Probability (Graduate Studies in Mathematics)

Davar Khoshnevisan Graduate Studies in Mathematics Volume 80 American Mathematical Society Probability Probability...

Author: Davar Khoshnevisan

413 downloads 2631 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Davar Khoshnevisan

Graduate Studies in Mathematics Volume 80

American Mathematical Society

Probability

Probability Davar Khoshnevisan

Graduate Studies in Mathematics Volume 80

American Mathematical Society Providence, Rhode Island

Editorial Board David Cox Walter Craig Nikolai Ivanov Steven G. Krantz David Saltman (Chair) 2000 Mathematics Subject Classification. Primary 60- 01; Secondary 60-03, 28-01, 28-03.

For additional information and updates on this book, visit

www.ams.org/bookpages/gsm-80

Library of Congress Cataloging-in-Publication Data Khoshnevisan, Davar. Probability / Davar Khoshnevisan.

p. cm. - (Graduate studies in mathematics, ISSN 1065-7339 ; v. 80) Includes bibliographical references and index. ISBN-13: 978-0-8218-4215-7 (alk. paper) ISBN-10: 0-8218-4215-3 (alk. paper) 1. Probabilities. 1. Title. QA273.K488

2007 2006052603

Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting for them, are permitted to make fair use of the material, such as to copy a chapter for use in teaching or research. Permission is granted to quote brief passages from this publication in reviews, provided the customary acknowledgment of the source is given. Republication, systematic copying, or multiple reproduction of any material in this publication is permitted only under license from the American Mathematical Society. Requests for such permission should be addressed to the Acquisitions Department, American Mathematical Society, 201 Charles Street, Providence, Rhode Island 02904-2294, USA. Requests can also he made by e-mail to reprint-peraissionaams.org. © 2007 by the American Mathematical Society. All rights reserved. The American Mathematical Society retains all rights except those granted to the United States Government. Printed in the United States of America. ® The paper used in this book is acid-free and falls within the guidelines established to ensure permanence and durability.

Visit the AMS home page at http://ww.am8.org/ 1211 10090807 10987654321

To my family

Contents Preface ...................................

xi

General Notation ..... ......................... xv Chapter 1.

Classical Probability .....................

1

§3.

Discrete Probability ........................ .. .. ...... . .. Conditional Probability . .... . Independence ........... . ....... ...... .

§4.

Discrete Distributions ... . .... .

§5.

Absolutely Continuous Distributions ..

§6.

Expectation and Variance ..................... 12

§1. §2.

.

.

.

1

.

4

.

6

.. .. ... . ..... . .

.. .. .. .. .

.

.

6

.

10

. .... ... ......... Notes ................................... 15 . .... .. . ..... ....... Chapter 2. Bernoulli Trials Problems .... .... .

.

. .... . .

13

.

17

.

The Classical Theorems ...................... 18 . ..... . . .... .. 21 Problems .... ...... .. §1.

.

.

.

..

.

.

.

.

.

.

Notes ................................... 22 Chapter 3. Measure Theory

.

..

.

.

. ... .

.

.

.

.

..

.

.

.

.

.

25

.

.

28

§2.

. ..... . .... .. .. ..... . Lebesgue Measure .... . . ... .

§3.

Completion .

§1.

Measure Spaces .... .. ..... .. .. .

.

.

. .... .

.

.

.

.

. .... . ... .

Proof of Caratheodory's Theorem .. .. . . . Problems . §4.

. .... .

.

.

.

.

.

.

..

. ...... ....

. ..... ... . . .... .

.

.

23

.

..

.

23

30 33

vii

Contents

viii

Notes ................................... 34

. .... . . .... . ... . .. .. .. 35 Measurable Functions ....................... 35 The Abstract Integral ..... .. . ... . ... ... .. .. 37 . ... 39 LP-Spaces ..... .. .... ... . .. ..... . Modes of Convergence .... ... . .. ....... . . ... Limit Theorems . ......... .. ....... . . ..... 45

Chapter 4. Integration §1. §2. §3.

§4. §5. §6.

.

.

.

.

.

.

.

.

4:3

.

The Radon-Nikodym Theorem .. ......... .. ..

.

.

.

47

Problems ........ ... . ... .. .......... . ..... 49

Notes ................................... 52 Chapter 5. Product Spaces ........................ 53 Finite Products . ....... .. ........... ..... §1.

53

§2.

Infinite Products ....... . ........... . . ..... 58

§3.

Complement: Proof of Kolmogorov's Extension Theorem

. .

.

60

Problems ................................. 62 Notes ................................... 64 . ...... .. 65 Chapter 6. Independence ..... . .. .. .. .

.

.

.

§1.

Random Variables and Distributions .... . .... . .... .

65

§2.

Independent Random Variables .. ...... ....... ..

67

§3.

An Instructive Example ..

§4. §5.

Khintchine's Weak Law of Large Numbers .... .. ..... 71 Kolmogorov's Strong Law of Large Numbers ..... ..... 73

§6.

Applications

.

.

. ............ . .....

71

.. ...... . ............. ..... Problems .... . .. ..... . . ........... .. ..... Notes ................................... 89 Chapter 7.

.

77

.

84

. ... .. .. .. .... . . ... . .... . ...... .

The Central Limit Theorem

.

91

.

91

§1.

Weak Convergence ..... .

§2.

Weak Convergence and Compact-Support Functions ..

§3.

Harmonic Analysis in Dimension One .... .. ....... .

.

. ...

94

96

.. .. ...... .. . .... ... 97 . ....... . . ... 100 .. ......... ...... ... 101 Problems ........ ....... .. .......... . .. ... 111 Notes .......... ....... .. .. ....... . . ... 117 The Plancherel Theorem . §5. The 1-D Central Limit Theorem §6. Complements to the CLT . §4.

.

.

.

.

.

.

Contents

ix

Martingales .. ............ ..... . ...... 119

Chapter 8. §1. Conditional Expectations §3.

. .... . ... ........... . Filtrations and Semi-Martingales ................. 126 Stopping Times and Optional Stopping . ..... ..... .

§4.

Applications to Random Walks .................. 131

§5.

Inequalities and Convergence .... .

§2.

119 129

.

. ...... . . .... . 134 §6. Further Applications ........................ 136 Problems ....................... . . ... . .... Notes

151

........... ....... .. ....... .. ......

Chapter 9.

157

Brownian Motion ...................... 159

§2.

Gaussian Processes ..... .... ... ...... . .... . 160 Wiener's Construction: Brownian Motion on [0.1) . . ..... 165

§3.

Nowhere-Differentiability ..................... 168

§4.

The Brownian Filtration and Stopping Times .... . .... .

§5.

The Strong Markov Property .. .. ... ...... . ..... 17.3

§6.

The Reflection Principle ... .... ....... .. ...... 175

§1.

170

Problems ................................. 176

. ....... . ...... .. ......... ..... . . .... . .... . Chapter 10. Terminus: Stochastic Integration .. ... .. The Indefinite Ito Integral ..... ... .. Notes ..

.

.

.

§1.

.

.

.

180 181 181

§2.

Continuous Martingales in LZ(P) .. ... ...... ...... 187

§3.

The Definite Ito Integral ......... ............ 189

§4.

Quadratic Variation ... ..... .... ..

§5.

Ito's Formula and Two Applications ... .. ....... ..

. ...... ... .

192 193

Problems ................... . ............ 199 Notes ..... .. .... ... .... . .... . . ....... ... 201 Appendix ........................ .......... 203

Hilbert Spaces ........................... 203 . 205 §2. Fourier Series .................. ....... . Bibliography ...................... .......... 209 §1.

.

Index

........ .......... ...... . . .... .

.

.

..

.

217

Preface

Say what you know, do what you must, come what may. -Sofya Kovalevskaya

To us probability is the very guide of life.

-Bishop Joseph Butler

A few years ago the University of Utah switched from the quarter system to the semester system. This change gave the faculty a chance to re-evaluate their course offerings. As part of this re-evaluation process we decided to replace the usual year-long graduate course in probability theory with one

that was a semester long. There was good reason to do so. The role of probability in mathematics, science, and engineering was, and still is, on the rise. There is increasing demand for a graduate course in probability. And yet the typical graduate student is not able to tackle a large number of year-long courses outside his or her own research area. Thus, we were presented with a non-trivial challenge: Can we offer a course that addresses

the needs of our own, as well as other, graduate students, all within the temporal confines of one semester? I believe that the answer to the preceding question is "yes." This book presents a cohesive graduate course in measure-theoretic prob-

ability that specifically has the one-semester student in mind. There is, in fact, ample material to cover an ordinary year-long course at a more leisurely

pace. See, for example, the many sections that are entitled complements, and those that are called applications. However, the primary goals of this book are to maintain brevity and conciseness, and to introduce probability quickly and at a modestly deep level. I have used as my model a standard one-semester undergraduate course in probability. In that setting, the xi

xii

Preface

instructional issues are well understood, and most experts agree on what should be taught. Giving a one-semester introduction to graduate probability necessarily involves making concessions. Mine form the contents of this book: No mention is made of Kolmogorov's theory of random series; Levy's continuity theorem of characteristic functions is sadly omitted; Markov chains are not treated at all; and the construction of Brownian motion is Fourier-analytic rather than "probabilistic." That is not to say that there is little coverage of the theory of stochastic processes. For example, included you will find an introduction to Ito's stochastic calculus and its connections to elliptic partial differential equations. This topic may seem ambitious, and it probably is for some readers. However, my experience in teaching this material has been that the reader who knows some measure theory can cover the book up to and including the last chapter in a single semester. Those who wish to learn measure theory from this book would probably aim to cover less stochastic processes.

Teaching Recommendations. In my own lectures I often begin with Chapter 2 and prove the De Moivre-Laplace central limit theorem in detail. Then, I spend two or three weeks going over basic results in analysis [Chap-

ters 3 through 5]. Only a handful of the said results are actually proved. Without exception, one of them is Caratheodory's monotone class theorem (p. 30). The fundamental notion of independence is introduced, and a number of important examples are worked out. Among them are the weak and the strong laws of large numbers [Chapter 6], respectively due to A. Ya. Khintchine and A. N. Kolmogorov. Next follow elements of harmonic analysis and the central limit theorem [Chapter 7]. A majority of the subsequent lectures concern J. L. Doob's theory of martingales (1940) and its various applications [Chapter 8]. After martingales, there may be enough time left to introduce Brownian motion [Chapter 9], construct stochastic integrals, and deduce a striking computation, due to Chung (1947), of the distribu-

tion of the exit time from [-1, 1] of Brownian motion (p. 197). If at all possible, the latter topic should not be missed. My personal teaching philosophy is to showcase the big ideas of probability by deriving very few, but central, theorems. Fran4ois Marie Arouet [Voltaire] once wrote that "the art of being a bore is to tell everything." Viewed in this light, a chief aim of this book is to not bore. I would like to leave the reader with one piece of advice on how to best use this book. Read it thoughtfully, and with pen and paper.

Preface

xiii

Acknowledgements: This book is based on the combined contents of several of my previous graduate courses in probability theory. Many of these were given at the University of Utah during the past decade or so. Also, I have used parts of some lectures that I gave during the formative stages of my career at MIT and the University of Washington. I wish to thank all three institutions for their hospitality and support, and the National Science Foundation, the National Security Agency, and the North-Atlantic Treaty Organization for their financial support of my research over the years. All scholars know about the merits of library research. Nevertheless, the role of this lore is underplayed in some academic texts. I, for one, found the following to be enlightening: Billingsley (1995), Breiman (1992), Chow and Teicher (1997), Chung (1974), Cramer (1936), Dudley (2002), Durrett (1996), Fristedt and Gray (1997), Gnedenko (1967), Karlin and Taylor (1975, 1981), Kolmogorov (1933, 1950), Krickeberg (1963, 1965), Lange (2003), Pollard (2002), Resnick (1999), Stroock (1993), Varadhan (2001), Williams

(1991), and Woodroofe (1975). Without doubt, there are other excellent references. The student is encouraged to consult other resources in addition to the present text. He or she would do well to remember that it may be nice to know facts, but it is vitally important to have a perspective. I am grateful to the following for their various contributions to the development of this book: Nelson Beebe, Robert Brooks, Pieter Bowman, Rex Butler, Edward Dunne, Stewart Ethier, Victor Gabrenas, Flank Gao, Ana

Meda Guardiola, Jan Hannig, Henryk Hecht, Lajos Horvath, Zsuzsanna Horvath, Adam Keenan, Karim Khader, Remigijus Leipus, An Le, David Levin, Michael Purcell, Pejman Mahboubi, Pedro Mendez, Jim Pitman, Natalya Pluzhnikov, Matthew Reimherr, Shang-Yuan Shiu, Josef Steinebach, James Turner, John Walsh, Jun Zhang, and Liang Zhang. Many of these people have helped find typographical errors, and even a few serious mistakes. All errors that remain are of course mine.

My family has been a stalwart pillar of patience. Their kindness and love were indispensable in completing this project. I thank them deeply. And last but certainly not the least, my eternal gratitude is extended to my teachers, past and present, for introducing me to the joys of mathematics. I hope only that some of their ingenuity and spirit persists throughout these pages.

Davar Khoshnevisan Salt Lake City, January 2007

General Notation

Here we set forth some of the general notation that is consistently used in the entire book. This is standard mathematical notation, and we may refer to it without further mention.

Logic and Set Theory. Throughout, U and fl respectively denote union and intersection, and C the subset relation. Occasionally we may write "a :_ b," where a and b could be sets, numbers, logical expressions, functions, etc. Depending on the context, this may mean either "define a to be b," or "define b to be a." We will not make a distinction between the two.

If A, B C X, then Ac denotes the complement of A [in X). The dependence on X is usually suppressed as it is clear from the context. Let A \ B := A fl B`, and A L B :_ (A \ B) U (B \ A). The latter is called the set difference of A and B. A set is denumerable if it is either countable or finite. We frequently write "iff " as short-hand for "if and only if." Finally, "Y' and "s" respectively stand for "for all" and "there exists."

Euclidean Spaces. Throughout, R = (-oo, oo) denotes the real line, Z = 10,±l,±2 .... } the integers, N = 11 , 2 .... } the natural numbers, and Q denotes the rationale. If X designates any one of these, then X+ denotes the non-negative elements of X, and X_ denotes the non-positive elements. If k E N then Xk denotes the collection of all k-tuples (x1, ... , xk) such that xl, ... , xk are in X. For instance, R. denotes the collection of all k-vectors that are non-negative coordinatewise. The complex plane is denoted by C.

xv

xvi

General Notation

If x, y E R, then we write x A y = min(x, y) for the minimum and x V y = max(x, y) for the maximum. Similarly, sup and inf respectively refer to supremum and infimum operations. Functions. If X and Y are two sets, then "f : X -' Y" stands for "f maps X into Y," and "x -+ f (x)" refers to the map from x to f (x). If f : X -+ Y and A C Y, then f (A) = {x E X : f (x) E A}. This is the inverse image of A.

The Big-O/Little-o Notation. Suppose al, a2.... , bl, b2, ... E R. We say 1. When the bi's are also that an - bn [as n -y oo] when non-negative, "limsupn_,,. lanl/bn < oo" is often written as "an = O(b")," Note that an = O(bn) if there and "limn-oo lanl/bn = 0" as "an = exists a constant C such that lanl < Cbn for all n > 1. The big-O/little-o notation is also applicable to functions: "f (x) - g(x) 0 we may a" means "limx-a(f (x)/g(x)) = 1"; and when g as x write "f (x) = O(g(x)) as x -' a" for "lim supx-a If (x)I = O(g(x))," and "f (x) = o(g(x)) as x a" in place of "limy.a(f (x)/g(x)) = 0."

Chapter 1

Classical Probability

Probability does not exist.

-Bruno de Finetti

How dare we speak of the laws of chance? Is not chance the antithesis of all law? -Bertrand Russell

The original development of probability theory took place during the seventeenth through nineteenth centuries. In those times the subject was mainly concerned with games of chance. Since then an increasing number of scientific applications, many in mathematics itself, have spurred the development of probability in other directions. Nonetheless, classical probability remains a most natural place to start the subject. In this way we can work

non-axiomatically and loosely in order to grasp a number of useful ideas without having to develop abstract machinery. That will be covered in later chapters.

1. Discrete Probability Consider a game that can lead to a fixed denumerable (i.e., at most countable) set {wi }°__1 of possible outcomes. Suppose in addition that the outcome w j occurs with probability p j (j = 1, 2, ... ), where we agree once and for all

that probabilities are real numbers in (0, 1], and that the w,'s really are the only possible outcomes; i.e., that Ei=1 pj = 1. 1

1. Classical Probability

2

Definition 1.1. We call Q:= {wj}j=1 the sample space of our experiment. Subsets of 1 are called events, and the probability of an event A is

P(A) :_ E pj. j>1: WJEA

One can check directly that P has the following properties:

Proposition 1.2. P(S2) = 1, P(Q) = 0, and (1) P(A1 U A2) = P(A1) + P(A2) - P(A1 n A2).

(2) P(Ac) = 1 - P(A). (3) If the Ai's are pairurise disjoint, then P(U°__1Ai) (4) If Al C A2, then P(A1) < P(A2).

P(Ai).

(5) (Boole's Inequality) P(U901Ai) < E 'l P(Ai). (6) (Inclusion-Exclusion Formula)

P

U Ai/ _ > P(Ai) t=1

i=1

EE P (Ai, n Ail ) 1
E P (Ai,nAi2nAi3)--

+

1
+ ....+ (_1)n-lp (Al n ... n An) The case where 1 = {wi} 1 and pl = = PN = 1/N deserves special mention. In this case, the wj's are equally likely to occur, and (1.2)

P(A) = iRi = INI ,

for all events A, where I

I denotes cardinality.

Example 1.3. If we are tossing a coin (the experiment), then we can think of Sl as {H,T}, where H stands for "heads" and T for "tails." If the coin is fair, then pl and p2 are both equal to 1/2, and the preceding definition

states that: (a) P(P) = 0; (b) P({H}) = 1/2; (c) P({T}) = 1/2; and (d) P(S2) = P({H,T}) = (1/2) + (1/2) = 1.

Example 1.4. Consider the set S1 := {(1, 1), (1,2),...,(6,6)}; this could denote the sample space of a roll of a pair of fair dice. For example, (i , j) may correspond to i dots on the first die and j on the second. Then, w1 = (1, 1), w2 = (1, 2).... are equally likely, and each has probability 1/36 be-

causeJcu= 36. The collection A:_ {(1,1),(2,2),(3,3),(4,4),(5,5),(6,6)} is the event that we roll doubles; it has probability P(A) = 1/6.

1. Discrete Probability

Example 1.5 (The Boltzmann Statistic). Suppose we wanted to put n distinct balls in N urns, where N > n. Let us assume that the balls are thrown at random into the N urns, so that all Nn possible outcomes of this game are equally likely. To understand this model better, consider the case N = 4 and n = 3. We then have three balls x, y, and z, and four urns. Let Al :_ {xyl Izi }. This notation designates the event that balls x and y end up in the first urn, in this order, and ball z lands in the third urn; the three vertical lines represent the "dividers" that separate the different urns. Note that Al represents a different arrangement from A2 := {yxI IzJ }, but the two events have the same chance of occurring. Going back to the general case, let A. denote the event that the first n urns end up with exactly one ball in each (and, therefore, the remaining N - n end up with no balls in them). Then the cardinality of A. is the same as the number of different ways we can order the n balls; i.e., n!. The latter count is the so-called Boltzmann statistic, and we have found that (1.3)

P(A.) = Nn.

Example 1.6 (The Bose-Einstein Statistic). Here, the setting is very much the same as it was in Example 1.5, but now the balls are indistinguishable from one another. As a result, the events Al and A2 of Example 1.5 are now one and the same. Thus, JA.I = 1 and P(A.) = 1/101, and it remains to compute 1521.

Observe that I52I denotes the number of ways that we can mix up the balls together with the urn-dividers and then lay them down all in a straight line. Because there are (N - 1) dividers and n balls, I 1I = (N+n-1). This number is called the Bose-Einstein statistic, and we have found that (1.4)

P(A.) = (N+n-1) 1 ' n

Example 1.7 (The Fermi-Dirac Statistic). We continue to work within the general framework of Examples 1.5 and 1.6. But now, we assume that the balls are all indistinguishable and each urn contains at most one ball. Evidently, 1511 is equal to (n), the Fermi-Dirac statistic, and (1.5)

P(A.) _

()

Example 1.8 (Random Matchings). Consider n people each of whom drives a different car. If we assign a car to each person at random, then what is the probability that at least one person gets his or her own car? Equivalently, consider all permutations 7r : {1,. .. , n} - {1,. .. , n}; there are n! of them. If we select a permutation at random, all being equally

1. Classical Probability

4

likely, then what is the probability that 7r has a fixed point? That is, what are the chances that there exists an i such that 7r(i) = i? If E, := {7r(i) = i} denotes the event that i is a fixed point, then (1.6)

P(Ej,n...nEik)=

whenever 1 < i1 <

< ik < n. Therefore,

P(Ei, n

(1.7)

n Eik) -

(n

k)! n!

(n) (n

i j <...
k)!

n,

k

1

= k,

Thanks to the inclusion-exclusion principle, (1.8)

P (UE) = 1 - 5i +

+... + (-1)n+1

1'

i= 1

Ek1(-1)k+1/k! = 1 - e-1. We can therefore As n oc, this converges to conclude that, in a large random permutation, chances are nearly 1/e that no fixed points arise.

2. Conditional Probability A certain population is comprised of a hundred adults, twenty of whom are

women. Fifteen of the women and twenty of the men are employed. A statistician has selected a person at random from this population, and tells us that the person so sampled is employed. Given this information, what are the chances that a woman was sampled? If we were not privy to the information given by the statistician, then the sample space would have N = 100 equally likely elements; one for each adult in the population. Therefore, P(W) = 20/100 = 0.2, where W denotes the event that a woman is sampled. On the other hand, once the statistician tells us that the sampled person is employed, this knowledge changes our original sample space to a new one that contains the employed people only. The new sample space is comprised of 35 people (15 women and 20 men), all of whom are employed and equally likely to be chosen. Given the knowledge imparted to us by the statistician, the chances are 15/35 0.4285 that the sampled person is a woman. In general, it is easier to not change the sample space as new information surfaces, but instead use the following conditional probability formula on the original sample space:

Definition 1.9. For any two events A and B such that P(B) > 0, the conditional probability of A given B is defined as P(A n B) (1.9) P('.1 I B) :=

P(B)

.

2. Conditional Probability

5

If we apply (1.9) with A = {woman} and B = {employed}, then we find

that P(A I B) = 15/35: 0.4285. This agrees with our earlier bare-hands calculation.

In general, A H P(A I B) is a probability, and has the properties of Proposition 1.2. In other words, we have the following.

Proposition 1.10. For any event B that has positive probability, P(B I B) _ 1 and P(0 I B) = 0. In addition, for all events A, Al, A2.... : (1) P(A1 U A2 I B) = P(A1 I B) + P(A2 I B) - P(A1 n A2 I B).

(2) P(A` I B) = 1 - P(A I B). (3) If the Ai n B's are disjoint, then P(U°1Ai I B) (4) If A, C A2, then P(A1 I B) < P(A2 I B). (5) (Boole's Inequality) P(U°_1Ai I B) P(Ai I B). ((6) (Inclusion-Exclusion Formula) P

(U Ai \1=1

I

B l = E P(Ai I B) -

P(Ai I B).

P (Ail n Ail B) 15i,
i=1

+ E E E P (Ai1 n Ail n Ai3 I B) - .. . 1
+... + (-1)n-lp (A1 n ... n An I B). The following is a particularly useful corollary.

The Law of Total Probability. Consider events B, A1, A2, ... , An where the Ai's are disjoint. U;`Ai = Il, and P(A3) > 0 for all j = 1, ... , n. Then, n

(1.10)

P(B) _ E P(B I A,)P(Ai) j=1

The law of total probability provides us with a method for computing weighted averages, viz.,

Example 1.11. Dichromatism is a form of color blindness. It is a genetic disorder that is caused by a defect in the X-chromosome. Roughly, about 5% of all men and 1% of all women suffer from dichromatism. Suppose that 60% of a certain population are women. If we sample an individual at random from this population, then what is the probability that this person has dichromatism? Define the events D = {dichromatism} and F = {female} in the obvious way. Because P(D I Fc) 0.05, P(D I F) 0.01, and P(F) .: 0.6, (1.11) P(D) = P(D I F)P(F) + P(D I F°)P(F') : 0.026. That is, about 2.6% of this population suffers from dichromatism.

1. Classical Probability

6

3. Independence Events A and B are independent if our assessment of the likelihood of A is not affected by whether or not we know B has occurred. This can be formalized as follows.

Definition 1.12. Two events A and B are independent if P(A I B) = P(A). This is equivalent to the so-called product rule of probabilities: P(An B) _ P(A)P(B). More generally, events A,,-, An are independent if k

(1.12)

P(A n ... n Ajk) = fl P(A;.), v=1

f o r all distinct j1i ... , jk E {1, ... , m}.

For instance, three events A, B, C are independent if and only if all of the following conditions are met: (i) P(A n B n C) = P(A)P(B)P(C); (ii) P(A n B) = P(A)P(B); (iii) P(A n C) = P(A)P(C); and (iv) P(B n C) _ P(B)P(C). This cannot be relaxed (Problem 1.3).

Example 1.13. A fair coin is tossed n times at random. Let H3 denote the event that the jth toss yields heads, and Tj = HJ1 the event that the said toss yields tails. We assume that all of the possible outcomes of this experiment are equally likely. It follows that (1.13)

P(R1n...nR,)=ln,

where Ri can stand for either Hi or Ti. Therefore, the Hi's, and hence the Ti's, are independent. Moreover, P(Hi) = P(Ti) = 1/2 (i = 1, ... , n). Conversely, if the Hi's are independent with P(HH) = 1/2 for all i, then all possible outcomes of this experiment are equally likely.

Example 1.14. An urn contains 2 black and 3 white balls. Two balls are drawn at random from the urn; all such pairs are equally likely. If Bj denotes the event that the jth draw yields a black ball, then B1 and B2 are not independent. This assertion follows from the following: \of l2J

=

(1.14)

P(B1) = P(B2) =

(o) (2) (5) z

+

(5)

- 0.7.

(52)

4. Discrete Distributions What is a random variable? Usually, elementary probability texts define a random variable as a numerical outcome of a random experiment. Although

4. Discrete Distributions

7

this definition is attractive, it is flawed. For instance, suppose we performed a random experiment three times and it yielded the numbers 0.8322, -1.253, and 0.003. Certainly, there is nothing random about any one of these real numbers! So what is the random variable that is produced?

To answer this question, one has to recognize the fact that what is random here is the procedure that led to the said numbers, and not so much the observed numbers themselves. Procedures are, of course, encoded by functions. Thus, we are led to the following.

Definition 1.15. A random variable X is a function from S2 into R. For any set E C R, the event {X E E} is defined as {w E 1l : X(w) E E}, and we write P{X E E} in favor of P({X E E}). Because fI is a denumerable set, all random variables on S2 are discrete in the sense that we can find a denumerable set of points {xi}i=1 such that

pj = P{X = xj } is strictly positive, and E' 1 pj = 1. Example 1.16. Suppose we are interested in studying a random variable that takes the values 1, ... , N with probabilities 1/N each. Intuitively speaking, this makes perfect sense even if we do not have a ready sample space in mind. How then does one construct such a random variable? Here is one possibility among many: Let Sl = { 1, ... , N}, and suppose the elements of S2 are equally likely. If we define X(w) = w for all w E 0, then (1.15)

P{X = j} = P({j}) = N

vj = 1,...,N,

as needed.

Definition 1.17. The distribution of a random variable X is a collection of distinct real numbers {x}1, and pi, p2, ... E [0,1], such that: (i)

Ej_1 pj = 1; and (ii) P{X = xj} = pj for all

1.

If pj > 0, then

the corresponding xj is a possible value of X. Definition 1.18. The mass function of a discrete random variable X is the function p : R -+ [0, 1] defined by (1.16)

p(x) = P{X = x}

ex E R.

The mass function is a convenient way to code the entire distribution of X. If the distinct possible values of X are {xi}°_°1 and the corresponding probabilities are {pi }°1, then p(xi) =pi, whereas p(x) = 0 for other values of X.

One can always construct a random variable with a pre-described distribution (Problem 1.4). However, when we discuss a random variable X, we rarely need to specify the underlying probability space. Usually, all that matters is the distribution of X. Next we present three examples to illustrate this remark. The exercises contain a few more examples.

1. Classical Probability

8

4.1. The Binomial Distribution. Consider a random experiment that has two possible outcomes only: Success versus failure (or heads vs. tails, female vs. male, smoker vs. non-smoker, etc.). The probability of success is denoted by p E (0,1). We perform this experiment n times independently. The latter means that the outcomes of all the trials are independent from one another. Such experiments are called Bernoulli trials with parameter p. Let X denote the total number of successes. Then we say that X has the binomial distribution with parameters n and p, and X is sometimes also written as Bin(n, p). For instance, suppose the proportion of women in a certain population

is p. We sample n people at random and with replacement. Then, the total number of women in the sample has the binomial distribution with parameters n and p. Quite generally, if X has the binomial distribution with parameters n and p, then the possible values of X are 0, ... , n. It remains to find the mass function; i.e., p(k) = P{X = k} for k = 0, . . . , n. This is the probability of getting exactly k successes, and also n - k failures. Consider first the problem of finding the probability that the first k trials lead to successes, and the remaining n-k trials to failures. By independence, this probability is pk(1 - p)n-k. Now if we fix any k of the n trials, then the probability that those k succeed and all remaining trials fail is the same; i.e., pk(1 - p)n-k. The union of these events is the event {X = k}. Because the latter is a disjoint union, p(k) = Npk(1 - p)n-k where N is the number of ways to choose k spots for the successes among all n trials. Therefore, (1.17)

p(k) =

(n)pk(lp)n_k

dk = 0, ... , n.

This notation tacitly implies that P{X = k} = 0 for all values of k other than k = 0, ... , n. From now on, we adopt this way of writing a probability mass function.

Binomials have the following interpretation: Let X denote the total number of successes in n independent Bernoulli trials with success proba-

bility p E [0, 1J each. Let Ij := 1 if the jth trial leads to a success, and Ij := 0 otherwise (1 < j < n). Then, X = Il + + In. Among other things this means that Bin(n, p) is a sum of n independent, identically distributed random variables.

4.2. The Geometric Distribution. Consider independent Bernoulli trials, where each trial has the same success probability p E (0, 1). We perform these trials until the first success appears. Let X denote the number of trials required. The distribution of X is the so-called geometric distribution with

4. Discrete Distributions

9

parameter p, and X is sometimes written as Geom(p). Its mass function is (1.18)

dx = 1,2,....

p(x) = (1 - p)x-lp

4.3. The Poisson Distribution. Suppose A > 0 is fixed. Then a random variable X is said to have the Poisson distribution with parameter .A if the mass function of X is C-aA x

(1.19)

p(x) =

X!

vx = 0, 1, 2,...

Sometimes we write Poiss(A) for such X.

The Poisson distribution plays a natural role in a number of approximation theorems. Next is an instance in random permutations. We will see another example in Chapter 2; see the law of rare events (p. 19).

Example 1.19 (Example 1.8, Continued). Let X denote the total number of fixed points in a random permutation 1r of { 1, ... , n}. In Example 1.8 we found that if n is large, then P{X = 0} .:s e-1. Now we go one step further

and compute the mass function of X. Define FO = Go = S2, and for all non-empty A C {0, ... , n} let (1.20)

FA := n {ir(j) = j} and GA := n {ir(j) # j}. jEA

j1ZA

In words, FA denotes the event that all elements of A are fixed points, and GA the event that there are no fixed points in A'. If k is an integer between 0 and n, then we can write (1.21)

P{X = k}

P(FA) {1 - P (GA I FA)}

.

AC{1,...,nl: IAI=k

We observed in Example 1.8 that (1.22)

P(FA) = (n - IAI )! n!

To find the remaining conditional probability we first note that the conditioning reduces the sample space to all permutations for which 7r(i) = i for all i E A. Consequently, P(GA I FA) is the probability that there is at least objects [one for one fixed point in a random permutation of n - Idistinct Al each j ¢ A]. Equation (1.8) tells us that n-IAI (1.23)

P(GA I FA) i=1

1. Classical Probability

10

In summary, for all k = 0, ... , n,

n

P{X = k} _ AC {1,...,n}:

(1.24)

(i(_i) \ n-k

n. k)

ti-1

i+1

2.

IAI=k

1 n-: -1'

k!.0

i!

oo to find that P{X = k} e-1/k!. It follows that, when n is Let n large, the distribution of X is approximately that of a Poiss(1). Next we present another example of Poisson distributions. This example highlights some of the deep connections between Poisson and Binomial random variables.

Example 1.20 (Poissonization). Choose and fix p E [0, 1] and .\ > 0, and let N, Il, 12i ... be independent random variables such that: (i) P{I3 = 1} = 1- P{I; = 0} = p for all j > 1; and (ii) N = Poiss(A). We have seen already that S,, E', Ii = Bin(n, p) for all n > 1. Now consider the randomized sum SN. Its distribution can be computed as follows: For all k > 0,

P{SN=k}P{Sn=k}P{N=n} (1.25)

n=k e--Ip(Ap)k

k!

(Check!) This proves that Bin(N,p) = I1 +

+ IN = Poiss(Ap).

5. Absolutely Continuous Distributions It is not difficult to imagine random experiments that lead to numerical outcomes which can, in principle, take an arbitrary positive real value. One would like to call such outcomes absolutely continuous, or merely continuous, random variables. For example, the average weight of 100 randomly-selected individuals is probably best modeled by an absolutely continuous random variable.

Definition 1.21. A function f on R is a density function if it is nonnegative, Riemann-integrable, and f!. f (x) dx = 1. Define f := {x E R : f (x) > 0} and P(A) := fA f (x) dx for all sets A C S1 for which the Riemann integral exists. One can then check that

Proposition 1.2 continues to hold for this choice of P and Q. Similarly, one can describe conditional probabilities for this P.

5. Absolutely Continuous Distributions

11

Definition 1.22. An absolutely continuous random variable X with density function f is a real-valued function on ci that satisfies P({w E S2 : X (w) E A}) =

(1.26)

JA

f (x) dx.

This is valid for all sets A for which the integral is defined. The displayed probability is usually written as P{X E A}.

Frequently, one assumes further that f is piecewise continuous. This means that we can find points {ai}°°1 such that for all is (a) ai < ai+1; and (b) f is continuous on (ai , aj+1). Next we describe two illustrative examples. A few more are included in the exercises.

5.1. The Uniform Distribution. If a < b are two fixed numbers, then the following defines a density function: (1.27)

P X) = 6

1

dx E (a, b).

a

This notation implies tacitly that f (x) = 0 for all other values of x. The function f is called the uniform density function on (a, b). If the density function of X is f, then X is said to be distributed uniformly on (a, b). We might refer to X as Unif(a, b). Clearly, X = Unif(a, b) if and only if P{X E A} is proportional to the length of A.

5.2. The Normal Distribution. If p E R and a > 0 are fixed, then X is normally distributed with parameters p and a2 if its density function is / _ 1 (1.28) exp I - (x f (x) = ) vx E R. 2a2)2

Frequently, the symbol N(p,a2) `denotes a normal random variable with parameters p and a2, and N(0,1) is called a standard normal random variable. We define N(p, 0) to be the non-random quantity p. This represents the degenerate case. To complete this discussion we need to verify that S' := f f'. f (x) dx is

equal to one. A change of variables [y = (x - p)/a] reduces the problem to the standard normal case. We then appeal to a trick of J. Liouville, and compute r22 in polar coordinates, viz., (1.29)

f

92 = Jr00 oo

foo e-(x2+'2)/2

J

27r

2rr

dxdy =

foo a-r2/2

J J

27r

r dr dO = 1.

1. Classical Probability

12

6. Expectation and Variance Suppose that X is a discrete random variable with mass function p. Its expectation (or mean), when defined, is

EX :=

(1.30)

xp(x). zER

That is, EX is the weighted average of the possible values of X where the weights are the respective probabilities. If X denotes the amount of money that is to be gained/lost in a certain game of chance, then EX is a natural predictor of the as-yet-unrealized value of X.

Similarly, if X has density function f, then the expectation of X is defined as

EX :=

(1.31)

00

fxf(x)dx,

provided that the integral is well defined. In this regard see Problem 1.13. It is not too difficult to check that for any reasonable function g : R -+ R, (1.32)

F,.ERg(x)p(x)

f.

Eg(X) _

g(x) f (x) dx

if X is discrete, if X is absolutely continuous.

In particular, we do not need to work out the distribution of g(X) before we compute its expectation. Define the variance of X by VarX := E [(X - µ)2] , (1.33) where µ := EX denotes the mean of X. Then, (1.34)

VarX =

2xER(x - µ)2p(x)

If

if X is discrete. if X is absolutely continuous.

!0. (x - µ)2f (x) dx The following is an equivalent formulation: (1.35)

VarX = E[X2] - (EX)2.

The square-root of the variance is the so-called standard deviation SD(X) of X. It gauges the best bet for the "distance" between the as-yet-unrealized value X and its predictor EX.

Example 1.23. If X = Bin(n,p) then EX = np and VarX = np(1 - p).

Example 1.24. If X = Geom(1/a) then EX = a and VarX = a(l - a).

Example 1.25. If X = Unif(a , b) then EX = (b + a)/2 and VarX = (b - a)2/12.

Example 1.26. If X = N(µ , a2) then EX = µ and VarX = a2.

Problems

13

The exercises contain many more classical examples.

Problems 1.1. Prove Propositions 1.2, 1.10, as well as the law of total probability (p. 5). 1.2. Derive (1.32) carefully, and verify Examples 1.23 through 1.26.

1.3. Construct a sample space and three events A, B, and C in this space such that: (1) Any two of the three events are independent, but A, B, and C are not independent. (2) A, B, and C have strictly positive probabilites, and P(A fl B fl C) = P(A)P(B)P(C) even though A, B. and C are not independent.

1.4. Suppose {pi}j= l are non-negative numbers that add up to one. If {x,}j= l are fixed and distinct, then construct a probability space and a random variable X on this probability space such that for all integers j > 1, P{X = xi } = pj. (HINT: Example 1.16.) 1.5 (A Bonferroni Inequality). Use the inclusion-exclusion principle to deduce that for all events

Et., En, (1.36)

P I U EJ I ? >P(Ei) - FF P(E, n Ei) !=I

/

1<.<J
J=t

1.6 (Binomial Theorem). Use (1.17) to prove that for all x, y > 0, and all integers n > 0, (1.37)

n

(x + y)" = k=0

k

xAy°-k

1.7 (Distribution Functions). Let X be a random variable with a density function f. For all a E R define F(a) = P(X < a). F is the distribution function of X. Prove that F'(a) = f(a) if f is continuous in a neighborhood of a.

1.8 (Hypergeometric Distribution). An urn contains r + b balls; r are red, and the other b are blue. With the exception of their colors, the balls are identical. We sample, at random and without replacement, n of the r + b balls. Let X denote the total number of red balls in the sample. Then, find the distribution of X, as well as EX and VarX. The distribution of X is called hypergeometrsc.

1.9. Compute EX and VarX, where X = Poiss(A) for some A > 0.

1.10 (Negative Binomial Distribution). Imagine that we perform independent Bernoulli trials with success-per-trial probability p E (0, 1). We do this until the kth success appears; here, k is a fixed positive integer. Let X denote the number of trials to the kth success. Compute the distribution, expectation, and variance of X. The distribution of X is called negative binomial

1.11 (Exponential Distribution). If A > 0 is fixed, then we can define f(x) = Ae-11 (x > 0). Check that this is a density function. A random variable X with density function f is said to have the exponential distribution with parameter A. Compute the mean and variance of X.

1.12 (Gamma Distribution). Given a, A > 0, define (1.38)

where r(a) := fo

f(x)

xa-te-As

vx > 0,

I(a)

x°-le-r dx is Euler's gamma function. Verify that f is a density function. If X has density function f, then it is said to have the gamma distribution with parameters (a , A). Compute the mean and variance of X in terms of I'. Verify also that r(n+ 1) = n! by first deriving the "duplication formula" I'(a + 1) = ar(a), valid for all a > 0.

1. Classical Probability

14

1.13 (Cauchy Distribution). Prove that f is a density function, where

f(x) :=

(1.39)

vx E R.

all 1x2)

If X has density function f, then it is said to have the Cauchy distribution. Prove that in this case EX is not well defined. Construct a random variable whose mean is well defined but infinite.

1.14 (Standardization). If X = N(µ,02), then prove that (X -µ)/o is standard normal. (HINT: First compute F(x) = P{(X - µ)/o < x}. Then compute F'.) 1.15 (Standard Normal Distribution). Let Z be a standard normal random variable. Compute E[Z'J, for all r > 0, using facts about the Gamma function r(t) = fo xt-Ie- dx (Problem 1.12). What happens if r is a positive integer? Also compute E[exp(tZ)] and E[exp(tZ2)] for t E R. 1.16 (Tails of the Normal Distribution). Prove that for all x > 0,

1x

(1.40)

e-:2/2 x3

<J a

a-°2/2 du < 1 x

(HINT: Integrate (1+x'2)exp(-x2/2) and (1 -3x'4)exp(-x2/2).) 1.17. Prove that if X = Unif(0,1) and a E R and b > 0 are fixed, then bX + a = Unif(a, a + b). (HINT: First compute F(x) = P{bX + a < x}. Then compute F'.) 1.18. You have n distinct keys, one of which unlocks a certain door. You select a key at random and try to unlock the door with that key. If the key works, then you are done. Else, you select another key at random and try the door again. You repeat this procedure until the door is unlocked. Let X denote the number of sampled keys needed to unlock the door. (1) Compute the mean and variance of X if the sampling is done with replacement. (2) Compute the mean and variance of X if the sampling is done without replacement. 1.19 (Hard). Prove that rif X = Bin(n,p), then (1.41)

EIC+X] L

1 Pt o

fDSr-I(s+1-P)^ds

ve>0.

For what values of a can you evaluate this integral? The following exercises explore some two-dimensional extensions of the one-dimensional theory of this chapter. The reader is encouraged to think independently about three-dimensional, or even higher-dimensional, generalizations. If X and Y are random variables, both defined on a common sample space 11, then (X, Y) is said to be a random vector. It is also known as a random variable in two dimensions. If X and Y are discrete, then (X, Y) is said to be discrete; in this case,

P(x, y) := P{X = x, Y = y}

(1.42)

is its mass function. On the other hand, suppose there exists an integrable function f of two variables such that (1.43)

P{(X,Y) E A} = f f f(x,y)dxdy, A

for all sets for which the integral can be defined. Then (X, Y) is said to be absolutely continuous and j is its density function. We say that X and Y are independent if

P{X < x,Y < y} = P{X < x}P{Y < y} vx, y E R. 1.20 (Discrete Random Vectors). Let (X, Y) denote a discrete random vector with mass function (1.44)

p, and define the respective mass functions of X and Y as px (x) = P{X = x} and py (y) = P{Y = y}. Prove that for all x, y E R, Px (x) p(z, y). Prove also p(x, z) and Py (y) that for all functions g : R2 - R, (1.45)

Eg(X,Y)=

g(x,y)P(x,y), x.YER

provided that the sum converges.

Notes

15

1.21 (Absolutely Continuous Random Vectors). Let (X, Y) denote an absolutely continuous ran-

dom vector with a continuous density function f. Prove that X and Y are both absolutely continuous random variables, and their respective densities fx and fy are defined as follows: fx (x) = f f. f (x, z) dz and fy (y) = ff. f (z, y) dz for all x, y E R. Prove also that for all bounded continuous functions g : R2 - R, oo

(1.46)

E9(X,Y) = I

00

f

m

9(x,y)f(x. y)dxdy. 00

1.22. Suppose (X, Y) is an absolutely continuous random vector with density function f. Suppose

f is continuous. Then prove that X and Y are independent if and only if we can write f as f(x,y) = fl(x)f2(y). Explore the case that f is piecewise continuous in each variable. 1.23 (Convolutions). Suppose (X, Y) is an absolutely continuous random vector with density function f. Prove that if f is piecewise continuous and X and Y are independent, then for all

zER,P{X+Y
P{Y
l

(1.47)

..//

oo

!!!

P{X
Conclude that, in particular, X + Y is absolutely continuous with density function oo

(1.48)

fx+y (z) = f

fy (z - x)fx (x) dx = f fx (z - y)fv (y) dy.

The function jx+y is said to be the convolution of fx and fy. 1.24. Let X1 and X2 be two independent Poisson random variables with respective parameters X, and A2. Prove that X1 + X2 = Poiss(A1 + A2). Compute P(X1 = a I X1 + X2 = n}. Use your computation to argue that the conditional distribution of X1, given that X1 + X2 = n, is binomial.

1.25. Suppose (X1, X2) are independent normal random variables with EX, = µ, and VarX, = o?

(i = 1,2). Use Problem 1.23 to prove that XI + X2 is normal with mean p1 +µ2 and variance of + oZ. Generalize this to a sum of n independent normal random variables.

1.26. Let T be absolutely continuous with density f(t) = texp(-t), t > 0. Prove that Unif(0,T) has the exponential distribution with mean one. There are instances where one may wish to mix continuous and discrete distributions as well. The following may require some loose interpretation on the part of the reader, but it should not prove to be ambiguous. Recall that a "p-coin" is a coin that tosses heads with probability p E [0, 1].

1.27. Suppose we observe U = Unif(O, 1), and then toss a U-coin n times independently. Let X denote the number of heads thus tossed. Prove that X has the [discrete] uniform distribution on

(0,... ,n}. That is,

P{X=k}=n+1

(1.49)

vk=0,...,n.

1.28. Suppose we observe U = Unif(O, 1), and then independently toss a U-coin until the first head appears. Let X denote the number of tosses that are needed. Compute P{X = k} for all

k=1,2,....

Notes (1) Independence was first formally introduced by de Moivre (1718). He called it "statistical independence." (2) Example 1.20 is based on a very old idea which is ripened in Kac (1949).

1. Classical Probability

16

(3) The fact that the normal density function satisfies ff. Q(x) dx = 1 is found within the work of P: S. Laplace in 1774. Most authors prefer Lionville's elegant proof (1.29).

In this regard I add the following, which was proved in 1835 by J. Liouville: The distribution function 4i(y) := f!_ O (x) dx cannot be expressed in terms of elementary functions (Rosenlicht, 1972). This theorem of Liouville has spurred the creation of "normal tables" for statisticians. For a charming, though "circular" derivation of the identity "f f'_ O(x) dx = 1" see Exercise 4.20 on page 50.

(4) The material of this chapter barely skims the surface of a rich theory that is worthy of study in its own right. Feller (1957) and Poincare (1912) are two natural starting points for learning more of the classical theory. (5) The method suggested in Problem 1.16 is borrowed from Feller (1957, Lemma 2, pp. 166-168). But the result itself is very old. See for instance, Note 4 on page 157 below, where I mention the definitive work of Laplace (1805, pp. 490-493) on this problem. Problem 1 of Feller (1957, p. 179) contains another estimate.

Chapter 2

Bernoulli Trials

Everything would return to its pristine state after the passage of innumerable ages. -Jacob Bernoulli

Consider the following statements:

(1) If we toss a fair coin n times for a large n, then the proportion of heads is close to one half. (2) If we roll an unbiased die n times for a large n, then about one-sixth of the time we see one dot, approximately one-sixth two dots, etc.

It is a remarkable fact that these assertions can be made into precise mathematical theorems. We now formulate things more carefully in order to state and prove just such theorems. Consider first the problem of the coins. Suppose, in addition, that the outcomes of the coin tosses are independent from one another. Let Sn denote

the number of heads in the first n tosses. Then Sn is a binomial random variable with parameters n and 1/2. Next consider the die example. Suppose also that the outcomes of the rolls are independent from one another. Fix some k E (1,... , 6} and define Sn to be the number of the rolls that led to k dots. Then Sn is binomial with parameters n and 1/6. These remarks suggest that, quite generally and with good probabilty, a binomial random variable is close to its expectation. We will start this chapter by proving that this is often the case. Henceforth, in this chapter, we have in mind n Bernoulli trials all with the same parameter p E (0, 1). The random variable Sn denotes the total 17

2. Bernoulli Trials

18

number of successes, and has the binomial distribution with parameters n and p.

1. The Classical Theorems The following is the celebrated law of large numbers of Bernoulli (1713):

Bernoulli's Law of Large Numbers. If p does not depend on n, then for

any e>0, lim P

n

For example, consider a large independent random sample of people from

a given population. Then, with high probability, the sample proportion of women is very close to the true proportion of women in the population. Bernoulli's law of large numbers follows at once from our next result that is due to Chebyshev (1846, 1867).

Chebyshev's Inequality. For all n > 1 and e > 0, (2.2)

P{ISn-npl>nE}
Proof of Chebyshev's Inequality. We compute directly: P{1Sn - npl > nE} =

(n)Pk(l_P)n_k

1:

One

(2.3)

(nE)2

\ /

(k - np)2 Ck f pk(1 One

(k - np)2 (n)pk(i

-

p)n-k

p)n-k

k=0

The latter sum is equal to VarSn = np(1- p); see Example 1.23 on page 12. Chebyshev's inequality follows from this. O The second limit theorem of this chapter describes the distribution of Sn for large n where the parameter p is inversely proportional to n. Unlike the fixed-p case, Sn does not cluster around a non-random value. Instead, for large n, the distribution of Sn looks like that of a Poisson random variable; see §4.3 (p. 9).

1. The Classical Theorems

19

Poisson's Law of Rare Events. If p = pn satisfies limn.. npn = A for some fixed number A > 0, then limo P{Sn = k} =

(2.4)

e-ask

n-o

k!

vk = 0,1, ...

.

In rough terms, Bernoulli's law of large numbers states that Sn np. Therefore, it is natural to ask about the size of Sn - np. The central limit theorem (CLT) of de Moivre (1733) and Laplace (1812) tells about the behavior of the distribution of (Sn - np)// for large n. This behavior is described in terms of the all-mysterious "bell curve," which is the graph of the N(0,1)-density defined in (1.28), page 11.

The de Moivre-Laplace Central Limit Theorem. If p does not depend on n, then for all finite numbers a < b, (2.5)

limPa< o

ll

Sn

- np < b1=P{a
A by-product of the CLT is that Sn - np is roughly of the order ,/n.

Remark 2.1. It can be shown that (N(µ, o2) - µ)/o is standard normal (Problem 1.14, p. 14). Therefore, Theorem 1 asserts that the distribution of Sn is approximately that of N(np, np(1-p)); i.e., a normal random variable whose mean and variance agree with those of Sn. The following combinatorial device will be used to estimate the distribution of Sn.

de Moivre's Formula. There exists Q E (0, cc) such that n! ti On"+Z a-n

(2.6)

as n

oo.

Proof. Define for all integers n > 1, (2.7)

f (n)

n! nn+2e-n Because f (1) = e, we write f (n) as a telescoping product, viz.,

(2.8)

f(j) =exp 1+E[lnf(j)-lnf(j-1)j f(n)=ell T(7:: 1) j=2

j=2

Evidently, as j -+ oo, (2.9)

lnf(j)-lnf(j-1)=1+Ij-2ln(1-

1

-12j2

The result follows from this, the summability of j--2 and (2.8).

0

2. Bernoulli Trials

20

We can now complete our proof of the de Moivre-Laplace central limit theorem.

1-p

Proof of the de Moivre-Laplace CLT. Throughout, we write q and seek to approximate the following for large values of n: P

a<Sn-np
()pkqn_k

=

k

np+a npq
E

(2.10)

n )pA+npqnc_). a + np

aJ
The index k is in {0, ... , n}, while A runs over all real numbers of the form k - np. Let us call any A of this form an n-admissible number. According to de Moivre's Formula, as n oo, (2.11)

n

(A +np)

-nq+a

-a-nP

1

Q npq Cn +p)

1,

n)

(g

uniformly for all n-admissible A E [a npq , 6 npq]. Thus, as n

oo,

P{a<Sn -
-.\-np

A

1

\

Q npg a npq
-nq+a

A

nq

np/

0. By keeping track of the errors incurred we can prove that uniformly over all n-admissible

By the Taylor expansion, ln(1 - x) = -x+ 2x2 +o(x2) if x A E [a npq , bV(n-p_q[, (2.13)

(1 + -) np .

-a-np

-A

(1

- nq )

nq+a

z

It follows from these remarks, and the integral test for R.iemann integrals,

that asn -goo, l Pja<S"-np
111

1

slim P { a < -oo

sn npg< b) =

b __

v(npq -x 2l (2npq)

a-e

rb -x2/2 e R

dx =

a2Tf

dx.

b

¢(x) dx, fa where 0 denotes the standard-normal density function. See (1.28) on page 11, and set µ = 0 and o, = 1. Because the preceding display is valid for all finite real numbers a < b, it remains to prove that 0 = 2-. ( 2.15)

J

Problems

21

We may note that for all a > 0, (2.16)

1>P -a<Sn- np
npq -

>1- a2

Indeed, the first bound is a tautology, while the second follows from Chebyshev's inequality (p. 18). Thanks to (2.15), we can deduce that for all a > 0, (2.17)

13

>

27r

ra

J

O(x) dx >

a

Let a T oo to deduce that 3 =

1 -a21) ,27r- -

as was claimed.

During the course of our proof of the CLT, we verified that 3 = This leads to the following celebrated result of Stirling (1730):

Stirling's Formula. n! - n"+2 a-n 27r as n

O 27r.

oo.

Problems 2.1. Derive the law of rare events (p. 19).

2.2. de Moivre's formula (p. 19) has quantitative versions as well. For instance, derive the bounds (2.18)

nn+l a-n < n! < nn+i a-n+1

vn > 1.

2.3. Derive Bernoulli's law of large numbers (p. 18) from the de Moivre-Laplace central limit theorem (p. 19).

2.4. An urn contains N balls. They are all identical except for their colors: k are white and N - k are black. We take a random sample of n balls, without replacement, from this urn. Let X denote the number of white balls in the sample. Suppose that k/N - p E (0, 1) as N -. 00. Then, compute lim,v-- P{X = j} for all j. 2.5. Suppose Xp has the geometric distribution with parameter p E (0, 1) (§4.2, page 8). Then prove that limp-0 P{pXp > x} = P(Y > x} for all x > 0, where Y has the exponential distribution with parameter one.

2.6. Suppose: (a) limn-_ P{Xn < x} = P{X < x} for all x E R; and (b) x - P{X < x} is continuous. Prove that limn_,o [P{Xn < x} - P(X < x11 = 0. Use this to prove that (2.5) holds uniformly for all real numbers a < b.

2.7. Construct discrete random variables {Xn };° 1 such that p(x) := limn-- P{Xn = x} exists for all x E R, but p is not a mass function. 2.8 (Problem 2.7, continued). Suppose (Xn}°n°__1 are discrete random variables with respective

mass functions (pn)1j. Prove that if there exists a function p such that "Mn- EIER 1p^(z) p(z)[ = 0, then p is a mass function. 2.9. Derive the following "normal approximation": (2.19)

1

e--' dx = t - I t3 + o(t3)

as t - 0.

(Laplace, 1782, vol. 10, p. 230).

2.10 (Hard). Suppose X is a discrete, non-negative random variable with mass function p. Prove

that limn_,(E[Xn[)1/n = sup{x E R : p(x) > 0}.

2. Bernoulli vials

22

2.11 (Hard). Suppose X has the Poisson distribution with parameter A > 0. Prove that for all a < b,

AhmoP(a< f
(2.20)

2.12 (Problem 2.8, continued; Hard). Consider two discrete random variables X and Y, and denote their respective mass functions by px and pY. (1) Prove that supACR IP{X E Al - P{Y E A}I = 1 E ER IPx (z) - PY (z)I (2) Let X" denote the number of fixed points of a random permutation of { 1, ... , n} and Y = Poiss(1). Use the preceding, together with Example 1.19 (p. 9), to deduce that lim"_ suPACR IP{X" E Al - P{Y E A}I = 0. 2.13 (Hard). Let ( be defined by (2.7). The proof of de Moivre's formula (pp. 19-21) asserts that as n tends to infinity, f(n) Refine this by proving that (2a)-1/2f(n) = 1 + (12n)-i +

o(1/n) as n - oo. 2.14 (Harder). Prove that if ((a) e2

n=

(2.21)

2

j-° for all a > 1, then - 1) exp - k-2 (k - 1)(C(k) k(k + 1)

Notes (1) Items (1) and (2) (p. 17) both rely implicitly on physical assumptions whose validity can, and should, be questioned. In this regard see Keller (1986); see also Diaconis and Keller (1989). (2) Bernoulli's original proof of his law of large numbers (1713) was much more involved

than the present, modern argument. But it also yielded much more than the stated theorem of this chapter. (3) Our presentation of De Moivre's Formula (p. 19) is motivated by the exposition of Poincar5 (1912, pp. 84-87). (4) Some authors incorrectly ascribe the de Moivre-Laplace CLT, in the p i4 1/2 case, to P-S. Laplace. For a detailed clarification see Dudley (2002, pp. 330-331). de Moivre's innovation was, in fact, a normal approximation to the binomial expansion of (a+b)", although he placed special emphasis on the case that a = b. This is equivalent to the de Moivre-Laplace CLT in its entirety. However, Laplace (1812, Chapter III) seems to be the progenitor of the modern probabilistic interpretation of this theorem, and so deserves a part of the credit (Adams, 1974, Chapter 2). In the words of Adams (1974, p. 27), "I can find no evidence to support the view that de Moivre thought of what we now call the normal law as itself being a probability law of errors. A concept of probability law of errors, due to Thomas Simpson, first appeared in 1755, the year after de Moivre's death." (5) de Moivre (1738) proved the following refinement to our "de Moivre's Formula" (p. 19):

I In n - n +

In(n!) = In p + I n +

+..

,

360n3 + 1260n4 and evaluated 9 numerically. Problem 2.13 is a start in this direction. (6) There area vast number of [mostly overlapping] proofs of the Stirling formula. Here is a synopsis of a novel derviation due to Wong (1977): Let S" = Poiss(n) and Z = N(0,1), 2

12n

so that (S" - n)/. converges in distribution to Z by Problem 2.11 above. One can

prove then that lim- n-1/2E[(S" - n)-] = E[Z-] = (2n)-1/2. See Problem 7.34 on page 115. Stirling's formula ensues because of the following computation: e-"(n -j)nr = e-nnn+1

E [(S" - n)

) =o

j!

n!

Chapter 3

Measure Theory

If you are afraid of something, measure it, and you will realize it is a mere trifle. -Renato Caccioppoli

Modern probability is spoken in the language of measure theory, and the latter is the main subject of this chapter. To understand why measure theory is needed consider a random variable X which is distributed uniformly on (0,1); thus, P{X E A} is the length of A, as long as this length is defined. A primary goal of measure theory is to describe when A has a well-defined

length. This matter arises because if we adopt just about any reasonable model of logic, then there exist subsets of R which do not have a definable length (Solovay, 1970; Freiling, 1986). We have other uses for measure theory as well. For example, the measure-

theoretic approach allows us to study distributions that are not of the types discussed in Chapter 1. We will see later on (Theorem 6.20, p. 71) examples of distributions that are neither absolutely continuous, nor discrete; nor are they simple combinations of the two types mentioned. The present chapter addresses our forthcoming measure-theoretic needs.

1. Measure Spaces Throughout, SZ denotes an abstract set that we call the sample space. Definition 3.1. A collection IF of subsets of Il is a o -algebra if: (i) SI E 9;

(ii) F is closed under complementation; i.e., if A E 9 then A` E 9; and (iii) 9 is closed under countable unions; i.e., if A1, A2i ... E 9, then U°_lA, E P. It is an algebra if, instead of (iii), 9 is merely closed under finite unions. 23

3. Measure Theory

24

Definition 3.2. Elements of a a-algebra s are said to be s-measurable, or measurable with respect to 9. When it is clear from the context that F is the a-algebra under study, its elements are referred to as measurable. Of course, a-algebras (respectively, algebras) are also closed under countable (respectively, finite) intersections, and they also contain the empty set. Furthermore, any a-algebra is an algebra but the converse is false: The collection of all finite unions of subintervals of [0, 1] is an algebra but not a a-algebra.

Example 3.3. IQ, 0} is called the trivial a-algebra; it is the smallest aalgebra of subsets of Q. The largest such a-algebra is the power set of S2; it is the collection of all subsets of 0, and is written as Y(0).

Lemma 3.4. If A is a set, and if .tea is a a-algebra of subsets of Il for each a E A, then f1QEA.`$Q is a a-algebra. Consequently, given any algebra .say, there exists a smallest a-algebra that contains W.

Definition 3.5. If W is a collection of subsets of St, then we write a(d) for the smallest a-algebra that contains .sat; this is the a-algebra generated

by d. It might help to note that v(d) = fl.`, where the intersection is taken over all a-algebras 9 D sV, and is non-empty (Example 3.3).

Definition 3.6. If S2 is a topological space, then the o-algebra .V(f2) generated by the open subsets of SZ is called the Bored or-algebra of Q. The elements of R(S2) are called Borel sets.

Definition 3.7. Let 9 be a a-algebra of subsets of 11. A set function µ :9 -' [0, oo] is said to be a measure on (S2 , 9) if: (i) µ(0) = 0; and (ii) for any denumerable collection {An}n of disjoint measurable sets, 1

(3.1)

iz

(Ci

An = E p(An)

n=1

n=1

By virtue of their definition, measures are non-negative though possibly infinite. Of course, real-valued (often called signed) or complex measures can be defined just as easily.

Definition 3.8. Let 9 denote a collection of subsets of S2, and let µ be a set function on Q. We say that p is countably additive on 9 if (3.1) holds for all disjoint sets Al, A2.... E 9 that satisfy U°n°__1An E W. We say that p is countably subadditive on 9 if for all Al, A2.... E 9 that satisfy U°°=1 An E 9, (3.2)

(0A) <

A(An)-

n

n=1

n=1

2. Lebesgue Measure

25

The following is simple but not entirely obvious if you read the definitions carefully.

Lemma 3.9. Countably additive set functions are countably subadditive. Definition 3.10. If JP is a or-algebra of subsets of St, then (Sl, .9F) is called a measurable space. If, in addition, p is a measure on (St , 5), then (St , Jr, µ) is called a measure space.

Listed below are some of the elementary properties of measures:

Lemma 3.11. Let (f

denote a measure space.

,

(i) (Continuity from below) If Al C A2 C ti(An) T a (Um=IAm) as n - oo. (ii) (Continuity from above) If Al 2 A2 2

are all measurable, then

are all measurable and ,u(An) < oc for some n, then p(An) j µ(lm=1A,n) as n - oc.

Definition 3.12. A measure space (Q, 9, ti) is a probability space if p(Q) _ 1. In this case u is called a probability measure. If u(Q) < oo, then p is called a finite measure and (1 , 9, p) a finite measure space. Finally, p and/or (Q, 3, p) are called o -finite if there exist measurable sets ci1 C St2 such that U°n°__1Sln = Sl and µ(Sln) < 00.

Often we denote probability measures by P, Q.... rather than µ, v, ... . The following shows that the preceding abstract set-up includes the discrete theory of Chapter 1.

Lemma 3.13. If Sl = {w,} is denumerable, then we can find pl, p2, ... E [0,1] such that P(A) = Ei, IiEApi for all A C Q. Conversely, any nonnegative sequence {pi}°11 that adds to one defines a probability measure P on the power set of Sl via the assignment P({wi}) = pi (i > 1). 1

Definition 3.14. For all measurable spaces (f , 5) and x E Sl define the point-mass bx at x as bx(A) := 1A(x). This is a probability measure. In the notation of point-masses, the probability measure of Lemma 3.13 can be written as P(A) _ Ej_1 pjb,,,i(A).

2. Lebesgue Measure Lebesgue measure on (0, 1] describes a notion of length for various subsets of (0, 11. To construct the Lebesgue measure we first define a set function m on half-closed finite intervals of the form (a, b] C (0, 11 that evaluates the lengths of the said intervals: m((a, b]) = b - a.

3. Measure Theory

26

Let 0 denote the collection of all finite unions of half-closed subintervals of (0, 1). It is easy to see that 0 is an algebra. We extend the definition of m to d as follows: For all disjoint half-closed intervals I,__ , In C (0, 11, n

n

M U Ii = > m(I;). i=1

;_ 1

So far, we have defined m(E) for all E E d. Of course, we need to insure that this definition is consistent. Consistency follows from induction and the following obvious fact: For all 0 < a < b < c,

m((a,c])=m((a,b])+m((b,c]).

(3.4)

More importantly, we have

Lemma 3.15. The set function m is countably additive on the algebra 0. In other words, suppose A1, A2.... are disjoint elements of .off such that U°O=1An Ed. Then, m(U°°=1An) _ En°_1 m(A, ). (If there are only finitely many non-empty A0's, then this is an obvious consequence of (3.3).) Proof. Because U',,°=,An and Un 11An are both in d, so is U°O_NAf. Also, 0o

(3.5)

N-1

00

m U An = j m(An) + m U An n=1

n=N

n=1

Therefore, it suffices to prove that limN_o.m(U°n°__NAf) = 0. Hence, our goal is to establish the following: Given a sequence of sets Bn j. 0, all in d, m(Bn) decreases to zero. Suppose to the contrary that there exists e > 0 such that m(Bn) > e for all n > 1. We will derive a contradiction from this. Write Bn as a finite union of half-open intervals, Bn = where 0 < an < bjn < 1. Also recall that the Bn's are decreasing. Choose some an E (an, bn) such that

a"7
(36)

e

2n+2kn'

b ]. Then,

and define Cn =

;

(3.7)

m(U(Bi\Ct)J <>>(cr -a;) <2. ll

i=1

j=1 i=1

In particular, m(Cn) > m(Bn) - (e/2) _> (e/2). But the closure Cn of every Cn is closed, bounded, and non-empty. If we knew that the Cn's were decreasing, then this and the Heine-Borel property of 10, 11 together would

2. Lebesgue Measure

27

imply that 0. Because C,, C B and B 10, this would yield the desired contradiction. Unfortunately, the CR's need not be decreasing. So instead consider D = n 1Cj. Then, Dl, D2, ... are closed, bounded, and decreasing. It suffices to show that they are non-empty. Because B = D U (B,, n Dc,), (3.8)

(UBjnDnl `j=1

/l

j=1

since the B,,'s are decreasing. Therefore, e < m(B,,) < m(D,,) + (E/2), e/2, so that D,, 34 0; this thanks to (3.7). This shows that completes the proof of countable additivity. The remainder of the proof is smooth sailing.

Lemma 3.15 and the following result together extend the domain of the definition of m to a(.d); the latter is obviously equal to .x((0,1)).

The Carathdodory Extension Theorem (Caratheodory, 1948). Suppose Q is a set, and .ad denotes an algebra of subsets of Q. Given a countably additive set function p on sat, there exists a measure f2 on (S2 , a(d)) such

that p(E) = µ(E) for all E E d. Suppose, in addition, that there exist S21 C S22 C ... in d such that U,,_ 1S2 = S2 and µ(S2,) < oo for all i > 1.

Then, the extension µ of p is unique, and µ is a a-finite measure with

µ(2) = p(l) This result is proved in §4 at the end of this chapter (p. 30). Note that the method of this section also yields Lebesgue's measure on R, [0, 11, etc. In order to construct Lebesgue's measure on (0, 11d for d > 2, we proceed as in the case d = 1, except start by defining the measure of a hypercube -aj). Since the collection of all finite {x E (0, 1]d: aj < xj < bj} as unions of hypercubes is an algebra that generates ° ((0,1]d), we can appeal to the Caratheodory extension theorem to construct Lebesgue's measure on .V((0, 1]d). Further extensions to [0, 1]d , Rd, etc. are made similarly. It is also possible to construct Lebesgue measure on R(Rd) as a product measure; see Chapter 5. Thanks to the Caratheodory extension theorem, we can easily construct many measures on the Borel-measurable subsets of Rd as the following shows. This example will be generalized greatly in the next chapter where we introduce the abstract integral.

Theorem 3.16. Suppose f : Rd

R+ is a continuous function such that the Riemann integral f Rd f (x) dx is equal to one. Given a, b E Rd with x (ad, bd), aj < bj for all j < d, consider the hypercube C := (a1 , b1] x

3. Measure Theory

28

and define p(C) = fc f (x) dx. Then p extends uniquely to a probability measure on (Rd,.V(Rd)), and f is called the density function of p. Proof. (Sketch) Let W denote the collection of all finite unions of disjoint hypercubes of the type mentioned. Then d is an algebra of subsets of Rd. If A E .&, then we can write A = U;"_1Ci, where the Ci's are hypercubes of the type considered here, and define z-'1

i='1

p (u C)

(3.9)

= > /a(Ci).

This is a consistent definition. Now we proceed as we did when we constructed the Lebesgue measure. In this way, we can show that p is countably additive on d. Finally, we appeal to the Caratheodory extension theorem (p. 27) to finish. As an instructive exercise, one can verify that the preceding includes the distributions of §5.1 (p. 11) and §5.2 (p. 11). Below are two more examples.

Example 3.17 (Uniform). Suppose X E .£(R") has finite and strictly positive n-dimensional Lebesgue measure m(X). Then, the uniform distribution v is the measure defined by v(A) =

(3.10)

m(nA n x)

vA E 9(R").

This generalizes §5.1 on page 11.

Example 3.18 (Normal/Gaussian in R"). Let it E R" be a column vector, and E a symmetric, invertible, and positive-definite matrix with n rows and n columns. Then, the normal (or Gaussian) distribution with parameters p and E corresponds to the density function exp det E See §5.2 (p. 11) for the case n = 1.

(3.11) f (x) =

1

(21r)n/2

-1(x - p)'E-1 (x -,a) I

dx E R".

3. Completion In the previous section we constructed the Lebesgue measure on .4((0,1]d), and this is good enough for most of our needs. However, one can extend the definition of m further by defining m(E) for a slightly larger class of sets E.

Definition 3.19. Given a measure space (Q, 9, p), a measurable set E is null if p(E) = 0. The a-algebra 9 is said to be complete if all subsets of null sets are themselves measurable and null. When 9 is complete, we say also that (0, ., p) is complete.

3. Completion

29

We can always ensure completeness, as the following demonstrates.

there exists a complete Theorem 3.20. Given a measure space a -algebra 9' D .5% and a measure p' on (Q, F') such that p and p' agree

on 9. Definition 3.21. The measure space (S2 , (S2 , F, p)

p') is called the completion of

On one or two occasions, in particular when discussing Brownian motion, we will complete a measure space or two in order to handle some technical points. However, for the most part, we need not worry about such issues overly much.

Proof of Theorem 3.20. (Sketch) For any two sets A and B define

(3.12) 9'={ACSi: 3B,NE.` such that u(N)=O and AABCN}. Step 1. Jr' is a a-algebra. Since A' A B` = A D B, F'' is closed under complementation. If A1, A2.... E F', then we can find B1, B2, ... E Jr and null sets N1, N2.... such that Ai Al Bi C Ni for all i > 1. But (3.13)

(U A)10 (U B.) io="l

i=1

C U(Ai A Bi) C U Ni, i=1

i=1

and the latter is null, thanks to countable subadditivity. Thus, F' is a a-algebra.

Step 2. The measure p'. For any A E 9' define p'(A) := p(B), where B E JF is a set such that for a null set N E `F, A A B C N. It is not hard to see that this is well defined, as it does not depend on the representation (B, N) of A. Clearly, p' = p on 9; we need to show that p' is a measure on (Q, F'). The only interesting portion is countable additivity. Step 3. Countable Additivity. Suppose Al i A2, ... E . F' are disjoint. Find Bi, Ni E 9 as before and note that whenever j < i, Bi+1 n B; C (Ai+1 u N.) n (A U N.) = N., (3.14) where N. := U90 1Ni is a null set. Define C1 = B1 and iteratively define (3.15)

Ci+1 := Bi+1 \ (Cl U ... U Ci).

The Ci's are disjoint, and Bi+1 n Ci particular, (3.16)

N. thanks to the previous display; in

µ (Bi+1) = p' (Bi+l \ Ci) =,u (Ci+1)

Since B1 = C1, this shows that p(Bi) = p(Ci) for all i; we have used the fact that p' = p on 9. Because the C,'s are disjoint, and since

3. Measure Theory

30

we obtain F,j l µ(Bj) = Uj° 1CJ = ily that Ei'(Uj0_1Aj) = Ej l µ'(Aj), and our task is done.

It follows read-

0

Let m denote the Lebesgue measure on (0,1]d. We can complete the probability space ((0,11d,.V((0,1]d),m) in order to obtain the probability space ((0,1]d, 2((0,1]d), A). Here, 2((0,1]d) denotes the completion of .V((0,1]d), and is called the collection of all Lebesgue-measurable sets in (0,1]d. Likewise, we could define P([0, 1]d), 2(Rd), etc. We have now defined the Lebesgue measure A(E) of E C Rd for a large class of sets E. Problem 3.17 shows that one cannot define \(E) for all E C Rd and preserve the all-important translation-invariance of the Lebesgue measure.

4. Proof of Carathiodory's Theorem The proof of Caratheodory's theorem (p. 27) is somewhat lengthy, and relies on a set of ingenious ideas that are also useful elsewhere. Throughout, f is a set, and a( is an algebra of subsets of Q.

Definition 3.22. A collection of subsets of I is a monotone class if it is closed under increasing countable unions and decreasing countable intersections.

Lemma 3.23. An arbitrary intersection of monotone classes is a monotone class. In particular, there exists a smallest monotone class containing sl.

Definition 3.24. The smallest monotone class that contains sd is written as mc(sd), and is called the monotone class generated by s.4. The following result is of paramount use in abstract measure theory:

The Monotone Class Theorem. Any monotone class that contains 0 also contains a(.vd). In other words, mc(sd) = a(sd). Before we prove this let us use it to establish the uniqueness assertion of Caratheodory's extension theorem.

Proof of Caratheodory's Theorem (Uniqueness). Suppose there were two extensions ft and v, and define (3.17)

W:_ {E E a(sd) : v(E) = µ(E)} .

One can check directly that W is a monotone class that contains 0, and hence the theorem follows [decode this!].

0

Proof of the Monotone Class Theorem. Because a(sl) is a monotone class, this implies that a(ad) 2 mc(.od). Therefore, it suffices to prove

4. Proof of Caratheodory's Theorem

that mc(d)

31

a(d). The proof is non-constructive. First, note that the

following are monotone classes: W1 (3.18)

{E E a(.d) : E` E mc(d)},

W2{EEa(.ci): vFEa(s+d), EUFEmc(d)}.

Since W1 is a monotone class that contains srd, we have W1 D mc(.d). This means that mc(d) is closed under complementation. If we knew also that d C `e2, then we could deduce similarly that W2 mc(srd). This would imply that mc(s+d) is closed under finite unions and is therefore a a-algebra. Consequently, we would have mc(.ad) Q) a(ad) and complete our proof. However, the proof that sir C W2 requires one more idea. Consider (3.19)

`uP3 := {E E a(sd) : eF E W, E U F E mc(sad)} .

Because it is a monotone class that contains srd, W3 mc(O). By reversing the roles of E and F in the definition of W3 we can see that W2 srd as well, and this is what we needed to prove. 0

Proof of Carathdodory's Theorem (Existence). It takes too much effort to produce a completely rigorous proof. Therefore, we outline a proof only. The main idea is to try and prove more. Namely, we will define, in a natural way, µ(E) for all E C Q. This defines a set function µ on the power set .9(S2) of 0 which may be too big a a-algebra in the sense that µ may fail to be countably additive on .9(11). However, it will be countably additive on a(srd). Now, we fill in more details. For all E C S2, define (3.20)

µ(E) := inf { E µ(En) : vi > 1, Ej E ad and E C U00 En } n=1

n=1

.

JJJ

In the jargon of measure theory, p is a Caratheodory outer measure on

(11,This defines a natural extension of M. The proof proceeds in three steps. Step 1. Countable Subadditivity of d. First, we want to prove that µ is countably subadditive on .9(12). Indeed, we wish to show that p(Un__1An) < En°_1 µ(An) for all A1, A2.... C S2. To this end, consider any collection {A3,n} of elements of 0 such that An C UJ_1Aj,n for all n. By the definition of µ, (3.21)

An =1

< n=1 =

3. Measure Theory

32

A second appeal to the definition of µ implies that for any E > 0 we could choose the Aj,n's such that

0 (3.22) j=1

whence

UAn <E+Efi(An)

(3.23)

n=1

n=1

Because E > 0 is arbitrary, this yields the countable subadditivity of µ. Step 2. µ extends it. Next, we plan to prove that µ and µ agree on .rd so that µ is indeed an extension of p. Because µ(E) < µ(E) for all E E d, we seek to prove the converse inequality. Consider a collection E1, E2.... of elements of sd that cover E. For any E > 0 we can arrange things so that E' 1 p(En) < p(E) + E. Since p is countably additive on sd, (3.24)

p(EE) < µ(E) +c.

µ(E) 0 is arbitrary, Step 2 is completed. Step 3. Countable Additivity. We now complete our proof by showing that the restriction of µ to u(d) is countably additive. Thanks to Step 1, it suffices to show that F,0* 1 p(An) < µ(U°O==1An) for all disjoint A1, A2, ... E

a(d). With this in mind consider (3.25)

-,f = {E c f2 : vF E sd, µ(E) =µ(E n F) + p(E n Fc) } .

According to Step 2, . f' contains sd. Thus, thanks to the monotone class theorem, if . f' were a monotone class then a(O) C -0. This proves that µ is finitely additive on o(sd). Since the An's were disjoint, it follows that 0o

(3.26)

p U An n=1

for every N > 1.

>_

ja

N

N

n=1

n=1

( U An = E fl(An),

Step 3, whence the Caratheodory extension theorem,

follows from this upon letting N j oo. Define (3.27)

-,Y := {E c it : p(E) > µ(E n F) + µ(E n F`) vF E W}

.

Owing to Step 1, it suffices to show that .A' is a monotone class. This is proved by appealing to similar covering arguments that we used in Steps 1 and 2.

Problems

33

Problems 3.1. Prove Lemma 3.4.

of o-algebras such that o(U_,.!Fi) $ 3.2. Construct an example of a countable family U,_,9;. Can you do this so that .ST, C Y,+1 for all i? Typically, one writes v°_-t.lg, for o(U;_ 1 f, ).

3.3. Construct a a-algebra 9 of subsets of R such that no open interval is measurable with respect to 9, although any singleton {x} is (x E R). 3.4. Prove that.4(Rk) is generated by the collection of all balls whose center and radius are both rational. This implies that yd(Rk) is "countably generated," i.e., generated by a countable family of sets. Prove also that any singleton (x) is .A(Rk)-measurable. 3.5. Prove Lemma 3.11. 3.6. Prove Lemma 3.13.

3.7. Prove Lemma 3.23. 3.8 (Counting Measure). Suppose fl is a set. For any A C fl define µ(A) to be the cardinality of A. Prove that p is a measure on (12, Y(fl)), where .?(fl) denotes the power set of fl. 3.9 (Distribution Functions). A function F : R -- [0, 11 is a (cumulative) distribution function on R if: (i) It is non-decreasing and right-continuous [this means that F(at) := limb,. F(b) = F(a)];

(ii) lim., F(a) = 0; and (iii) lim._-W F(a) = 1. Prove that if µ is a probability measure on (R,.4f(R)), then F(a) := p((-oo,a)) defines a distribution function on R. Conversely, prove that if F is a distribution function on R, then there exists a unique probability measure µ on (R, 9(R)) such that µ((-oo,a]) := F(a) for all a E R.

3.10. If x, r E R and A C R, then consider the sets (3.28)

x+A:={x+a: aEA} and rA:={ra: aEA}.

Prove that Lebesgue measure on R is translation invariant. That is, the measure of x + A is the same as the measure of A, provided that x + A and A are Bore] measurable. Furthermore,

prove that if m. denotes the Lebesgue measure on ([0,a],R([O,a])) for a given a > 0, then a-1A E 2((0,11) for all measurable A C (0,a), and m.(A) = aml(a-'A). In other words, prove that Lebesgue measure is also scale invariant.

3.11 (Problem 3.10, Continued). Let p be a translation invariant a-finite measure on (R, td(R)). Prove that there exists a c E (0, oo) such that c-1 µ is Lebesgue measure.

3.12 (Lebesgue Measure on the Circle; Problem 3.10, Continued). Let S1 = (z E C : [z[ = 1} denote the unit circle in the plane. We say that A C S1 is an open subset of S1 if A is an open subset of C; this defines 5it(S') unambiguously. Prove that f (O) = exp(i2i9) defines a homeomorphism from (0, 1] onto S1; that is, f-1 exists, and f and f-1 are both continuous. Let m denote the Lebesgue measure on (0, 1) and define

µ(A) = m(f-1(A)) for all A E .£(S'). Prove that p is a probability measure on (S' ,R(Sl)). Prove also that p is "rotation invariant." That is, u(rA) = µ(A) for all A E R(S1) and r E C with [r[ = 1. Frequently, p is called the Lebesgue measure on V.

3.13. Suppose (f2 (fl) , µ) is a topological (Borel-) measure space. Define supp(p) to be the smallest closed set whose complement has p-measure zero; this is called the support of p. Prove that supp(p) is well defined. Prove also that a point x E fl is not in supp(p) if and only if there exists an open neighborhood U of x such that p(U) = 0. 3.14. Consider a finite measure space (fl, Jr, it), and suppose rA C Jr is a monotone class. Prove that -f is a monotone class, where ((

(3.29)

n=1 l

n1

3. Measure Theory

34

3.15 (Relative Measures). If µ is a a-finite measure on (R,.B(R)), then define d to be the collection of all A E -V(R) such that the following limit exists and is finite:

µ(A fl [-n, nJ) n Is d an algebra? What if R E d? Is Du countably additive on (R, d)?

(Dµ)(A) = nom= Iim

(3.30)

3.16. Let µ be a measure on (R,R(R)) such that the following limit exists for all x E R: µ([x - x + Tl) (3.31) (LP)(x) = Timo T Prove that Lp is a constant (Plancherel and P61ya, 1931).

3.17 (Hard). In this exercise we construct a set in the circle S' that is not Borel measurable. As usual, we can think of S1 as the unit circle in C. That is, S1 = {e'a : 9 E (0,2x[}.

(1) Given any z = e1°, w = e48 E S', we write z - w if a - $ is a rational number. Show that this defines an equivalence relation on S'. (2) Use the axiom of choice to construct a set A whose elements are one from each equivalence class of St. A is often written as S'/ -. (3) For any rational a E (0,2x[ let A. = et°A denote the rotation of A by angle or, and check that if a, 0 E (0, 2x1 fl Q are distinct, then A. fl AO = 0. (4) Let µ denote the Lebesgue measure on (S',2(S')) (Problem 3.12), and show that

µ(A) is not defined. (HINT: S1 = UOE(o,z,JnQAQ.) (5) Conclude that A is not Borel measurable.

3.18 (Harder). For any compact E C (0,11 and r, Q > 0 define H5 (E) := inf

(3.32)

I E; 1s,

:=1

where JAI denotes the Lebeague measure of A, and the infimum is computed over all sequences

(E;), of closed intervals such that sup, JE;I < e and U;_,E, J E. Prove that (3.33)

H9(E)

limeH5(E)

exists and defines a measure on x([0,11). The set function HO is called the dimensional Hausdorff measure. Can you identify Ht and H5 for 9 > 1? (HINT: You may wish to consult the book of Falconer (1986, §1.1 and §1.2).)

Notes (1) The theorem of Solovay (1970), referred to in the preamble of this chapter, states

that there are non-measurable subsets of the real line if and only if Cantor's axiom of denumerable choice [ADCJ holds. Note that ADC lies at the very heart of nearly all of real analysis. (2) Textbook expositions of Lemma 3.4 have a long tradition; see, for example, Hausdorff (1927, p. 85). Similarly, we can refer to Hausdorff (1927, pp. 177-181) for Definition 3.6.

Chapter 4

Integration

Nature laughs at the difculties of integration. -Pierre-Simon de Laplace

We are ready to define nearly household terms such as "random variables," "expectation," "standard deviation," and "correlation." Next follows a brief preview: A random variable X is a measurable function. The expectation EX is the integral f X dP of the function X with respect to the underlying probability measure P. The standard deviation is the distance, in L2(P), between X and its expectation. Correlation is related to an expectation of a certain function of two random variables.

Thus, in this chapter we describe measurable functions, as well as the abstract integral f X dP. Throughout, (Sl , .F, µ) denotes a measure space.

1. Measurable Functions Definition 4.1. A function f : S2 -+ Rn is (Borel) measurable if f -1(E) E .9' for all E E .V(R'). Measurable functions on probability spaces are often referred to as random variables, and written as X, Y... instead of f, g, ... . Measurable subsets of probability spaces are called events. Because f -1(E) = {w E S2 : f (w) E E}, f is measurable (equivalently, f is a random variable) if and only if the pre-images of measurable sets under f are themselves measurable. 35

36

4. Integration

Example 4.2. The indicator function of A C S2 is (4.1)

1A(W) =

J1

ifwEA,

10 if w E A` If A E 3, then 1A : 9 - {0, 1} is a measurable function. .

Checking the measurability of a function can be a painful chore. The following alleviates some of the pain most of the time.

Lemma 4.3. If sit is an algebra that generates . (R") and f-1(A) E . for all A E 0, then f : S2 R" is measurable. Proof. The lemma follows from the monotone class theorem (p. 30), because (A E .V(R") : f -1(A) E 9} is a monotone class that contains 0. The following shows how to use this to produce measurable functions.

Lemma 4.4. Consider functions f, f1, f2, ...: Sl - R" and g : R" - R'". (i) If g is continuous, then it is measurable. (ii) If f, fl, f2 are measurable, then so are a f and f, + f2 for all a E R. If n = 1, then f1 x f2 is measurable too. (iii) If n = 1 and f1, f2,... are measurable, then so are supk fk, infk fk, lira supk fk, and lim infk fk

(iv) If g and f are measurable, then so is their composition (go f)(x) _ g(f (x)).

Proof. By definition, if g is continuous then for all open sets G C R, g1(G) is open and hence Borel measurable. Because g-1(G`) = (g-1(G))` and g-'(G1 U G2) = g-1(G1) U g-1(G2), (i) follows from Lemma 4.3. The functions g(x) = ax and g(x, y) = x + y and g(x, y) = xy are all continuous on the appropriate Euclidean spaces. So if we proved (iv), then (ii) would follow from (i) and (iv). But (iv) is an elementary consequence of the identity

(g o f)-1(A) = f-1(g-1(A)). It remains to prove (iii). From now on, we assume that the values of the fk's are one-dimensional. Let S(w) = supk fk(w) and note that 00

(4.2)

S-1((-oo,x]) = n fk 1((-oo,x]) E Jr k=1

for all x E R. Because S-1((x,y]) = S-1((-oo,y]) \ S-1((oo,x]) for all reals x < y, it follows that S-1((x, y]) E 9. The collection of finite disjoint unions of sets of the form (x , y] is an algebra that generates .R(R). Therefore, supk fk is measurable by Lemma 4.3. Apply (iv) to

g(x) = -x to deduce that infk fk = - supk(- fk) is also measurable. But

2. The Abstract Integral

37

we have lim supk fk = inf,n supm>n fk = infk hk where hk = SUP >k f,n. Since denumerable suprema and infima preserve measurability, lim supk fk is measurable. Finally, the lim inf is measurable because lim infk fk = - lim supk (- fk ).

2. The Abstract Integral Throughout this section (SI, 5, p) is a finite measure space unless we explicitly specify that µ is a-finite. We now wish to define the integral f f du for measurable functions f : 11 - R. Much of what we do here works for a-finite measure spaces using the following localization method: Find disjoint measurable K1, K2.... such that UnK7z = S2 and u(K,,) < oo. Define µn to be the restriction of it to Kn; i.e., p,z(A) = uu(K. fl A) for all A E Jr. It is easy to see that µ1i is a finite measure on (1, 5). Apply the integration theory of this module to

µ z, and define f f du = >n f f dµ.. For us the details are not worth the effort. After all probability measures are finite! The abstract integral is derived in three steps.

2.1. Elementary and Simple Functions. When f is a nice function, f f du is easy to define. Indeed, suppose f = c1A where A E 9 and c E R. Such functions are called elementary functions. Then, we define f f dp = cp(A). More generally, suppose A,_., An E Jr are disjoint, al, ... , an E R, and f = E 1 aj1A . Then f is measurable by Lemma 4.4, and such functions are called simple functions. For them we define f f du = E, 1 apu(Aj). This notion is well defined; in other words, writing a simple function f in two different ways does not yield two different integrals. One proves this first in the case where f is an elementary function. Indeed, suppose f = a1A = b1B + c1c, where B, C are disjoint. It follows easily from this that a = b = c and A = B U C. Therefore, by the finite additivity of p, a,u(A) = bp(B) + cp(C). This is another way of saying that our integral is well defined in this case. The general case follows from this, the next lemma, and induction.

Lemma 4.5. If f is a simple function, then so is If 1. If f > 0 pointwise, then f f dp > 0. Furthermore, if f, g are simple functions, then for a, b E R, (4.3)

f(af+b)dtz=affd+bffdiz.

In other words, A(f) := f f dp defines a non-negative linear functional on simple functions. A consequence of this is that f f dµ < f g du whenever

38

4. Integration

f < g are simple functions. In particular, we have also the following important consequence: I f f dµI < f if I dµ. This is called the triangle inequality.

2.2. Bounded Measurable Functions. Suppose f : Il - R is bounded and measurable. To define f f dµ we use the following to approximate f by simple functions.

Lemma 4.6. If f : S2 -+ R is bounded and measurable, then we can find simple functions fn, Yn (n = 1, 2, ...) such that as n -+ oo: L n 1 1; f, 1 f ; and fn < fn + 2-n pointwise.

We can deduce the following by simply combining Lemmas 4.5 and 4.6:

f fn dµ < f fn dµ < f f n dµ + 2-nµ(52) for all n > 1; and f f dµ := limn-oc f fn dµ = limn_co f 7n dµ exists and is finite. This produces an integral f f dµ that inherits the properties of f f n dµ and f fn dµ that were described by Lemma 4.5. That is, Lemma 4.7. If f is a bounded measurable function, then so is If 1. If f is a pointwise-nonnegative measurable function, then f f dp > 0. Furthermore, if f, g are bounded and measurable functions, then for a, b E R, (4.4)

2.3. The General Case. Let R+ := [0, oo), and consider a non-negative measurable f : S2 -* R+. For all n > 1, the function fn(w) := min(f (w), n)

is measurable [Lemma 4.4] and 0 < fn < f. Because fn j f as n -+ oo, Lemma 4.7 insures that f fn dµ increases with n, and hence has a limit, which is denoted by f f dµ. This "integral" inherits the properties of the integrals for bounded measurable integrands, but may be infinite. In order to define the most general integral of this type let us consider an arbitrary measurable function f : 11 -+ R and write f = f+ - f -, where (4.5)

f(w) := max(f (w) , 0) and f(w) := - min(f (w) , 0).

The functions f + and f - are respectively called the positive and the negative parts of f. Both f :L are measurable (Lemma 4.5), and if f If I dµ < oo, then

we can define f f dµ = f f + dµ - f f - dµ. This integral has the following properties.

Proposition 4.8. Let f be a measurable function such that f If I dµ < oo. If f > 0 pointwise, then f f dµ > 0. If g is another measurable function such that f IgI dµ < oo, then for a, b E R, (4.6)

f(af+bg)diz=affdiz+bfdiL.

39

3. LP-Spaces

Our arduous construction is over and gives us an "indefinite integral." We can get "definite integrals" as follows: For all A E S2 define

ffdP=JflAd/i.

(4.7)

This is well defined as long as fA If I dp < oo. In particular, note that

ffdµ=ffdµ

Definition 4.9. We say that f is integrable (with respect to µ) if f If I dµ < oo. On occasion, we will write f f (w) µ(dw) for the integral ffdµ. This will be useful later when f will have other variables in its definition, and the µ(dw) reminds us to only "integrate out the variable w."

Definition 4.10. When (S2 , P) is a probability space and X : 11 -+ R is a random variable, we write EX = f X dP and call this integral the expectation or mean of X. When A E [i.e., when A is an event], we may write E[X; A] in place of the more cumbersome E[X 1A] or fA X dP.

3. LP-Spaces Throughout this section (D, .`9', P) is a probability space. We can define for all p E (0, oo) and all random variables X : 9 -+ R, (4.8)

IIXIIP:_ (E{IXIPl)11P,

provided that the integral exists; i.e., that I X I P is P-integrable.

Definition 4.11. The space LP(P) is the collection of all random variables

X : Il -+ R that are p times P-integrable. More precisely, these are the random variables X such that IIXIIP < oo.

Remark 4.12. More generally, if (S2 , .` , p) is a o-finite measure space, then LP(M) will denote the collection of all measurable functions f : SZ -+ R such that II.f IIP < oo. Occasionally we write respectively If II r.'(µ) and LP(i, .$, µ) in place of IIf IIP and LP(µ) in order to emphasize that the underlying measure space is (Q, 9,,u). Next we list some of the elementary properties of LP-spaces. Note that the following properties do not rely on the finiteness of µ.

Theorem 4.13. The following hold for a a-finite measure µ: (i) LP(µ) is a linear space. That is, Ilaf IIP = Ial . IIf IIP for all a E R and f E L%µ), and f+ g E LP(µ) if f, g E LP(µ). (ii) (Holder's Inequality) If p > 1 and p 1 + q-1 = 1, then IIf9II1 S

IIfIIPIIgIIq

df E LP(µ), g E L'(µ).

40

4. Integration

(iii) (Minkowski's Inequality) If p > 1, then

'f, g E L"(p).

Ill +911P:5 IIf lip + II9IIP

Proof. It is clear that Ilaf lip = lal IIfIIP, and Ix+yIP <2P{IxIP+ IyIP} for x, y E R. Hence, Ill + 9IIP < 2P{ Ill lip + II9IIP}, which proves (i). Holder's inequality holds trivially if IIf lip or IIgIIP is equal to 0. Thus, we can assume without loss of generality that 11f lip, IIgIIP > 0. For all x, y > 0,

xy<

(4.9)

xP

P

+

yq 9

this follows by minimizing 0(x) = p-1xP + q-1yq - xy. Replace x and y respectively by F(w) = If(w)I/IIfIIP and G(w) = I9(w)1/11911q, and integrate [dµ] to obtain Holder's inequality, viz., (4.10)

IIf9II1 IIfIIP' II9IIq

= f IFGI dµ < 11E9 + IIGIIg = 1 + 1 =1. p

q

p

q

Minkowski's inequality follows from Holder's inequality as follows: Since Ix+yIP <- IxI Ix+yIP-1 + IyI Ix+yIP-1,

fii + glPdµ<Jiii -If +gIp-1dµ+ f I9I -If+gIP-1dµ (4.11)

1/0

<{IIfIIa+II9IIa} (f If+glv(P-1)dµ) where a > 1 and a-1 +)3-' = 1. Choose a = p and note that ,Q = q solves /3(p - 1) = p. This yields Ill + 9IIP <- {IIf lip + II9IIP}IIf + 9IIP-1. If If +gIIP = 0, then Minkowski's inequality holds trivially. Else, we can solve the preceding display to finish. The following is a consequence of Holder's inequality with p = 2, but it is sufficiently important that it deserves special mention.

The Cauchy-Schwarz-Bunyakovsky Inequality. For all f, g E L2(µ), (4.12)

if

l<- IIfII2II9II2

Definition 4.14. A function ip : R -+ R is convex if vA (4.13) (1 - A)y) < az/i(x) + (1 E [0, 1], x, y E R. See Problem 4.9 for a useful criterion of convexity.

Jensen's Inequality. Suppose µ is a probability measure. If t,b : R -+ R is convex and zfi(f) and f are integrable, then (4.14)

f(f)d.i > ip (Jfd).

41

3. L"-Spaces

Example 4.15. Since iP(x) _ IxI is convex, Jensen's inequality extends the triangle inequality: f If I dµ > I f f dpl. A second noteworthy example is the inequality f eJ d1i > exp(f f dp), valid because ?P(x) = eZ is convex. These examples do not presuppose any integrability (why?).

Proof of Jensen's inequality. We will soon see that because 1P is convex, there are linear functions {LZ}ZER such that (4.15)

V) (x) = sup L,.(x) zER

dx E R+.

Therefore, by Proposition 4.8, (4.16)

l,b(f)dp ?

(JIdit).

L(f)du

= supLCJ f d) = 0 sERJ [Here is where we need p to be a probability measure.] It is easy to describe LZ pictorially: L. describes the line "tangent" to the graph of V) at the point (z, tl'(z)). Nonetheless (4.15) merits an honest proof. Consider three points x < z < y. We can write z = Ax + (1 - A)y where A _ (y - z)/(y - x). Because A E (0, 11 and tp is convex, J

(4.17)

ty(z) <

(L_) ow

(!__.)

+ Subtract t,u(z)(z - x)/(y - x) from both sides and then solve to obtain 1(y)

t/i(z) -1'(x) < '(y) - t4'(z) vx < z < Y. z-x y-z Now choose and fix some z E R. Appeal to (4.18) to deduce that i,&(z) t1(x) < A, (z - x) for all x < z, where Al := inf,,,>z(t/J(w) - tl'(z))/(w - z) depends only on z. That is, (4.18)

(4.19)

P(x) > t/'(z) + A, (x - z)

Vx <

Z.

Similiarly, Vy > z, b(y) ? t'(z) + A2(y - z) where A2 := supw
(4.21)

t/'(x) > t,(z) +az(x - z) := Lz(x)

dx E R.

For each z E R fixed: LZ is a linear function; Lz(z) = ip(z); and tp(x) > O Lz(x) for all x. This implies (4.15) and thence Jensen's inequality. An important consequence of this is that LP(p)-spaces are nested when p is finite.

42

4. Integration

Proposition 4.16. If µ(S2) < oo and r > p > 1, then L'(µ) C L"(µ). In fact, (4.22)

IIf!I,

°

II! II,

df E L'(µ).

Proof. The proposition follows from the displayed inequality. Since this is a result that involves only the function If I, we can assume without loss of generality that f > 0. Consider simple functions S that converge upward

to f. Suppose we could prove the proposition for each S, Then we can let n T oo and then appeal to our construction of integrals to derive the theorem for f. In particular, we can assume without loss of generality that f is bounded and hence in L°(µ) for all v > 0. The function ¢(x) = IxIs is convex for all s > 1. Let s = (r/p) and apply Jensen's inequality to deduce that when µ is a probability measure, (4.23)

IIfIIp =

f IflpdA)

J

0(IfI) d1z = Ill II.

This is the desired result. If µ(S2) > 0 is finite but not equal to 1, then define µ(A) := µ(A)/µ(S2). This is a probability measure, and according to what we have shown thus far, IIf IILP(,R) S IIf IIL-(f,). Solve for Ill II LP(µ) and Ill IILP(p) to finish. Finally, if µ(S2) = 0 then the result holds vacuously.

Fix some p > 1 and define d(f,g) := If - 9IIp for all f, g E LP(µ). According to Minkowski's inequality (Theorem 4.13), d has the following properties: (1) d(f,f) = 0;

(2) d(f, g) < d(f, h) + d(h, g); and

(3) d(f,9) =d(9,f) In other words, if it were the case that "d(f, g) = 0 : f = g," then d(. , ) would metrize L%µ). Unfortunately, the latter property does not hold in general. For an example consider g = f 1A where A 34 0 and µ(Ac) = 0. Evidently then g qb f but d(f, g) = 0. Nonetheless, if we can identify the elements of L'(µ) that are equal to each other outside a null set, then the resulting collection of equivalence classes-endowed with the usual quotient topology and Borel Q-algebrais indeed a metric space. It is also complete; i.e., every Cauchy sequence converges.

Theorem 4.17. Let (S2, ` ,µ) denote a a-finite measure space. For any f,g E LP(µ), write f ^' g i$f = g µ-a.e. That is, µ({w : f(w) # g(w)}) = 0. Then - is an equivalence relation on L%µ). Let [f] denote the --orbit of f; i.e., f E [f] if f - g. Let YP(µ) = {[f] f E LP(µ)} and define :

4. Modes of Convergence

43

II [f] IIP = IIf IIP Then, 2'p(µ) is a complete norrned linear space. Moreover,

22(p) is a Hilbert space. We will prove this in Section 5 below; see page 46.

4. Modes of Convergence There are many ways in which a sequence of functions can converge. We will be primarily concerned with the following. Throughout, (SI, .!F, µ) is a measure space, and f, fl, f2,... : S2 -+ R are measurable. Definition 4.18. We say that fn converges to f p-almost everywhere (written p-a.e., a.e. [p], or even a.e.) if

w E Q: limsup lfn(w) - f (w)I > 0 = 0. l n- oo / Frequently, we write {f E A} for {w E St : f (w) E A} and µ{ f E A} for p({ f E A}). In this way, fn converges to f a.e. if µ{ fn f+ f } = 0. When (Q, F, P) is a probability space and X, X1, X2, ... are random variables on this space, we say instead that Xn converges to X almost surely (written a.s.). (4.24)

p

C{

Definition 4.19. We say that fn -+ f in LP(p) if limn.- , [ifn - f I1P = 0. Also, fn -+ fin measure if limn,, p { I fn - fl J E} = 0 for all c > 0. If X, X1, X2, ... are random variables on the probability space (I , 9, P), then we say that Xn converges to X in probability when Xn -* X in P-measure; that is, if limn P{IXn - XI > E} = 0 for all E > 0. We write this as Xn - X. Theorem 4.20. Either a. e.-convergence or LP-convergence implies convergence in measure. Conversely, if sups>n If, I -+ 0 in measure, then fn -+ 0 almost everywhere. The interesting portion of this relies on the following result:

Markov's Inequality. If f E L1(µ), then for all A > 0, (4.25)

f IfIdp< p{IfI?A}< 1 A {f»,}

IIfIII

A

Proof. Set A := {I f I > A} and note that fA III dp > ff Adp = \p(A). This yields the first inequality. The second one is even more transparent. We can apply the preceding to the function If IP to deduce the following:

Chebyshev's Inequality (1846; 1867). For all p, A > 0 and f E LP(p), (4.26)

p{I fI > A} <

1 AP

If IP dp

fflfl>,\)

-

IIf IIP AP

.

44

4. Integration

Proof of Theorem 4.20. By the Chebyshev inequality, LP(µ)-convergence implies convergence in measure. In order to prove that a.e.-convergence implies convergence in measure we first need to understand a.e.-convergence a little better. Note that fn -+ f a.e. if and only if p(f1N=1 U°O_N {ifn - f I > E}) = 0 for all c > 0. Since µ is continuous from above, (4.27)

E}) = 0

limo ju t U {ifn - f I fn -+ f a.e. if N-o n= N

Because 12{I fN - fl > E} < µ(Un>N{ I fn - f I > E}), if fN - f a.e. then

fN - f in measure. Finally, if sups>n I fi I -' 0 in measure then (4.28)

it

1)

\l

E11=0.

J

J. If N(E) denotes the set of w's for which this inequality fails, then UIEQ+N(E) is a null set off which lim,n Ifml < E for every rational E > 0; i.e., off U,Eq+N(E) we have

Thus, lim sup,n Ifml < lim supra Supt>n If; I < E a.e.

limm Ifml = 0.

El

Here are two examples to test the strength of the relations between the various modes of convergence. The first involves the Steinhaus probability space which was a starting-point of modern probability theory.

Example 4.21 (The Steinhaus Probability Space). The Steinhaus probaP) where H is either (0, 1), bility space is the probability space (Il, [0,1), (0, 1], or [0,11; P denotes the Lebesgue measure on Q. On this space consider (4.29)

Xn(w)

n°1[o .1 /n) M,

where a > 0 is fixed and w E f2. Then Xn -+ 0 almost surely (in fact for all w E fl). And yet if p > a-1, then IIXnIIp = nap-1 is bounded away from 0. Therefore, a.s.-convergence does not imply LP-convergence. The trouble comes from the fact that sups IXnI is not in L"(P); compare with the dominated convergence theorem (p. 46).

Example 4.22. Let ((0, l], ..((0 ,1]) , P) be the Steinhaus probability space of the previous example. Now we construct random variables {Xn}°° 1 such that limn X. (w) does not exist for any w E (0, 1), and yet limn IIXnIIp = 0

for allp>0.

5. Limit Theorems

45

Define a "triangular array" of functions fi, j (Vi >_ 1, j < 2`-1) as follows:

First let fi,l(w) = 1 for all w E (0,1]. Then define /

(4.30) f2.1 \w) =

if w E (0, Z] otherwise

2, 0,

f2,2 (w) =

r0, 512,

if w E (2 ,1] otherwise

In general, for all i > 1 and j = 1..... 2'-1, we can define fi,j to be i on ((j , j2-'-1], and zero elsewhere. Let us enumerate the fi,j's 1)2-i-1

according to the dictionary ordering, and call the resulting relabeling (Xk); Evidently, i.e., X1 = 11,1, X2 = f2,1, X3 = f2,2, X4 = f3,1, Xs = 13,2, limsupk_,, Xk(w) = oo whereas lim infk-oo Xk(w) = 0 for all w E (0,1]. In particular, the Xn(w)'s do not converge for any w. On the other hand, X,, converges to zero in L1(P) for all p > 0 because IIfi,j IIp = zp2-(i-1)

5. Limit Theorems Proposition 4.8 expresses two of the essential properties of the abstract integral: (i) Integration is a positive operation (i.e., if f > 0 then f f dµ > 0); and (ii) it is a linear operation (i.e., equation (4.6)). We now turn to some of the important properties that involve limiting operations. Throughout this section, we let (1, ., µ.) denote a finite measure space, and address the following question: If fn converges to f , then does f fn dp converge to f f dµ?

The Bounded Convergence Theorem. Suppose fl, f2.... are measurable functions on (S1,9) such that sup,, If,, I is bounded by a constant K. If

fn - fin measure [,a], then lim, f f du = f f dµ. Proof. For all n > 1, fn is integrable because I fn(w)I < K and µ is a finite measure. Now fix an e > 0 and let En := {w E 1 : If(w) - fn(w)l > e}. According to Proposition 4.8, (4.31)

ffdP_ffnd p) fIf_fnldP+LIf_fnIdia < ep(Il) + 2Kµ(En).

Since limn-,,.,u(En) = 0, we can then let e j 0 to finish. Fatou's Lemma. If { fi}°°1 is a collection of non-negative integrable functions on (Q, 9, µ), then (4.32)

Jliminffndiz < lim inf f fn dµ. n- oo n-oo

Proof. Let gn = infj>n fj and observe that gn T f := lim infk A. as n - oo.

In particular, for any constant K > 0, (f A K - gn A K) is a bounded

46

4. Integration

measurable function that converges to 0 as n

oo. Because g,, < fn, the

bounded convergence (theorem implies that (4.33)

lnm n J fn dµ ? limo

A K) dµ = f (f A K) dµ.

Therefore, it suffices to prove that

l f (f A K) dµ = f f dµ.

(4.34)

KToo

For all e > 0 we can find a simple function S such that: (i) 0 < S < f; (ii) there exists C > 0 such that S(w) < C; and (iii) f S dµ > f f dµ - E. Now

f(fAK)dµ> f(SAK)dp= fSdµ> f fdµ - e ifK>C. This proves (4.34), whence follows the result.

The Monotone Convergence Theorem. Suppose { fn}°o_1 is a sequence of non-negative integrable functions on (f , , µ) such that fn (x) < fn+1 (x) for all n _> 1 and x E S2, and f (x) := limn-oo fn (x) exists for all x E ft Then, limn-,,. f fn dµ = f f dµ. Proof. By monotonicity L := limn-oo f fn dµ exists and is < f f dµ. Apply Fatou's lemma to deduce the complementary inequality.

The Dominated Convergence Theorem. Suppose { fi }°_° is a sequence of measurable functions on (1l , .F) such that supm If n I is integrable [dµ]. Then, limn.oo f fn dµ = f limn-,,. fn dµ provided that f (x) := limn-,,. fn(x) 1

exists for all x E f2.

Proof. Thanks to Fatou's lemma f E L1(µ). Also, F := supi,l Ifil E L1(µ) by assumption. We can apply Fatou's lemma to the non-negative function gn := 2F - Ifn - f I to deduce that limn-oo f Ifn - f I dµ = 0. The dominated convergence theorem follows from this and the bound (4.35)

if

,

w hich is merely the t riangle inequality for integrals.

We can now prove our Theorem 4.17 (p. 42) on completions of LP spaces.

Proof of Theorem 4.17. The fact that L'(µ), and hence t'(µ), is a linear space has already been established in Theorem 4.13. As we argued a few

paragraphs earlier, d(f, g) := IIf - 9IIp is a norm (now on 2P(µ)) as soon as we prove that d(f, g) = 0 If] = [g]; but this is obvious.

llmm00

In order to establish completeness suppose { fn}°O_1 is a Cauchy sequence

in LP(µ). It suffices to show that in converges in L"(µ). (Translate this to a statement about [fn]Is.) Recall that "{ fn},°i°_1 is Cauchy" means that ,n-.oo Ilfn - fmllp - 0. Thus, we can find a subsequence {nk}k=1

6. The Radon-Nikodym Theorem

such that

Ilfnk+, - fnk IIP

47

2-k. Consequently, >k Ilfnk+1 - fnk IIP < 00.

Thanks to Minkowski's inequality and the monotone convergence theorem, II F,k I If.,,,, - fnk I IIP < oo. In particular, Ek(f11k+1 - fnk) converges µalmost everywhere (why?).

If f := >k(fnk+, - fnk) then f E LP(p) by Fatou's lemma. By the

triangle inequality for LP-norms, (4.36)

11f - Al, lip s 00E

Ilfnj+l -

fnj II p - 0

ask - oo.

j=k+1

Minkowski's inequality implies that (4.37)

VN, k > 1. If - fN Ilp : Ilf - fnk llp + Ilfnk - fN IIP Therefore, we can let N and k tend to infinity to see that fn -+ f in LP(p). Finally, we can recognize that by Holder's inequality (f, g) := f f g dµ is an inner product. Therefore, L2(µ) is a Hilbert space. This completes the proof.

6. The Radon-Nikodym Theorem Given two measures p and v one can ask, "When can we find a function ir, such that for all measurable sets A, v(A) = fA ir, dµ?" If µ denotes the Lebesgue measure, then the function a, is a probability density function, and the prescription v(A) := fA 7r, dµ defines a probability measure v. For instance, the standard-normal distribution is precisely the measure v when ir,(x) = (27r)-1/2 exp(-x2/2) and µ is the Lebesgue measure on the line. Definition 4.23. Given two measures µ and v on (Q, Jr), we say that v is absolutely continuous with respect to µ (written v << p) if v(A) = 0 for all A E 9 such that µ(A) = 0. For instance, suppose that (S2 , F, µ) is a finite measure space and f E

L1(µ). Then v(A) = fA f dµ defines a finite measure, and v << µ. The following states that what we have just seen is the only example of its kind.

The Radon-Nikodym Theorem. If v << µ are two finite measures on (Q, 9), then there exists a non-negative ir, E L1(µ) such that f f dv = f fir, dµ for all bounded measurable functions f : S2 - R. Furthermore, 7r, is unique up to a µ-null set. Remark 4.24. Frequently we write 7r. := dv/dµ. The function a. is called the Radon-Nikodym derivative of v with respect to µ. The notation is suggestive because for all bounded measurable f : 0 -+ R, (4.38)

adv dµ. ffdv=Jf()

48

4. Integration

Remark 4.25. Suppose (1l , .r, µ) is a measure space, f E L1(µ) is nonnegative, and v(E) = fE f dµ for all E E .#. Then v is a measure also and (dv/dµ) = f a.e. [µ]. Furthermore, if Ilf III = 1 then v is a probability measure with density function f. We can now see that Examples 3.173.18 (p. 28) contain a series of Radon-Nikodym derivatives that arise in elementary probability theory.

The Radon-Nikodym theorem is a simple but deep result. Next is a "geometric proof," due to von Neumann (1940, Lemma 3.2.3, p. 127).

Proof. First, we prove the theorem under the stronger domination condition that v(A) < µ(A) for all A E Jr. Step 1. The Case v < p. Consider the linear functional 2(f) := f f dv that acts on all f E V (v). By the Cauchy-Bunyakovsky-Schwarz inequality, (4.39)

I2(f)I2 < v(1) J I112 dµ.

Hence, 2 is a bounded linear functional on L2(µ). By completeness (Theorem 4.17), and the general theory of Hilbert spaces (Theorem A.4, p. 204),

2 is obtained by an inner product. That is, there exists a p-almost everywhere unique it E L2(µ) such that f f dv = f fir dp for all f E L2(µ). Choose an a > 0 and replace f by the indicator of the measurable set {ir < -a} to deduce that µ{a < -a} = 0. It follows from the rightcontinuity of µ that 7r > 0 a.e. [p]. That is, we have established the theorem for all f E L2(p). By the monotone convergence theorem this fact holds for all measurable f > 0, and the entire theorem follows, with ir,, = 7r, in the case of dominated measures. Step 2. General v, µ. Because v < (µ + v), Step 1 extracts a µ-a.e. unique (in fact, (µ + v)-a.e. unique) and non-negative it E L2(µ + v) such that f f (1-7r) dv = f fir dµ for all f E L2 (µ+v). Replace f by the indicator of {7r > 1} to deduce that µ{7r > 1} = 0. Consequently, (4.40)

f

7r<1}

f(1-ir)dv= fi.
df EL2(µ+v).

According to the monotone convergence theorem, the preceding holds for all non-negative measurable functions f . Replace f by f (1 - 7r) - 11 (,,< 1) and consider II := 7r(1 - 7r)-11{,<1} to deduce that there exists a non-negative measurable function 11 such that f{.<1} f dv = f f 11 dµ for all measurable

f > 0. In general, one cannot go further and remove the {ir < 1} from the integral. However, if v << µ, then the already-proven fact that µ{7r > 1} = 0 implies that v{ 7r > 1) = 0. Hence, for all non-negative measurable f, (4.41)

ffdv=fffldP.

Problems

49

Plug in f (x) = 1 to deduce that f 11 dit = v(Q) < oo. It follows that II E L1(p), and the theorem concludes with 7r,, = II. Step 3. Uniqueness. We conclude the proof by proving that if there exist 7r., 7r' E L1(p) such that f fir. du = f fire du for all bounded measurable

f : Q - R, then 7r. = 7r' a.e. [p]. Let us posit the existence of such a., 7r'. Fix e > 0 and define f = 1A(f), where A(c) := {w E Q : ir'(w) > ir.f(w) + e}. Then, (4.42)

J

f7r. dp =

f

f ire dp >

f

J

f (ir. + e) dIL =

J

f7r. dp + ep(A(e)).

Because f f7r. dp < II74 IIL1(p) < oo, this proves that p(A(e)) = 0 for all e > 0. By the continuity properties of measures (Lemma 3.11, page 25),

0 = limo p(A(e)) = it U A(c))

(4.43)

eEQ

e>0:

fEQ

But the right-hand side is the p-measure of the set where ir' > ir.. This proves that ir' < 1r* a.e. [p]. Reverse the roles of ir' and 7r, to find that they are equal almost everywhere [p].

Problems 4.1. Let f2 be a set and (A,d) a measurable space. For any function X : f1 -» A define o(X) to be the collection of all inverse images of X; i.e., {X-1(B);B E rA}. Prove that c(X) is a o-algebra, and is the smallest o-algebra with respect to which X is measurable.

4.2. Prove Lemma 4.6. (HINT: If f (w) E ]j2-", (j + 1)2-"), then set 7,(w) := (j + 1)2-".) 4.3. Let p denote the counting measure on the measure space (f2,9); see Problem 3.8 on page 33. Prove, carefully, that f f dp = F=En AT) whenever f is absolutely summable. 4.4 (Distributions). Suppose (fl, 9, p) is a measure space. Let f2' be a set, and let f' denote a or-algebra of subsets of f2'. Prove that if f : f2 -, 12' is measurable, then p o f-1 is a measure on (12',.x'), where (po f-1)(A) = p({w E fl: f(w) E A)) is the so-called distribution of Jr. 4.5 (Riesz Representation Theorem). Let C(0,11 denote the collection of all continuous functions that map (0, 1] to R. A map T : C(0, 1] - R is a positive, bounded linear functional if: (i) For

all a, b E R and f,9 E C(0, 1], T(af + bg) = aT(f) + bT(g); (ii) there exists a finite constant A such that IT(f)I < Asup5E(o 1l If(v)I for all f E C(0, 1]; and (iii) T(f) > 0 whenever f 2 0. The smallest such A is the norm IITII of T.

(1) Given any finite measure p on .9((0, 1]), check that T(f) = f f dp defines a positive, bounded linear functional. Compute IITII. (2) Conversely, prove that for any positive, bounded, linear functional T on (0, 1] there exists a finite measure p such that for all f E C(0, 11, T(f) = f f dp. (HINT: For any closed set C define p(C) := inf{T(f) : f > 1c}. For a general set G define p(G) := sup{p(C) : C C G, C closed}.) This is due to F. Riesz. (HINT: Examine Caratheodory's theorem on page 27. Also, see the proof of Lemma 3.15, p. 26.) 4.6. Consider two o-finite measures p and v, both defined on (R k' R(Rk )). Prove that if f f dp = f f dv for all continuous functions f : Rk -+ R, then p = v.

50

4. Integration

4.7. Choose and f i x pl, ... , p > 0 such that p = 1. Prove that if x i, ... , x are positive, = Xx"" < then Prove also that the inequality is strict unless xl =

4.8. Define d = E 1 a,/n to be the arithmetic mean, 9 a,)1"" the geometric mean, and J" = n/ E,_t a-' the harmonic mean of {a,}!'1, where a, > 0. Prove that JF 0 on I, then f is convex on I. Use this to prove that:

(1) The Euler gamma function 1'(t) = fo xt-1e- dx is convex on (0, oo). (2) The function x-1 exp(-x2/2) is convex on R+.

4.10 (Problem 4.9, Continued). Suppose f : R - R is convex. Prove that f has right and left derivatives everywhere; i.e., prove that 8+f(x) = limelo(f(x+e) - f(x))/e and 8_ f(x) = lim,lo(f (x) - f (x - e))/e exist. Prove, in addition, that 8+ f and 0-f are both non-decreasing. 4.11. Prove that if f is convex, then it is continuous. Conversely, prove that if f is continuous and 2f((a+b)/2) < f(a) + f(b) for all a,b E R, then f is convex.

4.12. Prove that if sup IX I < Y, where Y E L' (P), then, lim sup- EX < E[lim

X].

Construct an example to show that the domination condition on the X,,'s cannot be altogether dropped. 4.13. Let m(a) := min(Ial, 1), and given any two random variables X and Y, define dp(X,Y) _

E4(X - Y). Prove that dp is a metric, and X. converges to X in probability if and only if dp(X,,, X) -. 0. That is, dp metrizes convergence in probability.

4.14. Prove that if X E LP(P) for some p > 0, then lime_- tPP{IXI > t} = 0, and the latter condition implies that X E L'(P) for all r E (O,p). (HINT: Apply Fubini-Tonelli to jo t'-iP{IXI > t}dt.) 4.15 (Slutsky's Theorem). Suppose

and (Y )°n°_-1 are two sequences of random variables

such that X -. X and Y. -. Y. Prove that if f is a continuous function of two variables, then

f(Xn,Y.) . f(X,Y). 4.16. Prove that X converges to X in probability if for any subsequence {nk}k 1 there exists a further sub-subsequence {nk(,)};° 1 such Xnkl,> -' X as. 4.17. Consider the measure space ([0, 1], R([0, 11),1&) where p is a finite measure. Prove that if f : [0, 1] -. R is continuous and is a sequence of numbers in [0, 11, then (4.44)

n*\') µ

nlimo o f j=1

n

(7

n

= I fdµ - f(0)/1({0}). J

Use this to prove that if p denotes Lebesgue measure, then the Riemann integral of f agrees with its Lebesgue integral. Can you extend this to o-finite measure spaces (Rd,.9(Rd),p) and integrable continuous functions f : Rd -. R? 4.18. Let p denote the Lebesgue measure on ([0, 1J', M([0,1Jd)). Prove that continuous functions are dense in LP(p) for every p > 1. That is, prove that given e > 0 and / E LP(p) we can find a continuous function g : [0, 1[d -. R such that I[f - gllp < e.

4.19. If f : Rk -, R satisfies f I f (x)IP dx < oo for some p > 1, then prove that f is continuous in LP(R"); i.e., limr_.oIRk If(x+e) - f(x)IPdx = 0. (HINT: Problem 4.18.) 4.20. Prove that the following exists, and compute its value: (4.45)

fI

al~mo I

f

I 1

x2"" 2n 1

dx.

4.21. Construct a o-finite measure space (f1,.9,µ) such that L2(µ) Q L1(µ).

Problems

51

4.22 (Mixtures). Suppose (f2 , .5) and (e,) are two measure spaces. Assume that v is a probability measure on I and Pe a probability measure on f for each 0 E e. Then prove that I, (A) = fe Pe(A) v(dO) (A E Jr) defines a probability measure on F, provided that 0... PO (A) is 4-measurable for each A E .9. The probability measure u is said to be a mixture of the PO's; the mixing measure is the probability measure v. 4.23 (Generalized Holder Inequality). Let {X,}!'1 be non-negative random variables. Prove that

for all p I , ... , pn > 1 that satisfy E'-.,p' = 1, E(X1 ... Xn) < rj

1 ]lX, IIp;

(HINT: Problem

4.7.)

4.24 (Chernoff's Inequality). Prove that for any random variable X and all t > 0, P{X > t} < inft>o exp{-tl; + In Eef" }, and P{X < t} < inft>o exp{t. + In Ee-EX }. 4.25 (Hadamard's Inequality). Suppose f is a convex and integrable function on (a,b). Then prove that (b - a) f ((a + b)/2) < fn f(x) dx. 4.26 (Young's Inequality). Suppose f is a continuous, strictly increasing function on [0, a], and f (O) = 0. Prove that ab < fo f (x) dx + f, f- I (x) dx for all b > f (a), with equality if b = f (a). Here, f-1 is the inverse function to f. Use this to find another proof of (4.9). (HINT: Plot f, and consider the areas under and over f, respectively.) 4.27 (An Uncertainty Principle). Prove that all continuously differentiable functions f : R -. R that have compact support satisfy the inequality 00

(4.46)

I If(x)12dx<2V-. x21f(x)I2dx)l 00

1/2

1/2 r ro°

(J_ 0°

If1(x)I2dx)

.

Use the preceding to relax the compact-support assumption. (HINT: Integrate by parts to find

that f°° xf(x)f'(x)dx=-z f°0 f2(x)dx.) 4.28 (Uniform Integrability). Let {Xn }n 1 denote a collection of random variables on a common probability space (Q,.5, P). The X's are uniformly integrable (UI) if they are integrable and (4.47)

t

lim limsupE{IXnl; IXnl > t} = 0.

oo n--

Prove that (Xn)n l is UI if limt-°° supn>, E{IXnI; lXnl > t} = 0. Also prove: (1) If IX,I < IYnl (n > 1) and {Yn}n__, is UI, then so is {Xn}n--1. (2) If {Xn}°n°__, and {Yn}n°_, are UI, then so is {Xn +Yn},°i=1. (3) {Xn}n 1 is UI as long as supn ]IXn]]p < co for some p > 1.

(4) Xn -» X in LI(P) if and only if. (a) Xn Z X; and (b) {Xn}n 1 is UI. 4.29 (Problem 4.28, Continued). Let p > 1, and consider X,XI,X2,... E LP(P) such that Xn -. X, in probability. Prove that either one of the following is equivalent to the uniform integrability of {lX,IP},° 1: (i) Xn -. X in LP(P); or (ii) E{IXnIp} -. E{IXIP} as n -. 00. 4.30 (Hoeffding's Inequality). Suppose EX = 0 and P{IXI < c} = 1 for some non-random constant c > 0. Prove that for all 4 E R, EeEX < exp(F2c2/2) and Ee(1X1 < 2exp(a:2c2/2).

(HINT: CC' < e1(c+x)/(2c)+e-t"(c-x)/(2c) for all x E [-c, c).) 4.31 (Hard). For all a > 0 compute (4.48)

nlim° n

f" expx 0

/ xs- I dx.

4.32 (The Good-Lambda Inequality; Hard). Let X and Y be two non-negative random variables,

and p > 1 a fixed constant. Suppose there exist 0 > 1, ry E (0,1), and b < (3-P such that P{X > OX, Y < 7a} < dP{X > J,} for all a > 0. Then prove that E[XP] < a-y PE[YP], where

a = (Q-P - b)-1. 4.33 (Harder). Let X and Y be two non-negative random variables such that X, Y, log Y E L3 (P). Suppose for all measurable sets A, E[X; A] > IE[Y; A))2. Then prove that E[log X] > -oo. (HINT:

Set A_, = {X > Y}, and for all n > 0 define An = {e-n-1Y < X < e-"Y} and Bn = An fl {Y < e-n/4}. Prove that En nP(An \ Bn) < 00. Use this to prove that E. nP(A,) < oo. Alternatively, see Dudley (1967).)

52

4. Integration

Notes (1) The modern notions of abstract random variables-as measurable functions-and expectations-as integrals-seem to be due to Freshet (1930). In concrete settings, these notions have been around for quite a long time. See, for example, the classic by Borel (1909). (2) Kolmogorov (1933) created the modern, axiomatic theory of probability in his landmark book. Among other things, Kolmogorov's work is said to have solved a main part of Hilbert's sixth problem (Gnedenko, 1969). (3) Much of the material of Section 5 is due to Lebesgue (1910). Notable exceptions to this remark are Fatou's lemma (1906) and the monotone convergence theorem of Levi (1906).

(4) Problem 4.15 is due, in its essence, to Slutsky (1925). (5) Problem 4.24 is due to Chernoff (1952). (6) Problem 4.27 is a disguised form of the Heisenberg uncertainty principle. In this form, it is due to H. Weyl. Another form will be discussed in Problem 7.36 on page 115. (7) Problem 4.30 is due to Hoeffding (1963, Lemma 1). (8) The good-A inequality (Problem 4.32) is a fundamental tool in probability and harmonic analysis. It was invented by Burkholder and Gundy (1970) and explored further by Coifman (1972) and Burkholder, Davis, and Gundy (1972). See also the expository account by Jones (1998).

Chapter 5

Product Spaces

Nature is an infinite sphere, whose center is everywhere and whose circumference is nowhere. -Blaise Pascal

If Al and A2 are sets, then their product Al x A2 is defined to be the collection of all ordered pairs (al , a2) where al E Al and a2 E A2. In a like manner, we define Al x A2 x A3, etc. We can even define infinite-product spaces of the type Al x A2 x ... . We have two main reasons for studying the measure theory of product spaces. The first one is that an understanding of product spaces allows for the construction and analysis of several random variables simultaneously; a theme that is essential to nearly all of probability theory. Our second reason for learning more about product spaces is less obvious at this point: We will need the so-called Fubini-Tonelli theorem that allows us to interchange the order of various multiple integrals. This is a central fact, and it leads to a number of essential computations.

1. Finite Products Suppose (521 , Jr1, {L1) and (I2 , Y2, µ2) are two finite measure spaces. There

is a natural o-algebra -r1 x 92 and a measure µl xµ2 that correspond to the product set 52l x 522. First consider the collection (5.1)

Wp := {Al x A2 : Al E J PI, A2 E

This is closed under finite (in fact arbitrary) intersections, but not under finite unions. For example, let Al = A2 = [0, 11 and Bl = B2 = [1, 21 to see 53

5. Product Spaces

54

that (A1 x A2) U (B1 x B2) is not of the form C1 x C2 for any C1 and C2. So do is not an algebra. We correct this by adding to sago all finite disjoint unions of elements of do, and call the resulting collection ii. Lemma 5.1. The collection sad is an algebra, and o(sV) = o(s.Vo).

Definition 5.2. We write 91 x 92 in place of Define p on .c 4o as follows: (5.2)

p(A1 x A2) := p1(A1)p2(A2)

'A1 E 91,A2 E g2.

If A',..., An E do are disjoint, then we define p(U 1A') :=

1 p(A') This constructs p on the algebra sat in a well-defined manner. Indeed, sup-

pose U 1A' = Uj'!1B' where the A''s are disjoint and the Bp's are also disjoint. Then, U 1A' = U 1 UT 1 (A' fl B3) is a disjoint union of nm, sets. Therefore,

p(UA') =EEA(A'nB')=p UBi

(5.3)

i=1

,

i=1 j=1

by symmetry. Theorem 5.3. There exists a unique measurep1 X p2 on (521 X 12

, 3%1 X _Qr2)

such that pi x p2 = p on si. Definition 5.4. The measure p1 x p2 is called the product measure of pi and 42; the space 1 x 522 is the corresponding product space, and -Qr1 x 3r2 is the product o-algebra. The measure space (521 X Q2 , 3r1 X 15rl , p1 X 02) is the product measure space.

Remark 5.5. By induction, we can construct a product measure space (52

, F, p) based on any finite number of measure spaces

1,...,n: Define 52:=f21x...Xf2n, `fit:=JT1x...x.9t,, andu

i= plx...xpn_

Proof of Theorem 5.3. By Caratheodory's extension theorem (p. 27), it suffices to prove that (Al x 02) is countably additive on the algebra W. We accomplish this in three successive steps. Step 1. Sections of Measurable Sets are Measurable. For all E C 521 x 522 and w2 E 522 define (5.4)

E,112 :_ {w1 E 521

:

(WI, W2) E E}.

This is the section of E along w2. In the first step of the proof we demonstrate that if E is measurable, then for every w2 E 522, E,,,2 is measurable too: Fix W2 E 522 and consider the collection (5.5)

.4f :={EE50F1x92: E12E91}.

1. Finite Products

55

Because . ' is a monotone class that contains d, the monotone class theorem

(p. 30) implies that 4' = Jr, x ,F2. This concludes Step 1. Step 2. Disintegration. Because Ewe is measurable, µ1(E,,,2) is well defined. We now show that 522 3 w2 i-4 µ1(E,2) is measurable. First suppose E E sj(o, so that E = Al x A2 where Ai E 9;. Then E,,,2 = Al if w2 E A2, and E,,,2 = 0 if w2 E A. Consequently, (5.6)

µ1(E,,2) = µ1(A1)1A2(w2)

It follows that µ1(E,,,2) is a measurable function of w2 E 522. Furthermore, (41 x µ2)(E) = p1(A1)µ2(A2), and hence

(Al x 42)(E) = fPi(Ew2)2(dw2).

(5.7)

Equation (5.7) is called a disintegration formula. Step 3. Countable Additivity. By finite additivity, (5.7) extends the definition of µl x µ2 to finite disjoint unions of elements of s"1(0; i.e., (5.7) holds

for all E E d. Furthermore, the dominated convergence theorem shows that µ1 x 92 is countably additive on the algebra sat. [It suffices to prove that if EN E s.4' satisfy EN 1 0, then (µ1 x µ2)(EN) 10. But this follows from (5.7) and the monotone convergence theorem.] Therefore, owing to the Caratheodory extension theorem (p. 27), µ1 xµ2 can be extended uniquely

to a measure on all of 91 x F2. This proves the theorem. In addition, it shows that (5.7) holds for all E E 91 x 2. (The fact that w2 --+ µ1(E,,,2) is measurable is proved implicitly here; why?) The following shows that the two possible ways of constructing Lebesgue's measure coincide.

Corollary 5.6. If and denotes the Lebesgue measure on ((0' 1]d,.V((0,1]d)), then md = m1 x x m1 (d times.)

Proof. If E = (a1

, b1 ]

x (ad, bd) is a d-dimensional hypercube, then

x d

(5.8)

md(E) = rJ(bi - ai) =

(m1

x ... x ml)(E)

j_1

By finite additivity, and and (m1 x x m1) agree on the smallest algebra that contains hypercubes. By Caratheodory's extension theorem, and and (m' x x m') agree on the o--algebra generated by hypercubes. The following is an important consequence of this development.

The Fubini-Tonelli Theorem. If f : S21 x 522 -+ R is product measurable, then for each w1 E 521i w2 F-+ f(w1,w2) is 92-measurable, and by symmetry,

5. Product Spaces

56

for each w2 E 522, WI H f (Wl, W2) is 91 -measurable. If in addition f E L1(µ1 x µ2), then the following are a.e.-finite measurable functions: (5.9)

WI'-'

J f(w1,w2)P2(dw2), W2- ff(iw2)Pi(di).

Finally, the following change-of-variables formula is valid:

(5.10)

J

f d(µ1 x µ2) =

J (ff(wiw2)Pi(dwi))

=J

A2(2)

ff(W1,W2)/42(dW2)) Al(du)l)

Proof. (Sketch) If f = 1E for some E E Jr1 x F2i then (5.7) contains (5.9) and (5.10). By linearity, these equations continue to hold for all simple functions f. Finally, we take limits to prove the result for every function

f E Ll(µl x µ2). The following is an important corollary of the proof of Fubini-Tonelli's theorem. (Proof: Approximate f from below by simple functions; then appeal to the monotone convergence theorem.)

Corollary 5.7. If f : 521 x 522 -. R is measurable and non-negative, then (5.10) holds in the sense that all three double integrals converge and diverge together, and are equal in the convergent case.

Remark 5.8. In fact, the proof shows that f E Ll (µl x 02) as long as one of the three integrals in (5.10) is finite when f is replaced by I f I.

The Fubini-Tonelli theorem is deceptively delicate: We cannot always interchange the order of double integrals. Our next examples highlight this fact.

Example 5.9. For all n > 0 and x E (0, 1] define On(x) :=

(5.11)

2n+1

10

if x E (2-n-1,2-n], otherwise.

Note, in particular, that f0 On(x) dx = 1. Define the measurable function f : (0,1]2 R as follows: 00

(5.12)

f (x , y) = E[10n (x) - Vn+1(x)] VGn (y)

vx, y E (0,1].

n=0

All but possibly one of these terms are zero, so the function f is well defined. Nonetheless, we argue next that the Fubini-Tonelli theorem is not applicable

to the function f.

1. Finite Products

57

If Y E (2-n-1, 2-n] then f (x, y) = 2n+1 if1Pn (x) -'+1 n+1(x)]. It follows that

fo f (x, Y) dx = 0, whence we have fo fo f (x, y) dx dy = 0. On the other hand, 2-^

(5.13)

12-1 f (x , y) dy = On (x) - n+1(x)

Sum this from n = 0 to n = m - 1 to find that

Lm f(x ,y)dy=Vo(x)-But

(5.14)

if x > 0, then limn., O n(x) = 0 for all x E (0,1]. Thus,

f f(x,y)dy=Vo(x)

(5.15)

0

We integrate this over all values of x E (0, 1] to find that

ff 1

(5.16)

0

ff 1

f (x, y) dx dy = 0:A 1 =

0

0

1

f (x, y) dy dx.

0

Thus, Fubini-Tonelli's theorem does not apply, and the reason is that f is not absolutely integrable. (Prove it!) The preceding example is slightly complicated because we had to work with finite measures. But, in fact, the Fubini-Tonelli theorem is valid for sigma-finite measures as well (Problem 5.5). If we admit this, then we can greatly simplify the preceding example.

Example 5.10. Define f : R. - { -1, 0, 1} as follows: (5.17)

f (x , y)

_

cc [1(n,n+1) x (n,n+1) (x , y) - 1(n,n+1) x (n+1,n+2) (x ,

y)]

n=0

One can check directly that f o°D f 0O f dx dy = 1 whereas fo fo f dy dx = 0. As was the case in the preceding example, the Fubini-Tonelli theorem fails to apply here because fO0 f00° If I dx dy = oo.

Our next example, due to W. Sierpiriski, illustrates that Fubini-Tonelli's theorem need not hold when f is not product-measurable. Throughout, we will rely on the axiom of choice as well as the continuum hypothesis, and let (0, 9, P) designate the Steinhaus probability space.

Example 5.11. Define c to be the first uncountable ordinal; by the axiom of choice c exists. Next, define S to be the collection of all ordinal numbers strictly less than c; S is called Hartog's c-section of ordinal numbers. By the continuuum hypothesis, S has the power of the continuum. That is, we can find a one-to-one map 0: 10, 11 -' S.

5. Product Spaces

58

Now consider the set

E := { (x, y) E [0, 11' : O(x) < 0(y) } .

(5.18)

For all x E [0, 1] consider the x-section xE of E, (5.19)

E:= {y E [0,1] : (x, y) E E} = {y E [0,1] : 0(x) < O(y)1 .

Both E and xE are non-empty, because 0 is one-to-one. Moreover, ,E° is denumerable because O(xE`) is. [This follows from the definition of S.) Consequently, E is Borel measurable and P(xE) = 1 for all x E [0, 1]. W e can also define the y-section E y := {x E [0 ,1] : (x, y) E E} for any y E [0, 1]. Since Ey is denumerable, Ey E 9 and P(Ey) = 0. Hence, (5.20)

J0

1

P(xE) P(dx) = 10 0 =

1

P(Ey) P(dy).

So there is no disintegration formula (5.7). This, in turn, implies that (5.10)

does not hold for the bounded function f (x , y) = IE(x, y). Since P is a probability measure on the Borel subsets of [0, 1), all bounded measurable functions are P-integrable. Thus we see that the source of the difficulty is that f is not product-measurable although x '--* f f (x, y) P(dy) and y H f f (x , y) P(dx) are measurable (in fact constants).

2. Infinite Products So far, our only nontrivial example of a measure is Lebesgue measure on (R", .V(R")). We have also seen that we can create other interesting product measures once we know some nice measures. We now wish to add to our repertoire of nontrivial measures by defining measures on infinite-product spaces that we take to be (0,1]°O or R°°, where for any SZ the set Q°° is defined as the collection of all infinite sequences of the form (w1, w2, ...) where w; E Q.

In order to construct measures on (0,1]°°, or more generally R°°, we first need a topology in order to have a Borel a-algebra .4((0,1]°°).

Definition 5.12. Given a topological set St, a set A C S2°° is called a cylinder set if either A = 0, or it has the form A = Al x A2 x . where Ai = SZ for all but a finite number of i's. A cylinder set A = Al x A2 x is open if every Ai is open in Q. The product topology on 1l°° is the smallest topology that contains all open cylinder sets. This, in turn, gives us the Borel a-algebra .4(f2°O). Suppose we wanted to construct the Lebesgue measure on (0, 111. Note that any cylinder set has a perfectly well-defined Lebesgue measure. For example, let It = (0, e-1) for e = 1, 2, 3, and It = (0, 1] for e > 4. Then,

2. Infinite Products

59

I = Il x I2 x 13 x 14 x . . . is a cylinder set, and it would make perfectly good sense that the "Lebesgue measure" of I should be 1 x z x 3 = s It stands to reason that if m denotes one-dimensional Lebesgue measure then one should be able to define the Lebesgue measure tn' = m x m x on ((0,1]°0,R((O,1]O°)) as the (or perhaps a) "projective limit" of the ndimensional Lebesgue measure ml = m x . . . x m on ((0,1)n,.F((0,1]n)). This argument can be made rigorous not only for m°O, but for a large class of other measures as well. But first we need some notation for projections.

For all 1 1 we can project A onto its first n coordinates as follows: (5.21)

x A.

7rn(A) = Al x

More precisely, the function nn : I' -+ In is defined by (5.22)

7rn(x) := (xi , ... , xn)

dx = (X1 , x2 , ...) E

We will use this notation throughout.

Definition 5.13. If A = Al x A2 x

C I°O is a nonempty cylinder set,

then we define its dimension as (5.23)

where sup 0

dim A := sup In > 1 : An # (0, 11} ,

0. In addition, dim 0 := oo.

Note that: (i) A E I°° is a non-empty cylinder set if all but a finite number of its coordinates are equal to (0, 1); and (ii) all cylinder sets are finite dimensional. We also remark that this "dimension" has the property that if A C B, then dim A > dim B.

If dimA = n then An # (0,1). But Ai+n = (0,1] for all i > 1, and you should think of A E I°O as the natural embedding, or lifting, of the n-dimensional Euclidean set 7rn(A) = Al x

x An E In onto I.

Definition 5.14. A family {(I,,.V",Pn)}O°_1 of probability spaces is consistent if (5.24)

Pn(A1x...xAn)=pn+1(A1x...xAnx(0,1]),

for all n > 1 and all A1i ... , A. E -41. We will also say that {Pn}n°_1 is consistent.

Remark 5.15. There is another way to think of a consistent family {Pn}°°_1: If 1 < n := dim A < oo then (5.25)

Pt(irm.(A)) = Pn(7rn(A))

dm > n.

5. Product Spaces

60

The notation is admittedly heavy-handed, but once you understand it you are ready for the beautiful theorem of A. N. Kolmogorov, the proof of which is spelled out later in §3.

The Kolmogorov Extension Theorem. Suppose {P"}n is a consistent 1

family of probability measures on each of the spaces (In, -4'). Then, there exists a unique probability measure P00 on (1°°,.100°) such that P00(B) _ for all finite n and all n-dimensional sets B E .4°O. Remark 5.16. One can use Kolmogorov's extension theorem to construct the Lebesgue measure on (100, °°).

Remark 5.17. One can just as easily prove Kolmogorov's extension theorem on the measurable space (R00,M(R00)), where R0O is endowed with the product topology.

3. Complement: Proof of Kolmogorov's Extension Theorem First we establish the asserted uniqueness of POO. Indeed suppose there were It follows immediately two such measures P°O and Q°°, both on (I, x that POO(A) = Q°O(A) for all n-dimensional cylinder sets A = Al x An x (0, 11 x (0, 11 x . Therefore, P°O and Q00 agree on the algebra W generated by all cylinder sets; this is the smallest algebra that contains all cylinder sets. The monotone class theorem (p. 30) implies that P1 = Q°° on all of -10°O.

Here is the strategy of the remainder of the proof, in a nutshell: Let d denote the collection of all finite unions of cylinder sets of the form (5.26)

(al , bl] x (a2 , b2] x ... x (ak, bk] x (0, 11 x (0, 11 x

,

where 0 < ai < b; < 1 for all i, and k > 1. We also add 0 to sad, so that sat becomes an algebra that generates W°°. Our goal is to construct a countably additive measure on W and then appeal to Caratheodory's theorem (p. 27) to finish.

Our definition of P°O is both simple and intuitively appealing: First, define P0O(0) = 0 and P1 (I') = 1. This takes care of the trivial elements

of d. If A E 0 is such that 1 < n := dimA < oo, then we let P0°(A) _ [Check that this is well defined.] Step 1. Finite Additivity. Let us first prove that PO° is finitely additive

on sl. We want to show that if A, B E d are disjoint, then P' (A U B) _ P- (A) + P00(B). If A = 0 or I00 then B = A0, and finite additivity holds trivially from the fact that P1 (0) = 1 - P1 (11) = 0. If neither A nor B is I°O, then n = dim A and m = dim B are nontrivial natural numbers. We may assume without loss of generality that n > m. It

3. Proof of the Extension Theorem

61

follows that

P- (A U B) = P' (7rn(A) U irn(B)) = P"(in(A)) + P' (7n(B))

(5.27)

This follows, since 7rn(A) n irn(B) = 0 and P' is a measure. On the other hand, P'(irn(A)) = P°O(A), and P"(7rn(B)) = Pm(7r,n(B)) = P°O(B) since {Pk}' is a consistent family (Remark 5.15). This verifies finite additivity. Step 2. Countable Additivity. Suppose P°° is countably additive on .sV'. Then, the Caratheodory's extension theorem implies that P°O can be extended uniquely to a countably additive measure on u(d) = .4°°. This extension, still written as PO°, is the probability measure on (I°°, .°°) that is stated in the present theorem. Thus, it suffices to establish the countable 1

additivity of P°O on d. In order to do this, we appeal to an argument that is similar to the proof that the Lebesgue measure on (0, 1] is countably additive on finite unions of intervals of the form (a , b]. Make certain that you understand the proof of Lemma 3.15 (p. 26) before proceeding with the present proof. Let A', A2, ... denote disjoint sets in .& such that UJt 1Al is also in a; Lj01 P°°(Aj). We write we need to verify that

and note that U1Aj and Ut N+IAj are disjoint

U

elements of W. By Step 1, 0o

(5.28)

P°O

U Aj j=N+1

N

o0

U Aj

= E P°O(Aj) + P°°

.

j=N+1

j=1

Thus, it suffices to show that if BN I 0-all in d-then P°O(BN) 1 0. We assume the contrary and derive a contradiction. That is, we suppose that

there exists e > 0 such that PI(BI) > e for all n > 1. These remarks make it clear that dim B' is strictly positive (i.e., Bn 34 0) and finite (i.e., Bn # I°°) for all n large. Henceforth, let y(n) := dim Bn, and note that the condition Bn 10 forces y(n) to be non-decreasing. Thus,

B' = Bi x ... x B,y(n) x (0, 1] x (0, 1] x

(5.29)

where

,

k(n,m) nm ] (m. < y(n)). Now define Cn to be an n := Uj=1 (a,'nm b,"

B,n

approximation from inside to B via closed intervals, viz., (5.30)

Cn=C1

where Cn = Uk=n1'm) [a: 'm, b, 'm] (m < 'Y(n)), and the a" 'm E (ar'm, 'm) bi are so close to the a's that (5.31)

PO°(Bj\Cj)<

23

Vj>1.

5. Product Spaces

62

Proof. This can always be done because Pw(BJ \ C) is k(l,t)

p7(i) (i)

U

o?,tl x ... x

k(7.'Y(l))

U

l

a measure on (11 W, Sir 0)). 13

Therefore, thanks to (5.31), P°O(D') > (e/2), where Dn = fly 1C1 is a sequence of decreasing sets with D' C 10, jp(n) x (0, 1] x (0, 1] x

. Now

we argue that fln°_1Dn # 0; since Dn C Bn, this contradicts B" l 0 and our task is done. We know that Dn # 0 for any finite n since Poo(Dn) > (E/2). Moreover, we can write Dn = Di x D2 x where: (a) Dn = (0,1] for all j > y(n); and (b) Dn is closed in [0, 1] for all j < ry(n). Therefore, we can choose xn E Dn of the following form: 2,...,xnt(n)12

xn :_ (x 1,x

(5.32)

2....)

)

do >

Because ^t(n) is non-decreasing and DI D D2 D Di ) i

is a decreasing

sequence of closed subsets of [0, 11, z1 := lime-, xi is an element of fl° n°__1 Di .

Similarly, zj = lime.,, x E fl°=1 D for all j > 1. Thus, we have found a point z = (z i z2, z3, ...) in fln°.1Dn. This proves that fln_1D" # 0, whence the theorem.

Problems 5.1. Let fl be an uncountable set and SF the collection of all subsets A C 0 such that either A or A` is denumerable. (1) Prove that S9' is a o-algebra. (2) Define the set function P : 9 -. {0, 1) by: P(A) = 1 if A is uncountable, and P(A) = 0 if A is denumerable. Prove that P is a probability measure on (0, 5). (3) Use only the axiom of choice to construct a set 11, and an E C fl x 0 such that for

all x, y E 0, E and Ey are denumerable, where xE Ey := {x E fl : (z, y) E E}.

{y E 0 : (x, y) E E} and

5.2. Consider a finite measure tt on (R", .9(R")) such that p({x}) = 0 for all x E R". Define the diagonal D of R21 = R" x R" to be ((x, x) : x E R"). Then prove that (µ x µ)(D) = 0. 5.3 (Problem 5.2, Continued). Let (X, Y) be a random variable that takes values in R2. We say that X and Y are independent if (5.33)

Elf(X)9(Y)] = Ef(X) Eg(Y),

for all bounded measurable functions f,g : R -» R. Prove that if P{X = a} = 0 for all a E R, then P{X = Y} = 0. Why does this generalize Problem 5.2? (HINT: p(A) := P{X E A} and u(B) := P{Y E B} are probability measures.) 5.4. Prove that if {a,,i }° j=1 is a sequence indexed by N2, then 00

00

00

(5.34)

,=li=t

a',i =

00

ai,i+

provided that E,,, Ia,,ij < oo. Construct an example of {a,,i}w.j=t for which (5.34) fails to hold.

Problems

63

6.5. Prove that the Fubini-Tonelli theorem (p. 55) remains valid when p is a-finite.

5.6. If

are probability measures on ([0,1] , 53([0, 1])), then carefully make sense of the probability measure 11°_jµ,. Use this to construct the Lebesgue measure m on [0,1]O° endowed with its product a-algebra. Finally, if 1 > al > a2 > ... > a,, 10, then prove that

m fi[a,, 1] I >0 if E a, < oo.

(5.35)

/

,=1

We will do much more on this. Consult the Borel-Cantelli lemma on page 73.

5.7. Define (5.36)

f(x,y):=(x2-y2)(x2+y2)-2 for all x,y E (0,1], and verify that

f' 0

0

f'f(x,y)dxdy# f' f'f(x,y)dydx. 0

0

Why does the Fubini-Tonelli theorem not apply?

5.8. Define f(x,y) := xy(x2 + y2)-2 for all x, y E [-1, 11, and prove that f 0 L'([-I, I]') and yet t

t

t

1

t

f(x,y)dxdy= /- j f(x,y)dydx.

(5.37)

./

1

!

t

5.9. Define, for all x, y E R2,

f(x.u) _

(5.38)

r 1 if x2 + y2 = 1, 10 otherwise.

Respectively define µ and v to be the Lebesgue measure and the counting measure on R. Prove

that f't f 1 t f dµ d. 0 f't f't f dvdµ. Why does the Fubini-Tonelli theorem not apply? 5.10 (Monotone Rearrangements). Let f : R -» R+ be integrable, and define (5.39)

A,:={xER: f(x)>z}

Vz>0.

Prove that A, is measurable for all z > 0. Let fl(z) denote the Lebesgue measure of A,, and prove that: (i) fl is non-increasing and measurable; and (ii) f0 fl (z) dz = f f_ f (x) dx. 5.11. Compute explicitly the numerical value of oo

oa

/

x2

\

f!0 f exp l\ -- y- y l /dydx.

(5.40)

0

5.12. For all functions f : [a, b] -» R+ define the set (5.41)

A(f):={(x,y)ER2: 0
Prove that if f is measurable, then A(f) E `..d(R2). Compute the two-dimensional Lebesgue measure of A(f).

5.13 (The FKG Inequality). Suppose p denotes a probability measure on 5f(R) and f, 9: R -, R are non-decreasing and measurable. Prove that

f f9dl+? (f f dµ)l (\f 9dµl/

(5.42)

provided that the integrals converge. Use this to derive the Chebyshev inequality for sums: If a1

b1
_b) \n .=t=t

n n,) I -,b,. ,=t (HINT: [f(x) - f(y)] [9(x) - 9(y)] > 0.) 5.14 (Problem 5.13, Continued). Suppose pi , ... , µ.. are probability measures on W(R), and (5.43)

I

\

define µ = µ1 x . . . x µ". If f, 9 : R" -» R are non-decreasing in each coordinate and measurable, then prove that (5.42) holds provided that all three integrals converge absolutely.

5. Product Spaces

64

5.15. Prove that: (i) fo (I sinxl/x) dx = oo; and (ii) for all a> 0, (5.44)

fe

sin x x

dx--it2 <-2a

Conclude that fo (sin(x)/x) dx = a/2. (HINT: x-1 = fo a-z' du.) 5.16 (Problem 5.15, Continued). Prove that for all a > 0,

(1- cos xl

T\

(5.45)

x2

If°

/

rr

dx - 2

3

a

and

r° sin X)2 /

/

s

n

dx - 2

3

2a

Thus, f0°O(1 - coax)/x2 dx = fo (sin (x)/x)2 dx = a/2.

5.17. Use Fubini-Tonelli to compute (a' /n) and for n > 1. (HINT: an/n = fo xn-1 dx.) Hn :_

n° IanHn for all a E (0, 1), where

5.18. Use the Fubini-Tonelli theorem to compute

f 11-e as Ie-'3Tdx X

(5.46)

Jo

\

da,/3>0.

5.19 (Hard). Consider a set-valued function X on some given probability space (fl, 9, P). Specifically, X : fl -. 9(Rd), where 9(Rd) denotes the power set of Rd. We say that X is a random set

if (w,x) -. lx(,) (z) is product-measurable on the measure space (fl x Rd,.. x 57(R' )). Prove: (1) If A E .9(Rd), then A and X n A are both random sets. (2) If X,XI,X2,... are random sets, then so are X`, n,° IXn, and Un IXn. (3) If A E .W(Rd) satisfies A(A) < oo where A is a a-finite measure on (Rd,_4(Rd)), then A(X n A) is a finite random variable. (4) For all A E 9(Rd) such that A(A) < oo, and for all integers k > 1,

Ila(XnA)Ilk=fA ... f AP{x1EX,.,xkEX}A(dx,)...\(dxk). (5) P{x E X} = 0 for A-almost every x E Rd if and only if A(X) = 0, P-a.s. (6) There is a non-empty random set X such that P{x E X} = 0 for all x E Rd.

Notes (1) Example 5.11 is due to Sierpifiski (1920).

Mattner (1999) has constructed a Borel set A C R and two a-finite measures µl and N2 on 9(R) such that if we ignored measurability issues, then we would have the

'

following:

fx

\

(

\I-. 1A(x+Y)l+I(dx)/ 1 112(dv) 1 f

/ f IA(x+v)P2(dv)\1 --(d.).

1

o° \ °o Mattner's construction is interesting for at least two reasons: (i) It does not rely on the axiom of choice nor on the continuum hypothesis; and (ii) it shows that the "convolution" y .-. f f (x - y) pi (dx) need not be measurable with respect to the oo

smallest a-algebra with respect to which all functions { f (.--y); y E R} are measurable. (2) Problem 5.9 is motivated by two papers of Mukherjea (1972, 1974). (3) The FKG inequality (Problem 5.13) is due to Fortuin, Kasteleyn, and Ginibre (1971).

Chapter 6

Independence

Nothing is too wonderful to be true.

Attributed to Michael Faraday

Our review/development of measure theory is finally complete, and we begin studying probability theory in earnest. In this chapter we introduce the all-important notion of independence, and use it to prove a precise formulation of the so-called law of large numbers. In rough terms, the latter states that the sample average of a large random sample is close to the population average. Throughout, (Sl , Jr, P) is a probability space.

1. Random Variables and Distributions For every random variable X : SZ

R we can define a set function P o X-1

on (R,.V(R)) as follows: (6.1)

(P o X-1) (E) = P{X E E}

dE E R(R).

This notation is motivated by the fact that {X E E} is another way to write

X-1 (E), so that (P o X-1)(E) = P(X-1(E)). Lemma 6.1. P o X-1 is a probability measure on (R, -4(R)). Definition 6.2. The measure P o X

is called the distribution of the ran-

dom variable X.

Proof of Lemma 6.1. The proof is straightforward: (P o X-1)(0) = 0, and P o X-1 is countably additive on (R, -4(R)), since P is countably ad(] ditive on (Sl, .F) and X is a function. 65

6. Independence

66

Lemma 6.1 tells us that to each random variable we can associate a real probability space (R,.V(R), P o X-1). In a sense, the converse is also true: For every probability measure p on (R,,9(R)), we can define X(w) = w (w E R) to deduce that there exists a random variable X whose distribution is µ.

Definition 6.3. The (cumulative) distribution function F of a probability

measure µ on (R,R(R)) is defined by F(x) = p((-oo,x]) for all x E R. The distribution function F of a random variable X is the distribution function of P o X-1. In other words, F(x) := P{X < x} for all x E R. Note that: (i) F is non-decreasing, right-continuous, and has left limits;

F(x) = 0; and (iii) F(oo) := limx, F(x) = 1.

(ii) F(-oo) :=

These properties characterize F; i.e.,

Theorem 6.4. A function F : R -i [0, 1] is the distribution function of a probability measure p if and only if: (i) F is non-decreasing and right-

continuous; and (ii) F(-oo) = 0 and F(oo) = 1. In addition, F and p define one another uniquely.

Proof. (Sketch) The necessity of (i) and (ii) has already been established. Conversely, suppose F : R -- [0, 11 satisfies (i) and (ii). We can then define µ((a, b]) = F(b) - F(a) for all real numbers a < b. Extend the definition of p to finite disjoint unions of intervals of type (ai , bi] by setting (6.2)

µ

(C(ai , b1] I = E[F(bi) - F(ai)] i-1

/f

i=1

It is not difficult to check that: (a) This is a well-defined extension of p; and (b) p is countably additive on the algebra of all disjoint finite unions of intervals of the type (a, b]. These assertions are proved by adapting the proof of Lemma 3.15 on page 26 to the present case. Now we apply Caratheodory's

theorem (page 27), and extend p uniquely to a measure on all of R(R). It remains to check that this extended µ is a probability measure, but this µ((-oo, n]) = F(oo) = 1, thanks to the inner follows from µ(R) = continuity of measures.

Definition 6.5. If p > 0, then the pth moment of a random variable X is defined as E[XP] provided that X > 0 a.s., or X E L"(P). Lemma 6.6. If X > 0 a.s., then E[XP] = fn XP dP = f ° xP p(dx), where p denotes the distribution of X. More generally still, if h : R --+ R is Borel measurable, then (6.3)

Eh(X) =

Jn

h(X) dP =

provided that the integrals exist.

J

h(x) p(dx), 00

2. Independent Random Variables

67

Proof. The assertion Eh(X) = fn h(X) dP is a tautology. Next consider a simple function h, i.e., one of the form n

h(x) = E ailAi (x),

(6.4)

i=1

where A1,. .. , An E 9 are disjoint and al, ... , an E R. It follows that h(X) is a discrete random variable, and (6.5)

aip(Ai) = f h dµ.

aiP{X E Ai} _

Eh(X)

For a more general non-negative function h, we can choose simple functions hn T h and appeal to the monotone convergence theorem (p. 46). The rest follows from linearity. 0

Definition 6.7. The variance and the standard deviation of a random vari-

able X E L2(P) are respectively defined as VarX := E[(X - EX)2] and SD(X) := V-arX. If X,Y E L2(P), then the covariance and correlation between X and Y are respectively defined as (6.6)

Cov(X, Y) := E[(X - EX)(Y - EY)],

and

Cov(X, Y)

(6.7)

SD(X) SD(Y)'

where 0/0 := 1. Two random variables X and Y are said to be uncorrelated if p(X, Y) = 0.

Lemma 6.8. If X,Y E L2(P), then VarX = IIX - EX II2 = E[X2] - IEXIZ and Cov(X, Y) = E[XY] - EX EY. If X > 0 a.s. then

E[XP] = p r AP-1P{X > A} d)

dp>0.

00

CO

E P{X > n} < EX < E P{X > n}. n=1

n=0

2. Independent Random Variables Now we generalize the notion of independence that was touched on first in Chapter 1.

6. Independence

68

Definition 6.9. Events {Ei}T 1 are independent if for all distinct indices

i(1),...,i(l) in {1,...,n}, (6.10)

P (Ei(1) n ... n Eia)) = 11 P (Ei(j))

j-1

A collection {E,}QED of events is independent if EQII), ... , EQ(n) are indepen-

dent f o r all a(1), ... , a(n) E I. For an arbitrary index set I, the a-algebras {90aE1 are called independent if any finite number of events A"(;) E gQi;) (i = 1, . . . , n) are independent. Definition 6.10. The random variables XI, ... , Xn : S2 - Rd are independent if the events {X, 1(Ai)} 1 are independent for all A1, ... , An E .4(Rd). An arbitrary collection {XQ}QEJ is independent if XQii), ... , XQ(n) are inde-

pendent for all c (1), ... , o(n) E I. If {XQ}QE, are independent and identically distributed random variables, then we say that the XQ's are i. i. d. Equivalently, {Xi}

1 are independent if for all measurable {Ai} n

(6.11)

PIXIE A1,...,XnEAn}=flP{XjEAj}. j=1

Remark 6.11. One can construct random variables X1, X2, X3 such that Xi and Xj are independent whenever i 34 j, but X1, X2, X3 are not independent.

Lemma 6.12. Random variables {Xi} 1 are independent if for all measurable functions 01, ... , On : Rd -, R+, n

(6.12)

E

[f(x)] = fJE[4>j(Xj)]. j=1

j=1

Consequently, {Xi} 1 are independent if {hi(Xi)}= 1 are independent for

all Borel-measurable functions hl, ... , hn : R - R. Proof. The second assertion is a ready consequence of (6.12). Therefore, we derive only the latter equation. When the 4j's are elementary functions, (6.12) is the definition of independence. By linearity (in each of the O3's), (6.12) continues to hold when the 4j's are simple functions. Take limits to obtain the full result. O Soon we will see that assuming independence places severe restrictions on the random variables in question. But first, we need a definition or two.

Definition 6.13. The a-algebra generated by

is the smallest a-

algebra with respect to which all of the Xi's are measurable; it is written as u({Xi}LEA). When we say that {Xi}iEj is independent of a a-algebra ', we mean that a({Xi}iEI) is independent of 9.

2. Independent Random Variables

69

Definition 6.14. The tail a-algebra.I of the random variables {X1}°_1 is the a-algebra J = fln°_1a({Xi}°_°n). The following tells us that our definitions of independence are compatible. Moreover, the last portion implies that in order to prove that two

real-valued random variables X and Y are independent, it is necessary as well as sufficient to prove that (6.13)

P{X < x , Y < y} = P{X < x}P{Y < y}

''x, y E R.

Lemma 6.15. Let A and B denote two topological spaces. (i) For all random variables X : 52 -+ A,

a(X) = {X-1(A) : A E .2(A)}

.

(ii) If {Xj} 1 are random variables all taking values in A, then a Bvalued random variable Y is independent of {XX}jG if and only if Y is independent of {X}1 for every n > 1. (iii) Let 2 and sA be two subalgebras that respectively generate 2(B) and .Q(AO°). If Y-1(F) is independent of (XI, X2 .... )-'(E) for 1

all E E d and all F E 2, then Y and (Xj}j' 1 are independent. The proof of this is relegated to the exercises. Instead, we turn to the following consequence of independence. It confirms the assertion-made

earlier-that independence is a severe restriction.

Kolmogorov's Zero-One Law. If {Xi}°_1 are independent random variables, then their tail a-algebra 7 is trivial in the sense that for all E E .Y. P(E) = 0 or 1. Consequently, any .°T-measurable random variable is a constant almost surely.

Proof. Our strategy is to prove that every E E 9' is independent of itself,

so that P(E) = P(E fl E) = P(E)P(E). Since E E J, it follows that E is independent of a({Xi}n 1). Because this is true for each n, Lemma 6.15 (iii) ensures that E is independent of V'n'o a({Xi}i=11), which is defined to be the smallest a-algebra that contains U°° 1a({Xi}i1). In other words, E is independent of all the Xi's, and hence of itself. To conclude this proof suppose Y is Y-measurable. We intend to prove

that there exists a constant c such that P{Y = c} = 1. For any x E R, the event {Y < x} has probability zero or one. Therefore, the distribution function F of Y is necessarily of the form F(y) = where c denotes

the smallest p such that P{Y < p} = 1. This implies that Y = c almost surely.

0

6. Independence

70

Example 6.16. Suppose {Xi}°_1 are independent and define An:=X1+...+Xn.

(6.14)

n

Then, lira supn_,,. An and lim infn,. An are almost surely constants. Furthermore, the probability that limn-,, An exists in [-oo, oo] is zero or one. If this probability is 1, then limn-(,o An is a constant almost surely. Next we prove that independent random variables exist.

Theorem 6.17. If {µi}i=1 are probability measures on (Rd,.V(Rd)), then there exist independent random variables {Xi}°0=1, all on a suitable probability space, such that the distribution of X; is pi for each i = 1, 2,... .

Proof. For the sake of notational convenience we will assume that d = 1. x An for every Let fln := Rn, gn := .4(Rn), and µn := Al x n > 1. Clearly, {/Ln}°O_1 is a consistent family of probability measures. By the Kolmogorov extension theorem (p. 60; see also Remark 5.17 therein) there exists a probability measure P on (R°°,R(R°O)) that extends {µn}°n°__1.

Define X1(w) = wi for all w E R°° and i > 1. Because Xt 1(Ei) = {w E RO° : wi E Ei}, it follows that P{Xi E Ei} = µi(EE) and

P{Xl E El,..., Xn E En} = P(Xi 1(El) n ... n Xn 1(En)) (6.15)

n

_ flui(Ei). i=1

Therefore, the Xi's are independent and have the asserted distributions. 0 Let us conclude this section with two results of computational utility. The proofs are left to the reader.

Corollary 6.18. If X, Y E L2(P) are independent, then they are uncorrelated; i.e., Cov(X,Y) = 0. The converse is false in general; see Problem 6.7.

Corollary 6.19. If {Xi},n=1 are uncorrelated and in L2(P), then n

(6.16)

Var (Xi + ... + Xn) = E VarXj. j=1

In particular, this identity is valid if the Xi's are independent.

4. The Weak Law

71

3. An Instructive Example We now describe a class of distributions that do not fit into the classical probability models of Chapter 1. Let {Xi}°_1 denote i.i.d. random variables, all taking the values 0 and 1 with probability 2 each, and define Y :_ E°__14-'Xi. If µ denotes the distribution of Y, then p is a probability measure on .V([0,11) that is neither discrete nor absolutely continuous. It is also not a simple combination of the latter two types of distributions. The following makes these remarks more precise.

Theorem 6.20. The distribution µ of Y satisfies: (i) M({x}) = 0 for all x E [0, 1]; and (ii) there exists a measurable set A C R+ that has zero Lebesgue measure and yet µ(A) = 1.

Proof. For all n > 1 let Yn := F 14-'Xi and Yn the set of possible values of Yn. The cardinality of 9n is 2n because its elements are of the form En where yi = 0 or 1. Also, because IY-Ynl < Ei'n+14-` = 4-n/3, i=1 Y is a.s. within 4-n/3 of some y E Yn. Therefore, µ(,,) = P{Y E Vn} = 1 for all n > 1, where (6.17)

Vn:= U U

k=nyE9'

[c,+c].

Vn+1, A := nl,Vn has IL-measure one. On the other hand, if m denotes Lebesgue measure, then for all n > 1, Because Vn

00

(6.18)

m(A) < m(Vn) < 2 k=nyEYA,

4_k

2 00

3=3

2- k

k=n

=

22_n 3

This proves that m(A) = 0 although µ(A) = 1. It remains to prove that ;=14-'xi where ,u({x}) = 0 for all x E [0, 1]. To this end, write x = xi E {0,1, 2, 3}, and note that (6.19)

µ({x}) < P{X1 = x1i ... , Xn = x, } < 2-n.

Let n - oo to complete the proof.

D

4. Khintchine's Weak Law of Large Numbers The weak law of Khintchine (1929) states that, with high probability, sample

averages are close to population averages. We will soon see that this is a considerable improvement of the law of large numbers of Bernoulli (1713) on page 18. Although the weak law is subsumed by the forthcoming strong law of large numbers, we state and prove it first because it provides us

6. Independence

72

with a good opportunity to learn more about the Markov and Chebyshev inequalities, as well as Markov's "truncation method." Throughout this section {Xi}°_1 are i.i.d. (see Definition 6.10), realvalued, and

S. := X1 + ... + X.

(6.20)

do > 1.

The Weak Law of Large Numbers. If {Xi}°_1 are in L'(P) then, as

n - oo, Sn

- EX1 in LI(P), and hence in probability. n Example 6.21. Imagine n independent Bernoulli trials where the probability of success per trial is p E (0, 1). If E3 denotes the event that the jth trial is a success, then Sn := E i 1Ej is the total number of successes, and Sn = Bin(n,p). Because Sn is a sum of n i.i.d. mean-p random variables, Khintchine's weak law of large numbers includes Bernoulli's law of large numbers (p. 18) as a special case. (6.21)

Proof of the Weak Law. Thanks to Theorem 4.20 on page 43, it suffices to prove that Sn/n --* EX1 in LI (P). We do this in two steps. Step 1. The L2-Case. If {X)911 are in L2(P), then Corollary 6.19 tells us that (6.22)

II

nn

- EX1II2

SD(Sn/n) =

SD( 1)

0.

That is, Sn/n -+ EX1 in L2(P), and hence in L'(P) (Proposition 4.16, p. 42).

Step 2. The General Case. When the Xi's are assumed only to be in

L'(P) we use a truncation argument. Choose and fix a large a > 0, and define X° := Xi1{lX;1 1, n

Sc,111:5

(6.23)

+ Xn. By the triangle

IIsn -

E (IXiI; IXiI > a) = nE (IX1I; IX1I > a). i=1

Also, IEX1 - E[X1 ]I < E{IX1I; IXiI > a}. Therefore, n

Sn (6.24)

n - EX1

< 2E (IX1I; IXi i > a) + nn - E[Xfl 11

i

Because the X°'s are bounded and i.i.d., Step 1 insures that the last LInorm converges to zero as n -+ oo. Therefore, (6.25)

lim sup II Sn - EX1 ill :5 2E (IX, I; IX1I >- a),

n-oo

n

5. The Strong Law

73

for all truncation levels a > 0. Let a -* oo and appeal to the dominated convergence theorem to finish.

5. Kolmogorov's Strong Law of Large Numbers We are ready to state and prove the law of large numbers of Kolmogorov (1933). Throughout this section {Xi}°__1 are i.i.d. random variables in R. We will write Sn := Xl + + Xn as before.

The Strong Law of Large Numbers. If X1 E L'(P), then lim

(6.26)

Sn

n-»oo n

= EX1

a.s.

Conversely, if limsupn.,,. ISn/nI < oo with positive probability, then the XD's are in L1(P) and (6.26) holds. Our proof hinges on two key technical results.

The Borel-Cantelli Lemma. Let {Ai}i=1 be a collection of events. If E 1 P(An) < oo then Eoa--1 lA < oo a.s. Conversely, suppose that the Ai's are pairwise independent; i.e., P(Ai fl A3) = P(Ai)P(Aj) whenever i # j. Then, E°=1 P(An) = oo implies that F,n=1 lA = oo a.s.

Proof. Let pn := P(An) for all n > 1. By the monotone convergence theorem, E°O_1 pn = E F,O°_11An. Any non-negative [0, oo1-valued random

variable that is in L' (P) is a.s.-finite. Therefore, if E°_1 pn < oo then En '=l lA < 00 almost surely. The converse is more interesting: Suppose that En'=1 pn = oo and the Ad's are pairwise independent. Let Zk := En 1A and note that -1

k

(6.27)

k

VarZk=Epn(1-pn) <>pn=EZk n=1

Vk> 1.

n=1

See Corollaries 6.18 and 6.19. Chebyshev's inequality (p. 43) then yields (6.28)

P{jZk-EZkI>EEZk}<

1

E2EZk

.

Therefore, Zk/EZk -+ 1. It follows that Zk -+ oo in probability; i.e., limk.,,. P{Zk > A} = 1 for all A > 0. Because F,n° 1A > Zk for all k, this proves that En=11A = oo almost surely. 1

Before we proceed with our second technical result, let us prove the second half of the strong law.

6. Independence

74

Proof of the Strong Law (Necessity). Suppose that EIX1I = oo; we plan to prove that limsupn, ISn/nI = 00 a.s. Because IXnI lim sup IXnI < 2limsup

(6.29)

ISnI+ISn-1I,

ISnI

n-.oo n n-.oo n Therefore, it suffices to prove that limsupn_0 IXnI/n = oc a.s. Choose and fix an arbitrary k > 0, and observe that

(6.30)

ElX1I < E P{IX1I > kn} = > P{IXnI > kn}. n=0

n=0

See Lemma 6.8. Because the left-hand side is infinite, the second half of the

Borel-Cantelli lemma implies that limsupn_00 IXnI/n > k a.s. Let k - 00

0

to finish.

The following maximal L2-inequality of Kolmogorov (1933, 1950) is the second technical result that was promised earlier.

Kolmogorov's Maximal Inequality. Suppose Sn = X1 + + Xn, where the XD's are independent and in L2(P). Then for all A > 0 and n > 1, (6.31)

ISk-ESkI>AVarSn

P{ max

\2

1
Remark 6.22. Chebyshev's inequality (p. 43) implies the following weaker bound: A2P{ISn - ESnI > A} < VarSn.

Proof. We may assume that EX, = 0; otherwise, we can consider Xi - EXi in place of Xi. Note that ESn = 0 in this case. For all k > 2, let Ak denote the event that ISkI > A but that ISdI < A for all i < k. If we think of the index k as "time," then Ak denotes the event that the random process i i- Si leaves [-A, A] for the first time at time k. Because the Ak's are disjoint, E[SS] > r_k=1 E[Sn; Ak]. Because Sn > 2(Sn - Sk)Sk + Sk, this yields n

(6.32)

n

E[SS; Ak].

E [Sk(Sn - Sk); Ak] +

E[Sn] > 2 k=1

k=1

The event Ak and the random variable Sk both depend on {Xi}k 1, whereas Sn - Sk = Xk+1 + +Xn is independent of {Xi }k 1 Consequently, Sn - Sk is independent of Sk1 A, (Lemma 6.12). It follows that E[(Sn - Sk)Sk; Ak] = E[Sn - Sk]E[Sk; Ak] = 0, since ESn = ESk = 0. Whenever w E Ak we have SS(w) > A2. Therefore, n

(6.33)

n

n

E[SS] > E E[SS; Ak] > A2 E P(Ak) = J12P U Ak k=1

k=1

k=1

.

5. The Strong Law

75

This proves the result.

0

Proof of the Strong Law (Sufficiency). We suppose that X1 E L' (P), Sn/n = EX1 a.s. Throughout, we may and strive to prove that assume, without loss of generality, that EX1 = 0; otherwise, we can consider

Xi - EXi in place of Xi. The proof of the strong law simplifies considerably when we assume further that X1 E L2(P). This will be done in the first step. The second step uses a truncation argument to reduce the matter to the L2 case. Step 1. The L2 Case. If the XD's are in L2(P), then by the Kolmogorov maximal inequality for all n > 1 andle > 0, (6.34)

P c lmkax I SkI > en } 5

[2 2] =

II E2 2

because E[Sn] = VarS,, = nE[Xf] (Corollary 6.19). Replace n by 2' to deduce that En'=1 P{maxl e21} < oo. By the Borel-Cantelli lemma, with probability one, max ISki < e2"

(6.35)

1
for all but a finite number of n's. This proves the strong law of large numbers along the subsequence 2".

In order to prove the strong law along n we use a sandwich trick. Because any integer m can be sandwiched between 2n and 2"+1 for some n, it follows that ISmI <_ maxl 0 there exists a null set N(e) such that < 2e limsup ISm(w)I dw V N(e). m m-oo By countable subadditivity N := UIEQ+N(e) is a null set. Because ISm(w)I = o(m) for all w ¢ N, this proves the result in the case that X1 E L'(P). Step 2. The Ll Case. Henceforth, we assume only that X1 E L' (P). We follow the proof of the weak law of large numbers and truncate the X1's. But now the truncation "levels" are chosen with more care: Define Xi := Xi1{1X.I 1, and let S;, := X1 + + X;,. Note that (6.36)

n

(6.37)

n

E P{iX:I > i} = E P{iXil > i} i=1

IIXIIII;

i=1

see Lemma 6.8. Therefore, the Borel-Cantelli lemma implies that a.s., Xn = X;, for all but a finite number of integers n. In particular, ISn - Sn( = o(n) a.s. Thus, the theorem follows once we prove that S,, = o(n) a.s.

6. Independence

76

First, we prove the simpler assertion that ES,, = o(n). Since Xl has mean zero, ES;, = - E I E[Xl; IXI I > j]. Therefore, we apply the triangle inequality to deduce that IES;,I < >"=1 E(IXi ; IXiI > j). We exchange the order of summation and expectation to find that IES;,I < E {IXI I min (IXI I

(6.38)

,

n)).

By the dominated convergence theorem, ES;, = o(n), as asserted. Our next and final goal is to prove that IS,, - ESnI = o(n) a.s. It follows from this that IS,',I = o(n) a.s., whence the proof would follow. For all k > 1, Sk is a sum of k independent (though not i.i.d.) random variables. According to the Kolmogorov maximal inequality,

P(E(n)) <

(6.39)

where

Var2 Zn

do > 1, c > 0,

l max ISk - ESknE } . E(n) :_ ll
(6.40)

111

Furthermore, VarS;, = E, VarXj' $ rj I E[(X,' )2], and the latter sum is 1

identically equal to E I E[X;; IXII < j]. Consequently, 00

(6.41)

n=1

. 2n P(E(2n)) < E E

E[XI, IXI I

< j]

411(2

n=1j=1

z j> 1 E[X?;

IXI

4-n.

I S j] n>log2(j )

If x >_ 0, then En>x 4-n is at most En°_lxj 4-n = 41-1x1/3 < 42-x/3, where J denotes the greatest-integer function. Hence, P(E(2n)) n=1

<

16 3E

E j>1

E[X1

32

j]

(6.42) 16

= 362E

1

2

XI

2

.

jEN: j?JXIJ

It is not hard to prove that xE,>x j-2 < 2 for all x > 0.1 Set x := IXiI and plug this into (6.42) to find that 00

(6.43)

32

>P(E(2n)) <_ 2EIX1I
The previous bound, and the "sandwich trick" of Step 1, together imply that IS;, - ESnI = o(n) a.s., which verifies our goal. 'Indeed, if x > 2, then we estimate the sum with f:', u-2 du < 2/x; else, by E, j-2< 2.

6. Applications

77

6. Applications 6.1. The Weierstrass Approximation Theorem. The Weierstrass approximation theorem states that every continuous function on [0, 11 can be approximated by a polynomial, uniformly to within any given e > 0. We now follow Bernstein (1913) and derive the Weierstrass theorem from Khintchine's weak law of large numbers.

Definition 6.23. For every continuous function f : [0, 1] - R define the Bernstein polynomial Bn f by (6.44)

(n)x3(1

(Bnf)(x)

- x)nf Cam/

'x E [0,1].

Then, C3 f is a polynomial of order at most n for each n > 1.

Theorem 6.24.

!n f = f, uniformly on [0, 1).

Proof. Choose and fix some p E (0, 11, and define {Xt }°_1 to be independent

random variables that take the values 1 and 0 with respective probabilities

p and 1 - p. Recall that S := X1 +

+ X, = Bin(n, p), and note that

(Bn f)(p) = E f (Sn/n). Consider the "modulus of continuity of f": (6.45)

m f(6) = sup I f (r) - f (s)I

d6 > 0.

O
jr-sI
Consider the event A:= {I(Sn/n) - pI < b}, and write (6.46)

I (Bnf)(p) - .f (p) I <_ T1 + T2,

where

T1 := E[If(Sn/n) - f(Y)I; A],

(6.47)

7'2

E[If(Sn/n) - f(p)I; Ac].

We bound T2 first: If K := 2supy If(x)I, then by the Chebyshev inequality,

IT2I < KP(A`) < KS-2Var(Sn/n).

(6.48)

By Corollary 6.19, Var(Sn/n) = n-1p(1-p) < (4n)-1. Consequently, IT2I < K/(462n) uniformly in p E [0, 1]. Since IT1I < m f(6), it follows that for any

6>0, (6.49)

Let n

UP I (Bnf)(P) - .f (A 1

o

m!(6) + 462n

oo and then 6 - 0, in this order, to finish the proof.

6. Independence

78

Remark 6.25 (Kac's Refinement). Kac (1937) has observed that the preceding proof can yield quantitative estimates. Indeed, suppose that f is Holder continuous with index a > 0; i.e., there exists L such that m f(6) < Lb°. It follows that (6.50) o

sup 1 1(13.f )(p) - f(p)I

nn/2'

for a constant A that depends only on a and L. We derive (6.50) next. The only case of interest is a E (0, 11. For if a > 1 then f = 0, in which case f is a constant and there is nothing to prove. Henceforth, choose and fix a E (0, 11, and define D := I (Bnf)(p) - f (p)I for brevity. Because Dn = I E{ f (Sn/n) - f (p)} I, Holder continuity of f implies that Dn < LE{I(Sn/n) - p1°}. Since a E (0, 1], Holder's inequality asserts that E(IZ1°) < (E{Z2})Q/2 for all random variables Z. Therefore, [(1_n P) ]./2 (6.51) I(8nf)(p) -f(p)I -.5 L [var L

(S)]/2

The elementary inequality p(1 - p) < 1/4 yields (6.50) with A := L2-°.

6.2. The Asymptotic Equipartition Property. Our next application of independence is one of the starting points of the work of Shannon (1948; 1949), who discovered various startling connections between the thermodynamical notion of entropy and the mathematical theory of communication. We explore one of these connections here. First, we need some jargon from communication theory: Consider a fixed finite set A := {a1 i ... , a,n} with m distinct elements. Any element o i of

A is a letter (or symbol), and A itself is the alphabet. A word (or code) w := (wl, ... , w,) of length n is a vector of n letters in A. The relative frequency of the letter ak in the word w is then (6.52)

fn(ak,w) = n

1{ok}('wj) j=1

There are a total of ml words of length n. If we were to select one at random (all n-letter words being equally likely) and write the resulting random word as W := (W1,... , Wn), then W1,. .. , Wn are i.i.d., and P{ W 1 = ai} = 1/m f o r all i = 1, ... , m (check!). Therefore, by the weak law of large numbers (p. 72), f o r any k = 1, ... , m, fn(ak, Wn) - 1/m in probability. That is, for a "very typical word of indefinite length," the asymptotic relative frequency of any letter is 1/m. What about long words with (possibly) other asymptotic frequencies? To answer this, we choose and fix a probability vector (p1 i ... , pm) through-

out;i.e.,pj>Oforj=1,...,m,andpi++pn=1.

6. Applications

79

Definition 6.26. If e > 0, then an n-letter word w is said to be e-typical if (6.53)

Ifn(crk,w) - pkl < e

vk = 1,..., m.

Otherwise, w is said to be e-atypical.

The following is, in essence, Shannon's fundamental theorem of data compression (Shannon, 1948). For an improvement see Problem 6.33 below.

Theorem 6.27. For every n _> 1 and e > 0 define Tn(e) to be the number of c-typical words of length n. Then,

1- 1)2 n(H(p)-cf) < Tn(e) < 2n(H(P)+cc)

(6.54)

where loge denotes the base-2 logarithm, c

-Ek1 log2pk > 0, and

H(p) :_ - > i i t p; 1092 A is the entropy of the vector p = (p',... ,pm). Suppose n tends to infinity and e = en goes to zero so that c2n oo. Then the preceding says that of the mn words of length n, 2n(H(P)+o(1)) have more or less the property that the letters al, ... , am have asymptotic relative frequencies pl,... , pm, respectively. The point is that unless pi = = pm = 1/m,, 2n(H(P)+o(1)) is asymptotically far smaller than the total number of n-letter words mn; compare with Problem 6.25. This observation is the starting point of data compression.

Proof of Theorem 6.27. Let {X;}°_1 denote i.i.d. random variables, all taking the values a1, ... , am with probabilities P1,... , pm. Write Wn

(X1,...,Xn). For any fixed n-letter word w, P{Wn = w} = lik

t pkIf w = 2_n(H(p)+«);

is

IIk1 pk(Pk+F)

e-typical, then the latter probability is at least also it is at most 2-n(H(P)-Of). Rearrange and sum over c-typical w's to find

that (6.55)

2n(H(P)-cf)In(')

<_ Tn(e) <

2n(H(P)+cf)In('),

where 7rn(e) := P{Wn is c-typical}. Because irn(e) < 1, we obtain the asserted upper bound for Tn(e). For the lower bound we note that

1-7rn(e)=P {3k=1,...,m: lfn(ak,Wn)-Pk1>e} l (6.56)

m

>P{Ifn(ak,Wn)-Pkl>e}. k=1

6. Independence

80

Clearly, nfn(ak,W.) = Ej 11{xj=ok} = Bin(n,pk). In particular, it has mean npk and variance npk(1-Pk) < npk. Thus, by the Chebyshev inequality,

1 - 7rn (e) < V' npk =

(6.57)

1

k=1

The theorem follows.

6.3. The Glivenko-Cantelli Theorem. A histogram is a random discrete probability distribution; it depends on {X;} 1, and assigns probability pn(x) to any point x E R, where pn(x) is the fraction of the data that is equal to x. Its cumulative distribution function is called the empirical distribution function, and defined more formally as follows. Definition 6.28. If {Xj}i_1 are i.i.d. random variables all with distribution function F, then their empirical distribution function Fn is (6.58)

1

Fn(x,w) = E 1{xk<x}(w)

dx E R,w E Q.

k=1

Thus, we can view x ' Fn(x, ) as a random (cumulative) distribution function. As we did with other sorts of random variables, we suppress the dependence on the w variable, and write Fn(x) in place of Fn(x , w). The following is due to Glivenko (1933) and Cantelli (1933b). In dataanalytic terms, this theorem presents a uniform approximation to an unknown distribution function F based on a random sample from this distribution. Theorem 6.29. limn-. SupXER I Fn(x) - F(x)I = 0 a.s. Proof. Since Fn and F are right-continuous, (6.59)

sup IFn(x) - F(x)I = sup IFn(x) - F(x)I.

xER

xEQ

Thus, supXER I Fn(x) - F(x)I is a random variable.

Fix x E R and note that nFn(x) = Bin(n, F(x)). By the strong law of large numbers (p. 73), for any e > 0, n > 1, and x E R, (6.60)

IFn(x) - F(x)I < e for all but finitely many n's, a.s.

Recall that: (i) F is non-decreasing; (ii) F is right-continuous; (iii) F(oo) = 1; and (iv) F(-oo) = 0. Therefore, we can find -oo < xo < . . . < x,n < oo such that: F(xo) < e; F(x,n) > 1 - e; and (6.61)

sup

X,_1<x<xj

IF(x) - F(xi-1)I < e

dj = 1,...,m.

6. Applications

81

According to (6.60), (6.62)

max IFF(xj) - F(xj)I

O<j<m

-e

for all but finitely many n's, a.s.

Hence follows that if x E [xj_1,xj) for some 1 <_ j < m, then with probability one the following holds for all sufficiently large integers n:

Fn(x) < FF(xi) < F(xj) + e; and F(xj-1) C F,,(xj-1) + E < F,,(x) + e. By (6.61), F(xj) < F(xj_1) + E. Since F is non-decreasing it follows that (663)

with probability one, sup

(6.64)

IF. (x) - F(x) I < 2e for all but finitely many n's.

Sp <X <Xm

Choose and fix n large enough that the previous inequality is satisfied. If

x > x,,,, then F(x) > F(x,,,) > 1 - e and F,,(x) >

e > 1 - 2e.

Therefore, (6.65)

I Fn(x) - F(x)I < 11 - F(x)I + 11 -

3e

'

> x,n.

Similarly, if x < xo, then IF(x) - Fn(x)I < F(xo) + Fn(xo) < 3e. Consequently, with probability one, (6.66)

sup I FF(x) - F(x)I < 3E for all but finitely many n's.

rER

This proves the theorem.

6.4. The Erdos Bound on Ramsey Numbers. Let us begin with a definition or two from graph theory.

Definition 6.30. The complete graph Km on m vertices is a collection of m distinct vertices any two of which are connected by a unique edge. The nth (diagonal) Ramsey number Rn is the smallest integer N such that any bi-chromatic coloring of the edges of KN yields a Kn C KN whose edges are all of the same color.

To understand this definition suppose Rn = N. Then, no matter how we color the edges of KN using only the colors red and blue, somewhere inside KN there exists a K whose edges are either all blue or all red, and N is the smallest such value. It is possible to check that R2 = 3 and R3 = 6, for example.

Ramsey (1930) introduced these and other Ramsey numbers to discuss ways of checking the consistency of a logical formula. See also Skolem (1933) and Erdos and Szekeres (1935).

As a key step in his proofs Ramsey proved that R,, < oo for all n > 1.

Evidently, Rn - oo as n - oo; in fact, it is obvious that Rn > n. The following theorem of Erdos (1948) presents a much better lower bound.

6. Independence

82

Theorem 6.31. As n - oo, Rn > (c+o(1))n2n/2, where 1/c := e'. Proof. We aim to prove that given any two integers N > n, (1V)21_() < 1 implies R> N. (6.67) n Let us assume that this is the case, for the time being, and apply it with N := Lcn2n/2J, where L J denote the greatest-integer function. Because N!/(N - n)! < Nn, our particular choice of N yields (6.68)

(N) `n JJ

< (cn2n/2)n n!

Consequently, by Stirling's formula (p. 21), (6.69)

(')2'-;< 2c Onn2n/2

n! ten' which is strictly less than 1 for all large n. It remains to verify (6.67). Consider a random coloring of the edges of KN; i.e., if EN denotes the set of all edges of KN, then consider an i.i.d. collection of random variables

{Xe}eEEN where P{Xe = 1} = P{Xe = -1} = 1/2. We color e red if

Xe=1,andblue ifXe=-1. The probability that any n given vertices form a monochromatic Kn is 21-(2). Since there are (n) choices of these n vertices, the probability that there exist n vertices that form a monochromatic Kn is less than or equal to (n)21-(2). If this is strictly less than one, then there must exist bichromatic colorings of KN that yield no monochromatic Kn, and hence (6.67) follows.

6.5. Percolation. Consider the d-dimensional integer lattice Zd as a graph; i.e., two points x, y E Zd are connected by an edge if and only if the Euclidean

distance between x and y is one. Let Ed denote the resulting collection of edges; Zd denotes both the vertices and the graph. Fix a number p E [0, 1], and consider the resulting random graph of Zd that is obtained by deleting any edge with probability 1 - p; else, the edge is kept. The decisions from edge to edge are made independently. More precisely, let {Xe(p)}eEEd be i.i.d. with P{Xe(p) = 1} = 1 - P{Xe(p) = 0} = p. As always, XX(p) = Xe(p,w) depends on w, but we ignore the w. If Xe(p) = 0, then we think of e as a deleted edge. The resulting random subgraph can be identified with r(p) := {e E Ed : XX(p) = 1}. We say that percolation occurs if r(p) has an infinite connected subgraph. By the Kolmogorov zero-one law, the probability of this event is zero or one. The basic question is to decide when the probability of percolation is one.

6. Applications

83

This, and some generalizations, were introduced by Broadbent and Hammersley (1957). Since then, the subject has grown to be a vast area in mathematics and physics alike; see Grimmett (1999) for a lively introduction.

Next, we prove one of the basic results of percolation. Namely, that there exists a critical probability of percolation.

Proposition 6.32. For any d > 2, there exists a "critical probability" pc(Zd) E [0, 1] such that whenever p < pe(Zd) percolation does not occur. But if p > p,(Zd), then percolation occurs a.s.

Remark 6.33.

(1) The same is true for d = 1, but this is the trivial case since p,(Z) = 1. Indeed, if p < 1, then with probability one, we end up deleting infinitely many edges on both sides of the origin; see Problem 6.16. Thus, unless p = 1, there is no percolation on Z.

(2) It can be shown that if d > 2, then pe(Zd) lies (strictly) in (0, 1). In addition, pc(Z2) = the lower bound on pc(Z2) is due to Harris (1960), and the upper2;bound to Kesten (1980). To establish the weaker bound, pc(Z2) > 1, one can employ far simpler arguments. See Problem 6.28.

Proof. The trick is to appeal to a monotonicity argument: We canconstruct all of the Xe(p)'s, all on the same probability space, such that [for all w]: (6.70)

If p< r, then Xe(p) < Xe(r) for all e E Ed.

In order to do this, w e first construct, on an appropriate product space, random variables {Ue}eEEd that are i.i.d. and distributed uniformly on [0 ,1]. Then we define Xe(p) := 1{UUE(o,p1} for all p E [0, 11 and e E Ed. Evidently, {Xe(p)}eEEd are i.i.d., and P{Xe(p) = 1} = 1 - P{Xe(p) = 0} = p. More-

over, (6.70) is manifest. In particular, on this probability space, if p < r, then r(p) C r(r) [for all w]. The result follows from letting pe(Zd) denote the smallest p E [0, 1] such that r(p) contains an infinite connected subgraph with positive probability.

6.6. Monte Carlo Integration. Suppose we were asked to find, or estimate, the value of some n-dimensional integral (6.71)

I(0) :=

J(0.11""

O(x) dx.

Here, 0 : R" --+ R is a Lebesgue-integrable function that is so complicated that I(¢) is not explicitly computable. One way to proceed is to first select i.i.d. random variables X1, ... , XN uniformly at random in the n-cube [0, 11n. By Lemma 6.6, EO(X3) = I(0)

6. Independence

84

for all j = 1, ... , N. Because {O(Xi)}N 1 are i.i.d. random variables with expectation I(0), the strong law of large numbers (p. 73) implies that N

lim N-.oo N

(6.72)

EO(Xj) = 1(0)

a.s.

j=1

The preceding suggests a way of finding numerical approximations to 1(0): First simulate N independent uniform-[O, 1]" random variables {Xi}N1, and then average 0(X1), ... , O(XN ). This procedure is called Monte Carlo

integration. It was first used in the 1930s by E. Fermi in the calculation of neutron diffusion. The full power of Monte Carlo integration was discovered in the 1940s by J. von Neumann and S. Ulam. Monte Carlo integration outperforms most other numerical integration methods when N is large. We conclude this section with a brief discussion on random-number generation. For this discussion, it might help to recall that a random "variable" is in fact a function, or a procedure. Stated loosely, most random-number generators simulate the said procedure as follows: Start with an initial number wo, called a seed, and some predetermined function f : R -s R. Then, define iteratively wi+1 := f (wi). If f is sufficiently "chaotic," and if N is sufficiently large, then wN simulates a realization X(u)o) of a certain random variable X. To obtain more simulations we start over with other seeds. In order to simulate a Unif(O, 1) random variable X, f has to be chosen with care. The most common choice is to use a linear congruential generator (LCG). A linear congruential generator is described by (6.73) f (x) = (ax + b) (mod c), where a, b, and c are "well-chosen" predescribed parameters. Knuth (1981) discusses this and related methods. In particular, one finds there intuitive as well as rigorous methods for finding good choices of a, b, and c. There are also interesting non-linear examples; for a sampler see Problems 6.31 and 6.36.

Problems 6.1. Consider a continuously differentiable function f : R+ -. R+ that satisfies f (O) = 0. Suppose that either f' > 0 a.e., or f' is integrable. Then prove that for all non-negative random variables X. (6.74)

I,

Ef(X) =

0

f'(,\)P{X > A}dA= /

f'(A)P{X >

0

Deduce Lemma 6.8 from this.

6.2. Prove that two real-valued random variables X and Y. both defined on the same probability space, are independent if and only if for all x, y E R, (6.75)

P{X < x,Y < y} = P{X < x}P{Y < y}.

Problems

85

6.3. Improve Lemma 6.8 as follows: If X, Y E L2(P) then Cov(X, Y) is equal to (6.76)

1

J

(P{X > x,Y > y} - P{X > x}P{Y > y}) dxdy.

6.4. Prove that if the distribution of (X, Y) is absolutely continuous with respect to two-dimensional Lebesgue measure, then:

(1) There exists f E L1(R2) such that P{(X,Y) E A} = ff,, f(x,y)dxdy for all A E M(R2),

(2) There exist f, f} E L1(R) such that P{X E B} = f8 fx(x)dx and P{Y E B) _ fe fv (y) dy for all B E .S!(R). (3) X and Y are independent iff f (x, y) = f,, (x)f,. (y) for almost every (x, y) E R2.

6.5. Verify the claim of Remark 6.11 by constructing three random variables X1, X2, and X3, such that X, and Xj are independent whenever i j, but {X1,X2,X3} are not. 6.6. Prove Lemma 6.15.

6.7. Construct two random variables X and Y on the same probability space such that X and Y are uncorrelated but not independent. 6.8. Prove Corollaries 6.18 and 6.19.

Improve the latter by proving that given any sequence

{X,}°=1 of random variables in L2(P),

+Xn)_

VarX,+2 Cov(X, , X2). =1 IGV
Var(X1+

6.10. If {Xn}n'=1 are i.i.d., then prove that the following are equivalent for all p > 0 fixed:

(1) E{IXIIP}
6.12. The strong law is easier to prove when the summands have four finite absolute moments. Suppose {X,}°O1 are i.i.d. random variables with mean zero and variance one, and let S. = X1 + .+X,,. Prove that, if in addition IIXI114 < oo, then IIS.114 = O(n'/2) as n -. oo. Conclude the following, which is essentially due to Cantelli (1917b): If E[X4] is finite then Sn = o(n) as. You may not use the Kolmogorov maximal inequality in this exercise.

8.13. Let {X,}401 be a sequence of non-negative, i.i.d. random variables such that EX = oo. Prove the following complement to the strong law (p. 73):

-oo as. lim n-ao n 6.14. Let X1,X2,... be non-negative random variables in L2(P) that have common mean p. Prove that if Sn := X1 + . + Xn then Sn/n -. p as., provided that there exists 6 E (0, 2) such that VarS, = o(n2'6). Conclude that if {X,}'1 are non-negative, identically distributed, and uncorrelated, then Sn nEXI as. 6.15. Construct independent positive random variables {X,}°_1 such that limn-x ]j° 1 X. ex(6.78)

ists almost surely but not in L1(P).

6. Independence

86

6.16. Suppose {Xj}'?11 are i.i.d. random variables that take values in some topological space X, and P{Xt E A) > 0 for ' some A E R(X). Prove that with probability one infinitely many of the Xn's fall in A. Thus, if a monkey types forever and at random, then with probability one she eventually reproduces the entire works of Shakespeare.

6.17. First prove that EX = Er..t P{X > n} whenever X is a non-negative integer-valued random variable. Then use your computation to solve the following.

(1) A dresser has k distinct pairs of socks. We select, at random, one sock at a time until a pair has been drawn. Compute the expectation of the total number of draws needed. (2) We generate, independently, Unif(0,1) random variables X,, X2,... as long as the generated sequence is monotone. Prove that if X denotes the length of this random monotone sequence, then EX = 2e - 3 (Shultz and Leonard, 1989).

6.18. Prove that if X is independent of itself, then X is almost surely a constant. 6.19. Suppose that {X,};__, is a sequence of i.i.d. exponential random variables with parameter

A > 0. Verify that almost surely, limsup,.,(Xn/Inn) = a-t and liminfn,(lnXn/Inn) = -1. 6.20 (Paley-Zygmund Inequality). Prove that for all nonnegative Y E L2(P) and all c E (0, 1),

P{Y > cEY} > (1 - e)2II 'llz

(6.79)

IIY112

(Paley and Zygmund, 1932). Now suppose that {E,},__, are events such that E,_t P(E,) = oo, and that -y = lim inf

(6.80)

_t

Ei_,P(E,nEj)

,,=

2

- oo.

i P(Ej)

Then, show that P{E, 1E, = oo) > ry-t > 0. Verify that this improves the independence half of the Borel-Cantelli lemma (p. 73).

6.21 (Problem 6.20, Continued). Prove that if Z is a non-negative random variable in L2(P), then P(Z = 0) < VarZ/E(Z2]. 6.22 (Normal Numbers). Suppose X is uniformly distributed on [0, 1], and write its decimal expansion as X = F,-=, 10-'Xj, where Xj = 0,...,9 (with some convention for terminating expansions). Prove that {X,}.__, are i.i.d. Find their distribution. Derive the normal-number theorem of Borel (1909): "Lebesgue-almost every number w E [0, 11 satisfies limn_- Nt(i.,)/n = 0.1 fort= 0,... , 9." Here, Nn' (w) = E° I 1{x,(W)=t} denotes the number of times that the digit I appears in the first n binary digits of w. [If you do not find this surprising then you may wish to try to decide whether some given irrational such as 1/f, 7r/10, or In 2, is a normal number.] 6.23 (Problem 6.22, Continued). Choose and fix an integer b > 1. If X is distributed uniformly on (0, 11, then write its b-ary expansion X(w) = Eat b-jXj(w). In the case that there are two ways to choose the X,'s, we always opt for the finite expansion. Prove that {Xj},_, are i.i.d. and take the values 0, ... , b - 1 with equal probability. Prove: n (6.81)

P

vt nHIM

n F_ I{x;(W)=t} = 6

= 0,...,b - 1, vb > 2

= 1.

=i

6.24. Recall that if f : R R+ is measurable and integrates to one, then it is a probability f (y) dy defines a density function. Prove that if, in addition, f is continuous then F(x) distribution function with F' = f almost everywhere. 6.25. Suppose p = (pi , . Pm) is a vector of probabilities; i.e., p, E [0, 11 for all i = 1, ... , m, and the p,'s add up to one. Recall the entropy H(p); you may need to define OInO := 0 to make sense of this in general. Prove that the (discrete) uniform distribution maximizes the entropy uniquely among all probability vectors on m fixed points. Calculate this maximum entropy. This exemplifies the method of "the most probable distribution" of statistical thermodynamics (Schrodinger, 1946). (HINT: For all x > 0, -xlnx < 1 - x.)

Problems

87

6.26 (Information Inequality). First prove that

f f(x)Ing(x)dx< f f(x)Inf(x)dx, 0o co

(6.82)

for all density functions f and g on R, where 0 In 0 := 0. This is called the information inequality.

Let H(f) := - f f. f (x) In f (x) dx denote the entropy of f. Then prove that: (1) The Unif(a,b) density is of maximum entropy among all density functions that are supported on (a, b). (2) The N(µ, 02) density is of maximum entropy among all densities on R that have mean p and variance a2.

6.27. Prove that if {X,) 1 areri.i.d. and in L2(P), and if Sj := X1 +

+ X,, then

E I max IS, - ES, 1] < 2SD(Xl)

(6.83)

1<j
6.28. Let a denote a collection of edges in the graph Zd. We say that rr is a path of length n if there are n edges in rr and they connect neighboring points in Zd, starting with the origin. A path is self-avoiding if no 3 of its edges share a vertex.

(1) Prove that there are at most An,d := 2d(2d - 1)"-1 self-avoiding paths of length n. (2) In the context of percolation (§6.5) prove that the expected number of kept paths of length n is at most Use this to prove that pc(Zd) is at least (2d - 1)-1. p"an,d.

6.29. Prove that if {X,}i-1 are i.i.d. and in LI(P), then ,

(6.84)

nlimo

F -n

= EXI

as.

(HINT: Work along the subsequence 22".)

6.30. Prove that if f : [0, 11 -. R is non-decreasing, then so is Bnf, where Bnf is the Bernstein polynomial based on f. (HINT: If U1, U2.... are i.i.d. Unif(0, 1)'s, then Sn := E; 1 11u,
6.31 (Computer Problem). Newton's method is a numerical algorithm for finding the roots of a function g : R -» R. We start with a seed wo, and iteratively define wn+1 := wn - g(wn)

(6.85)

9'(-n)

vn > 0.

If g is a nice function, then Newton's method converges rapidly to a root of g. Consider g(x) " x2 + 1, which has no roots. What happens if one computes the Newton-method iterates {w,}'_ on a computer? Explore this idea for various choices of m and wo. Also consider the sequence defined by (n := 2 + 1- arctanwn. Convince yourself, via computer simulation, that if m is large then the distribution of (,n is nearly uniform.

6.32. Suppose that (Xj),_1 are i.i.d., and take the values ±1 with probability one-half each. Then prove that limn_. F X, does not exist as. 6.33 (Problem 4.30, p. 51, Continued; Hard). Let {X,}° 1 denote a collection of mean-zero, independent, as: bounded random variables such that

{c,}^ I.

I < c,} = 1 for non-random constants

,cj then:

(1) Prove that P{1S"I -> t} < 2e_,2/(2sn) for all t > 0 (Hoeffding, 1963, Theorem 1). (HINT: Problem 4.24, p. 51.)

(2) Prove that if B = Bin(n,p) then P{IB-npI > t f } < 2e_,2/2 for all t > 0 (Chernoff, 1952; Okamoto, 1958). (3) Improve (6.48) to IT21 < 2K exp(-n62 /2).

6. Independence

88

(4) Improve Theorem 6.27 by proving that T.(e) >_ (1 - 2me-ne2/2) 2n(H(p)-cc).

6.34 (The Erd6s-11knyi Law of Large Numbers; Hard). Suppose {Xi}'1 are i.i.d. with the distribution P{X1 = 11 = P{X1 = 0} = 1/2. Let Yn denote the length of the longest uninterrupted string of l's among X1, ... , X,,. Prove the Erd6s-REnyi law of large numbers (1970): Y - loge n as., where 1092 x denotes the logarithm of x in base two. 6.35 (Theorem 6.29, Continued; Hard). Improve the Glivenko-Cantelli theorem (p. 80) by proving that for all fixed p < 1/2,

sup IF (x) - F(x)I =

(6.86)

o(n-p/2)

as n -. on, a.s.

Can p = 1/2?

6.36 (Raikov's Ergodic Theorem; Hard). Throughout this exercise we extend the domain of any function G : (0, 1) -. R periodically to all of R+ by defining G(t) = C(t mod 1) whenever

tV(0,1). (1) Suppose k is a fixed positive integer and f : [0, 1) -. R is constant on [j2-k, (j+ 1)2-k) for all j = 0, ... , 2' -1. Define X, (w) = f (23w) for all w E [0, 1) and j > 1. Prove that {Xk+.n}m=o are i.i.d. What is their common distribution? (HINT: Problem 6.23.) (2) Prove that given any continuous function ,y : [0, 1) - R,

tim n1 t P(2'w) = f

n

=o

ds,

a

for almost all w E [0, 1) (Raikov, 1936).

6.37 (The Cantor-Lebesgue Function; Hard). Inspite of Problem 6.24, follow the steps below to prove that there exists a continuous distribution function F on R such that F' = 0 almost everywhere. (Why is this surprising?)

(1) Define C to be the collection of all numbers of the form j=1 3-'(,, where (, = 0 or 2 for every j. This is the middle-thirds Cantor set. Prove that C has zero Lebesgue measure yet is uncountable.

(2) Define F to be the distribution function of Ej 1 3-'X, where the X,'s are i.i.d., each taking the values 0 or 2 with probability z each. Prove that F is continuous, and F' = 0 almost everywhere. Can you see a relation to Theorem 6.20? 6.38 (Problem 6.29, Continued; Harder). Suppose {X,}°_1 are i.i.d. and symmetric; i.e., X1 has the same distribution as -X1. Define In+ x = ln(max(x, e)) for all x > 0, and prove the following: (6.87)

If E ( "I'

\ In+ IXI I)

< oo then nlim

o Inn=1`

-

=0

&s.

Prove also that limsupn_.(lnn)-I E ,(X,/i) = on as., otherwise. That is, the end result of Problem 6.29 can hold even if X1 Q LI(P)! 6.39 (Khintchine's Inequality; Harder). Let {X,},. 1 be i.i.d. random variables with P{X1 = 1) _ 1-P{XI = -1} = 1/2, {a,} 1 a sequence ofconstants, on := E 1 a?, and Sn(a) :_ E° 1 a,X,. Prove that:

(1) P([Sn(a)I > A} < 2exp(-A2/(2o,2,)) for all 1 > 0 and n > 1. (HINT: Problem 4.30, p. 51.) (2) For every p E (1 , oo) there exist finite constants Ap, Bp > 0 such that

APO, < IIS,(a)IIp < Bpon, for all n > 1, regardless of the values of al,... , an (HINT: For the lower bound, verify first that IIZII2 <- IIZIIpIIZIIa, where p-1 +9-1 = 1.)

Notes

89

6.40 (Harder). The classical central limit theorem (p. 100 below) states the following: If {X,}- 1 are i.i.d. random variables with EX1 = 0 and VarX1 = 1, and if S := X, + - - + X,,, then for any real a < b,

lim P(En) = P{a < N(0,1) < b},

(6.88)

where En := {af < Sn < bf }. Define Z.

1E, /j, and use the central limit theorem,

without proof, to derive the following:

SD(Z°) = 0. = P{a < N(0, I) < b} and lim n-oc Inn Inn Conclude that Zn/Inn P(a < N(0, 1) < b) in probability as n -. oo. (6.89)

lim

EZn

Notes (1) What we call "correlation" here is more precisely known as the "correlation coefficient." Stigler (1986, pp. 297-299) discusses the history of correlation.

(2) The Kolmogorov zero-one law for random variables that are either zero or one is much older than the general form presented here (p. 69). See Cantelli (1917a) for this and its applications to the theory of inference. (3) 'T`runcation was invented by A. A. Markov in 1907 (Markov, 1910). (4) The convergence half of the Borel-Cantelli Lemma (p. 73) was found by Bore] (1909)

in his study of normal numbers (Problem 6.22). A divergence half was devised by Cantelli (1917b) in order to prove a version of the strong law of large numbers; see Problem 6.12 for a modern version of Cantelli's strong law. Since Borel's original reasoning contained flaws [that can nowadays be corrected],

some authors refer to Cantelli's strong law of large numbers as the first strong law of its kind. More importantly, the ingenuity of Cantelli's paper is undeniable for at least two reasons: (i) The second half of the Borel-Cantelli lemma has proved to be of immense value in probability and its applications; and (ii) more significantly, Cantelli's

is one of the earliest papers that explicitly states the need for a countably additive probability theory. (5) Lindvall (1982) presents an up-to-date survey of Bernstein polynomials and their role within the theories of approximation and probability (Problem 6.30). (6) Etemadi (1981) devised an elementary and elegant proof of the strong law of large numbers. Also, as a by-product of his proof, independence can be relaxed to pairwise

independence. (This means that Xi and Xi are independent for any i # j, although {X, }' 1 need not be.] Our proof of the strong law is slightly less direct, but is useful in other settings. (7) Remark 6.25 has a deeper counterpart, due to Hoeffding (1971): If f is integrable on O(n-1/2). [0, 1] and has bounded variation, then fo J(Bnf)(x) - f(x)Idx = (8) For a pedagogic account of Shannon's theorem (Theorem 6.27) and its impact in communication theory see Cover and Thomas (1991, pp. 53-55). (9) The lower bound of Theorem 6.31 is the best known bound of its kind. The best known

upper bound is Rn = O(n-022n), where a is a positive and finite constant that does not depend on n. P. Erdds has conjectured that 1092 Rn en for some c E (0,00); see Alon and Spencer (1991, p. 241). (10) Proposition 6.32, and more, can be found in Hammersley (1963). (11) Kolmogorov's one-series theorem (Problem 6.11) is an abstraction of an earlier result of Rademacher (1922); see also Steinhaus (1930). The optimal result in this area is the three-series theorem of Khintchine and Kolmogorov (1925). (12) The Paley-Zygmund inequality (Problem 6.20) has been discovered and refined many times since. Some notable instances can be found within Chung and Erd&a (1947, 1952), Erdos and Renyi (1959), Kochen and Stone (1964), and Rcnyi (1962, p. 327).

90

6. Independence

(13) Problem 6.22 invites the reader - perhaps too casually - to think about the normality of numbers such as 1/f, a/10, and In 2. In all fairness, this should come equipped with a caveat: These are hard, old, and wide-open problems! See Wagon (1985) for a history of a110. Some non-trivial numbers are known to be normal in base 10. A rather curious example is 0.1234567891011121314 , which is obtained by enumerating the numerals (Champernowne, 1933). Another interesting example is 0.123571113.. , which is obtained by enumerating all primes; see Copeland and Erd6s (1946). These results have found applications in theoretical computer science; see Schnorr and Stimm (1972) and Bourke, Hitchcock, and Vinodchandran (2005). (14) Problem 6.31 is the only computer exercise in this text. The ultimate assertion of this problem, however, is a rigorous mathematical theorem. Its proof requires a working knowledge of both theories of Markov chains and complex dynamics. Two pedagogic sources are respectively Norris (1998) and Devaney (2003). (15) Raikov's ergodic theorem (Problem 6.36) follows from the more general "pointwise ergodic theorem" of Birkhoff (1931). Chapter 6 of Durrett (1996, pp. 341-346) contains a pedagogic account based on an elementary proof of Garsia (1965). Erdi s (1949) extended Raikov's ergodic theorem in several directions: Continuity can be reduced essentially to f E L2(dx); and J(2)w)/n can be replaced by Fn-1 J(ljw)/n where {ej}- 1 satisfies the so-called "Hadamard gap condition," infj>1 tj+1/tj > 1. The latter gap condition cannot be improved dramatically (Buczolich and Mauldin, 1999). (16) The limiting result in Problem 6.40 can be shown to hold almost surely, and the end result is called the almost-sure central limit theorem (Lacey and Philipp, 1990). For a prefatory version consult Levy (1937, p. 270). This problem and its history have been recently surveyed by Berkes (1998).

Chapter 7

The Central Limit Theorem

Experimentalists think that it is a mathematical theorem, while the mathematicians believe it to be an experimental fact. -Gabriel Lippman, in a discussion with J. H. Poincare about the CLT

Let Sn denote the total number of successes in n independent Bernoulli trials, where the probability of success per trial is some fixed number p E (0, 1). The De Moivre-Laplace central limit theorem (p. 19) asserts that for all real numbers a < b, (7.1)

-x2/2dx.

lim P a< n-oo

Sn - np < b np(1 - p)

-

f be 27r

We will soon see that (7.1) implies that the distribution of Sn is close to that of N(np, np(1 - p)); see Example 7.3 below. In this chapter we discuss the definitive formulation of this theorem. Its statement involves the notion of weak convergence which we discuss next.

1. Weak Convergence Definition 7.1. Let X denote a topological space, and suppose p, Al, µ2, ... are probability (or more generally, finite) measures on (X,R(X)). We say that An converges weakly to It, and write µ µ, if (7.2)

lim

If f dpn =

f f dµ, 91

7. The Central Limit Theorem

92

for all bounded continuous functions f : X - R. If the respective distributions of Xn and X are µn and p, and if µn . µ, then we also say that Xn converges weakly to X and write X, = X. This is equivalent to saying that lim Ef (Xn) = Ef (X), noo

(7.3)

for all bounded continuous functions f : X

R.

The following result of Levy (1937) characterizes weak convergence on R.

Theorem 7.2. Let µ,µl,µ2, ... denote probability measures on (R, .B(R)) with respective distibution functions F, F1, F2,... . Then, µn µ if and only if

lim Fn(x) = F(x),

(7.4)

n-oo for all x E R at which F is continuous.

X if and only if P{Xn <_ x}

Equivalently, Xn

P{X < x} for all x

such that P{X=x}=0. Example 7.3. Consider the De Moivre-Laplace central limit theorem, and define

Sn - np np(1 - p)

Xn :=

(7.5)

Let Fn denote the distribution function of Xn, and F the distribution function of N(0,1). Observe that: (i) F is continuous; and (ii) (7.1) asserts that limn-,,.(Fn(b) - Fn(a)) = F(b) - F(a). By the preceding theorem, (7.1) is saying that Xn N(0,1).

Theorem 7.2 cannot be improved. Indeed, it can happen that Xn = X but Fn fails to converge to F pointwise. Next is an example of this phenomenon.

Example 7.4. First let X = ±1 with probability 2 each. Then define (7.6)

X,(w) :=

1

if X(w) = -1,

1+n ifX(w)=1.

Then, limn.oo f (Xn) = f (X) for all bounded continuous functions f, whence Ef(Xn) -+ Ef(X). However, Fn(1) = P{Xn < 1} = 2 does not converge to

F(1) = P{X < 1} = 1. In order to prove Theorem 7.2 we will need the following.

Lemma 7.5. The set J:= {x E R : P{X = x} > 0} is denumerable.

1. Weak Convergence

93

Proof. Define r

Jn:={xER: P{X=x}> n

(7.7)

111

Since J = U°_1J,,, it suffices to prove that J,, is finite. Indeed, if J were infinite, then we could select a countable set Kn C J,,, and observe that

1 > E P{X = x} >

(7.8)

II L"I

xE Kn

where I infinite.

I denotes cardinality. This contradicts the assumption that K is

Proof of Theorem 7.2. Throughout, we let X denote a random variable whose distribution is pn (n = 1, 2....), and X a random variable with distribution M.

Suppose first that X = X. For all fixed x E R and e > 0, we can find a bounded continuous function f : R -+ R such that (7.9)

f (y) : 1(-oo,x) (y)

f (y - E)

dy E R.

[Try a piecewise-linear function f.] It follows that

Ef

(7.10) Let n

(7.11)

F.(x) < Ef (Xn - e).

oc to obtain

E f (X) < lim inf F (x) < lim sup Fn (x) < E f (X - E). n-roo

n-.oo

Equation (7.9) is equivalent to the following: (7.12)

1(-.,x-E)(y) < f (y) and f (y - e) < 1(_oo,x+E)(Y)

We apply this with y := X and take expectations to see that (7.13)

F(x - e) < E f (X) and E f (X - e) < F(x + e).

This and (7.11) together imply that (7.14)

F(x - c) < lim inf Fn (x) < lim sup n-.oc

F(x + e).

Let e 10 to deduce that Fn(x) F(x) whenever F is continuous at x. For the converse we suppose that F,,(x) -' F(x) for all continuity points x of F. Our goal is to prove that limn. E f (Xn) = E f (X) for all bounded continuous functions f : R R. In accord with Lemma 7.5, for any o, N > 0, we can find real numbers < x_2 < x_1 < xo < x1 < x2 < . . . (depending only on d and N) I f(y) - f (x,)1 < d; (ii) F is continuous such that: (i) maxj,j
7. The Central Limit Theorem

94

at xi for all i E Z; and (iii) F(XN+1) >- 1 - b and F(x_N) < b. Let AN

(x-N,xN+1] By (i), N

E [f (X.); X. E AN] -

E f(xj)[Fn(xj+l) - FF(xj)]

=-N N

E E If (X.) - f (xj); X. E (xj,xj+1)}

(7.15)

j=-N N

< >2 E{If(Xn) -f(xj)I; X. E (xj,xj+1]} j=-N < b.

This remains valid if we replace Xn and Fn respectively by X and F. Note that N is held fixed, and Fn converges to F at all continuity-points of F. Therefore, as n oo,

f(xj) [Fn(xj+1) - Fn(xj)] - >2 f(xj) [F(xj+1) - F(xj)]

(7.16)

lil
lil<
By the triangle inequality, lim sup IE { f(Xn); Xn E AN} - E If (X); X E AN}I < 2b. n-oo For the remainder terms, first note that (7.17)

P {Xn

(7.18)

AN} = 1 - Fn (xN+1) + Fn (xN) < 26.

Let N -* oo to find that the same quantity bounds P{X ¢ AN}. Therefore, if we let K := supVER 1f(Y) 1, then (7.19)

limsupE{If(Xn)I; X,VAN}+E{If(X)I; X 0AN} <4Kb.

In conjunction with (7.17), this proves that

limsup IEf (Xn) - Ef (X)I < 26 + 4K5. n-oo Let b tend to zero to finish. (7.20)

2. Weak Convergence and Compact-Support Functions Definition 7.6. If X is a metric space, then Cc(X) denotes the collection of all continuous functions f : X R such that f has compact support;

i.e., there exists a compact set K such that f (x) = 0 for all x V K. In addition, Cb(X) denotes the collection of all bounded continuous functions

f:X-'R.

2. Weak Convergence and Compact-Support Functions

95

Recall that in order to prove that An = µ, we need to verify that f f dµn -' f f dµ for all f E Cb(X). Since Cb(Rk) C Cb(Rc), the next result simplifies our task in the case that X = Rk.

Theorem 7.7. If µ, µl, l 2i ... are probability measures on (Rk,. (Rk)) then pn * p if and only if nim -oo J

(7.21)

fdµn=J fdµ

of EC,,(Rk).

Proof. We plan to prove that if f g dµn f g dµ for all g E C,,(Rk), then f f dun - f fdµ for all f E Cb(Rk). With this goal in mind, let us choose and fix such an f E Cb(Rk). By considering f + and f - separately, we can-and will-assume without loss of generality that f (x) > 0 for all x. Step 1. The Lower Bound. For any p > 0 choose and fix a function fp E C,,(Rk) such that:

(1) For all x E [-p,p]k, fp(x) = f(x). (2) For all x ¢ [-p - 1, p + 11k, fp(x) = 0. (3) For all x E Rk, 0 < fp(x) < f (x), and fp(x) T f (x) as p T oo. It follows that (7.22)

lim inf f f dµn > lim f fp dµn = n-oo

n-oo

J fp dµ.

Let P T oo and apply the dominated convergence theorem to deduce that (7.23)

lim inf

n-oo

ff dµn >

J

f dp.

This proves half of the theorem.

Step 2. A Variant. In this step we prove that, in (7.23), f can be replaced by the indicator function of an open k-dimensional hypercube. More precisely, given any real numbers al < bl,... , ak < bk, (7.24)

liminfµn((ai,bi) x ... x (ak,bk)) > p((ai,bi) x ... x (ak,bk))

To prove this, we first find continuous functions z/P,n I pointwise. By definition, On E C,,(Rk) for all m > 1, and (7.25)

lim f lin ((al , bl) x ... X (ak , bk)) ? n-oo

imoo

m dµn

V"m d{!.

n J Let m T 00 to deduce (7.24) from the dominated convergence theorem.

7. The Central Limit Theorem

96

Step 3. The Upper Bound. Recall fP from Step 1 and write

f f dpn = f (726)

<

f dµ+ JRk\(-P.P(

f fPdPn + sup If(z)I . [i -1,n ([-p,p]k)] zERk

Now let n - oo and appeal to (7.24) to find that (7.27)

lim up

f f dPn <_ f ffdp+ s R If(z)I' [I -p

11 ((-p,p)k)J

n-oo Let P T oo and use the monotone convergence theorem to deduce that lim sup

(7.28)

n-oo

f f dµ < f f dµ.

This finishes the proof.

3. Harmonic Analysis in Dimension One Definition 7.8. The Fourier transform of a probability measure p on R is (7.29)

µ(t) := f e'tx p(dx)

vt E R,

00

where i :_ V"--l. This definition continues to makes sense if p is a finite measure. It also makes sense if p is replaced by a Lebesgue-integrable function

f : R -+ R. In that case, we set (7.30)

f (t) := f e'xt f (x) dx

Vt E R.

[We identify the Fourier transform of the function f = (dµ/dx) with that of the measure p.] If X is a real-valued random variable whose distribution is some probability measure p, then µ is also called the characteristic function of X and/or p, and µ(t) is equal to Eexp(itX) = E cos(tX) + iE sin(tX ). Here are some of the elementary properties of characteristic functions.

Lemma 7.9. If p is a finite measure on (R,.q(R)), then µ exists, is uniformly continuous on R, and satisfies the following:

= u(0) = p(R) and µ(-t) = A(t). (2) µ is nonnegative definite. That is, F, Ek=1 2(ti - tk)zj zk > 0 for all zl, ... , zn E C and ti, ... , to E R. (1) suPtER I2(t)I

Proof. Without loss of generality, we may assume that p is a probability measure. Otherwise we can prove the theorem for the probability measure v( ) = p( . . . )/µ(R), and then multiply through by µ(R).

4. The Plancherel Theorem

97

Let X be a random variable whose distribution is p; µ(t) = Ee`tx is always defined and bounded since IOx I < 1. To prove uniform continuity, we note that for all a, b E R, (7.31)

1 e'a - e'bl

_

=1-

jab

< ja - bi.

Consequently,

e'a - e`b< ja-bIA2.

(7.32)

It follows from this that (7.33)

sup Pt) - µ(5)1 < sup E Ieitx - e'

Is-tI<6

< E (61X I A 2).

IS-t1<6

Thanks to the dominated convergence theorem, the preceding tends to 0 as 6 converges down to 0. The uniform continuity of µ follows. Part (1) is elementary. To prove (2) we first observe that (7.34)

Ee'(t,-tk)xzjzk

E2(tj - tk)zjzk = 1 < j,k
1 < j,k
This is the expectation of I F 1 e't'xzj I2, and hence is real as well as nonnegative.

Example 7.10 (§5.1, p. 11). If X = Unif(a, b) for some a < b, then Eeitx = (e'tb - e'ta)/it(b - a) for all t E R. Example 7.11 (Problem 1.11, p. 13). If X has the exponential distribution with some parameter A > 0, then Ee'tx = A/(A - it) for all t E R.

Example 7.12 (§5.2, p. 11). If X = N(p,a2) for some p E R and or > 0, then Ee'tx = exp (itp - 2t2a2) for all t E R. Example 7.13 (§4.1, p. 8). If X = Bin(n, p) for an integer n > 1 and some p E [0,1], then Ee'tx = (peit + 1 - p)n for all t E R. Example 7.14 (Problem 1.9, p. 13). If X = Poiss(A) for some A > 0, then Ee'tx = exp(-A + Ae`t) for all t E R.

4. The Plancherel Theorem In this section we state and prove a variant of a result of Plancherel (1910, 1933). Roughly speaking, Plancherel's theorem shows us how to reconstruct a distribution from its characteristic function. In order to state things more precisely we need some notation.

7. The Central Limit Theorem

98

Definition 7.15. Suppose f, g : R

R are measurable. Then, when defined, the convolution f * g is the function, (7.35)

f f (x - y)g(y) dy.

(f * g)(x)

Convolution is a symmetric operation; i.e., f *g = g* f for all measurable

f, g : R -i R. This tacitly implies that one side of the stated identity converges if and only if the other side does. Next are two less obvious properties of convolutions. Henceforth, let 0E denote the density function of N(0 , E2); i.e., 1

(7.36)

E

x2

exp -

OE(x) = 21r

dx E R.

`ZE

The first important property of convolutions is that they provide us with smooth approximations to nice functions. Fejer's Theorem. If f E CC(R), then f * 0E is infinitely differentiable for all c > 0, and the kth derivative is f * for all k > 1. Moreover, lim sup I (f * M(x) - f (X) I = 0.

(7.37)

e-.OxER

Proof. Let 0E0) := 4E. Then for all k > 0 and all c > 0 fixed, (f *.O(k)) (7.38)

(x + h) -

(f * O(k)) (x)

h (k) (X

=

f

f(y)

+ h -

-

0(k)(x

- y)

dy.

Because 0(k+1) is bounded and f has compact support, the bounded convergence theorem implies that f * (k) is differentiable, and the derivative is

(k+i) . Now we apply induction to find that the kth derivative of f * E f *Of exists and is equal to f * ¢'(k) for all k > 1. Let Z denote a standard normal random variable, and note that 0 f is the density function of EZ; thus, (f * 0E) (x) = E f (x - EZ). By the uniform continuity of f , limE..o supXER If (x - EZ) - f (x) I = 0 a.s. Because f is bounded, this and the bounded convergence theorem together imply the result. 0

The second property of convolutions, alluded to earlier, is the Plancherel theorem.

4. The Plancherel Theorem

99

Plancherel's Theorem. If µ is a finite measure on R and f : R - R is Lebesgue-integrable, then 000

(7.39)

f(f*)(x).u(dx) = 27r

f

e-f2t2/2f(t)µ(t)dt

vE

> 0.

0

Consequently, if f E CS(R) and f E L' (R), then '00

00

f f da= 27r f f(t)?(t)dt.

(7.40)

00

00

Proof. By the Fubini-Tonelli theorem, ao 1 21r

f,,,,0 e-f2t2/2f (t)

µ(t) dt YOO

00

(7.41)

=

1f

27r 1

o0

e_f2t2/2

/ oo

2a _ J-

foo

00

f (x)e`tz dx/) I (fo.0 a-ItY i(dy)dt

e-f ztz /2eit(X-y) dt µ(dy) f (x) dx.

A direct calculation reveals that (7.42)

f

00

0

e-f2t2/2eit(x-y) dt =

27r E

ex p p

(x 2e2 - y)2

= 27rO,(x - y).

See Example 7.12. Since f is integrable, all of the integrals in the right-hand side of (7.41) converge absolutely. Therefore, (7.39) follows from the FubiniTonelli theorem; (7.40) follows from (7.39) and the Fejer theorem.

The Plancherel theorem is a deep result, and has a number of profound consequences. We state two of them.

The Uniqueness Theorem. If µ and v are two finite measures on R and µ = v, then µ = v.

Proof. By the theorems of Plancherel and Fejer, f f d p = f f dv for all f E CS(R). Choose fk E CS(R) such that fk j 1[a,bJ. The monotone convergence theorem then implies that u([a,b)) = v([a,b]). Thus, µ and v agree on all finite unions of disjoint closed intervals of the form [a, b]. Because the said collection generates R(R), a = v on -4(R). The following convergence theorem of P. Levy is another significant consequence of the Plancherel theorem.

The Convergence Theorem. Suppose p, Al, A2,... are probability measures on (R,.Q(R)). If limn-0 µn = )1 pointwise, then An µ.

7. The Central Limit Theorem

100

Proof. In accord with Theorem 7.7 it suffices to prove that limn_."" f f dpn = f f dp for all f E CC(R). Thanks to the Fejer theorem, for all 6 > 0 we can choose e > 0 such that (7.43)

sup I (f * O) (x) - f(x)I < 6. xER

Apply the triangle inequality twice to see that for all 6 > 0,

if(7.44)

= 26 +

f

If

(t)e-`2t2

()_ /2p(t)l I

dtl .

The last line holds by the Plancherel theorem. Since f E Cc(R),

is

uniformly bounded by f f. I f (x) I dx < oo (Lemma 7.9). Therefore, by the dominated convergence theorem, (7.45)

lim upl

f fdpn- ffdI2 <25.

n-oo The theorem follows because 6 > 0 is arbitrary.

5. The 1-D Central Limit Theorem We are ready to state and prove the main result of this chapter: The onedimensional central limit theorem (CLT). The CLT is generally considered to be a cornerstone of classical probability theory.

The Central Limit Theorem. Suppose {X;}1 are i.i.d., real-valued, and have two finite moments. If Sn := X1 + then (7.46)

f

Sn - nEXI

+ Xn and VarX1 E (0, oc),

N(O, VarX1).

Because nEXI + f N(O , VarXi) and N(nEX I , nVarXi) have the same distribution, the central limit theorem states that the distribution of Sn is close to that of N(nEX 1 , nVarXi ).

Proof. By considering instead Xj* := (X3 - EXI)/SD(XI) and S,, :_ E" 1 X we can assume without loss of generality that the XD's have mean zero and variance one. We apply the Taylor expansion with remainder to deduce that for all

xER, (7.47)

e- = 1 + ix - X2 + R(x), 2

6. Complements to the CLT

101

where IR(x)l < lIX13 < Ix13. If lxj 4, then this is a good estimate, but when lxi > 4, we can use IR(x)l < leixi+1+lxi+2x2 < x2 instead. Combine terms to obtain the bound: (7.48)

IR(x)l S lxI3 A x2.

Because the Xj's are i.i.d., Lemma 6.12 on page 68 implies that EeitS.,' vrn- =

(7.49)

fi

j=1

This and (7.47) together imply that 1

1 + iE

Ee (7.50)

_

E+ E n

/ 2 /1-2n+E[R(v/n)J/

(tXl )] 1 \ n

[R

I

n

By (7.48) and the dominated convergence theorem, (7.51)

it IE [R (tXl//)] I < E

[tJS A (tXl)2J = 0(1)

(n - oo).

V/n

By the Taylor expansion ln(1 - z) = -z + o(lzl) as IzI ---* 0, where "In" denotes the principal branch of the logarithm. It follows that

(n1))n (7.52)

riiin

(1 - 2n +o

= e-t2/2

The CLT follows from the convergence theorem (p. 99) and Example 7.12 (p. 97).

6. Complements to the CLT 6.1. The Multidimensional CLT. Now we turn our attention to the study of random variables in Rd. Throughout, X, X t, X 2, ... are i.i.d. ran-

dom variables that take values in Rd, and S. := E 1 X'. Our discussion is a little sketchy. But this should not cause too much confusion, since we encountered most of the key ideas earlier on in this chapter. Throughout this section, jlxii denotes the usual Euclidean norm of a variable x E Rd. That is, (7.53)

114 :=

xi + ... + xd

dx E Rd.

Definition 7.16. The characteristic function of X is the function f (t) = Eeit'x where t x = Ed1 taxi for t E Rd. If ,z denotes the distribution of X, then f is also written as µ.

7. The Central Limit Theorem

102

The following is the simplest analogue of the uniqueness theorem; it is an immediate consequence of the convergence theorem (p. 99).

Theorem 7.17. If µ,µl,µ2, ... are probability measures on (Rd, .4(Rd) ) and µn - µ pointwise, then An = µ. This leads us to our next result.

Theorem 7.18. Suppose {X'}9=1 are i.i.d. random variables in Rd with EXi = µi, and Cov(X,', Xi) = Qij for an invertible (d x d) matrix Q :_ x (ad, bd], (Q,,). Then for all d-dimensional hypercubes G = (al, bl ] x (7.54)

lim p { Sn

µ E G} = fG

(2a )d/2 det (Q)

dy

That is, (Sn -nµ)/Vfn- converges weakly to a multidimensional Gaussian distribution with mean vector 0 and covariance matrix Q. The preceding theorems are the natural d-dimensional extensions of their 1-D counterparts. On the other hand, the following is inherently multidimensional.

The Cramer-Wold Device. Xn

X if and only if (t Xn) = (t X)

for all t E Rd. If we were to prove that Xn converges weakly, then the Cramer-Wold device boils our task down to proving the weak convergence of the onedimensional (t Xn). But this needs to be proved for all t E Rd.

Proof. Suppose Xn X, and choose and fix f E Cb(Rd). Because gt(x) _ t x is continuous, Ef(gt(Xn)) converges to Ef(gt(X)) as n -p oo. This is half of the theorem. Conversely, let µn and µ denote the distributions of Xn and X, respectively. Then (t Xn) (t X) for all t E Rd if µn(t) - µ(t). The converse now follows from Theorem 7.17. O

6.2. The Projective Central Limit Theorem. The projective CLT describes another natural way of arriving at the standard normal distribution. In kinetic theory this CLT implies that, for an ideal gas, all normalized Gibbs states follow the standard normal distribution. We are concerned only with the mathematical formulation of this CLT.

Definition 7.19. Define Sn-1 :_ {x E Rn : IIxii = 1} to be the unit sphere in Rn. This is topologized by the relative topology in Rn. That is, U C Sn-1 is open in Sn-1 if U is an open subset of R.

Recall that an (n x n) matrix M is a rotation if M'M is the identity.

6. Complements to the CLT

103

Definition 7.20. A measure µ on .V(S"-1) is called the uniform distribution on Si-1 if. (i) µ(Sn-1) = 1; and (ii) µ(A) = p(MA) for all A E _V(Sn-1)

and all (n x n) rotation matrices M. If X is a random variable whose distribution is µ, then we say that X is distributed uniformly on Sn-1. Item (ii) states that p is rotation invariant.

Theorem 7.21. If X (n) is distributed uniformly on Si-1, then (7.55)

nXi"1

N(0,1).

Remark 7.22. Without worrying too much about what this really means let X denote the first coordinate of a random variable that is distributed uniformly on the centered ball of radius in R°O. The projective CLT asserts that X is standard normal. Before we prove Theorem 7.21 we need to demonstrate that there are, in fact, rotation-invariant probability measures on S"-1. The following is a special case of a more general result in abstract harmonic analysis.

Theorem 7.23. For all n > 1 there exists a unique rotation-invariant probability measure on S"-1

Proof. Let {Zi}°O_1 denote a sequence of i.i.d. standard normal random variables, and define Z(n) = (Z1, (7.56)

X(n) :=

. , Z"). We normalize the latter as follows: Z(")

do > 1.

JJZ(n1JJ

By independence, the characteristic function of Z( n) is f (t) := exp(-IItfl2/2) Because f is rotation-invariant, Z(n) and MZ(n) have the same characteristic

function as long as M is an (n x n) rotation matrix. Consequently, Z(") and MZ(n) have the same distribution for all rotations M; confer with the uniqueness theorem on page 99. It follows that the distribution of X(n) is rotation invariant, and hence the existence of a uniform distribution on Sn-1 follows. Next we prove the more interesting uniqueness portion. For all e > 0 and all sets A C Sn-1 define KA(e) to be the largest number of disjoint open balls of radius a that can fit inside A. By compactness, if A is closed then KA(e) is finite. The function KA is known as Kolmogorov c-entropy, Kolmogorov complexity, as well as the packing number of A.

Let p and v be two uniform probability measures on ..(S11-1) By the maximality condition in the definition of KA, and by the rotational invariance of µ and v, for all closed sets A C (7.57)

Sn-1

KA(e)µ(B,) S µ(A) 5 (KA(E) + 1)µ(B,),

where B, := {x E Sn-1 : IIxli < e}. The preceding display remains valid if we replace µ by v everywhere. Therefore, for all closed sets A that have

7. The Central Limit Theorem

104

positive v-measure, (7.58)

KA(E) + 1) (_K,()

p(A) < < v(A) - v(BE)

K() + 1 KA(E)

(A) v(A)

Consequently, (7.59)

I p(A)

p(BE) I

v(A)

v(B,)

1

,u(A)

KA(E) v(A)

We apply this with A := Sn-1 to find that (7.60)

I1

A(BE)

1 < Ks.-.(E)

)

We plug this back in (7.59) to conclude that for all closed sets A with positive v-measure, 1 p(A) 1 dE>0. kA(E) v(A) + Ks,.-, (E) As f tends to zero, the right-hand side converges to zero. This implies that p(A) = v(A) for all closed sets A E .O(Si-1) that have positive v-measure.

(7.61)

p(A) v(A)

1

Next, we reverse the roles of p and v to find that u(A) = v(A) for all closed sets A E ..(Si-1). Because closed sets generate all of R(Sn-'), the monotone class theorem (p. 30) implies that p = v.

0

Proof of Theorem 7.21. We follow the proof of Theorem 7.23 closely, and 1 observe that by the strong law of large numbers (p. 73), IIZl">II/v a.s. Therefore, f X(") - Z1 a.s. The latter is standard normal. a.s.-convergence implies weak convergence, the theorem follows.

Since

0

6.3. The Replacement Method of Liapounov. There are other approaches to the CLT than the harmonic-analytic ones of the previous sections. In this section we present an alternative probabilistic method of Lindeberg (1922) who, in turn, used an ingenious "replacement method" of Liapounov (1900, pp. 362-364). This method makes clear the fact that the CLT is a local phenomenon. By this we mean that the structure of the CLT does not depend on the behavior of any fixed number of the increments. In words, the method proceeds as follows: We estimate the distribution of S" by replacing the increments, one at a time, by independent normal random variables. Then we use an idea of Lindeberg, and appeal to Taylor's theorem of calculus to keep track of the errors incurred by the replacement method. As a nice by-product we obtain quantitative bounds on the error-rate in the CLT without further effort. To be concrete, we derive the following using the Liapounov method; the heart of the matter lies in its derivation.

6. Complements to the CLT

105

Theorem 7.24. Fix an integer n > 1, and suppose {Xi}! 1 are independent Xi and s' mean-zero random variables in L3(P). Define Sn VarSn. Then for any three times continuously differentiable function f,

IEf(Sn) - Ef (N(O,s2))I <

(7.62)

11

2Mj

E IIXtII3, 3 /2 i=1

provided that Mf := sup.- I f'(z)I is finite.

Proof. Let o? denote the variance of Xi for all i = 1, ... , n, so that sn =

s=1 o . By Taylor expansion, 2

(7.63)

I

f(Sn) - f(Sn-1) - Xnf'(Sn-1) -

2nf"(Sn-1)I

< 6jIXn13.

Because EXn = 0 and E[Xn] = on, the independence of the X's implies that (7.64)

Ef(Sn) - Ef(Sn-1) - 2 E

6j IIXnII3

Next consider a normal random variable Zn that has the same mean and variance as Xn, and is independent of X1i ... , Xn. If we apply (7.64), but replace Xn by Zn, then we obtain (7.65)

Ef (Sn-1 + Zn) - Ef (Sn-1) -an

E

sj IIZnII3

This and (7.64) together yield (7.66)

IEf(Sn) - Ef(Sn-1 +Zn)I

6j (IIZnII3 +IIXnII3)

A routine computation reveals that IIZnII3 = Aan, where A := E{IN(0,1)I3} _

2/ /2 > 1. Since a3 < IIXnII3 (Proposition 4.16, p. 42), we find that (7.67)

IEf(Sn) - Ef(Sn-1 + Zn)I <

Now we iterate this procedure: Bring in an independent normal Zn_ 1 with the same mean and variance as Xn_1. Replace Xn_1 by Zn_1 in (7.67) to find that (7.68)

IEf(Ss) - Ef(Sn-2 + Zn_1 + Z j < 32M/2 (Ilxn-1113 + IIXnII3)

Next replace Xn_2 by another independent normal Zn_2, etc. After n steps, we arrive at 2Mj n (7.69) IIXiII3IEf (Sn) - Ef (E Zi)I < 3 7r/2 i-1 The theorem follows because E j= 1 Zi = N(0, s2n); see Problem 7.18. 0

[

7. The Central Limit Theorem

106

To understand how this can be used suppose {X;} 1 are i.i.d., with mean zero and variance a2. We can then apply Theorem 7.24 with f (x) g(xv/n) to deduce the following.

Corollary 7.25. If {X1} 1 are i.i.d. with mean zero, variance a2, and three bounded moments, then for all three times continuously differentiable functions g,

IEg(Sn/v) - Eg(N(O,a2))I

(7.70)

where A := 2supz Ig...(z)I . IIXIII /(3

A

-/2).

We let g(x) := e'X, and extend the preceding to complex-valued functions in the obvious way to obtain the central limit theorem (p. 100) under the extra condition that X1 E L3(P). Moreover, when X1 E L3(P) we find that the rate of convergence to the CLT is of the order n-I/2. Theorem 7.24 is not restricted to increments that are in L3(P). For the case where X1 E L2+p(P) for some p E (0, 1) see Problem 7.44. Even when X1 E L2(P) only, Theorem 7.24 can be used to prove the CLT, viz.,

Lindeberg's Proof of the CLT. Without loss of generality, we may assume that u = 0 and a = 1. Choose and fix f > 0, and define X= Xtl{IX.I<Ef}, Sn _ Ei I X;, ltn := ES,,, and sn VarSn R such that g and its first three Choose and fix a function g : R derivatives are bounded and continuous. According to Theorem 7.24, - An (7.71)

!5

32Mn/2E

Eg\Sn / -Eg(N(0, n)/I <

(IX'-EX' 13)

3 32M9/2 IIX1113

The last line follows from the inequality Ia+bl3 < 8(IaI3+Ib13) and the fact IIX i II3 (Proposition 4.16, p. 42). Because I X; I is bounded that IIX i II 1

above by f f, (7.72)

IIXI II3 < E/E (IXI I2)

E/E (Xi) =

Consequently, (7.73)

('V-/n 'In)

IEg

-

Eg

2 (N (0, n )1 < 3

_/2E:= AE. i-r

6. Complements to the CLT

'

107

A one-term Taylor expansion simplifies the first term as follows: Eg (7.74)

n

- Eg

EISn-Sn-An

supl9(z)I

V,`

Z

E, } is a sum of n i.i.d. random variables, (7.75)

Var(Sn - S,,) = nVar

nE [X?; IX1I >

Therefore, (7.76)

IEg

n \ v/n Sn / - Eg (N (0' sn

))I

Ae + JE [X?; IX1I > es/n].

Now, sn/n = Var(Sn)/n = Var(Xl1{IXII>E, }). By the dominated convergence theorem, this converges to VarX1 = 1 as n oo. Therefore by scaling (Problem 1.14, p. 14), (7.77)

Eg

(N (0,

s- I n) ) = Eg (N(0,1) V/n

Eg(N(O, 1)),

as n -i oo. This, the continuity of g, and (7.76), together yield (7.78)

lim sup IEg (Sn/Vn-) - Eg(N(0,1))I < AE.

Because the left-hand side is independent of e, it must therefore be equal to zero. It follows that Eg(Sn/y) - Eg(N(0,1)) if g and its first three derivatives are continuous and bounded. Now suppose rp E CS(R) is fixed. By Fejer's theorem (p. 98), for all 6 > 0 we can find g such that g and its first three derivatives are bounded and continuous, and sup,, Ig(z) - O(z)l < b. Because b is arbitrary, the triangle inequality and what we have proved so far together prove that EV,(Sn/ f) --* EV,(N(0,1)). This is the desired result.

6.4. Cramer's Theorem. In this section we use characteristic function methods to prove the following striking theorem of Cramer (1936). This section requires only a rudimentary knowledge of complex analysis.

Theorem 7.26. Suppose X1 and X2 are independent real-valued random variables such that X1+X2 is a possibly degenerate normal random variable. Then X1 and Xz are possibly degenerate normal random variables too.

Remark 7.27. Cramer's theorem states that if µl and µz are probability e'µt_°2t2 measures such that jl(t)µz(t) = (µ E R,o > 0), then z1 and 92 are Gaussian probability measures.

7. The Central Limit Theorem

108

Remark 7.28. Cramer's theorem does not rule out the possibility that one or both of the Xi's are constants. It might help to recall our convention that N(µ , 0) = p. We prove Cramer's theorem by first deriving three elementary lemmas from complex analysis, and one from probability. Recall that a function f : C -* C is entire function if is is analytic on C.

Lemma 7.29 (The Liouville Theorem). Suppose f : C function, and there exists an integer n > 0 such that

C is an entire

(7.79)

as IzI 00. If (z)I = 0 (IzIn) Then there exist ao, . . . , an E C such that f (z) = Eg=o aj z2 on C.

Remark 7.30. When n = 0, Lemma 7.29 asserts that bounded entire functions are constants. This is the more usual form of the Liouville theorem. Proof. For any zo E C and p > 0, define y(6) := zo + pe'O for all 0 E (0, 21rl. By the Cauchy integral formula on circles, for any n > 0, the nth derivative f (n) is analytic and satisfies f(n+1)(zo) (7.80)

_

(n + 1)! f

f (z)

dz

2iri

J (z - ZO)n+2

(n + 1)! 2iripn+l

Jei(n+2)0

2n f (zo + pete) dB.

Since f is continuous, (7.79) tells us that there exists a constant A > 0 such that I f (zo + pe'B) 15 Apn for all p > 0 sufficiently large and all 0 E [0, 2a). In particular, I f(n+1)(zo)I 5 (n + 1)!Ap 1. Because this holds for all large p > 0, f(n+l)(zo) = 0 for all zo E C, whence follows the result.

Lemma 7.31 (Schwarz). Choose and fix A, p > 0. Suppose f is analytic on B. := {w E C : IwI < p}, f (0) = 0, and supEB, If (z)I < A. Then, (7.81)

If(z)I <

AizI

on B,,.

Proof. Define (7.82)

F(z) :=

if z # 0, rf(z)/z S

if z = 0. l f'(0) Evidently, F is analytic on Bp. According to the maximum principle, an

analytic function in a given domain attains its maximum on the boundary of the domain. Therefore, whenever r E (0, p), it follows that (7.83)

IF(z)I <- sup IF(w)I <- A r IwI=r

Let r converge upward to p to finish.

"IzI < r.

6. Complements to the CLT

109

The following is our final requirement from complex analysis.

Lemma 7.32 (Borel and Caratheodory). If f : C -- C is entire, then (7.84)

sup If(z)I<4suplRef(z)I+5If(0)1 IzI
IzI
dr>0.

Proof. Let g(z) := f (z) - f (0), so that g is entire and g(0) = 0. Define R(r) := sup,=, 0, and consider the function (7.85)

T(w) :=

2R(

)-

dlwl < R(r).

Evidently, T(g(z))

g(z) = 2R(r)

(7.86)

1 + T(g(z))'

One can check directly that IT(f (z))I < 1 for all z E Br, and hence Tog is analytic on Br. Because T(g(0)) = 0, Lemma 7.31 implies that IT(g(z))l IzI/r for all z E Br. It follows that for all z E Br, Ig(z)I < 2R(r)

(7.87)

IzI/r

1- (IzI/r)

This proves that, Ig(z)I < 4R(r), uniformly for IzI < r/2, and hence, (7.88)

sup If(z) - f(0)I <- 4 sup IRef(z) - Ref(0)I

lzl
Izl
The lemma follows from this and the triangle inequality. Finally, we need a preparatory lemma from probability.

Lemma 7.33. If V > 0 a.s., then for any a > 0, (7.89)

Ee°v = 1+ a

J0

r e°'P{ V> :r} dx.

In particular, suppose U is non-negative, and there exists r > 1 such that. (7.90)

P{V>x}:r}

vx>0.

Then, Ee°v < rEeau for all a > 0. Proof. Because a°v(.) = 1 +a fo 1 {V (w)>r}eax dx and the integrand is nonnegative, we can take expectations and use Fubini-Tonelli to deduce (7.89). Because r > 1, the second assertion is a ready corollary of the first

Proof of Theorem 7.26. Throughout, let Z := X1 + X2; Z is normally distributed. We can assume without loss of generality that EZ = 0; else we consider Z - EZ in place of Z. The proof is now carried out in two natural steps.

7. The Central Limit Theorem

110

Step 1. Identifying the Modulus. We begin by finding the form of IEeitXk

fork=1,2. Because EZ = 0, there exists or > 0 such that Eexp(zZ) = exp(z2a2) for all z E C. Since IZI > IXII - 1X21, if IXII > A and IX21 < m then IZl > A - m. Therefore, by independence,

P{IZI > A-m} > P{IXII > A}P{IX21 <m} (7.91)

> 4P {IX1I > A},

provided that we choose a sufficiently large m. Choose and fix such an m. In accord with Lemma 7.33, EectXiI < 4ec"tEe0Z1 for all c > 0. But Ee`IZI < EecZ +

(7.92)

Ee-cZ <

2ec2a2

ec > 0.

Consequently, (7.93)

I EezX' I < EeIzI'IXtI < 8exp (lzlm + a2lzl2)

Vz E C.

Because IZI > IX21 - IXII, the same bound holds if we replace Xl by X2 everywhere. This proves that fk(z) := Eexp(zXk) exists for all z E C, and defines an entire function (why?). To summarize, R a t '-- fk(it) is the characteristic function of Xk, and (7.94)

Ifk(z)I <- 8exp

(Izlm+0,21x12)

dz E C,k = 1,2.

Because fi(z)f2(z) = Eexp(zZ) = exp(z2a2), (7.94) implies that for all

zECandk=1,2, (7.95)

8exp (Izlm + a2lxl2) Ifk(z)l > I exp(z2o2)I >- eXp (-Izl2o2) . It follows from this and (7.94) that for all z E C and k = 1, 2, (7.96)

8 exp (-Izlm - 2a21z12) 5 Ifk(z)I < 8exp (Izlm +a21Z12) .

Consequently, in l fkI is an entire function that satisfies the growth condition (7.79) of Lemma 7.29 with n = 2, and hence, (7.97)

lf1(z)l = exp (a0 + a1z + a2z2)

ez E C.

A similar expression holds for 1f2(z)j. Step 2. Estimating the Imaginary Part. Because fk is non-vanishing and entire, we can write (7.98)

fk(z) = exp(gk(z)),

where gk is entire for k = 1, 2. To prove this we first note that fL/ fk is entire, and therefore so is (7.99)

gk(Z) := fz

fk(w) dw.

Problems

111

Next we compute directly to find that (e-9k fk)'(z) = 0 for all z E C. Because fk(0) = 1 and gk(O) = 0, it follows that fk(z) = exp(gk(z)), as asserted.

It follows then that Ifk(z)I = exp(Regk(z)), and Step 1 implies that Re 9k is a complex quadratic polynomial for k = 1, 2. Thanks to this and Lemma 7.32, we can deduce that the entire function gk satisfies (7.79) with n = 2. Therefore, by Liouville's theorem, gk(z) = ak + AZ + ykz2 where al, a2, /31, 02, ^11,1'2 are complex numbers. Consequently, (7.100)

Ee`tXk = fk(it) = exp (ak + it,Qk - t2yk)

dt E R, k = 1, 2.

Plug in t = 0 to find that ak = 0. Also part (1) of Lemma 7.9 implies that fk(-it) is the complex conjugate of fk(it). We can write this out to find that (7.101)

exp(-it,3k - t2-yk) = exp(-it,Ok - t2^1k)

et E R.

This proves that (7.102)

it(3k - t2^1k = it[3k

- t2^1k + 27riN(t),

where N(t) is integer-valued for every t E R. All else being continuous, this proves that N is a continuous integer-valued function. Therefore, N(t) = N(0) = 0, and so it follows from the preceding display that Qk and yk are

real-valued. Because Ifk(it)l < 1, we have also that ^1k > 0. The result

0

follows from these calculations.

Problems 7.1. Define C'°(Rk) to be the collection of all infinitely differentiable functions f : Rk -+ R that have compact support. If µ,µt,p2,... are probability measures on (Rk,R(Rk)), then prove that

µiff f

f fdpfor all f

7.2. If µ,µt , µ2, ... , µ is a sequence of probability measures on (Rd, R(Rd)), then show that the following are characteristic functions of probability measures:

(1) µ; (2) Re µ, (3) lµ12;

(4) n;=t Ih; and (5) E'=lpiµ, where pi__ p. > 0 and

pi = 1.

Also prove that µ(T) Consequently, if p is a symmetric measure (i.e., µ(-A) = p(A) for all A E .A(Rd)) then µ is a real-valued function.

7.3. Use characteristic functions to derive Problem 1.17 on page 14. Apply this to prove that if X = Unif(-1, 1), then we can write it as (7.103)

X:= _t X p

where the X,'s are i.i.d., taking the values ±1 with probability

each.

7. The Central Limit Theorem

112

7.4 (Problem 7.3, continued). Prove that sinx

(7.104)

= VT cos (

I

vx E R \ {0).

\ 2k x /

k=1

X

By continuity, this is true also for x = 0.

7.5. Let X and Y denote two random variables on the same probability space. Suppose that X + Y and X - Y are independent standard-normal random variables. Then prove that X and Y are independent normal random variables. You may not use Theorem 7.26 or its proof. 7.6. Suppose X1 and X2 are independent random variables. Use characteristic functions to prove that:

(1) IfX,=Bin(n,,p)forthesame pE(0,11,then X1+X2=Bin(n1+n2,p). (2) If X, = Poiss(A1), then X1 + X2 = Poiss(Ai +a2) (3) If X. = N(µ, , a'), then X1 + X2 = N(µ1 + µ2 , of + o2). 7.7. Let X have the gamma distribution with parameters (a, A). Compute, carefully, the characteristic function of X. Use it to prove that if X1, X2, ... are i.i.d. exponential random variables with parameter A each, then S., := X1 + + X has a gamma distribution. Identify the latter distribution's parameters. 7.8. Let f be a symmetric and bounded probability density function on R. Suppose there exists C > 0 and a E (0, 11 such that

f(x) - Clxl-(1+0)

(7.105)

as lx( -. oo.

Prove that

f(t) = I - Dltle +o(Itl°) as Itl -. 0, and compute D. Check also that D < oo. What happens if a > 1? (7.106)

7.9 (Levy's Concentration Inequality)).. Prove that if µ is a probability measure on the line, then

µ({x:Ixi>

(7.107)

1

vc>0.

(HINT: Start with the right-hand side.)

7.10 (Fourier Series). Suppose X is a random variable that takes values in Zd and has mass function p(x) = P{X = x}. Define pit) = Eet1'X, and derive the following inversion formula:

p(x) = (2n)d -1

(7.108)

,.p(- it x) p(t) dt

i- ..]d

vx E Zd.

Is the latter identity valid for all x E Rd? 7.11. Derive the following variant of Plancherel's theorem (p. 99): For any a < b and all probability measures µ on (//R, 53(R)), .In a-ieb\ µ({b}) i1 (7.109)

lio

-

J

,212,2 re

00 e

`

/I

µ(t)dt = µ((a,8))+ l+({a}) z

7.12 (Inversion Theorem). Derive the inversion theorem: If It is a probability measure on 99(Rk such that µ is integrable (dxl, then µ is absolutely continuous with respect to the Lebesgue measure on Rk. Moreover, then µ has a uniformly continuous density function f, and (7.110)

P X) = (21I )k

f

a-u : f(t) dt k

vx E Rk.

7.13 (The Triangular Distribution). Consider the density function f(x) := (1 - lxl)+ for x E R. If the density function of X is f, then compute the characteristic function of X. Prove that f itself is the characteristic function of a probability measure. (HINT: Problem 7.12.)

7.14. Suppose f is a probability density function on R; i.e., f > 0 a.e. and f . f (x) dx = 1.

Problems

113

(1) We say that f is of positive type if f is non-negative and integrable. Prove that if f is of positive type, then f (x) < f (O) for all x E R. (2) Prove that if f is of positive type, then g(x) f (x)/(2a f (0)) is a density function, and g(t) = f(t)/f(0). (HINT: Problem 7.12.) (3) Compute the characteristic function of g(x) = z exp(-Ixl). Use this to conclude that f(x) := ar-I(1 +x2)-' is a probability density function whose characteristic function is f(t) = exp(-Itl). The function f defines the so-called Cauchy density function. [Alternatively, you may use contour integration to arrive at the end result.[

7.15 (Riemann-Lebesgue lemma). Prove that Ee't'X = 0 for all k-dimensional absolutely continuous random variables X. Can the absolute-continuity condition be removed altogether? (HINT: Consider first a nice X.) 7.16. Suppose X and Y are two independent random variables; X is absolutely continuous with density function f, and the distribution of Y is µ. Prove that X + Y is absolutely continuous with density function

f f(x-y)Ft(dy)

(f'tp)(x)

(7.111)

Prove also that if Y is absolutely continuous with density function g, then the density function of

X+Y isfsg.

7.17. Prove that the CLT (p. 100) continues to hold when o = 0.

7.18. A probability measure it on (R,R(R)) is said to be infinitely divisible if for any n > 1 there exists a probability measure v such that µ = (E)n. Prove that the normal and the Poisson distributions are infinitely divisible. So is the probability density

f(x) := x(1 +I x2)

(7.112)

vx E R.

This is called the Cauchy distribution. (HINT: Problem 7.14.)

7.19. Prove that if (X,)l I are i.i.d. uniform-[0,11 random variables, then (7.113)

4 i-3

2-

rt 2

converges weakly.

Identify the limiting distribution.

7.20 (Extreme Values). If {X,}°_, are i.i.d. standard normal random variables, then find nonrandom sequences an,bn -. oo such that an maxl<, 0 is a fixed number, and repeat the exercise.

7.21. Let {X,}i t denote independent random variables such that (7.114)

XJ = 5±j ±1

each with probability (4j2)-', with probability 2 - (4j2)-'

Prove that (7.115)

SD(Sn)

= N(O,u2),

and compute a.

7.22 (An abelian CLT). Suppose that {X,}- 1 are i.i.d. with EX1 = 0 and E[X? J = a2 < oo. First establish that r'Xi converges almost surely for all r E (0, 1). Then, prove that (7.116)

r Er4X,

N(O,ry2)

asn -. oo,

.=o

and compute ry (Bovier and Picco, 1996).

7.23. State and prove a variant of Theorem 7.18 that does not assume Q to be non-singular.

7. The Central Limit Theorem

114

7.24 (Liapounov Condition). In the notation of Problem 7.38 below assume there exists ii > 0 such that n

(7.117)

^limo

s+6 n

E [jXj - µj 12+d1 = 0. j=1

J

Prove the theorem of Liapounov (1900, 1922):

Sn -Ej=tµj .N(0,1).

(7.118)

an

Check that the variables of Problem 7.21 do not satisfy (7.129).

7.25. Compute n2 n3 tim (1+n+-+ - + noe-^

(7.119)

2

3.

n4

+

4

+

n'

1.

7.26 (The Simple Walk). Let et, ... , ed denote the usual basis vectors of Rd; i.e., et = (1, 0, ... , 0),

e2 = (0, 1,0,...,0), etc. Consider i.i.d. random variables {x,}1 such that

P{Xt =±ej} _

(7.120)

id.

Then the random process Sn = Xt + + Xn with So = 0 is the simple walk on Zd. It starts at zero and moves to each of the neighboring sites in Zd with equal probability, and the process continues in this way ad infinitum. Find vectors an and constants bn such that (Sn - an)/bn converges weakly to a nontrivial limit distribution. Compute the latter distribution. 7.27 (Problem 7.26, continued). Consider a collection of points 11 = {x,}' =o in V. We say that II is a lattice path of length n if ,ro = 0, and for all i = 2, ... , n - 1 the distance between a, and 7r,+t is one. Prove that all lattice paths II of length n are equally likely for the first n steps in a simple walk.

7.28 (Problem 7.27, continued). Let Nn(d) denote the number of length-n lattice-paths {!r,},=o such that an = 0. Then prove that d

(7.121)

Nn(d) =(2a) 1 dJ

[-*,w

dt

j_I

if n > 2 is even; else N. (d) = 0. Conclude the 1655 Wallis formula: (7.122)

J

(cos t)^ dt = (n)

n/2 2^-t

n

,

valid for all even n > 2. (HINT: Problem 7.10.)

7.29. Suppose {X,}.°_, are i.i.d., mean-zero, and in L2(P). Prove that there exists a positive constant c such that (7.123)

E

(

max ISji I > c SD(Xi)f <j
do > 1.

Compare to Problem 6.27 on page 87.

7.30. Suppose that Xn = X and Yn = Y as n -- oo, where Xn, Yn, X, and Y are real-valued. (1) Prove that if Y is non-random, then Yn -. Y in probability. Conclude from this that (Xn,Yn) . (X,Y) (2) Prove that if {Xn}n__t and {YnI t are independent from one another, then (Xn, YY) converges weakly to (X, Y).

(3) Find an example where X.

X, Y

Y, and (Xn,Y) 96 (X, Y).

Problems

115

7.31 (Variance-Stabilizing Transformations). Suppose g : R -. R has at least three bounded continuous derivatives, and let X1,X2,... be i.i.d. and in L2(P). Prove that

/[g(Xn) -9(l+)] N(O,o2), X := n-1 Fn X u:= EX1, and o:= SD(XI)g'(µ). Also prove that

(7.124)

where

Eg(Xn) - 9(µ) _

(7.125)

2n 2n

+ o

(n

as n - oo.

7.32 (Microcanonical Distributions). Prove that if X(") is distributed uniformly on Sn-1, then (Xi"1,.. , X(n)) . Z for any fixed k > 1, where Z = (Z1, ... , Zk) and the Z,'s are i.i.d. standard normals.

7.33. Choose and fix an integer n > 1 and let X1,X2,... be i.i.d. with common distribution given by P{X1 = k} = 1/n for k = 1,...,n. Let Tn denote the smallest integer I > 1 such that P{Tn = k} for all k. Xt + + XI > n, and compute 7.34 (Uniform Integrability). Suppose X, XI, X2,... are real-valued random variables such that: (i) Xn X; and (ii) sups IIXnIIp < oo for some p > 1. Then prove that limn, EXn = EX. (HINT: See Problem 4.28 on page 51.) Use this to prove the following: Fix some po E (0, 1), and

define f(t) = It - pol (t E [0, 1]). Then prove that there exists a constant c > 0 such that the Bernstein polynomial Ian f satisfies (7.126)

do > 1.

I(13nf)(PO) - f(PO)I ?

Thus, (6.50) on page 78 is sharp (Kac, 1937).

7.35 (Hard). Define the Fourier map -Ff = j for f E Ll(Rk). Prove that (7.127)

IIfIIL2(Rk) =

"f E Lt(Rk)nL2(Rk).

1k/2II'FfIIL2(Rk) (2n)

This is sometimes known as the Plancherel theorem. Use it to extend F to a homeomorphism from L2(Rk) onto itself. Conclude from this that if it is a finite measure on .S(Rk) such that fRk Iµ(t)I2 dt < oo, then p is absolutely continuous with respect to the Lebesgue measure on Rk. WARNING: The formula (.Ff)(t) = fak f(x)ei1'xdx is valid only when f E L1(Rk). 7.36 (An Uncertainty Principle; Hard). Prove that if f : R -. R is a probability density function

that is zero outside [-7r, w), then there exists t 0 [-1/2,1/2] such that f(t) # 0 (Donoho and Stark, 1989). (Hint: View f as a function on [-rr,rr], and develop it as a Fourier series. Then study the Fourier coefficients.)

7.37 (Hard). Choose and fix \I ...... \- > 0 and al,...,a,, E R. Then prove that if m < oo, then fn, defines the characteristic function of a probability measure, where (7.128)

aj (1 - cos(ajt)))

f,.(t) := exp

"t E R, I < m < oo.

Prove that fo is a characteristic function provided that F,, (a2 A Iajl)aj < oo. (HINT: Consult Example 7.14 on page 97.)

7.38 (Lindeberg CLT; Hard). Let {X,},° 1 be independent L2(P)-random variables in R, and for

all n define sn = Ej=1 VarXj and p" = EXn. In addition, suppose that s" -. co, and (7.129)

lim

1

. 8n j=1

E [(Xj - µj)2; IX, - µj] > tan] = 0

Prove the Lindeberg CLT (1922):

S. (7.130)

- E,=1 Ikj 3n

. N(0,1).

Check that the variables of Problem 7.21 do not satisfy (7.129).

'e > 0.

7. The Central Limit Theorem

116

7.39 (Hard). Let (X, Y) be a random vector in R2 and for all 9 E (0,2,r) define Xe := cos(0)X +sin(0)Y

(7.131)

and

Y9 := sin(0)X - cos(0)Y.

Prove that if Xs and Ye are independent for all 0 E (0, 2a], then X and Y are independent normal variables. (HINT: Use Cramer's theorem to reduce the problem to the case that X and Y are symmetric; or you can consult the original paper of Kac (1939).) 7.40 (Skorohod's Theorem; Hard). Weak convergence does not imply a.s. convergence. To wit, X => X does not even imply that any of the random variables {Xn},°.t and/or X live on the X whenever same probability space. The converse, however, is always true; check that X. X almost surely. On the other hand, if you are willing to work on some probability space, Xn then weak convergence is equivalent to as. convergence as we now work to prove. (1) If F is a distribution function on R that has a continuous inverse, and if U is uniformly

distributed on (0,1), then find the distribution function of F-' (U).

(2) Suppose F F: All are distribution functions; each has a continuous inverse. Then prove that Bin- FT' (U) = F-' (U) a.s. X0, we can find, on a suitable probability (3) Use this to prove that whenever X. space, random variables X;, and X' such that: (i) For every 1 < n < oo, X;, has the same distribution as X,,; and (ii) lim X;, = X' almost surely Skorohod (1961, 1965). (HINT: Problem 6.9.)

7.41 (Ville's CLT; Hard). Let fl denote the collection of all permutations of 1, ... , n, and let P be the probability measure that puts mass (n!)-I on each of the n! elements of fl. For each w E fl define XI(w) = 0, and for all k = 2,....n let Xk(w) denote the number of inversions of k in the permutation w; i.e., the number of times 1, ... , k - 1 precede k in the permutation W. [For instance, suppose n = 4. If w = (3,1,4,2), then X2(W) = 1, X3(W) = 0, and X4(W) = 2.] Prove that {X,}° , are independent. Compute their distribution, and prove that the total X. in a random permutation satisfies number of inversions S.

S

(7.132)

n3;2/4)

N(0, 1/36).

(HINT: Problem 7.38.)

7.42 (A Poincar6 Inequality; Hard). Suppose X and Y are independent standard normal random variables. (1)

Prove that for all twice continuously differentiable functions f, g : R -» R that have bounded derivatives,

f

Cov(f(X),g(X)) = f E [f'(x)g' (ax +

1- s2 Y)] de.

(HINT: Check it first for f (x) := exp(itx) and g(x) := exp(irx).) (2) Conclude the "Poincar6 inequality" of Nash (1958):

Vaf(X) <- IIf'(X)llz T.43 (Problem 7.18, continued; Harder). Prove that the uniform distribution on (0, 1) is not infinitely divisible. (HINT: µ 96 (p)3. Simpler derivations exist, but depend on more advanced Fourier-analytic methods.)

7.44 (Harder). Suppose {X,}" , are i.i.d. mean-zero variance-o2 random variables such that E{)XI I2+p) < oo for some p E (0, 1). Then prove that there exists a constant A, independent of n, such that (7.133)

A IEg(S, /v1_n) - Eg(N(0,o2))I !S ryp/2

provided that g has three bounded and continuous derivatives.

Notes

117

Notes (1) The term "central limit theorem" seems to be due to P61ya (1920). Our treatment covers only the beginning of a rich and well-developed theory (Levy, 1937; Feller, 1966; Gnedenko and Kolmogorov, 1968).

(2) The present form of the CLT is due to Lindeberg (1922). See also Problem 7.38 on page 115. Zabell (1995) discusses the independent discovery of the Lindeberg CLT (1922) by the nineteen-year-old 'hiring (1934). See also Note (8) below. (3) Fejer's Theorem (p. 98) appeared in 1900. Tandori (1983) discusses the fascinating history of the problem, as well as the life of Fejer. (4) Equation (7.40) is sometimes referred to as the Parseval identity, named after M.-A. Parseval des Chenes for his 1801 discovery of a discrete version of (7.40) in the context of Fourier series.

(5) For an amusing consequence of Problem 7.4 plug in x = n/2 and solve to obtain the 1593 Viete formula for computing x: 7r =2

V2_ 2+f 2+ 2+f 2+ 2 2

2

2

2+f ...J

.

2

(6) Levy (1925, p. 195) has found the following stronger version of the convergence theo-

rem: "If L(t) = limp µp(t) exists and is continuous in a neighborhood oft = 0, then there exists a probability measure is such that L = µ and µn µ." Levy's argument was simplified by Glivenko (1936).

(7) The term "projective CLT" is non-standard. Kac (1956, p. 182, In. 7) states that this result "is due to Maxwell but is often ascribed to Borel." See also Kac (1939, p. 728), as well as Problem 7.39 above. The mentioned attribution of Kac seems to agree with that of Borel (1925, p. 92). For a historical survey see the final section of Diaconis and Freedman (1987), as well as Stroock and Zeitouni (1991, Introduction). (8) The term "Liapounov replacement method" is non-standard. Many authors ascribe this method incorrectly to Lindeberg (1922). Lindeberg used the replacement method in order to deduce the modern-day statement of the CLT. Trotter (1959) devised a fixed-point proof of the Lindeberg CLT. His proof can be viewed as a translation-into the langauge of analysis-of the replacement method of Liapounov. In this regard see also Hamedani and Walter (1984). (9) Cramer's theorem (p. 107) is intimately connected to general central limit theory (Gnedenko and Kolmogorov, 1968; Levy, 1937). The original proof of Cramer's theorem uses hard analytic-function theory. The ascription in Lemma 7.32 comes from Veech (1967, Lemma 7.1, p. 183). (10) Problem 7.5 goes at least as far back as 1941; see the collected works of Berniitefn (1964, pp. 314-315). (11) Problem 7.41 is borrowed from Ville (1943).

(12) Problem 7.42 is due to Nash (1958), and plays a key role in his estimate for the solution to the Dirichlet problem. The elegant method outlined here is due to Houdre, Perez-Abreu, and Surgailis (1998).

Chapter 8

Martingales

One can measure the importance of a scientific work by the number of earlier publications rendered superfluous by it. -David Hilbert

If X has two finite moments and p = EX, then we can check easily that E[(X - x)2] = VarX + (p - x)2. Consequently, infXER E[(X - x)2] = VarX, and the infimum is achieved uniquely at x = p. In other words, µ is the best predictor of the value of X, where "best" is meant in the sense of L2(P). For example, imagine a two-person game wherein Player 1 tosses a fair coin 100 times independently. Player 2, who is unaware of the outcome of the coin tosses, is supposed to guess the total number X of heads. Because

X = Bin(100, 1/2), µ = EX = 50. The preceding paragraph then asserts that Player 2 should predict X to be its expectation; i.e., "A = 50 heads." Now suppose that Player 2 is aware of the outcome of the first 10 tosses (say). Let Y denote the total number of heads in the first 10 tosses. Then clearly Player 2's best possible prediction of X is "Y + 45 heads." This is the conditional expectation of X given the value of Y, and "conditional expectations" is the topic with which we begin the chapter.

1. Conditional Expectations 1.1. Abstract Conditioning. The entire subject of conditioning emanates from the following abstract result. Throughout, let (Sl , P) be a probability space, and l a sub-a-algebra of . (i.e., W C 9). 119

8. Martingales

120

Proposition 8.1. If X E L'(P), then there exists an a.s.-unique random variable IIyX that: (i) is integrable; (ii) is 9-measurable; and (iii) satisfies E[l;X] = E[I;HiX] for all bounded, 9-measurable random variables We should pay close attention to the seemingly innocuous property (ii). For instance, note that if X is 9-measurable, then (ii) implies that X = IIyX a.s. (why?). In particular, X is always equal to H,,X a.s.

Definition 8.2. Hereafter, the random variable IIyX of Proposition 8.1 is written as E[X I 91, and called the conditional expectation of X given 9. For all random variables Yl,... , Y,n define a({Y };=1) = o (Yi, ... , Y,,,) to be the

a-algebra generated by Yl,... , Y. Frequently, we write E[X I Yl,... , Yn] in place of E[X I a(Yj,... , Y,n)].

Proof. Let v(A) := E[X; A] = fA X dP for all A E 9. This defines a finite measure on 9 that is absolutely continuous with respect to P. We claim that for all bounded and 9-measurable random variables Z, (8.1)

It suffices to prove this for Z > 0; else, consider Z+ and Z- separately. Now (8.1) holds tautologically if Z = 1A for some A E 9. So it holds also for a simple 9-measurable function Z. The rest of the claim follows from the monotone convergence theorem.

Since 9 C 9, both v and P are finite measures on 9. Because v << P, the Radon-Nikodym theorem (p. 47) assures us that there exists a IIyyX E L' (Q, 9, P) such that f l; dv = f 1111X dP for all bounded, 9-measurable

By the very definition of L'-spaces, IIyX is 9-measurable; it is also unique a.s. [P]. Thanks to (8.1), f lX dP = f IIIyX dP for all bounded

C.

9-measurable random variables 1;, and Proposition 8.1 follows.

Abstract conditioning will become clearer the more we use conditional expectations. The following theorem and its proof together form a first step in this direction.

Theorem 8.3. Conditional expectations have the following properties:

(1) If X > 0 a.s. then E[X I91 > 0 a.s. Also, E[E(X I9)] = EX, E[X I .] = X, and E(X [ (0, 12}) = EX a.s. (2) If Xl, X2, ... , Xn E L' (P) and al, a2, ..., an E R, then a.s., n

E Ea.X. j=1

q

n

=Ea,E[XiI `.9]. j=1

1. Conditional Expectations

121

(3) If Z is 9-measurable and ZX E L1(P), then E[ZX 11] a.s. exists and is equal to ZE[X I W]. (4) (Conditional Jensen) If VY : R then with probability one,

R is convex and Ili(X) E L'(P),

E[''(X) IIf] > V,(E[X I `.Q]).

(5) (Conditional Fatou) If {X;}°°1 are integrable and non-negative, then with probability one,

E [lim inf Xn 14] < lim inf E[Xn I i]. n-oo

n-oo

(6) (Conditional Bounded Convergence) If {X;}°_1 are bounded and a.s. -convergent, then with probability one,

E [ lim Xn W] = lim E[X,, I l]. n-oo

n-oo (7) (Conditional Monotone Convergence) If X1 < X2 < X3 < all in L'(P), then with probability one,

are

E[Xn 159] / E n-oo [ lim Xn I ] as n -+ oo. (8) (Conditional Dominated Convergence) If E{sup IXnI} < oo and limn.. Xn exists a.s., then with probability one, E [ lim Xn ( cf] = lim E[Xn I q].

noo

n-oo

(9) (Conditional Holder) Suppose X E L"(P) for some p > 1 and Y E L9(P) where p-1 + q-1 = 1. Then with probability one, IE [XY 19] I < (E{IXIP I ¶})l/p(E{IYIq I W})114.

(10) (Conditional Minkowski) If X, Y E LP(P) for some p > 1, then with probability one,

(E{IX + YIP I ¶})'"p <

(E{IXIP

I ¶})'1p + (E{IYIp I W})11P.

Proof. In order to save space we write IIjrX in place of E[X I.*°] for any

o-algebra J. Step 1. Verification of Part (1). First suppose X > 0, and for all n > 0 define An := {IIyX < -1/n}. Because An is 9-measurable, (8.2)

E[X; An] = E [IIyX; A,,].

The left-hand side is non-negative whereas the right-hand side is < -P(A,,)/n.

It follows that P(An) = 0 for all n > 1. By monotonicity, P(Un>,An) = limn-,%,P(An) = 0. Because Un>1An = {IIyX < 0}, this proves that if

P{X>0}=1 then IIgX>0a.s.

8. Martingales

122

Note that EX = E[X; f ] = E[IIqX; S2] = E[Il X], and E[Z;X] = E[tyH,rX] for all bounded random variables l;. Because II,,X and X are both s-measurable, the uniqueness portion of Proposition 8.1 implies that X = 11,.,X a.s. [This fact came up earlier, just after Definition 8.2.1 Finally, II{e.n}X is a non-random constant a.s. because it is measurable with respect to the trivial o-algebra {0,12}. It follows that II{m,st}X = E(n{a,sj}X) = EX a.s. This proves part (1). Step 2. Verification of Part (2). By induction it suffices to assume that n = 2. For all bounded 4-measurable random variables E [.II;g (a1 Xi + a2X2)] = E [e (a, Xi + a2X2)]

= aiE[,Xj] + bjE[1;X2] (8.3)

= aiE = E [l;

a2E [MIX2] a2rI X2)] .

Since II.q(a1Xi+a2X2) and ajrIgX1+a2IIyX2 are both 9-measurable, they must be equal a.s. Step 3. Verification of Part (3). Thanks to part (2) we can assume, without loss of generality, that Z, X > 0. Else, we can consider Xf and Z±. for all bounded 59Our goal is to prove that measurable l;. If Z is bounded then Zt; is a bounded 9-measurable random variable and we are done. For the general case let Z := min(Z, n). Then E[Z7 H X]. each Z is bounded and Si-measurable, and hence Because X and MqX are both non-negative (part (1)) we can write take n - oo, and appeal to the monotone convergence theorem to ty+ + finish.

Step 4. Conclusion. Parts (1) through (3) show that conditional expectations have all the salient features of (ordinary, , unconditional) expectations. The rest of the proof follows by mimicking integration theory arguments.

1.2. Conditioning and Prediction. Let X denote the outcome of an experiment which has not yet been performed. Intuitively speaking, E[X I s9]

denotes our "best prediction" of the as-yet-unseen value of X, given that we know the values of all S9-measurable random variables. By using this "prediction interpretation" we can give heuristic justifications for the various properties of conditional expectations. For example:

E[X I Y] is our best prediction of X if we know the value of Y. E[X I.9'1 = X a.s. because if we know all random variables, then we certainly know the value of X. E[X I {0 ,12}] is our best prediction of X given that we know nothing. This is because there are no truly random (i.e., non-constant)

1. Conditional Expectations

123

random variables that are measurable with respect to {0, S2}. In

the first paragraph of this chapter we saw already that EX is our best prediction of X given no knowledge. Therefore, EX = E[X I {fd, l}] a.s. If l; is 9-measurable, then the knowledge of 9 renders 1; a constant. Therefore, E[t;'X I9] = l;E[X 19] a.s.

And so on. Here is why the prediction interpretation works.

Proposition 8.4. If E[X2] < oc, then for all 9-measurable 1; E L2(P), (8.4)

E [(x - E[X I9])2] < E [(X - l;)2] .

The inequality is an equality if and only if l; = E[X 19] a.s.

Proof. Thanks to conditional Jensen's inequality (part (4) of Theorem 8.3) E[X 191 has two finite moments. In particular, the expectations here are all finite.

Because E[X 19] - is 9-measurable, we can "pull it out" of the conditional expectation to find that

E [(X - E[X 19]) (E[X 19] -

I9]

(8.5)

= (E[X 191 - e) E [(X - E[X 19]) 9]

a.s.,

and this is zero. Consequently, E[(X - l)Z 19] is a.s. equal to E[(X E{X I9})2 I9] + E((E{X I91 - )2 I9). Take expectations and appeal to part (1) of Theorem 8.3 to derive the following "parallelogram law" : (8.6)

E [(X - £)2] = E [(X - E[X I g])2] + E [(f - E[X I W])2]

This proves the proposition. Our newly found intuition about conditional expectations might suggest

that if X and 9 are independent, then E[X 19] = EX a.s. [After all, knowing 9 does not teach us anything about X.] This is indeed the case.

Theorem 8.5. Suppose X E L1(P) and 9, 91i 92 are three sub-Q-algebras of 9. If X is independent of 9, then E[X 19] = EX a.s. If 91 9 92, then (8.7)

E[E(XI92)191] =E[E(XI91)192] =E[XI91]

a.s.

Equation (8.7) is known as the towering property of conditional expectations. Intuitively speaking, it asserts that we can use only the lesser amount

of available information to predict X. We emphasize that the condition 91 C 92 cannot be dropped in general.

8. Martingales

124

Proof. All bounded, fl-measurable random variables t; are independent of X by default. Therefore, E[l;X] = El; EX = E[l;E(X)J. Because W = EX is W-measurable, the uniqueness portion of Proposition 8.1 implies that EX = E[X 19] a.s. The second part of the theorem contains two assertions. For the first note that E[X 1 2f1] is !W1- and hence 8.3, E[E(X I WI) 192] = E[X 1911 a.s.

92-measurable. Therefore, by Theorem

For the remainder of the proof let 6 be a bounded 41-measurable random variable. Because W1 C 92i 1; is (8.8)

92-measurable too. Therefore,

E [l;E (X 192)] = E [6XJ = E

I Wi)]

This has the desired effect.

Let us return briefly to the preamble to this chapter and analyze the two-person game mentioned there. Recall that Player 1 tosses a fair coin 100 times independently. Player 2 knows the value of Y, which is the number of heads in the first 10 tosses. If X denotes the total number of heads tossed

by Player 1, then X - Y and Y are independent. Therefore, E[X I Y] = E[X - Y I Y] + E[Y I Y] = E[X - Y] - Y. Since X - Y = Bin(90,1/2), E[X - Y] = 45, whence E[X I YJ = 45 + Y a.s. This calculation was arrived at earlier, but the argument lacked rigorous justification.

1.3. From Classical to Abstract Conditional Expectation. Definition 8.6. If X E Ll (P) and B is an event of positive probability, then the (classical) conditional expectation of X given B is defined as (8.9)

[X;

1

E[X I BI := EP (B) ]

P(B) 1B X dP.

In general, abstract conditional expectations are genuine random variables, whereas classical conditional expectations are constants. But the two notions are related, as we will see next.

Proposition 8.7. Choose and fix X E L'(P), and let B1,.. . , Bn be disjoint events of positive probability such that U= 1Bi = Q. Recall that o,({Bi}" 1) denotes the o-algebra generated by the Bi's. Then, n

(8.10)

E [X I o ({Bi}L 1)] (w) _

18; (w)E [X I Bi ]

a.s.

i=1

Proof. It suffices to consider the case where X is bounded (Theorem 8.3). The o-algebra a({Bi} 1) is generated by the random variables {1B,}= 1. Equivalently, a({Bi} 1) is generated by all random variables of the form

1. Conditional Expectations

125

ci1B,, where c1, ... , c,,, E R are non-random constants. Define %P(c):=E

(8.11)

)'1

[(x

vc

:= (ci,... , c,) E Rn.

i=1

It is possible to check that W attains its unique minimum when ci = E[X I B;]

(i = 1, ... , n). Therefore, Proposition 8.4 finishes the proof.

We can apply the preceding to compute certain conditional expectations. Indeed, suppose Y is a discrete, possibly abstract-valued, random variable that takes the values yl, . . . , y,, with positive probability each, and

no other values are possible. Apply Proposition 8.7 with Bi := {Y = yi} (i = 1, . . . , n) to deduce the following formula; see also Problem 8.8.

Corollary 8.8. For all X E L' (P), and for Y as above, (8.12)

E[X I Y]

=:1{Y=y;}E[XIY=yi]

a.s.

Equivalently, for all i < n, E[X I Y] = E[X I Y = yi] a.s. on {Y = yi}. Alternatively, if we define ¢(y) := E[Y I Y = y], then E[X I Y] = O(Y) a.s. Note that we are careful to not write E[X I Y] = E[X I Y = Y]. [In fact, E[X I Y = Y] = EX. Why?]

1.4. Conditional Probabilities. Conditional probabilities follow readily from conditional expectations via the assignment (8.13)

P(A 19) = E[lA 1'Y1

VA E Jr.

Their salient properties are not hard to derive, and are listed in the following:

Proposition 8.9. For every sub-r-algebra 9 C JP: (i) P(O I9) = 0 a.s. (ii) P(A I W) = 1 - P(AC I W) a.s. for all A E 9. (iii) For all disjoint measurable sets Al, A2, ... there exists a null set outside which P(U°_IAi I W) = E°_1 P(Ai I W).

Example 8.10. Consider an event A, and an event B such that 0 < P(B) < 1. Then, according to Corollary 8.8, the random variable P(A I a(B)) is equal

to P(A I B) on B and P(A I Bc) on B'. In this way, we can relate abstract conditional probabilities to the classical ones of Chapter 1 (§2, p. 4).

8. Martingales

126

2. Filtrations and Semi-Martingales Definition 8.11. A stochastic process (or a random process, or a process) is C 92 C a collection of random variables. If are sub-a-algebras of IF, then {5n}0 is a filtration. A process {Xn}n 1 is adapted to a filtration {3'n}°O_1 if Xn is Fn-measurable for every n > 1. 1

Suppose we start with a random process of our choice, call it {Xi}i_1,

and define 5n := a({Xi}

1).

Clearly, .`5'n g gn+l, and by definition

{Xn}°°_1 is adapted to {9n}00°_1. Suppose, in addition, that Xn E L1(P) all for n > 1. We can think of {Xi}°_1 as a random process that evolves in (discrete) time. Then, a sensible prediction of the value of the process at time n + 1, given the values of the process by time n, is E[Xn+1 I 9n]. We say that X = {Xi}911 is a martingale if this predicted value is Xn. In this way, you should convince yourself that fair games are martingales, and in a sense, the converse is also true.

Definition 8.12. A stochastic process X = {Xn}°°_1 is a submartingale with respect to a filtration 9 = {9n},1 n= ifX is adapted to .9. (ii) Xn E L1(P) for all n > 1. (iii) For each n > 1, E[Xn+1 I Win] > Xn a.s.

The process X is a supermartingale if -X is a submartingale. It is a martingale if it is both a sub- and a supermartingale; it is a semi-martingale if it can be written as Xn = Yn + Zn where {Yi}°_1 is a martingale and {Zi}°__1 is a bounded-variation process; i.e., Zn = Un - Vn where U1 < U2 < and V1 < V2 < are integrable adapted processes. Occasionally, we call a finite sequence {Xi}

1 a submartingale if. (i) Xi

is 3i-measurable for all 1 Xi for all 1 < i < n. A similar remark applies to super- and semi-martingales. Here are a few examples of martingales.

Example 8.13 (Independent Sums). Suppose that a fair game is played repeatedly. Every game is independent of all others, and results in ±1 dollar

for the gambler. We can model this by letting {Xi}°11 be i.i.d. random variables with the values ±1 with probability one-half each. We think of Xi as the gambler's win from game i, where negative win means loss. In this way, the gambler's cumulative fortune after k games is Sk := Xi + Xk. By independence, E[Xk I Xl,... , Xk_ 1] = EXk = 0. Therefore, S is a martingale with respect to the filtration where .'n := a({Xi} 1). More generally still, if Sn = X1 + + Xn, where the XD's are independent

2. Filtrations and Semi-Martingales

127

(not necessarily i.i.d.) and have mean zero, then S is a martingale with respect to .9P. Check that .fin is also equal to a({Si}° 1). For our second class of examples, we need a definition.

Definition 8.14. A stochastic process {Ai}°_1 is previsible with respect to a given filtration {. 'i}°_1 if An is .9n_1-measurable for every n > 1, where -9o always denotes the minimal a-algebra 10 ,12}.

Example 8.15 (Martingale Transforms). Let be a martingale with respect to a filtration {.n}°O_1. Define So := 0, yo a constant, and A a previsible process with respect to Now consider the process Y defined by (8.14)

Yn:=yo +>Aj(Sj-Sj_1)

Vn>0.

j=1

The process Y is called the martingale transform of S. It is a straightforward task to prove that Y is a martingale. Here is an example of how Y arises naturally: Suppose we play a fair game repeatedly and independently

every time. Let Xi denote the amount of win/loss for the ith play of the game, so that Sn := X1+ +Xn denotes the total win/loss by the nth play. Of course, S is a martingale (Example 8.13). Now suppose we can bet An dollars on the nth play; we are allowed to choose An based on what we have seen so far. That is, at time n, we are privy to the values of X1, ... , Xn_1.

Then, the martingale transform Yn = >s A,X2 = E 1 Ai(S2 - Si_1) 1

describes the win/loss after the nth play. Note that the martingale transform of (8.14) has the equivalent definition, Yn+1 - Yn = An+1(Sn+l - Sn) for

n > 0, where Yo = yo. We may think of this, informally, as the discrete analogue of the "stochastic differential identity," dY = A dS.

Example 8.16 (Doob Martingales). Let S be a filtration and Y E L1(P). Then martingales of the form Xn := E[Y 1 9n] are called Doob martingales.

Lemma 8.17. If X is a submartingale with respect to a filtration 5, then it is also a submartingale with respect to the filtration generated by X itself. That is, for all n, E[Xn+1 I X1.... , Xn] > Xn a.s.

Because a(X1,... , Xn) Proof. For all n and all A E -4(R), Xn 1(A) E is the smallest a-algebra that contains Xn 1(A) for all A E R(R), it follows that a(Xj, .... Xn) C 5n for all n. Consequently, by the towering property of conditional expectations (Theorem 8.5), with probability one, 1

(8.15)

E[Xn+1 I X1, ... , Xn] = E [E(Xn+1 15n) I X1, ... , Xn]

> E[Xn I Xi,... , Xn] = Xn.

The last equality is a consequence of Theorem 8.3.

0

8. Martingales

128

Lemma 8.18. If X is a martingale and 1' is convex, then '1i(X) is a submartingale, provided that 1'(Xn) E Ll (P) for all n. If X is a submartingale and ib is a nondecreasing convex function, then rlt(X) is a submartingale, provided that ?P(Xn) E L1(P) for all n.

Proof. Thanks to the conditional form of Jensen's inequality (Theorem 8.3), E[iI'(Xn+i) 19n] > '(E[Xn+1 1-9"n]) a.s. This holds for any process X and any convex function ip as long as O(Xn) E L1(P). If X is a martingale, then Vi(E[Xn+1 1 ,fin]) = ii(Xn) a.s., whence follows the result.

If, in addition, ip is nondecreasing but X is a submartingale, then Vi(E[Xn+1 I Win]) ? V,(X,) a.s., which has the desired result.

Remark 8.19. If X is a martingale, then X+, IXIP, and ex are submartingales, provided that they are integrable at each time n. If X is a submartingale then X+ and ex are also submartingales as long as they are integrable. However, one can construct a submartingale whose absolute value is not a submartingale; e.g., consider Xk := -1/k. The definition of semi-martingales is motivated by the following, whose proof is as interesting as the fact itself:

Doob's Decomposition. Any submartingale X can be written as Xn = Yn + Zn, where Y is a martingale, and Z is a non-negative previsible a.s.increasing process with Zn E L'(P) for all n. In particular, sub- and supermartingales are semi-martingales, and any semi-martingale can be written as the difference of a sub- and a supermartingale.

Proof. Define Xo := 0 and dj := Xj - Xj_1 (j = 1,2,...), so that Xn En En dj. Let Zn := En E[dj I and Yn 1(dj-E[dj j1 1

A

direct computation reveals that this yields the promised decomposition.

The preceding is one among many decomposition theorems for semimartingales. Next is the decomposition theorem of Krickeberg (1963, Satz 33, p. 131). See also Krickeberg (1965, Theorem 33, p. 144). Before introducing it however, we need a brief definition. Definition 8.20. {Xi}i=1 is bounded in L?(P) if sup,, IIXnIIp < 00.

Krickeberg's Decomposition. Suppose X is a submartingale which is bounded in L'(P). Then we can write Xn = Yn-Zn, where Y is a martingale and Z is a non-negative supermartingale.

Proof. By the submartingale property, Y. = limn-,,. E[Xm I Vin] exists a.s. as an increasing limit. Note that Y is an adapted process, and Yn >

3. Stopping Times and Optional Stopping

129

X. Moreover, by the monotone convergence theorem, EY = limn EXm = sup,,, EX,,,, which is finite since X is bounded in L1(P). Finally, we appeal to the towering property of conditional expectations (Theorem 8.5) and the conditional form of the monotone convergence theorem (Theorem 8.3) to E[Xm I . n] = Yn a.s. This proves that Y find that E[Yn+1 I JFn] = is a martingale and Zn = Y,, - Xn > 0. Also, because Y is a martingale and X is a submartingale, I Jn] - E[Xn+1 I .stn] < Y. - X. = Zn, almost surely. This completes our proof. (8.16)

E[Zn+1 I

One of the implications of Doob's decomposition is that any submartingale X is bounded below by some martingale. The Krickeberg decomposition implies a powerful converse to this: Every L1-bounded submartingale is also bounded above by a martingale.

Remark 8.21. The preceding processes Y and Z are bounded in L'(P). Here is a proof: EIYnI < supk EIXkI + EZn; thus it suffices to show that EZ is bounded in n. But the martingale property of Y implies that EZn = EY1 - EXn, whence we have IEZ,, I < IEY1I + supk EIXk1. Also, we may remark that Y > 0 whenever X > 0.

3. Stopping Times and Optional Stopping Definition 8.22. A stopping time (with respect to a filtration Jr) is a random variable T : Il N U {oo} such that IT = k} E Jrk for all k E N. This is equivalent to saying that IT < k} E 5k for every k E N. You should think of 5k as the total amount of information available by time k. For example, if we know 5k, then we know whether or not A E .5ik

has occurred by time k. With this in mind, the above can be interpreted as saying that T is a stopping time if and only if we only need to know the state of things by time k to decide measurably whether or not T < k. Example 8.23. Non-random times are stopping times (check!). Next suppose {Xi}°° 1 is a stochastic process that is adapted to a filtration 9. If A E then T(w) := inf{n > 1 : X,,(w) E A} is a stopping time provided that we define inf 0 := oo. Indeed, IT = k} = n3-i {Xj ¢ Al n {Xk E A} for every k > 2. Because {T = 1} _ {X1 E A}, we find that IT = k} E 5k

for every k > 1. The random variable T is the first time the process X enters the set A. Likewise, one shows that the kth time that X enters A is a stopping time for all k > 1. This example is generic in the following sense:

If T is a stopping time with respect to a filtration, then there exists an adapted process X such that T := inf{j > 1 : Xj = 1}, where inf 0 := oo. A simple recipe for X is Xj(w) := 1{T=j}(0)

8. Martingales

130

Remark 8.24. The previous example shows that the kth time a process enters a Borel set is a stopping time for any k > 1. Although this produces a large collection of stopping times, not all random times are stopping times. For instance, consider (8.17)

L(w) := sup{n > 1 : Xn(w) E A},

where sup 0 := 0 and A is a Borel set. Thus, L is the last time X enters

A, and {L = k} = njtk+1{Xj ¢ A} n {Xk E A}. This is in Fk if and only if Xk, Xk+l,... are all Borel functions of X1,. .. , Xk; a property that does not generally hold. (For example, consider the case when the Xn's are independent.)

Lemma 8.25. If {T}1 are stopping times, then so too are T1 +

+ Tn,

T and maxi j
.T:={AE.F: An{T1}

The subscript "r' of FT is there only to remind us of the relation of the collection FT to the stopping time T. It is not the case that FT is a function of T(w), for instance.

Lemma 8.26. If T is an a.s.-finite stopping time, then FT is a or-algebra. If S and T are a.s.-finite stopping times such that S < T a.s., then FS C FT. Furthermore, E[Y IFT]1{T=n} = E[Y I Fn]1{T=n} a.s. for all Y E L'(P) and n > 1. Finally, if X is adapted to 9 then XT is FT-measurable, where XT(w) := XT(",)(w) for all w E n. The following theorem is due to Doob (1953, 1971); see also Hunt (1966).

It is our first important result on semi-martingales.

The Optional Stopping Theorem. Suppose S and T are a.s.-bounded stopping times such that P{S < T} = 1. If X is a submartingale, then E[XT I FS] > XS a.s. If X is a supermartingale, then E[XT IFS] < XS a.s. If X is martingale, then E[XT I .FS] = XS a.s.

This result has the following interpretation in terms of a fair game. are i.i.d. mean-zero random variables so that Sn := X1 + + Xn can be thought of as the reward-or loss-at time n in a fair game. Because ESn = 0 for all n > 1, we do not expect to win with certitude at non-random times n. The optional stopping theorem states that the same fact holds for bounded stopping times n. In other words, when playing a Suppose

fair game, there is no free lunch unless you are clairvoyant.

4. Applications to Random Walks

131

Proof. It suffices to consider the submartingale case. We can find a non-random K > 0 such that with probability one, S < 1, where T < K. Now the trick is to write things in terms of 4 := Xn Xa := 0. Equivalently, X,, = E'=1 dj, and hence XT = EK 1 djl{j
E[XT - Xs; A] _

E [dj1{s<j
(8.19)

_ 1: E [E (dj1{S<j
_ 1: E[E[dj

I

`

j-1]1{s<j
j=1

We have used the facts that: (a) {T > j} = {T < j - 1}° E Fj_1i and (b) A n {S < j < T} E Fj_1. [Fact (b) is a consequence of the identity, {S < j}f1A = {S < j-1}f1A E 3 _i.] By the definition of asubmartingale, E[dj I .>j_1] > 0 almost surely. This implies that E[XT; A] > E[Xs; A], which is equivalent to the desired result.

Our next result follows immediately from the preceding one. But it is important, and deserves special mention.

Corollary 8.27. Suppose T is a stopping time with respect to a filtration 9 and X is a submartingale (respectively, supermartingale or martingale) with respect to 9. Then, {XTnn}n 1 is a submartingale (respectively, supermartingale or martingale) with respect to

4. Applications to Random Walks Definition 8.28. If {Xi}°_1 are i.i.d. random variables in Rm, then {Sn}n 1 is called a random walk, where Sn := X1 + + Xn.

Henceforth, consider the case that m = 1. It follows, after centering, that every L1-random walk is a martingale. Lemma 8.29. If Sn = X1+. +Xn defines a random walk in one dimension and X1 E L1 (P), then {Sn - nEX1 }n 1 is a mean-zero martingale. Suppose, in addition, that X1 E L2(P) and EX1 = 0. Then {S,2, - nVarXi}°O=1 is a mean-zero martingale.

4.1. Wald's Identity. If {Sn}n is a random walk whose increments 1

{Xi}°_°1 have a finite mean p, then Lemma 8.29 implies immediately that ESn = nµ. This identity generalizes to stopping times, as we prove next.

8. Martingales

132

Theorem 8.30. Consider a random walk defined by Sn = En X; where (n = X1 E LI(P). Let !Fn denote the a-algebra generated by {X;} 1.2.... ) with respect to which T is a stopping time with ET < oo. Then 1

1

EST = EX1 ET.

Proof. Combine Corollary 8.27 with Lemma 8.29 to find that for all n > 1, E[STnn] = EX1 E[T A n]. As n tends to infinity, E[T A n] / ET. Thus, it suffices to prove that (8.20)

Esup ISTA. 1 <- EIX1I ET, n

whence it is finite. But for all j > 1, ISTnnI = I Ek 1 1{Tnn>k}XkI S Ek1 1{T>k}IXkI Take expectations to find that Esup,, ISTnnI is at most 'k 1E{IXk1;T > k}. (Why can we interchange the infinite sum with the expectation integral?) Because IT > k} = IT < k -1}° E Yk_1 is indepen0 dent of Xk, (8.20) follows from Lemma 6.12 on page 68.

4.2. Gambler's Ruin Problem. Definition 8.31. A random walk S n = X1 +

+ Xn is called a nearestneighborhood walk if X1 E {-1, +1 } a.s.; i.e., if at all times n = 1, 2, ... , we have Sn = S,,_1 ± 1 almost surely.

In other words, Sn is a nearest-neighborhood walk if there exists p E

[0,1] such that P{X1 = 1} = p = 1 - P{X1 = -1}. The case p = 1 is particularly special and has its own name.

Definition 8.32. If P{X1 = 1} = P{X1 = -1} = 1, then {Sn}k is called 1

the simple walk.

We can think of a nearest-neighborhood walk S,, as the amount of money won (lost if negative) in n independent plays of a game, where in each play one wins or loses a dollar with probabilities p and 1 - p respectively. Then the simple walk corresponds to the fortune process of the gambler in the case that the game is fair. Suppose that the gambler is playing against the house, there is a max-

imum house limit of h dollars, and the gambler's resources amount to a total of g dollars. Then consider the first time that either the house or the gambler is forced to stop playing. That is, (8.21)

T:= inf f j > 1 : S3 _ -g or h} ,

where T(w) = inf 0 := oo amounts to the statement that the particular realization w of the game is played indefinitely.

Lemma 8.33. With probability one, T < oo.

4. Applications to Random Walks

133

Proof. The proof appeals to the following "continuity property" of S: If Sn > h, then there exists i < n such that Si = h. Define m = g + h, and consider the events

Vk>0.

(8.22)

Evidently, {Ei,,,}°_o are independent, have the same probability of occurring, and P(E;n) = 1 - p'n > 0 for all i > 0. By the Borel-Cantelli lemma (p. 73) infinitely many of the E*,n's occur a.s. In particular, there a.s. exists a random finite integer r such that ET occurs. [More precisely,

Ekfl{r=k}54 0 forallk>1.] On{SrIZ (-g,h)}, wehave T -g + X,+1 + +Xr+m = h. Therefore, T < r+m is finite on {Sr(w) E (-g, h)} too.

The "gambler's ruin" problem is to compute, in terms of the parameter p, the probability that the gambler is ruined. In symbols, we seek to know (8.23)

'.Ruin(p) := P(ST = -g}.

The following gambler's ruin formula is our first important application of the optional stopping theorem (Corollary 8.27).

Theorem 8.34. For all p E (0, 1) define c := (1 - p)/p. Then, h

(8.24)

93tuin(v) =

9

ifp=1/2,

+h h+g

(h+g

h 1

if p 5 4 1/2

Note that 9(uin is continuous at p = 1/2.

Proof. Let us begin with the case p = 1/2: By the optional stopping theorem, and by Lemma 8.29, {STnn}n°_1 is a mean-zero bounded martingale. Boundedness follows from sups ISTAn I <_ max(g,h) < oo. Therefore, ESTnn = 0 for all n. The dominated convergence theorem and the fact that a.s. T < oo (Lemma 8.33) together imply that EST = E[limn STAn] _ limn ESTnn = 0. Because w r-. ST(w) is a simple function,

0 = EST = -gftuin(1/2) + h(1 - 9tuin(1/2)). Solve to obtain the expression for the ruin probability in the case p = 1/2. When p 54 2, we have to only find a suitable bounded martingale and then follow the preceding argument. But it is not hard to see that {STnn }n l is a bounded mean-one martingale. A similar reasoning as the one used in the p = 2 case shows that E(ST = 1, whence follows the re(8.25)

sult.

8. Martingales

134

5. Inequalities and Convergence We now use the optional stopping theorem to prove the second set of maximal inequalities of this book (Doob, 1940).

Doob's Maximal Inequalities. If X is a submartingale, then for all A > 0

andn> 1, AP (8.26)

max Xj > a } <_ E [X.; I 1<j
min Xj aP 11:5j
max Xj > Al

1<j
-

- -A < E[Xf ] - EXI,

We can add the two parts of (8.26) to find that AP{maxj Al < E[XI]. This readily yields our next result.

Corollary 8.35. If X is a submartingale, then for all A > 0 and n > 1, (8.27)

max IXjj > A <_ 2EIXnl - EXI. AP fl<j
Proof of Doob's Inequalities. If we let inf 0 := oo, then T is a stopping time, where T := inf{j > 1 : Xj > A}, and {T < n} = {maxj A}. E[Xn; T < n] _ Fn j=l E[Xn; T = j], and since {T = Because j} E .3j for all j > 1, the submartingale property implies that E[X,n ] > F,I E[Xj; T = j] > AP{T < n}. This proves the first Doob inequality. For the second portion of (8.26) define (8.28)

7-:= inf{1 < j:5 n : Xj < -A},

where inf 0 := oo. By the optional stopping theorem, EXI < EXTnn Since

X. < -A on Jr < oo}, (8.29)

EXTnn = E[XT; r < n] + E[Xn; r > n] < -AP{r < n} + E[X,+].

This completes the proof.

0

The following convergence theorem of Doob (1940) is a consequence of the Doob inequalities.

The Martingale Convergence Theorem. Let X be a submartingale. Suppose either: (i) X is bounded in LI(P); or (ii) X is non-positive a.s.

Then, limn, Xn exists and is finite a.s. Proof. We follow the general outline of the proof of the strong law of large numbers (p. 73): We first prove things in the L2-case; then truncate down to LI(P). This is achieved in four easy steps.

5. Inequalities and Convergence

135

Step 1. The Non-negative L2-Bounded Case. If X is non-negative and bounded in L2(P), then for all n, k > 1, IIXn+k - XnII2 = IIXn+kII2 + IIXnII2 - 2E[Xn+kXnj (8.30)

= IIXn+kII2 +IIXnII2 - 2E [E(Xn+k I -rn)Xn] IIXn+kII2 - IIXnII2

According to Lemma 8.18, X2 is a submartingale since Xn > 0 for all n. Therefore, IIXnII2 / SUPm IIXmII2 as n / oo. It follows that {Xn}n 1 is a Cauchy sequence in L2(P), and so it converges in L2(P). Let X00 be the L2(P)-limit of Xn, and find nk T oo such that IIXoo - Xnk1I2 < 2-k. By Chebyshev's inequality, 00

00

EP{IXoo- XnkI? }
(8.31)

k=1

4-k
vE>0.

k=1

Thus, by the Borel-Cantelli lemma (p. 73), limk.oo Xnk = X00 a.s. On the other hand, IIXnk+1- XnkII2

IIXnk+l - Xnk III

< IIXoo - XnkII2 + IIXoo - Xnk+l 112 (8.32)

< 2-k +2 -k+l

= 3.2-k Therefore, Corollary 8.35 shows us that for all e > 0, 00

EP (8.33)

k=1

max

jnk:5j:5nk+1

IXj - Xnk I

2

00

>- e< E IIXnk+l - Xnk III E k=1 6

0-0

< E 2-k < oo. k=1

We have used the fact that {Xn+j -

is a submartingale for each fixed n with respect to the filtration {9j+n}j9_o, and that this submartingale starts at 0. It follows from the Borel-Cantelli lemma that (8.34)

lim

max

k-.oo nk<j
IXj - Xnk I = 0

a.s.

Because Xnk -1 X00 a.s., we have proved that lim,',o X,n = X., a.s. As X00 E L2(P) it follows that X00 is a.s. finite. Step 2. The Non-positive Case. If Xn < 0 is a submartingale, then exp(X,,) is a bounded non-negative submartingale (Lemma 8.18). Thanks to Step 1, limn..00 exp(Xn) exists and is finite a.s. Step 3. The Non-negative L'-Bounded Case. If Xn is a non-negative

submartingale that is bounded in L1(P), then thanks to the Krickeberg

136

8. Martingales

decomposition (p. 128), we can write X. = Y is a non-negative martingale and Z is a non-negative supermartingale; see also Remark 8.21. By Step 2, limn Yn and Z exist and are finite a.s. Consequently, Xn exists and is finite a.s. Step 4. The L1-Bounded Case. If X is an L1-bounded submartingale,

then we can write X = Y - Yn - Zn where Y is an L'-bounded martingale and Z is an L1-bounded non-negative supermartingale. Because Y+ and Y- are non-negative submartingales (Lemma 8.18), Step 3 implies that limn_ Yn and limn.w Yn exist a.s. Similarly, Z is a non-negative supermartingale and limn_m Zn exists almost surely; see Step 2. This completes the proof. 0

6. Further Applications Martingale theory provides us with a powerful set of analytical tools and, as such, it is not surprising that it has made an impact on a large number of diverse mathematical problems. This section contains a few applications of this theory. More examples can be found in the exercises.

6.1. Kolmogorov's Zero-One Law. Let {Xt}i=1 denote independent random variables, and define T to be the corresponding tail a-algebra; see Definition 6.14. Recall that the Kolmogorov zero-one law (p. 69) asserts that J is trivial. Here is a martingale proof: Let Fn := a({Xt}= 1), and consider any event A E J. Because A is independent of .fin, we have the a.s. identity

P(A I Fn) = P(A). But P(A ] 9n) defines a Doob martingale. Therefore, in accord with the martingale convergence theorem, L = limn. P(A 1 9n) exists a.s. and in L'(P). We claim that L = 1A a.s. This would then prove that P(A) = 1A a.s., which is Kolmogorov's zero-one law. Fix an integer k _> 1 and note that E[L; B] = limn. E[P(A J r,,); B] = P(A fl B) for all B E Fk. By the monotone class theorem (p. 30), E[L; B] = P(A fl B) for all B E 9, , where 9,,. denotes the smallest a-algebra that contains U°n°__1 n. Because L is 9,,,,-measurable and A E A"., this ensures that L = P(A 19 ...) = 1A a.s. The result follows.

6.2. Levy's Borel-Cantelli Lemma. In this section we describe an optimal improvement to the Borel-Cantelli lemma (p. 73; see also Problem 6.20, p. 86). This improvement is due to Levy (1937, Corollary 68, p. 249). First, we need a definition.

Definition 8.36. If E and F are events, then we say that E = F almost surely when 1E(w) = 1F(w) for almost every w.

6. Further Applications

137

Theorem 8.37. If {. n}n is a filtration and E1, E2, ... are events such 1

that En E .fin for all n > 1, then ( (8.35)

{En occurs infinitely often}

00

E P(Ef 19n_1) = 00

l a. S.

n=2

Consequently, the two events, F 1 := {E 2P(EnI,Fn-1) = oo} and F2 :_ {En occurs infinitely often} have the same probability. The proof of Levy's Borel-Cantelli lemma rests on a general result about martingales with bounded increments.

Theorem 8.38. Suppose {Xn}n is a martingale such that 1 Xn - Xn_1 < a a.s. for all n > 1, where a is a positive non-random constant. Consider the events L1 := {supra Xn < oc}, L2 := {infra Xn > -oo}, and L3 {limn-o. Xn exists and is finite}. Then L1 = L2 = L3 a.s. 1

Proof. For any A > 0 define Ta := inf{n > 1 : Xn > .1}, (8.36) where inf 0 := oo. By the optional stopping theorem (Corollary 8.27), {XnATa }°°_1 is a martingale. Moreover, the fact that the increments of X are at most a implies that XTA < a + A. Therefore, a + A - XnATA defines a non-negative martingale. This must converge a.s. Consequently, for any A > 0 there exists a null set off which limn_.oo XnAT,, exists and is finite. Take the union of these null sets, as A ranges over all positive rationals, to deduce that outside one null-set N, limn-c. X nATA exists and is finite for all rational A > 0. If w E L1 then Ta(w) is infinite for all rational A > supra Xn(w). There-

fore, L1 n N` C L3. By considering the martingale -X we find also that L2 n NC C L3. This proves that (L1 U L2) n N` C L3, whence 1L,uL2 5 1L3 a.s. Since L3 C (L1 n L2), the result follows. 2{1E, - P(E; 1 .!'i_ 1) } Proof of Theorem 8.37. The variables Xn = (n > 1) define a martingale with bounded increments. In the notation of Theorem 8.38, L1 = {Ei 1E, < oo} and L2 = {Ei P(EE j .9'i_1) < oc}.

[This merits a moment's thought.] The proof follows.

6.3. Khintchine's LIL. Suppose {Xi}°_1 are i.i.d. random variables taking the values ±1 with probability 1/2 each, and define Sn := X1 +- +Xn. By the strong law of large numbers (p. 73), Sn/n -i 0 a.s. This particular form of the strong law first appeared in the context of the normal number theorem of Borel (1909). See Problem 6.22 on page 86. One would like to know how fast Sn/n converges to zero. The central

limit theorem suggests that Sn/n cannot tend to zero much faster than

8. Martingales

138

n-1/2. The correct asymptotic size of Sn/n was found in a series of successive improvements by Hausdorff in 1913 (see 1949, pp. 420-421), Hardy and Littlewood (1914), Steinhaus (1922), and Khintchine (1923). The definitive result, along these lines, is the law of the iterated logarithm (LIL) of Khintchine (1924): S" S. 1 a.s. lim sup = - lim inf (8.37) n-oo (2n In In n)1/2 = n-oo (2n In In n)1/2

Khintchine's LIL has a remarkable extension that is valid for sums of general i.i.d. random variables with finite variance.

The Law of the Iterated Logarithm. If {Xi}i=1 are i.i.d. random vari+ Xn, then ables in L2(P) and Sn := XI + S. - nEXI lim sup

(8.38)

n-.oo (2n In In n)1/2

= SD(XI) a.s.

When the Xi's are bounded this was proved by Kolmogorov (1929). Cantelli (1933a) improved Kolmogorov's theorem to the case that X1 E L2+6(P) for some d > 0. Then the LIL for general mean-zero finite-variance increments remained elusive for nearly a decade, until Hartman and Wintner

(1941) devised an ingenious truncation method which reduced the general LIL to that of Kolmogorov. We will derive the LIL only in the case that the X's are normal. The theorem, in its full generality, is much more difficult to prove.

Proof of the LIL for Normal Increments. Without loss of generality, we may assume that the Xi's are standard normal random variables. Define

S.

A:= hm sup

(8.39)

m-oo (2m In In M) 1/z

According to the Kolmogorov 0-1 law (p. 69), A is almost surely a constant. Our task is to prove that A = 1. We do this in three steps.

Step 1. A Large-Deviations Estimate. Fix a t > 0 and define Mn :_ exp (tSn - t2n). Let An := o ({Xi}s 1), and verify that M is a non-negative 2 mean-one martingale. Moreover,

Imax Sl > nt) C I Imax Ml > exp(nt2/2)

(8.40) I.

i_

j_

JJJ

y. J

According to Doob's maximal inequality (p. 134), for all integers n _> 1 and all real numbers t > 0, (8.41)

P

max S. > nt y < exp (- nt2/2). JJJ

6. Further Applications

139

Step 2. The Upper Bound. Choose and fix c > 0 > 1, and define 9k oo, LOkJ (k =11,2.... ). We apply (8.41) to deduce that as k (8.42)

P { max Si > (2cOk_l lnln0k_1)l/2l < exp

J

11<j<ek

cOk-I In InOk_1

`

ek

=

k-(c1e)+°(1).

Thus, the left-hand side is summable in k. By the Borel-Cantelli lemma (p. 73), with probability one there exists a random variable ko such that for all k > k0, maxl<j<ek Sj < (2cOk_1 In ln0k_I )1/2. For all m > Bko we can find k > ko such that 0k_1 < m < °k, and hence, (8.43)

Sm < maxSj j!ek < (2cOk_1lnInOk_l)1/2 < (2cmInInm)1/2.

This proves that A < c1/2. Because the latter holds for all c > 0 > 1, it follows that A < 1. This is fully one-half of the LIL in the case of standard normal increments. We can also apply the preceding to the process -S to obtain the following:

(8.44)

ISmI

limsup

m-oo (2m In In m.)1/2

<1

-

a.s.

Step 3. A Lower Estimate. There exists a constant A > 0 such that (8.45)

P {X1 > Al >

Ae _ 2/2

d'\ > 1.

See Problem 1.16, page 14 for a hint. Choose and fix 0 > 1, and define Ok := [ek J . Consider the events (8.46)

Ek := {Sek+1 - Sok >_ (2ati+1 ln+ In+ Bk)1`21

where ln+ x := ln(x V e), and (8.47)

ak+1

rr

Var (Sek+ 1

- Sek) = 0k+1 - Bk.

The Ek's are independent events, and because of (8.45), for all k large, (8.48)

A

P(Ek) = P {N(0,1) > (21nln0k)112l > JJ

In Ok (2 In In Ok)

112

Consequently, Ek P(Ek) = oo, and hence by the independence part of the Borel-Cantelli Lemma, (8.49)

lim sup k-.oo

Sell+1 - Sek 1/2 > 1 (2(0k+1 Ok) In In 0k)

-

as.

8. Martingales

140

Since 9k+1 - 9k ^' Ok+1(1 - 0-1) as k -4 oc,

S Sek+l - SOk

lim sup

(8.50)

1/2

>

C1 - B)

k-.oo (29k+11n In 9k ) 1/2

a.s.

Thanks to this, (8.44), and the fact that 9k+1 - 9.9k as k

oo,

Sek+1 A > li msup k-oo (29k+11nln9k)1/2

-

Sek+1 Sek > lim sup k-oo (20k+1 In In 0k)1/2

(8.51)

>

C1 -

1/2

0

-

91/2/2

- limk---Sup(29k+11n In 9k) /2 I Sek I

1

a.s.

Let 0 T oc to find that A > 1.

O

6.4. Lebesgue's Differentiation Theorem. The fundamental theorem of calculus asserts that if f : R -I R is continuous then F(w) := fo f (x) dx is differentiable and F' = f. In fact, we have the stronger result that w+6

1

1io

(8.52)

b

f

f(y)dy = f(w),

uniformly for all w in a given compact set. Here is why: For all w E R and b > 0, I

(8.53)

1

fw+6

rw+6 Jw

If (y)

f (y) dy - f (w) l < b

Jw

- f(w) l dy.

Therefore, (8.52) follows from the uniform continuity of f on compact sets. There is a surprising extension of this, due to H. Lebesgue, that holds for all integrable functions f. The following is the celebrated differentiation theorem of Lebesgue.

Theorem 8.39. If fo If (x) I dx < oc, then (8.52) holds for almost every w E [0,1]. Consider the Steinhaus probability space ([0, 1] , 66([0, 11), P). In probabilistic language, Theorem 8.39 states that (8.52) holds almost surely provided that f E L' (P). In order to derive this formulation we need a maximal inequality for the following function M f that is known as the HardyLittlewood maximal function (1930, Theorem 17). First we extend f to a function on R by setting f (w) := f (0) if w < 0 and f (w) := f (1) if w > 1. Then, we define: 1

(8.54)

(Mf)(w) = M(f)(w) =

sup

f

w+6

6E(0,1-w) 8 w

where 0/0 := 0 to ensure that (M f) (1) = 0.

If(y)I dy

"w E [0, 11,

6. Further Applications

141

Theorem 8.40. For all A > 0, p > 1, and f E L'(P), P{Mf > A} < 8PIlflip

(8.55)

Let us first prove Theorem 8.39 assuming the preceding maximal-function inequality. Theorem 8.40 is proved subsequently.

Proof of Theorem 8.39. For notational convenience, define the "averaging operators" A6 as follows: 1

(8.56) A6(f)(w) := (A6f)(w) :=

b

f

w+6

f(y)dy dw E [0,1], f E L1(P)

Thus, we have the pointwise equality, (M f)(w) = suP6E[o,1-w] A6(I f I)(w)

Throughout this proof we tacitly extend the domain of all continuous functions g : [0, 11 - R to R by setting g(w) := g(0) for w < 0 and g(w) := g(1) for w > 1. Because continuous functions are dense in L1(P) (Problem 4.18, p. 50), for every n > 1 we can find a continuous function gn such that II9n - f II i n-1. Let 2 := lim sup6lo I A6f - f I to find that

.2
(8.57)

610

= lim sup IA69n - A6f I + I9,, - f I 610

M(I9n-f1)+I9n-fIIf 2 > A, then by the triangle inequality one of the two terms on the right-most side must be at least A/2. Therefore, we can write (8.58)

P {2 > Al < T1 +T2,

where (8.59)

T1:=P{M(I9n-fl)> 2}

and

T2:=P{I9n-fI>

2}.

We estimate T1 and T2 separately. On one hand, we can apply Theorem 8.40, with p = 1, to deduce that (8.60)

T15

16 A

16 II9n-fII1<--

An

On the other hand, we appeal to Chebyshev's inequality (p. 43) to find that (8.61)

T25

2 -2 II9n-fl11Sn

Consequently, P{2 _> Al < 18/(An) for all n > 1. Let n -' oo and A 10, in this order, to deduce that 2 = 0 a.s. This proves the theorem.

8. Martingales

142

Proof of Theorem 8.40. Because f can be replaced with if 1, we can assume without any loss in generality that f > 0. Also, we will extend the domain of the definition of f by setting f (w) := 0 for all w E R \ (0, 1]. Define .9'n to be the collection of all dyadic intervals in (0, 1]. That is, 1 E . !FnO if and only if I = (j2-n, (j + 1)2-n] where j E {0 , ... , 2n - 11 and n > 0. Define .$n to be the o-algebra generated by .fin. Since every element of FnO is a union of two of the elements of .fin+1+ it follows that gn C JFn+1

(8.62)

do > 0.

That is, {.$n}n°_o is a filtration; it is known as the dyadic filtration. We can view the function f as a random variable, and compute Mn E[f I 9n] using Corollary 8.8: (8.63)

r Mn(w) _ E 1Q(w)2n f f (y) dy

for almost all w E [0, 11.

QE.

It should be recognized that the preceding sum consists of one term only. Next define i to be the collection of all shifted dyadic intervals of the

form J = (j2-n +

2_n-1

,

(j + 1)2-n +

2-n_1),

where j E Z and n > 0.

Let Wn denote the o-algebra generated by the intervals in Wno, and define Nn := E[ f ] 8on]. Because f vanishes outside [0, 11, (8.64)

Nn(w) _

1Q(w)2n

JQ

for almost all w E [0, 11.

f (y) dy

Consider w E (0,1) and b E (0,1 - w). There exists n = n(w) > 0 such that 2-n-1 < 5 < 2-n. We can find 1(w) E JrnO and J(w) E fl-both containing w-such that (w, w + 6) C I (w) U J(w). Because f > 0, f =- 0 off [0,1], and b > 2-n-1, this implies that 1

(8.65)

d

f+' f (y) dy <- 2n+1 j f (y) dy +

Jd(W)

f (y) dy

= 2 (Mn(w) + Nn(w))

Optimize over all 8 to find that M f < 2 supra Mn + 2 supra Nn. Therefore,

for all y>0, (8.66)

P{Mf >A} n>0

4

A}+P(supNn> n>O

Al 4

Note that Mn + N,, is not a martingale because M and N are adapted to different filtrations. However, M and N are martingales in their respective filtrations. We apply the first maximal inequality of Doob (p. 134) to the

6. Further Applications

143

submartingale defined by IMnII" to find that (8.67)

P j suP Mn > l n>0

41

0

P. Ap IIfIIp

[The last inequality follows from the conditional form of Jensen's inequality.]

A similar inequality holds for N. We can combine our bounds to obtain, 2,4P (8.68)

P{Mf > A} <

IIf IIp

0

The theorem follows because 2.4P < 8p.

The following corollary of Theorem 8.40, essentially due to Hardy and Littlewood (1930, Theorem 17), is noteworthy as it has a number of interesting consequences in real and harmonic analysis.

Corollary 8.41. If p > 1 and f E LP(P) then (8.69)

f l I(Mf)(t)I" dt < (p8p1)p f 1 If(t)Ip dt.

6.5. Option-Pricing in Discrete Time. We now take a look at an application of martingale theory to the mathematics of securities in finance. In this example we consider a simplified case where there is only one

type of stock whose value changes at times n = 1,2,3, ... , N. We start with yo dollars at time 0. During the time period (n, n + 1) we look at the performance of this stock up to time n. Based on this information, we may decide to buy An+i-many shares. Negative investments are also allowed in the marketplace: If An(w) < 0 for some n and w, then we are selling short for that w. This means that we sell An(w) stocks that we do not own, hoping that when the clock strikes N, we will earn enough to pay our debts. Let Sn denote the value of the stock at time n. We simplify the model further by assuming that ISn+i -SnI = 1. That is, the stock value fluctuates b y exactly one unit at each time step, and the stock value is updated precisely at time n f o r every n = 1, 2, .... The only unexplained variable is the ending time N; this is the so-called time to maturity and will be explained later. Now we can place things in a more precise framework. Let St denote the collection of all possible w = (w1, ... , wN) where every

wj takes the values ±1. Intuitively, wj = 1 if and only if the value of our stock went up by 1 dollar at time j. Thus, wj = -1 means that the stock went down by a dollar, and Sl is the collection of all stock movements that are theoretically possible. Define the functions Si, ... , SN by So(w) := 0, and (8.70)

Sn(wi,...,wn)

wi +...+wn

dn= 1,...,N.

8. Alartingales

144

We may abuse the notation slightly and write &(w) in place of S,,(wi ...... a ).

In this way, S,,(w) represents the value of the stock at time n, and corresponds to the stock movements w1, ... , w,,. During the time interval (n, n + 1), we may look at w1,... , w,,, choose a number A,,+1(w) =A.+, (w, .... , wn ), and buy which might depend on shares. If our starting fortune at time 0 is yo, then our fortune at time n depends on { A; (w) };_11, and is given by n (8.71)

YY(w) = Yn(wl,...,wn) = yo + E A.i(w)[Sj(w) - Sy-1(w)], J=1

as n ranges from 1 to N. The sequence {A=(w)}( 1 is our investment strategy.

Recall that it depends on the stock movements {w;}N I in a "previsible manner": i.e., for each n > 2, A,,(w) depends only on w,.... ,Wn-l- JAI does not depend on w.]

A European call option is a gamble wherein we purchase the right to buy the stock at a given price C-the strike or exercise price-at time N. Suppose we have the option to call at C dollars. If it happens that SN(w) > C, then we have gained (SN(w) - C) dollars. This is because we can buy the stock at C dollars and then instantaneously sell the stock at SN(W). On the other hand, if SN(w) < C then it is not wise to buy at C. Therefore, no matter what happens, the value of our option at time N is (SN(w) - C)+. An important question that needs to be settled is this: (8.72)

What is the fair price for a call at C?

This was answered by Black and Scholes (1973) and Merton (1973) for a related, but slightly different, model. The connections to probability were discovered later by Harrison and Kreps (1979) and Harrison and Pliska (1981). The present model, the so-called "binomial pricing model," is due to Cox, Ross, and Rubenstein (1979). In order to explain their solution to (8.72) we need a brief definition from finance.

Definition 8.42. A strategy A is a hedging strategy if-

(i) Using A does not lead us to bankruptcy; i.e., Yn(w) > 0 for all

n = 1,...,N. (ii) Y attains the value of the stock at time N; i.e., YN(W) = (SN(w) - C)+. Of course any strategy A is also previsible.

Let us posit that there are no "arbitrage opportunities," where arbitrage is synonymous to "free lunch." That is, we assume there are no risk-free investments. Then, in terms of our model, yo is the "fair price of a given

6. Further Applications

145

option" if, starting with yo dollars, we can find a hedging/investment strategy that yields the value of the said option at time N, no matter how the stock values behave.

The solution of Black and Scholes (1973), transcribed to the present simplified setting, depends on first making (Q, .1'(l)) into a probability space. Here, Y(Q) denotes the power set of Q. Define the probability measure P so that Xj (w) = wj are i.i.d. taking the values ±1 with probability each. In words, under the measure P, the stock values fluctuate at random

but in a fair manner. Another, yet equivalent, way to define P is as the product measure: (8.73)

P(dw) = Q(dwl) ... Q(dwN)

dw E 1,

where Q({1}) = Q({-1}) = 1/2. Using this probability space (1, .9(l), P), {A;}°°1, {S1}°O1, and {Yt}°_1 are stochastic processes, and we can present the so-called Black-Scholes formula for the fair price yo of a European option.

The Black-Scholes Formula. A hedging strategy exists if (8.74)

yo = E [(SN - C)+] .

Proof (Necessity). We first prove Theorem 6.5 assuming that a hedging strategy A exists. If so then the process Yn defined in (8.71) is a martingale; see Example 8.15. Moreover, by the definition of a hedging strategy, Yn > 0

for all n, and YN = (SN - C)+ a.s. (in fact for all w). On the other hand, martingales have a constant mean; i.e., EYN = EY1 = yo, thanks to (8.71). Therefore, we have shown that yo = E[(SN - C)+] as desired. (] In order to prove the second half we need the following.

The Martingale Representation Theorem. In (Q, .9(1k), P), the process S is a mean-zero martingale. Any other martingale M is a martingale transform of S; i.e., there exists a previsible process H such that n

(8.75)

Mn = EMI + > H3 (Si - Sj-1)

do = 1, ... , N.

j=1

Proof. Because Lemma 8.29 proves that S is a mean-zero martingale, we can concentrate on proving that M is a martingale transform. Since M is adapted, Mn is a function of w1, ... , wn only. We abuse the notation slightly, and write (8.76)

Mn(w) = Mn(w1, ... , Wn)

Vw E I1.

The martingale property states that E[Mn+1 1,9n] = Mn a.s. Now suppose 01, ... , On are bounded and Oj is a function of wj only. Then, thanks to the

8. Martingales

146

independence of the wj's,

E

OLi.Mn+h] =1

(8.77)

[Jctj(wj)Mn+l(wl,...,wn,-1)Q(dwl) ... Q(dwn)

-2 1

+2

J j=1

j(wj)Mn+1(w1,...,wn,1)

G1

1) ... Q(dwn)-

That is, we can write n

n

E f Oj M-+1 = E H Oj . Nn

(8.78)

j=1

,

j=1

where (8.79)

1

Nn(w)

1

=2Mn+1(wl,...,wn, 1)+2Mn+1(wl,...,wn,

Note that Nn is Fn measurable for every n = 1, . . . , N. Therefore, (8.77) and the martingale property of M together show that M = N a.s. This leads us to the formula (8.80)

1

1

Mn(w) = 2Mn+l(wl,... ,wn, l) + 2Mn+1(wl,...,wn, -l),

valid for almost all w E 52.1 In fact, because fl is finite and P assigns positive measure to each wj, the stated equality must hold for all w. Moreover, since

go = 10, 52}, the preceding discussion continues to hold for n = 0 if we define Mo = EMI. Since Mn(w) = ZMn(w) + Z1Lin(w), the following holds for all 0 < n <

N-1andallwE0: (8.81)

Mn+l(wl,...,wn,l)-Mn(w)=Mn(w)-Mn+l(wl,...,wn,-1).

'While this calculation is intuitively clear, you should prove its validity by first checking it of the form fjlZi hj(wj), and then appealing to a monotone class argument. for

6. Further Applications

147

Let dj := Mj+1 - Mj so that Mn+l(w) - Mo = E"=odj(w), and apply the preceding as follows:

[dj(w)1{1}(wj+1) +dj(w)1{-1}(wj+l)]

Mn+1(w) - 11ro =

j=o n (8.82)

_E

(Mj+1(wi,... ,wj, l) - Mj(w)) [1{1}(wj+1) - 1{-1}(wj+1)]

j=o n

E (Mj+1(w1,...,wj, l) - Mj(w)) [Sj+1(w) - Sj(w)] j=o

This proves (8.75) with Hj(w) := Mj(wl,...,wj_1i 1)-Mj-I(w) and Mo O EMI. [Note that H is previsible.] We are ready to prove the second half of the Black-Scholes formula.

Proof of the Black-Scholes Formula (Sufficiency). The process Yn = E[(SN - C)+ I fn] (0 < n < N) is a non-negative Doob martingale. Also it has the property that YN = (SN - C)+ almost surely, and hence for all w (why?). Thanks to the martingale representation theorem, we can find a previsible process A such that n-1

(8.83)

Yn = EYi + > Aj(Sj - Sj_1). j=1

It follows that A is a hedging strategy with yo = EYI. By the martingale property, EYi = EY2 = . . . = EYN. This implies that yo = E[(SN - C)+J, which proves the theorem. 0

6.6. Rademacher's Theorem. A function f : (0, 1) -; R is Lipschitz continuous if there exists a constant A > 0 such that for all x, y E (0, 11, (8.84)

If (x)

- f(y)I < AIx - yl.

The optimal choice of A is called the Lipschitz constant of f . If f' exists and is continuous, then one can perform a one-term Taylor expansion to note that f is Lipschitz continuous. The following theorem of Rademacher (1919) asserts a remarkable converse.

Theorem 8.43. If f : (0, 1] -+ R is Lipschitz-continuous then it is differentiable almost everywhere.

Proof. Let ((0,1] , 4((0,1]) , P) denote the Steinhaus probability space, so that P is Lebesgue measure, and "a.e." is the same thing as "a.s." Also let {.`fi'n}°° 1 and {.9n}n°_1 respectively denote the dyadic intervals and filtration

(p. 142).

8. Martingales

148

two numbers, f(Q) and We associate to all dyadic intervals Q E r(Q): t(Q) is the left end-point of Q and r(Q), the right one. For instance, if Q = (k2--, (k + 1)2-n], then f(Q) = k2-n and r(Q) = (k + 1)2-n. Define (8.85)

Xn(w) :_ QE.9"n

f(r(Q)) - f(e(Q))1Q(w) r(Q) - e(Q)

dw E (0, 1).

This is a difference quotient because the sum consists of exactly one term and the Lipschitz continuity off ensures that sup,, I Xn I is a bounded random variable. From here on, the proof splits into two steps. Step 1. X is a martingale with respect to .9. To prove this, write (8.86)

Xn(w) = E E

f (r(Q))2- f (E(Q))1Q(w)

JE.9 _1 QE.$,o,:

QCJ

By Corollary 8.8, if w E J E .fin-1 and Q E gn is a subset of J, then P(Q I gn_1)(w) is the classical probability P(Q I J), which is z. If U) ¢ J then P(Q I.n_1)(w) = 0. Thus, E[Xn 19n-1] =

E

2

f (t(Q))1 J

f (r(Q)

QE.Fn:

2-

n

QCJ (8.87)

f (r(J)) - f (t(J))1J

_

2-'

JE.3n_1

= Xn_1.

According to the martingale convergence theorem, all bounded martingales

converge a.s. and in L1(P) (p. 134). Therefore, we can find X. such that Xn

X,,,, a.s. [P] and in L1(P).

Step 2. The Conclusion. Suppose I, J E .fin for the same n > 1, and I lies to the left of J; that is, every u E I is less than every v E J. Then we denote this by I < J. For all Q E .5n, fQ Xn(w) du) = f (r(Q)) - f (f(Q)). Therefore, for all

(8.88)

r f (r(J)) - f (0) = E / Xn(w) dw = IEYfO:

I<J

I

j

"M

Xn(w) dw.

6. Further Applications

149

Given any x E (0, 11 and n > 1, we can find a unique J E .'ro such that w E .`ro . If A denotes the Lipschitz constant of f, then

- f (r(J))I < Aix - r(J) I < 2 Therefore, If (x) - f (0) - fo ") Xn(w) dwi < A2-n. Also, (8.89)

If (x)

jr(j) Xn(w)

(8.90)

rx+2-^

dw - fox Xn(w) dw

<_ J

IXn(w)Idw.

By the dominated convergence theorem, the right-hand side goes to zero as n -+ oo. Therefore, by the monotone convergence theorem,

f (x) - f (0) = J X,,, (w) dw

(8.91)

ex E (0, 11.

0

Rademacher's theorem follows from Lebesgue's differentiation theorem (Theorem 8.39), and f' = X, almost everywhere.

6.7. Random Patterns. Suppose X1, X2, ... are i.i.d. random variables with P{X1 = 1} = p and P{X1 = 0} = q, where q:= 1-p and 0 < p < 1. It is possible to use the Kolmogorov zero-one law and deduce that the infinite sequence X1, X2,... will a.s. contain a zero, say. In fact, with probability one, any predescribed finite pattern of zeroes and ones will appear infinitely often in the sequence {X1, X2,. ..}. Let N denote the first k such that the sequence {X1, ... , Xk} contains a predescribed, non-random pattern. Then, we wish to know EN.

The simplest patterns are "0" and "1." So let N denote the smallest k such that {X1, . . . , Xk} contains a "0." It is not hard to convince yourself

that EN = 1/q because P{N = j} = p)-1q for j = 1,2,.... But this calculation uses too much of the structure of the pattern "0." Next is a more robust argument, due to Li (1980): Consider the process Yn := 91{X1=0} +

(8.92)

Define

91{X2=0} +

+q

to be the o-algebra defined by {X;} 1 for every n > 1. Then,

for alln>1, (8.93)

E[Yn+1 I fn] = Y. + 1 P(Xn+1 = 01.13.) = Yn + 1.

Therefore, {Yn - n}n 1 is a mean-zero martingale (check!). By the optional stopping theorem, E[N A n] = EYNnn for all n > 1. But N < oc a.s., and both IN A n}n and {YNnn}n 1 are increasing. Therefore, we can apply the monotone convergence theorem to deduce that EYN = EN. Because YN = (1/q) almost surely, EN = 1/q, as we know already. 1

8. Martingales

150

The advantage of the second proof is that it can be applied to other patterns. Suppose, for instance, the pattern is a sequence of f ones, where f > 1 is an integer. Consider

1

Zn,t :=

1{X,=1,...,X1=1} + 71{X2=1,....Xt+1=1} 1

+ ... +

(8.94)

1{xn_t+,=l,...,X =1} + pe11{Xn-t+2=1,....Xn=1} 1

+ p1-221{Xn-1+3=1,...,Xn=1} + ... + P1{Xn=1}

Then, you should check that {Zn,t - n}'_1 is a martingale. As before, we

have EZN,t = EN, and now we note that ZN,t = (1/pt)+(1/pt-1)+ +(1/p) a.s. Therefore, (8.95)

Therefore, set e = 2 to find that

EN =

(8.96)

1

+

p

for the pattern "11."

Our next result is another example of this kind.

Lemma 8.44. If the pattern is "01" then EN = 1/(pq). Define (8.97)

Wn

p1{X1=0,X2=1} +

+

1 1{Xn-,=O,xn=1} + 91{x.=o}

One can prove that {Wn - n}°O=1 is a martingale. Lemma 8.44 is proved by using the preceding martingale methods. Suppose we wished to know which of the two patterns, "01" and "11," is more likely to come first. To answer this, we first note that { Wn - Zn,2 } °,° 1 is a martingale, since it is the difference of two martingales.

Define T to be the smallest integer k > 1 such that the sequence {X1, ... , Xk} contains either "01" or "11." Then, we argue as before and

find that E[WT - ZT,2) = 0. But WT - ZT,2 = q-1 on ("01" comes up first}, and WT - ZT,2 = -(1/p) - (1/p2) = -(p + 1)/p2 on {"11" comes up first}. Therefore,

0=E[WT-ZT,21 (8.98)

= P { "01" comes up first} -

p p+21

P { "11" comes up first}.

Solve to find that (8.99)

P ("01" comes up before "11" } = 1 - p2.

Problems

151

6.8. Random Quadratic Forms. Let {Xi}i=1 be a sequence of i.i.d. random variables. For a given double array {ai,,, } 2,7 =1 of real numbers, we wish

to consider the "quadratic form" process, do > 1.

Q,,:= >2 > ai,jXiXj

(8.100)

1
Define a,3 :_ (aij + aj,i)/2. A little thought shows that we do not alter the value of Qn if we replace aij by of .. Therefore, we can assume that ai,j = aj,i, and suffer no loss in generality. The quadratic form process {Qn}n° I arises in many disciplines. For instance, in mathematical statistics, {Qn}' 1 belongs to an important family of processes called "U-statistics."

Theorem 8.45 (Varberg, 1966). Suppose EX1 = 0, E[X2] = 1, and E[Xf ] < oo. If E 1 1 a,3 < co, then limn. (Qn - E I
E'

Proof. Let An := E1
Qn-An=2Un+Vn,

(8.101)

where (8.102)

Un :_ EE ai,,X;Xj

and

Vn := E ai,i [X?

1
I
Because (x + y)2 < 2(x2 + y2) for all x, y E R, it follows that (Qn - An)2 <

8Un + 2. Therefore, (8.103)

E [(Qn - An)2] < 8E[Un] + 2E[V,2].

By independence, E[VI] = El
Problems 8.1. We say that X E LI(P) converges to X E LI(P) weakly in LI(P) if for all bounded random E[XZJ. Show that X - X weakly in Ll(P) if for any subvariables Z,

a-algebra I C 9, E[X I y] converges to E(XIT) weakly in LI(P). Conversely, prove that if X - X in LI (P), then for any sub-o-algebra I C .9', E[X" (`.>'J - E(X 159) in V (P). 8.2. Construct three random variables U, V, W such that EIE(U I V) I WJ 0 E(E(U I W) I V] with positive probability.

8.3. Suppose X and Y are independent real (say) random variables, and / Rz - R is bounded and measurable. If g(x) := Etf(x,Y)J for all x E R, then prove that g(X) = E[j(X,Y) I XI as.

8. Martingales

152

8.4. Suppose X, Y E Ll (P) satisfy E[X I YJ = Y and E[Y I X] = X as. Prove then that X = Y as. (HINT: Prove first that E[X - Y; X < q < YJ = 0 for all q E Q; see also Doob (1953, p. 314).) 8.5. Let ((0, 1],-4((0, 11),P) denote the Steinhaus probability space, and consider X (w) = w for 0 < w < 1. Compute and compare E[X 19'1) and E[X 19'2] where 9'1 is the o-algebra generated by o((0, 1/2]), and .92 is the sub-o-algebra of 9((0, 11) that is generated by .9((0,1/21). This is due to J. A. Turner. 8.6 (Conditional Variance). Suppose cf C .IF is a o-algebra. If X E L2(P), then define Var(X 144) _

E{(X - E[X I y])21 y). Prove that Var(X [1) < E{(X - C)2 1 y} as, for every X E L2(f2 ,.S, P) and E L2(f1,4,P). Derive (8.4) as a consequence. 8.7 (Conditional Chebyshev). Prove that if I is a sub-o-algebra of.Jr and X is a non-negative random variable on (Q, .'F, P), then AP(X > a [ y) < E[X I yJ a.s. for all a > 0. 8.8 (Corollary 8.8, p. 125, Continued). Let (X, Y) be an absolutely continuous random variable with piecewise-continuous density f (x, y). Prove that for every non-negative measurable h : R -+ R,

(8.104)

E[h(X)] YJ =

fff'_ (x)f(x,Y)dr f(x,Y)dx

as.

Even though P{Y = y} = 0 for all y E R, use the preceding to justify the classical definition,

P(X
8.9 (Censoring). Let 9 denote a filtration, and consider 9-stopping times S < T as. For any fixed event A E fs define r = S1A + T1AC. Prove that r is an .r-stopping time. 8.10. Verify Proposition 8.9.

8.11. Carefully prove Lemmas 8.25 and 8.26. In addition, construct an example that shows that the difference of two stopping times, even if non-negative, need not be a stopping time. 8.12. The optional stopping theorem (p. 130) assumes that T is almost surely bounded. Construct an example to show that this assumption cannot be dropped altogether.

8.13. Prove Lemma 8.29.

8.14 (Gambler's Ruin). If {X,} 1 are i.i.d. random variables with P{X1 = 11 = 1 - P{X1 = -l} = p 54 1, then verify that, in the proof of Theorem 8.34, {(Sn)n 1 is indeed a bounded mean-one martingale. Also compute ET in the case p = 1/2. 8.15. Let {Xn}n 1 be independent random variables such that for all k > 1, hk(t) = Ee1Xk exists and is finite for all t E (-to, to) for a fixed to > 0. Prove that whenever III < to, Mn(t) _ e1Sn / {Zk=1 hk(t) defines a mean-one martingale. [As usual, S. denotes E

1

X,.]

8.16 (Likelihood Ratios). Suppose f and g are two strictly positive probability density functions on R. Prove that if {X,};__1 are i.i.d. random variables with probability density f, then Fl,'= I [g(X, )/ f (Xj)J defines a mean-one martingale. When does it converge. and in what sense(s)?

8.17 (Ptilya's Urns). An urn initially contains R red and B black balls. Except for their colors, the balls are identical. A ball is chosen at random. If it comes up red (reap. black), then it is replaced with two red (reap. black) balls. Let X denote the number of red balls in the urn after n draws. Prove that the fraction fn = Xn/(n+ R + B) of red balls has an almost-sure limit. 8.18. Prove that Definition 8.12(iii) is equivalent to E[Xn+k I ,fnJ >- Xn as, for all k, n > 1. 8.19. Prove that Doob's decomposition (p. 128) is a.s.-unique, as long as we insist that {Zn}n°__1 is previsible and Z1 = 0.

8.20. Let {Xn}n 1 be a martingale. Prove that X is bounded in L1 (P) if sup E[X,+, J < oo.

8.21. Prove that if X is a martingale, then (8.105)

r PtlmaxJXi[>a <E(]X.[P) 111

j_

-

"p > 1, n > 1, \ > 0.

Also, prove that Doob'a inequalities imply Kolmogorov's maximal inequality (p. 74).

Problems

153

8.22 (Doob's LP Inequality). Suppose l; and ( are a.s. non-negative random variables such that va > 0.

P{(> a} < 1 E[(;(> a]

(8.106)

a

Prove that for all p > 1, (8.107)

IIfIIP 5 (p p 1) IKIIP

Use this show the strong LP-inequality of Doob: If X is a non-negative submartingale and Xn E LP(P) for all n > 1 and some p > 1, then r

11

E Ilmax X?J <

(8.108)

(p)

P

E (XP].

1

Use this to prove Corollary 8.41.

8.23 (Pitman's L2 Inequality; Problem 8.22, Continued). Suppose {X,}n 1 and {Mk}k=l are processes that satisfy: (i) Mk = Xk whenever Mk iE Mk-1; and (ii) E[Ek=z Mk-s(Xk - Xk-1)J is non-negative. Prove that E[A4n] < 4E[X2]. Use this to conclude (8.108) in the case that p = 2 (Pitman, 1981).

8.24 (Problem 8.22, Continued; Square Functions). Let {3,}O 1 denote a filtration, and suppose {X,}°_1 is a martingale such that X. E L2(P) for all i > 1.

(1) Prove that Xn - An defines a mean-zero martingale where A. = E; 1 E[d? d, = X, - X,_1, Xo = 0, and .9ro is the trivial o-algebra. The process A is called the square function of X. (2) Prove that E[sup, 1. (3) Conclude that limn X,(w) exists for almost all w E {A,,, < oo}. (4) Explore the case where the d,'s are independent.

8.25 (Theorem 8.30, Continued). Suppose Sn = X1 + + Xn defines a random walk with EX1 = is and VarX1 = oz < no. Let .fin = ((X,}°° 1) (n > 1), and consider an f-stopping time T that has a finite mean. Prove Wald's second identity, VarSr = VarXi ET. (HINT: Problem 8.22.)

8.26. Consider two random variables X and Y, both of which are defined on a common probability space (11, Jr, P). Define (8.109)

(-) 1{xeL,2

Xn =

2n

7=-a

" (7+1)2 ^))

vn

1.

Prove that for any Y E LI(P), limn_ ELY I Xn] = E[Y I X) a.s. and in Lt (P).

8.27. Suppose that Y E L1(P) is real-valued, and that X is a random variable that takes values in Rn. Prove that there exists a Borel measurable function f such that E[Y I X] = f(X) almost surely.

8.28. Suppose that {Xi}1,=1 are independent mean-zero random variables in L2(P) that are bounded; that is, that there exists a constant B such that almost surely, IXnI Oandn>1, (8.110)

(

P(ma!cn lS,I-A} 1925).

(B

+

VarSn

(Khintchine and Kolmogorov,

8.29 (Martingale Convergence in LP). Refine the martingale convergence theorem by showing that limn-- Xn exists in LP(P) whenever X is bounded in LP for some p > 1. In addition, prove

that if X = lim-a, Xn, then Xn = E[X I9n] as. for all n > 1. (HINT: Use Problem 8.22.)

8. Martingales

154

8.30 (Double-or-Nothing). Let {y,) 1 denote a sequence of i.i.d. random variables with P{11 = 0) = P{71 = 1) = 1. Consider the stochastic process X, where X1 := 1, and Xn := 2Xn_i7n for all n > 2. Prove that X is an L1-bounded martingale that does not converge in L'(P). Consequently, Problem 8.29 can fail for p = 1. Compute the almost-sure limit of X.. 8.31. Let {Xn}n 1 be independent standard normal random variables and define S.

X,

(n > 1). Prove that Mn = (n + 1)-1/2exp{Sn/(2n + 2)} defines a mean-one martingale (Woodroofe, 1975, Problem 12.10, p. 344).

8.32 (Problem 8.31, Continued). Define Mn as in Problem 8.31. Use only the martingale convergence theorem (p. 134) and the CLT (p. 100) to prove that limn-,c Mn = 0 a.s. Derive the following precursory formulation of the LIL (Steinhaus, 1922):

Sn=o((ninn) 1/2)

(8.111)

a.s.asn-+oo.

8.33. Suppose X is a submartingale with bounded increments; i.e., there exists a non-random finite constant B such that almost surely, ]X,, - Xn_ 1 I 2. Then prove that limn Xn exists as. on the set {sup,, IXm] < oo}.

8.34. Suppose {.fin}- 1 is a filtration of a-algebras, and Y E L1(P) is fixed. Define M,, _ E[Y ]5n] to be the corresponding Doob martingale. Prove that for all finite stopping times T, MT = ELY I.5T1 as. (Dubins and Freedman, 1966).

8.35. Prove that X is a martingale if and only if EMT = EMI for all bounded stopping times T. Characterize super- and submartingales similarly. 8.36. Let {Xn}°,,°-e be a non-negative supermartingale that attains the value zero at some a.s.finite time. Prove that limk_,o Xk = 0 as. 8.37. Follow the proof of Theorem 8.40 and prove that c1 A(f) < M f < c2A(f) where cl and c2 are positive and finite constants that do not depend on f, and A(f) := sup,, Afn + sup,, Nn. 8.38. The following is a variant of Problem 8.17. First choose and fix A E (0, 1). Then consider random variables X,, E (0, 11, adapted to .in, such that a.s. for all n > 1,

P{Xn+1=A+(1-A)XnIAn}=Xn (8.112)

P(Xn+l =0-X)Xnl9n}=1-Xn.

Prove that X := limn-,o Xn exists as. and in LP(P) for all p > 1, and that X_ is zero or one almost surely. Compute P{X = *-

8.39. Suppose that {X,},'=1 are i.i.d. with P{X1 = 1} = P{Xl = -1} = 1/2. As before, let

S,, :=X1+...+Xn.

(1) Prove that Eexp(tS,) <- exp(nt2/2) for all n > 1 and t E R(2) Prove that (8.41) continues to hold in the present setting.

(3) Prove the following half of the LIL for (±1) random variables: limsup

Sn

n-,o (2nlnlnn)1/2 -

a.s.

(4) Suppose {Y,}i= 1 are i.i.d. with P{Y1 = 0} = P{Yj = 1} = 1/2 and Tn := Y1 + +Yn. Prove that lim sup

T,, - 121

1

n-- (2nlnInn)21/2 < -2

as.

Check that this is one-half of the LIL (p. 138) for Tn. 8.40 (Uniform Integrability). Consider a martingale {X,,)n- 1 for which T is an a.s: finite stopping time. Suppose, in addition, that {XTnn} =1 is uniformly integrable; see Problem 4.28 on page

51. Then prove that EXT = EX1. 8.41 (Uniform Integrability). Let {Xnbe a uniformly integrable martingale with respect to some filtration .9; see Problem 4.28 on page 51. Prove that X. = limn-.,o X,, exists and is finite as. Prove also that outside a null set Xn = E[X ]5n] for all n > 1.

Problems

155

8.42 (Theorem 8.39, Continued). Prove that if f : (0, 1)k -» R is integrable (k > 2), then for almost all x E (0, 1)k 1

(8.113)

:k+a

610 Sk I lim k

...

S,+e I1

j(u) duI ... duk =

f (X).

8.43. Let XI be uniformly distributed on (0,1). Conditionally on XI, define X2 to be uniformly distributed on (0, XI); i.e., P{X2 E A I XI) = m(A n [0, X I ])/X 1 , where m denotes the Lebesgue measure. Iteratively define (8.114)

P{Xn E AI X1,..., Xn-1} =

m(An(0, Xn-1]) Xn-1

Explore the structure of {Xn}n I, and the behavior of Xn for large n. 8.44 (Patterns). Verify Lemma 8.44. Also, find the probability that we see f consecutive ones before k consecutive zeros.

8.45 (U-Statistics). Prove that that (8.115)

0 in the proof of Theorem 8.45. From this conclude

E [(Qn - An)2] =4 F

alt + Var(X?)

a2 ;.

1
1<.<7
8.46 (Runs in Random Permutations; Hard). Let H,. denote a random permutation of {1 , ... , n}, all permutations being equally likely. A block of ascending elements of [In is a run if it is not a sub-block of a longer block of ascending elements. For example, if 113 = (7,6,8,3,1 , 2 , 5 , 4 },

then it has five runs: {7}; {6,8}; (3); {1,2,5}; and (4). Prove that if Rn denotes the number of runs of fln, then nRn - (na 1) defines a mean-zero martingale, and VarR, = 0(n) as n -. oo. Conclude that lim

(8.116)

n

R= n

1

2

as.

8.47 (Reversed Doob Martingales; Hard). Let f1 ? Y2 D ... be a decreasing family of sub-aalgebras off. The family {.i n}°n°__1 is called a reversed (or "backward") filtration. Prove that if

Y E L'(P) then lim.-- E[Y (.rn] exists as. and in LI(P). 8.48 (Exchangeability; Problem 8.47, Continued; Hard). Random variables XI, X2.... are said to be exchangeable if the distribution of (XI , ... , Xn) is the same as that of (X,'111, ... , X,,(n)) for every permutation rr of { 1, ... , n) and all n > 1. Define Sn = X i + + Xn for all n > 1 and let en denote the a-algebra generated by {Sk}k n. If EIX1I < oo then: (1) Compute E(X, (<Sn] for all 1 < i < n.

(2) Prove that Z := limn-»w(Sn/n) exists a.s. and in LI(P). 8.49 (Levy's Equivalence Theorem; Hard). Let {Xn}n=1 be independent. Prove that Sn = E° 1 X; converges almost surely if and only if Sn converges in probability (Levy, 1937, Theorbme 44, p. 139). (HINT: Begin by proving that exp(itSn)/Eexp(itSf) defines a mean-one "complex" martingale for any t E R.)

8.50 (Azuma-Hoeffding Inequality; Hard). Let {X;}° 0 be a mean-zero martingale. Suppose there exist non-random constants {ci} 1 such that IX, - X,_1( < c, as. (i = 1,...,n). Prove that (8.117)

P { max (X,(> zlJ < 2exp (_ 0<2
z n

\ 1

C2/

vZ>0.

(HINT: Consult Problems 4.30 (p. 51) and 6.33 (p. 87).)

8.51 (Problem 8.40, Continued; Hard). Find an example of a mean-zero martingale {X,)andan a.s: finite stopping time T such that EXT 0 0.

8. Martingales

156

8.52 (Problem 8.22. Continued; Hard). Suppose f and ( are as. non-negative random variables such that for all a > 0,

P{f>a):5 1 EI(;{>a".

(8.118)

a

Prove that f E LI(P) as long as EI(In+(I < oo. Here, In+y = In(y n e). Use this to prove the strong LI-inequality of Doob: IfX is a non-negative submartingale, then sups E(IXnI In+ IXnI) < oo implies Esupn IXnI < oo. (HINT: Prove first that if 0 < x < y, then x in y< x In+ x + (y/e).)

8.53 (Hard). Prove that if f : R -. R is convex, then f' exists almost everywhere. [In fact, f" exists a.e., but this is a little more difcult.I 8.54 (de Finetti's Theorem; Problem 8.48, Continued; Hard). Suppose is an exchangeable sequence of zeros and ones, S. := XI + + Xn, and 4 = o(Sn , Sn+I , ...) for all n > 1.

(1) Prove that P(X1= =Xk=IIgn)=(n

s)/(s,)a.s.

(2) Use Stirling's formula to conclude the theorem of de Finetti (1937):

p(X1=.._=Xk=1,Xk+I=...=Xn=01 Z)=Zk(1-Z)"-k

a.s.

8.55 (Problem 8.52, Continued; Harder). Problem 8.52 cannot be improved. Let {Xn}n I denote i.i.d. mean-zero random variables, and define N. = E,n=, X, for all n > 1. Prove that Esupn ISn/ni and E{IXII In+ IXII) converge and diverge together (Burkholder. 1962). 8.58 (Problem 8.39, Continued; Harder). Prove the other half of the LIL (p. 138) for (±1) random variables: lim supn_,o Sn/an > 1 a.s., where an := (2n In In n)1 /2. You may use the following argument (de Acosta, 1983, Lemma 2.4):

(1) Prove that it suffices to show that for all c > 0,

InP {Sn > cI/2an} > -c. 1 n-. lnlnn (2) To establish (1) choose pn -. oo such that n divides p,,, and then prove that liminf

(

n

P {Sn > cI/Zan} > \P {Spy > eI/ZanPn/n}) (3) Use the central limit theorem and the preceding with pn - an/(Inlnn) to prove (1). Conclude the proof of the LIL for (±1) random variables.

(HINT: For part (2) first write S. = 5,,,, +(S2p - S,,,)+ +(Sn -S(n_I)p,,/n). Next observe that if each of these n/p terms is greater than pn A/n, then S. > A. Finally choose A judiciously. For part (3) optimize over the choice of a.) 8.57 (Problem 8.24, Continued; Harder). Suppose {Xn }n= is a martingale with respect to some filtration {.9rn},a 1. Suppose also that do = X. - Xn_I satisfies Idnl S a for all n > 1, where Xo = 0, Sto = (0,f1), and a is a non-random positive constant.

(1) Prove that for all x E R, e'' < I + x + x2el=I. Use this to prove that for all I E R

and all i=1,2,..., ed,

I

._

<1+

2

2°ItIE[d?_1I 2'

< exp

2'

a.s.

(2) Let { An },O°_ 1 denote the square function of X. Then conclude that given a non-random

t E R the following defines a non-negative supermartingale:

Mn = exp tXn -

t2&10A n 2

vn>1.

Moreover, verify that EMn < 1 for all n > 1. (3) Prove that if X. > 0, then limn-.oo X. IA. exists and is finite as. Prove also that

lim Xn(w) = 0

for almost all w E {A, = oo}.

Notes

157

Notes (1) Martingales were first introduced and studied by Ville (1939). The current powerful theory was formed by Doob (1940, 1949) shortly thereafter. (2) Our proof of the martingale convergence theorem (p. 134) is due to Isaac (1965). Aside from this and the original proof of Doob (1940) there are other nice proofs of Doob's martingale convergence theorem. For example, see Chatterji (1968), Helms and Loeb (1982), Lamb (1973), and C. lonescu Tulcea and A. lonescu Tulcea (1963). (3) An enormous literature is devoted to the study of the law of the iterated logarithm and its variants. An excellent starting point is the theorem of Strassen (1967). It implies that if X1, X2,... are i.i.d., EXi = 0, and VarXj = 1, then on some suitable probability space there exist i.i.d. N(0, 1) random variables {G;},°, such that

i_t Xi - E" _tGi n

nlim(n log log n) 1/2

l

-0

a.s.

In particular, this shows that the general LIL follows from the one proved here. More-

over, if the X;'s have higher moments than two, then the rate of approximation can be improved upon. This is the starting point of a theory of "strong approximations." Csorg6 and Rhvbsz (1981) is an excellent treatment. Two scholarly reviews of the LIL are Feller (1945), for the classical theory, and Bingham (1986), for the more modern advances.

(4) Equation (8.45) has the following improvement, due to Laplace (1805, pp. 490-493):

P{X1>a)=

A2/2

kr

A2

1+

A2

1+2 1+3

A2

1+4

(5) Theorem 8.39 is also known as the Lebesgue density theorem. It states that the antiderivative of every f E L'(dx) is f a.e. On the other hand, it is the case that "most" continuous functions are nowhere differentiable (Banach, 1931; Mazurkiewicz, 1931; Paley, Wiener, and Zygmund, 1933; Kahane, 1997, 2000, 2001). (6) The material on option-pricing (§6.5) is based in part on the discussions of Baxter and Rennie (1996, Chapter 2) and Williams (1991, Section 15.2). There you will learn, among other things, that there are in fact hedging strategies that never sell short. This demonstration requires only a little more effort than the proof described here, and is worth looking at. (7) The notion of Lipschitz continuity is due to the work of Lipschitz (1876) on differential equations. (8) One can streamline the method of §6.7; see Li (1980) and Gerber and Li (1981). (9) Problem 8.46 is, in essence, borrowed from the exciting book of Mahmoud (2000, pp. 48-51) on sorting. It can be shown that

f

R. - (n/2)

N(0, 1/12).

(ibid., Proposition 1.10, p. 51).

(10) Problem 8.48 is due to de Finetti (1937), but the proof outlined here is borrowed from Doob (1949). Aldous (1985) presents a masterly modern-day account of exchangeability and related topics. (11) When the increments of X are independent, Problem 8.50 is due to Hoeffding (1963). The general case is due to Azuma (1967), and is proved by the same argument.

158

8. Martin gales

(12) Problem 8.54 states that all exchangeable sequences of zeros and ones are "conditionally i.i.d." The proof outlined here is motivated by Exercise 6.3 of Durrett (1996, p. 271). For a detailed historical account see Cifarelli and Regazzini (1996). Remarkably enough, de Finetti's theorem has consequences in diverse subjects such as the philosophy of statistics (de Finetti, 1937; Kyhurg and Smokier, 1980), statistical mechanics (Georgii. 1988), and geometry of Hilbert spaces (Bretagnolle and Dacunha-Castelle. 1969). (13) Problem 8.57 generalizes the 'law of large numbers" of Dubins and Freedman (1965). The central ideas used here come from a paper of de Acosta (1983).

Chapter 9

Brownian Motion

The theory of random functions always makes the impression of a much greater degree of artificiality than corresponds to the facts.

-Raymond E. A. C. Paley and Norbert Wiener

On March 29, 1900, a doctoral student of J. H. Poincare by the name of Louis Jean Baptiste Alphonse Bachelier presented his thesis to the Faculty of Sciences of the Academy of Paris. Louis Bachelier's work was chiefly concerned with finding "a formula which expresses the likelihood of a market fluctuation" (Bachelier, 1964, p. 17). Bachelier's solution to this problem required the introduction of a number of novel ideas, one of which was today's "Brownian motion." See also the English translation in the volume edited by Cootner (Bachelier, 1964). In 1828, the botanist Robert Brown noted empirically that the grains of pollen in water undergo erratic motion. Brown himself admitted to not having a scientific explanation for this phenomenon. And it was years later, in 1905, that an explanation was found by Albert Einstein. The key idea in Einstein's solution was the introduction of a stochastic process that Einstein called "the Brownian motion." Unaware of the earlier work of Bachelier in economics, Einstein had rediscovered that Brownian motion is related concretely to the diffusion of particles. As a main application of his theory, Einstein found a very good estimate for Avogadro's constant. Einstein's theory was tacitly based on the assumption that the Brownian motion process exists. Nearly two decades later, Wiener (1923a) proved the validity of Einstein's assumption. In the present context, the contributions of von Smoluchowski (1918) and Perrin (1913) are also particularly noteworthy. 159

9. Brownian Motion

160

From a mathematical point of view, Bachelier's work went further than Einstein's. However, we introduce the latter's work because it is easier to describe. Thus, we begin with a modern statement of Einstein's postulates: Brownian motion .( W(t) }t>o is a random function oft (= "time") such that:

(P-a) W(0) = 0, and for any given time t > 0, the distribution of W(t) is normal with mean zero and variance t.

(P-b) For any 0 < s < t, W(t) - W(s) is independent of Think of s as the current time. Then, this condition is saying that "given the value of W at the present time, the future is independent of the past." This is called the Markov property.

(P-c) The random variable W(t) - W(s) has the same distribution as W(t - s). That is, Brownian motion has stationary increments. (P-d) The random path t' -+ W(t) is continuous with probability one.

Remark 9.1. One can also have a Brownian motion B that starts at an arbitrary point x E R by defining (9.1)

B(t) := x + W(t),

where W is a Brownian motion started at the origin. One can check directly that B has all the properties of W, except that B(t) has mean x for all t > 0, and B(0) = x. Unless stated to the contrary, our Brownian motions always

start at the origin.

So why are (P-a) through (P-d) postulates and not facts? The sticky point is the a.s.-continuity (P-d). In fact, Levy (1937, Theoreme 54.2, p. 181) has proven that if in (P-a) we replace the normal by any other distribution,

then either there is no process that satisfies (P-a)-(P-c), or else (P-d) fails to hold. In summary, while the predictions of theoretical physics were correct, a more solid understanding of Brownian motion required a rather in-depth undertaking such as that of N. Wiener. Since Wiener's work Brownian motion has been studied by multitudes of mathematicians. This and the next chapter aim to whet your appetite to learn more about this elegant theory.

1. Gaussian Processes Let us temporarily leave aside the question of the existence of Brownian motion, and first study normal distributions, Gaussian random variables, and Gaussian processes. Before proceeding further, you may wish to recall §5.2 (p. 11), as well as Examples 3.18 (p. 28) and 7.12 (p. 97), where normal random variables and their characteristic functions have been introduced.

1. Gaussian Processes

161

1.1. Normal Random Variables. Definition 9.2. An R-valued random variable Y is said to be centered if Y E L' (P) and EY = 0. An R"-valued random variable Y = (1',.. . , Yn)' is said to be centered if each Y is. If, in addition, E{IYj2} < 00 for all i = 1,. .. , n, then the covariance matrix Q = (Q,,j) of Y is the matrix whose (i,j)th entry is the covariance of Y and Yj; i.e., Qjj = E[YYj]. Suppose that X = (X1,. .. , Xn)' is a centered n-dimensional random variable in L2(P). Let a E R" denote a constant (column) vector, and note that a'X = a X = En 1 a;X1 is a centered R-valued random variable in L2(P) with variance n

n

Var(cz X) _ E E aiE[X;Xj[aj = a'Qa,

(9.2)

i=1 j=1

where Q = (Q;,j) is the covariance matrix of X. Since the variance of any random variable is non-negative, we have the following. Lemma 9.3. If Q denotes the covariance matrix of a centered L2 (p) -valued

random variable X = (X1, ... , Xn)' in R", then Q is a symmetric nonnegative definite matrix. Moreover, the diagonal terms of Q are given by Qj,j = VarXj. Definition 9.4. An R"-valued random variable X = (X1,.. . , Xn)' is centered normal (or centered Gaussian) if for all a E R",

= e-za'Qa where Q is a symmetric non-negative definite real matrix. The matrix A is called the covariance matrix of X. (9.3)

Ee'a.X

We have seen in Lemma 9.3 that covariance matrices are symmetric and non-negative definite. The following implies that the converse is true also.

Theorem 9.5. Let Q be a symmetric non-negative definite (n x n) matrix of real numbers. Then there exists a centered normal random variable X = (X1, . . , Xn) whose covariance matrix is Q. If Q is non-singular, then the distribution of X is absolutely continuous with respect to the n-dimensional .

Lebesgue measure and has the density o f Example 3.18 (p. 28) with Q replac-

ing E there. Finally, an R"-valued random variable X = (X1, ... , Xn) is centered Gaussian if and only if a'X is a mean-zero normal random variable

for all a E R. denote the n eigenvalues of Q. The )j's 1 are real and non-negative. Let {v;};'_1 denote the respective orthonormal eigenvectors, and view them as column vectors. Then the (n x n) matrix

Proof (Sketch). Let {a;}

9. Brownian Motion

162

P = (vi, ... , v,) is orthogonal. Moreover, we can write Q = P'AP, where A is the diagonal matrix of the eigenvalues al, ... , .\n. Next, let {Z;} 1 denote n independent standard normal random variables. It is not difficult to see that Z = (Z1,. .. , Zn)' is a centered Rnvalued random variable whose covariance is the identity matrix. Define

X := P'A1/2Z, where A'/2 denotes the diagonal matrix whose jth diagonal entry is x 2. Since X = (X1, ... , Xn)' is a linear combination of centered random variables, it too is a centered Rn-valued random variable. Define n

(9.4)

ll at (P'A1/2)1,k

Ak =

vk = 1.... , n.

1=1

Then a. X = Ek=1 ZkAk for all a E Rn. By independence, n

Eet°'X = [J Ee`4Ak = e-2 Ek=, Ak. k=1

One can check readily that Ek=1 Ak = a'Qa. Therefore, we have constructed a centered Gaussian process X that has covariance matrix Q. To check that Q is indeed the matrix of the covariances of X, make another round of computations to see that E[X;XJJ = Q. Next we suppose that Q is non-singular. Let j2 denote the distribution

of X. We have shown that µ(t) = exp(-Zt'Qt). Because µ is absolutely integrable on R1, the inversion theorem (see Problem 7.12, p. 112) implies that the probability density f = du/dx exists and is given by the formula (9.6)

AX) =

1

e-it"_2t'Qt

dt

vx E R.

Write Q:= P'A1/2A1/2P, and change variables [s = A2 Pt]. This transforms the preceding n-dimensional integral into a product of n one-dimensional integrals. Each of the latter integrals can be computed painlessly by completing the square. Therefrom follows the form of f. To complete this proof, we derive the assertion about linear combinations. First suppose a'X is a centered Gaussian variable in R. We have seen already that its variance is a'Qa, where Q denotes the covariance matrix of

X. In particular, thanks to Example 7.12 (p. 97), Ee'a'X = exp(-Za'Qa). Thus, if a'X is a mean-zero normal variable in R for all a E Rn, then X is centered Gaussian. The converse is proved similarly.

Remark 9.6. According to Theorem 9.5, the covariance matrix of a centered normal random variable determines its distribution. However, it can happen that Xl and X2 are normally distributed even though (X1, X2) is not; see Problem 9.4 below. This demonstrates that in general the normality

1. Gaussian Processes

of (X1, ... ,

163

is a stronger property than the normality of the individual

XD's.

The following important corollary asserts that for normal random vectors independence and uncorrelatedness are one and the same.

Corollary 9.7. Let (X 1, ... , X,,, Y1,. .. , Ym) be a centered normal random variable such that Cov(X;, YY) = 0 for all i = 1, ... , n and j = 1, ... , m. Then (XI, ... , and (Y1,. .. , Ym) are independent.

1.2. Brownian Motion as a Gaussian Process. Definition 9.8. We say that a real-valued process X is centered Gaussian if (X (tl ), ... , X (tk)) is a centered normal random variable in Rk for all 0 < t1, t2, t3, ... , tk. The function Q(s, t) := E[X (s)X (t)] is called the covariance function of the process X.

For the time being, we assume that Brownian motion exists, and derive some of its elementary properties. We establish the existence later on.

Theorem 9.9. If W := {W(t)}t>o denotes a Brownian motion, then it is a centered Gaussian process with covariance function Q(s, t) := s A t. Conversely, any a.s.-continuous centered Gaussian process that has covariance function Q and starts at 0 is a Brownian motion. Furthermore:

(1) (Quadratic Variation) For each t > 0, as n V n-I

jr

oo,

- W \ \n) t)J2

t.

(2) (The Markov Property) For any T > 0, the process

{W(t + T) - W(T)}t>o is a Brownian motion that is independent of of{W(r)}o
Remark 9.10.

(1) Any function with bounded variation has zero qua-

dratic variation. Therefore, Part 1 implies that Brownian motion has unbounded variation a.s.

(2) Because W(t + T) = [W(t + T) - W(T)] + W(T), the Markov property tells us that given the values of W before time T, the "post-T" process t ' W (t + T) is a "conditionally independent" Brownian motion that starts at W(T). The dependence of the post-T process on the past is "local" since it depends only on the last value W(T). Theorem 9.9 has the following useful consequence.

9. Brownian Motion

164

Then, Corollary 9.11. Let 39 be the o-algebra generated by Brownian motion is a continuous-time ..I -martingale. That is, E[W (t) I.1r9]

=W(s) a.s. for allt>s>0. Proof. We know that W(t) - W(s) has mean zero and is independent of J r,. Therefore, E[W(t) - W(s) I.T9] = 0 a.s. The result follows from the obvious fact that W (s) is p3-measurable.

Proof of Theorem 9.9 (Sketch). First, let us find the covariance function of W, assuming that W is indeed a centered Gaussian process: If t > s > 0, then because W(t) - W(s) is a mean-zero random variable that is independent of W(s), Q(s,t) = E [W(s)W(t)] = E [(W(t) - W(s) + W(s)) W(s)] (9.7)

= E [IW(s)I2] = s.

In other words, Q(s, t) = s A t for all s, t > 0. Next we will prove that W is a centered Gaussian process. By the independence of the increments of W, for all 0 = to < t1 < t2 <

n

(9.8)

Ee Ek=1 ak(W(tk)-W(tk-1)) = [1

Ee1Rk(W(tO-W(tk-1n)

k=1

For each k = 1,... , n, W(tk) - W (tk_ 1) is a mean-zero normal random variable with variance tk - tk_1i its characteristic function is computed in Example 7.12, p. 97. This leads to (9.9)

Ee'Ek=1 ak(W(tk)-W(tk-11) = e-Za'tita,

where the matrix M is described by M1,j = 0 if i # j, and Mj,2 = tj - tj_1.

In other words, the vector (W(tj) - W(tj_1); 1 < j < n) is a centered normal random variable in Rn. Now for any 3 = (,01 i n

(9.10)

.,On) E Rn,

n

E,8kW(tk) = Eak[W(tk) -W(tk-1)], k=1

k=1

+ an. Therefore, Ek=1 /3, W(tk) is a centered normal where Qk := ak + random variable in R. That is, W is a centered Gaussian process.

Next we prove that if G = {G(t)}t>o is an a.s.-continuous centered Gaussian process with covariance function Q, and if G(0) = 0, then G is a Brownian motion. Because the remaining conditions of Brownian motion are easily verified for G, it suffices to show that whenever t > s, G(t) - G(s) is independent of We fix 0 < u1 < . . . < uk < s and prove that G(t) - G(s) is independent of (G(u1),... , G(un)). However, the distribution of the (n + 1)-dimensional random vector (G(t) - G(s),G(u1),...,G(u))

2. Brownian Motion on [0, 1)

165

is the same as that of (W (t) - W (s), W (ul ), ... , W This is because everything reduces to the same calculations involving the function Q. This proves that G is a Brownian motion. Problem 9.6 contains ample hints for proving that V,a(t) t in probability as n -* oo. Note that t H W (t + T) - W (T) is a continuous centered Gaussian process. We verify that: (a) This is Brownian motion; and (b) it is independent of a({W(r)}o
E [(W(s + T) - W(T)) (W(t + T) - W(T))] = E[W(s + T)W(t +T)] - E[W(T)W(t + T)]

(9.11)

- E[W(T)W(s +T)] + E [(W(T))2]

=(s+T)-T-T+(T)=s=s At. In particular, t,--+ W (t + T) - W (T) is a Brownian motion. The assertion about independence is proved similarly: Ifs < T, then E(W(s) (W(T + t) - W(T))] (912)

E[W(s)W(T+t)]-E[W(s)W(T)]=s-s=0.

Corollary 9.7 proves the independence of t --+ W(t + T) - W(T) from 0'({W(r)}0
2. Wiener's Construction: Brownian Motion on [0, 1) It remains to prove that Brownian motion exists. We begin by reducing the scope of our ultimate goal: If we can show the existence of Brownian motion indexed by [0, 1), then we have the following more general existence result. Lemma 9.12. Suppose {B;}°_°0 are independent Brownian motions indexed by [0, 1). Then, the following recursive definition describes a Brownian motion W indexed by [0,00):

(9.13)

BO(t)

if t E [0,1),

BO(1) + Bl(t - 1)

if t E [1, 2),

W(t) :_

Ej-oBk(1)+Bj(t- j) if t E [j,j+1), Proof. The defined process W is a continuous centered Gaussian process because it is a finite sum of continuous centered Gaussian processes. It remains to compute the covariance of W: If s < t, then either we can find

9. Brownian Motion

166

j>0such that j<s j. In the first case, E[W (s)W (t)] is equal to

-1

E

-1

[(Bk(1)+B(t_i)) (Bk(1) + B(s - j) k-0

/

]

(9.14)

(Bk(1))2J + E [(BB (s - j)BB(t - j))] =EI k=o

=j+ (a - j)=s=sAt.

In the second case, one obtains the same final answer. In any event, W has the correct covariance function, and is therefore a Brownian motion. 0 The simplest construction of Brownian motion indexed by [0, I)-in fact, [0,1]-is the following minor variant of the original construction of Norbert Wiener. Throughout, let {X=}°_°0 denote a sequence of i.i.d. normal random variables with mean zero and variance one; the existence of this sequence is guaranteed by Theorem 6.17 on page 70. Then, formally speaking, the Brownian motion {W(t)}oct<1 is the limit of the following sequence: 9.15

Wn t =tX

+7Esin(jzrt)X

VO
j=1

In order to prove the existence properly, we first need to complete the probability space. Informally, this means that we declare all subsets of null sets measurable and null; this can always be done at no cost. See Theorem 3.20 on page 29. Now we can prove Wiener's theorem (1923a).

Wiener's Theorem. If the underlying probability space is complete, then W(t) = limn-o. W2n(t) exists a.s., and the convergence is uniform for all t E [0, 1]. The process W is a Brovmian motion indexed by [0, 1].

Proof. We split the proof into three steps. Step 1. Uniform Convergence. For n = 1, 2.... and t > 0 define (9.16)

sin(kk7rt)

Sn(t)

Xk.

k=1

Thus, Wn(t) = tXo + (f /ir)Sn(t). Stated more carefully, we have two processes Sn(t , w) and Wn(t , w), and as always we do not show the dependence on w.

We will prove that Stn forms a Cauchy sequence a.s. and in L2(P), uniformly in t E [0, 1]. Define (9.17)

On(t)

S2n}1(t) - S2-(t).

2. Brownian Motion on [0, 1)

167

Note that 2

2n+1

(9.18)

sin(jirt)

I&n(t)12 =

2n+1

X

j=2n+1

2

eijlrt

E

<

j=2n+1

i

X.,

Therefore, 2n+1

2n+1

I&n(t)12

ci(j-k)vrt

<

XjXk

k

j=2n+1 k=2n+1 2n+1

x2

_

(9.19)

k=21+1

2

2"-1 2"+'-1

k(l + k) XkXl+k

1=1 k=2n+1 2n-1 2n+1-1

X2

2n+1

eillrt

+2 E E

E 2 +2E E

k=2n+1

1=1

k=2n+1

XkXl+k k(k + 1)

The right-hand side is independent of t > 0, so we can take expectations and apply Minkowski's inequality to obtain rr

E suP IA(t)1 2 L t>0

(9.20)

l J

2n+1

<E

2

k=24+1 1 2n}1

_E

k=2n+l

2n-1

2n+1 -1

1=1

k=2n+l

+2

+

2n-1 2 t=1

V.

XkXl+k k(k + 1)

2'+1 -1

E

k=2n+1

2

XkXl+k 2

k(k+l)

1

12

(Why?) The final squared-L2(P)-norm is equal to k-2(k + l)-2. On the other hand, by monotonicity, .k=2n+1 k-2 < 2_n and .1=1 1(1+2n)-1 < 1. Therefore, (9.21)

E [SUP f''n(t)12]

< 2-n +2.2

-n/2.

It follows that En Supt>0 IS2n+1 (t) - S2n (t)I < oo a.s. Thus, as n -+ oo, S2n (t) converges uniformly in t > 0 to the limiting random process (9.22)

sin(jnt)Xi

Si(t) j=1

7

In particular, W (t) = limn-,,,, W2n (t) exists uniformly in t > 0 almost surely. Step 2. Continuity and Distributional Properties. The random map t -+ W2n (t) defined in (9.15) is obviously continuous. Because W is an a.s.uniform limit of continuous functions, it is a.s. continuous; see Step 3 below

for a technical note on this issue. Moreover, since W2n is a mean-zero

9. Brownian Motion

168

Gaussian process, so is W (Problem 9.2). Since W(0) = 0, it remains to prove that (9.23)

E [IW(t) - W(s)12] = t - s

dO < s < t.

But the independence of the X's, together with Lemma 6.8 (p. 67), yields

E [IW(t) - W(s)12] = (t - s)2 + (9.24)

_ (t - s)2 +

T2 E

[(S.(t) - Soo(s))2]

J=1J 2

(sin(Jlrt)

"0

sin(as)) 2

2

Define f(x) = 1[,ra,nti(x)+1[-nt,-as](x) (x E [-7r, -7r)) and On(x) = ei"z/ 2ir (x E [-7r, 7r]; n = 0, ±1, ±2,...). Then, 00

E [IW(t) - W(8)12] = (t - s)2 + (9.25)

1

-

2ar

00

f

2

f (x)O, (x) dx 2

IIn

f(x) b (x) dxl =-oo

a

By the Riesz-Fischer theorem (Theorem A.5, p. 205), the right-hand side is equal to (21r)-' f "'r lf (x) I2 dx = (t - s). This yields (9.23). Step 3. Technical Wrap-up. Now we tie up a subtle loose end: The uniform limit W (t) = limn-W W2.. (t) is known to exist, but only with probability one. This is insufficient because we need to define W (t, w) for all W. Thus, we define

W(t) := limsupW2n(t). n-oo The process W is well defined and continuous a.s. The remainder of the calculations of Step 2 goes through, since by redefining a random variable on a set of measure zero we do not change its distribution (unscramble this!). Finally, the completion is needed to ensure that the event C that W is continuous is measurable; in Step 1, we showed that C° is a subset of a null set. Because the underlying probability is complete, C` is null. (9.26)

3. Nowhere-Differentiability We are in a position to prove the following striking theorem of Paley, Wiener, and Zygmund (1933). Throughout, W denotes a Brownian motion.

Theorem 9.13. Suppose the underlying probability space is complete. Then, Brownian motion is nowhere differentiable almost surely.

3. Nowhere-Differentiability

169

Proof. For any A > 0 and n > 1, consider the event (9.27)

Ea = { 3s E [0,1] :

IW(s) - W(t)I 0. Indeed, suppose there exists s E [0, 11 such that for all t within 2-n of s, IW(s) - W(t)l < A2-1. Then there must exist a possibly random j = 0, ... , 2' - 1 such that s E D(j; n), where D(j; n) := [j2-n, (j +1)2 -nj . Thus, IW(s) -W(t)I < A2-n for all t E D(j;n). By the triangle inequality, 2A2-n = A2-n+1 for all u, v E D(j; n). we can deduce that I W (u) - W (v) I < Subdivide D(j; n) into four smaller dyadic intervals, and note that the successive differences in the values of W (at the endpoints of the subdivided intervals) are at most A2-n+1 This leads us to the following: (9.28)

2n-1 3

En g U n {Io,tI <

(9.29)

A2-n+')

,

i=o t=0

where (9.30)

A't = W

(j2-n

+ (f + 1)2 -(n+2)) - W

(j2-n + t2-(n+2))

Thanks to the independent-increments property of Brownian motion (P-b), 2n-1 3

P (En) <' ft P { IOj,II <

(9.31)

A2-n+1 }

,

i=o t=o

On the other hand, by the stationary-increments property of W, All has a normal distribution with mean zero and variance 2-(n+2) ((P-a) and (P-c)). Thus, for all Q > 0, 112("+2)/2 a-x2/2

P { IOj,tI

(9.32)

Q}

=

f_f32(,+2)/2

dx

/32

+2)/2.

Apply this with,3 = A2-n+1 to deduce that P(Ea) < 256x42-n. In particular, En P(Ea) < oo, as was asserted earlier. By the Borel-Cantelli lemma, for any A > 0, the following holds with probability one: For all but a finite number of n's, (9.33)

inf

sup

o<s
IW(s) -W(t)I > inf Is - tI

sup

0<s<1 It-s1<2

IW(s) -W(t)I > A 2- n

Thus, if W'(s) existed for some s E [0, 1], then IW'(s)l > A a.s. Because A > 0 is arbitrary, this proves that IW'(s)I = oo a.s. This contradicts the differentiability of W at some s E [0, 1]. Thanks to scaling (Theorem 9.9),

9. Brownian Motion

170

W is a.s. nowhere differentiable in [0 , c] for any c > 0. Therefore W is a.s. nowhere differentiable. Technical Aside in the Proof. In fact, we have proven that there exists a null set N such that D C N, where D is the collection of all w's for which t u--. W (t , w) is differentiable somewhere. The collection D need not be measurable. But this is immaterial to us since we can complete the underlying probability space at no cost (Theorem 3.20, p. 29). In the completed probability space D is a null set, and our task is complete. O

4. The Brownian Filtration and Stopping Times Recall the Markov property of Brownian motion (Theorem 9.9): Given any

fixed T > 0, the "post-T" process t i--, W(T + t) - W(T) is a Brownian motion that is independent of a({W (u)}o
given the value of W(T), the process after time T is independent of the process before time T. The strong Markov property states that the Markov property holds for

a large class of random times T that are called stopping times. We have encountered such times when studying martingales in discrete time, and their continuous-time definition is formally the same.

Throughout, W is a Brownian motion with some prescribed starting point, say W(0) = x. Definition 9.14. A filtration W = {. }t>o is a collection of sub-a-algebras of .9 such that a(. C Wt if s < t. If W is a filtration, then a measurable function T : SZ -+ [0, oo] is a stopping time (or sad-stopping time) if {T < t} is a /t-measurable for all t > 0. Given a stopping time T, we can define OT by (9.34)

v/T={AE.,F: An{T0}.

T is called a simple stopping time if there exist 0 < To, 7-1.... < oo such that T(w) E {ro,T1i...} for all w E Q. Define (9.35)

.3° = o ({W(u)}o
In light of our development of martingale theory this definition is quite natural. Here are some of the properties of FT when T is a stopping time.

Lemma 9.15. If T is a finite .90-stopping time, then 9T is a a-algebra, and T is 9To.-measurable. Furthermore, if S < T is another stopping time, then 3s 9 .70..

4. The Brownian Filtration and Stopping Times

171

Proof. For each t > 0, (9.36)

A`n{T
Consequently, 9T is closed under complementation. Because 9,0 is a monotone class, it is a a-algebra.

For all a E (0,oo) andt>0, T-1([0, aJ) n IT < t} _ IT < a n t} E .°t C go. (9.37) This proves that T is JrT-measurable. Finally, we suppose A E F so, and note that for any t > 0, An IT < t} _ An IS < t} n IT < t}. Since An IS < t} and IT < t} are both in 9O, this proves that A n IT < t} E 9to, whence we have A E F. This proves the remaining assertion that . C F. 0

For technical reasons, we need to modify the filtration 90. For every t > 0 let Ft denote the completion of . I. It can be checked that {`fi't}t>o is a filtration of a-algebras. Note that any .`tee- or .0-stopping time is also an .F-stopping time. We alter the filtration Or as well to obtain the Brownian filtration.

Definition 9.16. A filtration {. }t>o is right-continuous if for all t > 0, +,. The Brownian filtration {Ft}t>o is defined as the smallest right-continuous filtration that contains {.fit}t>o. That is, F't := n,>t.`3', s1t = nE>o

for all t>0. Next we construct interesting stopping times.

Proposition 9.17. If A C R is either open or closed, then the first hitting time TA := inf{t > 0 : W(t) E A}, where inf 0 := oo, is a stopping time with respect to the Brownian filtration F. Remark 9.18. If you know what Fo- and G6-sets are, then convince yourself that when A is of either variety, then TA is a stopping time. You may ask further, "What about TA when A is measurable but is neither Fo nor G6?" The answer is given by a quite deep theorem of Hunt (1957): TA is a stopping time for all Borel sets A.

Proof of Proposition 9.17. We prove this proposition in two steps. Step 1. TA is a stopping time when A is open. Suppose A is open. We wish to prove that {TA < t} E .fit for all t > 0. It suffices to prove that (TA < t} E .fit for all t > 0, because the right-continuity of Jr ensures that {TA < t} = n,>o{TA < t + e} E F. But (TA < t} is the event that there exists a time s before t at which W (s) E A. Let C denote the collection of all w such that t H W (t , w) is continuous. We know that P(C) = 1. And

9. Brownian Motion

172

since A is open, {TA < t} n C is the event that there exists a rational s < t at which W(s) E A. That is,

{TA
= U {W(s)EA}\ s
U {W(s)EA}nC`

.

s
Because At is complete, so is .frt. Therefore, all subsets of null sets are measurable null sets. It follows from this that Us
x and the set A is < n-1. It is clear that An is open, and {TA < t} n C = nn{TA, < t} n C, where C was defined in Step 1. By Step 1, {TA < t} n C is in .frt. Because Ft is complete, it follows that {TA < t} E .5rt also.

O

For all random variables T : SZ -+ [0, oo) we define the random variable W (T) as follows:

W(T)(w) := W(T(w),w).

(9.39)

Proposition 9.19. If T is a finite stopping time, then W (T) is measurable with respect to 5T. The proof of this proposition relies on a simple though important approximation scheme.

Lemma 9.20. Given any finite s-stopping time T, one can construct a non-increasing sequence of simple stopping times T1 > T2 > . limn Tn(w) = T(w) for all w E St. In addition, .$T = nn,!FT,,.

.

.

such that

Proof. Here is a receipe for the Tn's: (9.40)

7'n (w) =

(__) 1[k2-n,(k+1)2-n)(T(w))

Since every interval of the form [k2-n, (k + 1)2-n) is obtained by splitting into two an interval of the form [j2-n+1 (j+1)2-n+1), we have Tn > Tn+1 > T.

To check that Tn is a stopping time, note that {Tn < (k + 1)2-n} _ {T < (k + 1)2-n} E 9(k+1)2-n, since T is a stopping time. Now given any

5. The Strong ltiarkov Property

173

t > 0, we can find k and n such that t E [k2-n, (k + 1)2-n). Therefore, {T < t} = {Tn < k2-n} = IT < k2-n} E 5k2-n C St. This proves that the Tn's are non-increasing simple stopping times. Moreover, Tn converges to T since 0 < Tn - T < 2-n. It remains to prove that 5T nn5T,,; see Proposition 9.17 but replace 9° by 9 everywhere.

IfAEnk

An{Tn _ landt>_0.

Therefore, 00

(9.41)

00

An{TO m=1 n=m

e>0

The lemma follows from the right-continuity of the filtration 5.

Proof of Proposition 9.19. First suppose that T is a simple 3-stopping time which takes values in {T° , -r1 , ... }. In this case, given any Bore] set A

and any t > 0, (9.42)

{W(T) E Al n IT < t} = U {W(Tn) E Al n IT = Tn} E S. n>O:

-r.
For a general finite stopping time T, we can find simple stopping times Tn I T (Lemma 9.20) with .`'T = n,, ST.. Let C denote the collection of w's for which t i--+ W (t , w) is continuous and recall that P(C) = 1. Then, for any open set A C R,

{W(T) E A}nCn IT < t} (9.43)

00

00

= n u n {W(Tn)EA}nCn{Tn
Since Tn is a finite simple stopping time, {W(T,,) E Al n IT,, < t} E S. In particular, the completeness of Ft shows that the above, and hence, {W(T) E A} n IT < t} are also in S. The collection of all A E .q(R) such that {W(T) E A} n IT < t} E fit is a monotone class that contains all open sets. It follows from the monotone class theorem that for all A E -V(R) and t > 0, (9.44)

{W(T) E A} n IT < t} E .fit.

This proves the proposition.

5. The Strong Markov Property We are finally in a position to state and prove the strong Markov property of Brownian motion.

9. Brownian Motion

174

Theorem 9.21. If T is a finite .-stopping time, where 9 denotes the Brownian filtration, then {W (T + t) - W (T) }t>o is a Brownian motion that is independent of 9T.

Proof. We prove this first for simple stopping times, and then approximate, using Lemma 9.20, a general stopping time with simple ones. Step 1. Simple Stopping Times. If T is a simple stopping time, then there exist To < rl < such that T E {To, 7-1i ...} a.s. Now for any A E .?T, and for all B1,... , Bm E R(R),

P(An{'i<m: W(T+ti)-W(ti)EBi}) (9.45)

P(An{t'i<m: W(7k+ti)-W(Tk)EBi, T=Tk}). k=0

But A n IT = Tk} = A n IT < Tk} n IT < Tk_1}' is in .irk since A E 9T. Therefore, by the Markov property (Theorem 9.9),

P(An{'i<m: W(T+ti)-W(ti)EBi}) (9.46)

=EP{di<m: W(Tk+ti)-W(Tk)EBi}P({T=Tk}nA) k=0

= P {di < m : W(ti) E Bi} P(A). This proves the theorem in the case that T is a simple stopping time. Indeed,

to deduce that t H W(t + T) - W(T) is a Brownian motion, simply set A = R. The asserted independence also follows since A E 9T is arbitrary. Step 2. The General Case. In the general case, we approximate T by simple stopping times as in Lemma 9.20. Namely, we find Tn I T-all simple stopping times-such that nn,!FTn = 9T. Now for any A E 9T, and for all open B1i... , B,n C R,

P(An{di<m: W(T+ti)-W(ti)EBi}) (9.47)

= n-oo limp (An { 'i < m : W(Tn + ti) - W(ti) E Bi}) = lim P {Vi < n-oo

m : W(ti) E Bi} P(A).

In the first equation we used the fact that the B's are open and W is continuous, while in the second equation we used the fact that A E .99'T for all n, together with the result of Step 1 applied to T. This proves the theorem.

6. The ReBection Principle

175

6. The Reflection Principle The following "reflection principle" is a prime example of how the strong Markov property (Theorem 9.21) can be applied to make nontrivial computations for the Brownian motion.

Theorem 9.22. If t is non-random and positive, then supo<., 0 and t > 0, (9.48)

I

P ( sup W(s) > a = lo<s
tt J

exp I

a

-z2 I dz.

\\\

2t/

Proof. Define Ta = inf{s > 0 : W(s) > a} where inf 0 = oo. Thanks to Proposition 9.17, Ta is an f- and hence an f-stopping time. Step 1. T. is finite a.s. By scaling (Theorem 9.9), for any t > 0, the event {W(t) > v'} has probability (21r) -1/2 f1 ° e-x2/2 dx = c > 0. Consequently,

P{t.oo limsup

(9.49)

()>1}>c>0.

-

JJJ -

(Why?). Among other things, this and the zero-one law (Problem 9.11) together imply that lim supt_ao W (t) = oo a.s. Since W is continuous a.s., it must then hit a at some finite time a.s. Therefore, with probability one, T. is finite and W(Ta) = a. Step 2. Reflection. Note that {supo<s a} = {Ta < t} E .fit. Moreover, P{Ta < t} is equal to P {Ta < t

,

W(t) > a} + P {Ta < t, W(t) < a}

=P{W(t)>a}+P{Ta
= P {W(t) > a}

+E[P{W(Ta+(t-Ta))-W(Ta) <0IfT,};Ta
of fTa is below zero at time t - Ta, given the value of Ta. Independence and symmetry (Theorem 9.9) together imply that the said probability is

a.s. equal to P{W(TO + (t - Ta)) - W(Ta) > 0IfT,} (why a.s.?).

[In

other words, we have reflected the post-Ta process to get another Brownian motion, whence the term "reflection principle."] Therefore, we make this change and backtrack in the preceding display to deduce that P{Ta < t} is

9. Brownian Motion

176

equal to P {W(t) > a}

+E[P{W(Ta+(t-Ta))-W(Ta)>OI9T};Ta
= p{W(t) > a}+P{Ta < t, W(Ta + (t - Ta)) - W(Ta) > 0}

=P{W(t)>a}+P{Taal = 2P {W (t) > a),

because P{W(t) = a} = 0. The latter is manifestly equal to the integral in the statement of the theorem. In addition, thanks to symmetry (Theorem 9.9),

2P{W(t) > a} = P{W(t) > a} +P{-W(t) > a} (9.52)

= P{IW(t)I > a}.

This completes our proof. The reflection principle has the following peculiar consequences: While we expect Brownian motion to reach a level a at some finite time, this time has infinite expectation. That is,

Corollary 9.23. Let Ta = inf{s > 0 : W(s) = a} denote the first time Brownian motion reaches the level a. Then for all a

0, Ta < oo a.s. but

ETa=oo. Proof. (Sketch) We have seen already that Ta is a.s. finite; let us sketch a proof that Ta has infinite expectation. Without loss of generality, we can assume that a > 0 (why?). Then, thanks to Theorem 9.22, = a/f e-v2/2 P {Ta > t} dy. (9.53)

f.lVt

27r

See Problem 9.13 for more details. The preceding formula demonstrates that

g)1/2

(9.54)

P{Ta>t}Haast-goo.

Therefore, E°_1 P{Ta

n} = oo, and Lemma 6.8 (p. 67) finishes the

proof.

Problems Throughout, W denotes a Brownian motion. 9.1. Prove the following: If X and Y are respectively R"- and R'"-valued random variables, then X and Y are independent if and only if (9.55)

Ee"'X+wY=&wX ivY

Use this to prove Corollary 9.7.

vuER",vER"`.

Problems

177

9.2. Suppose for every n = 1, 2, ... , C" = (G',..., GA) is an Rk-valued centered normal random variable. Suppose further that Q;j = lim_ E[G°G'`,'] exists and is finite. Then prove that Q is a symmetric nonnegative definite matrix, and that G" converges weakly to a centered normal random variable C = (G 1, ... , Gk) whose covariance matrix is Q.

9.3 (Linear Regression). Suppose C = (G1,...,G") is an R"-valued centered normal random variable, and let f denote the a-algebra generated by (G1,...,Gm), where m < n. Prove that, conditionally on 1, (G,,,+i,...,G") is a centered normal random variable. Find the conditional mean, as well as the covariance matrix.

9.4. Construct an example of two random variables X1 and X2 such that each of them is standard normal, although (XI, X2) is not a Gaussian random variable. (HINT: If X = ±1 with probability 1/2, and Z is an independent N(0,1), then X[Z[ is an N(0, 1) also.)

9.5. Prove that if W denotes Brownian motion, then { fp W(s) ds}t>o is a continuously differentiable Gaussian process. Use this as a guide to construct a k-times continuously differentiable Gaussian process. 9.6. Let t > 0 be fixed and define V"(t) as in Theorem 9.9. Prove: (1) The first two moments of V"(t) are respectively t and (2t2/n) + t2. Use this to verify that V"(t) converges to tin probability. (2) There exists a constant A > 0 such that for all it > 1, [[V,(t) - t1I4 <

7n=.

Use this to prove that V"(t) converges to t almost surely.

9.7. Prove that {-W(t)}t>o, {tW(1/t)}t>o. and {c 1/2W(ct)}t>o are Brownian motions for any fixed, non-random c > 0.

9.8 (Heat equation). Let pt denote the density function of W(t). Compute pt(x), and verify that it solves the partial differential equation, z

vt>0, x E R. &° (x) = 2 ax t (x) Also, prove that pt solves pt +,(x) = f a pt(y-x)p,(y) dy (Bachelier, 1900, p. 29 and pp. 39-40). (9.56)

9.9 (Khintchine's LIL). If W denotes a Brownian motion, then prove the following LILs of Khintchine (1933): With probability one, limsup

(9.57)

t-oo

W(t) (2 t In In t)1/2

=limsup

W(t)

t-e (2tlnln(1/t))t/2

=1.

9.10 (Brownian Bridge). Given a Brownian motion {W(t)}t>o define the Brownian bridge to

be the process B(t) = W(t) - tW(1). Prove that for all 0 < ft < t2 <

ul,...,um ER,

lim E [e'Ej_t vi%V(t,) I W(1), < eJ = Ee'E7 1 ,o

(9.58)

< t,,, < I and

jB(tj).

This justifies the assertion that Brownian bridge is Brownian motion conditioned to be zero at time one. (HINT: B is independent of W(1).) 9.11 (Blumenthal's Zero-One Law). If W denotes a Brownian motion, define the tail o-algebra .Y as follows: First, for any t > 0, define 9 to be the P-completion of o{W(u); u > t}. Then, define ,T = nt>o_17t

(1) Prove that T is trivial; i.e., P(A) E {0,1} for all A E T. (2) Let of {W define At to be the P-completion of Jr,0, and let .fit be the right-continuous extension. That is, .fit = n.9, where the intersection is taken over all rational s > t. Prove Blumenthal's zero-one law: 5o is trivial; i.e., P(A) E {0,1} for all A E S.

9. Brownian Motion

178

9.12. Follow the next three steps in order to refine Wiener's Theorem (p. 166).

(1) Check that (W,n(t) - W2- (0)m=2, is a martingale for all fixed n > 1 and t E [0,1]. (2) Conclude that m - supo
2n+1

(3) Prove that as., W(t) = limn Wn(t) uniformly over all t E [0, 1].

9.13. Given a # 0, define T. = inf{s > 0: W(s) = a). (1) Prove that the density function f , of To is given by (9.59)

]ale a2/(2s) J T. (x) =

d a E R, x > 0.

x3/2 2 a

For a bigger challenge compute the characteristic function of Ta. The distribution of IT1 is called the stable distribution with index 1/2. (2) Show that the stochastic process {Ta}a>o has i.i.d. increments. 9.14. Choose and fix a, b > 0. Find the probability that Brownian motion does not take the value zero at any time in the interval (a, a + b). Use this to find the distribution function of (9.60)

L := sup(t < 1

:

W(t) = 0).

(HINT: L is not a stopping time; condition on W(a).) 9.15. Let (W(t))tE[o,l) denote a Brownian motion on (0,1). We wish to prove that the maximum of W is achieved at a unique place a.s.

(1) Prove that P{supfE[ablW(t)=x}=0forallxERand0
dk = 1,2,... . Tk+1 := inf{s > Tk : IW(s) - W(Tk)l = 1) (1) Prove that the Ti's are stopping times. (2) Prove that the vectors (W(Tk+1) - W(Tk),Tk+1 - Tk) (k > 0) are i.i.d. Therefore, the process {W(Tk)}k 0 is an embedding of a simple random walk inside Brownian motion (Knight, 1962). In fact, every mean-zero finite-variance random walk can be embedded inside Brownian motion via stopping times (Skorohod, 1961), but this is considerably more difficult to prove.

9.17 (The Forgery Theorem). Let W denote a Brownian motion on [0, 1), and consider a nonrandom continuous function f : [0, 1) -. R. Then prove that for all e > 0, (9.61)

P { sup IW(t) - f(t)] <_ e) > 0. tEI0,1)

1

Think of the graph of f as describing a "signature." Then, this shows that Brownian motion can forge this signature to within a with positive probability (Levy, 1951).

9.18 (Heat Semigroup). Let W denote Brownian motion. Suppose f : R -. R is measurable, and EI f(x+W(t))I < oo for all t > 0 and x E R. Define the heat operator, (Htf)(x) := Ef (x+W(t)), where t > 0 and x E R. Prove that {Ht}too has the following "aemigroup" property: dt, s > 0. Ht(Haf) = Hs+tf 9.19 (White Noise; Hard). Let f : R -. R be non-random, differentiable, and zero outside [a,b].

Then we can define f f dW to be - f W(s)f'(s) ds. Prove then that II f f dWll; = where m denotes the Lebesgue measure on R Use this "L2-isometry" to construct f f dW for all f E L2(m).

Problems

179

(1) Prove that for all f,9 E L2(m),

E [ f fdW f 9dWl = f f(x)9(x)dx. J

(2) Let G(f) = f f dW (f E L2(m)), and prove that C

{G(f)}JEL3(m) is a centered Gaussian process. The process C is the so-called white noise, as well as the Wiener integral.

(3) Prove that if {4,}a=1 is an orthonormal basis for L2(m), then {G(m,)}i=1 are i.i.d. standard normal random variables (Wiener, 1923a,b). R is infinitely differentiable and 9.20 (Hard). Suppose W is a Brownian motion and f : R Elf (W(t))l Goo for all t > 0. Prove that {f(W(t))}t>o is a martingale if and only if there exist

a,b E R such that f(x)=a+bxforallxER. 9.21 (Hard). Our proof of Theorem 9.13 can be refined to produce a stronger statement. Indeed, suppose a > z is fixed, and then prove that for all s E [0,1], (9.62)

lim sup I W (s) - W(t)I = oo

t-s

I9 - tIa

as.

9.22 (Hard). Choose and fix some a > z, and define W to be a Brownian motion. Prove that there exist finite random times of > a2 > .... decreasing to zero, such that W (oj) = o for all j > 1. Is there an a < z for which this property holds? (HINT: Problem 9.21.) 9.23 (Reflection Principle, Hard). Let {Xn }1 1 be i.i.d. taking the values ±1 with probability each. Define Sn = E;`_1 X, (n > 1). Prove that max,<,
9.24 (Hard). Prove that the zero-set Z = {t > 0 : W(t) = 0) of Brownian motion W is a.s. uncountable, closed, and has no isolated points. Prove also that Z has zero Lebesgue measure.

9.25 (Problem 9.13, continued; Hard). Define S(t) = sup,E[o,q W(s) and X(t) = S(t) - W(t). Then, compute the density function of X(t) for fixed t > 0. (HINT: Start by computing P{S(t) >

s,W(t) < w).) 9.26 (Harder). Suppose {W,}:°o is a collection of independent Brownian motions. Define the "two-parameter processes" Z1, Z2, ... , where (9.63)

Zn(s,t)=sWo(t)+

y-c

sln(jira)

n j=1

Wj(t)

v0<8,t<1.

Prove that almost surely, Z2" (a, t) converges to a limiting two-parameter process {Z(a, t) }s,tE[o,t[ . uniformly for all a, t E [0, 1). Prove that Z is an a.s.-continuous centered "Gaussian process" with

covariance function E[Z(a, t)Z(u, v)] = (s n u)(t n v). Use this to prove that for any fixed a > 0, t - s-I/2Z(s,t) is a Brownian motion on (0,1]. The process {Z(s,t)}a,tE1o,11 is the so-called Brownian sheet on [0,1]2.

9.27 (Problem 9.16, continued; Harder). Prove that the embedding scheme of Problem 9.16 satisfies ET, = 1 and E[TT ] < oo. Conclude that

IT,, - nI = O ((n in in n)1/2) < oo as n - oo as. Use this to prove that for all p > 141 (9.64)

nlimo

IW(T

W(n)I = 0

as.

no

Conclude that, on a suitable probability space, we can construct a simple walk {S,}1-1 and a Brownian motion {W(t)}t>o such that

max ISk - W(k)] = o(n") as n - oo a.s.

1
(HINT: W2(t) - t describes a mean-zero martingale.)

9. Brownian Motion

180

Notes (1) Brownian motion is known also as the "Wiener process," or even sometimes as the "Bachelier-Wiener process." (2) Although Bachelier's ideas are now regarded as revolutionary, they went largely unnoticed for nearly 60 years. Courtault, Kabanov, Bru, Crepel, Lebon, and Le Marchand (2000) contains an enthusiastic discussion of this issue. The said reference contains a number of other interesting facts about Bachelier. (3) The book of Nelson (1967) contains a detailed account of the development of the

physical theory of Brownian motion.

(4) The term "Avogadro's constant" is due to J. B. Perrin. (5) The a.s.-continuity assumption of Theorem 9.9 is redundant; it is a consequence of the other assumptions there. (6) The strong Markov property was introduced and utilized by Kinney (1953), Hunt (1956), Dynkin and Jushkevich (1956), and Blumenthal (1957). The phrase "strong Markov property" was coined by Dynkin and Jushkevich (1956). (7) Parts of our modification to the Brownian filtration are unnecessary because Brownian motion is as. continuous. However, this requires a more advanced development of the Brownian motion.

(8) The reflection principle (p. 175) is due to Bachelier (1964, pp. 61-65). The central idea uses the method of Andr6 (1887) developed for the simple walk (Problem 9.23, p. 179).

Chapter 10

Terminus: Stochastic Integration

Reason's last step is the recognition that there are an infinite number of things which are beyond it.

-Blaise Pascal

1. The Indefinite Ito Integral Given a "nice" stochastic process H = {H(s)}B>o, Ito (1944) constructed

a natural integral f H dW = fo H(s) W(ds) despite the fact that W is nowhere differentiable a.s. (Theorem 9.13). In order to identify what "nice"

means, it is best to go back and redefine what we mean by a stochastic process in continuous time.

Definition 10.1. A measurable stochastic process (also, process or stochastic process) X = {X(t)}t>o is a product-measurable function X : [0, oo) x S1 --i R.

We often write X (t) in place of X (t , w); this is similar to what we did in discrete time.

This is a natural place to verify that Brownian motion is a stochastic process.

If H is nicely behaved, then it stands to reason that we should define

f HdW as limn,, Tn(H), where (10.1)

Zn(H)

= off (2n /

(k+1) LW

-W

(i)] 181

10. Terminus: Stochastic Integration

182

and the nature of the limit must be made precise. Clearly, Zn(H) is a well-defined random variable if, for instance, H has compact support. The following performs some of the requisite book-keeping about {Zn(H)} 1.

Lemma 10.2. If there exists T > 0 such that H(s) = 0 for all s > T a.s., then Tn(H) is a.s. a finite sum and Zn+1(H) - Zn(H) 00

- j=o LHC2+11)-H\2n/J

[W

2n

1/-W\2+11/J

Proof. The sum is obviously finite. We derive the stated identity for Zn+1(H) - Tn(H). Throughout we write Hk,n in place of H(k2-n), and

Oj n W (.72-") - W(k2-m). Consider the identity 27,+,(H) = >o Hk,n+l0k+l,n+1, and split the sum according to whether k = 2j or k = 2j + 1: 00

00

Zn+1(H) = E Hj,n02j+1,n+1 + E H2j+1,n+10 j+ 1 nn+l j=o

j=0 00

_

(10.2)

j,n

2j+1,n+1

Hj,n A2j+1,n+1 + j+1,n j=o 00

2j+1,n+1

(H2j+l,n+1 - Hj,n) Oj+1,n j=0

Because 02 +1,n+1 + Aj+l nn+l = Oj+1,nl the first term is equal to Zn(H), whence follows the lemma.

Definition 10.3. A process H = {H(s)}t>o is adapted to the Brownian filtration 9 if H(s) is .T,-measurable for each s > 0. We say that H is a compact-support process if there exists a non-random T > 0 such that with probability one, H(s) = 0 for all s > T. We also need the following technical definition.

Definition 10.4. Choose and fix p > 1. We say that H is Dini-continuous in LP(P) if H(s) E 11(P) for all s > 0 and (10.3)

J o

1

p(r) dr < oo,

r

where

Op(r) :=

sup s,t: Is-tI
IIH(s) - H(t)IIp

The function Op is called the modulus of continuity of H in LP(P). If H is compact-support, continuous, and a.s.-bounded by a non-random

quantity, then it is a.s. uniformly continuous in 11(P) for any p > 1; i.e.,

1. The Indefinite Ito Integral

183

limt-.oVip(t) = 0. Dini-continuity in LP(P) ensures that ?Pp converges to zero at some minimum rate. Here are a few examples:

Example 10.5.

(a) Suppose H is a.s. differentiable with a derivative that satisfies K := sups IIH'(t)IIp < oo. [Since (s,w) H H(s,w) is product measurable, f I H'(r) I P dr is a random variable, and hence IIH'(t)IIp are well defined.) By the fundamental theorem of calculus,

ift>s>0then

t

IIH(s) - H(s)IIp S f IIH'(r)llpdr < Kit - sl. s

Therefore, Op(r) < Kr, and H is Dini-continuous in L1(P).

(b) Suppose H(s) = f (W (s)), where f is a non-random Lipschitzcontinuous function; i.e., there exists L such that

If(x)-f(y)I SLly - xI It follows then that ilip(r) = O(r1/2) as r -i 0, and this yields the Dini-continuity of H in L1(P) for any p > 1. (c) Consider H(s) = f (W (s) , s), where f (x, t) is a non-random function, twice continuously differentiable in each variable with bounded

derivatives. Suppose, in addition, that there exists a non-random T > 0 such that f (x, s) = 0 for all s > T. Because I H(s) - H(t)I is bounded above by

If (1't'(s), a) - f (u'(t), a)1 + If MO, a) - f MO, t)I , by the fundamental theorem of calculus we can find a constant M such that for all s, t > 0,

IH(s)-H(t)I <M(IW(s)-W(t)I +It-sl) By Minkowski's inequality for all p > 1,

IIH(s) - H(s)IIp S M(IIW(s) - W(t)IIp+ It - sl)

=M(c it-611/2+It-sl)

,

where cp := IIN(0,1)IIp. Therefore, i/ip(r) = O(r1/2) as r - 0, whence follows the Dini-continuity of H in any L1(P) (p > 1).

Remark 10.6 (Cauchy Summability Test). Dini-continuity in L1(P) is equivalent to the summability of ikp(2-nf). Indeed, we can write (10.4)

n=0

10. Terminus: Stochastic Integration

184

Because iPp is nondecreasing, 00

00

lEOp(21 t)dt<EOp(2-").

(10.5)

1

0

n=0

n=0

We can now define f H dW for adapted compact-support processes that are Dini-continuous in L2(P). We will improve the construction later on.

Theorem 10.7. Suppose H is an adapted compact-support stochastic process that is Dini-continuous in L2(P). Then limn-,,2n(H) exists in L2(P).

If we write[fHdw] fH dW for this limit, the[(JHdw)2] n

(10.6) E

=0

= E [f°°H2(s)ds].

and E

If a, b E R, and V is another adapted compact-support stochastic process that is Dini-continuous in L2(P), then with probability one,

= aJ HdW +bJ V dW.

(10.7)

Definition 10.8. The second identity in (10.6) is called the Ito isometry (Ito, 1944).

Proof (Sketch). For t > s, W(t) - W(s) is independent of Jr. (Theorem 9.21, p. 174), and H(u) is %'. measurable for u < s. Therefore, Lemma 10.2 implies the following. We use the notation introduced in the proof of Lemma 10.2 to simplify the type setting: IITn+1(H) -Tn(H)II2 IIAI+ 11H {22n+11)

(10.8)

0<j<2^T-1

-H

112

nntl

2nj /II22 2

E

= 2-`-10<j<2^T-1 IIH \22 +11)

- H \2n II2

< V'2(2'), by Dini continuity. Consequently, N+M-1

IITN+M(H) -IN(H)112 5 > IITn+1(H) -2n(H)II2 (10.9)

n=N+1

N+M-1

VT

`

bit (2-n-1)

'N, M > 1.

n=NN+1

It follows from Remark 10.6 that {2n(H)}n°_1 is a Cauchy sequence in L2(P). This proves the asserted L2-convergence.

1. The Indefinite Ito Integral

185

The basic properties of Brownian motion ensure that EZ"(H) = 0. By L2-convergence we also have E[f H dW] = 0. Similarly, we can prove (10.6):

E [(I Hdw)]

=HymnIlZn(H)II2 00

= ine' E [H2 (k2-")] 2-"

(10.10)

n

k=0 1

= E [1000 H2(s) ds]

.

The exchanges of limits and integrals are all justified by the compact-support

assumption on H together with the continuity of the function t' IIH(t)112. Finally, (10.7) follows from the linearity of H '-41,,(H) and the existence D of L2(P)-limits. We now drop many of the technical assumptions in Theorem 10.7.

Theorem 10.9 (1t6,1944). Suppose His adapted and E[ fo H2 (s) ds] < 00. Then one can define a stochastic integral f H dW that has mean zero and variance E[ f' H2(s) ds]. Moreover, if V is another such integrand process, then for all a, b E R

f(aH + bV)dW = E

[J

H dW

af HdW+bJ VdW a.s.,

J V dWl =E [ rCo H(s)V(s) ds1

.

The proof is function theoretic, and takes up the remainder of this section. If you wish to avoid such technical details, then you can do so without missing any of the central ideas: Simply skip to the next section. Throughout, let m denote the Lebesgue measure on (R, .(R) ), and let L2(m x P) denote the corresponding product L2-space. We may note that JJ

(10.12)

E [J"O H2(s) ds] = 11H11L2(mxe),

and E[f H(s)V(s)d.9) is the L2(m X P) inner product between H and V. The following is the key step in our construction of stochastic integrals.

Proposition 10.10. Given any stochastic process H E L2(m X P) we can find processes {H"}°O_1, all compact-support and Dini-continuous in L2(P),

such that limn-00 H = H in L2(m X P). Theorem 10.9 follows immediately from this.

10. Terminus: Stochastic Integration

186

Proof of Theorem 10.9. Proposition 10.10 asserts that there exist adapted compact-support processes {Hn}n_-1, that are Dini-continuous in L2(P) and converge to H in L2 (m x P). The Ito isometry (10.6) ensures that if Hn dW }n 1

is a Cauchy sequence in L2(P), since {Hn}n is a Cauchy sequence in L2 (M X P). Consequently, f H dW = limn f Hn dW exists in L2(P). The 1

properties of f H dW follow readily from those of f Hn dW and the L2(P)convergence that we have proved earlier. Let us conclude this section by proving the one remaining proposition.

Proof of Proposition 10.10. We proceed in three steps, each reducing the problem to a more restrictive class of processes H. Step 1. Reduction to the Compact-Support Case. For each n > 1, let

HH(t) := H(t)110,n)(t) and note that Hn is an adapted compact-support stochastic process. Moreover,

E [ j°°H2(s)ds] = 0. l Therefore, we can, and will, assume without loss of generality that H is also compact-support. Step 2. Reduction to the L2-Bounded Case. Let us first extend the definition of H to all of R by assigning H(t) := 0 if t < 0. Next, define (10.13)

Wimp IIH - HnII ,2(mXP) =

(10.14)

rt Hn(t) = n (

H(s) ds

Vt > 0, n > 1.

Check that H is an adapted process. Moreover, Hn(t) = 0 for all t > T + 1, so that Hn is also compact-support. Next, we claim that Hn is bounded in L2(P). The Cauchy-Bunyakovsky-Schwarz inequality and the FubiniTonelli theorem together imply the following: (10.15)

sup IIHn(t)II2 < n f IIH(s)II2 ds = nhIHIIi2(mxp) t>o

0

It remains to prove that Hn converges in L2(m X P) to H. Since fo H2(s) ds < oo a.s., the Lebesgue differentiation theorem (p. H(t) for almost every t > 0. Therefore, 140) implies that a.s., Hn(t) limn Hn = H (m x P)-almost surely by Fubini-Tonelli. According to the dominated convergence theorem, Step 2 follows if we prove that (10.16)

sup IHnI E L2(m X P). n>1

Note that supn IHnI < 4'H, where the latter is the "maximal function," !t Vt > 0. (10.17) (.4'H)(t) = sup n / IH(s)I ds n>1

t-(1/n)

2. Continuous Martingales in L2(P)

187

For each w, (4'H)(t + n-1) is the Hardy-Littlewood maximal function of H. Also, . WH is a n adapted process, whence 0

H

I (-*fH)(s)I2 ds <

(10.18) 1000

Confer with Corollary 8.41 on page 143. But H(t) = 0 for all t > T and suet IIH(t)112 < oo. Therefore, f °O H2(s) ds < oo for almost all w. Moreover, we take first expectations (Fubini-Tonelli), and then square roots, to deduce

that (10.19)

II'HIIL2(mxP) <_ 161IHIILz(mxP) < 00-

This reduces our problem to the one about the H's that are bounded in L2(P) and compact-support. Step 3. The Conclusion. Finally, if H is bounded in L2(P) and compactsupport, then we define Hn by (10.14) and note that Hn is differentiable,

and Hn(t) = n{H(t) - H(t - n-1)}. Therefore, (10.20)

sup IIHn(t)II2 <- 2nsuP IIH(t)II2 < 00I

t

Part (b) of Example 10.5 implies the asserted Dini-continuity of H,,. On the other hand, the argument developed in Step 2 proves that Hn -+ H in L2(m X P), whence follows the theorem.

2. Continuous Martingales in L2(P) The theories of continuous-time martingale and stochastic integration are intimately connected. Thus, before proceeding further, we take a side-step, and have a quick look at martingale theory in continuous time. To avoid unnecessary abstraction, .9T will denote the Brownian filtration throughout.

Definition 10.11. A process M = {M(t)}t>o is a (continuous-time) martingale if:

(1) M(t) E L'(P) for all t > 0. (2) If t > s > 0 then E[M(t) I9(s)] = M(s) a.s. The process M is a continuous L2-martingale if t H M(t) is almost-surely continuous, and M(t) E L2(P) for all t > 0. Much of the theory of discrete-time martingales transfers to continuous L2(P)-martingales. Here is a first sampler.

The Optional Stopping Theorem. If M is a continuous L2-martingale and S < T are bounded s-stopping times, then (10.21)

E[M(T) I.J"rs] = M(S)

a.s.

10. Terminus: Stochastic Integration

188

Proof. Throughout, choose and fix some non-random K > 0 such that T < K almost surely. If S and T are simple stopping times, then the optional stopping theorem is a consequence of its discrete-time namesake (p. 130). In general, let S 1 S and T I, T be the simple stopping times of Lemma 9.20 (p. 172). Note that

the condition S < T imposes 5,,, < T, + 2-' for all n, m > 1. Because T + 2-n` is a simple stopping time, it follows that (10.22)

= M(S,n)

E [M(Tn + 2-m) I

a.s.

Moreover, this very argument implies that {M(TL + 2-m)}°O_, is a discretetime martingale. Since Tn < T + 2-1 < K + 2-n, Problem 8.22 on page 153 implies that 1

(10.23)

E[sup jM(Tn+2-m)I2] <4E[IM(K+1)12]
-

n>1

By the almost-sure continuity of M, limn_,. M(Tn + 2-1) = M(T + 2-m) a.s. Convergence holds also in L2(P), thanks to the dominated convergence theorem. Therefore, E[M(T + 2-m) I sm]

IIE[M(Tn + 2-m) (1024)

2

= IIE [M(T,+2-m) - M(T +2-r) I222 < E (E [(M(Tn + 2-m) - M(T + 2-m))2 -40 asn -goo.

5,,,] I

1

We have appealed to the conditional Jensen inequality (p. 120) for the first inequality, and the towering property of conditional expectations (Theorem 8.5, page 123) for the second identity. By (10.22), M(S,n) = E[M(T + 2-m)1 gs,n] a.s. Because rs c `rs,,,, this and the towering property of conditional expectations together imply that E[M(S,n) I 9s] = E[M(T + 2-m) 1.5's] a.s. Let m oc and appeal to the argument driving (10.24) to deduce that E[M(T) 19s] = xlimo E [M(T + 2-m I .a sm] (10.25)

=

mo E[M(Sm) 19S]

= E[M(S) I -01s]

= M(S), where the limits all hold in L2(P). This implies that E[M(T) 1 ,Fs] = M(S) almost surely, as desired. 0 The following is a related result whose proof is relegated to the exercises.

3. The Definite Integral

189

Doob's Maximal Inequalities. Let Al denote a continuous L2-martingale. Then for all A, t (10.26)

0.

> AP

I sup JAI(s)I > $ o<s
r

E I Ihl(t)I; sup IM(s)I > AJ 0<,
L

In particular, for all p > 1, E

(10.27)

o Ai(s)IPJ [sun

<

(_-!_)E(aAI(t)It). p

1

3. The Definite Ito Integral It is a natural time to mention the definite Ito integral. The latter is defined simply as fo H dW = f H110,t) dW for all adapted processes H such that E[ fo H2(s) ds] < oo for all t > 0. This defines a collection of random variables fo H dW, one for each t > U. The following is of paramount importance to us, since it says something about the properties of the random function t F-. fo H dW .

Theorem 10.12. If H is an adapted process such that E[ f0 H2 (s) ds] < oc then we can construct the process { fo H dW }t>o such that it is a continuous L 2 -martingale.

Proof. According to Theorem 10.9, fo H dW exists. Thus, we can proceed by verifying the assertions of the theorem. We do so in three steps. Step 1. Reduction to H that is Dini-continuous in L2(P). Suppose we have proved the theorem for all processes H that are adapted and Dini-

continuous in L2(P). In this first step we prove that this implies the remaining assertions of the theorem. Let H be an adapted process such that E(fo H2(s) ds] < oc for all t > 0. We can find adapted processes H that are Dini-continuous in L2(P) and

lim E [f'

(10.28)

1

H(s))2 ds] = 0.

Indeed, we can apply Proposition 10.10 to H110.1.], and use the recipe of the said proposition for H,,. Then apply the proposition to t as well. By (10.28) and the Ito isometry (10.6),

H

lim E

[(f1Hdw r(10.29)

- J Hn, dW

=

0

Because f o Hn dW - fo Hn+, dW = f f (H - Hn+x) dW defines a continuous L2-inartingale, by Doob's maximal inequality (p. 189), for all non-random

10. Terminus: Stochastic Integration

190

but fixed T>0, (10.30)

r lim E I sup

n-oo

L0
r

(10

t

dW\) 2=

HdW - J t Hn+k

/

0

0.

In particular, for each T > 0 there exists a process X = {X(t)}t>o and a subsequence n' - oo such that ( 10.31)

lim

sup

.o0 0
J Hn' dW - X(t)= 0

a.s.

Moreover, the same uniform convergence holds in L2(P), and along the original subsequence n - oo. Consequently, (10.29) implies that X is a particular construction of t " fo H dW that is a.s.-continuous and adapted. In other words, X is adapted and a.s.-continuous, and also satisfies (10.32)

P{x(t) =JtHdW}=1 o

et>0.

JJJ

Finally, X(t) E L2(P) for all t > 0, so it remains to prove that X is a martingale. But remember that we are assuming that {fo Hn dW }t>o is a martingale. By the conditional Jensen inequality (p. 120), and by L2(P)-convergence,

IIE

IX(t+s) _

11X(t + s) -

-+0

ft+8

ft+8

HndWB]112

HndWII2

asn -goo.

Consequently, limn. fo Hn dW = E[X(t + s) I tee] in L2(P). But we have seen already that fo Hn dW -+ X (t) in L2(P). Therefore, (10.34)

E[X(t + s) I e] = X (s)

a.s.

That is, X is a martingale, as was claimed. Step 2. A Continuous Martingale in the Dini-Continuous Case. Now we suppose that H is in addition Dini-continuous in L2(P), and prove the theorem in this special case. Together with Step 1 this completes the proof.

3. The Definite Integral

191

The argument is based on a trick. Define

.7ntH)(t) _ 0
(10.35)

H C2/

/

+Hl

L2n2n

-W ()] 1] ) [w(t)_w(L2hi)]. LW

('2n+1)J

This is a minor variant of T(H11o 1). Indeed, you should check

In (H1io,q) (10.36)

=H(L2: Ii

[W(t)-W([2nt -1J+n

2n

JL JJ whose L2(P)-norm goes to zero as n - oc. But n(H) is also a stochastic process that is: (a) Adapted; and (b) continuous-in fact, piecewise linearin t. It is also a martingale. Here is why: Suppose t > s > 0. Then there

exist integers 0 < k < K < 2ns - 1 such that s E D(k; n) := [k2-n, (k + 1)2-n) and t E D(K;n). Then,

n(H)(t) -.7n(H)(s) (10.37)

k
+H(K) [w(t) - W

(2n

(a-H( 2) [IV(s) -

W (a].

where Ek<j
E [Jrn(H)(t) - 7n(H)(s) I .f'A] = 0.

This proves the martingale property. Step 3. The Conclusion. To finish the proof, suppose H is an adapted process that is Dini-continuous in L2(P). A calculation similar to that of Lemma 10.2 reveals that for any non-random T > 0, (10.39)

lim sup E [(.7n+1(H)(t) n-oo 0
-.n(H)(t))2]

= 0.

Therefore, by Doob's maximal inequality (p. 189), (10.40)

lim E I sup (.I.+, (H)(t) - .7n(H)(t))2I = 0. n-'O°

o
J

This implies that a subsequence of .7n(H) converges a.s. and uniformly for all t E [0, T] to some process X. Since ,7n (H) is a continuous process, X is necessarily continuous a.s. Furthermore, the argument applied in Step 1

10. Terminus: Stochastic Integration

192

shows that here too X is a martingale. We have seen already that for any fixed t > 0,

.n(H)(t) - Zn(Hlio,,1) -; 0

(10.41)

in L2(P).

Since Zn(Hlio,11) -+ J. H dW in L2(P), this proves that

P{X(t)=fo HdW}=1

(10.42)

o

111

dt>0,

JJJ

and whence the result.

4. Quadratic Variation We now elaborate a little on quadratic variation (Theorem 9.9, p. 163). Quadratic variation is a central theme in continuous-time martingale theory, but its study requires more time than we have. Therefore, we will develop only the portions for which we have immediate use. Throughout, we define the second-order analogue of Tn (10. 1), (10.43)

H

Q,,(H)

I

[W (1) - W ()]2.

k=O

Theorem 10.13. Suppose H is adapted, compactly supported, and uniformly continuous in L2(P); i.e., lim,..o2(r) = U. Then,

lim Qn(H) =

(10.44)

JH(s)ds

n-oo

in L2(P).

Proof. To simplify the notation, we write for all integers k > 0 and n > 1, (10.45)

H(k2-") and dk,n := W((k + 1)2-n) - W(k2-n).

Hk,n

Recall next that we can find a non-random T > 0 such that for all s > T, H(s) = 0 a.s. Throughout, we keep such a T fixed. Step 1. Approximating the Lebesgue Integral. We begin by proving that 00

E Hk,n2-" -b

(10.46)

k=0

J0 "O H(s) ds

in L2(P) as n - oo.

Note that 00

o0

E Hk,n2-" - f H(s) ds (10.47)

k=0

0

(ki'l)`2-n

< n 0
IHk,n- H(s)j ds.

5. Ito's Formula

193

Therefore, by Minkowski's inequality,

(10.48)

IIE Hk,n2-n - J

H(s) dsll <

k=0

2-n '2(2-n) 0
2

T'02 (2-n)

,

which converges to zero as n -+ oo; Step 1 follows. Step 2. Conclusion. Choose and fix t > 0, and for all n > 1 define 00

Dn

(10.49)

Qn(t) - E Hk,n2-n k=0

Note that Hk,n is independent of dk,n, and the latter has mean zero and variance 2-n. Because Dn = E00 o Hk,n[dk,n -2 -n ], we first square and then take expectations to obtain

=

IIDnII22

E [Hk2,n] E [(d2k, n

-2 -n)2]

0
(10.50)

+2

E [Hk nHj n (dk,n -2 -n) (di,n -2 -n)]

.

O<j
If j < k, then dk,n is independent of Hk,nHj,n(dj2,n - 2-n), and has mean zero. Therefore, (10.51)

E [HR,n] E [(d2k, n

IIDnII22 =

-2 -_)2]

.

O
Next we observe that [dk n - 2-n] has the same distribution as 2-n(Z2 - 1), where Z is standard normal. Because E[(Z2 -1)2] = E[Z4] -1 = 2, it follows

that (10.52)

IIDnII2 = 4n

IIHk,nhI2 O
Dini-continuity ensures that t " IIH(t)112 is continuous, and hence bounded on [0, T] by some constant KT. Thus, (10.53)

2 IIDnII2 <

2(2nT - 1)KT 4n

-0

as n

oo.

This and Step 1 together imply the result.

5. Ito's Formula and Two Applications Thus far, we have constructed the Ito integral, and studied some of its properties. In order to study the Ito integral further, we need to develop an operational calculus that mimics the calculus of the Lebesgue and/or

10. Terminus: Stochastic Integration

194

Riemann integral. To understand this better, we recall the chain rule of elementary calculus. Namely, (10.54)

(f o 9)'(x) = f'(9(x)) g ,(X),

valid for all continuously differentiable functions f and g. In its integrated form-this is integration by parts-the chain rule states that for all t > s > 0,

(10.55)

f(9(t)) - f(9(s)) = f tf'(9(u))9 (u) du. e

For example, let f (x) = x2 to find that (10.56)

92(t)-92(0)= fgdg t

where dg(s) = g'(s) ds. What if g were replaced by Brownian motion? As a consequence of our next result we have t

(10.57)

W2(t) - W2(0) = f W dW + 2

a.s.

0

Compared with (10.56), this has an extra factor (t/2).

Ito's Formula 1. If f : R R has two continuous derivatives, then for all t > s > 0, the following holds a.s.: (10.58)

f(W(t)) - f(W(s)) =

f

t

f'(W(r)) W(dr) + 1 f t f"(W(r))dr. 2

a

s

Ito's formula is different from the chain rule for ordinary integrals because the nowhere differentiability of W forces us to replace the right-hand side of (10.55) with a stochastic integral plus a second-derivative term. Remark 10.14. Ito's formula continues to hold even if we assume only that f" exists almost everywhere and f t (f'(W (r))2 dr < oo a.s. Of course then we have to make sense of the stochastic integral, etc.

Proof in the Case that f"' is Bounded and Continuous. We assume, without loss of too much generality, that s = 0.

The proof of Ito's formula starts out in the same manner as that of (10.55). Namely, by telescoping the sum we first write

f(W(2-"L2"t-1])) (10.59)

-f(0)

E

0
[f(w('2+n'))

AW(2'nM

5. Ito's Formula

195

To this we apply Taylor's expansion with remainder, and write (2-n [2"t

f (W (10.60)

- 1J)) - f(0)

E f (W (2 ))

dk.n

0
+2 E f""(W()) 4, + 0
where dk,n := W((k+1)2-") -W(k2-n), and IRk,nI

Rkndkn, 0
M<

supx

oo, uniformly for all k, n.

According to the proof of Theorem 10.7, the first term of the righthand side of (10.60) converges in L2(P) to fo f'(W(s)) W(ds); see also Example 10.5. The second term, on the other hand, converges in L2(P) to 2 fo f"(W(s)) ds; consult with Theorem 10.13. In addition, continuity and the dominated convergence theorem together imply the following: (10.61)

nimo f (W(2-n[2nt - 1])) - f(W(t))

a.s. and in L2(P).

It, therefore, suffices to prove that ?n --+ 0 in Ll (P) as n -* oc, where (10.62)

Rk,ndk

n

0
But as n -+ oo, (10.63)

Ildk,nll3 -

EI.-nI < M

MtIIN(0,1)II32-ni2.

0
The proof follows.

O

Next is an interesting refinement; it is proved by similar arguments involving Taylor series expansions that were used to derive Ito's formula 1.

Ito's Formula 2. Let W denote Brownian motion with W (O) = x0. If f (x , t) is twice continuously differentiable in x and continuously differentiable in t, and if E[ f0 I a= f (W (s) , s) I2 ds] < oo for all t > 0, then a.s., f (W (t) , t) = f (xo , 0) + (10.64)

ft

ax f (W (s) , s) W(ds)

f [cf

(W(s) , s) +

+ 0 This remains valid if f takes on complex values.

f (W(s) , s)} ds.

Of course, f H dW := f Re(H) dW + i f Im(H) dW whenever possible. I will not prove this refinement. Instead, let us close this book with two fascinating consequences of Ito's formula 2.

10. Terminus: Stochastic Integration

196

5.1. Levy's Theorem: A First Look at Exit Distributions. Choose and fix some a > 0, and let W denote a Brownian motion started somewhere

in (-a, a). We wish to know where W leaves the interval (-a, a). The following remarkable answer is due to Levy (1951):

Theorem 10.15. Choose and fix some a > 0, and define (10.65)

Ta := in£ is > 0: W (s) = a or - a},

where inf 0 := oo. If W (O) := xo E (-a, a), then for all real numbers A # 0, (10.66)

Eexp (i.AT0) =

cos (xo

2i))

cos (a 2iA)

Proof. We apply Ito's formula 2 with f (x, t) := V)(x)e

(10.67)

where A 0 0 is fixed, and the function

satisfies the following boundary-

value problem: Min(x) = 2iA i(x)

dx E (-a, a),

(10.68)

O(a) = V1 V (-a) = 1.

By actually taking derivatives, etc., we find that the solution is cos (x 2i.L)

(10.69)

cos (a 2iA

It is possible to check that E[fo l8 f (W (s) , s)/8x2 ds) is finite for all t > 0. implies that f solves the partial Moreover, the eigenvalue problem for differential equation, 2

(10.70)

2-5x2(x,t)+(x,t)=0.

As a result, Ito's formula 2 tells us that f (W (t) , t) - f (xo, 0) is a mean-zero (complex) martingale. By the optional stopping theorem (p. 187), (10.71)

E[f (W (Ta At),Ta A t)] = f(xo,0) ='O(xo)

Thanks to the dominated convergence theorem and the a.s.-continuity of W, we can let t -+ oc to deduce that (10.72)

E [f (W (Ta) ,Ta)] = TI'(xo)

Because W(Ta) = ±a and 0(±a) = 1, (10.73)

This proves the theorem.

f (W (Ta) ,Ta) = eaalo.

0

5. Ito's Formula

197

5.2. Chung's Formula: A Second Look at Exit Distributions. Let us have a second look at Theorem 10.15 in the simplest setting where x0 := 0 and a := 1. Define T := Tl to find that the formula (10.66) simplifies to the following elegant form: (10.74)

Eexp(i)T) =

1

cos

2ia

VA E R\ {0}.

In principle, the uniqueness theorem for characteristic functions tells us that the preceding formula determines the distribution of T. However, it is not always so easy to extract the right piece of information from (10.74). For instance, if it were not for (10.74), then we could not prove too easily that 1/ cos 2i5 is a characteristic function of a probability measure. Or for that matter, can you see from (10.74) that T has finite moments of all orders? (It does!) The following theorem of Chung (1947) contains a different representation of the distribution of T that answers the previous question about the existence of the moments of T.

Theorem 10.16. For all t > 0, (10.75)

P {T > t} = 74r

(2n +81)27r2t1

2-+ l eXp n=0

J

Consequently, P{T > t} - (4/7r) exp(-7r 2t/8) as t -b oc.

The preceding implies that P{T > t} < 2exp(-7r2t/8) for large values of t. In lieu of Lemma 6.8 (p.' 67), (10.76)

Vp>0.

E[7'I']=pJ00 tP-'P{T>t}dt,
Therefore, T has moments of all orders, as was asserted earlier. In fact, you

might wish to carry out the computation a little further and produce the following neat formula for the pth moment of T: (10.77)

E [7'P] =

23p+2r(p + 1),3(2p + 1)

y

p > 0'

al+2p

where r(p) = fo sP-le-sds denotes Euler's gamma function and 3 is the Dirichlet beta function, viz., (10.78) n=O

(2n +)1)t

Vt >

1.

Theorem 10.16 implies also the following formula (Chung, 1947).

10. Terminus: Stochastic Integration

198

Corollary 10.17. For all x > 0, (10.79)

P ( sup IW(s)I S x } = to<s<1

4 71

111

(-1)"

00

exp (- (2n + 1)2lr2)

.2n+ 1

J

8x2

n=

In particular, P{supo<s<1 IW(s)I < x} - (4/7r)exp(-7r2/(8x2)) as x - 0.

Proof of Theorem 10.16 (Sketch). The proof follows three steps. Step 1. A Cosine Formula. Choose and fix an integer n > 1, and define

282t)

(10.80)

dxER,t>0.

f(x,t)=cos(n-2x7r)exp(n

The function f solves the partial differential equation, 102f (10.81)

+

=0 subject to f(±l,t)=0 't>0.

This is a kind of heat equation on [-1, 1] with Dirichlet boundary conditions.

Barring technical conditions, Ito's formula 2 tells us that f (W (t) , t) - 1 defines a mean-zero martingale. [This uses the fact that f (0, 0) = 1.] By the optional stopping theorem (p. 187), E[f (W (T At) , T At)] = 1. Equivalently, (10.82)

E[f(W(T),T); Tt]=1.

282t)S

Because W(T) = ±1 a.s., the first term in (10.82) vanishes a.s. Whence we obtain the following cosine formula: /

(10.83)

\

1

E[cosl mr2(t) I; T>tJ

=exp(-n

2. A Fourier Series. Let L2(-2,2) denote the collection of all measurable functions g : [-2,2] - R such that f 22 g2 (x) dx < oo. Theorem A.5 (p. 205), after a little fidgeting with the variables, shows that 2, 2-1/2 sin(n7rx/2), 2-1/2 cos(m7rx/2) (n, m = 1,2,...) form an orthonormal basis for L2(-2,2). In particular, any 0 E L2(-2, 2) has the representation,

(10.84)

¢(x) = 20 +

[An cos n=1

(n7rx) + B. sin (n7rx 11 2

where:

- The infinite sums converge in L2(-2,2); - Ao = 2-1 f22 0(x) dx;

- An = 2-1/2 f 22 O(x) cos(n7rx/2) dx for n > 1; and

- Bn = 2-1/2 f22 22 0(x) sin(n7rx/2) dx for n > 1.

2

/J

Problems

199

Step 3. Putting it Together. We can apply the result of Step 2 to the function O(x) := 1(_1.1)(x) to obtain 1(-1,1)(x) -

(10.85)

1

_ 2 °O (-1)n

2

cos -E 2n + 1 n=0

((2n +1)7rx) 2

J

We "plug" in x := W (t , w), multiply by 1{T(w)>t}, and then apply expectations to find that

P{W(t)E(-1,1),T>t}-2PIT >t} (10.86)

=22n'

1E[cos((211+

)irW(t));T>tJ )

\

n=0

Since the left-hand side is equal to ZP{T > t}, the cosine formula of Step 1 completes our proof. The preceding proof is a sketch only because: (i) we casually treat the L2identity in (10.85) as a pointwise identity; and (ii) we exchange expectations with an infinite sum without actually justifying the exchange. With a little effort, these gaps can be filled.

Problems 10.1. In this exercise we construct a Dini-continuous process in LP(P) that is not as, continuous.

(1) Prove that if0<s
s

t

(2) Use this to prove that H(s) := 1(o is not as. continuous.

is Dini-continuous in L2(P), but H

10.2. In this exercise you are asked to construct a rather general abstract integral that is due to Young (1970). See also McShane (1969).

A function f : (0,11 - R is said to be Holder continuous of order a > 0 if there exists a finite constant K such that (10.87)

If(s) - f(t)l < Kit - sl°

es, t E (0, 1].

Let B° denote the collection of all such functions.

(1) Prove that B° contains only constants when a > 1, whereas `BI includes but is not limited to all continuously differentiable functions. (2) Prove that if 0 < a < 1, then if" is a complete normed linear space that is normed by

Illlly°

sup s,tE10,11

If(s) - f(t)I + sup If(t)I Is - t1°

tE iO,1]

.?it

(3) Given two functions f and g, define for all n > 1,

Ifbn9= klf (2) [9(k2 1)-9(2 )1'

10. Terminus: Stochastic Integration

200

Suppose f E 'a and g E `6'O for some 0 < a, 0 < 1. Prove fo f 6g := limn fo f bng exists whenever a + 0 > 1. Note that when we let g(x) = x, we recover the Riemann integral of f; i.e., that fo I6g = foI f(.) dx. (4) Prove that fo gbf is well defined, and

IIf69=f(1)9(l)-f(0)9(0)-I gbf 0 0 The integral f fbg is called a Young integral. (HINT: Lemma 10.2.) 10.3. In this problem you are asked to derive Doob's maximal inequality (p. 189) and its variants. We say that M is a submartingale if it is defined as a martingale, except

EIM(t) J F.] > M(s) a.s. whenever t > s > 0. M is a supermartingale if -M is a submartingale. A process M is said to be a continuous L2(10.88)

submartingale (respectively, continuous L2-supermartingale) if it is a submartingale (respectively

supermartingale), {M(t)}t>o is as, continuous, and M(t) E L2(P) for all t > 0. Prove: (1) If Y is in L2(P), then M(t) = EIY ),fit) is a martingale. This is a Doob martingale in continuous time. (2) If M is a martingale and t' is convex, then >'(M) is a submartingale provided that tl'(M(t)) E L'(P) for each t > 0. (3) If M is a submartingale, t' is a nondecreasing convex function, and tj'(M(t)) E L' (P) for all t > 0, then O(M) is a submartingale. (4) The first Doob inequality on page 189 holds if )MI is replaced by any a.s.-continuous submartingale. (HINT: Prove first that supo<.
10.4. For all integers n > 1 define pn := E{(W(1))"}, where W is Brownian motion. Use Ito's formula to prove that An+2 = (n + 1)µ". Compute An for all integers n > 1.

10.5 (Gambler's Ruin). If W denotes a Brownian motion, then for any a E R, define T. inf{s > 0 : W(s) = a} where inf 0 := oo. Recall that T. is an .ir-stopping time (Proposition 9.17, p. 171). If a, b > 0 then prove that b

P{T
(10.89)

and compute ET,. 10.6. Prove Corollary 10.17.

10.7. Recall the Dirichlet beta function 0 from (10.78). Prove that 0(1) = w/4 and 0(3) = w3/32. Few other evaluations of 0 are known explicitly. Even 0(2), the so-called "Catalan constant," does not have a simpler description.

10.8. Let W be a Brownian motion with W(O) = 0, and define T to be the first time W exits the interval (-1, 1). Prove that EITPI < oo whenever -oo 0 : IW(s)I = 1}, where inf 0 (10.90)

Eexp(-AT) =

0. Prove that for all A > 0, 1

cosh v'-2-,\'

(Levy, 1951).

10.10. Define 2t2)

(10.91)

p(t;x)_ (2wt)1/2eXP(

1(/

"xER,t>0.

Prove that if W is Brownian motion, then M(t) := p(t; W(t)) defines a martingale.

10.11 (Hard). Let W be a Brownian motion with W(0) = 0 and define T,,b to be the first time W exits the interval (-a, b), where a, b > 0 are fixed constants. Compute E exp(iAT,,b) for all

A>0.

Notes

201

10.12 (Hard). Let W denote a Brownian motion, and (3 > 0 a fixed positive number. Define Wa(t) = W(t) + j3t to be the so-called Brownian motion with drift 0. For any a,b > 0 define

r°,_t := inf {s > 0: Wa(s) = a or - b} .

(10.92) Prove that

1 - e-zaa P {Wa (.,-b) = -b} = e1db -e-2a°

(10.93)

From this deduce the distribution of -inft>oW8(t). (HINT: For all a E R find a non-random function h. : R+ -a R such that t -. exp(aW(t) - h°(t)) defines a mean-one martingale.)

Notes (1) Instead of presenting a general theory of stochastic integration, we have discussed a

special case that is: (i) broad enough to be applicable for our needs; and (ii) concrete enough so as to make the main ideas clear. Dellacherie and Meyer (1982) have written

a definitive account of the general theory of processes. Their treatment includes a detailed description of the general theory of stochastic integration with respect to (semi-) martingales. (2) Ito's theory of stochastic integrals uses the "left-hand rule," as can be seen clearly in (10.1). This "left-hand rule" is the hallmark of Ito's theory of stochastic integration. In general, it cannot be replaced by other rules-such as the midpoint- or the right-hand rule-without changing the resulting stochastic integral. (3) For Theorem 10.15, and much more, see Knight (1981, Chapter 4). (4) Theorem 10.16 can be extended to several dimensions Ciesielaki and Taylor (1962). One can use Corollary 10.17 in conjunction with the Poisson summation formula (Feller, 1966, p. 630) to deduce that

P

(10.94)

rm

(4n+1)s

exp(-u2/2) du. sup 1W(s)1:5X = V < J 0a<1 ' n___ (4n-1)x

Whereas (10.79) is useful for small values of x, the preceding is accurate when x is large. The fact that the right-hand sides of (10.79) and (10.94) agree is one of the

celebrated theta-function identities of analytic number theory. (5) Now that we have reached the end of the book, let me close by suggesting that this

is a natural place to start learning about W. K. Feller's theory of one-dimensional diffusions (1955a; 1955b; 1956). Modern and more pedagogic accounts include Bass (1998), Knight (1981, Chapter 4), and Revuz and Yor (1999, Chapters 3 and 7).

Appendix

The moving power of mathematical invention is not reasoning but imagination. -Augustus de Morgan

1. Hilbert Spaces Throughout, let H be a set, and recall that it is a (real) Hilbert space if it is linear and if there exists an inner product ( , ) on H x H such that f '-' (f , f) = 11f 112 norms H into a complete space. We recall that inner product means that (a f + ag, h) = (h, a f + 13g) = a(f , h) +,3(g, h) for all f,g,h E H and all a, 0 E R. Hilbert spaces come naturally equipped with a notion of angles: If (f, g) = 0 then f and g are orthogonal. Definition A.1. Given any S C H, we let S1 denote the collection of all elements of H that are orthogonal to all the elements of S. That is, (A.1)

S1={fEH: (f,g)=0"gES}.

It is easy to see that S' is itself a subspace of H, and that S n S1 = {0}. We now show that in fact S and S1 have a sort of complementary property.

Theorem A.2 (Orthogonal Decomposition). If S is a closed subspace of a Hilbert space H, then H = S + S1 := if + g : f E S , g E S'} In order to prove this, we need a lemma. Lemma A.3. If X is a closed and convex subset of a complete Hilbert space H, then there exists a unique f E X such that 11f II = infgEX 119 11 203

Appendix

204

Proof. By definition we can find f, E X such that Recall the "parallelogram law" : (A.2)

IIh + 9112 + IIh - 9112 = 2 (11h112 + 119112)

IIffII2 = infhEx IIh112.

vh, g E H.

We apply this with h := f, and g := fn, to find that for all n, m > 1, Ilfn 4- 4 f-111

= 2 (lfI2 +11fm112-211 fn

2fm112)

(A.3)

' (ilfn112+Ilfmll2-2hnf

11h112/J

The final inequality follows because (fn + fm)/2 E X, thanks to convexity. Let n,m - oo to deduce that {fn}°__1 is a Cauchy sequence in X. Because

X is closed, it follows that fn - f for some f E X, and hence Ilf 11 = infhEx llhll This verifies the existence of f. For the uniqueness portion suppose there were two norm-minimizing functions f, g E X. By the parallelogram law,

Ill-9112 = inf 11h112-

(A.4)

4

hEX

11:L+

2 g112 <0.

(Why?). Thus, f = g.

D

Proof of Theorem A.2. For all given f E H, the set f + S is closed and

convex, where f+ S := if + s

:

s E S}. In particular, f+ S has a

unique element .1(f) of minimal norm (Lemma A.3); also define .9(f) _

f - .1(f)

Because .1(f) E f + S, it follows that .9(f) E S for all f E H. Since .9 (f) + .91(f) = f, it suffices to demonstrate that .9-L (f) E S1 for all f E H. But by the definition of :91, 11.1(f)11 < If - 9Il for all g E S. Instead of g write G = ag + -60(f), where 1190 = 1 and a E R. Because G E S, we can deduce that for all g E S with 1190 = 1 and all a E R, (A.5)

11.91(1)112 < 11.1(f)

- a9I12

= 11.91(1)112

-2a(.91(1),9) +a2.

Let a = (91(f) , g) to deduce that (.-L(f), g) = 0 for all g E S. This is the desired result.

0

Theorem A.4. To every bounded linear functional 2 on a Hilbert space H there corresponds a unique 7r E H such that .`P(f) = (f , 7r) for all f E H.

Proof. If 2(f) = 0 for all f E H, then we define 7r = 0, and we are done. If not, then S = { f E H : 2(f) = 0} is a closed subspace of H that does not span all of H; i.e, there exists g E S1 with II9II = 1 and 2(g) > 0;

2. Fourier Series

205

this follows from the decomposition theorem for H (Theorem A.2). We will establish that 7r := g2(g) is the function that we seek, all the time remembering that 2(g) E R. For all f E H consider the function It = P(g) f - 2(f )g, and note that h E S since 2(h) = 0. Because 7r E Sl, this means that (7r, h) = 0. On the other hand, (7r, h) = .fi(g) (7r , f) - 2(g)2(f ). Since .2(g) > 0, we

have 2(f) = (71,1) for all f E H. It remains to prove uniqueness. but this too is easy for if there were two of these functions, say 7r1 and 7r2. then

for all f E H, (f, a, - 7r2) = 0. In particular. let f = 7r1 - r2 to see that D

ire = 712.

2. Fourier Series Throughout this section, we let T = [-7r, 7r] denote the torus of length 27r, and consider some elementary facts about the trigonometric Fourier series on T that are based on the following functions: (A.6)

t¢n(x) =

einx

ex E T, n = 0, ±1, ±2,...

.

27r

Let L2(T) denote the Hilbert space of all measurable functions f : T -. C such that (A.7)

IIf 11T,

JT

If(x)I2dx < oo.

As usual, L2(T) is equipped with the (semi-)norm IIf IIT and inner product (A.8)

(f, g) := IT

Our goal is to prove the following theorem.

Theorem A.S. The collection {0 }FEZ is a complete orthonormal system in L2(T). Consequently, every f E L2(T) can be written as

` 00

(A.9)

f=

(f, 0n)4n

n=-ac

where the convergence takes place in L2(T). Furthermore. 00

(A.10)

11f 11T

=

n=-oo

(f

I

The proof is not difficult, but requires some preliminary developments.

Definition A.6. A trigonometric polynomial is a finite linear combination of the fn's. An approximation to the identity is a sequence of integrable functions wro, wi , ...: T R+ such that:

Appendix

206

(i) fT ipn(x) dx = 1 for all n. Ef (ii) There exists co > 0 such that limn-,,,, f O (x) dx = 1 for all

e E 10,C01-

Note that (a) all the 1bn's are nonnegative; and (b) the preceding display shows that all of the area under 4pn is concentrated near the origin when n is large. In other words, as n -' oo, 7pn looks more and more like a point mass.

For n = 0,1,2.... and x E T consider (A.11)

cnx) = (1+cosx)" an

where an = f (1+cos(x))ndx. T

Lemma A.7. {}°_°o is an approximation to the identity. xe E (0, it/21. Then, Proof. Choose and fij(i +cosx)ndx < ir(1 +cose)n.

(A.12)

By symmetry, this estimates the integral away from the origin. To estimate the integral near the origin, we use a method of P.-S. Laplace and write (A.13)

j(i + cosx)n dx =

E

J

e engixl dx,

where g(x) := ln(1 + cos x). Apply Taylor's theorem with remainder to deduce that for any x E [0, e] there exists [; E [0, x] such that g(x) = In 2 - x2/(1 + cos (). But cos (> 0 because 0 < (< e < it/2. Thus, for all n > 1, (A.14)

j(i + cos x)dx >_ 2n J e2 dx > 0

V/n

J e-z2 dz. o

It follows from this and (A.12) that ff
Proposition A.8. If f E L2(T) and e > 0, then there is a trigonometric polynomial T such that JET- A IT < e. [Trigonometric polynomials are dense in L2(T).]

Proof. Since continuous functions (endowed with uniform topology) are dense in L2(T), it suffices to prove that trigonometric polynomials are dense in the space of all continuous function on T (why?). We first of all observe that the functions Kn are trigonometric polynomials. Indeed, by the binomial

theorem and the Euler formula cosx = 1(e" + e-z), Icnx) is a linear

2. Fourier Series

207

combination of {Oj(x)}? _n. Next we note that the convolution rcn * f of rcn and f is also a trigonometric polynomial, where (A.15)

(an * .f)(x) =

T IT

f(y)icn(x - y) dy.

Note that (rc,, * f)(x) - f (x) = fT{ f (y) - f (x)}rcn(y - x) dy. We choose and fix e E (0, 7r) and split the last integral according to whether or not Iy - xI < e. It follows that I(rc * f)(x) - f (x) I is at most (A.16)

sup If() - f(u)I +2 sup If(w)I

y,uET:

wET

.J
Iy-ul<e

By Lemma A.7 the last term vanishes as n (A.17)

oo. Thus, for all e > 0,

limsupsup 1(rc * f)(x) - f(x)I < sup If(y) - f(u)I xET

V,uET:

Iy-ul<e

Let c (A.18)

0 to see that the left-hand side is zero. Because II (Kn * ,f) - .f IIT < 27r sup I(rcn * f)(x) - f

(x)12,

zET

the proposition follows.

We are ready to prove Theorem A.5.

Proof of Theorem A.5. It is easy to see that {¢n}nEZ is an orthonormal sequence in L2(T); that is, (A.19)

1

ifn=m,

10 ifn34 m.

To establish completeness suppose that f E L2(T) is orthogonal to all On's; i.e., (f, ¢n) = 0 for all n E Z. If a and T are as in the preceding proposition, then: (i) (f ,T) = 0; and (ii) 11f - TIIT < E. This last part, (ii), implies that

E ? IIf - TIIT (A.20)

= IITIIT + IITIIT - 2(f,T) = Ill IIT + IITIIT

Since c is arbitrary, Ill IIT = 0, from which we deduce that f = 0 almost everywhere. This proves completeness. The remainder is easy to prove, but requires the material from Chapter 4 [§6].

Let Jan and 9 respectively denote the projections onto Sn := the _n and S. If f E L2(T), then -,Vnf is the a.e.-unique function g E S,, that minimizes Ill - 9IIT We can write g = j=_n CA

linear span of {0j}

Appendix

208

and expand the said L2-norm to obtain the following optimization problem: Minimize over all {cj} _n, 2

n

(A.21)

f - E Cj 0j

n

= IIfIIT + J.=-n

j=-n

n C.j

-2

Cj(f+l6j) j=-n

T

This is a calculus exercise and yields the optimal value of r_j = (f, 0j). It follows readily from this that: (1)

>_(f,Oj)Oj;

(ii) fn f = f - +9-f; (iii)

E

=-n I(f, Oj)I2; and (ice) IIyn fIIT = IIfIIT - Ej=-n I (f+Oj)I2 The last inequality yields Bessel's inequality: 00

(A.22)

F I(f,0j)I2 <_ IIfIIT

Our goal is to show that this is an equality. If not, then Fatou's lemma implies that II lim infn-,,,, .9n f IIT > 0, whence g := lim infn_.00.9 f 34 0 on a set of positive Lebesgue measure. But note that g E .9n for all n. Fix e > 0 and find a trigonometric polynomial T E Sn for some large n such that fig - TIIT 5 e. Now expand: E2

> IIg - TIIT =119112 + IITII. - 2(g, T)

(A.23)

= II911T + IITIIT II91IT

Thus, g = 0 almost everywhere, whence follows a contradiction. In fact, this

argument shows that any subsequential limit of .9 f must be zero almost everywhere, and hence .9n f -+ 0 in L2(T). It follows that 00

(A.24)

f = Eli

.9nf = > (f, Oj)Oj j=-00

as desired.

in L2(T),

Bibliography

Adams, W. J. (1974). The Life and Tomes of the Central Limit Theorem. New York: Kaedmon Publishing Co.

Aldous, D. J. (1985). Exchangeability and Related Topics. In Ecole d'E°te de Probabiht8s de SaintFlour, X111-1983, Volume 1117 of Lecture Notes in Math., pp. 1-198. Berlin: Springer. Alon, N. and J. H. Spencer (1991). The Probabilistic Method (First ed.). New York: Wiley. Andre, D. (1887). Solution directe du problbmc resolu par M. Bertrand. C. R. Acad. Sci. Paris 105, 436-437.

Azuma, K. (1967). Weighted sums of certain dependent random variables. Tdhoku Math. J. (2) 19, 357-367.

Bachelier, L. (1900). Theorie de la speculation. Ann. Sci. Ecole Norm. Sup. 17, 21-86. See also the 1995 reprint. Sceaux: Gauthier-Villars. Bachelier, L. (1964). Theory of speculation. In P. H. Cootner (Ed.), The Random Character of Stock Market Prices, pp. 17 78. MIT Press. Translated from French by A. James Bones". Banach, S. (1931). Uber die Baire'sche kategorie gewisser Funkionenmengen. Studia. Math. 111, 174-179.

Bass, R. F. (1998). Diffusions and Elliptic Operators. New York: Springer-Verlag. Baxter, M. and A. Rennie (1996). Financial Calculus: An Introduction to Derivative Pricing. Cambridge: Cambridge University Press. Second (1998) reprint. Berkes, 1. (1998). Results and problems related to the pointwise central limit theorem. In Asymptotic Methods in Probability and Statistics (Ottawa, ON, 1997), pp. 59-96. Amsterdam: North-Holland. Bernoulli, J. (1713). Ars Conjectandi (The Art of Conjecture). Basel: Basilem: Impensis Thurnisioruin Fraetrum. Bernstein, S. N. (1912/1913).

Demonstration du theorbme de Weierstrass fondee sur le calcul des

probabilites. Comm. Soc. Math. Kharkow 13, 1-2. Bernstein, S. N. (1964). On the property characteristic of the normal law. In Sobranie sochinenii. Tom IV: Teoriya veroyatnostei. Matematicheskaya Statistoka. 1911-1946. "Nauka". Moscow. Billingsley, P. (1995). Probability and Measure (Third ed.). Now York: John Wiley & Sons Inc. Binghamn, N. H. (1986). Variants on the law of the iterated logarithm. Bull. London Math. Soc. 18(5), 433-467.

Birkholf, G. D. (1931). Proof of the ergodic theorem. Proc. Nat. Acad. Sci. 17, 656-660. Black, F. and M. Scholea (1973). Pricing of options and corporate liabilities. J. Political Econ. 81, 637-654.

Blumenthal, R. M. (1957). An extended Markov property. Trans. Amer. Math. Soc. 85, 52-72. Borel, E. (1909). Les probabilites denombrables et leurs applications arithmetique. Rend. Cire. Mat. Palermo 27, 247-271. Bore], E. (1925). Mecanique Slatistique Classique (Third ed.). Paris: Gauthier-Villars. 209

210

Bibliography

Bourke, C., J. M. Hitchcock, and N. V. Vinodchandran (2005). Entropy rates and finite-state dimension. Theoret. Comput. Sci. 349(3), 392-406. Bovier, A. and P. Picco (1996). Limit theorems for Bernoulli convolutions. In Disordered Systems (Temuco, 1991/1992), Volume 53, pp. 135-158. Paris: Hermann. Breiman, L. (1992). Probability. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). See also the corrected reprint of the original (1988). Bretagnolle, J. and D. Dacunha-Castelle (1969). Applications radonifiantes dans les espaces de type p. C. R. Acad. Sci. Paris Ser. A-B 269, A1132-A1134. Broadbent, S. R. and J. M. Hammersley (1957). Percolation processes. I. Crystals and mazes. Proc. Cambridge Philos. Soc. 53, 629-641. Buczolich, Z. and R. D. Mauldin (1999). On the convergence of E::=1 f(nx) for measurable functions. Mathematika 46(2), 337-341. Burkholder, D. L. (1962). Successive conditional expectations of an integrable function. Ann. Math. Statist. 33, 887-893. Burkholder, D. L., B. J. Davis, and R. F. Gundy (1972). Integral inequalities for convex functions of operators on martingales. In Proc. Sixth Berkeley Symp. Math. Statist. Probab., Vol. II, Berkeley, Calif., pp. 223-240. Univ. California Press. Burkholder, D. L. and R. F. Gundy (1970). Extrapolation and interpolation of quasi-linear operators on martingales. Acta Math. 124, 249-304. Cantelil, F. P. (1917a). Su due applicazioni d'un teorema dl G. Boole ails statistics, matematica. Atti delta Reale Acaademia Nationale dei Lincei, Serie V, Rendicotti 26, 295-302. Cantelli, F. P. (1917b). Sulla probabilith come limits della frequenze. Atti delta Reale Accademia

Nationale dei Lincei, Serie V, Rendicotti 26,39-45. Cantelli, F. P. (1933a). Conaiderazioni sulla legge uniforme dei grandi numeri e sulla generalizzazione di un fondamentale teorema del signor L6vy. Giornale d. Istituto Italians Attuari 4, 327-350. Cantelli, F. P. (1933b). Sulla determinazione empirica delle leggi di probabilita. Giornale d. Istituto Itoliano Attuari 4, 421-424. Caratheodory, C. (1948). Vorlesungen fiber recite Funktionen. New York: Chelsea Publishing Company. Champernowne, D. G. (1933). The construction of decimals normal in the scale of ten. J. London Math. Soc. 8, 254-260. Chatterji, S. D. (1968). Martingale convergence and the Radon-Nikodym theorem in Banach spaces. Math. Scand. 22, 21-41. Chebyshev, P. L. (1846). Demonstration flementaire dune proposition generale de la th6orie des probabilit6s. Crelle J. Math. 33(2), 259-267. Chebyshev, P. L. (1867). Des valeurs moyennes. J. Math. Puns Appl. 12(2), 177-184. Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statist. 23, 493-507. Chow, Y. S. and H. Teicher (1997). Probability Theory: Independence, Interchangeability, Martingales (Third ed.). New York: Springer-Verlag. Chung, K. L. (1947). On the maximum partial sum of independent random variables. Proc. Nat. Acad. Sci. U.S.A. 33, 132-136.

Chung, K. L. (1974). A Course in Probability Theory (Seconded.). New York-London: Academic Press.

Chung, K.-L. and P. Erd66 (1947). On the lower limit of sums of independent random variables. Ann. of Math. (2) 48, 1003-1013. Chung, K. L. and P. ErdSSs (1952). On the application of the Borel-Cantelli lemma. Trans. Amer. Math. Soc. 72, 179-186. Ciesielski, Z. and S. J. Taylor (1962). First passage times and sojourn times for Brownian motion in space and the exact Hausdorff measure of the sample path. Trans. Amer. Math. Soc. 103, 434-450. Cifarelli, D. M. and E. Regazzini (1996). De Finetti's contribution to probability and statistics. Statist. Sri. 11(4), 253-282. Coifman, R. R. (1972). Distribution function inequalities for singular integrals. Proc. Not. Acad. Set. U.S.A. 69,2938-2939. Copeland, A. H. and P. Erd8s (1946). Note on normal numbers. Bull. Amer. Math. Soc. 52, 857-860. Courtault, J: M., Y. Kabanov, B. Bru, P. Crepel, I. Lebon, and A. Le Marchand (2000). Louis Bachelier. On the Centenary of Theorie de la Speculation. Math. Finance 10(3), 341-353. Cover, T. M. and J. A. Thomas (1991). Elements of Information Theory. New York: John Wiley & Sons Inc.

Bibliography

211

Cox, J. C., S. A. Ross, and M. Rubenstein (1979). Option pricing: a simplified approach. J. Financial Econ. 7, 229-263. Crambr, H. (1936). Uber eine Eigenschaft der normalen Verteilungsfunktion. Math. Z. 41, 405-415. Cs6rg8. M. and P. Rdvdsz (1981). Strong Approximations in Probability and Statistics. New York: Academic Press Inc. [Harcourt Brace Jovanovich Publishers]. de Acosta, A. (1983). A new proof of the Hartman-Wintner law of the iterated logarithm. Ann. Probab. 11(2), 270-276. de Finetti, B. (1937). La pr6vision: see loin logiques, ses sources aubjectives. Ann. Inst. H. PoincarC 7, 1-68.

de Moivre, A. (1718). The Doctrine of Chances; or a Method of Calculating the Probabilities of Events in Play (First ed.). London: W. Pearson.

de Moivre, A. (1733). Approximatio ad Summam terminorum Binomii (a t b)" in Serium expansi. Privately Printed. de Moivro, A. (1738). The Doctrine of Chances; or a Method of Calculating the Probabilities of Events in Play (Second ed.). London: H. Woodfall. Dellacherie, C. and P.-A. Meyer (1982). Probabilities and Potential. B. Amsterdam: North-Holland Publishing Co. Theory of martingales, Translated from French by J. P. Wilson. Devaney, R. L. (2003). An Introduction to Chaotic Dynamical Systems. Boulder, CO: Westview Press. Reprint of the second (1989) edition. Diaconis, P. and D. Freedman (1987). A dozen de Finetti-style results in search of a theory. Ann. Inst. H. Poincar8 Probab. Statist. 23(2, suppl.), 397-423. Diaconis, P. and J. B. Keller (1989). Fair dice. Amer. Math. Monthly 96(4), 337-339. Donoho, D. L. and P. B. Stark (1989). Uncertainty principles and signal recovery. SIAM J. Appl. Math. 49(3). 906-931. Doob, J. L. (1940). Regularity properties of certain families of chance variables. Trans. Amer. Math. Soc. 47, 455-486.

Doob, J. L. (1949). Application of the theory of martingales. In Le Calcul des ProbabilitCs et ses Applications, pp. 23-27. Paris: Centre National de is Recherche Scientifique. Doob, J. L. (1953). Stochastic Processes. New York: John Wiley & Sons Inc. Doob, J. L. (1971). What is a martingale? Amer. Math. Monthly 78, 451-463. Dubins, L. E. and D. A. Freedman (1965). A sharper form of the Borel-Cantelli lemma and the strong law. Ann. Math. Statist. 36, 800-807. Dubins, L. E. and D. A. Freedman (1966). On the expected value of a stopped martingale. Ann. Math. Statist 37, 1505-1509. Dudley, R. M. (1967). On prediction theory for nonstationary sequences. In Proc. Fifth Berkeley Syrup. Math. Statist. Probab., Vol. II, pp. 223-234. Berkeley, Calif.: Univ. California Press. Dudley, R. M. (2002). Real Analysis and Probability. Cambridge: Cambridge University Press. Revised reprint of the 1989 original. Durrett, R. (1996). Probability: Theory and Examples (Seconded.). Belmont, CA: Duxbury Press. Dynkin, E. and A. Jushkevich (1956). Strong Markov processes. Teor. Veroyatnost. i Primenen. 1, 149-155.

ErdSs, P. (1948). Some remarks on the theory of graphs. Bull. Amer. Math. Soc. 53, 292-294. ErdSa, P. (1949). On the strong law of large numbers. Trans. Amer. Math. Soc. 67, 51-56. Erdbs, P. and G. Szekeres (1935). A combinatorial problem in geometry. Composito. Math. 2, 463-470. Erd6s, P. and A. R.enyi (1959). On Cantor's series with convergent E 1/q". Ann. Univ. Sci. Budapest. EStvds. Sect. Math. 2, 93-109. Erd6s, P. and A. Rbnyi (1970). On a new law of large numbers. J. Analyse Math. 23, 103-111. Etemadi, N. (1981). An elementary proof of the strong law of large numbers. Z. Wahrsch. Verso. Ceb. 55, 119-122. Falconer, K. J. (1986). The Geometry of Fractal Sets, Volume 85. Cambridge: Cambridge University Press.

Fatou, P. J. L. (1906). S6ries trigonombtriques et series de Taylor. Aeta Math. 69, 372-433. Feller, W. (1945). The fundamental limit theorems in probability. Bulletin A.M.S. 51, 800-832. Feller, W. (1955a). On differential operators and boundary conditions. Comm. Pure Appl. Math. 8, 203-216.

Feller, W. (1955b). On second order differential operators. Ann. of Math. (2) 61. 90-105.

212

Feller, W. (1956). On generalized Sturm-Liouville operators.

Bibliography

In Proceedings of the Conference on

Differential Equations (dedicated to A. Weinstein), pp. 251-270. University of Maryland Book Store, College Park. Aid.

Feller, W. (1957). An Introduction to Probability Theory and Its Applications. Vol. I (Seconded.). New York: John Wiley & Sons Inc.

Feller, W. (1966). An Introduction to Probability Theory and Its Applications. Vol. 11. New York: John Wiley & Sons Inc.

Fortuin, C. M.. P. W. Kasteleyn, and J. Ginibre (1971). Correlation inequalities on some partially ordered sets. Comm. Math. Phys. 22, 536 -564. Freshet, M. R. (1930). Stir In convergence en probability. Metron 8, 1-48. Freiling, C. (1986). Axioms of symmetry: Throwing darts at the real number line. J. Symbolic Logic 51(1),190-200. Fristedt, B. and L. Gray (1997). A Modern Approach to Probability Theory. Boston. MA: Birkhhuser Boston Inc.

Garsis, A. M. (1965). A simple proof of E. Hopf's maximal ergodic theorem. J. Math. Mech. 14, 381 382.

Georgii, H.-O. (1988). Gibbs Measures and Phase Transitions. Berlin: Walter de Gruyter & Co. Gerber. H. U. and S: Y. R. Li (1981). The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain. Stochastic Process. Appl. 71(1), 101-108. Glivenko, V. (1933). Sulla determinazione empirica dells leggi di probability. Giornale d. Istituto Italiano Attuan 4, 92 -99. Glivenko, V. (1936). Sit[ teorema limits delta teoria delle funzioni caratteristiche. Giornale d. Instituto Italiano attuart 7, 160-167. Gnedenko, B. V. (1967). The Theory of Probability. New York: Chelsea Publishing Co. Translated from the fourth Russian edition by B. D. Seckler. Gnedenko, B. V. (1969). On Hilbert's Sixth Problem (Russian). In Hilbert's Problems (Russian), pp. 116-120. "Nauka", Moscow. Gnedenko. B. V. and A. N. Kolmogorov (1968). Limit Distributions for Sums of Independent Random Variables. Reading, Mass.: Addison-Wesley Publishing Co. Translated from the original Russian, annotated, and revised by Kai Lai Chung. With appendices by J. L. Doob and P. L. 11su. Revised edition. Grimmett, G. (1999). Percolation (Second ed.). Berlin: Springer-Verlag.

Hamedani, G. G. and G. C. Walter (1984). A fixed point theorem and its application to the central limit theorem. Arch. Math. (Basel) 43(3), 258-264. Hammersley, J. M. (1963). A Monte Carlo solution of percolation in a cubic lattice. In S. F. B. Alder and M. Rotenberg (Eds.), Methods in Computational Physics, Volume I. London: Academic Press. Hardy, G. H. and J. E. Littlowood (1914). Some problems of diophantine approximation. Acta Math. 37, 155-239.

Hardy. G. H. and J. E. Littlowood (1930). A maximal theorem with function-theoretic applications. Acta Math. 54, 81 166. Harris, T. E. (1960). A lower bound for the critical probability in a certain percolation process. Proc. Cambridge Philos. Soc. 56, 13-20. Harrison, J. M. and D. Kreps (1979). Martingales and arbitrage in multiperiod securities markets. J. Econ. Theory 20,381-408. Harrison, J. M. and S. R. Pliska (1981). Martingales and stochastic integrals in the theory of continuous trading. Stoch. Proc. Their Appl. 11, 215-260. Hartman, P. and A. Wintner (1941). On the law of the iterated logarithm. Amer. J. Math. 63,169-176. Hausdori, F. (1927). Mengenlehre. Berlin: Walter De Gruyter & Co. Hausdorff, F. (1949). Grundzuge der Mengenlehre. New York: Chelsea publishing Co. Helms, L. L. and P. A. Loeb (1982). A nonstandard proof of the martingale convergence theorem. Rocky Mountain J. Math. 12(1), 165-170. Hoeffding. W. (1963). Probability inequalities for sums of bounded random variables. Amer. Statist. Assoc. 58, 13-30. Hoetfding, W. (1971). The Lt norm of the approximation error for Beinstein-type polynomials. J. Approximation Theory 4, 347-356. Houdr6, C.. V. PErez-Abreu, and D. Surgailis (1998). Interpolation, correlation identities, and inequalities for infinitely divisible variables. J. Fourier Anal. Appl. 4(6), 651-668. Hunt, G. A. (1956). Some theorems concerning Brownian motion. Tram. Amer. Math. Soc. 81, 294 319. Hunt, G. A. (1957). Markoff processes and potentials. I, 11. Illinois J. Math. 1, 44-93, 316-369.

Bibliography

213

Hunt, G. A. (1966). Martingates et processus its Mark-o% Paris: Dunod. lonescu Tulceo, A. and C. lonescu Tulcea (1963). Abstract ergodic theorems. Trans. Amer. Math. Soc. 107, 107-124. Isaac. R. (1965). A proof of the martingale convergence theorem. Proc. Amer. Math. Soc. 16, 842-844. Ito, K. (1944). Stochastic integral. Proc. Imp. Acad. Tokyo 20, 519-524. Jones, R. L. (1997/1998). Ergodic theory and connections with analysis and probability. New York J. Math. 3A(Proceedings of the New York Journal of Mathematics Conference, June 9 13, 1997), 31-67 (electronic). Kac, M. (1937). Une remarque sur lee polynomes de M.S. Bernstein. Studio Math. 7, 49-51. Kac, M. (1939). On a characterization of the normal distribution. Airier. J. Math. 61, 726-728. Kac, M. (1949). On deviations between theoretical and empirical distributions. Proc. Nat. Acad. Sri. U.S.A. 35, 252-257. Kac, M. (1956). Foundations of kinetic choery. In J. Neyman (Ed.), Proc. Third Berkeley Symp. on Math. Statist. Probab., Volume 3, pp. 171--197. Univ. of Calif. Kahane, J: P. (1997). A few generic properties of Fourier and Taylor series. In Trends in Probability and Related Analysis (Taipei, 1996), pp. 187-196. River Edge, NJ: World Sci. Publishing. Kahane. J.-P. (2000). Baire's category theorem and trigonometric series. J. Anal. Math. 80, 143-182. Kahane, J.-P. (2001). Probabilities and Baire's theory in harmonic analysis. In Twentieth Century

Harmonic Analysis-A Celebration (11 Ctocco, 2000), Volume 33 of NATO Sci. Ser. II Math. Phys. Chem., pp. 57-72. Dordrecht: Kluwer Acad. Publ. Karlin, S. and H. M. Taylor (1975). A First Course in Stochastic Processes (Seconded.). Academic Press [Harcourt Brace Jovanovich Publishers], New York-London.

Karlin, S. and H. M. Taylor (1981). A Second Course in Stochastic Processes. New York: Academic Press Inc. (Harcourt Brace Jovanovlch Publishers]. Keller, J. B. (1986). The probability of heads. Amer. Math. Monthly 9.9(3), 191-197. Kesten, H. (1980). The critical probability of bond percolation on the square lattice equals . Comm. Math. Phys. 74(l), 41-59. Khintchine, A. and A. Kolmogorov (1925). Uber Konvergenz von Reihen deren Glieder durch den Zufall bestimmt werden. Rec. Math. Moscow 32, 668-677. Khintchine, A. Y. (1923). Uber dyadische Briiche. Math. Z. 18. 109-116. Khintchine, A. Y. (1924). Ein Satz der Wahrscheinlichkeitsrechnung. Fond. Math. 6. 9-10.

Khintchine, A. Y. (1929). Stir la loi des grands nombres. C. R. Acad. Set. Paris 188, 477-479. Khintchine, A. Y. (1933). Asymptotische Gesetz der Wahrscheinlichkeitsrechnung. Springer. Kinney, J. R. (1953). Continuity properties of sample functions of Markov processes. Trans. Amer. Math. Soc. 74, 280-302. Knight, F. B. (1962). On the random walk and Brownian motion. Trans. Airier. Math. Soc. 103, 218 228.

Knight, F. B. (1981). Essentials of Broumian Motion and Diffusion. Providence, R.I.: American Mathematical Society.

Knuth, D. E. (1981). The Art of Computer Programming. Vol. 2 (Second ed.). Reading, Mass.: Addison-Wesley Publishing Co. Seminumerical Algorithms.

Kochen, S. and C. J. Stone (1964). A note on the Borel-Cantelli lemma. Illinois J. Math. 8, 248-251. Kolmogorov, A. (1930). Sur la loi forte des grandee nombres. C. R. Acad. Sci. Paris 191, 910-911. Kolmogorov, A. N. (1929). Uber das Gesetz des iterierten Logarithmus. Math. Ann. 101, 126-136. Kolmogorov, A. N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. Berlin: Springer. Kolmogorov, A. N. (1950). Foundations of Probability. New York: Chelsea Publishig Company. Translation edited by Nathan Morrison. Krickeberg, K. (1963). Wahracheinlichkeitstheorie. Stuttgart: Teubner. Krickeherg, K. (1965). Probability Theory. Reading, Massachusetts: Addison-Wesley. Kyburg, Jr., H. E. and H. E. Smokier (1980). Studies in Subjective Probability (Second ed.). Huntington, N.Y.: Robert E. Krieger Publishing Co. Inc. Lacey, M. T. and W. Philipp (1990). A note on the almost sure central limit theorem. Statist. Probab. Lett. 9(3), 201-205. Lamb, C. W. (1973). A short proof of the martingale convergence theorem. Proc. Amer. Math. Soc. 38, 215-217. Lange, K. (2003). Applied Probability. New York: Springer-Verlag.

214

Bibliography

Laplace, P. S. (1782). MCmoire sur les approximations des formulas qui sent fonctions de tres-grands nombres. Technical report, Histoire do I'AcadCmie Royale des Sciences de Paris. Laplace, P.-S. (1805). Traitd de Mdcanique Cdleate, Volume 4. Chez J. B. M. Duprat, an 7 (Crapelet). Reprinted by the Chelsea Publishing Co. (1967). Translated by N. Bowdltch. Laplace, P: S. (1812). Thdori.e Analytique des ProbabilitEs, Vol. 1 and 11. V[iemej Courtier. Reprinted in Oeuvres compldtes de Laplace, Volume VII (1886), Paris: Gauthier-Villars. Lebesgue, H. (1910). 361-450.

Sur l'integration des fonctions discontinues.

Ann. Ecole Norm. Sup. 27(3),

Levi, B. (1906). Sopra l'integrazione dells serie. Rend. !nstituto Lombardino di Sci. e Lett. 39(2), 775-780.

Levy, P. (1925). Calcul des Probabilitds. Paris: Gauthier-Villars. Levy, P. (1937). Theorie de I'Addition des Variables Aldatoires. Paris: Gauthier-Villars. Levy, P. (1951). La mesure de Hausdorff de la courbe du mouvement brownien h n dimensions. C. R. Acad. Sci. Paris 233, 600-602. Li, S.-Y. R. (1980). A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Ann. Probab. 8(6), 1171-1176. Liapounov, A. M. (1900). Sur one proposition de Is theorie des probabilitCs. Bulletin de l'Acaddmie Impdriale des Sciences de St. PetCrsbourg 13(4), 359-386. Liapounov, A. M. (1922). Nouvelle forma du theorbme sur Ia limits do probability. MEmoires de I'Acaddmie Impdriale des Sciences de St. Petdraboury 12(5), 1-24. Lindeberg, J. W. (1922). Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrech-

nung. Math. Z. 15, 211-225. Lindvall, T. (1982). Bernstein polynomials and the law of large numbers. Math. Sci. 7(2), 127-139. Lipschitz, R. (1876). Sur Is possibilite d'integrer complement un systbme donna d'equations diffCrentielles. Bull. Sci. Math. 10, 149-159. Mahmoud, H. M. (2000). Sorting: A distribution theory. Wiley-Interscience, New York. Markov, A. A. (1910). Recherches our on cas remarquable d'epreuves dependantes. Acta Math. 33, 87-104.

Mattner, L. (1999). Product measurability, parameter integrals, and a Fubini-Tonelli counterexample. Enseign. Math. (2) 45(3-4), 271-279. Mazurkiewicz, S. (1931). Sur lea functions non dtrivablea. Studio. Math. 111, 92-94. McShane, E. J. (1969). A Riemann-type integral that includes Lebesgue-Stieltjjes, Bochner and stochastic integrals. Memoirs of the American Mathematical Society, No. 88. Providence, R.I.: American Mathematical Society. Merton, R. C. (1973). Theory of rational option pricing. Bell J. of Econ. and Management Sci. 4(1), 141-183.

Mukherjea, A. (1972). A remark on Tonelli's theorem on integration in product spaces. Pacific J. Math. 42, 177-185. Mukherjea, A. (1973/1974). Remark on Tonelli's theorem on integration in product spaces. II. Indiana Univ. Math. J. 23, 679-684. Nash, J. (1958). Continuity of solutions of parabolic and elliptic equations. Amer. J. Math. 80, 931-954. Nelson, E. (1967). Dynamical Theories of Brownian Motion. Princeton, N.J.: Princeton University Press. Norris, J. R. (1998). Markov Chains. Cambridge: Cambridge University Press. Reprint of 1997 original. Okamoto, M. (1958). Some Inequalities relating to the partial sum of binomial probabilities. Ann. Inst. Statist. Math. 10, 29-35.

Paley. R. E. A. C., N. Wiener, and A. Zygmund (1933). Notes on random functions. Math. Z. 37, 647-668.

Paley, R. E. A. C. and A. Zygmund (1932). A note on analytic functions in the unit circle. Proc. Camb. Phil. Soc. 28, 366-372. Perrin, J. B. (1913). Lea Atomes. Paris: Llbrairie F. Alcan. See also Atoms. The revised reprint of the second (1923) English edition. Van Nostrand, New York. Translated by Dalziel Llewellyn Hammick.

Pitman, J. (198]). A note on L2 maximal inequalities. In Seminar on Probability, XV (Univ. Strasbourg, Strasbourg, 1979/1980) (French), Volume 850 of Lecture Notes in Math., pp. 251 -258. Berlin: Springer. Plancherel, M. (1910). Contribution a 1'Ctude de Is representation dune fonction arbitraire par des

intCgrales defines. Rend. Circ. Mat. Palermo 30, 289-335. Plancherel, M. (1933). Sur lee formules de rdclprocit4 du type de Fourier. J. London Math. Soc. 8, 220-226.

Bibliography

215

Plancherel, M. and G. P61ya (1931). Sur les valeurs moyennes doe fonctions rtelles dbfinies pour toutes lea valuers do Ia variables. Comment. Math. Hely. 3, 114-121. Poincar6, H. (1912). Calcul des Probabititds. Paris: Gauthier-Villars. Pollard, D. (2002). A User's Guide to Measure Theoretic Probability. Cambridge: Cambridge University Press. P61ya, G. (1920). Uber den zentralen Grenzwertsatz der Wahrscheinlichkeitstheorie and des Momentenproblem. Math. Zeit. 8, 171-181. Rademacher, H. (1919). Uber partielle and totale Differenzierbarkeit I. Math. Ann. 89, 340-359. Rademacher, H. (1922). Einige Shtze Uber Reihen von allgemeinen Orthogonalfunktionen. Math. Ann. 87, 112-138. Raikov, D. (1936). On some arithmetical properties of summable functions. Math. Sb. 1(43:3), 377-383. Ramsey, F. P. (1930). On a problem of formal logic. Proc. London Math. Soc. 30(2), 264-286. R6nyi, A. (1962). Wahrscheinlichkeitsrechnung. Mit einem Anhang fiber Informationstheorie. Berlin: VEB Deutscher Verlag der Wissenachaften. Resnick, S. 1. (1999). A Probability Path. Boston, MA: Birkhfiuser Boston Inc.

Revuz, D. and M. Yor (1999). Continuous Martingales and Brownian Motion (Third ad.). Berlin: Springer-Verlag.

Rosenlicht, M. (1972). Integration in finite terms. Amer. Math. Monthly 79, 963-972. Schnorr, C: P. and H. Stimm (1971/1972). Endliche Automaten and Zufallafolgen. Acta Informal. 1(4), 345-359.

Schrhdingor, E. (1946). Statistical Thermodynamics. Cambridge: Cambridge University Press. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech. J. 27, 379-423, 623-656.

Shannon, C. E. and W. Weaver (1949). The Mathematical Theory of Communication. Urbana, Ill.:

Univ. of Illinois Press. Shultz, H. S. and B. Leonard (1989). Unexpected occurences of the number e. Math. Magazine 62(4), 269-271.

Sierpiiski, W. (1920). Sur les rapport entre ('existence des intograles fo f(x,y)dx, fo f(x,y)dy et fo dx fo f(x,y)dy. Fund. Math. 1, 142-147. Skolem, T. (1933). Ein kombinatorischer Satz mit anwenduag auf sin logisches Entacheidungsproblem. Fund. Math. 20, 254-261. Skorohod, A. V. (1961). Iseledovaniya po teorii sluchainykh protsessov. Kiev. Univ., Kiev. (Studies in the Theory of Random Processes. Translated from Russian by Scripts Technica, Inc., AddisonWesley Publishing Co., Inc., Reading, Mass. (1965). See also the second edition (1985), Dover, Now York.).

Skorohod, A. V. (1965). Studies in the theory of random processes. Addison-Wesley Publishing Co., Inc., Reading, Mass. Slutsky, E. (1925). Uber etochastische Asymptoten and Grenzwerte. Metron 5, 3-89. Solovay, R. M. (1970). A model of set theory in which every set of reels is Lebesgue measurable. Ann. Math. 92, 1-56. Steinhaus, H. (1922). Les probabilitds dlnombrables et leur rapport A la th6orie de mesure. Fund. Math. 4, 286-310. Steinhaus, H. (1930). Sur Ia probability de la convergence de eerie. Studia Math. 2, 21-39. Stigler, S. M. (1986). The History of Statistics. Cambridge, MA: The Belknap Press of Harvard University Press. Stirling, J. (1730). Methodue Difjerentiatis. London: Whiston & White. Strassen, V. (1967). Almost sure behavior of sums of independent random variables and martingales. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), pp. Vol. II, Part 1, pp. 315-343. Berkeley, Calif.: Univ. California Press. Stroock, D. W. (1993). Probability Theory, An Analytic View. Cambridge: Cambridge University

Press. Stroock, D. W. and O. Zeitounl (1991). Microcanonical distributions, Gibbs states, and the equivalence of ensembles. In Random Walks, Brownian Motion, and Interacting Particle Systems, pp. 399-424. Boston, MA: Birkhauser Boston.

Tendon, K. (1983). The Life and Works of Lip6t Fej#r. In Functions, series, operators, Vol. 1, II (Budapest, 1980), Volume 35 of Colloq. Math. Soc. Jdnos Bolyai, pp. 77-85. Amsterdam: NorthHolland.

Trotter, H. F. (1959). An elementary proof of the central limit theorem. Arch. Math. 10, 226-234.

216

Bibliography

Turing, A. M. (1934). On the Gaussian error function. Technical report, Unpublished Fellowship Dissertation, King's College Library, Cambridge. Varadhan, S. R. S. (2001). Probability Theory. New York: New York University Courant Institute of Mathematical Sciences. Varberg, D. E. (1966). Convergence of quadratic forms in independent random variables. Ann. Math. Statist. 37, 567-576. Veech, W. A. (1967). A Second Course in Complex Analysis. New York, Amsterdam: W. A. Benjamin, Inc.

Ville, J. (1939). Etude Critique de to Notion de Collectif. Paris: Gauthier-Villars. Ville, J. (1943). Sur 1'application, a un critere d'inddpendance, du d4nombrement des inversions pr8eent6es par one permutation. C. R. Acad. Sci. Paris 217, 41-42. von Neumann, J. (1940). On rings of operators, Ill. Ann. Math. 41, 94-161. von Smoluchowski, M. (1918). Uber den Begriff des Zufalls and den Ursprung der Wahrscheinlichkeit. Die Naturwissenschaften 6(17), 253-263. Wagon, S. (1985). Is x normal? Math. Intelligencer 7(3), 65-67. Wiener, N. (1923a). Differential space. J. Math. Phys. 2, 131-174. Wiener, N. (1923b). The homogeneous chaos. Amer. J. Math. 60, 879-036. Williams, D. (1991). Probability with Martingales. Cambridge: Cambridge University Press. Wong, C. S. (1977). Classroom Notes: A Note on the Central Limit Theorem. Amer. Moth. Monthly 84(6), 472. Woodroofe, M. (1975). Probability with Applications. New York: McGraw-Hill Book Co. Young, L. C. (1970). Some new stochastic integrals and Stioltjes integrals. I. Analogues of HardyLittlewood classes. In Advances in Probability and Related Topics, Vol. 2, pp. 161-240. New York: Dekker.

Zabell, S. L. (1995). Alan Turing and the central limit theorem. Amer. Math. Monthly 102(6), 483-494.

Index Symbols .8(O), Borel sigma-algebra on fl ........ 24 Bin(n,p), binomial distribution ..........8

independent identically distributed 68 mc(6d), monotone class generated by 0 30

C, the complex numbers ............... xv

IUIILP(µ) = (f IIIPdµ)'/P .............. 39 IIXIIP = (E[IXIP])'/P ................... 39 v K µ, absolute continuity ............. 47

Cb(X), bounded continuous functions from

0.(X ), 0.(.d), ... sigma-algebra generated by

CC(X), compactly supported continuous func-

A, V, the min and max operators .......xvi

C(X), continuous functions from X to R 49

X to R ............................94

X, of, etc ................. 24. 49, 120.

tions from X to R ................. 94 C°°(Rk), infinitely differentiable functions of compact support from Rk to R 111 Cov(X, Y), covariance between X and Y 62

A

EX, expectation of X .................. 39

Adapted process ..................126, 126,182

EIX; Al = E[X1AI = .fA X dP .......... 32 Geom(p), geometric distribution .........8 LP, random variables with p finite absolute

moments .......................... 39

ZP, completion of LP .................. 43 N(µ,0.2), normal distribution

.......... U

N, the natural numbers ................ xv 9(fl), the power set of 11 .............. 24 Poiss(. ), Poisson distribution ............9

Q, the rationals ........................ xv R, the real numbers .................... xv

Absolute continuity .. 11, see also Measure Adams, William J . Aldous, David J .

..................... 22

......................157

Almost everywhere convergence ........ 43 Almost sure

central limit theorem ................. 90

convergence ..........................43 Alon, Noga .............................89 Andr6, Desire .................... 119 180

Approximation to the identity ......... 205 Avogadro, Lorenzo Romano Amedeo Carlo 159

Azuma, Kazuoki ...................... 151

9

the unit sphere in R" .......... 102 SD(X), standard deviation of X ........67 Unif(a, b), uniform distribution .........11

Azuma-Hoeffding inequality ..155, see also Heoffding's inequality

VarX, variance of X ................... 62

B

X+, the positive elements of X ......... xv

Z, the integers ......................... xv

a.e., almost everywhere .................43

a.s., almost surely ...................... 43

Bachelier, Louis Jean Baptiste Alphonse 159,

117180 Backward (or reversed) martingale ..... 155 Banach, Stefan ........................152

dv/dµ, Radon-Nikodym derivative ..... 47

Bass, Richard Franklin ................ 201

J+ = max(f, ,0) ........................ 38

Baxter, Martin ........................15Z

f- = max(-f, 0) ...................... 38

Berk@s, Istvan ..........................90

217

Index

218

Bernoulli trials .......................... 8 Bernoulli, Jacob ................ 18. 22, Z1 Bernstein

polynomials .......................... 77 proof of the Weierstrass theorem ..... 77 Bernstein (Bernshtein), Sergei Natanovich 77, 117

Constantine ........ 27 109 Cauchy

sequence ............................. 46 summability test ....................183 Cauchy, Augustin Louis ..40, 108, 113 183 Cauchy-Bunyakovsky-Schwarz inequality 40 Central limit theorem . 19, 22, 89, 100, 102

Bessel's inequality ..................... 208

Bovier-Picco ........................113

Bessel, Friedrich Wilhelm ............. 208

de Moivre-Laplace ................... 19

Bingham, Nicholas H .

.................1`57

Liapounov's .........................114

Binomial distribution ...........8 see also

Lindeberg's .........................115

Distribution

projective ...........................102

Birkhoff, George David ................. 90 Black, Fischer .................... 144, 145 Black-Scholes Formula ................145 Blumenthal's zero-one law .............117 Blumenthal, Robert M............177, 180 Borel set and/or sigma-algebra ......... 24 Borel, Emile .. 52, 73 86, 89, 109, 117, 137

Borel-Cantelli lemma ..... §1 73 see also

via Liapounov's method .............1125 Ville's .............................. 116

with error estimates ............105 116

Champernowne, David Gawen .......... 90 Characteristic function .........96, 96-117 convergence theorem ............99, 99,10 inversion theorem ...................112 uniqueness theorem .................. 99

Dubins-Freedman ...................156

Chatterji, Srishti Dhav ................ 152 Chebyshev's inequality ............. 18 43

Levy's .............................. 136

conditional ..........................152

Paley-Zygmund inequality

lemma ........... 109 Bounded convergence theorem ..........45 conditional ..........................121

Bourke, Chris .......................... 90

Bovier Anton ......................... 113 Bretagnolle, Jean ..................... 158 Broadbent, S. R. ....................... 83

Brown, Robert ........................ 159 Brownian bridge ...................... 177

Brownian motion

and the heat equation 117 118 196, 198 as a Gaussian process ...............153 Einstein's predicates ................ 160 exit distribution ..196 197, 200, see also

Chung's formula

filtration ........ 171 see also Filtration gambler's ruin formula .............. 200 nowhere differentiability of ..........168

quadratic variation ..................103

Wiener's construction .......... 166-168 with drift ........................... 201

Bru, Bernard .......................... 180 Buczolich, Zolt4n .......................90

Bunyakovsky, Viktor Yakovlevich .......40

Burkholder, Donald L ..............52, 156

C

for sums ............................. 63

Chebyshev, Pafnutii Lvovich .... L& 43 fi3

Chernoff's inequality ................... 51 Chernoff, Herman ...............51, 52, 87 Chung's formula ................. 197, 200 Chung, Kai Lai ................ xii, 89, 197 Ciesielski, Zbigniew ................... 201 Cifarelli, Donato Michele ..............158

Coifman, Ronald R ..................... 52 Compact support (or compactly supported) function ..............................94

Compact-support process ............................. 182

Complete measure space ....................... 28 topological space .....................42

Completion ............................ 29 Conditional expectation .......... 120-125 and prediction ...................... 122 classical .............................124

properties ................. 120-121, 123 towering property ................... 123

Conditional probability ............. 4, 125 Consistent measures ....................59

Convergence almost everywhere ................... 43 almost sure .......................... 43 in LP ................................ 43

Call options ........................... 144

in measure ........................... 43

Cantelli, Francesco Paolo 73, 80 85, 89, 138

in probability ........................ 43

Cantor set ............................. 88

weak .................................91

Cantor, Georg ..........................34

Convergence theorem ........ 102. see also

Cantor-Lebesgue function .............. 88

Characteristic function Convex function ............... 40, 50, 156

Caratheodory Extension Theorem ...... 27

Index

219

Convolution ...................15 98,113

mean and variance ................. 12

Cootner, Paul H .......................159

hypergeometric ...................... 13

Copeland, Arthur H . ................... 90

mean and variance ................. 13

Correlation .............................67

infinitely divisible ................... 113

Countable (sub-) additivity .............24

negative binomial .................... 13

Courtault, Jean-Michel ................180

mean and variance .................13

Covariance ............................. 67

normal ...........................11, 11,28

matrix .............................. 161 Cover, Thomas M .

characteristic function ........ 97, 161

..................... 89

degenerate .........................11

Cox, John C ...........................144 Cr4pel, Pierre ......................... 180

mean and variance ................. 12

multi-dimensional .............28, 28,161

Cramdr's theorem ..................... 1117 Cramer, Harald ..............102,107,117 Cramer-Wold device .................. 102

standard ........................... U Poisson ............................9, 10

characteristic function ............. 97 connection to binomial ......... 10 19

Csorgo, Mikl6s ........................157 Cumulative distribution function ... 66, see also Distribution function

mean and variance ................. 13

uniform .......................11 11 28

Cylinder set ............................ 58

characteristic function ............. 97

connection to discrete uniform ..... 15 connection to exponential ..........1.5

D Dacunha-Castelle, Didier ..............1.58

mean and variance .................12

Davis, Burgess J ........................52 de Acosta, Alejandro ............. 156, 158 de Finetti's theorem ................... 156

de Finetti, Bruno ................. 156-158

de Moivre, Abraham ............ 15, 19,22 de Moivre-Laplace central limit theorem 19, see also Central limit theorem de Moivre's formula . 19, see also Stirling's formula Degenerate normal 1L see also Distribution

Dominated convergence theorem ........46 conditional ..........................121

Donoho, David L ...................... M.

Doob's decomposition ................. 128, 152

martingale convergence theorem .... 134 martingales .................... 127, 200

maximal inequality ................. 134 continuous-time .................. 189

Dellacherie, Claude ....................201

optional stopping theorem .......... 130

Density function ................ 10. 14, 48

strong (p, p)-inequality

Devaney, Robert L ......................90

continuous-time .................. 189

Diaconis, Persi .................... 22, 117

strong Lt-inequality ................ 156

Dimension ............................. 59

strong LP-inequality ................ 153 Pitman's improvement ............153

Dini, Ulisse ........................... 1112

Dini-continuous process in LZ(P) ............................182 Distribution ........................49, 49,0

binomial .............................. 8

Doob, Joseph Leo .. xii, 127, 130. 134. 152. 153, 156, 157 189, 200 Dubins, Lester E ..................154,

L51

Dudley, Richard Mansfield ..........22, 22,51

characteristic function ............. 97

Durrett, Richard .................. 90, 1311

connection to Poisson .......... 10, 19

Dyadic filtration ................. 142, 14$

mean and variance .................12

Dyadic interval .................. 142, 1411

Cauchy ......................... 14. 113

Dynkin, Eugene B .....................180

non-existence of the mean

......... IA

discrete ............................... 7

discrete uniform ......................a

E

exponential ..........................1a

Einstein, Albert ....................... 159

characteristic function ............. 97

Elementary function ....................37

connection to uniform ............. 15

Entropy ................................78

mean and variance ................. 13

Erd&s, Paul [Pill ................81, 88-90

function ......................13, 33, 60

Etemadi, Nasrollah ..................... 89

gamma .............................. U

European options ..................... 144

characteristic function ............ 1.2

Event ......... 35, see also Measurable set

mean and variance ................. 13

Exchangeable .................... 155, 156

geometric ............................. 8

Expectation ........................12. 32

Index

220

F

Hilbert, David ..........................52

Falconer, Kenneth K ....................34

Hitchcock, John M .

Fatou's lemma ..................... 45.50

Hoeffding's inequality ..... 51, 87, sec also

conditional .......................... 121

Fatou, Pierre Joseph Louis ..........45, 52

Azuma-Hoeffding Hoeffding, Wassily ...... 51, 52, 87, 89, UZ

Fejer, Leopold ......................... 117

Holder

.................... 90

Feller, William K.........16. 117, 157, 201

continuous function ............. Z, 199

Fermi, Enrico .......................... 84

inequality ............................ 39

Filtration ............................. 126

conditional ....................... 121 generalized ........................ 51

Brownian ...........................171 right-continuous .................... L71

Holder, Otto Ludwig ...........30 51, 199

Fischer, Ernst Sigismund .............. 16$

Houdr6, Christian ..................... 117

FKG inequality ........................ 63

Hunt, Gilbert A ..............131

Fortuin, C. M .

191 180

......................... 64

Fourier series ..................... 205-208

I

Fourier transform ............. 96, see also

Independence ...................14. 62. 6$

Characteristic function

Indicator function ...................... 36

Fourier, Jean Baptiste Joseph .......... 96

Infinitely divisible .....................)13

Frechet, Maurice Rene ..................52

Information inequality ..................87

Freedman, David A. ......... 117, 154, 138

Inner product ......................... 203

Freiling, Chris ..........................23

Integrable function ..................... 39

Fubini. Guido ..........................33

Integral ................................ 39

Fubini-Tonelli theorem .................. U

Inverse image .......................... xvi

inapplicability of .......... 56-58, 62. 63

Inversion theorem ............ 112. see also Characteristic function

G

lonescu Tulcea, Alexandra .............152

Gambler's ruin formula ........... 133, see

lonescu Tulcea, Cassius ............... 152

also Random walk, see also Brownian

Isaac, Richard .........................152

motion

Ito

Garsia, Adriano M ......................90

formula ........................ 194,195

Gaussian process ......................1.63

integral ........................ 181--18.3

Geometric distribution ......... 8, see also Distribution

indefinite .........................1$6

under Dini-continuity ............. 184 isometry ............................ 184

Goorgii, Hans-Otto ....................15& Gerber, Hans U .

...................... L52

lemma .............. see also Ito formula

Ginibre, Jean .......................... 64

It6, Kiyosi ......... 181, 184, 185. 194, 195

Glivenko, Valerii Ivanovich ........ 80, 117

Glivenko-Cantelli theorem ............. 81)

J

Gnedenko, Boris Vladimirovich ....52, 117

Jensen's inequality ..................... 40

Grimmett, Geoffrey R . ................. 83 Gundy, Richard F . ..................... 52

conditional .......................... 121

Jensen, Johann Ludwig Wilhelm Waldemar 40

H

Jones, Roger L .

Hadamard's Inequality ................. 51 Hadamard, Jacques Salomon ........... 51 Hamedani, Gholamhossein Gharagoz .. 117 Hammersley, John M . .............. 83,89 Hardy, Godfrey Harold . 138, 140, 143. 187 Hardy-Littlewood maximal function .. 1413 187

........................ 52

Jushkevich (Yushkevichl, Alexander A. 180

K Kabanov, Yuri ........................ 180

Kac, Mark ................ 151 78, 115-117 Kahane, Jean-Pierre ...................157

Kasteleyn, Pieter Willem ...............64

Harris, Theodore E . .................... 83

Keller, Joseph B ........................22

Harrison, J. Michael ................... 144

Kesten, Harry ..........................

Hartman, Philip ....................... 138

Khintchine (Khinchinl, Aleksandr Yakovie-

Hausdorff measure ..................... 34

vich ...... xii, 71, 72, 88, 89, 138, 153,

Hausdorff, Felix ................... 34, 138

122

Helms, Lester L . ...................... L52

Khintchine's

Hilbert space ..........................203

83

inequality ............................ 88

Index

221

weak law of large numbers .. 72. see also Law of large numbers

Kinney, John 8 ........................180 Knight, Flank B . ................. 11$. 201 Knuth, Donald E .......................84 Kochen, Simon ......................... 89 Kolmogorov

Liouville, Joseph ............... 11. 16 108 Lipschitz continuous .............. 147, 193

Lipschitz, Rudolf Otto Sigismund ..... 152 Littlewood. John Edensor ... 138, 140, 143. 187 Loeb, Peter A . ........................ 152

consistency theorem .........60, see also Kolmogorov extension theorem

extension theorem ....................60 maximal inequality ................... 74 one-series theorem ................... 85 strong law of large numbers .7_3 see also Law of large numbers

zero-one law .................... 69, 136 Kolmogorov, Andrei Nikolaevich xii, 52, 60, 69, 73, 74, 85, 89, 103 117, 138, 153 Kreps, David M .

Lindvall. Torgny ....................... 89

...................... 144

M Mahmoud, Hosam M.

................. 151

Le Marchand, Arnaud ................. 180 Markov

inequality ............................ 43 property ...104 see also Strong Markov of Brownian motion ......... 160, 163 Markov, Andrei Andreyevich ........43. 89 Martingale ............................ 126 and likelihood ratios ................ 152

Krickeberg's decomposition ............1'28

continuous-time ..................... 187

Krickeberg, Klaus ..................... 128

convergence theorem 134 see also Doob's

Kyburg, Henry E., Jr .

representations ......................195

.................158

reversed ............................ 155

L

transforms .......................... 127

Lacey, Michael T . ...................... 90

Mass function ....................... 7, 14

Lamb, Charles W ......................152

Mattner, Lutz .......................... 64

Laplace, Pierre-Simon . 16. 19, 21, 157, 206

Mauldin, R. Daniel ..................... 90

Law of large numbers ............... 71_-88

Maximal inequality ....74-76, 87, 1-9, 189

and Monte Carlo simulation .......... 83

Maxwell, James Clerk ................. 117

and Shannon's theorem .............. 79

Mazurkiewicz, Stefan ..................157

Erd6s-Renyi ......................... 88

McShane, Edward James .............. 199 Mean ............ 39, see also Expectation

strong ........................... 73,85

Measurable

and the Glivenko-Cantelli theorem ... 80 weak .................................72

function ..............................35

Law of rare events ......................13

Lebesgue .............................30

Law of the iterated logarithm 138, 154 156

set ...................................24

for Brownian motion ................177

space ........ 25, see also Measure space

Law of total probability ................. 5

Measure ................................24

Lebesgue

absolutely continuous ................ 47

differentiation theorem ........ 140, 155 measurable ..... 30, see also Measurable measure ........... 25 see also Measure Lebesgue, Henri Leon ......... 52, 113, 140

counting ............................. 33 finite product ........................54

Lebesgue .............................25 invariance properties ...............33

Lebon, Isabelle ........................ 180

probability ...........................25

Leonard. Bill ........................... 86

space ................................ 25

Levi's monotone convergence theorem .. 46 Levi, Beppo ............................ 52

Levy's

Bore]-Cantelli lemma ............... 136

support of ......... 33, see also Support Merton, Robert C . .................... 144

Meyer, Paul-Andre .................... 201 Minkowski's inequality ................. 40

concentration inequality .............112

conditional .......................... 121

equivalence theorem ................ 155

Minkowski, Hermann ................... 40

forgery theorem .....................178

Mixture ................................ 51

Modulus of continuity ............. 77 182

Levy, Paul .. 90, 92, 99, 112. 117, 136, 155, 160, 178. h-WL 200

Monotone class .........................30

Li, Robert Shun-Yen ............. 149 157 Liapounov [Lyppunov], Aleksandr IMlikhailovich 149.. 1-14

Lindeberg, Jarl Waldemar ........ I

theorem ..............................30

Monotone convergence theorem 46, see also Levi

1

115

conditional .......................... 121

Index

222

Monte Carlo

Previsible process ..................... 127

integration ...........................84

Probability space .....25, see also Measure

simulation ..... 83, see also Law of large

space Product

numbers, 84

measure ........... 5_4 see also Measure

Mukherjea, Arunava ....................69

topology ............................. 58

N Nash's Poincare inequality ............ 11fi

Nash, John .......................156 117

Q

Quadratic variation 113 see also Brownian

Negative part of a function ............. 38

motion

Nelson, Edward ....................... 180 Newton's method ...................... 87 Newton, Isaac .......................... 87

R

Nikodym, Otton Marcin ................ 47 Normal distribution ........... 11, see also

Radon, Johann ......................... 47

Radon-Nikod'rm theorem .............. 47

Distribution

Normal numbers ....................86,90 Norris, James R .

Rademacher's theorem ................ 147 Rademacher, Hans .................89, 147

....................... 90

0

Ralkov's ergodic theorem ........... 88, 90 Raikov, Dmitrii Abramovich ........ 88,90

Ramsey number ........................81

Ramsey, Frank Plumpton .............. 81

Okamoto, Masashi ..................... 87

Random

continuous-time ..................... 187

permutation ..3. 9 22, 116, 155, see also Central limit theorem, Ville's

Optional stopping theorem ............ 130

set ...................................64

Options ............................... 144

variable .............................. 35

Optional stopping

absolutely continuous .......... 10. 14

Orthogonal ............................203

Orthogonal decomposition theorem ....203

discrete ......................... 7, 14

vector ................................1A

P

Random walk ......................... 131

Paley, Raymond Edward Alan Christopher

gambler's ruin formula ......... 133, 152

86, 157 10 Paley-Zygmund inequality ............. 86

nearest neighborhood ...............132

Parseval des Chenes, Marc-Antoine ....117

Reflection principle .............. 175, 129

Parseval's identity ..................... 117

Regazzini, Eugenio ....................LSa

Percolation .............................82 P6rez-Abreu, Victor ................... 117 Perrin, Jean Baptiste .................. 180

simple .............................. 132

Rennie, Andrew .......................152 Rhnyi, Alfred .......................88, 89 Reversed martingale ................... 155

Philipp, Walter .........................90

Rbvesz, P&l ........................... 1.52

Picco, Pierre

.......................... 113

Revuz, Daniel ......................... 201

Piecewise continuous function .......... U Pitman's L2 inequality .......153 see also

Riemann, Georg Friedrich Bernhard ... l13

Doob's strong L' inequality Pitman, James W . ....................

Riesz reresentation theorem ............ 49

1.53

Riemann-Lebesgue lemma ............ 113 Riesz, Frigyes ......................49, 163

Plancherel theorem ................99, 115

Rosenlicht, Maxwell .................... 16

Plancherel, Michel ............. 34, 97, LLS

Ross, Stephen A .......................144

Pliska, Stanley R . ..................... 144

Rubenstein, Mark ..................... 144

Poincard inequality .................... 116 Poincar6, Jules Henri .......... 16, 22, 159

S

Poissonization .......................... 10

Schnorr, Claus-Peter ................... 90 Scholes, Myron ................... 144, 195 Schrodinger, Erwin ..................... 86 Schwarz's lemma ...................... 108

P61ya's urns ...........................152

Schwarz, Hermann Amandus ...... 40, 108

P61ya, George ................ 34, 117, 152

Semimartingale ....................... 126

Point-mass ............................. 25

Poisson distribution 9, see also Distribution

Poisson, Simeon Denis ............... 9 19

Positive part of a function ..............38

Shannon's theorem ..................... 7Z2

Positive-type function ................. 113

Shannon, Claude Elwood ...........78, 29

Power set .............................. 24

Shultz, Harris S .

....................... 86

Index

223

Sierpinski, Waclaw ..................5,1, 64

Tonelli, Leonida ........................55

Sigma-algebra ..........................23

towering property of conditional expectations

Borel ................................ 24

123

generated by a random variable . 49, 1211

Triangle inequality ..................38, 38,41

Simple function ........................ 37

Trigonometric polynomial ............. 205

Simple walk ...164 see also Random walk,

Trotter, Hale Freeman .................117

Turing, Alan Mathison ................ 117

132

Simpson, Thomas ...................... 22 Simulation ....... 83, see also Law of large numbers

Turner, James A .

..................... 152

U

Skolem, Thoralf Albert ................. 81

U-statistics .......................151, 151,155

Skorohod (Skorokhod], Anatoli Vladimirovich 116. 178

Ulam, Stanislaw ........................84

Skorohod embedding ..................118 Skorohod's theorem ...................116

Uncorrelated random variables ......... fit Uniform distribution .......... 11, see also

Uncertainty principle .......... 51, 52, 115

Distribution

Slutsky, Evgeny .................... 50, 52 Slutsky's theorem ...................... 50

Smokier, Howard E ....................155 Solovay, Robert M .

................. 23,34

Spencer, Joel ...........................89

on

Standard normal . 11, see also Distribution Stark, Philip B .

....................... 115

Steinhaus probability space .............44

............................103

and weak convergence ...............115

Uniqueness theorem ........... 99, see also Characteristic function

Stable distribution .................... 178

Standard deviation .................12 67

3n- l

Uniform integrability ..............51, 154

V Varberg, Dale E . ...................... 151

Variance ........................... 11 6Z

Steinhaus, Hugo .......... 4444 89, 138, 159

conditional ..........................152

Stigler, Steven Mack ................... 89

Veech, William A ......................117

Stimm, H ...............................90

Viete, Francois ........................ 117

Stirling's formula ..21 22, 82, 156, see also de Moivre's formula

Ville, Jean ....................... 117, 152

Vinodchandran, N. V ...................90

Stirling, James ......................... 21

von Neumann, John ................ 48, 84

Stochastic integral 179, see also It6 integral

von Smoluchowski, Marian ............ 159

Stochastic process ..................... 126

continuous-time .....................181

w

Stone, Charles J ........................89

Wagon, Stanley ........................ 90

Stopping time ................... 129, 1211

Wald's identity ................... 131, 153

simple ..............................110

Wald, Abraham .................. 131, 153

Strassen, Volker .......................152

Wallis, John .......................... L14

Strong law of large numbers ... 73, see also Law of large numbers

Weak convergence ......................91

Walter, Gilbert G .

.................... 117

for dependent variables ...............85

Weak law of large numbers .... 72, see also Law of large numbers

for exchangeable variables ...........155

Weaver, Warren ........................78

Cantelli's ............................ 85

Strong Markov property ...............124 Stroock, Daniel W .....................117

Submartingale

........................ 126

continuous-time .....................200

Supermartingale ...................... 126

Weierstrass approximation theorem 77, see also Bernstein

Hoeffding's refinement ............... 89

Kac's refinement ..................... 78 Weierstrass, Karl Theodor Wilhelm .....71

Support ................................ 33

Weyl, Hermann ........................ 52 White noise ...................... 178, 174

Surgailis, Donatas ..................... 117

Wiener process .... 180, see also Brownian

continuous-time .....................200

Szekeres, Gabor [Gyorgy] ...............81

motion

Wiener, Norbert ... 157. 159, 166 168 179

T

Tandori, K4roly ....................... 117

Williams, David .......................152 Wintner, Aurel ........................ 138

Taylor, Samuel James ................. 201

Wold, Herman 0. A .

Thomas, Joy A .

........................ 89

.................. 192

Wong, Chi Song ........................22

Index

224

Woodroofe, Michael ................... 154

Y Yor, Marc .............................201 Young integral ................... 199, 200

Young's inequality ......................51 Young, Laurence Chisholm ............ 199 Young, William Henry ..................51

z Zabell, Sandy L .

...................... 117

Zeitouni, Ofer ......................... 117 Zero-one law

Blumenthal's ....................... 12Z 69,136 Kolmogorov's ...................69, Zygmund, Antoni .............86, 157, 1S$

This is a textbook for a one-semester graduate course in measure-

theoretic probability theory. but with ample material to cover an ordinary year-long course at a more leisurely pace. Khoshnevisan's

approach is to develop the ideas that are absolutely central to modem probability theory, and to showcase them by presenting their various applications. As a result, a few of the familiar topics are replaced by interesting non-standard ones.

The topics range from undergraduate probability and classical limit theorems to Brownian motion and elements of stochastic calculus. Throughout, the reader will find many exciting applications of probability theory and probabilistic reasoning. There are numerous exercises, ranging from the routine to the very difficult. Each chapter concludes with historical notes.

ISBN 3-82L8-4215-3

I

For additional information and updates on this book. visit

www .ams.orglbookpageslgsm-80

9

www.ams.org

Analysis (Graduate Studies in Mathematics)

Read more

Fourier Analysis (Graduate Studies in Mathematics)

Read more

Global Calculus (Graduate Studies in Mathematics)

Read more

Hamilton's Ricci Flow (Graduate Studies in Mathematics)

Read more

Algebra: Chapter 0 (Graduate Studies in Mathematics)

Read more

Complex Made Simple (Graduate Studies in Mathematics)

Read more

Algebra: Chapter 0 (Graduate Studies in Mathematics)

Read more

Foliations I (Graduate Studies in Mathematics)

Read more

Geometric Relativity, Graduate Studies in Mathematics 201

Read more

Mapping Degree Theory (Graduate Studies in Mathematics)

Read more

Cones and Duality (Graduate Studies in Mathematics)

Read more

Foliations I (Graduate Studies in Mathematics)

Read more

Resolution of Singularities (Graduate Studies in Mathematics)

Read more

Finite Group Theory (Graduate Studies in Mathematics)

Read more

Fourier Analysis (Graduate Studies in Mathematics)

Read more

Complex Made Simple (Graduate Studies in Mathematics)

Read more

Algebra: A Graduate Course (Graduate Studies in Mathematics)

Read more

Probability: A Graduate Course

Read more

Probability: a graduate course

Read more

Probability: A Graduate Course

Read more

Probability: A Graduate Course

Read more

Linear Algebra in Action (Graduate Studies in Mathematics)

Read more

Algebra (Graduate Texts in Mathematics)

Read more

Linear Algebra in Action (Graduate Studies in Mathematics 78)

Read more

A Course in Convexity (Graduate Studies in Mathematics, V. 54)

Read more

A First Course in Sobolev Spaces (Graduate Studies in Mathematics)

Read more

Lecture Notes in Algebraic Topology (Graduate Studies in Mathematics, 35)

Read more

Dirac Operators in Riemannian Geometry (Graduate Studies in Mathematics)

Read more

Linear Algebra in Action (Graduate Studies in Mathematics 78)

Read more

A Course in Differential Geometry (Graduate Studies in Mathematics 27)

Read more

Recommend Documents

Analysis (Graduate Studies in Mathematics)

Fourier Analysis (Graduate Studies in Mathematics)

Fourier Analysis Javier Duoandikoetxea Translated and revised by David Cruz-Uribe, SFO Graduate Studies in Mathemat...

Global Calculus (Graduate Studies in Mathematics)

Global Calculus S. Ramanan Graduate Studies in Mathematics Volume 65 American Mathematical Society Global Calculus ...

Hamilton's Ricci Flow (Graduate Studies in Mathematics)

Algebra: Chapter 0 (Graduate Studies in Mathematics)

Algebra: Chapter 0 Paolo Aluffi Graduate Studies in Mathematics Volume 104 American Mathematical Society Algebra: ...

Complex Made Simple (Graduate Studies in Mathematics)

Algebra: Chapter 0 (Graduate Studies in Mathematics)

Algebra: Chapter 0 Paolo Aluffi Graduate Studies in Mathematics Volume 104 American Mathematical Society Algebra: ...

Foliations I (Graduate Studies in Mathematics)

Geometric Relativity, Graduate Studies in Mathematics 201

GRADUATE STUDIES I N M AT H E M AT I C S 201 Geometric Relativity Dan A. Lee Geometric Relativity GRADUATE STUDIES...

Mapping Degree Theory (Graduate Studies in Mathematics)