(t) = JQ'x' - (log x)ke - x dx. (b) Show by partial integration that [(t + l) = t[(t) and hence that [(n = n! for integral n. (c) Fro m (18.10) deduce r< i) = ..[;. has volume (see Example 18.5) (d) Show that the unit sphere in
Rk
(18.19)
v:
k
=
---f( ( k/2) + 1) - ' 'TT' k /2
+ 1)
SECTION 19. THE
LP SPACES
x)x-2
By partial integration prove that /�( sin cos dx = rr.
18.19.
x)jx)2 dx
241 = rrj2 a nd f�J.l
-
x
Suppose that JL is a probability measure on ( X, ff) and that, for each i n X, measure on Suppose further that, for each B in W, vx is a probability ' vx( B is, as a function of x, measurable ff. Regard the JL(A) as initial probabilities and the v ( B as transition probabil ities: (a) Show that, if E ffx W, then vxf y: y) £] is measurable ff. (b) Show that rr( E) = fxvx fy: y) E]JL(dx) defines a probability measure this is just ( 18.1). on ffx W. If vx = v does not depend y )vx(dy) (c) Show that if f is measurable ffx W an d nonnegative, then f is measurable ff. Show further that
18.20.
(Y, W).
)
) (x E E , (x, Eon x, x
yf(x,
f( x , y )rr(d( x, y )) J [Jt( x , y ) vx( dy )] JL ( dx),
jX X Y
=
X
Y
which extends Fubini's theorem (in the probability case). Consider also f's that may be negative. (d) Let v(B) = fxvx(B)JL(dx). Show that rr(X X B) = v( B) and
l t< Y) " ( dy) = f lr i t< y)vx( dy) ] JL (dx) . y
SECTION 19. THE
LP
X
y
SPACES *
Definitions
Fix a measure space (f!, !F, JL). For 1 < < oo, let LP = LP( f! !F, JL) be the class of (measurable) real functions f for which IJI P is integrable, and for f in LP, wri t e
p
I f li P
( 19 . 1 )
fp ] [ l = jltl p dJL
There is a special definition for the case IS
( 1 9 .2 )
,
p
= oo:
The
essential supremum of f
II t I = in f[ a : 11- [ w It( w) I > a] o] ; to consist of those f for which this is finite. The spaces LP have a oo
:
=
take L"' geometric structure that can usefully guide the intuition. The basic facts are laid out in this section, together with two applications to theoretical statistics. *
Tht: results in this section are used nowhere else in the book. The proofs require some elementary facts about metric spaces, vector spaces, and convex sets, and in one place the Radon-Nikodym theorem of Section 32 is used. As a matter of fact, it is possible to jump ahead and read Section 32 at this point, since it makes no use of Chapters 4 and 5
INTEGRATION
242
oo , and are indices if For 1 + - I =1 1; - and are also conjugate if one is 1 and the other is oo (formally, 1 - + oo 1 = 1). says that if and are conjugate and if f E LP and g E U, then fg is integrable and
p-I q
p q
( 19.3)
q p Minkowski's inequality
oo and is a probability This is obvious if = 1 and = oo. If 1 measure, and if f and g are simple functions , (19 .3) is (5.35). Bat the proof in Section 5 goes over without change to the general case. says that if f, g E LP (1 < < oo ) , then f + g E LP :md
J.L
p
( 19.4) This is clear if
+ lgi P )l iP, f +
that 1 < p < Since If+ gl < 2(1 f l P pg =doesorliepin= LPSuppose Let q be conjugate to Since p- 1 = pjq, 1
oo.
Holder's inequality gives
.
oo .
p.
I !+ gil� = fit+ gi P dJ.L < fitl · l f + gl p / q dJ.L + jlgl · l f + g l p j q dJ.L < II !l i P . II I f+ g I p j q l q + llg li P . II I t + g I p j q l q q / = (I I ! l i P r llgll p ) [ flt + gl p dJ.L r = ( II f li P + llgll p ) llt + gll�1 q.
p-pjq
Since = 1 , (19.4) follows.t If a is real and f E LP, then obviously
af E LP and ( 19 .5) ll aJII P = l a i · IIJII P . Define a metric on LP by d (f, g) = I ! - gl l p · Minkowski's inequality gives the triangle inequality for dP , and d is certainly symmetric. Further, P to 0, that is, f = g almost deverywhere. /f, g ) = 0 if and only if If - gPI P integrates To make LP a metric space, identify functions that are equal almost everywhere.
t The Holder and Minkowski inequalities can also be proved by convexity arguments, see Problems 5.9 and 5.10.
SECfiON 19. THE
LP SPACES
243
If II ! -fJ p -+ 0 and < oo, so that J lf- f,I P dJL -+ 0, then f, is said to converge to If f f' and almost everywhere, then f + = f' + almost every where, and for real a, af af' almost everywhere. In LP, f and f' become the same function, and similarly for the pairs and 1, f + and f' + 1, and af and af'. This means that addition and scalar multiplication are unam biguously defined in LP, which is thus a real vector space. It is a in the sense that it is equipped with a 1 1 - I P satisfying (19.4) and (1 9.5).
p fin gthe glmean of order p.
=
=
g gl g g g norm
=
vector space
g normed
Completeness and Separability
Banach Riesz-Fischer theorem, The space LP is complete.
A normed vector space is a space if it is complete under the corresponding metric. According to the this is true of LP; Theorem 19.1.
It IS enough to show that every fundamental sequence contains a convergent subsequence. Suppose first that < oo. Assume that I !m - /,ii P -+ oo , and choose an increasing sequence 0 as so that ll fm -fJ� < 2 - < p + l )k for m, n > n k. Since J l fm -f,I P dJL > aPJL[Ifm -f,l > a] (thisk is just a general version of Markov's inequality (5.3 1)), JL[I f, - fml > 2- ] < 2Pk l fm - fJ� < 2 - k for Therefore, L 1 JL[I f, + I -J, k l > 2 k ] < oo, and it follows by the first Borei-Ca;ltelli lemma (which works for arbitrary measures) that, outside a set of wmeasure 0, L:kl f, k + I - !, k I converges. But then f, k converges to some f almost everywhere, and by Fatou ' s lemma, !If - f,kl P dJL < lim inf; ! I f,; - /,� I P dJ-t < 2 - k _ Therefore, f E L P and I ! - !,� l i P -� 0, as required. If = oo, choose { } so that 11/m - /,l i "" < 2- k for m , > Since k If, � + I -f, I < 2-k almost everywhere, /, � converges to some f, and if -f,k I < • 2 - k almost everywhere. Again, II!-f,) p -+ 0. PROOF.
p
m , n -+
{n k }
m, n >nk .
p
k
-
n nk.
n
k
The next theorem has to do with separability.
Let U be the set of simple functions L:�: 1aJ8 for a; and JL( B ) finite. For < p < U is dense in LP. af i n i t e and !F is countably generated, and if p < then LP (ii) lf separable. Theorem 19.2. ;
1-L is
(i)
1
I
oo ,
is
oo ,
of (i). Suppose first that p < For f E LP , choose (Theo Proof rem 1 3.5) simple functions J, such that fn -+ f and If, I < I f l. Then f, LP, PRooF.
and by the dominated convergence theorem,
oo.
E
J l f -J,I P dJL -+ 0. Therefore,
244
INTEGRATION
for some n; but each fn is in U. As for the case p if In2! n-fn> 1 i /P 1<""'Ethen n the fn defined by (13.6) satisfies I / - fJ"" 2- ( <E for large n). Proof of (ii). Suppose that !F is generated by a countable class � and that f! is covered by a countable class !!J of .r-sets of finite measure. Let Epconsisting £2 , be an enumeration of �u !!J; let .9,. ( n > 1 ) be the partition of the sets of the form where is E; or Et; and let · · be the field of unions of the sets in .9,. Then .ro U � is a countable field that generates !F, and J.L is £T-finite. .ra. Let V be the set E .ra, and J.L ( A;) < of simple functions L:;"= 1a;IA for a; rational, Let g L:;"= 1 a; be the element of U constructed in the proof of part (i). and any a; that vanish can Then I ! -g i P < E, the a; are rational by be suppressed. By Theorem 1 1.4(ii), there exist sets A; in .ro such that and then h L:;"= 1 aJA . lies in V provided p < (E/ml a ;DP, J.Land( B;LlA;) I ! - hllp < 2E. But V is countable.t =
<
•
•
oo ,
.
F1 n
·
�
A;
'
l8
=
'
F;
n Fn ,
on
=
=
1
�
oo.
( 1 3.6),
<
=
oo,
•
I
Conjugate Spaces -y
Linear functional on LP is a real-valued function such that y( af + a'f') ay( f ) + a'y ( f') . (19 ) The functional is bounded if there is finite M such that l r( f ) I M ·I I / I l P ( 19. 7) for all f in LP. A bounded linear functional is uniformly continuous on LP because I !- ! ' l i P < EjM implies l y ( f)- y(f' ) l < E (if M > 0; and M 0 implies y(f) 0). The norm ! l r l of y is the smallest M that works in (19.7): sup[ I y (/ )1 / 1 ! l i P : f =I= 0] . l r lSuppose and are conjugate indices and g E U. By Holder's inequality, ( 19 .8) is defined for J E LP and satisfies (1 9.7) if M> l g l q ; and is obviously linear. According to the Riesz representation theorem, this is the most general bounded linear functional in the case p < i s £Tj i n ite, t h at 1 i s Suppose that < and that J.L on LP has t o p. Every form (19.8) for conjugate functi o nal the bounded L i n ear some g E U; further, ( 19 .9) and unique up to a set of J.L-measure 0. A
=
.6
a
<
=
=
=
p
q
'Yg
oo :
Theorem 19.3.
g is
t Part (ii) definitely requi res
p < oo;
see Problem 19.2.
oo,
q
SECTION 19.
THE L P SPACES
245
dual space,
The space of bounded linear functionals on LP is called the or the and the theorem identifies U as the dual of LP. Note that the theorem does not cover the case = oo. t
conjugate space, p Case !: finite. For in define y(!A ). The linearity of ( 19. 7), of y implies that is finitely additive. For the l ·I J.L
PROO F.
Y,
A
cp( A ) =
/A
l
M cp( A ) < M cp li P = M · I J.L( AW1 P . If A = U , A n, where the A n are disjoint, then (/'( A ) = L�� lcp( A ,, ) + cp( u n > N A n), and since l cp( u n N A )I < MJ.Ll /P( U n N A ) � 0, it fol lows that cp is an additive set function in the sense of (32. 1). The Jordan decomposition (32.2) represents cp as the difference of two finite measures cp + and cp - with disjoint supports A + and A -. If J.L( A ) = 0, then cp + ( A ) = cp( A n A + ) < MJ.L1 1 P( A ) = 0. Thus cp + is absolutely continu ous with respect to J.L and by the Radon-Nikodym theorem (p. 422) has an integrable density g + : cp +( A ) = fA g + dJ.L. Together with the same result for cp - , this shows that there is an integrable g such that ·y( !A ) = cp( A) = = Thus = for simple functions in L P. Assume that this lies in U, and define 'Yg by the equation (19.8). Then and are bounded linear functionals that agree for simple functions; since the latter are dense (Theorem 1 9.2(i)), it fol lows by the continuity of and 'Yg that they agree on all of L P. It is therefore enough (in the case of finite J.L) to prove E U. It will also be shown that is at most the M of (1 9.7); since does work as a bound in (19.7), (19.9) will fol low. If r/ ) = 0, (1 9.9) will imply that = 0 al most everywhere, and for the general satisfying r/ ) = must it will follow further that two functions agree almost everywhere. oo. Let gn be simple functions such that 0 1 i > and take n = � P sgn Then h n g = n and since hn < = n J.L r/ ,) = <M is simple, it follows that l = this gives J.L ! < M. Now the M[ fgn Since 1 monotone convergence theorem gives E LP and even < M. = = 1 = l ( )l < M · = oo. In this case, Then J.L > al < for simple functions m L1 • Take = sgn = If < M · II/III = M, this inequality gives J.Ll 0; therefore g il"" = M and E L"' = U. Let A n be set'> such that A n i fi and J.L(A) oo . If II: J.L for J E J.Ln( A ) = J.L(A n A ), then lr(f/.4)1 < M =M. LP( J E L P(J.L) c L P (J.L)). By the finite case, A n supports a gn in Lq such that = for J E LP and < M. Because of uniqueness, gn l can be taken to agree with gn on A n ( LP(J.Ln l ) c LP(J.L)). There is + + therefore a function on n such that = n on A n and M. It E U. By the dominated convergence theorem < M and follows that = limn and the continuity of E LP imp lies = " limn Uniqueness fol lows as before. )= • " >
>
fA gdJ.L f!A gdJ.L. y(f) ffgdJ.L g y 'Yg f
g l gll q
f
y
llgl l q
g
f y(f ) Assumehthat g I l g > a [ MJ.L f/[lgl�a l · lgl dJ.L ffgdJ.L I g II llgll q < l g l > a] = Case a--finite. < · llf!A ) P [f ltl pdJ.LJ' IP fA y(fiA ) ffiA . gn dJ.L , l gJ q g g I l iA gll q < g llgl l q g y, f ff/A g dJ.L ffg dJ.L y(f!A y(f).
y
g
=
=
"
t Problem 1 9 3
246
INTEGRATION
Weak Compactness
f E LP, and g E U, where p and q are conjugate, write ( 19.10) ( f, g ) = ffgdJL . For fixed f in LP , this defines a bounded linear functional on U; for fixed g in U, it defines a bounded linear functional on LP. By Holder' s inequality, For
( 19 .1 1 ) Suppose that and fn are elements of If limn(f�, for each 0, then certainly to f. If in U, then converges weakly to although the converse is false. t The next theorem says in effect that if p > then the unit ball Bf 1 is compact in the topology of weak convergence.
LP. (f, g) = g) f fn weakl g fn converges --> fnl i I ! y P f, 1, = [fELP : 1 / I P < ] Suppose that is uf i n ite and :Y is countably generated. If oo, then eve1y sequence in B f contains a subsequence converging weakly 1to
Theorem 19.4.
>
11-
>
PRoor-.
d
n
m
=
=
tProblem 19.4 tProblem 19.5.
;
'
=
=
SECfiON 19.
THE
LP
SPACES
247
By the Riesz representation theorem (1 q < oo) there is an in L P (the = Since y has norm at space adjoint to U ) such that for all most 1, ( 19.9) implies that f lies in B f. Now in G. Suppose that is an arbitrary for ) = lim,(f,, Then element of U, and choose in G so that
(f, g
f < , y(g) (f, g ) g. 11/ll p < l: g) g g' g llg' - gll q < e .
l( f, g' ) - ( !, , g')l < l ( f , g ') - ( f , g ) l + l( f , g ) - ( f, , g )l + l( f, , g ) - ( J, , g ' ) l < 11/ll p llg' - gll q + l( f, g ) - ( f, , g ) l + 11/, l p llg - g 'll q < 2e + !( f , g ) - ( f, , g ) l. Since g E G, the last term here goes to 0, and hence l ,( f, g ' ) = (f, g ') for all g' in U. Therefore, !, converges weakly to f. im
.
•
Some Decision Theory
The weak compactness of the unit ball in L"" has interesting implications for sratistkal decision theory. Suppose that J-L is a--finite and sr is countably generated. Let , fJ.. be probabil ity densities with respect to J-L-nonnegative and integrating to f1 , 1. Imagine that, for some i, w is drawn from n according to the probability measure P;( A) = fA !; dJ-L. The statistical problem is to decide, on the basis of an observed w, which [; is the right one. Assume that if the right density is [;, then a statistician choosing f1 incurs a nonnegative loss L(jli). A decision rule is a vector function o(w) = ( o 1( w ), . . . , 8 k(w)), where the o;( w ) are nonnegative and add to 1: the statistician, observing w, selects [; with probability o;( w ). If, for each w, o;( w ) is 1 for one i and 0 for the others, o is a nonrandomized rule; otherwise, it is a randomized rule. Let D be the set of all rules. The problem is to choose, in some more or less rational way that connects up with the losses L(jli), a rule o from D. The risk corresponding to o and [; is •
•
•
[;
which can be interpreted as the loss a statistician using o can expect if is the right density. The risk point for o is R(o) = ( R 1(o), . . . , Rk(o)). If R;( o') < R;(o ) for all i and R;(o') R;( o ) for some i-that is, if the point R( o') is " southwest" of R(o)-then of course o' is taken as being better than 8. There is in general no rule bt!tter than all the others. (Different rules can have the same risk point, but they are then indistin guishable as regards the decision problem.) The risk setk is the collection S of all the risk points; it is a bounded set in the first orthant of R . To avoid trivialities, assume that S does not contain the origin (as would happen for example if the L(jli) were all 0). Suppose that o and 8' are elements of D, and A and A' are nonnegative and add to 1. If o;'(w) = AO;( w ) + Xo;(w), then o" is in D and R(o") = A R(o) + A'R(o'). There fore, S is a convex set. Lying much deeper is the fact that S is compact. Given points x < n > in S, choose rules o<"> such that R( o<">) = x <">. Now oJ" >( · ) is an element of L"", in fact of 8 �, and so by Theorem 19.4 there is a subsequence along which, for each j = 1, . . . , k, oJ">
<
248
INTEG RATION
converges weakly to a function o1 in B�. If J.L(A ) < oo, then fo/A dJ.L = lim n fo] n >JA dfL > 0 and J( l LA )lA aJ.L = lim n /0 "L, W >)JA J.L = 0, so that oi > 0 and L,ioi 1 almost everywhere on A. S ince J.L is u-finite, the oi can be altered on a set of J.L-measure 0 in such a way nas to ensure that o = (81, , ok ) is an element of D. risk But, along the subsequence, < > R( o ). Therefore: -
d
-
=
x
The set i s compact and vex. conThe rest is geometry. For x in Rk , let Qx be the set of x' such that 0 <x; <x1 for all If x R(o) and x' = R(o'), then o' is better than o if and only if x ' E Qx and is admissible if there exists no better t han o; it makes no sense to x'use a ruleA rule that is not admissible. Geometrically, admissibility means that. for x = R(o), Q x consists of x alone. i. * X.
=
•
-->
0
•
•
01
Sn
>
x
Let = R(o) be given, and suppose that o is not admissible. Since S n Qx is compact, it contains a po int nearest the origin unique, since S n Qx is convex as well as compact); let o' be a conesponding rule: = R(o'). Since o is not admissible, it would x' and o' is better than o. If S n Q x · contained a point distinct from ' be a point of S n Q, nearer the origin than x , which is impossible. This means that Q x · contains no point of S other than x ' itself, which means in turn that 8' is admissible. Therefore, if o is not itself admissible, there is a o ' that is admissible and is bette1 than o . This is expressed by saying that the class of admissible rules is
x'
=l=x,
(x'x'
x',
complete. Let p = (p1, , Pk ) be a probability vector, and view P; as an a priori probability that [; is the correct density. A rule o has Bayes risk R( p , o) = [1p1R1(o) with respect to p. This is a kind of compound risk: [; is correct with probability p1, and the statistician chooses fj with probability 8/ w ). A Bayes rule is one that minimizes the Bayes risk for a given p. In this case, take a = R(p, o) and consider the hyperplane H = [ z : � P;Z; = a ] ( 19. 12) •
•
•
and the half space
( 19.13 )
[
H+= z· " l..., p .z. > a ·
i
,
r -
]
•
Then x = R(o) lies o n H, and S is contained in H+: x is on the boundary of S, and
SECTION 19.
TH E
LP
SPACES
249
H is a supporting hyperplane. If P; > 0 for all i, then Qx meets S only at x , and so o is admissible. Suppose now that o is admissible, so that x = R( o ) is the only point in S n Q x and x lies on the boundary of S. The problem is to show that o is a Bayes rule, which means finding a supporting hyperplane (19.12) corresponding to a probability vector p. Let consist of those for which Q Y meets S. Then is convex: given a convex combination = of points in choose in S points and southwest of and respectively, and note that = lies in S and is southwest of Since S meets Qx only in the point x, the same is true of so that x is a boundary point of as well as of S. Let (19.12) ( p -=F 0) be a supporting hyperplane through x x H and H+. If P,· < 0, take Z; = X; 1 and take Z ; = x ,. for the other i; then lies in but not in H+, a contradiction. (The right-hand figure shows the ro le of and only H2 the planes H 1 and H2 both support S, but only H2 supports corresponds to a probability vector.) Thus P; > 0 for all i, and since f.; P; = 1 can be arranged by normalization, o is indeed a Bayes rule. Therefore
T y y" Ay + A' y ' y', T E Tc T U
T, z" Az + A'z' +
(I
tl
are Bayes rules, and they form a complete class. The S pace
T
z z'
yy". T, z T: T The admissible rules
L2
The space is special because = 2 is its own conjugate index. If f, g E L the is well defined, and by (1 9. 1 1 ) -write 11!1 1 in (f, place of IIJ II 2 -K/, This is (or Cauchy-Schwarz) inequality. If one of f and is fixed, (f, is a bounded (hence continuous) f ), the norm is giver. by linear functional in the other. Further, (f, is complete under the metric A 11! 11 = ( / f ), and is a vector space on which is defined an inner product having all these properties. The Hilbert space is quite like Euclidean space. If (f, = 0, then f and are and orthogonality is like perpendicularity. If f1 , , fn a1 e orthogonal (in pairs), then by linearity, (L.J;, L.J) = L;Lif;, f) = L.; (f; , f): IL.; /; 11 = L; /;1 1 . This is a version of the Pythagorean theo rem. If f and are orthogonal, write f .l For every f, f .l 0. Suppose now that 11- is a--finite and !F is countably generated, so that C is separable as a metric space. The construction that follows gives a sequence • • • that is in the sense that llcpnll = 1 for (finite or infinite) cp and is all and (cpm , cpn ) 0 for in the sense that (f, cp ) = 0 for all implies f = 0-so that the orthonormal system cannot be enlarged. Define 1 , Start with a sequence ... f2 , . . . that is dense in have been defined and are inductively: Let 1 f1 • Suppose that P • • • , orthogonal. Define is L.j= 1 ; ; where if Then is orthogonal to and is arbitrary if and f 1 is a linear combination of 1 , • • • , 1 • This, the Gram-Schmidt method, • • • with the property that the finite gives an orthogonal sequence linear combinations of the include all the fn and are therefore dense in 2 If = L• 0, take if 0, discard it from the sequence. Then cp 1 , cp 2 , . . . is orthonormal, and the finite linear combinations of the are are 0, in which still dense. It can happen that all but finitely many of the
L2 p 2, inner product g g))I <=l ffflgd· l JLg l!. the Schwarz g)g) (g, g = 2 d(f, g)= l f g L2 l . Hilbert space g) L2 g orthogonal, 2 2 ! l g g. cp orthonormal , 2 n n = m =I= n, complete g L2. , g f �> 2 g =gn+ l =fn+l - ang g gn ani (J�+ �> g;)/l l g; l 2 .. gw g; = O.g ggn+l g;=I=n + O ,gn, + n g , g1, 2 gn gn =I= 'Pn gn!ll gJ; gn = 'Pn gn ,
•
P
,
•
•
250
INTEG RATION
case there are only finitely many of the 'Pn · In what follows it is assumed that cp 1 , cp 2 , • • • is an infinite sequence; the finite case is analogous and somewhat simpler. Suppose that f is orthogonal to all the 'Pn · If ar� arbitrary scalars, then f, a 1 cpp · · · • a n cpn is an orthogonal set, and by the Pythagorean property, II ! - '£7= ,a;cp1ll 2 = 11!11 2 + 'L;= 1 T > 11!11 2 . If 11!11 > 0, then f cannot be ap proximated by finite linear combinations of the cpn, a contradiction: cp 1 , cp 2 , • • •
a1
a is aConsider completenoworthonormal s stem. y a sequence a 1 , a 2 , . • • of scalacs for which '[� a ,2 converge s. If sn = '£7= a 1cp , then the Pythagorean theorem gives llsn - sm11 2 = 'Lm nar Since the scalar series converges, {sJ is fundamental and therefore by Theorem 1 9 . 1 converges to some g in L2 • Thus g = lim n 'L7= 1 a1cp1, which it is namral to express as g '[� 1 1cp1• The series (that is to say, the sequence of partial sums) converges to g in the mean of order 2 (not almost everywhere). By the following argument, every element of L 2 has a unique representation in this form. The Fourier coefficients of f with respect to {cp) are the inner products n, 0 < II ! - 'L7= 1 a 1 cp;ll 2 = 11!11 2 - 2'L, a 1(f, cp) + a'L1 =a 1 (f,aicpcp).cp For11! 11each 2 - 'L 7= 1 a ! , and hence, n being arbitrary, '£7= ,a; < 11!11 2 • 1• ) u By the argument above, the series '£7= 1 a 1 cp; therefore converges. By linearity, (f- 'L7= 1 a1cp;, cp) = 0 for n > j, and by continuity, (f- '[�= 1 a cp , cp) = 0. Therefore, J- 'L7= 1 a1cp1 is orthogonal to each cpi and by completeness must ,
=1
;
=
=
<;,
a
=
;
be
1
0:
( 19.14 )
t=
00
i
I: ( f, cp; )'P; · =I
'[�=1a;cp;
Fourier representation a basis ai subspace
is This is the of f. It is unique because if f = 0 ('L ; < oo), then = (f, cp) = 0. Because of (19. 14), {cpn } is also called an orthonormal for L2 • if it is closed both algebraically (f, f' E M A subset M of L2 is a implies af + a'f' E M) and topologically (fn E M, fn � f implies f E M). If L2 is separable, then so is the subspace M, and the construction above carries over: M contains an orthonormal system {cpn } that is M, in the sense that f = 0 if (f, cpn) = 0 for all if f E M. And each f in M has the unique Fourier representation (19.14). Even if f does not lie in M, '£7= ,(f, cpy converges, so that '£7= ,(f, cp)cp; is well defined. This leads to a powerful idea, that of onto M. For an orthonormal basis {cp,} of M, define PM f= 'L7= 1(f, cp;)cp1 for all f in L2 (not just for f in M). Clearly, PM f E M. Further, f - 'L7= 1(f, cp;)cp; .l cpj for > by linearity, so that f - PMf .l cpi by continuity. But if f - PMf is orthogonal to each cpi , then, again by linearity and continuity, it is orthogonal to the general element 'L]= tbjcpj of M. Therefore, PMfE M and J- PMf .l M. The map f � PMf is the orthogonal projection on M.
complete in
nand orthogonal projection
nj
SECTION
THE L P SPACES
! 9.
251
The fundamental properties of PM are these: {i) (ii) (iii) (iv)
g E M and f - g .l M together imply g = PMf; f E M implies PMf = f; g E M implies I I ! - g il > I I ! - PMf ll; PM(a f + a'f' ) = a PMf + a' PM f'·
Property (i) says that PMf is uniquely determined by the two conditions PM f E M and J - PM .l M. To prove it, suppose that g, g ' E M, J - g .l. M, and f - g' .l M. Then g - g' E M and g - g' .l M, so that g - g' is orthogo 2 nal to itself and hence ll g - g ' ll = 0: g = g' . Thus the mapping PM is independent of the particular basis {cp;}; it is determined by M alone. Clearly, (ii) follows from (i); it implies that PM is idempotent in the sense t hat P'/4 f = PM f. As for {iii), if g lies in M, so does PM f - g, so that, by the 2 2 2 2 Pythagorean relation, II ! - g ll = II! - PM.f l l + IIPMf - g ll > I I ! - PMJ II ; the inequality is strict if g =I= PM f. Thus PM f is the unique point of M lying nearest to f. Property (iv), linearity, follows from (i). An
Estimation Problem
Fir�t, the technical setting: Let (!l, .9", ,u ) and (e, G', '7T) be a a--finite space and a probability space, and assume that .9" and G' are countably generated. Let fe(w) be a nonnegative function on e X n, measurable G'x .9", and assume that fn fi w )JL( dw) = 1 for each (J E e. For some unknown value of e, w is drawn from n according to the probabiiities P6( A ) = fA fe( w )JL( dw ), and the statistical problem is to estimate the value of g( e), where g is a real function on e. The statistician knows the func tions f( . ) and g( ) as well as the value of w; it is the value of (J that is unknown. For an example, take !l to be the line, f(w ) a function known to the statistician, and fe( w ) = af(aw + {3 ), where (J = ( a, {3) specifies unknown scale and location parameters; the problem is to estimate g(e) = a, say. Or, more simply, as in the exponential case ( 1 4.7), take fe( w) = af(aw ), where (J = g((J) = a. An estimator of g((J) is a function t(w). It is unbiased if ·
{ 19 . 15 )
·
,
1 t ( w ) f6( w )JL(dw ) = g ( e ) fl
for all e in e (assume the integral exists); this condition means that the estimate is on target in an average sense. A natural loss function is ( t ( w) - g(e))2, and if f6 is the co rrect density, the risk is taken to be fn(t(w) - g(e ))2fe( w )JL( de). If the probability measure '7T is regarded as an a priori distribution for the unknown e, the Bayes risk of t is ( 19.16)
R('1T, t ) = 1. 1 ( t ( w ) - g ( e ) ) 2 f6 (w ) JL ( dw ) '1T ( de ) ; e n
this integral, assumed finite, can be viewed as a joint integral or as an i terated integral (Fubini's theorem). And now t0 is a Bayes estimator of g with respect to '7T if it minimizes R( '7T, t) over t. This is analogous to the Bayes rules discussed earlier. The
252
INTEGRATION
followmg simple projection argument shows that, except in trivial cases, no Bayes estimator is unbaised t Let Q be the probability measure on G'x !f having density t6(w) with respect to 1r X JL, and let 2 be the space of square-integrable functions on (E) X !l, G' X !f, Q). Then Q is finite and G'x !f is countably generated. Recall that an element of 2 is an equivalence class of functions that are equal almost everywhere with respect to Q. Let G be the class of elements of [2 containing a function of the form g( e, w) = g( w) -functions of () alone Then G is a subspace. (That G is algebraically closed is clear; if t, E G and lit, - til -> 0, then-see the proof of Theorem 19.1-some subse quence converges to f outside a set of Q-measure 0, <"TJd it follows easily that t E G.) Similarly, let T be lhe subspace of functions of w alone: /(e, w) = t(w). Consider only functions g and their estimators t for which the corresponding g and t are in e. Suppose now that t0 is both an unbiased estimat�r of g <2 and a Bayes estimator of g0 with respect to rr. By (19 16) for g0, R(rr, t) = lit - g0ll , and since t0 is a Bayes estimator of g0, it follows that 1�0 - g0ll 2 < lit - g0ll 7 for all t i n T. This means that (0 is the orthogonal projection of g0 on the subspace T and hence that g 0 - (0 1. t0• On the other hand, from the assumption that t0 is an unbiased estimator of g0, it follows that, for every g(e, w) = g(e) in G,
L
L
( to - go , g ) = J J ( t0 ( w ) - g0( 0 )) g ( 6 ) /6 { w )JL( dw ) rr( de) 0 l1
[
]
= j g ( e ) J ( t0 (w ) - g0 ( 6 ) ) t6 ( w )JL ( dw) rr ( d6 ) = 0. 0
l1
This means that t0 - g0 l. G: g 0 is the orthogonal projection of t0 on the subspace G But g 0 - t0 l. t0 and t0 - g0 l. g0 together imply that t0 - g0 is orthogonal to itself: t0 = g0. Therefore, t0(w) = t0(e, w) = g0(e, w) = g0(6) for (6, w) outside a set of Q measure 0. This implies that t0 and g0 are essentially constant. Suppose for simplicity that fe(w) > 0 for all (6, w ), so that (Theorem 15.2) ( rr X JL)[(e, w ): t0(w) * g0(6)] = 0. By Fubini's theorem, there is a () such that, if a = g0(6), then JL[w: t0(w) * a] = O; and there is an w such that, if b = t0(w), then rr[e: g0(6) * b] 0. Jt follows that, fo r (6, w) outside a set of ( rr X JL)-measure 0, t0(w) and g0(6) have the common value a = b: rr[e: g0(6) = a] = 1 and P6[w: t0(w) = a] = 1 for all 0 in e. =
PROBLEMS 19.1.
Suppose that
JL(!l) < oo and t E L . Show that lltii P i II t i l,. "'
Show that L"'((O, 1], £13', A ) is not separable. (b) Show that P((O, 1], £13', JL) is not separable if JL is counting measure (JL is not a-finite). (c) Show that P( fi , 5', P ) is not separable if (Theorem 36.2) there is on the space an independent stochastic process [X, : 0 < t < 1] such that X, takes the valu es 1 with probability I each ( 5' is not countably generated).
19.2. (a)
+
L L
t This is interesting because of the close connection between Bayes rules and admissibility; see BERGER, pp. 546 If.
SECTION 19.3.
19.4.
19.
THE LP SPACES
253
Show that Theorem 19.3 fails for Banach limit of nJ(:I"f(x) dx.
L"'((O, 1],
£13',
Hint: Take y( f) to be a
A ).
L P((O,
1], .£13', A ). Consider weak c9nvergence in (a) For the case p = oo, find functions fn and f such that fn goes weakly to f but II!- [,,II does not go to 0. (b) Do the same for p = 2.
P
19.5.
Show that the unit ball in U((O, 1], £13', A ) is not weakly compact.
19.6.
, Pk ) may not be admissible Show that a Bayes rule corresponding to p = p if P; = 0 for some But there will be a better Bayes rule that admissible
19.7.
(
�>
is i The Neyman-Pearson lemma. Suppose f1 and /2 are rival densities and L(Ji i ) is 0 or 1 as j = i or j -=F i, so that R;(b) is the probability of choosing the opposite density when [; is the right one. Suppose of o that 8 2( w ) = 1 if f2(w ) > tfi( w ) and oz( w ) 0 if f2(w ) < tfi( w ) where t > 0 Show that b is adm!ssible: For any ule o', fo; r dJL < foz r dJL implies Wr h dJL fo r f dJL. Hint. f(o - o;) =
•
,
> f f ([2 - tf1 ) dJL > 0, since the integrand is nonnegative. l
19.8.
2
2
The classical orthonormal basis for r z [o, 2'7T] with Lebesgue measure is the trigonometric system ( 1 9.17) (2'7T) - I ,
'7T - 1/2 sin nx
,
l '7T - /2 cos tiX ,
Hint:
n = 1,2 .. ,
e inx
+ Prove orthonormality. Express the sines and cosines in terms of multiply out the products, and use the fact that M dx i s 2'7T or 0 as = 0 or -=fo 0. (For the completeness of the trigonometric system, see Problem 26.26.)
e-;,\ m m 19.9.
.,.eimx
Drop the assumption that [2 is separable. Order by inclusion the orthonormal systems in L 2 , and let (?..orn's lemma) = [cp,: be maximal. (a) Show that f1 = [ (f, cp, ) -=F 0] is countable. Use 'Lj= 1( f, cp1 ) 2 < 11[11 2 and the argument for Theorem 10.2(iv). (b) Let Pf = Lr E r ( f, cp )cp . Show that f - Pf l. clJ and hence \maYimality) f = Pf. Thus ¢ r s a f-t ortiJon�rmal basis. (c) Show that is countable if and only if [2 is separable. and (d) Now take to be a maximal orthonormal system in a subspace define PMf = 'L E r (f, cp)cp1• Show that PMf E M and f - PMf l. , that g = PMg if g , ")'an a that f - PMf l. M. This defines the general orthogonal projection.
y:
EM
yEn Hint·
M,
C H A PTER 4
Random Variables and Expected Values
SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS
This section and the next cover random variables and the machinery for dealing with them-expected values, distributions, moment generating func tions, independence, convolution. Random Variables and Vectors
random variabl e on a probability space f!, !F, P) is a real-valued function X = X(w) measurable Sections 5 through 9 dealt with random variables of A
(
!F.
a special kind, namely simple random variables, those with finite range. All concepts and facts concerning real measurable functions carry over to ran dom variables; any ch3nges are matters of viewpoint, notation, and terminol ogy only. The positive and negative parts x+ and x- of X are defined as in (15.4) and (15.5). Theorem 13.5 also applies: Define
( k - 1 )2- n ( 20.1)
n If X is nonnegative and Xn nonnegative, define
( 20.2) (This i s the same as
254
if ( k - 1 )2- n < X < k T n , if
=
x
>
n.
1 < k < n2 n ,
1/Jn(X), then 0 < Xn i X. If X is not necessarily if X > 0, if X < 0.
(13.6).) Then 0 < Xn(w) i X(w) if X(w ) > 0 and 0 >
SECTION 20.
RANDOM VARIABLES AND DISTRIBUTION S
255
Xn(wH X(w) if X(w) < 0; and IXJw)l i I X{w)l for every w. The random variable Xn is in each case simple. is a mapping from f! to Rk that is measurable !F. Any A mapping from f! to R k must have the form w � X( w ) = (X1(w ), . . . , Xk (w )), where each X;(w) is real; as shown in Section 13 (see (13.2)), X is measur able if and only if each X; is. Thus a random vector is simply a k -tuple X = (X, , . . . , Xk ) of random variables.
random vector
Subfields
If .§ is a £T-field for which .§c !F, a k-dimensional random vector X is of course measurabie .§ if [w: X(w ) E .§ for every in � k . The £T-field £T( X) generated by X is the smallest £T-field with respect to which it is measurable. The £T-field generated by a collection of random vectors is the smallest £T-field with respect to which each one is measurable. As explained in Sections 4 and 5, a sub-lT-field corresponds to partial information about w. The information contained in £T(X) = £T(X1, , Xk ) consists of the k numbers X,(w ), . . . , Xk (w ). i The following theorem is the analogue of Theorem 5.1, but there are technical complications in its proof.
E H]
H
•••
Let X = (X" . . . , Xk ) be a random vector. {i) The £T-jield £T( X ) = £T( X 1 , •• , Xk ) consists exactly of the sets [ X E H] for H E �k . (ii) In order that a random variable Y be measurable £T(X) £T(X" . . . , Xk ) 1 k a measurabl e map y and sufficient that there exist itthat necessar f: R R such Y(w) = f( X1( w), . . . , Xk (w)) for all w. The ciass .§ of sets of the form [ X E H] for HE � k is a £T-field. Since X is measurable £T( X ), .§c £T(X). Since X is measurable .§, £T( X) c Hence part {i). Measurability of f in part {ii) refers of course to measurability � k I� 1 • The sufficiency is easy: if such an f exists, Theorem 13. 1{ii) implies that Y is measurable £T(X). To prove necessity, * suppose at first that Y is a simple random variable, and let y 1, ••• , Ym be its different possible values. Since A; = [w: Y(w ) = Y;] lies in £T(X ), it must by part {i) have the form [w: X( w) E H;] for some H; in �X(w)k. Putcanf=lieLin; Y)more H . ; certainly f is measurable. Since the A; are disjoint, no than one H; (even though the latter need not be disjoint), and hence f(X(w )) Y(w ). Theorem 20. 1 .
=
is
�
PROOF.
f.
,
=
t The partition defined by (4. 16) consists of the sets [w. X(w) t For a general version of this argument, see Problem 13.3.
=
x ] for x E
R k.
RANDOM VARIABLES AND EXPECTED VALUES
256
To treat the general case, consider simple random variables Y,. such that Y,.( w ) � Y(w) for each w. For each there is a measurable function fn : 1 R k � R such that Y,.(w) = fn( X(w)) for all w. Let M be the set of x in R k for which {fn( x )} converges; by Theorem 13.4{iii), M lies in � k. Let f( x ) = limn fJx) for x in M, and let f( x ) = O for x in R k - M. Since f= lim n fn lM and fn lM is measurable, f is measurable by Theorem 13.4{ii). For each w, Y( w ) = l im n JJ X(w)); this implies in the first place that X( w ) lies in M and in the second place that Y(w) = lim n JJX(w )) = f( X(w)). •
n,
Distributions
The distribution or law of a random variable X was in Section 14 defined as the probability measure on the line given by JL = px - 1 (see (13.7)), or ( 20.3 )
JL( A ) = P[ X t:: A ] ,
A
E �1 •
The distribution function of X was defined by ( 20.4) for real x. The left-hand limit of F satisfies ( 20.5)
F( x - ) = JL ( - oo , x )
= P[ X < x ) ,
F( x ) - F( x - ) = JL { x} = P [ X = x ] ,
and F has at most countably many discontinuities. Further, F is nondecreas ing and right-continuous, and limx --+ - oo F( x ) = 0, lim x --+oo F( x ) 1. By Theo rem 14. 1, for each F with these properties there exists on some probability space a random variable having F as its distribution function. A support for JL is a Borel set S for which JL{S ) = 1. A random variable, its distribution, and its distribution function are discrete if JL has a countable support S = { x 1 , x 2 , . . . }. In this case JL is completely determined by the values JL{ X 1 }, jL{X 2 }, • • • • A familiar discrete distribution is the =
binomial:
r=0, 1, ... , n. There are many random variables, on many spaces, with this distribution: If { Xk } is an independent sequence such that P[ Xk = 1] = p and P[ Xk = 0] = 1 - p (see Theorem 5.3), then X could be L:j= 1 Xi, or L:�:;r;xi, or the sum of any of the Xi, Or f! could be {0, 1, . . . , n} if !F con sists of all subsets, P{r} = JL{r}, r = 0, 1, . . . and X( r ) = r. Or again the space and random variable could be those given by the construction in either of the two proofs of Theorem 14. 1. These examples show that, although the distribution of a
n
, n,
SECTION 20.
RANDOM VA RIABLES AND DISTRIBUTIONS
X X
257
contains all the information about the probabilistic random variable behavior of itself, it contains beyond this no further information about the or about the interaction of with underlying probability space (f! , !F, other random variables on the space. Another common discrete distribution is the distribution with parameter ,.\ > 0:
X
P)
Poisson
,.\r
P[X= r ] =J.L{r} = e-,\rl ' - r = O, l, ... .
(20.7)
constant c P[ X= c] = J.L{c} = {x x }
X(w) c.
A can be regarded as a discrete random variable with = In this case 1. For an artificial discrete example, let , 2 , . . be an enumeration of the rationals, and put 1 .
( 20.8) the point of the example is that the support need not be contai ned in a lattice. A random variable and its distribution have f with respect to Lebesgue measure if f is a nonnegative Borel function on R 1 and
density
P[XEA ] = J.L( A ) = f f( x ) dx, A E In other words, the requirement is that J.L have density f with respect to Lebesgue measure in the sense of (16.1 1). The density is assumed to be with respect to if no other measure is specified. Taking A = R 1 in (20.9) shows that f musr integrate to 1. Note that f is determined only to within a set of Lebesgue measure 0: if f = except on a set of Lebesgue measure 0, then can also serve as a density for X and It follows by Theorem 3.3 thl:lt (20.9) holds for every Borel set A if it holds for every interval-that is, if ( 20.9)
92 1 .
A
,.\
,.\
g
g
JL ·
F( b ) -F( a ) = ft( x ) dx holds for every a and b. Note that F need not differentiate to f everywhere (see (20. 13), for example); all that is required is that f integrate properly-that (20.9) hold. On the other hand, if F does differentiate to f and f is continuous, it follows by the fundamental theorem of calculus that f is indeed a density for F.t a
t The general question of the relation between differentiation and integration is taken up in Section 3 1
258
RANDOM VARIABLES AND EXPECTED VALUES
For the
exponential distribution with parameter a > 0, the density is if if
( 20.10 )
X < 0, X > 0.
The corresponding distribution function
F ( x ) =- \! 0 e - a< 1-
( 10. 1 1 )
was studied in Section 14. For the
if if
normal distribution with parameters (x )2 1 exp , f( x ) = & 2 27r [ 2a- l
( 20 .12 )
X < 0, >0 m
and a-, a- > 0,
- m
(T
- oo
<x <
oo ;
f
a change of variable together with (18.10) shows that does integrate to 1 . normal distribution, m = 0 and a- = 1 . For the ], For the distribution over an interval
standard uniform
( 20.13)
f( x ) =
(a, b 1 if a <x < b, a 0 b-
otherwise .
The distribution function F is useful if it has a simple expression, as in (20. 1 1 ). It is ordinarily simpler to describe JL by means of a density ( ) or discrete probabilities fL{ X ) . If F comes fwm a density, it is continuous. In the discrete case, F increases in jumps; the example (20.8), in which the point� of discontinuity are dense, shows that it may nonetheless be very irregular. There exist distributions that are not discrete but are not continuous either. An example is fL( A) = !fL 1{A) + !fL 2(A) for fL 1 discrete and !-L 2 coming from a density; such mixed cases arise, but they are few. Section 31 has examples of a more interesting kind, namely functions F that are continuous but do not come from any density. These are the functions singular in the sense of Lebesgue; describing bold play in gambling (see (7.33)) turns out to be one of the them. See Example 31.1. If X has distribution JL and g is a real function of a real variable,
fx
Q(x)
( 20. 14) Thus the distribution of g(X) is
/-L g
- 1 in the notation (13.7).
RANDOM VARIABLES AND DISTRIBUTIONS
SECTION 20.
259
f and F are related by F( x ) J f( t ) dt. ( 20. 1 5 ) Hence f at its continuity points must be the derivative of F. As noted above, if F has a continuous derivative, this derivative can serve as the density f. Suppose that f is continuous and g is increasing, and let T g- The distribution function of g(X) is P[ g( X) < x] P[ X < T(x )] F(T(x )). If T is differentiable, this differentiates to f(T{x ))T'(x ), which is therefore the density for g(X). If g is decreasing, on the other hand, then P[g(X) < x] > T{x)] 1 F(T{x)), and the derivative is equal to f(T{x))T'( x) P[X f(T{x))I T'(x)l. In either case, g(X) has density In the case where there is a density, =
X
- 00
=
=
1
•
=
=
=
-
-
=
!P [ g ( X ) < x ] =f(T(x) )JT' ( x) J . ( 20.16) If X has the normal density (20. 12) and > 0, (20. 16) show� that has the normal density with parameters and Finding the density of g{X) from first principles, as in the argument leading to (20. 16), often works even if g is many-to-one: If X has the standard normal distribution, then am
a
t
b
au.
aX + b
Example 20. 1.
for
X > 0. Hence
X2 has density f( x )
0 =
1
r;:;v L. 'TT'
x
- 1 /2 - x /2 e
if X < 0, •
if X > 0.
Multidimensional Distributions
. . . , k ), the distribution J.L (a For a k-dimensional random vector X = probability measure on � k ) and the distribution function F (a real function on R k ) are defined by
(Xp X
A E �k,
(20.17) where
Sx
=
[
y : Y;
< X;, i
=
]
1, . . . , k consists of the points "southwest" of
x.
RANDOM VARIABLES AND EXPECfED VALUES
260
J.L 1, F Xk . FA
Often and are called the joint distribution and , function of X Now is nondecreasing in each variable, and D. A rectangles (see (12.12)). As h decreases to 0, the set • • •
joint
F>
distribution
0 for bounded
= [ y : X + = 1' . . . ' k ] decreases to S and therefore (Theorem 2. 1{ii)) F is continuous from above Further, in the sense that lim h ! F(x 1 + h, . . . , x k + h) = F(x , x ). k ) � if X ; � for some i (the other coordinates held fixed), F(x , x 1, k ) � 1 if X ; � for each i. For any F with thesek properties and F(x x k there is by Theorem 12.5 a unique probability measure J.L on �� such that J.L(A)As =hD.decrease A F for bounded rectangles A, and J.L( S ) F(x) for all x. s to S increases to the interior S� [ y i = 1, . . . , k ] of S and so sX . h
x'
•
•
1,
•
•
•
i
h' i
0
-
0
•
Y; <
,
I>
oo
•
•
•
oo
0,
x'
x
x. -h
=
=
:
Y; < X ;,
lim F( x1 - h , . . . , x k - h ) = 11- ( S� ) . h !O
(20. 18)
F F(x).
x F(x) J.L(
Since is nondecreasing in each variable, it is continuous at if and only if it is continuous from below there in the sense that this last limit coincides S x ) J.L(S� ), with Thus is continuous at if and only if which holds if and only if the boundary asX sX s� (the y-set where Y; < X i for all i and Y; = for some i) satisfies aS) 0. If k > 1 , can have discontinuity points even if J.L has no point masses: if J.L corresponds to a uniform distribution of mass over the segment B 0): 0 < < 1 in the 0 < x < 1, then is discontinuous at each plane . point of B. This also shows that can be discontinuous at uncountably many points. On the other hand, for fixed X the boundaries as h are disjoint for different values of h , and so (Theorem l0.2{iv)) only count�bly many of them can have positive wmeasure. Thus is the limit of points + h, . . . + h at which F is continuous: the continu ity points of are dense. There is always a random vector having a given distribution and distribu tion function: Take (f!, :F, (R \ �\ and X(w) = w. This is the obvious extension of the construction in the first proof of Theorem 14 .1. The distribution may as for the line be discrete in the sense of having countable support. I t may have density f with respect to k-dimensional Lebesgue measure: As in the case k 1, the distribution = is more fundamental than the distribution function and usually JL is described not by but by a density or by discrete probabilities. � R ; is measurable, If X is a k-dimensional random vector and then g(X) is an i-dimensional random vector; if the distribution of X is JL , the distribution of X) is just as in the case k 1-see (20 .14). If I> and its distribu � is defined by then gj(X) is is given by tion for
F x; (u{A) A[x:
x
=
-
=
J.L(
=
=F [( x,
(x,O) EA]), F (x x 1 F J.L) P)
=
=
F x ]
X
, xk )
=
J.L
J.L( A) fA f(x)dx.
=
F,
F g: Rk 1 g{ gg/ J.L , 1 R k J.LRj =J.Lg;-: 1 gj(x J.L/ A)xk)=J.L= xj,[(x1, , x ): xjEA]xj, = P[XjEA] k =
•
•
•
'
•
•
•
SECTION 20.
RANDOM VARIABLES AND DISTRIBUTIONS
261
.9i' 1 . The are the marginal distributions of J..L . If J..L has a density f in AR \E then J..Lj has over the line the density J..L i
(20.19)
f;(x) = j f(x p .. . ,x,_ 1, x , x; + �> · · · · xk ) dx1 • • • dx,_ 1 dx; + 1 • • • dxk , since by Fubini ' s theorem the right side integrated over A comes to J..L[(xpNow·· suppose · · x�) XjthatEA].g is a one-to-one, continuously differentiable map of V onto U, where U and V are open sets in R k . Let T be the inverse, and suppose its Jacobian J(x) never vanishes. If X has a density f supported b)" V, then for A c U, P[g(X) E A) = P[ X E TA] = fTA f( y ) dy , and by (17. 10), this equals fA f(Tx)I J(x)l dx. Therefore, g{X) has density x I J I for x E U, ( �( Tx ( ) ) (20 .20) d( x) = for x $. U. Rk - 1
This is the analogue of (20. 16). Example 20.2.
(X1, X2) has density f ( X X 2) = ( 2 7T) exp [ - i (X f + X i) J , Suppose that I,
-I
g 21 g(X1, X),2)
and let be the transformation to polar coordinates. Then U, V, and T are as in Example 17.7. If R and E> are the polar coordinates of then 2 in V. By (20. 19), R has density ( R , E>) has density (2 7T)- 1pe - p 2 pe - p on (0, oo and E> is uniformly distributed over (0, 2 7T . •
/2
=
(X1, X2 ),
)
For the normal distribution in R \ see Section 29. Independence
• • • , Xk are defined to be independent if the u-fields X1, u(X1 ) , • • • , u(Xk ) they generate are independent in the sense of Section 4. This concept for simple random variables was studied extensively in Chapter 1; the general case was touched on in Section 14. Since u(X) consists of the 1 sets [X; E H] for HE .9i' , X 1, • • • , Xk are independent if and only if Random variables
H1, , Hk .
for all linear Borel sets The definition (4.10) of independence ••• requires that (20.2 1) hold also if some of the events are suppressed on each side, but this only means taking H,·
[X; E H;] 1 =R •
RANDOM VARIABLES AND EXPECTED VALUES
262
Suppose that (20 .22)
x 1, , x k ;
x]
for all real it then also holds if some of the events X; < ; are suppressed on each side Oet X; � ). Since the intervals ( x ] form a 7T-system generating � � , the sets < form a 7T-system generating u(X,.). are independent. Therefore, by Theorem 4.2, (20.22) implies that XI , . . . If, for example, the X; are integer-vaiued, it is enough that for integral (see • • •
oo
-
[X; x ] n1,(5.9)). , Xk = nd =P[X1 = n 1 ] · · · P[Xk =n k ]
[oo,
' xk
P[X1 n1, ,n k Let . . . , Xk ) have distribution and distribution function F, and let the have distributions and distribution functions (the marginals). By (20.2 1), X1, , Xk are independent if and only if is product measure in the sense of Section 18: •
•
=
•
X;
( Xp
•
J.L
J.L ;
IL
• • •
•
•
Fj
( 20 .23) By (20.22), X1,
• • •
,
Xk are independent if and only if
(20.24) Suppose that each integrated over J.L has density
J.L,
fl f ( y1) ) yk k F1( x1) Fk(xk), so that
has density [;; by Fubini' s theorem, is just
(-oo, xd x · · · x { -oo, xd
•
•
•
•
•
•
( 20 . 25) in the case of independence. If are independent IT-fields and X; is measurable .#;, i 1, . . . ' k, then certainly XI , . are independent. are If X; is a d;-dimensional random vector, i 1, . . . , k, then X1 , , u(Xk ) are independent. by definition independent if the u-fields u(X The theory is just as for random variables: X1 , are independent if and E � d1 , Xk ) can be only if (20.21 ) holds for E � dk. Now (X re garded as a random vector of dimension d f.7_ d; ; if J.L is its distribution X R dk and J.L ; is the distribution of X; in R d;, then, just as in R d R d1 before, , X are independent if and only if J.L J.L X X J.L In none of this need the d; components of a single X; be themselves independent random variables. An infinite collection of random variables or random vectors is by defini tion independent if each finite subcollection is. The argument following (5.10)
.#1, , .#k •
•
•
.
.
=
' xk
=1), , X k , Hk 1 •
H1
=
X1,
x ···
. • .
k
•
• • •
•
•
•
•
•
•
•
=
=
1, , 1 · · · k. •
•
•
, Xk
SECTION 20.
RANDOM VARIABLES AND DISTRIBUTIONS
263
extends from collections of simple random variables to collections of random vectors: Theorem 20.2.
Suppose that ... ...
( 20.26) •
is an independent collection of random vectors. If .9; is the u-field generated by the i th row, then � .9;, . . . are independent. •
Let J:1f; consist of the finite intersections of sets of the form [ H] with H a Borel set in a space of the appropriate dimension, and apply Theorem 4.2. The u-fields .9; u{J:If;), i 1, . . . , n, are independent for each n, and the result follows. • PROOF.
=
X;j E
=
Each row of (20.26) may be finite or infinite, and there may be finitely or infinitely many rows. As a matter of fact, rows may be uncountable and there may be uncountably many of them. Suppose that X and are independent random vectors with distributions Then ( X, has distribution /-L and in and in and over Let range over By Fubini's theorem,
Y k k k + Rj = Rj . X . Rj R X R Y) J.L v v k ·x Rj y R . · ( 20 .27 ) (J.L x v)(B ) = J v[y: (x, y) E B] J.L(dx ) , Replace B by (A X R k ) n B, where E �j and B E �j + k . Then (20.27) reduces to ( 20.28) (J.L X v) (( A x R k ) n B ) = f v[y : (x, y) EB ]J.L( dx ) , RJ
A
A
) E B] is the x-section of B, so that Bx E� k Theorem = [ (x, B y y x 1 8.1), then P[(x, Y) E B] = P[w : (x, Y(w)) E B] P[w: Y(w) E B ] v(Bx). :
If
=
(
x
=
Expressing the formulas in terms of the random vectors themselves gives this result: Theorem 20.3.
If X and Y are independent random vectors with distribu and then
J.L and v in Rj R\ B E�j + k ( 20 .29 ) P [( X , Y ) EB ] = f P [(x , Y ) EB ] J.L(dx ) ,
tions
Rl
'
RANDOM VARIABLES AND EXPECTED VALUES
264
and
P [ XEA,( X,Y ) E B] f P [(x ,Y ) E B] J.L( dx ) , A E �j , B E �j + k . Suppose that X and Y are independent exponentially distributed random variables. By (20.29), P[Y/X > z] f�P[Yjx x zae-ax dx = (l +z)- 1 • Thus YjX has density x (l +z)-2 = J�e-a z]ae-a dx for z > O. Since P[X> z1, Y/X>z2] = fz� P[Yjx > z2]ae-a x dx by (20.30), • the joint distribution of X and YIX n be calculated as well. (20 .30)
=
A
Example 20.3.
>
=
ca
The formulas (20.29) and (20.30) are constantly applied as in thi<> example. There is no virtue in making an issue of each case, however, and the appeal to Theorem 20.3 is usua lly silent. Here is a more complicated argument of the same sort. Let , Xn be independent random variables, earh unifo rmly distributed over [0, t ]. X < Yn < t. The X; Let Yk be the k th smallest among the X;, so that 0 < Y < Yn - Yn t - Yn; let M divide [0, t ] into + 1 subintervals of lengths Y , Y2 P[ M < ] . The problem is to show be the maximum of these lengths. Define 1/Jn( t, that
1
Example
, • • •
20.4.
1 1a)= - Y1, •••a ,
n
·
·
·
_
1,
( 20 .31 )
=
where x + ( x + I xi)/2 denotes positive part. Separate consideration of the possibilities 0 < a < t /2, t /2 < a < t , and t 1 . Suppose it is shown that the probability rfii t, satisf.es disposes of the case the recursion ( 20 .32)
n=
r/J, (t , a ) = n � rfi a
n
_
1( a)( t
- x,
)
t - x ·' - 1 dx t t
·
Now (as follows by a n integration together with Pascal's identity for binomial coefficients) the right side of (20.3 1) satisfie>s this same recursion, and so it will follow by induction that (20.31) holds for all In intuitive form, the argument for (20.32) is this: If [ M < a] is to hold, the smallest If X1 is the smallest of the X1, then of the X; mus t have some value in [0, , Xn must all lie in [ t ] and divide it into subintervals of length at most the X n probability this is (1 - xjt) - J r/J, _ /t because X , , Xn have probability n of (1 - xjt ) - I of all lying in [ t ], and if they do, they are independe nt and uniformly distributed there. Now (20 .32) results from integrating with respect to the density for xl and multiplying by to allow for the fact that any of XI , . . . , xn may be the smallest. 1. Let be To make this argument rigorous, apply (20.30) for j 1 and k = the interval [0, a], and let consist of the points (x n) for which 0 < X; < t, is the minimum of xn divide [ x 1 , t] into subintervals of length , xn, and at most a. Then P[ X1 = min X;, M < a] = P[ X1 (X1 , , Xn ) 8 ]. Take X1 for
2
,
• . .
n
x, x,
B x1, • • •
x
n.
a]. - x, a),
2
1, x
a;
• • •
n-
=
x2,
• • • ,
• • • ,
EA,
• • •
E
A x1
S ECTION 20.
RANDOM VARIABLES AND DISTR IBUTIONS
X and ( X2, • • • , X,) for
265
Y in (20.30). Since X1 has density ljt,
(20.33) If C is the event that x < X; < t for 2 < i < n, then P(C) = 0 - xjt)" - 1 • A simple calculation shows that P[ X; - X < s" 2 < i < n!C] = n;'= z(sJ(t - X )); in other words, given C, the random variables X2 - x, . . . , X, - x are conditionally independent and uniformly distributed over [0, t - x ]. Now X2 , • • • , X, are random variables on some 5', P); replacing P by P( ·!C) shows that the integrand in probability space (20.33) is the same as that in (20.32). The same argument holds wi th the index 1 replaced by any k (1 < k < n), which gives (20.32). (The events [ Xk = mi n X;, < ] a re not disjoi nt, but any two intersect in a set of probability 11
(D.,
O.)
Y a
Sequences of Random Variables
Theorem 5.3 extends to general distributions
11- n -
If {JL,} is a finite or infinite sequence of probability mea exists on some probability space (f!, 5', P) an independent of random variable� such that X, has distribution JL, .
Theorem 20.4. sures on � 1 , there
sequence {X,}
PROOF. By Theorem 5.3 there exists on some probability space an independent sequence Z 1 , 22, . . . of random variables assuming the values 0 and 1 with probabilities ?[2, = 0] = ?[ 2, = 1] = i · As a matter of fact, Theorem 5.3 is not needed: take the space to be the unit interval and the Z,{w) to be the digits of the dyadic expansion of w -the functions d ,(w) of Sections and 1 and 4. Relabel the countably many random variables 2" so that they form a double array, .
.
.
All the 2n k are independent. Put U, = L.� = 1 2, k 2 - k The series certainly converges, and U, is a random variable by Theorem 13.4. Further, U1 , U2 , is, by Theorem 20.2, an independent sequence. Now P[Z, ; = z ; , 1 < i < k ] = 2 - k for each sequence z 1 , • • • , z k of O's and 1 ' s; hence the 2 k possible values j2 - k , 0 <j < 2 \ of S, k = L.7= 1 2, ; 2 ; all have probability 2 - k . If 0 < x < 1, the number of the j2 - k that lie in [0, x] is l2 k xj + 1 , and therefore P[ S, k < x ] = (l2 k xj + l)j2 k . Since S, k (w) j U,(w) as k j oo, it follows that [S, k < x] H U, < x] as k j oo, and so P[U, < x] = Iim k P[S, k < x ] lim k (l2 k xj + 1)j2 k = x for 0 < x < 1. Thus U, is uniformly distributed over the unit interval. • • •
-
=
RANDOM VARIABLES AND EXPECTED VALUES
266
The construction thus far establishes the existence of an independent sequence of random variables each uniformly distributed over [0, 1]. Let and put in : be the distribution function corresponding to for 0 < u < 1. This is the inverse used in Section 14-see (14.5). u Set Since 0, say, for u outside (0, 1), and put )-see the argument following (14.5)-P[ if and only if u And by Thus has distribution function Theorem 20.2, are independent. •
Un
Fn < Fn(x)] cpn(u) 'Pn(u) < X < Fn(x <x] = P[Un < Fn(x)] = Fn(x). Xn X" X2 , =
•
.
J.L n , cpn(u) = f[x Xn(w) = cpn(Un(w)). xn Fn.
.
This theorem of course includes Theorem 5.3 as a special case, and its proof does not depend on the earlier result. Theorem 20.4 is a special case of Kolmogorov' s existence theorem in Section 36. Convolution
X
J.L and v. = [(x, y ): x + y E H] with
Y
Let be independent random variables with distributions and Apply (20.27) and (20.29) to the planar set B
H E .9i' 1 :
"' ( 20 .34) P[ X+ YE H] = j-oo v( H -x ) J.L ( dx ) = j- "' P[YEH -x]J.L ( dx ) . The convolution of J.L and v is the measure J.L * v defined by "' ( 20 . 35 ) ( J.L * V}(H) = j- "'v ( H - x ) J.L ( dx ) , If X and Y are independent and have distributions J.L and (20.34) shows that X+ Y has distribution J.L * v. Since addition of random variables is commutative and associative, the same is true of convolution: J.L * v v * J.L and J.L *(v * TJ) = (J.L * v) * TJ. If F and G are the distribution functions corresponding to J.L and v, the distribution function corresponding to J.L * v is denoted F * G. Taking H = y] in (20.35) shows that ( "' ( 20 .36 ) ( F * G ) ( y ) = j G ( y - x ) dF ( x ). (See 0 7.22) for the notation dF(x).) If G has density g, then G{y -x) jy:,xg(s) ds = f�"'g(t - x)dt, and so the right side of (20.36) is f � "' [f':"'g(t - x) dF(x)] dt by Fubini's theorem. Thus F * G has density F * g, where "' ( 20.37) ( F * g )( y ) = j g(y -x ) dF ( x ) ; "'
v,
=
-
oo,
- 00
=
- 00
SECTION 20.
267
RANDOM VARIABLES AND DISTRIBUTIONS
G
g.
this holds if has density If, in addition, denoted and reduces by (16. 1 2) to
f*g
F
has density
f, (20.37) is
"'
( f * g )( y) = J g(y -x ) f( x ) dx. This defines convolution for densities, and * has density f * g if and have densities f and g. The formula (20.38) can be used for many explicit ( 20.38)
- oo
v
J.L
v
J.L
calculations.
Let X 1 , , Xk be independent random variables, each with the exponential densit� (20. 10). Define by Example 20.5.
•
•
•
gk
x > O,
( 20.39) put
g k(x)
=
0 for
k = 1 , 2, . . . ;
x < 0. Now g k(y).
g 1 coincides with gk =g k - l * g 1 , and . since XI + . . + xk has density gk •
which reduces to Thus (20. 10), it follows by induction that the sum The corresponding distribution function is
X > 0, as follows by differentiation.
•
Example 20.6. Suppose that X has the normal density (20.12) with and and that Y has the same density with T in place of u. If
independent, then
X+ Y
A change of variable u
=
X
has density
x(u 2 + T 2 ) 1 12juT
reduces this to
m -=
Y
0
arc
RANDOM VARIABLES AND EXPECTED VALUES
268
Thus X + Y has the normal density with
(F 2.
m = 0 and with u 2 + T 2 in place of •
If and v are arbitrary finite measures on the line, theii: convolution is defined by (20.35) even if they are not probabil ity measures.
J.L
Convergence in Probability
Random variables X, converge in probability to X, written X,
-)
lim P [ I X, - X I > E ] = 0
(20.4 1 )
P
X, if
n
for each positive E.t This is the same as (5.7), and the proof of Theorem 5.2 carries over without change (see also Example 5.4) I,
X.
If X, � X with probability then X, � (ii) A necessary and sufficient condition for X11 � X is that each subsequence {X11 } contain a further subsequence {X, } such that X � X with probability 1 as i � PRooF. Only part (ii) needs proof. If X, � X, then given {n k }, choose a subsequence {n k(i l } so that k > k ( i) implies that P[I X,k - X ! > i - 1 ] < 2 1 . By the first Borel-Cantelli lemma there is probability 1 that i X, . - X I < i - I for all but finitely many i. Therefore, lim , X, w, = X with probability 1 . If X11 does not converge to X in probability, there is some positive E for which P[ I X11� - XI > E] > E holds along some sequence {n k}. No subsequence Theorem 20.5. �
{i)
kUl
oo
P
P
11kUl
P
-
� (<)
of {X, � } can converge to X in probability, and hence none can converge to X with probability 1 . •
It follows from {ii) that if X" � P X and X, � P Y, then X = Y with probability 1 . It follows further that if is continuous and X, � P X, then
f
f(X" ) � f(X). In nonprobabilistic contexts, convergence in probability becomes conver gence in measure: If f, and f are real measurable functions on a measure space (f!, :F, J.L), and if J.L[w: lf(w) - f,(w)l E] � 0 for each E > 0, then f, converges in measure to f. P
>
The Glivenko-Cantelli Theorem ·
empirical distribution function for random- 1variables Xp . . . , X, F,,(x, w) with a jump of n at each Xk(w )
The distribution function (20.42)
tThis is ofte n expressed p lim, X,, = X •This topic may be omitted
:
is the
SECTION 20.
269
RANDOM VARIABLES AND DISTRIBUTION S
F w)
If the Xk have a common un known distribution function F(x), then n(x, is its natural estimate. The estimate has the right limiting behavior, according to the
Glivenko-Cantelli theorem: Theorem 20.6. Suppose that XI > X . . are independent and have a com mon distribution function F; put Dn(w) = supx IF/x, w) - F(x)l. Then Dn � 0 with probability 1. For each Fn(x, w) as a function of w is a random variable. By right 2, .
x,
contir!Uity, the supremum above is unchanged if x is restricted to the rationals, and therefore Dn is a random variable. The summ ands in (20.42) are indepen dent, identically distributed simple random variables, and so by the strong law of large numbers (Theorem 6.1), for each x there is a set A x of probability 0 such that limFn( x ,
( 20.43)
n
w) = F( x )
w
for $ A x · But Theorem 20.6 says more, namely that (20.43) holds for w outside some set A of probability 0, where A does not depend on x-as there are uncountably many of the sets A x • it is conceivable a priori that their union might necessarily have positive measure. Further, the conver gence in (20.43) is un iform in x. Of course, the theorem implies that with probability 1 there is weak convergence = (x) in the sense of Section 14.
Fn(x, w) F
PROOF OF THE THEOREM. As already observed, the set Ax where (20.43) fails has probability 0. Another application of the strong law of large numbers, with 1( - oo, x > in place of 1( - oo, x J i n (20.42), shows that (see (20.5)) = lim n , = ) except on a set Bx of probability 0. Let inf[ x: < (x) for 0 < < 1 (see (14.5)), and put xm k = cp( j ), > 1, , 1 < < It is not hard to see that F(cp( u) - ) < < F( cp(u)); hence 1 F( xm ' 1 - ) < m - 1 , and 1 - m - • U:: t and n( w ) be the maximum ot the quantities for k = l, ' + < < then If x < X < X� > <. (x) + n( w) and > + ' -I > Together with simi lar arguments fo r the this shows that cases x < xm, l and x >
cp(u) F/x - w) F(x u u F ] km m u F(xm k m. k 1 - ) - F(xm k - l ) < m - , F(xm m) > IF/xm k: w) - F( tm k )l Dm . . . , m. IFn(xm k - , w) - F(xm k - )I Fn(x, w) Fn(xm k - , w) F(xm k - ) m k- l - 1 k• Fn(x, w) Fn(xm k - l' w) F(xm k - l ) Dm n(w) F m Dm - Dm n(w) F(x) - m - Dm n(w). xm,m' •
•
•
•
•
( 20 .44)
w
If lies outside the union A of all the Axmk and Bxmk. , then limn = 0 and hence limn = 0 by (20.44). But A has probability 0.
Dn(w)
Dm n(w) •
•
270
RAN DOM VARIABLES AND EXPECTED VALUES
PROBLEMS 20. 1.
20.2.
2.11 i
A necessary and sufficient condition for a a--field .:# to be countably generated is that .:#= a-( X) for some random variable X. Hint: If .:#= k a-( A 1 , A 2 , • • . ), consider X = Lk=dUA )/ 10 , where f(x) is 4 for x = 0 and 5 for x -=1= 0. If X is a POSitive random variable with dens ity f, then x - l has density f( ljx)jx 2 • Prove this by (20.16) and by a direct argument.
20.3.
Suppose that a two-dimensional distribution ft.nction F has a continuous density f. Show that f( x, y) = a 2F( x, y)jax ay.
20.4.
The construction in Theorem 20.4 requires only Lebesgue measure on the unit interval. Use the theorem to prove the existence of Lebesgue measure on R k . First construct Ak re�t::icted to ( -n, n] x X ( -n, n], and then pass to the limit (n --+ oo). The idea is to argue from first principles, and not to use previous constructions, such as tho�e in Theorems 12.5 and 18.2. ·
·
B,
20.5.
Suppose that A , and C are positive, independent random variables with distribution function F. Show that the quadratic Az 2 + + C has real zeros with probability f'0f0F(x 214 y) dF(x) dF(y ).
20.6.
Show that XI , Xz , . . . are independent if a- ( X l > independent for each n.
20.7.
Le t X0, X1 , • • • be a persistent, irreducible Markov chain, and fo r a fixed state j let T1 , T2 , • • • be the times of the successive passages through j. Let Z 1 T1 and Zn = Tn -k Tn _ 1 , n > 2. Show that Z 1 , Z2 , • • • are independent and t hat Z = k ] = fjj l fo r n > 2.
P[
20.8.
Bz
" ' '
xn - l ) and a-(X) are
=
n
Ranks and records. Let XI > X2 , . . . be independent random variables with a
B
be the w-set where common continuous distribution function. Let Xm(w) Xn( w ) for some pair m , n of distinct integers, and show that 0. Remove from the space !l on which the X, are defined. This leaves the joiJit distributions of the xn unchanged and makes ties impossible. n Let r< l(w) = (Tf"l(w), . . . T�"l (w )) be that permutation (t 1 , . . . , tn) of O, . . . , n) for which X, ( w ) < X, (w) < < X, (w). Let Yn be the rank of Xn ' � among X1 , • • • , Xn: Yn r if an only if X; < j(n for exactly r - 1 values of i preceding 11. n (a) Show that T< l is uniformly distributed over the n! permutations. (b) Show that Yn r] 1jn, 1 < r < n. (c) Show that Yk is measurable a-(T<"l) for k < n. (d) Show that Y1 , Y2 , • • • are independent.
=B
P( B) =
,
·
P[ =
20.9. i
·
·
=
Record values. Let An be the event that a record occurs at time n:
max k < n Xk < Xn . (a) Show that A 1 , A 2 , • • • are independent and P(A,) = ljn. (b) Show that no record stands forever. (c) Let Nn be the time of the first record after time n . Show that k 1 = n(n + k - o - 1 (n + k )- 1 •
P[ Nn = n +
SECTION 20.
271
RANDOM VARIABLES AND DISTRIBUTIONS
20.10.
Use Fubini's theorem to prove that convolution of finite measures is commuta tive and associative.
20. 11.
Suppose that X and Y are independent and have densities. Use (20.20) to find the joint density fo r ( X + Y, X) and then use (20. 19) to find the density for X + Y. Check with (20.38).
20.12.
If f( x - €) < F( x + €) for all positive €, then x is a point of increase of F (see Problem 12.9). If f( x - ) < F( x ), then x is an atom of F. (a) Show that, if x and y are points of increase of F and G, then x + y is a pain! of increase of F * G. (b) Show that, if x and y are atoms of F and G, then x + y is an atom of
F * G. 20.13.
20.14.
Suppose that JL and v con3ist of masses a n and 13n at n, n = 0, 1, 2, . . . . Show that JL * v consists of a mass of E'k = oakf3 n - k at n, n = 0, 1, 2, . . . . Show that two Poisson distributions (the parameters may differ) convolve to a Poisson distribution. The Cauchy distribution has densitv I
c"( x ) = -
(20.45)
7T'
u
2
u
+x
2 '
- oo
< x < oo ,
for u > 0. (By (17.9), the density integrates to 1.) (a) Show that c u * c, = c U +, . Hint: Expand the convolution integrand in partial fractions. (b) Show that, if XI ' . . . ' xn are independent and have density c,, then ( X1 + + Xn )jn has density c u as well. · ·
·
Y
Show that, if X ar:d are independent and have the standard normal density, then Xj Y has the Cauchy density with u = 1. (b) Show that, if X has the uniform distribution over ( 7rj2, 7T'/2), then tan X has the Cauchy distribution with u = 1.
20.15. i
(a)
-
20. 16.
1 8.1 8 i Let XI , . . . ' xn be independent, each having the standard normal distribution Show that Xn2 = XIz + . . . +X2 n
has density
(20.46) over (0, oo). This is called the chi-squared distribution with n degrees offreedom.
272
RANDOM VA RIABLES AND EXPECTED VALUE�
20.17. i
The gamma distribution has density
(20.47) over (0, oo) for positive parameters Show that
au
f( , , ) * f(
(20.48)
a and u. Check that (20.47) integrates to 1 ; a, c ) = f( ' ; a ,u + u).
Note that (20.46) is f( x; t, nj2), and from (20.48) deduce again t hat (20.46) is the density of x;. Note that the exponential density (20 10) is f( x ; 1), and from (20 48) deduce (20 39) cnce again.
a,
n
Let N, X 1 , X2 , . . . be independent, where P[ N = n] = q - lp, n > l, and each xk has the exponential density f(x; 1). Show that XI + . . +X,., has density f(x; 1).
20.18. i
ap,
20.19.
a,
Let Anm(t:) = [IZk - Zl < t:, n < k < m]. Show that Zn --+ Z with probability 1 if and only if limn lim m P( An m(t:)) = l for all positive £, whereas Zn -"p Z if and only if lim n P( An,(t: )) = 1 for all positive £. Suppose that f: R 2 --+ R 1 is continuous. Show that Xn --+ P X and Yn --+ P Y imply f( Xn, Y) --+ P f( X, Y ). (b) Show that addition and multiplication preserve convergence in probability.
20.20. (a)
20.21.
Suppose that the sequence {Xn} is fundamental in probability in the sense that for t: positive t here exists an NE such that P[IXm - Xnl > t:] < t: for m, n > NE . (a) Prove there is a subsequence {Xn ) and a random variable X such that that limk Xn k = X with probability 1. Hint: Choose increasing nk such k k k P[ I Xm - Xnl > 2 - ] < 2 - for m, n � n�. Analyze P[ IXn k+, - Xnk l > 2 - ]. (b) Show that Xn --+ P X. Suppose that XI < Xz < . . . and that xn --+ p X. Show that xn --+ X with probability 1. (b) Show by example that in an infinite measure space functions can converge almost everywhere without converging in measure.
20.22. (a)
20.23.
20.24.
If Xn --+ 0 with probability 1, t hen n - 1 r,�= 1 Xk --+ 0 with probability 1 by the standard theorem on Cesaro means [A30]. Show by example that this is not so if convergence with probability 1 is replaced by convergence in probability. S how that in a discrete probability space convergence in probabil ity is equivalent to convergence with probability 1. (b) Show that discrete spaces are essentially the only ones where this equiva lence holds: Suppose that P has a nonatomic part in the sense that there is a set A such that P(A) > 0 and P( lA) is nonatomic. Construct random variables Xn such that X, --+ P 0 but X, does not converge to 0 with probabil ity 1.
2. 1 9 i
(a)
SECTION 21. (
'0.25.
EXPECTED VALUES
273
20.21 20.24 i Let d( X, Y) be the infimum of those positive *' for which P[l X - Y I > t: ] < t:. (a) Show that d( X, Y) = 0 if and only if X = Y with probability 1. Identify random variables t hat are equal with probability 1, and show that d is a metric
on the resulting space. (b) Show that Xn -+ P X if and only if d( X X) -+ 0. (c) Show that t he space is complete. (d) Show that in general t here is no metric d0 on this space such t hat Xn -+ X with probability 1 if and only if d0( Xn, X) -+ 0.
n
20.26.
,
Construct in R k a random variable X that is uniformly distributed over the surface of the unit sphere i:1 the sense t hat I XI = 1 and UX has the same distribution as X for orthogonal transformations U. Hint Let Z be uniformly = (1, 0, , 0), say), distributed in the unit ball in R k , define 1/J( x) = x !I xI and take X = ljJ( Z ).
( ljJ(O)
20.27.
e
and be the longitude and latitude of a random point on th� surface of the u nit sphere in R 3• Show that E> and are independent, 0 is uniformly distributed over [0, 2'7T), and is distributed over [ -'7T/2, +'7T/2] with density -} cos >
i Let
SECTION 21. EXPECTED VALVES Expected Value as Integral
The expected value of a random variable with respect to the me asure
P:
E[
X on (f!, .9', P) is the integral of X
•
X ] = jXdP = j X(w)P(dw) . n
All the definitions, conventions, and theorems of Chapter 3 apply. For nonnegative is always defin ed (it may be infinite); for the ge neral is defined, or has an expected value, if at least one of and is finite, in which case and i s integrable if = and only if E I Il < oo. The integral over a set A is defined, as before, as In the case of simple random variables, the definition reduces to that used in Sections 5 through 9.
X, E[X1 E[X- 1
X, E[X1
X
[X E[IA X].
E[X1 E[X + 1 - E[X- 1; fA XdP
X
E[ X + j
Expected Values and Limits
The theorems on integration to the limit in Section 16 apply. A useful fact: If random vari ables are dominated by an integrable random variable, or if they are uniformly integrable, then � follows if converges to in probability -convergence with probability 1 is not necessary. This follows easily from Theorem 20.5.
Xn
X
E[Xn1 E[X1
Xn
274
RAN DOM VARIABLES AND EXPECTED VALUES
Expected Values and Distributions
J.L. g
X
Suppose that has distribution If is a real function of a real vari able, then by the change-of-variable formula (16.17),
E[g( X)) f-"' g(x)J.L(dx). "' (In applying (16.1 7), replace T: f! � D.' by X: f! � R 1 , J.L by P, 11-T - 1 by J.L, and f by g.) This formula holds in the sense explained in Theorem 16. 1 3: It ( 21 . 1 )
=
holds in the non negative case, so that
( 21 2 )
E (l g(X) I] j-"' l g(x) IJ.L (dx) ; =
"'
if one side is infinite, then so is the other. And if the two sides of (21 .2) are finite, then (21 . 1 ) holds. If J.L is disci ete and J.L{X I ' x 2 , = 1, then (21. 1) becomes (use Theorem •
•
•
16.9)
( 21 .3 )
}
r
X has density f, then (21. 1) becomes (use Theorem 16. 1 1 ) E [ g(X) ) J g(x)f(x) dx. ( 21 .4 ) "' If F is the distribution function of X and J.L, ( 2 1 . 1 ) can be written E[g(X)] If
00
=
=
J':"'g(x) dF( x) in
-
the notation ( 17.22).
Moments
J.L and F determine all the absolute moments of X: k 1 , 2, . . . . ( 21 . 5 ) E[IXi k J j "' lxl kJ.L (dx) Jco lxl k dF ( x ) , k Since j < k implies that lxlj < 1 + lxl , if X has a finite absolute moment of order k, then it has fin ite absolute moments of ord ers 1, 2, . . . , k - 1 as welL For each k for which (2.15) is finite, X has k th moment
By (21.2),
=
- 00
=
- 00
=
( 21 .6 )
J.L
These quantities are also referred to as the moments of and of F. Th ey can be computed by (21 .3) and (21 .4) in the appropri ate circumstances.
SECTION 21.
EXPECTED VALUES
275
m
0 and u = 1 . Consider the normal density (20.12) with goes to 0 exponentially as � + oo, and so finite For each k, mome nts of all orders exist. Integration by parts shows that Example 21.1.
x k e - x 2 12
x
=
oo "' 1 ! x k e - x2 1 2 dx = k - f x k - 2 e -x2 ;2 dx , k 2 ,3 , . . . . J2 - "' (Apply ( 18. 16) to g(x) = x k - 2 and f(x) =xe-x 212 , and let Of course, E[ X] 0 by symmetry and E[X 0 ] 1 . It follows by induction J2
17r
7r
=
- "'
=
th at
a � - oo, b � oo. )
=
( 2 1 .7 ) E[X 2k ] = 1 x 3 x 5 x
· · ·
x ( 2k 1 ) ,
k
-
=
1 , 2, . . . , •
and that the od d moments all vanish. If the first two moments of Section the variance is
5,
X are finite and E[X] = m, then just as
in
Vac[X] E [ ( X - m ) ] = r ( x - m) 2 JJ- (dx)
( 21 .8)
=
2
" - ou
From Example 21.1 and a change of variable, it follows that a random variable with the normal den sity (20.12) has mean and variance u 2 • Consider for nonnegative the relation
m
X "' "' f f E[X] = 0 P[ X > t] dt = 0 P[ X> t] dt.
( 21 .9)
P[ X= t]
t,
Since can be positive for at most countably many values of th e two integrands differ only on a set of Lebesgue measure 0 and hence the integrals are the same. For simple and nonnegative, (21.9) was proved in Section see (5 .29). For the general nonnegative let be simple random variables for which 0 < i X (see (20.1)). By the monotone conver i gence theorem moreover, and therefore i again by the monotone convergence theorem. i Since (21.9) holds for each a passage to the limit establishes (21. 9 ) for X itself. Note that both sides of (21.9) may be infinite. If the integral on the right is finite, then is integrable. Replacing by > aJ leads from (21.9) to
X
5;
X, Xn P[ Xn > t] P[ X> t ],
Xn
£[ Xn] £[X]; f<�P[ Xn > t] dt J0P[ X> t] dt, Xn , X X XI1x ( 2 1 . 10 ) j X dP = a P[ X> a] + j "'P[ X > t] dt, As long as a > 0, this holds even if X is not nonnegative. [ X > a]
a
276
RANDOM VARIABLES AND EXPECTED VALUES
Inequalities
Since the final term in (21 . 10) is non negative, Thus
E[X].
aP[ X> a] < frx a. J XdP < �
1 P[ X> a] < -a1 1[ X <:a.lXdP < -E[X], a
( 21 . 1 1 )
for nonnegative
a > 0,
X. As in Section 5, there follow the inequalities
( 21 . 12) It is the inequality between the two extreme terms here that usually goes under the name of M; nkov; but the left-hand i nequality is often useful, too. As a special case there is Chebyshev's in equ ality,
1 P[IX - ml > a) < 2 Var [ X ] a
(21 . 13) (m
=
E[X] ).
J ensen's inequality
cp(E[ X]) < E[cp( X)]
( 2 1 . 14)
cp
X X = ax + b cp aE[X] + b < E[cp(X)].
and if and holds if i s convex on a n interval containing the range of be a support both hdve expected values. To prove it, let L(x) ing line through )- a line lying entirely under the graph of [A33]. Then so that But the left side of this inequality is cp( Holder' s inequ ality is
cp( X)
( El X], cp( E[ X]) aX(w) + b < cp(X(w)), E[ X]).
1 1 - + - = 1.
( 21 . 15)
p
q
For discrete random variables, this was proved in Section 5; see (5.35). For the general case, choose simple random variables and Y, satisfying 0 and 0 see (20.2). Then (5.35) and the monotone convergence theorem give (21 .15). Notice that (21 .15) implies that if and are integrable, then so is XY. Schwarz's inequality is the case
< IXnli lXI IYI"
p =q
=
< IYnli IYI;
Xn
2:
(2 1 . 16) If
X and Y have second moments, then XY must have a first moment.
IXIP
S ECTION 21.
EXPECTED VALUES
277
The same reasoning shows that Lyapounov 's inequality (5.37) carries over from the simple to the general case. Joint Integrals
(X1 , , Xk ) has
The relation (21 . 1 ) extends to I andom vectors. Suppose that distribution in k-space and g : � By Theorem 16.13,
Rk R1
J.L
• • •
•
(21 . 17 ) with the usual provisos about infin ite values. For example, = m;, the of X1 and If is
fRkx1x1J.L(dx). £[X1]
Random variables are
X1
covariance
£[ X,X1] =
uncorrelated if they have covariance 0.
Independence and Expected Value
X and Y are independent. If they are also simple, then Ef XY] E[ X)E[ Y), proved+ in Section 5-see (5.25). Define Xn by (20.2) and similarly define Y,. 1/Jn ( y ) - 1/1/ y- ). Then Xn and Yn are indepen dent and simple, so that E[IXnY,.ll E[IXni] E[IY,.Il, and 0 < IXnli lXI, 0 < !Y,.Ii IYI. If X and Y are integrable, then £[1XnY,I] = £[1Xni ] E[IY,I l i E[IXI] · E[I YI], and it follows by the monotone convergence theorem th::\t E[IXYI J oo; since and IXnY,I < IXYI, it follows further by the dominated conver XnY, gence theorem that E[XY] = lim n E[XnY,l = limn E[Xn]E[Yn] E[X]E[Y]. Therefore, is integrable if X and Y are (which is by no me ans true for depe ndent random variables) and E[XY] E[X]E[Y]. This argument obviously extends inductively: If X1 , , Xk are indepen dent a n d integrable, then the product XI . . . xk is also integrable and Suppose that
as
=
=
=
<
� XY
=
XY
=
•
•
•
( 2 1 . 18)
.#1
.#2 X2
A
.#1 , X1
Suppose that are independen t a--fields, lies in and is and are indepen measurable and is measurable Then dent, so that (21 . 1 8) gives
.#1 ,
.#2 •
IAX1
(21 . 19) if the random variables are integrable. In particular, ( 21 .20)
f X2 dP = P( A) E[XJ. A
X2
278
RAN DOM VARIABLES AND EXPECfED VALUES
From (21 . 18) it follows just as for simple random variables (see (5.28)) that variances add for sums of indepen dent random variables. I t is even enough th at the random variables be independent in pairs. Moment Generating Functions
The
moment generating function is defined as M (s) = E[ e 5X ] = j "' e xJL(dx) j "' e 5x dF( x )
( 2 1 .21)
- 00
=
5
- 00
s 9
for all for which this is finite (note that the integrand is nonn egative). Section shows in the cl:lse of simple random variables th e power of moment generating function methods. This function is also called the of JL, especially in nonprobabilistic contexts. Now 'xJL(dx ) is finite for < 0, a n d if it is finite for a positive then it is finite for all smaller Together with the corre spon ding result for th e left half-line, this shows that M(s) is d efined on some interval containing 0. If is nonn egative, this interval contains ( - :xJ, 0 and perhaps part of (0, if and perhaps part of ( 0). It is possible is non positive, it co mains 0, that th e interval consists of 0 alon e ; this happens, for example, if 11- is = for = 1, 2, . . . . concentrated on the integers and = where Suppose that M(s) is defined throughout an interval ( sx and the l atter function is integrable 11- for 0. Si nce l sx l < sx + so is = l sx l . By the corollary to Theorem 16.7, 11- has finite moments of all orders and
form
f�e
s
s.
[ oo)
So > lsi < s0 ,
Laplace trans s, X oo); X
-
e ek e L:'k=olsxl jk! e
]
-oo, JL{n} JL{ -n} Cjn 2 n -s0 , s0 ), lsl <s0•
Thus M( s) has a Taylor expansion about 0 with positive radius of conver gence if it is defined in some ( s0), > 0. If M(s) can somehow be can be calculated and expanded in a series and if the coefficients identified, then, since ] ! , the moments of must coincide with can be computed: = ! It also follows from the theory of Taylor expan sions that is the k th derivative M< k >(s) evaluated at s = 0:
-s<> • ks0 L:ka k s ,
( 2 1 .23 )
ak
E[Xk jk
a kk E[X ] a k k [A29] a k k! M< k > ( O) = E[X k ] = j x kp_(dx ) .
X
"'
- oo
Ms jM s
This holds if ( ) exists in some neighborhood of 0. If v has Suppose now that is defined in some neighborhood of x den sity es ( ) with respect to 11- (see (16. 1 1 )), then v has moment for u in some neighborhood of 0. generating function N(u) = M(s + u) 1
s.
M
M(s)
SECTION 21.
EXPECTED VALUES
279
N < k >(O) = f': oox k v( dx) = [:"'x ke sxJl.(dx)jM(s), and since N (o) = M < k >(s)/M(s), But then by (21.23),
00
M< k > (s) = J x k e sxJJ. ( dx ) .
( 21 . 24)
- 00
This holds as long as the moment generating fu nction exists in some neigh borhood of s. If s = 0, this gives (2 1 .23) again. Taking k 2 shows that M(s) is convex in its interval of definition. =
Example 21.2.
For the standard normal density,
and a change of variable gives ( 21 .25 ) The moment generating function in this case defined for all has the expansion
( )
k
oo _!_ � ;., 2 2 e5 1 = 2 k ! k'::o k�O
1X3X
s.
Since
e5 2 1 2
· · ·
X ( 2 k - 1 ) 2k s , (2 k ) !
the momen ts can be read off from (21 .22), which proves (21. 7) once more. Example 21.3.
•
In the exponential case (20.10), the moment generating
function (21 .26) is defined for s < a. By (21 .22) the kth moment is k!a - k . The mean and • variance are thus a - I and a - 2 • Example 21 .4.
For the Poisson distribution (20.7),
(21 .27) Since M'(s) = A e5M(s) and M"(s) = (A.2 e 2 5 + Ae5)M(s), the first two mo ments are M'(O) = A and M"(O) = ;>..2 + A.; the mean and variance are both A . •
RANDOM VARIABLES AND EXPECTED VALUES
280
Let X1 1 , Xk be independent random variables, and suppose that each X1 has a moment generating function M1(s) = E[ e s X; ] in ( -s0 , s 0). For lsi < s0, each exp(sX,) is integrable, and, since they are independent, their product exp{s L:7= 1 X) is also integrable (see (21 .18)). The moment generating function of XI + . . . + xk is therefore •
•
•
( 21 28) in ( - s0, s0 ). This relation for simple random variables was essential to the arguments in Section 9. For simple random variables it was shown in Section 9 that the moment generating function determines the distribution . This will later be proved for general random variables; see Theorem 22.2 for the nonnegative case and Section 30 for the general case. PROBLEMS 21.1.
Prove 1
../2rr
/00 e -IX f2 dx = t - lf2 - oo
2
'
differentiate k times with respect to t inside the integral (justify), and derive (21.7) again. 21.2.
Show that, if X has the standard normal distribution, then
2"n!f2j·rr .
£[1 Xl 2" + 1 ] =
21.3.
20.9 i Records. Consider the sequence of records in the sense of Problem 20.9. Show that the expected waiting time to the next record is infinite.
21.4.
20. 14 i Show that the Cauchy distribution has no mean.
21.5.
Prove the first Borel-Cantelli lemma by applying Theorem 16.6 to indicato r random variables. Why is Theorem 16.6 not enough for the second Borel-Cantelli lemma?
2 1.6.
Prove (21 .9) by Fubini's theorem.
21. 7.
Prove for integrable X that E[X] =
0
10 P[ X > t ] dt - f 00
- oo
P [ X < t ] dt .
SECTION 21. 21.8. (a)
EXPECTED VALUES
281
Suppose that X and Y have first moments, and prove
E[ YJ - E[X] = {' ( P [ X < t < Y ] - P[ Y < t < X J) dt . - oo
Let ( X, Y] be a nondegenerate random interval Show that its expected length is the integral with re>spect to t of the probab ility that it cove rs t.
(b)
21.9.
Suppose that X and Y are random variables with distribution functions F and G. (a) Show that if F and G have no common jumps, then E[F( Y)] + E[G( X)] = 1. (b) If F is condnuous, then E[F( X)] = i. (c) Even if F dud G have common jumps, if X and Y are taken to be independent, then E[ F( Y)] + E[ G( X )] 1 , P[ X = Y ]. (d) Even if F has jumps, E[F(X)] i t- � Lx P 2[ X = x ] =
=
Show that uncorrelated variables need not be independent. (b) S how that Var[L:I'= 1 X;] = L:7, 1 = 1 Cov[ X;, X1 ] = I:;'� 1 Var[ X; ] + 2[ 1 5 ; < 1 5 11 Cov[X;, �-1. The cross terms drop out if the X, are uncorrelated, and hence drop out it they are independent.
21.10. (a)
Let X, Y, and Z be independent random variables such that X and Y assume the values 0, 1, 2 with probabil ity * each and Z assumes the values 0 and 1 with probabilities i and �. Let X' = X and Y' = X + Z (mod 3). (a) Show that X', Y', and X' + Y' have the same one-dimensional distribu tions as X, Y, and X + Y, respectively, even though ( X', Y' ) and ( X, Y) have different distributions. (b) Show that X' a11d Y' are dependent but uncorrelated (c) Show that, despite dependence, the moment generating function of X' + Y' is the product of the moment generating functions of X' and Y'.
21.11. i
21.12.
Suppose that X and Y are independent, nonnegative random variables and that E[ X] = oo and E[Y] = 0. What is the value common to E[ XY ] and E[ X]E[ Y ]? Use the conventions ( 15.2) for both the product of the random variables and the product of their expected values. What if £[X] = oo and O < E[ Y ] < oo?
21.13.
Suppose that X and Y are independent and that f( x, y) is nonnegative. Put g(x) = E[ f( x , Y)] and show that E[ g ( X )] = E[ f( X, Y)]. Show Jllore generally that fx E A g( X) dP = fx E A f( X, Y) dP. Extend to f that may be negative. The integrability of X + Y does not imply that of X and Y separately. Show that it does if X and Y are independent.
21.14. i
2 1.15.
21. 16.
20.25 i Write d 1( X, Y) E[ IX Y l ;( l + I X - Yl)]. Show that this is a metric equivalent to the one in Problem 20.25. =
-
For the density C exp( - l x l 1 12 ), oo < x < oo, show that moments of all orders exist but that the moment generating function exists only at s = 0. -
282
RAN DOM VARIABLES AND EXPECTED VALUES
Show that a moment generating function M(s) defined in ( -s0, s0), s0 0, can be extended to a function analytic in the strip [ z : - s0 < Re z < s0]. If M(s) is defined in [0 s0), s0 0, show that it can be extended to a function continuous i n [ z : 0 < Re z < s0] and analytic in [ z · 0 < Re z < s0].
21.17. 16.6 i
>
>
,
21.18. Use (21.28) to find the generating function of (20.39). 21.19. For independent random variables having moment generating functions, show by (21.28) that the variances add.
Show that the gamma density (20.47) has moment generating function k 1) . . (1 - sja ) - " for s < a. Show that the k th moment is l)ja k Show that the chi-squared distribution with n degrees of freedom has mean n and vanance 2n.
21.20. 20. 17 i
u(u +
(u +
21.21. Let X1 , X2 , be identically distributed random variables with finite second moment. Show that nP[I X1 1 > € /n) -> 0 and n - 1 1 2 max k , ,I Xk I -> p 0. • • •
SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES
X1, X2 ,
Let be a sequence of independent random variables on some probability space. It is natural to ask whether the infinite series 1:: 1 con converges with probability 1 , or as in Section 6 whether - • 1 verges to some limit with probability 1 . It is to questions of this sort that the present section is devoted. will denote the partial sum Throughout the section, •
•
•
n L:�= Xk
=
X,
L� = .xk (S0 = 0).
S,
The Strong Law of Large Numbers
The central result is a general version of Theorem 6. 1.
If X 1 , X2 ,
are independent and identically distributed and have finite mean, then S,/n £[ X1] with probability 1. Theorem 22.1.
•
• •
-7
Formerly this theorem stood at the end of a chain of results. The following argument, due to Etemadi, proceeds from first principles. If the theorem holds for nonnegative random variables, then 1 n - L� = 1X:- n - 1 L:� = 1 X; E[XtJ - E[X;l £[X1] with proba bility 1. Assume then that Xk > 0. Consider the truncated random variables Yk Xk l[xk k l and their partial sums S,� L� = 1Yk . For a > 1, temporarily fixed, let u, l a " J. The first step is to prove S, � E [ s: J (22.1 ) u,
PROOF. n - • s, =
=
-7
=
=
11
n
s
=
22. SUMS OF INDEPENDENT RANDOM VARIABLES Since the Xn are independent and identically distributed, n n Var[ s: ] = L Var[ Yk ] < L E [ Yk2 ] k=l k= l = k=Ll E [ XI2/[XJ .s k ]] < nE [ xi2 /[ XI .s n ] ] . SECTION
283
ll
It follows by Chebyshev' s inequality that the sum in
(22.1) is at most
K = 2af(a - 1), and suppose x > 0. If N is the smallest n such that un > x, then a N >x, and since y 2lyJ for y > 1, n 1 1• -N K = Kx 2 a a < u < L L ;; n �N U n �X 1 Therefore, L:� = 1u;; 1I[ x1511" 1 < KX"i for X1 > 0, and the sum in (22.1) is at most KE - 2 £[X1] < oo . From (22.1) it follows by the first Borel-Cantelli lemma (take a un ion over positive, rational E) that (s:n - E[s:n ])fun 0 with probability 1. But by the 1 consistency of Cesaro summation [A30], n - 1 E[s;] = n - L:�= 1 £[ Yk ] has the same limit as £[ Yn ], namely. £[ X1 ]. Therefore s:" fun £[X1 ] with probaLet
<
-7
-7
bility 1 . Since
"' L P[Xn * Y,] = L P[ X1 > n] 1 P[X1 > t ] dt = E[X1 ] < oo , n=! n=l another application of the first Borel-Cantelli lemma shows that (s: - Sn)fn 0 and hence "'
00
-7
( 22.2) 1. un < < un + l> then since x,. > 0,
with probability k If
<
0
RANDOM VARIABLES AND EXPECTED VALUES
284
But u, 1 /u n -7 a, and so it follows by (22.2) that +
sk [ 1 [ . . sk . X1 < aE X1 J E < bm mf < hm sup J T a k T k with probability 1. This is true for each a > 1. Intersecting the corresponding sets over rational a exceeding 1 gives lim k Sdk = £[ X1 ] with probability 1. •
Although the hypothesis that the Xn all have the same distribution is used several times in this proof, independence is used only through the equation Var[s: l = L�= I Var[ Yk], and for this it is enough that the xn be indepen dent m pairs. The proof given for Theorem 6.1 of course extends beyond the case of simple random -variables, but it requires E[ x:1 < oo.
Suppose that XI > X2 , are independent and identically dis tributed and E[ XJ ] < oo, £[ xn = 00 (so that E[ XI ] = ) . Then n - I L�= I xk -7 oo with probability 1. Corollary.
• • •
00
By the theorem, n - I L�= I x;; -7 £[ X� l with probability 1, and so it suffices to prove the corollary for the case X1 = Xi > 0. If PROOF.
{ Xn xn< > - \ 0 u
if O < Xn < u , if xn > u ,
then n - 1 L:;; 1 Xk > n - 1 L:Z= 1 Xk"> -7 £[ X�">] by the theorem. Let u -7 oo, =
•
The Weak Law and Moment Generating Functions
The weak law of large numbers (Section 6) carries over without change to the case of general random variables with second moments-only Chebyshev ' s inequality is required. The idea can be used to prove in a very simple way that a distribution concentrated on [0, oo) is uniquely determined by its moment generating function or Laplace transform. For each A, let Y,� be a random variable (on some probability space) having the Poisson distribution with parameter A. Since � has mean and variance ,.\ (Example 21 .4), Chebyshev's inequality gives
SECTION 22.
Let
GA
SUMS OF INDEPENDENT RANDOM VARIABLES
285
be the distribution function of �/A , so that
The result above can be restated as if t > 1 ' if t < 1 .
( 22.3)
In the notation of Section 1 4, Gix ) = Ll(x - 1) as ,.\ -7 oo. Now consider a probability distribution J.L concentrated on [O, oo). Let F be the corresponding distribution function. Define 00
M (s) = 1 e-s xJ.L(dx ) ,
(22.4)
.
0
s > 0;
here 0 is included in the range of integration. This is the moment generating function (21 .21), but the argument has been reflected through the origin. It is a one-sided Laplace transform, defined for all nonnegative s. For positive s, (21 .24) gives (22.5) Therefore, for positive
x
��
k
(22.6)
lsxl
L k= O
( 1) '
and s,
k s ) s kM ( k l (s) = 1 L e - s y ( J,' J.L(dy) 0 k =O oo
lsxl
Fix X > 0. ut 0 < y < x , then Gs y(xjy ) -7 1 as s -7 00 by (22.3); if y > x, the limit is 0. If J.L { x = 0, the integrand on the right in (22.6) thus converges as s -7 oo to /[O, x J( y ) except on a set of J.L-measure 0. The bounded convergence theorem then gives
}
(22.7) t lf y = 0, the integrand in (22.5) i s I for k = 0 and 0 for k ;;:: I, hence for the middle term of (22 6) is 1 .
y
=
0, the in tegrand in
286
RANDOM VARIABLES AND EXPECTED VALUES
J.L{X}
x x
Thus M(s) determines the value of F at if > 0 and 0, which covers all but cou ntably many values of in [O, oo). Since F is right-continu ous, F itself and hence are determined through (22.7) by M(s). In fact is by (22. 7) determined by the values of M(s) for s beyond an arbitrary s0:
x
J.L
Theorem 22.2.
where s0 > 0, then Corollary.
=
J.L
Let J.L and v be probability measures on [0, oo). If
11 = v.
Let f 1 and f2 be real functions on [0, oo ). If
where s0 > 0, then f1 = f2 outside a set of Lebesgue measure 0. The f,. need not be nonnegative, and they need not be integrable, but must be integrable over [O, oo) for s > s0.
e - sxfi(x)
For the nonnegative case, apply the theorem to the probability i = 1, 2. For the densities g,.( x) = where m = • general case, prove that fi + fi. = f; + fl almost everywhere. PROOF.
e - s"xfi(x)jm,
J0e-snxt,.( x)dx,
=
J.L 1 J.L 2 J.L3, then the corresponding transforms (22.4) satisfy M1 (s)M2(s) Mis) for s > 0. lf J.L i is the Poisson distribution with mean Ai, then (see (21.27)) M,(s) exp[A,. ( e - s - 1)]. It follows by Theorem 22.2 that if two of the J.L, are Poisson, so is the third, and A 1 A 2 A3 • Example 22. 1.
*
If
=
=
+
•
=
Kolmogorov's Zero-One Law
1 L:Z 1 Xk(w) X1(w), . . . , Xm_ 1( w) - L: Xk(w) L:Z X
Consider the set A of w for which n -7 0 as n -7 oo. For each = are irrelevant to the question of m, the values of whether or not w lies in A, and so A ought to lie in the - fi eld xm . . . ). In fact, limn n ;:(: = 0 for fixed m, and hence w 1 lies in A if and only if lim n =m k( w ) = 0. Therefore,
u(Xm, (22.8)
+I'
u
I
11
[
A = n U n w: n-1 E
N "?: m n "?:N
X k( w) t k=m
the first intersection extending over positive rational
E.
]
<E ,
The set on the inside
SECTION 22.
SUMS OF INDEPENDENT RANDOM VARIABLES
287
lies in u( X X + 1 , . . . ), and hence so does A. Similarly, the w-set where the series Ln Xn( w) converges lies in each u( X X + 1 , . . . ). The intersection !T n : = 1 u(Xn , Xn + 1 , ) is the tail u-field associated with the sequence Xp X2 , . . . ; its elements are tail events. In the case xn = /A ' this is the u-field (4.29) studied in Section 4. The following general form of Kolmogorov's zero -one law extends Theorem 4.5. m,
m
m
m, •
.
.
"
Suppose that {Xn} is independent and that A E !T n �= 1 u(Xn , Xn + P · . . )· Then either P(A ) = O or P(A) = 1. Theorem 22.3.
Let .ro = u k = I u( XI ' . . . ' Xk ). The first thing to establish is that .ro is a field generating the u-field u(X 1 , X2 , ). If B and C lie in �, then B E o-(Xp . . . , X) and C E o-(X1, . . . , Xk ) for some 1 and k; if m = max{j, k}, then B and C both lie in u(X; , . . . , Xm), so that B U C E u(Xp . . . , Xm ) c .ro . Thus .ro is closed under the formation of finite unions; since it is similarly closed under complementation, � is a field. For H E .9R 1 ' [ xn E H] E .ro c u(.ro), and hence xn is measurable u(.ra); thus .ro generates u( X1 , X2 , ) (which in general is much larger than .ra). Suppose that A lies in .9'". Then A lies in u( Xk+ P Xk + 2 , . . . ) for each k. Therefore, if B E u(Xp . . . , X1 ), then A and B are independent by Theorem 20.2. Therefore, A is independent of .ro and hence by Theorem 4.2 is also independent of u(Xp X2 , ). But then A is independent of itself: P(A n A) = P(A)P(A) . Therefore, P(A) = P 2 (A), which implies that P(A) is ei • ther 0 or 1 . PROOF.
•
•
•
•
•
•
•
•
•
As noted above, the set where L:n X/w) converges satisfies the hypothesis of Theorem 22.3, and so does the set where n - 1 L:Z= 1 Xk(w) -7 0. In many similar cases it is very easy to prove by this theorem that a set at hand mu st have probability either 0 or 1. But to determine which of 0 and 1 is, in fact, the probability of the set may be extremely difficult. Maximal Inequalities
Essential to the study of random series are maximal inequalities-inequali ties concerning the maxima of partial sums. The best known is that of Kolmogorov.
Suppose that Xp . . . , Xn are independent with mean 0 and finite variances. For a > 0, Theorem 22.4.
(22.9)
288
RAN DOM VARIABLES AND EXPECTED VALUES
A k be the set where I Ski > a but ISj l < a for j < k. Since the Ak n E[ s,;] > r_ f s; dP k n= I A k = kL JA k [ sl + 2 Sd Sn - Sk) + (Sn - Sk) 2] dP n= l > kL fA [ Sl + 2Sk (Sn - Sk) ] dP. =l k Since A k and S k are measurable o-( X1, , Xk ) and Sn - Sk is measurable o-(X/:+ 1 , , Xn ), and since the means are all 0, it follows by (21 . 19) and independence that fA kS1.( S" - Sk ) dP 0. Therefore, n n E [ S,;] > L J S} dP > L a 2 ( A k) k= l k = l Ak = a2P [ max k :s:n !Skl > a ] . By Chebyshev' s inequality, P[ISnl > a] < a - 2 Var[ S�]. That this can be strengthened to (22.9) is an instance of a general phenomenon: For sums of indep endent variables, if maxk "'" I Ski is large, then ISn l is probably large as Let are disjoint,
PROOF.
•
•
•
•
•
•
=
P
•
l:s
well. Theorem 9.6 is an instance of this, and so is the following result, due to Etemadi. Theorem 22.5.
( 22.10) PROOF.
the
Suppose that X1 ,
• • •
P [ 1 �ka:n I Ski > 3a ]
, Xn are independent. For
a > 0,
< 3 1 �ka:n P[ I Ski > a] . Let Bk be the set where ISk l > 3a but IS) < 3a for j < k. Since
Bk are disjoint,
n- 1 P L �ka:n ISk I > 3 a ] < P [ ISn I > a] + k P ( Bk n [ ISn I < a] } �1 n- 1 < P[ISnl > a] + L P(Bk n [ 1 Sn - Skl > 2a] ) k= l n- 1 = P(ISnl > a] + kL P( Bk) P[1Sn - Skl > 2 a] =l < P[IS"I > a] + I:Smax k :Sn P(IS" - Ski > 2a] " I > a] + P(ISk [> a] ) (P(IS � P(IS" I > a] + _ max k :S :s:n < 3 l max :S k :Sn P [ ISk I > a] . 1
•
SECTION 22.
SUMS OF INDEPENDENT RANDOM VARIABLES
289
If the Xk have mean 0 and Chebyshev's inequality is applied to the right side of (22.10), and if a is replaced by a/3, the result is Kolmogorov 's inequality (22.9) with an extra factor of 27 on the right side. For this reason, the two inequalities are equally useful for the applications in this section. Convergence of Random Series
For independent Xn, the probability that L:Xn converges is either 0 or 1. It is natural tO try and characterize the two cases in terms of the distributions of the individual Xn.
Suppose that {Xn } is an independent sequence and E[Xn ] = L Var[ Xn ] < oo, then LXn converges with probability 1.
Theorem 22.6.
0. If
PROOF.
By (22.9),
Since the sets on the left are nondecreasing in
Since L: Var[ Xnl converges,
[ k� l
r,
lim P sup iSn + k - Snl > €
(22.1 1 )
n
]
letting r -7 oo gives
=
0
for each E. Let E(n, E) be the set where supj, k � n 1Sj - Ski > 2E, and put £(€ ) = n n E( n , € ). Then E(n, EH £(€ ), and (22. 11) implies P( £ (€)) = 0. Now U E E(E), where the union extends over positive rational E, contains the set where the sequence {Sn} is not fundamental (does not have the Cauchy property), and this set therefore has probability 0. • Let Xn(w) = rn(w) a n, where the rn are the Rademacher functions on the unit interval-see (1. 1 3). Then Xn has variance a�, and so [a� < oo implies that L:rn(w)a n converges with probability 1. An interesting special case is a n = n - 1 • If the signs in L: + n - 1 are chosen on the toss of a coin, then the series converges with probability 1 . The alternating harmonic series 1 2 - 1 + 3 - 1 + · · · is thus typical in this respect. • Example 22.2.
-
If L:Xn converges with probability 1 , then Sn converges with probability 1 to some finite random variable S. By Theorem 20.5, this implies tha t
RAN DOM VARIABLES AND EXPECTED VALUES
290
Sn � P S. The reverse implication of course does not hold in general, but it does if the summands are independent.
For an independent sequence {Xn }, the Sn converge with probability 1 if and only if they converge in probability. Theorem 22.7.
p
It is enough to show that if sn � S, then {Sn} is fundamental with probability 1. Since PROOF.
sn � p s implies lim sup P ( ISn t-J - Sni > E ] = 0 .
(22 . 1 2)
n
j :?.
I
But by (22. 10),
and therefore
[
]
[
P sup iSn +k - Sni > E < 3 sup P 1Sn+k - Snl > k�l k� l
�].
It now follows by (22. 12) that (22. 1 1 ) holds, and the proof is completed as • before. The final result in this direction, the three-series theorem, provides neces sary and sufficient conditions for the convergence of [ X11 in terms of the individual distributions of the xn. Let x�c) be xn truncated at c: x�c) = Xn [(IX,I :;; ]· c
Theorem 22.8.
senes
Suppose that {Xn}
is
independent, and consider the three
(22 . 13)
In order that EXn converge with probability 1 it is necessary that the three series converge for all positive c and sufficient that they converge for some positive c.
S ECTION 22.
SUMS OF INDEPENDENT RANDOM VARIABLES
291
Suppose that the series (22.13) converge, and put m�c) E[ X�cl]. By Theorem 22.6, L:( X�c) - m�c)) converges with probabil ity 1, and since L:m�c) converges, so does L:x�cl. Since P[ Xn =t= X�c) i.o.] 0 by the first Borel-Cantelli lemma, it follows finally that L:Xn converges with • probability 1. PROOF OF SuFFICIENCY. =
=
Although it is possible to prove necessity in the three-series theorem by the methods of the present section, the simplest and clearest argument uses the central limit theorem as treated in Section 27. This involves no circularity of reasoning, since the three-series theorem is nowhere used in what follows. PROOF OF NECESSITY. Suppose that L:Xn converges with probability 1 , and fix c > 0. Since X11 � 0 with probability 1 , it follows that L:X�c) con verge� with probability 1 and, by the second Borel-Cantelli lemma, that
L:P [I Xn l > c] < oo.
Let MYl and s�c) be the mean and standard deviation of s�c) EZ = 1Xkc)_ If s�c) � oo, then since the x�c) - m�c ) are uniformly bounded, it follows by the central limit theorem (see Example 27.4) that =
(22.14)
lim p n
[
s
l
-=
1 V12; £�
j
·Y
X
e
-
1 /2 dt. 2
And since EX� c) converges with probability 1, s�c) � oo also implies s�c)js�c) � 0 with probability 1, so that (Theorem 20.5) limP [ I S�c)/s�c)l > € ] = 0. n But (22. 1 4) and (22.15) stand in contradiction: Since (22.15)
p
[x <
S �c l - M�cl s�cl < y , ""(c) sn s(c) n
.
<- €
l
is greater than or equal to the probability in (22. 14) minus that in (22.15), it is positive for all sufficiently large n (if x < y ) But then .
X
- € < - M(c);s
and this cannot hold simultaneously for, say, (x - E, y + E) = ( - 1, 0) and (x - E, y + E ) = (0, 1). Thus s�c) cannot go to and the third series in (22.13) converges. And now it follows by Theorem 22.6 that L:( X�c) - m�cl) converges with • probability 1, so that the middle series in (22. 13) converges as well. oo,
If Xn r11 an , where rn are the Rademacher functions, then L:a � < oo implies that L:Xn converges with probability 1. If L:Xn con verges, then a n is bounded, and for large c the convergence of the third Example 22.3.
=
292
RANDOM VARIABLES AND EXPECfED VALUES
series in (22.13) implies [a� < oo: If the signs in L: + a n are chosen on the toss of a coin, then the series converges with probability 1 or 0 according as [a� converges or diverges. If [a� converges but L:lanl diverges, then I: + a n is • with probability 1 conditionally but not absolutely convergent . If a n � 0 but [a� = oo, then L: + an converges if the signs are strictly alternating, but diverges with probability 1 if they are chosen on the toss of a coin. • Example 22.4.
Theorems 22.6, 22.7, and 22.8 concern conditional convergence, and in the most interesting cases, L:Xn converges not because the Xn go to 0 at a high rate but because they tend to cancel each other out. In Example 22.4, the terms cancel well enough for convergence if the signs are strictly alternating, but not if they are chosen on the toss of a coin. Random Taylor Series •
Consider a power series L: + z n , where the signs are chosen on the toss of a coin. The radius of convergence being 1 , the series represents an analytic function in the open unit disk D0 = [z: l z l < 1] in the complex plane. The question arises whether this function can be extended analytically beyond D0. The answer is no: With probability 1 the unit circle is the natural boundary. Theorem 22.9.
Let {Xn} be an independent sequence such that
P[ Xn = l] = P[ X,, = - 1] = } ,
( 22 . 16)
There is probability
0
that "'
(22.17)
n = O, l, . . . .
F( w , z) = L Xn( w ) z n
n
=0
coincides in D0 with a function analytic in an open set properly containing D0• It will be seen in the course of the proof that the w-set in question lies in u(X0, X1 , ) and hence has a probability. It is intuitively clear that if the set is measurable at all, it must depend only on the Xn for large n and hence must have probability either 0 or 1. •
PRooF.
(22 . 18) *
•
•
Since
n = 0, 1 , . . .
This topic, which requires complex variable theory, may be omitted.
SECTION 22.
SUMS OF INDEPENDENT RANDOM VARIABLES
293
with probability 1, the series in (22. 17) has radius of convergence 1 outside a set of measure 0. (I < r ], where ( E D0 and r > 0. Now Consider an open disk D = [ I (22.17) . coincides in D0 with a function analytic in D0 u D if and only if its expanston
z: z -
about ( converges at least for this holds. The coefficient
l z - (I < r. Let A0 be the set of w for which u(Xm , m lam{w)l1 1 ,., am o(w ), am ui )
is a complex-valued random variable measurable X + I> ) . By the root test, E A0 if and onl� if lim supm < r - 1 • For each m0, the condition for E A D can thus be expressed in term'> of I( CL.I , . Thus Ao has a probability, and !n > alone, and so Ao I fact P(A0) is 0 or 1 by the zero·one law. Of course, P( A0) = 1 if D c D0. The central step in the proof is to show that P( A0) = 0 if D contains points not in D0. Assume on the contrary that P( A0) = 1 for such a D. Consider that part of the circumference of the unit circle that lies in D, and let k be an integer large enough that this arc has length exceeding 27T jk. Define
w
•
w
\=.
u(Xmu' xm o +
•
•
•
)
Let B0 be the w-set w�ere the function
•
•
•
•
•
if n ¥: 0 ( mod k ) , if n = 0 ( mod k) .
"'
G ( w, z ) = L Yn(w ) z n
( 22 . 19 )
rl
=0
coincides in D0 with a function analytic in D0 U D. . . } has the same structure as the original sequence: The sequence the are independent and assume the values + 1 with probability i each. Since B0 is defined in terms of the in the same way as A0 is defined in terms of the it is intuitively clear that P(B0) and P( A 0) must be the same. Assume for the moment the truth of this statement, which is somewhat more obvious than its proof. If for a particular each of (22. 17) and (22. 19) coincides in D0 with a function analytic in D0 U D, the same must be true of
{Y0 , Yw
Yn
Yn
Xn,
w
"'
(22.20)
F( w, z) - G( w, z) = 2 L= xm k ( w)z mk. m O
294
RANDOM VARIABLES AND EXPECTED VALUES D1 = [ ze21Tilf k : z E D ].
Since replacing z by ze 21Ti/k leaves the function (22.20) unchanged, it can be extended analytically to each D0 u Dp I = 1, 2, . . . . Because of the choice of k , it can therefore be extended analytically to [ z : l z l < 1 + E] for some positive E; but this is impossible if (22.18) holds, since the radius of convergence must then be 1. Therefore, AD n BD cannot contain a point w satisfying (22.18). Since (22. 18) holds with probability 1, this rules out the possibility P(AD) = P(BD) = 1 and by the zero-one law leaves only the possibility P(A D) = P( BD) = 0. Let A be the w-set where (22.17) extends to a function analytic in some open set larger than D0. Then w E A if and only if (22. 17) extends to D0 U D for some D = [ z : l z - ( l < r] for which D - D0 =t= 0, r is rational, and ( has rational real and imaginary parts; in other words, A is the countable union of ) and has probability 0. A D for such D. Therefore, A lies in a-( X0 , X 1 , It remains only to show that P( AD) = P(BD), and this is most easily done by comparing {Xn} and {Y) with a canonical sequence having the same n structure. Put Zn(w) = (Xn(w) + 1)/2, and let Tw be L� = o Zn(w)2 - - l on the w-set A* where this sum lies in (0, 1]; on f! - A* let Tw be 1, say. Because of (22.16) P( A* ) = 1. Let .9"= u (X0 , X 1 , . . . ) and let � be the u-field of Borel subsets of (0, 1 ]; then T: fl � (0, 1] is measurable .9"/ �. Let rn (x) be the nth Rademacher function. If M = [x: r;( x ) = u ; , i = 1 . . . . , n ], where u; = + 1 for each i, then P(T- 1M) = P[w: X;(w) = u1, i = 0, 1, . . . , n 1 ] = 2 - n , which is the Lebesgue measure A(M ) of M. Since these sets form a 7T-system generating �. P(T- 1 M) = A( M) for all M in f!J (Theorem 3.3). Let MD be the set of x for which L:� =o rn + 1 (x) z n extends analytically to D0 U D. Then MD lies in f!J, this being a special case of the fact that AD lies in !F. Moreover, if w E A* , then w E A D if and onl� if Tw E M0: A* nAD = A* n T- 1MD. Since P( A* ) = 1, it follows that P(AD) = A(MD). This argument only uses (22.16), and therefore it applies to {Yn } and BD as • wei!. Therefore, P( BD) = A(MD) = P(AD). Let
•
•
•
PROBLEMS 22.1.
Suppose that X 1 , X , . . . is an independent sequence and Y is measurable u ( Xn , Xn + I > . . . ) for 2 each n. Show that there exists a constant a such that P[ Y= a] =
22.2.
1.
Assume {Xn} independent, and define X�c> as in Theorem 22.8. Prove that for !:IXnl to converge with probability 1 it is necessary that [P[ I Xnl > c] and [E[IX�c>l] converge for all positive c and sufficient that they converge for some positive c. If the three series (22.13) converge but [E[ I ��cll] = oo, then there is probability 1 that [Xn converges conditionally but not a bsolutely. Generalize the Borei-Cantelli lemmas: Suppose Xn are nonnegative If [E[ Xn] < oo, then [Xn converges with probability 1. If the Xn are indepen dent and uniformly bounded, and if [E[ Xn] = oo, then [Xn diverges witt probability 1.
22.3. i
(a)
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIABLES
295
(b) Construct independent, nonnegative X11 such that [X11 converges with probability 1 but [£[ X11] diverges For an extreme example, arrange that P[ Xn > 0 i.o.] 0 but £[X11] oo. =
=
22.4. Show under the hypothesis of Theorem 22.6 that
extend Theorem 22.4 to infimte sequences. •
22.5. 20.14
[Xn has finite variance and
a re independent, each with the Cauchy 22.1 i Suppose that XI > X2 , distribution (20.45) for a common value of u. (a) Show that n - 1 !:Z 1 Xk does not converge with probability 1 . Contrast with Theorem 22. 1 . (b) Show that P[ n - I max k "' " xk < X ] --+ e - u j TTX for X > 0. Relate to Theorem 14.3. •
•
•
=
22.6. If
X1 , X2, are independent and iden tically disuibuted, and if P[ X1 > 0] 1 and P[ X1 > 0] > 0 , then [11X11 = oo with probability 1. Deduce this from Theorem 22. 1 and its corollary and also directly: find a positive E such that xr > c infinitely often with probability 1 . •
•
=
•
22.7. Suppose that
X1, X2, . . . are independent and identically distributed and E[ I X1 I] = oo. Use (21.9) to show that L11 P[ I Xn l > an] = oo for each a, and conclude that S Up11 n - 1 1 X111 oo with probability 1. Now show that sup" n - : 1511 1 = oo with probability 1. Compare with the corollary to Theorem 22. 1 . =
22.8.
Wald's equation. Let X1, X2, be in dependent and identically distributed + Xn. Suppose that -r is a stopping with finite mean, and put Sn = X1 + time: -r has positive integers as values and [ -r = n] E u(X1, , X11); see Section 7 for examples. Suppose also that £[ -r] < oo. (a) Prove that •
•
•
·
·
·
.
•
.
(22.21 )
(b) Suppose that Xn is + 1 with probabilities p and q, p -=F q, let -r be the first n for which Sn is - a or b (a and b positive integers), and calculate £[-r]. This gives the expected duration of the game in the gambler's ruin problem for unequal p and q. 22.9. 20.9 i
Let Z11 be 1 or 0 according as at time n there is or is not a record in the sense of Problem 20.9. Let R11 = Z1 + + Zn be the number of records up to time n. Show that R11jlog n -+P 1. ·
·
·
22.10. 22. 1
{X11} the radius of conver i (a) Show that fo r an independent sequence 11 gence of the random Taylor series L11 X11 Z is r with probability 1 fo r some nonrandom r. (b) Suppose that the Xn have the same distribution and P[X1 -=F 0] > 0. Show that r is 1 or 0 according as log +I X1 1 has finite mean or not.
22.1 1. Suppose that
X0, X1, . . . are independent and each is uniformly n d istributed over [0, 2 '7T ]. Show that with probability 1 the series [11eiX,z has the unit circle as its natural boundary.
22.12. Prove (what is essentially Kolmogorov's zero-one law) that if
A is independe nt
of a '7T-system f?J and A E u(f?J), then P(A ) is either 0 or 1 .
RANDOM VARIABLES AND EXPECTED VALUES
296 22.13.
22. 14.
Suppose that .w' is a semiring conta ining !l. (a) Show that if P(A n B) < bP( B) for all B E .w', and if b < 1 and A E a-(.9«'), then P(A) = 0. (b) Show that if P( A n B) < P( A )P( B) for all B E .w', and if A E a-(.9«'), then P( A) is 0 or 1. (c) Show that i f aP(B) < P(A n B) for all B E .w', and if a > 0 and A E a-(.9«'), then P( A ) = 1. (d) Show that if P( A )P( B) < P( A n B) for all B E .w', and if A E a-(.9«'), then P( A) is 0 or 1. (e) Reconsider Problem 3.20
Burstin's theorem. Let f be a Borel function on [0, 1] with arbitrarily small periods: For each *' there is a p such that 0 < p < *' and f(x) = f(x + p) for 0 < x < 1 - p. Show that such an f i<; cons tan! almost everywhere: (a) Show that it is er.ough to prove that P(r ' B) is 0 or 1 for every Borel set B, where P is Lebesgue measure on the unit interval. (b) Show that r I B is independent of each interval [0, X], and conclude that P(f- 18 ) is 0 or 1. 22. 12 i
(c) 22.15.
Show hy example that f need not be constant.
Assume that X1,
•
•
•
, Xn are indepeudent and s, t, a are nonnegative Let
L(s) = max P [ ISk l > s ] , R (s ) = max P [ IS" - Sk i > s ], k �n
[
]
k �n
M ( s ) = P max ISk I > s , T ( s ) = P [ ISn I > s ] . n (a)
k�
Following the first part of the proof of (22. 10), show that
{22.22)
M(s + t ) < T( t) + M(s + t ) R(s ) .
Take s = 2a and t = a; use (22.22), together with the inequalities T(s) < L(s) and R(2s) < 2 L(s), to prove Etemadi;s inequality (22. 10) in the form (b)
{22.23 )
M(3a) < Bd a) = 1 1\ 3L(a).
Carry the rightmost term in (22.22) to the left side, take s = t == a, a nd prove Ottaviani's inequality: (c)
(22.24)
M(2a) < B0(a ) = 1 1\
T(a ) 1 - R (a) .
(d) Prove This shows that the Etemadi and Ottaviani inequalities are of the same power for most purposes (as, fo r example, for the proofs of Theorem 22.7 and (37.9)). Etemadi's inequality seems the more natural of the two. Neither inequality can replace (9.39) in the proof of the law of the iterated logarithm .
SECTION
23.
THE POISSON PROCESS
297
SECTION 23. THE POISSON PROCESS Characterization of the Exponential Distribution
Suppose that
(23.1)
X has the exponential distribution with parameter a: X > 0. P[X>x] - ax = e ,
The definition (4. 1) of conditional probability then gives
P[X > x +yiX>x] = P [X> y],
(23.2)
X , y > 0.
X
Image as the waiting time for the occurrence of some event such as the arrival of the next customer at d queue or telephone call at an exchange. As observed in Section 14 (see (14.6)), (23.2) attributes to the waiting-time mechanism a lack of memory or aftereffect. And as shown in Section 14, the condition (23.2) implies that has the distribution (23.1) for some positive a. Thus if in the sense of (23.2) there b no aftereffect in the waiting-time mechanism, then the waiting time itself necessarily follows the exponential law.
X
The Poisson Process
Consider next a stream or sequence of events, say arrivals of calls at an be the waiting time to the first event, let be the waiting exchange. Let time between the first and second events, and so on. The formal model consists of an infinite sequence of random variables on some · · · represents the time of occurrence probability space, and S = of the nth event; it is convenient to write S0 = 0. The stream of events itself remains intuitive and unformalized, and the mathematical definitions and arguments are framed in terms of the If no two of the events are to occur simultaneously, the S must be strictly increasing, and if only finitely many of the events are to occur in each finite interval of time, s must go to infinity:
X2
X1
n
XI > X2 , X1 + +X
•
•
•
n
X
n.
n
n
(23.3)
,
This condition is the same thing as
(23 .4)
n
Throughout the section it will be assumed that these conditions hold every where-for every w. If they hold only on a set A of probability 1, and if Xn(w) is redefined as Xn(w) = 1, say, for w � A, then the conditions hold everywhere and the joint distributions of the xn and sn are unaffected.
298
RANDOM VARIABLES AND EXPECTED VALUES
Condition 0°. For each w, (23.3) and (23.4) hold. The arguments go through under the weaker condition that (23.3) and (23.4) hold with probability 1, but they then involve some fussy and uninter esting details. There are at the outset no further restrictions on the X;; they are not assumed independent, for example, or identically distributed. The number N, of events that occur in the time interval [0 , t] is the largest integer n such that sn < t:
N, = max [ n : S n < t ) . Note that N, = 0 if t < S 1 = X 1 ; in particular, N0 = 0. The number of events in (s, t] is the increment N, - Ns (23 .5)
-
N1
N1
=
2
=0
So = 0
From (23 .5) follows the basic relation connecting the N, with the Sn :
(23.6)
[ N, > n ] = [Sn < t ] .
From this follows
( 23.7) Each N, is thus a random variable. The collection [ N,: t :2:: 0] is a stochastic process, that is, a colle<.:tion of random variables indexed by a parameter regarded as time. Condition oo can
be restated in terms of this process:
Condition 0°. For each w, N,(w) is a nonnegative integer for t > 0, N0(w) = 0, and lim N,( w) = oo; further, for each w, N,( w) as a June tion of t is nondecreasing and right-continuous, and at the points of discontinuity the saltus N,(w) - sups < N/w) is exactly 1. 1 _, "'
1
It is easy to see that (23.3) and (23.4) and the definition (23.5) give random variables N, having these properties. On the other hand, if the stochastic
SECTION
23.
THE POISSON PROCESS
N,:
299
process [ t :2: 0] is given and does have these properties, and if random intl t : { ) > n] and variables are defined by = ), then (23.3) and (23.4) hold, and the definition (23.5) gives back the original N, . Therefore, anything that can be said about the can be stated in terms of the N, , and conversely. The points . . . of (O, oo) are exactly the discontinuities of N,{w) as a function of t; because of the queueing example, it is natural to call them arrival times. The program is to study the joint distributions of the under conditions on the waiting times and vice versa. The most common model specifies the independence of the waiting times and the absence of aftereffect:
Sn(w)
sn _,( w
N, w
=
Xn(w) Sn(w) Xn S 1{ w), S 2(w), N,
Xn
Condition r. The X,. are independent , and each is exponentially distributed with parameter a.
Xn
n
In this case P[ > OJ = 1 for each n and n s � a - I by the strong law of large numbers (Theorem 22.1), and so (23.3) and (23.4) hold with probabil ity 1; to assume they hold everywhere (Condition 0°) is simply a convenient normalization. Under Condition 1°, has the distribution function specified by (20.40), so that P[ N, > n] = f.7 ne - "''(at)'/i! by (23.6), and -
•
Sn
(23.8)
n (at )
n = 0 1,... .
P[ Nr = n] = e - ar n ! ,
,
N,
Thus has the Poisson distribution with mean at. More will be proved in a moment.
Condition 2°. {i) For 0 < t < · · · < t k the increments N, , , N,l - N, , , . . . , N,k - N, are independent. {ii) The individual increments have the Poisson distribution: 1
k-I
t s (23.9) P[ Nr - N = n ] = e - a< r - s ) ( a ( n-! s
n ))
'
n = 0 , 1 , . . . , 0 < s < t.
[N,:
t > 0] of Since N0 = 0, (23.8) is a special case of (23.9). A collection random variables satisfying Condition 2° is called a Poisson process, and a is the rate of the process. As the increments are independent by {i), if r < s < t, then the distributions of � - Nr and - � must convolve to that of - Nr But the requirement is consistent with {ii) because Poisson distribu tions with parameters u and v convolve to a Poisson distribution with parameter u + v.
N,
N,
Theorem 23.1.
Condition oo.
Conditions 1 o and 2° are equivalent
tn
the presence of
300
RAN DOM VARIABLES AND EXPECTED VALUES
PROOF OF 1 o -7
2°. Fix t, and consider the events that happen after time t. By (23.5), SN( < t < SN( + I • and the waiting time from t to the first event following t is SN( + 1 - t; the waiting time between the first and second events following t is XN( + 2 ; and so on. Thus (23. 10) XU1 ) - sN( + I - t , X2< l> -- XN( + 2 ' X3< l> - XN, + 3 • · · · define the waiting times following t. By (23.6), N, + ' - N, > m, or N, + s > N, + )< . . . +X� m, if and only if S N + m < t + s , which is the same thing as xv> + , s . Thus
Nl + s - N = max [ m · x Iu > +
( 23. 1 1 )
1
•
· · ·
+ xu> m -< s ] ·
Hence [ Nl + s - N1 = m ] = [ x I + · · · +X < 1 > -< s < x u > + · · + Xu> m + I ]· A comparison of (23.11) and (23.5) shows that for fixed t the random variables N, + s - N, for s � 0 are defined in terms of the sequence (23. 10) in exactly the same way as the � are defined in terms of the original sequence of waiting times. The idea now is to show that conditionally on the event [N, = n] the random variables (23.10) are independent and exponentially distributed. Because of the independence of the Xk and the basic property (23.2) of the exponential distribution, this seems intuitively clear. For a proof, apply (20.30). Suppose y > 0; if Gn is the distribution function of S n ' then since X11 + I has the exponential distribution,
=
J
X :5;
I
·
l
m
P[ Xn + l > t + y - x ] dGn( x)
= e - "')'
J
X :5; I
P[ Xn + 1 > t - x ] dGn( x )
By the assumed independence of the X11,
= P[ S n + l - t > y I • Sn -< t < Sn + l ] e - "'Y2 = P[Sn -< t < Sn + l ] e - ay , ( 23 .12)
p
[ Nl = n ' ( X I
I)'
.
•
.
'
· · · e - "'Yi
· · · e - ayi ·
X} I ) ) E H ] = p [ Nl = n ] p [ ( X ' . . . ' xj) E H ] . l
SECTION
23.
THE POISSON PROCESS
301
By Theorem 10.4, the equation extends from H of the special form above to all H in �i. Now the event [ N5 . = m;, 1 < i < u ] can be put. in the form [(X1 , , Xi ) E H], where j = mu + 1 and H is the set of x in R 1 for which x 1 + · · · +xm · < s; < x 1 + · · · +xm; + l • 1 < i < u. But then [( Xl ' >, . . . , xp >) E H] is by (23. 1 1) the same as the event [ N, - N, = m;, 1 < i < u]. Thus (23.12) gives •
(
.
.
+s (
From this it follows by induction on
k
that if 0 = t0 < t k
1
<
· · · < tk, then
P [ N, - N, = n;, 1 < i < k l = O P[ N,._, = n,· ] .
( 23.13 )
r
r-1
.
i
=I
1
1 -l
Thus Condition 1 o implies (23.13) and, as already seen, (23.8). But from (23.13) and (23.8) follow the two parts of Condition 2°. •
2° -7 1°. If 2° holds, then by (23.6), P[ X1 > t] = P[N, = 0] = e - ar , so that X1 is exponentially distributed. To find the joint distribution of X1 and X2, suppose that 0 < s 1 < t 1 < s2 < t2 and perform the caiculation PRooF oF
P[ s < s < t s 2 < s 2 < t2 ] = P[ N5 = 0, N, 1 - � = 1 , � 2 - N, = 0, N, 2 - �s > 1 ] 1 I
I'
I
1
1
= e - as, X a ( t l - s l)e - a
s 1
Thus for a rectangle A contained in the open set G = [{y i> y2): 0 < y 1 < y2].
By inclusion-exclusion, this holds for finite unions of such rectangles and hence, by a passage to the limit, for countable ones. Therefore, it holds for A = G n G' if G' is open. Since the2 open sets form a 7T-system generating the Borel sets, (S1, S2) has density a e-a y 2 on G (of course, the density is •o outside G). k By a similar argumentk in R (the notation only is more complicated), (S1, , Sk) has density a e-a yk on [ y : 0 < y 1 < · · · < h ]. If a linear trans formation g(y) = x is defined by X; = Y; - Y; - 1, then (X1, ; Xk ) = g(S1, , Sk ) has by (20.20) the density Of= 1 ae-ax; (the Jacobian is identi cally 1 ). This proves Condition 1°. • •
•
•
•
•
•
•
•
•
RANDOM VARIABLES AND EXPECTED VALUES
302 The Poisson Approximation
Other characterizations of the Poisson process depend on a generalization of the classical Poisson approximation to the binomial distribution.
Suppose that for each n, Zn 1 , , Znrn are independent random variables and Zn k assumes the values 1 and 0 with probabilities Pn k and 1 - Pn k · If rn ( 23.14) Pn k -7 A > 0, L k= l Theorem 23.2.
•
•
•
then i = 0, 1 , 2, . . . .
(23.15)
If A = 0, the limit in (23.15) is interpreted as 1 for i = 0 and 0 for i > 1. In the case where rn = n and Pn k = Ajn, (23.15) is the Poisson approximation to the binomial. Note that if A > 0, then (23.14) implies rn -7 oo.
PRooF. The argument depends on a construction like that in the proof of be independent random variables, each uni Theorem 20.4. Let U1 , U2 , formly distributed over [0, 1). For each p, 0 < p < 1, �plit [0, 1) into the two intervals /0( p) = [0, 1 - p) and / 1 p) = f 1 - p, 1), as well as into the sequence . - 0, 1 , . . . . D efi ne vn - 1 i i i I l "' ') t jl·' p of mterva I s Ji ( P ) -- [ "' p �... J. , J. = le �...i = le , i k if uk E / J( Pn k ) and vn k = 0 if u�.. E Io . . . , Znrn, (23.15) will follow if it is shown that Vn = I:�"= 1 V,. k satisfies •
.
•
p
•
{
p
"
( 23.16) Now define Wn k = i if Uk E Ji( Pn k ), i = 0, 1, . . . . Then P[Wnk i] e -Pn k p� k ji!-Wn k has the Po isson distribution with mean Pn k · Since the W,. k =
=
SECTION
23.
THE POISSON PROCESS
303
are independent, Wn = L�"= 1 Wnk has the Poisson distribution with mean An = L�"= 1 Pn k · Since 1 - p < e - P, 1 1( p ) c I1( p ) (see the diagram). Therefore,
P[ V,. k =I= Wn k ) = P[ V,. k = 1 =I= Wn k ) = P( Uk E II ( Pn k ) - 11 ( Pn k )] = pn k - e -Pnk pn k -< pn2k '
and
P[ V,, =I= Wn ] <
rn
Pn k P�k < An max L k l k= l s; s;rn
�
0
by (23.14). And now (23.16) and (23.15) follow because II
Other Characterizations of the Poisson Process
The condition (23.2) is an interesting characterization of the exponential distribution because it is essentially qualitative. There are qualitative charac terizations of the Poisson process as well. For each w, the function N,(w) has a discontinuity at t if and only if Sn( w ) = t for some n > 1; t is a fixed discontinuity if the probability of this is positive. The condition that there be no fixed discontinuities is therefore
( 23 .17 )
t > 0, n > 1 ;
that is, each of S 1 , S 2 , . . . has a continuous distribution function. Of course there is probability 1 (under Condition 0°) that N,(w) has a discontinuity somewhere (and indeed has infinitely many of them). But {23.17) ensures that a t specified in advance has probability 0 of being a discontinuity, or time of an arrival. The Poisson process satisfies this natural condition.
If Condition oo holds and [ N, : t > 0] has independent increments and no fixed discontinuities, then each increment has a Poisson distribution. This is Prekopa ' s theorem. The conclusion is not that [ Nr : t > 0] is a Poisson process, because the mean of N, - N5 need not be proportional to t - s. If rp is an arbitrary nondecreasing, continuous function on [0, oo) and cp{O) = 0, and if [ N, : t > 0] is a Poisson process, then N
conditions of the theorem.t
PRooF. The problem is to show for t' < t" that N," - N,, has for some A > 0 a Poisson distribution with mean A, a unit mass at 0 being regarded as a Poisson distribution with mean 0. tThis is in fact the general process satisfying them; see Problem 23.8
304
RANDOM VARIABLES AND EXPECTED VALUES
The procedure is to construct a sequence of partitions
t ' = tno < tn l < · · · < tn rn = t
( 23.18)
"
of [t', t"] with three properties. First, each decomposition refines the preced ing one: each tn k is a tn + l , j· Second, rn
P [ N,nk - N,n k - 1 > 1 ] j A L , k= l
( 23.19) for some finite A and
( 23.20) Third,
(23.21)
l
[
pr L N,n k-1 > 1 ] -7 0. N, nk �k �r max
n
-
,
P l max ( N,nk - N,n k-1 ) > � k � rn
2]
-7
0.
Once the partitions have been constructed, the rest of the proof is easy: Let Z,. k be 1 or 0 according as N, nk - N, n,k - 1 is positive or not. Since [ N,: t > 0] has independent increments, the Zn k are independent for each n. By Theorem 23.2, therefore, (23.19) and (23.20) imply that Zn = L�"= 1 Znk satis fies P[ Zn = i] -7 e- AAiji! Now N,. - N,, > Zn , and there is strict inequality if and only if N, k - N, - > 2 for some k. Thus (23.21) implies P[ N," - N,, =I= Zn ] -7 0, and therefo;� P[ N,, - N1, = i] = e- AAiji! To construct the partitions, consider for each t the distance D, = infm 1 It - Sml from t to the nearest arrival time. Since Sm -7 oo, the infimum is achieved. Further, D, = 0 if and only if Sm = t f01 some m , and since by hypothesis there are no fixed discontinuities, the probability of this is 0: P[ D, = O] = O. Choose 8, so that 0 < 8, < n - 1 and P[D, < 8,] < n - 1 • The intervals (t - o, t + 8,) for t' < t < t" cover [t', t"]. Choose a finite subcover, and in (23.18) take the tn k for 0 < k < rn to be the endpoints (of intervals in the subcover) that are contained in (t', t"). By the construction, �
(23.22) and the probability that (t n, k _ 1 , tn k ] contains some Sm is less than n - 1 • This gives a sequence of partitions satisfying (23.20). Inserting more points in a partition cannot increase the maxima in (23.20) and (23.22), and so it can be arranged that each partition refines the preceding one. To prove (23.21) it is enough (Theorem 4. 1) to show that the limit superior of the sets involved has probability 0. It is in fact empty: If for infinitely many n, N, (w ) - N,n, k-J(w ) > 2 holds for some k < rn , then by (23.22), N,(w) as a n k.
SECTION
23.
THE POISSON PROCESS
305
function of t has in [t', t"] discontinuity points (arrival times) arbitrarily close together, which requires S m(w) E [t', t"] for infinitely many m, in violation of Condition 0°. It remains to prove (23.19). If Znk and Zn are defined as above and Pnk = P[Zn k = 1] , then the sum in (23. 19) is L k Pnk = E [ Zn ] . Since zn + I > Zn , L k Pnk is nondecreasing in n. Now
P[ N,.. - N, . = O) = P[ Zn k = 0, k < rn ]
- 0 (1 -p k= l
nk
[
) < exp -
t
k=l
l
Pnk ·
If the left-hand side here is positive, this puts an upper bound on [, k pr. k • and (23.19) follows. But suppose P[N,. - N,. = 0] = 0. If s is the midpoint of t' and t", then since the increments are independent, one of P[ N5 - N,. = 0] and P[N,. - � = OJ must vanish. It is therefore possible to find a nested sequence of interval'> [urn, vn ] such that vm - urn -7 0 and the event A m = [N, - Nu > 1] has probability 1. But then P( n m A m ) = 1 , and if t is the point common to the [urn, vm ], there is an arrival at t with probability 1, contrary to the assumption that there are no fixed discontinuities. • m
m
Theorem 23.3 in some cases makes the Poisson model quite plausible. The increments will be essentially independent if the arrivals to time s cannot seriously deplete the population of potential arrivals, so that N5 has for t > s negligible effect on N, - N5 And the condition that there are no fixed discontinuities is entirely natural. The'>e conditions hold for arrivals of calls at a telephone exchange if the rate of calls is small in comparison with the population of subscribers and calls are not placed at fixed, predetermined times. If the arrival rate is essentially constant, this leads to the following condition. •
Condition 3°. {i) For 0 < t < · < tk the increments N, N, 2 - N, N, - N, are independent. k k -I {ii) The distribution of N, - � depends only on the difference t - s. 1
Theorem 23.4.
Condition 0°.
· ·
1,
1,
• • •
,
Conditions 1°, 2°, and 3° are equivalent in the presence of
Obviously Condition 2o implies 3°. Suppose that Condition 3o holds. If 1, is the saltus at t (1, = N, - sups < r N5 ), then [ N, - N, _ n - 1 > 1] � [1, > 1] , and it follows by (ii) of Condition 3o that P[1, > 1] is the same for all t. But if the value common to P[1, > 1] is positive, then by the independence of the increments and the second Borel-Cantelli lemma there is probability 1 that 1, > 1 for infinitely many rational t in (0, 1), for example, which contra dicts Condition 0°. PRooF.
RANDOM VARIABLES AND EXPECTED VALUES
306
By Theorem 23.3, then, the increments have Poisson distributions. If f(t) is the mean of N, then N, - Ns for s < t must have mean f(t) -f(s) and must by (ii) have mean f(t - s ); thus f(t) = f(s) + f(t - s ). Therefore, f satisfies Cauchy's functional equation [A20] and, being nondecreasing, must have the form f(t) = at for a > 0. Condition oo makes a = 0 impossible. • One standard way of deriving the Poisson process is by differential equations.
Condition 4°. If G < t 1 < then
· · ·
< tk and if n 1 , . . . , n k are nonnegative integers,
( 23.23) and (23.24) as h � 0. Moreover, [ N, : t > 0] has no fixed discontinuities. The occurrences of o(h) in (23.23) and (23.24) denote functions, say
of Condition 0°.
Conditions 1 o through 4° are all equivalent in the presence
oF 2° -7 4°. For a Poisson process with rate a, the left-hand sides of (23.23) and (23.24) are e- ahah and 1 - e - ah - e- ahah, and these are ah + o(h) and o(h), respectively, because e - al' = 1 - ah + o(h). And by the PROOF
argument in the preceding proof, the process has no fixed discontinuities.
•
4° -7 2°. Fix k, the ti , and the n i; denote by A the event [N,, = n i, j < k]; and for t > O put Pn(t) = P[N, k + , - N,k = niA]. It will be shown that PROOF OF
( 23.25)
n
) pn ( t ) = e- ar (at n! '
n = 0, 1 , . . . .
SECTION
23.
THE POISSON PROCESS
307
This will also be proved for the case in which Pn(t) = P[N, = n]. Condition 2° will then follow by induction. If t > 0 and It - sl < n- 1 , then
As n oo, the right side here decreases to the probability of a discontinuity at t, which is 0 by hypothesis. Thus P[N, = n] is continuous at t. The same kind of argument works for conditional probabilities and for t = 0, and so Pn(t) is continuous for t > 0. To simplify the notation, put D, = N, k + r - N,k . If D, + h = n, then D, = m for some m < n. If t > 0, then by the rules for conditional probabilities, �
Pn( t + h) = Pn( t )P [ Dr + h - D, = OIA n [ D, = n ]) + Pn - 1 ( t ) P [ D, +11 - D, -= 1 I A n [ D, = n - 1 ] ] n- 2 + L Pm( t )P[ Dr + h - D, = n - m i A n [ D, = m l ] . m=O
For n < 1, the final sum is absent, and for n = 0, the middle term is absent as well. This holds in the case Pn (t) = P[N, = n] if D, = N, and A = fl. (If t = 0, some of the conditioning events here are empty; hence the assumption t > 0.) By (23.24), the final sum is o(h) for each fixed n. Applying {23.23) and (23.24) now leads to
Pn( t + h ) = pn( t)(1 - ah) +pn_ 1 ( t)ah + o( h ) , and letting h � 0 gives
( 23 .26) In the case n = 0, take p _ 1(t) to be identically 0. In (23.26), t > 0 and p�(t) is a right-hand derivative. But since Pn(t) and the right side of the equation are continuous on [0, oo), (23.26) holds also for t = 0 and p�(t) can be taken as a two-sided derivative for t > 0 [A22]. Now (23.26) gives [A23]
Since Pn(O) is 1 or 0 as n = 0 or n > 0, (23.25) follows by induction on n.
•
RANDOM VARIABLES
308
AND
EXPECTED VA LUES
Stochastic Processes
The Poisson process [ N, : t > 0] is one example of a stochastic process - that is, a collection of random variables (on some probability space (f!, sz-, P)) indexed by a parameter regarded as representing time. In the Poisso n case, time is continuous. In some cases the time is discrete: Section 7 concerns the sequence ( F } of a gambler's fortunes; there n represents time, but time that . . . mcreases m JUmps. Part of the structure of a stochastic process is specified by its finite-dimen sional distributions. For any finite sequence t 1 , • • • , tk of time points, the k-dimensional random vector (N1 1 , • • • , N, ) has a distribution Jl- 1 1 1 over R k . These measures Jl- 1 1 1 k are the finite-dimensional distributions of the pro cess. Condition 2° of this section in effect specifies them for the Poisson case: n
k
( 23.27 ) P [ NI · = n1. ' ]. -< k J = n e �l 1
j
-
a( rI
· -
11 1 ) · -
(a(t - t )) j
j-l
n - - n J.
1
( ni - ni - l ) .I
.
-•
if O < n 1 :::;; . . . < n k and 0 < t 1 < . . . < tk (takc n 0 t 0 = 0) . The finite-dimensional distribution s do not, however, contain all the mathematically interesting information about the process in the case of continuous time. Because of (23.3), (23.4), and the definition (23.5), for each fixed w, N1 (w) as a function of t has the regularity properties given iu the second version of Condition 0°. These properties are used in an essentiai way in th.e proofs. Suppose that f(t) is t or 0 according as t is rational or irrational. Let N, be defined as before, and let =
Ml ( w) = N,( w) + f ( t + X I ( w ) ) . If R is the set of rationals, then P[w: f(t + X 1 (w)) =I= 0] = P[w : X1 ( w) E R - t ] = 0 for each t because R - t is countable and X, has a density. Thus P[M1 = N, ] = 1 for each t, and so the stochastic process [M1 : t > 0] has the same finite-dimensional distributions as [N1: t > 0]. For w fixed, however, M1 (w) as a function of t is everywhere discontinuous and is neither mono ( 23.28)
-
tone nor exclusively integer-valued. The function s obtained by fixing w and letting t vary are calle d the path functions or sample paths of the process. The example above shows that th e finite -dimensional distributions do not suffice to determine the character of the path functions. In specifying a stochastic process as a model for some phenomenon, it is natural to p lace conditions on the character of the samp le paths as well as on the finite-dimensional distributions. Condition oo was imposed throughout this section to ensure that the sample paths are nonde creasing, right-continuous , integer-valued step functions, a natural condition if N, is to represent the number o f events in [0, t ]. Stochastic processes i n continuous time are studied further in Chapter 7.
SECTION
23.
THE POISSON PROCESS
309
PROBLEMS
Assume the Poisson processes here satisfy Condition oo as well as Condition 1°. 23.1.
Show that the minimum of independent exponential waiting times is again exponential and that the parameters add. •
23.2.
20. 17 i Show that the time Sn of the nth event in a Poisson stream has the gamma density f( x; a, n) as defined by (20.47). This is sometimes called the
Erlang density. 23.3.
Let A , = t - SN he the time back to the most recent event in the Poisson stream (or to 0); and let B, = SN + 1 - t be the time forward to the next event. Show that A, and B, are independent, that B, is distributed as X1 (exponen tially with parameter a ), and that A, is distributed as min{Xp t}: P[ A, < t] is or 1 as x < 0, 0 < x < t, or x > t. 0, 1 - e -ax,
Let . L, =A, + B, = SN + 1 - SN be the length of the interarrival interval covenng t. (a) Show that L, has density
23.4. i
'
'
if 0 < X < t , if X > t . Show that E[L1] converges to 2 E[Xd as t --+ oo. This seems paradoxical because L, is one of the Xn. Give an intuitive resolution of the apparent paradox. (b)
23.5.
Merging Poisson streams. Define a process {N,} by (23.5) for a sequence {Xn} of
random variables satisfying (23.4). Let {X�} be a second sequence of random variables, on the same probability space, satisfying (23.4), and define {N,'} by N,' = max[ n : X) + +X� < t ]. Define {N,"} by N;' = N, + N;. Show that, if u(X1 , X7 , . . . ) and u(X;, X�, . . . ) are independent and {N1} and {N;l are Poisson processes with respective rates a and {3, then {N,"} is a Poisson process with rate a + {3. ·
23.6. i
The n th and
sn + l "
·
·
(n + l)st events
1 n the process {N,} occur at times Sn and
Find the distribution of the number Ns process during this time interval. (b) Generalize to N5 - N5, .
(a)
..
'"
23.7.
+,
- Ns of events in the other ..
,
Suppose that X 1 , X2 , are independent and exponentially distributed with parameter a, so that (23.5) defines a Poisson process {N,}. Suppose that Y1 , Y2 , . . . are independent and identically distributed and that u(X1 , Xz, . . . ) and u(Yp Y2 , ) are independent. Put Z, = L:k ,; N Yk . This is the compound Poisson process. If, for example, the event at time \ in the original process •
•
•
•
•
•
310
RANDOM VARIABLES AND EXPECTED VALUES
represents an insurance claim, and if Yn represents the amount of the claim, then z, represents the total claims to time t . (a) I f Yk = 1 with probability 1 , then {Z,} is an ordinary Poisson process. (b) Show that {Z,} has independent increments and that Zs +r - Z5 has the same distribution as z,. (c) Show that, if Yk assumes the values 1 and 0 with probabilities p and 1 - p (0 < p < 1), then {Z,} is a Poisson process with rate pa. 23.8.
Suppose a process satisfies Condition oo and has independent, Poisson-distrib uted increments and no fixed discontin uities. Show that it has the form {N
23.9.
If the waiting times xn are independent and exponentially distributed with parameter a, then S"jn --+ a - I with probability 1, by the strong law of large numbers. From lim , N, = oo and SN' < t < SN + 1 ded uce that lim , N,!t = a with probability L -+ oo
-+ OO
'
Suppose that Xp X2 , . . . are positive, and assume directly that Sn /n --+ m with probability 1, as happens if the X, are independent and identically distributed with mean m. Show that lim , N,/t = 1 jm with probabil ity 1. (b) Suppose now that Sn fn -+ oo with probability 1, as happens if the Xn are independent and identically distributed and have infinite mean. Show that lim , N,/t 0 with probability 1.
23.10. i
(a)
=
The results in Problem 23.10 are theorems in renewal theory: A component of some mechanism is replaced each time it fails or wears out The Xn are the lifetimes of the successive components, and N, is the number of replacements, or renewals, to time t. 23.11.
20.7 23. 10 i Consider a persistent, irreducible Markov chain, and for a fixed state j let Nn be the number of passages through j up to time n. Show that Nn/n --+ 1jm with probability 1, where m = Lk= kfjjk > is the mean return time 1 3 in Section 8. (replace 1/m by 0 if this mean is infinite). See Lemma
23.12.
Suppose that X and Y have Poisson distributions with parameters a and {3. Show that IP[X = i] - P[Y = i]l < Ia - /31. Hint: Suppose that a < /3, and repre sent Y as X + D, where X and D are independent and have Poisson distribu tions with parameters a and /3 - a. Use the methods in the proof of Theorem 23.2 to show that the error (23.15) is bounded uniformly in by lA - A n i + A n max k Pn k •
23.13. i
SECTION
i
24.
in
THE ERGODIC THEOREM *
Even though chance necessarily involves the notion of change, the laws governing the change may themselves remain constant as time passes: If time * This section may be omitted. There is more on ergodic theory
in Section 36.
SECfiON
24.
THE ERGODIC THEOREM
311
does not alter the roulette wheel, the gambler's fortunes fluctuate according to constant probability laws. The ergodic theorem is a version of the strong law of large numbers general enough to apply to any system governed by probability laws that are invariant in time. Measure-Preserving Transformations
Let (f!, !F, P) be a probability space. A mapping T: f! -7 f! is a measure-pre serving transformation if it is measurable n!Fj !F and P(T- 1A) = P(A) for all A in !F, from which it follows that P(T- A) = P(A) for n > 0. If, further, T is a one-to-one mapping onto f! and the point mapping T- 1 is measurable !Fj !F (TA E !F for A E !F), then T is invertible; in this case T - 1 automati cally preserves measure: P(A) P(T - 1 TA) P(TA). The first result is a simple consequence of the 7r-A theorem (or Theorems 13. l{i) and 3.3). =
=
l.
If .9 is a 7r-systcm generating !F, and if T - 'A E !F and P(T- 1A) P(A) for A E .9, then T is a measure-preserving transformation. Lemma
=
Example 24.1. The Bernoulli shift. Let S space SX! of sequences (2.15) of elements of
be a finite set, and consider the S. Define the shift T by
(24. 1) the first element of w (z 1(w ), z/w ), . . . ) is lost, and T shifts the remaining elements one place to the left: z k(Tw) z k + 1(w) for k > 1. If A is a cylinder (2.17), then =
=
(24.2)
T - 1A [ w : ( z 2 (w), . . . , zn + 1 ( w )) E H ] [ w : ( z 1 (w), . . . , zn + J (w)) E S X H] =
=
is anoth er cylinder, and since the cylinders generate the basic u-field �. T is measurable �/ �For probabilities Pu on S (nonnegative and summing to 1), define product measure P �m the field �0 of cylinders by (2.21). Then P is consistently defined and countably additive (Theorem 2.3) and hence extends to a probability measure on � = u( �0). Since the thin cylinders (2.16) form a 7r-system that generates �. P is completely determined by the probabilities it assigns to them:
(24.3)
RANDOM VARIABLES AND EXPECTED VALUES
312
If A is the thin cylinder on the left here, then by (24.2),
(24.4)
P(T - IA ) =
L Pu Pu 1 . . · Pun = pu 1 . . , Pun = P ( A } ,
uES
and it follows by the lemma that T preserves P. This T is the Bernoulli shift. If A = [w: z 1(w) = u] (u E S), then IiT k - 1 w) is 1 or 0 according as z k (w) is u or not. Since by (24.3) the events [w: z k (w) = u] are independent, each with probability Pu, the random variables IA (T k - l w) are independent and take the value 1 with probability Pu = P(A). By the strong law of large numbers, therefore,
(24.5) •
with probability 1.
Example 24. 1 gives a model for independent trials, and that T preserves P means the probability laws governing the trials are invariant in time. In the present section, it is this invariance of the probability laws that plays the fundamental role; independence is a side issue. The orbit under T of the point w is the sequence (w, Tw, T 2w, . . . ), and (24.5) can be expressed by saying that the orbit enters the set A with asymptotic relative frequency P(A). For A = [w: (z 1{w), z/w)) = (u 1 , u 2 )], the IiT k - 1 w) are not independent, but (24.5) holds anyway. In fact, for the Bernoulli shift, (24.5) holds with probabHity 1 whatever A may be (A E 9' ). This is one of the consequences of the ergodic theorem (Theorem 24. 1). What is more, according to this theorem the limit in (24.5) exists with probability 1 (although it may not be con�tant in w) if T is an arbitrary measure-preserving transformation on an arbitrary probability space, of which there are many examples.
The Markov shift. Let P = [ Pij ] be a stochastic matrix with rows and columns indexed by the finite set S, and let 7T; be probabilities on S. Replace (2.21) by P(A) = L H 7Tu I Pu I u Pu n - 1 n . The argument in Section 2 showing that product measure is consistently defined and fin itely additive Example 24.2.
2
·
·
·
u
carries over to this new measure: since the rows of the transition matrix add to 1,
and so the argument involving (2.23) goes through. The new measure is again
SECflON
24.
THE ERGODIC THEOREM
313
countably additive on t"0 (Theorem 2.3) and so extends to C. This probabil ity measure P on t" is uniquely determined by the condition
(24.6) Thus the coordinate functions z n( · ) are a Markov chain with transition probabilities pii and initial probabilities (until now unspecified) are stationary: L; 1T; P;i = 7Ti . Suppose that the Then 7T;.
7T;
and I t follows (see (24.4)) that T preserves P. Under the measure P specified by (24.6), T is the Markov shift. • The shift T, qua point transformation on S"", is the same in Examples 24.1 and 24.2. A measure-preserving transformation, however, is the point trans formation together with the £T-field with respect to which it is measurable and the measure it preserves. Example 24.3. Let P be Lebesgue measure A on the unit interval, and take Tw = 2w (mod 1 ):
Tw =
(
2w 2w - 1
if 0 < w < I , if I < w < 1 .
=
If w h as nonterminating dyadic expansion w = .d 1(w)d2(w) . . . , then Tl•J )d3(w) . . . : T shifts the digits one place to the left-compare (24. 1). Since .d2(w 1 r (0, X] = (0, IX] u ( t, I + ixl, it follows by Lemma 1 that T preserves Lebesgue • measure. Th is is the dyadic transformation.
Example 24.4. Let !l be the unit circle in the complex plane, let !T be the u-fie l d generated by the arcs, and let P be normalized circular Lebesgue measure: map [0, 1) to the unit circle by c/J(x} = ez.,.;x and define P by P(A) = A(cfJ- 1A). For a fixed c i n !l, let Tw cw. Since T is effectively the rotation of the circle through the angle arg c, T preserves P. The rotation turns out to have radically different properties according as c is a root of unity or not. •
=
Ergodicity
The �set A is invariant under T if T - 1A = A ; it is a nontrivial invariant set if 0 < P( A) < 1. And T is by definition ergodic if there are in !F no
314
RANDOM VARIABLES AND EXPECfED VALUES
nontrivial invariant sets. A measurable function f is invariant if f(Tw) = f(w) for all w ; A is invariant if and only if /A is. The ergodic theorem: Theorem 24.1. Suppose that T is a measure-preserving (f!, !F, P) and that f is measurable and integrable. Then
transformation on
1 n 1 w) =f(w) lim f( T� L n n h
( 24 . 7)
k=l
h
h
with probability 1, where f is invariant and integrable and E[ f] = E[f]. If T is ergodic, then f = £[ f] with probability 1. h
h
This will be proved later in the section. In Section 34, f will be identified as a conditional expected value (see Example 34.3). If f = /A , (24.7) becomes
(24.8) and in the ergodic case,
�( w ) = P( A)
(24.9 )
with probability 1. If A is invariant, then Iiw) is 1 on A and 0 on A c, and so the limit can certainly be nonconstant if T is not ergodic. h
Take f! = {a, b, c, d, e} and !F= 2 °. If T is th e cyclic permutation T = (a, b, c, d, e) and T preserves P, then P assigns equal probabilities to the five points. Since 0 and f! are the only invariant sets. T is ergodic. It is easy to check (24.8) and (24.9) directly. The transformation T = (a, b, cXd, e), a product of two cycles, preserves P if and only if a, b, c have equal probabilities and d, e have equal probabili ties. If the probabilities are all positive, then since {a, b, c} is invariant, T is not ergodic. If, say, A = {a, d}, the limit in (24.8) is � on {a, b, c} and ! on • {d, e}. This illustrates t he essential role of ergodicity. Example 24.5.
The coordinate functions zn< ) in Example 24. 1 are independent, and hence by Kolmogorov's zero-one law every set in the tail field !T n u( z z" + I , ) has probability 0 or 1. (That the z take values in the abstract set S does not affect the arguments.) If A E u(z 1 , . . . , z k ), then T -"A E u(zn + J > · · · , zn +k ) c u(zn + l • z n + 2 , ); since this is true for each k, A E !F u( z ) ' z 2 , . . . ) implies (Theorem 13.1(i)) T - "A E u( z + I , z + 2 • . . . ). For A invariant, it follows that A E !T: The Bernoulli shift is ergodic. Thus ·
n
n,
n
• . .
•
=
•
•
n
n
SECfiON
24.
THE ERGODIC THEOREM
315
the ergodic theorem does imply that (24.5) holds with probability 1, whatever A may be. The ergodicity of the Bernoulli shift can be proved in a different way. If A = [(zp · · · , zn) = u] and B = [(z 1 , , z k ) = v ] forn an n-tuple u and a k tuple v, and if P is given by (24.3), then P(A n T - B) = P(A)P(B) because T - n B = [(zn + I • · · · , z n +k ) = v]. Fix n and A, and use the 1r-A theorem to show that this holds for all B in !F. If B is invariant, then P(A n B) = P( A)P(B) holds for the A above, and hence (1T - A again) holds for all A. Taking· B = A shows that P(B) = (P(B)) 2 for invariant B, and P(B) is 0 or 1. This argument is very close to the proof of the zero - one law, but a modification of it gives a criterion for ergodicity that applies to the Markov shift and other transformations. • • •
Suppose that .9 c !F0 c !F, where !F0 is a field, every set in !F0 is a finite or countable disjoint union of .9-sets, and .9Q generates !F. Suppose further that there exists a positive c with this property: For each A in .9 there is an integer n A such that Lemma 2.
(24. 1 0 ) for all B in
.9.
Then T - 1 c = C implies that P(C) is 0 or 1.
It is convenient not to require that T preserve P; but if it does, then it is an ergodic measure-preserving transformation. In the argument just given, .9 consists of the thin cylinders, nA = n if A = [(zp . . . , z n ) = u], c = 1, and !F0 = t"0 is the class of cylinders. Since every .9Q-set is a disjoint union of .9-�ets, (24.10) holds for B E .9Q (and A E .9 ). Since for fixed A the class of B satisfying (24.10) is monotone, it contains !F (Theorem 3.4). If B is invariant, it follows that P(A n B) > cP(A)P(B) for A in .9. But then, by the same argument, the inequality holds for all A in !F. Take A = Be: If B is invariant, the n • P( B c )P(B) = 0 and hence P(B) is 0 or 1. PRooF.
To treat the Markov shift, take .9Q to consist of the cylinders and .9 to consist of the thin ones. If A = [(zp . . . , z n) = ( u 1 , un)], n A = n + m - 1, and B = [( zp . . . , z k ) = (v i > . . . , vk)], then , • • •
The lemma will apply if there exist an integer m and a positive c such that Phm ) > C 7Tj for all i and j. By Theorem 8.9 (or Lemmam 2, p. 125), there is in the irreducible, aperiodic case an m such that all Ph ) are positive; take c
316
RANDOM VARIABLES AND EXPECfED VALUES
less than the minimum. By Lemma 2, the corresponding Markov shift is ergodic. Example 24.6. Maps preserve ergodicity. Suppose that 1/f: !l -+ !l is measurable .9'] !T and commutes with T in the sense that 1/J Tw = TI/Jw. If T preserves P, it also preserves Pl/f- 1 : Pl/f- 1 (T- 1A) = P(r 1 1/f- 1A) = PI/f- 1 (A). And if T is ergodic under P, it is also ergodic under Pl/f - 1 : if A is invariant, so is l/f- 1A, and hence Pl/f- 1 (A) is 0 or 1. These simple observations are useful in studying the ergodicity of stochastic processes (Theorem 36.4). • Ergodicity of Rotations
The dyadic transformation, Example 24.3, is essentially the same as the Bernoulli shift. In any case, it is easy to use the zero - one law or Lemma 2 to show that it is ergodic. From this and the ergodic theorem, the normal number theorem follows once again. Consider the rotations of Example 24.4. If the complex number c defining the rotation (Tw = cw) is - 1, then the set consisting of the first and third quadrants is a nontrivial invariant set, and hence T is not ergodic. A similar construction shows that T is nonergodic whenever c is a root of unity. In the opposite case, c is ergodic. [n the first place, it is an old n umbcr-theore tic fact due to Kronecker that if c is not a root of unity then the orbit (w, cw, c 2w, . . . ) of every w is dense. Since the orbits are rotations of one another, it suffices to prove that the orbit (1, c, c 2, ) of 1 is dense. But if c is not a root of unity, then the elements of this orbit are all distinct and hence by compactness have a limit point w0. For arbitrary €, there are distinct points c " and c " + k within E/2 of w0 and hence within € of each other (distance measured along the arc). But then, since the distance from c n +jk to c n +U+ I >k is the same as that from c" to c n + k, it is clear that for some m t he points c", c" + k , . . . , c" + m k form a chain which extends all the way around the circle and in which the distance from one point to the next is less than €. Thus every point on the circle is within € of some point of the orbit (1, c, c2, ), which is indeed dense. To use this result to prove ergodicity, suppose that A is invariant and P(A) > 0. To show that P(A ) must then be 1, observe first that for arbitrary € there is an arc /, of length at most €, satisfying P(A n I) > (1 - € )P(A). Indeed, A can be covered by a sequence /I > / , . . . of arcs for which P(A )j(l - €) > L,, PU,); t he arcs can be take n 2 disjoint and of length less than €. Since L,, P( A n /,) = P( A) > (1 - € )f., P(I,), there is an n for which P(A n /,) > (1 - €)?(/,): take I = I,. Let I have length a; a < €. Since A is invariant and T is invertible and preserves P, it follows that P(A n T"l) > (1 - E)P(T"I). S uppose the arc I runs from a to b. Let n 1 be arbitrary and, using the fact that {T "a} is dense, choose n 2 so that T"1/ and T"2/ a re disjoint and the distance from T"tb to T "2 a is less than Ea. Then choose n3 so that T"1/, T"2/, T " 'l are disjoint and the distance from T"2b to T"'a is less than Ea. Continue until T" kb is within a of T"1a and a further step is impossible. Since the T"•l are disjoint, ka < 1; and by the construction, the T "• I cover the circle to within a set of measure kw + a, which is at most 2€. And now by disjointness, • • •
• • •
k
k
P ( A ) > L P ( A n T"• I ) > ( 1 - € ) L P ( T"• I ) > ( 1 - € )( 1 - 2 €). i=
I
i=
I
Since € was arbitrary, P(A) must be 1: T is ergodic if c is not a root of unity.t t For
a
simple Fourier-series proof, see Problem 26.30.
SECfiON
24.
317
THE ERGODIC THEOREM
Proof of the Ergodic Theorem
The argument depends on a preliminary result the statement and proof of which are most clearly expressed in terms of functional operators. For a real fu nction f on f!, let Uf be the real function with value (Uf)(w ) = f( Tw ) at If f is integrable, then by change of variable (Theorem 16. 13),
w.
_
(24. 12) E[Uf ] = 1 t(Tw)P(dw) = 1 t(w)PT- 1(dw) = E[ f ] . l1
l1
U is nonnegative in the sense that it carries nonnegative functions to nonnegative functions; hence f < g (pointwise) implies Uf < Ug . n Make these pointwise definitions: S0f = O, SJ=f+ Uf + · · · + U - 1f, Mn f = max0 ,., k ,., n Sk f, and MJ = supn 0 Sn f = supn 0 MJ. The maximal ergodic theorem: And the operator
�
Theorem 24.2.
�
Iff is integrable, then
fMxf>OfdP ?: 0.
(24.13)
Since Bn = [Mnf> O] i[M,J > 0], it is enough, by the dominated convergence theorem, to show that f8n f dP > 0. On Bn , Mn f = max 1 ,., k ,; n Sk f. Since the operator U is nonnegative, Sk f = f + US k _ I f < f + UMJ for 1 < k < n, and therefore Mnf
< f fdP + 1 UMnfdP = f fdP + 1 MJdP, B B l1 l1 n
n
where the last equality follows from (24.12). Hence f8n fdP > 0. =
•
Replace f by f/A. If A is invariant, then Sn{f!A) = (SJ)IA , and M,,(f!A ) (M,,J )JA , and therefore (24.13) gives
( 24.14 ) Now replace f here by f- ,.\ ,
,.\
a constant. Clearly [M,,( f - ,.\ ) > 0] is the set
318
RANDOM VARIABLES AND EXPECfED VALUES
where for some n > 1, Sn(f - A) > O. or n - 1 Snf> A. Let
(24.15 )
FA =
[ w: n � l � kt= l f( Tk - 1w) > A ] ; sup
it follows by (24.14) that fA n FA(f- A) dP > 0, or
A P( A n FJ j
(24.16)
<
A nFA
fdP
The A here can have either sign.
24.1. To prove that the averages a n(w) = k 1 n - [% = 1 f( T - l w) converge, consider the set PROOF OF THEOREM
A a./3 =
[
lu
:
lim i � f a n(w) < a < {3 < l im s �p a n( w)
]
for a < {3. Since Aa,/3 = A a, /3 n F13 and Aa,/3 is invariant, (24. 16) gives {3 P( A a,/3) < fAa. /dP. The same result with -f, - {3, - a in place of f, a, {3 is fAa. /dP < aP ( A a /3 ). Since a < {3, the two inequalities together lead to P ( A a f3 ) = 0. Take the union over rational a and {3: The averages a n(w) converge, that is,
,
,
lima n n( w ) = / ( w )
(24.1 7 )
with probability 1, where f may take the values + oo at certain values of w. Because of (24. 12), E[aJ = E[f]; if it is shown that the an( w) are uniformly integrable, then it will follow (Theorem 16.14) that f is integrable and £[ / ] = E[f]. By (24.16), AP ( FA ) < £[1 !11 . Combine this with the same inequality for - f: If GA = [w: sup lan{w)l > A], th-en A P( GA ) < 2 £[1/ll (trivial if A < 0). There fore, for positive a and A , h
h
n
1
lanl dP _!_n t J I J( T k - lw) I P( dw ) [la n i > A ) k = I GA <
< _!_ n
=1
t 1 k= l (
lf(w)l > a
<1
lf(w)l > a
I J(T' - Iwll > a
i f( T k - 1 w) I P ( dw ) + aP ( G>. )
l f(w) I P ( dw ) + aP ( G>. )
l f (w) IP( dw) + 2 � E[Itl] .
)
Take a = A1 12 ; since f is integrable, the final expression here goes to 0 as
SECfiON
24.
THE ERGODIC THEOREM
319
,.\ -7 oo . The a n(w) are therefore uniformly integrable, and £[ / ] E[f]. The uniform integrability also implies E[ia n - !11 -7 0. Set f(w) = 0 outside the set where the an(w) have a finite limit. Then (24.17) still holds with probability 1, and j(Tw) = /Cw). Since [w: j(w) < x ] is invariant, in the ergodic case its mea�ure is either 0 or 1; if x0 is the infimum of the x for which it is 1, th en f(w) = x0 with probability 1, and from • x0 = E[f] = E[f] it follows that f(w) = E[f] with probability 1. =
h
h
A
h
The Continued-Fraction Transformation
Let f! consist of the irrationals in the unit interval, and for x in f! let Tx = {1/x} and a 1(x) = l1/xJ be the fractional and integral parts of 1jx. This defines a mappi11g
(24.18 ) of f! into itself, a mapping associated with the continued-fraction expansion of x [A36]. Concentrating on irrational x avoids some triviai details con nected with the rational case, where the expansion is finite; it is an inessential restriction because the interest here centers on results of the almost-every where kind. n For x E f! and n > 1 let a n(x) = a 1(T - 1 x) be the nth partial quotient, and define integer-valued functions Pn(x) and qn(x) by the recursions
(24.19) p _ 1(x) = 1, q _ 1(x) = 0,
p0(x) = 0, q0(x) = 1,
Pn(x ) = a n(x )pn _ 1(x) + Pn - /x), n > 1, qn(x) = an(x)qn _ 1(x) + qn _ /x), n > 1.
Simple induction arguments show that
n > 0,
(24.20 ) and [A37: (27)]
n > 1. It also follows inductively [A36: (26)] that
(24.22)
_!ja 1 ( x ) + +1J an _ 1 (x) +_!ja n(x) + t Pn( x) + tpn _ 1 ( x) , n > 1, O < t < l. qn( x) + tqn - l ( x) ·
-
·
·
320
RANDOM VARIABLES AND EXPECfED VALUES
Taking t = 0 here gives the formula for the nth convergent: n > 1,
(24.23)
where, as follows from (24.20), pJx) and qn(x) are relatively prime. By (24.21) and (24.22),
n Pn( x ) -t ( T x) Pn - l ( x) x = J x ) + T n x)qn _ 1 ( x ) ' q (
( 24.24)
n > 0,
which, together with (24.20), impliest n ?:. 0.
(24.25)
Thus the convergents for even n fall to the left of x, and those for odd n fall to the right. And since (24. 19) obviously implies that qn(x) goes to infinity with n, the convergents Pn(x)jqJx) do converge to x: Each irrational x in (0, 1) has the infinite simple continued-fraction representation
(24.26) The representation is unique [A36: (35)], and Tx = lla 2 ( x ) + J]a 3 ( x ) + · · T shifts the partial quotients in the same way the dyadic transformation (Example 24.3) shifts the digits of the dyadic expansion. Since the continued fraction transformation turns out to be ergodic, it can be used to study the continued-fraction algorithm. Suppose now that a" a?., . . . are positive integers and define Pn and qn by the recursions (24.19) without the argument x. Then (24.20) again holds n (without the x), and SO Pn /qn - Pn - 1 /qn - l = (- l) + l /qn - l qn , n > 1. Since qn increases to infinity, the right side here is the n th term of a convergent alternating series. And since p0jq0 0, the n th partial sum is Pn /qn , which therefore converges to some limit: Every simple infinite continued fraction converges, and [A36: (36)] the limit is an irrational in (0, 1). Let A a 1 a. be the set of x in f! such that a k (x) a k for 1 < k < n; call it a fundamental set of rank n. These sets are analogous to the dyadic intervals and the thin cylinders. For an explicit description of A a a -necessary for the proof of ergodicity-consider the function :
=
•
•
=
I
(24.27) tTheorem 1 .4 follows from this.
n
·
SECTION 24. THE ERGODIC THEOREM
321
If x E d a a • then x = l/la a ( Tn x) by (24.21); on the other hand, because 1 1 . of the uniqueness of the partial. quotients [A36: (33)], if t is an irrational in the unit interval, then (24.27) lies in d a 1 an . Thus d a ( an is the image under (24.27) of f! itself. Just as (24.22) follows by induction, so does • •
• •
( 24.28)
I/Ja 1 a.{ t )
=
Pn + lPn - 1 q + tqn - I n
·
And 1/Ja I an(t) is increasing or decreasing in t according as n is even or odd, as is clear from the form of (24.27) (or differentiate in (24.28) and use (24.20)). It follows that if n is even, if n is odd. By (24.20), this set has Lebesgue measure ( 24.29) The fundamental sets of rank n form a partition of f!, and their unions form a field �; let ,9 2qn _ 2 by (24. 19), induction gives qn > 2< n - I J/ 2 for n > 0. And now (24.29) implies that .9Q generates the u-field !F of linear Borel sets that are subsets of f! (use Theorem 10.1(ii)). Thus .9, �. !F are related as in the hypothesis of Lemma 2. Clearly T is measurable !Fj !F. Although T does not preserve ,.\ , it does preserve Gauss ' s measure, defined by (24.30) In fact, since
P ( A ) = log1 2 f 1 dx +x ' A
A E !F.
RANDOM VARIABLES AND EXPECTED VALUES
322
it is enough to verify
ljk dx [, ljk dx . [, dx ol 1 + x = k = l fI/(k + l ) 1 + x = k = l f1 /(k + I ) 1 + x (I
Gauss's measure is useful because it is preserved by T and has the same sets of measure 0 as Lebesgue measure does. Proof that T is ergodic. Fix a 1 , , an, and write 1/Jn for 1/J a and dn for Suppose that n ni]i even, so that 1/Jn is increasing. !f X E d n , then da l n (since x = 1/Jn(T x )) s < T x < t if and only if if;/s) < x < 1/Jn(t ); and this last condition implies x E dn. Combined with (24.28) and (24.20), this shows that a
• • •
a
n
·
I
n
If B is an interval with endpoints s and t, then by (24.29),
A similar argument establishes this for n odd. Since the ratio on the right lies between t and 2,
( 24.31 ) Therefore, (24. 10) holds for .9, .9Q, !F as defined above, A = dn, nA = n, c = 4 , and A in the role of P. Thus r - 1 C = C implies that ,.\(C) is 0 or 1, and since Gauss ' s measure (24.30) comes from a density, P(C) is 0 or 1 as well. Therefore, T is an ergodic measure-preserving transformation on ( f!, :Y, P). It follows by the ergodic theorem that if f is integrable, then
( 24. 3 2)
1. t f( Tk - l x )
lim n n
k= I
=
1 r t f( x ) dx log 2 }0 1 + X
holds almost everywhere. Since the density in (24.30) is bounded away from 0 and oo, the "integrable" and "almost everywhere" here can refer to P or to A indifferently. Taking f to be the indicator of the x-set where a 1(x) = k shows that the asymptotic relative frequency of k among the partial quotients is almost everywhere equal to
1
lfk
dx log 2 JI /< k + D 1 + x
=
1 log ( k + 1) 2 log 2 k(k + 2 )
·
In particular, the partial quotients are unbounded almost everywhere.
SECTION
24.
THE ERGODIC THEOREM
323
For understanding the accuracy of the continued-fraction algorithm, the magnitude of a n(x) is less important than that of qn(x). The key relationship . IS
(24.33) which follows from (24.25) and (24.19). Suppc<;e it is shown that li� n log qn( x ) 1
( 24.34)
=
7T 2
1 2 log 2
almost everywhere; (24.33) will then imply that
(24.35)
Pn( x) hm log x n n qn ( X ) .
1
=
7T 2 : - 6 1Og 2
The discrepanc1 between x and its nth convergent is almost everywhere of the order e - nrr /(6 log 2). To prove (24.34), note first that since pj + 1(x) q/ Tx) by (24. 19) , the product n k� I Pn - k + I (T k - l x)fqn k + I (T k - l x) telescopes to 1/qn(x): =
-
(24.36) As observed earlier, qn (x) > 2 < n - I )/ 2 for n > 1. Therefore, by (24.33),
1 1 X Pn ( X ) fqn ( X ) 1 -< qn + 1 ( X ) < 2 n f 2 ' n > 1. Since llog{ 1 + s)l < 4lsl if lsi < 1/ fi . log x - log Therefore, by (24.36),
n 1 k I log log y x L qn ( X )
k� I
4 . Pn( x) < -= qn( X) - 2 n f 2
RANDOM VARIABLES AND EXPECTED VALUES
324
By the ergodic theorem, then, t
1 1 I.1 m - Iog n qn ( X ) n
-
1
(
I log X dx
log 2 10 1 + x
=
( - l)k
I dx log( 1 x + x ) }
-1 (
log 2 0
_ 7T 2 - 12 log 2 - log 2 L 2 k O (k + 1) -1
00
·
Hence (24.34 ). Diophantine Approximation
The fundamental theorem of the measure theory of Diophantine approximation, due to Khinchine, is Theorem 1.5 together with Theorem 1.6. As in Section 1, let �p(q ) be a positive function of integers and let A 'I' be the set of x in (0, 1) such that
p X - -q
(24.37)
<
1 q 2�p ( q )
---=---
has infinitely many irreducible solutions p jq. If D jq�p(q) converges, then A has Lebesgue measure 0, as was proved in Section 1 (Theorem 1.6). It remains to prove that if 1p is nondecreasing and 'L.l jq�p(q) dive 1 ges, then Acp has Lebesgue measure 1 (Theorem 1.5). It is enough to consider irrational x.
For positive a n , the probability ( P or A) of [x: an(x) > an i.o.] is 0 or 1 as [1/an converges or diverges. Lemma 3.
PROOF. Let En = [x: an(x) > an]. Since P(En) = P[x: a 1 (x) > an ] is of the order 1/an, the first Borel - Cantelli lemma settles the convergent case (not needed in the proof of Theorem 1.5). By (24.31),
Taking
a
union over certain of the !:J.n shows that for m < n,
By induction on n,
A ( E� n · · · n £�) <
1 1) 0 ( 1 ) 2( a + + k l k=m
< exp
[ - '[,
1 k=m 2( a k + l + 1 )
as in the proof of the second Borel - Cantelli lemma.
]'
•
t lntegrate by parts over (a, 1) and then let a ! 0. For the series, see Problem 26.28. The specific value of the limit in (24.34) is not needed for the application that follows.
SECTION
24.
THE ERGODIC THEOREM
325
PROOF OF THEOREM 1.5. Fix an integer N such that log N exceeds the limit in (24.34). Then, except on a set of measure 0,
(24.38) holds for all but finitely many n. Since # is nondecreasing,
1
N -:n) --:: L . q� ( q ) < ----: N ( � N" sq
'
and '[J j�( N11) diverges if Djq�(q) does. By the lemma, outside a set of measure 0, a11 1(x ) > op ( N 11 ) holds for infinitely many n. If this inequality and (24.38) both hold, then by (24.33) and the assumption that � is nondecreasing, +
•
But p (x)jq/x) is irreducible by (24.20).
11
PROBLEMS
24.1. Fix (n, ,9<) and a T measurable .r; .r. The probability measures on (n, .9"" ) preserved by T form a convex St!t C. Show that T is ergodic under P if and only if P is an extreme point of C-cannot be represented as a proper convex combination of distinct elements of C.
24.2. Show that T is ergodic if and only if 1 !:Z: l P(A n r k B) --+ P(A)P(B) fo r all A and B (or all A and B in a '7T-system generating ,9<). 'l -
24.3 . i The transformation T is mixing if (24.39)
P ( A n r nB) --+ P( A ) P( B)
for all A and B. (a) Show that mixing implies ergodicity. (b) Show that T is mixing if (24.39) holds for all A and B m a '7T-System generating .r. (c) Show that the Bernoulli shift is mixing. (d) Show that a cyclic permutation is ergodic but not mixing. (e) Show that if c is not a root of u nity, then the rotation (Example 24.4) is ergodic but not mixing.
RANDOM VARIABLES AND EXPECTED VALUES
326
E
24.4. i Write rn!F [rnA: A ,9<], and call the a--field � = n �= 1 r ng- triv ial if every set in it has probability either 0 or 1. (If T is invertible, � is g- and hence is trivial only in uninteresting cases.) (a) Show that if � is trivial, then T is ergodic. (A cyclic permutation is ergodic even though � is not trivial.) (b) Show that if the hypotheses of Lemma 2 are satisfied, then � is trivial. (c) It can be shown by martingale theory that if � is trivial, then T is mixing; see Problem 35.20. Reconsider Problem 24.3(c) 24.5. 8.3'i 24.4 i (a) Show that the shift corresponding to an irreducible, aperiodic Markov chain is mixing. Do this first by Problem 8 35, then by Problem 24.4(b), (c). (b) Show that if the chain is irreducible but has period greater than 1, then the shift i!. ergodic but not mixing. (c) Suppose the state space splits iuto two closed, disjoint, nonempty subsets, and that the initial distribution (stationary) gives positive weight to each. Show that the corresponding shift is not ergodic.
k
24.6. Show that if T is ergodic and if f is nonnegative and E[ f ] 1 d( T - lw ) --+ oo with probability 1. n - f.k. =
= oo,
then
24.7. 24.3 i Suppose that P0( A ) fAo dP for all A (o > 0) and that T is mixing with respect to P (T need not preserve P0). Use (21.9) to prove =
24.8. 24.6 i
(a) Show that
and n .--:--:--------,-Ja l ( x ) . . . a n ( x )
(
1 --+ n 1 + ---,z=--k + 2k k=l ""
)(log k )j{log2)
almost everywhere. (b) Show that
24.9. 24.4 24.7 i (a) Show that the continued-fraction transformation is mixing. (b) Show that
' [ X .. TnX < t ] --+
ll
log( I + t ) Jog 2 '
C H A PT E R S
Convergence of Distributions
SECTION
25.
WEAK CONVERGENCE
Many of the best-known theorems in probability have to do with the asymp totic behavior of distributions. This chapter covers both general methods for deriving such theorems and specific applications. The present section con cerns the general limit theory for di stributions on the real line, and the methods of proof use in an essential way the order structure of the line. For the theory in R \ see Section 29. Definitions
Distribution functions F" were defined in Section 14 to the distribution function F if
( 25 .1)
lim n Fn ( x )
=
F
converge
weakly to
(x)
for every continuity point x of F; this is expressed by writing F" � F. Examples 14.1, 14.2, and 14.3 illustrate this concept in connection with the asymptotic distribution of maxima. Example 14.4 shows the point of allowing (25.1) to fail if F is discontinuous at x ; see also Example 25.4. Theorem 25.8 and Example 25.9 show why this exemption is essential to a useful theory. If 11-n and 11- are the probability measures on (R 1 , .9J 1 ) corresponding to F11 and F, then F11 = F if and only if
(25 .2)
lim n lln ( A ) = 11- ( A )
for every A of the form A = ( x] for which JL{x} = O-see (20.5). In this case the distributions themselves are said to converge weakly, which is expressed by writing 11-n = 11-· Thus Fn � F and 11-n = 11- are only different expressions of the same fact. From weak convergence it follows that (25.2) holds for many sets A besides half-infinite intervals; see Theorem 25.8. -
oo,
327
CONVERGENCE OF DISTRIBUTIONS
328
Example 25.1. Let Fn be the distribution function corresponding mass at n: Fn = I[ n , oo)" Then lim n Fn(x) = 0 for every x, so that satisfied if F( x) = 0. But Fn = F does not hold, because F is not a
to
a
unit (25. 1 ) is distribu tion function. Weak convergence is defined in this section only for functions Fn and F that rise from 0 at - oo to 1 at + oo-that is, it is defined only fQr probability measures 11- n and 11-· t •
Poisson approximation to the binomial. Let 11- n be the binomial distribution (20.6) for p = A/n Jnd let 11- be the Poisson distribution (20.7). For non negative integers k, Example 25.2.
if n > k. As n -7 oo the second factor on the right goes to 1 for ftxed k, and so J.Ln{k} -7 JL(k }; this is a special case of Theorem 23.2. By the series form of Scheffe ' s theorem (the corollary to Theorem 16.12), (25.2) holds for every set A of nonnegative integers. Since the nonnegative integers support 11- and the ll n • (25.2) even holds for every linear Borel set A. Certainly 11-n converges weakly to 11- in this case. • Let 11- n correspond to a mass of n - 1 at each point k jn , let 11- be Lebesgue measure confined to the unit intervaL The corresponding distribution functions satisfy Fn(x ) = (lnxj + 1)/n -7 F(x) for 0 < x < 1, and so f� = F. In this case (25.1) holds for every x, but (25.2) does not, as in the preceding example, hold for every Borel set A : if A is the set of rationals, then 11- n( A ) = 1 does not converge to 11-(A ) 0. Despite this, 11-n does converge weakly to 11-· • Example 25.3. k = 0, 1, . . . , n - 1;
=
If 11-n is a unit mass at xn and 11- is a unit mass at x, then 11-n = 11- if and only if x n -7 x. If x n > x for infinitely many n , then (25.1) fails at the discontinuity point of F. • Example 25.4.
Uniform Distribution Modulo t•
For a sequence x 1 , x2, of real numbers, consider the corresponding sequence of their fractional parts {xn} = xn l x nl· For each n, define a probability measure 1-Ln by • • •
-
(25.3)
f.L n ( A )
=
1 -#[ n k : 1 < k < n, { x d E A ] ;
t There is (see Section 28) a related notion of vague convergence in which IL may be defective in the sense that f.L(R 1 ) < 1. Weak convergence is in this context sometimes called complete convergence. * This topic, which requires ergodic theory, may be omitted.
SEC'TION 25. WEAK CONVERGENCE
329
has mass n - 1 at the points {x 1 }, , {xn}, and if several of these points coincide, the masses add. The problem is to find the weak limit of {JLn} in number-theoretically interesting cases. If the JLn defined by (25.3) converge weakly to Lebesgue measure restricted to the is said to be uniformly distributed modulo 1 . In unit interval, the sequence x" x 2, this case every subinterval has asymptotically its proportional share of the points {xn}; by Theorem 25.8 below, the same is then true of every subset whose boundary has Lebesgue measure 0.
JLn
• • •
• • •
.
Theorem 25.1.
F.or e irrational, e, 2e, 3e, . . . is uniformly distributed modulo 1.
in Example 24.4, PRooF. Since { ne} = {n {e}} , e can be assumed to lie in [0,1]. As x map [0, 1) to the unit circle in the complex plane by cf>( x) = e 2 -rri . If e is irrational, then c = cf>(e) is not a root of un ity, and so (p. 000) Tw = cw defines an ergodic transformation with respect to circular Lebesgue measure P. Let J be the class of open arcs with endpoints in some fixed countable, dense set. By the ergodic theorem, n the> orbit {T w} of almost every w enters every I in J with asymptotic relative frequency P( /). Fix such an w. If /1 c1 c /2,1 where 1 is a closed arc and /1 , /2 are in J, then the upper and lower limits of n - ['/, = 1 /iTk - lw) are between P(/1 ) and P(/2), and thereiore the limit exists and equ als P(J). Since the orbits and the arcs are rotations of one another, every orbit enters every closed arc 1 with frequency P(J). n This is true in particular of the orbit {e } of 1. n Now carry all this back to [0, 1) by cf> - I : For every x i n [0, 1), {ne } = cf> - '(e ) lies in [0, x] with asymptotic relative frequency ;,;. • For a simple proof by Fourier series, see Example 26.3. Convergence in Distribution
Let Xn and X be random variables with respective distribution functions Fn and F. If Fn = F, then Xn is said to converge in distribution or in law to X, written Xn = X. This dual use of the double arrow will cause no confusion. Because of the defining conditions (25 . 1 ) and (25.2), Xn = X if and only if
( 25.4)
lim P[Xn < x ] = P[X < x ] ,
for every x such that P[X = x] = 0. be independent random variables, each Let X1 , X2, with the exponential distribution: P[Xn > x ] = e -ax, x > 0. Put Mn = max{ X1 , , Xn } and bn = a- 1 log n. The relation (14.9), established in Exam ple 14. 1 , can be restated as P[Mn - bn < x] � e - e - ax. If X is any random variable with distribution function e - e - ax' this can be written Mn - bn = X. Example 25.5.
• • •
• • •
•
One is usually interested in proving weak convergence of the distributions of some given sequence of random variables, such as the Mn - bn in this example, and the result is often most clearly expressed in terms of the random variables themselves rather than in terms of their distributions or
CONVERGENCE OF DISTRIBUTIONS
330
distribution functions. Although the Mn - bn here arise naturally from the problem at hand, the random variable X is simply constructed to make it possible to express the asymptotic relation compactly by Mn - bn = X. Recall that by Theorem 14. 1 there does exist a random variable for any prescribed distribution. For each n, let f!n be the space of n-tuples of O's and 1 's, let � consist of all subsets of f!n, and let Pn assign probability (Ajn)k (l Ajn) n - k to each w consi�tjng of k 1's and n - k O's. Let Xn(w) be the number of 1's in w; then Xn , a random variable on (f!n, �. Pn), represents the number of successes in n Bernoulli trials having probability A/n of success at each. Let X be a random variable, on some (f!, !F, P), having the Poisson • distribution with parameter A. According to Example 25.2, X,. = X. Example 25.6.
As this example shows, the random variables Xn may be defined on entirely different probability spaces. To allow for this possibility, the P on the left in (25.4) really should be written Pn . Suppressing the n causes no confusion if it is understood that P refers to \Whatever probability space it is that xn is defined on; the underlying probability space enters into the definition only via the distribution 11- n it induces on the line. Any instance of Fn = F or of 11- n = 11- can be rewritten in terms of convergence in distribution: There exist random variables Xn and X (on some probability spaces) with distribution functions Fn and F, and Fn = F and Xn = X express the same fact. Convergence in Probability
Suppose that X, X 1 , X2, . . . are random variables all defined on the same probability space (f! , !F, P). If Xn � X with probabiliry 1, then P[IXn - XI > E i.o.] 0 for E > 0, and hence =
(25.5)
lim P [ I Xn - XI > E ] n
=
0
by Theorem 4. 1. Thus there is convergence in probability Xn �P X; see Theorems 5.2 and 20.5. Suppose that (25.5) holds for each positive E. Now P[X < x - E ] - P[IXn XI > E] < P[Xn < x ] < P[X < x + €] + P[IXn - x i > E]; letting n tend to oo and then letting E tend to 0 shows that P[ X < x ] < lim infn P[Xn < x] < lim sup n P[Xn < x ] < P[X < x ]. Thus P[Xn < x ] � P[X < x ] if P[X = x] = 0, and so Xn = X:
Suppose that Xn and X are random variables on the same probability space. If Xn � X with probability l, then Xn � p X. · If Xn � p X, then Xn = X. Theorem 25.2.
SECTION 25. WEAK CONVERGENCE
331
Of
the two implications in this theorem, neither converse holds. B ecause of Example 5.4, convergence in probability does not imply convergence with probability 1 . Neither does convergence in distribution imply convergence in probability: if X and Y are independent and assume the values 0 and 1 with probability � each, and if Xn = Y, then Xn = X, but Xn � P X cannot hold because P[IX - Yl] = 1 = � - What is more, (25.5) is impossible if X and the Xn are defined on different probability spaces, as may happen in the case of convergence in distribution. Although (25.5) in general makes no sense unless X and the Xn are defined on the same probability space, suppose that X is replaced by a constant real number a-that is, suppose that X(w) = a. Then (25.5) be comes (25.6)
lim P [ IXn - al > € ] = 0, r.
and thi� condition makes sense even if the space of Xn does vary with n. Now a can be regarded as a random variable (on any probability space at all), and it is easy to show that (25.6) implies that Xn = a: Put E = lx - al; if ..i > a, then P[ Xn < x] > P[1Xn - ai < E] � 1, and if x < a, then P[Xn <x] < P[IXn - al > E] � 0. If a is regarded as a random variable, its distribution function is 0 for x < a and 1 for x > a . Thus (25.6) implies that the distribu tion function of Xn converges weakly to that of a. Suppose, on the other hand, that Xn = a. Then P[IXn - al > E] < P[Xn < a - €] + 1 -- P[Xn < a + € ] --> 0, so that (25.6) _holds:
The condition (25.6) holds for all positive E if and only if xn = a, that is, if and only if Theorem 25.3.
if x < a , if x > a . If (25.6) holds for all positive E, Xn may be said to converge to a in probability. As this does not require that the Xn be defined on the same space, it is not really a special case of convergence in probability as defined by (25.5). Convergence in probability in this new sense will be denoted xn = a, in accordance with the theorem just proved. Example 14.4 restates the weak law of large numbers in terms of this concept. Indeed, if X 1, X2 , • . • are independent, identically distributed ran dom variables with finite mean m, and if Sn = X 1 + · · · +Xn , the weak law of large numbers is the assertion n - I Sn = m. Example 6.3 provides another illustration: If sn is the number of cycles in a random permutation on n letters, then Sn / log n = 1.
332
CONVERGENCE OF DISTRIBUTIONS
Suppose that Xn = X and 8n � 0. Given E and YJ , choose x so that P[IXI > x ] < 17 and P[X = + x] = 0, and then choose n0 so that n > n0 implies that l8ni < Ejx and I P[Xn < y] - P[X < y]i < YJ for y = +x. Then P[ !8nXn1 > E] < 3YJ for n > n0• Thus Xn = X and 8n � 0 imply that • 8n Xn = 0, a restatement of Lemma 2 of Section 14 (p. 193). Example 25. 7.
The asymptotic properties of a random variable should remain unaffected if it is altered by the addition of a random variable that goes to 0 in probability. Let (Xn • Y) be a two-dimensional random vector. Theorem 25.4.
If X" = X and Xn - Yn = 0, then Yn = X.
PROOF.
Suppose that y' < x < y" and P[X = y'] = P[X = y" ] = 0. If y' < X - € < X < X + f < y", then
( 25 7 ) P [ Xn < y' ] - P [ I Xn - Yn : > € ] < p [ Yn < X ] < P[X" < y" ] + P [ I Xn - Ynl > Ej . •
Since Xn = X, letting n � oo gives
(25.8 )
P[ X < y' ] < lim infP n [ Y,. < x ] < lim sup P[ Y,. < x ] < P[ X < y" ] . n
Since P[ X = y] = O for all but countably many y, if P[X = x ] = O, then y' and y" can further be chosen so that P[X < y'] and P[ X < y" ] are arbitrarily near P[ X < x ]; hence P[ Yn < x] � P[ X < x]. • Theorem 25.4 has a useful extension. Suppose that (X�" >, Yn) is a two-dimensional random vector. Theorem 25.5.
If, for each u, XA" ) = x < u ) as n -+ oo, if x < u > = X as u -+ oo, and if lim lim sup P ( I X� > - Ynl > € ] = 0
( 25.9)
u
for positive
€,
n
"
then Yn = X.
PROOF. Replace Xn by x�u ) in (25. 7). If P[ X = y' 1 = 0 P[ x < u) = y' 1 and P[ X = y"1 0 P[x < u > =uy"1, letting n -+ oo and then u -+ oo gives (25.8) once again. Since P[X = y 1 = 0 P[ x < > = y 1 for all but countably many y, the proof can be completed '-=
as before.
=
=
=
•
SECTION 25. WEAK CONVERGENCE
333
Fundamental Theorems
Some of the fundamental properties of weak convergence were established in Section 14. It was shown there that a sequence cannot have two distinct weak limits: If Fn = F and Fn = G, then F = G. The proof is simple: The hypothe sis implies that F and G agree at their common points of continuity, hence at all but countably many points, and hence by right continuity at all points. Another simple fact is this: If lim n Fn( d ) = F(d) for d in a set D dense in R 1 , then Fn = F. Indeed, if F is continuous at x, there are in D points d' and d" such that d' < x < d" and F(d") - F(d') < E, and it follows that the limits superior and inferior of Fn(x) are within E of F(x ). For any probability measure on (R 1 , .9i' 1 ) there is on some probability space a random variable having that measure as its distribi.Ition. Therefore, for probability measures satisfying 11- n = J.L, there e,..ist 1 andom variables Yn and Y having these measures as distributions and satisfying Yn = Y. Accord ing to the following theorem, the Yn and Y can be constructed on the same probability space, and even in such a way that Yn(w) � Y(w) for every w-a condition much stronger than Y,. -=> Y. This result, Skorohod ' s theorem, makes possible very simple and transparent proofs of many important facts.
Suppose that f-L n and J-L are probability measures on and J.Ln = 11- · There e.x:ist random variables Yn and Y on a common probability space (f!, !F, P) such that Yn has distribution J.Ln , Y has distribution JL, and Yn(w) � Y(w) for each w. Theorem 25.6.
( R 1 , .9i' 1 )
PRooF.
For the probabili ty space (f!, !F, P), take f! = (0, 1), let !F con sist of the Borel subsets of (0, 1), and for P(A) take the Lebesgue measure of A. The construction is related to that in the proofs of Theorems 14. 1 and 20.4. Consider the distribution functions Fn and F corresponding to J.Ln and J.L· For 0 < w < 1, put Yn(w) = infix: w � Fn(x)] and Y(w) = infi x: w < F(x)]. Since w < Fn(x) if and only if Y,.(w) < x (see the argument following (14.5)), P[ w: Yn(w) < x ] P[ w: w < Fn(x)] Fn (x). Thus Yn has distribution function Fn ; similarly, Y has distribution function F. It remains to show that Yn(w) � Y(w). The idea is that Yn and Y are essentially inverse functions to Fn and F; if the direct functions converge, so must the inverses. Suppose that 0 < w < 1. Given E, choose x so that Y(w) - E <x < Y(w) and JL{X} = 0. Then F(x) < w; Fn(x) � F(x) now implies that, for n large enough, Fn(x) < w and hence Y(w) - E < x < Yn(w). Thus lim infn Y,.(w) > Y(w). If w < w' and E is positive, choose a y for which Y(w') < y < Y(w') + E and J.L{Y} 0. Now w < w' < F(Y(w')) < F( y), and so, for n large enough, w < Fn(y) and hence Y,.(w) < y < Y(w') + E. Thus lim sup n Y,.(w) < Y(w') if w < w'. Therefore, Yn( w) � Y( w) if Y is continuous at w. =
=
=
CONVERGENCE OF DISTRIBUTIONS
334
Since Y is nondecreasing on (0, 1), it has at most countably many disconti nuities. At discontinuity points w of Y, redefine Yn(w) = Y(w) = 0. With this change, Yn(w) � Y(w) for every w. Since Y and the Yn have been altered only on a set of Lebesgue measure 0, their distributions are still 11- n and J.L. • Note that this proof uses the order structure of the real line in an essential way. The proof of the corresponding result in R k is more complicated. The following mapping theorem is of very frequent use.
Suppose that h: R 1 � R 1 is measurable and that the set Dh of its discontinuities is measurable.t If 11-n = J.L and J.L(D;,) = 0, then 11-n h - i = J.L h - 1 . Recall (see (13.7)) that 11-h - 1 has value p (h - 1A ) at A. Theorem 25.7.
PRooF.
Consider the random variables Yn and Y of Theorem 25.6. Since Yn(w) � Y(w), if Y(w) f!- D h then h(Yn(w)) � h(Y(w)). Since P[w: Y(w) E Dh ] = J.L(Dh ) = 0, it follows that h(Yn(w)) � h(Y(w)) with probability 1. Hence h(Y,.) = h(Y) by Theorem 25.2. Since P[h(Y) EA] P[ Y E h - 1A] = J.L(h - 1.4 ), h(Y) has distl ibution 11-h - I ; similarly, h(Yn) has distribution J.L n h - 1 . Th us • h(Yn ) = h(Y) is the same thing as J.L n h - I = 11-h - I . =
Because of the definition of convergence in distribution, this result has an equivalent statement in terms of random variables: Corollary
l.
If Xn = X and P[X E Dh ] = 0, then h(Xn) = h(X).
Take X = a: Corollary 2.
If Xn = a and h is continuous at a, then h(Xn) = h(a).
From Xn = X it follows directly by the theorem that aXn + b = aX + b. Suppose also that a n � a and bn � b. Then (a n - a)Xn � 0 by Example 25.7, and so (a nXn + bn ) - (aXn + b) = 0. And now a nXn + bn = aX + b follows by Theorem 25.4: If Xn = X, a n � a, and bn � b, then anXn + bn = aX + b. This fact was stated and proved differently in Section 14 • -see Lemma 1 on p. 193. Example 25.8.
By definition, 11- n = J.L means that the corresponding distribution functions converge weakly. The following theorem characterizes weak convergence
!JR 1
tThat Dh lies in is generally obvious in applications. In point of fact, it always holds (even if h i s not measurable): Let A(E, 15) be the set of x for which there exist y and z such that lx - y l < 15, lx - zi < 15, a r. d ih( y ) - h( z)i > E. Then A(E, 15) is open and Dh = u . n aA(E, 15 ), where E and 15 range over the positive rationals.
SECTION 25.
WeAK
CONVERGENCE
335
without reference to distribution functions. The boundary aA of A consists of the points that are limits of sequences in A and are also limits of sequences in Ac; alternatively, aA is the closure of A minus its interior. A �et A is a wcontinuity set if it is a Borel set and J.L(aA) = 0. Theorem 25.8.
The following three conditions are equivalent.
{i) 1-Ln = /-L; {ii) ffdi-Ln � ffdJ.L for every bounded, continuous real function f; (iii) 1-L n( A ) � J.L( A) for every wcontinuity set A.
PROOF.
Suppose that 1-Ln 1-L· and consider the random variables Yn and Y of Theorem 25.6. Suppose that f is a bounded function such that J.L( D ) = 0, where D1 is the set of points of discontinuity of F. From P[ Y E D1 = J.L(D1) = 0 it follows that fCY,,) � f(Y) with probabiiity 1, and so by change of variable (see (21. 1)) and the bounded convergence theorem, ffdi-Ln = E[f(Yn)] � E[ f(Y)] = ffdJ.L. Thus 1-Ln = J.L and J.L(D1) = 0 together imply that ffdi-L n � JfdJ.L if f is bounded. In particular, (i) implies (ii). Further, if f = /A , then D1 = a A, and from J.L(aA) = 0 and 1-Ln = 1-L follows 1-Ln(A) -= ffdi-L n � Jfdi-L = J.L(A). Thus (i) also implies {iii). Since a( - oo, x ] = {x}, obviously (iii) implies {i). It therefore remains only to deduce 1-Ln = 1-L from {ii). Consider the corresponding distribution func tions. Suppose that x < y, and let f(t) be 1 for t < x, 0 for t > y, and interpolate linearly on [x, y ]: f(t) = { y - t)j(y - x) for x < t < y. Since Fn(x) < Jfdi-Ln and Jfdi-L < F(y), it follows from (ii) that lim supn Fn(x) < F(y); letting y ! x shows that lim supn Fn(x) < F(x). Similarly, F(u) < lim infn Fn(x) for u < x and hence F(x - ) < lim infn Fn(x). This implies • convergence at continuity points.
f
=
The function f in this last part of the proof is uniformly continuous. Hence 1-Ln = 1-L follows if ffdi-L n � ffdJ.L for every bounded and uniformly continuous f. The distributions in Example 25.3 satisfy 1-Ln = J.L, but 1-Ln( A ) does not converge to J.L( A) if A is the set of rationals. Hence this A cannot be a wcontinuity set; in fact, of course, aA = R 1 • • Example 25.9.
The concept of weak convergence would be nearly useless if (25.2) were not allowed to fail when J.L(aA) > 0. Since F(x) - F(x - ) = J.L{x} = J.L(a( - oo, x ]), it is therefore natural in the original definition to allow (25.1) to fail when x is not a continuity point of F.
CONVERGENCE OF DISTRIBUTIONS
336 Melly's Theorem
One of the most frequently used results in analysis is the Helly selection
theorem:
For every sequence {F11} of distribution functions there exists a subsequence {Fn ) and a nondecreasing, right-continuous function F such that lim k F11 k( x) = F( x) at continuity points x of F. Theorem 25.9.
PROOF.
An
application of the diagonal method [Al4] gives a sequence {n k} of ir.tegers along which the limit G(r) = limk F k(r) exists for every rational r. Define F(x) = inf[ G(r ): x < r ]. Clearly F is non decreasing. To each x and there is an r fo r which x < r and G(r) < F(x) + E. If x < y < r, then F(y) < G(r) < F(x) + E. Hence F is continuous from the right. If F is continuous at x, choose y < x so that F(x) - E < F( y ); now choose rational r and s so that y < r <x < s and G(s) < F(x) + E. From F(x) - E < G(r) < G(s) < F(x) + E and F11(r) < F11( X ) < F11(s) it follows that as k goes to infinity F11 k(x) has limits superior and inferior within E of F(x ) . • 11
E
The F in this theorem necessarily satisfies 0 < F(x) < 1. But F need not be a distribution function: if F11 has a unit jump at n, for example, F(x) 0 is the only possibility. It is important to have a condition which ensures that for some subsequence the limit F is a distribution function. A sequence of probability measures 11- n on (R 1 , .9i' 1 ) is said to be tight if for each E there exists a finite interval (a, b] such that JL11(a, b] > 1 - E for all n. In terms of the corresponding distribution functions F11, the condition is that for each E there exist x and y such that F11(x) < E and F11(y) > 1 - E for all n. If 11- n is a unit mass at n, {Jl-11} is not tight in this sense-the mass of P-n "escapes to infinity." Tightness is a condition preventing this escape of mass. =
Tightness is a necessary and sufficient condition that for every subsequence {Jl-11 k } there exist a further subsequence {Jl-11 k(})} and a probability measure 11- such that 11- n kuJ = 11- as j � oo. Theorem 25.10.
Only the sufficiency of the condition in this theorem is used in what follows. theorem to the subsequence {Fn) of corresponding distribution functions. There exists a further subsequence {Fn k(})} such that lim ,. Fnk(}J(x) = F(x) at continuity points of F, where F is nondecreasing and right-continuous. There exists by Theorem 12.4 a measure 11- on (R 1 , .9i' 1 ) such that JL(a, b] = F(b) - F(a). Given E, choose a and b s o that JL11( a, b] > 1 - E for all n, which is possible by tightness. By decreasing a
PROOF. Sufficiency. Apply Helly's
SECTION 25. WEAK CONVERGENCE
337
and increasing b, one can ensure that they are continuity points of F. But then JL(a, b] > 1 - E. Therefore, 11- is a probability measure, and of course
= 11- · Necessity. If {JLn} is not tight, there exists a positive E such that for each finite interval (a, b ], 11- n(a, b] < 1 - E for some n. Choose n k so that 11-n ( - k, k] < 1 - €. Suppose that some subsequence {JLn } of {JL n } were to converge weakly to some probability measure 11-· Choose (a, b] so that JL{a} = JL{b} = 0 and JL(a, b] > 1 - E . For large enough j, (a, b] c ( - k(j), k{j)], and so 1 - € > 11-n ( - k{j), k{ j )] > 11- n (a, b] � JL(a, b ]. Thus • JL(a, b] < 1 - E, a contradiction. 11-
n k(} )
k
k(} )
k(J)
k
k(J)
is a tight sequence of probability measures, and !f each subsequence th at converges weakly at all converges weakly to the probability measure JL, then 11- n = JL. Corollary.
If {JL n}
PROOF.
By the theorem, each subsequence {JLn) contains a further subsequence {JL n . } converging weakly (j � oo) to some limit, and that limit must by hypothesis be 11- · Thus every subsequence {JL n ) contains a further subsequence {JLn } converging weakly to 11-· Suppose that �'� = 11- is false. Then there exists some x such that JL{x} = 0 but 11- n( - oo, x] does not converge to JL( - oo, x ]. But then there exists a positive E such that I 11- n ( - oo, x] - JL(- oo, x 11 > E for an infinite sequence {n k } of integers, and no subsequence of {JL n ) can converge weakly to 11- · This • contradiction shows that 11- n = 11- · k(J)
k
lf 11-n is a unit mass at xn, then {JLn} is tight if and only if {x n} is bounded. The theorem above and its corollary reduce in this case to standard facts about real, line; see Example 25.4 and A10: tightness of sequences of probability measures is analogous to boundedness of sequences of real numbers. Let 11-n be the normal distribution with mean mn and variance a-n2 • If mn and a-n2 are bounded, then the second moment of 11-n is bounded, and it follows by Markov's inequality (21. 12) that {JLn} is tight. The conclusion of Theorem 25. 10 can also be checked directly: If {n k U >} is chosen so that lim ,. m n = m and lim,. a-n2 . = a- 2 , then 11-n = JL, where 11- is normal with mean m and variance a- 2 (a unit mass at m if a- 2 = 0). If mn > b, then Jl n(b, oo ) > 4; if m n < a, then Jl n( - oo, a] > 4. Hence {JLn} cannot be tight if m n is unbounded. If m n is bounded, say by K, then 1 is the standard normal distribu oo, (a - K)a-n- ], where 11- n( - oo, a] > 1 oo, (a - K)a-n- 1 � t along some subse tion. If a-n is unbounded, then quence, and {JL n} cannot be tight. Thus a sequence of normal distributions is tight if and only if the means and variances are bounded. • Example 25.10.
k(J)
v(-
k(})
v(-
k(J)
v
CONVERGENCE OF DISTRIBUTIONS
338 Integration to the Limit Theorem 25.11.
If Xn = X, then E[ I XIl < lim infn E[ IXn l].
PROOF.
Apply Skorohod's Theorem 25.6 to the distributions of xn and X: There exist on a common probabili ty space random variables Yn and Y such that Y = lim n Yn with probability 1, Yn has the distribution of Xn , and Y has the distribution of X. By Fatou's lemma, E[ I Y il < lim infn E[ I Y,. Il. Since l X I and I Y I have the same distribution, they have the same expected value • (see (21.6)), and similarly for I Xn l and I Y,.l. The random variables Xn are said to be uniformly integrable if
(25.10)
lim sup
a -+ oo
n
1
1 Xn1 dP
[I Xn l � a]
=
0;
see (16.21). This implies (see (l6.22)) that
(25 . 1 1 ) Theorem 25.12.
integrable and
n
If Xn = X and the Xn are uniformly integrable, then X is
(25.12 )
PROOF.
Construct random variables Yn and Y as in the preceding proof. Since Y,. � Y with probability 1 and the Yn are uniformly integrable in the • sense of (16.21), E[ Xn l = E[ Y,. ] � E[ Y ] = E[X ] by Theorem 16.14. If supn E[IXi + E ] < cc for some positive integrable because
E,
then the Xn are uniformly
(25.13) Since Xn = X implies that x; = xr by Theorem 25.7, there is the following consequence of the theorem.
Let r be a positive integer. If Xn = X and sup n E[I XX+El < oo, where E > 0, then E[ l x n < oo and E[ X;l � E[ Xr]. Corollary.
'
The Xn are also uniformly integrable if there is an integrable random variable Z such that P[IXn l > t ] < P[ IZI > t ] for t > 0, because then (21.10)
SECTION 25. WEAK CONVERGENCE
339
.
gives
From this the dominated convergence theorem follows again. PROBLEMS
Show by example that distribution functions having densities can converge weakly even if the densities do not converge: Hint: Consider f/x) = 1 + cos 211'nx on [0, 1]. (b) Let fn be 2" times the indicator of the set of x in the unit interval for which d,. + 1(x) = · · · = d2n(x) = 0, where dk (x) is the k th dyadic digit. Show that f,.(x) --+ 0 except on a set of Lebesgue measure 0; on this exceptional set, redefine ,f,.( x) = 0 for all n, so that fn(x) --+ 0 everywhere. Show that the distributions corresponding to these densities converge weakly to Lebesgue measure confined to the unit interval. (c) Show that distributions with densities can converge weakly to a limit that has no density (even to a unit mass). (d) Show that discrete distributions can converge weakly to a distribution that has a density. (e) Construct an example, like that of Example 25.3, in which ttn(A ) --+ p,(A ) fails but in which all the measures come from continuous densities on [0, 1].
25.1. (a)
25.2.
14.8 i Give a simple proof of the Gilvenko-Cantelli theorem (Theorem 20.6) under the extra hypothesis that F is continuous.
25.3.
Initial digits.
x is d (in the scale of 10) if and only if {log 10 x} lies between log 1 0 d and log 1 0(d + 1), d 1, . . . , 9, where the braces denote fractional part. (b) For positive numbers x 1, x2, . . . , let Nr(d) be the number among the first n that have initial digit d. Show that (a) Show that the first signifi c ant digit of a po!'itive number
=
lim n _!n Nn ( d) = log 10 ( d + 1) - log 10 d,
( 25 . 14)
d = 1, . . . , 9,
if the sequence log 1 0 X11, n = 1, 2, . . . , is uniformly distributed modulo 1. This is true, for example, of xn {}" if log10 {} is irrational. (c) Let Dn be the first signifi c ant digit of a positive random variable X11 • Show that =
( 25. 15)
lim n P [ D" = d] = log 10( d + 1 ) - log 10 d,
if {log 10 Xn} 25.4.
=
d = 1,
. . .
, 9,
U, where U is uniformly distributed over the unit interval.
Show that for each probability measure p, on the line there exist probability measures 1-tn with finite support such that p, 11 p,. Show further that ttn{x} can =
CONVERGENCE OF DISTRIBUTIONS
340
be taken rational and that each point in the support can be taken rational. Thus there exists a countable set of probability measures such that every p., is the weak limit of some sequence from the set. The space of distribution fu nctions is thus separable in the Levy metric (see Problem 14.5). 25.5. Show that
(25.5) implies that
P([ X < X ] ll [ xn
< X ]) --+ 0 if
P[ X = X 1 =
0.
25.6. For arbitrary random variables Xn there exist positive consta nts a n such that a n xn = 0.
25.8 by showing for three-dimensional random vectors ( A n, Bn , Xn) and constants a and b, a > 0, that, if A n = a , Bn = b, and Xn = X, then A n Xn + Bn = aX + b Hint: First show that if Y,, = Y and Dn 0, then DnYn = 0.
25.7. Generalize Example =>
hn and h are Borel fu nctions. Let E be the set of x for which hnxn -+ hx fails for some sequence xn --+ x. Suppose that E � 1 and P( X £] = 0. Show that h 1 Xn hX.
25.8. Suppose that Xn = X and that
E
E
1
=>
25.9. Su ppose that the distributions of random variables Xn and X have densities fn and f. Show that if fn(x) --+ f(x) for x outside a set of Lebesgue measure 0,
then xn => X.
Suppose that Xn assumes as values Yn + kon , k = 0, + 1, . . . , where on > 0. Suppose that on --+ 0 and that, if k,. is a n integer varying with n in such a way that 'Yn + knon --+ x, then P[ Xn = 'Yn + k nonlo; 1 --+ f(x), where f is the density of a random variable X. Show that Xn = X.
25.10. i
Let sn have the binomial distributioil with parameters n and p. Assume as known that
25.11. i
(25 .16) if ( k n - np)(np(1 -p )) - 1 12 --+ x. Deduce the DeMoivre-Laplace theorem: (Sn - np )(np(l -p )) - 1 12 N, where N has the standard normal distribution. This is a special case of the central limit theorem; see Section 27. =>
25.12. Prove weak convergence in Example
theory of the Riemann integral.
25.3 by using Theorem 25.8 and the
25.13. (a) Show that probability measures satisfy /1- n
=>
p., if p.,n(a, b ] --+ p.,( a , b] when
ever p.,{a} = p.,{b} = 0. (b) Show that, if ffdp.,n --+ ffdp., fo r all continuous f with bou nded support, then /1- n = p.,.
SECTION 25. WEAK CONVERGENCE
341
Let p., be Lebesgue measure confined to the unit interval; let l.tn corre spond to a mass of xn , i - xn,i - l at some point in (xn , i - l , xn), where 0 =xn o <xn 1 < · · · <xnn = 1. Show by considering the distribution functions that l.tn = p., if max; < n(.xn,i - x,,; _ 1 ) -+ 0. Deduce that a bounded Borel fu nction continuous almost everywhere on the unit interval is Riemann integrable. See Problem 17. 1.
25. 14. i
A function f of positive integers has distribution function F if F is the weak limit of the distribution function Pn[m: f(m) < x ] of f under the measure having probability 1/n at each of 1, (see 2.34)). In this case D[m: f(m) <x] = F( x ) (see (2.35)) for continuity points x of F. Show that
25.15. 2. 18 5. 19 i
. . . , n
q; (m) j m
(see (2.37)) has a distribution: (a) Show by t he mapping theorem that it suffices to prove that f(m) = log(q;(m)jm) = [P o/m) log(1 - 1 jp ) has a distribution. (b} Let fu(m) = Lp ,; u o (m)log(1 - 1jp), and shew by (5.45) that fu has distribution function F/x) = P[I:P , u Xp Iog(1 - 1/p) < x ], where the XP are independent random variables (one for each prime p) such that P[XP = 1] = 1 /p and P[ XP = 0] = 1 - 1/p. (c) Show that I: P XP Iog(1 - 1jp) converges with probabi!it} 1. Hint: Use Theorem 22.6. (d) Show that lim u supn En[ If- fu ll = 0 (see (5.46) for the notation). (e) Conclude by Markov's inequality and Theorem 25.5 that f has the distribution of the sum in (c). _ ""
E �1
and T > 0, put A T( A ) = A([ - T, T] n A )j2T, where A is Lebesgue measure. The relative measure of A is
25.16. For A
(25. 17)
p ( A ) = lim A T( A ) , T - oo
provided that this limit exists. This is a continuous analogue of density (see (2.35)) for sets of integers. A Borel fu nction f has a distribution under A T ; if thi!; converges weakly to F, then (25. 18)
p [ x : f( x ) < u] = F ( u )
for continuity points u of F, and F is called the distribution function of f. Show that all periodic fu nctions have distributions. 25. 17. Suppose that supn ffdp., n < oo for a nonnegative f such that
x -+
± oo.
25.18. 23 . 4 i
Show that {p.,n} is tight.
f(x) -+ oo as
1
Show that the random variables A and L, in Problems 23.3 and 23.4 co nverge in distribution. Show that the moments converge.
25.19. In the applications of Theorem 9.2, only a weaker result is actually needed:
For each K there exists a positive a = a(K) such that if E(X] = 0, E[X 2 ] = 1 , and E[ X4] < K , then P[ X > 0] > a. Prove this by using tightness and the corollary to Theorem 25. 12.
Xn for which there is no inte grable Z satisfying P[IXnl > t] < P[l Zl > t] for t > 0.
25.20. Find u niformly integrable random variables
CONVERGENCE OF DISTRIBUTIONS
342
SECTION 26. CHARACTERISTIC FUNCTIONS Definition
Th e characteristic function of a probability measure JL on the l ine is defined for real t by
�( t)
=
j'"'00eitxJL ( dx) -
"'
= f COS txJL(dx) - 00
00
+ if 00 �in txJL( dx ) ; -
see the end of Section 16 fer integrals of complex-valued fu ncticns. t A random variable X with distribution J..L has characteristic function
� ( t ) = E[ e i ' X ] =
"'
J- eitxJL(dx) . "'
The characteristic function is thus defined as the moment generat ing function but with the real argument s replaced by it; it has the advantage that it a lways exists because eit x is bounded. The characteristic function in nonprob abilistic contexts is called the Fourier transform. The characteristic function has three fundamental properties to be estab lished here: {i) If JL 1 and JL 2 have respective characteristic functions � 1 (t ) and �it), then JL 1 * JL 2 has characteristic function � 1(t)�it). Although convolution is essential to the study of sums of independent random variables, it is a complicated operation, and it is often simpler to study the products of the corresponding cha 1 acteristic functions. {ii) The characteristic function uniquely determines the distribution. This shows that in studying the products in {i), no information is lost. {iii) From the pointwise convergence of characteristic functions follows the weak convergence of the corresponding distributions. This makes it possible, for example, to investigate the asymptotic distributions of sums of independent random variables by means of their characteristic functions. Moments and Derivatives
It is convenient first to study the relation between a characteristic function and the moments of the distribution it comes from. t From complex variable theory only De Moivre's formula and the simplest properties of the exponential function are needed here.
SECTION 26. CHARACTERISTIC FUNCTIONS
343
Of course, cp{O) = 1, and by (16.30), lcp(t )I < 1 for all t. By Theorem 1 6 .80 ), cp(t ) is continuous in t. In fact, lcp(t + h) - cp(t )I < Jle ih x - 1 I J.L{dx), and so it follows by the bounded convergence theorem that cp(t) is umformly continu
ous.
In the following relations, versions of Taylor's formula with remainder, x is assumed real. Integration by parts shows that
( 26. 1)
X
1o ( x -
n s ) e ;5 ds =
l j ( - s ) n + l e 1s ds + ' n+1 n+1 o x X
.
n+J
X
and it follows by induction that
k ·n + l n lX ( ) ·x l j e' = L k .1 + n .1 ( X - s ) e'� ds o k =O n
( 26.2 )
.l.
"
·
for n > 0. Replace n by n - 1 in (26. 1), solve for the integral on the right, and substitute this for the integral in (26.2); this gives
( 26. 3) Estimating the integrals in (26.2) and (26.3) (consider separately the cases x > 0 and x < 0) now leads to
{
k n n +l x l ( ix ) l 2lxl . e ix < mm ' n! k ! ! n + 1 ) ( L k O ' I
( 26.4 )
n
}
for n > 0. The first term on the right gives a sharp estimate for lxl small, the second a sharp estimate for lxl large. For n = 0, 1, 2, the inequality special izes to
le ix - 11 < min{ lxl, 2} , l e'x - ( 1 + ix) l < min{Ix 2 , 2 lxl} ,
(26.40 ) ( 26.41 )
l e i x - (1 + ix - Ix 2 ) l < min{-i;lxl 3 , x 2 } .
( 26.4 2 )
If X has a moment of order n , it follows that ( 26.5 )
[ { ltX I n + I
k n it tX 2l . I ;, ( k ).I E[X k ] -< £ mm cp ( t ) L. I I ' . n + . n 1 ) ( k=O _
}]
.
CONVERGENCE OF DISTRIBUTIONS
344
For any
t
satisfying
(26.6) cp(t) must therefore have the expansion (26.7) compare
(21 .22).
If
[, 1 � 1 ; E( IX Ik) k =O
=
E[ e l r Xi ] < oo,
then (see (16.3 1)) (26.7) must hold. Thus generating function over the whole line. Since
Example 26.1.
tion, by
(26. 7) and (21. 7)
( 26.8) cp( t )
"'
=
L k =0
(26.7)
holds if
X
has
a
moment
E[ei' XI ] < oo if X
has the standard normal distribu its characteristic function is
( lt ) 2 k 1 X 3 X · · · X (2k - 1) ( 2k ) !
"'
•
This and (21.25) formally coincide if
s
=
11 -
L k =O k .
( - !.._2 _ ) k 2
=
e - 1 2 /2 •
= it.
If the power-series expansion (26. 7) holds, the moments of off from it:
X can be read
(26.9) This is the analogue of (21.23). It holds, however, under the weakest possible assumption, namely that E[ I X k l] < oo. Indeed,
[
]
h X _ l - ihX i cp( t + h) - cp ( t) e v i · r r . - E[ 1.11. e X ] - E e i X h h By (26.4 1 ), the integrand on the right is dominated by 2 I X I and goes to 0 with h; hence the expected value goes to 0 by the dominated convergence theore m. Thus cp'(t) = E[iXelr x]. Repeating this argument inductively gives
(26 .10)
SECTION 26. CHARACTERISTIC FUNCTIONS
345
if E[IX k l] < oo. Hence (26.9) holds if E[ IX k l] < oo. The proof of uniform k continuity for cp(t) works for cp< >(t) as well. If £[X 2 1 is finite, then
t -7 0.
(26 . 1 1 )
Indeed, by (26 .4 2 ), the error is at most t 2 £[ min{lti iX I 3, X 2 }], and as t -7 0 the integrand goes to 0 and is dominated by X 2 . Estimates of this kind are essential for proving limit theorems. The more moments JL has, the more derivatives cp has. This is one sense in which lightness of the tails of JL is reflected by smoothness of q>. There are results which connect the behavior of cp(t) as It 1 -� oo with smoothness properties of JL. The Riemann -Lebesgue theorem is the most important of these: Theorem 26.1.
If JL has a density, then cp(t) -7 0 as I t 1 -7
oo .
PROOF.
The problem is to prove for integrable f that ff(x )eirx dx -7 0 as ltl -7 oo. There exists by Theorem 17.1 a step function g = [k ak iA ,• a finite linear combination cf indicators of intervals Ak = (a k , bk ], for which J!J gl dx < € . Now Jf(x)e ir x dx differs by at most € from Jg(x)ei rx dx • [k ak(eirbk - e ir ak )jit, and this goes to 0 as ltl -7 oo. =
Independence
The multiplicative property (21 .28) of moment generating functions extends to characteristic functions. Suppose that X 1 and X2 are ind ependent random variables with characteristic functions cp 1 and cp 2 . If }j = cos Xj and 21 sin tXj, then (Y1 , Z 1 ) and (Y2 , 2 2 ) are independent; by the rules for integrat ing complex-valued functions, =
•
cp 1 ( t)cp 2 ( t) = ( E [ Y1 ] + i£[2 1 ) )( E[ Y2 ) + i£[ 22 )) = E[ Y1 ] E[ Y2 ] - £[ 2 1 ) £[ 2 2 ) + i ( E[ YJ l £[ 2 2 ) + £[2 J l E[ Y2 ]) i t ( X I + X2 > ] = E [ Y, Y - 2 1 2 + i( yl 2 + 2 1 Y2 )] = E [ e . 2 2 2
This extends to sums ·of three or more: If
(26.12)
E [ e i t 'f.Z - I Xk l
=
n
X1 , • • • , Xn
n E[ e irXk ] .
k=l
are independent, then
CONVERGENCE OF DISTRIBUTIONS
346
If X has characteristic function function
cp(t ),
then
aX + b
has characteristic
(26.13) cp( - t ), which
In particular, -X has characteristic function conjugate of cp(t ).
is the complex
Inversion and the Uniqueness Theorem
A characteristic function cp uniquely determines the measure JL it comes from. This fundamental fact will be derived by means of an inversion formula through which JL can in principle be recovered from cp. Define
T > 0. In Example
18.4 it is shown that 7T ; T) S = ( 2 T-+ oo
(26.14)
lim
is therefore bounded. If sgn 0 is + 1, negative, then
S(T)
(26.15)
1
T
0,
or
-1
as () is positive,
0,
or
sin tO
0
t dt = sgn o · S ( T IOI) ,
If the probability measure JL has characteristic function cp, and if J1 {a} = JL{b} = 0, then - ir b T e- ir a 1 . e hm - J JL ( a , b] = T(26.16) cp( t) dt. l't + oo 2 7T - T Theorem 26.2.
_
Distinct measures cannot have the same characteristic function. integrand here converges as t --) 0 to b - a, which is to be taken as its value for t = 0. For fixed a and b the integrand is thus continuous in t, and by (26.40) it is bounded. If JL is a unit mass at 0, then cp(t) 1 and the integral in (26. 16) ca nnot be extended over the whole line.
Note: By (26.4 1 ) the =
PROOF.
The inversion formula will imply uniqueness: It will imply that if JL and v have the same characteristic function, then JL(a, b] = v(a, b] if JL{a} v{a} = JL{b} = v{b} = 0; but such intervals (a, b] form a 1r-system gen erating g 1 • =
SECTION 26. CHARACTERISTIC FUNCTIONS Denote by
347
IT the quantity inside the limit in (26.16). By Fubini's theorem
( 26.17 ) This interchange is legitimate because the double integral extends over a set of finite product measure and by (26.40) the int_egrand is bounded by lb - ai. Rewrite the integrand by DeMoivre's formula. Since sin s and cos s are odd and even, respectively, (26.15) gives
The integrand here is bounded and converges as 1
2
!/la, b ( x ) =
-7 oo to the function
if x < a, if x = a , if a < x < b, if X = b, if b < x.
0 (26.18)
T
1 I
2
0
Thu� / -7 f i/Ja, b dfL, which implies that (26.16) holds if JL{a} = JL{b} = 0. T The inversion formula contains further information. Suppose that
•
"'
f 1 �( t ) 1 dt < oo.
( 26.19)
- 00
In this case the integral in
(26.16) can be extended over R 1 • By (26.40),
e - ir b - e - i ra
it
i e r( b - a ) - 11 < lb - a l; ,= ! tl i
therefore, JL(a, b) < (b - a)f':"'l�(t)l dt, and there can be no point masses. By (26.16), the corresponding distribut ion function satisfies F( x + h ) - F( x ) h
1 "' = 27T !_
"'
i e - rx _ e - ir (x +h )
ith
�( t) dt
(whether h is positive or negative). The integrand is by (26.40) dominated by l�{t)l and goes to eir x�(t) as h -7 0. Therefore, F has derivative
(26.20 )
CONVERGENCE OF DISTRIBUTIONS
348
Since f is continuous for the same reason cp is, it integrates to F by the fundamental theorem of the calcuius (see (17.6)). Thus (26.1 9) implies that JL has the continuous density (26.20). Moreover, this is the only continuous density. In this result, as in the Riemann-Lebesgue theorem, conditions on the size of cp(t) for large I t I are connected with smoothness properties of JL . The inversion formula (26.20) has many applications. In the first place, it can be used for a new derivation of (26.14). As pointed out in Example 17.3, the. existence of the limit in (26. 14) is easy to prove. Denote this limit temporarily by 7r0/2-without assuming that 7r o = 7r. Then (26.16) and (26.20) follow as before if 7r is replaced by 7r0. Applying the latter to the standard normal density (see (26.8)) gives
( 26.21)
where the 7r on the left is that of analysis and geometry--it comes ultimately from the quadrature (18. 10). An application of (26.8) with x and t inter changed reduces the right side of (26.21 ) to (J27r /2 7r0)e - x 2 1 2 , and therefore 7r 0 does equal 7r. Consider the densities in the table. The characteristic function for the normal distribution has already been calculated. For the uniform distribution over (0, 1 ), the computation is of course straightforward; note that in this case the density cannot be recovered from ( 26.20), because cp(t) is not integrable; this is reflected in the fact that the density has discontinuities at 0 and 1.
Distribution
Density
Interval
1. Normal
1 - x 2 /2 Vl rr
- oo < x < oo
Uniform
1
0 <x < 1
3. Exponential 4. Double
e
2.
exponential or Laplace
5. Cauchy 6. Triangular 7.
e
-x
-!-
e
- lxl
1 1 rr 1 + x 2 1 - lxl 1 1 - COS X x2 7r
0 < x < oo - oo < x < oo
Characteristic Fu nction e
e
il
-
it 1
1
1 - it 1
1 + t2
- oo < x < oo
e
- 1 <x < 1
2
- oo < x < oo
_ ,2 /2
-It!
1-
cos t
t2 ( 1 - I t l) /( - I, I )(t)
SECTION 26. CHARACTERISTIC FUNCTIONS
349
The characteristic function for the exponential distribution is easily calcu lated; compare Example 2 .3 As for the double exponential or Laplace distribution, e - l x l e irx integrates over (0, oo) to - it )- 1 and over ( oo, 0) to + it )- 1 , which gives the result. By (26.20), then,
1 .
(1
e -lxl
(1
=
1 j "' e
7T
-
- oo
- ir x
-
dt . 1 + t2
= 0 this gives the standard integral f':"' dtj{l + t 2 ) = 1r; see Example Thus the Cauchy density in the table integrates to and has character istic function e 1 1 1 This distribution has no first moment, and the characteris tic function is not differentiable at the origin. A straightforward integration shows that the triangular density has the characteristic function given in the table, and by (26.20),
For
17.5.
1
x
-
.
( 1 - I X I) 1( - 1, 1)\ 1
) X =
1 ! "' e 1 - COS t dt.
7T
- oo
- i rx
t2
"'( 1
For x = 0 this is J': - cos t )t - 2 dt = 1r ; hence the last line of the table. Each density and characteristic function in the table can be transformed by (26.13), which gives a family of distributions.
The Continuity Theorem
Because of (26.1 2), the characteristic function provides a powerful means of studying the distributions of sums of independent random variables. It is often easier to work with products of characteristic functions than with convolutions, and knowing the characteristic function of the sum is by Theorem 26.2 in principle the same thing as knowing the distribution itself. Because of the following continuity theorem, characteristic functions can be used to study limit distributions. Theorem
26.3. Let JLn, JL be probability measures with characteristic func
tions 'Pn ' cp. A necessary and sufficient condition for JLn = JL is that cpn(t) -7 cp(t ) for each t.
PROOF. Necessity. For each t,
e i rx
has bounded modulus and is continu ous in x. The necessity therefore follows by an application of Theorem 25.8 (to the real and imaginary parts of e irx ).
CONVERGENCE OF DISTRIBUTIONS
350
Sufficiency. By Fubini's theorem, ( 26 .22)
>
2j
lx l � 2/u
1 ) dx (1 - 1 1 /-Ln( ) UX
(Note that the first integral is real.) Since cp is continuous at the origin and cp(O) = 1, there is for positive E a u for which u - 1J�J1 - cp(t)) dt < E . Since 'Pn converges to cp, the bounded convergence theorem implies that there exists an n0 such that u - lf�u(l cpn (t )) dt < 2 € for n > n0. If a = 2/u in (26.22), then JLn [x: lxl > a] < 2 € for n > n0. Increasing a if necessary will ensure that this inequality also holds for the finitely many n preceding n0. Therefore, {JLn } is tight. By the corollary to Theorem 25 . 1 0, 1-L n = JL will follow if it is shown that each subsequence {JL n) that converges weakly at all converges weakly to J.L . But if JL n = v as k -7 oo, then by the necessity half of the theorem, already k proved, v has characteristic function Iim k 'Pn k(t) = cp(t ). By Theorem 26.2, v and fL must coincide. II
-
Two corollaries, interesting in themselves, will make clearer the structure of the proof of sufficiency given above. In each, let J.Ln be probability measures on the line with characteristic functions 'Pn · l.
Suppose that limn cpn (t) = g(t) for each t, where the limit function g is continuous at 0. Then there exists a JL such that 1-L n = JL, and J.L has characteristic function g. Corollary
PRoOF.
The point of the corollary is that g is not assumed at the outset to be a characteristic function. But in the argument following (26.22), only cp{O) = 1 and the continuity of cp at 0 were used; hence {JLn} is tight under the present hypothesis. If 1-L nk => v as k oo, then v must have characteristic function limk 'Pnk(t) = g(t ) Thus g is, in fact, a characteristic function, and the proof goes through as before. • .
--+
SECTION 26. CHARACTERISTIC FUNCTIONS
351
In this proof the continuity of g wa s used to establish tightness. Hence if {J.Ln} is assumed tight in the first place, the hypothesis of continuity can be suppressed:
Suppose that lim n cp,,(t) = g(() exists for each t and that {J.L n} is tight. Then there exists a J.L such that J.L n = J.L, and J.L has characteristic function g. Corollary 2.
This second corollary applies, for example, if th e 1-L n have a common bounded support. If 1-L n is the uniform distribution over ( n , n), its charac teristic function is (nt) - 1 sin tn for t =I= 0, and hence it converges to Ic0J( t ). In this case {J.L n} is not tight, the limit fu nction is not continuous at 0, and 1-Ln does not converge weakly. • Example 26.2.
-
Fourier Series •
Let p., be a probability measure on coefficients are defined by
(26.23)
.91 1 that is supported by [0, 2'7T]. Its Fom ier
[2rr .m em = lo e• xp., ( dx ) , 0
m
= 0, + 1, + 2, . . . .
Th ese coefficients, the values of the characteristic functio n for integer arguments, suffice to determine p., except for the weights it may put at 0 and 2'7T. The relatio n between p., and its Fourier coefficients can be expressed formally by
1 p., ( dx ) - 2 '7T L c,e -•lx dx : I= 00
(26.24 )
•
- oo
if the p., ( dx } in (26.23} is replaced by the right side of (26.24}, and if the sum over I is interchanged with the integral, the result is a formal identity. To see how to recover p., from its Fourier coefficients, consider the symmetric partial sums sm(t) = (2'7T )- 1 'L'i'= mc1 e -u' and their Cesaro averages um(t) = - identity [A24] m 'L'i'=o 1s1(t ). From the trigonometric
-I
(26.25) it follows that
(26.26) * This topic may be omitted
m-
L
I
I 2 2 L e ikx = sin 2lmx sin � x 1 = 0 k= - 1
352
I
CONVERGENCE OF DISTRIBUTIONS
If JL is (2 '7T) - I times Lebesgue measure confined to [0, 2'7T ], then c 0 = 1 and c m = 0 for m -=fo 0, so that um(t) = sm(t) = (2 '7T) - ; this gives the identity
1 rr sin 2 �ms 2'7Tm J- rr sin 2 h ds = 1 . --==-=---
( 26.27)
Suppose that 0 < a < b < 2'7T, and integrate (26.26) over (a, b). Fubini's theorem (the integrand is nonnegative) and a change of variable lead to
(26.28)
b m( t ) dt = j2 rr [ 1 b-x x sin22 tms ] ds 1 fa 2 '7T m Ja-x Sin 2s 0 u
.
JL
( dx )
.
The denominator in (26.27) is bounded away from 0 outside ( o, o), and so as to oo with o fixed (0 < o < 1T ), -
m
goes
I 1 Sin 2 2ms I ds --+ 0 ' 2'7Tm J.li
•
Therefore, the expression in brackets in (26.28) goes to 0 if 0 < x < a or b < x < 2 '7T, and it goes to 1 if a < x < b; and because of (26.27), it is bounded by 1. It follows by the bounded convergence theorem that
( 26.29 ) if JL{a} = JL{b} = 0 and 0 < a < b < 2'7T. This is the analogue of (26.16). If JL and v have the same Fourier coefficients, it follows from (26.29) that JL( A) = v(A) fo r A c (0, 2'7T) and hence that JL{0,2'7T} = v{O, 2'7T}. It is clear from periodicity that the coefficients (26.23) are unchanged if JL{O} and JL{2'7T} are altered but 1-�{0} + JL{2'7T} i'i held wnstant. Suppose that 1-Ln is supported by [0, 211 ] and has coefficients c�>, and suppose that limn c�> = c m for all m. Since {JLn} is tight, JLn = JL will hold if JL n = v (k --+ oo) implies v = JL · But in this case v and JL have the same coefficients ckm ' and hence they are identical except perhaps in the way they split the mass v{O, 2'7T} = JL{O, 2'7T} between the points 0 and 2'7T. But this poses no problem if JL{O, 2'7T} = 0: If lim n c�> = e m for all m and JL{O} = JL{2'7T} = 0, then JL n = JL ·
Example 26.3. If JL is (2'7T) - I times Lebesgue measure confined to the interval [0, 2'7T], the condition is that Iimn dr7> = 0 for m -=fo 0. Let x 1 , x 2 , be a sequence of 1 reals, and let JLn put mass n at each point 2'7T{x k }, 1 < k < n, where { x k } = x k - [ x k J denotes fractional part. This is the probability measure (25.3) rescaled to [0, 2'7T]. The sequence x 1 , x 2 , is uniformly distributed modulo 1 if and only if •
•
for m
-=fo
•
•
0. This is Weyl' s criterion.
•
•
SECTION 26. CHARACTERISTIC FUNCTIONS
353
If x k = k8, where 8 is irrational, then exp(2 '7T i8 m ) -=fo 1 for m -=F 0 and hence n 1 L: 27Tik8m -1 2-rr i8m 1 - e 2-rr in8m = ne -+ o. nk e 1 e 2-rr18m _
=l
_
.
Thus 8, 28, 38, . . . is uniformly distributed modulo 1 if 8 is irrational, which gives • another proof of Theorem 25.1. PROBLEMS
lattice distribution if for some a and b, b > 0, the lattice [a + nb: n = 0, ± 1, . . . ] supports the distribution of X. Let X have
26.1. A random variable has a
characteristic function 'P· (a) Show that a necessary condition for X to have a lattice distribution is that l�p(t )I = 1 for some t -=F 0. (b) Show that the condition is sufficient as well. (c} Suppose that l�p(t)l = l�p(t')l = 1 for incommensurable t and t' (t -=F 0, t' -=F 0, t jt' irrational). Show that P[ X = c] = 1 for some constant c.
p.( - oo, x ] = p.[ -x, oo) for all x (which implies that p.(A) = p.(-A) for all A � 1 ), then p. is symmetric. Show that this holds if and only if the
26.2. If
E
characteristic function is real.
26.3. Consider functions 1p that are real and nonnegative and satisfy
�p(t ) and �p(O) = 1.
�p( -t) =
are positive and Lk. 1 dk = oo, that s 1 > s2 > · · · > 0 and lim k sk = 0, and that Lk = l sk d k = 1. Let 1p be the convex polygon whose successive sides have slopes -s 1 , - s2 , . . . and lengths d 1 , d 2 , when pro jected on the horizontal axis: 1p has value 1 - I:j= 1 s dj at t k = d 1 + · · · +dk . If sn = 0, there are in effect only n sides. Let �p0(t ) = (1 - It 1)1< 1 , 1 >(t) be the characteristic function in the last line in the table on p. 348, and show that �p(t ) is a convex combination of the characteristic functions �p0(t jt k ) and hence is itself a characteristic function. (b) P6lya's criterion. Show that 1p is a characteristic function if it is even and continuous and, on [O, oo), nonincreasing and convex t�p(O) = 1). (a) Suppose that
d 1 , d2 ,
•
•
•
=
•
_
•
•
CONVERGENCE OF DISTRIBUTIONS
354 26.4.
i Let 'P I and �p2 be characteristic functions, and show that the set A = [t: �p1(t) = �p2(t)] is closed, contains 0, and is symmetric about 0. Show that every set with these three properties can be such an A. What does this say about the uniqueness theorem?
26.5. Show by Theorem
26.1 and integration by1 parts that if JL has a density f with integrable derivative f', then �p(t) = o(t - ) as It I -+ oo. Extend to higher deriva tives.
26.6. Show for independent random variables uniformly distributed over ( - 1, n
that X1 +
26. 7.
···
+ Xn has density 71"- 1/Q((sin t)jt ) cos tx dt for n > 2.
+ 1)
21.17 i Uniqueness theorem for moment generating functions. Suppose that F has a moment generating function in ( -s0, s0), s0 > 0. From the fact that f":.. oo ezx dF( x ) is analytic in the strip -s0 < Re < s0, prove that the moment z
generating function determines F. Show that it is enougn that tbe moment generating function exist in [0, s0), s0 > 0. 26.8.
21.20 26.7 i Show that the gam111a density (20.47) has characteristic func tion
(1
� - t/a ) I
u
(
(
it
= exp -u log 1 - a
)] ,
where the logarithm is the principal part. Show that f0euf( x ; a, u) d.x 1 � analytic for Re z < a. 26.9. Use characteristic functions for a simple proof that the family of Cauch .
distributions defined by (20.45) is closed under convolution; compare th argument in Problem 20. 14(a). Do the same for the normal distributic (compare Example 20.6) and for the Poisson and gamma distributions.
26. 10. Suppose that Fn -=> F and that the characteristic functions are dominated by
integrable function. Show that of the Fn.
26. 11. Show for all
F
has a density that is the limit of the densit
a and b that the right side of (26.16) is JL(a, b) + �JL {a} + �JL
26.12. By the kind of argument leading to
1
Let x 1 , x2, prove that
(26.31)
J
T
.
e -110�p( t ) dt . JL { a } = lim 2T T-. oo -T
(26.30)
26. 13. i
(26.16), show that
•
•
•
be the points of positive JL-measure. By the following '
2 2 .hm 1 T " 1 I �p ( t ) dt = L..,., ( JL { x d ) . T-.oo 2T T k
J-
SECTION 26. CHARACTERISTIC FUNCTIONS
355
Let X and Y be independent and have characteristic function cp. (a) Show by (26.30) that the left side of (26.31) is P[ X - Y = 0]. (b) Show (Theorem 20.3) that P( X - Y = 0] = f:_ ooP( X = y]JL(dy) =
[k ( �L {x k )) 2. 26.14. r
Show that 1-t has no point masses if cp 2 (t) is integrable.
{1-tn} is tight, then the characteristic functions 'Pn(t) are uniformly equicontinuous (for each there is a o such that Is - t1 < o implies that lcpn(s) - Cf'n(t)l < for all n). 1-t implies that 'Pn(t) -+ cp(t) uniformly on bounded sets. (b) Show that 1-tn
26.15. (a) Show that if
=
E
E
(c) Show that the convergence in part (b) need not be uniform over the entire
line.
14.5 26.15 i For distribution functions F and G, define d'(F, G) =
26.16.
sup, lcp(t) - r/J(t )lj(l + ltl), where 1p and rfJ are the corresponding characteristic functions. Show that this is a metric and equivalent to the Uvy metric.
26.17.
25.16 i
A real function f has
mean value
J
1 M[ f ( x )] = lim 2 T T f( x ) d.x , -T T -.oo
(26.32)
provided that f is integrable over each [ - T, T] and the limit exists. (a) Show that, if f is bounded and ei•f<x> has a mean value for each t, then f has a distribution in the sense of (25.18). (b) Show that if t = 0, if t * 0.
( 26.33 ) Of course, f(x) = x has no distribution.
1. Let 1-tn be the distribution of the fractional part {nX}. Use the continuity theorem and Theorem 25.1 to 1 show that n - I:Z = 1 1-t k converges weakly to the uniform distribution on [0, 1 ].
21 18. Suppose that X is irrational with probability
26 1. �
25.13 i The uniqueness theorem for characteristic functions can be derived
from the Weierstrass approximation theorem. Fill in the details of the follow ing argument. Let 1-t and v be probability measures on the line. For continuous f with bounded support choose a so that 1-t( -a, a) and v( -a, a) are nearly 1 and f vanishes outside (-a, a). Let g be periodic and agree with f in (-a, a), and by the Weierstrass theorem uniformly approximate g(x) by a trigonomet ric sum p(x) = I:f= l a k eitk x. If 1-t and v have the same characteristic function, then ffd �L fgd�L fp d�L = fpdv fgdv ffdv. ==
==
==
==
26.20. Use the continuity theorem to prove the result in Example
convergence of the binomial distribution to the Poisson.
25.2 concerning the
356
CONVERGENCE OF D ISTRIBUTIONS
26.21. According to Example
25.8, if Xn = X, an --+ a, and bn --+ b, then aX + b. Prove this by means of characteristic functions.
26.22.
a
n Xn + bn =
26. 1 26. 15 i According to Theorem 14.2, if Xn = X and anXn + bn = Y, where an > 0 and the distributions of X and Y are nondegenerate, then an --+ a > 0, bn --+ b, and aX + b and Y have the same distribution. Prove this by characteristic functions. Let 'Pn' tp, r/J be the characteristic functions of Xn , X, Y. (a) Show that l�pn(ant )I -+ lt/J(t)l uniformly on bounded sets and hence that an cannot converge to 0 along a subsequence. (b) Interchange the roles of tp and rjJ and show that an cannot converge to infinity along a subsequence. (c) Show that an converges to some a > 0. (d) Show that eitb,. --+ r/J(t )jtp(at ) in a neighborhood of 0 and hence that Jci e isb ds fci(r/J(s)/tp(as)) ds. Conclude that bn converges. "
____,
26.23. Prove a continuity theorem for moment generating functions as defined by
(22.4) for probability measures on [0, the analogue of (26.22) is 2
( ( 1 - M(s)) ds
u .0
26.24.
oo). For uniqueness, see Theorem 22.2; >
�-t{�-,oo) . u
26.4 i Show by example that the values tp(m) of the characteristic function at integer arguments may not determine the distribution if it is not supported by [0, 2 '1T ].
m
f is integrable over [0, 2 1T ], define its Fourier coefficients as Jl"ei xf(x)dx. Show that these coefficients uniquely determine f up to sets of measure 0.
26.25. If
26.26.
19.8
26.25 i
Show that the trigonometric system (19. 17) is complete.
(26. 19) is I:mlcml < Show that x I:mcme-im x on [0, 21T ], where f is f(21T ). This is the analogue of the inversion formula
26.27. The Fourier-series analogue of the condition it implies 11- has density f( ) (2 1T ) - 1
oo.
=
continuous and f(O) (26.20).
26.28.
i
=
Show that
O < x < 21T .
26.29.
(a) Suppose X' and X'' are independent random variables with values in [0, 21T ], and let X be X' + X" reduced module 21T. Show that the correspond ing Fourier coefficients satisfy em c;.,c�. (b) Show that if one or the other of X' and X" is uniformly distributed, so is X. =
SECTION 27. THE CENTRAL LIMIT THEOREM 26.30.
357
26.25 i The theory of Fourier series can be carried over from [0, 21T] to the
unit circle in the complex plane with normalized circular Lebesgue measure P. The circular functions e im x become the powers wm, and an integrable f is determined to within sets of measure 0 by its Fourier coefficients e = m fnwmf(w)P(dw). Suppose that A is invariant under the rotation through the angle arg c (Example 24.4). Find a relation on the Fourier coefficients of /A , and conclude that the rotation is ergodic if c is not a root of unity. Compare the proof on p. 316.
SECTION 27. THE CENTRAL LIMIT THEOREM Identically Distributed Summands
The central limit theorem says roughly that the sum of many independent random variables will be approximately normally distributed if each sum mand has high probability of being small. Theorem 27. 1 , the Lindeberg-Levy theorem, will give an idea of the techn iques and hypotheses needed for the more general results that follow. Throughout, N will denote a random variable with the standard normal distribution: ( 27 .1 )
Suppose that {Xn } is an independent sequence of random variables having the same distribution with mean c and finite positive variance u 2 • If sn = XI + . . . +Xn , then Theorem 27.1.
(27 .2)
S__:.:_n- nc = N. u fii -=-
By the argument in Example 25.7, (27.2) implies that n - 1 Sn = c. The central limit theorem and the strong law of large numbers thus refine the weak law of large numbers in different directions. Since Theorem 27.1 is a special case of Theorem 27.2, no proof is really necessary. To understand the methods of this section, however, consider the special case in which Xk takes the values + 1 with probability 1/2 each. Each Xk then has characteristic function cp(t) = !e ;1 + !e - it = cos t . By n (26. 12) and (26. 13), Sn/ Vn has characteristic function cp (tj Vn), and so, by n the continuity theorem, the problem is to show that cos t/ Vn � E[e i1 N] e - 1 2 12 , or that n log cos t j Vn (well defined for large n) goes to - �t 2 • But this follows by 1 ' Hopi tal's rule: Let t I rn = X go continuously to 0. =
CONVERGENCE OF DISTRIBUTiONS
358
For a proof closer in spirit to those that follow, note that (26.5) for 3 gives l�(t) - (1 - tt 2 )1 < ltl (IXk l < 1). Therefore,
n=2
(27.3) Rather than take logarithms, use (27.5) below. which gives (n large) ( 27.4)
n
2 But of course (1 - t 2 j2 n) -7 e - 1 12 , which completes the proof for this special case. Logarithms for complex arguments can be avoided by use of the following simple lemma. at
Lemma l. Let most 1; then
lz 1
( 27 .5)
PROOF.
( z 1 - w 1 Xz 2
z1,
•
•
•
•
and w 1 ,
, zm
•
•
-
zm - W 1
•
•
•
•
•
•
, wm
wmi <
be complex numbers of modulus m
l z k - wk l. L k= l
This follows by induction from •
•
•
zm ) + w 1( z 2
•
•
•
zm - w 2
•
•
•
wm ).
z l . . . zm
- WI
. . . wm =
•
Two illustrations of Theorem 27. 1 : Example 2 7.1. In the classical De Moivre-Laplace theorem, Xn takes the values 1 and 0 with probabilities p a n d q = 1 - p, so that c = p, and u 2 = pq. Here Sn is the number of successes in n Bernoulli trials, and (Sn -
np)j ,fnpq = N.
•
Example 27.2. Suppose that one wants to estimate the parameter a of an exponential distribution (20. 10) on the basis of an independent sample X I , . . . , xn" As n -7 00 the sample mean xn = n - 1 L:Z= l xk converges in prob ability to the mean 1ja of the distribution, and hence it is natural to use 1/Xn to estimate a itself. How good is the estimate? The variance of the exponential distribution being 1ja 2 (Example 21.3), am (Xn - lja) N by the Lindeberg-Uvy theorem. Thus Xn is approximately normally distributed with mean 1ja and standard deviation 1jam. By Skorohod 's Theorem 25.6 there exist on a single probability space random variables Yn and Y having the respective distributions of Xn and N and satisfying am (Yn(w) - lja) -7 Y{w) for each w. Now Y,.(w) -7 1/a and a- 1 m (Y,.{w) - 1 - a) = am (a- 1 - Yn(w))jaY,.(w) -7 - Y(w). Since - Y has =>
SECTION 27. THE CENTRAL LIMIT THEOREM the distribution of
-
N and Yn
359
has the distribution of
( -)
Vn 1 a Xn
a
=>
-
Xn , it follows that
N '·
thus 1/Xn is approximately normall� distributed with mean a and standard deviation a I rn. In effect, 1 !Xn has been studied through the local linea r approximation to the function 1jx. This is called the delta method. • The Lindeberg and Lyapounov Theorems
Suppose that for each
n
( 27.6) are independent: the probability space for the sequence may change with n. Such a collection is called a triangular array of random variables. Put Sn Xn 1 + · · +Xnr . Theorem 27.1 covers the special case in which rn n and Xn k = Xk . Example 6 .3 on the number of cycles in a random permutation shows that the idea of triangular array is natural and useful. The central limit theorem for triangular arrays will be applied in Example 27.3 to the same array. To establish the asymptotic normality of Sn by means of the ideas in the precedi ng proof requires expanding the characteristic function of each Xn k to second-order terms and estimating the remainder. Suppose that the means are 0 and the variances are finite; write
=-
·
=
n
rn
( 27 .7)
s;, = I: un2k· k= l
The assumption that Xnk has mean 0 entails no loss of generality. Assume s� > 0 for large n. A successful remainder estimate is possible under the assumption of the Lindeberg condition: ( 27.8) for E > 0.
Suppose that for each n the sequence Xn 1 , is independent and satisfies (27.7). If (27.8) holds for all positive Sn/Sn = N. Theorem 27.2.
•
•
•
E,
, Xn
r n
then
CONVERGENCE OF DISTRIBUTIONS
360
This theorem contains the preceding one: Suppose that Xn k = Xk and rn = n, where the entire sequence {Xk } is independent and the Xk all have the same distribution with mean 0 a n d variance u 2 • Then (27.8) reduces to
( 27 .9) which holds because
[IX1 1 > EuVn]� 0 as n j oo.
PROOF OF THE THEOREM.
Replacing no loss of generality in assuming
Xn k
by
Xnkfsn shows that there
is
rn
s� = L u.,2k = 1 . k= l
(27. 10)
Therefore, the characteristic function
'Pn k of Xnk
satisfies
(27 . 1 1 ) Note that the expected value is finite. For positive E the right side of (27.11) is at most
Since the un2k add to condition that
rn
(27.12)
L
k=I
for each fixed
(27.13)
1
and
is arbitrary, it follows by the Lindeberg
I 'Pn k ( t ) - ( 1 - t t 2un2d l � 0
t. The objective rn
E
now is to show that
rn
n 'Pn k ( t ) = n ( 1 - it 2u,?k) + o ( 1 ) k= l k= l rn
= n e- 1 2u;d 2 + o( 1 ) = e - 1 2 / 2 + o ( 1 ) . k=l
SECTION
For
27.
E
THE CENTRAL LIMIT THEOREM
361
positive,
and so it follows by the Lindeberg con dition (recall that
sn
is now
1) that
( 27.14) For large enough n, 1 - tt 2un2k are all between 0 and 1, and by (27.5), n�"= l cpn k (t) and fl�"= I(l - tt 2un2k) differ by at most the sum in (27.12). This establishe� the first of the asymptoti::: relations in (27.13). Now (27.5) a!so implies that rn
rn
rn
n e -t 2a};k ! 2 - f1 ( 1 - �t 2un2k) < L l e-t 2 a}d 2 - 1 + tt 2un2k I . k�l k= l k=l For complex
z,
k-2 I 1 "' < lzl 2e l z l . ie • - 1 - zl < lzl 2 L
(27.15)
k=
\! 2
Using this in the right member of the preceding inequality bounds it by t4e 1 2 L�"= I un�; by (27.14) and (27.10), this sum goes to 0, from which the second equality in (27.13) follows. • It is shown in the next section (Example 28.4) that if the independeP.t array {Xnk} satisfies (27.7), and if Sn/sn N, then the Lindeberg condition holds, provided max k , , a:fk js� -+ 0. But this converse fails without the extra condition: Take Xnk = xk norm al with mean 0 and variance a;k = uf , where ul = 1 and an2 = ns� 1· =
sn = L:Z= 1 Xnk
in Example 27.3. Goncharov ' s theorem. Consider the sum Example 6.3. Here Sn is the number of cycles in a random permutation on n
letters, the
xn k
are independent, and
The mean m n is Ln = L:Z= 1 k- 1 , and the variance s� is Ln + 0(1). Lindeberg's condition for Xn k - (n - k + 1) - 1 is easily verified because these random variables are bounded by 1. The theorem gives (Sn - Ln ) fsn = N. Now, in fact, Ln = log n + 0(1), and so (see Example 25.8) the sum can be renormalized: (Sn - log n) j ,jlog n = N. •
CONVERGENCE OF DISTRIBUTIONS
362
'
Suppose that the 1 Xnk 1 2 + "' are integrable for some positive 8 and that
Lyapounov 's condition
'
"
1
[
E I Xn k I2 + "' lim L n k = I s2n +.5
( 27 . 1 6 )
J=0
holds. Then Lindeberg's condition follows because the sum in (27.8) is bounded by
Hence Theorem 27.2 has this corollary:
Suppose that for each n the sequence Xn1, • • • , Xn,n is independent and satisfies (27.7). If (27. 16) holds for some positive 8, then Sn /S n = N. Theorem 27.3.
Suppose that X1 , X , • • • are independent and uniformly 2 bounded and have mean 0. If the variance s� of Sn = X1 + · · · +X, gees to oo, then Sn!sn = N: If K bounds the Xn, then Example 27.4.
K = - � o' sn •
which is Lyapounov's condition for 8 = 1.
Example 27.5. Elements are drawn from a population of size n, randomly and with replacement, until the number of distinct elements that have been sampled is rn' where 1 < rn < n. Let Sn be the drawing on which this first happens. A coupon collector requires Sn purchases to fill out a given portion of the complete set. Suppose that rn varies with n in such a way that rn/n --+ p, 0 < p < 1. What is the approximate distribution of Sn? Let YP be the tria! on which successk first 1 occurs in a Bernoulli sequence with probability p for success: P[YP = k] = q - p, where q = 1 -p. Since the moment generating function is pesj(l - qes), E[ YP ] = p - I and Var[ ¥;, 1 = qp - 2• If k 1 dis tinct items have thus far entered the sample, the waiting time until the next distinct one enters is distributed as ¥;, as p = (n - k + 1)jn. Therefore, Sn can be repre sented as I:�·� 1Xnk for independent su mmands Xn k distributed as Y{n k l >fn- Since - + rn - pn, the mean and variance above give ·-
m n = E[ Sn ] =
( k - 1 )-l L 1'n
k=l
n
P dx
- n 10 1 - x
SECflON 27. THE CENTRAL LIMIT THEOREM
and s2n =
ity
i-
kL=l
{
k- 1 1_ k- 1 n
n
) -2
363
_
n
i
P
xdx o (1 -x) 2 "
Lyapounov's theorem applies for o = 2, and to check (27.16) requires the inequal
( 27.17) for some K independent of p. A calculation with the moment generating function shows that the left side is in fact qp - 4(1 + 7q + q 2 ). It now follows that
( 27.18) dx o (1 - x ) 4 •
- Knfp
Since (27.16) follows from this, Theorem 27.3 applies: (S,. - mn)/sn
=
N.
•
Dependent Variables •
The assumption of independence in the preceding theorems can be relaxed in various ways. Here a central limit theorem will be proved for sequences in which random variables far apart from one another are nearly independent in a sense to be defined. For a sequence Xp X2 , of random variables, let an be a number such that • • •
I P( A n B) - P( A)P ( B) I < an
( 27 . 19)
for A E o-(Xp · · · • Xk ), B E u(Xk +n' xk +n + P · · · ), and k > 1, n > 1. Suppose that an -7 0, the idea being that xk and xk +n are then approximately independent for large n. In this case the sequence {XJ is said to be a-mixing . If the distribution of the random vector (Xn , xn + I • ' xn +) does not depend on n, the sequence is said to be stationary. . • .
Example
Let {Y;J be a Markov chain with finite state space and positive transition probabilities pij, and suppose that Xn = f(Y), where f is some real function on the state space. If the initial probabilities P; are the stationary ones (see Theorem 8.9), then clearly {XJ is stationary. Moreover, n n _ by (8.42), I P� ) Pj l < p , where p < l . By (8. 1 1 ), P[ Y1 = ip . . . , Yk = ik , n · · · · , = = j P; k - l ; k P�k,.JO P1·o1 I • P1·/ - 11·/ , which differs �k + n o , · · Yk + n + l = jl] P; 1 P,· 1 ;2 27.6.
•
• This topic may be omitted.
•
364
CONVERGENCE OF DISTRIBUTIONS
, Yk + n + l = j1 ] by at most from P[ Y1 = i l > . . . , Yk = i k ] P[ Yk + n = j0, P; I P,· 1 ;2 . . . P;k - 1 ,· � p np,.ol· I . . . P,·/ - 1,·/ . It follows by addition that, if s is the number of states, then for sets of the form A = [( YI > . . . , Yk ) E H ] and B = n [(Yk +n• · · · · Yk +n + l ) E H'], (27. 19) holds with an = sp . These sets (for k and n fixed) form fields generating u-fields which contain u( X 1 , , Xk ) and u( Xk + n • Xk +n + 1, ), respectively. For fixed A the set of B satisfying (27. 19) is a monotone class, and similarly if A and B are interchanged. It follows by the monotone class theorem (Theorem 3.4) that {Xn} is a-mixing with n • an = sp . •
•
•
• • •
• • •
The sequence is m-dependent if ( X 1 , , Xk ) and (Xk + n , . . . , Xk+n + l ) are independent whenever n > m. In this case the sequence is a-mixing with a n = 0 for n > m . In this terminology an independent sequence is 0-depen dent. • • •
Example 27. 7. Let Y1, Y2 , and put X = f(Y,., . . . , }'� + m ) �
stationary and m-dependent.
then
(27 .20)
be independent and identically dis tributed, m 1 for a real function f on R + • Then {XJ is • •
•
Suppose that X 1 , X2 , is stationary and a-mixing wich and that E[XJ = O and E[X! 2 ] < oo. If S n = X 1 + · · · + Xn ,
Theorem 27.4.
an = O( n - 5)
•
• • •
1 n - Var [ SJ -7 CT 2 = E [ xn
"'
+ 2 L E[ X I XI + k ] , k= l
where the series converges absolutely. If u > 0, then Sn /uVn N. =
The conditions an = O(n - 5) and E[ X! 2 1 < oo are stronger tha n nece�sary; they are imposed to avoid technical complications in the proof. The idea of the proof, which goes back to Markov, is this: Split the sum X I + . . . + xn into alternate blocks of length bn (the big blocks) and In (the little blocks). Namely, let
where rn is the largest integer i for which (i let
(27.22)
l)(bn + I) + bn < n.
V,.; = X(i - IX b , + l, ) + b , + l + . . . +Xi(b , + l, ) • V,. ,, = X(r, - I X b, + l, ) + b, + l + . . . +Xn .
Furth er,
Then sn = L:;� I U,. I + L:;� I V,. , , and the technique will be to choose the In small enough that L;V,.; is small in comparison with L:; Uni but large enough
SECfiON 27. THE CENTRAL LIMIT THEOREM
365
that the Un i are nearly independent, so that Lyapounov's theorem can be adapted to prove L; Un i asymptotically normal.
If Y is measurable CT(X 1 , , Xk) and bounded by C, and if Z is measurable CT(Xk + n • xk + n + I > ) and bounded by D, then Lemma 2.
• • •
• • •
I E[ YZ ] - £[ Y ] £[ Z ] I < 4CDa n .
(27.23)
PROOF.
It is no restriction to take C = D = 1 and (by the usual approxi mation method) to take Y L;YJA , and Z = [jzJ8 simple ( IY;I, l zjl < 1). If i d1 j = P(A; n B) - P(A)P(B), the left side of (27.23) is IL.; jYi zjd iil. Take {; to be + 1 or - 1 as f.jzjd ;j is positive or not; now take TJj to be + 1 or - 1 as L;{; d;j is positive or not. Then =
" y . z . d . . -< " " L,. L,. z d . . .
L,. 1,1
I
)
I)
I
1
)
I)
< I: [g;d;j . l ]
=
"c L,. <::, l L,.
" z .d . .
.
I
=
1
)
I)
I:. TJj [g;d . ;j = L {;TJjd ij · l
1
l,J
0 0 < < l [B l] be0 the union of the A ; [ B) for which {; = + 1 [ TJj = + 1], Le t A let A ( l l = f! - A < l [BO l = f! - B<0 l]. Then l,)
and
U,L
If Y is measurable CT(X 1 , Xk) and E[Y4] < C, and if Z is measurable CT(Xk + n ' Xk + n + 1 ) and E[Z4] < D, then Lemma 3.
• • •
,
• • • •
I E [ YZ ] - E[ Y]E( Z ] I
(27.24)
PROOF.
< 8( 1 + C + D )a!1 2 .
Y0 = YI11 r' l a]• Y1 = YI11y1 > a]• Z0 = ZI1121 By Lemma 2, I E[ Y0 Z0] - E[ Y0]£[Z0]1 < 4a 2a n . Further, Let
s;
s;
ap
Z 1 = ZI[ IZ I > a]"
I E[ Yo Z J ] - E[ Yo ]E[ Z I ] I < E [l Yo - E[ Yo ] 1 · 1 z l - E[ Z I ] I] < 2a · 2£( 1Zd] < 4aE [ I Z 1 I · I Z 1 /ai 3 J < 4Dja 2 . Similary,
I£[Y1 Z0] - £[ Y1 ]E[Z0]1 < 4Cja 2 . Finally,
l f2 z l /2 2 [ y 2 ] £ 1 !2 [ z n ! 1 £ Y Z < Var Var < Y Z ]E[ ] [ £[ Y ] ] l [ ] l l £[ I I I I I I < £ 1 !2 [ y14ja 2 ] £ 1 ! 2 [ Zifa 2 j < C I I2 D I /2 ja 2 . 2 Adding these inequalities gives 4a a n + 4(C + D)a - 2 + C 1 1 2D 1 12a - 2 as a
CONVERGENCE OF DISTRIBUTIONS
366
= a,� 1 14 and observe that 4 + 4(C + • D) + C I f2Dif2 < 4 + 4(C lf2 + D l/ 2 ) 2 < 4 + 8(C + D) . bound for the left side of (27.24). Take
PROOF OF THEOREM 27. 4. 2 2
a
By
Lemma 3, I£[X 1 X 1 + JI < 8(1 + 2£[X(Da!1 = O(n - 51 ), and so the series in (27.20) converges absolutely. If pk = E[X 1 X 1 + d, then by stationary E[S�] = np0 + 2L:Z: :
where the indices in the sum are constrained by i, j, k > 0 and i + 1 + k < n. By Lemma 3 the summand is a t most
which is at most t
Similarly,
K 1 alf2
is a bound. Hence
E [ s� ] < 4!n 2 E K I m in { a f 12 ' a V2 } i , k
i+k
"'
k=O
Since
a k = O(k- 5), the series here converges, and therefore
( 27.25 ) for some K independent of n. Let bn l n 314 J and In = ln 1 14 j. If rn is the largest integer (i - lXbn + I) + bn < n , then =
(27.26) Consider the random variables
i
such that
In - n l/4 ' (27.21 ) and (27.22).
By
(27.25), (27.26), and
SECfiON 27.
THE CENTRAL LIMIT THEORE M
367
stationarity,
(27.25) and (27.26) also give
Therefore, L:�� 1 VnjaVn => 0, and by Theorem 25.4 it suffices to prove that
L:i� Pn ;/uVn => N. Let u:i' 1 < i < rn , be independent random variables having the distribu tion common to the Un i · By Lemma 2 extended inductively the characteristic functions of L:j� 1Un;/uVn and of [jn 1 U:JuVn differ by at most t 16rna1 • 1 Since an = O(n - 5), this difference is O(n - ) by (27.26). 2 - 1 12 e if The characteristic function of L:i� 1Unjcr{i; will thus approach that of L:!J=. p:JuVn does. It therefore remains only to show that L:j� p:i /uvn = N. Now E[ IU:;I 2 ] = E[Un�] - bnu 2 by (27.20). Further, E[l u:ll < Kb?; by (27.25). Lyapounov's condition (27.16) for 8 = 2 therefore n
follows because
E [ IU:1 I 4] - e [ 1u:� 14] n ( rn E [ I U:I12] t rn b2u4 rn
-<
K
r u4 n
-7 0.
•
Let {Y,.} be the stationary Markov process of Example 27.6. Let f be a function on the state space, put m = [i pJ( i ), and define of Theorem 27.4. If Xn = [CY,.) - m. Then {Xn} satisfies the conditions 2 f3ij = 8 ij pi - P; Pj + 2 pi[� = 1(p}fl - p), then the u in (27.20) is Lijf3ij(f(i) mXf(j) - m), and L:Z = d( Yk ) is app roximately normally distributed with mean nm and standard deviation uVn. If f(i) = 8i oi' then L:Z d( Yk ) is the number of passages through the state i0 in the first n steps of the process. In this case m = Pi0 and a 2 = (k) • PiP - Pi) + 2 P;0L k= lPi0i0 - P;/ Example 2 7. 8.
=
00
If the Xn are stationary and m-dependent and have mean 0, Theorem 27.4 applies and u 2 = E[xn + 2L:Z'= 1 E[X 1 X1 + d · Example 27.7 is a case in point. Taking m = 1 and f(x, y) = x - y in that example gives an • instance where u 2 = 0. Example 27.9.
"�" The 4 in (27.23) has become 16 to allow for splitting into real and imaginary parts.
CONVERGENCE OF DISTRIBUTIONS
368 PROBLEMS
23.2 by means of characteristic functions. Hint: Use (27.5) to compare the characteristic function of L A:"� 1 Znk with exp[ I: k Pn k (e i' - l )].
27.1. Prove Theorem
{Xn} is independent1 and the X" all have the same distribution with finite first moment, then n - sn ----> E[ X 1 ] with probability 1 (T!l.eorem 22. 1), so that n - lsn E[X 1 ]. Prove the latter fact by characteristic functions. Hint: Use
27.2. If
(27.5).
=
27.3. For a Poisson variable YA with mean
Show that (22.3) fails for
27.4. Suppose that
t = l.
A, show that ( YA - A )I lA
I Xnkl < Mn with probability 1 and Mnl n ----> 0
condition and then Lindeberg's condition.
s
=
N as A ----> oo.
Verify Lyapounov's
27.5. Suppose that the random variables in any single row of the triangular array arc
identically distributed. To what do Lindeberg's and Lyapounov's conditions reduce?
27.6. Suppose that
Z 1 , Z2 ,
are independent and identically distributed with mean 0 and variance 1, and suppose that Xnk = unkzk . Write down the Lindeberg condition and show that it holds if max k s. r, u}k = o([/;"� 1 un2k). • • •
27.7. Construct an example where Lindeberg's condition holds but Lyapounov's
does not.
27.8.
27.9.
27.10.
22.9 i Prove a central limit theorem for the number
time
n.
sn
Rn
of records up to
be the number of inversions in a random permutation on letters. Prove a central limit theorem for Sn .
6.3 i Let
n
The 8-method. Suppose that Theorem 27.1 applies to {Xn}, so that vn u- 1 (Xn - c) = N, where Xn = - 1 [f: � 1 Xk . Use Theorem 25.6 as in Exam ple 27.2 to show that, if f(x) has a nonzero derivative at c, then {n (fC'
f(c ))Iul f'(c)l N: Xn is approximately normal with mean c and standard deviation u1 f,i, and f(Xn ) is approximately normal with mean f(c) and standard deviation lf'(c)lul {,;' . Example 27.2 is the case f(x) = 11x. 27.11. Suppose independent Xn have density l x l - 3 outside ( - 1, + 1). S how that (n log n)- 1 12Sn N. =
=
27.12. There can be asymptotic normality even if there are no moments at all.
Construct a simple example.
dn (w) be the dyadic digits of a point w drawn at random from the unit interval. For a k-tuple (u 1 , . . . , u k ) of 0' s and 1's, let Nn(u 1 , . . . , u k ; w) be the number of m < n for which (dm(w), . . . , dm +k l(w)) = (u 1 , . . . , uk). Prove central limit theorem for Nn(u 1 , . . . , uk; w ). (See -Problem 6.12.)
27.13. Let
a
SECfiON 27. THE CENTRAL LIMIT THEOREM 27.14.
369
The ce."!tral limit theorem for a random number of summands. Let X1 , X2 , be independent, identically distributed random variables with mean 0 and vari + Xn. For each positive t, let v, be a random ance a- 2, and let Sn = X1 + variable assuming positive integers as values; it need not be independent of the xn. Su ppose that there exist positive constants a and 8 such that • • •
· · ·
I
as
t
----> co.
Snow by the following steps that
sv(:
( 27.27)
cry v1
--
=
N, a,
(a) Show that it may be assumed that 8 = 1 and the are integers. (b) Show that it suffices to prove �he second relation in (27.27). (c) Show that it suffices to prove I {i; 0.
(S v, - Sa)
(d) Show that
=
P [ I Sv, - Sa,I > E /� ] < P [Iv, - a ,l > £ 3a ,) +
P
[
max 3
l k -a ,l :S E u ,
I Sk - Sa , I > E {i;
l,
and conclude from Kolmogorov's inequality that the last probability is at most 2£u 2 27.15.
21.21 23.10 23.14 i
A
central limit theorem in renewal theory.
Let X I , x , . . . be independent, identically distributed positive random variables 2 with mean m and variance a- 2, and as in Problem 23. 10 let � be the maximum n for which Sn < t. Prove by the following steps that
(a) Show by the results in Problems
21.21 and 23. 10 that (S
(b) Show that it suffices to prove that
N, - sN,m- 1 u t l f2 m - 3 2
N,
- t)I .(t
=
0.
!
(c) Show (Problem
lem 27.14.
23. 10) that
NJt = m- 1 , and apply the theorem in Prob
CONVERGENCE OF DISTRIBUTIONS
370 27.16. Show by partial integration that
(27.28) as 27.1i.
x
----> oo.
i Suppose that X1, X2, are in dependent and identically distributed with mean 0 and variance 1, and suppose that a n ----> oo. Formally combine the central limit theorem and (27.28) to obtain •
•
•
(27.29) where ?, ----> 0 if a n ----> oo. For a case in which this does hold, see Theorem 9.4. 27.18.
Stirling's formula. Let Sn = X1 +
+Xn , where the Xn are indepe n dent and each has the Poisson distribution with parameter 1. Prove succes sively. n k nn +(lf2>e -n Sn - n - - - n n n k (a) E [ ( .fn ) ] e 1 n.1 . k n n
21.2 i
sn - n (b) ( rn
) - = N -.
( I:
k
�
o
In
)
·
·
·
_
Sn n (c) E [ ( in ) ] -) E[ N - ] =
(d)
n! - /2rr n n +( lf2>e - n.
27.19. Let ln(w) be the length of the run of O's starting at the nth place in the dyadic
expansion of a point w drawn at random from the unit interval; see Example
4.1.
(a) Show that / 1, /2 , . . . is an a-mixing sequence, where a11 = 4j2 n.
(b) Show that [f: 1 /k is approximately normally distributed with mean
variance 6n.
�
27.20. Prove under the hypotheses of Theorem
Hint: Use (27.25).
27.21.
26. 1 26.29 i Let X 1 , X2 ,
n and
27.4 that Sn!n ----> 0 with probability 1.
be independent and identically distributed, and suppose that the distribution common to the xn is supported by [0, 2rr] and is not a lattice distribution. Let sn = X I + . . . +XII , where the sum is reduced modulo 27T. Show that Sn U, where U is uniformly distributed over [0, 21T ]. • • •
=
SECfiON 28.
INFINITELY DIVISIBLE DISTRIBUTIONS
371
SECTION 28. INFINITELY DMSIBLE DISTRIBUTIONS*
Suppose that ZA has the Poisson distribution with parameter ,.\ and that Xn P . . . , Xnn are independent and P[ Xn k 1] A/n, P[ Xn k = 0] = 1 - Ajn. According to Example 25.2, xn I + . . . + xnn zA . This contrasts with the central limit theorem, in which the limit law is normal. What is the class of all possible limit laws for independent triangular arrays? A suitably restricted form of this question will be answered here. =
=
=
Vague Convergence
The theory requires two preliminary facts about convergence of measures Let J.Ln and JL be finite measures on (R 1 , � 1 ). If J.L n( a b) --> JL(a. b] for every finite interval for which JL{ a} = JL{b} = 0, then J.Ln converges vaguely to JL, written J.Ln --> , JL. If J.Ln and JL are probability measures, it is not hard to set! that this is equivalent to wedk conv�rgence · t-Ln = JL · On the other hand, if J.Ln is a unit mass at n and JL(R 1 ) = 0, then 1-Ln --> , JL, but J.Ln JL makes no sense, because JL is not a probability measure. The first fact needed is this: Suppose that J.Ln --> JL and ,
-=>
1
(28 . 1 )
then (28.2)
for every conti."!uous real f that 1 vanishes at + oo 1in the sense that lim 1 x1 _,,., f(x) = O. Indeed, choose M so that JL(R ) < M and J.Ln(R ) < M for all n. Given t:, choose a and b so that JL{a} = JL{b} = 0 and lf(x)l < t:jM if x $ A = (a, b]. Then I JAcfdJ.Lnl < €
and I JA . fdJLI < t:. If JL(A) > O, define v(B) = JL(B nA)/JL(A) and vn(B) =J.Ln(B n A )/J.Ln(A). It is easy to see that vn = v, so that ffdvn --> ffdv. But then lfA fdJLn fA fdJLI < " for large n, and hence I JfdJLn - JfdJL I < 3t: for large n. If JL(A) = 0, then fA fdJLn --> 0, and the argument is even simpler. The other fact needed below is this: If (28. 1) holds, then there is a subseq/Jence {JL ' } and a finite measure JL such that 1-Ln . --> JL as k --> Indeed, let Fn(x) = J.Ln( - oo, x ]. Since the Fn are uniformly bounded because of (28.1), the proof of Helly's theorem shows there exists a subsequence {Fn) and a bounded, nondecreasing, right-continuous function F such that lim k Fn.(x) = F(x) at continuity points x of F. If JL is the measure for which JL(a, b ] = F(b) - F(a) (Theorem 12.4), then clearly 1
�
oo.
J.Ln • --> I JL ·
The Possible Limits
Let xn l • . . . ' xnrn• n = 1, 2, . . . be a triangular array as in the preceding ' section. The random variables in each row are independent, the means are 0, *This section may be
omitted.
CONVERG ENCE OF DISTRIBUTIONS
372
and the variances are finite: rn
s� = L un2k ·
(28.3)
k=l
Assume s� > 0 and put Sn = Xn 1 + the total variance is bounded:
(28.4) ln order that the
(28.5)
· · + Xn n. r
·
Here it will be assumed that
sup s� < oo.
n
Xnk
be small compa red with lim max u 2k =
n
k s rn
n
Sn, assume that
0.
,
The arrays in the preceding section were normalized by replacing Xnk by Xnkfs�. This has the effect of replacing sn by 1, in which case of course (28.4) holds, and (28.5) is the same thing as maxk un2kfs� � 0. A distribution function F is infinitely dimsible if for each n there is a distribution function Fn such that F is the n-fold convolution Fn * · · · * Fn (n copies) of Fw The class of pos:;ible limit laws will turn out to consist of the infinitely divisible distributions with mean 0 and finite variance. t It will be possible to exhibit the characteristic functions of these laws in an explicit way. Theorem 28.1.
Suppose that
1 1 i x - 1 - itx ) 2 11- ( dx ) , � ( t) = exp j ( e x Ri where 11- is a finite measure. Then � is the characteristic function of an infinitely divisible distribution with mean 0 and variance JL(R 1 ). ( 28.6)
By (26.4 2 ), the integrand in (28.6) converges to - t 2/2 as x � 0; take this as its value at x = 0. By (26.4 1 ), the integrand is at most t 2/2 in modulus and so is integrable. The formula (28.6) is the canonical representation of � . and 11- is the
canonical measure.
Before proceeding to the proof, consider three examples.
If 11- consists of a mass of u 2 at the origin, (28.6) is e - u 2 12 1 2 , the characteristic function of a centered normal distribution F. It is certainly infinitely divisible-take Fn normal with variance u 2jn. • Example 28.1.
tThere do exist infinitely divisible distributions without moments (see Problems 28.3 and 28. 4 ) but they do not figure in the theory of this section
,
SECfiON 28. INFINITELY DIVISIBLE DISTRIBUTIONS
373
Suppose that IL consists of a mass of ,.\ x 2 at x * 0. Then (28.6) is exp ,.\( e i1x - 1 - itx ); but this is the characteristic function of x(ZA ,.\), where Z,� has the Poisson distribution with mean ,.\ . Thus (28.6) is the characteristic function of a distribution function F, and F is infinitely • divisible-take Fn to be the distribution function of x(ZA; n - Ajn). Example 28.2.
IL
Example 28.3. If cpit) is given by (28.6) with ILi for the measure, and if = L.J= 1 1L 1 , then (28.6) is cp 1(t) . . . cp k ( t). It follows by the preceding two
examples that (28.6) is a characteristic function if IL consists of finitely many point masses. It is easy to check in the preceding two examples that the distribution corresponding to cp( t) has mean 0 and variance IL(R 1 ), and since the means and variances add, the same must be true in the present example . •
Let IL k have mass IL{j2-\ {j + 1)2 - k ] at j2 - k for j = 0, + 1, . . . , + 2 2 . Then ILJ. � IL · As observed in Example 28.3, if cp k ( t) is (28.6) with IL k in place of IL, then 'P k is a characteristic function. For each t the integrand in (28.6) vanishes at + oo; since supk 1L k(R 1 ) < oo, 'P/t) � cp(t) follows (see (28.2)). By Corollary 2 to Theorem 26.3, cp( t) is itself a characteristic function. Further, the distribution corresponding to cpk ( t ) has second moment 1L k (R 1 ), and since this is bounded, it follows (Theorem 25.1 1) that the distribution corresponding to cp( t) has a finite second moment. Differentiation (use Theorem 16.8) shows that the mean is cp'(O) = 0 and the variance is - cp"(O) = IL(R 1 ). Thus (28.6) is always the 1 characteristic function of a distribution with mean 0 and variance IL( R ). If 1/Jn(t ) is (28.6) with JL jn in place of IL, then cp t ) = rfJ;(t ), so that the distribution corresponding to cp( t) is indeed infinitely divisible. •
PROOF OF THEOREM k28.1.
1
(
The representation (28.6) shows that the normal and Pois�on distributions are special cases in a very large class of infinitely divisible laws.
Every infinitely divisible distribution with mean 0 and finite variance is the Limit Law of Sn for some independent triangular array satisfying • (28.3), (28.4), and (28.5). Theorem 28.2.
The proof requires this preliminary result:
If X and Y are independent and X + Y has a second moment, then X and Y have second moments as well. Lemma.
PROOF.
Since X 2 + Y 2 < (X + Y) 2 + 2I XY I, it suffices to prove I XY I integrable, and by Fubini's theorem applied to the joint distribution of X and Y it suffices to prove lXI and I YI individually integrable. Since IYI < lxl + lx + Yl, E[IYI] = oo would imply E[lx + Yl] oo for each x; by Fubini's
=
CONVERGENCE OF DISTRIBUTIONS
374
theorem again E[ I Y I] = oo would therefore imply E[IX + Yl] impossible. Hence E[I Y I] < oo, and similarly E[ I X I] < oo.
= oo,
which is •
PROOF OF THEOREM 28.2. Let F be infinitely divisible with mean 0 and variance u 2 • If F is the n-fold convolution of Fn, then by the lemma (extended inductively) Fn has finite mean and variance, and these must be 0 and u 2 jn. Take rn = n and take Xn 1 , • • • , Xnn independent, each with distri bution function Fn. •
If F is the Limit law of Sn for an independent triangular array satisfying (28.3), (28.4), and (28.5), then F has characteristic function of the form (28.6) for some finite measure 11- · Theorem 28.3.
PROOF. The proof will yield information making it possible to identify the limit. Let 'Pn k (t) be the characteristic function of Xnk · The first step is to prove that rn
rn
k=l
k=l
O 'Pn k ( t ) - exp I: ('Pr.k ( t) - 1 ) � o
(28.7)
for each t. Since l z l < 1 implies that l e z - I I = eRez - I < 1, it follows by (27.5) that the difference 8n(t) in (28.7) satisfies l8n(t)l < L�"= 1 lcpn k (t) exp(cpn k(t) - 1) 1. Fix t. If 'Pn k (t) - 1 = ()n k • then IOn k l < t 2un2d2, and it follows by (28.4) and (28.5) that max k l()n k l � 0 and Ek l()n k l = 0{1). Therefore, for sufficiently large n, l8n(t)l < Ek l1 + ()n k - e8•k l < e 2 Ek 1 0nk 1 2 < e 2 max k l()n k l · Ek l()n k l by (27.15). Hence (28.7). If Fn k is the distribution function of Xn k • then rn
L ('Pn k ( t ) - 1 ) =
k=l
rn
J L k=l
R1
( e ii x - 1) dFn k ( x )
Let 11-n be the finite measure satisfying (28.8) and put (28.9)
11-n( - oo, x ] =
r"
L= I j
k
y s; x
y 2 dFn k ( y) ,
SECTION 28.
375
INFINITELY Dl VISIBLE DISTRIBUTIONS
Then (28.7) can be written
r"
n 'Pn k ( t ) - 'Pn( t) � 0.
( 28.10)
k= l
By (28.8), Jl n( R 1 ) = s�, and this is bounded by assumption. Thus (28.1) holds, and some subsequence {JLn "} converges vaguely to a finite measure 11- · Since the integrand in (28.9) vanishes at + oo, 'Pn"(t) converges to (28.6). But, . of course, limn cpn(t) must coincide with the characteristic function of the limit law F, which exists by hypothesis. Thus F must have characteristic function of the form (28.6). • Theorems 28. 1, 28.2, and 28.3 together show that the possible. limit laws are exactly the infinitely divisible distributions with mean 0 and finite vari ance, and they give explicitly the form the characteristic functions of such laws must have. Characterizing the Limit
Suppose that F has characteristic function (28.6) and that an independent triangular array satisfies (28.3), (28.4), and (28.5). Then sn has limit Law F if and only if 11-n �" JL, where 11-n is defined by (28.8). Theorem 28.4.
PROOF. Since (28.7) holds as before, s n has limit law
cpn ( J.L n
F if and only if
t ) (defined by (28.9)) converges for each t to cp(t) (defined by (28.6)). If � ,. JL, then cpn(t) � cp(t) follows because the integrand in (28.9) and (28.6)
vanishes at + oo and because (28. 1) follows from (28.4). Now suppose that
II
=
•
•
The normal case. According to the theorem, Sn = N if and only if 11-n converges vaguely to a unit mass at 0. If s� 1 , this holds if and only if L�"= 1 f1x1 " x 2 dFn k (x) � 0, which is exactly Lindeberg's condition. • Example 28.4.
=
:
The Poisson case. Let Zn 1 , , Znr" be an independent triangular array, and suppose xn k = zn k - m nk satisfies the conditions of the Example 28.5.
•
•
•
376
CONVERGENCE OF DISTRIBUTIONS
theorem, where m n k = E[Zn k ] . If ZA has the Poisson distribution with parameter A, then L:k Xn k = ZA - ,.\ if and only if 11-n converges vaguely to a mass of ,.\ at 1 (see Example 28.2). If s� � ,.\, the requirement is 11-n[ 1 - E, 1 + E ] � ,.\ , or ( 28 .1 1 ) for positive E . If s� and L:k m n k both converge to ,.\, (28. 1 1 ) is a necessary and sufficient condition for L:k Znk = ZA . The conditions are easily checked under the hypotheses of Theorem 23.2: Zn k assumes the values 1 and 0 with probabilities Pn k and 1 - Pn k • Lk Pn k � A , and max k Pn k � 0. •
PROBLEMS
n ----> 1 J.L
implies JL(R 1 ) lim infn JL n( R 1 ). Thus in vague conver gence mass can "escape to infinity" but mass cannot "enter from infir.ity."
28.1. Show that JL
<
1-Ln ----> 1 JL if and only if (28.2) holds for every continuous f with bounded support. (b) Show that if JLn ----> 1 JL but (28.1) does not hold, then there is a continuous f vanishing at + oo for which (28.2) does not hold. are independent, the Yn have a common 28.3. 23.7 i Suppose that N, Y1 , Y2 , distribution function F, and N has the Poisson distribution with mean a. Then S = Y1 + + YN has the compound Poisson distribution. 28.2. (a) Show that
•
·
·
•
•
·
(a) Show that the distribution of
S is infinitely divisible. Note that S may not
have a mean. n n n (b) The distribution function of S is [�=0e-aa F *(x)jn!, where F * is the tz-fold convolution of F (a unit jump at 0 for n = 0). The characteristic funrtion of S is exp af �'",(e11x - l) dF(x). (c) Show that, if F has mean 0 and finite variance, then the canonical measure JL in (28.6) is spedfied by JL(A) =afA x 2 dF(x). 28.4. (a) Let v be a finite measure, and define
(28.12 )
itx ) 1 +x 2 v ( dx ) 1 +x 2 x 2
]
,
where the integrand is - t 2;2 at the origin. Show that this is the characteristic function of an infinitely divisible distribution. (b) Show that the Cauchy distribution (see the table on p. 348) is the case where y = 0 and v has density 7T- 1 (1 +x 2 )- 1 with respect to Lebesgue measure. 28.5. Show that the Cauchy, exponential, and gamma (see
infinitely divisible.
(20.47)) distributions are
SECTION 28.
INFINITELY DIVISIBLE DISTRIBUTIONS
377
28.6. Find the canonical representation (28.6) of the exponential distribution with
mean 1: (a) The characteristic function is J0e i1 Xe-x d.x = (1 - it ) - 1 = �p(t ). (b) Show that (use the principal branch of the logarithm or else operate formally for the moment) d(log �p(t ))1dt = i�p(t) iJ0ei'xe -x d.x. Integrate with respect to t to obtain =
1
--:--.--.
(28. 13 )
1
- It
=
eXp r ( eiiX - 1 ) - d.x }0
X
e -x
00
Verify (28 13) after the fact by showing that the ratio of the two sides has derivative 0. (c) Multiply (28. 13) by e- it to center the exponential distribution at its mP-an: The canonical measure JL has density xe-x over (0, oo). If
X and Y are independent and each has the exponential density e - x, then X - Y has the double exponential density 1e -1x1 (see the table on p. 348).
28.7. i
Show that its characteristic function is 1
---=-
=
1 + t2
exp
100
.
1
(e11x - 1 - itx ) - l x le - lx l d.x. x2 - oo
are independent and each has the double exponential Suppose X1 , X , 2 density. Show that [�= 1 Xnfn converges with probability 1. Show that the distribution of the sum is infinitely divisible and that its canonical measure has density l x l e - 1x1j ( l - e -lxl) = [� = l x l e - 1nx1.
28.8. i
•
•
•
1
Show that for the gamma density e -xx u - I jf(u ) the canonical mea sure has density uxe -x over (0, oo).
28.9. 26.8 i
The remaining problems require the notion of a stable law. A distribution function F is stable if for each ll there exist constants an and b11, an > 0, such that, if XI , . . , xn are independent and have distribution fu nction F, then a;; 1 (XI + Xn) + bn also has distribution function F. + ·
·
·
a, a', b, b' there exist a", b" (here a, a', a" are all positive) F(ax + b) * F(a'x + b') = F(a"x + b"). Show that F is stable.
28.10. Suppose that for all
such that
28.11. Show that a stable law is infinitely divisible. 28.12. Show that the Poisson law, although infinitely divisible, is not stable. 28. 13. Show that the normal and Cauchy laws are stable. 28. 14. 28.10 j
of
a'', b"
Suppose that F has mean 0 and variance 1 and that the dependence on a, a', b, b' is such that
(X) (X)
F - *F - =F ul O"z
( X ) ul + u}
V
Show that F is the standard normal distribution.
.
CONVERGENCE OF DISTRIBUTIONS
378
28.15. (a) Let Yn k be independent random variables having the Poisson distribution with mean cnajl k l 1 a, where c > O and O < a < 2. Let Zn = n - 1 'L'k� - n2kYn k (omit k = 0 in the sum), and show that if c is properly chosen then the characteristic function of zn converges to e _ ,,,a, (b) Show for 0 < a < 2 that e - llla is the characteristic fu nction of a symmetric stable distribution; it is called the symmetric stable law of exponent a. The case a = 2 is the normal law, and a = 1 is the Cauchy law.
SECTION 29. LIMIT THEOREMS IN
Rk
If Fn and F are distribution functions on R\ then Fn converges weakly to F, written Fn = F, if limn Fn( x ) = F(x) for all continuity points of F. The corresponding distributions 11- n and 11- are in this case also said to converge weakly: 11- n = 11-· If ¥n and X are k-dimensional random vectors (possibly on different probability spaces), Xn converges in distribution to X, written Xn = X, if the corresponding distribution functions converge weakly. The definitions are thus exactly as for the line. x
The Basic Theorems
The closure A - of a set in R k is the set of limits of sequences in A; the interior is Ao = R k - ( R k - A ) - ; and the boundary is aA = A - - A0• A Borel set A is a wcontinuity set if JL(aA ) = 0. The first theorem is the k-dimen sional version of Theorem 25.8.
measures 11- n and J.L on ( R\ IN k ), each of the following conditions is equivalent to the weak convergence of 11- n to 11-: Theo rem 29.1.
(i) (ii) (iii) (iv)
For probability
lim n ffdll n = ffdJL for bounded continuous f; lim supn JL/C) < JL{C) for closed C; lim infn Jln(G) > JL{G) for open G; limn Jln(A) = JL(A) for wcontinuity sets A.
It will first be shown that {i) through (iv) are all equivalent. (i) implies {ii): Consider the distance dist(x, C) = inf[ l x - yl: y E C] from x to C. It is continuous in x. Let PRooF.
cpj( t ) =
1 1
0
- jt
if t < 0, if o < t < r I , if r I < t.
SECTION 29.
LIMIT THEOREMS IN
Rk
379
Then f/x) = cp/dist(x, C)) is continuous and bounded by 1, and f/xH Ic(x) as j i oo because C is closed. If (i) holds, then lim supn J.Ln(C) < limn ffi dJ.Ln = ffi dJ.L. As j i 00, ffi dJ.L ! fie dJ.L = J.L(C). (ii) is equivalent to (iii). Take C = R k - G. (ii) and {iii) imply (iv): From {ii) and (iii) follows
J.L( Ao ) < lim infJ.Ln n ( Ao ) < lim inf n J.Ln( A ) n
1l
Clearly (iv) follows from this. (iv) implies (i): Suppose that f is continuous and lf(x )I is bounded by K. Given E , choose reals a0 < a 1 < · · · < a1 so that a0 < - K < K < a1, a; a; _1 < E, and J.L[x: f(x) = a;] = O. The last condition can be achieved be cause the sets [x: f(x) = a] are disjoint for different a. Put A; = [x: a;_ 1
=
=
=
Suppose that h: R k � R i is measurable and that the set Dh of its discontinuities is measurable.t If 11-n = J.L in R k and J.L(Dh) = 0, then J.L nh - 1 = 11-h - I in R i. Theorem 29.2.
tThe argument in the foot note on
p.
334 shows that in fact Dh
E
gp k always holds.
380
CONVERGENCE OF DISTRIB UTIONS
C be a closed set in R i. The closure (h - 1 c) - in R k satisfies (h - 1 C) - c Dh u h - 1 C. If 11-n = J.L, then part (ii) of Theorem 29. 1 gives PRooF.
Let
(
lim SUPJ.Lnh - 1 (C) < lim SUP J.Ln ( h - 1 C
n
n
f) < J.L ( ( h - 1 C f )
Using (ii) again gives J.Lnh - 1 = 11-h - I if J.L(Dh) = 0.
•
Theorem 29.2 is the k-dimensional version of the mapping theorem-The orem 25.7. The two proofs just given provide in the case k = 1 a second approach to the theory of Section 25, which there was based on Skorohod's theorem (Theorem 25.6). Skorohod's theorem does extend to Rk, but the proof is harder.t Theorems 29.1 and 29.2 can of course be stated in terms of random vectors. For example, Xn = X if and only if P[X E G] < lim infn P[Xn E G] for all open sets G. A sequence {J.L n } of probability measures on (R k , � k ) is tight if for every E there is a bounded rectangle A such that J.L n(A) > 1 E for all n. -
If {J.L) is a tight sequence of probability measures, there is a subsequence {J.LJ and a probability measure J.L such that 11- n. = J.L as i � oo. Theorem 29.3. I
I
Sx = [y: Yi < xi , j < k ] and Fn (x) = J.Ln(S). The proof of Helly's theorem (Theorem 25.9) carries over: For points x and y in R\ interpret x < y as meaning x u < Yu , u = 1, . . . , k, and x < y as meaning x u < Yu , u = 1 , . . , k. Consider rational points r- points whose coordinates are all rational-and by the diagonal method [A14] choose a sequence {n;} along which lim; Fn (r) = G( r) exists for each such r. As before, define F(x) = inf[ G(r ): x < r ]. Although F is clearly nondecreasing in each variable, a further argument is required to prove D.AF > 0 (see (12. 12)). Given E and a rectangle A = (a 1 , b 1 ] X · · · X (a k , bk], choose a o such that if z = (8, . . . , 8), then for each of the 2 k vertices x of A, x < r <x + z implies IF(x) - G(r )I < E j 2 k . Now choose rational points r and s such that a < r < a + z and b < s < b + z. If B = (r 1 , s 1 ] X · · · X (rk , sk ], then ID.AF D. 8 G I < E . Since D. 8 G = lim; D. 8 Fn .> 0 and E is arbitrary, it follows that !:J. A F > 0. PROOF. Take
.
I
I
With the present interpretation of the symbols, the proof of Theorem 25.9 shows that F is continuous from above and lim; Fn (x) F(x) for continuity points x of F. I
=
t The approach of this section carries over to general metric spaces; for this theory and its applications, see BILLINGSLEY1 and BILLINGSLEY 2• Since S korohod's theorem is no easier in R k than in the general metric space, it is not treated here.
SECTION
29. LIMIT THEOREMS IN R k
381
12.5, there is a measure J.L on ( R\ � k ) such that J.L(A ) = !lA F for rectangles A. By tightness, there is for given € a t such that J.L J y: - t < yi < t, j < k ] > 1 - E for all n. Suppose that all coordinates of x exceed t: If r > x, then Fn (r) > 1 - E and hence (r rational) G(r) > 1 - E, so that F(x) > 1 - E. Suppose, on the other hand, that some coordinate of x is less than - t: Choose a rational r such that x < r and some coordinate of r is less than - t; then Fn(r) < E, hence G(r) < E, and so F(x) < E. Therefore, for every E there is a t such that By Theorem
F( x ) \f
( 29 . 1 )
> 1-€
<€
if xi > t for all j, if xi < - t for some j.
If B5 = [ y: - s < yi < xi, j < k ] , then J.L(S) = I im 5 J.L(B5 ) = 1im 5 1l8l. Of the 2 k te 1 ms in the sum !l8F, all but F(x) go to 0 (s -7 oo) because of the ' second part of (29. 1). Thus J.L(S) = F(x ).t Because of the other part of (29. 1), J.L is a probability measure. Therefore, Fn = F and 11-n = J.L . • Obviously Theorem 29.3 implies that tightness is a sufficient condition that each subsequence of {J.Ln} contain a further subsequence converging weakly to some probability measure. (An easy modification of the proof of Theorem 25. 10 shows that tightness is necessary for this as well.) And clearly the corollary to Theorem 25.10 now goes through:
If {J.Ln} is a tight sequence of probability measures, and if each subsequence that converges weakly at all converges weakly to the probability measure J.L , then J.Ln = J.L · Corollary.
Characteristic Functions
Consider a random vector X = ( X1 , , Xk ) and its distribution J.L in R k . Let t · x = L� 1 t u x., denote inner product. The characteristic function of X and of J.L is defined over R k by •
•
•
=
(29 .2) To a great extent its properties parallel those of the one-dimensional charac teristic function and can be deduced by parallel arguments. "t This requires proof because there exist (Problem 12.10) functions F' other than F for which J.L( A ) AAF' holds for all rectangles A. =
382
CONVERGENCE OF DISTR IBUTIONS
(26.16)
The inversion formula takes this form: For a bounded rectangle A = [x: a u <x u < bu , u < k ] such that J.L(aA ) 0, =
II II QII - e- 1/ 1 bII e 1 .t 1-L ( A ) = lim cp( t ) dt, j k 0 u l T -+ "' (27T) BT u = I k
(29.3)
.
where B T [t E R k : l tul < T, u < k ] and it, replace cp{t) by the middle term in The integral in is
(26.17):
=
.
dt is short for dt 1
(29.3)
The inner integral may be evaluated by Fubini's theorem in IT =
f un= I [ sgn( x�Rk
-
dtk . To prove
(29.2) and reverse the integrals as in •
a
•
•
R k , which gives
u) S T · Ixu - aul) (
sgn ( x; - bu)
J
S ( T · Ix u - bui) J.L( dx ) .
(26.1 8)), (29.3) 29.1
Since the integrand converges to n� = 1 1/laII' bII(xJ (see follows as in the case k The proof that weak convergence implies (iii) in Theorem shows that k for probability measures 1-L and 11 on R there exists a dense set D of reals such that J.L(aA) = ll(aA ) 0 for all rectangles A whose vertices have coordi nates in D. If J.L{A) 11{A) for such rectangles, then 1-L and 11 are identical by Theorem Thus the characteristic function cp uniquely determines the probability mea sure 1-L Further properties of the characteristic function can be derived from the one-dimensional case by means of the following device of Cramer and Wold. For t E R\ define h 1: R k � R 1 by h 1 (x) = t · x. For real a, [ x : t · x < a] is a half space, and its wmeasure is =
1.
=
3.3.
=
(29.4)
By change of variable, the characteristic function of J.L h ; 1 is
( 29 .5)
f e isyJ.Lh ; l (dy ) f e is(l x)J.L( dx ) =
Rl
Rk
= cp( st 1 ,
• • •
, stk ),
(29.4))
to know each J.L h 1- 1 To know the J.L- measure of every half space is (by for s to know cp( t ) for every t; and to know the and hence is (by
(29.5)
=
1)
SECTION
29. LIMIT THEOREMS IN R k
383
J.L is to know J.L. Thus J.L is uniquely determined by the values it gives to the half spaces. This result, very simple in its statement, seems to require Fourier methods-no elementary proof is known. If J.Ln = J.L for probability measures on R k , then cpn(t) -7 cp(t) for the corresponding characteristic functions by Theorem 29. 1. But suppose that the characteristic functions converge pointwise. It follows by (29.5) that for each t the characteristic function of J.L n h; 1 converges pointwise on the line to the characteristic function of 11-h 1 ; by the continuity theorem for characteristic functions on the line then, J.L n h :- 1 = J.Lh ,- 1• Take the u th component of t to be 1 and the others 0; then the J.Ln h,- 1 are the marginals for the uth coordinate. Since {J.Lnh,- 1 } is weakly convergent, there is a bounded inte rval (a u , bJ such that J.L n[x E R k : a u < xu < bJ = JL n h,- 1 (a u , bJ > 1 - E jk for all n . But then J.Ln(A) > 1 - € for the bounded rectangle A = [x: a u < x u < bu , u = 1, . . ; k]. The sequence {J.L n} is therefore. tight. If a subsequence {;.t n } converges weakly to then 'Pn(t) converges to the characteristic function of v , which is therefore cp(t ). By uniqueness, v J.L, so that 11-n = J.L. By the corollary to Theorem 29.3, 11-n = J.L. This proves the continuity theorem for k-dimensional characteristic functions: J.Ln = J.L if and only if cpn( t ) -7 cp(t) for all t. characteristic function cp of
1
.
v'
I
=
•
The Cramer-Wold idea leads also to the following result, by means of which certain limit theorems can be reduced in a routine way to the one-dimensional case.
For random vectors Xn = (Xn 1 , , Xn k ) and Y = (YI , . . . ' Yk ), a necessary and sufficient condition for xn = y is that L� = l tu xnu , tk ) in R k . � L �= 1 tu Yu for each (t Theorem 29.4.
•
P
.
.
•
•
•
PRooF. The necessity follows from a consideration of the continuous
mapping h, above-use Theorem 29.2. As for sufficiency, the condition implies by the continuity theorem for one-dimensional characteristic func tions that for each (t 1 , , tk ) •
•
•
for all real s. Taking s = 1 shows that the characteristic function of Xn converges pointwise to that of Y. • Normal Distributions in
Rk
By Theorem 20.4 there is (on some probability space) a random vector X = (XP . . . , Xk ) with independent components each having the standard normal distribution. Since each Xu has density e-x 2 12 jVz1r , X has density
CONVERGENCE OF DISTRIBUTIONS
384
(see (20.25)) (29.6) where lxl 2 = r.�= x� denotes Euclidean norm. This distribution plays the 1 normal distribution in R k . Its characteristic function is role of the standard (29.7) Let A = [a u, ] be a k X k matrix, and put Y =AX. where X is viewed as a column vector. Since £[ X X13 ] = 80 13 , the rr..atrix l, = [uuJ of the covariances of Y has entries uu, = E [Y)'; ] = L. ! a u a . Thus 2: =AA', where the prime denotes transpose. The matrix I 1is symmetric and nonnegative defin ite: Lu r uUl x.,x, IA'x l 2 > 0. View t also as a column vector with transpose t', and note that t · x = t' x. The characteristic function of AX is thus a
=
a
, a
=
(29. 8)
£{ e ir'( A X l ] = E [ e i( A'r )' X ] = e- I A'q2 /2 = e - r'lr ;2 .
Define a centered normal distribution as any probability measure whose characteristic function has this form for some symmetric nonnegative definite I. If I is symmetric and nonnegative definite, then for an appropriate orthogonal matrix U, U'IU = D is a diagonal matrix whose diagonal ele ments are the eigenvalues of I and hence are nonnegative. If D0 is the diagonal matrix whose elements are the square roots of those of D, and if A = UD0, then I = AA'. Thus for every nonnegative definite I there exists a centered normal distribution (namely the distribution of AX) with covari ance matrix I and characteristic function exp( - 1t'It). If I is nonsingular, so is the A just constructed. Since X has density (29.6), Y = AX has, by the Jacobian transformation formula (20.20), density f(A - 1x)ldet A - 1 1. From I = AA' follows ldet A - 1 I = (det i) - 11 2 • Moreover, I - 1 = (A') - 1A - 1, so that IA - 1xl 2 = x'I - 1x. Thus the normal distribution has density (27T) k l2(det i) - 1 12 exp( - {x'I - 1x) if I is nonsingular. If I is singular, the A constructed above must be singular as well, so that AX is confined to some hyperplane of dimension k - 1 and the distribution can have no density. By (29.8) and the uniqueness theorem for characteristic functions in R k , a
centered normal distribution is completely determined by its covariance matrix.
Suppose the off-diagonal elements of I are 0, and let A be the diagonal matrix with the u;)/ 2 along the diagonal. Then I = AA', and if X has the standard normal distribution, the components X; are independent and hence so are the components u;)/ 2X; of AX. Therefore, the components of a
SECTION
29. LIMIT THEOREMS IN R k
385
normally distributed random vector are independent if and only if they are uncorrelated. If M is a j X k matrix and Y has in R k the centered normal distribution with covariance matrix l, then MY has in R i the characteristic function j exp( - HM'tYl(M't)) = exp( - tt'(MlM')t) (t E R ). Hence MY has the centered normal distribution in R i with covariance matrix MlM'. Thus a Linear transformation of a normal distribution is itself normal.
These normal distributions are special in that all the first moments vanish. The general normal distribution is a translation of one of these centered distributions. It is completely determined by its means and covariances. The Central
Limit
Theorem
Let Xn (Xn 1 , , Xnk) be independent random vectcrs all having the same distribution. Suppose that £[ x;"J < oo; let the vector of means be c = (c 1 , . . . , ck), where C11 = E[ XnJ, and let the covariance matrix be l = [u11 , ], = E[(Xn ll - ciiX Xnl - c, )1. P ut sn = X I + . . . +Xn " where =-
•
•
•
(Til l
Under these assumptions, the distribution of the random vector (Sn - nc)j Vn converges weakly to the centered normal distribution with covariance matrix l. Theorem 29.5.
Let Y = ( YP . . . , Yk ) be a normally distributed random vector with 0 means and covariance matrix l. For given t = (t P t k ), let Zn = L� = 1 tJXn11 - C11) and Z = L�= 1 t11Y11 By Theorem 29.4, it suffices to prove that n - 1 12 L:j= 1 Zi = Z (for arbitrary t ). But this is an instant consequence of • the Lindeberg-Uvy theorem (Theorem 27. 1). PROOF.
•
•
•
,
•
PROBLEMS
k
29.1. A
real function f on R is everywhere upper semicontinuous (see Problem 13.8) if for each x and E there is a B such that I x - y l < B implies that f( y ) < f( x) E; f is lower semicontinuous if f is upper semicontinuous. (a) Use condition (iii) of Theorem 29.1, Fatou's lemma, and (2 1 .9) to show that, if JLn JL and f is bounded and lower semicontinuous, then -
.J..
=
lim inf ffdJLn > ffdJL.
( 29.9)
n
Show that, if (29.9) holds for all bounded, lower semicontinuous functions f, then JLn JL · (c) Prove the analogous results for upper semicontinuous functions. (b)
=
CONVERGENCE OF DISTRIBUTIONS
386
X
X
Show f o r probabi l i t y measures on the l i n e that JLn vn = JL v if and only if JLnSuppose = JL and vn = v. that Xn and Yn are independent and that X and Y are indepen dent. Show that, i f Xn = X and Yn = Y, then ( Xn, Yn ) = (X, Y) and hence that Xn + Yn = X + Y. Show that part (b) f a i l s wi t hout i n dependence. If Fn = F and Gn = G, then Fn * Gn = F * G. Prove thi s by part (b) and also by characteristic functions. Show that {JLn} is tight if and only if fo r each there is a compact set suchShow that that {JLn} is tightforifalandl n. only if each of the k sequences of marginal distributions is tight on the line. Assume of (Xm Yn) that X, = X and = c. Show that (X.7 , Yn) = (X, This iins dependent. an example of Problem 29.2(b) where and Yr. need not be assumed Prove analogues for R k of the corollaries to Theo!"em 26.3. Suppose that f(X) and g(Y) are uncorrelated for all bounded continuous and Show that X and Y are independent. Hint: Use characteristic fu nc tions. 20.normal 16 diSuppose that the random vector X has a centered k-dimensional as has 1 s tri b uti o n whose covari a nce matri x an ei g enval u e of mul t i 2 plchiic-isquared ty r anddi0striasbanutioeingwienvalth ruedegrees of multofiplfreedom. icity k- r. Show that IXI has the Multinomial sampling. Let p , p k be positive and add to 1, and let be i n dependent k-di m ensi o nal random vect o rs such that Zn has wi(fnthl , .probabi l i t y P; a 1 i n the ith component and O' s elsewhere. Then n = n from a . . , fnk) = i s t h e frequency count f o r a sampl e of si z e popul a ti o n wi t h cel l probabi l i t i e s mul t i n omi a l P;· Put xni liP; ) ,;np; and Show Xn = ( Xn J> · · · , Xnk). that X" has mean val u es 0 and covaria nces u;j = (f>;j Pj P;Pj)/ fP;Pj npY jnp; has asymptotically Show that the chi squared stati s ti c the chi-squared distribution with k- 1 degrees of freedom. 20.uni2f6ormlyAdistheorem of Poincare. Suppose that Xn = (X . . . , Xnn) is R" . Fix t, tri b uted sphere of radi u s {;:'i n over the surf a ce of a X , , X111 are in the l i m i t independent, each with the and show that normal standard di s tri b uti o n. Hint: If the components of Yn ( Yn , , Ynn) each then Xn has the aresameindidependent, wi t h the standard normal di s tri b uti o n, as s tri b uti o n .fn Yn!1Yn1· Suppose that t h e di s tri b uti o n of Xn = ( Xn , , Xnn) is spherically sym sense t h e that metri c i n Xn!I Xnl is uniformly distributed over the unit sphere. that Assume I Xn1 /n = 1, and show that xn l • . . . ' Xnr are asymptotically independent and normal.
29.2. (a) (b)
(c)
(d)
29.3. (a)
JLn(K) > 1 - E
(b)
29.4.
29.5. 29.6.
29.7.
E
Y,,
K
c).
Xn
f
g.
i
29.8. i
1,
Z 1 , Z2, • • •
• • •
f
I:;:, = 1 Zm
I
= (f.,; -
(a)
·
(b)
29.9.
I:7= 1(/m -
(a )
i
11
(b)
1
1,
• • •
=
1
2
• • •
1
• • •
SECTION 29. 29.10.
LIMIT THEOREMS IN R k
387
Xn = (Xn1, . - - , Xnk ), = an = (Xn , Xn Xn
be randomthatvectors the sati s fyi n g mi x i n g Letcondition (27.19) with n O(n1, 2,-...5 ). ,Suppose stati o nar y i s the sequence 0, and (tthathe dithestributiu oaren ofuniforml...y, bounded. +) is the same for al l n), that Show that i f then + + .fn has i n the l i m i t the centered normal di s tri b uti o n with covaria nces
Sn !
E[ Xnu1 = Sn = X 1 · · · Xn ,
00
00
E [ xluxl, ] + L E[ xlu xl +j,l ] + L E[ xl +j. u xl, ] =l l j=
j
Hint:
29.11.
Use the Cramer-Wold device. l e t e state be a Markov chai n wi t h fi n i t e 27. 6 , space As{1, in Exampl say. Suppose t h e t r ansi t i o r: probabi l i t i e s Pu, are all po!>itive and thefor whiinitciahl 1probabi l i t i e s Pu are the statio nary ones. Let fn u be the number of i Show that the normalized frequency count i n and j
S=
{Yn}
. . . , s},
<
}j
<
=
u.
has in the limit the centered normal distribution with covariances 00
00
Bu, - PuP, + L ( P�{> - PuP, ) + L ( P!� - P, Pu ) · j= l
29.12.
Assume that imensi s posiotinalve definormalnite,densi invertty itisexplicitly, and show that the corresponding two-di (29 . 10) where D normal di s tri b uti o n that has the standard Suppose i n R 1• Let be the mixturehavewithdiequal wei g hts of the di s tri b uti o ns of and and l e t s tri b uti o n Prove: they AlAltthough each of and normal , are not joi n tl y normal . hough and are uncorrelated, they are not independent. =
29.13.
j= l
a- 1 1 a-22 - a-f2· Z
( X, Y ) (a)
(b)
X
X Y
JL ·
(Z, Z)
Y
is
JL
(Z, - Z ),
388
CONVERGENCE OF DISTRIBUTIONS
SECTION 30. THE METHOD OF MOMENTS • The Moment Problem
For some distributions the characteristic function is intractable but moments can nonetheless be calculated. In these cases it is sometimes possible to prove weak convergence of the distributions by establishing that the moments converge. This approach requires conditions under which a distribution is uniquely determined by its moments, and this is for the same reason that the continuity theorem for characteristic functions requires for its proof the uniqu�ness theorem.
-
Let J.L be a probability measure on the Line having finite moments a k r X kJ.L ( dx ) of all order3. If the power series [ k a k r kjk! has a positive radius of convergence , then J.L is the only probability measure with the moments a a 2 , k PROOF. Let {3 k f':. ""lxi J.L(dt-) be the absolute moments. The first step i� Theorem 30.1. =
P
.
oo
•
to show that
•
•
=
k -7 oo ,
(30.1 )
for some positive r. By hypothesis there exists an s, 0 < s < 1, such that a k s kjk! -7 0. Choose O < r < s; then 2kr 2 k - 1 < s 2k for large k. Since
lxl 2 k - l < 1 + l x l 2 \
for large k. Hence (30. 1) holds as k goes to infinity through odd values; since {3k = ak for k even, (30.1) follows. By (26.4),
e IIX o
( e - k': _,__( ihxk,.!c'--.., ) k ) o lhX o
n
'\"
and therefore the characteristic function
cp
<
of
n+ 1 lhx l ( rt + 1 ) ! '
J.L satisfies <
* This section may be omitted.
n lhl + 1{3n + I (n + 1 ) ! ·
SECTION 30. TH E METHOD OF MOMENTS
By (26. 10), the integral here is
cp< "l(t). By (30. 1),
cp( k ) ( t ) h", cp( t + h ) = I: k ! k =O 00
(30.2)
389
lhl < r.
If v is another probability measure with moments function 1/J(t ), the same argument gives
ak
and characteristic
r/J ( k) ( t ) 1/J(t + h) = I: k! h", k =O 00
( 30.3 )
t = 0; since cp < "l(O) i "ak = r/J< kl(O) (see (26'.9)), cp and rfJ agree in ( - r, r) and hence have identical derivatives there. Taking t = r - E and t = - r + E in (30.2) and (30.3) shows that cp and 1/J also agree in ( - 2 r + E, 2r - E) and hence in ( - 2 r, 2 r ). But then they must by the same argument agree i n ( - 3r, 3r) as well, and so on.t Thus cp and rfJ coincide, and by the Take
=
uniqueness theorem for characteristic functions, so do J.L and
be
v.
•
A probability measure satisfying the conclusion of the theorem is said to
determined by its moments.
Example 30.1. For the standard normal distribution, theorem implies that it is determined by its moments.
Ia " ! < k!, and so the •
But not all measures are determined by their moments: Example 30.2. If N has the standard normal density, then log-normal density
f( x ) = Put
0
g(x) = f(x)(l + sin(2 7T log x)). If
i x"f( x ) sin( 2 7T log x ) dx = 0, 00
0
if
eN has the
X < 0.
k = O, l , 2, . . . ,
then g, which is nonnegative, will be a probability density and will have the same moments as f. But a change of variable log x = s + k reduces the t
This process is a version of analytic continuation.
390
CONVERGENCE OF DISTR IBUTIONS
integral above to
which vanishes because the integrand is odd.
•
Suppose that the distdbution· of X is determined by its moments, that the Xn have moments of all orders, and that limn E[X�] E[ X'] for r = 1, 2, . . . . Then Xn =X. Theorem 30.2.
=
Let 11-n and J.L be the distributions of xn and X. Since £[ Xn2 ] converges, it is bounded, say by K. By Markov's inequality, P[IXn l ] Kjx 2 , which implies that the sequence {J.L n} is tight. Suppose that 11- nk v, and let Y be a random variable with distribution v . If u is an even integer exceeding r, the convergence and hence boundedness of £[ x:J implies that £[ X�J £[ yr], by the corollary to Theorem 25.12. By the hypothesis, then, E[ Y'] E[X']- that is, v and J.1 have the same moments. Since J.L is by hypothesis determined by its moments, v must be the same as J.L , and so 11-n k = J.L · The conclusion now follows by the corollary to Theorem 25 .10. •
>x <
PROOF.
4
-7
=
Convergence to the log-normal distribution cannot be proved by establish ing convergence of moments (take X to have density f and the Xn to have density g in Example 30.2). Because of Example 30. 1, however, this approach will work for a normal limit. Moment Generating Functions
Suppose that J.L has a moment generating function M(s) for s E [ -30, s0 ], s0 > 0. By (2 1.22), the hypothesis of Theorem 30. 1 is satisfied, and so J.L is determined by its moments, which are in turn determined by M(s) via (21.23). Thus J.L is determined by M(s) if it exists in a neighborhood of o.t The version of this for one-sided transforms was proved in Section 22-see Theorem 22.2. Suppose that J.Ln and J.L have moment generating functions in a common interval [ -s0 , s0 ], s0 > 0, and suppose that Mn( s ) -7 M(s) in this interval. Since JLn[( - a, aY] e - sua(Mn( -s0) + Mn(s0)), it follows easily that {�-L n} is tight. Since M(s) dete rmines J.L, the usual argument now gives 11-n = JL ·
<
t For another proof, see Problem 26.7. The prese nt proof does not require the idea of analyticity.
SECTION 30. THE METHOD OF MOMENTS
391
Central Limit Theorem by Moments
To understand the application of the method of moments, consider once again a sum Sn = Xn 1 + · · · +Xn k , where Xn l ' . . . , Xn k are independent and n
n
(30.4) Suppose further that for each n there is an Mn such that k 1, . . . , k n , with probability 1. Finally, suppose that =
IXnk l < Mn,
(30.5) All moments exist, andt
S'.. = Lr [ ' r ' · r!· · ru ! u1l [ " X'' · · · X'" ni ., • 1 • • • u�I
( 30.6)
-
nr l
L:' extends over the u-tuples (rw . . , ru ) of positive integers satisfying r1 + · · + ru = r and L:' extends over the u-tuples (i w . . , i J of distinct integers in the range 1 < ia < k n. where
·
By independence, then,
( 30.7) where
An( r 1 , • • • , ru) =
( 30.8)
1 " [ 5 , E [ X:JJ · · · E [ x:l'..] , n
and L:' and L:' have the same ranges as before. To prove that (30.7) converges to the rth moment of the standard normal distribution, it suffices to show that if r 1 = · · · otherwise
(3 0.9)
=
r
u
=
2
'
Indeed, if r is even, all terms in (30.7) will then go to 0 except the one for which u = r/2 and ra 2, which will go to r!j(r 1 ! · · · r)u!) = 1 X 3 X 5 X · · · X (r 1). And if r is odd, the terms will go to 0 without exception. =
-
tTo deduce this from the multinomial formula, restrict the inner sum to u-tuples satisfying 1 .:s: i < · · · < i u s; k and compensate by striking out the 1 /u! .
1
n
392
CONVERGENCE OF DISTRIBUTIONS
ra = 1 for some a, then (30.9) holds because by (30.4) each summand in (30.8) vanishes. Suppose that ra > 2 for each a and ra > 2 for some a. Then r > 2u, and since IE[X:rJI < M� r,. - Zlun� ' it follows that An(rw . . , r) < (MnfsnY - 2 uAn(2, . . . , 2). But this goes to 0 because (30.5) holds and because A n(2, . . . , 2) is bounded by 1 (it increases to 1 if the sum in (30.8) is enlarged to include all the u-tuples (i w . . , i)). It remains only to check (30.9) for r 1 = · · · = ru = 2. As just noted, An(2, . . . ,2) is at most 1, and it differs from 1 by L.sn- 2uun: , the sum extending over the (i., . . . , i) with at least one repeated index. S ince un� < M;, the u terms for example with i u = i u -l sum to at most M;s;; 2 LCTn� · · · un� < Mn2S11- 2 Thus 1 - A11 (2, . . . , 2) < u 2Mn2sn- 2 � 0. This proves that the moments (30 7) converge to those of the normal distribution and hence that Sn/sn = N If
I
•
u-1
Application to Sampling Theory
Suppose that
n
number'>
not necessarily distinct, are associated with the elements of a population of size n. Suppose that these numbers are normalized by the requirement
( 30.10)
n L Xn :, = 0 ,
h
=I
h
n L x �h = 1 , =I
Mn = maxn l x nh l . h :s;
An ordered sample Xn�> . . . , Xn k is taken, where the sampling is without " replacement. By (30. 10), E[Xnk ] = 0 and E[x;k J = 1jn. Let s� = k n /n be the fraction of the population sampled. If the Xn k were independent, which they are not, Sn = Xn 1 -t · · · +Xn k would have variance s� . If k n is small in comparison with n, the effects of dependence should be small. It will be shown that Snfsn = N if n
( 30 . 1 1 )
k , --" 2 sn = 0' n
Mn _ s n � o'
Since M; > n - I by (30.10), the second condition here in fact implies the third. The moments again have the form (30.7), but this time E[X:J , · · · x:;) cannot be factored as in (30.8). On the other hand, this expected value is by symmetry the same for each of the (kn)u = kn(kn - 1) · · · (kn - u + 1) choices of the indices i a in the sum L:'. Thus
A n( r p . . . , ru ) - ( ksnr L E [ Xn l . . . X'"J nu . n ''
The problem again is to prove (30.9).
SECTION
30.
THE METHOD OF MOMENTS
393
The proof goes by induction on u. Now An(r) kns;'n - 1 L:Z = 1 x� h • so that A n(l) = O and An(2) = 1. If r > 3, then lx�h i < M�-2x�h ' and so I An(r)l < (Mn/snY - 2 � 0 by (30. 1 1). Next suppose as induction hypothesis that (30.9) holds with u 1 in place of u. Since the sampling is without replacement, E[ X�j · : · X�d = L:x��� · · · x�'J,j(n) u , where the summation extends over the u-tuples (h I ' . . . , h) of distinct integers in the range 1 < ho: < n. In this last sum enlarge the range by requiring of (hp h 2 , , h) only that h 2 , , h u be distinct, and then com pensate by subtracting away the terms where h 1 = h 2 , where h 1 = h3, and so on. The result is =
-
•
•
•
•
•
•
n( n) u _ 1 [ X ' � ] E [ X '2 · X '" ] n 2 · · nu ( n) u E n l � ( n ) u - 1 E [ X'n 2 . . . Xn' 1 +ra . . X'" ] '-' ( n ) u o: 2 =2
_
T' U
.
0:
This takes the place of the factorization made possible in assumed independence there. It gives
u A n ( r2 , L o: =2
• • •
, r 1 + ro: ,
•
(30.8) by the
• . •
, r, ) .
By the induction hypothesis the last sum is bounded, and the factor in front goes to 0 by (30.1 1). As for the fi rst term on the right, the factor in front goes to 1. If r 1 =1= 2, then An(r1 ) � o and An(r2 , , r) is bounded, and so A n(r I ' . . . , r) � 0. The same holds by symmetry if ro: =I= 2 for some a other than 1. If r 1 · · · ru 2, then An(r 1 ) 1, and A n(r 2 , , r) � 1 by the induction hypothesis. Thus (30.9) holds in all cases, and Snfs n = N follows by the method of moments. • • •
=
=
=
=
• • •
Application to Number Theory
g(m) be the number of distinct prime factors of the integer m; for example g{34 x 5 2 ) = 2. Since there are infinitely many primes, g(m) is unbounded above; for the same reason, it drops back to 1 for infinitely many m (for the primes and their powers). Since g fluctuates in an irregular way, it Let
is natural to inquire into its average behavior. On the space f! of positive integers, let Pn be the probability measure that places mass 1/n at each of 1, 2, . . . , n, so that among the first n positive
394
CONVERGENCE OF DISTRI BUTIONS
integers the proportion that are contained in a given set A is just Pn( A). The problem is to study Pn[m: g(m) < x] for large n. If 8/m) is 1 or 0 according as the prime p divides m or not, then
( 30.12) Probability theory can be used to investigate this sum because under Pn the 8/m) behave somewhat like independent random variables. If Pp . . . , Pu are distinct primes, then by the fundamental theorem of arithmetic, 8P1(m) = · · · = 8P (m) = 1-that is, each P; divides m-if and only if the product p 1 · · · pu divides m. The probability under Pn of this is just n - I times the number of m in the range 1 < m < n that are multiples of p1 • Pu, and this number is the integer part of njp 1 • • Pu · Thus II
•
•
•
( 30.13) for distinct p1 • Now let XP be independent random variables (on some probability space, one variable for each prime p ) satisfying
1 P [ Xp = 1 ] = p' If P I > . . . , Pu are distinct, then (30 . 14)
P[ XPi = 1 ' i = 1 ' . . . ' U ] = P • 1• • Pu I
For fixed p" . . . , Pu, (30. 13) converges to (30. 14) as n � oo. Thus the behavior of the XP can serve as a guide to that of the 8/m). If m < n, (30. 12) is L.P ,s n8/m), because no prime exceeding m can divide it. The ideat is to compare this sum with the corresponding sum L.P ,s n Xp. This will require from number theory the elementary estimate*
(30.15)
1 p <X p
I: - = log log x + 0( 1 ) .
The mean and variance of Lp ,snXp are Lp ,s nP_, and Lp ,s nP- 1 { 1 -p- 1); since L.pP- 2 converges, each of these two sums is asymptotically log log n . teompare Problems 2.18, 5.19, and 6.16. t see, for example, Problem 18.17, or HARov & WRIGHT, Chapter XXII.
SECTION 30. THE METHOD OF MOMENTS
395
Lp s.n8/m) with L.p < n Xp then leads one to conjecture the Erdos -Kac centra/ limit theorem for the prime divisor function: Comparing
Theorem 30.3.
(30.16)
For all x ,
[Pn m : g( m)
- log log n vflog log n
<x -
]
�
1 =- x --=
- u 2; 2 du . e h'Tr J - oo
PROOF. The argument uses the method of moments. The first step is to
show that (30.16) is unaffected if the range of p in (30.12) is further restricted. Let {a n } be a sequence going to infinity slowly enough that log an log � 0
( 30.17 )
n
but fast enough that ( 30.18)
"\" i...J
1
<>n < p < n p
-
j2 = o( log log n) l .
Because of (30.15), these two requirements are met if, for example, log a n {log n)jlog log n. Now define
=
( 30.19) For a function
f of positive integers, iet En[f] = n - 1
n
L f( m)
m=l
denote its expected value computed with respect to
By
Pn. By (30.13) for u = 1 ,
(30. 18) and Markov's inequality,
Therefore (Theorem 25.4) , (30.16) is unaffected if
g(m).
gn(m) is substituted for
CONVERGENCE OF DISTRIBUTIONS
396
Now compare (30.19) with the corresponding sum mean and variance of sn are
Sn
=
<
'£P o:nXP.
and each is log log n + o{log log n) 1 1 2 by (30. 18). Thus (see Example (30.16) with g(m) replaced as above is equivalent to
1
( 30.20)
�
VL. 'TT
fx e
-
- oo
u
The
25.8),
2 f 2 du .
It therefore suffices to prove (30.20). Since the XP are bou nded, the analysis of rhe moments (30.7) applies here. The only difference is that the summands in Sn are indexed not by the integers k in the range k < kn but by the primes p in the range p < a n; also, 1 XP must be replaced by XP - p- to center It. Thus the rth moment of (Sn - cn)/sn converges to that of the normal distribution, and so (30.20) and (30.16) will follow by the method of moments if it is shown that as n � oo,
( 30.21) for each r. Now E[S�] is the sum r
1 r .I p: ) L. E[ Xr ...... " L. ' '. . . . ,.u ' u ' " u=l
(30.22)
,.
l
•
"
•
. . . xrp;: ) '
where the range of r: is as in (30.6) and (30.7), and I:" extends over the u-tuples ( p 1 , p) of distinct primes not exceeding a n . Since XP assumes only the values 0 and 1, from the independence of the XP and the fact that the p1 are distinct, it follows that the summand in (30.22) is • • •
,
E [ XP I
(30.23)
Xp, )
· · ·
=
1 . PI . Pu •
•
By the definition (30.19), En [g�] is just (30.22) with the summand replaced 8Pr ]. Since 8P(m) assumes only the values 0 and 1, from (30.13) by En [ 8Pr 1I and the fact that the P; are distinct, it follows that this summand is •
( 30.24)
•
•
u
,
En [ 8P I
p., ]
· · · 8
=
1l n n p , . . . Pu J .
-
SECTION
30.
397
THE METHOD OF MOMENTS
But (30.23) and (30.24) differ by at most 1jn, and hence E[S�] and En[g�] differ by at most the sum (30.22) with the summand replaced by l jn. Therefore,
J E[ S� ] - En [ g�] J <
( 30.25)
�(
L p � an
l
r < :� .
Now
and En[(gn - cnYl has the analogous expansion. Comparing the two expan sions term for term and applying (30.25) shows that
( 30.26)
Since
en < an, and since a�jn � 0 by (30. 17), (30.21) follows as required.
•
The method of proof requires passing from (30.12) to (30.19). Without this, the an on the right in (30.26) would instead be n, and it would not follow that the difference on the left goes to 0; hence the truncation (30. 19) for an an small enough to satisfy (30.17). On the other hand, an must be large enough to satisfy (30.18), in order that the truncation leave (30. 16) unaffected. PROBLEMS 30.1. 30.2.
assumpt i o n (30.5) get the ful From the cent r al l i m i t t h eorem under t h e Lindeberg theorem by a truncation argument. For a sampl e of si z e kn with replacement from a population of size the 1 (1 Under the assumpt i o n probabi l i t y of no dupli c at e s i s kn/ Vn -+ 0 in addition to (30. 10), de uce the asymptotic normality of Sn by a reduction to the independent case. Byinadapti n g the proof of (21.24), show that the moment generating function of an arbitrary interval determines 25. 13 30.3 Suppose that the moment generating function of converges to that of in some interval. Show that 1-Ln jjn).
n��o d
30.3. 30.4.
n,
JL .
JL
JL
i
=
JL ·
JL1 1
398
30.5.
CONVERGENCE OF DISTRIBUTIONS JL
fRk l x;i'JL(dx ) < oo
k
whi c h f o r Let1, beanda probabi l i t y measure on 1, 2, . . . . Consider the cross moments R
r=
... ,k
for
1 =
fo r nonnegat i v e i n t e gers Suppose fo r each i that r;.
(a)
( 30.27 ) JL
s of convergence hasdetearmiposinedtivbye radiits umoments as a power seri e s i n e. Show that i n t h e sense that, i f a probabil i t y measure satiwithsfies then coincides fo r all Show that a by k-di m ensi o nal normal di s tri b uti o n i s det e rmi n ed i t s mo ments. Let JLn and for Suppose that on each be probabi l i t y measures (30.27) has a positive radius of convergence. Suppose that JL ·
(b)
30.6.
a(r 1 ,
•
•
.
T
, rk ) = fx �·
· x;/v(dx)
•
JL
30.8.
•
•
•
•
rk ,
v
v
1,
R k.
JL n = JL .
for all nonnegative integers , Show that 30.5 Suppose that X and Y are bounded random vari a bles and that X and are uncorrel a t e d f o r 1, 2, ... . Show that X and Yare i n depen dent. 26.17 30.6 In the notation (26.32), show for 0 that r 1,
30. 7.
r1,
IS
•
•
•
rk.
m
i
yn
m,
i
n=
A
(a)
-=fo
of of
fmoment o r evens thatandcosthat thehas mean i s 0 fo r odd It fo l ows by the method sense a di s tri b uti o n i n the of (25. 18), and in fact course the relative measure is 1 (30.29) p[ cos ] 1 - -arccos of Suppose that are l i n earl y i n dependent over the fi e l d rat i o nal s then in the0.sense that , 0 fo r i n tegers i f Show that r
Ax
x:
(b)
nm =
r.
A x :::;: u =
A 1 , A2, n1A1 + •
.
.
· ·
· +nmAm =
(30.30)
fo r nonnegative integers
r1 ,
•
•
•
1T
, 'k·
u,
- 1
nv,
n1 =
-
SECTION 30. THE METHOD OF MOMENTS
399
on X f u nct i o n the Let X1 , di s tri b uti o n , be i n dependent and have the 2 right in (30.29). Show that (30.31) p [ x : _t cos Ajx
• • •
···
J= I
a
30.9. 30.10.
For a si g nal l a rge number of pure cosi n e si g nal s wi t h that i s the sum of itnhcommensurabl e frequenci e s, (30.32) describes the relative arr.ount of ti me e signal is between u1 and u2• 6. 1 6 i From (30.16), deduce once more the Hardy-Ramanujan theorem (see (6. 10)). i Prove rhat (if Pn puts probabil ity 1/n at 1, . . . , n) m j l o l l o o g n gl o g l 0. > l i m Pn [ m: (30.33) n log log n From (30.16) deduce that (see (2.35) fo r the notation) [ 1 lx - u2 12 ( m - loglogm l d D m: <x u. (30.34) Jlog log m i Let G(m) be the number of prime factors in m with mul t i p l icity counted. In thShow e notatiforonk>of 1Probl e m 5.19, G(m) [P a/m). that Pn[m: aP(m) - 8/m) > k] < 1/p k + l; hence En[ aP <Show 2fp 2 • that En[ G - is bounded. Deduce fro m (30. 16) that [Pn m·· G( mJi)ogl-looggnlog n -<x l 1 lx - u2 f2 du. (d) Prove fo r G the analogue of (30.34). i Prove the Hardy-Ramanujan theorem in the fo rm D [ m·· log log m - 1 > ] 0. Prove this with G in place of (a)
E
=
(b)
g
30.11.
)
=
e
J2Tr - oo
=
(a) op] (b) (c)
g]
--+
30.12.
g( ) m
g.
� v L. 'TT'
E
=
e
- 00
CHAPTER 6
Derivatives and Conditional Prob ability
SECTION 31. DERIVATIVES ON THE LINE*
This section on Lebesgue ' s theory of derivatives for real functions of a real variable serves to introduce the general theory of Radon-Nikodym deriva tives, which underlies the modern theory of conditional probability. The results here are interesting in themselves and will be referred to later for purposes of illustration and comparison, but they will not be required in subsequent proofs. The Fundamental Theorem of Calculus
To what extent are the operations of integration and differentiation inverse to one another? A function F is by definition an indefinite integral of another function f on [a, b] if X
F( x ) - F( a ) = f f( t ) dt
(3 1 . 1 )
a
a < x < b; F is by definition a primitive of f if it has derivative f: F'( x ) =f( x ) ( 31 .2) for a < x < b. According to the fundamental theorem of calculus (see (17.5)), these concepts coincide in the case of continuous f: for
Theorem 31.1.
Suppose that f is continuous on [a, b ].
An indefinite integral off is a primitive off: if (3 1.1) holds for all x in [a, b ], then so does (31.2). (ii) A primitive off is an indefinite integral off: if (31.2) holds for all x in [a, b], then so does (31. 1). (i)
* This section may be omitted.
400
SECTION
31.
401
DERIVATIVES ON THE LINE
A basic problem is to investigate the extent to which this theorem holds if
f is not assumed continuous. First consider part (i). Suppose f is integrable, so that the right side of (31 .1) makes sense. If f is 0 for x < m and 1 for x m (a < m < b), then an F satisfying (31.1) has no derivative at m . It is thus too much to ask that (31.2) hold for all x. On the other hand, according to a famous theorem of Lebesgue, if (31.1) holds for all x, then (31.2) holds almost everywhere-that is, except for x in a set of Lebesgue measure 0. In this section almost everywhere will refer to Lebesgue measure only. This result, the most one- could hope for, will be proved below (Theorem 31.3 ) . Now consider part {ii) of Theorem 31.1. Suppose that (31.2) holds almost everywhere, as in Lebesgue' s theorem, just stated. Does (31.1) follow? The answer is no: If f is identically 0, and if F(x ) is 0 for x < m and 1 for x m (a < m < b), then (3 1.2) holds almost everywhere, but (31. 1) fails for x m. The question was wrongly posed, and the trouble is not far to seek: If f is integrable and (31.1) holds, then
>
> >
( 31 .3 )
F( x + h ) - F( x )
=
Jbl( x. x +h lt)f( t ) dt � O a
as h � 0 by the dominated convergence theorem. Together with a similar argument for h i 0 this shows that F must be continuous. Hence the question becomes this: If F is continuous and f is integrable, and if (31.2) holds almost everywhere, does (31.1) follow? The answer, strangely enough, is still no: In Example 31.1 there is constructed a continuous, strictly increasing F for which F '(x) = 0 except on a set of Lebe,<,gue measure 0, and (31 . 1) is of course impossible if f vanishes almost everywhere and F is strictly increas ing. This leads to the problem of characterizing those F for which (3 J . 1) does follow if (3 1.2) holds outside a set of Lebesgue measure 0 and f is integrable. In other words, which functions are the integrals of their (almost everywhere) derivatives? Theorem 31.8 gives the characterization. a
Theore It i s possi b l e 31.1 i n different direction. Suppose t o ext e nd part (i i ) of m that (31.2) hol d s fo r every not just almost everywhere. In Example 17.4 there was giandveninathifunct i o n F, everywhere differentiable, whose derivative i s not integrable, s case the ri g ht si d e of (31.1) has no meaning. If, however, (31.2) holds fo r and i f every i s i n tegrabl e , then (31.1) does hol d fo r al l For most purposes of probabi l i t y theor y , i t i s natural to i m pose condi t i o ns onl y al m ost everywhere, and so this theorem wil not be proved here.t x,
f
x,
f
x.
The program then is first to show that (31.1) for integrable f implies that (31.2) holds almost everywhere, and second to characterize those F for which the reverse implication is valid. For the most part, f will be nonnegative and F will be nondecreasing. This is the case of greatest interest for probability theory; F can be regarded as a distribution function and f as a density. t For
a
proof,
see
Ruo1N2 , p 1 79
402
DERIVATIVES AND CONDITIONAL PROBABILITY
In Chapters 4 and 5 many distribution functions F were either shown to have a density f with respect to Lebesgue measure or were assumed to have one, but such F's were never intrinsically characterized, as they will be in this section. Derivatives of Integrals
The first step is to show that a nondecreasing function has a derivative almost everywhere. This requires two preliminary results. Let ,.\ denote lebesgue measure. l.
Let A be a bounded linear Borel set, and let J be a collection of open intervals covering A . Then J contains a finite, disjoint subcollection 11 , • • • , lk for which [�= 1 ,.\{l) > A(A)/6. Lem ma
By regularity (Theorem 12.3) A contains a compact subset K satisfying A(K) > A(A)/2. Choose in J a finite subcollection J0 covering K. Let I1 be an interval in J0 of maximal length ; discard from J0 the interval /1 and all the others that intersect /1• Among the intervals remaining in J0, let /2 be one of maximal length; discard I2 and all int�rvals that intersect it. Continue this way until J0 is exhausted. The I; are disjoint. Let J,. be the interval with the same midpoint as I,. and three times the length. If I is an interval in J0 that is cast out because it meets I,., then I cJ,.. Thus each discarded interval is contained in one of the J,., and so the J,. cover K. • Hence [ ,.\( !) = [,.\{J,.)j3 > ,.\( K )/3 > A(A)j6. PROOF.
If
(31 .4) is a partition of an inte 1val
(31.5) Lemma 2.
( 3 1 .6) and if
[a, b] and F is a function over [a, b ], let k
II FI Ll = L I F( a ; ) - F ( a ,. _ 1 ) 1. i= I
Consider a partition (31.4) and a nonnegative (). If F( a) < F( b ) ,
( 31 .7) for a set of intervals [a,._ 1 , aJ of total length d, then IIFII Il > I F ( b ) - F( a ) I + 2()d.
SECTION
31.
403
D ERIVATIVES ON THE LINE
This also holds if the inequalities in (3 1.6) and (31.7) are reversed and replaced by 0 in the latter.
- 0 is
k = 2 and the left-hand interval satisfies (31.7). Here F falls at least Od over [a, a + d], rises the same amount over [a + d, u] , and then rises F(b) - F(a) ove r [u, b ]. PROOF. The figure shows the case where
b
u
a
For the general case, let L:' denote summation over those i satisfying (31.7 ) and let L:" denote summation over the remaining i (1 ). Then
IIFIIIl = [' ( F( a; _ 1 ) - F ( a J ) + [" J F( a;) - F( a ;_ 1 ) J > [' ( F( a;_ 1 ) - F (a;) ) + I L:" ( F( a;) - F( a; _ 1 ) ) 1 = [' ( F(a;_ 1 ) - F (a ;) ) + ! ( F(b) - F( a )) + [' ( F( a;_ 1 ) - F(a;)) l
·
As all the differences in this last exp ression are nonnegative, the absolute value bars can be suppressed; therefore,
IIFIIIl > F( b) - F( a) + 2 [' ( F( a;_ 1 ) - F ( a ;) ) • > F(·b) - F(a) + 2 0 [' (a ; - a ;_ 1 ) . A function F has at each x four derivates, the upper and lower right
derivatives
. sup F( x + h ) - F( x ) , D F ( x ) = hm h h J. O
DF ( x ) = I 1m mf
. . F( x + h) - F( x ) , h tO
h
and the upper and lower left derivatives
F( X ) - F( X - h ) ( , D x ) = hm sup h
F
•
h tO
F h F(x) x ( . ) . 1 ( ) _ f 1 1 0 m D X F h h tO
·
404
DERIVATIVES AND CONDITIONAL PROBABILITY
There is a derivative at x if and only if these four quantities have a common value. Suppose that F has finite derivative F'(x) at x. If u < x < v, then
F( v ) - F ( u ) ' ( x ) < v - x F( v ) - F( x ) - F ' ( x ) F v-x v-u v-u x - u F( x) - F ( u ) _ ' + F ( x) x-u v-u Therefore,
- F( u ) v-u
F( v )
(31.8)
�
F'( x )
as u j x and v � x; that is to say, for each E there is a o such that u < x < v and 0 < v - u < o together imply that the quantities on either side of the arrow differ by less than E . Suppos� that F is measurable and that it is continuous except possibly at countably many points. Thi� will be true if F is nondecrcasing or is the difference of two nondecreasing functions. Let M be a countable, dense set containing all the discontinuity points of F; let rn(x) be the smallest number of the form kjn exceeding x. Then F( y) - F( x) . Y -x --+oo x
D F ( x ) = nlim
sup
'
the function inside the limit is measurable because the x-set where it exceeds
a
IS
U [ x : x < y < rn( x ) , F ( y ) - F( x ) > a ( y - x ) J .
y EM
Thus DF(x) is measurable, as are the other three derivates. This does not exclude infinite values. The set where the four derivates have a common finite value F' is therefore a Borel set. In the following theorem, set F' = 0 (say) outside this set; F' is then a Borel function.
A nondecreasing function F is differentiable almost evety where, the derivative F' is nonnegative, and Theorem 31.2.
( 31 .9 ) for all a and b.
fbF'( t ) dt < F( b ) - F( a) a
This and the following theorems can also be formulated for functions over an interval.
SECTION
31.
PROOF.
405
DERIVATIVES ON THE LINE
If it can be shown that
except on a set of Lebesgue measure 0, then cby the same result applied to G(x) = -F(-x) it will follow that FD(x) = D ( -x) < c D( -x) = D/x) al most everywhere. This will imply that DF(x) < D F(x) < D(x) < FD(x) < F Dp(x) almost everywhere, since the first and third of these inequalities are obvious, and so, outside a set of Lebesgue measure 0, F will have a derivative, possibly infinite. Since F is nondecreasing, F' must be nonnega tive, and once (31.9) is proved, it will follow that F' is finite almost every where. If (31 .10) is violated for a particular x, then for some pair a, {3 of rationals satisfying a < {3, x will lie in the set A a13 = [ x: F D(x) < a < {3 < D F(x)]. Since there arc only countably many of these sets, (31.10) will hold out�idc a set of Lebesgue measure 0 if A ( A a 13 ) = 0 for all a and {3. Put G(x) F(x) - �(a + {3)x and () H f3 - a). Since differentiation is linear, A a13 = B8 = [x: D(x) < - () < () < Dc(x)]. Since F and G have only c countably many discontinuities, it suffices to prove that ,.\(C11) = 0, where C11 is the set of points in B11 that are continuity points of G. Consider an interval (a, b), and suppose for the moment that G(a ) < G(b ). For each x in C11 satisfying a <x < b, from cD(x} < - () it follows that there exists an open interval (ax, bx ) for which x E (ax , bx ) c (a, b) and =
=
(31 .11 ) There exists by Lemma 1 a finite, disjoint collection (ax·' bx ) of these intervals of total length L.(bx. - ax) > A((a, b) n C11)/6. Let d be the partition (31.4) of [ a b] with the points ax and bx in the role of the a 1 , , ak 1 By Lemma 2, I
1
,
I
I
I
I
•
•
•
_
•
{ 31 . 1 2 )
IIGIIIl > I G( b ) - G ( a) I + �(),.\ ( ( a , b) n C11 ) . If instead of G( a ) < G( b ) the reverse inequality holds, choose ax and bx so that the ratio in (31.11) exceeds (), which is possible because Dc(x) > () for X E ell. Again (31. 12) follows. In each interval [a, b] there is thus a partition (31.4) satisfying (31.12). Apply this to each interval [ a _ 1 , a;] in the partition. This gives a partition � 1 that refines d , and adding the corresponding inequalities (31.12) leads to ;
Continuing leads to a sequence of successively finer partitions dn such that
( 31 .13)
406
DERIVATIVES AND CONDITIONAL PROBABILITY
Now IIGIIIl is bounded by IF(b) - F(a)i + ila + f3i( b - a) because F is mono tonic. Thus (31. 13) is impossible unless A(( a, b) n C11) = 0. Since (a, b) can be any interval, ,.\(C11) = 0. This proves (3 1. 10) and establishes the differentiabil ity of F almost everywhere. It remains to prove (31.9). Let
F (x + n - 1 ) - F( x ) J n( x ) n- 1 "
( 3 1 . 14)
_
·
Now fn is nonnegative, and by what has been shown, fn(x) -7 F'(x) except on a set of Lebesgue measure 0. By Fatou\ lemma and the fact that F is nondecreasing,
J bF' ( X ) dx < lim infn Jbfn( X ) dx J, + n - F ( x ) dx - n J + n - F ( x ) dx l = lim inf [ n b j n b < lim inf [ F( b + n - 1 ) - F( a)] = F( b + ) - F( a ) . n a
a
a
1
1
a
Replacing
b by b - E and letting E -7 0 gives (3 1.9).
•
Iff is nonnegative and integrable, and if F(x) = r_,J(t) dt, then F'(x) = f(x) except on a set of Lebesgue measure 0. Theorem 31.3.
Since f is nonnegative, F is nondecreasing and hence by Theorem 31 .2 is differentiable almost everywhere. The problem is to show that the derivative F' coincides with f almost everywhere. PROOF FOR BouNDED f
Suppose first that f is bounded by M. Define f, by (3 1.14). Then fn(x) = nf: +" - f(t ) dt is bounded by M and conve rges almost everywhere to F'(x ), so that the bounded convergence theorem gives
jbF'(x) dx = limn jbfn(x) dx J, + n - F ( x ) dx - n J + n - F( x ) dx ] . = lim [ n b n b a
a
1
a
1
a
Since F is continuous (see (3 1.3)), this last limit is F(b) - F(a) = J:f(x) dx. Thus fA F'(x) dx = fA f(x) dx for bounded intervals A = (a, b ]. Since these form a 1r-system, it follows (Theorem 16.10(iii)) that F' f almost every where. • =
S ECTION
31.
DERIVATIVES ON THE LINE
PROOF FOR INTEGRABLE
407
f Apply the result for bounded functions tO f
<
truncated at n: If hn(x) is f(x) or n as f(x) n or f(x) > n, then Hn(x) = fx_oohn(t ) dt differentiates almost everywhere to hn(x) by the case already treated. Now F(x) = Hn(x) + jx_"'( f(t) - h n(t )) dt ; the integral here is nonde creasing because the integrand is nonnegative, and it follows by Theorem 31.2 that it has almost everywhere a nonnegative derivative. Since differentiation is linear, F'(x) H�(x) = h n(x) almost everywhere. As n was arbitrary, F'(x) f(x) almost everywhere, and so f bF'( x) dx f bf(x) dx F(b ) - F(a). But the reverse inequality is a consequence of (31.9). Therefore, J:(F'(x) f(x )) dx = 0, and as before F' = f except on a set of Lebesgue measure 0. • '
>
>
a
>
a
=
Singular Functions
If f(x) is nonnegative and integrable, differentiating its indefinite integral jx_"'f(t) dt leads back to f(x) except perhaps on a set of Lebesgue measure 0. That is the content of Theorem 31.3. The converse question is this: If F(x ) is nondecreasing and hence has almost everywhere a derivative F'(x ), does integrating F'(x) lead back to F(x )? As stated before, the answer turns out to be no even if F( x) is assumed continuous: Let X1 , X2 , . . . be independent, identically distributed random variables such that P[ Xn = 0] = Po and P[ Xn = 1j = p 1 = 1 - p0, and n let X = r.: =I Xn2 - . Let F(x) = P[ X <x] be the distribution function of X. For an arbitrary sequence u I ' u 2 , . . . of O' s and 1 's, P[ xn = u n , n = 1, 2, . . . ] = limn Pu · · · Pu = 0; since x can have at most two dyadic expansions continuous. Of course, x = L:n u n 2 - n , P[X = x] = 0. Thus F nis everywhere n F(O) = 0 and F(l) = 1. For 0 k < 2 , k2- has the form "£7= 1 u ,. 2 - ; for some n-tuple (u 1 , u n ) of O's and 1 's. Since F is continuous, Example 31.1.
n
I
•
(31.15)
(
•
•
,
<
) ( ;n) = P [ ;n < X < k;} ]
F k; 1 - F
= P[ Xi = u i , i < n] = Pu 1 " " " Pu. ·
This shows that F is strictly increasing over the unit interval. n If P o = p 1 = t, the right side of (31.15) is 2 - , and a passage to the limit shows that F(x) = x for 0 x 1. Assume, however, that p0 =I= p 1 It will be shown that F'(x) = 0 except on a set of Lebesgue measure 0 in this case. Obviously the derivative is 0 outside the unit interval, and by Theorem 31.2 it exists almost everywhere inside it. Suppose then that 0 < x < 1 and that F has a derivative F'(x) at x. It will be shown that F'(x) = 0. For each n choose kn so that x lies in the interval In = (kn T n , (k n + 1)T n ]; In is that dyadic interval of rank n that contains x. By (31.8),
< <
•
408
DERI VATIVES AND CONDITIONAL PROBABILITY
25
Graph of F(x) for P o = .25, p 1 .75. Because of the recursion (3 1 . 1 7), the part of the graph over [0, .5 ] and the part over [.5, 1] are identical, apart from changes in scale, with the whole graph Each segment of the curve therefore contains scaled copies of the whole, the er.treme irregularity thi� implies is obscured by the fact that the accuracy is only to within the width of the printed line. =
If F'(x) is distinct from 0, the ratio of two successive terms here must go to so that
P[XE in + l ] P[ X E in ]
( 31.16)
�
1,
1
-
2"
If In consists of the reals with nonterminating base-2 expansions beginning with the digits u 1 , un, then P[X E In ] = Pu · · · p11 by (31. 15). But In + must for some un + 1 consist of the reals beginning u 1 , , u n , u n + 1 (u n + 1 is 1 or 0 according as x lies to the right of the midpoint of In or not). Thus P[X E ln +l ]jP[X E in] = pu is either P o or p 1 , and (31.16) is possible only 1f P o = p 1 , wh1ch was excluded by hypothesis. Thus F is continuous and strictly increasing over [0, 1], but F'( x ) = 0 except on a set of Lebesgue measure 0. For 0 x I independence gives •
�
�
•
•
I
,
n
1
n
•
+I
< <
•
•
SECTION
31.
DERIVATIVES ON THE LINE
409
n F(x) = P[X 1 = 0, L�= 2 Xn 2 - + l < 2x] = P0F(2x). Similarly, F(x ) - P o = p 1 F(2x - 1) for 4 < x < 1. Thus (31.17 )
F( x ) =
\ p0F( 2x )
P o + p 1 F( 2x -
In Section 7, F(x ) (there denoted at bold play; see (7.30) and (7.33). A function is
1)
if 0 < X < t , 1"f 2I < x < l .
Q(x )) entered as the probability of success •
singular if it has derivative 0 except on a set of Lebesgue
measure 0. Of course, a step function constant over intervals is singular. What is remarkable (indeed, singular) about the function in the preceding example is that it is continuous and strictly incr�asing but nonetheless has derivative 0 except on a set of Lebesgue measure 0. Note that there is strict inequality in (31.9) for this F. Further properties of nondecreasing functions can be discovered through a study of the measures they generate. Assume from now on that F is nondecreasing, that F is continuous from the right (this is only a normaliza F(x) < lim + "' F(x) = m < oo. Call such an F tion), and that 0 = lim a distribution function, even though m need not be 1. By Theorem 12.4 there exists a unique measure J.L on the Borel sets of the line for which x --+
( 31.18)
x --+
_ 00
11-( a , b] = F ( b ) - F( a ) .
Of course, J.L( R 1 ) = m is finite. The larger F' is, the larger J.L is:
Suppose that F and J.L are related by (31. 18) and that F'( x) exists throughout a Borel set A. Theorem 31.4.
{i) If F'(x) < a (ii) If F'(x) > a
for: x E A , then J.L( A) < a..\(A). for x E A , then J.L( A) > a..\(A).
E
It is no restriction to assume A bounded. Fix for the moment. Let E be a countable, dense set, and let A n = n (A n I), where the intersec tion extends over the intervals I = (u, v] for which u, v E £, 0 < ..\( I ) < n - 1 and PROOF.
( 31 .19 )
J.L ( l )
< ( a + E ) ..\ ( 1 ) .
,
Then An is a Borel set and (see (31.8)) An i A under the hypothesis of (i). By Theorem 1 1.4 there exist disjoint intervals In k (open on the left, closed on
DERIVATIVES AND CONDITIONAL P ROBABILITY
410
the right) such that
An c U k in k and
(31 . 20 )
L A ( ln k ) < A ( An) + E. k
It is no restriction to assume that each In k has endpoints in E, meets and satisfies ..\ { In k ) < n - l . Then (31.19) applies to each In k • and hence
An,
J.L ( A n ) < L J.L( Ink ) < ( a + E) L A ( In k ) < ( a + E )( A ( A n) + E ) . k
k
In the extreme terms here let n -7 oo and then E -7 0; {i) follows . To prove {ii), let the countable, dense set E contain all the discontinuity points of F, and use tbc same argument with ,_J,(l) > (a - E)..\{l) in place of (31. 19) and L k J.L(ln k ) < J.L(An) + E in place of (31.20). Since E contains all the discontinuity points of F, it is again no restriction to assume that each In k has endpoints in E, meets An , and satisfies ..\ { In k ) < n - l . It follows that
J.L ( An) + € > L J.L( In k ) > ( k
Again let
a
n -7 oo and then E -7 0.
The measures and S,� · such that
( 31.21)
- €)
L A ( Ink ) > (a - € ) A ( An) · k
•
J.L and A have disjoint supports if there exist Borel sets S,.. J.L ( R 1 - s,.. ) = o, s,.. n s,� O . =
Suppose that F and J.L are related by (31.18). A necessary and sufficient condition for J.L and A to have disjoint supports is that F'(x) = 0 except on a set of Lebesgue measure 0. Theorem 31.5.
PROOF. By Theorem 31.4, J.L[x: l xl < a, F'(x) < E} < 2aE, and so {let E -7 0 and then a -7 oo) J.L[x: F'(x) = 0] = 0. If F'(x) = 0 outside a set of
Lebesgue measure O, then S,� [ x : F'(x) = 0] and S,.. = R 1 - S,� satisfy (3 1.21). Suppose that there exist S,_. and S,� satisfying (31.21). By the other half of Theorem 31.4, d[x: F'(x) > E] = d[ x : x E S,�, F'(x) > E] < J.L(S,�) = O, and so {let E -7 0) F'(x) = 0 except on a set of Lebesgue measure 0. • =
Suppose that J.L is discrete, consisting of a mass m k at each of countably many points x k " Then F( x) = L.m k ' the sum extending over the k for which x k < x. Certainly, J.L and A have disjoint supports, and so F' must vanish except on a set of Lebesgue measure 0. This is directly obvious if the • x k have no limit points, but not, for example, if they are dense. Example 31.2.
SECTION
31.
DERIVATIVES ON THE LINE
411
Consider again the distribution function F in Example 31.1. Here J.L(A ) = P[ X E A]. Since F is singular, J.L and ,.\ have disjoint supports. This fact has an interesting direct probabilistic proof. For x in the unit interval, let d 1(x), dix), . . . be the digits in its nontermi n nating dyadic expansion, as in Section 1. If (k2- , (k + 1)2- n ] is the dyadic interval of rank n consisting of the reals whose expansions begin with the digits u l , . . . ' u n , then, by (31.15), Example 31.3.
If the unit interval is regarded as a probability space under the measure J.L, then the d;(x) become random variables, and (3 1.22) says that these random variables are independent and identically distributed and J.L[ x: di(x) 0] = p0, =
J.L[x: di(x) = 1] = p 1 •
Since these random variables have expected value p 1 1 the strong law of large number'i implies that their averages go to p 1 with probability 1:
( 31 .23)
J.L
[X E (0, 1]
:
lim n
� i.t= I d ( X) = p I ] = 1. j
On the other hand, by the normal number theorem,
(31 .24)
,.\
[ x E (0, 1] : n n1 .[ di( x ) = 21 ] = 1 . .
hm -
n
'=I
(Of course, (31.24) is just (31.23) for the special case p0 = p 1 = f; in this case J.L and ,.\ coincide in the unit interval.) If p 1 =I= �, the sets in (31.23) and (31.24) are disjoint, so that J.L and ,.\ do have disjoint supports. It was shown in Example 31.1 that if F'(x) exists at all (0 < x < 1), then it is 0. By part (i) of Theorem 31.4 the set where F'(x) fails to exist therefore has wmeasure 1; in particular, this set is uncountable. • In the singular case, according to Theorem 31.5, F' vanishes on a support of A. It is natural to ask for the size of F' on a support of J.L · If B is the x-set where F has a finite derivative, and if (31.21) holds, then by Theorem 31.4, J.L[x E B: F'(x) < n] = J.L[x E B n S"': F'(x) < n] < nA(S"') = O, and hence J.L(B) = 0. The next theorem goes further. Theorem 31.6. Suppose that F and J.L are related by (31. 18) and that and ,.\ have disjoint supports. Then, except for x in a set of J.L-measure D(x) = oo.
F
J.L 0,
412
DERIVATIVES AND CONDITIONAL PROBABILITY
If J.L has finite support, then clearly FD(x) = oo if J.L{ x } > 0, while D/x ) = 0 for all x. Since F is continuous from the right, FD and DF play different roles.t PRooF. Let A n be the set where FD(x) < The problem is to prove that J.L ( A n ) = 0, and by (31.21) it is enough to prove that ( A n n S1) = 0.
n.
J.L
Further, by Theorem 12.3 it is enough to prove that J.L( K) = 0 if K is a compact subset of A n n S,.. . Fix E. Since A. ( K ) = 0, there is an open G such that K c G and A{ G) < E. If x E K, then x E A n , and by the definition of FD and the right-continuity of F, there is an open interval Ix for which x E Ix c G and J.LU) < n;>,.(l). By compactness, K has a finite subcover Ix 1, , Ixk · If some three of these have a nonempty intersection, one of them must be contained in the union of the ether two. Such superfluous intervals can be removed from the subcover, and it is therefore possible to assume that no point of K lies in more than two of the Ix But then •
I
•
•
J.L( K) < 11- ( V I_.; ) � �J.L{Ix) < n � A.( IxJ I
Since
•
I
I
< 2 n A ( l) Ir, ) 2 nA. ( G ) < 2nE. <
I
•
E was arbitrary, A(K) = 0, as required.
Example 31.4.
Rest r i c t the F of Example!. and t o (0, and let g be the iIfnverse. Thus F and g are conti n uous, strictly increasing mappings of (0, onto i ts el f. A = [ x E (0, F'( x) = 0], then A( A) = as shown in the examples, while f.L( A ) 0. Let H be a set in (0, that is not a Lebesgue set. Since H - is contained in a set ofsinceLebesgue measure 0, it is a Lebesgue set; hence H0 = H nA is not a Lebesgue set, other w i s e H = H0 U (H - A ) would be a Lebesgue set. If B = (O, x], then A g - 1 ( 8) = A(O, F( x)] = F(x) = JL(B), and i t fo l ows that A g - 1 ( 8) = JL(B) for al l Borel setLebesgue s B. Sisetnce. Ong -theH0otheris a hand, A(g - 1A) = JL(A) = 0, g - 1 H is a subsetif gof- 1H0g - 1Awereanda Borel H0 = woul d set , 1 alBorel so beset.a +Borel set. Thus g - H0 provides an example of a Lebesgue set that is not a 31.1
1 ):
1,
1)
1
3 1 .3
1 ),
1)
=
A
p-r( g - 1H0 )
0
•
Integrals of Derivatives
Return now to the problem of extending part (ii) of Theorem 31.1, to the problem of characterizing those distribution functions F for which F' inte grates back to F:
( 31.25)
F( x ) =
tsee Problem 31.8 t For a different argument, see Problem 3.14.
f
X
- 00
F'( t ) dt .
SECTION
31.
DERIVATIVES ON THE LI!'IE
413
The first step is easy: If (3 1.25) holds, then F has the form F( x )
(31 .26)
= j f ( t ) dt X
- 00
for a nonnegative, integrable f (a density), namely f = F'. On the other hand, (3 1.26) implies by Theorem 31.3 that F' = f outside a set of Lebesgue measure 0, whence (3 1.25) foliows. Thus (3 1.25) holds if and only if F has the form (31.26) for some f, and the problem is to characterize functions of this form. The function of Example 31.1 is not among them. As observed earlier (see (31.3)), an F of the form (31.26) with f integrable is continuous. It has a still stronger property: For each E th::.re exists a 8 such that
JAf( x) dx < E
( 31 .27)
if ..\ ( A ) < 8.
Indeed, if An = [x: f(x ) > n], then An ! 0, and since f is integrable, the dominated convergence theorem implies that fA f( x)dx < E/2 for large n. Fix such an n and take 8 Ej2n. If ..\(A ) < 8, then fA f( x ) dx < n
=
fA -A " f( x ) dx + fA
f(x) dx < n..\( A) + €/2 < €. If F is given by" (3 1.26), then F(b) - F(a) = Jabf(x) dx, and (31.27) has this consequence: For every E the1 e exists a 8 such that for each finite collection [a1, b1], i = 1, . . . , k, of nonoverlappingt intervals, ( 31 .28)
k
L j F( b1) F ( a ) J < €
i� I
-
,
if
k
L ( b, - a,) < 8.
i=I
A function F with this property is said to be absolutely continuous.+ A function of the form (31.26) (f integrable) is thus absolutely continuous. A continuous distribution function is uniformly continuous, and so for every E there is a 8 such that the implication in (31.28) holds provided that k = 1. The definition of absolute continuity requires this to hold whatever k may be, which puts severe restrictions on F. Absolute continuity of F can be characterized in terms of the measure J.L :
Suppose that F and J.L are related by (31.18). Then F is absolutely continuous in the sense of (3 1 .28) if and only if J.L(A) = 0 fo r every A for which ..\(A) = 0. Theorem 31.7.
t In tervals are nonoverlapping if their interiors are disjoint. In this definition it is immaterial
whether the intervals are regarded as closed or open or half-open, since this has no effect on (3 1 .28). tThe definition applies to all functions, not just to distribution functions If F is a distribution function as in the present discussion, the absolute-value bars in (3 1.28) are unnecessary
414
DERIVATIVES AND CONDITIONAL PROBABILITY
Suppose that F is absolutely continuous and that A(A) = 0. Given E, choose 8 so that (31.28) holds. There exists a countable disjoint union B = U k lk of intervals such that A c B and A(B) < 8. By (31.28) it follows that J.L( U Z= 1 /k ) < E for each n and hence that J.L(A) < J.L(B) < E. Since E was arbitrary, J.L(A) = 0. If F is not absolutely continuous, then there exists an E such that for every 8 some finite disjoint union A of intervals satisfies A(A) < 8 and J.L(A) > E. Choose An so that A( An) < n - 2 and J.L(An) Then ,.\(lim supn A n) = 0 by the first Borel-Cantelli lemma (Theorem 4.3, the proof of which does not require P to be a probability measure or even finite). On the other hand, J.L0im supn An) > E > 0 by Theorem 4. 1 (the proof of which applies because 1-L is assumed finite). • PROOF.
> €.
This result leads to a characterization of indefinite integrals.
distribution function F(x) has the form J�"'f(t) dt for an integrable f if and only if it is absolutely continuous i."l the sense of (31.28). Theorem 31.8.
A
That an F of the form (31.26) is absolutely continuous was proved in the argument leading to the definition (31.28). For anothe r proof, apply Theorem 31.7: if F has this form, then A ( A ) = 0 implies that J.L(A) = PROOF.
fAf(t) dt = 0.
To go the other way, define for any distribution function
(31 .29)
F
Fac ( x) = f F'( t ) dt X
- ex:
and
( 31 .30) Then F;; is right-continuous, and by (31.9) it is both nonnegative and nondecreasing. Since Fac comes form a density, it is absolutely continuous. By Theorem 31.3, F;c = F' and hence F; 0 except on a set of Lebesgue measure 0. Thus F has a decomposition =
(31.31)
F( x ) = Fac ( x) + F;; ( x ) ,
where Fac has a density and hence is absolutely continuous and F5 is singular. This is called the Lebesgue decomposition. Suppose that F is absolutely continuous. Then F5 of (3 1.30) must, as the difference of absolutely continuous functions, be absolutely continuous itself. If it can be shown that F5 is identically 0, it will follow that F = Fac has the required form. It thus suffices to show that a distribution function that is both absolutely continuous and singular must vanish.
SECTION
31.
415
DERIVATIVES ON THE LIN E
If a distribution function F is singular, then by Theorem 31.5 there are disjoint supports S,.. and SA . But if F is also absolutely continuous, then from ,.\(S,_.) = 0 it follows by Theorem 31.7 that JL ( S,_. ) 0. But then JL(R 1 ) 0, and • so F( x ) 0.
=
=
=
This theorem identifies the distribution functions that are integrals of their derivatives as the absolutely continuous functions. Theorem 31.7, on the other hand, characterizes absolute continuity in a way that extends to spaces f! without the geometric structure of the line necessary to a treatment involving distribution functions and ordinary derivatives.t The extension is studied in Section 32. Functions of Bounded Variation
The of t h e precedi n g remai n der of thi s sect i o n bri e fl y sket c hes theory t o the ext e nsi o n funct i o ns that are not monot o ne. The resul t s arc f o r si m pl i c i t y gi v en onl y f o r a fi n i t e inteIfrvalF( x[ a) =b]faxf(t) and fodtr fius nctian ionnsdefiFnionte i[nat,eb]gralsat, where isfyingfF(a)is in=tegrabl 0. e but not necessar itwlyononnegat i v e, then F( faxr(t) dt - J:r(t) de exhibits F as the difference of i n t e gral s thus nondecreasi n g functi o ns. The probl e m of charact e ri z i n g i n defi n i t e ldieffadserenceto theof nondecreasi prelimin aryngprobl e m of charact e ri z i n g funct i o ns repre<> e ntabl e as a f u nct i o ns. Now F is said to be of bounded variation over [ a , b] if sup8 II F ila is finite, where II FilA is defi n ed by (3 1 .5) and ranges over all parti t i o ns (31.4) of [ a , b]. Clearly, a diasffwelerencel: Forofevery nondecreasi n g funct i o ns i s of bounded vari a ti o n. But the converse hol d s finite collection of nonoverlapping intervals [ x;, Y;l in [a, b], put ,
X) =
�
r
Now define = supPr, N( x ) = sup Nr, where the supr::. m a extend over part i t i o ns of {a, x ] . If F is of bounded variatio n, then Pr Nr + F( x ). This gives the P( x) and N( x) are finite. For each such inequalities P( x )
r
Pr < N( x ) + F( x ) ,
r
r
r,
=
P( x ) > Nr + F( x ) ,
which in turn lead to the inequalities P(x) > N( x ) + F( x ) . P( x ) < N( x ) + F(x ) , Thus F(x ) = P( x ) - N( x ) (31 .32) representati o n: A function the difference of two nondecreasing gifunctions ves the ifrequi r ed and only if it is of bounded variation. in is
tTheorems 3 1 .3 and 3 1 8 do have geometric analogues
R k ; see
RuoiN2 , Chapter 8.
416
DERIVATIVES AND CONDITIONAL PROBABILITY ;
If Pr Nr, then I:IF( F( x )l. According to the definition (31.28), F conti n uous i f f o r whenever the iisntabsol u t e l y every € there exists a such that l e ngth l e ss than col l e cti o n have total If F is absolutely e rval s i n t h e of 1 and decompose [ a , b] i n to a fi n i t e t a ke the correspondi n g an conti n uous, t o number, say n, of subin tervals [ u , u) of lengths less than Any partition of [ a , b] can by the insertion of the u be spl i t i n to n sets of i n tervals each of total length lfunct ess tiohnanis necessari and itlyfoofl obounded wst thatvariI Filaation.n. Therefore, an absolutely continuous An absol u tel y conti n uous F thus has a representation (31.32). It fo l ows by the defi n i t i o ns that P( y) - P( x) i s at most supr where ranges over the parti tions of [ x, y]. If x Y;] are nonoverlapping in tervals, then P( y;) -P( x, ) i s at most col l e cti o ns of i n terval s that parti t i o n each of supr where now ranges over t h e thethat[ L(x y1y;]- x;)Therefore,impliifesFthatis absol[(P(y,utel)y-contiP(xnuous, there exi s t s for each such )) In other words, P i s absol u tel y contiThusn uous.an absol Similuatelrlyy, Ncontiis nabsol u t e l y conti n uous. uous F is the diffe rence of two nondecreasing absolutely conti n uou!. f u ncti o ns. By Theorem 31.8, each of these is an i n defi n i t e i n tegral, which implies that F is an indefinite integral as well: For an F on [a, b] satisfying F(a) 0, Tr =
Tr =
+
r
o
1
j
8,
Tr,
[
y;) -
8.
E
_1
Tr < E
�
o.
<
r
Tr,
1,
i•
o
I:(
r
E
;
< E.
a
o
=
absolute continuity is a necessary and sufficient condition for F to be an indefinite integral-to have the form F( x) = J:f(t) dt for an integrable f. PROBLEMS
31.1.
p0 ,
Extaddienndg toExampl e s , Pr - be nonnegative numbers 31.1 and 31 .3: Let 1, where r 2; suppose there is no i such that P; 1. Let X1 , XP[2Xn,... i]be independent, i d enti c al l y di s tri b uted random vari a bl e s such that n 0 i
=
=p1,
•
•
1
•
=
1
<
p1
31.2.
=
p
=
p
=
4
31.3.
(a)
=
a.
(b)
t This uses the fact that IIFII11 cannot decrease under passage to a finer partition.
� a]
SECTION
31.4. i
31.
417
DERIVATIVES ON THE LINE
An
arbi t rary funct i o n f on 1] can be represented as a composi t ion of a Lebesgue funct i o n f1 and a Borel function f • For x in 1], let d ( x) be the nth digit in itsShownontthatermifnatiisnigncreasi f (x) = defi n e dyadi c expansi o n, and the n g 1] i s contai n ed i n and that 2 1 t o be f(f2 ( x )) if x E 1] and i f x E 1] x) f ( , 1]. set . Take Cantor Now show that f = fd2 • LetLk r 1 r2, • • •Defibe nane enumerat i o n of the rati o nal s 1) and put F(x) = i n by (14.5) and prove that it is continuous and singular. rel a t e d by (31.18). IfF is not absolutely conti n uous, Suppose that JL and F are JL( A)> f o r some set A of Lebesgue measure It is an interesting fact, then have A must however, that al m ost al l transl a tes of J.L measure From Fubi n i ' s i n vari a nt under theorem and the fact t h at i s transl a ti o n and refl e ct i o n f o r x a--fi n i t e, x) through <; h ow A)= and JL i s t h en JL(A that, i f A( outside a set of Lebesgue measure 17.4 31.6 i Show that F is absolutely continuous if and only if fo r each Borel set A. JL(A x) is continuous in i n f(F(u)-F(u))j(u-u) , where t h e i n fi m um extends Letover uF*(x)=l i m 6__ . 0 and u such and that u Defi n e F*( x) as t h i s l i m i t u -u x u with theIeplaincedfimumby F*replinacedpartby(i)aandsupremum. Show that i n Theorem 31.4, F' can be by F i n part (i i ) . Show that i n Theorem 31.6, FD can be replaced by F (note that F (x) FD(x )). Lebesgue's density theorem. point x is a density point of a Borel set A if A(deduce (u, u] nA)j(u-u)-+ 1 as u i x and u! x. From Theorems 31. 2 and 31.4 that al m ost al l poi n ts of A are densi t y poi n t s . Si m i l a rl y , A( ( u, u] n A)j(u-u)-+ almost everywhere on Ac. k be an a1 c; (t )=( i ( t ), . . . , fk(t)). Show that the arc IS Letrectif:fiabl[ae, ibf]-+R and only if each /; is of bounded variation over [a, b]. i Suppose that F is continuous and nondecreasing and that F(O) = F(l) = 1. Then f(x) = (x, F(x)) defines an arc f: [ , 1]-+ R 2• It i s easy to see ngth >satisfie.:s. byL(f)monotoItniicsialtysothateasy,thegivarcen.:is, torectiproduce fiable andfunctithat,ons iFn ffoarctwhi, itschleL(f) t h e argument s i n the proof of Theorem Show by 31.4 that L(f) i f F i s singular. l�p(t)l = 1. Suppose that the characteri s t i c funct i o n of F sati s fi e s l i m sup,__ . , l a tt i c e case Show that F i s (Probl e m 26.1). Hint: Use si n gul a r. Compare the the Lebesgue decomposition and the Riemann - Lebesgue theorem. Suppose t h at are i n dependent and assume t h e val u es 1 wi t h � each, and let X= Show probabi l i t y i s uni f orml y di s that triandbuteddeduceover(1.40). [ - 1, 1]. Calculate the characteristic functions of and establ i s h Conversel y , (1.40) by trigonometry and concl u de that is uniformly distributed over [ - 1, 1]. (0,
n
r.:= 1 2dn(x )j3 .
31.5. 31.6.
U
0
A
0
0.
+
=
0.
0
x.
<
<
< o.
*
*
31.9.
20
(0,
0.
+
31.8.
2
(0,
1p
rk s. x 2- k .
0,
31.7.
f2(0, 0
/2(0,
,
n
(0,
2
*
<
A
0
31.10. 31.11.
f
i
0,
0
< 2.
31.12.
31.13.
=
X 1 , X2 , . . .
n
r.:= , Xn!2 .
+
X
+
X
2
2-
±
X
Xn
418
DERIVATIVES AND CONDITIONAL PROBABILITY
the val u es Suppose that ••• are i n dependent and assume and 1 with probability t andeach. Let and Show G be the distribution fu nctions of that and G are singular but that ShowGthatis absolthe uconvol tely contiutionnuous. an of absol u tel y conti n uous di s tri b uti o n func tion with an arbitrary distribution function is absolutely continuous. 31.2 i Show that the Cantor function is the di s tri b uti o n function of where t h e are i n dependent and assume t h e val u es and 1 wiproduct th probabi l i t y } each. Express i t s characte ristic function as an i n fi n i te . Show foFrom r the(31 17)ofdeduce Examplthate 31.1 that and and fo r all dyadiicf rationals Analyze the cJse p0 and sketch the graph 6. 14 i Let be as i n Exampl e 31.1, and let be t h e correspondi n g n d th be the probabi l i t y the measure on t h e uni t i n t e rval . Let n(x) di g i t i n d ( ) of and l e t k(x). I f nontermi n ati n g bi n ar y expansi o n i s 1 the dyadic interval of order n containing then 31 .33) ln og ( ) ( 1 - ) logp0- -n- logp 1 • Show that (3 1.33) converges on a set of wmeasure to the entropy h the fact that thi s entropy i s l e ss t h an i f ldeduce og inpthi1 losgcasep 1 • From l o g does not have a that on a set of wmeasure finiteShowderivthatative.(31.33) converges to - } log � log p on a set of Lebesgue 1 measure 1. If p0 } thi s limit exceeds log2 (arithmetic versus geometric means), except on set of Lebesgue measure Thi s and so a does not prove that exi s t s al m ost everywhere, but i t does show that, exceptShowforthat,in ifa (31.33) set of Lebesgue measure i f does exi s t , then i t i s converges to l, then { iiff ll lloogg lim ( 31 .34) Iforder(31.34)lflogholdThus condi t i o nt s, then sati(roughl y ) of (exact ) sati s fi e s a Li p schi t z s fi e s a Li p schi t z condi t i o n of order h flog on a set ( - log - t log p1 ) l g ofon JL-measure and Li p schi t z condi t i o n of order a a set of Lebesgue measure 1. van der Waerden' s conti n uous, nowhere di ff e renti a bl e funct i o n i s where 0 ( ) is the distance fro m to the nearest in teger and 0 ( ). Show by the Weierstrass M-test that is continuous. Use and the ideas in Example 31.1 to show that is nowhere differentiable.
31.14. (a)
X1 , X2 ,
2
(b)
31.15.
[�
31.16.
31.17.
=
*
Xnf 3",
Po < � -
F
2 L';; = 1 X2nf2 ".
L� = 1 X2n _ 1 f2 " -
F
0
F
0
Xn
F
x.
>
D F(O) = 0 F D(l) =Foo D (x) = 0 ( D( x ) = oo
2
F
JL
x,
(
l
-
(a) = - Po Po * },
2
1
x,
s,
JL \ l x ) = ,
s
n x = L:k'
( x)
ln( x )
=
sn( x )
11
l
Po -
2
l, F
Po -
(b)
*
n JLUn( x ))f2- -+ 0 F'( x )
x
(c)
JL( /n( X )) n (2- n (
F
2.
0.
0.
=
oo 0
F'( x )
f a< f
L:'k=o a k ( x ), a x a k ( x ) = 2- ka 2 k x (31.8)
2, 2.
a>
F
}
1
31. 18.
0.
Po
2
fo 2
f(x )
x
f
=
f
tA Li psc hitz condi tion of order a holds at x if F(x + h ) - F(x) O(ihia) as h -+ 0 for a > 1 this implies F'(x) 0, and for o < a < I it is a smoothness condition stronger than 'continuity =
=
and weaker than differentiability.
32.
S ECTION
31.19.
31.20.
TH E RADON -NIKODYM THEOREM
419
Show (see 01.31)) that (apart from addition of constants) a fu nction can have onllar.y one re presentation with absolutely continuous and singu Show thatwherethe iins cont the iLebesgue decomposi t i o n can be furt h er spl i t i n t o i n creases onl y i n jumps n uous and si n gul a r and iposin thteiosense t h at the correspondi n g measure i s di s cret e . The compl e t e decom n is then Show and that, i f assumes Suppose that L i F( )l + then i t is of unbounded1 variation. ( i n t e rval the valDefiuene0 in each f o r over [0, 1] by and F(x)=xasi n x For which values of is of bounded variation? 14.4 i I f is nonnegative and Lebesgue i n tegrable, then by Theorem 31.3 and (31.8), except fo r in a set of Lebesgue measure 0, F1 + F2
F2
F1
F5 Fcs
Fd + Fcs>
Fd
F = Fac + Fcs + Fd.
31.21. (a)
(b)
31.22.
x (< x 2 <
F
a
f
· ·
·
x n,
x n 1 ), F(O) = O
F
n
xn
= oo.
F
x > O.
x
1
(31.35)
v-u u
< u,
!' f( t ) dt -+ f( x ) u
There liisty anmeasure analoguef.L: Ifin whiiscnonnegat imeasure f is replaced andby a general probabi h Lebesgueive and in tegrable with respect to f.L, then as h ! 0, (31 .36 ) f.L ( X - �, h] t -h , x+h/( l)f.L ( dt) -+ onf.L , anda setputof wmeasureinf[x:1. Let be thefor di0 stributi1o(see n funct(14.5)). ion correspondi n g t o Deduce (31.36} from (31.35) by change of variable and Problem 14.4. u <x<
u,
u,
u
-+ x.
f
f( x )
x+
�p(u) =
F u < F(x)]
SECTION 32. THE RADON-NIKODYM THEOREM
f is a nonnegative function on a measure space (f!, !F, JL), then v ( A ) = fA fd11- defines another measure on !F. In the terminology of Section 16, v has density f with respect to 11- ; see (16.1 1). For each A in !F, JL( A ) = 0 implies that v ( A ) = 0. The purpose of this section is to show conversely that if this last condition holds and v and 11- are u-finite on !F, then v has a density with respect to 11- · This was proved for the case ( R 1 , .9i' 1 , ,.\ ) in Theorems 31.7 and 31.8. The theory of the preceding section, although If
illuminating, is not required here.
Additive Set Functions
Th roughout this section, (f!, !F) is a measurable space. All sets involved are assumed as usual to lie in !F.
420
An
DERIVATIVES AND CONDITIONAL PROBABILITY
additive set function is a function cp from !F to the reals for which
(
( 32.1)
cp U A n n
) = [n cp( A n )
if A 1 , A 2 , . . . is a finite or infinite sequence of disjoint sets. A set function differs from a measure in that the values cp(A) may be negative but must be finite-the special values + oo and - oo are prohibited. It will turn out that the series on the right in (32. 1) must in fact converge absolutely, but this need not be assumed. Note that cp(0) = 0.
Example 32.1 . If 11- 1 and 11- 2 are finite measures, then cp(A) = JL 1 (A) ;L2(A) is an additive set function. It will turn out that the general additive set function has this form. A speciai caf.e of this if cp( A ) = fA! dJL, where f is integrable (not necessarily nonnegative).
•
The proof of the main theorem of this section (Theorem 32.2) requires certain facts about additive set functions, even though the statement of the theorem involves only measures. Lemma
l.
If Eu i E or Eu ! E, then cp(EJ � cp(E).
Eu i E, then cp(E) = cp(EI u u := I (Eu + l - EJ) = cp ( E I ) + r.: = I cp ( Eu + I - E) = lim ,. [cp(£ 1 ) + E�:�cp(Eu + l - E)] = limL' cp(E) by (32.1). If Eu ! E, then EZ i e, and hence cp( E) = cp(!l) - cp( EZ) � cp{!l ) • cp( E c ) = cp(E). PROOF.
If
Although this result is essentially the same as the corresponding ones for measures, it does require separate proof. Note that the limits need not be monotone unless cp happens to be a measure. The Hahn Decomposition
For any additive set function cp, there +exist disjoint sets A + and A - such that A + u A - = !l , cp(E) > 0 for all E in A , and ip(E) < 0 for all E in A - . Theorem 32.1.
>
A set A is posuwe if cp(E) 0 for E cA and negative if 1p(E) < 0 for E cA. The A + and A - in the theorem decompose n into a positive and a negative set. This is the Hahn decomposition . If cp(A) = fAfdJL (see Example 32.1), the result is easy: take A + = [f > O] and A - = [f < 0] .
SECTION 32.
TH E RADON-NIKODYM THEOREM
421
Let a = sup[cp{A): A E .9']. Suppose that there exists a set A + satisfying cp(A + ) = a (which implies that a is finite). Let A -= n - A + . If A c A + and cp{ A ) < 0, then cp(A + - A ) > a, an impossibility; hence A + is a positive set. If A cA - and cp(A) > 0, then cp(A + u A) > a, an impossibility; hence A - is a negative set. It is therefore only necessary to construct a set A + for which cp(A +) = a. Choose sets An n such that cp(An) � a, and let A = U n An. For each n consider the 2 sets Bni (some perhaps empty) that are intersections of the form· n Z = 1 A'k • where each A'k is either A k or else A - A k . The collection n f!Jn = [Bn;: 1 < i < 2 ] of these sets partitions A. Clearly, f!Jn refines f!Jn _ 1 : each Bnj is contained in exactly one of the Bn - I, i · Let en be the union of those Bn; in �n for which cp( Bn;) > 0. Since An is the union of certain of the B�;, it follows that cp(An) < cp(en). Since the partitions f!J1 , f!J2 , are successively finer, m < n implies that (enr u · · · u en - I u en) - (em u · · · u en _ 1 ) is the union (perhaps empty) of certain of the sets Bni ; the Bni in this union must satisfy cp(Bn;) > 0 because they are contained in en- Therefore, cp(Cm U · · · U en _ 1 ) < cp(em u · · · u Cn ), so that by induction cp(Am) < cp(em ) < cp(em U · · · U en). If Dm U :=men, then by Lemma 1 (take E, = em u · · · u em J cp(Am) < cp( Dm). Let A + = n ';;, = 1 Dm (note that A + = lim supn en), so that +Dm ! A +. By Lemma 1, a = limm cp(A m) • < limm cp(Dm) = cp(A + ). Thus A + does have maximal cp-value. PROOF.
•
•
•
=-
If cp +{A) = cp(A nA + ) and cp - { A ) = - cp(A n A - ), then cp + and cp - are finite measures. Thus (32.2 ) represents the set function cp as the difference of two finite measures having disjoint supports. If E cA, then '{){£) < cp + (£) < cp + (A), and there is equal ity if E = A nA +. Therefore, cp+(A) = sup£c A cp(£). Similarly, cp -(A) = - inf £ c A cp( E). The measures cp + and cp - are called the upper and Lower variations of cp, and the measure lcpl with value cp +(A) + cp -{A) at A is called the total variation. The representation (32.2) is the Jordan decomposi
tion.
Absolute Continuity and Singularity
Measures J.L and v on (f!, .9') are by definition mutually singular if they have disjoint supports-that is, if there exist sets S,.. and Sv such that (32.3)
11- ( f! - S,_. ) = 0 , s,.. n sv =
0.
v (f!
- Sv ) = 0,
422
DE RIVATIVES AND CONDITIONAL PROBABILITY
In this case J.L is also said to be singular with respect to v and v singular with respect to J.L · Note that measures are automatically singular if one of them is identically 0. 1 According to Theorem 31.5 a finite measure on R with distribution function F is singular with respect to Lebesgue measure in the sense of (32.3) if and only if F'(x) = 0 except on a set of Lebesgue measure 0. In Section 31 the latter condition was taken as the definition of singularity, but of course it is the requirement of disjoint supports that can be generalized 1 from R to an arbitrary fl. The measure v is absolutely continuous with respect to J.L if for each A in !F, J.L( A) = 0 implies v(A) = 0. In this case v is also said to be domillated by J.L, and the relation is indicated by v « J.L. If v « J.L and J.L « v, the measures are equivalent, indicated by v J.L. A finite measure on the line is by Theorem 31.7 absolutely continuous in this sense with respect to Lebesgue measure if and only if the corresponding distribution function F satisfies the condition (3 1.28). The latter condition, taken in Section 31 as the definition of absolute continuity, is again not the one that generalizes from R 1 to fl. There is an E-8 idea related to the definition of absolute continuity given above. Suppose that for every E there exists a 8 such that =
v( A ) < €
(32.4)
if 11- ( A ) < 8.
If this condition holds, J.L(A) = 0 implies that v(A) < E for all E, and so v « JL. Suppose, on the other hand, that this condition fails and that v is finite. Then for some E there exist sets An such that J.L(An) < n - 2 and v(An) > €. If A = lim :;,upn An, then J.L(A ) = 0 by the first Borel-Cantelli lemma (which applies to arbitrary measures), but v(A ) > E > 0 by the right hand inequality in (4.9) (which applies because v is finite). Hence v « J.L fails, and so (32.4) follows if v is finite and v « f-L . If v is finite, in order that « J.L it is therefore necessary and sufficient that for every E there exist a 8 satisfying (32.4). This condition is not suitable as a definition, because it need not follow from v « J.L if v is infinite.t v
The Main Theorem
If v(A) = f fdJ.L, then certainly v « f.L. The in the opposite direction: A
Radon-Nikodym theorem goes
If J.L and v are u-finite measures such that v « J.L, then there exists a nonnegative f, a density, such that v(A) = f fdJ.L for all A E !F. For two such densities f and g, 11-U =I= g ] = 0. Theorem 32.2.
tsee Problem 32.3.
A
SECTION 32.
TH E RADON-NIKODYM THEOREM
423
The uniqueness of the density up to sets of J.L-measure 0 is settled by Theorem 16. 10. It is only the existence that must be proved. The density f is integrable J.L if and only if v is finite. But since f is integrable J.L over A if v ( A ) < oo, and since v is assumed u-finite, f < oo except on a set of J.L-measure 0; and f can be taken finite everywhere. By Theorem 16. 1 1, integrals with respect to v can be calculated by the form ula
JAhdv = JAhfdJ.L.
(32.5)
The density whose existence is to be proved is called the Radon-Nikodym derivative of v with respect to J.L and is often denoted dv/dJ-1-. The term derivative is appropriate because of Theorems 3 1.3 and 3 1 .8: For an abso lutely continuous distribution function F on the line, the corresponding measure J.L has with respect to Lebesgue measure the Radon-Nikodym derivative F'. Note that (32.5) can be written
v f hdv = J h : dJ.L.
(32.6)
A
A
J.L
Suppose that Theorem 32.2 holds for finite J.L and v (which is in fact enough for the probabilistic applications in the sections that follow). In the u-finite case there is a countable decomposition of f! into .r-sets An for which J.L(An) and v ( A n ) are both finite. If (32.7) then v « J.L implies vn « IL n • and so vn( A ) = fA in dJ.Ln for some density In Since .u n has density IA " with respect to J.L (Example 16.9),
Thus [nfniA is the density sought. It is therefore enough to treat finite J.L and result. n
If J.L and
v. This requires a preliminary
v are finite measures and are not mutually singular, then there exists a set A and a positive € such that J.L(A) > 0 and E J.L( E ) < v(E) Lemma 2.
for all E c A .
DERIVATIVES AND CONDITIONAL PROBABILITY
424
PRooF. Let A ; u A ; be a Hahn decomposition for the set function v - n _ ,11-; put M = U n A;, so that Me = n n A;. Since Me is in the negative set A; for v - n - IJL, it follows that v(Me) n - IJL(Me); since this holds for all n, v( Me ) = 0. Thus M supports v, and from the fact that 11- and v are not
<
mutually singular it follows that Me cannot support 11--that is, that JL(M ) must be positive. Therefore, JL(A;) > 0 for some n . Take A = A ; and • E = n - 1. Suppose that (f!, !T) = ( R 1 , � 1 ), 11- is Lebesgue measure ,.\, and v(a, b] = F(b) - F(a ). If v and ,.\ do not have disjoint supports, then by Theorem 31.5, ,.\[x: F'(x) > 0] > 0 and hence for somt" E, A = [x: F'(x) > E] satisfies ,.\{ A ) > 0. If E = (a , b] is a sufficiently small interval about an x in A, then v( E)jA( E) = (F(b) - F(a))j(b - a) > E, which is the same thing as Example 32.2.
d (E)
< v(E).
•
Thus Lemma 2 ties in with derivatives and quotients v( E) IJL(E) for "small" sets E. Martingale theory links Radon-Nikodym derivatives with such quotients; see Theorem 35.7 and Example 35. 10. PRooF OF THEOREM 32.2. Suppose that 11- and v are finite measures satisfying v « 11- · Let .§ be the class of nonnegative functions g such that fEgdJL < v( E) for all E. If g and g' lie in .§, then max(g, g ') also lies in .§
because
J max( g , g') dJL = j E
gd11- +
E n [ g � g 'J
j
g ' dJL
£ n[g' > g ]
< v ( E n [ g > g ']) + v ( E n [ g ' > g ] ) = v ( E ) .
Thus .§ i<> closed under the formation of finite maxima. Suppose that functions gn lie in .§ and gn j g. Then fEgdJL = limn fcgn d11- v(E) by the monotone convergence theorem, so that g lies in .§. Thus § is closed under nondecreasing passages to the limit. Let a = sup fg dJL for g ranging over .§ (a ::::;; v(f!)). Choose gn in .§ so that fgn d11- > a - n - I . If fn = max(g" . . . , gn) and f = lim fn, then f lies in .§ and ffdJL limnffn d11- > lim n fgn dJL = a. Thus f is an element of .§ for which ffdJL is maximal. Define vac by V3/ E) = fEfdJL and v5 by v5( E) = v( E) - v3c( E). Thus
<
=
(32.8) Since f is in .§, v5 as well as v3c is a finite measure-that is, nonnegative. Of course, vac is absolutely continuous with respect to 11-·
S ECTION
31.
425
DERIVATIVES ON THE LINE
Suppose that v5 fails to be singular with resp ect to 11- · It then follows from Lemma 2 that there are a set A and a positive € such that JL(A) > 0 and EJL(E) < 4£) for all E cA. Then for every E '
'
'
= v( E nA) + f AfdJL < v ( E nA ) + v ( E - A ) £= v( E ) . In other words, f + dA lies in .§; since f(f+ dA ) dJL = a + EJL ( A ) > a, th is contradicts the maximality of f. Therefore, 11- and v5 are mutually singular, and there exists an S such that 4S) JL(S c ) = 0. But since v « JL, v5(S c ) < v(S c ) = 0, and so v5{f!) = 0. The • rightmost term in (32.8) thus drops out. =
Absolute continuity was not used until the last step of the proof, and what the argument shows is that v always has a decomposition (32.8) into an absolutely continuous part and a singular part with respect to 11- · This is the Lebesgue decomposition, and it generalizes the one in the preceding section (see (31.31)). PROBLEMS
32.1.
32.2.
(32. 1)
absol u te: There are t w o ways t o show that the convergence i n must be Use the Jordan decomposi t i o n. Use the fact that a seri e s converges absol u tel y if it has the same sum no matter what order the terms are taken in. be other ones A(UA). a Hahne ofdecomposi t i o n of there may IfConstA +ruAuct -anisexampl thi s . Show that there i s uni q ueness t o the ext e nt that �p(A +6A i> �p(A -6A)) that absol u te conti n ui t y does not i m pl y the condi t i o n i f v is Show l e t s t subset s icounti nfiniten.g measure,Let andconsi of al l of the space of i n t e gers, v be at l e t Not e that have mass i s fi n i t e and v is a--finite. that t h e Radon-Ni k odym theorem fai l s not afi n i t e , even i f i f i s v is Show fiuncountabl nite. e Let let consibe scount t of theing count a bl e and the cocount a bl e sets i n an measure, and l e t v(A) be or 1 as A is countable or cocountable. 1p,
=
32.3.
Hint.
32.4.
= 0.
!f
!f
Hint:
n,
(32.4)
E-B
JL
n-
2
JL
n.
JL
JL
0
426
32.5.
DERIVATIVES AND CONDITIONAL PROBA BILITY
measure Let {AJL be theA restE rictiofonvertoficplalastnarrips.Lebesgue A 2 to the u-field Defi n e v on by v(A = A 2(A 1)). Show that v is absolutely continuous with respect to JL but has no density. Why does this not contradict the Radon-Nikodym theorem? LetderivJL,ativesandherpe beareu-fieverywhere nite measures on (D., .9"). Assume the Radon-Nikodym nonnegat i v e and fi n i t e. Show that v « JL and JL « p imply that v « p and X
!F
x (0,
32.6.
R1:
.9"
�1 }
X
R1)
f�,
(a)
dv dv dJL . dp = dJL dp (b)
=
Show that v JL implies ( )
dJL - I dv . djL = l[diLfdv > O] dv v
p,
1
« Suppose that JL p and « and let A be the set where dv dp dJL fdp. Show that v «JL if and onl y if p( A) = O, in which case (c)
>
0
=
dv f dvfdp = djL [diLfdp > Ol dJLfdp . 32.7. 32.8.
decomposi t i o n (32.8) in the if-fi nite as well as Show that there i s a Lebesgue the finite case. Prove that it is unique. JL i s u-fi n i t e, even The Radon Ni k odym theorem hol d s i f i f v i s not. Assume at firLetst that beJL ithes finiclatess(andof (,9:sets) v « JL).B such that JL(E) = or v(E) = for each E cB.LetShowbethatthe �clacontai n s a set B0 of maximal wmeasure. ss of set s i n 00 = 80 that are countable uni ons of sets of fiDnoit=e fiov-measure. Show t h at contai n s a set C0 of maximal JL-measure. Let - Co . Deduce from the maxi m al i t y of 8 0 and C0 that JL( D0 ) = v(D0) = Let v0(A) = v(A 00). Using the Radon-Nikodym theorem fo r the pai r JL, v0,Nowproveshowit fthat o r JL,thev. theorem holds if JL is merely u-finite. Show that if the density can be taken everywhere finite, then v is u-finite. .9"), and suppose that LetcontaiJL nanded ivn be finThen ite measures on i s a ufi e l d rest r i c t i o ns J-l-0 and v0 of JL and v to the are measures on (!l, .9" 0). Let v3c,v5, v�c, v5° be, respectively, the absolutely contin respect to v and v0 wi t h and JL0• Show that uous and si n gul a r part s of v�c(E) V3c(E) and v;(E) v5( E) for E E .9"0• on (D., Suppose that JL, v, vn are fi n i t e measures .9") and that v(A) = Ln vn( ) vn(A) = fA in dJL + v�(A) and v(A) = fA fdJL + v'(A) be the fordecomposi all A.tioLetns (32.8); si n gul a r wi t h respect here v' and v� are Show that to JL · f = En fn except on a set of JL·measu re and that v' ( A) = Env�( A) for al l Show that v « JL if and only if vn JL for all (a)
-�
(b)
.tf
oo
0
.tf
0.
(c)
n
(d) (e)
(f)
32.9.
.r.
>
32.10.
g:-o
en,
g:-o
.u
<
A
«
0
n.
A.
SECTION
33.
32.11. 32.2 i
1p
respect t o a measure JL Absol u t e conti n ui t y i s nct i o n wi t h of a set f u were i t s el f a measure: JL(A) = must imply that �p(A) = just as i f defi n ed u-fi n i t e , hol d s and JL that, i f thi s i s Show t h en �p(A) = fA fdJL for some + iHahn ntegrabldecomposi e f. Showtionthatfor A Show = [w: f(w) > O] and A - = [w: f(w) < O] give a + (A) = that the three vari a ti o ns sat i s fy = f dJL. Hint: To construct f, start dJL, and dJL, �p - (A) with (32.2). A signed measure is a set fu nction that satisfies (32. 1) i f A 1 , A 2 , are and - but not both. Extend nt andandmayJordan assumedecomposi one of thetionsvaltouessigned+ measures dithesjoiHahn 31.22 Suppose that JL and v are a probabi l i ty measure and a u-fin i t e measure on t h e l i n e and that v JL. Show that the Radon - Nikodym derivative f satisfies m1 ;.tv(x( x -- hh ,, xx +-+ h]h] = f( x ) on a set of JL-measnre < < the unis StPinsuchtervalthatuncountabl Fi�inthd �onupport y many probabi l i t y measures JL , p JL)x} = fo r each x and and the SP are disjoint paus. Letuncountablbee theDefifienlde consion stingbyoftakithneg fi�p(A) nite tando be tthehe number cofinite ofsetpois innts anin A i f A is finite, and the negative of the number of points in A' if A i s cofinite . Show that (32. 1) holds (t hi s is not true if !l is countable). Show that there are notion,negatandivtehatsets fdoes o r (except the empty set ) , t h at there i s no Hahn decomposi not have bounded range. 1p
'P ·
= fA r
fA r
l �p i(A)
1p
32.12. i
32.13.
427
CONDITIONAL PROBABILITY 0
0.
1p
fA i l
oo
i
• • •
oo
«
I
h
.
....
c
1.
32.14.
0
m
32.15.
.9Q
1p
n.
1p
p
0
p
1,
.9Q
1p
SECTION 33. CONDITIONAL PROBABILI1Y I
The concepts of conditional probability and expected value with respect to a u-field underlie much of modem probability theory. The difficulty in under standing these ideas has to do not with mathematical detail so much as with probabilistic meaning, and the way to get at this meaning is through calcula tions and examples, of which there are many in this section and the next. The Discrete Case
Consider first the conditional probability of a set A with respect to another set B. It is defined of course by P(A I B ) = P(A n B)jP(B), unless P(B) vanishes, in which case it is not defined at all. It is helpful to consider conditional probability in terms of an observer in possession of partial information.t A probability space (f!, !F, P) describes tAs always, observer, information , know, and so on are informal, nonmathematical terms; see the related discussion in Section 4 (p. 57).
DERIVATIVES AND CONDITIONAL PROBABILITY
428
the working of a mechanism, governed by chance, which produces a result w distributed according to P; P( A) is for the observer the probability that the point w produced lies in A. Suppose now that w lies in B and that the observer learns this fact and no more. From the point of view of the observer, now in possession of this partial information about w, the probability that w also lies in A is P(A I B ) rather than P(A). This is the idea lying back of the definition. If, on the other hand, w happens to lie in Be and the observer learns of this, his probability instead becomes P( AIBe). These two conditional proba bilities can be linked together by the simple function ( 33 . 1 )
j P( A I B ) f( w) = \ P( A I B c )
if w E B, it w E Be .
The observer learns whether w lies in B or in Be; his new probability for the event w E A is then just f(w ). Although the observer does not in general know the argument w of f, he can calculate the value f(w) because he knows which of B and Be contains w. (Note conversely that from the value f(w) it is possible to determine whether lies in B or in SC, unless P(A I B ) = P(A I Be)-that is, unless A and B are independent, in which case the conditional probability coincides with the unconditional one anyway. ) The sets B and Be partition f!, and these ideas carry over to the general partition. Let B 1 , B 2 , be a finite or countable partition of n into .r-sets, and let .:# consist of all the unions of the B,. Then § is the u-field generated by the B,. For A in !F, consider the function with values cu
•
•
•
( 33.2) f( w) = P( A I B,) =
P( A n B; ) P( B; )
i = l , 2, . . . . w,
If the observer learns which element B, of the partition it is that contains then his new probability for the event w E A is f(w). The partition {B;}, or equivalently the u-field .:#, can be regarded as an experiment, and to learn which B, it is that contains w is to learn the outcome of the experiment. For this reason th e function or random variable f defined by (33.2) is called the conditional probability of A given § and is denoted P[ AII§J. This is written P[ A 11.:#]., whenever the argument w needs to be explicitly shown. Thus P[ A II.:#] is the function whose value on B1 is the ordinary condi tional probability P( A I BJ This definition needs to be completed, because P( A I B) is not defined if P(B) = 0. In this case P[ A II.:#J will be taken to have any constant value on B,; the value is arbitrary but must be the same over all of the set B,. If there are nonempty sets B; for which P(B) = 0, P[ A II.:#] therefore stands for any one of a family of functions on f!. A specific such function is for emphasis often called a version of the conditional
SECTION 33. CONDITIONAL PROBABILITY
429
probability. Note that any two versions are equal except on a set of probabil ity 0. Consider the Poisson process. Suppose that 0 < s < t, and and B; = [ N, = i], i = 0, 1, . . . . Since the increments are independent (Section 23), P( A I B ) = P[ Ns = O]P[ N, - Ns = i]jP[ N, = i], and since they have Poisson distributions (see (23.9)), a simple calculation reduces this to Example 33.1. let A = [Ns = 0]
;
i 0, 1 , 2, .
(33 .3)
=
Since i = N,(w) on B,, this can be written (33.4)
P[Ns = O il &' ] .,
( = 1
-t S
.
.
.
) N,(w ) .
Here the experiment or observation corresponding to {B;} or .§ deter mines the number of events -telephone calls, say-occurring in the time interval [0, t ]. For an observer who knows this r.umber but not the locations of the calls within [0, t], (33.4) gives his probability for the event that none of them occurred before time s. Although this observer does not known w, he • knows N,(w), which is all he needs to calculate the right side of (33.4). Suppose that X0, X 1 , . . . is a Markov chain with state space S as in Section 8. The events Example 33.2.
(33.5) form a finite or countable partition of f! as i0, , in range over S. If .§, is the u-field generated by this partition, then by the defining condition (8.2) for Markov chains, P[Xn + 1 = jll.§, ] ., = P;ni holds for w in (33.5). The sets .
.
•
(33 .6)
0
a u-field � smaller than � for i E S also partition f!, and they generate 0 Now (8.2) also stipulates P[Xn + l = jll-§, ]., = P;j for w in (33.6), and the essence of the Markov property is that •
(33 .7) The General Case
If .§ is the u-field generated by a partition B 1 , B 2 , . . , then the general element of .§ is a disjoint union B; 1 U B; 2 U finite or countable, of certain of the B;. To know which set B; it is that contains w is the same thing .
· · · ,
DERIVATIVES AND CONDITIONAL PROBABILITY
430
as to know which sets in .§ contain w and which do not. This second way of looking at the matter carries over to the general u-field .§ contained in !F. (As always, the probability space is (f!, !F, P).) The u-field .§ will not in general come from a partition as above. One can imagine an observer who knows for each G in .§ whether w E G or w 'E Gc. Thus the u-field .§ can in principle be identified with an experiment or observation. This is the point of view adopted in Section 4; see p. 57. It is natural to try and define conditional probabilities P[ A II.:#] with respect to the experiment .§. To do this, fix an A in !F and define a finite measure v on .§ by v(G)
=
P ( A n G) ,
G E .§.
Then P{ G ) = 0 implies that v ( G ) = 0. The Radon- Nikodym theorem can be applied to the measures v and P on the measurable space (f!, .§) because the first one is absolutely continuous with respect to the second.t It follows that there exists a function or random variable f, measurable .§ and integrable with respect to P, such that t P(A n G) = v(G) = J0fdP for all G in .:#. Denote this function f by P[ AII.:#]. It is a random variable with two properties: (i) P[ AII.:#] is measurable .§ and integrable. {ii) P[ A II.§] satisfies the functional equation
j P[ A II.:# ] dP = P( A n G) ,
(33.8)
G
G E .§.
There will in general be many such random variables P[ AII.:#], but any two of them are equal with probabilit-y 1. A specific such random variable is called a version of the conditional probability. If .§ is generated by a partition B 1 , B 2 , the function f defined by (33.2) is measurable .§ because [ w: f( w ) EH] is the union of those B; over which the constant value of f lies in H. Any G in .§ is a disjoint union G = U k Bik ' and P(A n G ) = [k P(AI B;)P(B;), so that (33.2) satisfies (33.8) as well. Thus the general definition is an extension of the one for the discrete case. Condition {i) in the definition above in effect requires that the values of P[ AII.:#] depend only on the sets in .§. An observer who knows the outcome of .§ viewed as an experiment knows for each G in .§ whether it contains or not; for each he knows this in particular for the set [ w ' : P[ A II.:#].,. = ], • • •
w
x
x
t Let P0 be the restriction of P to .§ (Example 1 0.4), and find on (fl, .§) a density f for v with respect to P0• Then, for G E .§, v(G) = fcfdP0 fcfdP (Example 16.4). If g is another such density, then P[f * g 1 = P0[f * g 1 0. =
=
S ECTION 33. CONDITIONAL FROBABILITY
431
and hence he knows in principle the functional value P[ A II.#L even if he does not know w itself. In Example 33. 1 a knowledge of N,(w) suffices to determine the value of (33.4)-w itself is not needed. Condition (ii) in the definition has a gambling interpretation. Suppose that the observer, after he has learned the outcome of .#, is offered the opportu nity to bet on the event A (unless A lies in .#, he does not yet know whether or not it occurred). He is required to pay an entry fee of P[ A II.#] units and will win 1 unit if A occurs and nothing otherwise. If the observer decides to bet and pays his fee, he gains 1 - P[ A II.#] if A occurs and - P[ A II.#] otherwise, so that his gain is ( 1 - P[ A II.# ] ) IA + ( - P [ A II,f ] ) IA' = fA - P[ A II.#]. If he declines to bet, his gain is of course 0. Suppose that he adopts the strategy of betting if G occurs but not otherwise, where G is some set in .#. He can actually carry out this strategy, since after learning the outcome of the experiment .§' he knows whether or not G occurred. His expected gain with this strategy is his gain integrated over G: •
f) fA -- P[ A ll.#]) dP. But (33.8) is exactly the requirement that this vanish for each G in .#. Condition (ii) requires then that each strategy be fair in the sense that the observer stands neither to win nor to lose on the average. Thus P[ A II.#] is the just entry fee, as intuition requires.
Example
Suppose that A E .#, which will always hold if .§' coin cides with th e whole u-field !F. Then /A satisfies conditions (i) and (ii), so that P[ A II.#] = /A with probability 1 . If A E .#, then to know the outcome of .§' viewed as an experiment is in particular to know whether or not A has • occurred. 33.3.
Example 33.4. If .§' is {0, f!}, the smallest possible u-field, every function
measurable .§' must be constant. Therefore, P[ A II.#L P(A) for all w in • this case. The observer learns nothing from the experiment .#. =
According to these two examples, P[ AII{O, f!}] is identically P(A), whereas /A is a version of P[ A II!F]. For any .§', the function identically equal to P(A ) satisfies condition (i) in the definition of conditional probability, whereas /A satisfies condition (ii). Condition {i) becomes more stringent as .§' de creases, and condition (ii) becomes more stringent as .§ increases. The two conditions work in opposite directions and between them delimit the class of versions of P[ A II.§].
DERIVATIVES AND CONDITIONAL PROBABILITY
432
Example 33.5. Let f! be the plane R 2 and let :F be the class � 2 of planar Borel sets. A point of f! is a pair (x, y ) of reals. Let .§ be the u-field consisting of the vertical strips, the product sets E X R 1 = [(x, y ): x E £], where E is a linear Borel set. If the observer knows for each strip E X R 1 whether or not it contains (x, y ), then, as he knows this for each one-point
set E, h e knows the value of x. Thus the experiment .§ consists in the determination of the first coordinate of the sample point. Suppose now that P is a probability measure on � 2 having a density y ) with respect to planar Lebesgue measure: P(A ) = y )dxdy. Let A be a horizontal strip R 1 X F = [(x, y ): y E F], F being a linear Borel s::.t. The conditional probability P[ A II.§] can be calculated explicitly. Put
ffA f(x.,
�( X ' y )
( 33.9)
=
f(x ,
j f( x , t) dt --'F'---- -!Rlf( t) dt X,
Set �(x, y ) = 0, say, �lt points where the denominator here vanishes; these points form a set of P-measure 0. Since 'fJ( -', y ) is a function of x alone, it is measurable .:#. The general element of .§ being E X R 1 , it will follow that � is a version of P[ AII.:#] if it is shown that ( 33 . 10)
fExR1�( x , y ) dP( x , y )
=
P( A n ( E X R 1 ) ) .
Since A = R 1 X F, the right side here is P( E X F). Since P has density Theorem 16. 1 1 and Fubini's theorem reduce the left side to
JE { JR I�(
X
f,
} JE { fI( X ' t ) dt } dx
' y ) ( X ' y ) dy dx =
f
=
Thus (33.9) does give a version of P[ R
ffEx Ff( x , y ) dxdy = P( E X F ) . 1
X Fll.:# ].
•
The right side of (33.9) is the classical formula for the conditional probability of the event R 1 X F (the event that y E F) given the event {x} x R 1 {given the value of x ). Since the event {x} X R 1 has probability 0, the formula P(A I B ) = P( A n B)jP(B) does not work here. The whole point of this section is the systematic development of a notion of conditional probability that covers conditioning with respect to events of probability 0. This is accomplished by conditioning with respect to collections of events-that is, with respect to u-fields .§.
SECTION 33.
433
CONDITIONAL PROBABILITY
Example 33.6. The set A is by definition independent of the u-field .§ if
it is independent of each G in .§: P( A n G) = P(A )P(G). This being the A is independent of .§ if and only if same thing as P( A n G) • ( A ) with probability 1. P[ A II.:# ]
= J0P(A) dP ,
=P
The u-field u(X) generated by a random variable X consists of the sets �1; see Theorem 20. 1. The conditional probability [w: X(w) H] for of A given X is defined as P[ AIIu(X)] and is denoted P[ A II X]. Thus P[ AIIX] = P[ A IIu(X)] by definition. From the experiment corresponding to ] contains and the u-field u(X), one learns which of the sets [ : hence learns the value X(w). Example 33.5 is a case of this: take X( y) y) in the sample space f! = R 2 there. for This definition applies without change to random vector, or, equivalentiy, to a tinite set of random variables. It can be adapted to arbitrary sets of random variables as well. For any such set [ X�> t T], the u-field u[X, T] it generates is the smallest u-field with respect to which each X, is measurable. It is generated by the collection of sets of the form [ : X,(cu) T] A P[ A II X, H] for t in T and in �1• The is by definition the conditional probability P[ A IIu[ X, T]] of A with respect to the u-field u[X, , T]. In this notation the property (33.7) of Markov chains becomes
E
HE
w' X(w') = x
w
x, = x
( x,
E
tE
H conditional probability with respect to this set of random variables tE
E w t E of tE
(33 . 1 1 ) l
=
The conditional probability of [ Xn + j] is the same for someone who knows the present state xn as for someone who knows the present state xn and th e past states X0, Xn _ as well. • • •
,
1
Y be random vectors of dimensions j and k, let JL be the distribution of X over R j, and suppose that X and Y are independent. According to (20.30), Example 33. 7. Let X and
HE
for !!J j and �j + k. This is a consequence of Fubini's theorem; it has a conditional-probability interpretation. For each in R j put
JE
x f( x) = P[(x,Y ) EJ] = P[w' : ( x , Y(w')) EJ] .
( 33.12 )
By Theorem 20. 1{ii), f(X(w)) is measurable u(X ), and since distribution of X, a change of variable gives
1
JL
is the
f( X( w ))P( dw ) = j f(x) JL( dx ) = P([( X, Y) EJ] n [X E H] ).
[ X e H]
H
434
DERIVATIVES AND CONDITIONAL PROBABILITY
Since [ X E H] is the general element of u(X), this proves that
f( X( ) ) P [ ( X, Y ) E J II XL , w
(33. 13)
=
•
with probability 1.
The fact just proved can be writ en P[ ( X, Y ) Ell I XL = P[( X( w ) , Y) E 1 ] = P[w' : (X( w ) , Y( w')) El].
col l i s i o n Repl a ci n g w' by w on the the one notati o nal l i k e here causes ri g ht a replacing by x causes in J:J(x, y ) dy. Suppose that and Y are independent random variables and that Y has distribution function F. For = [(u, v ): max{u, v} < m], (33. 12) is 0 for m < and F(m) for m > if M = max{ Y}, then (33.13) gives y
X
J
x;
x
X,
( 33 . 14 ) with probability 1. All equations involving conditional probabilities must be qualified in this way by the phrase with probability 1 , because the conditional probability is unique only to within a set of probability 0. The following theorem is useful for checking conditional probabilities.
Let 9 be a 1r-system generating the u-field .§, and suppose that f! is a finite or countable union of sets in g_ An integrable function f is a version of P[ A il .:#] if it is measurable .§ and if Theorem 33.1.
( 33 .15 )
j fdP = P( A n G) G
holds for all G in g_ PRooF.
•
Apply Theore m 10.4.
The condition that f! is a finite or countable union of 9-sets cannot be suppressed; see Example 10.5.
Suppose that X and Y are independent random variables with a conti n uous. What di s tri b uti o n functi o n common i s the condi that i s posi t i v e and F of [X< x] ticloenalarlyprobabi l i t y the random vari a bl e M = max{ X, Y}? it shoul d gi v en i f M < x, suppose that M x. Since X< x requires M Y, the chance of be the x] shoul d whidencech ibes !�F(x)j by symmetry, by i n depen condi t i o nal probabi l i t y of [X < F(m) !P[ X < xiX < m] with the random variable M substituted Example 33.8. 1
>
=
=
As
S ECTION 33.
CONDITIONAL PROBABILITY
435
for m. Intuition thus gives (33.16)
It suffi c es t o check (33.15) for sets G = [M < m], because these fo rm a 'IT-system generating u(M). The functional equation reduces to
P[M < min{x, m} ] + ��x<M�m :((;;) dP = P [ M<m , X<x]. other caseit folisloeasy, Siproduct nce themeasure, suppose that x < m. Since the distri b ution of (X, Y ) is ws by Fubini's theorem and the assumed continuity of F that (33.17 )
dF ( u)dF( v ) v F ) ( X
fx <M�m F(M ) dP - jfu 1
which gives (33.17).
1
<; t
•
Example 33.9. A collection [ X,: t > 0] of random variables is a Markov process in continuous time if for k > 1, 0 < t 1 < · · · < tk < u, and H E 8i' 1 , (33 . 18)
(
holds with probability 1. The analogue for discrete time is (33. 1 1). Th Xn there have countable range as well, and the transition probabilities are constant in time, conditions that are not imposed here.) Suppose that t < u. Looking on the right side of (33. 18) as a version of the conditional probability on the left shows that
e
f P [ Xu E H IIX, ] dP = P([ Xu E H ] n G)
(33. 19)
G
if 0 < t 1 < · < tk = t < u and G E u(X, 1 , , X,). Fix t, u, and H, and let k and t" . . . , tk vary. Consider the class g= U u X, X,) , the union < tk t. If extending over all k > 1 and all k-tuples satisfying 0 < t 1 < X,) , A Eu X , X,) and B E u X5 1 , X5), then A n B E u X, where the r are the s13 and the tY merged together. Thus g is a 1r-system. Since g generates u[X5: s < t] and P[ Xu E HI I X,] is measurable with respect to this u-field, it follows by (33. 19) and Theorem 33. 1 that P[Xu E HIIX,] is a version of P[ Xu E H II X5, s < t ] : ·
(
11
•
•
a
• • •
·
•
,
( 33.20) with probability 1 .
( ,
(
1
, , • • •
·
• • •
t < u,
·
(
=
·
1
, , •
•
•
436
DERIVATIVES AND CONDITIONAL PROBABILITY
This says that for calculating conditional probabilities about the future, the present u(X,) is equivalent to the present and the entire past u[ X5 : s t ]. This follows from the apparently weaker condition (33.18). •
<
The Poisson process [ N, : t > 0] has independent incre ments (Section 23). Suppose that 0 t 1 · · · tk u. The random vector (N, , N1 - N, , . . . , N, - N, ) is independent of Nu - N, , and so (Theorem 20.2) (N, 1 , N, 2 , , N, ) is independent of Nu - N, k . If J is the set of points (x 1 , , xk, y) in R k + l such that x k + y E H, where H E /?l2 1 , and if v is the distribution of Nu - N, k, then (33.12) is P[( x 1 , , x k, Nu - N,) E J] = P[ x k + N1 - N, k E H ] = v(H - x k ). The refore, (33. 1 3) gives P[ Nu E H I I N, 1 , N,) = v( H - N, ) This holds also if k = 1, and hence P[ N11 E H I I N, , . . . , N, ] = P[ N E HIIN, k ]. The Poisson process thus has the Markov property (33. 18); this is a cor.sequence solely of the independence of the increments. The extended Markov property (33.20) follows. •
Example 2
I
I
• • •
k.
< <
< <
33.1 0.
k-1
k
• • •
• • •
1
•
•
I
•
•
,
k
Properties of Conditional Probability Theorem 33.2.
With probability 1, P[0!1.§] = 0, P[ f!ll.§] = 1 ; and
<
0 P [ All.§' )
( 33.21 )
for each A. If A 1 , A 2 ,
• • •
<1
is a finite or countable sequence of disjoint sets, then
( 33 .22 )
with probability 1 . PRooF. For each version of the conditional probability, faP[ A II.§] dP = P(A n G) > 0 for each G in .§; since P[ A II.§] is measurable .#, it must be nonnegative except on a set of P-measure 0. The other inequality in (33.21 ) is proved the same way. If the A n are disjoint and if G lies in .§, it follows (Theorem 16.6) that
Thus L:n P[ Anll.§], which is certainly measurable .§, satisfies the functional equation for P[ U n A n ll.§], and so must coincide with it except perhaps on a set of !'-measure 0. Hence (33.22). •
SECTION 33.
CONDITIONAL PROBABILITY
437
Additional useful facts can be established by similar arguments. If (33.23)
P[ B - A ll.§' ] = P[ Bll.§'] - P[ A ll.§' ] ,
A c B,
then
P[ All.§'] < P[ Bll.§'].
The inclusion-exclusion formula P
(33.24)
holds. If
[U ]
A ;ll.§' = }: P[ A ; II .§' ] - L P[ A; n A)I .§' ) + I=I I I <).
An i A,
then P[ A"l!.§'] i P[ A ll.§'] ,
(33.25)
and if
·· ·
An ! A,
then
( 33.26)
Further,
P( A ) = 1
implies that P[Ail.§' ] = 1,
(33.27)
and
P( A) = 0
implies that P[ All.§' ] = 0 .
(33.28)
Of course (33.23) through (33.28) hold with probability 1 only. Difficulties and Curiosities
This section has been devoted almost entirely to examples connecting the abstract definition (33.8) with the probabilistic idea lying back of it. There are pathological examples showing that the interpretation of conditional proba bility in terms of an observer with partial information breaks down in certain cases.
Example
Let (0, !F, P) be the unit interval f! with Lebesgue measure P on the u-field !F of Borel subsets of f!. Take .§ to be the u-field of sets that are either countable or cocountable. Then the function identically equal to P( A) is a version of P[ A II.:#]: (33.8) holds because P(G) is either 0 or 1 for every G in .:#. Therefore, ( 33 .29)
33.11.
P [ AII.:# ]
w
=
P( A )
with probability 1. But since .§ contains all one-point sets, to know which
DERIVATIVES AND CONDITIONAL PROBABILITY
438
elements of .§ contain w is to know w itself. Thus .§ viewed as an experiment should be completely informative-the observer given the infor mation in .§ should know w exactly-and so one might expect that (33.30)
P[ A II .:# L =
{�
if w E A , if w $A . •
This is Example 4. 10 in a new form.
The mathematical definition gives (33.29); the heuristic considerations lead to (33.30). Of course, (33.29) is right and (33.30) is wrong. The heuristic view breaks down in certain cases but is nonetheless illuminating and cannot, since it does not intervene in proofs, lead to any difficulties. The point of view in this section has been "global." To each fixed A in !F has been attached a function ( usually a family of functions) P[AII.:# L defined over all of fl. What happens if the point of view is reversed-if is fixed and A varies over /7" ? Will this result in a probability measure on SO? Intuition says it should, and if it does, then (33.2 1 ) through (33.28) all reduce to standard facts about measures. Suppose that Bp . . . , B, is a partition of f! into Y-sets, and let .:!l= u(Bp . . . , B,). If P( B 1 ) 0 and P(B) > 0 for the other i, then one version of P[ A II.:# ] is w
=
17
P[ A II .:# L = P ( A n B;) P( B,)
if w E B; , i
=
2, . . . , r .
With this choice of version for each A, P[ A II.:# L is, as a function of A, a probability measure on !F if w E B 2 U · · U Br, but not if w E B 1 • The "wrong" versions have been chosen. If, for example, ·
P( A ) P [A II.:# L = P( A n B, ) P( B;)
if w E B1 , i
=
2, . . . , r ,
then P[ AII.:# L is a probability measure in A for each w. Clearly, versions such as this one exist if .§ is finite. It might be thought that for an arbitrary u-field .§ in !F versions of the a various P[ A II.:# l can be so chosen that P[ AII.:# L is for each fixed probability measure as A varies over /7". It is possible to construct a w
SECTION 33.
CONDITIONAL PROBABILITY
439
counterexample showing that this is not so.t The example is possible because the exceptional w-set of probability 0 where (33.22) fails depends on the sequence A 1 , A 2 , ; if there are uncountably many such sequences, it can happen that the union of these exceptional sets has positive probability whatever versions P[ A II.§] are chosen. The existence of such pathological examples turns out not to matter. Example 33.9 illustrates the reason why. From the assumption (33. 1 8) the notably stronger conclusion (33.20) was reached. Since the set [Xu E H] is fixed throughout the argument,-il does not matter that conditional probabili ties may not, in fact, be measures. What does matter for the theory is Theorem 33.2 and its extensions. Consider a point w0 with the property that P(G) > 0 for every G in .§ that contains w0. This will be true if the one-point set {w0} lies in 7 and has positive probability. Fix any versions of the P[ A II.§]. For each A the set [ w : P[ A II.§L < 0] lies in .§ and has probability 0; it therefore cannot contain w0• Thus P[ AI15'L 0 > 0. Similarly, P[ fiii5' L 0 1, and, if the An are dis joint, P[ U nAnll .:flw = LnP[A II.§L . Therefore, P[ A II.§lw is a probability measure as A ranges over !F. Thus conditional probabilities behave like probabilities at points of posi tive probability. That they may not do so at points of probability 0 causes no problem because individual such points have no effect on the probabilities of sets. Of course, sets of points individually having probability 0 do have an effect, but here the global point of view reenters. • • •
=
0
0
0
Conditional Probability Distributions Let
X be a random variable on (f!, !F, P), and let .§ be a u-field in !F. 1 Theorem 33.3. There exists a function JL(H, w ), defined for H in � and w in f!, with these two properties: (i) For each w in f!, JL( w) is a probability measure on � 1 (ii) For each H in !JP 1 , JL(H, · ) is a version of P[X E Hll.§]. ·
,
•
The probability measure JLL w) is a conditional distribution of X given .§. If .§= u{Z), it is a conditional distribution of X given Z. For each rational r < s, then by (33.23), PROOF.
<
let F(r, w) be a version of P[X ri15'L. If
F( r , w ) < F( s , w )
( 33.3 1 ) tThe argument is outlined nonmeasurable sets.
r,
10
Problem 33 . 1 1 . It depends on the construction of certain
DERIVATIVES AND CONDITIONAL PROBABILITY
440
for w outside a .§.. set A of probability 0. By (33.26), rs
F( r, w)
( 33.32 )
=
lim F( r + n - I , w) n
for w outside a .§.. set B, of probability 0. Finally, by (33.25) and (33.26), (33 .33)
lim F( r, w)
r -+ - oo
= 0,
lim F( r, w) = 1
r -+ oo
outside a .§.. set C of probability 0. As there are only countably many of these exceptional sets, their union E lies in .§ and has probability 0. For w $. E extend F( · , w) to all of R 1 by setting F(x, w) inf[ F(r w ) : x < r]. For w E E take F(x, w ) = F(x ), where F is some arbitrary but fixed distribution function. Suppose that w $. E . By (33.31) and (33.32), F(x, w) agrees with the first definition on the rationals and is nondecreasing; it is right-continuous; and by (33.33) it is a probability distribution function. Therefore, there exists a probability measure JL( · , w) on (R 1 , � 1 ) with distribution function F( · , w). For w E E, let JL{ - , w) be the probability measure corresponding to F(x ). Then condition {i) is satisfied. The class of H for which JL(H, · ) is measurable .§ is a A-system containing the sets H ( - oo, r] for rational r; therefore JL(H, · ) is measur able .§ for H in � 1 • By construction, JL(( - oo, r ], w) = P[ X < rii.:#L with probability 1 for ratio nal r; that is, for H = ( - oo, r] as well as for H = R1 , =
,
=
f JL( H, w)P( dw) = P( [ X E H ] n G) G
for all G in .:#. Fix G. Each side of this equation is a measure as a function • of H, and so the equation must hold for all H in � 1 •
X and Y be random variables whose joint distribu tion v in R 2 has density f(x, y) with respect to Lebesgue measure: P[(X, Y ) E A] = v(A) = ffAf(x, y) dxdy. Let g(x, y) = f(x, y)jfR�f(x, t ) dt, and let JL(H, x) = fHg(x, y ) dy have probability density g(x, · ); if fR�f(x, t ) dt = O, let ILL x) be an arbitrary probability measure on the line. Then JL(H, X(w )) will serve as the conditional distribution of Y given X. Indeed, (33.10) is the same thing as f£xR 1 JL(F, x) dv(x, y) = v(E X F), and a change of variable gives frx e EJfL( F, X(w))P(dw) P[X E E, YE F]. Thus JL(F, X(w)) is a ver II sion of P[Y E Fll XL. This is a new version of Example 33.5. Example 33.12.
Let
=
SECTION 33.
CONDITIONAL PROBABILITY
441
PROBLEMS 33.1.
20.Speci27ified byBorel's paradox. Suppose that a random point on the sphere is < E) < TT, SO that E) l o ngi t ude and l a ti t ude but restri c t by 0 speci fi e s the compl e te meri d i a n ci r cl e (not poi n t, semi c i r cl e ) contai n i n g the and(a) Show compensate by l e tt i n g range over ( -TT, TT]. that f o r gi v en 6 the condi t i o nal di s tri b ution of has density -i" lcos ¢1 over ( -TT, +TT]. If the point lies on, say, the meridian circle through Greenwi c h, i t i s therefore not uni f o rml y di s tri b uted over t h at great ci r cl e . Show that for the conditional distribution of i s uniform over gi v en (0,distriTT).buted If theoverpoiuthatt liegreat s on tciherclequator ( is 0 or TT ), i t i s therefore uni fo rml y e . Si n ce the poi n t i s uni f orml y di s tri b uted over t h e spheri c al surface and contradi c t i o n. ent great ci r cl e s are i n di s ti n gui s habl e , (a) and (b) st a r. d i P. appa 1 Thievents shows agai n the i n admi s si b i l i t y of condi t i o ni n g wi t h respect t o an i s ol a ted of probability 0. The relevant u-tield must not be lost sight of. 20.16 i Let X and Y be independent, each having the standard normal be the pol a r coordi n at e s f o r ( X, Y). di(a)striShew butiothat n, andX l+etY( R,6) and X- Y are independent and that R 2 =[( X+ Y) 2 + ( X - Y)2]12, and conclude that the conditional distribution of R 2 given X - Y i(sXthe- Y)chi212.-squared distribution with one degree of freedom translated by Show that the condi t i o nal di s tri b uti o n of R2 given 6 is chi-squared wi t h two Ifdegrees of freedom. X - Y 0, the conditional di s tri b ution of R 2 i s chi-squared wi t h one degree freedom.witIfh two degrees 4 or 5TT 4, the conditional di s tri b ution of 2R[ is chiof4]-squared of freedom. But the e· J ent s [ X - Y 0] and [ 5TT 4] are the same. Resolve the apparent contradiction. a i(a) OfParadoxes of somewhat si m i l a r ki n d ari � e i n very si m pl e cases. three pri s oners, cal l t h em 1, 2, and 3, two have been chosen by lot fo r executi o n. Pri s oner 3 says to the guard, "Which of 1 and 2 is to be executed? of t h em wil l be, and you gi v e me no i n formati o n about mysel f i n tel l i n g One which"itAndis." now The 3guard fi n ds thi s reasonabl e and says, "Pri s oner 1 i s to be meexecuted. reasons, "[ know that 1 i s to be execut e d; the other wi l betheeittherit was2 orbefore, me, and" Apparent so my chance of bei n g execut e d i s now onl y t, instead of guard has given him info rmation. l y , the i t If one l o oks for a u-fi e l d , must be the one descri b i n g the guard' s answer, and i t t h en becomes cl e ar that the sampl e space i s i n compl e t e l y speci fi e d. i s "1" wi t h Suppose that, and 2 are t o be execut e d, t h e guard' s i f 1 response p and "2" with probability 1 - p; and, of course, suppose that, if 3 probabi l i t y other vi c t i m . Cal c ul a te iprobabi s to be liexecuted, the guard names the the condi t i o nal t i e s. t w o chi l d ren di s tri b uti o ns f o ur sex Assume that among fami l i e s wi t h t h e i n to one of the t w o chi l d ren such area famiequally, andly likheelyi.sYoua boy.haveWhatbeenis ithentroduced condi t i o nal probabi l i t y that the other i s a boy as well? E)
,
E)
El
(b)
33.2.
(b)
(c)
e = 1TI
33.3.
(b)
=
u
e=
e = 1TI
I
e=
I
=
442
33.4.
DERIVATIVES AND CONDITIONAL PROBABILITY
(a) Consi d er probabi l i t y spaces that T: and (D.' , suppose (D., !l' is measurabl e Let be a u-fi e l d and i n and = take .:# to be the u-field with ?-probabilFority 1. show by (16.18) that Now take (D.' , = ( R 2, .9!? 2, JL), wheie JL is the distribution of vect o r (X, on (D., Suppose that ( X, Y) has density f, and random show by (33.9) that
5', P) n -> S'j !f' P' Pr 1• [ r 1 G': G' E .:#']. P[ r 1A ' II .:#lw = P'[ A 'II .:#']Tw 5'', P') (b) 5', P). Y)
5'', P'); .:#' A' E 5'',
5'',
a
j f( X( w ) , t ) dt
P[Y E Fl l XL = ___;;_ F ___
_ _ _
f.R lf( X( w ), t) dt
l.
with probability (a) There i s a sl i g htl y di ff erent approach t o condi t i o nal probabi l i t y . Let a probabie lity space,Define a measure a measurablone space,by v(andA') = (D.,a mapping bemeasurabl Prove that there exi s t s a funct i o n l w ) on !l', measur T A ) for e such that 1(dw') = ablforeall andin integrabl Intui t i v el y , p(Ai w ' ) i s the condi t i o nal probabi l i t y t h at w that fisora someone who knows Tw = w'. Let [Tshow that u-fiConnect eld andthisthatwithp(AIpartT(a)w) iofs atheversiprecedi on ofng problem. that T = X is a random variable, (D.' , = ( R 1 , ), and x is Suppose R In thi s case poi n t of lx) is sometimes writ en theWhatgeneral I X= x]. is the problem with this notation? For the Poisson process (see Example 33. 1) show that for 0 t,
33.5. i
5', P)
-
A' E 5''. 1 ' !f' A' !f'.
(D.', !f')
T: H -> !l'
!f' P( A n p( A ' P( A n r 1A ') fA.p( A iw')Pr EA .:#= 1A ': A' E 5'']; .:# P[ A II�lw-
.9'] 5''.
Pr 1,
v
(b)
!f')
33.6. i
33.7.
p( A
1•
<s <
P [N. = kilN;]
=
k > N; .
0,
N.
condi t i o nal di s tri b ut i o n (i n the sense of Theorem 33.3) of given N, Thus the is binomial with parameters and t. 29.12 Suppose that (XI > X2 ) has the centered norma1 distribution-has in plane theadil asstribution with density (29.10). Express the quadratic form in thethe exponenti N;
33.8.
.9£'1 P[ A
i
sj
SECTION 33.
CONDITIONAL PROBABILITY
integrate out the x2 and show that
-
1= � _/v 2 71" T
2 u 1 exp [ - ( x2 - 0"121 x ) ] 1 _ 2T
I
'
where T 0"22 -uf2ui/ Describe the conditional distribution of x2 given x,. (a)JL(H,Suppose that JL(H, w) has property (i ) in Theorem 33.3, and suppose that ) is a version of P[ E Hll.§'] for H i n a 71"-system generating .9£' 1 . Show that UseJL< - ,Theorem w) is a conditional distribution of given .§'. 12.5 to extend Theorem 33.3 from R 1 to R k . Show that condi t i o nal probabi l i t i e s can be defi n ed as genui n e probabi l i t i e s on spaces of the special form (D., u( . . . ' xk ), P). i Deduce from (33. 16) that the conditional di s tri b uti o n of given is =
33.9.
x ::.)_ ---'f'--(,_x__::l_,-'2 .:: "' !- f(x 1 , t ) dt "'
443
·
X
X
(b) (c)
33. 10.
XI '
X
M
1 JL ( H n ( - oo , M ( w )]) 1 2 /[M E H )(w ) + 2 JL ( - oo, M(w)] ) '
where JL is the distribution corresponding to (positive and continuous). Hint: First check H = ( - oo, x]. 4. 10 1 2.4 i The fol l owing construction shows that conditional probabi l i t i e s mayInnotProblgiveemmeasures. Compl e t e the detai l s . 4.10 i t i s shown that there exi s t a probabi l i t y space (!l, P), a u-fiindependent eld .§' in, .§' contandainas alsetl theHsiinngletons,suchandthat.§' iP(H) = -f, H and .§' are s generat e d by a count a bl e a 71" s ystem subcl a ss. The countabl e subcl a ss generat i n g . §' can be t a ken to be .9= {8 1 , 82 , . . . } (pass to the finite i n tersections of the sets i n the origi n al class).Assume that it is possible to choose versions P[ AII.§'] so that P[A II.§'lw is forwhereeachP[Bnll.§'lw w a probability measure as A varie<; over Let Cn be the w-set = (w); show (Example 33.3) that C = nCn has probabi l ihence ty 1. Show that w E implies that P[ Gll.§'lw = IG(w) fo r al l G i n .§' and that P[{w}ll.§'lw = 1. Now w E H n C i m pl i es that P[ HII.§'lw > P[{w}ll.§'lw 1 and w E He n iP[H mpliII.§'lw es that= IH(w). P[ HII.§'lw < P[!l - {w}ll.§'lw = Thus w E C implies that But si n ce H and .§' are independent, P[HII.§'lw P( H) = t with probabil ity 1, a contradiction. 4.10 but concerns mathemati c al fact Thi s exampl e i s rel a t e d t o Exampl e instead of heuristic interpretation. the l i n e, Letdensityandwith{3 respect and l e t f(x, y) be a probability be u-fintoite measures on {3. Define F
33. 1 1.
5',
5',
!f
!f.
n
18
'C
0.
=
C
=
33.12.
a
(33.34 )
a
X
gx ( y ) =
f(x , y) , f. lf(x, t )f3 (dt ) R
444
DERIVATIVES AND CONDITIONAL PROBABILITY
i n whi c h case unlif (X,ess thehasdenomi t a ke gx( y) = 0, say. Show that, n ator vani s hes, densi t y f wi t h respect to a {3, then the conditional di s tri b uti o n of33.5 andgiven33.X12,haswheredensiatyandEx<{3Y )arewitLebesgue h respect measure. to {3. This generalizes Examples 18.20 Suppose that and x (one fo r each real x) are probabi l i t y measures onThenthe(sleeine,Probl and esuppose that x ( B) is a Borel function in x for each B E m 18 20) X
Y)
Y
33.13.
JL
i
v v
.!Jf 1 •
(33.35 )
defiSuppose nes a probabi on ( R 2 , .9f2 ). l i t y measure that ( X, has di s tri b uti o n TT", and show that x is a version of the conditional distribution of given X. Let a and {3 be u-finite measures on the line. Speci a l i z e the s�tup of Probl e m 33. 13 by suppo>ing that has density f(x) wi t h respect to a and hasin thdensi t y g/ y) with respect to (3. Assume that gx( y) is measurable e pai r (x, y), so that v/ B) is automatically measurabl e in x. Show thatfEf( x(33.35) has densi t y f(x)gx ( y) with respect to a {3: TT"( E) = )g/ y )a(dx)f3(dy ). Show that (33.34) is consistent with f(x, y) = j(x)gx ( y). Put Y)
v
Y
33.14. i
JL
v 2 .9f
X
f
=
p (x) y
f( x)gx ( y )
f.R lf(s)g.( y )a(ds )
has Suppose that densi t y f(x)gx( y) wi t h respect to aX{3, and show (X,Y) that p y(x) is a density with respect to a for the conditional distribution of givenIn the language of Bayes, f(x) is the prior density of a parameter x, g/ y ) iiss thethe condi t i o nal densi t y the ob!. e rvat i o n y given the parameter, and P/X ) of posterior density of the parameter given the observation. Now suppose that a and {3 are Lebesgue measure, that f( x) is positive, n conti n uous, g/ y) = e -
X
Y.
33.15. i
(
)
1 X 1 -x 12 e -p y + - -+ y 1 2 11" rn rn
33.16.
1
fordistrifixedbutioxnandwithy.mean Thus ytheandpostvarieriaoncer densi1jn.ty is approximately that of a normal P( AIIXlw = f( X(w)) for 32. 13i Suppose that X has distribution Now Borel funct i o n f. Show that l i m h P(Aix - h < X < + h ] f(x) for x some i=n x].a setHint:of wmeasure 1. Roughly speaking, P(A 1x - h < X < x + h] -+ P(A Take (B) P( A [ X E B]) in Problem 32.13. _,
v
=
n
0
JL ·
x
=
IX
S ECTION 34.
CONDITIONAL EXPECTATION
445
SECTION 34. CONDITIONAL EXPECTATION
In this section the theory of conditional expectation is developed from first principles. The properties of conditional probabilities will then follow as special cases. The preceding section was long only because of the examples in it; the theory itself is quite compact. Definition
X is an integrable random variable on (f!, !F, P) and that .§ is !F. There exists a random variable £[XI I .:#], called the condi tional expected value of X given .:#, having these two properties: (i) £[XII.§] is measurable .§ and integrable. (ii) £[XII.§] satisfies the functional equation Suppose that a u-field in
j E[XI!.:#] dP = f XdP,
( 34.1 )
G
G
To prove the existence of such a random variable, consider first the case of nonnegative d . This measure by v( G) Define a measure v on is finite because is integrable, and it is absolutely continuous with respect to By the Radon-Nikodym theorem there is a function measurable .:#, such that v{G) = fa d . t This has properties (i) and (ii). If is not clearly has the required necessarily nonnegative, properties. any one of There will in general be many such random variables them is called a version of the conditional expected value. Any two versions arc equal with probability 1 {by Theorem 16.10 applied to restricted to Arguments like those in Examples 33.3 and 33.4 show that f!}] and that with probability 1. As .§ increases, condition {i) becomes weaker and condition (ii) becomes stronger. The value at w is to be interpreted as the expected value of for someone who knows for each G in .§ whether or not it contains the point w, which itself in general remains unknown. Condition (i) ensures that can in principle be calculated from this partial information alone. Condition (ii) can be restated as 0; if the observer, in possession of the partial information contained in is offered the opportu nity to bet, paying an entry fee of and being returned the amount and if he adopts the strategy of betting if G occurs, this equation says that the game is fair.
X.
P.
E[X]
£[XI I .:#]
X,
.§
X
fP
=
f £[X+ I .:# ] - E[X - 1 .§ ]
faX P f,
X
£[XI I .§]; P .§). E[XII{O, =
E[XII !F l =X E[XII.:#L
X
fa(X - E[XII.:#])dP .§, £[XII.§]
=
tAs in the case of conditional probabilities, the integral is the same on (.!1, .'7, P ) as on (.!1, .#) with P restricted to .# (Example 16.4).
446
DERIVATIVES AND CONDITIONAL PROBABILITY
Example 34.1. Suppose that B 1 , B 2 , is a finite or countable partition of f! generating the u-field .:#. Then £[XII.§] must, since it is measurable .§, have some constant value over B;, say a;. Then (34.1) for G = B; gives a;P( B;) = f8 XdP. Thus •
•
•
I
( 34.2 )
E[ XII.:# L = P( �;) j8 XdP, w E B;,
P( B;) > O.
'
•
If P(B) = 0, the value of £[ XII.§] over B; is constant but arbitrary.
Example 34.2. For an indicator /A the defining properties of £[1)1 .#] and ?( All.§] coincide; therefore, £[ /All.§] = ?[ A ll.§] with probability 1. It is easily checked that, more generally, £[XII.§] = L;a;P[ A;II.#l with pwba • bility 1 for a simple function X = L;a;IA . I
In analogy with the case of conditional probability, if [X, t E T] is a collection of random variables, E[ XIIX, t E T] is by definition £ [XII.#] with u[X, t E T] in the role of .:#.
Example 34.3. Let J be the u-field of sets invariant under a measurepreserving transformation T on (f!, !F, P). For f integrable , the limit f in (24.7) is E[fiiJ]: Since f is invariant, it is measurable J. If G is invariant, then the averages a n in the proof of the ergodic theorem (p. 318) satisfy E[ J0a n ] = E[{0Jl. But since the a n converge to f and are uniformly inte • grable, E[I0f ] = E[I0f]. A
A
A
Properties of Conditional Expectation
To prove the first result, apply Theorem 16.10(iii) to f and £[ XII.§] on (f!, .§, P).
Let g be a 1r-system generating the u-field -�. and suppose that f! is a finite or countable union of sets in .:#. An integrable function f is a version of £[ XII.§] if it is measurable .§ and if Theorem 34.1.
( 34.3 )
holds for all G in
j fdP = j XdP G
G
g_
In most applications it is clear that f! E g_ All the equalities and inequalities in the following theorem hold with probability 1.
SECTION 34.
-
447
CONDITIONAL EXPECTATION
Theorem 34.2.
Suppose that X, Y, Xn are integrable.
(i) If X = a with probability 1, then £[ XII.§] = a. {ii) For constants a and b , E[aX + bYII.§] = a£[ XII.§] + b£[ Y I I 5' ]. (iii) If X < Y with probability 1, then £[ XII.§] < E[ YII.§]. (iv) I E[ XII 5' ll < £[l XI II.§]. {v) If lim n Xn = X with probability 1 , IXnl < Y, and Y is integrable, then lim n E[ Xnll.§] = £[ XII.§] with probability 1. If X= a with probability 1 , the function identically equal to a satisfies conditions (i) and (ii) in the definition of £[ XII.§'], and so {i) above follows by uniqueness. As for (ii), aE[XII.§] -r bE[ YII.§] is integrable and measurable �4, and PROOF.
f ( aE[ XII.§] + bE[ Y II.§]) dP = a f E[X I I .§ ] dP + b f E[ Y II.§ ] dP G
G
=a
j XdP + b f G
G
Y
G
dP =
j (aX + bY) dP G
for all G in .§, so that this function satisfies the functional equation. If X < Y with probability 1, then fa( E[ YII.§] - £[XII.§]) dP fa(Y X) dP > 0 for all G in .§. Since E[ YII.§] - £[ X II .§] is measurable .§, it must be nonnegative with probability 1 (consider the set G where it is negative). This proves (iii), which clearly implies (iv) as well as the fact that £[ XII.§] = E[ YII.§] if X = Y with probability 1 . To prove (v), consider Zn = supk ;?; n1Xk - X ]. Now Zn � 0 with probability 1, and by {ii), (iii), and {iv), I E[Xn l l .#] - E[ XII.§]I < E[ Zn ll.§]. It suffices, therefore, to show that E[ Zn11.4HO with probability 1. By (iii) the sequence E[ Zn ll.§] is nonincreasing and hence has a limit Z; the problem is to prove that Z 0 with probability 1, or, Z being nonnegative, that E[Z] = 0. But 0 < Zn < 2Y, and so (34. 1 ) and the dominated convergence theorem give • E[ Z] = fE[Z II.§] dP < JE[ Znll.§] dP E[ Zn] -7 0. =
=
=
The properties (33.21) through (33.28) can be derived anew from Theorem 34.2. Part (ii) shows once again that E[L:,.a,.IA II.§] = L:,.a,.P[A,.II.§] for simpie functions. If X is measurable .§, then clearly £[ XII.§] = X with probability 1. The following generalization of this is used constantly. For an observer with the information in .§, X is effectively a constant if it is measurable .#: '
Theorem 34.3.
(34.4)
with probability 1 .
If X is measurable .:#, and tf Y and XY are integrable, then E[ XY II.§] = XE[ YII.§]
448
DERIVATIVES AND CONDITIONAL PROBABILITY
It will be shown first that the right side of (34.4) is a version of the left side if X = 100 and G0 E .§. Since /00E[YII.§] is certainly measur able .:#, it suffices to show that it satisfies the functional equation f0100E[ YII.:#] dP = f0100YdP, G E .:#. But this reduces to fa n a11 E[ YII.:#] dP = fa na11 YdP, which holds by the definition of E[YII.:#]. Thus (34.4) holds if X is the indicator of an element of .:#. It follows by Theorem 34.2(ii) that (34.4) holds if X is a simple function measurable .§. For the general X that is measurable .:#, there exist simple functions Xn , measurable .:#, such that 1Xr1 < lXI and lim n Xn = X (Theorem 13.5). Since IXn YI < I XYI and IXYI is integrable, Theorem 34.2(v) implies that lim n E[ Xn YII.:#l = E[ XYII.:#] with probability 1. But E[XnY II.:#] = Xn E[ YII.:#] by the case already treated, and of course lim n Xn E[Y II.:#] = XE[ YII .:#]. (Note that IXnE[ Y II.:#ll = I E[XnYII.:#JI < E[ IXn YI II.:#l E[I XYI II.:#], so that the limit XE[YI!.§] is integrable.) Thus (34.4) holds in • general. Notice that X has not been assumed integrable. PRooF.
<
Taking a conditional expected value can be thought of as an averaging or smoothing operation. This leads one to expect that averaging X with respect to .:#2 and then averaging the result with respect to a coarser (smaller) a--field .:#1 should lead to the same resuit as would averaging with respect to .:#1 in the first place: Theorem 34.4.
.:#1 c .:#2, then
If X is integrable and the a--fields .:#1 and .:#2 satisfy
( 34.5)
with probability 1 . The left side of (34.5) is measurable .§P and so to prove that it is a version of E[ XII.:#d, it is enough to verify f0£[ £[XII.:# ]II.:#1 ] dP = f0XdP 2 for G E .41 • But if G E .:#1 , then G E .:#2 , and the left side here is PROOF.
•
f0E[ XII.:#2 ]dP = f0XdP.
!F, then E[XII-:#2 ] = X, so that (34.5) is trivial. If .:#1 .:#2 = .:#, then (34.5) becomes If .:#2
=
=
{0, D.} and
E[ E[ XII.:# ] ] = E[ X ] ,
( 34.6)
the special case of (34.1) for G = fl. If .:#1 c-:#2 , then E[ XII-:#1 ], being measurable .:#1 , is also meast1rable .§2 , so that taking an expected value with respect to .:#2 does not alter it: £[ £[ XII.:#1 lll.:#2 ] E[X II-:#1 ]. Therefore, if .:#1 c-:#2 , taking iterated expected values in either order gives E[XII-:#1 ]. =
SECTION 34.
449
CONDITIONAL EXPECTATION
The remaining result of a general sort needed here is Jensen's inequality for conditional expected values: If cp is a convex function on the line and X and cp(X) are both integrable, then
( 34.7 )
cp ( E[XII .# ]) < E[cp ( X) II.# ]
with probability 1 . For each x0 take a support line [A33] through (x0, cp(x0)): cp(x0) + A(x0Xx - x0) < cp(x). The slope A(x0) can be taken as the right-hand derivative of cp, so that it is nondecreasing in x0• Now
cp ( E[ XII .# ] ) + A ( E[ XII .# ])( X - E[ XII .# ] ) < cp ( X ) . Suppose that £[XII.§] is bounded. Then all three terms here are integrable (if cp is convex on R 1 , then cp and A are bounded on bounded sets), and taking expected values with respect to .§ and using (34.4) on the middle term gives (34.7). To prove (34.7) in general, let Gn = [IE[XII.#ll < n] . Then E[I0 X II.§] = I0 £[X II.§] is bounded, and so (34.7) holds for I0 X: cp(l0 £[X II.§]) < E[cp(l0 X ) ll .§ ]. Now E[cp(l c X ) l l .§ ] = E [ I0 cp( X ) + I0ccp(O) I I.§] = -7 E[cp(X)II.#]. Since cp(l0 £[XII.§]) converges to I0 E[cp(X)II.§] + I0ccp(O) n n cp( E[XII.§]) by the continuity of cp, (34.7) follows. If cp(x) = lxl, (34.7) gives part (iv) of Theorem 34.2 again. n
n
n
n
n
n
n
n
n
Conditional Distributions and Expectations
Let J.L( · , w) be a conditional distribution with respect to .§ of a random variable X, in the sense of Theorem 33.3. If cp: R 1 -7 R 1 is a Borel function for which cp(X) is integrabie, then fR lcp(x)J.L(dx, w) is a version of E[cp(X)II.#L. PROOF. If cp = IH and H E .9R 1 , this is an immediate consequence of the Theorem 34.5.
definition of conditional distribution, and by Theorem 34.2(ii) it follows for cp a simple function over R 1 • For the general nonnegative cp, choose simple 'Pn such that 0 < cpn(x) i cp(x) for each x in R 1 • By the case already treated, fR I 'Pn(x)J.L(dx, w) is a version of E[cpn(X)II.#L. The integral converges by the monotone convergence theorem in (R 1 , .9R 1 , J.L( · , w)) to fRicp(x)J.L(dx, w) for each w, the value + oo not excluded, and E[cpn(X)II.#L converges to E[cp(X)II.#L with probability 1 by Theorem 34.2{v). Thus the result holds for nonnegative cp, and the general case follows from splitting into positive and • negative parts. It is a consequence of the proof above that fR �cp(x)J.L(dx, w) is measurable .§ and finite with probability 1. If X is itself integrable, it follows by the
DERIVATIVES AND CONDITIONAL PROBABILITY
450
theorem for the case cp( x) x that =
E [ XII §L
j "'XJL(dx, w ) "'
=
-
with probability 1. If cp(X) is integrable as well, then
"' E[cp( X )I1 .:9'L = j cp ( x ) JL( dx, w ) "'
( 34.8 )
-
with probability 1 . By Jensen's inequa lity (21.14) for unconditional expected values, the right side of (34.8) is at least cp( f':"'XJL(dx, w)) if cp is convex. This gives another proof of (34.7). Sufficient Subfields *
(fl, Y).
that for each e i n an i n dex set P6 is a probabi l i ty measure on In Suppose stati s ti c s an the probl e m i s to draw i n ferences about the unknown paramet e r e from observat i o n Denote by P6[ AII.§'] and £6[ XII.§'] conditional probabi l i t ies and expected val u es measur e P6 on u-fi e l d to the probabi l i t y calis sufficient culated wiftohr respect .§' in the fami l y [P6: e ] if versions P0[ AII.§'] can be chosen that are f u nct i o n p( A, ) of A .r and isuch ndependent of e-that i s , i f ther e exi s t s a that, for each p(A, ) is a version of P6[AII.§']. There is no and e requi r ement that p( ) be a measure for fixed. The idea is that al t hough there i n be already cont a i n ed .§', thi s information is i r relevant to i n formati o n i n not may ferences about e. t sufficient statistic i s a random variable or therandomdrawivectngoofr Tinsuch that u(T) i s a suffi c i e nt subfi e l d . A , from i f , f o r each fami l y of measures dominates another family f o r and for al l i n i t foll o w!' that al l i n If each of A) f.L A) domi n ates the other, t h ey are equivalent . For sets consisting of a single measure these are the concepts introduced in Section 32. 0,
w.
E
A E ·,w
(
=
0
Theorem
.A'
f.L
0
w
E 0,
Y
Y
A
(fl, Y). A
Y
w En
E
·
w
A
v(
.A',
=
.A/ v .A/.
0
.A'
.A/
34.6. Suppose that [ P6: (J E 0 ] is dominated by the u-finite measure p., . A
necessary and sufficient condition for .§' to be sufficient is that the density f6 of P6 with respect to f.L can be put in the form f6 = g6h for a g6 measurable .§'. h
i s assumed throughout that It g6 and are nonnegative and of course that h is measurabl e factorization theorem, the condi t ion being Theorem i s cal l e d the that t h e densi t y f6 spl i t s i n m a factor depending on only through .§' and a factor 6 and h are not assumed i n tegrable their pro duct f6, iasndependent of e. Al t hough de the nsi t y of a fi n i t e measure, must be. Before proceedi n g t o the proof , consi d er an application. Let and f o r l e t P6 be the measure having with respect to k-dimensional Lebesgue measure the density iotherwi f se. i 1 , . , k, !F.
34.6
g
Exampk 34.4.
(fl, Y ) = (Rk, � k ),
w
11 > 0
0 < X; < (J , =
* This topic may be omitted. tSee Problem 34.19.
f.L,
..
SECTION 34.
CONDITIONAL EXPECTATION
451
If X; is the function on R k defined by X;( X) = X;, then under P6, XI , . . . ' xk are independent random variables, each uniformly distributed over [0, e]. Let T(x ) max ; ,; k X;(x ). If g 6(t) is e - k for 0 < t < e and 0 otherwise, and if h(x) is 1 or 0 according as all x( are nonnegative or not, then fix) = g eCT(x ))h(x ). The factoriza tion criterion is thus satisfied, and T is a sufficient statistic. Sufficiency is clear on intuitive ground& as well: e is not involved in the co nditional distribution of X1 , , Xk given T because, roughly speaking, a random one of them equals T and the others are independent and uniform over [0, T]. If this is true, the distribution of X; given T ought to have a mass of k - I at T and a uniform distribution of mass 1 - k - I over [0, T ], so that =
•
•
•
(34 .9 ) For a proof of thi� fact, needed later, note that by (21 9)
(34 10) r t - u { .!_) k - I l t k +� du = = o
e
fl
2e k
if 0 < t < e. On the other hand, P6[T < t ] = (t ;ei, so that under P6 the distributio n of T has density kt k - I ;e k over [0, e]. Thus
fT,; r
(34.1 1 )
k
+2k 1 TdP.6 = k 2k+ 1 j 'uk u k -k l du = t k +kl o
e
2e
Since (34.10) and (34.11) agree, (34.9) follows by Theorem 34.1.
•
The essential ideas in the proof of Theorem 34.6 are most easily understoo d through a preliminary consideration of special cases.
E
Lemma 1. Suppose that [ P6; e e] is dominated by a probability measure P and that each P6 has with respect to P a density g 6 that is measurable .:#. Then .:# is sufficient, and P[ A ll.#] is a version of P6[ A ll.#] for each e in e. PRoOF. For
G
in .:#, (34.4) gives
Therefore, P[ A II.:#]-the conditional probability calculated with respect to P-does serve as a version of P6[ A II.:#] for each e in e. Thus .:# is sufficient for th e family
452
DERIVATIVES AND CONDITIONAL PROBABILITY
E 0]
even for this family augmented by p (which might happen to lie in the family to start with). • For the necessity, suppose first that the family is dominated by one of its members.
[ Pe:
e
-
E
E
Lemma 2. Suppose that [ Pe : e 0] is dominated by Pe for_ some eo e. If .:# is sufficient, then each Pe has with respect to Pe 11 a density g e /hat is measurable .:#. PROOF. Let p(A , w) be the function in the definition of sufficiency, and take and 0. Let de be any density of Pe w Pe[ A l l.:#]., = p( A, w) fo r all A with respect to Pn . By a number of applications of (34.4 ), uo
E !T, E fl,
6E
the next-to-last equality by sufficiency (the integrand on either side being p (A , - )). Thus gn = Ee [dell.:#], which is measu rable .:#, can serve as a density for Pe with respect to Pn . • u
uu
"
To complete the proof of Theorem 34.6 requires one more lemma of a technical sort.
E E>] is dominated by a u-finite measure, then it is equivalent to some finite or countably Lemma 3. If [Pe: e
infinite subfamily.
In many examples, the Pe are all equivalent to each other, in which case the subfamily can be taken to consist of a single Peu· PROOF. Since JL is u-finite, there is a finite or countable partition of n into .r-sets An such that 0 < JL(An) < oo. Choose positive constants am one for each An, in such a way that [nan < oo. The finite measure with value Ln anJL(A nAn)/JL(An) at A dominates JL. In proving the lemma it is therefore no restriction to assume the family dominated by a finite measure JL. Each Pe is dominated by JL and hence has a density fe with respect to it. Let Se = [w : fiw) > 0]. Then PiA) = PiA n Se) for all A, and Pe( A ) = 0 if and only if JL(A n Se) = 0. In particular, Se supports Pe. Call a set B in !T a kernel if B c Se for some e, and call a finite or countable union of kernels a chain. Let a be the supremum of JL(C) over chains C. Since JL is finite and a finite or countable union of chains is a chain, a is finite and JL(C) = a for some chain C. Suppose that C Un Bm whe1e each Bn is a kernel , and suppose that Bn c Se ". is dominated by [ Pe : n = 1, 2, . . . ] and The problem is to show that [ Pe: e hence equivalen t to it. Suppose that Pe ( A ) = 0 for all n. Then "JL( A n Se") = 0, as observed above. Since C c Un Se • JL(A n C) = 0, and it follows that Pe(A n C) = 0 =
E 8]
,
S ECTION 34.
453
CONDITIONAL EXPECTATION
whatever e may be. But suppose that PeCA - C ) > 0. Then P6((A - C) n S6) = PeCA - C) is positive, and so (A - C ) n S6 is a kernel, disjoint from C, of positive JL-measure; this is impossible because of the maximality of C. Thus PeCA C) is 0 • along with PeC A n C), and so P6( A) = 0. -
Suppose that [ P6: e E 0] is
P( A) = [ c,P6,( A ).
( 34. 12)
11
Clearly, P is equivalent to [ P6 , P6 , ' dominated by J.L . ·
.
] and hence to [ P6 : e E 0 ], and all th ree are
•
( 34.13 ) 34.6. If each P6 has density g 6h with respect to JL, then by the construction (34. 12), P has density fh with respect to where f = Lncng6 . Put r6 = g6jf if f > 0, and = 0 (say) if f = 0. If each g6 is measurable .:#, th e' same is true of f and hence of the r6 Since P[f 0] = 0 and P is equivalent to the entire family, P6[f = 0] = 0 fo 1 all e. Therefore, PROOF OF SuFFICIENCY I"' THEOREM
,u ,
ru
=
•
JA r6 dP = J r6fh dJL = JA n [f> O]r6 fh dJL = JA n [ f> O]g6h dJL 4
Each P6 thus has with respect to the probability measure P a density measurable .:#, and it follows by Lemma 1 thai .:# is sufficient • PROOF OF NECESSITY IN THEOREM
34.6. Let p(A, w) be a function such that, for
each A and e, p(A, · ) is a version of P6[ A II.:#], as required by the definition of sufficiency. For P as in (34.12) and G E .:fl,
(34. 14)
j p ( A , w ) P ( d w ) = Ln Cn }r p( A , w )P6 (dw) G
G
=
''
[ cn j0P6.,( AII .:# ] dP6
,,
n
= [ cnPe ( A n G) = P( A n G).
n
"
Thus p( A, · ) serves as a version of P[ AII.:#] as well, and .:# is still sufficient if P is added to the family Since P dominates the augmented family, Lemma 2 implies that each P6 has with respect to P a density g6 that is measurable .:#. But if h is the density of P with respect to JL (see (34.13)), then P6 has density g6h with respect to JL. •
454
DERIVATIVES AND CONDITIONAL PROBABILITY
E
A sub-a-field .§'0 sufficient with respect to [ P6 : e e ] is minimal if, fo r each sufficient .§', .§'0 is essentially contained in .§' in the sense that for each A in .§'0 there is a B in .§' such that P6(A A B) = 0 fo r all e in e. A sufficient .§' represents a compression of the information in !T, and a minimal sufficient .§'0 represents the greatest possible compressio n. Suppose the densities f6 of the P6 with respect to JL have the property that f6(w) i s measurable G'x !T, where G' is a u-fie!d i n e. Let rr be a probability measure on G', and define P as f0P6rr(de), in the sense that P(A) = fe!A f6(w)JL(dw)rr(de) = fe PeCA )rr(de). Obviously, P « [ P6: e E e]. Assume that {34 15) If rr has mas!. c, at e,, then P is given by (34. 12), and of course, (35.15) holds if (34. 13) does. Let r6 be a density for P6 with respect to P. Theorem
sub-a-field.
34.7. If (34. 15) holds, then .§'0 u[ r6: e =
E e] is a minimal sufficient
PROOF. That .§'0 is sufficient follows by Theorem 34.6. Suppose that .§' is sufficient. It follows by a simple extension of (34. 14) that .§' is still sufficient if P is added to the family, and then it follows by Lemma 2 that each P6 has with respect to P a density g 6 that is measurable .§'. Since densities are essentially unique, P[ g6 = r6 ] = 1. Let d? be the class of A in .§'0 such that P( A A B) = 0 for some B in .§'. Then d? is a u-field containing each set of the fo rm A = [ r6 H] (take B = [ g6 H]) and hence containing .§'0. Since, by (34. 15), P dominates each P6, §0 is essentially contained in .§', in the sense of the defi nitio n. •
E
E
Minimum-Variance Estimation •
To illustrate sufficiency, let g be a real function on e, and consider the problem of estimating g(e). One possibility is that e is a subset of the line and g is the identity; another is that e is a subset of R k and g picks out one of the coordinates. (This problem is considered from a slightly different point of view at the end of Section 19.) An estimate of g(e) is a random variable Z, and the estimate is unbiased if £6[Z] g (e) for all e. One measure of the accuracy of the estimate Z is £6[( Z g(ll)) 2 ]. If .§' is sufficient, it follows by linearity (Theorem 34.2(ii)) that £6[ Xli.§'] has for simple X a version that is independent of e. Since there are simple X, such that I X,l < lXI and X, -+ X, the same is true of any X that is integrable with respect to each P6 (use Theorem 34.2(v)). Suppose that .§' is, in fact, sufficient, and denote by £[ XII.§'] a version of £6[ XII.§'] that is independent of e. =
Theorem
Then
34.8. Suppose that E6[(Z - g(e))Z] < oo for all e and that .§' is sufficient.
(34.16)
for all e. If Z is unbiased, then so is E[ ZII.§']. * This topic may be omitted.
SECTION
34.
CONDITIONAL EXPECTATION
455
By Jensens's inequality (34.7) for �p(x) = (x - g(6))2, (E[ZII.§'] - g(6))2 < E6[(Z - g (6 ))2 1 1.§' . Applying £6 to each side gives (34.16). The second statement • follows from the fact that £6[ E[ZII.§']] = £6[Z]. PROOF.
]
This, the Rao-Blackwell theorem, says that as Z if .§' is sufficient.
E[ZII.§'] is at least as good an estimate
Example 34.5. Returning to Example 34.4, note that each X; has mean 6 j2 under P6, so that if X = k- 1t7= X; is the sample mean, then 2X is an unbiased estimate of 6. But there is a better1 one. By (34.9), £6[2 X liT] = (k + l )Tjk T', and by the Rao-Biackwell theorem, T' is an un biased estimate with variance at most t hat of 2X. In fact, for an arbitrary unbiased estimate Z, E6[(T' - 6)2] < £6[(Z - 6)2 ]. To prove this, let o = T' - E[ZIIT]. By Theorem 20. l(ii), o = f(T) for some Borel function f, and E6[ f(T )] 0 for all 6. Taking account of the density for T leads to jg[(x )x" - 1 dx = 0, so thai f(x)x k - 1 integrates to 0 over all intervals. Therefore, f(x) along with f(x )x k - 1 vani:;hes for x > 0, except on a set of Lebesgue measure 0, and hence P6[f(T) = 0] 1 and P6[T' = E[ZIIT]] 1 for all 6. Therefore, E6[(T' 6)2] = £6[(E[Z11Tl - 6)2 ] < £6[(Z - 6)2] for Z unbiased, and T' has minimum vari =
=
=
=
ance among all unbiased estimates of 6.
•
PROBLEMS 34.1. Work out for conditional expected values the analogues of Problems 33.4, 33.5, and 33.9. 34.2. In the context of Examples 33.5 and 33. 12, show that the conditional expected value of Y (if it is integrable) given X is g(X), where
g(x) =
{' f( x, y )ydy ::
- - -
J
-
f( x, y ) dy
"'
34.3. S how that the independence of X and Y implies that E[YIIX] = E[Y], which in turn implies that E[XY] = E[ X]E[Y]. Show by examples in an of three points that the reverse implications are bo th false.
D.
B be an event with P( B) > 0, and define a probability measure P0 by P0(A) = P(AIB). Show that P0[ AII.§'] = P[A n Bll.§']jP[BII.§'] on a set of P0-measure 1. (b) Suppose that d? is generated by a partition 8 , 82 , . . . , and let .§'V � 1 u(.§'U d?). Show that with probabil ity 1,
34.4. (a) Let
P[ All.§'v d?]
=
n B;ll.§' ) �18; P[PA[ B;ll.§' )
456
D ERIVATIVES AND CONDITIONAL PROBABILITY
34.5. The equation
(34.5) was proved by showing that the left side is a version of the right side. Prove it by showing that the right side is a version of the left side.
34.6. Prove fo r bounded X and Y that E[ YE[ XII.§']] = E[ XE[ YII.§']]. 34.7.
33.9 j
Generalize Theorem 34.5 by replacing X with a random vecto r
34.8. Assume that X is nonnegative but no t necessarily integrable. Show that it is
still possible to define a nonnegative random variable £[ X II.§'], measurable .§', such that (34. 1 ) holds. Prove versions of the monotone convergence theorern and Fatou's lemma. 34.9.
(a) Show fo r non nega tive X that £[ X I I .§'] = f0P[ X > tll.§'] dt with probabil ity 1. k k (b) Generalize Markov's inequ ality: P[ I X I ?- a ll.§' ] < a - E[ I X I ll.§'] with probab;lity 1 . (c) Simil arly generalize Chevyshev's and Holder"s inequalities.
34.10.
(a) Show that, if .§'1 c .§'2 and E[ X 2 ] < oo, then E[( X - E[ X II .§'2 ]) ] < E[(X 2
2
- £[ Xll .§'1 ]) ]. The dispersion of X about its conditional mean becomes smaller as the u-field grows. 2 (b) Define Var[ X II.§' ] = E[( X - £[ X II.§']) II.§']. Prove that Var[ X ] = E[Var[ X I I .§']] + Var[ E[ X II.§']].
!T, let 5;1 be the u-field generated by 5; U .§j, and let A; be the generic set in .§j. Consider three conditions:
34. 1 1 . Let .§'1 , .§'2 , .§'3 be u-fields in
all all
(i) P[ A 3II.§'1 2 ] = P[ A 3 II .§'2 ] for A 3• A 3• (ii) P[ A 1 n A 3II.§'2 ] = P[ A 1 /I .§'2 ]P[ A 3II .§'2 ] for A1 (iii) P[ A 1 II .§'23 ] P[ A 1 II .§'2 ] for A 1• If .§'1 , .§'2 , and .§'3 are interpreted as descriptions of the past, present, and future, respectively, (i) is a general version of the Markov prope rty: th e conditional probability of a future event A 3 given the past and present .#'1 2 is the same as lhe condi tional probabil ity given the present .§'2 alone. Condition (iii) is the same with time reversed. And (ii) says that past and future events A 1 and A 3 are conditionally independent given the present .§'2 • Prove the three conditions equivalent. =
34. 12.
all and
33.7 34. 1 1 i Use Exa mple 33.10 to calculate P[ Ns the Poisson process.
=
k liN;,
u >
t ] ( s < t) for
2 L be the Hilbert space of square-integrable random variables on (fi, !T, P). For .§' a u-field in !T, let M.# be the su bspace of elements of e 2 that are measurable .§'. Show that the operator P.# defined for X E [ by P.#X = £[ X II.§'] is the perpendicular projection on M.#.
34.13. Let
Suppose in Problem 34. 13 that .§= u(Z) fo r a random variable Z in L2 . Let S2 be the o ne-dimensional subspace spanned by z. Show that S2 may be 2 much smaller than Mu Z>• so that E[ XIIZ] (fo r X E L ) is by no means the projection of X on Z. Take Z the identity function on the unit interval with Lebesgue measure.
34.14. j
Hint :
SECTION
34.
CONDITIONAL EXPECTATION
457
j
34.15.
Problem 34. 13 can be turned around to give an alternative approach to conditional probability and expected value. For a u-field .:# in !T, let P.# be the perpendicular projection on the subspace M§· Show that P.#X has for X E e the two properties required of E[ XII.:#]. Use this to define E[ XII.:#] for X E L2 and then extend it to all integrable X via approximation by random variables in e. Now define conditional probability.
34.16.
Mixing sequences. A sequence A A 2 , . . . !T, P) is mixing with constant if 1,
a
(D.,
of .r-sets in a probability space
lim P ( A, n E) = a P ( E )
(34.17)
n
fo r every E in !T. Then a = lim ,P( A,). (a) Show that { A ) is mixing with constant
a
if and only if
li,?1 fA XdP = a jXdP
( 34. 18)
n
fo r each integrable X (measurable !T). (b) Suppose that (34. 17) holds fo r E E &, where g is a 'IT-system, n E 9J, and A, E u( !Y) for ail n. Show that {A,} is mixing. Hint: First check (34. 18) for X measurable u(!Y) and then use conditional expected values with respec t to u(!Y). (c) Show that, if P0 is a probability measure on !T) and P0 « P, then mixing is preserved if P is replaced by P0.
(D.,
be 34.17. i Application of mixing to the central limit theorem. Let XI > X2 , random variables on P), independent and identically distributed with mean 0 and variance u 2, and pu t S, = X1 + · · · +X,. Then S,ju{n = N by the Lindeberg-Levy theorem. Show by the steps below that this still holds if P is replaced by any probability measure P0 on (!l, !T) that P dominates. For example, the central limit theorem applies to the sums [i: = 1 r1,(w) o f Rademacher functions if w i s chosen according to the unifo rm density over the unit interval, and this result shows that the same is true if w is chosen according to an arbitrary density. Let Y,, = S,juVn and Z, = (S, - S[1o , 1 )/uVn, and take fY to consist of the sets of the form [(XI ' . . . , Xk ) E H], � > 1, H E !JR k . Prove successively: (a) P[Y, < x] --+ P[N < x]. •
•
•
(D., !T,
P[IY, - Z,I > £] --+ 0. P[Z, <x] -+ P[N < x]. P(E n [Z, <x]) --+ P(E)P[N < x] for E E 9J. P(E n [Z, <x]) --+ P(E)P[N <x] for E E !T. (f) P0[Z, <x] --+ P[ N < x]. (g) P0[IY, - Z,I > E] -+ O. (h) P0[Y, <x] --+ P[N < x]. 34.18. Suppose that .:# is a sufficient subfield for the family of probability measures P6 , (} E E>, on !T). Suppose that for each (} and A, p(A, w) is a version of P6[AII.:#].,. and suppose further t hat for each p(- , w) is a probability (b) (c) (d) (e)
�
(D.,
w,
DERIVATIVES AND CONDITIONAL PROBABILITY
458
measure on
Qe = Pe ·
!F.
Define Q6 on !T by QeCA) = f0 p(A , w)Pidw ), and show that
The idea is that an observer with the information in .:# (but ignorant of w itself) in principle knows the values p(A , w) because each p(A, · ) is measur able .:#. If he has the appropriate randomization device, he can draw an w' from n according to the probability measure p( · , w ), and h is w' will have the same distribution Q6 = P6 that w has. Thus, whatever the value of the unknown e, the observer can on the basis of the information in .:# alo ne, and without knowing w itself, construct a probabilistic replica of w.
34.19. 34. 13 j In the context of the discussion O fl p. 252, let !T be the u-field of sets of the form e X A for A E g:: Show that under the probability measure Q, i o is the conditional expected value of g0 given !F.
6
34.20. (a) In Example 34.4, take rr to have der.sity e - over 0 = (O, oo). S how by Theorem 34.7 that T is a minimal sufficient statistic (in the sense that u(T) is minimal). (b) Let P6 be the distribution for samples of size n from a normal distribution with parameter (J = (m, u 2 ), u 2 > 0, and let rr put unit mass at (0, 1 ). Show that the sampie mean and variance form a minimal sufficient statistk:.
SECTION 35. MARTINGALES Definition
Let X1 , X2 , be a sequence of random variables oo a probability space (f!, !F, P), and let �. !F2 , be a sequence of u-fields in !F. The sequence {( X", .9,;): n 1, 2, . . . } is a martingale if these four conditions hold: •
•
•
•
•
•
=
(i) � c .9� + 1; (ii) xn is measurable �; (iii) E[IX" Il < oo; (iv) with probability 1, (35 . 1 ) Alternatively, the sequence X1 • X2 , is said to be a martingale relative to Condition (i) is expressed by saying the � form a the u-fields .9j, !F2 , filtration and condition (ii) by saying the X" are adapted to the filtration. If X" represents the fortune of a gambler after the nth play and � represents his information about the game at that time, (35.1) says that his expected fortune after the next play is the same as his present fortune. Thus a martingale represents a fair game, and sums of independent random variables with mean 0 give one example. As will be seen below, martingales arise in very diverse connections. • •
• • •
•
•
SECTION 35.
MARTINGALES
459
is defined to be a martingale if it is a martingale The sequence Xp X2 , relative to some sequence .9;, .9;, . . . . In this case, the u-fields .§, = u( X1 , X" ) always work: Obviously, .#, c .§, + 1 and X" is measurable .§, , and if (35.1) holds, then E[ Xn + l ll�l = £[£[Xn + 1 ll�lll.§, l = E[Xnll -§, 1 = X" by (34.5). For these special u-field s .§, , (35.1) reduces to •
•
•
•
•
•
,
(35.2) Since u( X1 , , X") c � if and only if X" is measurable � for each n , the u(X 1 , , X) are the smallest u-fields with respect to which the X" are a martingale. The essential condition is embodied in (35.1) and in its specialization (35.2). Condition {iii) is of course needed to ensure that E[ Xn + l ll�l exists. Condition (iv) says that X" is a version of E[ X"+ 1 11 �]; since X" is measurable �' the requirement reduces to •
•
•
•
•
•
(35 .3)
·
·
Since the � are nested, A E � implies that fAXn dP = fAXn + l dP = = fAXn +k dP. Therefore, X,, being measurable �, is a version of
·
E[X" +k ll�] : (35 .4)
with probability 1 for k > 1. Note that for A = f!, (35.3) gives (35 .5) The defining conditions for a martingale can also be given in terms of the differences (35 .6)
(d 1 = X1). By linearity, (35.1) is the same thing as (35 .7) Note that, since xk = d I + . . . +dk and dk = xk - xk - 1 > the sets Xp . . . , X, and d 1 > . . . , d " generate the same u-field: (35 .8) Let d 1 , d 2 , . . . be independent, integrable random vari ables such that E[d " ] = 0 for n > 2. If � is the u-field (35.8), then by independence E[d n + t ii�] = E[dn + l ] = 0. If d is another random variable, Example 35.1.
460
DERIVATIVES AND CONDITIONAL PROBABILITY
independent of the d" , and if � is replaced by u(d, d I > , � " ), then the + d" are still a martingale relative to the �- It is natural and X" d 1 + convenient in the theory to allow u-fields � larger than the minimal ones (35.8). • •
·
=
·
•
•
·
35.2. Let (f!, !F, P ) be a probability space, let v be on !F, and let .9;, .9;_, . . . be a nondecreasing sequence of
Example
a finite measure u-fields in !F. Suppose that P dominates v when both are restricted to �-that is, suppose that A E � and P(A) 0 together imply that v(A) 0. There is then a density or Radon-Nikodym derivative X" of v with respect to P when both are restricted to �; X" is a function that is measurable � and integrable with respect to P, and it satisfies =
=
A E �.
( 35 .9)
=
If A E � ' then A E �+ I as well, so that fA Xn + I dP v(A); this and (35.9) give (35.3). Thus the X" are a martingale with respect to the �• For a specialization of the preceding example, let P be Lebesgue measure on the u-field !F of Borel subsets of f! (0, 1], and let � be the finite u-field generated by the partition of f! into dyadic intervals (k2- " , (k + 1)2 - " ], 0 k < 2 " . If A E � and P(A) 0, then A is empty. Hence P dominates every finite measure v on �- The Radon-Nikodym derivative is
Example
35.3.
=
<
v .'l.
( 35 .1 0 )
n
( ) W
=
=
v ( k 2- " , ( kn+ 1)2- " ] --::::2 o-c ---'-- ·--=-
There is no need here to assume that P dominates v when they are viewed as measures on all of !F. Suppose that v is the distribution of E� = 1 Zk 2 - k for independent Zk assuming values 1 and O with probabilities p and 1 - p. This is the measure in Examples 31.1 and 31.3 (there denoted by JL), and for p =I= �' v is singular with respect to Lebesgue measure P. It is nonetheless absolutely continuous with respect to P when both are restricted • to �-
Example 35.4. For another specialization of Example 35.2, suppose that v
is a probability measure Q on !F and that � = u(YI > . . . , Y,) for random variables Y1 , Y2 , . . . on (f!, !F ). Suppose that under the measure P the distribution of the random vector ( Yp . . . , Y,) has density Pn( y 1 , , y" ) with respect to n-dimensional Lebesgue measure and that under Q it has density q "(y I > , Yn). To avoid technicalities , assume that Pn is everywhere positive. •
•
•
•
•
•
SECTION
35.
MARTINGALES
461
Then the Radon-Nikodym derivative for Q with respect to P on .9,; 1 s
(35 . 1 1) To see this, note that the general element of � is [(YI > . . . , Y" ) E H], H E !Jr; by the change-of-variable formula,
In statistical terms, (35.11) is a likelihood ratio: Pn and q" are rival densities, and the larger X" is, the more strongly one prefers q " as an explanation of the observation (Y1 , , Y" ). The analysis is carried out under the assumption that P is the measure actually governing the Y" ; that is, X" is a martingale under P and not in general under Q. In the most common case the }'n are independent and identically dis tributed under both P and Q: P n( Y 1, • • • , Yn ) = p( y 1 ) • p( y" ) and q" (y 1, • • • , Yn ) = q(y 1 ) q(y" ) for densities p and q on the line, where p is assumed everywhere positive for simplicity. Suppose that the measures corre sponding to the densities p1 and q are not identical, so that P[Yn E H ] =1= Q[ J'n E H] for some H E � • If z" = ll Yn e H J • then by the strong law of large numbers, n - 1 L� = I z k converges to P[ Y1 E H ] on a set (in L
•
•
•
•
•
•
•
Example 35.5. Suppose that Z is an integrable random variable on (f!, 9"", P) and that � are nondecreasing u-fields in !F. If (35. 12) then the first three conditions in the martingale definition are satisfied, and by (34.5), E[Xn + t ll�l = £[£[ ZII � + 1 lll�l = E[ZII�l = X" " Thus X" is a martingale relative to � · •
Example 35.6. Let Nn k • n, k = 1, 2, . . . , be an independent array of iden tically distributed random variables assuming the values 0, 1, 2, . . . . Define Z0, Z p Z 2 , . . . inductively by Z0(w) = 1 and Z n ( w ) = Nn)w) + · · · +Nn, z, _ 1 ( w )(w); Z"(w) = O if Zn _ lw) = O. lf Nnk is thought of as the number of progeny of an organism, and if Zn _ 1 represents the size at time n - 1 of a population of these organisms, then Z" represents the size at time n. If the expected number of progeny is E[ Nnk l = m, then E[Zn11Zn - I l = " Z" _ 1m , so that Xn = Zn /m , n = 0, 1 , 2, . . . , is a martingale. The sequence Z0, Z I > . . . is a branching process. •
462
DERIVATIVES AND CONDITIONAL PROBABILITY
In the preceding definition and examples, n ranges over the positive integers. The definition makes sense if n ranges over 1, 2, . . . , N; here conditions (ii) and (iii) are required for 1 n N and conditions (i) and (iv) only for 1 n < N. It is, in fact, clear that the definition makes sense if the indices range over an arbitrary ordered set. Although martingale theory with an interval of the line as the index set is of great interest and importance, here the index set will be discrete.
< <
<
Submartingales
·
Random variables X" are a submartingale relative to u-fields � if (i), (ii), and (iii) of the definition above hold and if this condition holds in place of (iv). (iv') with probability 1,
(35 . 13) As
before, the X" are a submartingale if they are a submartingale with respect to some sequence �, and the special sequence � u(Xp . . . , X" ) works if any does. The requirement (35.13) is the same thing as =
A E �.
( 35 .14)
This extends inductively (see the argument for (35.4)), and so
( 35 .15) for k > 1. Taking expected values in (35. 15) gives
(35 .16) Example 35. 7. Suppose that the d " are independent and integrable, as in Example 35.1, but assume that E[d " 1 is for n > 2 nonnegative rather than 0. • Then the partial sums d + · · · + d " form a submartingale. 1
Example 35.8. Suppose that the X" are a martingale relative to the .9,; . Then I X"I is measurable .9,; and integrable, and by Theorem 34 .2(iv), E[IXn + 1 11�1 > I E[Xn + 11�11 = IXJ Thus the IX" I are a submartingale rela tive to the �· Note that even if X I > . . . , X" generate �, in general IX 1, . . . , I X" I will generate a u-field smaller than � · • 1
1
1
Reversing the inequality in (35.13) gives the definition of a supennartin gale. The inequalities in (35.14), (35.15), and (35. 16) become reversed as well. The theory for supermartingales is of course symmetric to that of submartin gales.
SECTION
35.
MARTINGALES
463
Gambling
Consider again the gambler whose fortune after the nth play is X" and whose information about the game at that time is represented by the u-field �· If � = ( X 1 , X" ), he knows the sequence of his fortunes and nothing else, but � could be larger. The martingale condition (35.1) stipulates that his expected or average fortune after the next play equals his present fortune, and so the martingale is the model for a fair game. Since the condition (35.13) for a submartingale stipulates that he stands to gain (or at least not lose) on the average, a submartingale represents a game favorable to the gambler. Similarly, a supermartingale represents a game unfavorable to the gambler. t Examples of such games were studied in Section 7, and some of the results there have immediate generalizations. Start the martingale at n 0, X0 representing the gambler's initial fortune. The difference d" = X" - X" _ 1 represents the amount the gambler wins on the nth play,* a negative win being of course a loss. Suppose instead that d " represents the amount he wins if he puts up unit stakes. If instead of unit stakes he wagers the amount W, on the nth play, W" d " represents his gain on that play. Suppose that W, > 0, and that W, is measurable � - I to exclude prevision: Before the nth play the information available to the gambler is that in � - I > and his choice of stake W, must be based on this alone. For simplicity take W" bounded. Then Jf� d " is integrable, and it is measurable � if d " is, and if X" is a martingale, then E[ W, d " ll�_ 1 ] W, £[d " ll�_ 1 ] = 0 by (34.2). Thus u
, • • •
=
=
( 35.17 ) is a martingale relative to the �· The sequence Wp W2, . . . represents a betting system, and transforming a fair game by a betting system preserves fairness; that is, transforming xn into (35.17) preserves the martingale property. The various betting systems discussed in Section 7 give rise to various martingales, and these martingales are not in general sums of independent random variables-are not in general the special martingales of Example 35.1. If W" assumes only the values 0 and 1, the betting system is a selection system; see Section 7. If the game is unfavorable to the gambler-that is, if xn is a supermartin gale-and if W, is nonnegative, bounded, and measurable � - I • then the same argument shows that (35.17) is again a supermartingale, is again unfavorable. Betting systems are thus of no avail in unfavorable games. The stopping-time arguments of Section 7 also extend. Suppose that {X" } is a martingale relative to {�}; it may have come from another martingale tThere is a reversal of terminology here: a subfair game (Section 7) is against the gambler, while a submartingale favors him. t The notation has, of course, changed. The F,, and X of Section 7 have become Xn and An. n
DERIVATIVES AND CONDITIONAL PROBABILITY
464
via transformation by a betting system. Let r be a random variable taking on nonnegative integers as values, and suppose that (35 .18)
[r = n ] E � .
If r is the time the gambler stops, [ r = n] is the event he stops just after the nth play, and (35.18) requires that his decision is to depend only on the information � available to him at that time. His fortune at time n for this stopping rule is Xn* =
(35 .19)
(
X" XT
<
if n r, if n > r.
Here X7 (which has value Xr( w )(w ) at w) is the gambler's ultimate fortune, and it is his fortune for all times subsequent to r. The problem is to show that Xd', X( , . . . is a martingale relative to .9Q, 9;, . . . . First, n- 1
E[I X; Ij = L 1
IXkl dP +
k = O [T = k )
<
f[ T :;:>: n)I Xn l dP <
n
L E[ I Xkl] < oo.
k =O
Since [ r > n] = f! - [ r n] E �' n
[ X"* E H ] = U [ r = k , Xk E H ] U [ r > n , X" E H ) E � . k=O
Moreover,
and
Because of (35.3), the right sides here coincide if A E �; this establishes (35.3) for the sequence X(, Xi, . . . , which is thus a martingale. The same kind of argument works for supermartingales. Since x; = XT for n 2. r, x; � X7• As pointed out in Section 7, it is not always possible to integrate to the limit here. Let X" = a + d I + . . . + d where the d" are independent and assume the values + 1 with probability � + d" = 1. Then (X0 = a ), and let r be the smallest n for which d 1 + E[ X; l = a and XT = a + 1. On the other hand, if the X" are uniformly bounded or uniformly integrable, it is possible to integrate to the limit: E[ XT ] = E[ X0 ]. n'
·
·
·
SECTION 35.
MARTINGALES
465
Functions of Martingales
Convex functions of martingales are submartingales: (i) If X I > X2 , • • • is a martingale relative to .9;, ,9S, . . . , tf cp is convex, and tf the cp(X,) are integrable, then cp(X1), cp(X2 ), . . . is a submartingale relative to .9;, ,9S. (ii) If Xp X2 , . . . is a submartingale relative to .9;, !7"2 , . . . , tf cp is nonde Theorem 35.1.
creasing and convex, and if the cp(X,) are integrable, then cp(X1 ), cp(X2 ), • • • is a submartingale relative to .9;, ,9S, . . . .
In the submartingale case, X, < E[Xn + 1 11 �1, and if cp is non decreasing, then cp(X,) < cp(E[X, + 1 11�]). In the martingale case, X, = E[X, + 1 11�1, and so cp(X,) cp(E[X, + 1 11�]). If cp is convex, then by Jensen's inequality (34.7) for conditional expectations, it follows that cp(£[ X, + 1 11 �]) < PRooF.
=
II
£[cp(X, + , )II�1.
Example 35.8 is the case of part (i) for cp(x) lxl. =
Stopping Times
Let r be a random variable taking as values positive integers or the special value oo. It is a stopping time with respect to {�} if [" = k 1 E .9\ for each finite k (see (35.18)), or, equivalently, if [ r < k 1 E .9\ for each finite k. Define (35 .20)
.9:, = [ A E .9": A n [ < k ] E .9\ ' 1 < k < oo] .
'
7'
This is a u-field, and the definition is unchanged if [ r < k] is replaced by [ r = k 1 on the right. Since clearly [ r = j] E .9:, for finite j, r is measurable
.9:,.
If r(w) < oo for all w and � = u(X 1 , . . . , X,), then Iiw) = Iiw') for all A in .9:, if and only if X;(w) X;(w') for i < r(w) r(w'): The information in .9:, consists of the values r(w), Xlw), . . . , XT( w )(w). Suppose now that r 1 and r2 are two stopping times and r 1 < r 2 • If A E .9:, 1 , then A n [r 1 < k1 E S\ and hence A n [r 2 < k1 = A n [r 1 < k1 n (r 2 < k 1 E .9\ : .9:,1 C .9:, • 2 =
=
If X 1 , • • • , X, is a submartingale with respect to .9;, . . . , .9,; and r r 2 are stopping times satisfying 1 < r < r 2 < n, then X7 1 , X7 2 is a submartingale with respect to .9:,1 .9;2 • Theorem 35.2. 1,
1
DERIVATIVES AND CONDITIONAL PROBABILITY
466
This is the optional sampling theorem. The proof will show that X7 1 , X7 2 is a martingale if Xp . . . , X" is. Since the X7. are dominated by EZ = 1 1Xkl, they are integrable. It is required to show that E[XT 2 liS:T ] > XT J , or PROOF.
I
1
A E -9;." .
(35 .21)
I
ButA E .9;1 implies thatA n [r l < k < r 2 ] = (A n [r l < k - l]) n [r 2 < k - lY lies in .9k_ 1 • If D.k = Xk - Xk _ 1 , then n
fA ( XT2 - XT J dP = fA kL= l I[ T J < k � ·21/), k dP n
= Lf D.k dP 2. 0 k = l A n [ T 1 < k ,;; T 2 ) by the submartingalc property.
•
Inequalities
There are two inequalities that are fundamental to the theory of martingales. Theorem 35.3.
If X1 , • • • , X" is a submartingale, then for a > 0,
(35 .22) This extends Kolmogorov's inequality: If S I > 5 2 , • • • are partial sums of independent random variables with mean 0, they form a martingale; if the variances are finite, then S�, Si, . . . is a submartingale by Theorem 35.1(i), and (35.22) for this submartingale is exactly Kolmogorov's inequality (22.9). Let r 2 = n; let r 1 be the smallest k such that Xk > a, if there is one, and n otherwise. If Mk = max k X; , then [M" > a] n [r 1 < k] = [Mk > a] E .9k, and hence [M" > a] is in .9;."1 • By Theorem 35.2, PROOF.
; ,;;
(35 .23)
<1
x: dP < E [ x;]
[ Mn �a)
< E [ I X".I] .
•
This can also be proved by imitating the argument for Kolmogorov's inequality in Section 23. For improvements to (35.22), use the other integrals
SECTION 35.
467
MARTINGALES
in (35.23). If X 1 , , X" is a martingale, 1IX1 1, . . . , IX" I is a submartingale, and so (35.22) gives ?[max; < " IX; I > a] < a - £[1Xnll · The second fundamental inequality requires the notion of an upcrossing. Let [a, f3] be an interval (a < 13) and let X 1 , X" be random variables. Inductively define variables r p r 2 , : •
•
•
•
•
•
•
•
,
•
r 1 is the smallest j such that 1 < j < n and Xi < a, and is n if there is no such j; rk for even k is the smallest j such that rk - l <j < n and Xi > /3, and is n if there is no such j; rk for odd k exceeding 1 is the smallest j such that rk <j < n and Xi < a, and is n if there is no such j. _1
The number U of upcrossings of [a, 131 by X 1 , . . . , X" is the largest i such that X < a < t3 < X . In the diagram, n = 20 and there are three upcrossmgs. T 21
T 2, - l .
I Jr\.
v �r\ /1\
a
'V
/1\.
\V
'
\
�
'
v
I
v\
\\JI
I\.
I
\
I I
T7
=
Tg
=
'
'
For a submartingale X p . . . , X" , the number U of upcross ings of [a, 131 satisfies Theorem 35.4.
(35 .24 ) Let Yk = max{O, Xk - a} and () = t3 - a. By Theorem 35.1(ii), Y1 , , Y,. is a submartingale. The rk are unchanged if in the definitions xj < a is replaced by 1j = 0 and xj � t3 by lj � (), and so u is also the number of upcrossings of [0, 0] by Yp . . . , Y,. . If k is even and rk- l is a stopping time, then for j < n, j- 1 PRooF. •
•
•
[rk = j) = U [ rk - l = i, Y;+ I < O , . . . , l:J - I < O , lj � O ] i= I
468
DERlVATIVES AND CONDITIONAL PROBABILITY
lies in � and [ rk = n] = [ r� < n - 1Y lies in �, and so rk is also a stopping time. With a similar argument for odd k, this shows that the rk are all stopping times. Since the rk are strictly increasing until they reach n, " = n. Therefore, "
n
Yn = YT, > YT" - YTI = L ( YT• - YT• - , ) k= 2
=
L e + La ,
where Le and La are the sums over the even k and the odd k in the range 2 < k < n. By Theorem 35.2, La has nonnegative expected value, and there fore, E[ Ynl � E[ LJ If YTV -I = 0 < () < Y7 2; (which is the same thing as X7 2; _ 1 < a < {3 < X7 2), then the difference Y72i - Y,. 2i - l appears in the sum Le and is at least 0. Since there are U of these differences, Le > OU, and therefore E[ Y"] � OE[U ]. In terms of the originai variables, this is
( {3 - a ) E[U ] < j
[ X, > a)
( X" - a) dP < E [ I X)] + lal.
•
In a sense, an upcrossing of [a, {3 ] is easy: since the Xk form a submartin gale, they tend to increase. But before another upcrossing can occur, the sequence must make its way back down below a, which it resists. Think of the extreme case where the Xk are strictly increasing constants. This is reflected in the proof. Each of le and La has nonnegative expected value, but for Le thr proof uses the stronger inequality E[ Le] > E[OU ]. Convergence Theorems
The martingale convergence theorem, due to Doob, has a number of forms. The simplest one is this:
Let X I > X2 , be a submartingale. If K = sup" E[I X"� < then X" � X with probability 1, where X is a ran dom variable satisfying E[ I X Il < K. Theorem 35.5.
•
•
•
oo ,
a and {3 for the moment, and let U" be the number of upcrossings of [ a, {3 ] by X1, . . . , X". By the upcrossing theorem, E[U"] < ( E[IX"I] + iai)j({3 - a) < ( K + lal)/({3 - a). Since Un is nondecreasing and PRooF.
Fix
E[U" ] is bounded, it follows by the monotone convergence theorem that sup" U" is integrable and hence finite-valued almost everywhere. Let X * and X * be the limits superior and inferior of the sequence X I> X2 , . . . ; they may be infinite. If X * < a < {3 < X * , then U" must go to infinity. Since U" is bounded with probability 1, P[ X* < a < {3 <X*] = 0.
SECTION 35.
MARTINGALES
469
Now [ X * < X * ] = U [ X * < a < {3 < X * ] ,
(35 .25 )
where the union extends over all pairs of rationals a and {3. The set on the left therefore has probability 0. Thus X * and X * are equal with probability 1, and X" converges to their common value X, which may be + oo. By Fatou's lemma, £[1 X I] < • lim inf" E[I X"Il < K. Since it is integrable, X is finite with probability 1. If the X" forn 1 a martingale, then by (35. 16) applied to the submartingale I X 1 1, I X2 1, . . the E[IXnll are nondecreasing, so that K = lim " E[IXnll· The hypothesis in the theorem that K be finite is es<;ential: If X" = d I + . . . + d n, where the d " are independent and assume values + 1 with probability i, then X" does not converge; £[1Xnll goes to infinity in tpis case. If the X" form a nonnegative martingale, then E[IX"Il = E[ X"] E[ X 1 ] by (35.5), and K is necessarily finite. .
=
35.9. The X" in Example 35.6 are nonnegative, and so X" = Example " 2"/m � X , where X is nonnegative and integrable. If m < 1, then, since Z" is an integer, 2" = 0 for large n, and the population dies out. In this case, X = 0 with probability 1 . Since E[ X"] = E[X0] 1, this shows that E[ X"] � =
•
E[X] may fail in Theorem 35.5.
Theorem 35 .5 has an important application to the martingale of Example 35.5, and this requires a lemma.
If 2 is integrable and � are arbitrary u fields , then the random variables £[ 2 11�1 are uniformly integrable. Lemma.
-
For the definition of uniform integrability, see (16.21). The � must, of course, lie in the u-field S', but they need not, for example, be nondecreas. mg. Since I E[ZII�ll < £[121 11�], 2 may be assumed nonnegative. Let Aan = [ £[2119,;] > a ] . Since Aan E 9,; PROOF OF THE LEMMA.
jA�E [ 2 ll� ] dP = jA�2 dP . It is therefore enough to find, for given E, an a such that this last integral is less than E for all n. Now fA 2dP is, as a function of A, a finite measure dominated by P; by the E -8 version of absolute continuity (see (32.4)) there is a1 8 such that P(A) < 8 implies that JA 2 dP < E. But P[ £[ 2119,;1 > a] < a - £[ £[ 211�11 = a - 1 £[ 2] < 8 for large enough a. •
470
DERIVATIVES AND CONDITIONAL PROBABILITY
Suppose that � are u-fields satisfying .9; c .9i c · · · . If the union u� I � generates the u-field �' this is expressed by � i �- The require ment is not that � coincide with the union, but that it be generated by it. �
If � i � and Z is integrable, then
Theorem 35.6.
E [ ZII� ] � E [ ZII� ] .
(35 .26 ) with probability 1.
PRooF. According to Example 35.5, the random variables X" = E[ZII�l form a martingale relative to the �- By the lemma, the X11 are uniformly integrable. Since E[I X" Il <:: E[IZI], by Theorem 35.5 the X" converge to an integrable X. The problem is to identify X with E[ ZII�]. Because of the uniform integrability, it is possible (Theorem 16.14) to integrate to the limit: fA XdP = lim " fA Xn dP. If A E � and n � k, then fAXn dP = fA E[Zil�]dP = fAZdP. Therefore, fA XdP = fAZdP for all A in the 7T-system u;= � ; since X is measurable �' it follows by Theorem 34.1 that X is a version of E[ZII�]. • 1
Applications: Derivatives
Suppose that (f!, !F, P) is a probability space, v is a finite measure on !F, and � i S:, c !F. Suppose that P dominates v on each � ' and let X" be the corresponding Radon -Nikodym derivatives. Then X" � X with probability 1, where X is integrable. Theorem 35.7.
(i) If P dominates v on S:,, then X is the corresponding Radon -Nikodym
derivative . {ii) If P and v are mutually singular on �' then X = 0 with probability 1. The situation is that of Example 35.2. The density X" is measur able � and satisfies (35.9). Since X" is nonnegative, E[IX" Il = E[ X" ] = v(f!), and it follows by Theorem 35.5 that X" converges to an integrable X. The limit X is measurable S:,. Suppose that P dominates v on � and let Z be the Radon-Nikodym derivative: Z is measurable �, and fAZ dP = v(A) for A E �. It follows that fAZ dP = fA Xn dP for A in �, and so X" = E[ZII�]. Now Theorem 35.6 implies that X" � E[ZIIS:,] = Z. Suppose, on the other hand, that P and v are mutually singular on S:,, so that there exists a set S in � such that v(S) = 0 and P(S) = 1. By Fatou's lemma fAXdP < lim inf nJAXn dP. If A E � ' then fA Xn dP = v(A) for n > k, and so fA XdP < v(A) for A in the field u; = I � . It follows by the monotone class theorem that this holds for all A in S:,. Therefore, JX dP fs X dP v ( S ) = 0, and X vanishes with probability 1. • PROOF.
=
<
SECTION 35.
471
MARTINGALES
Example 35.10. As in Example 35.3, let v be a finite measure on the unit interval with Lebesgue measure (f!, !F, P). For .9,; the u-field generated by the dyadic intervals of rank n , (35.10) gives X" . In this case .9,; j � !F. For each w and n choose the dyadic rationals a "( w ) = k 2 - " and b"(w) = (k + 1)2 - " for which a iw ) < w < b"(w). By Theorem 35.7, if F is the =
distribution function for
v,
then
(35 .27)
except on a set of Lebesgue measure 0. According to Theorem 3 1.2, F has a derivative F' except on a set of Lebesgue measure 0, and since the intervals (aiw ), b/w)] contract to w, the difference ratio (35.27) converges almost everywhere to F'(w) (see (31.8)). This identifies X. Since (35.27) involves intervals (a n<w ), b"(w )] of a special kind, it does not quite imply Theorem 31.2. By Theorem 35 .7, X = F' is the density for v in the absolutely continuous case, and X = F' = 0 (except on a set of Lebesgue measure 0) in the singular case, facts proved in a different way in Section 3 1 . The singular case gives • another example where E[Xn] � E[ X] fails in Theorem 35.5. Ukelihood Ratios
Return to Example 35.4: v = Q is a probability measure, 9,; = u( Yp . . . , Y,) for random variables Y,, and the Radon-Nikodym derivative or likelihood ratio Xn has the form (35.1 1) for densities Pn and qn on R " . By Theorem 35 .7 the Xn converge to some X which is integrable and measurable � = u ( Yp Y2 , ). If the Yn are independent under P and under Q, and if the densities are different, then P and Q are mutually singular on u( Yp Y2 , ), as shown in Example 35.4. In this case X = 0 and the likelihood ratio converges to 0 on a set of ?-measure 1 . The statistical relevance of this is that the smaller X" is, the more strongly one prefers P over Q as an explanation of the observation ( Y1 , , Yn), and Xn goes to 0 with probability 1 if P is in fact the measure governing the Yn. It might be thought that a disingenuous experimenter could bias his results by stopping at an Xn he likes-a large value if his prejudices favor Q, a small value if they favor P. This is not so, as the following analysis shows. For this argument P must dominate Q on each 9,; = u( Yp . . . , Y"), but the likelihood ratio Xn need not have any special form. Let r be a positive-integer-valued random variable representing the time the experimenter stops. Assume that r is finite, and to exclude prevision, assume that it is a stopping time. The u-field .9::, defined by (35.20) repre sents the information available at time r, and the problem is to show that XT • • •
•
•
•
•
•
•
472
DERIVATIVES AND CONDITIONAL PROBABILITY
is the likelihood ratio (Radon -Nikodym derivative) for Q with respect to P on .9:,. First, X� is clearly measurable .9:,. Second, if A E 5':,, then A n [ r = n ] E .9,";, and therefore 00
00
f XT dP = nL= 1 fA nh= n )xn dP = nL= Q( A n [ A
1
T
= n ] ) = Q( A ) '
as required. Reversed Martingales '
A left-infinite sequence . . . , X _ 2 , X _ 1 is a martingale relative to ufields . . . , !F_ 2 , !?_ 1 if conditions (ii) and (iii) in the definition of martingale are satisfied for n < - 1 and conditions (i) and (iv) are satisfied for n < - 1 . Such a sequence is a reversed or backward martingale.
For a reversed martingale, lim n -+ oo X- n = X exists and is integrable, and E[X] = E[ X_n] for all n . Theorem 35.8.
PRooF. The proof is almost the same as that for Theorem 35.5. Let X* and X * be the limits superior and inferior of the sequence X - I > X _ 2 , Again (35.25) holds. Let Un be the number of upcrossings of [a, {3] by X - n ' . . , , X _ 1 • By the upcrossing theorem, E[Unl < (E[ I X_ 1 1] + l aD /(/3 - a ). Again E[ Un ] is bounded, and so supn Un is finite with probability 1 and the sets in (35.25) have probability 0. Therefore, limn -+ OO X - n = X exists with probability 1 . By the property (35 .4) for martingales, X_ n = Ef X_ 1 11 !F_ n] for n = 1, 2, . . . . The lemma above (p. 469) implies that the X - n are uniformly integrable. Therefore , X is integrable and E[ X ] is the limit of the E[X_n]; these all have the same value • by (35.5). •
If .o/,; are u-fields Satisfying .9; :::> .9; . . . , then the intersection n:= 1 .o/,; � is also a u-field, and the relation is expressed by .9,"; � � · Theorem 35.9.
( 35 .28)
•
•
=
If .9,"; � � and Z is integrable, then E [ Z ll� ]
�
£[ Z ll�]
with probability 1 . PRooF. If X_n = E[Z II�], then . . . , X_ 2 , X_ 1 is a martingale relative to . . . , !F2, .9; . By the preceding theorem, E[ Z II�l converges as n --? oo to an integrable X and by the lemma, the E[ Z II�l are uniformly integrable. As the limit of the E[Z II�l for n z k, X is measurable �; k being arbitrary , X is measurable � ·
•
SECTION
35.
473
MARTINGALES
By uniform integrability, A E � implies that
j XdP A
=
lim n E[ Zll� ] dP = lim n E( £ [ Zll9,; ] II�] dP
=
lim n E[ Zll�) dP =
j
j
A
A
j
A
j E[ Zll� ] dP. A
Thus X is a version of E[ZII�].
•
Theorems 35.6 and 35.9 are parallel. There is an essential difference between Theorems 35.5 and 35.8, however. In the latter, the martingale has a last random variable, namely X _ 1 , and so it is unnecessary in proving convergence to assume the £[ 1 Xnll bounded. On the other hand, the proof in Theorem 35.8 that X is integrable would not \'!Ork for a submartingale. Applications: de Finetti's Theorem
< <
Let 0, Xp X2 , be random variables such that 0 () 1 and, conditionally on (), the X� are independent and assume the values 1 and 0 with probabili ties () and 1 (): for u 1 , , un a sequence of O 's and 1 's, • • •
•
-
•
•
(35 .29 ) where s = u 1 +
·
·
·
+ un .
be independent random vari To see that such sequences exist, let 8, Z1, Z 2 , ables, where (J has an arhitrarily prescribed distribution supported by [0, 1] and the zn are uniformly distributed over [0, 1]. Put Xn l[z , , 6 1 . If, for instance, f(x) = x(1 - x ) = P[Z 1 < x, Z 2 > x], then P[X1 = 1, X2 = 0!IB)., =f(8(w)) by (33.13). The obvious extension establishes (35.29). • • •
=
Integrate (35.29):
(35 .30 ) Thus {Xn} is a mixture of Bernoulli processes. It is clear from (35.30) that the Xk are exchangeable in the sense that for each n the distribution of (X 1 , , Xn) is invariant under permutations. According to the following theorem of de Finetti, every exchangeable sequence is a mixture of Bernoulli sequences. • • •
If the random variables Xp X2 , are exchangeable and take values 0 and 1, then there is a random variable () for which (35.29) and (35.30) hold. Theorem 35.10.
• • •
474
D ERIVATIVES AND CONDITIONAL PROBABILITY
t < m, then
Let sm = XI + . . . +Xm . If
PROOF.
+ um = t. By where the sum extends over the sequences for which u 1 + exchangeability, the terms on the right are all equal, and since there are of them, ·
Suppose that s < n :-::; m and u 1 + un + J > • • . , um that sum to t - s:
where
n
·
·
·
+u n =
s
<
t < m;
/
·
·
{ 7)
add out the
fn.s. m (x) = Yl {x - .!:._) m ll 1 { 1 - x - .!:._) m YI1 ( 1 - .!:._ m_). i=O
i=O
i=O
The preceding equations still hold if further constraints s m + I = t ,, . . . 'sm +j = tj are joined to sm = t. Therefore, P[XI = U p . . . ' xn = un i!Sm, . . . ' sm +) = fn . s m(Sm /m). Let � u(Sm, sm + I > . . . ) and ../ nm �- Now fiX n and U p . . . ' u n , and +un = s. Let j � oo and apply Theorem 35.6: P[X1 = suppose that u 1 + U p . . . ' xn = u nll�] = fn s m < Sm/m). Let m � 00 and apply Theorem 35.9: '
=
·
·
·
•
•
holds for w outside a set of probability 0. Fix such an w and suppose that {Sm(w)/m} has two distinct limit points. Since the distance from each Sm(w)jm to the next is less than 2/m, it follows that the set of limit points must fill a nondegenerate interval. But limk x m = x implies limk fn s m (xm ) = x5(1 - x)n - s, and so it follows further that x5(1 - x) n - s must be constant over this interval, which is impossible. Therefore, S m( w)/m must converge to some limit 6( w ). This shows that P[X 1 u1, • • • , Xn = u nll../] = 65(1 - O) n - s with probability 1. Take a condi tional expectation with respect to u(O), and (35.29) follows. • k.
=
t t
k.
k.
SECTIO N
35.
475
MARTINGALES
Bayes Estimation
From the Bayes point of view in statistics, the 0 in (35.29) is a pa rameter governed by some a priori distribution known to the statistician. For given X� > . . . , Xn, the Bayes estimate of 0 is E[ OIIX�> . . . , Xn]. The problem is to show that this estimate is consistent in the sense that
(35 .31) with probability 1. By Theorem 35.6, E[OIIX I > . . . , Xn] -7 £[011.9,;,], where S:, = u(Xp X2 , ), and so what must be shown is that E[ OIIS:,] 0 with probability 1. By an elementary argument that parallels the unl:onditional case, it follows from (35.29) for Sn = X 1 + +Xn that E( SniiO) = nO and E[(Sn - n0) 2 IIO) - !Sn - 0) 2 ] < n - 1 , and by Chebyshev 's inequality = n O( l - 0). Hence E[(n n - 1 S n converges in probability to 0. Therefore (Theorem 20.5), lim k n ;; 1S nk = 0 with probability 1 for some subsequence. Thus 0 = 0' with probability 1 for a 0' measurable S:,, and E[OIIS:,] = E[O'IIS:,] = 0' = 0 with probability 1. =
• • •
·
·
·
A Central Limit Theorem •
is a martingale relative to .9;, .9; , . . . , and put Y,. = Xn Suppose X�> X2 , - Xn - l ( Y1 = X1 ), so that • • •
(35 .32 ) View Yn as the gambler's gain on the n th trial in a fair game. For example, if dl, d7.. , " ' are independent and have mean 0, � = u( d l , . . . . �n), wn is measurable � - I • and Y,. = Wn dn, then (35.32) holds (see (35.17)). A special ization of this case shows that Xn = L:Z Yk need not be approximately normally distributed for large n. =
1
Example 35.11. Suppose that dn takes the values + 1 with probability � each and W1 = 0, and suppose that Wn = 1 for n > 2 if d 1 = 1, while W, = 2 for n > 2 if d 1 = - 1. If Sn = d 2 + + dn, then Xn is Sn or 2Sn according as d 1 is + 1 or - 1. Since Sn/ Vn has approximately the standard normal · · ·
distribution, the approximate distribution of Xn/ Vn is a mixture, with equal weights, of the centered normal distributions with standard deviations 1 and 2. • To understand this phenomenon, consider conditional variances. Suppose for simplicity that the Yn are bounded, and define
(35 .33 ) * This topic, which requires the limit theory of Chapter 5, may be omitted.
476
DERIVATIVES AND CONDITIONAL PROBABILITY
(take � = {0, f!}). Consider the stopping times
[
(35 .34)
v1 = min n :
t uf > t j .
k= l
Under appropriate conditions, Xv ;fi will be approximately normally distributed for large t. Consider the preceding example. Roughly: If d 1 + 1, then '£Z= 1uf = n - 1, and so v1 ""' t and Xv , ;fi ""' S, ;..fi; if d 1 = - 1, then '£Z = 1 uf = 4( n - 1), and so v, ::::: t /4 and Xv, ! fi ::::: 2S1 1 4 j/i = S, 14 / Jt j4 . In either case, Xv ;/i approximately follov.. s the standard !10rmal law. If the nth play takes un2 units of time, then v1 is essentially the number of plays that take place during the first t units of time. This change of the time scale stabilizes the rate at which money changes hands. r
=
r
Suppose the yn = xn - xn - I are uniformly bounded and satisfy (35.32), and assume that 'Lnun2 = oo with probability 1. Then Xv 1 /i N. Theorem 35.11.
r
=
This will be deduced from a more general result, one that contains the Lindeberg theorem. Suppose that, for each n, Xn�> Xn 2 , is a martingale with respect to �� 1 , � 2 , . . . . Define Y,.k = Xnk - Xn, k - P suppose the Y,, k have second moments, and put un2k = E[ Yni II�. k 1 ](�0 = {0, f!}). The probability space may vary with n . If the martingale is originally defined only for 1 < k < rn , take Ynk = 0 and �k = �' for k > rn . Assume that r.; = 1Yn k and r.; = 1 un2k converge with probability 1. • •
•
•
_
Theorem 35.12.
Suppose that 00
2k � 2 " n f.... (F p (F '
( 35 .35)
k=l
where u is a positive constant, and that "'
L E [ Yni I!IY.k l <J) � 0
(36.36)
k=l
for each
€.
�
Then r.; = I Y,. k = uN.
PRooF oF
The proof will be given for t going to infinity through the integers.t Let Ynk = I1v. � kld Vn and �k = � . From [vn > k] = ['LJ.:lo/ < n ] E � - 2 follow E[ Ynkll�.k - l ] = 0 and un2k = E[ Y,.i ll �.k - l ] = I1v. � klufjn. If K bounds the I Yk l, then 1 < 'L; = I un2k = n - 1 '£'k"= 1 uf < 1 + K 2jn, so that (35.35) holds for u = 1. For n large enough that Kj /n < E, the sum in (35.36) vanishes. Theorem 35. 12 therefore applies, and '£'k"= 1 Yd Vn = '£; = 1 Y,.k = N. • THEOREM 35.1 1 .
t For the general case, first check that the proof of Theorem 35. 12 goes through without change if n is replaced by a parameter going continuously to infinity.
S ECTION 35.
MARTINGALES
477
PRooF oF THEOREM 35. 12. Assume at first that there is a constant
that
c
such
00
(35 .37)
which in fact suffices for the application to Theorem 35. 1 1. Write Sk = L:J= 1 Ynj (S0 = 0), Soo = r:;= I Y,.j , L k = L�� 1 un21 (l0 = 0), and 2 L00 = L:j= 1unj; the dependence on n is suppressed in the notation. To prove E[ e 1· 1 Sx] .-., e - ' 1 2 u2, observe first that I
= I E [ e l l x( 1 - e'l 2"<"-<-x e - -;1 u 2 ) + e - -;1 2u 2 ( e ' 1 2"<"-<-x e l l x - 1 ) ] I' < £ ( 1 1 - /1 2 2xe - { 1 2u2 1} + I E ( e�1 2 2x ei1 Sx _ 1] 1 = A + B . .
s
I
I 2
I
I
.
s
The term A on the right goes to 0 as n oo, because by (35.35) and (35.37) the integrand is bounded and goes to 0 in probability. The integrand in B is ---.
"'
L.t "'
k=l
e l·l Sk -1 ( el·l - e - -; 1 2un2k ) e' l 2"<".<. k ' y' k
I
I
2
because the mth partial sum here telescopes to e ;1 5'" e !1 2 m - 1 . Since, by (35.37), this partial sum is bounded uniformly in m, and since S k - I and l. k are measurable �. k - 1 , it follows (Theorem 16.7) that
00
< L I E [ ei1 Sk - le �l 2 Lk£ [ eil Ynk - e - � 1 2u,�k l1.9,; , k - l ] I k=l < e�I 2c L £ [ 1 E [ etl Ynk - e - !1 2u,;k l1.9,; , k - d I ] . k= l 00
To complete the proof (under the temporary assumption (35.37)), it is enough to show that this last sum goes to 0. By (26.4 2 ), (35 .38)
where (write
Ink = I( I Y" k l
and let
K1
bound
t2 and lt l3)
478
DERIVATIVES AND CONDITIONAL PROBABILITY
And (38.39)
where (use (27. 15) and increase K, )
Because of the condition E[ Ynk \ l �. k _ 1 1 0 and the definition of un2k , the right sides of (35.38) and (35.39), minus () and ()', respectively, have the same conditional expected value given �. k 1 • By (35.37), therefore, =
_
t E[l E k'Ynk - e - !t 2
k=l
< K1
00
L
([ k= l
< K,
(E[}�ifnk ] r €E [ un2k ] + E [ un� ] )
[
)
E [ Ynt fnk ] + a + c£ sup un2k l · j k�l k=l
E
Since un2k < £[€ 2 + Ynt fnkll�, k - l 1 < € 2 + L:7= 1 [ Y,.� /nj ll �. j - l 1, it follows by (35 .36) that the last expression above is, in the limit, at most K 1( EC + a 2 ). Since E is arbitrary, this completes the proof of the theorem under the assumption (35.37). To remove this assumption, take c > u2, define A nk = [L:7= 1 un2j < c 1 and A noo = [L:j= I (J'n� < c 1, and take znk = Y,.k !Ank" From A,k E �. k - 1 follow E[ Znk l!�' k - l 1 = 0 and -r;k = E[ z;k il� k - 1 1 = / un2k . S ince L:j= 1 -r;j is L:,.k= I (J'n2j on A nk -- A n k + I and Lj = I un,·2 on A noo' the Z-array sa Usfie� ( 35.37). ' Now P( AnJ -7 1 by (35.35), and on Anoo' Tnk 2 - unk 2 for all k, so that the Z-array satisfies (35.35). And it satisfies (35.36) because I Znk I < I Y,.k 1 . There fore, by the case already treated, L:; = 1 Znk = uN. But since L; = 1 Y,.k com • cides with this last sum on A n""' it, too, is asymptoticaily normal. oo
'
A
nk
,
,
PROBLEMS 35.1. Suppose that tl 1, !l 2 , . . . are independent random variables with mean 0. Let XI = Il l and xn+ I = xn + tln+ dn(XI> . . . ' Xn), and suppose that the xn are
integrable. Show that {Xn} is a martingale. The martingales of gambling have this form.
Y1 , Y2, . . . be independent random variables with mean 0 and variance u 2• Let Xn = O:::f: = 1 Yk ) 2 - nu 2 and show that {Xn} is a martingale.
35.2. Let
SECTION 35.
MARTINGALES
479
35.3. Suppose that {Yn} is a finite-state Markov chain with transition matrix [ p1). Suppose that LjPtjx(j) = Ax(i) for all i (the x(i) are the components of a right n eigenvector of the transition matrix). Put Xn = A - x(Yn) and show that {Xn} is a martingale. 35.4. Suppose that Y1 , Y2 , are independent, positive random variables and that E[Y,,] = 1. Put Xn = Y1 Yn. (a) Show that {Xn} is a martingale and converges with probability 1 to an integrable X. (b) Suppose specifically that Yn assumes the values t and % with probability t each. Show that X = 0 with probability 1 . This gives an example where E[n; = I Yn] * n� = I E[Y,,] for independent, integrable, positive random vari ables. Show, however, that E[n;= I Yn] < n;= 1 E[ Yn] always holds. • • •
•
•
•
35.5. Suppose that X1 , X? , . . . is a martingale satisfying £[ X� ] = 0 and E[ Xj] < oo. Show that E[( Xn+r - Xn) 2 ] = r:,;; = 1 E[( Xn + k - Xn + k - l ) ] (the variance of the sum is the sum of the variances). Assume that L,n E[(Xn - Xn _ 1 )2 ] < and prove that Xn converges with probability 1. Do this first by Theorem 35.5 and then (see Theorem 22.6) by Theorem 35.3. oo
35.6. Show that a submartingale Xn can be represented as Xn Yn + Zn, where Yn is a martingale and 0 < Z 1 < Z2 < · · · . Hint · Take X0 = 0 and !ln =Xn - Xn _ 1 , and define zn = L,f: = 1 £[/l k ll�� - d <.rc = {0, !l}). =
35.7. If X 1 , X2 , . . . is a martingale and bounded either above or below, then s upn £[1 Xnil < oo. 35.8. i Let xn = Il l + . . . +lln, where the .ln are independent and assume the values + 1 with probability I each. Let T be the smallest n such that Xn = 1 and define X;i by (35.19). Show that the hypotheses of Theorem 35.5 are satisfied by {X;i} but that it is impossible to integrate to the limit. Hint: Use (7.8) and Problem 35. 7. 35.9. Let XI , X2• be a martingale, and assume that I Xl(w )I and I Xn(w ) - xn - l(w )I are bounded by a constant independent of w and n . Let T be a stopping time with finite mean. Show that X7 is integrable and that E[X7] = E[Xd. • • •
35.10. 35.8 35.9 i Use the preceding result to show that the T in Problem 35.8 has infinite mean. Thus the waiting time until a symmetric random walk moves one step up from the starting point has infinite expected value. 35.11. Let X1 , X2 , be a Markov chain with countable state space S and transition probabilities p0. A function 1p on S is excessive or superharmonic if �p(i) � LjPtjifJ(j). Show by martingale theory that �p(Xn) converges with probability 1 if 1p is bounded and excessive. Deduce from this that if the chain is irreducible and persistent, then 1p must be constant. Compare Problem 8.34. • • •
35.12. i A function 1p on the integer lattice in R k is superharmonic if for each lattice point x, �p(x) > (2k) - 1 [�p(y ), the sum extending over the 2k nearest neighbors y. Show for k = 1 and k = 2 that a bounded superharmonic function is constant. Show for k > 3 that there exist nonconstant bounded harmonic functions.
DERIVATIVES AND CONDITIONAL PROBABILITY
480
35.13. 32.7 32.9 i Let !T, P) be a probability space, let v be a finite measure on !T, and suppose that .9,; i .9;, c: !F. For n < oo, let Xn be the Radon Nikodym derivative with respect to P of the absolutely continuous part of v when P and v are both restricted to .9,;. The problem is to extend Theorem 35.7 by showing that Xn -> X, with probability 1. (a) For n < oo, let
(D.,
A E -9,; , be the decomposition of v into absolutely continuous and singular parts with respect to P on .9,;. Show that X1 , X2 is a supermartingale and converges with probability 1. (b) Let • • • •
A E'
.9,; ,
be the decomposition of a-00 into absolutely continuous and singular parts with fespect to P on .9,;. Let Yn = E[Xooll-9,;], and prove A E .9,;. Conclude that Yn + Zn = Xn with probability 1. Since Yn converges to X,, Zn converges with probability 1 to some Z. Show that fA Z dP < a-""( A) for A E .9;,, and conclude that Z = 0 with probability 1.
35.14. (a) Show that {Xn} is a martingale with respect to {.9,;} if and only if, for all n and all stopping times T such that T < n, E[ Xn11.9;] = X7• (b) Show that, if {Xn} is a martingale and T is a bounded stopping time, then
£[ XT ] = £[ Xd.
35.15. 31.9 i Suppose that .9,; i � and A E �, and prove that P[AII�] -> /A with probability 1. Compare Lebesgue's density theorem. 35.16. Theorems 35.6 and 35.9 have analogues in Hilbert space. For n < oo, let Pn be the perpendicular projection on a subspace Mn - Then Pnx Poox for all x if either (a) M1 c M2 c · · · and Moo is the closure of Un M2 -:::> • • and Moo = nn M,. --+
•
< oo
35.17. Suppose that e has an arbitrary distribution, and suppose that, conditionally on e, the random variables Y1 , Y2 , . . . are independent and normally dis tributed with mean e and variance a- 2 Construct such a sequence {e , Y1 , Y2 , • • • }. Prove (35.3 1). •
35.18. It is shown on p. 471 that optional stopping has no effect on likelihood ratios This is not true of tests of significance. Suppose that X1 , X2 , . . . are indepen dent and identically distributed and assume the values 1 and 0 with probabili ties p and 1 - p. Consider the null hypothesis that p = 4- and the alternative that p > t· The usual .05-Ievel test of significance is to reject the null
SECTION 35.
MARTINGALES
481
hypothesis if (35.40 ) For this test the chance of falsely rejecting the null hypothesis is approximately P[ N > 1.645] "" .05 if n is large and fixed. Suppose that n is not fixed in advance of sampling, and show by the law of the iterated logarithm that, even if p is, in fact, t, there are with probability 1 infinitely many n for which {35.40) holds.
35.19. (a) Suppose that (35.32) and (35.33) hold. Suppose further that, for constants s�, s; 2 ['l, = 1 ul -> p 1 and s; 2 ['l, = 1 E[ Y/I[! Yk l� s) -> 0, and show that S1� 1 r:,z = 1 Yk N. Hint: Simplify the proof of Theorem 35. 1 1 . (b) The Lir.deberg-Levy theorem for martingales Suppose that =
. . ·
, Y - 1 , Yo , Y1 , · · ·
is stationary and ergodic (p. 494) and that
Prove that L, k= l yk ;.,fii is asymptotically normal. Hint: Use Theorem 36.4 and the remark following the statement of Lindeberg's Theorem 27.2.
35.20. 24.4 j Suppose that the u-field � in Problem 24.4 is trivial. Deduce from n Theorem 35.9 that P[ AIIr .r] ----> P[ A II�] = P(A) with probability 1, and conclude that T is mixing.
CHAPTER 7
Stochastic Processes
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM Stochastic Processes
A
stochastic process is a collection [X,: t E T] of random variables on
a
probability space (f!, !F, P). The sequence of gambler's fortunes in Section 7, the sequences of independent random variables in Section 22, the martin gales in Section 35-all these are stochastic processes for which T = {1, 2, . . . }. For the Poisson process [ N1: t > 0] of Section 23, T = [0, oo). For all these processes the points of T are thought of as representing time. In most cases, T is the set of integers and time is discrete, or else T is an interval of the line and time is continuous. For the general theory of this section, however, T can be quite arbitrary. Finite-Dimensional Distributions
A process is usually described in tenns of distributions it induces in Eu
clidean spaces. For each k-tuple (t 1 , . . . , tk ) of distinct elements of T, the random vector (X1 , , X1 k ) has over R k some distribution I""11 1 , /k' I
• • •
I
•
( 36.1 ) These probability measures J.L1 1 • 1k are the finite-dimensional distributions of the stochastic process [ X1: t E T]. The system of finite-dimensional distribu tions does not completely determine the properties of the process. For example, the Poisson process [N1: t > 0] as defined by (23.5) has sample paths (functions N1(w) with w fixed and t varying) that are step functions. But (23.28) defines a process that has the same finite-dimensional distributions and has sample paths that are not step functions. Nevertheless, the first step in a general theory is to construct processes for given systems of finite dimensional distributions. 482
S ECTION
36.
KOLMOGOROV' s EXISTENCE THEOREM
483
Now (36.1) implies two consistency propertie s of the system J.L1 1 , k . Suppose the H in (36.1) has the form H = H1 X · · x Hk (H; E .9i' 1 ), and consider a permutation 7T of (1, 2, . . . , k ). Since [( X, , . . . , X, ) E (H1 X · · X Hk )] and [( X, , . . . , X, ) E (Hrr 1 X · X Hrr k )] are t he sam� event, it follows by (36.1) that ..
•
·
1T
· ·
-rr k.
I
(36.2 ) For example, if 11- s, I = v X v ' , then necessarily The second consistency condition is
JL1, s
'
= v X v.
( 36.3 ) This is clear because (X,1, , X, k _ ) lies in H1 x · · · x Hk _ 1 if and only if (X, I , . . . , X,k - I , X, k ) lies in H 1 x · · x Hk _ 1 x R 1 Measures Jl- 1 1 1 k coming from a process [ X,: t E T ] via (36.1) necessarily satisfy (36.2) and (36.3). Kolmogorcv's existence theorem says conversely that if a given system of measures satisfies the two consistency conditions, then there exists a stochastic process having these finite-dimensional distributions. The proof is a construction, one which is more easily understood if (36.2) and into a single condition. (36.3) are combined k Define cprr: R � R k by • • •
·
•
..
applies the permutation 7T to the coordinates (for example, if 7T sends x 3 to first position, then 7T 1 1 = 3). Since cp;; 1 (H 1 X · · · X Hk ) = Hrr ! X · · · X Hrr k ' it follows from (36.2) that 'Prr
-
for rectangles H. But then
(36.4) Similarly, if cp: R k R k - 1 is the projection cp( x 1 , . . . , x k ) = ( x I > . . . , x � _ 1 ), then (36.3) is the same thing as �
(36.5 ) The conditions (36.4) and (36.5) have a common extension. Suppose that (u 1 , , u rn ) is an m-tuple of distinct elements of T and that each element of (tp . . . , tk ) is also an element of ( up . . . , um). Then (t 1 , . . . , t k ) must be the u rn ); that is, k m and there initial segment of some permutation of (u 1 , • • •
• • •
,
<
STOCHASTIC PROCESSES
484
is a permutation
7T
of (1, 2, . . . , m) such that
where tk+ 1 , . . . , tm are elements of (u 1 , . . . , u m ) that do not appear in ( t 1 , , tk ). Define rfJ : Rm � R k by • • •
( 36.6) 1/J applies 1r to the coordinates and then projects onto the first k of them. Since r/J ( Xu , . . . , Xu ) = (X, , . . . , X, ), I
( 36.7 )
m
I
k
u
,f, '+' m
I
•
This contains (36.4) and (36.5) as special cases, but as 1/J is a coordinate permutation followed by a sequence of projections of the form (x 1 , , x1) � (x 1 , , x 1_ 1 ), it is also a consequence of these special cases. •
•
•
•
•
•
Product Spaces
The standard construction of the general process involves product spaces. Let T be an arbitrary index set, and let RT be the collection of all real functions on T-all maps from T into the real line. If T = {1, 2, . . . , k}, a real function on T can be identified with a k-tuple (x I > . . . , x k ) of real numbers, and so RT can be identified with k-dimensional Euclidean space R". If T = {1, 2, . . . }, a real function on T is a sequence {x 1 , x 2, } of real numbers. If T is an interval, RT consists of all real functions, however irregular, on the interval. The theory of RT is an elaboration of the theory of the analogous but simpler space S" of Section 2 (p. 27). Whatever the set T may be, an element of RT will be denoted x. The value of x at t will be denoted x( t) or x, depending on whether x is viewed as a function of t with domain T or as a vector with components indexed by the elements t of T. Just as R k can be regarded as the Cartesian product of k copies of the real line, RT can be regarded as a product space-a product of copies of the real line, one copy for each t in T. 1 For each t define a mapping Z, : RT � R by • • •
( 36.8 )
Z,(x ) = x ( t ) = x, .
The Z, are called the coordinate functions or projections. When later on a probability measure has been defined on RT, the Z, will be random variables, the coordinate variables. Frequently, the value Z,(x) is instead denoted Z(t, x ). If x is fixed, Z( · , x) is a real function on T and is, in fact, nothing other than x( · )-that is, x itself. If t is fixed, Z(t, · ) is a real function on R7 and is identical with the function Z, defined by (36.8).
SECTION 36.
KOLMOGORov's EXISTENCE THEOREM
485
There is a natural generalization to RT of the idea of the u-field of k-dimensional Borel sets. Let �T be the u-field generated by all the coordinate functions Z, t E T: �T = u[Z,: t E T ] . It is generated by the sets of the form
1 k for t E T and H E � • If T = {1, 2, . . . , k}, then �T coincides with jN . Consider the class �ci consisting of the sets of the form (36.9)
A = =
[ [
x
x
E RT: ( Z , J ) . . . , z, k( ) ) E H J x
E RT: (
,
x, 1 , . . . , x,
x
J E H] ,
where k is an integer, (t 1 , . . . , tk) is a k-tuple of distinct points of T, and H E �k. Sets of this form, elements of �[, are called finite-dimensional sets, or cylinders. Of course, IN[ generates �T. Now �l is not a u-field, does not coincide with �T (unless T is finite), but the following argument shows that it is a field.
y
a,
I,
If T is an interval, the cylinder [x e RT: a1 < x(t1 ) < {31, a2 <x(t2) < {32] consists of the functions that go through the two gates shown; y lies in the cylinder and z does not (they need not be continuous functions, of course)
The complement of (36.9) is RT - A = [x E RT: ( • • • , ) E Rk - H ], and so �l is closed under complementation. Suppose that A is given by (36.9) and B is given by x 1 1,
x,
(36.10) where / E �i. Let ( u 1 , . . . , u m ) be an m-tuple containing all the t,. and all the s13 • Now (t . . . , tk) must be the initial segment of some permutation of m ( u 1 , . . . , u rn ), and if 1/1 is as in (36.6) and H' = 1/1 - I H, then H' E � and A is �>
STOCHASTIC PROCESSES
486
given by
(36.1 1 ) as well as by (36.9). Similarly, B can be put in the form
(36. 12 ) where I' E �m. But then
(36.13) Since H' u I' E �m, A u B is a cylinder. This proves that �6 is a field such that � T = u(�l). The Z, are measurable functions on the measurable space ( R T, � T ). If P is a probability measure on � T, then [ 21 : t E T] is a stochastic process on (R T, � T, P), the coordinate-variable process. Kolmogorov's Existence Theorem
The existence theorem can be stated two ways:
If J.L 1 1 1 k are a system of distributions satisfying the consistency conditions (36.2) ana (36.3), then there is a probability measure P on � T such that the coordinate-variable process [Z 1: t E T] on (R T, � T, P) has the f.L 1 1 1k as its finite-dimensional distributions. Thecrem 36.2. Tf J.L 1 1 • 1 k are a system of distributions satisfying the consis tency conditions (36.2) and (36.3), then there exists on some probability space (f!, !F, P) a stochastic process [X1 : t E T] having the Jl- 1 1 • 1 k as its finite dimensional distributions. Theorem 36.1.
For many purposes the underlying probability space is irrelevant, the joint distributions of the variables in the process being all that matters, so that the two theorems are equally useful. As a matter of fact, they are equivalent anyway. Obviously, the first implies the second. To prove the converse, suppose that the process [X1 : t E T] on (f!, !F, P) has finite-dimensional distributions Jl- 1 1 •• 1 k, and define a map {: f! � R T by the requirement
(36.14 )
t E T.
For each w, {(w) is an element of R T, a real function on T, and the
SECTION
36.
KOLMOGOROV ' S EXISTENCE THEOREM
requirement is that
487
X1( w ) be its value at t. Clearly,
C 1 [ x E R T : (Z1lx ) , . . . , Z1 k( x )) E H ]
(36.15 )
= [ w E f! : (Z11( {(w)) , . . . , Z1 k( {( w)) ) E H] = [ w E n : ( X11( w ) , . . . , X1k( w)) E H ] ; since the X1 are random variables, measurable !F, this set lies in !F if k Thus � . C1A E !F for A E �J, and so (Theorem 13.1) g is measur E H able !Fj� T. By (36.15) and the assumption that [X1 : t E T] has finitedimensional distributions J..L 1 1 1�, PC 1 (see (13.7)) satisfies
( 36.16) pg- 1 [ x e R T : (Z1lx) , . . . , Z1 k( x )) E H]
= P [ w E f! : ( XJ w) , . . . , X1 k( w ) ) E H ] = 11- 1 1 1 k( H ) . Thus the coordinate-variable process [ Z 1 : t E T] on (RT, �T, PC 1) also has finite-dimensional distributions Jl-1 1 1k . Therefore, to prove either of the two versions of Kolmogorov's existence theorem is to prove the other one as well. that T is finite, say T = {l, 2, . . . , k}. Then Example 36. 1. Suppose k (R T, �T) is (R k , f!! ), and taking P 11- 1 ,2, , k satisfies the requirements of Theorem 36.1. • =
• •
Example 36.2. Suppose that T = {1 , 2, . . . } and (36.17) are probability distributions on the line. The consistency where /J- p 11-2, conditions are easily checked, and the probability measure P guaranteed by Theorem 36.1 is product measure on the product space (R T, �T ). But by Theorem 20.4 there exists on some (f!, !F, P) an independent sequence X 1 , X2, of random variables with respective distributions JJ- 1 , 11-2, ; then (36.17) is the distribution of (XI > . . . , X1 ). For the special case (36.17), Theorem 36.2 (and hence Theorem 36.1) was thus proved in Section 20. The existence of independent sequences with prescribed distributions was the measure-theoretic basis of all the probabilistic developments in Chapters 4, 5, and 6: even dependent processes like the Poisson were constructed from independent sequences. The existence of independent sequences can also be made the basis of a proof of Theorems 36.1 and 36.2 in their full generality; see the second proof below. • • • •
• • •
• • .
I
n
STOCHASTIC PROCESSES
488
The preceding example has an analogue in 1the space S"' of sequences (2. 15). Here the finite set S plays the role of R , the zn ( ) are analogues of the Zn( ), and the product measure defined by (2.21) is the analogue of the product measure specified by (36. 17) with J.L; = J.L. See also Example 24.2. The theory for S"' is simple because S is finite: see Theorem • 2.3 and the lemma it depends on. Example 36.3.
·
·
If T is a subset of the line, it is convenient to use the order structure of the line and take the 11- s I s to be specified initially only for k-tuples ( s 1 , . . . , sk ) that are in increasing order: Example 36.4.
k
( 36.18)
It is natural for example to specify the finite-dimensional distributions for the Poisson processes for increasing sequences of time points alone; see (23.27). Assume that the J.L5 1 sk for k-tuples satisfying (36.18) have the consistency property
For given s1, • • • , sk satisfying (36.18), take (X5 1 , • • • , Xs) to have distribution J.Ls I s • If tl, . . . , tk iS a permutatiOn Of S I • • • • > Sk , take J.L1 I tO be the distribution of (X1 1 , • • • , X,): k
•
Ik
(36.20)
This unambiguously defines a collection of finite-dimensional distributions. Are they consistent? If t,. 1 , • • • , t,. k is a permutation of t 1 , . . . , tk , then it is also a permutation of is the distribution of s 1 , . . . , sk , and by the definition (36.20), J.L,;<./ , (X, trl , . . . , X, trk. ), which immediately gives (36.2J, the first of the consistency conditions. Because of (36.19), J.L s 1 s; _ 1 si + l s is the distribution of (X5 I , • • • , X5.1 - 1 , X5.1 + 1 , • • • , X5k. ), and if tk = s;, then t1, . . . , tk - l is a permutation of s 1 , . . . , s;_ 1 , si+ 1 , . . . , sk , which are in increasing order. By the definition (36.20) applied to t 1 , . . . , t k - l it therefore follows that J.L1 1 . is the distribution of (X, I , . . . , X, k - 1 ). But this gives (36.3), the second of the consistency conditions. It will therefore follow from the existence theorem that if T c R 1 and J.L5 s is defined for all k-tuples in increasing order, and if (36.19) holds, 1 then there exists a stochastic process [ X, : t E T] satisfying (36.1) for increas • ing t 1 , • • • , tk . rrk
•
•
•
k
· 'k - l
•
k
SECT I ON
36.
KOLMOGOROV'S EXISTENCE THEOREM
489
Two proofs of Kolmogorov's existence theorem will be given. The first is based on the extension theorem of Section 3 . •
FIRST PRooF oF KoLMOGORov's THEOREM. Consider the first formula tion, Theorem 36. 1. If A is the cylinder (36.9), define
(36.21 ) This gives rise to the question of consistency because A will have other representations as a cylinder. Suppose, in fact, that A coincides with the cylinder B defined by (36. 10). As observed before, if (u 1, 1 , u ) contains all the to: and �13, A is also given by (36.11), where H' = rfJ - H and 1/J is defined in (36.6). Since the consistency conditions (36.2) and (36.3) imply the more general one (36.7), P( A) JJ- 1 1 1k( H ) = Jku 1 . . . uJ H'). Similarly, (36. 10) has the form (36.12), and P( B ) = JLs 1 sP ) = J.L u1 .. u {l' ). Since the u Y are dis tinct, for any real numbers z 1, , z m there are points x of R T for which (x . . . ' Xu ) = ( z , , . . . ' zm ). From this it follows that if the cylinders (36.1 1) and (36.12J coincide, then H' = 1'. Hence A = B implies that P( A) = 11- u 1 u ( H' ) 11- u I u {I') = P( B), and the definition (36.21) is indeed consistent. Now consider disjoint cylinders A and B. As usual, the index sets may be taken identical. Assume then that A is given by (36.11) and B by (36.12), so that (36.13) holds. If H' n I' were nonempty, then A n B would be nonempty as well. Therefore, H' n I' = 0, and •
•
•
m
=
m
•
•
•
ul'
m
=
m
P ( A u B ) = J.L u 1 . uJ H' u l' )
.
= J.L u 1 uJ H' ) + J.L u 1 . . uJ I' ) = P ( A ) + P ( B ) .
Therefore, P is finitely additive on YJ[. Clearly, P( RT ) = 1. Suppose that P is shown to be countably additive on �[. By Theorem 3.1, P will then extend to a probability measure on �T. By the way P was defined on �6,
(36.22 ) and therefore the coordinate process [ 21: t E T] will have the required finite-dimensional distributions. It suffices, then, to prove P countably additive on �6, and this will follow if An E �6 and An � 0 together imply P( AnH 0 (see Example 2.10). Sup and that P( An) > E > 0 for all n. The problem is to pose that A 1 ::> A 2 :::> show that n n An must be nonempty. Since An E �[, and since the index set involved in the specification of a cylinder can always be permuted and •
•
•
STOCHASTIC PROCESSES
490
expanded, there exists a sequence
tI> t2,
•
•
•
of points in T for which
wheret H,. E ���. Of course, P( A ,.) = Jl-1 I t,.(H,. ). By Theorem 12.3 (regularity), there 1 exists inside H,. a compact set K,. such that Jl- 1 1 1 ( H,. - K,. ) 1< E / 2 " + • If B,. = [ x E RT: (x1 I , , x 1 ) E K,. ], then P( A,. - B,) < E / 2" + • Put C, = n z = I Bk . Then C,. c B, cA,. and P( A, - C, ) < E / 2, so that P(C,) > E / 2 > 0. Therefore, C, c C, _ 1 and C, is nonempty. n n Choose a point x < ) of RT in en " If n > k, then x < ) E c, c ckI c Bk and n n · Kk IS . boun d e d , the sequence { x (1 ), x;f2), . . . } hence ( x (1 ), , x 1( ) ) E Kk . smce is bounded for each k. By the diagonal method [A14] select an increasing sequence n n 2 , of integers such that lim; x��;l exists for each k. There is in RT some point x whose tk th coordinate is this limit for each k. But then, . . . , x�";l) and hence lies for each k, ( x 1 I , , x1 k ) is the limit as i � oo of (x��;l, k I in Kk . But that means that x itself lies in Bk and hence in A k . Thus • E n � = I A k , which completes the proof.:!: •
•
I
•
•
•
•
•
•
n
k.
k
•
1,
•
•
•
•
k.
•
•
X
The second proof of Kolmogorov's theorem goes in two stages, first for countable T, then for general T.*
T. The result for countable T will be proved in its second formulation, Theorem 36.2. It is no restriction to enumerate T as { t 1, t2, } and then to identify t,. with n; in other words, it is no restriction to assume that T = {1, 2, . . . }. Write p., , in place of p., By Theorem 20.4 there exists on a probability space (.fl, 5', P) (which can be taken SECOND PROOF FOR couNTABLE
•
1 2 , •
•
•
,n·
of random variables each to be the unit interval) an independent sequence U1, U2, uniformly distributed over (0, 1). Let F1 be the distribution function corresponding to p., 1 If the "inverse" g 1 of F1 is defined over (0, 1) by g 1(s) = inf[ x : s < F1(x )], then XI g I(UI) has distribution p., I by the usual argument: P[ g i(UI ) < X ] = P[ ul < Fl(x)] •
•
•
•
=
= Fl x ).
The problem is to construct X2, X3,
•
•
•
inductively in such a way that
(36.23) for a Borel function h k and ( X1, . . . , X,) has the distribution p., , . Assume tha t Xp . . . , X, _ 1 have been defined (n > 2): they have joint distribution p., , _ 1 and (36.23) holds for k < n - 1. The idea now is to construct an appropriate conditional distribu tion function F,.( x l x 1 , . . . , x, _ 1 ); here F,( xi X1(w), . . . , X, _ 1(w)) will have the value P[ X,. < x ll X1, , X,. _ dw would have if X,. were already defined. If g,.( · l x I > , x,. _ 1 ) •
•
•
•
•
•
t in general, An will involve indices t 1, . . . , ta ' w here a 1 < a2 < · · · . For notational simplicity a,. n is taken as n. As a matter of fact, this can be arranged anyway: Take A'a = A ,., A'k [x: k k a (x11, . . . , x,) e R ] = RT tor k < a1, and A'k = [x: (x11, . . . , x,) e H,. x R - :] = A ,. for a,. < , k < an + I· Now relabel A ,. as A,.. t The last part of the argument is, in effect, the proof that a countable product of compact sets i s compact. * This second proof, which may be omitted, uses the conditional-probability theory of Section 33.
=
.
SECTION 36.
KOLMOGOROV'S EXISTENCE THEOREM
491
is the "inverse" function, then Xn(w) = gn(Un(w)I X1(w), . . . , Xn _ 1(w)) will by the usual argument have the right conditional distribution given X 1 , . . . , Xn_ 1 , so that n (X1 , . . . , Xn_ 1 , Xn) will have the right distribution over R . the conditional distribution function, apply Theorem 33.3 in nTo!Jr,construct (R , 11-n) to get a conditional distribution of the last coordinate of (x 1 , . , . , xn) given the first n - 1 of them. This will have (Theorem 20. 1) the form v( H; i 1 , . . . , xn _ 1 ); it is a probability measure as H varies over � 1 , and
Since the integrand involves only x 1 , . . . , xn _ 1 , and since 11- n by consistency projects to 11- n - l under the map (x 1 , . . . , xn) --+ (x 1 , . . . , xn _ 1 ), a change of variable gives
JMv( H ; X I , . . . , xn - 1 ) dp.,n - l( x l , . . . , xn - 1 ) = 11- n [ X E R n : (X I >
.
•
.
' xn - 1 ) E M' Xn E H ] .
Define Fn(xlx 1 , . . . , xn _ 1 ) = v(( - oo, x]; x 1 , . . . , xn _ 1 ). Then Fn(- l x 1 , . . . , xn _ 1 ) nis a probability distribution function over the line, Fn(xl · ) is a Borel function over R - l , and
Put g,(ulx l , . . . ' Xn - 1 ) = inf[ x: u < Fn(xlx l , . . . ' xn - 1 )] for 0 < u < 1. Since Fn(xlx 1 , . . . , xn_ 1 ) is nondecreasing and righ t-continuous in x, gn(ulx I > • • • , xn _ 1 ) < x if and only if u < Fn(xlx 1 , . . . , xn_ 1 ). Set Xn = gn(Un1X1 , . . . , Xn _ 1 ). Since (X1 , . . . , Xn _ 1 ) has distribution 11-n - l and by (36.23) is indeper.dent oi Un> an application of (20.30) gives
P ( ( X1 , . . . , Xn _ 1 ) E M, Xn < x ) P [ ( X1 , . . . , Xn _ 1 ) E M, Un < Fn ( xiX 1 , . . . , Xn _ 1 ) ) =
= J Fn( xlx 1 , , xn _ J) dJtn - l( x 1 , , xn - l) M n = 11- n [ X E R : ( X I ' . . . ' Xn - I ) E M, xn < X ] . •••
• • •
Thus (X I , . . . ' Xn) has distribution 11-n · Note that xn, as a function of XI , . . . ' xn - 1 and Un, is a function of U1 , . . . , Un becau se (36.23) was assumed to hold for k < n. Hence (36.23) holds for k = n as well. •
STOCHASTIC PROCESSES
492
SECOND PROOF FOR GENERAL r. Consider (RT, !JRT) once again. If s c T, let SIS = a- [ Z, : t E S]. Then SIS c !f = !JRT. Suppose that S is countable. TBy the case just treated, there exists a process [ X,:
t E S ] on some W, 5', P)-the space and the process depend on S-such that ,k for eve ry k-tuple (t I > , tk ) from S. Define a (X, I , ./:. . ,....X, ) has distribution p., , . I . map � : u -+k RT by requmng that • • •
Z, ( { { w ) ) =
{:
( ) , w
if t E S, if t � s.
Now (36.15) holds as before if t 1 , . . . , t k all lie in S, and so { is measurable S'j SIS. Further, (36.16) holds for t 1 , . . . , tk in S, Put Ps = PC 1 on SIS Then Ps is a probability measure on (RT, SIS), and (36,24 ) if H E gp k and t 1 , , t k all lie in S. (The various spaces (0, 5', P) and processes [ X,: t E S] now become irrelevant.) If 50 c S, and if A is a cylinder (36.9) for which the t 1 , , tk lie in 50, then Ps ( A ) and Ps ( A ) coincide, their common value being p., , , (H). Since these cylin�ers generate Y.s , Ps ( A ) = Ps ( A ) for all A in !F.s . If A 11iesk both in g:;.-> and g:;s , then 1 2 Ps( A ) = Ps u s ( A ) = Ps ( A ). Thus P( A ) = Ps( A ) consistently defines a set function 1 on the clas� U � SIS, the 2union extending over the countable subsets S of T. If A n lies in this union and A n E SIS (Sn countable), then s = u n sn is countable and u n An lies in SIS· Thus u s SIS is a a--field and so must coincide with !JRT. Therefore, p is a probability measure on !JR T, and by (36.24) the coordinate process has under P the • required finite-dimensional distributions. • • •
. • .
11
o
'
The Inadequacy of Theorem 36.3.
o
�T Let [ X1: t E T] be a family of real functions on !1.
lf A E u[X1 : t E T] and w E A, and if X1(w) = X:(w') for all t E T, then w' EA. {ii) If A E u[X1: t E T], then A E u [X1 : t E S] for some countable subset S of T. { i)
Define g: n --+ R T by Z/ (g(w)) = Xl(w). Let !F u [ Xl : t E1T]. By (36. 15), g is measurable !Fj�T and hence !F contains the class [C M: M E �T]. 'I he latter class is a u-field, however, and by (36.15) it contains th e sets [ w E !l : (X1(w), . . . , X1k(w)) E H], H E �\ and hence contains the I u-field !F they generate. Therefore PROOF.
(36.25)
This is an infinite-dimensional analogue of Theorem 20.10).
SECTION
' KOLMOGO ROV S EXISTENCE THEOREM
36.
493
1
As for {i), the hypotheses imply that w E A = C M and g( w) = {(w' ), so that w' E A certainly follows. For S c T, let .rs u[ X1: t E S]; {ii) says that !F= !F coincides with T .:#= U s .rs, the union extending over the countable subsets S of T. If A 1 , A 2, lie in .§, An lies in .rs for some countable Sn, and so U n A n lies n in .§ because it lies in .rs for S = U nSn. Thus .§ is a u-field, and since it contains the sets [X1 E H], it contains the u-field !F they generate. (This part of the argument was used in the second proof of the existence theorem.) • =
•
•
•
From this theorem it follows that various important sets lie outside the class � T. Suppose that T [0, oo). Of obvious interest is the subset C of R T consisting of the functions continuous over [0, oo). But C is not in � T. For suppose it were. By part (ii) of the theorem Oet !1 = RT and put [ 21: t E T] in the role of [X1 : t E T]), C would lie in u[ Z1 : t E S] for some countable S c [0, oo ). But then by part {i) of the theorem Oet f! = R T and put [Z1 : t E S ] in t he role of [ X1 : t E T]), if x E C and Z1( x ) = Z1( y) for all t E S, then y E C. From the assumption that C lies in � T thu!) follows the existence of a countable set S such that, if x E C and x( t ) = y(c) for all t in S, then y E C. But whatever countable set S may be, for every continuous x there obviously exist functions y that have discontinuities but agree with x on S. Therefore, cannot lie in � T. What the argument shows is this: A set A in R T cannot lie in � T unless there exists a countable subset S of T with the property that, if x E A and x( t ) = y(t) for all t in S, then y E A . Thus A cannot lie in � T if it effectivety involves all the points t in the sense that, for each x in A and each t in T, it is possible to move x out of A by changing its value at t alone. An d C is such a set. For another, consider the set of functions x over T = [0, oo) that are non decreasing and assume as values x(t ) only nonnegative integers: =
( 36 .26)
[ X E R[O, co) :
X
( S) < x ( t ) ,
X < t ; X ( t ) E { 0, 1 , . . . } , t > 0] .
This, too, lies outside �7. In Section 23 the Poisson process was defined as follows: Let X1 , X2, be independent and identically distributed with the exponential distribution (the probability space f! on which they are defined may by Theorem 20.4 be taken to be the unit interval with Lebesgue measure). Put S0 0 and + Xn . If Sn( w) < Sn + 1(w) for n > 0 and Sn( w ) � oo, put N(t, w) Sn X1 + = �( w ) = max[ n : Sn( w ) < t ] for t > 0; otherwise, put N( t, w) = �( w ) 0 for t > 0. Then the stochastic process [ � : t > 0] has the fini te-dimensional distributions described by the equations (23.27). The function N( w) is the path function or sample functiont corresponding to w, and by the construc tion every path function lies in the set (36.26). This is a good thing if the .
=
=
·
·
·
=
·,
t Other terms are realization of the process and trajectory.
•
•
494
STOCHASTIC PROCESSES
process is to be a model for, say, calls arriving at a telephone exchange: The sample path represents the history of the calls, its value at t being the number of arrivals up to time t, and so it ought to be nondecreasing and integer-valued. According to Theorem 36.1, there exists a measure P on for T [0, oo) P) has the finite such that the coordinate process [Z 1: t > 0] on dimensional distributions of the Poisson process. This time does the path function Z( · x) lie in the set (36.26) with probability 1? Since Z( · x) is j ust x itself, the question is whether the set (36.26) has P-measure 1. But this set does not lie in and so it has no measure at all. An application of Kolmogorov ' s existence theorem will always yield a stochastic process with prescribed finite-dimensional distributions, but the process may lack certain path-function properties that it is reasonable to require of it as a model for some natural phenomenon. The special construc tion of Section 23 ger<; around this difficulty for the Poisson process, and in the next section a special construction will yield a model for Brownian motion with continuous paths. Section 38 treats a general method for producing stochastic processes that have prescribed finite-dimensional distri butions and at the same time have path functions with desirable regularity properties.
T R ( R T, gT,
,
=
,
gT,
A Return to Ergodic Theory • �
T T {0, Wri t e , a'(; , a' i n the case where the i n dex set ... } R"' , a'0, f o r R consi s ts of al l t h e i n t e gers. Then R"' i s anal o gous t o S"' (Sections and except that here the sequences are doubly infinite: X= ( I:ethke Tthe(notshi_fatn m_indexSectis_et)on der.oSmce t� the AshiEfta'0: 1mphes T-{j4 E a'0,= 0,T 1s measurabl . . This is e Clearlyi,ciprocess t is invertX=ible(. ... ,X X0, X1, ... ) on a'"'Fora'"'a .stochast defi n e{: f1 -> R"' 1 , byon (R"', a'"'{w) can= X(w) = X0(w), X1 ( w), ... ). The measure PC = PX-1 1 ( w ( be vi e wed as the distribution of X. Suppose t h at X is stationary i n , Xn+k) theall nsense that, f o r each > and H E P[ ( Xn+l• ... E H] is the same fo r 1 and Lemma p. = 0, ... . Then (use the shi f t preserves PC i s defi n ed The process X to be ergodic i f under PC the shift is ergodic in the sense of Secti o n In the ergodic case, it follows by the ergodic theorem that 2
+ 1, + 2,
24),
. . . , z _ J( x ) , Zo( x), ZJ( x), . . . ) .
�k �Tx) . Zk + x),
24.
1
. . . , X_
(36.14):
±
1,
k
1
)
k a' ,
(36.27) * This topic, which requires Section 24, may be omitted.
+ I, . . _
(.fl, !7, P),
_ 1,
24.
k
1
(36. 16)
1,
31 1).
SECTION 36.
KOLMOGOROV 'S EXISTENCE THEOREM
495
on a set of PC 1 -measure 1, provided f is measurable !JR"' and integrable. Carry (36.27) back to (fl, .9', P) by the inverse se t mapping C 1 • Then (36.28) with probability 1: (36.28) holds at w if and only if (36.27) holds at x = t w = X(w). It is understood that on the left in (36.28), Xk is the center coordinate (the Oth coordinate) of the argument of f, and on the right, X0 is the center coordinate: For stationary, ergodic X and iategrable f, (36.28) holds with probability 1 . then the Zk are independent under P{ - 1 • In this case, If the Xk are independent, n limn PC 1 (A n r B) = PC 1 ( A)PC 1 (B) for A and B in !JR0, because for large enough n the cylinders A and r nB depend on disjoint sets of time indices and hence are independent. But then it follows by approximation (Corollary 1 to Theorem 1 1 .4) that the same limit holds for all A and B in !JR"'. But for invariant B, this implies PC 1 (SC n B) = PC 1 ( W)PC 1 (B), so that P{ - 1 (8) is 0 cr 1, and the shift is ergodic under PC 1 : If X is stationary and independent, then it is ergodic. If f depends on just one coordinate of x, then (36.28) is in the indept!ndent case a consequence of the strong law of large numbers, Theorem 22. 1 . But (36.28) follows by the ergodic theorem even if f involves all the coordinates in some complicated way. Consider now a measurable real function on R"'. Define !{i: R"' R"' by �
r/J (X) = ( . . . , >( r 1 X ) , >( X), >( Tx ), . . . ) ; here (x) is the center coordinate: Zk (r/J(x)) = (T k x). It is easy to show that r/J is measurable !JR"' 1!JR"' and commutes with the shift in the sense of Example 24.6. Therefore, T preserves PC 1 rfi - I if it preserves PC \ and it is ergodic under PC 1 rfi - 1 if it is ergodic under PC 1 . This translates immediate!y into a result on stochastic processes. Define Y = ( . . . , Y _ J > Y0, Y1, ) in terms of X by •
.
(36.29 )
.
Y,
=
> ( . . . ' xn - 1 ' xn ' xn + I ' . . ) '
that is to sa), Y(w) = rfi ( X(w)) = rfi{w. Since PC 1 is the distribution of X, P{ - I t/1 - 1 = P(r/J{) - I = py - I is the distribution of Y:
36.4. If X is stationary and ergodic, in particular tf the Xn are indepen dent and identically distributed, then Y as defined by (36.29) is stationary and ergodic. Theorem
This theorem fails if Y is not defined in terms of X in a time-invariant way-if the > in (36.29) is no t the same for all n: If n(x) = Z - n(x) and > is replaced by n in (36.29), then Yn X0 ; in this case Y happens to be stationary, but it is not ergodic if the distribution of X0 does not concentrate at a single point. =
The autoregressive model. Let (x) = 'f:k =of3 k Z _ k (x) on the set < 1 and where the series converges, and take (x) = 0 elsewhere. Suppose that Example
36.5.
1 {3 1 that the Xn are independent and identically distributed with finite second moments.
Then by Theorem 22.6, Y, = 'f:k=of3 k Xn - k converges with probability 1, and by Theorem 36.4, the process Y is ergodic. Note that Yn + l f3 Yn + Xn + l and that Xn + l is independent of Yn . This is the linear autoregressive model of order 1. • =
496
STOCHASTIC PROCESSES
The Hewitt-Savage Theorem*
Change notation: Let (R"", !JR"") be the product space with {1,2, . . . } as the index set, the space of one-sided sequences. Let P be a probability measure on !JR"". If the coordinate variables Zn are independent under P, then by Theorem 22.3, P(A ) is 0 or 1 for each A in the tail a--field !J'. If the Zn are also identically distributed under P, a stronger result halos. Let J,; be the class of !JR""-sets A that are invariant under permutations of the first n coordinates: if rr is a permutation of {1, . . . , n}, then x lies in A if and only if (Z,ix), . . . , Z,.n(x), Zn + l(x), . . . ) does. Then ../,; is a a--field. Let ../ n � = 1 J,; be the a--field of !JR""-sets invariant under all finite permutations of coordinates. Then .../ is larger than !T, since, for example, the x-set where r::;; = 1Zk(x) > en infinitely often lies in .../ but not in !T. The Hewitt-Savage theorem is a zero-one law for .../ in the independent, identi cally distributed case.
If the Zn are independent and identically distributed under P, then 1 for each A in .../.
Theorem 36.5.
P( A) is 0
or
By Corollary 1 to Theorem 1 n1.4, there are for giver. A and *' an n and a , Zn) E H) (H E gp ) such that P(A U) < t: . Let V = set U = [(Z1 [(Zn + l • . . . ' Zzn) E H ]. If the zk are independent o.nd identically distributed, then P( A "' U) is the same as PROOF.
•
.
.
!>
.
P ( [ ( Zn + ] • • . . , Zzn , Z ] , . . . , zn . z2n + l • z2n + 2 • . . ) E A ]
But if A E J;n, this is in turn the same as P(A "' V). Therefore, P(A "' U) = P(A "' V). But then, P(A "' V) < t: and P(A "' (U n V)) < P(A "' U) + P(A "' V) < 2t:. Since U and V have the same probability and are independent, it follows that P( A) is within 2 2 *' of P(U) and hence P (A) is within 2t: of P (U) = P(U)P(V) = P(U n V), which is in turn within 2t: of P(A). Therefore, I P 2 (A) - P(A)I < 4t: for all t:, and so P(A) • must be 0 or 1.
PROBLEMS
Suppose that [ X,: t E T] is a stochastic process on en, 5', P) and A E !T. Show that there is a countable subset S of T for which P[ All X, t E T] P[ AII X, t E S ] with probability 1. Replace A by a random variable and prove a similar result.
36.1. i
=
T be arbitrary and let K(s, t) be a real function over T x T. Suppose that K is symmetric in the sense that K(s, t) = K(t, s) and nonnegative-definite in the sense that I:L = 1 K(ti, ti)xi xi > O for k � 1, t1 , , tk in T, and x1, , x k real. Show that there exists a process [ X,: t E T] for which (X, , . . . , X, ) ha s
36.2. Let
•
the centered normal distribution with covariances
* This topic may be omitted.
•
•
•
•
K(ti, t), i, j =1 1, . . . , k�
•
SECTION 36.
KOLMOGORov 's EXISTENCE THEOREM
497
36.3. Let L be a Borel set on the line, let ..£' consist of the Borel subsets of L, and let LT consist of all maps from T into L. Define the appropriate notion of cylinder, and let JT be the a-field generated by the cylinders. State a version of Theorem 36. 1 for (LT, JT ). Assume T countable, and prove this theorem not by imitating the previous proof but by observing that LT is a subset of R T and lies in !JRT. 36.4. Suppose that the random variables X1, X2 , assume the values 0 and 1 and P [ Xn = 1 i.o.] = 1. Let p., be the distribution over (0, 1] of Ln = I Xn ;zn. Show that on the unit interval with the measure p., , the digits of the nonterminating dyadic expansion form a stochastic process with the same finite-dimensional distributions as X1, X2 , . . . . . • .
36.5. 36.3 i There is an infinite-dimensional version of Fubini's theorem. In the construction in Problem 36.3, let L = I = lO, 1), T = {1, 2, . . }, let J consist of the Borel subsets of I, and suppose that each k-dimensional distribution is the k-fold product of Lebesgue measure over the unit interval. Then I T is a countable product of copies of (0, 1), its elements are sequences x = (x1, x 2 , . . . ) of points of (0, 1 ), and Kolmogorov's theorem ensures the existence on (I T, JT ) of a product probability measure 7T: 7T[ X: X; < a;, i < n] = a 1 · · · an for 0 < a; < 1. Let r denote the n-dimensional unit cube. (a) Define r/J: r X I T -+ I T by .
n Show that r/J is measurable y X JT1JT and r/J - I is measurable J T1yn l X J T. Show that r/J - p,n X 7T) = 7T, where An is n-dimensional Lebesgue measure restricted to r. (b) Let f be a function measurable J T and, for simplicity, bounded. Define
in other words, integrate out the coordinates one by one. Show by Problem 34. 18, martingale theory, and the zero-one law that (36 .30 ) except for x in a set of 7T-measure 0. (c) Adopting the point of view of part (a), let gn(x1, , xn) be the result of integrating the variable (yn+ J, Yn + z , . . . ) out (with respect to 7T) from f(x 1, . , . , xn, yn + 1, . . . ). This may suggestively be written as • • •
Show that gn(x1,
• • •
, xn) -+ f(x1, x 2 , . . . ) except for x in a set of 7T-measure 0.
36.6. (a) Let T be an interval of the line. Show that !JR T fails to contain the sets of: linear functions, polynomials, constants, nondecreasing functions, functions of bounded variation, differentiable functions, analytic functions, functions con-
STOCHASTIC PROCESSES
498
tinuous at a fixed t0, Borel measurable functions. Show that it fails to contain the set of functions that: vanish somewhere in T, satisfy x(s) < x(t) for some pair with s < t, have a local maximum anywhere, fail to have a local maximum. (b) Let C be the set of continuous functions on T = [0, oo). Show that A E gp r and A c C imply that A = 0. Show, on the other hand, that A E !JRT and C cA do not imply that A = R T.
36.7.
Not all systems of finite-dimensional distributions can be realized by stochastic processes for which !l is the unit interval. Show that there is on the unit interval with Lebesgue measure no process [ X, : t > 0] for which the X, are independent and assume the values 0 and 1 with probability i each. Compare Problem 1.1.
36.8.
Here is an application of the existence theorem in which T is not a subset of the line. Let (N, J, v) be a measure space, and take T to consist of the ..,Y-sets of finite v-measure. The problem is to construct a generalized Poisson process, a stochastic process [ XA: A E T] such that (i) XA has the Pois�on distribution with mean v( A) and (ii) XA , . . . , XA are independent if A 1 , , An are disjoint. Hint: To define the fi nite-dimensional distributions, generalize this construction: For A, B in T, consider independent random variables Y1 , Y2 , Y3 having Poisson distributions with means v( A n Be), v(A n B), v(Ac n B), take ll- A . B to be the distribution of (Y1 + Y2 , Y2 + Y3 ). •
.
.
SECTION 37. BROWNIAN MOTION Definition
A Brownian motion or Wiener process is a stochastic process [ W, : t > 0], on some (f!, !F, P), with these three properties: {i) The process starts at 0:
P[ W0 = 0] = 1 . {ii) The increments are independent : If
( 37.1)
(37 .2)
then
( 37 .3)
P [ W,. - W, .
<
,
,-1
E H;, i < k ]
=
n P [ W, - W,
i s, k
,
1-1
E H;) .
{iii) For 0 s < t the increment W, - W.S is normally distributed with mean
0 and variance t - s: (37 .4)
P [ W' - ws E H ] = -;==1== ! e- x 2j2(t - s) dx . J 27r ( t - s ) H
The existence of such processes will be proved. Imagine suspended in a fluid a particle bombarded by molecules in thermal motion. The particle will perform a seemingly random movement
SECTION 37.
BROWNIAN MOTION
499
first described by the nineteenth-century botanist Robert Brown. Consider a single component of this motion-imagine it projected on a vertical axis-and denote by W, the height at time t of the particle above a fixed horizontal plane. Cond ition {i) is merely a convention: the particle starts at 0. Condition {ii) reflects a kind of lack of memory. The displacements W, I - W, , . . . , W,k.- 1 - W, k _ 2 the particle undergoes dunng the mtervals [t0, t1], . . . , [ tk _2, tk _ 1] m no way influence the displacement W,k - W, k - 1 it undergoes during [t k _ 1 , t k ]. Although the future behavior of the particle depends on its present position, it does not depend on how the particle got there. As for {iii), that W, - � has mean 0 reflects the fact that the particle is as likely to go up as to go down-there is no drift. The variance grows as the length of the interval [s, t ]; the particle tends to wander away from its position at time s, and having done so suffers no force tending to restore it to that position. To Norbert Wiener are due the mathematical foundations of the theory of this kind of random motion. •
•
•
0
A Brownian motion path.
The increments of the Brownian motion process are stationary in the sense that the distribution of W, - W.S depends only on the difference t - s. Since W0 = 0, the distribution of these increments is described by saying that
STOCHA STIC PROCESSES
500
w; is normally distributed with mean 0 and variance t. This implies (37. 1). If 0 < s < t, then by the independence of the increments, £[ �w; ] = E[( �( w; - �)] + £[ �2 ] = E[�]E[ W: - �] + £[� 2 ] = s. This specifies all the means, variances, and covariances: (37 .5 )
E[ �w; ] = min{ s , t} .
E[ W: ] = O,
If 0 < t 1 < . . · < tk , the joint density of (W, I , W, - w; I , . . . , W, - w; ) is by (20.25) the product of the corresponding normal densities. By the Jacobian formula (20.20), ( W, I , . . . , w; ) has density 2
k-
k
1
k
(37.6)
where t0 = x0 = 0. Sometimes w; will be denoted W(t ), and its value at w will be W(t, w). The nature of the path functions W( · , w ) will be of great importance. The existence of the Brownian motion process follows from Kolmogorov's theorem. For 0 < t 1 < · · · < tk let 11be the distribution in R k with density (37.6). To put it another way, let Jl-1 , be the distribution of (Sp . . . ,Sk ), where S; = X1 + · · · +X; and where X1 , . . . , Xk are independent, normally distributed random variables with mean 0 and variances t1, t 2 - t 1 , • • • , tk - tk - l ' If g(x1, • • • , x k ) = (x1, • • • , x;_1, x; + !• · · · · x k ), then g{Sp . . . , S k ) = (S 1 , . . . , S;_ 1 , S;+ 1 , . . . , Sk ) has the distribution prescribed for , This is because X; + Xi + I is normally distriButed with mean 0 Jl-1 I , , and variance t; + 1 - t; _ 1 ; see Example 20.6. Therefore, 11 •
��
I
,-
1
1+l
•
k.
k
•
( 37.7)
defined in this way for increasing, positive t1 , , tk thus The Jl-1 1 satisfy the conditions for Kolmogorov's existence theorem as modified in Example 36.4; (37.7) is the same thing as (36. 19). Therefore, there does exist a process [ w; : t > 0] corresponding to the Jl-1 1 Taking w; = 0 for t = 0 shows that there exists on some (f!, !F, P) a process [w;: t > 0] with the finite-dimensional distributions specified by the conditions (i), (ii), and (iii). ,
•
k
·
•
•
' . k
Continuity of Paths
If the Brownian motion process is to represent the motion of a particle, it is natural to require that the path functions W( · , w ) be continuous. But Kolmogorov 's theorem does not guarantee continuity. Indeed, for T = [0, oo), the space (f!, !F) in the proof of Kolmogorov's theorem is (RT, ar), and as shown in the last section, the set of continuous functions does not lie in .§RT.
SECTION
37.
501
BROWNIAN MOTION
A special construction gets around this difficulty. The idea is to use for dyadic rational t the random variables � as already defined and then to
redefine the other � in such a way as to ensure continuity. To carry th is through requires proving that with probability 1 the sample path is uniformly continuous for dyadic rational arguments in bounded intervals. Fix a space (f!, !F, P) and on it a process [ � : t > 0] having the finite motion. Let D be the set dimensional distributions prescribed for Brownian n of nonnegative dyadic rationals, let In k = [k2 - ,(k + 2)z - ], and put n
Mn k ( w ) = sup I W( 1 , w ) - W( k 2 - n , w ) l r e In k n D
(37.8 )
Suppose it is shown that f P[Mn > n - 1 ] converges. The first Borel - Cantelli i.o.] ha� probability 0. But lemma will then imply that B = [Mn > n suppose w lies outside B. Then for every t and E there exists an n such that t < n, 2n - 1 < E, and Mn(w) < n - 1 . Take 8 = 2 - n . Suppose that r and r' are dyadic rationals in [0, t ] and l r - r'l < 8. Then rn and r' must for some k < n2 n lie in a common interval In k (length 2 X 2 - ), in which case iW(r, w) - W(r', w)l < 2Mn k(w) < 2Mn( w) < 2 n - 1 < €. Therefore, uJ $. B implies that W( r, w) is for every t uniformly continuous as r ranges over the dyadic rationals in [0, t ], and hence it will have a continuous extension to [0, oo). To prove L:P[Mn > n - 1 ] < oo, use Etemadi's maximal ineq�ality (22.10), which applies because of the independence of the increments. This, together with Markov's inequality, gives -1
_P [ �� I W( t + 8i2 - m ) - W( t ) I > a ) m ) - W( t) l :? a/3] < 3 imax P[ I W( t + 8i2 s; m 2
(see (21.7) for the moments of the normal distribution). The sets on the left here increase with m , and letting m � oo leads to
P sup I W( t + r8) - W( t) l > a
(37.9 )
O :s; r :s; l reD
Therefore,
n)2 X 2K( 2 P [ Mn > n - 1 ] < n2 n 1 4 (n- )
and L:P[ M > n - 1] does converge. n
4 Kn 5
STOCHASTIC PROCESSES
502
Therefore, there exists a measurable set B such that P(B ) = 0 and such that for w outside B, W(r, w ) is uniformly continuous as r ranges over the dyadic rationals in any bounded interval. If w $. B and r decreases to t through dyadic rational values, then W(r, w) has the Cauchy property and hence converges. Put
�, ( w ) = W' ( t, w ) =
lim W( r, w )
if w $. B,
0
if
r
!1
(J)
E B'
where r decreases to t through the set D of dyadic rationals. By construc tion, W'(t, w) is continuous in t for each w in fl. If w $. B, then W(r, w ) = W'( r, w ) for dyadic rationals, and W'( · , w ) is the continuous extension to all of [0, oo). The next thing is to show that the �, have the same joint distributions as the � . It is convenient to prove this by a lemma which will be used again further on.
Let Xn and X be k-dimensional random vectors, and Let Fn( x) be the distribution function of Xn. If Xn � X with probability 1 and Fn(x) � F(x) for all x, then F(x) is the distribution function of X. Lemma 1.
PROOF. t Le t X have distribution function Theorem 4. 1, if h > 0, then n
< lim infn Fn (
X1
+
H. By two applications of
h , . . . , Xk + h )
It follows by continuity from above that F and H agree.
•
Now, for 0 < t 1 < · · · < tk , choose dyadic rationals r1(n) decreasing to the t;. Apply Lemma 1 with (W,.1 < n > ' . . . , W,.k ) and (� ; , . . . , �: ) in the roles of Xn and X, and with the distribution function with density (37.6) in the role of F. Since (37.6) is continuous in the t,., it follows by Scheffe's theorem that Fr.(x) � F(x ), and by construction Xn � X with probability 1. By the lemma, ( W,', . . . , �,k ) has distribution function F, which of course is also the distribution function of ( � , . . . , W, k ). Thus [�': t > 0] is a stochastic process, on the same probability space as [ � , t > 0], which has the finite-dimensional distributions required for Brown ian motion and moreover has a continuous sample path W'( · , w ) for every w. J
1
t The lemma is an obvious consequence of the weak-convergence theory of Section 29; the point of the special argument is to keep the development independent of Chapters 5 and 6.
SECTION 37.
BROWNIAN MOTION
503
By enlarging the set B in the definition of �'(w) to include all the w for which W(O, w) =I= 0, one can also ensure that W'(O, w) = 0. Now discard the original random variables � and relabel �, as � . The new [ �: t > 0] is a stochastic process satisfying conditions (i), (ii), and (iii) for Brownian motion and this one as well: (iv) For each w, W(t, w) is continuous in t and W(O, w) = 0. From now on, by a Brownian motion will be meant a process satisfying (iv") as well as {i), (ii), and (iii). What has been proved is this:
There exist processes [ � : t > 0] satisfying conditions (i), (ii), (iii), and (iv)-Brownian motion processes. Theorem 37.1.
In the construction above, W,. for dyadic r was used to define � in general. F0 1 that reason it suffices to apply Kolmogorov's theorem for a countable index set. By the second proof of that theorem, the space (f!, !F, P) can be taken as the unit interval with Lebesgue measure. The next section treats a general scheme for dealing with path-function questions by in effect replacing an uncountabk time set by a countable one. �easurable Processes
Let T be a Borel set on the line, let [X, : t E T] be a stochastic process on an (f!, !F, P), and consider the mapping (37. 10)
1
( t, w ) � X, ( w ) = X( t, w )
carrymg T X f! into R • Let !T be the u-field of Borel subsets of T. The process is said to be measurable if the mapping (37.10) is measurable !Tx sz-;!Rt. In the presence of measurability, each sample path X( · , w) is measurable !T by Theorem 18.1. Then, for example, J:cp(X(t, w)) dt makes sense if (a, b) c T and cp is a Borel function, and by Fubini' s theorem if
jb E[ lcp( X, )I] dt a
< oo .
Hence the usefulness of this result: Theorem 37.2. PROOF.
Brownian motion is measurable.
If for
k 2-n
<
n ( k + 1)2 - , k = 0, 1, 2, . . . ,
STOCHASTIC PROCESSES
504
then the mapping (t, w) � w < n >( t , w) is measurable !TX !F. But by the continuity of the sample paths, this mapping converges to the mapping (37. 10) pointwise (for every (t, w )), and so by Theorem 13.4(ii) the latter mapping is also measurable !Tx !Fj!N 1 . • Irregularity of Brownian Motion Paths
Starting with a Brownian motion [W,: t > 0] define
( 37 . 1 1 ) where c > 0. Since t � c 2 t is an increasing function, it is easy to see that the process [w;': t > 0] has independent increments. Moreover, w;' - �, = c - 1 (Wc2 , - Wc 2), and for s < t this is normally distributed with mean 0 and variance c - 2 (c 2 t - c 2s) = t - s. Since the paths W'( · w) all start from 0 and are continuous, [ w;' : t > 0] is another Brownian motion. In (37. 1 1) rhe time scale is contracted by the factor c 2 , but the other scale only by the factor c. That the transformation (37.1 1) preserves the properties of Brownian motion implies that the paths, although continuous, must be h igh ly irregular. It seems intuitively clear that for c large enough the path W( · w) must with probability nearly 1 have somewhere in the time interval [0, c] a chord with slope exceeding, say, 1. But then W'( · w) has in [0, c- 1] a chord with slope exceeding c. Since the w;' are distributed as the w; , this makes it plausible that W( · w) must in arbitrarily small intervals [0, 8] have chords with arbitrarily great slopes, which in turn makes it plausible that W( · w) cannot be differentiable at 0. More generally, mild irregularities in the path will become ever more extreme under the transformation (37. 1 1) with ever larger values of c. It is shown below that, in fact, the paths are with probability 1 nowhere differentiable. Also interesting in this connection is the transformation ,
,
,
,
,
if t > 0, if t = 0.
( 37.12)
Again it is easily checked that the increments are independent and normally distributed with the means and variances appropriate to Brownian motion. Moreover, the path W"( · w) is continuous except possibly at t = 0. But (37.9) holds with w;' in place of � because it depends only on the finite-dimen sional distributions, and by the continuity of W"( · w) over (0, oo) the supremum is the same if not restricted to dyadic rationals. Therefore, 1 P[sup s :s; n -J I W;'I > n - ] < Kjn 2 , and it follows by the first Borel-Cantelli lemma that W"( · w) is continuous also at 0 for w outside a set M of probability 0. For w E M, redefine W"(t, w) 0; then [ w;": t > 0] is a Brown ian motion and (37. 12) holds with probability 1. ,
,
,
=
SECTION 37.
BROWNIAN MOTION
505
The behavior of W( · , w) near 0 can be studied through the behavior of W"( · , w ) near oo and vice versa. Since (W/' - w;)/t = W1 1 , , W"( · , w) cannot have a derivative at 0 if W( · , w) has no limit at oo. Now, in fact, inf Wn =
(37. 13)
n
- oo
sup wn =
'
n
+ oo
with probability 1. To prove th is, note that Wn = X 1 xk = wk - wk - 1 are independent. Consider
[ sup wn < ] oo
n
=
"'
"'
[
· · ·
+
+
Xn , where the
< ]
u n "!ax W; u ;
u = l m= l
t_m
this is a tail set and hence by the zero-one law has probability 0 or 1. Now -X1 , -X2 , have the same joint distributions as X1, X2 , , and so th is event has the same probability as •
•
•
•
[ inf W, > ] = - oo
n
"'
"'
[
u n '!l:s;ax ( - W;) r m
u=l m � l
•
•
< uJ .
If these two sets have probability 1, so has [supni W,I < oo], so that P[ sup n i Wn l <x] > O for some x. But P[1Wnl <x] = P[IW1 I <xjn 1 1 2 ] � o. This proves (37. 13). Since (37. 13) holds with probability 1, W"( , w) has with probability 1 upper and lower right derivatives of + oo and - oo at t = 0. The same must be true of every Brownian motion. A similar argument shows that, for each fixed t, W( , w) is nondifferentiable at t with probability 1 . In fact, W( w) is nowhere differentiable: ·
·
Theorem 37.3.
differentiable.
· ,
For w outside a set of probability 0, W( , w) is nowhere ·
The proof is direct-makes no use of the transformations (37. 1 1 ) and (37.12). Let PROOF.
By independence and the fact that the differences here have the distribution of 2 - n i 2 W1 , P[Xn k €] = P 3[IW1 1 2 n 1 2E ]; since the standard normal den sity is bounded by 1, P[Xnk €] (2 X 2 n 12E)3. If Yn = min k :s; n 2" Xn k then
<
(37.15)
< < <
•
STOCHASTIC PROCESSES
506
Consider now the upper and lower right-hand derivatives w
D (t, w)
=
D w( t , w ) =
.hm sup W( t + h, w ) - W( t , w ) , h h !O .1 . W( t + h, w ) - W( t, w ) I m Inf h h !O
·
w
Define E (not necessarily in !F) as the set of w such that D (t, w) and Dw ( t, w) are both finite for some value of t. Suppose that w lies in £, and suppose specifically that
< <
There exists a positive 8 (depending on w, t, and K) such that t s t + li implies IW(s, ) W(t, w)l Kls - tl. If n exceeds some n0 (depending on 8 , K, and t ), then 4 - n 8, 8K < n , n > t. cu
<
-
X
2 <
<
n n n Given such an n, choose k so that (k - l)r t < k2- . Then li2 -n - tl < 8 for i = k, k + 1, k + 2, k + 3, and therefore Xnk(w) 2K(4 X 2 - ) < n2- n . n Since k - 1 t2 n < n2 n, Yn( w) n2 - . What has been shown is that if w lies in E, then w lies in A n [Yn nr n ] for all sufficiently large n: E c lim infn An- By (37. 15),
<
<
<
=
<
By Theorem 4.1, lim infn A n has probability 0, and outside this set W( w) is nowhere differentiable-in fact, nowhere does it have finite upper and lower right-hand derivatives. (Similarly, outside a set of probability 0, nowhere does W( w) have finite upper and lower left-hand derivatives.) • ·
·
,
,
If A is the set of w for wh ich W( w) has a derivative somewhere, what has been shown is that A c B for a measurable B such that P (B) 0; P(A) 0 if A is measurable, but this has not been proved. To avoid such problems in the study of continuous-time processes, it is convenient to work in a complete probability space. The space (f!, !F, P) is complete (see p. 44) if A cB, B E /F, and P(B) = 0 together imply that A E !F (and then, of course, P(A ) 0). If the space is not already complete, it is possible to enlarge !F to a new u-field and extend P to it in such a way that the new space is complete. The following assumption therefore entails no loss of generality: For the rest of this section the space (f!, !F, P) on which the Brownian motion is defined is assumed complete. Theorem 37.3 now becomes: W( w) is with probability 1 nowhere differentiable. · ,
=
=
·
,
=
S ECTION 37.
BROWNIAN MOTION
507
A nowhere-differentiable path represents the motion of a particle that at no time has a velocity. Since a function of bounded variation is differentiable almost everywhere (Section 31), W( w) is of unbounded variation with probability 1. Such a path represents the motion of a particle that in its wanderings back and forth travels an infinite distance in finite time. The Brownian motion model thus does not in its fine structure represent physical reality. The irregularity of the Brownian motion paths is of considerable mathematical interest, however. A continuous, nowhere-differentiable func tion is regarded as pathological, or used to be, but from the Brownian-motion point of view such functions are the rule not the exception.t The set of zeros of the Brownian motion is also interesting. By property (iv), t = 0 is a zero of W( w) for each w. Now [W:": t > Oj as defined by (37. 12) is another Brownian motion, and so by (37.13) the sequence {U.�': n = 1, 2, . . . } = {nW1 1 : n = 1, 2, . . . } has supremum + oo and infimum oo for w outside a set of probability 0; for such an w, W( w) changes sign infiniteiy often near 0 and hence by continuity has zeros arbitrarily near 0. Let Z(w) denote the set of zeros of W( w ). What has just been shown is that 0 E Z{w) for each w and that 0 is with probability 1 a limit of positive points in Z(w). From (37.13) it also follows that Z(w) is with probability 1 un bounded above. More is true: ·
·
,
,
-
,
·
·
,
,
The set Z(w) is with probability bounded, nowhere dense, and of Lebesgue measure 0. Theorem 37.4.
1
perfect [A15], un
Since W( - , w) is continuous, Z(w) is closed for every w. Let A denote Lebesgue measure. Since Brownian motion is measurable (Theorem 37.2), Fubini's theorem applies: PROOF.
j A ( Z( w )) P( dw) = ( A X P ) [(t, w ) : W( t, w) = 0) l1
"'
= 1 P [ w: W( t, w) = O) dt = O. 0
Thus A(Z( w )) = 0 with probability 1. If W( · w) is nowhere differentiable, it cannot vanish throughout an interval I and hence must by continuity be nonzero throughout some subin terval of I. By Theorem 37.3, then, Z(w) is with probability 1 nowhere dense. It remains to show that each point of Z(w) is a limit of other points of Z(w). As observed above, th is is true of the point 0 of Z(w). For the general point of Z( w ), a stopping-time argument is required. Fix r > 0 and let ,
t For the construction of a specific example, see Problem 31. 18.
508
STOCHASTIC PROCESSES
r(w) = inf[ t : t > r, W(t, w) = 0]; note that this set is nonempty with probabil ity 1 by (37. 13). Thus r( w) is the first zero following r. Now
and by continuity the infimum here is unchanged if s is restricted to rationals. This shows that r is a random variable and that [ w. r( w)
<
t]
E CT [ Wu : u < t ] .
A nonnegative random variable with this property is a stopping time. To know the value of r is to know at most the values of Wu for u < r. Since the inctements are independent, it therefore seems intuitiveiy clear that the process
: > 0, ought itself to be a Brownian motion. This is, in fact, true by the next result, Theorem 37.5. What is proved there i� that the finite-dimensional distribu tions of [ � * : t > 0] are the right ones for Brownian motion. The other properties are obvious: W * ( w) is continuous a nd vanishes at 0 by construc tion , and the space on which [ � * ; t > 0] is defined is complete because it is the original space (f!, !F, P), assumed complete . If [ � * : t > 0] is indeed a Brownian motion, then, as observed above, for w outside a set Br of probability 0 there is a positive sequence {tn} such that t n � 0 and W* (tn , w) = 0. But then W( r(w) + tn , w) = 0, so that r(w ), a zero of W( w ), is the limit of other larger zeros of W( w ). Now r( w) was the first zero following r. (There is a different stopping time r for each r, but the notation does not show this.) If B is the union of the Br for rational r, the first point of Z(w) following r is a limit of other, larger points of Z(w). Suppose that w $. B and t E Z(w ), where t > 0; it is to be shown that t is a limit of other points of Z(w). If t is the limit of smaller points of Z(w), there is of course nothing to prove. Otherwise, there is a rational r such that r < t and W( w) does not vanish in [ r, t ); but then, since w $. Br, t is a limit of larger points s that lie in Z(w). This completes the proof of Theorem 37.4 under the provisional assumption that (37. 16) is a Brownian motion. • · ,
· ,
·
,
· ,
The Strong Markov Property
Fix t0 > 0 and put (37.17)
�, = � o+t - � o'
t > 0.
It is easily checked that [ �, : t > 0] has the finite-dimensional distributions
SECTION
37.
BROWNIAN MOTION
509
appropriate to Brownian motion. As the other properties are obvious, it is in fact a Brownian motion. Let
(37.18 ) The random variables (37.17) are independent of .9;0. To see this, suppose that O < s 1 < < t k . Put u ; = t0 + t,. Since the < si < t0 and 0 < t 1 < increments are independent, ( w;', w;' - w;', . . . , W,' - W,' ) = ( Wu w; 0, Wu - W, , W,, - Wu ) is independent of (Ws) , Ws - Ws , . . . , Ws W.S. ). But then ( w;', w;' , . . . , w;' ) is independent of \ W.S , � , . . . , W.SJ By Theorem 4.2, ( W,' , . . . , U!;' ) is independent of .9;0 Thus 0 0 0
2
J - l
I
.
.
· ·
0
,
k
I
I
(37. 19 )
k -
2
I
1
0
2
I
'
k
0
k
k
I
2
k-
2
I
1
I
J
j
P ( [ ( w;' , w;: ) E H] n A) P [ ( w;' , , w,: ) E H ] P ( A ) P ( ( w; , w;J E H ] P( A ) , 1
• • • ,
• 0 •
=
1
=
1, • • •
where the second equality follows because (37.17) is a Brownian motion. This holds for all H in IN k. The problem now is to prove all this when t0 is replaced by a stopping time r- a nonnegative random variable for which
(37 .20 )
[ w : r ( w) < t]
E
t > 0.
.9;,
It will be assumed that r is finite, at least with probability 1. Since [ r = t] [ r < t] - U J r < t - n - 1 ], (3 7.20) implies that
(37 .21 )
[ w : r ( w ) t] =
=
t > 0.
E .9;,
The conditions (37.20) and (37.21) are analogous to the conditions (7. 18) and (35.18), which prevent prevision on the part of the gambler. Now .9;0 contains the information on the past of the Brownian motion up to time t0, and the analogue for r is needed. Let .9:, consist of all measurable sets M for which
(37.22)
M n [ w: r ( w )
< t)
E .9;
for all t. (See (35.20) for the analogue in discrete time.) Note that .9:, is u-field and r is measurable .9:,. Since M n [ r t] M n [ r t] n [ r < t ], =
(37.23)
M n [ w: r ( w)
for M in .9:,. For example, [infs w.r > - 1] is in .9:,. ,s
7
r
=
t)
= inf[t: w;
=
a
=
E .9; =
1] is a stopping time and
STOCHASTIC PROCESSES
510 Theorem 37.5.
Let r be a stopping time, and put
(37.24 )
t > 0.
Then [ � * : t > ] is a Brownian motion, and it is independent of .9:,-that is, u[ � * : t > 0] is independent of 5':,: (37.25) P ( [ ( � 7 , . . . , � : } E H ] n M ) =
for H in
�
k
P[( � 7 , . . . , �: ) E H ) P( M ) = P ((�1, . . . , �J E H ] P(M)
and M in .9:,.
That the transfmmation (37.24) preserves Brownian motion is the strong Markov property .t Part of . the conclusion is that the � * are random vari ables. Suppose first that general point of V. Since PROOF.
[ w : �* ( w )
E H]
=
r has countable range V and let t0 be the
U [ w : � 0 + 1 ( w ) - �i w) E H, r ( w ) = t0 ] ,
t0E V
� * is a random variable. Also,
P( [ ( � 7 , . . . , �: ) E H ] n M) = L P( [ ( � 7 , . . . , �: ) E H] n M 0 [ r = t01) . t0 E V
If M E �' then M n [ = tol E g-;0 by (37.23). Further, if = to , then �* coincides with �, as defined by (37.17). Therefore, (37.19) reduces th is last sum to T
T
L P[( �
t0E V
1
, . . . , �J E H ] P(M n [ r = t0 ])
This proves the first and third terms in (37.25) equal; to prove equality with the middle term, simply consider the case M = fl. t since the Brownian motion has independent increments, it is a Markov process (see Examples 33.9 and 33.!0); hence the terminology.
S ECTION
37.
Sll
BROWNIAN MOTION
Thus the theorem holds if r has countable range. For the general r, put
{ k2 -n
n n k - 1 }2- < < k 2- , k = 1 , 2, . . . if ( (37.26) n= 0 if r = O. n If k r n < t < (k + 1 )T , then [rn < t] = [r < k T n ] E !Fk2-" c !Jl;. Th us each n n + 1)2 - . and k r M E t < (k .9; rn is a stopping time. Suppose that < n Then M n [ Tn < t l = M n [ < kr ] E !F..k 2 -n c sz;. Thus s: c .9"_ Let n (n ) - �< >(w) = WT n(w ) +t(w) - WT"(w )(w) -that is, let � be the � * corresponding to the stopping time rn . If M E .9; then M E .9; and by an application of (37.25) to the discrete case already treated, 7'
7'
7'
T
n
Tn
•
,
But rn(w) � r(w) for each w, and by continuity of the sample paths, �< n >(w) � � * (w) for each w. Condition on M and apply Lemma 1 with n n (w,< >, . . . , w,< > ) for Xm ( �* , . . . , �"' ) for X, and the distribution function of (� , . . . , � n ) for F = Fn . Then (37.25J!<, follows from (37.27). • I
k
I
I
The r in the proof of Theorem 37.4 is a stopping time, and so (37.16) is a Brownian motion, as required in that proof. Further applications will be given below. If !F* = u[W/ : t > 0], then according to (37.25) (and Theorem 4.2) the u-fields .9; and !F * are independent:
(37.28)
P ( A n B ) = P( A ) P( B ) ,
For fixed t define rn by (37.26) but with t2 - n in place of 2 - n at each occurrence. Then [ WT < x 1 n [ r < t] is the limit superior of the sets [ WT < x ] n [ r < t ], each of which lies in SZ";. This proves that [ � < x] lies in .9'; and hence that � is measurable g;;. Since r is measurable .9;, n
(37.29) for planar Borel sets H. The Reflection Principle
For a stopping time r, define
(37.30 )
if t < if t >
7' ' 7' .
The sample path for [ �": t > 0] is the same as the sample path for [ � : t > 0] up to r, and beyond that it is reflected through the point WT . See the figure.
512
STOCHASTIC PROCESSES w WT
------------ ----- -----
1
-------I r
: \ {'\..__ ../ I I I
I I I
...
'
W"
T
The process defined by (37.30) is a Brownian motion, and to prove it, one need only check th� finite-dimensional distributions: ?[( � 1 , , �) E H] = E H ]. By the argument starting with (37.26), it is enough to P [( �" , . . . , �") k consider the case where r has countable range, and for this it is enough to check the equation when the sets are intersected with [ r = t0 ]. Consider for notational simplicity a pair of points: • • •
I
(37.31 ) If s < t < t0 , this holds because the two events are identical. Suppose next that s < t0 < t. Since [ r = t0] lies in !F, , it follows by the independence of the increments, symmetry, and the definition (37.30) that 0
P [ r = t0 , (Ws , W, J E / , � - W,0 E l] = P [ r = t0, ( �, �0 ) E l , - (� - �J E l]
K
If
=
I X J, this is
P [ r = to , ( � , � � - � ) E K ] = P [ r = t 0, ( w;' , �� , �, - �� ) E K ] , o,
o
and by 1r-A it follows for all K E !JP3• For the appropriate K, this gives (37.31). The remaining case, t0 < s < t, is similar. These ideas can be used to derive in a very simple way the distribution of M = sup s � · Suppose that x > O. Let r = inf[s: � > x], define W" by and (37.30), ati"d put r" = inf[s: w;' > x] and M;' = SUPs I w;' . Since W" is another Brownian motion, reflection through the point W7 = x shows ,
< r
:S
r" = T
SECTION
that
37.
BROWNIAN MOTION
5 13
P[ M, > X ] = P[ T < t] = P[ T < t ' w, < X ] + P[ T < t ' w, > X ] = P[ < t ' W," < X ] + P[ T < t ' w, > X ] P [ r" < t, W, > X ] + P[ r < t , W, > X ] = P[ T < t ' w, > X] + P[ T < t' w, > X ] = 2P[ w, > X ] . T11
=
Therefore,
(37 .32)
!"'
2 e - u 2 12 du . -/2 7r x;{r
P[ M, > x ] =
-
This argument, an application of the reflection principle,t becomes quite transparent when referred to the diagram. Skorohod Embedding •
Suppose that Xp X2 , are independent and identically distributed random variables with mean 0 and variance u 2 A powerful method, due to Skoro hod, of studying the partial sums sn = XI + . . . + xn is to construct an increasing sequence r0 = O, r 1 , r 2 , of stopping times such that W(rn) has the Same distribution aS Sn. The differenCeS Tk - Tk - l Will turn OUt tO be independent and identically distributed with mean u 2 , so that by the law of large numbers n - l rn = n - I [Z = 1( rk - rk _ 1 ) is likely to be near u 2 • But if rn is near nu 2 , then by the continuity of Brownian motion paths W(rn) will be near W(nu 2 ), and so the distribution of Sn/uVn, which coincides with the distribution of W(rn)/uVn, will be near the distribution of W(na 2 )juVn -that is, will be near the standard normal distribution. The method will thus yield another proof of the central limit theorem, one independent of the characteristic-function arguments of Section 27. of max k < n Sk fuVn But it will also give more. For example, the distribution is exactly the distribution of max k p W( rk )ju{n , and this in turn is near the distribution of sup, ,;n u W(t)juvn , which can be written down explicitly because of (37.32). It will thus be possible to derive the limiting distribution of max k ,; n Sk . The joint behavior of the partial sums is closely related to the behavior of Brownian motion paths. The Skorohod construction involves the class !T of stopping times for which •
•
•
•
•
•
•
2
(37.33 ) (37.34 )
E[WT] = 0, E[r] = E [ W72 ] ,
t see Problem 37. 18 for another application. * The rest of this section, which requires martingale theory, may be omitted.
514
STOCHASTIC PROCESSES
and
Lemma
<
2. ALL bounded stopping times are members of !T.
Define Y9 , , = exp(OW, - �0 2 t) for all 0 and for t > 0. Suppose t and A E !f;. Since Brownian motion has independent increments,
PROOF.
that s
f Y9 , t dP = f e 9 Ws - 9 2s/2 dP . E [ e 9( W, - Ws) - 92(1 - s)/2 ] ' A
A
and a calculation with moment generating functions (see Example 21.2) shows that
s < t ' A E �-
(37.36)
This says that for 0 fixed, [ Y9, 1 : t > 0] is a continuous-time martingale adapted to the u-fields !F,. It is the moment-generating-function martingale associated with the Brownian motion. Let f(O, t) denote the right side of (37.36). By Theorem 16.8,
:O f( 0 , t ) = £ Y9 ( W, - 0 t ) dP, a2 2 - t ] dP, Ot Y t = O, ) f( ) ( j W, , 9 [ ao 2 ,t
A
•
iJ 4 , [ ( W, - Ot ) 4 - 6( W, - O t) 2 t + 3t 2 ] dP. Y f( f ) , t 8 9 ao 4 A ·
Differentiate the other side of the equation (37.36) the same way and set 0 = 0. The result is
f � dP = f W, dP, A
A
fA ( Ws4 - 6W/s + 3s 2 ) dP = f ( W,4 - 6W, 2 t + 3t 2 ) dP, A
s < t, A E �' s < t , A E g;;,
s < t , A E g;;,
This gives three more martingales: If Z, is any of the three random variables
(37 .37)
w, ,
SECTION
then
37.
BROWNIAN MOTION
Z0 = 0, Z,
is integrable and measurable !F,, and
j Zs dP = j Z, dP,
(37.38)
5 15
A
A
s < t, A E �-
In particular, E[ Z,] = [Z0] = 0. If r is a stopping time with finite range (37.38) implies that
{t 1 ,
• • •
, tm}
bounded by
t,
then
Suppose that r is bounded by t but does not necessarily have finite range. n n n n Put Tn = k2 - t if (k - l)T t < T ::;; k2- t, 1 k 2 , and put Tn = 0 if r =- 0. Then rn is a stopping time and E[ Z-r" ] = 0. For each of the three possibilities {37.37) for Z, SUPs :s;riZsl is integrable because of (37.32). It therefore follows by the dominated convergence theorem that £[27] = lim" E[ ZT " ] = 0. Thus £[ Z7 ] = 0 for every bounded stopping time -r. The three cases {17.37) give
< <
This implies
(37.33), (37.34), and
0 = £[ �4 ] - 6£ [ W72 r ] + 3£ ( r 2 ] > E [ WT4 ] - 6£ 1 !2 [ WT4 ] £ 1 ! 2 [ r 2 ] + 3£(r 2 ]. If C = E 1 1 2 [ W74] and x = E 1 1 2 [ r 2], the inequality is 0 > q(x) = 3x 2 - 6Cx + C 2 . Each zero of q is at most 2C, and q is negative only between these two zeros. Therefore, x < 2C, which implies (37.35). •
Suppose that r and rn are stopping times, that each rn is a member of !T, and that rn � r with probability 1. Then r is a member of Y if (i) E[ W74" ] < E[ W74] < oo for all n, or if {ii) the W74" are uniformly integrable. Lemma 3.
PROOF. Since Brownian motion paths are continuous, W7 " � W7 with probability 1. Each of the two hypotheses {i) and {ii) implies that £[ �4] " is bounded and hence that £[ r,;J is bounded, and it follows (see (16.28)) that the sequences { rn }, {W7" }, and {W72" } are uniformly integrable. Hence (37.33) and (37.34) for r follow by Theorem 16.14 from the same relations for the rn The first hypothesis implies that lim infn £[ �4] " E[W74], and the second implies that limn E[ W74] = E[ W74]. In either case it follows by Fatou's lemma that £[ r 2 ] lim infn El r,;] 4 lim infn E[ W74] • " 4E[ W74].
<
<
<
<
STOCHASTIC PROCESSES
516
Suppose that a, b > 0 and a + b > 0, and let r(a, b) be the hitting time for the set { - a, b}: r(a, b) = inf[t: W, E { - a, b}]. By (37. 13), r(a, b) is fini te with probability 1, and it is a stopping time because r(a, b) < t if and only if for every m there is a rational r < t for which W, is within m - 1 of -a or of b. From I W(min{r{a, b), n})l < max{a , b} it follows by Lemma 3(ii) that r(a, b) is a member of !T Since WT(a , b ) assumes only the values - a and b, E[ WT(a , b ) ] = 0 implies that
b W a P [ T(a , b ) = - ] = a + b '
( 37.39)
a . = P [ �< a . b > � b ] a + b
This is obvious on grounds of symmetry in the case a = b. Let J.L be a probability measure on the line with mean 0. The program is to construct a stopping time r for which W7 has distribution J.L· Assume that J.L{O} < 1 , since otherwise r = 0 obviou!>ly works. If J.L consists of two point masses, they must for some positive a and b be a mass of b j(a + b) at - a and a mass of a j(a + b) at b; in this case r< a. b ) is by (37.39) the required stopping time. The general case will be treated by adding together stopping times of this sort. Comider a random variable X having distribution J.L· (The probability space for X has nothing to do with the space the given Brownian moiion is defined on.) The technique will be to represent X as the limit of a martingale X� X2 , . . . of a simple form and then to duplicate the martingale by W7'1 , W7 , for stopping times rn; the rn will have a limit r such that W7 has 2 distribution as X. the same The first step is to construct sets •
•
•
and corresponding partitions
IQ' = ( - a�n > J , n = ( a< n > a< n > J 1 < k <· rn I k- I' k ' - - ' .:::rn : k Irnn + l - ( a,(n" ) , oo ) . oo ,
I'UJ
_
Let M( H ) be the conditional mean: if 11-( H ) > 0.
d 1 consist of the single point M(R 1 ) = E[ X] = 0, so that .91 consists of IJ = ( - oo, 0] and Il = (0, oo). Suppose that dn and 9.r are given. If J.L({I:)o) > 0, split If: by adding to d n the point M(lf:), which lies in (If:)0; if J.L(Uf:n = 0, I: appears again in 9.r + 1 • Let
SECTION 37.
BROWNIAN MOTION
517
� be the u-field generated by the sets [X E I;J, and put xn = £[XII�]. Then XI , X2 , . . . is a martingale and xn = M{I;) on [ X E I;J. The Xn have finite range, and their joint distributions can be written out explic itly. In fact, [XI = MU1 I ), . . . ' xn = !vl( I;n )] = [ X E If I ' . . . ' X E I;n ], and this set is empty unless IfI :::> :::> I;n, in which case it is [ Xn = MU;n )] = [ X E I:). Therefore, if k n - l = j and Ip - 1 = I;_ 1 u i;, Let
•
•
•
and
provided the conditioning event has positive probability. Thus the martingale {Xn} has the Markov property, and if x = M{Ip - 1 ), u = M(I;_ 1 ), and u = M{I;), then the conditional distribution of xn given xn - ! = X is concen trated at the two points u and v and has mean x. The structure of { Xn} is determined by these conditional probabilities toge ther with the distribution
of
X1
•
If .:#= u(U n.§,), then Xn � £[XII.§] with probability 1 by the martingale theorem (Theorem 35.6). But, in fact, Xn � X with probability 1, as the following argument shows. Let B be the union of all open sets of wmeasure 0. Then B is a countable disjoint union of open intervals; enlarge B by adding to it any endpoints of wmeasure 0 these intervals may have. Then JL(B) = 0, and x $. B implies that JL(X - E, x] > 0 and JL[ X, x + €) > 0 for all positive €. Suppose that x = X(w) $. B and let xn = Xn(w). Let I; be the element of 9.r containing x; then xn + l = MU; ) and I; � I for some interval I. Suppose that xn+ l < x - E for n in an infinite sequence N of integers. Then xn+ 1 is the left endpoint of I;n++ I1 for n in N and converges along N to the left endpoint, say a, of I, and (x - E, x] c I. Further, xn+ 1 = MU; ) � M(I) along N, so that M(J) = a . But this is impossible because JL(X - €, X] > 0. Therefore, Xn > X - € for large n . Similarly, xn <X + E for large n , and so xn �x. Thus Xn(w) � X(w) if X(w) $. B, the probability of which is 1 . Now X1 = E[ XII-:#1 ] has mean 0, and its distribution consists of point masses at - a = M(IJ ) and b = M(Il). If r 1 = r( a , b) is the hitting time to "
"
"
"
518
STOCHASTIC PROCESSES
{ - a, b}, then (see (37.39)) r 1 is a stopping time, a member of !T, and W has the same distribution as X1 • Let r 2 be the infimum of those t for which t > r 1 and W, is one of the points M{l�), 0 < k < r2 + 1. By (37. 13), r 2 is finite with probability 1; it is a stopping time, because r 2 < t if and only if for every m there are rationals r and s such that r < s + m - 1 , r < t, s < t, W,. is within m - 1 of one of the points M(l} ), and W, is within m - 1 of one of the points M(l�). Since I W(min{r2 , n})l is at most the maximum of the values I M{l�)l, it follows by Lemma 3(ii) that r 2 is a member of !T. Define W,* by (37.24) with r 1 for r. If x = M{l/ ), then x is an endpoint common to two adjacent intervals If_ 1 and I�; put u = M(l�_ 1 ) and v M(lk2 ). If W = x, then u and u are the only possible values of WT2 . If r* is the first time the Brownian motion [ W,* : t > 0] hits u - x or v - x, then by TJ
=
TJ
(37.39),
v -x P[ Wt r = u - x ] = -V - U '
P [ WtT = v - x ] =
On the set [ Wr = x ], r 2 coincides with r 1 1
x-u
V - U
.
+ r*, and it follows by (37.28) that
P [ WT = x WT2 = v ] = P[ WT = x x + WTt v] 1
7
1
7
=
x u = P [ Wr Jv-u = P [ Wr = x ) P[ Wrt = v - x ] 1
·
This, together with the same computation with u in place of v, shows that for Wr I = x the conditional distribution of WT 2 is concentrated at the two points u and v and has mean x. Thus the conditional distribution of Wr 2 given Wr 1 coincides with the conditional distribution of X2 given X1 • Since Wr 1 and X 1 have the same distribution, the random vectors (W,. 1 , Wr ) and (X1 , X2 ) also have the same distribution. An inductive extension of this argument proves the existence of a se , each rn is a member of quence of stopping times rn such that r 1 < r 2 < , W,." have the same joint distribution as !T, and for each n, Wr , XI ' . . . ' xn. Now suppose that X has finite variance. Since Tn is a member of ' !T, £[ rn] = E[ Wr2" ] = E[ X;J = £[ £ 2[ X II.§,.]] < £[X 2 ] by Jensen s inequality (34. 7). Thus r = limn rn is finite with probability 1. Obviously it is a stopping time, and by path continuity, WT � WT with probability 1. Since xn � X with " probability 1, it is a consequence of the following lemma that Wr has the distribution of X. · · ·
1
• • •
If Xn � X and Yn � Y with probability the same distribution, then so do X and Y. Lemma 4.
1,
and if Xn and Y,. have
SECTION
37.
PROOF.t
51 9
BROWNIAN MOTION
By two applications of (4.9), P[ X < x ] < P[ X < x + € ] < lim infP[ Xn < x + E ] n
< lim sup P[ Y,. < x + E ] < P[ Y < x + E ] . n
Let E � 0: P[ X < x ] < P [ Y < x ]. Now interchange the roles of X and
Y.
•
Since X! < £[ X 2 11 .§,. ], the Xn are uniformly integrable by the lemma preceding Theorem 35.6. By the monotone convergence theorem and Theo rem 16. 14, £[ r] = limn £[ rn] = limn E[ W.,.:] = limn E[ X!] = £[ X 2 ] = E Ut:?J . If E[ X4] < oo, then E[ W.,.�) = E[ X:l < £[ X4J = £[ W.,.4 ] (Jensen's inequality again), and so r is a member of !Y. Hence £[ r 2 ] < 4E[ W.,.4 ]. This construction establishes the first of Skorohod's embedding theorems:
Suppose that X is a random variable with mean 0 and finite variance. There is a stopping time r such that W.,. has the sume distnbution as X, E[r] = £[ X 2 ], and E[r 2 ] < 4E[ X4]. Theorem
37.6.
Of course, the last inequality is trivial unless £[ X4] is finite. The theorem could be stated in terms not of X but of its distribution, the point being that the probability space X is defined on is irrelevant. Skorohod's second embedding theorem is this:
Suppose that X1 , X2 , are independent and identically distributed random variables with mean 0 and finite variance, and put Sn = X1 of stopping times + · +Xn. There is a nondecreasing sequence r 1 , r , 2 SUCh that the W.,.n haVe the Same joint distributions aS the Sn and Tl, T2 - T l , T3 - r2 , are independent and identically distributed random variables satisfying £[ rn - rn - d E[ xn and E[( rn - rn _ 1 ) 2 ] < 4E[ XiJ. Theorem
·
37.7.
•
•
•
·
•
•
.
•
•
•
=
The me thod is to repeat the construction above inductively. For notational clarity write W, = w,< 1 > and put .9;< 1 > = u[ �< l ): 0 < s < t ] and 1 sr-< > = u[ w,( l >: t > 0]. Let 8 1 be the stopping time of Theorem 37.6, so that W8\1 > and X1 have the same distribution. Let �� I ) be the class of M such that M n [81 < t ] E .9;< 1> for all t. Now put w. <2> = w,li<1ll+ 1 - W.li(1l )' .:T..t <2> = u[ W.S< 2> · 0 < s < t ]' and .9"( 2) = u[ W,( 2): t > O]. By another application of Theorem 37.6, construct a stopping time 8 for the Brownian motion [ W,(2): t > 0] in such a way that W8�> has the 2 same distribution as X1 • In fact, use for 8 2 the very same martingale construction as for 8p so that (81, W8\1 >) and (8 , W8�2>) have the same 2 2 9" ( . ) are independent (see (37.28)), it follows distribution. Since �(I) and (see (37.29)) that (8p W8\1 >) and (8 , W8�2>) are independent. PROOF.
I
•
2
t This is obvious from the weak-convergence poi nt of view.
-
520
STOCHASTIC PROCESSES
class of M such that M n [8 < t] E .9; < 2> for all t. If 2 and sz-<3> is the u-field generated by these random variables, then again SZ";,�2> and sz-(3) are independent. These two u-fields are contained in sz-( 2), which is independent of �� I )· Therefore, the three u-fields SZ";,� 1>, ��2>, sz-(3) are independent. The procedure therefore extends inductively to give independent, identically distributed random vectors (8n ' W,(8"n ) ). If Tn 8 1 + . . . + 8n ' then W ( l ) W,(81l ) + . . . + W,(8"n ) has the dis• tribution of XI + . . . + xn .
sz-;,<2> be the WI(3 ) w./l2<2+> l - w./l2< 2> Let
=
T"
=
=
In variance •
If E[ xn CT 2' then, since the random variables Tn - Tn I of Theorem 37.7 are independent and identically distributed, the strong law of large numbers (Theorem 22. 1) applies and hence so does the weak one: =
( 37 .40) (If E[X:] < oo, so that the rn - rn l have second moments, this follows immediately by Chebyshev 's inequality.) Now Sn has the distribution of W(rn), and rn is near nu 2 by (37.40); hence Sn should have nearly the distribution of W(nu 2 ), namely the normal distribution with mean 0 and variance nu 2• To prove this, choose an increasing s�quence of integers Nk such that P[ln - 1 rn - u 2 1 > k- 1 ] < k- 1 for n > Nk , and put En k - 1 for Nk < n < Nk + 1 • Then En � 0 and P[ln - 1 rn - u 2 1 > t: n ] < En. By two applications of (37.32), =
s.( < )
�
[ """:!f,;W(
P I W(
'•) I > <
< P [ I n - 1 rn - u 2 I > En]
+P
]
[ -nu
sup
lr
nn
2 1 :S <
I W( t ) ·- W( nu 2 ) l > w·./n
< En + 4P [ I W( Enn) I � EuVn] ,
and it follows by Chebyshev's inequality that limn 8n(E) distributed as W(rn),
P
[ W��
2)
]
0. Since
[ u� < x ] < p [ w�{,( ) < x + E l + 8n( E ) .
< x - E - 8n(E) < P
*This topic may be om itted.
=
sn
] IS
SECTION
37.
521
BROWNIAN MOTION
Here W(nu 2 )ju..[rl can be replaced by a random variable N with the standard normal distribution, and lett ing n � oo and then € � 0 shows that
This gives a new proof of the central limit theorem for independent, identi cally distributed random variables with second moments (the Lindeberg -Uvy theorem-Theorem 27.1). Observe that none of the convergence theory of Chapter 5 has been used. This proof of the central limit theorem is an application of the invariance principle: Sn has nearly the distribution of W(nu 2 ), and the distribution of the latter does not depend on { ary with) the distribution common to the Xn . More can be said if the Xn have fourth moments. For each n, define a stochastic process [Y,.{t): 0 < t < 1] by Y,.(O, w) = 0 and v
(37.41 ) Yn( t , w ) =
1 k 1 . k S < t f < C (w) 1 n - n' uvn k
k = 1, . . . , n.
If kjn = t > 0 and n is large, then k is large, too, and Y,.(t) = t 1 1 2Sk /u..fk is by the central limit theorem approximately normally distributed with mean 0 and variance t. Since the Xn are independent, the increments of 07.41) should be approximately independent, and so the process should behave approximately as a Brownian motion does. Let rn be the stopping times of Theorem 37.7, and in analogy with (37.41) put Zn(O) = 0 and
(37.42) Zn( t ) = 1r W(rk ) I. f k n- 1· < t < nk- , uvn
k = 1 , . . . , n.
By construction, the finite-dimensional distributions of [ Y,.(t ): 0 < t < 1] coin cide with those of [Zn{t): 0 < t < 1]. It will be shown that the latter process nearly coincides with [ W(tnu 2 )juVn : 0 < t < 1], which is itself a Brownian motion over the time interval [0, 1]-see (37.11). Put Wn(t) = W(tnu 2 )ju..[rl . Let Bn (8) be the event that I rk - ku 2 1 > 8nu 2 for some k < n. By Kolmogorov's inequality (22.9),
(37.43) If (k -
1)n - 1 < t < kn - 1
and
n > 8 - 1 , then
STOCHASTIC PROCESSES
522
on the event ( Bn( 8 )Y, and so I Zn( t ) - W,.( t ) l = wn
( n�2 ) - Wn( t ) <
sup I W ( s ) - Wn( t ) i
l s - t i :S 28
n
on ( Bn( 8 )Y. Since the distribution of this last random variable is unchanged if the Wn( t ) are replaced by W(t ) ,
[
P sup i Zn( t) - W,. ( t ) l > € ts
I
(
< P ( Bn( o ) ) + P sup
]
J
sup I W( s) - W( t ) l > E .
1 :$ 1 1.< - r l :$ 2/l
Let n � oo and then 8 � 0; it follows by (37.43) and the continuity of Brownian motion paths that
[
]
lim P sup i Zn( t ) - W ( t ) l > € = 0
( 37.44)
n
n
r :S I
for positive E. Since the processes (37.41) and (37.42) have the same finite dimensional distributions, this proves the following general invariance princi ple or functional central Limit theorem.
Suppose that X1 , X2 , . . . are independent, identically dis tributed random variables with mean 0, variance u 2 , and finite fourth mo ments, and define Y ( t) by (37.41). There exist (on another probability space), for each n, processes [Zn(t): 0 < t < 1] and [W,.(t): 0 < t < 1] such that the first has the same finite-dimensional distributions as [ Yn( t ): 0 < t < 1], ihe second is a Brownian motion, and P[sup, 5 1 1 Zn(t) - W,.{t)l > €] � 0 for positive €. Theorem 37.8.
n
As an application, consider the maximum Mn = max k 5 n Sk . Now Mn fuVn = sup, Yn( t ) has the same distribution as sup, Zn(t ), and it follows by (37.44) that
[
]
P sup Zn( t ) - sup Wn( t ) > € � o . 1 :$ 1
1 :$ 1
But P[sup, s: 1 Wn(t) > x] = P[sup, 5 1 W(t) > x] = 2P[ N > x] for x > O by (37.32). Therefore, ( 37.45)
]
< x � 2 P [ N < x ],
X > 0.
SECTION 37.
BROWNIAN MOTION
523
PROBLEMS
37.1. 36. 2 j Show that K(s, t) = min{s, t} is nonnegative-definite; use Problem 36.2 to prove the existence of a process with the fi nite-dimensional distributions prescribed for Brownian motion. 37.2. Let X(t) be independent, standard normal variables, one for each dyadic rational t (Theorem 20.4; the unit interval can be used as the probability space). Let W(O) = O and W(n) = Ef:= 1 X(k). Suppose that W(t) is already defined for dyadic rationals of rank 'l, and put
Prove by induction that the W( t) for dyadic t have the finite-dimensional distributions prescnbed for Brownian motion. Now construct a Brownian motion with continuous paths by the argument leading to Theorem 37. 1. This avoids an appeal to Kolmogorov' s existence theorem.
37.3.
n
j
n
For each n define new variables Wn(t) by setting Wn(kj2 ) = W(kj2 ) for dyadics of order n and interpolating linearly in between. Set Bn = sup, , n1Wn + l(t) - Wn(t)l, and show that
The construction in the preceding problem makes it clear that the difference here is normal with variance 1/2n + <. Find positive xn such that Exn and E P( on > xnl both converge, and conclude that outside a set of probability 0, Wn(t, w) converges uniformly over bounded intervals. Replace W(t, w) by lim n Wn( t, w ). This gives another construction of a Brownian motion with continuous paths.
37.4. 36.6 i Let T = [0, oo), and let P be a p1 0bability measure on ( RT, �T) having the finite-dimensional distributions prescribed for Brownian motion. Let C consist of the continuous elements of RT. (a) Show that P* (C) = 0, or P*(RT- C) = 1 (see (3.9) and (3.10)). Thus completing (RT, �T, P) will not give C probability 1. (b) Show that P*( C) = l. 37.5. Suppose that [W,: t > 0] is some stochastic process having independent, sta tionary increments satisfying E[ W,] = 0 and E[ W, 2 ] = t. Show that if the finite-dimensional distributions are pr eserved by the transformation (37. 11), then they must be those of Brownian motion. 37.6. Show that n I > oul W,: s > t 1 contains only sets of probability 0 and 1 . Do the same for n , >Ou( W,: 0 t €]; give examples of sets in this u-field.
<<
37.7. Show by a direct argument that W( · , w ) isn with probability l of unbounded n variation on n[0, 1]: Let Yn = Ef: 1 1 W(i2 - ) - W((i - 02 - )l. Show that Y, has mean 2 i 2£[ 1W1 1] and variance at most Var[IW1 1]. Conclude that
EP (Yn < n] < oo.
STOCHASTIC PROCESSES
524
37.8. Show that the Poisson process as defined by (23.5) is measurable. 37.9. Show that for T = [0, oo) the coordinate-variable process [Z,: t E T] on ( R T, gp T) is not measurable. 37.10. Extend Theorem 37.4 to the set [t: W(t, w) = a]. 37.11. Let Ta be the first time the Brownian motion hits a > 0: Ta = inf[t: W, > a ]. Show that the distri6ution of Ta has over (0, oo) the density
a l_ - a 2 f2 t h a ( t ) - r;;- _ 32 v2rr t / e
(37 .46)
Show that £['Ta l = oo. Show that 'Ta has the same distribution as a 2 IN 2 , where N is a standard normal variable.
37.12. i (a) Show by the !.trong Markov property that Ta and Ta + /3 - Ta are independent and that the latter has t he same distribution as Tw Conclude that h a * h13 = h a + f3 · Show that {3Ta has the same distribution as Ta " (b)
,_fjj
Show that each h a is stable-see Problem 28. 10.
37.13. i Suppose that X 1 , X2 , • • • are in dependent and each has the distribution (37.46). (a) Show that (X1 + · · +Xn)ln2 also has the distribution (37.46). Contrast this with the law of large numbers. (b) Show that P[n - 2 max k !; n Xk < x] -+ exp( -a yf2/rr x ) for x > O. Relate this to Theorem 14.3. ·
37.14. 37. 1 1 i Let p(s, t) be the probability that a Brownian path has at least one zero in (s, t). From (37.46) and the Markov property deduce (37 .47 )
,; p(s, t ) = - arccos !.. .
2
t
1T
Hint: Condition with respect to Hj.
37 .15. i (a) Show that the probability of no zero in ( t, 1) is (21 rr) arcsin ..[t and hence that the position of the last zero preceding 1 is distributed over (0, 1) with density rr - 1 (t(l - t)) - 1 12 • (b) Similarly calculate the distribution of the position of the first zero follow ing time 1 . (c) Calculate the joint distribution of the two zeros in (a) and (b). 37.16. i (a) Show by Theorem 37.8 that inf s u s r Yn(u) and inf s u s r Zn(u) both converge in distribution to inf s u s 1 W(u) for 0 < s < t < 1. Prove a similar result for the supremum. (b) Let An(s, t) be the event that the position at time k in a symmetric random walk, is 0 for at least one k in the range sn < k < tn, and show that P(An(s, t )) -+ (21rr) arccos ..fi7i. s
s
Sk,
s
SECTION 37. (c) Let
BROWNIAN MOTION
525
Tn be the maximum k such that k < n and Sk = 0. Show that Tnfn has
asymptotically the distribution with density 7T- 1 (t(l - t)) - 1 12 over (0, l). As this density is larger at the ends of the interval than in the middle, the last time during a night's play a gambler was even is more likely to be either early or late than to be around midnight.
37.17. i Show that p(s, t) =p(t- ' , s - ') = p(cs, ct). Check this by (37.47) and also by the fact that the transformations (37. 1 1) and (37.12) preserve the properties of Brownian nlotion. 37.18. Deduce by the reflection principle that ( M, w;) has density
[
1
2 2(2y -x) (2y x ) --'---i=:==-.<... e xp 2t " J2 7T t ,
on the: set where y > 0 and y > ..(, No-.v deduce from Theorem 37.8 the corresponding limit theorem for symmetric random walk.
37.19. Show by means of the transformation (37.12) that for positive a and b the probabi lity is 1 that the process is within the boundary -at w; < bt for all sufficiently large t. Show that aj(a + b) is lhe probability that it last touches above rather than below.
<
37.20. The martingale calculation used for (37.39) also works for slanting boundaries. For positive a, b, r, let T be the smallest t such that either w; = - a + rt o r w; = b + rt, and let p(a, b, r) be the p1 0bability that the exit is through the upper barrier-that Wr = b + rT. (a) For the martingale Y6 • 1 in the proof of Lemma 2, show that E[Y6) = 1. Operating formally at first, conclude that
( 3 7 .48 ) Take (J = 2r, and note that ewT - !e 2 T is then 2rb if the exit is above (probability p(a, b, r )) and - 2ra if the exit is below (probability 1 - p(a, b, r )). Deduce
p (a, b,r) =
1 - e 2 ra e 2 rb - e - 2r a '
Show that p(a, b, r) --+ aj(a + b) as r --+ 0, in agreement with (37.39). (c) It remains to justify (37.48) for (J = 2r. From E[Y6, 1] = 1 deduce
(b)
( 37 .49 ) for nonrandom u. By the arguments in the proofs of Lemmas 2 and 3, show that (37.49) holds for simple stopping times u, for bounded ones, for u = T A n , for u = T.
STOCHASTIC PROCESSES
526
SECTION 38. NONDENUMERABLE PROBABILITIES * Introduction
As observed a number of times above, the finite-dimensional distributions do not suffice to determine the character of the sample paths of a process. To obtain paths with natural regularity properties, the Poisson and Brownian motion processes were constructed by ad hoc methods. It is always possible to ensure that the paths have a certain very general regularity property called separability, and from this property will follow in appropriate circumstances various other desirable regularity properties. Section 4 dealt with "denumerable" probabilities; questions about path functions involve all the time points and hence concern "nondenumerable" probabilities.
Example
For a mathematicaHy simple illustration of the fact that path properties are not entirely determined by the finite-dimensional distri butions, consider a probability space (D., !F, P) on which is defined a positive random variab!e V with continuous distribution: P[ V x] 0 for each x. For t > 0, put X(t, w) 0 for all w, and put 38.1.
=
=
=
Y( t , w)
(38.1)
=
(� \
if if
V( w ) t , V( w ) =1= t. =
Since V has continuous di-;tribution, PlX, Y,l 1 for each t, and so [X, : t > 0] and [ Y,: t > 0] are stochastic processes with identical finite-dimensional distributions; for each t , tk , the distribution f.l- 11 common to (X, , . . . , X, ) and (Y, , . . . , Y, ) concentrates all its mass at the origin of R k. But what about the sample paths? Of course, X(· , w) is identically 0, but Y( · , w) has a discontinuity-it is 1 at t V(w) and 0 elsewhere. It is because the position of this discontinuity has a continuous distribution that the two • processes have the same finite-dimensional distributions. =
1,
I
k.
I
•
•
=
,k
•
It:
=
Definitions
The idea of separability is to make a countable set of time points serve to determine the properties of the process. In all that follows, the time set T will for definiteness be taken as [0, oo) . Most of the results hold with an arbitrary subset of the line in the role of T. As in Section 36, let R T be the set of all real functions over T [0, oo). Let D be a countable, dense subset of T. A function x-an element of R T--is separable D, or separable with respect to D, if for each t in T there exists a =
* This section may be omitted.
SECTION
38.
sequence
NONDENUMB ERABLE PROBABILITIES
t 1, t2,
• • •
527
of points such that
(38 .2 ) (Because of the middle condition here, it was redundant to require D dense at the outset. ) For t in D, (38.2) imposes no condition on x, since t n may be taken as t . An x separable with respect to D is determined by its values at the points of D. Note, however, that separability requires that (38.2) hold for every t-an uncountable set of conditions. It is not hard to show that the set of functions separable with respect to D lies outside �T.
Example 38.2.
If x is everywhere continuous or right-continuous, then it is separable with respect to every countable, dense D . Suppose that x(t) is 0 for t =i= u and 1 for t v , where v > 0. Then x is not separable with respect to D unless v lies in D. The paths Y( · , w) in Example 38.1 are of this form. • =
The condition for separability can be stated another way: x is separable D if and only if for every t and every open interval I containing t, x(t) lies in the closure of [ x(s ): s E I n D ]. Suppose that x is separable D and that I is an open interval in T. If E > O, then x(t0) + E > sup 1 e 1 x(t) = u for some t0 in I. By separability l x(s0) - x(t0)1 < E for some s0 in I n D. But then x(s0) + 2€ > u, so that sup x ( t )
(38.3)
r ei
Similarly,
infx ( t )
(38.4)
, r ei
=
=
sup
re/nD
x( t ) .
t) x( r e/nD inf
and
(38.5)
sup
i x ( t ) - x( t0) i =
sup
i x( t ) - x( t0 ) i .
A stochastic process [X,: t > 0] on CO., !F, P) is separable D if D is a countable, dense subset of T = [0, oo) and there is an .r-set N such that P(N) 0 and such that the sample path X( · , w) is separable with respect to D for w outside N. Finally, the process is separable if it is separable with respect to some D; this D is sometimes called a separant. In these definitions it is assumed for the moment that X(t, w ) is a finite real number for each t and w. =
If the sample path X( · , w ) is continuous for each w, then the process is separable with respect to each countable, dense D. This covers Brownian motion as constructed in the preceding section. •
Example 38.3.
STOCHASTIC PROCESSES
528
Suppose that U¥,: t > 0] has the finite-dimensional distri butions of Brownian motion, but do not assume as in the preceding section that the paths are necessarily continuous. Assume, however, that [ W,: t > 0] is separable with respect to D. Fix t0 and 8. Choose sets Dm {tm 1 , , tmm } of D-points such that t 0 < tm l < · · · < tmm < t0 + 8 and Dm i D n (t0, t0 + 8). By the argument leading to (37.9),
Example 38.4.
=
P
(38.6)
sup t0.s;t s t 0 + 1l t ED
.
.
•
I W, - W, 11 1 > a
For sample points outside the N in the definition of separability, the supremum in (38.6) is unaltered, because of (38.5), if the restriction t E D is dropped. Since P( N ) = 0,
Define Mn by (37.8) but with r ranging over all the reals (not just over the n n n dyadic rationals) in [k2 - ,(k + 2)2 - ]. Then P[ Mn > n - i ] 4Kn 5 j2 fol lows just as before. But for w outside B [ Mn > n - 1 i.o.], W( · , w) is continuous. Since P( B) 0, W( · , w) is continuous for cu outside an .r-set of probability 0. If ( f!, !F, P) is complete, then the set of w for which W( · , w) is continuous is an .r-set of probability 1. Thus paths are continuous with probability 1 for any separable process having the finite-dimensional distribu tions of Brownian motion-provided that the underlying space is complete, • which can of course always be arranged.
<
=
=
As it will be shown below that there exists a separable process with any consistently prescribed set of finite-dimensional distributions, Example 38.4 provides another approach to the construction of continuous Brownian motion. The value of the method lies in its generality. It must not, however, be imagined that separability automatically ensures smooth sample paths: Suppose that the random variables Xt> t > 0, are indepen dent, each having the standard normal distribution. Let D be any countable set dense in T [0, oo). Suppose that I and J are open intervals with rational endpoints. Since the random variables X1 with t E D n I are independent, and since the value common to the P[X1 E ] ] is positive, the second Borel-Cantelli lemma implies that with probability 1, X1 E ] for some t in D n I. Since there are only countably many pairs I and J with rational endpoints, there is an .r-set N such that P(N) 0 and such that for w outside N the set [X(t, w): t E D n l ] is everywhere dense on the line for every open interval I in T. This implies that [X1 : t > 0] is separable with respect to D. But also of course it implies that the paths are highly irregular.
Example 38.5.
=
=
SECTION
38.
NONDENUMBERA BLE PROBABILITIES
529
This irregularity is not a shortcoming of the concept of separability-it is a necessary consequence of the properties of the finite-dimensional distribu • tions specified in this example.
Example 38.6. The process [ Y,: t > 0] in Example 38.1 is not separable: The path Y( - , w) is not separable D unless D contains the point V(w ). The set of w for which Y( · , w) is separable D is thus contained in [w: V(w) E D], a set of probability 0, since D is countable and V has a continuous •
distribution.
Existence Theorems
It will be proved in stages that for every consistent system of finite-dimen sional distributions there exists a separable process having those distribu tions. Define x to be separable D at the point t if there exist points t in D such that tn � t and x(t) � x(t ). Note that this is no restriction on x if t lies in D, and note that separability is the <;arne thing as separability at every t. n
Let [ X : t > 0] be a stochastic prvc'!SS on (f!, !F, P ). There exists a countable, dense set D in [O, oo), and there exists for each t an !J<:.set N(t ), such that P(N(t )) = 0 and such that for w outside N(t) the path function X( · , w) is separable D at t. ,
Lemma 1.
PRooF.
Fix open intervals I and J, and consider the probability
(
p (U ) = P n [ Xs $ 1 ] sEU
)
for countable &ubsets U of I n T. As U increases, the intersection here decreases and so does p(U). Choose Un so that p(Un ) � infv p(U). If U(l, J ) U n Un , then U(l, J ) is a countable subset of I n T making p(U) minimal : =
(38.7 )
P
(
s E U(I, J )
for every countable subset
( 38.8) because otherwise
) (
[ Xs $ J ] < P n [ Xs $ J ]
n
p
sEU
U of I n T. If t E I n T, then
( [ XI E J ]
n
n
)
[ xs $ J ] = 0'
s E U(I, J)
(38. 7) would fail for U U(l, J ) U {t}. =
)
STOCHASTIC PROCESSES
530
Let D = U U(I, J ), where the union extends over all open intervals I and J with rational endpoints. Then D is a countable, dense subset of T. For each t let
(38.9 )
N( t)
=
U
([X1El] n
n s E U( I , J)
[Xs$1]),
where the union extends over all open intervals J that have rational end points and ove1 all open interval� I that have rational endpoints and contain t. Then N(t) is by (38.8) an 9t"set such that P(N(t )) 0. Fix t and w $ N(t ). The problem is to show that X( · , w) is separable with respect to D at t. Given n, choose open intervals I and J that have rational endpoints and lengths less than n - 1 and satisfy t I and X(t, w) J. Since w lies outside (38.9), there must be an sn in U(I, J) such that X(sn, w) J. But then s., E D, l sn - tl < n - 1 , and I X(sn , w) -- X(t, w) l < n - 1 • Thus sn � t and X(sn , w ) � x(t, w) for a sequence s 1 , s 2 , in D. • =
E
•
For any countable respect to D at t is
D,
the set of
w
•
E E
•
for which
X( - , w)
is separable with
n U [ w : I X ( t , w) - X( s, w) l < n - 1 ] . n = I ls-I I
(38.10 )
sED
This set lies in !F for each t, and the point of the lemma is that it is possible to choose D in such a way that each of these sets has probability 1.
Let [X1 : t > 0] be a stochastic process on (f!, :F, P). Suppose that for all t and w Lemma 2.
a <X( t , w) < b .
(38.11)
Then there exists on (f!, !F, P) a process [X;: t > 0] having these three properties:
1
P[X; X ] 1 for each t. For some countable, dense subset D of [0, oo), X'( - , w) is separable D for every w in n. (iii) For all t and w, (i) (ii)
( 38.12) 0
=
=
a <X'( t, w) < b.
P RooF. Choose a countable, dense set D and .r-sets N(t) of probability as in Lemma 1. If t E D or if w $ N(t), define X'(t, w) = X(t, w). If t $ D,
SECTION
38.
NONDENUMBE RABLE PROBABILITIES
531
fix some sequence {s�')} in D for which lim n s�') t, and define X'(t, w) = lim supn X(s�' l, w) for
w E N(t). To sum up,
X( t , w) (3 8. 13 ) X '( t, w ) = lim su p X ( s�n , w ) n
=
t E D or w $ N( t ) , if t $ D and w E N( t).
if
N(t) E .9", x; is measurable .9" for_each t. Since P(N(t)) = O, P[X, = x;J = 1 for each t. Fix t and w. If t E D, then certainly X'( · , w) is separable D at t, and so assume t $ D. If w $ N(t), then by the construction of N(t), X( - , w) is separable with respect to D at t, so that there exist points sn in D such that sn � t and X(sn , w) � X(t, w). But X(s n , w) = X'(s n , w) because sn E D, and X(t, w) = X'(t, w) because w $ N(t). Hence X'(s n , w) � X'(t, w), and so X'( · , w) is separable with respect to D at t. Finally, suppose that t $ D and w E N(t). Then X'(t, w ) = lim k X(s�'} , w) for some sequence {n k } of integers. l ) w) � X'lt w)' so tha t again ) < < < � w) =X(s X'(s k oo s As n t l and n n ' n ' k. k.) .. k. X'( - , w) is separable with respect to D at t. Clearly, (38.11) implies (38.12) . Since
\.
�
'
•
One must allow for the possibility of equality in (38.12). Suppose that V(w) > 0 for all w and that V has a continuous distribution. Define Example 38. 7.
Il l { f( t ) = �
if t * 0, if t = 0,
and put X(t, w) = f(t - V(w )). If [X; : t > 0] is any separable process with the same finite-dimensional distributions as [X,: t > 0], then X'( · , w) must with probability 1 assume the value 1 somewhere. In this case (38.11) holds for • a < 0 and b = 1, and equality in (38.12) cannot be avoided. If
( 38 .1 4)
sup i X ( t , w ) l < oo , I,
w
then (38.11) holds for some a and b. To treat the case in which (38. 14) fails, it is necessary to allow for the possibility of infinite values. If x(t) is oo or - oo, replace the third condition in (38.2) by x(tn ) � oo or x(t) � - oo. This extends the definition of separability to functions x that may assume infinite values and to processes [X,: t > 0] for which X(t, w) = + oo is a possibility.
If [X, : t > 0] is a finite-valued process on (f!, !F, P), there exists on the same space a separable process [X;: t > 0] such that P[X; X, ] 1 for each t. Theorem 38.1.
=
=
STOCHASTIC PROCESSES
532
It is assumed for convenience here that X(t, w) is finite for all t and w, although this is not really necessary. But in some cases infinite values for certain X'(t, w) cannot be avoided-see Example 38.8. PROOF. If (38.14) holds, the result is an immediate consequence of Lemma 2. The definition of separability allows an exceptional set N of probability 0; in the construction of Lemma 2 this set is actually empty, but it is clear from the definition this could be arranged anyway. The case in which (38.14) may fa il could be treated by tracing through the preceding proofs, making slight changes to allow for infinite values. A simple argument makes this unnecessary. Let g be a continuous, strictly increasing mapping of R 1 onto (0, 1). Let Y(t, w) g(X(t, w)). Lemma 2 applie� to [Y,: t > 0]; there exists a separable process [ Y,': t > 0] such that P[ Y,' Y,] 1. Since 0 < Y(t, w) < 1, Lemma 2 ensures 0 Y'(t, w) .::;; 1. Define =
- oo X'( t, w) = g - 1 ( Y '( t , w ) ) + oo Then [X;: each t.
t > 0]
<
=
Y'( t , cJJ ) 0, if 0 < Y' ( t, w) < 1, if Y' ( t, w) 1 . if
=
=
satisfies the requirements. Note that
Suppose that V(w) > distribution. Define Example 38.8.
h( t)
=
{�
l '
=
0 for all w
P[X; +oo] 0 for =
=
•
and V has a continuous
if t * 0, if t 0, =
and put X(t, w ) = h(t - V(w)). This is analogous to Example 38.7. If [X;: t > 0] is separable and has the finite-dimensional distributions of [X,: t > 0], • then X'( · , w) must with probability 1 assume the value oo for some t. Combining Theorem 38.1 with Kolmogorov' s existence theorem shows that
fior any consistent system offinite-dimensional distributions 11- r k there exists a separable process with the 11- r k as finite-dimensional distributions. As shown in Example 38.4, this leads to another construction of Brownian motion with I
1
I
1
continuous paths.
Consequences of Separability
The of a fact The
next theorem implies in effect that, if the finite-dimensional distributions process are such that it "should" have continuous paths, then it will in have continuous paths if it is separable. Example 38.4 illustrates this. same thing holds for properties other than continuity.
SECTION
38.
NONDENUMBERABLE PROBABILITIES
533
Let RT be the set of functions on T = [O, oo) with values that are ordinary reals or else oo or -oo. Thus R T is an enlargement of the R T of Section 36, an enlargement necessary because separability sometimes forces infinite values. Define the function Z, on R T by Z,(x) Z(t, x) = x(t ). This is just an extension of the coordinate function (36.8). Let �T be the u-field in R T generated by the Z, t > 0. Suppose that A is a subset of R T, not necessarily in � T. For D c T [0, oo), let A0 consist of those elements x of R T that agree on D with some element y of A : =
=
A D = u n [ x E R T: x(t ) y ( t )] .
(38. 15)
=
y EA t E D
Of course, A c A 0. Let S0 denote the set of x in R T that are separable whh respect to D. In the following theorem, [X,: t > 0] and [X;: t > 0] are processes on spaces (0, !F, P) and (0', ::7', P'), which may be distinct; the path functions are X( · , w) and X'( - , w').
Suppose of A that for each countable, dense subset T [0, oo), the set (38.15) satisfies Theorem 38.2.
D
of
=
(38.16 )
-T
A0 E � ,
If [X,: t > 0] and [X;: t > 0] have the same finite-dimensional distributions, if [w: X( · , w) EA] Lies in !Y and has ?-measure 1, and if [X;: t > O] is separable, then [w': X'( · , w') EA] contains an !F'-set of P'-measJ,tre 1. If (0', .'F', P' ) is complete, then of course !F'-set of ?'-measure 1.
[w': X'(· , w') EA]
is itself an
[X;: t > 0] is separable with respect to D. The [w': X'( · , ) EA0] - [w': X'( · , w') EA] is by (38.16) a subset of [w': X'( · , w') E R T - S0], which is contained in an !7'-set of N' of ?'-mea sure 0. Since the two processes have the same finite-dimensional distribu tions and hence induce the same distribution on (R T, � T ), and since A 0 lies in �T, it follows that P'[w': X'( · , w') EA0] = P[w: X(· , w) EA0] > P[w: X{ · , w) EA] = l. Thus the subset [w': X'( · , w') EA0] - N' of [w': X'( ·, w') E A ] lies in !F' and has P'-measure 1. • PROOF. difference
Suppose that '
w
Consider the set C of finite-valued, continuous functions on T. If x E S 0 and y E C, and if x and y agree on a dense D, then x and y agree everywhere: x y. Therefore, C0 n S0 c C. Further, Example 38.9.
=
C0 = n U n ( x E RT: I x(s ) l < oo, i x (t ) l < oo, i x( s ) - x( t) I < E ] , €, 1
li
s
.
STOCHASTIC PROCESSES
534
where E and 8 range over the positive rationals, t ranges over D, and the inner intersection extends over the s in D satisfying Is - t I < 8. Hence C0 E �T. Thus C satisfies the condition (38.16). Theorem 38.2 now implies that if a process has continuous paths with probability 1, then any separable process having the same finite-dimensional distributions has continuous paths outside a set of probability 0. In particular, a Brownian motion with continuous paths was constructed in the preceding section, and so any separable process with the finite-dimensional distribu tions of Browni�r. motion has continuous paths outside a set of probability 0. The argument in Example 38.4 now becomes supererogatory. • There is a somewhat similar argument for the step functions of the Poisson process. Let z + be the set of nonnegative integers; let E consist of the nondecreasing functions x in R T such that x(t) E Z for all t and such that for every n E Z + there exists a non empty interval J su�h that x(t) = n for t E /. Then EY.ample 38.10.
+
n
s, t E D, s < l
tED 00
() n u n�O
I
n
r eDn/
[A : x(s) < x ( t )]
[x: x(t ) = n],
where I ranges over the open intervals with rational endpoints. Thus E0 E !JRT. Clearly, E0 n S0 c E, and so Theorem 38.2 applies. In Section 23 was constructed a Poisson process with paths in E, and therefore any separable process with the same finite-dimensional distributions will have paths in E except for a set of probability 0. • For E as in Example 38. 10, let E0 consist of the elements of E that are right-continuous; a function in E need not lie in E0, although at each t it must be continuous from one side or the other. The Poisson process as defined in Section 23 by N, = max[ n : Sn < t] (see (23.5)) has paths in E0 • But if N; = m ax[ n : sn < t], then [N,': t � 0] is separable and has the sam� finite-dimensional distributions, but its paths are not in E0• Thus E0 does not satisfy the hypotheses of Theorem 38.2. Separabil ity does not help distinguish between continuity from the right and continu ity from the left. • Example 38.JI.
The class of sets A satisfying (38. 16) is closed under the forma tion o f countable unions and intersections but is not closed under complementation. Define X, and Y, as in Example 38. 1, and let C be the set of continuous paths. Then [Y,: t > 0] and [ X, : t � 0] have the same finite-dimensional distributions, and the latter is separable; Y( · , w ) is in R T - C for each w, and X( · , w ) is in R T - C for • no w. Example 38.12.
1
As a final example, consider the set of functions with disconti nuities of at most the first kind: x is in if it is finite-valued, if x(t + ) = lim s 1 x(s) exists (finite) for t � 0 and x(t - ) = lim s l 1 s(s) exists (finite) for t > 0, and if x{t) lies between x(t + ) and x(t - ) for t > 0. Continuous and right-continuous functions are special cases. Example 38./3.
1
SECTION 38.
Let
NOND ENUMBERABLE PROBABILITIES
535
V denote the general system
{38.17) where k is an integer, where the
r,., s,., and a,. are rational, and where
Define k
1( D , V, t:) = n [ x : a,. <x ( t) < a,. + t: , t E ( r,. , s;) n D ] i= I
n
%-m k
k
n
i= 2
[ x : min{a,. _ 1 , a;} < x(t) < max{ a ,._ 1 , a,.} + t:,
6
t o::
( s,. _ 1 , r; )
be the class of systems (38.17) that have a fixed value for r,. - s,._ 1 .<.B, i = 2, . . . , k, and sk > m. lt will be shown that Let
00
(38.18)
00
1v = n n u n
k
nD)
and satisfy
U 1 (D, V, t:),
where € and fj range over the positive rationals. From this it will follow that 10 E !JRT. It will also be shown that 10 n S0 c1, so that 1 satisfies the hypothesis of Theorem 38.2. Suppose that y E 1. For fixed t: , let H be the set of nonnegative h for which there exist finitely many points t; such that 0 = t0 < t 1 < < t , = h and I y(t) -y(t')l < € for t and t' in the same interval (t,._ 1 , t,.). If hn E H and hn i h, then from the existence of y(h - ) follows h E H. Hence H is closed. If h E H, from the existence of y(h + ) it follows that H contains points to the right of h. Therefore, H = [O, oo). From this it follows that the right side of (38.18) contains 10. Suppose that x is a member of the right side of (38. 18). It is not hard to deduce that for each t the l imits ·
(38. 1 9)
x s ), ( sED
lim
s ! I,
lim
·
s j r, s e D
·
x(s)
exist and that x(t) lies between them if t E D. For t E D take y(t) = x(t), and for t � D take y(t) to be the first limit in (38.19). Then y E 1 and hence x E 10. This argument also shows that 10 n S0 c1. •
Appendix
Gathered here for easy reference are certain definitions and results fro m set theory and real analysis required in the text. Although there are many newer books, HAUSDORFF (the early sections) on set theory and HARDY on analysis are still excellent for the general background assumed here. Set Theory
Ai. The empty set is denoted by 0. Sets are variable subsets of some space that is space is denoted either fixed in any one definition, argument, or discussion; this k generical ly by n or by some special symbol (such as R for Eucl idean k-space). A singleton is a set consisting of just one point or element. That A is a subset of B is expressed by A c B. In accordance with standard usage, A c B does not preclude A = B; A is a proper subset of B if A c B and A -=F B. The complement of A is always relative to the overall space !l; it consists of the points of !l not contained in A and is denoted by Ac. The difference between A and B , denoted by A - B, is A n SC; here B need not be contained in A , and if it is, then A - B is a proper difference. The symmetric difference A ,), B = (A n Be) U (Ac n B) consists of the points that lie in one of the sets A and B but not in both. Classes of sets are denoted by script letters. The power set of n is the class of all subsets of !l; it is denoted 2 °. A2.
The set of w that lie in A and satisfy a given property p(w) is denoted [ w E A: p (w )]; if A = !1, this is usually shortened to [w: p( w )].
A3. In this book, to say that a collection [ A 6: (J E e] is disjoint always means that it is pairwise disjoint: A 6 n A 6, = 0 if e and e' are distinct elements of the index set e. To say that A meets B, or that B meets A, is to say that they are not disjoint: A n B -=F 0. The collection [ A 6: e E 0] covers B if B c U 6 A 6• The collection is a decomposition or partition of B if it is disjoint and B = U 6 A 6• A4. By A n i A is meant A 1 c A 2 c A I ::J A 2 ::J . • • and A = n n An. AS.
·· ·
and A = U n A,; by A n l A is meant
The indicator' or indicator function ' of a set A is the function On n that assumes the value 1 on A and 0 on A c; it is denoted !A ' The alternative term "characteristic function" is reserved for the Fourier transform (see Section 26).
536
537
APPENDIX A6.
De Morgan's laws are (U 6A6Y = n 6A� and ( n 6 A 6 Y = U 6A�. These and the
other facts of basic set theory are assumed known : a countable union of countable sets is countable, and so on.
If T: !l --+ !l' is a mapping of !l into !l' and A' is a set in !l', the inverse image of A' is T- 1,4' = [w E !l: Tw E A']. It is easily checked that each of these statements is equivalent to the next: w E n - T- 1A'' w $. T- 1A', Tw $.A'' Tw E !l' - A', w E T- 1 (0' - A'). Therefore, !l - r t.4' = r 1 (0' - A'). Simple considerations of this kind show that U 6r t.4'6 = r 1 ( U 6A�) and n 6r t.4'6 = r 1( n 6 A'6 ), and that A' n B' = 0 implies r 1A' n T- 1 B' = 0 (the reverse implication is false unless T!l = !l'). I f f maps n into another space, f(w ) is the value of the function f at an unspecified value of the argument w. The function f itself (the rule defining the mapping) is sometimes denoted f( · ). This is especially convenient for a functio n f(w , t ) of two arguments: For each fixed t, f( · , t ) denotes the function on !l with value f(w, t) at w. A7.
AS.
The axiom of choice.
Suppose that [ A 6: (J E E>] is a decomposition of !l into nonempty sets. The axiom of choice says that there exists a set (at least one set) C that contains exactly one point from each A6: C n A 6 is a singleton for each e in e. The existence of such sets C is assumed in "eve tyday" mathematics, and the axiom of choice may even seem to be simply true. A careful treatment of set theory, however, is based on an explicit list of such axioms and a study of the relatio nships between them; see HALMOS 2 or DuDLEY. A few of the problems require Zorn's lemma, which is equivalent to the axiom of choice; see Du DLEY or KAPLANSKY. The Real Line
The real line is denoted by R 1 ; x v y = max{x, y} and x 1\ y = min {x , y}. For real x, l x J is the integer part of x, and sgn x is + 1, 0, or - 1 as x is positive, 0, o r negative. It is convenient to be explicit about open, closed, and half-open intervals:
A9.
( a, b) = [x: a < x < b), [a, b] = [x: a < x < b), ( a, b] = [x: a <x < b ), [a, b) = [x: a < x < b ]. AlO. Of course xn --+ x means lim n x, = x; xn i x means x1 <x 2 < · · · and xn --+x; xn l x means x 1 x 2 · · · and xn x. A sequence {xn} is bounded if and only tf every subsequence {xn } contains a further subsequence {xn . } that converges to some x: lim,. x, . = x. If { ;n} is not bounded, then for each k there is an n k for which lxn I > k; no subsequence of {xn } can ' t{ >
k(j )
>
--+
.{ ( J )
converge. The implication in the other direction is a simple consequence of t e fact that every bounded sequence contains a convergent subsequence.
If {x n} is bounded, and if each subsequence that converges at all converges to x, then lim n x n x. If x n does not converge to x, then I x n, - x I > € for some positive € and some increasing sequence {n k } of integers; some subsequence of {xn } converges, but ' the limit cannot be x. Al l. A set G is defined as open if for each x in G there is an open interval I such that x E I c G. A set F is defined as closed if pc is open. The interior of A, denoted =
538
APPENDIX
x closure
I
such that A0, consists of the in A for which there exists an open interval E c A . The of A , denoted A - consists of the for which there exists a sequence of A is aA = A - - A0• The basic facts in A with The of real analysis are assumed known: A is open if and only if A = A0; A is closed if and only if A = A - ; A is closed if and only if it contains all limits of sequences in it; lies in aA if and only if there is a sequence in A c such in A and a sequence
x I
that
{xn}
Xn --+ x.
xn --+ x and Yn --+ x; and
so
x
,
boundary
{yn}
{xn}
on.
x
Every open set G on the line is a countable, disjoint union of open intervals. To see this, define points x and y of G to be equivalent if x < y and [ x, y] c G or y < x and [ y , x ] c G. This is an equivalence relation. Each equivalence ci<1ss is an interval, and since G is open, each is in fact an open interval. Thus G is a disjoint union of open (nonempty) intervals, and there can be only countably many of them, since each contains a rational. A12.
The simplest form of rhe He,ne -Borel theorem says that if [ a , b] c U k' = 1(a k , bk ), then [a, b] c lJ f: = 1(a k > bd for some n. A set A is defined m be compact if each cover of it by open sets has a iinite subcover--that is, if [G6: (J E 0] covers A and each G6 is open, then some finite subcollection {G6 1 , • • • , G6 } covers A. Equivalent to the Heine-Bore! theorem is the assertion that a bo unded, closed set is compact. Also equivalent is the assertion that every bounded sequence of real numbers has a convergent subsequence. Al3.
A14.
The diagonal method. From this last fact follows one of the basic principles of
analysis.
Theorem.
Suppose that each row of the array Xl,l
X 2. I
(1)
X1.2
X 2, 2
Xl,3
X 2, 3
is a bounded sequence of real numbers. Then there exists an increasing sequence n 1 , n 2 , . of integers such that the limit lim k x n, exists for r = 1 , 2, . . . . • •
r,
PROOF.
From the first row, select a co nvergent subsequence X 1•
(2)
n1.1 , X I • n1 2 , X I
• n I '\
,•••;
here {n 1 • k } is an increasing sequence of integers and lim k the second row of (1) along the sequence n 1, I> n 1, 2, : • • .
X2
{3)
n 1. 1 , X 2• n , x 2 n I 2
•
I '\ ,
x1 n •
I k
exists. Look next at
•• • •
As a subsequence of the second row of (1), (3) is bounded. Select from it a convergent subsequence
here {n 2 k } is an increasing sequence of integers, a subsequence of {n l.k }, and lim k X 2, n: eXiStS. <
APPENDIX
539
Continue inductively in the same way. This gives an array
n. 3 I '
(4)
with three properties: (i) Each row of (4) is an increasing sequ ence of integers. (ii) The rth row is a subsequence of the (r - l )st. (iii) For each r, lim k x , n r k exists. Thus •
X,
(5 )
n • r l'
X,
n 2 • r
' X,
•
n
t
'\
'
.
is a convergent subsequence of the rth row of (1). Put n k = n k k . Since each row of (4) is increasing and is ::ontained in the preceding row, n1, n 2 , n3, . . . is an increasing sequence of integers. Furthermore, ? " n , + I> n , + 2 • • • • is a subse�uence of the r th row of (4). T.hus X r, n ,> X r, n, + xr, n, + 2 ' . . . ts a subsequence of (5) and IS therefore convergent. Thus hm k x,, nk does extst. • Since {n k } is the diagonal of the array (4), application of this theorem is called the ,•
.
diagonal method.
The set A is by definition der.se in the set B if for each x in B and each open interval containing x, meets A. This is the same thing as requiring B cA -. The set E is by definition nowhere dense if each open interval I contains some open interval that does not meet E. This makes sense: It I contains an open interval that does not meet E, then E is not dense in I; the definition requires that E be dense in no interval I. A set A is defined to be perfect if it is closed and for each x in A and positive €, there is a y in A such that 0 < lx -yl < *' · An equivalent requirement is that A be closed and for each x in A there exist a sequence { xn } in A such that xn -=1= x and x n --+ x. The Cantor set is uncountable, nowhere dense, and perfect. A set that is nowhere dense is in a sense small. If A is a countable union of sets each of which -is nowhere dense, then A is said to be of the first category. This is a weaker notion of smallness. A set that is not of the first category is said to be of the AlS.
1 1
1
1
second category.
Euclidean k-Space
Euclidean space of dimension k is denoted R k . Points (a1, . . . , ak) and (b 1 , • • • , bk) determine open, closed, and half-open rectangles in R k : Al6.
[x: a; < x1 < b1, i = l , . . . , k ] , [x: a; <x; < b; , i = l , . . . , k ], [x: a; <x; < b; , i= 1 , . . . , k ] .
A rectangle (without a qualifying adjective) is1 in this book a set of this last form. The Euclidean distance U:fr= 1(x1 -y;>Z) 12 between x = (x1, . . . , xk ) and y = ( y I> . . . , yk ) is denoted by I x - y I. Al7. All the concepts in All carry over to R k : simply take the I there to be an open rectangle in R k . The definition of compact set also carries over word for word, and the Heine-Borel theorem in R k says that a closed, bounded set is compact.
540
A PPENDIX
Analysis
The standard Landau notation is used. Suppose that { t,.} and { y,.} are real sequences and Y,. > 0. Then x,. = O(y,.) means x,./Y,. is boundP-d; x,. = o(y,.) means x,./Y,. --+ 0; x,. -Yn means x,./Yn --+ 1; x,. :=e: Yn means x,./Y,. and y,./x,. are both bounded. To write x,. = z,. + O(y,.), for example, means that x,. = z,. + u,. for some {u,.} satisfying u,. = O( y,.)- that is, that x,. - z,. = O(y,.). Al8.
difference equation. Suppose that a and b are integers and a < b. Suppose that x,. is defined for a < n < b and satisfies
Al9.
A
x,. = pxn + l + qx,. _ 1
(6)
where p and q are positive and equation has the form X"
(7)
for a
n
< < b,
p + q = 1. The general solution of this difference
" (= A -+-+ B(qjp) Bn
for a < n < b for a < n < b
A
if p * q, if p = q.
That (7) always solves (6) is easily checked. Suppose the values where a < n1 < n 2 < b. If p * q, the system A
x,.
I
and
x,.
2
a re given,
+ B ( qjp) " 1 = x� , I
B. If p = q, the system A + Bn 1 = x,.1,
can always be solved for A and
can always be solved. Take n 1 = a and n 2 = a + 1; the corresponding A and B satisfy (7) for n = a and for n = a + 1, aP.d it follows by induction that (7) holds for all n. Thus any solution of (6) can indeed be put in the form (7). Furthermo re. the equation (6) and any pair of values x,. and x,. (n1 * n 2 ) suffice to determine all the x,. . If x,. is defined for a <1n < oo a � d satisfies (6) for a
Cauchy 's equation. Theorem. Let f be a real function on (0, oo), and suppose that f satisfies Cauchy 's equation: f( x + y) = f( x) + f( y) for x, y > 0. If there is some interval on which f is bounded above, then f( x) = xf(l) for x > 0. PROOF. The problem is to prove that g( x) = f( x) - xf(l) vanishes identically. Clearly, g(l) = 0, and g satisfies Cauchy's equation and on some interval is bounded above. By induction, g(nx) = ng(x); hence ng(mjn) = g(m) = mg(l) = 0, so that g(r) = 0 for positive rational r. Suppose that g(x0) * 0 for some x0• If g(x0) < 0, then g(r0 - x0) = -g( x0) > 0 for rational r0 > x0• It is thus no restriction to assume that g(x0) > 0. Let l be an open interval in which g is bounded above. Given a number M, choose n so that ng(x0) > M, and then choose a rational r so that nx0 + r lies in /. If r > 0, then g(r + nx0) = g(r) + g(nx0) = g(nx0) = ng(x0). If r < 0, then ng(x0) g(nx0) = g((-r) + (nx0 + r)) = g( -r) + g(nx0 + r) = g(nx0 + r ). In either A20.
=
APPENDIX
541 •
case, g(nx0) + r) = ng(x0); of course this is trivial if r = 0. Since g(nx0 + r) = ng(x0) > M and M was arbitrary, g is not bounded above in I, a contradiction. • Obviously, the same proof works if f is bounded below in some interval.
Let U be a realfunction on (0, oo) and suppose that U( x + y) = U( x )U( y) for x, y > 0. Suppose further that there is some interval on which U is bounded above. x A Then either U(x) = 0 for x > 0, or else there is an A such that U(x) = e for x > 0. n PROOF. Since U( x ) = U 2 (xj2), u is nonnegative. If U(x ) = 0, then U(xj2 ) = 0 Corollary.
and so U vanishes at points arbitrarily near 0. If U vanishes at a point, it must by the functional equation vanish everywhere to the right of that point. Hence U is identically 0 or else everywhere positive. In the latter case, the theorem applies to f(x) = log U( x ), this function being II bounded above in some interval, and so f( x) =Ax for A = log U(l).
A21. A
number-theoretic fact.
Suppose that MIS a set ofpositive integers closed under addition and that M has greatest common divisor 1. J'hen M contains all integers exceeding some n • Theorem.
0
PRooF. Let M1 consist of all the integers m, -m, and m - m' with m and m' in M. Then M1 IS closed under addition and subtraction (it is a subgroup of the group of integers). Let d be the smallest positive element of M1 • If n E M1, write n = qd + r, where 0 < r < d. Since r = n - qd lies in Mt> r must actually be 0. Thus M1 consists of the multiples of d. Since d divides all the integers in M1 and hence all the integers in M, and since M has greatest common divisor 1, d = 1. Thus M1 contains all the integers. Write 1 = m - m' with m and m' in M (if 1 itself is in M, the proof is easy), and take n0 = (m + m')2 • Given n > n0, write n = q(m + m') + r, where O < r < m + m'. From n > n0 > (r + 1Xm + m') follows q = (n - r)j(m + m') > r. But n = q(m + m') + r(m - m') = (q + r)m + (q - r)m', and since q + r > q - r > O, n lies in M. •
A22.
One- and two-sided derivatives.
Suppose that f and g are continuous on [0, oo) and g is the right-hand derivative off on (0, oo) r (t) = g(t) for t > 0. Then r (0) = g(O) as well, and g is the two-sided derivative off on (0, oo). Theorem.
:
PRooF. It suffices to show that F(t) = f(t) -f(O) - fJg(s) ds vanishes for t > 0. By assumption, F is continuous on [O, oo) and F + (t) 0 for t > 0. Suppose that F(t0) > F(t1), where 0 < t0 < t 1 • Then G(t)+= F(t) - (t - t0XF(t1) - F(t0))j(t 1 - t0) is continuous on [O, oo), G(t0) = G(t1), and c (t) > 0 on (O, oo). But then the maximum of G over [t0, t d must occur at some interior point; since G + < 0 at a local maximum, this is impossible. Similarly F(t0) < F(t1) is impossible. Thus F is constant over (0, oo) and by continuity is constant over [0, ) Since F(O) = 0, F vanishes on [0, ) • =
oo .
co .
A23. A differential equation. The equation f'(t) = Af(t) + g(t) (t > 0; g continuous) has the particular solution f0(t) = e A 'JJg(s)e- A s ds; for an arbitrary solution f, 0 and hence equals f(O) identically. All solutions (f(t) -f0(t))e- A1 has derivative A thus have the form f(t) = e '[f(O) + JJg(s)e- As ds].
APPENDIX
542
A24. A
trigonometric identity. If z * 1 and z * 0, then
and hence m
I
-I
L:
L:
/=0 k = -1
zk =
z 1-z- z L:
m-
I
-
I
1+ I
1=0
[ -z
1 1 = 1 -z 1
_
-m
z- I
m] -z -z 1 -z 1
( z m f2 _ z - m f 2 ) 2 _ - I) ( z 1/2 z - 1/2 ) 2
1 _ z-m + 1 - zm ( 1 - Z)( 1 Take z = eix. If
x
x
Z
_
is not an integral multiple of 21r, then 2 ) ( sin � mx i L L e kx = 2 • ) ( Sin 2 x 1 = 0 k = -1
m-I
If
-
I
.
1
= 21rn, the left-hand side here is m 2, which is the limit of the right-hand side as
x --+ 2 1rn .
Infinite Series
A25. Nonnegative series. Suppose x 1 , x 2 , • • • are nonnegative. If E is a finite set of integers, then E c {1, 2, . . . , n} for some n, so that by nonnegativity [k E E x k < [ f: = 1 x k . The set of partial sums [f:= 1 x k thus has the same supremum as the larger set of sums [ k E x k ( E finite). Therefore, the nonnegative series L,; 1 x k converges if and only if the sums [k E E x k for finite E are bounded, in which case the sum is the supremum: L,; = ! x k = sup E [k E E x k . E
=
A26. Dirichlet 's theorem. Since the supremum in P..25 is invariant under permuta tions, so is [k= ! x k : If the x k are nonnegative and Yk = x f< k > for some one-to-one map f of the positive integers onto themselves, then [k x k and L k Yk diverge or converge together and in the latter case have the same sum.
A27. Double series. Suppose that x ij i, j 1, 2, . . . , are nonnegative. The ith row gives a series [1xu, and if each of these converges, one can form the series L;LJ xiJ. Let the terms x;1 be arranged in some order as a single infinite series [ifxiJ; by Dirichlet's theorem, the sum is the same whatever order is used. Suppose each [jx11 converges and [;[1x;1 converges. If E is a finite set of the pairs (i, j), there is an n for which L( l, j ) E e X ij < L; s, n Lj s, n X ij < L; s, n LJ X ij < L; LjX ij; hence [;1 x if converges and has sum at most L;[1xif. On the other hand, if [if x U converges, then L; s, m LJ s. n xil < [if x;1; letting n --+ oo and then m --+ oo shows that each [j xiJ converges and that [;[1xil < L;J xif. Therefore, in the nonnegative case, Lu X ;l-. converges if and only if the [jxiJ all converge and L;Lj x iJ converges, in which case Lux;j - [;[J...xiJ. . By sy� metry, 2.. if x11 Ll-.L; Xif. Thus the order of summation can be reversed in a • nonnegative double senes: 2.. 1[1x;1 = [1 [1 xiJ. •
A28.
The Weierstrass M-test.
=
543
APPENDIX
Suppose that lim n x n k = xk for each k and that lxnkl <Mk> where f.k Mk < oo. Then f.k x k and all the f.k x nk converge, and lim n f.k x nk = f.k x k . PRooF. The series of course converge absolutely, since f.k Mk < oo. Now lf.k xnk - f.k x k l < f.k s k �lxnk - x k l + 2f.k > k Mk. Given e, choose k0 so that f.k > k Mk < e / 3, and th n choose n0 so that n > n0 implies lxnk - xkl < e/ 3k0 for • k < k�-. Then n > n0 implies If." xnk - f.k x"l < e. A29. Power series. The principal fact needed is this: If f(x) = Lk= oak x " co nverges in the range I xI < r, then it is differentiable there and f'(x) = L kak x " - 1 • (8) Theorem.
k=l
For a simple proof, choose r0 and r1 so that lxl < r0 < r1 < r. If lhl < r0 - lxl, so that I x ± hi < r0 , then the mean-value theorem gives (here 0 < (Jh < 1)
2kr$ - I j f goe<> to 0, it is bounded by some M, and if Mk = lak I · Mrf, then f." Mk < oo and Ia " I times the left member of (9) is at most Mk for I hi < r0 - I xl. By the M·test [A28] (applied with h -+ 0 instead of n -+ oo ), Since
r
Hence (8). Repeated application of (8) gives [ 0>( x ) =
00
[ k ( k - 1 ) ' " ( k - j + 1 ) a k x k -j.
k =j
For x = 0, this is aj = JU >(O)jj!, the formula for the coefficients in a Taylor series. This shows in particular that the values of f(x) for lxl < r determine the coefficients
ak .
Cesaro averages. If xn -+ x, then n- 1 f.Z = 1 x k -+ x. To prove this, let M bound lxkl, and given e, choose k0 so that l x - x k i < E/2 for k > k0• If n > k0 and n > 4k0Mje, then ku - 1 n n 1 x - -n1 L x" < -n L 2M + n-1 L !.2 <e. k�l k = ko k= l A30.
A3l. by
Dyadic expansions. Define a mapping T of the unit interval !l = (0, 1] into itself
{ 2w Tw = 2w - 1
if O < w < L if � < w < 1.
APPENDIX
544
Define a function
d1 on !l by if O < w < L if � < w < 1,
and let
( 10)
d;(w) = d 1(T' - 1 w ). Then ;.. d; ( w ) < L. i= l
2'
·
W
<
;..
L.
i= l
d; ( w ) + 2'
·
1
2n
for all w E w and n > 1. To verify this for n = 1, check the cases w < � and w > � separately. Suppose that (10) holds for a particular n and for all w. Replace w by Tw in (10) and use the fact that d;(Tw) = d; + 1(w); separate consideration of the cases w < � and w > � now shows that (10) holds with n + 1 in place of n Thus (10) holds for all n and w, and it follows that w = f.�= 1d/ w )/ 2 ;. This gives the dyadic representation of w. If d,(w) = 0 for all i > n, then w = f.7= 1 d; (w )j 2 ' , which contradicts the left-hand inequality in (10). Thus the expansion does not terminate in O's. Convex Functions A32.
A function
{11) for
cp on an open interval I (bounded or unbounded) is convex cp(tx + ( 1 - t )y ) < tcp( x ) + ( 1 - t)cp ( y)
if
x, y E I and 0 < t < 1. From this it follows by induction that
{ 12) if the X; lie in I and the P; are llonnegative and add to 1. If cp has a continuous, nondecreasing derivative cp' on /, then cp is convex. Indeed, if a < b < c, the average of cp' over (a, b) is at most the average over (b, c):
cp ( b) - cp( a ) = 1 f bcp' ( s) ds m'(b) < 1 (c ' ( s ) ds "' c - b ),b 'P b -a b-a b) c) cp( cp( - c -b a
<
The inequality between the extreme terms here reduces to
( 13) which is (11) with
( c - a )cp( b ) < ( c - b ) cp( a ) + ( b - a )cp ( c ) , x = a, y = c, t = (c - b)j(c - a).
Geometrically, (1 1) means that the point B in Figure (i) lies on or below the chord AC. But then slope AB < slope AC; algebraically, this is (cp(b) - cp (a))j(b - a) < ( cp(c) - cp(a))j(c - a), which is the same as (13). As B moves to A from the right,
A33.
545
A PPENDIX
G B a
{I)
b
z
E
-
c
{i iJ
{I i j)
--
z
y
X
--
{iv]
slope AB is thus nonincreasing and hence has a limit. In o th er words, cp has a right-hand derivative cp +. Figure (ii) shows that slope DE < slop e EF < slope FG. Let E move to D from the right and let G move to F from the right: The right-hand derivative at D is at most that at F, and cp + is nondecreasing. S ince the slope of XY in Figure (iii) is at least as great as the right-hand derivative a t X, the curve to t he right of X lies on or above the line through X with slope cp + (x ):
{ 14)
cp ( y) > cp ( X ) + ( Y - X ) cp + ( X ) ,
y>
x.
Figure (iii) also make� it clear that .p is right-continuous. Similarly, cp has a non decreasing left-han d der ivative cp - an d is left-con tinuous . Since slope AB < slope BC in Figure (i), cp - (b) < cp , (b). Sin ce clearly cp +(b ) < oo and - oo < cp - (b), cp and cp- are finite. Finally, (14) and its right:sided � nalogue show that the curve lies on or above each line through z in Fi g ure (Iv) having slope between cp- (z) and cp + (z): �
( 15) This is a
cp ( x ) > cp ( z)
+ m ( x - z) ,
support line.
Some Multil'ariable Calculus
A34. Suppose that U is an open set in R k and T: U --+ R k is cont inuously differen � iable; let Dx = [tij(x)] and J(x) = det Dx be the Jacobian matrix and determinant, as m Theorem 17.2. Let Q- be a closed rectangle in U. Theorem.
( 16 )
If ltu(x') - t;j(x)l < a for x, x' E Q - and all i, j, then I Tx' - Tx - Dx( x' - x )l < k 2alx' - xl, x, x' E Q .
Before proceed ing to the proof, note that, since the t . are continuous, a can be taken to go to 0 as Q contracts to the point x. In other �ords, (1 6) implies .
( 17)
. I Tx' - Tx - Dx( x' - x ) I = 0. hm I X, - x I
x' -+x
This shows that Dx acts as a multivariable derivative. Suppose o n the other hand that 07) holds at x for an initially unspecified matrix Dx . Take xj = x j + h a � d x / . x ,. for l -=/= j, and let h go to 0. It follows that the entires of D m ust be th e partial denvat1ves tix): If (17) holds, then Dx must be the Jacobian matrix. X
546
APPENDIX
TQ aQ T(aQ) (vi)
(v)
PROOF OF (16). For j = 0, 1, . . . , k, let z j agree with x' in the first j places and with x in the last k - j places. Then z0 = x, zk = x', and l zj - zj- !1 = l(z j - z j - l)j l = I xj - xj l (Figure (v)). By the mean-value theorem in one dimension, there is a point wij on the segment from zj- ! to z j such that t-( zj ) - t.( z j-1) = tI). .(wijX z1 - zj - l).) Since I
I
"
( Dx ( zj - zj - !) ); = L tu( x )( zj - z' - 1 ) = tij ( x ) ( zj - zj- !) j • I
1
it follows that
I Tx' - Tx - Dx ( x' - x ) I < L I t ( z j ) - t; ( z j- ! ) - ( Dx( zj - z' - 1 )); I ij
;
= L l td wij) - t ;j ( x ) l · l ( zj - zj - ! L I ij
<
[alxj - x11 < k2a l x' - xl. ij
•
multivariable inverse-function theorem. Let x0 be a point of the open set U. Theorem. If J( x 0) -=1= 0, then there are open sets X and Y, containing x 0 and y0 = Tx0, respectively, such that T is a one-to-one map from X onto Y = TX; further, T- 1: Y � X is continuously differentiable, and the Jacobian matrix of T- 1 at y is D7-] 1 Y. A35.
The
This is a local theorem. It is not assumed, as in Theorem 17.2, that T is one-to-one on U and J( x ) never vanishes; but under those additional conditions, TU is open and the inverse point mapping is continuously differentiable. To understand the role of the condition J( x0 ) -=1= 0, consider the case where k = 1, x0 = 0, and Tx is x 2 or x3• PROOF. Let Q be a rectangle such that x0 E Q° C Q - c U and J( x ) -=1= 0 for x E Q -. As (x, u) ranges over the compact set Q - x [ u: l u i = 1], I Dx u l is bounded below by some positive
{3:
( 18)
APPENDIX
547
Making Q smaller will ensure that ltJ x) - tij(x' )l < {3 ;2e for all x and x' in Q and all i, j. Then (16) and (18) give, for x, x' E Q - , I Tx' - Tx I > I Dx( x' - x ) I - I Tx' - Tx - Dx( x' - x ) I > I Dx ( x' - x ) 1 - if31 x' - x I > � {31x' - xl.
Thus
2
/x' - x/ < {3 /Tx' - Tx /
( 19 )
for x , x'
E
Q -.
This shows that T is one-to-one on Q - . Since x0 does not lie in the compact set aQ, infx E aQ /Tx - Tx0/ = d 0. Let Y be the open ball with center y0 = Tx 0 and radius dj2 (Figure (vi)). Fix a y in Y. The problem is to show that y = Tx for some x in Q0, which means finding an x such that rp(x) = / y - Tx / 2 = [i(y,. - t;(x )) 2 vanishes. By compactness, the minimum of rp on Q - is achieved there. If x E aQ (and y E Y), then 2/ y - y01 < d < /Tx - y0/ < /Tx y/ + / y - Yo/, so that / y - Txol < I y - Tx /. Therefore, rp(xo) < rp(x) for X E aQ, and so the minimum occurs in Q0 rather than on aQ. At the minimizing point, a rp ;a xj = - [,. 2( y; - t ,.( x ))t;i x ) = 0, and since Dx is nonsingular, it follows that y = Tx : Each y in Y is the image under T of some point x in Q0• By (19), this x is unique (although it is possible that y = Tz for some z outside Q). Let X = Q0 n T- 1 Y. Then X is open and T is a one-to-one map of X onto Y. Now let T- ! denote the inverse point transformation on Y. By (19), T- 1 is continu ous. To prove differentiability, consider in Y a fixed point y and a variable point y' such that y ' -+ y and y' -=/= y. Let x = T- 1 y and x' = T- 1 y'; then x' is a function of y', x' -+ x, and x' -=/= x. Define u by Tx' - Tx Dx(x' - x) + u ; then u is a function of x' and hence of y', and /v/j/ x' - x / -+ O by (17). Apply Dx- 1 : D; 1(Tx' - Tx) = x' - x + Dx- l u , or r- 1 y' - T- 1 y = Dx- 1( y' - y) - D; 1 u . By (18) and (19),
>
=
I r I y' - r I Y - Dx- I ( y' - Y) I / y' - y/
l
l / x' - x/
vx- 1 v - -7--;:--;--
ix' - x/
.
Iv
'
2 /vl/{3 . - y/ - I x' - x/ f3 · <
The right side goes to 0 as y' -+ y. By the remark following (17), the components of Dx- l must be the partial derivatives o f the inverse mapping. T- 1 has Jacobian matrix D:;.J1y at y. The components of an inverse matrix vary continuous with the components of the original matrix (think for example of the inverse as specified by the cofactors), and so T- 1 is even continuously differentiable on Y. • Continued Fractions
A36. In designing a planetarium, Christian Huygens confronted this problem: Given the ratio x of the periods of two planets, approximate it by the ratio of the periods of two linked gears. If one gear has p teeth and the other has q, then the ratio of their periods is pjq, so that the problem is to approximate the real number x by the rational p 1q. Of course x, being empirical, is already rational, but the numerator and denominator may be so large that gears with those numbers of teeth are not practical: in the approximation pjq, both p and q must be of moderate size.
548
APPENDIX
Since the gears play symmetrical roles, there is no more reason to approximate x by r = pjq than there is to approximate 1 /x by 1/r = qjp. Suppose, to be definite, that x and r lie to the left of 1. Then 1 1 1 1 -- < --I x - r l = xr x r x r
(20)
'
and the inequality is strict unless x = r: If x < 1, it is better to approximate 1 /x and then invert the approximation, since that will control both errors. For a numerkal illustratior., approximate x = . 127 by rounding it lip to r = . 13; the calculations (to three places) are
r-x (21)
1 x
-
-
1
r
-
=
=
. 13 - . 127 = .003, 7.874 - 7.692 = . 182.
The second error is large. So instead, approximate 1 /x 7.874 by rounding it down to 1 jr' = 7.87. Since 1/7.87 = . 1271 (to four places), the calculations are =
-x1 - -,r1 = 7.874 - 7.87 = .004, r' - x = . 1271 - . 127 .0001 .
(22)
=
This time, both errors are small, and the error .0001 in the new approximation to x is smaller than the coHesponding .003 in (2 1). It is because x lies to the left of 1 that inversion improves the accuracy; see (20). If this inversion method decreases the error, why not do another inversion in the middle, in finding a rational approximation to 1jx? It makes no sense to invert 1/x itself, !Iince it lies to the right of 1 (and inversion merely leads back to x anyway). But to approximate 1jx = 7.874 is to approximate the fractional part .874, and here a second inversion will help, for the same reason the first one does. This suggests Huygens's iterative procedure. In modern notation, the scheme is this. For x (rational or irrational) in (0, 1), let Tx = { l jx } and a 1(x) = l 1 jx ] be the fractional and integral parts of 1 jx; and set TO = 0. This defines a mapping of [0, 1) onto itself:
(23)
Tx =
{.!. \J = .!.X - l.!.X j .!. - a t( x )
0
X
=
X
if 0 < x < 1 , if X = 0.
Then (24)
X = a!( x)1 + Tx -----,.-
=-
if 0 < x < 1.
What (20) says is that replacing Tx on the right in (24) by a good rational approxima tion to it gives an even better rational approximation to x itself.
549
APPENDIX
To carry this further requires a convenie nt notation. For positive variables z,. define the continued fractions 1 � + !_iz; = ----:1,, z +1 Zz
Zt +
1 + Zz z 3
and so on. It is " typographically" clear that
(25) and
2
( 6)
For a formal theory, use (25) as a recursive definition and then prove (26) by induction (or vice versa). An infinite continued fraction is defined by 1 r:;1 rz L2 + � L� +�
. . · = tim � L� + · · · +� Lfl' 1 rz 1 rz n
provided the limit exists. A continued fraction is simple if the z,. are positive integers. If T " - 1 x > 0, let a,(x) = atCT" - 1 x); the a,(x) are the partial quotients of x. If x and Tx are both positive, then (24) applies to each of them:
If none of the iterates (26)) that
X,
Tx, . . . , r - t X
vanishes, then it follows by induction (use
(27) This is an extension of (24), and the idea following (24) extends as well: a good rational approximation to T"x in (27) gives a still better rational approximation to x. Even if T"x is approximated very crudely by 0, there results a sharp approximation
(28) x itself. The right side here is the nth convergent to x, and it goes very rapidly to x; see Section 24. By the definition (23), x and Tx are both rational or both irrational. For an irrational x, therefore, T" x remains forever among the irrationals, and (27) holds for If x is rational, on the other hand, T"x remains forever among the rationals, all and in fact, as the following argument shows, T "x eventually hits 0 and stays there. to
n.
550
APPENDIX
x is a rational in (0, 1): x = d1jd0, where 0 < d1 < d0• If Tx > 0, then Tx = {d0jd1} = dz!d1, where 0 < d2 < d1 because 0 < Tx < 1. (If d t fd0 is irreducible, so is dzld1.) If T 2 x > 0, the argument can be repeated: dz Tx ( 29) d! , n And so on. Since the dn decrease as long as the T x remain positive, Tnx must vanish for some n, and then Tmx = 0 for m > n. If n is the smallest integer for which Px = 0 (n . > 1 if x > 0), then by (27), Suppose that
=
x
( 30 ) Thus each positive rational has a representation as a finite simple continued fraction. If 0 < x < 1 and Tx = 0, then 1 > x = 1/a1( t ), so that a1(x) > 2. Applied to Tn , - IX, this shows that the an'( x) in (30) must be at least 2. Section 24 requires a u niqueness result. Suppose that
(31) where the
ai are positive integers and
( 32 )
0 < X < 1,
0
1,
The last condition rules out a n = 1 and t = 0 (which in the case by x < 1). It follows from (31) and (32) that
n=
1 is also ruled out
T"x = t .
( 33)
The case n = 1 being easy, suppose the implication holds for n - 1, where n > 2. Since 0 < 1j(a a k (x) = ak for + t) < 1, the induction hypothesis (use (26)) gives n n � l k < n and T - x = 1j(an + t ). Now apply the case n = 1 to Tn - x. (If a, = 1 and t = O, then ak(x) = a k for k < n - 2, a n _ 1(x) = a n ! + 1, and T - lx = 0.) Consider now the infinite case. Assume that -
(34) converges, where the
( 35 )
an are positive integers. Then n >
1.
To prove this, let n --+ oo in (25): the continued fraction t = .![ii""; + .![i2; + · · · con verges and x = 1j( a1 + t ). It follows by induction (use (26)) that
( 36 )
n>
2.
Hence 0 < x < 1, and the same must be true of t. Therefore, a1 and t are the integer and fractional parts of 1jx, which proves (35) for n = 1. Apply the same argument to
APPENDIX
551
n Tx, and continue. The x defined by (34) is irrational: otherwise, T x
=
0 for some
n,
which contradicts (35) and (36). Thus the value of an infinite simple continued fraction uniquely determi nes the partial quotients. The same is almost true of finite simple continued fractions. Since (3 1) and (32) imply (33), it follows that if x is given by (30), then any continued fraction of n terms that represents x must indeed match (30) term for term. But, for example, !f1 + l.f3= !f1 + 1.[4 + ![1. This is always possible: replace a n,( x) in (30) (where an ( x ) > 2) by a n ( x ) - 1 + ![1. Apart from this ambiguity, the representation is unique-and the representation (30) that results from repeated application of T to a rational x never ends with a partial quotient of l.t x
'
'
t see RocKEIT & Szusz for more on continued fractions.
Notes on the Problems
These notes consist of hints, solutions, and references to the literature. As a rule a solution is complete in proportion to the frequency with which it is needed for the solution of subsequent problems
Section l l.l. (a) Each point of the discrete space lies in one of the four sets A 1 n A 2 , A� nA 2 , A 1 nA�, A� n A� and hence would have probability at most 2 - 2 ; continue. (b) If, for each i, 81 is A, or A�, then 81 n · · · n Bn has probability at most n;= ,(1 - a, ) < exp[ - !:7= , a , ] Suppose A is trifling and let A - be its closure. Given e choose intervals (ak , bd, k = 1, . . . , n , such that A c 1J f: = 1(ak> bd and [f: = l bk - ak ) < ej2. If ' xk = ak - ej2n, then A - c U k = l xk • bk l and ['£. = 1(bk - x k ) < e . For the other parts of the problem, consider the set of rationals in (0, 1).
1.3. (b)
1.4.
(a) Cover Ar(i) by (r - l ) n intervals of length ,- n. (c) Go to the base r k . Identify the digits in the base r with the keys of the typewriter. The monkey is certain eventually to reproduce the eleventh edition of the Britannica and even, unhappily, the fifteenth.
1.5. (a ) The set A 3(1) is itself uncountable, since a point in it is specified by a sequence of O's and 2's (excluding the countably many t hat end in O's). (b) For sequences u 1 , consist of the points , un of O's, l's, and 2's, let M 1 in (0, 1] whose nonterminating base-3 expansions sta rt out with those digits. Then A 3(1) = (0, 1] - U M where the union extends over n > 1 and the " � sequences u 1 , . . . , u n contai ing at least one 1. The set described in part (b) is [0, 1] - U M�1 , where the union is as before, and this is the closure of A 3(1). From this representation of C, it is not hard to deduce that it can be defined as the set of points in [0, 1] that can be written in base 3 without any 1 's if terminating expansions are also allowed. For example, C contains t = . 1222 · · · = .2000 . . . because it is possible to avoid 1 in the expansion. (c) Given e and an w in C, choose w ' in A 3(1) within e/2 of w; now define w" by changing from 2 to 0 some digit of w' far enough out that w" differs from w' by at most ej2. •
•
u
•
u
u
552
u ,
u
NOTES ON THE PROBLEMS
553
1.7. The interchange of limit and integral is justified because the series [k rk (w)2 - k converges uniformly in w (integration to the limit is studied systematically inn Section 16). There is a direct derivation of (1 .40): let n --+ oo in sin t = n k 2 sin 2 - t · n ;: 1 COS 2 - t, which follows by induction from the half-angle for mula. =
1.10. (a) Given m and a subinterval (a, b] of (0, 1], choose a dyadic interval I in (a, b ], and then choose in I a dyadic interval of order n m such that In - 'sn( w )I � for w E This is possible because to specify is to specify the first n dyadic digits of the points in J, choose the first digits in 5uch a way that c I and take the following ones to be 1, with n so large that n- 1sn(w) is near 1 for w E (b) A countable union of sets of the first category is also of the first category; (0, 1] = N U Nc would be of the first category if Nc were. For Baire's theorem, see RoYDEN, p 139.
1
>
1
1.
> 1
1.
l.ll. (a) If x =p0jq0 -=F pjq, then
k ) has denominator 2 a( n ) and approximates 1/ 2 a( (c) The rational !:Z 1 n within 2j2a( + l>. =
X to
Section 2 2.3. 2.4.
2.5.
Let n consist of four points, and let sr consist of the empty set, and all six of the two-point sets. (b)
n
itself,
For example, take n to consist of the integers, and let .9,; be the a--field generated by the singletons {k} with k < n. As a matter of fact, any example in which .9,; is a proper subclass of .9,; + 1 for all n will do, because it can be shown that in this case U n-9,; necessarily fails to be a a--field; see A. Broughton and B. W. Huff: A comment on unions of sigma-fields, Amer. Math. Monthly, 84 (1977), 553 - 554. (b)
The class in question is certainly contained in f(d) and is easily seen to be closer under the formation of finite intersections. But ( u /!. I n }!. tA ;)c = n ;'!. 1 U }!. 1 A �j• and U }!. 1 A�j = U }!. 1[Af; n n t, 11 A;d has the required form. (b)
2.8. If d? is the smallest class over d closed under the formation of countable unions and intersections, clearly %c a-( d). To prove the reverse inclusion, first show that the class of A such that A c E % is closed under the formation of countable unions and intersections and contains d and hence contains %. 2.9. Note that U nBn E a-( U n .w'B ). "
2.10. (a) Show that the class of A for which Iiw) = IA (w') is a a--field. See Exam ple 4.8. 2.11.
Suppose that !f is the a--field of the countable and the cocountable sets in !l. Suppose that !f is countably generated and !l is uncountable. Show that :F
(b)
NOTES ON THE PROBLEMS
554
is generated by a countable class of singletons; if n 0 is the union of these, then sr must consist of the sets 8 and 8 u n() with 8 c no, and these do not include the singletons in n(), which is uncountable because n is. (c) Let g;_ consist of the Borel sets in n (0, 1], and let g; consist of the countable and the cocountable sets there.
=
2.12. Suppose that A 1, A 2, is an infinite sequence of distinct sets in a a--field 5', and let .:# consist of the nonempty sets of the form n �= I Bn , where Bn = An or Bn = A�, n 1, 2, . . . . Each An is the union of the .#sets it contains, and since the An are distinct, .:# must be infinite. But there are uncountably maily distinct countable unions of .#sets, and they all lie in !f. •
•
•
=
2.18. For this and the subsequent problems on applications of probability theory to arithmetic, the only number theory required is the fundamental theorem of arithmetic and its immediate consequences. The other problems on stochastic arithmetic are 4.15, 4.16, 5.19, 5.20, 6.16, 18.17, 25.15, 30.9, 30.10, 30. 1 1, and 30. 12. See also Theorem 30.3. (b) Let A consist of the even integers, let Ck = [m: u k < m � u,_ + d, a nd let B consist of the even integers in c, u c3 u . . . together with the odd integers in C2 U C4 U · · · ; take uk to increase very rapidly with k and consider A n B. (c) If c is the least common multiple of a and b, then Ma n Mb = Me. From Ma E � conclude in succession that Ma n Mb E �. Ma 1 n · · · n Ma . n Mg n · · · n Mg E .9, f(�') c �- By the same sequence of steps, show how D o � .L determi� es D on f(.-R). (d) If 81 = Ma - U p ,;; /Map• then a E 81 and (the inclusion - exclusion formula requires only finite additivity) 1 ·· D ( B;) = a1 - I: a1P +
a Choose Ia so that, if Ca B1 , then (Ca ) < 2 - - l . If were a probability measure on f(.L), < f would follow. See Problem 4. 16 for a different approach.
= D(!l)
D
D
2.19. (a) Apply the intermediate-value theorem to the function f( x) A(A n (0, x]). Note that this even proves part (c) for A (under the assumption that A exists). (b) If 0 < P(B) < P(A ), then either 0 < P(B) < iP( A ) or 0 < P(B - A) < �P( A). Continue. (c) If P( U k Hk ) < x, choose C so that C c A - U k Hk and O < P(C) < x P( U k Hk ). If n - 1 < P(C), then P( U k < n Hk ) + hn < P( U k < n Hk ) + P(Hn) + =
P(C) < P( U k < n Hk ) + hn.
2.21. (c) If J,;_ 1 were a a--field, 2.22.
£13' c J,; _ 1
would follow.
Use the fact that, if a 1 , a2, is a sequence of ordinals satisfying a n < n, then there exists an ordinal a such that a < n and a n < a for all n. •
•
•
NOTES ON THE PROBLEMS
555
2.23. Suppose that Bj E U f3 � j = 1 , 2, . . . . Choose odd integers nj in such a way that Bj E �a< ni > and the nj are all distinct; choose .fo-sets such that < a
•
<1>�a , < n · > ( em"il , em 1
Hj2
) = B, ;
'...
for n not of the form n,·, choose Yo-Sets for which <1>�,a (n)<em ' em ' . . . ) is 0 or (0, 1 ] as n is odd or even. Then U ]= t Bj = ( e 1 , e2 . . . ). Similarly, B e = a(e., e2 , . . . ) for Yo-sets en ifB E u {3 � The rest of the proof is essen tially the same as before. < a
·
a
,
nl
n2
Section 3 3.1. (a) The finite additivity of P is used in the proof that 90 c .4' and again (via monotonicity; see (2.5)) in the proof of (3.7). The countable additivity of P is used (via wuntable subadditivity; see Theorem 2. 1) in the proof of (3.7). 1 1 (b) For a specific example consider U n(2 - + n - , 1 ] in connection with Prob lem 2.15. But an exam;Jie is provided by every P that is finitely but not countably additive: If P is finitely additive on a field 90 and A,. are disjoint .9Q-sets whose union A also lies in 90, then monotonicity (which requires finite additivity only) gives Lk o:; n P( A k ) = P( U k ,; n A k ) < P( A ) and hence [k P(A k ) < P(A). Countable subaddit!vity will ensure that there is equality here. (c) The proof of (3. 7) involves the countable subadditivity of P on 90, which is only assumed to be a field (that being the whole point of the theorem). 3.2. (a) Given €, choose 90-sets An such that A c U n A n and [P( An) < P* (A) + t: ; if B = U n A n , then A c B, B E 5', and P( B) < P*(A) + t: ; hence the right side of (3.9) is at most P*( A) On the other hand, A c B and B E !f impl(' P*(A) < P* (B) = P( B). Hence (3.9). If A c Bk > Bk E 5', P(Bk ) < P*(A) + k - , and B = n k Bk > then A c B , B E 5', and P*(A) = P(B). For (3.l0), argue by completnentation. (b) Suppose that P * (A ) = P* ( A ) and chose .9=sets A 1 and A in such a way 2 that A 1 c A cA 2 and P(A 1 ) = P( A 2 ). Given E, choose an .9=set B in such a way that E c B and P'" (E) = P( B). Tht!n P*(A n E) + P*(Ac n E) < P(A 2 n B ) + P( A'{ n B). Now use (2.7) to bound tht! last snm by P(B) + P(A 2 - A 1 ) = P* (E).
3.3. First note the general fact that P* agrees with P on 90 if and only if P is countably, additive there, a condition not satisfied in parts (b) and (e). Using Problem 3.2 simplifies the analysis of P* and ./I(P* ) in the other parts. Note in parts (b) and (e) that, if P* and P* are defined by (3.1) and (3.2), then, since P* ( A ) = 0 for all A, (3.4) holds for all A and (3.3) holds for no A. Countable additivity thus plays an essential role in Problem 3.2. 3.6 (c) Split Ee by A: P0 (E ) = 1 - P ( £C ) = 1 - r (A n Ee ) - P (Ae n £C ) = 1 0 P ( A () Ee) - P( Ae) = P( A ) - P (A - E).
3.7.
(b)
Apply (3. 13): For A E 90, Q( A ) = P (H n A ) + P0 (He n A ) = P(H nA)
+ P(A ) - P0 ( A - ( He n A )) = P(A). (c) If A 1 and A 2 are disjoint 90-sets, then by (3.12),
NOTES ON THE PROBLEMS
556
Apply (3. 13) to the three terms in this equation, successively using A 1 U A 2, A 1 , and A 2 for A: Po ( H e n ( A 1 U A 2 ) ) = Po ( He n A 1 ) + Po ( He n A 2 ) .
But for these two equations to hold it is enough that H n A 1 n A 2 = 0 in the first case and He n A 1 n A 2 = 0 in the second (replacing A 1 by A 1 n A� changes nothing).
3.8. By using Banach limits (BANACH, p. 34) one can similarly prove that density D on the class 9 (Problem 2.18) extends to a finitely additive probability on the class of all subsets of n = {1, 2, . . . }. 3. 14. The argument is based on cardinality. Since the Cantor set e has Lebesgue measure 0, 2c is contained in the class ...i' of Lebesgue sets in (0, 1]. But e is uncountable: card � = card(O, 1] < card 2 c < card ...£'. 3.18. (a) Since the A Ell r are disjoint Borel sets, I:, A( A Ell r) < 1, and so the common value A(A) of the A( A Ell r) must be 0. Similarly, if A is a Borel set contained in some H Ell r, then A( A) = 0. (b) If the E n (H Ell r) are all Borel sets, they all have Lebesgue measure 0, and so E is a Borel set of Lebesgue measure 0. 3,19.
Given A 1 , B 1 , , A , _ 1 , 8, _ 1 , note that their union en is nowhere dense, so that In contains an interval In disjoint from en. Choose in In disjoint, nowhere dense sets An and Bn of positive measure. (c) Note that A and Bn are disjoint and that An U Bn c G. (b)
•
•
•
3,20. (a) If In are disjoint open intervals with union G, then b - lA( A) > f.n A(In) > Lnb - lA( A () In) > b - lA( A ).
Section 4 4.1. Let r be the quantity on the right in (4.30), assumed finite. Suppose that x < r; then x < V 'k=n x k for n > 1 and hence x < x k for some k > n: x < xn i.o. Suppose that x < x n i.o.; then x < V k_ =,, x k for n > 1: x < r. It follows that r = sup[ x : x < x n i.o.], which is easily seen to be the supremum of the limit points of the sequence. The argument for (4.31) is similar. .9'
is the u-field generated by .:#U {H} (Problem 2.7(a)). If (H n G 1 ) U (He n G2) = (H n G'1 ) u (He n G;), then G 1 � G'1 c He and now follows because A * (H) G2 � G'2 nc H; consistency n = A * (He) = O. If A!i) = n (H n G\ >) u (He n G� >) are disjoint, then G\m> n G\ > c He and G \m> n G� nc (see nProblem 2.17) P( U nA n ) = zM U nG\ >) H for m -=fonn, and therefore n + tM U ,, G� >) = Ln(tMG\ >) + -!MG� >)) = Ln P( An). The intervals with ratio nal endpoints generate .:#.
'4.10. The class
4.14. Show as in Problem l. l(b) that the maximum of P(B1 n n Bn), where B1 is A1 or A�, goes to 0. Let A , = [w: f.n iA (w)2 - " < x ], show that P(A n Ax) is continuous in x, and proceed as in Problem 2.19(a). ·
·
·
NOTES ON THE PROBLEMS
557
4.15. Calculate D(F1) by (2.36) and the inclusion - exclusion formula, and estimate Pn( F1 - F) by subadditivity; now use 0 < Pn(F1) - Pn(F) = Pn(F1 - F). For the calculation of the infinite product, see HARDY & WRIGHT, p. 246. Section 5 5.5. (a) lf m = O, a > O, and x > O, then P[ X > a ] < P[( X + x ) 2 � (a + x F ] < E[( X + x ) 2 ]j(a + x )2 = (u 2 + x 2 )j(a + x) 2 ; minimize over x. 5.8. {b) It is enough to prove that cp\t ) = f(t ( x', y') + (1 - t X x, y )) is convex in t (O ::; t < l ) for ( x , y) and ( x', y' ) in C. If a = x' - x and {3 = y' - y, then (if [, , > 0) cp" = fu a 2 + 2 ft2 a f3 + fn/3 2 1 = - Un a + !! 2 /3 ) 2 + 1-1 ( fu /22 - flz) f3 2 > 0. �11 II Examples like f( x, y ) = y 2 - 2 xy show that convexity in each variable sepa rately does not imply convexitv.
5.9. Check (5.39) for f( x, y) = - x 11P y 1 fq . 5.10 . Check (5.39) for f( x, y) = - ( x 1 1P + y 1 1P ) P. 5.19. For (5.43) use (2.36) and the fundamental theorem of arithmetic · since the P ; k are distinct, the P; ; individually divide m if and only if their product does. For (5.44) use inclusion - exclusion. For (5.47), use (5.29) (see Problem 5. 12)).
k 5.20. (a) By (5.47), En[a P ] < f.� = 1 p - < 2 jp. And, of course, n - ! log n! = En[log] = LpEn[a P ] log p. k (b) Use (5.48J and the fact that En[a P - o P ] < Lk = p - . 2 (c) By (5.49), n
log p =
n � j - 2 l ; j) log p (l L
n
< 2n ( E n [ log * ] - En [ lo g* ] ) = O(n) . 2
Deduce (5.50) by splitting the range of summation by successive powers of 2. (d) Jf K bounds the O(l) terms in (5.51), then
L log p > O x
p s,x
L p - 1 log p > Ox ( log e - 1 - 2K ) .
9x
(e) For (5.53) use
L
p sx
log p x < log x :$ '7T( )
" L.. 1 +
p s,x l /2
L
x lf2
<- xl/2 + log2 x L log p . p s. x
log p lo a x 1 12 o
NOTES ON THE PROBLEMS
558
By (5.53), '7T(X) > x 1 12 for large x, and hence log '7T(x) :: d og x and '7T(x ) :::: xjlog '7T(x). Apply this with x = p, and note t hat '7T( p,) = r.
Section 6 6.3. Since for given values of Xn1(w), . . . , Xn , k - l(w) there are for Xn(w ) the k possible values 0, 1, . . . , k - 1, the number of values of (Xn1(w), . . . , Xnn( w)) is n!. Therefore, the map w -+ (Xnlw), . . . , Xm,(w)) is one-to-one, and the Xnk(w) determine w. It follows t hat if 0 < X; < i for 1 < i < k, then t he number of permutations w satisfying Xn ;(w) = X;, 1 < i < k, is j ust (k + 1Xk + 2) · · · n, so that P[ xni = X;, 1 < i < k ] = ljk!. It now follows by induction on k that Xn1 , . . . , Xnk are independent and P[ Xn k = x ] = k - 1 (0 < x < k). Now calculate E [ Xnd =
k-1 2
,
0 + 1 + · · · + {n - 1 )
-)
n(n - 1) 4 , 4 2 k - 1 2 k2 - 1 0 2 + f + · " ' + ( k - 1 )2 - 2 Var[ Xn d = - 12 k n 2n 3 + 3n 2 - 5n 1 i: Var[ Sn l = rr (e - 1) = 72 E[S,. ] =
(
k=l
Apply Chebyshev 's inequality.
6.7. (a) If k 2 < n < (k + 1) 2 , let a n = e; if M bounds the lxnl, then
n - an _!_ _!_ s .!_ _!_ -+ 0. = a 2M M + (n nM < n s ) .!_n n a n a, - n a an an n
6.16. From (5.53) and (5.54) it follows that an = LP n - 1ln /P J (6.8) is
- + co,
The ieft side of
1 _!_ + _!_ , .!. . .!. . .!. . (.!. . ) (! ! ) < < .!_n l.!!. p n q n pq .._ l _ n l!!. p . j n l q J pq np nq !!..
Section 7
7.3. If one grants that there are only countably many effective rules, the result is an immediate consequence of the mathematics of t his and the preceding sections: C i s a countable intersection of .r-sets of measure 1 . The argument proves in particular the nontrivial fact that collectives exist. 7.7. If n < T, then Wn = Wn - I - Xn - 1 = WI - Sn - 1 > and T is the smallest n for Which Sn - l = W1 • Use (7.8) for the question of whether the game terminates. Now T- 1
FT = F0 + I: ( W1 - sk _ I ) Xk k= l
=
F0 + W1S7_ 1 - i(s;_ 1 - ( T - 1 )) .
NOTES ON THE PROBLEMS
559
7.8. Let x 1 , . . . . x,. be the initial pattern and put I.0 = x 1 + · · · + x,.. Define I-n = I,n - 1 - wn xn, L o = k, and Ln Ln - 1 - (3Xn + l)j2. Then T is the smallest n such that Ln < 0, and T is by the strong law finite with probability 1 if E[3Xn + 1] = 6( p - �) > 0. For n < -r, I.n is t he sum of t he pattern used to determine Wn + 1 . Since Fn - Fn _ 1 = I.n _ 1 - I.n, it follows that Fn = F0 + !. 0 - I-n and F7 = F0 + !. 0• =
"Section 8
8.8.
With probability 1 the population either dies out or goes to infinity. If, for example, Pko = 1 - Pk , k + l = 1jk 2, then extinction and explosion each have positive probability. (b)
8.9. To prove t hat x,. 0 is the only possibility in the persistent case, use Problem 8.5, or else argue directly: If X; = f.1 * ;0 P;1 x1 ' i -=1= i0, and K bounds t he I ..<), then X; = "L pij, · " Pln - d.Xln' where the sum is over j 1 , . . . , jn - l distinct from i0, and hence lx;l < KP;[ Xk -=1= i 0, k < nl -+ 0. =
8.13. Let P be the set of i for which rr; > 0, let N be the set of i for which 71'; < 0, and suppose n >that P and N are both nonempty. For i 0 E P and j0 E N choose n so that P! } 0. Then CIJ 0
0<
L L 7r; P}jl = L rr1 - L L rr; P!j l
jeN i E P
= I: iEN
71';
jeN iEN
jeN
I: pjjl < 0.
jeP
Transfer from N to P any i for which rr; = 0 and use a similar argument.
8.16. Denote the sets ( 8.32) and (8.52) by P and by F, respectively_ Since F c P, gcd P < gcd F. The reverse inequality follow-; from the fact that each integer in P is a sum of integers in F. 8.17. Consider the chain with states 0, 1, , . _ and a and transition probabiiities Poj = [1+ 1 for j > 0, Poa = 1 -f, P;, i - l = 1 for i > 1, and Paa = 1 (a is an absorbing state). The transition matrix is
[, fz 13 1 0 0 0 1 0 -
·
0
-
·
·
·
0
·
·
'
- -
.'.
1 -f 0 0
· · · · - · - · · · · ·
0
-
'.
1
Show tha t f/Jjl = fn and PlKil = un, and apply Theorem 8.1. Then assume f = 1, discard the state a and any states j such that fk 0 for k > j, and apply Theorem 8.8. =
560
NOTES ON THE PROBLEMS
In FELLER, Volume 1, the renewal theorem is proved by purely analytic means and is then used as the starting point for the theory of Markov chains_ Here the procedure is the reverse.
8.19. The transition probabilities are P or = 1 and P; r- i 1+ I = p, P; r - i =1 q, 1 < i < r ; the stationary probabilities are u 1 = . . = u, ='q - u0 = ( r +'q)- • The chance of getting wet is u0p, of which the maximum is 2r + 1 - 2 -Jr( r + 1) _ For r = 5 this is .046, the pessimal value of p being .523. Of course, u 0 p < 1 /4 r . In more reasonable climates fewer umbrellas suffice: if p = .25 and r = 3, then u0p = .0�0; if p . 1 and r = 2, then u0p = .031 . At the other end of the scale, if p = .8 and r = 3, then u0p = .050; and if p = .9 and r = 2, then u 0 p = .043. =
8.22. For t he last part, consider the chain with state space Cm and transition probabilities P;1 for i, j E Cm (show that they do add to 1)_ 8.23. Let C' = S - (T U C) , and take U = T U C' in (8.51)_ The probability of absorp tion in C is the probability of ever entering it, and for initial states i in T U C these probabilities are the minimal solution of Y; =
L P; Yj + jL
jE T
O < y,. < 1 ,
;
EC
Pij
+
L Pij j , j ' e
C
V
i E T U C' ' i E T U C' .
Since the states in C' (C' = 0 is po ssible) are pe1 sbtent and C is closed, it is impossible to move from C' to C. Therefore, in the minimal solution of the system above, y,. = 0 for i E C'. This gives the system (8.55). It also gives, for the minimal solution, '[j T Pij Yj + '[j c p,.1 = 0, i E C'. This makes probabilistic sense: for an i in C', not only is it Impossible to move to a j in C, it is impossible to move to a j in T for which there is positive probability of absorption in C. e
e
8.24. Fix on a state i, and let Sv consist of those j for whkh pfj lm> 0 for some n congruent to v modulo t. Choose k so that p)/l > 0; if p� ) and pfj'l are positive, then t divides m + k and n + k, so that m and n are congruent modulo t_ The S, are thus well defined. 8.25. Show that Theorem 8.6 applies to the chain with transition probabilities p�>. 8.27. (a) From PC = CA follows Pc ; = A 1 c; , from RP = A R follows r P = A ; r,. , and n ; n from RC = I follows r; cj = oij- Clearly N is diagonal and p = CA R. Hence pfj l = Lu, C;u A: o u, R ,j = Lu A: (c u ru )ij
=
Lu A:( A u );j •
By Problem 8.26, there are scalars p and y such that r1 = pr0 = p ('7T 1 , . . '7T5 ) and c1 = yc0, where c0 is the column vector of 1 ' s. From r 1 c 1 = 1 follows p y = 1, and hence A 1 = c1r1 = c0r0 has rows ('7T 1, . . . , '7T5). Of course, (8.56) gives the exact rate of convergence. It is useful for numerical work; see <;INLAR, pp. 364 ff. (c) Suppose all four pij are positive. Then '7T 1 = p 2 1 /( p 21 + P 1 2 ), '7T z = p 1 2/(P zl + p12), the second eigenvalue is A = 1 - p 1 2 - p 21, and (b)
_
,
NOTES ON THE PROBLEMS
561
Note that for given n and t:, A" > 1 - t: is possible, which means that the pfj l are not yet near the rrj. I n the case of positive pij, P is always diagonalizable by C= (d)
[�
7r z
- rr I
]
,
R=
For example, take t = *, 0 < t: < t , and
t t
P=
t t
l + t:
1 - t:
[ ��
'Trz
-1
]
t t t
In this case, 0 is an eigenvalue with algebraic multiplicity multiplicity 1 .
2
and geometric
8.30. Show that an = 7r; and {3n - � n = �
2
n-1 L-
"
n(n _ 1) k = I
=0
{� t
n k=t
rr;( n - k ) ( P;(;k ) _ rr; )
) (�)
( n - k )p k = o
,
where p is as in Theorem 8.9.
8.36. The definitions give E; [ f(
Xu,) ] = P; [un < n ]f(O) + P;[un = n ]f( i
+
n)
= 1 - P;[un = n ] + P;[un = n ] { 1 - fi+n .o) = 1 - P; · · · Pi + n - 1 + P I · · · P;+n - J( Pi+n P;+n + I · · ) ,
and this goes to 1 . Since P;[r < n = un] > P;({ r < n ] n ( Xk > O, k > 1]) --+ 1 [,.0 > 0, there is an n of the kind required in the last part of the problem. And now E;[ f( XT)] < P;[ T < n = u, ]f( i + n ) + 1 - P; [ T < n = un ] =
1 - P;[ T < n = un ]f; + n .o·
8.37. If i > 1, n 1 < n 2, (i, . . . , i + n 1 ) E /n , and (i . . . . , i + n 2) E /n2 , then P;(r = n1, T = n2] > P,.( Xk = i + k, k < n 2 ] > 0, �hich is impossible. Section 9 9.3. See
BAHADUR.
9.7. Because of Theorem 9.6 there are for P( Mn > a ] bounds of the same order as the ones for P(Sn > a ] used in the proof of (9-.36).
562
NOTES ON THE PROBLEMS
Section 10 10.7. Let JLJ be counting measure on the a--field of all subsets of a countably infinite n, let JL z = 2JL 1 , and let 9J consist of the cofinite sets. Granted the existence of Lebesgue measure A on @ 1 , one can construct another example: let JL 1 = A and JL z = 2A, and let 9J consist of t he half-infinite intervals ( - oo, x ]. There are similar examples with a field .9Q in place of f?J. Let n consist of the rationals in (0, 1 ], let JL1 be co unting measure, let JL z = 2JL 1 , and let .9Q consist of finite disjoint unions of " intervals" [ r E !l: a < r < b ]. Section l l 11.4. (b) If ( f, g ] c U k ( fk , gd, then ( f( w ), g(w)] c U k (fk(w), fk( w )] for all w, and Theorem 1 .3 gives g(w) - f(w ) < 'Lk(gk(w) - fk(w)). If h m = (g - f - Lk s n (gk - fk )) V 0, then h, l 0 and g - f < LH ,,( gk - fk ) + h n. The positivity and continuity of A now give v0(f, g ] < '[k v0(fk , g k ]. A similar, easier argument shows that Lk v0(fk , g d < v0(j, g ] if ( fk , gd are disjoint subsets of (f, g ]. 11.5. (b) From ( 1 1 .7) it follows t hat [f > 1 ] E .9Q for f in ..£' Since ..£' is linear, [ f > x ] and [ f < -x] are in .9Q for f E ...i' and x > 0. Since the sets (x, oo) and ( - x) for x > 0 generate .9f' 1 , each j in ..£' is measurable a-( .9ij). Hence .r a-(.9Q). It is easy to show that .9Q is a semiring and is in fa�t closed under the formation of proper differences. It can happen that n � .90-for example, in the case where n = {1, 2} and../ consists of the f with f(l ) = 0. See Jurgen Kindler: A simple proof of the Danieli-Stone representation theorem. Amer. Ma th . Monthly, 90 (1983), 396-397_) - co,
Section 12 12.4. (a) If en = e,.. , then en - m = 0 and n = m because e is irrational. Split G into finitely many intervals of length less than t:; one of them must contain points 62n and (}2m with 6z n < 6zm • If k = m - n , then 0 < (}2m - 6zn = (}2m e 6z n (J2k < €, and the points 6zkl for 1 < l < [62/ J form a chain in which the distance from each to the next is less than t: , the first is to the left of t:, and the last is to the right of 1 - £. (c) If S 1 e Sz = 6zk + 1 G IJ 2n Ell 6 zn lies in the subgroup, then s 1 = Sz and (J2 H 1 = =
2
I
(J 2( n , - n2 )
'
12.5. (a) The S Ell em are disjoint, and (2n + l)u + k = (2 n + l)u' + k' with lkl, lk'l < n is impossible if u -=1= u'. ( b) The A Ell (J• are disjoint, contained in G, and have the same Lebesgue measure. 12.6. See Example 2.10 (which applies to any finite measure). 12.8. By Theorem 12.3 and Problem 2.19(b), A contains two disjoint compact sets of arbitrarily small positive measure. Construct inductively compact sets Ku , . . . u, (each U ; is 0 or 1) such that 0 < JL(Ku , u ) < 3 - n and Ku , u ,. O and Ku , u ,.l are disjoint subsets of Ku u . Take K = n n U u u Ku u . The Cantor set is a special case. ,
I
n
l
u
i
,
NOTES ON THE PROBLEMS
563
Section 13 13.3. If f = L;X;IA , and A ; E T- 1 Y', take A'; in Y' so that A ; = T- 1A';. and set 1 1p = L.;x,.IA' · For the general f measurable T - :T', there exist simple functions fn, measurable T- 1 :T', such that fn(w) --+ f(w) for each w. Choose 'Pn• measurable Y', so that fn = 'PnT. Let C' be the set of w' for which 'Pn(w') has a finite limit, and define ip(w') = limn ifJn(w') fo r w' E C' and �p(w') = 0 for w' � C'. Theorem 20.1(ii) is a special case. 13.7. The class of Borel functions contains the continuous functions and is closed under pointwise passages to the limit and hence contains ff. By imitating the proof of the rr-A theorem, show that, if f an d g lie in f?l:, then so do f + g, fg, f- g, f V g (note that, for example, [ g : f + g E ff] is closed under passages to the limit). If fn(x) is 1 or 1 - n(x - a) or 0 as x < a or a < x < a + n - 1 or a + n - 1 < x, then fn is continuous and fn(x) --+ 1< - oo. al x). Show that [ A: lA E ff] is a A -system. Conclude that g; contains all indicators of Borel sets, all simple Borel functions, all Borel functions. 13.13. Let B = { b p . . . , bk }, where k < n, E; = C - b;- 1A, and E = U 7= 1 E;. Then E = C - U 7= 1 b,:- 1A. Since JL is invariant under rotation&, JL( E;) = 1 - JL(A) < n - I ' and hence JL(E) < 1. Thaefore c - E = n ;= I b;- 1A is nonempty. Use any (J in C - E.
Section 14 14.3. (b ) Since u < F(x) is equivalent to �p(u) < x, it follows that u < F(:p(u)). And since F(x) < u is equivalent to x < �p(u ), it follows further t hat F(�p(u) - € ) < u for positive €. 14.4. (a) If 0 < u < v < 1, then P[u < F( X ) < v, X E C] = P[ �p(u) < X < �p(u ), X E C]. If �p(u) E C, this is at most P[ �p( u ) < X < �p(u )] = F(�p(u) - ) - F(�p(u) - ) = F(�p(u) - ) - F(�p(u)) < u - u; if �p(u) � C, it is at most P[ �p( u ) < X < �p( u )] = F(�p(u) - ) - F(�p(u)) < u - u. Thus P[ F( X) E [u, v), X E C] < A[u, u) if 0 < u < u < 1. This is true also for u = 0 (let u ! C and note that P[F( X) = 0] = 0) and fm u = 1 (let u i 1), The finite disjoint unions of intervals [u, u ) in [0, 1) form a field there, and by addition P[F(X) E A , X E C] < A ( A ) for A in this field. By the monotone class theorem, the inequality holds for all Borel sets in [0, 1). Since P[ F(X) = 1, X E C] = 0, this holds also for A = { 1}. 14.5. The sufficiency is easy. To prove necessity, choose continuity points X; of F in such a way that x0 < x 1 < < x k , F(x0) < £ , F(x k ) > 1 - £, and X; - x;_ 1 < € . If n exceeds some n0, IF(x;) - Fn(x;)l < £/2 for all i. Suppose that X;_ 1 < x < x;. Then F,,(x) < Fn(x;) < F(x;) + Ej2 < F(x + E) + Ej2. Establish a simi lar inequality going the other direction, and give special arguments for the cases x < x0 and x > x k . ·
·
·
Section 15 15.1. Suppose there is an .%partition s uch that I;[supA JlJL( A;) < oo, Then a ; = supA · f < oo for i in the set I of indices for which JL(A ; ) > 0. If a = ' max1 a ;, then JL[f > a ] = I1JL ( A ; n [ f > a ]) < I1JL ( A ; n [ f > a ; ]) = 0. And A; n [ f > 0] = 0 for i outside the set 1 of indices for which JL( A; ) < oo, so that JL[f > 0] = IJL( A , n [/> O]) < I1JL( A;) < oo,
564
NOTES ON THE PROBLEMS
15.4. Let W, ;r-, JL + ) be the completion (Problems 3.10 and 10.5) of W, 5', JL). If g is measurable 5', [f* g] cA, A E 5', JL(A) = 0, and HE .!JR 1 , then [fE H] = (Ac n [f E H]) U (A n [fE H]) = (Ac n [g E H]) U (A n [fE H]) lies in /r", and hence f is measurable !r". (a) Since f is measurable !r", it will be enough to prove that for each (finite) /r"-partition {B) there is an .9=partition {A;} such that I ,[infA , f]JL(A;) > I} inf8I]JL + (B), and to prove the dual relation for t he upper sums. Choose (Problem 3.2) Y-sets A; so that A; c B; and JL(A;) = JL * (B;) = JL + (B;). For the partition consisting of the A; together with ( U ;A;Y, the lower sum is at least I;[inf8'I ]JL(A;) I;[ inf 8'I ]JL + (B;). (b) Choose successively finer .9=partitions {A11;} in such a way that the corresponding upper and lower sums differ by at most ljn 3• Let gn and fn have values in fA I and sup A I on An;· Use Markov ' s inequality-since JL(fi) is finite, it may as well be l-to show that JL[fn - gn > ljn] < ljn 2 , and then use the first Borei-Cantelli lemma to show that fn - gn -+ 0 almost everywhere. Take g =- limn gn. =
Section 16
16.4. (a) By Fatou's lemma,
and
< lim inf n J (bn -fn) djL =
JbdJL - lim sup Jfn dJL . n
Therefore
16.6. For
w E A and small enough complex h, I f( w , z0 + h ) -f( w, z0 ) I
16.8. Use the fact that fA i fl dJL
=
fzu+hf, (w, z) dz < lhlg(w, z0). zo
< aJL( A) + fr1r1
�
a1 1!1 dJL.
NOTES ON THE PROBLEMS
565
JL( A) < o im �lies fA Ifni dJL < t: for all n , and if a - I sup nf If) d JL < o, then JL[Ifnl > a] < a- f Ifn i d JL < o and hence fut a]lfnl d JL < € for all n. For the
16.9. If
.. i
;;,
reverse implication adapt the argument in the preceding note.
16.10. (b) Suppose that fn are nonnegative and satisfy condition (ii) and JL is nonatomic_ Choose o so that JL(A) < o implies fA fn dJL < 1 for all n . If JL[ fn = oo] > 0, there is an A such that A c [ fn = oo] and 0 < JL( A) < o; but then fA fn dJL = oo_ Since JL[f,. = oo] = 0, t here is an a such that JL[fn > a] < o < JL[fn > a]. Choose B c [fn = a] in such a way that A = [fn > a] U B satisfies JL( A) = o. Then a o = a JL( A) < fA fn dJL < 1 and ffn dJL < 1 + aJL(AC) < 1 + 0 - IJL (D,)_ 16.12. (b) Suppose that f E ..£' and f > 0. If fn = (1 - n - I )f V 0, then fn E ..£' and fn i f, so that v(fn, f] = A(f- fn) t 0. Since v(f1 , f] < oo, it follows that v[(CL• , I): f(w) = I ] = 0. The disjoint union
increases to
B, where B c (O, f] and (0, f] - B c [(w, 1 ):
((Ct.•)
= I ]. Therefore
Section 17 17.1. (a) Let A . be t he set of x such that for every o t here are points y and z satisfying I y - xi < o, lz - xl < o, and lf(y) -f(z )I > E. Show that A. is closed and Dr is the unior. of the A (c) Given t: and TJ , choose a partition into intervals I1 for which the corre sponding upper and lower sums differ by at most O J . By considering those i1 whose interiors meet A., show that O J > t: A(A.). (d) Let M bound 1!1 and, giver: t:, find an open G such that D c G and 1 A(G) < t:/M. Take C = [0, 1] - G and show by compactness that there is a o such that lf(y) -f(x )I < F if x (but perhaps not y) li es in C and ly - xl < o. If [0, 1] is decomposed into intervals I; with A( l1 ) < o, and if x1 E I1, let g be the function with valu e f(x1) on I1• Let r: denote summation over those i for which I1 meets C, and let '[" deuote summation over the other i. Show that •.
ft( x) dx - L f( x ; )A (l; ) < f it( x ) - g ( x )i dx
< [' 2 EA( l1 ) + [" 2MA( l1 ) < 4t: .
17.10. (c) Do not overlook the possibility t hat points in (0, 1) - K converge to a point in K. 17.11. (b) Apply the bounded convergence theorem to fn(x) (1 - n dist(x , [s, I]))+. (c) The class of Borel sets B in [ u , v] for which f = I8 satisfies (17.8) is a A-system. =
566
NOTES ON THE PROBLEMS
(e) Choose simple fn such that 0 < fn i f To 07.8) for f = fn, apply the monoton e convergence theorenl on the right and the dominated convergence theorem on the left. is the distance from x to [a, b ], then fn = (1 - ng) V 0 l Ir a. b 1 and fn E ...t"'; since the continuous functions are measurable !JR 1 , it follows that sr !JR' . If fn(x) ! O for each X, t hen the compact sets [x: fn(x) > € ] decrease to 0 and hence one of them is 0; thus the convergence is uniform.
17.12. If
g(x)
17.13. The linearity and positivity of A are certainly elementary facts, and for the continuity property, note that if 0 < f < t and f 'vanishes outside [a, b], then elem entary cons:derations show that 0 < A(f) < E(b - a). Section 18 18.2. First, .9l'x :Jt' is generated by the sets ot the forms {x} X X and X X {x}. If the diagonal E lies in 8rx [J[', then there must be a countable S in X such that E lies in the u-field .:T generated by the sets of these two forms for x in S. If .9 consists of sc and the singletons in S, then !F is the class of unions of sets in t he partition [ P1 X P2 : P1 , P2 E .9"]. But E E !F is impossible. 18.3. Consider A X B, where A consists of completion of !JR 1 with respect to .\ ,
a
singl e point and B lies outside the
f = p - I log p, and put fn = 0 if n is not a prime In the notation of (18.175, F(x) = log x + �p(x), where ip is bounded because of (5.51). If G(x) = - 1 flog x, then
18.17. Put
.!_
(x F( t ) dt F( x ) L P = log x + }2 t log z t p :s, x 'P( X )
= 1 + log x
00 tp { l ) dt f oo �p( t ) dt dt ! + zt . 2 t log t 2 t log z t t log x
� +
x
Section 19 19.3. See BANACH, p. 34. 19.4. (a) Take f = 0 and [,, = I(O, I /n )' (b ) Take f = 0, and let Un> be an infinite orthonormal set. u�e the fact that
Ln< fm g)Z < ll g ll 2• 19.5. Take fn ni<0, 1 /n)• andoo suppose that fn
converges weakly to som e f in L1 . Integrate against the L -functions sgn / I< • I ) and conclude that f = 0 almost everywhere; now integrate against the functlon identically 1 and get a contra diction. =
Section 20 20.4. Suppose U1 , . . . , Uk are independent and uniformly distributed over the unit interval, put V; = 2nll; - n, and let 1-Ln be (2n) k times the distribution of
NOTES ON THE PROBLEMS
567
X ( - n, n ], and if W1 , . . . , Vk ). Then JLn is supported by Qn = ( -n, n] X l = (a 1 , b d X X (a k , b k ] c Qn, then JLn(l) = n�= 1(b; - a). Further, if A C Qn C Qm (n < m), then JLn( A) = JLm(A). Define A k (A) = limn JLn(A II Qn). ·
·
-
-
·
·
20.7. By the argument preceding (8. 16),
P.[TI = n I I
' · · · '
Tk = n k ] = Jr.<.n • >f !"z - n ,) I)
))
·
• '
n n J <. k - k - . > ))
-
For the general initial distribution, average over i.
20.8. (a) Use t he rr-A theorem to show that P[(X1T I > . . . , x1Tn) E H] is the same for all permutations rr. (b) Us e part (a) and the fact that Yn = r if and only if r,.< n > = n. (c) If k < n , then Yk = r if and only if exactly r - 1 among the integers n 1, . . ' k - 1 precede k in the permutation r < >. n n + l> (d) Observe that r < > = (t 1 , . . . , tn) and Yn + : = r if and oniy if r< = (t 1 , . .n. , t,_ " n + 1, t" . . . , tn), and conclude that u(Yn + l) is independent of u(T < l) and hence of u(Y1 , , Y,, )- see Probkm 20.6. _ . _
20.12. If X and Y are independent, then
P [ I( X + y ) - ( X + y ) I < €] > p [ I X - X I < ±t:] p [ I y - y I < ±t:] and
P[ X + Y = x + y ] > P[ X = x ] P[ Y = y ] . 20.14. The partial-fraction expansion gives
uu 1 c u ( y - x)c1 ( x) = 2 R ( A + B + C + D), 7r
2y ( y - x ) B= 2 ' u + ( y - x) 2 2yx D= 2 2. u +x After the fact t his can of course be checked mechanically. Integrate over [ - t, t] and let t --+ oo: f '_, Ddx = 0, J '_, Bdx -+ 0, and f�,,lA + C) dx = (y 2 + u 2 - u 2 )u - 1 rr + (y 2 - v 2 + u 2 )v - 1 rr = u - 1 v - 1 rr 2Rc u+ 1 (y). There is a very simple proof by characteristic functions; see Problem 26.9_ and a 20.16. See Example 20.1 for the case n = 1, prove by inducti�e convolution n x l 2 2 f change of variable that the density must have the form Knx < f >- e - , and t hen from t he fact that the density must integrate to 1 deduce the form of Kn.
20.17. Show by (20.38) and a change of variable that the left side of (20.48) is some constant times the right side; then show that the constant must be 1.
568
NOTES ON THE PROBLEMS
20.20. (a) Given € choose M so that P[IXI > M] < € and P[ IYI > M] < t:, and t hen choose o so that l x l, l yl < M, lx - x'l < o, and l y - y'l < o imply t hat l f(x', y') - f(x, y)l < t: . Note that P[ l f( X Yn) - f(X, Y)l > t:] < 2t: + P[IXn m
X I > o] + P[ I Yn - Y I > o ].
20.23. Take, for example, independent Xn assuming the values 0 and n with proba bilities 1 - n - I and n - I . Estimate the probability that Xk = k for some k in the range n/2 < k < n. 20.24. (b) For each m split A into 2 m sets Am k of probability P( A)/2 M. Arrange all the Am k in one infin ,te sequence, and let X11 be the indicator of the n th �et in it 20.27. To get the distribution of <1>, show by integration that for 0 < cf> < 2 rr, the intersection with the unit ball of the (x 1 , x 2 , x 3 )-set where 0 < x 3 < (x[ + x�) 1 1 2 tan cf> has volu m e trr sin cf>. Section 21 21.5. Consider 'LIA . A random variable is finit e with probability 1 if (but not only if) it is integrable. 21.6. Calculate J';xdF(x)
=
/o/cf dy dF(x) = f0J;' dF(x) dy.
21 .8. (a) Write E[Y - X] = fx < Y fX < t 5. Y dtdP - ! Y < Xf Y < t 5. x dtdP. 21.10. (a) The most important dependent uncorrelated random variables are the trigonometric functions-the random variables sin 2rrnw and cos 2rrnw on the unit interval with Lebesgue measure. See Problem 19.8. 21.13. Use Fubini 's theorem; s ee (20.29) and (20.30).
+
21.14. Even if X = - Y is not integrable, X + Y= O is_ Since IYI < I x l + l x YI, E[IYI] = oo implies that E[ l x + Yl] = oo for each x; use Problem 21. 13. See also th e lemma in Section 28. 21.21. Use (21.12). Section 22 22.2. For sufficiency, use E[LI x�c>l] = L: E[ I x�c>l]. 22.8. (a) Put U = L k I1k 5. T1Xt and V = L k Il k 5. T1X;; , so that ST U - V. Since [ T > k ] n - [ T < k - 1] lies in a-( XI , . . ' xk - 1 ), it follows that E[ ll T � k ] Xtl = E[ I1 T � k J ]E[ XtJ = P[T > k ]E[ Xi ]. Hence E[U] = L'k= 1 E[ Xi ]P[T > k] = £[ Xi]E[ T ]. Treat V the sam e way. (b) To prove E[T] < oo, show that P[T > (a + b)n ] < (l - p a + b) n. By (7 . 7), ST is b with probability (1 - p a )/(1 - p a + b) and -a with t he opposite probability. Since E[ X1 ] = p - q, =
-
=
a 1 - pa a + b E[T] = q -p - q - p 1 p a +h ' _
q
p=p
*- L
NOTES ON THE PROBLEMS
569 n
22.1 1. For each e, [ne;x,(e i6z) has the sam e probabilistic behavior as the original series, because t he xn + ne reduced modulo 2rr are independent and uni formly distributed. Therefore, the rotation idea in the proof of Theorem 22.9 carries ove r. See KAHANE for further results_ 22.14. (b) Let A = r I B and suppose p is a period of f- Let m = lx /P J and n = l1jp]. By periodicity, P(A n [ y, y + p ]) is the sam e for all y ; there fore, I P(A n [0, x]) - mP(A n [0, p ])I < p, I P(A) - nP(A n [0, p ])I < p, and I P(A n [0, x ]) - P(A)xl < 2p + l x - mjnl < 3p. Since p can be taken arbitrar ily small, P( A n [0, x]) = P(A)x. 22.15. (a) By the inequalities L ( s) < M(s) and (22.24),
Bd2s) = l A 3L(2s) < 3M(2s) < 380(s ) . For the other inequality, note first t hat T(s) i s nonincreasing and that R(2s) < 2 L(s) and T(s) < L(s) < M(s). If 8E(s) < 1/3, then
T(3s) T ( 6s ) M(3s) Bo ( 6 s) < 1 R( 6 < -----=-=-=---"---,- < --'---'-:--,s) 1 2 L ( 3 s ) -1 2 M ( 3 s ) -
On the other hand, if BE (s) > 1j3, then B0(6s) < 1 < 3B£(s). In either case , B0(6s) < 3BE(s).
Section 23 23.3. Note that A, cannot exceed t. If 0 < u < t and P[ Nt + t - Nr - u = 0] = e -aue - a•
u >
0, then P[ A, > u, 81 > v ] =
·
23.4. (a) Use (20.37) and the distributions of A1 and 81 • (b) A long interarrival interval has a better chance of covering one does_ 23.6. The probability that Ns
n + J..
t
t han a short
- N:S" = j is
"' ( f3 )1 a k a kf3j ( J' + k - 1) ! -I dx { e -(3 x x k --'ax e - k +j--: j ! (k - 1) ! j! f( k ) x Jo (a + {3-)� _
.
23.8. Let M1 be the given process and put �p(t ) = £[ M1 ]_ Since there are no fixed discontinuities, �p(t ) is continuous_ Let r/J(u) = intl t : u < �p(t)], and show that Nu = MI/J( u ) is an ordinary Poisson process and M, = N"'
--+ oo
in
s
s N, + I t << -::-:--' N, - N, - N, + 1 N,
-
----,-
N,
+1
-'-=--=--
N,
NOTES ON THE PROBLEMS
570
23.11. Restrict t in Problem 23. 10 to integers. The waiting times are the Zn of Problem 20.7, and account must be taken of the fdct that the distribution of Z 1 may differ from that of the other zn. Section 25 25.1. (e) Let G be an open set that contains the rationals and satisfies A( G) < t. For k = 0, 1, . n - 1 , construct a triangle whose base contains kjn and is contained in G: make these bases so narrow that they do not overlap, and adjust th e heights of the triangle s so that each has area 1 /n. For the nth density, piece together these triangular functions, and for the limit density, use t he function identically 1 over the unit interval. _ . ,
25.2. By Problem 14.8 it suffices to prove that Fn( , w ) F with probability 1, and for this it is eno ugh that Fn(x, w ) -+ F(x) with probability 1 for each rational x. =
25.3. (b) It can be shown, for example , that (25.14) holds for xn n !. See Persi Diaconis: The distribution of leading digits and uniform distributio!l mod 1, Ann . Prob., 5 (1977), 72 - 81. (c) The first significant digits of numbers drawn at random fro m empirical compilations such as almanacs and engineering handbooks seem approximately to follow the limiting distribution in (25.15) rather than the uniform distribu tion over 1, 2, . . . , 9. This is sometimes called Benford's law. One explanation is that the distribution of the observation X and hence of log 1 0 X will be spread over a large interval; if log 1 0 X has a reasonably smooth density, it then seems plausible that {log 1 0 X} should be approximately uniformly distributed. See FELLER, Volume 2, p. 62. =
25.9. Use Scheffe's theorem. 25.10. Put fn(x) = P[ Xn = Yn + kon]o; ' for y, + kon < x < yn + (k + l)on. Construct random variables Yn with densities fm and first prove Yn X. Show that Zn = 'Yn + [(Yn - 'Yn)!on]o, has the distribt.tion of Xn and that Yn - Zn 0. =
=
25.11. For a proof of (25.16) see FELLER, Volume 1, Chapter 7.
25.13. (b) Follow the proof of Theorem 25.8, but approximate l< x. Y l instead of /
( - x ]· oo,
25.20. Let Xn assume the values n and 0 with probabilities Pn = 1 j(n log n) and 1 - Pn · Section 26 26.1. ( b) Let p., be the distributionx of X. If l�p(t)l 1 and t -=fo 0, then �p(t) = e ira for some a, and 0 = J�J.1 - e ir ( - a> )p.,( dx ) = f':. oo(l - cos t(x - a))p.,(dx). Since the integral vanishes, p., must confine its mass to t he points where the nonnegative integrand vanishes, namely to the points x for which t(x - a) = 2Trn for some integer n. (c) The mass of p., concentrates at points of the form a + 2Trnjt and also at points of the form a' + 2Trnjt'. If p., is positive at two distinct points, it follows that t jt' is rational. =
NOTES ON THE PROBLEMS
571
26.3. (a) Let f0( x ) = 7T - I x - 2 ( 1 - cos x ) be the density corresponding to �p0(t). If Pk = (sk - sk + 1 )t k> then Lk. 1 Pk = 1; since 'Lk. 1 Pk �p0(t jt k ) = �p(t ) (check the points t = ti), �p(t ) is t he characteristic function of the continuous density Lk. 1 Pk tdo (t k , x). (b) If lim �p(t) = 0, approximate 1p by functions of the kind in part (a), pass to the limit, and use the first corollary to the continuity theorem. If 1p does not vanish at infinity, mix in a unit mass at 0. �
=
=
1 _, oo
26.12. On the right in (26.30) replace �p(t) by the integral defining it and apply Fubini's theorem; the integral average comes to p.,
{ } a
+
sin T( x - a ) ( dx Jx * a T( X - a ) p., ) · r
Now use the bounded convergence theorem.
26.15. (a) Use (26.40) to prove that l�pn(t + h) - ifin(t)l < 2p.,n( -a. aY + alhl. (b) Use part (a). 26.17. (a) Use the second corollary to the continuity theorem. 26.19. For the Weierstrass approximation theorem, see Ruo i N 1 , Theorem 7.32. =
26.22. (a) If a n goes to 0 along a subsequence, then lr/J(t)l 1; use part (c) of Problem 26.1. (c) Suppose two subsequences of {an} converge to a0 and a, where 0 < a0 < a; k put e = a0ja and show that l�p(t)l l�p((J t)l. (d) Observe that =
26.25. First do the nonnegative case; then note that if f and g have the same coefficients, so do r + g - and g + +r Section 27
27.8. By the same reasoning as in E"ample 27.3, (R n - log n)j ylog n
=
N.
27.9. The Lindeberg theorem applies: (Sn - n2 j4)j ..jn3 j36 N. 27.11. Let Yn be Xn or 0 according as IXnl < n 1 f2 log n or not. Show that Xn = Yn for large n, with probability 1, and that Lyapounov's theorem (lJ 1) applies to the Yn. =
=
27.12. For example, let the distribution of Xn be the mixture, with weights 1 - n - 2 and n -2 , of the standard normal and Cauchy distributions.
NOTES ON THE PROBLEMS
572
27.17. For another approach to large-deviation theory, see Mark Pinsky: An elemen tary derivation of Khintchine's estimate for large deviations, Proc. Amer. Math. Soc., 22 (1969), 288 - 290. 27.19. (a) Everything comes from (4.7). If u(lk +n , lk +n+ l • · " ), then
A = [(I I > . . . , lk ) E H]
and
BE
I P( A () B) - P ( A )P(B) i < [ i P ( [lu = i u , < k ] n B ) - P[lu = i", u < k ] P ( B ) i , u
where the sum extends -over the k-tuples ( i 1 , , ik) of nonnegative integers in H. The summand vanishes if u + iu < k + n for u < k, the remaining terms add to at mosl n:: � = I P[ l > k + n - u ] < 4/2 n (b) To show that u 2 = 6 (see (27.20)), show that /1 has mean 1 and variance 2 and that •
•
•
ll
if i <. n, if i > Jl . Section 28
28.2. (b) Pass to a subsequence along which p.,11(R 1 ) --+ oo, choose €11 so that it decreases to 0 and €11p.,11(R 1 ) --+ oo, and choose X11 so that it increases to oo and P.,11( -X11, x,) > tp.,11( R 1 ); consider the f that satisfies f( +x11) = £11 for all n and is defined by linear interpolation in between these points. 28.4. (a) If all functions (28. 12) are characteristic functions, they are all certainly infinitely divisible. Since (28. 12) is continuous at 0, it need only be exhibited as a limit of characteristic functions. If p.,, has density lr n. n P + x 2 ) with respect to v, then
is a characteristic function and converges to (28. 12). It can also be shown that every infinitely divisible distribution (no moments required) has characteristic function of the form (28. 12); see GN EDENKO & KoLMOGOROV, p. 76. (b) Use (see Problem 18. 19) - ltl = 71' - I/�"'(cos tx - l)x- 2 dx.
28.14. If X1 , X2 , . . . are independent and have distribution function F, then (X1 + · · · + X11)jf,i also has distribution function F. Apply t he central limit theo rem. 28.15. The characteristic function of Zn is
c exp - }: n k
1
( l k l jn )
1 +a
( e tk f i
"
it x - 1 e dx - 1 ) --+ exp c j + l a - oo lxl a! "' 1 - COS X dx = exp -clt l l+ - oo lxl a "'
[
].
NOTES ON THE PROBLEMS
573
Section 29 29. 1. (a) If f is lower semicontinuous, [x: f(x) > t] is open. If f is positive , which is no restriction , then Jf d/L = Jr:P-[ f > t] dt < f� lim infn /L n[ f > t] dt < lim infn foiL ,[ f > t] dt = lim inf, !fd /L n · (b) If G is open, then lc is lower semicontin uous. 29.7. Let � be the covariance matrix. Let M be an orthogonal matrix such that the entries of M�M' are 0 except for the first r diagonal entries, which are 1. If Y = MX, then Y has covariance matrix M�M', and so Y= (Y1 , • • :-, Y,. , 0, . . . , 0), where Y1 , • , Y, are independent and have t he standard normal distribution. But IXI 2 = I:7= 1 Y/. 29.8. By Theo rem 29.5, X11 has asymptotically lhe centered normal distribution with covariances u;r Put x = ( p lf2 , . . . , Plf2 ) and show that �x ' = 0, so lhat 0 !s an eigenvalue of �. Show that �y' = y' if y is perpendicular to x, so that � has 1 as an eigenvalue of multiplicity k - 1. Use Problem 29.7 together with Theo rem 29.2 (h(x) = ixl 2 ).
29.9. (a) Note that n - I I:j'= 1 Y,� 1 and thal 1 tion as ( Y,1 1 , • • • , Y,, )/(n- [ 7= 1¥,� )112 • =
( X,1, . . . , X,,) has the same distribu
Section 30 30.1. Rescale so that s� = 1, and put L,(f.) = I:k J1x.,k1 ..,_ ,x;k dP. Choose increasing nu so that L,(u - 1 ) < u -3 for n > nu, and put M, = u- 1 for n 11 < n < nu + l· Then M, --+ 0 and� L,(M,) < M;. Put Y,k = x,k lux. . � :5 M., ]' S?ow that Lk E[ Y,k ] --+ 0 and Lk E[ ¥,7, ] --+ 1, and apply to I:k Yn k the central hmit theo rem under (30.5). Show that Lk P[ X,k * Y, d --+ 0. .
30.4. Suppose that the moment generating function M, of /L, converges to t he moment generating function M of It in som e interval about s. Let v, have density e sx/M,( s ) with respect to J..L , , 3nd let v have density e sx/M(s) wi th respect to IL · Then the moment generating function of v, converges to that of v in some mterval about 0, and hence v, = v. Show that f ":.. oof(x)P-,(dx ) --+ J":_J(x )P-(dx) if f is continuous and has bounded support; see Problem 25.13(b). 30.5. (a) By Ho lder's inequ ality II:j= I tjx I' < k r - I r:: j= l ltixl, and so I:,e'J II:itixl/L(dx)/ r! has positive radius of convergence. Now
where the summation extends over k-tuples that add to r. Project /L to the line by the mapping [itixi, apply Theorem 30. 1, and use the fact that /L is determined by its values on half-spaces.
30.6. Use the Cramer -Wold idea.
NOTES ON THE PROBLEMS
574
30.8.
=
Suppose that k 2 in (30.30). Then =
( ( f � + M[ 2 e -iA1x
e iA1x
= 2 - rl -rz
e iAzx
e -iAzx
fl
E E ( r ' )(�]z2 )M (exp i(A 1(2j1 -r d +A z{ 2j2 - r2 }) x ] . 0
i1 = J z = O
1!
i s 1 i f By2j1 -r1(26.33)2j2-r here and the2 i0n dependence , the l a st mean of A1 and A 2 and i s 0 otherwi s e. si m i l a r cal c ul a ti o n for k 1 gives generalFork (30.gives3 1)(30.use30).theThemulacttiduimalenSiformona!of (30the 28),distriandbut!aosin minil(30.ar cal29)culisaunitionmforportant. method of moments (Probl e m 30. 6 ) and the mappi n g t h eorem. For (30. 3 2) use the central limit theorem; by (30.28). X1 has mean 0 and variance 2 1 1 n < t h e i n equal i t y i n (30. 3 3) hol d s, l o g Iflog lno1g12n<- E(l
=
=
A
t·
30.10.
m
tz
€
Section 31 31. 1.
a
the argument i n Exampl e 31. 1 . Suppose that F has nonzero Consi d er expansi o ns deri v at i v e at x, and l e t be the set of numbers whose baser anal o gue of (31. 1 6) i s P[X E agree]/iP[X n the first] nr pl1a, cesandwitheth ratithato ofherex. iThe 1 for s one of p0, ... ,p,. lf P;-=l=r 1 use the second Borel some Cant e l i l e mma to show that the rati o i s i, iargument nfinitely oftis eunnecessary n except onifa rset of Lebesgue measure 0. (This last part of the anal o gue of The argument i n Exampl e 31. 3 needs no essenti a l change. The (3 1.17) is -r -<x<- i +r 1 ' OA( 1 -£, 1 lf(kjn) -f((k - Ojn )l 2-2£. 1,
1, +
E I,
-
--+
=
P;
2. )
I
31.3. 31.9.
=
=
A
A,
n A ),
C,
E.
E )A(C, )
31.11.
=
= lg - I H11
A( C. ) =
>
€
�
C. )
=
€.
I:'/, =
E
N.
�
>
A
C. ) :5
-
A.
a =
=
A
A, A,) > A,
=
E
E
A, )
NOTES ON THE PROBLEMS
575
2 it /3")• 31 15 noon � l (.!.2 + .!.e 2 •
•
31. 18. For x fixed, let u11 and V11 be the pair of successive dyadic rationals of order n (un - U 11 = 2 - ") for which U11 < x < V11 • Show that
k =0
=
n-1
a k- ( x ), L k=O
where a;; is the left-hanJi derivative. Since af:(x) = + Liar all x and k, the difference ratio cannot have a finite limit
31 .22. Let A be the x-set where (31.35) fails if f is replaced by f�p; then A has Lebesgue measure 0. Let G be the union of all open sets of tt·measure 0; represent G as a countable disjoint union of open intervals, and let B be G together with any endpoints of zero wrneasure of these intervals. Let D be t he set of discontinuity points of F. If F(x) r;t: A, x ff: B, and x l;t: D, then F(x - h) < F(x) < F(x + h ), Hx ± h) -+ F(x), and
Now x - € < �p(F(x)) < x follows from F(x - t:) < F(x), and hence �p(F(x)) = x. If A is Lebesgue measure restricted to (0, 1), then tt = A 1p - I , and (31.36) follows by change of variable. But (36.36) is easy if x E D, and hence it holds outside B (De n F - 1A). But tt ( B ) = 0 by construction and tt(D' n F- 1A) = 0 by Problem 14.4.
U
Section 32
n n I' S 32 . 7. Define Jt11 an d V11 as ·I n (32. 7), an d Write V - V3(0 ) + v5( ), Where V3(r.) c r l is singular with respect to absolutely con tinuous with respect to tt n and v5" (11) '\' '\' (II) lt n · Ta ke Vac - L.n Vac an d Vs - L. nVs ' Suppose that V3c(E) + v5(E) = v�c(E) + v�(E) for all E in :F. Choose an S in :T that supports v5 and v� and satisfies tt(S J = 0. Then v&c(E) ·
_
=
Vac(E () sc) = vac( E () sc) + J's(E () sc) = v�c(E () sc) + v�(E () sc) = v�c(E () sc) = v�c( E). A similar argument shows that v5(E) = v�(E).
32.8. (a) Show that � is closed under t he formation of countable unions, choose �-sets Bn such that Jt(B11 ) --+ sup a- tt(B) ( < oo), and take 80 = U11 B11 • (b) The same argument. (c) Suppose tt(D0) > 0. The maximality of 80 implies that 80 u D0 contains an E such that tt(E) > 0 and v(E) < oo. Since 80 n E cB0 E �, tt(B0 n E) = 0 (v(E) < oo rules out v(B0 n E) = oo). Therefore, JL(D0 n E) > 0 and v(D0 n E) < oo, which contradicts the maximality of C0. (d) Take the density to be oo on fl(j. 32.9. Define f and v5 as in (32.8), and let r and v� be t he corresponding func tion and measure for !T0 : v(E) = fEr dtt + v� (E) for E E !T0, and ther e is an g;-o -set So such that v� (fl - So) = 0 and tt(So ) = 0. If E E g;-o , it fol lows that fEr dtt = IE - S" r dtt = IE - S" r dtt0 = V0(E - so ) = v(E - S0) > IE- S" fdtt = IEfdtt ·
576
NOTES ON THE PROBLEMS
It is instructive to consider the extreme case g-o = {0, fl}, in which v0 is absolutely continuous with respect to p.,0 (provided p.,( fl) > 0) and hence v� vanishes.
Section 33 33.2. (a) To prove independence, check the covariance. Now use Example 33. 7. (b) Use the fact that R and 0 are in dependent (Example 20.2). (c) As the single event [ X = Y] = [ X - Y = 0] = [0 = 7T/4] u [ 0 = 57T/4] has probability 0, the conditional probabilities have no meaning, and strictly speaking there is nothing to resolve But whether it is natural to regard the degrees of freedom as one or as two depends on whether the 45o line through the origin is regarded as an element of the decomposition of the plane into 45° lines or whether it is regarded as the union of two elements of the decoJnposi tion of the plane into rays from the origin. Borel's paradox can be explained the same way: The equator is an element of the decornpo�ition of the sphere into lines of constant latitude; the Green wich meridian is an element of the decomposition of the sphere into great circles with common poles. The decomposition matters, which is to say the u-field matters. 33.3. (a) If the guard says, " 1 is to be exect.ted," then the conditional probability that 3 is also to be executed is 1 /0 + p). The " paradox" comes from assumin g that p must be 1, in which case the conditional probability is indeed t. But if p -=/= t, then the guard does give prisoner 3 some information. (b) Here "one" and "other" are undefined, and the problem ignores the possibility that you have been introduced to a girl. Let the sample space be a
bbo 4 , bby
1-a 4 '
f3 • bgo 4 1 - {3 bgy 4 '
'Y gbo 4 , 1 - ')' gby 4 '
0
ggo 4' 1 -8 ggy 4 .
For example, bgo is the event (probability f3 14) that the older child is a boy, the younger is a girl, and the child you have been introduced to is the older; and ggy is the event (probability (1 - o)j4) that both children are girls and the one you have been introduced to is the younger. Note that the four sex· distributions do have probability t· If the child you have been introduced to is a boy, then the conditional probability that the other child is also a boy is p = 1/(2 + f3 - y ). If f3 = 1 and y = 0 (the parents present a son if t hey have one), then p = t· If f3 = y (the parents are in different), then p = t . Any p between t and 1 is possible. This problem shows again that one must keep in mind the entire experi ment the sub,.-field .:# represents, not just one of the possible outcomes of the experiment.
33.6. There is no problem, un less the notation gives rise to the illusion that p(A ix) is P(A n [ X = x])jP[ X = x ]. 33 .15. If N is a standard normal variable, then
NOTES ON THE PROBLEMS
577
Section 34 34.3. If (X, Y ) takes the values (0, 0), (1, - 1), and (1, 1) with probability t each, then X and Y are dependent but E[ YIIX] = E[ Y] 0. If (X, Y) takes the values ( - 1, 1), (0, - 2), and (1, 1) with probability { each, t hen E[ X] = E[Y] = E[ XY ] = 0 and so E[XY] = E[ X]E[ Y], but E[ Y II X ] = Y -=1= 0 = E[ Y ]. 0{ course, this is another example of dependent but uncorre lated random variables. =
34.4. First s how t hat ffdP0 = f8 fdPjP( B) and that P[ BII&"] > 0 on measure 1 . Let G be the general set in &". (a) Since
=
a
set of
P0-
P(B) fr/cP0[ All&"] dP0 = P( B )P0(A n G)
= j P[ A n BII&"] dP, G
it follows that
holds on a set of ?-measure 1. (b) If P,(A) = P(AI B,), then
=1
P( All&"v .;y] dP.
G n B;
Therefore, fc· l8 P;[ A II&"]dP = fc /8 P[AII&"v �] dP if C = G rtB;, and of course this holds for C = G n 81 if j -=1= i. But C's of this form constitute a 'IT-system generating .§'v �. and hence /8 P;[ AII&"] = 18' P[AII&"v �] on a set of ?-measure 1. Now use t he result in part (a).
34.9. All such results can be proved by imitating the proofs for t he unconditional case or else by using Theorem 34.5 (for part (c), as generalized in Problem 34.7). For part (a), it must be shown that it is possible to take the integral measurable &". 34.10. (a) If Y =2 X - E[ XII&"d, t hen X - E[ XII&"2 ] = Y - E[YII&"2 ], and E[(Y 2 E[ YII&"2 ]) II&"2 ] = E[ Y II&"2 ] - E 2 [ Y II&"2 ] < E[ Y 2 II&"2 ]. Take expected values. 34.11. First prove that
578
NOTES ON THE PROBLEMS
From this and (i) deduce (ii). From
(ii), and the preceding equation deduce
The sets A 1 n A 2 form
a
'IT-system generating .§'1 2 .
34. 16. (a) Obviously (34. 18) implies (34. 17). If (34. 17) holds, then clearly (34. 18) holds for X simple. For the general X, choose simple Xk such that lim k Xk = X and IXk l < lXI. Note that
f XdP - afXdP An
let n -+ oo and then let k -+ oo. (b) If .!1 E .9, then the class of E satisfying (34. 17) is a A-system, and so by the 7T-A theorem and part (a), (34. 18) holds if X is measurable u(9'). Since A n E u (.9 ), it follows that
j XdP = j E[ XIIu (.9)] dP -+ ajE[XIiu (.9)] dP An
An
= a jxdP. (c) Replace X by XdP0jdP in (34 .18). 34. 17. (a) The Lindeberg-Uvy theorem. (b) Chebyshev's inequal ity. (c) Them em 25.4. (d) Independence of the Xn (e) Problem 34. 1 6(b). (f) Problem 34. 16(c). (g) Part (b) here and the t:-8 defin ition of absolute continuity. (h) Theorem 25.4 again. Section 35 35.4. (b) Let snn be the number of k such that 1 < k < n and yk = i- Then Xn = 35"j 2 . Take logarithms and use the strong law of large numbers.
579
NOTES ON THE PROBLEMS
35.9. Let K bound l X I I and the iXn - xn - I I· Bound IXT I by KT. Write IT $ k XT dP = L 7= 1 fT =i X, dP = [7= 1( fT ""- i X, dP - fT ""- l + l X,. dP). Transform the last integral by the martingale property and reduce the expression to £[X I ] - IT > k xk + I dP. Now
fT > kxk + l dP
< K ( k + 1 ) P[ r > k ] < K( k + l)k - 1 f k T dP -+ 0. T>
is a supermartingale. Si nce 35.13. (a) By the result in Problem 32.9, X1 , X2 , E[ IXnl] = E[ Xn] < v(fl), Theorem 35.5 applies. (b) If A E .9,;, then fiYn + Zn) dP + a-�( A ) = fA X., dP + a-J A) = v( A ) fA Xn dP + a-n( A). Since the Lebesgue decomposition is unique (Problem 32. 7), Yn + Zn = Xn with probability 1. Since Xn and Yn converge, so does Zn. If A � !T, and n > k, then fAZn dP < a-J A), and by Fatm:'s lemma, the limit Z satisfies fA Z dP < a-.,( A ). This holds for A l n U k !T, and hence (monotone class theorem) for A in �. Choose A so that P(A) = 1 and a-.,( A ) = 0: E[ Z] = fAZdP < a-.,( A ) = 0. It can happen that a-n(fl) = 0 aud a-J !l) = v(fl) > 0, in which case a-n does not converge to (.Too and the xn cannot be integrated to the limit. •
•
=
35.17. For a very general result, see J. L. Doob: Appl ication of the theory of martingales, Le Calcul des Probabilites et ses Applications (Colloques Interna tionaux du Centre de Ia Recherche Scientifique, Par:s, 1949). Section 36 36.5. (b) Show by part (a) and Problem 34.18 that fn is the conditional expected value of f with respect to the a--field .9';; + 1 generated by the coordinates x n + 1 , x n + z • . . . . By Theorem 35.9, (36.30) will follow if each set in n n:T, has 'IT-measure either 0 or 1, and here the zero-one law applies. (c) Show that gn is the conditional expected value of f wit h respect to the a--field generated by the coordinates x " . . . , x n , and apply Theorem 35.6. 36.7. Let ..£' be the countable set of simple functions L;a ; IA for a; rational and {A;} a finite decomposition of the unit interval into subi�tervals with rational endpoints. Suppose that the X, exist, and choose (Theorem 17. 1) Y, in ..£' so that E[I X, - Y, l] < i. From E[IXs - X, l] = t, conclude that E[ I Y, - ¥; 1] > 0 for s -=1= t. But there are only countably many of the y; . It does no good to replace Lebesgue measure by some other measure on the unit interval. t Section 37 37. 1. If t 1 , . . . , t k are in increasing order and t0 = 0, then min(i,j)
[, K( t; , tJ x; xi = [, x ; xi [, Ut - tt d 1= 1 t,} i,j
(
= [, ( It - lt - d [, x; I
i ?! I
f > 0.
580
NOTES ON THE PROBLEMS
37.4 (a) Use Problem 36.6(b). (b) Let [ W,: t > 0] be a Brownian motion on ( fl, !f. P0 ), where W( · , w) E C for every w. Define {: n --+ RT by Z,({(w)) = w;(w). Show that { is mea surable S'j!JRT and P = P0C 1 • If C c A E .9P T, then P(A) = P0(C 1A) =
Po(fl) = 1.
37.5. Consider Since
W(l) = EZ
=
nj
1(W(k jn ) - W((k - l)jn)) for notational convenience.
I W( I /�)1 ? <
( )
wz l n dP
=
j
IW( l )I
W 2 ( 1) dP --+ O,
the Lindeberg theorem applies.
37.14. By symmetry,
[
p(s, t ) = 2 P W. > 0 . inf ( Wu - W5 ) < S $; U :$, 1
W. ] ;
W, and the infimum here are independent because of t he Markov property,
and so by (20.30) (and symmetry aga in)
oo p ( s, t ) - _,- P[ Tx s t - s ]
l
o
1 e - x Z ;zs dx v2Trs r;
Reverse t he integral, use J;xe - x 2 r ;z dx
=
1 jr, and put = (s j(s + u)) 1 12 : u
1 � t -s 1 P( s, t ) = Tr o u + s 2
'J
du
s 1 12 du u l f2
. = Tr z .[0 VI - u
Bibliography
HALMos and SAKS have been the strongest measure-theoretic and Doos and FELLER the strongest probabilistic influences on this book, and the spirit of KAc's small volume has been very important. AUBREY: Brief Lives, John Aubrey; ed., 0. L. Dick. Seker and Warburg, London, 1949. BAHADUR: Some Limit Phi ladelphia, 1971.
Theorems in Statistics,
BANACH: Theorie des Operations atyczne, Warsaw, 1932. BERGER: Statistical Decision Verlag, New York, 1985.
R. R. Bahadur. SIAM,
Lineaires, S. Banach.
Theory,
Monografje Matem
2nd ed., James 0. Berger. Springer
BHATTACHARYA & WAYMIRE: Stochastic Processes with Applications, Rabi N. Bhattacharya and Edward C. Waymire. Wiley, New York, 1990.
I
, BILLINGSLEY 1 : Convergence Wiley, New York, 1968. • +
of Probability Measures,
Patrick Billingsley.
BILLINGSLEY 2 : Weak Convergence of Measures: Applications Patrick Billingsley. SIAM, Philadelphia, 1971.
�BIRKHOFF & MAc LANE:
in Probability,
A Survey of Modern Algebra,
4th ed., Garrett Birkhoff and Saunders Mac Lane. Macmillan, New York, 1977. <;INLAR: Introduction to Stochastic Processes, Erhan <;inlar. Prentice-Hall, Englewood Cliffs, New Jersey, 1975. CRAMER: Mathematical Methods of Statistics, Harald Cramer. Princeton University Press, Princeton, New Jersey, 1946. - Doos: Stochastic Processes, J. L. Doob. Wiley, New York, 1953. DuBINS & SAvAGE: How to Gamble If You Must, Lester E. Dubins and Leonard J. Savage. McGraw-Hill, New York, 1965. 581
582
BIBLIOGRAPHY
DuDLEY: Real Analysis and Probability, Richard M. Dudley. Wadsworth and Brooks, Pacific Grove, California, 1989. DYNKIN & YusHKEVICH: Markov Processes, English ed., Evgenii B. Dynkin and Aleksandr A. Yushkevich. Plenum Press, New York, 1969. FELLER: An Introduction to Probability Theory and Its Applications, Vol. I. 3rd ed., Vol. II, 2nd ed., William Feller. Wiley, New York, 1968, 1971. GALAMBos: The Asymptotic Theory of Extreme Order Statistics, Janos Galambos. Wiley, New York, 1978. GELBAUM & ()LMSTED: Counterexamples in Analysis, Bernard R. Gelbaum and John M. Olmsted. Holden-Day, San Francisco, 1964. GNEDENKO & KoLMOGORov: Limit Distributions for Sums of Independent Random Variables, English ed., B. V. Gnedenko and A. N. Kolmogorov. Addison-Wesley, Reading, Massachusetts, 1954. HALMOS 1 : Measure Theory, Paul R. Hal mos. Van Nostrand, New York,
1950.
HALMOs 2 :
1960.
Naive Set Theory,
Paul
R.
Halmo<;. Van Nostrand, Princeton,
HARDY: A Course of Pure Mathematics, 9th ed., G. H. Hardy. Macmillan, New York, 1946. HARDY & WRIGHT: An Introduction to the Theory of Numbers, 4th ed., G. H. Hardy and E. M. Wright. Clarendon, Oxford, 1959. HAuSDORFF: Set Theory, 2nd English ed., Felix Hausdorff. Chelsea, New York, 1962. JEcH: Set Theory, Thomas Jech. Academic Press, New York, 1978. KAc: Statistical Independence in Probability, Analysis and Number Theory, Carus Math. Monogr. 12, Marc Kac. Wiley, New York, 1959. KAHANE: Some Random Series of Functions, Jean-Pierre Kahane. Heath, Lexington, Massachusetts, 1968. K\PLANSKY: Set Theor; and Metn·c Spaces, Irving Kaplansky. Chelsea, New York, 1972. KARATZES & SHREVE: Brownian Motion and Stochastic Calculus, loannis Karatzis and Steven E. Shreve. Springer-Verlag, New York, 1988. KA RLIN & TAYLOR: A First Course in Stochastic Processes, 2nd ed., A Second Course in Stochastic Processes, Samuel Karlin and Howard M. Taylor. Academic Press, New York, 1975 and 1981. KOLMOGOROv: Grundbegriffe der Wahrscheinlichkeitsrechnung, Erg. Math., Vol. 2, No. 3, A. N. Kolmogorov. Springer-Verlag, Berlin, 1933. LEvY: Theorie de I' Addition des Variables ALeatoires, Paul Uvy. Gauthier Villars, Paris, 1937. PROTTER: Stochastic Integration and Differential Equations, Philip Protter. Springer-Verlag, 1990.
BI BLIOGRAPHY
583
RI ESZ & Sz.-NAG Y: Functional Analysis, English ed., Frigyes Riesz and Bela Sz.-Nagy. Unger, New York, 1955. RocKETT & Szu sz: Continued Fractions, Andrew M. Rockett and Peter Sziisz. World Scientific, Singapore, 1992. RoYDEN: Rea/ Analysis, 2nd ed., H. I. Royden. Macmillan, New York, 1968. R uDI N 1 : Principles of Mathematical Analysis, 3rd ed., Walter Rudin. McGraw-Hill, New York, 1976. R uDIN 2 : Real and Complex A nalysis, 2nd ed., Walter Rudin. McGraw-Hill, New York, 1974. SAKS: Theory of the Integral, 2nd rev. ed., Stanislaw Saks. Hafner, New York, 1 93 7. SPIVAK: Calculus on Mamfolds, Michael Spivak. W. A Benjamin, New York, 1965. WAGON: The Banach - Tarsky Paradox, Stan Wagon. Cambridge University Press, 1985.
List of Symbols
.!1, 536 2'\ 536 0, 536 (a, b ], 537 /A , 536 Ill, 1 P, 2, 22 d/w), 3 rn(w), 5 sn(w), 6 N, 8, 357 l X J, 537 sgn x, 537 A - B, 536 A c, 53 6 A t> B, 536 A c B, 536 J':set, 20 �0• 20 u(N), 21 J, 22 �. 22 (.!1, !7, P), 23 A n t A , 536 A n ! A , 536 A, 25, 43, 168 1\ , 537 v , 53 7 f(.W), 33 P/ A ), 35 D( A ), 35 �. 35 P* , 37, 47 p * , 37, 47 S'', 27, 3 1 1 9, 41 ../, 41 A* 44 ,
P(Bi A ), 5 1 lim SllPn A n• 52 lim infn A n, 52 lim n A n• 52 i.o., 53 g-, 62, 287 R I , 537 R k , 53 9 [ X = x ], 67 O"( X ), 68, 255 J.L , 73, 160, 256 E[X], 76, 273 Var[ X ], 78, 275 En[ f], 87 s/a), 93 T , 99, 1 33, 464, 508
PiJ•
S,
a; ,
7T;,
111
11 J
Ill
1 24 M(t ) 146, 278, 285 gp k , 158 .Wn .!10, 159 x k t x, 160, 537 x = [k xk , 1 60 J.L* , 1 65 .II( J.L* ), 165 A 1 , 168 A k , 171 F, 1 75, 1 77 AA F, 1 76 r- 'A', 537 sz-;sz-', 1 82 Fn = F , 1 9 1 , 327, 378 dF(x ), 228 X X Y, 23 1 !!Z" x �. 231 J.L X v , 233 ,
585
586 *,
266 Xn -> pX, 70, 268, 330 IIJ II, 249 lifllp, 241 L P, 241 J.Ln => J.L , 327, 378 Xn = X, 329, 378 xn = a, 331 JA , 538 tp( t ), 342 J.Ln -> 1 J.L, 371 F. , 414
LIST OF SYMBOLS Fac' 4 1 4 v « J.L, 422 d vfdJ.L, 423 v. , 424 v.c , 424 P[ A II&"], 428, 4 30 P[ A IIX" t E T], 433 E[ XIi&"], 445 RT, 484 !JPT, 485 w, 498
Index
Here An refers to paragraph n of the Appendix (p. 536): u . v refers to Problem o in Section u , or else to a note on it (Notes on the Problems, p. 552); the other references are to pages. Greek letters are alphabetized by their Roman equivalents (m for J.L, and �o on). Names in the bibliogr3phy are not indexed separatt"ly. Absolute continuity, 413, 422 Absolutely continuous part, 425 Absolute moment, 274 Absm bing state, 1 1 2 Adapted u-fields, 458 Additive set function, 420 Additivity: countable, 23, 161 finite, 23, 161 Admissible, 248, 252 Affine transformation, 172 Algebra, 1 9 Almost evel)where, 60 Almost surely, 60 a-mixing, 363, 29.10 Aperiodic, 125 Approximation of measure, 168 Area over the curve, 79 Area under the curve, 203 Asymptotic equipartition property, 91, 1 44 Asymptotic relative frequency, 8 Atom, 271 Autoregression, 495 Axiom of choice, 21, 45
Beppo Levi theorem, 1 6.3 Bernoulli-Laplace model of diffusion, 1 12 Bernoulli shift, 3 1 1 Bt"rnoulli trials, 75 Bernstein polynomial, 87 Betting system, 98 Binary digit, 3 Binomial distribution, 256 Blackweii-Rao theorem, 455 Bold play, 102 Hoole's inequality, 25 Borel, 9 Borel-Cantelli lemmas, 59, 60 Borel function, 183 Borel normal num ber theorem, 9 Borel paradox, 441 Borel set, 22, 158 Boundary, Al l Bounded convergence theorem, 210 Bounded variation, 415 Branching process, 461 Britannica, 552 Brownian motion, 498 Burstin's theorem, 22.1 4
Baire category, 1 10, Al 5 Baire function, 1 3.7 Banach lim its, 3.8, 19.3 Banach space, 243 Banach-Tarski paradox, 180 Bayes estimation, 475 Bayes risk, 248, 25 1 Benford law, 25.3
Canonical measure, 372 Canonical representation, 372 Cantelli ineq uality, 5.5 Cantelli theorem, 6.6 Cantor function, 31 .2, 3 1 . 1 5 Cantor set, 1 .5 Cardinal ity of u·fields, 2.12, 2.22 Cartesian product, 231
587
588 Category, A15, 1 10 Cauchy distribution, 20 1 4, 348 Cauchy equation, A20, 1 4.7 Cavalieri principle, 18.8 Central limit theorem, 291, 357, 385, 391, 34 1 7, 475 Cesaro averages, A30, 20.23 Change of variable, 215, 224, 225, 274 Characteristic function, 342 Chebyshev inequality, 5, 80, 276 Chernoff theorem, 1 5 1 Chi-squared distribution, 20. 15 Chi-squared statistic, 29.8 Circular Lebesgue measure, 1 3 12, 3 1 3 Class of sets, 18 Closed set, Al l Closed set of states, 8 21 Closed support, 1 2.9 Closure, Al l Cocountable set, 21 Cofinite set, 20 Collective, 1 09 Compact, A13 Complement, Al Completely normal number, 6. 1 3 Complete space or measure, 44, 10.5 Com pletion, 3 10, 10.5 Complex functions, integration of, 218 Compound Poisson distribution, 28.3 Compound Poisson process, 32.7 Concentrated, 1 6 1 Conditional distribution, 439, 449 Conditional expected value, 1 33, 445 Conditional probability, 5 1 , 427, 33.5 Congruent by dissection, 1 79 Conjugate index, 242 Conjugate spact", 244 Consistency conditions for finite-dimensional distributions, 483 Contt:nt, 3.15 Continued·fraction transformation, 3 1 9, A36 Con tinuity from above, 25 Continuity of paths, 500 Continuum hypothesis, 46 Conventions involving oo, 160 Convergence in distribution, 329, 378 Convergence in mean, 243 Convergence in measure, 268 Convergence in probability, 70, 268, 330 Convergence with probability 1, 70, 330 Convergence of random series, 289 Convergence of types, 1 93 Convex functions, A32 Convolution, 266
INDEX Coordinate function, 27, 484 Coordinate variable, 484 Countable, 8 Countable additivity, 23, 1 6 1 Countable subadditivity, 25, 1 62 Countably generated u-field, 2 1 1 Countably infinite, 8 Counting measure, 1 6 1 Coupled chain, 126 Coupon problem, 362 Covariance, 277 Cover, A3 Cramer-Wold theorem, 383 Cylinder, 27, 485 Daniell-Slone theorem, ! I 14, 16 12 Darboux-Young definition, 15.2 A-distribution, 1 92 Decision theory, 247 Decom position, A3 de Finetti theorem, 473 Definite integral, 200 Degenerate di stribution function, 193 Delta method, 359 DeMoive-Laplace theorem, 25. 1 1 , 358 DeMorgan !aw, A6 Dense, Al5 Density of measure, 213, 422 Density point, 31.9 Density of random variable or distribution, 257, 260 Density of set of integers, 2. 18 Denumerable probabilities, 51 Dependent random variables, 363 Derivatives of integrals, 402 Diagonal method, 29, A14 Difference equation. A19 Difference set, Al Diophantine approximation, 13, 324 Dirichlet theorem, 13, A26 Discontinuity of the first kind, 534 Discrete measure, 23, 1 61 Discrete random variable, 256 Discrete space, 1 . 1 , 23, 5.16 Disjoint, A3 Disjoint supports, 410, 421 Distribution: of random variable, 73, 187, 256 of random vector, 259 Distribution function, 175, 188, 256, 259, 409 Dominated convergence theorem, 78, 209 Dominated measure, 422 Double exponential distribution, 348 Double integral, 233
INDEX Double series, A27 Doubly stochastic matrix, 8.20 Dual space, 245 Dubins-Savage theorem, 102 Dyadic expa nsion, 3, A31 Dyadic interval, 4 Dyadic transformation, 3 1 3 Dynkin 's rr-A theorem, 42 E-.5 definition of absolute continuity, 422
Egorov theorem, 1 3 9 Eigenvalues, 8.26 Empirical distribution function, 268 Empty set, A 1 Entropy, 57, 6. 1 4, 8.31, 3 1 . 1 7 Equkontin uous, 355 Equivalence class, 58 Equivalent measures, 422 Erdos-Kac central limit theorem, 395 Ergodic theorem, 314 Erlang density, 23.2 Essential supremum, 241 Estimation, 2'51, 452 Etemadi, 282, 288, 22. 15 Euclidean distance, A16 Euclidean space, A1, A16 Euler function, 2. 18 Event, 18 Excessive function, 1 34 Exchangeable, 473 Existence of independent sequences, 73, 265 Existence of Markov chains, 1 15 Expected value, 76, 273 Exponential convergence, 131, 8.18 Exponential distribution, 189, 258, 297, 348 Extension of measure, 36, 166, 1 1. 1 Extremal distribution, 1 95
Factorization and sufficiency, 450 Fair game, 92, 463 Fatou lemma, 209 Field, 1 9, 2.5 Filtration, 458 Finite additivity, 20, 23, 2.15, 3 8, 161 Finite or countable, 8 Finite-dimensional distri butions, 308, 482 Finite-dimensional sets, 485 Finitely additive field, 20 Finite subadditivity, 24, 162 First Borel-Cantelli lemma, 59 First category, 1 . 10, A15 First passage, 1 18 fixed discontinuity, 303 Fourier representation, 250
589 Fourier series, 35 1 , 26.30 Fourier transform, 342 Frequency, 8 Fubini theorem, 233 Functional central limit th eorem, 522 Fundamental in probability, 20 21 Fundamental set, 320 Fundamental theorem of calculus, 224, 400 Fundamental theorem of Diophantine approximation, 324 Gambling policy, 98 Gamma distribution, 20. 1 7 Gamma function, 18 1 8 Generated u· field, 21 Glivenko-Cantelli theoreM, 269 Goncharov's theorem, 361 Hahn decomposition, 420 Hamel basis, 14 7 Hardy-Ramanujan theorem, 6. 16 Heine-Bore! theorem, A 1 3, A17 Hewitt-Savage zero-one law, 496 Hilbert space, 249 Hitting time, 1 36 Holder's inequality, 80, 5 9, 242, 276 Hypothesis testing, 1 5 1 Inadequacy of !JPT, 492 Inclusion-exclusion formula, 24, 163 Indefinite integral, 400 Identically distributed, 85 Independent classes, 55 Independent events, 53 Independent increments, 299, 498 Independent random variables, 71, 261 Independent random vectors, 263 Indicator, AS Infini tely divisible distributions, 371 Infinitely often, 53 Infinite series, A25 Information, 57 Initial digit problem, 25.3 Initial probabilities, 1 1 1 Inner boundary, 64 Inner measure, 37, 3.2 Integrable, 200, 206 In tegral, 1 99 Integral with respect to Lebesgue measure, 221 Integrals of derivatives, 412 Integration by parts, 236 Integration over sets, 212 Integration with respect to a density, 214
590 Interior, Al l Interval, A9 lnvariance principle, 520 Invariant set, 313 Inverse image, A7, 182 Inversion formula, 346 Irreducible chain, 1 1 9 Irregular paths, 504 Iterated integral, 233 Jacobian, 225, 261, 545 Jensen inequality, 80, 276, 449 Jordan decomposition, 421 Jordan measurable, 3.15 k-dimensional Borel set, 1 58 k-dimensional Lebesgue measure, 1 71, 1 77, 1 7 14, 20.4 Kolmogorov existence theorem, 483 Kolmogorov zero-one law, 63, 287 Landau notation, Al8 Laplace distribution, 348 Laplace transform, 285 Large deviations, 148 Lattice distribution, 26. 1 Law of the iterated logarithm, 1 5 3 Law of large numbers: strong, 9, 1 1 , 85, 282 weak, 5, 1 1 , 86, 284 Lebesgue decomposition, 414, 425 Lebesgue density theorem, 3 1 .9, 35. 15 Lebesgue function, 3 1 .3 Lebesgue in tegrable, 221 , 225 Lebesgue measure, 25, 43, 167, 171, 177 Lebesgue set, 45 Leibniz formula, 17.8 Levy distance, 14.5, 25.4, 26.1 6 Likelihooct ratio, 461, 471 Limit inferior, 52, 4.1 Limit of sets, 52 Lindeberg condition, 359 Lindeberg-Uvy theorem, 357 Linear Borel set, 158 Linear functional, 244 Linearity of expected value, 77 Linearity of the integral, 206 Linearly independent reals, 14.7, 30.8 Lipschitz condition, 418 Log-normal distribution, 388 Lower integral, 204, 228 Lower semicontinuous, 29. 1 Lower variation, 421 L P-space, 241
INDEX A-system, 41 Lusin theorem, 17.10 Lyapounov condition, 362 Lyapounov inequality, 81, 277 Mapping theorem, 344, 380 Marginal distribution, 261 Markov chain, 1 1 1, 363, 367, 29. 1 1 , 429 Markov inequality, 80, 276 Markov process, 435, 5 1 0 Markov sh ift, 312 Markov time, 1 33 Martingale, 1 0 1 , 458, 5 1 4 Martingale central limit theorem, 475 Martingale convergence th eorem, 468 Maximal ergodic th eorem, 3 1 7 Maximal inequ'llity, 287 Maximal solution, 122 wcontir:uity set, 335, 378 m-dependent, 6 1 1 , 364 Mean value, 26. 1 7 Measurable mapping, 1 82 Measurable proLess, 503 Measurable rectangle, 231 Measurable with respect to a u·field, 68, 225 Measurable set, 20, 38, 1 65 Measurable space, 161 Measure, 22, 1 60 Measure-preserving transformation, 31 1 Measure space, 1 6 1 Meets, A3 Method of moments, 388, 30.6 Minimal sufficient field, 454 Minimum·variance estimation, 454 Minkowski inequality, 5.1 0, 242 Mixing, 24.3, 363, 29. 1 0, 34. 16 Mixture, 473 IL*-measurable, 165 Moment, 274 Moment generating function, 1 .6, 1 46, 278, 284, 390 Monotone, 24, 162, 206 Monotone class, 43, 3.12 Monotone class th eorem, 43 Monotone convergence theorem, 208 M-test, 210, A28 Multidimensional central limit theorem, 385 Multidimensional characteristic function, 381 Multidimensional d istribution, 259 Multidimensional normal distribution, 383 Multinomial sampling, 29.8 Negative part, 200, 254 Negligible set, 8, 1 .3, 1 . 9, 44
INDEX Neym an- Pearson lemma, 1 9.7 Nonatomic, 2 . 1 9 Nondenumerable probabilities, 526 Nonmeasurable set, 45, 12 4 Nonnegative series, A25 Norm, 243 Normal distribution, 258, 383 Normal number, 8, 1 8, 86, 6 1 3 Normal number theorem, 9, 6.9 Nowhere dense, AlS Nowht"re differentiable, 3 1 . 18, 505 Null persistent, 1 30 Num ber theory, 393 Open set, Al l Optional sampling theorem, 466 Optimal stopping, 1 33 Ordel of dyadic interval, 4 Orthogonal projection, 250 Orthonormal, 249 Ottaviani inequahty, 22. 15 Outer boundary, 64 Outer measure, 3 7, 3.2, 165 Pairwise disjoint, A3 Partial-fraction expansion, 20.1 4 Partial inform ation, 57 Partition, A3 Path function, 308, 493, 500 Payoff function, 1 33 Perfect set, Al5 Pea:10 curve, 1 79 Period, 125 Permut
591 Probability measure space, 23 Probability transformation, 14 3 Product measure, 28, 12.12, 233, 487 Product space, 27, 23 1 , 484 Projection, 27, 484 Proper difference, Al Proper subset, Al rr-system, 41 Rademacher functions, 5, 289, 291 Radon-Nykodym derivat ive , 423, 460, 470 Radon-Nykodym theorem, 422, 32.8 Random Taylor series, 292 Random variable, 67, 1 82, 254 Random vector, 1 83, 255 Random walk, 1 1 2 Ra11 k, 4, 320 Ranks and records, 20.8 Rate of Poisson process, 299 Rational rectangle, 158 Realization of process, 493 Record values, 20.9, 21 .3, 22.9, 27 8 Rectangle, 158, Al6 Recurrent event, 8.17 Red-and-black, 92 Reflection principle, 5 1 1 Regularity, 1 74 Relative fieq uency, 8 Relative measure, 25.1 6 Renewal theory, 8.17, 310 Reversed martingale, 472 Riemann integral, 2, 12, 221 , 228, 25 1 2 Riemann-Lebesgue lemma, 345 Riesz-Fischer theorem, 243 Riesz representation theorem, 1 7. 12, 244 Right continuity, 1 75, 256 Rigid transformation, 1 72 Risk, 247, 25 1 Saltus, 188 Sample function, 1 88 Sample path, 1 88, 308 Sample point, 18 Sampling theory, 392 Scheffe theorem, 215 Schwarz inequality, 81, 5 6, 249, 276 Second Borel-Cantelli lemma, 60, 4.1 1 , 88 Second category, 1 . 1 0, A 15 Second-order Markov chain, 8.32 Secretary problem, 1 1 4 Section, 231 Selection problem, 1 1 3, 138 Selection system, 95, 7.3 Semicontinuous, 29 1
592 Semiring, 1 66 Separable function, 526 Separable process, 527, 5 3 1 Separable u-iield, 2. 1 1 Separant, 527 Sequence space, 27, 31 1 Set function, 22, 420 u-field, 20 u·field generated by a class of sets, 21 u-field generated by a random variable, 68, 255 u· finite on a class, 160 u-finite measure, 160 Shannon theorem, 6.1 4, 8 3 1 Signed measure, 32.12 Simple function, 184 Sim ple raodom variable, 67, 185, 254 Singleton, Al Singular function, 407 Singularity, 421 Singular part, 425 Skorohod el!lbedding, 51 3, 519 Skorohod theorem, 333 Southwest, 176, 24 7 Space, Al, 17 Souare-free integers, 4.1 5 Stable law, 377 Sta ndard normal distribution, 258 State space, 1 1 1 Stationary d istribution, 124 Stationary increments, 499 Stationary probabil ities, 124 Stationary process, 363, 494 Stationary sequence of random variables, 363 Stationary transition probabilities, 1 1 1 Stieltjes integral, 228 Stirling formula, 27. 18 Stochastic arithmetic, 2.18, 4.15, 4. 16, 5.19, 6.1 6, 18. 1 7, 25 . 1 5, 393, 30.9, 30.10, 30. 1 1 Stochastic matrix, 1 1 2 Stochastic process, 298, 308, 482 Stopping time, 99, 1 33, 464, 465, 508 Strong law of large num bers, 8, 9, 1 1 , 85, 6.8, 282, 312, 27.20 Strong Markov property, 508 Subadditivity: countable, 25, 162 fin ite, 24, 1 62 Subfair game, 92, 1 02 Subm artingale, 462 Subset, Al Subsolution, 8.5
INDEX Sufficient subfield, 450 Superharmonic function, 134 Supermartingale, 462 Support line, A33 Support of measure, 23, 1 6 1 . Symmetric difference, Al Symmetric random walk, 1 1 3, 35. 1 0 Symmetric stable law, 378 System, 1 1 1 Tai! u-field, 63, 287, 496 Tarski theorem, 3.8 Taylor series, A29, 292 Thin cylinders, 27 Three series theorem, 290 Tiehmess, 336, 380 Timid play, 108 Tonelli theorem, 234 Total variation, 421 Trajectory of process, 493 Transformation of measures, 185, 215 Transient, 1 1 7, 1 20 Transition ;Jrobabilities, 1 1 1 Translation invariance, 45, 172 Triangular array, 359 Triangular distribution, 348 Trifling set, 1 .3, 1 .9, 3.15 Type, 1 93 Unbiased estimate, 251, 454 Uncorrelated random variables, 7, 277 Uniform distribution, 258, 348 Uniform distribution modulo 1, 328, 352 Uniform integrability, 216, 338 Uniformly equicontinuous, 26.1 5 Uniqueness of extension, 36, 42, 163 Uniqueness theorem for charactenstic functions, 346, 382 Uniq ueness theorem for moment generating functions, 147, 284, 26.7, 390 Unit interval, 5 1 Unit mass, 24 U pcrossing, 467 Upper integral, 204, 228 Upper semicontinuous, 29.1 Upper variation, 421 Utility function, 7 12 Vague convergence, 371 Value of payoff function, 134 van der Waerden function, 3 1 . 1 8
593
INDEX Variance, 78, 275 Version of conditional expected value, 445 Version of conditional probability, 430 Vieta formula, 1 . 7 Wald equation, 22.8 Weak convergence, 1 90, 327, 378 Weak law of large number�, 5, 1 1, 86, 284 Weierstra�s approximation theorem, 87, 26 1 9
Weierstrass M-te�t, 210, A28 Weiner process, 498 With probability 1, 60 Zeroes of Brownian motion, 507 Zero-one law, 63, 1 1 7, 8.35, 286, 22. 12, 3 14, 496, 37.4 Zorn's lemma, AS