( w ), Yia>( w ), . . . with the right asymptotic relative frequency (in the sense of Problem 6.12). Let C be the intersection of C over all a in the effective selection systems a. Show that C lies in !F (the a-field probability space (Q, !F, P) on which the Xn are defined) and that P( C) = 1. A sequence ( X1 ( w ), X2 ( w ), . . . ) for w in C is called a collective: a subse quence chosen by any of the effective rules a contains all k-tuples in the correct proportions.
* This topic may be omitted.
106
7.4.
PROBABILITY
Let D11 be 1 or 0 according as X2 _ 1 * X2 11 or not.. and let M" be the time of the k th 1 - the smallest n such that E�·= 1 D, = k . Let Z" = X2 M . In other words, look at successive nonoverlapping pairs ( X2 11 _ 1 , X211 ) , dis�ard accor dant ( x2 ,r - 1 = x2 ,. ) pairs, and keep the second element of discordant ( X211 _ 1 * X2 ,. ) pairs. Show that this process simulates a fair coin: Z1 , Z2 , are independent and identically distributed and P [ Z" = + 1] = P [ Z" = - 1] = ! , whatever p may be. Follow the proof of Theorem 7. 1 ; it is instructive to write out the proof in all detail, suppressing no w 's. n
.
7.5.
•
.
Suppose that a gambler with initial fortune 1 stakes a proportion 8 (0 < 8 < 1) of his current fortune: Fo = 1 and W,. = 8F,. _ 1 • Show that F,. = n;: 1 (1 + (J xk ) and hence that =
n
[
�.
1
log F,. = 2 -; log 1 Show that F, 7.6.
7.8.
_
-
82 >] .
0 with probability 1 in the subfair case.
In "doubling,'' W1 = 1, w,. = 2 W,. _ 1 , and the rule is to stop after the first win. For any positive p, play is certain to terminate. Here F, = Fo + 1 , but of course infinite capital is required. If Fo = 2 k - 1 and W,. cannot exceed F, 1 the probability of F, = Fo + 1 in the fair case is 1 - 2 - " . Prove this via Theorem 7.2 and also directly. _
7.7.
-+
+8 + ( log 1 8
,
In " progress and pinch," the wager, initially some integer, is increased by 1 after a loss and decreased by 1 after a win, the stopping rule being to quit if the next bet is 0. Show that play is certain to terminate if and only if p > ! . Show that F, = Fo + ! W12 + ! ( T - 1). Infinite capital is required.
Here is a common martingale. At each stage the gambler has before him a pattern x 1 , , xk of positive numbers. He bets x 1 + x�.. , or x 1 in case k = 1 . If he loses, at the next stage he uses the pattern x 1 , . . . , x�.. , x1 + x�.. . If he wins, at the next stage he uses the pattern x 2 , , x" 1 , unless k is 1 or 2, in which case he quits. Show that play is certain to terminate if p > t and that the ultimate gain is the sum of the numbers in the initial pattern. Infinite capital is again required. • • •
• . .
7.9.
_
Suppose that wk = 1 , so that Fk = Fo + sk . Suppose that p > q and T is a stopping time such that 1 < T < n with probability 1 . Show that E[ F, ] < E [ F, ], with equality in case p = q. Interpret this result in terms of a stock option that must be exercised by time n , Fo + s�.. representing the price of the stock at time k.
7. 10. Let u be a real function on [0, 1], u( x) representing the utility of the fortune x. Consider policies bounded by 1 ; see (7 .24) . Let Q ( Fo ) = E [ u ( F, ) ] ; this represents the expected utility under the policy , of an initial fortune Jb . Suppose of a policy w0 that 'IT
( 7 .34)
0 < X < 1,
SECTION
8.
107
MARKOV CHAINS
and that
( 7 .35) 0 < X - 1 < X < X + t < 1. Show that Q ( x) < Q ( x) for all x and all policies , . Such a w0 is optimal. Theorem 7.3 is the ;pecial case of this result for p < i , bold play in the role of w0 , and u(x) = 1 or u ( x ) = 0 according as x = 1 or x < 1. Condition (7.34) says that gambling with policy w0 is at least as good as not gambling at all; (7 .35) says that, although the prospects even under w0 become on the average less sanguine as time passes, it is better to use w0 now than to use some other policy for one step and then change to w0 . 'IT
'IT
7.1 1. The functional equation (7.30) and the assumption that Q is bounded suffice to determine Q completely. First, Q(O) and Q(1) must be 0 and 1, respec tively, and so ( 7.3 1) holds. Let 1Qx = !x and T1 x = }x + ! ; let f0 x = px and /1 x = p + qx. Then Q( Tu � x) = fu fu Q( x). If the binary expansions of x and y both b�gin with the diiits u 1 , � , u,r , they have the form x = T Tu,.x' and y = T Tu ,. y'. If K bounds Q and if m = max{ p , q }, it follows that IQ( x) - Q(y) l < Km n . Therefore, Q is continuous and satisfies (7.31). · · ·
"•
· · ·
· · ·
"•
• •
· · ·
SECTION 8. MARKOV CHAINS
As Markov chains illustrate well the connection between probability and measure, their basic properties are here developed in a measure-theoretic setting. Definitions
Let S be a finite or countable set. Suppose that to each pair i and j in S there is assigned a nonnegative number piJ and that these numbers satisfy the constraint
L P;j = l ,
(8.1 )
jES
i E S.
Let X0, X1 , X2, be a sequence of random variables whose ranges are contained in S. The sequence is a Markov chain or Markov process if •
•
•
P ( Xn + t = JIXo = i o ,
(8.2 )
· · · ,
Xn = i n ]
= P ( Xn+ t = ii Xn = i n ] = P;nJ for every
n
and every sequence i0,
•
•
•
, i n in S for which P [ X0 = i0,
•
•
•
, Xn
108
PROBABILITY
= i n ] > 0. The set S is the state space or phase space of the process, and the p iJ are the transition probabilities. Part of the defining condition (8.2) is that the transition probability (8 .3) does not vary with n. t The elements of S are thought of as the possible states of a system, Xn representing the state at time n. The sequence or process X0, X1 , X2 , then represents the history of the system, which evolves in accordance with the probability law (8.2). The conditional distribution of the next state Xn + t given the present state Xn must not further depend on the past X0 , , Xn - t · This is what (8.2) requires, and it leads to a copious theory. The initial probabilities are •
•
'IT;
(8 .4)
•
•
•
•
= P [ X0 = i] .
The 'IT; are nonnegative and add to 1, but the definition of Markov chain places no further restrictions on them. Example 8. 1.
The Bernoulli-Lap/ace model of diffusion. Imagine r black
balls and r white balls distributed between two boxes, with the constraint that each box contains r balls. The state of the system is specified by the number of white balls in the first box, so that the state space is S = {0, 1, . . . , r } . The transition mechanism is this: at each stage one ball is chosen at random from each box and the two are interchanged. If the present state is i, the chance of a transition to i - 1 is the chance i/r of drawing one of the i white balls from the first box times the chance i/r of drawing one of the i black balls from the second box. Together with similar arguments for the other possibilities, this shows that the transition probabil ities are
P;, ; - t =
( � r.
r ; 2 Pi , i +l = r '
(
)
i r i ( ) = 2 Pu r2 '
the others being 0. This is the probabilistic analogue of the model for the • ftow of two liquids between two containers. * t Sometimes in the definition of Markov chain P [ Xn + 1 = jl Xn = i ] is allowed to depend on n . A chain satisfying (8.3) is then said to have stationary transition probabilities, a phrase that will be omitted here because (8.3) will always be assumed. * For an excellent collection of examples from physics and biology, see FELLER, Volume 1 , Chapter XV.
SECTION
8.
109
MARKOV CHAINS
The P;j form the transition matrix P = [ P;j ] of the process. A stochastic matrix is one whose entries are nonnegative and satisfy (8.1); the transition matrix of course has this property. Extunple 8. 2.
{ O , l, . . . , r } and
Random walk with absorbing barriers. 1
0 0
q •
•
0 0 0
•
•
0 0 0
0 0
p
q
0 P=
0
p
0
•
•
•
0 0 0
•
•
•
0 0 0
•
•
•
•
•
•
...
•
•
•
•
•
... •
•
•
•
•
•
0 0 0
0 0 0 •
•
•
q
0 0
•
•
•
0 0 0 •
0
q
0
•
•
p 0 0
S=
Suppose that 0 0 0
•
•
•
0
p 1
That is, Pi, i + l = p and P;, ; - t = q = 1 p for 0 < i < r and p00 = p,, = 1. The chain represents a particle in random walk. The particle moves one unit to the right or left, the respective probabilities being p and q, except that each of 0 and r is an absorbing state-once the particle enters, it cannot leave. The state can also be viewed as a gambler's fortune; absorption in 0 represents ruin for the gambler, absorption in r ruin for his adversary (see Section 7). The gambler's initial fortune is usually regarded as nonrandom, so that (see (8.4)) 'IT; = 1 for some i. • -
Example 8. 3. Unrestricted random walk. Let S consist of all the integers p. This chain i = 0, ± 1, ± 2, . . . , and take P;, ; + t = p and P;, ; - I = q = 1 -
represents a random walk without barriers, the particle being free to move • anywhere on the integer lattice. The walk is symmetric if p = q.
The state space may, as in the preceding example, be countably infinite. If so, the Markov chain consists of functions Xn on a probability space (D, !F, P ), but these will have infinite range and hence will not be random variables in the sense of the preceding sections. This will cause no difficulty however, because expected values of the Xn will not be considered. All that is required is that for each i E S the set [w: Xn (w) = i] lie in !F and hence have a probability. Let S consist of the integer lattice points in k-dimensional Euclidean space R k ; x = (x 1 , , x k ) lies in S if the coordinates are all integers. Now x has 2k neighbors, points of the form y = (x 1 , , X ; + 1, . . . , x k ); for each such y let Px_v = (2k) - 1 . Example 8.4.
Symmetric random walk in space.
•
•
.
.
•
•
110
PROBABILITY
The chain represents a particle moving randomly in space; for k = 1 it reduces to Example 8.3 with p = q = � . The cases k < 2 and k > 3 exhibit an interesting difference. If k s 2, the particle is certain to return to its • initial position, but this is not so if k > 3; see Example 8.6. Since the state space in this example is not a subset of the line, the X0 , X1 , do not assume real values. This is immaterial because expected values of the Xn play no role. All that is necessary is that Xn be a mapping from 0 into S (finite or countable) such that [ w: Xn ( w) = i] E � for i E S. There will be expected values E[/(Xn )1 for real functions f on S with finite range, but then /( Xn (w)) is a simple random variable as defined before. •
•
•
Example 8.5. A selection problem. A princess must choose from among r suitors. She is definite in her preferences and if presented with all r at once could choose her favorite and could even rank the whole group. They are ushered into her presence one by one in random order, however, and she must at each stage either stop and accept the suitor or else reject him and proceed in the hope that a better one will come along. What strategy will maximize her chance of stopping with the best suitor of all? Shorn of some details, the analysis is this. Let S1 , S2 , , Sr be the suitors in order of presentation; this sequence is a random permutation of the set of suitors. Let X1 == 1, and let X2 , X3 , be the successive positions of suitors who dominate (are preferable to) all their predecessors. Thus X2 == 4 and X3 == 6 means that S1 dominates S2 and � but S4 dominates S1 , S2 , � , and that S4 dominates S5 but S6 dominates S1 , , S5 • There can be at most r of these dominant suitors; if there are exactly m , xm + 1 == xm + 2 == == r + 1 by convention. As the suitors arrive in random order, the chance that S; ranks highest among S1 , , S; is (i - 1)!/i ! == 1/i. The chance that � ranks highest among S1 , � and S; ranks next is ( j - 2)!/j ! == 1/j ( j - 1). This leads to a chain with transition probabilitiest • • •
• • •
• • •
•
•
•
•
•
•
•
P [ X, + 1 = JI X,
(8.5 )
i i ... ] ... j( j _ l ) ,
•
(8.6)
•
p [ xn + 1 == r + 11 xn == i ] == !_ ' r
•
,
1 < i < j < r.
If Xn == i, then Xn + 1 == r + 1 means that S; dominates S; + 1 , S1 , , S; , and the conditional probability of this is •
•
• • •
, Sr
as
well
as
1 < i < r.
As downward transitions are impossible and r + 1 is absorbing, this specifies a transition matrix for S == {1, 2, . . . , r + 1 }. It is quite clear that in maximizing her chance of selecting the best suitor of all, the princess should reject those who do not dominate their predecessors. Her strategy therefore will be to stop with the suitor in position XT , where T is a random t The details can be found in DYNKIN & YUSHKEVICH. Chapter III.
SECTION
8.
111
MARKOV CHAINS
variable representing her strategy. Since her decision to stop must depend only on the suitors she has seen thus far, the event [ T == n ] must lie in a( X1 , , Xn ) . If X., == i , then by (8.6) the conditional probability of success is /(i) == ifr. The probability of success is therefore E[/( X., )], and the problem is to choose the • strategy T so as to maximize it. For the solution, see Example 8.16.t • • •
Higher-Order Transitions
The properties of the Markov chain are entirely determined by the transi tion and initial probabilities. The chain rule (4.2) for conditional probabili ties gives
Similarly, (8. 7)
for any sequence Further, (8. 8 )
i i 0'
1'
. . . ' i m of states.
P ( Xm + t = }1 , 1
<
t < n i Xs = i s , O < s < m ]
as follows by expressing the conditional probability as a ratio and applying (8. 7) to numerator and denominator. Adding out the intermediate states now gives the formula (8.9 )
P � �> = P ( Xm + n = J· I Xm = i ] '}
(the k1 range over S ) for the nth order transition probabilities. t With the princess replaced by an executive and the suitors by applicants for an office job, this is known as the secretary problem.
112
PROBABILITY
n
Notice thatp f? > is the entry in position (i, j ) of p , the n th power of the transition matrix P. If S is infinite, P is a matrix with infinitely many rows and columns; as the terms in (8.9) are nonnegative, there are no conver gence problems. It is natural to put
{
1 P ��> = 8 . . = 0 I}
I}
if i = j , if i ¢ j .
Then P0 is the identity, as it should be. From (8.1) and (8.9) follow n + <� � l � > � . . � � ) P p = p p p = 8.10 ( ) LJ 1 , "} LJ 1 , 'II} > ' I} ,
,
An Existence lbeorem
Suppose that P = [ p;1 ] is a stochastic matrix and that 'IT; are nonnegative numbers satisfying L; s'IT; = 1. There exists on some ( 0, .fF, P) a Markov chain Xo , XI , x2 , . . . with initial probabilities 'IT; and transition probabilities piJ · lbeorem 8.1.
e
Reconsider the proof of Theorem 5.2. There the space ( 0, ofF, P) was the unit interval, and the central part of the argument was the construction of the decompositions (5.10). Suppose for the moment that S = { 1 , 2, . . . ) . First construct a partition Jf0>, 1�0>, . . . of (0, 1] into counta bly manyt subintervals of lengths ( P is again Lebesgue measure) P( I/0> ) = 'IT;· Next decompose each 1/0> into subintervals I;�> of lengths P( l;�>) = 'IT; P ;J · Continuing inductively gives a sequence of partitions { /;�� �. ;,. : i 0 , , i n = 1 , 2, . . . } such that each refines the preceding and P( l;�� > . ;,. ) = 'IT;0 pi 0 i1 pin in Put Xn ( w) = i if w E U ;0 . . . ;,. _ 1 1;�� �. ;,. _ 1 ; . It follows just as in the proof of Theorem 5.2 that the set [ X0 = i 0 , , Xn = i n ] coincides with the interval 1;��� . ;,. · Thus P[ X0 = i , , Xn = in] = 'IT;0 P ;0 ; 1 P;,._ 1 ; ,.· From this, (8.2) 0 and (8.4) follow immediately. That completes the construction for the case S = { 1 , 2, . . . } . For the general countably infinite S, let g be a one-to-one mapping of { 1 , 2, . . . } onto S and replace the Xn as already constructed by g( Xn ); the assumption S = { 1, 2, . . . } was merely for notational convenience. The same argument • obviously works if S is finite. PROOF.
•
•
•
•
•
•
-I
•
•
•
•
•
•
•
•
•
•
Although strictly speaking the Markov chain is the sequence X , X1 � . . . , one often speaks as though the chain were the matrix P together0 = b - a and B; � 0 , then I; = (b - E1 s , 81 , b - E1 < ;B1 ), i = 1 , 2 decompose ( a , b ] into intervals of lengths B, .
t 1r
81 + 82 +
·
·
·
• . . .
,
SECTION
8.
113
MARKOV CHAINS
with the initial probabilities 'IT; or even P with some unspecified set of 'IT; . Theorem 8.1 justifies this attitude: For given P and 'IT; the corresponding Xn do exist, and the apparatus of probability theory-the Borel-Cantelli lemmas and so on-is available for the study of P and of systems evolving in accordance with the Markov rule. From now on fix a chain X0 , X1 , • • • satisfying 'IT; > 0 for all i. Denote by P; probabilities conditional on [ X0 = i]: P;(A) = P[A I X0 = i]. Thus
( 8. 11)
--
P. [ X = z· 1 < t < n ] = p I
I
I'
.
II1
..
p I 1 I 2 • • • P I· n - 1 I· n
by (8.8). The interest centers on these conditional probabilities, and the actual initial probabilities are now largely irrelevant. From (8.11) follows
(8. 12)
Suppose that I is a set (finite or infinite) of m-long sequences of states, J is a set of n-long sequences of states, and every sequence in I ends in j. Adding both sides of (8.12) for ( i 1 , • • • , i m ) ranging over I and ( j1, • • • , in ) ranging over J gives
( 8 . 13)
For this to hold it is essential that each sequence in (8.12) and (8.13) are of central importance.
I end in j.
Formulas
Transience and Persistence
Let
(8.14)
-
be the probability and let
(8.1 5 )
of a first visit to j at time
tij
= P;
n
for a system that starts in
( CJl x = i 1 ) = f /;�" > n-1
..
n-1
i,
1 14
PROBABILITY
be the probability of an eventual visit. A state i is persistent if a system starting at i is certain sometime to return to i: /;; = 1. The state is transient in the opposite case: /;; < 1. < n k and Suppose that n 1 , • • • , n k are integers satisfying 1 < n 1 < consider the event that the system visits j at times n 1 , • • • , n k but not in between; this event is determined by the conditions · · ·
XI
=I= j '
. . . ' xn l - I
=I= j '
xn l
= j'
. . . ' Xn 2 - I =I= j ' Xn 2 = j ' .................. . .......
Xn l + I
=I= j '
Repeated application of (8.13) shows that under P; the probability of this n n n �}n �< - n �< - •> . Add this over the k-tuples n 1 , • • • , n k : event is /;S · >�} 2 - 1 > 1 the P;-probability that Xn = j for at least k different values of n is /;1�J - • Letting k --+ oo therefore gives · · ·
if lI.jj. < if lI.jj. =
( 8 .16 ) Recall that
i.o. means infinitely often. Taking i = j gives if /; ; < if }1,. . =
( 8 .17 ) Thus
4.5),
1' 1.
1' 1.
P;[ Xn = i
i.o.] is either 0 or 1; compare the zero-one law (Theorem but note that the events [ Xn = i ] here are not in general independent.
Theorem 8.2. (i) Transience ofi is equivalent to P;[ Xn = i i.o.] = 0 and to Ln Pft> < oo . (ii) Persistence of i is equivalent to P;[ Xn = i i.o.] = 1 and to En pfr > = oo .
By the first Borel-Cantelli lemma, L n Pfr > < 00 implies P;[ xn = i i.o.] = 0, which by (8.17) in tum implies /;; < 1. The entire theorem will be proved if it is shown that /;; < 1 implies Ln Pfr > < oo . PROOF.
SECTION
8.
115
MARKOV CHAINS
The proof uses a first-passage argument : by (8.13),
n- 1
p!j > = P; [ xn = j ) = L P; [ xl * j , . . . ' xn - s - 1 * j , s-O
n- 1
- L
s-o
=
n-1 � �
s-o
xn - s = j , xn = j )
P; [ X1 * j , . . . , xn - s - 1 * i , xn - s = j ] lj [ Xs = j]
l_ (_n - s )p �� ) . lrJ JJ
Therefore,
n
n t- 1
t - s >p : > > = /;� : L L L P:f t- 1 t- 1 s-O n- 1
n
s-o
t-s+ 1
- L P :t > L
/;� t - s ) �
n
L P Jt >/; ; ·
s-o
Thus (1 - /;;) E7_ 1 p1f> � /;;, and if /;; < 1, this puts a bound on the partial � n ;(t) Surns f-t• 1P ; . Polya 's theorem. For the symmetric k-dimensional ran dom walk (Example . 8.4), all states are persistent if k = 1 or k = 2, and all states are transient if k > 3. To prove this, note first that the probability pfr > of return in n steps is the same for all states i; denote this probability by a�k> to indicate the dependence on the dimension k. Clearly, a ��>+ 1 = 0. Suppose that k = 1. Since return in 2n steps means n steps east and n steps west, Example 8.6.
a 2< ln) =
( 2nn ) _!_ . 22n
Stirling's formula, a�1J - ( 'ITn) - 1 12 • Therefore, En a �1 > = are persistent by Theorem 8.2. By
oo
and all states
1 16
PROBABILITY
In the plane, a return to the starting point in 2n steps means equal numbers of steps east and west as well as equal numbers north and south:
(2n )! 1 f. u!u! ( n - u ) ! ( n - u ) ! 4 2 n u == O
a(22n> _
( )
It can be seen on combinatorial grounds that the last sum is 2nn , and so a �2J = (a �1J ) 2 - ( w n) - I . Again, En a �2 ) = oo and every state is persistent. For three dimensions,
a 2( 3n) = L
( 2n ) ! u!u!v!v! ( n - u - v ) ! ( n - u - v ) !
the sum extending over nonnegative reduces to
u
and
v
1 6 2n '
satisfying
u + v < n.
This
) 2 n - 2/ ( 2 ) 2 / ) ( ( 1 2 a 2< 3>n - L � 3 3 a ��� u a W , n
(8 . 18)
_
/- 1
-
2
as can be checked by substitution. (To see the probabilistic meaning of this formula, condition on there being 2n - 21 steps parallel to the vertical axis and 21 steps parallel to the horizontal plane.) It will be shown that a �3J = O(n - 312 ), which will imply that En a�3> < oo . The terms in (8 . 18) for I = 0 and I = n are each OI( n - 312 ) and hence can be omitted. Now a �1 > < Ku - 1 12 and a�2 > < Ku - , as already seen, and so the sum in question is at most
n ) ) ) 2/ 2 2 / 1 ( ( ( 2 1 n2 3 K L ;� 3 ( 2n - 21 ) - 1 /2 ( 21 ) /-1
-•.
(2n - 21 ) - 1 12 < 2n 1 12 ( 2n - 21 ) - 1 � 4n 1 12 ( 2n - 21 + (21) - 1 � 2(21 + 1) - 1 , this is at most a constant times Since
! 2 n ) ( /2 I n 2n + 2 ( )! Thus L, n a�3> <
oo ,
1 ( 2n + 2 ) ( ! ) 2n - 2/ 1�1 2/ + 1 3
n-
+
I
1) - 1 and
( 2 ) 2/ I - O ( n - 3/2 ) . 3
+
and the states are transient. The same is true for
k=
SECTION
4, 5, . . . , since 0( n - " 12 ) .
8.
1 17
MARKOV CHAINS
an inductive extension of the argument shows that a �k >
=
•
It is possible for a system starting in i to reach j (f,.1 > 0) if and only if p1j > > 0 for some n. If this is true for all i and j, the Markov chain is
irreducible.
If the Markov chain is irreducible, then one of the following two alternatives holds. (i) All states are transient, P,. (U 1[ Xn = j i.o.]) = 0 for all i, and Ln Pfj > < oo for all i and j. (ii) All states are persistent, P,. (n1[ Xn = j i.o.]) = 1 for all i, and En p1j > = oo for all i and}. Theorem 8.3.
The irreducible chain itself can accordingly be called persistent or transient. In the persistent case the system visits every state infinitely often. In the transient case it visits each state only finitely often, hence visits each finite set only finitely often, and so may be said to go to infinity. PROOF. >
P�� > 0. }I
For each Now
i
and
j
there exist
r
and
P � � + s + n ) p ��>p � � >p � � )
(8. 19 )
II
> -
1)
}}
)1
s
such that
>
pf; > 0
and
'
and since p I}� � >p}�I� > > 0' En p I�I� > < oo implies that En p}}� � > < oo · if one state is transient, they all are. In this case (8.16) gives P,. [ Xn = j i.o.] = 0 for all i and j , so that P,.(U1[ Xn = i i.o.]) = O for all i. Since E�_ 1 p 1j > = n I(•)P ( n - • ) - �00 / ( P)�oom P (Jm ) < L.,�00m-== OP (Jm ) , 1· t f0llOWS that 1· f }· � ao � L.,.,._ I J ij L., -o { �n - 1L.,.,- 1 J ij 11 J is transient, then (Theorem 8.2) En pfj converges for every i. The other possibility is that all states are persistent. In this case lj[ Xn = j i.o.] = 1 by Theorem 8.2, and it follows by (8.13) that •
pJ,.m > = lj ( ( Xm = i ] n ( Xn = j i.o. ] ) < _
E
n>m � i..J
n>m
lj [ Xm = i, xm + l =1:- j,
•
•
•
, xn - 1 =1:- j, Xn = J ]
p � � )Jl,(_n - m ) = p � � >Jl. . . }I
I}
}I
I}
There is an m for which pJ,.m > > 0, and therefore fiJ = 1 . By (8.16), P, [ Xn = j i.o.] = /;1 = 1 . If En pfj > were to converge for some i and j, it would follow by the first Borel-Cantelli lemma that P,. [ Xn = j i.o.] = 0. •
1 18
PROBABILITY
Example B. 7.
impossible if
S
Since E1 pfj > = 1 , the first alternative in Theorem 8.3 is is finite: a finite, irreducible Markov chain is persistent.
The chain in Polya's theorem is certainly irreducible. If the dimension is 1 or 2, there is probability 1 that a particle in symmetric random walk visits every state infinitely often. If the dimension is 3 or more, • the particle goes to infinity. Example 8. 8.
Example 8. 9.
Consider the unrestricted random walk on the line (Ex ample 8.3). According to the ruin calculation (7.8), lot = pjq for p < q. Since the chain is irreducible, all states are transient. By symmetry, of course, the chain is also transient if p > q, although in this case (7.8) gives lot = 1. Thus /; = 1 (i ¢ j ) is possible in the transient case.t If p = q = , the chain is persistent by Polya' s theorem. If n and j - i have the same parity,
(
n
n +j - ; 1 li - i l < n · 2 This is maximal if j i or j i ± 1, and by Stirling' s formula the maximal value is of order n - t /2 • Therefore, lim n p �n > 0, which always holds in the =
=
=
transient case but is thus possible in the persistent case as well ( see Theorem
8.8).
•
Another Criterion for Persistence
Let Q = [ qiJ ] be a matrix with rows and columns indexed by the elements of a finite or countable set U. Suppose it is substochastic in the sense that q;1 > 0 and E1 q;1 < 1. Let Q n = [ q;)n >] be the n th power, so that � q . ,q"l<� > ' q1)� � + t ) = i..J ( 8.20 ) ,
,
Consider the row sums O.,( n ) =
( 8.21 ) From
( 8.22 )
(8.20) follows
o .< n + t ) I
=
� q��)•
i..J j
1)
� . . (J � n ) • q i..J j
1) J
t out for each j there must be some i =F j for which fij
<
1 ; see Problem 8.9.
8. MARKOV CHAINS 01.< l l � 1 , and hence o.< n + l )
1 19
SECTION
Si ncen Q is substochastic, n l r q I�JI >oJ1< ) � o.I< > Therefore the monotone limits � 'II • '
I
=
n lq .
EJ.E J1qI�JI
Jlj =
( 8.23) exist. By (8.22) and the Weierstrass M-test solve the system X;
(8.24 )
=
E jE U
[A28], o; E1q;1o1. Thus the o; =
q J x1 , i e U, i
O < x; < 1 ,
; e u.
n � � ;1 - o;( I ) , and X; < o;( ) for < L.t1q L.t1q;1x1 n n n all i implies that X; < E1q;1o} > o;< + I ) by (8.22). Thus X; < o;< > for all n by induction, and so X; < o;. Thus the o; give the maximal solution to (8.24): . · For an arb1 trary so1utton, X;
=
For a substochastic matrix solution of (8.24). Lemma
1.
Q
the limits (8.23) are the maximal
Now suppose that U is a subset of the state space S. The p iJ for i and j n in U give a substochastic matrix Q. The row sums (8.21) are o/ > = Ep;11 P1LI 2 • • • pin - Lin' where the j1 , . . . , in range over U, and so o/ n > = P; [ X1 e U, t < n ]. Let n --+ oo :
(8.25 )
o;
=
P; ( X1
E
U, t 1 , 2, . . . ) =
,
i E U.
this case, o; is thus the probability that the system remains forever in U, given that it starts at i. The following theorem is now an immediate consequence of Lemma 1. In
Theorem
8.4.
of the system
( 8.26 )
For U c S the probabilities (8.25) are the maximal solution X;
=
E jE U
piJ x1 , i e U,
0 < X; < 1 ,
i E U.
120
PROBABILITY
For the random walk on the line consider the set The system (8.26) is
Example 8.10.
{0, 1, 2, . . . }.
{ X; == PpxX; +. 1 + qxi - 1 '
Xo 1 xn = A + An
i>
U
=
1,
n
It follows [A19] that if p = q and x n = A - A(qjp) + 1 if p ¢ q. The only bounded solution is x n = 0 if q > p, and in this case there is probability 0 of staying forever among the nonnegative integers. If q < p, A = 1 gives the maximal solution xn = 1 - (qjp) n + 1 . Compare (7.8) and • Example 8.9. Now consider the system (8.26) with single state i 0 :
U = S - { i0 }
for an arbitrary
X; = E P;jxj, i ¢ i 0, j '=�= io 0 � X; < 1 ,
( 8 .27 )
There is always the trivial solution-the one for which
X; = 0.
An irreducible chain is transient zf and only if (8.27) has a nontrivial solution.
Theorem 8.5. PROOF.
The probabilities
( 8 .28 ) are by Theorem 8.4 the maximal solution of (8.27). Therefore (8.27) has a nontrivial solution if and only if /;;0 < 1 for some i ¢ i 0 • If the chain is persistent, this is impossible by Theorem 8.3(ii). Suppose the chain is transient. Since 00
P;o [ X1 = i, x2 ¢ i o , . . . ' Xn - 1 ¢ i o , xn = i o] hoio = P;o [ X1 = i o] + nE=2 iE -:l= i0 = P i0 i0 + L Pi0 i fi io ' ; '=�= io and since /;0 ;0 < 1, it follows that /;;0 < 1 for some i ¢ i 0• • Since the equations in (8.27) are homogeneous, the issue is whether they have a solution that is nonnegative, nontrivial, and bounded. If they do, 0 � x ; � 1 can be arranged by rescaling. t t see Problem 8. 1 1 .
SECTION
8.
121
MARKOV CHAIN S
In the simplest of queueing models the state space is and the transition matrix has the form
Example 8. 11.
{ 0, 1, 2, . . . }
ao aI a 2 0 0 0 ao aI a 2 0 0 0 0 ao al a 2 0 0 0 0 ao aI a 2 0 0 0 0 ao aI a 2
...
••• ••• ...
•••
.........................
If there are i customers in the queue and i > 1 , the customer at the head of the queue is served and leaves, and then 0, 1, or 2 new customers arrive (probabilities a0, a 1 , a 2 ), which leaves a queue of length i - 1, i, or i + 1. If i = 0, no one is served, and the new customers bring the queue length to 0, 1, or 2. Assume that a0 and a 2 are positive, so that the chain is irreducible. For i0 = 0 the system (8.27) is
(8 . 29 )
{ x i == aai xxi + a+2xa2 x, xk
o k-1
k > 2.
l k + a 2x k + 1 '
Since a0, a 1 , a 2 have the form q( 1 - a), a, p( 1 - a) for appropriate p, q, a, the second line of (8.29) has the form x k = px k + I + qx k _ 1 , k > 2. Now the solution [A19] is A + B( q/p) k = A + B( a0ja 2 ) k if a 0 ¢ a 2 ( p ¢ q) and A + Bk if a0 = a 2 ( p = q), and A can be expressed in terms of B because of the first equation in (8.29). The result is if a 0 ¢ a2 , if a 0 = a 2 • There is a nontrivial solution if a0 < a 2 but not if a 0 > a 2 . If a 0 < a 2 , the chain is thus transient, and the queue size goes to infinity with probability 1. If a0 � a 2 , the chain is persistent. For a nonempty queue the expected i_ncrease in queue length in one step is a 2 - a0, and the • queue goes out of control if and only if this is positive. Stationary Distributions
Suppose that the initial probabilities for the chain satisfy
(8 .30)
E 'IT; P ij = 'IT} '
iES
j E
S.
122
PROBABILITY
It then follows by induction that
� 'fT.p � �> 'fT.,
( 8 .31 )
i..J iES
I
1)
=
J
jE
S,
n
=
0, 1 , 2, . . . .
If 'IT; is the probability that X0 = i, then the left side of (8.31) is the probability that xn = j, and thus (8.30) implies that the probability of [ Xn = j] is the same for all n. A set of probabilities satisfying (8.30) is for this reason called a stationary distribution. The existence of such a distribu tion implies that the chain is very stable. To discuss this requires the notion of periodicity. The state j has period t if p}1n > > 0 implies that t divides n and if t is the largest integer with this property. In other words, the period of j is the greatest common divisor of the set of integers
( 8 .32 ) If the chain is irreducible, then for each pair i and j there exist for which p fj > and p}t > are positive, and of course
�� + s + n) > p � � >p � �>p � � ) PI I }} 1) )1
( 8 .33 )
-
r and s
•
Let t ; and t1 be the periods of i and j. Taking n = 0 in this inequality shows that t ; divides r + s; and now it follows by the inequality that p}1n > > 0 implies that t; divides r + s + n and hence divides n. Thus t; divides each integer in the set (8.32), and so t; � t1 . Since i and j can be interchanged in this argument, i and j have the same period. One can thus speak of the period of the chain itself in the irreducible case. The random walk on the line has period 2, for example. If the period is 1, the chain is
aperiodic.
In an irreducible, aperiodic chain, for each i and}, P0n > > 0 for all n exceeding some n0(i, j). PROOF. Since p1jm + n > > p1jm>pJ1n > , if M is the set (8.32) then m e M and n e M together imply m + n e M. But it is a fact of number theory [A21] Lemma
2.
that if a set of positive integers is closed under addition and has greatest common divisor 1, then it contains all integers exceeding some n 1 . Given i and J.' choose r so that p1)� � > > 0 If n > n 0 = n I + r then pI)� � > > pI}� � >p}}� �- r > > 0. • •
'
Suppose of an irreducible, aperiodic chain that there exists a stationary distribution-a solution of (8.30) satisfying 'IT; � 0 and L; 'IT; 1 . Theorem 8.6.
=
SECTION
Then the chain is persistent, ( 8.34 )
8.
MARKOV CHAINS
123
� ) = 'fT.} lim p � I} n
for all i and}, the 'ITi are all positive, and the stationary distribution is unique. n The main point of the conclusion is that, since p� > approaches 'ITi , the effect of the initial state i wears off. n PROOF. If the chain is transient, then p� > --+ 0 for all i and j by Theorem 8.3 , and it follows by (8.31) and the M-test that 'ITi is identically 0, which contradicts L; 'IT; = 1. The existence of a stationary distribution therefore
implies that the chain is persistent. Consider now a Markov chain with state space S X S and transition probabilities p(ij, k/) = P; k Pil (it is easy to verify that these form a stochas tic matrix). Call· this the coupled chain; it describes the joint behavior of a pair of independent systems, each evolving according to the laws of the original Markov chain. By Theorem 8.1 there exists a Markov chain ( Xn , Yn ), n = 0, 1, . . . , having positive initial probabilities and transition probabilities
P ( ( Xn + l ' Yn + t ) =
( k, / ) 1( Xn , Yn) = ( i, J ) ) = p ( ij, kl ) . n For n exceeding some n 0 depending on i, }, k, /, p< n >(ij, kl) = pf: >pi� > is positive by Lemma 2. Therefore, the coupled chain is irreducible. (The
proof of this fact requires only the assumptions that the original chain is irreducible and aperiodic.) It is easy to check that 'IT( ij) = 'IT; 'ITi forms a set of stationary initial probabilities for the coupled chain, which, like the original one, must therefore be persistent. It follows that, for an arbitrary initial state ( i, }) for the chain {( Xn , Yn )} and an arbitrary i0 in S, P;i [( Xn , Yn ) = ( i 0 , i0) i.o.] = 1. If T is the smallest integer such that XT = YT = i0, then T is finite with probability 1 under P;j· The idea of the proof is now this: Xn starts in i and yn starts in }; once xn = yn = i o occurs, xn and yn follow identical probability laws, and hence the initial states i and j will lose their influence. By (8.13) applied to the coupled chain, if m < n, then P;j [ ( xn ' yn ) =
(k' I)' T = m] t < m , ( Xm , Ym ) = ( i o , i o )] X P;o;o ( ( Xn - m ' Yn - m ) = ( k , / )]
= Pij ( ( X, , Y, ) ¢ ( i o , i o ) ,
�n - m)p �n - m > . = PI.J. ( T = m ] pt0A. 10/
124
PROBABILITY
= =I = = = k, < ] = = < ] = k ] < = k, < n ] + > ] = = k , < ] + PiJ > ] ] = k] +
Adding out I gives PiJ [ X, k , 'T m ] m] P.1 1 [ T out k oives PIJ. . [ Yn ' 'T probabilities, and add over m 1, . . . , n : =
ca
pi} [ xn
From this follows P;1 ( Xn
'T
n
PiJ ( Xn
PiJ ( Yn
=
=
=
P;1 [ T
=
m ] p 1c:ic- m ), and adding m ] p�1 11/" - m > • Take k = I' equate
pi} [ yn
k'
T
'T
< PiJ ( Yn
n
PiJ ( T >
'T
n .
P;1 ( 'T
n
( 'T
n
n .
This and the same inequality with X and Y interchanged give
Since T is finite with probability 1 ,
p�n) - p�n) lim I k l jk I n
(8 .35)
=
0•
(This proof of (8.35) requires only that the coupled chain be persistent.) By (8.31) , 'ITk - pj; > L; 'IT; ( pf; > - pj; >), and this goes to 0 by the M-test if (8.35) holds. Thus Iim n p}; > 'ITk . As this holds for each stationary distribution, there can be only one of them. It remains to show that the 'ITJ are all strictly positive. Choose r and s so that pf; > and p}t > are positive. Letting n --+ oo in (8.33) shows that 'IT; is positive if 'IT1 is; since some '"J is positive (they add to 1) , all the 'IT; must be • positive.
=
=
For the queueing model in Example 8.1 1 the equations
Example 8. 12.
(8.30) are
'ITo 'IT1 'IT2 'ITk
== 'IT'IToaao ++ 'IT1aa1o , 2a , o t 'IT1 + 'IT o = '1Toa2 + 'IT1a2 + 'IT2a1 + 'IT3ao , = 'ITk - 1a2 + 'ITkai + 'ITk + tao ,
k > 3.
Again write a 0, a 1 , a 2 , as q(1 - a), a , p(1 - a ). Since the last equation here is 'ITk q 'ITk + 1 + p 'ITk _ 1 , the solution is
=
AA Bk 'ITk = { + +
B ( p/q )
k
= A + B ( a 2ja0 ) k
if a0 if a 0
=I=
=
a2 , a2 ,
SECTION
8.
125
MARKOV CHAINS
for k > 2. If a 0 < a2 and Ewk converges, then wk 0, and hence there is no stationary distribution; but this is not new, because it was shown in Example 8.11 that the chain is transient in this case. If a 0 = a 2, there is again no stationary distribution, and this is new because the chain was in Example 8.11 shown to be persistent in this case. If a 0 > a2, then Ewk converges, provided A = 0. Solving for w0 and w1 in the first two equations of the system above gives w0 = Ba2 and w1 = Ba 2 ( 1 - a0 )/a0 • From Ekwk = 1 it now follows that B = (a0 - a 2 )/a 2, and the wk can be written down explicitly. Since wk = B(a2ja0) k for k � 2, there is small chance of a large queue length. • =
If a 0 = a2 in this queueing model, the chain is persistent (Example 8.11) but has no stationary distribution (Example 8.12). The next theorem de scribes the asymptotic behavior of the p!1n > in this case. 1beorem 8.7.
bution, then ( 8 .36) for all i and j.
If an irreducible, aperiodic chain has no stationary distri lim pI)� � > = 0
n
If the chain is transient, (8.36) follows from Theorem interesting here is the persistent case.
8.3.
What is
By the argument in the proof of Theorem 8.6, the coupled chain is irreducible. If it is transient, then E ( p�n> ) 2 converges by Theorem 8.2, and n the conclusion follows. Suppose, on the other hand, that the coupled chain is persistent. Then the stopping-time argument leading to (8.35) goes through as before. If the p�n> do not all go to 0, then there is an increasing sequence { n u } of integers along which some p�n> is bounded away from 0. By the diagonal method [A14], it is possible by passing to a subsequence of { n u } to ensure that each p�n��> converges to a li�t, which by (8.35) must be independent of i. Therefore, there is a sequence { n u } such that limup�n��> = a1 exists for all i and j, where a1 is nonnegative for all j and positive for some j. If M is a finite set of states, then E1e MaJ = limuEJ e MP�n��> < 1, and hence 0 < a = E1a1 < 1. Now Ek e MPf:��>pkJ < p�n��+ I> = Ek PikP�j��>; it is possible to pass to the limit ( u --+ oo) inside the first sum (if M is finite) and inside the second sum (by the M-test), and hence Ek e MakpkJ � EkPi kaJ = a1 . There fore, EkakpkJ � a1; if one of these inequalities were strict, it would follow that Ekak = E1EkakpkJ < E1a1, which is impossible. Therefore Eka k p k J = aj for all }, and the ratios w1 = a1ja give a stationary distribution, contrary • to the hypothesis. PROOF.
126
PROBABILITY
The limits In times. Let
(8.34) and (8.36) can be described in terms of mean return 00
n) . l. < . � n = i..J !ljj ' ,-} n� l if the series diverges, write p.i = oo. In the persistent case, this sum is to be thought of as the average number of steps to first return to j, given that Xo = j.t Lemma 3. Suppose that} is persistent and that lim n p}in > = u. Then u > 0 if and only if p.i < oo, in which case u = 1/p.i .
( 8.37)
u .
Under the convention that 0 = 1/oo, the case u = 0 and p.i = oo is consistent with the equation u = 1/p.i . n PROOF . For k > 0 let p k = En k fi} >; the notation does not show the dependence on j, which is fixed. Consider the double series 1.(2 ) + ljj 1.(3) + . . . 1_(1 ) + ljj ljj 1_(.2) + ljj 1_(3) + . . . + ljj 1_(.3 ) + + ljj >
n n to sums column th n the and 0) > k ( p to sums row The k th k fi} > (n > 1), and so [A27] the series in (8.37) converges if and only if E k p k does, in which case 00
( 8.38 )
P.j = L P k · k�O Since j is persistent, the lj-probability that the system does not hit j up to time n is the probability that it hits j after time n, and this is Pn· Therefore, 1 - pi)n > = Pi [ Xn ¢ j ) n- 1
= Pj [ Xl ¢ j, . . . , Xn ¢ J ) + L lj [ Xk = j, xk + l ¢ j, . . . , Xn ¢ J ) n-1
k-1
k = Pn + L Pj) )Pn - k ' k-1
t Since in general there is no upper bound to the number of steps to first return, it is not a simple random variable. It does come under the general theory in Chapter 4, and its expected value is indeed P.j (and (8.38) is just (5.25)) , but for the present the interpretation of P.j as an average is informal.
SECTION
and since
127
MARKOV CHAINS
1,
=
Po
8.
1
=
p0 pJJ��> + p 1 pJJ� � - 1 > +
·
· ·
+ pn 1 pJJP > + -
pn pJ�J�> •
Keep only the first k + 1 terms on the right here, and let n --+ oo; the result + P k )u. Therefore u > 0 implies that Ek p k converges, so is 1 � (p0 + that p.1 < oo . n Write x n k = P k PJ) - k ) for 0 < k � n and x n k = 0 for n < k. Then 0 < x n k < P k and Iimn x n k = p k u. If p.1 < oo , then E k p k converges and it follows by the M-test that 1 = Er- ox n k --+ Er_0 p k u. By (8.38), 1 = p.1 u, so • tha t u > 0 and u = 1jp.1 . ·
·
·
The law of large numbers bears on the relation u = 1jp.1 in the persistent case. Let Vn be the number of visits to state j up to time n. If the time from one visit to the next is about p.1 , then V,. should be about njp.1 : Vnfn 1jp.1 . But (if X0 = } ) Vn/n has expected value n - 1 EZ _ 1 pjf > , which goes to u under the hypotheses of Lemma 3 [A30]. Consider an irreducible, aperiodic, persistent chain. There are two possi bilities. If there is a stationary distribution, then the limits (8.34) are positive, and the chain is called positive persistent. It then follows by Lemma 3 that p.1 < oo and 'ITJ = 1jp.1 for all }. In this case, it is not actually necessary to assume persistence, since this follows from the existence of a stationary distribution. On the other hand, if the chain has no stationary distribution, then the limits (8.36) are all 0, and the chain is called null persistent. It then follows by Lemma 3 that p.1 = oo for all }. This, taken together with Theorem 8.3, provides a complete classification: =
For an irreducible, aperiodic chain there are three possibili The chain is transient; then for all i and}, lim n p�n > 0 and in fact
Theorem
ties:
8.8.
(i) �� n p '1� � > < 00 (ii) The chain
=
·
is persistent but there existsn no stationary distribution (the null persistent case); then for all i and}, p� > goes to 0 but so slowly that Ln P!j > = 00, and p.1 = oo. (iii) There exist stationary probabilities 'ITJ and (hence) the chain is persistent ( the positive persistent case); then for all i and}, lim n p!j > 'ITJ > 0 and p.i 1/'nj· < oo. Since the asymptotic properties of the p!j > are distinct in the three cases, =
these asymptotic properties in fact characterize the three cases.
=
128
PROBABILITY
Example 8. 13.
matrix is
Suppose that the states are 0, 1 , 2, . . . and the transition
o Po 0 q1 0 P 1
q
.
0 0
q2
.
.. ..
0 0 P2 .. .... ......... ...
where P ; and q; are positive. The state i represents the length of a success run, the conditional chance of a further success being P; · Clearly the chain is irreducible and aperiodic. A solution of the system (8.27) for testing for transience (with i0 = 0) must have the form x k = x 1 jp 1 P k - 1 . Hence there is a bounded, nontrivial solution, and the chain is transient, if and only if the limit a of p0 • • • Pn is positive. But the chance of no return to 0 (for initial state 0) in n steps is clearly p0 • • • Pn - 1 ; hence /00 = 1 - a , which checks: the chain is persistent if and only a = 0. Every solution of the steady-state equations (8.30) has the form 'ITk = 'IT0 p 0 • • • P k - 1 . Hence there is a stationary distribution if and only if E k p0 • • • p k converges; this is the positive persistent case. The null per p k --+ 0 but Ek Po p k diverges (which sistent case is that in which Po happens, for example, if qk = 1/k for k > 1). Since the chance of no return to 0 in n steps is p0 • • • Pn _ 1 , in the p k _ 1 . In the null persistent persistent case (8.38) gives p. 0 = Ek-o Po case this checks with p.0 = oo ; in the positive persistent case it gives P. o = Er-o'1Tk/'IT0 = 1/'IT0 , which again is consistent. • ·
·
·
·
·
·
·
·
·
·
·
·
Exponential Convergence *
In the finite case pfj > converges to
'ITi at an exponential rate: Theorem 8.9. If the state space is finite and the chain is irreducible and aperiodic, then there is a stationary distribution { 'IT; } , and lp � � ) 'IT · I < Apn I}
where A
>0
-
}
-
'
and 0 < p < 1.
From this it follows that a finite, irreducible, aperiodic chain cannot be null persistent. This also follows from Theorem 8.8 ; see Example 8.7. * This topic may be omitted.
SECTION
8.
129
MARKOV CHAINS
m(J n > = min . p � �> and MJ� n > = max · P � � > • By (8 • 10) ' n > = m (n ) � p . � . ( min <�> p > m ( n + l ) = min m p . i..J i..J I
Let
PROOF .
}
1)
I
I
1 11
J1
11)
-
I
J1
•
I JI
1)
}
}
'
n > = M� n > • . � . � � <�> p M � n + t > = max < max p M p . � . i..J "J J }
I
Sin ce obviously
J1
-
m} n > < M} n > , < ··· < m (1 > _ < m (2> _ 0_ }
}
( 8 .39)
1,
I
< -
J1
I JI
}
M � 2> < M � 1 > }
-
}
" �
1.
Suppose temporarily that all the P;J are positive. Let r be the number of states and let B = minii p;1 . Since E1 pii > rB, 0 < B < r - 1 • Fix states u and v for the moment; let E' denote the summation over j in S satisfying PuJ > PvJ and let E" denote summation over j satisfying PuJ < PvJ · Then (8 .40) Since E'pvj +
E"puJ > rB,
(8 .41 ) Apply (8.40) and then (8.41):
n n n _ + l + l . �( ( ( ) ) . � p p p p = Pu k vk VJ ) Jk ) i..J UJ 1
"( _ � ' ( PuJ _ PvJ ) Mk( n ) + � < i..J i..J PuJ
=
n n ) ' Ml m� p - >) L ( uj Pvj) (
< (1 Since
u
and
v
n ( m ) PvJ k )
rB ) ( Ml n > - m�n > ) .
are arbitrary,
Th erefore, Ml n > - m�n > < (1 - rB) n . It follows by (8.39) that M) n > have a common limit 'ITJ and that (8 .42 )
m5 n > and
130
PROBABILITY
Take A = 1 and p = 1 - r8. Passing to the limit in (8.10) shows that the 'IT; are stationary probabilities. (Note that the proof thus far makes almost no use of the preceding theory.) If the p;1 are not all positive, apply Lemma 2: Since there are only finitely many states, there exists an m such that pfj > > 0 for all i and j. By the case just treated M} m t > - m} m t ) < p1 Take A = p - I and then replace p by pl / m . • •
Example 8. 14.
Suppose that
P=
Po P t Pr - 1 Po •
•
•
•
•
•
• • • • • •
•
•
•
•
•
• • •
P1
•
•
Pr - 1 Pr - 2
•
•
•
•
Po
The rows of P are the cyclic permutations of the first row: piJ = p1 _; , j - i reduced modulo r. Since the columns of P add to 1 as well as the rows, the steady-state equations (8.30) have the solution 'IT; = r - 1 • If the P; are all 1 n positive, the theorem implies that p� > converges to r- at an exponential rate. If X0 , Y1 , Y2 , . . . are independent random variables with range {0, 1, . . . , r - 1 } , if each Yn has distribution { p0 , , p, _ 1 } , and if Xn = X0 + Y1 + + Yn , where the sum is reduced modulo r, then P[ Xn = j] --+ r - 1 • The Xn describe a random walk on a circle of points, and whatever the • initial distribution, the positions become equally likely in the limit. ·
·
•
·
•
•
Optimal Stopping *
Assume throughout the rest of the section that S is finite. Consider a function T on 0 for which T ( w ) is a nonnegative integer for each w. Let � = o( X0 , X1 , . . . , Xn ); T is a stopping time or a Markov time if
(8.4 3 )
[w:
T( w )
= n]
e�
for n = 0, 1, . . . . This is analogous to the condition (7. 1 8) on the gambler' s stopping time. It will be necessary to allow T ( w ) to assume the special value oo , but only on a set of probability 0. This has no effect on the requirement (8.43), which concerns finite n only. If f is a real function on the state space, then /( X0 ), /( X1 ), . . . are simple random variables. Imagine an observer who follows the successive * This topic may be omitted.
SECTION
8.
13 1
MARKOV CHAINS
X0 , X1 , X,.< c.) > ( w )), and
of the system. He stops at time T, when the state is XT (or receives an reward or payoff /( XT ). The condition (8.43) prevents prevision on the part of the observer. This is a kind of game, the stopping time is a strategy, and the problem is to find a strategy that maximizes the expected payoff E[/( XT )]. The problem in Example 8.5 had this form; there S = {1, 2, . . . , r + 1 }, and the payoff function is /(i) = i/r for i < r (set /(r + 1) = 0). If P (A) > 0 and Y = E1 y1 18 is a simple random variable, the B1 1 forming a finite decomposition of 0 into jli:sets, the conditional expected value of Y given A is defined by states
• • •
E [ Y I A ] = E Y;P( BJI A ) . 1
Denote by
E; conditional expected values for the case A E;[ Y ] = E [ YI X0 = i ] = Ey1 P;( B1 ) .
=
[ X0 = i]:
1
The stopping-time problem is to choose T so as to maximize simultaneously E;[/( XT )] for all initial states i. If x lies in the range of f, which is finite, and if T is everywhere finite, then (w: /( XT( w ) (w)) = X ) = U�-o (w: T( W ) = n, f( Xn ( w )) = x] lies in � ' and so /( XT ) is a simple random variable. In order that this always hold, put /( XT ( w ) (w)) = 0, say, if T(w) = oo (which happens only on a set of probability 0). The game with payoff function f has at i the value
(8 .44 ) the supremum extending over all Markov times T. It will tum out that the supremum here is achieved: there always exists an optimal stopping time. It will also tum out that there is an optimal T that works for all initial states i. The problem is to calculate v(i) and find the best T. If the chain is irreducible, the system must pass through every state, and the best strategy is obviously to wait until the system enters a state for which f is maximal. This describes an optimal T, and v(i ) = max / for all i. For this reason the interesting cases are those in which some states are transient and others are absorbing ( p ;; = 1 ) . A function q> on S is excessive, or superharmonic, if t
(8 .45)
E Pij <J> ( j) , j
t compare the conditions (7.28) and (7.35).
i E S.
132 In
PROBABI LITY
terms of conditional expectation the requirement is
cp(i) > E;[ cp ( X1 )).
The value function v is excessive. PROOF. Given choose for each j in S a "good" stopping time 'lj Xn ) E 11, ] satisfying E1 [f( X'i )] > v ( j ) - By (8.43), [ 'lj = n ] = [( X0 , for some set I1 n of (n + 1)-long sequences of states. Set T = n + 1 (n > 0) on the set ( X1 = j ) n (( X1 , . . . , Xn+ l ) E 11n ) ; that is, take one step and then from the new state X1 add on the "good" stopping time for that state. Then Lemma 4.
£
T
£.
•
•
•
,
is a stopping time and 00
E; [ f( XT )] = E E EP; [ XI = j , ( X1 , . . . , Xn + l ) E /Jn , Xn + l = k ] f ( k ) n -o j k 00
- E E E P;J lJ [ ( Xo, . . . , Xn ) E /Jn ' Xn = k ) f ( k ) n-o j k
= E Pij EA !( x,)] . 1
Therefore, v(i) � E; [f( XT )] > was arbitrary, v is excessive.
E1 piJ ( v ( j ) -
£
)
= E1 piJ v ( j ) -
Lemma 5. Suppose that cp is excessive. (i) For all stopping times T, cp(i) > E; [ cp( XT )] .
£.
Since
£
•
For all pairs of stopping times satisfying E; [ cp( XT )] .
o<
T,
E;[ cp( Xa)] >
Part (i) says that for an excessive payoff function optimal strategy.
T
0
represents an
(ii)
PROOF.
( 8.46 )
To prove (i), put
TN
=
= min{ T, N } . Then TN is a stopping time, and
E; [ cp ( XTN )] = E E P; [ T = n ' xn = k ] cp ( k ) N- 1
n-O k
+
EP; [ T > N, XN = k]cp ( k ) . k
8. MARKOV CHAINS Since [ T > N] = [ T < N]c e �N - 1 , the final sum here is by (8.13)
133
SECTION
E E P; [ T > N, XN - 1 = j , XN = k]cp(k) k j
= E E P; [ T > N, XN - 1 = j ] PJkcp ( k ) < E P; [ T > N, XN - 1 = j ] cp ( j ) . 1
k j
Substituting this into (8.46) leads to E;[ cp ( XTN )] < E; [ cp( XTN _ 1 )]. Since To = 0 and E;[ cp ( X0 )] = cp (i), E; [ cp ( XTN )] < cp(i ) for all N. But for T( w) finite, q> ( XTN < "' > ( w )) --+ cp( XT< "' > (w)) (there is equality for large N), and so E; [ cp( XTN )] --+ E; [ cp( XT )] by Theorem 5.3. The proof of (ii) is essentially the same. If TN = min{ T, o + N ) , then TN is a stopping time, and oo
N- 1
E; [ cp ( XTN ) ] = E E E P;( o = m , T = m + n , xm + n = k ]cp( k ) m=O n=O
k
00
+ E E P; ( o = m , T > m + N, Xm + N = k] cp ( k ) . m=O
k
Since [o = m, T > m + N] = [o = m] - [o = m, T < m + N] e �+ N - 1 , again E;[cp ( XTN )] < E; [ q> ( XTN _ 1 )] < E; [cp( XT0 )]. Since T0 = o, part (ii) fol lows from part (i) by another passage to the limit. •
If an excessive function cp dominates the payofffunction f, then it dominates the value function v as well. Lemma 6.
all
By definition, to say that
i.
By Lemma 5, cp(i) > and so cp(i ) > v(i) for all i.
PROOF.
,. ,
g
dominates
8.10.
dominating f.
is to say that
g(i) > h(i)
for
E; (cp( XT )] > E; [f( XT )] for all Markov times
Since T = 0 is a stopping time, immediately characterize v: Theorem
h
•
v
dominates
f.
Lemmas
4
and
6
The value function v is the minimal excessive function
There remains the problem of constructing the optimal strategy T. Let M be the set of states i for which v(i) = f(i); M, the support set, is nonempty
134
PROBABILITY
since it at least contains those i that maximize f. Let A = n:o= o [ Xn � M] be the event that the system never enters M. The following argument shows that P; (A) = 0 for each i. As this is trivial if M = S, assume that M ¢ S. Choose 8 > 0 so that /(i) < v(i) - 8 for i e S - M. Now E;[/( XT )] = E�-o E k P; [T = n, Xn = k]f(k); replacing the /(k) by v(k) or v(k) - 8 according as k e M or k e S - M gives E; [/( XT )] < E; [v( XT )] - 8P; [XT E S - M ] < E; [v( XT )] - 8P; (A) < v(i) - 8P; (A), the last inequality by Lemmas 4 and 5. Since this holds for every Markov time, taking the supremum over T gives P; (A) = 0. Whatever the initial state, the system is thus certain to enter the support set M. Let T0 ( w ) = min[n: Xn (w) E M] be the hitting time for M. Then T0 is a Markov time and To = 0 if X0 E M. It may be that Xn (w) � M for all n, in which case T0 ( w) = oo , but as just shown, the probability of this is 0.
The hitting time T0 is optimal: E; [/( XT0 )] = v(i ) for all i. PROOF. By the definition of T0 , /( XT0 ) = l'( XT0 ). Put cp(i) = E; [/( XT0 )] = E; [ v( XT0 )]. The first step is to show that cp is excessive. If T1 = min[n: n > 1, xn E M], then Tl is a Markov time and
Theorem 8.11.
E; [ v ( XT. ) ] = E E P; [ X1 � M, . . . , Xn_ 1 � M, Xn = k ] v ( k ) 00
n-1
=
ke M
00
E E E P;JIJ [ Xo � M, . . . , xn - 2 � M, xn - 1 = k ] v ( k )
n-1
ke M jES
= E P;J E1 [ v ( XTo ) ] . 1
Since T0 � T1 , E; [ v( XT0 )] > E; [ v( XT1 )] by Lemmas 4 and 5. This shows that cp is excessive. And cp(i) < v(i) by the definition (8.44). If cp(i) > /(i) is proved, it will follow by Theorem 8.10 that cp(i) > v(i) and hence that cp(i) = v(i). Since To = 0 for X0 e M, i e M implies that cp(i) = E; [/( X0 )] = /(i). Suppose that cp(i) < /(i) for some values of i in S - M, and choose i0 to maximize /(i) - cp ( i ). Then -Jl(i) = cp(i) + /(i 0 ) - cp(i 0 ) dominates f and is excessive, being the sum of a constant and an excessive function. By Theorem 8.10, .p must dominate v, so that -Jl(i 0 ) > • v(i0), or /(i 0 ) � v(i 0 ). But this implies that i 0 e M, a contradiction The optimal strategy need not be unique. If all strategies have the same value.
f is constant, for example,
SECTION
8.
135
MARKOV CHAINS
For the symmetric random walk with absorbing barriers at 0 and r (Example 8.2) a function cp on S = { 0, 1, . . . , r } is excessive if cp (i) > !cp(i - 1) + �cp(i + 1) for 1 < i < r - 1. The requirement is that cp give a concave function when extended by linear interpolation from S to the entire interval [0, r]. Hence v thus extended is the minimal concave function dominating f. The figure shows the geometry: the ordinates of the dots are the values of f and the polygonal line describes v. The optimal strategy is to stop at a state for which the dot lies on the polygon. Example 8. 15.
0
0
r
,
If /(r) = 1 and /(i) = 0 for i < r, v is a straight line; v ( i ) = ijr. The optimal Markov time T0 is the hitting time for M = {0, r }, and v ( i ) = E; [/( XT0 )] is the probability of absorption in the state r. This gives another solution of the gambler' s ruin problem for the symmetric case. • For the selection problem in Example 8.5, the piJ are given by (8.5) and (8.6) for 1 < i S r, while p, + 1 . , + 1 == 1. The payoff is l( i) == i/r for i < r and 1( r + 1) == 0. Thus v ( r + 1) == 0, and since v is excessive, Example 8.16.
v ( i ) > g( i ) =
( 8 . 47)
r
L
J- i + l
.
.( . � l) v ( j) ,
J J
1 S i < r.
By Theorem 8 . 1 0, v is the smallest function satisfying (8.47) and v(i) > l(i) == ijr, 1 � i s r. Since (8.47) puts no lower limit on v ( r), v(r) = l ( r) == 1, and r lies in the support set M. By minimality
( 8 . 48)
v( i ) = max{ I( i ) , g( i ) } ,
1
< i < r.
M , then I( i) == v ( i) � �( i) � E j + 1 ij- 1 ( j - 1) - 1I( j ) == I( i )Ej + 1 ( j - 1) � 1 , and hence r:; _! + 1 ( j - 1) - � 1 . 0� the ��er .hand,! ! thi.s ineq�ali�y hold� and 1 + 1 , . . . , r all lie m M , then g( 1 ) == E + 1 1) ( 1 - 1) I( 1 ) == I( 1 )E1 - ; + 1 ( J - 1) - 1 s l( i), so that i e M by (8.48). Therefore, M == { i,, i, + 1, . . . , r, r + 1 } , If i
e
_
;
_
;
_
;
136 where i r is determined by
(8 .49) If i
�+ 1,
PROBABILITY
1 + ··· + 1 <1 < + i, 1 -1 r
1
i, - 1
+ ! + ··· + 1 r
i,
-
1
< i,, so that i E M, then v (i) > /(i) and so, by (8.48),
...
i
'r,l
0
J-i+ l
0;
J( J -
1)
v( j)
(
+ .!.. . 1 + . . . + 1 1 1 r
It follows by backward induction starting with i (8.50)
v( i)
= Pr ...
i, r
(
r -
1, -
==
i, -
1 that
1 . 1 + . . . + _1_ 1 1 1, -
).
r -
)
is constant for 1 < i < i,. In the selection problem as originally posed, X1 == 1 . The optimal strategy is to stop with the first Xn that lies in M. The princess should therefore reject the first i, - 1 suitors and accept the next one who is preferable to all his predecessors (is dominant). The probability of success is p, as given by (8.50). Failure can happen two ways. Perhaps the first dominant suitor after i, is not the best of all suitors; in this case the princess will be unaware of failure. Perhaps no dominant suitor comes after i ,; in this case the princess is obliged to take the last suitor of all and may be well aware of failure. Recall that the problem was to maximize the chance of getting the best suitor of all rather than, say, the chance of getting a suitor in the top half. If r is large, (8.49) essentially requires that log r - log i, be near 1, so that i, rje . In this case, p, 1/e . Note that although the system starts in state 1 in the original problem, its resolution by means of the preceding theory requires consideration of all possible initial states. • =
=
This theory carries over in part to the case of infinite S, although this requires the general theory of expected values, since /( XT) may not be a simple random variable. Theorem 8.10 holds for infinite S if the payoff function is nonnegative and the value function is finite. t But then problems t The only essential change in the argument is that Fatou's lemma (Theorem 16.3) must be used in place of Theorem 5.3 in the proof of Lemma 5.
SECTION
8.
137
MARKOV CHAINS
arise : Optimal strategies may not exist, and the probability of hitting the support set M may be less than 1. Even if this probability is 1, the strategy of stopping on first entering M may be the worst one of all. t PROBLEMS 8.1. 8.2.
Show that (8.2) holds if the first and third terms are equal for all n , i 0 ' . . . ' in ' and j. Let Y0 , Y1 , be independent and identically distributed with P[ Y, = 1] == p, P[ Y, == 0] == q == 1 - p, p ¢ q. Put Xn == Y, + Y, + 1 (mod 2). Show that Xo ' x1 ' . . . is not a Markov chain even though p [ xn 1 == j I xn - 1 == i] == P [ Xn 1 == j ]. Does this last relation hold for all Markov chains? Why? Show by example that a function /( X0 ), /( X1 ), of a Markov chain need not be a Markov chain. 4.18 f Given a 2 X 2 stochastic matrix and initial probabilities w0 and w1 , define P for cylinders in sequence space by P[ w: a; ( w ) == u; , 1 < i < n ] == wu Pu u Pu _ u for sequences u 1 , , u n of O's and 1 ' s (Problem 4.18 is th� c�s� piJ == � 1 =:: p1 ). Extend P as a probability measure to fl0 and then to fl. Show that a1 ( w ), a 2 ( w ), . . . is a Markov chain with transition probabili ties piJ . Show that •
•
•
+
+
8.3. 8.4.
•
· · ·
8.5.
• • •
/,. . I)
oo
� �
k -0
p))�� ) ==
oo
� �
n
� �
n-1 m=1
j,.< !" >p � � - m ) == 1) ))
and prove that if j is transient, then E n P�j' > Show that if j is transient, then
8.6.
8.7.
• •
<
oo
oo
� �
n-1
p� n ) ' 1)
(compare Theorem 8.3(i)).
specialize to the case i == j: in addition to implying that i is transient (Theorem 8.2(i)), a finite value for E�_ 1 p1r > suffices to determine /;; exactly. Call { X; } a subsolution of (8.24) if X; < E 1 qiJx and 0 < X; < 1, i e U. Extending Lemma 1, show that a subsolution (X; } satisfies X; < o; : the solution { o; } of (8.24) dominates all subsolutions as well as all solutions. Show that if X; == E1 qiJ xi and - 1 < X; < 1, then { 1 X; 1 } is a subsolution of (8.24). Find solutions to (8.24) that are not maximal.
t See Problems 8.34 and 8.35.
138
8.8.
8.9.
PROBAB ILITY
Show by solving (8.27) that the unrestricted random walk on the line (Example 8.3) is persistent if and only if p == t . (a) Generalize an argument in the proof of Theorem 8.5 to show that /; k = p; k + EJ k piJJjk Generalize this further to •
•
JiI k == Jit(lk ) + . . . +Jit(kn ) + E P; [ XI
i• k
¢
k ' . . . ' xn - I
¢
k ' xn == j ] Jjk •
Take k == i. Show that /, > 0 if and only if P; [ XI -:1= i, . . . ' xn - 1 ¢ i, xn == j] > 0 for some n, and conclude that i is transient if and only if � ; < 1 for some j ¢ i such that fiJ > 0. (c) Show that an irreducible chain is transient if and only if for each i there is a j ¢ i such that Jj; < 1. 8. 10. Suppose that S == (0, 1, 2, . . . }, p00 == 1, and /;0 > 0 for all i. (a) Show that P; (Uj.... 1 [ Xn == j i.o.]) == 0 for all i. (b) Regard the state as the size of a population and interpret the conditions p00 == 1 and /;0 > 0 and the conclusion in part (a). 8.1 1. 8.6 f Show for an irreducible chain that (8.27) has a nontrivial solution if and only if there exists a nontrivial, bounded sequence { X; } (not necessarily nonnegative) satisfying X; == EJ piJ xJ , i -:1= i0• (See the remark following the proof of Theorem 8.5.) (b)
• ;0
8.12.
f Show that an irreducible chain is transient if and only if (for arbitrary i0 ) the system Y; == EJ PJ.J.Y.i ' i ¢ i0 (sum over all j) has a bounded, noncon stant solution { Y; , i e .s } .
8.13. Show that the P;-probabilities of ever leaving U for i e U are the minimal solution of the system
Z; == L PijZj + L Pi} ' i E U,
{ 8.51 )
jE U
0 S Z; < 1 ,
jE U
i E U.
The constraint z; < 1 can be dropped: the minimal solution automatically satisfies it, since z; 1 is a solution. 8.14. Show that supiJ n0(i, j) == oo is possible in Lemma 2. 8.15. Suppose that { 'IT; } solves (8.30), where it is assumed that E; l w; l < oo , so that the left side is well defined. Show in the irreducible case that the 'IT; are either all positive or all negative or all 0. Stationary probabilities thus exist in the irreducible case if and only if (8.30) has a nontrivial solution { 'IT; } (L, w, absolutely convergent). 8.16. Show by example that the coupled chain in the proof of Theorem 8.6 need not be irreducible if the original chain is not aperiodic. =
SECTION
8.
139
M.AllKOV CHAINS
8. 17. Suppose that S consists of all the integers and - 13 ' - 1 - p0. 0 - p0, + 1 Pk . k - 1 == q , Pk . k + 1 == p , Pk . k - 1 == p , Pk . k + 1 == q , n
.rO,
k < -1 k > 1. -
'
Show that the chain is irreducible and aperiodic. For which p's is the chain persistent? For which p 's are there stationary probabilities? 8.18. Show that the period of j is the greatest common divisor of the set ( 8.52) 8.19.
f
s
be nonnegative numbers with I == EC:.. 1 In Recurrent events. Let 11 , 12 , 1. Define u 1 , u 2 , recursively by u 1 == 11 and • •
• • •
•
( 8.53) (a) Show that I < 1 if and only if En un < oo .
(b)
Assume that I == 1, set
( 8.54)
p.
== E�_ 1 nln , and assume that
gcd[ n : n
�
1 , In > 0] == 1 .
Prove the renewal theorem: Under these assumptions, the limit u == lim n u n exists, and u > 0 if and only if p. < oo, in which case u == 1 Ip.. Although these definitions and facts are stated in purely analytical terms, they have a probabilistic interpretation: Imagine an event I that may occur at times 1, 2, . . . . Suppose In is the probability I occurs first at time n. Suppose further that at each occurrence of I the system starts anew, so that In is the probability that I next occurs n steps later. Such an I is called a recurrent event. If un is the probability that I occurs at time n , then (8.53) holds. The rec11rrent event I is called transient or persistent according as I < 1 or I == 1, it is called aperiodic if (8.54) holds, and if I == 1, p. is interpreted as the mean recurrence time. 8.20. Show for a two-state chain with S == {0, 1 } that pbQ> == p 1 0 + ( p00 p 1 0 ) p6{; _ 1+,. Derive an explicit formula for pbQ> and show that it converges to p 1 0/( p01 p 1 0 ) if all p, are positive. Calculate the other P1j> and verify that the limits satisfy (8. JO). Verify that the rate of convergence is exponen tial. 8.21. A thinker who owns r umbrellas travels back and forth between home and office, taking along an umbrella (if there is one at hand) in rain (probability p) but not in shine (probability q). Let the state be the number of umbrellas at hand, irrespective of whether the thinker is at home or at work. Set up the transition matrix and find the stationary probabilities. Find the steady-state probability of his getting wet, and show that five umbrellas will protect him at the 5% level against any climate (any p ) . •
140
PROBABILITY
8.22. (a) A transition matrix is doub�v stochastic if E; p, 1 == 1 for each j. For a finite, irreducible, aperiodic chain with doubly stochastic transition matrix, show that the stationary probabilities are all equal. (b) Generalize Example 8.14: Let S be a finite group, let p ( i ) be probabili I ties, and put p; 1 == p(j · ; - ) , where product and inverse refer to the group operation. Show that, if all p ( i ) are positive, the states are all equally likely in the 1imi t. (c) Let S be the symmetric group on 52 elements. What has (b) to say about card shuffling? 8.23. A set e in S is closed if E1 c pi J == 1 for i e e: once the system enters C it cannot leave. Show that a chain is irreducible if and only if S has no proper closed subset. 8.24. f Let T be the set of transient states and define persistent states i and j (if there are any) to be equivalent if /;1 > 0. Show that this is an equivalence relation on S - T and decomposes it into equivalence classes e1 , C2 , , so that S == T u C1 u e2 u . . . . Show that each em is closed and that /;; == 1 for i and j in the same em . 8.25. 8.13 8.23 f Let T be the set of transient states and let e be any closed set of persistent states. Show that the P;-probabilities of eventual absorption in C for i e T are the minimal solution of ·
e
• • •
Y; ==
( 8.55)
E
jE T
P;J J'.i
0 < Y; < 1 ,
+ E Pi } ' jE C
iE
T,
ie
T.
8.26. Suppose that an irreducible chain has period t > 1. Show that S decomposes into sets 50 , . . . , S,_ 1 such that piJ > 0 only if i E S, and j E S,+ for some 11 ( 11 + .1 reduced modulo t). Thus the system passes through the S, in cyclic successton. 8.27. f Suppose that an irreducible chain of period t > 1 has a stationary distribution { '17) } · Show that, if i E s, and j E s,+ cr ( 11 + a reduced modulo t). ' then lim n p� �r + cr ) == 'fT. Show that lim n n - I�"-- mn -= 1 p � �) == w.jt for all i and 1
1·
IJ
J"
'l
J
8.28. Suppose that { Xn } is a Markov chain with state space S and put Y, == ( Xn , Xn + 1 ). Let T be the set of pairs ( i , j ) such that P;1 > 0 and show that Y, is a Markov chain with state space T. Write aown the transition probabilities. Show that, if { Xn } is irreducible and aperiodic, so is { Y,, } . Show that, if "'; are stationary probabilities for { Xn }, then "'; piJ are stationary probabilities for { Y, }. 8.29. 6.10 8.28 f Suppose that the chain is finite, irreducible, and aperiodic and that the initial probabilities are the stationary ones. Fix a state i, let A n == [ Xn == i ], and let Nn be the number of passages through i in the first n steps. Calculate an and fln as defined by (5.37). Show that fln - a; = 0(1/n ) , so that n - 1 Nn --+ with probability 1 . Show for a function f on the state space that n - 1 E'ic .... 1 /( Xk ) --+ E;w;/(i) with probability 1. Show that n - 1 E'ic .... l g( Xk , xk + l > --+ E j wi pi j g,i , j ) for functions g on s X s. 'IT ;
i
SECTION
8.30.
8.
141
MARKOV CHAINS
, in , put p, ( w ) 8. 2 9 f If X0 ( w ) == i0 , , X, ( w ) == i, for states i0 == w, p, ; · · · p, , , so that p, ( w ) is the probability of the observation obse�e�� Show ffi�i - n - • tog p, ( w ) --+ h == - E, 1 P; log p, 1 with probabil ity 1 if the chain is finite, irreducible, and aperiodic. & tend to this case the notions of source, entropy, and asymptotic equipartition. A sequence { Xn } is a Markov chain of second order if P [ Xn + 1 == j l X0 == i0 , , Xn == in ] = P[ Xn + t == j ! Xn - t == i n_ 1 , Xn == in ] = P;n_ 1 ;n: 1 Show that nothing really new is involved because the sequence of pairs ( xn ' xn + 1 ) is an ordinary Markov chain (of first order). Compare Problem 8.28. Generalize this idea into chains of order r. Consider a chain on S == {0, 1, . . . , r }, where 0 and r are absorbing states and Pi , i + 1 == P; > 0, P; , ; - 1 == q; == 1 - P; > 0 for 0 < i < r. Identify state i with a point z; on the line, where 0 == z0 < · · · < z, and the distance from z; to z; + 1 is q;/P; times that from z; _ 1 to z; . Given a function q> on S, consider the associated function ip on [0, z, ] defined at the z; by ip(z; ) == q> ( i ) and in between by linear interpolation. Show that q> is excessive if and only if ip is concave. Show that the probability of absorption in r for initial state i is t; _ 1 /t, _ 1 , where I; == E� _0q1 Pk · Deduce (7.1). Show that qk /p 1 in the new scale the expected distance moved on each step is 0. Suppose that a finite chain is irreducible and aperiodic. Show by Theorem 8.9 that an excessive function must be constant. Alter the chain in Example 8.13 so that q0 == 1 - p0 == 1 (the other P; and q; still positive). Let {J == limn p 1 p, and assume that {J > 0. Define a payoff function by /(0) = 1 and /(i) == 1 - /;0 for i > 0. If X0 , , Xn are positive, put a, == n ; otherwise let an be the smallest k such that X,._ == 0. Show that E; [/( Xa )] --+ 1 as n --+ oo , so that v(i) 1 . Thus the support set is M = {0} , and for an initial state i > 0 the probability of ever hitting M is /; o < 1. For an arbitrary finite stopping time T, choose n so that P; [ T < n = a, ] > 0. Then E; [/( XT )] � 1 - h + n . o P; [ T < n == a, ] < 1. Thus no strategy achieves the value v(i) (except of course for i == 0). f Let the chain be as in the preceding problem, but assume that {J == 0, so that /;0 == 1 for all i. Suppose that A 1 , A 2 , exceed 1 and that A 1 A n --+ A < oo ; put /(0) == 0 and /(i) == A 1 A; _ 1 /p 1 P t · For an arbitrary (finite) stopping time T, the event [ T == n ] must have the- form [( X0 , , Xn ) e /n ) for some set In of ( n + 1)-long sequences of states. Show that for each i there is at most one n � 0 such that (i, i + 1 , . . . , i + n ) e ln. If there is no such n , then E; [ /( XT )] == 0. If there is one, then 6.14
• • •
• • • •
'IT;
8.31.
•
8.32.
•
•
•
•
8.33. 8.34.t
•
•
•
•
•
•
•
•
•
•
•
=
8.35.
•
•
•
•
•
•
•
•
•
•
•
•
;
•
•
.
•
.
and hence the only possible values of E; [ f( XT )] are 0 , /( i ) , P;/( i + 1 )
= /( i) A ; ,
P; P; + 1 /( i + 2) == /( i) A; A; + 1 ,
•
t lbe l ast two problems involve expected values for random variables with infinite range.
.
142
PROBABILITY
Thus v(i) == /( i)A/A 1 • • • A; _ 1 for i � 1 ; no strategy achieves this value. The support set is M == {0}, and the hitting time To for M is finite, but E; (/( XTo )) == 0. SECI10N 9. LARGE DEVIATIONS AND THE LAW OF THE ITERATED LOGARITHM *
It is interesting in connection with the strong law of large numbers to estimate the rate at which Sn/n converges to the mean m. The proof of the strong law used upper bounds for the probabilities P[ I Sn - m l > a] for large a. Accurate upper and lower bounds for these probabilities will lead to the law of the iterated logarithm, a theorem giving very precise rates for Sn/n --+ m. The first concern will be to estimate the probability of large deviations from the mean, which will require the method of moment generating functions. The estimates will be applied first to a problem in statistics and then to the law of the iterated logarithm. Moment Generating Functions
Let X be a simple random variable assuming values x 1 , • • • , x1 with respec tive probabilities p 1 , . . . , p 1• Its moment generating function is \
(9 . 1)
I
M ( t ) = E [ e'X ] = E p;e 1x• . ;- I
(See (5.16) for expected values of functions of random variables.) This function, defined for all real t, can be regarded as associated with X itself or as associated with its distribution-that is, with the measure on the line having mass P; at X; (see (5.9)). If c = max ;I X;I , the partial sums of the series e 1 x = Er/_0t k X k/k ! are bounded by el t lc, and so the corollary to Theorem 5.3 applies : (9 . 2 )
Thus M(t) has a Taylor expansion, and as follows from the general theory [A29], the coefficient of t k must be M < k > (O)jk! Thus (9 . 3 ) * This section may be omitted.
SECTION
9.
LARGE DEVIATIONS
143
Furthermore, term-by-term differentiation in (9.1) gives
M
==
I
L P;x fe'x; E [ X ke' x ] ;
i- 1
==
taking t = 0 here gives (9.3) again. Thus the moments of X can be calculated by successive differentiation, whence M( t ) gets its name. Note that M(O) = 1. If X assumes the values 1 and 0 with probabilities p and q = 1 - p , as in Bernoulli trials, its moment generating function is M(t) == pe 1 + q. The first2 two moments are M '(O) == p and M "(O) == p, and the variance is p - p == pq. • Example 9. 1.
If XI , . . . ' xn are independent, then for each t (see the argument follow ing (5.7)), e 1 x1 , • • • , e' x, are also independent. Let M and M1 , • • • , Mn be the respective moment generating functions of s == xl + . . . + xn and of X1, • • • , Xn ; of course, e 1 5 == II ; e 1 x'. Since by (5.21) expected values multiply for independent random variables, there results the fundamental relation
(9 . 4) This is an effective way of calculating the moment generating function of the sum S. The real interest, however, centers on the distribution of S, and so it is important to know that distributions can in principle be recovered from their moment generating functions. Consider along with (9.1) another finite exponential sum N(t) == E1q1e 'Y1, and suppose that M( t) == N( t) for all t. If X;0 == max X; and y10 == max y1 , then M(t) - P ;0e 1x;o and N(t) - q10e'YJo as t --+ oo , and so X;0 = y10 and 1x; == E ,. ioq1e'Y1, and The same argument now applies to E; .,. e = • ; ; ; P 0 1 P10 q0 it follows inductively that with appropriate relabeling, X; == Y; and P; == q; for each i. Thus the function (9.1) does uniquely determine the X; and P; · If X1 , • • • , Xn are independent, each assuming values 1 and 0 with probabilities p and q, then s == xl + . . . + xn is the number of successes in n Bernoulli trials. By (9.4) and Example 9.1, S has the moment generating function EXIIIple II 9.2.
144
PROBABILITY
The right-hand form shows this to be the moment generating function of a n distribution with mass ( = )p kq - k at the integer k, 0 < k < n. The unique ness just established therefore yields the standard fact that P[S = k ]
= ( =)p kq n - k . The
•
cumulant generating function of X (or of its distribution) is C ( t ) = log M ( t ) = log E ( e' x ).
( 9.5 )
(Note that M( t) is strictly positive.) Since - ( M') 2 )/M 2, and since M(O) = 1 ,
C' = M'/M and C " = (MM"
C (O) = 0, C' (O ) = E ( X ] , C " (O) = Var( X ] . Let m k = E [ X k ]. The leading term in (9.2) is m 0 = 1 , and so expansion of the logarithm in (9.5) gives v v+1 1 oo ) oo k . t C( t ) = E ( ( 9.7 ) E� k- 1 k . (9.6)
v-1
)
(
V
a formal
1 as t � 0, this expansion is valid for t in some neighbor Since M( t ) hood of 0. By the theory of series, the powers on the right can be expanded ; and terms with a common factor t collected together. This gives an . expansion .-.
( 9.8 )
c( t )
=
00
� i..J
,. _
1
C;
.
7f t ' , I.
valid in some neighborhood of 0. The C; are the cumulants of X. Equating coefficients in the expansions (9.7) and (9.8) leads to c1 = m 1 and c2 = m 2 - m�, which checks with (9.6). Each c; can be expressed as a polynomial in m 1 , , m ; and conversely, although the calculations soon become tedious. If E[ X] = 0, however, so that m 1 = c 1 = 0, it is not hard to check that •
.
•
( 9.9) Taking logarithms converts the multiplicative relation (9.4) into the additive relation
(9.10)
SECTION
9.
LARGE DEVIATIONS
145
M (t)
--�----�----� � T
for the corresponding cumulant generating functions; it is valid in the presence of independence. By this and the definition (9.8), it follows that cumulants add for independent random variables. Clearly, M "(t) = E[ X 2e 1 x ] > 0. Since ( M'(t)) 2 = E 2 [ Xe 1 x ] < E[e 1 x ] E [ X 2e 1 x ] = M(t)M "(t) by Schwarz's inequality (5.32), C"(t) > 0. Thus ·
the moment generating function
convex.
and the cumulant generating function are both
Large Deviations Let Y be a simple random variable assuming values y1 with probabilities p1 . The probl�m i� to estimate P[Y > a] when Y has mean 0 and a is positive. It is not.ationally convenient to subtract a away from Y and instead estimate P[Y > 0] when Y has negative mean. Assume then that
(9.11 )
P ( Y > O] > O, the second assumption to avoid trivialities. Let M(t) = E1 p1 e'Y1 be the moment generating function of Y. Then M '(0) < 0 by the first assumption in (9.11), and M(t) � oo as t --+ oo by the second. Since M(t) is convex, it has its minimum p at a positive argument T: (9.12 ) inf M ( t ) = M ( T ) = p , 0 < p < 1 , T > 0. E ( Y ] < O,
I
Construct (on an entirely irrelevant probability space) an auxiliary ran dom variable Z such that
(9.13 ) for each y1 in the range of Y. Note that the probabilities on the right do add to 1. The moment generating function of Z is
(9.1 4 )
146
PROBABILITY
and therefore
( 9.15)
M' T ) ( E ( Z ) = p = 0,
For all positive t, (5.27), and hence
T ) M" ( 2 2 s = E ( Z ] = p > 0.
P[Y 0] = P[e 1 Y > 1] < M(t) by Markov's inequality �
P ( Y > 0)
( 9.16 )
<
p.
Inequalities in the other direction are harder to obtain. If summation over those indices j for which yj > 0, then
Put the final sum here in the form e - 1, and let p = P[Z > � 0. Since log x is concave, Jensen's inequality (5.29) gives fJ - IJ
= log E' e - "Yip - 1P ( Z = yi ] > E' ( - Tyj) p - 1P ( Z = yi ]
�
= - Tsp - 1 L' P [ Z = yi ] By
E'
0].
denotes
By
(9.16),
+ log p + log p
+ log p .
(9.15) and Lyapounov's inequality (5.33), Y· 1 E li Z < 1 E 112 z 2 ] = 1 . = z ( P < y j I] ; [ ] ; E' :
The last two inequalities give
(9.18 )
0 � IJ �
TS ) - log P ( Z > 0) . � P[Z O
This proves the following result.
Suppose that Y satisfies (9.11). Define p and T by (9.12), let variable with distribution (9.13), and define s 2 by (9.15). Then P[Y > 0] = pe - 9, where fJ satisfies (9.18).
Theorem 9. 1. Z be a random
To use (9.18) requires a lower bound for P[Z
> 0].
SECTION
9.
LARGE DEVIATIONS
1beorem 9.2. If E[Z] = 0, E[ Z 2 ] = � 0] � s 4/4 � 4 .t
s 2, and E[ Z 4 ] = � 4 > 0, then
147
P[Z
Let z + = Z/[ Z � OJ and z- = - Z/[ Z < OJ · Then z+ and z- are nonnegative, Z = z + _ z-, Z2 = (Z + )2 + (Z-) 2 , and PROOF.
(9. 1 9) Let p
= P[ Z
�
0]. By Schwarz's inequality (5.32),
By Holder's inequality (5.31) (for p = f and
Since E[ Z ] = q j) gives ==
= 3)
0, another application of Holder's inequality (for p = 4 and
Combining these three inequalities with
(Epl/4 ) 213� 4/3
q
=
2 ptl2� 2 .
(9.1 9)
gives
s2
�
p1 12� 2 +
•
Chernoff's Theorem* 1beorem
9.3.
Let xl, x2 , . . . be independent, identically distributed simple
random variables satisfying E [ Xn ] < 0 and P[ Xn > 0] > 0, let M(t) be their common moment generating function, and put p = inf M(t ) . Then ,
(9.20)
.!. tog P [ X1 + lim n � oo n
· · ·
+ Xn > 0 ) = log p .
t For a related result, see Problem 25.21 . *This theorem is not needed for the law of the iterated logarithm, Theorem 9.5.
148
PROBABILITY
Put Yn = X1 + · · · + Xn. Then E[ Yn] < 0 and P[ Yn > 0] > P " [ X1 > 0 ] > 0, and so the hypotheses of Theorem 9.1 are satisfied. Define Pn and Tn by inf,Mn ( t ) = Mn ( Tn ) = Pn, where Mn(t ) is the moment generating n fu�ction of Yn . Since Mn( t ) == M ( t ), it follows that Pn = p" and Tn = T, where M( T) == p. Let Zn be the analogue for Yn of the Z described by (9.13). Its moment n n generating function (see (9.14)) is Mn ( T + t)/p = ( M( T + t)/p) . This is also the moment generating function of V1 + · · · + V, for independent random variables V1 , . . . , V, each having moment generating function M( T + t )jp . Now each v; has (see (9.15)) mean 0 and some positive variance o 2 and fourth moment � 4 independent of i. Since Z, must have the same moments as V1 + · · · + V,, it has mean 0, variance s; = n o 2 , and fourth moment �! == n� 4 + 3n(n - 1) o 4 = O(n 2 ) (see (6.2)). By Theorem 9.2, P[ Zn > 0] > s:/4�! > a for some positive a independent of n. By Theo n 9 > rem 9.1 then, P[ Yn 0] = p e - , where 0 < on < Tnsna - 1 - log a = Ta - 1o /n - log a. This gives (9.20), and shows, in fact, that the rate of • convergence is 0( n - 1 /2 ). PROOF.
"
This result is important in the theory of statistical hypothesis testing. An informal treatment of the Bernoulli case will illustrate the connection. Suppose Sn == X1 + · · · + Xn , where the X; are independent and assume the values 1 and O with probabilities p and q. Now P( Sn > na ] == P(EZ= 1 ( X�c - a ) > 0], and Chernoff's theorem applies if p < a < 1. In this case M ( t ) = E ( e '< x1 - a ) ] = e - 'a ( pe ' + q). Minimizing this shows that the p of Chernoff's theorem satisfies
a
b
- log p == K( a , p) = a log - + b log - , p q where b == 1 - a. By (9.20), n- 1 log P[Sn > na ] -+ - K(a, p ); express this as (9 .21) Suppose now that p is unknown and that there are two competing hypotheses concerning its value, the hypothesis H1 that p = p 1 and the hypothesis H2 that p = P2 ' where P I < P2 . Given the observed results x. ' . . . ' xn of n Bernoulli trials, one decides in favor of H2 if sn > na and in favor of HI if s, < na ' where a is some number satisfying p 1 < a < p2 • The problem is to find an advantageous value for the threshold a. By (9.21), (9 .22) where the notation indicates that the probability is calculated for p = p 1 -that is, under the assumption of H1 • By symmetry, (9.23)
SECTION
9.
LARGE DEVIATIONS
149
The left sides of (9.22) and (9.23) are the probabilities of erroneously deciding in favor of H2 when H1 is, in fact, true and of erroneously deciding in favor of H1 when H2 is, in fact, true-the probabilities describing the level and power of the tes t. Suppose a is chosen so that K(a, p 1 ) = K(a, p 2 ), which makes the two error probabilities approximately equal. This constraint gives for a a linear equation with solution (9 .24) where q; = 1 - P; · The common error probability is approximately e - n K for this value of a , and so the larger K (a, p 1 ) is, the easier it is to distinguish statistically between p 1 and p2 • Although K( a(p . , p2 ), p 1 ) is a complicated function, it has 3a simple approxima tion for p 1 near p2 . As x -+ 0, log(1 + x) = x - !x 2 + O(x ). Using this in the definition of K and collecting terms gives (9 .25) Fix
x2 K ( p + x , p ) = 2pq + O ( x 3 ) ,
X -+ 0.
p 1 = p, and let p2 = p + t; (9.24) becomes a function 1/!(t) of t, and expanding
the logarithms gives {9.26)
t -+ 0
after some reductions. Finally, (9.25) and (9.26) together imply that (9.27)
t -+ 0.
In distinguishing p 1 == p from p2 == p + t for small t, if a is chosen to equalize the two error probabilities, then their common value is about e - n t 2 /Sp q. For t fixed, the nearer p is to ! , the larger this probability is and the more difficult it is to distin,uish p from p + t. As an example, compare p == .1 with p == .5. Now .36nt /8(.1)(.9) == nt 2/8(.5)(.5). With a sample only 36 percent as large, .1 can therefore be distinguished from .1 + t with about the same precision as .5 can be distinguished from .5 + t. 1be
Law of the Iterated Logarithm
analysis of the rate at which S,jn approaches the mean depends on the following variant of the theorem on large deviations.
The
Let sn = xl + . . . xn , where the xn are independent and identically distributed simple random variables with mean 0 and variance 1. If
Theorem 9.4.
150
PROBABILITY
a n are constants satisfying (9 .28 ) then ( 9. 2 9) for a sequence rn going to 0. PROOF. Put yn = sn - a nrn = E�- l ( Xk - a n! ..fi). Then E[Yn1 < 0. Since X1 has mean 0 and variance 1, P[X1 > 0] > 0, and it follows by (9.28) that P[nX1 > a,/ {i] > 0 for n sufficiently large, in which case P [Yn > 0] > p [ X1 - a n/..fn > 0] > 0. Thus Theorem 9.1 applies to Yn for all large enough n. Let Mn (t), Pn ' Tn , and zn be associated with yn as in the theorem. If m(t) and c( t ) are the moment and cumulant generating functions of the Xn , a then Mn( t ) is the n th power of the moment generating function e - t ,/�m(t) of xl - a nl rn ' and so yn has cumulant generating function (9 . 30 ) Since Tn is the unique minimum of Cn( t ), and since c;(t) = - a nrn + nc'(t ), Tn is determined by the equation c'( Tn ) = a n! rn . Since xl has mean 0 and variance 1, it follows by (9.6) that c (O ) = c' (O ) = 0, c ( O) = 1 . ( 9 . 31 ) Now c'( t ) is nondecreasing because c(t) is convex, and since c'( Tn ) = a n! rn goes tO 0, Tn must therefore go tO 0 as Well and must in fact be 0( a n/ rn ) . By the second-order mean-value theorem for c'( t ) , a nl rn = c'( Tn ) = Tn + 0( Tn2 ), from which follows "
( 9.32 )
'�'n =
an 7n
+
(0 a; ) --;;
·
c( t ), log pn = Cn( Tn ) = - Tn a nrn + nc ( Tn ) = - Tn a nrn + n 'l'n2 + o( ;
By the third-order mean-value theorem for
Applying (9.32) gives
( 9.33 )
[I
.,.
)] .
SECTION
9.
151
LARGE DEVIATIONS
Now (see (9 .14)) zn has moment generating function Mn ( Tn + t)/ Pn and (see (9.30)) cumulant generating function Dn (t) = Cn (Tn + t) - log pn = - (Tn + t)a nln + nc(t + Tn ) - log pn. The mean of zn is D�(O) = 0. Its variance s; is D�'(O); by (9.31), this is
(9 . 3 4 )
s; = nc"( Tn ) = n ( c"(O) + 0 ( Tn ) ) = n ( 1
+
o (1)) .
The fourth cumulant of zn is D�" ' (O) = nc""( Tn ) = 0( n ) . By formula (9.9) relating moments and cumulants (applicable because E[Zn1 = 0), E[Z:] = 3s: + D�" ' (O). Therefore, E[Z:];s: 3, and it follows by Theorem 9.2 that there exists an a such that P [Zn > 0] > a > 0 for all sufficiently large n. By Theorem 9. 1 , P[Yn � 0] = Pne - B, with 0 < (Jn < Tn sna - 1 + log a. By (Jn = O(a n ) = o(a;), and it follows by (9.33) that (9.28), (9.32), and (9.34), a�( l + o( 1 ))/2 . y = e � 0 ] n • [ P -+
The law of the interated logarithm is this:
Let sn = x1 + . . . + xn, where the xn are independent, identically distributed simple random variables with mean 0 and variance 1. Then
Theorem 9.5.
(9.35 )
P
[
Sn lim sup n J2 n log log n
=
1
]
=
1.
Equivalent to (9.35) is the assertion that for positive £
(9 .3 6 )
P [ Sn > ( 1 + £ ) J2n log log n
i.o. ]
=
0
P [ Sn � ( 1 - £ ) J2n log log n
i.o. ]
=
1.
and
(9. 3 7 )
The set in (9.35) is, in fact, the intersection over positive rational £ of the sets in (9.37) minus the union over positive rational £ of the sets in (9.36). The idea of the proof is this. Write
(9. 3 8 )
q> ( n ) = J2n log log n .
If A ! = [Sn > (1 ± £)f/>(n)], then by (9.29) , P ( A !) is near (log n) - < 1 ± '>2• If n k increases exponentially, say n k - (J k for (J > 1, then P (A�) is of the order k - < 1 ± '>2• Now Ekk -< 1 ± '>2 converges if the sign is + and diverges if
152
PROBABILITY
the sign is - . It will follow by the first Borel-Cantelli lemma that there is probability 0 that A� occurs for infinitely many k. In proving (9.36), an extra argument is required to get around the fact that the A; for n =I= nk must also be accounted for (this involves choosing (J near 1). If the A; were independent, it would follow by the second Borel-Cantelli lemma that with probability 1, A n " occurs for infinitely many k, which would in tum imply (9.37). An extra argument is required to get around the fact that the A n " are dependent (this involves choosing 8 large). For the proof of (9.36) a preliminary result is needed. Put Mk = max{ S0 , S1 , . . . Sk }, where S0 = 0. ,
Theorem 9.6.
for a � {2,
P
(9.39) PROOF.
If the Xk are independent with mean 0 and variance l , then
[ � > a] � [-!;- > a - ] . 2P
n:
If Aj == ( � _ 1 < a..fn < Mj], then
Since Sn - � has variance n - j, it follows by independence and Chebyshev's inequality that the probability in the sum is at most
(9.36). Given £ , choose (J so that (J > 1 but 8 2 < k 1 [(J ] and xk == 8(2 log log n k ) 12• By (9.29) and (9.39),
PROOF OF
nk
==
1
+
£.
Let
SECTION
9.
153
LARGE DEVIATIONS
where � k --+ 0. The negative of the exponent is asymptotically 8 2 log k and hence for large k exceeds 8 log k, so that
Since 8 > 1, it follows by the first Borel-Cantelli lemma that there is probability 0 that (see (9.38))
(9.40) for infinitely manx k. Suppose that nk _ 1 < n < n k and that
sn > ( 1
(9.41 )
+ £ ) q, ( n )
.
1 Now q, ( n ) � q, ( n k _ 1 ) - 8 - 12q,(n k ); hence, by the choice of 8, (1 + £)q,( n ) > IJq,( nk) if k is large enough. Thus for sufficiently large k, (9.41) implies (9.40) (if n k _ 1 < n < n k ), and there is therefore probability 0 that (9.41) • holds for infinitely many n. 1 PROOF OF (9.37). Given £, choose an integer (J so large that 38 - 12 < £. k Take nk = fJ . Now n k - n k _ 1 --+ oo , and (9.29) applies with n = n k - nk _ 1 1 and a n = xk/ Jn k - n k _ 1 , where xk = ( 1 - 8 - )q,( n k ). It follows that
P ( Sn k - Sn k - 1
> xk ]
=
P [ Sn - n 1c - 1 �<
> xk
]=
[
x k2 1 exp - 1 + �k ) , 2 nk - nk 1 ( -
-
]
1 where � k 0. The negative of the exponent is asymptotically ( 1 (J - )log k and so for large k is less than log k, in which case P[Sn k - Sn k _ 1 > xk ] > k - 1 • The events here being independent, it follows by the second Borel- Cantelli lemma that with probability 1, Sn " - Sn k - 1 > x k for infinitely many k. On the other hand, by (9.36) applied to { - Xn }, there is probability 1 that - Sn " _ 1 < 2q,( n k _ 1 ) < 2fJ - 1 12q,(n k ) for all but finitely many k. These two inequalities give sn k > x k - 2fJ - lllq,(n k ) > ( 1 • t ) � (n k ), the last inequality because of the choice of 8. .-.
That completes the proof of Theorem 9.5.
154
PROBABILITY
PROBLEMS 9.1. Prove (6.2) by using (9.9) and the fact that cumulants add in the presence of independence.
9.2. Let Y be a simple random variable with moment generating function M( t ) . Show that P[ Y > 0] == 0 implies that P [ Y � 0] == inf M ( t ) . Is the infimum ,
achieved in this case?
9.3. In the Bernoulli case, (9.21) gives
where p < a < 1 and
xn == n ( a - p ). Theorem 9.4 gives
where xn == a nlnpq . Resolve the apparent discrepancy. Use (9.25) to compare the two expressions in case x fn is small. See Problem 27. 17. n
9.4. Relabel the binomial parameter p as fJ == I( p ) , where I is increasing and continuously differentiable. Show by (9.27) that the distin�uishability of fJ from fJ + �fJ, as measured by K, is ( �6 ) 2j8p (1 - p)(l'( p )) + 0( �6 ) 3 • The leading coefficient is independent of fJ if I( p ) == arc sin {i. 9.5. From (9.35) and the same result for { - Xn } , together with the uniform boundedness of the xn ' deduce that with the probability 1 the set of limit points of the sequence { sn (2 n log log n ) - 1 12 } is the closed interval from - 1 to + 1. 9.6.
Suppose Xn takes the values ± 1 with probability ! each and show that P[ Sn = 0 i.o.] == 1 (This gives still another proof of the persistence of symmet ric random walk on the line (Example 8.3).) Show more generally that, if the Xn are bounded by M, then P [ I Sn l s M i.o.] == 1.
f
9.7. Weakened versions of (9.36) are quite easy to prove. By a fourth moment argument (see (6.2)), show that P[Sn > n 314 (log n ) < 1 + f )/4 i.o.] == 0. Use (9.29) to give a simple proof that P[ Sn > (3n log n ) 1 12 i.o.] == 0. 9.8. What happens in Theorem 9.6 if fi is replaced by fJ (0 < fJ < a)? 9.9. Show that (9.35) is true if sn is replaced by I sn I or maxk n sk or maxk n I sk I · s
s
CHAPTER
2
Measure
SECI'ION 10. GENERAL MEASURES
Lebesgue measure on the unit interval was central to the ideas in Chapter 1. Lebesgue measure on the entire real line is important in probability as well as in analysis generally, and a uniform treatment of this and other examples requires a notion of measure for which infinite values are possible. The present chapter extends the ideas of Sections 2 and 3 to this more general
case.
aasses
of Sets
The a-field of Borel sets in (0, 1] played an essential role in Chapter 1, and it is necessary to construct the analogous classes for the entire real line and for k-dimensional Euclidean space. Let x = (x 1 , . . . , xk) be the generic point of Euclidean k-space R k . The bounded rectangles Example 10.1.
(10.1 ) play in R k the role intervals (a, b] played in (0, 1]. Let � k be the o-field generated by these rectangles. This is the analogue of the class fJI of Borel sets in (0, 1]; see Example 2.6. The elements of � k are the k-dimen siona/ Borel sets. For k = 1 they are also called the linear Borel sets. Call the rectangle (1 0.1) rational if the a ; and h; are all rational. If G is an open set in R k and y e G, then there is a rational rectangle Av such that Y e A.v c G. But then G = Uy eGAy, and since there are only. countably will
155
156
MEASURE
many rational rectangles, this is a countable union. Thus f11 k contains the open sets. Since a closed set has open complement, f11 k also contains the closed sets. Just as fJI contains all the sets in (0, 1] that actually come up in ordinary analysis and probability theory, !1/k contains all the sets in R k that actually come up. The a-field gJ k is generated by subclasses other than the class of rectangles. If An is the x-set where a; < x; < b; + n - 1 , i = 1, . . . , k, then An is open and (10.1) is nnAn. Thus !1/ k is generated by the open sets. Similarly, it is generated by the closed sets. Now an open set is a countable union of rational rectangles. Therefore, the (countable) class of rational
rectangles generates f11 k.
•
The a-field !A 1 on the line R 1 is by definition generated by the finite intervals. The a-field fJI in (0, 1] is generated by the subintervals of (0, 1 ]. The question naturally arises whether the elements of !!4 are the elements of !1/ 1 that happen to lie inside (0, 1]. The answer is yes, as shown by the following technical but frequently useful result. Theorem 10.1. (i) If !F is
Oo.
Suppose that 0 0 c 0. a a-field in 0, then �0 = [ A
n
0 0:
A e �]
is a a-field in
If .JJI generates the a-field � in 0, then ..ra10 = [ A n 0 0 : A e ..raf] generates the a-field � = [A n 0 0 : A e � ] in 0 0• PROOF. Of course, 0 0 = 0 n 0 0 e �0• If B = A n 0 0 and A e �' then 0 0 - B = (0 - A ) n 0 0 and 0 - A e �- If Bn = An n 0 0 and An e �' then U n Bn = (U n An) n O o and U nAn E �- Thus � is a a-field in O o . Let a0( �0 ) be the a-field ..ra10 generates in 0 0• Since �0 is a a-field in 0 0 , and since .JJ/0 c � ' certainly a0( d0 ) c �0• Now �0 c a0 ( .JJ/0 ) will follow if it is shown that A e � implies A n 0 0 e a0 ( ..ra10 ), or, to put it another way, if it is shown that � is contained in f§ = [ A : A n O o E ao ( ..rafo )]. Since A E d implies that A n Oo E ..rafo c a0 ( .JJ/0 ), it follows that d c �- It is therefore enough to show that f§ is a a-field in D. But if A E f§, then (0 - A) n O o = O o - ( A n O o ) E ao (do) and hence D - A E �- If An E f§ for all n, then (UnAn) n Oo = Un(An n 0 0 ) E a0(�0 ) and hence U nAn E f§. • (ii)
If 0 0 e /F, then �0 = [ A : A c 0 0 , A e � ]. If 0 = R1, 0 0 = (0, 1 ] , and 1 � = fA , and if .JJI is the class of finite intervals, then ..ra10 is the class of subintervals of (0, 1] and fJI = a0( d0) is given by
( 1 0 .2)
ffl = ( A : A
c (O, 1] , A
e !11 1 ] .
SECTION
10.
GENERAL MEASURES
157
A
subset of (0, 1] is thus a Borel set (lies in ffl) if and only if it is a linear Borel set (lies in 91 1 ), and the distinction in terminology can be dropped. Conventions Involving oo
Measures assume values in the set [0, oo] consisting of the ordinary nonnegative reals and the special value oo , and some arithmetic conventions are called for. For x, y e [0, oo ], x < y means that y == oo or else x and y are finite (that is, are ordinary real numbers) and x < y holds in the usual sense. Similarly, x < y means that y == oo and x is finite or else x and y are both finite and x < y holds in the usual sense. For a finite or infinite sequence x, x 1 , x 2 , . in [0, oo ], • •
(10 .3) means that either (i) x == oo and xk == oo for some k, or (ii) x == oo and xk < oo for all k and :E k xk is an ordinary divergent infinite series or (iii) x < oo and xk < oo for all k and (10.3) holds in the usual sense for l:k xk an ordinary finite sum or convergent infinite series. By these conventions and Dirichlet's theorem [A26], the order of summation in (10.3) has no effect on the sum. For an infinite sequence x, x 1 , x 2 , in [0, oo ], • • •
(10.4) means in the first place that xk < xk 1 s x and in the second place that either (i) x < oo and there is convergence in the usual sense, or (ii) xk == oo for some k, or (iii) x == oo and the xk are finite reals converging to infinity in the usual sense. +
Measures
A set function p. on a field .fF in D is a
measure
if it satisfies these
conditions: (i) p.( A ) E [0, oo] for A E ofF; (ii) ,., (0) = 0; (iii) if A I , A 2, . . . is a disjoint sequence of jli:sets and if uk_ l Ak E .?F, then (see (10.3))
The measure p. is finite or infinite as p.(O) < oo or p.(O) = oo ; it is a probability measure if p.(D) = 1, as in Chapter 1. I f D = A 1 u A 2 u · · · for some finite or countable sequence of jli:sets satisfying p. ( A k) < oo, then p. is a-finite. The significance of this concept
158
MEASURE
will be seen later. A finite measure is by definition a-finite; a a-finite measure may be finite or infinite. If SJI is a subclass of !F, p. is a-finite on SJI if D = U k A k for some finite or infinite sequence of �sets satisfying p.( A k ) < oo . It is not required that these sets A k be disjoint. Note that if D is not a finite or countable union of �sets, then no measure can be a-finite on .JJI . If p. is a measure on a a-field � in D, the triple (0, �' p.) is a measure space. (This term is not used if � is merely a field.) It is an infinite, a a-finite, a finite, or a probability measure space according as p. has the corresponding properties. If p.( Ac) = 0 for an jli:set A , then A is a support of p., and p. is concentrated on A. For a finite measure, A is a support if and only if p. ( A ) = p.(D). The pair ( D, � ) itself is a measurable space if � is a a-field in D. To say that p. is a measure on (0, �) indicates clearly both the space and the class of sets involved. As in the case of probability measures, (iii) above is the condition of countable additivity, and it implies finite additivity: If A 1 , . . . , A n are disjoint jli:sets, then
As in the case of probability measures, if this holds for extends inductively to all n.
n = 2,
then it
A measure p. on (D, �) is discrete if there are finitely or countably many points w; in D and masses m ; in [0, oo] such that p. ( A ) = E"'. A m ; for A e �- It is an infinite, a finite, or a probability measure as E ; m ; diverges, or converges, or converges to 1; the last case was treated in Example 2.9. If � contains each singleton { w ; } , then p. is a-finite if and • only if m ; < oo for all i. Example 10. 2
I
e
Let � be the a-field of all subsets of an arbitrary D, and let p. ( A ) be the number of points in A , where p.(A) = oo if A is not finite. This p. is counting measure; it is finite if and only if D is finite and is a-finite if and only if D is countable. Even if � does not contain every • subset of 0, counting measure is well defined on �Example 10.3.
Specifying a measure includes specifying its domain. If p. is a measure on a field � and �0 is a field contained in �, then the restriction p. 0 of p. to .F0 is also a measure. Although often denoted by th e Example 10.4.
SECTION
10.
159
GENERAL MEASURES
same symbol, p.0 is really a different measure from p. unless �0 = �- Its properties may be different: If p. is counting measure on the a-field � of all subsets of a countably infinite 0, p. is a-finite, but its restriction to the • a-field $0 = { 0, 0 } is not a-finite. Certain properties of probability measures carry over immediately to the general case. First, p. is monotone: p.(A) � p.( B) if A c B. This is derived, just as its special case (2.4), from p.(A) + p.(B - A) = p.(B). But it is possible to go on and write p.(B - A) = p.(B) - p.(A) only if p.( B) < oo . If p, ( B) = oo and p.(A) < oo, then p.(B - A) = oo ; but for every a e (0, oo] there are cases where p.(A) = p.(B) = oo and p.( B - A) = a. The inclusion-exclusion formula (2.7) also carries over without change to jli:sets of finite measure:
(10 .5)
P.( lJ Ak ) = E p.( A;) - E p. ( A ; n Aj ) k-1
The proof of
i
��
+
···
finite subadditivity also goes through just as before:
here the A k need not have finite measure. 1beorem 10.2. Let p. be a measure on a field (i) Continuity from below: If A n and A p,( A n ) i p.( A).
�lie in � and A n j A, thent
Continuity from above: If A n and A lie in !F and An � A, and if I'( A 1 ) < oo, then p.( A n) � p ( A). (iii) Countable subadditivity: If A 1 , A 2 , and U k-t A k lie in �, then (ii)
•
•
•
If p. is a-finite on !F, then � cannot contain an uncountable, disjoint collection of sets of positive p.-measure. (iv)
tSee (10.4).
160
MEASURE
The proofs of (i) and (iii) are exactly as for the corresponding parts of Theorem 2.1 . The same is essentially true of (ii): If p.( A1) < oo , subtrac tion is possible and A1 - A n i A1 - A implies that p. ( A 1 ) - p. ( A n ) = p. ( A1 - A n ) i p.( At - A ) = p. ( At) - p. ( A ). There remains (iv). Let [ B6: 8 E 8 ] be a disjoint collection of jli:sets satisfying p. ( B6 ) > 0. Consider an jli:set A for which p.( A ) < oo . If 81, , 8n are distinct indices satisfying p. ( A n B6.) > £ > 0, then n £ < E7_ 1p.( A n B6. ) � p.( A ), and so n < p.( A)j£. Thus the index set [8: p. ( A n B6) > £] is finite, and hence (take the union over positive rational £) [8: p. ( A n B6) > 0] is countable. Since p. is a-finite, 0 = UkAk for some finite or countable sequence of jli:sets Ak satisfying p.(Ak) < oo . But then ek = [8: p.(Ak n B6 ) > 0] is countable for each k. If p.(Ak n B6 ) = 0 for all k, then p. ( B6 ) < • Ekp. ( Ak n B6 ) = 0. Therefore, 8 = U ke k , and so 9 is countable. PROOF.
•
•
•
I
I
Uniqueness
According to Theorem 3.3, probability measures agreeing on a w-system agree on a(fJ'). There is an extension to the general case.
9'
Suppose that p.1 and p.2 are measures on a(fJ'), where 9' is a w-system, and suppose they are ,a-finite on 9'. If p.1 and p. 2 agree on 9', then they agree on a ( 9' ). PROOF. Suppose B E 9' and p.1( B) = p.2( B ) oo , and let !1'8 be the class of sets A in a(fJ') for which p.1( B n A ) = p. 2 ( B n A ). Then !1'8 is a A-system containing 9' and hence (Theorem 3.2) containing a(fJ'). By a-finiteness there exist .9'-sets Bk satisfying 0 = UkBk and p.1( Bk ) = Theorem 10.3.
<:
p.2( Bk ) < oo . By the inclusion-exclusion formula (10 . 5),
P. a
(u
i-1
)
( BJ ' A ) =
E P. a ( B; n A ) -
1 sisn
E
1 s i<j < n
P. a ( B; n Bj n A ) + . . .
for a = 1 , 2 and all n. Since 9' is a w-system containing the B;, it contains the B; n Bi, and so on. Whatever a ( fJ' )-set A may be, the terms on the right above are therefore the same for a = 1 as for a = 2. The left side is then the • same for a = 1 as for a = 2; letting n --+ oo gives p.1( A ) = p. 2( A ) .
Suppose p.1 and p. 2 are finite measures on a(fJ'), where 9' is a w-system and 0 is a finite or countable union of sets in 9'. If p.1 and p. 2 agree on 9', then they agree on a(fJ'). Theorem 10.4.
SECTION
10.
161
GENERAL MEASURES
By hypothesis, 0 = U k Bk for .9'-sets Bk , and of course P. a( Bk) < P.a ( D ) < oo , a = 1, 2. Thus p.1 and p.2 are a-finite on gJ, and Theorem 10.3 • applies. PROOF.
If 9' consists of the empty set alone, then it is a 'IT-system and a(fJ') = {0, 0 }. Any two finite measures agree on 9', but of course they need not agree on a(fJ'). Theorem 10.4 does not apply in this case because 0 is not a countable union of sets in 9'. For the same reason, no measure on a(fJ') is a-finite on 9', and hence Theorem 10.3 does not • apply. Example 10.5.
1
Suppose (0, �) = ( R\ !1/ ) and 9' consists of the half infinite intervals ( oo , x]. By Theorem 10.4, two finite measures on � that agree on 9' also agree on �. The 91'-sets of finite measure required in the • definition of a-finiteness cannot in this example be made disjoint. Example 10.6.
-
If a measure on (0, � ) is a-finite on a subfield �0 of !F, then 0 = U kBk for disjoint �0-sets Bk of finite measure: if they are not • disjoint, replace Bk by Bk n Bf · · · n Bk- t · Example 10. 7.
The proof of Theorem 10.3 simplifies slightly if 0 = U k Bk for disjoint .9'-sets with p.1( Bk) = p.2( Bk) < oo , because additivity itself can be used in place of the inclusion-exclusion formula.
PROBLEMS 10.1. 10.2.
10.3. 10.4.
Show that if conditions (i) and (iii) in the definition of measure hold, and if p.( A ) < oo for some A e !F, then condition (ii) holds. Let ( Qn , � , P.n ) be a measure space, n == 1, 2, . . . , and suppose that the sets Qn are disjoint. Let Q == UnQn , let !F consist of the sets of the form A == UnAn with An � ' and for such an A let p.( A ) == E n P.n ( A n >· Show that ( 0 , !F, p.) is a measure space. Under what conditions is it a-finite? Finite? On the a-field of all subsets of Q == {1, 2, . . . }, put p.( A ) == E k e 2- k if A is finite and p.( A ) == oo otherwise. Is p. finitely additive? Countably additi.ve.? (a) In connection with Theorem 10.2(ii), show that if An � A and p.( A k ) < oo for some k, then I'( An ) � p.( A ). (b) Find an example in which An � A , p.( A n ) oo , and A == 0.
E
A
=
162 10.5.
MEASURE
The natural generalization of (4.10) is ( 10.6)
1'( lim infn A,. )
.s
lim inf n I' ( A,. )
.s
lim sup I'( A,. )
n
.s
(
}
I' lim sup A,. .
n
Show that the left-hand inequality always holds. Show that the right-hand inequality holds if p.(U k nA k ) < oo for some n but can fail otherwise. 3.5 t A measure space ( 0, !F, p.) is complete if A c B, B e !F, and p. ( B ) == 0 together imply that A e !F- the definition is just as in the probability case. Use the ideas of Problem 3.5 to construct a complete measure space ( 0, r, p. + ) such that !F c r and p. and p. + agree on !F. �
10.6.
The condition in Theorem 10.2(iv) essentially characterizes a-finiteness. (a) Suppose that (Q, !F, p.) has no "infinite atoms," in the sense that for every A in !F, if p. ( A ) == oo , then there is in !F a B such that B c A and 0 < p. ( B ) < oo . Show that if !F does not contain an uncountable, disjoint collection of sets of positive measure, then p. is a-finite. (Use Zorn's lemma.) (b) Show by example that this is false without the condition that there are no " infinite atoms." 10.8. Deduce Theorem 3.3 from Theorem 10.3 and from Theorem 10.4. 10.9. Example 10.5 shows that Theorem 10.3 fails without the a-finiteness condi tion. Construct other examples of this kind. 10.10. Suppose that p.1 and p.2 are measures on a a-field !F and are a-finite on a field � generating !F. (a) Show that, if p. 1 ( A ) s p.2 ( A ) holds for all A in 'a , then it holds for all A in !F. (b) Show that this fails if � is only a w-system. 10.7.
SECI'ION 11. OUTER MEASURE Outer Measure
An outer measure is a set function p.* that is defined for all subsets of a space D and has these four properties: (i) p.*( A ) e (0, oo] for every A c 0; (ii) p.*(O) = 0; (iii) p.* is monotone: A c B implies p.*( A ) � p.*( B); ( iv) p.* is countably subadditive: p.*(UnA n) � Enp.*( An)·
SECTION
11.
OUTER MEASURE
163
The set function P* defined by (3.1) is an example, one that generalizes : Example 11.1. Let p be a set function on a class .JJI in 0. Assume that 0 e � and p(O) = 0, and that p(A) e [0, oo] for A e SJI; p and � are otherwise arbitrary. Put (1 1 .1) n
where the infimum extends over all finite and countable coverings of A by �sets A n . If no such covering exists, take p.*(A) = oo in accordance with the convention that the infimum over an empty set is oo. That p.* satisfies (i), (ii), and (iii) is clear. If p.*(A n ) = oo for some n, then obviously p.*( U n A n ) � En p.*(A n ) Otherwise, cover each A n by �sets Bn k satisfying Ek p ( Bn k) < p.*(A n ) + £/2 n ; then p.*( U n A n ) � E n , k P ( Bn k) < • E n p.*( A n ) + £. Thus p.* is an outer measure. ·
Define A to be p.*-measurab/e if p.* ( A n E ) + p.* ( A c n E ) = p.* ( E )
(1 1 .2)
for every E. This is the general version of the definition (3.4) used in Section 3. It is by subadditivity equivalent to (1 1 .3 )
Denote by .l(p.* ) the class of p.*-measurable sets. The extension property for probability measures in Theorem 3.1 was proved by means of a sequence of four lemmas. These lemmas carry over directly to the case of the general outer measure: if P* is replaced by p.* and .,1( by .l(p.*) at each occurrence, the proofs hold word for word, symbol for symbol. In particular, an examination of the arguments shows that oo as a possible value for p.* does not necessitate any changes. The fourth lemma in Section 3 becomes this: If p.* is an outer measure, then .l(p.*) is a a-field, and p.* restricted to .l(p.* ) is a measure.
1beorem
1 1.1.
This will be used to prove an extension theorem, but it has other applications as well; see Sections 19 and 29.
164
MEASURE
Extension Theorem 1 1.2.
a-field.
A
measure on a field has an extension to the generated
If the original measure on the field is a-finite, then it follows by Theorem 10.3 that the extension is unique. Theorem 11.2 can be deduced from Theorem 11.1 by the arguments used in the proof of Theorem 3.1. t It is unnecessary to retrace the steps, however, because the ideas will appear in stronger form in the proof of the next result, which generalizes Theorem 11.2. Define a class .JJI of subsets of D to be a semiring if (i) 0 E .JJI ; (ii) A , B e .JJI implies A n B e .JJI ; (iii) if A , B e .JJI and A c B, then there exist disjoint Jet-sets C1 , . . . , Cn such that B - A = Ui:_ 1 Ck . The class of finite intervals in 0 = R 1 and the class of subintervals of 0 = (0, 1] are the simplest examples of semirings. Note that a semiring need not contain 0. If A and B lie in a semiring .JJI , then since A n B e .JJI by (ii), A n B e = A - A n B has by (iii) the form U i: _ 1Ck for disjoint Jet-sets Ck . If A , B1, , B1+ 1 are in .JJI and A n Bf n · · · n Bt has this form, then A n Bf n · · · n Bt+ 1 = U i: _ 1 (Ck n Bt+ 1 ), and each Ck n Bt+ 1 is a finite disjoint union of Jet-sets. It follows by induction that for any sets A , B1, , B1 in .JJI , there are in .JJI disjoint sets Ck such that A n Bf • n Bt = u ;: _ lck . n
Example
11. 2.
•
•
·
·
•
•
•
•
·
Theorem 11.3. Suppose that p. is a set function on a semiring .JJI . Suppose that p. has values in [0, oo ], that p. (0) = 0, and that p. is finitely additive and countably subadditive. Then p. extends to a measure on a( .JJI ). *
This contains Theorem 11.2, because the conditions are all satisfied if .JJI is a field and p. is a measure on it. If 0 = U k A k for a sequence of d-sets satisfying p. ( A k ) < oo , then it follows by Theorem 10.3 that the extension is . uruque.
t See also Problem 11.1.
*And so p. must have been countably additive on � in the first place; see Problem 1 1 .4.
SECTION
11.
165
OUTER MEASURE
If A , B, and the Ck are related as in condition (iii) in the definition of semiring, then by finite additivity p.( B) = p. ( A ) + I:;: _ 1p.( Ck ) > p. ( A ). Thus p. is monotone. Define an outer measure p.* by (11.1) for p = p.: PROOF.
(11 .4)
n
the infimum extending over coverings of A by �sets. The first step is to show that d e J/(p.* ). Suppose that A E .JJ/. If p.*( E ) = oo, then (11.3) holds trivially. If p.*( E ) < oo, for given £ choose Jet-sets A n such that E e UnAn and Enp.(An) < p.*( E ) + £ . Since .JJI is a semiring, Bn = A n An lies in .JJI and Ac n A n = A n - Bn has the form U k'.:. 1Cn k for disjoint �sets Cnk· Note that A n = Bn u U k'.:. 1Cn k ' where the union is disjoint, and that A n E e UnBn and Ac n E e U n U k'.:. 1Cnk· By the definition of p. * and the assumed finite additivity of p., p.* ( A n
E ) + p.* ( A (' n E ) � L p. ( Bn ) + L n
n
mn
L p. (Cn k ) k == l
n Since £ is arbitrary, (11.3) follows. Thus d e J/(p.* ). The next step is to show that p.* and p. agree on .JJI . If A e UnAn for Jef-sets A and A n, then by the assumed countable subadditivity of p. and the monotonicity established above, p. ( A ) < Enp. ( A n An) � Enp.(An)· There fore, A E .JJI implies that p. ( A ) < p.*( A ) and hence, since the reverse inequality is an immediate consequence of (11 .4), p. ( A ) = p.*( A). Thus p.* agrees with p. on .JJI . Since .JJI e Jl ( p. * ) and .I ( p. *) is a a-field (Theorem 11.1 ), a ( .JJI ) e J/( p.* ). Since p.* is countably additive on .l( p.*) (Theorem 11.1 again), p.* • restricted to a ( .JJI ) is an extension of p. on d, as required. For .JJI take the semiring of subintervals of 0 = (0, 1] together with the empty set. For p. take length A: A ( a, b) = b - a . The finite additivity of A follows by Theorem 2.2 and the countable subadditiv ity by Lemma 2 to that theorem. t By Theorem 11 .3, A extends to a measure • on the class o ( .JJ/ ) = fJI of Borel sets in (0, 1 ] . Example 11.3.
t On a field, countable additivity implies countable subadditivity, but � is merely a semiring. Hence the necessity of this observation; but see Problem 1 1 .4.
166
MEASURE
This gives a second construction of Lebesgue measure in the unit interval. In the first construction A was extended first from the class of intervals to the field 910 of finite disjoint unions of intervals (see (2.1 1)) and then by Theorem 1 1 .2 (in its special form Theorem 3.1) from !110 to !11 a(aJ0). Using Theorem 11.3 instead of Theorem 11.2 effects a slight economy, since the extension then goes from JJI directly to !11 without the intermediate stop at 910, and the arguments involving (2.12) and (2.13) become unnecessary. ==
In Theorem 11.3 take for JJI the serrunng of finite intervals on the real line R t , and consider A t (a, b) = b - a. The arguments for Theorem 2.2 and its two lemmas in no way require that the (finite) intervals in question be contained in (0, 1 ], and so A t is finitely additive and countably subadditive on this class d. Hence A t extends to the a-field gJ t of linear Borel sets, which is by definition generated by JJI. This defines • Lebesgue measure A t over the whole real line. Example 11.4.
A subset of (0, 1] lies in !11 if and only if it lies in gJ t (see (10.2)). Now A t (A) = A( A ) for subintervals A of (0, 1 ], and it follows by uniqueness (Theorem 3.3) that A t (A) = A ( A ) for all A in !11 . Thus there is no inconsistency in dropping A t and using A to denote Lebesgue measure on 91 t as well as on !JI. k The class of bounded rectangles in R is a semiring, a fact needed in the next section. Suppose A = [x: X; e I;, i � k] and B = [x: X; e .f;, i � k] are nonempty rectangles, the I; and 1; being finite intervals. If A c B, then I; c 1;, so that 1; - I; is a disjoint union I;' u /;'' k of intervals (possibly empty). Consider the 3 disjoint rectangles [x: X; e lf;, i < k], where for each i, lf; is I; or /;' or /;''. One of these rectangles is A itself, and B - A is the union of the others. The rectangles thus form a semmng. • Exllllple l 11.5.
•
•
An Approximation 1beorem
If JJ1 is a semiring, then by Theorem 10.3 a measure on a( JJ/) is determined by its values on JJI if it is a-finite there. The following theorem shows more explicitly how the measure of a a( JJ/)-set can be approximated by the measures of �sets.
SECTION
11.
OUTER MEASURE
167
Suppose that d is a semiring, p. is a measure on !F = o( .SJI ), and p. is a-finite on d. (i) If B E .fF and £ > 0, there exists a finite or infinite disjoint sequence of JJI.sets such that B c U k A k and p. ((U k A k ) - B) < £. A1, A 2, (ii) If B e .fF and £ > 0, and if p. ( B) < oo , then there exists a finite disjoint sequence A 1 , , A n of JJI.sets such that p. ( BA (Ui:_ 1 A k )) < £. Theorem 1 1.4.
•
•
•
•
•
•
Return to the proof of Theorem 1 1 .3. If p.* is the outer measure defined by (1 1 .4), then .?Fe .l(p.* ) and p.* agrees with p. on d, as was shown. Since p.* restricted to .fF is a measure, it follows by Theorem 10.3 that p.* agrees with p. on !F as well. Suppose now that B lies in !F and p.(B) = p.*( B) < oo . There exist �sets A k such that B c U k A k and p.(U k A k ) � Ek p. ( A k ) < p. ( B) + £; but then p ((U kA k ) - B) < £. To make the sequence { A k } disjoint, replace A k nAk_ 1; each of these sets is (Example 11 .2) a finite by A k n AI n disjoint union oT sets in d. Next suppose that B lies in .fF and p.(B) = p.*( B) = oo . By a-finiteness there exist JJI.sets Cm such that D = U mCm and p.(Cm ) < oo . By what has just been shown, there exist JJI.sets A m k such that B n Cm c U k A m k and m p. ((U k A m k ) - ( B n Cm )) < £/2 . The sets A m k taken all together provide a sequence A 1 , A 2 , of JJI.sets satisfying B c U k A k and p.((U k A k ) - B) < £. As before, the A k can be made disjoint. To prove part (ii), consider the A k of part (i). If B has finite measure, so has A = U k A k , and hence by continuity from above (Theorem 10.2(ii)), p. ( A - U k � nA k ) < £ for some n. But then p.( B A (Uk_ 1 A k )) < 2£. • PROOF.
·
·
·
•
•
•
If, for example, B is a linear Borel set of finite Lebesgue measure, then A ( BLl. (U Z _ 1 A k )) < £ for some disjoint collection of finite intervals
A t , . . . ' A n.
If p. is a finite measure on a a-field !F generated by a field !F0, then for each �set A and each positive £ there is an !F0-set B such that
Corollary.
p.(ALl.B) < £.
This is of course an immediate consequence of part (ii) of the theorem, but there is a simple direct argument. Let fl be the class of jli:sets with the required property. Since A c Ll.B c = ALl.B, f§ is closed under com plementation. If A = U n A n , where A n e f§, given £ choose n 0 so that p. ( A - U n � noAn ) < £, and then choose !F0-sets Bn , n � n 0, so that p ( A n Ll.Bn ) < £ jn 0 • Since (U n � noA n )A(U n � noBn ) C Un � n0( A n A Bn ), the .fF0set B = U n � noBn satisfies p.(AAB) < 2£. Of course .?F0 c f§; since f§ is a • a- field, .fF c l§, as required. PROOF.
168
MEASURE
Caratheodory's Condition*
Suppose that p.* is an outer measure on the class of all subsets of R k . It satisfies Caratheodory 's condition if (1 1 . 5 )
p.*( A
u
B ) = p.*( A ) + p.*( B )
whenever A and B are at positive distance in the sense that dist( A, B) = inf( l x - Y l : x e A, y e B] is positive. Theorem 11.5.
Ifk p.* is an outer measure on R k satisfying Caratheodory 's
condition, then !Ji
c
.l(p.*).
Since .l(p.* ) is a a-field (Theorem 11.1) and the closed sets generate !Jt k (Example 10.1), it is enough to show that each closed set is p.*-measureable. The problem is thus to prove (11.3) for A closed and E arbitrary. Let B = A n E and C = A c n E. Let Cn = [x e C: dist(x, A ) � n - 1 ], so that Cn t C and dist(Cn , A) � n - 1 . By Caratheodory' s condition, p.*( E) = p.*( B U C ) � p.*( B U Cn ) = p.*( B) + p.*( Cn ). Hence it suffices to prove that p.*( Cn ) .-. p.*( C), or, since Cn t C, that p.*( C) � lim n p.*( Cn ). Let Dn = cn + 1 - en ; if X E Dn + 1 and l x - Yl < n - 1 ( n + 1) - 1 , then dist( y , A ) s I Y - x l + dist(x, A) < n - 1 , so that y � en . Thus l x - Y l > n - 1 ( n + 1) - 1 if X E Dn + 1 and y E en , and so Dn + 1 and en are at positive distance if they are nonempty. By the Caratheodory condition extended inductively, PROOF.
(11 .6) and ( 1 1 .7) By subadditivity,
( 1 1 .8)
00
00
p * ( C ) � p.*( C2 n ) + En p.* ( D2k ) + E p.*( D2k - l ) . k
k-n+ l
If Ep.*( D2 k ) or Ep.*( D2 k _ 1 ) diverges, p.*(C) < limnp.*(Cn ) follows from • (11 .6) or (11.7); otherwise, it follows from (11.8). * This topic may be omitted; it is needed only for the construction of Hausdorff measure in
Section 19.
SECTION
11.
OUTER MEASURE
169
For A c R 1 define A* (A) = inf E n ( bn - a n ), where the infimum extends over countable coverings of A by intervals ( a n , bn1 · Then A* is an outer measure (Example 11 .1). Suppose that A and B are at positive distance and consider any covering of A U B by intervals (a n , bn ] ; (11 .5) will follow if E( bn - a n ) � A*( A) + A*( B ) necessarily holds. The sum here is unchanged if (a n , bn ] is replaced by a disjoint union of subintervals each of length less than dist( A, B). But if bn - a n < dist( A , B), then ( a n , bn ] cannot meet both A and B. Let M and N be the sets of n for which ( a n , bn ] meets A and B, respectively. Then M and N are disjoint, A c U n e M ( a n , bn ], and B c Un e N (a n , bn ] ; hence E( bn - a n ) � En e M (bn - a n ) + E n E N ( bn - a n ) � A*( A ) + A*( B). Thus A* satisfies Caratheodory's condition. By Theorem 11.5, 91 1 c Jf(A*), and so A* restricted to 91 1 is a measure by Theorem 1 1 . 1 . By Lemma 2 in Section 2 (p. 24), A*( a, b) = b - a. Thus A* restricted to 91 1 is • A , which gives still another construction of Lebesgue measure. Example 11.6.
PROBLEMS 11.1.
11 .2.
11 .3.
The proof of Theorem 3.1 obviously applies if the probability measure is replaced by a finite measure, since this is only a matter of rescaling. Take as a starting point then the fact that a finite measure on a field extends uniquely to the generated a-field. By the following steps prove Theorem 11.2-that is, remove the assumption of finiteness. (a) Let I' be a measure (not necessarily even a-finite) on a field 'a and let !F == a ( 'a). If A e 'a and I'( A) < oo , restrict I' to a finite measure I'A on the field 'a( A ) == [ B: B c A , B e �] in A , and extend I'A to a finite measure (still denoted I'A ) on the a-field !F ( A ) == [ B: B c A , B e !F] generated in A by �(A ). (b) Suppose that E e !F. If there exist disjoint 'a-sets An such that I'( A n ) < oo and E c Un An , put I'( E) == 'En i'A ,.( E n An ) and prove consistency; otherwise, put I'( E) == oo . (c) Show that I' is a measure on !F and agrees with the original �£ on �(a) If I'* is an outer measure and v*( E) = I'*( E n A ), then v* is an outer measure. (b) If 11: are outer measures and I'*( E) = Ln i':( E) or I'*( E) = sup p.:( E) then I'* is an outer measure. 2.5 10.10 t Suppose .JJI is a semiring that contains 0. (a) Show that the field generated by .JJI consists of the finite disjoint unions of elements of .JJI. (b) Suppose that measures 1' 1 and p. 2 on a( .JJI ) are a-finite on .JJ1 and that 1' 1 ( A ) < 1' 2 ( A ) for all A in .JJI. Show that 1'1 ( A ) < 1' 2 ( A ) for all A in a( .JJI ) .
170 11.4.
MEASURE
Suppose that .JJI is a semiring and I' is a finitely additive set function on .JJI having values in [0, oo] and satisfying 1'(0) 0. (a) Suppose that A , A1 , , Am are �sets. Show that Ek'- 11'(Ak ) S I'( A ) if Uk'- 1 A k c A and the Ak are disjoint. Show that I'( A ) s Ek'- 1 1'(A k ) if A c Uk'- 1 A k (here the Ak need not be disjoint). (b) Show that I' is countably additive on .JJI if and only if it is countably subadditive. 11.5. Measure theory is sometimes based on the notion of a ring. A class .JJI is a ring if it contains the empty set and is closed under the formation of differences and finite unions; it is a a-ring if it is also closed under the formation of countable unions. (a) Show that a ring ( a-ring) containing Q is a field ( a-field). Find a ring that is not a field, a a-ring that is not a a-field, a a-ring that is not a field. (b) Show that a ring is closed under the formation of symmetric dif ferences and finite intersections. Show that a a-ring is closed under the formation of countable intersections. (c) Show that the field f( .JJI ) generated (see Problem 2.5) by a ring .JJI consists of the sets A for which either A e .JJI or A c e .JJI. Show that /( .JJI ) a( .JJI ) if .JJI is a a-ring. (d) Show that a (countably additive) measure on a ring .JJI extends to /( .JJI ) . Of course this follows by Theorem 11.3, but prove it directly. (e) Let .JJI be a ring and let .JJI ' be the a-ring generated (define) by .JJI. Show that a measure on .JJI extends to a measure on .JJI' and that the extension is unique if the original measure is a-finite on .JJI . Hint: To prove uniqueness, first show by adapting the proof of Theorem 3.4 that a monotone class containing .JJI must contain .JJI ' . 11.6. Call .JJI a strong semiring if it contains the empty set and is closed under the formation of finite intersections and if A , B e .JJI and A c B imply that there exist �sets A 0 , , A n such that A A0 c A1 c · · c A n B and A k - A k - l e .JJI, k 1, . . . , n . k (a) Show that the class of bounded rectangles in R is a strong semiring. (b) Show that a semiring need not be a strong semiring. (c) Call I' on .JJI pairwise additive if A , B, A u B e .JJI and A n B 0 imply that I'( A u B) I'( A ) + I'( B). It can be shown that, if I' is pairwise additive on a strong semiring, then it is finitely additive there. Show that this is false if .JJI is only a semiring. 11.7. Suppose that I' is a finite measure on the a-field !F generated by the field -Fa · Suppose that A e !F and that A1 , . . . , A n is a decomposition of A into Jli:sets. Show that there exist an �-set B and a decomposition B1 , , Bn of B into �-sets such that I' ( A k � Bk ) < £, k 1, . . . , n . Show that B may be taken as A if A e -Fo11.8. Suppose that I' is a-finite on a field -Fa generating the a-field !F. Show that for A !F there exist -Fa-sets Cn ; and Dn i such that, if c U n ni cn l and D n n U; Dn ; ' then C c A c D and I' ( D - C ) 0. 11.9. Show that a-finiteness is essential in Theorem 11.4. Show that part (ii) can fail if I'( B) oo. 11.10. Here is another approach to Theorem 11.4(i). Suppose for simplicity that p. is a finite measure on the a-field !F generated by a field �- Let t§ be the • • •
==
==
==
• • •
·
==
==
==
==
.
==
==
E
==
==
==
.
•
11. 11.
12.
171 class of Jli:sets G such that for each £ there exist 'a-sets A k and Bk satisfying nk Ak c G c Uk Bk and I' ((Uk Bk ) - (nk A k )) < £. Show that f§ is a a-field containing 'a. This and Problems 11.12, 16.25, 17.16, 17.17, and 17.18 lead to proofs of the Daniell-Slone and Riesz representation theorems. Let A be a real linear functional on a vector lattice !R of (finite) real functions on a space n. This means that if 1 and g lie in !R, then so do I v g and f A g (with values max{ f(w), g( w)} and min{ f(w), g( w)}), as well as a.f + flg, and A ( a.f + flg ) == a.A ( I) + /lA ( g). Assume further of 9 that f e !R implies f A 1 e !R (where 1 denotes the function identically equal to 1). Assume further of A that it is positive in the sense that I � 0 (pointwise) implies A (/) � 0 and continuous from above at 0 in the sense that In � 0 (pointwise) implies A (/n ) -+ 0. (a) If f s g (I, g e !R), define in Q X R1 an "interval" SECTION
MEASURES IN EUCLIDEAN SPACE
( I, g) == [ ( w , t ) : I( w ) < t s g ( w ) ] .
{ 11 .9)
Show that these sets form a semiring Jaf0• (b) Define a set function �'o on Jaf0 by { 11 .10) 11.12.
�'o ( /, g]
== A ( g - /) .
Show that �'o is finitely additive and countably subadditive on Jaf0• f (a) Assume f e !R and let In == ( n ( f - f A 1)) A 1. Show that /( w) � 1 implies In ( w) == 0 for all n and f( w) > 1 implies In ( w) == 1 for all sufficiently large n. Conclude that for x > 0, ( 0, xfn ) f [ w : /( w ) > 1 ] X (O, x] .
( 11 .11 )
(b) Let !F be the smallest a-field with respect to which every f in !R is measurable: !F== a[r 1 H: f e !R, H e 911 ]. Let 'a be the class of A in !F for which A X (0, 1] e a ( Jaf0 ) . Show that 'a is a semiring and that !F== a('a). (c) Let be' the extension of �'o ( see (11.10)) to a(Jaf0) and for A e 'a define l'o ( A ) == P ( A X (0, 1]). Show that l'o is finitely additive and count ably subadditive on the semiring �.,
SECI10N 12. MEASURES IN EUCLIDEAN SPACE Lebesgue Measure
Example 11.4 Lebesgue measure A was constructed on the class fll 1 of linear Borel sets. By Theorem 10.3, A is the only measure on fll 1 satisfying A( a , b] = b - a for all intervals. There is in k-space an analogous k-dimen sional Lebesgue measure A k on the class !Ji k of k-dimensional Borel sets (Example 10.1). It is specified by the requirement that bounded rectangles In
172
MEASURE
have measure (I2.I)
Ak (x:
a ; < X ; < bi ' i =
I, . . . , k]
=
k n ( b;
i=l
- a, ) .
This is ordinary volume-that is, length ( k = I), area ( k = 2), volume (k = 3), or hypervolume ( k > 4). Since an intersection of rectangles is again a rectangle, the uniqueness theorem shows that (I2.I) completely determines A k . That there does exist such a measure on gJ k can be proved in several ways. One is to use the ideas involved in the case k = I. A second construction is given in Theorem I2.5. A third, independent, construction uses the general theory of product measures; this is carried out in Section IS. t For the moment, assume the existence on fit k of a measure A k satisfying (I2.I). Of course, A k is a-finite. A basic property of A k is translation invariance. *
If A E (A k , then A + x = [a + x: a E A] E fit k and A k ( A ) = A k ( A + x ) for all x. Theorem 12.1.
If l§ is the class of A such that A + x is in fit k for all x, then f§ is a a-field containing the bounded rectangles, and so f§ ::> fit k . Thus A + x E !Ji k for A E !Ji k . For fixed x define a measure p. on gJ k by p.(A) = A k ( A + x). Then p. and A k agree on the w-system of bounded rectangles and so agree for all • Borel seb. PROOF.
If A is a (k - I)-dimensional subspace and x lies outside A, the hyperplanes A + tx for real t are disjoint and by Theorem I2.I, all have the same measure. Since only countably many disjoint sets can have positive measure (Theorem I0.2 (iv)), the measure common to the A + tx must be 0. Every ( k - I)-dimensional hyperplane has k-dimensional Lebesgue mea sure 0. The Lebesgue measure of a rectangle is its ordinary vowme. The follow ing theorem makes it possible to calculate the measures of simple figures.
If T:k R k --+ R k is linear and nonsingular, then A E fit k implies that TA E !A and Theorem 12.2.
(I2 .2)
t See also Problems 12. 1 5, 1 7.18, and 20.4.
*An analogous fact was used in the construction of a nonmeasurable set at the end of Section 3.
SECTION
12.
173
MEASURES IN EUCLIDEAN SPACE
As a parallelepiped is the image of a rectangle under a linear transforma tion, (1 2.2) can be used to compute its volume. If T is a rotation or a reflection-an orthogonal or a unitary transformation-del T = + 1, and so A k ( TA) = A k( A). Hence every rigid transformation (a rotation followed by a translation) preserves Lebesgue measure. Since TU n A n U n TAn and TAc ( TA)(' be cause of the assumed nonsingularity of T, the class f§ = [ A : TA E !1/ k ] is a a-field. Since TA is open for open A, again by the assumed nonsingularity of T, f§ contains all the open sets and hence (Example 10.1) all the Borel sets. Therefore, A E !1/ k implies that TA E !1/ k . For A E gJ k , set p. 1 (A) = A k ( TA) and p. 2 ( A ) = l det Tl · Ak(A). Then p. 1 and p. 2 are measures, and by Theorem 10.3 they will agree on !1/ k (which is the assertion (12.2)) if they agree on the w-system consisting of the rectan gles [ x: a; < X; < h;, i = 1, . . . , k] for which the a; and the h; are all rational (Example 10.1). It suffices therefore to prove (12.2) for rectangles with sides of rational length. Since such a rectangle is a finite disjoint union of cubes and A k is translation-invariant, it is enough to check (12.2) for cubes =
PROOF OF THE THEOREM.
A = [x: 0
( 12.3)
< X; <
=
c, i = 1 , . . . , k ]
that have their lower comer at the origin. Now the general T can by elementary row and column operationst be represented as a product of linear transformations of these three special forms: (1 °) T( x 1 , , x k ) = (x,.1 , . . . , x,. k ), where w is a permutation of the set {1, 2, . . . ' k } ; (2°) T( x 1 , , x k ) = (ax 1 , x 2 , , x k ); (3°) T( x 1 , , x k ) = (x 1 + x 2 , x 2 , , x k ). •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Becau se of the rule for multiplying determinants, it suffices to check (12.2) for T of these three forms. And, as observed, for each such T it suffices to consider cubes (12.3). (1 °) Such a T is a permutation matrix, and so det T = ± 1. Since (12.3) is invariant under T, (12.2) is in this case obvious. (2°) Here det T = a, and TA = [x: x 1 E H, 0 < X; < c, i = 2, . . . , k ], where H = (O, ac] if a > 0, H = {0} if a = 0, and H = [ac, O) if a < 0. In each case, A k ( TA ) = I a I · c k = I a I · A k ( A ). t BIRKHOFF & MAC LANE, Section 8.9.
174
MEASURE
(3°) Here det T = 1. Let R k if k < 3, and define
B = [x: 0 < X; < c, i = 3, . . . , k], where B =
B 1 = ( x: 0 < x 1 s x 2 < c ] n B, B2 = ( x: 0 < x 2 < x 1 s c ] n B, B3 = [ x: c < x 1 s c + x 2 , 0 < x 2 s c ] n B. Then A = B1 U B2 , TA = B2 U B3 , and B 1 + (c, O, . . . , O) = B3. Since A k ( B 1 ) = A k ( B3) by translation invariance, (12.2) follows by additivity. •
If T is singular, then det T == 0 and TA lies in a ( k - 1 )-dimensional subspace. Since such a subspace has measure 0, (12.2) holds if A and TA lie in gt k . The surprising thing is that A e gt k need not imply that TA e gtk if T is singular. For even a very simple transformation such as the projection T( x1 , x 2 ) ( x 1 , 0) in the plane, there exist Borel sets A for which TA is not a Borel set.t But if A is a rectangle, TA is a Borel set by the argument involving (1 °), (2°), and (3°)-note that a can be 0 in (2°).
==
Regularity
Important among measures on gp k are those assigning finite measure to bounded sets. They share with A k the property of regularity:
Suppose that p. is a measure on gJ k such that p. ( A ) < oo if A is bounded. (i) For A e gp k and £ > 0, there exist a closed C and an open G such that C c A c G and p. (G - C) < £. (ii) If p. ( A ) < oo, then p.(A) = supp.(K ), the supremum extending over the compact subsets K of A . Theorem 12.3.
The second part of the theorem follows from the first: p. ( A ) < oo implies that p. ( A - A0) < £ for a bounded subset A0 of A, and it then PllOOF .
tSee HAUSDORFF, p. 241 .
SECTION
12.
MEASURES IN EUCLIDEAN SPACE
175
follows from the first part that p.(A0 - K) < £ for a closed and hence compact subset K of A0• To prove (i) consider first a bounded rectangle A = [x: a; < X; � h;, 1 i < k]. The set Gn = [x: a; < X; < h; + n - , i < k] is open and Gn � A . Since p.(G1) is finite by hypothesis, it follows by continuity from above that p, ( Gn - A ) < £ for large n . A bounded rectangle can therefore be approxi mated from the outside by open sets. The rectangles form a semiring (Example 11 .5). For an arbitrary set A in gt k , by Theorem 11 .4(i) there exist bounded rectangles A k such that A c U k A k and p.((U k A k ) - A) < £. Choose open sets G k such that A k c G k and p.(G k - A k ) < £/2 k . Then G = U kG k is open and p.(G - A) < 2£. Thus the general k-dimensional Borel set can be approximated from the outside by open sets. To approximate from the inside by closed sets, pass to • complements. Specifying Measures on the Line
There are on the line many measures other than A which are important for probability theory. There is a useful way to describe the collection of all measures on 91 1 that assign finite measure to each bounded set. If p. is such a measure, define a real function F by (12.4 )
F( x )
=
{
p.(O, x] - p.( x, o ]
if X > 0, if X S 0.
It is because p.(A) < oo for bounded A that F is a finite function. Clearly, F is nondecreasing. Suppose that x n � x. If x > 0, apply part (ii) of Theorem 10.2, and if x < 0, apply part (i); in either case, F( x n ) � F(x) follows. Thus F is continuous from the right. Finally, (12 .5)
p.( a, b]
=
F( b) - F( a )
for every bounded interval (a, b ]. If p. is Lebesgue measure, then (12.4) gives F( x ) = x. The finite intervals form a w-system generating !1/ 1 , and therefore by Theorem 10.3 the function F completely determines p. through the relation (12.5). But (12.5) and p. do not determine F: if F( x) satisfies (12.5), then so does F(x ) + c. On the other hand, for a given p., (12.5) certainly determines F to within such an additive constant. For finite p., it is customary to standardize F by defining it not by (12.4) but by ( 12 .6)
F( x )
=
p.( - oo , x] ;
176
MEASURE
then lim x _. _ 00 F(x) = 0 and lim x .... 00 F(x) = p. ( R 1 ). If p. is a probability measure, F is called a distribution function (the adjective cumulative is sometimes added). Measures p. are often specified by means of the function F. The following theorem ensures that to each F there does exist a p..
IfF is a nondecreasing, right-continuous real function on the line, there exists on 91 1 a unique measure p. satisfying (12.5) for all a and b. Theorem 12.4.
As noted above, uniqueness is a simple consequence of Theorem 10.3. The proof of existence is almost the same as the construction of Lebesgue measure, the case F(x) = x. This proof is not carried through at this point, because it is contained in a parallel, more general construction for k-dimen sional space in the next theorem. For a very simple argument establishing Theorem 12.4, see the second proof of Theorem 14.1. Specifying Measures in R"
The a-field !Jt k of k-dimensional Borel sets is generated by the class of bounded rectangles =
[x:
a ; < x; s b; , i 1 , . . . , k ] =
(12.7)
A
(Example 10.1). If
I; (a ; , b; ], =
A has the form of a cartesian product
(12 .8) Consider the sets of the special form (12.9 )
sx
=
[y:
Y; s
X; '
i 1' . . . ' k]; =
Sx consists of the points " southwest" of x ( x 1 , , x k ) ; in the case k 1 it is the half-infinite interval ( oo, x]. Now Sx is closed, and (12.7) has the =
•
•
=
•
-
form
Therefore, the class of sets (12.9) generates gt k_ This class is a w-system. The objective is to find a version of Theorem 12.4 for k-space. This will in particular give k-dimensional Lebesgue measure. The first problem is to find the analogue of (12.5). A bounded rectangle (12.7) has 2 k vertices-the points x = (x 1 , , x k ) for which each X; is either a ; or b; . Let sgn A x, the signum of the vertex, be •
•
•
SECTION
12.
177
MEASURES IN EUCLI DEAN SPACE
1 or - 1, according as the number of i k(1 < i < k ) satisfying X; = a ; is even or odd. For a real function F on R , the difference of F around the vertices of A is &AF = E sgn x F( x) , the sum extending over the 2 k vertices x of A . In the case k = 1, A = ( a , b ] and &AF = F( b ) - F( a) . In the case k = 2, &AF = F( b 1 , b2 ) - F( b 1 , a 2 ) - F(a 1 , b2 ) + F( a 1 , a 2 ) . Since the k-dimensional analogue of (12.4) is complicated, suppose at first that p. is a finite measure on !1/ k and consider instead the analogue of (1 2.6), namely +
A
F( X ) = ,., [ y : Y; < Xi ' i = 1 ' . . . ' k ] .
(1 2.11) Suppose that Then
·
Sx is defined by (12.9) and A is a bounded rectangle (12.7).
(12.12) To see this, apply to the union on the right in (12.10) the inclusion-exclusion formula (10.5). The k sets in the union give 2 k - 1 intersections, and these are the sets Sx for x ranging over the vertices of A other than ( b 1 , , b k ). Taking into account the signs in (10.5) leads to (12.12). •
+
-
+
-
+
+
-
+
+
-
-
+
-
+
-
+
•
•
+
+
Suppose x < n > � x in the sense that xf n > � X; as n --+ oo for each i = 1, . . . , k. Then Sx < " ) � Sx , and hence F(x < n >) --+ F(x ) by Theorem 10.2(ii). In this sense, F is continuous from above.
Suppose that the real function F on R k is continuous from above and satisfies &AF > 0 for bounded rectangles A . Then there exists a unique measure p. on !1/ k satisfying (12.12) for bounded rectangles A .
Theorem 12.5.
The empty set can be taken as a bounded rectangle (12.7) for which a 1 = h; for some i, and for such a set A, &A F = 0. Thus (12.12) defines a finite-valued set function p. on the class of bounded rectangles. The point of the theorem is that p. extends uniquely to a measure on !1/ k . The uniqueness is an immediate consequence of Theorem 10.3, since the bounded rectangles form a w-system generating !1/ k .
178
MEASURE
If F is bounded, then p. will be a finite measure. But the theorem does not require that F be bounded. The most important unbounded F is F(x) = x 1 • • • x k . Here &AF = (b 1 - a 1 ) • • • (b k - a k ) for A given by (12.7). This is the ordinary volume of A as specified by (12.1). The corresponding measure extended to !Ji k is k-dimensional Lebesgue measure as described at the beginning of this section. 12.5. As already observed, the uniqueness of the extension is easy to prove. To prove its existence it will first be shown that p. as defined by (12.12) is finitely additive on the class of bounded rectangles. Suppose that each side I; = ( a ; , b; ) of a bounded rectangle (12.7) is partitioned into n ; subintervals JiJ == ( t;. 1 _ 1 , t! � ], j == 1, . . . , n ; , where a ; == t; o < til < · · · < t; n ; = b;. The n 1 n 2 · · · n k rectanpes PROOF OF THEOREM
{ 12.13) then partition A . Call such a partition regular. It will first be shown that regular partitions: ( 12.14)
,.,. ( A )
== E ,.,. ( �1 ... it
ik
• . •
p.
adds for
j )• lc
The right side of (12.14) is E8Ex sgn8x · F(x), where the outer sum extends over the rectangles B of the form (12.13) and the inner sum extends over the vertices x of B. Now ( 12.15 )
E E sgn8x · F( x ) B
X
== L F( x ) L sgn8x , X
B
where on the right the outer sum extends over each x that is a vertex of one or more of the B 's, and for fixed x the inner sum extends over the B 's of which it is a vertex. Suppose that x is a vertex of one or more of the B 's but is not a vertex of A . Then there must be an i such that X; is neither a; nor b;. There may be several such i, but fix on one of them and suppose for notational convenience that it is i == 1. Then x1 == t1 1 with 0 < j < n 1 • The rectangles (12.13) of which x is a vertex therefore come in pairs B' == � 2 · · ·) and B " == B 1 • 1 2 ilc ' and sgn 8 x = - sgn 8 ,x. Thus the inner sum on t'&e n&bt in (12.15) is 6 if x is not a vertex of A . On the other hand, if x is a vertex of A as well as of at least one B, then for each i either X; == a ; == t; o or X; == b; = t; n . · In this case x is a vertex of only one B of the form (12.13)- the one for which j; == 1 or j; == n ; , according as X; = a; or X; == b, -and sgn8x == sgn A x. Thus the right side of (12.15) reduces to ll A F, which proves (12.14). +
,
• • .
'
!
� -- -
-- - -- - -- ----
I
I
·
-
-
-
12.
179 Now suppose that A == U: _ 1 A u, where A is the bounded rectangle (12.8), X lk u for u == 1, . . . , n , and the A u are disjoint. For each i (1 < i < 11u X Au k ) , the intervals /i l , . . . , I; n have I; as their union, although they need not be disj oint. But their endpoints split I; into disjoint subintervals .Ii i , . . . , .f; n such that each l; u is the union of certain of the JiJ . The rectangles B of the form (i2.13) are a regular partition of A as before; furthermore, the B 's contained in a single A u form a regular partition of A u . Since the A u are disjoint, it follows by (12.14) that SECTION
=
•
•
MEASURES IN EUCLIDEAN SPACE
·
n
n
It ( A ) == E �t { B) == E E It { B ) == E It { A u ) · B
u- 1
B c A11
u= 1
Therefore, I' is finitely additive on the class J k of bounded k-dimensional rectangles. As shown in Example 11 .5, Jk is a semiring, and so Theorem 11.3 applies. Now suppose that U:_ 1 A u e A , where A and the A u are in Jk and the latter are disjoint. As shown in Example 11.2, A - U:_ 1 A u is a finite disjoint union u:_ 1 Bv of sets in Jk . Therefore, E: .... I �t ( Aj ) + E� t i' ( Bv ) == �t ( A ), and so n
n
U
E It { A u ) S It { A ) if
{ 12.16)
u- 1
u -= 1
Au e A ,
e U: _ 1 A u, where A and the A u are in Jk . Let B1 == A n A1 and Bu == A n A u n A} n · · · n A � _ 1 • Again, each Bu is a finite disjoint union Ut, Bu of elements of Jk . The Bu are disjoint, and so the Buv taken all together are also disjoint. They have union A , so that I'( A ) == EuEv�t ( But, ). Since Ut, Buv == Bu e A u , an application of (12.16) gives
Now suppose that A l'
n
�£ { A ) S E �t ( A u ) if A
{ 12.17)
u- 1
e
n
U
Au.
u- 1
Of
course, (12.16) and (12.17) give back finite additivity again. To apply Theorem 11.3 requires showing that I' is countably subadditive on Jk . Suppose then that A e U�_ 1 A u , where A and the A u are in Jk . The problem is to prove that �t { A )
(12.18)
<
00
E �t ( A u ) ·
u- 1
Suppose that £ > 0. If A is given by (12.7) and B == [ x: a ; + B < X; < b; , i s k ], then I' ( B) > I' ( A ) £ for small enough positive B because I' is defined by (12.12) and F is continuous from above. Note that A contains the closure B- == [x: a; + B < x, < b, , i < k] of B. Similarly, for each u there is in Jk a set Bu == [x: u a i u < X; < b u + Bu , i < k ] such that �t ( Bu ) < �t ( A u ) + £/2 and A u is in the interior B: == [ x: a,u < X; < b;u + Bu , i < k] of Bu . Since B- e A e U:'_ 1 A u e U�_ 1 B:, it follows by the Heine-Bore) theorem that B c B- e u : _ 1 B: e U:_ 1 Bu for some n . Now (12.17) applies, and so �t( A ) - £ < p. ( B ) < E: .... 1 �t ( Bu ) < EC:- 1 �t ( A u ) + £. Since £ was arbitrary, the proof of (12.18) is complete. -
,
MEASURE 180 Thus p. as defined by (12.12) is finitely additive and countably subadditive on the semiring J* . By Theorem 11.3, p. extends to a measure on at" = a ( J " ). •
PROBLEMS 12.1. 12.2.
Suppose that p. is a measure on 9P1 which is finite for bounded sets and is translation-invariant: p.(A + x) p.(A ) . Show that p. ( A ) a A ( A ) for some a > 0. Extend to R*. Suppose that A e 9P1 , A ( A ) > 0, and 0 < 8 < 1. Show that there is a bounded open interval I such that A( A n I) > 8A( I). Hint: Show that A( A ) may be assumed finite and choose an open G such that A c G and A( A ) > 6A(G). Now G U,, In for disjoint open intervals In [A12], and E n A(A n In ) > 8EnA(In ); use an In. f If A e 911 and A ( A ) > 0, then the origin is interior to the difference set D( A ) [ x - y : x, y e A ]. Hint: Choose a bounded open interval I as in Problem 12.2 for 6 � . Suppose that l z 1 < A( I)/2 ; since A n I and (A n I) + z are contained in an interval of length less than 3A( I)/2 and hence cannot be disjoint, z e D( A ). 4.13 12.3 f The following construction leads to a subset H of the unit interval that is nonmeasurable in the extreme sense that its inner and outer Lebesgue measures are 0 and 1 : A .( H ) 0 and A* ( H ) 1 (see (3.9) and (3 .1 0)). Complete the details. The ideas are those in the construction of a nonmeasurable set at the end of Section 3. It will be convenient to work in G [0, 1); let E9 and e denote addition and subtraction modulo 1 in G, which is a group with identity 0. (a) Fix an irrational 6 in G and for n 0, ± 1, ± 2, . . . let (Jn be n8 reduced modulo 1. Show that 6n E9 6m 6n + m , 6n e 6m 6n - m , and the (Jn are distinct. Show that { 62 n : n 0, ± 1, . . . } and { 62 n + 1 : n 0, + 1, . . . } are dense in G. (b) Take x and y to be equivalent if x e y lies in { 6n : n 0, ± 1, . . . }, which is a subgroup. Let S contain one representive from each equivalence class (each coset). Show that G Un (S E9 6n ), where the union is disjoint. Put H Un (S E9 62 n ) and show that G - H H E9 6. (c) Suppose that A is a Borel set contained in H. If A( A ) > 0, then D( A ) contains an interval (0, £); but then some 62 k + I lies in (0, £) c D( A ) c D( H), and so 62 k + t h 1 - h 2 h 1 e h 2 ( s 1 E9 62 n ) e (s 2 E9 62 n ) for some h 1 , h 2 in H and some s 1 , s 2 in S. Deduce that s : s2 and ob tain a contradiction. Conclude that A.(H) 0. 1. (d) Show that A .( H E9 6) 0 and A*( H) f The construction here gives sets Hn such that Hn f G and A .( Hn ) 0. If Jn G - Hn , then Jn � 0 and A*( Jn ) 1. (a) Let Hn U Z , ( S E9 61c ), so that Hn f G. Show that the sets Hn E9 �2 n + !} v are disjoint for different v . (b) Suppose that A is a Borel set contained in Hn . Show that A and indeed all the A E9 6< 2 , + I ) v have Lebesgue measure 0.
==
==
==
12.3.
12.4.
==
==
==
==
== == == ==
==
12.5.
==
== ==
==
_ _
==
==
== == == ==
==
== == == ==
==
SECTION
12.6.
12.7.
12.8. 12.9.
12.10. 12.11.
12.12.
U.13.
12.14.
1 2.
MEASURES IN EUCLIDEAN SPACE
181
Suppose that p. is nonnegative and finitely additive on �k and p.( R k ) < oo . Suppose further that p.(A ) == sup p.( K), where K ranges over the compact subsets of A . Show that p. is countably additive. (Compare Theorem 12.3(ii).) Suppose p. is a measure on � k such that bounded sets have finite measure. Given A , show that there exist an F, set U (a countable union of closed sets) and a G8 set V (a countable intersection of open sets) such that U c A c V and p.( V - U) == 0. 2.17 f Suppose that p. is a nonatomic probability measure on ( R k , � k ) and that p.(A) > 0. Show that there is an uncountable compact set K such that K c A and p.( K ) == 0. The closed support of a measure p. on � k is a closed set C0 such that CO c C for closed C if and only if C supports p.. Prove its existence and uniqueness. Characterize the points of CO as those x such that p. ( U) > 0 for every neighborhood U of x. If k == 1 and if p. and the function F(x) are related by (12.5), the condition is F(x - £ ) < F(x + £) for all £; x is in this case called a point of increase of F. Let t§ be the class of sets in � 2 of the form [(x1 , x 2 ): x1 e H) for some H in �1 . Show that t§ is a a-field in � 2 and that two-dimensional Lebesgue measure is not a-finite on t§ even though it is a-finite on � 2 • Of minor interest is the k-dimensional analogue of (12.4). Let I, be (0, t ) for t > 0 and ( t, O] for t < 0, and let Ax == lx 1 X · · · X lx ,<" Let cp(x) be + 1 or - 1 according as the number of i, 1 < i < k , for which X; < 0 is even or odd. Show that, if F(x) == cp(x)p.(Ax ), then (12.12) holds for bounded rectangles A . Call F degenerate if it is a function of some k - 1 of the coordinates, the requirement in the case k == 1 being that F is constant. Show that ll A F == 0 for every bounded rectangle if and only if F is a finite sum of degenerate functions; (12.12) determines F to within addition of a function of this sort. Let G be a nondecreasing, right-continuous function on the line and put F(x, y ) = min{ G(x), y }. Show that F satisfies the conditions of Theorem 1 12.5, that the curve C == [(x, G(x)): x e R ] supports the corresponding measure, and that A 2 (C) == 0. Let F1 and F2 be nondecreasing, right-continuous functions on the line and put F(x. , x 2 ) == F1 (x 1 )F2 (x 2 ). Show that F satisfies the conditions of Theorem 12.5. Let p., p. 1 , p. 2 be the measures corresponding to F, F1 , F2 , and prove that p.(A 1 X A 2 ) == p. 1 (A 1 )p. 2 (A 2 ) for intervals A 1 and A 2 • This p. is the product of p. 1 and p. 2 ; products are studied in a general setting in Section 18. Let fl' be the class of sets A such that there exist an open G and a closed C such that C c A c G and p.(G - C) < £. Prove Theorem 12.3(i) for the case of a finite p. by showing that .JII' is a a-field containing the closed rectangles.
182
MEASURE
12.15. In the case F(x)
x 1 x" , the proof of Theorem 12.5 shows that k dimensional volume (see (12.1)) is finitely additive and countably subad ditive for bounded rectangles. Starting with this fact, use the ideas in Example 11.6 to give another construction of """ . 12.16. Identifying "' · Take as a starting point the fact that each point (x, y ) of the plane other than the origin has a unique polar representation ( x, y ) == 1 ( rcos 8, r sin 8), where 0 < 8 < 2 w and r == (x 2 + y 2 ) 12 . Let D, be the closed disc [( x, y): x 2 + y 2 < r 2 ] and define w0 == A 2 ( D1 ). The problem is to ideo tify w0 , an area, with the , of trigonometry. (a) Let T" be the linear map with matrix ==
•
•
•
- sin q> COS q>
]
Show by the addition formulas for sin and cos that T" rotates the plane through the angle q>. Show that it preserves area. (b) Consider the sector S9 1 9 consisting of the points with polar representa2 tions satisfying 61 < 6 < 62 • Show that A 2 ( D1 n S9 9 2 ) = w0 ( 82 - 81 )/2w by considering in sequence the cases where (i) (62 - f1 )/2w is the recipro cal of an integer, (ii) a rational, (iii) arbitrary. (c) Show that the linear map taking (x, y ) to ( rx, ry) multiplies areas by r 2 • Let A ,1 , be the annulus D, - D,1 , and sltow that 2
2
(12.19) Problem 18.24 shows how to prove w0 D1 n So9 in two different ways.
==
, by calculating the area of
SECI'ION 13. MEASURABLE FUNCitONS AND MAPPINGS
If a real function X on 0 has finite range, it is by the definition in Section 5 a simple random variable if [ w: X( w) = x] lies in the basic a-field � for each x. The requirement appropriate for the general real function X is stronger; namely, [ w: X( w) e H] must lie in � for each linear Borel set H. An abstract version of this definition greatly simplifies the theory of such functions . .,
Measurable Mappings
Let ( 0, $ ) and ( 0', � ') be two measurable spaces. For a mapping T: 0 0', consider the inverse images r- 1A' = [ w E 0: Tw E A'] for A' c 0'. (See [A 7] for the properties of inverse images.) The mapping T is measur able $j$' if T- �, E $ for each A' E $'. -+
SECTION
1 3.
MEASURABLE FUNCTIONS AND MAPPINGS
183
For a real function /, the image space 0' is the line R 1 , and in this case 91 1 is always tacitly understood to play the role of $i" ' . A real function f on 0 is thus measurable $i" (or simply measurable if it is clear from the context what !F is involved) if it is measurable fFjgJ 1 -that is, if f- 1H = [ w: f(w) e H] E !F for every H e qJ 1 . In probability contexts, a real measur able function is called a random variable. The point of the definition is to ensure that [ w: /( w) e H] has a measure or probability for all sufficiently regular sets H of real numbers-that is, for all Borel sets H.
f with finite range is measurable if a condition to f- 1 { x } e !F for each singleton { x }, but this is too weak impose on the general f. (It is satisfied if ( 0, !F) = ( R 1 , qJ 1 ) and f is any one-to-one map of the line into itself; but in this case f- 1H even for so simple a set H as an interval can for an appropriately chosen f be any uncountable set, say the non-Borel set constructed in Section 3.) On the other hand, for a measurable f with finite range, f- 1H e !F for every H c R 1 ; but this is too strong a condition to impose on the general f. (For ( D, /F) = ( R 1 , 9l 1 ), even f(x) = x fails to satisfy it.) Notice that nothing is • required of fA ; it need not lie in qJ 1 for A in !F. Example 13. 1.
A real function
If in addition to (0, !F), ( 0', !F'), and the map T: 0 --+ 0' there is a third measurable space (0", !F ") and a map T': 0' --+ 0", the composition T'T = T' o T is the mapping 0 --+ 0" that carries w to T'(T( w )).
If T- 1A' e !F for each A' e d ' and d' generates !F', then T is measurable !Fj!F'. (ii) If T is measurable !Fj!F' and T' is measurable !F 'j!F", then T'T is measurable !Fj!F ". PROOF. Since T- 1 (0' - A') = 0 - T - 1A' and T- 1 (U n A�) = U n T- 1A�, and since !F is a a-field in 0, the class [A': T - 1A' e !F] is a a-field in 0'. If Theorem 13.1. (i)
this a-field contains d ', it must also contain a( d '), and (i) follows. As for (ii), it follows by the hypotheses that A" e !F " implies that (T') - 1A" e $ ', which in tum implies that (T'T) - 1A" = [w: T'Tw e A"] = [w: Tw E (T') - 1A"] = T - 1 ((T') - 1A") E !F. • By part (i), if f is a real function such that [w: /(w) < x] lies in !F for all x, then f is measurable !F. This condition is usually easy to check. Mappings into Rk
For a mapping f: 0 --+ R k carrying D into k-space, fA k is always under stood to be the a-field in the image space. In probabilistic contexts, a
184
MEASURE
measurable mapping into R k is called a random vector. Now f must have the form ( 1 3 .1 ) for real functions h ( w ). Since the sets (12.9) (the "southwest regions") generate 9t k , Theorem 13.1(i) implies that f is measurable � if and only if the set ( 1 3 .2 )
[ w : /l ( w ) < x l , . . . , fk ( w )
�
k
x k ] = n [ w : h ( w ) < xj ] j- 1
lies in � for each (x 1 , , x k ). This condition holds if each h is measur able !F. On the other hand, if xj = x is fixed and x 1 = · · · = xi - I = xi+ I = · · · = x k = n goes to oo, the sets (13.2) increase to [w: h (w) � x]; the condition thus implies that each h is measurable. Therefore, f is measurable !F if and only if each component function h is measurable �. This provides a practical criterion for mappings into R k . A mapping f: R ; --+ R k is defined to be measurable if it is measurable 91 ;!9/ k . Such functions are often called Borel functions. To sum up, T: D D' is measurable �/�' if T- 1A' e � for all A' e � '; f: 0 --+ R k is measurable !F if it is measurable �!9t k ; and f: . R ; --+ R k is measurable (a Borel function) if it is measurable gJ ijgJ k . If H lies outside gJ1, then IH ( i = k = 1) is not a Borel function. •
•
•
-+
Theorem 13.2.
Iff: R ; --+ R k is continuous, then it is measurable.
As noted above, it suffices to check that each set (13.2) lies in gJ i. • But each is closed because of continuity.
PROOF.
If h = D --+ R 1 is measurable � ' j = 1, . . . , k, then g( / ( w ), . . . , /k (w)) is measurable � if g: R k --+ R 1 is measurable-in 1 particular, if it is continuous. Theorem 13.3.
If the h are measurable, then so is (13.1), so that the result follows • by Theorem 13.1(ii). PROOF.
Taking g( xl , . . . ' x k ) to be E�- IX;, n�- IX;, and max{ X I , . . . ' x k } in tum shows that sums, products, and maxima of measurable functions are mea surable. If /(w ) is teal and measurable, then so are sin /(w), e'f< w > , and so on, and if f ( w ) never vanishes, then 1 If( w ) is measurable as well.
SECTION
13.
MEASURABLE FUNCTIONS AND MAPPINGS
185
Limits and MeasurabUity
For a real function f it is often convenient to admit the artificial values oo and - oo - to work with the extended rea/ line [ - oo, oo]. Such an f is by definition measurable j&' if [ w: /( w) e H] lies in j&' for each Borel set H of (finite) real numbers and if [w: /(w) = oo] and [w: /( w) = - oo] both lie in .F. This extension of the notion of measurability is convenient in connection with limits and suprema, which need not be finite.
Suppose that /1 , /2 , are real functions measurable j&' _ (i) The functions supnfn , infn fn , lim supnfn , and lim infn/n are measur
'lbeorem 13.4.
able !F. (ii) (iii) (iv)
PROOF.
•
•
•
If lim nfn exists everywhere, then it is measurable j&'. The w-set where { /n ( w )} converges lies in j&'. Iff is measurable j&', then the w-set where fn ( w) --+ /( w) lies in !F. Clearly, (supnfn � x) = n n ( fn < x) lies in j&' even for X = 00 and
oo, and so supn fn is measurable. The measurability of infn /n follows the same way, and hence lim supn fn = inf n sup k n fk and lim inf n/n = supn inf k n fk are measurable. If limn fn exists, it coincides with these last two functions and hence is measurable. Finally, the set in (iii) is the set where lim supn /n (w) = lim infn/n ( w ), and that in (iv) is the set where this common value is /( w ). • x = -
�
�
Special cases of this theorem have been encountered before-part (iv), for example, in connection with the strong law of large numbers. The last three parts of the theorem obviously carry over to mappings into R k . A simple real function is one with finite range; it can be put in the form
(13 .3 ) where the A; form a finite decomposition of �. It is measurable !F if each A; lies in !F. The simple random variables of Section 5 have this form. Many results concerning measurable functions are most easily proved first for simple functions and then, by an appeal to the next theorem and a passage to the limit, for the general measurable function.
Iff is real and measurable !F, there exists a sequence { In } of simple functions, each measurable !F, such that
Theorem 13.5.
(13.4)
iff( w ) > 0
186
MEASURE
and ill( w ) < O.
(13 .5 ) PROOF.
Define
( 13.6 ) In( W ) =
-n - ( k - 1 )2 - n ( k - 1 )2 - n n
-n, < - ( k - 1 )2 - n , n, n2 < 1 < k n n if ( k - 1 ) 2 - < l( w ) < k2 - , 1 < k < n2 n , if n < I ( w ) < oo . if - oo < l( w) < n if -k2 - < l ( w )
•
This sequence has the required properties.
Note that (13.6) covers the possibilities l(w) = oo and l(w) = - oo . I f A e !F, a function I defined only on A is by definition measurable if [w e A : l(w) e H] lies in !F for H e !1/ 1 and for H = { oo } and H =
{ - 00 }.
Transfonnations of Measures
Let (D, !F) and (D', !F ') be measurable spaces and suppose that the mapping T: D --+ D' is measurable !Fj!F'. Given a measure p. on !F, define a set function p.T - 1 on !F ' by
( 13.7 )
A' E !F '.
p.T - 1 assigns value p.(T- 1A') to the set A'. If A' e �', then T - 1A' e !F because T is measurable, and hence the set function p.T - 1 is well defined on � '. Since T- 1 U n A� = U n T- 1A� and the T - 1A� are disjoint sets in D if the A� are disjoint sets in D', the countable additivity of p.T - 1 follows from that of p.. Thus p.T - 1 is a measure. This way of transferring a That is,
measure from D to D' will prove useful in a number of ways. If p. is finite, so is p. T- 1 ; if p. is a probability measure, so is p. T - 1 • t t But see Problem 1 3.22.
SECTION
13.
187
MEASURABLE FUNCTIONS AND MAPPINGS
PROBLEMS 13.1.
Show by examples in which Q == 0' is a space of two points that measur ability .Fj.F' is compatible with all but one of the eight statements A A A A
13.2. 13.3. 13.4.
13.5.
13.6. 13.7.
13.8. 13.9.
e .F and e .F and e .F and e .F and
rA rA rA rA
A' A' A' A'
e .F', e .F', e .F', e .F',
and r- 1A' E .F, and r- 1A' e .F, e .F' and r- 1A E .F e .F' and 1 1A' e .F.
E .F' E .F'
'
'
Although an indicator lA is measurable .F if and only if A e .F, the simple function (13.3) can be measurable .F even if some of the A; lie outside .F. Show that a monotone real function is measurable. Functions are often defined in pieces (for example, let l( x) be x3 or x- 1 as x � 0 or x < 0), and the following result shows that the function is measurable if the pieces are. Consider measurable spaces (Q, .F) and ( 0', .F') and a map T: Q -+ 0'. Let A 1 , A 2 , • • • be a countable covering of Q by $sets. Consider the a-field � == [ A : A c An , A e .F) in A n and the restriction T, of r to An. Show that T is measurable .Fj.F' if and only if T, is measurable !F,/.F' for each n . (a) For a map r and a-fields .F and .F' , define r- 1-F' == (r- 1A' : A' E .F'] and r.F== [ A' : 1 1A' E .F). Show that r- 1-F' and r.F are a-fields and that pteasurability .Fj.F' is equivalent to r- 1-F' c .F and to !F' c T.F. (b) For given !F', r- 1-F', which is the smallest a-field for which r is measurable .Fj.F', is by definition the a-field generated by r. For simple random variables describe a( XI ' . . . ' xn ) in these terms. (c) Let a'( J/1') be the a-field in 0' generated by J/1'. Show that a(r- 1JJI') == r- 1 ( a'( J/1' )). Prove Theorem 10.1 by taking r to be the identity map from no to n. f Suppose that 1: Q R1 • Show that I is measurable r- 1!F' if and only if there exists a map q>: 0' R1 such that q> is measurable .F' and I == q>T. Hint: First consider simple functions and then use Theorem 13 . 5. f Relate the result in Problem 13.6 to Theorem 5 . 1(ii). Show of real functions I and g that I( w ) + g( w) < x if and only if there exist rationals r and s such that r + s < x, I< w) < r, and g( w) < s. Prove directly that I + g is measurable .F if I and g are. Let .F be a a-field in R 1 • Show that 911 1c .F if and only if every continuous function is measurable !F. Thus 9P is the smallest a-field with respect to which all the continuous functions are measurable. -+
-+
188
MEASURE
13.10. Consider on R 1 the smallest class
13.1 1. 13. 12. 13.13.
13. 14. 13.15.
13.16.
(that is, the intersection of all classes) of real functions containing all the continuous functions and closed under pointwise passages to the limit. The elements of !I are called Baire functions. Show that Baire functions and Borel functions are the same thing. A real function I on the line is upper semicontinuous at x if for each there is a B such that l x - Y l < B implies that l( y ) < /( x) + £ . Show that. if I is everywhere upper semicontinuous, then it is measurable. If I is bounded and measurable !F, then there exist simple functions, each measurable !F, such that /,. ( w ) --+ I( w ) uniformly in w. Suppose that In and I are finite-valued, jlt:measurable functions such that In < w ) --+' I< w ) for w e A , where p. (A ) < oo ( p. a measure on !F ). Prove Egoroff s theorem : For each £ there exists a subset B of A such that I'( B) < £ and In < w ) --+ I< w ) uniformly on A -1 B. Hint: Let B�' d be the set of w in A such that ll( w ) - /; ( w ) l > k - for some i > n . Show that 00 B 0 as n f oo ' choose n k so that n( B<* n 1c ) ) < £/2* ' and put B == U J.. 1 B< k ) • f Show that Egorofrs theorem is false without the hypothesis p. (A ) < oo . 13.13 f Prove this variant of Egorofrs theorem: If I'( A ) < oo and lim supn /,, < I on A , then for each £ there is a subset B of A with the property that I'( B) < ( and for each B there is an n0 such that In ( w ) < l( w ) + B for w e A - B and n > n0 • Suppose that (Q, !F, P) is a probability measure space, T: 0 --+ 0' is measurable !Fj!F', and Q == r- I p on !F'. Consider the outer and1 inner measures as defined by (3.9) and (3.10). Show that Q .(A') < P .(T - A') < P* ( T- 1A') < Q*(A') and show by example that the inequalities can be strict. Show that P*( A ) < Q*(TA ) and that, if T is one-to-one, Q .(TA) < !I
£
1 +
r-
=
""
p .( A ).
Show that, if � is a field generating !F, if I is measurable !F, and if I'( Q ) < oo , then for positive ( there exists a simple function g == E; x; IA such that A ; e � and I' [ w : II( w ) - g( w ) 1 > (] < (. Hint: First ap� proximate I by a simple function measurable !F, and then use Problem 11.7. 13.18. 2.8 f Show that, if I is measurable a ( JJI), then there exists a countable subclass � of JJI such that I is measurable a( � ). 13.19. Circular Lebesgue measure. Let C be the unit circle in the complex plane and define T: [0, 1) --+ C by Tw == e 2 w i . Let !I consist of the Borel subsets of [0, 1) and let A be Lebesfue measure on !1. Show that � == (A : r- 1A e !I] consists of the sets in � (identify R 2 with the complex plane) that are contained in C. Show that � is generated by the arcs of C. Circular Lebesgue measure is defined as I' == A 1 1 • Show that I' is invariant under rotations: 1'[6z: z e A ] == I'( A ) for A e � and 6 e C. 13.20. f Suppose that the circular Lebesgue measure of A satisfies I'( A ) > 1 n - t and that B contains at most n points. Show that some rotation carries B into A : 6 B c A for some 6 in C. 13.17. 11.7 f
w
SECTION
14.
DISTRIBUTION FUNCTIONS
13.21. In connection with (13.7), show that p.(T'T) - 1 = ( p.T- 1 )(T') - 1 •
189
1 13.22. Show by example that p. a-finite does not imply p.T- a-finite.
13.23. Consider Lebesgue measure A restricted to the class !I of Borel sets in
(0, 1]. For a fixed permutation n1 , n 2 , • • • of the positive integers, if x has dyadic expansion . x 1 x 2 •1 • • , take Tx == • xn1 x, 2 • • • • Show that T is measura ble �/!I and that AT- == A. SECI10N 14. DISTRIBUTION FUNCI10NS
Distribution Functions
A random variable as defined in Section 13 is a measurable real function X on a probability measure space (0, �' P). The distribution or law of the random variable is the probability measure p. on ( R 1 , 91 1 ) defined by (14 .1 )
p. ( A) = P ( X e A ) ,
A
E
!1/ 1 .
As in the case of the simple random variables in Chapter 1, the argument w is usually omitted: P[ X e A] is short for P[w: X(w) e A ] . In the notation (13.7), the distribution is px - 1 • For simple random variables the distribution was defined in Section 5-see (5.9). There p. was defined for every subset of the line, however; from now on p. will be defined only for Borel sets, because unless X is simple, one cannot in general be sure that [ X e A] has a probability for A outside 91 1 . The distribution function of X is defined by (14 .2)
F( X ) = p.( - 00 ' X]
=
p[ X < X]
for real x. By continuity from above (Theorem 10.2(ii)) for p., F is right-continuous. Since F is nondecreasing, the left-hand limit F(x - ) = limy t x F(y) exists, and by continuity from below (Theorem 10.2(i)) for p., ( 14 .3)
F( X - ) = ,., ( - 00 ' X ) = p [ X < X ] .
Thus the jump or saltus in
F at x is
F( X ) - F( X -) = ,., { X } Therefore (Theorem 10.2(iv))
=
p[ X = X].
F can have at most countably many points of
190
MEASURE
discontinuity. Clearly,
(14.4)
lim F( x ) = O,
x � - oo
lim F( x ) = 1 .
A function with these properties must, in fact, be the distribution
function of some random variable:
a nondecreasing, right-continuous function satisfying (14.4), then there exists on some probability space a random variable X for which F(x) = P[ X < x]. FIRST PROOF. By Theorem 12.4, if F is nondecreasing and right-continu ous, there is on ( R 1 , 91 1 ) a measure p. for which p.(a, b] = F(b) - F( a). But lim x _. _ 00 F( x ) = 0 implies that p.( - oo, x] = F(x) , and lim x _. 00F(x) = 1 implies that p. ( R 1 ) = 1. For the probability space take (0, � ' P) = ( R 1 , !1/ 1 , p. ) and for X take the identity function: X(w) = w. Then P[ X < • x ] = p. [w e R 1 : w < x] = F(x). Theorem 14.1.
If F is
There is a proof that uses only the existence of Lebesgue measure on the unit interval and does not require Theorem 12.4. For the probability space take the open unit interval: 0 is (0, 1), � consists of the Borel subsets of (0, 1), and P(A) is the Lebesgue measure of A. To understand the method, suppose at first that F is continuous and strictly increasing. Then F is a one- to-one mapping of R 1 onto (0, 1); let cp : (0, 1) --+ R 1 be the inverse mapping. For 0 < w < 1, let X( w) = cp ( w ) . Since cp is increasing, certainly X is measurable �- If 0 < u < 1, then cp ( u) < x if and only if u < F(x) . Since P is Lebesgue measure, P[ X < x] = P[w e (0, 1): cp ( w ) < x] = P[w e (0, 1): w < F(x)] = F(x), as required. SECOND PROOF.
F (x)
.P (u) .P (p)
If F has discontinuities or is not strictly increasing, detinet
(14.5)
cp ( u ) = inf [ X :
t This is called the quantile function.
u < F( X ) ]
SECTION
1 4.
DISTRIBUTION FUNCTION S
191
for 0 < u < 1 . Since F is nondecreasing, [ x : u < F(x)] is an interval stretching to oo ; since F is right-continuous, this interval is closed on the left. For 0 < u < 1 , therefore, [ x : u < F( x )] = [cp( u), oo), and so cp(u) < x if and only if u < F(x). If X(w) = cp(w) for 0 < w < 1, then by the same • reasoning as before, X is a random variable with P[ X < x] = F(x). This second argument actually provides a simple proof of Theorem 12.4 for a probability distributiont F: the distribution p. (as defined by (14.1)) of the random variable just constructed satisfies p.( - oo, x] = F(x) and hence
p.(a, b] = F(b) - F(a).
Exponential Distributions
There are a numbet: of results which for their interpretation require random variables, independence, and other probabilistic concepts, but which can be discussed technically in terms of distribution functions alone and do not require the apparatus of measure theory. Suppose as an example that F is the distribution function of the waiting time to the occurrence of some event-say the arrival of the next customer at a queue or the next call at a telephone exchange. As the waiting time must be positive, assume that F(O) = 0. Suppose that F(x) < 1 for all x and furthermore suppose that
(14.6 )
1 - F( X + y ) 1 - F( x )
=
1 - F( y ) '
X,
y > 0.
The right side of this equation is the probability that the waiting time exceeds y; by the definition (4.1) of conditional probability, the left side is the probability that the waiting time exceeds x + y given that it exceeds x. Thus (14.6) attributes to the waiting-time mechanism a kind of lack of memory or aftereffect: If after a lapse of x units of time the event has not yet occurred, the waiting time still remaining is conditionally distributed just as the entire waiting time from the beginning. For reasons that will emerge later (see Section 23), waiting times often have this property. The condition (14.6) completely determines the form of F. If U(x) = 1 - F(x), (14.6) is U( x + y) = U(x)U(y). This is a form of Cauchy's equation [A20], and since U is bounded, U(x) = e - ax for some a. Since lim 00U( x) = 0, a must be positive. Thus (14.6) implies that F has the x
.....
t For the general case, see Problem 14.2.
192
MEASURE
exponential form if X < if X �
( 14.7)
0, 0,
and conversely. Weak Convergence
Random variables X1 , , Xn are defined to be independent if the events ( X1 e A 1 ], ( Xn e An] are independent for all Borel sets A 1 , , An, so that P[ X; e A;, i = 1, . . . , n ] = n?_ 1 P[ X; e A;]. To find the distribution function of the maximum Mn = max{ X1 , , Xn }, take A 1 = · · · = A n = ( - oo, x]. This gives P[Mn � x] = n?_ 1 P[ X; < x]. If the X; are indepen dent and have common distribution function G and Mn has distribution function Fn , then • • •
. . •
• • • ,
• • .
( 14.8) It is possible without any appeal to measure theory to study the real function Fn solely by means of the relation (14.8), which can indeed be taken as defining Fn . It is possible in particular to study the asymptotic properties of Fn :
EX111p11 le
Consider a stream or sequence of events, say arrivals of calls at a telephone exchange. Suppose that the times between successive events, the interarrival times, are independent and that each has the exponential form (14.7) with a common value of a. By (14.8) the maximum Mn among the first n interarrival times has distribution function Fn(x) = (1 - e - ax ) n , x � 0. For each x, limnFn (x) = 0, which means that Mn1 tends to be large for n large. But P[Mn - a - 1 log n � x] = Fn(x + a - log n). This is the distribution function of Mn - a - 1 log n, and it satisfies
( 14 .9) as n --+
14.1.
oo ; the equality here holds if log n �
- x , and so the limit holds a
for all x. This gives for large n the approximate distribution of the • normalized random variable Mn - a- 1 log n.
and F are distribution functions, then by definition, weakly to F, written Fn => F, if If
Fn
( 14.10)
Fn converges
SECTION
14.
193
DISTRIBUTION FUNCTIONS
for each x at which F is continuous.t To study the approximate distribu tion of a random variable Yn it is often necessary to study instead the normalized or rescaled random variable ( Yn - bn )/a n for appropriate con stants a n and bn . If Yn has distribution function Fn and if a n > 0, then P[( Yn - bn )/a n < x] = P[Yn < a n x + bn ], and therefore (Yn - bn )/a n has distribution function Fn (a n x + bn ). For this reason weak convergence often appears in the form*
( 14.11 ) An example of this is
e
- e- ax
.
(14.9):
there
a n = 1, bn = a - 1 log n,
Consider again the distribution function maximum, but suppose that G has the form Example 14. 2.
G(x) = where a > fore
0.
{�
Here Fn(n 1 1°x) =
_
(1
and F(x) =
(14.8)
of the
if X < 1, xif X > 1, n - n - 1x - a) for x > n - 1 /a, and there "
0, 0. This is an example of (14.11) in which a n = n 1 /a and bn = 0. if X < if X >
Example 14.3.
•
Consider (14.8) once more, but for
0 if X < 0, G ( x ) = 1 - (1 - x ) a if 0 < X < 1 , 1 if X > 1 , 0. This time Fn (n - llax + 1) = (1 - n - 1 ( - x) a ) n
where a > � 0. Therefore,
• ( ) li� Fn ( n - 1 /"x + 1 ) = { �- - x
a case of (14.11) in which an = n - l/a and
bn = 1.
if X < if X >
if - n 1 /a < x
0, 0, •
t For the role of continuity, see Example 14.4. * To write F,. ( an x + bn ) ::$ F( x ) ignores the distinction between a function and its value at an unspecified value of its argument, but the meaning of course is that F,. ( an x + bn ) --. F( x ) at continuity points x of F.
194
MEASURE
& be the distribution function with a unit jump at the origin:
Let
ll ( x ) =
( 14.12 ) If
{�
if X < if X >
0, 0.
X( w) = 0, then X has distribution function &. X1 , X2 , . . . be independent random variables for P[ Xk = 1] = P[ Xk = - 1] = � and put Sn = X1 + · · · + Xn . By the
Example 14.4.
Let
which weak law of large numbers,
( 14.13 ) for £ > 0. Let Fn be the distribution function of n - tsn . If x > 0, then Fn (x) = 1 - P[n - 1Sn > x] --+ 1; if x < 0, then Fn (x) < P[ l n - 1Snl > l x l ] 0. As this accounts for all the continuity points of &, Fn => &. It is easy to tum the argument around and deduce (14.13) from Fn => &. Thus the weak law of large numbers is equivalent to the assertion that the distribu tion function of n - 1sn converges weakly to &. If n is odd, so that Sn = 0 is impossible, then by symmetry the events [Sn s 0] and [Sn > 0] each have probability � and hence Fn (O) = � - Thus Fn(O) does not converge to &(0) = 1, but because & is discontinuous at 0, • the definition of weak convergence does not require this. -+
Allowing (14.10) to fail at discontinuity points x of F thus makes it possible to bring the weak law of large numbers under the theory of weak convergence. But if (14.10) need hold only for certain values of x, there arises the question of whether weak limits are unique. Suppose that Fn => F and Fn => G. Then F(x) = lim n Fn (x) = G(x) if F and G are both continu ous at x. Since F and G each have only countably many points of discontinuity,t the set of common continuity points is dense, and it follows by right continuity that F and G are identical. A sequence can thus have at most one weak limit. Convergence of distribution functions is studied in detail in Chapter 5. The remainder of this section is devoted to some weak-convergence theo rems which are interesting both for themselves and for the reason that they require so little technical machinery. t The proof following
(14.3) uses measure theory, but this is not necessary: If the saltus
a( x ) - F( x) - F( x - ) exceeds t: at x1 < · · · < xn , then F( x; ) - F( x;_ 1 ) > t: (take x0 < x1 ), and so n t: < F( xn ) - F( x0 ) < 1 ; hence [ x : a( x ) > t: ) is finite and [ x : a( x ) > 0] is countable.
SECTION
14.
195
DISTRIBUTION FUNCTI ONS
Convergence of Types *
Distribution functions F and G are of the same type if there exist constants a and b, a > 0, such that F(ax + b) = G(x) for all x. A distribution function is degenerate if it has the form �(x - x0) (see (14.12)) for some x0 ; otherwise, it is nondegenerate.
Suppose that Fn (u n x + vn ) => F(x) and Fn(a n x + bn ) => G (x), where u n > 0, a n > 0, and F and G are nondegenerate. Then there exist a and b, a > 0, such that a nfu n --+ a, (bn - vn )/u n --+ b, and F(ax + b ) = G(x). 1beorem 14.2.
Thus there can be only one possible limit type and essentially only one possible sequence of norming constants. The proof of the theorem is for clarity set out in a sequence of lemmas. In all of them, a and the a n are assumed to be positive. Lemma 1.
+ b).
If Fn => F, a n --+ a, and bn --+ b, then Fn (a n x + bn ) => F(ax
If x is a continuity point of F( ax + b) and £ > 0, choose continu ity points u and v of F so that u < ax + b < v and F( v) - F(u) < £; this is possible because F has only countably many discontinuities. For large enough n, u < a n x + bn < v, IFn (u) - F(u) l < £, and IFn (v) - F(v) l < £; but then F(ax + b) � 2£ < F( u) - £ < Fn (u) < Fn (a n x + bn ) < Fn (v) < • F( v) + £ < F( ax + b) + 2£. PROOF.
then Fn (a n x) => �(x). PROOF. Given £ choose a continuity point u of F so large that F( u) > 1 If x > 0, then for all large enough n, a n x > u and IFn (u) - F( u) l < £, SO that Fn (a n x) > Fn (u) > F(u) - £ > 1 - 2£. Thus limn Fn (a n x) = 1 for • x > 0; similarly, limn Fn (a n x) = 0 for x < 0. Lemma 2.
If Fn => F and a n --+
oo ,
£.
If Fn => F and bn is unbounded, then Fn (x + bn ) cannot con verge weakly. PROOF . Suppose that bn is unbounded and that bn --+ oo along some subsequence (the case bn --+ - oo is similar). Suppose that Fn (x + bn ) => G(x). Given £ choose a continuity point u of F so that F( u) > 1 - £. Whatever x may be, for n far enough out in the subsequence, + bn > u Lemma 3.
x
* This topic may be omitted.
196
MEASURE
Fn( u ) > 1 - 2£, so that Fn( x + bn ) > 1 - 2£. Thus G( x ) = lim n F,,(x • + bn ) = 1 for all continuity points x of G. which is impossible.
and
If Fn (x ) =$ F(x) and Fn (a n x + bn ) =$ G ( x), where F and G are nondegenerate, then Lemma 4.
( 14.14 )
0 < inf a n s sup a n < n
n
00 ,
n
Suppose that a n is not bounded above. Arrange by passing to a subsequence that a n oo . Then by Lemma 2, Fn (a n x) =$ �(x), and by Lemma 3 and the hypothesis, bn is bounded along this subsequence. Arrange by passing to a further subsequence that bn converges to some b. Then by Lemma 1, Fn (a n x + bn ) =$ �(x + b), and so by hypothesis G(x) = �(x + b), contrary to the assumption that G is nondegenerate. Thus a n is bounded above. If Gn (x) = Fn (a n x + bn ), then Gn (x) => G(x) and Gn (a n 1x - a; 1bn ) = Fn (x) =$ F(x). The result just proved shows that a; 1 is bounded. Thus a n is bounded away from 0 and oo . If bn is not bounded, pass to a subsequence along which bn --+ ± oo and an converges to some a. Since Fn (a n x) =$ F(ax) along the subsequence by Lemma 1, bn --+ ± oo is by Lemma 3 incompatible with the hypothesis that Fn (a n x + bn ) =$ G(x). • PROOF.
--+
If F(x) = F(ax + b) for all x and F is nondegenerate, then a = 1 and b = 0. n n PROOF. Since F(x) = F(a x + (a - 1 + · · · + a + 1)b), it follows by n Lemma 4 that a is bounded away from 0 and oo , so that a = 1, and it then follows that nb is bounded, so that b = 0. • PROOF OF THEOREM 14 . 2 . Suppose first that U n = 1 and Vn = 0. Then (14.14) holds. Fix any subsequence along which a n converges to some positive a and bn converges to some b. By Lemma 1, Fn (a n x + bn ) => F(ax + b) along this subsequence, and the hypothesis gives F(ax + b) = G(x). Suppose that along some other sequence, a n --+ u > 0 and bn --+ v. Then F( ux + v) = G(x) and F(ax + b) = G(x) both hold, so that u = a and v = b by Lemma 5. Every convergent subsequence of {(an, bn )} thus converges to (a, b), and so the entire sequence does. For the general case, let Hn (x) = Fn (u n x + vn ). Then Hn (x) =$ F(x) and Hn (a n u; 1x + (bn - vn )u n 1 ) =$ G(x), and so by the case already treated, a n u; 1 converges to some positive a and (bn - vn)u; 1 to some b, • and as before, F(ax + b) = G(x). Lemma 5.
SECTION
14.
DISTRIBUTION FUNCTIONS
197
Extremal Distributions*
A distribution function F is extremal if it is nondegenerate and if, for some distribution function G and constants a n ( a n > 0) and bn , ( 14.15) These are the possible limiting distributions of normalized maxima (see (14.8)), and Examples 14.1, 14.2, and 14.3 give three specimens. The following analysis shows that these three examples exhaust the possible types. n Assume is extremal. From (14.15) follow G k (a n x + bn ) => F k (x) that F n and G k (a n k x + bn k) => F(x), and so by Theorem 14.2 there exist con stants ck and dk such that ck is positive and
( 14 . 16 ) From F(cjk x + djk ) = pjk (x) follow (Lemma 5) the relations
d .k = c . dk + d . = ck d . + dk •
( 14 . 17 )
J
Of course, separately.
CASE 1.
= Fj(ck x + dk ) = F(cj (ck x + dk ) + dj )
c1 = 1
and
Suppose that
d1 = 0.
J
J
J
There are three cases to be considered
ck = 1 for all k. Then
(14.18 ) This implies that Fjfk (x) = F(x + dj - dk ). For positive rational r = jjk, put 8, = dj - dk ; (14.17) implies that the definition is consistent, and F'(x) = F( x + 8,). Since F is nondegenerate, there is an x such that 0 < F(x) < 1, and it follows by (14.18) that d k is decreasing in k, so that 8, is strictly decreasing in r. For positive real t let cp(t) = inf0 < r < ,8, (r rational in the infimum). Then cp( t) is decreasing in t and
(1 4 . 19 )
F ( x ) = F( x + cp ( t ) ) '
for all x and all positive t. Further, (14.17) implies that cp( st) = cp( s) + cp( t ), so that by the theorem on Cauchy's equation [A20] applied to cp(e x ), cp(t) * This topic may be omitted.
198
MEASURE
= {J log t, where {J > 0 because cp( t ) is strictly decreasing. Now (14.19) with t = e x ffl gives F(x ) = exp{ e - xlfliog F(O)} , and so F must be of the same type as -
(14.20) Example 14.1 shows that this distribution function can arise as a limit of distributions of maxima-that is, F1 is indeed extremal. CASE 2. Suppose that ck o ¢ 1 for some k0, which necessarily exceeds 1 . Then there exists an x ' such that ck 0x' + dk o = x'; but (14.16) then gives k p o ( x ') = F( x'), so that F(x') is 0 or 1. (In Case 1, F has the type (14.20) and so never assumes the values 0 and 1.) Now suppose further that, in fact, F(x') = 0. Let x0 be the supremum of those x for which F( x) = 0. By passing to a new F of the same type one can arrange that x0 = 0; then F(x) = 0 for x < 0 and F( x) > 0 for x > 0. The new F will satisfy (14.16), but with new constants dk . If a (new) dk is distinct from 0, then there is an x near 0 for which the arguments on the two sides of (14.16) have opposite signs. Therefore, dk = 0 for all k, and (14 .21 ) k j for all k and x. This implies that F f (x) = F(xcjjck ). For positive rational r = jjk, put Yr = cjjck . The definition is by (14.17) again con r sistent, and F (x) = F( yrx). Since 0 < F(x) < 1 for some x, necessarily positive, it follows by (14.21) that ck is decreasing in k, so that Yr is strictly decreasing in r. Put 1/l ( t ) = info < r < tYr for positive real t. From (14.17) follows 1/l ( st ) = 1f (s ) 1f ( t ), and by the corollary to the theorem on Cauchy' s equation [A20] applied to 1/l ( e x), 1/l ( t ) = � - � for some � > 0. Since F'(x) = F( 1/l ( t )x) for all x and for t positive, F(x ) = exp{ x - 1 1�log F(1)} for x > 0. Thus (take a = 1/�) F is of the same type as (14 .22)
if X < 0, if X > 0.
Example 14.2 shows that this case can arise. CASE 3. Suppose as in Case 2 that ck o ¢ 1 for some k0, so that F( x') is 0 or 1 for some x', but this time suppose that F( x') = 1. Let x 1 be the infimum of those x for which F(x) = 1. By passing to a new F of the same type, arrange that x 1 = 0; F(x) < 1 for x < 0 and F(x ) = 1 for x > 0. If
SECTION
14.
DISTRIBUTION FUNCTION S
199
dk ¢ 0, then for some x near 0, one side of (14.16) is 1 and the other is not. Thus d k = 0 for all k, and (14.21) again holds. And again Yj; k = cjjc k consistently defines a function satisfying F r(x) = F( yrx) . Since F is nonde generate, 0 < F( x) < 1 for some x, but this time x is necessarily negative, so that ck is increasing. The same analysis as before shows that there is a positive � such that F'(x) = F(t �x) for all x and for t positive. Thus F(x) = exp { ( - x ) 1 1�log F( - 1)} for x < 0, and F is of the type F3, a
( 14.23) Example
x >a { < e (x) = 1
if X < if X >
0, 0.
14.3 shows that this distribution function is indeed extremal.
This completely characterizes the class of extremal distributions:
The class of extremal distribution functions consists exactly of the distribution functions of the types (14.20), (14.22), and (14.23). Theorem 14.3.
It is possible to go on and characterize the domains of attraction. That is, it is possible for each extremal distribution function F to describe the class of G satisfying (14.15) for some constants a n and bn-the class of G attracted to F. t PROBLEMS 14.1.
The general nondecreasing function F has at most countably many discontinuities. Prove this by considering the open intervals (sup" < F( u ) , inf(, F( v ))-each nonempty one contains a rational. For distribution functions F, the second proof of Theorem 14.1 shows how to construct a measure I' on ( R1 , 9P1 ) such that I'( a , b] == F( b) - F( a ). (a) Extend to the case of bounded F. (b) Extend to the general case. Hint: Let F, ( x) be - n or F( x) or n as F( x ) < - n or - n < F( x ) < n or n � F( x ) . Construct the corresponding l'n and define I'(A ) == limnl'n ( A ). Suppose that X assumes positive integers as values. Show that P [ X > n + m i X > n ] == P[n X > m ] if and only if X has a Pascal distribution: P[ X == n ] == q - Ip, n == 1, 2, . . . . This is the discrete analogue of the exponential distribution. x
14.2.
14.3.
> x
t This theory is associated with the names of Fisher, Frechet, Gnedenko, and Tippet. For further information, see GALAMBOS.
]00
14.4.
MEASURE
(a) Suppose that X has a continuous, strictly increasing distribution
function F. Show that the random variable F( X) is uniformly distributed over the unit interval in the sense that P[ F( X) < u ] == u for 0 � u < 1. Passing from X to F( X) is called the probability transformation. (b) Show that the function q>( u ) defined by ( 1 4.5) satisfies F( q>( u ) - ) < u < F( q> ( u)) and that, if F is continuous (but not necessarily strictly increas ing), then F( q>( u)) == u for 0 < u < 1. (c) Show that P[ F( X) < u] == F( q> ( u ) - ) and hence that the result in part (a) holds as long as F is continuous.
14.5.
f Let C be the set of continuity points of F. (a) Show that for every Borel set A , P[ F( X) e A , X e C] is at most the Lebesgue measure of A. (b) Show that if F is continuous at each point of F" 1A , then P[ F( X) e A ] is at most the Lebesgue measure of A.
14.6.
A distribution function F can be described by its derivative or density f = F', provided that f exists and can be integrated back to F. One thinks in terms of the approximation
( 14.24)
P[ x < X < x
+
d.x ]
=
f( x ) dx.
(a) Show that, if X has distribution function with density f and g is
14.7.
increasing and differentiable, then the distribution function of g( X) has density /(g - 1 (x))/g'(g - 1 (x)). (b) Check part (a) of Problem 14.4 under the assumption that F is differentiable. (a) Suppose that X1 , , Xn are independent and that each has distribu tion function F. Let � kJ be the k th order statistic, the k th smallest among the X;. Show that � k > bas distribution function •
•
•
-
n
Gk ( X ) == E k
u-
( : ) ( F( X ) )
u
( 1 - F( X ) )
n-u
.
(b) Assume that F has density f and show that Gk has density
;:
F( x )) k - I /( x ) ( 1 - F( x )) " - k ( ( k - 1) n - k ) !
14.8.
•
Interpret this in terms of (14.24). Hint: Do not differentiate directly. Let A be the event that k - 1 of the X; lie in ( - oo , x], one lies in (x, x + h ], and n - k lie in (x, oo) . Show that P[x < � k ) < x + h ] - P (A ) is non negative and does not exceed the probability that two of the X, lie in (x, x + h ]. Estimate the latter probability, calculate P(A ) exactly, and let h go to 0. Show for every distribution function F that there exist distribution func tions F, such that F, � F and (a) F, is continuous; (b) F, is constant over each interval (( k - 1) n - 1 , kn - 1 ], k 0, ± 1, ± 2, . . . . =
201 14.9. The Levy distance d( F, G) between two distribution functions is the in fimum of those £ such that G(x - £) - £ < F( x) < G(x + £) + £ for all x. Verify that this is a metric on the set of distribution functions. Show that a necessary and sufficient condition for F, � F is that d( F, , F) --+ 0. 14.10. 12.3 f A Borel function satisfying Cauchy's equation [A20] is automati cally bounded in some interval and hence satisfies l(x) xl(1) Hint: Take K large enough that A[x: x > 0, ll(x) l < K ] > 0. Apply Problem 12.3 and conclude that I is bounded in some interval to the right of 0. 14. 1 1. f Consider sets S of reals that are linearly independent over the field of rationals in the sense that n 1 x1 + · · · + n k xk == 0 for distinct points X; in S and integers n; (positive or negative) is impossible unless n; 0. (a) By Zorn's lemma find a maximal such S. Show that it is a Hamel basis. That is, show that each real x can be written uniquely as x == n 1 x 1 + · · · + n k xk for distinct points X; in S and integers n; . (b) Define I arbitrarily on S and define it elsewhere by I( n 1 x 1 + · · · + n k xk ) == n 1 l(x 1 ) + · · · + n k l(xk ) Show that I satisfies Cauchy's equation but .need not satisfy l(x) = xl(1). (c) By means of Problem 14.10 give a new construction of a nonmeasur able function and a nonmeasurable set. SECTION
14 .
DISTRIBUTION FUNCTIONS
=
.
=
.
14.12. 14.9 f
(a) Show that if a distribution function F is everywhere continu
ous, then it is uniformly continuous. (b) Let BF (£) == sup[ F(x) - F(y): l x - Y l < £] be the modulus of con tinuity of F. Show that d( F, G) < £ implies that supx I F( x) - G ( x) I < £ + BF ( £). (c) Show that, if F, � F and F is everywhere continuous, then F, (x) --+ F( x) uniformly in x. What if F is continuous over a closed interval? 14.13. Show that (14.22) and (14.23) are everywhere infinitely differentiable, al though not analytic.
CHAPTER
3
Integration
SECitON 15. THE INTEGRAL
Expected values of simple random variables and Riemann integrals of continuous functions can be brought together with other related concepts under a general theory of integration, and this theory is the subject of the present chapter. Definition
Throughout this section, f, g, and so on will denote real measurable functions, the values ± oo allowed, on a measure space (0, ofF, p.). The object is to define and study the definite integral
Suppose first that f is nonnegative. For each finite decomposition of D into �sets, consider the sum
( 15 .1 ) In computing the products here, the conventions about infinity are
( 15 .2 ) 202
{ X0
. 00 = 00 .
0 = 0,
00 = 00 · X = 00 if 0 < X < 00 , 00 . 00 = 00 . •
{ A; }
SECTION
15.
203
THE INTEGRAL
The reasons for these conventions will become clear later. Also in force are the conventions of Section 10 for sums and limits involving infinity; see (10.3) and (10.4). If A; is empty, the infimum in (15 . 1) is by the standard convention oo ; but then p.(A;) = 0, so that by the convention (15.2), this term makes no contribution to the sum (15.1). The integral of f is defined as the supremum of the sums (15.1): (15 .3) The supremum here extends over all finite decompositions JEsets. For general /, consider its positive part,
its
(15 .5)
of 0 into
if 0 < / ( w ) < 00 ' if - oo < / ( w ) < 0
(15 .4) and
{ A; }
negative part, if - 00 < / ( w ) < 0, if 0 < / ( w ) < 00 .
These functions are nonnegative and measurable, and f = j+ - f- . The general integral is defined by ( 15 .6) unless f j+ dp. = f /- dp. = oo, in which case f has no integral. If f j+ dp. and f /- dp. are both finite, f is integrable, or integrable p., or summable, and has (15.6) as its definite integral. If f j+ dp. = oo and f 1- dp. < oo , f is not integrable but is in accordance with (15.6) assigned + oo as its definite integral. Similarly, if f j dp. < oo and f /- dp. = oo, f is not integrable but has definite integral - oo. Note that f can have a definite integral without being integrable; it fails to have a definite integral if and only if its positive and negative parts both have infinite integrals. The really important case of (15.6) is that in which f j+ dp. and f f- dp. are both finite. Allowing infinite integrals is a convention that simplifies the statements of various theorems, especially theorems involving nonnegative functions. Note that (15.6) is defined unless it involves "oo - oo; " if one term on the right is oo and the other is a finite real x, the difference is defined by the conventions oo - x = oo and x - oo = - oo.
204
INTEGRATION
The extension of the integral from the nonnegative case to the general case is consistent: (15.6) agrees with (15.3) if f is nonnegative because then
!-= 0.
Nonnegative Functions
It is convenient first to analyze nonnegative functions.
If f = E;x ; IA is a nonnegative simple function, { A; } being a finite decomposition of 0 into �sets, then f fdp. = E;x ;P. ( A;). (ii) If 0 � /( w ) � g(w) for all w, then f fdp. < f gdp.. (iii) If 0 � /n (w) j /( w) for all w, then 0 < f fdp. i f fdp.. (iv) For nonnegative functions f and g and nonnegative constants a and {J, j ( af + {Jg) dp. = af fdp. + {Jf g dp..
Theorem 15.1
(i)
I
In part (iii) the essential point is that f fdp. = lim n f In dp., and it is important to understand that both sides of this equation may be oo . If In = /A and f = /A , where A n i A, the conclusion is that p. is continuous from below (Theorem 10.2(i)): li m n p.(A n ) = p.(A); this equation often takes the form oo = oo . II
(i). Let { B1 } be a finite decomposition of 0 and let P1 be the infimum of f over B1 . If A ; n B1 ¢ 0, then {J1 s x;; therefore, E1P1p.(B1 ) = E;1p1p.((A ; n B1 ) s E;1x;p.(A; n B1 ) = E;x;p.(A;) · On the other hand, there • is equality here if { B1 } coincides with { A; }. PROOF OF
PROOF OF
by g.
(ii). The sums (15.1) obviously do not decrease if f is replaced
•
(iii) . By (ii) the sequence f In dp. is nondecreasing and bounded above by f f dp.. It therefore suffices to show that f f dp. S lim n f In dp., or that PROOF OF
m lim n /fn dfl > S = L V;fl ( A; )
( 15.7 )
i- 1
is any decomposition of 0 into �sets and V; = inf w e A f(w) . In order to see the essential idea of the proof, which is quite simple , suppose first that S is finite and all the v; and p.(A;) are positive and finite . Fix an £ that is positive and less than each v; , and put A;n = [w E A ;: fn ( w) > V; £]. Since In i /, A ;n i A;. Decompose 0 into A1 n , . . . , A m n and if
A1,
•
•
•
, Am -
I
SECTION
15.
THE INTEGRAL
the complement of their union, and observe that
( 15 .8 )
jf, dfl > iE- 1 ( v; - £ ) #' ( A ; n ) -+ iE= l ( v; - £ ) �L ( A; ) m
m
m
= s - £ E #' ( A ; ) . i- 1
Since the #' ( A ; ) are all finite, letting £ -+ 0 gives (15.7). Now suppose only that S is finite. Each product V; #L( A ; ) is then finite; suppose it is positive for i < m 0 and 0 for i > m 0• (Here m 0 � m; if the product is 0 for all i, then S = 0 and (15.7) is trivial.) Now v; and #'(A ; ) are positive and finite for i < m 0 (one or the other may be oo for i > m 0 ). Define A ; n as before, but only for i < m 0• This time decompose 0 into .A1 n , . . . , A m o n and the complement of their union. Replace m by m0 in (15.8) and complete the proof as before. Finally, suppose that S = oo . Then V;0 #'(A;0 ) = oo for some i0, so that v10 and #' ( A ; ) are both positive and at least one is oo . Suppose 0 < x < V; 0 0 � oo and 0 < y < #'(A; ) � oo, and put A; n = [w E A; : fn (w) > x]. Since 0 0 0 /,. f /, A;0 n i A;0 ; hence #'(A;0 n ) > y for n exceeding some n 0• But then (decompose 0 into A;0 n and its complement) f In dl' > X�L ( A ;0 n ) > xy for n > n 0 , and therefore li m n / In dl' > xy. If V;0 = oo, let x -+ oo , and if p( A;0 ) = oo, let y -+ oo . In either case (15.7) follows: lim n f In dl' = oo . • PROOF OF (iv). Suppose at first that f = E ; x ; IA and g = Ej.y./8 are j simple. Then a/ + {Jg = E ;1 (ax; + {Jy1 )IA , n B1 , and so J
I
j( af + fJg ) dfl = L. ( ax; + fJyj ) ll ( A; n Bi ) I)
j
= a E X ;fl ( A; ) + fJ LYifl ( Bi ) = a fdfl + I
}
fJ Jg dfl .
Note that the argument is valid if some of a , {J , X;, y1 are infinite. Apart from this possibility, the ideas are as in the proof of (5.18). For general nonnegative f and g, there exist by Theorem 13.5 simple functions In and Bn such that 0 S In i f and 0 < Bn i g . But then 0 < a.fn + /lg,. i af + {J g and f ( afn + {Jgn ) dl' = a f In dl' + {Jf 8n d#', so that (iv) fol lows via (iii) . •
15.1, the expected values of simple random Variables in Chapter 1 are integrals: E [ X] = f X( w)P( d w) . This also covers By part (i) of Theorem
206
INTEGRATION
the step functions in Section 1 (see (1.5)). The relation between the Riemann integral and the integral as defined here will be studied in Section 17.
Consider the line ( R 1 , 91 1 , A) with Lebesgue measure. Suppose that - oo < a 0 � a 1 S · · · < a m < oo and let f be the function with nonnegative value X; on (a; _ 1 , a;], i = 1, . . . , m , and value 0 on ( - oo, a 0 ] and ( a m , oo). By part (i) of Theorem 15.1, f fdA = E7' 1 x;(a,. a ; _ 1 ) because of the convention 0 · oo = O-see (15.2). If the "area under the curve" to the left of a 0 and to the right of am is to be 0, this convention is inevitable. It must then work the other way as well : oo · 0 = 0, so that f fdA = 0 if f is oo at a single point (say) and 0 elsewhere. If f = I( a , oo ) ' the area-under-the-curve point of view makes f fdp. = oo natural. Hence the second convention in (15.2), which also requires that the • integral be infinite if f is oo on a nonempty interval and 0 elsewhere. Example 15. 1.
Recall that
almost everywhere means outside a set of measure 0.
Theorem 15.2. Suppose that f and g are nonnegative. (i) Iff = 0 almost everywhere, then f fdp. = 0. (ii) If p.(w: /(w) > 0] > 0, then f fdp. > 0.
If f I dp. < oo, then f < oo almost everywhere. Iff < g almost everywhere, then ffdp. < f g dp.. If I = g almost everywhere, then f fdp. = f gdp.. Suppose that f = 0 almost everywhere. If A; meets [ w: /( w ) = 0], PROOF. then the infimum in (15.1) is 0; otherwise, p. ( A ; ) = 0. Hence each sum (15.1) is 0, and (i) follows. If A = [w: /(w) > £], then A( i [w: /(w) > 0] as £ � 0, so that under the hypothesis of (ii) there is a positive £ for which p. ( A( ) > 0. Decomposing 0 into A ( and its complement shows that f fdp. > £p.(A() > 0. If p.[/ = oo ] > 0, decompose D into [/ = oo ] and its complement: f fdp. � oo · p.[/ = oo] = oo by the conventions. Hence (iii) . To prove (iv), let G = [/ < g]. For any finite decomposition { A 1 , , A m } (iii) (iv) (v)
(
•
of 0,
[ ] p. ( A ; ) = E [ in�I ] p. ( A ; n G )
E inf I A,
A,
s
[
E inf
A, n G
]
•
•
I p. ( A ; n G )
SECTION
15.
207
THE INTEGRAL
where the last inequality comes from a consideration of the decomposition A 1 n G, . . . , A m n G, Gc. This proves (iv), and (v) follows immediately. • Suppose that I = g almost everywhere, where I and g may not be nonnegative. If I has a definite integral, then since I+ = g + and 1- = g almost everywhere, it follows by Theorem 15 . 2(v) that g also has a definite integral and f l dp. = f g dp. . Uniqueness Although there are various ways to frame the definition of the integral, they are all equivalent- they all assign the same value to lldp. . This is because the integral is uniquely determined by certain simple properties it is natural to require of it. It is natural to want the integral to have properties (i) and (iii) of Theorem 15.1. But these uniquely determine the integral for nonnegative functions: For I nonnega tive, there exist by Theorem 13.5 simple functions In such that 0 < In f I; by (iii) , I Idp, must be limn I In" dp. , and (i) determines the value of each I In dp.. Property (i) can itself by derived from (iv) (linearity) together with the assumption that I lA dp. == p. ( A ) for indicators /A : I (E; x; IA ) dp. == E ; x; l lA dp. == E; X;p. ( A ; ). If (iv) of Theorem 15.1 is to persist when the integral is extended beyond the class of nonnegative functions, I1 dp. must be I dp. = I r dp. - I r dp. , which makes the definition (15.6) inevitable. I
I
PROBLEMS These problems outline alternative definitions of the integral. In all of them I is assumed measurable and nonnegative. Call (15.3) the lower integral and write it as
{15 .9) to
distinguish it from the upper integral
(15 .10) The infimum in (15.10), like the supremum in (15.9), extends over all finite partitions { A; } of Q into Jli:sets. 15.1.
Show that I * ldp. == oo if p. [ w : l( w ) > 0] == oo or if p. [ w : l( w ) > x ] all (finite) x .
As there are many I 's of this last kind that ought to be integrable,
>
0 for
(15.10) is
inappropriate as a definition of Ildp,. The only trouble with (15.10), however, is that
INTEGRATION
it treats infinity incorrectly. To see this, and to focus on essentials, assume in the following problems that p.(O) < oo and that I is bounded and nonnegative. 15.2.
f (a) Show that
� [��/("')] I'( A, ) < 7 [���/("') ] I'( B; )
{ B1 } refines { A ; }. Prove a dual relation for the sums in (15.10) and conclude that I * ldp. s I * ldp.. (b) Suppose that M bounds I and consider the partition A ; == [ w : i£ � if
I< w ) < ( i + 1)£], i that
0, . . . , N, where N is large enough that N£ > M. Show
=
Conclude that
( 15.11 ) To define the integral as the common value in (15.11) is the Darboux- Young approach. The advantage of (15.3) as a definition is that it applies at once to unbounded I and infinite p.. 15.3. 3.2 15.2 f For A c Q, define p.*(A) and p. .( A ) by (3.9) and (3.10) with p. in place of P. Suspend for the moment the assumption that I is measurable and show that I * lA dp. == p.*( A ) and I * lA dp. == p. .( A ) for every A . Hence (15.11) can fail if I is not measurable. (Where was measurability used in the proof of (15.11)?) Even though the definition (15.3) makes formal sense for all nonnegative functions, it is thus idle to apply it to nonmeasurable ones. Assume again that I is measurable (as well as nonnegative and bounded, and that p. is finite). 15.4.
f Show that for positive £ there exists a finite partition { A } such that , if { � } is any finer partition and w1 e � , then ;
Jfd#L - 'L, /( "'i ) #L ( B; )
< f.
j
15.5.
f
Show that fd" == lim n
f
•
n 2" k -
L
k-l
2"
1
[P. "'
:
k-
2"
1
< t< "'>
k
< 2"
]
·
The limit on the right here is Lebesgue 's definition of the integral.
SECTION
15.6.
16.
PROPERTIES OF THE INTEGRAL
209
f Suppose that the integral is defined for simple functions by f (L;x; IA '. ) dp. == E;x;p. (A;) · Suppose that In and gn are simple and nondecreasing and have a common limit: 0 � In f f and 0 < gn f f. Adapt the arguments used to prove Theorem 15.l(iii) and show that limn /In dp. == limn / gn dp.. Thus f fdp. can (Theorem 13.5) consistently be defined as limn /In dp. for simple functions for which 0 � In f /. SECI'ION 16. PROPERTIES OF THE INTEGRAL
FApdities and Inequalities
f is that f f+ dp. and ff- dp. both be finite, which is the same as the requirement that f /+ dp. + Jr dp. < oo and hence is the same as the requirement that / ( /+ + f- ) dp. < oo (Theorem 15.1(iv)). Since /+ + f+ = 1/1 , f is integrable if and only if
By definition, the requirement for integrability of
jill dp. < 00 .
( 16.1)
l g l almost everywhere and g is integrable, then integrable as well. If p.(D) < oo , a bounded f is integrable. It follows that if
11leorem
1/ 1
16.1 (i)
everywhere, then
<
f is
Monotonicity : Iff and g are integrable and/ s g almost
jIdp. s jg dp. .
( 16 .2 )
Linearity: Iff and g are integrable and a, P are finite real numbers, then af + pg is integrable and (ii)
j ( af + Pg ) dp. = a jfdp. + P jg dp..
( 1 6.3) PROOF
OF
PROoF
OF
(i). For nonnegative f and g such that f < g almost everywhere, (16.2) follows by Theorem 15.2(iv). And for general f and g, if f < g almost everywhere, then j+ < g + and /- > g - almost everywhere, and so (16.2) follows by the definition (15.6). • (ii). First,
af + pg is integrable because, by Theorem 15.1,
j Ia/ + Ps i dp. s j (Ia I · Ill + I P I · l s i ) dp. = I a I j Ill dp. + I P I j l s i dp. < oo .
210
INTEGRATION
By Theorem 15.1(iv) and the definition (15.6), f (a/) dp. = af f dp.-consider separately the cases a > 0 and a < 0. Therefore, it is enough to check (16.3) for the case a = P = 1. By definition, ( / + g ) + - ( / + g ) - = f + g = f+ 1- + g + - g- and therefore ( I + g ) + + /- + g - = ( I + g ) - + /+ + g + . All these functions being nonnegative, f ( f + g ) + dp. + f /- dp. + f g - dp. f ( / + g ) - dp. + f /+ dp. + f g + dp., which can be rearranged to give f ( / + g) + dp. - f ( I + g ) - dp. = f /+ dp. - f /- dp. + f g + dp. - f g - dp.. But this reduces to (16.3). • =
Since - 1/1
<
f < 1/1 , it follows by Theorem 16.1 that
(16 .4) for integrable f. Applying this to integrable f and g gives (16 .5) Suppose that 0 is countable, that � consists of all the subsets of D, and that p. is counting measure: each singleton has measure 1. To be definite, take D = { 1, 2, . . . }. A function is then a sequence x1, x 2 , If xnm is xm or 0 as m s n or m > n , the function corresponding to xn l ' Xn 2 ' . . . has integral E:, _ l xm by Theorem 15.1(i) (consider the decom position { 1 }, . . . , { n }, { n + 1, n + 2, . . . }). It follows by Theorem 15.1(iii) that in the nonnegative case the integral of the function given by { x m } is the sum E mxm (finite or infinite) of the corresponding infinite series. In the general case the function is integrable if and only if E: _ 1 1 x m 1 is a convergent infinite series, in which case the integral is E:_ 1 x� - E:_ 1 x� . The function xm = ( - 1) m + l m - 1 is not integrable by this definition and even fails to have a definite integral, since E:_ 1 x� = E:_ 1 x m = oo . This invites comparison with the ordinary theory of infinite series, according to which the alternating harmonic series does converge in the sense that lim M E !'_ 1 ( - 1) m + l m - 1 = log 2. But since this says that the sum of the first M terms has a limit, it requires that the elements of the space 0 be ordered. If D consists not of the positive integers but, say, of the integer lattice points in 3-space, it has no canonical linear ordering. And if E mxm is to have th e same finite value no matter what the order of summation, the series must be absolutely convergent.t This helps to explain why f is defined to be • integrable only if f /+ dp. and f f- dp. are both finite. Example 16.1.
•
t RUDIN , p. 76.
•
•
•
16. PROPERTIES OF THE INTEGRAL 211 Example 16.2. In connection with Example 15.1, consider the function I = 3/( a , oo ) - 2 /( - oo , a ) · There is no natural value for f fdA (it is "oo - oo " ), and none is assigned by the definition. If a function f is bounded on bounded intervals, then each function I, = fl( - n , n > is integrable with respect to A. Since f = lim n fn , the limit of I In dA, if it exists, is sometimes called the " principal value" of the integral of f. Although it is natural for some purposes to integrate symmetrically about the origin, this is not the right definition of the integral in the context of general measure theory. The functions gn = fl( - n , n + 1 > for example also converge to f, and f gn dA may have some other limit, or none at all; f(x) = x is a case in point. There is no general reason why In should take precedence over gn . As in the preceding example, I = Er_ 1 ( - 1) kk - 1I( k , k + 1 has no integral, > even though the f In dA above converge. • SECTION
Integration to the Limit
The first result, the Theorem 15.1(iii).
monotone convergence theorem, essentially restates
If 0 S In i f almost everywhere, then f In dp. i f / dp. . If 0 < In j f on a set A with p.(Ac) = 0, then 0 < fn/A i f/A holds
1beorem 16.2. PROOF.
everywhere, and it follows by Theorem 15.1(iii) and the remark following Theorem 15.2 that f In dp. = fIn/A dp. i f f/A dp. = f fdp.. • As the functions in Theorem 16.2 are nonnegative almost everywhere, all the integrals exist. The conclusion of the theorem is that lim n f In dp. and I Idp. are both infinite or both finite and in the latter case are equal. Consider the space { 1, 2, . . . } together with counting measure, as in Example 16.1. If for each m, 0 < x n m j x m as n --+ oo, then lim n E m x n m = E m x m , a standard result about infinite series. • Example 16.3.
If p. is a measure on � and �0 is a a-field contained in S', then the restriction p.0 of p. to �0 is another measure (Example 10.4). If / == /A and A e �0, then Example 16.4.
the common value being p. ( A ) = p. 0 ( A ). The same is by linearity true for
212
INTEGRATION
nonnegative simple functions measurable �0• It holds by Theorem 16.2 for all nonnegative I measurable j&'0 because (Theorem 13.5) 0 < In i I for simple functions In measurable j&'0• For functions measurable �0, integra tion with respect to p. is thus the same thing as integration with respect to •
P. o·
In this example a property was extended by linearity from indicators to nonnegative simple functions and thence to the general nonnegative func tion by a monotone passage to the limit. This is a technique of very frequent application. For a finite or infinite sequence of measures P.n on j&', p.(A) = LnP.n ( A) defines another measure (countably additive because [A27] sums can be reversed in a nonnegative double series). For indicators I, Example 16.5.
and by linearity the same holds for simple I > 0. If 0 < lk i I for simple lk, then by Theorem 16.2 and Example 16.3, f I dp. = lim k f lk dp. lim k En f lk dp.n = En lim k f lk dp.n = En f ldP.n · The relation in question thus holds for all nonnegative 1. • =
An important consequence of the monotone convergence theorem is
Fatou 's lemma: Theorem 16.3.
For nonnegative In,
If gn = infk 'il!:. nlk , then 0 < gn j g = lim infnln , and the preceding two theorems give f In dp. � f 8n dp. -+ f g dp.. • PROOF.
.
On ( R1 , !1/ 1 , A), In = n 2/(o, n -t ) and I = 0 satisfy ln ( x ) -+ l ( x ) for each x, but / IdA = 0 and fln d A = n -+ oo . This shows that the inequality in (16.6) can be strict and that it is not always possible to integrate to the limit. This phenomenon has been encountered before; see • Examples 5.6 and 7.7. EXIIIple II 16. 6.
16. PROPERTIES OF THE INTEGRAL 213 Fatou' s lemma leads to Lebesgue 's dominated convergence theorem: SECTION
f
If Ifn i s g almost everywhere, where g is integrable, and if In f almost everywhere, then f and the In are integrable and fIn dp. f fdp.. PROOF. From the hypothesis 1/n l s g it follows that all the In as well as I * = lim supn In and f * = lim inf In are integrable. Since g + In and g - In neorem 16.4. -+
-+
are nonnegative, Fatou' s lemma gives
and
.S:
j
j
j
lim inf In dp. . n ( g - In ) dp. = g dp. - lim sup n
Therefore (16 . 7) .S:
j
j
lim sup In dp. .S: lim sup In dp. . n n
The only assumption used thus far is that the In are dominated by an integrable g. If In f almost everywhere, the extreme terms in (16.7) agree • with f fdp.. -+
Example 16.6 shows that this theorem can fail if no dominating EXIIIple II 16. 7.
g exists.
The Weierstrass M-test for series. Consider the space
{ 1, 2, . . . } together with counting measure, as in Example 16.1. If l x n m l < M, and E m Mm < oo, and if lim n x n m = x m for each m, then lim n E m x n m = E,.. x m . This follows by an application of Theorem 16.4 with the function given by the sequence M1 , M2 , in the role of g. This is another standard • result on infinite series [A28]. •
•
•
214
INTEGRATION
The next result, the bounded convergence theorem, is a special case of Theorem 16.4. It contains Theorem 5.3 as a further special case.
If p.(D) < oo and the In are uniformly bounded, then In --+ f almost everywhere implies fIn dp. --+ f Idp.. Theorem 16.5.
The next two theorems are simply the series versions of the monotone and the dominated convergence theorems. Theorem 16.6.
If fn > 0, then J Enfn dp. = En / fn dp..
The members of this last equation are both equal either to oo or to the same finite, nonnegative real number.
If En fn converges almost everywhere and IE� == tfk l < g almost everywhere, where g is integrable, then En fn and the In are integrable and f En ln dp. = En fIn dp.. Theorem 16.7.
If Enf Ifni dp. < oo , then Enfn converges absolutely almost ev erywhere and is integrable, and f En fn dp. = En fIn dp.. PROOF. The function g = En Ifni is integrable by Theorem 16.6 and is finite almost everywhere by Theorem 15.2(iii). Hence Enl/nl and En/n converge • almost everywhere and Theorem 16.7 applies. Corollary.
In place of a sequence { In } of real measurable functions on ( Q, !F, p. ) , consider a family [/,: 1 > 0] indexed by a continuous parameter 1. Suppose of a measurable f that ( 16.8) on a set A , where { 16.9)
lim /,( w ) == f( w )
t� oo
A e !F ,
A technical point arises here, since holds:
p. ( O - A ) == O. !F
need not contain the w-set where (16.8)
Ex1111le 1p 16.8. Let !F consist of the Borel subsets of Q == (0, 1], and let H be a nonmeasurable set-a subset of Q that does not lie in !F (see the end of Section 3). Define /, ( w ) == 1 if w equals the fractional part t [ 1] of 1 and their common value lies in H('; define /, ( w ) == 0 otherwise. Each /, is measurable !F, but if f( w ) 0, • then the w-set where (16.8) holds is exactly H. -
=
Because of such examples, the set A above must be assumed to lie in � (Because of Theorem 13.4, no such assumption is necessary in the case of sequences.)
215 The assumption that (16.8) holds on a set A satisfying (16. 9) can still be expressed as the assumption that (16.8) holds almost everywhere. If I, == f /, dp. converges to I == f f dp. as t --+ oo , then certainly I,n --+ I for each sequence { t, } going to infinity. But the converse holds as well: if I, does not converge to I, then there is a positive £ such that 1 1, n - II > £ for a sequence { tn } going to infinity. To the question of whether l,n converges to I the previous theorems apply. Suppose that (16.8) holds almost everywhere and that 1/, 1 < g almost every where, where g is integrable. By the dominated convergence theorem, f must then be integrable and I, --+ I for each sequence { tn } going to infinity. It follows that f/, dp. --+ f f dp.. In this result t could go continuously to 0 or to some other value instead of to infinity. SECTION
Theorem 16.8.
( a, b).
16.
PROPERTIES OF THE INTEGRAL
Suppose that f( w, t) is a measurable function of w for each t in
(i) Suppose that /( w, t) is almost everywhere continuous in t at t0; suppose
further that, for each t in some neighborhood of t0, 1/(w, t) l < g( w) almost every where, where g is integrable. Then f f( w, t)p.( dw) is continuous in t at t0• (ii) Suppose that for w e A , where A satisfies (16.9), /( w , t) has in (a, b) a derivative / '( w, t) with respect to t; suppose further that 1/ '( w, t) I < g( w ) for w e A and t e ( a , b), where g is integrable. Then f f( w, t)p.(dw) has derivative f /' ( w, t) p. ( d w) on (a, b). Part (i) is an immediate consequence of the preceding discussion. To prove part (ii), consider a fixed t. By the mean value theorem, PllOOF.
/( w , I + h ) - /( w , I) h
== / ' ( w , s ) '
where s lies between t and t + h. The left side goes to /'( w, t) as h --+ 0 and is by the hypothesis dominated by the integrable function g( w ) . Therefore,
� [ J!( w , t + h ) p.( dw ) - jf( w , t ) p. ( dw ) ] --+ jf' ( w , t) p. ( dw ) .
•
The condition involving g can be localized. It suffices to assume that for each t there is an integrable g( w, t) such that 1/ '( w, s) 1 < g( w, t) for all s in some neighborhood of t. Integration over Sets The integral of I over a set A in � is defined by
(16 . 10)
j fdp. = jJAfdp. . A
The definition applies if I is defined only on A in the first place (set I = 0
outside A ). Notice that
fA idp. = 0 if p. ( A ) = 0.
216
INTEGRATION
All
the concepts and theorems above carry over in an obvious way to integrals over A . Theorems 16.6 and 16.7 yield this result:
If A1, A 2, are disjoint, and if f is either nonnegative or integrable, then fu, A , fdp. = En fA , fdp.. 16.9. The integrals (16.10) suffice to determine /, provided p. 11 EX111ple is a-finite. Suppose that f and g are nonnegative and that fAfdp. < fAgdp. for all A in !F. If p. is a-finite, there are jli:sets A n such that A n i 0 and p. ( A n ) < oo . If Bn = [0 < g < f, g S n ], then the hypothesized inequality applied to A n n Bn implies fA , , fdp. s fA , , gdp. < 00 (since A n n Bn has finite measure and g is bounded there) and hence f IA , ,(I - g) dp. = Theorem 16.9.
• • •
nB
nB
n8
0. But then by Theorem 15.2(ii), the integrand is 0 almost everywhere, and so p. ( A n n Bn ) = 0. Therefore, p.(O s g < f, g < oo] = 0, so that I < g almost everywhere:
Iff and g are nonnegative and fAfdp, = fAg dp, for all A in � ' and if p. is a-.finite, then f = g almost everywhere. (i)
There is a simpler argument for integrable functions; it does not require that the functions be nonnegative or that p, be a-finite. Suppose that f and g are integrable and that fAfdp, < fAgdp. for all A in �. Then f - g) dp. = 0 and hence p,[g < / ] = 0 by Theorem 15.2(ii):
Ils
If f and g are integrable and fAfdp, = fA.g dp, for all A in � ' then f = g almost everywhere. (iii) Iff and g are integrable and fA fdp, = fA gdp, for all A in 9', where 9' is a 'IT-system generating � and D is a finite or countable union of 9'-sets, then f = g almost everywhere. If f and g are nonnegative, this follows from Theorem 10.4 and (ii) above. For the general case, first prove that /+ + g - = f- + g + almost (ii)
everywhere.
Densities
Suppose that 8 is a nonnegative function and define a measure " by (Theorem 16.9) (16.11)
v(A)
=
L8 dfl ,
A
e � '·
8 is not assumed integrable with respect to p.. Many measures arise in this way. Note that p. ( A ) = 0 implies that 11(A) = 0. Clearly, " is finite if and
217 16. PROPERTIES OF THE INTEGRAL only if 8 is integrable p.. Another function 8' gives rise to the same " if 8 = 8 ' almost everywhere. On the other hand, J1(A) = JAB' dp. and (16.11) together imply that 8 = 8 ' almost everywhere if p. is a-finite, as was shown SECTION
in the preceding example. t p.
The measure " defined by (16.11) is said to have density 8 with respect to A density is by definition nonnegative. Formal substitution dJI = 8 dp. gives formulas (16.12) and (16.13).
1beorem 16.10.
If has density 8 with respect to p., then Jl
jfdv = jf8 dfA
( 16.12)
for nonnegative f. Moreover, f (not necessarily nonnegative) is integra ble with respect to if and only if fB is integrable with respect to p., in which case (16.12) and holds
Jl
(16.13 ) both hold. For nonnegative f, (16.13) always holds. Here fB is to be taken as 0 if f = 0 or if 8 conventions (15.2). Note that J1( 8 = 0] = 0.
= 0; this is consistent with the
If f = lA, then f fdJI = J1( A), so that (16.12) reduces to the definition (16.11). If f is a simple nonnegative function, (16.12) then follows by linearity. If f is nonnegative, fIn dJI = f fn6 dp. for the simple functions /, of Theorem 13.5, and (1�.12) follows by a monotone passage to the limit -that is, by Theorem 16:! Note that both sides of (16.12) may be infinite. Even if f is not nonnegative, (16.12) applies to 1/1 , whence it follows that I is integrable with respect to " if and only if f6 is integrable with respect to p. And if f is integrable, (16.12) follows from differencing the same result • for j+ and /-. Replacing f by fiA leads from (16.12) to (16.13). PROOF.
If J1( A) = p.(A n A0), then (16.11) holds with 6 and (16.13) reduces to fAfdJI = fA n A o fdp.. Example 16. 10.
= lA o'
•
Theorem 16.10 has two features in common with a number of theorems about integration. (i) The relation in question, (16.12) in this case, in addition to holding for integrable functions, holds for all nonnegative 'This is true also if " is a-finite;
see
Problem 16.12.
218
INTEGRATION
functions- the point being that if one side of the equation is infinite, then so is the other, and if both are finite, then they have the same value. This is useful in checking for integrability in the first place. (ii) The result is proved first for indicator functions, then for simple functions, then for nonnegative functions, then for integrable functions. In this connection, see Examples 16.4 and 16.5. The next result is Scheffe ' s theorem. Theorem 16.11.
sities 8n and 8. If
Suppose that �'n ( A) = fA Bn dP. and I'( A) = fA B dp. for den
n = 1 , 2, . . . , and if Bn .-. 8 except on a set of p.-measure 0, then ( 16.14 )
( 16.15 )
The inequality in (16.15) of course follows from (16.5). Let gn = 8 - Bn . The positive part g: of gn converges to 0 except on a set of p.-measure 0. Moreover, 0 < g: < 8 and 8 is integrable, and so the dominated convergence theorem applies: f g: dp. --+ 0. But f gn dp. = 0 by (16.14), and therefore PROOF.
• A corollary concerning infinite series follows immediately- take p. as counting measure on 0 = {1, 2, . . . } .
If E mx n m = Emxm < oo, the terms being nonnegative, and if lim n x n m = xm for each m , then lim nE mlx n m - xml = 0. If Ym is bounded, then lim nL m YmX n m = LmYmX m . Corollary.
Change of Variable
Let ( 0, � ) and ( 0', �') be measurable spaces, and suppose that the mapping T: 0 --+ D' is measurable �/� '. For a measure p. on �, define a measure p. T 1 on � ' by -
( 16 .16 )
as at the end of Section 13.
A'
E
� ',
SECTION
16.
PROPERTIES OF THE INTEGRAL
219
Suppose f is a real function on 0' that is measurable §' ', so that the composition fT is a real function on 0 that is measurable §' (Theorem 13.1(ii)). The change-of-variable formulas are (16.17) and (16.18). If A' = 0', the second reduces to the first. Theorem 16.12.
Iff is nonnegative, then
( 16 .17)
1 /( Tw ) p. ( dw ) = 1 f ( w' ) p.T- 1 ( dw' ) . O'
n
function f (not necessarily nonnegative) is integrable with respect to p. T- 1 if and only if fT is integrable with respect to p. , in which case (16.17) and
A
(}1 1A'/( Tw ) p. ( dw ) = jA'f( w' ) p.T- 1 ( dw' )
(16 .18)
hold. For nonnegative f, (16.18) always holds. PROOF. If f = JA ,, then fT = 11 tA' ' and so (16.17) reduces to the defini
tion (16.16). By linearity, (16.17) holds for nonnegative simple functions. If /, are simple functions for which 0 < i then 0 < fn T i fT, and (16.17) follows by the monotone convergence theorem. An application of (16.17) to establishes the assertion about integrabil ity, and for integrable f, (16.17) follows by decomposition into positive and negative parts. Finally, if f is replaced by f/A,, (16.17) reduces to (16.18). •
/, /,
1/1
Suppose that ( 0', $' ') = ( R 1 , !1/ 1 ) and T = cp is an ordinary real function, measurable §'. If /(x) = x, (16.17) becomes
Example
( 16.1 9
16. 11.
) p.
If cp = E ; x ; IA is simple, then P.'P- I has mass ( A ; ) at X;, and each side of • (16.19) reduces to L;X;p.(A;). I
Unifonn Integrability
1/1 /ll/l
If I is integrable, then � a] goes to 0 almost everywhere as a --+ oo and is dominated by and hence
1/1,
( 1 6.20)
lim
a
.....
oo
j[ 1/l � a )1/1 dp.
=
0.
220
INTEGRATION
A sequence
{ In } is uniformly integrable if ( 1 6 . 20) holds uniformly in n:
(16.21 )
lim sup
a -. oo
n
£
11,. 1 dp. = 0.
( 1/,. l > a ]
If (16.21) holds and p.(O) < oo , and if a is large enough that the supremum in (16.21) is less than 1, then
jll,. l dp. < ap. ( O ) + 1 ,
( 16.22 )
and hence the In are integrable. On the other hand, (16.21) always holds if the In are uniformly bounded, but the In need not in that case be integrable if p.(O) = oo. For this reason the concept of uniform integrability is interesting only for p. finite. If h is the maximum of 1/1 and l g l , then
11/+ gp � 2 alf + g l dp. 5. 2 1h � ah dp. 5. 2 f1/l � alll dp. + 2 jg � l g l dp.. l l
Therefore, if
a
{ In } and { gn } are uniformly integrable, so is { In + gn }.
If p.(O) < oo and In --+ / almost everywhere, and if the In are uniformly integrable, then f is integrable and fIn dp. --+ f fdp.. PROOF. Since f Ifn i dp. is bounded (see (16.22)), it follows by Fatou' s lemma that f is integrable. Put h n = 1/n - /1 . Then the h n are uniformly integrable and h n --+ 0 almost everywhere. If h �a > = h n l[ ,. < a ] ' then for each a, lim n h �a > = 0 and so lim n f h �a> dp, = 0 by the bounded convergence theorem. Now Theorem 16.13.
h
Given £, choose a so that the first term on the right is less than £ for all n; then choose n 0 so that the second term on the right is less than £ for all • n � n 0 • This shows that f h n dp. --+ 0. Suppose that
( 16.23)
sup /II,. I I +• dp. n
< oo
SECTION
for a positive
and
£.
If
16.
PROPERTIES OF THE INTEGRAL
221
K is the supremum here, then
so { In } is uniformly integrable.
Complex Functions A complex-valued function on Q has the form h are ordinary finite-valued real functions on Q.
l( w ) = g( w ) + ih ( w), where g and
It is, by definition, measurable !F if g and h are. If g and h are integrable, then I is by definition integrable, and its integral is of course taken as (16.24)
j( g + ih ) dp. = f gdp. + ; jh dp. .
Since max{ lgl , I h i } < Ill < lgl + l h l , I is integrable if and only if I Ill dp. < oo , just as in the real case. The linearity equation (16.3) extends to complex functions and coefficients-the proof requires only that everything be decomposed into real and imaginary parts. Consider the inequality (16.4) for the complex case: {16.25 )
ffdp. � jill dp. .
If f == g + ih and g and h are simple, the corresponding partitions can be taken to be the same ( g == Ek xk lA and h = Ek Yk lA ), and (16.25) follows by the triangle inequality. For the gener� integrable I, rePresent g and h as limits of simple
functions dominated by Ill and pass to the limit. The results on integration to the limit extend as well. Suppose that lk = gk + ih k are complex functions satisfying E k I Ilk I dp. < oo . Then E k I l gk I dp. < oo , and so by the corollary to Theorem 16.7, Ek gk is integrable and integrates to Ek l gk dp.. The same is true of the imaginary parts, and hence Ek lk is integrable and ( 16.26)
PROBLEMS 16.1. 16.1.
Suppose that p.(A) < oo and a s I < fl almost everywhere on A. Show that ap. ( .A ) < IAidp. S /lp. ( A ). (a) Deduce parts (i) and (ii) of Theorem 10.2 from Theorems 16.2 and 16.4. (b) Compare (16.7), (10.6), and (4.10).
222
INTEGRATION
13.13 Suppose that p.(Q) < oo and In are uniformly bounded. (a) Assume In --+ f uniformly and deduce I/,. dp. --+ I I dp. from (16.5). (b) Use part (a) and Egoroff's theorem to give another proof of Theorem 16.5. 16.4. Prove that if 0 < In --+ f almost everywhere and I In dp. < A < oo , then f is integrable and Ifdp. < A . (This is essentially the same as Fatou' s lemma and is sometimes called by that name.) 16.5. Suppose that p. is Lebesgue measure on the unit interval 0 and that ( a, b) = (0, 1) in Theorem 16.8. If /( w, t) is 1 or 0 according as w < t or w > t, then for each t, f'( w, t ) = 0 almost everywhere. But lf( w, t)p.( dw ) does not differentiate to 0. Why does this not contradict the theorem? Examine the proof for this case. 16.6 (a) Suppose that functions an , bn , fn converge almost everywhere to func tions a , b, f, respectively. Suppose that the first two sequences may be integrated to the limit-that is, the functions are all integrable and I an dp. --+ I a dp., I bn dp. --+ I b dp.. Suppose, finally, that the first two sequences enclose the third: an < In < bn almost everywhere. Show that the third may be integrated to the limit. (b) Deduce Lebesgue's dominated convergence theorem from part (a). 16.7. Show that, for f integrable and £ positive, there exists an integrable simple g such that I If - gl dp. < £ . If .JJ1 is a field generating !F, g can be taken as E; x; IA with A ; e .JJI. Hint: Use Theorems 13.5, 16.4, and 11.4. 16.8. Suppose that f( w, · ) is, for each w, a function on an open set W in the complex plane and that f( · , z ) is for z in W measurable !F. Suppose that A satisfies (16.9), that /( w, · ) is analytic in W for c.> in A , and that for each z0 in W there is an integrable g( · , z0 ) such that 1/( w, z ) I < g( w, z0 ) for all w e A and all z in some neighborhood of z0 • Show that I/( w, z )p.( dw) is analytic in W. 16.9. Suppose that In are integrable and supn l In dp. < oo . Show that, if In f /, then I is integrable and lfn dp. --+ lfdp.. This is Beppo Levi 's theorem. 16.10. Suppose of integrable In that am n = I lim - ln l dp. --+ 0 as m , n --+ 00 . Then there exists an integrable f such that /Jn = I If - In I dp. --+ 0. Prove this by the following steps. (a) Choose n k so that m , n > n k implies that am n < 2- k . Use the corollary to Theorem 16.7 to show that f = lim k In k exists and is integrable and lim k I If - Ink I d�t = o. (b> For n > n k , write I If - In I d�t < I If - In k I dp, + I lin k - In I d�t . 16. 1 1 (a) If p. is not a-finite, (i) in Example 16.9 can fail. Find an example. (b) Modifying the argument in Example 16.9, show that if f, g > 0, if IA fdp. < IA g dp. for all A e !F, and if [ g < oo,f > 0] is a countable union of sets of finite p.-measure, then f < g almost everywhere. (c) Show that the condition on [ g < oo , f > 0] in part (b) is satisfied if the measure v ( A ) = IA fdp. is a-finite. 16.12. f (a) Suppose that (16.11) holds ( B � 0). Show that v determines B up to a set of p.-measure 0 if p. or v is a-finite. 16.3.
I
SECTION
16.
223
PROPERTIES OF THE INTEGRAL
(b)
16. 13. 16. 14. 16.1 5. 16.16. 16. 17.
16.18. 16.19.
16.20.
16.21.
16.22.
16.13.
Show that, if p. is a-finite and p.( B = oo ] = 0, then ., is a-finite. (c) Show that if neither p. nor ., is a-finite then B may not be unique even up to sets of p.-measure 0. Show that the supremum in (16.15) is half the integral on the right. Show that if (16.14) holds and if lim infnBn > B almost everywhere, then (16.15) holds and lim infnBn = B almost everywhere. Show that Scheffe' s theorem still holds if (16.14) is weakened to �'n (g) --+ ., < n ) . f Suppose that In and f are integrable and that /,, --+ f almost every where. Show that / lfn I dp. --+ / 1/1 dp. if and only if / lfn - / 1 dp. --+ 0. (a) Show that if 1/n I < g and g is integrable, then { In } is uniformly integrable. Compare the hypotheses of Theorems 16.4 and 16.13. (b) On the unit interval with Lebesgue measure, let In = ( njlog n) /<0• n- • > for n > 3. Show that the In are uniformly integrable (and f In dp. --+ 0) although they are not dominated by any integrable g. (c) Show for In = nl - nl< n- •. 2 n- • > that the In can be integrated to the limit (Lebesgue measure) even though the In are not uniformly integra ble. Show that if f is integrable, then for each £ there is a B such that p. (A) < B implies fA 1/1 dp. < £. f Suppose that p.(O) < oo . Show that { In } is uniformly integrable if and only if (i) f lin I dp. is bounded and (ii) for each £ there is a B such that p.(A ) < B implies fA Ifn i dp. < £ for all n. 2.17 16.19 t Assume p.(Q) < oo. (a) Show by examples that neither of conditions (i) and (ii) in the preceding problem implies the other. (b) Show that (ii) implies (i) for all sequences { In } if and only if p. is nonatomic. 16.16 16.19 t Suppose that p.(O) < oo , In and f are integrable, and In --+ f almost everywhere. Show that these conditions are equivalent: (i) { In } is uniformly integrable; (ii) I lin I dp. --+ I 1/1 dp. ; (iii) / 1/n - / 1 dp. --+ 0. 13.15 16.19 t (a) Suppose that p.(Q) < oo and In > 0. Show that if the In are uniformly integrable, then lim supn / In dp. < / lim supn In dp.. (b) Use this fact to give another proof of Theorem 16.13 in the nonnega tive case. (a) Suppose that /1 ,/2 , are nonnegative and put gn = max{ /1 , , fn } . Show that /g, a gn dp. < E Z - 1 ftk a fk dp.. (b) Suppose further that { In } is uniformly integrable and p. ( Q) < oo . Show that f gn dp. = o( n). function integrating to re;e , r > 0. From Let f be a complex-valued ; 6 / ( 1/( w ) l - e - /( w))p.( dp.) = / 1/1 dp. - r, deduce (16.25) . •
�
16.14.
•
•
.
�
•
•
224
INTEGRATION
Consider the vector lattice .ftJ and the functional A of Problems 11.11 and 11.12. Let p. be the extension (Theorem 11.3) to !F= a ( � ) of the set function p.0 on 'a. (a) Show by (1 1.11) that for positive x, y1 , y2 , v([ l > 1] X (0, x ]) = xp.0( 1 > 1] xp.[ l > 1] and v([y1 < I < Y2 J X (0, x]) = xp.[yt < I < Y2 ). (b) Show that if I e .ftJ, then I is integrable and
16.15. 11.12 f
=
A ( /) = jfdp. . (Consider first the case I � 0.) This is the Danie/1-Stone representation
theorem.
SECI'ION 17. INTEGRAL WITH RESPECf TO LEBESGUE MEASURE The Lebesgue Integral on the Line
Lebesgue integrable if it is integrable with respect to Lebesgue measure A, and its Lebesgue integral f f dA is denoted by f I( x ) dx or, in case of integration over an interval, by
A real measurable function on the line is
J:f( x ) dx. It is instructive to compare it with the Riemann integral .
The Riemann Integral
I on an interval (a, b] is by definitiont Riemann integrable, with integral r, if this condition holds: For each £ there exists a 8 with the
A real function
property that
( 17 .1 )
r - E l ( xi ) A ( Jj )
<
£
j
if { Jj } is any finite partition of (a, b] into subintervals satisfying A( Jj ) < 8 and if xi e � for each j. The Riemann integral for step functions was used in Section 1. Suppose that I is Borel measurable, and suppose that I is bounded, so that it is Lebesgue integrable. If I is also Riemann integrable, the r of (17 .1) must coincide with the Lebesgue integral J:l( x) dx. To see this, first note that letting xi vary over Jj leads from (17.1) to
( 17 .2)
r - E sup l ( x ) A ( Jj ) j x e�
·
t There are of course a number of equivalent definitions.
< £.
SECTION
17.
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
225
Consider the simple function g with value supx e J f(x) on � - Now f < g, and the sum in (17 .2) is the Lebesgue integral of g. By monotonicity of the Lebesgue integral, J:f(x) dx < J:g(x) dx < r + £. The reverse inequality follows in the same way, and so J:f(x) dx = r. Therefore, the Riemann
integral when it exists coincides with the Lebesgue integral. Suppose that f is continuous on [a, b]. By uniform continuity, for each £ there exists a 8 such that 1/(x) - /(y) l < £/(b - a) if l x - Y l < 8. If A ( � ) < 8 and x1 e � ' then g = E1/(x1 )IJ satisfies 1/ - g l < £/(b - a) and hence I J:fdx - J:g dxl < £. But this i� (17.1) with r replaced (as it must be) by the Lebesgue integral 1:1dx : A continuous function on a closed interval is Riemann integrable. E
ple
1 7.1.
E
ple
1 7. 2.
If f is the indicator of the set of rationals in (0, 1], then the Lebesgue integral f�f(x) dx is 0 because f = 0 almost everywhere. But for an arbitrary partition { Jj } of (0, 1] into intervals, E1/(x1 )'A(Jj) for xi e Jj is 1 if each x is taken from the rationals and 0 if each x is taken 1 1 from the irrationals. Thus f is not Riemann integrable. • xam
For the f of Example 17.1, there exists a g (namely, 0) such that f = g almost everywhere and g is Riemann integrable. To g show that the Lebesgue theory is not reducible to the Riemann theory by the casting out of sets of measure 0, it is of interest to produce an f (bounded and Borel measurable) for which no such g exists. In Examples 3.1 and 3.2 there were constructed Borel subsets A of (0, 1] such that 0 < 'A( A) < 1 and such that 'A( A n J ) > 0 for each subinterval J of (0, 1]. Take f = lA. Suppose that f = g almost everywhere and that { Jj } is a decomposition of (0, 1] into subintervals. Since 'A(Jj n A n [/ = g]) = A ( � n A) > 0, g(y1 ) = f(y1 ) = 1 for some y1 in Jj n A, and xam
=
(17 .3 )
L g ( yj ) 'A ( Jj ) = 1 > 'A( A ). j
If g were Riemann integrable, its Riemann integral would coincide with the • Lebesgue integrals f g dx = f fdx = A( A), in contradiction to (17 .3). It is because of their extreme oscillations that the functions in Examples 17.1 and 17.2 fail to be Riemann integrable. (It can be shown that a bounded function on a bounded interval is Riemann integrable if and only if the set of its discontinuities has Lebesgue measure o.t ) This cannot happen in the case of the Lebesgue integral of a measurable function: if f tSee Problems 1 7.5 and 1 7.6.
226
INTEGRATION
fails to be Lebesgue integrable, it is because its positive part or its negative part is too large, not because one or the other is too irregular.
E
ple
xam
1 7.3.
( 17 .4 )
It is an important analytic fact that . lim SID X 'IT dx = 2 t -. oo 0 X
1'
•
The existence of the limit is simple to prove because /<�"- t >wx - 1 sin x dx alternates in sign and its absolute value decreases to 0; the value of the limit will be identified in the next section (Example 18.3). On the other hand, x - 1 sin x is not Lebesgue integrable over (0, oo) because its positive and negative parts integrate to oo. Within the conventions of the Lebesgue theory (17.4) thus cannot be written f0(X)x - 1 sin x dx = 'IT/2-although such " improper" integrals appear in calculus texts. It is, of course, just a question • of choosing the terminology most convenient for the subject at hand. The function in Example 17.2 is not equal almost everywhere to any Riemann integrable function. Every Lebesgue integrable function can, how ever, be approximated in a certain sense by Riemann integrable functions of two kinds.
Suppose that f lfl dx < oo and £ > 0. (i) There is a step function g E�_ 1 x;IA , with bounded intervals the A ;, such that f If - g l dx < £. (ii) There is a continuous integrable h such that f If - h I dx < £. PROOF. By the construction (13.6) and the dominated convergence theo rem, (i) holds if the A; are not required to be intervals; moreover, A( A;) < oo for each i for which X; ¢ 0. By Theorem 11.4 there exists a finite disjoint Theorem 17.1.
=
I
as
union B; of intervals such that A(A;A B;) < £/k l x; l · But then E;x;I8 satisfies the requirements of (i) with 2£ in place of £. To prove (ii) it is only necessary to show that for the g of (i) there is a continuous h such that f l g - hi dx < £. Suppose that A ; = (a;, b;]; let h ;(x) be 1 on (a;, b;] and 0 outside (a; - 8, b; + 8], and let it increase linearly from 0 to 1 over (a; - 8, a;] and decrease linearly from 1 to 0 over (b;, b; + 8 ] . Since f IIA - h;l dx --+ 0 as 8 --+ 0, h = Ex;h; for sufficiently small 8 will satisfy th� requirements. Note\hat this h vanishes outside a • bounded interval. I
The Lebesgue integral is thus determined by its values for continuous functions. t t This provides another way of defining the Lebesgue integral on the line. See Problem 1 7. 1 7 .
SECTION
1be
17.
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
227
Fundamental Theorem of Calculus
For positive h , 1 h
fx+ hf( y ) dy - f( x ) X
1 < h
fx+hlf( y ) - f( x )l dy X
< sup [ l/( y ) - /( x )l : x < y < x + h ] , and the right side goes to 0 with h if f is continuous at x. The same thing holds for negative h ,t and therefore J:f( y ) dy has derivative /(x): d -;&
( 17 .5)
lxf( y ) dy = f( x ) a
if f is continuous at x. Suppose that F is a function with continuous derivative F' = f; that is, suppose that F is a primitive of the continuous function f. Then ( 17 .6 )
j
j
6 6 f( x ) dx = F ' ( x ) dx =
a
a
F( b ) - F( a ),
as follows from the fact that F(x) - F(a) and J:f( y ) dy agree at x = a and by (17 .5) have iden�ical derivatives. For continuous f, (17 .5) and (17 .6) are two ways of stating the fundamental theorem of calculus. To the calculation of Lebesgue integrals the methods of elementary calculus thus apply.
As will follow from the general theory of derivatives in Section 31, (17.5) holds
outside a set of Lebesgue measure 0 if f is integrable-it need not be continuous. As the following example shows, however, (17.6) can fail for discontinuous f. E%111le 11p 1 7.4. Define F(x) == x 2 sin x - 2 for 0 < x s t and F(x) == 0 for x s 0 and for x � 1. Now for t < x < 1 define F(x) in such a way that F is continuously differentiable over (0, oo ). Then F is everywhere differentiable, but F'(O) 0 and F'( x) == 2x sin x- 2 - 2x- 1cos x- 2 for 0 < x < }. Thus F' is dis continuous at 0; F' is, in fact, not even integrable over (0, 1], which makes (17.6) impossible for a = 0. For a more extreme example, decompose (0, 1] into countably many subintervals (a n , bn 1 · Define G( x) = 0 for x < 0 and x > 1, and on (an , bn ] define G(x) = F((x - a n )/( bn - an )). Then G is everywhere differentiable but (17.6) is impossible for G if (a, b] contains any of the (an , bn ] because G is not integrable over any of them . =
t nus requires defining /� = - ft� for
u >
v.
•
228
INTEGRATION
Change of Variable
Suppose that T is a continuous, increasing mapping of ( a , b ) onto ( Ta, Tb ). If has a continuous derivative then the formal substitutions y = and dy = dx give the usual formula for changing variables:
T
T'(x)
( 17 .7 )
T',
T(x)
1T b b T x))T'(x) d x = f( y ) dy. la /( ( Ta
T
There is a more general formula. If has a continuous derivative on (a, b) and is either increasing or decreasing, then
fA/( T( x))IT' ( x)l dx 1TA/( y) dy
( 17 .8)
=
for Borel subsets A of ( a, b). To prove this, define p,(A) A c ( a, b). If B is a subinterval of a, b), then
T(
=
fA
I T'( x)l dx for
( 17 .9 ) as follows by the fundamental theorem of calculus (consider increasing and decreasing separately). By Theorem 10.4, (17.9) holds for all Borel sets B in a, b). By the general change-of-variable formula (16.18), fr- t 8 ( )p,( d ) = /8/(y) dy. Putting B = TA is one-to-one) and transforming the left side by the formula (16.13) for integration with respect to a density gives (17. 8). t Note that may vanish and r- 1 may be nondifferentiable at places, for example if = The intervals ( a, b) and ( a , b) may be infinite. Finally, (17 .8) holds for all nonnegative f as well as for integrable f.
T(
T
(T
/ Tx x T' T(x) x3•
T
T(x) sin xjcos x on ( - '1T/2, '1T/2). Then T'(x) 1 + T 2( x), and (17.7) for f(y) (1 + y 2 ) - 1 gives EX111ple 11
( 17 .10 )
1 7.5.
Put
=
=
=
!00
dy - oo 1 + y
t For another approach, see Problem 17.12.
2
= 'IT .
•
SECTION
17.
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
229
'lbe Lebesgue Integral in R"
The k-dimensional Lebesgue integral, the integral in (R k , fA k , A k ), is de noted f f(x) dx, it being understood that x = (x 1 , . . . , x k ). In low-dimen sional cases it is also denoted f fA f(x 1 , x 2 ) dx 1 dx 2 , and so on . A mapping T: V --+ R k carrying a set V in R k into R k has the form T(x) = (t 1 (x), . . . , t k (x)). Suppose that the partial derivatives tiJ = at;/ax1 exist, let Dx be the k X k matrix with entries t iJ (x), and let J(x) denote the
Jacobian: ( 1 7 . 11 )
Let T: V --+ TV be a one-to-one mapping of an open set V onto an open set TV. Suppose that T is continuous and that T has continuous partial derivatives t iJ · Then Theorem 17.2.
( 1 7 .12 )
f f( Tx )IJ( x )l dx 1 f( y ) dy. =
V
As usual this holds for nonnegative Replacing f by /IrA for A c V gives (17 .13 )
j f( Tx)IJ( x )l dx A
TV
f and for f integrable over TV.
=
1 f( y) dy . TA
the case k = 1, this reduces to (17.8). If T is linear, Dx is for each x the matrix of the transformation. If identified with this matrix, (17.13) reduces to In
(17 .14)
l det T l
· j f( Tx ) dx A
=
T is
1 f( y) dy. TA
Only this special case will be proved here, because Theorem 19.3 contains Theorem 17.2.t Suppose that T is linear and nonsingular and V = R k . It is enough to prove (17.14) for A = R k. If f is an indicator, this reduces to (12.2), and it follows in the usual way that it holds for simple /, then for nonnegative /, and finally for the general integrable f. ' The proof in Section 1 9 uses properties of Hausdorff measure. For an alternative argument. see, for example, RUDIN', p. 1 84. In the present book, as it happens, Theorem 1 7.2 is applied
only to nonsingular linear transformations T and to the T of Example 1 7 .6. For the latter case, there is an entirely different proof; see Problems 12.16 , 1 7 . 1 5 , and 1 8.24.
230
INTEGRATION
If T(r, 8) = (rcos 8, r sin O ), then the Jacobian (17.1 1 ) is which gives the formula for integrating in polar coordinates:
Example 1 7.6.
J(r, 8 ) r, =
(17 .15)
JJ f( rcos 8, r sin 8 ) r dr d8 JJ f( x, y) dx dy. =
r>O 0 < 6 < 2w
R2
•
Stieltjes Integrals
Suppose that F is a function on R k satisfying the hypotheses of Theorem 12.5, so that there exists a measure p. such that p.( A) = &AF for bounded rectangles A . In integrals with respect to p., p.(d.x) is often replaced by dF( x ):
£!( x) dF(x) £f( x )p. (dx ) . =
(17 .16)
The left side of this equation is the Stieltjes integral of I with respect to F; since it is defined by the right side of the equation, nothing new is involved. Suppose that I is uniformly continuous on a rectangle A, and suppose that A is decomposed into rectangles A m small enough that ll(x) - I( Y ) I < £/p. ( A ) for x, y e A m . Then
£f( x ) dF( x ) - 'f. f(xm ) ll A .,F m
<E
for x m e A m . In this case the left side of (17.16) can be defined as the limit of these approximating sums without any reference to the general theory of measure and for historical reasons is sometimes called the Riemann-Stieltjes integral; (1 7.16) for the general I is then called the Lebesgue-Stieltjes integral. Since these distinctions are unimportant in the context of general measure theory, f l ( x ) dF(x) and f ldF are best regarded as merely nota tional variants for f l( x )p.( dx ) and f ldp.. PROBLEMS
17.1
to R k .
17.1.
Extend Theorem
17.2.
Show that if f is integrable, then lim f If ( x
t�o
Use Theorem
17.1.
+ t ) - / ( x )I dx == 0.
SECTION
17.
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
231
17.3.
Suppose that p. is a finite measure on qjA and A is closed. Show that p. ( x + A ) is upper semicontinuous in x and hence measurable.
17.4.
Stappose that /0 1/( x ) l dx < oo . Show that for each £ , A [ x: x > a , 1/( x ) l > � 1 � 0 as a --+ oo . Show by example that /( x) need not go to 0 as
X � 00 .
17.5.
A Riemann integrable function on (0, 1] is continuous almost everywhere. Let the function be f and l�t A( be the set of x such that for every 8 there are points y and z satisfying IY - x l < 8, l z - x l < 8, and 1/( y ) - /( z ) l > £ . Show that A( is closed and that the set D of discontinuities of f is the
union of the A ( for positive, rational £ . Suppose that [0, 1] is decomposed into intervals I, . If the interior of I; meets A( , choose Y; and z; in I; in such a way that /(Y; ) - /( z; ) > £; otherwise take Y; = z, e 1; . Show that E( /( Y; ) - /( z; )) A ( I; ) > £ A ( A( ) . Finally, show that if f is Riemann integra ble, then A ( D ) = 0.
17.6.
A bounded Borel function on (0, 1 ] that is continuous almost everywhere is Riemann in.tegrab/e. Let M bound 1/1 and suppose that A ( D ) = 0. Given find an open G such that D c G and A ( G) < £/M and take C = [0, 1] - G. For each x in C find a 8x such that 1/( y ) - /( x) l < £ if IY - x l < 8x . Cover C by finitely many ( xk - 8x /2, x" + 8x /2) and take 2 8 less than " the minimum 8x . Show that 1/( y) _! /( x) l < 2£ if IY - x l < 8 and x e C. If [0, 1] is deco�posed into intervals I; with A ( I; ) < 8 , and if Y; e I; , let g f
f,
be the function with value /( Y; ) on 1; . Let E' denote summation over those i for which I, meets C, and let I:" denote summation over the other i. Show that
17. 7.
Show that if f is integrable, there exist continuous, integrable functions gn such that gn (x) --+ /(x ) except on a set of Lebesgue measure 0. (Use Theorem 17.1(ii) with £ = n - 2 . )
17.8.
13.13 1 7.7 f Let f be a finite-valued Borel function over (0, 1). By the ' following steps, prove Lusin s theorem: For each £ there exists a continuous function g such that A [ x e (0, 1): /( x) =I= g( x)] < £ . (a) Show that f may be assumed integrable, or even bounded. (b) Let gn be continuous functions converging to f almost everywhere. Combine Egoroff's theorem and Theorem 12.3 to show that convergence is uniform on a compact set K such that A ((O, 1) - K ) < £. The limit lim n gn ( x) = /(x) must be continuous when restricted to K. (c) Exhibit (0, 1) - K as a disjoint union of open intervals Ik [A12] ; define g as f on K and define it by linear interpolation on each lk . n n Let fn ( x ) = x - 1 - 2x2 - 1 • Calculate and compare /JE':- 1 /n ( x) dx and EC:- 1 /J/n ( x) dx. Relate this to Theorem 16.6 and to the corollary to Theorem 16.7.
17.9.
232
INTEGRATION
17.10. Show that (1 + y 2 ) - 1 has equal integrals over ( - oo , - 1), ( - 1, 0), (0, 1), (1 , oo). Conclude from (17.10) that /J (1 + y 2 ) - 1 dy == w/4. Expand the integrand in a geometric series and deduce Leibniz's formula
1 1 1 'IT - == 1 - - + - - - + · · · 5 7 4 3 16.7 (note that its corollary does not apply). 17.11. Take T( x ) == sin x and l(y) = (1 - y 2 ) - 1 12 in (17.7) and show that t == /Ji0 1(1 - y 2 ) - 1 12 dy for 0 < t s wj2. From Newton's binomial for mula and two applications of Theorem 16.6, conclude that by Theorem
{ 17 .17) 17.12. 17.7 f
Suppose that T maps [ a, b] into [ u , v] arid has continuous deriva tive T'; T need not be monotonic, but assume for notational convenience that Ta s Tb. Suppose that I is a Borel function on [ u, v] for which J: I/( T( x )) T' ( x ) l dx < oo and Jl: II(Y) I dy < oo . Then (17.7) holds. (a) Prove this for continuous I by replacing b by a variable upper limit and differentiating. (b) Prove it for bounded I by exhibiting I as the limit almost everywhere of uniformly bounded continuous functions. (c) Prove it in general.
17.13. A function I on [0, 1] has Kurzweil integral· I if for each £ there exists a positive function B( · ) over [0, 1] such that, if x0 , , xn and � 1 , , �, •
•
•
•
•
•
satisfy
{ 17 .18)
{
O == x0 < x1 < · · · < xn == 1 ' X; - 1 < �; < X; ' X; - X; - 1 < B ( �; ) ,
i == 1 , . . . , n ,
then
( 17.19)
n
E l( �; ) ( x; - X; _ 1 ) - I < £ .
k- 1
Riemann's definition is narrower because it requires the function B( · ) to be constant. Show in fact by the following steps that, if a Borel function over [0, 1] is Lebesgue integrable, then it has a Kwzweil integral, namely its Lebesgue integral. (a) The Kunweil integral being obviously additive, assume that I > 0. Choose k0 so that J1 � k I dx < £, and for k == 1, 2, . . . choose open sets Gk such that Gk :::> A k = r(k - 1)£ < I < k£] and A ( Gk - A k ) < 1jkk5 2k . Define B(() == dist (�, Gk ) for E e A k .
SECTION
17.
233
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
(b) Suppose that (17.18) holds. Note that �; e A k implies that [ X; _ 1 , c Gk and show that E7- 1 /( �; )( x; - X; _ 1 ) < I + 2£ by splitting the sum according to which A k it is that �; lies in. k (c) Letk E< > denote summation over the i for which �; e Ak , and show that E< > ( x; - X; _ 1 ) 1 - E1 ., k E < I) ( x; - X; _ 1 ) > A ( Ak ) - k0 2 • Con clude that E7- 1 l(�; )( x; - X; _ 1 ) > k - 3£. Thus - 2£ > I
x,]
f dx
=
I dx
11< 0 I dx
(17.18) implies (17.19) for I = I Idx (with 3£ in place of £). 17. 14. f (a) Suppose that I is continuous on (0, 1] and ( 17 .20)
lim j11( x ) dx = I.
��o ,
Choose 1J so that t s 1J implies that I I - 1/1 dxl < £; define 80( ) over (0, 1] in such a way that 0 < x , y s 1 and lx - Y l < B0 ( x ) imply that ll( x ) - I( Y ) I < £, and put ·
B( x ) =
min{ � x , Bo ( x )} min{ 1 + 1 �( 0) I } 11 '
if O < X < 1 , if X
= 0.
From (17.18) deduce that � 1 = 0 and then that (17.19) holds with 3£ in place of £. (b) Let F be the function in Example 17.4, and put I = F' on [0, 1]. Show that I has a Kunweil integral even though it is not Lebesgue integrable. 17.15. 12.16 f Identifying w. Deduce from (12.19) that ( 17 .21 )
A2 ( A)
= 'ITo fJ rdrdfJ , 'IT B
where B is the set of ( r, 8 ) with (rcos 8, r sin 8 ) e A , and then that ( 17.22)
fJ f( x , y ) dx dy = � JJ R
2
f( rcos fJ , r sin fJ ) r dr dfJ .
r>O
0 < 9 < 2 tr
When w0 = has been established in Problem 18.24, this will give (17.15). 17.16. 16.25 f Let !R consist of the continuous functions on R1 with compact support. Show that !R is a vector lattice in the sense of Problem 11.11 and has the property that I e !R implies I 1\ 11 e !R (note that 1 E !R). Show that the a-field !F generated by !R is 9P • Suppose A is a positive linear functional on !R; show that A has the required continuity property if and only if ln (x) �O uniformly in x implies A ( /n ) � 0. Show under this ,
234
INTEGRATION
assumption on
A
{ 17 .23 )
that there is a measure p. on 911 such that A
( /)
=
ffdp. ,
f e !R.
Show that p. is a-finite and unique. This is a version of the Riesz representa
tion theorem .
17. 17.
f Let A ( /) be the Riemann integral of f, which does exist for f in !£'. Using the most elementary facts about Riemann integration, show that the p. determined by (17 .23) is Lebesgue measure. This gives still another way of constructing Lebesgue measure.
17. 18.
f
Extend the ideas in the preceding two problems to R k . SECI'ION 18.
PRODUCf MEASURE AND FUBINI'S THEOREM
Let ( X, !I) and ( Y, C!!J ) be measurable spaces . For given measures p. and v on these spaces, the problem is to construct on the cartesian product X X Y a product measure 'IT such that 'IT(A X B) = p.(A)v( B) for A c X and B c Y. In the case where p. and " are Lebesgue measure on the line, 'IT will be Lebesgue measure in the plane. The main result is Fubini 's theorem, according to which double integrals can be calculated as iterated integrals. Product Spaces
It is notationally convenient in this section to change from (Sl, �) to ( X, fi) and ( Y, C!!J ). In the product space X X Y a measurable rectangle is a product A X B for which A e fi and B e C!!J . The natural class of sets in X X Y to consider is the a-field fi X C!!J generated by the measurable rectangles. (Of course, fi X C!!J is not a cartesian product in the usual sense.) Suppose that X = Y = R 1 and f£= C!!J = !R 1 . Then a measurable rectangle is a cartesian product A X B in which A and B are linear Borel sets. The term rectangle has up to this point been reserved for cartesian products of intervals, and so a measurable rectangle is more general. As the measurable rectangles do include the ordinary ones and the latter generate !A 2 , !A 2 c !A 1 X 91 1 • On the other hand, i f A is an interval, [B: A X B e !A 2 ] contains R 1 (A X R 1 = U n ( A X ( - n, n ])e !A 2 ) and is closed under the formation of proper differences and countable unions; thus it is a a-field containing the intervals and hence the Borel sets. Therefore, if B is a Borel set, [ A : A X B e 91 2 ] contains the intervals and hence, being a Example 18. 1.
SECTION
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
235
a-field, contains the Borel sets. Thus all the measurable rectangles are in 91 2 , and so fJ1t 1 x !1/ 1 = fJ1t 2 consists exactly of the two-dimensional Borel
.
��-
As this example shows, fi X C!!J is in general much larger than the class of measurable rectangles.
If E e fi X C!!J, then for each x the set [y: (x, y) e E] lies in C!!J and for each y the set [x: (x, y) e E] lies in !I. (ii) Iff is measurable fi X C!!J , then for each fixed x the function f(x, y) with y varying is measurable C!!J, and for eachfixedy the function f(x, y) with varying is measurable !I.
Theorem 18. 1. (i)
x
The set [y: (x, y) e E] is the section of E determined by x, and as a function of y is the section off determined by x.
f(x, y)
x and consider the mapping Tx : Y --+ X X Y defined by TxY = (x, y). If E = A X B is a measurable rectangle, Tx 1E is B or 0 according as A contains x or not, and in either case Tx 1E e C!!J. By Theorem 13.1(i), Tx is measurable C!!Jj!IX C!!J . Hence [y: (x, y) e E] = Tx- 1E e C!!J for E e fiX C!!J. By Theorem 13.1(ii), if f is measurable fi X fl/91 1 , then fTx is measurable C!!f/Y't 1 • Hence f(x, y) = fTx ( Y ) as a function of y is measurable C!!J. The symmetric statements for fixed y are proved the PROOF.
Fix
-
-
•
same way.
Product Measure
Now suppose that ( X, !I, p.) and (Y, C!!J , ") are measure spaces, and suppose for the moment that p. and " are finite. By the theorem just proved Jl[y: (x, y) e E] is a well-defined function of x. If !l' is the class of E in fiX C!!J for which this function is measurable fi, it is not hard to show that !l' is a A-system. Since the function is IA(x)JI(B) for E = A X B, !l' contains the w-system consisting of the measurable rectangles. Hence !l' coincides with � X f!!/ by the 'IT-A theorem. It follows without difficulty that (18 . 1 )
'11' ' ( £ ) =
f: [y: ( x, y ) E E ) p(dx) ,
E e !I x C!!J ,
is a finite measure on fiX C!!J, and similarly for
(18 .2 )
'11' "( £ ) =
ft lx: ( x, y )
E
E ) P ( dy ) ,
E E fi X
C!!J .
INTEGRATION
For measurable rectangles,
w' ( A X B ) = w"( A X B ) = �t ( A ) · P ( B ) . The class of E in fiX C!!J for which w'(E) = w"( E) thus contains the measurable rectangles; since this class is a A-system, it contains :r X C!!J. The common value w'(E) = w"(E) is the product measure sought. To show that (18.1) and (18.2) also agree for a:.finite I' and ,, let { A m } and { Bn } be decompositions of X and Y into sets of finite measure, and put l' m (A) = �t(A n A m ) and �'n (B) = P( B n Bn > · Since P( B) = Lm �'m ( B), the integrand in (18.1) is measurable fi in the a-finite as well as in the finite case; hence w' is a well-defined measure on fi X C!!J and so is w". If w:,n and w:,:n are (18.1) and (18.2) for I' m and "n ' then by the finite case, already treated, w'( E ) = Lm nw:, n (E) = L m nw,::n ( E) = w"(E) . Thus (18.1) and (18.2) coincide in the a-finite case as well. Moreover, w'( A X B) = L m nl' m (A)Pn ( B) = #'(A)P( B).
(18.3 )
If ( X, !I, I') and ( Y, C!!J, ") are a-finite measure spaces, w( E ) = w'( E ) = w"(E) defines a a-finite measure on fi X C!!J; it is the only measure such that w(A X B) = �t(A) P(B) for measurable rectangles.
Theorem 18.2.
·
Only a-finiteness and uniqueness remain to be proved. The prod ucts Am X Bn for { A m } and { Bn } as above decompose X X Y into measurable rectangles of finite w-measure. This proves both a-finiteness and uniqueness, since the measurable rectangles form a w-system generating fi X f!!/ (Theorem 10.3). • PROOF.
The w thus defined is called product measure; it is usually denoted I' X " · Note that the integrands in (18.1) and (18.2) may be infinite for certain x which is one reason for introducing functions with infinite values. and Note also that (18.3) in some cases requires the conventions (15.2).
y,
Fubini's Theorem
Integrals with respect to w are usually computed via the formulas
( 1 8 .4 )
j
XX Y
and
( 1 8 .5 )
f(x,y)'1T(d(x,y)) = J [ J t(x,y),(dy) ] �t(dx) X
Y
J f(x,y)'1T(d(x,y)) = J [/ f(x,y)�t(dx)]P(dy). Xx Y
Y
X
SECTION
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
237
The left side here is a double integral, and the right sides are iterated integrals. The formulas hold very generally, as the following argument shows. Consider (18.4). The inner integral on the right is
/j (x , y ) v ( dy ) .
{ 18. 6)
Because of Theorem 18.1(ii), for f measurable fi X CfY the integrand here is measurable d.Y; the question is whether the integral exists, whether (18.6) is measurable fi as a function of x , and whether it integrates to the left side of (18. 4). First consider nonnegative f. If f = I£, everything follows from Theorem 18.2: (18.6) is l'[ y : (x, y ) e E] and (18.4) reduces to 'IT ( E ) = 'IT '( E ). Because of linearity (Theorem 15.1(iv)), if f is a nonnegative simple function, then (18.6) is a linear combination of functions measurable fi and hence is itself measurable f£; further application of linearity to the two sides of (18.4) shows that (18.4) again holds. The general nonnegative f is the monotone limit of nonnegative simple functions; applying the monotone convergence theorem to (18.6) and then to each side of (18.4) shows that again f has the properties required. Thus for nonnegative f, (18.6) is a well-defined function of x (the value oo is not excluded), measurable fi , whose integral satisfies (18.4). If one side of (18.4) is infinite, so is the other; if both are finite, they have the same finite value. Now suppose that f, not necessarily nonnegative, is integrable with respect to 'IT. Then the two sides of (18.4) are finite if f is replaced by 1/1 . Now make the further assumption that
hf ( x , y ) l v ( dy) < oo
(18.7 )
for all x. Then ( 18 .8 )
/j ( x , y )v (dy ) /j+ ( x , y )v ( dy) - J;- ( x , y)v( dy ) . =
functions on the right here are measurable fi and (since /+ , /- < 1/1 ) integrable with respect to p., and so the same is true of the function on the left. Integrating out the x and applying (18.4) to j+ and to f- gives (18.4) for I itself. The
238
INTEGRATION
The set A 0 of x satisfying (18.7) need not coincide with X, but p.( X A0) = 0 if f is integrable with respect to 'IT because the function in (18.7) integrates to f 1/1 d'IT (Theorem 15.2(iii)). Now (18.8) holds on A0, (18.6) is measurable !I on A0, and (18.4) again follows if the inner integral on the right is given some arbitrary constant value on X - A0• The same analysis applies to (18.5): Theorem 18.3.
functions
Under the hypotheses of Theorem 18.2, for nonnegative f the
j f( x , Y ) v ( dy ) , j f( x , y ) p. ( dx )
(1 8 .9)
y
X
are measurable !I and C!!l, respectively, and (18.4) and (18.5) hold. Iff (not necessarily nonnegative) is integrable with respect to 'IT, then the two functions (1 8.9) are finite and measurable on A0 and on B0, respectively, where p. ( X - A0) = 1'( Y - B0) = 0, and again (18.4) and (18.5) hold. It is understood here that the inner integrals on the right in (18.4) and (1 8.5) are set equal to 0 (say) outside A0 and B0. t This is Fubini 's theorem; the part concerning nonnegative f is sometimes called Tonelli ' s theorem. Application of the theorem usually follows a two-step procedure that parallels its proof. First, one of the iterated in tegrals is computed (or estimated above) with 1/1 in place of f. If the result is finite, then the double integral (integral with respect to 'IT ) of 1/1 must be finite, so that f is integrable with respect to 'IT; then the value of the double integral of f is found by computing one of the iterated integrals of f. If the iterated integral of 1/1 is infinite, f is not integrable 'IT. Let I = f �«Je - x2 dx. By Fubini' s theorem applied in the plane and by the polar-coordinate formula (see (17.15)),
Example
18. 2.
]2
=
x l +y l ) dx dy = jj -< e f I r>O R2
0 < 8 < 2w
e -'2rdrd0.
The double integral on the right can be evaluated as an iterated integral by t Since
two functions that are equal almost everywhere have the same integral, the theory of integration could be extended to functions that are only defined almost everywhere; then A 0 and B0 would disappear from Theorem 1 8. 3 .
SECTION
1 8.
PRODUCT MEASURE AND FUBINI ' S THEOREM
239
another application of Fubini' s theorem, which leads to the famous formula
(18 .10) As
the integrand in this example is nonnegative, the question of integrability • does not arise.
E
ple
It is possible by means of Fubini' s theorem to identify the limit in (17 .4). First, xam
18.3.
l0 e - uxsin x dx = 1 +1 u [1 - e - u ( u sin 2
'
'
as follows by differentiation with respect to
t
+ cos t ) ]
.
t. Since
Fubini's theorem applies to the integration of e - u x sin x over (0, t)
11 sin x dx = 11sm. x [ l ooe - ux du ] dx 0 x
0
0
oo 1 = o
X (0, oo ) :
du 1+u 2
oo e - ul ( u sin t + cos t ) du. 1 o
1+u
2
next-to-last integral is 'IT/2 (see (17.10)), and a change of variable ut == s shows that the final integral goes to 0 as t --+ oo . Therefore,
The
( 18.11 ) lntearation by Parts
. 'IT 1 sm x 1 lim 0 dx = -2 . X 1 -.
oo
•
Let F and G be two nondecreasing, right-continuous functions on an . lllterval [a, b ], and let p. and " be the corresponding measures:
l&( x , y]
=
F ( y) - F(x ) ,
Jl( x, y) = G ( y) - G ( x ) ,
a < x < y < b.
INTEGRATION
dF(x) and dG(x) in place
In accordance with the convention (17.16), write of p.( dx) and P ( dx ).
If F and G have no common points of discontinuity in (a, b ],
Theorem 18.4.
then
( 1 8 .12)
j
(a, b)
G ( x ) dF( x )
=
F ( b ) G ( b ) - F( a ) G ( a )
In brief: f GdF = integration formula.
-f
F( x ) dG ( x ) .
(a, b]
&FG - f FdG. This is one version of the partial
Note first that replacing F(x) by F(x) - C leaves (18.12) un changed-it merely adds and subtracts CP(a, b) on the right. Hence (take C = F(a)) it is no restriction to assume that F(x) = p.(a, x] and no restriction to assume that G(x) = P(a, x] . If , = p. X , is product measure in the plane, then by Fubini's theorem, PROOF.
(18 .13)
'1T [(x, y) : a < y � x � b ] =
j
(a, b ]
v( a, x) �&(dx)
j
=
(a, b ]
G (x) dF( x )
and (18 .14)
'1T [(x. y) : a < x � y < b] =
j
�&( a, y) v( dy )
(a, b 1
=
j
F(y) dG ( y ) .
(a, b ]
The two sets on the left have as their union the square S = The diagonal of S has 'IT-measure
'1T [ ( x, y ) : a < x y < b] =
=
j
(a, b ]
(a, b] X (a, b ].
v { x } �& ( dx ) 0 =
because of the assumption that p. and , share no points of positive measure. Thus the left sides of (18.13) and (18.14) add to 'IT( S) = p.(a, b]P(a, b] =
F(b)G(b).
•
SECTION
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
241
g
Suppose that " has a density with respect to Lebesgue measure and let Transform the right side of (18.12) by the formula c + (1 6.13) for integration with respect to a density; the result is
G(x ) =
J:g(t) dt.
(18 .15 )
J
(a, b 1
G( x ) dF(x )
= F(b)G(b) - F( a )G( a ) - j6F( x ) g ( x ) dx . a
A consideration of positive and negative parts shows that this holds for any g integrable over Suppose further that p. has a density I with respect to Lebesgue measure ' c + and let Then (18.15) further reduces to
(a, b ). F( x ) = J:l(t) dt. (18. 16 ) j6G( x ) f ( x ) dx = F(b )G(b) - F( a )G( a ) - j6F(x ) g (x) dx . a
a
Again, I can be any integrable function. This is the classical formula for integration by parts. Under the appropriate integrability conditions, can be replaced by an unbounded interval.
(a, b ]
Products of Higher Order
Suppose that ( X, !£, p. ), ( Y, �, "), and ( Z, .11' , "1) are three a-finite measure spaces. In the usual way, identify the products X X Y X Z and ( X X Y) X Z. Let !£X � X .11' be the a-field in X X Y X Z generated by the A X B X C with A, B, C in !£, �, .11' , respectively. For C in Z, let ric be the class of E e !I X � for which E X C e !£X � X .11' . Then ric is a a-field contain ing the measurable rectangles in X X Y, and so ric !£X �- Therefore, (� X '!?I ) X .11' c !£ X � X 11'. But the reverse relation is obvious, and so (.!'" X '!?/) X .11' !£ X � X .11' . Define the product p. X " X "1 on !£X � X .11' as ( p. X ") X "1. It gives to p.(A)11(B) 1J (C), and it is A X B X C the value (p. X 11) (A X B) · 1J (C) the only measure that does. The formulas (18.4) and (18.5) extend in the obvious way. Products of four or more components can clearly be treated in the same way. This leads in particular to another construction of Lebesgue measure k in R R 1 x · · · X R 1 (see Example 18.1) as the product A X · · · XA ( k
=
=
=
=
242
INTEGRATION
factors) on !Ji k = !1/ 1 x · · · X91 1 . Fubini' s theorem of course gives a practi cal way to calculate volumes: Let Vk be the volume of the sphere of radius 1 in R k ; by Theorem 12.2, a sphere in R k with radius r has volume r k Vk . Let A be the unit sphere in R k , let B = [(x 1 , x 2 ): xf + xi < 1], and let C(x 1 , x 2 ) = [(x 3 , , x k ): E�_ 3 xl < 1 - xf - xi]. By Fubini' s theorem, Example 18.4.
•
•
•
=
vk - 2
JJ (t - P2 ) < k - 2>f2 p dp dO
O < ll < 2w O
If V0 is taken as 1, this holds for k = 2 as well as for k > 3. Since V1 = follows by induction that
2(2w ) ;- t v2 ; - 1 = --------. 5 --. .� . ( 2i - 1- -) ' 1 .3� for i =
2, it
); v2 ; = -. (2w 2 4 . . . -( 2i ) ---------
1, 2, . . . .
• PROBLEMS
18.1. 18.2.
18.3. 18.4.
Show by Theorem 1 8.1 that if A E !I and B E tty_
A X B is nonempty and lies in !I X
tty , then
Suppose that X == Y is uncountable and !I == tty consists of the countable and the co-countable sets. Show that the diagonal E == [( x y ) : x == y ] does not lie in !I X tty , even though [y: ( x, y) E E] E t!JI and [ x : ( x, y) E E ) E !I for all x and y. 10.6 1 8.1 f Let ( X, !I, p. ) == ( Y, tty , v ) be the completion of ( R1 , 911 , A ) . Show that ( X X Y, !I X tty , p. X v) is not complete. 2.8 f
,
The assumption of a-finiteness in Theorem 1 8.2 is essential: Let p. be Lebesgue measure on the line, let v be counting measure on the line, and take E == [( x, y): x == y]. Then (1 8.1) and (1 8.2) do not agree.
SECTION
18.5.
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
243
Suppose Y is the real line and ., is Lebesgue measure. Suppose that, for each x in X, l(x, y) has with respect to y a derivative l '(x, y). Formally,
J' [ /Xf ' ( x . � ) p.( dx ) ] d� ... jX[ J( x , y ) - f( x , a ) ] p. ( dx ) a
and hence fxl(x, Y)l'( dx) has derivative fxl '(x, y)p.( dx). Use this idea to prove Theorem 16.8(ii).
18.6.
Suppose that I is nonnegative on a a-finite measure space ( n, !F, p. ) . Show that
Prove that the set on the right is measurable. This gives the " area under the curve." Given the existence of I' X A on n X R1 , one can use the right side of this equation as an alternative definition of the integral. 18.7.
Reconsider Problem 12.13.
18.8.
Suppose that v [y: (x , y) e E) == v[y: (x, y) e F) for all x and show that (I' X ., )( E) == (I' X ., )( F). This is a general version of Cavalieri ' s principle.
18.9.
(a) Suppose that p. is a-finite and prove the corollary to Theorem 16.7 by
18.10.
Fubini's theorem in the product of ( 0 , !F, p.) and {1 , 2, . . . } with counting measure. (b) Relate the series in Problem 17.9 to Fubini's theorem.
(a) Let p. == ., be counting measure on X == Y == { 1 , 2, . . . }. If
l( x, y )
==
2 - 2-x if X == y, - 2 + 2 - x if X == y + 1 , 0 otherwise,
then the iterated integrals exist but are unequal. Why does this not con tradict Fubini's theorem? (b) Show that xyj(x 2 + y 2 ) 2 is not integrable over the square [(x, y): I x I , IY I < 1 ] even though the iterated integrals exist and are equal. 18.11.
For an integrable function I on (0, 1 ], consider a Riemann sum in the form R == E}- 1 l(x1 )( x1 - x1 _ 1 ), where 0 == x0 < · · · < xn == 1 . Extend I to (0, 2] by setting I< x) == I< x - 1) for 1 < x < 2, and define n
R ( t) == E J- 1
I( x1 + t ) ( x1 - x1 1 ) . _
For 0 < t < 1 , the points x1 + t reduced modulo 1 give a partition of (0, 1], and R ( t) is essentially the corresponding Riemann sum, differing from it by at most three terms. If maxJ ( x1 - x 1 ) is small, then R ( t) is for most � values of t a good approximation to Jol(x) dx, even though R = R (O) may _
244
INTEGRATION
be a poor one. Show in fact that 1 ( 1 8 . 1 7) R ( t ) - t( x) dx
fo
f
... 10
n
1
L
j == 1
n
< L
1
x
1
XJ - l
[ /( x1 + t) - /( x + t ) ) dx dt
1 1 10 1f( x1 + t} - f( x 1
x
j-1
dt
I
x,
+
t )l dt dx.
Given £ choose a continuous g such that /J ll(x) - g(x) l dx < £ 2/3 and then choose B so that l x - Yl < B implies that l g(x) - g(y ) l < £ 2/3. Show that max(x1 - x1 _ 1 ) < B implies that the first integral in (1 8.17) is at most £ 2 and hence that I R ( t) - /JI(x) dx l < £ on a set of t (in (0, 1]) of Lebesgue measure exceeding 1 - £ .
18. 12. Exhibit a case in which (1 8.12) fails because F and G have a common point of discontinuity.
18.13. Prove (1 8.16) for the case in which all the functions are continuous by differentiating with respect to the upper limit of integration.
18.14. Prove for distribution functions F that /�00 ( F(x + c) - F(x)) dx = c.
t. 18.16. Suppose that a number In is defined for each n > n0 and put F(x) = E n o s n s x ln · Deduce from (1 8.1 5) that 18.15. Prove for continuous distribution functions that /� 00 F( x) dF( x )
L
{ 1 8 . 1 8)
G ( n )/,
n0 � n � x
=
F( x ) G ( x ) -
=
{n oF( t ) g( t ) dt
f1g( t) dt, which will hold if G has continuous derivative g. First assume that the In are nonnegative. 18. 17. f Take n0 1 , In 1 , and G(x) 1/x and derive E n s x n - 1 log x + y + 0( 1 /x ) , where y 1 - /f (t - [t]) t - 2 dt is Euler's constant. if
G(y) - G(x) =
=
=
=
=
==
18.18. 5.17 1 8.16 f such that ( 1 8 .19)
18.19. If A k
=
Use (1 8.1 8) and (5.47) to prove that there exists a constant
L .! p p �x E�_ 1 a; and Bk
=
==
log log x
E�_ 1 b; , then
n
L a k Bk
k-1
+c+
c
o( ) · 1 10g x
n
==
An Bn + L A k - l bk . k-2
This is Abel's partial summation formula. Derive it by arguing proof of Theorem 1 8.4.
as
in the
SECTION
1 8.
PRODUCT MEASURE AND FUBINI ' S TH EOREM
18. 20. (a) Let In == j012 (sin x)" dx . Show by partial integration that In
1)( 1n _ 2 - In ) and hence that In == ( n - 1) n - 1 1n _ 2 •
245 ==
(n -
(b) Show by induction that 1 r;, + 1 From
(17.17) deduce that 2 w - == 8
...
( - 1 ) " ( 2n + 1 )
{ -} ).
1 1 1 1 + -2 + -2 + -2 + · · · 3
5
7
1 22
1 32
1 42
and hence that
2 w 6
== 1 + - + - + - + · · ·
(c) Show by induction that
2 n - 1 2n - 3 . . . 3 1 'IT 2 n 2n - 2 . . . 4 2 == n 1 12 n == 2n 2n - 2 1 2 + 4 22' 2n + 1 2n - 1 S3· From /2 n I > 12 n > 12 , + 1 deduce that 'IT 22 · 4 2 · · · ( 2n ) 2 2n w < < 2 2n + 1 3 2 . 5 2 . . . ( 2n - 1 ) 2( 2n + 1 ) 2 ' and from this derive Wallis ' s formula, ,
18.21.
4466 - 1 3 3 -5 5 7
'IT 22 == - -
{ 18.20)
2
•
•
•
Stirling ' s formula. The area under the curve y == log x between the abscis sas 1 and n is An == n log n - n + 1 . The area under the corresponding inscribed polygon is Bn == log n ! - ! log n. y=
log
log
X
2
0
1
2
3
4
INTEGRATION
(a) Let Sk be the sliver-shaped region bounded by curve and polygon
between the abscissas k and k + 1, and let Tk be this region translated so that its right-hand vertex is at (2, log 2). Show that the Tk do not overlap and that T, u T, + 1 u · · · is contained in a triangle of area t log(1 + 1/n ). Conclude that log n ! =
{ 18.21 ) where
(b)
( n �) +
log n
- n + c + an ,
c is 1 minus the area of T1 u T2 u · · · and 1 1 1 0 < an < 2 log 1 + n < n .
)
(
By Wallis's formula (18.20) show that log
{f
Substitute
=
�
[
2 n log 2 +
2logn! - log(2n)! - � log( 2n + 1 ) ] .
(1 8.21) and show that c == log J2w . This gives Stirling's formula
18.22. Euler's gamma function is defined for positive t by f( t) == f0x ' - 1e - x dx. (a) Prove that r(t) == f0x' - 1 (log x) ke- x dx.
(b) Show by partial integration that f( t + 1) == tf( t) and hence that f( n + 1) == n ! for integral n. (c) From (1 8.10) deduce f( ! ) == {; .
(d)
Show that the unit sphere in
Rk has volume (see Example 18.4)
( 18.22) 18.13. By partial integration prove that /0 (sin x )/x) 2 dx == wj2 and /� 00 (1 cos x ) x- 2 dx ==
18.24. 17.15 f and
w. Identifying
0 s cp <
If A8 is the set of ( rcos cp, r sin cp ) satisfying r < fJ, then by (17.21), 'If.
( 18.23 ) But by Fubini's theorem, if 0
{ 18.24)
< fJ < w/2, then
1 12 sin : x + 2 1 005' x 1 { ) dx 1 COS dx 1cosl
A 2 ( A ,) == 0 ==
11
1 . 2 ) 1 /2 dx. 1 x sm fJ cos fJ + ( 2 cosfJ
1
247 1 9 . HAUSDORFF MEASURE Equate the right sides of (18.23) and (18.24), differentiate with respect to 8, and conclude that w0 == w. Thus the factor w0/w can be dropped in (17.22), which gives the formula (17.15) for integrating in polar coordinates. 18.15. Suppose that 1-' is a probability measure on ( X, .t'") and that, for each x in X, "x is a probability measure on ( Y, '!V). Suppose further that, for each B in t!JI, "x ( B) is, as a function of x, measurable .t'". Regard the /-'( A ) as initial probabilities and the �'x ( B) as transition probabilities. (a) Show that, if E e .t'" X '!V, then �'x [y: ( x, y) e E] is measurable .t'". (b) Show that w( E) == Ix "x [ y: ( x, y) e E]/-'( dx) defines a probability measure on .t'"X '!V. If "x == v does not depend on x, this is just (18.1). (c) Show that if I is measurable !£ X 'lV and nonnegative, then I I( x, y ) vx ( dy) is measurable .t'". Show further that SECTION
y
which extenas Fubini's theorem (in the probability case). Consider also I 's that may be negative. (d) Let v(B) == lx �'x ( B)/-'(dx). Show that w( X X B) == v(B) and
SECI'ION 19. HAUSDORFF MEASURE *
The theory of Lebesgue measure gives the area of a figure in the plane and the volume of a solid in 3-space, but how can one define the area of a curved surface in 3-space? This section is devoted to such geometric questions. The Definition
Suppose that
A
is a set
in R k. For positive
m
and
£
put
(19 .1 ) where the infimum extends over countable coverings of A by sets Bn with diameters diam Bn = sup[ l x - y l : x, y e Bn 1 less than £ . Here em is a positive constant to be assigned later. As £ decreases, the infimum extends over smaller classes, and so h m , ((A) does not decrease. Thus h m , ((A) has a * This section should be omitted on a first reading.
248
INTEGRATION
limit h m ( A ), finite or infinite: ( 19 . 2 )
This limit h m ( A ) is the m-dimensiona/ outer Hausdorff measure of A . Although h m on R k is technically a different set function for each k , no confusion results from suppressing k in the notation. Since h m , , is an outer measure (Example 11.1), h m , , (U n A n ) < E n h m , ,( A n ) < E n h m ( A n ), and hence h m (U n A n ) < En h m ( A n ) · Since it is obviously monotone, h m is thus an outer measure. If A and B are at positive distance, choose £ so that £ < dist ( A, B). If A U B c Unen and diam en < £ for all n, then no en can meet both A and B. Splitting the sum cm E n (diam en ) m according as en meets A or not shows that it is at least h m ,,( A ) + h m , ,( B ). Therefore, h m ( A U B) > h m (A) + h m ( B ), and h m satisfies the Caratheodory condition (11.5). It follows by Theorem 11.5 that every Borel set is h m-measurab/e. There fore, by Theorem 11.1, h m restricted to gJ k is a measure. It is immediately obvious from the definitions that h m is invariant under isometries: If A c R k , cp: A --+ R i, and I f�>( X) - cp (y) l = l x - Yl for x, y e A , then h m (cpA) = h m ( A ).
Example 19.1. Suppose that m = k = 1. If a n = inf Bn and bn = sup Bn , then [a n , bn ] and Bn have the same diameter bn - a n . Thus (19.1) is in this
case unchanged if the Bn are taken to be intervals. Further, any sum Ln diam Bn remains unchanged if each interval Bn is split into subintervals of length less than £. Thus h 1 , ,(A) does not depend on £, and h 1 ( A ) = h 1 , ,(A) is for A c R 1 the ordinary outer Lebesgue measure A*( A) of A , provided that c 1 is taken to be 1, as it will. A planar segment S, = [(x, t): 0 < x < /], as an isometric copy of the interval [0, /], satisfies h 1 (S,) = I. This is a first indication that Hausdorff measure has the correct geometric properties. Since the segments S, are disjoint, one-dimensional Hausdorff measure in the plane is not o-finite. •
Example 19. 2. Consider the segments S0 and S, of the preceding exam ple with I = 1 and 0 < t < 1. Now h 1 (S0 U S, ) = 2, and in fact h 1 , ((S0 U S,) = 2 if £ < t. This shows the role of £: If h m ( A) were defined, not by ( 19 . 2), but as the infimum in (19.1) without the requirement diam Bn < then S0 U S, would have the wrong one-dimensional measure-diam ( S0 U • S,) = (1 + t 2 ) 1 12 instead of 2. £,
SECTION
19.
HAUSDORFF MEASURE
249
1be Normalizing Constant
In the definition
m
can be any positive number, but suppose from now on that it is an integer between 1 and the dimension of the space R k : m = 1 , . . . , k. The problem is to choose em so as to satisfy the geometric intention of the definition. Let Vm be the volume of a sphere of radius 1 in R m ; the value of Vm is explicitly calculated in Example 18.4. And now take the em in (19.1) as the volume of a sphere of diameter 1 :
(19 .3 ) A sphere in R m of diameter d has volume em d m . It will be proved below that for this choice of em the k-dimensional Hausdorff and Lebesgue measures agree for k-dimensional Borel sets:
fJI/ m and B is a set in R k that is an isometric copy of A , it will theorem that h m ( B) = h m (A) = A m (A). The unit cube C in R k can be covered by n k cubes of side n - 1 and diameter k 1 1 2 n - 1 ; taking n > k 1 1 2£ - 1 shows that h k . f( C ) < 1 2 1 k 2 clc n k (k 1 n - ) = ek k k / ' and so h k (C) < oo If C c U n Bn and diam Bn = d,, enclose Bn in a sphere Sn of radius dn . Then C c U n Sn , and so 1 == A k (C) < Ln A k (Sn ) = Ln Vk d! ; hence h k (C) > ek/ Vk . Thus h k (C) is finite and positive. Put h k (C) = K, where again C is the unit cube in R k . Then of course If A e follow by this
·
( 1 9. 4 ) for fJ >
C. Consider the linear transformation carrying x to fJx, where 0. By Theorem 12.2, A k (fJA) = fJ k A k (A) for A e gJ k . Since diam fJA =
A =
B" diam
A , the definition (19.2) gives
(1 9. 5) Thus (19.4) holds for all cubes and hence by additivity holds for rectangles whose vertices have rational coordinates. The latter sets form a w-system generating fJit k , and so (19.4) holds for all A e gJ k , by Theorem 10.3. Thus Theorem 19.1 will follow once it has been shown that the h k-mea sure of the unit cube in R k , the K in (19.4), is 1. To establish this requires
INTEGRATION
two lemmas connecting measure with geometry. The first is a version of the
Vitali covering theorem.
Suppose that G is a bounded open set in R k and that > 0. Then there exists in G a disjoint sequence S1 , S2 , . . . of closed spheres such that A k (G - U n Sn ) = 0 and 0 < diam Sn < PROOF. Let S 1 be any closed sphere in G satisfying 0 < diamS1 < Suppose that S 1 , . . . , Sn have already been constructed. Let 9:_ be the class of closed spheres in G that have diameter less than and do not meet S1 u · · · u Sn . Let sn be the supremum of the radii of the spheres in 9:_, and let Sn + l be an element of 9:_ whose radius rn + 1 exceeds s,/2. Since the Sn are disjoint subsets of G, En Vk r,:C = En 'A k (Sn ) < 'A k (G) < oo , and hence £
Lemma 1.
£.
£.
£
( 19.6 ) Put B = G - U n Sn . It is to be shown that 'A k (B) = 0. Assume the opposite. Let s; be the sphere having the same center as Sn and having radius 4rn . Then En 'A k (S;) = 4 k L,n 'A k (Sn ) < 4 k 'A k (G). Choose N so that En > N 'A k (S;) < 'A k (B). Then there is an x 0 such that (19.7)
x 0 e B,
x o � U s;. n
As the Sn are closed, there is about small enough that
>N
x 0 a closed sphere S with radius r
( 1 9 .8 ) If S does not meet S1 U · · · U Sn , then r < sn < 2rn + 1 ; this cannot hold for all n because of (19.6). Thus S meets some Sn ; let Sn o be the first of these:
( 19.9 )
S n ( S1 U · · · U Sn0 - 1 ) = 0 •
Because of (19.8), n o > N, and by (19 .7 ), Xo � s;0. But since Xo � s;0 and S n Sn o =I= 0, S has radius r >. 3rno > .l.2 sn0 _ 1 . By (19.9), r < sn 0 _ 1 • Thus • Sn o - 1 � r > .l.2 Sno - 1 ' a contrad"lCUOn. In other words, the volume of a set cannot exceed that of a sphere with the same diameter.
SECTION
19.
HAUSDORFF MEASURE
e
251
e
Represent the general point of R k as (t, y), where t R 1 , y R k - t. Let A y = [t: (t, y) A]. Since A is bounded, each A Y is bounded and A Y = 0 for y outside some bounded set. By the results of Section 18, k .Ay e 91� and A(A ) is measurable £1t - t . Let I be the open (perhaps Y y empty) mterval ( - A (A y )/2, + A( A y )/2), and set S1 A = U y [(t, y): t y ].k If 0 S fn ( Y ) t A(A y )/2 and In is simple, then [(t, y): l t l < /n ( Y )] is in £Il and increases to S1 A ; thus S1 A £Il k . By Fubini ' s theorem, A k ( A ) = A k (S1 A ). The passage from A to S1 A is Steiner symmetrization with respect R k : x 1 = 0]. to the hyperplane H1 = [x
e
PROOF.
y
e
e/
e
A
Let JY be the closed interval from inf A Y to supA Y ; it has length A (Jy ) > A ( A y ). Let m y be the midpoint of JY . If s IY and s' IY ' ' then I s - s' l < � A ( A y ) + �A( A y ,) < l m y - m y ' l + !A( Jy ) + �A( Jy ,), and this last sum is 1 t - t 'l for appropriate endpoints t of JY and t' of Jy ' · Thus for any points (s, y) and (s', y') of S1 A , there are points (t, y) and (t', y') that are at least as far apart and are limits of points in A . Hence diam S1 A < diam A. For each i there is a Steiner symmetrization S; with respect to the hyperplane H; = [x: X; = 0]; A k (S;A) = A k ( A ) and diam S;A < diam A . It is obvious by the construction that S;A is symmetric with respect to H;: if x ' is x with its ith coordinate reversed in sign, then x and x' both lie in S;A or neither does. If A is symmetric with respect to H; (i > 1) and y' is y with its (i - 1)st coordinate reversed in sign, then, in the notation above, �, = AY ' ' and so S1 A is symmetric with respect to H; . More generally, if A lS symmetric with respect to H; and j ¢ i, then S A is also symmetric with j respect to H;. Pass from A to SA = Sk · · · S1 A . This is central symmetrization. Now A k ( SA) = A k ( A ), diam SA < diam A, and SA is symmetric with respect to
e
e
252
INTEGRATION
each H; , so that x e SA implies - x e SA . If l x l > !diam A, then l x ( - x ) l > diam A > diam SA , so that SA cannot contain x and - x . Thus SA is contained in the closed sphere with center 0 and diameter diam A . • 19.1. It is only necessary to show that K = 1 in (1 9.4). By Lemma 1 the interior of the unit cube C in R k contains disjoint spheres such that diam Sn < £ and h k , ( (C - Un Sn ) < h k (C - Un Sn ) = S1 , S2 , K 'A k ( C - U n Sn ) = 0. By (19.3), h k , ( (C) = h k , ( (U n Sn ) < ck En (diam Sn ) k = ck E nck" 1 'A k ( Sn ) < A k (C) = 1. Hence h k (C) < 1. If C C U n Bn , then by Lemma 2, 1 < En 'A k ( Bn ) < Enck (diam Bn ) k , and • hence h k ( C ) � 1. PROOF OF THEOREM
•
•
•
Change of Variable
If x 1 , , x m are points in R k , let P(x 1 , , x m ) be the parallelepiped [E;"_ 1 a;x;: 0 < a 1 , , a m < 1]. Suppose that D is a k X m matrix, and let x 1 , , x m be its columns, viewed as points of R k ; define I D I = h m ( P(x 1 , , x m )). If C is the unit cube in R m and D is viewed as a linear map from R m to R k , then P(x 1 , , x m ) = DC, so that • • .
•
•
•
• . •
• . .
• • •
• • •
(19 .10) In the case
m =
k it follows by Theorem 12.2 that
I D I = l det D l ,
(19.11)
m
= k.
Theorem 12.2 has an extension: Theorem 19.2.
and
Suppose that D is univa/ent. t If A e £1/ m , then DA e £Il k
(19 .12) Since D is univalent, D Un A n = Un DA n and D( R m - A ) = DR m - DA ; since DR m is closed, it is in £Il k . Thus [A : DA e £Il k ] is a a-field. Since D is continuous, , DA is compact if A is. By Theorem 13.1(i), D is measurable £1/ mj£1/ k . Each side of (19.12) is a measure as A varies over £1/ m . They agree by definition if A is the unit cube in R m , and so by (19.5) they agree if A is any • cube. By the usual argument they agree for all A in !A m . PROOF.
t That is, one-to-one: Du
==
0 implies that u
=
0. In this case
m
< k.
19. HAUSDORFF MEASURE 253 If D is not univalent, A E £1/ m need not imply that DA E £Il k , but (19 .12) still holds because each side vanishes. The Jacobian formula (17.12) for change of variable also generalizes. Suppose that V is an open set in R m and T is a continuous, one-to-one mapping of V onto the set TV in R k , where m < k. Let t;(x) be the i th coordinate of Tx and suppose that t iJ (x) at;(x);axj is continuous on V. SECTION
=
Pu t
D
(19.13 )
X
=
.
............
•
'
IDx l as defined by (19.10) plays the role of the modulus of the Jacobian (1 7.11) and by (19.11) coincides with it in case m k. =
Suppose that T is a continuous, one-to-one map of an open set V in R m into R k . If T has continuous partial derivatives and A c V, then
Theorem 19.3.
(19.14 ) and (19.15 ) By Theorem by
A m.
19.1, h", on the left in (19.14) and (19.15) can be replaced
This generalizes Theorem 17 .2. Several explanations are required. Since V is open, it is a countable union of compact sets Kn . If A is closed, T( A n Kn ) is compact because T is continuous, and so T( A n V) = U n T( A n Kn ) E £Il k . In particular, TV E £Il k , and it follows by the usual argument that TA e £Il k for Borel subsets A of V. During the course of the proof of the theorem it will be seen that IDx l is continuous in x. Thus the formulas (19.14) and (19.15) make sense if A and I are measurable. As usual, (19.15) holds if f is nonnegative and also if the two sides are finite when f is replaced by 1 /1 . And (19.15) follows from (1 9.14) by the standard argument starting with indicators. Hence only (1 9.14) need be proved. To see why (19.14) ought to hold, imagine that A is a rectangle split into many small rectangles A n ; choose x n in A n and approximate the integral in (1 9.1 4) by E I Dx l h m (A n ) · For y in A n , Ty has the linear approximation "
254
INTEGRATION
Tx n + Dx ,( Y - x n ), and so TA n is approximated by Tx n + Dx,(A n - x n ), an m-dimensional parallelepiped for which h m is I Dx, l h m (A n - x n ) = I Dx , l h m (A n ) by (19.12). Hence EI Dx, l h m (A n ) Eh m ( TA n ) = h m ( TA) . The first step in a proof based on these ideas is to split the region of integration into small sets (not necessarily rectangles) on each of which the Dx are all approximately equal to a linear transformation that in tum gives a good local approximation to T. Let V0 be the set of x in V for which Dx is univalent. ==
If (J > 1, then there exists a countable mdecomposition of V0 into Borel sets B1 such that, for some linear maps M1 R --+ R k , Lemma 3.
:
and Choose £ and 80 in such a way that fJ - 1 + £ < 80 1 < 1 < 80 < fJ - £. Let Jl be the set of k X m matrices with rational coefficients. For M in Jl and p a positive integer let E (M, p) be the set of x in V such that
PROOF.
(19.18) and (19.19)
y E V. Imposing these conditions for countable, dense sets of v ' s and y ' s shows that E( M, p) is a Borel set. The essential step is to show that the E (M, p) cover V0• If Dx is univalent, then 80 = inf[ I Dxv l : I vi = 1] is positive. Take 8 so that 8 < ( 80 - 1)80 and 8 � (1 - 80 1 )80 , andm then choose in Jt an M such that I Mv - Dxv l � 8 1 v l for all v E R . Then I Mv l < I Dxf'l + I Mv - Dxf1 1 < I Dxv l + 8 1 v l � 801 Dxf'l , and (19.18) follows from this and a similar in equality going the other way. As for (19.19), it follows from (19.18) that M is univalent, and so there is a positive ., such that 1J I VI � I Mv l for all v. Since T is differentiable at x, there exists a p such that IY - x l < p - 1 implies that I Ty - Tx - Dx ( Y x) l � £ 1J IY - x l � £ 1 M(y - x) l . This and (19.18) imply that I Ty - Tx l < I Dx (x - Y ) l + I Tx - Ty - Dx (x - y) l � fJ0IM(x - y ) l + £ 1 M(x - y ) l
SECTION
19.
HAUSDORFF MEASURE
255
IJI M( x - y ) l . This inequality and a similar one in the other direction give (19.19). Thus the E( M, p) cover V0 • Now E( M, p) is a countable union of sets B such that diam B < p - 1 . These sets B (for all M in Jf and all p ), made disjoint in the usual way, together with the corresponding maps M, satisfy �
the requirements of the lemma.
•
Also needed is an extension of the fact that isometries preserve Hausdorff measure.
Suppose that A c R k , cp: A --+ R ;, andm 1/J: A --+ Ri. If lcpx fi>Y I < 8 1 1/J x - 1/IY I for X, y E A, then h m (cpA) < (J h m (l/IA). PROOF. Given £ and 8, cover 1/JA by sets Bn such that 8 diam Bn < £ and cm En1 (diam Bn ) m < h m ( 1/IA) + B. The sets cpl/J - 1Bn cover cpA and m diamcpl/J - Bn < IJ diam Bn < £, whence follows h m ,((cpA) < (J (h m (l/IA) + Lemma 4.
3).
•
19.3:
Suppose first that A is contained in the set V0 where Dx is univalent. Given 8 > 1, choose sets B1 as in Lemma 3 and put A1 = A n B1 • If C is the unit cube in R m , it follows by (19.16) and Lemma 4 that 9 - mh m (M1C) < h m (DxC) < 8 mh m (M1C) for X E A1; by (19.10), 9 - m i M1I S I Dx l < 8 mmi M1 1 for x E A1 t By (19.17) and Lemma 4, I h m (A1) 9 - mh m ( M1A1) < h m (TA I ) < IJ hmm (MIA I ). Now h m ( MIA I ) = I M1 m by Theorem 19.2, and so 8 - 2 h m (TA1) < fA,I Dx l h m ( dx ) < 8 2 h m ( TA1) . Adding over A 1 and letting (J --+ 1 gives (19.14). SECOND CASE. Suppose that A is contained in the set V - V0 where Dx is not univalent. It is to be shown that each side of (19.14) vanishes. Since the entries of (19.13) are continuous in x and V is a countable union of compact sets, it is no restriction to assume that there exists a constant K such that I Dxv l < K l v l for x e A, v e R m . It is also no restriction to assume that h m (A) = A m (A) is finite. Suppose that 0 < £ < K. Define T' : R m --+ R k X R m by T'x = ( Tx, £x). For the corresponding matrix D; of derivatives, D;v = ( Dxv, £ V ), and so D; is univalent. Fixing x in A for the moment, choose in R m an orthonormal set v 1 , • • • , vm such that Dxv 1 = 0, which is possible because x e V - V0 • Then I D;v 1 1 = 1 (0, £V 1 ) 1 = £ and ID.:V;I < I Dxv;l + £ I V; I < 2K. Since h m ( P( v l , . . . ' vm )) = 1, it follows by Theorem 19.2 that I D; I = h m ( D;P( v 1 , • • • , vm )). But the parallelepiped D;P( v 1 , • • • , vm ) of dimension PROOF OF THEOREM
FIRST CASE.
.
tsince the entries of Dx are continuous in x, if y is close enough to x, then fJ - 1 1 Dxv I < I Dyv I S fJ I Dx v I for v e R m , so that by the same argument, 8 - m I Dx I < I Dy I � (J m I Dx I ; hence I Dx I is continuous in x .
256
INTEGRATION
m in R k + m has one edge of length £ and m - m1 edges of length at most 2K and hence is isometric with a subset of [x E R : l x 1 1 < £, l x; l < 2mK, i = (19.14) for the univalent case 2, . . . , m]; therefore, I D� I < £(2mK) m - •. Now already treated gives h m (TA) S fA £(2mK ) m - lh m ( dx ). If w: R k X R m R k is defined by w(z, v) = z, then TA = wT'A, and Lemma 4 gives h m (TA) < h m (TA) = 0. h m (T'A) < £(2mK ) m - 1h m (A). Hence m If Dx is not univalent, Dx R is contained in an (m - I)-dimensional subspace of R k and hence is isometric to a set in R m of Lebesgue measure 0. Thus I Dx l = 0 for x e V - V0 , and both sides of (19.14) do vanish. • -+
Calculations
The h m on the left in (19.14) and (19.15) can be replaced by A m . To evaluate the integrals, however, still requires calculating I D I as defined by (19.10). Here only the case m = k - 1 will be considered. t If D is a k X ( k - 1) matrix, let z be the point in R k whose ith component is ( - 1) ; + 1 times the determinant of D with its i th row removed. Then IDI = 1 z 1 . (19.20 ) To see this, note first that for each x in R k the inner product x · z is the determinant of D augmented by adding x as its first column (expand by cofactors). Hence z is orthogonal to the columns x 1 , , x k _ 1 of D. If z =I= 0, let z 0 = z/ l z l ; otherwise, let z 0 be any unit vector orthogonal to x 1 , , x k _ 1 • In either case a rotation and an application of Fubini 's theorem shows that I D I = h k _ 1 (P(x 1 , , x k _ 1 )) = A k (P(z0, x 1 , , x k _ 1 )); this last quantity is by Theorem 12.2 the absolute value of the determinant of the matrix with columns z 0 , x 1 , , x k _ 1 • But, as above, this is z 0 z = 1 z I · •
•
•
•
•
•
•
•
•
•
•
•
•
.
•
•
V be the1 unit sphere in R k - 1 . If t;(x) = X ; for i [1 - E�.:-lxf] 12 , then TV is the top half of the surface of the unit sphere in R k . The determinants of the ( k - 1) X ( k - 1) submatrices of Dx are easily calculated, and I Dx l comes out to t; 1 (x). If A k is the surface area of the unit sphere in R k , then by (19.14), A k = 2h k - 1 (TV) = 2 f vti 1 (x) dx. Integrating out one of the variables shows that A k = 2wVk _ 2 , where Vk _ 2 is the volume of the unit sphere in R k - 2 • The calculation in Example 18.4 now gives A 2 = 2w and i- I ' 2 2w (2 ( ) ) '17' .= A 2; - 1 = . . . . . 2 A ' 2 2 4 6 (2 i - 2) 1 3 5 ( . - 3) Example 19.3. < k and t k (x) =
Let
'
t For the general case, see SPIVAK.
I
•
•
•
•
•
SECTION
19.
257
HAUSDORFF MEASURE
for i = 2, 3, . . . . Note that A k = k Vk for k = 1 , 2, . . . . By (19.5) the sphere in R k with radius r has surface area r k - tA k . • PROBLEMS 19.1. 19.2.
Describe S2 S1 A for the general triangle A in the plane. Show that in (19.1) finite coverings by open sets suffice if A is compact: for positive £ and B there exist finitely many m open sets Bn such that A c
Un Bn , diam Bn < £ , and cm En (diam Bn ) 19.3.
<
f
h m (A) + B. ,
f Let f be an arc or curve-a continuous mapping of an interval [a, b) into R k -and let /[a, b) = [/(t): a < t s b) be its trace or graph. The arc length L(f) of f is the supremum of the lengths of inscribed polygons:
L ( I)
=
n
sup E 1/ ( t; ) - I( t; - 1 ) I , i-1
where the supremum extends over sets { t; } for which a = t0 < t1 < · · · < tn == b. The curve is rectifiable if L (f) is finite. (a) Let /[ u , v] be the length of f restricted to a subinterval [ u , v ]. Show that I [ u , v] == I [ u, t] + I [ t, v] for u < t < v. (b) Let I ( t) be the length of f restricted to [ a, t ]. Show for s < t that 1/( t) - / (s) l < /[s, t] == /( t) - /(s) and deduce that h 1 ( / [a, b )) < L(f ): the one-dimensional measure of the trace is at most the length of the arc. (c) Show that h 1 (/[a, b]) < L( / ) is possible. (d) Show that 1/( b) - /(a) I < h 1 (/[a, b)). Hint: Use a compactness ar,fument-cover /[a, b) by finitely many open sets B1 , , BN such that
En _ 1 diam Bn < h 1 , ( ( / ( a, b]) + £.
•
•
•
(e) Show that h 1 (/[a, b]) = L(/) if the arc has no multiple points ( / is
19.4.
one-to-one). (f) Show that if the arc is rectifiable, then I ( t) is continuous. (g) Show that a rectifiable arc can be reparameterized in such a way that [a, b ] = [0, L(/)] and /( t) == t. Show that under this parameterization, h 1 ( A ) for a Borel subset A of the arc is the Lebesgue measure of r 1A , provided there are no multiple points. f Suppose the arc is smooth in the sense that /( t) == ( /1 ( t), . . . , fk ( t)) has continuous derivative / '( t) == ( /{( t), . . . , /f( t)). Show that L( /) == f: If' ( t) I dt. The curve is said to be parameterized by arc length if 1/ ' ( t) 1 1, because in that case f[s, t] has length t - s. f Show that a circle of radius r has perimeter 2wr: if /( t) == ( r cos t, r sin t) for O < t < 2w, then /� '" 1/ '( t) l dt = 2wr. The w here is that of trigonometry; for the connection with w as the area of the unit disc, see Problem 18.24. f Suppose that /(t) = (x( t), y( t)), a < t < b, is parameterized by arc length. For a < � s b, define l/;( �, ., ) = ( x( �), y( �), ., ). Interpret the trans formation geometrically and show that it preserves area and arc length. =
19.5.
19.6.
258
19.7.
19.8.
INTEGRATION
f (a) Calculate the length of the helix (sin 8, cos fJ, 8 ) , 0 < 8 < 2w. (b) Calculate the area of the helical surface ( t sin 8, t cos 8, 8 ) , 0 < 8 < 2 w, 0 < t s 1. Since arc length can be approximated by the lengths of inscribed polygons, it might be thought that surface area could be approximated by the areas of inscribed polyhedra The matter is very complicated, however, as an exam ple will show. Split a 1 by 2w rectangle into mn rectangles, each m- 1 by 2wn - 1 , and split each of these into four triangles by means of the two diagonals. Bend the large rectangle into a cylinder of height 1 and cir cumference 2w, and use the vertices of the 4mn triangles as the vertices of an inscribed polyhedron with 4 mn flat triangular faces. Show that the resulting polyhedron has area
Show that as m , n
[2w, oo ).
19.9.
--+ oo ,
the limit points of Am n fill out the interval
R3 4J2 . - /2 2 (18.22), (19.3) em 2 mwm ;r(1 m/2), m 0. Thus h m m.
A point on the unit sphere in is specified by its longitude 8, 0 < 8 < 2w, and its latitude 4J , - w/2 < 4J < w/2. Calculate the surface area of the sector 61 < 8 < 8 , 4J 1 < + <
19.10. By
can
be
written = is well defined for all
and this is Show that h 0 is
+
defined for all � counting measure. 19.11. f (a) Show that, if hm(A) < oo and B > 0, then hm+8(A) = 0. (b) Because of this, there exists a nonnegative number dim A , called the Hausdorff dimension of A , such that hm (A) = oo if m < dim A and hm (A) = 0 if m > dim A. Thus 0 < hm (A) < oo implies that dim A = m. Show that "most" sets in specified by m smooth parameters have (integral) dimension m. (c) Show that dim A S dim B if A C B and that dim Un A n = supn dim An. 19.12. f Show that hm (A) is well defined by (19.1) and (19.2) for sets A in an arbitrary metric space. Show that hm is an outer measure, whatever the space. Show that part (a) of the preceding problem goes through, so that Hausdorff dimension is defined for all sets in all metric spaces, and show that part (c) again holds.
Rk
Rk
19.13. Let x = (xb . . . , xk ) range over the surface of the unit sphere in and consider the set where au < x u < flu , 1 < u < t; here t < k - 2 and - 1 <
a u S flu � 1. Show that its area is ( 19.21 )
A k- r
[ Ak - tls 1 -
t
E
i-1
r ](k2) 2 / dx1 xl
•
•
•
dx , ,
where is as in Example 19.3 and S is the set where E�_ 1 xl < 1 and tlu < Xu < flu , 1 < U < t.
C H A PTER
4
Random Variables and Expected Values
SECI10N 20. RANDOM VARIABLES AND DISTRIBUTIONS This section and the next cover random variables and the machinery for dealing with them-expected values, distributions, moment generating func tions, independence, convolution. Radom
Variables and Vectors
random variable on a probability space ( 0, �, P ) is a real-valued function X = X( w) measurable �. Sections 5 through 9 dealt with random
A
variables of a special kind, namely simple random variables, those with finite range. All concepts and facts concerning real measurable functions carry over to random variables; any changes are matters of viewpoint, notation, and terminology only. The positive and negative parts x+ and x- of X are defined as in (15.4) and (15.5). Theorem 13.5 also applies: Define
(20.1 )
1/l n ( x ) =
( k - 1 )2 - n n
If X is nonnegative and
Xn
=
- 1 ) 2 - n < X < k2 - n , 1 < k < n2 n , if x > n .
if ( k
1/1 n ( X), then 0
<
Xn i X.
If
X is not neces259
260
RANDOM VARIAB LES AND EXPECTED VALUES
sarily nonnegative, define if if
( 20.2 )
X � 0, X s 0.
(This is the same as (13.6).) Then 0 < Xn ( w) i X( w) if X( w) > 0 and 0 � Xn (w) � X(w) if X(w) < 0; and IXn (w) l i I X(w) l for every w. The random variable Xn is in each case simple. A random vector is a mapping from 0 to R k measurable §'. Any mapping from 0 to R k must have the form w --+ X(w) = ( X1 (w), . . . , Xk ( w )), where each X;( w) is real; as shown in Section 13 (see (13.2)) , X is measurable if and only if each X; is. Thus a random vector is simply a k-tuple X = ( X1 , , Xk ) of random variables. •
•
•
Subfields
If f§ is a a-field for which f§ c §', a k-dimensional random vector X is of course measurable f§ if [ w: X( w) e H] e f§ for every H in gJ k _ The a-field a( X) generated by X is the smallest a-field with respect to which it is measurable. The a-field generated by a collection of random vectors is the smallest a-field with respect to which each one is measurable. As explained in Sections 4 and 5, a sub-a-field corresponds to partial information about w. The information contained in a( X) = a( X1 , , Xk ) consists of the k numbers X1 ( w ), . . . , Xk ( w ) t The following theorem is instructive in this connection, although only part (i) is used much. It is the analogue of Theorem 5.1, but there are technical complications in its proof. •
.
.
.
Theorem 20.1. Let X = (i) The a-field a( X)
( X1 , , Xk ) be a random vector. = a( X1 , , Xk ) consists exactly of the sets [Xe •
•
•
H] for H E fJII k. (ii) In order that a random variable Y be measurable a( X) a( X1 , , Xk ) it is necessary and sufficient that there exist a measurable map f: R k --+ R 1 such that Y( w) = /( X1 ( w ), . . . , Xk ( w )) for all w. PROOF. The class f§ of sets of the form [ X e H] for H e gJ k is a a-field. Since X is measurable a( X), f§ c a( X). Since X is measurable f§, a( X) c .
•
•
=
•
•
•
f§. Hence part (i). Measurability of f in part (ii) refers of course to measurability !1/ kj!l/ 1 • The sufficiency is easy: if such an f exists, Theorem 1 3. 1 (ii) implies that Y is measurable o ( X ). t The partition defined by (4.17) consists of the sets [ w: X( w )
=
x ] for x
e
Rk .
SECTION
20.
RANDOM VARI AB LES AND DISTRIBUTIONS
261
To prove necessity,t suppose at first that Y is a simple random variable , Ym be its different possible values. Since A; [ w : Y( w ) y;] and let y 1 , lies in o( X) , it must by part (i) have the form [ w: X( w) E H;] for some H; in £Il k . Put I = L; Y;ln. ; certainly I is measurable. Since the A ; are disjoint, no X( w) can lie in more than one H; (even though the latter need not be disjoint), and hence I( X( w )) = Y( w ). To treat the general case, consider simple random variables Y,, such that Yn ( w) --+1 Y( w) for each w. For each n , there is a measurable function In : R k --+ R such that Yn( w ) = ln( X( w )) for all w. Let M be the set of x in R k for which { ln ( x ) } converges; by Theorem 1 3.4(iii), M lies in !1/k. Let f( x ) = lim nln(x) for x in M, and let l( x ) = 0 for x in R k - M. Since f = lim n ln iM and ln iM is measurable, I is measurable by Theorem 1 3.4(ii). For each w , Y(w) = lim nln( X( w )); this implies in the first place that X( w ) lies in M and in the second place that Y( w ) = lim n In ( X( w )) = I( X( w )). • •
•
=
•
=
Distributions
The distribution or law of a random variable X was in Section 14 defined as the probability measure on the line given by p. = PX- 1 (see (1 3.7)), or
(20 .3)
= P[X E A),
p. ( A )
A
E
£1/ 1 .
The distribution function of X was defined by (20 .4)
F( X ) = p.( - 00 ' X] = p [ X < X]
for real x. The left-hand limit of (20.5)
F satisfies
{ F(F X -- =F ,., (X)
( - 00 ' X ) = p [ X < X]' ( X ) = p. { X } = P [ X = X ) ,
)
and F has at most countably many discontinuities. Further, F is nonde creasing and right-continuous, and lim 00 F(x) = 0, lim 00F( x ) = 1 . By Theorem 14.1, for each F with these properties there exists on some probability space a random variable having F as its distribution function. A support for p. is a Borel set S for which p. ( S ) = 1 . A random variable, its distribution, and its distribution function are discrete if p. has a count able support S = { x 1 , x 2 , } . In this case p. is completely determined by the values p. { x 1 } , p. { x 2 } , x
•
•
•
•
_.
_
•
•
•
tpor a general version of this argument,
see
Problem 13.6.
x
_.
262
RANDOM VARIAB LES AND EXPECTED VA LUES
A familiar discrete distribution is the
binomial:
( 20.6) P [ X = r ] = p { r } = ( � ) p'(l - p r ' ,
r = 0, 1 , . . . , n .
There are many random variables, on many spaces,with this distribution: If { Xk } is an independent sequence such that P [ Xk = 1] = p and P [ Xk = 0] = 1 - p (see Theorem 5.2), then X could be E? 1X; , or Ef +;x;, or the sum of any n of the X; . Or 0 could be {0, 1 , . . . , n } if !F consists of all subsets, P { r } = p. { r } , r = 0, 1, . . . , n, and X( r) = r. Or again the space and ran dom variable could be those given by the construction in either of the two proofs of Theorem 14.1. These examples show that, although the distribu tion of a random variable X contains all the information about the probabilistic behavior of X itself, it contains beyond this no further information about the underlying probability space (D, �, P ) or about the interaction of X with other random variables on the space. Another common discrete distribution is the Poisson distribution with parameter A > 0: Ar " P [ X = r ] = p { r } = e- r! ,
( 20 .7)
r = 0, 1 , . . . .
A constant c can be regarded as a discrete random variable with X( w) = c. In this case P [ X = c] = p. { c } = 1 . For an artificial discrete example, let { } be an enumeration of the rationals and put
x1, x2 ,
•
•
•
(20 .8) the point of the example is that the support need not be contained in a lattice. A random variable and its distribution have density I with respect to Lebesgue measure if I is a nonnegative function on R 1 and (20.9)
P[ X e A) = p(A) =
j f ( x ) dx , A
A
E
!1/ 1 .
In other words, the requirement is that p. have a density with respect to A in the sense of (16.11). The density is assumed to be with respect to A if no other measure is specified. Taking A = R 1 in (20.9) shows that I must integrate to 1 . Note that I is determined only to within a set of Lebesgue measure 0: if 1 = g except on a set of Lebesgue measure 0, then g can also serve as a density for X and p..
SECTION
20.
RANDOM VARIABLES AND DISTRIBUTIONS
263
It follows by Theorem 3.3 that (20.9) holds for every Borel set A if it holds for every interval- that is, if
F( b ) - F(a )
=
bf x dx j ( ) a
holds for every a and b. Note that F need not differentiate to I everywhere (see (20.13), for example); all that is required is that I integrate properly- that (20.9) hold. On the other hand, if F does differentiate to I and I is continuous, it follows by the fundamental theorem of calculus that f is indeed a density for F.t For the exponential distribution with parameter a > 0, the density is if X < 0, if X > 0.
(20.10) The corresponding distribution function
F( x )
(20.1 1 )
=
{�
_
e - ax
if X S 0, if X > 0
was studied in Section 14. For the normal distribution with parameters (20.12)
f( x )
=
1 {ii exp 2w o
[ - ( x - m2 ) 2 ] , 2o
m and o,
o > 0,
- oo < x < oo ;
a change of variable together with (1 8.10) shows that I does integrate to 1 . For the standard normal distribution, m = 0 and o = 1 . For the uniform distribution over an interval (a, b ], (20.13)
l(x )
=
1
b-a
0
if
a < x s b,
otherwise.
The distribution function F is useful if it has a simple expression, as in (20.1 1). It is ordinarily simpler to describe p. by means of the density l(x) or the discrete probabilities ,., { x r } . If F comes from a density, it is continuous. In the discrete case, F increases in jumps; the example (20.8), in which the points of discontinuity ' The general question of the relation between differentiation and integration is taken up in Section 31 .
264
RANDOM VARIABLES AND EXPECTED VALUES
are dense, shows that it may nonetheless be very irregular. There exist distributions that are not discrete but are not continuous either. An example is p. (A) = �p.1( A) �p. 2 (A) for p. 1 discrete and p. 2 coming from a density; such mixed cases arise, but they are few. Section 31 has examples of a more interesting kind, namely functions F that are continuous but do not come from any density. These are the functions singular in the sense of Lebesgue; the Q( x ) describing bold play in gambling (see (7 .33)) turns out to be one of them. If has distribution p. and is a real function of a real variable,
+
X
g
(20.14)
g( X) p. g-1
in the notation (1 3.7). is Thus the distribution of In the case where there is a density, f and F are related by F(x )
(20.1 5 )
=
{'
- oo
f( t ) dt.
Hence f at its continuity points must be the derivative of F. As noted above, if F has a continuous derivative, this derivative can serve as the density f. Suppose that f is continuous and is increasing, and let T= The distribution function of is
g-1•
If
g(X)
P[ g ( X ) < x ) P[ X < T(x )) =
g
=
F( T( x )) .
T is differentiable, this differentiates to
�d P (g ( X ) < x )
( 20 .16 )
g(X)
=
/ ( T( x )) T'( x ) ,
which is thus the density for (as follows also by (17.8)). has normal density (20.12) and > 0, (20.16) shows that b If b and has the normal density with parameters Calculating (20.16) is generally more complicated than calculating the distribution function of X) from first principles and then differentiating, which works even if is many-to-one:
X
g(
Example
20.1.
If
a am +
ao .
X has the standard normal distribution, then
aX + g
SECTION
for X > 0. Hence
20.
X2
RANDOM VARIABLES AND DISTRIBUTIONS
265
has density
/ (x ) =
0
1 .� x - 1/2e x/2 -
v � 'IT
if X < 0, •
if X > 0 .
Multidimensional Distributions
For a k-dimensional random vector X = ( X1 , , Xk ), the distribution p. (a probability measure on gJ k ) and the distribution function F (a real func tion on R k ) are defined by . • •
(20.17)
{
= P ( ( X1 , , Xk ) E A ] , A e fJit k , F( x 1 , , x k ) = P( X1 < x 1 , , Xk < x k ] = p.(Sx ) ,
P. ( A )
• • •
• • •
• • •
where Sx = [ y: Y; < x ; , i = 1, . . . , k] consists of the points "southwest" of x. Often p. and F are called the joint distribution and joint distribution function of xl, . . . ' xk . Now F is nondecreasing in each variable, and &AF > 0 for bounded rectangles A (see (12.12)) As h decreases to 0, the set
s h=
+ h' i = 1' . . . ' k] decreases to Sx , and therefore (Theorem 2.1(ii)J F is continuous from above in the sense that lim h l 0 F(x 1 + h, . . . , x k + h) = F(x 1 , , x k ). Further, F( x , x k ) --+ 0 if x; --+ - oo for some i (the other coordinates held fixed), and F( x 1 , , x k ) --+ 1 if X; --+ oo for each i. For any F with these properties there is by Theorem 12.5 a unique probability measure p. on gJ k such that p. ( A ) = &AF for bounded rectangles A, and p. ( Sx ) = F(x) for Xt
1,
[ y : Y; < X;
• • •
• • •
all X.
• • •
As h decreases to 0, Sx , - h increases to the interior i = 1, . . . , k ] of Sx , and so
sxo = [ y:
Y; < X ;,
(20.18) Since F is nondecreasing in each variable, it is continuous at x if and only if it is continuous from below there in the sense that this last limit coincides with F(x). Thus F is continuous at x if and only if F(x) = p.(Sx ) = p. ( Sx0 ), which holds if and only if the boundary asx = sx - sxo (the y-set where Y; s; X; for all i and Y; = X; for some i) satisfies p.( asx ) = 0. If k > 1 , F can have discontinuity points even if p. has no point masses: if p. corre-
RANDOM VARIABLES AND EXPECTED VALUES
sponds to a uniform distribution of mass over the segment B = [(x, 0): 0 < x < 1] in the plane (p.( B) = A[x : 0 < x < 1, (x, O) e B]) then F is discontinuous at each point of B. This also shows that F can be discontinu ous at uncountably many points. On the other hand, for fixed x the boundaries asx , h are disjoint for different values of h, and so (Theorem 1 0.2(iv)) only countably many of them can have positive p.-measure. Thus x is the limit of points (x 1 + h, . . . , x k + h) at which F is continuous: the continuity points of F are dense. There is always a random vector having a given distribution and distribu tion function : Take (O, S&", P) = ( R k , f1t k , p.) and X(w) = w. This is the obvious extension of the construction in the first proof of Theorem 14.1. The distribution may as for the line be discrete in the sense of having countable support. It may have density I with respect to k-dimensional Lebesgue measure: p.(A) = fA I(x) dx. As in the case k = 1, the distribu tion p. is more fundamental than the distribution function F, and usually p. is described not by F but by a density or by discrete probabilities. If X is a k-dimensional random vector and g: R k .-. R ,. is measurable, then g( X) is an i-dimensional random vector; if the distribution of X is p., the distribution of g( X) is p.g - 1 , just as in the case k = 1 see (20.14). If gj : R k .-. R 1 is defined1 by gj (x 1 , , x k ) = xj, then gj ( X) is Xj , and its distribution P.j = p.gj is given by p.j (A) = p.[(x 1 , , x k ): xj e A] = P[ � e A] for A e 91 1 • The P.j are the marginal distributions of p.. If p. has a density I in R k , then P.j has over the line the density ,
-
•
•
•
•
•
•
Jj(x ) =
(20 .1 9 )
since by Fubini ' s theorem the right side integrated over A comes to p.[(x 1 , , x k ): xj e A]. Now suppose that g: U --+ V is one-to-one, where U and V are open subsets of R k . Suppose that T = g - 1 is continuously differentiable, and let J(x) be its Jacobian. If X has a density that vanishes outside U, then P[g( X) E A] = P[ X E TA] = frA I( y ) dy, and so by (17.1 3), •
•
(20.20 ) for A c
•
P ( g ( X ) e A ) = j f( Tx )IJ( x )l dx A
V. Thus the density for g( X) is I(Tx)I J(x) l on V and 0 elsewhere. t
tAlthough (20.20) is indispensible in many investigations, it is used in this book only for linear T and in Example 20.2; and here different approaches are possible-see ( 1 7. 14) and Proble m 1 8.24.
SECTION 11 EX111ple
20. 2
20.
RANDOM VARIABLES AND DISTRIBUTIONS
267
Suppose that ( X1 , X2 ) has density
and let g be the transformation to polar coordinates. Then U = R 2 , and V and T are as in Example 17 .6. If R and 8 are the polar coordinates of 1 ( X1, X2 ), then ( R , 8 ) = g ( X1 , X2 ) has density (2w) - re - r 2 /2 in V. By (20 . 1 9), R has density re - r 2 12 on (0, oo ) and 8 is uniformly distributed • over (0 , 2 w ) .
Independence
Random variables xl , . . . ' xk are defined to be independent if the a-fields , o ( Xk ) t�ey generate are independent in the sense of Section 4. a( X1 ), This concept for simple random variables was studied extensively in Chapter 1; the general case was touched on in Section 14. Since o ( X;) consists of the sets [ X; E H ] for H E 91 1 , xl , . . . ' xk are independent if and only if • • •
for all linear Borel sets H1 , , Hk . The definition (4.11) of independence requires that (20.21) hold also if some of the events [ X; e H; ] are sup pressed on each side, but this only means taking H; = R 1 Suppose that • • •
•
for all real x 1 , , xk; it then also holds if some of th� events [ X; < X;] are suppressed on each side (let X; --+ oo ). Since the intervals ( oo , x] form a tr-system generating !1/ 1 , the sets [ X; < x] form a w-system generating a ( X;). Therefore, by Theorem 4.2, (20.22) implies that X1 , Xk are independent. If, for example, the X; are integer-valued, it is enough that P [ X1 = n 1 , , Xk = n k ] = P [ X1 = n 1 ] P[ Xk = n k ] for integral " 1 ' . . , n k (see (5.6)). Let ( X1 , , Xk ) have distribution p. and distribution function F, and let the X; have distributions P. ; and distribution functions F; (the marginals). By (20.21), X1 , , Xk are independent if and only if p. is product measure in the sense of Section 18: • • •
-
, • • •
•
•
• • •
•
.
• • •
• • •
(20. 23 )
268
RANDOM VARIABLES AND EXPECTED VALUES
By (20.22), X1,
• • •
, Xk are independent if and only if
(20 .24) Suppose that each P. ; has density /;; by Fubini' s theorem, f1 (y 1 ) fk ( Yk ) X ( - oo , x k ] is just F1 (x 1 ) Fk (x k ), so integrated over ( oo, x 1 ] X that p. has density • • •
-
· · ·
· · ·
(20 .25 ) in the case of independence. If f§1 , , f§k are independent a-fields and X; is measurable f§;, i = 1 , . . . ' k, then certainly x1 , . . . ' xk are independent. If X; is a d;-dimensional random vector, i = 1, . . . , k, then X1 , , Xk are by definition independent if the a-fields a( X1 ), , a( Xk ) are indepen dent. The theory is just as for random variables: x1 , . . . ' xk are indepen dent if and only if (20.21) holds for H1 e gJd•, , Hk e gJd". Now ( X1 , , Xk ) can be regarded as a random vector of dimension d = E�_ 1 d;; X Rdk and P. ; is the distribution of if p. is its distribution in Rd = Rd1 X X; in Rd•, then, just as before, X1 , , Xk are independent if and only if X p. k . In none of this need the d; components of a single X; p. = p. 1 X be themselves independent random variables. An infinite collection of random variables or random vectors is by definition independent if each finite subcollection is. The argument follow ing (5.7) extends from collections of simple random variables to collections of random vectors: • • •
• • •
• • •
• • •
• • •
· · ·
• • •
·
·
·
Theorem 20.2.
(20 .26)
Suppose that X11 , X1 2 , X21 , x22 , . . . · · ·
•
•
•
•
•
•
•
•
•
is an independent collection of random vectors. If .fF; is the a-field generated by the ith row, then �1 , �2 , . . . are independent. Let �; consist of the finite intersections of sets of the form [ X;1 e H] with H a Borel set in a space of the appropriate dimension, and apply Theorem 4.2. The a-fields .fF; = a( �;), i = 1, . . . , n, are independent • for each n, and the result follows. PROOF.
Each row of (20.26) may be finite or infinite, and there may be finitely or infinitely many rows. As a matter of fact, rows may be uncountable and there may be uncountably many of them.
SECTION
20.
RANDOM VARIABLES AND DISTR IBUTIONS
269
Suppose that X and Y are independent random vectors with distribu tions p. and Jl in Ri and R k . Then ( X, Y ) has distribution p. X Jl in R i X R k = R i +k . Let x range over Ri and over R k . By Fubini's theorem,
y
(20 .2 7) ( p. X ) ( B ) = J1
Replace B by reduces to
(20 .2 8 )
( ll
X
f. [ y : ( x, y ) Rl
J1
E
B ) p. ( dx ) ,
(A X R k ) n B, where A E gJi and B E gJ J +k . Then (20.27) v)(( A X R k ) n B ) = j v[ y : ( x , y ) E B ) fl (dx) , A
Bx = [ y: ( x, y) E B] is the x-section of B, so that Bx E gJ k (Theorem 1 8 .1), then P[(x, Y) E B] = P[w: (x, Y( w )) E B] = P[w: Y(w) E Bx ] = • · Expressing the formulas in terms of the random vectors themselves If
gives this result: Theorem 20.3.
p,
and
Jl
(20 . 29 )
If X and Y are independent random vectors with distributions
in Ri and R k, then P [ ( X, Y )
E B) =
f. .P [( x , Y ) E B ) p. ( dx ) , RJ
B E gJ J +k
and
(20.30)
P [ X E A, ( X, Y ) E B ) =
£P [( x , Y ) E B ) fl ( dx ) , A E gJi, B E gJ J +k .
Suppose that X and Y are independent exponentially distributed random variables. By (20.29), P [ YIX > z] = f0(X)P[Yjx > z)ae- ax dx = j0(X)e - axzae - ax dx = (1 + z) - 1 • Thus YjX has density (1 + z ) - 2 for z > 0. Since P[ X > z 1 , Y/X > z 2 ] = fz�P[ Yjx > z 2 ]a e - ax dx by • (20.30), the joint distribution of X and Y1 X can be calculated as well. Example 20.3.
Formulas (20.29) and (20.30) are constantly applied as in this example. There is no virtue in making an issue of each case, however, and the appeal to Theorem 20.3 is usually silent.
270
RANDOM VARIABLES AND EXPECTED VALUES
Here is a more complicated argument of the same sort. Let X1 , , Xn be independent random variables, each uniformly distributed over (0, I ] o Let Y" be the k th smallest among the X, , so that 0 < Y1 < · · · < Y,. < 1. The X, divide [0, 1 ] into n + 1 subintervals of lengths Y1 , Y2 - Y1 , , Yn - Yn - 1 , I - Y, ; let M be the maximum of these lengths. Define 1/1n ( I, a) = P [ M < a ]. The problem is to show that EXIIIle IIp 20.4. o
o
•
• • •
{ 20.31)
1/ln ( t, a )
( n 1 )( ) T:o ( - l) k z 1 - k �
n+l ...
"
+ ,
where x + ( x + l x l )/2 denotes positive part. Separate consideration of the possibilities 0 < a < 1/2, 1/2 < a < 1, and 1 < disposes of the case n 1. Suppose it is shown that the probability 1/Jn ( l, a) satisfies the recursion ==
a
==
( 20.32)
1/Jn ( l, a )
==
n a ( l - x ) - 1 dx nl l/Jn _ 1 ( 1 - x, a ) 1
1
0
°
Now (as follows by an integration together with Pascal' s identity for binomial coefficients) the right side of (20.31) satisfies this same recursion, and so it will follow by induction that (20.31) holds for all n. In intuitive form, the argument for (20.32) is this: If [ M < a] is to hold, the smallest of the X; must have some value x in [0, a ]. If X1 is the smallest of the X, , then x2 ' . . . ' xn must all lie in [x, 1] and divide it into subintervals of length at most a; the probability of nthis is (1 - xjl) n - 11/ln - 1 ( 1 - x, a ) because x2 , . . . ' xn have probability (1 - xjl) - l of all lying in [x, 1], and if they do, they are independent and uniformly distributed there. Now (20.32) results from integrating with respect to the density for X1 and multiplying by n to account for the fact that any of x1 ' . . . ' xn may be the smallest. To make this argument rigorous, apply (20.30) for j 1 and k n - 1. Let A be the interval [0, a], and let B consist of the points ( x 1 , . 0 0 , xn ) for which 0 < X; < I, x 1 is the minimum of x 1 , , xn , and x 2 , 0 0 . , xn divide [ x1 , 1 ] into subintervals of length at most a. Then P[ X1 min X;, M < a ] == P[ X1 e A , ( X1 , . . . , Xn ) e B). Take X1 for X and ( X2 , . . . , Xn ) for Y in (20.30)0 Since X1 has density 1/1, ==
• • •
(20 .33)
P ( X1
==
.
mm X; , M < a )
=
l0a
==
==
P ( ( x, X2 , . . . , Xn )
E
dx
B] I
o
n
If C is the event that x s X; < 1 for 2 < i < n , then P(C) (1 - xjl) - t o A simple calculation shows that P [ X; - X < s, , 2 < i s n i C] n;_ 2 (s;/(l - x )); in other WOrds, given C, X2 - X, . . Xn - X are conditionally independent and uni formly distributed over [0, I - X]. Now ( x2 ' 0 0 0 ' xn ) are random variables on some probability space ( Q, !F, P); replacing P by P( · I C) shows that the integrand in (20.33) is the same as that in (20032). The same argument holds with the index 1 o
,
=
==
20. RANDOM V ARIABLES AND DISTRIBUTIONS replaced by any k (1 < k < n ) which gives (20.32). (The events [ XI< y � a] are not disjoint, but any two intersect in a set of probability 0.) SECTION
,
271 = min
X, ,
•
Sequences of Random Variables Theorem 5.2 extends to general distributions P.n ·
{ If p. n } is a finite or infinite sequence of probability mea sures on !A 1 , there exists on some probability space ( 0, ofF, P) an independent sequence { xn } of random variables such that xn has distribution P.n ·
1beorem 20.4.
PROOF.
By Theorem 5.2 there exists on some probability space an indepen of random variables assuming the values 0 and 1 dent sequence Z 1 , Z2, with probabilities P[Zn = 0] = P[ Zn = 1] = ! . As a matter of fact, Theo rem 5.2 is not needed : take the space to be the unit interval and the Zn( w ) to be the digits of the dyadic expansion of w-the functions dn( w ) of Sections and 1 and 4. Relabel the countably many random variables Zn so that they form a double array, • • •
zl l zl 2 Z21 , Z22 ,
, · · ·
,
•
•
•
•
•
•
· · ·
•
•
•
the zn k are independent. Put un = Ef- I Zn k 2 - k . The series certainly converges, and Un is a random variable by Theorem 13.4. Further, U1 , U2 , is, by Theorem 20.2, an independent sequence. Now P[ Zn; = z; , 1 � i < k] = 2 - k for each sequence z 1 , , z k of O' s k and l's; hence the 2 k possible values j2 - k , 0 < j < 2 , of Sn k = E�_ 1Zn;2 - ; all have probability 2 - k . If 0 � x < 1, the number of the j2 - k that lie in [O,x ] is [2 k x] + 1, the brackets indicating integral part, and therefore P[S,.k � x ] = ([2 kx] + 1)/2 k . Since Sn k ( w ) i Un( w ) as k i oo , [ Sn k < x] � [Un k k S x ] as k t oo , and so P[Un � x ] = lim k P[Sn k < x ] = lim k ([2 x] + 1)/2 = X for 0 � X < 1. Thus Un is uniformly distributed over the unit interval. The construction thus far establishes the existence of an independent sequence of random variables Un each uniformly distributed over [0, 1]. Let F,. be the distribution function corresponding to P.n, and put cpn( u ) = inf[ x : u S Fn( x )] for 0 < u < 1. This is the inverse used in Section 14-see (14.5). Set q>,. (u) = 0, say, for u outside (0, 1), and put Xn( w ) = cpn(Un(w)). Since cp,.( u ) � x if and only if u < Fn( x )-see the argument following (14.5 ) -P[ Xn < x ] = P[Un < Fn(x)] = Fn(x). Thus Xn has distribution function Fn. And by Theorem 20.2, XI, x2, . . . are independent. • All
• • •
• • •
272
RANDOM VARIABLES AND EXPECTED VALUES
This theorem of course includes Theorem 5.2 as a special case, and its proof does not depend on the earlier result. Theorem 20.4 is a special case of Kolmogorov' s existence theorem in Section 36. Convolution
Let X and Y be independent random variables with distributions p. and " · Apply (20.27) and (20.29) to the planar set B = [(x, y): x + y e H] with
H E !1/ 1 :
j-oo P( H - x )p( dx ) oo = f oo P [ Y e H - x) p( dx ) . - oo
P [ X + Y e H) =
( 20.34 )
The
convolution of p. and " is the measure p. • " defined by
( 20.35 )
( p • ., )( H )
=
j-oo P( H - x )p ( dx ) , oo
H E !1/ 1 •
If X and Y are independent and have distributions p. and ,, (20.34) shows that X + Y has distribution p. • " · Since addition of random variables is commutative and associative, the same is true of convolution: p. • " = " • p. and p. • (" • '11 ) = ( p. • ") • '11 · If F and G are the distribution functions corresponding to p. and ,, the distribution function corresponding to p. • " is denoted F • G. Taking H ( - oo , y] in (20.35) shows that =
( F • G )( y) =
(20.36)
(See (17.16) for the notation
j-oo G(y - x ) dF( x). oo
dF(x).) If G has density g, then G(y - x) = JY--,;g(s) ds g(t x) dt, and so the right side of (20.36) is JY 00 fY 00 [ / 00 00 8( 1 - x) dF(x)] dt by Fubini's theorem. Thus F • G has density F • g, where =
(20.37) this holds if
G
has density
g.
If, in addition,
F
has density f,
(20.37) is
SECTION
20.
RANDOM VARIABLES AND DISTRIBUTIONS
273
denoted I • g and reduces by (16.12) to
oo ( / • g )(y) = j- g (y - x ) f(x ) dx . oo
(20. 38)
This defines convolution for densities, and p. • " has density I • g if p. and " have densities I and g. The formula (20.38) can be used for many explicit calculations. Let X1 , , Xk be independent random variables, each with the exponential density (20.10). Define gk by
EXIIIple II
• • •
20.5.
k- 1 ax ( ) - ax (20 .3 9) gk ( x ) = a ( k - l ) ! e ' put gk (x) = 0 for x < 0. Now
k=
X > 0,
1 , 2, . . . ;
1
Y = )( y) ( Y - x ) gl (x) dx, t k ( 8k - t * 8 t 8 0
which reduces to g,.(y). Thus gk = Bk - l • g 1 , and since g 1 coincides with (20. 1 0), it follows by induction that the sum XI + . . . + xk has density Bk · The corre·iponding distribution function is
X > 0, as
•
follows by differentiation.
E
ple
Suppose that X has the normal density (20.12) with m == 0 and that Y has the same density with T in place of o . If X and Y are independent, then X + Y has density mm
20.6.
1
2 '1TO A
change of variable
[ /oo T - oo
exp -
u
]
( y - x ) 2 - x 2 dx . 2 2 2o
2T
= x(o 2 + T 2 ) 1 12joT reduces this to
274
RANDOM VARIABLES AND EXPECTED VALUES
m = 0 and with o 2 + T 2 in place
Thus X + Y has the normal density with of o 2 •
•
I f p. and " are arbitrary finite measures on the line, their convolution is
defined by
(20.35) even if
they are not probability measures.
Convergence in ProbabUity Random variables Xn converge in probability to X, written Xn lim P [I Xn - X I � £ ]
(20.41 )
n
--+ p X,
if
=0
1,
for each positive £. t If xn --+ X with probability then lim supn [ I xn - XI > £ ] has probability 0, and it follows by Theorem 4.1 that Xn --+ p X. But the
converse does not hold : There exist sets A n such that P( A n )
1.
--+ 0
and
For example in the unit interval, let A 1 , A be the 2 dyadic intervals of order one, A 3 , • • • , A 6, those of order two, and so on (see P(lim supn A n ) =
(4.30) or (1.31)). If = 0.
Xn =
lA
II
and X =
0,
then Xn
--+ p X
but P[lim n Xn =
X]
If Xn --+ X with probability I, then Xn --+ p X. (ii) A necessary and sufficient condition for Xn --+ p X is that each subse quence { Xnk } contain a further subsequence { Xnk
Theorem 20.5
PROOF.
The argument for (i) has been given.
{ nk }
{n
choose a subsequence k < ;> } so that k � k ( i ) implies that P [ I Xnk - XI � ; - 1 ] < By the first Borel-Cantelli lemma I f Xn
-+ p X,
given
there is probability
1
that I Xn k
2 - ;.
- XI <
) (l babill. ty . h pro Th ere fore, lim ; Xnk . = X Wit
;- I
1.
for
all
but finitely many
i.
If Xn does not �nverge to X in probability, there is some positive £ fo r
{ nk }·
which P [ I Xn - XI � £] > £ holds along some sequence No subse " quence of Xnk } can converge in probability to X, and hence none can
{
converge to X with probability It follows from
probability
1.
(ii)
1.
that if Xn
•
--+ p X
and Xn
--+ pY,
then X = Y with
It follows further that if I is continuous and xn
/( Xn ) -+ pf( X ).
--+ p X,
In nonprobabilistic contexts, convergence in probability becomes
gence in measure:
If In and
then
conver
f are real measurable functions on a measure
t This is often expressed p lim, X, - X.
SECTION
20.
275
RANDOM VARIABLES AND DISTRIBUTIONS
space ( 0, �' p.), and if p.[w: /, converges in measure to f.
lf(w) - fn (w) l
�
£] -+
0 for each £ > 0,
then
1be Glivenko-CanteUi Theorem*
The empirical distribution function for random variables XI , . . . ' distribution function Fn ( x, w) with a jump of n - I at each Xk ( w ):
xn
is the
(20.42 ) If the Xk have a common unknown distribution function F( x ), Fn ( x, w) is its natural estimate. The estimate has the right limiting behavior, according to the Glivenko-Cantelli theorem:
Suppose that X1 , X2 , are independent and have a com mon distribution function F; put Dn (w) = supx i Fn (x, w) - F(x) l . Then Dn � 0 with probability I. Theorem 20.6.
•
•
•
For each x, Fn ( x, w) as a function of w is a random variable. By right continuity, the supremum above is unchanged if x is restricted to the rationals, and therefore Dn is a random variable. The summands in (20.42) are independent, identically distributed simple random variables, and so by the strong law of large numbers (Theorem 6.1), for each x there is a set A x of probability 0 such that
(20 .43)
Fn( x, w) = F( x ) lim n
for "' � A x · But Theorem 20.6 says more, namely that (20.43) holds for w outside some set A of probability 0, where A does not depend on x-as there are uncountably many of the sets A x ' it is conceivable a priori that their union might necessarily have positive measure. Further, the conver gence in (20.43) is uniform in x. Of course, the theorem implies that with probability 1 there is weak convergence Fn ( x, w) � F( x ) in the sense of Section 14. As already observed, the set A x where (20.43) fails has probability 0. Another application of the strong law of large �umbers,. with /( - x ) in place of /( - x ] in (20.42), shows that (see (20.5)) lim n Fn ( x - , w) = F( x - ) except on a set Bx of probability 0. Let cp( u) =
PROOF OF THE THEOREM. 00 ,
*This topic may be omitted.
00 ,
276
RANDOM VARIABLES AND EXPECTED VALUES
inf[x: u � F(x)] for 0 < u < 1 (see (14.5)), and put x m, k = cp(k/m), m � 1, 1 � k � m. It is not hard to see that F(cp(u)- ) � u � F(cp(u)) ; hence F( x m, k - ) - F(x m, k - 1 ) < m - I , F(x m,1 - ) < m - I , and F(x m, m ) > 1 - m - 1 . Let Dm, n (w) be the maximum of the quantities I Fn (x m, k ' w) F(x m, k ) l and IFn (x m , k - , w) - F(x m, k - ) I for k = 1, . . . , m. If x m, k - 1 < X < x m, k ' then Fn (x, w) < Fn (x m, k - , w) � F(x m, k - ) + Dm , n (w) � F(x) + m - 1 + Dm, n (w)1 and Fn (x, w) > Fn (x m, k - 1 , w) > F(x m, k - 1 ) - Dm, n (w) > F(x) - m - - Dm, n(w). Together with similar arguments for the cases X < X m,1 and X > X m , m ' this ShOWS that (20.44) If w lies outside the union A of all the A xm k and Bxm k , then limn Dm ' n (w) = 0 and hence lim n Dn ( w) = 0 by (20.44). But A has probability 0. • PROBLEMS 20.1.
20.2. 20.3. 20.4.
20.5.
20.6. 20. 7. 20.8.
2. 10 f A necessary and sufficient condition for a a-field t§ to be countably generated is that t§ == a( X) for some random variable X. Hint: If t§ == k a(A 1 , A 2 , • • • ), consider X == Ej/_ 1 f( IA �< )/10 , where /(x) is 4 for x == 0 and 5 for x =F 0. If X is a positive random variable with density /, x- 1 has density /(1/x)jx 2 • Prove this by (20.16) and by a direct argument. Suppose that a two-dimensional distribution function F has a continuous density f. Show that /(x, y) == a 2 F(x, y)jax ay. The construction in Theorem 20.4 requires only Lebesgue measure on the Use the theorem to prove the existence of Lebesgue measure unit interval. k on R . First construct A k restricted to ( - n , n ] X · · · X ( - n , n ], and then pass to the limit ( n --+ oo ) . The idea is to argue from first principles, and not to use previous constructions, such as those in Theorems 12.5 and 18.2. and latitude of a random point on 19.9 f Let 8 and _, be the longitude 3 the surface of the unit sphere in R ; probability is proportional to surface area. Show that 8 and _, are independent, 8 is uniformly distributed over [0, 2 w ), and _, is distributed over [ - w/2 , + w/2] with density ! cos 4-. Suppose that A , B, and C are positive, independent random variables with distribution function F. Show that the quadratic Az 2 + Bz + C has real zeros with probability /0/0F(x 2 /4y) dF(x) dF(y). Show that X1 , X2 , are independent if a( X1 , , Xn _ 1 ) and a ( Xn ) are independent for each n. Let X0 , X1 , be a persistent, irreducible Markov chain and for a fixed state j let T1 , 72 , . . . be the times of the successive passages through j. Let Z1 == T1 and Zn == T, - T, _ 1 ,k n > 2. Show that Z1 , Z2 , . . . are indepen dent and that P [ Zn == k ] == Jj� > for n > 2. •
•
•
•
•
•
•
•
•
277 20. RANDOM VARIABLES AND DISTRIBUTIONS are non are distribution functions and p 1 , p2 20.9. Suppose that F1 , F2 , negative and add to 1. Show that F( x) = E�= 1 p,, F,, ( x ) is a distribution function. Show that, if F,(x) has density /,, ( x ) for each n, then F( x ) has density E�_ 1 p,, In ( x ). 20. 10. Let XI ' . . . ' xn be independent, identically distributed random variables and let , be a permutation of 1, 2, . . . , n. Show that P(( X1 , , X,, ) e H) == p [( xw l ' . . . ' x'IT, ) H) for H qJ". 20.11. i Ranks and records. Let X1 , X2 be independent random variables with a common continuous distribution function. Let B be the w-set where xm ( w ) == xn ( w) for some pair m ' n of distinct integers, and show that P( B) == 0. Remove B from the space 0 on which the Xn are defined. This leaves the joint distributions of the xn unchanged and makes ties impossi ble. n n Let r < > ( w ) == ( T1< n > ( w ), . . . , r,< > ( w )) be that permutation ( t 1 , , tn ) of (1, . . . , n) for which X, ( w ) < X, ( w ) < · · · < X, ( w ). Let Y, be the rank of xn among x. ' . . . ' xn : Y, == / if and only if i; < xn for exactly r - 1 values of i preceding n. n (a) Show that r< > is uniformly distributed over the n ! permutations. (b) Show that P[ Y, == r] == 1 /n , 1 < nr < n. (c) Show that Yk is measurable a( T< > ) for k < n. (d) Show that Y1 , Yi , . . . are independent. 20.12. f Record values. Let An be the event that a record occurs at time n: max k < n xk < xn . are independent and P(An) == 1/n. (a) Show that A 1 , A 2 , (b) Show that no record stands forever. (c) Let Nn be the time of the first record after time n. Show that P [ Nn == n + k] == n ( n + k - 1) - 1 ( n + k) - 1 • SECTION
• •
• • •
E
E
•
, •
•
.
•
•
•
•
•
•
•
• • •
20.13. Use Fubini's theorem to prove that convolution of finite measures is 20.14. 20.15.
20.16.
20.17. 20.18. 20.19.
commutative and associative. In Example 20.6, remove the assumption that m is 0 for each of the normal densities. Suppose that X and Y are independent and have densities. Use (20.20) to find the joint density for ( X + Y, X) and then use (20.19) to find the density for X + Y. Check with (20.38). If F( x - £) < F( x + £) for all positive £, x is a point of increase of F (see Problem 12. 9). If F( x - ) < F( x ), then x" is an atom of F. (a) Show that, if x and y are points of increase of F and G, then x + y is a point of increase of F • G. (b) Show that, if x and y are atoms of F and G, then x + y is an atom of F • G. Suppose that I' and " consist of masses an and fln at n , n == 0, 1, 2, . . . . Show that I' • " consists of a mass of E'k - oa k /Jn - k at n, n == 0, 1, 2, . . . i Show that two Poisson distributions (the parameters may differ) con volve to a Poisson distribution. Show that, if X1 , , Xn are independent and have the normal distribution (20.12) with m == 0, then the same is true of ( X1 + · · · + Xn )/ {n . • • •
278
RANDOM VARIABLES AND EXPECTED VALUES
20.20. The Cauchy distribution has density
( 20.45 )
Cu ( X)
== -1 2 u 2 ' 'IT U + X
-
oo
<x<
oo ,
for u > 0. (By (17.10), the density integrates to 1.) (a) Show that cu cl, == cu . Hint: Exp8.l&d the convolution integrand in partial fractions. (b) Show that, if x1 ' . . . ' xn are independent and have density Cu , then ( X1 + · · · + Xn )/n has density cu as well. Compare with the convolution law in Problem 20.19. 20.21. f (a) Show that, if X and Y are independent and have the standard normal density, then X/Y has the Cauchy density with u == 1. (b) Show that, if X has the uniform distribution over ( - ,/2, ,/2), then tan X has the Cauchy distribution with u == 1. 20.22. 18.22 f Let XI ' . . . ' xn be independent, each having the standard normal distribution. Show that •
+ 1,
X n2 == x12 + . . . + xn2 has density
( 20.46)
1
2n 12 r ( n/2)
x(
n /2) - 1e - x/2
over (0, oo ). This is called the chi-squared distribution with n degrees of
freedom. 20.13.
f
The gamma distribution has density
( 20.47) over (0, oo) for positive parameters a and u. Check that (20.47) integrates to 1. Show that
( 20.48)
f( · ; a, u ) • f( · ; a , v ) == f( · ; a , u + v ) .
Note that (20.46) is f(x; t , n/2), and from (20.48) deduce again that (20.46) is the density of x; . Note that the exponential density (20.10) is f(x; a , 1), and from (20.48) deduce (20.39) once again. 20.24. f The beta function is defined for positive u and v by
20. RANDOM VARIABLES AND DISTRIBUTIONS From (20.48) deduce that SECTION
(
B u, v)
=
f( u ) f ( v ) r( u + v )
279
·
Show that
for integral u and v . be independent, where P [ N = n ] = q n - 1p , n 20.25. 20.23 f Let N, X1 , X2 , > 1 , and each Xk has the exponential density f( x; a , 1). Show that X1 + · · · + XN has density f( x; ap , 1). 20.26. Let An m (£) == [ I Zk - Z l < £, n < k < m ] Show that Zn -+ Z with prob ability 1 if and only if Iimnlim m P(An m < £)) = 1 for all positive £, whereas Zn -+ p Z if and only if limn P(Ann < £)) == 1 for all positive £. 20.27. (a) Suppose that f: R 2 -+ R1 is continuous. Show that Xn -+ pX and Y, -+pY imply /( Xn , Y, ) -+pf( X, Y). (b) Show that addition and multiplication preserve convergence in prob ability. 20.28. Suppose that the sequence { Xn } is fundamental in probability in the sense that for £ positive there exists an � such that P[ I Xm - Xn l > £] < £ for •
•
•
.
m, n >
�-
(a) Prove there is a subsequence { Xn k } and a random variable X such that
Iim k Xn k = X with r,robability 1. Hint: Choose increasing n k such that k P[ kI Xm - Xn l > 2- ] < 2 - for m , n > n k . Analyze P ( I Xn k 1 - xn k I > 2- ]. (b) Show that Xn -+ P X. (a) Suppose that X1 < X2 s · · · and that Xn -+ pX. Show that Xn -+ X -
10.29.
10.30.
10.31.
with probability 1. (b) Show by example that in an infinite measure space functions can converge almost everywhere without converging in measure. If Xn -+ 0 with probability 1, then n - 1 E i: 1 Xk -+ 0 with probability 1 by the standard theorem on Cesaro means [A30]. Show by example that this is not so if convergence with probability 1 is replaced by convergence in probability. 2.1 7 f (a) Show that in a discrete probability space convergence in probability is equivalent to convergence with probability 1. (b) Show that discrete spaces are essentially the only ones where this equivalence holds: Suppose that P has a nonatomic part in the sense that there is a set A such that P( A) > 0 and P( · l A ) is nonatomic. Construct random variables Xn such that Xn -+ pO but Xn does not converge to 0 with probability 1. _
280 20.32.
RANDOM VARIABLES AND EXPECTED VALUES
20.28 20.31 f Let d( X, Y) be the infimum of those positive £ for which P[ I X - Yl > £] < £. (a) Show that d( X, Y) == 0 if and only if X == Y with probability 1. Identify random variables that are equal with probability 1 and show that d is a metric on the resulting space. (b) Show that Xn --+ pX if and only if d( Xn , X) --+ 0. (c) Show that the space is complete. (d) Show that in general there is no metric d0 on this space such that Xn --+ X with probability 1 if and only if d0( Xn , X) -+ 0. SECI'ION 21. EXPECI'ED VALVES
Expected Value as Integral
The expected value of a random variable X with respect to the measure P:
X on ( 0, �, P) is the integral of
E[X] = j xdP = fox(w)P(dw). All the definitions, conventions, and theorems of Chapter 3 apply. For nonnegative X, E [X } is always defined (it may be infinite); for the general X, E[X] is defined, or X has an expected value, if at least one of E[X+] and E[X-] is finite, in which case E[X] = E[X+] - E[X-]; and X is integrable if and only if E[IXI] < oo . The integral fA X dP over a set A is defined, as before, as E[/A X] . In the case of simple random variables, the definition reduces to that used in Sections 5 through 9. Expected Valoes and Distributions
Suppose that X has distribution p,. If g is a real function of a real variable, then by the change-of-variable formula (16.17), (21 . 1 )
E (g( X)) = f oo g ( x )�& ( dx ) .
- oo
(In applying (16.17), replace T: 0 -+ D' by X: 0 -+ R 1 , p. by P, p.T - 1 by p, , and f by g.) This formula holds in the sense explained in Theorem 16.12: I t holds in the nonnegative case, so that (21 .2)
E [l g( X)I ] = j oo l g ( x )l�&(dx) ;
- oo
SECTION
21.
281
EXPECTED VALUES
if one side is infinite, then so is the other. And if the two sides of (21 .2) are finite, then (2 1 .1) holds. If p, is discrete and p. { x 1 , x 2 , . . . } = 1, then (21 .1) becomes (use Theorem 1 6 . 9)
(21 . 3 )
r
If X has density f, then (21.1) becomes (use Theorem 16.10)
E [ g( X)] =
(21 .4 )
F
100 g ( x ) f( x ) dx . - oo
is the distribution function of X and p., (21 .1) can be written E[ g( X)] = 00 g(x) dF(x) in the notation (17.16). If
/ 00
Moments By
p,
(2 1 .2),
(21 . 5 )
and
F determine all the absolute moments of X:
E (I X I k ) =
j
oo
- oo
l x l k�&(dx) =
j
oo
- oo
l x l k dF( x ) ,
k
= 1 , 2, . . .
.
Since j � k implies that l x lj s 1 + l xl k , if x ·has a finite absolute moment of order k, then it has finite absolute moments of orders 1, 2, . . . , k - 1 as well. For each k for which (21.5) is finite, X has k th moment
E [ Xk ) =
(21 .6)
100 x k�&(dx) = 100 x k dF(x ) . - oo
- oo
These quantities are also referred to as the moments of p. and of F. They can be computed via (21. 3) and (21.4) in the appropriate circumstances. EX111le 11p 21.1. Consider the normal density (20.12) with m = 0 and a == 1 . For each k, x ke - x2 12 goes to 0 exponentially as x -+ ± oo , and so finite moments of all orders exist. Integration by parts shows that 1 _ r,;-v 2w
J
oo
- oo
x ke - x2 /2 dx =
k-1 ,J2w
f
oo
- oo
x k - 2e - x2 /2 dx ,
k
= 2, 3 , . . .
.
(Apply (18. 16) to g(x) = x k - 2 and f(x) = xe - x2 12 , and let a -+ - oo , 0 b -.. oo.) Of course, E[ X] = 0 by symmetry and E[ X ] = 1 . It follows by
282
RANDOM VARIABLES AND EXPECTED VALUES
induction that
E ( X2 k ) 1 · 3 · 5 · · · ( 2k - 1 ) ,
( 21 .7 )
=
k 1 ' 2, . . . ' =
•
and that the odd moments all vanish. If the first two moments of Section 5, the variance is
( 21 .8 )
Var [ X ]
=
X are finite and E[X]
E [ (x - m ) 2]
=
=
m, then just as
in
j-oo ( x - m )2�& ( dx ) oo
From Example 21.1 and a change of variable, it follows that a random variable with the normal density (20.12) has mean m and variance o 2• Consider for nonnegative X the relation
(21 .9 ) Since P[X = t] can be positive for at most countably many values of t, the two integrands differ only on a set of Lebesgue measure 0 and hence the integrals are the same. For X simple and nonnegative, (21.9) was proved in Section 5 ; see ( 5 .2 5 ). For the general nonnegative X, let Xn be simple random variables for which 0 s Xn i X (see (20.1)). By the monotone convergence theorem E[Xn ] i E[X]; moreover, P[Xn > t] i P[X > t], and so f000P[Xn > t] dt j J:P[X > t] dt, again by the monotone convergence theorem. Since (21.9) holds for each Xn , a passage to the limit establishes (21.9) for X itself. Note that both sides of (21.9) may be infinite. If the integral on the right is finite, then X is integrable. Replacing X by X/[ X > aJ leads from (21.9) to
( 21 .10 )
1
XdP aP( X > a) + f 00P( X > t) dt,
[ X> a ]
As long as
=
a � 0.
a
a � 0, this holds even if X is not nonnegative.
Inequalities
Since the final term in (21.10) is nonnegative, E[X]. Thus
( 21 .11 )
P ( X � a] s -a1
1
aP[X � a] s f[x� X dP
1 ( X], XdP s -E a
[ X� a]
a.]
a>O
s
SECTION
21.
for nonnegative X. As in Section
EXPECTED VALUES
5, there follows the inequality
(21 .12 ) It is the inequality between the two extreme terms here that usually goes under the name of Markov; but the left-hand inequality is often useful, too. As a special case there is Chebyshev's inequality,
(21 .13 )
1
P ( I X - m l � a] s 2 Var( X ] a
(m = E[ X]). Jensen's inequality
(21 .14 )
cp ( E ( X ]) s E ( cp ( X ) ]
holds if cp is convex on an interval containing the range of X and if X and cp( X) both have expected values. To prove it, let /(x) = ax + b be a supporting line through ( E[ X], cp( E [ X]))-a line lying entirely under the graph of cp [A33 ] . Then aX( w) + b s cp( X( w )), so that aE[ X] + b < E [ cp( X)]. But the left side of this inequality is cp( E[ X]). Holder's inequality is
1 p
- +
1 q
-
=
1.
For discrete random variables, this was proved in Section 5; see (5.31). For the general case, choose simple random variables Xn and Yn satisfying 0 S I Xn l i l XI and 0 S I Yn l i I Yl ; see (20.2). Then (5.31) and the monotone convergence theorem give (21.15). Notice that (21.15) implies that if I XIP and I Yl q are integrable, then so is XY. Schwarz's inequality is the case p == q = 2:
(21 .16 ) If X and Y have second moments, then XY must have a first moment. The same reasoning shows that Lyapounov' s inequality (5.33) carries over from the simple to the general case.
284
RANDOM VARIABLES AND EXPECTED VALUES
Joint Integrals
The relation (21 .1) extends to random vectors. Suppose that ( X1 , has distribution p. in k-space and g: R k -+ R 1 • By Theorem 16. 1 2,
•
•
•
, Xk )
(21 .17) with the usual provisos about infinite values. For example, E [ X;Xi ] = fR "x ;xip.(dx). If E[ X;] = m ; , the covariance of X; and Xi is
Independence and Expected Value
Suppose that X and Y are independent. If they are also simple, then E [ XY] = E[ X]E[ Y], as proved in Section 5 -see (5.21). Define Xn by (20.2) and similarly define Yn = 1/ln( Y+ ) - 1/l n( Y-). Then Xn and Yn are independent and simple, so that E[ I XnYn l 1 = E[ 1 Xn 1 1 E [ I Yn l ], and 0 < 1 Xn 1 i l XI , 0 � I Yn l i I Yl . If X and Y are integrable, then E[ I XnYn l 1 = E [ 1 Xn 1 1 E [ I Yn 1 1 i E[ I XI ] · E[ l Yl ], and it follows by the monotone conver gence theorem that E[ I XYI ] < oo ; since XnYn -+ XY and I XnYnl < I XYI , it follows further by the dominated convergence theorem that E [ XY] = lim n E [ XnYn] = limnE[Xn]E[ Yn] = E[ X]E [ Y ]. Therefore, XY is integrable if X and Y are (which is by no means true for dependent random variables) and E[ XY] = E[ X]E[Y]. This argument obviously extends inductively: If X1 , , Xk are indepen dent and integrable, then the product xl . . . xk is also integrable and •
•
•
(21 .18) Suppose that fl1 and fl2 are independent a-fields, A lies in fl1 , X1 is measurable G1, and X2 is measurable fl2• Then /A X1 and X2 are indepen dent, so that (21 .18) gives (21 .19)
£x. x2 dP = £ X1 dP . E [ X2 ]
if the random variables are integrable. In particular, (21 .20)
SECTION
21 .
EXPECTED VALUES
285
From (21 .1 8) it follows just as for simple random variables (see (5.24)) that variances add for sums of independent random variables. It is even enough that the random variables be independent in pairs. Moment Generating Functions
The
moment generating function is defined as
(21 .21 )
M ( s ) = E [ e•X] =
J oooo e•xfl ( dx ) = J oooo e •x dF( x ) -
-
for all s for which this is finite (note that the integrand is nonnegative). Section 9 shows in the case of simple random variables the power of moment generating function methods. This function is also called the Laplace transform of p., especially in nonprobabilistic contexts. Now f0e sxp.(dx) is finite for s < 0, and if it is finite for a positive s, then it is finite for all smaller s. Together with the corresponding result for the left half-line, this shows that M(s) is defined on some interval containing 0. If X is nonnegative, this interval contains ( - oo, 0] and perhaps part of (0, oo ); if X is nonpositive, it contains [0, oo) and perhaps part of ( - oo , 0). It is possible that the interval consists of 0 alone; this happens for example, if p. is concentrated on the integers and p. { n } = p. { - n } = Cjn 2 for n = 1, 2, . . . . Suppose that M(s) is defined throughout an interval ( -s0, s0), where So > 0. Since elsxl < esx + e-sx and the latter function is integrable ,., for l s i < s0, so is Er- o l sx l kjk! = e lsxl . By the corollary to Theorem 16.7, p. has finite moments of all orders and
Thus M(s ) has a Taylor expansion about 0 with positive radius of convergence if it is defined in some ( - s0 , s0), s0 > 0. If M(s) can somehow be calculated and expanded in a series E k a k s k , and if the coefficients ak can be identified, then, since ak must coincide with E[ X k ]jk!, the moments of X can be computed: E[X k ] = a k k ! It also follows from the theory of Taylor expansions [A29] that a k k ! is the k th derivative M < k > ( s ) evaluated at s = 0:
(21 .23) This holds if
M ( k l (O ) = E [ X k ] =
/-00
oo
X kfl ( dx ) .
M(s) exists in some neighborhood of 0.
286
RANDOM VARIABLES AND EXPECTED VALUES
Suppose now that M is defined in some neighborhood of s. If " has density e s xjM(s ) with respect to p. (see (16.11)), then " has moment generating function N( u ) = M(s + u)/M(s ) for u in some neighborhood of 0. But then by (21 . 23 ) N < k >(O) = j 00 00 X k v( dx ) = f 00 00 X ke sxp.( dx )/M(s ), and since N < k >(O) = M< k >(s )/M(s), ,
M < k l (s ) =
( 21 . 24 )
oo
j- oox ke sxp. ( dx ) .
This holds as long as the moment generating function exists in some neighborhood of s. If s = 0, this gives ( 21.23 ) again. Taking k = 2 shows that M( s) is convex in its interval of definition.
Example
For the standard normal density,
21. 2.
M( s ) =
oo
j J2w - oo 1
es:xe - :x z /2 dx =
and a change of variable gives
h e •z /2J
oo
- oo
e - < x - s )z /2 dx ,
( 21 .25 )
The moment generating function is in this case defined for all s. Since es2 1 2 has the expansion e • z/2 =
f _!_ ( � ) k = f
k -0
k!
2
1 · 3 · · · (2k (2k ) ! k -0
-
1 ) s 2k
'
the moments can be read off from (21.22), which proves ( 21 . 7 ) once more. •
Exllltlple
21.3.
function
In the exponential case
(20.10)
the moment generating
(21 .26) is defined for s < a. By (21.22) the k th moment is k !a - k . The mean and • variance are thus a - 1 and a - 2.
Example ( 21 .27 )
21.4.
For the Poisson distribution (20.7),
SECTION
s M(s) and M'(s ) = Ae ce in S moments are M '(O) = A and both A.
21 .
287
EXPECTED VALUES
M"(s) = (A2e 2 s + Ae s ) M(s), the first two M"(O) = A2 + A; the mean and variance are •
Let X1 , . . . , Xk be independent random variables, and suppose that each X; has a moment generating function M; (s ) = E[e s X;] in ( - s0, s0). For ls i < s0, each exp(sX;) is integrable and, since they are independent, their product exp(sE �_ 1 X; ) is also integrable (see (21.18)). The moment gener ating function of xl + . . . + xk is therefore (21 .28)
( - s0, s0 ). This relation for simple random variables was essential to the arguments in Section 9. For simple rand_om variables it was shown in Section 9 that the moment generating function determines the distribution. This will later be proved for general random variables; see Theorem 22.2 for the nonnegative case and Section 30 for the general case.
in
PROBLEMS 21.1.
Prove
1 J2w
11.2. 11.3. 21.4. 21.5. 11.6.
21.7.
1 00 e- x 12 dx -
,
oo
2
== 1 -
1 12 ,
differentiate k times with respect to t inside the integral (justify), and derive (21. 7) again. 2 n + 1 ] == X Show that, if X has the standard normal distribution, then E[ 1 1 n
2 n h/2/w . 16.23 f Suppose that X1 , X2 , . are identically distributed (not neces sarily independent) and E( I X1 1 ] < oo . Show that E[max k n i Xk 1 1 == o( n ). 20.12 f Records. Consider the sequence of records in the sense of Problem 20.12. Show that the expected waiting time to the next record is infinite. 20.20 f Show that the Cauchy distribution has no mean. Prove the first Borel-Cantelli lemma by applying Theorem 16.6 to indicator random variables. Why is Theorem 16.6 not enough for the second • •
s
Borel-Cantelli lemma? Derive (21.11) by imitating the proof of (5.26). Derive Jensen's inequality (21.14) for bounded X and q> by passing to the limit from the case of simple random variables (see (5.29)).
288
21.8. 21.9.
RANDOM VARIABLES AND EXPECTED VALUES
Prove (21.9) by Fubini' s theorem. Prove for integrable X that
E [ X) -
{ P[ X > t) dt - t "
0
- oo
P[ X < t) dt.
21.10. (a) Suppose that X and Y have first moments and prove
E [ Y ] - E [ X)
...
/-cooo ( P [ X < t <
Y) - P ( Y < t < X ] ) dt .
(b)
Let ( X, Y] be a nondegenerate random interval. Show that its expected length is the integral with respect to t of the probability that it covers t. 21.11. Suppose that X and Y are random variables with distribution functions F and G. (a) Show that if F and G have no common jumps, then E[ F( Y)] +
E[ G( X)] == 1. (b) If F is continuous, then E[ F( X)] == ! . (c) Even if F and G have common jumps, if X and Y are taken to be independent, then E[ F( Y)] + E[G( X)] == 1 + P[ X == Y]. (d) Even if F has jumps, E[ F( X)] == ! + 1E x P 2 ( X == x]. 21.12. Let X have distribution function F. (a) Show that E[ X+ ] < oo if and only if J: < - log F( t)) dt < oo for some a. (b) Show more precisely that, if
f
X> o
X dP < a ( - log F( a )) +
a
and F( a ) are positive, then
l eo ( - log F( t ) ) dt < F(l ) f G
a
X> o
X dP.
21.13. Suppose that X and Y are nonnegative random variables, r > 1, and
P[ Y � t] < t - 1/y:?. , X dP for t > 0. Use (21.9), Fubini's theorem, and Holder's inequality to prove E( Y' ] < (rj( r - 1))'E[ X' ). 21. 14. Let X1 , X2 , . be identically distributed random variables with finite sec ond moment. Show that nP[ I X1 1 > £{,;' ] � 0 and n - 1 12 max k < n i X�c l • •
� P o.
21.15. Random variables X and Y with second moments are uncorre/ated if
E[ XY ] == E[ X] E( Y].
(a) Show that independent random variables are uncorrelated but that the
converse is false. (b) Show that, if X1 , , Xn are uncorrelated (in pairs), then Var[ X1 + · · · + Xn ] == Var[ X1 ] + · · · + Var[ Xn 1 · 21.16. f Let X, Y, and Z be independent random variables such that X and Y assume the values 0, 1, 2 with probability t each and Z assumes the values 0 and 1 with probabilities l and 1 . Let X' == X and Y' == X + Z (mod 3). (a) Show that X', Y', and X' + Y' have the same one-dimensional distri butions as X, Y, and X + Y, respectively, even though ( X', Y') and ( X, Y) have different distributions. • • •
21 . EXPECTED VALUES 289 (b) Show that X' and Y' are dependent but uncorrelated. (c) Show that, despite dependence, the moment generating function of X' + Y' is the product of the moment generating functions of X' and Y'. Suppose that X and Y are independent, nonnegative random variables and that E[ X] == oo and E[ Y] == 0. What is the value common to E( XY] and E[ X] E[ Y]? Use the conventions (15.2) for both the product of the random variables and the product of their expected values. What if E[ X] == oo and 0 < E[ Y] < oo? Deduce (21.18) for k == 2 from (21.17), (20.23), and Fubini' s theorem. Do not in the nonnegative case assume integrability. Suppose that X and Y are independent and that f( x , y) is nonnegative. Put g(x ) == E[f(x, Y)] and show that E[ g( X)] == E( f( X, Y)]. Show more gen erally that Ix A g( X) dP == Ix A /( X, Y) dP. Extend to / that may be negative. f The integrability of X + Y does not imply that of X and Y separately. Show that it·does if X and Y are independent. SECTION
21 .17.
21.18. 21.19.
e
21.20.
&
21.21. The dominated convergence theorem (Theorem 16.4) and the theorem on
uniformly integrable functions (Theorem 16.13) of course apply to random variables. Show that they hold if convergence with probability 1 is replaced by convergence in probability.
f Show that if E( l Xn - XI ) -+ 0, then Xn -+ p X. Show that the converse is false but holds under the added hypothesis that the xn are uniformly integrable. If E[ I xn - X I ] -+ 0, then xn is said to converge to X in the mean . 21.13. f Write d1 ( X, Y) == E[ I X - Yl /(1 + I x· - Yl )]. Show that this is a met ric equivalent to the one in Problem 20.32. 21.24. 5.10 f Extend Minkowski's inequality (5.36) to arbitrary random variables.
21.22.
Suppose that p > 1. For a fixed probability space ( 0 , S', P), consider the set of random variables satisfying E[ l X I P ] < oo and identify random variables that differ only on a set1 of probability 0. This is the space L P == L P (Q, .F, P). Put dp ( X, Y) == E 1P ( I X - Y I P ] for X, y E LP. (a) Show that L P is a metric space under dP . (b) Show that L P is complete. 2.10 16.7 21.25 f Show that L P ( Q, S', P) is separable if S' is sep arable, or if §' is the completion of a separable a-field. 21 .25 f For X and Y in L2 put ( X, Y) == E[ XY]. Show that ( · , · { is an inner product for which d2 ( X, Y) == ( X - Y, X - Y). Thus L is a Hilbert space. 5.11 f Extend Problem 5.11 concerning essential suprema to random variables that need not be simple. Consider on (1, oo) densities that are multiples of e - ax , e - x2, x - 2e - ax , x - 2 ; also consider reflections of these through 0 and linear combinations of
21.25. 20.28 21.21 21.24 f
21.26. 21.17. 21.28. 21.29.
290
21.30. 21.31.
21.32. 21.33.
21.34. 21.35.
RANDOM VARIABLES AND EXPECTED VALUES
them. Show that the convergence interval for a moment generating function can be any interval containing 0. There are 16 possibilities, depending on whether the endpoints are 0, finite and different from 0, or infinite, and whether they are contained in the interval or not. For the density Cexp( - l x l 1 12 ), - oo < x < oo , show that moments of all orders exist but that the moment generating function exists only at s == 0. 16.8 f Show that a moment generating function M( s ) defined in ( - s0 , s0 ) , s0 > 0, can be extended to a function analytic in the strip [ z: - s0 < Re z < s0 ] . If M ( s ) is defined in [0, s0 ), s0 > 0, show that it can be extended to a function continuous in [ z: 0 < Re z < s0 ] and analytic in [z: 0 < Re z < s0 ]. Use (21 .28) to find the generating function of (20.39). The joint moment generating function of ( X, Y) is defined as M( s, t ) == E( esX + t Y]. Assume that it exists for ( s, t) in some neighborhood of the origin, expand it in a double series, and conclude that E [ xm yn ] is a m + nMjas m at n evaluated at s .. t == 0. For independent random variables having moment generating functions, show by (21 .28) that the variances add. 20.23 f Show that the gamma density (20.47) has moment generating function (1 - sja ) - uk for s < a. Show that the k th moment is u( u + 1) · · · ( u + k - 1 )/a . Show that the chi-squared distribution with n de grees of freedom has mean n and variance 2 n . SECI10N 22. SUMS OF INDEPENDENT RANDOM VARIABLES
Let X1 , X2, be a sequence of independent random variables on some probability space. It is natural to ask whether the infinite series E�_ 1 Xn converges with probability 1, or as in Section 6 whether n - 1 Ek_1 Xk converges to some limit with probability 1. It is to questions of this sort that the present section is devoted. Throughout the section, Sn will denote the partial sum Ek _ 1 Xk (S0 = 0) . •
•
•
The Strong Law of Large Numbers
The central result is a general version of Theorem 6.1 .
are independent and identically distributed and have finite mean, then S,jn --+ E [ X1 ] with probability 1. Theorem 22. 1.
If X1 , X2,
•
•
.
Formerly this theorem stood at the end of a chain of results. following argument, due to Etemadi, proceeds from first principles.
The
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIABLES
291
pJtOO F.
If the theorem holds for nonnegative random variables, then ,.- 1sn = n - 1Ei:_ 1 Xt - n - 1Ei:_ 1 X; -+ E[ X(] - E[ X}] = E[ X1 ] with probability 1. Assume then that Xk > 0. Consider the truncated random variables Yk = Xk ll x. k J and their par n tial sums Sn* = Ei:_ 1 Yk . For a > 1, temporarily fixed, let un = [a ], where the brackets denote integer part. The first step is to prove s
>£ <
(22 .1)
Since the
00 .
Xn are independent and identically distributed, n
n
k-1
k-1
Var[Sn* ] = E Var ( Yk ] < E E [ Yk2 ] n
= E E [ x.2J( X, s k ) ] k-1
s:
nE [ x.2I( x, s n d .
It follows by Chebyshev' s inequality that the sum in (22.1) is at most
Let K = 2aj(a - 1) and suppose x > 0. If N is the smallest n such that N u,. � x, then a > x, and since y � 2 [y] for y > 1,
E u ; 1 � 2 E a - n = Ka - N < Kx - 1 •
u,. � x
n�N
Therefore, E:_ 1 u n 1/[ x1 s u,. J < KX} 1 for X1 > 0, and the sum in (22.1) is at most K£ - 2E [ X1 ] < oo . From (22.1) it follows by the first Borel-Cantelli lemma (take a union over positive, rational £) that (Su* - E[Su* 1)/u n -+ 0 with probability 1 . But 1 1 by the consistency of Cesaro sunimation (:�30], n - E[Sn*1 = n - Ei:_ 1 E[Yk ] has the same limit as E[Yn ], namely, E[X1 ]. Therefore Su* fun -+ E[X1 ] with probability 1. Since II
00
00
E P [ Xn ¢ Yn ] = nE P [ X1 > n ] s: ,_ 1 -=1
1
00
0
P [ X1 > t ] dt = E [ X. J <
oo ,
292
RANDOM VARIABLES AND EXPECTED VALUES
another application of the first Borel-Cantelli lemma shows that Sn )/n --+ 0 and hence
(Sn* -
( 22 .2) with probability 1 . If u n < k < u n + l' then since
X; > 0,
u n + 1 /u n --+ a, and so it follows by (22.2) that ! E [ X1 ) � lim inf £k � lim sup £k < aE ( X1 ) a k k k k with probability 1. This is true for each a > 1 . Intersecting the correspond ing sets over rational a exceeding 1 gives Iim k Sk/k = E[X1 1 with probabil But
•
ity 1 .
Although the hypothesis that the xn all have the same distribution is used several times, independence is used only through the equation Var[ Sn* 1 = EZ - tVar[ Yk 1 , and for this it is enough that the xn be independent in pairs. The proof given for Theorem 6.1 of course extends beyond the case of simple random variables, but it requires E[Xt1 < oo .
Suppose that X1 , X2 , . . . are independent and identically distrib uted and E [ X1- 1 < oo, E[Xi1 = oo (so that E[ X1 1 = oo). Then n - 1EZ_ 1 Xk oo with probability 1 . By the theorem, n - 1 EZ_ 1 X; --+ E[ X1- 1 with probability 1, and so PROOF. it suffices to prove the corollary for the case X1 = X1+ � 0. If xn if 0 < xn � u, u xn< > = 0 if xn > u, then n - 1Ek_ 1 Xk � n - 1Ek_ 1 X�"> --+ E[Xf"> 1 by the theorem. Let u --+ oo . Corollary. -+
{
•
1be Weak Law and Moment Generating Functions
The statement and proof of Theorem 6.2, the weak law of large numbers, carry over word for word for the case of general random variables with second moments-only Chebyshev' s inequality is required. The idea can be
SECTION
22.
SUMS OF IN DEPENDENT RANDOM VARIABLES
293
used to prove in a very simple way that a distribution concentrated on [0 , oo ) is uniquely determined by its moment generating function or Laplace transform. For each A, let Y" be a random variable (on some probability space) having the Poisson distribution with parameter A. Since Y" has mean and variance A (Example 21 .4), Chebyshev' s inequality gives
Let
G"
Y"/A, so that [At] k A " E e - k ,. .
be the distribution function of
G"( t ) =
k -0
-
The result above can be restated as if t > 1 , if t < 1 .
( 22 .3)
In the notation of Section 14, G"(x) => &(x - 1) as A � oo . Now consider a probability distribution p. concentrated on [0, oo ). Let F be the corresponding distribution function. Define
s > 0;
(22 .4)
here 0 is included in the range of integration. This is the moment generating function (21 .21), but the argument has been reflected through the origin. It is a one-sided Laplace transform, defined for all nonnegative s. For positive s, (21.24) gives
.
00 k k > M< = (22 .5 ) ( s ) ( - 1 ) y ke - s vp. ( dy) . 0 Therefore, for positive x and s, k k ( 1 ( 1 sy ( ) ( ) kM< k ) ( s ) = - sy , fl ( dy ) e s (22.6) E E -� k. k.
1 0
k -0
= Fix
the
1 oo fooo GsA; ) fl( dy ) . SX
SX
k-0
x > 0. If t 0 < y < x, then Gsy (xjy) --. 1 as s � oo by (22.3); if y > x, limit is 0. If p. { x } = 0, the integrand on the right in (22.6) thus
t lf y = 0, the integrand in (22.5) is 1 for k = 0 and 0 for k in the middle term of (22.6) is 1.
>
1 ; hence for y = 0, the integrand
294
RANDOM VARIABLES AND EXPECTED VALUES
converges as s --+ oo to Iro. x 1 (y) except on a set of p.-measure bounded convergence theorem then gives [sx ]
(22 .7)
lim E
s -. 00
(
_
k 1) ,
k.
k -o
0.
The
s kM < k > ( s ) = p. (O, x ] = F( x ) .
Thus M( s) determines the value of F at x if x > 0 and p. { x } = 0, which covers all but countably many values of x in [0, oo ). Since F is right-con tinuous, F itself and hence p. are determined through (22. 7) by M( s ) . In fact p. is by (22. 7) determined by the values of M( s) for s beyond an arbitrary s0: Theorem 22.2.
Let p. and be probability measures on [0, oo ). If
where s 0 > 0, then p. Corollary.
Jl
= Jl.
Let ft and /2 be real functions on [0, oo ). If
where s0 > 0, then ft = /2 outside a set of Lebesgue measure 0. /; need not be nonnegative, and they need e - sx/;(x) must be integrable over [0, oo) for s > s0 • The
not be integrable, but
For the nonnegative case, apply the theorem to the probability densities g; (x) = e - so x/;(x)jm, where m = J:e - sox/;(x) dx, i = 1, 2. For • the general case, prove that It + /2 = fi + /1 almost everywhere. PROOF.
Example 22.1. If p. 1 p. 2 = p.3, then the corresponding transforms (22.4) satisfy Mt (s)M2 (s) = M3 (s) for s � 0. If P. ; is the Poisson distribution with mean A ; , then (see (21 .27)) M; (s) = exp[A ; (e - s - 1)]. It follows by Theo rem 22.2 that if two of the P.; are Poisson, so is the third, and A t + A 2 = A 3 •
•
•
Kolmogorov's Zero-One Law
Consider the set A of w for which n - t Ek- tXk (w) --+ 0 as n --+ oo . For each m , the values of Xt (w), . . . , Xm t (w) are irrelevant to the question of -
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIABLES
295
whether or not w lies in A, and so A ought to lie in the a-field o( Xm , X,. + 1 , ). In fact, limn n - 1EZ'-JXk (w) = 0 for fixed m, and hence w lies in l A if and only if lim n n - L� - m Xk ( w) = 0. Therefore, • • •
[
A = n U n w: n-1
(22 .8)
f
N�m n> N
Xk (w ) t k-m
<
t],
the first intersection extending over positive rational £ . The set on the inside lies in o ( Xm , Xm + 1 , ), and hence so does A. Similarly, the w-set where the series E n Xn (w) converges lies in each o ( Xm , Xm + 1 , ). The intersection !T= n :_ 1a ( Xn, Xn + 1 , ) is the tail a-field associated with the sequence X1 , X2 , ; its elements are tail events. In the case xn = IA ,' this is the a-field (4.29) studied in Section 4. The following general ' form of Kolmogorov s zero-one law extends Theorem 4.5. • • •
• • •
• • •
• • •
If { Xn } is independent, and if A E !T= n :_ 1 a(Xn , Xn + 1 , ), then either P(A) = 0 or P(A) = 1 . PROOF. Let $0 = Uk_ 1a( X1 , , Xk ). The first thing to establish is that � is a field generating the a-field a( X1 , X2 , ) If B and C lie in $0 , then B e a( X1 , . . . , �) and C e a ( X1 , . . . , Xk ) for some j and k; if m -= max { j, k }, then B and C both lie in a( X1 , . . . , Xm ), so that B U C e a( X1 , , Xm ) c $0 . Thus $0 is closed under the formation of finite unions; since it is similarly closed under complementation, $0 is a field. For H e fil l , [ Xn e H] e � c a($0 ), and hence Xn is measurable a ( � ) ; thus .F0 generates a (X1 , X2 , ) (which in general is much larger than $0 ). Suppose that A lies in !T. Then A lies in a( Xk +l ' Xk + 2 , ) for each k. Therefore, if B e a( X1 , , Xk ), then A and B are independent by Theo rem 20.2. Therefore, A is independent of $0 and hence by Theorem 4.2 is also independent of a( X1 , X2 , ) But then A is independent of itself: P( A n A) = P(A)P(A). Therefore, P(A) = P 2 (A), which implies that P(A) is either 0 or 1 . • Deorem 22.3. • • •
• • •
• • •
•
.
.
.
• • .
• • •
. • .
• • •
.
As noted above, the set where En Xn ( w) converges satisfies the hypothesis of Theorem 22.3, and so does the set where n - 1EZ _ 1 Xk (w) --+ 0. In many similar cases it is very easy to prove by this theorem that a set at hand must have probability either 0 or 1 . But to determine which of 0 and 1 is, in fact, the probability of the set may be extremely difficult. In connection with the problem of verifying the hypotheses of Theorem 22.3, consider this example: Let 0 = R l , let A be a subset of R 1 not contained in fll 1 (see the end of Section 3), and let $ be the a-field generated by fll 1 u { A }.
E
ple
xam
22. 2.
296
RANDOM VARIABLES AND EXPECTED VALUES
Take X1 ( w ) = IA (w) and X2 (w) = X3 (w) = · · · = w. Then A lies in a ( x1 , x2 , . . . ) Fix m. If one knows Xm (w), Xm + 1 (w), . . . , then he knows w itself and hence knows whether or not w lies in A. To put it another way, if X1 (w') = X1( w" ) for all j > m, then IA(w') = IA(w"); therefore, the values of X1 (w), . . . , Xm _ 1 (w) are irrelevant to the question of whether or not w E A. Does A lie in o( Xm , xm + 1 ' • • • )? For m � 2 the answer is no, because o( Xm , Xm + 1 , • • • ) = gJ 1 and A � gJ 1 • • .
Although the issue is a technical one of measurability, this example shows that in checking the hypothesis of Theorem 22.3 it is not enough to show for each m that A does not depend on the values of X1 , • • • , Xm _ 1 • It must for each m be shown by some construction like (22.8) that A does depend in a measurable way on xm , xm + 1 ' • • • alone. Maximal Inequalities
Essential to the study of random series are maximal inequalities-inequali ties concerning the maxima of partial sums. The best known is that of Kolmogorov.
Suppose that X1 , . • • , Xn are independent with mean 0 and finite variances. For > 0 Theorem 22.4.
a
( 22.9 ) Let A k be the set where are disjoint,
PROOF.
Ak
I Sk i
�
a
but
ISi l
<
a
for j' < k. Since the
n E [ s; ] > E J s; dP k=l
n - E
Ak
k-l
jA [ sf + 2Sk ( Sn - Sk ) + ( Sn - Sk ) 2] dP
k -= l
jA [ sf + 2Sk ( Sn - Sk ) ] dP.
n > E
k
k
A k and Sk are measurable o( X1 , • • • , Xk ) and Sn - Sk is measurable o( Xk+ 1 , • • • , Xn ), and since the means are all 0, it follows by (21.19) and Since
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIABLES
297
independence that fAkSk(Sn - Sk ) dP = 0. Therefore, n n E [s;] > L Sf dP > L a2P ( Ak ) k-1 k - 1 Ak
f
• By Chebyshev's inequality, P [ I Sn l > a ] < a - 2 Var[ Sn 1· That this can be strengthened to (22.9) is an instance of a general phenomenon: For sums of independent variables, if max k s n i Sk l is large, then I Sn l is probably large as well. Theorem 9.6 is an instance of this, and so is the following result, due to Etemadi.
Suppose that X1 ,
Theorem 22.5.
(22 .10 )
P
• • •
[ 1 ms kaxs n I Sk l � 4a]
, Xn
are independent. For a > 0
< 4 max P [ I Sk l � a] . n
1 sks
Let Bk be the set where I Sk i > 4 a but I Si l < 4 a for j < k. Since the Bk are disjoint, n-1 max I Sk l � 4 a < P [ I Sn l > 2 a ] + L P ( Bk n [ I Sn l < 2 a] ) 1 sksn k-1 n-1 < P [ I Sn l > 2a ] + L P ( Bk n [ I Sn - Sk i > 2 a] ) k-1 n-1 = P ( I Sn l > 2 a ] + L P ( Bk ) P ( I Sn - Sk i > 2 a ] k-1
PROOF.
]
P[
< 2 max P [ I Sn - Sk i > 2 a]
Os k < n
Now replace 4 a by 2 a and apply the same argument to the partial sums of the random variables in reverse order: Xn , . . . , X1 ; it follows that
P
[ Oms kax< n I Sn - Sk i > 2 a ] � 2 I ms kaxs n P [I Sk l > a ] .
These two inequalities give (22.10).
•
298
RANDOM VARIABLES AND EXPECTED VALUES
If the Xk have mean 0 and Chebyshev' s inequality is applied to the right side of (22.10), and if a is replaced by a/4, the result is Kolmogorov' s inequality (22.9) with an extra factor of 64 on the right side. For this reason, the two inequalities are equally useful for the applications in this section. Convergence of Random Series
For independent Xn , the probability that EXn converges is either 0 or 1 . It is natural to try and characterize the two cases in terms of the distributions of the individual xn .
Suppose that { Xn } is an independent sequence and E[Xn ] = 0. If E Var[Xn 1 < oo , then EXn converges with probability 1. PROOF. By (22.9),
Theorem 12.6.
Since the sets on the left are nondecreasing in
r, letting r --+
oo
gives
Since E Var[ Xn ] converges, (22 .1 1 )
for each £. Let E( n, £) be the set where supj , k ni Si - Sk i > 2£ , and put E( £) = n n E(n, £). Then E (n , £) � E(£), and (22 . 11 ) implies P( E(£)) = 0. Now U E ( £ ) , where the union extends over positive rational £, contains the set where the sequence { Sn } is not fundamental (does not have the Cauchy • property), and this set therefore has probability 0. �
(
EX111ple 11
Let Xn ( w ) = rn ( w )a n , where the rn are the Rademacher functions on the unit interval-see ( 1 . 1 2) . Then xn has variance a;, and so Ea; < oo implies that Ern ( w )a n converges with probability 1. An interest ing special case is a n = n - 1 • If the signs in E ± n - 1 are chosen on the toss of a coin, then the series converges with probability 1. The alternating • harmonic series 1 - 2 - 1 + 3 - 1 + · · · is thus typical in this respect. 22.3.
SECTION
22.
299
SUMS OF INDEPENDENT RANDOM VARIABLES
If EXn converges with probability 1, then Sn converges with probability 1 to some finite random variable S. By Theorem 20.5, this implies that S,. _... p S. The reverse implication of course does not hold in general, but it does if the summands are independent.
For an independent sequence { Xn }, the Sn converge with probability 1 if and only if they converge in probability. It is enough to show that if Sn --+ p S, then { Sn } is fundamental pJtOOF. with probability 1. Since 1beorem
22.7.
P ( I Sn +j - Sn l �
t)
[
s P l sn +j - S l
i] + [
>
P l sn - S l
>
i ].
Sn -+p S implies
(22. 12) But by
sup P ( I Sn +J - Sn l lim n }� 1
> £)
=
0.
(22.10), P
[
m � I Sn +J - Sn l
1 Sj S k
>£
] <4
[
m� P I Sn +J - Sn l
1 Sj S k
>
£ 4
],
and therefore
It now follows by before.
(22.12) that (22.11) holds, and the proof is completed as
•
The final result in this direction, the three series theorem, does provide necessary and sufficient conditions for the convergence of EXn in terms of the individual distributions of the xn . Let x� c) be xn truncated at c: X -
Theorem 22.8.
aeries (22.1 3 ) In
Suppose that { Xn } is independent and consider the three
E P [ I Xn l
> c] ,
order that EXn converge with probability 1 it is necessary that the three Series converge for all positive c and sufficient that they converge for some positive c.
300
RANDOM VARIABLES AND EXPECTED VALUES
Suppose that the series (22.13) converge and put m�c > = E [ X�c>]. By Theorem 22.6, E( X�c> - m�c > ) converges with probabil ity 1 , and since Em�c> converges, so does EX�c >. Since P[ Xn ¢ x�c > i.o.] = 0 by the first Borel-Cantelli lemma, it follows finally that EXn converges with probability 1 . • PROOF OF SUFFICIENCY.
Although it is possible to prove necessity in the three series theorem by the methods of the present section, the simplest and clearest argument uses the central limit theorem as treated in Section 27. This involves no circular ity of reasoning, since the three series theorem is nowhere used in what follows. Suppose that EXn converges with probability 1 and fix c > 0. Since Xn --+ 0 with probability 1, it follows that Ex�c > converges with probability 1 and, by the second Borel-Cantelli lemma, that EP[ I xn I > c] < oo . Let M�c > and s!c > be the mean and standard deviation of s�c > = Ek - t x�c > . If s!c> --+ oo , then since the x�c> - m�c > are uniformly bounded, it follows by the central limit theorem (see Example 27.4) that PROOF OF NECESSITY.
( 22.14)
sn< c> - Mn< c>
And since
EX�c>
s�c >;s!c> --+
Y
2
2 1 1 dt. e j V2 '1T x
converges with probability 1, s!c> --+ 0 with probability 1, so that (Theorem 20.5)
limP n [ IS�c>;s!c> l
( 22.15 ) But
=
1
also implies
> £ ] = 0.
(22.14) and (22.15) stand in contradiction: p
oo
Since
< c> < c> - Mn< c> s s n n X< <£ < y, sn< c> sn< c >
is greater than or equal to the probability in (22.14) minus that in (22.15) , it is positive for all sufficiently large n (if x < y ) But then .
X-£<
- Mn< c>;sn< c> < y + £ '
and this cannot hold simultaneously for, say, (x - £, y + £) = ( - 1, 0) and (x - £, y + £ ) = (0, 1). Thus s!c> cannot go to oo , and the third series in (22.13) converges.
SECTION
22.
SUMS OF INDEPENDENT RAN DOM VARIABLES
301
And now it follows by Theorem 22.6 that E( X� ( ) - m �c) ) converges with probability 1 , so that the middle series in (22. 1 3) converges as well. • If Xn = rn a n , where rn are the Rademacher functions, then Ea; < oo implies that E Xn converges with probability 1 . If E Xn converges, then a n is bounded, and for large c the convergence of the series (22 . 1 3) implies E a; < oo : If the signs in E + a n are chosen on the toss of a coin, then the series converges with probability 1 or 0 according as Ea; converges or diverges. If Ea; converges but E l a nl diverges, then E + a n is • with probability 1 conditionally but not absolutely convergent. Example 22.4.
Theorems 22.6, 22.7, and 22.8 concern conditional convergence, and in the most interesting cases, E xn converges not because the xn go to 0 at a high rate but because they tend to cancel each other out.t Random Taylor Series *
Consider a power series E ± z n , where the signs are chosen on the toss of a coin. The radius of convergence being 1, the series represents an analytic function in the open unit disc D0 = [z: l z l < 1] in the complex plane. The question arises whether this function can be extended analytically beyond D0 • The answer is no: With probability 1 the unit circle is the natural boundary. 1beorem 22.9.
Let { xn } be an independent sequence such that n = 0, 1, . . . .
(22.16 )
There is probability 0 that F( w,
(22.17 )
Z
00
) = E Xn ( W ) z n n-O
coincides in D0 with a function analytic in an open set properly containing D0• It will be seen in the course of the proof that the w-set in question lies in a ( X0 , X 1 , ) and hence has a probability. It is intuitively clear that if the set is measurable at all, it must depend only on the Xn for large n and hence • • •
t See Problem 22.3 . *This topic, which requires complex variable theory, may be omitted.
302
RANDOM VARIABLES AND EXPECTED VALUES
must have probability either 0 or 1. PROOF.
Since n = 0, 1 , . . .
( 22.18 )
with probability 1, the series in (22.17) has radius of convergence 1 outside a set of measure 0. Consider an open disc D = [z: l z - r l < r ] , where r E Do and r > 0. Now (22.17) coincides in D0 with a function analytic in D0 u D if and only if its expansion
about r converges at least for this holds. The coefficient am ( "' ) =
�!
lz - rl
< r. Let
m
p< > ( "' · r ) =
AD be the set of w for which
E ( t':t)
X,. ( "' ) r n -
m
n-m
-
is a complex-valued random variable measurable o ( Xm, Xm + 1 , • • • ). By the m� �root test, w E AD if and only if lim supm Vl am ( w )I < r - 1 • For each m 0 , the condition for w E AD can thus be expressed in terms of a m 0( w ), a m o 1 ( w ), . . . alone, and so AD E (J ( Xm o' Xm o 1 ' • • • ). Thus AD has a prob ability, and in fact P(AD) is 0 or 1 by the zero-one law. Of course, P( A D) = 1 if D c D0• The central step in the proof is to show that P(AD) = 0 if D contains points not in D0 • Assume on the contrary that P(AD) = 1 for such a D. Consider that part of the cir cumference of the unit circle that lies in D, and let k be an integer large enough that this arc has length exceeding 2 '1Tjk. Define +
+
if n ¥= 0 ( mod k ) , if n = 0 ( mod k ) . Let BD be the w-set where the function
( 22.19 )
G(w, z ) = E Yn( w ) z n 00
n -O
coincides in D0 with a function analytic in D0 u D. The sequence { Y0 , Y1 , . . • } has the same structure as the original se quence: the Yn are independent and assume the values ± 1 with probabili ty t
SECTION
22.
303
SUMS OF INDEPENDENT RANDOM VARIABLES
each . Since BD is defined in terms of the Yn in the same way as AD is defined in terms of the Xn , it is intuitively clear that P( BD) and P(AD) must be the same. Assume for the moment the truth of this statement, which is somewhat more obvious than its proof. If for a particular w each of (22.17) and (22.19) coincides in D0 with a function analytic in D0 u D, the same must be true of
( 22 . 20 )
00
F( w, z ) - G ( w, z ) = 2 E xmk ( w ) z mk . m -0
Let D1 = [ze 2 "illk : z E D). Since replacing z by ze 2 "i /k leaves the function (22.20) unchanged, it can be extended analytically to each D0 u D1, I = 1 , 2, . . . . Because of the choice of k, it can therefore be extended analyti cally to [z: l z I < 1 + E] for some positive E; but this is impossible if (22.18) holds, since the radius of convergence must then be 1. Therefore, AD n BD cannot contain a point w satisfying (22.18). Since (22.18) holds with probability 1, this rules out the possibility P(AD) = P( BD) = 1 and by the zero-one law leaves only the possibility P(AD) = P( BD) = 0. Let A be the w-set where (22.17) extends to a function analytic in some open set larger than D0• Then w E A if and only if (22.17) extends to D0 U D for some D = [z: l z - r 1 < r] for which D - D0 ¢ 0, r is rational, and r has rational real and imaginary parts; in other words, A is the countable union of AD for such D. Therefore, A lies in o( X0, X1 , ) and has probability 0. It remains only to show that P(AD) = P(BD), and this is most easily done by comparing { xn } and { yn } with a canonical sequence having the same structure. Put Zn ( w ) = ( Xn ( w ) + 1)/2, and let Tw be E:- o zn ( w )2 - n - I on the w-set A* where this sum lies in (0, 1 ]; on 0 - A* let Tw be 1, say. Because of (22.16) P(A*) = 1. Let �= o( X0, X1 , ) and let � be the a-field of Borel subsets of (0, 1 ]; then T: 0 --+ (0, 1] is measurable �/!fl. Let rn(x) be the n th Rademacher function. If M = [x: r1(x ) = u ; , i = 1, . . . , n ], where u ; = ± 1 for each i, then P(T - 1M ) = P[ w : �((a,)) = u ; , i = 0, 1, . . . , n - 1] = 2 - n , which is the Lebesgue measure A(M) of M. Since these sets form a w-system generating ffl, P(T - 1M) = A(M ) for all M in fJI (Theorem 3.3). n Let MD be the set of x for which E�-o rn + 1 (x)z extends analytically to D0 U D. Then MD lies in ffl, this being a special case of the fact that AD lies in �. Moreover, if w E A*, then w E AD if and only if Tw E MD : 1 A * n AD = A* n T - MD. Since P(A*) = 1, it follows that P(AD) = A ( MD) . This argument only uses (22.16), and therefore it applies to { Yn } and BD • as well. Therefore, P(BD) = A( MD) = P(AD). • • •
• . •
304
RANDOM VARIABLES AND EXPECTED VALUES
The Hewitt-Savage Zero-One Law *
Kolmogorov's zero-one law does not apply to the events [Sn > an i.o.] and [ Sn = 0 i.o. ], which lie in each a-field a ( Sn , Sn + 1 , ) but not in each a ( X,. , Xn + 1 , ) . If the X,, are identically distributed as well as independent, however, these events come under a different zero-one law. Let �:· k consist of the sets H in � n + " that are symmetric in the first n coordinates in the sense that if ( x1 , , x,. + k ) lies in H then so does ( x '" 1 , , x'" n ' xn + 1 , , xn + k ) for every permutation of 1, . . . , n. Then �:· " is a a-field. Let 9:,, k consist of the sets [( XI ' . . . ' xn + k ) E H) for H E �:· k ' let 9:, be the a-field generated by Uk'.... 1 9:,, k , and let .9= n�_ 1 9:, . The information in 9:, corresponds to knowing the values xn + I ( w ) , xn + 2 ( w ), . . . exactly and knowing the values X1 ( w ) , . . . , Xn ( w) to within a permutation. If n < m , then [ Sm e Hm ] e .Ym . o c � m - n c � , and so [ Sm e H i.o.] comes under the Hewitt-Savage zero-one . m law: •
•
•
•
•
•
•
•
•
•
•
•
•
.
'IT
•
Theorem 22.10. If XI ' x2 ' . . are independent and identically distributed ' and if
A
e
.9,
.
then either P(A ) = 0 or P( A) = 1.
Theorem 11.4 there Since u�- I a ( XI ' . . . ' xn) is a field, by the corollary to n is for given £ an n and a set U = [( X1 , , Xn ) e H) ( H e � ) such that P( A�U ) < ( . Let v = [( xn + l ' . . . ' x2 n> E H). From A E .92 n + I and the fact that the xk are identically distributed as well as independent it will be shown that
PROOF.
•
( 22 .21 )
•
•
P( A�U ) = P ( A� V ) .
This will imply P(A � V) < £ and P(A�(U n V)) � P( A�U) + P(A � V) < 2£. Since U and V are independent and have the same probability, it will follow that P( .A ) is within £ of P(U) and hence P 2 (A) is within 2£ of P 2 (U) == P( U)P( V) = P( U n V), which is in tum within 2£ of P(A ). But then 1P 2 ( A ) - P(A ) I < 4£ for all £ , and so P( .A ) must be 0 or 1. It remains ntok prove (22.21). If B -E .92 n. k ' then B == [( X1 , , X2 n + k ) e J) for some J in &f 2 · and since •
s
•
•
'
it follows that P( B n U) == P( B n V). But since U k'_ 1 .92 n. k is a field generating .92 n , P( B n U ) == P( B n V) holds for all B in .92 n (Theorem 10.4). Therefore P(Ac n U) == P(Ac n V); the same argument shows that P(A n Uc) == P(A n • vc ) . Hence (22.21 ) . * This topic may be omitted.
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIAB LES
305
PROBLEMS 22. 1.
is an independent sequence and Y is measurable Suppose that X1 , X2 , ) for each n. Show that there exists a constant a such that a( Xn , Xn + 1 , P[ Y == a] == 1.
22.2.
21.12 f Let XI ' x2 ' . . . be independent random variables with distribution functions F1 , F2 , (a) Show that P[supn Xn < oo ] is 0 or 1 and is in fact 1 if and only if En (1 - F, ( x)) < oo for some x. (b) Suppose that X == supn Xn is finite with probability 1 and let F be its distribution function. Show that F( X ) == n � F, (X ) . (c) Show further that E[ x+ ] < oo if and only if En fx a Xn dP < oo for some a .
• • •
• • •
•
• •
•
-1
II
>
22.3.
Assume { Xn } independent and define X!c > as in Theorem 22.8. Prove that for E I xn I to converge with probability 1 it is necessary that EP[ I xn I > c) and E E( 1 X! c > 1 ] converge for all positive c and sufficient that they converge for some positive c. If the three series (22.13) converge but EE( I X!c > 1 1 == oo , then there is probability 1 that EXn converges conditionally but not absolutely.
22.4.
f (a) Generalize the Borel-Cantelli lemmas: Suppose Xn are nonnega tive. If E E( Xn ] < oo , then EXn converges with probability 1. If the Xn are independent and uniformly bounded, and if EE[ Xn ] = oo , then EXn di verges with probability 1. (b) Construct independent, nonnegative Xn such that EXn converges with probability 1 but EE[ Xn 1 diverges. For an extreme example, arrange that P[ Xn > 0 i.o. ) == 0 but E[ Xn ] = oo .
22.5.
Show under the hypotheses of Theorem 22.6 that EXn has finite variance and extend Theorem 22.4 to infinite sums.
22.6.
are independent, each with the Cauchy 20.20 f Suppose that X1 , X2 , distribution (20.45) for a common value of u . (a) Show that n - 1 EZ .... 1 Xk does not converge with probability 1 to a constant. Contrast with Theorem 22.1. (b) Show that P[ n - 1 max k s n xk < x] --+ e - ujwx for X > 0. Relate to Theorem 14.3.
12.7.
If X1 , X2 , are independent and identically distributed, and if P[ X1 > 0] == 1 and P[ X1 > 0] > 0, then En Xn == oo with probability 1. Deduce this from Theorem 22.1 and its corollary and also directly: find a positive £ such that xn > ( infinitely often with probability 1.
11.8.
Suppose that XI ' x2 ' . . . are independent and identically distributed and show that En Pl i Xn l > an ] == oo for each a , E[ I X1 1 ] == oo . Use (21.9) to 1 and conclude that supnn - 1 Xn l == oo with probability 1. Now show that supn n - 1 1 sn I == oo with probability 1. Compare with the corollary to Theo rem 22.1.
• •
•
• • •
306
22.9.
RANDOM VARIABLES AND EXPECTED VALUES
Wald ' s equation. Let XI ' x2 ' . . . be independent and identically distributed
with finite mean, and put sn == x. + . . . + xn . Suppose that T is a stopping time: T has positive integers as values and [ T == n ] E a( XI ' . . . ' xn ); see Section 7 for examples. Suppose also that E[ T] < oo. (a) Prove that (22.22)
22.10.
22.11.
22.12.
22. 13.
22.14. 22.15.
22. 16.
(b) Suppose that xn is ± 1 with probabilities p and q , p :F q , let T be the first n for which sn is - a or b ( a and b positive integers), and calculate E[ T ) . This gives the expected duration of the game in the gambler's ruin problem for unequal p and q. 20.12 f Let Zn be 1 or 0 according as at time n there is or is not a record in the sense of Problem 20.12. Let Rn == Z1 + · · · + Zn be the number of records up to time n. Show that R n/log n -+pl. 22.1 f (a) Show that for an independent sequence { Xn } the radius of n convergence of the random Taylor series En Xn z is r with probability 1 for some nonrandom r. (b) Suppose that the xn have the same distribution and P[ XI :F 0] > 0. Show that r is 1 or 0 according as log + 1 X1 1 has finite mean or not. Suppose that X0 , X1 , . . are independent and each is uniformly distributed n ;x over [0, 2w]. Show that with probability 1 the series Ene ,. z has the unit circle as its natural boundary. In the proof of Kolmogorov's zero-one law, use the corollary to Theorem 11.4 in place of Theorem 4.2. Hint: See the first part of the proof of the Hewitt-Savage zero-one law. Prove (what is essentially Kolmogorov's zero-one law) that if A is indepen dent of a w-system gJ and A e a(gJ), then P(A ) is either 0 or 1. 11.3 f Suppose that .JJI is a semiring containing Q. (a) Show that if P(A n B ) < bP( B) for all B e .JJI, and if b < 1 and A e a( .JJI ) , then P(A ) == 0. (b) Show that if P( A n B) S P( A)P( B) for all B e .JJI, and if A e a(..w') , then P( A ) is 0 or 1. (c) Show that if aP( B) < P(A n B) for all B e .JJI, and if a > 0 and A e a( .JJI ) , then P( A ) == 1. (d) Show that if P( A )P( B ) < P( A n B) for all B e .JJI, and if A e a(..w' ) , then P( A ) is 0 or 1. (e) Reconsider Problem 3.20. 22.14 f Burstin ' s theorem. Let I be a Borel function on [0, 1] with arbitrarily small periods: For each £ there is a p with 0 < p < £ such that l( x ) == l( x + p ) for 0 < x < 1 - p. Show that such an I is constant almost everywhere: (a) Show that it is enough to prove that P< r 1 B) is 0 or 1 for every Borel set B, where P is Lebesgue measure on the unit interval •
SECTION
23.
307
THE POISSON PROCESS
Show that r 1 B is independent of each interval [0, X], and conclude that P< r 1 B ) is 0 or 1. (c) Show by example that f need not be constant. (b)
SECI'ION 23. THE POISSON PROCESS Characterization of the Exponential Distribution
Suppose that
X
has the exponential distribution with parameter
P [ X > x ] = e - ax ,
(23. 1 ) The definition
(23.2)
a:
X � 0.
(4.1) of conditional probability then gives
P [ X > x + yi X > x] = P [ X > y ],
X , y � 0.
Imagine X as the waiting time for the occurrence of some event such as the arrival of the next customer at a queue or telephone call at an exchange. As observed in Section 14 ( see (14.6)), (23.2) attributes to the waiting-time mechanism a lack of memory or aftereffect. And as shown in Section 14, the condition (23.2) implies that X has the distribution (23.1) for some positive «. Thus if in the sense of (23.2) there is no aftereffect in the waiting-time mechanism, then the waiting time itself necessarily follows the exponential law. The Poisson Process
Consider next a stream or sequence of events, say arrivals of calls at an exchange. Let X1 be the waiting time to the first event, let X2 be the waiting time between the first and second events, and so on. The formal model consists of an infinite sequence X1 , X2 , of random variables on some probability space, and sn = XI + . . . + xn represents the time of occurrence of the n th event; it is convenient to write S0 = 0. The stream of events itself remains intuitive and unformalized, and the mathematical definitions and arguments are framed exclusively in terms of the Xn. If no two of the events are to occur simultaneously, the Sn must be strictly increasing, and if only finitely many of the events are to occur in each finite interval of time, sn must go to infinity: •
•
•
n
308
RANDOM VARIABLES AND EXPECTED VALUES
This condition is the same thing as
(23.4 )
n
Throughout the section it will be assumed that these conditions hold everywhere-for every w. If they hold only on a set A of probability 1, and if Xn ( w) is redefined as Xn ( w) = 1, say, for w � A , then the conditions hold everywhere and the joint distributions of the xn and sn are unaffected.
Condition 0°.
For each w, (23.3) and (23.4) hold.
The arguments go through under the weaker condition that (23.3) and (23.4) hold with probability 1, but they then involve some fussy and uninteresting details. There are at the outset no further restrictions on the X;; they are not assumed independent, for example, or identically distrib uted. The number N1 of events that occur in the time interval [0, t] is the largest integer n such that sn < t:
N, = max ( n : Sn � t ) . Note that N, = 0 if t < S1 = X1 ; in particular, N0 = 0. The number of events in (s, t] is the increment N, - Ns . ( 23 .5 )
•
3 N1
2 N1
1 N1
=0
So = 0
s,
=
=
=
2
1
x,
�
=
x,
+
x2
From (23.5) follows the basic relation connecting the N, with the
( 23.6 ) From this follows
( 23.7 ) Each N, is thus a random variable.
Sn :
SECTION
t
23.
309
THE POISSON PROCESS
stochastic process,
that is, a collection of The collection [ N,: > 0] is a random variables indexed by a parameter regarded as time. Condition oo can be restated in terms of this process:
Condition 0°. For each w, N,( w) is a nonnegative integer for t > 0, N0( w ) 0, and lim, _. 00 N,( w) = oo ; further, for each w , N,( w ) as a function of t and right-continuous, and at the points of discontinuity the is nondecreasing saltus N,( w ) - sups < tNs ( w ) is exactly 1. =-
It is easy to see that (23.2) and (23.4) and the definition (23.5) give random variables N, having these properties. On the other hand, if the stochastic process [ N,: > 0] is given and does have these properties, and if random variables are defined by Sn( w ) = inf[t: N,( w ) > n] and Xn( w ) = Sn ("') - Sn _ 1 ( w ), then (23.3) and (23.4) hold, and the definition (23.5) gives back the original N,. Therefore, anything that can be said about the Xn can be stated in terms of the N,, and conversely. The points S1 ( w ), S2 ( w ), . . . of (0, oo) are exactly the discontinuities of N,( w ) as a function of because of the queueing example, it is natural to call them The program is to study the joint distributions of the N1 under conditions on the waiting times Xn and vice versa. The most common model specifies the independence of the waiting times and the absence of aftereffect:
t
arrival times.
t;
Condition 1 °. The xn are independent and each is exponentially distributed with parameter a. In this case P [ Xn > 0] = 1 for each n and n - 1Sn --+ a- 1 by the strong
and so (23.3) and (23.4) hold with probability 1 ; to assume they hold everywhere (Condition 0°) is simply a convenient normalization. Under Condition 1 o, Sn has the distribution function specified by 1 (20.40), so that P[N1 � n] = Er:_ n ( ) / by (23.6), and
law of large numbers (Theorem
22.1),
e - at i ! " at ) ( , (23 .8) P [ Nt = n ] = e_,. n ! ' n = 0, 1 , . . . . Thus N, has the Poisson distribution with mean at. More will be proved 0
;
Presently.
Condition
N,1,
•
•
•
2°
, N," -
(i)
N,k - 1
For 0 < t1 < are independent.
· · ·
<
tk the increments
N11, N1 2 -
310
RAN DOM VARIABLES AND EXPECTED VALUES
The individual increments have the Poisson distribution: " t s)) a ( ( s P ( N, - Ns = n] = e - t ( 23 .9) n! , n = 0, 1 , . . . , 0 < s < t. (ii)
a(
)
t Poisson process,
Since N0 = 0, (23.8) is a special case of (23. 9). A collection [ N,: > 0] of random variables satisfying Condition 2° is called a and a is the of the process. As the increments are independent by (i), if < < then the distributions of Ns - N, and N, - Ns must convolve to that of N, - N,. But this requirement is consistent with (ii) because Poisson distributions with parameters and convolve to a Poisson distribution with parameter +
rate r s t,
u
u v. Conditions 1 ° and
Theorem 23. 1.
Condition 0°.
v
2°
are equivalent in the presence of
t
t.
1 o --+ 2° . Fix and consider the events that happen after time By (23.5), SN < < SN + I ' and the waiting time from to the first event following is SN + 1 - the waiting time between the first and second events following is XN + 2 ; and so on. Thus PROOF OF
t
t t
I
( 23 .10 ) Xfl) =
t
I
t;
I
I
SN + l - t, I
X 2 -- XN + 2 ' I
X - XN + 3 , · · · 3 -
t.
I
·
m,
define the waiting times following By (23.6), N, + s - N, > or N, + s > N, + m , if and only if SN + m < + which is the same thing as x: t) + · · · + X�'> < Thus
s.
( 23 .11)
t s,
I
N, + s - N, = m [ m : ax
m
Xf'> + · · · + X�' > <
s] .
Hence [N, + s - N, = ] = [ Xf'> + · · · + X�> < s < Xf'> + · · · + X�'! 1]. A comparison of (23.11) and (23.5) shows that for fixed the random variables N, + s - N, for > 0 are defined in terms of the sequence (23.10) in exactly the same way as the Ns are defined in terms of the original sequence of waiting times. The idea now is to show that conditionally on the event [N, = ] the random variables (23.10) are independent and exponentially distributed. Because of the independence of the Xk and the basic property (23.2) of the exponential distribution, this seems intuitively clear. For a proof, app ly (20.30). Suppose y > 0; if Gn is the distribution function of Sn , then sin ce
s
t
n
SECTION
Xn + l
By
23.
31 1
THE POISSON PROCESS
h as the exponential distribution,
=
1
=
e - ay
the assumed independence of the
=
If H = ( y 1 , oo )
X
P [ Xn +l
x
1
t + y - x ) dGn( x )
>
P ( Xn + l > t - x) dGn( x )
x
X,,
P(Sn + l - t > yI ' S
11
<
-
t
<
Sn + l ] e - ay2 · · ·
· · · X ( y1 , oo ), this is
(23.12 ) P [ N, = n , ( Xf ' > , . . . , xp>) e n] =
By Theorem 10.4, all H in gJ i.
e - ayJ
P ( N, = n ) P [ ( X1
the equation extends from
, • . .
, �) E H] .
H of the special form above to
Now the event [Ns . = m ; , 1 < e H ] , where j = m + 1 and u
i < u] can be put in the form [( X , �) H is the set of x in R 1 for which x 1 + + x m < S ; < x 1 + · · · + x m + I' 1 < i < u. But then [( Xf t ) , . . . , X1� 1 > ) e H ] is by (23.11) the same as the event [N, + s - N, = m ; , 1 < i < u]. Thus (23.12) gives I
·
·
·
1,
•
I
• • •
I
I
P [ N, = n , Nt + s1 - N, = m ; , 1
<
i < u]
From this it follows by induction on k that if 0
(23 .13 ) P [ N, - N,1 - l = n,. , 1 I
<
i < k]
=
t0
<
<
···
0 P [ N, _ 11 - 1 = n ; ] · k
=
t1
i-1
I
<
tk , then
312
RANDOM VARIABLES AND EXPECTED VALUES
Thus Condition 1 o implies (23.13) and, as already seen, (23.8). But from (23.13) and (23.8) follow the two parts of Condition 2°. •
2° --+ 1°. If 2° holds, then by (23.6), P( X1 > t) = P(N1 = 0] = e - at, so that X1 is exponentially distributed. To find the joint distribution of X1 and X2 , suppose that 0 < s 1 < t 1 < s 2 < t 2 and perform the calculation PROOF OF
=
a ( t l - s . )( e -•u z - e - arz ) =
a 2e - <>Yz dyl dy2 . < yl < l1 s2 < Y2SI2
JJsl
Thus for a rectangle A contained in the open set G
=
[( y1 , y2 ): 0 < y1
< y2 ]
,
By inclusion-exclusion, this holds for finite unions of such rectangles and hence by a passage to the limit for countable ones. Therefore, it holds for A = G n G' if G' is open. Since the2 open sets form a w-system generating the Borel sets, (S1 , S2 ) has density a e - ay2 on G (of course, the density is 0 outside G). By a similar argument in R k (the notation only is more complicated), (S 1 , , Sk ) has density a ke - aYk on [y: 0 < y1 < · · · < Yk ]. If a linear transformation g( y) = X is defined by X; = Y; - Y; - I ' then ( xl ' . . . ' xk ) = g(S l , . . . ' Sk ) has by (20.20) the density n�- l ae - OtX; (the Jacobian is identi • cally 1). This proves Condition 1°. • • •
The Poisson Approximation
Other characterizations of the Poisson process depend on a generalization of the classical Poisson approximation to the binomial distribution.
Suppose that for each n, Zn 1 , , Zn r are independent random variables and Zn k assumes the values 1 and 0 with probabilities Pnk Theorem 23.2.
• • •
II
SECTION
and
1 - Pn k · If
23.
rll L Pn k --+ A � 0,
(23. 14 )
313
THE POISSON PROCESS
max
1 sk s
k-l
11r Pn k --+ 0,
then
i = 0, 1 , 2, . . . .
(23 .15 )
= 0, the limit in (23.15) is interpreted as 1 for i = 0 and 0 for i > 1. In the case where rn = and Pn k = Ajn , (23.15) is the Poisson approxima tion to the binomial. Note that if A > 0, then (23.14) implies rn --+ oo . If A
n
Io : 1
-
P
II :p
I
�1 1 o l�----� - ------��----------�--� J1 :e -P / ! J2 J0:e -P PROOF.
p1
The argument depends on a construction like that in the proof of Theorem 20.4. Let U1 , U2 , be independent random variables, each uni formly distributed over [0, 1). For each p, 0 � p < 1, split [0, 1) into the two intervals /0 ( p) [0, 1 - p) and /1 ( p) = [1 - p, 1), as well as into the . tervals J ( p ) I I I. , L.t - (�;�;i _ 1 e pp j;J· I. ) , 1 - 0 1 sequence of tn L.ti _ 1e - pp j1J ; Define vn k = 1 if uk E /t ( Pn k ) and v,. k = .0 - if uk E lo ( Pn k > · Then v,.l , . . . , vn r are independent and vn k assumes the values 1 and 0 with probabilities P[Uk E /1 ( Pn k )] = Pn k and P[Uk E /0 ( Pn k )] = 1 - Pn k· Since Zn r , (23.15) will V,.1, . . . , Vn r have the same joint distribution as Zn1 , follow if it is shown that V,. = Ek��- tvn k satisfies • • •
=
.
-
.
-
,
, • • •
•
II
• • • ,
II
II
(23.16 ) Wn k = i if Uk e l;( Pn k ), i = 0, 1, . . . . Then P[ Wn k = i] = e -P �cp!Ji ! - Wn k has the Poisson distribution with mean Pn k· Since the W,. k are independent, Wn = Ek��- 1W,. k has the Poisson distribution with mean A n = Ek��- tPn k· Since 1 - p < e P , J1 ( p) c /1 ( p) (see the diagram). Now define ..
-
Therefore,
= rnn k e - P 111c pn k -< rnn2k ' -
314
RANDOM VARIABLES AND EXPECTED VALUES
and r,.
P [ Vn ¢ Wn] < E P;k < A n 1 max Pn k --+ 0 < k < r,. k-l
by
(23.14). And now (23.16) and (23.15) follow because •
Other Characterizations of the Poisson Process
The condition (23.2) is an interesting characterization of the exponential distribution because it is essentially qualitative. There are qualitative char acterizations of the Poisson process as well. For each w , the function N,( "' ) has a discontinuity at t if and only if Sn ( "' ) = t for some n > 1; t is a fixed discontinuity if the probability of this is positive. The condition that there be no fixed discontinuities is therefore
( 23 .17 )
P ( Sn = t ] = 0,
t > 0,
n > 1;
that is, each of S 1 , S2 , has a continuous distribution function. Of course there is probability 1 (under Condition 0°) that N,( "' ) has a discontinuity somewhere (and indeed has infinitely many of them). But (23.17) ensures that a t specified in advance has probability 0 of being a discontinuity, or time of an arrival. The Poisson process satisfies this natural condition. •
•
•
If Condition 0° holds and [ N,: t > 0] has independent increments and no fixed discontinuities, then each increment has a Poisson distribution.
Theorem 23.3.
This is Prekopa's theorem. The conclusion is not that [ N,: t > 0] is a Poisson process, because the mean of N, - Ns need not be proportional to t - s. If cp is an arbitrary nondecreasing, continuous function on [0, oo) with cp (O) = 0, and [ N,: t > 0] is a Poisson process, then Ncp satisfies the conditions of the theorem. t The problem is to show for t' < t " that N,,, - N,, has for some A > 0 a Poisson distribution with mean A, a unit mass at 0 being regarded as a Poisson distribution with mean 0. PROOF. *
t This is in fact the general process satisfying them; see Problem 23.9. * This proof carries over to general topological spaces ; see Timothy C. Brown, A mer. Math . Month(v, 91 ( 1984) 1 16- 1 23.
23. THE POISSON PROCESS The procedure is to construct a sequence of partitions
315
SECTION
(23 .18) of
t'
=
t n O < t nl < . . . < t n
rII
=
t"
[t ', t "] with three properties. First, each decomposition refines the preced
ing one: each t n k is a t n + 1 , j . Second, r,.
E p [ Nl,. k - Nln . k - 1
(23 .19)
>
k-l for
some finite
(2 3 .20 )
1] j A
A and max P [ N,nk - N,n,k - 1 > l
_ r,.
1 ) --+ 0.
Third,
(2 3 . 2 1 )
[
P max ( N1nk - N1 n. k- 1 ) > 2 1Sk< r,.
] --+ 0.
Once the partitions have been constructed, the rest of the proof is easy: Let Zn k be 1 or 0 according as N,nlr. - N,n,k- 1 is positive or not. Since [ N,: t � 0] has independent increments, the Zn k are independent for each n. By Theorem 23.2, therefore, (23.19) and (23.20) imply that Zn = Ek _ 1 Zn k satisfies P[ Zn = i] --+ e - A A;/i! Now N,, - N,, > Zn , and there is strict inequality if and only if N1nlr. - N,n k- 1 > 2 for some k. Thus (23.21) implies i A P[N,, - N,, ¢ Zn 1 --+ 0, and there1ore P[N,, - N,, = i] = e - A'/i! To construct the partitions, consider for each t the distance D, = infm � 1 1 t - Sm l from t to the nearest arrival time. Since Sm --+ oo , the infimum is achieved. Further, D, = 0 if and only if Sm = t for some m , and since by hypothesis there are no fixed discontinuities, the probability of this is 0: P [ D1 = 0] = 0. Choose 81 so that 0 < 81 < n - 1 and P[ D1 < 8,] < n - 1 • The intervals (t - 81 , t + 8,) for t' < t < t" cover [t', t "]. Choose a finite subcover, and in (23.18) take the t n k for 0 < k < rn to be the endpoints (of intervals in the subcover) that are contained in (t', t"). By the construction, ..
·
(23 .22 ) and the probability that ( t n, k 1 ' t n k 1 contains some sm is less than -
n - 1 . This
gives a sequence of partitions satisfying (23.20). Inserting more points in a Partition cannot increase the maxima in (23.20) and (23.22), and so it can be arranged that each partition refines the preceding one.
316
RANDOM VARIABLES AND EXPECTED VALUES
To prove (23.21) it is enough (Theorem 4.1) to show that the limit superior of the sets involved has probability It is in fact empty: If for infinitely many N, (w ) - N, - 1 (w ) > 2 holds for some k < n , then by discontinuity points (arrival (23.22), N,( w ) as a function of has in times) arbitrarily close together, which requires Sm ( w ) e for in finitely many in violation of Condition 0°. It remains to prove (23.19). If Zn k and Zn are defined as above and Pn k = P [ Zn k = 1], then the sum in (23.19) is E k Pn k = E[ Zn 1 · Since Zn + t ;::: Zn , E k Pn k is nondecreasing in Now
n,
""
"·"
t
0. [t', t"]
r [t ', t"]
m,
n.
P ( N,, - N,, = 0 ) = P ( Zn k = 0, k � rn ) =
r ,
.n
k�l
[ r, l
( 1 - Pn k ) < exp - E Pn k
·
k-1
If the left-hand side here is positive, this puts an upper bound on E k Pn k ' and (23.19) follows. But suppose P[N,, - N, , = = If is the midpoint of and then since the increments are independent, one of P[Ns - N,, = and P[N,, - Ns = must vanish. It is therefore possible to find a nested sequence of intervals [u m , vm ] such that vm - u m --+ and the event A m = [Nv - Nu > 1] has probability 1. But then P(n m A m ) = 1, and if t is the point common to the [ u m , vm ], there is an arrival at with probability • 1, contrary to the assumption that there are no fixed discontinuities.
t' 0]
0] 0. s
t",
m
0]
m
0
t
Theorem 23.3 in some cases makes the Poisson model quite plausible. The increments will be · essentially independent if the arrivals to time cannot seriously deplete the population of potential arrivals, so that Ns has for > negligible effect on N, - Ns . And the condition that there are no fixed discontinuities is entirely natural. These conditions hold for arrivals of calls at a telephone exchange if the rate of calls is small in comparison with the population of subscribers and calls are not placed at fixed, prede termined times. If the arrival rate is essentially constant, this leads to the following condition.
s
t s
Condition 3°. (i) For 0 < t1 < · · · < tk the increments N11, N12 N, , . . . , N1 - N1 are independent. (ii) The distribution of N1 - Ns depends only on the difference t - s. Theorem 23.4. Conditions 1°, 2°, and 3° are equivalent in the presence of Condition 0°. Obviously Condition 2° implies 3°. Suppose that Condition 3° holds. If J, is the saltus at t (J1 = N, - sups < 1 Ns ), then [N, - N, _ n- 1 1
k
k-1
PROOF.
2:
SECTION
23.
317
THE POISSON PROCESS
1 ] � [ J1 > 1 ], and it follows by (ii) of Condition 3° that P[ J, > 1] is the same
t.
for all But if the value common to P[ J, > 1] is positive, then by the indepe ndence of the increments and the second Borel-Cantelli lemma there is probability 1 that 11 > 1 for infinitely many rational in (0, 1 ), for example, which contradicts Condition 0° . By Theorem 23.3, then, the increments have Poisson distributions. If and is the mean of N,, then N, - Ns for < must have mean + thus = must by (ii) have mean Therefore, f satisfies Cauchy' s functional equation [ A20] and, being nondecreasing, must = for > 0. Condition 0° makes = 0 impossible . have the form •
t
s t f(t) - f(s) f(t - s); f(t) f(s) f(t - s). a f( t) at a
f( t)
One standard way of deriving the Poisson process is by differential equations.
Condition 4°. If 0 < t1 < tegers, then
···
<
tk and if n1, . . . , nk are nonnegative in
(23.23)
tmd (23.24 )
Moreover, [ N,: t > 0] has no fixed discontinuities. The occurrences of o( h) in (23.23) and (23.24) denote functions, say +1 ( h ), and f/>2 (h), such that h Iq,; ( h) -+ 0 as h � 0; the f/>; may a priori depend on k, t1, , tk, and n1, , nk as well as on h. It is assumed in as h � 0 .
-
•
•
•
•
•
•
(23.23) and (23.24) that the conditioning events have positive probability, so that the conditional probabilities are well defined.
1beorem
23.5.
condition 0° .
Conditions 1 ° through 4° are all equivalent in the presence of
2 ° -+ 4 °. For a Poisson process with rate a, the left-hand sides of (23.23) and (23.24) are e - ahah and 1 - e - ah - e - ahah, and these are «h + o(h) and o(h), respectively, because e - ah = 1 - ah + o(h). And by PROOF OF
the argument -·
PROOF
[�i
in the preceding proof, the process has no fixed discontinui-
ti,
ni;
.
OF 4° -+ 2 ° . Fix k, the and the denote by A the event j � k]; and for > 0 put = P [ N," + t - N," = i A ] . It will
== ni,
t
Pn (t)
n
318
RANDOM VARIABLES AND EXPECTED VALUES
be shown that (23 .25 )
r , (at _ ,. t = e n ( P ) n! '
n
This will also be proved for the case in which 2° will then follow by induction. If t > 0 and I t - s I < n - 1 , then
= 0, 1 , . . . . Pn(t) = P[N1 = n]. Condition
As n j oo , the right side here decreases to the probability of a discontinuity at t, which is 0 by hypothesis. Thus P[N, = n] is continuous at t. The same kind of argument works for conditional probabilities and for t = 0, and so Pn ( t) is continuous for t > 0. To simplify the notation, put D, = N,k + t - N,k . If D, + h = n, then D, = m for some m < n. If t > 0, then by the rules for conditional probabilities,
Pn( t + h ) = Pn( t ) P ( D, + h - D, = OIA n [ D, = n ) ] +pn_ 1 ( t ) P ( D, + h - D, = l i A n [ D, = n - 1 ]] n-2 + E Pm ( t ) P [ D, + h - D, = n - m i A n [ D, = m ]] . m =O
For n < 1, the final sum is absent, and for n = 0, the middle term is absent as well. This holds in the case Pn(t) = P[N, = n ] if D, = N, and A = 0. (If t = 0, some of the conditioning events here are empty; hence the assump tion t > 0.) By (23.24), the final sum is o( h) for each fixed n. Applying (23.23) and (23.24) now leads to
Pn ( t + h ) = Pn(t )( l - ah ) + Pn - l (t )ah + o ( h ) , and letting
h � 0 gives
( 23 .26)
In the case n = 0, take p _ 1 ( t) to be identically 0. In (23.26), t > 0 and p�( t) is a right-hand derivative. But since Pn( t) and the right side of th e equation are continuous on [0, oo ), (23.26) holds also for t = 0 and p� ( t) can be taken as a two-sided derivative for t > 0 [A22].
SECTION
23.
319
THE POISSON PROCESS
Now (23.26) gives [A23]
Since Pn(O) is 1 or 0 as
n = 0 or n > 0, (23.25) follows by induction on n. •
Stochastic Processes
stochastic process-that
t
The Poisson process [N1 : > 0] is one example of a is, a collection of random variables (on some probability space ( 0, �, P)) indexed by a parameter regarded as representing time. In the Poisson case, time is In some cases the time is Section 7 concerns the sequence { Fn } of a gambler's fortunes; there represents time, but time that increases in jumps. Part of the structure of a stochastic process is specified by its For any finite sequence 1 , , of time points, the k-dimensional random vector ( N1 , , N111 ) has a distribution p. 11 ... 111 over R k . These measures p. 1 1k are the finite-dimensional distributions of the process. Condition 2° of this section in effect specifies them for the Poisson case:
continuous.
discrete: n
dimensional distributions.
t
1
n1
n
1
•
•
•
•
•
tk
finite
•
•
t
t
n 0 t0
if 0 < < · · · < k and 0 < 1 < · · · < k (take = = 0). The finite-dimensional distributions do not, however, contain all the mathematically interesting information about the process in the case of continuous time. Because of (23.3), (23.4), and the definition (23.5), for each fixed w, N1 ( w) as a function of has the regularity properties given in the second version of Condition 0°. These properties are used in an essential way in the proofs. Suppose that is t or 0 according as is rational or irrational. Let N1 be defined as before, and let
t
/(t)
(23 .28)
t
M1 ( w ) = N1 (w ) + f ( t + X1 ( w ) ) .
If R is the set of rationals, then P[w: f(t + X1 (w)) ¢ 0] = P[w: X1 (w) e R - t ] = 0 for each t because R - t is countable and X1 has a density. Thus P[M1 = N1 ] = 1 for each t, and so the stochastic process [M,: t > 0] has the same finite-dimensional distributions as [N1 : t > 0]. For w fixed,
320
RANDOM VARIABLES AND EXPECTED VALUES
t
however, M1 ( w ) as a function of is everywhere discontinuous and is neither monotone nor exclusively integer-valued. The functions obtained by fixing w and letting vary are called the or of the process. The example above shows that the finite-dimensional distributions do not suffice to determine the character of the path functions. In specifying a stochastic process as a model for some phenomenon, it is natural to place conditions on the character of the sample paths as well as on the finite-dimensional distributions. Condition 0° was imposed throughout this section to ensure that the sample paths are nondecreasing, right-continuous, integer-valued step functions, a natural condition if N, is to represent the number of events in [0, Stochastic processes in continuous time are studied further in Chapter 7.
path
t
functions sample paths
t].
PROBLEMS
Assume the Poisson processes here satisfy Condition 0° as well as Condition 1°. 13.1. 13.2.
13.3.
Show that the minimum of independent exponential waiting times is again exponential and that the parameters add. 20.23 f Show that the time Sn of the nth event in a Poisson stream has the gamma density /( x; a, n) as defined by (20.47). This is sometimes called the Er/ang density. Let A , t - SN1 be the time back to the most recent event in the Poisson stream (or to 0) and let B, SN 1 - t be the time forward to the next event. Show that A, and B, are independent, that B, is distributed as X1 (exponentially with parameter a), and that A, is distributed as min{ X1 , t } : P [ A, S t] is 0, 1 - e - or 1 as x < 0, 0 < x < t, or x � t. f Let L, A + B, SN1 1 - SN1 be the length of the interarrival interval covering t. (a) Show that L, has density ==
==
,
+
ax ,
13.4.
==
==
,
+
if O < x < t, if X > I. (b)
13.5.
Show that E[ L,] converges to 2 E[ X1 ] as t --+ oo . This seems paradoxi cal because L, is one of the Xn. Give an intuitive resolution of the apparent paradox. Merging Poisson streams. Define a process { N, } by (23.5) for a sequence { Xn } of random variables satisfying (23.4). Let { X� } be a second sequence of random variables, on the same probability space, satisfying (23.4) and define { N,' } by N,' max[ n: X{ + · · · + X� < t ]. Define { N," } by N," N, + N,'. Show that, if a( X1 , X2 , ) and a( X{, X2 , . . . ) are independent ==
=
•
•
•
23.
SECTION
and { � } and { N, } are Poisson processes with respective rates a and fJ, then { N, } is a Poisson process with rate a + {J . f The n th and (n + 1)st events in the process { N, } occur at times Sn and '
"
2.1.6.
321
THE POISSON PROCESS
sn + 1 ·
(a) Find the distribution of the number Ns,.
+
1
- Ns,. of events in the other
process during this time interval. (b) Generalize to Ns - Ns,. . 23.7. For a Poisson stream and a bounded linear Borel set A , let N(A ) be the number of events that occur at times lying in A . Show that N( A ) is a random variable having the Poisson distribution with mean a times the Lebesgue measure of A . Show that if A and B are disjoint, then N( A ) and N( B) are independent. Suppose that X1 , X2 , are independent and exponentially distributed with 23.8. parameter a, so that (23.5) defines a Poisson process { N, }. Suppose that are independent and identically distributed and that Y1 , Y2 , a( x1 ' x2 ' . . . ) and a( y1 ' Yi ' . . . ) are independent. Put z, == Ek N yk . This l is the compound Poisson process. If, for example, the event at time Sn in the original process represents an insurance claim, and if Y, represents the amount of the claim, then Z, represents the total claims to time t. (a) If Yk == 1 with probability 1, then { Z, } is an ordinary Poisson process. (b) Show that Z, has independent increments and that Zs + r - Zs has the same distribution as Z, . (c) Show that, if Yk assumes the values 1 and 0 with probabilities p and 1 p (0 < p < 1), then { Z, } is a Poisson process with rate pa. 11.9. Suppose a process satisfies Condition 0° and has independent, Poisson-dis tributed increments and no fixed discontinuities. Show that it has the form { Ncr> < I ) } , where { N, } is a standard Poisson process and cp is a nondecreas ing, continuous function on [0, oo) with cp(O) == 0. 11. 10. If the waiting times Xn are independent and exponentially distributed with parameter a, then Sn/n --+ a - 1 with probability 1, by the strong law of large numbers. From lim, _ 00 N, == oo and SN < t < SN + 1 deduce that lim, - 00 N,/t == a with probability 1. 1.1.1 1. f (a) Suppose that X1 , X2 , . are positive and assume directly that Sn/n --+ m with probability 1, as happens if the Xn are independent and identically distributed with mean m. Show that lim , N,/t == 1/m with probability 1. (b) Suppose now that Sn/n --+ oo with probability 1, as happens if the Xn are independent and identically distributed and have infinite mean. Show that lim, N,/t == 0 with probability 1. The results in Problem 23.11 are theorems in renewal theory: A component of some mechanism is replaced each time it fails or wears out. The Xn are the lifetimes of the successive components, and N, is the number of replacements, or renewals, to time t. 11. 1 2. 20.8 23.11 f Consider a persistent, irreducible Markov chain and for a fixed state j let Nn be the number of passages through j up to time n. Show that Nn/n --+ 1 /m with probability 1, where m = Ek' 1 kfj�k > is the "'
•
•
•
•
•
•
s
-
1
• •
1
322
RANDOM VARIABLES AND EXPECTED VALUES
mean return time (replace 1/m by 0 if this mean is infinite). See Lemma in Section 8.
3
23.13. Suppose that X and Y have Poisson distributions with parameters a and {J. Show that I P( X = i] - P[Y = i] l < I a - fJ I . Hint: Suppose that a < fJ and represent Y as X + D, where X and D are independent and have Poisson distributions with parameters
23.14.
a and fJ - a.
f Use the methods in the proof of Theorem 23.2 to show that the error in (23.15) is bounded uniformly in i by l A - A n I + A n max k Pnk .
SECI'ION 24. QUEUFS AND RANDOM WALK *
Results in queueing theory can be derived from facts about random walk, facts having an independent interest of their own. The Single-Server Queue
Suppose that customers arrive at a server in sequence, their arrival times being 0, A 1 , A 1 + A 2 , The customers are numbered 0, 1, 2, . . . , and customer n arrives at time A 1 + +A n , customer 0 arriving by conven tion at time 0. Assume that the sequence { A n } of interarrival times is independent and that the An all have the same distribution concentrated on (0, oo ) . The A n may be exponentially distributed, in which case the arrivals form a Poisson process as in the preceding section. Another possibility is P [ A n = a] = 1 for a positive constant a; here the arrival stream is determin istic. At the outset, no special conditions are placed on the distribution common to the A n . Let B0, B 1 , . . . be the service times of the successive customers. The Bn are by assumption independent of each other and of the A n , and they have a common distribution concentrated on (0, oo ) . Nothing further is assumed initially of this distribution: it may be concentrated at a single point for example, or it may be exponential. Another possibility is that the service-time distribution is the convolution of several exponential distributions; this represents a service consisting of several stages, the service times of the constituent stages being dependent and themselves exponentially distrib uted. If a customer arrives to find the server busy, he joins the end of a queue and his service begins when all the preceding customers have been served. If he arrives to find the server free, his service begins immediately; the server is by convention free at time 0 when customer 0 arrives. • • •
•
· · ·
* This section may be omitted.
SECTION
24.
QUEUES AND RANDOM WALK
323
Airplanes coming into an airport form a queueing system. Here the A n are the times between arrivals at the holding pattern, Bn is the time the plane takes to land and clear the runway, and the queue is the holding pattern itself. Let Wo , WI , W2 , . . . be the waiting times of the successive customers: wn is the length of time customer n must wait until his service begins; thus W,. + Bn is the total time spent at the service facility. Because of the convention that the server is free at time 0, W0 = 0. Customer n arrives at time A1 + · · · + A n ; abbreviate this as t for the moment. His service begins at time t + Wn and ends at time t + Wn + Bn. Moreover, customer n + 1 arrives at time t + A n +1 . If customer n + 1 arrives before t + Wn + Bn (that is, if t + A n + t < t + Wn + Bn , or Wn + Bn - A n +1 > 0), then his waiting time is Wn + t = (t + Wn + Bn ) - (t + A n + t ) = Wn + Bn - A n +1 ; if he arrives after t + Wn + Bn , he finds the server free and his waiting time is W,. + 1 = 0. Therefo�e, Wn + t = max[O, Wn + Bn - A n + 1 ]. Define
(24.1)
n
=
1 , 2, . . . ;
the recursion then becomes
n = 0, 1 , . . . .
(24.2 ) Note that Wn is measurable
o( X1 ,
•
•
•
, Xn ) and hence is measurable
Bn - 1 ' A 1 , . . . ' A n ) . The equations (24.1) and ( 24.2) also describe inventory systems. Suppose
a( Bo , . . . '
that on day n the inventory at a storage facility is augmented by the amount Bn l and the day' s demand is A n . Since the facility cannot supply what it does not have, (24.1) and (24.2) describe the inventory Wn at the end of day n. A problem of obvious interest in queueing theory is to compute or approximate the distribution of the waiting time Wn. Write -
(24 . 3 )
S0 = 0,
The following induction argument shows that
(24 .4) this is obvious for n = 0, assume that it holds for a particular n. If wn + xn + 1 is nonnegative, it coincides with wn + 1 ' which is therefore the maximum of Sn - Sk + Xn + 1 = Sn + 1 - Sk over O < k < n; this maxi mum, being nonnegative, is unchanged if extended over 0 < k < n + 1, As
324
RANDOM VARIABLES AND EXPECTED VALUES
whence (24.4) follows for n + 1. On the other hand, if Wn + Xn + 1 < 0, then sn + 1 - sk = sn - sk + xn + 1 < 0 for 0 < k < n, so that max 0 � k � n + 1 ( Sn + 1 - Sk ) and Wn + 1 are both 0. Thus (24.4) holds for all n. Since the components of ( Xh X2 , , Xn ) are independent and have a common distribution, this random vector has the same distribution over R n as ( Xn , Xn h . . . , X1 ). Therefore, the partial sums S0, S 1 , S2 , , Sn have the same joint distribution as 0, Xn , Xn + xn 1 ' . . . ' xn + . . . + XI ; that is, they have the same joint distribution as Sn -- Sn , Sn - Sn _ 1 , Sn - Sn_ 2 , , Sn - S0• Therefore, (24.4) has the same distribution as max o � k � n sk . The following theorem sums up these facts. • • •
•
•
•
• • •
Let { Xn } be an independent sequence of random variables with common distribution, and define Wn by (24.2) and Sn by (24.3). Then (24.4) holds and wn has the same distribution as max o � k � n sk . Theorem 24.1.
In this and the other theorems of this section, the Xn are any indepen dent random variables with a common distribution; it is only in the queueing-theory context that they are assumed defined by (24.1) in terms of positive, independent random variables A n and Bn . The problem is to investigate the distribution function
( 24.5 )
Mn ( x )
=
P ( Wn s x ]
=
]
[
Sk < x , P max n � � O k
X > 0.
(Since wn > 0, M(x) = 0 for X < 0.) By the zero-one law (Theorem 22.3), the event [sup k Sk = oo"] has probability 0 or 1. In the latter case max0 � k � n sk --+ oo with probability 1, so that
( 24.6 ) for every x, however large. On the other hand, if [sup k Sk probability 0, then
( 24.7 )
M( x )
=
[
P sup Sk k>O
s
x
= oo ]
]
is a well-defined distribution function. Moreover, [max0 k � n Sk x] � [sup k Sk s x] for each x and hence <
( 24. 8)
nlimoo Mn ( x ) ......
=
has
M( x ) .
Thus M(x) provides an approximation for Mn (x).
s
24. QUEUES AND RANDOM WALK If the A n and Bn of queueing theory have finite means, the ratio SECTION
325
(24 .9 )
called the traffic intensity. The larger p is, the more congested is the queueing system. If p > 1, then E[ Xn1 > 0 by (24.1), and since n - 1sn --+ E[ X1 ] with probability 1 by the strong law of large numbers, supn Sn = oo with probability 1. In this case (24.6) holds: the waiting time for the n th customer has a distribution that may be said to move out to + oo ; the service time is so large with respect to the arrival rate that the server cannot accommodate the customers. It can be shown that supn Sn = oo with probability 1 also if p = 1 , in which case E[ Xn 1 = 0. The analysis to follow covers only the case p < 1 . Here the strong law gives n - 1Sn --+ E[ X1 ] < 0, Sn --+ - oo , and supn Sn < oo with probability 1, so that (24.8) applies. is
Random Walk and Ladder Indices
it helps to visualize the Sn as the successive positions in a random walk. The integer n is a ladder index for the random walk if Sn is the maximum position up to time n:
To calculate (24.7)
(24 .1 0 )
Note that the maximum is nonnegative because S0 = 0. Define a finite measure L by (23 .11) L ( A ) =
P[ max n sk n- l O
E
=
0<
]
sn E A ,
A
c
( 0, oo ) .
Thus L is a measure on the Borel sets of the line and has support (0, oo ). The probability that there is at least one ladder index is then (24.12 )
Let T1 , T2 , be the times between the ladder indices: T1 is the smallest n satisfying (24.10), T1 + T2 is the next smallest, and so on. To complete the definition, if there are only k ladder indices ( k may be 0), set Tk + I = Tk+ 2 · · · = oo . Define the k th ladder height by H1 + · · · + Hk = Sr1 + . .. + r,/ Thus H1 + · · · + Hk is the position in the random walk at the k th ladder •
•
•
•
326
RANDOM VARIABLES AND EXPECTED VALUES
index; if there are only k ladder indices, set Hk + I Hk + 2 = · · · = 0. As long as ladder indices occur, the ladder heights must increase and the Hk must be positive. Whether there are infinitely many ladder indices or only finitely many, the definitions are so framed that =
(24.13 )
sup Sk k
=
00
E Hn . n== l
Therefore, (24.7) can be studied through the Hn. The distribution of H1 consists of the measure L defined by (24.11) together with a mass of 1 - p at the point 0. If there is a ladder index and n is the first one ( T1 = n ), then H2 is the first positive term (if there is one) in the sequence Sn + 1 - Sn , Sn + 2 - sn , . . . ; but this sequence has the same structure as the original random walk. Therefore, under the condition that there is a ladder index and hence a first one, the distribution of H2 is the original distribution of H1 An extension of this idea is the key to understanding the distribution of (24.13). Let L n * denote the n-fold convolution of L with itself; L n * = L< n - 1 > * L, and L0 * is a unit mass at the point 0. Let 1/1 be the measure (see Example 16.5) •
•
00
( 24.14)
n -O
A c [O, oo ) .
Note that 1/1 has a unit mass at 0 and the rest of the mass is confined to (0, 00 ) . Theorem 24.2. (i)
If A 1 ,
•
•
•
, An
are subsets of (0, oo), then
( 24.15 ) If p = 1, then with probability 1 there are infinitely many ladder indices and supn Sn = oo. (iii) If p < 1, then with probability p n (l - p) there are exactly n ladder indices; with probability 1 there are only finitely many ladder indices and supn Sn < oo ; finally, (ii)
( 24.16 )
[ n >O
P sup Sn e A
] = ( 1 - p ) l/J ( A ) ,
A c [0, oo ) .
If A c (0, oo), then P[Hn e A] = p n - 1L(A), as will be seen in the course of the proof; thus (24.15) does not assert the independence of the Hn - obvi ously Hn = 0 implies that Hn + 1 = 0.
327 24. QUEUES AND RANDOM WALK (i) For n = 1, (24.15) is just the definition (24.1 1). Fix subsets PROOF. .A1 , . . . , A n of (0, oo) and for k > n - 1 let Bk be the w-set where H; A; for i � n - 1 and T1 + · · · + Tn - l k. Then Bk E o( X1, . . . , Xk ) and so independence and the assumption that the X; all have the same distribuSECTION
=
by tion,
E
Summing over m = 1, 2, . . . and then over k = n - 1, n, n + 1, . . . gives P[H; e A ; , i � n] = P [ H; E A ; , i < n - 1] L ( A n ). Thus (24.15) follows by induction. Taking A 1 = · · · = A n = (0, oo) in (24.15) shows that the probability of at least n ladder indices is P[H; > 0, i < n] = p n . (ii) If p = 1, then there must be infinitely many ladder indices with probability 1. In this case the Hn are independent random variables with distribution L concentrated on (0, oo ) ; since 0 < E[ Hn ] < oo, it follows by the strong law of large numbers that n - 1 E'ic,_1Hk converges to a positive number or to oo, so that (24.13) is infinite with probability 1. (iii) Suppose that p < 1. The chance that there are exactly n ladder indices is p n - p n +l = p n (1 - p) and with probability 1 there are only finitely many. Moreover, the chance that there are exactly n ladder indices and that H; E A ; for i < n is L(A 1 ) · • • L(A n ) - L(A 1 ) · • · L(A n )L(O, oo ) = L(A 1 ) • • • L(A n )(1 - p). Conditionally on there being exactly n ladder indices, H1 , . . . , Hn are therefore independent, each with distribution p - 1L. Thus the probability that there are exactly n ladder indices and H1 + · · · + Hn E A is (1 - p)L n *(A) for A c (0, oo). This 0 holds also for n = 0: (1 - p )L * consists of a mass 1 - p = P[sup k Sk = 0] at the point 0. Adding over n = 0, 1, 2, . . . gives (24.16). Note that in this case there are with probability 1 only finitely many positive terms on the • right in (24.13). Exponential Right Tail
Further information about the p and 1/1 in (24.16) can be obtained by a time-reversal argument like that leading to Theorem 24.1. As observed before, the two random vectors (S0, S1 , . • . , Sn ) and (Sn - Sn, Sn S,._ 1 , • • • , Sn - S0) have the same distribution over R n + t . Therefore, the event [max o k n sk < sn E A] has the same probability as the event <
<
328
RANDOM VARIABLES AND EXPECTED VALUES
[
A]
Sk > 0, Sn E Sk < Sn E A ] = P min P [ max l k n O k
�
00
00
00
L l/ln( A ) = E L P [ Tt +
n- 1
n - l j- l
···
<
<
+ 1j = n , H1 +
···
+ H1 e A ]
00
= E Li* ( A ) . j- l
Let 1/1 0 be a unit mass at 0; it follows by (24.14) that 00
E 1/J n( A ) = 1/J ( A ) ,
( 24. 1 8 )
n -O
A
c
[O, oo) .
Let F be the distribution function common to the Xn . If P. n is the distribution in the plane of (min 1 � k � n sk , Sn ), then by (20.30), P[min 1 � k � n Sk > 0, Sn +l � x] = /y1 > 0 F( x - y2 ) dp. n (y1 , y2 ). Identifying 1/1 n < A) with the right side of (24.17) givest
(24. 19) f'XJOOF( x - y ) t/ln ( dy) = p L � n sk > 0, sn +l � X ] . If the minimum is suppressed, this holds for n = 0 as well as for n > 1. Since convolution commutes, the left side here is J 00 oo l/l n ( - oo, x - y] dF(y) = /y � xl/1 n [O, x - y] dF(y ). Adding over n = 0, 1, . . . leads to this result: Theorem 24.3.
The measure (24. 1 4) satisfies
fy s x1/! [0, x - y) dF(y ) = nE- l P [ l smink < n sk > o , sn < x ] . 00
( 24.20 )
The important case is x � 0; here the sets on the right in (24.20) are disjoint and their UniOn is the set where Sn < 0 for SOme n > 1 and Sn � X t That /y1
>
0/(y2 ) dp.n (y1 , Yl )
indicators f.
==
/� 00/( y )1/Jn ( dy ) follows by the usual argument starting with
SECTION
24.
QUEUES AND RANDOM WALK
329
for the first such n. This theorem makes it possible to calculate (24.16) in a
case
of interest for queueing theory.
Theorem
24.4.
Suppose that E[X1 ] < 0 and that the right tail of F is
exponential: P ( X1 > x] = 1 - F( x ) = �e - A x , (24 .21 ) where 0 < � < 1 and A > 0. Then p < 1 and (24. 22) p SUp Sn > x = pe - A ( l -p ) x ,
[
n �O
]
X > 0,
X > 0.
Moreover, A(1 - p) is the unique root of the equation (24.23) in
the range 0 < s < A.
Because of the absence of memory (property (23.2) of the exponen tial distribution), it is intuitively clear that (even though only the right tail of F is exponential), if Sn t < 0 < Sn , then the amount by which Sn exceeds 0 should follow the exponential distribution. If so, PROOF.
L ( x , oo ) = pe - A x ,
(24.24 )
X > 0.
To prove this, apply (20.30):
[
P max Sk < o, sn > x k
] P [ max Sk < 0, xn > X - sn - 1 ] 1 �e - ><< x - S. - t l dP =
=
k
max S�c s O
k
Adding over n gives (24.24). Thus L is the exponential distribution multiplied by p. Therefore, L k * k k can be computed explicitly: L *[O, x] is p times the right side of (20.40) with A = a, so that
(24.25 )
330
RANDOM VARIABLES AND EXPECTED VALUES
Note that (24.24) and (24.25) follow from (24.21) alone-the condition E[ X1 ] < 0 has not been used. From E [ X1 ] < 0 it follows that Sn --+ - oo with probability 1 , so that p must be less than 1 . Summing (24.25) in the other reduces it to (24.26)
1
tP (0, X ) = 1 - p
P e - A ( l -p ) x , 1 -p
whence (24.22) follows via (24.16). It remains to analyze (24.23). Since Sn 1 for x = 0, and so by (24.26)
X > 0,
--+ - oo, the right side of (24.20) is
fy !f;O( 1 - pe><
( 24 .27)
Let l(s) denote the function on the left in (24.23). Because of (24.21), it is defined for 0 � s < A and (see (21 .26)) l(s) = �A( A - s ) - 1 + fx !f; 0esx dF( x ). From this, (24.27), and F(O) = 1 - �' it follows that I( A(l - p )) = 1. Now I has value 1 at s = 0 also. Since I is continuous on [0, A ) and l "(s) = f 00 00 x 2es x dF(x) is positive on (0, A ) because F does not con • centrate at 0, (24.23) cannot have more than one root in (0, A ). Suppose in the queuing context-see (24.1)- that the service times Bn are exponentially distributed with parameter P and that the A n have the distribution function A over (0, oo ). Assume that p- t < E[A 1 ], so that the traffic intensity (24.9) is less than 1 and E [ X1 ] < 0. Since P[ X1 > x] = e - flxJ0e - flY dA(y), Theorem 24.4 applies for A = p. The Bn have moment generating function fJ/(P - s) by (21.26); the moment gener ating function for F can by the product rule be expressed in terms of that for A. Therefore, (24.23) is in this case Example 24. 1.
(24.28)
1
o
00e - sy dA ( y ) = p {J- s .
This can be solved explicitly if the interarrival times An are also exponen tial, say with parameter a < p. The left side of (24.28) is then aj( a + s ) , the only positive root is P - a, and so p = a;p. In this case (see (24.8)) (24.29 )
P ( W, < x) = 1 - : e-x, nlim -+ oo P
X�
0.
SECTION
24.
331
QUEUES AND RANDOM WALK
This limit is a mixture of discrete and continuous distributions: there is a mass of 1 - a.;p at 0 and a density ap - 1 ( P - a.) e - x over (0, oo ). • Exponential Left Tail If F is continuous, reflecting the relation to treat the case in which the left tail of
Theorem 24.5. Suppose that left tail of F is exponential:
(24.20) through 0 makes it possible F is exponential.
E[X1 ] < 0, that F is continuous, and that the
(24.30) where
X<
0,
0 < � < 1 and A > 0. Then
(24 .3 1) and
L(O, x)
(24 .3 2 )
=
x
fo
F( x ) - F(O) + >.. (1 - F( y )) dy.
First assume not (24.30) but (24.21), and assume that E[ X1 ] > 0. Then Sn + oo , and so p = 1. As pointed out after the proof of (24.25), that relation follows from (24.21) alone; in the case p = 1 it reduces to (reverse the sum and use the Poisson mean) 1/J[O, x] = 1 + AX, x > 0. By Theorem 24.3, PROOF.
-+
[
(24 .33) 'f. P l � n Sk > 0, Sn < x n 1 00
= for
x <
] � < } 1 + >.. ( x - y )) dF( y) =
(1 +
0.
>.. x ) P ( X1 < x) - >.. f xX1 dP X1 <
Now if (24.30) holds and if E[ X1 ] < 0, the analysis above applies to { - xn }. If xn is replaced by - xn and X by - X, then (24.33) becomes
'f. P [ 1 max S < 0, Sn > x ] k n < sk n -= 1 00
valid
for
=
(1 -
>.. x ) P ( X1 > x ) + >..
f
X1 > x
X1 dP,
x > 0. Now F is assumed continuous, and it follows by the
332
RANDOM VARIABLES AND EXPECTED VALUES
convolution formula that the same must be true of the distribution of each 00 Sk . By (24.11) and (21.10), L(x, oo) = L[x, oo) = 1 - F(x) + >.. fx (1 F(y)) dy. By (24.30), fx1 0 X1 dP = - >.. - 1� = - >.. - 1F(O), and so /0(1 F( y )) dy = fx1 0 X1 dP = E[ X1 ] + A - 1F(O). Since p = L (0, oo ), this gives • (24.31) and (24.32). �
';!!
This theorem gives formulas for p and L and hence in principle determines the distribution of supn 0 Sn via (24.14) and (24.16). The rela tion becomes simpler if expressed in terms of the Laplace transforms 'i!!
and
.l ( s ) = E exp { -s sup sn ) . n �O
(
]
These are moment generating functions with the argument reflected through 0 as in (22.4); they exist for s > 0. Let �+ (s) denote the transform of the restriction to (0, oo) of the measure corresponding to F:
By (24.32), L is this last measure plus the measure with density A (1 over (0, oo ). An integration by parts shows that
=
- F(x))
� ( 1 - F(0)) + ( 1 - � ) !F+ ( s ) .
By (24.14) and (24.16), .l(s) = (1 - p)r./:_ 0!t' n (s); the series converges because !t'(s) � !t'(O) = p < 1. By (24.31),
( 24"34 ) .l ( s ) =
>.. E ( X1 )s ( s - " ) !F+ ( s ) - s + " ( 1 - F(O)) '
s > 0.
This, a form of the Khinchine-Po//aczek formula, is valid under the hypothe ses of Theorem 24.5. Because of Theorem 22.2, it determines the distribu tion of supn 0Sn . 'i!!
SECTION
24.
333
QUEUES AN D RAN DOM WALK
Example 24. 2.
Suppose that the queueing variables A ,, are exponen tially distributed with parameter a, so that the arrival stream is a Poisson process . Let b = E [ Bn], and assume that the traffic intensity p = ab is less than 1 . If the B, have distribution function B( x ) an d transform ffl( s ) = JOOe -sx dB ( x ), then F(x ) has densityt ae axdl( a ) ,
F (x) = '
ae ax
J
x < O
e - a v dB ( y ) ,
y>x
X>
0.
Note that F(O) = !11 ( a). Thus Theorem 24.5 applies for A = a, and revers ing integrals shows that �+ (s) = a( a - s ) - 1 ( 91(s) - F(O)). Therefore, (24. 3 4) becomes (24.35)
.K
(s ) =
(1 - p )s s - a + adl( s ) '
the
usual form of the Khinchine-Pollaczek formula. If the B,. are exponen tially distributed with parameter p, this checks with the Laplace transform • of the distribution function M derived in Example 24.1. Queue Size
Suppose that p < 1 and let Qn be the size of the queue left behind by customer n when his service terminates. His service terminates at time A I + . . . + A n + wn + Bn , and customer n + k arrives at time A l + A n + k · Therefore, Qn < k if and only if A 1 + · · + A n + k > A 1 + + + A n + Wn + Bn . (This requires the convention that a customer arriving exactly when service te rminates is not counted as being left behind in the queue.) Thus ·
·
·
·
·
·
·
k
> 1.
Since W,. is measurable o( B0, , Bn _ 1 , A 1 , , A n ) (see (24.2)), the three random variables Wn , Bn , and - ( A n + l + + A n + k ) are independent. If Mn , B, and Vk are the distribution functions of these random variables, then by the convolution formula, P[Qn < k ] = j 00 00 Mn ( -y )( B • Vk )( dy ). If M is the distribution function of supn Sn , then (see (24.8)) M,. converges to M and hence p [ Q n < k ] converges to I 00 00 M( -y )( B * vk )( dy ), which is the • • •
• • •
·
tJust
as
·
·
(20.37) gives the density for X + Y, f� 00 g( x - y ) dF( x ) gives the density for X - Y.
334
RANDOM VARIABLES AND EXPECTED VALUES
same as /00 00 Vk ( -y )( M * B )( dy) because convolution is commutative. Dif ferencing for successive values of k now shows that = k ) = qk , P(Qn n -+ oo
( 24.36)
lim
k > 1,
where
These relations hold under the sole assumption that p < 1. If the A n are exponential with parameter a, the integrand in (24.37) can be written down explicitly:
( 24.38) Example
Suppose that the A n and Bn are exponential with parameters p, as in Example 24.1. Then (see (24.29)) M corresponds to a mass of 1 - a;P = 1 - p at 0 together with a density p(p - a ) e - < IJ - a > x over (0, oo ) . By (20.37), M * B has density 24.3. a and
1
fJe - fl < y - x) dM ( x ) = ( fJ
[O. y ]
- a ) e - < fl - a) y .
By (24.38),
Use inductive integration by parts (or the gamma function) to reduce the last integral to k ! Therefore, qk = (1 - p)pk , a fundamental formula of • queuing theory. See Examples 8.11 and 8.12.
C HAP T ER
5
Convergence of Distributions
SECI'ION 25. WEAK CONVERGENCE
Many of the best known theorems in probability have to do with the asymptotic behavior of distributions. This chapter covers both general methods for deriving such theorems and specific applications. The present section concerns the general limit theory for distributions on the real line, and the methods of proof use in an essential way the order structure of the line. For the theory in R k , see Section 29. Definitions
Distribution functions Fn were in Section the distribution function F if
14 defined to converge weakly to
(25 .1 ) for every continuity point x of F; this is expressed by writing Fn � F. Examples 14.1, 14.2, and 14.3 illustrate this concept in connection with the asymptotic distribution of maxima. Example 14.4 shows the point of allowing (25.1) to fail if F is discontinuous at x; see also Example 25.4. Theorem 25.8 and Example 25.9 show why this exemption is essential to a useful theory. If p. n and p. are the probability measures on ( R 1 , !1/ 1 ) corresponding to Fn and F, then Fn � F if and only if lim p. n ( A )
(25 .2 ) for every A of the form
n
A
=
(
-
oo ,
=
p. ( A )
x] for which p. { x } = O-see
(20.5). In 335
336
CONVERGENCE OF DISTRIBUTIONS
this case the distributions themselves are said to converge weakly, which is expressed by writing P. n � p.. Thus Fn � F and P. n => p. are only different expressions of the same fact. From weak convergence it follows that (25.2) holds for many sets A besides half-infinite intervals; see Theorem 25.8.
Example
Let Fn be the distribution function corresponding to a unit mass at n: Fn = /[ n . oo > · Then lim n Fn ( x ) = 0 for every x, so that (25. 1 ) is satisfied if F( x) = 0. But Fn � F does not hold, because F is not a distribution function. Weak convergence is in this section defined only for functions Fn and F that rise from 0 at - oo to 1 at + oo -that is, it is defined only for probability measures p. n and p.. t For another example of the same sort, take Fn as the distribution function of the waiting time W, for the n th customer in a queue. As observed after (24.9), Fn (x) --+ 0 for each x if the traffic intensity exceeds 1 . 25. 1.
•
Example
Poisson approximation to the binomial. Let P. n be the binomial distribution (20.6) for p = 'A/n and let p. be the Poisson distribu tion (20.7). For nonnegative integers k , ( n ) "' k A n - k P. n { k } = k n 1 - n n k k-1 'A/n 'A 1 1 i ) ( x = n 1 -k n k! ( 1 - 'Ajn ) i = O 25. 2.
( )(
)
(
)
if n > k . As n --+ oo the second factor on the right goes to 1 for fixed k , and so P. n { k } --+ p. { k } ; this is a special case of Theorem 23.2. By the series form of Scheffe's theorem (the corollary to Theorem 16. 1 1 ), (25.2) holds for every set A of nonnegative integers. Since the nonnegative integers support p. and the P. n ' (25.2) even holds for every linear Borel set A. Certainly P. n converges • weakly to p. in this case.
Example 25.3. Let P. n correspond to a mass of n - 1 at each point kjn, k = 0, 1 , . . . , n - 1 ; let p. be Lebesgue measure confined to the unit interval. The corresponding distribution functions satisfy Fn (x) = ([nx] + 1 )/n --+ F(x) for 0 < x < 1 , and so Fn � F. In this case (25. 1) holds for every x, but (25.2) does not as in the preceding example hold for every Borel set
A:
t There is (see Section 28) a related notion of vague convergence in which p. may be defective in the sense that p.( R 1 ) < 1 . Weak convergence is in this context sometimes called complete convergence.
SECTION
25.
WEAK CONVERGENCE
337
is the set of rationals, then P. n (A) = 1 does not converge to p. (A) = 0. Despite this, P. n does converge weakly to p.. •
if A
Example
If P. n is a unit mass at x n and p. is a unit mass at x, then p , :::o p. if and only if x n --+ x. If x n > x for infinitely many n , then (25.1) • fails at the discontinuity point of F. 25.4.
Unifonn Distribution Modulo 1*
For a sequence x1 , x 2 , of real numbers, consider the corresponding sequence of their fractional parts { xn } == xn - [ xn ]. For each n , define a probability measure p. n •
•
•
by
(25 .3)
P.n ( A ) == -1 , # ( k : 1 < k < n , { xk } E A ] ; n
has mass n - 1 at the points { x1 }, , { xn }, and if several of these points coincide, the masses add. The problem is to find the weak 1imi t of { p. n } in number-theoretically interesting cases. Suppose that xn == nplq, where plq is a rational fraction in lowest terms. Then { kplq } and { k'plq } coincide if and only if (k - k')plq is an integer, and since p and q have no factor in common, this holds if and only if q divides k - k'. Thus for any q successive values of k the fractional parts { kp1q } consist of the q points
P.n
•
0 1
•
•
,..., q'q
(25 .4)
q-1 q
in
some order. If O < x < 1, then P.n [O, x] == P.n [O, y ], where y is the point ilq just left of x, namely y == [ qx]lq. If r == [ nlq], then the number of { kpjq } in [0, y] for 1 s k < n is between r([ qx] + 1) and ( r + 1)([ qx] + 1) and so differs from nq- 1 ([ qx] + 1) by at most 2 q. Therefore, {25 .5)
- [ qx 1 + 1 2q p." [ 0 ' x 1 < n ' q
0 < X < 1.
If _, consists of a mass of q - 1 at each of the points (25.4), then (25.5) implies that l'n � _, . Suppose next that xn == n 8, where 8 is irrational. If pIq is a highly accurate rational approximation to 8 (q large), then the measures (25.3) corresponding to { n fJ } and to { npIq } should be close to one another. If ., is Lebesgue measure restricted to the unit interval, then ., is for large q near ., (see Example 25.3), and so there is reason to believe that the measures (25.3) for { n8 } will converge weakly to
P.
If the #Ln defined by (25.3) converge weakly to Lebesgue measure restricted to the unit interval, the sequence x1 , x 2 , is said to be uniformly distributed modulo 1. In •
•
This topic may be omitted.
•
•
338
CONVERGENCE OF DISTRIBUTIONS
this case every subinterval has asymptotically its proportional share of the points { x n } ; by Theorem 25.8 below, the same is true of every subset whose boundary has Lebesgue measure 0.
Theorem 25.1. PROOF.
Define
For () irrational, { n6 } is uniformly distributed modulo 1 . #Ln by (25.3) for xn #Ln ( X , y]
( 25 .6)
n6. It suffices to show that
==
--+ y -
X,
0 < x < y < 1.
Suppose that 0 < £ < min { x , 1 - y } .
(25 .7)
It follows by the elementary theory of continued fractionst that for n exceeding some n 0 == n 0 ( £ , () ) there exists an irreducible fraction pIq (depending on £ , n , and () ) such that
p n 6 - -q < £ .
q<( ' n
-1 < ( '
q
If p.� is the measure (25.3) for the sequence
( 25 .8)
I ll�( x , y] - ( y - x) I <
{ np1q } , then by (25.5),
(
q
+
,t,.( x , y] - [ qy ] + 1 - [ qx 1 + 1 q q
2
+
-4qn
2
� -
q
)
< 6£ .
Now l k 6 - kplql < £ for k == 1 , . . . , n. If { k6 } lies in ( x, y ] , then k6 and kp/q have the same integral part because of (25 . 7), so that { kplq } lies in ( x - £ , y + £ ] . Thus P.n ( x, y ] < p.� ( x - £ , y + £ ] . Similarly, p.� ( x + £ , y - £ ] < P.n ( x , y ] . Two ap plications of (25.8) give I P.n ( x, y ] - ( y - x) l < 8£ . Since this holds for n > n 0 • • (25.6) follows. For another proof see Example 26 . 3 . Theorem 25.1 implies Kronecker's theorem. according to which { n () } is dense in the unit interval if () is irrational.
Convergence in Distribution
Let Xn and X be random variables with respective distribution function s Fn and F. If Fn => F, then Xn is said to converge in distribution or in law to X, written Xn => X. This dual use of the double arrow will cause no confu sion . t Let p,/q, be the convergents of 9 and choose , = "n so sufficiently large, this forces , to be large enough that q, > ( fractions, 1 9 - p,/q, l < 1 /q,q, + 1 < 1 /n ( q, < (/n . Take p/ q
-
that q, < n( < q, + l · If n i s 2 • By the theory of cont i n ued = p, /q, .
SECTION
25.
Because of the defining conditions (25.1) and (25.2),
(25.9) for every x such that
339
WEAK CONVERGENCE
lim n P [ Xn < x ] =
Xn => X if and only if
P[ X < x]
P[ X = x] = 0.
be independent random variables, each Example 25. 5. Let X1 , X ,. 2 with the exponential distribution: P[ Xn > x] = e - a x , x > 0. Put Mn = max{ XI , . . . ' xn } and bn = a - 1 log n. The relation (14.9) established in •
•
•
Example 14.1 can be restated as P[Mn - bn < x] --+ e - e . If X is any ax random variable with distribution function e - e - ' this can be written �-��x • - ax
One is usually interested in proving weak convergence of the distribu tions of some given sequence of random variables, such as the Mn - bn in this example, and the result is often most clearly expressed in terms of the random variables themselves rather than in terms of their distributions or distribution functions. Although the Mn - bn here arise naturally from the problem at hand, the random variable X is simply constructed to make it possible to express the asymptotic relation compactly by Mn - bn => X. Recall that by Theorem 14.1 there does exist a random variable with any prescribed distribution. For each n , let On be the space of n-tuples of O's and 1's, let .F, consist of all subsets of On, and let Pn assign probability ( 'Ajn ) k (1 n - 'Ajn ) - k to each w consisting of k 1 's and n - k O's. Let Xn(w) be the number of 1 's in w; then Xn, a random variable on (On, � ' Pn), represents the number of successes in n Bernoulli trials having probability 'A/n of success at each. Let X be a random variable, on some ( 0 , :F, P), having the Poisson • distribution with parameter 'A. According to Example 25.2, Xn => X. Exllllple l 25. 6.
As this example shows, the random variables Xn may be defined on entirely different probability spaces. To allow for this possibility, the P on the left in (25.9) really should be written Pn . Suppressing the n causes no confusion if it is understood that P refers to whatever probability space it is that xn is defined on; the underlying probability space enters into the definition only via the distribution P.n it induces on the line. Any instance of Fn => F or of P. n => p. can be rewritten in terms of convergence in distribu tion: There exist random variables Xn and X (on some probability spaces) with distribution functions Fn and F, and Fn => F and Xn => X express the same fact.
340
CONVERGENCE OF DISTRIBUTIONS
Convergence in Probability
are random variables all defined on the same Suppose that X, X1 , X2 , probability space ( 0, $", P ). If Xn -+ X with probability 1, then P[ I Xn XI > £ i . o. ] = 0 for £ > 0, and hence •
(25 .10 )
•
•
lim P [ I Xn - XI > u
£] = 0
by Theorem 4.1. Thus there is convergence in probability Xn -+ pX as discussed in Section 20; see (20.41). That convergence with probability 1 implies convergence in probability is part of Theorem 20.5. Suppose that (25.10) holds for each positive £. Now P [ X < x - £] P[ I Xn - XI > £] < P[ Xn < x] < P[ X < x + £] + P [ I Xn - x l > £]; letting n tend to oo and then letting £ tend to 0 shows that P[ X < x] < lim infnP[ Xn < x] S lim supnP[ Xn < x] < P[ X < x]. Thus P[ Xn S x] � P[ X < x] if P[ X = x] = 0, and so Xn => X:
Suppose that Xn and X are random variables on the same probability space. If Xn -+ X with probability 1 , then Xn -+ pX. If Xn -+ p X, then Xn => X. Theorem 25.2.
Of the two implications in this theorem, neither converse holds. As observed after (20.41), convergence in probability does not imply conver gence with probability 1. Neither does convergence in distribution imply convergence in probability: if X and Y are independent and assume the values 0 and 1 with probability i each, and if Xn = Y, then Xn => X, but Xn -+ p X cannot hold because P[ I X - Y l = 1 ] = � - What is more, (25.10) is impossible if X and the Xn are defined on different probability spaces, as may happen in the c�se of convergence in distribution. Although (25.10) in general makes no sense unless X and the Xn are defined on the same probability space, suppose that X is replaced by a constant real number a- that is, suppose that X( w) = a. Then (25.10) becomes (25 .1 1 )
lim P [I Xn n
al > £] = 0,
and this condition makes sense even if the space of Xn does vary with n. Now a can be regarded as a random variable (on any probability space at all) , and it is easy to show that (25.11) implies that Xn => a: Put £ = l x - al ; if x > a, then P[ Xn < x] > P[ I Xn - a l < £] -+ 1, and if x < a, then P [ Xn < x] < P[ I Xn - a l � £] -+ 0. If a is regarded as a random variable, i ts
SECTI ON
25.
WEAK CO N V E R G E NC E
34 1
distribution function is 0 for x < a and 1 for x > a . Thus ( 25 . 1 1 ) implies that the distribution function of X, converges weakly to that of a . Suppose, on the other hand. that xll => a . Then p [ I X, - a I > £] < p [ xll � a - £ ] + 1 - P [ X, < a + £ ] � 0. so that (25 . 1 1 ) holds : 'lbeorem 25.3. Condition (25 . 1 1 ) holds for all positive £ if and on�y if Xn ==> a. that is. if and only if
if x < a , if x > a . If (25.1 1 ) holds for all positive £ , Xn may be said to converge to a in probability. As this does not require that the Xn be defined on the same space, it is not really a special case of convergence in probability as defined by (25.10). Convergence in probability in this new sense will be denoted xn � a, in accordance with the theorem just proved. Example 1 4.4 restates the weak law of large numbers in terms of this concept. Indeed, if X1 , X2 , . . . are independent, identically distributed ran dom variables with finite mean m , and if Sn = X1 + + Xn , the weak law 1 of large numbers is the assertion n - sn � m. Example 6.4 provides another illustration : If Sn is the number of cycles in a random permutation on n letters, then Snf log n � 1 . ·
·
·
Example 25. 7. Suppose that Xn � X and 8n � 0. Given £ and 11 choose x so that P [ I XI > x ] < 11 and P[ X = ± x] = 0, and then choose n 0 so that n � n 0 implies that 1 8n l < £/X and I P [ Xn < y] - P [ X < y ] l < 11 for y = ± x. Then P [ l 8n Xn 1 > £] < 311 for n > n 0 . Thus Xn � X and 8n --+ 0 imply that 8n Xn � 0, a restatement of Lemma 2 of Section 14 (p. 195). • The
asymptotic properties of a random variable should remain un affected if it is altered by the addition of a random variable that goes to 0 in probability. Let ( Xn , Yn ) be a two-dimensional random vector. Theorem 25.4.
�
X and Xn - Yn
�
0, then Yn
Suppose that y ' < x < y " and P [ X = y'] x - y' and £ < y" - x, then
PROOF.
( <
If Xn
( 25 .12 )
� =
X.
P[ X
=
y "]
=
0. If
342
Since Xn
CONVERGE NCE OF DISTRIBUTIONS :::.
X,
letting n
� oo
gives
P [ X < y') < lim inf P [ Yn < x ]
( 25 . 1 3 )
n
< lim sup P ( Yn < x ] < P [ X < y" ] .
·
Since P[ X = y] = 0 for all but countably many y, if P[ X = x ] = 0, then y ' and y" can further be chosen so that P[ X < y'] and P[ X < y"] are arbitrarily near P[ X < x ]; hence P[ Yn < x] � P[ X < x ]. • Theorem 25.4 has a useful ex tension. Suppose that ( X,� " , , Y,. ) is a sional random vector. Theorem 2S.S. and if
If, for each
x,�
u
)
=>
x
as
n -+
_ lim lim sup n P ( I x,� u ) Y,. l > £ ]
( 25 .14) for positive
u,
u
£ , then Y,.
-+
oo , if
two-dimen
x< u> => X as
u -+
oo .
== 0
X.
Replace X,. by X,� u> i n (25.12). If P( X = y') = 0 P( x< u> == y ' ] and P[ X = y"] == 0 P[ x
=
=
-+
u -+
=
Fundamental Theorems
Some of the fundamental properties of weak convergence were established in Section 14. It was shown there that a sequence cannot have two distinct weak limits: If Fn => F and Fn => G, then F = G. The proof is simple: The hypothesis implies that F and G agree at their common points of continu ity, hence at all but countably many points, and hence by right continuity at all points. Another simple fact is this: If lim n Fn (d ) = F( d ) for d in a set D dense in R 1 , then Fn => F. Indeed, if F is continuous at x, there are in D points d ' and d " such that d ' < x < d " and F( d ") - F( d ') < £ , and it follows that the limits superior and inferior of F,(x ) are within £ of F(x ). For any probability measure on ( R 1 , &/ 1 ) there is on some probability space a random variable having that measure as its distribution. Therefore, for probability measures satisfying P.n => p., there exist random variables Yn and Y having these measures as distributions and satisfying Y,, � Y. According to the following theorem, the Yn and Y can be constructed on t he same probability space, and even in such a way that Yn( w ) � Y( w ) for every w- a condition which is of course much stronger than Yn => Y. This result, Skorohod 's theorem , makes possible very simple and transparen t proofs of many important facts.
SECTION
25.
343
WEAK CONVERGENCE
neorem 25.6. Suppose that P.n and p. are probability measures on ( R 1 , 9l 1 ) and p. n => p.. There exist random variables Yn and Y on a common probability space ( 0 , !F .. P ) such that Yn has distribution p. , , Y has distribution p., and Yn ( w ) __... Y( w ) for each w.
For the probability space ( 0, !F, P ), take 0 (0, 1 ), let !F consist of the Borel subsets of (0, 1), and for P( A ) take the Lebesgue measure of A . The construction is related to that in the proofs of Theorems 14.1 and 20 .4. Consider the distribution functions F, and F corresponding to P.n and p.. For 0 < w < 1 , put Yn( w ) = in f[ x : w < Fn( x )] and Y( w) = in f[ x : w < F(x )]. Since w < Fn(x) if and only if Yn( w ) < x (see the argument follow ing (14.5)), P [ w: Yn( w ) < x] = P[w: w < Fn( x )] = Fn( x ). Thus Yn has distribution function Fn; similarly, Y has distribution function F. It remains to show that Yn( w ) --. Y( w ). The idea is that Y,, and Y are essentially inverse functions to Fn and F; if the direct functions converge, so must the inverses. Suppose that 0 < w < 1 . Given £, choose x so that Y( w ) - £ < x < Y( w) and p. { x } = 0. Then F( x) < w; Fn( x ) --. F( x ) now implies that, for n large enough, Fn( x ) < w and hence Y( w ) - £ < x < Y,, ( w ). Thus lim infn Yn( w ) > Y( w ). If w < w' and £ is positive, choose a y for which Y( w') < y < Y( w') + £ and p. { y } = 0. Now w < w' < F( Y( w')) < F( y ), and so, for n large enough, w < Fn( Y ) and hence Yn( w ) < y < Y( w') + £. Thus lim supn Yn( w) < Y( w') if w < w'. Therefore, Yn ( w) __... Y( w) if Y is continuous at w. Since Y is nondecreasing on (0, 1 ), it has at most countably many discontinuities. At discontinuity points w of Y, redefine Yn( w) = Y( w ) = 0. With this change, Yn( w) --. Y( w ) for every w. Since Y and the Yn have been altered only on a set of Lebesgue measure 0, their distributions are still p. n and p.. • =
PROOF.
Note that this proof uses the order structure of the real line in an essential way. The proof of the corresponding result in R k (Theorem 29.6) is more complicated. The following mapping theorem is of very frequent use. lbeorem 25.7. Suppose that h : R1 --. R1 is measurable and that the set Dh of its discontinuities is measurable . t If P. n => p. and p.( Dh ) = 0, then P.n h - 1 � ,., h - 1 . t
That Dh lies in 9l1 is generally obvious in applications. In point of fact. it always holds (even
h is not measurable): Let A ( ( , B ) be the set of x l x - Y l < B. l x - z l < B. and l h ( y ) - h ( z ) l > ( . if
A( ( . B ). where
( and
B
for which there exist y and Then
range over the positive rationals.
A( (. B )
z
such that
is open and Dh
=
U( n8
344
CONVERGENCE OF DISTRIBUTIONS
Recall (see (1 3.7)) that
p.h - 1
has value p.( h - 1A ) at
A.
PROOF. Consider the random variables Yn and Y of Theorem 25.6. Since Yn ( w ) � Y( w ), Y(w) � Dh implies that h( Y,,( w)) � h ( Y(w)). Since P[w: Y( w ) E Dh] = p.(Dh) = 0, h(Yn (w)) � h(Y(w)) holds with probability 1. Hence h ( Yn ) => h ( Y) by Theorem 25.2. Since P[h(Y) E A] = P[ Y E h - 1A] = p. ( h - 1A ), h ( Y) has distribution p.h - 1 ; similarly, h ( Yn ) has distribution P. n h - 1 • Thus h ( Yn ) => h(Y) is the same thing as P. n h- 1 => p.h - 1 • • Because of the definition of convergence in distribution, this result has an equivalent statement in terms of random variables: Corollary 1.
Take
If Xn => X and P[ X E Dh] = 0, then h( Xn ) => h( X).
X = a:
Corollary 2.
If Xn � a and h is continuous at a, then h( Xn ) => h(a).
From Xn => X aXn + b => aX + b. Suppose also a ) Xn => 0 by Example 25.7, and now an Xn + bn => aX + b follows
it follows directly by the theorem that that a n � a and bn � b. Then (a n so (anXn + bn) - (aX, + b) => 0. And by Theorem 25 . 4 : If Xn => X, an � a, and bn � b, then anXn + bn => aX + b. This fact was stated and proved • differently in Section 14-see Le mma 1 on p. 195. Example 25.8.
By definition, p. n => p. means that the corresponding distribution func tions converge weakly. The following theorem characterizes weak conver gence without reference to distribution functions. The boundary aA of A consists of the points that are limits of sequences in A and are also limits of sequences in Ac; alternatively, aA is the closure of A minus its interior. A set A is a p.-continuity set if it is a Borel set and p.( a A ) = 0. Theorem 25.8. The following three conditions are equivalent. (i) p. n p.; (ii) f f dp. n � f fdp. for every bounded, continuous real function f; (iii) P.n( A ) � p.(A) for every p.-continuity set A. :::0
Suppose that P.n => p. and consider the random variables Yn and Y of Theorem 25.6. Suppose that f is a bounded function such that p.( D1) = 0, where D1 is the set of points of discontinuity of f. Since P [ Y E D1] = p.( D1 ) = 0, f( Yn) � f( Y) with probability 1, and so by change of variable (see (21 .1)) and the bounded convergence theorem, ffdp.n = E[/( Yn)] � PROOF.
SECTION
25.
WEAK CONVERGENCE
345
E[ /( Y )] = f fdp.. Thus P.n => p. and p.( D1 ) = 0 together imply that f fdp.n __. f f dp. if f is bounded. In particular, (i) implies (ii). Further, if f = IA ' then D1 = aA , and from p.( aA ) = O and P.n => p. follows p.,,( A ) = f fdp.n __. f I dp. = p. ( A ). Thus (i) also implies (iii). Since a( - 00 , x ] = { X } , obviously (iii) implies (i). It therefore remains only to deduce p. n => p. from (ii). Consider the corresponding distribution functions. Suppose that x < y, and let /( t ) be 1 for t < x, 0 for t > y, and interpolate linearly on [x, y]: /( t ) = ( y - t )/( y - x ) for x < t < y . Since Fn (x ) < f fdp. n and f fdp. < F( y ), it follows from (ii) that lim supn Fn(x ) < F( y ); letting y � x shows that lim supn Fn(x ) < F( x ). Similarly, F( u ) < lim infn Fn( x ) for u < x and hence F( x - ) < lim infn Fn( x). This implies • convergence at continuity points. The function f in this last part of the proof is uniformly continuous. Hence P.n => p. follows if J fdp.n � f fdp. for every bounded and uniformly continuous f. Example 25. 9. The distributions of Example 25.3 satisfy P.n => p., but Hence this A P n ( A ) does not converge to p.( A) if A is the set of rationals. • cannot be a p.-continuity set; in fact, of course, a A = R 1 •
The concept of weak convergence would be nearly useless if (25.2) were not allowed to fail when p.( aA ) > 0. Since F(x) - F(x - ) = p. { x } = p.( a( - 00 , X )), it is therefore natural in the original definition tO allow (25.}) to fail when x is not a continuity point of F. Belly's 1beorem
One of the most frequently used results in analysis is the Helly selection theorem: Theorem 25.9. For every sequence { Fn } of distribution functions there exists a subsequence { Fn " } and a nondecreasing, right-continuous function F such that lim k Fn "(x) = F(x) at continuity points x of F. An application of the diagonal method [A14] gives a sequence PROOF. { n k } of integers along which the limit G( r ) = lim k Fn"( r ) exists for every rational r. Define F( x) = inf(G( r ): x < r ] . Clearly F is nondecreasing. To each x and £ there is an r for which x < r and G ( r ) < F( x ) + £. If x s y < r, then F( y ) < G(r) < F(x) + £. Hence F is continuous from the right. If F is continuous at x, choose y < x so that F( x ) - £ < F( y ) ; now choose rational r and s so that y < r < x < s and G(s ) < F( x) + £. From
346
CONVERGENCE OF DISTRIBUTIONS
F( x ) - £ < G ( r ) < G(s ) < F(x ) + £ and Fn ( r ) < Fn (x ) < Fn( s ) it follows that as k goes to infinity Fn k(x) has limits superior and inferior within £ of F( x ). • The F in this theorem necessarily satisfies 0 < F(x) < 1 . But F need not be a distribution function; if Fn has a unit jump at n , for example, F( x ) 0 is the only possibility. It is important to have a condition which ensures that for some subsequence the limit F is a distribution function. A sequence of probability measures P.n on (R 1 , !1/ 1 ) is said to be tight if for each £ there exists a finite interval ( a, b] such that p. n( a, b] > 1 - £ for all n . In terms of the corresponding distribution functions Fn, the condition is that for each £ there exist x and y such that Fn(x) < £ and F,, ( y ) > 1 - £ for all n . If p. n is a unit mass at n , { p. n } is not tight in this sense- the mass of p. n "escapes to infinity." Tightness is a condition preventing this escape of mass. =
Theorem 25.10. Tightness is a necessary and sufficient condition that for every subsequence { p. n k } there exist a further subsequence { p. n } and a probability measure p. such that p. n k ( J ) � p. as j � oo . k(})
Only the sufficiency of the condition in this theorem is used follows.
in
what
Apply Helly' s theorem to the subsequence { F,, " } of corresponding distribution functions. There exists a further subsequence { Fn k j ) } such that lim 1.Fn k ( J )( x ) = F(x ) at continuity points of F, where F is � nonaecreasing and right-continuous. There exists by Theorem 12.4 a measure p. on ( R 1 , !1/ 1 ) such that p.(a, b] = F( b ) - F( a ) . Given £, choose a and b so that P. n( a , b ] > 1 - £ for all n , which is possible by tightness. By decreasing a and increasing b, one can ensure that they are continuity points of F. But then p. ( a, b] > 1 - £. Therefore, p. is a probability mea • sure, and of course p. n k ( J ) � p. . PROOF :
SUFFICIENCY.
If { p. n } is not tight, there exists a positive £ such that for each finite interval ( a, b ], p. n( a, b] < 1 - £ for some n . Choose n " so that P. n k ( - k , k ] < 1 - £. Suppose that some subsequence { P. n A ( J ) } of { P. n A } we re to converge weakly to some probability measure p. . Choose ( a, b] so that p. { a } = p. { b } = 0 and p. ( a, b ] > 1 - £. For large enough }, ( a , b) c ( - k ( j ), k ( j )], and so 1 - £ > P. n A (J ) ( - k ( j ), k ( j )] > p. , A ( J ) ( a, b] � p. ( a � b] . • Thus p. ( a, b ] < 1 - £, a contradiction. NECESSITY.
Corollary. If { P.n } is a tight sequence of probability measures, and if each subsequence that converges weakly at all converges weakly to the probability measure p. , then p. n � p. .
SECTION
25.
WEAK CONVERGENCE
347
By the theorem, each subsequence { p. n " } contains a further subsequence { p. n } converging weakly ( j � oo) to some limit, and that limit must by hypothesis be p.. Thus every subsequence { p., A } contains a further su bsequence { p. n lc ( J ) } converging weakly to p.. Suppose that p. n � p. is false. Then there exists some x such that p { x } = 0 but P. n( - oo, x] does not converge to p.( - oo , x]. But then there exists a positive £ such that I P. n "( - oo , x] - p. ( - oo , x ] l > £ for an infinite sequence { n k } of integers, and no subsequence of { P. n A } can converge • weakly to p.. This contradiction shows that P.n � p.. PROOF.
A(})
If ,., n is a unit mass at X n ' then { ,., n } is tight if and only if { X n } is bounded. The theorem above and its corollary reduce in this case to standard facts about the real line; see Example 25.4 and A10: tightness of sequences of probability measures is analogous to boundedness of sequences of real numbers. Let P.n be the normal distribution with mean m n and variance on2• If m n and on2 are bounded, then the second moment of P.n is bounded, and it follows by Markov' s inequality (21 .12) that { P.n } is tight. The conclusion of Theorem 25.10 can also be checked directly: If { n k (J) } is chosen so that lim1.m n lc ( J ) = m and lim 1.on2lc ( J ) = o 2, then P.n A t J ) � p., where p. is normal with mean m and variance o2 (a unit mass at m if o2 = 0). If m n > b, then P.n(b, oo) > ! ; if m n < a, then p. , ( - oo , a] > t · Hence { p. n } cannot be tight if m n is unbounded. If m n is bounded, say by K, then P n ( - oo , a ] > v ( - oo, ( a - K )on- 1 ], where v is the standard normal distri bution. If on is unbounded, then v ( - oo, (a - K )on- 1 ] � i along some subsequence, and { p. n } cannot be tight. Thus a sequence of normal distribu • tions is tight if and only if the means and variances are bounded. Example 25. 10.
Integration to the Limit 1beorem 25. 1 1.
If Xn � X, then E [ I XI ] < lim infn E [ 1 Xn 1 1 -
Apply Skorohod ' s Theorem 25.6 to the distributions of X, and X: There exist on a common probability space random variables Y,, and Y such that Y = lim n Yn with probability 1 , Yn has the distribution of Xn , and Y has the distribution of X. By Fatou ' s lemma, E[ I Yl ] < lim inf, E[ I Y,, l ]. Since 1 XI and 1 Yl have the same distribution, they have the same expected value (see (21 .6)), and similarly for 1 xn 1 and I Yn 1 • PROOF.
The random variables Xn are said to be uniformly integrable if (25 . 1 5 )
lim sup a , .....
oo
£
[ I xn 1 � a J
I X,. I dP
=
0;
348
CONVERGENCE OF DISTRIBUTIONS
see (16.21). This implies (see (16.22)) that sup E ( I Xn l ) < 00 .
( 25 .16 )
n
If Xn � X and the Xn are uniformly integrable, then X is
Theorem 25. 12.
integrable and ( 25 .17)
Construct random variables Yn and Y as in the preceding proof. Since Yn � Y with probability 1 and the Yn are uniformly integrable in the sense of (16.21 ), E [ Xn ] = E [ Yn ] � E [ Y ] = E [ X] by Theorem 16.13. • PROOF.
If supn E [ I xn 1 1 + ( ] < 00 for some positive integrable because
£'
then the xn are uniformly
(25 .18 )
Since Xn :::o X implies that x; � xr by Theorem 25.7, there is the follow ing consequence of the theorem.
Let r be a positive integer. If Xn :::o X and supn E [ I Xn l r + f ] < > 0, then E [ I X I r] < 00 and E [ x;] � E[ xr].
Corollary.
here
w
£
oo ,
The Xn are also uniformly integrable if there is an integrable random variable Z such that P[ I Xn l > t ] s P[ I Z I > t ] for t > 0, because then ( 2 1 . 1 0 ) gives
1
[ I X,.I � a)
I X,. I dP
<
1
[ IZI > a ]
I Z I dP.
From this the dominated convergence theorem follows again. PROBLEMS 25.1.
(a) Show by example that distribution functions having densities can
converge weakly even if the densities do not converge. Hint: Consider /,. ( x ) == 1 + cos 2wnx on [0, 1]. (b) Let /,. be 211 times the indicator of the set of x in the unit interval for which d ,. + (x) == · · · == d2 n ( x) == 0, where d" (x) is the k th dyadic digi t. 1
SECTION
25.
349
WEAK CONVERGENCE
Show that In ( x) --+ 0 except on a set of Lebesgue measure 0; on this exceptional set, redefine In ( x) == 0 for all n , so that /,. ( x) --+ 0 everywhere. Show that the distributions -corresponding to these densities converge weakly to Lebesgue measure confined to the unit interval. (c) Show that distributions with densities can converge weakly to a limit that has no density (even to a unit mass). (d) Show that discrete distributions can converge weakly to a distribution that has a density. (e) Construct an example, like that of Example 25.3, in which p.,. ( A ) --+ p.( A ) fails but in which all the measures come from continuous densities on 25.2. 25.3.
[0, 1 ]. 14.12 f Give a simple proof of the Glivenko-Cantelli theorem (Theorem 20.6) under the extra hypothesis that F is continuous. Initial digits. (a) Show that the first significant digit of a positive number x is d (in the scale of 10) if and only if {log1 0x } lies between log1 0 d and log1 0( d + 1), d == 1, . . . , 9, where the brackets denote fractional part. (b) For positive numbers x 1 , x2 , , let Nn ( d ) be the number among the first n that have initial digit d. Show that •
•
•
d == 1 , . . . , 9,
( 25 .19)
if the sequence log1 0xn , n == 1, 2, . .n. , is uniformly distributed modulo 1. This is true, for example, of xn == f if log1 08 is irrational. (c) Let Dn be the first significant digit of a positive random variable Xn . Show that
( 25 .20) lim P [ Dn == d ] == log 1 0( d + 1 ) - log1 0 d , n
25.4.
d == 1 , . . . , 9,
if {log1 0 Xn } � U, where U is uniformly distributed over the unit interval. Show that for each probability measure p. on the line there exist probability measures #Ln with finite support such that #Ln � p.. Show further that P.n{ x } can be taken rational and that each point in the support can be taken rational. Thus there exists a countable set of probability measures such that every p, is the weak limit of some sequence from the set. The space of distribution functions is thus separable in the Levy metric (see Problem
14.9).
25.5. 25.6.
25.7. 25.8.
Show that (25.10) implies that P([ X < X ]�[ xn < X ]) --+ 0 if P[ X == X ] == 0. For arbitrary random variables Xn there exist positive constants an such that an xn � 0. Generalize Example 25.8 by showing for three-dimensional random vectors ( A n , Bn , Xn ) and constants a and b, a > 0, that, if An � a, Bn � b, and Xn � X, then AnXn + Bn � aX + b . Suppose that Xn � X and that hn and h are Borel functions. Let E be the set of x for1 which hnxn --+ hx fails for some sequence xn --+ x. Suppose that E e 9P and P[ X e E] == 0. Show that hn Xn � hX.
JSO
CONVERGENCE OF DISTRIBUTIONS
Suppose that the distributions of random variables X,, and X have densities /,. and f. Show that if fn( x) --+ /( x ) for x outside a set of Lebesgue measure 0, then Xn � X. 25. 10. f Suppose that Xn assumes as values y, + k B,. , k == 0, + 1, . . . , where with n in Bn > 0. Suppose that Bn --+ 0 and that, if kn is an integer varying 1 such a way that Yn + knBn --+ x, then P[ Xn == Yn + knBn 1 Bn- --+ /( x), where I is the density of a random variable X. Show that xn = X. 25. 11. f Let Sn have the binomial distribution with parameters n and p. Assume as known that 25.9.
( 25 .21)
25.12. 25. 13.
25. 14.
P [ S,. = k n ] ( np ( l - p ))
1
--+ ..ff; e - ' 2 /2
if ( kn - np)( np(1 - p)) - 1 12 1--+ x. Deduce the DeMoivre-Laplace the orem: ( Sn - np)( np( 1 - p)) - 12 = N, where N has the standard normal distribution. This is a special case of the central limit theorem; see Section 27. Prove weak convergence in Example 25.3 by using Theorem 25.8 and the theory of the Riemann integral. (a) Show that probability measures satisfy #Ln = p. if P. n ( a , b] --+ p.( a , b) whenever I' { a } == I' { b } == 0. (b) Show that, if I fdp.n --+ I fdp. for all continuous f with bounded support, then ,.,. n = ,.,. . f Let I' be Lebesgue measure confined to the unit interval; let l'n COrrespond tO a mass Of Xn . Xn 1 at SOme point in ( Xn . Xn . ] where 0 == Xn o < xn 1 < · · · < xnn == 1. Show by considering the distribution functions that p.,. = p. if max ; < n( x, ; - xn . ; - 1 ) --+ 0. Deduce that a bounded Borel function continuous almost everywhere on the unit interval is Riemann integrable. See Problem 17.6. 2.15 5.16 f A function f of positive integers has distribution function F if F is the weak limit of the distribution function Pn [ m : /( m ) < x] of I under the measure having probability 1/n at each of 1, . . . , n (see (2.20)). In this case D [ m : /( m ) < x] == F(x) (see (2.21)) for continuity points x of F. Show that q>(m)/m (see (2.23)) has a distribution: (a) Show by the mapping theorem that it suffices to prove that f( m) .. log(q>( m )/m) == EPBP ( m)log(1 - 1/p) has a distribution. (b) Let fu ( m) == EP "B1 ( m )log(1 - 1/p), and show by (5.41) that fu has distribution function �\ X ) == P{E u Xp log(1 - 1/p) < x], where the XP are independent random variables (one for each prime p) such that P[ XP .. 1] == 1/p and P( XP == 0] == 1 - 1/p. (c) Show that EP XP log(1 - 1/p) converges with probability 1. Hint: Use Theorem 22.6. (d) Show that limu _ oo supn En l l/ - fu l 1 == 0 (see (5.42) for the notation). (e) Conclude by Markov's inequality and Theorem 25.5 that f has the distribution of the sum in (c). i
-
.
I_ 1,
I_
•.
25.15.
1 /2
s
s
I ,
26. CHARACTERISTIC FUNCTIONS 351 25.1 6. 5.16 f Suppose that /( n ) --+ x. Show that Pn [ m : 1/( m ) - x l > £] --+ 0. From Theorem 25.12 conclude that En [/] --+ x (see (5.42); use the fact that f is bounded). Deduce the consistency of Cesaro summation [A30]. 25 .1 7. For A E 911 and T > 0, put "T( A ) == A ([ - T, T] n A )/2T, where " is Lebesgue measure. The relative measure of A is SECTION
p( A)
( 25 .22)
==
lim AT ( A ) ,
r-. oo
provided that this limit exists. This is a continuous analogue of density (see (2.21)) for sets of integers. A Borel function f has a distribution under "T; if this converges weakly to F, then
p ( x : /( X )
( 25 .23 )
<
u]
==
F( u )
for continuity points u of F, and F is called the distribution function of f. Show that all periodic functions have distributions. 25.18. Suppose that supn / fdp.n < oo for a nonnegative f such that f(x) --+ oo as x --+ + oo . Show that { P.n } is tight. 25. 19. f Let f( x) == l x l cr , a > 0 ; this is an f of the kind in Problem 25.18. Construct a tight sequence for which f lx l aP.n ( dx) --+ oo ; construct one for which f l x l aP.n( dx) == 00 . 25.20. 23.4 t
Show that the random variables A, and L, in Problems 23.3 and 23.4 converge in distribution. Show that the moments converge. 25.21. In the applications of Theorem 9.2, only a weaker result is actually needed: For each K there exists a positive a == a( K) such that if E[ X] == 0, E[ X2 ] == 1, and E [ X4 ] < K, then P[ X > 0] > a. Prove this by using tightness and the corollary to Theorem 25.12. 25.22. Find uniformly integrable random variables Xn for which there is no integrable Z satisfying P[ I Xn l > t] � P[ I ZI > t] for t > 0. SECI'ION 26. CHARACI'ERISTIC FUNCI'IONS Definition
The characteristic function of a probability measure for real t by
oo cp ( t ) J eitxll ( dx ) - oo =
p.
on the line is defined
352
CONVERGENCE OF DISTRIBUTIONS
see the end of Section 16 for integrals of complex-valued functions.t A random variable X with distribution p. has characteristic function
cp( t )
=
E[e
;'x ]
=
;ul-' ( dx ) . oo e J - oo
The characteristic function is thus defined as the moment generating func tion but with the real argument s replaced by it; it has the advantage that it always exists because e itx is bounded. The characteristic function in non probabilistic contexts is called the Fourier transform. The characteristic function has three fundamental properties to be estab lished here: (i) If p. 1 and p. 2 have respective characteristic functions cp 1 (t) and cp2 ( t ), then p.1 * p. 2 has characteristic function cp 1 ( t) cp2 ( t ). Although con volution is essential to the study of sums of independent random variables, it is a complicated operation, and it is often simpler to study the products of the corresponding characteristic functions. (ii) The characteristic function uniquely determines the distribution. This shows that in studying the products in (i), no information is lost. (iii) From the pointwise convergence of characteristic functions follows the weak convergence of the corresponding distributions. This makes it possible, for example, to investigate the asymptotic distributions of sums of independent random variables by means of their characteristic functions. Moments and Derivatives
It is convenient first to study the relation between a characteristic function and the moments of the distribution it comes from. Of course, cp(O) = 1, and by (16.25), l cp(t) l < 1 for all t. By Theorem 16.8(i), cp(t) is continuous in t. In fact, l cp(t + h) - cp(t) l < f l e i h x 1 1 p. ( dx ), and so it follows by the bounded convergence theorem that cp(t) is
uniformly continuous.
Integration by parts shows that
(26 .1)
1 (X X
0
S
n
) eu ds ·
=
n+1 n +1 's 1 I + ( X - S ) e ds, 1 1 n+ n+ 0 •
X
X
·
and it follows by induction that
n ( · k ·n + 1 n ) 1 ( x - s ) e ;s ds e ix (26 .2) + L '\ k. n. for n � 0. Replace n by n - 1 in (26.1), solve for the integral on the right, =
t From
k-o
1
1
x
o
complex variable theory only De Moivre's formula and the simplest properties of the
exponential function are needed here.
SECTION
26.
353
CHARACTERISTIC FUNCTIONS
and substitute this for the integral in (26.2); this gives (26 . 3)
e
ix
=
�
( ix ) k +
i..Jo k ' k·
(n
;n
_
1x( x - ) n -1 ( e is - 1) ds . 1) ' s
o
·
Estimating the integrals in (26.2) and (26.3) (consider separately the cases x � 0 and x < 0) now leads to
e ix -
(26 .4)
n ( ix ) k
E
k-o
k '.
<
.
nun
{ I In+1 ! , 21 In } X
( n + 1)
X
n '.
I
for n > 0. The first term on the right gives a sharp estimate for x I small, the second a sharp estimate for l x l large. For n = 0, 1 , 2, the inequality specializes to
l e ix - 1 1 <
min { l x l , 2} ;
{�
min x 2 , 2 1 xI
};
If X has a moment of order n , it follows that k 1 + n � . k [ ( ] E X < E nun (26.5) , i..Jo k'. ' 1) n + . k-
21
I eix - ( 1 + ix ) I �
For any
(26 . 6 )
) t cp t
(it )
[ { (I tXI
tXI n } ]
n '.
.
satisfying
cp( t) must therefore have the expansion (26. 7) compare (21.22). If
then (see (16.26)) (26.7) must hold. Thus generating function over the whole line.
(26.7) holds if
X has a moment
354
CONVERGENCE OF DISTRIBUTIONS
Since E[el t XI] < oo if X has the standard normal distri bution, by (26.7) and (21.7) its characteristic function is Example 26. 1.
(26 .8) cp ( t ) =
00 it ) 2 k 1 · 3 · · · (2 k - 1) = E ( E (2k ) 9. 00
k-0
This and (21 .25) formally coincide if s =
( -)
2 k 1 t 1 = e - 1 2 12 . k. 2 k -0
it.
•
If the power series expansion (26.7) holds, the moments of off from it:
X can be read
(26.9) This is the analogue of (21 .23). It holds, however, under the weakest possible assumption, namely that E[ 1 X k 1 ] < oo . Indeed,
l
hX i ( t + e h t ( )
[
� - ihX ] .
By (26.4 1 ), the integrand on the right is dominated by 2 1 XI and goes to 0 with h; hence the expected value goes to 0 by the dominated convergence ; theorem. Thus cp'( t ) = E[iXe 1 x ]. Repeating this argument inductively gives
(26 .10) if E [ I X k l ] < oo. Hence (26.9) holds if E[ IX k l ] < oo . The proof of uniform continuity for cp ( t ) works for cp < k > ( t) as well. If E[ X 2 ] is finite, then . t --+ 0 . cp ( t ) = 1 + itE ( X ] - �E ( X 2 ) + o ( t 2 ) , (26 .1 1 )
Indeed, by (26.42 ), the error is at most t 2E [ min { I t I I X 1 3 , X 2 } ], and as t --+ 0 the integrand goes to 0 and is dominated by X 2 • Such estimates are useful in proving limit theorems. The more moments p. has, the more derivatives cp has. This is one sense in which lightness of the tails of p. is reflected by smoothness of cp. There are results which connect the behavior of cp( t) as 1 t 1 --+ oo with smoothness properties of p. . The Riemann-Lebesgue theorem is the most important of these:
has a density, then cp( t) --+ 0 as I t I --+ oo . The problem is to prove for integrable f that f f(x)e it x dx
Theorem 26. 1. PROOF.
ItI
If p.
-+
0 as
--+ oo . There exists by Theorem 17.1 a step function g = E ka k /A " ' a fini te linear combination of indicators of intervals A k = (a k , b k ], for which f If -
SECTION
gl dx < £.
Ekak ( e ilb"
26.
355
CHARACTERISTIC FUNCTIONS
Now ff( x) e itx dx differs by at most £ from f g ( x ) e itx dx = - ei1a" )/it, and this goes to 0 as I t l oo . • -+
Independence
The multiplicative property (21.28) of moment generating functions extends to characteristic functions. Suppose that XI and x2 are independent random variables with characteristic functions cp 1 and cp • If lj cos t� 2 = , ) are independent; by the rules d 1 , ) and ( , then ( sin t� an Z Y2 Z2 Y1 Z1 for integrating complex-valued functions, ==
cp 1 ( t ) cp2 ( t ) = ( E ( Y1 ) + iE ( Z1 ])( E ( Y2 ) + iE ( Z2 ]) =
E ( Y1 ) E ( Y2 ) - E ( Z1 ) E ( Z2 )
+ i ( E ( Y1 ) E ( Z2 ) + E ( Z1 ) E ( Y2 ]) E [ Y1Y2 - Z1Z2 + i( Y1Z2 + Z1Y2 )] = E ( eit< X1 + x2> ] . This extends to sums of three or more: If X1 , , Xn are independent, then =
•
(26.12)
E [ e ;rEZ- t x• )
If X has characteristic function function
=
•
.
fi1 E [ e ;' x• ] .
k-=
cp( t ),
then
aX + b
has characteristic
(26.13)
cp( - t), which is the complex
particular, - X has characteristic function conjugate of cp(t). In
Inversion and the Uniqueness Theorem
A characteristic function cp uniquely determines the measure p. it comes from. This fundamental fact will be derived by means of an inversion formula through which p. can in principle be recovered from cp. Define
S ( T ) = i r sinX x dx ,
T > 0.
0
In Example 1 8.3 it is shown that (26. 14)
T = lim S( ) r-. oo
'IT
2
;
356
CONVERGENCE OF DISTRIBUTIONS
S( T) is therefore bounded. If sgn 8 is + 1, 0, or - 1
negative, then
( 26.15 ) Theorem 26.2.
and if p. { a }
for si� tfJ dt
=
sgn fJ
If the probability
· S(TifJI ) ,
as 8 is positive, 0, or
T > 0.
measure p. has characteristic function
p. { b } = 0, then e - i ta e - i tb 1 . cp ( t ) dt. p.( a, b] = lim j ( 26.16 ) It 2 'IT Distinct measures cannot have the same characteristic function. =
T T-+ oo - T
q>,
_
.
Note: By (26.4 1 ), the integrand here converges as t -+ 0 to b - a, which is to be taken as its value for t = 0. For fixed a and b the integrand is thus continuous in t, and by (26.40) it is bounded. If p. is a unit mass at 0, then cp (t) = 1 and the integral in (26.16) cannot be extended over the whole line. PROOF . The inversion formula will imply uniqueness: It will imply that if p. and " have the same characteristic function, then p.(a, b] = Jl(a, b] if p. { a } = J1 { a } = p.1 { b } = J1 { b } = 0; but such intervals (a, b] form a 'IT-sys tem generating 91 • Denote by Ir the quantity inside the limit in (26.16). By Fubini's theorem ( 26 .17 )
IT _
_!__
2 'IT
/ 00 [J T e it( x - a) .
_
-
oo - T
It
e i t( x - b )
]dt ,., (
dx ) .
This interchange is legitimate because the double integral extends over a set of finite product measure and by (26.40) the integrand is bounded by l b - a l . Rewrite the integrand by DeMoivre's formula. Since sin s and cos s are odd and even, respectively, (26.15) gives
Ir =
joo [ sgn( x - a ) S( T · l x - a l ) - sgn( x - b ) S( T · l x - b l ) ] p.( dx) . - oo 'IT
'IT
T -+ oo 0 if x < a, 1/2 if x = a, 1 if a < x < b, 1/2 if X = b, 0 if b < X .
The integrand here is bounded and converges as
( 26.18 )
1/l a
.
b
(x) =
to the function
Thus Ir -+ f 1/1 a , b dp., which implies that (26.16) holds if p. { a } =
p. { b } = 0.
•
SECTION
26.
CHARACTERISTIC FUNCTIONS
357
The inversion formula contains further information. Suppose that
oof 1 cp < t > I dt < - oo
( 26. 19 )
oo .
In this case the integral in (26.16) can be extended over R 1 • By (26.40), e - i tb _. e - ita = I e it( b - a > 1 1 < lb - a l ; It I tI _
00
therefore, p. (a, b) < ( b - a) f I cp ( t) I dt, and there can be no point masses. By (26.16), the corresponding distribution function satisfies 00
_l_ joo e - itx _ . e - it(x + h > cp ( t ) dt
F( x + h ) - F( x ) _ - 2 'IT h (whether
by
l fP(t) l
(26.20)
h
_
00
Ith
is positive or negative). The integrand is by (26.40) dominated and goes to e - i txcp(t) as h -+ 0. Therefore, F has derivative
/(x)
1 =-
2 '1T
/ 00 e - 1.1xcp( t ) dt.
- oo
Since f is continuous for the same reason cp is, it integrates to F by the fundamental theorem of the calculus (see (17.6)). Thus (26.19) implies that p. has the continuous density (26.20). Moreover, this is the only continuous density. In this result, as in the Riemann-Lebesgue theorem, conditions on the size of cp(t) for large l t l are connected with smoothness properties of p.. The inversion formula (26.20) has many applications. In the first place, it can be used for a new derivation of (26.14). As pointed out in Example 17.3, the existence of the limit in (26.14) is easy to prove. Denote this limit temporarily by 'IT0/2-without assuming that 'ITo = 'IT. Then (26.16) and (26.20) follow as before if 'IT is replaced by 'IT0• Applying the latter to the standard normal density ( see (26.8)) gives
(26.21 )
·
1 / 00 - 1. 1x - 1 2 /2 1 - x 2 /2 -e dt e = e 2 '1To ,J2 '1T
- oo
'
where the 'IT on the left is that of analysis and geometry-it comes ultimately from the quadrature (18.10). An application of (26.8) with x and t interchanged reduces the right side of (26.21) to ( ,J2'1T j2 '1T0 )e - x2 1 2 , and therefore 'IT0 does equal 'IT.
358
CONVERGENCE OF DISTRIBUTIONS
Consider the densities in the table below. The characteristic function for the normal distribution has already been calculated. For the uniform distri bution over (0, 1), the computation is of course straightforward; note that in this case the density cannot be recovered from (26.20) because cp( t) is not integrable; this is reflected in the fact that the density has discontinuities at 0 and 1. The characteristic function for the exponential distribution is easily calculated; compare Example 21 .3. As for the double exponential or Laplace distribution, e - l x l e i tx integrates over (0, oo) to (1 - it) - 1 and over ( - oo, 0) to (1 + it) - 1 , which gives the result. By (26.20), then,
e - lx l = For
'IT
dt tx i e . 2 - oo 1+t
J
oo
x = 0 this gives the standard integral / 00 00 dt/(1
Distribution
1.
1-
Normal
Density
Interval
1 e - x2 /2
- oo < x < oo
J2'1T
2. Uniform
1
0<x<1
3. Exponential
e-x
O < x < oo
4. Double exponential or Laplace
5.
Cauchy
-21 e - lx l
- oo < x < oo
1
- oo < x < oo
1 'IT
1 + x2
6. Triangular
1 - lxl
7.
1 1 - cos x 'IT x2
-1 < X < 1 - oo < x < oo
+
t 2 ) = 'IT ; see Exam-
Characteristic Function
e _ , 2 /2 e ;' - 1 it 1 1 - it 1 1 + t2 e - l tl 2
1 - cos t t2
(1 - l t l ) /( - 1 .l ) (t)
26. CHARACTERISTIC FUNCTIONS 359 pie 17 5 Thus the Cauchy density in the table integrates to 1 and has characteristic function e - l tl. This distribution has no first moment and the SECTION
.
.
characteri:t1c function is not differentiable at the origin. A straightforward integration shows that the triangular density has the characteristic function given in the table, and by (26.20),
I I ) I( - 1 , 1) ( X ) = -'IT1 f e i 1 - 2 t dt; t For x = 0 this is / 00 00 (1 - cos t )t - 2 dt = 'IT ; hence the last line of the table. (1
OO
_
-oo
- X
IX
COS
Each density and characteristic function in the table can be transformed by (26.13), which gives a family of distributions. The
Continuity 1beorem
Because of (26.12), the characteristic function provides a powerful means of studying the distributions of sums of independent random variables. It is often easier to work with products of characteristic functions than with convolutions, and knowing the characteristic function of the sum is by Theorem 26.2 in principle the same thing as knowing the distribution itself. Because of the following continuity theorem, characteristic functions can be used to study limit distributions. Theorem 26.3. Let P. n ' p. be probability measures with characteristic func tions cpn , cp. A necessary and sufficient condition for P. n => p. is that cpn (t) -+
fJ>(I)
for each t.
For each t, e itx has bounded modulus and is continu ous in x. The necessity therefore follows by an application of Theorem 25.8 • (to the real and imaginary parts of e i1x ). PROOF : NECESSITY.
SUFF I CIENCY.
(26 .22 )
By Fubini's theorem,
� j"} 1
-
cp11 ( t ) ) dt = =
Joo,J � j" ( 1 ei'x ) dt ] p 11( oo ) 2J (1 P. n ( -oo u
-
-
sin
ux UX
dx )
dx )
360
CONVERGENCE OF DISTRIBUTIONS
(Note that the first integral is real.) Since cp is continuous at the origin and cp(O) = 1, there is for positive £ a u for which u - 1/ ( 1 - cp(t)) dt < E. Since CJ>n converges to cp, the bounded convergence theorem implies that there exists an n0 such that u- 1/ � "(1 - CJ>n (t)) dt < 2£ for n > n0• If a = 2ju in (26.22), then P. n [x: lx l > a] < 2£ for n � n0• Increasing a if necessary will ensure that this inequality also holds for the finitely many n preceding n 0• Therefore, { p. n } is tight. By the corollary to Theorem 25 . 10 , P. n => p. will follow if it is shown that each subsequence { P. n " } that converges weakly at all converges weakly to p.. But if p. n " => " as k --+ oo, then by the necessity half of the theorem, already proved, " has characteristic function lim kcpn "( t) = cp( t ) . By Theorem 26.2, " and p. must coincide. • u
u
Two corollaries, interesting in themselves, will make clearer the structure of the proof of sufficiency given above. In each, let P. n be probability measures on the line with characteristic functions CJ>n ·
Suppose that lim ncpn (t) = g(t) for each t, where the limit function g is continuous at 0 . Then there exists a p. such that p. n => p., and p. has characteristic function g. PROOF . The point of the corollary is that g is not assumed at the outset to be a characteristic function. But in the argument following (26.22), only cp(O) = 1 and the continuity of cp at 0 were used; hence { P. n } is tight under the present hypothesis. If p. n " => " as k --+ oo , then " must have characteris tic function lim kcpn "(t) = g(t). Thus g is, in fact, a characteristic function, Corollary 1.
and the proof goes through as before.
•
In this proof the continuity of g was used to establish tightness. Hence if { p. n } is assumed tight in the first place, the hypothesis of continuity can be suppressed :
Suppose that lim ncpn (t) = g(t) exists for each t and that { P.n } is tight. Then there exists a p. such that p. n => p., and p. has characteristic function g. Corollary 2.
This second corollary applies, for example, if the p. n have a common bounded support.
Example
If P. n is the uniform distribution over ( - n, n ) , its ch ar acteristic function is ( nt) - 1 sin tn for t =1= 0, and hence it converges to I {O} ( t ) . In this case { p. n } is not tight, the limit function is not continuous at • 0, and P. n does not converge weakly. 26. 2.
SECTION
26.
361
CHARACTERISTIC FUNCTIONS
Fourier Series* Let I' be a probability measure on �1 that is supported by [0, 2w]. Its Fourier coefficients are defined by
em
(26. 23)
==
102tre'.m xl'( dx ) '
m
==
0, ± 1 , ± 2, . . . .
1bese coefficients, the values suffice to determine I' except
of the characteristic function for integer arguments, for the weights it may put at 0 and 2w. The relation between I' and its Fourier coefficients can be exp�essed formally by
.x 00 l' ( d.x ) - 2 w E c,e '' dx : 1- - oo 1
(26 .24)
if the I' ( dx ) in (26.23) is replaced by the right side of (26.24), and is interchanged with the integral, the result is a formal identity.
if the sum over I
To see how to recover I' from its Fourier coefficients, consider the symmetric partial sums sm ( t) == (2 w) - 1 Ei- m c1 e - i lt and their Cesaro averages am ( t) == ,-1I:i-01s1 ( t). From the trigonometric identity [A24] _
m-1
E E eik x 1-0 k - - I
(26.25) it
follows that
am ( t )
(26.26) c..
I
==
1
==
sur 2 x
siJ12 i m ( x - t ) p. ( dx 2 " 1 ).
2 m 0 w
�2:U !-2 1
sur ! ( X - t )
I' is (2 w) - 1 times Lebesgue measure confined to [0, 2 w ], then c0 == 1 and - 0 for m :p 0, so that am ( t) == sm (t) == (2w) - 1 ; this gives the identity
U
1
(26.27)
2 m w
21 " sin _... ..;;.l_ ds J == 1 . _,
ms
sDr ! s
Suppose that 0 < a < b < 2w, and integrate (26.26) over (a, b). Fubini's theo rem (the integrand is nonnegative) and a change of variable lead to
( 26.28)
1 Jb- s��;"' ds ] ( 2 " [ b 1 a t I' dx ) . dt == ) ( fm 2wm 0 sm IS x
a
1he denominator goes to oo with B
•
a-x
in (26.27) is bounded away from 0 outside ( - B, B), and fixed (0 < B < w ),
This topic may be omitted.
so
as m
CONVERGENCE OF DISTRIBUTIONS 362 Therefore, the expression in brackets in (26.28) goes to 0 if 0 < x < a or b < x < 2 and it goes to 1 if a < x < b; and because of (26.27), it is bounded by 1. It follows by the bounded convergence theorem that
'IT ,
( 26.29) if I' { a } == I' { b } == 0 and 0 < a < b < 2,. This is the analogue of (26.16). If I' and ., have the same Fourier coefficients, it follows from (26.29) that I' ( A ) == ., ( A ) for A c (0, 2,) and hence that I' { 0, 2, } = ., {0, 2w }. It is clear from periodicity that the coefficients (26.23) are unchanged if I' {0} and I' {2w } are altered but I' {0} + I' {2w } is held constant. n Supposen that l'n is supported by [0, 2w] and has coefficients c� > , and suppose that limnc� > == em for all m. Since { l'n } is tight, l'n = I' will hold if l'n �< = ( k � oo) implies ., == I' · But in this case ., and I' have the same coefficients em ' and hence they are identical except perhaps in the way they split the mass ., {0, 2w } = I' {0, 2 w } between the points 0 and 2w. But this poses no problem if I' {0, 2w } == 0: If limnc�n > == em for all m and I' {0} == I' {2 'IT } == 0, then l'n = I' · .,
If I' is (2,) - I ntimes Lebesgue measure confined to the interval [0, 2 w ], the condition is that limnc� > == 0 for m =F 0. Let x1 , x 2 , be a sequence of reals, and let l'n put mass n - 1 at each point 2w{ xk }, 1 < k < n , where { x k } = x" - [ xk ] denotes fractional part. This is the probability measure (25.3) rescaled to [0, 2 , ]. The sequence x1 , x 2 , . is uniformly distributed modulo 1 if Emmple 26.3.
• • •
• •
n e 2"i { xk > m 1 E
_
n k-1
==
n e 2 "ixk m 1 E
_
n k-1
�
0
for m =F 0. If xk == k fJ , where fJ is irrational, then exp(2 , i fJ m) =F 1 for m =F 0 and hence 1
n
� - �
n k-I
e 2tT i k fJ m
1 i m1 == - e 2tT fJ n 1
_
e 2tTi n fJ m e 2 " ifJ m
�
0
•
Thus (J, 2 (J, 3 (J, . is uniformly distributed modulo 1 if 8 is irrational, which gives • another proof of Theorem 25.1. •
•
PROBLEMS 26.1.
A random variable has a
lattice distribution if for some a and b, b > 0, the
lattice ( a + nb: n == 0, ± 1 , . . . ] supports the distribution of X. Let X have characteristic function q>. (a) Show that a necessary condition for X to have a lattice distribution is that I q> ( t) I == 1 for some t =F 0. (b) Show that the condition is sufficient as well. (c) Suppose that l fP ( t) l == l fP ( t') l == 1 for incommensurable t and t ' ( t =F 0, t' =F 0, t/t' irrational). Show that P[ X == c ] == 1 for some con stant c.
SECTION
J'-2. 26.).
26.
363
CHARACTERISTIC FUNCTIONS
If 1' ( - oo , x] == 1'[ - x, oo) for all x (which implies that I'( A) == 1' ( - A ) for all A e 9P1 ) , then I' is symmetric. Show that this holds if and only if the characteristic function is real. Consider functions q> that are real and nonnegative and satisfy q>( - t) == q>( t) and q>(O) == 1. (a) Suppose that d1 , d2 , are positive and Ek- t dk == oo, that s1 ;;::: s2 > · · · � 0 and lim k sk == 0, and that E k- t sk dk == 1. Let q> be the convex polygon whose successive sides have slopes - s1 , - s2 , . and lengths d1 , d 2 , when projected on the horizontal axis: q> has value 1 - E1 = 1 si di at tk == d1 + · · · + dk . If sn == 0, there are in effect only n sides. Let q>0 ( t) == (1 - I tl) /< - I . I > ( t) be the characteristic function in the last line in the table on p. 358 and show that q> ( t) is a convex combination of the characteristic functions q>0 ( tItk ) and hence is itself a characteristic func tion. •
•
•
•
•
•
•
•
1
-s,
26.4.
Z6.5.
26.6. l6.7.
(b) Construct a continuous density for which the characteristic function is not integrable.-In such a case the density cannot be recovered from (26.20). ' (c) Po/ya s criterion. Suppose that q> is continuous and convex on [0, oo ) . Show that q> is a characteristic function. f Let q> 1 and q>2 be characteristic functions, and show that the set A = [ t: q>1 ( t) == q>2 ( t)] is closed, contains 0, and is symmetric about 0. Show that every set with these three properties can I?e such an A . What does this say about the uniqueness theorem? Show by Theorem 26.1 and integration by parts that if I' has a density f with integrable derivative /', then q>(t) == o( t - 1 ) as l t l --+ oo. Extend to higher derivatives. Show for independent random variables uniformlyndistributed over ( - 1, + 1) 1 that XI + . . . + xn has density , - /ij((sin t)/t) cos IX dt for n > 2. 21.3 f Uniqueness theorem for moment generating functions. Suppose that F has a moment generating function in ( - s0 , s0 ), s0 > 0. From the fact that �� ooezx dF( X ) is analytic in the strip - So < Re z < So ' prove that the
364
26.8.
CONVERGENCE OF DISTRIB UTIONS
moment generating function determines F. Show that it is enough that the moment generating function exist in [0, s0 ) , s0 > 0. 21.35 26.7 f Show that the gamma density (20.47) has characteristic function 1 {1 -
itja)
"
==
[
(
exp - u log 1 -
it ) ] , a
where the logarithm is the principal part. Show that f0ezx/( x; a, u) dx is analytic for Re z < a. 26.9. Use characteristic functions for a simple proof that the family of Cauchy distributions defined by (20.45) is closed under convolution; compare the argument in Problem 20.20(a). Do the same for the normal distribution (compare Example 20.6) and for the Poisson and gamma distributions. 26.10. Suppose that F, • F and that the characteristic functions are dominated by an integrable function. Show that F has a density that is the limit of the densities of the F,. 26.11. Show for all a and b that the right side of (26.16) is I'( a, b) + i l' { a }
+ i l' { b } .
26.12. By the kind of argument leading to (26.16), show that
I' { a }
{ 26.30) 26.13.
==
1
T .
lim 2T / e- "aq> ( t ) dt. T� «J - T -
f Let x1 , x 2 , . . . be the points of positive ��-measure. By the following steps prove that
{ 26.31 )
.
T�Cil
2 1 T t) dt 2T J- Ti cp ( l
...
�
2
7 ( 1' { xt }) .
Let X and Y be independent and have characteristic function q>. (a) Show by (26.30) that the left side of (26.31) is P[ X - Y == 0]. (b) Show (Theorem 20.3) that P[ X - Y == 0] == f �CXJ P[ X == y ] l' ( dy ) == Ek (I' { Xk } ) 2 • 26.14. f Show that I' has no point masses if q>2 ( t ) is integrable. 26.15. (a) Show that if { l'n } is tight, then the characteristic functions fl>n ( t) are uniformly equicontinuous (for each £ there is a B such that I s - t l < B implies that l9>n (s) - fl>n (t) I < £ for all n ). (b) Show that l'n • I' implies that fl>n ( t) --+ q> ( t ) uniformly on bounded sets. (c) Show that the convergence in part (b) need not be uniform over the entire line. 26.16. 14.9 26.15 f For distribution functions F and G, define d'( F, G ) == sup, I cp ( t) - 1/1 ( t) I I(1 + I t I ), where q> and 1/1 are the corresponding char-
26. CHARACTERISTIC FUNCTIONS 365 acteristic functions. Show that this is a metric and equivalent to the Levy metric. 25.17 f A real function I has mean value SECTION
2(;.17.
M ( I( X ) )
( 26 .32)
==
lim
T-. rx:J
1 T I X dx , 2 T j- T ( )
provided that I is integrable over each [ - T, T] and the limit exists. (a) Show that, if f is bounded and eitf< x > has a mean value for each then I has a distribution in the sense of (25.23). (b) Show that
t,
if t == 0, if t '* 0.
{ 26 .33 )
Of course, l(x) == x has no distribution. �18. Suppose that X is irrational with probability 1. Let l'n be the distribution of the fractional � art { nX }. Use the continuity theorem and Theorem 25.1 to show that n - EZ .... 1 1'k converges weakly to the uniform distribution on [0, 1]. 76.19. 25.13 f The uniqueness theorem for characteristic functions can be de rived from the Weierstrass approximation theorem. Fill in the details of the following argument. Let I' and ., be probability measures on the line. For continuous I with bounded support choose a so that I'( - a, a) and ( - a, a) are nearly 1 and I vanishes outside ( - a, a). Let g be periodic and agree with I in ( - a, a), and by the Weierstrass theorem uniformly approximate g(x) by a trigonometric sum p( x) == Ef_ 1 a k e ; '" x . If I' and ., have the same characteristic function, then I Idl' I g dI' I p dI' == I p dv "
=
I gdv
26.20.
=
l ldv.
=
=
Use the continuity theorem to prove the result in Example 25.2 concerning the convergence of the binoll;lial distribution to the Poisson. 26.21. According to Example 25.8, if Xn � X, an --+ a, and bn --+ b, then an Xn + bn aX + b. Prove this by means of characteristic functions. 26.22. Infinite convolutions. Write F == F1 * F2 * · · · if F1 • • • • • F, � F. Show that this holds if and only if the corresponding characteristic functions satisfy cp ( t) == n �_ 1 cpn (t) in the usual sense of infinite products. 26.23. 26.1 26.15 f According to Theorem 14.2, if Xn � X and an Xn + bn � Y, where a n > 0 and the distributions of X and Y are nondegenerate, then a n --+ a > 0, bn --+ b, and aX + b and Y have the same distribution. Prove this by characteristic functions. Let fl>n , cp , 1/1 be the characteristic functions of Xn , X, Y. (a) Show that I fl>n (an t) I --+ I 1/1 ( t) I uniformly on bounded sets and hence that an cannot converge to 0 along a subsequence. (b) Interchange the roles of cp and 1/1 and show that an cannot converge to infinity along a subsequence. :::0
366
26.24.
CONVERGENCE OF DISTRIBUTIONS
(c) Show that an converges to some a > 0. (d) Show that eirh , --+ l/;( t)/q>( at) in a neighborhood of 0 and hence th at
f0eis h, ds --+ /o(l/J(s)/q>(as)) ds . Conclude that bn converges. Prove a continuity theorem for moment generating functions as defined by (22.4) for probability measures on [0, oo ). For uniqueness, see Theorem 22.2; the analogue of (26.22) is
Show by example that the values q> ( m) of the characteristic func tion at integer arguments may not determine the distribution if it is not supported by [0, 2, ]. 26.26. The Fourier-series analogue of the condition (26.19) mis Em I em I < oo . Show that it implies I' has density /(x) == (2w) - 1 E m cm e - i x on [0, 2 w ], where 1 is continuous and f(0) == I (2 , ). This is the analogue of the inversion formula (26.20). 26.27. f Show that 26.25. 26.4 f
0 < x < 2w. Show that w 2/6 == E:_ 1 m - 2 ; see Problem 18.20(b). 26.28. Suppose X' and X" are independent random variables with values in [0, 2 w ], and let X be X' + X" reduced modulo 2 w. Show that the corre sponding Fourier coefficients satisfy em == c�,c::, . SECI'ION 27. THE CENTRAL LIMIT THEOREM Identically Distributed Summands
The central limit theorem says roughly that the sum of many independent random variables will be approximately normally distributed if each sum mand has high probability of being small. Theorem 27.1, the Lindeberg-Uvy theorem, will give an idea of the techniques and hypotheses needed for the more general results that follow. Throughout, N will denote a random variable with the standard normal distribution:
( 27.1 )
SECTION
27.
367
THE CENTRAL LIMIT THEOREM
Suppose that { Xn } is an independent sequence of random variables having the same distribution with mean c and finite positive variance a 2• If Sn = X1 + · · · + Xn , then Sn - nc
')beorem 27.1.
(27 .2)
a li
=> N.
-
By the argument in Example 25.7, (27.2) implies that n 1sn => c. The central limit theorem and the strong law of large numbers thus refine the weak law of large numbers in different directions. Replacing Xn by Xn - c reduces the theorem to the case in which E( Xn 1 = 0. H ence assume that c = 0. Let cp be the characteristic function common to the Xn . By (26.12) and n (26. 13), Sn! a li has characteristic function cp ( t/a li) . By the continuity theorem, it suffices to prove that this conVerges to the characteristic function e- 12 12 of N. By (26.1 1), cp(t) = 1 - �t 2a 2 + fj(t), where fj(t)/t 2 -+ 0 as t -+ 0. The idea is to show that
I
This can be proved by using logarithms for complex arguments, but a simple device avoids them. Lemma
1.
most 1 ; then (27 .3)
Let z 1 , • • . , z m and w1 , • • • , wm be complex numbers of modulus at l z t · · · z m - W1
m
• • •
wm l < E l z k - wk l · k=l
This follows by induction from z 1 wl)(z 2 . . . z m ) + wl (z 2 . . . z m - w2 . . . wm> · PROOF.
• • •
z m - w1
• • •
wm = (z 1 -
•
Since the {j(t) above satisfies fj(t)jt 2 -+ 0.
(27 .4)
n cp
( alit ) - ( 1 - !}__2 n )
-+ 0
fixed t. In the lemma take m = n, z k = 1 - t 2j2n, and wk = in); (27.3) implies that (for n large enough that t 2j2n < 1) n t t (27 .5 ) cp 1 _ !}__ n cp n - 1 - 2n . s, 2n a li ali
for each
( ) ( _
)
!}__ ) ( ( )
cp(tja
368
CONVERGENCE OF DISTRIB UTIONS
This goes to 0 by (27.4), and Theorem 27.1 follows because (1
e - t2/2 .
- t 2j2n) n
--+
In the classical De Moivre-Laplace theorem, Xn takes the values 1 and 0 with probabilities p and q = 1 - p, so that c = p, and o 2 = pq. Here Sn is the number of successes in n Bernoulli trials, and Example 27.1.
(Sn - np)f /npq
=>
N.
•
Suppose that one wants to estimate the parameter a of an exponential distribution (20 1 0) on the basis of an independent sample X1 , , Xn . As n --+ oo the sample mean Xn = n - 1Ei: _ 1 Xk converges in probab.!_lity to the mean 1/a of the distribution, and hence it is natural to use 1/ Xn to estimate a itself. How good is the estimate? The variance of the exponential distribution being 1/a 2 (Ex�ple 21.3), ali ( Xn - 1/a) => N by the Lindeberg-Levy theorem. Thus Xn is approximately normally dis tributed with mean 1/a and standard deviation 1/aln. By Skorohod ' s _Iheorem 25.6 there exist on a single probab i!!_ty space random variables Yn and Y having the respective distributions of Xn and N and satisfying a {i( Yn (w) - 1/a) --+ Y(w) for each w. Now Yn (w) --+ 1 /a and a - 1 {i( Yn (w) - 1 - a) = a{i(a� 1 - Yn (w))/aYn (w) --+ -_Y(w). Since Y has the distribution of N and Yn has the distribution of Xn , it follows that Example 27.2.
.
. • .
-
thus 1/ Xn is a.P_proximately nO.!_Illally distributed with mean a and standard deviation ajyn . In effect, 1/Xn has been studied through the local linear • approximation to the function 1jx. This is called the delta method. The Lindeberg and Lyapounov Theorems
Suppose that for each
n
( 27 .6) are independent; the probability space for the sequence may change with n. Such a collection is called a triangular array of random variables. Pu t sn = xnl + . . . + xn r . Theorem 27.1 covers the special case in which rn = n and xn k = xk . Theorem 6.2 is the weak law of large numbers for triangular arrays, and Example 6.4 on the number of cycles in a random permutation shows its usefulness. The central limit theorem for triangular arrays will be applied below to the same example. II
SECTION
27.
THE CENTRAL LIMIT THEOREM
369
To establish the asymptotic normality of Sn by means of the ideas in the preceding proof requires expanding the characteristic function of each Xn k to second-order terms and estimating the remainder. Suppose that r,.
s; = E o;k .
(27 .7)
k-l
assumption that Xn k has mean 0 entails no loss of generality. A successful remainder estimate is possible under the assumption of the
1be
Lindeberg condition: (27. 8) for
f
>
0.
Suppose that for each n the sequence Xn1 , , Xn ,,. is independent and satisfies (27.7). If (27.8) holds for all positive £, then S,jsn => N. Theorem
• • •
27.2.
This theorem contains the preceding one: Suppose that Xn k = Xk and r,. == n, where the entire sequence { xk } is independent and the xk all have 2 o same distribution with mean 0 and variance the • Then (27.8) reduces to
(27.9)
£oli] � O as n i oo . Replacing Xn k by Xn k/s n shows that there is no
which holds because [ I X1 1 � PaOOF OF THE THEOREM.
loss of generality in assuming
'n
s; = E on2k = 1.
(27.1 0) By
k-l
(26.4 2 ), ,
Therefore the
Note
characteristic function cpn k of
that the expected value is finite.
xn k satisfies
370
CONVERGENCE OF DISTRIBUTIONS
For positive
£
the right side of (27.11) is at most
Since the on2k add to 1 and condition that
£
is arbitrary, it follows by the Lindeberg
(27 .12) for each fixed t. This is the analogue of (27.4), and the objective now is to show that
(27 .13) 'n
- ne k-l
For
£
_ l 2a�k 12 + o(1) = e - 1 2 /2 + o(1).
positive
and so it follows by the Lindeberg condition (recall that
sn
is now
1) that
(27 .14) For large enough n , 1 - �t 2on2k are all between 0 and 1, and by (27.3) , nk.._ 1 cpn k (t) and n k.._1 (1 - �t 2on2k ) differ by at most the sum in (27.12). This establishes the first of the asymptotic relations in (27 .13). Now (27.3) also implies that
r,. n
k-1
( 1 - 2 t 2on2k ) 1
� e _ t 2a�lc /2 <- i..J k- 1
- 1 + 21 t 2on2k .
SECTION
For real
27.
371
THE CENTRAL LIMIT THEOREM
x,
(27 .15 )
lex - 1
- xl
1 S 2
00 v E lxl < x2
v-2
Using this in the right member of the preceding inequality bounds it by t 4Ek"_ 1 on4k; by (27.14) and (27.10) this sum goes to 0, from which the second • equality in (27.13) follows.
' Example 27.3. Goncharov s Theorem. Consider the sum Sn = Ei: _ 1 Xnk in Example 6.4. Here Sn is the number of cycles in a random permutation on n letters, the xnk are independent, and P ( Xnk = 1 ] = n _ k1 + 1 = 1 - P ( Xnk = 0 ] . + The mean mn is Ln = Ei: _1k- 1 , and the variance s ; is Ln 0(1). Linde + berg's condition for xnk - (n - k 1) 1 is easily verified because these -
random variables are bounded by 1. The theorem gives (Sn - Ln)/sn � N. Now, in fact, Ln = log n and so (see Example 25.8) the sum can be renormalized: log n )j y'log n � N.
1Xnk 1 2 +8 Lyapounov ' s condition Suppose that the
+
0(1),
(Sn
-
•
are integrable for some positive 8 and that
(27 .16 ) holds. Then Lindeberg' s condition follows because the sum in (27.8) is bounded by
Hence Theorem 27.2 has this corollary:
Suppose that for each n the sequence Xn1 , Xn, is independent and satisfies (27.7). If (27. 16) holds for some positive 8, then S,.jsn � N. Theorem 27.3.
•
•
•
,
n
372
CONVERGENCE OF DISTRIBUTIONS
E
ple
xam
Suppose that X1 , X2 , are independent and uniformly bounded and have mean 0. If the variance s ; of sn = XI + . . . + xn goes to oo, then Snfsn => N: If K bounds the Xn, then 27.4.
•
which is Lyapounov' s condition for 8 =
•
•
1.
•
Emmple 27.5. Elements are drawn from a population of size n, randomly and
with replacement, until the number of distinct elements that have been sampled is rn , where 1 < rn < n. Let Sn be the drawing on which this first happens. A coupon collector requires sn purchases to fill out a given portion of the complete set. Suppose that rn varies with n in such a way that rn/n --+ p , 0 < p < 1. What is the approximate distribution of Sn? Let >;, be the trial on which success first occurs in a Bernoulli sequence with q == 1 - p. Since the moment probability p for success: sP[ >;, == k] == q k - 1p, where 1 generating function is pe /(1 - qes ), E[ >;, ] == p - and Var[ >;, ] == qp - 2 • If k - 1 distinct items have thus far entered the sample, the waiting time until the next distinct one enters is distributed as >;, for p == ( n - k + 1)/n. Therefore, sn can be represented as Ek 1 Xn k for indepen(fent summands Xn k distributed as Ycn - k + 1); n · Since rn - p n , the mean and variance above give ......
( k - 1 ) -1 d m n == E ( Sn J == E 1 - n - n lp 1 -X x r,.
o
k-1
and
(
k 1 � k-1 sn == � n 1 - -n k-1 2
)
-2
-n
1P o
x dx 2. (1 - x)
Lyapounov' s theorem applies for B == 2, and to check (27.16) requires the inequality
( 27 . 17) for some K independent of p. �A calculation with the moment generating function shows that the left side is qp - (1 + 1q + q 2 ), but the estimate will suffice.) Since ( � - p - 1 ) 4 < 8 >;,4 + 8p - 4 by Holder's inequality, it is enough to estimate
( 27.18) If s > 4p, then P[ Y. > sjp] == q l s /p ] - 1 < qs12 P; since p < e - s /2 • This and (27.18) give (27.17).
==
1-
q
< - log q,
Z q s/ P
SECTION
It
27.
THE CENTRAL LIMIT THEOREM
now follows that r "
X, k - n E ( k"f:t [
n
k+1
) 4]
373
'" ( k - 1 ) -4 1
- Kn
1P o
n
dx . 4 (1 - x)
Since (27.16) follows, Theorem 27.3 applies: ( Sn - mn )/sn
=
N.
•
FeUer's Theorem*
Consider a triangular array of normally distributed random variables; in this case of course Sn/sn � N. Suppose further that rn = n and Xn k = Xk , and that on2k = of is defined by O f = 1 and the recursion on2 = ns;_ 1 , so that s; - on2 • The sum in (27.8) is then at least
Thus the Lindeberg condition is not necessary for the central limit theorem. (For another example, take rn = 1 and xnl normal-but here the array does not arise from truncating a single infinite sequence.) In this example, one of the summands dominates the others. If this sort of phenomenon is excluded, the Lindeberg condition becomes necessary as well as sufficient. Suppose that
(27 .19) This
is, in fact, a consequence of the Lindeberg condition-see (27.14), where s; is assumed 1. If E[ Xn k 1 = 0, (27.19) and Chebyshev' s inequality imply that
( 27 .20) for positive £ . This condition ensures that no one summand contributes much to Sn ; if it holds, the random variables Xn klsn form an infinitesimal * This topic may be omitted.
374
CONVERGENCE OF DISTRIBUTIONS
array, or are uniformly asymptotically negligible. Feller 's theorem is essen tially the converse of Lindeberg' s theorem: Theorem 27.4. Suppose that for each n the sequence Xn1, • • • , Xn rn is independent and satisfies (27.7). If Sn!sn � N and (27.20) holds, then the Lindeberg condition (27 .8) is satisfied.
Since (27.8) implies (27.19), which in tum implies (27.20), Theorem 27.2 and 27.4 together show that (27.20) and the condition Sn!sn � N are jointly equivalent to the Lindeberg condition. PROOF OF THE THEOREM.
As before, it is no restriction to assume that
s; = 1. By (26.40 ), lcpn k (t) - 1 1 � 2E[min{1, l tXn k l }]. Splitting the expected value according as IXn k l exceeds £ or not shows that lcpn k (t) - 1 1 < 2P[ 1 Xn k l > £] :+ 2£ 1tl . By (27.20), max k s rJ cpn k (t) - 1 1 -+ 0. By (26.4 1 ), lcpn k (t) - 1 1 <' t 2on2k . Since the on2k add to 1 and Ek lcpn k (t) - 1 1 2 < max k lcpn k (t) - 1 1 · Ek lcpn k (t) - 1 1 , it follows that
rn 2 t ) E I cpn k ( I I -+ 0
( 27 .21 )
k-1
for each t. If l z l � 1, then l e z - 1 1 = e Re z - I < e CfnA: (t) - 1 and wk = cpn k (t):
1.
Hence
(27.3)
applies to
zk =
If l z l < � ' then l e z - 1 - z l < l z l 2 ; for large n , lcpn k (t) - 1 1 < � ' and the right side of (27.22) is at most Ek lcpn k (t) - 1 1 2 • Since Sn � N by hypothe 2 sis, n kcpn k ( t) -+ e - 1 12 • By (27.21) and (27.22), Since l e z l = e Re z, taking absolute values and then logarithms leads to
( 27 .23) E[cos tXn k - 1], and so (27.23) can be written rn 1 E E cos tXn k - 1 + 2 t 2X;k -+ 0. k- 1
The real part here is
( 27 .24)
[
]
SECTION
Since cos x > 1 than
Given
-�
x
27.
375
THE CENTRAL LIMIT THEOREM
2, the integrand is nonnegative and the sum not less
choose t so large that the first factor on the right is positive. By the left side goes to 0, and hence so does the sum on the right. •
t: ,
(27.24),
-Dependent Variables *
The assumption of independence in the preceding theorems can be relaxed in various ways. Here a central limit theorem will be proved for sequences in which random variables far apart from one another are nearly independent in a sense to be defined. of random variables, let an be a number such For a sequence X1 , X2 , that •
•
•
(27 .25 )
I P( A n B) - P( A ) P( B ) I < an for A E o ( X1, . . . , Xk ), B E o( Xk + n ' xk+ n + 1 , . . . ), and k > 1, n > 1. Sup pose that a n --+ 0, the idea being that xk and xk + n are then approximately independent for large n. In this case the sequence { Xn } is said to be a-mixing. If the distribution of the random vector ( Xn , Xn + 1 , , Xn +j ) does not depend on n, the sequence is said to be stationary. •
•
.
Let { Yn } be a Markov chain with finite state space and positive transition probabilities Pij ' and suppose that xn = /( Yn ), where I is some real function on the state space. If the initial probabilities P ; are the stationary ones (see Theorem 8.9), then clearly { Xn } is stationary. More n over, by (8.42), IP��> - pj l < p , where p < 1. By (8.11), P [Y1 = i1, , Yk n = i k ' Yk + n = 1·0 ' · · · Y:k + n + l = 1·I ] = P· P· · · · · P'�c - t ' lc P'J� Jo> P·Jolt · · · P·h- t11 ' which differs from P[Y1 = i 1 , Yk = i k ]P[Yk + n = }0, , Yk + n + l = }1 ] by n at most p,.1 p,.1 · · · p,.lc - 1 ,·lc p p1.oJ1 p1.1 - .vl It follows by addition that, if r is the number of states, then for sets of the form A = [( Y1 , , Yk ) E H) and n B = [( Yk+ n ' . . . , Yk + n +l ) E H'], (27.25) holds with a n = rp . These sets (for k and n fixed) form fields generating a-fields which contain o( X1, , Xk ) and o( Xk + n ' Xk + n + 1 , ), respectively. For fixed A the set of B satisfying (27.25) is a monotone class, and similarly if A and B are interchanged. It follows by the monotone class theorem (Theorem 3.4) that { Xn } is a-mixing with a n = rpn . • Example 27.6.
•
'
,·
,
•
•
•
•
2
'•
•
•
'• '2
,
·
·
•
•
•
•
•
•
•
•
•
* This topic may be omitted.
•
•
•
•
•
•
376
CONVERGENCE OF DISTRIBUTIONS
The sequence is m-dependent if ( X1 , . . . , Xk ) and ( Xk + n ' . . . , Xk + n + l) are independent whenever n > m. In this case the sequence is a-mixing with a n = 0 for n > m. In this terminology an independent sequence is 0-depen dent.
Example
Let Y1 , Y2 , . . . be independent and identically distrib + uted, and put xn = /( Yn , . . . ' yn + m > for a real function I on R m 1 • Then { xn } is stationary and m-dependent. • 27. 7.
Suppose that X1 , X2 , . . . is stationary and a-mixing with a n = O(n - 5 ) and that E[Xn ] = 0 and E [ X: 2 ] < oo. If Sn = X1 + · · · + Xn then Theorem 27.5.
,
n - 1 Var (Sn] -+ o 2 = E [ X12 ] + 2
( 27.26)
00
L E [ X1 X1 + k ] ,
k-l
where the series converges absolutely. If o > 0, then S,!o..fi � N. The conditions an = O(n - 5 ) and E [ X: 2 ] < oo are stronger than neces
sary; they are imposed to avoid technical complications in the proof. The idea of the proof, which goes back to Markov, is this: Split the sum X1 + · · · + Xn into alternate blocks of length bn (the big blocks) and In (the little blocks). Namely, let
( 2 7 27 ) un = x( i - 1)( bII + /11) + 1 + •
j
where rn is the largest integer ther, let
( 27 .28 )
i
•
•
•
+ x(i- 1 )( bII + /11) + bII '
for which
(i - l )(bn + In ) + bn < n.
Fur
vn i = x(i - 1 )(bll + /ll) +bll + 1 + . . . + xi (bll +/II ) ' Vn rII = X
Then sn = Er:.. 1Un ; + Er:.. 1 V,.;, and the technique will be to choose the In small enough that E;V,.; is small in comparison with E;Un ; but large enough that the Un ; are nearly independent, so that Lyapounov' s theorem can be adapted to prove E;Un ; asymptotically normal.
If Y is measurable o( X1 , , Xk ) and bounded by C, and if Z is measurable o( Xk + n ' Xk + n + 1 , ) and bounded by D, then Lemma 2.
•
•
( 27 .29 )
•
•
•
•
I E ( YZ ] - E ( Y ] E ( Z ] I < 4 CDa n .
SECTION
27.
THE CENTRAL LIMIT THEOREM
377
It is no restriction to take C = D = 1 and (by the usual approxi mation method ) to take Y = L; Y; IA ; and Z = E1 z1 18 simple ( I Y;I , l z1 1 < 1) . If d ;1 = P( A ; n B1 ) - P(A ; )P( B1 ), the left side of (27.29) is IE; Y;E1 z1 diJ I ; if �; is + 1 or - 1 as E1 z1 diJ is positive or not, this is at most E ;1 � ; z1 diJ . 0 This last sum, in tum, is at most E;1 � ;'f'/1 d;1 , where .,.,1 = ± 1. If A < > is the 0 0 um·on of the A '. for which �1: .. = + 1 and A < 1 > = 0 - A < > ' and if B < > and B< 1 > are defined similarly in terms of the .,.,1 , then E;1 � ; 'f'/1 d;1 < E u v i P(A < u > n B < v > ) - P(A < ">)P ( B < v >) l < 4an . • PROOF.
If Y is measurable o( X1 , • • • , Xk ) and E[Y4 ] < C, and if Z is measurable o( Xk + n ' Xk + n + t ' . . . ) and E[Z 4 ] s D, then
Lemma 3.
I E [ YZ ] - E [ Y ] E [ Z ] I < 8(1
(27 . 30 )
+ C + D ) a!(2 •
Let Yo = Y/[ IYI s a]' yl = Y/[ IYI � a ]' ZJ( IZ I > a J · By Lemma 2, 1 E[Y0Z0] - E[Y0]E[Z0] 1 PROOF.
Zo = Z/[ IZI s a ]' zl < 4a 2an . Further,
I E ( Y0Z1 ] - E ( Y0] E ( Z1 ]1 < E (l Y0 - E ( Y0]1 · IZ1 - E ( Z1 ]1 ] < 2 a · 2 E [ I Z1 1 ] < 4aE [ I Z1 1 · I Z1 /a 1 3 ] < 4DIa 2 • Similarly,
1E[ Y1 Z0] - E[Y1 ]E[Z0] 1 < 4Cja 2 • Finally,
1 E ( Y1 Z1 ] - E [ Y1 ] E [ Z1 ]1
s
Var 1 12 [ Y1 ] Var 1 12 [ Z1 ]
< E 112 [ Y12 ] E 1 12 [ Zf ]
Adding these inequalities gives 4a 2an + 4(C + D)a - 2 + C 1 12D 1 12a - 2 as a bound for the left side of (27 .3 0). Take a = a; 1 14 and observe that 4 + 4( C + D ) + C 1 12D 1 12 < 4 + 4(C 1 12 + D 1 12 ) 2 < 4 + 8( C + D ). •
27.5. By Lemma 3, 1 E [ X1 X1 + k ] l < 8(1 + 2 E [ Xi ]) a!(2 = O (n - 512 ) , and so the series in (27.26) converges absolutely. If P k = E[ X1 X1 + k ] , then by stationarity E[S,?] = np0 + 2E;::�(n - k)p k and therefore l o 2 - n - 1E[S,?] I S 2Ek-n1P k l + 2n - 1 E 7..:-l Ek-;I P k l ; hence PROOF
OF
THEOREM
(27.26). By stationarity again,
where the indices in the sum are constrained by
i, j, k > 0 and i + j + k <
378
CONVERGENCE OF DISTRIBUTIONS
n. By Lemma 3 the summand is at most which is at mostt
Similarly,
K 1a}/2 is a bound. Hence E [ s:] < 4!n 2 E K1 min { a}12 , a}/2 } i. k � O i+k < n
< K2 n 2 E
00
Osisk
Since
a}/2 = K2 n 2 E ( k + 1 ) a}/2 • k -0
ak = O( k - 5 ), the series here converges, and therefore
(27 .31 )
K
n. 1 [n 14 ]. If rn is the largest integer i such that
for some independent of Let bn = and In = (i - 1Xbn + In ) + bn < then
[n314 ]
n, bn - n3/4 '
( 27 .32)
In -
nl/4 '
Consider the random variables (27.27) and (27.28). By (27.31), (27.32), and stationarity,
p
(J
1 .rn
rn - l
E
i= l
v,.;
(27.31) and (27.32) also give
[
]
1 p . C I Vn r,. I � £ < av n
K ( bn + ln) 2 - --K-- --+ O. £ 4a 4 n 2 £ 4a 4 n 1 /2
379 27. THE CENTRAL LIMIT THEOREM There fore, E �"-. 1 Vn ;/a ln � 0, and by Theorem 25.4 it suffices to prove that rr. 1 U.n 1.ja/n � N. Let u,:; , 1 < i s rn , be independent random variables having the distribution common to the Un;· By Lemma 2 extended inductively the characteristic functions of E �"-. 1 Un;/o fn and of L �"-. 1 U,:;;a /n differ by at most t 16rna 1 • 1 Since a n = O( n - 5 ), this difference is O( n - ) by (27.32). The characteristic function of E �"-. 1 Un ;/ofn will thus approach e - ' 2 12 if that of Lf; 1 U,:;;a /n does. It therefore remains only to show that E�r:. 1U,:;/ov n � N. Now E[ l U,;; l 2 ] = E[Un�1 - bno 2 by (27.26). Further, E[ l U,;; l 4 ] < Kb; by (27.31). Lyapounov's condition (27.16) for 8 = 2 there SECTION
�� -
II
fore follows because
rnE (I u,:t l 4 ) ( rn E (I U,.l l2) 2
•
Let { Yn } be the stationary Markov process of Example 27.6. Let f be a function on the state space, put m = L; P; /(i), and define Xn = f( Yn ) - m. Then { Xn } satisfies the conditions of Theorem 27.5. If P1i = BiJ p i - PiPJ + 2p;Ek_ 1 ( pfj > - p1 ), then the o 2 in (27.26) is E1iP;1 ( /(i) - m )( /( j) - m), and EZ - 1 /( Yk ) is ap,e_roximately normally dis tributed with mean nm and standard deviation ovn . If /(i) = Bio i' then EZ - 1 /( Yk ) is the number of passages through the state i0 in the first n steps of the process. In this case m = P;0 and • a 2 = P i0(1 - P io ) + 2p;0 Lk- t ( P f;� - P io ). Exllltlple 27.8.
If the Xn are stationary and m-dependent and have mean 0, Theorem 27.5 applies and o 2 = E[X12 ] + 2Ek' 1 E[ X1 X1 + k ]. Exam ple 27.7 is a case in point. Taking m = 1 and f( x , y ) = x - y in that example gives an instance where a 2 = 0. • Example 27.9.
PROBLEMS 27.1. 17.2.
17.3.
Prove Theorem 23.2 by means o( characteristic functions. Hint: Use (27.3) to compare the characteristic function of Ek,.- t Znk with exp[l:k Pnk ( e ; ' - 1)]. If { Xn } is independent and the Xn all have the same distribution with finite first moment, then n - 1sn --+ E[ X1 ] with probability 1 (Theorem 22.1), so that n - 1Sn � E[ X1 ]. Prove the latter fact by characteristic functions. Hint: Use (27.3). For a Poisson variable Y" with mean A , show that ( Y" - A )/ A � A --+ oo . Show that (22.3) fails for t = 1 .
t The 4 in (27.29) has become 1 6 to allow for splitting into real and imaginary parts.
N
as
380
27.4. 27.5.
27.6.
CONVERGENCE OF DISTRIBUTIONS
Suppose that I Xn k I S Mn with probability 1 and Mnfsn --+ 0. Verify Lyapounov's condition and then Lindeberg's condition. Suppose that the random variables in any single row of the triangular array are identically distributed. To what do Lindeberg's and Lyapounov's condi tions reduce? Suppose that Z1 , Z2 , are independent and identically distributed with mean 0 and variance 1 and suppose that Xnk == an k Z . Write down the Lindeberg condition and show that it holds if max k a�k == o(Lk.._ 1 an2k ). Construct an example where Lindeberg's condition holds but Lyapounov' s does not. 22.10 t Prove a central limit theorem for the number Rn of records up to time n . 6.3 f Let Sn be the number of inversions in a random permutation on n letters. Prove a central limit theorem for Sn . The B-method. Suppose that Theorem 27.1 applies to { Xn }, so that {n a - 1 ( Xn - c) � N, where Xn == n - 1 EZ - t Xk . Use Theorem 25.6 as in Example 27.2 to show that, if f( x) has a nonzero derivative at c, then {n (f( Xn ) - f( c))/af'( c) � N. Example 27.2 is the case f(x) == 1jx. f The mean volume of grains of sand can be estimated by measuring the amount of water a sample of them displaces. Use Problem 27.10 to find a method for estimating the mean diameter of the grains and analyze the variance of the estimator. The hypotheses of the theorems in this section impose conditions on moments. But there can be asymptotic normality even if there are no moments at all. For ann example, let Xn assume the values + 1 and - 1 with probability !(1 - 2 - ) each and the value 2k with probability 2 - k for k > n . Show that ' n- 1 12 Ei: _ 1 Xk � N even though the Xn have no finite moments. Let dn ( w) be the dyadic digits of a point w drawn at random from the unit interval. For a k-tuple ( u1 , . , uk ) of O's and 1 's, let Nn ( u 1 , , u k ; ) be the number of m < n for which ( dm (w ), . . . , dm + k I (w )) == ( u 1 , , u k ). - Problem 6.12.) Prove a central· limit theorem for Nn ( u1 , , uk ; w ) . (See The central limit theorem for a random number of summands . Let X1 , X2 , be independent, identically distributed random variables with mean 0 and variance a 2 ' and let sn == XI + . . . + xn . For each positive t ' let .,, be a random variable assuming positive integers as values; it need not be independent of the Xn . Suppose that there exist positive constants a, and 8 such that ., _ • • •
s ,
27.7. 27.8. 27.9. 27. 10.
27. 11.
27. 12.
27.13.
II
• •
• • •
• • •
27. 14.
•
•
w
•
.
a, --+
as
t --+
( 27 .33 )
oo .
oo ,
I
a,
�
(}
Show by the following steps that
S, --
, � N'
aF,
s,
___;,. , .
-
a{BQ,
_
�
N.
• .
SECTION
(a) (b)
27.
381
THE CENTRAL LIMIT THEOREM
Show that it may be assumed that fJ == 1 and the are integers. Show that it suffices to prove the second relation in (27.33). (c) Show tha t it suffices to prove ( S,, - Sa,)/ {Q; � 0. (d) Show that a1
P ( I S,, - Sa, l > ({Q;) < P ( l v, - a, l +
> f 3a , ]
I sk - sa, I P [ l k max < - a,l £ 3
a,
>
(
F.
]
,
and conclude from Kolmogorov's inequality that the last probability is at most 2 £a 2 • %7.15. 21.14 23.11 27.14 f A central limit theorem in renewal theory. Let XI ' x2 ' . . . be independent, identically distributed positive random variables with mean m and variance a 2 , and as in Problem 23.11 let N, be the maximum n for which S,. < t. Prove by the following steps that
(a) Show by the results in Problems 21.14 and 23.11 that �
0.
( SN, - t)/ {i
(b) Show that it suffices to prove that
(c) Show (Problem 23.11) that N,/t � m - 1 and apply the theorem in
Problem 27.14. 27.16. Show by partial integration that (27 .34)
as x --+ oo . %7.17. f Suppose that XI ' x2 ' . . . are independent and identically distributed with mean 0 and variance 1, and suppose that an --+ oo . A formal combina tion of the central limit theorem and (27.34) gives (27 .35) where fn --+ 0 if an --+ oo . For a case in which this does hold, see Theorem 9.4. 27.18. 21.2 f Stirling ' s formula. Let sn == XI + . . . + xn , where the xn are inde pendent and each has the Poisson distribution with parameter 1. Prove
382
CONVERGENCE OF DISTRIBUTIONS
successively: (a)
(b) (c)
E
[ ( Sn - n ) - ] = e - n kr.-0 ( n - ) n: - _nn_+_on;_,_2)_e-_'1 rn n
( Sni r � [E ( Sni n r ] �
rn
N-
n
E
(d) n ! - �2 '1T n + ( l /2> e - n
[ N- ] =
k
k.
k.
of O' s starting at the n th place in the dyadic expansion of a point w drawn at random from the unit interval; see Example 4.1. (a) Show that /1 , /2 , • • • is an a-mixing sequence, where a n = 4/2 11 • (b) Show that E;: _ 1 /�c is approximately normally distributed with mean n and variance 6n . hypotheses of Theorem 27.5 that Sn/n --+ 0 with probabil 27.20. Prove under the ' ity 1. Hint: use (27.31). 27.21. 26.1 26.28 f Let XI ' x2 ' . . . be independent and identically distributed, and suppose that the distribution common to the X,. is supported by [0, 2 '1T ] and is not a lattice distribution. Let sn = XI + . . . + xn ' where the sum is reduced modulo 2w. Show that Sn � U, where U is uniformly distributed over [0, 2 , ]. 27. 19. Let /n ( w) be the length of the
run
SECI'ION 28. INFINITELY DIVISIBLE DISTRIBUTIONS *
Suppose that Z" has the Poisson distribution with parameter A and that Xn 1 , • • • , Xnn are independent and P[ Xn k = 1] = 'Ajn, P[ Xnk = 0] = 1 A /n . According to Example 25.2, xnl + . . . + xnn => ZA. This contrasts with the central limit theorem, in which the limit law is normal. What is the class of all possible limit laws for independent triangular arrays? A suitably restricted form of this question will be answered here. Vague Convergence
The theory requires two preliminary facts about convergence of measures. Let #L n and p. be finite measures on ( R1 , 9P1 ). If l'n(a, b) --+ p.(a, b) for every finite interval for which p. { a } = p. { b } = 0, then #Ln converges vaguely to p., written #Ln --+ v p.. If P.n and p. are probability measures, it is not hard to see that this is equivalent to weak convergence: P.n � p.. On the other hand, if #Ln is a unit mass at n and p.( R1 ) = 0, then P.n --+ v p., but P.n � p. makes no sense because p. is not a probabil ity measure. The first fact needed is this: Suppose that #Ln --+ v p. and sup p.n ( R 1 ) < oo ; (28.1 )
n
* This section may be omitted.
28.
SECTION
383
INFINITELY DIVISIBLE DISTRIBUTIONS
then
ffdp.n --+ Jfdp.
(28.2)
for every continuous real f that vanishes at + oo in the sense that lim 1 t l - oc. f( x ) = 0. Indeed, choose M so that p.( R 1 ) < M and #Ln ( R 1 ) < M for all n . Given £ , choose a and b so that p. { a } == p. { b } == 0 and 1/ (x) l < £/M if x ll. A == ( a , b] . Then 1 / Acf dp.n l < £ and 1 / A fdp. l < £. If p. ( A ) > 0, define v ( B ) = p.( B n A )/p.( A ) and ,,. ( B) == P.n ( B n A )/P.n ( A ). It is easy to see that "n � v , so that ffd vn --+ f fd v . But then 1 / A fdp.n - f A fdp. l < £ for large n , and hence 1 / fdp. n - ffdp. l < 3£ for large n. If p. ( A ) == 0, then f A fdp.n --+ 0, and the argument is even simpler. The other fact needed below is this: If (28.1) holds, then there is a subsequence { l'n } and a finite measure p. such that p., --+ p. as k --+ oo . Indeed, let F, ( x) == p,,. ( :_ oo , x ]. Since the F, are uniformly b�unded because of (28.1 ), the proof of c
,,
Helly's theorem shows there exists a subsequence { F, } and a bounded, nondecreas ing, right-continuous function F such that lim* F, ( ; ) == F( x ) at continuity points x of F. If p. is the measure for which p.( a , b] == F� b) - F( a ) (Theorem 12.4), then clearly l'n --+ p.. l'
"
The
Possible Limits
Let Xnl, , Xn r , n = 1, 2, . . . , be a triangular array as in the preceding section. The random variables in each row are independent, and •
•
•
II
rll
s; = E on2k · k=l
(28.3) Put Sn = Xn1 + bounded :
·
·
·
+ Xn r . Here it will be assumed that the total variance is
sup sn2
(28.4) In
order that the
(28 .5)
II
n
xn k
<
oo .
be small compared with lim max on2k =
n k < r11
sn, assume that
0.
The arrays in the preceding section were normalized by replacing Xnk by X ,cfsn . This has the effect of replacing sn by 1, in which case of course (28.4) holds, and (28.5) is the same thing as (27 .19). A distribution function F is infinitely divisible if for each n there is a distribution function Fn such that F is the n-fold convolution Fn Fn (n copies) of Fn. The class of possible limit laws will tum out to consist of ,
•
·
·
·
•
384
CONVERGENCE OF DISTRIBUTIONS
the infinitely divisible distributions with mean 0 and finite variance. t It will be possible to exhibit the characteristic functions of these laws in an explicit way.
Theorem 28.1.
Suppose that
(28.6 )
cp ( t ) = exp j ( e itx - 1 - itx ) � p. ( dx ) , R1
X
where p. is a finite measure. Then cp is the characteristic1 function of an infinitely divisible distribution with mean 0 and variance p. ( R ). By (26.4 2 ) the integrand in (28.6) converges to - t 2/2 as x --+ 0; take this as its value at x = 0. By (26.4 1 ), the integrand is at most t 2j2 in ,
modulus and so is integrable. The formula (28.6) is the
canonical measure.
canonical representation
of
cp,
and
p.
is the
Before proceeding to the proof, consider three examples.
If p. consists of a mass of o 2 at the origin, (28.6) is e - a 2 12 12 , the characteristic function of a centered normal distribution F. It • is certainly infinitely divisible-take Fn normal with variance o 2jn. Exllllple l 28.1.
Example 28. 2.
Suppose that
p. consists of a mass of x 2 at x =1= 0. Then
(28.6) is exp A( eitx - 1 - itx ); but this is the characteristic function of x ( Z" - A ), where Z" has the Poisson distribution with mean A. Thus (28.6) is the characteristic function of a distribution function F, and F is infinitely • divisible- take Fn to be the distribution function of x(Z A / n - A/n). Example 28.3. If cp1 (t) is given by (28.6) with p.1 for the measure, and if p. = Ej_ 1 p.1 , then (28.6) is cp 1 (t) · · · cp k (t). It follows by the preceding two examples that (28 . 6) is a characteristic function if p. consists of finitely many point masses. It is easy to check in the preceding two examples that the distribution corresponding to cp(t) has mean 0 and variance p.(R 1 ), and since the means and variances add, the same must be true in the present
��k
•
28.1.k Let P. k have mass p.( j 2 - k , (j + 1)2 - k ) at j 2- k for j = 0, ± 1, . . . , ± 2 2 . Then p. k --+ p.. As observed in Example 28.3, if PROOF OF THEOREM
v
t There do exist infinitely divisible distributions without moments (see Problems 28.3 and 28.4) ,
but they do not figure in the theory of this section.
SECTION
28.
INFINITELY DIVISIBLE DISTRIBUTIONS
385
fJJk (t) is (28.6) with P. k in place of p., then cpk is a characteristic function. For 1 each t the integrand in (28.6) vanishes at ± oo ; since sup k p.k( R ) < oo , fJJk ( t) .-. cp( t) follows (see (28.2)). Therefore, cp( t ) , as a limit of characteris
tic functions, is itself a characteristic function. Further, the distribution 1 corresponding to cp k ( t ) has second moment P. k ( R ), and since this is bounded, it follows (Theorem 25.11) that the distribution corresponding to fJJ ( I) has a finite second moment. Differentiation (use Theorem 16.8) shows 1 that the mean is cp'(O) = 0 and the variance is - cp"(O) = p. ( R ). Thus (28.6) is always the characteristic function of a distribution with mean 0 and 1 variance p. ( R ). If 1/1 n ( t) is (28.6) with p.jn in place of p., then cp ( t ) = 1/1:( t ), so that the distribution corresponding to cp( t) is indeed infinitely divisible. • The representation (28.6) shows that the normal and Poisson distribu tions are special·cases in a very large class of infinitely divisible laws.
Every infinitely divisible distribution with mean 0 and finite variance is the limit law of Sn for some independent triangular array satisfying (28.3), (28.4), and (28.5). • Theorem 28.2.
The proof requires this preliminary result:
If X and Y are independent and X + .Y has a second moment, then X and Y have second moments as well. PROOF. Since X 2 + Y 2 < ( X + Y) 2 + 2 1 XY I , it suffices to prove I XY I integrable, and by Fubini's theorem applied to the joint distribution of X and Y it suffices to prove 1 X I and 1 Y l individually integrable. Since I Yl < l x l + l x + Y l , E[ l Y l ] = oo would imply E[ l x + Y l ] = oo for each x; by Fubini's theorem again E[ l Y l ] = oo would therefore imply E[ I X + Yl ] = oo , which is impossible. Hence E[ l Y l ] < oo , and similarly E [ I X I ]
Lemma.
< 00 .
•
28.2.
Let F be infinitely divisible with mean 0 and variance o 2 If F is the n-fold convolution of Fn , then by the lemma extended inductively Fn has finite mean and variance and these must be 0 and o 2jn. Take rn = n and take Xn 1 , Xnn independent, each with distribution function Fn . • PROOF OF THEOREM •
•
•
•
,
If F is the limit law of Sn for an independent triangular array satisfying (28.3), (28.4), and (28.5), then F has characteristic function of the form (28.6) for some finite measure p.. Theorem 28.3.
386
CONVERGENCE OF DISTRIBUTIONS
The proof will yield information making it possible to identify the limit. Let n k ( t ) be the characteristic function of Xn k · The first step is to prove that n 'n (28 .7) n cpn k ( t ) - exp E ( cpn k ( t ) - 1 ) ..... o PROOF.
k -= l
k- 1
for each t. Since 1 z 1 � 1 implies that l ez l l = eRez- I < 1 , it follows by (27.3) that the difference Bn( t ) in (28.7) satisfies IBn( t ) l < � k.. 1 l cpnk( t) 2 2 exp( cpn k ( t ) - 1) 1 . Fix t. If cpnk( t ) - 1 = (Jnk ' then l fJn k l < t on k/2, and it follows by (28.4) and (28.5) that max k l fJn k l -+ 0 and E k l fJn k l = 0(1). There fore, for sufficiently large, n, I Bn( t ) l < Ek 1 1 + (Jn k - e '.. " l < E k l fJn k l 2 < max k l fJn k l · E k l fJn k l · Hence (28.7). If Fn k is the distribution function of Xn k ' then -
'n
E ( CJln k ( t ) - 1 )
k-l
'n
= E J. (e itx - 1 ) dF, k ( x ) R1 k- l
Let p. n be the finite measure satisfying (28 .8)
l' n( - oo , x]
'n
= E
k-l
1
2
y dFn k ( y ) ,
y<x
and put (28 .9) Then (28. 7) can be written 'n
(28 .10)
n cpn k ( t ) - cpn ( t )
k-l
-+ 0.
1 By (28.8), P. n ( R ) = s;, and this is bounded by assumption. Thus (2 8. 1 ) holds, and some subsequence { p. n } converges vaguely to a finite measu re P. · Since the integrand in (28.9) vanishes at ± oo, cpn (t) converges to (2 8.6 ). But, of course, limncpn( t) must coincide with the characteristic function o f the limit law F, which exists by hypothesis. Thus F must have characteristic • function of the form (28.6). II
SECTION
28.
387
INFINITELY DIVISIBLE DISTRIBUTIONS
Theorems 28.1, 28.2, and 28.3 together show that the possible limit laws are exactly the infinitely divisible distributions with mean 0 and finite variance, and they give explicitly the form the characteristic functions of such laws must have. Characterizing the Limit
Suppose that F has characteristic function (28.6) and that an independent triangular array satisfies (28.3), (28.4), and (28.5). Then Sn has limit law F if and only if p. n --+ p., where p. n is defined by (28.8). PROOF. Since (28.7) holds as before, Sn has limit law F if and only if CJJn (t) (defined by (28.9)) converges for each t to cp ( t ) (defined by (28.6)). If p.,. --+ 0 p., then cpn (t) --+ cp( t) follows because the integrand in (28.9) and (28.6) vanishes at ± oo and because ( 28. 1) follows from (28.4). Now suppose that cpn (t) --+ cp (t). Since P. n (R 1 ) = s; is bounded, each subsequence { p. n } contains a further subsequence { p. n } converging vaguely to some .;. If it can be shown that " necessarily cor;;�ides with p., it will follow by the usual argument that p. n --+ p.. But by the definition (28. 9) of fPn ( t) , it follows that cp(t) must coincide with 1/J(t) = e p JRt (e it x - 1 itx )x - 2 JI(dx). Now cp'(t) = icp( t)JRt (e itx - 1)x - 1p.(dx),1 and similarly for 1/l'(t). Hence cp(t) = 1/l(t) implies that fRt (e itx; - 1)x - JI(dx) = fR t (e itx l) x - 1p.(dx). A further differentiation gives fRt e 1xp.( dx ) = fRt e i tx JI(dx). This implies that p.( R 1 ) = J1( R 1 ), and so p. = " by the uniqueness theorem for Theorem 28.4.
v
v
x
•
characteristic functions.
The normal case. According to the theorem, Sn => N if and only if P. n converges vaguely to a unit mass at 0. If s; = 1, this holds if and only if L �"- t fl x l > (x 2 dFn k (x) --+ 0, which is exactly Lindeberg' s condi Example 28.4.
•
tion. Thus Theorem 28.4 contains Theorems 27.2 and 27.4.
The Poisson case. Let Zn 1 , . . . , Zn ," be an independent triangular array, and suppose xn k = zn k - m n k satisfies the conditions of the theorem, where m n k = E[Zn k 1 · If ZA has the Poisson distribution with parameter A, then Ek Xn k => ZA - A if and only if P. n converges vaguely to a mass of A at 1 (see Example 28.2). If s; --+ A, the requirement is P. n [1 1 + £] -+ A, or Example 28.5.
£,
(28 .11 ) for positive
£.
If
s; and Ek m n k both converge to A, (28.1 1 ) is
a
necessary
388
CONVERGENCE OF DISTRIBUTIONS
and sufficient condition that Ek Znk => Z". The conditions are easily checked under the hypotheses of Theorem 23.2: Znk assumes the values 1 and 0 with probabilities Pnk and 1 Pnk ' E k Pnk -+ A, and max k Pnk --+ 0. •
-
PROBLEMS 28.1. 28.2.
Show that #Ln --+ p. implies p. ( R1 ) < lim infn P.n ( R1 ). Thus in vague conver gence mass can "escape to infinity" but mass cannot "enter from infinity." (a) Show that #Ln --+ p. if and only if (28.2) holds for every continuous f with bounded support. (b) Show that if #Ln --+ p. but (28.1) does not hold, then there is a continuous f vanishing at + oo for which (28.2) does not hold. 23.8 f Suppose that N, Yi , Yi , . . . are independent, the Y, have a common distribution function F, and N has the Poisson distribution with mean a. Then S == Yi + · · · + YN has the compound Poisson distribution . (a) Show that the distribution of S is infinitely divisible. Note that S may not have a mean. n n (b) The distribution function of S is E�_ 0e- aa Fn * ( x)jn ! , where p • is the n-fold convolution of F (a unit jump at 0 for n == 0). The characteristic function of S is exp af�00 ( e;'x - 1) dF( x) . (c) Show that, if F has mean 0 and finite variance, then the canonical measure I' in (28.6) is specified by p.(A) == afA x2 dF( x). (a) Let " be a finite measure and define v
l'
v
28.3.
28.4.
( 28 .12)
28.5. 28.6.
[
cp( t ) == exp iyt +
-
f
oo
- oo
(
. -1e''x
)
itx 1 + x2 ( dx) , x2 1 + x2 P
]
where the integrand is t2/2 at the origin. Show that this is the characteris tic function of an infinitely divisible distribution. (b) Show that the Cauchy distribution (see the1 table on p. 35 8) is the case 1 where y == 0 and " has density w - ( 1 + x 2 ) - with respect to Lebesgue measure. Show that the Cauchy, exponential, and gamma (see (20.47)) distributions are infinitely divisible. Find the canonical representation (28.6) of the exponential distribution with mean 1 : (a) The characteristic function is f�ei'xe- x dx == (1 - it) - 1 == cp ( t). (b) Show that (use the principal branch of the logarithm or else operate formally for the moment) d (log cp ( t ))/dt == icp ( t) == if�ei1xe-x dx. Integrate with respect to t to obtain (28 . 1 3)
-- . == exp 1 1
1
II
0
oo
( eitx - 1 )
e- x X
dx .
28.
389 Verify (28.13) after the fact by showing that the ratio of the two sides has derivative 0. (c) Multiply (28.13) by e - to center the exponential distribution at its mean: The canonical measure p. has density xe - t over (0, oo ) . SECTION
INFINITELY DIVISIBLE DISTRIBUTIONS
;,
28.7.
f If X and Y are independent and each has the exponential density e - --.:, X - Y has the double exponential density !e - 1 x 1 (see the table on p. 358). Show that its characteristic function is 1 --2 == exp 1+t
28.8.
28.9.
/ 00 ( e'l. \. - 1 - ztx. ) -1 l x l e- l tl dx. x2
- oo
f Suppose X1 , X2 , are independent and each has the double exponen tial density. Show that E�- 1 Xn/n converges with probability 1. Show that the distribution of the sum is infinitely divisible andn that its canonical measure has density lx l e - l x l/(1 - e - l x l ) == E�_ 1 1x l e -l x l . 26.8 f Show that for the gamma density e - xx " - 1 /f ( u ) the canonical measure has density uxe - x over (0, oo). • •
•
The remaining problems require the notion of a stable law. A distribution function F is stable if for each n there exist constants an and bn , an > 0, such that, if XI ' . . . ' xn are independent and have distribution function F, then a; 1 ( XI + + Xn ) + bn also has distribution function F. · · ·
28.10. Suppose that for all a, a', b, b' there exist a", b" (here a, a', a" are all 28.11. 28.12. 28.13. 28.14.
positive) such that F( ax + b) • F( a'x + b') == F(a"x + b"). Show that F is stable. Show that a stable law is infinitely divisible. Show that the Poisson law, although infinitely divisible, is not stable. Show that the normal and Cauchy laws are stable. 28.10 t Suppose that F has mean 0 and variance 1 and that the depen dence of a", b" on a, a', b, b' is such that
(F -X ) • F( -X ) == F ( X ) . al
a2
Jar + af
Show that F is the standard normal distribution. 28.15. (a) Let Y, k be independent random variables having the Poisson distribu tion with mean cn°/ l k l 1 + a , where c > 0 and 0 < a < 2. Let Zn == n - 1 E i: � - n k Y, k (omit k == 0 in the sum) and show that if c is properly chosen then the characteristic function of zn converges to e - l t la. (b) Show for 0 < a < 2 that e -l tla is the characteristic function of a symmetric stable distribution; it is called the symmetric stable law of· exponent a. The case a == 2 is the normal law, and a == 1 is the Cauchy law. 2
390
CONVERGENCE OF DISTRIB UTIONS
SECTION 29.
LIMIT THEOREMS IN R k
If Fn and F are distribution functions on R k , Fn converges weak ly to F, written Fn => F, if lim n Fn ( x) = F(x) for all continuity points x of F. The corresponding distributions P.n and p. are in this case also said to converge weakly: P. n => p.. If Xn and X are k-dimensional random vectors (possibly on different probability spaces), Xn converges in distribution to X, written Xn => X, if the corresponding distribution functions converge weakly. The definitions are thus exactly as for the line. The Basic Theorems
The closure A - of a set in R k is the set of limits of sequences in A ; the interior is A 0 = R k - (R k - A) - ; and the boundary is vA = A - - A 0 • A Borel set A is a p.-continuity set if p.( aA ) = 0. The first theorem is the k-dimensional version of Theorem 25.8.
For probability measures P.n and p. on (R k , &ft k ), each of the following conditions is equivalent to the weak convergence of p. n to p.. (i) lim n ! I dp. n = ! 1dp. for bounded continuous f; (ii) lim SUPnP. n< C) < p.( C) for closed C; (iii) lim infnP.n ( G) > p.( G) for open G; (iv) lim n P. n ( A ) = p.( A) for p.-continuity sets A. Theorem 29.1.
It will first be shown that (i) through (iv) are all equivalent. (i) implies (ii): Consider the distance dist(x, C) = inf[ lx - Y l : y from x to C. It is continuous in x. Let PROOF.
cpj ( t )
=
1 1 - jt 0
E
C]
if t < 0, if 0 < t < j- 1 ' if j - 1 < t.
Then /j ( x ) = cp1 (dist(x, C)) is continuous and bounded by 1, and fj( X ) � Ic( X ) as j t 00 because c is closed. If (i) holds, then lim SUPnP. n ( c ) < lim n f fj dP. n = f fj dp.. As j f 00, f fj dp. � f Ic dp. = p. (C). (ii) is equivalent to (iii) : Take C = R k - G. (ii) and (iii) imply (iv): From (ii) and (iii) follows p. ( A 0
)< <
lim inf nP. n ( A 0 ) < lim inf nP. n ( A )
lim SUPnP.n ( A ) < lim SUPnP.n ( A - ) <
Clearly (iv) follows from this.
p. ( A - ) .
SECTION
29.
LIMIT THEOREMS IN
Rk
391
(iv) implies (i): Suppose that f is continuous and 1/(x ) l is bounded by K. Given t: choose reals a0 < a1 < · < a1 so that a0 < - K < K < a1, a; - a; _ 1 < t: , and p. [x: f(x) = a;] = 0. The last condition can be achieved because the sets [x: f(x) = a] are disjoint for different a. Put A; = [x: a; - I < f(x ) < a;] · Since I is continuous, A;- c [x: a; _ 1 < f(x) < a;] and .A.f ::::> [x : a ; _ 1 < f(x) < a;] . Therefore, a A; c [x: /(x) = a; _ 1 ] u [x: /(x) == a ;], and therefore p.( a A;) = 0. Now 1 / fdp.n - E� == 1 a;P.n (A;) I < t: and similarly for p., and E�_ 1 a;P.n( A;) � E� .... 1 a;p.( A;) because of (iv). Since t: was arbitrary, (i) follows. It remains to prove these four conditions equivalent to weak convergence. (iv) implies p. n => p.: Consider the corresponding distribution functions. If Sx = [ y: Y; < x ; , i = 1, . . . , k ], then F is continuous at x if and only if p.( asx) = 0 ; see the argument following (20.18). Therefore, if F is continu ous at X, Fn( x ) = P.n(Sx) --+ p.(Sx) = F( x ) , and Fn => F. P.n => p. implies (iii) : Since only countably many parallel hyperplanes can have positive p.-measure, there is a dense set D of reals such that p. [x: X = d ] = 0 for d e D and i = 1, . . . , k. Let .JJI be the class of rectangles ; A = [x: a; < X; < h;, i = 1, . . . , k ] for which the a; and the h; all lie in D . All 2 k vertices of such a rectangle are continuity points of F, and so Fn => F implies (see (12.12)) that P.n(A ) = & A Fn � & A F = p. ( A ) . It follows by the inclusion-exclusion formula that P.n( B) --+ p.( B) for finite unions B of elements of .JJI . Since D is dense on the line, an open set G in R k is a countable union of sets Am in .JJI . But p.(Um � M Am) = limnP.n(Um < M Am) S lim inf nP.n( G). Letting M --+ oo gives ( iii) . • ·
·
Suppose that h: Rk --+ Ri is measurable and that the set Dh of its discontinuities is measurable. t If P.n => p. in R k and p. ( D ) = 0, then 1 P h - 1 => p.h - in Ri. Let C be a closed set in Ri. The closure ( h - 1 C) - in R k satisfies PROOF. (h - 1C ) - c Dh u h - 1c. If P.n => p., then part (ii) of Theorem 29.1 gives lim sup p. n h - 1 ( c ) s lim sup p. n ( ( h - 1C ) - ) < p. ( ( h - 1C ) - ) n n Theorem 29.2.
h
n
( D ) + ,.,. ( h - 1C ) . Using (ii) again gives P.nh - 1 => p.h - 1 if p. ( D ) = 0. s ,.,.
h
h
•
Theorem 29.2 is the k-dimensional version of the mapping theorem Theorem 25.7. The two proofs just given provide in the case k = 1 a second approach to the theory of Section 25, which there was based on Skorohod ' s ' The argument in the footnote on p. 343 shows that in fact Dh
e
gt k always holds.
392
CONVERGENCE OF DISTRIBUTIONS
theorem (Theorem 25.6). t Skorohod ' s theorem does extend to R k (see Theorem 29.6), but the proof is harder. Theorems 29.1 and 29.2 can of course be stated in terms of random vectors. For example, Xn => X if and only if P[ X E G ] < lim infn P[ Xn E G ] for all open sets G. A sequence { P.n } of probability measures on (R k , £Il k ) is tight if for every £ there is a bounded rectangle A such that P.n( A ) > 1 - £ for all n .
If { P.n } is a tight sequence of probability measures, there is a subsequence { P.n 1 } and a probability measure p. such that P.n l => p. as i ._. 00 . Theorem 29.3.
. The proof of Helly' s theorem (Theorem 25.9) carries over to R k : For points x and y in R k , interpret x < y as meaning x u < Yu , u = 1, . . . , k, and x < y as meaning x u < Yu ' u = 1, . . . , k. Consider ra tional points r-points whose coordinates are all rational-and by the diagonal method [A14] choose a sequence { n; } along which lim ;Fn ( r ) = l G(r) exists for each such r. As before, define F(x) = inf[ G ( r ) : x < r]. Although F is clearly nondecreasing in each variable, a further argument is required to prove &AF > 0 (see (12.12)). Given £ and a rectangle A = (a 1 , b 1 ] X X (a k , bk ], choose a 8 such k that if z = ( 8, . . . , 8 ), then for each of the 2 vertices x of A, x < r < x + z implies IF( x ) - G(r) l < £/2 k . Now choose rational points r and s such that a < r < a + z and b < s < b + z. If B = (r1 , s 1 ] X X (rk , sk ], then I & A F - &sCJ I < £. Since &JJ = 1im;& 8Fn > 0 and £ is arbitrary, it follows that &AF > 0. With the present interpretation of the symbols, the proof of Theorem 25 .9 shows that F is continuous from above and lim ;Fn .(x) = F(x) for continuity points x of F. By Theorem 12.5, there is a measure p. on (R k , £Il k ) such that p. ( A ) = &A F for rectangles A . If p. is a probability measure, then certainly p. n . => p.. But given £, choose a rectangle A such that P.n(A) > 1 - £ for all n. By enlarging A , it is possible to ensure that none of the ( k - I)-dimensional hyperplanes containing the 2k faces of A have positive p.-measure (Theorem k 10.2(iv)). But then F is continuous at each of the 2 vertices of A (see the discussion following (20.18)), and so 1 � p.( R k ) > p. ( A ) = &AF = • lim; & A Fn . = lim ;P.n .(A) > 1 - £. Thus A is a probability measure. FIRST PROOF
· · ·
· · ·
I
I
I
I
I
t The approach of this section carries over to general metric spaces; for this theory and its applications, see BILLINGSLEY'.
SECTION
29.
LIMIT THEOREMS IN
Rk
393
There is a proof that avoids an appeal to Theorem 12.s.t SECOND PROOF. Let JJI be the class of bounded, closed rectangles whose vertices have rational coordinates, and let .JII' consist of the empty set and the finite unions of � sets. Then .JII' is countable and closed under the formation of finite unions, and each set in .JII' is compact. Select by the diagonal procedure a subsequence { l'n ; } along which the limits (29. 1) exist for all H in .Jf'. Suppose it can be shown that there exists a probability measure p. such that ( 29.2)
p.( G ) == sup ( a( H ) : H c G, H e .Jt')
for
all open sets G. Then H c G will imply that a( H) == lim;#J.n . ( H) < lim inf;P.n . (G), so that p.(G) < lim inf;P.n (G) for open G by (29.2). Therefore (condition (iii) in Theorem 29.1), it suffi ces to produce a probability measure satisfying (29.2). Clearly, a( H), defined by (29.1) for all H in .J11' , has these properties: (29.3) (29.4)
Define (29.6)
/l ( G) == sup ( a( H): H c G, H e jf']
for open sets G, and then for arbitrary A define y ( A ) == inf( � ( G ) : A c G) ,
where G ranges over open sets. Qearly, y( G) == �(G) for open G. Suppose it is shown that y is an outer measure and that each closed set is y-measurable. In the notation of Theorem 11.1, it will then follow that � k c ..l( y ), and the restriction p. of y to gt k will be a measure satisfying p.( G) == y( G) == P< G), and so (29.2) will hold for open G, as required. By tightness, there is for given £ an #set H such that P.n ( H) > 1 - £ for all n . But then 1 > a( H) > 1 - £ by (29.1), and I'(R k ) = fl( R k ) == 1 by (29.2). The first step in proving that y is an outer measure is to show that P is finitely subadditive (on open sets): If H c G1 u G2 and H e .Jf', define C1 = [ x e H : dist( x, Gf) > dist(x, G2)] and c2 == [ x E H: dist( x, G}') < dist( x, G2')]. If X E c. ' It extends to the general metric space; see LOEVE, p. 196.
394
CONVERGENCE OF DISTRIBUTIONS
and x ll. G1 , then x e G2 , so that, since G2 is closed, dist( x, G2 ) a contradiction . Thus C1 c G1 ; similarly, C2 c G2 .
>
0
=
dist( x, Gi )
,
If x e C1 , there is an �set Ax such that x e A � c Ax c G1 . These A� cover C1 , which is compact (a closed subset of the compact set H), and so finitely many of them cover C1 ; the union H1 of the corresponding sets Ax is in .71' and satisfies c. c HI c c • . Similarly, there is an #set H2 for which c2 c H2 c G2 . But then a ( H) < a ( H1 U H2 ) < a( H1 ) + a( H2 ) < /l( G1 ) + /l( G2 ) by (29.3), (29. 5 ), and (29.6). Since H can be any #set in G1 u G2 , another application of (29.6) gives /l( G1 U G2 ) < /l( G1 ) + /l( G2 ). Next, /l is countably subadditive (on open sets): If H c UnGn , then by compact ness H c U n < n Gn for some n 0 , and therefore by finite subadditivity a( H) < /l(U n s n oGn ) 5:- �n < n 0 /l( Gn ) < E n /l( Gn ). Letting H vary over the #sets in U ,.G, gives /l (UnGn ) < 'En/l (Gn ). And now y is an outer measure: Clearly, y is monotone. To prove it subadditive� suppose that A n are arbitrary setsn and £ > 0. Choose open sets Gn such that A n C Gn and /l( Gn ) < y ( A n ) + £/2 . By the countable subadditivity of /l , y ( U A ) < /l(U nGn ) < E,. /l (Gn ) < E11y( A n ) + £; £ being arbitrary, y(Un A n ) < E ny ( A, ) fol lows. It remains only to prove that each closed set is y-measurable: ,
( 29 .7)
y( A ) > y( A
n
C) + y ( A
n
,
C( )
for C closed and A arbitrary (see (1 1 .3)). To prove (29.7) it suffices to prove
( 29 . 8)
/l ( G)
> y ( G n C) + y ( G
n
C( )
for open G, because then G :::> A implies that /l( G ) > y ( A n C ) + y ( A n C< ) an d taking the infimum over G will give (29.7). To prove (29.8), for given ( choose an #set HI for which HI c G () c( and a ( HI ) > /l ( G () c( ) - ( . Now choose an #set H2 for which H2 c G () H} and a ( H2 ) > /l( G n H} ) - £. Since H1 and H2 are disjoint and are contained in G, i t follows by (29.4) that /l ( G ) > a ( HI u H2 ) a ( HI ) + a ( H2 ) > /l( G () c( ) + /J( G • n Hi ) - 2 £ > y ( G n C(' ) + y( G n C) - 2 £. Hence (29.8). =
Obviously Theorem 29.3 implies that tightness is a sufficient condition that each subsequence of { P.n } contain a further subsequence converging weakly to some probability measure. (An easy modification of the proof o f
SECTION
29.
LIMIT THEOREMS IN
Rk
395
Theorem 25.10 shows that tightness is necessary for this as well.) And clearly the corollary to Theorem 25.10 now goes through:
If { P. n } is a tight sequence of probability measures, and if each subsequence that converges weakly at all converges weakly to the probability measure p., then P. n => p.. Corollary.
Characteristic Functions
Consider a random vector X = ( X1 , , Xk ) and its distribution p. in R k . Let t · x = E� _ 1 t u x u denote inner product. The characteristic function of X and of p. is defined over R k by • • •
(29. 9 ) To a great extent its properties parallel those of the one-dimensional characteristic function and can be deduced by parallel arguments. The inversion formula (26.16) takes this form: For a bounded rectangle A = [x: a u < x u < bu , u � k] such that p.( aA) = 0,
(29.10 ) where B r = [t e R k : l t u l < T, u < k] and prove it, replace cp(t) by the middle term in as in (26.17): The integral in (29.10) is
dt is short for dt 1 • • • dt k . To (29.9) and reverse the integrals
The inner integral may be evaluated by Fubini' s theorem in
lr =
f.
n[
Rk u =- 1
R k, which gives
sgn ( x ,.
- a ,. ) S( T · l x ,. a ,. l ) 'IT
gn ( x - b,. ) S( T l ,. s · x b l) � ,.
Since the integrand converges to follows as in the case k = 1.
IT�- t tP a , , b ,(x u )
(see
]
p(
dx )
.
(26.18)), (29.10)
396
CONVERGENCE OF DISTRIBUTIONS
The proof that weak convergence implies (iii ) in Theorem 29.1 shows that for probability measures p. and " on R k there exists a dense set D of reals such that p.( a A) = 11( a A) = 0 for all rectangles A whose vertices have coordinates in D. If p. ( A ) = 11(A) for such rectangles, then p. and " are identical by Theorem 3.3.
Thus the characteristic function cp uniquely determines the probability measure p.. Further properties of the characteristic function can be derived from the one-dimensional case by means of the following device of Cramer and Wold. For t e R k , define h,: R k --+ R 1 by h,( x ) = t · x. For real a, [x: t · x < a ] is a half-space, and its p.-measure is p.(x: t · x < a ] = p.h; 1 ( - oo , a] . ( 29.11 ) By change of variable, the characteristic function of p. h; 1 is
( 29.12 ) To know the p.-measure of every half-space is (by (29.11)) to know each p. h; 1 and hence is (by (29.12) for s = 1) to know cp (t) for every t; and to know the characteristic function cp of p. is to know p.. Thus p. is uniquely determined by the values it gives to the half-spaces. This result, very simple in its statement, seems to ,require Fourier methods-no elementary proof is known. If p. n => p. for probability measures on R k , then cpn ( t) --+ cp( t) for the corresponding characteristic functions by Theorem 29.1. But suppose that the characteristic functions converge pointwise. It follows by (29.12) that for each t the characteristic function of P. n h; 1 converges pointwise on the line to the characteristic function of p. h; 1 ; by the continuity theorem for characteristic functions on the line then, P. n h; 1 => p. h; 1 Take the u th component of t to be 1 and the others 0 ; then the P. n h; 1 are the marginals for the u th coordinate. Since { p. n h - 1 } is weakly convergent, there is a bounded interval (a u , bu ] such that P.n[x E R k : a u < x u < bu ] = P. n h; 1 (a u , bu ] > 1 - £/k for all n. But then P. n (A) > 1 - f. for the bounded rectangle A = [ x: a u < x u � bu , u = 1, . . . , k ]. The sequence { p. n } is there fore tight. If a subsequence { P. n , } converges weakly to ,, then cpn , (t ) converges to the characteristic function of ,, which is therefore cp (t). By uniqueness, " = p., so that P. n => p.. By the corollary to Theorem 29 .3 , P. n => p.. This proves the continuity theorem for k-dimensional characteristic functions: p. n => p. if and only if cpn ( t) --+ cp( t) for all t. •
I
SECTION
29.
LIMIT THEOREMS IN
Rk
397
The Cramer-Wold idea leads also to the following result, by means of which certain limit theorems can in a routine way be reduced to the one-dimensional case.
For random vectors Xn = ( Xn 1 , , Xn k ) and Y = ( Y1 , , Yk ), a necessary and sufficient condition for Xn Y is that E� _ 1 t u Xn u E!_ 1 t u Yu for each (t 1 , , t k ) in R k . •
1beorem 29.4. • • •
=>
• • •
•
•
=>
The necessity follows from a consideration of the continuous mapping h above-use Theorem 29.2. As for sufficiency, the condition implies by the continuity theorem for one-dimensional characteristic func tions that for each (t 1 , , t k ) PROOF.
1
• • •
for all real s. Taking s = 1 shows that the characteristic function of Xn • converges pointwise to that of Y. Normal Distributions in Rk By Theorem 20.4 there is (on some probability space) a random vector
X = ( X1 ,
, Xk ) with independent components each having the standard normal distribution. Since each Xu has density e - x 2 12;J2 '1T , X has density • • •
(see (20.25))
(29.13) where l x l 2 = E!_ 1 x; denotes Euclidean norm. This distribution plays the role of the standard normal distribution in R k . Its characteristic function is
(29.14 ) Let A = [a u v 1 be a k X k matrix and put Y = AX, where X is viewed as a column vector. Since E[ Xa Xp] = Bap ' the matrix � = [ ou v 1 of the covari ances of Y has entries ou v = E[ Yu Yv 1 = E!_ 1 a uaa v a · Thus � = AA', where the prime denotes transpose. The matrix � is symmetric and nonnegative definite: L u vou v x u x v = IA'x l 2 � 0. View t also as a column vector with transpose t', and note that t · x = t'x . The characteristic function of AX is
398
CONVERGENCE OF DISTRIBUTIONS
thus
Define a centered normal distribution as any probability measure whose characteristic function has this form for some symmetric nonnegative definite �. If � is symmetric and nonnegative definite, then for an appropriate orthogonal matrix U, U '� U = D is a diagonal matrix whose diagonal elements are the eigenvalues of � and hence are nonnegative. If D0 is the diagonal matrix whose elements are the square roots of those of D , and if A = UD0, then � = AA'. Thus for every nonnegative definite � there exists a unique centered normal distribution (namely the distribution of A X) with covariance matrix � and characteristic function exp( - � t '�t ). If � is nonsingular, so is the A just constructed. Since X has density (29.13), Y = AX has by the Jacobian transformation formula (20.20) density /( A - 1x) l det A - 1 1 . Since � = AA', l det A - 1 1 = (det �) - 1 12 . Moreover, � - 1 = (A') - 1A - 1 , so that IA - 1x l 2 = x'� - 1x. Thus the normal distribution has density (2w) - k l2 (det �) - 1 12 exp( - �x�� - 1x) if � is non singular. If � is singular, the A constructed above must be singular as well, so that AX is confined to some hyperplane of dimension k 1 and the distribution can have no density. If M is a j X k matrix and Y has in R k the centered normal distribu tion with covariance matrix, �, MY has in Rj the characteristic function exp( - !( M 't )'�( M 't )) = exp( - ! t '( M � M ') t ) (t e Ri). Hence MY ha s the centered normal distribution in Rj with covariance matrix M � M . •
-
'
Thus a linear transformation of a normal distribution is itself normal. These normal distributions are special in that all the first moments
vanish. The general normal distribution is a translation of one of these centered distributions. The Central Limit Theorem
Let Xn = ( Xn 1 , , Xn k ) be independent random vectors all having the same distribution. Suppose that E[ X;u 1 < oo ; let the vector of means be c = (c 1 , , ck ), where c u = E[Xn u 1, and let the covariance matrix be � = [ (Ju v 1, where (Ju v = E[( xn u - c u )( xn v - c v )]. Put sn = XI + . . . + Xn . •
•
•
• . .
Under these assumptions, the distribution of the random vector (Sn - nc)/ {i converges weakly to the centered normal distribution with covariance matrix �-
Theorem 29.5.
29.
SECTION
LIMIT THEOREMS IN
Rk
399
Let Y = ( Y1 , , Yk ) be a normally distributed random vector with 0 means and covariance matrix �- For given t = (t 1 , , t k ), let Zn = E!-t t u ( 1Xn u - c u ) and Z = E!_ 1 t u Yu . By Theorem 29.4, it suffices to prove that n - 12 Ej_ 1 Z1 Z (for arbitrary t). But this is an instant consequence of the Lindeberg-Levy theorem (Theorem 27 .1 ) . • PROOF.
•
•
•
•
•
•
�
Skorohod's Theorem in Rk *
Suppose that Yn and Y are k-dimensional random vectors and Yn probability 1 . If G is open, then
Y with
--+
( Y e G ) c lim infn ( Yn e G ) u ( Yn --+ Y ) c, and Theorem 4.1 implies that P[ Y e G] < lim inf n P l Yn e G], so that Yn Y by Theorem 29.1. Thus convergence with probability 1 implies con �
vergence in distribution.
Suppose that P.n and p. are probability measures on (R k , &J k ) and p. n p.. Then there exist random vectors Yn and Y on a common probability space ( 0, �, P) such that Yn has distribution P. n ' Y has distribu tion p., and Yn ( w ) --+ Y( w ) for each w. Theorem 29.6. �
As in the proof of Theorem 25.6, the probability space ( 0, �, P) will be taken to be the unit interval with Lebesgue measure. It will be convenient to take 0 = [0, 1)-closed on the left and open on the right. Let B0 be the set of real s such that p.[x: X; = s ] > 0 for some i = 1, . . . , k, and let B consist of all rational multiples of elements of B0• Then B is countable, and so there exists some positive t outside B. By the construction the set PROOF.
D = ( n2 - "t: u = 0, 1, 2, . . . ; n = 0, ± 1, ± 2, . . . ] is disjoint from B0; of course, D is dense on the line. Let [ C;�1 > : i = 1, 2, . . . ] be the countable decomposition of R k into cubes of the form [x: n ; t < X; < (n ; + 1 )t , i = 1, . . . , k] for integers n 1 , , nk. 1 k k For each i 1 let [C;���: i 2 = 1, . . . , 2 ] be the decomposition of C;� > into 2 cubes of side tj2. Now decompose each C;��� into 2 k cubes of side t/2 2 • Continue inductively in this way to obtain for u = 1, 2, . . . decompositions [C;<��. ; ] of R k into cubes of side t/2 " - 1 for which (29. 15 )
1
•
I
II
(29 .16 ) *This topic may be omitted.
c'.<1 u - �) ···
'u- 1
=
2k
u c.< u > . .
i- 1
'1 · · · ' u - 1 1
•
•
-
CONVERGENCE OF DISTRIBUTIONS
In all that follows, the first index, i 1 , ranges over the positive integers, and the other indices range over 1, 2, . . . , 2 k . Since (29.15) is disjoint from B0,
(29.17 )
Construction of
[0, C;�1 >), �-&(C;��!>· [1/�� 0 ;
1;�1>, 1;��!, i 2
Y. Next, decomposet 1) into subintervals i = 1, 2, . . . , in such a way that P(/;�1> ) = p.( where p. is the given 1 limit measure on R k . Now decomposet 1;<1> into subintervals = 1, . . . , 2 k , in such a way that P( /;�� ) = Continue inductively in this way to obtain for u = 1, 2, . . . decompositions 1 " ] of 1) into subintervals for which
)1 ].1
( 29.18 )
ooo
[0,
2k
= u /1.(1 uo )o o 1.11 _ 1 ; i= l
and
(29.19) In each y on
C/�� 0 ;
(0, 1) by 1
"
( 29.20 )
choose an arbitrary point y ( w )
=
xf"!oo ; . Define random vectors 1
"
0 0 0 '" .
000
X'�1u ) '." if w E ]'. 1
0
C;� �� }�+ 1 C;� �� o ;,,
y(w) and y< u>(w ) lie in the same element of c Since the uth decomposition of R k. Therefore, 1 y(w) - Y "(w) l < k /2 - 1 • The sequence { y( w )} is thus fundamental in the Euclidean metric for each w, and so the limit
(29.21 )
t "
Y( w )
= lim y< u> ( w ) u
exists. Note that
(29.22 ) The y and Y are measurable mappings from the unit interval into R k _ are random vectors. a
+
81
-=
+
82
(finite sum or infinite series) and B; � 0, then I; = [ a EJ s ; BJ ) decomposes [a, b) into subintervals of lengths P( I; ) = B;
t 1r b - a
+
· · ·
0
+
r.1 < ; BJ ,
29. LIMIT THEOREMS IN R k Applying (29.16), (29.20), (29.19), (29.16) in succession gives P ['-': y < u+ v > ( w ) E c.< u > ] = � p ( w : y < u + v > ( w ) E c.< u+ � ) i...J 1,)1 11 1, '1
401
SECTION
.
···
it
=
it
� i...J
· · ·
·
iu
·
•••
·Jv
P ( r;1< u+. . vi.Jt> .. iv ) = .
.
i1
� i...J
· · ·
iv
P,
·••
.
lv
]
( c < u+.. vi,]> l . . . ) ;1
iv
.
lim v P[ w : y < v > ( w) E C�"-� . ;,] = p. ( C;���- ;, ). If A = [y: Y; > X ; , ; 1, . . . , k] and the X; all lie in the set (29.15), then A is for some u the union of countably many sets C;���- ;,, so that by Scheffe' s theorem lim v P[Y < v > e A] = p. ( A ) . Since (29.15) is dense, the distribution of y < v > Y by (29.21), p. and the distribution converges weakly to p..t Since y < v >
Therefore, -=
of
Y must
=>
coincide.
Construction of the Yn .
n, u > 1, let
The construction is as for Y. For [ 11� �: �), 1 be a decomposition of [0, 1) into subintervals for which n / .( , u .- 1 ) = '1 '"- 1
(29.23)
.
0 0
2k
u J ( n , u;)
i-l
;1
..
0
"- 1 ;
and
(29.24 ) This time arrange inductively* that i < j if and only if /,.<1 �: �)u - 1 lies to the left of Order the ( 1 , lexicographically: (i 1 , , u ) < (j1 , . . . , ju ) means that there exists an r, 1 � r < u such that i s = is for 1 s < r and < },. It follows by induction that (i 1 , . . . , i u ) < ( jh . . . , iu ) if and only if /;� �: �),lies to the left of 11� �: �) . .. u > ( w) as x ! ".>. . ; for w in < y For the same xj") . . as in (29.20), define ; n 1 ( 1;1�� �] . As before, I Yn' u +I>(w) - yn< u > (w)l < kt/2 " - 1 , so that the limit .. Y,( '-' ) = lim u Yn< u > (w) exists,
i
I;��:�], _ Ji· s i,
• • •
,·
, iu)
• • •
I
1J
i
11
(29 .25 ) and
yn
has distribution
P. n ·
tSee the proof on p. 391 that P.n
�
p. implies (iii).
* The construction in the footnote on p. 400 has the property that i < j if and only if I; lies to the left of /j .
402
CONVERGENCE OF DISTRIBUTIONS
Convergence. It remains to consider the convergence of Yn (w) to Y(w ).
Let
] .< n , u �, = [ a � n , u �, , b � n , u � ) , '• . . . ,
'• . . . , ,
'• . . . ,
E' denotes summation over those ( }1 , , iu ) for which ( i 1 , , i")-here i 1 , , i u are fixed-then by (29.24), ) "" ' ,_ ( c � u "' ' P ( I1�1 n , u � ) = i...J > a 1�1n , u � = i...J n 1 r- 1 If
. • •
u,
• • •
. • •
'
• • •
...
"
},
.. .
.
},
( }1 ,
• • .
, iu )
<
•
Since P.n => p. by hypothesis, it follows by (29 . 17) that this finite su m converges to
"' ' P ( l1�1 " > · 1., ) = i...J "" 'P. ( c1.<1 u· > · 1., ) · a � u >. ., = i...J - b;(1u.). . ;, for each and 11· , . . . , I· u · Sl. ffil"1ar1y, li m n b;(1n..., u );, If w is interior to 1; 1 ;,' it is therefore also interior to 1;� �:�1 for large so that y < u > ( w) = Yn<" > ( w ) It follows by (29 . 22) and (29.25) that I Y( w ) Yn (w)l < kt/2"- 2 for large n. Since was arbitrary, lim n Yn (w) = Y( w ) if w is, for each interior to the I/��- ;" that contains it . The endpoints of all the /1���- make up a countable set; redefinin g Yn ( w) = Y( w) = 0 on this set of Lebesgue measure 0 ensures that Yn ( w ) Y( w) for every w and does not alter the distributions of Yn and Y. This '• .
·
..
,
·
U
• • •
.
u,
n,
..
u
1•
--+
completes the proof of Skorohod' s theorem for general k. Emmple 29.1.
The B-method .
•
Suppose that
( 29.26) where vn -+ oo , and suppose that f( x, y) is differentiable at (a, p ) with derivatives /1 and /2 • The problem is to prove that
partial
(29.27) By Skorohod' s theorem in R 2 it is legitimate to proceed as though all these random variables were defined on the same probability space and (29.26) were ordinary convergence in R 2 for each sample point w. Since f is differentiable, if €, and ., , go 1to 0, then / (a + €n , {J + Tin ) == / (a, {J) + /1 ( a, {J)€n + /2 (a, /l)TJ n + 8, ( €� + TJ! ) 1 2 , where (Jn --+ 0. Since vn --+ oo, ( Xn ( w ), Y, ( w)) --+ (a, /l) for each and hence w
vn ( f( Xn( w ) , Yn ( w )) - /( a , p ) ) == f. ( a ' p ) vn ( xn ( w) - a) + /2 ( a ' p ) vn ( Y,, ( w ) - p ) 1 12 2 2 ; ( ( + (In ( w ) v; ( xn ( w ) - a) + v, Y,, w) - ll ) '
[
]
SECTION
29.
LIMIT THEOREMS IN
Rk
403
where In ( w ) --+ 0 for each w. The expression converges to 11 (a, fJ ) X( w ) + f2 ( a , fj ) Y( w ) for each w, and so (29.97) follows. Therefore, if vn --+ oo , (29.26) implies (29.27). The proof assumed of the random variables more than (29.26), but the result is true under the assumption of ( vn --+ oo • and) (29.26) alone. This is another case of the delta method; see Example 27.2. PROBLEMS 29.1.
A real function I on R k is everywhere upper semicontinuous (see Problem 13.1 1) if for each x and t: there is a B such that 1 x - Y l < B implies that l(y) < l (x) + t:; I is lower semicontinuous if -I is upper semicontinuous. (a) Use condition (iii) of Theorem 29.1, Fatou's lemma, and (21 .9) to show that, if #Ln � p. and I is bounded and lower semicontinuous, then (29.28)
lim inf I ldP. n > I ldp. . n
(b) Show that, if (29.28) holds for all bounded, lower semicontinuous
29.2.
functions I, then P.n � p.. (c) Prove the analogous results for upper semicontinuous functions. (a) Show for probability measures on the line that #Ln X �'n � p. X v if and only if #Ln � ,.,. and "n � (b) Suppose that Xn and Y, are independent and that X and Y are independent. Show that, if Xn � X and Y, � Y, then ( Xn , Yn ) � ( X, Y) and hence that Xn + Y, � X + Y. (c) Show that part (b) fails without independence. (d) If F, � F and Gn � G, then F, • Gn � F • G. Prove this by part (b) and also by characteristic functions. (a) Show that { #Ln } is tight if and only if for each t: there is a compact set K such that P.n ( K ) > 1 - t: for all n . (b) Show that { #Ln } is tight if and only if each of the k sequences of marginal distributions is tight on the line. Assume of ( Xn , Y, ) that Xn � X and Y, � c. Show that ( Xn , Y, ) � ( X, c ). This is an example of Problem 29.2(b) where Xn and Y, need not be assumed independent. Jl .
29.3.
19.4.
19.5. 19.6. '1.9.7. 19.8.
Let p. and v be a pair of centered normal distributions on R 2 • Suppose that all the variances are 1 but that the two covariances differ. Show that the mixture !P. + t v is not normal even though its marginal distributions are. Prove analogues for R k of the corollaries to Theorem 26.3. Devise a version of Lindeberg's theorem for R k . Suppose that estimates U and V of a and fJ are approximately normally distributed with means a and fJ and small variances aJ and al � , respec tively. Suppose that U and V are independent and fJ > 0. Use Example 29. 1 for l(x, y ) == xjy to show that U/ V is approximately normally distributed
404
CONVERGENCE OF DISTRIBUTIONS
with mean aj{J and variance
A physical example is this: U estimates the charge a == e on the electron, V estimates the charge-to-mass ratio {J == ejm, and U/ V estimates the mass a.j{J == m. 29.9. To obtain a uniform distribution over the surface of the unit sphere in R k , fill in the details in the following argument. Let X have the centered normal distribution with the identity as covariance matrix. For U orthogonal, UX has the same distribution. Let cf-(x) == xf l x l on R k (take ct-(0) == 0, say). Then Y == ct-(X) lies on the unit sphere and, as UY == ct-(UX) for U orthogonal, the distribution of Y is invariant under rotations. 29.10. Assume that
is positive definite, invert it explicitly, and show that the corresponding two-dimensional normal density is
29.1 1. 29.12.
29.13.
29. 14.
where D == a1 1 a22 - af2 • 21.15 29.10 t Al�ough uncorrelated random variables are not in general independent, they are if they are jointly normally distributed. Suppose that /( X) and g( Y) are uncorrelated for all bounded continuous I and g. Show that X and Y are independent. Hint: Use characteris tic functions. 20.22 f Suppose that the random vector X has a centered k-dimensional normal distribution whose covariance matrix has 1 as an eigenvalue of multiplicity r and 0 as an eigenvalue of multiplicity k - r. Show that 1 X 1 2 has the chi-squared distribution with r degrees of freedom. f Multinomial sampling. Let p1 , • • • , Pk be positive and add to 1, and let z. ' z2 ' • . . be independent k-dimensional random vectors such that zn has with probability P; a 1 in the i th component and O's elsewhere. Then In = < ln 1 , • • • , In k ) == E :, _ 1 Zm is the frequency count for a sample of size n from a multinomial population with cell probabilities P; · Put Xn ; == ( In ; np; )/{ni; and xn == ( xnl ' . . . ' xn k ). (a) Show that Xn has mean values 0 and covariances oiJ == ( BiJ p1 - P; p1 )/
VP; PJ .
(b) Show that the chi-squared statistic Ef.... 1 < In ; - np; ) 2 jnp; has asympto t
ically the chi-squared distribution with k - 1 degrees of freedom.
SECTION
30.
405
THE METHOD OF MOMENTS ==
(
�
, nX,,, ) is uniformly distributed over the surface of a sphere of radius vn in R . Fix t and show that xnl ' . . . ' xnt are in the limit independent, each with the standard normal distribution. Hint: If the components of Y, == ( Yn 1 , , Y, n ) are independent, each with the standard normal distribution, then xn has the same distribution as {n Y,, / 1 Y, I · (b) Suppose that the distribution of Xn ( Xn 1 , . . . , Xnn ) is spherically symmetric in the sense that Xn/ I Xn l is uniformly distributed over the unit sphere. Assume that 1 xn 1 2 /n � 1 and show that Xn1 , . . . , xn , are asymptot ically independent and normal. 29.16. 19.13 29.15 f Prove the result in Problem 29.15(a) by the method of Problem 19.12. 29.17. Let Xn ( Xn . , . . . , Xn k ) , n == 1, 2, . . . , be random vectors satisfying the mixing condition (27.25) with an O( n - 5 ). Suppose that the sequence is stationary (the distribution of ( Xn , . . . , Xn +.i ) is the same for all n), that E[ Xn u 1 0, and that the Xn u are uniformly bounded. Show that if Sn X1 + · · · + Xn , then Sn/ {n has in the limit the centered normal distribution with covariances
29.15. 29.9 t
•
A
•
theorem of Poincare. (a) Suppose that Xn
,
.
•
•
•
=-
==
==
==
==
00
00
E [ x. u xl v ] + E E [ x. u x. +j. v ] + E E [ x. +j. u Xll, ] .
j- 1
j- 1
Hint: Use the Cramer-Wold device. 29.18. f As in Example 27.6, let { Y, } be a Markov chain with finite state space S == {1, . . . , k }, say. Suppose the transition probabilities Puv are all positive and the initial probabilities Pu are the stationary ones. Let fn u be the number of i for which 1 < i < n and Y; = u . Show that the normalized frequency count
n - 1 12 ( fn1 - np l ' · · · , fn k - npk ) has in the limit the centered normal distribution with covariances 00
00
Buv - Pu Pv + E ( P!t) - Pu Pv } + E ( P!�) - PvPu } ·
j- 1
j- 1
SECI10N 30. THE METHOD OF MOMENTS * 1be
Moment Problem
For some distributions the characteristic function is intractable but mo ments can nonetheless be calculated. In these cases it is sometimes possible *This section may be omitted.
406
CONVERGENCE OF DISTRIBUTIONS
to prove weak convergence of the distributions by establishing that the moments converge. This approach requires conditions under which a distri bution is uniquely determined by its moments, and this is for the same reason that the continuity theorem for characteristic functions requires for its proof the uniqueness theorem.
Let p. be a probability measure on the line having finite moments a k f (X) (X) x kp.( dx ) of all orders. If the power series Ekak r k/k ! has a positive radius of convergence, then p. is the only probability measure with the moments a 1 , a 2 , PROOF. Let P k f (X) (X) I x l kp.(dx) be the absolute moments. The first step is Theorem 30. 1. =
• • •
•
=
to show that
k --+
(30.1)
00 ,
for some positive r. By hypothesis there exists an s, 0 < s < 1, such that a k s kjk! 0. Choose 0 < r < s; then 2kr 2 k - t < s 2 k for large k. Since .-.
lx l 2k - t � 1
+
l x l 2 k,
r2k- l ( 2k - 1)!
+
p2 ks 2 k ( 2k )!
for large k. Hence (30.1) holds as k goes to infinity through odd values; since Pk = a k for k even, (30.1) follows. By (26.4),
e itx
(
k i x ) � (h e ih x - i..J k '. k -o
and therefore the characteristic function
By (26.10), the integral here is
( 30.2 )
cp ( t + h)
=
)
n +l hx l l < , ( n + 1) !
cp of p. satisfies
cp< k >(t). By (30.1), 00 cp( k ) ( t) k l h l < r. E kI h , .
k -0
SECTION
30.
407
THE METHOD OF MOMENTS
If " is another probability measure with moments function 1/1 ( t ), the same argument gives
ak
and characteristic
r. < k > ( t ) k 1/l ( t + h ) = E k I. h , l h l < r . k-0 since q>< k > (O) = i ka k = 1/J < k > (O) (see (26.9)), cp oo
(30 .3)
..
.,..
and 1/J agree in Take t = 0; ( - r, r) and hence have identical derivatives there. Taking t = r - £ and t == - r + £ in (30.2) and (30.3) shows that cp and 1/J also agree in ( - 2r + £, 2r - £) and hence in ( - 2r, 2r). But then they must by the same argument agree in ( - 3r, 3r) as well, and so on. t Thus cp and 1/J coincide, and by the • uniqueness theorem for characteristic functions, so do p. and " · be
A probability measure satisfying the conclusion of the theorem is said to
determined by its·moments.
EX111ple 11
For the standard normal distribution, the theorem implies that it is determined by its moments. 30.1.
l ak l < k !, and so •
But not all measures are determined by their moments:
EX111ple 11
If N has the standard normal density, then
30. 2.
log-normal density f( x ) Put g(x)
=
1 .!. e - ( log x ) 2 /2 X
'2fT 0
= f( x) (1 + sin(2w log x)). If 100x kf( x ) sin(2'1T log x) dx = 0, 0
eN
has the
if X > 0, if X < 0.
k = 0, 1 ' 2, . . . '
then g, which is nonnegative, will be a probability density and will have the same moments as f. But a change of variable log x = s + k reduces the integral above to
which vanishes because the integrand is odd. t lbis process is a version of analytic continuation.
•
-
CONVERGENCE OF DISTRIBUTIONS
Suppose that the distribution of X is determined by its moments, that the Xn have moments of all orders, and that limnE [ X; ] E[ x r ] for r 1, 2, . . . . Then Xn X. PROOF. Let P. n and p. be the distributions of xn and X. Since E[ X,l ] converges, it is bounded, say by K. By Markov' s inequality, P[ I Xnl > x] s Kjx 2 , which implies that the sequence { P. n } is tight. Suppose that P. n k " and let Y be a random variable with distribution If u is an even integer exceeding r, the convergence and hence boundedness of E[Xn"1 implies that E [ X;k ] --+ E[Y r ] by the corollary to Theorem 25.12. By the hypothesis, then, E[Y r ] E[X r ]-or " and p. have
Theorem 30.2.
�
=
"
=
�
·
=
the same moments. Since p. is by hypothesis determined by its moments, " must be the same as p., and so P. n k � p.. The conclusion now follows by the corollary to Theorem 25.10. • Convergence to the log-normal distribution cannot be proved by estab lishing convergence of moments (take X to have density f and the Xn to have density g in Example 30.2). Because of Example 30.1, however, this approach will work for a normal limit.
Moment Generating Functions
Suppose that p. has a moment generating function M(s) for s e [ - s0 , s0 ], s0 > 0. By (21 .22), the hypothesis of Theorem 30.1 is satisfied, and so p. is determined by its moments, which are in turn determined by M(s) via (21 .23). Thus p. is determined by M(s) if it exists in a neighborhood of o.t The version of this for one-sided transforms was proved in Section 22-see Theorem 22.2. Suppose that P. n and p. have moment generating functions in a common interval [ - s 0 , s0 ], s 0 > 0, and suppose that Mn (s) --+ M(s) in this interval. Since P. n [( - a, a)c] S e - so a(Mn ( - s0 ) + Mn (s 0 )), it follows easily that { P.n } is tight. Since M(s) determines p., the usual argument now gives P. n � p.. Central Limit Theorem by Moments
To understand the application of the method of moments, consider on ce again a sum sn = xnl + . . . + xn k ' where xn l ' . . . ' xn k are independent n
t For another proof, analyticity.
see
n
Problem 26.7. The present proof does not require the idea of
SECTION
30.
409
THE METHOD OF MOMENTS
and kn
(30 .4)
s; = E on2k · k- 1
Suppose further that for each n there is an Mn such that I Xnk l < Mn , k == 1, . . . , k n' with probability 1. Finally, suppose that (30 .5) All moments exist,
(30.6)
andt
r r .I 1 1 = L " r Srn . E E ' r ' . . . r ' x n i l u · u •' 1· u- 1
•
•
•
II
xnri, '
where E' extends over the u-tuples of positive integers satisfying r1 + · · · + ru = r and E" extends over the u-tuples ( i 1 , , i u ) of distinct integers in the range 1 < ia S k n . By independence, then, •
•
•
( 30.7)
where (30. 8 )
A n ( rl · . . , ru ) = E " •
. . . E [ x:r. ] . x [ :,E ;}. ] n
and E' and E" have the same ranges as before. To prove that (30. 7)
converges to the r th moment of the standard normal distribution, it suffices to show that (30.9 )
if r1 = · · · = ru = otherwise.
2'
if r is even, all terms in (30.7) will then go to 0 except the one for which u = r/2 and ra = 2, which will go to r !j(r1 ! · · · ru !u!) = 1 · 3 · S ( r - 1 ). And if r is odd, the terms will go to 0 with out exception. Indeed, •
•
•
tTo deduce this from the multinomial formula, restrict the inner sum to u-tuples satisfying 1 S i1 < · · · < iu < k n and compensate by striking out the 1/u !.
410
CONVERGENCE OF DISTRIBUTIONS
= 1 for some then (30.9) holds because by (30.4) each summand in (30.8) vanishes. Suppose that r01 > 2 for each and r01 > 2 for some Then r > 2 u, and since I E[X;j] l < M� ra - 2>on� ' it follows that A n( r1 , • • • , ru ) < ( Mn/sn ) r- 2 "A n (2, . . . , 2). But this goes to 0 because (30.5) holds and because A n (2, . . . , 2) is bounded by 1 (it increases to 1 if the sum in (30.8) is enlarged to include all the u-tuples (il, . . . ' i u )). It remains only to check (30.9) for r1 = · · · = ru = 2. As just noted, A n (2, . . . , 2) is at most 1, and it differs from 1 by Es; 2 "on� , the sum If
a,
r01
a.
a
"
extending over the ( i 1 , • • • , i u ) with at least one repeated index. Since on2; < M;, the terms for example with i u = i u - I sum to at most Mn2s; 2 "Eon�1 • • • on�,_ 1 < M;s; 2 • Thus 1 A n (2, . . . , 2) < u 2M;s; 2 --+ 0. This proves that the moments (30.7) converge to those of the normal distribution and hence that Snfsn => N. -
Application to Sampling 1beory
Suppose that
n
numbers
not necessarily distinct, are associated with the elements of a population of size n . Suppose that these numbers are normalized by the requirement
( 30.10)
Mn
= maxn l x n h l · h<
An ordered sample Xn 1 , • • • , Xn k is taken, where the sampling is without replacement. By (30.10), E[ Xn k 1 = 0 and E[ X;k ] = 1 /n . Let s; = k n/n be the fraction of the population sampled. If the Xn k were independent, which they are not, Sn = Xn1 + · · · + Xn k would have variance s; . If kn is small in comparison with n , the effects of dependence should be small. It will be shown that Snfsn => N if II
II
( 30.11)
sn2
=
kn n
-+ 0 '
M __!!. sn
--+ 0 '
Since M; � n - 1 by (30.10), the second condition here in fact implies the third. The moments again have the form (30.7), but this time E[ X;} 1 • • • x:r, 1 cannot be factored as in (30.8). On the other hand, this expected value is by symmetry the same for each of the ( k n ) u = k n ( kn 1) · · · ( kn u + 1 ) -
-
SECTION
30.
411
THE METHOD OF MOMENTS
choices of the indices ia in the sum E". Thus
The problem again is to prove (30.9). 1 The proof goes by induction on u. Now A n ( r ) = kns; rn - Ej! _ 1 x�h' so that An (1) = 0 and An(2) = 1 . If r > 3, then l x�h l < M;- 2x;,h , and so IA,.( r ) l � ( M,/sn ) r - 2 -+ 0 by (30.1 1). Next suppose as induction hypothesis that (30.9) holds with u 1 in place of u. Since the sampling is without replacement, E [ X;t · · · x:: 1 = Ex�1h 1 x�.,/( n )u, where the summation extends over the u-tuples ( h 1 , , h u ) of distinct integers in the range 1 < h a < n . In this last sum enlarge the range b:y requiring of ( h 1 , h 2, , h u ) only that h 2, , h u be distinct, and then compensate by subtracting away the terms where h 1 = h 2, where h 1 = h 3, and so on. The result is -
•
•
•
•
•
•
•
_
•
•
•
�
( n ) u - t [ x r1... . . . x r1 + ra E n2 na ai.-2 (n) u
•
•
•
•
xr, nu] ·
•
This takes the place of the factorization made possible in (30.8) by the assumed independence there. It gives
By the induction hypothesis the last sum is bounded, and the factor in front goes to 0 by (30.11). As for the first term on the right, the factor in front goes to 1 . If r1 =I= 2, A n ( r1 ) -+ 0 and An( r2 , , ru ) is bounded, and so A,.(r1 , , ru ) -+ 0. The same holds by symmetry if ra =I= 2 for some a other , ru ) -+ 1 by than 1 . If r1 = · · · = ru = 2, then An( r1 ) = 1 , and A n ( r2 , the induction hypothesis. Thus (30.9) holds in all cases, and Sn!sn => N follows by the method of moments. •
•
•
•
•
•
• •
•
412
CONVERGENCE OF DISTRIBUTIONS
Application to Number Theory
Let g ( m) be the number of distinct prime factors of the integer m; for example g (3 4 5 2 ) = 2. Since there are infinitely many primes, g (m) is unbounded above; for the same reason, it drops back to 1 for infinitely many m (for the primes and their powers). Since g fluctuates in an irregular way, it is natural to inquire into its average behavior. On the space D of positive integers, let Pn be the probability measure that places mass 1/n at each of 1, 2, . . . , n, so that among the first n positive integers the proportion that are contained in a given set A is just Pn ( A ). The problem is to study Pn [m: g(m ) < x] for large n. If Bp (m) is 1 or 0 according as the prime p divides m or not, then •
(30 .12) Probability theory can be used to investigate this sum because under Pn the Bp ( m ) behave somewhat like independent random variables. t If p 1 , • • • , Pu are distinct primes, then by the fundamental theorem of arithmetic, BP 1 ( m ) = · · · = Bp ,( m ) = 1 -that is, each P; divides m-if and only if the product p 1 • • • Pu divides m. The probability under Pn of this is just n - 1 times the number of m in the range 1 < m < n that are multiples of p 1 • • • Pu , and this number is the integer part of njp 1 • • • Pu· Thus (30.13)
Pn [ m : Bp . ( m) = 1 , i = 1 , . . . , u ] = ! n I
•
[
n
P t . . . Pu
]
for distinct P; · Now let XP be independent random variables (on some probability space, one variable for each prime p) satisfying 1 P ( Xp = 1 ] = - ' p If
p 1 , • • • , pu are distinct, then P [ xp,· = 1 , ; = 1 , . . . , u ] =
(30 .14) For fixed
p1, • • •
,
Pu '
1
P t · · · Pu
(30.13) converges to (30.14) as
t See also Problems 2.1 5, 5.16, and 6.17.
.
n --+
oo .
Thus the
SECTION
30.
413
THE METHOD OF MOMENTS
behavior of the XP can serve as a guide to that of the BP (m). If m < n, (30. 12) is EP < nBp ( m ) because no prime exceeding m can divide it. The idea is to compare this sum with the corresponding sum E P n XP . This will require from number theory the elementary estimatet s
(30. 1 5 )
1
= log log x + 0 ( 1 ) . p p <x E
-
The mean and variance of Ep < n Xp are Ep � n P - 1 and Ep � n P - 1 (1 - p - 1 ) ; since EP p - 2 converges, each of these two sums is asymptotically log log n. Comparing Ep � n8p ( m ) with Ep s n Xp then leads one to conjecture the
Erdos-Kac centra/ limit theorem for the prime divisor function:
1beorem 30.3.
(30.16)
Pn
For all x,
[m : g ( m )
l
- log log n 1 � X --+ - � v�w y'log log n
jx -
oo
e - u 2 ;2 d
U.
PROOF.
The argument uses the method of moments. The first step is to show that (30.16) is unaffected if the range of p in (30.12) is further restricted. Let { an } be a sequence going to infinity slowly enough that
(30.17) but
log a n --+ log n
O
fast enough that
(30.18 ) Because of (30.15), these two requirements are met if, for example, log an = log njlog log n. Now define
(30.19) For
Bn ( m ) = E Bp ( m ) . p � an
a function f of positive integers, let
n En ( / ] = n - 1 E /( m ) m-1
'See, for example, Problem 1 8.18, or HARDY & WRIGHT, Chapter XXII.
414
CONVERGENCE OF DISTRIBUTIONS
denote its expected value computed with respect to Pn. By (30.13) for
u=
1,
By (30.18) and Markov' s inequality
Therefore (Theorem 25.4), (30.16) is unaffected if Bn ( m ) is substituted for
g(m).
Now compare (30.19) with the corresponding sum Sn mean and variance of sn are
=
Ep � a..Xp . The
cn = � � -p1 ' p � an
and each is log log n + o (log log n) 1 12 by (30.18). Thus (see Example 25.8), (30.16) with g(m) replaced as above is equivalent to (30.20)
[
�
pn m : 8n ( m - en
<
x] -+ � jxooe -uzl2 du.
It therefore suffices to prove (30.20). Since the XP are bounded, the analysis of the moments (30. 7) applies here. The only difference is that the summands in sn are indexed not by the integers k in the range k � kn but by the primes p in the range p < an; also, XP must be replaced by XP - p - 1 to center it. Thus the rth moment of ( Sn - cn)/sn converges to that of the normal distribution, and so (30.20) and (30.16) will follow by the method of moments if it is shown that as n ..._. oo , (30 .21 ) for each r. Now E[S;J is the sum (30.22)
,
r .' 1 ' E E r 1 • • r I u f E "E [ xr�P t 1• u u- 1 •
•
•
· · ·
xr�� Pu ] ,
SECTION
30.
THE METHOD OF MOMENTS
415
•here the range of E' is as in (30.6) and (30. 7) and E" extends over the u- tuples ( p 1 , , Pu ) of distinct primes not exceeding a n . Since XP assumes only the values 0 and 1, from the independence of the XP and the fact that the p ; are distinct, it follows that the summand in (30.22) is • • •
(30 .23 )
E [ XPl · · · X ] = p.
p1
•
1
• •
pu •
By the definition (30.19), En [ g;] is just (30.22) with the summand replaced by En lB;: · · · B;: 1 · Since Bp( m ) assumes only the values 0 and 1, from (30.13) and the fact that the P; are distinct, it follows that this summand is
{30. 24 )
En [ 8Pl
· · · BP u ]
1
= -n
[
n
]
. P t • . . Pu
But (30.23) and (30.24) differ by at most 1/n, and hence E[S;] and En [ g;] differ by at most the sum (30.22) with the summand replaced by 1/n. Therefore, (30.25)
I E [s:1 - En [s; 1 1 <
;( E 1 ) p s an
'
<
:� .
Now
,
E [ (sn - en) ' ] = E ( � ) E [ s: ] < - cn r - k . k -0
and
En [( gn - en ) ' ] has the analogous expansion. Comparing the two expan
sions
term for term and applying (30.25) shows that
(30.26)
, ( r } ank r k 1 r a < + = -n ( n en ) . k -e n n k =O � �
Since
en � a n , and since a�/n --+ 0 by (30.17), (30.21) follows as required.
•
The method of proof requires passing from (30.12) to (30.19). Without this, the a n on the right in (30.26) would instead be n, and it would not follow that the difference on the left goes to 0; hence the truncation (30.19)
416
CONVERGENCE OF DISTRIBUTIONS
for an a n small enough to satisfy (30.17). On the other hand, a n must be large enough to satisfy (30.18) in order that the truncation leave (30.16) unaffected. PROBLEMS 30.1. 30.2.
30.3. 30.4.
From the central limit theorem under the assumption (30.5) get the full Lindeberg theorem by a truncation argument. For a sample of size kn with replacement from a population of size n , the probability of no duplicates is llj.:.0 1 (1 - j/n). Under the assumption knlrn -+ 0 in addition to (30.10), deduce the asymptotic normality of sn by a reduction to the independent case. By adapting the proof of (21.24), show that the moment generating function of p. in an arbitrary interval determines p.. 25.13 30.3 f Suppose that the moment generating function of l'n con verges to that of p. in some interval. Show that l'n p.. Let 1-' be a probability measure on R k for which fR " l x; l rl-' ( dx ) < oo for i 1, . . . , k and r 1, 2, . . . . Consider the cross moments �
30.5.
==
==
for nonnegative integers r; . (a) Suppose for each i that
( 30.27) has a positive radius of convergence as a power series in fJ. Show that 1-' is determined by its moments in the sense that, if a probability measure " satisfies a( ,. , . . . , rk > 1 xp · · · xr" ,< dx > for a11 r1 , , rk , then , coin cides with 1-' · (b) Show that a k-dimensional normal distribution is determined by its moments. f Let 1-'n and 1-' be probability measures on R k . Suppose that for each i, (30.27) has a positive radius of convergence. Suppose that ==
30.6.
for all nonnegative integers r1 , . . . , rk . Show that 1-'n
• • •
�
1-' ·
30.7. 30.8.
30.
417 30.5 f n Suppose that X and Y are bounded random variables and that xm and y are uncorrelated for m, n 1, 2, . . . . Show that X and Y are independent. 26.17 30.6 f (a) In the notation (26.32), show for A =I= 0 that SECTION
THE METHOD OF MOMENTS
==
{ 30.28) for even r and that the mean is 0 for odd r. It follows by the method of moments that cos Ax has a distribution in the sense of (25.23), and in fact of course the relative measure is (30.29)
p[
x : cos Ax s u ]
==
1 1 - - arccos u,
-1 < u < 1.
'IT
(b) Suppose that A1 , A 2 , . . . are linearly independent over the field of
rationals in the sense that, if n 1 A1 + ··· n m 0. Show that n1 =
==
···
+ nm Am
==
==
0 for integers n,, then
(30.30) for nonnegative integers r1 , . . . , rk . (c) Let X1 , X2 , . • • be independent and have the distribution function on the right in (30.29). Show that { 30.31) (d)
Show that
( 30.32)
30.9. 30. 10.
For a signal that is the sum of a large number of pure cosine signals with incommensurable frequencies, (30.32) describes the relative amount of time the signal is between u1 and u 2 • 6.17 f From (30.16), deduce once more the Hardy-Ramanujan theorem (see (6.14)). f (a) Prove that (if Pn puts probability 1/n at 1, . . . , n) { 30.33)
[
log log m - log log n lim pn m : n ,flog log n
>
£]
=
O
.
418
CONVERGENCE OF DISTRIBUTIONS
(b) From (30.16) deduce that (see (2 . 21 ) for the notation)
{ 30 .34) 30.1 1.
v [ m : g( m ) {
x]
- log log m � log log m
==
fx
1
J2
w
-
oo
e - u 2 12 du .
f Let G( m) be the number of prime factors in m with multiplicity counted. In the notation of Problem 5.16, G( m) == EP aP (m ). (a) Show for k > 1 that Pn [m: ap (m) - Bp (m) > /C) � 1/p k + I ; hence
En [ a� - Bp ] < 2/p 2 • (b) Show that En [ G - g] is bounded. (c)
Deduce from (30.16) that
[Pn m .
. G { m ) - loglog n
(d)
30.12.
f
y'log log n
S
X] �
1
� v w
2
jx -
oo
Prove for G the analogue of (30.34). Prove the Hardy-Ramanujan theorem in the form
[
m v m : g( ) - 1 log log m Prove this with G in place of g.
�
(]
== 0.
e - u2; 2 dU .
C H APTER
6
Derivatives and Conditional Probability
SECI10N 31. DERIVATIVES ON THE LINE *
This section on Lebesgue's theory of derivatives for real functions of a real variable serves to introduce the general theory of Radon-Nikodym deriva tives, which underlies the modem theory of conditional probability. The results here are interesting in themselves and will be referred to later for purposes of illustration and comparison, but they will not be required in subsequent proofs. The Fundamental Theorem of Calculus
To what extent are the operations of integration and differentiation inverse to one another? A function F is by definition an indefinite integral of another function f on [a, b) if
(31 .1) for
a
a � x � b; F is by definition a primitive of f if it has derivative f:
(31 .2) for
x F( x) - F( a) = j f ( t ) dt
F' ( x) = / ( x)
a � x < b . According to the fundamental theorem of calculus (see
* This section may be omitted.
419
420
DERIVATIVES AND CONDITIONAL PROBABILITY
(17.5)), these concepts coincide in the case of continuous f: Suppose that f is continuous on [ a , b). (i) An indefinite integral off is a primitive off: if (31.1) holds for all in [ a , b ], then so does (31.2). (ii) A primitive off is an indefinite integral off: if (31.2) holds for all in [ a, b ], then so does (31.1). Theorem 31.1.
x x
A basic problem is to investigate the extent to which this theorem holds if
f is not assumed continuous. First consider part (i). Suppose f is integrable, so that the right side of (31.1) makes sense. If f is 0 for x < m and 1 for x > m (a < m < b), then an F satisfying (31.1) has no derivative at m. It is thus too much to ask that (31.2) hold for all x. On the other hand, according to a famous theorem of Lebesgue, if (31.1) holds for all x, then (31.2) holds almost everywhere-that is, except for x in a set of Lebesgue measure 0. In this section almost everywhere will refer to Lebesgue measure
only. This result, the most one could hope for, will be proved below (Theorem 31 . 3) . Now consider part (ii) of Theorem 31.1. Suppose that (31.2) holds almost everywhere, as in Lebesgue' s theorem, just stated. Does (31.1) follow? The answer is no: If f is identically 0, and if F(x) is 0 for x < m and 1 for x > m (a < m < b), then (31.2) holds almost everywhere, but (31.1) fails for x > m. The question was wrongly posed, and the trouble is not far to seek: If f is integrable and (31.1) holds, then
( 31 .3)
F( x + h ) - F( x ) =
J\x . x + h ) ( t ) f( t ) dt -+ a
0
as h � 0 by the dominated convergence theorem. Together with a similar argument for h j 0 this shows that F must be continuous. Hence the question becomes this: If F is continuous and f is integrable, and if (31.2) holds almost everywhere, does (31.1) follow? The answer strangely enough is still no: In Example 31.1 there is constructed a continuous, strictly increas ing F for which F '( x) = 0 except on a set of Lebesgue measure 0, an d (31.1) is of course impossible if f vanishes almost everywhere and F is strictly increasing. This leads to the problem of characterizing those F for which (31.1) does follow if (31.2) holds outside a set of Lebesgue measure 0 and f is integrable. In other words, which functions are the integrals of their (almost everywhere) derivatives? Theorem 31.8 gives the characteriza tion. It is possible to extend part (ii) of Theorem 31.1 in a different direction. Suppose that (3 1. 2) holds for every x, not just almost everywhere. In Example 17.4 there was
SECTION
31.
421
DERIVATIVES ON THE LINE
given a function F, everywhere differentiable, whose derivative I is not integrable, and in this case the right side of (3 1 . 1) has no meaning. If, however, (3 1 . 2) holds for every x , and if I is integrable, then (3 1 . 1) does hold for all x . For most purposes of probability theory, it is natural to imfose conditions only almost everywhere, and so this theorem will not be proved here.
The program then is first to show that (31.1) with f integrable implies that (31.2) holds almost everywhere, and second to characterize those F for which the reverse implication is valid. For the most part, f will be non negative and F will be nondecreasing. This is the case of greatest interest for probability theory; F can be regarded as a distribution function and f as a density. In Chapters 4 and 5 many distribution functions F were either shown to have a density f with respect to Lebesgue measure or were assumed to have one, but such F ' s were never intrinsically characterized, as they will be in this section. Derivatives of Integrals
The first step is to show that a nondecreasing function has a derivative almost everywhere. This requires two preliminary results. Let A denote Lebesgue measure.
Let A be a bounded linear Borel set, and let J be a collection of open intervals covering A. Then J has a finite, disjoint subco/lection I I for which E�_ 1 A(I; ) > A(A)/6. PROOF . By regularity (Theorem 12.3) A contains a compact subset K satisfying A(K ) > A(A)/2. Choose in J a finite subcollection J0 covering Lemma I.
1,
K.
•
•
•
,
k
Let I1 be an interval in J0 of maximal length; discard from J0 the interval I1 and all the others that intersect I1 • Among the intervals remain ing in J0, let I2 be one of maximal length; discard I2 and all intervals that intersect it. Continue this way until J0 is exhausted. The I; are disjoint. Let � be the interval with the same midpoint as I; and three times the length. If I is an interval in J0 that is cast out because it meets I; , then I c 1;. Thus each discarded interval is contained in one of the 1;, and so the 1; cover K. Hence EA(I; ) = EA(l;)/3 > A(K)/3 > A(A)/6. • If
(31 .4)
&:
a a0 < a1 =
t For a proof, see RUDIN', p. 179.
<
···
<
ak
=
b
422
DERIVATIVES AND CONDITIONAL PROBABILITY
is a partition of an interval [a, b] and
F is a function over [a, b ], let
k
II F II � = E I F( a ;) - F( a ; - t )l .
(31 .5)
;-I
Lemma
( 31 .6) and if ( 31 . 7 )
2.
Consider a partition (31.4) and a positive fJ. If F( a ) � F( b), F a F a ) ) ( ( ; 1 _ _ _ _ ...;. . .. __,; ;;... ... < (J _ a. - a. '
·-
1
,. ;...._
_
for a set of intervals [a ; _ 1 , a ; ] of total length d, then IIFII � � IF( b ) - F( a) l + 2fJd . This also holds if the inequalities in (31.6) and ( 3 1. 7) are reversed and - fJ is replaced by fJ in the latter. PROOF. The figure shows the case where k = 2 and the left-hand interval satisfies (31.7). Here F falls at least (Jd over [a, a + d), rises the same amount over [a + d, u], and then rises F(b) - F(a) over [u, b].
"
b
For the general case, let E' denote summation over those i satisfying (31 .7) and let E" denote summation over the remaining i (1 < i < k ). Then
IIFII � = E ' ( F(a ;_ 1 ) - F(a ; ) ) + E" I F( a ; ) - F(a ; _ 1 )1 > E ' ( F ( a ;_ 1 ) - F( a ;) ) + I E" ( F( a ;) - F( a ;_ 1 ) ) 1 = L ' ( F( a ;_ 1 ) - F( a ; ) ) + I ( F( b) - F(a )) + E' ( F( a ;_ 1 ) - F( a ;))l · As all the differences in this last expression are nonnegative, the absolute
SECTION
31.
423
DERIVATIVES ON THE LINE
value bars can be suppressed; therefore,
IIFII � > F( b ) - F( a ) + 2 L' ( F( a ;_ 1 ) - F( a ; ) ) >
A function
derivatives
F( b ) - F( a ) + 28 L' ( a ; - a ;_ 1 ) .
F has at each x four derivates, the upper and lower right D F( x ) - lim sup
F ( X + h ) - F( X ) , h
. I lim nf
F( X + h ) - F ( X ) , h
h J, O
DF ( X ) and
•
=
h J, O
the upper and lower left derivatives FD ( x ) FD
( )
x
-
_
=
lim sup li
h J, O
F ( x ) - F( x - h ) , h
. F( x) - F( x - h) . h h J, O .
m rnf
There is a derivative at x if and only if these four quantities have a common value. Suppose that F has derivative F '(x) at x. If u < x < v, then
F( v ) - F ( u ) v-u
_
F '( x )
F F v x v ) x ( ( ) s -v-u v-x
_
F F x u) ) u x ( ( F' ( x) + v - u x-u
_
F' ( x ) .
Therefore,
(31 .8)
F( v) - F( u ) v-u
-+
F' ( x )
u f x and v � x; that is to say, for each £ there is a 8 such that u < x < v and 0 < v - u < 8 together imply that the quantities on either side of the
as
arrow differ by less than £. Suppose that F is measurable and that it is continuous except possibly at countably many points. This will be true if F is nondecreasing or is the
424
DERIVATIVES AND CONDITIONAL PROBABILITY
difference of two nondecreasing functions. Let M be a countable, dense set containing all the discontinuity points of F; let rn ( x ) be the smallest number of the form k/n exceeding x. Then
D F( x)
=
F F y ) (x) . ( lim sup n __.
'
yeM
the. function inside the limit is measurable because the x-set where it exceeds I
a
S
U
yeM
[ x: x < y < rn( x), F( y) - F( x) > a ( y - x)] .
Thus DF(x) is measurable, as are the other three derivates. This does not exclude infinite values. The following theorems concern real functions defined over the whole line. They could also be formulated for functions over an interval.
nondecreasing function F is differentiable except on a set of Lebesgue measure 0. The derivative F 1 is measurable t and nonnegative and b f F' ( t) dt < F(b) - F(a) ( 31 .9 ) Theorem 31.2.
A
a
for all a and b. PROOF.
If it can be shown that
except on a set of Lebesgue measure 0, then by the same result applied to G(x) = - F( - x) it will follow that FD(x) = D G( - x) < G D( - x) = DF (x) F almost everywhere. This will imply that DF (x) < D (x) < D(x) < FD(x) F inequalities < D F (x) almost everywhere, since the first and third of these are obvious, and so outside a set of Lebesgue measure 0, F will have a derivative, possibly infinite. Since F is nondecreasing, F 1 must be nonnega tive, and once (31 .9) is proved, it will follow that F1 is finite almost everywhere. If (31.10) is violated for a particular x, then for some pair a, {J of rationals satisfying a < {J, x will lie in the set Aap = [x: D(x) < a < {J <
F
t outside some Borel set A such that A ( A ) = 0, F has finite derivative F' = D F; since D F is measurable, F' is measurable if it is set equal to 0 (say) on A .
31. DERIVATIVES ON THE LINE 425 D F( x )] . Since there are only countably many of these sets, (31 .10) will hold outside a set of Lebesgue measure 0 if A(A0111) = 0 for all a and {J. Put G( x ) = F(x) - �(a + {J)x and 8 = � ( {J - a). Since differentiation is linear, A01p = B9 = [ x : G D ( x ) < - 8 < 8 < D G ( x )] . Since F and G have only countably many discontinuities, it suffices to prove that A( C9 ) = 0, where C9 is the set of points in B9 that are continuity points of G. Consider an interval (a, b), and suppose for the moment that G(a) < G(b). For each x in C9 satisfying a < x < b, from G D( x) < - 8 it follows that there exists an open interval (a x , bx ) for which x e (a x , bx ) c (a, b) and SECTION
(31 .11) There exists by Lemma 1 a finite, disjoint collection (a x , bx ) of these intervals of total length E(bx . - a x .) > A(( a, b) n C9 )/6. Let !l be the partition (31.4) of [a, b] with the points a x . and bx . in the role of the a 1 , , a k - t · By Lemma 2, I
I
(31 .12)
I
I
• • •
I
I
1 G(a) > + G(b) I IIGII � l 3 0 " ( ( a, b) n C11 ) .
G(a) < G(b) the reverse inequality holds, choose a x and bx so that the ratio in (31.11) exceeds 8, which is possible because D G ( x ) > (J for
If instead of x
e C9• Again (31 .12) follows.
In each interval [a, b) there is thus a partition (31.4) satisfying (31 .12). Apply this to each inverval [a ;_ 1 , a ; ] in the partition. This gives a partition !l 1 that refines !l, and adding the corresponding inequalities (31 .12) leads to
II G II �1 > II G il � +
1 3 8 " ( (a, b ) n C11 ) .
Continuing leads to a sequence of successively finer partitions
!l n such that
(31 .13 ) Now IIGII � is bounded by IF( b) - F( a ) l + � I a + fJ I(b - a) because F is monotonic. Thus (31 .13) is impossible unless A(( a, b) n C9 ) = 0. Since (a, b) can be any interval, A ( C9 ) = 0. This proves (31 .10) and establishes the differentiability of F almost everywhere. It remains to prove (31. 9). Let
(31 .14)
/,n ( X )
=
F( x
+ n - l ) - F( x ) . 1 n-
426
DERIVATIVES AND CONDITIONAL PROBABILITY
Now In is nonnegative, and by what has been shown, fn(x) F'(x) except on a set of Lebesgue measure 0. By Fatou ' s lemma and the fact that F is nondecreasing, �
nn+ + la lb [ F( x ) dx F( x ) dx - n = lim inf n n
b
<
Replacing b by b
1
lim inf n [ F( b
1
a
]
+ n - l ) - F( a )] = F( b + ) - F( a ) .
- £ and letting £
�
0 gives (31 .9).
•
1/fis nonnegative and integrab/e, and i/ F(x) = f x «:Jf(t) dt , then F'(x) = f(x) except on a set of Lebesgue measure 0.
Theorem 31.3.
Since f is nonnegative, F is nondecreasing and hence by Theorem 31 .2 is differentiable almost everywhere. The problem is to show that the derivative F' coincides with f almost everywhere.
f. Suppose first that f is bounded by M. Define /,, 1 n by (31. 14). Then fn (x) = n j; + J(t) dt is bounded by M and converges almost everywhere to F'(x), so that the bounded convergence theorem gives PROOF FOR BOUNDED
lb [ n = lim n b
+
n- 1
F( x) dx - n la a
+
n-
1
F( x) dx .
]
Since F is continuous (see (31 .3)), this last limit is F(b) - F(a) = J:f(x) dx . Thus fA F'(x) dx = fA f(x) dx for bounded intervals A = (a, b]. Since these form a w-system, it follows (see Example 16.9) that F' = f almos t • everywhere. PROOF FOR INTEGRABLE f. Apply the result for bounded functions to I truncated at n: If h n (x) is f(x) or n as /(x) < n or f( x ) > n, then Hn (x) = f x 00 h n (t) dt differentiates almost everywhere to hn(x) by the case already treated. Now F(x) = Hn (x) + f x «:J(f(t) - h n (t)) dt; the integral here is nondecreasing because the integrand is nonnegative, and it follows by Theorem 31 .2 that it has almost everywhere a nonnegative derivative. Since differentiation is linear, F'(x) > H�(x) = h n (x) almost everywhere.
SECTION
31 .
427
DERIVATIVES ON THE LINE
F'(x) > f(x) almost everywhere, and so J:F'(x) dx > 1:1< x ) dx = F( b) - F( a). But the reverse inequality is a consequence of (31 .9 ). Therefore, J: (F'(x) - f(x )) dx = 0, and as before F' = f except on
As a
n
set
was arbitrary,
•
of Lebesgue measure 0.
Singular Functions
If /( x) is nonnegative and integrable, differentiating its indefinite integral J��f(t) dt leads back to f(x) except perhaps on a set of Lebesgue measure 0. That is the content of Theorem 31 .3. The converse question is this: If F(x) is nondecreasing and hence has almost everywhere a derivative F'(x), does integrating F'(x) lead back to F(x)? As stated before, the answer turns out to be no even if F(x) is assumed continuous: Let X1 , X2 , . . . be independent, identically distributed random variables such that P[ Xn = 0] = Po and P[ Xn = 1] = p 1 = 1 - p0 , and let X = E�_ 1 Xn 2 - n . Let F(x) = P[ X < x] be the distribution function of X. For an arbitrary sequence u 1 , u 2 , . . . of O' s and 1 ' s, P[ Xn = u n , n == 1, 2, . . . ] = lim n Pu1 n • Pu, = 0; since x can have at most two dyadic expansions x = E n u n 2 - , P[ X = x] = 0. Thus F is everywhere continuous. Of course, F(O) = 0 and F(1) = 1 . For 0 < k < 2 n , k2 - n has the form E7_ 1 u ; 2 - ; for some n-tuple (u 1 , , u n ) of O's and 1 ' s. Since F is continu Exllllple l 31. 1.
•
•
•
ous,
•
•
= P [ X. = u . i -< n ] = p · · · p I
I'
Ut
U,
•
F is strictly increasing over then unit interval. If p0 = p 1 = �, the right side of (31.15) is 2 - , and a passage to the limit shows that F(x) = x for 0 < x < 1 . Assume, however, that p0 =I= p 1 . It will be shown that F '( x) = 0 except on a set of Lebesgue measure 0 in this case. This
shows that
Obviously the derivative is 0 outside the unit interval, and by Theorem 31 .2 it exists almost everywhere inside it. Suppose then that 0 < x < 1 and that F has a derivative F'(x) at x. It will be shown that F'(x) = 0. For each n choose k n so that x lies in the interval In = (k n 2 - n , (k n + 1)2 - n ]; In is that dyadic interval of rank n that contains x. By (31 .8),
428
DERIVATIVES AND CONDITIONAL PROBABILITY
. 25
.5
Graph of F( x ) for p0 = .25, jJ 1 .75. Because of the recW'Sion (3 1.17), the part of the graph over [0, . 5 ] and the part over [. 5 , 1] are identical, apart from changes in scale, with the whole graph. Each segment of the curve therefore contains scaled copies of the whole; the extreme irregularity which this implies is obscured by the fact that the accuracy is only to within the width of the printed line. -
If F ( x) is distinct from 0, the ratio of two successive terms here must go to 1, so that '
( 31 .16 ) If In consists of the reals with nonterminating base-2 expansions beginning with the digits u 1 , , u n , then P[ X E In ] = Pu Pu by (31.15). But In + 1 must for some u n + 1 consist of the reals beginning u 1 , , u n , u n + 1 ( u n + 1 is 1 or 0 according as x lies to the right of the midpoint of In or not). Thus P[ Xe In + 1 ]/P[ X e In ] = Pun + l is either p0 or p 1 , and (31 .16) is possible only if Po = p 1 , which was excluded by hypothesis. • • •
1
· · ·
II
• • •
31 . DERIVATIVES ON THE LI NE 429 Thus F is continuous and strictly increasing over [0, 1 ] but F'( x ) = 0 except on a set of Lebesgue measure 0. For 0 < x < � independence gives F(x) = P[ X1 = O, E:0.... 2 Xn 2 - n + I < 2x] = p0F(2x). Similarly, F(x) - Po = p 1 F(2x - 1) for 1 < x < 1. Thus SECTION
( 31 .17)
p0F( 2x ) { F( x ) =
Po + p 1 F( 2x
- 1)
if O < X < ! if 1 < X < 1 .
In Section 7, F(x) (there denoted Q (x)) entered as the probability of • success at bold play; see (7 .30) and (7 .33). A function is singular if it has derivative 0 except on a set of Lebesgue measure 0. Of course, a step function constant over invervals is singular. What is remarkable (indeed, singular) about the function in the preceding example is that it is continuous and strictly increasing but nonetheless has derivative 0 except on a set of Lebesgue measure 0. Note that there is strict inequality in (31 .9) for this F. Further properties of nondecreasing functions can be discovered through a study of the measures they generate. Assume from now on that F is nondecreasing, that F is continuous from the right (this is only a normaliza tion), and that 0 = lim x _. _ 00F(x) < lim x --+ + ooF(x) = m < oo . Call such an F a distribution function, even though m need not be 1 . By Theorem 1 2.4 there exists a unique measure p. on the Borel sets of the line for which
(31 .18 )
p.( a, b] = F( b ) - F( a ) .
p. ( R 1 ) = m is finite. The larger F' is, the larger p. is:
Of course,
Suppose that F and p. are related by (31 .18) and that F'(x) exists throughout a Borel set A. (i) If F'(x) s a for x e A, then p.(A ) < a A( A ). (ii) If F'(x) > a for x e A, then p.( A) > aA( A ). PROOF. It is no restriction to assume A bounded. Fix £ for the moment. Let E be a countable, dense set, and let A n = ('(A n / ), where the intersection extends over the intervals I = (u, v] for which u, v e E, 0 A( /) < n - 1 , and Theorem 31.4.
<
( 31 .19 ) Then
p. ( / ) < (a + £ ) A ( l ) .
A n is a Borel set and (see (31 . 8)) A n f A under the hypothesis of (i).
430
DERIVATIVES AND CONDITIONAL PROBABILITY
By Theorem 11.4 there exist disjoint intervals on the right) such that A n c U k in k and
In k (open on the left, closed
L A ( In k ) < A ( A n) + f .
(31 .20)
k
It is no restriction to assume that each In k has endpoints in E, meets A n , and satisfies A( In k ) < n - 1 • Then (31.19) applies to each In k' and hence
�t ( A n ) < L�t ( In k ) < (a + f ) L A ( In k ) < ( a + f ) ( A ( A n ) + £ } . k
k
In the extreme terms here let n --+ oo and then £ --+ 0; (i) follows. To prove (ii), let the countable, dense set E contain all the discontinuity points of F, and use the same argument with p.( /) � (a - £)A(I) in place of (31.19) and E k �t (In k ) < �t (A n ) + £ in place of (31.20). Since E contains all the discontinuity points of F, it is again no restriction to assume that each In k has endpoints in E, meets A n , and satisfies A(In k ) < n - 1 • It follows that
�t ( A n ) + f > L�t ( In k ) � ( a - f ) L A ( In k ) > ( a - f ) A ( A n ) . k
Again let
n --+
oo and then £
The measures I' and and SA such that
( 31 .21 )
k
-+
0.
•
A have disjoint supports if there exist Borel sets S"
�t ( R1 - s" ) = o, s" n sA = o.
Suppose that F and I' are related by (31.18). A necessary and sufficient condition that I' and A have disjoint supports is that F '( x) = 0 except on a set of Lebesgue measure 0. PROOF. By Theorem 31.4, �t [x: l x l < a, F'(x) � £] � 2a£, and so (let set of £ --+ 0 and then a --+ oo) �t [x: F'(x) = 0] = 0. If F'(x) = 0 outside 1 Lebesgue measure 0, then SA = [x: F'(x) = 0] and S" = R - SA satisfy (31.21). Suppose that there exist S" and SA satisfying (31.21). By the other half of Theorem 31.4, £A[x: F'(x) � £] = £A[x: x e SA , F'(x) > £] < �t ( SA ) = 0 , • and (let £ --+ 0) F '( x) = 0 except on a set of Lebesgue measure 0. Theorem 31.5.
a
so
SECTION
31 .
DERIVATIVES ON THE LINE
431
Suppose that p. is discrete, consisting of a mass m k at each of countably many points x k . Then F( x) = Em k , the sum extending over the k for which x k s x. Certainly, p. and A have disjoint supports, and so F ' must vanish except on a set of Lebesgue measure 0. This is directly obvious if the x k have no limit points but not, for example, if they are • dense. Example 31. 2.
Consider again the distribution function F in Example 31 . 3 . Here p. ( A ) = P[ X e A). Since F is singular, p. and A have disjoint supports. This fact has an interesting direct probabilistic proof. For x in the unit interval, let d1 (x), d 2 (x), . . . be the digits in its nonterminating dyadic expansion, as in Section 1. If (k2 - n , ( k + 1)2 - n ) is the dyadic interval of rank n consisting of the reals whose expansions begin with the digits ul, . . . ' u n , then, by (31.5), Example 31. 3.
If the unit interval is regarded as a probability space under the measure p., then the d;(x) become random variables and (31.22) says that these random variables are independent and identically distributed and p.[x: d ; (x) = 0] = p0, p [ x: d; (x) = 1) = p 1 • Since these random variables have expected value p 1 , the strong law of large numbers implies that their averages go to p 1 with probability 1:
(31 .23)
p.
[
x e (0, 1] : lim n ..!.n
f. d; ( x ) = p 1 ] = 1 .
.
,_
1
On the other hand, by the normal number theorem
(31.24 )
n 1 1 x = A x E ( 0, 1] : lim n -n E d;( ) 2 = 1 .
[
.
,_
1
]
course, (31.24) is just (31.23) for the special case p0 = p 1 = � ; in this case p. and A coincide in the unit interval.) If p 1 + � , the sets in (31.23) and (31.24) are disjoint, so that p. and A do have disjoint supports. It was shown in Example 31.1 that if F'(x) exists at all (0 < x < 1), then it is 0. By part (i) of Theorem 31.4 the set where F'(x) fails to exist • therefore has p.-measure 1; in particular, this set is uncountable. (Of
In the singular case, according to Theorem 31.5, F ' vanishes on a
support of
A. It is natural to ask for the size of F ' on a support of p.. If B is
432
DERIVATIVES AND CONDITIONAL PROBABILITY
the x-set where F has a finite derivative, and if (31.21) holds, then by Theorem 31.4, p. [x e B: F '(x) s n] = p. [x e B n S": F '(x) � n] < nA( S") = 0, and hence p.( B ) = 0. The next theorem goes further.
Suppose that F and p. are related by (31.18) and that p. and A have disjoint supports. Then, except for x in a set of p.-me e 0, F D(x) = oo . Theorem 31.6.
asur
p. has finite support, then clearly FD( x) = oo if p. { x } > 0, while DF (x) = 0 for all x. Since F is continuous from the right, FD and DF play If
different roles. t
Let A n be the set where FD(x) < n. The problem is to prove that p.(A n ) = 0, and by (31.21) it is enough to prove that p.(A n n S") = 0. Further, by Theorem 12.3 it is enough to prove that p.( K) = 0 if K is a compact subset of A n n S" . Fix £. Since A(K) = 0, there is an open G such that K c G and A( G) < £. If x e K, then x e A n , and by the definition of FD and the right-continuity of F, there is an open interval /x for which x e lx c G and p.( lx ) < n'A(Ix > · By compactness, K has a finite subcover lx 1 , , Ix / If some three of these have a nonempty intersection, one of them must be contained in the union of the other two. Such superfluous intervals can be removed from the subcover, and it is therefore possible to assume that no point of K lies in more than two of the lx . But then PROOF.
• • •
,
I
p. ( K ) s p. (U ; Ix, ) < L P. ( lx, ) < n L A ( lx, ) ;
Since £ was arbitrary,
A(K) = 0, as required.
;
•
Emmple 31.4. Restrict the F of Examples 31.1 and 31.3 to (0, 1), and let g be the inverse. Thus F and g are continuous, strictly increasing mappings of (0, 1) onto itself. If A = [ x e (0, 1): F'( x) == 0] , then A( A ) == 1, as shown in the examples, while I' ( A ) = 0. Let H be a set in (0, 1) that is not a Lebesgue set. Since H - A is contained in a set of Lebesgue measure 0, it is a Lebesgue set; hence H0 == H n A is not a Lebesgue set, since otherwise H == H0 u ( H - A ) would be a Lebesgue set. If B == (0, x], then Ag- 1 ( 8) == A(O, F(x)] == F( x) = I'( B), and it follows that A g - 1 (1 B) == I'(B) for all Borel sets B. Since g - 1 H0 is a subset of 1g - 1A and A ( g - A ) == I' ( A ) == 0, g - t H0 is a Lebesgue set. On the other hand, if g - H0 were a t See Problem 3 1 . 8
31.
433 Borel set, H0 == r 1 (g - 1 H0 ) would also be a Borel set. Thus g - 1 H0 provides an • example of a Lebesgue set that is not a Borel set.t SECTION
DERIVATIVES ON THE LINE
Integrals of Derivatives
Return now to the problem of extending part (ii) of Theorem 31.1, to the problem of characterizing those distribution functions F for which F' integrates back . to F:
F(x ) = lx F' ( t ) dt.
(31 .25)
- oo
The first step is easy: If (31.25) holds, then
F has the form
F(x ) = lx f ( t ) dt
(31.26 )
- oo
for a nonnegative, integrable I (a density), namely I = F'. On the other hand, (31.26) implies by Theorem 31.3 that F' = I outside a set of Lebesgue measure 0, whence (31.25) follows. Thus (31.25) holds if and only if F has the form (31.26) for some I, and the problem is to characterize functions of this form. The function of Example 31.1 is not among them. As observed earlier (see (31.3)), an F of the form (31.26) with I integrable is continuous. It has a still stronger property: For each £ there exists a 8 such that
jAf( x ) dx < E if ). ( A ) < 8.
(31 .27 )
if A n = [x: l(x) > n], then A n � 0, and since I is integrable, the £/2 for large n. dominated convergence theorem implies that fA l(x) Fix such an n and take 8 = £j2n. If A( 8 , then fA I( x) fA -A f(x) + fA l(x) � nA(A) + £/2 £. If F is given by (31.26), then F(b) - F(a) = f:l( x) dx, and (31.27) has this consequence: For every £ there exists a 8 such that for each finite collection [a ; , h; ], i = 1, . . . , k, of nonoverlapping* intervals Indeed,
n
(31 .2 8 )
dx
n
k
<
dx
E IF( h;) - F( a ; )l
<£
k
if
A) <
dx <
E ( h; - a ;)
i-1
dx <
< B.
t For a different argument, see Problem 3. 14. * Intervals are nonoverlapping if their interiors are disjoint. In this definition it is immaterial whe ther the intervals are regarded as closed or open or half-open, as this has no effect on
(31.28).
434
DERIVATIVES AND CONDITIONAL PROBABILITY
F with this property is said to be absolutely continuous.t A function of the form (31.26) ( / integrable) is thus absolutely continuous.
A function
A continuous distribution function is uniformly continuous, and so
for
every £ there is a 8 such that the implication in (31.28) holds provided that k = 1. The definition of absolute continuity requires this to hold whatever k may be, which puts severe restrictions on F. Absolute continuity of F can be characterized in terms of the measure p.:
Suppose that F and p. are related by (31.18). Then F is absolutely continuous in the sense of (31.28) if and only if p.(A) = 0 for every A for which A( A ) = 0. PROOF. Suppose that F is absolutely continuous and that A( A ) = 0. Given £, choose 8 so that (31.28) holds. By Theorem 11.4 there exists a Theorem 31.7.
countable disjoint union B = U k lk of intervals such that A c B and A( B) < 8. By (31.28) it follows that p.(U;:_ 1 /k ) < £ for each n and hence that p.( A) � p. ( B ) < £. Since £ was arbitrary, p.(A) = 0. If F is not absolutely continuous, then there exists an £ such that for every 8 some finite disjoint union A of intervals satisfies A(A) < 8 and p.( A ) > £. Choose A n so that A( A n ) < n - 2 and p.(An) > £. Then A(lim supn A n ) = 0 by the first Borel-Cantelli lemma (Theorem 4.3, the proof of which does not require P to be a probability measure or even finite). On the other hand, p.(lim supn A n ) > £ > 0 by Theorem 4.1 (the • proof of which applies because p. is finite). This result leads to a characterization of indefinite integrals.
distribution function F( x) has the form f x (X)/( t) dt for integrable f if and only if it is absolutely continuous in the sense of (31.28). PROOF . That an F of the form (31.26) is absolutely continuous was proved in the argument leading to the definition (31 .28). For another proof, apply Theorem 31.7: if F has this form, then A( A ) = 0 implies that p. ( A ) fAf(t) dt = 0. To go the other way, define for any distribution function F Theorem 31.8.
an
A
=
( 31 .29 )
and ( 31 .30 )
FaA x ) = .Fs ( X ) =
jx
- oo
F'( t ) dt
F( X ) - Fa c ( X ) ·
t The definition applies to all functions, not just to distribution functions. If F is a distrib uti on function as in the present discussion, the absolute-value bars in (3 1 .28) are unnecessary.
SECTION
31.
DERIVATIVES ON THE LINE
435
Then F: is right-continuous, and by (31.9) it is both nonnegative and nondecreasing. Since Fac comes from a density, it is absolutely continuous. By Theorem 31 .3, F;c = F ' and hence E:' = 0 except on a set of Lebesgue measure 0. Thus F has a decomposition
(31 .3 1 )
F( x ) = Fac( x ) + F: ( x ) ,
where Fac has a density and hence is absolutely continuous and F: is singular. This is called the Lebesgue decomposition. Suppose that F is absolutely continuous. Then F: of (31.30) must, as the difference of absolutely continuous functions, be absolutely continuous itself. If it can be shown that F: is identically 0, it will follow that F = Fac has the required form. It thus suffices to show that a distribution function that is both absolutely continuous and singular must vanish. If a distribution function F is singular, then by Theorem 31.5 there are disjoint supports S" and SA. But if F is also absolutely continuous, then from A(S") = 0 it follows by Theorem 31.7 that p. ( S" ) = 0. But then • I'(R 1 ) = 0, and so F( x ) = 0. This theorem identifies the distribution functions that are integrals of their derivatives as the absolutely continuous functions. Theorem 31.7, on the other hand, characterizes absolute continuity in a way that extends to spaces 0 without the geometric structure of the line necessary to a treatment involving distribution functions and ordinary derivatives.t The extension is studied in the next section. Functions of Bounded Variation
The remainder of the section briefly sketches the extension of the preceding theory to functions that are not monotone. The results are for simplicity given only for a finite interval [ a, b] and for functions F on [a, b] satisfying F( a) = 0. If F( x) = J:f( t) dt is an indefinite integral, where f is integrable but not necessarily nonnegative, then F(x) = J:r < t) dt - J:r < t) dt exhibits F as the difference of two nondecreasing functions. The problem of characterizing indefinite integrals thus leads to the preliminary problem of characterizing functions represen table as a difference of nondecreasing functions. Now F is said to be of bounded variation over [ a, b] if sup� II F ll � is finite, where II FII� is defined by (31.5) and � ranges over all partitions (31.4) of [a, b ]. Clearly, a difference of nondecreasing functions is of bounded variation. But the converse holds as well: For every finite collection r of nonoverlapping intervals [X; , Y; ] in [ a , b ], put
t Theore ms 3 1 . 3 and 31.8 do have geometric analogues in R k � see RUDIN '� Chapter 8.
436
DERIVATIVES AND CONDITIONAL PROBABILITY
Now define
P( x)
== sup Pr , r
N( x )
== sup Nr , r
where the suprema extend over partitions r of [ a , x ]. If F is of bounded variation, then P(x) and N(x) are finite. For each such r, Pr == Nr + F( x). This gives the inequalities
Pr
<
N( x ) + F( x ) ,
P ( x ) > Nr + F( x )
which in turn lead to the inequalities
P( x)
<
N ( x ) + F( x ) ,
P ( x ) � N ( x ) + F( x ) .
Thus
( 31 .32)
F( x )
==
P ( x ) - N( x )
A function is the difference of two nondecreasing functions if and only if it is of bounded variation. If Tr == Pr + Nr, then Tr == E 1 F(y;) - F( X; ) I · According to the definition gives the required representation:
(31.28), F is absolutely continuous if for every £ there exists a B such that Tr < £ whenever the intervals in the collection r have total length less than B. If F is absolutely continuous, take the B corresponding to an £ of 1 and decompose [ a, b]
into a finite number, say n, of subintervals [ u1 _ 1 , u1 ] of lengths less than B. Any partition � of [ a, b ] can by the insertjon of the u be split into n sets of intervals l each of total length less than B, and it follows that II F II .:1 < n. Therefore, an absolutely continuous function is necessarily of bounded variation. An absolutely continuous F thus has a representation (31.32). It follows by the definitions that P(y) - P(x) is at most suprTr, where r ranges over the partitions of [ x, y]. If [ x; , y; ] are nonoverlaping intervals, then E( P(y; ) - P(x;)) is at most suprTr , where now r ranges over the collections of intervals that partition each of the [ x; , Y; ]. If F is absolutely continuous, there thus exists for each £ a B such that E( y; X; ) < B implies that E( P(y; ) - P( x;)) < £ . In other words, P is absolutely continuous. Similarly, N is absolutely continuous . Thus an absolutely continuous F is the difference of two nondecreasing ab solutely continuous functions . By Theorem 31.8, each of these is an indefinite integral, which implies that F is an indefinite integral as well: For an F on [a, b ] -
satisfying F( a) == 0, absolute continuity is a necessary and sufficient condition that F be an indefinite integral-have the form F(x) = J:f( t) dt for an integrable f. PROBLEMS
31. 1.
Extend Examples 31.1 and 31.3: Let p0 , , p, _ 1 be nonnegative numbers adding to 1, where r > 2; suppose there is no i such that P; = 1. Let •
•
•
t This uses the fact that 11 F I L� cannot decrease under passage to a finer partition.
3 1 . DERIVATIVES ON THE LINE 437 XI ' x2 ' . . . be independent, identically distributed random variables such that P [ Xn = i] = p, , 0 < i < r, and put X = E�= l Xn r - '1• Let F be the distribution function of X. Show that F is continuous. Show that F is strictly increasing over the unit interval if and only if all the p, are strictly positive. Show that F( x) x for 0 < x < 1 if p, r- 1 and that otherwise F is singular; prove singularity by extending the arguments both of Exam ple 31.1 and of Example 31.3. What is the analogue of (31.17)? f In Problem 31.1 take r 3 and Po = p2 = t, p 1 = 0. The correspond ing F is called the Cantor function. The complement in (0, 1] of the Cantor set (see Problems 1.6 and 3.16) consists of the middle third (! , f), the middle thirds (! , ; ) and ( � , t ), and so on. Show that F is t on the first of these intervals, -i- on the second, l on the third, and so on. Show by direct argument that F' = 0 except on a set of Lebesgue measure 0. A real function f of a real variable is a Lebesgue function if [x: / (x) < a] is a Lebesgue set for each a. (a) Show that, if /1 is a Borel function and /2 is a Lebesgue function, then the composition /1/2 is a Lebesgue function. (b) Show that there exists a Lebesgue function /1 and a Lebesgue (even Borel, even continuous) function /2 such that /1 /2 is not a Lebesgue function. Hint: Use Example 31.4. f An arbitrary function f on (0, 1] can be represented as a composition of a Lebesgue function /1 and a Borel function /2 • For x in (0, 1], let dn (x) be the n th digit inn its nonterminating dyadic expansion, and define /2 (x) = E:'_ 1 2dn ( x)j3 . Show that /2 is increasing and that /2 (0, 1] is contained in the Cantor set. Take /1 (x) to be /(/2 1 (x)) if x e /2 (0, 1] and 0 if x e (0, 1] - /2 (0, 1]. Now show that / = /1 /2 • Let r1 , r2 , • •k• be an enumeration of the rationals in (0, 1) and put F( x) = Ek . x 2 - . Define q> by (14.5) and prove that it is continuous and singu,ar. Suppose that p. and F are related by (31.18). If F is not absolutely continuous, then p. ( A ) > 0 for some set A of Lebesgue measure 0. It is an interesting fact, however, that almost all translates of A must have p.-mea sure 0. From Fubini's theorem and the fact that A is invariant under translation and reflection through 0 show that, if A ( A ) = 0 and p. is a-finite, then p. ( A + x) == 0 for x outside a set of Lebesgue measure 0. 17.2 31.6 f Show that F is absolutely continuous if and only if for each Borel set A , p.(A + x) is continuous in x. Let F.( x ) = lim8 _ 0 inf( F( v ) - F( u ))/( v - u ), where the infimum ex tends over u and v such that u < x < v and v - u < B. Define F * ( x ) as this limit with the infimum replaced by supremum. Show that in Theorem 31.4, F' can be replaced by F * in part (i) and by F in part (ii). Show that in Theorem 31.6, FD can be replaced by F (note that F .(x) < FD( x)). Lebesgue ' s density theorem. A point x is a density point of a Borel set A if A (( u , v ] n A )j( v - u) --+ 1 as u f x and v � x. From Theorems 31.2 and 31.4 deduce that almost all points of A are density points-that is, that the SECTION
=
31.2.
31 .3.
31.4.
31.5.
31.6.
31.7. 31.8.
31.9.
=
=
,
s
•
•
438
31.10. 31.11.
31.12.
31. 13.
DERIVATIVES AND CONDITIONAL PROBAB ILITY
points of A that are not density points form a set of Lebesgue measure 0. Similarly, A (( u, v ] n A)/( v - u ) --+ 0 almost everywhere on Ac. 19.3 f Let f: [a, b ] --+ R * be an arc; /( t) = (/1 ( t ), . . . , /" ( t)). Show that the arc is rectifiable if and only if each /; is of bounded variation over [ a, b ]. F(O) = 0, f Suppose that F is continuous and nondecreasing and that 2 F(1) = 1. Then f(x) = (x, F(x)) defines an arcf: [0, 1] --+ R • It is easy to see by monotonicity that the arc is rectifiable and that, in fact, its length satisfies L( / ) < 2. It is also easy, given £, to produce functions F for which L( / ) > 2 - £. Show by the arguments in the proof of Theorem 31.4 tha t L( / ) = 2 if F is singular. Suppose that the characteristic function of F satisfies lim sup, _ lcp( t) 1 = 1. Show that F is singular. Compare the lattice case (Problem 26.1). Hint: Use the Lebesgue decomposition and the Riemann- Lebesgue theorem. 26.22 f Suppose that X1 , X2 , • • • are independent andn assume the values + 1 with probability t each, and let X = �_ 1 Xn/2 . Show that X is uniformly distributed over [ - 1, + 1]. Calculate the characteristic functions of X and Xn and deduce (1.36). Conversely, establish (1.36) by trigonome try and conclude that X is uniformly distributed over [ - 1, + 1]. oo
31.14. (a) Suppose that X1 , X2 , . • • are independent and assume the values 0 and
1 with probability ! each. Let F 2n and G be the distribution functions of n 2 I E�_ 1 X2 n _ 1 j2 - and E�_ 1 X2n/2 . Show that F and G are singular but that F G is absolutely continuous. (b) Show that the convolution of an absolutely continuous distribution function with an arbitrary distribution function is absolutely continuous. 31.15. 31.2 f Show that the Cantor function is the distribution function of n E�_ 1 Xn/3 , where the Xn are independent and assume the values 0 and 2 with probability ! each. Express its characteristic function as an infinite product. 31.16. Show for the F of Example 31.1 that F D(1) = oo and D F(O) = 0 if p0 < ! . From (31.17) deduce that F D(x) = oo and D F(x) = 0 for all dyadic rationals x. Analyze the case p0 > ! and sketch the graph. 31.17. 6.14 f Let F be as in Example 31.1 and let p. be the corresponding probability measure on the unit interval. Let dn (x) be the n th digit in the non terminating binary expansion of x, and let sn ( x) = EZ _ 1 d1c ( x ). If ln (x) is the dyadic interval of order n containing x, then •
(31 .33) (a) Show that (31.33) converges on a set of p.-measure 1 to the entropy
log Po - p 1 log p 1 • From the fact that this entropy is less than log 2 if p0 ¢ ! , deduce that on a set of p.-measure 1, F does not have a finite derivative if Po =F ! .
h
=
- Po
31 .
439 (b) Show that (31.33) converges to - !log p0 - !log p 1 on a set of Lebesgue measure 1. If p0 =F ! this limit n exceeds log 2 (arithmetic versus geometric means), and so p. ( /n ( x ))/2 - --+ 0 except on a set of Lebesgue measure 0. This does not prove that F'( x) exists almost everywhere, but it does show that, except for x in a set of Lebesgue measure 0, if F'( x ) does exist, then it is 0. (c) Show that, if (31.33) converges to I, then SECTION
DERIVATIVES ON THE LINE
( 31 .34)
31.18.
31.19.
31.20.
31.21.
31.22.
.
00
1f a
0
.f
>
1 a <
I
log 2 ' I
log 2 ·
If (31.34) holds, then (roughly) F satisfies a Lipschitz conditiont of (exact) order /jlog 2. Thus F satisfies a Lipschitz condition of order hjlog 2 on a set of p.-m�asure 1 and a Lipschitz condition of order ( - !log p0 - !log p 1 )jlog 2 on a set of Lebesgue measure 1. van der Waerden's continuous, nowhere differentiable function is l( x ) = Er.... 0 a k ( x ), where a0 ( x) is the distance from x to the nearest integer and a k ( x ) = 2- k a0 (2 k x ). Show by the Weierstrass M-test that I is continuous. Use (31.8) and the ideas in Example 31.1 to show that I is nowhere differentiable. Show (see (31.31)) that (apart from addition of constants) a function can have only one representation F1 + F2 with F1 absolutely continuous and F2 singular. Show that the F; in the Lebesgue decomposition can be further split into Fd + �s ' where F::s is continuous and singular and Fd increases only in jumps in the sense that the corresponding measure is discrete. The complete decomposition is then F = � c + �s + �(a) Suppose that x1 < x2 < · · · and En I F(xn ) I = oo . Show that, if F assumes the value 0 in each interval ( xn , xn + 1 ), then it is of unbounded variation. (b) Define F over [0, 1] by F(O) = 0 and F( x ) = x a sin x - 1 for x > 0. For which values of a is F of bounded variation? 14.5 f If I is nonnegative and Lebesgue integrable, then by Theorem 31.3 and (31.8), except for x in a set of Lebesgue measure 0, (31 .35) if
u
< x <
v v, u
<
v,
and
1 -
u, v
u
lvI( t ) dt --+ I( X ) u
--+ x.
There is an analogue in which Lebesgue
holds at x if F(x + h ) - F(x ) = O( l h l 41 ) as h --+ 0; for a > 1 this implies F'( x ) = 0, and for 0 < a < 1 it is a smoothness condition stronger than continuity and weaker than differentiability. tA Lipschitz condition of order
a
440
DERIVATIVES AND CONDITIONAL PROBABILITY
measure is replaced by a general probability measure p.: If f is nonnegative and integrable with respect to p., then as h J, 0, { 31 .36)
p. ( x - h , x + h] 1
1
/( t)p.( dt ) --+ /( X)
( x-h, x+h ]
on a set of p.-measure 1. Let F be the distribution function corresponding to p. and put cp ( u) == inf[ x: u < F( x )] for 0 < u < 1 (see (14.5)). Deduce (31.36) from (31.35) by change of variable and Problem 14.5. SECI'ION 32. THE RAOON-NIKODYM THEOREM
If f is a nonnegative function on a measure space (0, �, p.), then I'( A) = fA fdp. defines another measure on �- In the terminology of Section 16, " has density f with respect to p.; see (16.11). For each A in �, p.(A) = 0 implies that I'(A) = 0. The purpose of this section is to show conversely that if this last condition holds and " and p. are a-finite on �, then , has a density with respect to p.. This was proved for the case (R 1 , !1/ 1 , A) in Theorems 31 .7 and 31 .8. The theory of the preceding section, although illuminating, is not required here. Additive Set Functions
Throughout this section (0, !F) is a measurable space. All sets involved are assumed as usual to lie in !F. An additive set function is a function cp from !F to the reals for which ( 32 .1 )
n
if A 1 , A 2 , is a finite or infinite sequence of disjoint sets. A set function differs from a measure in that the values cp( A) may be negative but must be finite- the special values + oo and - oo are prohibited. It will turn out that the series on the right in (32.1) must in fact converge absolutely, but this need not be assumed. Note that cp(O) = 0. •
•
•
If p. 1 and p. 2 are finite measures, then cp( A ) = p. 1 (A) p. 2 (A) is an additive set function. It will turn out that the general additive set function has this form. A special case of this is cp( A ) = fAfdp., where I • is integrable (not necessarily nonnegative). EX111ple 11 32.1.
SECTION
32.
THE RADON-NIKODYM THEOREM
44 1
The proof of the main theorem of this section (Theorem 32.2) requires certain facts about additive set functions, even though the statement of the theorem involves only measures.
If Eu i E or Eu � E, then cp( Eu ) -+ cp( E ). If Eu j E, then cp( E ) = cp(E1 U U �_ 1 (Eu + 1 - Eu )) = cp(E 1 ) + PROOF. E:- tcp(Eu + 1 - Eu ) = lim v [cp( E1 ) + E�: lcp( Eu + 1 - Eu )] = lim vcp( Ev ) by (32.1). If Eu � E, then E� j Ec, and hence cp( Eu ) = cp(O) - cp(E� ) -+ cp(O)
Lemma 1.
-- cp( E c ) = cp ( E ).
•
Although this result is essentially the same as ones for measures, it does require separate proof. Note that the limits need not be monotone unless cp happens to be a measure. 1be
Hahn Decomposition
For any additive set function cp, there exist disjoint sets A + and A - such that A + u A - = 0, cp(E) > 0 for all E in A + and cp(E) � 0 for all E in A -.
1beorem 32.1.
A set A is
positive if cp( E ) > 0 for E c A and negative if cp(E) < 0 for
E c A . The A + and A - in the theorem decompose 0 into a positive and a
negative set. This is the Hahn decomposition. If cp(A) = fA ! dp. (see Example 32.1), the result is easy: take A + = [/ > and A - = [ / < 0].
0]
Let a = sup[cp(A): A e j&'). Suppose that there exists a set A + with cp(A + ) = a (which implies that a is finite). Let A - = 0 - A + . If A c A + and cp(A) < 0, then cp(A + _ A ) > a, an impossibility; hence A + is a positive set. If A c A - and cp( A) > 0, then cp(A + u A) > a, an impossi bility; hence A - is a negative set. It is therefore only necessary to construct a set A + for which cp( A + ) = a. Choose sets A n such that cp(A n ) -+ a and let A = U n A n . For each n consider the 2 n sets Bn; (some perhaps empty) that are intersections of the form n k_ 1 Ak, where each Ak is either A k or else A - A k . The collection n !/In = [ Bn; : 1 � i � 2 ] of these sets partitions A . Clearly, !fin refines ffln _ 1 : each Bn j is contained in exactly one of the Bn 1 , ;. Let en be the union of those Bn; in !fin for which cp( Bn ;) > 0. Since A n is the union of certain of the Bn i ' cp(A n ) < cp(en ). Since the partitions 91 1 ' EM2 ' . are successively finer, m < n implies that (em u . . . u en - I u Cn ) - (em u · · · u en _ 1 ) is the union (perhaps empty) of certain of the sets Bn ; ; the Bn; in this union must satisfy cp( Bn ; ) > 0 because they are conPROOF.
• •
442
DERIVATIVES AND CONDITIONAL PROBABILITY
tained in en . Therefore, cp(em U · · · u en _ 1 ) < cp(em U · · · U Cn ), so that by induction cp ( A m ) < cp(em ) < cp(em U · · · u en ). If Dm = U�-men , then by Lemma 1 (take Ev = em U · · · u em v) cp( A m ) < cp( Dm ). Let A + = + + D (note that A = lim supnen ), so that Dm � A + . By Lemma 1, n :_ 1 m a = lim mcp ( A m ) < lim mcp(Dm ) = cp ( A + ). Thus A + does have maximal cpvalue. • If cp + ( A ) = cp ( A n A + ) and cp - (A) = - cp ( A n A - ), then cp + and cp are finite measures. Thus
(32.2) represents the set function cp as the difference of two finite measures having disjoint supports. If E c A , then cp( E) < cp + (E ) < cp + (A), and there is equality if E = A n A + . Therefore, cp + ( A ) = SUPE c A cp( E ). Similarly, cp - ( A ) = - inf E c A cp( E ). The measures cp + and cp - are called the upper and lower variations of cp, and the measure l cp l with value cp + (A) + cp - ( A ) at A is called the total variation. The representation (32.2) is the Jordan
decomposition.
Absolute Continuity and Singularity
Measures p. and " on ( 0, �) are by definition mutually singular if they have disjoint supports- that is, if there exist sets S" and S, such that
( 32.3 )
(
( - s") = o,
,., o
,
( 0 - s, ) = 0 '
s" n s, - o.
In this case p. is also said to be singular with respect to " and " singular with respect to p. . Singularity is sometimes indicated by p. .l " · Note that mea sures are automatically singular if one of them is identically 0. According to Theorem 31.5 a finite measure on R 1 with distribution function F is in the sense of (32.3) singular with respect to Lebesgue measure if and only if F'(x) = 0 except on a set of Lebesgue measure 0. In Section 31 the latter condition was taken as the definition of singularity, bu t of course it is the requirement of disjoint supports that can be generalized from R 1 to an arbitrary 0. The measure " is absolutely continuous with respect to p. if for each A in !F, p. ( A) = 0 implies 11( A ) = 0. In this case " is also said to be dominated by p., and the relation is indicated by " << p.. If " << p. and p. << , , the measures are equivalent, indicated by " = p..
SECTION
32.
THE RADON -NIKODYM THEOREM
443
finite measure on the line is by Theorem 31 .7 absolutely continuous in this sense with respect to Lebesgue measure if and only if the corresponding distribution function F satisfies the condition (31.28). The latter condition, taken in Section 31 as the definition of absolute continuity, is again not the one that generalizes from R 1 to 0. There is an £-8 idea related to the definition of absolute continuity above. Suppose that for every £ there exists a 8 such that A
, ( A ) < £ if p. ( A ) < 8 .
(32 .4 )
If this condition holds, p.(A) =
., «
0 implies that
I'(A) < £ for all £, and so
p.. Suppose, on the other hand, that this condition fails and that , is finite. Then for some £ there exist sets A n such that p.(A n ) < n - 2 and P ( A n ) > £ . If A = lim supn A n , then p.(A) = 0 by the first Borel-Cantelli lemma (which applies to arbitrary measures ) but I'(A) � £ > 0 by the right-hand inequality in (4.10) (which applies because , is finite). Hence ., « p. fails, and so (32.4) follows if , is finite and , < < p.. If , is finite, in order that , << p. it is therefore necessary and sufficient that for every £ there is a 6 satisfying (32.4). This condition is not suitable as a definition because it need not follow from , << p. if , is infinite. t The
Main Theorem
If I' ( A)
in
= fAfdp., then certainly , << p.. The Radon-Nikodym theorem goes
the opposite direction:
If p. and " are a-finite measures such that " p., then there exists a nonnegative f, a density, such that ( A ) = fA fdp. for all A E �- For two such densities f and g, p.[/ ¢ g] = 0. Theorem 32.2.
I'
«
The uniqueness of the density up to sets of p.-measure 0 was settled in Section 16 (see Example 16.9). It is only the existence that must be proved. The density f is integrable p. if and only if , is finite. By Theorem 16.10 integrals with respect to , can be calculated by the formula (32.5)
jAh d v = jAhfdp. .
The density whose existence is to be proved is called the Radon-Nikodym derivative of , with respect to p. and is often denoted d,Idp.. The term tSee Problem 32.3.
444
DERIVATIVES AND CONDITIONAL PROBABILITY
derivative
is appropriate because of Theorems
absolutely continuous distribution function
31.3
and
31.8:
For an
F on the line, the corresponding
measure p, has with respect to Lebesgue measure the Radon-Nik:odym
derivative
F '. Note that (32.5) can be written
Lh dv = Lh $ d� .
( 3 2 .6) Suppose that Theorem
32.2
holds for finite p. and ., (which is in fact
enough for the probabilistic applications in the sections that follow). In the
a-finite case there is a countable decomposition of
n
n
which p. ( A ) and J1( A ) are both finite. I f
D into �sets
An
for
(32 .7) then .,
«
p, implies
Since
Pn
Thus
E n fn /A
n
Jl
has density /A
n
P. n ' and so n ( A ) = fAin dp. n for some density fn · with respect to p. (Example 16.10),
« n
Jl
is the density sought.
It is therefore enough to treat finite p. and
result.
Jl .
This requires a preliminary
If p. and ., are finite measures and are not mutually singular, then there exists a set A and a positive £ such that p. ( A ) > 0 and £p. ( E ) s J1 ( E ) for all E c A .
Lemma 2.
-1
PROOF.
., - n
Let
,..u·,
u A;
A;
be a
Hahn
decomposition for the set function
= U n A n-+ 1' so that Me-=1 n n A n- Since Me is in the for ., - n p., J1 ( M e ) s n p, ( M e ) ; since this holds for all
put M
negative set
n,
A;
·
0. Thus M supports .,, and from the fact that p. and
., are not mutually singular it follows that Me cannot support p.- that is, that p. ( M ) must be positive. Therefore, p. ( A ; ) for some Take A A; and
J1 ( Me) =
£ = n
-1
>0
.
=
•
( R 1 , 91 1 ),
p. is Lebesgue measure = A, and J1 ( a , b] = F(b) - F(a). If ., and A do not have disjoint supports, A= then by Theorem 31.5, A[x : F'(x) > 0] > 0 and hence for some [x : F'(x) > e] satisfies A(A ) > 0. I f E = (a, b] is a sufficiently small inter-
Emmple 322.
Suppose that
( D, !F)
n.
e,
32. THE RADON -NIKODYM THEOREM 445 val about an x in A, then JI( E )/A ( E ) = ( F( b ) - F(a ))/( b - a ) > e, which is the same thing as e'A ( E ) < JI( E ) . • Th us Lemma 2 ties in with derivatives and quotients v( E)jp. ( E ) for "small" sets E. Martingale theory links Radon-Nikodym derivatives with such quotients; see Theorem 35.8 and Example 35 . 10. Suppose that p. and " are finite measures PROO F OF THEOREM 32.2. satisfying " p.. Let f§ be the class of nonnegative functions g such that J8g dp. s JI ( E ) for all E. If g and g' lie in f§, then max( g , g') also lies in f§ SECTION
«
because
J.Emax( g , g') dp. = J.E n (g � g' ]gdp. + J.En [ g' < "(E
n (g >
> g]
g' dp.
g'] ) + " ( E n ( g ' > g] ) = " ( E ) .
f§ is closed under the formation of finite maxima. Suppose that functions gn lie in f§ and gn i g . Then fEg dp. = limn fEgn dP. < JI( E ) by the monotone convergence theorem, so that g lies in f§. Thus f§ is closed under Thus
nondecreasing passages to the limit. Let a = sup f g dp. for g ranging over f§ (a < JI ( D )). Choose gn in f§ so 1 that f gn dp. > a - n - . If In = max(g 1 , . . . , gn ) and I = limln , then I lies in t6 and f Idp. = lim n f In dp. > limn / gn dp. = a . Thus I is an element of f§ for which f Idp. is maximal. Define "ac by "ac (E) = fEidp. and "s by "s (E) = J1 (E) - "uc (E) . Thus
v ( E ) = �'ac( E ) + vs ( E ) = f. Jdp. + vs ( E ) .
(32.8)
E
Since I is in f§, "s as well as "a c is a finite measure. Of course, "ac is absolutely continuous with respect to p.. Suppose that "s fails to be singular with respect to p.. It then follows from Lemma 2 that there is a set A and a positive £ such that p.(A) > 0 and €I'(E) s "s( E) for all E c A. Then for every E
L(/ +
dA
) dp. = /jdp. + t p. ( E () A ) < /jdp. + �'s ( E () A ) = f. fdp. + vs( E n A ) + f. f dp. En A
E- A
= v ( E n A ) + f. fdp. < v ( E n A ) + v ( E - A ) E- A
= "( E ).
446
DERIVATIVES AND CONDITIONAL PROBABILITY
In other words, f + f.(� lies in rl; since f (/ + f./A ) dp. = a + f. P, ( A ) > a , this contradicts the maximality of f. Therefore, p. and "s are mutually singular, and there exists an S such tha t vs ( S ) = p. ( S c ) = 0. But since " << p., �'s ( Sc ) < v ( Sc ) = 0, and so vs(D) = 0 . The rightmost term in (32.8) thus drops out. •
Absolute continuity was not used until the last step of the proof, and what the argument shows is that , always has a decomposition (32.8) into an absolutely continuous part and a singular part with respect to p. . This is the Lebesgue decomposition, and it generalizes the one in the preceding section (see (31.31)). PROBLEMS 32. 1.
32.2.
32.3.
32.4.
32.5.
32.6.
There are two ways to show that the convergence in (32.1) must be absolute: Use the Jordan decomposition. Use the fact that a series converges ab solutely if it has the same sum no matter what order the terms are taken in. If A + u A - is a Hahn decomposition for cp, there may be other ones At u A 1- . Construct an example of this. Show that there is uniqueness to the extent that cp(A + �At ) == cp(A - �A 1- ) == 0. Show that absolute continuity does not imply the £-B condition (32.4) if ., is infinite. Hint: Let !F consist of all subsets of the space of integers, let ., be counting measure, and let p. have mass n- 2 at n. Note that p. is finite and ., is a-finite. Show that the Radon-Nikodym theorem fails if p. is not a-finite, even if ., is finite. Hint: Let !F consist of the countable and the co-countable sets in an uncountable Q, let p. be counting measure, and let v ( A ) be 0 or 1 as A is countable or co-countable. Or consider one-dimensional and two-dimen sional Hausdorff measure in the plane. Let I' be the restriction of planar Lebesgue measure A 2 to the a-field .F == { A X R1 : A e �� } of vertical strips. Define ., on !F by ., (A X R1 ) == A 2 ( A X (0, 1)). Show that ., is absolutely continuous with respect to p. but has no density. Why does this not contradict the Radon-Nikodym theo rem? Let 1£ , ., , and p be a-finite measures on ( Q, !F). (a) Show that ., « p. and p. << p imply that ., << p and
dv == dv · dp. . dp
dp. dp
(b) Show that ., = p. implies dvjdp. == ( dp./dv) - 1 • (c) Suppose that I' << p and ., « p, and let A be the set where dvjdp
>
SECTION
32.
0 == dp.jdp. Show that
447
THE RADON -NIKODYM THEOREM
., <<
p. if and only if p( A ) == 0, in which case
dv 1 dv/dp dp. == A dp.jdp . c
32.7. 32.8.
32.9.
32.10.
32.11.
32.12.
Show that there is a Lebesgue decomposition (32.8) in the a-finite as well as the finite case. Prove that it is unique. f The Lebesgue decomposition (32.8) can be described differently. Show that there exists a set S (the singular set for ., with respect to p.) and a nonnegative function I (the density) such that p. ( S) == 0 and v ( A ) == fA idp. for A c Q - S. Show that for any such S and I, �'a c ( A ) == fA idp., and "s ( A ) == ., ( A n S ). Let p. and ., be finite measures on ( Q, �) and suppose that �o is a a-field contained in �- Then the restrictions p.0 and ., o of p. and ., to $0 are measures on ( 0 , , 0 ) . Let �'a c , .,s , .,a� , .,s0 be, respectively, the absolutely continuous and singular parts of ., and ., o with respect to p. and p.0 • Show that v:, ( E) > �'a c( E) and .,so ( E) < �'s ( E) for E e ' o . Suppose that p. , .,, �'n are finite measures on ( Q, '> and that v ( A ) == En vn ( A ) for all A . Let �'n ( A ) == fAin dp. + v� (A) and v(A ) == fA fdp. + v'( A ) be the decompositions (32.8); here .,, and .,� are singular with respect to p.. Show that I == En ln except on a set of p.-measure 0 and that v'( A ) == En v� ( A ) for all A . Show that ., << p. if and only if "n << p. for all n. 32.2 f Absolute continuity of a set function cp with respect to a measure p. is defined just as if cp were itself a measure: p.( A ) == 0 must imply that cp(A ) == 0. Show that, if this holds and p. is a-finite, then cp ( A ) == fA idp. for some integrable 1. Show that A + == [ w : I( w ) > 0] and A - == [ w : I( w ) < 0] give a Hahn decomposition for cp . Show that the three variations satisfy cp + ( A ) == fA r dp., cp - ( A ) == fA r dp., and l cp i ( A ) == fA Ill dp.. Hint: To construct I, start with (32.2). f A signed measure cp is a set function which satisfies (32.1) if A 1 , A 2 , are disjoint and which may assume one of the values + oo and oo but not both. Extend the Hahn and Jordan decompositions to signed measures. 16.12 f Show that, if ., is not a-finite, then the Radon-Nikodym deriva tive need not be unique even if it exists. 31.22 f Suppose that p. and ., are a probability measure and a a-finite measure on the line and that ., << p.. Show that the Radon-Nikodym derivative I satisfies -
32.13. 32.14.
• • •
x x + h] = f x) ( lim ll - h , ( p,{ X - h , X + h]
h _. O
on a set of p.-measure 1. 32.1 5. Find on the unit interval uncountably many probability measures p.P , 0 < p < 1, with supports SP such that P.p { x } == 0 for each x and p and the sp are disjoint in pairs.
448
DERIVATIVES AND CONDITIONAL PROBABI LITY
32. 16. Let 3lb be the field consisting of the finite and the cofinite sets in an
uncountable n. Define q> on 3lb by taking cp( A ) to be the number of points in A if A is finite, and the negative of the number of points in A( if A is cofinite. Show that (32.1) holds (this is not true if Q is countable). Show that there are no negative sets for q> (except the empty set), that there is no Hahn decomposition, and that cp does not have bounded range. SECI'ION 33. CONDmONAL PROBABILITY
The concepts of conditional probability and expected value with respect to a a-field underlie much of modem probability theory. The difficulty in under standing these ideas has to do not with mathematical detail so much as with probabilistic meaning, and the way to get at this meaning is through calculations and examples, of which there are many in this section and the next. The Discrete Case
Consider first the conditional probability of a set A with respect to another set B. It is defined of course by P( A I B) = P( A n B)/P( B), unless P( B) vanishes, in which case it is not defined at all. It is helpful to consider conditional probability in terms of an observer in possession of partial information. t A probability space ( D, �, P) describes the working of a mechanism, governed by chance, which produces a result w distributed according to P; P( A) is for the observer the probability that the point w produced lies in A. Suppose now that w lies in B and that the observer learns this fact and no more. From the point of view of the observer, now in possession of this partial information about w, the probability that w also lies in A is P(A IB) rather than P( A). This is the idea lying back of the definition. If, on the other hand, w happens to lie in B e and the observer learns of this, his probability instead becomes P( A I Be). These two conditional probabilities can be linked together by the simple function ( 33 . 1 )
{ P( A IB ) /( ) = P( A I B") "'
The observer learns whether w lies in B or in Be; his new probability for the event w E A is then just w ). Although the observer does not in
/(
tAs always, obseroer , information , know , and so on are informal, nonmathematical terms; see the related discussion in Section 4 (p. 52).
SECTION
33.
449
CONDITIONAL PROBABILITY
general know the argument w of f, he can calculate the value /( w ) because he knows which of B and B e contains w. (Note conversely that from the value /( w ) it is possible to determine whether w lies in B or in Be, unless P(A I B) = P( A I B e)-that is, unless A and B are independent, in which case the conditional probability coincides with the unconditional one any way.) The sets B and Be partition D, and these ideas carry over to the general be a finite or countable partition of D into jli:sets, partition. Let B 1 , B2 , and let f§ consist of all the unions of the B;. Then f§ is the a-field generated by the B;. For A in �, consider the function with values •
•
•
P ( A n B; ) ( 3 3 . 2 ) /( "' ) = P ( A I B; ) = P( B; )
i = 1 , 2, . . .
if "' e B; ,
.
If the observer learns which element B; of the partition it is that contains w, then his new probability for the event w E A is /( w ). The partition { B; }, or equivalently the a-field f§, can be regarded as an experiment, and to learn which B; it is that contains w is to learn the outcome of the experiment. For this reason the function or random variable f defined by (33.2) is called the conditional probability of A given f§ and is denoted P[AIIf§]. This is written P[AIIf§]"' whenever the argument w needs to be explicitly shown. Thus P[AIIf§] is the function whose value on B; is the ordinary condi tional probability P(A I B;) · This definition needs to be completed, because P(A IB; ) is not defined if P(B;) = 0. In this case P[AIIf§] will be taken to have any constant value on B;; the value is arbitrary but must be the same over all of the set B;. If there are nonempty sets B; for which P( B;) = 0, P (AIIf§] therefore stands for any one of a family of functions on D. A specific such function is for emphasis often called a version of the condi tional probability. Note that any two versions are equal except on a set of probability 0. Consider the Poisson process. Suppose that 0 < s < t, and let A = [Ns = 0] and B; = [N1 = i], i = 0, 1, . . . Since the increments are independent (Section 23), P( A I B;) = P[Ns = O]P[N, - Ns = i]/P[N, = i], and since they have Poisson distributions (see (23.9)), a simple calculation reduces this to Example 33. 1.
.
(33.3 ) P [ N. = 0 11� )., = ( 1 - � Since i =
(33.4 )
N1 ( w ) on
B;,
r
if "'
this can be written
E
B; ,
i = 0, 1 , 2, . . . .
P [ Ns = 0 II f§ ] "' = ( 1 - tS ) N, ( "' ) .
450
DERIVATIVES AND CONDITIONAL PROBABILITY
Here the experiment or observation corresponding to { B; } or f§ de termines the number of events-telephone calls, say-occurring in the time interval [0, t ]. For an observer who knows this number but not the locations of the calls within [0, t ], (33.4) gives his probability for the event that none of them occurred before time s. Although this observer does not know w, he knows N1 ( w ), which is all he needs to calculate the right side of (33.4). • Example 33. 2.
Suppose that X0, X1 , space S as in Section 8. The events
• • •
is a Markov chain with state
( 33.5 ) form a finite or countable partition of D as i 0, , i n range over S. If f§n is the a-field generated by this partition, then by the defining condition (8.2) for Markov chains, P[ Xn + l = j ll f§n ]"' = P;n.� holds for w in (33.5). The sets •
•
•
( 33.6 ) for i e S also partition D, and they generate a a-field f§n° smaller than f§n . Now (8.2) also stipulates P[ Xn + l = j ll f§n°1"' = P;j for w in (33.6), and the essence of the Markov property is that
( 33 .7 )
•
The General Case
If f§ is the a-field generated by a partition B 1 , B , , then the general 2 element of l§ is a disjoint union B;1 u B; 2 u . . . , finite or countable, of certain of the B;. To know which set B; it is that contains w is the same thing as to know which sets in f§ contain w and which do not. This second way of looking at the matter carries over to the general a-field f§ contained in !F. (As always, the probability space is (D, ofF, P ).) The a-field f§ will not in general come from a partition as above. One can imagine an observer who knows for each G in f§ whether w e G or w e Gc. Thus the a-field f§ can in principle be identified with an experiment or observation. This is the point of view adopted in Section 4; see p. 52. It is natural to try and define conditional probabilities P[ A ll r9' ] with respect to the experiment f§. To do this, fix an A in ofF and define a finite measure " on f§ by •
11
(G)
= P( A n G) ,
G E f§.
•
•
SECTION
33.
CONDITIONAL PROBABILITY
451
Then P( G) = 0 implies that I'( G) = 0. The Radon-Nikodym theorem can be applied to the measures , and P on the measurable space ( 0, f§ ) because the first one is absolutely continuous with respect to the second.t It follows that there exists a function or random variable /, measure f§ and integrable with respect to P, such thatt P( A n G) = I'( G ) = fGfdP for all G in f§. Denote this function f by P[ AIIf§]. It is a random variable with two properties: (i) P[ A IIf§ ] is measurable f§ and integrable. (ii) P[ A II f§] satisfies the functional equation
(33 .8 )
1 P [ AII � ) dP = P( A n G ) ,
G
G
E
f§.
There will in general be many such random variables P[AIIf§], but any two of them are equal �with probability 1. A specific such random variable is called a version of the conditional probability. If f§ is generated by a partition B 1 , B2 , , the function f defined by (33. 2) is measurable f§ because [ w: /( w) E H ] is the union of those B; over which the constant value of f lies in H. Any G in f§ is a disjoint union G = Uk B;" ' and P( A n G) = Ek P( A I B;" )P( B;" ), so that (33.2) satisfies (33.8) as well. Thus the general definition is an extension of the one for the discrete case. Condition (i) in the definition above in effect requires that the values of P[ A IIf§] depend only on the sets in f§. An observer who knows the outcome of f§ viewed as an experiment knows for each G in f§ whether it contains w or not; for each x he knows this in particular for the set [w': P[AIIf§] w' = x], and hence he in principle knows the functional value P[ AIIf§ ]"' even if he does not know w itself. In Example 33.1 a knowledge of N,( w) suffices to determine the value of (33.4)-w itself is not needed. Condition (ii) in the definition has a gambling interpretation. Suppose that the observer, after he has learned the outcome of f§ , is offered the opportunity to bet on the event A (unless A lies in f§, he does not yet know whether or not it occurred). He is required to pay an entry fee of P[AIIf§] units and will win 1 unit if A occurs and nothing otherwise. If the observer decides to bet and pays his fee, he gains 1 P[AIIf§] if A occurs and - P [ A IIf§] otherwise, so that his gain is •
•
•
-
t Let P0 be the restriction of P to '§ (Example 10.4) and find on ( 0 , '§) a density f for Jl with resp ect to P0 . Then for G e '§, P(G ) = JGfdP0 = fGfdP (Example 16.4). If g is another such density, then P [ / =I= g ] = P0 [ f =I= g] = 0.
452
DERIVATIVES AND CONDITIONAL PROBABILITY
If he declines to bet, his gain is of course 0. Suppose that he adopts the strategy of betting if G occurs but not otherwise, where G is some set in f§. He can actually carry out this strategy, since after learning the outcome of the experiment f§ he knows whether or not G occurred. His expected gain with this strategy is his gain integrated over G:
foU� - P [ AII � ] ) dP . But (33.8) is exactly the requirement that this vanish for each G in f§. Condition (ii) requires then that each strategy be fair in the sense that the observer stands neither to win nor to lose on the average. Thus P[AIIf§] is the just entry fee, as intuition requires. 33.3. EX111ple 11
Suppose that A e f§, which will always hold if f§ coin cides with the whole a-field !F. Then IA satisfies conditions (i) and (ii), so that P[AIIl§] = lA with probability 1. If A e f§, then to know the outcome of l§ viewed as an experiment is in particular to know whether or not A has • occurred. If f§ is {0, 0 }, the smallest possible a-field, every func tion measurable f§ must be constant. Therefore, P[AII f§ ] "' = P( A ) for all w • in this case. The observer learns nothing from the experiment f§. Example 33.4.
According to these two examples, P[A II{O, O }] is identically P( A ), whereas lA is a version of P[AII�]. For any f§, the function identically equal to P( A ) satisfies condition (i) in the definition of conditional prob ability, whereas lA satisfies condition (ii). Condition (i) becomes more stringent as f§ decreases, and condition (ii) becomes more stringent as f§ increases. The two conditions work in opposite directions and between them delimit the class of vetsions of P [A II f§ ]. Let 0 be the plane R 2 and let � be the class gJ 2 of planar Borel sets. A point of 0 is a pair ( x, y) of reals. Let f§ be the a-field consisting of the vertical strips, the product sets E x R 1 = [(x, y): x e £ ], where E is a linear Borel set. If the observer knows for each strip E X R 1 whether or not it contains (x, y), then, as he knows this for each one-poin t set E, he knows the value of x. Thus the experiment f§ consists in the determination of the first coordinate of the sample point. Suppose now that P is a probability measure on gJ 2 having a density f(x, y) with respect to planar Lebesgue measure: P( A) = ffAf(x, y ) dx dy. Let A be a horizon tal strip R 1 x F = [( x , y): y e F], F being a linear Borel set. The conditional probability P[ AIIf§] can be calculated explicitly. Extunple 33.5.
SECTION
33.
CONDITIONAL PROBABILITY
453
Put
cp(x, y) =
{33 .9 )
/j (x, t ) dt
fRlf(x, t ) dt
.
Set cp(x, y) = 0, say, at points where the denominator here vanishes; these points form a set of P-measure 0. Since cp (x, y ) is a function of x alone, it 1 is measurable f§. The general element of f§ being E X R , it will follow that fP(X, y) is a version of P[AIIf§] ( x , y ) if it is shown that •
(33.10)
f
F: x
R
A X R1 ) ) . = dP(x, ) ) E ( y y n P cp(x, ( 1
Since A = R 1 x F, the right side here is P(E X F). Since P has density f, Theorem 16.10 and Fubini ' s theorem reduce the left side to
fe{ L.q> (x , Y )f (x , y ) dy } dx = fe{ /j(x , t ) dt } dx =
JJ
f(x, y) dx dy = P( E x F ) .
EX F
Thus (33.9) does give a version of P[R 1 X
F 11 �1 < x. y > ·
•
The right side of (33.9) is the classical formula for the conditional probability of the event R 1 X F (the event that y e F) given the event { x } X R 1 (given the value of x ) . Since the event { x } X R 1 has probability 0, the formula P ( A I B) = P(A n B)/P(B) does not work here. The whole point of this section is the systematic development of a notion of condi tional probability that covers conditioning with respect to events of prob ability 0. This is accomplished by conditioning with respect to collections of events-that is, with respect to a-fields f§. The set A is by definition independent of the a-field f§ if it is independent of each G in f§: P(A n G ) = P( A )P(G ). This being the same thing as P(A n G) = JG P(A) dP, A is independent of f§ if and only if P[A II�] = P( A ) with probability 1. • EX111ple 33.6. 11
454
DERIVATIVES AND CONDITIONAL PROBABILITY
X) He [w: X(w) e A X P[AIIX] P[AIIa(X)] X), X( x, x ( x,
X
The a-field a( generated by a random variable consists of the sets H) for .91 1 ; see Theorem 20.1. The conditional probability of given is defined as and is denoted Thus = by definition. From the experiment corresponding to contains w one learns which of the sets = the a-field a( and hence learns the value . Example 33.5 is a case of this: take y) = for y) in the sample space 0 = R 2 there. This definition applies without change to a random vector, or, equiv alently, to a finite set of random variables. It can be adapted to arbitrary sets of random variables as well. For any such set the a-field t a [ X1, t it generates is the smallest a-field with respect to which each is measurable. It is generated by the collection of sets of the form H ] for t in and in .91 1 . The conditional probability t of with respect to this set of random variables is by definition the conditional probability of with respect to the a-field t a [ X1, t In this notation the property (33.7) of Markov chains becomes
P[AI I a(X)]
X( w )
P[AIIX]. [ w' : X( w' ) x ] [X,, e T ],
e T] X, ( w ) e e T] A e T].
T
H P[AIIa[X,, e T]] A
X, [w: P[AI I X,,
( 33 .11 )
[Xn + I j]
= is the same for someone who The conditional probability of knows the present state as for someone who knows the present state xn and the past states as well.
xn Xo, . . . ' xn - I Example 33. 7. Let X and Y be random vectors of dimensions j and k, let p. be the distribution of X over R j, and suppose that X and Y are
independent. According to (20.30),
P ( X e H, ( X, Y) e J] = jHP (( x , Y) e J ) p. ( dx ) e
Je
_9lj + k . This is a consequence of Fubini' s theorem; it for H _9lj and has a conditional probability interpretation. For each in R j put
( 33 .12)
x f ( x ) = P [( x , Y ) e J] = P [ w': ( x , Y( w' )) e J] .
X and Y are independent, then ( 33 .13 ) f ( X( w )) = P( ( X, Y ) e J IIX ]
If
with probability
1.
By the theory of Section
w
18, f( X( w ))
is measurable
SECTION
33.
455
CONDITIONAL PROBABILITY
o( X), and since p. is the distribution of X, a change of variable gives
1
/( X ( w )) P ( dw ) =
[ XE H )
j f( x ) p. ( dx ) = P( [ ( X, Y ) H
Since [ X E H] is the general element of The fact just proved can be written
P [( X, Y ) E JIIX]
w
E
J]
n (X E H]).
o( X), (33.13) follows.
=
P ( ( X( w ) , Y ) E J]
=
P [ w': ( X( w ) , Y( w' ) ) e J ] .
Replacing w' by w on the right here causes a notational collision analogous to the one replacing y by x causes in J:f(x, y) dy. • Suppose that X and Y are independent random variables and that Y has distribution function F. For J = [(u, v): max{ u, v } < m], (33.12) is 0 for m < x and F(m) for m > x; if M = max{ X, Y }, then (33.13) gives (33 .14) with probability 1. All equations involving conditional probabilities must in this way be qualified by the phrase with probability 1 because the condi tional probability is unique only to within a set of probability 0. The following theorem is helpful in calculating conditional probabilities.
Let gJ be a 'IT-system generating the a-field f§, and suppose that 0 is a finite or countable union of sets in 9'. An integrable function f is a version of P[AIIf§] if it is measurable f§ and if
Theorem 33.1.
f JdP = P ( A n G )
( 33 .15)
G
holds for all G in gJ_ PROOF.
Apply Theorem 10.4.
•
The condition that 0 is a finite or countable union of gJ-sets cannot be suppressed; see Example 10.5. Emmple 33.8. Suppose that X and Y are independent random variables with a common distribution function F that is positive and continuous. What is the conditional probability of [ X � x] given the random variable M == max{ X, Y }? As
DERIVATIVES AND CONDITIONAL PROBABILITY 456 it should clearly be 1 if M < x, suppose that M > x. Since X � x requires M Y, the chance of which is ! by symmetry, the conditional probability of [ X < x] should by independence be ! F( x )/F( m ) = !P[ X < x i X < m] with the random variable M substituted for m. Intuition thus gives =
( 33 .16) It suffices to check (33.15) for sets G = [ M < m ], because these form a w-system generating a( M). The functional equation reduces to 1 . P ( M < nun { x, m } ] + 2
{ 33 .17)
j
F( x ) dP F M ) x< M�m (
= P( M < m , X < x] . Since the other case is easy, suppose that x < m. Since the distribution of ( X, Y) is product measure, it follows by Fubini's theorem and the assumed continuity of F that
+
J£
l dF u dF = 2( F( m ) - F( x ) ) ) v) ( ( F( u )
which gives (33.17).
,
•
A collection [ X,: t > 0] of random variables is a Markov process in continuous time if for k > 1, 0 < t1 � · · · < tk � u, and H e !1/ 1 , Example 33. 9.
(33 .18) with probability 1. The analogue for discrete time is (33.11). (The xn there have countable range as well, and the transition probabilities are constant in time, conditions that are not imposed here.) Suppose that t � u . Looking on the right side of (33.18) as a version of the conditional probability on the left shows that
1 P [ X,. e HII X,) dP = P ([ X,. e H) n G )
(33 .19) if 0 < t 1 � let k and
G
• •
t1,
•
tk = t s u and G e o( X11, , X,k ). Fix t, u, and H and , t k vary. Consider the class gJ = Uo(X11, , X,k ), the
�
• • •
• • •
• • •
SECTION
33.
457
CONDITIONAL PROBABILITY
union extending over all k > 1 and all k-tuples satisfying 0 < t 1 < · · · � tk = t. If A E o( x,1 , . . . ' x,k ) and B E o( XS1 ' . . . ' xs ), then A n B E o ( X,1 , . . . ' xr ), where the ra are the Sp and the ly merged, together. Thus gJ is a w-systeln . Since gJ generates o[Xs : s < t] and P[ Xu e HII X,] is measurable with respect to this a-field, it follows by (33.19) and Theorem 33.1 that P[ Xu E HII X,] is a version of P[ Xu e HII Xs , s < t]:
t < u,
(33.20 )
with probability 1. This says that for calculating conditional probabilities about the future, the present o( X,) is equivalent to the present and the entire past o[Xs : • s < t]. This follows from the apparently weaker condition (33.18). The Poisson process [N,: t > 0] has independent incre ments (Section 23). Suppose that 0 < t 1 < < t k < u. The random vector ( N,1 , N, 2 - N,1 , . . . , N,k - N,k- 1 ) is independent of Nu - N, , and so (Theorem 20.2) (N11 , N,2, , N,k ) is independent of Nu - N,k . If J is the set k l of points ( x 1 , , xk, y) in R + such that xk + y e H, where H e !l/ 1 , and if Jl is the distribution of Nu - N, k, then (33.12) is P[( x 1 , , xk, Nu N, ) e J ] = P[ xk + Nu - N,k e H) = J1(H - xk ). Therefore, (33.13) gives lc P [Nu e HIIN,1 , , N,k ] = J1(H - N,k ). This holds also if k = 1, and hence P [ Nu e HIIN,1 , , N,k ] = P[Nu e HIIN,k ]. The Poisson process thus has the Markov property (33.18); this is a consequence solely of the indepen dence of the increments. The extended Markov property (33.20) follows. • Example 33. 10.
· · ·
k
• • •
• • •
• • •
• • •
• • •
Properties of Conditional Probability 1beorem 33.2.
For each A
0 < P (AIIfl ] < 1 ( 33 .21 ) with probability 1. If A 1 , A 2 , is a finite or countable sequence of disjoint sets, then (3 3 . 2 2 ) • • •
n
with probability 1. For each version of the conditional probability, fG P[A IIfl] dP = P( A n G) > 0 for each G in fl ; since P[A IIfl] is measurable fl, it must be nonnegative except on a set of P-measure 0. The other inequality in (33.21) is proved the same way. PROO F.
458
DERIVATIVES AND CONDITIONAL PROBABILITY
If the A n are disjoint and if G lies in f§, it follows (Theorem 16.6) that
Thus E n P l A n llf§], which is certainly measurable f§, satisfies the functional equation for P[UnAnllf§], and so must coincide with it except perhaps on a set of P-measure 0. Hence (33.22). • ( 33 .23)
P( B - A ll�) == P( Bll�) - P( All�) ,
c
A B , then P( A ll�) < P( B ll�) .
Additional useful facts can be established by similar arguments. If The inclusion-exclusion formula
P u- 1 A ;ll� holds. If A n f A , then ( 33 .24)
[
.
I
]
==
� P[ A ;II�] - �. P ( A ; n A1 11 �] I
I <J
+ ···
( 33 .25) and if A n � A , then ( 33 .26) Further,
P ( A ) == 1 implies that
( 33 .27) and
P( A ll�) == 1
P ( A ) == 0 implies that
( 33 .28)
P[ A II�] == 0.
Of course, (33.23) through (33.28) hold with probability 1 only. Difficulties and Curiosities
This section has been devoted almost entirely to examples connecting the abstract definition (33.8) with the probabilistic idea lying back of it. There are pathological examples showing that the interpretation of conditional probability in terms of an observer with partial information breaks down in certain cases. Let ( 0, � ' P) be the unit interval 0 with Lebesgu e measure P on the a-field � of Borel subsets of 0. Take f§ to be the a-field Example 33. 11.
SECTION
33.
CONDITIONAL PROBABILITY
459
of sets that are either countable or co-countable. Then the function identi cally equal to P(A) is a version of P[AIIrl] because P(G) is either 0 or 1 for every G in rl; therefore,
P [ A II rl ] "' = P ( A )
(33 .29)
with probability 1. But since r9 contains all one-point sets, to know which elements of r9 contain w is to know w itself. Thus r9 viewed as an experiment should be completely informative-the observer given the infor mation in r9 should know w exactly-and so one might expect that
P [ A II !J ).,
(33 .30)
=
{�
if w E A, if w � A .
•
This is Example 4.9 in a new form.
The mathematical definition gives (33.29); the heuristic considerations lead to (33.30). Of course, (33.29) is right and (33.30) is wrong. The heuristic view breaks down in certain cases but is nonetheless illuminating and cannot, since it does not intervene in proofs, lead to any difficulties. The point of view in this section has been "global." To each fixed A in � has been attached a function (usually a family of functions) P[AIIrl]"' defined over all of 0. What happens if the point of view is reversed-if w is fixed and A varies over �? Will this result in a probability measure on �? Intuition says it should, and if it does, then (33.21) through (33.28) all reduce to standard facts about measures. Suppose that B 1 , , Br is a partition of 0 into �sets, and let r9 = o ( B 1 , , Br ). If P(B 1 ) = 0 and P(B;) > 0 for the other i, then one version of P[ A ll rl] is •
•
•
•
•
•
17
P ( A I I rl ] "' = P ( A n B; ) P ( B;)
i = 2, . . . , r.
With this choice of version for each A, P[A IIrl]"' is, as a function of A, a probability measure on � if w e B2 u · · · UBr , but not if w e B 1 . The " wrong" versions have been chosen. If, for example,
P [ AI I � ] "' =
P( A ) P ( A n B; ) P ( B; )
then P[A II �]"' is a probability measure in such as this one exist if r9 is finite.
i = 2, . . . , r,
A
for each w. Clearly, versions
460
DERIVATIVES AND CONDITIONAL PROBABILITY
It might be thought that for an arbitrary a-field f§ in � versions of the various P[AIIf§] can be so chosen that P[A I I f§]"' is for each fixed w a probability measure as A varies over �- It is possible to construct a counterexample showing that this is not so.t The example is possible because the exceptional w-set of probability 0 where (33.22) fails depends on the sequence A , A 2 , ; if there are uncountably many such sequences, 1 it can happen that the union of these exceptional sets has positive probabil ity whatever versions P[AII�] are chosen. The existence of such pathological examples turns out not to matter. Example 33.9 illustrates the reason why. From the assumption (33. 18) the notably stronger conclusion (33.20) was reached. Since the set [ Xu e H] is fixed throughout the argument, it does not matter that conditional probabil ities may not, in fact, be measures. What does matter for the theory is Theorem 33.2 and its extensions. Consider a point w0 with the property that P( G) > 0 for every G in � that contains w0• This will be true if the one-point set { w0 } lies in ffi and has positive probability. Fix any versions of the P[A II f§]. For each A the set [w: P[A II f§]"' < 0] lies in � and has probability 0; it therefore cannot contain w0• Thus P[A II�]"' o > 0. Similarly, P[OIIf§]"' o = 1, and, if the A n are disjoint, P [Un A nll �]"' o = E n P[AIIf§]"' o· Therefore, P[AII�]"'o is a prob ability measure as A ranges over ffl. Thus conditional probabilities behave like probabilities at points of positive probability. That they may not do so at points of probability 0 causes no problem because individual such points have no effect on the probabilities of sets. Of. course, sets of points individually having probabil ity 0 do have an effect, but here the global point of view reenters. •
•
•
Conditional Probability Distributions
Let
X be a random variable on ( 0, ffi, P) and let f§
be a a-field
in
ffl .
There exists a function p.( H, w ), defined for H in fll 1 and w in n, with these two properties: (i) For each w in 0, p.(H, w ) is, as a function of H, a probability measure on 91 1 (ii) For each H in fll 1 , p.( H, w) is, as a function of w, a version of P[ X e H ll f§1 w · The probability measure p.( · , w) is a conditional distribution of X given f§. If f§ = a ( Z ), it is a conditional distribution of X given Z. Theorem 33.3.
•
t The argument is outlined in Problem 33.13. It depends on the construction of certain nonmeasurable sets.
SECTION
For each rational then by (33.23),
PROOF.
r
�
s,
w outside a w outside a
(33 .33)
r, let F(r, w) be r,
a
version of
P[ X < rll�1w·
If
s,
r9-set A rs of probability 0. By (33.26),
F( r, w ) = lim F(
(3 3 .32) for
461
CONDITIONAL PROBABILITY
F( w ) < F( w )
( 33 .3 1 ) for
33.
n
r
+ n- 1, w)
r9-set Br of probability 0. Finally, by (33.25) and (33.26), lim
r -. - oo
F( r , w ) = 0,
lim
F( , w ) = 1 r
outside a �set C of probability 0. As there are only countably many of these exceptional sets, their union E lies in l§ and has probability 0. For w � E extend F( · , w) to all of R 1 by setting F(x, w) = in f[ F( r, w ) : x < r ]. For w e E take F( x, w) = F( x ) , where F is some arbitrary but fixed distribution function. Suppose that w � E . By (33.31) and (33.32), F(x, w) agrees with the first definition on the rationals and is nondecreas ing; it is right-continuous; and by (33.33) it is a probability distribution function. Therefore, there exists a probability measure p.( · , w) on (R 1 , !A 1 ) with distribution function F( · , w ) . For w e E, let p. ( · , w) be the probability measure corresponding to F(x). Then condition (i) is satisfied. The class of H for which p.( H, w) is measurable l§ is a A-system containing the sets H = ( - oo , r ] for rational r; therefore p.( H, w) is measurable f§ for H in !A 1 . By construction, p. (( - oo, r ], w) = P [ X < r II f§ ] "' with probability 1 for rational r. For H = ( - oo, r], then,
j p. ( H, w ) P( dw ) = P ([ X e H ) n G ) G
for all G in f§. Fix G. Each side of this equation is a measure as a function of H, and so the equation must hold for all H in !A 1 • • Let X and Y be random variables whose joint distribu tion " in R 2 has density f(x, y) with respect to Lebesgue measure: P(( X, Y ) e A ] = v ( A ) = f fAf(x, y) dx dy. Let g(x, y) = f(x, y)/ /Rtf(x, t) dt, and let p.( H, x) = fHg(x, y) dy have probability density g(x, · ) ; if fRt f(x, t) dt = 0, let p.( · , x) be an arbitrary probability measure on the line. Then p.( H, X( w )) will serve as the conditional distribution of Y given X. Indeed, (33.10) is the same thing as fE x R • p. ( F, x) d v ( x , y) = v ( E X F ), and a change of variable gives f[ x E J p. ( F, X( w))P(dw) = P[ X e E, Y e F). Thus p.( F, X(w)) is a version of P[ Y e FII X1w · This is Example • 33.5 in a new guise. Example 33.12.
e
462
DERIVATIVES AND CONDITIONAL PROBABILITY
PROBLEMS 33. 1.
33.2.
33.3.
20. 5 f
Borel ' s paradox. Suppose that a random point on the sphere is
specified by longitude 8 and latitude (), but restrict 8 by 0 < 8 < so that 8 specifies only the complete meridian circle (not semicircle) contain ing the point, and compensate by letting () range over ( - 'IT, 'IT ]. (a) Show that for given 8 the conditional distribution of () has density i 1 cos 4J 1 over ( - 'IT, + 'IT]. If the point lies on, say, the meridian circle through Greenwich, it is thus not uniformly distributed over that great circle. (b) Show that for given () the conditional distribution of 8 is uniform over (0, 'IT). If the point lies on the equator ( () is 0 or 'IT ), it is thus uniformly distributed over that great circle. Since the point is uniformly distributed over the spherical surface and great circles are indistinguishable, (a) and (b) stand in apparent contradic tion. This shows again the inadmissibility of conditioning with respect to an isolated event of probability 0. The relevant a-field must not be lost sight of. 20.22 f Let X and Y be independent, each having the standard normal distribution and let ( R , 8 ) be the polar coordinates for ( X, Y). (a) Show that X + Y and X - Y are independent and that R 2 = [( X + Y) 2 + ( X - Y) 2 ]12, and conclude that the conditional distribution of R 2 given X - Y is the chi-squared distribution with one degree of freedom translated by ( X - Y) 212. (b) Show that the conditional distribution of R 2 given 8 is chi-squared with two degrees of freedom. (c) If X - Y == 0, the conditional distribution of R 2 is chi-squared with one degree of freedom. If 8 == 'ITI4 or 8 == 5'1TI4, the conditional distribu tion of R 2 is chi-squared with two degrees of freedom. But the events [ X - Y == 0] and [ 8 == 'ITI4] u [ 8 == 5'1TI4] are the same. Resolve the ap parent contradiction. f Paradoxes of a somewhat similar kind arise in very simple cases. (a) Of three prisoners, call them 1, 2, and 3, two have been chosen by lot for execution. Prisoner 3 says to the guard, " Which of 1 and 2 is to be executed? One of them will be, and you give me no information about myself in telling me which it is." The guard finds this reasonable and says, "Prisoner 1 is to be executed." And now 3 reasons, " I know that 1 is to be executed; the other will be either 2 or me, and so my chance of being executed is now only t , instead of the ! it was before." Apparently, the guard has given him information. If one looks for a a-field, it must be the one describing the guard ' s answer, and it then becomes clear that the sample space is incompletely specified. Suppose that, if 1 and 2 are to be executed, the guard 's response is "1" with probability p and "2" with probability 1 - p ; and, of course, suppose that, if 3 is to be executed, the guard names the other victim. Calculate the conditional probabilities. (b) Assume that among families with two children the four sex distribu tions are equally likely. You have been introduced to one of the two 'IT ,
33. CONDITIONAL PROBABILITY 463 children in such a family, and he is a boy. What is the conditional probability that the other is a boy as well? Extend Examples 33.4 and 33.11 : If P(G) is 0 or 1 for every G in t§, then P[A II t§] == P(A) with probability 1. There is a slightly different approach to conditional probability. Let ( n , .F, P ) be a probability space, ( 0' , !F') a measurable space, and T: n n' a mapping measurable !Fj!F'. Define a measure ., on !F' by a function v( A') == P( A n 1 1A') for A' e !F'. Prove that there exists 1 p(A l w') on n', measurable !F' 1 and integrable PT- , such that 1 fA 'p (A l w') PT- (dw') == P(A n T- A') for all A' in !F'. Intuitively, p(A l w') is the conditional probability that w e A for someone who knows that Tw == w'. Let t§ == (T- 1A' : A' e !F']; show that t§ is a a-field and that p(A I Tw) is a version of P[AIIt§ L., . f Suppose that T == X is 1a random variable, (Q', !F') == ( R1 , 9P1 ), and x is the general point of R • In this case p(A lx) is sometimes written P[A 1 X == x] What is the problem with this notation? For the Poisson process (see Example 33.1) show that for 0 < s < t, SECTION
33.4. 33.5.
-+
33.6.
..
33.7.
P ( � == k ilN, ] ==
33.8.
0,
k
<
>
N, , N, .
Thus the conditional distribution (in the sense of Theorem 33.3) of � given N, is binomial with parameters N, and sIt. 29.10 t Suppose that ( X1 , X2 ) has the centered normal distribution-has in the plane the distribution with density (29.29). Express the quadratic form in the exponential as
where
XI.
33.9.
k
T
== a22 - al2aii 1 . Describe the conditional distribution of x2 given
Prove this form of Bayes ' theorem:
1 P [ AII!J ] dP P ( G I A ) == _G foP[ AII!J ] dP _ __
for G e t§.
464
DERIVATIVES AND CONDITIONAL PROBABILITY
33.10. Suppose that p. ( H, w) has property (i) in Theorem 33.3, and suppose that p. ( H, · ) is a version of P[ X e Hll�] for H in a w-system generating 9P1 •
Show that p. ( · , w) is a conditional distribution of X given �. 33 . 11. f Deduce from (33.16) that the conditional distribution of X given M is
1 p. ( H n ( - oo , M( w ) ]) 12 ' 1f M e H J ( w ) + 2 !-' ( - oo , M w ( )]) where p. is the distribution corresponding to F (positive and continuous). Hint: First check H ( - oo, x]. It is shown in Example 33.9 that (33.18) implies (33.20). Prove the reverse implication. 4.13 12.4 f The following construction shows that conditional probabili ties may not give measures. Complete the details. In Problem 4.13 it is shown that there exist a probability space ( Q, !F, P), a a-field � in !F, and a set H in !F, such that P( H) t , H and � are independent, � contai s all the singletons, and � is generated by a count able subclass. The countable subclass generating � can be taken to be a w-system 91 { B1 , B2 , } (pass to the finite intersections of the sets in the original class). Assume that it is possible to choose versions P[ A II� ] so that P[A 11 � 1cAJ is for each w a probability measure as A varies over !F. Let Cn be the w-set where P[ Bn 11 �1cAJ IB ( w ); show (Example 33.3) that c nncn has prob ability 1 . Show that w" e C implies that P[ G ll � 1cAJ IG ( w) for all G in � and hence that P[{ w } II� LAJ 1. Now w e H n C implies that P[ HII�LAJ > P[ { w } II�LAJ 1 and e He n c implies that P[ HII�LAJ � P ( Q - { w } II�LAJ 0. Thus w E c im plies that P[ H II�LAJ IH ( w ). But since H and � are independent, P[ HII�LAJ P( H) ! with probability 1, a contradiction. This example is related to Example 4.9 but concerns mathematical fact instead of heuristic interpretation. Use Theorem 12.5 to extend Theorem 33.3 to the case in which X is a random vector. Let a and fl be a-finite measures on the line and let l(x , y ) be a probability density with respect to a X fl. Define ==
33.12. 33.13.
==
n
==
•
•
•
==
==
==
==
33.14. 33.15.
( 33 .34)
unless
==
==
==
8x ( Y )
==
l( x , y )
f.RlI( X ' t ) fl ( dt )
==
==
w
'
==
the denominator vanishes, in which case take 8x ( Y ) 0, say. Show that, if ( X, Y) has density I with respect to a X fl, then the conditional distribution of Y given X has density 8 x ( Y ) with respect to fl. This generalizes Examples 33.5 and 33.12, where a and fl are Lebesgue measure.
33. CONDITIONAL PROBABILITY 465 33.16. 18.25 f Suppose that p. and �'x (one for each real x) are probability measures on the line, and suppose that �'x ( B) is a Borel function in x for each B e 911 . Then (see Problem 18.25) SECTION
{ 33 .35)
w ( E ) ==
f.Rl �'x ( y : ( x , y ) E E) p.( dx )
defines a probability measure on ( R 2 , 9P 2 ) . Suppose that ( X, Y) has distribution w and show that "x is a version of the conditional distribution of Y given X. 33.17. f Let a and fl be a-finite measures on the line. Specialize the setup of Problem 33.16 by supposing that p. has density f(x) with respect to a and �'x has density gx ( Y ) with respect to fl. Assume that 8x ( Y ) is measurable 91 2 in the pair (x, y), so that �'x ( B) is automatically measurable in x. Show that (33.35) has density f(x)gx ( Y ) with respect to a X fl : w( E) == ffEf( x)gx ( Y )a(dx) fl(dy ). Show that (33.34) is consistent with f(x, y) == f( x)gx (y). Put
Suppose that ( X, Y) has density f(x) gx ( Y ) �th respect to a X fl and show that py(x) is a density with respect to a for the conditional distribution of X given Y. In the language of Bayes, f( x ) is the prior density of a parameter x, gx ( Y ) is the conditional density of the observation y given the parameter, and P.v ( x) is the posterior density of the parameter given the observation. , that �is positive, 33.18. f Now suppose that a and fl are Lebesgue measur � x continuous, and bounded, and that 8x ( Y ) == e -
for fixed x and y. Thus the posterior density is approximately that of a normal distribution with mean y and variance 1/n. 33.19. 32.14 t Suppose that X has distribution p.. Now P[A II X1c., == / ( X(w)) for some Borel function f. Show that lim,. 0 P[ A l x - h < X s x + h ] == f(x) for x in a set of p.-measure 1. Roughly speaking, P [ A lx - h < X < x + h ] P[A I X == x]. Hint: Take P( B) == P(A n [ X e B)) in Problem 32.14. .....
�
466
DERIVATIVES AND CONDITIONAL PROBABILITY
SECI'ION 34. CONDITIONAL EXPECTATION
In this section the theory of conditional expectation is developed from first principles. The properties of conditional probabilities will then follow as a special case. The preceding section was long only because of the examples in it; the theory itself is quite compact. Definition
Suppose that X is an integrable random variable on ( 0 , S&", P) and that f§ is a a-field in S&". There exists a random variable E[ X ll rl], called the conditional expected value of X given rl, which has these two properties: (i) E[ X II�] i� measurable � and integrable. (ii) E[ X II�] satisfies the functional equation (34 .1 )
1 E [ XIIfl] dP = 1 XdP, G
G
G
E
rl.
To prove the existence of such a random variable, consider first the case of nonnegative X. Define a measure " on � by J1(G) = JG X dP. This measure is finite because X is integrable, and it is absolutely continuous with respect to P. By the Radon-Nikodym theorem there is a function f, measurable �, such that Jl ( G) = fGIdP. t This f has properties (i) and (ii). * If X is not necessarily nonnegative, E[ x+ ll �] - E[ x - 11�1 clearly has the required properties. There will in general be many such random variables E[ X II�]; any one of them is called a version of the conditional expected value. Any two versions are equal with probability 1 (by Example 16.9 with P restricted to
�).
Arguments such as those in Examples 33.3 and 33.4 show that E[ XII{O, O }] = E[ X] and that E[ XIIS&"] = X with probability 1. As f§ increases, condition (i) becomes weaker and condition (ii) becomes stronger. The value E[ XII�]"' at w is to be interpreted as the expected value of X for someone who knows for each G in � whether or not it contains the point w, which itself in general remains unknown. Condition (i) ensures that E[ XII�] can in principle be calculated from this partial information alone. Condition (ii) can be restated as fG ( }( - E[ X I I�]) dP = 0; if the observer, in possession of the partial information contained in �, is offered the tAs in the case of conditional probabilities, the integral is the same on (0, �, P ) as on ( 0 , rg) with P restricted to rl (Example 16.4). * See Problem 34.19 for another construction.
SECTION
34.
467
CONDITIONAL EXPECTATION
opportunity to bet, paying an entry fee of E[ XII�] and being returned the amount X, and if he adopts the strategy of betting if G occurs, this equation says that the game is fair. Example 34.1. Suppose that B 1 , B2 , is a finite or countable partition of 0 generating the a-field f§. Then E[ X II f§] must, since it is measurable �, • • •
have some constant value over a ; P( B; ) = f8; X dP. Thus
B; ,
say
a; .
Then (34.1) for G
= B;
gives
P( B; ) > 0. If
P( B; ) = 0, the value of E[ XII f§] over B; is constant but arbitrary.
•
For an indicator lA the defining properties of E[/AIIf§] and P[AIIf§] coincide; therefore, E[/AIIf§] = P[A II f§] with probability 1. It is easily checked that, more generally, E[ X II f§] = E; a ; P[A ; II f§] with prob • ability 1 for a simple function X = E; a ; IA . · Example 34. 2.
I
In analogy with the case of conditional probability, if [ X, , t E T] is a collection of random variables, E[ X II X,, t E T] is by definition E[X II r9' ] with a [ X,, t E T] in the role of �Properties of Conditional Expectation
To prove the first result, apply part (iii) of Example 16.9 to (D, � ' P).
f and X on
Let 9' be a 'IT-system generating the a-field f§, and suppose that 0 is a finite or countable union of sets in f§. An integrable function f is a version of E [ X II f§] if it is measurable f§ and if
Theorem 34. 1.
j fdP = f x dP G
G
holds for all G in 9'. In most applications it is clear that 0 E gJ_ All the equalities and inequalities in the following theorem hold with probability 1 .
468
DERIVATIVES AND CONDITIONAL PROBABILITY
Suppose that X, Y, Xn are integrable. a with probability 1, then E[XIIfl] = a. (ii) For constants a and b, E[aX + bYIIfl] = aE[ X IIfl] + bE[YII�]. (iii) If X s Y with probability 1, then E[ X llfl ] s E[YIIfl]. (iv) I E[ XII fl] l s E[ I XII I� ]. (v) If lim n Xn = X with probability 1, 1 Xn1 S Y, and Y is integrable, then lim n El Xnllfl ] = E[ XII�] with probability 1. PROOF. If X = a with probability 1, the function identically equal to a satisfies conditions (i) and (ii) in the definition of E[XIIfl], and so (i) above Theorem 34.2. (i) If X =
follows by uniqueness. As for (ii), aE[ XII�l +
bE[YII�] is integrable and measurable fl , and
1 ( aE [ XII� ) + bE [ Y II�]) dP = a 1 E [ XII� ) dP + b 1 E [ YII� ) dP G
G
G
= a 1 X dP + b 1 Y dP G
G
=
1 ( aX + bY ) dP G
for all G in t§ , so that this function satisfies the functional equation. If X s Y with probability 1, then /a( E[YIIfl] - E[ XII�]) dP = JG ( Y X) dP � 0 for all G in �- Since E[YII�] - E [X II�] is measurable fl , it must be nonnegative with probability 1 (consider the set G where it is negative). This proves (iii), which clearly implies (iv) as well as the fact that E[ XII�] = E[ YII t§ ] if X == Y with probability 1. To prove (v), consider zn == supk � n iXk - X I . Now zn � 0 with probabil ity 1, and by (ii), (iii), and (iv), I E [Xnll� ] - E[ XII�] I s E[Znll�). It suffices, therefore, to 'show that E[Znii�] � O with probability 1 . By (iii) the sequence E[Znll �] is nonincreasing and hence has a limit Z; the problem is to prove that Z = 0 with probability 1, or, Z being nonnegative, that E[Z] = 0. But 0 s Zn s 2Y, and so (34.1) and the dominated convergence theorem give E [ Z] f E [ZII t§ ] dP s f E [Znll � ] dP == E[Zn] 0. • -
--+
The properties (33.21) through (33.28) can be derived anew from Theo rem 34.2. Part (ii) shows once again that E[E; a1IA 1 11�1 == E; a ; P[A ; II�] for simple functions. If X is measurable �, then clearly E( X I I t§ ] ==- X with probability 1. The following generalization of this is used constantly. For an observer with the information in �, X is effectively a constant if it is measurable �:
SECTION
Theorem 34.3.
34.
CON DITIONAL EXPECTATION
469
If X is measurable f§, and if Y and XY are integrable, then
( 34 .2) with probability l .
I t will be shown first that the right side of ( 34.2) is a version of the left side if X = /Go and G0 E f§. Since /G0E [ YIIf§] is certainly measurable rl, it suffices to show that it satisfies the functional equation JG IG 0E [ YIIf§] dP = JG IG0Y dP, G E f§. But this reduces to /G n G0E [ YIIf§] dP == fG n G y dP, which holds by the definition of E [ YIIf§]. Thus (34.2) holds if o X is the indicator of an element of f§. It follows by Theorem 34.2(ii) that (34.2) holds if X is a simple function measurable f§. For the general X that is measurable f§, there exist simple functions Xn , measurable f§, such that I Xn l � l XI and lim n Xn = X (Theo rem 13.5). Since I XnYI < I XYI and I XYI is integrable, Theorem 34.2(v) implies that lim n E l XnYIIf§] = E [ XY IIf§] with probability 1 . But E [ XnYIIf§] = XnE[ YIIf§] by the case already treated, and of course lim n Xn E [ Y II�] = XE [YIIf§]. (Note that I XnE[ YIIf§] l = I E [ XnYIIf§] I � E[ I Xn YI II�] � E [ I XYI IIf§], so that the limit XE[ YIIf§ ] is integrable.) Thus (34.2) holds in general. Notice that X has not been assumed integrable. • PROOF.
Suppose that X and Y are random vectors of dimen sions j and k, that X is measurable f§, and that Y is independent of f§. If k J e gJ J + and fJ( x ) = P[(x, Y) e J], then Example 34.3.
(34 .3)
P [ ( X, Y ) E J ll� ] = IJ ( X )
with probability 1, which generalizes (33.13). To prove it, suppose first that k J = K X L ( K E gJi, L E gJ ); then by (34.2) and Example 33.6, P[ X E K, Y E Lilt§] = l[ x e K J P [ Y E Lilt§] = IK( X)P[ Y E L], and this reduces to (34.3). Since the class of J for which (34.3) holds is a A-system, it holds for all J in gJ J + k . Let cp(x, y) be a Borel function on Ri X R k , and for simplicity assume it is bounded. If /'P (x) = E[cp(x, Y)], then ( 34.4 ) E [ cp ( X, Y ) llf§ ) = /'P ( X) with probability 1 . If cp = IJ, this is just (34.3). It follows by linearity that (34.4) holds for simple functions, and by Theorem 34.2(v) (and Theorem 1 3 .5) that it holds for the general cp. • Taking a conditional expected value can be thought of as an averaging or smoothing operation. This leads one to expect that averaging X with respect
470
DERIVATIVES AND CONDITIONAL PROBABILITY
to f§2 , and then averaging the result with respect to a coarser (smaller) a-field �1 should lead to the same result as would averaging with respect to f§ 1 in the first place: Theorem 34.4.
f§ 1
c
f§2 , then
If X is integrable and the a-fields f§1 and f§2 satisfy
(34 .5 )
with probability l .
The left side of (34.5) is measurable f§ 1 , and so to prove that it is a version of E[ XIIf§1 ], it is enough to verify /G E[E[XIIf§2 ]ll f§1 ] dP = fG XdP for G e f§ 1 • But if G e f§1 , then G e �2 , and the left side here is • fG E[ X ll f§2 ] dP = fG X dP. PROOF.
If f§2 = !F, then E[ Xllf§2 ] = X, so that (34.5) is trivial. If and f§2 = f§, then (34.3) becomes
f§1
= {0, 0 }
E ( E ( XIIf§ ]) = E ( X] , the special case of (34.1) for G = 0 . If f§1 c f§2 , then E[XIIf§1 ], being measurable f§1 , is also measurable f§2 , so that taking an expected value with respect to f§2 does not alter it: E[E [ XIIf§1 ] ll f§2 ] = E[XIIf§1 ]. Therefore, if f§1 c f§2 , taking iterated ex pected values in either order gives E[ XIIf§1 ]. The remaining result of a general sort needed here is Jensen ' s inequality for conditional expected values: If cp is a convex function on the line and X and cp ( X) are both integrable, then (34 .7) cp ( E ( XII f§ ]) s E ( cp ( X ) II f§ ) with probability 1. For each x 0 take a support line [A33 ] through (x 0 , cp(x 0 )): cp (x 0 ) + A(x 0 )(x - x0) s cp (x) . The slope A(x 0 ) can be taken the right-hand derivative of cp, so that it is nondecreasing in x0• Now cp ( E ( XIIf§ ] ) + A ( E ( XIIf§ ] )( X - E ( X f l f§ ] ) < cp ( X ) . (34.6)
as
Suppose that E[ XII�] is bounded. Then all three terms here are integrable (if cp is convex on R 1 , then cp and A are bounded on bounded sets), and taking expected values with respect to f§ and using (34.2) on the middle term gives (34. 7). To prove (34.7) in general, let Gn = [ IE [ X II f§] l < n]. Then E[IG XII � ] = IG E[XII�] is bounded, and so (34.7) holds for IG x: cp (IG E[ XII f§]) s II
II
II
II
SECTION
34.
CONDITIONAL EXPECTATION
471
E [cp( IG11X)II rl]. Now E[cp ( IG11X)IIrl] = E[IG11cp ( X) + IG� cp (O) II� ] = IGN · E [ cp( X) II rl] + IGc:cp (O) ---. E[cp( X)IIrl]. Since cp(IG E [ X II rl]) converges to cp (E[ X II� ]) by the continuity of cp, (34.7) follows. If cp(x) = l x l , (34.7) II
II
gives part (iv) of Theorem 34.2 again.
Conditional Distributions and Expectations
Let p.( · , w ) be a conditional distribution with respect to r9 of a random variable X, in the sense of Theorem 33.3. If cp: R 1 ---. R1 is a Borel function for which cp( X) is integrable, then fR•cp(x)p.(dx, w) is a version of E[ cp ( X) ll riL.,. PROOF . If q> = IH and H e !1/ 1 , this is an immediate consequence of the definition of conditional distribution, and by Theorem 34.2(ii) it follows for q> a simple function over R 1 • For the general nonnegative q>, choose simple Cl>n such that 0 < cpnfx) j cp(x) for each x in R 1 • By the case already treated, fRl CJ>n (x)p.( dx, w ) is a version of E[cpn( X)IIrl]"'. The integral converges by the monotone convergence theorem in ( R 1 , !1/ 1 , p.( · , w )) to fRl cp( x )p.( dx, w) for each w, the value + oo not excluded, and E [ cpn ( X) ll rl] "' converges to E[ q>( X) ll rl]"' with probability 1 by Theorem 34.2(v). Thus the result holds Theorem 34.5.
for nonnegative q>, and the general case follows from splitting into positive and negative parts. •
It is a consequence of the proof above that fRtcp(x)p.( dx, w ) is measur able r9 and finite with probability 1 . If X is itself integrable, it follows by the theorem for the case cp(x) = x that
E [ xn � L.. = with probability 1 . If
( 34 .8 )
J-oo x p. ( dx , ,., ) oo
cp( X) is integrable as well, then
E [ cp( X) II� L =
j-oo cp( x )p.( dx, ,., ) oo
with probability 1 . By Jensen ' s inequality (21 .14) for unconditional expected values, the right side of (34.8) is at least cp( J 00 00 xp.( dx, w )) if cp is convex. This gives another proof of (34.7). Sufficient Subfields*
Suppose that for each 8 in an index set 8 , P9 is a probability measure on (0, §') . In statistics the problem is to draw inferences about t\le unknown parameter (J from an observation w.
* This topic may be omitted.
472
DERIVATIVES AND CONDITIONAL PROBABILITY
Denote by P9 [ A 11�1 and E9 [ XII�] conditional probabilities and expected values calculated with respect to the probability measure P9 on ( 0 , '> · A a-field � in 9i" is sufficient for the family [ P9 : 8 e 8 ] if versions P9 [ A 11�1 can be chosen that are independent of 8- that is, if there exists a function p ( A , w ) of A e � and w e 0 such that, for each A e !F and 8 e 8 , p ( A , · ) is a version of P9 [ A II� ] . There is no requirement that p ( · , w ) be a measure for w fixed. The idea is that although there may be information in !F not already contained in �, this information is irrelevant to the drawing of inferences about 8 .t A sufficient statistic is a random variable or random vector T such that a( T) is a sufficient subfield. A family Jt of measures dominates another family .Y if, for each A , from p.( A ) == 0 for all p. in J/, it follows that J�(A) == 0 for all J' in %. If each of Jl and .Y dominates the other, they are equivalent. For sets consisting of a single measure these are the concepts introduced in Section 32.
Theorem 34.6. Suppose that [ P9 : 8 e 8 ] is dominated by the a-finite measure p.. A
necessary and sufficient condition that � be sufficient is that the density f9 of P9 with respect to p. can be put in the form f9 == g9 h for a g9 measurable �. It is assumed throughout that g9 and h are nonnegative and of course that h is measurable !F. Theorem 34.6 is called the factorization theorem , the condition being that the density f9 splits into a factor depending on w only through � and a factor independent of 8. Although g9 and h are not assumed integrable p., their product f9 , as the density of a finite measure, must be. Before proceeding to the proof, consider an application. Ex1111le 1p 34.4. Let ( 0 , '> == ( R k , &f k ) and for 8 > 0 let P9 be the measure having with respect to k-dimensional Lebesgue measure the density if 0 � X; < 8 , i == 1 , . . . , k , otherwise. If X; is the function on R k defined by X; ( x) == X; , then under P9 , X1 , . . . , Xk are independent random variables, each uniformly distributed over [0, 8 ]. Let T( x) == max; s k X; ( x ). If g9 (t) is 9 - k for 0 < t < 8 and 0 otherwise, and if h ( x ) is 1 or 0 according as all x; are nonnegative or not, then f9 ( x) == g9 ( T( x)) h ( x ). The factorization criterion is thus satisfied, and T is a sufficient statistic. Sufficiency is clear on intuitive grounds as well : 8 is not involved in the conditional distribution of X1 , , Xk given T because, roughly speaking, a random one of them equals T and the others are independent and uniform over [0, T ]. If this is true, the distribution of X; given T ought to have a mass of k - 1 at T and a uniform distribution of mass 1 - k - 1 over [0, T], so that • • •
( 34 .9) t See Problem 34.20.
1 E, [ X; II T ] == k T + k -k I 2T == k 2+k l T.
34. CONDITIONAL EXPECTATION For a proof of this fact, needed later, note that by (21.9) SECTION
1
(34.10)
Ts. t
X; dP9 -
10
«>
473
P9 ( T .S t, X; � u ) du
tk 1 11 t - u ( -t ) k - 1 du == k == 0
8
8
+
--
28
if 0 s t s 8. On the otherk hand, P9 [ T s t ] == ( tj8 ) k , so that under P9 the distribu tion of T has density kt - 1 j8 k over [0, 8 ]. Thus (34.11 )
1
k + 1 T dP9 == k + 1 ' u k - 1 == t k + 1 . uk k du 2 k 2 k 0 Ts. t 8 28 k
1
Since (34.10) and (34.11) agree, (34.9) follows by Theorem 34.1.
•
The essential ideas in the proof of Theorem 34.6 are most easily understood through a preliminary consideration of special cases. For the sufficiency of the condition, suppose first that h is integrable I' ·
Suppose that [ P9 : 8 e 8 ] is dominated by a measure p. and that each P9 has with respect to p. a density f9 == g9 h , where g9 is measurable t§ and h is integrable I' · Then t§ is sufficient. Lemma 1.
Put a == I h dl' and replace g9 by g9a and h by hja. This shows that it is no restriction to assume that I h dp. == 1. Define P on !F by P( A ) == IA h dp., so that P1 ( A ) == IA g9 dP. For G in t§, (34.2) gives PROOF.
Therefore, P( A IIt§]-the conditional probability calculated with respect to P-serves as a version of P9 [A II t§] for each 8 in 8. Thus t§ is sufficient for the family [ P9 : 8 e 8 ]-even for this family augmented by P (which might happen to • lie in the family to start with). For the necessity, suppose first that the family is dominated by one of its members.
Suppose that [ P9 : 8 e 8 ] is dominated not only by p. but by P90 for some 10 e 8. If t§ is sufficient, then each P9 has with respect to p. a density f9 == g9 h , where g1 is measurable t§ and h is integrable p.. Lemma 2.
474
DERIVATIVES AND CONDITIONAL PROBABILITY
Let d8 be any density of P8 with respect to applications of (34.2),
PROOF.
P80.
By a number of
... jE60{ IA Es0( d,ll�) ll� } dPs0 ... jE,0{ 1AII� } Es0 ( dsll�) dPs0 =
J Es0 [ Es0 { lA II� } dsll�] dPs0 .., J Es0 { lA II� } d, dPs0
j Ps0 ( A ll�) dP, ... f P,( A ll�) dP6 ... P6 ( A ) , the next-to-last equality by sufficiency. Thus g8 == E80 [ d8 11 � ] , which is measurable � ' can serve as a density for P8 with respect to P80 • Take for h a density for P80 with =
re�ect � p.
•
To complete the proof of Theorem 34.6 requires one more lemma of a technical sort. Lemma 3.
If [ P8: () e 9] is dominated by a a-finite measure, then it is equivalent to
some countable subfamily.
If p. is a-finite, there is a countable partition n of Q into Jli:sets An such that 0 < p(An) < oo . The measure with value En 2- p(A n An)/p(An) at A is finite and equivalent to p. In proving the lemma it is therefore no restriction to assume the family is dominated by a finite measure p. Each P8 is dominated by p and hence has a density f8 with respect to it. Let S8 == [ w: f8 ( w) > 0]. Then P8 ( A ) == P8 (A n S8 ) for all A , and P8 ( A ) == 0 if and only if p(A n S8 ) == 0. In particular, S8 supports P8• Call a set B in !F a kernel if B c S8 for some (), and call a finite or countable union of kernels a chain. Let a be the supremum of p(C) over chains C. Since p is finite and a countable union of chains is a chain, a is finite and p( C) == a for some chain C. Suppose that C == Un Bn , where each Bn is a kernel, and suppose that Bn C Ss . The problem is to show that [ P8: () e 9] is dominated by [ P8 : n == 1, 2, . . . ] and hence equivalent to it. Suppose that P8 ( A ) == 0 for all n. Then 'jt (A n S8 ) == 0, as observed above. Since C c Un S8 , p(A n C) == 0, and it follows that P8 ( A n C) == 0 whatever () may be. But suppose that P8 (A - C) > 0. Then P8 (( A - C) n S8 ) = P8 ( A - C) is positive, and so (A - C) n S8 is a kernel, disjoint from C, of positive p-measure; this is impossible because of the maximality of C. Thus P8 (A - C) is 0 • along with P8 ( A n C), and so P8 ( A ) == 0. PROOF .
n
n
n
Suppose that [ P8 : () e 8 ] is dominated by a a-finite p, as in Theorem 34.6, and let [ P8 : n == 1 , 2, . . . ] be the sequence whose existence is guaranteed by Lemma 3. Define" a probability measure P on !F by
( 34.12)
n
SECTION
34.
475
CONDITIONAL EXPECTATION
Clearly, P is equivalent to [ P8 : n == 1, 2, . . . ] and hence to [ P8 : 8 e 8 ], and p. dominates P. If each P8 has density g8 h with respect PROOF OF SUFFICIENCY IN THEOREM 34.6. to p., thenn by the construction (34.12), P has density fh with respect to p., where f == En 2- g811 • Put r8 == g8/f if / > 0 and r8 == 0 (say) if / == 0. If each g8 is measurable t§, the same is true of f and hence of the r8• Since P [ f == 0] == 0 and P is equivalent to the entire family, P8 [f == 0] == 0 for all 8. Therefore, II
j r1 dP = j r1fh dp. = j A
A
== P8 ( A
A n [f> O]
n[
r1fh dp. =
j
A n [ /> 0]
g1 h dp.
/ > 0]) == P8 ( A ) .
Each P8 thus has with respect to the probability measure P a density measurable t§, and it follows by Lemma 1 for p. == P and h 1 that t§ is sufficient. • PROOF OF NECESSITY IN THEOREM 34.6. Let p(A , w) be a function such that, for each A and 8, p(A, · ) is a version of P8 [A IIt§], as required by the definition of sufficiency. For P as in (34.12) and G e t§, =
l p ( A , w ) P ( dw ) == [n 2- nj p ( A , w ) P8 ( dw) G
G
II
== [ 2- n J P, [ A Il l§] dP, == [ 2 - n P, ( A n
G
II
II
n
II
()
G ) == P ( A
()
G) .
Thus p (A , · ) serves as a version of P [A II t§] as well, and t§ is still sufficient if P is added to the family. Since P dominates the augmented family, Lemma 2 gives the • required factorization. Minimum-Variance Estimation*
To illustrate sufficiency, suppose that 8 is a subset of the line. An estimate of 8 is a random variable Z, and the estimate is unbiased if E8 [ Z] == 8 for all 8. A measure of the accuracy of the estimate Z is E8 (( Z - 8 ) 2 ). If t§ is sufficient, it follows by linearity (Theorem 34.2(ii)) that E8 [ Xllt§] has for X simple a version that is independent of 8. Since there are simple Xn such that I xn I < I XI and xn --+ X, the same is true of any X that is integrable with respect to each P8 (use Theorem 34.2(v)). Suppose that t§ is, in fact; sufficient, and denote by E( Xllt§] a version of E8 [ XIIt§] that is independent of 8. lbeorem 34.7. Suppose that E8 [( Z - 8 ) 2 ] < oo for all 8 and that t§ is sufficient.
Then
(34.13) for all 8. If Z is unbiased, then so is E[ Zllt§]. * This topic may be omitted.
476
DERIVATIVES AND CONDITIONAL PROBAB ILITY
By Jensen's inequality (34.7) for cp(x) == ( x - 8 ) 2 , ( E( ZII�] - 8 ) 2 < E1 [( Z - 8) 2 11�]. Applying E8 to each side gives (34.13). The second statement follows from the fact that E8 [ E[ Z11�11 == E8 [ Z]. • PROOF.
This, the Rao-B/ackwe/1 theorem, says that E[ Zll�] is at least as good an estimate as Z if � is sufficient. Returning to Example 34.4, note that each X; has mean 8/2 under P8 , so that if X == k - 1 E�_ 1 X; is the sample me� then 2 X is an unbiased estimate of fJ. But there is a better one. By (34.9), E8 [2 XIIT] == ( k + 1) Tjk == T' , and by the Rao-Blackwell theorem, T' is an unbiased estimate with variance at most that of 2 X. In fact, for an arbitrary unbiased estimate Z, E8 [(T' - 8) 2 ] s E8 [( Z - 8) 2 ]. To prove this, let B == T' - E[ Z II T]. By Theorem 20.1(ii), B == /(T) for some Borel function kf, and E8 [/(T)] == 0 for allk 8. Taking account of the density for T leads to f( x)x - 1 integrates to 0 over all intervals. Therefore, Jtf( x)x - 1 dx == 0, sok that f( x) along with f( x)x - 1 vanishes for x > 0, except on a set of Lebesgue measure 0, and hence P8 [/(T) == 0] == 1 and P8 (T' == E[ ZIIT]] == 1 for all 8. Therefore, E8 [( T' - 8 ) 2 ] == E8 [( E[ ZIIT] - 8 ) 2 ] s E8 [( Z - 8) 2 ] for Z unbiased, and T' has • minimum variance among all unbiased estimates of 8 . E%111le 11p 34.5.
PROBLEMS 34. 1. 34.2.
Work out for conditional expected values the analogue of Problem 33.5. In the context of Examples 33.5 and 33.12, show that the conditional expected value of Y (if it is integrable) given X is g( X), where ( X , y ) y dy !«> j g ( x ) == - : j f( x, y ) dy
•
- oo
34.3.
34.4.
Show that the independence of X and Y implies that E( YIIX] == E[ Y], which in turn implies that E[ XY] == E[ X] E[ Y]. Show by examples in an 0 of three points that the reverse implications are both false. (a) Let B be an event with P(B) > 0, and define a probability measure P0 by P0 ( A ) == P(A I B). Show that P0[AII�] == P[A n BII�]/P[ BII�] on a set of P0-measure 1. (b) Suppose that .Tt' is generated by a partition B1 , B2 , , and le t � v .Jf'== a(� U ft'). Show that with probability 1, • • •
SECTION
34.
CONDITIONAL EXPECTATION
477
34.5.
The equation (34.5) was proved by showing that the left side is a version of the right side. Prove it by showing that the right side is a version of the left side.
34.6.
X and Y that E[ YE[ XII�]] = E[ XE[ Y ll �]]. 33.14 f Generalize Theorem 34.5 by replacing X with a random vector. Assume that X is nonnegative but not necessarily integrable. Show that it is still possible to define a nonnegative random variable £( XII�], measurable � ' such that (34.1) holds. Prove versions of the monotone convergence
34.7. 34.8.
34.9.
Prove for bounded
theorem and Fatou' s lemma. (a) Show for nonnegative X that E[ XII�] = /0 P[ X > t il �] dt with prob ability 1. (b) Generalize Markov' s inequality: P[ I XI > a ll �] < a - k E[ I X l k 11 �1 with probability 1. (c) Similarly generalize Chebyshev's and Holder' s inequali ties.
34.10. (a) Show that, if �. c �2 and E[ X2 ] < oo , then E[( X - E[ X11 �2 ]) 2 ] < 2
E[( X - E[ X 11 �1 ]) ]. The dispersion of X about its conditional mean becomes smaller as the a-field grows. (b) Define Var[ XII�] = E[( X - E[ X11 �]) 2 11 �]. Prove that Var[ X] = E[Var[ XII�]] + Var[ E[ XII�]]. 34. 11. Let t§1 , �2 , �3 be a-fields in !F, let �;1 be the a-field generated by � u � ' and let A ; be the generic set in �; · Consider three conditions: (i) P [ A 3 11 �1 2 ] = P[A 3 11 �2 ] for all A 3 • (ii) P[A 1 n A 3 11 �2 ] = P[A 1 11 �2 ] P[A 3 11 �2 ] for all A 1 and A 3 • (iii) P[A 1 11 �2 3 ] = P[A 1 11 �2 ] for all A 1 • If �. , �2 , and �3 are interpreted as descriptions of the past, present,
and future, respectively, (i) is a general version of the Markov property: the conditional probability of a future event A 3 given the past and present �1 2 is the same as the conditional probability given the present �2 alone. Condition (iii) is the same with time reversed. And (ii) says that past and future events A 1 and A 3 are conditionally independent given the present �2 • Prove the three conditions equivalent.
34.12. 33.7 34.11 f
Use Example 33.10 to calculate P[Ns = k ii Nu , u > t] (s < t)
for the Poisson process.
34.13. Let L 2 be the Hilbert space of square-integrable random variables on ( 0 , $, P ) . For � a a-field in !F, let M� be the subspace of elements of L2 that are measurable �. Show that the operator P� defined for X e L2 by P� X = E[ XII�] is the perpendicular projection on M� . 34.14.
f Suppose in Problem 34.13 that � = a( Z) for a random variable Z in L2 • Let Sz be the one-dimensional subspace spanned by Z. Show that Sz may be much smaller than M0 z> , so that E[ XII Z] (for X e L2 ) is by no means the projection of X on Z. Hint: Take Z the identity function on the unit interval with Lebesgue measure.
34.15.
f Problem 34.13 can be turned around to give an alternative approach to conditional probability and expected value. For a a-field � in !F, let P� be
478
DERIVATIVES AND CONDITIONAL PROBAB ILITY
the perpendicular projection on the subspace Mt'§ . Show that Pt'§ X has for X e L2 the two properties required of E[ Xll fl ]. Use this to define E[ Xllfl] for X e L2 and then extend it to all integrable X via approximation by random variables in L2 • Now define conditional probability.
34. 16. Mixing sequences. A sequence A 1 , A 2 , ( Q, �' P ) is
(34.14)
•
mixing with constant a if
•
•
of $sets in a probability space
lim P( A n n E ) = a P( E ) n
for every E in !F. (a) Show that { An } is mixing with constant a if and only if li�
(34.15)
£ XdP = a f XdP II
for each integrable X (measurable !F). (b) Suppose that (34.14) holds for E e Bl, where fJI is a w-system, 0 e �' and A n e a(BI) for all n. Show that { An } is mixing. Hint: First check (34.15) for X measurable a(BI) and then use conditional expected values with respect to a( Bl). (c) Show that, if P0 is a probability measure on ( Q , !F) and P0 « P, then mixing is preserved if P is replaced by P0.
34.17.
Application of mixing to the central limit theorem. Let X1 , X2 , . . . be random variables on ( Q, !F, P), independent and identically distributed with mean 0 and variance a 2 ' and put sn == XI + . . . + xn. Then Sn/a..fn f
� N by the Lindeberg-Levy theorem. Show by the steps below that this
still holds if P is replaced by any probability measure P0 on ( Q , !F) that P dominates. For example, the central limit theorem applies to the sums EZ _ 1 rk ( w ) of Rademacher functions if w is chosen according to the uniform density over the unit interval, and this result shows that the same is true if w is chosen according to an arbitrary density. Let Y, = Sn/a..fn and Zn = ( Sn - S1 10 n J )/a..fn , and take fJI to consist k of the sets of the form [( X1 , , Xk ) e H, , k > 1, H e � . Prove succes sively: (a) P[ Y, < x ] --+ P[ N < x]. • • •
(b) P( l Y, - Zn l > t: ) --+ 0.
(c) (d)
P[ Zn < x] --+ P[ N < x]. P ( E n [ Zn < x]) --+ P( E) P [ N < x] for E e Bl. (e) P ( E n [ Zn � x]) --+ P( E) P[N < x ] for E e !F. (f) P0 [ Zn < x] --+ P[ N < x]. (g) Po l l Y, - Zn l > t: ] --+ 0. (h) P0 [ Y, < x] --+ P[ N < x]. 34. 18. Let ( X, !I, p.) and ( Y, '!1, ) be probability measure spaces and let ( Q , !F, P) be their product. Let t§ == [A X Y: A e !£] be the a-field of vertical strips. For a random variable f = f(x, y), put g(x, y) = Jyf {x, z ) v(dz ) and show by Fubini's theorem that g is a version of E[/llt§]. .,
SECTION
34.
479
CONDITIONAL EXPECTATION
34. 19. Here is an alternative construction of conditional expectation. It uses the
Hahn decomposition and in effect incorporates one of the proofs of the Radon-Nikodym theorem. Suppose that X is nonnegative and integrable. For each nonnegative dyadic rational t, find by the Hahn result (Theorem 32.1 ) a �set A, such that
J X dP > tP ( G) 1 XdP < tP ( G ) G
G
if
G c A, ,
Ge
'§ ,
if
G c A�,
Ge
'§.
Show that P(Al, - A u ) == 0 if v > u and replace A , by A, - U,, u (A,. A u ) · Show that it is possible to take A 0 == Q . Show by the integrability of X that P(ns 0 A s ) = 0 and replace A, by A, - ns 0As for t > 0. Thus it can be assumed that the A, are nonincreasing (over dyadic rational t ), that A 0 == Q , and that n, 0 A, == 0. Put Gn k = and consider the successively finer decom - A
>
>
A k 12��
>
1
1 )(2��,
• • •
1 P ( G n Gn k )
XdP -< k +11 1 2 Gn GII/c
1
.!5:_n <
( 34.16)
+
2
•
Gn k ( n < m ), then (}2 - nr , (} + 1 )2 - ] C ( k 2 - n , ( k + 1)2 - 11n] and (take G == Gn k and then G = G, j in (34. 16)) ln ( w ), lm ( w ) E n n [ k 2 - , ( k + 1 )2 - ] . For all w , II, ( w ) - In ( w )nl < 2 - for m > n , and so lim nfn ( w ) = l( w ), where 1/( w) - ln ( w ) l < 2 - . Clearly, fGin dP --+ fGi dP. If G E fl, then If w E
Gm j
C
"'
n
< L P ( G n Gn k ) 2- < 2 -
n
k
by (34. 16). Hence fGin dP --+ fG X dP , and the �measurable function satisfies !GfdP = JG X dP for G E fl.
34. 20. Suppose that t§ is measures
P8 ,
a
f
sufficient subfield for the family of probability 8 E 8 , on ( Q , .F). Suppose that for each 6 and A , p ( A , w ) is
480
DERIVATIVES AND CON DITIONAL PROBABILITY
a version of P8 [A II�L" ' and suppose further that for each w . p( w ) is a probability measure on !F. Define Q8 on !F by Q8 ( A ) = f0p ( A , w ) P8 ( dw ) , and show that Q8 = P8 • The idea is that an observer with the information in � (but ignorant of w itself) in principle knows the values p ( A , w ) because each p ( A , ) is measurable �- If he has the appropriate randomization device, he can draw an w' from Q according to the probability measure p ( , w ) , and his w' will have the same distribution Q8 = P8 that w has. Thus, whatever the value of the unknown 8, the observer can on the basis of the information in � alone, and without knowing w itself, construct a probabilistic replica of w. · ,
·
·
SECI10N 35. MARTINGALES Definition
Let X1, X2 , be a sequence of random variables on a probability space be a sequence of a-fields in �- The sequence (D, � ' P), and let �1, �2 , { ( Xn , �): n = 1, 2, . . . } is a martingale if these four conditions hold : (i) � c � + 1 ; (ii) xn is measurable �; (iii) E [ I Xn I ] < oo ; (iv) with probability 1, •
•
•
•
•
•
( 35 .1 ) Alternatively, the sequence X1, X2 , is said to be a martingale relative to the a-fields �1 ' �2 , Condition (ii) is expressed by saying the xn are adapted to the �If Xn represents the fortune of a gambler after the n th play and � represents his information about the game at that time, (35.1) says that his expected fortune after the next play is the same as his present fortune. Thus a martingale represents a fair game, and sums of independent random variables with mean 0 give one example. As will be seen below, martingales arise in very diverse connections. The sequence x1 , x2 , . . . is defined to be a martingale if it is a martingale relative to some sequence �1 , �2 , In this case, the a-fields fin = a( X1, , Xn ) always work: Obviously, fin c fin + 1 and Xn is measurable fin , and if (35.1) holds, then E[ Xn + 1 llfln1 = E[E[ Xn + 1 ll �]llfln1 = E[ Xnll �n1 = Xn by (34.5). For these special a-fields fin , (35.1) reduces to •
•
•
•
•
•
•
•
•
( 35 .2 )
.
.
•
•
•
SECTION
35.
481
MARTINGALES
Since a( X1 , , Xn ) c � if and only if Xn is measurable � for each n , the a( X1, , Xn ) are the smallest a-fields with respect to which the Xn are a martingale. The essential condition is embodied in (35.1) and in its specialization (35.2). Condition (iii) is of course needed to ensure that E[ Xn+ tll� ] exists. Condition (iv) says that Xn is a version of E [ Xn + 111 �]; since Xn is measurable � ' the requirement reduces to .
•
•
•
•
•
( 35 .3)
A e �.
Since the � are nested, A e � implies that fA Xn dP = fA Xn + 1 dP = · · = fA Xn + k dP. Therefore, Xn, being measurable �, is a version of E [ Xn + kll�]: ·
( 3 5 .4) with probability 1 for k > 1 . Note that for A =
ll,
(35.3) gives
( 35 .5) The defining conditions for a martingale can also be given in terms of the differences ( 35 .6) (4 1 = X1 ). By linearity (35.1) is the same thing as ( 35 .7) Note that, since Xk = &1 + + & k and &k = Xk - Xk _ 1, the sets XI ' . . . ' xn and &It . . . ' & n generate the same a-field: ·
·
·
( 35 .8) Let &1, & 2 , be independent, integrable random vari ables such that E [&n] = 0 for n > 2. If � is the a-field (35.8), then by independence E [ & n + lll�] = E[ & n + t1 = 0. If & is another random varia ble, independent of the &n, and if � is replaced by a ( & , &1, , & n ), then the Xn = &1 + + &n are still a martingale relative to the �- It is natural and convenient in the theory to allow a-fields � larger than the minimal • ones (35.8). Example 35. 1.
•
•
•
•
·
·
·
•
•
482
DERIVATIVES AND CONDITIONAL PROBABILITY
Let ( 0, � ' P) be a probability space, let " be a finite be a nondecreasing sequence of a-fields measure on !F, and let !F1 , !F2 , in !F. Suppose that P dominates " when both are restricted to �- that is, suppose that A e � and P( A) = 0 together imply that J1( A) = 0. There is then a density or Radon-Nikodym derivative Xn of " with respect to P when both are restricted to �; Xn is a function that is measurable � and integrable with respect to P, and it satisfies Example 35. 2
•
•
•
A e �.
(35 .9)
If A e �, then A e � + l as well, so that fA Xn + 1 dP = J1( A) ; this and • (35.9) give (35.3). Thus the Xn are a martingale with respect to the �For a specialization of the preceding example, let P be Lebesgue measure on the a-field � of Borel subsets of 0 = (0, 1] and let � be the finite a-field generated by the partition of 0 into dyadic intervals (k2 - n , (k + 1)2 - n ], 0 < k < 2 n . If A e � and P(A) = 0, then A is emp ty. Hence P dominates every finite measure " on �- The Radon-Nikodym derivative is Example 35.3.
There is no need here to assume that P dominates " when they are viewed as measures on all of �- Suppose that " is the distribution of Ek_ 1 Zk 2 - k for independent Zk assuming values 1 and 0 with probabilities p and 1 p. This is the measure in Examples 31.1 and 31.3 (there denoted by p.), and for p ¢ � ' " is singular with respect to Lebesgue measure P. It is nonetheless absolutely continuous with respect to P when both are re• stricted to �-
For another specialization of Example 35.2, suppose that Jl is a probability measure Q on � and that � = a( Y1 , , Yn ) for on ( 0, �). Suppose that under the measure P random variables Y1 , Y2 , the distribution of the random vector (Y1 , , Yn ) has density Pn (y 1 , Yn ) with respect to n-dimensional Lebesgue measure and that under Q it has density qn( y 1 , , Yn > · To avoid technicalities, assume that Pn is everywhere positive. Then the Radon-Nikodym derivative for Q with respect to P on � is Example 35.4.
•
•
•
(35 . 1 1 )
•
•
•
•
•
•
•
•
•
•
•
•
,
SECTION
35.
483
MARTINGALES
To see this, note that the general element of � is [( Y1, n H E £1/ ; by the change-of-variable formula,
.
•
•
, Yn )
E
H],
In statistical terms, (35.11) is a likelihood ratio: Pn and qn are rival densities, and the larger Xn is, the more strongly one prefers qn as an explanation of the observation ( Y1, , Yn). The analysis is carried out under the assumption that P is the measure actually governing the Yn ; that is, Xn is a martingale under P and not in general under Q. In the most common case the Yn are independent and identically distrib uted under both P and Q: Pn ( y1, p(yn) and , Yn ) = p(y1 ) qn (y 1 , , Yn ) = q(y1 ) q(yn ) for densities p and q on the line, where p is assumed everywhere positive for simplicity. Suppose that the measures corresponding to the densities p and q are not identical, so that P [ Yn e H] 1 ¢ Q[ Yn E H] for some H E £1/ If zn = /[ Y,. H ) ' then by the strong law of large numbers, n - 1Ek = 1 Zk converges to P[ Y1 E H] on a set (in !F) of P-measure 1 and to Q[ Y1 E H] on a (disjoint) set of Q-measure 1. Thus P • and Q are mutually singular on !F even though P dominates Q on �•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
E
Suppose that Z is an integrable random variable on (D, !F, P) and � are nondecreasing a-fields in !F. If Example 35.5.
(3 5 .12 ) then the first three conditions in the martingale definition are satisfied, and by (3 4 .5), E [ Xn + 111 �] = E [ E [ Z II� + 1 ] 11 �] = E [ZII �] = Xn . Thus Xn is a martingale relative to �• Let Nn k' n, k = 1, 2, . . . be an independent array of identically distributed random variables assuming the values 0, 1, 2, . . . . Define Z0 , Z1, Z2 , . . . inductively by Z0(w) = 1 and Zn ( w ) = Nn . 1 ( w ) + · · · + Nn . z,. _ 1( w ) (w); Zn (w) = 0 if Zn_ 1 (w) = 0. If Nnk is thought of as the number of progeny of an organism, and if Zn _ 1 represents the size at time n 1 of a population of these organisms, then zn represents the size at time n. If the expected number of progeny is E [ Nn k 1 = m , then E [ Zn11Zn_ 1 ] = Zn_ 1 m , so that Xn = Zn/m n , n = 0, 1, 2, . . . , is a martingale. The sequence Z0, Z1 , . . . is a branching process. • Example 35.6.
-
484
DERIVATIVES AND CONDITIONAL PROBABILITY
In the preceding definition and examples, n ranges over the positive integers. The definition makes sense if n ranges over 1, 2, . . . , N; here conditions (ii) and (iii) are required for 1 < n < N and conditions (i) and (iv) only for 1 � n < N. It is, in fact, clear that the definition makes sense if the indices range over an arbitrary ordered set. Although martingale theory with an interval of the line as the index set is of interest and importance, here the index set will be discrete. Submartingales
Random variables Xn are a submartingale relative to a-fields � if (i), (ii), and (iii) of the definition above hold and if this condition holds in place of (iv): (iv') with probability 1, (35 .13) As before, the Xn are a submartingale if they are a submartingale with respect to some sequence � ' and the special sequence � = o( X1 , , Xn) works if any does. The requirement (35.13) is the same thing as •
(35 .14)
A
•
•
E �.
This extends inductively (see the argument for (35.4)), and
so
(35 .15) for k > 1. Taking expected values in (35.15) gives (35 .16) Suppose that the � n are independent and integrable, as in Example 35.1, but assume that E[� n1 is for n > 2 nonnegative rather • than 0. Then the partial sums � 1 + · · · + � n form a submartingale. Example 35. 7.
Suppose that the Xn are a martingale relative to the � Then I Xnl is measurable � and integrable, and by Theorem 34.2(iv), E[ I Xn + t l l l�] > IE[ Xn + I II�] I = I Xn l · Thus the I Xn l are a submartingale relative to the �- Note that even if XI' . . . ' xn generate � ' I XI I ' . . . ' I xn I • will in general generate a a-field smaller than �Example 35.8.
35. MARTINGALES 485 Reversing the inequality in (35.13) gives the definition of a super martingale. The inequalities in (35.14), (35.15), and (35.16) become reversed as well. The theory for supermartingales is of course symmetric to that of sub martingales. SECTION
Gambling
Consider again the gambler whose fortune after the n th play is Xn and whose information about the game at that time is represented by the a-field � - If � = a( X1 , , Xn ), he knows the sequence of his fortunes and nothing else, but � could be larger. The martingale condition (35.1) stipulates that his expected or average fortune after the next play equals his present fortune, and so the martingale is the model for a fair game. Since the condition (35.1 3) for a submartingale stipulates that he stands to gain (or at least not lose) on the average, a submartingale represents a game favorable to the gambler. Similarly, a supermartingale represents a game unfavorable to the gambler. t Examples of such games were studied in Section 7, and some of the results there have immediate generalizations. Start the martingale at n = 0, X0 representing the gambler' s initial fortune. The difference � n = Xn - Xn - l represents the amount the gambler wins on the nth play, * a negative win being of course a loss. Suppose instead that � n represents the amount he wins if he puts up unit stakes. If instead of unit stakes he wagers the amount Wn on the n th play, W,. � n represents his gain on that play. Suppose that Wn > 0, and that W,. is measurable � - I to exclude prevision: Before the n th play the information available to the gambler is that in � - 1 , and his choice of stake W,. must be based on this alone. For simplicity take Wn bounded. Then W,� n is integrable and measurable � if � n is, and if Xn is a martingale, then E[ W,. � nll � - 1 ] = Wn El� nll � - 1 ] = 0 by (34.2). Thus •
•
•
(35 .17) is a martingale relative to the �- The sequence W1 , W2 , represents a betting system, and transforming a fair game by a betting system preserves fairness; that is, transforming xn into (35.1 7) preserves the martingale property. The various betting systems discussed in Section 7 give rise to various martingales, and these martingales are not in general sums of independent •
•
•
t There is a reversal of terminology here: a sub fair game (Section 7) is against the gambler,
while a submartingale favors him. * The notation has, of course, changed. The F, and xn of Section 7 have become xn and � n ·
486
DERIVATIVES AND CONDITIONAL PROBABILITY
random variables-are not in general the special martingales of Example 35. 1 . If Wn assumes only the values 0 and 1, the betting system is a selection system; see Section 7. If the game is unfavorable to the gambler-that is, if Xn is a super martingale- and if W,. is nonnegative, bounded, and measurable � _ 1 , then the same argument shows that (35.17) is again a supermartingale, is again unfavorable. Betting systems are thus of no avail in unfavorable games. The stopping-time arguments of Section 7 also extend. Suppose that { Xn } is a martingale relative to { � } ; it may have come from another martingale via transformation by a betting system. Let T be a random variable taking on nonnegative integers as values, and suppose that
( 35 .18 ) T is the time the gambler stops, [ T = n] is the event he stops just after the n th play, and (35.18) requires that his decision is to depend only on the information � available to him at that time. His fortune at time n for this stopping rule is If
{
X* = Xn n X
( 35 .19)
T
if n < T, if n > T .
Here X,. (which has value X,.<w>( w) at w) is the gambler' s ultimate fortune, and it is his fortune for all times subsequent to T. The problem is to show that X0* , X1* , . . . is a martingale relative to $0 , $1 , First, n-1 n * Xn I dP < E E [ I xk I < 00 . E [I Xn l 1 = E I xk I dP + I [ 1' > n ] k - 0 [ 1' - k ] k -0 •
•
•
•
1
Since [ T > n ] = 0
[ Xn*
Moreover,
and
E
H]
[ T < n] e � ' n = u [ T = k ' xk k-0
1
1
-
E
H] u [ T > n '
xn
E
H] E � .
SECTION
35.
487
MARTINGALES
Because of (35.3), the right sides here coincide if A E �; this establishes (35.3) for the sequence X1*, X2*, . . . , which is thus a martingale. The same kind of argument works for supermartingales. Since Xn* = XT for n � ,., Xn* -+ XT . As pointed out in Section 7, it is not always possible to integrate to the limit here. Let Xn = a + � 1 + + � n ' where the � n are independent and assume the values ± 1 with probability � ( X0 = a), and let ,. be the smallest n for which � 1 + � n = 1. Then E[ Xn*1 = a and XT = a + 1. On the other hand, if + the xn are uniformly bounded or uniformly integrable, it is possible to integrate to the limit: E[ XT ] = E[ X1 ]. ·
·
·
·
·
·
Functions of Martingales
Convex functions of martingales are submartingales:
If X1 , X2 , • • • is a martingale relative to �1 , �2 , • • • , if cp is convex, and if the cp( Xn ) are integrable, then cp( X1 ), cp ( X2 ), • • • is a submartingale relative to �1 , �2 , • • • • (ii) If XI , x2 , . . . is a submartingale relative to �1 , �2 , . . . ' if cp is nondecreasing and convex, and if the cp( Xn ) are integrable, then cp( X1 ), cp( X2 ), • • • is a submartinga/e relative to �1 , �2 , • • • • PROOF. In the submartingale case, Xn < E( Xn + 1 11� ), and if cp is nonde creasing, then cp (Xn ) < cp (E[ Xn + t ll �]). In the martingale case, Xn = E [ Xn + t ll �], and so cp( Xn ) = cp(E[ Xn + t ll �]). If cp is convex, then by Jensen ' s inequality (34.7) for conditional expectations cp(E[ Xn + t ll �]) < • E [ cp( Xn + t ) ll �]. Theorem 35. 1. (i)
Example 35.8 is the case of part (i) for cp(x) =
1x1 .
Inequalities
There are two inequalities that are fundamental to the theory of martingales. Theorem 35.2.
If XI , . ' xn is a submartinga/e, then for . .
a
> 0,
(35 .20) This extends Kolmogorov' s inequality: If S1 , S2 , • • • are partial sums of independent random variables with mean 0, they form a martingale; if the variances are finite, then Sf, Sf, . . . is a submartingale by part (i)
488
DERIVATIVES AND CONDITIONAL PROBABILITY
of the preceding theorem, and (35.20) for this submartingale is exactly Kolmogorov' s inequality (22.9).
a, let A 1 = (X1 � a), [max; < k X; < a < Xk ], and A = U Z _ 1 A k = [max; � n X; > a]. Since !Fk = o( X1 , , Xk ), it follows (see (35.15)) that n n Xn dP = E Xn dP = E E [ Xnll $"d dP PROOF
OF
THE
•
•
For arbitrary
THEOREM.
Ak = Ak E
•
J
A
k-1
�
n E
k-1
J
Ak
jAk
k-1
J
Ak
n Xk dP > a E P ( A k ) = aP ( A ) . k-1
Thus
[
aP �saxn X; � a 1
] j ..s:
[max; .s n Xf � a )
Xn dP < E [ x: )
..s:
E [I Xn I ) .
•
If xl , . . . ' xn is a martingale, I XI I ' . . . ' I xn I is a submartingale, and so (35.20) gives P[max; � ni X; I � a] � a - 1E[ 1Xn 1 1 · The second fundamental inequality requires the notion of an upcrossing. Let [a, P1 be an interval (a < P> and let xl , . . . ' xn be random variables. The number of upcrossings of [a, P1 by X1 ( w ) . . . , Xn( w) is the number of times the sequence passes from below a to above p. More specifically, the number of upcrossings is the number of pairs u and v of integers such that 1 � u < v � n, Xu � a , a < X; < P for u < i < v (a vacuous condition if v = u + 1), and Xv � {J. In the diagram, n = 20 and there are three upcrossings; the corresponding pairs u, v are shown at the bottom. ,
\
u
T
v
I
\
u
v
I
\
u
T
v
I
SECTION
35.
For a submartinga/e X1, crossings of [ a, PJ satisfies
1beorem 35.3.
E[U]
( 3 5 .21 )
�
489
MARTINGALES • • •
, Xn , the number U of up
E (IXn l ] + a �.�tJ - a
•
The proof requires a lemma. Suppose X1 , , Xn is a submartingale with respect to .F1 , , �, and let T1 and T2 be stopping times: [ T; = k ] E :Fk , 1 � k � n. • • •
• • •
Lemma 1.
satisfy 1
�
If xl ,
T1
�
' xn is a submartinga/e and the stopping times T1 and T2 T2 � n with probability 1, then E[ XT1 ] � E[ XT ]. 2 • • •
EZ- 1 E[ I Xk I ], the expected values exist. If � k = Xk - Xk _ 1, then
E[ I XT1 1 ]
Since PROOF.
<
k� are nonnegative. Since [ Tl
s
T2 ]
= [ Tl
�
k - 1 ] n [ T2 � k - 1]c E Fk - 1 ' the integrals •
Let Yk = max{O, Xk - a} and fJ = p - a . Then U is also the number of upcrossings of [0, 6] by the submartingale Y1, , Yn . Inductively define T0 , T1, :
PROOF OF THE THEOREM.
• • •
• • •
To = 1 ; Tk for even k is the smallest j such that Tk - t < j � n and lj � fJ, and is n if there is no such j; Tk for odd k is the smallest j such that Tk _ 1 < j � n and Yj = 0, and is
n if there is no such j.
(If 0, 6, lj are replaced by a, p, Xi in this definition, the Tk are unchanged;
the T ' s marked off on the diagram.) Note that Tk � n and that Tk - t = n implies Tk n. Also, Tk _ 1 < n implies Tk _ 1 < Tk, and hence Tn = Tn + l = see
. . . = n.
==
490
DERIVATIVES AND CONDITIONAL PROBAB ILITY
n
If k is odd and "k - l is a stopping time, then for j < j- 1
[ Tk = j] = u [ "k - 1 = i ' Y; + I > 0, · · · ' lJ - 1
0, lj = 0]
>
i- 1
lies in � and [ Tk = n] = [ Tk < n - 1 ] c lies in � ' and so "k is also a stopping time. With a similar argument for even k, this shows that the "k are stopping times, since clearly "o is one. Since Yn = YT > YT - YT0,
Yn > i..J � ( YT - YT e
k
k-l
) + L ( YT
k
o
II
II
- YT ) , k- l
where Ee and Eo extend respectively over the even and the odd k in the range 1 < k < n. By Lemma 1, the last sum has nonnegative expected value, and so
Suppose now that k is even. If "k - 1 = n, then YT = Yn = YTk ' while if "k _ 1 < n, then YT = 0; in either case, YT - YT > 0. Suppose that 1 < u < v < n, Yu = 0, 0 < Y; < 8 for u < i < v, and Yv > 8-that is, suppose that the pair u, v contributes an upcrossing of [0, 8] by Yh . . . , Yn. Then there is an even k, 1 < k < n, such that "k - 1 < u < v = "k (see the diagram); it follows that YT = 0 and YT > 8 , and hence YT - YT > 8. Since different upcrossings give different values of k, Ee( YT - YT ) > U8, and therefore E[Yn1 > 8E[U]. In terms of the original random variables, this is k-l
k
k- l
( ,8
- a
1 j
)E[U <
[ Xn > a ]
k
( xn
- a
k
-
l
k-l
k
) dP <
k
k-l
E [ I xn I 1 + I I · a
k-l
•
Convergence 1beorems
The martingale convergence theorem, due to Doob, has a number of forms. The simplest one is this: Theorem 35.4.
then Xn E[ I XI ] < K. oo ,
-+
Let XI , x2, . . . be a submartingale. If K = supnEl i Xnl 1 < X with probability 1, where X is a random variable satisfying
Fix a and P for the moment and let Un be the number of upcrossings of [ a , P 1 by XI , . . . ' xn . By the upcrossing theorem E [Un1 s PROOF.
SECTION
35.
MARTINGALES
491
(E[ 1 Xnl 1 + l a i )/(P - a) s (K + l a i )/(P - a ). Since Un is nondecreasing and E[Un1 is bounded, it follows by the monotone convergence theorem that supnUn is integrable and hence finite-valued almost everywhere. Let x • and x. be the limits superior and inferior of the sequence X1 , X2 , ; they may be infinite. If x. < a < p < X * , then Un must go to infinity. Since Un is bounded with probability 1, P[ X • < a < {J < X*] = 0. Now •
•
•
(35 .22 )
[ x. < x • ] = u[ x. < a < {J < x • ] ,
where the union extends over all pairs of rationals a and {J. The set on the left therefore has probability 0. Thus X* and x. are equal with probability 1, and Xn converges to their common value X; which may be infinite. By Fatou ' s lemma E[ I X I ] < lim infnE[ I Xn l ] < K. Since it is integrable, X is finite with probability 1. •
Xn form a martingale, then by ( 35 . 16) applied to the submartingale I X1 1 , 1 X2 1 , . . . the E[ I Xn1 1 do not decrease, that K = limnEl i Xn11 · The hypothesis in the theorem that K be finite is essential: If Xn = & 1 + · · · + & n , where the &n are independent and assume values + 1 with probability � ' then Xn does not converge; of course, E[ I Xnl 1 must go to If the
so
infinity in this case. If the Xn form a nonnegative martingale, then by (35.5), and K is necessarily finite.
E[1 Xn 1 1 = E[Xn] = E[ X1 ]
The Xn in Example 35.6 are nonnegative, and so Xn = n z,;m --+ X, where X is nonnegative and integrable. If m < 1, then, since Zn is an integer, Zn = 0 for large n, and the population dies out. In this case, X = 0 with probability 1. Since E[Xn1 = E[ X0] = 1, this shows that • E[ Xn ] --+ E [ X] may fail in Theorem 35.4. Example 35. 9.
Theorem 35.4 has an important application to the martingale of Example 35.5, and this requires a lemma.
If Z is integrable and � are arbitrary a-fields, then the random variables E[Z II�] are uniformly integrable.
Lemma 2.
For the definition of uniform integrability, see (16.21). The � must, of course, lie in the a-field !F, but they need not, for example, be nondecreas. tng.
492
DERIVATIVES AND CONDITIONAL PROBABILITY
1E[Z11�1 1 s E [ I ZI II�], Z may be assumed A an = [E[ZII�] � a]. Since E[ZII�] is measurable � '
PROOF OF THE LEMMA.
nonnegative. Let
Since
jA anE [ ZII�] dP = jA anZdP. It is therefore enough to find for given £ an a such that this last integral is less than £ for all n. Now fAZ dP is, as a function of A, a finite measure dominated by P; by the £-8 version of absolute continuity (see ( 3 2 . 4)) there is a 8 such that P (A ) < 8 implies that fAZ dP < £. But P[E[ZII�] > a] < • a - 1E[E[ZII�]] = a - 1E[Z] < 8 for large enough a. Suppose that � are a-fields satisfying �1 c �2 c · · · . If the union u�- 1 � generates the a-field �oo ' this is expressed by � i �oo · The requirement is not that �oo coincide with the union but that it be generated by it. Theorem 35.5.
If � j �oo and Z is integrable, then
( 35 .23) with probability 1. According to Example 35.5, the random variables xn = E[ZII� ] form a martingale relati�e to the � - By Lemma 2 the Xn are uniformly integrable. Since E[ I Xn11 s E[ I ZI], by Theorem 35.4 the Xn converge to an integrable X. The problem is to identify X with E[Z II �00 ]. Because of the uniform integrability, it is possible (Theorem 16.13) to integrate to the limit: fA XdP = limn fA Xn dP. If A E �k and n > k, then fA Xn dP = fA E[ZII�] . dP = fAZ dP. Therefore, fA XdP = fA ZdP for all A in the w-system Uk_ 1 �k ; since X is measurable �00 , it follows by Theorem 34.1 that X is a version of E[ZII �00 ]. • PROOF.
Reversed Martingales
A left-infinite sequence . . . , X _ 2 , X _ 1 is a martingale relative to a-fields . . . , :F_ 2 , �- 1 if conditions (ii) and (iii) in the definition of martingale are satisfied for n s - 1 and conditions (i) and (iv) are satisfied for n < - 1 . Such a sequence is a reversed or backward martingale.
For a reversed martingale, integrable, and E[ X] = E[X_n] for all n.
Theorem 35.6.
lim n � oo X - n = X exists
and is
SECTION
35.
493
MARTINGALES
The proof is almost the same as that for Theorem 35.4. Let X * and x. be the limits superior and inferior of the sequence X_ 1 , X_ 2 , Again (35.22) holds. Let un be the number of upcrossings of [a, P1 by X_ n , . . . , X_ 1 • By the upcrossing theorem, E[Un 1 S ( E[ I X 1 1 ] + l a i )/( P - a). Again E [Un ] is bounded, and so supnUn is finite with probability 1 and the sets in (35.22) have probability 0. Therefore, lim n __. 00 X_ n = X exists with probability 1. By the property (35.4) for martingales, X n = E [ X _ 1 11 � n ] for n = 1, 2, . . . . Lemma 2 now implies that the X_ n are- uniformly integrable. Therefore, X is integrable and E [ X] is the limit of the E [ X_ n ]; these all have the same value by • (35.5). PROOF.
•
•
•
•
-
-
=
If � are a-fields satisfying �1 ::> �2 • • • , then the intersection n:o_ 1 � �0 is also a a-field, and the relation is expressed by � � �0 •
1beorem 35.7.
If � � �0 and Z is integrable, then
(35 .24)
with probability 1. If x_ n = E [ ZII� ], then . . . , x_ 2 , x 1 is a martingale relative to . . . , �2 , �1• By the preceding theorem, E [ ZII�] converges as n --+ oo to an integrable X, and by Lemma 2, the E[ZII�] are uniformly integrable. As the limit of the E[ ZII�] for n > k, X is measurable �k ; k being arbitrary, X is measurable �0• By uniform integrability, A e �0 implies that PROOF.
_
J XdP = lim n j E [ E ( ZI I � ] II �o ] dP n j E ( ZII�] dP = lim A
A
A
= linm j E [ Z II �0] = j E [ ZII �0 ] dP. A A Thus X is a version of E[ZII �0].
•
Theorems 35.5 and 35.7 are parallel. There is an essential difference between Theorems 35.4 and 35.6, however. In the latter, the martingale has a last random variable, namely X _ 1 , and so it is unnecessary in proving convergence to assume the E[ 1 Xn 1 1 bounded. On the other hand, the proof in Theorem 35.6 that X is integrable would not work for a submartingale.
494
DERIVATIVES AND CONDITIONAL PROBABILITY
In applications of Theorem 35 .7, � is often o( Yn , Yn + 1 , ) for random variables Y1 , Y2 , In Theorem 35.5, � is often o( Y1, , Yn ); in this case j&'oo = o( Y1 , Y2 , ) is in general strictly larger than U�- 1'� · •
•
•
•
•
•
•
•
•
•
•
•
•
Applications: Derivatives
Suppose that (0, j&', P) is a probability space, is a finite measure on !F, and � t j&'oo c j&'_ Suppose that P dominates on � ' and let Xn be the corresponding Radon-Nikodym derivative. Then Xn --+ X with probability 1, where X is integrable. (i) If P dominates on j&'00 , then X is the corresponding Radon-Niko dym derivative. (ii) If P and are mutually singular on �00, then X = 0 with probabil ity 1. PROOF. The situation is that of Example 35.2. The density Xn is measur able � and satisfies (35.9). Since Xn is nonnegative, E[ IXn11 = E[Xn1 = J1(0), and it follows by Theorem 35.4 that Xn converges to an integrable X. The limit X is measurable � · Suppose that P dominates on j&'oo and let Z be the Radon-Nikodym derivative: Z is measurable � ' and fAZ dP = J1(A) for A E � · It follows that fAZ dP = fA Xn dP for A in � ' and so Xn = E[ZII�]. Now Theorem 35.5 implies that Xn --+ E[ Z ll j&'oo ) = Z. Suppose, on the other hand, that P and are mutually singular on �00, so that there exists a set S in � such that J1(S) = 0 and P(S) = 1. By Fatou ' s lemma fA XdP < lim infn fA Xn dP. If A E j&'k , then fA Xn dP = J1(A) for n > k, and so fA XdP � J1(A) for A in the field Uk- l j&'k · It follows by the monotone class theorem that this holds for all A in � · Therefore, • f X dP = fsX dP � J1(S) = 0, and X vanishes with probability 1. Theorem 35.8.
Jl
Jl
Jl
Jl
Jl
Jl
As in Example 35.3, let Jl be a finite measure on the unit interval with Lebesgue measure (0, j&', P). For � the a-field gener ated by the dyadic intervals of rank n, ( 35.10) gives Xn . In this casen � t j&'oo = :F. For each nw and n choose the dyadic rationals a n ( w) = k 2 and bn ( w ) = (k + 1 )2 - for which a n ( w ) < w < bn ( w ) . By Theorem 35.8, if !F is the distribution function for Jl , then Example 35.10.
(35 .25) except on a set of Lebesgue measure 0.
SECTION
35.
495
MARTINGALES
According to Theorem 31 .2, F has a derivative F' except on a set of Lebesgue measure 0, and since the intervals (a n (w), bn(w)] contract to w, the difference ratio (35.25) converges almost everywhere to F '(w) (see (31 .8)). This identifies X. Since (35.25) involves intervals (an( w ) bn( w )] of a special kind, it does not quite imply Theorem 31.2. By Theorem 35.8, X = F' is the density for " in the absolutely continu ous case, and X = F' = 0 (except on a set of Lebesgue measure 0) in the singular case, facts proved in a different way in Section 3 1 . The singular case • gives another example where E[Xn1 --+ E[X] fails in Theorem 35.4. ,
Likelihood Ratios
Return to Example 35.4: Jl = Q is a probability measure, � = o( Y1 , , Yn ) for random variables Yn , and the Radon-Nikodym derivative or likelihood ratio Xn has the form (35.1 1) for densities Pn and qn on R n . By Theorem 35.8 the Xn converge to some X which is integrable and measurable ,00 = (J ( yl ' y2 ' . . . ). If the Yn are independent under P and under Q, and if the densities are different, then P and Q are mutually singular on o(Y1 , Y2 , ), as shown in Example 35.4. In this case X = 0 and the likelihood ratio converges to 0 on a set of P-measure 1 . The statistical relevance of this is that the smaller Xn is, the more strongly one prefers P over Q as an explanation of the observation ( yl ' . . . ' yn ), and xn goes to 0 with probability 1 if p is in fact the measure governing the Yn . It might be thought that a disingenuous experimenter could bias his results by stopping at an xn he likes-a large value if his prejudices favor Q, a small value if they favor P. This is not so, as the following analysis shows. For this argument P must dominate Q on each � = o( Y1 , , Yn ), but the likelihood ratio xn need not have any special form. Let T be a positive-integer-valued random variable representing the experimenter' s stopping time. Assume that [ T = n] E �, which, like the analogous assumption (35.18) for the gambler' s stopping time, excludes prevision on the part of the experimenter. Let F., be the class of sets G for which • • •
• • •
• • •
(35 .26)
G n [T = n] e �,
n > 1.
Now � is a a-field. It is to be interpreted as representing the information the experimenter has when he stops: to know ( Y1 ( w ), . . . , YT ( "' ) ( w )) is the same thing as to know for each G satisfying (35.26) whether or not it contains w. In fact, � is generated by the sets [ T = n, (Y1 , , Yn ) E H] for n > 1 and H E gJ n . • . •
496
DERIVATIVES AND CONDITIONAL PROBABILITY
What is to be shown is that X,. is the likelihood ratio (Radon-Nikodym derivative) for Q with respect to P on !F.,. If H E !1/ 1 , then [ X,. E H] = U�- 1 ([ T = n] n [ xn E H]) lies in !F.,, so that X,. is measurable §'.,. More over, if G satisfies (35.26), then
00 Xn dP = xT dP = 'f. j n n -l Gt1(1' - ) G
J
00 Q( G n ( T = n ]) = Q(G) , '[. n -1
as required. Bayes Estimation
Let 8, Y1 , Y2 , . . . be random variables such that 0 < 8 < 1 and, condition ally on 8, the Yn are independent and assume the values 1 and 0 with probabilities 8 and 1 - 8: for u 1 , , u k a sequence of O' s and 1 ' s, • • •
( 35 .27) where
s
= u 1 + · · · + un.
To see that such sequences exist, let 8, Z1 , Z2 , • • • be independent random variables, where 8 has an arbitrarily prescribed distribution supported by (0, 1) and the Zn are uniformly distributed over (0, 1). Put Yn == l[z,. s 81 • If, for instance, f( x ) == x(1 - x ) == P[ Z1 s x, Z2 > x], then P[ Y1 == 1, Y2 == 01181cA) == /( 8(w)) by (33.13). The obvious extension establishes (35.27).
From the Bayes point of view in statistics, fJ is a par-ameter governed by some a priori distribution known to the statistician. For given Y1 , , Yn the Bayes estimate of 8 is E[IJIIY1 , , Yn1 · The problem is to show that this estimate is consistent in the sense that • • •
• • •
( 35 .28) with probability 1 . By Theorem 35.5, E[8 11Y1, , Yn1 .-. E[8 II F00], where F00 = o( Y1 , Y2 , ), and so what must be shown is that E[8 II F00] = 8 with probability 1 . By an elementary argument that parallels the unconditional case, i t follows from (35.27) that for S" = Y1 + · · · + Yn , E[Sn11 8] = n8 and E [( Sn - n8) 2 11 8] = n fJ (1 - 8). Hence E[(n - 1Sn - 8) 2 ] � n - 1 , and by Chebyshev' s inequality n - 1sn converges in probability to 8. Therefore (Theorem 20.5), lim k n k 1Sn k = 8 with probability 1 for some subsequence. Thus fJ = 8' with probability 1 for a 8' measurable F00, and E[811-Foo 1 = E[ 8' 11 $00] = 8' = 8 with probability 1 . • • •
• • •
SECTION
35.
MARTINGALES
A Central Limit Theorem*
Suppose
XI, x2 , . . .
- xn - 1 ' so that
is a martingale relative to $1 , ffl2, . . . ' and put
yn = xn
(35 .29)
Yn as the gambler's gain on the n th trial in a fair game. For example, if � I ' �2, . . . are independent and have mean 0, � = o(� l' . . . ' � n ), wn is measurable � - 1 , and Yn = Wn � n ' then (35.29) holds (see (35.17)). A specialization of this case shows that Xn = L k - l yk need not be approxi mately normally distributed for large n.
View
Suppose that �n takes the values ± 1 with probability ! each and WI = 0, and suppose that wn = 1 for n � 2 if �� = 1, while W,. = 2 for n � 2 if �� = - 1 . If Sn = � �+ + � n ' then Xn is Sn or 2Sn according as �� is + 1 or - 1. Since S,jvn has approximately the standard normal distribution, the approximate distribution of xn;rn is a mixture, with equal weights, of the centered normal distributions with standard • deviations 1 and fi. Emmpk 35.11.
·
·
··
To understand this phenomenon, consider conditional variances. Sup pose for simplicity that the Yn are bounded, and define
(3 5 .30) For large t consider the random variables .,
[
, = min n :
(35 .31)
£. a'f � 1].
k-1
Under appropriate conditions, X.,j{i will be approximately normally distributed for large t. Consider the preceding example. Roughly: if �� = + 1 , then r,k _ 1 af == n - 1, and so ,, t and X.,j{i S,j{i; I f �� = - 1, then Ek_1af == 4(n 1 ) and so ,, t/4 and X.,j{i 2S, 14j{i - S, 14/lt/4 . In either case, X,j{i approximately follows the normal law. If the n th play takes a; units of time, then ,, is essentially the number of plays that take place during the first t units of time. This change of the time scale stabilius the rate at which money changes hands. =
-
=
=
* This topic, which requires the limit theory of Chapter S, may be omitted.
=
498
DERIVATIVES AND CONDITIONAL PROBABILITY
Suppose the Yn = Xn - Xn - t are uniformly bounded and satisfy (35.29), and assume that Eno; = oo with probability 1. Then X.,j{i Theorem 35.9. � N.
Since Eon2 =
Jl1 is well defined by (35.31). PROOF. Let s; = Ek_ 1of , and let c, be that number such that 0 < c, � 1 and s; _ 1 + c,2a,2 = t. Let K bound the I Ynl · Since I Y, - c,Y., I < K, it is I
oo ,
enough to prove
I
I
( 35 .32 )
•, - 1
t - 1 /2 E
k-1
yk + c,Y.,I
Fix t and define random variables
=>
I
N.
Z1 , Z2 , • • • by if
.,
,
= k'
if .,, < k.
The left side of (35.32) is EC:_ 1 ZJJ{i. The notation does not reflect the dependence of the Zk on t. Since of is measurable .?Fk _ 1 , the events [P1 > k], [P1 = k], and [J11 < k] all lie in .?Fk _ 1 • If x > 0, the event 1£.,� - k J c, > x is exactly [J11 = k] n [c, > x ] = [J11 = k] n [sf _ 1 + x 2of < t], and hence ll., � -k J c, is measurable .?Fk _ 1 • It follows that for A e .?Fk _ 1 ,
Therefore
and the Zk satisfy the same martingale-difference equation (35.29) that the Yk do. Furthermore,
is measurable !Fk .
SECTION
35.
MARTINGALES
Now define Tf = E[Zf ll §"k _ 1 ]. Suppose that is measurable !Fk _ 1 ,
499
A e .Fk _ 1 . Since 1£,, - k J c;
Further,
f
A
Zi dP = 0. () [ •, < k ]
It follows that
( 35 .33)
0k2 7"k2 - C 20k2 I
0
if
,,
if if
,, ,,
> k' = k'
< k.
Introduce random variables N1 , N2 , , each with the standard normal distribution, that are ind9'endent of each other and of the a-field !F00 generated by U n�· The introduction of these new random variables entails no loss of generality: define them on a second probability space, and replace the original probability space (the one on which the Yk are defined) by its product with this second space. For n � 0 define •
•
•
The first step is to prove that the conditional characteristic function of T0j{i •
IS
(35 .34 )
DERIVATIVES AND CONDITIONAL PROBABILITY
are measurable j&'oo and the Nn are independent of -F00, the left-hand equality follows from (34.4): truncate the sum in T0 to E';_ 1 ,.k Nk , apply (34.4), then let m --+ oo and use Theorem 34.2(v). The right-hand equation in (35.34) follows from (35.33) and the definitions of ,t and ct . Therefore
Since the
Tn
(35 .35) Since the left side of (35.32) is 1',/ {i, the problem is to prove that E [ eisT.,Iv'i] --+ e - s2 12 , and this will follow by (35.35) if
I E [ eisT., Ift ]
_
E
[ eisTolft ] I = E <
[ [, ( eisT.;{t
_
n -= 1
00
eisT. -tlft )
l
E I E [ eisT.;{t eisT.-tift ] 1 -+ 0 _
n= l
as t --+
oo .
Write
an t = exp
If
�n
(
) E Z,J{i ,
n- 1
is
k-1
is the a-field generated by -F00 U
o(Nn ), then
follows by the same argument that gives (35.34). Therefore Y� t as well as
an t
SECTION
is
m
eas urab le
�
_
1,
35.
MARTINGALES
SOl
and so
= I E [ E ( a n t Pn t Yn t llfln 1 ] I = I E [ a n, Pn t Y� t] I = I E [ a nt Y�,E ( Pn t ll� - 1 1 ] I � E [ IE ( Pn,ll� - 11 ] · Since E[ Znll � - 1 ] = E[Tn Nnll � - 1 ] = 0 and E[Z; II� - 1 ] = E [Tn2Nn2 11 � - 1 ] = Tn2 , (26.4 2 ) gives E [ I E [ Pm ll � - t J I ] < l s l 3 r 312E [I Zn 1 3 + I 'Tn Nn ll If K ' = E[ INn l 3 ] ( = /8 /'IT ) and K bounds the I Yn l , then E ( I E ( Pn , ll� - t 1 1 ] < l s l 3 t - 312E ( 1 Zn l 3 + K ' l ,-n l 3 ) � K l s 1 3 t - 312E (z; + K 'Tn2 ) .
E[ Z;] = E[ ,-"2 ], the entire problem reduces to proving that • t - 312 E�_ 1 E[ Tn2 J --+ 0. But E:_ 1 Tn2 = t by (35.33). Since
PROBLEMS 35.1.
Suppose that � 1 , � 2 , are independent random variables with mean 0. Let XI � 1 and xn + l == xn + � n + tfn ( X. , . . . ' Xn ) and suppose that the Xn are integrable. Show that { Xn } is a martingale. The martingales of gambling have this form. -=
35.2. 35.3.
35.4.
• • •
Yi , fi , . . . be independent random variables with mean 0 and variance a 2 • Let Xn == <EZ- t Yk )2 - n a 2 and show that { Xn } is a martingale. Suppose that { Y, } is a finite-state Markov chain with transition matrix
Let
[ pii ] . Suppose that Eipiix(j) == Ax(i) for all i (the x (i) are the compo n nents of a right eigenvector of the transition matrix). Put Xn == A - x( Y, ) and show that { xn } is a martingale.
Suppose that Yi , Yi , . . . are independent, positive random variables and that E [ Y, ] == 1. Put Xn == Yi · · · Y, . (a) Show that { Xn } is a martingale and converges with probability 1 to an integrable X. (b) Suppose specifically that Y, assumes the values ! and � with probabil ity t each. Show that X == 0 with probability 1. This gives an example where E [n:- I Y, ] :F n:_ I E [ Y, ] for independent, integrable, positive ran dom variables. Show, however, that E[n:- I Y, ] < n:_ I E [ Y, ] always holds.
502
35.5.
35.6.
DERIVATIVES AND CONDITIONAL PROBABILITY
Suppose that X1 , X2 , is a martingale satisfying E[ X1 ] 0 and E[ x,; ] < oo . Show that E[( Xn + r - Xn ) 2 ] Ek _ 1 E[( Xn + k - Xn + k 1 ) 2 ] (the vari - E n E[( Xn ance of the sum is the sum of the variances). Assume that Xn - t ) 2 ] < oo and prove that Xn converges with probability 1 . Do this first by Theorem 35.4 and then (see Theorem 22.6) by Theorem 35.2. Show that a submartingale Xn can be represented as Xn Y, + Zn , where Yn is a martingale and 0 < Z1 < Z2 < · · · . Hint: Take X0 0 and !l n Xn Xn _ 1 , and define Zn Ek _ 1 E[ fl k ll "' - 1 ] ( � {0, 0 }). If X1 , X2 , is a martingale and bounded either above or below, then supn E[ l xn I ] < 00 . f Let Xn fl 1 + · · · + fln , where the !l n are independent and assume the values + 1 with probability t each. Let T be the smallest n such that Xn 1 and define Xn* by (35.19). Show that the hypotheses of Theorem 35.4 are satisfied by { Xn* } but that it is impossible to integrate to the limit. Hint: Use (7.8) and Problem 35.7. Let XI ' x2 ' . . . be a martingale and assume that I XI ( w ) I and I xn ( w ) xn - I ( w ) I are bounded by a constant independent of w and n. Let T be a stopping time (see (35.1 8)) with finite mean. Show that XT is integrable and that E( XT ) E( X1 ). 35.8 35.9 t Use the preceding result to show that the T in Problem 35.8 has infinite mean. Thus the waiting time until a symmetric random walk moves one step up from the starting point has infinite expected value. be a Markov chain with countable state space S and Let X1 , X2 , transition probabilities Pij · A function q> on S is excessive or super harmonic if q>(i) > Ei pii q>(j) . Show by martingale theory that q> ( Xn ) converges with probability 1 if q> is bounded and excessive. Deduce from this that if the chain is irreducible and persistent, then q> must be constant. Compare Problem 8.33. f A function q> on the integer lattice in R k is superharmonic if for each lattice point x , q>(x) > (2 k ) - 1 E q> ( y ) , the sum extending over the 2 k nearest neighbors y. Show for k 1 and k 2 that a bounded super harmonic function is constant. Show for k > 3 that there exist nonconstant bounded harmonic functions. 32 . 7 32.9 t Let ( 0 , ,, P) be a probability space, let ., be a finite measure on ,, and suppose that � f ' c ' · For n < oo , let Xn be the Radon-Nikodym derivative with respect to P of the absolutely continuous p art of ., when P and ., are both restricted to �. The problem is to extend Theorem 35.8 by showing that xn --+ x with probability 1 . (a) For n s oo , let . • •
35.8.
=
=
==
35. 7.
=
•
==
==
-
=
• •
==
==
35.9.
==
35.10.
35.1 1.
35. 12.
• • •
=
35.13.
=
oo
oo
A
e�,
be the decomposition of ., into absolutely continuous and singular parts with respect to P on �. Show that X1 , X2 , is a supermartingale and converges with probability 1 . • •
•
SECTION
35.
503
MARTINGALES
(b) Let
A e�, be the decomposition of a00 into absolutely continuous and singular parts with respect to P on � - Let Y, == E[ X00 11 � 1 and prove A E �.
35.14.
35.15. 35.16.
35. 17.
35. 18.
Conclude that Y, + Zn == Xn with probability 1. Since Y, converges to X00 , Zn converges with probability 1 to some Z. Show that fA Z dP < a00 (A) for A e ,00 , and conclude that Z == 0 with probability 1. (a) Show that { Xn } is a martingale with respect to { � } if and only if for all n and all stopping times T such that T < n ' E( xn II� ] == XT ' where � is the class of G such that G n [ T == k] e 'k for all k (see (35.26)). (b) Show that, if { xn } is a martingale and T is a bounded stopping time, then E( XT ] == E( X1 ]. 31.9 f Suppose that � f � and A e ,00 , and prove that P[ A II� ] --+ lA with probability 1. Compare Lebesgue's density theorem. 21.13 f Suppose that XI ' . . . ' xn is a martingale and that the I xk I ' are integrable, r > 1. If M == maxk s n i Xk l , then P [ M > t] < t - 1/M � t i Xn l dP, as follows by the proof of Theorem 35.2. Show that E( M' ) < ( rj( r - 1))' · E( l Xn 1'). (a) Extend Theorem 35.5: Replace the condition � f 'oo by the condi tion that each set in � differs by a set of measure 0 from some set in a(U n � ) and conversely. (b) Extend Theorem 35.7 analogously. Theorems 35.5 and 35.7 have analogues in Hilbert space. For n s oo , let Pn be the perpendicular projection on a subspace Mn . Then Pnx --+ P00 x for all x if either (a) M1 c M2 c · · · and M00 is the closure of Un 00 Mn or (b) M1 ::::> M1 :::> • • • and Moo == nn oo Mn. Suppose that 8 has an arbitrary distribution, and suppose that, condition ally on 8 , Y1 , ��' • • • are independent and normally distributed with mean 8 and variance a . Construct such a sequence { 8 , Y. , Yi , . . . }. Prove (35.28). It is shown on p. 495 that optional stopping has no effect on likelihood ratios. This is not true of tests of significance. Suppose that X1 , X2 , are independent and identically distributed and assume the values 1 and 0 with probabilities p and 1 - p. Consider the null hypothesis that p == ! and the alternative that p > ! . The usual .05-level test of significance is to reject the null hypothesis if <
35.19.
35.20.
<
•
(35 .36)
•
•
504
DERIVATIVES AND CONDITIONAL PROBABILITY
For this test the chance of falsely rejecting the null hypothesis is approxi mately P[ N > 1 . 645] .05 if n is large and fixed. Suppose that n is not fixed in advance of sampling and show by the law of the iterated logarithm that, even if p is, in fact, ! , there are with probability 1 infinitely many n for which (35.36) holds. =
35.21. 27.14 f
In addition to the hypotheses of Theorem 35.9, assume that v,jt � 1. Use Theorem 35.2 and the ideas of Problem 27.14 to show that Xn/v n � N.
A marts. In the remaining problems of this section X1 , X2 , • • are integrable random variables, Xn is measurable � ' � c � c and � == a(U�- 1 � ). Let T be the class of bounded stopping times: T e T if T is a random variable whose range is a finite set of positive integers and if [ T == n ] e � for n > 1 . If T is bounded by m , then E( I XT I ] < Ek'- 1 E( I Xk l ], and so XT is always integrable. Let 7;, consist of the T in T for which T ( w ) > n for all w. The sequence { Xn } is of course uniformly bounded if SUPn , I xn ( w ) I < 00 . It is called Ll -bounded if supn E[ I xn I ] < oo . M axim a and minima will be denoted by v and A . The sequence is an amart (asymptotic martingale) if for every sequence { Tn } of stopping times such that Tn e T, , lim n E[ XT ] exists and is finite. Since E[ XT ] == E[ X1 ] for T e T in the case of a martingale (see Problem 35.14(b)), every martingale is an amart. The following arguments, which do not depend on martingale theory, show that an L 1 -bounded amart converges with probability 1 . •
· · ·
,
c.)
35.22. (a) If { xn } is an amart, the limits lim n E [ XT ] ( Tn E T, ) all have a common value c, lim n supT e T. I E[ XT ] - c l == 0, and SUPT e T I E( XT ) I < 00 . (b) Show that { xn } is an aniart if lim n E[ XT ] exists and is finite for each sequence { Tn } such that Tn e T and Tn > Tn 1 • .
_
II
35.13.
f Define Y to be a cluster random variable if Y is measurable � and if for each w, Y( w ) is a limit point (the limit of a convergent subsequence) of XI ( w ), x2 ( w ), . . . . By the following steps show that, if y is a cluster random variable, then there are stopping times Tn such that Tn e T, and XT � Y with probability 1 . (This does not require the amart property.) (af Choose for each k an integer nk > k and an � k -measurable random variable Yk such that P( l Y - Yk 1 > 1/k ] s 1jk 2 • (b) Choose an integer qk S nk such that P[ I Xn - Yl > 1/k, nk < n < qk ] s 1/k 2 • (c) Let Tk be the smallest n in the range n k S n S qk for which I Xn - Yk l < 2/k ( Tk == qk if there are none) and show that XTk --+ X with probability 1 .
35.14.
Prove that a uniformly bounded amart converges with probability 1. Hint: Apply the result in Problem 35.23 to lim supn Xn and lim infn Xn . f
The remaining problems show how to extend this last result from the uniformly bounded to the L 1 -bounded case.
SECTION
35.15.
35.
MARTINGALES
Suppose that { Xn } is an L1 -bounded amart and show that (a) { x: } is an L1 -bounded amart; and (b) { Xn v a } and { Xn A a } are L 1 -bounded amarts.
f
f Prove the maximal inequality P[sup,. I Xn l > a] S a - 1 supT e r E[ I XT I ], a > 0. (This does not require the amart property.) 35.27. f Let { Xn } be an L1 -bounded amart, and define 36.26.
�n or ) ==
-a if xn < - a , xn if a s xn < a , a if a < Xn .
(a) Show that { � or > } is a uniformly bounded amart.
(b) Show that lim or ..,. 00 P[supn 1 Xn l > a] == 0.
35.28.
(c)
Show that
{ Xn } converges with probability 1.
t As noted ..above, every martingale is an amart. (a) Show that a submartingale is an amart if E[ Xn ] is bounded. (b) Find an amart that is not a martingale or even a submartingale.
C H APT E R
7
Stochastic Processes
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM Stochastic Processes
A
stochastic process is a collection [X,: t E T] of random variables on a
probability space (0, :F, P). The sequence of gambler' s fortunes in Section 7, the sequences of independent random variables in Section 22, the queueing process of Section 24, the martingales of Section 35-all these are stochastic processes for which T = {1, 2, . . . } . For the Poisson process [N,: t > 0] of Section 23, T = [0, oo ). For all these processes the points of T are thought of as representing time. In most cases, T is the set of integers and time is discrete, or else T is an interval of the line and time is continuous. For the general theory of this section, however, T can be quite arbitrary.
Finite-Dimensional Distributions
A process is usually described in terms of distributions it induces in
Euclidean spaces. For each k-tuple (t 1 , . . . , t k ) of distinct elements of T, the random vector ( X11 , , X,k ) has over R k some distribution p. 11 1k : • • •
(36 .1 )
• • •
p. 11 ,J H ) P [ ( X,1 , • • •
=
• • •
, X,J
E
H] ,
These probability measures p. 11 1 k are the finite-dimensional distributions of the stochastic process [X,: t E T]. The system of finite-dimensional distri butions does not completely determine the properties of the process. For example, the Poisson process [N,: t � 0] as defined by (23.5) has sample paths (functions N,( w ) with w fixed and t varying) that are step functions. But (23.28) defines a process which has the same finite-dimensional distribu• • •
506
SECTION
36.
KOLMOGOROV' S EXISTENCE THEOREM
507
tions and has sample paths that are not step functions. Nevertheless, the first step in a general theory is to construct processes for given systems of finite-dimensional distributions. Now ( 36 . 1 ) implies two consistency properties of the system P, 11 1k . Suppose the H in (36.1) has the form H = H1 X · · · X Hk (H; e .91 1 ), and consider a permutation 'IT of ( 1 , 2, . . ' k ) . Since [( x,1 , . . . ' x,k ) E ( H1 X · · · x Hk )] and [( X1.,1 , , X,.,k ) e (H"1 x · · · X H"k )] are the same event, it follows by (36.1) that • • •
.
•
•
•
(36 . 2 ) For example, if l's , t = " X 11 ', then necessarily l' t , s The second consistency condition is
(36.3)
,_ · · · lk - 1 ( H1 r-t1
X
···
=
,,
X "·
X Hk - 1 )
This is clear because ( X,1 , . . . , X,k - 1 ) lies in H1 X · · · X Hk _ 1 if and only if ( X,1 , . . . , X, k- 1 , X,k ) lies in H1 X · · · X Hk _ 1 X R 1 • These conditions can be stated in a slightly different way. Define cp": R k .-. R k by
cp" applies the permutation 'IT to the coordinates (for example, if 'IT sends x 3 to first position, then 'IT - 11 = 3). Since cp; 1 ( H1 X · · · X Hk ) = H"1 X · · · X H"k' it follows from (36 . 2) that
for rectangles
H. But then
(36.4 ) 1 Similarly, if cp: R k --+ R k - is the projection cp(x 1 , . . . , x k ) then (36.3) is the same thing as (36.5 )
=
(xh . . . , x k - 1 ),
508
STOCHASTIC PROCESSES
The conditions (36.4) and (36.5) have a common extension. Suppose that (u 1 , , u m ) is an m-tuple of distinct elements of T and that each element of (t 1 , , t k ) is also an element of (u 1 , , u m ). Then (t 1 , , t k ) must be the initial segment of some permutation of (u 1 , , u m ); that is, k < m and there is a permutation 'IT of (1, 2, . . . , m ) such that •
•
•
•
•
•
•
.
•
•
•
where t k 1 , , t m are elements of ( t 1 , , t k ). Define 1/1: R m --+ R k by +
•
•
•
•
•
(u 1 ,
•
•
•
•
•
•
•
, u m ) that do not appear in
•
( 36 .6 )
1/J applies 'IT to the coordinates and then projects onto the first k of them. Since 1/1 ( Xu1 , . . . , Xu ) = ( X,1 , . . . , X,k ), "'
( 36 .7 )
This contains (36.4) and (36.5) as special cases, but as 1/J is a coordinate permutation followed by a sequence of projections of the form (x 1 , , x1) --+ ( x 1 , , x 1 _ 1 ), it is also a consequence of these special cases. Measures p. 11 1k coming from a process via (36.1) necessarily satisfy (36.2) and (36.3). The problem is to show conversely that, if (36.2) and (36.3) hold for a given system of measures, then there exists a process having these finite-dimensional distributions. Proving this theorem is the main objective of the section. •
•
•
•
•
•
• • •
Product Spaces
The standard construction of the general process involves product spaces. Let T be an arbitrary index set and let Rr be the collection of all real functions on T-all maps from T into the real line. If T = {1, 2, . . . , k }, a real function on T can be identified with a k-tuple ( x 1 , , x k ) of real numbers, and so Rr can be identified with k-dimensional Euclidean space R k . If T = {1, 2, . . . }, a real function on T is a sequence { x 1 , x2, } of real numbers. If T is an interval, Rr consists of all real functions, however irregular, on the interval. Whatever the set T may be, an element of Rr will be denoted x. The value of x at t will be denoted x(t) or x 1 , depending on whether x is viewed as a function of t with domain T or as a vector with components indexed by the elements t of T. Just as R k can be regarded as the cartesian product of k copies of the real line, R T can be regarded as a product space .
•
•
•
•
•
SECTION
36.
KOLMOGOROV ' S EXISTENCE THEOREM
509
-a product of copies of the real line, one copy for each t in T. For each t define a mapping Z,: R r --+ R 1 by (36 .8) The Z, are called the coordinate functions or projections. When later on a probability measure has been defined on R r, the Z, will be random variables, the coordinate variables. Frequently, the value Z,(x) is instead denoted Z ( t, x ). If x is fixed, Z( · , x) is a real function on T and is, in fact, nothing other than x( · )-that is, x itself. If t is fixed, Z( t, · ) is a real function on R r and is identical with the function Z, defined by (36.8). There is a natural generalization to R r of the idea of the a-field of k-dimensional Borel sets. Let gJT be the a-field generated by all the coordinate functions Z,, t e T: gJT = a[Z,: t e T]. It is generated by the sets of the form =
[x e RT: Z, (x) e H ) [ x e RT: x, e H ) for t e T and H e gJ 1 If T {1, 2, . . . , k }, then gJT coincides with gJ k_ Consider the class gJ'l consisting of the sets of the form =
•
(36 .9)
where k is an integer, (t 1 , , t k ) is a k-tuple of distinct points of T, and H e gJ k _ Sets of this form, elements of gJ'l, are called finite-dimensional sets, or cylinders. Of course, gJ'l generates gJT_ Now gJ'l is not a a-field, does not coincide with gJT (unless T is finite), but the following argument shows that it is a field. The complement of (36.9) is R r - A = [x e RT: (x 11 , . . . , x,k ) e R k H], and so Yl'l is closed under complementation. Suppose that A is given by (36.9) and B is given by • • •
(36 .10) where I e !Ji i. Let ( u 1 , , u m ) be an m-tuple containing all the t a and all the sfJ. Now (t 1 , , t k ) must be the initial segment of some permutation of (u 1 , , u m ), and if 1/J is as in (36.6) and H ' = 1/J- 1H, then H ' e gJ m and A is given by • • •
• • •
(36 .11)
• • •
510
STOCHASTIC PROCESSES
�
t2
If T is an interval, the cylinder [ x e RT : a1 < x( t1 ) < /11 , a2 < x( t 2 ) functions that go through the two gates shown; y lies in the cylinder and not be continuous functions, of course).
< z
/12 ] consists of the does not (they need
as well as by (36.9). Similarly, B can be put in the form
(36.12) where
I' e !Jim. But then
(36 .13)
I' e gJm, A U B is a cylinder. This proves that gJ'{ is a field such that 91 r = o ( gJ'{). The Z, are measurable functions on the measurable space (R r, gJ T ). If P is a probability measure on gJ T, then [Z,: t e T] is a stochastic process on ( R r, !JIT, P ), the coordinate-variable process. Since H' U
Kolmogorov's Existence Theorem
The existence theorem can be stated two ways:
If P,11 1k are a system of distributions satisfying the con sistency conditions (36.2) and (36.3), then there is a probability measure P on !JIT such that the coordinate-variable process [ Z,: t e T] on (R r, gJT, P) has the P, 11 1k as its finite-dimensional distributions. Theorem 36.1.
• • •
• • •
If P,11 1k are a system of distributions satisfying the con sistency conditions (36.2) and (3 6.3), then there exists on some probability space (D, !F, P) a stochastic process [ X,: t e T ] having the p. 11 1k as its finite-dimensional distributions. Theorem 36.2.
• • •
• • •
SECTION
36.
KOLMOGOROV ' S EXISTENCE THEOREM
51 1
For many purposes the underlying probability space is irrelevant, the joint distributions of the variables in the process being all that matters, so that the two theorems are equally useful. As a matter of fact, they are equivalent anyway. Obviously, the first implies the second. To prove the converse, suppose that the process [ X,: t e T] on ( 0, S&", P ) has finite dimensional distributions p. 11 1k , and define a map �: 0 --+ Rr by the requirement •
•
•
Z, ( � ( w )) = X, ( w ) ,
(36 .14)
t E T.
For each w, � ( w ) is an element of Rr, a real function on requirement is that X,( w) be its value at t. Clearly,
T, and the
[ X E R T : ( Z11 ( X) , . . . , Z1k( X ) ) E H ] = [ w E (} : ( Z11 ( € ( w ) ) , . . . , Z,J € ( w ) ) ) E H ] = [ w E 0 : ( X11 ( w ) , . . . , X, J w ) ) E H ] ;
�- l
(36 .15)
since the X, are random variables, measurable S&", this set lies in S&" if H e !Ji k . Thus � - 1A e S&" for A e 91'{, and so (Theorem 13.1) � is measur able .Fj91r. By (36.15) and the fact that [X,: t e T] has finite-dimensional distributions p. 11 1k , p� - l (see (13 .7)) satisfies •
(36 .16)
• •
[ x E R T : ( Z,1 ( x ) , . . . , Z,J x )) E H ] = P [ w E (} : ( X11 ( w ) , . . . , X,. ( w ) ) E H ] = fl 11
P€ - 1
• •
•
,. ( H ) .
Thus the coordinate-variable process [Z,: t e T] on ( Rr, fltT, p� - 1 ) also has finite-dimensional distributions p. 11 ,k . Thus to prove either of the two versions of Kolmogorov's existence theorem above is to prove the other one as well. • • •
EXIIIple II 36.1.
Suppose that
T is finite, say T = {1, 2, . . . , k } . Then
(Rr, !JIT) is ( R k , !Ji k ), and taking P = p. 1 , 2, . . . , k satisfies the requirements of
•
Theorem 36.1 .
Example 36. 2
(36 .17 )
Suppose that
T = {1, 2, . . . } and
512
STOCHASTIC PROCESSES
are distributions on the line. The consistency conditions where p. 1 , p. 2 , are easily checked. By Theorem 20.4 there exists on some (0, �, P ) an independent sequence X1 , X2 , of random variables with respective distri butions p. 1 , p. 2 , But then (36.17) is the distribution of ( X11 , , X1" ). For the special case (36.17), Theorem 36.2 was thus proved in Section 20. The existence of independent sequences with prescribed distributions was, in fact, the measure-theoretic basis of all the probabilistic developments of Chapters 4, 5, and 6-even dependent processes like the Poisson and queueing processes were constructed from independent sequences. The existence of independent sequences can also be made the basis of a proof of Theorem 36.2 in its full generality; see the second proof below. • •
•
•
•
•
•
EX111ple 11 36.3.
•
•
•
•
•
If
•
•
T is a subset of the line, it is convenient to use the
order structure of T and take the p. 11 . . . 1" to be specified initially only for k-tuples ( t 1 , . . . , t k ) that are in increasing order:
( 36 .18 ) It is natural for example to specify the finite-dimensional distributions for the Poisson process for increasing t 1 , . . . , t k alone; see (23.27). Assume that the p. 11 1" for k-tuples satisfying (36.18) have the con sistency property • . .
( 36 .19 ) If (36.1 8) does not hold, let
'IT
be the permutation for which
( 36 .20 ) and define p,11 1" by the right side of (36.2) (or (36.4)). This defines the finite-dimensional distributions for all k-tuples of distinct points of T, and • • •
they satisfy the two consistency conditions of Kolmogorov's theorem. To see this, consider first a permutation p of (1, 2, . . . , k), and put s; = tp; and 1; = H ; · If 'IT is determined by (36.20), then p - I.zr is the permutation that p puts the s; in increasing order, because sP - l "; = t "; · But then p,
lpt · · · lp �c
( Hpl X . . . X Hp k ) = P.
st
= ,., Itrl
• • •
s�c
( Jl X . . . XJk )
. . . Itrk
( H"1 X . . . X Htrk )
36. KOLMOGOROV 'S EXISTENCE THEOREM 513 where the second and the fourth equalities hold because of the way the finite-dimensional distributions have been defined for the general k-tuple. The p. 11 '1e as thus extended therefore satisfy (36.2), the first of the consistency conditions. Suppose of the 'IT determined by (36.20) that k = 'ITj. By definition, SECTION
•
• •
(36 .21 ) = r-t, 1 ,_
· ·
·
'•le
( Hwl X
· · ·
X R1 X
· · ·
X H.,.k ) '
where on the right R1 is the j th component of the product. By the assumption (36.19), this R 1 can be canceled if the index t"i = t k is canceled in the subscript to the p. ; since t"1 , , t" ' t"< i + I > ' . . . , t.,.k are t 1 , . . . , t k - l in increasing order, the definition further reduces the right side of (36.21) to P, 11 11e _ 1 ( H1 X · · · X Hk _ 1 ). The p. 11 11e thus satisfy (36.3 ), the second consistency condition. It will therefore follow from the existence theorem that if T c R 1 and p, 11 ' is defined for all k-tuples in increasing order, and if (36.19) holds, 1e then there exists a stochastic process [ X,, t e T] satisfying (36.1) for • increasing t 1 , , t k · •
•
•
•
•
•
• • •
•
• •
•
.
•
Two proofs of Kolmogorov's existence theorem will be given. The first is based on the extension theorem of Section 3. Consider the first formulation, Theorem 36.1 . If A is the cylinder (36.9), define FIRST PROOF OF KOLMOGOROV ' S THEOREM.
(36 .22) This gives rise to the question of consistency because A will have other representations as a cylinder. Suppose, in fact, that A coincides with the cylinder B defined by (36.10). As observed before, if ( u 1 , , u m ) contains all the ta and sp , A is also given by (36.11), where H' = "' - 1H and "' is defined in (36.6). Since the consistency conditions (36.2) and (36.3) imply the more general one (36.7), P( A ) = p. 11 11e( H ) = P. u1 . . . " ( H ' ). Similarly, m ) = form (36.12), and P( B ) = P.s1 ( l (36.10) has the P u1 . . . um( /'). Since the uY are distinct, for any real numbers z 1 , , z m there are points x of RT for which ( x u1 , , x u ) = ( z 1 , z m ) · From this it follows that if the m cylinders (36.11) and (36.12) coincide, then H' = / '. Hence A = B implies that P( A ) = P. ul . . . U ( H ' ) = P. ul . . . U ( / ' ) = P( B), and the definition (36.22) m is indeed consistent.m •
• • •
... s.
•
•
•
•
•
•
•
•
•
J
•
•
514
STOCHASTIC PROCESSES
Now consider disjoint cylinders A and B. As usual, the index sets may be taken identical. Assume then that A is given by (36.11) and B by (36.12), so that (36.13) holds. If 1 n 1 1 were nonempty, then A n B would be nonempty as well. Therefore, 1 n I 1 = 0, and
H
P ( A U B ) = P. u
1
=
P. u1
· · ·
· · ·
H u ( H1 U / 1 ) u"'( H I ) + P. u1 "'
· · ·
u"'( 1 1 ) =
P( A ) + P( B ) .
Therefore, P is finitely additive on gJ'{. Clearly, P( R r ) = 1. Suppose that P can be shown to be countably additive on gJ'{. By Theorem 3.1, P will then extend to a probability measure on gJT_ By the way P was defined on gJ'{,
and therefore the coordinate process [Z,: t e T] will have the required finite-dimensional distributions. It suffices, then, to prove P countably additive on gJ'{, and this will follow if A n e 91'{ and A n � 0 together imply P(A n ) � O (see Example 2.10). Suppose that A 1 ::> A 2 ::> and that P(A n ) > £ > 0 for all n. The problem is to show that nn A n must be nonempty. Since A n E gJ'{, and since the index set involved in the specification of a cylinder can always be permuted and expanded, there exists a sequence t 1 , t 2 , of points in T for which •
•
•
•
H
•
•
wheret n E Yi n. Of course, P(A n ) = p. 11 1,.( n ) . By Theorem 12.3 (regularity), there exists inside Hn a compact set Kn such that p. 11 1 ( n - Kn ) < £/2 n + l . If Bn = [x E R T : (x,1 ' . . . ' x,,. ) E Kn], then P(A n - Bn) < £/2 n + l . Put en = n k _ 1 Bk . Then en c Bn c A n and P(A n - en ) < £/2, so that P(en) > £/2 > 0. Therefore, en is nonempty. Choose a point x < n > of RT in en . If n > k, then x < n > E en c ek c B k and hence (x �1n > , . . . , x �"n >) e Kk . Since Kk is bounded, the sequence { x �! >, x �;> , . . . } is bounded for each k. By the diagonal method [A14] select an increasing sequence n 1 , n2, of integers such that lim ; x �"n ; ) exists for • •
H
.
•
•
•
•
•
,
H
•
t in general, A n will involve indices 11 , 12 , . . . , Ia where a 1 < a < For notational 2 simplicity an is taken as n; as a matter of fact, this can be arranged : start off by repeating 0 a1 - 1 times and then for n > 1 repeat An an + 1 an times. ,
-
,
·
·
·
.
515 36. KOLMOGOROV ' S EXISTENCE THEOREM each k. There is in Rr some point x whose t kth coordinate is this limit for each k. But then, for each k, (x 1 1 , x,lc) is the limit as i --+ oo of ( x}1n ; > , . . . , x}; ; > ) and hence lies in Kk . But that means that x itself lies in Bk • and hence in A k . Thus X E n �- l A k , which completes the proof t SECTION
•
•
•
The second proof of Kolmogorov's theorem goes in two stages, first for countable T, then for general T. *
SECOND PROOF FOR COUNTABLE
T.
The result for countable T will be proved in its second formulation, Theorem 36.2. It is no restriction to enumerate T as { 11 , 1 2 , } and then to identify In with n ; in other words, it is no restriction to assume that T = { 1 , 2, . . . } . Write P.n in place of p. 1 2 n · By Theorem 20.4 there exists on a probability space ( 0, !F, P) (which can be taken to be the unit interval) an independent sequence V1 , V2 , of random variables each uniformly distributed over (0, 1 ) . Let F1 be the distribution function corresponding to p.1 • If the " inverse" g1 of F1 is defined over (0, 1) by g1 ( s) == inf[x: s s F1 (x)], then X1 == g1 (V1 ) has distribution p.1 by the usual argument: P [g1 ( V1 ) < x ] == P [ li1 s F1 (x)] == F1 (x). The problem is to construct x2 ' x3 ' . . . inductively in such a way that • • •
•
• . . . •
• • •
{36 .23) for a Borel function h k and ( XI ' . . . ' xn ) has the distribution #Ln . Assume that xl ' . . . ' xn 1 have been defined ( n > 2) : they have joint distribution ,.,. n 1 and (36.23) holds for k < n - 1. The idea now is to construct an appropriate conditional distribution function F, ( X l xt ' . . . ' Xn - 1 ); here F, ( X I XI ( w ), . . . ' xn - 1 ( w )) will have the value P [ xn < xll XI ' . . . ' xn - t 1c., would have if xn were already defined. If 8n ( . I X 1 ' . . . ' X n - 1 ) is the " inverse" function, then xn ( w ) == gn ( vn ( w ) I XI ( w ) , . . . ' xn - 1 ( w )) will by the usual argument have the right conditional distribution given n XI , . . . ' xn h so that ( Xh . . . ' xn - h Xn ) will have the right distribution over
R . To nconstruct the conditional distribution function, apply Theorem 33.3 in n ( R , gt , p.n ) to get a conditional distribution of the last coordinate of (x1 , . . . , xn ) given the first n - 1 of them. This will have (Theorem 20.1) the form 1 •( H; xh . . . , xn _ 1 ); it is a probability measure as H varies over 9P , and
Since the integrand involves only XI ' . . . ' Xn
- 1 and ,.,. n by consistency projects to
t The last part of the argument is, in effect, the proof that a countable product of compact sets
.
compact. *This second proof, which may be omitted, uses the conditional probability theory of Section 33.
IS
516
STOCHASTIC PROCESSES
#L n - l under the map ( xh . . . , xn ) � ( xh . . . , xn_ 1 ), a change of variable gives
Define F, ( x l x1 , . . . , xn_ 1 ) == P (( - oo , x ] ; x1 , , xn_ 1 ) . Then F, ( · l x1 , . . . , distribution function over the line, F, ( x 1 ) is a Borel xn _ 1 ) is a probability n- l function over R , and .
•
.
•
, xn _ 1 ) == inf[x: u S F, ( x l xh . . . , xn_ 1 )] for 0 < u < 1 . Since F, ( x l x1 , . . . , xn _ 1 ) is nondecreasing and right-continuous in x, 8n ( u l x1 , . . . , xn_ 1 ) < X if and only if u s F, ( x l xl , . . . ' Xn - 1 > · Set xn == gn ( U, I XI , . . . ' xn - 1 >· Since ( X1 , , Xn _ 1 ) has distribution l' n - l and by (36.23) is independent of Un , an application of (20.30) gives
Pu t gn ( u l x1 , •
•
.
•
.
•
p
[ ( XI ' . . . ' xn - I ) E M ' xn
< X]
== p [ ( XI ' . . . ' xn - I) E M ' un s F, ( X I XI ' . . . ' xn ) ]
Thus ( XI ' . . . ' xn ) has distribution l' n . Note that xn ' as a function of XI ' . . . ' xn and u, ' is a function of ul ' . . . ' un because (36.23) was assumed to hold for k < n. • Hence (36.23) holds for k == n as well. SECOND PROOF FOR GENERAL T. Consider ( R T, 9f T ) once again. If S C T, let .Fs a[ Z,: t e S]. Then � c !FT = 9fT. Suppose that S is countable. By the case just treated, there exists a process [ X,: t e S ] on some (0, .F, P )-the process depends on S-such that ( X, , . . . , X, ) has distribution 1' , ,�< for every k-tuple (th . . . , tk ) from S. Define a m�p �: Q � R r by requiring iliat -1
=
. . .
if t E S, if t ll. s .
SECTION
36.
KOLMOGOROV ' S EXISTENCE THEOREM
517
Now (36. 15) holds as before if It , . . . , lk all lie in S, and sot € is measurable ��� · Further, (36. 16) holds for It , . . . , lk in S. Put Ps = P€- on �- Then P5 is a probability measure on ( RT, .Fs ), and ( 36 .24) if H E � k and I1 , , Ik all lie in S. (The various spaces ( n , !F, P) and processes [ X, : 1 e S ] now become irrelevant.) If So c S, and if A is a cylinder (36. 9) for which the t 1 , 1k lie in So then Ps0 ( A ) and Ps ( A ) coincide, their common value being p. ,1 . . , ( H). Since these " A A A A ) in 's cylinders generate 'so ' Ps0 ( = Ps ( ) for all lies both in �� and o · If � , then Ps1 ( A ) = Ps1 u s ( A ) == Ps ( A ). Thus P( A ) == Ps ( A ) consistently defines a 2 2 2 set function on the class Us's, the union extending over the countable subsets S of T. If A n lies in this union and A n e 's ( Sn countable), then S = U n Sn is countable and U n A n lies in 's· Thus Us's is a a-field and so must coincide with � T. Therefore, P is a probability measure on �T, and by (36.24) the coordinate • process has under P the required finite-dimensional distributions. • • •
,
•
•
,
•
II
The Inadequacy of a�r Theorem 36.3. Let [X,: t e T] be a family of real functions on 0. (i) If A e a[ X,: t e T] and w e A , and if X,(w) = X,(w') for all t e T, then w' e A . (ii) If A e a( X,: t e T], then A e a[ X,: t e S] for some countable subset S of T.
Define �: 0 --+ R r by Z,(�(w)) = X,(w). Let �= a[ X,: t e T] . By (36.15), � is measurable �;gp r and hence � contains the class [� - 1M: M e 9fT]. The latter class is a a-field, however, andkby (36.15) it contains the sets [w e 0 : ( X11(w), . . . , X,k(w)) e H], H e f11 , and hence contains the a-field :F they generate. Therefore PROOF.
(36 .25) This is an infinite-dimensional analogue of Theorem 20.l(i). As for (i), the hypotheses imply that w e A = � - 1M and �(w) = �(w'), so that w' e A certainly follows. For S c T, let � = a ( X,: t e S]; (ii) says that �= :Fr coincides with � = Us�' the union extending over the countable subsets S of T. If A1, A 2 , lie in f§, A lies in � for some countable Sn , and so U A lies in l§ because it lies in � for s = u s Thus f§ is a a-field, and since it contains the sets [X, e H], it contains the a-field � they generate. (This part of the argument was used in the second proof of the existence • theorem.) •
•
.
n
II
n
n
n.
n
518
STOCHASTIC PROCESSES
From this theorem it follows that various important sets lie outside the class fAT. Suppose that T = [0 , oo). Of obvious interest is the subset C of RT consisting of the functions continuous over [0 oo ) . But C is not in fAT. For suppose it were. By part (ii) of the theorem (let 0 = RT and put [Z, : t e T] in the role of [ X,: t e T]), C would lie in o[Z,: t e S ] for some countable S c [0 , oo). But then by part (i) of the theorem (let 0 = RT and put [Z,: t e S ] in the role of [X,: t e T]), if x e C and Z, (x) = Z,(y) for all t e S, then y e C. From the assumption that C lies in fAT thus follows the existence of a countable set S such that, if x e C and x( t) = y( t) for all t in S, then y e C. But whatever countable set S may be, for every continuous x there obviously exist functions y which have discontinuities but which agree with x on S. Therefore, C cannot lie in fAT. What the argument shows is this: A set A in RT cannot lie in fAT unless there exists a countable subset S of T with the property that, if x e A and x(t) = y(t ) for all t in S, then y e A. Thus A cannot lie in fAT if it effectively involves all the points t in the sense that, for each x in A and each t in T, it is possible to move x out of A by changing its value at t alone. And C is such a set. For another, consider the set of functions x over T = [0 , oo ) that are nondecreasing and assume as values x( t) only nonnega tive integers : ,
( 36 .26 )
[ x E R [O, oo > : x ( s ) S x ( t ) , s < t; x ( t ) E {0, 1, . . . } , t > 0] .
This, too, lies outside fAT. In Section 23 the Poisson process was defined as follows: Let X1 , X2 , be independent and identically distributed with the exponential distribution (the probability space 0 on which they are defined may by Theorem 20.4 be taken to be the unit interval with Lebesgue measure). Put S0 = 0 and Sn = X1 + · · · + Xn . If Sn (w) < Sn + 1 (w) for n > 0 and Sn (w) --+ oo , put N( t, w) = N,(w) = max[ n : Sn (w) < t ] for t > 0; otherwise, put N(t, w) N, ( w) = 0 for t � 0. Then the stochastic process [ N,: t > 0] has the finite dimensional distributions described by the equations (23.27). The function N( · , w) is the path function or sample functiont corresponding to w, and by the construction every path function lies in the set (36.26). This is a good thing if the process is to be a model for, say, calls arriving at a telephone exchange: The sample path represents the history of the calls, its value at t being the number of arrivals up to time t, and so it ought to be nondecreas ing and integer-valued. According to Theorem 36.1, there exists a measure P on RT for T = [0, oo) such that the coordinate process [Z,: t > 0] on (RT, fAT, P) has the finite•
•
•
=
t other terms are
realization of the process and trajectory.
36. KOLMOGOROV' S EXISTENCE THEOREM 519 dimensional distributions of the Poisson process. This time does the path function Z( · x) lie in the set (36.26) with probability 1? Since Z( · , x) is just x itself, the question is whether the set (36.26) has P-measure 1. But this set does not lie in £1/ r, and so it has no measure at all. An application of Theorem 36.1 will always yield a stochastic process with prescribed finite-dimensional distributions, but the process may lack certain path-function properties which it is reasonable to require of it as a model for some natural phenomenon. The special construction of Section 23 gets around this difficulty for the Poisson process, and in the next section a special construction will yield a model for Brownian motion with continu ous paths. Section 38 treats a general method for producing stochastic processes that have prescribed finite-dimensional distributions and at the same time have path functions with desirable regularity properties. SECTION
,
PROBLEMS 36. 1.
1 3.6 f Generalize Theorem 20.1(ii), replacing ( XI ' . . . ' xk ) by for an arbitrary T.
36.2.
A process ( . . . , X X0 • X1 , ) (here T is the set of allk integers) is stationary if the distribution of ( Xn , Xn + l ' . . . , Xn + k - l ) over R is the same for all n 0, ± 1, ± 2, . . . . Define T: RT --+ RT by Zn ( TX ) zn + I (x); thus T moves a doubly infinite sequence (that is, an element of RT) one place left: T( , x_ 1 , x0 , x1 , . ) ( , x0 , x1 , x 2 , . ). Show that T is measur able 31 r;� r and show that the coordinate process ( . . . , Z 1 , Z0 , Z1 , ) on ( R r, �r, P) is stationary if and only if T preserves the measure P in the sense that PT - 1 P. Show that, if X is measurable a [ X,: t e T ], then X is measurable a [ X,: t e S ] for some countable subset S of T. Suppose that [ X,: t e T ) is a stochastic process on ( Q, , , P) and A e ' · Show that there is a countable subset S of T for which P [A II X, , t e T ] P[ A II X, , t e S ] with probability 1. Replace A by a random variable and _
1
,
[ X, : t E
T]
• • •
==
==
• • •
• •
==
• • •
• •
• • •
_
==
36.3. 36.4.
==
prove a similar result.
36.5.
Let T be arbitrary and let K(s, t) be a real function over T X T. Suppose that K is symmetric in the sense that K(s, t) == K( t, s) and nonnegative definite in the sense that E�, J - 1 K( t; , tJ )x; xJ > 0 for k > 1, th . . , tk in T, x1 , , xk real. Show that there exists a process [ X,: t e T ] for which ( X11 , , X, ) has the centered normal distribution with covariances K( I; , t1 ) , i ' j = 1 , . . .�< ' k. 8.4 f Suppose that Pn ( u1 , , un ) is a nonnegative real for each n and each n-long sequence u 1 , , un of O's and 1 ' s. Suppose that p 1 (0) + p 1 (1) == 1 and Pn + 1 ( u 1 , , un , O) + Pn + 1 ( u 1 , , un , 1) == Pn ( u 1 , • • • , un ). Prove that on the a-field rt generated by the cylinders in sequence space there .
•
36.6.
•
.
•
•
•
• • •
• • •
• • •
.
•
.
520
STOCHASTIC PROCESSES
exists a probability measure P such that P( w : a ; ( w ) == u ; , i � n ] == Pn ( u1 , , un ) . Problems 2.16, 4.18, and 8.4 cover the coin-tossing, indepen dent, and Markov cases, respectively. .
•
•
36.7.
Let L be a Borel set on the line, let !R consist of the Borel subsets of L, and let LT consist of all maps from T into L. Define the appropriate notion of cylinder, and let !l'T be the a-field generated by the cylinders. State a version of Theorem 36.1 for ( LT, JR T ) . Assume T countable and prove this theorem not by imitating the previous proof but by observing that LT is a subset of R T and lies in � T. If L consists of 0 and 1 , and if T is the set of positive integers, then LT is the space considered in the preceding problem.
36.8.
Suppose that the random variables X1 , X2 , assume the values 0 and n1 and P[ Xn == 1 i.o.] == 1. Let p. be the distribution over (0, 1] of E�- 1 Xn/2 . Show that on the unit interval with the measure p. the digits of the nonterminating dyadic expansion form a stochastic process with the same finite-dimensional distributions as XI ' x2 ' •
0
36.9.
0
0
•
•
0
36.7 f There is an infinite-dimensional version of Fubini' s theorem. In the construction in Problem 36.7, let L == I == (0, 1 ), T == { 1, 2, . . . } , let J consist of the Borel subsets of I, and suppose that each k-dimensional distribution is the k-fold product of Lebesgue measure over the unit interval. Then I T is a countable product of copies of (0, 1 ), its elements are sequences x == (x 1 , x 2 , ) of foints of (0, 1), and Kolmogorov's theorem ensures the existence on ( I T, J ) of a product probability measure w: w [ x : n X; < a; , i < n ] = a1 an for 0 < a; < 1 . Let I denote the n-dimen sional unit cube. (a) Define l/;: In X I T --+ I T by • • •
•
•
•
Show that 1/1 is measurable J X J TjJ T and l/; - 1 is measurable J TjJ n X J T. Show that l/; - 1 (An X w ) == w, where "-n is n-dimensional Lebesgue n measure restricted to I . (b) Let f be a function measurable J T and, for simplicity, bounded. Define
n
in other words, integrate out the coordinates one by one. Show by Problem 34.1 8, martingale theory, and the zero-one law that ( 36.27) except for x in a set of w-measure 0. (c) Adopting the point of view of part (a), let gn ( xh . . . , xn ) be the result of integrating ( Yn + l ' Yn + 2 , ) out (with respect to w ) from •
•
•
36. , xn , Yn + 1 ,
KOLMOGOROV ' S EXISTENCE THEOREM
SECTION
/(x 1 ,
• • •
. . •
) This may suggestively be written as .
Show that gn ( x1 , , xn ) --+ /( x1 , x 2 , ) except for x in a set of w-mea sure 0. (a) Let T be an interval of the line. Show that �r fails to contain the sets of: linear functions, polynomials, constants, nondecreasing functions, func tions of bounded variation, differentiable functions, analytic functions, functions continuous at a fixed t0 , Borel measurable functions. Show that it fails to contain the set of functions that: vanish somewhere in T, satisfy x(s ) < x( t) for some pair with s < t, have a local maximum anywhere, fail to have a local maximum. (b) Let C be the set of continuous functions on T [0, oo ). Show that A e 31 r and A c C imply that A 0. Show, on the other hand, that A e 31 r an 0] for which the X, are independent and assume the values 0 and 1 with probability ! each. Compare Problem 1.1. By Kolmogorov' s t':�;stence theorem there is a process [ N,: t > 0] having the finite-dimensional distributions (23.27) specified for the Poisson process. As pointed out at the end of this section and in Section 23, the sample paths may be very irregular. Let D be the set of dyadic rationals, and let A be the w-set where the sample path when restricted to D (i) satisfies N0 ( w) 0, (ii) is nondecreasing, (iii) assumes only nonnegative integers as values, and (iv) has for each t in D the property that inf1 s s e D � ( w ) < N,( w ) + 1. Show that A is measurable and P(A ) 1. For w e A define N,'( w ) inf1 s s e D Ns ( w ); for w ll. A define N,'( w) 0. Show that [ N,' : t > 0] is a Poisson process each of whose paths is integer-valued and nondecreasing and has no jump exceeding 1. This kind of argument is used in the next section to construct a Brownian motion with continuous paths. Finite-dimensional distributions can be specified by conditional distribu tions. Suppose that ., is a probability measure on the line, and suppose that for n > 2, �'n ( H; x 1 , , xn _ 1 ) is a probability measurenas H varies over �1 and is a Borel function as (X 1 ' ' X 1 ) varies over R - 1 Show that there exists a stochastic process { X1 , X2 , } such that X1 has distribution ., and P[ Xn e H II X1 , Xn - t 1 w �'n ( H; X1 ( w ), . . . , Xn _ 1 ( w )). Show that E[/( Xn ) II X1 , , Xn - l 1w /�CXJJ(x) vn ( dx; X1 (w), . . . , Xn - l (w)) if /( Xn ) is integrable. • . .
• • •
36. 10.
==
36.11.
36. 12.
521
==
==
==
==
36. 13.
• • •
• • .
•
•
•
• • •
,
==
==
n -
. • .
==
==
•
36.14. Here is an application of the existence theorem in which T is not a subset of
the line. Let ( N, .IV, ., ) be a measure space, and take T to consist of the %-sets of finite v-measure. The problem is to construct a generalized
522
STOCHASTIC PROCESSES
Poisson process, a stochastic process [ XA : A e T] such that (i) XA has the Poisson distribution with mean v( A ) and (ii) XA , . . . , XA are independent if A 1 , , An are disjoint. Hint: To define the finite-diniensional distribu tions, generalize this construction: For A , B in T, consider independent random variables fi , }2 , 1'3 having Poisson distributions with means v(A n Be), v ( A n B), v(Ac n B); take IL A, B to be the distribution of ( Yi + fi , Y2 •
•
•
+ Yj ).
SECI'ION 37. BROWNIAN MOTION Definition
A Brownian motion or Wiener process is a stochastic process [W,: some (D, :F, with these three properties: (i) The process starts at 0:
P ),
t > 0], on
P ( W0 = 0 ] = 1 . (ii) The increments are independent: If 0 < t0 S t1 < · · · < tk , (37 .2 ) then (37 .1 )
(37 .3)
P [ W,; - W,; _ 1 E H; , i S k ] = i[Js k P ( W,; - W,;_ 1 E H; j . For 0 < s < t the increment W, - � is normally distributed with
(iii) mean 0 and variance t - s: (37 .4)
P ( W.1 -
1 W. e H) = /2 'IT ( t s
S
)
J
x 2 12< t - s > dx. e
H
The existence of such processes will be proved presently. Imagine suspended in a fluid a particle bombarded by molecules in thermal motion. The particle will perform a seemingly random movement first described by the nineteenth-century botanist Robert Brown. Consider a single component of this motion-imagine it projected on a vertical axis-and by denote the height at time t of the particle above a fixed horizontal plane. Condition (i) is merely a convention: the particle starts at 0. Condition (ii) reflects a kind of lack of memory. The displacements 0, . . . , W.1k - 1 - W.1k - 2 the particle undergoes during the interval s 1 [t 0 , t 1 ], . . . , [t k _ 2 , t k _ 1 ] in no way influence the displacement W.1k - W.,k - 1 it
W,
W, - W,
SECTION
37.
523
BROWNIAN MOTION
undergoes during [t k _ 1 , t k ]. Although the future behavior of the particle depends on its present position, it does not depend on how the particle got there. As for (iii) , that W, - � has mean 0 reflects the fact that the particle is as likely to go up as to go down-there is no drift. The variance grows as the length of the interval [s, t]; the particle tends to wander away from its position at time s, and having done so suffers no force tending to restore it to that position. To Norbert Wiener are due the mathematical foundations of the theory of this kind of random motion. The increments of the Brownian motion process are stationary in the sense that the distribution of W, - � depends only on the difference t - s. Since W0 = 0, the distribution of these increments is described by saying that W, is normally distributed with mean 0 and variance t. This implies (37.1). If 0 < s s t, then by the independence of the increments E [�W,] = E [( �( W, - �)] + £ [ �2 ] = E[�]E[W, - �] + E[�2 ] = s. This specifies all the means, variances, and covariances: (37 .5)
E ( Jf:W, ]
E ( W, ] 0, =
=
min { s, t } .
If 0 < t 1 < · · · < t k , the joint density of ( W, W, , W," W, W, ) is by (20.25) the product of the corresponding normal densities. By the Jacobian formula (20.20), ( W, , W," ) has density 1,
"_ 1
1,
•
•
2
-
1,
•
•
•
-
•
( x; - X ; - 1 ) 2 2 ( t; - t; - 1 ) ' where t 0 = x 0 = 0. Sometimes W, will be d�noted W( t ), and its value at w will be W( t, w ). The nature of the path functions W( · , w) will be of great importance. The existence of the Brownian motion process follows from Kolmogorov' s theorem. For 0 < t 1 < · · · < t k let p. be the distribution in R k with density (37 .6). To put it another way, let p. be the distribution of (S1 , . . . , Sk ), where S; = X1 + · · · + X; and where X1 , . . . , Xk are indepen dent, normally distributed random variables with mean 0 and variances t 1 ' t 2 - t 1 ' . . . ' t k - t k - I · If g( X 1 ' . . . ' X k ) = ( X 1 ' . . . ' X ; - 1 ' X ; + 1 ' . . . ' X k ), then g(S1 , . . . , Sk ) = (S1 , . . . , S;_ 1 , S; + 1 , . . . , Sk ) has the distribution prescribed for p. This is because X; + X; + 1 is normally distrib uted with mean 0 and variance t ; + t - t ;_ 1 ; see Example 20.6. Therefore, 11
11
( 37 .7)
• • •
1; _ 1 1 , + 1
.
•
.
1" .
• •
•
1"
11
.
.
•
1"
524
STOCHASTIC PROCESSES
The P, 11 1k defined in this way for increasing, positive t 1 , . . . , t k thus satisfy the conditions for Kolmogorov' s existence theorem as modified in Example 36.3; (37.7) is the same thing as (36.19). Therefore, there does exist a process [ W,: t > 0] corresponding to the p. 11 1�< . Taking W, = 0 for t = 0 shows that there exists on some ( 0, S&", P) a process [ W, : t > 0] with the finite-dimensional distributions specified by the conditions (i), (ii), and (iii ) . • • •
•
•
•
Alternatively, define p.,1 ... as t\le centered normal distribution (Section 29) in k R with covariances piJ == mi�{ 1; , t1 }. Since (take t0 == 0) 1
min{ i, j }
L Pi J X; x1 == E x; x1 E ( t1 - t1 _ 1 )
i, j
1- 1
i, j
the matrix ( pi J ) is nonnegative definite. Consistency is now obvious. Continuity of Paths
If the Brownian motion process is to represent the motion of a particle, it is natural to require that the path functions W( · , be continuous. But Kolmogorov' s theorem does not guarantee continuity. Indeed, for T = [0, oo) the space ( D, $") in the proof of Kolmogorov' s theorem is ( R r, gJ T ), and as shown in the last section, the set of continuous functions does not lie in gJ T_ A special construction- gets around this difficulty. The idea is to use for dyadic rational t the random variables W, as already defined and then to redefine the other W, in such a way as to ensure continuity. To carry this through requires proving that with probability 1 the sample path is uni formly continuous for dyadic rational arguments in bounded intervals. Fix a space ( D, §', P) and on it a process [ W, : t > 0] having the finite-dimensional distributions prescribed for Brownian motion. Let D be the set of nonnegative dyadic rationals, let In k = [k2 - n , (k + 1)2 - n ] , and put
w)
(37 .8) Bn =
[w: O s k < n 211 max
sup I W( r ,
r e l,. k rl D
w) - W( k 2 - n , w) I > !.] . n
Suppose it is shown that EP( Bn) converges. The first Borel-Cantelli lemma will then imply that B = lim supn B has probability 0. But suppose that w lies outside B . Then for every t and every £ there exists an n such that t < n , 3n - 1 < £ , and w E B:. Fix such an n and suppose that r and r ' are dyadic rationals in [0, t) and l r - r ' l < 2 Then r and r' must lie in the n
-
n
.
SECTION
37.
525
BROWNIAN MOTION
same or in adjacent dyadic intervals Ink ·
If
r,
r' E Ink ' then
I W( r, w ) - W( r', w ) I � I W( r, w ) - W( k 2 - n , w ) l + I W( k 2 - n , w ) - W( r', w ) l < 2n - 1 < £ . I W( r, w ) - W( r' , w ) I < I W( r, w ) - W( k2 - n , w ) I + I W( k2 - n , w ) - W(( k + 1)2 - n , w ) l + I W(( k + 1)2 - n , w ) - W( r', w ) I < 3n - 1 < £ . Therefore, w E B implies that W( r, w) is for every t uniformly continuous as r ranges over the dyadic rationals in [0, t ), and hence W( · , w) will have a continuous extensioll to [0, oo ) . . To prove that EP ( Bn ) converges requires a maximal inequality. Lemma I.
Suppose that X1 , , Xn are independent, normally distributed random variables with mean 0, and put sk = xl + . . . + xk . Then for positive a and £, n 2 P (Sn > a + 2£ ] - E P ( Xk > £ ] (37 .9) . • .
k-1
[
]
S P max Sk > a < 2P [ Sn PROOF.
Clearly,
(37 . 1 0 )
[
ksn
·
P max Sk > a, Sn > a ksn
]
=
]
> a .
P ( Sn > a ] .
Let A k = [max; < k S; < a < Sk ). Since Sn - Sk is independent of A k and has distribution symmetric about the origin,
[
P max Sk > a, sn < a ksn
]
=
n- 1
E P ( A k n [ Sn < a] )
1 n- 1 k-
< E P ( A k n [ sn - sk < 0 ] ) k-1
=
n- 1
E P ( A k n [ sn - sk > 0 ] )
n-1
< E P ( Ak n ( Sn > a] ) < P ( Sn � a ] . k-1
Adding this to (37.10) gives the right-hand inequality in (37.9).
526
STOCHASTIC PROCESSES
The other inequality is needed only for the calculation (37.16) and its consequence (37. 5 1). Now Sk _ 1 < a, Xk < £, Sn - Sk < - £ imply that sn < a , and sk - 1 < a, xk < £, sn > a + 2£ imply that sn - sk > £. There fore,
n-1 E P ( A k n [ Sn < a ] ) k-1 n-1 > E P ( A k n [ Xk < £ , sn < a ] ) k-1 n- 1 > E P ( Ak n [ xk < £ , sn - sk < - £ ] ) k-1 n- 1 = E P ( Ak n [ Xk < £ , sn - sk > £ ] ) k-1 n-1 > E P ( A k n [ Xk < £ , sn � a + 2£ ]) k-1 n-1 � E { P ( Ak n [ Sn > a + 2£ ]) - P [ Xk > £ ] } k-1 n- 1 � P ( Sn � a + 2£ ] - E P ( Xk > £ ] . k=1 Adding this to (37.10) gives the other inequality in (37.9).
•
The argument goes through if the xk are independent and sn - sk is symmetrically distributed about 0. Since S1 , . . . , Sn have the same joint distribution as - S1 , . . . , - Sn , the right-hand inequality in (37.9) implies that
(37 .11) To analyze the probability of the event (37.8), fix 8 and t for the moment. Since the increments of Brownian motion are independent and
37. BROWNIAN MOTION normally distributed with mean 0, (37.11) implies that
527
SECTION
P
[
�ax i W( t IS
2m
]
+ Bi2 - m ) - W( t ) I > a < 2P [ I W( t + B ) - W( t ) I > a ]
[ ] 68 2 < 4 E ( W( t + 8 ) - W( t )) = 4
2
a
a4
(see (21.7) for the moments of the normal distribution). The sets on the left here increase with m, and letting m --+ oo leads to
(37 . 12) Therefore,
P
[
sup Osrs l, rED
68 2 ] I W( t + rB ) - W( t ) I > a < a4
•
( 37 .13) and
EP( Bn ) does converge. Therefore, there exists a measurable set B such that P(B) = 0 and such that for w outside B, W(r, w) is uniformly continuous as r ranges over the dyadic rationals in any bounded interval. If w � B and r decreases to t through dyadic rational values, then W(r, w) has the Cauchy property and hence converges. Put (37 .14)
W, '( w ) = W'( t, w ) =
lim W( r, w ) if w � B, r J, t
if w
0
E
B,
where r decreases to t through the set D of dyadic rationals. By construc tion W'( t, w) is continuous in t for each w in D. If w � B, W(r, w) = W '( r, w) for dyadic rationals, and W '( · w) is the continuous extension to all of [0, oo ) . The next thing is to show that the W, ' have the same joint distributions as the W, . It is convenient to prove this by a lemma that will be used again further on. ,
Let Xn , Yn, X, Y be k-dimensional random vectors; let F be a distribution function and let Fn be the distribution function of Xn . (i) If Xn --+ X with probability 1 and Fn(x) --+ F(x ) for all x , then F is the distribution function of X. Lemma 2.
528
STOCHASTIC PROCESSES
(ii) If Xn and
Yn have the same distribution, and if Xn --+ X and Yn --+ Y with probability 1, then X and Y have the same distribution. t PROOF. For (i), let X have distribution function H. By Theorem 4.1, if
h > 0, then
< H( x 1 , . . . , xk ) F( x . . . , x k ) = lim sup n Fn ( x 1 , . . . , x k ) 1,
<
lim inf n Fn ( X 1 + h , . . . , X k + h )
= F( x 1 + h , . . . , x k + h ) . It follows by continuity from above that F and H agree. As for (ii), if X and Y have distribution functions Theorem 4.1 gives
F and G, then
F( X 1 - h , . . . , X k - h ) < lim inf n Fn ( X 1 , . . . , X k ) <
Fn(x 1 , lim sup n
•
•
•
, x k ) < F(x 1 ,
•
•
•
, xk )
for h > 0, and the same relation with G in place of F. Therefore, if x is a continuity point of both F and G, then F(x) = limn Fn (x) = G(x). But for an arbitrary x, among the points (x 1 + h, . . . , x k + h), only countably many can be discontinuity points of F ( see the argument following (20.18)), and only countably many can be discontinuity points of G. Let h decrease to 0 through values for which F and G are both continuous at (x 1 + h, . . . , x k + h ); it follows by continuity from above that F(x) = G(x). • Now, for 0 < t 1 < · · · < t k , choose dyadic rationals r;(n) decreasing to the t;. Apply Lemma 2(i) with ( W,.1 ( n ) ' . . . , W,.�c < n ) and ( W,;, . . . , W,; ) in the > role of Xn and X, and with the distribution function with density (37.6) in the role of F. Since (37 .6) is continuous in the t ;, it follows by Schetfe' s theorem that Fn (x) --+ F(x), and by construction Xn --+ X with probability 1 . By the lemma, ( W,; , . . . , w,; ) has distribution function F, which of course is also the distribution function of ( W,1 , , W," ). Thus [W,': t � 0] is a stochastic process, on the same probability space as [ W,, t > 0], which has the finite-dimensional distributions required for Brownian motion and moreover has a continuous sample path W '( · , w) for every w. By enlarging the set B in the definition of W,'( w) to include all the w for which W(O, w) ¢ 0, one can also ensure that W'(O, w) = 0. Now •
t These
•
•
results are simple consequences of the multidimensional weak-convergence theory of Section 29.
37. BROWNIAN MOTION 529 discard the original random variables W, and relabel W,' as W,. The new [W,: t > 0] is a stochastic process satisfying conditions (i), (ii), and (iii) for Brownian motion and this one as well: (iv) For each w, W(t, w ) is continuous in t and W(O, w) = 0. From now on, by a Brownian motion will be meant a process satisfying (iv) as well as (i), (ii), and (iii) . What has been proved is this: SECTION
There exist processes [W,: t > 0] satisfying conditions (i), (ii), (iii ), and (iv)-Brownian motion processes. Theorem 37.1.
In the construction above, W,. for dyadic r was used to define W, in general. For that reason it suffices to apply Kolmogorov' s theorem for a countable index set. By the second proof of that theorem the space (D, �, P) can be taken as the unit interval with Lebesgue measure. The next section treats a general scheme for dealing with path-function questions by in effect replacing an uncountable time set by a countable one. Since the paths are continuous, sups � ,I W,o+ s - W,0 l is a random varia ble because it is unchanged if s is restricted to dyadic rationals. By (37.11),
P[max k � 2 m i Jf'(t0 k2 - m t) - W(t 0 ) 1 > a - £] < 2P[ I W(t0 W(t 0 ) 1 > - £]. L�tting m --+ oo and then £ --+ 0 gives +
n:
+
t)
P sup I W,0 + s - W,o I > a < 2P [ I W,o+t - W,o I > a ) · s�t From (37.9) it follows in the same way that P[ sups � , W. > a] < 2P[W, >
(3 7 .1 5 )
]
[
a] for a > 0. On the other hand,
[ s
W( k2 - m t ) > a ] ] P [ kmax � 2m
P sup W: > a :?::.
Since the last term here is by Markov' s inequality at most 3m 4 2 - m t 2 , letting m --+ oo leads to
(37 .16)
(
]
P sup � > a = 2P [ W, > a] , s
a > 0.
Measurable Processes
Let T be a Borel set on the line, let [ X, : t e T] be a stochastic process on an (D, �, P) and consider the mapping
(37 .17 )
530
STOCHASTIC PROCESSES
carying T X 0 into R 1 • Let !T be the a-field of Borel subsets of T. The process is said to be measurable if the mapping (37.17) is measurable
!TX
�jgJ l _
In the presence of measurability, each sample path X( · , w) is measurable !T by Theorem 18.1. Then, for example, J:cp ( X(t , w )) dt makes sense if (a, b) c T and cp is a Borel function, and by Fubini' s theorem E [ f:cp ( X( t, · )) dt ] = f:E [ cp ( X, )] dt if f:E [ I cp ( X,) I ] dt < oo . Hence the use fulness of this result: Theorem 37.2. PROOF.
Brownian motion is measurable.
If w<
n> ( t , w )
=
n n W( k2 - n , w ) for k2 - < t < ( k + 1 ) 2 - , k = 0, 1 , 2, . . . '
then the mapping (t, w) --+ w< n >(t, w) is measurable !Tx �- But by the continuity ..of the sample paths, this mapping converges to the mapping (37.17) pointwise (for every (t, w )), and so by Theorem 13.4(ii) the latter • mapping is also measurable !TX �jgJ 1 . Irregularity of Brownian Motion Paths
Starting with a Brownian motion [ W, : t > 0] define
(37 .18 ) where c > 0. Since t --+ c 2 t is an increasing function, it is easy to see that the process [ W,': t � 0] has independent increments. Moreover, W, ' - �' 1 = c - ( Wc 2 1 - Wc 2 s), and for s s t this is normally distributed with mean 0 and variance c - 2 (c 2 t - c 2s) = t - s. Since the paths W'( · , w) all start from 0 and are continuous, [ W, ': t > 0] is another Brownian motion. In (37.18) the time scale is contracted by the factor c 2 but the other scale only by the factor c. That the transformation (37.18) preserves the properties of Brownian motion implies that the paths, although continuous, must be highly irregu lar. It seems intuitively clear that for c large enough the path W( · , w) must with probability nearly 1 have somewhere in the time interval [0, c] a chord with slope exceeding, say, 1 . But then W'( · , w) has in [O, c - 1 ] a chord with slope exceeding c. Since the W, ' are distributed as the W, , this makes i t plausible that W( · , w) must in arbitrarily small intervals [0, 8] have cho rds with arbitrarily great slopes, which in tum makes it plausible that W( · , w )
SECTION
37.
531
BROWNIAN MOTION
cannot be differentiable at 0. More generally, mild irregularities in the path will become ever more extreme under the transformation ( 3 7.18) with ever larger values of c. It is shown below that, in fact, the paths are with probability 1 nowhere differentiable. Also interesting in this connection is the transformation
0, if t = 0. if t >
(3 7 .19 )
Again it is easily checked that the increments are independent and normally distributed with the means and variances appropriate to Brownian motion. Moreover, the path W "( · , w) is continuous except possibly at t = 0. But ( 3 7.12) holds with �" in place of W. because it depends only on the finite-dimensional distributions, and by the continuity of W "( · w) over (0, oo) the supremum is the same if not restricted to dyadic rationals. Therefore, P [sups � n-3 1 W." l > n - 1] s 6/n 2 , and it follows by the first Borel-Cantelli lemma that W "( · w) is continuous also at 0 for w outside a set M of probability 0. For w e M, redefine W"( t , w ) = 0; then [ W,": t > 0] is a Brownian motion and (37.19) holds with probability 1. The behavior of W( · , w) near 0 can be studied through the behavior of W "( · , w) near oo and vice versa. Since W,"/t = W1 1 , , W "( · w ) cannot have a derivative at 0 if W( · w) has no limi t at oo. Now, in fact, ,
,
,
(37 .20)
inf Wn n
=
,
sup wn n
- oo '
= =
+ oo
with probability
1. To prove this, note that
Wn
(37 .21 )
[ sup w,. < oo ]
[ W; < u];
xk
=
wk - wk - 1 are independent. Put n
=
00
00
X1 +
· · ·
+ Xn, where the
u n r:nax
u- 1 m - 1
t<m
this is a tail set and hence by the zero-one law has probability 0 or 1. Now - xl, - x2, . . . have the same joint distributions as xl, x2 , . . . ' and so (37.21) has the same probability as
(37 .22)
[ inf Wn > - oo ] n
=
00
00
==
m- 1
U n u l
[r:nax(- W;) < u]. tSm
If these two sets have probability 1, so has [supn l Wn l < oo ], so that P[supn l Wn l < x] > 0 for some x. But P[ l W,. l < x] = P[ l W1 1 < xjn 1 12 ] .-. 0. This proves (37 .20).
532
STOCHASTIC PROCESSES
·,
Since (37 .20) holds with probability 1, W ( w ) has with probability 1 upper and lower right derivatives of + oo and - oo at t = 0. The same must be true of every Brownian motion. A similar argument shows that, for each fixed t, W( w ) is nondifferentiable at t with probability 1. In fact, W( w ) is nowhere differentiable: "
·
· ,
,
Theorem 37.3.
differentiable.
For
w
outside a set of probability 0, W(
·,
w
) is nowhere
The proof is direct-it makes no use of the transformations (37.18) and (37.19). Let PROOF.
{ ( ; ) - w( :n ) w( ; ) - w( ; ) , k 2 } k; 3 w( ) w( ; ) ·
(3 7 .23 ) Xn k = max w
k
1
,
k
2
k
1
_
By independence and the fact that the differences here have the distribution n of 2 - n i2 W1 , P[ Xnk < £] = P 3 [ 1 W1 1 < 2 /2£]; since the standard normal density is bounded by 1, P[ Xnk < £] � (2 2 n 12£) 3 • If Yn = min k � n 2 " Xnk ' then
·
(37 .24) Consider now the upper and lower right-hand derivatives
D w( t, w ) = lim sup ( W( t + h , w ) - W( t, w ))/h , h J,O
D w ( t, w ) � lim inf ( W( t + h , w ) - W( t, w ) ) /h . h J,O
Define E as the set of w such that D w(t , w ) and D w (t , w ) are both finite for some value of t. Suppose that w lies in E, and suppose specifically that
There exists a positive 8 (depending on w, t, and K ) such that t � s < t + 8 implies I W(s , w ) - W(t , w ) I < Kls - tl . If n exceeds some n0 (depending on 8, K, and t ) , then
8K < n ,
n > t.
3 7 . BROWNIAN MOTION 533 Given such an n, choose k so that ( k - 1)2 - n < t < k 2 - n . Then l i2 - n - t l n Xnk ( w ) < 2K(4 · 2 - ) < < 8 nfor i = k , k + 1, k n+ 2, k n+ 3, and therefore n2 - . Since k - 1 < t2 < n2 , Yn( w) < n2 - n . What has been shown is that if w lies in E, then w lies in A n = [ Yn < n 2 - n ] for all sufficiently large n: E c lim infn A n . By (37.24), SECTION
By Theorem 4.1, lim infn A n has probability 0, and outside this set W( · , w) is nowhere differentiable-in fact, nowhere does it have finite upper and lower right-hand derivatives. (Similarly, outside a set of probability 0, nowhere does W( · , w) have finite upper and lower left-hand derivatives.) • If A is the set of w for which W( · , w) has a derivative somewhere, what has been shown is that A c B for a measurable B such that P( B) = 0; P ( A ) = 0 if A is measurable, but this has not been proved. To avoid such problems in the study of continuous-time processes, it is convenient to work in a complete probability space. The space ( 0, $, P) is complete (see p. 39) if A c B, B e $, and P( B) = 0 together imply that A e $ (and then, of course, P( A ) = 0). If the space is not already complete, it is possible to enlarge $ to a new a-field and extend P to it in such a way that the new space is complete. The following assumption therefore entails no loss of generality: For the rest of this section the space ( 0, $, P ) on which the Brownian motion is defined is assumed complete. Theorem 37.3 now becomes: W( · , w ) is with probability 1 nowhere differentiable. A nowhere-differentiable path represents the motion of a particle that at no time has a velocity. Since a function of bounded variation is differentia ble almost everywhere (Section 31), W( · , w) is of unbounded variation with probability 1. Such a path represents the motion of a particle that in its wanderings back and forth travels an infinite distance. The Brownian motion model thus does not in its fine structure represent physical reality. The irregularity of the Brownian motion paths is of considerable mathe matical interest, however. A continuous, nowhere-differentiable function is regarded as pathological, or used to be, but from the Brownian-motion point of view such functions are the rule not the exception. t The set of zeros of the Brownian motion is also interesting. By property (iv), t = 0 is a zero of W( · , w) for each w. Now [ W,": t � 0] as defined by (37 .19) is another Brownian motion, and so by (37 .20) the sequence { W,.'': n = 1, 2, . . . } = { n W11 n: n = 1, 2, . . . } has supremum + oo and infimum - oo for w outside a set of probability 0; for such an w, W( · , w ) changes t For
the construction of a specific example, see
Problem
31.18.
534
STOCHASTIC PROCESSES
sign infinitely often near 0 and hence by continuity has zeros arbitrarily near 0. Let Z( w) denote the set of zeros of W( · w ) . What has just been shown is that 0 e Z( w) for each w and that 0 is with probability 1 a limit of positive points in Z( w ) . More is true: ,
The set Z( w) is with probability 1 perfect,t nowhere dense, and of Lebesgue measure 0. PROOF. Since W( · w) is continuous, Z( w) is closed for every 0. Let A denote Lebesgue measure. Since Brownian motion is measurable (Theorem 37.2), Fubini' s theorem applies: Theorem 37.4.
,
fo ( Z ( "' )) P ( d"' ) = ( X P ) ( ( t, "' ) : W( t, ) = 0) = foooP [ W( t , ) = 0] dt = 0. '>.
'>.
,.,
IAJ :
IAJ
Thus A ( Z( w )) = 0 with probability 1. If W( · , w) is nowhere differentiable, it cannot vanish throughout an interval I and hence must by continuity be nonzero throughout some subinterval of I. By Theorem 37.3, then, Z( w) is with probability 1 nowhere dense. It remains to show that each point of Z( w ) is a limit of other points of Z( w ) . As observed above, this is true of the point 0 of Z( w ) . For the general point of Z( w ), a stopping-time argument is required. Fix r � 0 and let T( w) = inf[ t : t � r, W(t, w) = 0]; note that this set is with probability 1 nonempty by (37.20). Thus T(w ) is the first zero following r. Now [ w : T( w)
s
[
t] = w : inf I W( s , w )I = rsss t
o] '
and by continuity the infimum here is unchanged if s is restricted to rationals. This shows that T is a random variable and that
A nonnegative random variable with this property is called a stopping time. To know the value of T is to know at most the values of Wu for u < T. Since the increments are independent, it therefore seems intuitively clear tSee A1 5.
SECTION
37.
535
BROWNIAN MOTION
that the process
t > 0, ought itself to be a Brownian motion. This is, in fact, true by the next result, Theorem 37. 5 below. What is proved there is that the finite-dimensional distributions of [ W,*: t > 0] are the right ones for Brownian motion. The other properties are obvious: W *( · , w) is continuous and vanishes at 0 by construction, and the space on which [ W,*; t > 0] is defined is complete because it is the original space (0, �, P), assumed complete. If [ W,* : t > 0] is indeed a Brownian motion, then, as observed above, for w outside a set B, of probability 0 there is a positive sequence { t n } such that t n ._. 0 and W*(t n , w) = 0. But then W(T(w) + t n , w) = 0, so that T ( w ), a zero of W( · , w ), is the limit of other larger zeroes of W( · , w ) . Now T( w ) was the first zero following r. (There is a different stopping time T for each r, but the notation does not show this.) If B is the union of the B, for rational r, then P( B) = 0. Moreover, if w lies outside B, then for every rational r, the first point of Z( w) following r is a limit of other larger points of Z( w ) . Suppose that w � B and t e Z( w ), where t > 0; it is to be shown that t is a limit of other points of Z( w ) . If t is the limit of smaller points of Z( w ), there is of course nothing to prove. Otherwise, there is a rational r such that r < t and W( · , w) does not vanish in [ r, t ) ; but then, since w � B,, t is a limit of larger points s that lie in Z( w ) . This completes the proof of Theorem 37.4 under the provisional assumption that (37.25) is • a Brownian motion. The Strong Markov Property
Fix t0 > 0 and put
t > 0.
(37 .26 )
It is easily checked that [ W,': t > 0] has the finite-dimensional distributions appropriate to Brownian motion. As the other properties are obvious, it is in fact a Brownian motion. Let !F, = o ( W. :
( 37 .27 )
s �
t].
The random variables (37.26) are independent of !F,0 To see this, suppose that 0 � s 1 � � s < t0 and 0 < t 1 � � t k . Put u ; = t0 + t;. Since i the increments are independent, ( W.,',1 W.12' - W.11', , W.1k' - W.1k' - 1 ) = ( Wu1 •
•
•
•
•
•
•
•
•
•
536
STOCHASTIC PROCESSES
W,0 , Wu2 - Wu1 , . . . , Wu - Wu 1 ) is independent of ( W�1 , Ws2 - Ws1 , . . . , Ws is independent of ( W:1 , W: 2, . . . , W:.). - W:1- 1 ). But then 1 2 By Theorem 4.2, . . . , w,;) is independent of �0• Thus ..
k k( W,', W,', . . . , W,k') ( W,;,
(37 .28 )
1
1
P ([( w,; , . . . , w,; ) E H] n A ) = P [ ( w,; , . . . , W,; ) E H] P ( A ) =
P [ ( W,1 , . . . , W,k ) e H ] P ( A ) ,
A E �0 ,
where the second equality follows because (37.26) is a Brownian motion. This holds for all H in 9l k . The problem now is to prove all this when t0 is replaced by a stopping time T-a nonnegative random variable for which (3 7 .29 )
[ w : T ( w ) < t] e � ,
t > 0.
It will be assumed that T is finite, at least with probability 1. Since [ T = t] = [ T S t] - U n [ T S t - n 1 ] , (37.29) implies that -
(3 7 .30 )
[ w : T ( w ) = t] e � '
t > 0.
The conditions (37.29) and (37.30) are analogous to the conditions (7.18) and (35.18), which prevent prevision on the part of the gambler. Now §',0 contains the information on the past of the Brownian motion up to time t0, and the analogue for T is needed. Let §'., consist of all measurable sets M for which
(37 .31 )
M n [ w : T( w ) < t] e �
for all t. (See (35.26) for an analogue in discrete time.) Note that §'., is a a-field and T is measurable §'.,. Since M n [ T = t] = M n [ T = t] n [ T < t ]
,
( 3 7 .32 )
M n [ w : T( w ) = t] e �
for M in :F.,. For example, T = inf[t: [infs T W: > - 1] is in §'.,.
W, = 1] is a stopping time and
�
Theorem 37.5.
(37 . 3 3)
Let T be a stopping time and put t > 0.
SECTION
37.
537
BROWNIAN MOTION
Then [ W, *: t � 0] is a Brownian motion, and it is independent of §;,-that is, o [ W, *: t > 0] is independent of §'.,:
P ([( w,� , . . . , w,: ) E H ] n M ) = P [( W,1* , . . . , w,: ) e H ] P ( M ) = P ((W,1 , , W,.) e H ) P ( M )
(37.34)
•
•
•
for H in !Ji k and M in §'.,. That the transformation (37 .33) preserves Brownian motion is the strong Markov property.t Part of the conclusion is that the are random variables.
W, *
Suppose . first that T has countable range general point of V. Since
PROOF.
(w:
V and let t0 be the
W, * ( w ) E H ) = t0UE V [ w : W,0+1( w ) - W,0( w ) E H, T ( w ) = 10 ] ,
W, * is a random variable. Also, P ([(w,1• , , w,: ) E H ] n M ) E P ( [( w,� . . . , w,: ) e n ] n M n [ .,. = t0]) . •
•
•
=
.
t0 E V
[
� ' then M n T = t o ] E �0 by (37.32). Further, if T = t o , then coincides with as defined by (37.26). Therefore, (37.28) reduces this last sum to If M E
W,*
W,'
E
, P[(W, 1 t0 E V
•
•
•
, w,. ) e H ] P ( M n [.,. = t 0])
.
This proves the first and third terms in (37 .34) equal; to prove equality with the middle term, simply consider the case M = D. t since the Brownian motion has independent increments, it is a Markov process (see Examples 33.9 and 33.10); hence the terminology.
538
STOCHASTIC PROCESSES
Thus the theorem holds if
37 .35 ( )
n { k2 'Tn = O
T has countable range. For the general T, put
if (k - 1 ) 2 - n < if T = 0.
T < k2 - n ,
k = 1, 2, . . .
If k2 - n < t < (k + 1)2 - n , then [Tn < t ] = [T < k2 - n ] E �k 2 - , C �- Thus each "n is a stopping time. Suppose that M e � and k2 - n < t < (k + 1)2 - n. Thenn M n [Tn < t ] = M n [T < k2 - n ] E �k 2 - " c �-n Thus $, c �,· Let w, < >( w) = W..., ("' ) + t ( w) - W..., < "' > ( w )-that is, let w,< > be the W, * corresponding to the stopping time "n · If M e §'., then M e �,, and by an application of to the discrete case already treated,
( 37.36)
(37.34)
P ( [( w,� n > , . . . , W,! n > ) e H )
n
M)
= P [ ( W,1 ,
•
•
•
, W,k ) e H ] P ( M ) .
Tn ( w) .-. T( w) for each w, and by continuity of the sample paths n W,*(w)n for each w. Condition on M and apply Lemma 2(i) w, < >(w) .-. n with ( W,: >, . . . , W,� >) for Xn , ( W,1* , . . . , w,:) for X, and the distribution function of ( W,1 , . . . , W,, ) for F = Fn . Therefore, (37 .34) follows from (37 .36). But
37.4
(37.25)
•
The T in the proof of Theorem is a stopping time, and so is a Brownian motion, as required in that proof. Further applications will be given below. If � · = o [ W,*: t � 0], then according to (and Theorem 4.2) the a-fields §'., and � * are independent:
(37.34)
(37.37)
P ( A n B ) = P( A ) P ( B ) ,
(37.35)
A
E
§'., ,
B e �* .
For fixed t define "n by but with t2 - n in place of 2 - n at each occurrence. Then [WT < x] n [ T � t ] is the limit superior of the sets [W,., < x] n [ T < t ], each of which lies in �- This proves that [ W,. < x] lies in � and hence that W,. is measurable �- Since T is measurable �,
(37 .38) for planar Borel sets H.
SECTION
37.
BROWNIAN MOTION
539
Skorohod Embedding *
Suppose that X1 , X2 , . . . are independent and identically distributed ran dom variables with mean 0 and variance o 2 A powerful method, due to Skorohod, of studying the partial sums Sn = X1 + · · · + Xn is to construct an increasing sequence T0 = 0, T1 , T2 , of stopping times such that W( Tn ) has the Same distribution as Sn . The differences Tk - Tk - 1 will tum OUt tO be independent and identically distributed with mean o 2 , so that by the law of large numbers n - 1-rn = n - 1 Ek _ 1 (Tk - Tk _ 1 ) is likely to be near o 2 • But if Tn is near n o 2 , then by the continuity of Brownian motion paths W( Tn ) will be near W(no 2 ), and so the distribution of Sn!o..fn, which coincides with the distribution of W( Tn )!o..fi, will be near the distribution of W( n o 2 )jo{i - that is, will be near the standard normal distribution. The method will thus yield another proof of the central limit theorem, one independent of the characteristic-function arguments of Section 27. But it will also give more. For example, the distribution of max k s n SJc!o..fi is exactly the distribution of max k s n W( Tk )/o..fn, and this in tum is near the distribution of sup1 s n a 2 W( t )jo..fi , which can be written down explicitly because of (37.16). It will thus be possible to derive the limiting distribution of max k s n sk . The joint behavior of the partial sums is closely related to the behavior of Brownian motion paths. The Skorohod construction involves the class !T of stopping times for which •
•
( 37 .39 ) ( 37 .40)
•
•
E ( WT] = 0, E ( T ) = E [ W./) ,
and
(37 .41 ) All bounded stopping times are members of !T. PROOF. Define Y11, 1 = exp(IJW, - �8 2 t) for all 8 and for t > 0. Suppose that s < t and A e � (see (37.27)). Since Brownian motion has indepen dent increments, Lemma 3.
* The rest of this section, which requires martingale theory, may be omitted.
540
STOCHASTIC PROCESSES
and a calculation with moment generating functions (see Example 21.2) shows that
fA Y, ' s dP fA Y, , , dP, =
(37 .42)
s < t,
A E �.
This says that for (J fixed, [ Y9, ,: t > 0) is a continuous-time martingale adapted to the a-fields §',. It is the moment generating function martingale associated with the Brownian motion. Let /( 8, t) denote the right side of (37.42). By Theorem 16.8,
£ Y,, , ( w, - Ot ) dP , ; ::2 /( 6, t) L Y, , , [ ( w, - 6t ) 2 - t ] dP, o, t ) 6 t(
=
=
Differentiate the other side of the equation (37.42) the same way and set (J = 0. The result is s < t, A E �' s S t, A E �'
J (�4 - 6 �2s + 3S 2 ) dP JA ( W,4 - 6 W, 2 t + 3t 2 ) dP , =
A
s :S: t, A e � .
This gives three more martingales: If Z, is any of the three random variables (37 .43) then Z0 (37.44)
w, , =
0, Z, is integrable and measurable §',, and
JA z. dP jA z, dP,
In particular, E[Z,]
=
=
[Z0]
=
0.
s
< t,
A E �.
SECTION
37.
541
BROWNIAN MOTION
If T is a stopping time with finite range { t1, . . . , t m } bounded by t, then (37.44) implies that
E [ zT ]
=
E; j[ 1' = 1,]z,, dP = E; j[ 1' = 1, ]z, dP = E [ z, ] = o.
Suppose that T is bounded by t but does not necessarily have finite range. n Put Tn = k2 - n t if ( k - 1)2 - n t < T < k2 - t, 1 < k < 2 n , and put Tn = 0 if T = 0. Then "n is a stopping time and E[Z,. ] = 0. For each of the three possibilities (37.43) for Z1 , sups s 1 1 Zsl is integrable because of (37.15). It therefore follows by the dominated convergence theorem that E[Z,.] = lim n E[Z,. ] = 0. Thus E[Z,.] = 0 for every bounded stopping time T. The three cases (37 .43) give II
II
. E ( W,. ] = E [ W,.2 - , )
=
E [ W,.4 - 6W,.2T + 3T 2 )
= 0.
This implies (37 .39), (37 .40), and 0 = E [ W,.4 ) - 6E [ W,.2T ) + 3E ( T 2 ] > E [ w,.4 ] - 6£ 1 /2 [ w,.4 ] £ 112 [ T 2 ] + 3E [ T 2 ] .
If C = E 1 12 [W,.4) and x = E 1 12 [T 2 ), the inequality is 0 > q(x) = 3x 2 6Cx + C 2 . Each zero of q is at most 2C, and q is negative only between • these two zeros. Therefore, x � 2C, which implies (37.41).
Suppose that T and Tn are stopping times, that each Tn is a member of !T, and that "n --+ T with probability 1 . Then T is a member of ff if (i) E[W,.4] < E[W,.4] < oo for all n, or if (ii) the W,.4 are uniformly integrable. PROOF. Since Brownian motion paths are continuous, W,. --+ W,. with probability 1. Each of the two hypotheses (i) and (ii) implies that E[W,.4) is bounded and hence that E[ Tn2 1 is bounded, and it follows (see (16.23)) that the sequences { "n }, { W,. }, and { W,.2 } are uniformly integrable. Hence (37.39) and (37.40) for T follow by Theorem 16.13 from the same relations for the "n · The first hypothesis implies that lim inf n E[W,.4 ] < E [ W,.4 ], and the second implies that lim n E[ W,.4] = E[W,.4 ). In either case it follows by Fatou ' s lemma that E[T 2 ] < lim in fn E[Tn2 1 � 4 lim infn E[W,.4 ] � 4E[ W,.4]. • Lemma 4. II
II
II
II
II
II
II
II
II
Suppose that a, b � 0 and a + b > 0, and let T( a, b) be the hitting time for the set { - a, b } : T(a, b) = inf( t : W, e { - a, b }]. By (37.20), T(a, b) is
542
STOCHASTIC PROCESSES
T(a, b) �
finite with probability 1 , and it is a stopping time because t if and only if for every m there is a rational r < t for which W,. is within m- 1 of or of From I W min { n }) I < max{ it follows by Lemma 4(ii) that is a member of ff. Since WT ( a , b assumes only the values > 0 implies that ] and b, E[WT ( a, b )
b. T(a, b)
-a -a
(
T(a, b),
=
a, b}
b P [ WT( a , b ) = -a ] = a + b '
(37 .45 )
P [ WT ( a, b) = b )
a = b.
= a:b.
This is obvious on grounds of symmetry in the case Let p. be a probability measure on the line with mean 0. The program is to construct a stopping time T for which WT has distribution p.. Assume that p. {O} < 1, since otherwise T = 0 obviously works. If p. consists of two point masses, they must for some positive and be a mass of at and a mass of at in this case "< a , b is by (37.45) the required > stopping time. The general case will be treated by adding together stopping times of this sort. Consider a random variable X having distribution p.. (The probability space for X has nothing to do with the space the given Brownian motion is defined on.) The technique will be to represent X as the limit of a martingale x1, x2 , . . . of a simple form and then to duplicate the martingale for stopping times Tn ; the "n will have a limit T such that WT by WT1 , W,.2, has the same distribution as X. The first step is to construct sets
aj(a + b) b;
•
•
a
b
bj(a + b) -a
•
<
a r,
and corresponding partitions
{lin :
= (- a = (a a I,: + 1 = ( a �: > , oo ) . n> 00 , � ] n n 1 ) 1 ' 1 )] '
I� Ik
Let M( H ) be the conditional mean: M( H)
=
p(� ) {xp(dx) if p(H) > 0 . 1
=
=
Let & 1 consist of the single point M( R ) E[ X] 0, so that 9'1 consists of IJ - oo, 0] and If (0, oo ). Suppose that &n and !l'n are given. If =
(
=
37. BROWNIAN MOTION 543 p. (( /f )0) > 0, split If by adding to �n the point M( /f ), which lies in ( /f )0 ; if p. (( /f )0) = 0, If appears again in .9'n + 1 Let rln be the a-field generated by the sets [ X e /f ) and put Xn = E [ XII rln1 · Then xb x2, . . . is a martingale and xn = M( /f) on [ X E /f ) . The Xn have finite range and their joint distributions can be written out explicitly. In fact, [ X1 = M( Il1 ), . . . , Xn = M( lf )] = [ X e Il1 , . . . , X e If ] , and this set is empty unless Il1 :J · · · :J If , in which case it is 1 [ Xn = M( I;,. )] = [ X E If,.]. Therefore, if k n _ 1 = j and IJ' - = /f_ 1 U It , SECTION
•
II
II
II
and
provided the conditioning event has positive probability. Thus the martingale n n 1 { xn } has the Markov property, and if X = M( lj - ), u = M( Ik 1), and v = M( lf ), then the conditional distribution of xn given xn 1 = X is concentrated at the two points u and v and has mean x. The structure of { xn } is determined by these conditional probabilities together with the distribution -
of x1 . If r9 = o(U nrln ), then Xn -+ E[ XII rl] with probability 1 by the martingale theorem (Theorem 35.5). But, in fact, Xn -+ X with probability 1, as the following argument shows. Let B be the union of all open sets of p.-measure 0. Then B is a countable disjoint union of open intervals; enlarge B by adding to it any endpoints of p.-measure 0 these intervals may have. Then p.( B) 0, and x e B implies that p. ( x - £, x] > 0 and p. [ x, x + £) > 0 for all positive (. Suppose that X = X( w ) e B and let X n = Xn( w ) . Let If be the element of !l'n containing x; then xn + 1 = M( lf ) and If � I for some interval /. Suppose that xn + 1 < X - ( for n +in1 an infinite sequence N of integers. Then xn + 1 is the left endpoint of 1;n + 1 for n in N and converges along N to the left endpoint, say a, of I, and (x - £, x] c I. Further, x n + 1 = M( lf ) -+ M( l) along N, so that M( / ) = a. But this is impossible because p. ( x - £, x] > 0. Therefore, x n > x - £ for large n . Similarly, x n < x + t: for large n , and so x n -+ x. Thus Xn( w ) -+ X( w ) if X( w ) f£. B, the probability of which is 1 . =
II
II
II
II
544
STOCHASTIC PROCESSES
Now X1 = E[ X II f91 ) has mean 0 and its distribution consists of point masses at - a = M( IJ) and b = M( If). If T1 = T(a, b) is the hitting time to { - a, b ) , then (see (37.45)) T1 is a stopping time, a member of !T, and W 1 has the same distribution as X1 • Let ,-2 be the infimum of those t for which t > ,-1 and W, is one of the points M( If ) , 0 < k < r2 + 1. By (37 .20), ,-2 is finite with probability 1 ; it is a stopping time because ,-2 < t if and only if for every m there are rationals r and s such that r < s + m - 1 , r < t , s < t, W,. is within m - 1 of one of the points M( I]), and � is within m - 1 of one of the points M( If). Since I W(min{ T2 , n })l is at most the maximum of the values I M( If) l , it follows by Lemma 4(ii) that ,-2 is a member of !T. Define W, * by (37.33) with T1 for T. If x = M( I]), then x is an endpoint common to two adjacent intervals If_ 1 and If; put u = M( If_ 1 ) and v = M(If ). If W,.1 = x, then u and v are the only possible values of W,.2 • If ,- • is the first time the Brownian motion [ W, *: t > 0] hits u - x or v - x, then by (37 .45), T
v-x v- u'
p [ W! = T
V -
X] =
X-
U
v-u
.
On the set [ W 1 = x], T2 coincides with T1 + ,-*, and it follows by (37.37) that T
This, together with the same computation with u in place of v, shows that for W,.1 = x the conditional distribution of W,.2 is concentrated at the two points u and v and has mean x . Thus the conditional distribution of W.,.2 given W,.1 coincides with the conditional distribution of X2 given X1 • Since W 1 and X1 have the same distribution, the random vectors ( W,.1 , W,.2 ) and ( X1 , X2 ) also have the same distribution. An inductive extension of this argument proves the existence of a sequence of stopping times Tn such that T1 � T2 < · · · , each Tn is a member of !T, and for each n, W,.1 , . . . , W,. have the same joint distribution as x1 , . . . ' xn . Now suppose that X has finite variance. Since Tn is a member of !T, E[,-n ] = E [ W,.2 ] = E [ X,?] = E[E 2 [ XII rln ]] < E[ X 2 ] by Jensen's inequality (34.5). Thus T = lim nTn is finite with probability 1. Obviously it is a stopping time, and by path continuity, W,. -+ W,. with probability 1. Since Xn -+ X with probability 1, it follows by Lemma 2(ii) that W,. has the T
II
11
II
SECTION
37.
545
BROWNIAN MOTION
distribution of X. Since x; < E[X 2 11 rln], the Xn are uniformly integrable by the lemma preceding Theorem 35.5. By the monotone convergence theorem and Theorem 16.13, E [ T] = lim n £ [ Tn] = lim n E [ WT2 ] = lim n£ [ X;] = E[ X 2 ] = £ [�2 ]. If E[X4 ] < oo, then £ [�4 ] = E[ X:] < E[ X4 ] = E [�4 ] (Jensen' s inequality again), and so T is a member of !T. Hence E[T 2 ] :s: 4E[�4 ]. This construction establishes the first of Skorohod ' s embedding theo rems: II
Theorem 37.6. Suppose that X is a random variable with mean 0 and finite
variance. There is a stopping time T such that � has the same distribution as X, E[T] = E[ X 2 ], and E[T 2 ] :S: 4E[ X4 ]. Of course, the last inequality is trivial unless E[X4 ] is finite. The theorem could be stated in terms not of X but of its distribution, the point being that the probability space X is defined on is irrelevant. Skorohod ' s second embedding theorem is this:
Suppose that X1 , X2 , . . . are independent and identically distributed random variables with mean 0 and finite variance, and put sn = x1 + . . . + xn . There is a nondecreasing sequence 7"1 , 7"2 , of stopping times such that the � have the same joint distributions as the sn and T1 , T2 - T1 , T3 - T2 , . . . are independent and identically distributed random variables satisfying E[Tn - Tn_ 1 ] = E[X12 ] and E[( Tn - Tn_ 1 ) 2 ] < 4E[ Xi]. Theorem 37.7.
•
•
•
II
The method is to repeat the construction above inductively. For notational clarity write W, = w, < 1 > and put F, < 1 > = a[�< 1 > : 0 < s < t] and � < 11> = a[W, ( l > : t � 0] . Let 8 1 be the stopping time1 of Theorem 37.6, so that W8� > and X1 have the same distribution. Let .fF,� > be the class of M such that M n [ 8 1 :s: t ] e F, < 1 > for all t. Now put w. <2> = �B - �B ' F. < 2> = a[ W < 2> · 0 < s < t ] ' and � < 2> = a[ W, <2> : t > 0]. By another application of Theorem 37.6, construct a stopping time 82 for the Brownian motion [ w,<2>: t > 0] in such a way that W8�2> has the same distribution as X1 • In fact, use for 82 the very same martingale construction as for 8 1 , so that ( 8 1 , W8�1 >) and ( 82 , W8�2 > ) have the same distribution. Since .fF,� 1 > and � <2> are independent (see (37.37)), it follows (see (37.38)) that ( 81, JJ8�1 >) and ( 82 , W8�2>) are independent. Let .fF,�2> be the class of M such that M n [ 82 < t] e F, <2> for all t. If w: < 3> = w.<2> - w.<2> and � (3) is the a-field generated by these random variables, then again .fF,�2> and � <3> are independent. These two a-fields are contained in � <2>, which is independent of .fF,�1 > . Therefore, the three a-fields §8� 1 >, .fF,�2> , � < 3> are independent. The procedure therefore extends PROOF.
I
I
o2 + l
+I
o2
I
s
•
-
STOCHASTIC PROCESSES
inductively to give independent, identically distributed random vectors ( 8n ' w.< n >). If T.n = 8 I + . . . + 8n ' then w < 1 > = wu1.< 1 > + . . . + wu.<,.n > has the • distribution of XI + . . . + xn . o,.
T,.
Invariance *
If E[ Xl] = o 2 , then, since the random variables Tn - Tn - l of Theorem 37.7 are independent and identically distributed, the strong law of large numbers (Theorem 22.1) applies and hence so does the weak one:
(37 .46) (If E[ Xt] < oo, so that the Tn - Tn - l have second moments, this follows immediately by Chebyshev' s inequality.) Now Sn has the distribution of W( Tn ), and Tn is near no 2 by (37.46); hence sn should have nearly the distribution of W(no 2 ), namely the normal distribution with mean 0 and variance no 2 • To prove this, choose an increasing sequence of integers Nk such that 1 n > N P [ l n 1Tn - o 2 1 > k - 1 ] < k - 1 for k and put £ n = k - for Nk � n < Nk + l · Then £ n --+ 0 and P [ l n - 1Tn - o 2 1 > £ n 1 < £ n . By two applications of -
(37.15),
8n ( E ) = P [ l w( na 2 ) - W( '�'n ) 1 /av'n � E] �
P [ l n - 1Tn - o 2 1 > £ n ]
+
P
[
sup lt- n a2 1 S (n n
I W( t ) - W( no 2 ) I > £ov'n
and it follows by Chebyshev's inequality that distributed as W( Tn ),
[ !;;
P W
2)
]
lim nBn ( £) = 0. Since Sn is
[ 7n- < x ] 2) W [ !;;
< x - E - 8n( E) < P 0
Here
]
W( no 2 )jov'n
can
* This topic may be omitted.
� X
+
E
]
be replaced by a random variable
+
8n ( E ) .
N with the
SECTION
37.
547
BROWNIAN MOTION
standard normal distribution, and letting n --+ oo and then t: --+ 0 shows that
This gives a new proof of the central limit theorem for independent, identically distributed random variables with second moments (the Linde berg-Levy theorem-Theorem 27.1 ) . Observe that none of the convergence theory of Chapter 5 has been used. This proof of the central limit theorem is an application of the invariance principle: Sn has nearly the distribution of W( no 2 ), and the distribution of the latter does not depend on (vary with) the distribution common to the Xn . More can be said if the Xn have fourth moments. For each n , define a stochastic process [ Yn ( t ): 0 < t < 1] by Yn (O, w ) = 0 and
k = 1, . . . , n. If kjn = t > 0 and n is large, then k is large, too, and Yn (t) = t 1 12SJolk is by the central limit theorem approximately normally distributed with mean 0 and variance t. Since the Xn are independent, the increments of (37.47) should be approximately independent, and so the process should behave approximately as a Brownian motion does. Let Tn be the stopping times of Theorem 37.7, and in analogy with (37.47) put Zn (O) = 0 and ( 37 .48 ) Zn ( t ) =
0
1
k-l k < t< , W( Tk ) if n n
k = 1, . . . , n.
By construction, the finite-dimensional distributions of [ Yn ( t ): 0 < t < 1] coincide with those of [ Zn (t): 0 < t < 1]. It will be shown that the latter process nearly coincides with [ W(tno 2 )jo{,l : 0 < t < 1], which is itself a Brownian motion over the time interval [0, 1]-see (37.1 8). Put W,.( t) =
W( tno 2 )jo{,l. Let Bn ( B ) be the event that I Tk - ko 2 1 > Bno 2 for some k < n. By Kolmogorov' s inequality (22.9), ( 37 .49 )
548
STOCHASTIC PROCESSES
If ( k - 1)n - 1 < t < kn - 1 and n > 8 - 1 , then
on ( Bn ( B ))c. Since the distribution of this last random variable is unchanged if the Wn ( t ) are replaced by W( t ),
[
P sup I Zn ( t ) ts l
- W,.( t ) I > £
[
]
< P ( Bn ( B ) ) + P sup
]
sup I W( s ) - W( t ) I > £ .
t < 1 Is - tl s 28
Let n --+ oo and then 8 --+ 0; it follows by (37.49) and the continuity of Brownian motion paths that (37 .50) for positive £. Since �e processes (37.47) and (37.48) have the same finite-dimensional distributions, this proves the following general invariance principle or functional centra/ limit theorem.
Suppose that X1, X2 , . . . are independent, identically dis tributed random variables with mean 0, variance o 2, and finite fourth mo ments, and define Yn (t) by (37.47). There exist (on another probability space) for each n processes [Zn (t): 0 � t < 1] and [W,.(t): 0 < t < 1] such that the first has the same finite-dimensional distributions as [Yn (t): 0 < t < 1], the second is a Brownian motion, and P[sup, s 1 1 Zn (t) - Wn (t)l � £] --+ 0 for positive £. Theorem 37.8.
As an application, consider the maximum Mn = max k n Sk . Now M,jo{; = sup,Yn (t) has the same distribution as sup,Zn ( t ), and it follows by ( 37 .50) that s
[
]
P sup Zn ( t ) - sup W,. ( t ) > £ --+ 0. t
t
37. BROWNIAN MOTION 549 But P[sup, .s 1 Wn(t) > x] = P[sup, < 1 W(t ) > x] = 2 P [ N > x] for x > 0 by (37.1 6). Therefore, SECTION
X > 0.
(37 .51 )
PROBLEMS 37.1.
Let X( t) be independent, standard normal variables, one for each dyadic rational t (Theorem 20.4; the unit interval can be used as the probability space). Let W(O) 0 and W(1) X(1 ). Suppose that W( t ) is already defined for dyadic rationals in [0, 1 ] of rank n , and put ==
w( 37.2.
2k + 1 n 1 2 +
)
==
==
( )
)
(
(
1 x 2k + 1 ! w _!_n + ! w k +n 1 + 1 1 n n 1 2 2 2 2 2 /2 2 +
+
)
·
Prove by induction that the W( t ) for dyadic t have the finite-dimensional distributions prescribed for Brownian motion. Now construct a Brownian motion for 0 < t < 1 with continuous paths by the method of Theorem 37.1 . This avoids an appeal to Kolmogorov's existence theorem. 36.10 f Let T [0, oo) and let P be a probability measure on ( Rr, !JIT) having the finite-dimensional distributions prescribed for Brownian motion. Let C consist of the continuous elements of Rr. (a) Show that P •}C) 0, or P* ( RT - C ) 1 (see (3.9) and (3 .10)). Thus completing ( Rr, !Jt , P) will not give C probability 1 . (b) Show that P*(C) 1 . Suppose that [ W,: t > 0] is some stochastic process having independent, stationary increments satisfying E[ W, ] 0 and E[ W, 2 ] t. Show that if the finite-dimensional distributions are preserved by the transformation (37.1 8), then they must be those of Brownian motion. Show that n, > 0a [ �: s > t ) contains only sets of probability 0 and 1. Do the same for nf > Oa( W,: 0 < I < £ ); give examples Of Sets in this a-field. Show by a direct argument that W( · , w) is nwith probability 1n of unbounded variation on [0,n 1]: Let Y, Ef! 1 1 W( i2 - ) - W(( i - 1)2 - ) l . Show that Y, has mean 2 12 E[ I W1 1 ] and variance Var[ l W1 1 ]. Conclude that E P [ Y, < ==
==
==
==
37.3.
==
37.4. 37.5.
==
==
n]
37.6. 37.7. 37.8.
<
oo .
Show that the Poisson process as defined by (23.5) is measurable. Show that for T [0, oo) the coordinate-variable process [ Z,: t ( Rr, 9fT) is not measurable. Extend Theorem 37.4 to the set [ t: W( t, w) = a]. ==
e
T ] on
550
STOCHASTIC PROCESSES
37.9.
Let T0 be the first time the Brownian motion hits a > 0: T0 W, > a]. Show by (37.16) that
=
inf[ t:
(37.52) and hence that the distribution of T0 has over (0, oo) the density
(37.53 ) Show that E[ T0] = oo. Show that T0 has the same distribution as a 2/N 2 , where N is a standard normal variable. 37. 10. f (a) Show by the strong Markov property that T0 and Ta + fl - T0 are independent and that the latter has the same distribution as Tp. Conclude that h a * hp = h a+ fl · Show that /l Ta has the same distribution as Ta,fo · (b) Show that each h a is stable-see Problem 28.10. 37. 11. f Suppose that X1 , X2 , . • • are independent and each has the distribution
(37.53).
(a) Show that ( XI + . . . + xn )/n 2 also has the distribution (37.53). Con
trast this with the law of large numbers. (b) Show that P( n- 2 max k s n Xk < x] � exp( - a{2/wx ) for x > 0. Re late this to Theorem 14.3. 37. 12. 37.9 t Let p(s, t) be the probability that a Brownian path has at least one zero in (s, t). From (37.52), (37.53), and the Markov property, deduce
( 37 .54)
p ( s , t ) = 2 arccos - . t -
'IT
Hint: Condition with respect to �37. 13.
37.14.
If
f (a) Show that the probability of no zero in (t, 1) is (2/w) arcsin/i and hence that the position of the last zero preceding 1 is distributed over (0, 1) with density , - 1 (t( 1 - t)) - 1 12 • (b) Similarly calculate the distribution of the position of the first zero following time 1. (c) Calculate the joint distribution of the two zeros in (a) and (b).
f (a) Show by Theorem 37.8 that infs s u s 1Y, ( u) and infs s u s , Zn ( u ) both converge in distribution to infs s u s , W( u) for 0 < s < t < 1 . Prove a similar result for the supremum. (b) Let An (s, t) be the event that Sk , the position at time k in a symmetric random walk, is 0 for at least one k in the range sn < k < tn , and show that p (An ( s ' t)) --+ (2I'IT )arccos{ili . (c) Let T, be the maximum k such that k � n and Sk =1 0. Show that T,/n has asymptotically the distribution with density , - ( t (1 - t)) - 1 12 over (0, 1 ). As this density is larger at the ends of the interval than in the
38. SEPARABILITY 551 midule, the last time during a night' s play a gambler was even is more likely to be either early or late than to be around midnight. 37.15. f Show that p(s, t) == p(t - 1 , s - 1 ) == p(cs, ct). Check this by (37.54) and also by the fact that the transformations (37.1 8) and (37.19) preserve the properties of Brownian motion. 37.16. Show by means of the transformation (37.19) that for positive a and b the probability is 1 that the process is within the boundary - at < W, < bt for all sufficiently large t. Show that a/( a + b) is the probability that it last touches above rather than below. 37. 17. The martingale calculation used for (37.45) also works for slanting boundaries. For positive a, b, r, let T be the smallest t such that either W, == - a + rt or W, b + rt, and let p(a, b, r) be the probability that the exit is through the upper barrier-that W,. == b + rT. (a) For the martingale Y,, , in the proof of Lemma 3, show that E[ Y8. ,] == 1. Operating formally at first, conclude that SECTION
=-
(37 .55 ) Take fJ == 2r and note that 8 W,. - !8 2T is then 2 rb if the exit is above (probability p(a, b, r)) and - 2 ra if the exit is below (probability 1 p(a, b, r)). Deduce
p( a , b, r )
==
1 - e 2 ra
• e 2 rb - e - 2 ra
(b) Show that p(a, b, r) --+ aj( a + b) as r --+ 0, in agreement with (37.45). (c) It remains to justify (37.55) for fJ == 2 r. From E[ Y8. , ) == 1 deduce (37 .56) for nonrandom a. By the arguments in the proofs of Lemmas 3 and 4, show that (37.56) holds for simple stopping times a, for bounded ones, for a == T A n , for a == T . SECI'ION 38. SEPARABD..JTY * Introduction
As observed a number of times above, the finite-dimensional distributions do not suffi ce to determine the character of the sample paths of a process. To obtain paths with natural regularity properties, the Poisson and Brownian * This section may be omitted.
552
STOCHASTIC PROCESSES
motion processes were constructed by ad hoc methods. It is always possible to ensure that the paths have a certain very general regularity property called separability, and from this property will follow in appropriate cir cumstances various other desirable regularity properties. Section 4 dealt with "denumerable" probabilities; questions about path functions involve all the time points and hence concern " nondenumerable" probabilities. For a mathematically simple illustration of the fact that path properties are not entirely determined by the finite-dimensional distri butions, consider a probability space (0, �' P) on which is defined a positive random variable V with continuous distribution: P [V = x] = 0 for each x. For t > 0, put X( t, w ) = 0 for all w , and put Example 38. 1.
(38 .1 )
Y( t, w )
=
{�
if V( w ) = t, if V( w ) ¢ t.
Since V has continuous distribution, P[ X1 = Y,] = 1 for each t, and so [ X,: t > 0] and [ Y,: t > 0] are stochastic processes with identical finite-dimen sional distributions; for each t 1 , . . . , t k , the distribution p. 1 1 common to ( X,1 , , X,lc ) and ( Y,1 , , Y,lc) concentrates all its mass at the origin of R k . But what about the sample paths? Of course, X( · , w ) is identically 0, but Y( · , w ) has a discontinuity-it is 1 at t = V( w ) and 0 elsewhere. It is because the position of this discontinuity has a continuous distribution that • the two processes have the same finite-dimensional distributions. • • •
• • •
• • •
'1c
Definitions
The idea of separability is to make a countable set of time points serve to determine the properties of the process. In all that follows, the time set T will for definiteness be taken as [0, oo) . Most of the results hold with an arbitrary subset of the line in the role of T. As in Section 36, let R r be the set of all real functions over T = [0, oo ) . Let D be a countable, dense subset of T. A function x-an element of R r -is separable D, or separable with respect to D, if for each t in T there exists a sequence t 1 , t 2 , of points such that • • •
(38 .2) (Because of the middle condition here, it was redundant to require D dense at the outset.) For t in D, (38.2) imposes no condition on x, since t n may be taken as t. An x separable with respect to D is determined by its values at
38. SEPARABILITY 553 the points of D. Note, however, that separability requires that (38.2) hold for every t-an uncountable set of conditions. It is not hard to show that the set of functions separable with respect to D lies outside 9/ r. SECTION
If x is everywhere continuous or right-continuous, then it is separable with respect to every countable, dense D. Suppose that x( t) is 0 for t ¢ v and 1 for t = v, where v > 0. Then x is not separable with respect to D unless v lies in D. The paths Y( · , w) in • Example 38.1 are of this form. Example 38. 2
The condition for separability can be stated another way: x is separable D if and only if for every t and every open interval I containing t, x( t) lies in the closure of [x(s): s e I n D]. Suppose that x is separable D and that I is an open interval in T. If £ > 0, then x(t0) + £ > sup1 e 1x(t) = u for some t0 in I. By separability lx(s0) - x(t0) 1 < £ for some s0 in I n D. But then x(s0) + 2£ > u , so that
sup x ( t ) = sup x( t ) .
(38 .3)
tEl
t E I () D
Similarly, inf X ( t) =
(38 .4)
tEl
inf
t E I() D
X
(t)
and ( 38 .5 )
sup
t 0 < t < t0 + B
l x( t ) - x( t0 ) I =
sup
t 0 < t < t0 + 8 tED
l x ( t ) - x( t0) 1 -
A stochastic process [ X,: t > 0] on (0, � ' P ) is separable D if D is a countable, dense subset of T = [0, oo) and there is an jli:set N such that P( N ) = 0 and such that the sample path X( · , w) is separable with respect to D for w outside N. Finally, the process is separable if it is separable with respect to some D; this D is sometimes called a separant. In these definitions it is assumed for the moment that X( t, w) is a finite real number for each t and w. If the sample path X( · , w ) is continuous for each w , then the process is separable with respect to each countable, dense D. This • covers Brownian motion as constructed in the preceding section. Example 38.3.
Suppose that [ W, : t > 0] has the finite-dimensional dis tributions of Brownian motion, but do not assume as in the preceding Example 38.4.
554
STOCHASTIC PROCESSES
section that the paths are necessarily continuous. Assume, however, that [W,: t > 0] is separable with respect to D. Fix t0 and B. Choose sets Dm = { t m1 , , t mm } of D-points such that t0 < t m1 < · · · < t mm < t 0 + 8 and Dm j D n (t 0 , t 0 + 8). By (37.11) and Markov' s inequality, • • •
[
P 'l!:a.! I W( t m ;) - W( t0 ) l Letting
m
--+ oo
(3 8 .6 )
gives p
]
>a <
sup
t0 S t < t0 + 8 tED
I W,
2P [ I W( t mm ) - W( to) l
- W,0 I
>a S
68 2 a4
> a] <
68 2 a4
·
•
For sample points outside the N in the definition of separability, the supremum in (38.6) is because of (38.5) unaltered if the restriction t e D is dropped. Since P( N ) = 0,
P
[
sup
t0 < t < t0 + 8
]
I W, - W,0 1 > a S
68 2 a4
•
Define Bn by (37.8) but with r ranging over all the reals (not just over the dyadic rationals) in [k2 - n , (k + 1)2 - n ]. Then P( Bn ) < 6n 5j2 n follows just as (37.13) does. But for w outside B = lim supn Bn , W( · , w ) is continuous. Since P( B ) = 0, W( · , w ) is continuous for w outside an �set of probabil ity 0. If (D, $, P) is complete, then the set of w for which W( · , w ) is continuous is an �set of probability 1. Thus paths are continuous with probability 1 for any separable process having the finite-dimensional distri butions of Brownian motion-provided that the underlying space is com • plete, which can of course always be arranged. As it will be shown below that there exists a separable process with any consistently prescribed set of finite-dimensional distributions, Example 38.4 provides another approach to the construction of continuous Brownian motion. The value of the method lies in its generality. It must not, however, be imagined that separability automatically ensures smooth sample paths: Suppose that the random variables X,, t > 0, are inde pendent, each having the standard normal distribution. Let D be any countable set dense in T = [0, oo ). Suppose that I and J are open interval s with rational endpoints. Since the random variables X, with t E D n I are independent, and since the value common to the P[ X, e J ] is positive, the Example 38.5.
38. SEPARABILITY 555 second Borel-Cantelli lemma implies that with probability 1, X, e J for some t in D n I. Since there are only countably many pairs I and J with rational endpoints, there is an jli:set N such that P( N ) 0 and such that for w outside N the set [ X(t, w): t e D n I ] is everywhere dense on the line for every open interval I in T. This implies that [ X, : t > 0] is separable with respect to D. But also of course it implies that the paths are highly irregular. This irregularity is not a shortcoming of the concept of separability-it is a necessary consequence of the properties of the finite-dimensional distribu • tions specified in this example. SECTION
=
Exllltlple
The process [Y,: t > 0] in Example 38.1 is not separable: The path Y( · w) is not separable D unless D contains the point V( w ). The set of w for which Y( · w) is separable D is thus contained in [ w: V( w) e D], a set of probability 0 since D is countable and V has a • continuous distribution. 38.6. ,
,
Existence Theorems
It will be proved in stages that for every consistent system of finite-dimen sional distributions there exists a separable process having those distribu tions. Define x to be separable D at the point t if there exist points t n in D such that t n --+ t and x( t n ) --+ x(t). Note that this is no restriction on x if t lies in D, and note that separability is the same thing as separability at every t.
Let [ X,: t > 0] be a stochastic process on ( 0, �' P). There exists a countable, dense set D in [0, oo ), and there exists for each t an jli:set N( t), such that P( N(t)) 0 and such that for w outside N( t) the path function X( · , w) is separable D at t. PROOF. Fix open intervals I and J and consider the probability Lemma 1.
=
(
p ( U ) = P n [ X. e J ) sE U
)
for countable subsets U of I n T. As U increases, the intersection here decreases and so does p(U). Choose Un so that p(Un ) --+ inf up(U). If U( I, J ) U nUn , then U( I, J ) is a countable subset of I n T making p(U) minimal : =
(38 .7)
p
(
n
s E U( / , J)
) (
[ x. e J ] � p n l x. e J ] sE U
)
556
STOCHASTIC PROCESSES
for every countable subset U of I n T. If t e I n T, then
(
(38 .8)
P [ x, e J ) n
n
s E U( l, J)
)
[ Xs 4t J ) = o ,
because otherwise (38.7) would fail for U = U( I, J) u { t } . Let D = UU( I, J), where the union extends over all open intervals I and J with rational endpoints. Then D is a countable, dense subset of T. For each t let
(38 .9)
{
N ( t ) = U [ x, e J ) n
n
s e U( I, J)
}
[ Xs 4t J ) .
where the union extends over all open intervals J that have rational endpoints and over all open intervals I that have rational endpoints and contain t. Then N( t) is by (38.8) an �set such that P( N( t )) = 0 . Fix t and w � N( t ). The problem is to show that X( · , w) is separable with respect to D at t. Given n, choose open intervals I and J that have rational endpoints and lengths less than n - 1 and satisfy t e I and X( t, w) e J. Since w lies outside (38.9), there must be an sn in U( I, J) such that X(sn , w) E J. But then sn E D, lsn - tl < n - 1 , and I X(sn , w) - X( t, w ) l < n - 1 • Thus sn --+ t and X(sn , w) --+ X( t , w) for a sequence s1 , s 2 , • • • in D. • For any countable D, the set of w for which X( · , w) is separable with respect to D at t is
(38 . 10)
00
n
U
n = l ls - tl < n - 1 sED
[ w : I X( t, w ) - X( s, w ) I
< n - 1] •
This set lies in !F for each t, and the point of the lemma is that it is possible to choose D in such a way that each of these sets has probability 1.
Let [ X,: t > 0] be a stochastic process on (0, $, P). Suppose that for all t and w Lemma 2.
(38 .11)
a < X( t, w ) < b.
Then there exists on ( 0, $, P ) a process [ X/: t > 0] having these three properties: (i) P[ X! = X,] = 1 for each t. (ii) For some countable, dense subset D of [0 , oo ), X'( · , w) is separable D for every w in D.
SECTION
38.
SEPARABILITY
557
(iii) For all t and w, a < X'( t, w ) < b .
(38 .12)
Choose a countable, dense set D and jli:sets N( t ) of probability 0 as in Lemma 1. If t e D or if w � N( t), define X'( t, w) = X( t, w). If t � D , fix some sequence { s�l ) } in D for which lim n s!'> = t and define X'(t, w) = lim supn X(s!l ) , w) for w e N( t). To sum up, PROOF.
if t e D or w � N( t ) if t tt D and "' e. N( t ) .
X( t, w ) (3 8 .1 3) X'( t "' ) = lim sup x( s!'l, "' ) n •
Since N( t ) e �, x: is measurable !F for each t. Since P( N(t)) = 0, P[ X, = x;] = 1 for each t. Fix t and w. If t e D, then certainly X'( · , w) is separable D at t, and so assume t � D . If w � N( t), then by the construction of N( t), X( · , w) is separable with respect to D at t, so that there exist points s n in D such that sn --+ t and X(s n , w) --+ X(t, w ). But X(sn, w) = X'(s n , w) because sn e D, and X( t, w) = X'( t, w) because w � N( t). Hence X'(s n , w) --+ X'( t, w), and so X'( · , w ) is separable with respect to D at t. Finally, suppose that t � D and w e N( t ). Then X'( t, w) = lim k X(s! :>, w) for some sequence { n k } of integers. As k --+ oo ' s ' w) = �(sn< lk) ' w) --+ X'( t, w ), so that again X'( · , w) is separable with respect to D at t. Clearly, (38.11) implies (38.12). • One must allow for the possibility of equality in (38.12). Suppose that V( w) > 0 for all w and that V has a continuous distribution. Define Example 38. 7.
f( t ) =
{0
e _ ,, ,
�. f t ¢ o ' tf t = 0,
and put X( t, w ) = f( t - V( w)). If [ X;: t > 0] is any separable process with the same finite-dimensional distributions as [ X,: t > 0], then X'( · , w) must with probability 1 assume the value 1 somewhere. In this case (38.11) holds • for a < 0 and b = 1, and equality in (38.12) cannot be avoided. If ( 3 8 .14 )
sup I X( t, w ) I < I,
w
oo ,
558
STOCHASTIC PROCESSES
then (38.11) holds for some a and b. To treat the case in which (38.14) fails, it is necessary to allow for the possibility of infinite values. If x( t) is oo or - oo . - oo , replace the third condition in ( 38.2) by x(t n ) --+ oo or x( t n ) This extends the definition of separability to functions x that may assume infinite values and to processes [ X,: t > 0] for which X( t, w) = ± oo is a possibility. --+
If [ X,: t > 0] is a finite-va/uedprocess on (O, �, P), there exists on the same space a separable process [ x; : t > 0] such that P[ x: = X,] = 1 for each t.
Theorem 38. 1.
It is for convenience assumed here that X( t, w) is finite for all t and w, although this is not really necessary. But in some cases infinite values for certain X'( t, w ) cannot be avoided-see Example 38.8. If (38.14) holds, the result is an immediate consequence of Lemma 2. The definition of separability allows an exceptional set N of probability 0; in the construction of Lemma 2 this set is actually empty, but it is clear from the definition this could be arranged anyway. The case in which (38.14) may fail could be treated by tracing through the preceding proofs, making slight changes to allow for infinite values. A simple argument makes this unnecessary. Let g be a continuous, strictly increasing mapping of R 1 onto (0, 1). Let Y( t, w) = g( X( t, w )) . Lemma 2 applies to [ Y,: t � 0]; there exists a separable process [ Y,': t > 0] such that P[ Y,' = Y, ] = 1. Since 0 < Y(t, w) < 1, Lemma 2 ensures 0 < Y'( t, w) s 1. Define PROOF.
- oo
X '( t , w )
=
g - 1 ( Y' ( t , w ) ) + 00
if Y ' ( t, w) = 0, if O < Y' ( t, w ) < 1 , if y I ( t' (A) ) = 1 .
Then [ X,': t > 0 ] satisfies the requirements. Note that P[ X; = ± oo] = 0 • for each t. Suppose that V( w) > 0 for all w and V has a continu ous distribution. Define EX111ple 11 38.8.
if t ¢ 0, if t = 0, and put X( t, w) = h(t - V( w)). This is analogous to Example 38.7. If [ X/ :
SECTION
38.
559
SEPARABILITY
t > 0] is separable and has the finite-dimensional distributions of [ X,: t > 0], then X'( · , w ) must with probability 1 assume the value oo for • some t. Combining Theorem 38.1 with Kolmogorov' s existence theorem shows that for any consistent system of finite-dimensional distributions p. 11 1k there • • •
exists a separable process with the p. 11 1k as finite-dimensional distributions. As shown in Example 38.4, this leads to another construction of Brownian motion with continuous paths. • • •
Consequences of Separability
The next theorem implies in effect that, if the finite-dimensional distribu tions of a process are such that it "should" have continuous paths, then it will in fact have continuous paths if it is separable. Example 38.4 illustrates this. The same thing holds for properties other than continuity. Let Rr be the set of functions on T = [0, oo) with values that are ordinary reals or else oo or oo . Thus Rr is an enlargement of the R r of Section 36, an enlargement necessary because separability forces infinite values. Define the function Z, on R r by Z,(x) = Z( t, x) = x( t). This is just an extension of the coordinate function ( 3 6 . 8) . Let gJ T be the a-field in R r generated by the Z,, t > 0. Suppose that A is a subset of R r, not necessarily in gJ T_ For D c T = [0, oo ) , let A D consist of those elements x of R r that agree on D with some element y of A : -
( 38 .1 5 )
AD =
u n [x e R T : x ( t ) = y( t )] .
yEA tED
Of course, A c A D . Let SD denote the set of x in Rr that are separable with respect to D. In the following theorem, [ X, : t > 0) and [ X;: t > 0] are processes on spaces (D, !F, P) and (D', !F', P') which may be distinct; the path func tions are X( · , w ) and X'( · , w'). It
Theorem 38.2. Suppose of A that for each countable, dense subset D of T = [0, oo ), the set (38.1 5 ) satisfies
(38 .16) If [ X, : t > 0] and [ x; : t > 0] have the same finite-dimensional distributions, if [w: X( · , w ) e A] lies in � and has P-measure 1, and if [ X;: t > 0] is separable, then [w': X'( · , w ') e A ] contains an �'-set of P'-measure 1.
560
STOCHASTIC PROCESSES
If ( 0', !F ', P') is complete, then of course [w': X'( · , w') E A ] is itself an � '-set of P '-measure 1 . PROOF. Suppose that [ X;: t > 0] is separable with respect to D. The difference [w': X'( · , w ') E AD] - [w': X'( · , w') E A ] is by (38.16) a subset of [w': X'( · , w') E R r - SD], which is contained in an � '-set N ' of P'--measure 0. Since the two processes have the same finite-dimensional distributions and hence induce the same distribution on ( R r, !1/ r ), and since A D lies in gJ r, P '[w': X'( · , w') E AD] = P[w: X( · , w ) E AD] > P[w: X( · , w ) E A ] = 1 . Thus the subset [w': X'( · , w') E AD] - N ' of [w': X'( · , w') E A] lies in �' and has P'-measure 1. • Consider the set C of finite-valued, continuous functions on T. If x E SD and y E C, and if x and y agree on a dense D, then x and y agree everywhere: x = y. Therefore, CD n SD c C. Further, Example 38. 9.
CD = n u n [ x E R T: l x ( s )l < 00 , l x ( t )l < 00 , l x ( s ) - x ( t )l < (, I
8 s
£
]
,
where £ and 6 range over the positive rationals, t ranges over D, and the inner intersection extends over the s in D satisfying I s - t l < 6. Hence CD E gJ r_ Thus C satisfies the condition (38.16). Theorem 38.2 now implies that if a process has continuous paths with probability 1, then any separable process having the same finite-dimensional distributions has continuous paths outside a set of probability 0. In particu lar, a Brownian motion with continuous paths was constructed in the preceding section, and so any separable process with the finite-dimensional distributions of Brownian motion has continuous paths outside a set of probability 0. The argument in Example 38.4 now becomes supererogatory . •
Example 38. 10.
There is a somewhat similar argument for the step functions of the Poisson process. Let z + be the set of nonnegative integers, let E consist of the nondecreasing functions x in R r such that x(t) E z + for all t and such that for every n E z + there exists a nonempty interval I such that x( t ) = n for t E /. Then
ED == n [x: x ( t ) E z + ] n tED
00
n n u n
n - O I t E D rt l
n
s, tED, s < t
[x: x ( s ) < x ( t )]
[x: x( t ) = n] ,
where I ranges over the open intervals with rational endpoints. Thus ED E gJ r_ Clearly, ED n SD c E, and so Theorem 38.2 applies.
SECTION
38.
561
SEPARABILITY
In Section 23 was constructed a Poisson process with paths in E, and therefore any separable process with the same finite-dimensional distribu • tions will have paths in E except for a set of probability 0. For E as in Example 38.10, let E0 consist of the elements of E that are right-continuous; a fu�ction in E need not lie in E0, although at each t it must be continuous from one side or the other. The Poisson process as defined in Section 23 by N, = max[n: Sn < t] (see (23.5)) has paths in E0• But if N, ' = max[n: Sn < t], then [N,' : t > 0] is separable and has the same finite-dimensional distributions, but its paths are not in E0• Thus E0 does not satisfy the hypotheses of Theorem 38.2. Separability does not help distinguish between continuity from the right and continuity from the left. • Example 38. 11.
The class of sets A satisfying (38.16) is closed under the formation of countable unions and intersections but is not closed under complementation. Define X, and Y, as in Example 38.1 and let C be the set of continuous paths. Then [ Y,: t > 0] and [ X, : t > 0] have the same finite-dimensional distributions and the latter is separable; Y( · , w) is in • R_T - C for each w, and X( · , w) is in R_T - C for no w. Example 38. 12 ·
As a final example, consider the set J of functions with discontinuities of at most the first kind: x is in J if it is finite-valued, if x( t + ) = lim s .L t x(s) exists (finite) for t > 0 and x( t - ) = lims t ,s(s) exists (finite) for t > 0, and if x( t) lies between x( t + ) and x( t ..- ) for t > 0. Continuous and right-continuous functions are special cases. Let V denote the general system Emmple 38.13.
V: k ; r1 , . . . , rk ; s 1 , . . . , sk ; a1 , . . . , ak ,
{ 38.17)
where k is an integer, where the r; , s; , and a; are rational, and where
Define
J ( D , V, £ )
=
k
n ( x : a; < x ( t ) < a; + £ , t e ( r; , s; )
i- 1
k
n
D)
() n [ X : min { a; - 1 ' a; } < X ( t ) < max { a; - ' a; } + ( ' t E ( s i-2
1
I
Let �m . k , B be the class of systems (38.17) that have a fixed value for k
1
, r; ) () D ] . and
satisfy
562
r;
- s; _ 1
{ 38.18)
STOCHASTIC PROCESSES
<
B,
i
==
2, . . . , k and sk > 00
m.
00
It will be shown that
Jo = n n u n m-1 f k-1
8
u
VE 4',.. , k , l
J ( D , V, £ ) ,
where £ and B range over the positive rationals. From this it will follow that J0 e gpr_ It will also be shown that J0 n S0 c J, so that J satisfies the hypothesis of Theorem 38.2 Suppose that y e J. For fixed £ , let H be the set of nonnegative h for which there exist finitely many points I; such that 0 = t0 < t1 < · · · < t, = h and l y( t) - y( t') I < £ for t and t' in the same interval ( 1; _ 1 , I; ) · If hn e H and hn f h, then from the existence of y(h - ) follows h e H. Hence H is closed. If h e H, from the existence of y(h + ) it follows that H contains points to the right of h. Therefore, H = [0, oo). From this it follows that the right side of (38.18) contains J0 . Suppose that x is a member of the right side of (38.18). It is not hard to deduce that for each t the limits ( 38.19)
lim x ( s ) ,
s � t, s E D
lim x ( s )
s t t,sE D
exist and that x ( t ) lies between them if t e D. For t e D take y( t) = x( t), and for t � D take y( t) to be the first limit in (38.19). Then y e J and hence x e J0 . This argument also shows that J0 n S0 c J. Separability in Product
Space
Kolmogorov's existence theorem applies in ( Rr, 9/T): Construct on ( Rr, !JIT) a probability_ measure P0 with the specified finite-dimensional distributions. The1 map p : RT --+ RT defined by p ( x ) = x is measurable !JITj91T; take P = P0 p - . The process [ Z, : t � 0] on ( Rr, !Jtr, P) then has the required finite-dimensional distribu tions. Apply Theorem 38.1 to [ X, : t > 0] = ( Z,: t > 0] and ( 0 , S', P) = ( Rr, !Jtr, P ). Specifically, apply the construction in Lemma 2, but without the restriction (38.11) , so that infinite values are allowed. Since the set (38.10) lies in S' = !Jtr, the sets N( t ) of Lemma 1 can be taken to be J[sets. Thus each N(t) can be taken to be exactly the w-set where X( · , w) is not separable with respect to D at t. Let 0' be the set where the new path X' ( · , w ) is the same as the old path X( · , w): ( 38 .20)
0' =
n ( w : X' ( t, w) = X( t, w) ) .
t�O
The proof of Lemma 2 shows that X' ( · , w) is separable with respect to D for every w. Therefore, X( · , w) is separable with respect to D if w e 0' . It will be shown that 0' has outer measure 1 : 0' c A and A e !F= !JIT together imply that P ( A ) = 1. By Theorem 36.3(ii), A e a [ X, : t e S] for some countable set S in T. If { 38 .21 )
n ( w : X' ( t, w) = X( t, w ) ) c A ,
SECTION
38.
SEPARABILITY
563
then P( A ) = 1 will follow because P[ X; = X,] = 1 for each t. Now X'( · , w) is an element of Q = RT; denote it l/;( w ), so that X'( t, w ) = X( t, l/;( w )). Since X'( · , w) is by construction separable D for each w, so is X( · , 1/1( w )) = 1/1( w ). Thus 1/1( w) E N( t) for each t. Then by the construction (38.13), X( t, l/;(w)) = X'( t, l/;(w)) for all t, so that l/; ( w ) 0' and hence l/;(w) e A . But X( t, w) = X'(t, w) for all t S implies that X( t, w) = X(t, l/;( w )) for all t E S; hence (Theorem 36.3(i)) A contains w as well as 1/1 ( w ) . This proves (38.21 ). Let !F' consist of the sets of the form 0' n A for A !F. Then !F' is a a-field in 0', and because 0' has outer measure 1, setting P'(O' n A ) = P(A ) consistently defines a probability measure on !F ' . The coordinate functions X, = Z, restricted to 0 ' give a stochastic process on ( 0', !F', P') for which all paths are separable and the finite-dimensional distributions are the same as before.
E
E
E
Appendix
Gathered here for easy reference are certain definitions and results from set theory and real analysis required in the text. Although there are many newer books, HAUSDORFF (the early sections) on set theory and HARDY on analysis are still excellent for the general background assumed here. Set 1beory A 1. In this book, the empty set is denoted by 0. Sets are variable subsets of some space that is fixed in any one definition, argument, or discussion; this space is k denoted either generically by Q or by some special symbol (such as R for
Euclidean k-space). A singleton is a set consisting of just one point or element. That A is a subset of B is expressed by A c B. In accordance with standard usage, A c B does not preclude A == B; A is a proper subset of B if A c B and A =I= B. The complement of A is always relative to the overall space 0; it consists of the points of Q not contained in A and is denoted by A c. The difference between A and B, by denoted A - B, is A n Be; here B need not be contained in A , and if it is, then A - B is a proper difference. The symmetric difference A � B = ( A n B'") U ( A ( n B ) consists of the points that lie in one of the sets A and B but not both. A2. The set of w that lie in A and satisfy a given property p ( w) is denoted
(w
e
A:
p( w )) ; if A
== 0,
this is usually shortened to ( w: p( w )) .
A3. In this book, to say that a collection ( A 8 : 8 e 9] is disjoint always means that
it is pairwise disjoint: A8 n A8, = 0 if 8 and 8' are distinct elements of the index set 9. To say that A meets B, or that B meets A , is to say that they are not disjoint: A n B =I= 0. The collection [ A 8 : 8 e 9] covers B if B c U8 A8 • The collection is a decomposition or partition of B if it is disjoint and B = U8 A8 • A4. By A , f A is meant A 1 c A 2 c · · · and A == U, A , ; by A , � A is meant ••
AI
::>
A2
::>
•
and A
== nn A , .
A5. The indicator, or indicator function , of a set A is the function on Q that
assumes the value 1 on A and 0 on Ac; it is denoted lA . The alternative term "characteristic function" is reserved for the Fourier transform (see Section 26).
564
565 A6. DeMorgan ' s laws are (U8 A8r = n8A8 and (n8A8 )' = U8A8. These and the other facts of basic set theory are assumed known: a countable union of countable sets is countable, and so on. APPENDIX
A7. If T: n --+ 0' is a mapping of Q into 0' and A' is a set in 0', the inverse
checked that each of these image of A' is r- 1A' = [ w E n : Tw E A'). It is easily 1 statements is equivalent to the next: w e 0 - 1r- A', w 1E r- 1A', Tw E A', Tw e 0' - A', w e r- 1 ( 0' - A'). Therefore Simple consid 0 - r1- A' = r- (0' - A'). 1 1 erations of this kind show that U81 A8 = 1 (U8 A8) and n , T- A8 = r- 1 (n8 A8), and that A' n B' = 0 implies r- 1A' n r- 1 B' = 0 (the reverse implication is false unless TO = 0'). If f maps 0 into another space, /( w ) is the value of the function f at an unspecified value of the argument w. The function f itself (the rule defining the mapping) is sometimes denoted /( · ). This is especially convenient for a function /( t) of two arguments: For each fixed t, /( · , t) denotes the function on Q with value /( w , t) at w. w,
AS. The axiom of choice. Suppose that [A 8: 8 e 8] is a decomposition of 0 into nonempty sets. The axiom of choice says that there exists a set (at least one set) C that contains exactly one point from each A8: C n A8 is a singleton for each 8 in 8. The existence of such sets C is assumed in "everyday" mathematics, and the
axiom of choice may even seem to be simply true. A careful treatment of set theory, however, is based on an explicit list of such axioms and a study of the relationships between them; see HALMOS' or JECH. A few of the problems require Zorn 's lemma, which is equivalent to the axiom of choice; see HALMOS'. The Real
Line
A9. The real line is denoted by R1 • It is convenient to be explicit about open,
closed, and half-open intervals:
( a , b) = [ x : [ a , b] = [ x: ( a , b] = [ x: ( a , b) = [ x :
a < x < b] , a < x < b] , a < x < b] , a < x < b] .
AIO. Of course xn --+ x means limnxn = x; xn t x means x1 < x 2 < · · · and Xn , --+ x; Xn � X means X1 � X 2 � • • • and Xn --+ X . A sequence { xn } is bounded if and only if every subsequence { xn " } contains a further subsequence { xnk (j) } that converges to some x: lim;xn k (j) = x. If { xn } is not bounded, then for each k there is an n k for which 1 xn 1 > k ; no subsequence of J
{ x } can converge. The implication in the other directi�n is a simple consequence of lite fact that every bounded sequence contains a convergent subsequence. If { xn } is bounded, and if each subsequence that converges at all converges to x, then limn xn = x. If xn does not converge to x, then l xn - x l > £ for some positive £ and some increasing sequence { n k } of integers; �me subsequence of { x n " } converges, but the limit cannot be x. n
566
APPENDIX
All. A set G is defined as open if for each x in G there is an open interval I such
that x e I c G. A set F is defined as closed if F( is open. The interior of A , denoted A 0 , consists of the x in A for which there exists an open interval I such that x e I c A . The closure of A , denoted A - , consists of the x for which there exists a sequence { Xn } in A with Xn --+ x. The boundary of A is aA = A - - A 0• The basic facts of real analysis are assumed known: A is open if and only if A = A 0 ; A is closed if and only if A = A - ; A is closed if and only if it contains all limits of sequences in it; x lies in a A if and only if there is a sequence { xn } in A and a sequence { Yn } in A'' such that xn --+ x and Yn --+ x; and so on. A 12. Every open set G on the line is a countable, disjoint union of open intervals.
To see this, define points x and y of G to be equivalent if x < y and [ x, y] c G or y < x and [ y, x ] c G. This is an equivalence relation. Each equivalence class is an interval, and since G is open, each is in fact an open interval. Thus G is a disjoint union of open (nonempty) intervals, and there can be only countably many of them, since each contains a rational. Al3. The simplest form of the Heine-Bore/ theorem says that if [a, b)
c
Uf.... 1 (a k , bk ), then [ a , b) c Uk _ 1 (a k , bk ) for some n. A set A is defined to be compact if each cover of it by open sets has a finite subcover-that is, if [ G8 : 8 e 9] covers A and each G8 is open, then some finite subcollection { G8 1 , , G8 } • • •
covers A . Equivalent to the Heine-Borel theorem is the assertion that a bounded , closed set is compact. Also equivalent is the assertion that every bounded sequence of real numbers has a convergent subsequence.
Al4. The diagonal method. From this last fact follows one of the basic principles of
analysis.
Theorem.
( 1)
Suppose that each row of the array xl. I ' x1 . 2 ' xl .J ' . . . X 2. 1 ' X 2.2 ' x 2.J ' ·
·
·
is a bounded sequence of real numbers. Then there exists an increasing sequence n1 , n 2 , of integers such that the limit lim k xr. n �c exists for r == 1, 2, . . . . •
PROOF.
•
•
From the first row, select a convergent subsequence
( 2) here { n 1 • k } is an increasing sequence of integers and lim k x1 • n • . �c exists. Look next at the second row of (1) along the sequence n 1 , 1 , n 1 • 2 , : •
•
•
(3) As a subsequence of the second row of (1 ), (3) is bounded. Select from it convergent subsequence
a
here { n 2• k } is an increasing sequence of integers, a subsequence of { n 1 • k } , and lim k x 2• n exists. 2. A:
567
APPENDIX
Continue inductively in the same way. This gives an array
n l . I ' n l .2 ' n l . J ' . . . n 2. 1 ' n 2.2 ' n 2. 3 '
( 4)
·
·
·
..............
with three properties. (i) Each row of (4) is an increasing sequence of integers. (ii) The rth row is a subsequence of the ( r - 1)st. (iii) For each r, lim k x r exists. Thus . ,
( 5)
Xr
•
n,
.1
' Xr . n , ' X r . n , ' .2
.3
•
'·�<
•
•
is a convergent subsequence of the rth row of (1). Put n k == n k . k . Since each row of (4) is increasing and is contained in the is an increasing sequence of integers. Furthermore, preceding row, n 1 , n 2 , n 3 , n , , n ,+ 1 , n ,+ 2 , is a subsequence of the r th row of (4). Thus X,, n , ' X,. n , + X,. n , + l ' . . . iS a SUbSequenCe Of (5) and iS therefore COnVergent. ThUS lim k x,. n k does exist. • •
I
,
•
•
•
•
•
Since { n k } is the diagonal of the array (4), application of this theorem is called
the diagonal method.
AlS. The set A is by definition dense in the set B if for each x in B and each open interval J containing x, J meets A . This is the same thing as requiring
B c A . The set E is by definition nowhere dense if each open interval I contains some open interval J that does not meet E. This makes sense: If I contains an open interval J that does not meet E, then E is not dense in I; the definition requires that E be dense in no interval I. A set A is defined to be perfect if it is closed and for each x in A and positive £ , there is a y in A such that 0 < 1 x - y 1 < £ . An equivalent requirement is that A is closed and for each x in A there exists a sequence { xn } in A such that x, ::1= x and xn x. The Cantor set is uncountable, nowhere dense, and perfect. A set that is nowhere dense is in a sense small. If A is a countable union of sets each of which is nowhere dense, then A is said to be of the first category. This is a -
•
�
weaker notion of smallness. A set that is not of the first category is said to be of the
second category.
Euclidean k-S:Pace Al6. Euclidean space of dimension k is denoted R k . Points (a 1 ,
( b1 ,
•
•
•
, bk ) determine open, closed, and half-open rectangles in R k :
[x: a;
< X; <
[x: a;
.
•
, a k ) and
b; , i == 1, . . . , k ],
<
b; , i == 1, . . . , k ],
< X; �
b; , i == 1, . . , k ].
[ x: a; s X;
•
.
A rectangle (without a qualifying adjective)1 is in this book a set of this last form. The Euclidean distance (L�_ 1 (x; - Y; ) 2 ) 12 between x (x 1 , . . . , xk ) and y == ( y1 . . . , Yk ) is denoted by l x - Yl · ,
=
568
APPENDIX
over to R k : simply take the I there to be an open rectangle in R . The definitionk of compact set also carries over word for word, and the Heine-Borel theorem in R says that a closed, bounded set is compact.
Al7. All the concer ts in All
carry
Analysis All. The standard Landau notation is used. Suppose that { xn } and { Yn } are real
sequences and Yn > 0. Then xn == O(yn ) means xn !Yn is bounded; xn == o(yn ) means xn !Yn --+ 0; Xn - Yn means xn!Yn --+ 1 ; Xn X Yn means Xn/Yn and Ynlxn are both bounded. To write xn == zn + O(yn ), for example, means that xn == zn + un for some { � n } satisfying un == O(yn )-that is, that Xn - Zn == O(yn ).
Al9. A difference equation. Suppose that a and b are integers and a < b. Suppose
that xn
is
defined for a
s
n s b and satisfies
Xn == PXn 1 + qxn - 1 for a
( 6)
+
<
n < b'
where p and q are positive and p + q == 1. The general solution of this difference equation has the form ( 7)
Xn ==
for a
<
n
<
b if p :F q,
for a
<
n
<
b if p == q.
That (7) always solves (6) is easily checked. Suppose the values xn 1 and xn 2 are given, where a < n 1 < n 2 s b. If p ::1= q, the system A +
s( )
n q . == Xn l ' p
can always be solved for A and B. If p == q, the system
can always be solved. Take n 1 == a and n 2 == a + 1 ; the corresponding A and B satisfy (1) for n == a and for n == a + 1, and it follows by induction that (7) holds for all n. Thus any solution of (6) can indeed be put in the form (1). Furthermore, the equation (6) and any pair of values xn 1 and xn 2 ( n 1 =F n 2 ) suffice to determine all the Xn. If xn is defined for a s n < oo and satisfies (6) for a < n < oo , then there are constants A and B such that (7) holds for a < n < oo. A20. Cauchy 's equation.
Let f be a real function on (0, oo) and suppose that f satisfies Cauchy ' s equation : f( x + y ) == /(x) + /(y) for x, y > 0. If there is some interval on which f is bounded above, then f( x) == xf(l) for x > 0.
Theorem.
569 PROOF. The problem is to prove that g( x) = /( x ) - x/(1) vanishes identically. Clearly, g(1) 0, and g satisfies Cauchy's equation and on some interval is bounded above. By induction, g( nx) ng( x); hence ng( m/n ) g( m) mg(1) 0, so that g( r) 0 for positive rational r. Suppose that g(x0) ::1= 0 for some x0• If g( x0 ) < 0, then g( r0 - x0) - g(x0) > 0 for rational r0 > x0• It is thus no restriction to assume that g( x0) > 0. Let I be an open interval in which g is bounded above. Given a number M, choose n so that ng(x0 ) > M and then choose a rational r so that nx0 + r lies in I. If r > 0, then g( r + nx0 ) g( r) + g( nx0) g( nx0 ) ng( x0 ). If r < 0, then ng(x0) g( nx0) g(( - r) + ( nx0 + r)) g( - r) + g( nx0 + r) g( nx0 + r ). In either case, g( nx0 + r) ng( x0 ); of course this is trivial if r 0. Since g( nx0 + r) ng(x0 ) > M and M was arbitrary, g is not bounded above in I, a contradiction. • APPENDIX
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Obviously, the same proof works if f is bounded below in some interval. Coronary.
Let U be a real function on (0, oo) and suppose that U(x + y ) U( x )U( y ) for x, y > 0. Suppose further that there is some interval on which U is bounded above. The1t either U( x) 0 for x > 0, or else there is an A such that A U ( x ) e x for x > 0. ==
==
==
Since U (x) U 2 (x/2), U is nonnegative. If U( x) 0, then U(x/2n ) 0 and so U vanishes at points arbitrarily near 0. If U vanishes at a point, it must by the functional equation vanish everywhere to the right of that point. Hence U is identically 0 or else everywhere positive. In the latter case , the theorem applies to /(x) log U( x), this function being bounded above in some interval, and so /( x) Ax for A log U(1 ) . • ==
PROOF.
==
==
==
==
==
All. A number-theoretic fact.
Suppose that M is a set of positive integers closed under addition and that M has greatest common divisor 1. Then M contains all integers exceeding some n0 .
Theorem.
Let M1 consist of all the integers m , - m , and m - m' with m and m' in M. Then M1 is closed under addition and subtraction (it is a subgroup of the group of integers). Let d be the smallest positive element of M1 • If n e M1 , write n qd + r, where 0 < r < d. Since r n - qd lies in M1 , r must actually be 0. Thus M1 consists of the multiples of d. Since d divides all the integers in M1 and hence all the integers in M, and since M has greatest common divisor 1, d 1. Thus M1 contains all the integers. Write 1 m - m' with m and m' in M (if 1 itself is in M, the proof is easy), and take n 0 ( m + m') 2 • Given n > n0, write n q( m + m') + r, where 0 < r < m + m'. From n > n 0 > (r + 1 )( m + m') follows q ( n - r)/( m + m') > r. But n q( m + m') + r( m - m') ( q + r)m + ( q - r)m', and since q + r > q - r • > 0, this lies in M. PROOF.
==
==
==
==
==
==
==
==
==
A22. One- and two-sided derivatives.
Suppose that f and g are continuous on [0, oo) and g is the right-hand derivative of I on (0, 00 ) : r ( t) g( t) for t > 0. Then r (0) g(O) as well' and g is the two-sided derivative off on (0, oo ).
Theorem.
==
==
570
APPENDIX
It suffices to show that F( t) /( t) - /(0) - /Jg(s) ds vanishes for t > 0. By assumption, F is continuous on [0, oo) and F+ ( t) 0 for t > 0. Suppose that F( t0 ) > F( t1 ), where 0 < t0 < t1 Then G ( t) F( t) - (t - t0 )( F( t1 ) - F( t0))/( t 1 - t0 ) is continuous on [0, oo ) , G ( t0) G ( t1 ), and c + ( t) > 0 on (0, oo ) . But then the maximum of G over [ t0, t1 ] must occur at some interior point; since G + < 0 at a local maximum, this is impossible. Similarly F( t0) < F( t1 ) is impossible. Thus F is constant over (0, oo) and by continuity is constant over [0, oo ). Since F(O) 0, F vanishes on [0, oo ) . • ==
PROOF.
==
==
•
==
==
A differential equation. The equationA / ( t) AAsf( t) + g( t) ( t > 0; g continu solution /0(t) e 'JJg(s)e - ds; for an arbitrary solution ous) has the particular A f, ( /( t) /0 ( t)) e - ' has derivative 0 and hence equals /(0) identically. All soluAs A tions thus have the form /( t) e ' [/(0) + fJg(s ) e - ds ]. '
A13.
==
==
-
==
Al4. A trigonometric identity. If z 1
E
z
k- -1
k
21
==
z-1
E
z
k-0
k
::1= ==
1 and z
::1=
0, then
2/ + 1 1 z z-1 1 -z
and hence m- 1 � �
I
� �
z
1- 0 k - - 1
k
m- 1 ==
==
Take z
==
ei x . If
x
� �
1- o
z - 1 - z1+ 1 1 -z
==
1 - z- m + 1 - zm ( 1 - z )( 1 - z- 1 )
- ----
[
,
1 - z- m 1 - zm -z 1 - z 1 - z- 1 1 -z 1
==
---
]
( z m /2 - z - m/2 ) 2 ( z 1 /2 z - 1 /2 ) 2 _
. 1 mx ) s ( m k x ... i e E E 2 1- o k - - I ( sin t x )
x --+
_
is not an integral multiple of 2w, then m- 1
If x
z/+ 1 1 -z
z- 1
I
2
2 •
2 w n , the left-hand side here is m 2 , which is the limit of the right-hand side as 2 wn.
==
Infinite Series
Nonnegative series. Suppose x 1 , x 2 , . . are nonnegative. If E is a finite set of integers, then E c { 1 , 2, . . . , n } for some n, so that by nonnegativity Ek e E xk < EZ _ 1 xk . The set of partial sums EZ _ 1 xk thus has the same supremum as the larger AlS.
•
set of sums Ek e E xk ( E finite). Therefore, the nonnegative series Ek'.. 1 xk converges if and only if the sums Ek e E xk for finite E are bounded, in which case the sum is the supremum: Ek'.... 1 xk supEEk e Exk . ==
Al6.
Dirichlet ' s theorem. Since the supremum in A25 is invariant under permuta
tions, so is Ek'- 1 xk : If the xk are nonnegative and Yk
=
xf< k )
for some one-to-one
571 map f of the positive integers onto themselves, then E k xk and Ek Yk diverge or converge together and in the latter case have the same sum. APPENDIX
1, 2, . . . , are nonnegative. The i th row gives a series E1 x;1 , and if each of these converges, one can form the series E;E1xiJ . Let the terms x;1 be arranged in some order as a single infinite series E; 1 x;1; by Dirichlet's theorem, the result is the same whatever order is used. Suppose each E 1xiJ converges and E;E1 xiJ converges. If E is a finite set of the pairs ( i , j), there is an n for which E(i . J) e E XiJ < E , s nEJ s n x, 1 < E; < nEJxiJ < E;E 1 xiJ ; hence EiJ xiJ converges and has sum at mos� E;E1x; ; · On the other hand, if EiJ xiJ converges, then E; s mEJ s n xiJ < EiJ xiJ ; letting n --+ oo and then m --+ oo shows that each E1 xiJ converges and that E;E1 x;; < EiJ xiJ . Therefore, in the nonnegativ� case! EiJ xiJ converges if and only if the E1 xiJ all converge and E;E1 xiJ converges, tn which case EiJxiJ E;E1 xiJ . By sy�etry, E;.J. x;1 ":' �1E; x;J " Th:s the order of summation can be reversed in a nonnegative double senes. E;E1xiJ E1E;xiJ. A27.
Double series. Suppose that xiJ , i, j
==
==
A28.
The Weierstrass M-test.
Theorem. Suppose that limn xn k == xk for each k E k Mk < oo . Then Ek xk and all the Ek xn k converge,
and that l xn k I < Mk , where and limnE k xn k Ek xA . ==
The series of course converge absolutely since Ek Mk < oo . Now IE k xn k E k xk I < E k < k l xn k - xk I + 2Ek k Mk . Gj91en £ , choose k0 so that E k k 0 Mk < £/ 3, and then �boose n 0 so that n � n 0 implies l xn k - xk I < £/3k0 for k < k0 . Then n > n 0 implies IEk xn k - Ek xk l < £. • PROOF.
>
>
Power series. The principal fact needed is this: If f( x ) verges in the range l x l < r, then it is differentiable there and A29.
( 8)
/'( x )
==
==
k Ek' 0a k x con
00
k E ka k x - I .
k-1
For a simple proof, choose r0 and ri so that l x l < r0 < ri < r. If I h i < r0 - l x l , so that l x + h I < r0 , then the mean-value theorem gives (here 0 < (Jh < 1) ( 9)
k
( x + h) - x
h
k
- kx
k_I
==
I
k 1 k k ( x + 8h h ) - - kx
-
1
1 < 2 kr0k -
1
.
Since 2 krt - I jrf goes to 0, it is bounded by some M, and if M" l a k I · Mrf , then Ek Mk < oo and I a k I times the left member of (9) is at most Mk for I h I < r0 - I x I · By the M-test [A28] (applied with h --+ 0 instead of n --+ oo ) =
,
Hence (8).
APPENDIX 572 Repeated application of (8) gives
J < i > ( x ) ==
00
E k ( k - 1)
· · ·
k -J
(k
k
- j + 1) ak x -J.
For x == 0, this is a1 == j < l > (O)/j !, the formula for the coefficients in a Taylor series. This shows in particular that the values of f( x) for 1 x 1 < r determine the coeffi cients ak .
Cesaro averages. If xn --+ x, then n- 1 EZ- t xk --+ x. To prove this, let M bound l xk I , and given £, choose k0 so that l x - xk I < £/2 for k > k0. If n > k0 A30.
and n > 4k0 Mj£, then
n
ko - 1
n
k- 1
k-1
k - k0
x - -1 E xk < -1 E 2 M + -1 E � < £ . 2 n n n AJl. Dyadic expansions. Define a mapping T of the unit interval Q == (0, 1] into
itself by
{
Tw ==
2w 2w - 1
if O < w < t , if t < w < 1 .
Define a function d1 on Q by if O < w < ! , if t < w < 1 ,
( 10)
� �( w)
Lti
i- 1
·
2'
<
W�
� �( w)
Lti
i- 1
·
2'
1 + 2n
for all w e Q and n > 1. To verify this for n == 1 , check the cases w < ! and w > ! separately. Suppose that (10) holds for a particular n and for all w. Replace w by Tw in (10) and use the fact that d; (Tw) == d; + 1 ( w ) ; separate consideration of the cases w < ! and w > ! now shows that (10) holds with n + 1 in place of n. Thus (10) holds for all n and w , and it follows that w == E� d; ( w )/2; . This gives the dyadic representation of w. If d; ( w) == 0 for all i > n , then w == Ef_ 1 d; ( w )/2; , which contradicts the left-hand inequality in (10). Thus the expan sion does not terminate in O's. 1
Convex Functions
A32. A function fP on an open interval I (bounded or unbounded) is convex if ( 11)
fP( IX + ( 1 - t) y ) < t q> ( X ) + ( 1 - t) fP( y)
573
APPENDIX
for x, y e I and 0 � t <
From this it follows by induction that
1.
( 12) if the x; lie in I and the p; are nonnegative and add to 1. If fP has a continuous, nondecreasing derivative cp on /, then fP is convex. Indeed, if a < b < c, the average of cp over ( a, b) is at most the average over ( b, c): q> ( b ) - q> ( a ) b-a
==
-
bc; ds s) ( J b-a a 1
�
q> '( b ) <
1
lccp'( s) ds
e-b b
fP( c ) - fP( b ) c-b
The inequality between the extreme terms here reduces to ( 13 )
( c - a ) q> ( b ) < ( c - b ) q> ( a ) + ( b - a ) q> ( c ) ,
which is (1 1) with x
==
a, y
==
c, t
E b
a
(i)
==
( c - b)/( c - a ).
'
c
X
(ii)
y
( i i i)
z
(iv)
A33. Geometrically, (11) means that the point B in Figure (i) lies on or below the chord A C. But then slope A B s slope A C; algebraically, this is ( fP ( b) - fP ( a ))I( b - a) � ( q> ( c) - q> ( a ))/( c - a), which is the same as (13). As B moves to A from the right, slope A B is thus nonincreasing and hence has a limit. In other words, q>
has a right-hand derivative fP + . Figure (ii) shows that slope DE < slope EF < slope FG. Let E move to D from the right and let G move to F from the right: The right-hand derivative at D is at most that at F, and fP + is nondecreasing. Since the slope of XY in Figure (iii) is at least as great as the right-hand derivative at X, the curve to the right of X lies on or above the line through X with slope q> + ( x ) : ( 14) Figure (iii) also makes it clear that
y � X.
fP is right-continuous.
574 APPENDIX Similarly, q> has a nondecreasing left-hand derivative q>- and is left-continuous. Since slope A B < slope BC in Figure (i), q>- ( b) < q> + ( b). Since clearly q> + ( b) < oo and - oo < q>- ( b), q> + and q>- are finite. Finally, (14) and its right-sided analogue show that the curve lies on or above each line through Z in Figure (iv) with slope between q> - ( z ) and " + ( z ):
{ 15)
q> ( x ) � q> ( z ) + m( x - z) ,
This is a support line.
Notes on the Problems
These notes consist of hints, solutions, and references to the literature. As a rule a solution is complete in proportion to the frequency with which it is needed for the solution of subsequent problems. Section 1 1.1.
(a) Each point of the discrete space lies in one of the four sets A 1 n A 2 ,
A f n A 2 , A 1 n A2 , Af n A2 and hence 2- 2 ; continue. (b) If, for each i, B; is A ; or Af, then most n;_ 1 (1 a; ) < exp[ - E?.. 1 a; ).
would have probability at most
B1
n · · · n Bn
has probability at
-
1.4.
1.5.
1.6.
(b)
Suppose A is trifling and let A - be its closure. Given £ choose intervals (a k , bk ], k = 1, . . . , n , such that A c Ui: - 1 (a k , bk ] and EZ- 1 (bk - a k ) < e/2. If xk = a k - £j2n, then A - c uz_ l (x k , bk ] and E Z - I ( bk xk ) < £. For the other parts of the problem, consider the set of rationals. n n (a) Cover A b ( /l) by (b - 1) intervals of length b- . (c) Go to the base bk . Identify the digits in the base b with the keys of the typewriter. The monkey is certain eventually to reproduce the eleventh edition of the Britannica and even, unhappily, the fifteenth. (a) The se t A 1 (3) is itself uncountable, since a point in it is specified by a sequence of O' s and 2's (excluding the countably many that end in O's). (b) For sequences u 1 , , un of O' s, 1 ' s, and 2' s, let Mu . . . u consist of the points in (0, 1 ] whose nonterminating base-3 expansions �tart" out with those digits. Then A 1 (3) = (0, 1] - U Mu . . . u , where the union extends over n > 1 and sequences u1 , , un containi�g afleast one 1. The set described in part (b) is [0, 1] - U Nu1 u,. ' where the union is as before and Nu1 u,. is the interior of Mu . . . u , and this is the closure of A 1 (3). From this �epresentation of C, it is not hard to deduce that it can be defined as the set of points in [0, 1] that can be written in base 3 without any • • •
• • •
• • •
•
•
575
576
NOTES ON THE PROBLEMS
1 's if terminating expansions are also allowed. For example, C contains 2/3 . = .1 222 · · · = .2000 · · · because it is possible to avoid 1 in the expanSIOn. (c) Given an £ and an w in C, choose w' in A 1 (3) within £/2 of w; now define w" by changing from 2 to 0 some digit of w' far enough out that w" differs from w' by at most £/2.
1.8.
The interchange of limit and integral is justified because the series k E k rk ( w )2 - converges uniformly in w (integration to the limit is studied systematically in Section 16). There is a direct derivation of (1.36) : sin t = n n 2 sin 2 - tn k - 1 cos 2- k t ' as follows inductively by the double-angle for mula; use the fact that x - 1 sin x --+ 1 as x --+ 0.
1.9.
Within each b-adic interval ofnrank n - 1, rn ( w) assumes the value b - 1
length b - and the value - 1 on subintervals of total on a subinterval of n length ( b - 1) b - . The proofs therefore parallel those for the case b = 2, /l = 1 .
1.10.
Choose q0 so that Eq "?:. q0qf( q) < £; cover A1 by the intervals ( pq- 1 /( q), pq- 1 + /( q)] for q � q0 and 0 < p < q. See HARDY & WRIGHT, p. 1 58, for a proof that infinitely many pjq come within q- 2 of an irrational w. If q 2f( q) is nonincreasing and E q qf( q ) == oo , it can be shown that A1 has negligible complement, but this is harder; see p. 69 of A. Ya. Khinchirie: Continued Fractions (University of Chicago Press, Chicago, 1964). For example, except for a negligible set of w, l w - p/ql < 1/( q 2 log q) has infinitely many solutions.
1. 13.
(a) Given m and a subinterval (a, b] of (0, 1], choose a dyadic interval I in (a, b ], and then choose in I a dyadic interval J of order n > m such that I n - 1 sn ( w) I > t for w E J. This is possible because to specify J is to specify the first n dyadic digits of the points in J; choose the first digits in such a way that J c I and take the following ones to be 1, with n so large that n - 1 sn ( w ) is near 1 for w e J.
(b) A countable union of sets of the first category is also of the first category; (0, 1] == N u Nc would be of the first category if Nc were. For Baire's theorem, see ROYDEN, p. 1 39.
Section 2 Let n consist of four points and let � consist of the empty set, n itself, and all six of the two-point sets.
2.3.
(b)
2.4.
(b) For example, take 0 to consist of the integers and let � be the
I
2.5.
a-field generated by the singletons { k } with k s n. As a matter of fact, any example in which � is a proper subclass of � + I for all n will do, because it can be shown that in this case Un � necessarily fails to be a a-field; see A. Broughton and B. W. Huff: A comment on unions of sigma-fields, Amer. Math . Monthly, 84 (1977), 553-554.
(b) The class in question is certainly contained in /( JJI) and is easily seen to be closed under the formation of finite intersections. But ( U;'!, 1 nj!.. 1 A;1 ) == n ;"- 1 Uj !.. 1 A r1 , and Uj!.. 1 A f1 == Uj!.. 1 [A f1 n ni:� A ; k ] has the required form.
c
577
NOTES ON THE PROBLEMS
2.7.
2.8. 2.9. 2.10.
If � is the smallest class over .JJI closed under the formation of countable unions and intersections, clearly fc a( .JJI ). To prove the reverse inclusion, first show that the class of A such that Ac e f is closed under the formation of countable unions and intersections and contains .JJI and hence contains f. Note that Un Bn E a ( U n .JJIB ). (a) Show that the class of A for which lA ( w ) lA ( w') is a a-field. See Example 4. 7. (b) Suppose that !F is the a-field of the countable and the co-countable sets in Q. Suppose that !F is countably generated and Q is uncountable. Show that !F is generated by a countable class of singletons; if 00 is the union of these, then !F must consist of the sets B and B u 00 with B c n0 , and these do not include the singletons in ng, which is uncountable because Q is. (c) Let � consist of the Borel sets in Q (0, 1] and let ,. consist of the countable and the co-countable sets there. II
==
==
2.1 1.
2.13. 2.15.
Suppose that A1 , A 2 , . is an infinite sequence of distinct sets in a a-field !F, and let 16 consist of the nonempty sets of the form n�.... 1 Bn , where Bn = A n or Bn = A � , n = 1, 2, . . . . Each An is the union of the �sets it contains, and since the An are distinct, t§ must be infinite. But there are uncountably many distinct countable unions of �sets, and they all lie in !F. If Bn = A n n A f n · · · n A � _ 1 , then A1 = Un Bn , the Bn are disjoint, and P( Bn ) = P(An ) because An � Bn c U :,:. 1 ( A m n A n ) · • •
For this and the subsequent problems on applications of probability theory to arithmetic, the only number theory required is the fundamental theorem of arithmetic and its immediate consequences. The other problems on stochastic arithmetic are 4.21, 5.16, 5.17, 6.17, 18.18, 25.15, 30.9, 30.10, 30.11, and 30.12. See also Theorem 30.3. (b) Let A consist of the even integers, let Ck = [ m : vk < m < vk + 1 ], and let B consist of the even integers in C1 u C3 u · · · together with the odd integers in c2 u c4 u . . . ; take vk to increase very rapidly with k and consider A n B. (c) If c is the least common multiple of a and b, then M0 n Mb = Me. From M0 e 1) conclude in succession that Ma n Mb e 1) , M01 n · · · n Ma n Mt n · · · n Mt e 1) , f(J/) c 1). By the same sequence of steps, show how1 D on Jl de termines D on f(J/). (d) If B1 = Ma - U p s / Map ' then a e B1 and (the inclusion-exclusion formula requires only finite additivity) 1 D ( B, ) = ! - E _!_ + E _ - . . . p s. l
a
(
p < q s. l
ap
)
=!n 1 - .!. < s. l a
p
P
apq
(
! exp - E .!. a
a Choose /0 so that, if Ca = B1 , then D( Ca ) < 2 ity measure on f(J/), D (n ) ��< ! would follow.
psi - •.
P
If
)
D
--+ o .
were a probabil
578 2. 16.
NOTES ON THE PROBLEMS
The essential fact is that if r < s and a cylinder of rank r is the disjoint union of cylinders all of rank s, then there must be 2s - r cylinders in the union, and so the P-values add. Further, a cylinder of rank r can be split into cylinders of rank s (r < s). In what follows, let C1 , , Ck be disjoint cylinders of ranks r1 , . . . , rk . (a) To see that (U, C; ) c lies in �0 , choose s greater than all the r; and represent C, as a disjoint union U1 D, 1 of cylinders of rank s; then (U, C; ) is the union of the cylinders of rank s not appearing among the DiJ . Since the intersection of two cylinders is either the empty set or a cylinder, �0 is also closed under the formation of finite intersections. (b) Suppose that C = U; C, is itself a cylinder. Choose s and the D, 1 as before. Then s exceeds the rank of C and C = U;1 D, 1 , and so P( C) = E; P( C; ) follows because P( C) = E;1 P( D, 1 ) and P( C, ) = E1 P( D, ). (c) Follow the pattern of the argument involving (2.12) and (2.1 3'). and (d) It suffices (Example 2. 1 0) to prove that An e �0 , A1 :::J A 2 :::J P( An) > £ > 0 together imply that n n A n is nonempty. But each A n is nonempty, and the diagonal method [A14] can be used to produce a sequence (an w ) common to the An : Choose '-'n in An . Choose a sequence { n k } along which the limits lim k a; ( '-'n �c > = x, all exist and put w = (x1 , x2 , ). Given m , choose s so that whether or not a point of Q lies in Am depends only on its first s components. Now choose k large enough that n k > m and (a . ( w ), . . . ' a s < w )) = (a . ( '-'n k ), . . . ' as < '-'n k )). Since '-'n k E An k c A m , it follows that w e A m . (This is the standard proof that sequence space, as a topological product, is compact; the sets in �0 are closed.) (e) The sequence (1 .7) cannot end in O' s, but the sequence ( a 1 ( w ), a 2 ( w ), . . . ) can. Dyadic intervals can contract to the empty set, but nonempty cylinders cannot. The diagonal method plays here the role the Heine-Borel theorem plays in the proof of Lemma 2 to Theorem 2.2. The argument for part (d) shows that if P is finitely additive on
c
•
•
•
• • •
2. 17.
2.18.
(a) Apply the intermediate-value theorem to the function f( x ) = A ( A n (0, x ]). Note that this even proves part (c) for A (under the assumption that A exists). (b) If 0 < P( B) < P(A ), then either 0 < P( B) < i P( A ) or 0 < P( B A ) < t P( A ). Continue. (c) If P(U k Hk ) < x,1 choose C so that C c A - U k llk and 0 < P(C) < x - P( U k Hk ). If n- < P( C), then P( U k < n Hk ) + hn < P( U k < n Hk ) + P( Hn ) + P( C) < P( U k < n Hk ) + hn . Zorn 's lemma leads to a simpler proof. (a) Choose B so that P1 ( B) = ! ; take A = B or A = Be according as P2 ( B ) > t or P2 ( B) < t . This corresponds to the scheme for dividing a piece of cake between two children: one divides and the other chooses. (b) If P1 , , Pn are nonatomic, Q partitions into sets A 1 , , A n such that P; ( A ; ) > 1 In . The cake version of the proof runs as follows. The n children line up. The first cuts off 1/n of the cake (according to his measure) and passes it down the line, and any child who thinks it too large shaves it down to 1/n , returning the shavings (somehow) to the body of the cake. The last child to shave the piece down (or the first child if none do) keeps it, and the remaining n - 1 children proceed to divide the remaining cake. • • •
•
• •
579
NOTES ON THE PROBLEMS
For further information, see Walter Stromquist: How to cut a cake fairly, Amer. Math . Monthly, 87 (1980), 640-644. It can also be shown, although this is much harder, that [( P1 (A), . . . , Pn ( A )): A e '1 is a closed, convex set in n-space: see Paul R. Halmos: The range of a vector measure, Bull. 2.21. 2.22. 2.13.
Amer. Math . Soc., S4 (1948), 416-421 . (c) If J,_ 1 were a a-field, !I c J,_ 1 would follow. Use the fact that, if a1 , a 2 , • • • is a sequence of ordinals satisfying an < 0, then there exists an ordinal a such that a < Q and an < a for all n. Suppose that B1 e Up < 0-.lp , j == 1 , 2 . . . . Choose odd integers n1 in such a way that � e �a
( em 1 1 ' em 1 ' • • • ) == � ; II
II
2
for n not of the form n). ' choose Jo-sets for which .,Pa Q (n) ( em ' em ' . . . ) is 0 or (0, 1] as n is odd or even. Then Uj... 1 � == (l)cr (e1 , e2 , . . . ). Similarly, Be == �cr ( el ' e2 ' • • ) for Jo-sets en if B E Up < cr JP. The rest of the proof is essentially the same as before. �
11 2
0
Section 3 3.1.
3.2.
(a) The finite additivity of P is used in the proof that 'a c .,If and again
(via mono tonicity; see (2.4)) in the proof of (3. 7). The countable additivity of P is used (via countable subadditivity; see Theorem 2.1 ) in the proof of (3. 7). (b) For a specific example consider U n ( 2 - 1 + n - 1 , 1] in connection with Problem 2.14. But an example is provided by every P that is finitely but not countably additive: If P is finitely additive on a field � and An are disjoint 'a-sets whose union A also lies in 'a , then monotonicity (which requires finite additivity only) gives Ek < n P(A k ) == P(U k s n A k ) < P(A) and hence Ek P(Ak ) < P( A ). Countable subadditivity will ensure that there is equality here. (c) The proof of (3.1) involves the countable subadditivity of P on 'a, which is only assumed to be a field (that being the whole point of the theorem). (a) Given £, choose 'a-sets An such that A c Un An and EP(An) < P*( A ) + £; if B == Un A n , then A c B, B e � ' and P( B) < P*( A ) + £; hence the right side of (3.9) is at most P*( A ). On the other hand, A c B and B e ' imply P*(A) < P*( B) == P( B). Hence (3.9). If A c Bk , B" e ,, P( Bk ) < P*(A) + k - 1 , and B == n k Bk , then A c B, B e � ' and P*( A ) == P( B). For ( 3 . 1 0), argue by complementation. (b) Suppose that P.(A) == P*(A ) and choose J[sets A 1 and A 2 in such a way that A 1 c A c A 2 and P(A 1 ) == P(A 2 ). Given £, choose an ji:: set B such that E c B and P*( E) == P( B). Then P*(A n E) + P*(A( n E) <
B) < P( B) + P( A 2 - A 1 ) == P*( E). The cases in which P and P* fail to agree on � are those in which P is P(A 2
3.3.
n
B) + P(A}
n
not countably additive on 'a , namely parts (b) and (e). (In these parts, if P* and P * are defined by (3.1) and (3.2), then, since P *(A ) = 0 for all A ,
S80
3.7.
3.8.
3.9. 3.13. 3.14. 3.18.
3. 19.
3.20. 3.21.
NOTES ON THE PROBLEMS
(3.4) holds for all A and (3.3) holds for no A . Countable additivity thus plays an essential role in the preceding problem.) Using Problem 3.2 simplifies the analysis of J/( P* ) in the other parts of the problem. If !F is separable, then (Problem 2.6(c)) there is a countable field -Fa such that !F == a (-Fa). Given an J[set A and a positive £, choose �-sets A k such that A c Uk Ak and EP(Ak ) < P*(A ) + £ == P ( A ) + £. If n is large enough and B == U k s n Ak , then d( A , B) < £. There are only countably many such sets B. (a) See the note on part (b) of Problem 2.3. (b) Of course (A-5) is trivial. If A , B e f/J and A c B, then f/J contains the disjoint union Be u A and its complement B - A ; hence (A-4). Finally, ( A-6) follows from the observation that An f A implies A == U n ( A n - An _ 1 ) , and ( A-7) follows by complementation. (c) If flJ satisfies ( A-1 ), ( A-4), and ( A-6) and A and B are disjoint and in f/J, then A U B == Q - (( Q - A ) - B) is by (A-4) also in .f/J; further, if An are disjoint $sets, then the $sets uk s n A k increase to Un An , which must then lie in flJ by ( A-6) . By part (b) of Problem 2.5 it suffices to show that if A , A 1 , . . . , An lie in 9', then A n A} n · · · nA� e .f/J. Use induction on n and the property (A-4). In 0 == { 1, 2, 3, 4} consider the class consisting of { 1, 2} and { 2, 3}. Since the Cantor set is uncountable and has measure 0, the number of Lebesgue sets exceeds the power of the continuum. (a) Since the A E9 r are disjoint Borel sets, E,A(A E9 r) < 1, and so the common value A(A ) of the A(A E9 r) must be 0. Similarly, if A is a Borel set contained in some H E9 r, then A( A ) == 0. (b) If the E n ( H E9 r) are all Borel sets, they all have Lebesgue measure 0, and so E is � Borel set of Lebesgue measure 0. (b) Given AI , Bl , . . . ' An I ' Bn 1 ' note that their union en is nowhere Jn disjoint from en . Choose in Jn dense, so that In contains- an interval disjoint, nowhere dense sets An and Bn of positive measure. (c) Note that A and Bn are disjoint and that An U Bn c G. (a) If In are disjoint open intervals with union G, then b - 1 A( A ) > E n A ( In ) > En b - 1 A( A () In ) > b - 1 A( A ). Let n l In; be the dyadic intervals of order n. Given ln1 and Ink , choose 2 - - 1 of the In; distinct from In1 and Ink and from each other. If B is their union, then P(B U In J ) and I'( B U Ink ) have common value } , and so P( In J ) == P( Ink ). Now apply Theorem 3.3 to the class of dyadic intervals (a w-system).
Section 4 4.2.
Let r be the quantity on the right in (4.33), assumed finite. Then x < r implies that X < vr- n xk for each n > 1 , which in tum implies X < r. Now X < Vk- n Xk for each n � 1 if and only if for each n � 1 there is a k > n such that x < xk , and this holds if and only if x < xn infinitely often.
NOTES ON THE PROBLEMS
4.4. 4.8. 4.11.
4. 13.
4.14. 4.19.
4.20.
4.11.
4.13.
581
Therefore, r == sup[x: x < xn i.o.], and this is easily seen to be the supre mum of the limit points of the sequence. The argument for (4.34) is similar. (a) lim infnAn == A 1 () A 2 , lim supnAn == A 1 U A 2 • (c) The limits inferior and superior are, respectively, the closed disc of radius 1 and the open disc of radius fi. sequence, ensure by passing to a subsequence that If { An } is a Cauchy n take A == lim supn A n , and apply the first Borel P(A n � An + 1 ) < 2- , Cantelli lemma and Problem 4.6(d). First use the inclusion-exclusion formula to show that the equation holds for A == Ur_ 1 A; and then pass to the limit to show that it holds for A == 0. Now use the w-A theorem, as in the proof of Theorem 4.2. If ( H n G1 ) u ( He n G2 ) == ( H n G} ) u ( He n G2 ), then Gt �G} c He and G2 �G2 c H; nconsistency now 0. n follows because 'A .( H)m== 'A .( He) n ) c==H'' If An == m( H n Gi > ) U ( He n G! ) ) are disjoint, then G1 ) n Gi n and G� ) n G� ) c Hn for m ::1= n , nand therefore (see Problemn 2.13) n P( U n A n ) == !'A ( UnG1 ) ) + !'A (UnG! > ) == En ( ! A ( G 1 ) ) + !'A (G! ) )) == En P(An ) · The intervals with rational endpoints generate t§. (c) Consider disjoint events A 1 , A 2 , . . . for which P(An) == 1/2 n . The problem being symmetric in An and A � , suppose that an == Pn and define An as in Problem 4.18. Define B == lim infn A� , apply the first Borel-Cantelli lemma, and pass from sequence space to B and from An to B n An. Show as in Problem 1.1 (b) that the maximum of P ( B1 nn · · · n Bn ), where B; is A; or Af, goes to 0. Let Ax == [w: En /.,� (w )2- S x], show that P(A n Ax ) is continuous in x, and proceed as in,.Problem 2.17(a). Calculate D( Fi) by (2.22) and the inclusion-exclusion formula, and estimate Pn ( F, - F) by subadditivity; now use 0 < Pn (Fi) - Pn ( F) == Pn ( Fi - F). For the calculation of the infinite product, see HARDY & WRIGHT, p. 245. If An is the union of the intervals removed at the n th stage, then P(An I AI n · · · nA�_ 1 ) == an , where P refers to Lebesgue measure.
Section 5 5.5. 5.8
== 0, a � 0, and x > 0, then P [ X > a) < P [(X + x) 2 > (a + x) 2 ] < E[(X + x) 2 ]/(a + x) 2 == ( a 2 + x 2 )j(a + x) 2 ; minimize over x. (b) It is enough to prove that q>(t) == f(t(x', y') + (1 - t)(x, y)) is convex in t (0 < t < 1) for (x, y) and (x', y') in C. If a == x' - x and fl = y' - y, then (if /11 > 0) (a) If
m
cp"
== /u a 2 + 2ft 2 a/l + /22 fl 2 1 2 + 1 / / - /l ) fl 2 � 0. == f f a + fl) t 2 2 111 ( u 111 ( u 22
582 5.9. 5. 10. 5. 16.
5.17.
NOTES ON THE PROBLEMS
Check (5. 35) for f(x , y) == x1 1Py1 1q . (x1 1P + y1 1P ) P . Check (5.3 5) for f(x , y) For (5.39) use (2.22) and the fundamental theorem of arithmetic: since the p, are distinct, the pf• individually divide m if and only if their product does. For (5.40) use inclusion-exclusion. For (5.43), use (5.25) (see Problem 5.14). k (a) By (5.43), En [ ap ] < Er:.., 1 p- < 2jp. And, of course, n- 1 log n ! En [log] == Ep En [ap ]log p. (See Problem 18.21 for one proof of Stirling's formula.) k (b) Use (5.44) and the fact that En [ ap - Bp ] < Er:.. 2 p- . (c) By (5.45), -
=
-
==
E
n
log p ==
E
( [ P ] - 2 [ !!P ] ) 1og p 2n
n
< 2 n ( E2 n ( log* ] - En (log* ] ) == O ( n ) .
Deduce (5.46) by splitting the range of summation by successive powers of 2. The difference between the left side of (5.47) and E[ x J [log * ) is 0(1). (d) If K bounds the 0(1) term in (5.47), then
E log p � 8x E p- 1 log p
p s. x
Bx
�
8x(log 8- 1 - 2 K ) .
(e) For (5.49) use
log p E log x � w ( x) � P s. x < x 1 /2
+
E 1
P s. x
2
l/2
.,.. log x pE �x
+
E
xl/2
log p log x1 /2
log p .
By (5.49), w(x) � x1 12 for large x, and hence log w(x) log x and w(x) xjlog w(x). Apply this with x == p, and note that w( p,) == r. x
x
Section 6 6.3.
Since for given values of Xn1 ( w ), . . . , Xn. k _ 1 ( w ) there are for Xnk ( w ) the k possible values 0, 1 ' . . . ' k - 1 ' the number of values of ( xn1 ( ), . . . ' xn n ( w )) is n ! Therefore, the map w --+ ( xn 1 ( w ), . . . ' xn n ( w )) is one-to-one, and the xnk ( w ) determine w . It follows that if 0 < X; < i for 1 < i < k ' then the number of permutations W satisfying Xni ( W ) == X; , 1 < i < k, is just ( k + 1 )( k + 2) · · · n, so that P[ Xn; == X; , 1 < i < k] == 1 /k ! It now follows by induction on k that Xn 1 , . . . , Xn k are independent and P[ Xn k == x] == k - 1 (0 S X < k ). w
583
NOTES ON THE PROBLEMS
Now calculate E[ Xn k J == E[ � ] ==
k-1 2
,
0+1 +
·
·
·
2
+ ( n - 1)
==
n ( n - 1) 4
(
)
2 k2 - 1 2 + 1 2 + . . . + ( k - 1) 2 0 k 1 - 12 Var( Xn k J 2 k n 2 n3 + 3n 2 - Sn 1 2 -. ( k - 1 ) == Var( Sn ] == E 72 36 12 k - 1
'
-
Apply Chebyshev's inequality.
6.7.
(a) If k 2 S n < ( k + 1) 2 , let a n == k 2 ; if M bounds the lxn I , then
1s
-
n
6. 17.
n
- a1n
-
s
a
,.
s
n - an 1 - 1 1 · nM + - ( n - a n ) M == 2 M --+ 0.
-
n
an
an
�rom (5.49) it follows that an == EP n - 1 [ n/p ] --+
lS
[ ] [] []
! _!!._ - ! !! ! n < _!_ n p n q pq n pq
an
oo . The left side of (6. 1 2)
-( !p - !n )( !q - !n ) < _!_np + _!_nq _
Section 7 7.3.
If one grants that there are only countably many effective rules, the result is an immediate consequence of the mathematics of this and the preceding sections: C is a countable intersection of Jli:sets of measure 1. The argument proves in particular the nontrivial fact that collectives exist. See Per Martin Lot: The literature on von Mises' Kollektives revisited, Theoria, 1 ( 1 96 9) , 1 2 3 7. -
7.7.
If n s ,. , then W, == W, _ 1 - Xn - t == Wi - Sn - t , and ,. is the smallest n for which sn - I == Wi . use (7 .8) for the question of whether the game terminates. Now T-1
F., = Fo + L ( JJ1 - Sk - t ) xk - Fo + Wt ST - l k-1
7.8.
7.9.
� ( s;_ t - ( T - 1))
Let E o == X I + . . . + xk , En == En - 1 - w, xn , Lo = k, and Ln == Ln - 1 (3 Xn + 1)/2 . Then ,. is the smallest n such that Ln < 0, and ,. is by the strong law finite with probability 1 if E [3 Xn + 1] == 6( p - ! ) > 0. For n < ,. , En is the sum of the pattern used to determine W, + 1 • Since F, - F, _ t = En - t - En , F, == Fo + Eo - E n , and F, == Fo + Eo · Observe that E[ F, - F,. ] == ElEZ - t Xk i[ T < k d EZ - 1 E [ Xk ] P [ T < k) . ==
584
NOTES ON THE PROBLEMS
Section 8
8.10.
(b) With probability 1 the population either dies out or goes to infinity. If, for example, pk o == 1 - pk . k t == 1 Ik 2 , then extinction and explosion each +
have positive probability.
8.1 1.
To prove that X; = 0 is the only possibility in the persistent case, use Problem 8.6, or else argue directly: If X; == Ei • io PiJ Xi , i ::1= i0 , and K Pi,- u,xi,' where the sum is over bounds the lx; 1 , then x, == EpiJ 1 it ' . . . ' in - t dis tinct from io , and hence I X; I < KP; [ xk ::1= io , k < n ] --+ 0. •
8.15.
•
•
Let P be the set of i for which 'IT; > 0, let N be the set of i for which 'IT; s 0, and suppose that P and N are both nonempty. For i0 e P and }0 e N choose n so that p�:,! > 0. Then
0<
� � 'IT Lti Lti ; jEN iE P
== E
iEN
'IT;
E
jE P
pi(jn )
_
-
� Lti '".i jE N
� � Lti Lti 'IT; jEN iEN
_
pi(jn )
P�j> s o .
Transfer from N to P any i for which
fl; ==
0 and use a similar argument.
8.18.
Denote the sets (8.32) and (8.52) by P and by F, respectively. Since F c P , gcd P s gcd F. The reverse inequality follows from the fact that each integer in P is a sum of integers in F.
8.19.
Consider the chain with states 0, 1, . . . and a and transition probabilities Poi == /j + t for j � 0, Po a == 1 - /, P; ; - t == 1 for i � 1, and Paa == 1 ( a is an absorbing state). The transition matrix is ,
It
1
0
/2 13 0
1
0 0
0
0
...
1 -/ 0 0
.. ..... ...... ......
0
1
n
Show that /� ) == In and p&)> == un , and apply Theorem 8.1. Then assume f == 1 , discard the state a and any states j such that fk == 0 for k > j, and apply Theorem 8.8. In FELLER, Volume 1, the renewal theorem is proved by purely analytic means and is then used as the starting point for the theory of Markov chains. Here the procedure is the reverse.
8.21.
The transition probabilities are p0 , == 1 and Pi - i + t == p, !'_;, , -; == q, 1 < t i == u, == q- u0 == ( r + q ) - . s r; the stationary probabilities are Ut == The chance of getting wet is u0p, of which the maximum is 2 r + 1 2,Jr( r + 1 ) . For r 5 this is .046, the pessimal value of p being .523. Of course, u0p s 1j4r. In more reasonable climates fewer umbrellas suffice: if p == .25 and r == 3, then u0p == .050; if p J.: .1 and r == 2, then u0p = .03 1 . •
-=
•
•
,
r
-
NOTES ON THE PROBLEMS
At the other end of the scale, if p == .8 and p == .9 and r == 2, then u0p == .043.
8.24. 8.25.
r == 3, then
u0p = .050; and if
For the last part, consider the chain with state space Cm and transition probabilities piJ for i , j E Cm (show that they do add to 1).
Let C' == S - (T u C) and take U == T u C' in (8.5 1 ). The probability of absorption in C is the probability of ever entering it, and for initial states i in T u C' these probabilities are the minimal solution of Y;
== E
jE T
0 < Y;
S
Pi} >'i
+ E
jE C
pij
1,
+ E
j E C'
Pi} Y.i '
iE
T u C' '
i e
T U C' .
Since the states in C' ( C' == 0 is possible) are persistent and C is closed, it is impossible to move from C' to C. Therefore, in the minimal solution of the system above, Y; == 0 for i e C'. This gives the system (8.55). It also gives, for the minimal solution, EJ e TPiJ Y.i + E1 e c PiJ == 0, i e C' . This makes probabilistic sense: from an i in C', not only is it impossible to move to a j in C, it is impossible to move to a j in T for which there is positive probability of absorption in C.
8.26.
Fix on a state i and let S. consist of those j for which P!J > > 0 for some n congruent to , modulo t Choose k so that p<�> are > O '· if p<� )I I) > and p<�> IJ positive, then t divides m + k and n + k, so that m and n are congruent modulo t. The S., are thus well defined. •
8.27. 8.29.
Show that Theorem 8.6 applies to the chBi;n with transition probabilities PIJ ( �> . Show that an
==
w;
and
== where 8.34.
p
o{ � k�1 n - k) l} == o ( ! ) (
,
is as in Theorem 8.9.
The definitions give E;
[ I ( Xa,.)) == P; [ an < n ] I( 0) + P; [ an == n ] I( i + n ) == 1 - P; [ an == n ] + P; [ an == n ] ( 1
- h + n ,o)
== 1 - P; · · · P; + n - t + Pt · · · Pi + n - t ( P; + n Pi + n + t · · · ) ,
586
8.35.
NOTES ON THE PROBLEMS
and this goes to 1. Since P; [ T < n == an ] > P;([ T < n ] n [ Xk > 0, k > 1]) � 1 - /;0 > 0, there is an n of the kind required in the last part of the problem. And now
If i >
P;[T ==
1, n 1 < n 2 , (i, . . . , i + n 1 ) E ln 1 , and (i, . . . , i + n 2 ) E /n 2 , then n 1 , ,. == n 2 ] > P; [ Xk == i + k, k < n 2 ] > 0, which is impossible.
Section 9
9.4.
See BAHADUR.
9.9.
Because of Theorem 9.6 there are for P[ Mn > a] bounds of the same order as the ones for P[ Sn > a] used in the proof of (9. 3 6).
Section 10
10.7.
(a) Order by inclusion the family of all disjoint collections of J[sets of
finite positive measure; each of them is countable by hypothesis. Use Zorn's lemma to find a maximal collection { Ak } ; show that p.((Uk Ak ) c ) is finite and is in fact 0.
10.8.
Suppose p. 1 and p. 2 are probability measures on a(B') and agree on the w-system B'. Since p.1 ( Q ) == p. 2 ( Q ), Q can be added to B', and then the hypotheses of Theorems 10.3 and 10.4 are both satisfied.
10.9.
Let p. 1 be counting measure on the a-field of all subsets of a countably infinite Q , let 11 2 == 2 1£ 1 , and let B' consist of the cofinite sets. Granted the existence of Lebesgue measure A on �1 , one can construct another exam ple: let p. 1 == A and p. 2 == 2 A , and let 9' consist of the half-infinite intervals ( - 00 , X ). There are similar examples with a field Fo in place of 9'. Let Q consist of the rationals in (0, 1], let p. 1 be counting measure, let p. 2 == 2 p. 1 , and let Fo consist of finite disjoint unions of " intervals" [ r e Q: a < r < b ].
10.10
(a) Modify the proof of Theorem 10.3, using the monotone class theorem in place of the w-A theorem.
Section 11
11.4.
(a) See the special cases (12.16) and (12.17).
11.5
(d) Suppose that A e f(JJI); if A can be represented as a finite or
11.6
countable disjoint union of �sets An, put p.(A) == En p.(An) (prove con sistency); otherwise, put p.(A ) = oo . This extension need not be unique-suppose JJI = {0} for example.
(c) See HALMOS, p. 31, for the proof that pairwise additivity implies
finite additivity in strong semirings. For a counterexample in the case of
587
NOTES ON THE PROBLEMS
semirings, take .JJI to consist of 0, { 1 } , { 2 } , { 3 } , and Q = { 1, 2, 3 } and make sure that p. { 1 } + p. { 2 } + p. { 3 } < p. ( Q ). Accounts treating only strong semi rings omit the qualifier " strong."
11.7.
Take preliminary -Fo-sets Bk satisfying p.(Bk �A k ) < (/n 3 ; add U1 . k ( � n Bk ) to one of these sets and subtract it from the others and let B be the union of the new Bk . If A lies in -Fo , subtract B n A c from all the Bk and add A n Be to one of them.
11.10. Complementation is easy. Suppose G = UnGn , Gn e t§. Choose -Fo-sets An k
and Bn k such that n1 An c Gn c U k Bn k and p.((U k Bn k ) - (n k An k )) < n (;2 . Now choose N so that p.(G - Un s N Gn ) < (. Approximate G from without by Un k Bn k and from within by n k (Un s NAn k ).
U k ( lk , gk ], then ( I( w ), g( w )] c U k ( lk ( w ), lk ( w )] for all w, and Lemma 2 to Theorem 2. 2 gives g( w ) - l(w) < Ek ( gk (w) - lk (w)). If h m (g - 1 - Ek s n (gk - lk )) V 0, then h n � 0 and g - I � E k s n(gk - fk ) + h n . The positivity and continuity of A now give v0 ( 1, g) < Ek v0 ( fk , gk ]. A similar, easier argument shows that E k v0( lk , gk ] � v0( 1, g] if ( lk , gk ] are disjoint subsets of ( I, g ]. 11.12. (b) From (1 1.11) it follows that [I > 1] e -Fo for I in .ftJ. Since .ftJ is linear, [ I > x ] and [ I < - x ] are in -Fo for I e .ftJ and x > 0. Since the sets ( x, oo ) and ( - oo , - x) for x > 0 generate 911 , each I in .ftJ is 11.11. (b) If ( I, g]
c
=
measurable a( -Fo ). Hence !F = a( -Fo ). It is easy to show that -Fo is a semiring and is in fact closed under the formation of proper differences. It can happen that Q ll. -Fo-for example, in the case where Q = { 1 , 2 } and .ftJ consists of the I with 1(1) = 0. (On the other hand, -Fo is a a-ring and JJ/0 is a strong semiring in the sense of Problems 11.5 and 1 1.6, and the arguments can be carried through in that context; see J\irgen Kindler: A simple proof of the Daniell-Stone represen tation theorem, Amer. Math . Monthly, 90 (198 3), 3 96- 3 97.)
Section 12
12. 1.
Take a = p.(O, 1]; for A an interval with rational endpoints, compare p.( A ) with a by exhibiting A and (0, 1] as disjoint unions of translates of a single interval. Such A 's form a w-system.
12.4.
(a) If (Jn = fJm , then (Jn - m = 0 and n = m because fJ is irrational. Split G
into finitely many intervals of length less than (; one of them must contain points 82 n and 82 m with 82 n < 82 m . If k = m n , then 0 < 82 m - 82 n = 82 m e 82 n = 82 k < ( , and the points 82 k 1 for 1 < I < [ 8ik1 ] form a chain in which the distance from each to the next is less than (, the first is to the left of ( , and the last is to the right of 1 - ( (c) If s 1 e s 2 = 82 k + 1 e 82 n 1 E9 82 n 2 lies in the subgroup, then s 1 = s2 and (J2 k + 1 = (J2( nl n 2 ) " -
.
12.5.
-
(a) The S E9 (Jm are disjoint, and (2 n + 1 ) v + k = (2 n + 1 ) v' + k' with
I k I , I k' I < n is impossible if v ::1= v'. (b) The A E9 8< 2 n + I > v are disjoint, contained in Lebesgue measure.
G, and have the same
588
NOTES ON THE PROBLEMS
12.6.
See Example 2.10 (which applies to any finite measure).
12.8.
By Theorem 12.3 and Problem 2.17(b), A contains two disjoint compact sets of arbitrarily small positive measure. Construct inductively compact sets n Ku1 .. . u,. (each U ; is 0 or 1) such that 0 < p.(Ku1 ... u,. ) < 3 - and Ku1 . . . u,. O disjoint subsets of Ku1 .. . u,. . Take K = nn U u1 . . . u,. Ku1 . .. u,. . and Ku1 . . . u,. 1 are . . The Cantor set ts a speCial case.
12.14. Suppose that An
E
closed en .Jf'. To show that A = Un An E .Jf', choose n
and open Gn so that C, c An c Gn and p.(Gn - C, ) < f./2 , choose N so that p.(A - Un s N An) < f., and consider c = Un s N Cn and G = UnGn. Complementation is easy. Approximate a closed rectangle [x: a; < X; < b; , i < k ] by open rectangles [ x: a; - m - 1 < X; < b; + m - 1 , i s k ].
Section 13 13.6.
E ; x; IA ' and A ; e 1 1!F', take Aj in !F' so that A ; = 1 1Aj and set q> = E ; x ; J.��· For the general1 1 measurable r- 1/F', there exist simple functions Jn , measurable 1 /F', such that ln ( w ) --+ l( w ) for each w. Choose fl>n , measurable !F', so that In = fl>n T. Let C' be the set of w' for which 'Pn ( w') has a finite limit, and define q>(w') = lim n'Pn ( w') for w' e C' and q>( w') 0 for w' � C'. Theorem 20.1(ii) is a special case. If 1 ==
==
13.10. The class of Borel functions contains the continuous functions and is closed
under pointwise passages to the limit and hence contains !I. By imitating the proof of the fT-A theorem, show that, if I and g lie in !I, then so do I + g, lg, I - g, I v g (note that, for example, [g: I + g e !I] is closed under passages to the limit). If {n ( x) is 1 or 1 - n(x - a) or 0 as x s a or a s x s a + n - 1 or a + n - < x, then In is continuous and In ( X ) -+ /( - crJ� X). Show that [A : lA E !£] is a A-system. Conclude that !!I contains all indicators of Borel sets, all simple Borel functions, all Borel functions. oo ,
13.1 1. The set [ x: l(x) < a ] is open. 13.14. On the integers with counting measure, consider the indicator of { n , n + 1, . . . } .
, bk } , where k < n , E; = C - b;- 1A , and E = Uf.... 1 E; . Then E = C - Uf_ 1 b;- 1A. Since p. is invariant under rotations, p.( E; ) = 1 - p.(A) < n - 1 , and hence p.( E ) < 1. Therefore C - E == nf_ 1 b;- 1A is non empty. Use any fJ in C - E.
13.20. Let B = { b1 ,
• • •
Section 14 14.4. 14.5.
(b) Since u < F(x) is equivalent to q>( u) s x, u < F( q> ( u)). Since F( x)
< u is equivalent to x < q>( u), F(q>(u) - f.) < u for positive f.. (a) If 0 < u < v < 1, then P[ u s F( X) < v, X e C) = P[q> ( u) < X < q>(v), X e C). If q>(u) e C, this is at most P[ q> ( u ) s X < q>(v)] F( q> ( v ) - ) - F( q> ( u) - ) = F( q> ( v) - ) - F( q> ( u)) < v - u; if q> ( u) ll. C , it is at most P[ q> ( u ) < X < q>(v)] = F( q>( v ) - ) - F( q>( u)) < v - u. Thus P[ F( X) e [ u, v), X e C) s A[ u, v) if 0 < u < v < 1. This is true also for =
589
NOTES ON THE PROBLEMS
(let u J, 0 and note that P[ F( X) = 0] == 0), and for v = 1 (let v f 1 ) . The finite disjoint unions of intervals [ u , v) in [0, 1) form a field there, and by addition P[ F( X) e A , X e C) < A(A ) for A in this field. By the monotone class theorem, the inequality holds for all Borel sets in [0 1 ). Since P[ F( X) = 1, X e C) == 0, this holds also for A == { 1 }. The sufficiency is easy. To prove necessity, choose continuity points X; of F in such a way that x0 < x1 < · · · < xk , F(x0) < £, F(xk ) > 1 - £ , and X; - X; _ 1 < £. If n exceeds some n 0 , I F(x; ) - F, ( x; ) l < £/2 for all i. Suppose that X;_ 1 < x s X; . Then F,(x) < F, (x; ) < F(x; ) + £/2 < F( x + £ ) + £/2. Establish a similar inequality going the other direction, and give special arguments for the cases x < x0 and x > xk . u
=-
0
,
14.9.
Section 16. 16.6.
(a) By Fatou's lemma,
and
f b dtL - ffdtL f li� ( b, - /, ) dtL ==
j
< lim inf ( b, - /, ) d11 n
Therefore
16.9.
o s In
...
j b dtL - lim sup j/, d11 . n
- It t I - It ·
16.11. (a) Take !F == {0, Q ) , p.(O) == oo , I = 2, and g = 1 .
(b) Suppose An f [ g < oo , I > 0] and p.(An ) < oo. If Bn == [0 S g < f, g � n ], then p.(An n Bn ) == 0 and An n Bn f [g < /]. (c) Suppose en f g and P (Cn) < 00 . Then n- 1p.([/ > n- 1 ] n Cn ) < P[/ > n - 1 ] n C, ) < oo , and [ / > n - 1 ] n C, f [/ > 0]. 16.13. Whatever A may be,
and there is equality here if A
==
[ B > Bn 1 ·
590
NOTES ON THE PROBLEMS
16.16. By the result in Problem 16.15, f lin I dp. --+ f 1/1 dp. implies f 11 /1 - 1/n l l dp. --+ 0. If An == lfn < 0 < / ] U l/ < 0 < fn 1 and Bn == l l/n l < 1/1 ], then o
<
j 11 - /, I dp. - / I Ill - I/, I I dp.
... fA,.
n
B,.
2 1/, 1 dp. +
fA,.
n
B�
2 1/1 dp. < 2 f 1/1 dp. --+ 0 A,.
because lA ,. --+ 0 almost everywhere. 16.18. Use the fact that fA 1/1 dp. < ap.( A ) + fuJI � cr] 1/1 dp.. 16.19. If p. ( A ) < B implies fA Ifn i dp. < £ for all n , and if a - 1 supn f Ifn i dp. < B, then p.( lfn l > a] < a - 1f Ifn i dp. < B and hence f,1� l � a d/n l dp. < £ for all n. For the reverse implication adapt the argument m" the preceding note. 16.20. (b) Suppose that In are nonnegative and satisfy condition (ii) and p. is nonatomic. Choose B so that p.(A) < B implies fA in dp. < 1 for all n. If p.lfn == oo] > 0, there is an A such that A c l in == oo] and 0 < p.( A ) < B; but then fAin dp. == oo. Since p. lfn == oo] == 0, there is an a such that p. l fn > a] < B < p. l fn > a]. Choose B c l fn == a] in such a way that A == l in > a) U B satisfies p.(A ) == B. Then aB == ap.(A ) < fA in dp. < 1 and I In dp. s 1 + ap.(A c) < 1 + B- 1p.(O) . 16.21. By Theorem 16.13, (i) implies (ii), and the equivalence of (ii) and (iii) was
established in Problem 16.16. Suppose (ii) and (iii) hold. Then certainly f lfn I dp. is bounded. Given £, choose n0 so that f lin - /1 dp. < £ for n > n 0 and then choose B so that p. ( A ) < B implies fA 1/1 dp. < £ and fA lfn I dp. < £ for n < n0 • If p.(A) < B and n > n0, then fA lfn I dp. < / lfn - /1 dp. + fA 1/1 dp. < 2 £. 16.25. (b) Suppose that f e .ftJ and f > 0. If In == (1 - n- 1 )/ V 0, then In e !R and In i /, so that v( fn, / 1 == A ( / - fn ) � O. Since v( /1 ,/] < oo , it follows that v l( w, t): /( w) == t] == 0. The disjoint union B,
n 2"
-
il}l
([ ;
;+1 < I < 2" 2"
] X ( 0, 2" ] ) i
increases to B, where B c (0, /] and (0, / ] - B c l(w, t): /(w) == fore
t).
There
. [ . .+ ] A ( / ) ... ..( 0, .1] ... lim 11 ( B,) ... lim .E ;, p. ;, < I < 2 , ... jfdp. . n 2"
n
n
1
•-1
Section 17 17. 1 1. The binomial formula is (1
a
00
+ x) == L
n-o
( � ) xn ,
1
591
NOTES ON THE PROBLEMS
valid for arbitrary a and l x l < 1; here ( : ) == a( a - 1) · · · ( a - n + 1)/n ! A simple way to verify the formula without considering the analyticity of the left-hand side is to show by the ratio test that the series converges for 1 x 1 < 1 to /a (x), say, and then by a binomial identity to derive (1 + x )/;(x ) == af01 (x) [A29], whence it follows that the derivative of /a (x )/(1 + x ) 01 vanishes. Note that the series in (17 .17) is nonnegative. 17.14. Part (a) can be elaborated: If f is Lebesgue integrable over each [ t, 1] and
(17.20) holds, then f has Kunweil integral I over [0, 1]. This integral was introduced in J. Kunweil: Generalized ordinary differential equations and continuous dependence on a parameter, Czechoslovak Math . Jour., 7 (82) (1957), 418-446. 17.15. If A == A, n S, 1 , (17.21) follows from (12.19). But these A 's form a w-system generatihg � 2 • And now (17 .22) follows by the usual argument starting with indicators for f. 17.16. If g(x ) is the distance from x to [ a , b), then In == (1 - ng) V 0 � I[ a . h J and In e 9' ; since the continuous functions are measurable �1 , it follows that !F == 911 • If In ( x) � 0 for each x, then the compact sets [ x: In ( x) � £] decrease to 0 and hence one of them is 0; thus the convergence is uniform. 17.17. The linearity and positivity of A are certainly elementary facts, and for the continuity property, note that if 0 s f s £ and f vanishes outside [ a , b ], then elementary considerations show that 0 s A( / ) � £(b - a). ,2
Section 18
If E e !I X tty , then E lies in the a- field generated by some countable class of measurable rectangles, which by the definition of !I== tty may be taken as But then E the singletons A m n == {(xm , xn )} for some sequence x1 , x 2 , must be the union of certain of the A m n ' or such a union together with ( U m n A m n )"; either case is impossible. 18.3. Consider A X B, where A consists of a single point and B lies outside the completion of �1 with respect to A . 18.11. See DOOB, p. 63 18.18. Put /p == p - 1log p and put In == 0 if n is not a prime. In the notation of (18.1 8), F(x) == log x + cp (x), where cp is bounded because of (5.47). If G(x) == - 1jlog x, then 18.2.
• • •
L
psx
_! p
..,
F( X ) + log x
l
x
2
F( I ) dt t log 2 t
cp ( X ) == 1 + log + x
lx 2
dt + t log t
.
loo cp( t ) dt - oo cp ( t ) 2dt t log 2 t Jx t log t
0
2
18.21. Take the point of view of Problem 18.6 and use the translation invariance of Lebesgue measure in the plane. The Tk do not overlap because the slope log 1 (1 + k - 1 ) of the lower boundary of T is less than the slope k - of a
k
tangent to the upper boundary of Tk 1 at its right-hand endpoint. _
592
NOTES ON THE PROBLEMS
Section 19
m
19.2.
Choose sets Bn such that diam Bn < £ and cm En (diam Bn ) < h m .f ( A ) + £ . Replace Bn by the open set [ x: dist( x, Bn ) < Bn ]; if the Bn are small enough, the inequalities are preserved. Now use compactness.
19.3.
(a) Inserting the extra point /( t) cannot make a polygon shorter.
oo , /( · ) is not well defined, but in this case there is nothing to prove. Apply Lemma 4 to the maps f: [a, b ) --+ R k and /: [ a, b) --+ R1 .
(b) If L(/) ==
(c) Make sure that the trace is swept out several times as t runs from a to b. (d) Let tn be the supremum of the t in [a, b) for which /( t) e B;; . Now /(a) lies in some Bn o ' /( tn 0 ) lies in some Bn 1 , /(tn 1 ) lies in some Bn 2 , and so on. The process must terminate with a tn of b, because n 0, n 1 , are distinct until then. Therefore, 1/(b) - /( a ) l s Ef.:J 1/ ( tn . ) - /( tn. ) l < "'N � n - 1 diam Bn < h 1 ., (/ [a, b ]) + £. (e) If a == t0 < · · · < t, == b, then, since f is one-to-one and h 1 is 0 for singletons, Ej - 1 1/(ti ) - /(ti _ 1 ) 1 s Ej _ 1 h 1 (/[ti _ 1 , ti ]) == h 1 (/[a, b)). (f) Suppose, for example, that /(t + ) - /( t) == a > 0. For some B, t < s 1 s s2 < t + B implies /(s2 ) - /( s 1 ) < aj3. If t < s < t + B, there is a parti tion t t0 < · · · < t, == s such that E7- 1 1/(t; ) - /(1; _ 1 ) 1 > 2aj3; but since /[ t1 , s] < aj3, l/(t1 ) - /(t) l > aj3. Let s � t and get a contradiction • •
•+l
•
I
==
to the continuity of f.
19.4.
If the curve has no multiple points, the result follows from (19.14), but it is easy to derive it in the general case directly from the definition of L(/).
19. 7.
(a) 2 ,.fi.
19.8.
(b) 2'1T [1/ v'2 + ! log(1 + v'2 )]. In the diagram below the circle has radius 1. Take fJ == 'ITjn and x == 1/2 m , calculate the lengths of the numbered segments in order, and show that the triangles abc and abd have areas sin fJ [(1 - cos fJ ) 2 + x 2 ] 1 12 and x [2 d
NOTES ON THE PROBLEMS
593
2 cos fJ ] 1 12 • Since the polyhedron has 2 mn triangular faces of each kind, the formula for A m n follows. Since sin t - t as t --+ 0, if m and n go to infinity in such a way that m 2 jn 4 --+ p, then A m n --+ 'IT + w [ 1 + pw 4 ] 1 12 • See FEDERER and RADO for accounts of this unimaginably complicated subject. ,
19.9.
The rectangular coordinates are (x, y, z ) == (cos fJ cos 4J, sin fJ cos 4J , sin 4J ), and I D8• • 1 comes out to cos 4J. The sector therefore has area ( 82 8 1 )(sin 4J2 - sin 4J 1 ).
m
such that A C Un Bn , diam Bn < £, and En (diam Bn ) S h m ( A ) + 1, and note that hm +& ( (A ) < £3( h m (A) + 1). (b) If T is a map from R m to R k to which Theorem 19.3 applies, and if fA I Dx i Am ( dx ) is finite and positive, then dim TA == m. If T is not smooth, of course, anything can happen-consider a Peano curve.
19. 11. (a) Choose sets
Bn
19. 12. This definition of dimension is due to Felix Hausdorff: Dimension und
ausseres Mass, Math . Ann . , 79 (1919), 157- 179. He showed that the Cantor set has dimension log 2jlog 3. Eggleston showed that the set of points in the unit interval in whose dyadic expansion 1 appears with asymptotic relative frequency p has dimension -p log p - q log q; see BILLINGSLEY, Sec tion 14. For the connection between Hausdorff and topological dimension, see Withold Hurewicz and Henry Wallman: Dimension Theory (Princeton University Press, Princeton, New Jersey, 1948).
Section 20 20.4.
Suppose U1 , • • • , Uk are independent and uniformly distributed over the unit interval, put V; == 2 n lf; - n , and let l'n be (2 n) k times the distribution of ( V1 , • • • , Vk ). Then l'n is supported by Qn == ( - n , n ] X • · · X ( - n , n ], and if I == (a 1 , b1 ] X · · · X (ak , bk ] c Qn , then l'n (/) == n f_ 1 (b; - a; )· Fur ther, if A C Qn C Qm (n < m), then l'n ( A ) == l'm (A). Define A k (A) ==
lim nl'n ( A
n
Qn)·
n
n 2- nt>
20.8.
preceding (8.16), P; [ T1 == n 1 , • • • , Tk == n k ] == /;� t>jj� By then argument n • · · �� 1c - �c- a>. For the general initial distribution, average over i.
20.9.
Integrate rather than differentiate.
20. 10. Use the w-A theorem. 20.11. (a) See Problem 20.10.
n
(b) Use part (a) and the fact that Y, == r if and only if r,.c > == n. (c) If k s n , then Yk == r if and only if exactly r - 1 among the integers n 1 , . . . ' k 1 precede nk in the permutation r< >. n+ (d) Observe that r< > == (11 , . . . , In ) and Yn + t == r if and only if r< l > == (11 , • •n • , t,_ 1 , n + 1, t,, . . . , tn ), and conclude that a( Y, + 1 ) is independent of a(T< > ) and hence of a ( Y1 , • • • , Y, )-see Problem 20.7. -
20. 16. If X and Y are independent, then P ( I ( X + Y) - (x + y) l < £] � P( I X x l < ! £] P[ I Y - Y l < ! £] and P[ X + Y == x + y] � P[ X == x ] P [ Y == y].
594
NOTES ON THE PROBLEMS
20.20. The partial-fraction expansion gives
2y( y - x ) B == 2 ' u + ( y - x )2
After the fact this can of course be checked mechanically. Integrate over [ - t, t ] and let t --+ oo : f '_, D dx == 0, f '_ , B dx --+ 0, and /� 90 ( A + C) dx
== ( y 2 + v2 - u 2 ) u - 1 w + (y 2 - v2 + u 2 ) v - 1 w == u - 1v - 1w 2 Rcu + t,(y). There is a very simple proof by characteristic functions; see Problem 26.9. 20.22. See Example 20.1 for the case n == 1, prove by inductive convolution and a n < 2 change of variable that the density must have the form Kn x 1 > - • e - x l2 , and then from the fact that the density must integrate to 1 deduce the form of Kn .
20.13. Show by (20.38) and a change of variable that the left side of (20.48) is some constant times the right side; then show that the constant must be 1.
20.27. (a) Given £ choose M so that P[ 1 X I > M] < £ and P[ 1 Yl > M] < £ , and then choose B so that lx l , IY I < M, l x - x' I < B, and IY - y' I < B imply that 1/( x', y') - /( x, y) l < £ . Note that P( lf( Xn , Yn) £ ] < 2 £ + P[ I Xn - X I > B ) + P[ l Yn - Yl > B).
- f( X, Y) l >
20.30. Take, for example, independent Xn assuming the values 0 and n with
probabilities 1 - n - 1 and n - 1 • Estimate the probability that xk = k for some k in the range n/2 < k < n . 20.31. (b) For each m split A into 2m sets A m k of probability P( A )/2 m . Arrange all the A m k in one infinite sequence and let Xn be the indicator of the n th set in it. Section 21
21.6.
Consider E IA . A random variable is finite with probability 1 if (but not only if) it is integrable.
J:x dF(x) = J:Jt dy dF(x) = 1:1� dF( x) dy. 21. 10. (a) Write E[ Y - X] == fx < Y fx < 1 � y dt dP - fY < x fY < t � x dt dP. 21.12. Show that 1 - F( t) < - log F( t) < (1 - F( t))/F(a) for t > a and apply 21.8.
Calculate
(21.10).
595
NOTES ON THE PROBLEMS
21.13. Calculate
21.14. Use (21 .12). 21.15. (a) The most important dependent uncorrelated random variables are the trigonometric functions-the random variables sin 2 , n w and cos 2 , n w on the unit interval with Lebesgue measure.
21. 19. Use Fubini' s theorem; see (20.29) and (20.30).
21.20. Even if X == - Y is not integrable, X + Y == 0 is. Since 1 Yl < l x l + l x +
Yl , E[ I Yl ] ==
oo implies that
E[ I x + Yl ] ==
21 .19. See also the lemma in Section 28.
oo for each
x; use Problem
21.21. Use Theorem 20.5. 21.25. (b) Suppose { Xn } is d -fundamenta1. By Markov' s inequality, the se
quence is fundamental in Pprobability and hence converges in probability to some X. By Fatou's lemma it follows from xn E LP that X E L P . Another application of Fatou's lemma gives dp ( X, Xn ) < lim infm dp ( Xm , Xn), and so dp ( X, Xn ) --+ 0. 21.28. If x < ess sup i XI , then E1 1P [ I X I P ] > xP1 1P [ I X I > x] --+ x as p --+ oo .
Section 22 22.3. 22.9.
E [l: I X! c > I ] == EE( I X!c> I ]. (a) Put U == E k Il k s T J x: and V == Ek Ilk s T] X"A , so that ST == U - V. Since [ T > k ] == g - [ T < k - 1] lies in a( x1 ' . . . ' xk - 1 ), E[ /[ T � k ] xk+ ] == E( �T � k 1 ) E[ X: ) == P(T > k ] E( Xi ). Hence E(U) == Ek 1 E( Xi ] P( T > k ) == E( X( ] E[ T ]. Treat V the same way. n (b) To prove E[T] < oo , show that P[T > ( a + b) n ] < (1 - p a + h ) . By (7.7), ST is b with probability (1 - p0 )/(1 - pa + h ) and - a with the opposite probability. Since E[ X1 ] == p - q. For sufficiency, use
E[ T J -
a
_
q
_
_
p
a + b q
_
p
1 - pa 1 pa + b ' _
p
q
== p
::1=
1.
8 n has the same probabilistic behavior as the original series because the xn + n 8 reduced modulo 2w are independent
22. 12. For each 8, Ene ;x,. ( e; z)
and uniformly distributed. Therefore, the rotation idea in the proof of Theorem 22.9 carries over. See KAHANE for further results.
22. 16. (b) Let A == r 1 B and suppose p is a period of f. Let n
= [1/p ]. By periodicity,
P( A n
m
= [ x/p ] and
[y, y + p ]) is the same for all
y;
there-
NOTES ON THE PROBLEMS
596
fore, I P( A n [0, x]) - mP( A n [0, p]) I S p, I P( A ) p, and I P( A n [O, x]) - P( A)xl � 2 p + l x - m/n l taken arbitrarily small, P( A n [0, x]) P( A ) x.
- nP( A n [0, p ]) 1 < S 3p. Since p can be
=
Section l3 13.3. 13.4.
Note that A, cannot exceed
v] =
P[ N, + v - N, _ u
== 0] =
t. If 0 s u < t and v > 0, then P[A, > u, B, > e- a ue- a v .
(a) Use (20.37) and the distributions of A, and B, .
(b) A long interarrival interval has a better chance of covering t than a short one does.
== j is
13.6.
The probability that N5, . k - N5,
13.7.
Since N( A ) is E/A ( Sn ), it is a random variable. Restrict attention to subsets of a fixed interval [0, a]. Show that the sets with the desired property include the finite disjoint unions of intervals and form a monotone class. If A and B are disjoint, then N(A ) and N( B) are certainly independent if A and B are finite disjoint unions of intervals; now consider in sequence the cases where A and B are (i) open (countable disjoint unions of intervals), (ii) closed (hence compact-use disjoint, open A n and Bn satisfying An � A and Bn � B), (iii) countable unions of closed sets, and (iv) general (use Problem 12.7).
13.9.
Let M, be the given process and put q> ( t) == E [ M, ). Since there are no fixed discontinuities, q> ( t) is continuous. Let 1/1 ( u) = inf[ t: u s q> ( t)] and show that Nu == M_,< u > is an ordinary Poisson process and M, = N• < t > .
13.10. Let t -+ oo in
SN, + l N, + 1 SN, t < - s -- N, N, N, + 1 N,
__,;,_ _
13.12. Restrict t in Problem 23.11 to integers. The waiting times are the Zn of
Problem 20.8, and account must be taken of the fact that the distribution of zl may differ from that of the other zn .
Section lS 25.1.
(e) Let G be an open set that contains the rationals and satisfies A ( G) < t . For k == 0, 1, . . . , n - 1 , construct a triangle whose base contains k/n and is contained in G: make these bases so narrow that they do not overlap, and adjust the heights of the triangles so that each has area 1/n . For the n th
597
NOTES ON THE PROBLEMS
density, piece together these triangular functions, and for the limit density, use the function identically 1 over the unit interval.
25.2.
By Problem 14.12 it suffices to prove that F, ( · , w ) = F with probability 1 , and for this it is enough that F,(x, w ) --+ F(x) with probability 1 for each rational x.
25.3.
(b) It can be shown, for example, that (25.19) holds for xn = n ! See Persi Diaconis : The distribution of leading digits and uniform distribution mod 1, Ann. Prob., S (1 977), 72-81. (c) The first significant digits of numbers drawn at random from empirical compilations such as almanacs and engineering handbooks seem approxi mately to follow the limiting distribution in (25.20) rather than the uniform distribution over 1, 2, . . . , 9. This is sometimes called Benford ' s law. One explanation is that the distribution of the observation X and hence of log10 X will be spread over a large interval; if log10 X has a reasonably smooth density, it then seems plausible that {log10 X } should be approxi mately uniformly distributed. See FELLER, Volume 2, p. 62, and R. Raimi: The first digit problem, Amer. Math . Monthly, 83 (1976), 521 -538.
25.9.
Use Scheffe's theorem.
25.10. Put /n ( X )
==
==
Yn + kBn JBn- 1 for Yn + kBn < X < Yn + ( k + 1 ) Bn . Construct random variables Y, with densities fn , and first prove Yn = X. Show that Zn == Yn + [( Y, - Yn )/Bn 1 Bn has the distribution of Xn and that Y, - zn = 0.
P[ Xn
25.11. For a proof of (25.21)
see
FELLER, Volume 1 , Chapter 7.
25.13. (b) Follow the proof of Theorem 25.8, but approximate I< x. y J instead of
/( - oo . x ] •
25.22. Let Xn assume the values n and 0 with probabilities Pn 1 - Pn ·
=
1 /(n log n ) and
Section 26
26.1.
26.3.
(b) Let p. be the distribution00 of X. If lcp(t) l = 1 and t00::1= 0, then cp( t) = e i ta for some a, and 0 == / 00 (1 - ei t( x -a > )p. ( dx ) == / 00 (1 - cos t( x a)) l' ( dx). Since the integral vanishes, I' must confine its mass to the points where the nonnegative integrand vanishes, namely to the points x for which t( x - a ) == 2 wn for some integer n. (c) The mass of p. concentrates at points of the form a + 2wnjt and also at points of the form a' + 2 wnjt'. If p. is positive at two distinct points, it follows that tjt' is rational. 1 (a) Let f0 (x) == w - x - 2 (1 - cos x) be the density corresponding to cp0 (t). If Pk == (sk - sk + 1 )tk , then Ek- 1 Pk == 1 ; since Ek- 1 Pk CJ>o ( tjtk ) = cp(t ) (check the points t == t1 ), cp(t) is the characteristic function of the continuous density Er_ 1 pk tk /0 ( tk x) . 1 (b) For example, take dk == 1, and sk == k - 1 ( k + 1) - • (c) If lim, � 00cp(t) == 0, approximate cp by functions of the kind in part
598
NOTES ON THE PROBLEMS
(a), pass to the limit, and use the first corollary to the continuity theorem. If fP does not vanish at infinity, mix in a unit mass at 0.
26.12. On the right in (26.30) replace q>(t) by the integral defining it and apply Fubini's theorem; the integral average comes to
Now use the bounded convergence theorem.
26.15. (a) Use (26.40) to prove that I'Pn ( t + h) - fl'n ( t) l < 2p.n ( - a, a) ( + a l h l . (b) Use part (a).
26.17. (a) Use the second corollary to the continuity theorem. 26.19. For the Weierstrass approximation theorem, see RUDIN, Theorem 7.32. 26.13. (a) I f an goes to 0 along a subsequence, then 11f(t) 1
=
1 ; use part (c) of
Problem 26.1. (c) Suppose two subsequences of { an } converge to a 0k and to 0 < a 0 < a; put 8 == a0ja and show that l q>( t) l == lq>( S t) l . (d) Observe that
a , where
Section 27 27.8.
By the same reasoning as in Example
27.3, ( Rn - log n)/ {log n
(Sn - n 2/4)/ Vn3 /36 27.16. Write J:e - " 2 1 2 du == x - 1e - x 2 12 - J: u - 2e - u 2 /2 du. 27.9.
The Lindeberg theorem applies:
27. 17. For another approach to large deviation theory,
=
=
N.
N.
see
Mark Pinsky: An elementary derivation of Khintchine's estimate for large deviations, Proc. Amer. Math . Soc., 22 (1969), 288-290.
27.19. (a) Everything comes from (4.7). If A a( Ik + n , Ik
+
n 1, +
I P( A
(l
• • •
) , then
=
[(11 ,
• • .
, lk ) e H) and
Be
B ) - P( A ) P( B ) I
where the sum extends over the k-tuples (i 1 , , i k ) of nonnegative integers in H. The summand vanishes if u + i u < k + for nu s k; the remaining terms add to at most 2E�_ 1 P[ Iu � k + u] S 4 2 . (b) To show that a 2 == 6 (see (27.26)), show that 11 has mean 1 and • • •
n-
n
/
599
NOTES ON THE PROBLEMS
variance
2 and that if i < if i >
n, n.
Section 28 28.2.
Pass to a subsequence along which P.n ( R1 ) --+ oo , choose £n so that it decreases to 0 and ( n #Ln ( R1 ) --+ 00 ' and choose Xn so that it increases to 00 and P.n ( - xn , xn ) > ! P.n ( R1 ); consider the I that satisfies l( ± xn ) = £n for all n and is defined by linear interpolation in between these points.
28.4.
(a) If all functions (28.12) are characteristic functions, they are all cer
(b)
tainly infinitely divisible. Since (28.12) is continuous at 0, it need only be exhibited as a limit of characteristic functions. If #Ln has density /[ - n. n J (1 + x 2 ) with respect to ., , then exp
[ iyt + it1
oo
X
- 1 + x2 oo
1 dx P.n ( ) +
oo
- oo
.x 1 1 e'' itx 2 - ) P.n ( dx ) ( X
]
is a characteristic function and converges to (28.12). It can also be shown that every infinitely divisible distribution (no moments required) has char acteristic function of the form (28.12); see GNEDENKO & KOLMOGOROV, p. 76. (b) Use (see Problem 18.23) - I l l = , - lloo oo (cos IX - 1)x - 2 dx .
28. 14. If XI , x2 , . . . are independent and have distribution function F, then
( X1 +
· · ·
+ Xn )/,f; also has distribution function F. Apply the central
limit theorem.
28. 15. The characteristic function of zn is c
1 n 1 k / t i e exp - E - ) n lc (l k l /n ) l + a (
1 00
--+ exp c
= exp
- oo
eit x - 1 dx l + a x l l
[ - 1 t 1 a100 1 l-x l 1 + a dx ] c
- oo
COS
X
. •
Section 29 29. 1.
(a) If I is lower semicontinuous, [x : l( x) > t] is open. If I is positive,
which is no restriction, then I Idp. = 1:p. [I > t] dt < t] dt < lim infn 1:p. n [ I > t] dt = lim infn I Idp. n . (b) If G is open, then IG is lower semicontinuous.
1: lim infn p. n [I >
29. 13. Let l: be the covariance matrix. Let M be an orthogonal matrix such that the entries of M� M' are 0 except for the first r diagonal entries, which are
1. If Y = MX, then Y has covariance matrix M� M', and so Y = and have the stan ( Y1 , Y,. , 0, . . . , 0), where Y1 , , Y,. are independent dard normal distribution. But 1 X1 2 = E�- 1 ¥;2 • .
•
•
,
.
•
.
600
NOTES ON THE PROBLEMS
the centered normal distribution Xn has asymptotically 2 with covariances aii . Put x == ( p\1 , , p}/2 ) and show that l:x' = 0, so that 0 is an eigenvalue of l: . Show that l: y ' == y ' if y is perpendicular to x, so that l: has 1 as an eigenvalue of multiplicity k 1. Use Problem 29.13 together with Theorem 29.2 (h(x) = l x l 2 ) . (Xn , , Xn,) has the same distri (a) Note that n-1E 7.. 1Y,� =1 1 and that 1 1 bution as ( Y,1 , , Y, , )/( n- E7_ 1 Y,� ) 12 .
29.14. By Theorem 29.5,
• • •
-
29.15.
. • •
• • •
29.16. Express the probability as a ratio of quantities (19.21), change variables and X; == Y;fn , show that the new integrands converge to exp( -
� E�_ 1 y/),
use the dominated convergence theorem.
Section 30 30. 1.
30.4.
se increas.. s; 1 Ln( £) == Ek f1x,."' :?; ( x;k dP. Choo 1 nu Ln(u- ) u-3 n > nf and put �n == u- for nu < n < < Mn . Put Y, k - Xnk l[lx I < M 1 Show Mn --+ nu+ t · Ln(Mn) 2 Ek E[ Y, k] --+ Ek E[ Y,k ] --+ 1, and apply to Ek Y, k tli� central limit Ek P[ Xnk =F Y, k] --+ 0. Suppose that the moment generating function Mn of #Ln converges to the moment generating function M of 1-' in some interval about s. Let �'n have density esx/Mn(s) with respect to 1-'n' and let " have density esx/M(s) with respect to p.. Then the moment generating function of �'n converges to that Show that of00 " in some interval about 0, and hence �'n = 00 / 00 /(x)p.n(dx) --+ / 00/(x)p.(dx) if f is continuous and has bounded Rescale so that == 1, and put ing so that < for Then 0 and that 0 and theorem under (30.5). Show that
•
"·
support; see Problem 25.13(b).
30.5.
(a) By Holder's inequality
IE�_ 1 t1x1 1' < k' - 1E�-1 1t1x1 1',
E 8 'f 1E1 t1x1 1'1-'(dx)jr! has positive radius of convergence. Now , , x J.R J-t1 ti i I' ( dx) ... L tl' · · · tk' a ( '1 , . . . , rk ) , ,
)
(
and so
r.
where the summation extends over k-tuples that add to Project p. to the line by the mapping apply Theorem 30.1, and use the fact that p. is determined by its values on half-spaces.
E1 t1x1,
30.6.
Use the Cramer-Wold idea.
30.8.
Suppose that k ==
2 in (30.30). Then
M [ ( cos At X ) 'l ( cos A 2 X ) '2 ] i11 2 x + e - i11 2 x ) 'z] a,,x x e e ei1'•1 ( [ ) ( + ... M ' •
2
2
By (26.33) and the independence of
A1
and
A 2 , the last mean here is 1 if
601
NOTES ON THE PROBLEMS
2 j1 - r1 == 2j2 - r2 == 0 and is 0 otherwise. A similar calculation for k == 1 gives (30.28), and a similar calculation for general k gives (30.30). The actual form of the distribution in (30.29) is unimportant. For (30.31) use the multidimensional method of moments (Problem 30.6) and the mapping theorem. For (30.32) use the central limit theorem; by (30.28), X1 . has mean 0 and variance ! . 30. 10. If n1 12 < m s n and the inequality in (30.33) holds, then log log n1 12 < log log n £ (log log n ) 1 12 , which implies log 1� n < £ - 2 log 2 2. For large n the probability in (30.33) is thus at most 1/ vn . -
Section 31 31. 1.
Consider the argument in Example 31.1. Suppose that F has a nonzero derivative at x and let In be the set of numbers whose base-r expansions agree in the first n places with that of x. The analogue of (31.16) is P( X e In + t 1/P( X e In ] --+ r- 1 , and the ratio here is one of p0 , . . . , p,. If t P; =F ,- for some i, use the second Borel-Cantelli lemma to show that the ratio is P; infinitely often except on a set of Lebesgue measure 0. (This last part of the argument is unnecessary if r == 2.) The argument in Example 31.3 needs no essential change. The analogue of (3 1.11) is
F( x ) == Po + · · · + P; - t + P;F( rx - i ) ,
. -r < x < i +r 1 ' I
--
O < i < r - 1. (b)
== I8- 1 Ho and /2 == F; (/1 /2 ) - 1{1 } == H0 is not a Lebesgue set. 31.9. Suppose that A is bounded, define p. by p.( B) == A ( B n A ), and let F be the corresponding distribution function. It suffices to show that F'( x) == 1 for x in A , apart from a set of Lebesgue measure 0. Let 4 be the set of x in A for which F'( x) < 1 - £ . From Theorem 31 .4(i) deduce that A ( 4 ) = I'( C: ) s (1 - £ ) A ( 4 ) and hence A ( 4 ) = 0. Thus F'( x) > 1 - £ almost everywhere on A . Obviously, F'( x) < 1. 31.11. Let A be the set of x in the unit interval for which F'( x) = 0, take a = 0, and define A n as in the first part of the proof of Theorem 31.4. Choose n so that A ( A n ) > 1 - £ . Split {1, 2, . . . , n } into the set M of k for which (( k - 1)/n , k/n ] meets A n and the opposite set N. Prove successively that E k e M [ F( k/n) - F(( k - 1)/n)] < £ , Ek e N [ F( k/n ) - F(( k - 1)/n )] > 1 - £ , Ek e M1/n � A ( A n ) > 1 - £ , EZ 1 1/( k/n) - /(( k - 1)/n ) l > 2 -
31.3.
Take /1
....
2£.
31. 15. n :'- t < ! + !e 2 it/311 ). 31.18. For x fixed, let un of successive dyadic rationals of order n and vn be the pair
n
( vn - un =
2 - ) for which
un
< x < vn . Show that
602
NOTES ON THE PROBLEMS
where a k is the left-hand derivative. Since a ;; ( the difference ratio cannot have a finite limit.
31.22.
A
x) == + 1
for all
and k ,
x
(31 .35)
A
be the x-set where fails if I is replaced by lrp ; then has Let Lebesgue measure Let G be the union of all open sets of p.-measure 0; represent G as a countable disjoint union of open intervals and let B be G together with any endpoints of zero p.-measure of these intervals. Let be the set of discontinuity points of If ll. ll. B , and ll. then and < h) <
0.
F( x -
F( x)
F. F( x) A , x F( x + h ), F( x + h ) -+ F( x),
1 F( X + h ) - F( X - h )
x
D,
D
h ( ) x+ F JF( x- h /( f>( t ) ) dt --+ /( cp ( F( x ) )) .
x £ < rp(F( x)) < x follows from F( x £ ) < F( x), and hence rp( F( x)) == x. If A is Lebesgue measure restricted to (0, 1), then p. == Arp - 1 , and (31 .36) follows by change of variable. But (36.36) is easy if x e D, and 1 hence it holds outside B u ( De n F- A ) . But p. ( B) == 0 by construction 1 and p. ( DC () r A ) == 0 by Problem 14.5. Now
-
-
Section 32 32.7.
#Ln
32. 7)
l' n ·
32.9.
-n ( n ) + Jls( n ) vs< >
( n) •
. . Defi ne and �'n as m ( , Where �'ac IS , and wn te �'n - �'ac absolutely continuous with respect to and is singular with respect to and == Take �'ac == Suppose that �'ac ( E) for all E in �' ( E == ���c ( E) and satisfies p. ( S ) == Choose an S in !F that supports �'s and Then �'ac ( E) == �'ac ( E n s c) == �'ac ( E n s c ) �'s ( E n s c ) == ��� c: ( E n s c.' ) + c c n s == ���c( E n s == ���c ( E). A similar argument shows that �' ( E
#Ln n "s En v; > . + s )
En v�:>
) v;( E == v;( E). Define I and "s
� 0.
+ v;( E) v; +
)
)
s
as in (32. 8) and let 1 ° and be the corresponding function and measure for !F0 : p. E) == �' ( E ) for E e and there is an !F 0 -set so such that Jls0( 0 - S == and p. ( S == 0. If E E then f£1 ° p. == p. == p. == J1 ° ( E S == ( E S � == It is instructive to consider the extreme case == in which is absolutely continuous with respect to p.0 (provided p. ( Q ) > vanishes. and hence
fE - sol dp. {0, n }, 0)
d
IE -sol0 d fEidp. . v0 �'so
"so v( fE/0 d + s0 0) 0 IE - sol0 d 0
-
0) 0)
.1F"0
v
-
�0, 0) �o
Section 33 33.2.
(a) (b)
To prove independence, check the covariance. Now use Example 33.7. Use the fact that and 8 are independent (Example As the single event = [ 8 == u [8 = == = == has probability the conditional probabilities have no meaning, and strictly speaking there is nothing to resolve. But whether it is natural to ° regard the degrees of freedom as one or as two depends on whether the line through the origin is regarded as an element of the decomposition of lines or whether it is regarded as the union of two the plane into elements of the decomposition of the plane into rays from the origin. Borel's paradox can be explained the same way: The equator is an element of the decomposition of the sphere into lines of constant latitude ;
R
(c) 5 wI4]
0,
[X
Y]
[X
-
Y
0]
20.2). w/4]
45
45°
603
NOTES ON THE PROBLEMS
the Greenwich meridian is an element of the decomposition of the sphere into great circles with common poles. The decomposition matters, which is to say the a- �eld matters.
33.3.
(a) If the guard says, "1 is to be executed," then the conditional probabil ity that 3 is also to be executed is 1 /(1 + p ). The " paradox" comes from assuming that p must be 1, in which case the conditional probability is
indeed ! . If p == ! , then the guard gives no information. (b) Here "one" and "other" are undefined, and the problem ignores the possibility that you have been introduced to a girl. Let the sample space be
bbo a4 ' bb�
1 -a
4 '
/l ' bgo 4
y gbo 4 ,
bgy 1 -4 /l '
gby
1 -
ggo 4B ' y
4 '
ggy
1 -
B
4 .
For example, bgo is the event (probability /l/4) that the older child is a boy, the younger is a girl, and the child you have been introduced to is the older; and ggy is the event (probability (1 - B)/4) that both children are girls and the one you have been introduced to is the younger. Note that the four sex distributions do have probability 1 . If the child you have been introduced to is a boy, then the conditional probability that the other child is also a boy is p == 1 /( 2 + fl - y). If {J == 1 and y == 0 (the parents present a son if they have one), then p == ! . If fl == y (the parents are indifferent), then p == ! . Any p between ! and 1 is possible. This problem shows again that one must keep in mind the entire experiment the sub-a-field t§ represents, not just one of the possible outcomes of the experiment.
33.6.
There is no problem, unless the notation gives rise to the illusion that p ( A l x) is P(A n [ X == x])jP[ X == x].
33.18. If N is a standard normal variable, then 1
rn
( ) + y Pv X
rn
...
1 e l2 w
_
x 2 /2
I
( y + ) /E [ ( y + N ) . rn rn X
I
Section 34
]
34.3.
If ( X, Y) takes the values (0, 0) , (1 , - 1), and (1 , 1) with probability t each, then X and Y are dependent but E[ Yll X] == E[ Y] == 0. If ( X, Y) takes the values ( - 1 , 1), (0, - 2) , and (1, 1 ) with probability t each, then E[ X] == E[ Y] == E[ XY] == 0 and so E[ XY] == E[ X] E[ Y], but E[ Y ll X] == Y ::1= 0 == E[ Y]. Of course, this is another example of dependent but uncorrelated random variables.
34.4.
First show that f fdP0 == /8 /dPjP( B) and that P0-measure 1 . Let G be the general set in �-
P[ B II �] > 0
on a set of
604
NOTES ON THE PROBLEMS
(a) Since •
/GP0 ( A II � ) P ( BII � ) dP ... {P0 [ A II � ) I8 dP ... fslG P0 [ A II � ) dP
it follows that
P0 [ A II � ] P ( B II � ] == P [ A
n
Bll �]
holds on a set of P-measure 1. (b) If P;(A) == P(A I B;), then
==
1GnB P [ A II � 1
v
.}�") dP.
Therefore, fcl8 P; [A II �] dP == fcl8.P[A II � v Jt'] dP if C == G n B; , and of course this holds for C == G n ' � if j =F i. Since C 's of this form constitute a w-syst�m generating � V Jt', 18 P;[A II �] == 18 P[A II � V Jt') 1 1 on a set of P-measure 1. Now use the result in part (a).
34.9.
All such results can be proved by imitating the proofs for the unconditional
case
or else by using Theorem
34.1).
34.5 (for part (c), as generalized in Problem
34.10. (a) If Y == X· - E[ XII �1 ], then X - E[ XII �2 ] == Y - E[ YII �2 ], and
E[( Y - E[ YII �2 ]) 2 11 �2 ] == E[ Y2 11 �2 ] - E 2 [ YII �2 ] s E[ Y2 11 �2 ). Take ex
pected values.
34.11. First prove that
From this and (i) deduce
(ii). From
(ii), and the preceding equation deduce fA n A P[A 3 11 �2 ] dP == P[A 3 11 �1 2 ] dP. The sets A 1 n A 2 form a w-system generating �1 2 •
/A 1 n A2
60S
NOTES ON THE PROBLEMS
34.16. (a) Obviously (34.1 5) implies (34.14). If (34.14) holds, then clearly (34.15) holds for lim " xk ==
X simple. For the general X, choose simple X" such that X and I xk I < I X I . Note that
fA,. XdP - a. j XdP
let n � oo and then let k � oo. (b) If n e fJI, then the class of E satisfying (34.14) is a A-system, and so by the w-A theorem and part (a), (34.15) holds if X is measurable a(BI). Since An e a(BI), it follows that
j XdP = j E( XII a ( !JI ) ) dP -+ a. j E( XII a ( !JI ) ) dP A ,.
A ,.
j
- a. XdP. (c) Replace X by X dP0jdP in (34.15). 34.17. (a) The Lindeberg-Uvy theorem. (b)
(c)
(d)
(e) (f)
(g) (h)
Chebyshev's inequality. Theorem 25.4. Independence of the Xn . Problem 34.16(b). Problem 34.16(c). Part (b) here and the £-8 definition of absolute continuity. Theorem 25.4 again.
34. 19. See S. M. Samuels: The Radon-Nikodym theorem as a theorem in probabil ity,
Amer. Math . Monthly, 85 (1978), 155-165, and SAK.S, p. 34.
Section 35 35.4.
(b) Let Snn be the number of k such that 1 < k < n and Yk == f . Then Xn == 3s,. j2 . Take logarithms and use the strong law of large numbers.
35.9.
Let K bound l XI I and the I Xn - xn - t l · Bound I XT I by KT. Write /T s k XT dP == E�- 1 /T - ; X; dP == Ef_ 1 ( /T :?. ; X; dP - /T :?. i + I X; dP). Transform the last integral by the martingale property and reduce the expression to E( X1 ] - fT > k xk + t dP. Now
1
T> k
xk + 1 dP
.s K( k + I) P[ T > k] .s K( k + I) k - 1
1
TdP -+ O.
T> k
606
NOTES ON THE PROBLEMS
35. 13. (a) By the result in Problem 32. 9, X1 , X2 , • • • is a supermartingale. Since
E[ I Xn 1 1 == E( Xn 1 < P ( n ) , Theorem 35.4 applies. (b) If A e � ' then fA (Y, + Zn ) dP + a�( A ) == fA X00 dP + G00 ( A ) == P ( A ) == fA Xn dP + an (A). Since the Lebesgue decomposition is unique (Problem 32.1), Y, + Zn == Xn with probability 1. Since Xn and Y, converge, so does zn . If A E "' and n > k, then fA Zn dP < aoo ( A ), and by Fatou' s lemma, the limit Z satisfies fA Z dP < a00 (A). This holds for A in U 1c "' and hence (monotone class theorem) for A in �- Choose A so that P( A ) = 1 and G00 ( A ) == 0: E[ Z] == fA ZdP < a00 ( A ) == 0. It can happen that an ( n ) == 0 and a00 ( n ) == ., ( n ) > 0, in which case an does not converge to aoo and the xn cannot be integrated to the limit. 35. 19. For a very general result, see J. L. Doob: Application of the theory of martingales, Le Ca/cu/ des Probabilites et ses Applications (Colloques Inter nationaux du Centre de Ia Recherche Scientifique, Paris, 1949). 35.22. (a) To prove that two sequences must have the same limit, interlace them. To prove the last assertion choose n and K so that I E[ XT ) I < K for T e T, , and note that T v n e T, and n
I E [ XT ] I < E E [ I xk I ] + I E [ XT n ] I · v
k-1
35.13. (a) Use Problem 13.17.
(b) If A m q is the set where I xn - Yl < ( for some n in the range m < n < q , then P(U A m q ) == 1 for each £ and m because Y is a cluster , ) f 1 as q f oo . variable, and hence P\A mq (c) By parts (a) and (b) there is probability exceeding 1 - 2jk 2 that I Xn - Yk I < 2/k for some n in the range nk < n < qk , in which case I XT" - Y�c l < 2/k. Thus P( I XT" - Yl > 3/k] < 3jk 2 •
35.25. (a) Of course, { x: } is L1 -bounded. To show that it is an amart, it suffices (Problem 35.22(b)) to show that E[ XTII ] converges if Tn e T, and Tn > Tn _ 1 •
Suppose that
n � q
and put
if XT < 0, if XT > 0. II
II
Check that �'n , q e T, and E [ X:II ] ==
fXT
"
XT. dP + E ( XTII ) - E ( X., ] �0 q " ·9
Now deduce (Problem 35.22(a)) that lim supn E[ X: J < lim infq E[ X: J. Thus limn E[ x:II ] exists, although it might be infinite. ,.Choose integers Pn such
607
NOTES ON THE PROBLEMS
that Tn
s
Pn and put if XT � 0, if XT < 0. II II
Check that �'n e T and use the estimate E( x: ] < E [ X., ] + E( l X I ] along with Problem 35.22(a) and the assumption that { xn } is Ll-bounded. (b) Obviously, sums and differences of L1 -bounded amarts are again L1 -bounded amarts; note that Xn v a == Xn + (a - Xn ) + and Xn A a == Xn - ( Xn - a) + . 35.16. If �'a� == p " inf[n: I xn I > a], then �'a . p E T and aP[max n � p I xn I > a) < a P [ I X., I > a] < E [ I X., I ) . 35.27. (a) Use Problem 35.25(b) and the fact that X! a > == - a v ( Xn A a ). (b) Use Problem 35.26. (c) Combine parts (a) and (b) with Problem 35.24. 35.28. (a) Since E[ xn ] increases to some limit, it suffices to show that n s T < p implies that E[ Xn 1 < E [ XT ] < E [ Xp ]. To prove the left-hand inequality, note that II
a.p
a.p
E[ X,. )
=
1
T�n
X,. dP
...
1
T- n
XT dP +
1
T>n
X,. dP
(b)
For example, let X be a bounded, positive random variable and put n Xn == ( - 1) Xjn.
Section 36
36.6.
The analogue in sequence space of a cylinder in R r is [ w : ( a ,1 ( w ) , . . . , a ," ( w )) e H), where H is a set of k-long sequences of O's and 1 ' s. The special cylinders considered in this problem, where H consists of a single sequence, might then be called " thin" cylinders. Assign thin cylinders probabilities P [ w : a; ( w ) == u; , i < n] == Pn ( u 1 , , un ) · Show that if r < s, then •
•
•
where the sum extends over sequences ( vr 1 , vs ) of O's and 1 's. Deduce from this that if r < s and a thin cylinder of rank r is the disjoint union of thin cylinders of rank s, then the P-values add. Following the arguments in the notes on Problem 2.16, extend P to the field �0 of finite disjoint unions of thin cylinders, and prove consistency and finite additivity. Since P is automatically countably additive on �0 , it can be extended to � == o ( �0 ). This proof parallels that of Kolmogorov' s theorem but is simpler, since regularity of measures is irrelevant and finite additivity on �0 implies countable additivity. +
• • •
,
608
36.9.
NOTES ON THE PROBLEMS
(b)
Show by part (a) and Problem 34.18 that In is the conditional expected value of I with respect to the a-field 1 generated by the coordinates By Theorem 35.7 (see Problem 35.1 7), (36.27) will follow if xn 1 , xn 2 , each set in nns,; has 'IT-measure either 0 or 1, and here the zero-one law applies. (c) Show that gn is the conditional expected value of I with respect to the a-field generated by the coordinates x1 , , xn , and apply Theorem 35.5.
9,;+
+
+
• • •
•
• • •
36. 11. Let !R be the countable set of simple functions E; a; IA . for a; rational and
{ A; } a finite decomposition of the unit interval in to subintervals with rational endpoints. Suppose that the X, exist and choose (Theorem 17.1) Y, in !l' so that E [ I X, - Y, I ] < -1 . From E [ I Xs - X, 1 ] == t , conclude that E( l � - Y, l ] > 0 for s ::1= t . But there are only countably many of the Y,. It
does no good to replace Lebesgue measure by some other measure on the unit interval.
36. 12. That over D the path function for w is integer-valued and nondecreasing
with probability 1 is easy to prove. Suppose that w has this property and that the path restricted to D has somewhere a jump exceeding 1.n Then for some integer and for every sufficiently large n there is a k � 2 such that N, k /2.. ( w) - N, ( k - l )/2.. (w) � 2. To prove that this has probability 0, use th� fact that - N, > 2] s Kh 1, where K is independent of and h.
t0 P(N,+h
t
X H) ==1 fM "n ( H; x1 ,
36. 13. Put l' l == " and inductively define l'n ( M n I · dl'n (x , , xn_ ) for -
, xn_ 1 ) and H e 9P ; see Problem 18.25. • • •
M e Yi 1 1 1 Prove consistency and let { XI ' x2 ' . . . } have finite-dimensional distributions Let P�( H, w) == "n ( H; X1 (w), . . . , Xn_ 1 ( w )) and prove by a �£1 , �£ 2 , change of variable that fG "�( H, w)P(dw) == P ( G n [ Xn e H)) for G == [( X1 , , Xn_ 1 ) E M). In the sense of Theorem 33.3, "� ( H, w) is thus a conditional distribution for xn given a [ XI ' . . . ' xn - 1 ]. For the part concern ing integrals use Theorem 34.5. If A supports " and supports l'n ( · ; x1 , , Xn_ 1 ) for (x 1 , , xn _ 1 ) e A X · · · X A, then P( X1 E A, . . . , Xn E A ) == l'n(A X · · · X A) == 1 for all n . In such a case it makes no difference how "n ( H; x1 , , xn _ 1 ) is defined • • •
•
•
•
•
•
•
•
•
outside
AX
· · ·
•
•
• • •
• • •
XA.
Section 37 37.2.
(a) Use Problem 36.10(b). (b) Let [ W,: > 0] be a Brownian motion on (0, , ,
P0), where W( w) e C for every w. Define �: Q --+ -Rr by Z, (�(w)) == W, (w). Show that �1 is measurable 'I!JIT and P == P0 � 1 If C c A e !JIT, then P(A) == P0(�- A) == P0(Q ) == 1. t
·,
•
37.3.
Consider W(1) ence. Since n
== E;: _ 1 ( W( k/n ) - W((k - 1)/n )) for notational conveni
jI W(l/n )l �' W2 ( !n ) dP = 1IW( l)I � 'Ji" W2 ( 1 ) dP --+ 0,
the Lindeberg theorem applies.
609
NOTES ON THE PROBLEMS
37.12. By symmetry,
[
]
s sust
p ( s, t ) == 2P � > O , inf ( J.Yu - � ) < - � ; � and the infimum here are independent because of the Markov property,
and
so
by (20.30) (and syJDIPetry again) p
( S , I)
=
2
1 00 o
p [ TX
2 ts 1001 == o
o
< I - S] X
J2w
1
,J2ws
u312
Reverse the integral, use f0xe - x 2 r/2 dx p
'IT o
==
]
2 2 X e f s dx
2 2u e- x /
s 1t1 . ( s , t ) == -
1
==
1
J2ws
2 e - x 2 / s du dx .
1/r, and put
1
u+s
s112 du 2 1 u1
2 1 dv 2 v Vl {i7i ;
v
== (s/(s +
u)) 1 12 :
Bibliography
HALMOS and SAK.S have been the strongest measure-theoretic and DOOB and FELLER the strongest probabilistic influences on this book, and the spirit of KAC's small volume has been very important. AUBREY: BAHADUR:
Brief Lives, John Aubrey, ed., 0. L. Dick. Seker and Warburg, London. 1949.
Some Limit Theorems in Statistics, R. R. Bahadur. SIAM, Philadelphia, 1971 .
BILLINGSLEY : 1965.
Ergodic Theory and Information , Patrick Billingsley. Wiley, New York,
BILLINGSLEY': 1 968.
Convergence of Probability Measures, Patrick Billingsley. Wiley, New York,
BIRKHOFF: Lattice Theory, rev. ed., Garrett Birkhoff. American Mathematical Society, Providence, Rhode Island, 1961. BIRKHOFF & MAC LANE: A Survey of Modern A lgebra, 4th ed., Garrett Birkhoff and Saunders Mac Lane. Macmillan, New York, 1977. BREIMAN :
Probability, Leo Breiman. Addison-Wesley, Reading, Massachusetts, 1968. A Course in Probability Theory, 2nd ed., Kai Lai Chung. Academic, New York,
CHUNG: 1 974.
CHUNG' : Markov Chains with Stationary Transition Probabilities, 2nd ed., Kai Lai Chung. Springer-Verlag, New York, 1967. �INLAR: Introduction to Stochastic Processes, Erhan �inlar. Prentice-Hall, Englewood Cliffs, New Jersey, 1 975. CRAMER: Mathematical Methods of Statistics, Harald Cramer. Princeton University Press, Princeton, New Jersey, 1946. DOOB :
Stochastic Processes, J. L. Doob. Wiley, New York, 1953.
DUBINS & SAVAGE: How to Gamble If You Must, Lester E. Dubins and Leonard J. Savage. McGraw-Hill, New York, 1965. DYNKIN & YUSHKEVICH: Markov Processes, English ed Aleksandr A. Yushkevich. Plenum Press, New York, 1 969. FEDERER:
610
.•
Evgenii B. Dynkin and
Geor11etric Measure Theory, Herbert Federer. Springer-Verlag, New York, 1 969.
BIBLIOGRAPHY
61 1
FELLER: A n Introduction to Probabili(V Theory and Its Applications, Vol. I, 3rd ed., Vol. II. 2nd ed., Willi am Feller. Wiley, New York, 1 968, 1971 . GALAMBOS: The Asymp totic Theory of Extreme Order Statistics, Janos Galambos. Wiley, New York, 1 978. GNEDENKO & KOLMOGOROV : Limit Distributions for Sums of Independent Random Variables, English ed., B. V. Gnedenko and A. N. Kolmogorov. Addison-Wesley, Reading, Massachusetts, 1 954. HALMOS:
Measure Theory, Paul R. Halmos. Van Nostrand, New York, 1 950.
HALMOS' :
Naive Set Theory, Paul R. Halmos. Van Nostrand, Princeton, 1 960.
HARDY: 1 946.
A Course of Pure Mathematics, 9th ed., G. H. Hardy. Macmillan, New York,
HARDY & WRIGHT: A n Introduction to the Theory of Numbers, 4th ed ., G. H. Hardy and E. M. Wright. Clarendon, Oxford, 1 959. HAUSDORFF: JEC H :
Set Theory, 2nd English ed., Felix Hausdorff. Chelsea, New York, 1 962.
Set Theory, Thomas Jech. Academic Press, New York, 1 978.
KAC : Statistical Independence in Probability, A nalysis and Number Theory, Carns Math. Monogr. 1 2, Marc Kac. Wiley, New York, 1 959. KAHANE: Some Random Series of Functions, Jean-Pierre Kahane. Heath, Lexington, Massachusetts, 1968. KARLIN & TAYLOR: A First Course in Stochastic Processes, 2nd ed., A Second Course in Stochastic Processes, Samuel Karlin and Howard M. Taylor. Academic Press, New York, 1 975 and 1 98 1 . KINGMAN & TAYLOR: Introduction to Measure and Probability, J. F. C . Kingman and S. J. Taylor. Cambridge University Press, Cambridge, 1 966. KOLMOGOROV : Grundbegriffe der Wahrscheinlichkeitsrechnung, Erg. Math., Vol. 2, No. 3, A. N. Kolmogorov. Springer-Verlag, Berlin, 1 933. LEVY : Theorie de / 'Addition des Variables A leatoires, Paul Levy. Gauthier-Villars, Paris, 1 937. LOEVE : LUKACS:
Probability Theory /, 4th ed., M. Loeve. Springer-Verlag, New York, 1 977. Characteristic Functions, Eugene Lukacs. Griffin, London, 1 960.
NEVEU : Mathematical Foundations of the Calculus of Probabili(v, English ed., J. Neveu. Holden-Day, San Francisco, 1 965. RADO: Length and A rea, Tibor Rad6. American Mathematical Society, Providence, Rhode Island, 1 948. RENYI : RENYI ' :
Probability Theory, A. Renyi. North-Holland, Amsterdam, 1 970. Foundations of Probability, Alfred Renyi. Holden-Day, San Francisco, 1 970.
RIESZ & SZ.-NAGY : Functional A nalysis, English ed., Frigyes Riesz and Bela Sz.-Nagy. Unger, New York, 1 955. ROYDEN :
Real A nalysis, 2nd ed., H. I. Royden. Macmillan, New York, 1 968.
RUDIN : Principles of Mathematical A na(vsis, 3rd ed., Walter Rudin. McGraw-Hill, New York, 1 976. RUDIN ' : 1974.
Real and Complex Analysis, 2nd ed., Walter Rudin. McGraw-Hill, New York,
612
BIBLIOGRAPHY
SAKS:
Theory of the Integral, 2nd
rev.
ed., Stanislaw
Saks. Hafner,
New York, 1937.
SKO ROKHO D : Studies in the Theory of Random Processes, English ed., A. V. Skorokhod. Addison-Wesley, Reading, Massachusetts, 1965.
SPITZER:
1964. SPIVAK:
Principles of Random Walk , Frank Spitzer. Van Nostrand, Princeton, New
Calculus
on
Manifolds,
Michael Spivak. W. A. Benj amin, New York, 1 965.
Jersey,
List 0,
2, 16 ( a, b ], 56 5 fA , 564 P, 2, 20 dn ( W ), 3 rn ( w ) 5 sn ( w ), 6 N, 8, 366 A - B, 564 A c, 564 A A B, 564 A c B, 564 jli:set, 18 fjl0, 18 o ( Jat'), 19 J, 20 fjl , 20 ( D, !F, P), 21 A n t A , 564 A n � A , 564 A, 24, 40, 166 A, 59 V, 59 /( � ), 29 Pn ( A ), 30 D ( A ), 30 � ' 30 P *, 33, 42 ,
of Symbols
P. , 33
�'
42 !t', 36
A* ' 41
P( B IA ) , 46
lim supnAn, 46 lim infnAn, 46 lim n A n, 47 i.o., 47
!T,
57, 295
R 1 , 565 R k , 567
[ X = x ], 63 a ( X), 64, 260 p, , 68, 157, 261 E[ X], 70, 280 Var[ X ] , 73, 282 En [ / ], 78 s c ( a ) , 89 Nn , 91 'T, 96, 130, 486
P ;1 ,
107 S, 107
108 M(t), 142, 285, 293 C (tk ), 144 'IT; ,
!Ji , 155 x k t x,
565, 157 613
614
x
LIST OF SYMBOLS
= Ekxk , 1 57
p. *, 162 Jt( p. * ),
A 1 , 1 66
163
dist, 168 A k , 1 72 F, 175 , 1 77 177 flAF, T - 1A' , 565 :Fj:F', 182 Fn :::o F, 192, 335, 390 dF( x ), 230 X X Y, 234 !'I X d.!/, 234 p. X v, 236 diam, 247 h m , 248 e m , 249 I D I , 252 . , 272 Xn -+p X, 274, 340
L P ( D, :F, P ), 289
N, , 308 p. n :::0 p. ,
336, 3 90 Xn :::o X, 338, 390 xn :::0 a , 341 a A , 566 q> ( t ), 351 N, 8, 366 p. n --+ p,, 382 Fs , 434 Fa t'' 434 Jl « p, , 442 d vjdp, , 443 Jls , 445 v
�'a c' 445 P [ AII�], 451 P[ A ll X,, t e T], 454 E[ XII �], 466
RT, 508 !JtT, 509 w,, 522
Index
564-574); u . v refers to Problem v in Section pp . 575-609); the other references are to pages .
Here An refers to paragraph n of the Appendix (pp. u.
or else to a note on it (Notes on the Problems ,
Greek letters are alphabetized by their Roman equivalents bibliography are not indexed separately . Abel ,
244
(m
for IJ., and so on) . Names in the
Atom, 20. 1 6
Absolute continuity,
434, 436, 3 1 .7, 442 , 443 ,
18
Axiom of choice , AS ,
32. 3 Absolutely continuous part ,
446
28 1 Absorbing state , I 09 Adapted a-fields , 480
Bahadur,
Absolute moment ,
Additive set function , 440,
9 .4
Baire category, A 15,
1.13
13. I0 59 , 33.9, 33 . 1 7, 496
Baire function ,
32. 1 1
Bayes ,
countable ,
Benford ' s law, 25 . 3 Beppo Levi ' s theorem ,
finite ,
Bernoulli-Laplace model of diffusion,
Additivity :
1 58 21 , 23 , 1 58
Bernoulli trials,
Algebra , 1 7 Almost always,
48
Almost everywhere , Almost surely ,
20.24 Betting system , 95 , 485 Billingsley , 19 . 1 2, 392 Binary digit, 3 Binary expansion, 3 Binomial distribution, 262 Binomial series , 1 7 . I I Birkhoff, 42, 1 73 Blackwell , 476
54
Beta function ,
504 a-mixing , 375, 29 . 1 7 Aperiodic , 1 22 Amart ,
Arc, 19.3
108 80, 143 , 148, 9.3, 368
Bernstein polynomial , 82 Bernstein ' s theorem , 82
54, 420
Approximation of measure ,
16.9
166
19.3, 3 1 . 1 1 Area over the curve, 74 Area under the curve , 206 Arrival times , 309 Arc length ,
Asymptotic equipartition property ,
6. 1 4, 8 . 30
Bold play, 98 Boote ' s inequality , 23
615
616
INDEX
Borel , A l 3 , 9
Closure , A I I
Borel-Cantelli lemmas , 5 3 , 55 , 58 , 83 , 2 1 . 6 ,
Cluster variable , 35 . 23
22 .4
Co-countable set , 1 8
Borel function , 1 84. 1 3 . I 0, 3 1 . 3
Cofinite set , 1 8
Borel set , 20 , 3 . 1 4 , 1 55 , 433 Borel ' s nonnal number theorem , 9 , 6. 9 Borel ' s paradox , 3 3 . I
Collective , 7 . 3 Complement, A I
Boundary , A I I , 265
Completely nonnal number , 6 . 1 3
Bounded convergence theorem , 2 1 4
Complete space or measure , 39 , 1 0 . 6 , 533
Bounded gambling policy , 97
Completion, 3 . 5 , 1 0. 6
Bounded variation, 435
Complex functions, integration of, 22 1
Branching process , 483
Compound Poisson distribution , 28 . 3
Britannica , I . 5
Compound Poisson process , 23 . 8
Brown , 3 1 4
Concentrated , 1 58
Brownian motion , 522, 529 , 553 , 559 , 560 Burstin ' s theorem , 22 . 1 6
Conditional distribution, 460 , 47 1 , 36 . 1 3
Compact , A 1 3
Conditional expected value , 1 3 1 , 466 , 467 , 34 . 1 9
Canonical measure , 384
Conditional probability , 46 , 448 , 45 1 , 454 , 33.5, 33. 1 3
Canonical representation, 384 Cantelli ' s inequality , 5 . 5
Consistency conditions for finite-dimensional
Cantelli ' s theorem , 6 . 6
Content , 3 . 1 5
Cantor function , 3 1 . 2 , 3 1 . 1 5
Continuity from above , 23 , I 4\9 , 1 77 , 265
Cantor set , 1 . 6, 1 . 9 , 3 . 1 6 , 4 . 23 , 1 2 . 8
Continuity from below , 23 , 1 59 , 265
Caratheodory' s condition , 1 68
Continuity of paths , 524
Cardinality of a-fields , 2 . 1 1
Continuity set, 344, 390
Cartesian product , 234
Continuity theorem for characteristic functions ,
Category , A 1 5
distributions , 507
359 , 396
Cauchy distribution, 20. 20 , 2 1 . 5 , 22 .6, 358 ,
Conventions involving
26 . 9 ' 28 . 5 ' 28 . 1 3 Cauchy ' s equation , A20, 1 4 . 1 1
Convergence in distribution, 338 , 390
Cavalieri 's principle , 1 8 . 8
Convergence in measure , 274
Centrctl limit theorem , 300 , 366 , 27 . 1 4 , 27 . 1 5 ,
Convergence in probability , 274 , 299 , 340
387 ' 398 , 408 , 34. 1 7 ' 497 Central symmetrization, 25 1
Convergence with probability I , 274 , 299
Cesaro averages , A30 , 25 . 1 6
Convergence of types , 1 95
Change of variable , 2 1 8 , 22 8 , 229 . 252 . 264.
Convex functions , A32 , 75
280
oo ,
Convergence in mean , 2 1 . 22
Convergence of random series , 298
Convolution, 272
Characteristic function, A5 , 35 1 , 395
Coordinate function , 509
Chebyshev 's inequality , 5 , 75 , 5 . 5 , 8 1 , 283 ,
Coordinate variable , 509
34. 9
1 57 , 202
Coordinate-variable process , 5 1 0
Chemofr s theorem , 1 47
Countable , 8
Chi-squared distribution , 20 . 22 , 2 1 . 35 , 29 . 1 4 ,
Countable additivity , 20 , 1 58
462
Countable subadditivity , 23 , 1 59 , 1 62 , 1 64
Chi-squared statistic , 29 . 1 4
Countably additive , 20
Circular Lebesgue measure, 1 3 . 1 9
Countably generated a-field , 2 . 1 0 , 20 . 1
Class of sets , 1 6
Countably infinite , 8
Closed set , A I I
Counting measure , 1 58
Closed set of states , 8 . 23
Coupled chain , 1 23 , 8 . 1 6
Closed support , 1 2 . 9
Coupon problem , 372
617
INDEX Covariance , 284
Double-or-nothing , 95 . 7 . 6
Cover , A3
Double series , A27
Cramer-Wold theorem , 397 . 30 . 6
Doubly stochastic matrix , 8 . 22
Cumulant , 1 44
Dubins-Savage theorem , 99
Cumulant generating function , 1 44
Dyadic expansion, A3 1 , 3
Cylinder, 2 . 1 6 , 509 , 36 .6
Dyadic interval , 4 Dyadic rational , 4
Danieii-Stone theorem , 1 1 . 1 2 , 1 6 . 25
Dynkin , 1 1 0 Dynkin ' s 11'-A theorem , 37
Darboux-Young definition, 208 a-distribution, 1 94 Decomposition, A3 Definite integral , 202 , 203
E-8 definition of absolute continuity , 443 , 32 . 3 Egorov ' s theorem , 1 3 . 1 3
Degenerate distribution function, 1 95
Empirical distribution function , 275
Delta method , 368 , 27 . 1 0 , 27 . 1 1 , 402 , 29 . 8
Empty set , A I
DeMoivre-Laplace theorem , 25 . 1 1 , 368 DeMorgan ' s law, A6
Entropy , 52, 6. 1 4 , 8 . 30, 3 1 . 1 7 Equicontinuous , 26 . 1 5
Dense , A l 5
Equivalence class , 52
Density of measure , 2 1 6 , 440
Equivalent measures , 442 , 472
Density point, 3 1 . 9
Erdos-Kac central limit theorem , 4 1 3
Density of random variable or distribution,
Erlang density, 23 . 2
262 , 264 , 266
Essential supremum , 5 . I I
Density of set of integers , 2 . 1 5
Estimation, 475
Denumerable probability, 45 , 552
Etemadi , 290, 297
Dependent random variables , 375
Euclidean distance , A 1 6
Derivates , 423
Euclidean space , A I , A 1 6
Derivatives of integrals , 42 1 Diagonal method , A 1 4
Euler, 1 8 . 1 7 , 1 8 . 22 Euler ' s function, 2 . 1 5
Difference , A I
Event , 1 6
Difference equation, A 1 9
Excessive function , 1 3 1 , 8 . 33
Differential equations and the Poisson process ,
Existence of independent sequences , 68 , 27 1
317
Existence of Markov chains , 1 1 2
Dirichlet ' s theorem , A26 , 20
Expected value , 70 , 280
Discontinuity of the first kind , 56 1
Exponential convergence , 1 28
Discrete measure , 2 1 , 1 58
Exponential distribution, 1 9 1 , 263 , 20 . 25 , 286 ,
Discrete random variable , 26 1 Discrete space , I . I , 2 1
307 ' 327 ' 33 1 ' 358 , 28 . 5 Extension of measure , 3 2 , 1 64 , 1 1 . 1 , 1 1 . 5
Disjoint, A3
Extremal distribution , 1 97
Disjoint supports , 430, 442 Distribution: of random variable , 68 , 1 89 , 26 1
Factorization and sufficiency , 472
of random vector, 265 Distribution function , 1 76, 1 89 , 26 1 . 265 , 429
Fair game , 8 8 , 480 , 485 Fatou ' s lemma, 2 1 2 , 1 6.4, 34 . 8
Dividing cake , 2 . 1 8
Federer, 1 9 . 8
Dominated convergence theorem , 72 . 2 1 3 .
Feller, I , 8 . 1 9 , I 08 , 25 . 3 , 25 . I I , 6 1 0 Feller ' s theorem , 374
2 1 4, 1 6 .6, 2 1 . 2 1 , 348 Dominated measure , 442 , 472 , 494
Field , 1 7 , 1 9 , 2 . 5
Doob , 1 8 . 1 1 ' 490, 35 . 1 9 Double exponential distribution, 358
Finite additivity , 2 1 , 23 , 1 58
Double integral , 237
Finite-dimensional distributions , 3 1 9 , 506
Finite or countable , 8
618
INDEX
Finite-dimensional sets , 509
Independent classes . 50
Finitely additive field , 1 8
Independent events . 48 . 453
Finite measure , 1 57
Independent increments , 309 , 522 , 37 . 3
Finite subadditivity , 22 , 1 59 First Borei-Cantel li lemma , 53
Independent random variables , 66 , 73 , 1 92 , 267 , 284 , 287 , 290 , 355 , 454
First category , A 1 5 , 1 . 1 3
Independent random vectors , 268 . 269 . 454
First passage , 1 1 5
Indicator, AS
Fisher, 1 99
Infinite convolutions , 26 . 22
Fixed discontinuity , 3 1 4
Infinitely divisible distributions , 382, 385
Fourier series , 36 1 , 26. 26 , 27 . 2 1
Infinitely often, 47
Fourier transform , 352
Infinite series, A25
Frechet , 1 99
Infinitesimal array, 373
Frequency , 5 Fubini ' s theorem , 238 , 36. 9
Information, 52
Functional central limit theorem , 548
Initial digit problem , 25 . 3
Fundamental in probability , 20 . 28
Initial probabilities , I 08
Fundamental theorem of calculus , 227 , 4 1 9
Inner boundary , 58 , 85
Information source , 6. 1 4 , 8 . 30
Inner measure , 3 3 , 3 . 2 Integrable, 203 Galambos, 1 99
Integral , 202
Gambling policy , 96
Integral with respect to Lebesgue measure , 224
Gamma distribution , 20 . 23 , 26 . 8 , 28 . 5
Integrals of derivatives , 433
Gamma function, 1 8 . 22
Integration by parts , 239
Generated a-field , 1 9 , 64, 1 3 . 5 , 260
Integration over sets , 2 1 5
Glivenko-Cantelli theorem, 275 , 25 . 2
Integration with respect to a density , 2 1 7
Gnedenko , 1 99 , 28.4 Goncharov ' s theorem , 37 1
lnterarrival time , 322 Interior, A I I Interval , A9
Invariance principle , 547 , 37 . 1 4 Hahn decomposition, 44 1 , 32 . 2 , 34 . 1 9
Inventory , 3 23
Halmos , AS, 39 , 1 1 . 6 , 6 1 0
Inverse image , A7 , 1 82
Hamel basis , 1 4 . 1 1
Inversion formula, 355
Hardy , 1 . 1 0, 4 1 3 , 564
Irreducible chain, 1 1 7
Hardy-Ramanujan theorem , 6 . 1 7 , 30 .9, 30. 1 2
Irregular paths , 530
Hausdonf, 1 74 , 564
Iterated integral , 237
Hausdorff dimension, 1 9 . 1 1 Hausdorff measure , 247 Heine-Bore( theorem , A l 3 , A l 7
Jacobian , 229 , 25 3 , 266
Helly selection theorem , 345 , 392
Jensen·s inequality , 75 , 5 . 7 . 5 . 8 , 283 , 2 1 . 7 ,
Hewitt-Savage zero-one law , 304
470
Hilbert space , 2 1 . 27 , 34. 1 3 , 34 . 1 4 , 35 . 1 8
Jordan , 3 . 1 5
Hitting time , 1 34 , 54 1 , 37 .9 Holder ' s inequality , 75 , 5 . 9 , 5 . 1 1 , 283 , 34 . 9
Jordan decomposition, 442
Hypothesis testing , 1 48 Kac , 4 1 3 , 6 1 0 Inadequacy of �R1• 5 1 7 . 36. I 0
Inclusion-exclusion formula. 22 . 1 59 Indefinite integral . 4 1 9 Independent array , 5 1
Kahane , 22 . 1 2 k-dimensional Borel set , 1 55 k-dimensional Lebesgue measure , 1 7 1 , 1 78 . 20 .4 Khinchine , I . I 0, 27 . 1 7
619
INDEX Khinchine-Pollaczek fonnula, 332, 333
Lower semicontinuous. 29 . I
Kindler , I I . 1 2
Lower variation. 442
Kolmogorov , 28 .4
LP-space . 2 1 . 25
Kolmogorov·s existence theorem , 272, 5 1 0 ,
A-system , 36 , 3 . 8 , 3 . 1 0
523 , 37 . 1
Lusin' s theorem , 1 7 . 8
Kolmogorov ' s ineq uality , 296 , 487
Lyapounov ' s condition , 37 1 . 27 . 7
Kolmogorov ' s zero-one law , 57 , 294 , 22 . 1 3
Lyapounov ·s inequality , 76 , 283
Kronecker' s theorem , 338
Lyapounov 's theorem . 37 1
Kurzweil integral , 1 7 . 1 3 Mac Lane , 1 73 Ladder height, 325
Mapping theorem , 343 , 39 1
Ladder index , 325
Marginal distribution , 266
Landau notation, A 1 8
Markov chain, 1 07 , 20. 8 , 375 , 379 , 29 . 1 8 ,
Laplace distribution, 358
450
Laplace transfonn, 285 , 293
Markov process , I 07 , 456 , 537
Large deviations , 1 45 , 1 49 , 27 . 1 7
Markov 's inequality , 74 , 283 , 34 . 9
Lattice distribution, 26. t
Law of the iterated logarithm , 1 5 1 Law of large numbers:
Markov time , 1 30 Martingale, 98 , 7 . 8 , 445 , 480, 487 , 540, 37 . 1 7
strong , 9 , I I , 58, 80 , 6 . 8 , 1 27 , 290 , 27 . 20
Martingale central limit theorem , 497
weak , 5 , I I , 58 , 8 1 , 292
Martingale convergence theorem , 490
Law of random variable , 1 89 , 26 1
Maximal ineq uality , 296 , 525
L1 -bounded , 504
Maximal solution, 1 1 9
Lebesgue , 1 5 . 5 , 2 1 3
JL-continuity set, 344, 390
Lebesgue decomposition, 435 , 3 1 . 20 , 446 ,
m-dependent , 6. 1 1 , 376
32.7
Mean val ue , 70 , 26 . 1 7
Lebesgue function, 3 1 . 3
Measurable function, 1 83
Lebesgue integrable , 224
Measurable mapping , 1 82
Lebesgue integral , 224 , 225 , 229
Measurable process , 529 , 37 . 6 , 37 . 7
Lebesgue measure , 24. 3 3 , 40, 1 66 , 1 7 1 , 1 78 .
1 7 . 1 7 ' 24 1 ' 20 .4 Lebesgue 's density theorem , 3 1 . 9 , 35 . 1 5
Lebesgue set, 40 , 3 . 1 4, 433
Measurable random variable . 64 Measurable rectangle . 234 Measurable with respect to a a-field , 64, 260 Measurable set, 1 8 , 34 , 1 63
Leibniz ' s fonnula, 1 7 . I 0
Measurable space , 1 58
Levy distance , 1 4 . 9 , 25 .4
Measure , 20 , 1 57
Likelihood ratio, 483 , 495
Measure space , 2 1 , 1 58
Limit inferior , 46 , 4 . 2
Meets , A3
Limit of sets , 47
Method of moments, 405 , 30 . 5 , 30 .6
Limit superior , 46, 4 . 2
Minimal solution , 8 . 1 3
Lindeberg condition, 369 , 27 . 7
Minimum-variance estimation , 475
Lindeberg-Uvy theorem , 366
Minkowski 's inequality , 5 . I 0 , 5 . I I , 2 1 . 24
Lindeberg theorem , 368 , 30 . I
Mixing , 375 , 29 . 1 7 , 34 . 1 6
Linear Borel set, 1 55
1J. *-measurable , 1 63
Linearity of expected value , 7 1
Moment, 7 1 , 28 1
Linearity of the integral , 209
Moment generating function , I . 7 , 1 42 , 285 ,
Linearly independent reals, 1 4. 1 1 , 30. 8
292 , 408
Lipschitz condition. 3 1 . 1 7
Monotone , 22 , 1 59 , 1 62 , 209
log-nonnal distribution, 407
Monotone class , 39 , 3 . 1 0
Lower integral . 207
Monotone class theorem , 39
620
INDEX
Monotone convergence theorem , 2 1 1 , 2 1 4, 34 . 8
Pa�cal distribution, 1 99 Path function, 320 , 5 1 8 , 525
Monotonicity of the integral , 209
Payoff function, 1 3 1
M-test , A28 , 2 1 3
Perfect set , A 1 5
Multidimensional central limit theorem , 398 ,
Period , 1 22 , 8 . 27
29 . 1 8
Permutation , 67 , 6 . 3 , 37 1
Multidimensional characteristic function, 395
Persistent, 1 1 4 , 1 1 7, 1 1 8
Multidimensional distribution, 265
Phase space , I 08
Multidimensional normal distribution, 397 ,
Pinsky , 27 . 1 7
29 . 5 , 29 . 1 0 , 33 . 8 Multinomial sampling , 29 . 1 4
11'-A theorem , 36 , 3 . 9 P *-measurable , 34 Poincare , 29 . 1 5 Point, 1 6
Negative part , 203 , 259
Point of increase , 1 2. 9 , 20 . 1 6
Negligible set, 8 , 1 .4, 1 . 1 2, 4 1
Poisson approximation, 3 1 2 , 336
Newton , 1 7 . I I
Poisson distribution , 262. 20 . 1 8 , 286 . 3 1 0 ,
Nonatomic , 2 . 1 7 Nondegenerate distribution function, 1 95 Nondenumerable probability , 552
23 . 1 3 , 27 . I , 27 . 3 . 387, 28 . 1 2 Poisson process . 307 , 449 , 457 , 33 . 7 , 34 . 1 2 . 36 . 1 2 ' 36 . 1 4 . 560
Nonmeasurable set, 4 1 , 4 . 1 3 , 1 2.4
Poisson ' s theorem . 6 . 5
Nonnegative series , A25
Policy , gambling , 96
Normal distribution, 263 , 264 , 273 , 20. 1 9 ,
Polya 's criterion . 26 . 3
20 . 22, 28 1 ' 286 , 354 , 358 , 366 , 28 . 1 3 , 28 . 1 4, 397
Polya· s theorem , 1 1 5 , 1 1 8 Positive part , 203 , 259
Normal number , 8 , I . 9 , I . I I , 3 . 1 5 , 8 1 , 6. 1 3
Positive persistent , 1 27
Normal number theorem, 9 . 6 . 9
Power series, A29
Nowhere dense , A 1 5
Prekopa ' s theorem , 3 1 4
Nowhere differentiable , 3 1 . 1 8 , 532
Prime number theorem , 5 . 1 7
Null persistent, 1 27
Primitive, 227 , 4 1 9
Number theory , 4 1 2
Probability measure , 20 Probability measure space , 2 1 Probability transformation , 1 4 . 4
Open set, A l l
Product measure , 1 2. 1 3 , 234 , 236
Optimal policy , 99
Product space , 234 , 508 , 3 7. 2, 562
Optimal stopping , 1 30
Progress and pinch , 7 . 7
Optimal strategy , 1 34
Proper difference , A I
Optional stopping , 495 , 35 . 20
Proper subset , A I
Order of dyadic interval , 4
11'-system , 36, 3 . 9
Order statistic, 1 4 . 7 Orthogonal transformation , 1 73 , 398 Outer boundary , 5 8 , 85
Queueing model, 1 2 1 , 1 24 , 322
Outer measure , 3 3 , 3 . 2 , 1 62
Queue size , 333
Pairwise additive , 1 1 . 6
Rademacher functions, 5 , 1 . 8 , 63 , 298 , 30 1
Pairwise disjoint, A3
Rado , 1 9 . 8
Partial-fraction expansion, 20 .20
Radon-Nykodym derivative , 443 , 445 , 32.6,
Partial information, 52
32. 1 3 , 482 , 494 , 35 . 1 3
Partial summation , 1 8 . 1 9
Radon-Nykodym theorem , 443 , 32.4
Partition , A3 , 422 , 450
Random Taylor series, 30 1
621
INDEX Random variable , 63 , 1 83 , 259
Sequence space , 2 . 1 6 , 4. 1 8 , 8 . 4 , 36 .6
Random vector, 1 84, 260
Service time , 322
Random walk, 1 09 , 325
Set , A I
Rank of dyadic interval , 4
Set function , 20 , 440 , 3 2 . 1 1
Ranks and records , 20 . I I
a-field , 1 7 , 1 9
Rate of Poisson process , 3 1 0
a-field generated by a class of sets , 1 9
Rational rectangle , 1 55
a-field generated by a random variable , 64 ,
Realization of process , 5 1 8
260
Record values , 20 . 1 2 , 2 1 .4, 27 . 8
a-field generated by a transformation, 1 3 . 5
Rectangle , A l 6, 1 55
a-finite on a class , 1 58
Rectifiable curve , 1 9 . 3 , 3 1 . 1 1
a-finite measure , 1 57 , I 0. 7
Recurrent event, 8 . 1 9
Shannon' s theorem , 6. 1 4 , 8 . 30
Red-and-black, 88
Signed measure , 32 . 1 2
Regularity , 4 1 , 1 74
Simple function, 1 85
Regular partition, 1 78
Simple random variable , 63 , 1 85 , 260
Relative measure , 25 . 1 7
Singleton , A I
Renewal theory, 8 . 1 9 , 32 1
Singular function, 427 , 43 1 , 3 1 . 1
Renyi-Lamperti lemma , 6 . 1 5
Singularity , 442 , 494
Reversed martingale , 492
Singular part, 446
Riemann integral , 2 , 25 , 202 , 224 , 1 7 . 6 , 25 . 1 4
Skorohod embedding , 339 , 545
Riemann-Lebesgue lemma, 354
Skorohod 's theorem , 342 , 392 , 399
Right continuity , 1 75 , 26 1
Source , 6. 1 4 , 8 . 30
Rigid transformation , 1 73
Southwest , 1 76
Ring , 1 1 . 5
Space , A I , 1 6
Royden, I . 1 3
Square-free integers , 4 . 2 1
Rudin, 229 , 26 . 1 9 , 42 1 , 435
a-ring , 1 1 . 5 Stable law , 389 , 28 . 1 5 Standard normal distribution, 263
Saks , 6 1 0
State space , I 08
Saltus , 1 89 , 309
Stationary distribution, 1 2 1
Sample function , 5 1 8
Stationary increments , 523 , 37. 3
Sample path , 320 , 5 1 8
Stationary probabilities, 1 2 1
Sample point, 1 6
Stationary process, 36. 2
Sampling theory , 4 10
Stationary sequence of random variables ,
Samuels , 34 . 1 9
375
Scheffe's theorem , 2 1 8 , 1 6 . 1 5
Stationary transition probabilities , 1 08
Schwarz' s inequality , 75 , 5 . 6, 283
Statistical inference , 9 1 , 1 48
Second Borel-Cantelli lemma , 55 , 4 . 1 4 , 83
Steiner symmetrization, 25 1
Second category , A 1 5
Stieltjes integral , 230
Second-order Markov chain , 8 . 3 1
Stirling 's formula, 5 . 1 7 , 1 8 . 2 1 , 27 . 1 8
Secretary problem , I l l
Stochastic arithmetic, 2. 1 5 , 4 . 2 1 , 5 . 1 6, 5 . 1 7 ,
Section, 235
6. 1 7 , 1 8 . 1 8 , 25 . 1 5 , 4 1 2 , 30. I 0 , 30 . I I ,
Selection problem , 1 1 0 , 1 35
30 . 1 2
Selection system , 9 1 , 93 , 7 . 3
Stochastic matrix , I 09
Semicontinuous , 29 . I
Stochastic process , 309 , 3 1 9 , 506
Semiring, 1 64 Separable function, 552 Separable process, 553 , 558 Separable a-field, 2 . 1 0 Separant, 553
Stopping time , 96, 1 30 , 486, 534 , 536 , 539
Strong law of large numbers , 9, I I , 58, 80 , .
6 . 8 , 1 27 , 290 , 27 . 20 Strong Markov property , 535 , 537, 37. 1 0
Strong semiring , I I . 6
622
INDEX
Subadditivity:
Uni form distribution modulo I , 337, 362
countable . 23 . 1 59
Uni form integrability , 2 1 9 , 2 1 . 2 1 , 347
finite . 22. 1 59 Subfair game . 88 , 99
Uniformly asymptotically negligible array , 374 Uniformly bounded , 72 , 6 . 8 , 97 , 504
Subfield , 52, 260
Uniformly eq uicontinuous, 26 . 1 5
Submartingale . 484 , 35 .6 Subset , A I Sub-a-field, 52 , 260 Subsolution, 8 . 6 Substochastic matrix , 1 1 8
Uniq ueness of extension , 36, 38 , 1 60 Uniqueness theorem for characteristic functions , 355 , 26. 1 9 , 396
Uniqueness theorem for moment generating functions , 1 43 , 294 , 26 . 7 , 408, 30 . 3
Sufficient a-field, 47 1 , 34 . 20
Unitary transformation, 1 73
Sufficient statistic , 472
Unit interval , 45 , 68 , 27 1 , 5 1 5 , 37 . I
Superharmonic function , 1 3 1
Unit mass , 22 , 1 94
Supermartingale , 485
Univalent, 252
Support line , A33
Upcrossing , 488
Support of measure, 2 1 , 1 58, 1 2 . 9, 26 1 , 430 ,
Upper integral , 207
442
Upper semicontinuous , 1 3 . I I , 29 . I
Surface area , 247 , 1 9 . 8 , 1 9 . 1 3
Upper variation, 442
Symmetric difference , A I
Utility function, 7 . I 0
Symmetric random walk , 1 09, 1 1 5 . 1 35 , 35 . 1 0 Symmetric stable law , 28 . 1 5 System , 1 08
Vague convergence , 382 , 28 . 1 , 28 . 2 Value of payoff function , 1 3 1 van der Waerden, 3 1 . 1 8
Tail event, 57 , 295 , 304 Tail a-field , 57 , 295 Taylor series, A29 , 30 1 Thin cy finders , 36 . 6 Three series theorem, 299 Tightness, 346, 392 , 29 . 3
Variance , 73 , 282 , 285 Variation, 435 , 442 Version of conditional expected value , 466 Version of conditional probability , 449 , 45 1 Vieta's formula, 1 . 8 Vitali covering theorem , 250
Timid play , I 05 Tippet , 1 99 Tonelli 's theorem , 238 Total variation , 442 Trace , 1 9 . 3 Traffic intensity , 325 Trajectory of proce�s, 5 1 8 Transformation of measures , 1 86 Transient, 1 1 4 , 1 1 7 Transition probabilities , I 08 , I l l Translation invariance , 1 72 Triangular array , 368 Triangular distribution , 358
Waiting time , 323 Wald' s eq uation, 22 . 9 Wallis's formula, 1 8 . 20 Weak convergence , 1 92 , 335 , 390 Weak law of large numbers , 5, I I , 58, 8 1 , 292 Weierstrass approximation theorem , 82 , 26 . 1 9 Weierstrass M-test , A28 , 2 1 3 Wiener, 523 Wiener process, 522 With probability I , 54 Wright. 1 . 1 0 . 4 1 3
Trifling set, 1 . 4, 1 . 1 2 , 3 . 1 5 Type , 1 95
Yushkevich . I I 0
Unbiased estimate , 475
Zeroes of Brownian motion , 534, 37 . 1 3
Uncorrelated random variables , 2 1 . 1 5
Zero-one law , 57 , 294 . 304 , 22. 1 3 , 37.4
Uniform distribution , 263 , 358, 27 . 2 1 . 29 . 9
Zoms ' lemma. AS
S c an du r c h d i e D eu t s c he F i l i a l e von s t aa t l i c he r B au e r s c ha f t ( KOLX0 3 ' A )