This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
0, am � 0 sufficiently fast that
( 14.6)
m=l The term size has been coined to describe the rate of convergence of the mixing numbers, although different definitions have been used by different authors, and the terminology should be used with caution. One possibility is to say that the sequence is of size -
(Sn),<Jn } is a submartingale if { Sn,<Jn } a submartingale. Proof For the martingale case, 1 , form an Lp-mixingale of size - <po . According to the Minkowski and conditional modulus inequalities, �o. so that (p3 -P4)� - 1, which in (p3 4) � P turn implies (20.81). This completes the proof. Noting that 1 � �0 � 2, the condition in (20. 7 3) may be compared with (20. 5 9). Put <po i and r = 2 and we obtain �0 = �. whereas with <po = 1, we get �o = 2(2r - 1 )/(3 r - 1 ) which does not exceed r in the relevant range, taking values between 1 when r = 1 and � when r = 2. Square-summability of ct lat is sufficient only in the limit as both <po and r oo Thus, this theorem does not contain 20.16. On the other hand, in the cases where { Ct } is uniformly bounded and at = t, we need only �0 > 1, so that any r > 1 and
21 1
Mixing
§'';'+m cr(Yt+m.Yr+m+ b ·· .). Since Y1 is measur able on any cr-field on which each of X1,X1_ 1 , ... ,X1-'t are measurable, §'_:"" � rg;_:"" and §'';'+m c '!f';'+m- 't· Let ar,m = sup1 (§'_:oo,§'7+m) and it follows that ar,m � Um - 't for m � 't. With 't finite, am -1: = O(m-q>) if am = O(m - q>) and the conclusion follows. The same argument follows word for word with '<)> ' replacing 'a ' . • Proof Let §'.:== cr( ... , Y1- J,f1), and
=
1 4 . 2 Mixing Inequalities
Strong and uniform mixing are restrictions on the complete joint distribution of the sequence, and to make practical use of the concepts we must know what they imply about particular measures of dependence. This section establishes a set of fundamental moment inequalities for mixing processes. The main results bound the m-step-ahead predictions, E(Xt+m I '!f .: ""). Mixing implies that, as we try to forecast the future path of a sequence from knowledge of its history to date, looking further and further forward, we will eventually be unable to improve on the predictor based solely on the distribution of the sequence as a whole, E(Xt+m). The r.v. E(Xt+m l rg;_:"") - E(Xt+m) is tending to zero as m increases. We prove convergence of the Lp-norm. 14.2 Theorem (lbragimov 1962) For r � p � 1 and with am defined in (14. 1),
I I E(Xt+m I rg;_:oo) - E(Xr+m) li P � 2(2 1 1P + 1)a�lp - l lr 11 Xr+m ll r·
(14.7)
Proof To simplify notation, substitute X for Xt+m• §' for '!! .:=, :1-e for '!f7+m • and
a
for am . It will be understood that X is an Jf-measurable random variable where §', :1-e � 'lf . The proof is in two stages, first to establish the result for I X I � Mx < oo a.s., and then to extend it to the case where X is £,-bounded for finite r. Define the §'-measurable r.v. 1 , E(XI §') � E(X), 11 = sgn(E(XI §') - E(X)) = (14.8) - 1 , otherwise. Using 10.8 and 10.10,
{
El E(XI §') - E(X) I
=
E[11 (E(XI §') - E(X))]
= E[(E(11Xl §') - 11E(X)] =
Cov(11,X)
=
I Cov(11,X) 1 .
(14.9)
Let Y be any §'-measurable r.v., such as 11 for example. Noting that � sgn(E(YI Jf) - E(Y)) is :1-e-measurable, similar arguments give I Cov(X, Y) I = I E(X(E(YI H) - E(Y))) I
� E( I XI I CEC YI H) - E(Y)) I ) � MxE I E(YI H) - E(Y) I
=
Theory of Stochastic Processes
L. l L.
(14. 10)
� Mx i Cov(I;, Y) I ,
where the first inequality is the modulus inequality. 1; and 11 are simple random variables taking only two distinct values each, so define the sets A + = { 11 = 1 } , A - = {11 = -1 }, B+ = {1; = 1 } , and B - = {1; = - 1 } . Putting (14.9) and (14. 10) together gives
E I E(XI §') - E(X) I � Mx 1 Cov(11 1;) 1 = Mxi E(s11 ) - E(1;)E(11) 1 = Mx l [P(A+ n B+) + P(A - n B- ) - P(A + n B -) - P(A - n B+)] - [P(A +)P(B+) + P(A - )P(B - ) - P(A +)P(B - ) - P(A - )P(B+)] I � 4Mxa. (14. 1 1) ,
I E(X) I � 2Mx, it follows that, for p 2 1, (14. 12) II ECX I §') - E(X) Iip � 2Mx(2a) 1 1P. This completes the first part of the proof. The next step is to let X be L,-bounded. Choose a finite positive Mx, and define X1 = 1 I I X !:<> Mx}X and X2 = X - X1 . By the Minkowski inequality and ( 14 . 1 1 ) Since I E(X I §') - E(X) I � I E(XI §') I
+
,
I! E(XI §') - E(X) II p � II E(Xd §') - E(X, ) II p + II (Xz l §') - E(Xz) ll p � 2Mx(2a) 1 1P + 2 11 Xz ll v•
(14. 1 3)
and the problem is to bound the second right-hand-side member. But
I! Xz llv
�
M}- riP II X II�1P
( 14. 14)
for r 2 p, so we arrive at
I ! E(XI §') - E(X) Ilv � 2Mx(2a) 1 1P + 2M_k- rlp iiX II�1P. Finally, choosing Mx = II X !I ,a- 11' and simplifying yields II ECXI §') - E(X) II p � 2(2 1 1P + 1)allp -l !r iiX I I,,
which is the required result.
( 14. 15)
•
There is an easy corollary bounding the autocovariances of the sequence. >
1 and r 2 pl(p - 1 ), I Cov(Xr.Xt+m) l � 2(2 1 - 1 1P + 1 )a� - 1 /p - l !rii Xr ll p i! Xt+mll r·
14.3 Corollary For p
(14. 16)
Proof
I Cov(Xr.Xt+m) I = I E(XrXt+m) - E(X,)E(Xt+m) I = j E[X1(E(Xt+m l �,) - E(Xt+m))] j � I! Xr llpi i E(Xr+m l ��) - E(Xr+m) ll p!(p 1 ) p /p 1 1', 1 / 1 -1 � 2(2 - + l) I! Xr llp ll Xr+m ll r<X�
(14. 17)
Mixing
213
where the second equality is by 10.8 and 10.10, noting that Xr is :1ft-measurable, the first inequality is the Holder inequality, and the second inequality is by 14.2 . • 14.4 Theorem (Serfling 1968: th. 2.2) For r ;::: p
;:::
1,
(14. 18) II E(Xt+m l :1f�oo) - E(Xr+m) ii p :::; 2<j>�- 1 1'11 Xt+m l l r · where
:::; {� l xd I P(Ad :1f�oo) - P(Ai) l r (� I Xj I I P(Ai I :1f:oo) - P(Ai) 1 11' I P(Ai I :1f�oo) - P(Ai) 1 1/q) r q $ (� I x;l ' I P(A t l P(A;) I r P(A;) I ) (� I P(A, I =
:1' '�) -
:1' '�) -
(14. 19)
The second inequality here is by 9.25. The sets A i partition Q, and P(A i U A i' I :1f �"") = P(A i I :1f :"") + P(A i' I :1f �oo) a.s. and P(A i U A i') = P(A i) + P(A i') for i =I= i'. Letting Aj denote the union of all those A i for which P(A d :1f :"") - P(A i) ;::: 0, and A j the complement of A j on n,
L I P(A il :1f �oo) - P(A) I i
=
I P(Aj l :y;:oo) - P(A 1) I + I P(A f I :1f�oo) - P(A j) 1 . (14.20)
By 13.22, the inequalities
I P(AJ I :1f �oo) - P(A1) I :::; <J>m I P(A l I :1f �oo) - P(A j) I :::; <J>m hold with probability 1 . Substituting into (14. 19) gives I E(Xt+m l :1f�oo) - E(Xt+m) l r ;::: [E( I Xt+m l r l :y;�oo) + EI Xt+m l r] (2<J>m)rlq, a.s. (14.21) Taking expectations and using the law of iterated expectations then gives
EI E(Xt+m l :1f�oo) - E(Xt+m) l r :::; 2EI Xt+m l r(2<J>m) r/q' and, raising both sides to the power 1/r,
(14.22)
Theory of Stochastic Processes
214
(14.23) Inequality (14. 1 8) follows by Liapunov's inequality. The result extends from simple to general r.v.s using the construction of 3.28. For any r. v. Xr+m there exists a monotone sequence of simple r. v .s { X(k)t+m• k E IN } such that I X(k)t+m(ffi) - Xr+m(ro) l -7 0 as k � =, for all ro E .Q. This conver gence transfers a.s. to the sequences { E(�k)t+m I r:f _:oo) - E(X(k)r+m), k E IN } by 10.15. Then, assuming Xr+m is L,-bounded, the inequality in (14.22) holds as k � oo by the dominated convergence theorem applied to each side, with I Xr+m l ' as the dominating function, thanks to 10.13(ii). This completes the proof. • The counterpart of 14.3 is obtained similarly. 14.5 Corollary For r � 1, where, if r
=
I Cov(Xr+m.Xr) l S 2<j>�1' 11 Xr ll r i! Xr+mll rl(r- I ) • 1, replace I! Xr+m ll rl(r- 1 ) by I ! Xr+m ll oo = ess sup Xt+m·
(14.24)
Proof
I Cov(Xr+m .Xr) I S I !Xr ll r li E(Xt+m I r:f _:oo) - E(Xt+m) II rl(r- I )
S 2<j>�1' 11 Xr ll r 11 Xt+m ll rl(r- 1) •
(14.25)
where the first inequality corresponds to the one in (14. 17), and the second one is by 14.4. • These results tell us a good deal about the behaviour of mixing sequences. A fundamental property is mean reversion. The mean deviation sequence { X1 - E(X1) } must change sign frequently when the rate of mixing is high. If the sequence exhibits persistent behaviour with X1 - E(X1) tending to have the same sign for a large number of successive periods, then I E(Xt+m I r:f .:. ) - E(Xr+m) I would likewise tend to be large for large m. If this quantity is small the sign of the mean deviation m periods hence is unpredictable, indicating that it changes frequently. But while mixing implies mean reversion, mean reversion need not imply mixing. Theorems 14.2 and 14.4 isolate the properties of greatest importance, but not the only ones. A sequence having the property that I I Var(Xr m l rg, _:"") - Var(Xr+m) ll p > 0 is called conditionally heteroscedastic. Mixing also requires this sequence of norms to converge as m � =, and similarly for other integrable functions of Xt+m· Comparison of 14.2 and 14.4 also shows that being able to assert uniform mixing can give us considerably greater flexibility in applications with respect to the existence of moments. In (14. 1 8), the rate of convergence of the left-hand side to zero with m does not depend upon p, and in particular, El E(Xr+m I r:f .:oo) - E(Xr+m) I converges whenever IIXr+mll 1+o exists for o > 0, a condition infinitesimally stronger than uniform integrability. In the corresponding inequality for am in 14.2, p < r is required for the restriction to 'bite' . Likewise, 14.5 for the case p = 2 yields ""
+
Mixing
215 (14.26)
but to be useful 14.3 requires that either Xr or Xt+m be �+0-bounded, for 8 > 0. Mere existence of the variances will not suffice. 1 4 . 3 Mixing in Linear Processes
A type of stochastic sequence { Xr } ':"" which arises very frequently in econometric modelling applications has the representation
q (14.27) Xr =L 8jZr-j• 0 ::;; q ::;; oo, j=O where {Zr} �: (called the innovations or shocks) is an independent stochastic sequence, and { ej }J=O is a sequence of fixed coefficients. Assume without loss of generality that the Z1 have zero means and that 80 = 1 . ( 14.27) is called a moving average process of order q (MA(q)). Into this class fall the finite-order auto
regressive and autoregressive-moving average CARMA) processes commonly used to model economic time series. We would clearly like to know when such sequences are mixing, by reference to the properties of the innovations and of the sequence { 8j } . Several authors have investigated this question, including Ibragimov and Linnik (1971 ), Chanda (197 4), Gorodetskii (1977), Withers (1981 a), Pham and Tran ( 1985), and Athreya and Pantula (1986a, 1986b). Mixing is an asymptotic property, and when q < oo the sequence is mixing infinitely fast. This case is called q-dependence. The difficulties arise with the cases with q = oo. Formally, we should think of the MA(oo) as the weak limit of a sequence of MA(q) processes; the characteristic function of Xr has the form
q (14.28) <J>qrO�) = TI
sary that the sequence be square-summable. Note that the solutions of finite order ARMA processes are characterized by the approach of l 8j I to 0 at an exp onential rate, beyond a finite point in the sequence. If { Zr} is i.i.d. with mean 0 and variance d-, Xr is stationary and has spectral density function fCA.) =
oo
� j "'j:.. ejiN j 2. j=O
( 14.29)
The theorem of lbragimov and Linnik cited in § 14. 1 yields the condition I}=0 I ej I < as sufficient for strong-mixing in the Gaussian case. However, another standard result (see Doob 1 953: ch. X.8, or lbragimov and Linnik 1 97 1 : ch. 1 6.7)
Theory of Stochastic Processes
216
states that every wide-sense stationary sequence admitting a spectral density has a (doubly-infinite) moving average representation with orthogonal increments and square summable coefficients. But allowing more general distributions for the innovations yields surprising results. Contrary to what might be supposed, having the ei tend to zero even at an exponential rate is not sufficient by itself for strong mixing. Here is a simple illustration. Recall that the first-order autoregressive process X1 = pX1_ 1 + Z1, I p i < 1, has the MA(oo) form with ei = p i, j 0, 1 ,2, .. =
.
14.6 Example Let { Z1} 0 be an independent sequence of Bernoulli r. v .s, with P(Z1 = 1) = P(Z1 = 0) = !· Let Xo = Zo and
Xt It
=
�t- 1 + Z1
=
t L 2 -izt-J• t = 1 ,2,3, ...
(14.30)
l=O
is not difficult to see that the term
t 'L. Tizt -J j=O
=
t 2 -1L 2kzk
(14.3 1)
k=O
belongs for each t to the set of dyadic rationals W1 = {k/21 , k = 0,1 ,2, ... ,2t+ 1 - 1 } Each element of W1 corresponds to one of the 21+ 1 possible drawings {Zo, ... ,Z1}, and has equal probability of T1- 1 . Iff Zo 0, .
Xt E Bt whereas iff Zo
=
=
{k/21, k
=
=
0,2,4, ... ,2(2 1 - 1) } ,
1, 1 xt E Wt - Br = {k/2 , k = 1,3,5, . . .,21+ 1 - 1 } . It follows that {Xo = 1 } n {X1 E B1} = 0 , for every finite t. But it i s clear that P(X1 E B1) = P(Xo = 0) = !· Hence for every finite m, CXm :2:
I P( { Xo
which contradicts
CXm
=
1 } n {Xm
-7 0.
E
Bm }) - P(Xo
=
l)P(Xm
E
Bm) I
(14.32)
o
Since the process starts at t = 0 in this case it is not stationary, but the ex ample is easily generalized to a wider class of processes, as follows. 14.7 Theorem (Andrews 1984) Let {Z1}':.'= be an independent sequence of Bern oulli r.v.s, taking values 1 and 0 with fixed probabilities p and 1 - p. If X1 = pX1- 1 + Z1 for p E (O,!], {X1}':'oo is not strong mixing. o
Note, the condition on p is purely to expedite the argument. The theorem surely holds for other values of p, although this cannot be proved by the present approach. Proof Write
Xt+s
=
psX1 + X1,s where
Mixing xt,s
=
217
s- 1 L pjZt+s-j · j=O
(14.33)
The support of X1,s is finite for finite p, having at most 25 distinct members. Call this set W5 , so that W1 = (0, 1) , W2 = (0, 1, p, 1 + p), and so on. In general, Ws+ l is obtained from Ws by adding p 5 to each of its elements and forming the union of these elements with those of W5 ; formally,
(14.34) For given s denote the distinct elements of Ws by Wj, ordered by magnitude with w1 5 < . . . < WJ, for J � 2 • Now suppose that X1 E (O,p), so that p5X1 E (O,p 5+1 ). This means that Xt+s assumes a value between Wj and Wj + ps+ 1 , for some }. Defining events A = { X1 E (O,p)} and Bs = { Xt+s E Uf= l (wj, w + r s+l ) } , we have P(Bs i A) = 1 for any s, however large. To see that P(A) > 0, consider the case Z1 = Zt-1 = Zt-2 = 0 and Z1-3 = 1 and note that 3 (14.35) L PjZt-j � L Pj = 1 � p < p j=3 j=3 for p E (0,1J. So, unless P(Bs) = 1, strong mixing is contradicted. The proof is completed by showing that the set D = {X1 E [p, 1 ] } has positive probability, and is disjoint with Bs . D occurs when Z1 = 0 and Zt- l = 1 , since then, for p E (0,1], =
P
=
� pj - _P_ pjzt-1 <- L 1 -p j= l j= l
< � - L
.
-
and hence P(D) > 0. Suppose that min { Wj+ 1 - Wj } Then, if D occurs,
j <': l
�
<
-
ps- I .
1
'
(14.36) (14.37)
W1· + P s+l -< w·1 + p sXt < w1· + p s- I < Wj+ I ' (14.38) hence, Xr+s = Wj + psX1 E Uf= l (wj, Wj + p s+ l ), or in other words, Bs n D = 0. The assertion in (14.37) is certainly true when s = 1, so consider the following inductive argument. Suppose the distance between two points in Ws is at least p s- I . Then by (14.34), the smallest distance between two points of Ws+ l cannot be less than the smaller of ps and p s- I - ps. But when p E (0,�), ps � 1Ps- l , which implies p s- I - p 5 � p s. It follows that (14.37) holds for every s. • These results may appear surprising when one thinks of the rate at which p s -
approaches 0 with s; but if so, this is because we are unconsciously thinking about the problem of predicting gross features of the distribution of Xt+s from time t, things like P(Xt+s � x i A) for fixed x, for example. The notable feature of the sets Bs is their irrelevance to such concerns, at least for large s. What we
Theory of Stochastic Processes
218
have shown is that from a practical viewpoint the mixing concept has some undesi able features. The requirement of a decline of dependence is imposed over c events, whereas in practice it might serve our purposes adequately to tolera certain uninteresting events, such as the Bs defined above, remaining dependent c the initial conditions even at long range. In the next section we will derive some sufficient conditions for strong mixint and it turns out that certain smoothness conditions on the marginal distribution of the increments will be enough to rule out this kind of counter-example. But no\ consider uniform mixing. 14.8 Example 13 Consider an AR(1 ) process with i.i.d. increments,
Xr = pXr- 1 + Zr. 0 < p < 1 ,
i n which the marginal distribution of Z1 has unbounded support. We show that { X1] is not uniform mixing. For 8 > 0 choose a positive constant M to satisfy
(14.39 ) Then consider the events
A = {Xo � p -m (L + M) } E � �oo B = {Xm :S: L} e ��00, where L is large enough that P(B) � 1 - 8. We show P(A) > 0 for every m. Let PK = P(Zo < K), for any constant K. Since Z0 has unbounded support, either PK < 1 for every K > 0 or, at worst, this holds after substituting { -Z1} for {Z1 } and hence { -X1 } for {X1 } . Pk < 1 for all K implies, by stationarity, P(X- 1 < 0) = P(Xo < 0) < 1 . Since { X0 < K} c {Zo < K} u ({Zo � K} n {X- 1 < 0 } ) , independence of the { Z1 } implies that
P(Xo < K) :S: PK+ (1 - PK)P(X- t < 0) < 1 . So P(A) > 0, since K is arbitrary. Since Xm = pmXo + L.}:A p1Z
that
P(B iA)
�
(
P pmXo +
�
m
l
)
p jZ, -j S L p mXo ;, L + M <
by ( 14.39). Hence m � I P(B I A) - P(B) I means m = 1 for every m. o
>
6
(14.40) -J •
it is clear
(14.4 1 )
1 - 28, and since 8 i s arbitrary, this
Processes with Gaussian increments fall into the category covered by this example, and if -mixing fails in the first-order AR case it is pretty clear that counter examples exist for more general MA(oo) cases too. The conditions for uniform mix ing in linear processes are evidently extremely tough, perhaps too tough for this mixing condition to be very useful. In the applications to be studied in later chapters, most of the results are found to hold i n some form for strong mixing processes, but the ability to assert uniform mixing usually allows a relaxation of
Mixing
219
conditions elsewhere i n the problem, s o i t i s still desirable to develop the parallel results for the uniform case. The strong restrictions needed to ensure processes are mixing, which these exam ples point to (to be explored further in the next section), threaten to limit the usefulness of the mixing concept. However, technical infringements like the ones demonstrated are often innocuous in practice. Only certain aspects of mixing, encapsulated in the concept of a mixingale, are required for many important limit results to hold. These are shared with so-called near-epoch dependentfunctions of mixing sequences, which include cases like 14.7. The theory of these dependence concepts is treated in Chapters 16 and 17. While Chapter 15 contains some neces sary background material for those chapters, the interested reader might choose to skip ahead at this point to find out how, in essence, the difficulty will be resolved. 14.4 Sufficient Conditions for Strong and Uniform Mixing
The problems in the counter-examples above are with the form ofthe marginal shock distributions - discrete or unbounded, as the case may be. For strong mixing, a degree of smoothness of the distributions appears necessary in addition to summa bility conditions on the coefficients of linear processes. Several sufficient conditions have been derived, both for general MA(oo) processes and for auto regressive and ARMA processes. The sufficiency result for strong mixing proved below is based on the theorems of Chanda (1974) and Gorodetskii (1977). These conditions are not the weakest possible in all circumstances, but they have the virtues of generality and comparative ease of verification. 14.9 Theorem Let X1 = LJ==OejZt-j define a random sequence {X1}:oo , where, for either 0 < r :::; 2 or r an even positive integer, (a) Z1 is unifor-mly Lr-bounded, independent, continuous with p.d.f. fz1, and s �p
J:] fdzt + a)
-
fziz) J dz
:::;
Mi a I , M <
oo,
(14.42)
whenever I a I :::; 3, for some () > 0; (b) "L7==0Gt(r) 11( l +r) < oo , where
( 00 )r/2 2r-l � j==t
j
e]
r :::; 2, (14.43) '
r � 2;
(c) 8(x) = "L}== 1 8_;.x -:;:. 0 for all complex numbers x with l xl Then {Xd is strong mixing with <Xm = O("L7==m+ ! GtCr) 1 10 +r)). o
:::;
1.
Before proceeding to the proof, we must discuss the implications of these three conditions in a bit more detail. Condition 14.9(a) may be relaxed somewhat, as we
Theory of Stochastic Processes
220
show below, but we begin with this case for simplicity. The following lemma extends the condition to the joint distributions under independence. 14.10 Lemma Inequality
(14.42) implies that for ! a, !
:::;;
�.
t
=
1 , ... ,k, ( 14.44)
Proof Using Fubini's theorem,
J I Ok fzlzr + ar) - ilk fzlzr) I dzt · · ·dZk IRk
t=l
t=l
:::;; M l a 1 l +
J I r-rlfzlzr + ar) - r-fi2tz,(zr) I dz2...dzk· 2 IRk-!
The lemma follows on applying the same inequality to the second term on the right, iteratively for t = 2, . . ,k. • .
Condition 14.9(b) is satisfied when I ej I « F11 for Jl > 1 + 2/r when r 5 2 and ll > 3/2 + l!r when r � 2. The double definition of Gr(r) is motivated by the fact that for cases with r s 2 we use the von Bahr-Esseen inequality (11.15) to bound a certain sequence in the proof, whereas with r > 2 we mly on Lemma 14. 1 1 below. Since the latter result requires r to be an even integer, the conditions in the theorem are to be applied in practice by taking r as the nearest even integer below the highest existing absolute moment. Gorodetskii (1977) achieves a further weakening of these summability conditions for r > 2 by the use of an inequality due to Nagaev and Fuk ( 1 97 1). We will forgo this extension, both because proof of
the Nagaev-Fuk inequalities represents a rather complicated detour, and because the present version of the theorem permits a generalization (Corollary 14.13) which would otherwise be awkward to implement. Define Wr = I}:6ejZt-j and Vr = I,j=tejZt-j• so that Xt = Wt + V,, and Wr and Vt are independent. Think of Vt as the � � 00-measurable 'tail ' of Xr, whose contribu tion to the sum should become negligible as t -7 oo .
14.11 Lemma If the sequence { Zs } is independent with zero mean, then
E(Vi"') S z2m-I (t, sf:�� E(Zl'") for each positive integer m such that sup
s �o
E(Z�m)
<
(14.45) oo
.
Mixing
221
Proof First consider the case where the r.v.s Zr-j are symmetrically distributed,
meaning that -Zr-j and Zr-j have the same distributions. In this case all existing odd-order integer moments about 0 are zero, and +k +k t+k L j t-j = . . .
t t 2 m E ( e Z ) � L eh .. . ehmE(Zr-jj . . Zt-hm) 11 =t 12m=t
1=t
t+k
t+k
h=t
jm=t
L · · · L 8]1 eJmE(Z7-h- . .z7 ) ,; (� aJr:�� E(z;m) ( 14 . 46 ) The second equality holds since E(Zr-h . . Zt-hm) vanishes unless the factors form matching pairs, and the inequality follows since, for any r.v. Y possessing the j j requisite moments, E(Y +k) � E(Y )E(Y k) (i.e., Cov(Y< Y k) � 0) for j,k > 0. The result for symmetrically distributed Zs follows on letting k oo For general Z5, let z; be distributed identically as, and independent of, Z5, for each s s 0. Then v� LJ te z� - is independent of Vr, and Vr-v� has symmet rically distributed independent increments Zr-j- Z�-j- Hence E(V?m) s E(Vr-V�)2m s (t eJ) msup E(Zr-j-Z�-j)2m =
-jm
•••
---7
=
= j
.
j
1=t
1
(14.47) where the first inequality is by 10.19, the second by (14.45), and the third is the C r inequality. •
Lastly, consider condition 14.9(c). This is designed to pin down the properties of the inverse transformation, taking us from the coordinates of to those of It ensures that the function of a complex variable S(x) possesses an ana lytic 1 4 inverse (x) = for l x l s 1 . The particular property needed and implied by the condition is that the coefficient sequence { 'tj is absolutely summable. If = under 14.9(c) the inverse representation is also defined, as = Note that 'to = 1 if = 1 . An effect of 14.9( c) is to rule out 'over-differenced' cases, as for example where S(x) = (x)( l - x) with (.) a summable polynomial. The differencing transformation does not yield a mixing process in general, the exception being where it reverses the previous integration of a mixing process. For a finite number of terms the transformation is conveniently expressed using matrix notation. Let
{Zr} .
81
't 2-}=o't� Zr Xr 2.}2-}= =O'tjXr0ej-jZ· t-j•
So
{Xr} }
81
222
Theory of Stochastic Processes 1 0
1
e1
e1
An =
1
(n x n),
(14.48)
Sn- 2 Sn- 1
Sn-2
so that the equations x1 = I.j:be1zt-J• t = 1, ... ,n can be written x = AnZ where x = (x1 , ,xn )' and z = (z1 , ,zn)'. A � 1 is also lower triangular, with elements 'tj replacing ej for j = O, . . ,n - 1 . If v = (Vt . · · ··Vn)' the vector v = A�1v has elements I.}:d'tJVt-J• for t = l , . . . ,n. These operations can in principle be taken to the limit as n --7 =, subject to 14.9(c). .••
• • . .
Proof of 14.9 Without loss of generality, the object is to show that the a-fields
��= = cr( . . . .X- J ,Xo) and �;;;+ ! = cr(Xm+I .Xm+2•· · · ·) are independent as m --7 = The result does not depend on the choice of origin for the indices. This is shown for a sequence { X1} 7��-p for finite p and k, and since k and p are arbitrary, it then follows by the consistency theorem (12.4) that there exists a sequence {X1 }'::' = whose finite-dimensional distributions possess the property for every k and p. This sequence is strong mixing on the definition. Define a p + m + k-vector X = (X0,Xi,X2)' where Xo = (XI -p·· · ·,X0)' (p x 1 ), X1 = (Xt , . . . ,Xm)' (m X 1 ), and X2 = (Xm+J , . . . ,Xm+d (k x 1), and also vectors W = (Wi , W2)' and V = (Vi, V2)' such that X1 = W1 + VI and X2 = W2 + Vz. (The elements of W and V are defined above 14.11.) The vectors Xo and V are independent of W. Now, use the notation �; = a(Xs, . . . ,X1) and define the following sets: G = { ro: Xo(ro) E C} E �?-p, for some C E <.BP, .
H = { ro: X2Cro) E D } E ��!1. for some D E <.Bk, E
�� oo, where B = { v2 : I v2 l :::; 1l } E <.Bk, I v2 l denotes the vector whose elements are the absolute values of v2, and 1l = ('Tlm+J, . . . ,TJm+ )' is a vector of positive constants. k Also define E = { ro: Vz(ffi) E B}
D - v2 = { w2: w2 + v2 E D } E <.Bk.
H may be thought of as the random event that has occurred when, first V2 = v 2 is realized, and then W2 E D - v2. By independence, the joint c.d.f. of the variables (Wz, Vz,Xo) factorizes as F = Fw2Fv2 ,x0 (say) and we can write
P(H)
= P(X2 E D) =
J J kJ IRP
1R
D-vz
dF(w 2, v2 , xo) (14.49)
Mixing
223
where
(14.50) These definitions set the scene for the main business of the proof, which is to show that events G and H are tending to independence as rn becomes large. Given m �/;Ep+ +k_measurability of X, this is sufficient for the result, since C and D are arbitrary. By the same reasoning that gave (14.49), we have
fcf8x(v2)dFv2,x0(v2,xo). sup 2 X( 2) and X* inf 2 E X( 2), and (14.51) implies P(G n H n E)
Define X*
=
v es
=
=
V
v
s
(14.51)
V
X * P(G n E) � P(G n H n E) � X* P(G n E).
(14.52)
Hence we have the bounds
P(G n H)
=
P(G n H n E) + P(G n H n Ee) � X* P(G) + P(Ee),
(14.53)
and similarly, since X * � 1,
P(G n H) � X* P(G n E) + P(G n Hn Ee) = X* P(G) - X * P(G n Ee) + P(G n H n Ee) � X* P(G) - P(Ee). Choosing G
=
(14.54)
.Q (i.e., C = [RP) in (14.53) and (14.54) gives in particular X* - P(Ee) � P(H) � X * + P(Ee), (14.55)
and combining all these inequalities yields I P(G n H) - P( G)P(H) I
� X* - X* + 2P(Ee).
(14.56) Write W Am+kz, where Z = (Z1 , ... ,Zm+k)' and Am+k is defined by (14.48). Since I A m+k l = 1 and the {Z1 , ... ,Zm+k } are independent, the change of variable formula =
from 8.18 yields the result that W is continuously distributed with m+k
(14.57) fw(w) = fz(z) fl !Zt(z,). m Define B' { v : v 1 0, v2 B } B +k. Then the following relations hold: X* - X* � 2 sup I X(v2)- X(O) I V2 EB � 2 sup J I fw2(w2 + v2)-fw2(w2) I dw2 V2 EB =
=
=
E
D
E
t=l
Theory of Stochastic Processes
224
= {f { 2 su�
I
n fz/Zt + Vr) - n fdzr)
IR m+k t:= !
VEB
m+k
m+k
m+k
}
t:=l
I
dz
}
� 2M sup L l vrl , v E B'
t:=m+l
( 14.58)
where it is understood in the final inequality (which is by 14.10) that I vr I � 8 where 8 is defined in condition 14.9(a). The third equality substitutes v = A;;;!kv and uses the fact that v1 = 0 if v1 = 0 by lower triangularity of A m+k· For v E B', note that m+k
L l vr l
t=m+l
= .L
m+k
t-m-1
t:=m+!
j=O
L
'tjVt-j
(14.59) assuming Tt has been chosen with elements small enough that the terms in parenthe ses in the penultimate member do not exceed 8. This is possible by condition 14.9(c). For the final step, choose r to be the largest order of absolute moment if this is does not exceed 2, and the largest even integer moment, otherwise. Then
m+k
m+k
t:=m+!
t:=m+!
� L P( l Vrl > 1l r) � L El Vrl r1l � r,
(14.60)
by the Markov inequality, and
(14.61) El Vrl r � sup E I Zs l rG,(r), s where Gr(r) is given by (14.43), applying 11.15 for r � 2 (see (1 1 .65) for the required extension) and Lemma 14.11 for r > 2. Substituting inequalities (14.58), ( 14.59), (14.60), and (14.61 ) into (14.56) yields j P(G n H) - P(G)P(H) I
m+k
«
L (1lr + Gr(r)1l�r).
t:=m+!
(14.62)
Mixing
225
Since Gr(r) � 0 by 14.9(b), it is possible to choose m large enough that (14.59) and hence (14.62) hold with Tl t = Gr(r)1 1(l +r) = Gr(r)Tl? for each t > m. We obtain I P(G n H) - P(G)P(H) I
«
::=:;
m+k
L
t=m+ l 00
L
t=m+ l
Gr(r)ll(l +r) Gr(r) ll(l +r) ,
(14.63)
where the right-hand sum is finite by 14.9(b), and goes to zero as m � oo . This completes the proof. • It is worth examining this argument with care to see how violation of the condi tions can lead to trouble. According to (14.56), mixing will follow from two conditions: the obvious one is that the tail component V2, the � �""-measurable part of X2 , becomes negligible, such that is, P(E) gets close to 1 when m is large, even when 1\ is allowed to approach 0. But in addition, to have X* - X * disappear, P( W2 E D - v 2) must approach a unique limit as v2 � 0, for any D, and whatever the path of convergence. When the distribution has atoms, it is easy to devise examples where this requirement fails. In 14.6, the set B1 becomes W1 - B1 on 1 being translated a distance of 2 - • For such a case these probabilities evidently do not converge, in the limiting case as t � oo. However, this is a sufficiency result, and it remains unclear just how much more than the absence of atoms is strictly necessary. Consider an example where the distribution is continuous, having differentiable p.d.f., but condition ( 14.42) none the less fails.
14.12 Example Let f(z) = C0 z-2sin\z\ z E IR . This is non-negative, continuous everywhere, and bounded by Co z -2 and hence integrable. By choice of Co we can have t:J(z)dz = 1 , so f is a p.d.f. By the mean value theorem,
I f(z + a) - f(z) I = I a I I f'(z + a(z)a) I , a(z) E [0, 1], ( 14.64) where f'(z) = 8Cosin(z4)cos(z4)z - 2Cosin2(i)z - 3 . But note that t: l f'(z) l dz = oo, and hence,
1
J
TQT
+oo _""
l f(z + a) - f(z) l dz �
oo
as l a l � 0,
( 14.65)
which contradicts (14.42). The problem is that the density is varying too rapidly in the tails of the distribution, and I f(z + a) - f(z) I does not diminish rapidly enough in these regions as a � 0. The rate of divergence in (14.65) can be estimatedY For fixed (small) a, l f(z + a) - f(z) l is at a local maximum at points at which sin (z + a)4 = 1 (or 0) and sin z4 = 0 (or 1), or in other words where (z + a)4 - i = 4az3 + O(a2) = ±rr/2 . The solutions to these approximate relations can be written as z 1 ±C1 1 a l - 13 for C1 > 0. At these points we can write, again approximately (orders of magnitude are all we need here), =
Theory of Stochastic Processes
226
l f(z + a) - f(z) l � 2f(z) � 2CoC 12 I a l 213.
The integral is bounded within the interval [-C1 I a l - 1 13, C1 l a l - 1 13] by 4 C0C 11 1 a 1 1 13, the area of the rectangle having height 2Co C 1 2 1 a 1 213 . Outside the interval, f is bounded by C0 z-2, and the integral over this region is bounded by
J+oo
2C0
z- 2dz Cl l a i - 113
=
/ J 2Co C 1 l a l 1 3 .
Adding up the approximations yields
for M <
oo. o
s:: l f(z + a) - f(z) l dz � M l a l 113
(14.66)
The rate of divergence is critical for relaxing the conditions. Suppose instead of (14.42) that
J:: l f(z + a) - f(z) l dz � Mh( l a i ), I a l � B, B > 0
( 14.67)
could be shown sufficient, where h(.) is an arbitrary increasing function with h( I a I ) t 0 as I a I t 0. Since
s:: l f(z + a) - f(z) l dz � 2s::f(z)dz
=
(14.68)
2
for any a, ( 14.67) effectively holds for any p.d.f., by the dominated convergence theorem. Simple continuity of the distributions would suffice. 1 6 This particular result does not seem to be available, but it is possible to relax 14.9(a) substantially, at the cost of an additional restriction on the moving average coefficients. 14.13 Corollary Modify the conditions of 14.9 as follows: for 0 < � � 1 , assume that (a') Zr is uniformly Lr-bounded, independent, and continuously distributed with p.d.f. fz1, and
(14.69) whenever l a I � B, for some 8 > 0;
(b') I.7=0Gr(r) 13!(13+r) < oo, where Gr(r) is defined in (14.43); (c') 1/S (x) = 't(x) = LJ=I'tjXj for l xl � 1 , and I.'J=t l -rj l l3 < oo
Then Xr is strong mixing with
CXm
=
0(2:7=m+ 1 Gr(r) j3/(j3+r)).
Proof This follows the proof of 14.9 until
.
(14.58), which becomes
Mixing
227
{ m+k lvr l �}'
x * - x * :::; 2M sup L vE B'
(14.70)
t=m+i
applying the obvious extension of Lemma 14.10. Note that
t-Lm- I -j � oo l 1 l 13 m+k 11� (14.71 ) t=m+ I t=m+I j=O 'tjVt :::; .L}=0 t t=.Lm+I using (9.63), since 0 < � :::; 1 . Applying assumption 14.13(c'), m+k (14.72) I P(G n H) - P(G)P(H) I L (TJ � + Gr(r)TJ?), t=m+I and the result is obtained as before, but in this case setting TJ r G�1(�+r). Condition 14.13(b') is satisfied when I ej l for 1-l > l!J3 2/r when r :::; 2, and 1-1 > 1/2 + 1/r 11� when r � 2, which shows how the summability restrictions have «
•
=
+
« F11
+
to be strengthened when J3 is close to 0. This is none the less a useful extension because there are important cases where 14.13(b') and 14.13(c') are easily satisfied. In particular, if the process is finite-order ARMA, both I I and I I either decline geometrically or vanish beyond some finite j, and (b') and (c') both hold. Condition 14.13(a') is a strengthening of continuity since there exist functions h(.) which are slowly varying at 0, that is, which approach 0 more slowly than any positive power of the argument. Look again at 14.12, and note that setting � = ! will satisfy condition 14.13(a') according to (14.65). It is easy to generalize the example. Putting j(z) = Csin2(l)z - 2 for k � 4, the earlier argument is easily modified to show that the integral converges at the rate ! a ! and this 2 2 choice of J3 is appropriate. But for f(z) = Csin (ez)z the integral converges more slowly than l a l � for all � > 0, and condition 14.13(a') fails. To conclude this chapter, we look at the case of uniform mixing. Manipulating inequalities (14.52)--{14.55) yields
ej
'tj
1/(k-I \
( (�J
I P(H I G) - P(H) I � x* - x * + P(Ec) l + P (14.73) . which shows that uniform mixing can fail unless P(E) = 1 for all m exceeding a finite value. Otherwise, we can always construct a sequence of events G whose probability is positive but approaching 0 no slower than P(Ec). When the support of (X-p, . . . ,X0) is unbounded this kind of thing can occur, as illustrated by 14.8. The essence of this example does not depend on the AR(l ) model, and similar cases could be constructed in the general MA(=) framework. Sufficient conditions must include a.s. boundedness of the distributions, and the summability conditions are also modified. We will adapt the extended version of the strong mixing condition in 14.13, although it is easy to deduce the relationship between these conditions and 14.9 by setting J3 = 1 below.
Theory of Stochastic Processes
228
14.14 Theorem Modify the conditions of 14.13 as follows. Let (a') and (c') hold as before, but replace (b') by (b") )13 < oo, and add (d) is uniform!y bounded a. s. = Then is uniform mixing with )13). Proof Follow the proof of 14.9 up to (14.55), but replace (14.56) by (14.73). By condition 14.14(d), there exists < oo such that sup1 l < a.s., and hence a.s. It further follows, recalling the definition of V2, that P(E) = 1 < when m + 1 , ... ,m + k. Substituting directly into (14.73) < from (14.70) and (14.71), and making this choice of 11, gives (for any G with P(G)
L,";'=O(Lj=r l Sjl {Xr}{ Z1}
KL,j=0l Sjl Tl r KLj=rlSjl for t =
IXrl
> 0)
(14.74) The result now follows by the same considerations as before.
•
These summability conditions are tougher than in 14.13. Letting r � oo in the latter case for comparability, 14.13(b') is satisfied when I = OU 11) for 11 > 1/2 + 1/�, while the corresponding implication of 14.14(b") is 11 > 1 + 1/�.
Sjl
15 Martingales
1 5 . 1 Sequential Conditioning
It is trivial to observe that the arrow of time is unidirectional. Even though we can study a sample realization ex post, we know that, when a random sequence is generated, the 'current' member X1 is determined in an environment in which the previous members, Xt- k for k > 0, are given and conditionally fixed, whereas the members following remain contingent. The past is known, butthe future is unknown. The operation of conditioning sequentially on past events is therefore of central importance in time-series modelling. We characterize partial know ledge by specify ing a <J-subfield of events from c:J, for which it is known whether each of the events belonging to it has occurred or not. The accumulation of information by an observer as time passes is represented by an increasing sequence of a-fields, { c:J1}':'oo , such that . .. � cg;_ 1 c c:Jo � c:J 1 c c:J2 � ... � cg;_ I ? If X1 is a random variable that is c:J1-measurable for each t, { c:J1} ':'00 is said to be adapted to the sequence {X1}':'oo . The pairs {X1,c:Jr}':'oo are called an adapted sequence. Setting c:J1 = a(Xs, oo < s � t) defines the minimal adapted sequence, but c:Jt typically has the interpretation of an observer' s information set, and can contain more information than the history of a single variable. When X1 is inte grable, the conditional expectations E(X1 1 c:J1_ 1 ) are defined, and can be thought of as the optimal predictors of X1 from the point of view of observers looking one period ahead (compare 10.12). Consider an adapted sequence {Sn.c:Jn} ':'oo on a probability space (Q,c:J,P), where { c:Jn } is an increasing sequence. If the properties -
E I Sn l < E(Sn I cg;n - 1 )
=
00,
Sn - 1 , a.s.,
(15.1) (15.2)
hold for every n, the sequence is called a martingale. In old-fashioned gambling parlance, a martingale was a policy of attempting to recoup a loss by doubling one's stake on the next bet, but the modern usage of the term in probability theory is closer to describing a gambler's worth in the course of a sequence of fair bets. In view of (10. 1 8), an alternative version of condition (15.2) is
JAsndP JAsn-ldP, each A =
E cg;n_ 1 .
(15 .3)
Sometimes the sequence has a finite initial index, and may be written {Sn.c:Jn}1 where is an arbitrary integrable r.v.
s1
Theory of Stochastic Processes
230
15.1 Example Let {Xr}1 be an i.i.d. integrable sequence with zero mean. If Sn
'L7= 1Xr and r:Ji
=
=
o"(Xn.Xn - J , ... ,Xt), {Sn,r:Jin }1 is a martingale, also known as a random walk sequence. Note that E I Sn l :::; 'L7= tEI Xr l < oo. o n
15.2 Example Let
2 be
an integrable, r:Ji/:8-measurable, zero-mean r.v., { r:Jin}':oo an increasing sequence of a-fields with limn�oorg;n = r:Ji, and Sn = E(ZI r:Ji ) Then n .
(15.4) where the second equality is by 10.26(i). El Sn l :::; E I ZI < oo by 10.27, so Sn is a martingale. o Following on the last definition, a martingale difference (m.d.) sequence { Xr,r:Jir}':'oo is an adapted sequence on (Q,r:Ji,P) satisfying the properties E l Xr l < oo ,
E(Xr l r:Jit- I )
=
(15.5) (15.6)
0, a.s.,
for every t. Evidently, if { Sn } is a martingale and Xr = Sr - Sr- I , then { Xr} is a m.d. Conversely, we may define a martingale as the partial sum of a sequence of m.d.s, as in 15.1 (an independent integrable sequence is clearly a m.d.). However, if Xr has positive variance uniformly in t, condition (15. 1) holds for all finite n but not uniformly in n. To define a martingale by Sn = L�=- = Xt can therefore lead to difficulties. Example 15.2 shows how a martingale can arise without reference to summation of a difference sequence. It is important not to misunderstand the force of the integrability requirement in (15. 1). After all, if we observe Sn - 1 , predicting Sn might seem to be just a matter of knowing something about the distribution of the increment. The problem is that we cannot treat E(Sn I Sn - 1 , ... ) as a random variable without integrability of Sn . Conditioning on Sn- l is not the same proposition as treating it as a constant, which entails restricting the probability space entirely to the set of repeated random drawings of Xn. The latter problem has no connection with the theory of random sequences. A fundamental result is that a m.d. is uncorrelated with any measurable function of its lagged values. 15.3 Theorem If {Xr,r:Jir} is a m.d., then
Cov(Xr,cJ>(Xr- t ,Xt- 2•· ··))
=
0,
where cp is any Borel-measurable, integrable function of the arguments. Proof By 10.11 (see also the remarks following) noting that cj>(Xr- t ,Xr-2 , . . . ), is r:Jir- 1 -measurable. • 15.4 Corollary If {Xt,r:Jir} is a m.d., then E(XtXr-k) = 0, for all t and all k '# 0.
Proof Put cp
and t' - I k l
=
=
Xr-k in 15.3. For k < 0, redefine the subscripts, putting t' t, so as to make the two cases equivalent. •
=
t-k
Martingales
23 1
One might think of the m.d. property as intermediate between uncorrelatedness and independence in the hierarchy of constraints on dependence. However, note the asymmetry with respect to time. Reversing the time ordering of an independent sequence yields another independent sequence, and likewise a reversed uncorrelated sequence is uncorrelated; but a reversed m.d. is not a m.d. in general. The Doob decomposition of an integrable sequence {Sn,:!Fn }o is
(15.7) 0, Mo = So, and (15.8) Mn = Mn - 1 + Sn - E(Sn i :!Fn - 1 ), (15.9) An = An - I + E(Sn i :!Fn - J ) - Sn - 1 · An is an :!Fn- 1 -measurable sequence called the predictable component of Sn. Writing !:l.Sn = Yn and !:l.Mn = Xn, we find Mn = E(Yn l :!Ft- 1 ), and (15. 10) Xn is known as a centred sequence, and also as the innovation sequence of Sn. It is adapted if { Yn,:!Fn }o is, and since £1 Yn I < oo by assumption, where Ao
=
E I Xn l � E I Yn l + E I E(Yn i :!Fn - 1 ) 1 � E I Yn l + E(E( I Yn l l :!fn - 1 )) = 2£ 1 Yn l < 00 ,
(15. 1 1)
by (respectively) Minkowski' s inequality, the conditional modulus inequality, and the LIE. Since it is evident that E(Xt l :!Ft- 1 ) = 0, {Xn,:!Fn}o is a m.d. and so { Mn,:!Fn } o is a martingale. Martingales play an indispensable role in modern probability theory, because m.d.s behave in many important respects like independent sequences. Independence is the simplifying property which permitted the 'classical ' limit results, laws of large numbers and central limit theorems, to be proved. But independence is a constraint on the entire joint distribution of the sequence. The m.d. property is a much milder restriction on the memory and yet, as we shall see in later chapters, most limit theorems which hold for independent sequences can also be proved for m.d.s, with few if any additional restrictions on the marginal distributions. For time series applications, it makes sense to go directly to the martingale version of any result of interest, unless of course a still weaker assumption will suffice. We will rarely need a stronger one. Should we prefer to avoid the use of a definition involving a-fields on an abstract probability space, it is possible to represent a martingale difference as, for example, a sequence with the property
(15. 12) When a random variable appears in a conditioning set it is to be understood as representing the corresponding minimal cr-subfield, in this case cr(Xt- I .Xt-2 , ... ).
Theory of Stochastic Processes
232
This is appealing at an elementary level since it captures the notion of informa tion available to an observer, in this case the sequence realization to date. But since, as we have seen, the conditioning information can extend more widely than the history of the sequence itself, this type of notation is relatively clumsy. Suppose we have a vector sequence and though not necessarily is a m.d. with respect to = ... ) in the sense of ( 1 5.6). This case is distinct from (15. 12), and shows that that definition is inadequate, although ( 1 5. 16) implies (15. 12). More important, the representation of conditioning information is not unique, and we have seen (10.3(ii)) that any measurably isomorphic transformation of the conditioning variables contains the same information as the original variables. Indeed, the information need not even be represented by a variable, but is merely knowledge of the occurrence/non occurrence of certain abstract events.
{ (Xr,Zr)}, Xrrg;r cr(Xr,Zr,Xt- 1 ,Zr- 1 ,
Zr -
1 5. 2 Extensions of the Martingale Concept
{ {Xnr, rg;nr}��J};=I• where {kn};=l is some increasing ( 1 5 . 13) E i Xm i < (15. 14) E(Xnrl rg;n,t-1 ) 0 a.s. for each t l , ... , kn and n � 1 , is called a martingale difference array. In many applications we would have just kn n. The double subscripting of the subfield rg;nt may be superfluous if the information content of the array does not depend on n, with rg;nt rg;t for each n, but the additional generality given by the definition is harmless and could be useful. The sequence { Sn, rg;n} i where Sn L�� 1 Xm and rg;n rg;n,kn is not a martingale, but the properties of martingales can be profitably used to analyse its behaviour. Consider the case Sn n - 1 12.L7=1 Xr where { Xr, rg; t} is a m.d. Such scaling by sample size may ensure that the distribution of Sn has a non-degenerate limit. Sn is not a martingale since (15. 15) E(Sn l rg;n - 1 ) [(n 1)/n] 1 12Sn 1 , An adapted triangular array sequence of integers, for which
=,
=
=
=
=
=
=
=
=
-
-
but each column of the m.d. array
x1 T 112X1 3 -1 12Xt 4- 112X1 T 1 '2x2 3 - lt2x2 4-lt2x2 3 -1'2x3 4-112X3 4-112x4
(15. 1 6)
is a m.d. sequence, and Sn is the sum of column n. It is a term in a martingale sequence even though this is not the sequence A n adapted sequence of L 1 -bounded variables satisfying
{Smrg;nf:'oo
{Sn}.
Martingales
233 (15. 17)
is called a submartingale, in which case Xn = Sn - Sn - ! is a submartingale differ ence, having the property E(Xn+! l <Jn) ;::: 0 a.s. In the Doob decomposition of a submartingale, the predictable sequence An is non-decreasing. Reversing the inequality defines a supermartingale, although, since -Sn is a supermartingale whenever Sn is a submartingale, this is a minor extension. A supermartingale might represent a gambler' s worth when a sequence of bets is unfair because of a house percentage. The generic term semimartingale covers all the possibilities. 15.5 Theorem Let <)>(.): IR f-) IR be continuous and convex. If {Sm<Jn} is a martin gale and E I <J>(Sn) l < =, then {<J>(Sn),<Jn} is a submartingale. If
(15. 1 8) by the conditional Jensen inequality (10.18). For the submartingale case, '= ' becomes •;::: • in (15. 1 8) when X! � x2 ==> <j>(x1 ) � <J> (x2). • If {X1,'J1}i is a (sub)martingale difference, {Z1,'J1}o any adapted sequence, and n Sn = _LXrZr-J , (15. 19) t:=l
then {Sn,<Jn }i is a (sub)martingale since
n E(Sn+ d <Jn) = _LXrZt- 1 + ZnE(Xn+d <Jn) = Sn (;::: Sn). t:=l
(15.20)
We might think of Xt as the random return on a stake of 1 unit in a sequence of bets, and the sequence { Z1} as representing a betting system, a rule based on information available at time t - 1 for deciding how many units to bet in the next game. The implication of (15.20) is that, if the basic game (in which the same stake is bet every time) is fair, there is no betting system (based on no more than information about past play) that can turn it into a game favouring the player - or for that matter, a game favouring the house into a fair game. For an increasing sequence { <J t } of cr-subfields of (Q,<J ,P), a stopping time 't( co) is a random integer having the property { co: t = 't(co) } e 'J1• The classic example is a gambling policy which entails withdrawing from the game whenever a certain condition depending only on the outcomes to date (such as one' s losses exceeding some limit, or a certain number of successive wins) is realized. If 't is the random variable defined as the first time the said condition is met in a sequence of bets, it is a stopping time. Let 't be a stopping time of { <Jn }, and consider
SnA't =
{
Sn, n � 't Sr;, n > 't
(15.21)
Theory of Stochastic Processes
234
{
where n A 't stands for min n , 't } . {Snl\'t,�n } ';;'= 1 is called a stopped process.
{ Sn , �n }t is a martingale (submartingale), then { Sn 't �n lt is a martingale (submartingale). Proof Since { �n} 1 is increasing, { ro: k ro)} � n for k < n, and hence also n 't(W)} �n. by complementation. Write SnNr: IZ: l sk l {lr-'t) +Snl {n:O:'t )• {ro: where the indicator functions are all �n-measurable. It follows by 3.25 and 3.33 that sn/\'t is �n-measurable, and n -1 E I SM't l 2: E I Sk l ( k='tJ I +EI Sn l { n�'t J I k=1 15.6 Theorem If
:S
l\ •
=
E
't(
E
=
:S
n- 1 :S L E I Sk l + E I Sn l < oo, n ;?: 1.
k=1
(15.22)
{Sn,�n}i is a martingale then for A �n. applying (15.3), fAS
E
=
�(D't)
=
•
The general conclusion is that a gambler cannot alter the basic fairness charact eristics of a game, whatever gambling policy (betting system plus stopping rule) he or she selects. All these concepts have a natural extension to random vectors. An adapted sequence is defined to be a vector martingale diffe rence if and only if �1} :'"" is a scalar m .d. sequence for all conformable fixed vectors ::1= 0. It has the property
{A'Xr. {X1, �r}:'oo
A
(15.24) The one thing to remember is that a vector martingale difference is not the same thing as a vector of martingale differences. A simple counter-example is the two element vector = where is a m.d.; is an adapted sequence, but
Xr (X1,X1- d, X1 {f-.1Xr + �Xr- 1 •�t} E(AIXr + A2Xt- d �r- I) A2Xr- 1 0, so it is not a m.d.. On the other hand, E(A.IXr+l + X l � ) 0, but {AIXr+I +�X1, �r} is not adapted, since Xr+ l is not �1-measurable. :f.
=
A2
t
t-
l
=
Martingales
235
1 5 . 3 Martingale Convergence
Applyi ng 15.5 to the case <j>(.) = I . IP and taking unconditional expectations shows that every martingale or submartingale has the property
(15.25) By 2.1 1 the sequence of pth absolute moments converges as n --7 oo, either to a finite limit or to +oo. In the case where the L 1 -norms are uniformly bounded, (sub)martingales also exhibit a substantially stronger property; they converge, almost surely, to some point which is random in the sense of having a distribution over realizations, but does not change from one time period t to the next. The intuition is reasonably transparent. { r:Jn } is an increasing sequence of a-fields which converges to a limit r:J"" � r:J, the a-field that contains r:Jn for every n. Since E(Sn l r:Jn) = Sm the convergence of the sequence { r:J'n } implies that of a uniformly bounded sequence with the property E(Sn+ l l r:Jn) � Sn, so long as these expectations remain well-defined in the limit. Thus, we have the following.
M < oo, then Sn --7 S a.s. where S is a r:J'-measurable random variable with El S l -5, M. o The proof of 15.7, due to Doob, makes use of a result called the upcrossing inequality, which is proved as a preliminary lemma. Considering the path of a submartingale through time, an upcrossing of an interval [a,�] is a succession of steps starting at or below a and terminating at or above �· To complete more than 15.7 Theorem If {Smr:J'n}i is a submartingale sequence and supnE I Sn l
-5,
one upcrossing, there . must be one and only one intervening downcrossing, so downcrossings do not require separate consideration. Fig. 15.1 shows two upcross ings of [a,�], spanning the periods marked by dots on the abscissa.
�
0
0
0 0
a
0 °o 0 0
0 00
0
0
0
0
0
�o �
· ············································ ········ ··························-
0
0
·
······
o
····································································
0
00 0 0
0
············
0
0
0
0 0
0 0 0 0
0
0
0
················ ················· ····
o
·············
0
0
0
c;-··························
· ·_·_·_·_ --�---·_ ·_ · _ · ·_·_·_ ·_ ·_ · _______·_·_ ·_ ·_ · _ · ·_·_·_ ·_ · __________� k ··_ ·_
yk
000001 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1000000001 1 1 1 1 1 1 1 1 1 1000000000000 Fig. 15. 1
Let the r.v. Yk be the indicator of an upcrossing. To be precise, set Y1 and then, for k = 2,3, ... ,n,
=
0,
236
{
Theory of Stochastic Processes
0 if either Yk-1 yk = 1 if either Yk - 1
= =
-
0, Sk 1 > a, or Yk- 1 0, Sk-1 :5 a, or Yk- 1
=
=
1, Sk-1 � �. 1 , Sk-1 < �-
(15.26)
The values of Yk appear at the bottom of Fig. 15. 1 . Observe that an upcrossing begins the period after Sk falls to or below a, and ends at the first step there after where � is reached or exceeded. Yk is a function of Sk- 1 and an �k - 1 -measur able random variable. The number of upcrossings of [a, � ) up to time n of the sequence {Sn(ro) } i, to be denoted Un(ro), is an �n-measurable random variable. The sequence { Un(ro)} i is monotone, but it satisfies the following condition. 15.8 Upcrossing inequality The number of upcrossings of [a,�) by a submartin gale {Sm�n }i satisfies
(15.27) Proof Define S� = max{Sn,a} , a continuous, convex, non-decreasing function of
Sm such that { S� .� n } is an adapted sequence and also a submartingale. Un is the set of upcrossings up to n for {S� } as well as for {Sn } . Write n n n (1 5.28) S� - S! = ,L xk ,L YkXk + ,L ( l - Yk)Xk, k=2 k=2 k=2 where Yk is from (15.26), and Xk is a submartingale difference. Then
=
=if
E(Xk l �k- l )dP � 0,
(15.29)
k=2 { Y,rO} using the definition of a conditional expectation in the second equality (recall ing that Yk is �k -1-measurable), and the submartingale property, to give the in equality. We have therefore shown that
(i
)
(15.30) Ykxk . k=2 I,k=2 YkXk is the sum of the steps made during upcrossings, by definition of Yk · Since the sum of the Xk over an upcrossing equals at least � - a by definition, we must have n (15.3 1) ,L YkXk � (� - a)Un, k=2 where Un is the number of upcrossings completed by time n. Taking the expectation of ( 15.3 1 ) and substituting (15.30), we obtain, as required, E(S� - Sl) � E
Martingales
237
(� - a)E( Vn) � E(S� - S j ) � E(S� - a) =
f
I Sn >a)
(15.32)
(Sn - a)dP
� E I Sn - a l � EI Sn l + l a l .
•
The upcrossing inequality contains the implication that, if the sequence is uniformly bounded in L1 , the expected number of upcrossings is finite, even as n ---7 oo. This is the heart of the convergence proof, for it means that the sequence has to be settling down somewhere beyond a certain point. Proof of 15.7 Fix a and �
E(Un) �
> a. By 15.8,
E I Sn l + l a l M + l a l < � A p-a �-a
oo.
(15.33)
For CD E Q, { Un(CD) } i is a positive, non-decreasing sequence and either diverges to +oo or converges to a finite limit U(CD) as n ---7 oo. Divergence for CD E C with P(C) > 0 would imply E(Un) --7 oo, which contradicts (15.33), so Un ---7 U a.s., where E(U) < oo. Define S(CD) = limsupn�ooSn and �(CD) = liminfn�ooSn. If �(co) < a < � < S(CD), the interval [a,�] is crossed an infinite number of times as n --7 oo, so it must be the case that P(S. < a < � < S) = 0. This is true for any pair a,�. Hence consider { co: �(CD) < S(CD) } = U {S. � a < � � S } , a,J3
(15.34)
where the union on the right is taken over rational values of a and �- Evidently, P(S. < S) = 0 by 3.6(ii), which is the same as S. = S = S a.s., where S is the limit of {Sn } . Finally, note that
(1 5.35) E I S I � liminf E I Sn l � sup E I Sn l � M, n n�oo where the first inequality is from Fatou ' s lemma and the last is by assumption. This completes the proof. • Of the examples quoted earlier, 15.1 does not satisfy the conditions of 15.7. A random walk does not converge, but wanders forever with a variance that is an increasing function of time. But in 15.2, X1 is of course converging to Z. 15.9 Corollary Let {Su,?:Fn }:'"" be a doubly infinite martingale. Then Sn --7 S_"" a.s.
as n
---7 - oo ,
where s_oo is an £!-bounded r.v.
Proof Let U- n denote the number of upcrossings of [a,�] performed by the sequence { Sj, -1 � j � n } The argument of 15.8 shows that -
E( U- n) �
.
E I S1 I + l a l � , all n � 1 . a _
Arguments precisely analogous to those of 15.7 show that
(15.36)
Theory of Stochastic Processes
238
)
(
P 1!:��f Sn < 1���� sn = 0,
( 1 5.37)
so that the limit s_"" exists a.s. The sequence { E I Sn l } ::� is non-negative, non increasing as n decreases by ( 1 5 .25), and E I S-d < oo by definition of a martin gale. Hence E I S -.,.,1 < oo. • If a martingale does not converge, it must not be thought of as converging in R, of heading off to +oo or -oo, never to return. This is an event that occurs only with probability 0. Subject to the increments having a suitably bounded distribu tion, a nonconvergent martingale eventually visits all regions of the real line, almost surely. 15.10 Theorem Let {X1,?1'1} be a m.d. sequence with E(sup1 I X1 1 ) < oo, and let Sn = I7=tX1• If C = { ffi: Sn(ffi) converges } , and
E = { co: either infnSn(CO) > -oo or supnSn ( (I)) < oo } , then P(E - C) = 0. Proof For a constant M > 0, define the stopping time 'tM(ro) as the smallest integer n such that Sn( ro) > M, if one exists, and 'tM(ro) = oo otherwise. The stopped process { SnAtu • ?/'nAtu l';;'= l is a martingale (15.6), and Scn -l )Nr:u � M for all n. Letting S�"tM = max { SnA M'O } , t
s �A'tM � s (n - l )A'tM + x�A'tM � M + sup I Xn I .
n
( 1 5.38)
Since E(SnA'tM) = 0, El snA u l = 2E(�A'tM), and hence supnEI SnNrul < 00 , and snA M converges a.s., by 15.7. And since snATu(ro) = Sn(CO) on the set { (1): supnSn(ffi) � M}, Sn(co) converges a.s. on the same set. Letting M � oo, and then applying the same argument to -Sn, we obtain the conclusion that Sn(ro) converges a.s. on the set E; that is, P(C n E) = P(E), from which the theorem follows. • 't
t
Note that E c = { co: supnSn( co) = +oo, infnSn(ro) = -oo } . Since P(Ec) = P((C n E)c) = P( cc u E c) , a direct consequence of the theorem is that cc c Ec u N where P(N) = 0, which is the claim made above. 1 5 .4 Convergence and the Conditional Variances
If { Sn } is a square-integrable martingale with differences {Xn } ,
E(S� I ?Fn- t) = E(S�- 1 + X; + 2XnSn -d ?l'n-t ) ;:::: S�- 1 .
and s; i s a submartingale. The Doob decomposition of the sequence of squares has the form s; = Mn + An where �n = X� - E(X; I ?l'n- J ), and Mn = E(X� I ?l'n- t ). The sequence {An } is called the quadratic variation of { Sn } . The following theorem reveals an intimate link between martingale convergence and the summability of the conditional variances; the latter property implies the former almost surely, and in particular, if I'7= 1 ECX7 1 ?l't- t ) < oo a.s. then Sn � S a.s.
Martingales
239
15. 1 1 Theorem Let {X1,�r}j be a m.d. sequence, and Sn = I.7= I X1. If
D
=
{ w: I7=IE(X7 1 �t-J)(w) < oo } E � .
C = { w: Sn(W) converges } E �.
=
0. Proof Fix M > 0, and define the stopping time 'tM(W) as the smallest value of n then P(D - C)
having the property
n
_L ecx7 1 �t-J)(w) � M. t=i
(15.39)
If there is no finite integer with this property then 'tM(W) = oo. If DM = { w: 'tM(w) = oo } , D = limM�J)M· The r.v. 1 (<M �nJ(w) is �n- 1 -measurable, since it is known at time n - 1 whether the inequality in (15.39) is true. Define the stopped process n (15.40) _L xt 1 !<M� t ) = SnA<M· t=i
SnA,M is a martingale by 15.6. The increments are orthogonal, and sup E(S�I\'M) n
=
sup e n
(f x; t l,M� nl) t=i
(15.41) where the final inequality holds for the expectation since it holds for each w E Q by definition of 'tM(ro). By Liapunov ' s inequality, sup E I SnA,M I � sup I! SnA<Mih < M112, n n
and hence SnA,M converges a.s., by 15.7. If w E DM, Sn1\,�ro) = Sn(w) for every E IN , and hence Sn(w) converges, except for w in a set of zero measure. That is, P(DM n C) = P(DM). The theorem follows on taking complements, and then letting n
M � oo.
•
15.12 Example To get an idea of what convergence entails, consider the case of {Xr } an i.i.d. sequence (compare 15.1). Then {X/a1} is a m.d. sequence for any
sequence {ar} of positive constants. Since E(X7 1 �t-I) = E(X7) = c:Jl, a constant which we assume finite, Sn = 'L7=IX/a1 is an a.s. convergent martingale whenever I7= dla7 < oo. For example, a1 = t would satisfy the requirement. o
In the almost sure case of Theorem 15.11 (when P(C) = P(D) = 1), the summa bility of the conditional variances transfers to that of the ordinary variances, a7 = E(X7). Also when E(sup1X7) < oo, the summability of the conditional variances is almost equivalent to the summability of the x7 themselves. These are conse quences of the following pair of useful results.
Theory of Stochastic Processes
240
{Z1} ifbeanda anyonlynon-negative stochastic sequence. �r- 1) < oo a.s. if 'L7:: : J E(Z 'L7=IE(Z1 ) < rl (i) (ii) If E(sup1Z1) < then P(D E) 0, where D {co:'L7=JE(Zrl �r- l )(co) < oo } , E = {co: L7=1Zr(CO) < oo } Proof (i) The first of the sums is the expected value of the second, so the 'only if' part is immediate. Since E(Z11 �t- d is undefined unless E(Z1) < we may assume 'L7=1E(Z1) < oo for each finite n. These partial sums form a monotone series which either converges to a finite limit or diverges to +oo. Suppose 'L7= I E(Zrl �t- d converges a.s, implying (by the Cauchy criterion) that 'L7�'::+ JE(Z11 �t- d -7 0 a.s. as m 1\ n -7 oo. Then 7�':: E(Z1) -7 0 by the monotone con vergence theorem, so that by the same criterion 'L7::: 1E(Z1) -7 'L7::: 1 E(Z1) < oo , as required. (ii) Define the m.d. sequence X1 = Z1-E(Z11 �1- J), and let Sn 'L7::: I X1. Clearly supnSn(CO) s 'L7= 1 Zr(co), and if the majorant side of this inequality is finite, Sn(CO) converges in almost every case, by 15.10. Given the definition of X1, this implies in turn that 'L7::: 1 E(Z1I�t- J)(co) < oo. In other words, P(E - D) = 0. Now apply the same argument to -X1 = E(Z11 �1- 1 ) -Z1 to show that the reverse implica tion holds almost surely, and P(D - E) = 0 also. 15.13 Theorem Let
oo
Ll
oo
=
=
.
oo,
1:.
+1
=
•
1 5 . 5 Martingale Inequalities
Of the many interesting results that can be proved for martingales, certain inequalities are essential tools of limit theory. Of particular importance are maximal inequalities, which place bounds on the extreme behaviour a sequence is capable of over a succession of steps. We prove two related results of this type. The first, a sophisticated cousin of the Markov inequality, was originally proved by Kolmogorov for the case where { X is an independent sequence rather than a m.d., and in this form is known as Kolmogorov' s inequality. 15.14 Theorem Let be a martingale. For any p ;?: 1 ,
r}
{Sn,�n}i ) EISn i P P { max ISkl > £ s E l� k�n Proof Define the events A1 {co: IS1(co)l > £}, and for k = 2, Ak = {co: 1max j5 < k ISj(co) l s £, I Sk(co)l > £} �k· The collection A 1 , .An is disjoint, and tJAk = { max !�k sn ISkl > £}. P
.
=
( 15.42) . . . ,n,
E
•••
k=l
( 15.43)
Martingales
241
{ I Sk i > £}, the Markov inequality (9.10) gives (15.44) P(A k) :::;; £-pE( I Sk i P 1 Ak). By 15.5, I Sn I p for p ;?: 1 is a submartingale, so I sk I p :::;; E( I Sn I p I �k) a.s., for 1 :::;; k :::;; n. Since A k E �k' it follows that (15.45) where the equality applies- (10. 18). Noting .LZ=t l Ak = 1 u�=tAk' we obtain from (15.43)-{15.45), as required, Since A k
P
c
t�::
n
i Sk l > £
) � =
� ( i )
P(A k) :::;; £-p E( I Sn i P 1 Ak)
£-pE I Sn i P
1 Ak :::;; £-PE( I Sni P).
• (15.46) = � k1 The second result converts the probability bound of 15.14 into a moment inequal ity. 15.15 Doob's inequality Let {Sn,�n }l be a martingale. For p > 1 , =
(15.47) Proof
Consider the penultimate member of (15.46) for the case p
)
(
=
1, that is,
(15.48) P max I Sk l > £ :::;; £- 1 E( I Sn l 1 tmaxts k s n i Skl> e } ), 1 :'> k :<> n and apply the following ingenious lemma involving integration by parts. 15.16 Lemma Let X and Y be non-negative r.v.s. If P(X > £) :::;; £- 1 E(Y1 1x > e J ) for all £ > 0, then E(XP) :::;; [p/(p - 1)]PE(YP), for p > 1 . Proof Letting Fx denote the marginal c.d.f. of X, and integrating by parts, using d(l - Fx) = -dFx and [xP( l - Fx(x))]o = 0, E(Xp) = =
J:�PdFx(�) J:�Pd(l Fx(�)) J:p;p- 1 (1 Fx(;))d; J:p;p- 1 P(X > ;)d;. =
-
-
-
=
(15.49)
Define the function 1 tx>s J(x) = 1 when x > ;, and 0 otherwise. Letting Fx, Y denote the j oint c.d.f. of X and Y and substituting the assumption of the lemma into (15.49), we have
I
E(XP ) :::;; p :;p - 2E(Yl { X >I;) )d;
Theory of Stochastic Processes
242
f (5
oo = p c-,P -2 0
=
)
y l {X>�)(x)dFx,r (x,y) df:,
(IR2)+
�� l ) E(fXP- 1 ).
(15.50)
Here (rR 2t denotes the non-negative orthant of [R 2 , or [O,oo) X [O,oo) . The second equality is permitted by Tonelli's theorem, noting that the function FxyS defines a a-finite product measure on (rR 3 t. By Holder's inequality, E( fXP- 1 ) :::; £ 1 1P( fP)£ 1 - 1 1P(_xP ) . Substituting into the m�orant side of ( 15.50) and simplifying gives the result. • To complete the proof of 15.15, apply the lemma to (15.48 ) to yield (15.47), putting X = max l 5 k 5 n 1 Sk l and Y = ! Sn l · • Because of the orthogonality of the differences, we have the interesting property of a martingale {Sn}i that E(S�)
=
E
(it=1 x7) ,
(15.51)
where, with S0 = 0, Xr = Sr - St- l · This lets us extend the last two inequalities for the case p = 2, to link P(m ax t 5 k :::; n ! Sn l > E) and E(max l $k$ nsi) directly with the variance of the increments. It would be most useful if this type of property extended to other values of p, in particular for p E (0,2). One approach to this problem is the von Bahr-Esseen inequality of § 1 1 .5. Obviously, 1 1.15 has a direct application to martingales. 15.17 Theorem If {X,�r}i is a m.d. sequence and Sn = I.�=l Xt, n (15.52) E ! Sn l p :::; 2_L E !Xt ! P t=l for 0 < p :::; 2. Proof This is by iterating 11.15 with Y = Xm '§ = �n J, and X = Sn - t. as in the argument leading to ( 1 1 .65); note that the latter holds for m.d. sequences just as for independent sequences. • Another route to this type of result is Burkholder's inequality (Burkholder 1 973). 15.18 Theorem Let { Sn,�n } i be a martingale with increments Xt = St - St- 1 , and
Martingales
243
So 0. For 0 < p :::;; 1 there exist positive constants cp and Cp , depending only on p, such that =
(15.53) On the majorant side, this extends by the cr inequality to n p n 2p ( 15.54) :::; ; C S EI n l pE _Lx7 :::;; cp.LE1Xr i 2P, 0 < p :::;; 1 , t=l t=l which differs from (15.52) only in the specified constant. 1 8 In fact, the Burkholder inequality holds for p > 1 also, although we shall not need to use this result and extending the proof is fairly laborious. Concavity of ( . f becomes convexity, so that the arguments have to be applied in reverse. Readers may like to attempt this as an exercise. The proof employs the following non-probabilistic lemma. 15.19 Lemma Let { Yr }1 be a sequence of non-negative �s with y1 > 0, and let Yn = I.7= 1Yt for t � 1 and Yo = 0. Then, for 0 < p $ 1 , n (15.55) y� :::;; yl) + p_L y�::: � Yt :::;; (1 + Bp)Y�, t=2 where Bp � 0 is a finite constant depending only on p. Proof For p = 1 this is trivial, with Bp = 0. Otherwise, expand Y� = ( Yn- 1 + Ynf in a Taylor series of first order to get (15.56) Ypn = Ypn- 1 + p( Yn- 1 + enYnJ\P- 1Yn , where en E [0,1]. Solving the difference equation in (15.56) yields n (15.57) y� = y l) + p L (Yr- 1 + etYtf -1Yt· t=2 Defining 1 1 ( 15.58) Kt - ypt - I - < Yr-1 + 0tYt!\[J - , we obtain the result by showing that n 0 $ p_L KtYt $ Bp Y�. (15.59) t=2 The left-hand inequality is immediate. For the right-hand one, note that l - xr :::;; (y - xr, for y > X > 0 and 0 < r :::;; 1 (see (9.63)). Hence, ( 15.58) implies that 1 -p 1 1 - (Yr-t(Yr-1 + 8rYr))p-1 O 1r -pYt1-p . ( 15.60) Kr $ O yt-1 yt-1 + tYt _
(
It follows that
)
Theory of Stochastic Processes
244
and hence
0 <- KrYt <- y2p-2 t - 1 Y2t -p ,
(15.6 1 )
n n rY p y�P L., K t S P L (yr! Yr-1)2( 1 -p)(y/Ynf t=2 t=2 n -- P L.. "" (y'!Y' t t-1 )2(1-p)y'tP t=2
(15.62) where y� = yrf Yn for t = 1, . . ,n is a collection of non-negative numbers summing to 1, r; = I.!= 1 y;, and Bp (n) denotes the supremum of the indicated sum over all such collections, given p and n. The terms y;IY ;- 1 = y11 Yr- 1 for t ;::: 2 are finite since Y 1 > 0. If at most a finite number of the Yt are positive the majorant side of (15.62) is certainly finite, so assume otherwise. Without loss of generality, by reinterpreting n if necessary as the number of nonzero terms, we can also assume Yt > 0 for every t. Then, y;!r;_ 1 = O(t-1 ) and y; = O(n - 1), and applying 2.27 yields the result Bp (n) o(1) for all p e (0, 1 ). Putting Bp = supnBp(n) < oo completes the proof. • Proof of 15.18 Put An I.7= t X7, and for £ > 0 and b > 0, set n (15.63) Yn = £ + S� + b(E +An) = (1 + b)(E +An) + 2L_Sr-1Xt, t=l so that in the notation of 15.19, Yr = Yr - Yr-1 = (1 + b)X7 + 2Sr- 1 Xr for t ;::: 2, with y 1 = ( 1 + b)(E + Xb > 0. Then by the left-hand inequality of ( 15.55), n n (15.64) Y� s ( l + bf(E + Xtf + (l + b)pL_ Y�:�x7 + 2pL_ Y�:}Sr- IXt . t=2 t=2 .
=
=
However, E is arbitrary in (15.64), and we may allow it to approach 0. Taking expectations through, using the law of iterated expectations, and the facts that E(X11 �t- l ) = 0 and that (. f- 1 is decreasing in its argument, we obtain 1 E(S� + BA n f s E (l + BfXIP + (1 + b)p± (S7-l + BAt-1f- X7 t=2
) ( S E ((l + bfXyP + (1 + b)bp - lp± A�::: � x7) . t=2
( 15.65)
But if we put now Yn = E + An , with Y1 = E + Xy and Yt = X7 for t ;::: 2, the right hand inequality of ( 15.55) yields (again, as the limiting case as E -t 0) n (15.66) XyP + pL_A�::: �x7 s (1 + Bp)E(A�). t=2
Martingales
245
and since (1 + 8f � (1 + 8)8 p- I' this combines with (15.65) to give
(1 + 8)8p - I (1 + Bp)E(A�) :?: E(S� + 8An)P (1 5.67) ;;::: 2p- l [E J Sn J 2P + 8PE(A�)], where the second inequality is by the concavity of the function (.f for p � 1 . Rearrangement yields (15.68) which is the right-hand inequality in (15.54), where Cp is given by choosing 8 to minimize the expression on the majorant side of (15.68). In a similar manner, combining the right-hand inequality of (15.55) with Yn = £ + with (15.65) and (15.67), and using concavity, yields
s;
t, s;��-ox;) :?: E ((l + 8)PXyP + (1 + 8)pt, (S7 - I + 8At- I f-1 X7) (
(1 + 8)(1 + Bp)EI Sn l 2p :?: (1 + 8)E xrp + p
:?:
E(S� + 8Ant
(15.69) which rearranges as (15.70) EI Sn l 2p :?: 8P [2 1 -p(l + 8)(1 + Bp) - 1 ] - 1 E(A�), which is the left hand inequality of (15.54), with Cp given by choosing 8 to maximize the expression on the majorant side . 11 For the case p = 1 , Bp = 0 identically in (15.55) and c 1 = C1 = 1 for any 8, reproducing the known orthogonality property. Our final result is a so-called exponential inequality. This gives a probability bound for martingale processes whose increments are a.s. bounded, which is accordingly related directly to the bounding constants, rather than to absolute moments. 15.20 Theorem If {Xt5'tli is a m.d. sequence with I Xtl � Bt a.s., where {Bt} is a sequence of positive constants, and Sn = 'L7=I Xt, (15.71) This is due, in a slightly different form, to Azuma (1967), although the cor responding result for independent sequences is Hoeffding' s inequality, (Hoeffding 1963). The chief interest of these results is the fact that the tail probabilities decline exponentially as £ increases. To fix ideas, consider the case B t = B for all t, so that the probability bound becomes P( I Sn l > £) � 2exp{ -£212nB 2 } . This ineaualitv is trivial when n is small. since of course P( I S, I > nB) = 0 bv con-
Theory of Stochastic Processes
246
struction. However, choosing E = O(n 1 12) allows us to estimate the tail probabili ties associated with the quantity n - J I2Sn. The fact that these are becoming exponential suggests an interesting connection with the central limit results to be studied in Chapter 24. Proof of 15.20 By convexity, every x E [-B1,B1] satisfies (Bt + x)ea.B r + (B1 -x)e -a.Br ax < e ------��( 1 5.72) 28 t -
__ __ __ _
for any a > 0. Hence by the m.d. property, ( 15.73) E( eaX1 I :Y't- J ) :::; �(eaB1 + e- aB 1) :::; exp{ ia2B7 } a.s., where the second inequality can be verified using the series expansion of the exponential function. Now employ a neat recursion of 10.10: ( 1 5.74) E(eaSn i :J'n - d = E(eaSn- t+aXn i :J'n - l)
Generalizing this idea yields E(eaSn) = E(E( ... E(E(eaSn l :Y'n - 1 ) I :Y'n - 2) . . . 1 :1' 1 )) :::; exp {ia2B� } E(E( ... E(eaSn - t l :Y'n -2} · · 1 :¥ 1 ))
(1 5.75)
:::; . ..
Combining ( 1 5 .75) with the generalized Markov inequality 9.11 gives (15.76) P(Sn > E) :::; exp { -ac + !a.li,7= l B7 } for E > 0, which for the choice a = EI('L7= JB7) becomes (15. 77) P(Sn > c) :::; 2exp { -E212(I,7= 1 B7) } . The result follows on repeating the argument of ( 1 5.75)--(15.76) in respect of -Sn and summing the two inequalities. • A practical application of this sort of result is to team it with a truncation or uniform integrability argument, under which the probabilities of the bound B being exceeded can also be suitably controlled.
16
Mixingales
1 6. 1 Definition and Examples
Martingale differences are sequences of a rather special kind. One-step-ahead unpredictability is not a feature we can always expect to encounter in observed time series. In this chapter we generalize to a concept of asymptotic unpredict ability. 16.1 Definition On a probability space (Q,'!f ,P), the sequ� of pairs { 1, r} :"" , where { 1 } is an increasing sequence of cr-subfields of and tile X1 are integrable r.v.s, is called an Lp-mixingale if, for p � 1, there exist sequences of non negative constants { cr}':"" and { �}0 such that � 0 as m � =, and
'!F
�m I E(Xrl '!Fr-m) I P :::; Cr�m I Xr-E(Xr i '!Ft+m) I P :::; Cr�m+l
X '!F
'!F
(16. 1) (16.2)
hold for all t, and m � 0. o A martingale difference is a mixingale having = 0 for all m > 0. Indeed, 'mixing ale differences' might appear the more logical terminology, but for the fact that the counterpart of the martingale (i.e. the cumulation of a mixingale sequence) does not play any direct role in this theory. The present terminology, due to Donald McLeish who invented the concept, is standard. Many of the results of this chapter are basically due to McLeish, although his theorems are for the case p = 2. Unlike martingales, mixingales form a very general class of stochastic processes; many of the processes for which limit theorems are known to hold can be characterized as mixingales, although supplementary conditions are generally needed. Note that mixingales are not adapted sequences, in general. is not assumed to be ?F1-measurable, although if it is, (16.2) holds trivially for every m � 0. The mixingale property captures the idea that the sequence { s } contains progressively more information about as s increases; in the remote past nothing is known according to (16.1 , whereas in the remote future everything will eventu ally be known according to (16.2). The constants c1 are scaling factors to make the choice of scale-independent, will often fulfil this role. As for mixing processes (see and multiples of say that the sequence is of size -
�m
)
I Xrllp
'!F
Xr
Sm
Xr
248
Theory of Stochastic Processes
16.2 Example Consider a linear process (16.3) }= -oo where { Us}�: is a Lp-bounded martingale difference sequence, with p ::2: 1 . Also let '3'1 = a(Us, s � t). Then E(Xr l 'ff1-m)
=
Xr - E(Xr l 'ff t+m)
=
'L, S}Ur-j' a.s. }=m
( 16.4)
(16.5) L s_j ut+}' a.s. }=m+ l Assuming { Us } :'oo to be uniformly Lp-bounded, the Minkowski inequality shows that (16. 1) and ( 16.2) are satisfied with c1 = supsll Usll P for every t, and Sm = Ij=m( I SJ I + I S -} I ). { Xr.'ff 1} is therefore a Lp-mixingale if 2..}=m( I SJ I + I S-} I ) -7 0 as m -7 oo, and hence if the coefficients { S1 } :'oo are absolutely summable. The 'one sided' process in which SJ = 0 for j < 0 arises more commonly in the econometric modelling context. In this case X1 is 'ff ;-measurable and X1 - E(X1 I 'ffr+m) = 0 a.s., but we may set c 1 = sups$r11 Usllp which may increase with t, and does not have to be bounded in the limit to satisfy the definition. To prove X1 integrable, given integrability of the Us, requires the absolute summability of the coefficients, and in this sense, integrability is effectively sufficient for a linear process to be an L1-mixingale o
We could say that mixingales are to mixing processes as martingale differences are to independent processes; in each case, a restriction on arbitrary dependence is replaced by a restriction on a simple type of dependence, predictability of the level of the process. Just as martingale differences need not be independent, so mixingales need not be mixing. However, application of 14.2 shows that a mixing zero-mean process is an adapted Lp-mixingale for some p 2 1 with respect to the subfields '3'1 = cr(X1,Xt- 1, ... ), provided it is bounded in the relevant norm. To be precise, the mean deviations ofany L,-bounded sequence whichis a-mixing of size -
Mixingales
249
The next examples show the type of case arising in the sequel. 16.3 Example An Lr-bounded, zero-mean adapted sequence is an L2-mixingale of size -� if either r > 2 and the sequence is a-mixing of size - r/(r - 2), or r ?: 2 and it is -mixing of size -r/2(r - 1). o 16.4 Example Consider for any j ?: 0 the adapted zero-mean sequence { XtXt+i - at,t+J• � t+i } ,
where at, t+J = E(XtXt+j) , and { Xt l is defined as in 16.3. By 14.1 this is a-mixing (<j>-mixing) of the same size as Xt for finite j, and is Lr12-bounded, since IIXtXt+j ll r12
S
IIXtllriiXt+jllr-
by the Cauchy-Schwartz inequality. Assuming r > 2 and applying 14.2, this is an L1-mixingale of size - 1 in the a-mixing case. To get this result under -mixing also requires a size of -r/(r - 2), by 14.4, but such a sequence is also a-mixing of size -r/(r- 2) so there is no separate result for the <j>-m� case. o ' Mixingales generalize naturally from sequences to arrays. 16.5 Definition The integrable array { { Xnt• �nt l7=- oo}�= l is an Lp-mixingale if, for p ?: 1, there exists an array of non-negative constants { end :oo, and a non negative sequence { �m } 0 such that �m ---7 0 as m ---7 oo , and IIE(Xnt l �n.t-m) llp
S
Cnt�m
II Xnt - E(Xnt I �n,t+m) II p S Cnt�m+ 1
(16.6) (16.7)
hold for all t, n, and m ?: 0. o The other details of the definition are as in 16.1. All the relevant results for mixingales can be proved for either the sequence or the array case, and the proofs generally differ by no more than the inclusion or exclusion of the extra sub script. Unless the changes are more fundamental than this, we generally discuss the sequence case, and leave the details of the array case to the reader. One word of caution. This is a low-level property adapted to the easy proof of convergence theorems, but it is not a useful construct at the level of time-series modelling. Although examples such as 16.4 can be exhibited, the mixingale prop erty is not generally preserved under transformations, in the manner of 14.1 for example. Mixingales have too little structure to permit results of that sort. The mixingale concept is mainly useful in conjunction with either mixing assumptions, or approximation results of the kind to be studied in Chapter 17. There we will find that the mixingale property holds for processes for which quite general results on transformations are available. 1 6.2 Telescoping Sum Representations
Mixingale theory is useful mainly because of an ingenious approximation method. A sum of mixingales is 'nearly' a martingale process, involving a remainder which
Theory of Stochastic Processes
250
can be neglected asymptotically under various assumptions limiting the dependence. For the sake of brevity, let EsXr stand for E(Xrl �s). Then note the simple identity, for any integrable random variable X1 and any m � 1, m (16.8) Xr = L (Et+kXr - Er+k - lXr) + Er-m - l Xt + (Xr - Er+mXt) . k=-m
Verify that each term on the right-hand side of (16.8) appears twice with opposite signs, except for X1 . For any k, the sequence { Er+kXt - Et+k- l Xr, lfft+k } 7= 1 is a martingale difference, since Et+k-l (Er+kXr - Er+k- I Xr) = 0 by the LIE. When { X1,lff1 } is a mixingale, the remainder terms can be made negligible by taking m large enough. Observe that { Er+mXt, lfft+m l:=-oo is a martingale, and since supmE I Er+mXrl ::; E I Xr l < by 10.14, it converges a.s. both as m � oo and as m � -oo, by 15.7 and 15.9, respectively. In view of the fact that II Er-mXr ll p � 0 and II Xr - Er+mXr ll p � 0, the respective a.s. limits must be 0 and X1, and hence we are able to assert that oo
Xr = L (Et+kXr - Er+k- 1 Xr), a.s. Letting Sn
where
k=-oo
=
I,�=1X1, we similarly have the decomposition m n n = r Sn L Ynk + L E - m- I Xt + L (Xr - Et+mXr) k= -m t=l t=l
n Ynk = L (Er+kXr - Er+k- 1Xr), t= l
(16.9)
(16. 10)
(16. 1 1)
and the processes { Ynk> lffn+d are martingales for each k. By taking m large enough, for fixed n, the remainders can again be made as small as desired. The advantage of this approach is that martingale properties can be exploited in studying the convergence characteristics of sequences of the type Sn. Results of this type are elaborated in § 16.3 and § 16.4. If the sequence {Xr } is stationary, the constants {c1} can be set to 1 with no loss of generality. In this case, a modified form of telescoping sum actually yields a representation of a partial sum of mixingales as a single martingale process, plus a remainder whose behaviour can be suitably controlled by limiting the dependence. 16.6 Theorem (after Hall and Heyde 1980: th. 5.4) Let { Xr,lffr } be a stationary L 1 -mixingale of size - 1 . There exists the decomposition (16. 12) Xr = Wr + Zr - Zr+ l • where E I Zr l < and { W1,lff 1 } is a stationary m.d. sequence. o oo
Mixingales There is the immediate corollary that Sn = Yn + Z1 - Zn+ 1 where { Ym�n } is a martingale. Proof Start with the identity
25 1 (16. 13)
(16. 14) where, for m � 1, m (16. 15) Wmt = L (EtXt+s - Et-1Xt+s) + EtXt+m+1 + Xt-m-1 - Et- 1 Xt-m- 1 s=-m m (16. 16) Zmt = L (Et-1Xt+s - Xt-s- 1 + Et- 1 Xt-s- 1 ) . s=O J As in (1 6.8), every term appears twice with different sign in (16. yi), except for X1• Consider the limiting cases of these random variables as m ----7 oo, to be designated W1 and Z1 respectively. By stationarity, E I Et-1 Xt+s1 = E i Et-s - 1 Xt l and E I Xt-s- 1 - Et-1Xt-s- 1 1 = E I Xt - Et+sXt l ; hence, applying the triangle inequality, 00
E I Zt l :::; L E I Et-s - 1Xt l + L E I Xt - Et+sXt l s=O
:::; 2L ss < s=O
s=O
oo .
Writing W1 = X1 - Z1 + Zt+ 1 • note that
(16. 17)
(16. 1 8) and it remains to show that W1 is a m.d. sequence. Applying 10.26(i) to (16. 15), (16. 19) and stationarity and (16.1) imply that (16.20) E I Et- 1Xt+m+t l = EI Er-m-2Xt l ----7 0 as m ----7 oo, so that El £1_ 1 Wm1 l ----7 0 also. Anticipating a result from the theory of stochastic convergence (18.6), this means that every subsequence {mb k E IN } contains a further subsequence { mk(i) • j E IN } such that I E1- 1 Wmk(i)· t I ----7 0 a.s. as j ----7 oo Since Wmk(i)J ----7 W1 for every such subsequence, it is possible to conclude that E(W11 �r- 1 ) = 0 a.s. This completes the proof. • .
252
Theory of Stochastic Processes
The technical argument in the final paragraph of this proof can be better appre ciated after studying Chapter 18. It is neither possible nor necessary in this approach to assert that E(Wmt l �t- I ) ---7 0 a.s. Note how taking conditional expectations of (16. 12) yields E(Xt l �t - I ) = Zt - Zt+I a.s. (16.21 ) It follows that Wt is almost surely equal to the centred r.v. Xt - E(Xt l �t- J). 16.7 Example Consider the linear process from 16.2, with { Ut } a stationary inte grable sequence. Then Xt is stationary, and 00
E l Ud L I Sj l < OO, j=-<X) If the coefficients satisfy a stronger summability condition, i.e. EI Xd
00
�
00
00
00
(16.22) L _L ( ! Sjl + 1 8-jl ) = ,L m l 8m l + ,L m l 8-ml < 00, m= l m=l m=l j=m then Xt is an L1-mixingale of size - 1 . By a rearrangement of terms we obtain the decomposition of (16. 12) with (16.23) and
(16.24) where E ! Zt l <
oo
by (16.22). o
1 6. 3 Maximal Inequalities
As with martingales, maximal inequalities are central to applications of the mixingale concept in limit theory. The basic idea of these results is to extend Doob's inequality (15.15) by exploiting the representation as a telescoping sum of martingale differences. MacLeish's idea is to let m go to oo in (16. 10), and accordingly write
Sn
=
L Ynb a.s.
k=- oo
(16.25)
Suppose { Snll has the representation in (16 . 25) . Let { ak}':oo be a summable collection of non-negative real numbers, with ak = 0 if Ynk = 0 a.s., and ak > 0 otherwise. For any p > 1 , I P ( 1 6.26) E m�x ! Sj ! P � {_� 1 i ak p- L ar1 E I Ynk iP. \P k=-oo l:'>j :'> n ak>O 16.8 Lemma
(
)
)( )
Mixingales
253
Proof For a real sequence { xd : and positive real sequence {ad :DO, let K = Lk= -=ak and note that DO
(16.27)
where the weights ak/K sum to unity, and the inequality follows by the convexity of the power transformation (Jensen's inequality). Clearly, (16.27) remains true if the terms corresponding to zero xk are omitted from the sums, and for these cases set ak = 0 without loss of generality. Put xk = Ynk> take the max over 1 � j � n , and then take expectations, to give
/
(16.28)
To get (16.26), apply Doob's inequality on the right-hand side. • This lemma yields the key step in the proof of the next theorem, a maximal inequ ality for L2-mixingales. This may not appear a very appealing result at first sight, but of course the interesting applications arise by judicious choice of the sequence { ak} . 16.9 Theorem (Macleish 1975a: th. 1 .6) Let {X1,'!f1} ':= be an L2-mixingale, let Sn = I,�= 1X1, and let {ad 0 be any summable sequence of positive reals. Then E
(::: ) (� ) ( s]
<;
8
a, (1;5 + i;llaO 1 + 2
� i;[(ak 1 - ai� 1 )) (� cl) .
(16.29)
get a doubly infinite sequence {ad : put a_k = ak for k > 0. Then, applying 16.8 for the case p = 2,
Proof To
E
(�':: ) t ) {.t_ sJ
<;
{
a•
DO,
)
ai1E(Y�,) .
(1 6.30)
Since the terms making up Ynk are martingale differences and pairwise uncorre lated, we have n (16.3 1 ) E(Y�k) = L_E(Er+kXr - Et+k-tXrf t= l
Now, E(Et+kXrEt+k- tXr) = E(Et+k- t (Er+kXrEt+k-t Xr) ) = E(E7+k- tXr) by the LIE, from which it follows that (16.32) E(Er+kXt - Et+k - tXr)2 = E(E7+kXt - E7+k- tXr) . Also let Z1k = X1 - Et+kX1, and it is similarly easy to verify that E(Et+kXr - Et+k-l Xt)2 = E(Zt,k- l - Ztk)2 = E(Z7,k- l - z7k).
(16.33)
Theory of Stochastic Processes
254
Now apply Abel's partial summation formula (2.25), to get
oo
n
oo
L ai1E(Y�k) = L L ai 1 E(Er+kXr - Et+k- 1 Xr)2
k= -oo
t::
I k=- oo
+ a! 1 E(Z}o) +
ik=l E(Z}k)(a:kll - a:k 1 ))
(16.34)
where the second equality follows by substituting (16.32) for the cases k � 0, and (16.33) for the cases k > 0. (16.29) now follows, noting from (16. 1) that E(E}-kXr) � c}�� and from (16 . 2) that E(Z}k) � c}��+ l · • Putting (16.35)
oo
this result poses the question, whether there exists a summable sequence { ad o such that K < There is no loss of generality in letting the sequence { Sk } o be monotone. If Sm = 0 for m < then Sm+j = 0 for all j > 0, and in this case one may choose ak = 1 , k = O, .. ,m + 1 , and K reduces to (m + l)(s5 + sb. Alter natively, consider the case where Sk > 0 for every k. If we put a0 = �0, and then define the recursion .
.
oo ,
(16.36) ak
is real and positive if this is true of ak- l and the relation -I -I ak - ak1
-
�o;,r k 2ak is satisfied for each k. Since a0 1 C s6 + t;,b � 2a0, we have
K=s
=
� I 6 (£ ak) 2• (i ak) (ao\s5 + sh + 2�ak) k==O k-I k=O
(16.37)
(16.38)
In this case, for k > 0 we find si 2
so that
=
(ai 1 - a:k�J)ai 1
� a :k 2 - a:k : I
(16.39)
m
"
Mixingales
y--2 k L ':! k=O
255
m
" (a k-2 - a -2 $ y-':10-2 + L k-1) k=l
K {m=Ooo ( m
Substituting into (16.38), we get
)
=
}
am
-2 .
- l/2 2 . $ 16 L L s "k 2 k=O
(16.40)
( 1 6.41)
This result links the maximal inequality directly with the issue of the summa bility of the mixingale coefficients. In particular, we have the following corollary. 16.10 Corollary Let
where
K
{ Xt,:1't} be an L2-mixingale of size -!. Then
(
E max s;
< oo.
l �j � n
) ::;; Ki d, t= l
(16.42)
O(k - 1 12-6) for o > 0, as the theorem imposes, then Lk= I S"k2 = O(m2+26) by 2.27, and (I'k= 1 s"i2t 1 12 = O(m- 1 -6) and hence is summable over m. The theorem follows by (16.41). • Proof If
Sk
=
However, it should be noted that the condition -1 2 s "k2 1 < 00
� (�
)
(16.43)
is weaker than Sk = O(k- 1 12-6). Consider the case Sk = (k + 2r 1 12 (log k + 2) - l -E for E > 0, so that k 1 12+6Sk ----7 oo for every 0 > 0. Then
m
m
L s"k 2 = L (k + 2)(log k + 2)2+2E s (m + 2) 2(log m + 2)2+2E, k=O
k=O
(16.44)
and (16.43) follows by 2.31. One may therefore prefer to define the notion of 'size = -!' in terms of the summability condition (16.43), rather than by orders of magnitude in m. However, in a practical context assigning an order of magnitude to Sm is a convenient way to bound the dependence, and we shall find in the sequel that these summability arguments are greatly simplified when the order-of magnitude calculus can be routinely applied. Theorem 16.9 has no obvious generalization from the �-mixingale case to general Lp for p > 1 , as in 15.15, because (16.31) hinges on the uncorrelatedness of the terms. But because second moments may not exist in the cases under consideration, a comparable result for 1 < p < 2 would be valuable. This is attainable by a slightly different approach, although at the cost of raising the mixingale size from -! to - 1 ; in other words, the mixingale numbers will need to be summable. 16.11 Theorem Let {Xr,:1't } ':oo be an Lp-mixingale, 1 < p < 2, of size - 1 , and let Sn
=
I7= tXt ; then
256
Theory of Stochastic Processes
(16.45) Cp is a positive constant. Proof Let Ynk be defined as in (16.11), and apply Burkholder' s inequality (15.18) and then Loeve ' s inequality with r p/2 C1, 1 ) to obtain E I Y,,j P ,;; CpE I t (E..,X, - E,.,_,X,)2 1 ,n where
Cr
=
�
E
n
C�LEI t=:l (Et+kXt-Et+k- IXt! P.
(16.46)
Now we have the mixingale inequalities,
(16.47)
I! Er+kXr-Er+k- I Xtl lp � I Er+kXtllp + I Et+k- I Xtl lp � 2cl,k
0 and I !Er+kXr-Er+k-IXrllp I Zr,k-I - Ztkl ip � I Zt,k- I I p + I ! Ztkl lp � 2 cr�k (16.48) for k > 0, where Ztk is defined above (16. 3 3). Hence, (16.49) £ 1 Ynkl p � 2P Cp�{ L � t=:l (put �o 1), and substitution in (16. 26), with {ado a positive sequence and -ak ak, gives P p-l (16.50) £ { m� I Sj l p) � 2p+l cP I_� 1 ) (iak) (ial-psr)±c�. \jJ Both ak and al-psf can be summable for p > 1 only in the case Sk O(ak), and the conclusion follows. for k <
=
n
=
C
=
k=O
l:'>J � n
k=O
•
t=:l
=
A case of special importance is the linear process of 16.2. Here we can special
ize 16.1 1 as follows: 16.12 Corollary For
1 < p � 2,
X, I}=-ooejut-j• then P E ( max I Sj i P) � Cp (_� ) ( 1 So l + .f c 1 ek 1 + 18-ki)) Pn sup £1 Us ! P ; s \jJ 1 l�j:'>n k=I
(i) if
(ii) if x,
=
=
Ij=Oejut -j• then
E (�� I Sj i ) ,;; Cp �� �n� 1 e, 1 ) t :�� E I U,l''
'
ales 257 Proof In this case, Er- kXr-Er-k- I Xr = 8k Ur-k· Letting {ad 0 be any non-negative constant sequence and a_k = ak> n p p (16.5 1) _L ak- EI Ynki P :::;; Cp_L al l 8ki P_L0 t=I , where c1 = sups II Us l i p i n case (i), and c1 = sups rl l Us l i P in case (ii). Choosing ak = l 8k I and substituting in (16. 26) yields the results. Recall that the mixing ale coefficients in this case are Sm = Ij=m( l 8j I 18 -j I ), so linearity yields a dramatic relaxation of the conditions for the inequalities to be satisfied. Absolute summability of the 8j is sufficient. This corresponds simply to Sm � 0. A mixingale size of zero suffices. Moreover, there is no separate result for £2-bounded linear processes. Putting p = 2 yields a result that is correspondingly superior in terms of mixingale size restrictions to 16. 1 1. Mixing
al(i{)
al(i{)
::;
•
+
1 6.4 Uniform Square-integrability
One of the most important of McLeish's mixingale theorems is a further conse quence of 16.9. It is not a maximal inequality, but belongs to the same family of results and has a related application. The question at issue is the uniform inte grability of the sequence of squared partial sums.
1975b: lemma 6.5; 1977: lemma 3.5) Let {X1,�1} I7 X1 , and v� = I7= Ic7 where c1 is defined in =I { tc }7 I is uniformly integrable, then so is the (16.1)-{16.2) . '::'} =1 · X7 7 = Proof A preliminary step is to decompose X1 into three components. Choose posi tive numbers B and m (to be specified below), let 1 � = 11 and then define (16.52) U1 = X1 - Et+mXr + E1-mX1 (16.53) Yr = Et+mXr 1�-Er-mXr1� (16. 54) Zr = Et+mXrO - 1 �) -Er-mXrO - 1 �), such that X1 = U1 Y1 Z1 • This decomposition allows us to exploit the following collection of properties. (To verify these, use various results from Chapter 10 on 16.13 Theorem (from MacLeish be an �-mixingale of size -�, Sn = If the sequence sequence { maxi::;j ::;nSJIV�
IXr i :5:Bcr J •
+
+
conditional expectations, and consider the cases k � m and k < m separately.) First,
(16. 55) (16.56) for k �
0, where k v m = max { k,m} . Second, EE7-kYt E(E7-(kAm�t l �-E7-mXr 1 �), =
(16. 57)
258
Theory of Stochastic Processes
(16.58)
m m k,m}. E(Xh �) Bc1• EEt-kZr = E(E;-(kNn�r( l - 1�) - E;_mXr( l - 1 �)),
The terms are both zero ifk � and are otherwise bounded where k 1\ = min { by Third, �
(16.59) (16.60)
m E(X7(1 - 1 �)) otherwise. Note E(X7(1 - 1�))!c7 E((Xrfc1)2 1t1Xr'cri>B)) -) 0 B -) oo uniformly in t, by
where the terms are zero for k � and bounded by that = as the assumption of uniform integrability. The inequality
(16.61) 1
X1
Z1,
for � j � n follows from substituting = U1 + Y1 + multiplying out, and applying the Cauchy-Schwartz inequality. For brevity, write
(16.61)
xj = sJtv�, uj = (L/r= I uYtv�, lvn, Yj = lvn. Zj = is equivalent to Xj � 3(uj + yj + Zj), for each j =
(L'r..=I Yr)2 2 (L'r=IZr)2 2
1, .
Then .. ,n. Also let = max 1 :5j:5nXj, and define Um Yn · and Zn similarly; then clearly, Xn � Un + Yn + Z n . For any r. v. and constant M > introduce the notation £:5 = £( so that the object of the proof is to show that supn£:5MCxn) -) as M -) As a consequence of and 9.29,
Xn
"
3(
"
"
"
(16.62) M(X) 1 {X>MIX), 0
)
X� 0 0, (16.62) EEM(Xn) � 3f5Mt3(un + Yn + Zn)
oo.
(16.63)
0, (16.63) ( 16.55) (16.56), Sm -l/2-o), ak m-l -o m. m, ak o l.lr= r U1 E(un) � 8 ((m + 1)m- I -o + i k - I o) (s�m l +o + 2 i. siko) k=m+ 1 k==m+l 0
We now show that for any E > each of the expectations on the right-hand side of can be bounded by E by choosing M large enough. First consider given and we can apply 16.9 to this and assuming = O(m I case, setting = for k > Applying for k � and = k with substituted for Sj in that expression produces -
-
E(un); (16.29)
-
= O(m - ),
(16.64)
259
Mixingales
m
where the order of magnitude in follows from 2.27(iii). Evidently we can choose < £. Henceforth, let be fixed at this value. large enough that A similar argument is applied to but in view of and we and = otherwise. Write, formally, may choose = k = O, s; and where s;
m E(un) E(zn), 2c ty':l2k E(Zrak-Er+1,kZr)2 ...c2t,my':l2k,• ak 0
m
(16.59) (16.670) EE -kzt k < m, (16.65) k � m,
(16. 29) leads to max E((Xrfcr)2 1 I X/ r i> Bl) . (16.66) E(zn) s; 16(m + 1) I�t� c n This term goes to zero as B � oo, so let B be fixed at a value large enough that < £. E(zn) For the remaining term, notice that Y1 Lk= - m+ I �tk where (16.67) For each k, { �tk • �t+k } is a m.d. sequence. If 16.8 is applied for the case 4 and ak 1 for I k I s; m, 0 otherwise, we obtain (not forgetting that for Yj > 0, (maxm)2 maxj{ YJ }) j 1 4) 1 (4) 4 m E(�k), (16.68) 1 ( 1) (2m+ max LYt EQ�) V4En I�j� 3 L s; 4 n I t=I k=-m Vn where Ynk L� =I �tk · Now, given Ynk Yn - I ,k + �nk. we have the recursion E(Y�k) E(Y�- i,k)+4E(Y�- l,k�nk)+6E(Y�-l,k��k) + 4E(Yn- I,��k) + E(��k). (16.69) The �tk are bounded absolutely by 2Bc1; hence consider the terms on the right-hand side of (16. 69). The second one vanishes, by the m.d. property. For the third one, we have (16.70) E(Yn2- I,k-�':l2nk) E(Yn2- ! ,k)(2Bcn)2 s; (2B) Yn2- J Cm2 and for the fourth one, note that by the Cauchy-Schwartz inequality, (16.7 1) El Yn- I,��kl s; (2B)4Yn- J C�. Making these substitutions into (16. 69) and solving the implied inequality and then application of
1
=
p =
=
=
=
3
=
=
=
s;
4
recursively yields
(16.72)
Theory of Stochastic Processes
260
Plugging this bound into (16.68), and applying the inequality a 0 (X) s E(X2 ) for X· � 0 and a > 0, yields finally a
()
r�n ) <- J_E-\Y r�2n) <- ± 4 6(2m + 1 )4 1 1 (2B)4 �M/6\Y M 3 M C?
( 16.73)
·
By choice of M, this quantity can be made smaller than £. Thus, according to ( 1 6.63) we have shown that 0MCxn) < 1 8£ for large enough M, or, equivalently,
(16.74) By assumption, the foregoing argument applies uniformly in n, so the proof is complete. • The array version of this result, which is effectively identical, is quoted for the record. 16.14 Corollary Let {Xnr.?l'nr} be an L2-mixingale array of size -�, and let Sn = 'L7=t C�1, where Cnr is given by (16.6)-( 16.7); if {X�/c�1 } is uniformly integrable, { maxl: o:;j � nS]!v; }�=! is uniformly integrable.
'L7=tXnr and v�
=
Proof As for 16.13, after inserting the subscript n as required.
•
17 Near-Epoch Dependence
1 7 . 1 Definitions and Examples
As noted in § 14.3, the mixing concept has a serious drawback from the viewpoint of applications in time-series modelling, in that a function of a mixing sequence (even an independent sequence) that depends on an infinite number of lags and/or leads of the sequence is not generally mixing. Let (17.1) where V1 is a vector of mixing processes. The idea to be developed in this chapter is that although X1 may not be mixing, if it depends almost entirely on the 'near epoch ' of { Vt} it will often have properties permitting lhe application of limit theorems, of which the mixingale property is the most important. This idea goes back to lbragimov (1962), and had been formalized in different ways by Billingsley (1968), McLeish (1975a), Bierens (1983), Gallant and White ( 1 988), Andrews (1988), and Potscher and Prucha (1991a), among others. The following definitions encompass and extend most existing ones. Consider first a definition for sequences.
{ Vr }�:. possibly vector-valued, on a ;:: = cr(V probability space (Q,':J,P), let :!F�� r-m • ··· .Vt+m), such that { :!f��;:: } ;;;=O is an increasing sequence of a-fields. If, for p > 0, a sequence of integrable r.v.s { X1 } � : satisfies 17.1 Definition For a stochastic sequence
(1 7.2) · where Vm ----7 0, and {d1 }�: is a sequence of positive constants, X1 will be said to be near-epoch dependent in Lp-norm (Lp-NED) on { V1 }�:. o Many results in this literature are proved for the case p = 2 (Gallant and White, 1988, for example) and the term near-epoch dependence, without qualification, may be used in this case. As for mixingales, there is an extension to the array case.
{ { Vnrl7:'- oo }';= l • possibly vector-valued, on a probability space (Q,:!f,P), let :!f��"!-m = a(Vn ,t- m•· ·· · Vn,t+m). If an integrable array { {Xnrl7:'- oo}';= l • satisfies 17.2 Definition For a stochastic array
(17.3) where Vm -7 0, and {dnr} is an array of positive constants, it is said to be Lp-NED on { Vn1 } . o
262
Theory of Stochastic Processes
We discuss the sequence case below with the extensions to the array case being easily supplied when needed. The size terminology which has been defined for mixing processes and mixingales is also applicable here. We will say that the sequence or array is Lp-NED of size -
I Xr -E(Xrl �::z:) liP :s; I Xr - J.Lr l p + I E(Xr - J.Lr l ��:Z:) I p 2 I Xr - J.Lr l p, (17.4) where J.Lr E(X1). The role of the sequence {d1} in (17. 2 ) is usually to account for the possibility of trending moments, and when I Xr -J.Lr l p is uniformly bounded, we should expect to set d1 equal to a finite constant for all t. However, a drawback with the definition is that { d1} can always be chosen in such a way that . t { I Xr -E(XdrtI �::z:) liP} 0, mf for every m, so that the near-epoch dependence property can break down in the limit without violating (17. 2 ). Indeed, (17. 2) might not hold except with such a choice of constants. In many applications this would represent an undesirable weakening of the condition, which can be avoided by imposing the requirement d1 2can case, dnt 2 I Xnr -J.Lnr l p · Under this restriction we I X1set-J.Lr liP, :s;or1 forwiththenoarray loss of generality. :s;
=
=
:s;
:s;
vm
Near-epoch dependence is not an alternative to a mixing assumption; it is a property of the mapping from to not of the random variables themselves. The concept acquires importance when is a mixing process, because then inherits certain useful characteristics. Note that �::z:) is a finite-lag, �::Z:J;B-measurable function of a mixing process and hence is also mixing, by 14.1. Near-epoch dependence implies that is 'approximately' mixing in the sense of being well approximated by a mixing process. And as we show below, a near-epoch dependent function of a mixing process, subject to suitable restrictions on the moments, can be a mixingale, so that the various inequalities of § can be exploited in this case. From the point of view of applications, near-epoch dependence captures nicely the characteristics of a stable dynamic econometric model in which a dependent variable depends mainly on the recent histories of a collection of explanatory variables or shock processes which might be assumed to be mixing. The symmetric dependence on past and future embodied in the definition of a Lp-NED function has no obvious relevance to this case, but it is at worst a harmless generalization. In fact, such cases do arise in various practical contexts, such as the application of two-sided seasonal adjustment procedures, or similar smoothing filters; since most published seasonally adjusted time series are the output of a two-sided filter, none of these variables is strictly measurable with out reference to future events.
{ V1} { X1},{ V1}
{ X1}
E(X11
{X1}
16.2
X1
17.3 Example Let
V1,
{ V1} :: be a zero-mean, L0-bounded scalar sequence. and define
Near Epoch Dependence
263 (17.5)
}=-oo
Then, by the Minkowski inequality,
I Xt-ECXtl ���;;:) liP = I J.=m£+! (8j( Vt-r E;�;;:vt-1) + 8-j( Vt+J- E;�;;:vt+j)) I P (17.6)
), 8 8 1 I(I jl jl Vm I,J=m + { 81}
and dt = 2supsi1 Vs l lp, all t. Clearly, Vm --7 0 if + where = the sequence is absolutely summable, and vm is of size -
e1
1 81 18-J 1
The second example, suggested by Gallant and White (1 988), illustrates how near epoch dependence generalizes to a wide class of lag functions subject to a dynamic stability condition, analogous to the summability condition in the linear example.
{ V1}
17.4 Example Let be a Lp-bounded stochastic sequence for p � 2 and let a sequence be generated by the nonlinear difference equation
{ X1}
where
{f1(.,. ) }
Xt
=
frC Vt, Xt-d , is a sequence of differentiable functions satisfying
(1 7.7)
dft(v,x) I � b < 1 . (1 7.8) �; I dx As a function of x, f1 is called a contraction mapping. Abstracting from the stochastic aspect of the problem, write v1 as the dummy first argument of f1• By repeated substitution, we have s
(1 7.9) and, by the chain rule for differentiation of composite functions,
dgt !' I dVt-j
-- � br I .
I ddVftt--j l . j
( 1 7. 1 0)
' --
Define a ��-m/B-measurable approximation to g1 by replacing the arguments with lag exceeding m by zeros: ( 1 7. 1 1) By a Taylor expansion about 0 with respect to
v1_1 for j > m, ( 1 7 . 1 2)
264
Theory of Stochastic Processes
where * denotes evaluation of the derivatives at points in the intervals [0, Vr-jl Now define the stochastic sequence {X1} by evaluating g1 at (V1Y1- 1 , ). Note that • . .
(17. 1 3) by 10.12. The Minkowski inequality, ( 1 7 . 1 2), and then (17. 10) further imply that
� �
00
L II Gr-jVt-j ll z
j=m+l
b j- I II Fr-j Vt-j l b L j=m+l 00
(17. 14) where Gr-j is the random variable defined by evaluating [(d!dv1-j)g1]* with respect to the random point (V1-j , Vr-j- J , ... ), and Fr-j bears the corresponding relationship with (dldv1-)fr-j· X1 is therefore L2-NED of size with constants dr « sup � r ii F V ll z , if this norm exists. In particular, Holder ' s inequality allows us to make this derivation whenever II Vr llzr and II Fr l brtr- 1) exist for r > 1, and also if II Vr ll 2 < oc and F1 is a.s. bounded. o s
s
-oc,
s
1 7. 2 Near-Epoch Dependence and Mixingales
The usefulness of the near-epoch depencence concept is due largely to the next theorem. 17.5 Theorem Let {Xr}:'oo be an Lr-bounded zero-mean sequence, for r
> 1.
(i) Let { V1} be a-mixing of size -a. If X1 is Lp-NED of size b on { V1} for 1 � p < r with constants { d1} , { X,,�_:oo} is an Lp-mixingale of size -min { b, a(l!p - 1/r)} with constants c1 « max { IIXr ll n drJ . (ii) Let { V,} be -mixing, of size -a. If X1 is Lp-NED of size -b on { V1} for 1 � p :S r with constants { d1 } , {X1,�:""} is an Lp-mixingale of size -min{ b, a ( l 1 /r) } , with constants c1 « max { I ! Xr ll , d,J . -
-
= £(. 1 �!) where �� = cr(Vs, ... , V,). Also, for m � [m/2] , the largest integer not exceeding m/2. By the Minkowski
Proof For brevity we write £�(.)
1 , let k
=
inequality,
IIEL:mXr ll p � IIE.:.:m (X, - E��fXr) liP + IIE_:.:;m(E���Xr) liP ,
and we bound each of the right-hand-side terms. First,
(17 . 1 5)
Near Epoch Dependence
265
( 1 7. 1 6)
using the conditional Jensen inequality and law of iterated expectations. Second, Er�ZX1 is a finite-lag measurable function of Vr-h···,Vt+b and hence mixing of the same size as { V1 } for finite k. Hence for part (i), we have, from 14.2, ( 1 7 . 17) IIE_: .:;m (E���X,) Ii p � 6a;llp- llr ii .Er�ZX,II , � 6a;llp- llr iiX,II , . Combining ( 1 7. 16) and (17. 1 7) into ( 1 7. 1 5) yields ( 1 7. 1 8) I I E_:.:;mx, IIP � max { II Xr ll ,dr} sm , p where Sm = 6aY - llr + Vk. Also, applying 10.28 gives ( 1 7. 19) II X, - E_:.;!;mXr iiP � 2 II Xr - Er��Xrii P � 2drVm � 2drSm · Since Sm is of size -min {b, a( l/p - 1/r) } , part (i) of the theorem holds with c1 « max { II X, I I ,d1 } . The proof of part (ii) is identical except that in place of ( 17 . 1 7) we must substitute, by 14.4, ( 17.20) IIE.:.:m (E��ZX,) II p � 2¢1- 11'IIE;�ZXr llr � 2¢1- 11'11 Xr llr· • Let us also state the following corollary for future reference. 17.6 Corollary Let { {Xnr};:'- oo }'::'= I be an £,-bounded zero-mean array,
r > 1. (i) If Xnt is Lp-NED of size -b for 1 � p < r with constants { dnr l on an array { Vnr} which is a-mixing of size -a, then {Xn1,:1'�, -oo } is an Lp-mixingale of size -min { b, a(l/p - 1/r) } , with respect to constants Cnr « max { II Xntll ,dnr } . (ii) If Xnr is Lp-NED of size -b for 1 � p � r with constants {dnr} on an array { Vnr} which is ¢-mixing of size -a, then {Xnr,:J�, -oo } is an Lp-mixingale of size -min{ b, a(l - 1/r) } , with respect to constants Cnr « max{ I IXnr ll ,dnr } .
Proof Immediate on inserting
last proof.
•
n before the t subscript wherever required i n the
The replacement of V1 by Vn1 and :1<� by ��s is basically a formality here, since none of our applications will make use of it. The role of the array notation will always be to indicate a transfonnation by a function of sample size, typically the normalization of the partial sums to zero mean and unit variance, and in these cases :1< �s = :1<.� for all n. Reconsider the AR process of 14.7. As a special case of 17.3, it is clear that in that example X1 is Lp-NED of size -oo on Z1, an independent process, and hence is a Lp-mixingale of size -oo, for every p > 0. There is no need to impose smoothness assumptions on the marginal distributions to obtain these properties, which will usually be all we need to apply limit theory to the process. These results allow us to fine-tune assumptions on the rates of mixing and near epoch dependence to ensure specific low-level properties which are needed to prove convergence theorems. Among the most important of these is summability of the sequences of autocovariances. If for example we have a sum of terms, Sn = I�= IX,,
Theory of Stochastic Processes
266
we should often like to know at what rate the variance of this sum grows with n. Assuming E(X1) = 0 with no loss of generality, E(S;)
=
t E(XlJ + 2� (� E(X,X,.m)) ,
a sum of n 2 terms. If the sequence { X,} is uncorrelated only the n variances appear and, assuming uniformly bounded moments, E(S�) = O(n). For general dependent processes, summability of the sequences of autocovariances { E(X1X1-j) . j E IN } implies that on a global scale the sequence behaves like an uncorrelated sequence, in the sense that, again, E(S�) = O(n). For future reference, here is the basic result on absolute summability. To fulfil subsequent requirements, this incorporates two easy generalizations. First we consider a pair (X1, Y1), which effectively permits a generalization to random vectors since any element of an autocovariance matrix can be considered. To deal with the leading case discussed above we simply set Y1 = X1• Second, we frame the result in such a way as to accommodate trending moments, imposing L,-boundedness but not uniform L,-boundedness. It is also noted that, like previous results, this extends trivially to the array case. 17.7 Theorem Let {Xr.Ytl be a pair of sequences, each Lr1(r- l )-NED of size - 1 , with respect to constants {d},dr} for r > 2, on either (i) an a-mixing process of size -r/(r - 2), or (ii) a $-mixing process of size -r/(r - 1), where d} « IIXrllr and dr « II Yt II,. Then the sequences
(17.21) are summable for each t. Also, if arrays { Xnr,YnrJ are similarly Lrl(r- n-NED of size -1 with respect to constants {�1,d�r } with cfnt « IIXntll r and drr « II Ynr ll n the sequences
(17.22) are summable for each n and t. o Since r > 2, the constants appearing in ( 17.21) and ( 17.22) are smaller (absolutely) than the autocorrelations, and the latter need not converge at the same rate. But notice too that rl(r- 1 ) < 2, so it is always sufficient for the result if the functions are L2-NED. Proof As before, let £�(.) = E(. l ��), and let k = [m/2]. By the triangle inequal ity, ( 17.23) I E( Yt+mXt) I ::;; I E( Yt+m(Xt - Fr�ZXr)) I + I E( Yt+mE��ZXr) 1 . The modulus and Holder inequalities give
Near Epoch Dependence
I E(Yt+rn(X, - Er�ZX,)) I � II Yt+rn ll riiXr - Er�ZXrllrt(r- 1 ) � II Yr+m ll�v{,
267 (17.24)
where v� is of size - 1 . Also, applying 17.5(i) with p = rl(r - 1), a = r/(r - 2), and b = 1,
I E(Yt+rnEr�fX,) I = I E(Er�ZYt+rnE��ZXr) I � IIEr�ZYt+rn II rt(r- 1) 11-Fr�zx,) II r � II E�!kYt+m ll r/(r-1) 11Xrll r
� IIX, II r c ;+msk.
(17.25)
where cr « max{ II Y l l ndr } and t;� is of size - 1 . Combining the inequalities in (17.24) and (17.25) in (17.23) gives r
I E(X,Yt+m) I � max { II Yr+rn II r di, II X, II r c r+rn } (v{ + /; [)
(17.26)
where �rn = (v{ + /; [) is of size - 1 . This completes the proof of (i). The proof of (ii) is similar using 17.5(ii) with p = a = r/(r - 1 ) and b = 1. For the array generalization, simply insert the n subscript after every random variable and scaling constant. The argument is identical except that 17.6 is applied in place of 17.5. • 1 7 . 3 Near-Epoch Dependence and Transformations
Suppose that (X1 , , ... ,Xw) ' = X, = g( ... ,V,- I , Vt,Vt+ 1 , ... ) is a v-vector of Lp-NED functions, and interest focuses on the scalar sequence { <)>r(X,) } , where <)>,: 'U' H IR, 'U' � IR v, i s a �v/�-measurable function. We may presume that, under certain condi tions on the function, <)>,(X,) will be near-epoch dependent on { V, } if the elements of X, are. This setup subsumes the important case v = 1 , in which the question at issue is the effect of nonlinear transformations on the NED property. The depen dence of the functional form <)>r(.) on t is only occasionally needed, but is worth making explicit. The first cases we look at are the sums and products of pairs of sequences, for which specialized results exist. 17.8 Theorem Let X, and Y, be Lp-NED on { V,} of respective sizes -
II (X, + Y,) - E���(X, + Y,) l i P � II X, - E;��Xr l l p + II Y, - Er��Yt li P � ctiv� + drv� � dtYrn,
(17.27)
Theory of Stochastic Processes
268
y wh ere d = max { a.Xi, d 1 } and Vm = vXm + vmY = O(m-min { q>x,q> r) ) . • A variable that is Lq -NED is Lp-NED for 1 5 p 5 q, by the Liapunov inequality, so there is no loss of generality in equating the orders of norm in this result. The same consideration applies to the next theorem. 17.9 Theorem Let X1 and Y1 be Lz-NED on { V1} of respective sizes -
E I XtYr - E��:cxrYr) l
!
E (XrYr - XrEr�: Yr) + (XrEr�: Yr - (Er�:x,)(Er�: r,)) - E��: ((Xr - Er�:Xr)(Yr - E��: Yr)) l 5 II X, Ibll Y, - Er�:r,lb + IIEr�:r,lbii Xr - Er�:xr lb + II Xr - Er�:Xr lbll Yr - Er�: Yr lb 5 I I Xr ll zdrv;; + I! Yr ll z�! + dfv�div! =
(17.28) where dr
=
max{ II Xr ll zdr, II Yr ll zd?, d;d? } and Vm = v ;; + v! +V�v! = O(m-min {cpx,CJlrl ) .
•
In both of the last results we would like to be able to set Y, = Xr+J for some finite j. A slight modification of the argument is required here. 17.10 Theorem If X, is Lp-NED on { V, } , so is Xr+J for 0 j < Proof If X1 is Lp-NED, then
<
oo
.
ll Xr+j - E(Xr+j I ���J�:) liP S 2 l1 Xr+j - E(Xr+j I ��!)�:) li P x ( 17.29) -< d't Vm, using 10.28, where d;x 2J{+J· We can write ( 17.30) I I Xr+j - E(Xr+j l ���:) lip 5 d;�;,.., where v0, m 1 v;. = . Vm-) • m > j and v/n is of size
-
{
<"
_
Near Epoch Dependence size -min {
269
o
{=
By considering Z1 = Xr- [k/2] Yr+k- [kl2] , the L1 -NED numbers can
be
given here as
m � [k/2] + 1 Vo, Y m- [k/21 _ 1 , m > [k/2] + 1 :n where v = v� + v� + v� v�, and the constants are 4d}- [kl2] d;+k- [kl2] • assummg that d} and d; are not smaller than the corresponding L2 norms. v
m
All these results extend to the array case as before, by simply including the extra subscript throughout. Corollary 17.11 should be compared with 17.7, and care taken not to confuse the two. In the former we have k fixed and finite, whereas the latter result deals with the case as m ----7 oo The two theorems naturally complement each other in applying truncation arguments to infinite sums of prod ucts. Applications will arise in subsequent chapters. More general classes of function can be treated under an assumption of continu ity, but in this case we can deal only with cases where <\lr(X1) is L2-NED. Let <j>(x): lr 1--7 !R, lr s;; !Rv be a function of v real variables, and use the taxicab metric on !Rv, .
v I XI - X2 , "" p(x I , X 2) =Li i1
(17.3 1 )
i= l
to measure the distance between points x 1 and x2. We consider a set of results that impose restrictions of differing severity on the types of function allowed, but offer a trade-off with the severity of the moment restrictions. To begin with, impose the uniform Lipschitz condition,
(17.32) where B1 is a finite constant. 17.12 Theorem Let Xit be L2-NED of size -a on
{ V1} for i = 1 , . . . ,v, with constants dit . If (17.32) holds, { <j>1(X1)} is also L2-NED on { Vr } of size -a, with constants a finite multiple of maxd dit } . Proof Let �� denote a ?:1'���-measurable approximation to <j>r(X1). Then (17.33) li<J>r(Xr) - E���<J>tCXr) lb � II <J>rCXr) - �r lb by 10.12. Since <j>1(E���1) is an ?:1'���-measurable random variable, ( 17.33) holds for this choice of <j>1, and by Minkowski ' s inequality, A
v
� BrL II Xir - Er��Xir lb i= l v � BrL ditV im i=l
Theory of Stochastic Processes
270
( 1 7.34) where d, = vB,maxi {dit } and Vm = v-1Ii'= IVim• the latter sequence being of size -a by assumption. •
If we can assume only that the Xit are Lp-NED on V, for some p E [1 ,2), this argument fails. There is a way to get the result, however, if the functions <j>, are bounded almost surely. 17.13 Theorem Let Xit be Lp-NED of size -a on { V, } , for 1 s p s 2, with constants dir. i = 1 , ... ,v. Suppose that, for each t, I <j>,(X,) I s M < oo a.s., and also that I <J>,(X 1 ) - <j>,(X2) 1
s
min {B,p (X1 ,X2), 2M} a.s.,
(17.35)
where B, < oo . Then { <j>,(X,) } is L -NED on { V, } of size -ap/2, with constants a 2 finite multiple of max-i{ £iii? } . Proof For brevity, write <j>; ·
=
1 � - 7 1 s 2M min{Z, 1 } . Then E(<j>�
<j>1(X1), and let Z ·
=
-7? f { Zsl } (<j>} - <j>7)2dP + J {Z>l!(<j>} - <j>7)2dP S (2M)2 (f { Zs l } z 2dP + f iZ>II dP ) =
S (2M)2E(ZP)
=
where B lt
=
B,p(X I , X2)12M, so that
By,E( p(X 1 , X2f) ,
(17.36)
B�12(2M)1 -P12. Combining (17.33) with (17.36), we can write
l!(l>t(X,) - E��Z!<J>,(Xr) lb
s B �ti!P (X,,E���X,) II�12
(17.37) where d, = B ,vP12maxi{d�? } and V m = (v-1 I/= Vim)P12, which is of size -ap/2 by 1 1 assumption. •
An important example of this case (with v = 1) is the truncation of X,, although this must be defined as a continuous transformation. 17.14 Example For M
> 0 let
Near Epoch Dependence
�(x) or, equivalently, �(x) so set B
=
=
l
x, M, -M,
271
l xl $ M x>M X < -M
(17.38)
x1 { 1xlgf) +M(x! l x l ) l { lx i>MJ · In this case I �(XI ) - � (X2) I $ l XI - X2 1 ' 1 , and 17.13 can be used to show that { �(X1)} is L2-NED if {Xr } is =
{
Lp -NED. The more conventional truncation, Xr, I Xrl $ M Xr 1 { 1 x i�M) =
(17.39)
0, otherwise,
cannot be shown to be near-epoch dependent by this approach, because of the lack of continuity. o A further variation on 17.12 is to relax the Lipschitz condition ( 17.32) by letting the scale factor B be a possibly unbounded function of the random vari ables. Assume
(17.40) where, for each t,
(17.4 1 )
i s a non-negative, 152v/:B-measurable function. To deal with this case requires a lemma due to Gallant and White (1988). 17.15 Lemma Let B and p be non-negative r.v.s and assume < oo, and II B p ll r < oo, for q � 1, and r > 2. Then
l l p ll q < oo , IIB I I q!(q - 1 )
l II B p llz $ 2 ( ll p ii� -2 II B II �/Zq- 1 ) 1 1 Bp l l �) 1 12(r- ) .
( 1 7.42)
Proof Define
(17.43) and let B 1
=
l ( Bp:> C } B . Then by the Minkowski inequality, I I Bp l lz $ II B IP I I z + IICB - B I ) P IIz .
( 1 7.44)
) 1 12
The right-hand-side terms are bounded by the same quantity. First,
IIB I P I Iz =
J( Bp:> C(Bp)2dP
( 17.45)
Theory of Stochastic Processes
272
applying the Holder inequality. Second, IICB - B t )P II z =
(J
Bp > C
) 112
(Bp)2dP
(17.46) where the first inequality follows from r > 2 and Bp/C > 1 . Substituting for C in (17.45) and (17.46) and applying (17.44) yields the result. • The general result is then as follows. 17.16 Theorem Let { X1} be a v-dimensioned random sequence, of which each ele ment is Lz-NBD of size -a on { V1} , and suppose that cpr(X1) is Lz-bounded. Suppose further that for 1 :$; q :$; 2, ll p(Xr,E�::tr) ll q < oo, II Br(Xr,Er:::;:xt) ll q!(q - 1 ) < oo, and for r > 2, II B,(Xt ,Er::::xt) P (Xr, E�:;;:xt) II r < oo , Then { cj)r(X1) } is Lz-NBD on { V1} of size -a ( r - 2)/2( r - 1 ). Proof For ease of notation, write p for p(X1,E�::;:x1) and B for Br(X1,E�::t1). As in the previous two theorems, the basic inequality (17.33) is applied, but now we have llcp(Xr) - E�:;::cp(Xr) liz :$; ilcp(Xr) - <J>(E�::;:xr) liz :$; II Bp llz ( :$; 2 11 p ll � r-2)/2(r-l ) l i B II �/(� � wr- 1) II Bp 11 12 r- 1 ), where the last step is by 17.15. For q ll p ll q
:$;
IIP II 2
:$;
v
:$;
2,
L 1 1Xit - Er::::xitll 2 �1
(17.47)
:$;
v
_L ditv im �1
=
dtYm ,
(17.48)
where d1 = v maxd dtt } and Vm = v-1Lt=IVim • which is of size -a by assumption. Hence, under the stated assumptions, (17 .49) ll $tCXr) - Er:Z:cptCXr) llz :$; d�v�r- Z)IZ(r- 1 ) , where d; = II B II �/(�� f �Cr-l ) II Bp ll�'2<'- l )d1. • Observe the important role of 17.15 in tightening this result. Without it, the best we could do in ( 17.47) would be to apply Holder's inequality directly, to obtain
Near Epoch Dependence
273 ( 1 7.50)
The minimum requirement for this inequality to be useful is that B is bounded almost surely permitting the choice q = 1, which is merely the case covered by 1 2 17.12 with the constant scale factors set to ess sup Br(X ,X ). The following application of this theorem may be contrasted with 17.9. The moment conditions have to be strengthened by a factor of at least 2 to ensure that the product of L2-NED functions is also L2-NED, rather than just L 1 -NED. There is also a penalty in terms of the L2-NED size which does not occur in the other case. 17.17 Example Let X, = (Xr.Y1) and <)>(X1) = X1Y1• Assume that 1/ Xr /b < DO and II Yr /b < DO for r > 2, and that X1 and Y1 are L2-NED on { Vr} of size -a. Then
1 x } r} - x;r; I � I x} I I r} - r7 1 + 1 x} - x; I I r7 1
� ( I X} I + I Y7 1 )( I Y} - Y7 1 + I X} - X7 1 ) ( 1 7.5 1)
defining B and p. For any
q
i n the range [4/3,4], the assumptions imply
II B(X} ,x7 ) /l qt(q - l ) � II X} II qt(q- 1 ) + II Y7 11 qt(q- 1)
< DO ,
(1 7.52) (17.53)
and
I /B(X} ,X7)p(X } ,X7) // , � II X} Ib ii Y} Ib + II X} Ibii Y7 /b + II X} I/�, + II X} lb i i X7 1 b + II r7 1 12rll Y} lb + II r7 1 1�, + II Y7 1bi/ X} /b + II Y7 1 b i i X7/ b < DO ,
( 1 7.54)
Putting X } = X1 and x; = E/:.':;X1, the conditions of 17.16 are satisfied for the range [4/3,2] and X1Y1 is L2-NED of size -a(r - 2)/2(r- 1). o
q
in
1 7 .4 Approximability
In § 1 7.2 we showed that Lp-NED functions of mixing processes were mixingales, and most of the subsequent applications will exploit this fact. Another way to look at the Lp -NED property is in terms of the existence of a finite lag approx imation to the process. The conditional mean E(X1 I ���:;:) can be thought of as a function of the variables V1-m , ... ,Vt+m' and if { V1} is a mixing sequence so is { E(X,I ���:;:) } , by 14.1 . One approach to proving limit theorems is to team a limit theorem for mixing processes with a proof that the difference between the actual sequence and its approximating sequence can be neglected. This is an alternative way to overcome the problem that lag functions of mixing processes need not be m1xmg.
Theory of Stochastic Processes But once this idea occurs to us, it is clear that the conditional mean might not be the only function to possess the desired approximability property. More gener ally, we might introduce a definition of the following sort. Letting V1 be l x 1 , we shall think of h7 : !R1(2m+ l ) � lR as a ?F:�;::/1�-measurable function, where ?}�:�;;: =
a( Vr-m , ... , Vt+m).
17.18 Definition The sequence {X1 } will be called Lp-approximable (p > 0) on the sequence { V1} if for each m E !N there exists a sequence { h7} of ?}�:�;;:-measurable random variables, and
(17 .55) where {d,} is a non-negative constant sequence, and Vm -7 0 as m -7 oo. {Xr} will also be said to be approximable in probability (or Lo-approximable) on { V,} if there exist { h7 } , { d1} , and { vm } as above such that, for every o > 0, (17.56) The usual size terminology can be applied here. There is also the usual extension to arrays, by inclusion of the additional subscript wherever appropriate. If a sequence is Lp-approximable for p > 0, then by the Markov inequality
(17.57) P( ! Xr - h7 1 > dro) s; (dro) -p i! Xr - h7 11� s; v:n , where v/n = ()-Pvf:t; hence an Lp-approximable process is also Lo-approximable. An Lp-NED sequence is Lp-approximable, although only in the case p = 2 are we able to claim (from 10.12) that E(Xrl ?f ��;;:) is the best Lp-approximator in the sense that the p-norms in ( 17.55) are smaller than for any alternative choice of h7. 17.19 Example Consider the linear process of 17.3. The function
(17.58) j:::. -m is different from E(Xr I ?f��;;:) unless { Vr} is an independent process, but is also an Lp-approximator for X1 since (17.59) p j=m+ l where Vm = I,j:::.m+ l( l ejl + I e_j l ) and dr = SUPr ll Vr llp· D 17.20 Example In 17.4, the functions g7 are Lp-approximators for X1, of infinite size, whenever sups�r ii Fs Vs ll p < oo . D One reason why approximability might have advantages over the Lp-NED prop erty is the ease of handling transformations. As we found above, transferring the Lp-NED property to transformations of the original functions can present diffi culties, and impose undesirable moment restrictions. With approximability, these difficulties can be largely overcome. The first step is to show that, subject to Lr-boundedness, a sequence that is approximable in probability is also Lp-approx imable for p < r, and moreover, that the approximator functions can be bounded for
Near Epoch Dependence
275
each finite m. The following is adapted from Potscher and Prucha (1991a). 17.21 Theorem Suppose { X1} is L,-bounded, r> 1, and L.v-approximable by h 7. Then
for 0 < p < r it is Lp-approximable by h7 = h7 1 1 l h7l.,; d1Mml • where Mm < oo for each
mE
!N .
Proof Since h7 i s an L0-approximator of X1, we may choose a positive sequence
{ 8m } such that Dm --7 0 and yet P( l X1 - h7 1 > d1Dm) :::; Vm --7 0 as m --7 oo. Also choose a sequence of numbers { Mm } having the properties Mm --7 =, but Mmvm --7 0. For example, Mm = v::n 1 12 would serve. There is no loss of generality in assuming supmM::n 1 :::; 1 . By Minkowski ' s inequality we are able to write II Xr - 'fz 7 1 1p :::; A }m +A 7m + A ;m , (17.60) where
A}m = A 7m = A ;m =
II (Xr - h7)1 ( 1X1- h';'l > dr'im J IIP ' II (Xr - h ';') 1 { IX1- h';'l<;; dr)m, l hrl > drMm) l i P ' II (Xr - h7)1 ( IX1- h';'l<;; dr)m,lh71<;; drMmJ i lp ·
HOlder's inequality implies that
(17.61 ) II XYII P :::; II X II pq II Yllpq/(q- 1 ) · q > 1. Choose q = rip and apply (17.61) to A�1 for i = 1,2,3. Noting that II Xr - 'fz 7 ll r :::; II Xr ll , + drMm, again by Minkowski's inequality, and that 11 1 £ 1 1� = P(E), we obtain
the following inequalities. First,
A 1r m :::; II Xr - h� mt il ,P( l Xr - �hmt l > drDm) :::; dr( II Xr 1dr ii , M::n 1 + 1)Mmvm.
(17.62)
Second, observe that
{ 1 Xr - h7 1 :::; drDm , l h7 1 > dr Mm } C { I Xr l > dr(Mm - Dm) } and hence, when Mm > Dm , P( I Xr - h7 1 :::; drDm , l h7 1 > drMm) :::; P( I Xr l > dr(Mm - Dm)) :::; II Xr ll�ti; '(Mm - Dm) - r
(17.63)
by the Markov inequality, so that
A7m :::; 11 Xr - h7 11 , P( l Xr - h7 1 :::; drDm , l h7 1 > drMm) :::; ( II Xr ll r + drMm) II Xr ll�ti;'(Mm - Dmf' (17.64) :::; dr( II Xr ldr ii�+ I + II Xrldr ii�Mm(Mm - Dm) - r. (The final inequality is from replacing M::n 1 by 1 .) And lastly, A ;m :::; drDm (17.65) in view of the fact that 'h7 = h7 on the set { I 'fz7 1 :::; d1Mm } . We have therefore
� . ·�v,
5 u1
0wcnasnc Processes
established that
(17.66) where
v/n
=
MmVm + Mm(Mm - '6m) -' + Om
(17.67)
and v/n � 0 by assumption, since r > 1, and
d�
=
2d,max { II X/d, ll ,, II X/dr ll�+l , II Xrldr ll�, 1 } .
•
(17.68)
If Lo-approximability is satisfied with d1 « II X1 1! , then d; = 2d1• The value of this result is that we have only to show that the transformation of an Lo-approximable variable is also Lo-approximable, and establish the existence of the requisite moments, to preserve Lp-approximability under the transformation. Consider the Lipschitz condition specified in (17.40). The conditions that need to be imposed on B(.,.) are notably weaker than those in 17.16 for the Lp-NED case. ,
17.22 Theorem Let h"! = (h'J.1, ... ,h�1) be the Lo-approximator of X1 = (X1 1, ... ,Xv1)' of size -
then <MXr) is Lo-approximable of size
.
Proof Fix 8 > 0 and M > 0, and define d1 = Li=tdif. The Markov inequality gives
P( I <MXr) - <j>,(h"!) I > d1'6) :::;; P(B1(Xr.h"!)p(X1, h"!) > d1'6, Br(X1,h"!) > M) + P(Br(X1,h"!)p(X1,h"!) > d1'6, BtCXt,h"!)
:::;;
M)
(17.69) Since M is arbitrary the first term on the majorant side can be made as small as desired. The proof is completed by noting that
(� I Xit - h'i\1 > d,'61M) :::;; P (Q { ! Xit - hirl > dir'61M })
P(p(X1,h"!) > d1'61M) = P
v :::;; .L P( I xit - hit I > dit'6!M) i=l v :::;; L Vim � 0 as m � oo. • i= l
(17.70)
It might seem as if teaming 17.22 with 17.21 would allow us to show that, given an L,-bounded, L2 -NED sequence, r > 2 , which is accordingly L2-approximable and hence Lo-approximable, any transformation satisfying the conditions of 17.22 is L2 -approximable, and therefore also L2-NED, by 10.12. The catch with circum-
Near Epoch Dependence
277
venting the moment restrictions of 17. 16 in this manner is that it is not possible to specify the L2-NED size of the transformed sequence. In (17 .67), one cannot put a bound on the rate at which the sequence { 8m } may converge without specifying the distributions of the Xit in greater detail. However, if it is possible to do this in a given application, we have here an alternative route to dealing with trans formations. Potscher and Prucha (1991a), to whom the concepts of this section are due, define approximability in a slightly different way, in terms of the convergence of the Cesaro-sums of the p-norms or probabilities. These authors say that Xt is Lp-approximable (p > 0) if limsup n�oo
1
n
n
L II Xt - h"/ llp --7 0 as m --7
(17.71)
00 ,
t=I
and is Lo-approximable if, for every 8 > 0, n 1 limsup - _L P( I Xt - h"/ 1 > 8) n t=I n�oo
--7
0 as m --7 =
.
(17.72)
It is clear that we might choose to define near-epoch dependence and the mixingale property in an analogous manner, leading to a whole class of alternative converg ence results. Comparing these alternatives, it turns out that neither definition dominates, each permitting a form of behaviour by the sequences which is ruled out by the other. If (17.55) holds, we may write
(17.73) so long as the limsup on the majorant side is bounded. On the other hand, if (17.71) holds we may define Vm
=
limsup ! n�oo
i
n t= I II Xt - h"/ I J p ,
(17.74)
and then dt = supm{ I IXt - h": J lp lv m } will satisfy 17.18 so long as it is finite for finite t. Evidently, (17.71) permits the existence of a set of sequence coordin ates for which the p-norms fail to converge to 0 with m, so long as these are ultimately negligible, accumulating at a rate strictly less than n as n increases. On the other hand, (17.55) permits trending moments, with for example dt = O(tJ...), A > 0, which would contradict (17.71). Similarly, for 8m > 0, and Vm > 0, define dtm by the relation
(17.75) and then, allowing Vm --7 0 and 8m --7 0, define dt = supmdtm · (17.56) is satisfied if dt < oo for each finite t; this latter condition need not hold under (17. 72). On the other hand, (17. 72) could fail in cases where, for fixed 8 and every m, P( I Xt - h"/ 1 > 8) is tending to unity as t --7 = .
IV THE LAW OF LARGE NUMBERS
18 Stochastic Convergence
1 8 . 1 Almost Sure Convergence
Almost sure convergence was defined formally in § 12.2. Sometimes the condition is stated in the form
)
(
P limsup i Xn - XI > E = 0, for all E > 0. (18.1) n-';oo Yet another way to express the same idea i s to say that P( C) = 1 where, for each ro E C and any E > 0, I Xn(W) - X(ro) I > E at most a finite number of times as we pass down the sequence. This is also written as P( I Xn - XI > E, i.o.) = 0, all E > 0,
(18.2)
where i.o. stands for 'infinitely often ' . Note that the probability in (1 8.2) is assigned to an attribute of the whole sequence, not to a particular n. One way to grasp the 'infinitely often ' idea is to consider the event U';;'=m { I Xn - X I > E } ; in words, 'the event that has occurred whenever { I Xn - XI > E} occurs for at least one n beyond a given point m in the sequence ' . If this event occurs for every m, no matter how large, { I Xn - X I > E} occurs infinitely often. In other words, { I Xn - XI > £, i.o. } =
00
n u { I Xn - XI
m= l n=m
> E}
= limsup { I Xn - XI > E } .
(1 8.3)
Useful facts about this set and its complement are contained in the following lemma. 18.1 Lemma Let {En E � } i be an arbitrary sequence. Then
{ ) ( lJ ) (ii) P (liminf En) = lim P ( fl Em) . n....:;oo m=n n-';oo
(i) P limsup En = lim P Em . n....:;oo m=n n....:;oo
{U';;'=mEn J:= l is decreasing monotonically to limsupn En . Part (i) therefore follows by 3.4. Part (ii) follows in exactly the same way, since the
Proof The sequence
The Law of Large Numbers
282
sequence {n';;'=mEn };= I increases monotonically to liminf En. • A fundamental tool in proofs of a.s. convergence is the Borel-Cantelli lemma. This has two parts, the 'convergence' part and the 'divergence' part. The former is the most useful, since it yields a very general sufficient condition for convergence, whereas the second part, which generates a necessary condition for convergence, requires independence of the sequence. 18.2 Borel-Cantelli lemma
(i) For an arbitrary sequence of events {En
E
?f}'!,
'L_ P(En) < oo � P(Em i.o.) 0. 00
n= l (ii) For a sequence {En
=
� } i of independent events,
E
'L_ P(En) = oo � P(En i.o.) = 1. 00
n= l
Proof
(18.4)
(18.5)
By countable subadditivity,
(18.6) The premise in (18.4) is that the majorant side of (18.6) is finite for m = 1 . This implies L,';;'=mP(En) � 0 as m � oo (by 2.25), which further implies
( )
lim P LJ En = 0 . (18.7) m---too n=m Part (i) now follows by part (i) of 18.1. To prove (ii), note by 7.5 that the collection { g,; E ?f } i is independent; hence for any m > 0, and m' > m,
(18.8) by hypothesis, since e -x ;:::: 1 - x. (18.8) holds for all m, so P(liminf g,;)
=
( g,;)
lim P n m-+oo n=m
=
0,
(18.9)
by 18.1(ii). Hence, (18. 10) P(En i.o.) = P(limsup En) = 1 P(liminf E,;) = 1. • To appreciate the role of this result (the convergence part) in showing a.s. convergence, consider the particular case
-
Stochastic Convergence
283
If 2.";;= 1 P(En) < oo, the condition P(En) > 0 can hold for at most a finite number of n. The lemma shows that P(En i.o.) has to be zero to avoid a contradiction. Yet another way to characterize a.s. convergence is suggested by the following theorem. 18.3 Theorem {Xn } converges a.s. to X if and only if for all £ > 0
(
lim P sup / Xn - X / m�oo n ?: m
:s;
)
£
=
1.
(18. 1 1)
Proof Let
(18. 12) n=m and then (18. 1 1) can be written in the form limm�ooPCA m(E)) = 1 . The sequence {Am(E) } i is non-decreasing, so Am(£) = Ui= IAj(£); letting A(£) = u;;;= !Am(E), (18. 1 1) can be stated as P(A(£)) = 1 . Define the set C by the property that, for each ffi E C, { Xn( ffi)} i converges. That is, for ffi E C, 3 m(ffi) such that sup / Xn(ffi) - X(ffi) / :s; £, for all £ > 0.
(18. 13) n ?: m(w) Evidently, ffi E C=::} co E Am(E) for some m, so that C �A(E). Hence P(C) = 1 implies P(A(E)) = 1 , proving 'only if . To show 'if , assume P(A(£)) = 1 for all E > 0. Set E = llk for positive integer k, and define c (18. 14) A* = n A(llk) = ( LJ A(llkt . k=I
k=I
)
The second equality here is 1.1(iv). By 3.6(ii), P(A*) = 1 - P(Uk'=1A(l/k)c) = 1 . But every element of A * is a convergent outcome in the sense of (18. 13), hence A * c C, and the conclusion follows. • The last theorem characterizes a.s. convergence in terms of the uniform prox imity of the tail sequences { / Xn(ffi) - X(ffi) I }"::=m to zero, on a set Am whose measure approaches 1 as m ----7 = A related, but distinct, result establishes a direct link between a.s. convergence and uniform convergence on subsets of .Q. 18.4 Egoroff's Theorem If and only if Xn ...!!:4 X, there exists for every 8 > 0 a set C(8) with P(C(8)) � 1 - <5, such that Xn(ffi) ----7 X(ffi) uniformly on C(8). Proof To show 'only if', suppose Xn(ffi) converges uniformly on sets C(1/k), k = 1 ,2,3, ... The sequence { C(1/k) , k E rN } can be chosen as non-decreasing by mono tonicity of P, and P(Uk'= I C( 1 /k) ) = 1 by continuity of P. To show 'if, let .
The Law of Large Numbers
284 A m(O)
=
00
n { m: I Xn(ffi) - X(m) l < lim},
n=k(m)
(18.15)
k(m) being chosen to satisfy the condition P(A m(O)) � 1 - Tmo. In view of a.s. convergence and 18.3, the existence of finite k(m) is assured for each m. Then if C(o)
=
00
n A m(O),
m= I
(18. 16)
convergence is uniform on C(o) by construction; that is, for every ffi e C(O), I Xn(ffi) - X(m) I < l im for n � k(m), for each m > 0. Applying 1.1(iii) and subaddi tivity, we find, as required,
00
� 1 - L ( 1 - P(A m(O))) m= l � 1 - 8. II
(18. 17)
1 8 . 2 Convergence in Probability
In spite of its conceptual simplicity, the theory of almost sure convergence cannot easily be appreciated without a grasp of probability fundamentals, and traditionally, an alternative convergence concept has been preferred in econo metric theory. If, for any £ > 0, lim P( ! Xn - X I > £) = 0,
(18. 18)
Xn is said to converge in probability (in pr.) to X. Here the convergent sequences are specified to be, not random elements { Xn(ffi) } i, but the nonstochastic sequences { P( I Xn - X I > £) }}. The probability of the convergent subset of Q is left unspec ified. However, the following relation is immediate from 18.3, since (18. 1 1) implies (18. 18). 18.5 Theorem If Xn � X then Xn � X. o The converse does not hold. Convergence in probability imposes a limiting condi tion on the marginal distribution of the nth member of the sequence as n � = The probability that the deviation of Xn from X is negligible approaches 1 as we move down the sequence. Almost sure convergence, on the other hand, requires that beyond a certain point in the sequence the probability that deviations are negli gible from there on approaches 1 . While it may not be intuitively obvious that a sequence can converge in pr. but not a.s., in 18. 16 below we show that convergence in pr. is compatible with a.s. nonconvergence. However, convergence in probability is equivalent to a.s. convergence on a .
Stochastic Convergence
285
subsequence; given a sequence that converges in pr., it is always possible, by throwing away some of the members of the sequence, to be left with an a.s. conver gent sequence.
18.6 Theorem Xn � X if and only if every subsequence { Xnk ' k E IN } contains a further subsequence {Xnk(J) ' j E IN } which converges a.s. to X. Proof To prove 'only if : suppose P( I Xn - XI > E) � 0 for any E > 0. This means
that, for any sequence of integers { nb k E IN } , P( I Xnk - X I > E) � 0. Hence for each j E IN there exists an integer k(j) such that
(18. 19) Since this sequence of probabilities is summable over j, we conclude from the first Borel-Cantelli lemma that
(18.20) It follows, by consideration of the infinite subsequences U � J} for 1 > liE, that P( I Xnk(J) - XI > E i.o.) = 0 for every E > 0, and hence the subsequence {XnkVl } converges a.s. as required. To prove 'if : if {Xn } does not convergence in probability, there must exist a subsequence {nk } such that infk { P( I Xnk - X I > E) } � E, for some E > 0. This rules out convergence in pr. on any subsequence of {nk } , which rules out convergence a.s. on the same subsequence, by 18.5. • 1 8 .3 Transformations and Convergence
The following set of results on convergence, a.s. and in pr., are fundamental tools of asymptotic theory. For completeness they are given for the vector case, even though most of our own applications are to scalar sequences. A random k-vector Xn is said to converge a.s. (in pr.) to a vector X if each element of Xn converges a.s. (in pr.) to the corresponding element of X. 18.7 Lemma Xn � X a.s. (in pr.) if and only if IIXn - X I I � 0 a.s. (in pr.). 1 9
Proof Take first the case of a.s. convergence. The relation IIXn - X I I � 0 may be
expressed as
P
(urn ± (Xni - Xi)2 < E2) n--7oo r=l
for any E > 0. But ( 1 8.21 ) implies that
(
P lim I Xni - Xi ! n--7oo
< E, i =
=
)
1, . ,k . .
(18.21)
1
=
1,
(18.22)
proving 'if . To prove 'only if , observe that if (18.22) holds, P(Iimn--7oo i iXn - XII < k 1 12E) = 1, for any E > 0. To get the proof for convergence in pr., replace P(limn--7oo · .. ) everywhere by limn--7ooP( ... ), and the arguments are identical. •
The Law of Large Numbers
286
There are three different approaches, established in the following theorems, to the problem of preserving convergence (a.s. or in pr.) under transformations.
IR k ----7 IR be a Borel function, let C8 c IR k be the set of conti nuity points of g, and assume P(X e C8) = 1 . (i) If Xn � X then g(Xn) .-!!:4 g(X). 18.8 Theorem Let g:
(ii) If Xn � X then g(Xn) � g(X).
Proof For case (i), there is by hypothesis a set D e 'J', with P(D) = 1 , such that Xn(ro) ----7 X(ro), each ro e D. Continuity and 18.7 together imply that g(Xn(ro)) ----?
g(X(w)) for each (!) E x-1(Cg) n D. This set has probability 1 by 3.6(iii). Toprove (ii), analogous reasoning shows that, for each £ > 0, 3 8 > 0 such that { w: ll� (ro) - X(ro) ll < 8 } n K 1(C8) � { ro: l g(� (ro)) - g(X(ro)) l < e } . (18.23)
Note that if P(B) = 1 then for any A E
'fF,
( 1 8.24) by de Morgan ' s law and subadditivity of P. In particular, when P(X e C8) = 1 , ( 1 8.23) and monotonicity imply P( !l Xn - X II < 8) :::; P( j g(Xn) - g(X) j < e) . (18.25) Taking the limit of each side of the inequality, the rninorant side tends to 1 by
hypothesis.
•
We may also have cases where only the difference of two sequences is convergent. 18.9 Theorem Let { Xn } and { Zn } be sequences of randomk-vectors (not necessarily converging) and g the function defined in 18.8, and let P(Xn E C8) = P(Zn E C8) =
1 for every
n.
(i) If IIXn - Zn II � 0 then I g(Xn) - g(Zn) I � 0. (ii) If II Xn - Zn ll � 0 then l g(Xn) - g(Zn) l � 0.
Proof Put Efz = X�1(C8), E� = Z�\C8), Ex = n�=tE!z, and e2 = n�=l�· P(Ex) =
P(£2) = 1 , by assumption and 3.6(iii). Also let D be the set on which I I Xn - Zn I I converges. The proof is now a straightforward variant of the preceding one, with the set Ex n £2 playing the role of .A1(C8). • The third result specifies convergence to a constant limit, but relaxes the conti nuity requirements.
IR k ----? IR be a Borel function, continuous at a. (i) If Xn � a then g(Xn) � g(a).
18.10 Theorem Let g:
(ii)
If
Xn �
a
then g(Xn) � g(a).
with P(D) = 1 , such that Xn(ro) ----? a, each ro E D. Continuity implies g(Xn(ro)) ----7 g(a) for ro E D, proving (i). Likewise,
Proof By hypothesis there is a set D E
'J',
{ ro: II Xn(w) - a ll < 8 } � { ro : l g(Xn(ro)) - g(a) l < £ } ,
and (ii) follows much as in the preceding theorems.
•
( 18.26)
Stochastic Convergence
287
Theorem 18.10(ii) is commonly known as Slutsky ' s theorem (Slutsky 1925). These results have a vast range of applications, and represent one of the chief reas ons why limit theory is useful. Having established the convergence of one set of statistics, such as the first few empirical moments of a distribution, one can then deduce the convergence of any continuous function of these. Many commonly used estimators fall into this category. 18.11 Example Let An be a random matrix whose elements converge a.s. (in pr.) to a limit A . Since the matrix inversion mapping is continuous everywhere, the results a.s.lim A � 1 = A - l (plimA� 1 = A - l ) follow on applying 18.8 element by element. o The following is a useful supplementary result, for a case not covered by the Slutsky theorem because Yn is not required to converge in any sense. 18.12 Theorem Let a sequence { Yn}i be bounded in probability (i.e., Op(1) as n � oo); if Xn � 0, then XnYn � 0. Proof For a constant B > 0, define
Y� = Yn1 { 1 Yn l :o;B} · The event { I XnYn l � £} for
£ > 0 is expressible as a disjoint union:
(18.27) { I XnYn l � £} = { IXn i i Y� I � £} u { I Xn i i Yn - Y� I � £ } . For any £ > 0, { I Xn l l Y� l � £} c { I Xn l � £1B } , and (18.28) P( I Xn l l Y� l � E) � P( I Xn l � £/B) � 0. By the Op (l) assumption there exists, for each 8 > 0, B0 < oo such that P( l Yn - y�& I > 0) < 8 for n E lN . Since { I Xn l l Yn - Y� l � £ } c { I Yn - Y� l > 0 } , (1 8.27) and additivity imply, putting B = B0 i n (8.28), that (18.29) The theorem follows since both £ and 8 are arbitrary.
•
1 8 .4 Convergence in LP Norm
Recall that when E( I Xn I P) < oo, we have said that Xn is Lp-bounded. Consider, for p > 0, the sequence { II Xn - X l l p } i . If E( II Xn llp) < oo, all n, and limn�oo ii Xn - X l l p = 0, Xn is said to converge in 4 norm to X (write Xn � X). When p = 2 we speak of convergence in mean square (m.s.). Convergence in probability is sometimes called LQ-convergence, terminology which can be explained by the fact that Lp-convergence implies Lq-convergence for 0 < q < p by Liapunov ' s inequality, together with the following relationship, which is immediate from the Markov inequality. 18.13 Theorem If Xn � X for any p > 0, then Xn � X.
o
The converse does not follow in general, but see the following theorem. 18.14 Theorem If Xn � X, and { I Xn l P }i is uniformly integrable, then Xn � X.
The Law of Large Numbers
288 Proof
For E > 0,
E I Xn - XI P = E( l { IXn -X IP>EJ I Xn - X I P) + E( l { IXn -X (�;eJ I Xn - XI P)
$ E( l i i Xn -X I P>eJ I Xn - XI P) + E.
(18.30)
Convergence in pr. means that P( I Xn - X I > E) � 0 as n � co. Uniform integrab ility therefore implies, by 12.9, that the expectation on the majorant side of (18.30) converges to zero. The theorem follows since E is arbitrary. • We proved the a.s. counterpart of this result, in effect, as 12.8, whose conclus ion can be written as: I Xn - XI � O implies E I Xn - XI � 0. The extension from the L 1 case to the Lp case is easily obtained by applying 18.8(i) to the case g(.) =
I . I P.
One of the useful features of Lp convergence is that the Lp norms of Xn - X define a sequence of constants whose order of magnitude in n may be determined, providing a measure of the rate of approach to the limit. We will say for example that Xn converges to X in mean square at the rate nk if IIX1 - X l i z = O(n -k), but not o(n -k) . This is useful in that the scaled random variable nk(X1 - X) may be non-degenerate in the limit, in the sense of having positive but finite limiting variance. Determining this rate of convergence is often the first step in the analysis of limiting distributions, as discussed in Part V below. 1 8 . 5 Examples
Convergence in pr. is a weak mode of convergence in that without side conditions it does not imply, yet is implied by, a.s. convergence and Lp convergence. How ever, there is no implication from a.s. convergence to Lp convergence, or vice versa. A good way to appreciate the distinctions is to consider 'pathological' cases where one or other mode of convergence fails to hold. 18.15 Example Look again at 12.7, in whichXn = 0 with probability 1 - 1/n, and Xn = n with probability l in, for n = 1,2,3, ... . A convenient model for this sequence is to let ro be a drawing from the space ([0,1] , 13 [o,I],m) where m is Lebesgue measure, and define the random variable ro E [0, l in), Xn(ro) = ( 1 8.3 1 ) 0, otherwise. The set { ro: limnXn(OO) :f:. 0 } consists of the point { 0 } , and has p.m. zero, so that p Xn � 0 according to (18. 1). But E I Xn i P = 0.(1 - l in) + nP!n = n - l . It will be recalled that this sequence is not uniformly integrable. It fails to converge in Lp for any p > 1, but for the case p = 1 we obtain E(Xn) = 1 for every n. The limiting expectation of Xn is therefore different from its almost sure limit. o The same device can be used to define a.s. convergent sequences which do not converge in Lp for any p > 0. It is left to the reader to construct examples.
{n,
Stochastic Convergence
289
18.16 Example Let a sequence be generated as follows: X1 = 1 with probability 1 ; (X2 ,X3) are either (0, 1 ) or ( 1 ,0) with equal probability; (X4,X5 ,X6) are chosen from (1 ,0,0), (0, 1 ,0), (0,0, 1) with equal probability; and so forth. For k = 1 ,2,3, ... the next k members of the sequence are randomly selected such that one of them is unity, the others zero. Hence, for n in the range [�k(k - 1) + 1, !k(k + 1 )] , P(Xn = 1) = Ilk, as well as E I Xn i P = llk for p > 0. Since k � as n � =, it is clear that Xn converges to zero both in pr. and in Lp norm. But since, for any n, Xn+J = 1 a.s. for infinitely many j, oo
P( I Xn l
< £, i.o.)
=
0
(18.32)
for 0 :::; £ :::; 1 . The sequence not only fails to converge a.s., but actually converges with probability 0. Consider also the sequence { k 1 1rXn} , whose members are either 0 or k1 1r in the range [1k(k - 1) + 1 , 1k(k + 1 )]. Note that E( l k11rXn i P) = Jtlr- t , and by suitable choice of r we can produce a sequence that does not converge in Lp for p > r. With r = 1 we have E(kXn) = 1 for all n, but as in 18.15, the sequence is not uniformly integrable. The limiting expectation of the sequence exists, but is different from the probability limit. o In these non-uniformly integrable cases in which the sequence converges in L 1 but not in L 1 +9 for any e > 0, one can see the expectation remaining formally well-defined in the limit, but breaking down in the sense of losing its intuitive interpretation as the limit of a sample average. Example 18.15 is a version of the well-known St Petersburg Paradox. Consider a game of chance in which the player announces a number n E IN , and bets that a succession of coin tosses will produce n heads before tails comes up, the pay-off for a correct prediction being £2 n+ l . The n probability of winning is T - I, so the expected winnings are £ 1 ; that is to say, it is a 'fair game ' if the stake is fixed at £ 1 . The sequence of random winnings Xn generated by choosing n = 1,2,3, ... is exactly the process specified in 18.15 ?0 If n is chosen to be a very large number, a moment's reflection shows that the probability limit is a much better guide to one's prospective winnings in a finite number of plays than the expectation. The paradox that with large n no one would be willing to bet on this apparently fair game has been explained by appeal to psychological notions such as risk aversion, but it would appear to be an adequate explanation that, for large enough n, the expectation is simply not a practical predictor of the outcome. 1 8 . 6 Laws of Large Numbers
Let { Xr}I be a stochastic sequence and define Xn = n - l L7: I Xr. Suppose that E(Xr) = 11r and n - l I7=tl1t ---7 11 with 1 11 1 < oo; this is trivial in the mean-stationary case in which l1t = 11 for all t. In this simple setting, the sequence is said to obey the weak law of large numbers (WLLN) when Xn � 11, and the strong law of large numbers (SLLN) when Xn � 11· These statements of the LLNs are standard and familiar, but as characterizations
290
The Law of Large Numbers
of a class of convergence results they are rather restrictive. We can set 1-1 = 0 with no loss of generality, by simply considering the centred sequence { X1 - IJ.r } i; centring is generally a good idea, because then it is no longer necessary for the time average of the means to converge in the manner specified. We can quite easily have n - 1 :L7=I I-lr -j oo at the same time that n - 1 :L7=t(X1(ro) - !-11) -j 0. In such cases the law of large numbers requires a modified interpretation, since it ceases to make sense to speak of convergence of the sequence of sample means. More general modes of convergence also exist. It is possible that Xn does not converge in the manner specified, even after centring, but that there exists a sequence of positive constants {an } i such that an t oo and a� 1 :L7= 1 X1 -j 0. Results below will subsume these possibilities, and others too, in a fully general array formulation of the problem. If { {Xnr}�� I }-;:'= 1 is a triangular stochastic array with { kn} -;:'= 1 an increasing integer sequence, we will discuss conditions for kn
Sn = L Xnr � 0. t=l
(1 8.33)
A result in this form can be specialized to the familiar case with Xnr = a� 1 (X1 - !-11) and an = kn = n, but there are important applications where the greater generality is essential. We have already encountered two cases where the strong law of large numbers applies. According to 13.12, Xn � 1-1 = E(X1 ) when {Xr} is a stationary ergodic sequence and E I X1 1 < oo . We can illustrate the application of this type of result by an example in which the sequence is independent, which is sufficient for ergodicity. 18.17 Example Consider a sequence of independent Bernoulli variables X1 with P(X1 = 1) = P(X1 = 0) = !; that is, of coin tosses expressed in binary form (see 12.1). The conditions of the ergodic theorem are clearly satisfied, and we can conclude that n - 1 :L7= 1 X1 � E(X1) = !· This is called Borel's normal number theorem, a normal number being defined as one in which Os and 1 s occur in its binary expansion with equal frequency, in the limit. The normal number theorem therefore states that almost every point of the unit interval is a normal number; that is, the set of normal numbers has Lebesgue measure 1 . Any number with a terminating expansion is clearly non-normal and we know that all such numbers are rationals; however, rationals can be normal, as for example 1, which has the binary expansion 0.01010101010101... This is a different result from the well-known zero measure of the rationals, and is much stronger, because the non-normal numbers include irrationals, and form an uncountable set. For example, anynumberwithabinary expansionoftheform0.1 1bt 1 1b 2 1 1b 3 1 1 ... where the bi are arbitrary digits is non-normal; yet this set can be put into 1-1 cor respondence with the expansions O.b 1 b2b3 , ... , in other words, with the points of the whole interval. The set of non-normal numbers is equipotent with the reals, but it none the less has Lebesgue measure 0. o A useful fact to remember is that the stationary ergodic propery is preserved under measurable transformations; that is, if {X1} is stationary and ergodic, so
Stochastic Convergence
291
is the sequence {g(X1) } whenever g: !R f---7 !R is a measurable function. For example, we only need to know that E(XI) < to be able to assert that n - 1 I7= 1x7 ...!!:4 E(Xi). The ergodic theorem serves to establish the strong law for most stationary sequences we are likely to encounter; recall from § 13.5 that ergodicity is a weaker property than regularity or mixing. The interesting problems in stochastic convergence arise when the distributions of sequence coordinates are hetero geneous, so that it is not trivial to assume that averaging of coordinates is a stable procedure in the limit. Another result we know of which yields a strong law is the martingale conver gence theorem (15.7), which has the interpretation that a � 1 I7= 1 Xr ...!!:4 0 whenever {L7= 1 Xr} is a submartingale with El I7= 1 Xr l < oo uniformly in n, and an -7 This particular strong law needs to be combined with additional results to give it a broad application, but this is readily done, as we shall show in §20.3. But, lest the law of large numbers appear an altogether trivial problem, it might also be a good idea to exhibit some cases where convergence fails to occur.
oo
co
.
-
18.18 Example Let {X1} denote a sequence of independent Cauchy random vari ables with characteristic function G>xP,.) = e I A. I for each t (11.9). It is easy to
n verify using formulae (1 1 .30) and (1 1 .33) that
-
t
(18.34) Xr = L 'l'sZs = Xr 1 + 'JfrZr, t = 1,2,3, ... s= 1 with X0 = 0, where { Z1 } j is an independent stationary sequence with mean 0 and variance �, and { 'Vs } j is a sequence of constant coefficients. Notice, these are indexed with the absolute date rather than the lag relative to time t, as in the linear processes considered in § 14.3. For m > 0, t
Cov(X1, Xt+m) = Var(X1) = cr2 L Vs · s= 1
(18.35)
oo
For {X1} } to be uniformly L2-bounded requires I7= 1 'l'; < ; in this case the effect of the innovations declines to zero with t and Xr approaches a limiting random variable X, say. Without the square-summability assumption, Var(X1) -7 An example of the latter case is the random walk process, in which 'Vs = 1, all s. Since Cov(X1 ,X1) = 'Jficr2 for every t, these processes are not mixing. Xn has zero mean, but
)
oo
.
1 (18.36) f var(Xj) . ± v ar(X1) + 2i n2 t=1 t=2 ]= 1 If 'L7:::: 1 'VJ < co, then limn�ooVar(Xn) = �L7= 1 'JIJ; otherwise Var(Xn) -7 In either case the sequence { Xn } fails to converge to a fixed limit, being either stochastic Var(Xn) =
(
oo
.
292
The Law of Large Numbers
asymptotically, or divergent. o These counter-examples illustrate the fact that, to obey the law of large numbers, a sequence must satisfy regularity conditions relating to two distinct factors: the probability of outliers (limited by bounding absolute moments) and the degree of dependence between coordinates. In 18.18 we have a case where the mean fails to exist, and in 18.19 an example of long-range dependence. In neither case can Xn be thought of as a sample statistic which is estimating a parameter of the under lying distribution in any meaningful fashion. In Chapters 19 and 20 we devise sets of regularity conditions sufficient for weak and strong laws to operate, constraining both characteristics in different configurations. The necessity of a set of regularity conditions is usually hard to prove (the exception is when the sequences are independent), but various configurations of mixing and Lp-boundedness conditions can be shown to be sufficient. These results usually exhibit a trade-off between the two dimensions of regularity; the stronger the moment restrictions are, the weaker dependence restrictions can be, and vice versa. One word of caution before we proceed to the theorems. In §9 . 1 we sought to motivate the idea of an expectation by viewing it as the limit of the empirical average. There is a temptation to attempt to define an expectation as such a limit; but to do so would inevitably involve us in circular reasoning, since the arguments establishing convergence are couched in the language of probability. The aim of the theory is to establish convergence in particular sampling schemes. It cannot, for example, be used to validate the frequentist interpretation of probability. However, it does show that axiomatic probability yields predictions that accord with the frequentist model, and in this sense the laws of large numbers are among the most fundamental results in probability theory.
19
Convergence in Lp Norm
1 9. 1 Weak Laws by Mean-Square Convergence
This chapter surveys a range of techniques for proving (mainly) weak laws oflarge numbers, ranging from classical results to recent additions to the literature. The common theme in these results is that they depend on showing convergence in Lp-norrn, where in general p lies in the interval [ 1 ,2]. Initially we consider the case p = 2. The regularity conditions for these results relate directly to the variances and covariances of the process. While for subsequent results these moments will not need to exist, the � case is of interest both because the conditions are familiar and intuitive, and because in certain respects the results available are more powerful. Consider a stochastic sequence {X1}}, with sequence of means {J.l1 }}, and vari ances {cr� }i. There is no loss of generality in setting 1-lr = 0 by simply consider ing the case of { X1 - j..Lr } i, but to focus the discussion on a familiar case, let us initially assume iln n - 1 I�= II-lr ----7 !-l (finite), and so consider the question, what are sufficient conditions for E(Xn - j..L)2 ----7 0? An elementary relation is (19.1) where the second term on the right-hand side converges to zero by definition of j..l . Thus the question becomes: when does Var(Xn) ----7 0? We have =
(19.2) where � = Var(X1) and <J15 = Cov(X1,X5). Suppose, to make life simple, we assume that the sequence is uncorrelated, with <J15 = 0 for t-::f. s in (19.2). Then we have the following well-known result. 19.1 Theorem If {Xr }i is uncorrelated sequence and
(19.3) then Xn � j..l . Proof This is an application of Kronecker's lemma (2.35), by which ( 19.3) implies Var(Xn) = n - 2L1� ----7 0. • This result yields a weak law of large numbers by application of 18.13, known as Chebyshev' s theorem. An (amply) sufficient condition for (19.3) is that the
The Law of Large Numbers variances are uniformly bounded with, say, sup1o7 ::;; B < oo . Wide-sense stationary sequences fall into this class. In such cases we have Var(Xn) O(n - 1 ). But since all we need is Var(Xn) o(l ), o7 � oo is evidently permissable. If o7 - tHi for > 0, I,� 1 t- 2o7 has terms of O(t- 1 -8), and therefore converges by 2.27. Looking at (19. 2) again, it is also clear that uncorrelatedness is an unnec 294
=
=
=
o
essarily tough condition. It will suffice if the magnitude of the covariances can be suitably controlled. Imposing uniform L2-boundedness to allow the maximum relaxation of constraints on dependence, the Cauchy-Schwartz inequality tells us that I a15 I ::;; for all and Rearranging the formula in
B
t s.
(19. 2),
n- 1 n n 1 2 o7 + 2 L L l at,t-m I ::;; n2 L: n m=l t=m+ l t= l
(19.4) Bm ::;; B, m
Bm
where = sup1/ a1,1 ml , and all � 1 . This suggests the following variant on 19.1. oo 19.2 Theorem If {X,} i is a uniformly Lrbounded sequence, and I.;= L where = suprl ar,r- m l , then Xn ---=4 11· Proof Since it is sufficient by to show the convergence of to zero. This follows immediately from the stated condition and Kronecker's lemma. • o > 0; a very A sufficient condition, in view of 2.30, is = 0 ( (log mild restriction on the autocovariances. There are two observations that we might make about these results. The first is to point to the trade-off between the dimensions of dependence and the growth of the variances. Theorems 19.1 and 19.2 are easily combined, and it is found that by tightening the rate at which the covariances diminish the variances can grow faster, and vice versa. The reader can explore these possibilities using the rather simple techniques of the above proofs, although remember that the I a ,t I will need to be treated as growing with as well as diminishing with Analogous trade-offs are derived in a different context below. The order of magnitude in of Var(Xn), which depends on these factors, can be thought of as a measure of the of convergence. With no correlation and bounded variances, convergence is at the rate - 1 12 in the sense that Var(Xn) = - 1 ); but from = If convergence implies that Var(Xn) = rates are thought of as indicating the number of sample observations required to get Xn close to j..L with high confidence, the weakest sufficient conditions evidently yield convergence only in a notional sense. It is less easy in some of thl" mnrl". oP:nP.r�l rP�ml ts helow to l i nk exnlicitlv the rate of conver_gence with the
I m- I Bm<
Bm (n-m)/n < 1, (2/n)I,�:�Bm
(19.4)
mr 1-8),
Bm
t
m.
n
O(n
rate (19.4), Bm O(m-8)
n
O(n-8).
1
-
m
Convergence in Lp Norm
295
degree of dependence and/or nonstationarity; this is always an issue to keep in mind. Mixing sequences have the property that the covariances tend to zero, and the mixing inequalities of § 14.2 gives the following corollary to 19.2. 19.3 Corollary If {Xr}] is either (i) uniformly L2-bounded and uniform mixing with
(19.5) Lm= m- 1<1>�12 < 1 or (ii) uniformly L2+a-bounded for 8 > 0, and strong mixing with ( 19 . 6) Lm=l m - 1a�/(2+8) < j.l. then Xn Proof For part (i), 14.5 for the case r 2 yields the inequality Bm ::::;; 2B<j>�12• For r 2 + 8 yields Bm ::::;; 6 IIXrll�+8a�1(2+8). Noting part (ii), 14.3 for the case that B II Xrll�+li• the conditions of 19.2 are satisfied in either case. A sufficient condition for 19.3(i) is
oo ,
00
-
oo ,
L
�
p =
::::;;
=
=
•
=
19.3(ii), Um = O((log mf(l 2l (l e ) for E > 0 is sufficient. In the size termi nology of § 14. 1 , mixing of any size will ensure these conditions. The most signif icant cost of using the strong mixing condition is that simple existence of the variances is not sufficient. This is not of course to say that no weak law exists for L2-bounded strong mixing processes, but more subtle arguments, such as those of § 19 .4, are needed for the proof. 1 9 .2 Almost Sure Convergence by the Method of Subsequences
Almost sure convergence does not follow from convergence in mean square (a counter-example is 18.16), but a clever adaptation of the above techniques yields a result. The proof of the following theorems makes use of the exploiting the relation between convergence in pr. and convergence a.s. demonstrated in 18.6. Mainly for the sake of clarity, we first prove the result for the uncorrelated case. Notice how the conditions have to be strengthened, relative to 19.1.
method of
subsequences,
19.4 Theorem If {Xr}] is uniformly �-bounded and uncorrelated, Xn ...E4 j.l . o A natural place to start in a sufficiency proof of the strong law is with the convergence part of the Borel-Cantelli lemma. The Chebyshev inequality yields, under the stated conditions,
(19.7)
B
for < oo , with the probability on the left-hand side going to zero with the right-hand side as -7 oo . One approach to the problem of bounding the quantity
n
The Law of Large Numbers
296
P( ! Xn - iin l > £, i.o.) would be to add up the inequalities in (19.7) over n. Since the partial sums of l in form a divergent sequence, a direct attack on these lines does not succeed. However, I';;':: 1 n -2 :::::: 1 .64, and we can add up the subsequence of the probabilities in (19.7), for n = 1 ,4,9, 16, ... , as follows. Proof of 19.4 By
(19.7),
" ,< 1 .64B L P( Xnz - iinz I > £) - -2n2 £
<
oo .
(19.8)
Now 18.2(i) yields the result that the subsequence {Xnz, n E IN } converges a.s. The proof is completed by showing that the maximum deviation of the omitted terms from the nearest member of { Xnz } also converges in mean square. For each n define max I Xk - Xnz l (19.9) n2 S k <(n+l)2 and consider the variance of Dnz. Given the assumptions, the sequence of the Var(Xn) = (1/n2)I,7== I � tends monotonically to zero. For n2 < k < (n + 1) 2, re
Dnz
=
arrangement of the terms produces
(19. 1 0) and when the sequence is uncorrelated the two terms on the right are also uncorre lated. Hence
( �r� � (: �) (: � )
� 1 =
B
2
+
_
2
�B
-
(k - n2)
. 2 - (n 1 )2
(19. 1 1)
Var(Dnz) cannot exceed the last term in (19. 1 1), and n2
n
(19.12)
so the Chebyshev inequality gives
B (19. 1 3) _L P(Dnz > £) � 2 < oo, n2 £ and the subsequence { Dnz, n E !N } also converges a.s. Dn 2 � I xk - Xnz l for any k between n2 and (n + 1 )2, and hence, by the triangle inequality,
Convergence in Lp Norm
297 (19 . 14)
The sequences on the majorant side are positive and converge a.s. to zero, hence so does their sum. But (19. 14) holds for n2 :S; k < (n + 1)2 for { n2, n E IN } , so that k ranges over every integer value. We must conclude that Xn � Jl. • We can generalize the same technique to allow autocorrelation. 19.5 Corollary If {Xr}j is uniformly L2-bounded, and
(19. 15) where Bm
=
supt I O"t,t -m I , then Xn � Jl.
o
Note how much tougher these conditio ns are than those of 19.2. It will suffice here for Bm = O(m - 1 (log mf 1 -0) for o > 0. Instead of, in effect, having the autocovariances merely decline to zero, we now require their summability. Proof of 19.5 By (19 .4) , V ar(Xn) :S; (B + 2B * )In and hence equation (19. 7) holds in the modified form,
L p( l x -n2 - Jln2 I > £) n2 -
< _
1 .64(B + 2B * ) £2
<
oo
.
(19. 1 6)
Instead of ( 19. 1 1) we have, on multiplying out and taking expectations, - n2 Var(Xk - Xnz) = Var k1 L Xnz � X1 - 1 - k
[
(
t=n2+!
)
_
]
2 k t- 1 ( ( - k 1 - � ) L L at,t-m) . t=n2+ m=t n2
2
1
-
(19. 1 7)
The first term on the right-hand side is bounded by ( 1 - n2/k) 2(B + 2B * )/n2 , the second by (k - n2)(B + 2B*)!k2, and the third (absolutely) by 2(1 - n2/k) 2B* . Adding together these latter terms and simplifying yields
-<
(_!_n2 - (n +1 1 )2)s + 2 L!ln2 - (n +1 1)2 + ( 1 - (n +n21)2) 2] s* .
Note, ( 1 - n2!(n + 1 ) 2)2 (19.13) we can write
=
(19. 1 8)
O(n -2), so the term in B* is summable. In place of
The Law of Large Numbers
298
L P(Dn2 > E) s; n2
B + KtB * E2
< oo ,
(19. 19)
where K1 is a finite constant. From here on the proof follows that of 19.4.
•
Again there is a straightforward extension to mixing sequences by direct analogy with 19.3. 19.6 Corollary If { X1 } 1 is either (i) uniformly L2-bounded and uniform mixing with
m=l
"L
or (ii) uniformly L2+0-bounded for 8 > 0, and strong mixing with L a�t(2+B) < oo,
m=l 00
then Xn � 0.
D
( 19.20)
(19.21 )
Let it be emphasized that these results have no pretensions to being sharp ! They are given here as an illustration of technique, and also to define the limits of this approach to strong convergence. In Chapter 20 we will see how they can be improved upon. 1 9 . 3 A Martingale Weak Law
We now want to relax the requirement of finite variances, and prove Lp-convergence for p < 2. The basic idea underlying these results is a truncation argument. Given a sequence { X1}} which we assume to have mean 0, define Y1 = 1 { IXri�B)X1, which equals X1 when I X1 I s; B < oo , and 0 otherwise. Letting Z1 = X1 - Y1, the 'tail component ' of X1, notice that E(Z1) = -E(Y1) by construction, and Xn = Yn + Zn. Since Y1 is a.s. bounded and possesses all its moments, arguments of the type used in § 19 . 1 might be brought to bear to show that Yn � !-lr (say). Some other approach must then be used to show that Zn � !-lz = -j..l y. An obvious technique is to assume uniform integrability of { I X1 j P } . In this case, sup1E I Z1 I P can be made as small as desired by choosing B large enough, leading (via the Minkowski inequality, for example) to an Lp-convergence result for ZnA different approach to limiting dependence is called for here. We cannot assume that Y1 is serially uncorrelated just because X1 is. The serial independence assumption would serve, but is rather strong. However, if we let X1 be a martin gale difference, a mild strengthening of uncorrelatedness, this property can also be passed on to Y1, after a centring adjustment. This is the clever idea behind the next result, based on a theorem of Y. S. Chow (1971). Subsequently (see § 1 9.4) the m.d. assumption can be relaxed to a mixingale assumption. We will take this opportunity to switch to an array formulation. The theorems are easily specialized to the case of ordinary sample averages (see § 1 8.6), but in subsequent chapters, array results will be indispensable.
Convergence in Lp Norm
299
19.7 Theorem Let {Xnr,!¥nr l be a m.d. array, { Cnr } a positive constant array, and { kn } an increasing integer sequence with kn t =. If, for 1 :::; p :::; 2, (a) { I Xnr !Cnr iP} is uniformly integrable, kn
(b) limsup L Cnt <
n -too
kn
t= l
(c) lim L
c�1
n-too t= l then L��!Xnr � 0.
=,
and
0,
=
D
The leading specialization of this result is where Xnr = Xrfan, where { X1,:¥,} is a m.d. sequence with :¥
(b) L hr t= l
=
O (an), and
=
o (a � ) ;
kn
(c) L h; t=l
then a� 1 :L��IXr � 0.
Xnr X11an and Cnt brfan. • Be careful to distinguish the constants an and kn. Although both are equal to in
Proof Immediate from 19.7, defining
=
=
n
the sample-average case, more generally their roles are quite different. The case with kn different from typically arises in 'blocking' arguments, where the array coordinates are generated from successive blocks of underlying sequence coor dinates. We might have kn = for a E (0, 1) denoting the largest integer below x) where the length of a block does not exceed For an application of this sort see §24.4. Conditions 19.8(b) and (c) together imply an t =, so this does not need to be separately asserted. To form a clear idea of the role of the assumptions, it is helpful to suppose that b1 and an are regularly varying functions of their argu ments. It is easily verified by 2.27 that the conditions are observed if b1 - t � for any � � -1, by choosing an - n 1 +� for � > - 1 , and an - log for � - 1 . In particular, setting b1 = 1 for all t, an = kn = yields
n
[na]
([x] 1 [n a].
n
n
=
(19.22) Choosing an = I,�� 1 b1 will automatically satisfy condition (a), and condition (b) will also hold when b1 = O(t �). On the other hand, a case where the conditions
The Law of Large Numbers 1 and, for t > 1, b1 .L,�:lbs In this case condition (a) fail is where imposes the requirement bn O(an), so that b� O(a�). contradicting condition (b). The growth rate of b1 exceeds that of tl3 for every � > 0. 300
=
h =
=
=
=
t.
Proof of 19.7 Uniform integrability implies that
n,t E(IXnr Cnr P 1
sup
1
i
1 1 Xn1lcnr i >M )
)
-7 0
as
One may therefore find, for £ > 0, a constant B£ <
n.t {
sup IIXnrl ! IXnti>BEcnrJ II p !cn
Define
} :::; £.
M -7 oo .
such that
=
(19.23)
Ynt Xnrl IXntl gl£Cntl ' and Znt xnt- Ynt. Then since E(Xnt I ?}n,t- 1) Xnt Ynr-E(Ynrl?in,t- J)+Znr-E(Znr l?in,t- 1 ). =
{
=
=
=
0,
By the Minkowsk:i inequality, kn
p
:::; 2 t= l
(Ynt -E(Ynr l ?in,t- 1)) p
+
kn
2 t=!
(Znt-E(Znrl ?fn,t-J )) p
(19.24)
Consider each of these right-hand-side terms. First,
2(Ynr-E(Ynrl ?in,r-I )) :::; L (Ynr-E(Ynrl ?in,t- I )) 2 �
t=l
�
p
(i E(Ynt-E(Ynr l ?in, d)2)1/2 (i EY�t) 1/2 (i c�t) 1 2 t= l
=
$
k
r
t=l
k
$ B£
t=l
-
k
t=l
(19.25)
The first inequality in (19.25) is Liapunov' s inequality, and the equality follows because { is a m.d., and hence orthogonal. Second,
Yn1 -E(Ynrl ?in,r-1 )} kn kn kn L (Znt -E(Zntl ?i n, t -1 )) :::; L I!Znrl! p + L I ! E(Zntl ?in,t- J ) I ! p p t= l
t= 1
t= l
kn
kn
t=l
t= l
$ 2_L IIZntllp $ 2£L Cnt .
(19.26)
The second inequality here follows because
EIE(Znrl?in,t- J )I P $ E(E(IZnri P I?fn,t- J)) EIZnti P, =
from, respectively, the conditim1al Jensen inequality and the law of iterated expectations. The last is by (19.23). It follows by (c) that for £ > 0 there exists N£ � 1 such that, for n � N£,
Convergence in Lp Norm k ""'n C2
<
L,_, nt -
t=l
301
B - 2£2·
(19.27)
E
Putting together (19.24) with (1 9.25) and (19.23) shows that kn
(19.28)
L Xnt p :::; B£ t=l
22:��� nr < oo, by condition (b). Since £ is arbitrary,
n Ne,
where B = 1 + for 2 this completes the proof. •
C
The weak law for martingale differences follows directly, on applying 18.13. 19.9 Corollary Under the conditions of 19.7 or 19.8,
p
r 1/n
n
.L��1 Xnr
�
0.
o
If we take the case = 1 and set Cn = and kn = as above, we get the result that uniform integrability of { Xr} is sufficient for convergence in probability of the sample mean Xn. This cannot be significantly weakened even if the martingale difference assumption is replaced by independence. If we assume identically distributed coordinates, the explicit requirement of uniform integrability can be dropped and L 1 -boundedness is enough; but of course, this is only because the uniform property is subsumed under the stationarity. You may have observed that (b) in 19.7 can be replaced by kn
(b') limsup /ln - 1 L d,;1 < = . t=I n�oo It suffices for the two terms on the majorant side of (19.24) to converge in Lp, and the Cr inequality can be used instead of the Minkowski inequality in (19.26) to obtain p (19.29) , ( , :J' c.;,.
It
1
•.
£(2k.r't
However, the gain in generality here is notional. Condition (b') requires that limsup1,n�oo'lnd,;t < = , and if this is true the same property obviously extends to { kncntl· For concreteness, put Cnt = as in 19.8 with - and where � and y can be any real constants. With kn - for a > 0, note that the of the value majorant side of (19.29) is bounded if a(1 + �) - y :::; 0, of p. This condition is automatically satisfied as an equality by setting = but note how the choice of can accommodate different choices of kn. None the less, in some situations condition (b) is stronger than what we know to be sufficient. For the case = 2 it can be omitted, in addition to weakening the martingale difference assumption to uncorrelatedness, and uniform integrability to simple L2-boundedness. Here is the array version of 19.1, with the conditions cast in the framework of 19.7 for comparability, although all they do is to ensure that the variance of the partial sums goes to zero.
brian
.L�� 1 br,
an
p
bt t � an nY, na independent an
The Law of Large Numbers
302
19.10 Corollary If {Xnr} is a zero-mean stochastic array with E(XnrXns) = 0 for t # s, and (a) {Xnr!Cnr } is uniformly L2-bounded, and kn
(b) lim ,L c�,
=
0,
1 9.4 A Mixingale Weak Law
To generalize the last results from martingale differences to rnixingales is not too difficult. The basic tool is the 'telescoping series' argument developed in § 16.2. The array element Xnr can be decomposed into a finite sum of martingale differences, to which 19.7 can be applied, and two residual components which can be treated as negligible. The following result, from Davidson (1993a), is an extension to the heterogeneous case of a theorem due to Andrews (1988). 19.11 Theorem Let the� {Xnr.�n1}'::'oo be a L 1 -mixingale with respect to a constant array { cnr } . If (a) {Xnrlcnr} is uniformly integrable, kn
(b) limsup L cnt < oo, and n �oo t=l kn
(c) lim ,L c�r = 0, n �oo t=l where kn is an increasing integer-valued function of n and kn t � O. o
oo,
then I,��1 Xn
r
There is no restriction on the size here. It suffices simply for the mixingale coefficients to tend to zero. The remarks following 19.7 apply here in just the same way. In particular, if X, is a L1 -mixingale sequence and { X,Ib,} is uniformly integrable for positive constants { b,} , the theorem holds for Xnr = X,lan and Cnr = b, lan where an L�= I b r. Theorems 14.2 and 14.4 give us the corresponding results for mixing sequences, and 17.5 and 17.6 for NED processes. It is sufficient for, say, Xnr to be Lr-bounded for r > 1 , and Lp-NED, for p :?: 1 , o n a a-mixing process. Again, n o size restrictions need to b e specified. Uniform integrability of { Xnrlcnr} will obtain in those cases where I! Xnr II is finite for r > 1 and each t, and the NED constants likewise satisfy dnr » II Xnr I I A simple lemma is required for the proof: =
r
r·
19.12 Lemma If the array {Xnrlcnr} is uniformly integrable for p :?: 1 , so is the array {Er-)Xnrlcnr} for j > 0. Proof By the necessity part of 12.9, for any
£ > 0 .:3 8 > 0 such that
Convergence in
{
sup sup
n,t
Lp
Norm
fE I Xnt!cnrl dP}
303
< £,
( 1 9.30)
where the inner supremum is taken over all E E ':; satisfying P(E) < o. Since ':Jn,t -J c ':J, ( 1 9.30) also holds when the supremum is taken over E E ':Jn, t J satisfying P(E) < o. For any such E,
fE i Xntlcnt l dP fEEt-J I Xntlcnt l dP � fE i Et-JXnrlcnrl dP, =
(19.3 1 )
by definition of Et-J(.), and the conditional Jensen inequality (10.18). We may accordingly say that, for £ > 0 :3 o > 0 such that
:;
s
{
sup
J. I E,-jX.,!c.,l dP} < £,
(19.32)
taking the inner supremum over E E ':Jn,t -J satisfying P(E) < o. Since Et-JXnr is ':Jn,t-rmeasurable, uniform integrability holds by the sufficiency part of 12.9. • Proof of 19.11 Fix an integer j and let
Ynj
kn
L (Et+}Xnt - Et+j- l Xnt) . t= l
=
The sequence { Yn1, ':Jn,n+J } ;= l is a martingale, for each j. Since the array
{ (Er+}Xnr - Et+j - l Xnr)lcnr } is uniformly integrable by (a) and 19.12, it follows by (b) and (c) and 19.7 that
Ynj � 0.
(19.33)
We now express L.�� 1 Xnt as a telescoping sum. For any M
M- 1 L Ynj }= 1 -M
1,
kn
kn
=
�
L Et+M- l Xnt - L Et-MXnt' t=l t= 1
( 1 9.34)
and hence
M- 1 kn kn + ) Xn + (Xnr Et Y "L_ Xnt L nj "L_ +M- 1 t "L_ Et-MXnt· t= 1 t= 1 t= 1 }= 1 -M kn
( 19.35)
=
The triangle inequality and the L1 -mixingale property now give
kn kn M- 1 E "L_ Xnt :::; L El YnJI + "L_ E I Xnr - Et+M- 1 Xnrl + "L_ EI Et-MXnr l t= 1 t= 1 t=1 }= 1 -M kn M- 1 + 2 ( 19.36) E l Ynjl �ML Cnr . :::; L t = M = } 11 kn
According to the assumptions, the second member on the right-hand side of (19.36) �
_
/ "'\ /
... 1-0"
r
�
�
1"\
_ _ _ _1
_
� - - - --
_
�
n
� 1
1
,,
•
Y
�lr-
The Law of Large Numbers
304
�E for M � Me. By choosing n large enough, the sum of 2M - 1 terms on the right hand side of ( 19.36) can be made smaller than �E for any finite M, by ( 19.33). So, < E when n is large enough. The theorem is by choosing M � Me we have £ 1 now proved since E is arbitrary. •
I��lXntl
A comparison with the results of § 19. 1 is instructive. In an L2-bounded process, the L2-mixingale property would be a stronger form of dependence restriction than the limiting uncorrelatedness specified in 19.2, just as the martingale property is stronger than simple uncorrelatedness. The value of the present result is the substantial weakening of the moment conditions. 1 9. 5 Approximable Processes
There remains the possibility of cases in which the mixingale property is not easily established -perhaps because of a nonlinear transformation of a Lp-NED process which cannot be shown to preserve the requisite moments for application of the results in § 1 7 .3. In such cases the theory of § 1 7.4 may yield a result. On the assumption that the approximator sequence is mixing, so that its mean deviations converge in probability by 19. 1 1 , it will be sufficient to show that this implies the convergence of the approximable sequence. This is the object of the following theorem.
{h';; t } is a stochastic array and the centred array {h';:1 -E(h';:1) } satisfies the conditions of 19. 1 1 . If the array {Xnt l is L 1 -approximable by { h'::r } with respect to a constant array { dnt}, and limsupn�oo l:��l dnt � B < then L��lXnt 0. Establishing the conditions of the theorem will typically be achieved using 17.21 , by showing that Xm i s Lr-bounded for r > 1 , and approximable i n probability on h';:1 for each m, the latter being m-order lag functions of a mixing array of any 19.13 Theorem Suppose that, for each m E =,
�
IN ,
D
size.
Proof Since ( 19.37)
by the triangle inequality, we have for () > 0
( 1 9.38)
by subadditivity, since the event whose probability is on the minorant side implies at least one of those on the majorant. By the Markov inequality,
Convergence in Lp Norm
3
:::; 3
kn Vm· (2.:.dnt ) t=l
305
(19.39)
P(stochastic is equal to either 0 or 1 , according to whether the non c) l > 8/3)holds IL�� IE(h'::inequality or does not hold. By the fact that E(Xnc) 0 and L -approximability, =
1
(19.40)
and hence (19.4 1 )
We therefore find that for each m E IN
n�= P ( I i Xnc I > 8) :::; 3:v m limsupP (I i (h'::c - E(h'::c)) l > �) + 1 {svm> o/3} n �=
limsup
t= I
+
t=I
(19.42)
h'::c
by the assumption that satisfies the WLLN for each m E IN . The proof is completed by letting m � oo • .
20 The Strong Law of Large Numbers
20. 1 Technical Tricks for Proving LLNs
In this chapter we explore the strong law under a range of different assumptions, from independent sequences to near-epoch dependent functions of mixing processes. Many of the proofs are based on one or more of a collection of ingenious technical lemmas, and we begin by studying these results. The reader has the option of skipping ahead to §20.2, and referring back as necessary, but there is something to be said for forming an impression of the method of attack at the outset. These theorems are found in several different versions in the literature, usually in a form adapted to the particular problem in hand. Here we will take note of the minimal conditions needed to make each trick work. We start with the basic convergence result that shows why maximal inequalities (for example, 15.14, 15.15, 16.9, and 16.11) are important. 20.1 Convergence lemma Let {X,}} be a stochastic sequence on a probability space (O.,'Jf,P), and let Sn = L�= 1 X, and So = 0. For ro E 0., let
If P(M > E)
=
(
}
inf �up i Sj(co) - Sm(ro) l . m J> m 0 for all E > 0 , then Sn � S. M(ro)
=
(20. 1 )
Proof By the Cauchy criterion for convergence, the realization {Sn(ro) } converges if we can find an m such that I Sj - Sm I :::; E for all j > m, for all E > 0; in other words, it converges if M(ro) :::; E for all E > 0. •
This result is usually applied in the following way. 20.2 Corollary Let { cr} 1 be a sequence of constants, and suppose there exists p > 0 such that, for every m � 0 and n > m, and every E > 0,
(
)
�
i
P m �x I Sr Sm l > E s c�, E t=m+ l m <J:O, n where K is a finite constant. If I-7= 1 c� < oo, then Sn � S.
(20.2)
Proof Since { c� } is summable it follows by 2.25 that limm�oa"L'7=m+lc � = 0. Let M be the r.v. in (20. 1). By definition, M :::; supj>m i Sr Sm l for any m > 0, and hence
(
P(M > E) :::; lim �up j Sj - Sm l > E m�oo j> m
)
The Strong Law of Large Numbers 00 K fP m--)oo t=m+ 1
::; - lim L c�
=
307
0,
(20.3)
where the final inequality is the limiting case of (20.2). 20.1 completes the proof. •
Notice how this proof does not make a direct appeal to the Borel-Cantelli lemma to get a.s. convergence. The method is closer to that of 18.3. The essential trick with a maximal inequality is to put a bound on the probability of all occurrences of a certain type of event as we move down the sequence, by specifying a probabil ity for the most extreme of them. Since S is finite almost surely, Xn � 0 is an instant corollary of 20.2. How ever, the result can be also used in a more subtle way in conjunction with Kronecker ' s lemma. If i7=t Yt converges a.s., where { Yt} = {X11at} and {at} is a sequence of positive constants with an t oo , it follows that a� 1 I7= 1 X1 � 0. This is of course a much weaker condition than the convergence of I7=tXt itself. Most applications feature at = t, but the more general formulation also has uses. There is a standard device for extending a.s. convergence to a wider class of sequences, once it has been proved for a given class: the method of equivalent sequences. Sequences {Xt}j and { Yt}j are said to be equivalent if (20.4)
By the first Borel-Cantelli lemma (18.2(i)), (20.4) implies P(Xt 7:. Yt, i.o.) = 0. In other words, only on a set of probability measure zero are there more than a finite number of t for which Xt(ro) 7:. Yr(ro): 20.3 Theorem If Xt and Yt are equivalent, I�= J(Xt - Yt) converges a.s. Proof By definition of equivalence and 18.2(i) there exists a subset C of Q, with
P(Q - C) = 0, and with the following property: for all ro n0(ro) such that X1(ro) = Y1(ro) for t > n0(ro) . Hence n
L (Xt(ro) - Yt(ro) ) t=l
=
no(ro)
L (Xt(ro) - YtCro)), t= l
and the sum converges, for all ro E C.
'\/
E
C, there is a finite
n 2 no(ro),
•
The equivalent sequences concept is often put to use by means of the following theorem.
/w.4 Theorem Let {Xt}j be a zero-mean random sequence satisfying ,L E I Xr i Pfa� t= l
< oo
(20.5)
for some p 2 1, and a sequence of positive constants {at } . Then, putting 1 � for the indicator function 1 { IX11s;arJ(ro),
The Law of Large Numbers
308
_L P( I Xr l > a1)
< 00,
(20.6)
L I E(X11 �) I la1
< 00,
(20.7)
t= 1
t= 1
and for any
r
� p,
_LE( I Xr l r 1�)/a� <
(20.8)
00• D
t=l
The idea behind this result may be apparent. The indicator function is used to truncate a sequence, replacing a member by 0 if it exceeds a given absolute bound. The ratio of the truncated sequence to the bound cannot exceed 1 and possesses all its absolute moments, while inequality (20.6) tells us that the truncated sequence is equivalent to the original under condition (20.5). Proving a strong law under (20.5) can therefore be accomplished by proving a strong law for a truncated sequence, subject to (20.7) and (20.8). Proof of Theorem 20.4 We prove the following three inequalities:
P( I Xr l > ar)
E(l - 1�) $ E( I Xr i P (1 - 1 �))Ia� $ E( I Xri P)fa�. =
(20.9)
Here the inequalities are because I Xr(ro) I P/� > 1 for ro E { I Xt I > at } , and because E(I Xt i P 1�) is non-negative, respectively.
I E(Xt 1 �) I fat
=
I E(Xr(1 - 1 �) I Iat $ E(I Xti 0 - 1�))/at $ E( I Xt i P (l - 1�))/a� $ E( I Xt iP)fa�.
(20. 10)
The equality in (20. 10) is because E(Xt) = 0, hence E(Xt1 �) = -E(Xr(l - 1 �)). The first inequality is the modulus inequality, and the second is because on the event { I Xr l > ad , ( I Xt l latf � I Xr l lat for p � 1 . Finally, by similar arguments to the above,
E( I Xt l r1�)/a� $ E( I X1 1 P l�)la� for p $ $ E( I Xri P)faf. The theorem follows on summing over t.
r
(20. 1 1)
•
There are a number of variations on this basic result. The first is a version for martingale differences in terms of the one-step-ahead conditional moments, where
The Strong Law of Large Numbers
309
the weight sequence is also allowed to be stochastic. The style of this result is appropriate to the class of martingale limit theorems we shall examine in §20.4, in which we establish almost-sure equivalence between sets on which certain conditions obtain and on whkh sequences converge. 20.5 Corollary Let {X1,:g;1} be a m.d. sequence, let { W1} be a sequence of positive � t-1 -measurable r.v.s, and for some p 2 1 let
D
=
Also define
D1
=
{ro: £ t= l
{ro:
ro ro) }
E( I Xr I P I :g;t- 1 )( )/Wr(
.f P( I Xr l > Wr l �r- I )( m)
< oo
< oo
t=l
}
(20. 12)
:g;.
E
(20. 13)
:g;
E
(20. 14) (20. 15) and let D'
P(D')
=
1.
=
D1
n
D2 n D3. Then P(D - D') = 0. In particular, If P(D) = 1 then
(20.9), (20. 10), and (20. 1 1) for the case of conditional expectations. Noting that E(X1 1 :g;1_!) = 0 a.s. and using
Proof It suffices to prove the three inequalities
the fact that W1 is :g;r_1-measurable, all of these go through unchanged, except that the conditional modulus inequality 10.14 is used to get (20. 14) It follows that almost every m E D is in D'. • .
Another version of this theorem uses a different truncation, with the truncated variable chosen to be a continuous function of Xr; see 17.13 to appreciate why this variation might be useful. 20.6 Corollary Let {Xr}'i' be a zero-mean random sequence satisfying (20.5) for 2 1 . Define
p
(20 . .16) -1 ,
Xr
<
- at .
Then,
2:: I E(Yr) I t=l
< oo,
(20. 17)
The Law of Large Numbers
3 10 00
_L E I Yrl t=l
Proof Write
r
<
00,
r 2:: p.
(20. 1 8)
± ar to denote a1X/I X1 1 . Inequalities (20.10) and (20. 1 1 ) of 20.4
are adapted as follows.
I E(Yr) I = I E(Xr1 � + (1 - 1 �)(±ar)) I /at = I E(Xr - (± ar))( I - 1�) 1 /at ::;; E I Xr i 0 - 1�)/ar + EI 1 - 1� 1 :s; E( I Xri P(l - 1�))/a � + P( ! Xrl > ar) ::;; 2E( I X1 ! P)fa�.
(20. 19)
The second equality in (20. 19) is again because E(Xr) = 0. The first inequality is an application of the modulus inequality and triangle inequalities in succession, and the last one uses (20.9). By similar arguments, except that here the c, inequality is used in the second line, we have
E( l Yr l ') ::;; E I Xr1� + (1 - l�)(±ar) l '!a� 1 :s; 2'- (EI X11� I '/a� + EI (l - 1� ) 1 ') ::;; 2'- \E( I Xr iP1�)/a� + P( ! Xr l > ar)) for p ::;; 2rE( I Xr !P)fa�. The theorem follows on summing over t as before.
::;;
r
(20.20)
•
Clearly, 20.5 could be adapted to this case if desired, but that extension will not be needed for our results. The last extension is relatively modest, but permits summability conditions for norms to be applied. 20.7 Corollary (20.6), (20.7), (20.8), (20. 17), and (20. 1 8) all continue to hold if (20.5) is replaced by
_L E( ! Xr ! P) 1 1qla�1q 00
t=l
for any
q 2::
<
oo
(20.21)
1.
Proof The modified forms of
(20.9), and of (20. 19) and (20.20) (say) are (20.22) (20.23)
The Strong Law of Large Numbers
311 (20.24)
where in each case the first inequality is because the left-hand-side member does not exceed 1 . •
For example, by choosing p = q the condition that the sequence { IIX1/a1 11 p } is summable is seen to be sufficient for 20.4 and 20.6. 20. 2 The Case of Independence
The classic results on strong convergence are for the case of independent sequences. The following is the 'three series theorem ' of Kolmogorov: 20.8 Three series theorem Let {X1} be an independent sequence, and Sn = L�= 1 X1• Sn � S if and only if the following conditions hold for some fixed a > 0:
l:: PC I Xr l > a)
(20.25)
< oo,
t=l
2: E( 1 ! 1 Xtl � a)Xr)
(20.26)
< oo,
t=l 00
2:: Var ( 1 { IXtl �a)Xr) t= l
< oo. D
(20.27)
Since the event {S � S} is the same as the event {Sn+ l � S}, convergence is invariant to shift transformations. It is a remote event by 13.19 and hence in independent sequences occurs with probability either 0 or 1 , according to 13.17. 20.8 gives the conditions under which the probability is 1 , rather than 0. The theorem has the immediate corollary that Sn la � 0, whenever an t oo The basic idea of these proofs is to prove the convergence result for the trun cated variables 1 1 J X1J� a )X1, and then use the equivalent sequences theorem to extend it to X1 itself. In view of 20.4, the condition n
n
l:: E I Xr l p t=l
< oo,
1 � p � 2,
.
(20.28)
is sufficient for convergence, although not necessary. Another point to notice about the proof is that the necessity part does not assign a value to a. Conver gence implies that (20.25)-{20.27) hold for every a > 0. Proof of 20.8 Write Y1 = 1 1 J X1J�a1X1, so that the summands in (20.26) and (20.27)
are respectively the means and variances of Y1• The sequence { Y1 - E(Y1) } is inde pendent and hence a martingale difference, so that s� - s:n = L�=m+1 ( Yr - E(Yr)) is a martingale for fixed m 2 0, and I,7=m+ t Var(Y1) = Var(S� - s:n). Theorem 15.14 combined with 20.2, setting p = 2 in each case and putting c7 = Var(Y1) and K = 1 , together yield the result that S� � S' when (20.27) holds. If (20.26) holds, this further implies that I,7=t Y1 converges. And then if (20.25) holds the
The Law of Large Numbers
3 12
sequences {X1 } and { Y1 } are equivalent, and so Sn � S, by 20.3. This proves sufficiency of the three conditions. Conversely, suppose Sn � S. By 2.25 applied to Sn(ffi) for each ffi E Q, it follows that limm�ooL7=mX = 0 a.s. This means that P( I Xr l > a , i.o.) = 0, for any a > 0, and so (20.25) must follow by the divergence part of the Borel-Cantelli lemma (18.2(ii)). 20.3 then assures us that I7=I Y1 also converges a.s. Write s� = L�= !Var(Y1). If s� ---7 oo as n --7 oo, L,7= t(Y1 - E(Y1))1sn fails to converge, but is asymptotically distributed as a standard Gaussian r.v. (This is the central limit theorem - see 23.6.) This fact contradicts the possibility of I7= 1 Y1 converging, so we conclude that s� is bounded in the limit, which is equiv alent to (20.27). Finally, consider the sequence { Y1 - E(Y1) } . This has mean zero, the same vari ance as Y1, and P( l Y1 - E(Y1) ! > 2a) = 0 for all t. Hence, it satisfies the condi tions (20.25)-(20.27) (in respect of the constant 2a) and the sufficiency part of the theorem implies that L,7= 1 (Y1 - E(Y1)) converges. And since L,7= 1 Yt converges, (20.26) must hold. This completes the proof of necessity. • r
The sufficiency part of this result is subsumed under the weaker conditions of 20.10 below, and is now mainly of historical interest; it is the necessity proof that is interesting, since it has no counterpart in the LLNs for dependent sequences. In these cases we cannot use the divergence part of the Borel-Cantelli lemma, and it appears difficult to rule out special cases in which convergence is achieved with arbitrary moment conditions. Incidentally, Kolmogorov originally proved the maximal inequality of 15.14, cited in the proof, for the independent case; but again, his result can now be subsumed under the case of martingale differences, and does not need to be quoted separately. Another reason why the independent case is of interest is because of the follow ing very elegant result due to Levy. This shows that, when we are dealing with partial sums of independent sequences, the concepts of weak and strong conver gence coincide. 20.9 Theorem When {Xz} is an independent sequence and Sn = L,7= 1 X1, Sn � S if and only if Sn � S.
Proof Sufficiency is by 18.5. It is the necessity that is unique to the particular
case cited. Let Smn = L�=m+ I X1, and for some £ > 0 consider the various ways in which the event { ! Smn l > £ } can occur. In particular, consider the disjoint collection
{
max ! Smjl
msjs k- l
S:
}
2£, ! Smk l > 2£ , k
=
m + l , . . ,n. .
For each k, this is the event that the sum from m onwards exceeds 2£ absolutely for the first time at time k, and thus
lJ
{
max ! Smj l
k=m+ l m sj s k- 1
S:
} {
2£, ! Smk l > 2£
=
}
max I Smjl > 2£ ,
m sj s n
(20.29)
The Strong Law of Large Numbers
313
where the sets of the union are disjoint. It is also the case that
YJm �::'- ] I Smj l 5 2£, I Smd > 21} > { I S,. I 5 £ }
ko
(;;
{ I S= I > E } ,
(20.30)
where the inclusion is ensured by imposing the extra condition for each k. The events in this union are still disjoint, and by the assumption of an independent sequence they are the intersections of independent pairs of events. On applying (20.29), we can conclude from (20.30) that
{�;;'.
L':':: C�::
I Smi l > 2E
�
, "' ' P
.
P( I S, I 5 t)
_ , I S,; I ,
21:.
)
I S"" I > 2e P( I S, I , ,) (20.3 1 )
If Sn � S, there exists by definition m 2 1 such that
P( ! Smn l > £) < £ (20.32) for all n > m. According to (20.32), the second factor on the minorant side of (20. 3 1 ) is at least as great as 1 - £, so for 0 < £ < 1 , P
(:::;::. I Smi I > 2!:) < 1 � £.
Letting n � oo and then m
�
(20.33)
oo, the theorem now follows by 18.3.
•
This equivalence of weak and strong results is one of the chief benefits stemming from the independence assumption. Since the three-series theorem is equivalent to a weak law according to 20.9, we also have necessary conditions for convergence in probability. As far as sufficiency results go, however, practically nothing is lost by passing from the independent to the martingale case, and since showing convergence is usually of disproportionately greater importance than showing nonconvergence, the absence of necessary conditions may be regarded a small price to pay. However, a feature of the three series theorem that is common to all the strong law results of this chapter is that it is not an array result. Being based on the convergence lemma, all these proofs depend on teaming a convergent stochastic sequence with an increasing constant sequence, such that their ratio goes to zero. Although the results can be written down in array form, there is no counterpart of the weak law of 19.7, more general than its specialization in 19.8. 20.3 Martingale Strong Laws
Martingale limit results are remarkably powerful. So long as a sequence is a mart1 n cr<>lP rl 1 ffPrPn<'P
no f1 1rthPr rPctri r'tion<: on
it�:
riPnPnrlP.ncP.
arf'. rf'.nnirf'.O anrl the
The Law of Large Numbers
314
moment assumptions called for are scarcely tougher than those imposed in the inde pendent case. Moreover, while the m.d. property is stronger than the uncorrelated ness assumed in § 19.2, the distinction is very largely technical. Given the nature of econometric time-series models, we are usually able to assert that a sequence is uncorrelated because it is a m.d., basically a sequence which is not fore castable in mean one step ahead. The case when it is uncorrelated with its own past values but not with some other function of lagged information could arise, but would be in the nature of a special case. The results in this section and the next one are drawn or adapted chiefly from Stout (1974) and Hall and Heyde (1980), although many of the ideas go back toDoob (1953). We begin with a standard SLLN for L 2-bounded sequences. 20.10 Theorem Let {Xr,:!Fr} o be a m.d. sequence with variance sequence { a7 } , and {ar } a positive constant sequence with ar i oo. Snfan � 0 if 00
L a7ta7 r-= 1
< =. o
(20.34)
There are (at least) two ways to prove this result. The first is to use the mart ingale convergence theorem (15.7) directly, and the second is to combine the maxi mal inequality of 15.14 with the convergence lemma 20.2. In effect, the second line of argument provides an alternative proof of martingale convergence for the square-integrable case, providing an interesting comparison of techniques. First proof Define Tn = L�== IXrlat, so that { Tm:!Fn } is a square-integrable martin
gale. We can say, using the norm inequality and orthogonality of { Xr } , l/2 < oo, sup E I Tn l � sup E(T �) 1 12 = ,La7ta7 (20.35) n n I r-=
( 00 )
leading directly to the conclusion Tn � T a.s., by 15.7. Now apply the Kronecker lemma to the sequences { Tn(ro) } for (0 E n, to show that Snfan � 0 . •
m ;?: 0, { Tn - Tm, :!Fn } is a martingale with (20.36) E(Tn - Tm) 2 = L�==m+I <J7ta7. Apply 15.14 for p = 2, and then 20.2 with c7 = a7ta7. Finally, apply the Kronecker S �cond proof For
lemma as before.
•
Compare this result with 19.4. If Var(Xr) = <J7 � B < oo, say, then setting at = t, we have I7== 1 a7tP � B'L7== 1 11P 1 . 64B < =, and the condition of the theorem is satisfied, hence Xn = Snfn -E:4 0, the same conclusion as before. But the conditions on the variances are now a lot weaker, and in effect we have converted the weak law of 19.1 into a strong law, at the small cost of substituting the m.d. assumption for orthogonality. As an example of the general formulation, suppose the sequence satisfies z
The Strong Law of Large Numbers 00
L E(X7)1i"
315
< 00 •
t= l
(20.37)
We cannot then rely upon Xn converging to zero, but (putting ar = P) we can show that n -2.L7= 1 Xr = Xnln will do so. The limitation of 20.10 is that it calls for square integrability. The next step is to use 20.4 to extend it to the class of cases that satisfy
_L E ! Xr!Pfa�
< oo
t=l
(20.38)
for 1 :::;; p :::;; 2, and some {ar } t oo It is important to appreciate that (20.3 8) for p < 2 is not a weaker condition than for p = 2, and the latter does not imply the former. For contrast, consider p = 1 . The Kronecker lemma applied to (20.38) implies that .
n
a� 1 :.'L E ! Xrl t=l
-7
0.
(20.39)
For an - n, such a sequence has got to be zero or very close to it most of the time. In fact, there is a trivially direct proof of convergence. Applying the monotone convergence theorem (4.9),
(
) (
E lim a� 1 1 Sn l :::;; E lim a� 1 n�oo
n�oo
n
± I Xrl ) t=l
1 = lim a� ,L E ! Xr l · n�oo
t=l
(20.40)
For any random variable X, EIXI = 0 if and only if X = 0 a.s .. Nothing more is needed to show that Snlan converges, regardless of other conditions. Thus, having latitude in the value of p for which the theorem may hold is really a matter of being able to trade off the existence of absolute moments against the rate ofdamping necessary to make them summable. We may meet interesting cases in which (20.38) holds for p < 2 only rarely, but since this extension is available at small extra cost in complexity, it makes sense to take advantage of it. 20.1 1 Theorem If Snlan � 0.
{ Xr,�r}i is a m.d. sequence satisfying (20.38) for 1
:::;;
p :::;; 2,
Proof Let Yr = 1 { IXti�1)Xr, and note that {Xr } and { Yr } are equivalent under (20.3 8), by 20.4. Yr is also �rmeasurable, and hence the centred sequence
{ Zr,� t }, where Zr = Yr - E(Yr l �r- l), is a m.d. Now,
E(Z7) = E(E(Z� ! � t- 1 )) = E(E(Y7 1 �t- l ) - E(Yr l �t- 1 )2)
316
The Law of Large Numbers (20.41)
According to 20.4 with r = 2 , (20.38) implies that 2.7=1E(Y7)1a7 < =, and so, since E(Z7) s; E(Y7) by (20.41 ), 00
(20.42) L E(Z7)1a7 < =. t= l By 20.10, this is sufficient for 2.7= 1 Z11a1 � S t , where S1 is some random variable. But
n n n L Zr lat = L Yrfat - L E(Yr l ?l'r-t )lar. t= J t= I t=I By 15.13(i), (20.38) is equivalent to
(20.43)
(20.44) L E( I Xr i P I ?l't-t )ldr < oo, a.s. t= l According to 20.5, (20.44) implies that 2.7=I I E(Y1 I ?1'1-J ) l ia1 < =, a.s. Absolute
convergence of a series implies convergence by 2.24, so we may say that 2.7=tE(Yt f ?i'r-t)lar � Sz. Hence, I.7= t Yt lar � St + Sz and so a�12.�= 1 Yr � 0 by the Kronecker lemma. It follows by 20.3 and the equivalence of X1 and Y1 implied by (20.38) that Snlan � 0. • Notice that in this proof there are no short cuts through the martingale conver gence theorem. While we know that 2.7= 1 X,Ia1 is a martingale, the problem is to establish that it is uniformly L1 -bounded, given only information about the joint distribution of {X1}, in the form of (20.38). We have to go by way of a result for p = 2 to exploit orthogonality, which is where the truncation arguments come in handy. 20.4 Conditional Variances and Random Weighting
A feature of martingale theory exploited in the last theorem is the possibility of relating convergence to the behaviour of the sequences of one-step-ahead condi tional moments; we now extend this principle to the conditional variances E(X71 ?ft- J). The elegant results of this section contain those such as 20.10 and 20.11. The conditional variance of a centred coordinate is the variance of the innova tion, that is, of X1 - E(X1 I ?f1_1), and in some circumstances it may be more nat ural to place restrictions on the behaviour of the innovations than on the orig inal sequence. In regression models, for example, the innovations may correspond to the regression disturbances. Moreover, the fact that the conditional moments are ?1'1_ 1-measurable random variables, so that any constraint upon them is probabilistic, permits a generalization of the concept of convergence, following the results of § 15.4; our confidence in the summability of the weighted condi tional variances translates into a probability that the sequence converges, in the
The Strong Law of Large Numbers
317
manner of the following theorem. A nice refinement is that the constant weight sequence {a,} can be replaced by a sequence of �1- 1-measurable random weights. 20.12 Theorem Let {X1,�1 } 0 be a m.d. sequence, { W,} a non-decreasing sequence of positive, ��- 1-measurable r.v.s, and Sn = I7= 1X1. Then
The last statement is perhaps a little opaque, but roughly translated it says that the probability of convergence, of the event {Sn!Wn --7 0}, is not less than that of the intersection of the two other events in (20.45). In particular, when one probability is 1 , so is the other. Proof If {Xr } is a m.d. sequence so is {X1/Wr } , since W, is ��- 1 -measurable, and
oo
Tn = I7= 1X11W, is a martingale. For W E Q, if Tn(W) --7 T(w) and Wn(W) t then Sn(w)!Wn(ro) --7 0 by Kronecker' s lemma. Applying 15.11 completes the proof. • See how this result contains 20.10, corresponding to the case of a fixed, diver gent weight sequence and a.s. summability. As before, we now weaken the summ ability conditions from conditional variances to pth absolute moments for 1 � p � 2. However, to exploit 20.5 outside the almost sure case requires a modification to the equivalent sequences argument (20.3), as follows. 20.13 Theorem If {Xr } and { Y1} are sequences of �,-measurable r.v.s, P
({�
P(X,
*
Y, I :l' H) <
+'< {�
(X, - Y,) converges
})
�
0.
(20.46)
Proof Let E1 = {X1 t: Yr } E �1, so that P(X1 t: Y1 1 �r-1) = E(lEt l �r- 1). According to
15.13(ii), P
({
ro:
� E(IE, I :J',_J)(ro) <
oo
}{ � < oo (20.47)
= !'<
ro:
l E,(ro) <
=
})
�
0.
(20.47)
But 2.7= 1 1 Et(w) < means that the number of coordinates for which Xr(ro) t: Yr(w) is finite, and hence I7= 1 (X1(ro) - Yr(w)) therefore implies (20.46). • .
Now we are able to prove the following extension of 20.11. 20.14 Theorem For 1 � p � 2, let £ 1 = { I7= 1 EC I Xr iP I �r- 1)/Wf: { W1 t } Under the conditions of 20.12,
oo
<
oo }
and £2
=
.
(20.48)
Proof The basic line of argument follows closely that of 20.1 1 . As before, let Y1
= l i i Xtl:!> atJ Xr, so that Zr
=
Yr - E(Yr l �r- 1 ) is a m.d. and
ECZ7 1 �r- 1 ) = E(Y7 1 �r- 1 ) - (E(Yr l �r- 1) )2
The Law of Large Numbers
318
(20.49) Applying 20.5 and the last inequality,
( g
P E1 -
E(Z; J 3',- I )IW;
It follows by 15.1 1 and the fact that
< =
})
=
0.
£1 - C � (£1 - D)
(20.50) u
(D - C) that (20.5 1 )
where S1 is some a.s. finite random variable. A second application of 20.5 gives
( {�
I E(Y, l !f,_I ) l ! W,
( {i
E(Yr l ?ir- J )IW,
P EI
-
which is equivalent (by 2.24) to P
£1 -
t:=l
})
=
--7 })
=
< =
S2
J
0,
(20.52)
0,
(20.53)
where S2 is another a.s. finite r.v. And a third application of 20.5 together with 20.13 gives (20.54) for some a.s. finite r.v. S3 . Now (20.5 1 ), (20.53), (20.54), the definition of Z1, the Kronecker lemma and some more set algebra yield, as required,
(20.55) 20.5 Two Strong Laws for Mixingales
The martingale difference assumption is specialized, and the last results are not sufficient to support a general treatment of dependent processes, although they are the central prop. The key to extending them, as in the weak law case, is the mixingale concept. In this section we contrast two approaches to proving mixing ale strong convergence. The first applies a straightforward generalization of the methods introduced by McLeish (1975a); see also Hansen (199 1 , 1992a) for related
319
The Strong Law of Large Numbers
results. We have two versions of the theorem to choose from, a milder constraint on the dependence being available in return for the existence of second moments.
{X1,?11}':: ""
20.15 Theorem Let the sequence be a Lp-mixingale with respect to constants { for either with mixingale size -�, or (i) p (ii) 1 < p < with rnixingale size - 1 . If 2.7= 1 d; < oo then Sn � S.
c1}, 2, =
2,
Proof We have the maximal inequality,
)
(
K±c�,
(20.56)
E max I Sj i P � 1 5)5, n t=l
K
where is a finite constant. This is by 16.10 in the case of (i) and 16.11 case (ii). By relabelling coordinates It can be expressed in the form
E
(�:
,
)
I Sr Sm i P
,;
for any choice of m and n. Moreover,
P
l�:
,
I Sr Sm l
> e)
=
K!1 cl
{':':
,
m
(20.57)
I Sr Sm i P
>
e"
)
m�x I Sj - Sm i P) �E(m<]'O, (20.58) n by the Markov inequality. Inequalities (20. 5 7) and (20. 5 8) combine to yield (20.2), and the convergence lemma 20.2 now yields the result. �
£
We can add the usual corollary from Kronecker' s lemma.
{ Xrfat>?F1}':""
•
20.16 Corollary Let satisfy either (i} or (ii) of 20.15 with respect to constants for a positive sequence with t oo. If
{ c1/ar},
L t=l
c�/a�
{ a1}
a1
(20.59)
< oo,
0. The second result exploits a novel and remarkably powerful argument due to R. M. de Jong (1992). 20.17 Theorem Let { X1, ?1 1} be an L,-bounded, L1-mixingale with respect to con stants { c1} for r � 1, and let { a1}, { B1} be positive constant sequences, and { M1} a positive integer sequence, with an t oo If �M,ex{e2a�/(32M�t.s;) } (20.60)
then Snlan �
D
.
<
�.
The Law of Large Numbers
320 00
_Lt=l B �-rEIXtl rlat Lt=l SM1Ctlat
(20.61 )
< 00,
(20.62)
< oo,
{ sm };=O are the mixingale coefficients, then Snlan 0. Here {B1} and {M1} are chosen freely to satisfy the conditions, given {a1} and { c1} , which suggests a considerable amount of flexibility in application. The sequence {B1} will be used to define a truncation of { Xc}, the role which was played by {a1} in 20.11. The most interesting of the conditions is (20.62), which explicitly trades off the rate of decrease of the mixingale numbers with that of the sequence {c1/a1}. This approach is in contrast with the McLeish method of where
�
o
defining separate summability conditions for the moments and rnixingale numbers, as detailed in § 16.3.
Proof Writing
1 � for 1 1 start by noting that Er+jXt Et+)�Xr + Er+/1 - l�)Xt. IXti :SBtl' =
Hence, we have the identity
Xt = (Et+M1- tl�Xt-Et-M1 l �Xt) + (Et+M1- t ( l - l �)Xr - Et-M/1 - l�)Xr) (20.63) + (Xt-Et+M1- tXt) + Er-MXr, and, by the usual 'telescoping sum' argument, M1- 1 , (20.64) Et+M1- t l�Xt - Et-M11 �Xt = j=Ll-M10t where Zjt Et+jl�X1- Et+j- d �X1, and { Zjt, ?Jt+j} is a m.d. sequence. Note that application of 10 13(ii) Summing yields 1 0tl 2B1 na.s., by a double Mr- 1 _Lxt t=l = Lt=l j=lL-M10t + _LEt t=l +M1- t (1 - 1 �)Xt n - Lt=l Et-MP - 1�)Xr '+ Lt=l (Xr-Er+M1- lXr) s
=
n
n
.
.
n
(20.65)
1,
The object will be to show that Sknlan � 0 for k = . . . ,5. Starting with S1 n , the main task is to reorganize the double sum. It can be verified by inspection that
The Strong Law of Large Numbers n
M1- l
Mn- 1
n
Mn- 1
n
32 1
L J=LI -M1ZJ1 = L L ZJt t= l
=
=.2::1 -Mn LZJt -
}
(20.66)
t=l
= -M1 x�,
<j s -M1- I and Mr-1 s j < M1, where q1 = 1 for -M1 < j < M1 , and q1 t for > £} � for t = 2, ... ,n. Note that, for arbitrary numbers ... ,xk, > flk } . Hence by subadditivity, and Azuma's inequality (15.20),
U�=d lxd
{ a;/ (32 ; � B?)} -i'
,; 4M"exp
{ I I�=1xd
(20.67)
M
Under (20.60), these probabilities are summable over n and so S1 n first Borel-Cantelli lemma. Now let { be any integrable sequence and define
Y1}
la � 0 by the n
(20.68) By the Markov inequality,
)
( I
P max Sj - S� I > £ s m <j 5, n
�E ( m�x I Sj - S� I m <J 5, n
)
(20.69)
If 00
L E I Yrlla1 < t=l
00,
(20.70)
The Law of Large Numbers by an application of 20.2, and hence S,:!an then 0 by Kronecker' s lemma. We apply this result to each of the remaining terms. For S2n, put Y1 Ez+M1- I Xz( l n- 1 �), and noten that n B - laz, a Y 1�) (20.7 1) 1az :::; z :::; _LE1(1 zlf _LEI Xrl _L t=l t=l t=I } 'EIXrl' using the fact that IXtC I - 1 �)/B11 :::; 1Xr( l - 1�)/B11'. S3n is dealt with i n ex actly the same way. For S4n and Ssn , put successively Yz Xr-Et+M1- I X1 and Y1 assumption, Er-M1X1, and by the mixingale n n (c a ) (20.72) ,LEI t=l Yzlfaz :::; _Lt=I zf z SMt" The proof is completed by noting that the majorant terms of (20. 7 1) and (20. 7 2) 322
�
s,: � sa
=
=
are bounded in the limit by assumption.
=
•
The conditions of 20.17 are rather difficult to apply and interpret. We will restrict them very slightly, to derive a simple summability condition which can be compared directly with 20. 15.
{ X1,�tl { cr}. { c1} { a1} I X1I I , c1 an
-
20.18 Corollary Let be an L,-bounded, L1-mixingale of size with respect to constants If and are positive, regularly varying sequences of constants with and t oo, and «
(20.73) where
0. Proof Define o1 then
Snfan
-E4
l) oS = (1 2r
(20.74)
0.
= (log for 8 > This is slowly varying at infinity by 2.28, and the sequence is summable by 2.31. Apply the conditions of 20. 17 with the added stipulation that and are regularly varying, increasing sequences, and so consider the conditions for summability of a series of the form converges, summability follows for 11 > Since from � Taking logarithms, this is equivalent to
{B1} {M1} (n)}, 0. InConln) -11 U LnUl (n)exp{ 2 (nlon)U1 (n)exp{ -11 U2(n)} 0.
(20.75)
U(n) nPL(n) where L(n) is slowly varying, this condition has the form (20.76) where P I and are non-negative constants and L1 (n) and L2(n) are slowly varying. The terms log (on) and log(L1(n)) can be neglected here. Put P 2 0 and L2(n) lion (log n) 1 +8, and the condition reduces to Since
=
pz
=
=
=
The Strong Law of Large Numbers p1 , B } r
0
which holds for all for any 11 > and o > satisfied (recalling that { is monotone) if
323 (20.77)
0. Condition (20.60) is therefore (20.78)
(2.61) and (2. 62) are satisfied if, respectively, B) -rEIXtlrlat B) -rc�lat tlt, (20.79) and (20.80) We can identify the bounding cases of B t and Mt by replacing the second order-of magnitude inequality sign in (20. 79), and that in (20. 80), by equalities, leaving the required scaling constants implicit. Solving for Mt and B t in this way, sub stituting into (20. 78), and simplifying yields the condition (20.81) where � [2r
«
« O
=
>
•
<
=
---7 oo
---7
.
theorem shows that
(n(log n) 1 +3t 1 _L Xt 0 a.s. t=l n
---7
(20.82)
This amounts to saying that the sequence of sample means is almost surely slowly varying as ---7 =; it could diverge, but no faster than a power of log
n
n.
20.6 Near-Epoch Dependent and Mixing Processes Tn viP.w nf
thP h l�t re�ul ts_ there are two nossihle
annroaches to the NED case. It
The Law of Lar e Numbers
324
g
turns out that neither approach dominates the other in terms of permissable condi tions. We begin with the simplest of the arguments, the straightforward extension of 20.18.
{ Xr } �00 -b, { J..lr } �00 , dr I Xr-J..Lr l p { Vr} �oo 2, -a. (20.83) L I CXr -J.l. r)larl l � < 00 t=l for q > p in the a-mixing case (q � p in the -mixing case) where . {(1 2qb+2(q1) ' 2q(a+ 1) } - mm �(20.84) +q)b + 2(q - 1) (1 +q)a +2q ' then a�1L�= I(Xr-J..lr) -.!!:4 0. Proof By 17.5, {Xr-J..Lr } is a L 1 -mixingale of size -min {b, a(l - 1/q)} with respect to constants fer }, with Cr I Xr -J..Lr l q· This is by 17.5(i) in the a-mixing case and by 17.5(ii) in the -mixing case. The theorem follows by 20.18, after substituting for <po in (20.74) and simplifying. This permits arbitrary mixing and NED sizes and arbitrary moment restrictions, so long as (20. 8 3) holds with � arbitrarily close to 1. By letting b one obtains a result for mixing sequences, and by letting a a result for sequences that are Lp-NED on an independent underlying process. Interestingly, in each of these special cases � ranges over the interval (1, 2ql(q + 1)) as the mixing/Lp-NED size
20.19 Theorem Let a sequence be Lp-NED of size for with means on a (possibly vector-valued) sequence 1 ::;; p ::;; with constants « which is a-mixing (<j>-mixing) of size If , 00
«
•
---7
oo
---7
oo
is allowed to range from zero to oo By contrast, a result based on 20.15 would be needed if we could claim only square-summability of the sequence for finite p; this rules out for any choices of a and The first of these results comes directly by applying 17.5. -
.
{ b. I (Xr-J..lr)larllp } 20.20 Theorem For real numbers b, p and r, let a sequence {Xr}�oo with means {J..lr } �oo be Lp-NED of size -b on a sequence {Vr}�=• with constants dr I Xr- J..Lr l p· For a positive constant sequence {ar } let {(Xr -J..Lr)lar}i be uniformly Lr-bounded, and let (20.83)
«
t oo,
(20.85) I CXr -J..lr)larll� < Then a� 1 .L�=I(Xr- J..L1) 0 in each of the following cases: (i) b = -1, p = 2, r > 2, {Vr } is a-mixing of size -r/(r - 2); (ii) b = 1 , 1 < p < 2, r > p, {Vr} is a-mixing of size -prl(r -p); (iii) b -1, p = 2, r � 2, {Vr } is -mixing of size -r/2(r - 1); (iv) b = 1 , 1 < p < 2, � p, {Vr} is -mixing of size -r/(r - 1). Proof By 17.5, conditions (i)-(iv) are all sufficient for { (Xr-J..lt)/at, ?fr} to be L t=I
-.!!:4
=
r
oo
.
The Strong Law of Large Numbers 325 an Lp-mixingale of size -b, where '!11 cr(V5, s :::; t). The mixingale constants are crlar max{dr.IIXr-�tl l r }lat I Xt- �tl l ,!ar. The theorem follows by 20.16. As an example, let X1 possess moments of all orders and be L2-NED of size -�. on an a-mixing process of size close to -1 (letting r --7 oo). Summability of the terms Var(X/a1) is sufficient by (20. 8 5). The same numbers yield s � on putting q 2 and a 1 in (20.84 ), which is not far from requiring summability of the �-norms. However, this theorem requires L,-boundedness, which if r is small constrains the permitted mixing size, as well as offering poor NED size characteristics for cases with p < 2. It can be improved upon in these situations by introducing a =
«
•
=
=
=
=
truncation argument. The third of our strong laws is the following.
1} �} Lp-NED of size -lip X � V1} which is d1 I X1-�� liP' :::; 2, r 2 -r/(r-2) -r/2(r- 1) r 1, r :::; r a1} oo, let 00 (20.86) L I (Xr-�t)larl l 1n{ p,2qlr } < oo ; t=l then a� 1 :L�=l (Xt-�1) 0. Note the different roles of the three constants specified in the conditions. p controls the size of the NED numbers, q is the minimum order of moment required to exist, and r controls the mixing of { V1 } . The distribution of X1 does not otherwise depend on r. Proof The strategy is to show that there is a sequence equivalent to { (X1-�1)/at }, and satisfying the conditions of 20.15(i). As in 20.6, let (20. 87) where 1 � 1 1 and '±' denotes if X1 > M-1, ' otherwise. Note that { (X1-�1)/a1} is Lp-NED with constants d1/a1, and Y1 is a continuous function of (size X1-�-�1)/a1withwithconstants I Yr l :::; 1 a.s. Applying 17.13 shows that Y1 is L2-NED on { V1} of 21 -p12(d11a1)P12. Since I Yrl l , < oo for- every finite r, it further follows by 17.5 that if '!11 cr(V1-5, s � 0), { Y1 E(Y1), '!f1}o is an L2-mixingale of size -� with constants (20. 88) Cr max { (drfat)P12, I Ytll r } . Here, dr :::; 2 I Xr-�r l q for any � p, and I Ytl l r :::; 2( 1 Xt -�t l q la1) qlr for any q :::; r by the second inequality of (20.24). Condition- (20.86) is therefore sufficient for the sequence { d} to be summable, and { Y1 E(Y1) } satisfies the conditions of 20.15(i). We can conclude that L�=l (Y1 - E(Y1)) S1 , where S1 is some random variable. According to 20.6, condition (20.86) is sufficient for :L7= d E(Y1) I < oo. The '\'00 20.21 Theorem Let a sequence { :'oo with means { :oo be on a sequence { for 1 p :::; with constants « either for > or (i) a-mixing of size for > and � p; (ii) $-mixing of size and for q with p q :::; and a constant positive sequence {
�
=
:'""
t
o
' +'
I Xr l :::: a tl
'
=
«
q
�
cPrtPC
)n
•
J<'{V.\ thPrpfnrP t'nn"PrctPc tn
<=�
fi n i tP l i m i t hu '? ?Ll
"""
•
PfV \ -
The Law of Large Numbers
326
n
Lt=l Yr
� St + Ct .
(20.89)
Inequalities (20.22) and (20.6) further imply that lent sequences, and hence
Y1 and (X1- �1)fa1 are equiva
n
Lt=l ((Xr -!-lr)lar- Yr)
� Sz,
(20.90)
where S2 is another random variable, by 20.3. We conclude that n
.�)Xr-1-l r)lar t=l
� S 1 + Sz + Ct
=
say. It follows by Kronecker's lemma that a� 1 .L�=I conclusion. •
S3 ,
(X1 -j.l1)
(20.9 1 ) � 0,
the required
Here is a final, more specialized, result. The linear function of martingale differences with summable coefficients is a case of particular interest since it unifies our two approaches to the strong law.
X1
20.22 Theorem Let = .Lj=- aa81Ur-j where { U1} is a uniformly Lp-bounded m.d. sequence with p > 1, and L}:-ool e) I < Then
oo
.
(20.92) xr ..!!:4 0. Proof Y1 X1/t is a Lp-mixingale with c1 lit and arbitrary size. It was shown in 16.12 that the maximal inequality (20.56) holds for this case. Application of the convergence lemma and Kronecker's lemma lead directly to the result. Alterna tively, apply 20.18 to X1 with a1 t.
n± t= I l.
=
«
=
•
In these results, four features summarize the relevant characteristics of the stochastic process: the order of existing moments, the summability characteristics of the moments, and the sizes ofthe mixing and near-epoch dependence numbers. The way in which the currently available theorems trade off these features suggests that some unification should be possible. The McLeish-style argument is revealed by de Jong' s approach to be excessively restrictive with respect to the dependence conditions it imposes, whereas the tough summability conditions the latter's theorem requires may also be an artefact of the method adopted. The repertoire of dependent strong laws is currently being extended (de Jong, 1994) in work as yet too recent for incorporation in this book.
21 Uniform Stochastic Convergence
2 1 . 1 Stochastic Functions on a Parameter Space
The setting for this chapter is the class of functions f: Q x 8 � iR, where (Q,� ,j.t) is a measure space, and (8,p) is a metric space. We write f( ro,S) to denote the real value assumed by f at the point (ro,S), which is a random variable for fixed 8. But f(ro,.) , alternatively written just f(ro), is not a random vari able, but a random element of a space of functions. Econometric analysis is very frequently concerned with this type of object. Log Ukelihoods, sums of squares, and other criterion functions for the estimation of econometric models, and also the first and second derivatives of these criterion functions, are all the subject of important convergence theorems on which proofs of consistency and the derivation of limiting distributions are based. Except in a restricted class of linear models, all of these are typically functions both of the model parameters and of random data. To deal with convergence on a function space, it is necessary to have a crit erion by which to judge when two functions are close to one another. In this chapter we examine the questions posed by stochastic convergence (almost sure or in probability) when the relevant space of functions is endowed with the uniform metric. A class of set functions that are therefore going to be central to our discussion have the form f*: Q � iR, where f*(ro)
=
sup f(ro,S).
(2 1 . 1)
6E0
For example, if g and h are two stochastic functions whose uniform proximity is at issue, we would be interested in the supremum of f(ro,S) = l g(ro,S) - h(ro,S) I . An important technical problem arises here which ought to be confronted at the outset. We have not so far given any results that would justify treating f* as a random variable, when (8,p) may be an arbitrary metric space. We can write { ro: f * (ro) > x}
=
U { ro : f(S,ro) > x},
(2 1 .2)
and the results of 3.26 show that { ro: f*( ro) > x} E � when { ro: f(S,ro) > x} E � for each 8, when 8 is a countable set. But typically 8 is a subset of (R k,de) or something of the kind, and is uncountable.
The Law of Large Numbers
328
This is one of a class of measurability problems having ramifications far beyond the uniform convergence issue, and to handle it properly requires a mathematical apparatus going beyond what is covered in Chapter We shall not attempt to deal with this question in depth, and will offer no proofs in this instance. We will merely outline the main features of the theory required for its solution. The essential step is to recognize that the set on the left-hand side of can be expressed as a projection. Let denote the Borel field of subsets of that is, the smallest a-field containing the sets of that are open with respect to p. Then let (.Q X lJi denote the product space endowed with the product a-field (the a-field generated from the measurable rectangles of lJi and and suppose that is cy; /:B-measurable. Observe that, if
3.
(21.2)
:Be
8,
8
8, ®:Be) f(.,. ) (21.3 )
:Be), Ax = { (co,S): f(ro,S) > x} ®:Be, the projection of Ax into .Q is Ex = {co: f(co,S) > X, 8} = {co: f *(ro) > x}. (21.4) In view of 3.24, measurability of r is equivalent to the condition that Ex for rational x. Projections are not as a rule measurable transformations,21 but under certain conditions it can be shown that Ex cy;P , where (.Q,lJP,P) is the completion of the probability space. The key notion is that of an analytic set. A standard reference on this topic is Dellacherie and Meyer (1978); see also Dudley (1989: ch. 13 ) , and Stinchcombe and White (1992). The latter authors provide the following definition. Letting (.Q,cyl) be a measurable space, a set E .Q is called lJ-analytic if there exists a compact metric space (8,p) such that E is the projection onto .Q of a set A ® :Be. The collection of cy;-analytic sets is written dJ(cyl). Also, a function f: .Q is called cyl-analytic if {co: f( co) :::; x} dJ(cyl) for each x Since every E cy; is the projection of E x 8 ®:Be, cy; dl(lJ). A measurable ®:Be
E lJi
e E
E tj
E
c
E lJi
E iR. �
E
E cy;
E
1-7
iR
set (or function) is therefore also analytic. dJ(cyl) is not in general a a-field, although it can. be shown to be closed under countable unions and countable intersections. The conditions under which an image under projection is known to be analytic are somewhat weaker than the definition might suggest, and it will actually suffice to let be a that is, a space that is measurably isomorphic to an analytic subset of a compact metric space. A suffi cient condition, whose proof can be extracted from the results in Stinchcombe and White is the following:
(8,:Be)
Souslin space,
(1992), 21.1 Theorem Let (.Q,cyl) be a measurable space and (8, :B e) a Souslin space. If dl(lJ ®:Be), the projection of onto .Q is in dl(lJ). Now, given the measurable space (.Q,cyl), define = n)lcy;ll, where (.Q,cyl11,ji) is the completion of the probability space (.Q,cyl,f...l) (see 3.7) and the intersection is taken over all p.m.s defined on the space. The elements of are called univerBE
B
o
rg;U
f...l
lJi u
Uniform Stochastic Convergence
329
sally measurable sets. The key conclusion, from Dellacherie and Mayer (1978: III.33(a)), is the following.
21.2 Theorem For a measurable space (05), :A('!F) � rg; u_ o
(21 .5)
Since by definition rg; u c '!F� for any choice of J.L, it follows that the analytic sets of '!F are measurable under the completion of (Q,'!f ,Jl) for any choice of Jl. In other words, if E is analytic there exist A, B E '!F such that A � E c B and J.L(A) = J.L(B). In this sense we say that analytic sets are 'nearly ' measurable. All the standard probabilistic arguments, and in particular the values of integrals, will be unaffected by this technical non-measurability, and we can ignore it. We can legitimately treat f*(ro) as a random variable, the conditions on 8 are observed and we can assume f(. , .) to be (near-) '!F ® :130 /:B-measurable. An analytic subset of a compact space need not be compact but must be totally bounded. It is convenient that we do not have to insist on compactness of the parameter space, since the latter is often required to be open, thanks to strict inequality constraints (think of variances, stable roots of polynomials and the like). In the convergence results below, we find that 8 will in any case have to be totally bounded for completely different reasons: to ensure equicontinuity; to ensure that the stochastic functions have bounded moments; and that when a stoch astic criterion function is being optimized with respect to 8, the optimum is usually required to lie almost surely in the interior of a compact set. Hence, total boundedness is not an extra restriction in practice. The measurability condition on f(ro,S) might be verifiable using an argument from simple functions. It is certainly necessary by 4.19 that the cross-section functions f(.,S) : .Q f---7 iR and f(ro, .): f---7 iR be, respectively, '!f/:B-measurable for each e E and :Be /:B-measurable for each 0) E .Q. For a finite partition { , . . . ,em } of by :Be-sets, consider the functions
provided
e e
where
ej
e
fem) (ro, S)
is a point of
=
ej.
f(ro, �), e E
If Ej
=
el
ej,
j
=
I, . . .
,m,
(2 1 .6)
{ ro: f(0), �) :::; X} E '!F for each j, then (21 .7)
j
being a finite union of measurable rectangles. Since this is true for any x, fern) is '!F ® :80 /:B-measurable. The question to be addressed in any particular case is whether a sequence of such partitions can be constructed such that fern) --7 f as
m
--7 oo .
Henceforth we shall assume without further comment that suprema of stochastic functions are random variables. The following result should be carefully noted, not least because of its deceptive similarity to the monotone convergence theorem, although this inequality goes the opposite way. The monotone convergence theorem concerns the expectation of the supremum of a class of functions { fnCro) } , whereas the present one is more precisely concerned with the of a class of
envelope
The Law of Large Numbers
330
functions, the function j*(w) which assumes the value supB e ef(ro, O) at each point of n.
(
21.3 Theorem sup E(f(O)) � E sup f(O)) . J Be e Be e
Proof Appealing to 3.28, it will suffice to prove this inequality for simple
functions. A simple function depending on e has the form
=
=
m
_L a/9) 1 E; (W) i=l
ai(9), W E Ei.
(
)
=
a;, w E Ei.
i
sup E(cp(O)) - E sup cp(e) = sup ( ai(O) - ai)P(Ei) � 0, B e e i=l Be e where the final inequality is by definition of a;. • Bee
(2 1 .8)
SUPB e eai(O), sup cp(ro,e) Be e
Hence
=
(2 1 .9)
(2 1 . 10)
2 1 . 2 Pointwise and Uniform Stochastic Convergence
Consider the convergence (a.s., in pr., in Lp, etc.) of the sequence { Qn(O) } to a limit function Q(8), Typically this is a law-of-large-numbers-type problem, with n (2 1 . 1 1) Qn(O) = L qnr(O) t=l (we use array notation for generality, but the case qnt = q1/n may usually be assumed), and Q(O) = limn�coE'(Qn(O)). Alternatively, we may want to consider the case Gn(8) --7 0 where n (2 1 . 12) Gn(O) = L (qnr(O) - E(qnr(O ))) . t= l By considering (2 1 . 1 2) we divide the problem into two parts, the stochastic convergence of the sum of the mean deviations to zero, and the nonstochastic convergence assumed in the definition of Q( O). This raises the separate question of whether the latter convergence is uniform, which is a matter for the problem at hand and will not concern us here. As we have seen in previous chapters, obedience to a law of large numbers calls for both the boundedness and the dependence of the sequence to be controlled. In the case of a function on 0, the dependence question presents no extra difficulty; for example, if qntCOt) is a mixing or near-epoch dependent array of a given class, the property will generally be shared by qntC82), for any e l ' 92 E But the existence of particular moments is clearly not independent of e . If there
e.
Uniform Stochastic Convergence 331 exists a positive array {Dnr} such that I qnr(8) I Dnt for all 8 e, and I Dnrl l r < oo, uniformly in t and n, qntC8) is said to be Lr-dominated. To ensure pointwise convergence on e, we need to postulate the existence of a dominating array. There is no problem if the qnr( 8) are bounded functions of 8. More generally it is E
::;
necessary to bound e, but since e will often have to be bounded for a different set of reasons, this does not necessarily present an additional restriction. Given restrictions on the dependence plus suitable domination conditions, as an ordinary pointwise stochastic convergence follows by considering stochastic sequence, for each E e. However, this line of argument does not guarantee that there is a minimum rate of convergence which applies for all the condition of uniform convergence. If pointwise convergence of to the limit is defined by
{ Gn(8)}
8
{ Gn(8)} 8, G(8) Gn(8) 0 (a.s., in or in pr.), each 8 e, (21.13) a sequence of stochastic functions { Gn(8)} is said to converge unifo rmly (a.s., in or in pr.) on e if sup I Gn(8) I or in pr.). 0 (a.s., in (21.14) �
E
Lp,
Lp ,
�
Lp ,
9 E8
To appreciate the difference, consider the following example.
[O,oo), and define a zero-mean array {gnr(8)} where o 8 ll2n Z8, h t (21.15) gntC8) + Z(lln-8), ll2n < 8 1/n 0, 1/n < 8 < oo where {hr} is a zero-mean stochastic sequence, and Z is a binary r.v. with P(Z 1)and P(Z -1) �· Then Gn(8) L�=lgnr(8) Hn + Kn(8), where Hn n L7= ht, Zn8 , 0 8 1/2n Kn(8) Z(l - n8), ll2n < 8 1/n. (21.16) 0, lin < 8 < oo We assume Hn ...E4 0. Since Gn(8) Hn for 8 > lin as well as for 8 0, Gn(8) ...E4 0 for each fixed 8 e. In other words, Gn(8) converges pointwise to zero, a.s. However, supe I Kn(8) I I �Z I = � for every n ;:::: 1. Because Hn converges a.s. there will exist N such that I Hnl < ! for all n ;:::: N, with probability 1. You can verify that when I Hn I < ! the supremum on e of I Hn + Kn(8) I is always attained at the point 8 1/2n. Hence, \Yith probability 1, sup I GnC8) 1 I Hn+�ZI for n ;:::: N, 21.4 Example Let e
=
=
=
!
n
=
E
E8
=
!
::;
=
::;
::;
=
=
=
=
9E 8
::;
::;
=
=
=
::;
=
-
I
t
=
The Law of Large Numbers � as n oo It follows that the uniform a.s. limit of Gn(S) is not zero. Similarly, for n � N, P (6supE 8 I Gn(S) I � E) P( IHn +�ZI � E) P( I�ZI � E) 1, 332
----7
----7
.
(21 . 1 7)
=
=
----7
(21 . 1 8)
so that the uniform probability limit is not zero either, although the pointwise probability limit must equal the pointwise a.s. limit. o Our first result on uniform a.s. convergence is a classic of the probability literature, the This is also of interest as being a case outside the class of functions we shall subsequently consider. For a collec tion of identically distributed r.v.s on the probability space (Q.,'!f,P), the is defined as
Glivenko-Cantelli theorem. { X1(co), ... ,Xn(CO)} empirical distribution function n 1 (21 . 19) Fn(x,co) -n L 1 (-co,rJ(Xr(co)). In other words, the random variable Fn(x,co) is the relative frequency of the variables in the set not exceeding x. A natural question to pose is whether (and in what sense) Fn converges to F, the true marginal c.d.f. for the distribution. For fixed x, {Fn(x,co)} i is a stochastic sequence, the sample mean of n Bern oulli-distributed random variables which take the value 1 with probability F(x) and 0 otherwise. If these form a stationary ergodic sequence, for example, we know that Fn (x,co) F(x) a.s. for each x IR . We may say that the strong law of large numbers holds pointwise on IR in such a case. Convergence is achieved at x for all co Cx, where P(Cx) 1. The problem is that to say that thefunctions Fn converge a.s. requires that a.s. convergence is achieved at each of uncountable set of points. We cannot appeal to 3.6(iii) to claim that P(nx E IRCx) 1 , and hence the assertion that Fn (x,co) F(x) with probability 1 at a point x not specified beforehand cannot be proved in this manner. This is a problem for a.s. convergence additional to the possibility of convergence breaking down at certain points of =
t= l
----7
E
E
=
an
----7
=
the parameter space, illustrated by 21.4. However, uniform convergence is the condition that suffices to rule out either difficulty. In this case, thanks to the special form of the c.d.f. which as we know is bounded, monotone, and right-continuous, uniform continuity can be proved by establishing a.s. convergence just at a countable collection of points of IR .
Fn(x,co) F(x) a.s. pointwise, for x sup I Fn(x,co)-F(x) I 0 a.s. Proof First define, in parallel with Fn, 21.5 Glivenko-Cantelli theorem If X
----7
----7
o
E
IR , then (21 .20)
Uniform Stochastic Convergence
333
(2 1 . 2 1 ) F�(x,w) F(x-) for all w in a set C�, where P(C�) __,
and note that integer > 1 let
m
=
Xjm inf{x E lR : F(x) � jim} , j 1 , .. . ,m - 1 , and also let Xom = and Xmm so that, by construction, F(Xjm-) - F(Xj-I ,m) � l im, j 1, ... ,m. =
-=
Lastly let
= +=,
=
1 . For an (21 .22)
=
(2 1 2 3 ) .
=
iFn(Xjm,W)-F(Xjm)l , IF�(Xjm ,W) -F(Xjm-) 1 l} (21 .24) Mmn(W) 1max{maxf 5j 5 m Then, for j 1 , . .. ,m and x E [Xj- l,m ,Xjm), 1 F(x) --m Mmn(W) � F(x1·- 1 ,m) -Mmn(W) � Fn(Xj- l ,m,(J)) � Fn(X,ffi) � F�(Xjm,W) 1 (21 .25) � F(Xjm-)+Mmn(W) � F(x) +-+Mmn(W). m That is to say, IFn(x,w) -F(x)l � llm +Mmn(W) for every x E IR . By pointwise strong con vergence we may say that limn�ooMmnCro) 0 for finite m, and hence that limn�ooSUPx I Fn(x,ro) -F(x) I � 1 /m , for all c;, where m = (2 1 . 2 6 ) c; n ccxmjn c�m). =j l But P(limm oo c;) 1 by 3.6(iii), and this completes the proof. =
(J) (It
�
=
•
=
Another, quite separate problem calling for uniform convergence is when a sample statistic is not merely a stochastic function of parameters, but is to be eval uated at a random point in the parameter space. Estimates of covariance matrices of estimators generally have this character, for example. One way such estimates are obtained is as the inverted negative Hessian matrix of the associated sample log-likelihood function, evaluated at estimated parameter values. The problem of proving consistency involves two distinct stochastic convergence phenomena, and it does not suffice to appeal to an ordinary law of large numbers to establish convergence to the true function evaluated at the true point. The following theorem gives sufficient conditions for the double convergence to hold. 21.6 Theorem Let (Q,c:J,P) be a probability space and (8,p) a metric space, and let H IR be c:J/:8-measurable for each 8 E If (a) e� � eo , and (b) � Q( e) uniformly on an open set Bo containing eo, where Q (e) is a nonstochastic function continuous at eo,
Qn: exn Qn(e)
e.
The Law of Large Numbers
334
then Qn(S� ) � Q(So) . Proof Uniform convergence in probability of Qn on Bo implies that, for any £ > 0 � 1 large enough that, for � and 8 > 0, there exists
(
N1
n N1,
)
(2 1 .27)
P sup I Qn(S) - Q (S) I < �£ � 1 - !8. B E Bo
Also, since S� � S0, there exists
P(s�
e
Nz such that, for � Nz, n
(21 .28)
Bo) � 1 - a8.
To consider the joint occurrence of these two events, use the elementary relation
P(A n B) � P(A) + P(B) - 1 . Since
{ 8� for
E
}
{
Bo } n sup I Qn(S ) - Q (S) I < 1£ B E Bo
n � max(NI ,Nz),
c
(21 .29)
{ I Qn(S�) - Q( S�) I < �£} , (21 .30)
P { I Qn(S�) - Q(S� ) I < �£} � 2( 1 - !P) - I = 1 - 18. (21 .3 1 ) Using continuity at S0 and 18.10(ii), there exists large enough that, for �
N3
N3,
n
(21 .32)
P( I Q(S�) - Q (So) I < �£) � 1 - �8. By the triangle inequality,
(21 .33) and hence
{ I QnCS�) - Q (S�) I < 1£} n { I Q(S� ) - Q(So) I < �£} � {
Applying (21 .29) again gives, for
I Qn(S�) - Q (So) I < £ } .
(2 1 .34)
n � max(N1 ,N2,N3),
P( I Qn(S�) - Q(S o) I < £) � 1 - 8. The theorem follows since 8 and £ are arbitrary. •
(21 .35)
Notice why we need uniform convergence here. Pointwise convergence would not allow us to assert (21 .27) for a which works for all S e B0• There would be the risk of a sequence of points existing in B0 on which is diverging. Suppose So = 0 and Gn(S) = Qn(S) - Q(S) in 21.4. A sequence approaching So, say { 1 /m, m e IN } , has this property; we should have
single N1
N1
(21 .36) P( I Qn(1/m) - Q(llm) I < �£) � 1 - !P for arbitrary £ > 0 and 8 > 0, for > m. Therefore we would not be able to
only n
Uniform Stochastic Convergence 335 claim the existence of a finite n for which (2 1 . 3 1 ) holds, and the proof collapses. In this example, the sequence of functions { Gn(9 ) } is continuous for each n, but
the continuity breaks down in the limit. This points to a link between uniform convergence and continuity. We had no need of continuity to prove the Glivenko Cantelli theorem, but the c.d.f. is rather a special type of function, with its behaviour at discontinuities (and elsewhere) subject to tight limitations. In the wider class of functions, not necessarily bounded and monotone, continuity is the condition that has generally been exploited to get uniform convergence results. 2 1 . 3 S tochastic Equicontinuity
Example 21.4 is characterized by the breakdown of continuity in the limit of the sequence of continuous functions. We may conjecture that to impose continuity uniformly over the sequence would suffice to eliminate failures of uniform conver gence. A natural comparison to draw is with the uniform integrability property of sequences, but we have to be careful with our terminology because, of course, uniform continuity is a well-established term for something completely different. The concept we require is or, to be more precise, see (5.47). Our results will be based on the following version of the ArzeUt-Ascoli theorem (5.28).
equicontinuity,
uniform equicontinuity;
n
asymptotic
21.7 Theorem Let Un(9 ), E [N } be sequence of (nonstochastic) functions on a totally bounded parameter space (8,p ) . Then, supe E e I fn( 9) I 0 if and only if and Un } is fn(e) --7 0 for all e E 8o, where 8o is a dense subset of asymptotically uniformly equicontinuous. o
--7 e,
n
The set IF = Un, E [N } u { 0 } , endowed with the uniform metric, is a subspace of ( Ce ,du), and by definition, convergence of fn to 0 in the uniform metric is the same thing as uniform convergence on 8. According to 5.12, compactness of IF is equivalent to the property that every sequence in IF has a cluster point. In view of the pointwise convergence, the cluster point must be unique and equal to 0, so that the conclusion of this theorem is really identical with the ArzeUt-Ascoli theorem, although the method of proof will be adapted to the present case. Where convenient, we shall use the notation wUn l>)
=
sup
sup I fn( 9') - fn(8) 1 .
(2 1 .37)
9 E 8 9'E S(8,0)
The function w(fn,.): IR + 1--7 IR + is called the of fn· Asymp totic uniform equicontinuity of the sequence Un } is the property that limsupnw(fn,8) -1- 0 as 8 -1- 0.
modulus of continuity
Proof of 21.7 To prove 'if : given £ > 0, there exists by assumption 8 > 0 to
satisfy
limsup w(fn,8) < £.
(2 1 .38)
The Law of Large Numbers
336
Since e is totally bounded, it has a cover {S(S))/2), i = 1 , ... ,m}. For each i, choose ei E eo such that p( Sj,Sj) < o/2 (possible because eo is dense in 8) and note that { S(Sj,O), i = 1 , ... ,m} is also a cover for e. Every e E e is contained in S(S i,o) for some i, and for this i,
I fn(S ) I
:::;
:::;
sup I fn(S') I
e' e s(6;,5)
sup I fnCS') - fCSD I + I f(8i) 1 .
(2 1 .39)
e' e S(S;,5)
We can therefore write sup I fn(S ) I
eee
:::;
max
sup I fn( S') - f(S i) I + max I f(Si) I J=:; i =:;m
J =:; i =:; m e' e S(e;,o)
(21 .40) I =:;; i =:;; m
Sufficiency follows on taking the limsup of both sides of this inequality. 'Only if follows simply from the facts that uniform convergence entails point wise convergence, and that w(fmO) :::; 2 sup I fn (S ) 1 . eee
•
(21 .41)
To apply this result to the stochastic convergence problem, we must define concepts of stochastic equicontinuity. Several such definitions can be devised, of which we shall give only two: respectively, a weak convergence (in pr.) and a strong convergence (a.s.) variant. Let (8,p) be a metric space and (0./!F,P) a probability space, and let { Gn(S ,co)), n E IN } be a sequence of stochastic functions Gn : e X Q. i---7 IR , ?1'/:B-measurable for each 8 E e. The sequence is said to be asymp totically uniformly stochastically equicontinuous (in pr.) if for all E > 0 3 o > 0 such that Iimsup P(w(GmO)
::?:
E) < E.
(21 .42)
And it is said to be strongly asymptotically uniformly stochastically equicontin uous if for all E > 0 3 o > 0 such that
( n-too
)
P umsup w(Gn.o) :::: E
=
0.
(2 1 .43)
Clearly, there is a bit of a terminology problem here ! The qualifiers 'asymptotic' and 'uniform' will be adopted in all the applications in this chapter, so let these be understood, and let us speak simply of stochastic equicontinuity and strong stochastic equicontinuity. The abbreviations s.e. and s.s.e. will sometimes be used.
Uniform Stochastic Convergence
337
2 1 .4 Generic Uniform Convergence
Uniform convergence results and their application in econometrics have been researched by several authors including Hoadley (197 1 ), Bierens ( 1989), Andrews ( 1987a, 1992), Newey ( 1 99 1 ), and Potscher and Prucha ( 1989, 1994). The material in the remainder of this chapter is drawn mainly from the work of Andrews and Potscher and Prucha, who have pioneered alternative approaches to deriving 'generic' uniform convergence theorems, applicable in a variety of modelling situations. These methods rely on establishing a stochastic equicontinuity condition. Thus, once we have 21.7, the proof of uniform almost sure convergence is direct.
n
21.8 Theorem Let { Gn(S), E IN } be a sequence of stochastic real-valued functions on a totally bounded metric space (8,p). Then sup I Gn(S) I ...!!4 0 8E 0
(2 1 .44)
if and only if (a) Gn(S) ...!!4 0 for each 8 E 8o, where 8o is a dense subset of (b) { Gn } is strongly stochastically equicontinuous.
e,
Proof Because (8,p) is totally bounded it is separable (5.7) and 80 can be chosen
to be a countable set, say 80 = { Sh k E IN } . Condition (a) means that for k = 1,2, ... there is a set with = 1 such that Gn(Sbro) 0 for ro E Condition (b) means that the sequences { Gn( ro) } are asymptotically equicontinuous for all ro E with = 1 . By the sufficiency part of 21.7, sup8E e i Gn(S,ro) l --7 0 for ro E c* = nk=l c = 1 by 3.6(iii), proving 'if . 'Only if' follows from the necessity part of 21.7 applied to { Gn(ro) } for each ro
Ck P(Ck) C, P(C) k (') C. P(C*)
e
c*.
--7
Ck.
•
The corresponding 'in probability' result follows very similar lines. The proof cannot exploit 21.7 quite so directly, but the family resemblance in the arguments will be noted. 21.9 Theorem Let { Gn(S), n E IN } be a sequence of stochastic real-valued functions on a totally bounded metric space (8,p ). Then sup I Gn(S) I � 0 8E 0 if and only if (a) Gn(S) � 0 for each 8 E 8o, where 8o is a dense subset of (b) { Gn } is stochastically equicontinuous.
S
Proof To show 'if ' let { ( S i, 8)
,
(2 1 .45)
e,
i = 1 ,. . . ,m} with ei E 8o be a finite cover for
8. This exists by the assumption of total boundedness and the argument used in the proof of 21.7. Then,
The Law of Large Numbers
338
(
2£) ::; P (�·�X
P :�� 1 Gn(8) 1
�
'
Lt_m e
sup
( I Gn (9') - Gn(9i) I + I Gn(Si) I )
) t c:) + P (Q { I Gn(SD I c:} )
eS(S;,o)
::; P(w(Gmo) � c:) + P ��: I Gn(SD I � c: � ::; P(w(Gn,o) �
2£)
�
m
::; P(w(Gmo) � £) + _LP( I Gn(Si) l i=l
where we used the fact that
{x+y � 2£ }
�
{x � c: }
c
u
�
(21.46)
£),
{y � £ }
(21.47)
for real numbers x and y, to get the third inequality. Taking the limsup of both sides of (a) and (b) impiy that
(21.46),
n--7oo P
limsup
(
sup
eee
I Gn(8) I
�
2c:) < £.
(21.48)
To prove 'only if' , pointwise convergence follows immediately from uniform convergence, so it remains to show that s.e. holds; but this follows easily in view of the fact (see that
(21.4 1))
P(w(Gn ,O) � c:) ::; P
(
sup
eee
I Gn(9) I
�
c:/2) .
(21.49)
•
0
There is no loss of generality in considering the case Gn � in these theorems. We canjust as easily apply them to the case where Gn(8) = Qn(8) - Qn(9) and Qn is a nonstochastic function which may really depend on n, or just be a limit func tion, so that Qn = Q. In the former case there is no need for Qn to converge, as long as Qn - Qn does. Applying the triangle inequality and taking complements in we obtain
(21.47),
(21. 50)
This means that { Qn - Qn} is s.e., or s.s.e. as the case may be, provided that { Qn} is s.e., or s.s.e., and { Qn} is asymptotically equicontinuous in the ordinary sense of This extension of 21.8 is obvious, and in 21.9 we can insert the step
§5. 5 .
(21.46),
0 1
(21. 5 1)
into where the second term on the right is or depending o n whether the indicated nonstochastic condition holds, and this term will vanish when n � N
Uniform Stochastic Convergence
339
for some N � 1 , by assumption. The s.e. and s.s.e. conditions may not be particularly easy to verify directly, and the existence of Lipschitz-type sufficient conditions could then be very convenient. Andrews (1 992) suggests conditions of the following sort. 21.10 Theorem Suppose there exists N � 1 such that
I Qn(S') - Qn(S) I
n
::;
(21 .52)
Bnh( p (S,8')), a.s.
holds for all 8,8' E 8 and � N, where h is nonstochastic and and { Bn } is a stochastic sequence not depending on e. Then (i) { Qn } is s.e. if Bn = Op (l). (ii) { Qn } is s.s.e. if limsupnBn < oo, a.s. Proof The definitions imply that w(Qn,8) ::; Bnh(8) a.s. for note that, for any £ > 0 and 8 > 0,
limsup P(w(Qn,8) � E)
::;
h(x)
-1-
0 as
x
-1-
0,
n � N. To prove (i), (21 .53)
limsup P(Bn � £/h(8)).
By definition of Op(l), the right-hand side can be made arbitrarily small by choosing £1h(8) large enough. In particular, fix £ > 0, and then by definition of we may take 8 small enough that limsupn�ooP(Bn � < £. For (ii), we have in the same way that, for small enough 8,
h
£/h(8))
(
) (
)
P limsup w(Qm8) � £ ::; P limsup Bn � £/h(8) < £. n �oo n �oo
•
(21 .54)
A sufficient condition for Bn Op (l) is to have Bn uniformly bounded in L1 norm, i.e., supnE(Bn) < oo (see 12.11), and it is sufficient for limsupnBn to be a.s. bounded if, in addition to this, Bn - E(Bn) � 0. The conditions of 21.10 offer a striking contrast in restrictiveness. Think of (21 .52) as a continuity condition, which says that Qn(8') must be close to Qn(8) when 8' is close to e. When Qn is stochastic these conditions are very hard to satisfy for Bn, because random changes of scale may lead the condition to be violated from time to time even if Qn(S,oo) is a continuous function for all w and The purpose of the factor Bn is to allow for such random scale variations. Under s.e., we require that the probability of large variations declines as their magnitude increases; this is what Op(l) means. But in the s.s.e. case, the requirement that { Bn } be bounded a.s. except for at most a finite number of terms implies that { Qn } must satisfy the same condition. This is very restrictive. It means for example that Qn(8) cannot be Gaussian, nor have any other distribution with infinite support. In such a case, no matter what { Bn } and were chosen, the condition in (21 .52) would be violated eventually. It does not matter that the probability of large deviations might be extremely small, because over an number of sequence coordinates they will still arise with probability 1 . Thus, strong uniform convergence is a phenomenon confined, as far as we are able to show, to a.s. bounded sequences. Although (21 . 52) is only a sufficient condition, it can be verified that this feature of s.s.e. is implicit in the =
fixed
n.
h
infinite
340
The Law of Large Numbers
definition. This fact puts the relative merits of working with strong and weak laws of large numbers in a new light. The former are simply not available in many important cases. Fortunately, 'in probability' results are often sufficient for the purpose at hand, for example, determining the limits in distribution of estimators and sample statistics; see §25 . 1 for more details. Supposing (8,p) c (lRk,d£) , suppose further that Qn (O) is differentiable a.s. at each point of 8; to be precise, we must specify differentiability a.s. at each point of an open convex set 8* containing 8. (A set B c IRk is said to be convex if x E B and y E B imply Ax + (1 - A.)y E B for A E [0, 1].) The mean value theorem yields the result that, at a pair of points 8,8' E 8* , 22 (2 1 .55) where 8 * E 8* is a point on the line segment joining 8 and 8', which exists by convexity of 8*. Applying the Cauchy-Schwartz inequality, we get
I Qn (B) - Qn (B') I
where
l
I
k aQn I Oj ' · l I ei I a e i I S=e
s�
S Bn ii B - B'll a.s.,
(2 1 .56)
I � j .1 .
(2 1 . 57)
a n 8n = sup a _ S-O 9 * E 8*
Here II . 11 denotes the Euclidean length, and dQnldB is the gradient vector whose elements are the partials of Qn with respect to the Oi. Clearly, (2 1 .52) is satisfied by taking h as the identity function, and Bn defined in (2 1 .57) is a random variable for all n. Subject to this condition, and Bn satisfying the conditions specified in 21.10, a.s. differentiability emerges as a sufficient condition for s.e .. 2 1 .5 Uniform Laws of Large Numbers
In the last section it was shown that stochastic equicontinuity (strong or in pr.) is a necessary and sufficient condition to go from pointwise to uniform convergence (strong or in pr.). The next task is to find sufficient conditions for stochastic equicontinuity when { Qn(O) } is a sequence of partial sums, and hence to derive uniform laws of large numbers. There are several possible approaches to this problem, of which perhaps the simplest is to establish the Lipschitz condition of 21.10. 21.11 Theorem Let { { qnr( ro,O) } �= 1 } �= 1 denote a triangular array of real stochastic functions with domain (8,p), satisfying, for � 1 ,
N
Uniform Stochastic Convergence
341
I qnr(S') - qnr(S) I � Bnrh(p(8 , 8')), a.s., (2 1 .58) for all 8,8' e 8 and 2 N, where is nonstochastic and (x) -1, 0 as x -1, 0, and { Bnr l is a stochastic array not depending on a with L�= !E(Bnr) = 0( 1 ). If Qn(S) = L�= l qnr(S ), then (i) Qn is s.e. ; (ii) Qn is s.s.e. if L�=!(Bnr - E(Bnr)) � 0.
n
h
h
Proof For (i) it is only necessary by 21.10(i) and the triangle inequality to
establish that I�= !Bn1 = Op (l). This follows from the stated condition by the Markov inequality. Likewise, (ii) follows directly from 21.10(ii). • A second class of conditions is obtained by applying a form of s.e. to the summands. For these results we need to specify Gn to be an unweighted average of functions, since the conditions to be imposed take the form of Cesaro summability of certain related sequences. It is convenient to confine attention to the case
n
Gn(ro,S)
=
1 n - L (qtCXr(CO),S) - E(qr(Xr,S ))),
n
t=l
(2 1 .59)
where X1 e tz is a random element drawn from the probability space (tz,X, �1). Typically, though necessarily, X1 is a vector of real r.v.s with K a subset of m !R , m 2 1 , X being the restriction of 13m to K. The point here is not to restrict the form of the functional relation between q1 and ro, but to specify the existence of marginal derived measures �1, with �1(A) = P(X1 E A) for A e X. The usual context will have Gn the sample average of functions that are stochastic through their dependence on some kind of data set, indexed on t. The functions themselves, not just their arguments, can be different for different t. We must find conditions on both the functions qr(.,.) and the p.m.s J.l1 which yield the s.e. condition on Gn. The first stage of the argument is to establish conditions on the stochastic functions q1(8) which have to be satisfied for s.e. to hold. Andrews (1 992) gives the following result.
not
21.12 Theorem If (a) there exists a positive stochastic sequence { d1} satisfying sup I qr (S ) I � d,, all
and
eee
1 n . hmsup _L E(drl ldt>M} )
t
---7 0 as M ---7 oo ; n�oo t=l (b) for every £ > 0, there exists 0 > 0 such that n 1 limsup - _L P(w(q1,0) > E) < £;
n
-
n�oo
n
t=l
(2 1 .60)
(2 1 .6 1 )
(21 .62)
The Law of Large Numbers
342
Gn is s.e. Condition (21. 6 1) is an interesting Cesaro-sum variation on uniform integrability, and actual uniform integrability of { d1} is sufficient, although not necessary. Condition (a) is a domination condition, while condition (b) is called by Andrews termwise stochastic equicontinuity. Proof Given £ > 0, choose M such that limsupn�oon I7= t E(2d11 < t£2 , and then o such that limsup ! i, r(w(q1,o) > !£2 ) < !M-1 £2• (21.63) n�oo n then
o
-I
1 2dr>MJ)
t:=l
The first thing to note is that
w((q1 - E(qr)),o) ::; w(qr,ro) + w(E(qr), o) $ w(q1,ffi) +E(w(q1,0)),
(21.64)
where the last inequality is an application of 21.3. Applying Markov's inequality,
P(w(Gn,O) > £) ::; r (* i w(q1 -E(q1),o) > £)
(21.64) and using
t= l
(21.65) 1.
where the indicator functions in the last member add up to Using the fact that > M } , and taking the limsup, we now and hence > M} c $ obtain n £2 £2 >6 limsup > £) $ 6 + M limsup ;z L P E n�oo t:=l n�oo
w(q1,0) 2d1,
{ w(q1,0) P(w(Gn,O) 2 [
{2d1
::; £, in view of the values chosen for M and
1 (w(q1,0) )
o.
•
(21.66)
Uniform Stochastic Convergence
343
Clearly, whether condition 21.12(a) is satisfied depends on both the distribution of X1 and functional form of qr(.). But something relatively general can be said about termwise s.e. (condition 21.12(b)). Assume, following Potscher and Prucha (1989), that p
(2 1 .67) L rk1(x)skr(x,9), k=i where rkr: � � IR, and Skt(., S): � � IR for fixed e , are :X/:B-measurable functions.
qr(x,S)
=
The idea here is that we can be more liberal in the behaviour allowed to the factors rkt as functions of Xr than to the factors sk1; discontinuities are permitted, for example. To be exact, we shall be content to have the rkt uniformly L1-bounded in Cesaro mean: 1 n sup - :L E 1 rkr(X1) 1 :5 B < oo, k n
n
t=l
=
1 , . . . ,p.
(21 .68)
As to the factors skr(x, S ), we need these to be asymptotically equicontinuous for a sufficiently large set of x values. Assume there is a sequence of sets {Km E :X, m = 1 ,2, ... } , such that 1 n limsup - L �r(K�) n�oo
and that for each
m
n t=i
�
0 as
m �
oo,
(2 1 .69)
� 1 and E > 0, there exists 8 > 0 such that
limsup sup w(skr(x, .),8) < E, k
=
1 , . . . ,p.
(2 1 . 70)
Notice that (2 1 . 70) is a nonstochastic equicontinuity condition, but under condi tion (2 1 .69) it holds (as one might say) 'almost surely, on average ' when the r.v. Xr is substituted into the formula. These conditions suffice to give termwise s.e., and hence can be used to prove s.e. of Gn by application of 21.12. 21.13 Theorem If qrCXr,S) is defined by (21 .67), and (2 1 . 68), (2 1 .69), and (2 1 .70) hold, then for every E > 0 there exists 8 > 0 such that 1 n limsup - :L P(w (q,,8) > E) < E. n�oo
Proof Fix
n t=i
E > 0, and first note that
(±k=l l rkr l w(skr,8) > E) :5 P ( LJ{ I rkr I w(skr,8) > Elp } ) k=i
P(w(qr,8) > E) :5 P
(2 1 .7 1 )
The Law of Large Numbers
344 p
::;; _L P(I rkrl w(skr.8) > Elp) ,; � [P(ir,,jw(s.,,O)i K. > 2�) + P ( I rkt I w(skr.8) 1 K,;. > 2�) l (21. 72) Consider any one of these p terms. Choose m large enough that (21.73) and for this m choose 8 small enough that limsup sup w(skr(x,. ) , 8 ) < -(21.74) 4Bp EK x m k=l
n �oo
2 £
2.
Then, by the Markov inequality,
n
1 _L E I rktl 28Ep
::;; limsup n n�oo t= l and by
(21.75)
(21.73), (21.76)
Substituting these bounds into
(21.72) yields the result.
•
v THE CENTRAL LIMIT THEOREM
22 Weak Convergence of Distributions
22. 1 B asic Concepts
The objects we examine in this part of the book are not sequences of random vari ables, but sequences of marginal distribution functions. There will of course be associated sequences of r. v .s generated from these distributions, but the concept of convergence arising here is quite distinct. Formally, if { Fn }i is a sequence of c.d.f.s, we say that the sequence converges weakly to a limit F if Fn(x) ----7 F(x) pointwise for each x e C, where C � [R is the set of points at which F is continu ous. Then, if Xn has c.d.f. Fn and X has c.d.f. F, we say that Xn converges in distribution to X. These terms are in practice used more or less interchangeably for the distributions and associated r.v.s. Equivalent notations for weak convergence are Fn � F, and Xn � X. Although the latter notation is customary, it is also slightly irregular, since to say a sequence of r.v.s converges in distribution means only that the limiting r.v. has the given distribution. If both X and Y have the distribution specified by F, then Xn � X and Xn � Y are equivalent statements. Moreover, we write things like Xn � N(O, 1 ) to indicate that the limiting distribution is standard Gaussian, although 'N(O, 1 )' is shorthand for 'a r.v. having the standard Gaussian distribu tion' ; it does not denote a particular r.v .. Also used by some authors is the notation '�' standing for 'convergence in probability law' , but we avoid this form because of possible confusion with convergence in Lp-norm. Pointwise convergence of the distribution functions is all that is needed, remembering that F is non-decreasing, bounded by 0 and 1 , and that every point is either a continuity point or a jump point. It is possible that F could possess a jump at a point x0 which is a continuity point of Fn for all finite n, and in these cases Fn(x0) does not have a unique limit since any point between F(x0-) and F(x0) is a candidate. But the jump points of F are at most countable in number, and according to 8.4 the true F can be constructed by assigning the value F(x0) at every jump point x0; hence, the above definition is adequate. If ll represents the corresponding probability measure such that F(x) = !l(( -oo, x]) for each x e [R, we know (see §8.2) that ll and F are equivalent repre sentations of the same measure, and similarly for !ln and Fn. Hence, the statement J..ln � ll is equivalent to Fn � F. The corresponding notion of weak convergence for the sequence of measures {!ln } is given by the following theorem. 22.1 Theorem lln � !l iff !ln(A) ----7 !l(A) for every A e 'B for which !l(dA) = 0.
o
The proof of this theorem is postponed to a later point in the development. Note
The Central Limit Theorem
348
meanwhile that the exclusion of events whose boundary points have positive proba bility corresponds to the exclusion of jump points of F, where the events in question have the form oo ,x . Just as the theory of the expectation is an application of the general theory of integrals, so the theory of weak convergence is a general theory for sequences of finite measures. The results below do not generally depend upon the condition J..ln(lR ) for their validity, provided definitions are adjusted appropriately. However, a serious concern of the theory is whether a sequence of distribution functions has a distribution function as its limit; more specifically, should it follow because J..ln (lR ) for every that This is a question that is taken up in Meanwhile, the reader should not be distracted by the use of the convenient notations £(.) and P(.) from appreciating the generality of the theory.
{ (- ] }
=1
§22.5. = 1
n J.l. {lR) = 1?
{
) n=
22.2 Example Consider the sequence of binomial distributions B(n)Jn , ... } , where the probability of successes in n Bernoulli trials is given by
1,2,3,
x = x) = (:) (A/n)\1 - Aint-x, x = O, ... ,n
P(Xn
(22.1)
(see 8.7). Here, I� i s a constant parameter, so that the probability of a success falls linearly as the number of trials increases. Note that E(Xn) = 'A for every x For fixed as ---7 oo, and taking the binomial expansion of - � oo, whereas ---7 1 . We shows that � e -J.. as may therefore conclude that
1/x! n x, (�)n (1 -Aint (1 -Aint P(Xn
and accordingly, Fn(a)
=
n.
(1 -A/n)-x
n�
= x) � X.A� e-A, x = 0,1,2, ... , L
O$x$a
P(Xn
= x) � e-A. L AX�. O$x$a
(22.2) (22.3)
at all points a < oo. Thus the limit (and hence the weak limit) of the sequence is the Poisson distribution with parameter A. o
{B(n,Ain)}
[0, 1] is defined by = x) = { 1/n, x = iln , i = 1, ... ,n. (22.4) 0, otherwise This sequence actually converges weakly to Lebesgue measure m on [0, 1], although this fact may be less than obvious; it will be demonstrated below. For any x [0,1], J.l. n([O,x]) = [nx]/n � x = m([O,x]), where [nx] denotes the largest integer less than nx. There are sets for which convergence fails, notably the set of all rationals in [0,1], in view of the fact that = 1 for every n, and m((Q[o. n ) = 0. But [0,1] and = 1, thus the definition of weak convergence in 22.1 is not violated. 22.3 Example A sequence of discrete distributions on P(Xn
E
(Q [O,l J
(Q r0,11 =
m(o((Qr0,11)) o
J..ln ((Q [O, l])
Weak Convergence of Distributions
349
Although convergence in distribution is fundamentally different from converg ence a.s. and in pr., the latter imply the former. In the next result, · � · can be substituted for ·� ' , by 18.5.
0,
22.4 Theorem If Xn � X, then Xn � X. Proof For £ >
we have
P(Xn ::::; x) = P({Xn ::::; x} n { I Xn - X I ::::; £})
+ P( { Xn ::::; X} rl { I Xn - X I > £})
� P(X ::::; x + £) + P( I Xn - X I > £),
(22.5)
where the events whose probabilities appear on the right-hand side of the inequal ity contain (and hence are at least as probable as) the corresponding events on the left. P( I Xn - X I > £) -7 by hypothesis, and hence
0
limsup P(Xn ::::; x) � P(X � x + £).
(22.6)
Similarly, P(X � x - £)
P( {X � x - £} n { I Xn - XI � £})
=
+ P( { X � X - £} (l { I Xn - X I > £ } )
� P(Xn � x) + P( I Xn - X I > £),
(22.7)
P(X � x - £) � liminf P(Xn ::::; x).
(22.8)
and so
0,
Since £ is arbitrary, it follows that limn--+ooP(Xn � x) = P(X ::::; x) at every point x for which P(X = x) = such that lime,!, oF(X � x - £) = P(X � x). This condition is equivalent to weak convergence. •
The converse of 22.4 is not true in general, but the two conditions are equiva lent when the probability limit in question is a constant. A degenerate distribu tion has the form F(x)
=
{0,
x
1, x
;?:
a
(22.9)
If a random variable is converging to a constant, its c.d.f. converges to the step function (22.9), through a sequence of the sort illustrated in Fig. 22. 1 . 22.5 Theorem Xn converges in probability to a constant a iff its c.d.f. converges to a step function with jump at a. Proof For any £ >
0
PC I Xn - a l < £)
=
P(a - £ � Xn � a + £)
The Central Limit Theorem
350
=
Fn(a + £) - Fn((a - £)-). (22. 10) Convergence to a step function with jump at a implies limn�ooFn(a + £) = F(a + £) = 1 , and similarly limn�ooFn((a - £)-) = F((a - £)-) = 0 for all £ > 0. The suffi ciency part follows from (22. 10) and the definition of convergence in probability. For the necessity, let the left-hand side of (22.10) have a limit of 1 as n -7 oo, for all £ > 0. This implies lim [Fn(a + £) - Fn((a - £)-) ]
=
1.
(22. 1 1)
Since 0 � F � 1 , (22. 1 1) will be satisfied for all £ > 0 only if F(a) F(a-) = 0, which defines the function in (22.9). •
a
=
1 and
X
Fig. 22. 1 . 22.2 The S korokhod Representation Theorem
Notwithstanding the fact that Xn � X does not imply Xn ...E4 X, whenever a sequence of distributions { Fn } converges weakly to F one can construct a sequence of r.v.s with distributions Fn, which converges almost surely to a limit having distribution F. Shown by Skorokhod (1956) in a more general context (see §26.6), this is an immensely useful fact for proving results about weak convergence. Consider the sequence { Fn } converging to F. Each of these functions is a mono tone mapping from [R to the interval [0,1]. The idea is to invert this mapping. Let a random variable ro be defined on the probability space ([0,1],i3ro. 1 1 ,m), where :B ro.I J is the Borel field on the unit interval and m is the Lebesgue measure. Define for ro E [0, 1] Yn(W)
=
inf{x: ro � Fn(x) } .
(22. 1 2)
In words, Yn is the random variable obtained by using the inverse distribution function to map from the uniform distribution on [0,1 ] onto iR, taking care of any discontinuities in Fn 1 (ro) (corresponding to intervals with zero probability mass under Fn) by taking the infimum of the eligible values. Yn is therefore a non decreasing, left-continuous function. Fig. 22.2 illustrates the construction,
Weak Convergence of Distributions
35 1
essentially the same as used in the proof of 8.5 (compare Fig. 8.2). When Fn has discontinuities it is only possible to assert (by right-continuity) that Fn(Yn(w)) :2: ro, whereas Yn(Fn(x)) ::::; x, by left-continuity of Yn. - The first important feature of the Skorokhod construction is that, for any constant a E IR ,
(22. 13) where the last equality follows from the fact that w is uniformly distributed on [0, 1 ] . Thus, Fn is the c.d.f. of Yn. 23 Letting F be a c.d.f. and Y the r.v. corres ponding to F according to (22. 12), the second important feature of the construc tion is contained in the following result. 1
0 �----�--+ X Yn(W2) = Yn(W3)
Fig. 22.2 22.6 Theorem If Fn
=>
F then Yn -----7 Y a.s.[m] as n
-----7
=. o
In working through the proof, it may be helpful to check each assertion about the functions F and Y against the example in Fig. 22.3. This represents the extreme case where F, and hence also Y, is a step function; of course, if F is everywhere continuous and increasing, the mappings are 1-1 and the problem becomes trivial. Proof Let w be any continuity point of Y, excluding the end points 0 and
1 . For
any E > 0, choose x as a continuity point of F satisfying Y( co) - E < x < Y( ro). Given the countability of the discontinuities of F, such a point will always exist, and according to the definition of Y, it must have the property F(x) < w. If Fn(x) -----7 F(x), there will be n large enough that Fn(x) < W, and hence x < Yn(W), by definition. We therefore have Y(w) - E < x < Yn(CO).
(22. 14)
Without presuming that limn oo Yn(w) exists, since E is arbitrary (22. 14) allows us to conclude that liminfn�ooYn(W) :2: Y(w). Next, choose y as a continuity point of F satisfying Y(w) < y < Y(w) + E. The properties of F give co ::::; F(Y(co)) ::::; F(y). For large enough n we must also have co ::::; �
The Central Limit Theorem
352
and hence, again by definition of Yn, Yn(w) � y Y(w) + £. (22. 15) In the same way as before, we may conclude that limsupn�oofn(ffi) � Y(w). The superior and inferior limits are therefore equal, and limn�oofn(W) = Y(w). Fn (y) ,
<
F(y), w' 0)
F(x)
A
- - - - - - - - - - - -----------------------------------
----------------------------- ---------·---------------------
B
T
Y(w) - £
.
X
T Y( w), Y(w')
T
Y
Y( w) + £
Yn( w')
Fig. 22.3 This result only holds for continuity points of Y. However, there is a 1-1 correspondence between the discontinuity points of Y and intervals having zero probability under J.l in IR . A collection of disjoint intervals on the line is at most countable (1.11), and hence the discontinuities of Y (plus the points 0 and 1 ) are countable, and have Lebesgue measure zero. Hence, Yn ---7 f w.p. l [m], as asserted. • In Fig. 22.3, notice how both functions take their values at the discontinuities at the points marked A and B. Thus, F(Y(w)) = w' > w. Inequality (22. 15) holds for ro, but need not hold for ro', a discontinuity point. A counter-example is the sequence of functions Fn obtained by vertical translations of the fixed graph from below, as illustrated. In this case Yn(w') > Y(w') + £ for every n. 22.7 Corollary Define random variables Y�, so that f�(w) = Yn(W) at each w where the function is continuous, and f�(w) = 0 at discontinuity points and at w = 0 and 1 . Define Y' similarly. If Fn => F then f�(ro) ---7 Y'(ro) for every w E [0, 1], and Fn and F are the distribution functions of Y� and Y'. Proof The convergence for every ro is immediate. The equivalence of the distribu tions follows from 8.4, since the discontinuity points are countable and their complement is dense in [0, 1], by 2.10. • In the form given, 22.6 does not generalize very easily to distributions in IR k for k > 1 , although a generalization does exist. This can be deduced as a special case of 26.25, which derives the Skorokhod representation for distributions on general metric spaces of suitable type. A final point to observe about Skorokhod' s representation is its generalization
Weak Convergence of Distributions
353
to any finite measure. If Fn is a non-decreasing right-continuous function with codomain [a, b] , (22. 12) defines a function Yn(co) on a measure space ([a,b], :B[a,b]• m), where m is Lebesgue measure as before. With appropriate modifications, all the foregoing remarks continue to apply in this case. The following application of the Skorokhod representation yields a different, but equivalent, characterization of weak convergence. 22.8 Theorem Xn � X iff lim E(f(Xrz)) = E(f(X) ) (22. 16) n--)oo for every bounded, continuous real function f. o The necessity half of this result is known as the Helly-Bray theorem. Proof To prove sufficiency, construct an example. For a E [R and 8 > 0, let 1, x $ a-8 f(x) = (a - x)/8, a - 8 < x $ a (22. 1 7) 0, x>a
{
We call this the 'smoothed indicator' of the set (-oo, a]. (See Fig. 22.4.) It is a continuous function with the properties
f
Fn(a - 8) $ fdFn $ Fn(a), all F(a - 8)
n,
:::; JfdF :::; F(a).
(22. 1 8) (22. 19)
By hypothesis, ffdFn � ffdF, and hence
J
(22.20)
limsup Fn(a-) $ F(a),
(22.21)
F(a-) $ liminf Fn(a ).
(22.22)
limsup Fn(a - 8) .$ fdF $ liminf Fn(a). n�oo n�oo Letting 8 ·� 0, combining (22. 19) and (22.20) yields
These inequalities show that IimnFn(a) exists and is equal to F(a) whenever F(a-) = F(a), that is, Fn � F. To prove necessity, let f be a bounded function whose points of discontinuity are contained in a set D1, where �(D1) = 0, � being the p.m. such that F(x) = �(( -oo,.x]). When Fn � F CFn being the c.d.f. of Xn and F that of X) Y�(co) � Y'(co) for every co e [0, 1], where Y�(co) and Y'(co) are the Skorokhod variables defined in 22.7. Since m(co: Y'(co) e D1) = �(D1) = 0, f(Y�) � f(Y') a.s. [�] by 18.8(i). The bounded convergence theorem then implies E(f(Y�)) � E(f(Y')), or
The Central Limit Theorem
354
ff(y)dJln(y) � ff(y)dJl(y),
(22.23)
where Jln is the p.m. corresponding to Fn. But 9.6 allows us to write 1 (22.24) f(y)dJln(y) = ydJlnf- (y) = xdJlnf- 1 (x) = E(f(Xn)),
f
f
f
with a similar equality for E(f(X)) . (The trivial change of dummy argument from y to x is just to emphasize the equivalence of the two formulations.) Hence we have E(f(Xn )) ---7 E(f(X)) . The result certainly holds for the case D1 = 0, so 'only if is proved. •
/(x )
a- 8 a
X
Fig. 22.4 Notice how the proof cleverly substitutes ([0, 1] ,:Bro, 1 1 , m) for the fundamental probability space (Q, ':J,P) generating {Xn }, exploiting the fact that the derived distributions are the same. This result does not say that the expectations converge only for bounded continuous functions; it is simply that convergence is implied at least for all members of this large class of functions. The theorem also holds if we substitute any subclass of the class of bounded continuous functions which contains at least the smoothed indicator functions of half-lines, for example the bounded uniformly continuous functions. 22.9 Example We now give the promised proof of weak convergence for 22.3. Clearly, in that example, 1 n f(iln). (22.25) fdJln = n L i==l
f
-
The limit of the expression on the right of (22.25) as n � oo is by definition the Riemann integral of f on the unit interval. Since this agrees with the Lebesgue integral, we have a proof of weak convergence in this case. o We shall subsequently require the generalization of Theorem 22.8 to general finite measures. This will be stated as a corollary, the modifications to the proof being left to the reader to supply; it is mainly a matter of modifying the notation to suit.
Weak Convergence of Distributions
355
22.10 Corollary Let {Fn } be a sequence of bounded, non-decreasing, right continuous functions. Fn � F if and only if
JtdFn � ftdF
(22.26)
for every bounded, continuous real function f. o Another proof which was deferred earlier can now be given. Proof of 22.1 To show sufficiency, consider A = (-oo, x], for which dA = {x} . Weak convergence is defined by the condition J.ln( { ( -oo, x] } ) � J.l({ ( -oo, x] }) whenever J.l( { x}) = 0. To show necessity, consider in the necessity part of 22.8 the case f(x) = lA(x) for any A E 'B. The discontinuity points of this function are contained in the set dA, and if J.l(dA) = 0, we have J.ln(A) � J.l(A) as a case of (22. 16), when Fn � F. • 22.3 Weak Convergence and Transformations
The next result might be thought of as the weak convergence counterpart of 18.8. 22.1 1 Continuous mapping theorem Let h: IR f----7 IR be Borel-measurable with discontinuity points confined to a set Dh, where J.l(Dh) = 0. If J.ln � J.l, then J.lnh 1 1 � J.lh - . Proof By the argument used to prove the Helly-Bray theorem, h(Y�) � h(Y') a.s.[J.l]. It follows from 22.4 that h(Y�) � h(Y'). Since m(ro: Y�(w) E A) = J.ln(A), (22.27) for each A E 'B, using 3.21. Similarly, m(h(Y') E A) = J.lh - 1 (A). According to the definition of weak convergence, h(Y�) � h(Y') is equivalent to J.lnh - 1 � J.lh - 1 . • -
22.12 Corollary If h is the function of 22.11 and Xn � X, then h(Xn) � h(X). Proof Immediate from the theorem, given that Xn - Fn and X - F. • 22.13 Example If Xn � N(O, l) , then X� � x2 (1). o
Our second result on transformations is from Cramer (1946), and is sometimes called Cramer's theorem: 22.14 Cramer's theorem If Xn � X and Yn � a (constant), then (i) (Xn + Yn) � (X + a). (ii) YnXn � aX. (iii) XnfYn � X!a, for a :f. 0. Proof This is by an extension of the type of argument used in 22.4.
The Central Limit Theorem
356
+ P(Xn + Yn ::::; X, I Yn - a I :2:
E).
P(Xn ::::; x - a - E) ::::; P(Xn + Yn ::::; x) + P( I Yn -: a l
:2:
::::; P(Xn ::::; x - a + E) + P( I Yn - a l
:2:
E)
(22.28)
Similarly, E),
(22.29)
and putting these inequalities together, we have P(Xn ::::; x - a - E) - P( I Yn - a l
:2:
E) ::::; P(Xn + Yn ::::; x)
::::; P(Xn ::::; x - a + E) + P( I Yn - a l :2: E) .
(22.30)
Let Fxn and Fxn+ Yn denote the c.d.f.s of Xn and Xn + Yn respectively, and let Fx be the c.d.f. of X, such that Fx = Iimn�ooFxn(x) at all continuity points of Fx. Since Iimn�=P( I Yn - a I :2: E) = 0 for all E > 0 by assumption, (22.30) implies
(22.3 1 )
n �oo
Taking E arbitrarily close to zero shows that
(22.32) whenever x - a is a continuity point of FX· This proves (i). To prove (ii), suppose first that a > 0. By taking E > 0 small enough we can ensure a - E > 0, and applying the type of argument used in (i) with obvious variations, we obtain the inequalities P(Xn(a + E) ::::; x) - P( I Yn - a I :2: E) ::::; P(XnYn ::::; x)
::::; P(Xn(a - E) ::::; x) + P( I Yn - a I
:2:
E).
(22.33)
Taking limits gives Fx(x/(a + E)) ::::; liminfFxnrJx) n�oo ::::; limsup Fxn yn(x) ::::; Fx(xl(a - E)), n �oo
and thus
<
(22.34) (22.35)
n �oo
If a 0, replace Yn by - Yn and a by -a, repeat the preceding argument, and then apply 22.12. And if a = 0, (22.33) becomes P(XnE ::::; x) - P( I Yn I
:2:
E) ::::; P(XnYn ::::; x)
Weak Convergence of Distributions
357 (22.36)
<
For x > 0, this yields in the limit FXnYn (x) = 1, and for x 0, FXnYn(x) = 0, which defines the degenerate distribution with the mass concentrated at 0. In this case XnYn � 0 in view of 22.5. (Alternatively, see 18.12.) To prove (iii) it suffices to note by 18.10(ii) that plim llYn = lla if a =t. 0. Replacing Yn by llYn in (ii) yields the result directly. • 22.4 Convergence of Moments and Characteristic Functions
Paralleling the sequence of distribution functions, there may be sequences of moments. If Xn � X where the c.d.f. of X is F, then E(X) = fxdF(x), where it exists, is sometimes called the asymptotic expectation of Xn. There is a tempta tion to write E(X) = limn�ooE(Xn), but there are cases where E(Xn) does not exist for any finite n while E(X) exists, and also cases where E(Xn) exists for every n but E(X) does not. This usage is therefore best avoided except in specific circum stances when the convergence is known to obtain. Theorem 22.8 assures us that expectations of bounded random variables converge under weak convergence of the corresponding measures. The following theorems indicate how far this result can be extended to more general cases. Recall that El XI is defined for every X, although it may take the value +oo. 22.15 Theorem If Xn � X then E I XI � liminfn�ooE I Xnl · Proof The function ha(x) = I x I I { I xi::; a } is real and bounded. If P( I XI = a) = 0, it follows by 22.11 that ha(Xn) � ha(X), and from 22.8 (letting f be the identity function which is bounded in this case) that E(ha(X))
=
lim E(ha(Xn)) � liminf E I Xn l ·
(22.37)
The result follows on letting a approach += through continuity points of the distribution. • The following theorem gives a sufficient condition for E(X) to exist, given that E(Xn) exists for each n. 22.16 Theorem If Xn � X and {Xn } is uniformly integrable, then E I XI oo and E(Xn) ---7 E(X). Proof Let Yn and Y be the Skorokhod variables of (22. 12), so that Yn --E4 Y. Since Xn and Yn have the same distribution, uniform integrability of {Xn } implies that of { Yn } . Hence we can invoke 12.8 to show that E(Yn) ---7 E(Y), Y being integrable. Reversing the argument then gives E I XI oo and E(Xn) ---7 E(X) as required. • Uniform integrability is a sufficient condition, and although where it fails the existence of E(X) may not be ruled out, 12.7 showed that its interpretation is questionable in these circumstances. A sequence of complex r.v.s which is always uniformly integrable is f e itXn l. for
<
<
The Central Limit Theorem any sequence { since I eitXn I = 1 . Given the sequence of characteristic func tions {
Xn}, F.
Xn � X
F,
Fn F,
X.
Fn}
F
Fn
(22.39)
F
fdF
and <)>(t) is continuous at the point t = 0, then is a c.d.f. (i.e., = 1) and <)> is its ch.f. o The fact that the conditions imposed on the limit F in this theorem are not un reasonable will be established by the Helly selection theorem, to be discussed in the next section. Proof Note that = = 1 for any n, by (22.39) and the fact that is a c.d.f. For v > 0, e = (22.40) (t) t =
n(O) fdFn
Fn
�f>n d s:: (�f:eitxdt) dFn s:: ( i::: 1) dFn,
the change in order of integration being permitted by 9.32. By 22.10, which extends to complex-valued functions by linearity of the integral, we have, as n -7 oo, 1 5 <)>(t)dt, e e 1 = (22.41) -7 . lVX V 0 lVX
f+"" ( ivx - 1) dFn f+"" ivx - ) dF - v oc ( oc .
-
-
where the equality is by (22.40) and the definition of <)>. Since <)> is continuous at t = 0, lim v 1 (}<)>(t) t = <)>(0) and since -7 <)> we must have <)>(0) = 1 . It follows from (22.41 ) that
v oo f d ---t
-
+oo lim eivx : ) dF +oodF. (22.42) J-oo ( iv f-oo In view of the other conditions imposed, this means F is a c.d.f. It follows by 22.8 that <)>(t) E(e itX) where X is a random variable having c.d.f. F. This 1
=
1
=
completes the proof.
V---tO
•
=
Weak Convergence of Distributions
359
The continuity theorem provides the basic justification for investigating limit ing distributions by evaluating the limits of sequences of ch.f.s, and then using the inversion theorem of § 1 1 .5. The next two chapters are devoted to developing these methods. Here we will take the opportunity to mention one useful applica tion, a result similar to 22.4 which may also be proved as a corollary. 22.18 Theorem If I Xn - Zn l � 0 and {Xn } converges in distribution, then {Zn } converges in distribution to the same limit. Proof I eitXn - eitZn I � 0 by 18.9(ii). Since I eitX I = 1 these functions are Leo-bounded, and the sequence { I eitXn - eitZn I } is uniformly integrable. So by 18.14, I eitXn - eitZn I � 0. However, the complex modulus inequality (11.3) gives (22.43) so that a further consequence is I
=
22.5 Criteria for Weak Convergence
Not every sequence of c.d.f.s has a c.d.f. as its limit. Counter-examples are easy to construct. 22.19 Example Consider the uniform distribution on the interval [-n,n], such that Fn(a) = �(1 + aln), -n :::; a :::; n. Then Fn(a) --7 1 for all a E IR . D 22.20 Example Consider the degenerate r.v., Xn n w.p. l . The c.d.f. is a step function with jump at n. Fn(a) ---7 0, all a E IR . o Although Fn is a c.d.f. for all n, in neither of these cases is the limit F a c.d.f., in the sense that F(a) ---7 1 (0) as a ---7 ( ) Nor does intuition suggest to us that the limiting distributions are well defined. The difficulty in the first example is that the probability mass is getting smeared out evenly over an infinite support, so that the density is tending everywhere to zero. It does not make sense to define a random variable which can take any value in IR with equal probability, any more than it does to make a random variable infinite almost surely, which is the limiting case of the second example. In view of these pathological cases, it is important to establish the conditions under which a sequence of measures can be expected to converge weakly. The condi tion that ensures the limit is well-defined is called uniform tightness. A se quence { Jln } of p.m.s on fR is uniformly tight if there exists a finite interval (a,b] such that, for any £ > 0, SUPnlln((a,b]) > 1 - £. Equivalently, if {Fn } is the sequence of c.d.f.s corresponding to {J..Ln } , uniform tightness is the condition that for £ > 0 3 a,b with b - a < and =
co
co
-co
.
360
The Central Limit Theorem
sup { Fn(b) - Fn(a) } > 1 - E. (22.44) n It is easy to see that examples 22.19 and 22.20 both fail to satisfy the uniform tightness condition. However, we can show that, provided a sequence of p.m.s { Jln } is uniformly tight, it does converge to a limit Jl which is a p.m. This terminology derives from the designation tight for a measure with .the property that for every E > 0 there is a compact set Ke such that J.!(K�) :::; E. Every p.m. on (IR ,B) is tight, although this is not necessarily the case in more general probability spaces. See §26.5 for details on this. An essential ingredient in this argument is a classic result in analysis, Helly ' s selection theorem.
22.21 Belly's selection theorem If { Fn} is any sequence of c.d.f.s, there exists a subsequence { nk, k = 1 ,2, ... } such that Fnk � F, where F is bounded, non-decreas ing and right-continuous, and 0 :::; F :::; 1 . Proof Consider the bounded array { { Fn(Xi), n E IN } , i E IN } , where {xi, i E IN } is an
enumeration of the rationals. By 2.36, this array converges on a subsequence, so that limk�Jn/Xi) = F'(xD for every i. Note that F*(xi1) :::; F'(xi2) whenever Xi 1 < xi2, since this property is satisfied by Fn for every n. Hence consider the non decreasing function on IR , F(x)
=
(22.45)
inf F*(xi).
x; > x
Clearly 0 :::; F*(xi) :::; 1 for all i, since the Fnk(xa have this property for every k. By definition of F, for x E IR 3 xi > x such that F(x) s F'(xt) < F(x) + E for any E > 0, showing that F is right-continuous since F*(xD = F(xt). Further, for continuity points x of F, there exist xi1 < x and Xi2 > x such that
(22.46) The following inequalities hold in respect of these points for every k:
(22.47) Combining (22.46) with (22.47), F(x) - E < liminf Fnk(x) s limsup Fnk(x) < F(x) + E. Since E is arbitrary, limk�ooFnk(x)
=
F(x) at all continuity points of F.
(22.48) •
The only problem here is that F need not be a c.d.f., as in 22.19 and 22.20. We need to ensure that F(x) � 1 (0) as x � oo ( oo ), and tightness is the required property. -
22.22 Theorem Let { Fn } be a sequence of c.d.f.s. If
Weak Convergence of Distributions
361
(a) Fnk ==> F for every convergent subsequence {nk }, and (b) the sequence is uniformly tight, then Fn ==> F, where F is a c.d.f. Condition (b) is also necessary. o Helly ' s theorem tells us that { Fn } has a cluster point F. Condition (a) requires that this F be the unique cluster point, regardless of the subsequence chosen, and the argument of 2.13 applied pointwise to {Fn } implies that F is the actual limit of the sequence. Uniform tightness is necessary and sufficient for this limit F to be a c.d.f. Proof of 22.22 Let x be a continuity point of F, and suppose Fn(x) A F(x). Then I Fn(x) - F(x) I � £ > 0 for an infinite subsequence of integers, say { nh k E IN } . Define a sequence of c.d.f.s by Fk = Fnk' k 1,2, ... According to Helly's theorem, this sequence contains a convergent subsequence, { ki, i E IN }, say, such that Fie; ==> F'. But by (a), F' F, and we have a contradiction, given how the sub sequence { ki} was constructed. Hence, Fn ==> F. Since Fn is a c.d.f. for every n, Fn(b) - Fn(a) > 1 - £ for some b - a < oo, for any £ > 0. Since Fn -7 F at continuity points, increase b and reduce a as neces sary to make them continuity points of F. Assuming uniform tightness, we have by (22.44) that F(b) - F(a) > 1 - £, as required. It follows that 1imx---7ooF(x) = 1 and limx---7 - J(x) 0. Given the monotonicity and right continuity of F established by Helly' s theorem, this means that F is a c.d.f. On the other hand, if the sequence is not uniformly tight, F(b) - F(a) � 1 - £ for some £ > 0, and every b > a. Letting b -7 +oo and a -7 -oo, we have F(+<XJ) - F(-oo) � 1 - £ < 1 . Hence, either F(+oo) < 1 or F( oo) > 0 or both, and F is not a c.d.f. • The role of the continuity theorem (22.17) should now be apparent. Helly' s theorem ensures that the limit F of a sequence of c.d.f.s has all the properties of a c.d.f. except possibly that of fdF 1 . Uniform tightness ensures this property, and the continuity of the limiting ch.f. at the origin can now be interpreted as a sufficient condition for tightness of the sequence. It is of interest to note what happens in the case of our counter-examples. The ch.f. corresponding to example 22.19 is n sin v n _ (22.49) <J>n(Y) = (2n ) - 1 -n (cos vx + i sin vx)dx vn We may show (use l ' Hopital's rule) that <J>n(O) 1 for every n, whereas <J>n(Y) -7 0 for all v :f. 0. In the case of 22.20 we get <J>n(Y) cos v n + i sin vn, (22.50) which fails to converge except at the point v = 0. =
=
=
-
=
J
=
=
=
22.6 Convergence of Random Sums
Most of the important weak convergence results concern sequences of partial sums of a random array {Xn,, t 1 , ... ,n, n E IN }. Let =
The Central Limit Theorem
362
Sn
=
n L Xnt• t=l
(22.51)
and consider the distributions of the sequence {Sn } as n � oo . The array notation (double indexing) permits a normalization depending on n to be introduced. Central limit theorems, the cases in which typically Xnr = n - 1 12X1, and Sn converges to the Gaussian distribution, are to be examined in detail in the following chapters, but these are not the only possibility. 22.23 Example The B(n)Jn) distribution is the distribution of the sum of n inde pendent Bernoulli random variables Xnt• where P(Xnr = 1) = IJn and P(Xnr = 0) = 1 - IJn. From 22.2 we know that in this case n (22.52) Sn = L Xnt � Poisson with mean A. o t= l
From 11.1 we know that the distribution of a sum of independent r.v.s is given by the convolution of the distributions of the summands. The weak limits of independent sum distributions therefore have to be expressible as infinite con volutions. The class of distributions that have such a representation is necessar ily fairly limited. A distribution F is called infinitely divisible if for every n E fN there exists a distribution Fn such that F has a representation as the n-fold convolution (22.53) In view of (1 1 .33), infinite divisibility implies a corresponding multiplicative rule for the ch.f.s. 22.24 Example For the Poisson distribution, <J>x(t;A) = exp{Aeit - 1 } , from (1 1 .34), and x(t; A) = (exp{ (IJn)(eit _ 1 ) } Y = (<J>x(t; IJn)t. (22.54) The sum of n independent Poisson variates having parameter IJn is therefore a Poisson variate with parameter A. o In certain infinitely divisible distributions, Fn and F have a special relation ship, expressed through their characteristic functions. A distribution with ch.f.
Weak Convergence of Distributions <j)x(t)
=
363 (22.56)
exp { -a ! t iP } , a � 0.
22.25 Example The Cauchy distribution is stable with p = 1 and b(n) = 0, having ch.f.
-
-
23 The Classical Central Limit Theorem
2 3 . 1 The i . i.d. Case
The 'normal law of error' is justly the most famous result in statistics, and to the susceptible mind has an almost mystical fascination. If a sequence of random variables {X1} I have means of zero, and the partial sums L�= tX1, n = 1 ,2,3, ... have variances s� tending to infinity with n although finite for each finite n, then, subject to rather mild additional conditions on the distributions and the sampling process, 1 n D (23. 1) Sn = L Xt � N(O, 1). -
Sn t=l
This is the central limit theorem (CLT). Establishing sets of sufficient condi tions is the main business of this chapter and the next, but before getting into the formal results it might be of interest to illustrate the operation of the CLT as an approximation theorem. Particularly if the distribution of the X1 is symmet ric, the approach to the limit can be very rapid. 23.1 Example In 11.2 we derived the distribution of the sum of two independent U[O, 1 ] drawings. Similarly, the sum of three such drawings has density fx+Y+z(w)
r t rt
= J 0 J 0 1 ( w-z- I.w-z )dydz,
(23.2)
which is plotted in Fig. 23. 1. This function is actually piecewise quadratic (the three segments are on [0 , 1], [1 ,2] and [2,3] respectively), but lies remarkably close to the density of the Gaussian r.v. having the same mean and variance as X + Y + Z (also plotted). The sum of 10 or 12 independent uniform r.v.s is almost indistinguishable from a Gaussian variate; indeed, the formula S = 'L�� 1 Xi - 6, which has mean 0 and variance 1 when Xi - U[O, l ] and independent, provides a simple and perfectly adequate device for simulating a standard Gaussian variate in computer modelling exercises. o 23.2 Example For a contrast in the manner of convergence consider the B(n,p ) distribution, the sum of n Bernoulli(p) variates for fixed p E (0, 1). The proba bilities for p = � and n = 20 are plotted in Fig. 23.2, together with the Gaussian density with matching mean and variance. These distributions are of course dis crete for every finite n, and continuous only in the limit. The correspondence of the ordinates is remarkably close, although remember that for p :t � the binomial distribution is not symmetric and the convergence is correspondingly slower. This
The Classical Central Limit Theorem
365
example should be compared with 22.2, the non-Gaussian limit in the latter case being obtained by having p decline as a function of h. o
Sum of 3 U[O, 1 ]s
1 .5
3
Fig. 23. 1 0.2 I Bin(20, 0.5)
1-
0. 1
1-
'-
0
0
__
..
(1
I
I
I
I I I I
I I
I
, '
\
�
N( 1 0,5) \ \
- - -
\ \ \
\ \
\
�'
10
t ..
... _
20
Fig. 23.2 Proofs of the CLT, like proofs of stochastic convergence, depend on establishing properties for certain nonstochastic sequences. Previously we considered sample points I Xn(ro) - X(ro) I for ro E C with P( C) = 1 , probabilities P( I Xn - X I > £) , and moments E I Xn - XI P , as different sequences to be shown to converge to 0 to estab lish the convergence of Xn to X, respectively a.s., in pr., or in Lp . In the present case we consider the expectations of certain functions of the Sn; the key result is Theorem 22.8. The practical trick is to find functions that will finger print the limiting distribution conclusively. The characteristic function is by
The Central Limit Theorem
366
common consent the convenient choice, since we can exploit the multiplicative property for independent sums. This is not the only possible method though, and the reader can find an alternative approach in Pollard (1984: lll.4), for example. The simplest case is where the sequence { X1 } is both stationary and indepen dently drawn. 23.3 Lindeberg-Levy theorem If {X1}] is an i.i.d. sequence having zero mean and variance cr2, D ,. X1 I(J � = n - 1 12L... N(0 , 1 ) . 0 (23.3) t= l Proof The ch.f.s <j>x(A.) of X1 are identical for all t, 24 so from ( 1 1 .30) and
Sn
n
(1 1 .33),
Applying 11.6 with k = 2 yields the expansion
jcA.Xr)2 A.Xr
(23.4)
}
I 13 , 1 <\>x(A.cr -1 n - 1/2 ) - l - A.212n l � E mm �· l fTn 6cr3n312 .
(23.5)
which makes it possible to write, for fixed A., <\>x(A.cr- 1 n - 1 12) = l - A.212n + O(n-312) .
(23.6)
Applying the binomial expansion, ( 1 + alnt = I.}=o('})(alnY � 'Lj=naj!p = ea as n -7 oo, and setting a = -�A.2 + O(n- 112), we find lim
The Classical Central Limit Theorem
367
alternative to the central limit property under the specified conditions, that is, zero mean and finite variance. The earlier remark that symmetry of the distribution of n -Inx, improves the rate of convergence to the limit can be also 0, the expansion in (23.5) can be taken appreciated here. If we can assume to third order, and the remainder in (23.6) is of O(n- 2). On the other hand, if the variance does not exist the expansion of (23.6) fails. Indeed, in the specific case we know of, in which the are centred Cauchy, n 2Xn = O(n 1 12) ; the sequence of distributions of {n 1 12Xn } is not tight, and there is no weak convergence. The limit law for the sum would itself be Cauchy under the appropriate scaling of n - I . The distinction between convergence in distribution and convergence in probabil ity, and in particular the fact that the former does not imply the latter, can be demonstrated here by means of a counter-example. Consider the sequence {X,}i defined in the statement of the Lindeberg-Levy theorem, and the corresponding Sn in (23.3). 23.4 Theorem Sn does not converge in probability. Proof If it was true that plimn---7ooSn = Z, it would also be the case that p1imn---7=S2n = Z, implying plim (S2n - Sn) = 0. (23.8)
E(X�)
=
X,
11
We will show that (23. 8) is false. We have (23.9) where S'n
=
n
- 112"' 2n X "- t =n+ I
p1(J,
hence (23. 10)
According to the Lindeberg-Levy theorem,
X,
X(ro).
=
ro,
n- 1122:7= 1Xr(ro)
{X(ro) ro
roE
X
ro,
The Central Limit Theorem
368
not necessarily close to Sn(W) no matter how large n becomes. Weak convergence of the distribution functions does not imply convergence of the random sequence. Characteristic function-based arguments can also be used to show convergence in distribution to a degenerate limit. The following is a well-known proof of the weak law oflarge numbers for i.i.d. sequences, which circumvents the need to show L 1 convergence. 23.5 Khinchine's theorem If { Xr} i is an identically and independently distributed sequence· with finite mean �' then Xn = n -1 'L7=1 Xt � �· Proof The characteristic function of Xn has the form
The Lindeberg-Levy theorem imposes conditions which are too strong for the result to have wide practical applications in econometrics. In the remainder of this chapter we retain the assumption of independence (to be relaxed in Chapter 24), but allow the summands to have different distributions. In this theory it is convenient to work with normalized variables, such that the partial sums always have unit variance. This entails a double indexing scheme. Define the triangular array {Xnt' t = 1, ... ,n, n E IN }, the elements having zero mean and variances O'�t' such that if
(23. 15) then (under independence) E(S�)
=
n L �t t=l
=
1.
(23. 16)
Typically we would have Xnt = (Yr - �r)lsn where { Yr} is the 'raw' sequence under study, with means {�r}, and s� = 'L7= 1 E(Yr - �i. In this case O'�t = E(Yt - �t) 2/s�, so that these variances sum to unity by construction. It is also possible to have Xnt = (Ynt - �nt)lsn, the double indexing of the mean arising in situations where the sequence depends on a parameter whose value in turn depends on n. This case arises, for example, in the study of the limiting distributions of
The Classical Central Limit Theorem
369
test statistics under a sequence of 'local' deviations from the null hypothesis, the device known as Pitman drift. The existence of each variance �1 is going to be a necessary baseline condition in all the theorems, just as the existence of the common variance cJl was required in the Lindeberg-Levy theorem. However, with heterogeneity, not even uniformly bounded variances are sufficient to get a central limit result. If the Y1 are identically distributed, we do not have to worry about a small (i.e. finite) number of members of the sequence exhibiting such extreme behaviour as to in fluence the distribution of the sum as a whole, even in the limit. But in a heter ogeneous sequence this is possible, and could interfere with convergence to the normal, which usually depends on the contribution of each individual member of the sequence being negligible. The standard result for independent, non-identically distributed sequences is the Lindeberg-Feller theorem, which establishes that a certain condition on the distributions of the summands is sufficient, and in some circumstances also neces sary. Lindeberg is credited with the sufficiency part, and Feller the necessity; we look at the latter in the next section. 23.6 Lindeberg theorem Let the array { Xnr } be independent with zero mean and variance sequence {�r} satisfying (23. 16). Then, Sn � N(O, 1) if X�1 dP = 0, for all c: > 0. o (23 . 1 7) lim i n�oo t= l { I Xnr l > t:} Equation (23. 17) is known as the Lindeberg condition. The proof of the Lindeberg theorem requires a couple of purely mechanical lemmas. 23.7 Lemma If x 1 , . .. ,xn and y1 , ... ,yn are collections of complex numbers with l xr l � 1 and I Y r l � 1 for t = 1 , ... ,n, then
J
(23 . 1 8) Proof
For n
=
2, i x 1 x2 - Y1Yz l
I (x t - YI )xz + (xz - Yz)Y I I � i x1 - Yi i i xz l + i xz - Yzi I Yd � l xl - Y l l + l xz - Yzi . =
(23. 19)
The general case follows easily by induction. • 23.8 Lemma If z is a complex number and l z l � �' then i ez - 1 - z l � l z l 2 . Proof Using the triangle inequality, � l z li � l z l 2f=o (23.20) U + 2)!"
The Central Limit Theorem
370
Since I.j=02 -J = 2, the infinite series on the right-hand side cannot exceed 1 . • Proof of 23.6 We have to show that
I
-
n
2
e - 'A 12 1 = Il
I j fi
s
n 'A2crnl2 1 - Il e- 2 t=l
t=l
t=l
2 2 e I D _" (J�� - fi o - 1A-2�a I , where the equality is by definition, using the fact that I7=10"�t +
(23.21)
= 1, and the inequality is the triangle inequality. The proof will be complete if we can show that each of the right-hand side terms converges to zero. The integrals in (23. 17) may be expressed in the form E( l 1 \Xnr\>e} X�t), and
O"�t = E(l ( \ Xnr\�e}X�t) + E(l(\ Xnr \ >E}X�t) ::; c? + E(l { \Xnr \>E} X�t) e2 as n oo, all 1 s t s n, ----7
----7
(23.22)
since the Lindeberg condition implies that the second term on the right-hand side of the inequality (which is positive) goes to zero; since E can be chosen arbit rarily small, this shows that max O"�t ----7 0. (23.23)
l�ts n
In the first of the two terms on the majorant side of (23.21), the ch.f.s are all less than 1 in modulus, and by taking n large enough we can make 1 1 - 1A-2a�1 I s 1 for any fixed value of A-. Hence by 23.7,
I fi
(23.24)
To break down the terms on the majorant side of (23.24), note that 11.6 for the case k = 2, combined with (1 1 .29), yields
I
s A-2£(1 1 \Xnrl>e}X�t) + �� A-I 3E�t·
I7=1 O"�t = 1 , j
Hence, recalling
i
t=l
(23.25)
The Classical Central Limit Theorem �
� I A-1 \: as n
371 (23.26)
� oo ,
since the first majorant-side term vanishes by the Lindeberg condition. Since £ is arbitrary, this limit can be made as small as desired. Similarly, for the second term of (23.21), take n large enough so that
I TI e-A.2cr�/2 - TI (1 - �A2�r) I t=!
t=l
:::;
i I e-A.2a�/2 -
1
t= l
-
!"-2�t I ·
(23.27)
Setting z = -1A-2a�t (a real number, actually) in 23.8 and applying the result to the majorant side of (23.27) gives n 4 (23.28) :::; !"- .L cr�t· t=I
But,
n L <J�t
n :::; (max <J�r) L <J�r = max <J�r � 0 as n � (23.29) t=I t=I by (23.23). The proof is therefore complete. • The Lindeberg condition is subtle, and its implications for the behaviour of random sequences call for careful interpretation. This will be easier if we look at the case Xnr = Xtlsn, where Xr has mean 0 and variance cr7 and s� = I7=1cr7. Then the Lindeberg condition becomes n lim 21 L (23.30) x7dP = 0, for all £ > 0. n-too Sn t=I I { I Xr J >snE ) One point easily verified is that, when the summands are identically distributed, sn = Vncr, and (23.30) reduces to limn-too s� 2E(XTl { IXJ I > minE J ) = 0. The Lindeberg condition then holds if and only if X1 has finite variance, so that the Lindeberg theorem contains the Lindeberg-Levy as a special case. The problematical cases the Lindeberg condition is designed to exclude are those where the behaviour of a finite subset of sequence elements dominates all the others, even in the limit. This can occur either by the sequence becoming exces sively disorderly in the limit, or (the other side of the same coin, really) by its being not disorderly enough, beyond a certain point. Thus, the condition clearly fails if the variance sequence { cr7 } is tending to zero in such a way that s� = I7=1cr7 is bounded in n. On the other hand, if s� � oo, then sn£ � oo for any fixed positive £, and the Lindeberg condition resembles a condition of 'average' uniform integrability of {X7 } . The sum of the terms £(1 { I Xr l > snE }X7) must grow less fast than s�, no matter how close £ is to zero. The following is a counter-example (compare 12.7). 23.9 Example Let Xr = Yr - E(Yt) where Yt = 0 with probability 1 - t-2 , and t with probability t-2 . Thus E(Yr) = t- 1 and Xt is converging to a degenerate r.v., equal oo,
The Central Limit Theorem
372
to 0 with probability 1 , although Var(Yr) = 1 - t-2 for every t. The Lindeberg condition fails here. s� = n - 'L7= 1 t-2 , and for 0 < £ S. 1 we certainly have t > Sn£ whenever t > n 1 12. Therefore, (23.3 1)
as n --7 = And indeed, we can show rather easily that the CLT fails here. For any no � 1 , if we put Bo = 'L7� 1 t, then .
(23.32)
where the majorant side can be made as small as desired by choosing no large enough. It follows that 2:.7= 1 Y1 = Op(l) and hence, since we also have 'L7= 1 E(Yr) = O(log n), that 'L7= 1 Xr1Sn � 0, confirming that the CLT does not operate in this case. o Uniform square-integrability is neither sufficient nor necessary for the Lindeberg condition, so parallels must be drawn with cauiion. However, the following theorem gives a simple sufficient condition. 23.10 Theorem If {X7 } is uniformly integrable, and s�/n � B > 0 for all n , then (23.30) holds. Proof For any n and £ > 0, the latter assumption implies (23.33)
Hence lim � n ---7= Sn
i t=l
E(l { IX11>snEl X7) S. � limsup max {E(l { IXtl>snEl X7)} n---?oo i �t� n S. =
E( 1 1 1 x1 >sn£ 1 X7) � supt nlim ---?oo 1
0,
(23.34)
where the last equality follows by uniform integrability. • There is no assumption here that the sequence is independent. The conditions involve only the sequence of marginal distributions of the variables. None the less, the conditions are stronger than necessary, and a counter-example as well as further discussion of the conditions appears in §23.4. The following is a popular version of the CLT for independent processes. 23.11 Liapunov's theorem A sufficient condition for (23. 17) is
The Classical Central Limit Theorem
373
n
lim .2::: El Xn1 1 2+8 = 0, for some 8 > 0. (23 .35) n�oo t:= l Proof For 8 > 0 and £ > 0, (23.36) EIXnr l 2+l5 � E(l{ IXnt i>£J I Xnrl 2+8) � £8E(l I IXnri>£)X�r) . The theorem follows since, if £8limn�oo I7=1 E(l 1 1 Xnt i >£)X�1) = 0 for fixed £ > 0, then the same holds with £0 replaced by 1. •
Condition (23.35) is called the Liapunov condition, although this term is also used to refer to Liapunov's original result, in which the condition was cast in terms of integer moments, i.e.
n
lim L EIXnrl3
n�oo
t:= l
=
(23 .37)
0.
Although stronger than necessary, the Liapunov condition has the advantage of being more easily checkable, at least in principle, than the Lindeberg condition, as the following example illustrates. 23.12 Theorem Liapunov ' s condition holds if s�/n > 0 uniformly in n and E I X1 1 2+8 < oo uniformly in t, 8 > 0. Proof Under the stated conditions,
n nB 2 Bl < 2 8 + X EI tl L 2 Sn B2 Sn2 1
- � t:= l
<
-
-
-
� E I X1 1 2+8 lim -1- L
n�oo Sn2+8
t:= l
=
<
00
0
(23.38)
(23.39)
follows immediately. • Note that these conditions imply those of 23.10, by 12.10. It is sufficient to avoid the 'knife-edge' condition in which variances exist but no moments even fractionally higher, provided the sum of those variances is also O(n). 2 3 . 3 Feller' s Theorem and Asymptotic Negligibility
We said above that the Lindeberg condition is sufficient and sometimes also neces sary. The following result specifies the side condition which implies necessity. 23.13 Feller's theorem Let { Xnr} be an indep�ndent sequence with zero mean and variance sequence { 0'�1}. If Sn � N(O, 1) and
374
The Central Limit Theorem
max P( I Xnr l > E) � 0 as n � oo, any E > 0, (23.40) :=::: t :5n 1 the Lindeberg condition must hold. o The proof of Feller' s theorem is rather fiddly and mechanical, but necessary conditions are rare and difficult to obtain in this theory, and it is worth a little study for that reason alone. Several of the arguments are of the type used already in the sufficiency part. Proof Since �t � 0 for every t, the series expansion of the ch.f. suggests that I
I
(23.41) (23.42)
In each case the second inequality is by (1 1 .29), setting E = 0 for (23.42). Squaring I
t. 1
I AI 3 E as n
(23.43)
� oo,
using (23.40) and (23.41). This result is used to show that
exp
{t. (�x",(A.) 1 )} ··· D�x"'(A.) t. exp{ �x"'(A.) - 1 } - �x",(A.) I -
5
I
1
n (23.44) L I x../A) - 1 . The condition of the lemma can be satisfied for large enough n according to (23.42) and (23.23). By hypothesis
The Classical Central Limit Theorem 1 og
ex
��
(�x"'(A.)
-
} � E(cos AX., - 1 )
1)
=
375 -->
-1'-2,
(23.45)
using ( 1 1.12 ) and ( 1 1.13 ) to get the equality. Taking the real part of the expansion in ( 1 1.24) up to k = 2 gives cos x = 1 - ¥2cos ax for 0 :S: a :S: 1 , so that ¥2 - 1 + cos x � 0 for any x. Fix £ > 0 and choose A > 21£, so that the contents of the parentheses on the minorant side below is positive. Then we have n ')...2 2 n I A?X2nt - 2 dP x2 dP nt .L L 2 2 - t=I { IXnti>E} 2 £ t=I { IXnti>E}
(
)
f
< f
:S:
if t= l n
(
)
( �'A2 1 + cos AXnr) dP r X� { IXnti>E}
_L EqA.2X�1 - 1 + cos AXnr) ----7 0,
(23.46) t= l where the last inequality holds since the integrand is positive by construction for every Xn1, and the convergence is from (23.45) after substituting Lr�t = 1 . Since £ is arbitrary, the Lindeberg condition must hold according to (23.46). • Condition (23.40) is a condition of 'asymptotic negligibility' , under which no single summand may be so influential as to dominate the sum as a whole. The chief reason why we could have
-
f
f
where Z is a standard Gaussian variate. A condition related to (23.40) is
P ( max I Xnr I
)
(23.47) o
----7 0 as n ----7 oo, any £ > 0, I :> t :> n which says that the largest Xnr converges in probability to zero. 23.15 Theorem (23.48) implies (23.40).
>£
(23.48)
The Central Limit Theorem
376
P(max1s;1s;n I Xnrl $ £) 1. But P t��� I Xnr l £) = P (!] { I Xnr l £}) $ min P(! Xnt l $ £) Is;ts;n = 1 - max P( I Xn zl > £) , (23 .49) Is;ts;n where the inequality is by monotonicity of P. If the first member of (23.49) converges to 1 , so does the last. Proof
(23 .48) is the same as
�
s;
s;
•
Also, interestingly enough, we have the following. 23.16 Theorem The Lindeberg condition implies (23.48). Proof Another way to write (23. 17) (interchanging the order of summation and integration) is
nL 1!1Xnrl>E}X�, 0, all £ > (23.50) t=I According to 18.13 this implies 2:7=1 1 1 I Xnr / > E}X�1 or, equivalently, (23.51) P (it=I 1 1 /Xnr/ > E}X�1 > £2) 0 as n for any £ 0. But notice that {ro: i111 s;t�n Xnz(CO)j > £}, (23.52) t=l Xnr / > E}(ro)X�r(ro) > £2} = {ro: Imax! �
0.
� 0,
�
�
=,
>
so (23.5 1) is equivalent to (23.48). • Note that the last two results hold generally, and do not impose independence on the sequence. The foregoing theorems establish a network of implications which it may be helpful to summarize symbolically. Let L = the Lindeberg condition; I = independence of the sequence; � e- J...ZI2 ); AG = asymptotic Gaussianity AN = asymptotic negligibility (condition (23.40)); and PM = � 0 (condition (23.48)). Then we have the implications L + I ==> AG +PM + I ==> AG + AN + I ==> L + l, (23.53) where the first implication is the Lindeberg theorem and 23.16, the second is by 23.15, and the third is by the Feller theorem. Under independence, conditions L, AG + PM, and AG + AN are therefore equivalent to one another.
maxi Xnr l
(�sn(A)
The Classical Central Limit Theorem
377
However, this is not quite the end of the story. The following example shows the possibility of a true CLT operating under asymptotic negligibility, without the Lindeberg condition. 23.17 Example Let X1 = 1 and -1 with probabilities 1C l - t- 2) each, and t and -t with probabilities 1t- 2 each, so that E(X1) = 0 and Var(X1) = � - !?, and s� = �n + O(l). This case has similar characteristics to 23.9. Since I Xnrl = ts� 1 with probability t- 2 and 1s� 1 otherwise, we have for any E > 0 that, whenever n is large enough that ESn > 1, (23.54) P( \ Xnr l > E) � n; 2 where n£ is the smallest integer such that n£s� 1 > E. Since n£ = O(n 1 12), (23.40) holds in this case. However, the argument used in 23.9 shows that the Lindeberg condition is not satisfied. E( l { 1 Xn�I >E }X�1) � s� 2 for t � n£, and hence
it=1 f
{ IXnti>E}
X�rdP � (n - n£)s�2 � � > 0.
(23.55)
However, consider the random sequence { W1} , where W1 X1 when I X1 I = 1 and Wr = 0 otherwise. As t increases, W1 tends to a centred Bernoulli variate with p = 1, and defining Wnr = W 1sn , it is certainly the case that n (23.56) L Wnt � N(O, !). t= 1 However, I X1 - Wrl is distributed like Y1 in 23.9, and applying (23.32) shows that .L7=J I X, - W, I = Op(l ), and hence I .L7=1 Xnr - L7= 1 Wnr \ � 2:.7=1 \ Xnr - Wnr l = Op(n - 1 12). It follows that .L7=1Xn1 � N(0, !), according to 22.18. o A CLT therefore does operate in this case. Feller's theorem is not contradicted because the limit is not the standard Gaussian. The clue to this apparent paradox is that the sequence is not uniformly square-integrable, having a component which contributes to the variance asymptotically in spite of vanishing in probability. In these circumstances Sn can have a 'variance' of 1 for every n despite the fact that its limiting distribution has a variance of ! ! =
1
23 .4 The Case of Trending Variances
The Lindeberg-Feller theorems do not impose uniform integrability or Lr-bounded ness conditions, for any r. A trending variance sequence, with no uniform bound, is compatible with the Lindeberg condition. It would be sufficient if, for example, cr7 < oo for each finite t and the unit-variance sequence {X1 /cr1} is uniformly square-integrable, provided that the variances do not grow so fast that the largest of them dominates the Cesaro sum of the sequence. The following is an extension of the sufficient condition of 23.10. 23.18 Theorem Suppose { x7td} is uniformly integrable, where { c 1} is a sequence of positive constants. Then {X1} satisfies the Lindeberg condition, (23.30), if
The Central Limit Theorem
378
sup nMnf 2 sn2 n
=
C
<
(23.57)
oo,
One way to construct the c1 might be as max { 1 ,a1 } . The variances of the trans formed sequence are then bounded by 1 , but a7 = 0 is not ruled out for some t. Proof of 23.18 The inequality of (23.33) extends to
{c7E(1{1X/c11>s,Eic1)(X,Icr)2)} _!_£E(1{ 1 Xrl>s,E)X7) Sn� max $t$ n
s�
:s;
t:= l
:s;
l
C max {E( 1 ( IX/crl>s,Eictl (X/cr) 2)} l$t$ n
(23.58)
The analogous modification of (23.34) then gives sup l im E( 1 { IX!crl>s,.Eicr) (X,Icr)2) t n�oo
=
0.
•
(23.59)
Notice how (23.57) restricts the growth of the variances whether this be positive or negative. Regardless of the choice of { c1}, it requires that s�ln > 0 uniformly in n. It permits the c1 to grow without limit so long as they are finite for all t, so the variances can do the same; but the rate of increase must not be so rapid as to have a single coordinate dominate the whole sequence. If we let c1 = max{ 1 ,0"1} as above, (23.57) is1 satisfied (according to 2.27) when a7 - {x for any a � 0, but not when a7 2 • In fact, the conditions of 23.18 are stronger than necessary in the case of decreasing variances. The variance sequence may actually decline to zero without violating the Lindeberg condition, but in this case it is not possible to state a general sufficient condition on the sequence. If a7 - P with - 1 < a < 0, we would have to replace (23.33) by the condition' -
it:=l
max {E( 1 (IX1i>s,e}X7)}. (23.60) E( 1 { 1Xrl>s,e)X7) :s; Bn � Ists n s� where B = infn (s�ln 1 +a) > 0 by assumption (note, s� - n 1 +a under independence). Convergence of the majorant side of (23.60) to zero as n --7 oo is not ruled out, but depends on the distribution of the X1• The following example illustrates both possibilities. 23.19 Example Let { X1} be a zero-mean independent sequence with X1- U[-ta /v.] for some real a, such that a7 = ft2a either growing with t (a > 0) or declining with t (a < 0). However, X1 is Loo-bounded for finite t (see 8.13). The integrals in (23.30) each take the form _!_
,
(23.61 )
The Classical Central Limit Theorem
379
where s; = !I�= 1 't2a. Now, I,�=1 't2a = O(n2a+ l ) for a > -! and O(log n) for a = -! (2.27). Condition (23.57) is satisfied when a � 0. Note that (23.61) is zero if (!I,�=1 't2a) 1 12£ > ta ; (I,�= l 't2a) 1 12 grows faster than na for all a � 0, and hence (23.61) vanishes in the limit for every t in these cases, and the Lindeberg condition is satisfied. But if X1 - U[ -21,2 1], 2n grows at the same rate as (I,�= 1 22-r) 1 12 , the above argument does not apply, and the Lindeberg condition fails. Note how condition (23.57) is violated in this case. However, the fact that condition (23.57) is not necessary is evident from the fact that the variance sum diverges at a positive rate when X1 - U[ -ta,ta] for any a � -! even though the variance sequence itself goes to zero. It can be verified that (23.61) vanishes in the limit, and accordingly the Lindeberg condition holds, for these cases too. On the other hand, if a < ! s; is bounded in the limit and (23 . 1 7) becomes -
,
(23.62) where B = limn-tooL7=1 t2a < oo, and by choice of small enough £, (23.62) can be made arbitrarily close to 1 . This is the other extreme at which the Lindeberg condition fails. o
24 CL Ts for Dependent Processes
24. 1 A General Convergence Theorem
The results of this chapter are derived from the following fundamental theorem, due to McLeish ( 1 974). 24.1 Theorem Let {Zni• i = l , . . . ,rn , n E [N } denote a zero-mean stochastic array, where rn is a positive, increasing integer-valued function of n, and let25 rn (24. 1 ) Trn = n ( 1 + iAZni), A > 0. i= l Then, S rn = Lt�tZni � N(O, 1) if the following conditions hold: (a) Trn is uniformly integrable, (b) E(Trn) � 1 as n � oo, rn (c) L Z�i � 1 as n � oo, i= l (d) maXt::;i::;rn i Znd � 0 as n � oo. D There are a number of features requiring explanation here, regarding both the theorem and the way it has been expressed. This is a generic result in which the elements of the array need not be data points in the conventional way, so that their number r does not always correspond with the number of sample observations, n. rn = n is a leading case, but see 24.6 for another possibility. It is interesting to note that the Lindeberg condition is not imposed in 24.1, nor is anything specific assumed about the dependence of the sequence. Condition 24.1(d) is condition PM defined in (23.48), and by 23.16 it follows from the Lindeberg condition. We noted in (23.53) that under independence the condition is equivalent to the Lindeberg condition in cases where (as we shall prove here) the central limit theorem holds. But without independence, conditions 24.1(a)-(d) need not imply the Lindeberg condition. Proof of 24.1 Consider the series expansion of the logarithmic function, defined for l x l < 1 , log(l +x) = x - �2 + !x3 Although a complex number does not possess a unique logarithm, the arithmetic identity obtained by taking the exponential of both sides of this equation is well-defined when x is complex. The formula yields n
••.
CLTs for Dependent Processes
38 1
(24.2) where the remainder satisfies I r(x) I :::;; I x 1 3 for I x I < 1 . Multiplying up the terms for i = 1 , ... ,rn yields exp{ iAS,J = T,n U'n ' where T,n is defined in (24. 1) and
{-1A.'� z;, - �r(O..z,+
(24.3)
u,, = exp
Taking expectations produces
s,n (A,) = E(T,n U,n ) = e - 'A212E(T,n ) + E( T,n( U,n - e-'A212)),
(24.4)
so given condition (b) of the theorem,
-
lim E , T,n( U,n e - 'A212) '
n.....:;oo
The sequence
=
0.
(24.5) (24.6)
is uniformly integrable in view of condition (a), the first term on the right-hand side having unit modulus. So in view of 18.14, it suffices to show that Plim T'n (U'n - e -'A212) = 0. (24.7)
n....oo.:;
Since T,n is clearly Op(l), the problem reduces, by 18.12, to showing that plimn.....:;oo U,n = e -'A212, and for this in tum it suffices, by condition (c), if 'n plim :L r(iAZnD = 0. (24.8)
n---';oo
i=l
To show that this convergence obtains, we have by the triangle inequality 'n { 'n 'n (24.9) I Znd � Z�;. � r(iAZn;) :::;; I A 1 3 � I Znd :::;; I A
I
I
3 l3l:;i::
)
The result now follows from conditions (c) and (d), and 18.12. • It is instructive to compare this proof with that of the Lindeberg theorem. A different series approximation of the ch.f. is used, and the assumption from independence, that is avoided. Of course, we have yet to show that conditions 24.1(a) and 24.1(b) hold under convenient and plausible assumptions about the sequence. The rest of
The Central Limit Theorem
382
this chapter is devoted to this question. 24.l(b) will tum out to result from suitable restrictions on the dependence. 24.l(a) can be shown to follow from a more primitive moment condition by an argument based on the 'equivalent sequences ' idea.
24.2 Theorem For an array { Zni } , let (i) The sequence T,n
=
ll/� 1 ( 1
(
+ iAZni) is uniformly integrable if
)
sup E max Z�i < oo. n l:S: i :S: rn And if I:i�1Z�i � 1, then 'n (ii) _Lz�i � 1 ; i=l rn (iii) Srn = _L z i has the same limiting distribution as Srn· i=l Proof Let
(24. 1 1)
n
(24. 12)
otherwise such that Zni
=
0, if at all, from the point i i=I
=
ln + 1 onwards. Note that (24. 1 3)
i=l
The terms A.2Z�i are real and positive. The inequality 1 +x < ex for x > 0 implies that ll;(l +x;) :::; nieXi for Xi > 0. Hence, ln- 1 2 I 't,n 1 = fl (1 + t..2z�i)C l + t..2Z�)
i=l
:::; exp { ki:{�1 1 Z�i) } ( l + kZ�) :::;
e2'-2( 1 + 1..2Z�).
(24. 14)
where the last inequality is by definition of ln. Then by (24. 1 1), sup E I T, 1 2 :::; e 2'-\ I + A2 sup £(Z�n)) < 00 (24. 15) n n Uniform boundedness of E I T,n 1 2 is sufficient for uniform integrability of T,n , proving (i). Since by construction I:i� 1 Z�i :::; 2, n
•
CLTs for Dependent Processes P
(� z�i * � z�i) = P (� z�i > 2) (I � z�i - 1 1 > E) ::; p
'
383
for 0 < E < 1 '
-----7
0 as n ----7 oo, (24. 16) by assumption, which proves (ii). In addition, P(Zni * Zni• some 1 :::;; i :::;; rn) = P(� 1 Z�k > 2) -----7 0 as n ----7 oo, so I Sn - Sn I � 0, and by 22.18, Sn and Sn have the same limiting distribution, proving (iii). • 24.2 The Martingale Case
Although it permits a law of large numbers, uncorrelatedness is not a strong enough assumption to yield a central limit result. But the martingale difference assumption is similar to uncorrelatedness for practical purposes, and is attrac tive in other ways too. The next theorem shows how 24.1 applies to this case. 24.3 Theorem Let {Xnr,r::Fn r} be a martingale difference array with finite uncondi tional variances { 0'�1} , and L-7=1 �t = 1 . If n (a) _L X�1 � 1 , and t= 1 (b) max1�t�n i Xnr l � 0, n then Sn = _LXnr � N(0, 1). t=1 We use 24.1 and 24.2, setting rn = n, i = t, and Zni = Xnr. Conditions (a) and (b) are the same as (c) and (d) of 24.1, so it remains to show that the other conditions of 24.1 are satisfied; not actually by Xnt• but by an equivalent sequence in the sense of 24.2(iii). If Tn = ll7=1 ( 1 + i'AXn1), we show that limn ooE(Tn) = 1 when { Xnr} is a m.d. array. By repeated multiplying out, n Tn = fl( l + i'AXnr) = Tn - 1 + if..Tn -1Xnn t= 1 Proof
-7
=
Tr- 1
=
n i L 1 = + f.. Tr-1Xnr· t=1 ll�: }(1 + i'AXns) is an r:!r- 1-measurable r.v., so by the LIE,
(24. 1 7)
The Central Limit Theorem
384
E(Tn)
=
n 1 + iA-L E(Tr- IXnr) t:=l
n (24. 18) 1 + iA-L E(Tr- I E(Xnrl ?Fn ,r- 1 )) =. 1 . t:= 1 This is an exact result for any n, so certainly holds in the limit. If Xnr is a m.d., so is Xnr = Xnr1 (IJ:}x�k � 2), and this satisfies 24.1(b) as above, and certainly also 24.1(d) according to condition (b) of the theorem. Since I.7= 1 E(X�1) = 1 , condition (24. 1 1) holds for Xnr. Hence, Xnr satisfies 24.1(a) and 24.1(c) according to 24.2(i) and (ii), and so obeys the CLT. The theorem now follows by 24.2(iii). • This theorem holds for independent sequences as a special case of m.d. sequences, but the conditions are slightly stronger than those of the Lindeberg theorem. Under independence, we know by (23. 53) that24.3(b) is equivalent to the Lindeberg condition when the CLT holds. However, 24.3(a) is not a consequence of the Lin de berg condition. For the purpose of discussion, assume that Xnr = X/sn , where s� = I7= 1 cr? and a7 = E(X7). Under independence, a sufficient extra condition for 24.3(a) is that the sequence {X71a7} be uniformly integrable. In this case, inde pendence of {X1} implies independence of {X7 } , {X7 - a7, ?f1} is a m.d., and 19.8 (put Gn = S� and b r = a7) gives sufficient conditions for s�2L7: 1 (X7 - a7) � 0. This is equivalent to 24.3(a). But of course, it is not the case that {X? - a?, ?f1} is a m.d. merely because {Xr.�t } is a m.d. If 24.3(a) cannot be imposed in any other manner, we should have to require E(X71 ?Fr- 1 } = a?, a significant strengthen ing of the assumptions. On the other hand, the theorem does not rule out trending variances. Following the approach of §23.5, we can obtain such a case as follows. 24.4 Theorem If {Xr.?f1} is a square-integrable m.d. sequence and E(X71 ?Fr- 1 ) = a7 a.s., and there exists a sequence of positive constants { c1} such that {X71d} is uniformly integrable and =
2 n2 < oo sup nMnls (24. 1 9) n where M� = max l :o; r :o; n d, conditions 24.3(a) and 24.3(b) hold for Xnr = Xr fsn . Proof By 23.18 the sequence {Xnr } satisfies the Lindeberg condition, and hence, 24.3(b) holds by 23.16. Note that neither of these results imposes restrictions on the dependence of the sequence. To get 24.3(a), apply 19.8 to the m.d. sequence CX7 - a7), putting p = 1 , b1 = c7, and an = s�. The sequence { CX7 - a7)1d} is uniformly integrable on the assumptions, and note that I.7= 1 b1 � nM� = O(s�) and that I.7= 1 b7 :::; nM;; = o(s�), both as consequences of (24. 1 9). The conditions of 19.8 are therefore satisfied, and the required result follows by 19.9. •
CLTs for Dependent Processes
385
24. 3 Stationary Ergodic Sequences
It is easy to see that any stationary ergodic martingale difference having finite variance satisfies the conditions of 24.3. Under stationarity, finite variance is sufficient for the Lindeberg condition, which ensures 24.3(b) by 23.16, and 24.3(a) follows from the ergodicity by 13.12. The interest of this case stems from the following result, attributed by Hall and Heyde (1980: 137) to unpublished work of M. I. Gordin. 24.5 Theorem Let {X1,:1'1 } be a stationary ergodic L 1 -mixingale of size -1, and if Sn = I7= 1 X, assume that limsup n - 1 12E I Sn l < oo . (24.20) Then n - 1 12E I Sn l -7 A, 0 :::;; A < oo. If A > 0, n- 1 12Sn � N(0, 1:nA2). o Notice that the assumptions for this CLT do not include E(Xf) < oo . Let X� = Xrl ( IXr i � Bl · The centred sequence Y� = � - E(X� I :J
(
)
(24.21 ) E n - 1 12± Y� 2 = crl t-= 1 The sequence { n - 1 12 I I7= 1 Y� I } is therefore uniformly integrable (12.11), and by the continuous mapping theorem it converges in distribution to the half-Gaussian limit (see 8.16); hence, by 9.8, n - 1 12E
i it y� � -7 crs(2/n) 112•
I
I
(24.22) -=1 Now define Y1 = X1 - E(Xrl :J
(24.23)
Noting that cr� = f1 IXt l � B )XIdP, { cr�, B = 1 ,2, ... } is a monotone sequence converg ing either to a finite limit cr2 (say) or to +oo. In the latter case, in view of (24.22) there would exist N such that n - 1 12EI I.7=t Y11 > A for N � n, which contradicts (24.23), so we can conclude that cr2 < oo . Taking B = oo in (24.22) yields, in view of the fact that n - 1 12E I Zt - Zn+ I I -7 0,
The Central Limit Theorem
386
n- 1 12EI Sn I = n - 1 12E
I ± Yr + Zt - Zn+l I � a(2ht) 112.
(24.24)
t= l
Hence, A = a(2/rc) 1 12 . Since Y1 is now known to be a stationary ergodic m.d. with finite variance if, and if > 0, 24.3 and 22.18 imply that n - 1 12Sn � N(O,a2 ), which completes the proof. • This result can be thought of as the counterpart for dependent sequences of the Lindeberg-Levy theorem, but unlike that case, we do not have to assume explicitly that E(XT) < oo. Independence of {Xr} enforces the condition Z1 = 0 for all t, and then the conditions of the theorem imply E(XT) = if. It might appear that the existence of dependence weakens the moment restrictions required for the CLT, but this gain is more technical than real, for it is not obvious how to construct a stationary sequence such that X1 is not square-integrable but Y1 is. The most useful implication is that the independence assumption can be replaced by arbi trary local dependence (controlled by the mixing ale assumption) without weakening any of the conclusions of the Lindeberg-Levy theorem. 24.4 The CLT for NED Functions of Strong Mixing Processes
The traditional approach to the problem of general dependence is the so-called method of 'Bernstein sums ' (Bernstein 1927). That is, break up Sn into blocks (partial sums), and consider the sequence of blocks. Each block must be so large, relative to the rate at which the memory of the sequence decays, that the degree to which the next block can be predicted from current information is negligible; but at the same time, the number of blocks must increase with n so that a CLT argument can be applied to this derived sequence. It would suffice to require the sequence of blocks to approach independence in the limit, but a result can also be obtained if it behaves asymptotically like a martingale difference. This is the approach we adopt. The theorem weprove (from Davidson 1992, 1993b) is given in two versions. The first, 24.6, is fully general but the conditions are complicated and not very intuitive. The second, 24.7, is a special case, whose conditions are simpler but cover almost all the possibilities. The exceptional cases for which 24.6 is essen tial are those in which the variances of the process tend to 0 as t increases. 24.6 Theorem Let {Xnr• t = 1 , .. ,n, n � 1 } be a triangular stochastic array, let { Vnt• -oo < t < oo, n � 1 } be a (possibly vector-valued) stochastic array, and let r:f��7 = a(Vns• t - m :::; s :::; t + m). Also, let Sn = L�=IXnr. Suppose the following assumptions hold: (a) Xnr is r:f�. - ooi:B-measurable, with E(Xnr) = 0 and E(S�) = 1. (b) There exists a positive constant array {cnr } such that SUPn,r i i Xnlcnr llr < oo for r > 2. (c) Xnr is �-NED of size -1 on { Vnr} with respect to the constants { cnr } speci fied in (b), and { Vnr } is a-mixing of size -(1 + 28)r/(r - 2), 0 :::; 8 < �.
-m
CLTs for Dependent Processes
387
(d) Letting bn = [n 1 - a] and rn = [nlbn ] for some a E (0, 1 ] , and defining Mn i = maX(i- l )bn
The Central Limit Theorem
388
s�
=
n n t- 1 ,L a7 + 2L .L ar,t-k. t= 1 t=2 k= 1
(24.30)
where at t-k Cov ( Yr,Yt-k). Assumptions 24.6(c) or 24.7(c') imply, by the norm inequality, that (24.3 1 ) 11 Xnr - E(Xnt l ���7-m) llp � CntYm, for 0 < p � 2 , where v is of size - 1 . The following lemma is an immediate consequence of 17.6. 24.8 Lemma Under assumptions 24.6(a), (b), and (c), { Xn1, ��. oo } is an Lp-mixingale of size -min { l , (1 + 28)(r -p)/p(r - 2) } for 1 � p � 2, with constants {cnt l · o In particular, when p = 2 the size is -{1 + 8), and when p = r/(r - 1) the size is - 1 . Under the assumptions of 24.7 the same conclusions apply, except that 8 = 0. There is also a more subtle implication which happens to be very convenient for the present result. This is the following. 24.9 Lemma Under 24.6(a) and (b), plus either 24.6(c) or 24.7(c'), ==
m
-
(24.32) and
(24.33) for each k E IN, where ants = E(Xn�Xns), and �m = O(m-1 -8) for 8 > 0. o These inequalities are convenient for the subsequent analysis, but in effect the lemma asserts that for each fixed k the products XnrXn,t+k• after centring, form L 1 -mixingales of size -1 with constants given by max { c�1,c�,t+d . One of these might be written as, say, { Un1,��. -oo L where Unt = Xn,t-[k/2]Xn,t+k-[k/2] - an,t-[k/2],t+k-[k/2]· The mixingale coefficients here are �� = �0 for m = O, ... ,[k/2], and �� = �m-[k!Z] for m > [k/2]. Proof of 24.9 The array {Xn�Xn,t+d is L 1 -NED on { Vnt l of size -1 by 17.11. The conclusion then follows by 17.6(i), noting that any constant factors generated in forming the inequalities have been absorbed into �m in (24.32) and (24.33). • Now consider 24.6(d) and 24.7(d'). These assumptions permit global nonstation arity (see § 13.2). This is a fact worthy of emphasis, because in the functional CLT (see Chapters 27 and 29 for details) global stationarity is a requirement. In this respect the ordinary CLT is notably unrestrictive, provided we normalize by sn as we do here. The following cases illustrate what is allowed. 24.10 Example Let { Y,} be an independent sequence with variances a7 - t � for any � :?: 0 (compare 13.6). It is straightforward to verify that assumption 24.7(d') is
CLTs for Dependent Processes
389
satisfied for Xnr Yr lsn where Cnr = ar lsn , and in this case s� = L�=I cri· It is however violated when cr7 - 2t, a case that is incompatible with the asymptotic negligibility of individual summands (compare 23.19). It is also violated when � < 0 (see below). o =
24.11 Example Let { Yr} be an independent sequence with variance sequence gen erated by the scheme described in 13.7. Putting Xnr = Yr lsn, 24.7(d') is satis fied with Cnr = llsn for all t. o Among the cases that 24. 7(d') rules out are asymptotically degenerate sequences, having a7 � 0 as t � oo. In these cases, max 1 �r�ncri = 0(1) as n � oo, but given 24.7(c'), it will usually be the case that s�ln � 0. It is certain of these cases that assumption 24.6(d) is designed to allow. To see what is going on here, it is easiest to think in terms of the array { Cnt } as varying regularly with n and t, within certain limits to be determined. We have the following lemma, whose conditions are somewhat more general than we require, but are easily specialized. 24.12 Lemma Suppose c�t - tpn-y- I for �,y E IR . Then, 24.6(d) holds iff (24. 34)
and � < 2[8 +y(l + 8)].
0
(24. 3 5)
Notice that � and y can be of either sign, subject to the indicated constraints. Proof We establish that an a E (0, 1] exists such that each of conditions (24.25) (24.27) are satisfied. We have either M�i - (ibn)Pn -y- I for � � 0, or M�i ((i - 1)bn) Pn -y- I for � < 0, but in both cases Tn v �
M2 ·
rni+PbnPn -y- 1
na.O
+P)+C I - a.) p -y- 1
(24. 36) i=l Simplifying shows that condition (24.34) is necessary and sufficient for (24.27), independently of the value of a, note. Next, (24.25) is equivalent to max Cnr2 = o (na.-1 ) , (24.37) l �t� n which (since the maximum is at t = n for � � 0, and t = 1 otherwise) imposes the requirement max{ �,O} - y < a. (24. 38 ) In view of (24.34) this constraint only binds if y < 0, and is equivalent to just -y < a. (24. 39 ) Also, m
_
rn
_
" . i+ I2 I2 - ( + I )I2 � M - rn P bnP n y i= l m
·
The Central Limit Theorem
390
a(! +P/2)+(( 1 -a)p-y- 1 )/2 -n '
(24.40)
and (24.26) reduces to
28 +y- � a -< 1 + 28 ·
(24.41) i
The existence of a satisfying (24.41 ) and (24.39) requires two strict inequalities to be satisfied by �. y, and 8, but since 8 is positive the first of these, which is 8 > �� - y), holds by (24.34). The second is -y(1 + 29) < 28 + y- �. which is the same as (24.35). It follows that (24.34) and (24.36) are equivalent to 24.6(d), as asserted. • In terms of the leading case with Xm = (Yt - �t)lsn, with s� given by (24.30), consider first the case a7 - rP, � � 0. To have Cnt monotone in t (not essential, but analytically convenient), we could often set
(24.42) = max {<Js } fsn. 1 $s $t In this case, under 24.6(b) and 24.7(c'), the conditions of 17.7 are satisfied. Note however that I I Yt - �til (Jt is required by 24.6(b ). Since { sd is of size - 1 we have, substituting into (24.30), n n t- 1 p (24.43) s� « 2 t + 22 rP12 L (t - k)P12 Sk O(n l+P) . t= 1 t= 1 k=l This only provides an upper bound, and condition 24.7(c') alone does not exclude the possibility that s� - n 1+y for some y < �- However, compliance with condition (24.34), which follows in tum from 24.7(d') (note, M� = c�n in this case), enforces y = �- This condition says that the variance of the sum must grow no more slowly than the sum of the variances. Now consider the case a7 tp with � < 0. Here, we might be able to set (24.44) C nt
r
-
=
-
s :?: t
Under 24.7(c') and (24.34) we would again have c� t l3n - l3 -1 , but here M� = <J7 •1s� for some t* < oo; hence M� - n - 13- 1 , and 24.7(d') ceases to hold. However, with � y, condition (24.35) reduces to r
-
=
�
> 1-+2828'
(24.45)
and it is possible for the conditions of 24.12 to be satisfied, although only with 8 > 0. As 8 is increased, limiting the dependence, a can be increased according to (24.36) and this allows smaller �. increasing the permitted rate of degeneration. As 8 approaches �' so that the mixing size approaches -2rl(r - 2), � may approach -�, with a also approaching �These conclusions are summarized in the following corollary. Part (i) is a case
CLTs for Dependent Processes
391
of 24.7, and part (ii) a case of 24.6. 24.13 Corollary Let Xnt = (Y1 - J.l1)1sn where � t � and s� - n 1 +f3. If either (i) 0 :::; � < oo, and 24.6(b) and 24.7(c') hold with Cnt defined in (24.42); or (ii) there exists 8 such that (24.45), 24.6(b) and 24.6(c) hold with Cnt defined by (24.44); then Sn � N(O, 1). o By an apparent artefact of the proof, t- 112 represents a limit on the permitted rate of degeneration of the variances. We may conjecture that with mixing sizes exceeding 2rl(r - 2), the CLT can hold with � :::; -� , but a different method of attack would be necessary to obtain this result. Also, both the above cases appear to require that a E (0, �), and it is not clear whether larger values of a might be possible generally. The plausibility of both these conjectures is strengthened by the existence of the following special case. 24.14 Corollary If conditions 24.6(a) and (b) hold, and also (c") {Xn1,��, - oo} is a martingale difference array, and (d") conditions (24.25) and (24.27) hold with bn = 1 , then Sn � N(O, l). o In the case Xnt = (Y1 - J...L1)1sn, where by the m.d. assumption s� = I,�=J�, note that (24.27) is satisfied with bn = 1 by construction, and (24.25) requires only that s� t oo, so that � t� is permitted for any � :2: -1 under 2.14(d"). This result may be compared with 24.4, whose conditions it extends rather as 24.6 extends 24.7. The proof will be given below after the intermediate results for the proof of 24.6 have been established. As an example, let Y1 - J.l1 be a m.d. with a7 = �It where � is constant, so that s� log n. Corollary 24.14 establishes that (log nf 1 12L�= l (Yt - J.lt) � N(O, a2). The limit on the permissable rate of degeneration here is set by the requirement s� --7 oo as n --7 oo. If the variances are summable, the central limit theorem surely fails. Here is a well-known case where the non-summability condition is violated. 24.15 Example Consider the sequence { Y1} of first differences, with Y1 = Z1 and Y1 = Z1 - Z1- I , t = 2,3, . .. , where { Z1} is a zero-mean, uniformly L,-bounded, independent sequence, with r > 2. Here { Y1} satisfies 24.6(a)-(c), but L�= l Yt = Zn and s� = Var(Zn) = 0(1 ). o -
-
-
24. 5 Proving the CLT by the Bernstein Blocking Method
We shall prove only 24.6, the arguments being at worst a mild extension of those required for 24.7. We show that with a suitable Bernstein-type blocking scheme the blocks will satisfy the conditions of 24.1. In effect, we show that the blocks behave like martingale differences. In most applications of the Bernstein approach alternating big and small blocks are defined, with the small blocks containing of
The Central Limit Theorem
392
the order of [n �( !-a)] summands for some � E (0, 1 ) , small enough that their omis sion is negligible in the limit but increasing with n so that the big blocks are asymptotically independent. Our martingale difference approximation method has the advantage that the small blocks can be dispensed with. Define bn and rn as in condition 24.6(d) for some a E (0 , 1), and let ibn
Zni = L Xnt• i = 1 , . .. ,rm t= i )bn+ ( -I
such that n
(24.46)
I
'n
(24.47) Zni + (Xn, nbn+l + .. . + Xnn) . L t= ! i== l The final fragment has fewer than bn terms, and is asymptotically negligible in the sense that bnrnln ---7 1 . Our method is to show the existence of a > 0 for which an array { Zni } can be constructed according to (24.46), such that 24.1(c) is satisfied, and such that the truncated sequence {Znd defined in (24.10) satisfies the other conditions of 24.1. This will be sufficient to prove the theorem, since then Lt� I Zn i � 1 according to 24.2(ii), and Zni and Zni are equivalent sequences when 24.1(c) holds, by 24.2(iii). Since { Xnr} is a L2-mixingale array of size -� according to 24.8, we may apply 16.14 to establish that the sequences (24.48) . ib 2 At 1 east, thIS" 1011ows "" 1y mtegrable, where Vn2i = " are unl!orm .L.. t=(i-!)bn+!Cnr. directly in the case i = 1, and it is also clear that the result generalizes to any i, for, although the starting point (i - 1)bn + 1 is increasing with n, the nth coordinate of the sequence in (24.48) can be embedded in a sequence with fixed starting point which is uniformly integrable by 16.14. This holds uniformly in n and i, and it follows in particular that the array {Z�/Y�i , 1 � i � rm rn � 1 } is uniformly integrable. This result leads to the following theorem. 24.16 Theorem Under assumptions 24.6(a)-(d), {Znd satisfies the Lindeberg condition. Proof For any i, v�i � bnM�i � 0 as n � oo by (24.25) and hence, for E > 0, 1 { 1Znil>£}z�i) 1 { I Z,/vn; I >Eivn;}z�i) < max max bnMn2i Y2ni ! :S i :S r ! :S i :S r Sn = L Xnt =
r
"
{E(
n
} {E( _
n
� O as n � = by uniform integrability. The conclusion,
}
(24.49)
CLTs for Dependent Processes
393
rn
_L £( 1 { l2nj l >£ l z�j) j= l
-----) 0 as n -----) oo, (24 . 50) now follows since the sum of rn terms on the majorant side is 0(1 ) from (24.27). • This theorem leads in turn to two further crucial results. First, by way of 23.16 we know that condition 24.l(d) is satisfied (and if this holds for Zni it is certainly true for Zn;); and second, note that for any E > 0 max Zn2i
rn
"' 1 { I Z,.;I>£ } zn2i· E2 + L
(24.5 1 ) i= l Taking expectations of both sides of this inequality, and then the limit as n -----) oo, shows that (24. 1 1 ) holds, and therefore that Trn is uniformly integrable (that is, condition 24.1(a) holds) by 24.2(i). This leaves two results yet to be proven, E( Tr ) -----) 1 (corresponding to 24.1(b) for the truncated array), and 24.1(c). By the latter result we shall establish parts (ii) and (iii) of 24.2, and then in view of 24.2(iii) the proof will be complete. We tackle 24.1(c) first. Consider ] :s; j :s; rn
::;;
n
(24.52) say, where rn
An = _L (�; - E(Z�;)) i= l (24.53) and Bn
= B�
+
B�, B�
where
[
(
ibn n -t Tn = 2 _L _L L crnt,t+k , i= l t=(i- l )bn+ l k=ibn -t+l
)]
(24.54) (24.55)
Here, ants = E(XntXns), and recall that E(S�) = .L.7: } (cr�t + 2.L.}=t+l crnt,t+j) + �n = 1 . It may be helpful to visualize these expressions as made up of elements from outer product and covariance matrices divided into � blocks of dimension bn x bn, with a border corresponding to the final n - rnbn terms, if any; see Fig. 24. 1 .
The Central Limit Theorem
394
The terms in An correspond to the rn diagonal blocks, and B� and B� contain the remaining covariances, those from the off-diagonal and final blocks. 2
3
0 1hn hn 0 �
2 3 :
0 :
:
0
:
:
'" I I I I � .
.
Fig. 24. 1
An is stochastic and so must be shown to converge in probability, whereas the components of Bn are nonstochastic. The nonstochastic part of the problem is the
more straightforward, so we start with this. 24.17 Theorem Under assumptions 24.6(a)-(d), Bn --7 0. Proof Since r > 2, r/(r - 1) < 2 and conditions 24.6(b) and (c) imply by 17.7 that (24.56) I
rn rn+ l bn bn I B� I :5: 2� � � L I
(24.57) To determine the order of magnitude of the majorant expression, verify that the terms in the parentheses are O(b�-8(1 - i) -I -&) for I = i + l , . . ,rn + 1 , and some b > 0. Changing the order of summation and putting k = l - i allows us to write .
(24.58) But for every k in this sum, the Cauchy-Schwartz inequality and (24.27) give
rn+ l -k rn+ l L MniMn,i+k :5: ,LM�i = O(b�1 ) , i=l i=l
(24.59)
CLTs for Dependent Processes
395
so that (24.58) implies, as required, that I B� I O(b�8) O(n-S( l -a)) . (24.60) To complete the proof we can also show, by a similar kind of argument but applying (24.25), that (24.61) =
=
To solve the stochastic part of the problem, decompose the terms Z�i - E(Z�i) in (24.53) into individual mixingale components, each indexed by i = 1 , ... ,rn. For a pair of positive integers j and k, let (24.62) It is an immediate consequence of 24.9, specifically of (24.32) and (24.33), that for fixed j and k the triangular array { Wni(i ,k), ��� '2= ; 1 :::; i :::; rn, n � N(j,k) } , (24.63) where N(j,k) = min{n: rn � 1, bn � j + k}, is an L 1 -mixingale of size -1, with mixing coefficients 'Vo = �o and 'Jip �(p - l )bn+J for p � 1 , and constants (24.64) Substituting from (24.62) into (24.53), we have the sum of b� such rnixingales, =
Z�i - E(Z�i) =
I:1 (wniU,O) + 2rk=! WniU, k)) + Wn;(bn,O). j=l
(24.65)
Although this definition entails considering k of order bn as n --7 oo, note that the inter-block dependence of the summands does not depend on k. The designation of 'mixingale ' is convenient here, but it need not be taken more literally than the inequalities (24.32) and (24.33) require. The crucial conclusion is that a weak law of large numbers can be applied to these terms. 24.18 Theorem Under assumptions 24.6(a)-(d), A n � 0. o The object here is to show that the array (24.66) { (Z�i - E(Z�;) ), ��� '2= ; 1 :::; i :::; rn , rn � 1 } is an L1-mixingale with constants {and which satisfy conditions (a), (b), and (c) of 19.11. The proof could be simplified in the present case by using the fact that Zni is ���'2=-measurable (by 24.6(a)) so that the minorant side of (24.68) below actually disappears identically for p � 0. But since the result could find other applications, we establish the mixingale property formally, without appealing to this assumption. Proof By multiplying out Z�i and applying 24.9 term by term, we obtain
/
E E(Z�i - E(Z�a I � �� -p)bn ).
j
The Central Limit Theorem
396
, �� ( 1
e e(w.,{i,O) i !l' ! � -p)b,) I
+
2
�I
e e(w.,{i,k) i !l'!�-p)b, )
E ' E( Wni(bmO) j � ��-p)bn ) I
+
I)
(24.67) and similarly,
I
E z�i - E(Z�d � � �+p)bn ) I «
1 -0)
M�i
[�1 (
�(p+
l )bn-j + 2
) ]
�
�(p+l)b -j -k
+ �Pb
'
(24.68)
n n pl k=l where �j = 0([ for o > 0. Write, formally, ani'I'; to denote the larger of the two majorant expressions in (24.67) and (24.68), such that \jf; � 0 and ani is fixed by setting 'l'o = 1. Evaluating (24.68) at p = 1 and (24.68) at p = 0 respec tively gives
I
E E(Z�, - E(�,) J !!'!� - l )h,) I and also, putting
I
j' j =
( 1 + 2(b, -j))j+O + b� I -O
M�,
«
M�;bn,
bn - and k'
E z�i - E(Z�il � ��n) l
) (� j' [�1 (j'-1-o �k'=O ) ]
((
«
M�i
=
(24.69)
- k,
/= 1
+2
k' - l -o
+
1
(24.70)
2 Mnibn. Hence, ani = BM�ibn for some finite constant B. Since 'n 'n (24.7 1 ) L �i $ max M�; _LM�i = o(b�2) i= l i=l l$i$rn+l i n view of (24.27) and (24.25), these constants satisfy conditions 19.1l(b) and (c). And since Z�/bnM�; s Z�/V�i where Z�/V�; is uniformly integrable, they also satisfy condition 19.11(a). It follows that An � 0, and the proof is complete. • This brings us to the final step in the argument, establishing the asymptotic m.d. property of the Bernstein blocks. 24.19 Theorem Under 24.6(a)-(d), limn--?c£(f,n) = 1 . Proof Applying (24.17),
«
CLTs for Dependent Processes
397
'n 'n ) = 1 i 1 i Z + A (24.72) + A L 1\-! Zni• ni ( TI i=l i=l where Ti -l = Tij:t ( l + iA-Znj) is an ��:=�)bn_measurable r.v. by 24.6(a), and hence 'n E(Ti- I ZnD ,) = 1 + iAL E(T i=l T,n =
(24.73) By the Cauchy-Schwartz inequality,
(24 . 74) i=l i=l where II Ti- 1 1 l 2 is uniformly bounded by (24 . 1 5), which follows in tum by (24. 1 1 ), so the result hinges on the rate of approach to zero of II E(Znd ��:=�)bn) lb. This cannot be less than that for Zni • so consider, more conveniently, the latter case.
[�
E(E(Xntl ��:= �)bn))2 t=(i- 1 )bn+ I 1 12 i l i t I ( ( l � i � i + 2 L..., L..., � E(E(Xnt � �n,-oo)bn)E(Xnt+k � �n,-oo)bn)) l t=(i- I )bn+ I k=l Applying 24.8, �
]
(24 . 75) (24.76)
and
E I E(E(Xm l �t=�)bn)E(Xnt+k l ��:=�)bn)) I
� II E(Xm l ��:= �)bn) ll 2 II E(Xnt+k l ��:= �)bn) lb
where Sj
O(j - 1 12- 11) for
(24.77)
> 8 for the 8 defined in 24.6(c). Hence, 1 12 (24.78) II E(Znd ��:= �)bn) ll 2 « Mni sJ + 2 \j i Sk , k j=l =! =j+! ] where the sum of squares is 0(1 ) and the double sum is O(b� - 211) . Applying (24.26) , assumption 24.6(d) implies =
J.l
(�
�
)
The Central Limit Theorem
398 �
L j E(Ti- J E(Znd ��:=�)bn )) l i=l
�
L MnP (max { b� 1 - 211)12, 1 }) i=l - -I = O(max { b� 11 , b� I2 }) = o ( l ). =
(24.79 )
This ensures that L,j� 1 E(Ti- !Zni) ----7 0 as n ----7 oo, which is the desired result. • Proof of 24.6 We have established that L.i� 1 Zni = L.;�fnXnt � N(O,l). There remains the formality of extending the same conclusion to Sm but this is easy smce (24.80) and Sn has the same limiting distribution by 22.18. The proof of 24.6 is therefore complete. • It remains just to modify the proof for the martingale difference case, as was promised above. Proof of 24.14 It is easily seen that 24.16 holds for rn = n and bn = 1 . In 24.17, Bn = 0 identically since �k 0 for k > 0 in (24.57). In 24.18 one may put Zni = Xnt• and the conditions for 19.10 follow directly from the assumptions. Lastly, 24.19 holds since the sum in (24.79) vanishes identically under the martingale difference assumption. The proof is completed just as for 24.6. • =
25 S ome Extensions
25 . 1 The CLT with Estimated Normalization
The results of the last two chapters, applied to the case Xnt = X11sn where E(X1) = 0 and s� = E(I.7= 1 X1) 2 , would not be particularly useful if it were necessary to know the sequences { cr7} , and { cr1,1-d for k � 1 , in order to apply them. Obviously, the relevant normalizing constants must be estimated in practice. Consider the independent case initially, and let Sn = "'L7= 1 X11sn where s� = I.7=Icr7. Also let s� I.7 =Ix7, and we may write n 1 Sn = --;::- L Xr = dnSn, (25.1) S n t=! where dn snfsn. If dn � 1 , we could appeal to 22.14 to show that Sn � N(0,1 ) whenever Sn � N(O, l). The interesting question is whether the minimal conditions sufficient for the CLT are also sufficient for the relevant convergence in probability. If the sequence is stationary as well as independent, existence of the variance 2 cr is sufficient for both the CLT (23.3) and for n -1 I.7= 1 X7 � d (applying 23.5 to X7). In the heterogeneous case, we do not have a weak law of large numbers for { x7 } based solely on the Linde berg condition. However, the various sufficient conditions for the Lindeberg condition given in Chapter 23, based on uniform integrability, are sufficient for a WLLN. Without loss of generality, take the case of possibly trending variances. 25.1 Theorem If {X1} is an independent sequence satisfying the conditions of 23.18, then =
�
=
(25.2) the sequence (X7 - a7)1c7. By assumption this has zero mean, is independent (and hence an m.d.), and uniformly integrable. The conditions of 19.8, with p = 1 , b1 = d, and an = s�, are satisfied since, by (23.57), n " 2 2 2 (25.3) L- c t :$ nMn = O(sn) ,
Proof Consider
t=!
where Mn
=
max, � 1 � n{c1 } . Hence
The Central Limit Theorem
400 E
=
I-7=1 x? E --2- - 1 --7 0, Sn
(25.4)
which is sufficient for convergence in probability. • When the sequence { X1} is a martingale difference, supplementary conditions are needed for {X?} to obey a WLLN, but these tum out to be the same as are needed for the martingale CLTs of 24.3 and 24.4. In fact, condition 24.3(a) corresponds precisely to the required result. We have, immediately, the following theorem. 25.2 Theorem Let {X1,�1} be a m.d., and let the conditions of 24.3 or 24.4 be satisfied; then (25.2) holds. o Although we have spoken of estimating the variance, there is clearly no neces sity for {s�/n } , or any other sequence apart from { dn } , to converge. Example 24.1 1 is a case in point. In those globally covariance stationary cases (see § 1 3 .2) where s�/n converges to a finite positive constant, say -& , the 'average variance' of the sequence, we conventionally refer to s�ln as a consistent estimator of cr2. But more generally, the same terminology can always be applied to s� with respect to s� , in the sense of (25.2). Alternative variance estimators can sometimes be defined which exploit the particular structure of the random sequence. In regression analysis, we typically apply the CLT to sequences of the form X1 = W1U1 where { U1,�1} is assumed to be a m.d. with fixed variance � (the disturbance), and where W1 (a regressor) is �1- 1 -measurable. In this case, {Xt � l is a m.d. with variances a? = �E( W?) , which suggests the estimator s� = (n-1 2-7= 1 u?) L-7= 1 w?, for s�. This is the usual approach in regression analysis, but of course the method is not robust to the failure of the fixed-variance assumption. By contrast, s� = 2-7= 1 w?u? possesses the property cited in (25.2) under the stated conditions, regardless of the distributions of the sequence. The latter type of estimator is termed hetero scedasticity-consistent. 26 Now consider the case of general dependence. The complicating factor here is that s� contains covariances as well as variances, and s� is no longer a suitable estimator. A sample analogue of s� must include terms of the form X1Xt+j for IJ I � 1 as well as for j = 0, but the problem is to know how many of these to include. If we include all of them, in other words for j = 1 - t, ... ,n - t, the resulting sum is equal to (I-7= 1 X1) 2, and the ratio of this quantity to s� is converging, not in probability to 1 , but in distribution to x2 (1) . For consistent estimation we must make use of the knowledge that all but a finite number of the covariances are arbitrarily close to 0, and omit the corresponding sample products from the sum. Similarly to the m.d. case, the conditions of 24.6 contain the required convergence result. Consider (24.46) and (24.47) where Xnt = X!sm but now write >
4z;
=
SnZni =
t
ibn
1
L X1, i'= 1 , ... , rn.
t=(i- 1 )bn+
(25.5)
Some Extensions
401
In the proof of the CLT the construction of the Bernstein blocks was purely conceptual, but we might consider the option of actually computing them. The sum of squares of the unweighted blocks, s� 1 = L.i� 1 Z�T, is consistent for s� in the sense that r sn2l � 2 pr (25.6) 2 L Zni --"-----7 1 ' Sn i= l according to 24.17 and 24.18. An important rider to this proposal is that 24.17 and 24.18 were proved using only (24.25) and (24.27), so that, as noted previ ously, any a E (0, 1 ) will serve to construct the blocks. In the context of the conditions of 24.12 at least, the only constraint imposed by 24.6(d) is repre sented by (24.39), which puts a possible lower bound on a in the case of decreas ing variances (y > 0), but no upper bound strictly less than 1 . It is sufficient for consistency if bn goes to infinity at a positive rate, and we are not bound to use the a that satisfies the conditions of the CLT to construct the blocks in (25.6). But although consistent, s� 1 is not the obvious choice of estimator. It would be more natural to follow authors such as Newey and West (1987) and Gallant and White (1988), inter alia, who consider estimators based on all the cross-products X1Xr-k for t = k + 1 , ... ,n and k = O, . . . ,bn. In terms of the array represention of Fig. 24. 1, these are the elements in the diagonal band of width 2bm rather than the diagonal blocks only. (In this context, bn is referred to as the band width.) The simplest such estimator is n bn n s�2 = .L X7 + 2L L XrXr-k· (25.7) t= l k=l t=k+ l 25.3 Theorem Under the conditions of 24.6, applied to X/sn, s�2/s� � 1 . Proof Let Xnr = X/sn in (24.53), so that s!4 n denotes the same sum constructed from the X1 in place of the Xnr . The difference between An and (s�2 - E(s�2))1s� is the quantity
(25.8) The components of this sum correspond to the rn - 1 'triangles' which separate the diagonal blocks in Fig. 24. 1 , each containing �bn(bn - 1) terms, plus the terms from the lower-right corner blocks. Reasoning closely analogous to the proof of 24.18 shows that A� � 0. The sums of the corresponding covariances converge absolutely to 0 by 24.17, since they are components of B� in (24.54), and it follows that
402
The Central Limit Theorem I s�2 - s� d
2 -- � 0 . --Sn
The theorem therefore follows from 24.18. • Since this estimator uses sample data which are, discarded in s� 1 , there are informal reasons for preferring it in small samples. But there is a problem in that (the chosen notation notwithstanding) s�2 is not a square, and not always non-negative except in the limit. This difficulty can be overcome by inserting fixed weights in the sum, as in bn n n s;3 = _Lx7 + 2,L wnk L XrXt-k· t=l k=l t=k+l
(25 . 9 )
Suppose Wnk � 1 as n � oo for every k ::; K, for every fixed K < n I S�2 - s;3 l 2 bn ) ( 1 = 2 L - Wnk L XrXr-k � r(K), s; t=k+l Sn k=l where
1
---
I
bn n 1 (1 r(K) = 2 plim 2 L Wnk) L XrXt-k . n-too Sn k=K+I t=k+l
I
-
I
oo.
Then (25. 10)
(25 . 1 1 )
Since r(K) can be made as small as desired by taking K large enough, in view of 25.3 and 24.18, s�3 is consistent when the weights have this property. It remains to see if they can also be chosen in such a way that (25 . 9) is a sum of squares. Following Gallant and White (1988), choose b n + 1 real numbers an J , ... ,an,bn+ l , satisfying 2:j�t 1 a�j = 1 , and consider the n + bn variables Yni = a 1 X1, Yn2 = an1X2 + an2X1 , Yn'bn+ 1 = an!Xbn+1 + ... + an'bn+lXb Yn,bn+2 = an 1Xbn+2 + ... + an,bn+ 1X2, Ynn Yn,n+ 1
==
==
aniXn + ... + an,bn+lXn-bn ' an2Xn + ... + anbnXn-bn+b
Observe that
(25. 1 2)
Some Extensions
403
which shows that any weights of the form Wnk = I,j�tl l anJanJ -k, k = O, ... ,bn impose non-negativity, and also give wno = 1 . A case that fulfils the consistency requirement is anj = (b n + 1) - 1!2 , all j, which yields Wnk = 1 - k/(bn + 1). Variance estimators having the general form n n L L w( I t - S I lbn)XrXs t=l s=l are known as kernel estimators. The function w(x) = 1 - l x l for l xl 5 1, 0 other wise, which corresponds to �3 , is the Bartlett kernel. The estimator �2 , by contrast, uses the truncated kernel, w(x) = 1 1 l x l :o; I l . Other possibilities exist; see Andrews ( 1991) for details. One other point to note is that much of the literature on covariance matrix estimation relies on L2-convergence results for the sums and products, and accordingly requires L,-boundedness of the variables for r � 4. The present results hold under the milder moment conditions sufficient for the CLT, by using a weak law of large numbers based on 19.11. See Hansen (1992b) for a comparable approach to the problem. 25 . 2 The CLT with Random Norming
Here is a problem not unconnected with the last one. We will discuss it in the context of the m.d. case for simplicity, but it clearly exists more generally. Consider a m.d. sequence {X1} which instead of (25.2) has the property �7=Ix7 pr 2 (25. 13) 11 ' Sn2 � where 11 2 is a random variable. This might arise in the following manner. Let X1 = W1 U1 where { U1, '!F 1 } :00 is a m.d. and { W1 } := a sequence of r. v .s which are measurable with respect to the remote a-field 'R = n7=- oo'ffs . The implication is that W1 is 'strongly exogenous' (see Engle et al. 1983). with respect to the generation mechanism of U1• Then (25. 14) since 'R � r:Ji1_ 1 for every t, hence X1 is a m.d. Provided the W1 are distributed in such a way that �7= 1 w7u7Js� � 1, the analysis can proceed exactly as in §24.2. There is no practical need to draw a distinction between nonstochastic Wr and 'R-measurable Wr. But this need not be the case, as the following example shows. 25.4 Example Anticipating some distributional results from Part VI, let Wr = ��=I Vs where { V5 } is a stationary m.d. sequence with E(V�) = 't2 for all s. Also assume, for simplicity, that E(U7) = cr2, all t. Then, E(W7) = t't2 and s� = !n(n + l)'t2cr2. If we further assume that { Vs } satisfies the conditions of 24.3, it will be shown below (see 27.14) that
The Central Limit Theorem
404
'
f B(r)2dr, 1 � W12 � D (25. 15) L 22 o 't n t=l where B(r) is a Brownian motion process. Under the distribution conditional upon 'R, we may treat the sequence { Wr } as nonstochastic and apply the weak LLN (by an argument paralleling 25.2) to say that . I�=,x;
I�=' w7u7 . I�=' w; = 2 hm (25 . 1 6) 2 phm 2 .-2 = 11 2 , 2 2 't 't n s� n�oo crn (n + 1) n�oo n�oo where 11 2 merely denotes the limit indicated. Under the joint distribution of { U1 } and { W1}, !11 2 is a drawing from the limiting distribution specified in (25.15). o The application of the CLT to { X1} can proceed under the conditional distribution (defined according to 10.30), replacing each expectation by E(. l 'R). Let cr7 = E(U7 1 'R), defining an 'R-measurable random sequence, so that
phm
.
=
E
(it=l x7 1 'R) it=l w7a7. =
(25. 17)
We can then apply a result for heterogeneously distributed sequences, such as 24.4, letting Cr = Wrat· Assuming that ECX71 �t-1 ) = w;a;, and that condition (24. 19) is almost surely satisfied, the conditional CLT takes the form
}
( {
)
lim E exp n i� 2 1 12 i x� 'R = e - ".2!2 , a.s. (Lr =l Wrcrr) t= l n�oo
(25. 1 8)
But if we normalize by the unconditional variance, the situation is quite different. W1 must be treated as stochastic and E(X71 �1-1 ) :f:. ECX7), so the condi tions of 24.4 are violated. However, if (25.13) holds with 11 an 'R-measurable r.v., then according to 22.14(ii) the conditional distribution has the property
1 � X 'R -n L r S t= l
I
D �
N(O, 11 2) , a.s.
{ }I ( )
(25 . 1 9)
(see § 10.6 for the relevant theory). This result can also be expressed in the form
iA n Xr 'R lim E exp :S L n n�oo t=l
=
2 2
e - "- 11 12, a.s.
(25.20)
Hence, the limiting unconditional distribution is given by
(25.21) This is a novel central limit result, because we have established that L,�= 1 X11sn is not asymptotically Gaussian. The right-hand side of (25.21) is the ch.f. of a mixed Gaussian distribution. One may visualize this distribution by noting that
Some Extensions
405
drawings can be generated in the following way. First, draw 11 2 from the appropri ate distribution, for example the functional of Brownian motion defined in (25 . 1 5); then draw a standard Gaussian variate and multiply it by YJ . If X is mixed Gaussian with respect to a marginal c.d.f. G(YJ) (say), and <J>T'I is the Gaussian density with mean 0 and variance YJ 2, the moments of the distribution are easily computed, and as well as E(X) = E(X3) = 0 we find
E(X2) =
f""f 00 x2<J>T'l (x)dxdG(YJ) = f""0 YJ2dG(YJ) = E(YJ 2). 0
-
oo
(25.22)
However, the kurtosis is non-Gaussian, for
(25.23) where the right-hand side is in general different from iE(YJ 2)2 (see 9.7). 25 . 3 The Multivariate CLT
An array {Xnt} of p-vectors of random variables is said to satisfy the CLT if the joint distribution of Sn = L�= lXnt converges weakly to the multivariate Gaussian. In its multivariate version, the central limit theorem contributes a new and powerful approximation result. Given a vector of stochastic processes exhibiting arbitrary (contemporaneous) dependence among themselves, we can show that there exist different linear combinations of the processes which are asymptotically independent of one another (uncorrelated Gaussian variables being of course independent). This is a fundamental result in the general theory of asymptotic inference in econometrics. The main step in the solution to the multivariate problem sometimes goes by the name of the 'Cramer-Wold device' . 25.5 Cramer-Wold theorem A vector random sequence { Sn } i, Sn E IR \ con verges in distribution to a random vector S if and only if a'Sn � a'S for every fixed k-vector a t:. 0. Proof For given a the characteristic function of the scalar a'Sn is E(exp{ iA.a'Sn } ) =
--7
--7
The Central Limit Theorem
406
(25.25) where Ln ::: CnA�12, Cn and An being respectively the eigenvector matrix (satisfying CnC� = C�Cn= lp) and the diagonal, non-negative matrix of eigenvalues. Let Xnt ::: L-;,X1, where if l:n has full rank then L-;; = A� 1 12C� so that L-;,I:nL-;; ' lp· However, l:n need not have full rank for every n. If it is singular with An = I n ,
[� �]
=
let L-;; ' = [Cn 1 A!�12 : 0] where Cnl is the appropriate submatrix of Cn. In this case, L -;; l:nL -;; ' has either ones or zeros on the diagonal. We do however require l:n to be asymptotically of full rank, in the sense that L-;;l:nL-;; ' � lp. If Sn = L,7=IXn1, then for any p-vector a with a'a = 1 , we have E(a'Sn) 2 � 1 . If this condition fails, and there exists a ::f. 0 such that E(a'Sn)2 � 0, the asymptotic distribution of Sn is said to be singular. In this case, some elements of the limiting vector are linear combinations of the remainder. Their distribution is therefore determined, and nothing is lost by dropping these variables from the analysis. To obtain the multivariate CLT it is necessary to show that the scalar sequences { a'Xnt } satisfy the ordinary scalar CLT, for any a. If sufficient conditions hold for a'Sn � N(O, l), the Cramer-Wold theorem allows us to say that Sn � S, and it remains to determine the distribution of S. For any a, the ch.f. of a'S is <j>(A.) = E(exp { iA.a'S} ) = e - "-212 . (25.26) But letting t = A« be a vector of length A., it follows from ( 1 1 .41) that (25.26) is the ch.f. of a standard multi-Gaussian vector. (Recall that ex.'ex. = 1 .) By the inversion theorem we get the required result that S - N(O, lp) . We have therefore proved the following theorem. 25.6 Theorem Let {Xr} be a stochastic sequence of p-vectors and let l:n = E((L,7=IX1)(L7=IXt)'). If L-;;l:nL-;; ' � lp and L,7=Ia'L-;;Xt � N(0,1 ) for every ex. satisfying ex.'a = 1, then n (25.27) _LL-;;Xt � N(O, lp). D t=l
In this result the elements of l:n need not have the same orders of magnitude in n. The variances can be tending to infinity for some elements of X1, and to zero for others, within the bounds set by the Lindeberg condition. However, in the case when all of the elements of l:n have the same order of magnitude, say n° for some 8 > 0 such that n -ol:n � 1:: , a finite constant matrix, it is easy to manipulate (25.27) into the form n 0 2 (25.28) n 1 _LXr � N(O, l:) . t= l
Techniques for estimating the normalization factors generalize naturally from
Some Extensions
407
the scalar case discussed in §25 . 1 , just like the CLT itself. Consider the m.d. case in which l:n = L�=JE(XrX;), and assume this matrix has rankp asymptotically in the sense defined above. Under the assumptions of 25.2, (2 5 .29 ) for any a: with a:'a: = 1 , where the ratio is always well defined on taking n large enough, by assumption. This suggests that the positive semidefinite matrix i:n = L,�= 1 XrX; is the natural estimator for l:n. To be more precise: (25.29) says that P( I a:'i:na:/a:'l:n« - I I ) can be made as small as desired for arbitrary a: ::f. 0 by taking n large enough, since the normalization to unit length cancels in the ratio. This is therefore true in the particular case a:* = L;;'a:, and since a:�'l:n«� = 1, we are able to conclude that a:'L;;:i:nL;;'a: � 1 . We can further deduce from this fact that L;;:i:nL;;' � lp. To show this, note that if a matrix B is nonsingular, and g (a:) = a:'Ba:!a:'a: = 1 for every a: ::f. 0, g has the gradient vector g'(a:) = 2Ba:/a:'a: - 2a:/a:'a:, for any a:, and the system of equations Ba:/a:' a: - a:/a:'a: = 0 has the unique solution B = lp. If Ln is the factor ization of l:n. since Ln is asymptotically of rank p it follows by 18.10(ii) that LnL ;; � lp, and we arrive at the desired conclusion, for comparison with (25.27): n A " D N(O, lp). L -;; L Xt � (25. 30)
(pxp)
t=l
The extension to general dependence is a matter of estimating l:n by a general ization of the consistent methods discussed in §25. 1 , either i:n 1 = I,j�1Z�iZ� i where Z �i = L�(i-l )bn+ IXr, or, letting weights { Wnd represent the Bartlett kernel, n n bn (25. 3 1 ) i:n3 = :LXrX; + :L wnk L (XrX;_k + Xt-kX;). + t=l
k=l
t=k l
The latter matrix is assuredly positive definite, since a:'i:n3« � 0 with arbitrary a: by application of (25. 12) with Xr = Xt«. 25.4 Error Estimation
There are a number of celebrated theorems in the classical probability literature on the rates at which the deviations of distributions from their weak limits, and also stochastic sequences from their almost sure limits, decline with n. Most are for the independent case, and the extensions to general dependent sequences are not much researched to date. These results will not be treated in detail, but it is useful to know of their existence. For details and proofs the reader is referred to texts such as Chung ( 1974) and Loeve (1977). If {Fn } is a sequence of c.d.f.s and Fn � (the Gaussian c.d.f.), the Berry Esseen theorem sets limits on the largest deviation of Fn from . The setting for
The Central Limit Theorem
408
this result is the integer-moment case of the Liapunov CLT; see (23.37). 25.7 Berry-Esseen theorem Let {X1} be a zero-mean, independent, L3 -bounded random sequence, with variances { cr?L let s� = I,�= 1 cr?, and let Fn be the c.d.f. of Sn = 'L7= I X1 1sn. There exists a constant C > 0 such that, for all n, n sup i Fn (x) (x) l � C 2:: £ 1 Xrl 31s�. D (25 . 32) X t= l
-
The measure of distance between functions Fn and appearing on the left-hand side of (25.32) is the uniform metric (see §5.5). As was noted in §23 . 1 , convergence to the Gaussian limit can be very rapid, with favourable choice of Fn. The Berry Esseen bounds represent the 'worst case' scenario, the slowest rate of uniform convergence of the c.d.f.s to be expected over all sampling distributions having third absolute moments. For the uniformly L3-bounded case in which s� = O(n), inequality (25.32) establishes convergence at the rate n 1 12 . Another famous set of results on rates of convergence goes under the name of the law of the iterated logarithm (LIL). These results yield error bounds for the strong law of large numbers although they tell us something important about the rate of weak convergence as well. The best known is the following. 25.8 Hartman-Wintner theorem If {X1} is i.i.d. with mean j.l and variance a2, then I.7=I xr - 1-l = 1 a.s. o (25.33) limsup n--7= a(2n log log n) l/2 Notice the extraordinary delicacy of this result, being an equality. It is equiva lent to the condition that, for any £ > 0, both
P
(
and
P
I.7= 1 Xr - j.l � 1 + £, i.o. = 0, a(2n log log n) 1 12
(
)
- )
I.�= l xr - 1-l � 1 £, i.o. = 1 . a(2n log log n) 112
(25.34)
(25.35)
In words, infinitely many of these sequence coordinates will come arbitrarily close to 1, but no more than a finite number will exceed it, almost surely. By symmetry, there is a similar a.s. bound of -1 on the liminf of the sequence. Under these assumptions, n- 1 122.� = 1 (X1 - j.l)la � N(0, 1) according to the Lindeberg-Levy theorem, and so is asymptotically supported on the whole real line. On the other hand, n - 1 I.�=I (X1 - j..L)/a ----7 0 a.s. by the law of large numbers. It is clear there is a function of n, lying between 1 and n 1 12, representing the 'knife edge' between the degenerate and non-degenerate asymptotic distributions, being the smallest scaling factor which frustrates a.s. convergence on being applied to
Some Extensions
409
the sequence of means. The Hartman-Wintner law tells us that the knife-edge is 1 12 . l - 1 12 . precrse y n (2 1og 1 og n ) A feel for the precision involved can be grasped by trying some numbers: (2 log log 1099) 1 12 3.3! A check with the tabulation of the standard Gaussian probabilities will show that 3.3 is far enough into the tail that the probability of exceeding it is arbitrarily close to zero. What the LIL reveals is that for the scaled partial sums this probability is zero for some n not exceeding 1099, although not for yet larger n. Be careful to note how this is true even if the Xr variables have the whole real line as their support. For nonstationary sequences there is the following version of the LIL. 25.9 Theorem (Chung 197 4: th. 7.5 . 1) Let { Xr} be independent and L3 -bounded with variance sequence { a7 } , and let s� = I7= 1 cr7. Then (25.33) holds if for £ > 0, I7= 1 £1Xr l 3 = O(s�/( Iog Sn) 1 +e) . D ""
Generalizations to martingale differences also exist; see Stout (1974) and Hall and Heyde (1980) inter alia for further details.
VI THE FUNCTIONAL CENTRAL LIMIT THEOREM
26 Weak Convergence in Metric Spaces
26. 1 Probability Measures o n a Metric Space
In any topological space 5>, for which open and closed subsets are defined, the Borel field of 5l is defined as the smallest a-field containing the open sets (and hence also the closed sets) of 5>. In this chapter we are concerned with the properties of measurable spaces (Sl,!f), where s; is a metric space endowed with a metric d, and !f will always be taken to be the Borel field of 5>. If a probability measure Jl is defined on the elements of !f, we obtain a proba bility space ((Sl,d),!f,J..t) , and an element x E s; is referred to as a random element. As in the theory of random variables, it is often convenient to specify an under lying probability space (Q,'!F,P) and let ((Sl,d),!f,Jl) be a derived space with the prope1iy Jl(A) = P(x- \A ) ) for each A E !f, where x: � cs,d) is a measurable mapping. We shall often write 5l for (Sl,d) when the choice of metric is understood, but it is important to keep in mind that d matters in this theory, because !f is not invariant to the choice of metric; unless d1 and d2 are equivalent metrics, the open sets of (5l,d1 ) are not the same as those of (5l,d2). A property of measure spaces that is sometimes useful to assume is regularity (yet another usage of an overworked word, not to be confused with regularity of sequences etc.): (S>,J',J..t) is called a regular measure space (or Jl a regular measure with respect to (Sl,!f)) if for each A E !f and each £ > 0 there exists an open set 010 and a closed set C10 such that (26 . 1) and (26.2)
n
Happily, as the following theorem shows, this condition can be relied upon when 5l is a metric space. 26.1 Theorem On a metric space ((Sl,d),!f), every measure is regular. Proof Call a set A E !f regular if it satisfies (26. 1 ) and (26.2). The first step is to show that any closed set is regular. Let An = { x: d(A ,x) < lin}, n = 1 ,2,3, ... denote a family of open sets. (Think of A with a 'halo' of width 1/n.) When A is closed we may write A = nc;;= tAm and An -1- A as n � · By continuity of the measure this means J..L(An - A) � 0. For any £ > 0 there therefore exists N such that J..L(AN - A) < £. Choosing 010 = AN and C10 = A shows that A is regular. ""
The Functional Central Limit Theorem
414
Since § is both open and closed, it is clearly regular. If a set A is regular, so is its complement, since 0� is closed, C� is open, 0� � A c � C� and C� - 0� = Oe - Ce. If we can show that the class of regular sets is also closed under count-able unions, we will have shown that every Borel set is regular, which is the required result. Let Ar, A2 , ... be regular sets, and define A = U�= lAn. Fixing £ > 0, let One and Cne be open and closed sets respectively, satisfying (26.3)
and (26.4)
Let Oe = U�= l One, which is open, and A � Oe. Also let Ce = U�= l Cne, where the latter set is not necessarily closed, but C� = U�= l Cne where k is finite is closed, and C� � A; and since C� t Ce, continuity of the measure implies that k can be chosen large enough that Jl(Ce - C�) < £12. For such a k,
Jl(Oe - C�) $ Jl(Oe - Ce) + Jl(Ce - C�) 00
$ L Jl(One - Cne) + Jl(Ce - C�) < £. (26.5) n= l It follows that A is regular, and this completes the proof. • Often the theory of random variables has a straightforward generalization to the case of random elements. Consider the properties of mappings, for example. If (§,d) and (lf,p) are metric spaces with Borel fields !I and 'J, and f: § � lf is a function, there is a natural extension of 3.32(i), as follows. 26.2 Theorem If f is continuous, it is Borel-measurable. Proof Direct froms 5.19 and 3.22, and the fact that !I and 'J contain the open sets of § and lf respectively. • Let ((§ d), !!) and ((lf,p),'JJ be two measurable spaces, and let h: § � lf define a measurable mapping, such thatA E 'J implies that h - 1 (A) E !!; then each measure Jl on 5i has the property that Jlh- I , defined by Jlh - 1 (A) = Jl(h- \A)), A E 'J, (26.6) is a measure on ((lf,p),'J). This is just an application of 3.21, which does not use topological properties of the spaces and deals solely with the set mappings involved. However, the theory also presents some novel difficulties. A fundamental one concerns measurability. It is not always possible to assign probabilities to the Borel sets of a metric space - not, at least, without violating the axiom of choice. 26.3 Example Consider the space (D [o,I ] ,du), the case of 5.27 with a = 0 and b = 1 . Recall that each of the random elements fe specified by (5.43) are at a mntn�l rfi <:.:t�nce of 1 from one another. Hence, the spheres BCfe , 1) are all ,
Weak Convergence in Metric Spaces
415
disjoint, and any union of them is an open set (5.4). This means that the Borel field Vro, l ] on (Dr0, 1 1 ,du) contains all of these sets. Suppose we attempt to construct a probability space on ((Dro, l ] ,du), Vro, JJ) which assigns a uniform distribution to the fa, such that J.l( {fa: a e � b}) = b - a for 0 � a b � 1. Superficially this appears to be a perfectly reasonable project. The problem is formally identical to that of constructing the uniform distribution on [0, 1 ] . But there is one crucial difference: here, sets of fa functions corresponding to every subset of the interval are elements of Vro, l ] · We know that there are subsets of [0, 1] that are not Lebesgue-measurable unless the axiom of choice is violated; see 3.17. Hence, there is no consistent way of constructing the probability space ((Dro, l ] ,du),Vro, l J•J.l), where J.l assigns the uniform measure to sets of fa elements. This is merely a simple case, but any other scheme for assigning probabilities to these events would founder in a similar way. o There is no reason why we should not assign probabilities consistently to smaller a-fields which exclude such odd cases, and in the case of (Dro,n,du) the so called projection a-field will serve this purpose (see §28. 1 below for details). The point is that with spaces like this we have to move beyond the familiar intuitions of the random variable case to avoid contradictions. The space (Dr0, 1 1,du) is of course nonseparable, and nonseparability is the source of the difficulty encountered in the last example. The characteristic of a separable metric space which matters most in the present theory is the following. 26.4 Theorem In a separable metric space, there exists a countable collection V of open spheres, such that a(V) is the Borel field. Proof This is direct from 5.6, V being any collection of spheres S(x,r) where x ranges over a countable dense subset of s; and r over the positive rationals. • The possible failure of the extension of a p.m. to ('£,f/) is avoided when there is a countable set which functions as a determining class for the space. Measurabil ity difficulties on rR were avoided in Chapter 3 by sticking to the Borel sets (which are generated from countable collections of intervals, you may recall) and this dictum extends to other metric spaces so long as they are separable. Another situation where separability is a useful property is the construction of product spaces. In §3 .4 some aspects of measures on product spaces were discussed, but we can now extend the theory in the light of the additional structure contributed by the product topology. Let and ("IT" /J) be a pair of measurable topological spaces, with !I and <J the respective Borel fields. If 'R denotes the set of open rectangles of s; and !I ® <J = a('R), we have the following result. 26.5 Theorem If s; and "IT" are separable spaces, !I ® <J is the Borel field of with the product topology. Proof Under the product topology, 'R is a base for the open sets (see §6.5). Since s; "IT" is separable by 6.16, any open set of s; "IT" can be generated as a countable union of 'R-sets. It follows that any a-field containing 'R also contains the open sets of s; and in particular, !I ® <J contains the Borel field. Since the sets
<
x"U",
x
x"U",
<
(5>,:1)
5> x"U"
x
The Functional Central Limit Theorem
416
of � are open, it is also true that any a-field containing the open sets of s; x T also contains � ' and it follows likewise that the Borel field contains !I® <J . • If either s; or T are nonseparable, the last result does not generally hold. A counter-example is easily exhibited. 26.6 Example Consider the space (D[O, I J X D[o, 1 1 , Pu), where Pv is the max metric defined by (6. 1 3) with du for each of the componentmetrics. Let E denote the union of the open balls B( (x9,y9), 1) over 8 E [0, 1 ] , where xe and Ye are func tions of the form fe in (5.43). In this metric the sets B((xe,Ye), 1) are mutually disjoint rectangles, of which E is the uncountable union; if � denotes the open rectangles of (D[o, l ] x D[o, 1 1 , P u), E e a(�), even though E is in the Borel field of D[o, l ] x D[o, 1 1 , being an open set. o The importance of this last result is shown by the following case. Given a proba bility space (.Q,� ,P), let x and y be random elements of derived probability spaces ((Si,d),!f,!lx) and ((Si,d),!f,!ly). Implicitly, the pair (x,y) can always be thought of as a random element of a product space of which the llx and llY are the marginal measures. Since x and y are points in the same metric space, for given ro E Q a distance d(x(ro),y(ro)) is a well-defined non-negative real number. The question of obvious interest is whether d is also a measurable function on (.Q,�). This we can answer as follows. 26.7 Theorem If (Si,d) is a separable space, d(x,y) is a random variable. Proof The inverse image of a rectangle A x B under the mapping (x,y) : f--7 s; x§ lies in � ' being the intersection of the �-sets x- 1 (A) and y 1 (B) . The mapping is therefore �/!! ® !/-measurable by 3.22. But under separability, !I ® !I is the Borel field of § x § according to 26.5. Hence (x,y)(ro) (x(w),y(ro)) is a �/Borel measurable random element of§ x §. If the spaces; x s; is endowed with the product topology, the function d: § X § f--7 IR+ is continuous by construction, and this mapping is also Borel-measurable. The composite mapping (x,y) od: .Q f--7 1R is therefore �/:B-measurable, and the theorem follows. •
n
-
=
26.2 Measures and Expectations
As well as taking care to avoid measurability problems, we must learn to do with out various analytical tools which proved fundamental in the study of random variables, in particular the c.d.f. and ch.f. as representations of the distrib ution. These handy constructions are available only for r.v.s. However, if U5 is the set of bounded, uniformly continuous real functions f: s; f--7 IR, the expect ations
Weak Convergence in Metric Spaces
417 (26.7)
are always well defined. (From now on, the domain of integration will be understood to be s; unless otherwise specified.) The theory makes use of this family of expectations to fingerprint a distrib ution uniquely, a device that works regardless of the nature of the underlying space. While there is no single all-purpose function that will do this job, like eiAX in the case X E IR , the expectations in (26. 7) play a role in this theory analogous to that of the ch.f. in the earlier theory. As a preliminary, we give here a pair of lemmas which establish the unique represention of a measure on (Sl,:f) in terms of expectations of real functions on Sl. The first establishes the uniqueness of the representation by integrals. 26.8 Lemma If J.l and v are measures on ((Sl,d),:f) (:f the Borel field), and
ffdJ.l f dv, all =
f
fE
Us,,
(26.8)
then J.l = v . Proof We show that Us, contains an element for which (26.8) directly yields the conclusion. Let B E :f be closed, and define Bn = { x: d(x,B) < 1/n } . Think of Bn as B with an open halo of width 1/n. Bn -l- B as n --7 oo, B and B� are closed and mutu ally disjoint, and infxe BJ;,ye sd(x,y) :2 1/n for each n. Let gs:;,s E U'£ be a separating function such that g8:;,8 (x) = 0 for x E B� and 1 for x E B (see 6.13). Then j.l(B)
:::;
fgs:;,sdJ.l fgs:;,sdv fBngs:;,sdv :::; v(Bn). =
=
(26.9)
where the last inequality is because g8:;,8 (x ) :::; 1 . Letting n --7 oo, we have J.l(B) :::; v(B). But J.l and v can be interchanged, so J.l(B ) = v(B). This holds for all closed sets, which form a determining class for the space, so the theorem follows. • Since Us, c C'£, the set of all continuous functions on Sl, this result remains true if we substitute Cs, for Us, ; the point is that Us, is the smallest class of general functions for which it holds, by virtue of the fact that it contains the required separating function for each closed set. The second result, although intuitively very plausible, is considerably deeper. Given a p.m. J.l on a space Sl, define A(f) = ffdJ.l for f E Us,. We know that A is a functional on Us, with the following properties: (26. 10) f(x) :2 0, all X E Sl =::} A(f) :2 0 (26. 1 1) f(x) = 1, all X E Sl =::} A(f) = 1 (26. 12) A(afl + bh) = aA(fi) + bA(/2), !1> h E U, a,b E IR, where (26. 1 1) holds since fdJ.l = 1 , and (26. 12) is the linearity property of integrals. The following lemma states that on compact spaces the implication also
The Functional Central Limit Theorem
418
runs the other way. 26.9 Lemma Let be a compact metric space, and let A Us -7 1R define a func tional satisfying (26. 10)--(26. 12). There exists a unique p.m. on satisfy ing = A ( ) , each E Us. o In other words, functionals A and measures are uniquely paired. At a later stage we use this result to establish the existence of a measure (the limit of a sequence) by exhibiting the corresponding A functional. We shall not attempt to give a proof of this result here; see Parthasarathy (1967: ch. 2.5) for the details. Note that because is compact, Us and Cs coincide here; see 5.21.
ffdll f
5>
(f): ll (5>,Jl)
f
ll
5>
26.3 Weak Convergence
Consider IM, the set of all probability measures on ((5>,d),!f). As a matter of fact, we can extend our results to cover the set of all finite measures, and there are a couple of cases in the sequel where we shall want to apply the results of this chapter to measures where "# 1 . However, the modifications required for the extension are trivial. It is helpful in the proofs to have an agreed normaliz ation, and = 1 is as good as any, so let 1M be the p.m.s, while keeping the possibility of generalization in mind. Weak convergence concerns the properties of sequences in IM, and it is mathemati cally convenient to approach this problem by treating 1M as a topological space. The natural means of doing this is to define a collection of real-valued functions on IM, and adopt the weak topology that they induce. And in view of (26.7), a natural class to consider are the integrals of bounded, continuous real-valued functions with respect to the elements of IM. For a point E IM, define the base sets
ll fdll
fdll
f
ll
v,.tCk, t ... ,fk,£)
fi
,
=
{
v: v E IM,
J Jfidv-Jtidll/
< £,
i
=
}
1, ... ,k
(26. 1 3)
,llfk·
where E Us for each i, and £ > 0. By ranging over all the possible ft , ... and £, for each k E IN , (26. 1 3) defines a collection of open neighbourhoods of The base collection V11(k,ft, . . ,fk,£), E IM, defines the weak topology on IM. The idea is that two measures are close to one another when the expectations of various elements of Us are close to one another. The more functions this applies to, and the closer they are, the closer are the measures. This is not the conse quence of some more fundamental notion of closeness, but is the defining property itself. This simple yet remarkable application illustrates the power of the topo logical ideas developed in Chapter 6. The weak topology is the basic trick which allows distributions on general metric spaces to be handled by a single theory. Given a concept of closeness, we have immediately a companion concept of convergence. A sequence of measures n E IN } is said to converge in the weak topology, or converge weakly, to a limit written => if, for every neigh bourhood V11, 3 N such that E V11 for all n � N. If is a random element from a probability space and => !l, we shall say that converges in distri Essenhntion to and write xM � x. where x is a random element from
.
x
ll n (5>,Jl,!ln), lln
ll
{!ln!l,,
Xnlln Xn!l, (5>,Jl,!l).
Weak Convergence in Metric Spaces
419
tially, the same caveats noted in §22. 1 apply in the use of this terminology. The following theorem shows that there are several ways to characterize weak convergence. 26.10 Theorem The following conditions are equivalent to one another: (a) �n � �(b) ffdJln -c) ffd� for every f E U§. (c) limsupn�n(C) s �(C) for every closed set C E !f. (d) liminfn�n(B) � �(B) for every open set B E !f. (e) limn�n(A) = �(A) for every A E !f for which �(oA) = 0. o The equivalence of (a) and (b), and of (a) and (e), were proved for the case of measures on the line as 22.8 and 22.1 respectively; in that case weak convergence was identified with the convergence of the sequence of c.d.f.s, but this charac terization has no counterpart here. A noteworthy consequence of the theorem is the fact that the sets (26. 1 3) are not the only way to generate the topology of weak convergence. The alternative corresponding to part (e) of the theorem, for exam ple, is the system of neighbourhoods, V�(k,A 1 , . . . ,Ak>£)
=
{v: v E
[M,
j vCAi) - �(AD j < £, i
=
l,
. . ,k},
(26. 14)
where Ai E !f, i = l, ... ,k and �(()Ai) = 0. Proof of 26.10 This theorem is proved by showing the circular set of implications, (a) � (b) � (c) � (c),(d) � (e) � (a). The first is by definition. To show that (b) � (c), we can use the device of 26.8; let B be any closed set in !f, and put Bm = {x: d(x,B) < 1 /m } , so that B and B� are closed and inf B;. ,y B d(x,y ) � lim. Letting gs;.,B E u$ be the separating function defined above (26.9), we have xE
f
f
E
f
limsup Jln(B) s limsup gs;.,sd�n = gs;.,sd)l = B gs;.,sd� s �(Bm), (26. 1 5) m n---')oo n---')oo where the first equality is by (b). (c) now follows on letting m -c) oo . (c) � (d) is immediate since every closed set is the complement of an open set relative to and �(Sl) 1 . To show (c) and (d) � (e): for any A E !f, A0 c A � A, where A0 is open and A is closed, and ()A = A - A0• From (c), limsup Jln(A) s limsup �n (A) s J.L(A) = �(A), (26. 1 6) n---')oo and from (d), (26. 1 7) liminf�n(A) � liminf�n(A0) � �(A0) = �(A),
5>,
=
hence limn�n(A) �(A). The one relatively tricky step is to show (e) � (a). Let f E U§, and define (what is easily verified to be) a measure, �' on the real line (IR,15) by =
The Functional Central Limit Theorem
420
(26 . 1 8) 1-t'CB) = !l({x: f(x) E B}), B E 'B. f is bounded, so there exists an interval (a,b) such that a < f(x) < b, all x E Sl. Recall that a distribution on (lR ,'B) has at most a countable number of atoms. Also, a finite interval can be divided into a finite collection of disjoint subintervals of width not exceeding £, for any £ > 0. Therefore it is possible to choose m points t1, with a = to < t1 < ... < tm = b, such that t1 - tJ-1 < £, and 1-t'C { t1}) = 0 , for each j. Use these to construct a simple r.v. m (26. 19) gm(X) = 2)}- dAjCx), }= I
where AJ = {x: fJ- 1 :::::;; f(x) :::::;; fJ } , and note that supxl f(x) - gm(x) I
< £.
[ ffd!Jn - Jfd!l l :::::;; J l f - gm l d!ln + J l f - gm l d!l + [ fgmd!ln - fgmd!l l :::::;; 2£
m + L I tj- l l l !ln(Aj) - !l(Aj l ·
Thus,
(26.20)
}= I
Since !l(dA1) (e),
=
0
by the choice of t;, so that limn!ln(Aj) li:n��p
I ffdJ.ln - Ifd!l l
:::::;; 2£.
=
!l(Aj), for each j by (26.2 1 )
Since £ can be chosen arbitrarily small, (a) follows and the proof is complete. • A convergence-determining class for (Sl,:f) is a class of sets 'U c :.t which satisfy the following condition: if !ln(A) ---7 !l(A) for every A E 'U with !l(dA) = 0, then lln => J..l . This notion may be helpful for establishing weak convergence in cases where the conditions of 26.10 are difficult to show directly. The following theorem is just such an example. 26.11 Theorem If 'U is a class of sets which is closed under finite intersections, and such that every open set is a finite or countable union of 'U-sets, then 'U is con vergence-determining. Proof We first show that the measures lln converge for a finite union of 'U-sets A 1 , ,Am . Applying the inclusion-exclusion formula (3 .4), • • .
(26.22)
where the sets Ck consist of the AJ and all their mutual intersections and hence are in 11 whenever the A1 are, and '±' indicates that the sign of the term is given in accordance with (3 .4). By hypothesis, therefore, (26.23 )
Weak Convergence in Metric Spaces
421
To extend this result to a countable union B = U}= IAj , note that continuity of � implies �(U} = IAj) t �(B) as m -7 oo, so for any £ > 0 a finite m may be chosen large enough that �(B) - �(Uj= IAj) < £. Then
(LJ ) (LJ )
(26.24) liminf �n(B) � liminf �n Aj = � Aj > �(B) - £. n �oo n �oo �= I :t= l Since £ is arbitrary and (26.24) holds for any open B E !f by hypothesis on 'U, condition (d) of 26.10 is satisfied. • A convergence-determining class must also be a determining class for the space (see §3.2). But caution is necessary since the converse does not hold, as the following counter-example given by Billingsley (1968) shows. 26.12 Example Consider the family of p.m.s {�n } on the half-open unit interval [0 , 1) with �n assigning unit measure to the singleton set { 1 - 1/n } . That is, �n( { 1 - 1/n}) = 1 . Evidently, { �n } does not have a weak limit. The collection b' of half-open intervals [a ,b) for 0 < a < b < 1 generate the Borel field of [0 , 1 ) , and so are a determining class. But �n([a,b)) -7 0 for every fixed a > 0 and b < 1 , and the p.m. � for which �( { 0}) = 1 has the property that �([a,b)) = 0 for all a > 0. It is therefore valid to write (26.25) �n(A) -7 �(A), all A E b', even though �n � � in this case, so b' is not convergence-determining. o The last topic we need to consider in this section is the preservation of weak convergence under mappings from one metric space to another. Since �n => � means ffdllr. -7 ffd� for any f E U$, it is clear, since j oh E U$ when h is continuous, that ff(h(x))dllr.(x) -7 ff(h(x))d�(x). Writing y for h(x), we have the result
fJ(y)dJ..!nh- I (y) -7 Jf(y)d�h- I (y).
(26.26)
So much is direct, and relatively trivial. But what we can also show, and is often much more useful, is that mappings that are 'almost' continuous have the same property. This is the continuous mapping theorem proper, the desired generaliz ation of 22.11. 26.13 Continuous mapping theorem Let h: S H T be a measurable function, and let Dh c $ be the set of discontinuity points of h. If �n => � and �(Dh) = 0, then �nh- I => �h - I . Proof Let C be a closed subset of T. Recalling that (A) - denotes the closure of A, limsup �nh - 1 (C) = limsup �n(h - I ( C)) =::; limsup �n((h - I ( C)) -) n �oo =::; �((h - 1 (C)) -) =::; �(h - 1 (C) u Dh) =::;
�(h - l (C)) + �(Dh)
=
�(h- l (C))
=
�h - I (C),
(26.27)
422
The Functional Central Limit Theorem
noting for the third inequality that (h- 1 (C))- k h- 1 (C) u Dh; i.e., a closure point of h- 1 (C) is either in h- 1 (C), or is not a continuity point of h. The second inequality is by 26.10(c), and the conclusion follows similarly. • 26.4 Metrizing the S pace of Measures
We can now outline the strategy for determining the weak limit of a sequence of measures {�n} on (£,:!). The problem falls into two parts. One of these is to determine the limits of the sequences {�n(A) } for each A E t:i', where t:i' is a deter mining class for the space. This part of the programme is specific to the particu lar space under consideration. The other part, which is quite general, is to verify conditions under which the sequence of measures as a whole has a weak limit. Without this reassurance, the convergence of measures of elements of t:i' is not generally sufficient to ensure that the extensions to :f also converge. It is this second aspect of the problem that we focus on here. It is sufficient if every sequence of measures on the space is shown to have a cluster point. If a subsequence converges to a limit, this must agree with the unique ordinary limit we have (by assumption) established for the determining class. Our goal is achieved by finding conditions under which the relevant topo logical space of measures is sequentially compact (see §6.2). This is similar to what Billingsley (1968) calls 'relative' compactness, and the required results can be derived in his framework. However, we shall follow Prokhorov (1956) and Parthasarathy ( 1967) in making 1M a metric space which will under appropriate circumstances be compact. The following theorem shows that this project is feasi ble; the basic idea is an application of the embedding theorem (6.20/6.22). 26.14 Theorem (Parthasarathy 1967: th. ll.6.2) If and only if ('5,d) is separable, 1M can be metrized as a separable space and embedded in [0, 1 ]=. Proof Assume ('5,d) is separable. The first task is to show that Us is also separa ble. According to 6.22, '5 can be metrized as a totally bounded space ('5,d') where d' is equivalent to d. Let S denote the completion of '5 under d' (including the limits of all Cauchy sequences on '5) and then S is a compact space (5.12). The space ofcontinuous functions Cs; is accordingly separable under the uniform metric (5.26(ii)). Now, every continuous function on a compact set is also uniformly continuous (5.21), so that Us; = Cs;. Moreover, the spaces Cs; and Us are isometric (see §5.5) and if the former is separable so is the latter. Let {gm, m E lN } be a dense subset of U'£, and define the mapping T: 1M -7 IR= by (26.28) The object is to show that T embeds 1M in IR=. Suppose T(�) = T(v), so that fgmd� = fgmdv for all m. Since {gm } is dense in u'£, f E u'£ implies that
I ffd� -fgmd� I :5 f I f - gm I d� ::; du(f,&n)
<
£
(26.29)
Weak Convergence in Metric Spaces
423
for some m, and every £ > 0. (The second inequality is because fdJ.l = 1 , note.) The same inequalities hold for v, and hence we may say that ffdJ.l = ffdv for all f E U5, . It follows by 26.8 that Jl = v , so T is 1 - 1 . Continuity of T follows from the equivalence of (a) and (b) in 26.10. To show that T - 1 is continuous, let { Jln } be a sequence of measures and assume T(Jln) ----7 T(Jl). For f E U5, and any m � 1 ,
I ffd� - ffdJ.l l
=
I f(f - gm)dJ.ln + f(g + fg fg l g + I fgmdJ.ln - fg l · m - f)dJ.l
:S: 2du(f,
m
mdJ.ln -
mdJ.l
mdJ.l
)
(26.30)
Since the second term of the majorant side converges to zero by assumption, limsupn
If f l fd� - fdJ.l
:S:
2du(f,&n ) < 2£
(26 .3 1 )
for some m, and £ > 0, by the right-hand inequality of (26.29). Hence limn l ffd� - ffdJ.l i = 0, and lln ==> Jl by 26.10(b). We have therefore shown that 1M is homeomorphic with the set T(IM) � IR"", and IR"" is homeomorphic to [0, 1 ]"" as noted in 5.22. The distance d"" between the images of points of 1M under T defines a metric on 1M which induces the weak topology. The space T(IM) with the product topology is separable (see 6.16), so applying 6.9(i) to T - 1 yields the result that 1M is separable. This completes the sufficiency part of the proof. The necessity part requires a lemma, which will be needed again later on. Let Px E 1M be the degenerate p.m. with unit mass at x, that is, px( { x } ) = 1 and Px('5 - { x } ) = 0, and so let D = {px : x E '5 } � IM. 26.15 Lemma The topological spaces '5 and D are homeomorphic.
'5 1---7 D taking points x E '5 to points Px E D is clearly 1 - 1 , onto. For f E C5,, ffdpx = f(x), and Xn ----7 x implies f(.xn) ----7 f(x) and hence Pxn ==> Px by 26.10, establishing continuity of p. Conversely, suppose Xn A x. There is then an open set A containing x, such that for every N E IN, Xn E '5 - A for some n � N. Let f be a separating function such that f(x) = 0, f(y) = 1 for y E '5 - A, and 0 :S: f :S: 1. Then ffdPxn = 1 and ffdPx = 0, so Pxn � Px· This establishes Proof The mapping p:
continuity of p - 1 , and p is a homeomorphism, as required.
•
Proof of 26.14, continued Now suppose 1M is a separable metric space. It can be embedded in a subset of [0, 1 ]"" , and the subsets of 1M are homeomorphic to their images in [0, 1 ]"" under the embedding, which are separable sets, and hence are
themselves separable (again, by 6.16 and 6.9(i)). Since D c IM, D is separable and hence '5 must be separable since it is homeomorphic to D by 26.15. This proves necessity. •
The last theorem showed that 1M is metrizable, but did not exhibit a specific metric on IM. Note that different collections of functions {gm } yield different metrics, given how d= is defined. Another aooroach to the oroblem is to construct
The Functional Central Limit Theorem
424
such a metric directly, and one such was proposed by Prokhorov ( 1 956). For a setA e !1, define the open set A li = {x: d(x,A) < 3}, that is, 'A with a 3-halo' . The Prokhorov distance between measures !l, v e IM, is L(!l,V) inf{3 > 0: !l(Aii) + 3 ;;::: v(A), all A e !!}. (26.32) Since !f contains complements and !l(Sl) = v(Sl) = 1 , it must be the case, unless !l = v, that !l(A) ;;::: v(A) for some sets A e !1, and !l(A) < v(A) for others. The idea of the Prokhorov distance is to focus on the latter cases, and see how much has to be added to both the sets and their !l-measures, to reverse all the inequalities. When the measures are close this amount should be small, but you might like to convince yourself that both the adjustments are necessary to get the desired properties. As we show below, L is a metric, and hence is symmetric in !l and v. The properties are most easily appreciated in the case of measures on the real line, in which case the metric has the representation in terms of the c.d.f.s, L*(F1 ,F2) = inf { <5 > 0 : F2(x - <5) <5 � F1 (x) � F2 (x+ 3) + <5, V x e [R } , (26.33) for c.d.f.s F 1 and F2 . This is also known as Levy 's metric. =
-
Fig.
26. 1
Fig. 26. 1 sketches this case, and F2 has been given a discontinuity, so that the form of the bounding functions F2 (x + <5) + <5 and F2(x - <5) - 3 can be easily discerned. Any c.d.f. lying wholly within the region defined by these extremes, such as the one shown, is within () of F2 in the L* metric. 26.16 Theorem L is a metric. Proof L(!l,V) = L(V,!l) is not obvious from the definition; but for any 3 > 0 consider B = (A ii)c. If x e A, then d(x,y) 2 () for each y e B, whereas if x e Bli, d(x,y) <5 for some y e B; or in other words, Bli = A c If L(!l,V) � <5, then !l(A c) + 3 = !l(B 0) + () ;;::: v(B). (26.34) Subtracting both sides of (26.34) from 1 gives !l(A) � v(Bc) + () = v(Ali) + 8, (26.35)
<
.
Weak Convergence in Metric Spaces
425
and hence L(V,Il) � B. This means there is no B for which L(!l,V) > B ;::: L(V,!l), nor, by symmetry, for which L(V,Il) > B ;::: L(!l,V), and equality follows. It is immediate that L(!l,V) = 0 if ll = v. To show the converse holds, note that n n if L(!l,V) = 0, ll(A Il ) + lin ;::: v(A) for A E !f, and any n E IN. If A is closed, A Il n .J, A as n ---7 By continuity of 11, ll(A) = limn(ll(A li ) + lin) ;::: v(A), and by n symmetry, v(A) = limn(v(A li ) + lin) ;::: ll(A) likewise. It follows that ll(A) = v(A) for all closed A. Since the closed sets are a determining class, ll v. Finally, for measures 11, v, and 't let L(!l,V) = B and L(V,'t) = ll · Then for any A E !f, ll(A) � v(A8) + B � 't((A 8Y1 ) + B + ll � 't(A&+11) + B + ll , (26.36)
oo
.
=
where the last inequality holds because (A8)11 = {x: d(x,A 8) < ll } � {x: d(x,A ) < B + ll) }
=
A&+11 ,
(26.37)
the inclusion being valid since d satisfies the triangle inequality. Hence L(!l,'t) � B + ll = L(!l,V) + L(V,'t). • We can also show that L induces the topology of weak convergence. 26.17 Theorem If {lln} is a sequence of measures in IM , lln L(lln ,ll) ---7 0 .
===>
ll if and only if
show 'if , suppose L(lln•ll) ---7 0. For each closed set A E !f, limsupnlln(A) � ll(A 8) + B for every B > 0, and hence, letting B .J, 0, limsupnlln(A) � ll(A) by continuity. Weak convergence follows by (c) of 26.10. To show 'only if , consider for A E !f and fixed B the bounded function Proof To
fA (x) Note that fA (x) Since
=
=
{
}
d(x,A)
(26.38)
max 0, 1 - -- . B
1 for x E A, 0 < fA (x) � 1 for x E A 8, and fA (x)
=
0 for x � A 8 .
(26.39) independent of A, the family { fA , A E !f } is uniformly equicontinuous (see §5.5) and so is a subset of U5 . If lln ===> 11, then by 26.10(b), ,1n
=
sup
AE 9'
I ffAdlln - f Ad l.
(26.40)
f ll ---7 0.
Hence, n can be chosen large enough that ,1n � B, for any B > 0. For this n or larger,
f
f
f
lln(A) � fAdlln � fAdll + ,1n � fAdll + B � ll(A 8) + B, or, equivalently, L(lln,ll) � B. It follows that L(lln.ll)
---7
0.
•
(26.41)
It i s possible to establish the theory of convergence o n 1M by working explicitly
426
The Functional Central Limit Theorem
in the metric space (lM,L). However, we will follow the approach of Varadarajan (1958), of working in the equivalent space derived in 26.14. The treatment in this section and the following one draws principally on Parthasarathy (1967). The Prokhorov metric has an application in a different context, in §28.5. The next theorem leads on from 26.14 by answering the crucial question - when is 1M compact? 26.18 Theorem (Parthasarathy 1967: th. II.6.4) 1M is compact if and only if § is compact. Proof First, let § be compact, and recall that in this case Cs = U5 (5.21), and C5 is separable (5.26(ii)). For simplicity of notation write just C for C5, and write 0 for that element of C which takes the value 0 everywhere in §. Let Sc(0, 1) denote the closed unit sphere around 0 in C, such that sup1 l f(t) I ::;; 1 for all f E Sc(O, 1 ), and let {gm , m E IN } be a sequence that is dense in Sc(O, 1 ). For this sequence of functions, the map T defined in (26.28) is a homeomorphism taking 1M into T(IM), a subset of the compact space [ - 1 , 1 r. This follows by the argument used in 26.14. It must be shown that T(IM) is closed and therefore compact. Let {�n } be a sequence of measures in 1M such that T(�n) -7 y E [ -1 , 1r. What we have to show to prove sufficiency is that y E T(lM). Since the mapping T - 1 onto 1M is continuous, this would imply (6.9(ii)) that 1M itself is compact. Write An(f) = fjd�, and note that, since I ffd� I ::;; sup1 l f(t) I ::;; 1 , this defines a functional An(f): Sc(O, l) t--7 [- 1 , 1] . (26.42) In this notation we have T(�n) = (An(gt), An(g2), ... ). Since Sc(O, 1 ) is compact and {gm } is dense in it, we can choose for every f E Sc(0, 1) a subsequence {gmk ' k E IN } converging to f. Then, as in (26.30), (26.43) The second term of the majorant side contains a coordinate of T(J.ln) - T(J.ln') and converges to 0 as n and n' --7 oo by assumption. Letting k --7 =, we obtain, as in (26.31 ), (26.44) lim I An (f) - 1\z, (f) I = 0. n,n'�oo This says that {An } is a Cauchy sequence of real functionals on [- 1 , 1], and so must have a limit A; in particular, y = (A(g 1 ), A(g2), ... ). It is easy to verify that each An(f), and hence also A(f), satisfy conditions (26.10)-{26. 12) for f E Sc(0,1 ). Since for every f E C there is a constant c > 0 such that cf e Sc(0, 1), we may further say, by (26. 12), that A(f) = cA*(f/c) where A * (.) is a functional on C which must also satisfy (26. 10)-(26.12). From 26.9, there exists a unique J.l E 1M such that A* (f)= ffd�, fe C. Hence, we may write y = T(Jl). It follows that T(IM) contains its limit points, and being also bounded is compact; and since T - 1 is a homeomorphism, 1M is also compact. This completes the proof of sufficiency.
Weak Convergence in Metric Spaces
427
To prove necessity, consider D = {px : x E S> } � IM, the set shown to be homeo morphic to S> in 26.15. If D is compact, then so is D is totally bounded when 1M is compact, so by 5.12 it suffices to show completeness. Every sequence in D is the image of a sequence { Xn E S> } , and can be written as {Pxn } , so suppose Pxn ==> q E IM . If Xn � x E Sl, then q = Px E D by 26.15, so it suffices to show that Xn p x is impossible. The possibility that { Xn } has two or more distinct cluster points in 5l is ruled out by the assumption Pxn ==> q, so Xn P x means that the sequence has no cluster points in Sl. We assume this, and obtain a contradiction. Let E {xt,x2 , ... } � 5l be the set of the sequence coordinates, and let E 1 be any infinite subset of E. If the sequence has no cluster points, every point y E E1 is isolated, in that Et n S(y, E) - {y } is empty for some E > 0. Otherwise, there would have to exist a sequence {Yn E Ed such that Yn E S(y, lin) for every n, and y would be a cluster point of {xn } contrary to assumption. A set containing only isolated points is closed, so E1 is closed and, by 26.10(c),
5>.
=
(26.45) where the equality must obtain since Et contains Xn for some n � N, for every N E [N . Since q E !M, this has to mean q(E1 ) = 1 . But clearly we can choose another subset from E, say E2, such that E 1 and E2 are disjoint, and the same logic would give q(E2) = 1 . This is impossible. The contradiction is shown, concluding the proof. • 26.5 Tightnes s and Convergence
In §22.5 we met the idea of a tight probability measure, as one whose mass is concentrated on a compact subset of the sample space. Formally, a measure 1-1 on a space (S>,!f) is said to be tight if, for every E > 0, there exists a compact set K£ E !f such that j.l(K�) � E. Let TI � 1M denote any family of measures. The family n is said to be uniformly tight if supJlE rrj.l(K�) ::::; E. Tightness is a property of general measures, although we shall concentrate here on the case of p.m.s. In the applications below, TI typically represents the sequence of p.m.s associated with a stochastic sequence {Xn}l'. If a p.m. 1-1 is tight, then of course j.l(K£) > 1 - E for compact K£. In §22.5 uniform tightness of a sequence of p.m.s on the line was shown to be a necessary condition for weak convergence of the sequence, and here we shall obtain the same result for any metric space that is separable and complete. The first result needed is the following. 26.19 Theorem (Parthasarathy 1967: th. 11.3.2) When 5l is separable and complete, every p.m. on the space is tight. o Notice, this proves the earlier assertion that every measure on � .�) is tight, given that IR is a separable, complete space. Another lemma is needed for the proof, and also subsequently.
The Functional Central Limit Theorem
428
26.20
Lemma
Let S be a complete space, and let (26.46)
where Sni is a sphere of radius lin in S, Sni is its closure, and in is a finite integer for each n. Then K is compact. Proof Being covered by a finite collection of the Sni for each n, K is totally bounded. If { Xj, i E IN } is a Cauchy sequence in K, completeness of S implies that Xj � x E S. For each n, since K c U{� 1 Sni• infinitely many of the sequence coor dinates must lie in Kn = K n Snk for some k, 1 � k � in · Since Snk has radius 1/n, taking n to the limit leads to the conclusion that nnKn = {x}, and hence, X E K; K is therefore complete, and the lemma follows by 5.12. • Proof of 26.19 By separability, a covering of S by 1/n-balls Sn = S(x, 1/n), x E S, has a countable subcover, say {Sn i i E IN } , for each n = 1 ,2, ....nFix n. For any E > 0 there must exist in large enough that Jl(An) � 1 - El2 , where An = U{� ! Sni; otherwise we would have Jl(Ui'= I Sn i) = Jl(S) < 1 - El2n , which is a contradiction since Jl is a p.m. Given E, choose An in this manner for each n and let Ke = n;= IAn where An = U{�1Sni• note. Then Ke is compact by 26.20. Further, since (26.47)
n= l
n=l
(26.48)
or, in other words, Jl(Ke) > 1 - E. • Before moving on, note that the promised proof of 12.6 can be obtained as a corollary of 26.19. 26.21 Corollary Let (S,Jl,Jl) be a separable complete probability space. For any E E Jl, there is for any E > 0 a compact subset K of E such that Jl(E - K) < E. Proof Let the compact set d E Jl satisfy Jl(d) > 1 - El2, as is possible by 26.19, and let (d,Jl�. Jl�) denote the trace of (S,Jl,Jl) on d. This is a compact space, such that every set in Jl� is totally bounded. By regularity of the measure (26.1) there exists for any A E Jl� an open set A' ::J A such that Jl�(A' - A) < E /2 . Moving to the complements, A'c is a closed, and hence compact, set contained in Ac. But A c - A'c = A' - A, and Jl(A c - A'c) = Jl�(A c -A'c)Jl(d) < El2. Now for any set E E Jl let A = (E n dl, and let K = A'c, and this argument shows that there is a compact subset K of E n d (and hence of E) such that Jl((E n d) - K) < fi2. Since Jl(E n de) � Jl(dc) � fi2, Jl(E - K) < E, as required. •
Weak Convergence in Metric Spaces
429
k Lemma 12.6 follows from this result on noting that IR is a separable complete space. Theorem 26.19 tells us that on a separable complete space, every measure Jln of a sequence is tight. It remains to be established whether the same property applies to the weak limit of any such sequence. Here the reader should review examples 22.19 and 22.20 to appreciate how this need not be the case. The next theorem is a partial parallel of 22.22, although the latter result goes further in giving sufficient conditions for a weak limit to exist. Here we merely establish the possibility of weak convergence, via an application of theorems 5.10 and 5.11, by showing the link between uniform tightness and compactness. 26.22 Theorem (Parthasarathy 1967: th. II.6.7) Let (S,d) be a separable complete space, and let I1 � 1M be a family of p.m.s on (S,!f). TI is compact if and only if it is uniformly tight.
(S,d) is separable, it is homeomorphic to a subset of [0, 1 ]00, by 6.22. Accordingly, there exists a metric d' equivalent to d such that (S,d') is rela Proof Since
tively compact. In this metric, let § be a compact space containing '5, and let !f be the Borel field on § . We cannot assume that '5, E '!1, but !f, the Borel field of '£, is the trace of !f on '£. Define a family of measures fi on § such that, for jl E fi, jl(A) = Jl(A n '£), Jl E TI, for each A E !f. To prove that I1 is compact, we show that a sequence of measures {Jln, n E [N } from I1 has a cluster point in TI. Consider the counterpart sequence { Jln, n E [N } in fi. Since § is compact, fi is compact by 26.18, so this sequence has one or more cluster points in fi. Let v be such a cluster point. The object is to show that there exists a p.m. Jl E I1 such that J1 = v. Tightness of I1 means that for every integer r there is a compact set Kr c '5, such that Jl(Kr) � 1 1/r, for all Jl E TI. Being closed in § , Kr E !f and jl(Kr) = f.l(Kr n '5,) = 1-l(Kr), all !-1 E Il. Since Kr is closed we have for some subsequence { nh k E IN } -
(26.49) by 26.10(c). Since UrKr E '!1, we have in particular that v(UrKr) = 1 . Now, suppose we let v*('S,) denote the outer measure of 'S, in terms of coverings by !f-sets. Since UrKr c '£, we must have v*(S) � v*(UrKr) = v(UrKr) = 1 . Applying 3.10, note that '5, is v-measurable since the inequality in (3. 19) becomes
(26.50) which holds for all B � § . Since !f is the trace of !f on S, all the sets of !f are v*(B n '£) � v*(B),
accordingly v-measurable and there exists a p.m. Jl E I1 such Athat J1 = v, as required. For any closed subset C of '£, there exists a closed D � '5, such that C = D n '£, the assertions limsupkJlnk(D) � jl(D) and limsupkJln/C) � J.L(C) are equiv alent, and hence, by 26.10, Jlnk ::::::> Jl. This means that { J.!n } has a convergent sub sequence, proving sufficiency. Notice that completeness of '5, is not needed for this part of the proof. To prove necessitv. assume I1 is comoact. Letti n g f S_,_ i E IN 1 he a r.onntahlP.
430
The Functional Central Limit Theorem
covering of § by 1/n-spheres, and Un, n E iJ'.J } any increasing subsequence of inte gers, define An = U{�1Sni · We show first that the assumption, ::3 J.l E II such that, for 8 > 0, (26.51) leads to a contradiction, and so has to be false. If (26.51) is true for at least one element of (compact) II, there is a convergent sequence { J.lkl k E iJ'.J } in II, with J.lk ==:} J.l, such that it holds for all J.lk· (Even if there is only one such element, we can put J.lk = J.l, all k). Fix m. By 26.10, J.l(An) � limsup J.lk(An) � 1 - 8.
(26.52)
k-7oo
Letting n � oo yields J.l(§) � 1 8, which is a contradiction. Putting 8 = El2n , we may therefore assert n J.l(An) > 1 - El2 , all n, all J.l E II. (26.53) Letting Ke = n�=1An , this set is compact by 26.20 (§ being complete) and it follows as in (26.48) above that J.l(Ke) > 1 - E. Since J.l is an arbitrary element of II, the family is uniformly tight. • We conclude this section with a useful result for measures on product spaces. See §7.4 for a discussion of the marginal measures. 26.23 Theorem A p.m. J.l on the space (§ x lr, with the product topology is tight iff the marginal p.m.s J.lx and J.l:y are tight. Proof For a set K E § x lr, let Kx = nx(K) denote the projection of K onto §. Since the projection is continuous (see §6.5), Kx is compact if K is compact (5.20). Since (26.54) tightness of J.l implies tightness of J.lx· Repeating the argument for J.l:y proves the necessity. For sufficiency we have to show that there exists a compact set K E having measure exceeding 1 - E. Consider the set K =A x B where A E :1 and J.lx{A) > 1 - E/2, and B E where J.ly(B) > 1 - E/2. Note that (26.55) Kc = (A x Bc) u (Ac x B) U (Ac x Bc), where the sets of the union on the right are disjoint. Thus, -
:1 ® 'J)
:1 ® 'J,
'J
J.l(Kc) � J.l(A c X B) + J.l(A X Be) +
+
2J.L(A c X Be)
J.l(§ X Be) = J.lx{A c) + J.l:y(Bc) � E. = J.l(A c X lf)
(26.56)
If A and B are compact they are separable in the relative topologies generated from § and lr (5.7), and hence K is compact by 6.17 • .
Weak Convergence in Metric Spaces
43 1
26.6 Skorokhod ' s Representation
Considering a sequence of random elements, we can now give a generalization of some familiar ideas from the theory of random variables. Recall from 26.7 that separability ensures that the distance functions in the following definitions are r.v.s. Let {xn } be a sequence of random elements and x a given random element of a separable space ('£,!/). If (26.57) d(xn( W),x(w)) ----7 0 for ffi E C, with P(C) 1, we say that Xn converges almost surely to x, and also write Xn � x. Also, if (26.58) P(d(xn,x) ;?: £) ----7 0, all £ > 0, we say that Xn converges in probability to x, and write Xn � x. A.s. convergence is sufficient for convergence in probability, which in tum is sufficient for Xn � x. A case subsumed in the above definition is where x = a with probability 1 , a being a fixed element of '£. We now have the following result generalizing 22.18. 26.24 Theorem Given a probability space (Q.,'!J,P), let {xn( W) } and {Yn(ffi) } be random sequences on a separable space ('£,!/). If Xn � x and d(xn,Yn) � 0, then D Yn --===-? X. Proof Let A E !f be a closed set, and for £ > 0 put A £ = { x: d(x,A) � £} E !f, also a closed set for each £, and A £ -!- A as £ -!- 0. Since { w: Yn( W) E A } � {w: Xn(W) E A£ } U { ffi: d(xn(W),yn(W)) ;::: £ } , we have (26.59) and, letting n ----7 oo, =
limsup P(yn E A) S limsup J.ln(A £) � J.l(A£), (26.60) n -toa n-toa where J.ln is the measure associated with Xm J.l the measure associated with x, and the second inequality of (26.60) is by hypothesis on {xn } and 26.10(c). Since this inequality holds for every £ > 0, we have limsup P(yn E A) � J.l(A), (26.61) n -toa by continuity of the measure. This is sufficient for the result by 26.10. • In §22.2, we showed that the weak convergence of a sequence of distributions on the line implies the a.s. convergence of a sequence of random variables. This is the Skorokhod representation of weak convergence. That result was in fact a special case of the final theorem of this chapter. 26.25 Theorem (Skorokhod 1956: 3. 1) Let {J.ln } be a sequence of measures on the
The Functional Central Limit Theorem
432
'Bro.n/:1-
separable, complete metric space ('£,:/). There exists a sequence of measurable functions Xn: [0, 1] 1--7 '£ such that /1-n (A) m( { ro: Xn(W) E A }) for each A E where m is Lebesgue measure. If 1-ln => /-!, there exists a function x( w) such that /1-(A) m( { ro: x(w) E A }) for each and d(xn(W),x(w)) -7 0 a.s.[m] as n -7 = AE Proof This is by construction of the functions Xn(W). For some k E IN let {x�k), i E IN } denote a countable collection of points in '£ such that, for every x E '£, d(x,x�k)) � 112k+l for some i. Such sequences exist for every k by separability. Let S(x�k).Tk), for 1 !2k+ l < rk < 1!2k, denote a system of spheres in '£ having the property /1-(dS(x�k),rk)) 0 for every i. An rk satisfying this condition exists, since there can be at most a countable number of points r such that /1-(dS(x�k\r)) > 0 for one or more i; this fact follows from 7.4. For given k, the system {S(x�k),rk), i E IN } covers '£, and accordingly the sets i-1 (26.62) D7 S(x)k),rk) - US(xjk),rk), i E IN,
:1,
==
:1,
.
==
==
==
j= l
form a partition of '£. By letting each of the k integers i1, ... ,ik range indepen dently over IN, define the countable collection of sets (26.63) Each Si1, . .,ik is a subset of a sphere of radius rk < 1/2k, and /-!(dSi1, ...,ik) 0. By construction, any pair Si1,... ,ik and Si;, ... ,i!: are disjoint unless ik ik. Fixing iJ , . . . ,ik-1 we have (26.64)
== ==
and in particular, (26.65) That is to say, for any k the collection {Si�o .. -,ik} forms a partition of '£, which gets finer as k increases. These sets are not all required to be non-empty. For any n E IN and k E IN , define a partition of [0, 1 ] into intervals .1.)��---.ik' where it is understood that .1.���---.ik lies to the left of .1�(� ... ,ik if i1 ij for j 1 , ... ,r- 1 and ir < i; for some r, and the lengths of the segments equal the probabilities /1-n(Si� o ... ,ik) . We are now ready to define a measurable mapping from [0, 1] to '£. Choose an element xi�o ... ,ik from each non-empty Si1, ... .ik• and for w E [0, 1 ] put (26.66) X�((J)) = Xi1, ...,ik if (J) E .1)��---.ikNote that by construction d(x�(ro),x�+m(m)) � I/2k for m � 1 , and taking k = 1 ,2, . . . defines a Cauchy sequence in '£ which is convergent since '£ is a complete hu .f). C" C"lu·nnt1nn '"XT.,.. f+.c:t. _kt ==
==
C f"V.:l. l' P.
I r., \
1 ; _...,
'·'
\
Weak Convergence in Metric Spaces
433
To show that Xn(CO) is a random element with distribution defined by 1-ln, it is sufficient to verify that (26.67) 1-ln(A) P(xn E A) = m( { CO: Xn(CO) E A }), for, at least, all A E !f such that 1-ln(aA) 0. If we let A(k) denote the union of all Si 1 , ... ,ik � A and A'(k) the union of all Si 1, . .. ,ik � Ac, it is clear that A (k) � A � A'(k) , and that (26.67) holds in respect of A(k) and A' (k) . Let c
=
=
=
. . .•
.
oc,
oc,
=
27 Weak Convergence in a Function Space
2 7 . 1 Measures on Function Spaces
This chapter is mainly about the space of continuous functions on the unit interval, but an important preliminary is to consider the space R[O, Il of all real functions on [0, 1]. We shall tend to write just R for this space, for brevity, when the context is clear. In this chapter and the following ones we also tend to use the symbols x,y etc. to denote functions, and t,s etc. to denote their arguments, instead of f,g and x,y respectively as in previous chapters. This is conventional, and reflects the fact that the objects under consideration are usually to be interpreted as empirical processes in the time domain. Thus, x: [0, 1 ] f---7 IR will be the function which assumes the value x(t) at the point t. In what follows the element x will typically be stochastic, a measurable mapping from a probability space (Q,c:Jf,P). We may legitimately write x: Q f---7 R, assigning x(co) as the image of the element co, but also x : Q x [0, 1] f---7 IR, where x(co , t) denotes the value of x at (co , t). We may also write x(t) to denote the ordinate at t where dependence on ro is left implicit. The potential ambiguity should be resolved by the context. Sometimes one writes x1 to denote the random ordinate where x1(co) = x(co,t), but we avoid this as far as possible, given our use of the subscript notation in the context of a sequence with countable domain. The notion of evaluating the function at a point is formalized as a projection mapping. The coordinate projections are the mappings 1t1: R[o,I] -7 1R , where n1(x) = x(t). The projections define cylinder sets in R; for example, the set n� 1 (a), a E IR , is the collection of all functions on [0,1] which pass through the point of the plane with coordinates (a , t). This sort of thing is familiar from § 12.3, and the union or intersection of a collection of k such cylinders with different coordinates is a k-dimensional cylinder; the difference is that the number of coordinates we have to choose from here is uncountable. Let { t1 , ... ,td be any finite collection of points in [0, 1], and let k (27. 1 ) 1tr 1, ... ,t/X) = (1tr1 (x), ... ,1t1/x)) E IR denote the k-vector of projections from these coordinates. The sets of the collection
Weak Convergence in a Function Space 1f
=
{1t�� .... ,riB)
� R [o,l]:
435
B E :Bk, tt , .. . , tk E [0, 1], k E IN },
(27.2)
are called the finite-dimensional sets of R[O,IJ· It is easy to verify that Jf is a field. The projection a-field is defined as 'P = cr(Jf). Fig. 27. 1 shows a few of the elements of a rather simple Jf-set, with k 1 , and B an interval [a,b] of IR . The set H = 1t��([a,b]) E Jf consists of all those functions that succeed in passing through a hole of width b - a in a barrier erected at the point t1 of the interval. Similarly, the set of all the functions passing through holes in two such barriers, at t1 and t2, is the image under 1t �� .t2 of a rectangle in the plane - and so forth. =
Fig. 27. 1 If the domain of the function had been countable, the projection <J-field 'P would be effectively the same collection as :B"" of 12.3. But since the domain is uncountable, 'P is strictly smaller than the Borel field on R. The sets of example 26.3 are Borel sets but are not in 'P, since their elements are restricted at uncountably many points of the interval. As that example showed, the Borel sets of R are not generally measurable; but (R,'P) is a measurable space, as we now show. Define for k 1,2,3, ... the family of finite-dimensional p.m.s I'"""I J , ... , tk on k (IR ,:Bk), indexed on the collection of all the k-vectors of indices, =
1 1
{ (tr , ... , tk): ti
[0, 1], j
1 , ... ,k} . This family will be required to satisfy two consistency properties. The first is (27.3) �t1, .. . ,t/E) = �t1, ... ,tm(E x 1R m -k) for E E 'Bk and all m > k > 0. In other words, a k-dimensional distribution can be obtained from an m-dimensional distribution with m > k, by the usual operation of E
=
marginalization. This is simply the generalization to arbitrary collections of coordinates of condition (12.7). The second is where n( l )
_____
n(k)
IS a
nermutation of the i ntep-ers 1
_
- .k. anct m:
(27.4) IR k � IR k
The Functional Central Limit Theorem
436
denotes the (measurable) transformation which reorders the elements of a k-vector according to the inverse permutation; that is, <j>(xp (I) •···Xp(kJ ) = x 1 , ... ,xk. This condition basically means that re-ordering the vector elements transforms the measure in the way we would expect if the indices were 1 , ... ,k instead of tJ, ... ,tk. The following extends the consistency theorem, 12.4. 27.1 Theorem For any family of finite-dimensional p.m.s { J.Lr 1 , ... ,rk } satisfying conditions (27.3) and (27.4), there exists a unique p.m. Jl on (R,'P), such that llr 1 , ... , t = Jl1t�� ... ,t for each finite collection of indices. k k Proof Let T denote the set of countable sequences of real numbers from [0, 1]; that is, 't E T if 't = { Sj E [0, 1], j E lN } . Define the projections 1t-r: R f---7 Iff"' by (27.5) For any 't, write v� = llsk··sn for n = 1 ,2, ... Then by 12.4, which applies thanks to (27.3), there exist p.m.s v-r on (IR"",�"") such that v� v-rn� \ where nn (y) is the projection of the first n coordinates of y, for y E IR "". Consistency requires that v� = v�' if sequences 't and -r' have their first n coordinates the same. Since evidently 'P c { n� 1 (B) : B E :B"", 't E T} , we may define a p.m. J.L on (R,'P) by setting (27.6) for each B E :B"". No extension is necessary here, since the measure is uniquely defined for each element of 'P. It remains to show that the family { J.Lr1 , ... , rk } corresponds to the finite dimensional distributions of J.L. For any Jlr 1 , .. ,tk there exists 't E T such that { tJ. ... ,tk } c { s 1 , ... ,sn } , for some n large enough. Construct a mapping \jf: IR n f---7 IR\ by first applying a permutation p to the indices s 1 , ... ,sn which sets x(p(s iJ) n to IR k by suppressing the = x(ti) for i = 1 , ... ,k, and then projecting from IR indices Sp(k+ l ) , ... , Sp(n) · The consistency properties imply that .
=
(27.7) Since \jf 0 1tn °1t-r = 1tr 1 , ... , tk is a projection, Jlr 1 , . .. . tk is a finite-dimensional distribution of J.L. • If we have a scheme for assigning a joint distribution to any finite collection of coordinate functions {x(t 1 ), ... ,x(tk) } with rational coordinates, this can be extended, according to the theorem, to define a unique measure on (R,'P). These p.m.s are called the finite-dimensional distributions of the stochastic process x. The sets generated by considering this vector of real r.v.s are elements of 1f., and hence there is a corollary which exactly parallels 12.5. 27.2 Corollary 1f. is a determining class for (R , 'P). o
Weak Convergence in a Function Space
437
27 .2 The Space C
Visualize an element of Cro. 1 1 , the space of continuous real-valued functions on [0 , 1], as a curve drawn by the pen of a seismograph or similar instrument, as it traverses a sheet of paper of unit width, making arbitrary movements up and down, but never being lifted from the paper. Since [0 , 1] is a compact set, the elements of Cr0, 1 1 are actually uniformly continuous. To get an idea why distributions on Cr0.1 1 might be of interest to us, imagine observing a realization of a stochastic sequence {Sj(ro) }]Z, from a probability space (O.,'!J,P), for some finite n. A natural way to study these data is to display them on a page or a computer screen. We would typically construct a graph of Sj against the integer values of j from 1 to n on the abscissa, the discrete points being joined up with ruled lines to produce a 'time plot ' , the kind of thing shown in Fig. 27.2.
n
1
(1)
Fig. 27.2 We will then have done rather more than just drawn a picture; by connecting the points we have defined a random continuous function, a random drawing (the word here operates in both its senses !) from the space C[1 ,n]. It is convenient, and there is obviously no loss of generality, if instead of plotting the points at unit intervals we plot them at intervals of ll(n - 1 ) ; in other words, let the width of the paper or computer screen be set at unity by choice of units of measurement. Also, relocating the origin at 0, we obtain by this means an element of Cro, 1 1 , a member of the subclass of piecewise linear functions, with formula (27.8) x(t) = (i - tm)x((i - 1)/m) + (1 + tm - i)x(ilm) for t E [(i - 1)/m, ilm], and i = 1 , ... ,m, m = n - 1 . The points x(ilm) E IR for i = O, ... ,m are the m + 1 vertices of the function. In effect, we have defined a measurable mapping between points of IR n and elements of Cr0, 1 1 , and hence a family of distributions on Cro, I J derived from (O.,'!J,P), indexed on n. The specific problem to be studied is the distribution of these graphs as n tends to infinity, under particular assumptions about the sequence { Sj } . When { Sj } is a sequence of scaled partial sums of independent or asymptotically independent random variables, we shall obtain a useful generalization of the central limit theorem.
438
The Functional Central Limit Theorem
As in §5.5, we metrize Cro.I J with the uniform metric du(x,y) = sup l x(t) - y(t) l . t
(27.9)
Imagine tying two pens to a rod, so that moving the rod up and down as it traverses a sheet of paper draws a band of fixed width. The uniform distance du(x,y) between two elements of C[O, I J is the width of the narrowest such band that will contain both curves at all points. We will henceforth tend to write C for CCro, I J•du) when the context is clear. C is a complete space by 5.24, and, since [0, 1] is compact, is also separable by 5.26(ii). In this case an approximating function for any element of C, fully determined by its values at a finite number of points of the interval (compare 5.25) is available in the form of a piecewise linear function. A set IIm = { t 1 , ... ,tm } satisfying 0 = to < t 1 < ... < tm = 1 is called a partition of [0, 1 ] . This i s a slight abuse of language, an abbreviated way of saying that the collect ion defines such a partition into subintervals, say, A; = (ti- l ,t;) for i = 1 , ... ,m - 1 together with A m = [tm - J,1]. The norm II IIm II = max { t; - t;- d (27. 10) l�i�m is called the fineness of the partition, and a refinement of IIm is any partition of which IIm is a proper subset. We could similarly refer to mint�i�m{ t; - t;- d as the coarseness of Tim . The following approximation lemma specializes 5.25 with the partition II2n = { i/2n , i = 1 , ... , 2n } for n ;?: 1 playing the role of the B-net on the domain, with in this case () < 2!2n . n 27.3 Theorem Given x E C, let Yn E C be piecewise linear, having 2 + 1 vertices, with max { l x(2 -n i) -Yn(2 -n i) I < �£. (27. 1 1) n l �i� 2 There exists n large enough that du(x,yn) < £. n n n Proof Write A; = [T (i - 1), 2 - i], i = 1 , ... ,2 . (Inclusion of both endpoints is innocuous here.) Applying (27.8) we find that, for t E A;, Yn(t) = A-yn(t') + (1 - A.)yn(t") where t' = rn (i - 1), t" = rn i, and A. = i - 2n t. Noting that
}
l x(t) - yn(t) i � A- l x(t) - x(t') l + (1 - A.) Ix(t) -x((') l + A-l x(t') - yn(t') l + (1 - A.) I x(t") -yn((') l ,
(27. 1 2)
and that for n large enough, sups,teA; i x(s) - x(t) l < �£ by continuity, it follows that for such n,
du(x,y,) =
,:d::� I
x(t) - y.(t)
I}
<
E.
•
(27. 13)
Weak Convergence in a Function Space
439
Note that as n ---7 =, II2n ---7 [[) (the dyadic rationals). There is the following important implication. 27.4 Theorem If x,y E C and x(t) = y (t) whenever t E [[), then x = y. Proof Let Zn be piecewise linear with Zn (t) = x(t) = y(t) for t E TI2n . By assumption, such a Zn exists for every n E IN. Fix £, and by taking n large enough that max { 4J(x,zn) , du(y,zn } } < �£, as is possible by 27.3, we can conclude by the triangle inequality that du(x,y) < £. Since £ is arbitrary it follows that du(x,y) = 0, and hence x = y since du is a metric. • The continuity of certain elements of R, particularly the limits of sequences of functions, is a crucial feature of several of the limit arguments to follow. An important tool is the modulus of continuity of a function x E R, the monotone function Wx: (0, 1 ] f-7 IR + defined by wx(o) = sup lx(s) -x(t) 1 . (27. 14) I s- t i d )
Wx has already been encountered in the more general context of the ArzeUt-Ascoli theorem in §5.5. It tells us how rapidly x may change over intervals of width o. Setting o = 1 , for example, defines the range of x. But in particular, the fact that the x are uniformly continuous functions implies that, for every x E C, wx(o) -!, 0 as o -!, 0. (27. 15) For fixed o, we may think of wx(o) = w(x,o) as a function on the domain C. Since I w(x o) - w(y,o) I ::; 2du(x,y) , w(x , o) is continuous on C, and hence a measurable function of x. The following is the version of the ArzeUt-Ascoli theorem relevant to C. 27.5 Theorem A set A c C is relatively compact iff sup I x(O) I < =, (27. 16) XEA lim sup wx(o) = 0. o (27. 17) 0---70 xE A
,
These conditions together impose total boundedness and uniform equicontinuity on A. Consider, for some t E [0, 1] and k E IN,
[ x(t) [ ,; [ x(O) [
+
�
x
(H -x (i � 1 t)
(27. 1 8)
Equality (27.17) implies that for large enough k, supxEA wx(1/k) < =. Therefore (27. 16) and (27. 17) together imply that (27.19) sup sup I x(t) I < =. t xE A In other words, all the elements of A must be contained in a band of finite width ::�rnnnci
n
Thi .;;: thP:nrf'.m i.;;: thP:rP:fnrP: ::� .;;:tr::�i ohtfnrw::�n-1 r.nrnlhrv nf �-2"'
The Functional Central Limit Theorem
440
27. 3 Measures on C
We now see how 27.1 specializes when we restrict the class of functions under consideration to the members of C. The open spheres of C are sets with the form S(x,r) = {y E C: du(x,y) < r,} (27.20) for x E C.
Fig. 27.3 Such sets can be visualized as a bundle of continuous graphs, with radius r and the function x at the core, traversing the unit interval - for example all the functions lying within the shaded band in Fig. 27.3. We shall write :B e for the Borel field on C, and since (C,du) is separable each open set has a countable covering by open spheres and :B e can be thought of as the a-field generated by the open spheres of C. Each open sphere can be represented as a countable union of closed spheres, S(x,r)
=
00 U S(x, r 1/n) , n= l
(27.21)
-
and hence :Be is also the a-field generated from the closed spheres. Now consider the coordinate projections on C. Happily we know these to be continuous (see 6.15), and hence the image of an open (closed) finite-dimen sional rectangle under the inverse projection mapping is an open (closed) element of 'P. Letting 1-fe = { H n C: H E Jf } with Jf defined in (27 .2), and so defining 'P e = cr(Jfe), we have the following important property: 27.6 Theorem :B e = 'Pe. Proof Let H,(x,a) = E C: max I y(rki) - x(rki) I < a E 1-f e, (27.22)
{Y
and so let
l � iQk
}
Weak Convergence in a Function Space H(x,a)
=
n Hk(x,a) k=l
{
441
}
= y E C: sup l y(t) -x(t) l :::; a E 'Pc, tE ID
(27.23)
where [) denotes the dyadic rationals. Note that we cannot rely on the inequality in (27.22) remaining strict in the limit, but we can say by 27.4 that H(x,a) = S(x,a), (27.24) where S is the closure of S. Using (27.21), we obtain 00
S(x,r) = U H(x, r - lin). n=l It follows that the open spheres of C lie in 'Pc. and so 'Be � 'Pc.
(27.25)
X
Fig. 27.4 To show 'P c c 'Be consider, for a E IR and to E the restriction to [0, 1] of the functions on IR ,
!
[0, 1], functions Xn E
a + n(n + 1/n)(t + 1/n - t0), to - 1/n :::; t < to Xn(t) = a + n(n + 1/n)(to + lin - t), to :::; t < to + 1/n
C defined by
(27.26)
a, otherwise. Every element y of the set S(xmn) E 'Be has the property y(to) > a. (This is the shaded region in Fig. 27.4.) Note that 00
(27.27) G(a,to) = {y E C: 1tro(y) > a} = u scxn,n) E 'Be. l n= Now, G(a,t0) is an element of the collection 1f00 where for general t we define 1fcr = {1t� 1 (B), B E 'B } . (27.28)
442
The Functional Central Limit Theorem
In words, the elements of 1f.e1 are the sets of continuous functions x having x(t) E B, for each B E 'B. In view of parts (ii) and (iii) of 1.2, and the fact that 'B can be generated by the collection of open half-lines ( a,oo), it is easy to see that 1f.0 is the a-field generated from the sets of the form G(a,t) for fixed t and a E rR . Moreover, 1f.e is the a-field generated by {1f.0, t E [0, 1 ] } . Since G(a, t) E 'Be for any a and t by (27.27), it follows that 1f.e c 'Be and hence cp e � 'Be. • It will be noted that the limit Xoo(t) of (27.26) is not an element of C, taking the value a at all points except t0, and +oo at t0. Of course, {xn } is not a Cauchy sequence. However, the countable union of open spheres in (27 .27) is an open set (the inverse projection of the open half line) and omits this point. cp e is the projection a-field on C with respect to arbitrary points of the continuum [0, 1], but consider the collection cp(; = { H l't C: H E cp'}, where cp' is the collection of cylinder sets of Rro, 1 1 having rational coordinates as a base. In other words, the sets of cp' contain functions whose values x(t) are unrestricted except at rational t. Since elements of C which agree on the rational coordinates agree everywhere by 27.4, (27.29) This argument is just an alternative route to the conclusion (from 6.22) that C is homeomorphic to a subset of rR"". However, it is not true that cp = cp', because cp is generated from the projections of every point of the continuum [0,1], and arbitrary functions can be distinct in spite of agreeing on rational t. Evidently (C,'Bc) is a measurable space, and according to 27.2 and 27.6, 1f.e is a determining class for the space. In other words, the finite-dimensional distribu tions of a space of continuous functions uniquely determine a p.m. on the space. Every p.m. on R[O,l J must satisfy the consistency conditions, but the elements of C have the special property that x(t1) and x(t2) are close together whenever t1 and t2 are close together, and this puts a further restriction on the class of finite-dimensional distributions which can generate distributions on C. Such distributions must have the property that for any E > 0, ::3 8 > 0 such that (27.30) The class of p.m.s in (C,'Bc), whose finite-dimensional distributions satisfy this requirement, will be denoted !Me. Note that, thanks to 26.14, we are able to treat !Me as a separable metric space. This fact will be most important below. 27.4 Brownian Motion
The original and best-known example of a p.m. on C, whose theory is due to Norbert Wiener (Wiener 1923) is also the one that matters most from our point of view, since in the theory of weak convergence it plays the role of the attractor measure which the Gaussian distribution plays on the line. It is in fact the natural generalization of that distribution to function spaces. 27.7 Definition Wiener measure W is the p.m. on (C,'Bc) having these properties:
Weak Convergence in a Function Space (a) W(x(O)
=
0)
(b) W(x(t) ::::; a)
= =
443
1;
-1-fa e-s1121df;, 0
V2iti
-oo
<
t ::::;
1;
(c) for every partition { tb ... ,tk l of [0,1], the increments x(t 1 ) - x(t0), x(t3 ) - x(t2), ... ,x(tk) - x(tk- 1 ) are totally independent. o Parts (a) and (b) of the definition give the marginal distributions of the coord inate functions, while condition (c) fixes their joint distribution. Any finite collection of process coordinates {x(ti), i = 1 , ... ,k} has the multivariate Gauss ian distribution, with x(tj) - N(O,tj), and E(x(tj)x(tf)) = min { tj,tj' } . Hence, x(tt) - x(t2) - N(O, l tt - t2 l ), which agrees with the requirements of continuity. This full specification of the finite-dimensional distributions suffices to define a unique measure on ( C,'Bc). This does not amount to proving that such a measure exists, but we shall show this below; see 27.15. W may equally well be defined on the interval [O,b) for any b > 0, including b = oo, but the cases with b ::t 1 will not usually concern us here. A random element distributed according to W is called a Wiener process or a Brownian motion process. The latter term refers to the use of this p.m. as a mathematical model of the random movements of pollen grains suspended in water resulting from thermal agitation of the water molecules, first observed by the botanist Robert Brown. 27 In practice, the terms Wiener process and Brownian motion tend to be used synonymously. The symbol W conventionally stands for the p.m., and we also follow convention in using the symbol B to denote a random element from the derived probability space ( C, 'Bc, l-ll). In terms of the underlying proba bility space (Q,�,P) on which we assume B: Q � C to be a �/'Be-measurable mapping, we have W(E) = P(B E E) for each set E E 'Be. The continuous graph of a random element of Brownian motion, B(co) for co E Q, is quite a remarkable object (see Fig. 27 .5). It belongs to the class of geometrical forms namedfractals (Mandelbrot 1983). These are curves possessing the property of self-similarity, meaning essentially that their appearance is invariant to scaling operations. It is straightforward to verify from the defini tion that if B is a Brownian motion so is B * , where (27.31) B *(ro,t) = k - 1 12(B(OJ ,s + kt) - B( ro,s)) for any s E [0,1) and k E (0, 1 - s] . Varying s and k can be thought of as 'zooming in' on the portion of the process from s to s + k. The key property is the one contained in part (iii) of the definition, that of independent increments. A little thought is required to see what this means. In the definition, the points tJ, ... ,tk may be arbitrarily close together. Consider ing a pair of points t and t + �. the increment B( OJ,t + �) - B( OJ,t) is Gaussian with variance �. and independent of B(ro,t). Symmetry of the Gaussian density implies that P( OJ: (B(ro,t + M B(OJ,t) )(B(OJ,t) - B(ro,t - �)) < 0) = 1 -
The Functional Central Limit Theorem
444
for L1 � t � 1 L1 and every L1 > 0. This is compatible with continuity, but completely rules out smoothness; in any realization of the process, almost every point of the graph is a comer, and has po tangent. This property is also apparent when we attempt to differentiate B(ro). Note from the definition that -
B(ro,t + � - B(ro,t)
_
N(O, llh).
(27.32)
The sequence of measures defined by letting h � 0 in (27.32) is not uniformly tight, and fails to converge to any limit. To be precise, the probability that the difference quotients in (27.32) fall in any finite interval is zero, another way of saying that the sample path x(t,ro) is non-differentiable at t, almost surely. A way to think about Brownian motion which makes its relation to the problem of weak convergence fairly explicit is as the limit of the sequence of partial sums of n independent standard Gaussian r.v.s, scaled by n - 1 n . Note that (27.33) �(ro) = n 1 12(B(ro,jln) - B(ro,(j - 1)/n)) - N(0, 1) for j = 1, ... ,n and B(ro,j/n) = n - 1 12E= I�i(ro). By taking n large enough, we can express B(ro,t) in this form for any rational t, and by a.s. continuity of the process we may write [nt] (27.34) B(ro,t) = lim n - 1 12L �/ro) a.s. n�oo i= l for any t E [0, 1], where [nt] denotes the integer part of nt. Consider the expected sum of the absolute increments contributing to B(t). According to 9.8, l �j l has mean (2/n) 1 12 and variance 1 - 2/n, and so by indepen dence of the increments the r.v. An(t) = n - I/2L��fl l �d has mean [nt](21nn) 1 12 = m(t,n) (say) and variance 1 - 2/n. Applying Chebyshev ' s inequality, we have that for t > 0,
P(An(t) > �m(t,n)) � P( I An(t) - m(t,n) I � �m(t,n)) (27.35) � 1 _ 4(1 - 2/n) m(t,n) 2 _ Since m(t,n) = O(n 1 12), An(t) � oo a.s.[WJ for all t > 0. This means that the random element B(ro) is a function of unbounded variation, almost surely. Since limn�ooAn(t) is the total distance supposedly travelled by a Brownian particle as it traverses the interval from 0 to t, and this turns out to be infinite for t > 0, Brownian motion cannot be taken as a literal description of such things as particles undergoing thermal agitation. Rather, it provides a simple limiting approximation to actual behaviour when the increments are small. Standard Brownian motion is merely the leading member of an extensive family of a.s. continuous processes on [0, 1 ] having Gaussian characteristics. For example, if we multiply B by a constant cr > 0, we obtain what is called a Brownian motion with variance �- Adding the deterministic function �t to the process defines a Rrownian motion with drift LL Thu X(t) = rTR(t) -1- tlt rP-nrP-"P-nt" a fami l v of .;_
.
Weak Convergence in a Function Space
445
processes having independent increments X(t) -X(s) - N(J.l(t - s), d I t - s 1 ). More elaborate generalizations of Brownian motion include the following. 27.8 Example Let X(t) = B(t 1 +J3) for -1 < � < oo. X is a Brownian motion which has been subjected to stretching and squeezing of the time domain. Like B, it is a.s. continuous with independent Gaussian increments. It can be thought of as the limit of a partial sum process whose increments have trending variance. Suppose �;(ro) N(O }), which means the variances are tending to 0 if � < 0, or to infinity if � > 0. Then n-l-J3E(If�{l�;) 2 ---) t1 +J3 , and [nt] ) +J3 !2 O n _L �;(ro) ---) B(ro,t 1 +J3) a.s. o (27.36) i=l
27.9 Example Let X(t) = O(t)B(t) where 9: [0, 1] f--7 1R is any continuous determin istic function, and B is a Brownian motion. For s < t, (27.37) X(t) -X(s) = O(t)(B(t) - B(s)) + (9(t) - 9(s))B(s), which means that the increments of this process, while Gaussian, are not inde pendent. It can be thought of as the almost sure limit as n � oo of a double partial sum process, n-
J n� [
6 (iln) E;;( ffi) + ( 6(i/n) - 6 ((i - I )in))
� ]
E;,{ffi) ,
(27.38)
where �; - N(0, 1). o 27.10 Example Letting B denote standard Brownian motion on [O,oo), define (27.39) X(t) = e- J3tB(iJ3t) for fixed � > 0. This is a zero-mean Gaussian process, having dependent increments like 27.9. The remarkable feature of this process is that it is stationary, with X(t) - N(O, 1) for all t > 0, and (27.40) E(X(t)X(s)) = eJ3(2min{ t,sJ -t-s) = e- J3 1 t-s i . This is the Ornstein-Uhlenbeck process. o 27.11 Example The Brownian bridge is the process B0 E C where B0(t) = B(t) - tB(l ), t E [0, 1]. (27.41) This is a Brownian motion tied down at both ends, and has E(B0(t)B0(s)) = min { t,s} - ts. A natural way to think about B0 is as the limit of the partial sums of a mean-deviation process, that is
B" (t,ro)
=
�: n -
where �;(ro) - N(O,l).
o
w
�{
E;;(ro) -
� t E;,{ro)) a.s.
(27.42)
The Functional Central Limit Theorem
446
We have asserted the existence of Wiener measure, but we have not so far offered a proof. The consistency theorem (27.1) establishes the existence of a measure on (C, :B c) whose finite-dimensional distributions satisfy conditions (a)-(c) of 27.7, so we might attempt to construct a continuous process having these properties. Consider
-
Y.(t,co) : n - I /2
{% �,{co) + (nt - [nt])�l.,J+I (co)) ,
(27.43)
where Si N(O,l ) and the set {sJ, ... , sn } are totally independent. For given ro, Yn(.,ro) is a piecewise linear function of the type sketched in Fig. 27.2, although with Yn(O,ro) = 0 (the Si represent the vertical distances from one vertex to the next), and is an element of C. Yn(t) is Gaussian with mean 0, and E(Yn(t))2 = n - \ [nt] + (nt - [tft])2) 2 1 = t + n - ([nt] - nt+ (nt - [nt]) ) (27.44) = t + K(n,t)ln, (say) where 0 < K(n,t) < 2. Moreover, the Gaussian pair Yn( t) and Yn(t + s + n - 1 ) - Yn(t + n - 1 ) , s > 0, are independent. Extrapolating the same argument to general collections of non-overlapping increments, it becomes clear Yn(t) � N(O , t), and more generally that if Yn � Y, then Y is a stochastic process whose finite dimensional distributions match those of W. Fig. 27 .5, which plots the partial sums of around 8000 (computer-generated) independent random numbers, shows the typical appearance of a realization of the process approaching the limit. 1 . 3 ,-------� 1.0 0.5 0 �------� 0
1 Fig. 27.5
This argument does not show that the measure on ( C,:Bc) corresponding to Y actually is W. There are attributes of the sample paths of the process which are not specified by the finite dimensional distributions. According to the continuous mapping theorem, Yn � Wwould imply that h( Yn) � h(W) for any a.s. continuous function h. For example, suptl Yn(t) I is such a function, and there are no grounds .f.rnm th.,. <>rmnn<>nt" f'ron " ; ri PrPrl �hnvP. fnr •mnnn"i n u thM "lln. l y_(t) I � SUO, I W( t) I .
Weak Convergence in a Function Space
447
However, if we are able to show that the sequence of measures corresponding to Yn converges to a unique limit, this can only be W, since the finite-dimensional cylinder sets of C are a determining class for distributions on (C/Bc). This is what we were able to conclude from 27.6, in view of 27.1. This question is taken up in the next section, and the proof of existence will eventually emerge as a corollary to the main weak convergence result in §27.6. 27 . 5 Weak Convergence on C
Let { J.ln } be a sequence of probability measures in IMc. For example, consider the distributions associated with a sequence like { Ym n E IN } , whose elements are defined in (27.43). According to 26.22, the necessary and sufficient condition for the family {J.ln } to be compact, and hence to possess (by 5.10) a cluster point in IMc, is that it is uniformly tight. Theorem 27.5 provides us with the relevant compactness criteria. The message of the following theorem is that the uniform tightness of measures on C is equivalent to boundedness at the origin and continuity arising with sufficiently high probability, in the limit. Since tightness is the concentration of the mass of the distribution in a compact set, this is just a stochastic version of the Arzela-Ascoli theorem. 27.12 Theorem (Billingsley 1968: th. 8.2) {J.ln } is uniformly tight iff there exists N E IN such that, for all 11 > 0 and for all n � N, (a) there exists M < oo such that (27.45) J.ln({x: l x(O) I > M}) :::; T) ; (b) for each £ > 0, there exists 8 E (0, 1) such that J.ln({x: wx(o) � E}) :::; T) . o (27.46) Condition (b) is a form of stochastic equicontinuity (compare §21 .3). It is easier to appreciate the connection with the notions of equicontinuity defined in §5.5 if we write it in the form P(w(Xn,8) � E) :::; T) , where {Xn} is the sequence of stochas tic functions on [0, 1] having derived measures J.ln · Asymptotic equicontinuity is sufficient in this application, and the conditions need hold only over n � N, for some finite N. Since C is a separable complete space, each individual member of { J.ln} is tight, and for uniform tightness it suffices to show that the conditions hold 'in the tail ' . Proof of 27.12 To prove the necessity, let U..tn } be uniformly tight, and for 11 > 0 choose a compact set K with J.ln(K) > 1 - T) . By 27.5, there exist M < oo and 8 E (0, 1) such that K � {x: l x(O) J :::; M} n {x: wx(8) < E} (27.47) for any £ > 0. Applying the De Morgan law,
1l � J.ln (�) 2: J.ln ({x: J x(O) J > M} U {x: wx(8)
2:
£})
The Functional Central Limit Theorem
448 ;?:
max { J.Ln( { x : l x(O) I > M}), Jln({x: wx(8) ;?: E} ) } .
(27.48)
Hence (27.45) and (27.46) hold, for all n E !N . Write J.L*(.) as shorthand for SUPn <-Nf..Ln(.) . To prove sufficiency, consider for k = 1,2, . . the sets (27.49) where { 8d is a sequence chosen so that ll*(Ak) > 1 - 8/2k+l , for 8 > 0. This is possible by condition (b). Also set B = { x: I x(O) I :::;; M } , where M is chosen so that Jl*(B) > 1 - 8/2, which is possible by condition (a). Then define a closed set K = (nk'=IAk n B) -, and note that conditions (27.16) and (27.17) hold for the case A = K. Hence by 27.5, K is compact. But
.
11* CKc) :::;; J.L*
(UAZu Be) k=l
:::;; L J.L*(AZ} + J.L*(Bc) k=l
:::;; 2e .L 112k+2 + en = e. 00
k=l
(27.50)
This last inequality is to be read as SUPn<- Niln(Kc) :::;; e, or equivalently infn<- Nf..Ln(K) > 1 - e. Since e is arbitrary, and every individual Jln is tight by 26.19, in particular for 1 :::;; n < N, it follows that the sequence { Jln} is uniformly tight. • The following lemma is a companion to the last result, supplying in conjunction with it a relatively primitive sufficient condition for uniform tightness. 27.13 Lemma (adapted from Billingsley 1968: th 8.3) Suppose that, for some 8 E (0, 1),
({
sup Jln x: sup I x(s) -x(t) I O:St:SI -o
t :5 s :5 t+o
;?:
1£
})
:::;; 1no.
(27.51 )
Then (27 .46) holds. Proof Fixing 8, consider the partition { t1 , . . . ,t, } of [0, 1], for r = 1 + [1/o], where t; = i8 for i = 1 , ... ,r - 1 and t, = 1 . Thus, for 1 < o < 1 we have r = 2 and the partition { 0, 1 }, for t < 0 :::;; � we have r = 2 and the partition { o, 28, 1 } , and so on. The width of these intervals is at most 8. A given interval [t,t'] with I t' - tl :::;; 8 must either lie within an interval of the partition, or at most overlap two adjoining intervals; it cannot span three or more. In the event that I x(t') - x(t) I ;?: E , x must change absolutely by at least �£ in at least one of the interval(s) overlapping [t,t'], and the probability of the latter event is at least that of the former. In other words, considering all such intervals,
Weak Convergence in a Function Space !J-n{ {x : wx(o) � £ } ) ::; 1-ln
(U{ t=]
x:
S,S
449
})
sup l x(s') - x(s) l � 1£ ,E [t;-!, 1;]
(27.52) where the third of these inequalities applies (27.51), and the final one follows because rO ::; 2. • These results provoke a technical query over measurability. In §21 . 1 we indicated difficulties with standard measure theory in showing that functions such as sup1 � � .�;l x(s) -x(t) l in (27.51), and wx(o) in (27.46), are random variables. However, it is possible to show that sets such as the one in (27.51) are �-analytic, and hence nearly measurable. In other words, complacency about this issue can be justified. The same qualification can be taken as implicit wherever such sets arise below. s
t+
27.6 The Functional Central Limit Theorem
Let Sno = 0 and Snj = TI=I Uni for j = 1 , ... ,n, where { Und is a zero-mean stochastic array, normalized so that E(S�n) = 1 . As in the previous applications of array notation, in Part V and elsewhere, the leading example is Un i = U/sn, where { U;} is a zero-mean sequence and s� = E(Ii= l ui. Define an element Yn of Cro. lJ• somewhat as in (27.43) above, as follows: Yn(t) = Sn,[nt] + (nt - [n t] )Un,[nt]+ I , = Snj- 1 + (nt -j + 1)Unj for (j - 1)/n ::; t < j/n, j
=
l , . . . ,n ;
Yn(l ) = Snn·
(27.53) (27.54)
This is the type of process sketched in Fig. 27.2. The question of whether the distribution of Yn possesses a weak limit as n -> oo is the one we now address. The interpolation terms in Yn(t) are necessary to generate a continuous func tion, but from an algebraic point of view they are a nuisance; dropping them, we obtain Xn (t) = Sn,[nt] Xn(l ) = Snn ·
=
Snj- 1 for U - l)ln ::; t < j/n, j = 1 , ... ,n,
(27 .55) (27.56)
If conditions of the type discussed in Chapters 23 and 24 are imposed on { Un i } , XnO) � N(O, l) as n � oo. If for example U; - i.i.d.(O,d), so that Uni = UJsn where s� = ncr2, this is just the Lindeberg-Levy theorem. However, the Lindeberg Levy theorem yields additional conclusions which are less often remarked; it is easv to verifv that. for each distinct oair tJ .f? E rO , l l,
The Functional Central Limit Theorem
450
(27.57) Since non-overlapping partial sums of independent variates are independent, we find for example that, for any 0 :::; t1 < t2 < t3 :::; 1, Xn(t2) - XnCti ) and XnCt3) Xn(t2) converge to a pair of independent Gaussian variates with variances t2 - t1 and t3 - t2, so that their sum Xn(t3) - XnCti) is asymptotically Gaussian with variance t3 - t1 , as required. Under our assumptions, (27.58) so that Yn(t) and Xn(t) have the same asymptotic distribution. Since Yn(O) = 0, the finite-dimensional distributions of Yn converge to those of a Brownian motion process as n -----7 oo. As noted in §27.4, this is not a sufficient condition for the convergence of the p.m.s of Yn to Wiener measure. But with the aid of27.12 we can prove that { Yn} is uniformly tight, and hence that the sequence has at least one cluster point in rMc. Since all such points must have the finite-dimensional distributions of W, and the finite-dimensional cylinders are a determining class for (C,'Bc), W must be the weak limit of the sequence. This convergence will be expressed either by writing !-ln :::::::> W, or, more commonly in what follows, by Yn � B. This type of result is called a functional central limit theorem (FCLT), although the term invariance principle is also used. The original FCLT for i.i.d. increments (the generalization of the Lindeberg-Levy theorem) is known as Donsker ' s theorem (Donsker 1951). Using the results of previous chapters, in particular 24.3, we shall generalize the theorem to the case of a heterogeneously distributed martingale difference, although the basic idea is the same. 27.14 Theorem Let Yn be defined by (27.53) and (27.54), where { Uni.�nd is a martingale difference array with variance array { CJ�;}, and Lt= l �i = 1 . If n (a) _L u�; � 1 , i= l (b) (c)
max I Und � 0, ]$;i$; n [nt] lim L CJ�; = t, for all t n�oa i= l
E
[0, 1 ],
then Yn � B. o Conditions (a) and (b) reproduce the corresponding conditions of 24.3, and their role is to establish the finite-dimensional distributions of the process, via the conventional CLT. Condition (c) is a global stationarity condition (see § 13.2) which has no counterpart in the CLT conditions of Chapter 24. Its effect is to rule out cases such as 24.10 and 24.11. By simple subtraction, the condition is sufficient for
Weak Convergence in a Function Space
·
,!
;
·-·
45 1
[ns] (27.59) lim L a;i = s - t, n �oo i=[nt]+l tor 0 � t < s � 1 . Clearly, without this restriction condition (27 .57) could not hold for any t1 and t2 . Proof of 27.14 Conditions 24.3(a) and 24.3(b) are satisfied, on writing Uni for Xnr · In view of the last remarks, the finite-dimensional distributions of Yn converge to those of W, and it remains to prove that { Yn} is uniformly tight (i.e., that the sequence of p.m.s of the Yn is uniformly tight). m Define, for positive integers k and m with k + m � n, s;km = "'C' "-k+i= k+ I<Jni2 = E(Sn,k+m - Snk) 2, where Sn,k+j - Snk = I7�{+1 Uni· The maximal inequality for martingales in 15.14 implies that, for A > 0, E I Sn,k+m - Snk l p p max I Sn,k+j - Snk I > Asnkm � (27.60) . (Asnkmf l s;j �m In particular, set k = [nt] and m = [n8] for fixed 8 E (0, 1) and t E [0 , 1 - 8] so that m increases with n, and then we may say that (Sn ,k+m - Snk)lsnkm � Z N(0, 1), by 24.3. For given positive numbers 11 and £, choose A satisfying A > max{E/8, 256E I ZI 3/T) £2 } , (27.61) and consider the case 8 = £2/64A2 < 1 . There must exist N0 ;::: 1 for which, with n ;::: N0, the Gaussian approximation is sufficiently close that El Sn,k+m - Snk 1 3 11 £2 � _' I _ = 1T 8 . (27.62) ) (Asnkm)3 256A2 4 Also observe from (27.59) that limn�oos�km = 8. For the choice of 8 indicated there exists N1 ;::: 1 such that As nkm < £E for n ;::: N1 , and hence, combining (27 .60) with p = 3 with (27.62), such that
)
(
-
------
_
)
(
P max I Sn ,k+j - Snk l ;::: £E � £11 8, l �j�m for n ;::: max{N0,N1 } . Now, Yn(s) - Yn(t) = Sn,[ns] - Sn, [nt] + Rn(s,t) for s > t, from (27.53) and (27.54), where Rn(s,t) = (ns - [ns]) Un,[ns]+ l - (nt - [nt])Un, [nt] +l · For t E [0, 1 8], there exists s' E [t, t + 8] such that
(27.63)
(27.64) (27.65)
-
I Yn(s') - Yn(t) I
=
sup I Yn(s) - Yn(t) 1 . t �s� t+o
(27.66)
The Functional Central Limit Theorem
452
[ns'] .:::; [nt] + [n8] and hence max I Sn,[nt] +j - Sn, [nt] l · � I J �[no] It follows (also invoking the triangle inequality) that for n
I Sn,[ns'] - Sn, [ nt] l
(27.67)
.:::;
:2:
N2 ,
=
I Sn,[ns'] - Sn,[nt] + Rn(s', t) I (27.68) :::; max I Sn ,[nt] +J - Sn,[nt] l + I Rn(s',t) I . I �J�[n o] By condition (b) of the theorem, P(sup i Rn(s, t) l :2: ac) ---7 0 as n ---7 oo, and hence there exists N3 :2: 1 such that, for n :2: N3, (27.69) Inequalities (27.69) and (27.63) jointly imply that, for all t E [0, 1 - 8] and n max{No,NbN2 ,N3 } , :2: N* I Yn(s') Yn(t) I -
s, r
=
P( I Yn(s') - Yn(t) I
( ({ (
:2:
!c)
.:::; P max I Sn, [nt] +J - Sn, [nt] I + I Rn(s' , t) I I �J �[no] :::; p
max I Sn,[nt] +j - Sn,[n t] l I�J[no]
:::; P max ! Sn,[ntl +J - Sn ,[ntJ I I �J[n o]
:::; 111 B . The conclusion may be written as
(
:2:
:2:
}
aE
)
aE
)
)
!c
})
{ I Rn(s',t) I
:2:
aE
P( I Rn(s',t) l
:2:
aE)
u
+
:2:
(27.70)
s p (27.7 1 ) 1 Yn(s) - Yn(t) I :2: !E .:::; !11 8, n :2: N*. � � :� 0� t f o � +8 Note that (27.51) i s identical with (27.71 ) for the case J.ln (A) P(Yn E A), and that 11 and c are arbitrary. Therefore, uniform tightness of the corresponding sequence of measures follows by 27.12 and 27.13. This completes the proof. • We conclude this section with the result promised in §27.4: 27.15 Corollary Wiener measure exists. o The existence is actually proved in 27.14, since we derived a unique limiting distribution which satisfied the specifications of 27.7. The points which are conveniently highlighted by a separate statement are that the tightness argument developed to prove 27.14 holds independently of the existence of W as such, and that the central limit theorem plays no role in the proof of existence. -
=
Weak Convergence in a Function Space
453
the process Yn of (27.41). We shall show that, on putting Uni -l 2 n l � i' conditions 27.14(a), (b), and (c) are satisfied. It will follow by the reasoning of 27.12 that the associated sequence of measures is uniformly tight, and possesses a limit. This limit has been shown above by direct calculation to have the finite-dimensional distributions specified by 27. 7, which will conclude the proof. Condition 27.14(c) holds by construction. Condition 27.12(a)) follows from an application of the weak law of large numbers (e.g. Khinchine's theorem), recalling that since �i is Gaussian { �1 } is an independent sequence possessing all its moments. Finally, condition 27.14(b) holds by 23.16 if the collection {�J, ... ,�n } satisfy the Lindeberg condition, which is obvious given their Gaussianity and 23.10 . • Proof Consider
=
27 .7 The Multivariate Case
We would like to extend these results to vector-valued processes, and there is no difficulty in extending the approach of §25.3. Define the space C[o, 1 1m, which we write as en for brevity, as the space of continuous vector functions m H IR m, ' X = (X J , . . . ,Xk) : [O, l ] where [0, l ] m and IR m are the Cartesian products of m copies of [0, 1] and IR respect ively. em is itself the product of m copies of C. It can be endowed with a metric such as (27.72) = max 15:j 5: m which induces the product topology, and coordinate projections remain continuous. Since C is separable, e m is also separable by 6.16, and 'Be = 'B e ® 'B e ® . .. ® 'B e (the a-field generated by the open rectangles of [0, l] m) is the Borel field of e m by m-fold iteration of 26.5. (C m, 'B(!) is therefore a measurable space. Let (27.73) denote the finite-dimensional sets of e m. Again thanks to the product topology, 1f(! is the field generated from the sets in the product of m copies of 1fe. 27.16 Theorem 1f(! is a determining class for (C m , 'B[;). Proof An open sphere in 'B(! is a set S(x,a) = E C m : < a} ,
d'U(x, y)
=
The set
{du(x1,y1) },
{y d'{j(x, y) yj(t) {y E
em : max sup I 15:j5: m
t
xj(t) I
}
< a .
(27.74)
The Functional Central Limit Theorem
454
(27.75) is an element 1-fe. It follows by the argument of 27.6 that
n=l ( , lin)
S(x,r) = U n Hk x r k= l
E
Te = cr(1fe),
(27.76)
and hence, that :Be c 'Pe. Since 1-fe is a field, the result follows by the extension theorem. • It is also straightforward to show that :Be = T(;, by a similar generalization from 27.6, but the above is all that is required for the present purpose. A leading example of a measure on (C m ,c.B(;) is wm, the p.m. of m-dimensional standard Brownian motion. A m-vector B distributed according to wm has as its elements m mutually independent Brownian motions, such that (27 .77) B(t) - N(O, m where Im is the m x m identity matrix, and the process has independent increments with E( (B(s) - B(t))(B(s) - B(t))') = (s - t)Im (27.78) for t<s The following general result can now be proved. 27. 17 Theorem Let be a m-vector martingale difference array with variance matrix array such that = Im . Then let = (27.79) for (j - 1)/n � < j/n, = 0 and for j = and = where
ti ),
0 S S 1.
{ Uni{l:niL>�nil Lt=ll:ni Yn(t) SnJ-t+(nt-j+1)Unj t l,. .,n, Yn(l) Snn'j Sno (27.80) Snj Li=l Uni' j = 1, . . ,n. If n (a) L UniU�i Im, i=l 0, (b) max U�iUni l:S::iSnn [ t ] (c) lim L l:ni tim , for all t [0 ,1 , n -'t oo then Yn � B. Consider for an m-vector A of unit length the scalar process A'Yn, having increments ;\'Uni· By definition, {;\'Uni, �ni} is a scalar martingale difference array with variance sequence ;\'l:niA . It is easily verified that all the conditions =
�
�
i=l
Proof
=
E
]
Weak Convergence in a Function Space
455
of 27.14 are satisfied, and so �'Yn � B . This holds for any choice of �- In particular, �'Yn(t) � N(O, t) , with similar conclusions regarding all the finite dimensional distributions of the process. It follows by the Cramer-Wold theorem that Yn(t) � N(O, tim), (27.81) with similar conclusions regarding all the finite-dimensional distributions of the process; these are identical to the finite-dimensional distributions of wm . Since 1-f'C is a determining class for (em, 73m), any weak limit of the p.m. s of { Yn} can only be wm . It remains to show that these p.m.s are uniformly tight. But this is true provided the marginal p.m.s of the process are uniformly tight, by 26.23. Picking � to be the jth column of Im for j = l , ... ,m and applying the argument of 27.14 shows that this condition holds, and completes the proof. • The arguments of §25.3 can be extended to convert 27.17 into provide an unusually powerful limit result. The conditions of the theorem are easily generalized, by replacing 27.17(c) by [nt] (c') lim L :Eni = t:E , n�oo i= l where :£ is an arbitrary variance matrix. Defining L- such that L -u = Im as in 25.6, 27.17 holds for the transformed vector process Zn = L - yn · The limit of the process Yn itself can then be determined by applying the continuous mapping theorem. This is a linear combination of independent Brownian motions, the finite dimensional distributions of which are jointly Gaussian by 11.13. We call it an m-dimensional correlated Brownian motion, having covariance matrix :r, and denoted B(:r). The result is written in the form Yn � B(:r). (27.82) An invariance principle can be used in this way to convert propositions about dependence between stochastic processes converging to Brownian motion into more tractable results about correlation in large samples. Given an arbitrarily related set of such processes, there always exist linear combinations of the set which are asymptotically independent of one another. 28 _,
28 Cadlag Functions
2 8 . 1 The Space
D
The proof of the FCLT in the last chapter was made more complicated by the presence of terms necessary to ensure that the random functions under considera tion lay in the space C. Since these terms were shown to be asymptotically negli gible, it might reasonably be asked whether they are needed. Why not, in other words, work directly with Xn of (27.55) and (27.56), instead of Yn of (27.53) and (27.54)? Fig. 28. 1 shows (apart from the omission of the point Xn(l), to be explained below) the graph of the process Xn corresponding to the Yn sketched in Fig. 27.2. Xn as shown does not lie in Cro, I J but it does lie in Dro, I J , the space of cadlag functions on the unit interval (see 5.27), of which Cro,I J is a subset. Henceforth, we will write D to mean D ro, I J when there is no risk of confusion with other usages.
Fig. 28. 1 As shown in 26.3, D is not a separable space under the uniform metric, which means that the convergence theory of Chapter 26 will not apply to (D,du). du is not the only metric that can be defined on D, and it is worth investigating alternatives because, once the theory can be shown to work on D in the same kind of way that it does on C, a great simplification is achieved. Abandoning du is not the only way of overcoming measurability problems. Another approach is simply to agree to exclude the pathological cases from the field of events under consideration. This can be achieved by working with the a-field <JJn , the restriction to D of the projection a-field (see §27. 1). In con trast with the case of C, <JJn c c.Bn (compare 27.6) and all the awkward cases such as uncountable discrete subsets are excluded from <JJn , while all the ones likely to arise in our theory (which exclusively concerns convergence to limit points lying
Cadlag
Functions
457
in C) are included. Studying measures on the space ((D,du),Tv) is an interesting line of attack, proposed originally by Dudley (1966, 1967) and described in detail in the book by Pollard (1984). While this approach represents a large potential simplification (much of the present chapter could be dispensed with), an early decision has to be made about which line to adopt; there is little overlap between this theory and the methods pioneered by Skorokhod (1956, 1957), Prokhorov (1956), and Billingsley (1968), which involves metrizing D as a separable complete space. Although the technical overheads of the latter approach are greater, it has the advantage that, once the investment is made, the probabilistic environment is familiar; at whatever remove, one is still working in an analogue of Euclidean space for which all sorts of useful topological and metric properties are known to hold. There is scope for debate on the relative merits of the two approaches, but we follow the majority of subsequent authors who take their cue from Billingsley's work. The possibility of metrizing D as a separable space depends crucially on the fact that in D the permitted departures from continuity are of a relatively limited kind. The only ones possible are jump discontinuities (also called 'discontinuities of the first kind'): points t at which l x(t) - x(t-) 1 > 0. There is no possibility of isolated discontinuity points t at which both lx(t) - x(t-) I and I x(t) - x( t+) I are positive, because that would contradict right-continuity. There is however the possibility that x(l) is isolated; it will be necessary to discard this point, and let x(l) = x(l-) . This is a little unfortunate, but since we shall be studying convergence to a limit lying in C[O I J (e.g., B), it will not change anything material. We adopt the following definition. 28.1 Definition D[o J ] is the space of functions satisfying the following condi tions: (a) x(t+) exists for t E [0, 1 ) ; (b) x(t-) exists for t E (0 , 1 ]; (c) x(t) = x(t+), t < 1, and x(1) = x(1-). o The first theorem shows how, under these conditions, the maximum number of jumps is limited. 28.2 Theorem There exists, for all x E D and every £ > 0, a finite partition { tJ, ... ,tr } of [0, 1] with the property sup lx(t) - x(s) I < £ (28 . 1 ) ,
,
s,te [t;-[,t;)
for each i = 1 , . . . ,r. Proof This is by showing that tr = 1 for a collection { t1 , ... ,tr } satisfying (28. 1), with to = 0. For given x and £ let t = sup{ tr } , the supremum being taken over all these collections. Since x(t-) exists for all t > 0, t belongs to the set; that is, there exists r such that t = tr. Suppose tr < 1, and consider the point tr + 8 s 1 , for some 8 > 0. By definition of tn l x(tr + 8) - xUr- d l � £. Hence consider the interval [tntr + 8). By
458
The Functional Central Limit Theorem
choice of () we can ensure by right continuity that l x(t, + D) -x(t,) I < E. Hence there exists an (r + 1 )-fold collection satisfying the conditions of the theorem. We must have t :?: tr+ l = t, + D, and the assertion that t, = t is contradicted. It follows that t, = 1. • This elementary but slightly startling result shows that the number of jump points at which I x(t) -x(t-) I exceeds any given positive number are at most finite. The number of jumps such that I x(t) - x(t-) I > 1/n is finite for every n, and the entire set of discontinuities is a countable union of finite sets, hence count able. Further, we see that (28.2) sup l x(t) l < =, t
since for any t E [0,1], x(t) is expressible according to (28.1) as a finite sum of finite increments. The modulus of continuity wx(D) in (27 . 14) provides a means of discriminating between functions in Cro,I J and functions outside the space. For just the same reasons, it is helpful to have a means of discriminating between cadlag functions and those with arbitrary discontinuities. For () E (0 , 1 ), let Il0 denote a partition { t 1 , ... ,t,} with r s [1/D] and mini { ti - ti- d > (), and then define
{ {
w;(<>) = inf max 110
l :S t :S r
sup I x(t) - x(s) I
s,tE [t;-J,t;)
}}
·
(28.3)
Let' s attempt to say this in English! w;(<>) is the smallest value, over all partitions of [0, 1 ] coarser than (), of the largest change in x within an interval of the partition. This notion differs from, and weakens, that of wx(D) , in that w;(<>) can be small even if the points ti are jump points such that wx(D) would be large. For () < � there is always a partition I10 in which ti - ti- l < 2() for some i, so that for any x E D, (28.4) for () < �· So obviously, lim0 0w;(<>) = 0 for any x E C. On the other hand, �
lim w�(D) = 0
3�0
(28.5)
is a property which holds for elements of D, but not for more general functions. 28.3 Theorem If and only if x E D, 3 () such that w;(<>) < E, for any E > 0. Proof Sufficiency is immediate from 28.2. Necessity follows from the fact that if x i: D there is a point other than 1 at which x is not right-continuous; in other words, a point t at which I x(t) - x(t+) I :?: E for some E > 0. Choose arbitrary () and consider (28.3). If t i:. ti for any i, then w�(D) :?: E by definition. But even if t = ti for some i, ti E [ti,ti+ I) and l x(ti) - x(t,+) l :?: E, and again w;(<>) :?: E.
•
Cadlag Functions 28.2 Metrizing
459
D
Recall the difficulty presented by the existence of uncountable discrete sets in (D,du), such as the sets of functions o, o � t < e xe(t) = (28. 6) 1, e � t � 1, the case of (5.43) with a = 0 and b = 1 . We need a topology in which xe and Xe' are regarded as close when I 9 - 9' 1 is small. Skorokhod (1956) devised a metric with this property. Let A denote the collection of all homeomorphisms A: [0, 1] f--7 [0, 1] with A(O) = 0 and A( 1 ) = 1 ; think of these as the set of increasing graphs connecting the opposite corners of the unit square (see Fig. 28.2). The Skorokhod J1 metric is defined as
{
ds(x,y) = inf {E > 0: sup I A(t) - tl � E, sup l x(t) -y(A(t)) I � E} . t
AEA
t
(28.7)
In his 1956 paper Skorokhod proposes four metrics, denoted J l , J2, M l, and M2. We shall not be concerned with the others, and will refer to ds as is customary, as 'the' Skorokhod metric.
0
t
1
Fig. 28.2 It is easy to verify that ds is a metric, if you note that sup 1 I A(t) - t I = suprl t - A- \ t) l and suprl x(t) -y(A(t)) l = sup1 l x(A- 1(t)) - y(t) l , where A- 1 E A if A E A. While in the uniform metric two functions are close only if their vertical separation is uniformly small, the Skorokhod metric also takes into account the possibility that the horizontal separation is small. If x is uniformly close to y except that it jumps slightly before or slightly after y, the functions would be considered close as measured by ds, if not by du. Consider xe in (28.6), and another element Xe+o· The uniform distance between these elements is 1 , as noted above. To calculate the Skorokhod distance, note that the quantity in braces in (28.7) will be 1 for any A for which A(9) * 9 + 8. Confining consideration to the subclass of A with A(9 ) = 9 + 8, choose a case
460
The Functional Central Limit Theorem
where I A(t) - t l ::s; 8 (for example, the graph { t, A(t) } , obtained by joining the three points (0,0), (9,9 + 8), and (1, 1) with straight lines, will fulfil the definition) and hence, (28.8) This distance approaches zero smoothly as 8 J- 0, whi�h might conform better to our intuitive idea of 'proximity' than the uniform metric in these circumstances. 28.4 Theorem On C, ds, and du are equivalent metrics. Proof Obviously ds(x,y) � du(x,y), since the latter corresponds to the case where A is the identity function in (28. 7). On the other hand, for any A,
du(x,y) � sup lx(t) -y(A(t)) ! + sup l y(A(t)) - y(t) l . t
t
(28.9)
Suppose y is uniformly continuous. For every E > 0 there must exist 8 > 0 such that, if ds(x,y) < 8 (and hence sup1 I A(t) - t I < 8), then sup, I y(A(t)) -y(t) I < E. In other words, (28. 10) ds(x,y) < 8 => du(x,y) < 8 + E. The criteria of (5.5) and (5.6) are therefore satisfied. Uniform continuity is equivalent to continuity on [0, 1], and so the stated inequalities hold for all y E C. . The following result explains our interest in the Skorokhod metric. 28.5 Theorem (D,ds) is separable. Proof As usual, this is shown by exhibiting a countable dense subset. The counter part in D of the piecewise linear function defined for C is the piecewise constant function (as in Fig. 28. 1) defined as y(t) = y(t;), t E [ti, t;+ I ) i = O, ... ,m - 1 , (28. 1 1 ) where the y(ti) are specified real numbers. For some n E IN , define the set A n as the countable collection of the piecewise constant functions of form (28 . 1 1), with ti = i/2n for i = 0, ... ,2n - 1 , and y(t;) assuming rational values for each i. Letting A denote the limit of the sequence {An} , A is a set of functions taking rational values at a set of points indexed on the dyadic rationals [), and hence is countable by 1.5. According to 28.2, there exists for x E D a finite partition (t1 , ... ,tm } of [0, 1], such that, for each i, sup lx(s) - x(t) l < £. s,t E [t;-],t;)
Let y be a piecewise constant function constructed on the same intervals, assuming rational values Y I · · · · ·Ym where y; differs by no more than E from a value assumed by x on [t;,t;+1 ). Then, ds(x,y) < 2E. Now, given n � 1, choose z E An such that n Zj = Yi when j/2 E [t;,t;+ 1 ). Since [) is dense in [0, 1], ds(y,z) � 0 as n � oo .
461
Cadlag Functions
Hence, ds(x,z) � ds(x,y) + ds(y,z) is tending to a value not exceeding 2£. Since by taking m large enough E can be made as small as desired, x is a closure point of A. And since x was arbitrary, we have shown that A is dense in D. • Notice how this argument would fail under the uniform metric in the cases where x has discontinuities at one or more of the points ti . Then, du(y,z) will be small only if the two sets of intervals overlap precisely, such that ti = j/2n for some j. If ti were irrational, this would not occur for any finite m, since j/2n is rational. Under these circumstances x would fail to be a closure point of A. This shows why we need the Skorokhod topology (that is, the topology induced by the Skorokhod metric) to ensure separability. Working with ds will none the less complicate matters somewhat. For one thing, ds does not generate the Tychonoff topology, and the coordinate projections are not in general continuous mappings. The fact that x and y are close in the Skorokhod metric does not imply that x(t) is close to y(t) for every t, the examples of xe and xe+o cited above being a case in point. We must therefore find alternative ways of showing that the projections are measurable . •
i
1
.
: � - �;
I n
0
I 2
�
i:
t
Fig. 28.3 And there is another serious problem: (D,ds) is not complete. This is easily seen by considering the sequence of elements { Xn} where 1, t E [�, 1 + �) (28. 12) xn(t) = 0, otherwise (see Fig. 28.3). The limit of this sequence is a function having an isolated point of discontinuity at �' and hence is not in D. However, to calculate ds(xn,Xm) A. must be chosen so that A.(�) = �' and A.(� + n = � + k; the distance is 1 for any other choice. The piecewise-linear graph with vertices at (0,0), G,�), (� + k � + k), and (1, 1 ) fulfils the definition, and satisfies (28. 7). It appears that ds(Xn,Xm) = I k - k I , and so { Xn } is a Cauchy sequence.
{
,
28.3 Billingsley' s Metric
The solution to this problem is to devise a metric that is equivalent to ds (in the sense of generating the same topology, and hence a separable space) but in
The Functional Central Limit Theorem
462
which sequences such as the one in (28. 12) are not Cauchy sequences. Ingenious alternatives have been suggested by different authors. The following is due to Billingsley (1968), from which source the results of this section and the next are adapted. Let A be the collection of homeomorphisms /.. from [0, 1 ] to [0, 1 ] with /..(0) = 0 and /..( 1 ) = 1 , and satisfying lit.. I I
=
l
sup log/..(t)t -- s/..(s) t#s
I
< =.
(28. 1 3)
Here, I I/.. II : A � [R + is a functional measuring the maximum deviation of the gradient of /.. from 1 , so that in particular ll t.. l l = 0 for the case /..(t) = t. The set A is like the one defined for the Skorokhod metric with the added proviso that l it.. I I be finite; both /.. and /..- I must be strictly increasing functions. Then define
ds(x,y)
=
inf {c: > 0: llt.. l l ::; E, sup lx(t) - y(/..(t)) l t
AEA
::;
}
c: .
(28. 14)
We review the essential properties of d8. 28.6 Theorem d8 is a metric. Proof d8(x,y) = 0 iff x = y is immediate. d8(x,y) = d8(y,x) is also easy once it is noted that llt..- 1 1 1 = llt.. l l . To show the triangle inequality, note that
{ {
� sup t# s
� sup log f *- S
(f...t (t) - At (s))(/..2(t') - /..2(s')) -....,. . ' -. s ') (t - s)-..,.( t..,...
-----.,.---=-
for arbitrary t' and s'. On setting t'
=
_ _ _
/..1 (t) and s'
=
}
(28. 15)
/..2(s), we obtain
(28. 16) where /..1 o/..2(t) = /.. 1 (�(t)), and f... 1 of...2 is clearly an element of A Since (28. 1 7) sup l x(t) - z(/..2(/.. t (t))) l ::; sup l x(t) -y(f...t (t)) l + sup l y(t) - z(/..2(t)) l t
t
t
by the ordinary triangle inequality for points of IR , the condition ds(x,z) :::;; ds(x,y) + ds(y,z) follows from the definition. • Next we explore the relationship between ds and d8, and verify that they are equivalent metrics. Inequalities going in both directions can be derived provided the distances are sufficientlv small. Given functions x and v for which do(x_v) =
Cadlag Functions
463
£ s �' consider A E A satisfying the definition of d8 for this pair, such that, in particular, II A II s £. Since A(O) = 0, there evidently must exist t E (0, 1] such that l log(A(t)/t) I s II All , or e
-E
< _
A(t) < e E. (28 . 1 8) t we find eE - 1 s 2£ for £ s �' and e -£ - 1 � - 2£ -
-
Using the series expansion of eE, similarly, which implies that (28. 1 9) -2£ s t( e -£ - 1) s A(t) - t s t(e£ - 1) s 2£, or, I A(t) - tl s 2£. And in view of our assumption about A, suptl x(t) - y(A(t)) I s £ and hence ds(x,y) cannot exceed 2£. In other words, ds(x,y) s 2d8(x,y) (28.20) whenever d8(x,y) s �Now consider a function Jl E A which is piecewise-linear with vertices at the points of a partition III), as defined above (28.3) for a suitable choice of 3 to be specified. The slope of Jl is equal to (J.L(ti} - J.LUi- 1 ))/(ti - ti l ) on the inter vals [tH,ti), where ti - ti- 1 > 3. Notice that, if supt i J.L(t) - tl s- 32 ,
I
J.LUa - J.LCti-1 ) ti - ti - l
1
d (28.21) 1 s I J.LCti) - td + I J.L(ti- 1 ) - ti - s 23. 3 For I x I s �' the series expansion log { 1 + x} = x - ¥2 + !x3 - . .. implies (28.22) l log{ 1 + x} l s max { lx l , l x - x2 1 } s 2 l x l . Substituting for x in (28.22) the quantity whose absolute value is the minorant side of (28.21), we must conclude that, if supt l J.L(t) - tl s 32 for 0 < 3 s !, then (28.23) 111111 s 43 . Now, suppose ds(x,y) = 32, which means there exists A E A satisfying sup i A(t) - t l s 32, and supt ly(t) - x(A(t)) l s 32. Choose Jl as the piecewise linear function with J.L(tD = A(tD for i = O, . . . , r. The function A- 1 Jl is 'tied down ' to the diagonal at the points of the partition; that is, it is increasing on the intervals [ti-l ,tD with A-1 J.L(t) E [tH,ti) if and only if t E [ti- I , ti). Therefore, choosing II8 to correspond to the definition of w�(3), we can say I x(t) -x(J.L(t)) I s I x(t) - x(A- 1 J.L(t)) I + I x(A - 1 J.L(t)) - x(J.L(t)) I 2 s w�(3) + 3 . (28.24) -
t
Putting this together with (28.23) gives for 0 < 3 s ! the inequality d8(x,y) s max { 43, w�(3) + 32 } s w�(3) + 43. (28.25) Since for x E D we may make w�(3) arbitrarily small by choice of 3, we have
The Functional Central Limit Theorem
464
ds(x,y) S 4ds(x,y) 1 12 (28.26) whenever ds(x,y) is sufficiently small. We may conclude as follows. 28.7 Theorem In D, metrics ds and ds are equivalent. Proof Given £ > 0, choose 8 s a' and also sma)l enough that w;(8) + 48 s £. Then, for 11 s min { 82, 1£} , ds(x,y) < 11 � ds(x,y) < £, (28.27) ds(x,y) < 11 � ds(x,y) < £, (28.28) by (28.20) and (28.25) respectively, The criteria of (5.5) and (5.6) are therefore satisfied. • Equivalence means that the two metrics induce the same topology on D (the Skorokhod topology). Given a sequence of elements {xn}, ds(Xn,x) � 0 if and only if ds(xmx) � 0, whenever x E D. But it does not imply that {xn } is a Cauchy sequence in (D,ds) whenever it is a Cauchy sequence in (D,ds), because the latter space is incomplete and a sequence may have its limit outside the space. It is clear in particular that ds(xn,x) ----7 0 only if ds(xn,X) � 0 and lim.s�ow�(8) = 0. For example, the sequence of functions in (28. 12) is not a Cauchy sequence in (D,dB). To define dB(Xn,Xm) (for n � 3, m � 4) it is necessary to find the element of A for which "-G) 1 and A.(1 + k) 1 + A, and whose gradient deviates as little as possible from 1 . This is obviously the same piecewise-linear function, with vertices at the points (0,0), (1.-D, G + k, 1 + k) and ( 1 , 1 ), as defined for ds. But the maximum gradient is n/m, corresponding to the segment connecting the second and third vertices. dB(Xn,Xm) = min { 1 , I log(nlm) I } , which does not approach zero for large n and m (set m = 2n for example). 28.8 Theorem The space (D,ds) is complete. Proof Let {yk , k E IN } be a Cauchy sequence in (D,ds) satisfying dsCYbYk+ I ) < 1 12k, implying the existence of a sequence of functions { Ilk E A} with (28.29) sup I Yk(t) -Yk+l (llk(t)) I < 1 12k, =
=
,
t
(28.30) It follows from (28.20) that SUPrl llk+m(t) - t l S 212k+m for m > 0. Define llk,m f.Lk+m 0 f.Lk+m - I 0 0f.Lb also an element of A for each finite m; the sequence {llk,m , m = 1 ,2, ... } is a Cauchy sequence in (C,du) because sup l llk,m+ l (t) - llk,m(t) I = sup l l!k+m+ l (s) - s I S 112k+m. (28.3 1) t s Since (C,du) is complete there exists a limit function Ak = Iimk�oo llk,m· To show that Ak E A, it is sufficient to show that II A.k il < oo. But by (28. 16),
=
•••
Cadlag Functions
465
m m 1 1 (28.32) m k L k < k k+ 1 1 !-! L j ll ll :=:; + �;+- :=:; N' :=:; ll !-! 0!-l l 0 ••• 01-l + j=O j=O 2 J 2 for any m, so 111�- k I I :::; 112k- I . Note that Ak = 'Ak+1 °1-lh so that 'Akll = !-!k 0 Ak 1 and hence, by (28.29), sup 1 Yk(Ak 1 (t)) - Yk+l('AklJ(t)) = sup I Yk(s) - Yk+I(!-!k(s)) < 112k. (28.33) 11 !-!k,m ll
I
t
s
I
So consider the sequence {yk o')...k 1 E D, k E [N } . According to (28.33) this is actu ally a Cauchy sequence in (D,du). But the latter space is complete; this is easily shown as a corollary of 5.24, whose proof shows completeness of (C,du) without using any of the properties of C, so that it applies without modification to the case of D. Hence Yk 0 A"k 1 has a limit y E D. Since this means both that supr l yk(t) - y(Ak(t)) I = supr iYk(Ak 1 (t)) - y(t) I � 0 and that II'Adl = II'Ak1 11 � 0, d8(yk,y) � 0 and so {yd has a limit y in (D,dn). We began by assuming that {yd was a Cauchy sequence with dn(yk >Yk+I ) < 1 12 k. But this involves no loss of generality because it suffices to show that any Cauchy sequence {xm n E [N } contains a convergent subsequence {Yk = Xnk' k E [N } . Clearly, a Cauchy sequence cannot have a cluster point which is not a limit point. Every Cauchy sequence contains a subsequence with the required property; if dn(XmXn+I ) < 1/g (n) � 0 (say), choosing nk ? g - 1 (2 k) is appropriate. This completes the proof. • 28.4 Measures
on
D
We write 'Bv for the Borel field on (D,dn). Henceforth, we will also write just D to denote (D,dn), and will indicate the metric specifically only if one different from dn is intended. The basic property we need the measurable space (D,'Bv) to possess is that measures can be fully specified by the finite-dimensional sets. An argument analogous to 27.6 is called for, although some elaboration will be needed. In particular, we have to show, without appealing to continuity of the projections, that the finite-dimensional distributions are well-defined and that there are finite-dimensional sets which constitute a determining class for (D,'Bv). We start with a lemma. Define the field of finite-dimensional sets of D as 1fv = { H n D: H E 1f } , where 1f was defined in § 27 . I . 28.9 Lemma Given x E D, a > 0, and any t1 , ... ,tm E [0, 1], let Hm(x,a)
=
{
y E D: 3 '!. E A s.t. 11'1.11 < a,
Then Hm(x,a) E 1-fv. Proof Since Hm (x,a)
}
���:: I y(t;) - x(A.(t;)) I < a
(28.34)
D, all we have to show according to (27.2) is that 1t1 1 , ... ,rm(Hm(x,a)) E 'B . This is the set whose elements are (y(tt ), ... ,y(tm )) for each y E Hm(x,a). To identify these, first define the set s
m
The Functional Central Limit Theorem
466
(28.35) Then it is apparent that 1t11, ... ,tm (Hm(x,a)) =
{
bJ, . . . ,bm: max l ai - b d
< a,
(a1, ... ,am)
lsism
E Am(x,a)
}
m
� IR .
(28.36)
In words, this is the set Am(x,a) with an open a-halo, and it is an open set. It therefore belongs to :Em . • To compare the present situation with that for C, it may be helpful to look at the case k = 1 . The one-dimensional projection nr(Hr(x,a)), where (28.37) Hr(x,a) = {y E D: 3 'A E A s.t. IIJ... I I < a, l y(t) -x(J...(t)) l < a } , is in general different from S(x(t),a), that is, the interval of width 2a centred on x(t). If x is continuous at t the difference between these two sets can be made arbitrarily small by taking a small enough, and at these points the projections are in fact continuous. Since the discontinuity points are at most countable, they can be ignored in specifying finite-dimensional distributions for x, as will be apparent in the next theorem. However, the point that matters here is that we have the material for the extension of a measure to (D,:BD) from the finite-dimensional distributions. It is easily verified that JfD, like Jf, is a field. The final link in the chain is to show that JfD is a determining class for (D,BD). 28. 10 Theorem (cf. Billingsley 1968: th. 14.5) :ED = cr(JfD). Proof An open sphere in (D,d8) is a set of the form S(x,a)
=
{ y E D : d8(y,x) < a}
=
{y E D: 3 'A E A s.t.
I I J... I I
< a,
s�p l y(t) - x(J...(t)) l
}
< a
(28.38)
for x E D, a > 0. Since these sets generate :ED, it will suffice to show they can be constructed by countable unions and complements (and hence also countable intersections) of sets in JfD. Let H(x,a) = n'k'=rHk(x,a), where Hk(x,a) is a set with the form of Hm defined in (28.34), but with m = 2k - 1 and ti = i/2k, so that the set { t1 , ... hk-d converges on [) (the dyadic rationals) as k � oo Consider y E H(x,a). Since y E Hk(x,a) for every k, we may choose a sequence { 'Ad such that, for each k � 1 , (28.39) I I A.k l l < a, k (28.40) max I y(2 - i) - x(J...k(Tki)) I < a. .
l s i s 2L 1
Making use of the fortuitous fact that J...k has the properties of a c.d.f. on [0, 1 ] , Helly' s theorem (22.21) may be applied to show that there is a subsequence f 'Ar--_,
Cadlag Functions n
467
e
IJ'.J } converging to a limit function /.., which is non-decreasing on [0, 1 ] with /..(0) = 0 and /..( 1) = 1 . /.., is necessarily in A, satisfying l l t.. l l � a (28.41) according to (28.39). And in view of (28.40), and the facts that /..k(t) � /..(t) and x is right-continuous on [0, 1), it must also satisfy either l y(t) -x(/..(t)) l � a or I y(t) - x(/..(t)-) I � a for every t e []) . Since []) is dense in [0, 1], this is equivalent to sup I y(t) - x(/..(t)) I � a. (28.42) t
The limiting inequalities (28.41) or (28.42) cannot be relied on to be strict, but comparing with (28.38) we can conclude that y e S(x,a). This holds for all such y, so that H(x,a) � S(x,a). Put a = r- lin, and take the countable union to give 00
U H(x, r - lln)
n= l
�
U S(x, r- lln)
n=l
=
S(x,r).
(28.43)
It is also evident on comparing (28.34) with (28.38) that S(x,a) � Hk(x,a) for a > 0. Again, put a = r - 1/n, and
S(x,r)
=
00
U S(x, r - 1/n)
n=l
�
U H(x, r- 1/n).
n=l
(28.44)
It follows that, for any X E D and r > 0, S(x,r) u;= l nk=!Hk(x, r - lin) where Hk(x, r - 1/n) e JfD. This completes the proof. • The defining of measures on (D,'BD) is now possible by arguments that broadly parallel those for C. The one pitfall we may encounter when assigning measures to finite-dimensional sets is that the coordinate projections of 'BD sets may have no 'natural' interpretation in terms of observed increments of the random process. For example, suppose Xn e D is the process defined in (27.55) and (27.56), with respect to the underlying space (Q.,Wf,P). It is not necessarily the case that ntCXn (W)) is measurable with respect to W'n,[nt] = a(Uni, i � [nt]), as casual intuition might suggest. A 'BD-set like HtCx,a) in (28.37) is the image under the mapping Xn : Q H D of a set E e W'; in fact, we could write E = K,; 1 (n�1 (B)), where B e 'B. But E depends on the value that x assumes at /..(t), and if /..(t) > t then E cannot be in W'n, [nt] · However, this difficulty goes away for processes lying in C almost surely. In view of28.4, we may 'embed ' ((C,du),'Bc) in ((D,d8),'BD) and a p.m. defined on the former space can be extended to the latter, with support in C. In particular, Wiener measure is defined on (D,'BD) by simply augmenting the conditions in 27.7 with the stipulation that W(x e C) = 1 . =
2 8 . 5 Prokhorov ' s Metric
The material of this section is not essential to the development since Billings ley ' s metric is all that we need to work successfully in D. But it is interesting
The Functional Central Limit Theorem
468
to compare it with the alternative approach due to Prokhorov (1956). We begin with an alternative approach to defining a continuity modulus for cadlag functions. Let
Wx(O)
�
max
{_
'<,
��� '
'
(min { I x(t') - x(t) I . I x(f') -x(t) I ) ),
}
sup l x(O) - x(o) l , sup l x(o) - x( 1 ) 1 . 1 -o< t s l
O�t
(28.45)
Again, it may be helpful to restate this definition in English. The idea is that, for every t E [8, 1 - 8], a pair of adjacent intervals of width 8 are constructed around the point, and we determine the maximum change over each of these inter vals; the smaller of these two values measures the 8-continuity at point t, and this quantity is supped over the interval. This means that the function can jump discontinuously without affecting wx(o), so long as no two jumps are too close together. The exceptions are the two points 0 and 1 , which for wx(8) ----:f 0 must be true continuity points from the right and left respectively. The following theorem parallels 28.3. 28.11 Theorem If and only if x E D, lim wx(o) Proof Suppose x E
=
0.
(28.46)
D. By 28.1(c), the second and third terms under the 'max' in (28.45) definitely go to zero with o. Hence consider the first term. Let { tk>tk,t'k } denote the sequence of points at which the supremum is attained on setting o = llk for k = 1 ,2, . .. Assume tk --7 t. (If need be consider a convergent subsequence.) Then tk --7 t and t'[; --7 t. Since x(t) = x(t+) , this implies l x(tk) - x(t'k) l --7 0, which proves sufficiency. Now suppose w (8) --7 wx(O) > 0. Since wx(O) = max { l x(O) - x(O+) l . l x(1) - x(1 -) I . min { l x(t) - x(t-) I , l x(t) - x(t+) I } } , it follows that x 12: D, proving necessity. • Now define the function wx(e +), z < 0, (28.47) wx (z ) = wxC 1), z 2 o. This is non-decreasing, right-continuous, bounded below by 0 and above by wx(1). It therefore defines a finite measure on IR , just as a c.d.f. defines a p.m. on IR . B y defining a family of measures i n this way (indexed on x) on a separable space, we can exploit the fact that a space of measures is metrizable. In fact, we can use Levy' s metric L* defined in (26.33). The Prokhorov metric for D is (28.48) x
_
{�
z
Cadlag Functions
469
where rx and ry are the graphs of x and y and dH is the Hausdorff metric. The idea here should be clear. With the first term alone, we should obtain a property similar to that of the Skorokhod metric; if we write d(x(t),ry) = inf1'dE(x(t),y(t')), then dH(rx,ry)
=
{
}
max sup d(x(t),ry), sup d(rx,y(t')) . t
t'
(28.49)
In words, the smallest Euclidean distances between x(t) and a point of y, and y(t) and a point of x, are supped over t. For comparison, the Skorokhod metric minimizes the greater of the horizontal and vertical distances separating points on rx and ry in the plane, subject to the constraints imposed on the choice of 'A such as continuity. In cases such as the functions xe of (28.6), xe and xe+o are close in (D,dH) when 8 is small. (Think in terms of the distances the graphs would have to be moved to fit over one another.) The purpose of the second term is to ensure completeness. By 28.11, limz--7 -oowx(z) = 0 if and only if x E D; otherwise this limit will be strictly positive. Unlike the case of (D,dH), it is not possible to have a Cauchy sequence in (D,dp) aproaching a point outside the space. It can be shown that dp is equivalent to ds, and hence of course also to dB, and that the space (D,dp) is complete. The proofs of these propositions can be found in Parthasarathy ( 1967). For practical purposes, therefore, there is nothing to choose between dp and dB. 2 8 . 6 Compactness and Tightness in
D
The remaining task is to characterize the compact sets of D, in parallel with the earlier application of the ArzeUt-Ascoli theorem for C. 28. 12 Theorem (Billingsley 1968: th. 14.3) A set A c D is relatively compact in (D,dB) if and only if sup sup l x(t) I < =, (28.50) XEA
t
lim sup w�(8)
8--70 X E A
=
0.
o
(28.51 )
This theorem obviously parallels 27.5 but there are significant differences in the conditions. The modulus of continuity w� appears in place of Wx which is a weaken ing of the previous conditions, but, on the other hand, (28.50) replaces (27. 16). Instead of sup1 I x(t) I we could write dB( I xI ,0), where 0 denotes the element of D which is identically zero everywhere on [0,1]. It is no longer sufficient to bound the elements at one point of the interval to ensure that they are bounded every where: the whole element must be bounded. A feature of the proof that follows, which is basically similar to that of 5.28, is that we can avoid invoking completeness of the space until, so to speak, the last moment. The sufficiency argument establishing total boundness ofA is couched in terms of the more tractable Skorokhod metric, and then we can exploit the Pnni v�lPnf'P of rln with � t'omnlPtP mPtri f' <;:Jwh �" rln to
o-Pt thP: romn::�rtnP:"" of A
The Functional Central Limit Theorem
470
The argument for necessity also uses ds to prove upper semicontinuity of w�(8), a property that, as we show, implies (28.5 1 ) when the space is compact. Proof of 28.12 Let sup E A sup l x(t) l = M. To show sufficiency, fix E > 0 and choose m as the smallest integer such that both l im < !E and sup E A w�( l lm) < !E. Such an m exists by (28.51). Construct the finite collection Em of piecewise constant functions, whose values at the discontinuity points t = jim for j = O, . . . ,m - 1, are drawn from the set {M(2u/v - 1), u = 0, 1 , ... ,v } where v is an integer exceeding 2M/E; hence, Em has (v + 1 )m different elements. This set is shown to be an E-net for A. Given the definition of m, one can choose for x E A a partition il1 1m { t , ,tr } , defined as above (28.3), to satisfy max sup l x(t) - x(s ) l < �E. (28.52) x
1
1
x
}
{
• . •
=
l s i s r s,t E (t;_1,t;)
For i = O, . .. ,r - 1 let ji be the integer such that j/m � ti < (ji + 1)/m, noting that, since the ti are at a distance more than 1/m, there is at most one of them in any one of the intervals Ulm, (j + 1)/m), j = O, ... ,m - 1 . Choose a piecewise linear function A, E A with vertices /..(j/m) ti , i = O, ... ,r. Since I ti - Nm l � lim, maxos i s r l A(j/m) -Nm l � 1E, and the linearity of A between these points means that sup I A(t) - t l � �E. (28.53) =
t
By construction, A maps points in Ulm, (j + 1)/m) into (ti, ti+ l ) whenever ji � j $ ii+l • and since x varies by at most !E over intervals [ti,ti+J), the composite function xoA, can vary by at most 1E over intervals rJim,(j + 1 )/m). An example with m = 10 and r 4 is sketched in Fig. 28.4; here, }1 2, h = 4 and h = 6. The points to, . .. ,t4 must be more than a distance 1/10 apart in this instance. One can therefore choose y E Em such that (28.54) j y(jlm) - x(A(jlm)) l < !t:, j O, .. . ,m - 1. =
=
=
1
to -J<--.---f--L-r--+'--r--+--'-r-.--l 0 1 2 Fig. 28.4
Functions 471 Since y(t) y(j/m) for t E Ulm, (j + 1)/m), we have by (28. 52) and (28. 54), s�p I y(t) -x(A(t)) I ,:�J I y(jlm) -x(A(j/m)) I 0 sup l x(A(j/m))-x(A(t)) l } + < E. (28.55) Together, (28. 5 5) and (28. 5 3) imply ds(x,y) :::; £, showing that is an £-net for A Cadlag
=
,;
te Ulm,(j+ l )lm)
Em
as required. This proves that A is totally bounded in (D,ds). But since ds and d8 are equivalent (28.7), A is also totally bounded in (D,d8); in particular, if Em is an £-net for A in (D,ds), then we can find 11 such that it is also an 11-net for and where 11 can be set arbitrarily A in (D,d8) according to small. Since (D,d8) is complete, A is therefore compact, proving sufficiency. When A is totally bounded it is bounded, proving the necessity of To show the necessity of we show that the functions = are upper semicontinuous on (D,ds) for each This means that the sets Bm = < E} are open in (D,ds) for each E > 0. By equivalence of the metrics, they are also open in (D,d8). In this case, for any such £, the sets Bm, IN } are an open covering for D by 28.3. Any compact subset of D then has a finite subcovering, or in other words, if A is compact there is an such that A � Bm. By definition of Bm, this implies that holds. To show upper semicontinuity, fix E > 0, () > 0, and D, and choose a parti tion I1 0 satisfying
(28.27)
(28.28),
(28.51),
(28.50). w'(x,llm) w�(llm)
m.
w�(llm)
{x:
{ mE
(28.51)
max
{
sup
l :::; i :::; r s,t e [ti-J.fi)
m
xE
l x(t)-x(s) l } < w�(D) + !£.
Also choose 11 < !£, and small enough that > D + 211. max
(28. 56)
(28.57) {ti -ti-d Our object is to show, after (5. 3 2), that if y E D and ds(x,y) < 11 then w;(8) < w�(D) + E. (28.58) If ds(x,y) < 11 there is A. E A such that sup jy("A.(t))-x(t)l < 11 (28.59) and sup I A.(t)-t l < (28. 60) Letting A.(tD, (28.57) and (28.60) and the triangle inequality imply that (28.6 1) 1 :::;: i :::;; r
t
fl .
t
si =
l :::; i :::; r
l :::; i:::; r
The Functional Central Limit Theorem
472
If both s and t lie in [t;- J,t;) A(s) and A(t) must both lie in [s;-1 ,s;). It follows by (28.56), (28.59), and the choice of 11 that
�t:�
�
p J y(s) - y(t) l , '"l
}
<
wili) + £.
(28.62)
In view of (28.61), this shows that (28.58) holds, and since E and x are arbitrary the proof is complete. • This result is used to characterize uniform tightness of a sequence in D. The next theorem directly parallels 27.12. We need completeness for this argument to avoid having to prove tightness of every �n' so it is necessary to specify an appropri ate metric. Without loss of generality, we can cite d8 where required. 28.13 Theorem (Billingsley 1968: th. 15.2) A sequence { �n } of p.m.s on ((D,d8), <.En) is uniformly tight iff there exists N E IN such that, for all n � N, (a) For each 11 > 0 there exists M such that �n ( {x: sup l x(t) l > M}) t
(28.63)
:<::; TJ ;
(b) for each E > 0, 11 > 0 there exists 8 E (0, 1) such that �n( {x: w�8) � E}) :<::; TJ .
(28.64)
be uniformly tight, and for 11 > 0 choose a compact set K with �n(K) By 28.12 there exist M < oo and 8 E (0, 1 ) such that
Proof Let { �n } >
1
- TJ .
K
c
{x: sup l x(t) l t
:<::;
M} n { x: w;(8)
<
E}
(28.65)
for any E > 0. Inequalities (28.63) and (28.64) follow for n E IN , proving neces sity. The object is now to find a set satisfying the conditions of 28.12, whose closure K satisfies SUPn ::>:NJ..ln (K) > 1 - e for some N E IN and all e > 0. Because (D,d8) is a complete separable space, each �n is tight (26.19) and the above is sufficient for uniform tightness. As in 27.12, let �* stand for SUPn::>:NJ..ln · For e > 0, define (28.66) where { 8k} is chosen so that �*(Ak) > 1 - 8/2k+l , possible by condition (b). Also set B = {x: sup l x(t) l :<::; M} such that �*(B) > 1 - �8, possible by condition (a). Let K = (nk=IAk n B)-, and note that K satisfies the conditions in (28.50) and (28.51 ), and hence is compact by 28.12. With these definitions, the argument follows that of 27.12 word for word. • The last result of this chapter concerns an issue of obvious relevance to the functional CLT; how to characterize a sequence in D which is converging to a limit in C. Since in all our applications the weak limit we desire to establish is in C, no other case has to be considered here. The modulus of continuitv w_ 1s thP: t
Cadlag
Functions
473
natural medium for expressing this property of a sequence. Essentially, the following theorem amounts to the result that the sufficiency part of 27.12 holds in (D,d8) just as in (C,du). 28.14 Theorem (Billingsley 1968: th. 15.5) Let {J..ln } be a sequence of measures on ((D,d8),'BD). If there exists N E IN such that, for n :?: N, (a) for each 11 > 0 there is a finite M such that (28.67) J..ln ({x: lx(O) I > M}) $ 11 ; (b) for each E > 0, 11 > 0 there is a 8 E (0, 1) such that (28.68) )..ln({x: wx(8) :?: £}) $ 11; then {J..ln } is uniformly tight, and if J..l is any cluster point of the sequence, J..l(C) = 1. Proof By (28.4), if (28.68) holds for a given 8 then (28.64) holds for 8/2. Let k = [1 /8] + 1 (so that k8 > 1) where 8 > 0 is specified by condition (b). Then according to (28.68), J..ln({x: l x(ti/k) - x(t(i - 1)/k) l :?: E}) $ 11 for i = 1 , ... ,k, and t E [0, 1]. We have noted previously that
(28.69) where each of the k intervals indicated has width less than 8. It follows by (28.68) and (28.67) that
J..ln ({x: sup l x(t) l > M + kE}) :::; J..ln( {x: lx(O) I > M}) :::; 11, t
(28.70)
so that (28.63) also holds for finite M. The conditions of 28.13 are therefore satisfied, proving uniform tightness. Let J..l be a cluster point such that J.lnk :::::} J..l for some subsequence { nk> k E IN } . Defining A = {x: wx(8) :?: £}, consider the open set AD, the interior of A; for example, x E AD if wx(8/2) :?: 2£. Then by (d) of 26.10, and (28.68),
J..l(AD) :::; liminf J.ln/AD) k�oo
:::;
11·
(28 . 71)
Hence J..l(B) :::; 11 for any set B c AD. Since E and 11 are arbitrary here, it is pos sible to choose a decreasing sequence { 8j } such that J..l(Bj) :::; 1/j, where Bj = { x: wx(8j) :?: 1/j}. For each m :?: 1 , J..L(n}=mBj) = 0, and so, by subadditivity, J..L(B) = 0 where B = liminfBj. But suppose X E Be, where Be = n;;;= l uJ=mB} is the set {x: wx(8j) < 1/j, some j :?: m; all m E IN } . Since { 8j } is monotonic, it must be the case that lim0�0wx(8) = 0 for this x. Hence Be c C, and since J..l(Bc) = 1 , J..l(C) = 1 follows. •
29 FCLTs for Dependent Variables
29. 1 The Distribution o f Continuous Functions o n
D
A surprising fact about Wiener measure is that definition 27.7 is actually
redundant; if part (b) of that definition is replaced by the specification merely of the first two moments of x(t), Gaussianity of x(t) must follow. This fact leads to a class of functional CLTs of considerably greater power and generality than is possible with the approach of §27.6. 29.1 Theorem (Billingsley 1968: th. 19. l ) LetX be a random element of D ro,IJ with the following properties: (a) E(X(t)) = 0, E(X(t)2) = t, 0 :::;; t ::; 1 . (b) P(X E C) = 1 . (c) For any partition { t1 , ... ,tk } of [0, 1], the increments X(t2) - X(t 1 ), X(t3 ) - X(t2), ... , X(tk) - XCtk-J), are totally independent. Then X - B. o This is a remarkable theorem, in the apparent triviality of the conditions; if an element of D is a.s. continuous, independence of its increments is equivalent to Gaussianity ! The essential insight it provides is that continuity of the sample paths is equivalent to the Lindeberg condition being satisfied by the increments. The virtuosity of Billingsley's proof is also remarkable. The two preliminary lemmas are technical, and in the second case the proof is rather lengthy; the reader might prefer to take this one on trust initially. If S t. · ·Sm is a random sequence, and we define Sj = E=t Si for 1 ::; j ::; m, and S0 = 0, the problem is to bound the probability of I Sm I exceeding a given value. The lemmas are obviously designed to work together to this end. 29.2 Lemma I Sm ! ::; 2 max min { I Sji . I Sm - Sj l } + max l sj l · 0 5.j 5. m 0 5.j $ m ..
Proof Let I s;;;; { O, . ,m} denote the set of integers k for which I Sk l :::;; I Sm - Sk i · If Sm = 0 the lemma holds, .and if Sm :f: 0 then m fit I. On the other hand, 0 E /. It follows that there is a k fit I such that k - 1 E /. For this choice of k, ! Sm l ::; I Sm - Sk i + I Sk l ::; I Sm - Sk i + l sk - t l + l skl (29. 1) ::; 2 max min{ I Sj i , I Sm - Sj l } + max I S; I . • j j m m O O$ 5. $$ . .
FCLTs for Dependent Processes
475
The second lemma is a variation on the maximal inequality for partial sums. 29.3 Lemma (Billingsley 1968: th. 12. 1) If
� (± b1) , j = i, ... ,k, (29.2) l=i+l for each pair i,k with 0 � i � k � m, where {b 1 , ... , bm } is a collection of posi 2
2
2
E((Sj - Si) (Sk - Sj) )
tive numbers, then ::3 K > 0 such that, for all a > 0 and all m,
P
(0�:: min { I Sj i , I Sm - Sj l } � a) � KBa4 ,
where B = LJ== l bj. Proof For 0 � i � k
2
(29.3)
� m and a > 0, we have
P(min { I Sj Sd , I Sk - Sj l } � a) = P( { I Sr Sd � a} n { I Sk - Sj l � a} ) � P( I Sj - Si i i Sk - Sj l � a2) ' k 2 � � a "L b� , (29.4) l=i+ 1 -
( ) '
where Chebyshev ' s inequality and (29.2) give the final inequality. If m = 1 , the minorant side of (29.3) is zero. If m = 2, (29.4) with i = 0 and k = 2 yields
( b 1 + b 2)2 , (29.5) P(max{ O, min{ I Sd , I S2 - SJ i } } � a) � a4 so that (29.3) holds for K. = 1 and hence for any K � 1 . The proof now proceeds by induction. Assuming there is a K for which (29.3) holds when m is replaced by any integer between 1 and m - 1, we show it holds for m itself, with the same K. The basic idea is to split the sum into two parts, each with fewer than m terms, obtain valid inequalities for each part, and combine these. Choose h to be the largest integer such that L..j:}bj � B/2 (the sum is zero if h = 1); it is easy to see that Ij=h+ 1 bj � B/2 also (the sum being zero if h = m). First define (29.6) U1 = max min{ I Sji . I Sh - 1 - Sji } O�j�h- 1 (29.7) Evidently, h 1 2 KB2 K (29.8) P(U1 � a) � 4 L, bj � -4 4a a j==l
( )
by the induction hypothesis. Also, by (29.4) with i = 0 and k = m,
The Functional Central Limit Theorem
476
2 (29.9) P(D I � a) s 84 . a The object is now to show that (29. 10) min{ I SJ I . I Sm - Sj l } :::; UI + DJ, 0 s j :::; h - 1 . If I Sj l :::; UJ, (29.10) holds, hence suppose I Sh -1 sj I :s; U}, the only other possi bility according to (29.6). If D 1 = I Sh - I I , then min{ I SJ I , I Sm - Sjl } S I SJ I S I Sh - 1 - SJ I + I Sh - I I :s; U1 + D 1 . And if D 1 I Sm - Sh - I l then again, min{ I SJ I , I Sm - Sj l } :::; I Sm - Sj l :::; I Sh - 1 - Sj i + I Sm - Sh - d :::; Ul + D J . Hence (29. 10) holds in all cases. Now, for 0 :s; J..l s 1 , -
==
P(U 1 + D I � a) s P({ UI � J..la } u {DI � (1 - J..L) a}) s P(U1 � J..la) + P(D I � (1 - J..L)a) B2 KB2 < - 4 4+ 4a J..L a\1 - J..L/ --
(29. 1 1 )
Choosing J..l to minimize K/4J..l4 + 11( 1 - J..L)4 yields J..l = (i-K) 1 15/(1 + (!K) 1 15] (use calculus). Back-substituting for J..l and simplifying yields, for K � 2[1 - (�) 1 15r5 55,021,
z
(29. 12) According to (29. 10), we have bounded min { I S1 1 , I Sm - S11 } in the range 0 s j s h - 1 . To do . the same for the range h s j s m, define (29. 13) Uz max min{ I SJ - Sh l , I Sm - SJI } h 5J 5m =
(29. 14) It can be verified by variants of the previous arguments that min{ I SJI , I Sm - SJI } s Uz+ Dz, h s j s m,
(29. 15)
and also that
(29. 16) for the same choice of K. Combining (29. 16) with (29. 12), we obtain
FCLTs for Dependent Processes P
l':':: j
)
477
min{ I Sj l , I Sm - Sj l ) 2 a � P(max { U 1 + D l > U1 + D2 ) ;, a)
= P( { Ur +D1 � a} u { U2 + D2 � a}) ::::; P(Ur + Dt � a) + P(U2 + D2 � a) 2 <- KB . (29. 17) a4 "
29.1 Let the characteristic function of X(t) be (29. 1 8) <J>(t, A) = E(e i'AX(t)) . We can write, by ( 1 1 .25), eiu = 1 + iu - 1u2 + r(u), (29.19) where l r(u) l ::::; l u l 3. We shall write either 115,1 or 11(s,t), as is most convenient, to denote X(s) - X(t) for 0 ::::; t ::::; s ::::; 1 . Observe that by conditions (a) and (c) of the theorem, E(l17+h,r) = h. Hence, <J>(t + h,A.) - <J>(t, A) = E[ei'AX(t)(eiUr+h ,t - 1)] = E [eiAX(t)(iA-11r+h,t - 1A-2117+h,t + r(A-11r+h, r)) ] 2 (29.20) = <J>(t,A)[ -1A- h + E(r(Mr+h,r)] , Proof of
where the last equality is because X(t) and 11r+h,t are independent by condition (c). Since E(r(A-11r+h, r) ) ::::; A-3EI 11r+h, rl 3, it follows that <\> (t, A)A3EI 11r+h, r l 3 th(t + h A) - th(t A) 2 ' 'V ' ! 'V (t,A.) ::::; (29.21) + A<J> h h Now, suppose that
l
I
(29.22) lim * E I 11r+h, rl 3 = 0. h ,l.O It will then follow that, for all 0 ::::; t < 1 , possesses a right-hand derivative, Im (t + h,A.)h - (t,A.) - 21'\""2th'V(t, l\,) . (29.23) h ,l.O Further, for h > 0 and h ::::; t ::::; 1 , (29 .21) holds at the point t - h, so by consid ering a path to the limit through such points we may also conclude that 1.
_
'\
_
) 2th Im (t,A.) - h (t - h,A.) = _t'l (29.24) 2"" ( t ' "" . 'V h ,J, o Since <J> (t-,A) = <J>(t,A) because is continuous in t, by condition (b) of the theorem, is differentiable on (0, 1 ) and 1.
_
'\
The Functional Central Limit Theorem
478
(29.25) This differential equation is well known to have the solution -()..212, :?: 0. = (29.26) (Verify this by differentiating log with respect to Since = 0 a.s., = 1, and applying the inversion theorem we conclude that for each E (0, 1 . By continuity of <\l at 1 , the result also extends to = 1 . Hence, the task is to prove (29.22). This requires the application of 29.2 and 29.3. For some finite let S; = for j = 1 , By assumption, the �J are independent r.v.s with variances of If SJ = L{= l �i = then E((s1 - sicsk - Sj)2) = (29.27) we have By 29.3, setting bj =
<\l(t,A.) <\l(O,A.)e
<\J(O,A.) t
t
<\l
t.)
X(t)(O) N(O,t) X t !1(t+hj/m, t+h(j- l)lm)him. ... ,m. �
)
m
11(t +}him, t),
(}-i)(k-J)h21m2 :::; h2.
him, 2 P ( m�x minfl l 11(t + �' I, I !1(t + h, t +�h) I L a}) K� . a J o��m ;:::
t)
Hence by 29.2,
P( l
d(t+ h,t) I
;,
(
a) $ P 2 x min 0�� m
K*h2 + m X C
(
:::;
(29.28)
{ 1,\.(t +�, t) I , I d(t+ h, t+ �h) I }
)
. . 1 :::; -4- P max (29.29) h) I :?: ¥x , a O�j <:;,m where K* = 44K. Letting -7 oo, the second term of the majorant member must go to zero, since E with probability 1 by condition (b), so we can say that
1!1(t+ �, t +1�
(29.30) We may now use 9.15 to give E 1 11t+h,t 1 3 =
J: 1 11t+h,t1 3 F + EP( 1 11t+h,t 1 3 d
:?:
E)
+ f�PC 1 11t+h,t 1 3 > s)ds (29.3 1 )
FCLTs for Dependent Processes 479 . . . . Ch oose £ = (K* )314h 312 to m1mm1ze the 1ast member above, and we obtam (29.32) E l �t+h, r l 3 :::::; 4(K* ) 314h312 . This condition verifies (29.22), and completes the proof. • Notice how (29.30) is a substantial strengthening of the Chebyshev inequality, which gives merely P( l �t+h, r l ;::=:: a) :::::; hla2 . We have not assumed the existence of the third moment at the outset; this emerges (along with the Gaussianity) from the assumption of independent increments of arbitrarily small width, which allows us to take (29.29) to the limit. 2 9 . 2 Asymptotic Independence
Let {Xn } i denote a stochastic sequence in (D,'Bv). We say that Xn has asymptoti cally independent increments if, for any collection of points {si ,ti, i = 1 , ... ,r} such that 0 :::::; SJ :::::; t1 < S2 :::::; t2 < . . . < S r :::::; t, :::::; 1 , and all collections of linear Borel sets B 1 , ... ,B, E 'B, r
1 , ... ,r) -7 flP(Xn(ti) - Xn (si) E Bi) (29.33) i= l as n -7 oo. Notice that in this definition, gaps of positive width are allowed to separate the increments, which will be essential to establish asymptotic indepen dence in the partial sums of mixing sequences. The gaps can be arbitrarily small, however, and continuity allows us to ignore them as we see below. Given this idea, we have the following consequence of 29.1. 29.4 Theorem Let { Xn } ;;'= 1 have the following properties: (a) The increments are asymptotically independent. (b) For any £ > 0 and 11 > 0, 3 8 E (0, 1 ) s.t. limsupn---+ooP(w(Xn.8) ;::=:: £) :::::; T) . (c) {Xn(t)2 }';;'= 1 is uniformly integrable for each t E [0, 1]. (d) E(Xn Ct)) -7 0 and E(Xn(t)2) -7 t as n -7 oo, each t E [0, 1]. Then Xn � B. o Be careful to note that w(.,8) in (b) is the modulus of continuity of (27. 14), not w' of (28.3). Proof Condition (b), and the fact that E l Xn(O) l -7 0 by (d), imply by 28.14 that the associated sequence of p.m.s is uniformly tight. Theorem 26.22 then implies that the latter sequence is compact, and so has one or more cluster points. To complete the proof, we show that all such cluster points must have the characteristics of Wiener measure, and hence that the sequence has this p.m. as its unique weak limit. Consider the properties the limiting p.m. must possess. Writing X for the random element, 28.14 also gives P(X E C) = 1 . Uniform integrability of Xn(t) 2, and hence of Xn(t), implies that E(X(t)) = 0 and E(X(t)2) = t, by 22.16. By condition (a) we P(Xn(tD - Xn(si)
E
Bi, i
=
The Functional Central Limit Theorem
480
may say that the increments X(t 1 ) - X(s 1 ), ... ,X(t,) - X(s,) are totally indepen dent according to (29.33). Specifically, consider increments X(ti) - X(si) and X(ti+ I ) - X(si+ I ) for the case where si+ I = ti + 1/m. By a.s. continuity, (29.34) so that asymptotic independence extends to contiguous increments. All the condi tions of 29.1 are therefore satisfied by X, and X - B. • Our aim is now to get a FCLT for partial-sum processes by linking up the asymptotic independence idea with our established characterization of a dependent increment process; that is to say, as a near-epoch dependent function of a mixing process. Making this connection is perhaps the biggest difficulty we still have to surmount. An approach comparable to the 'blocking' argument used in the CLTs of §24.4 is needed; and in the present context we can proceed by mapping an infinite sequence into [0, 1 ] and identifying the increments with asymptotically independent blocks of summands. This is a particularly elegant route to the result. However, an asymptotic martingale difference-type property of the type exploited in 24.6 is not going to work in our present approach to the problem. vv'hile the terms of a mixing process (of suitable size) can be 'blocked ' so that the blocks are asymptotically independent (more or less by definition of mixing), mixingale theory will not serve here; near-epoch dependent functions can be dealt with only by a direct approximation argument. What we shall show is that, if the difference between two stochastic processes is op (l) and one of them exhibits asymptotic independence, so must the other, in a sense to be defined. Near-epoch dependent functions can be approximated in the required way by their near-epoch conditional expectations, where the latter are functions of mixing variables. This result is established in the following lemma in terms of the independence of a pair of sequences, which in the application will be adjacent increments of a partial sum process.
29.5 Lemma (Wooldridge and White 1986: Lemma A.3) If { l}n} and {.-0n } are real
stochastic sequences, and (a) lJn - .-4n � 0, for j = 1,2; (b) lJn � Yj for j = 1 ,2; (c) for any AI> A2 E :B ,
then
P({Z!n E A J } ("') {Z2n as n � oo ;
E
A2 } )
�
P(Zln
E
A 1)P(Z2n
E
A 2)
(29.35)
(29.36)
aBj) = 0) for j = 1 ,2. Proof Considering (Zl n,�n) and (Yln o Y2n) as points of !R 2 with the Euclidean metric, (a) implies de((Zl n,Z2n) , (Yln, Y2n)) � 0, and by an application of 26.24, for all lj-continuity sets (sets Bj
{h \ 1 m nl 1 Pc hnth {7.
7� \
�
E 'B
r v. V� \
such that P(Yj
=
1 ')
E
Wri tP
FCLTs for Dependent Processes
481 (29.37)
where )ln is the measure associated with the element (Z1n,Zzn). If )l is the measure associated with (Y1 , Y2), define the marginal measures � by �(Bj) = P( Yj E Bj) ; then �(dBj) = 0 for j = 1 ,2 implies )l(d(B1 x B2 )) = 0, in view of the fact that d(B t X B z)
c
(29.38)
(dB1 x rR ) u (rR x ()B 2).
Applying (e) of 26.10, it follows from the weak convergence of the joint distribu tions that, for all Yj-continuity sets Bj,
P( { Ztn
E B J } n { Zzn E Bz } )
lln (B t X Bz)
=
---7
=
)l(B t x B2 )
P( { Yt
E B t } n { Yz E Bz }).
(29.39)
And by the weak convergence of both sets of marginal distributions it follows that, for these same Bj,
(29.40) This completes the proof, since the limits of the left-hand sides of (29.39) and (29.40) are the same by condition (c). • 29.3 The FCLT for NED Functions of Mixing Processes
From 29.4 to a general invariance principle for dependent sequences is only a short step, even though some of the details in the following version of the result are quite fiddly. This is basically the one given by Wooldridge and White (1988).
29.6 Theorem Let { Uni } be a zero-mean stochastic array, { Cni } an array of positive constants, and {Kn(t), n E IN } a sequence of integer-valued, right-continuous, increasing functions of t, with Kn(O) = 0 for all n, and Kn(t) - Kn(s) ---7 oo as n ---7 oo if t > s. Also define X�(t) = l:�� { t)Uni· If (a) E(Uni) = 0; (b) SUPi,n ll UnJCni llr < 00, for r > 2; (c) Uni is L2-NED of size -y, for 1 $: y $: 1 , with respect to the constants { end , on an array { Vnd which is a-mixing of size -r/(r - 2); 2 Kn (t+8) v (t 8) (d) sup limsup T < oo, where v�(t,8) = L c�i ; tE [O,l),CE (O, l -t]
(e)
max Cni
{
}
n�=
= O(Kn(1 )1- 1),
1 5:: i 5:: Kn (l)
E(X�(t)2) ---7 t as n then X� � B. o (f)
---7 oo,
i=Kn (t)+ l
where y is defined in (c);
for each t E [0,1 ] ;
Right-continuity of Kn(t) ensures that v�(t,8) ---7 0 as 8 ---7 0, if we agree that a sum is equal to zero whenever the lower limit exceeds the upper.
The Functional Central Limit Theorem
482
If y is set to 1 in condition (c), condition (e) can be omitted. It is important to emphasize that this statement of the assumptions, while technically correct, is somewhat misleading in that condition (c) is not the only constraint on the depen dence. In the leading cases discussed below, condition (f) will as a rule imply a L2 -NED size of 1 Theorem 29.6 is very general, and it may help in getting to grips with it to extract a more basic and easily comprehended set of sufficient conditions. What we might think of as the 'standard' case of the FCLT - that of convergence of a partial sum process to Wiener measure - corresponds to the case Kn(t) = [nt]. We will omit the K superscript to denote this case, writing Xn(t) = I}� flUni · The full conditions of the theorem allow different modes of convergence to be defined for various kinds of heterogeneous processes, and these issues are taken up again in §29.4 below. But it might be a good plan to focus initially on the case Xn(t), mentally making the required substitutions of [nt] for Kn(t) in the formulae. In particular, consider the case Uni = U/sn where n n nl ni (29.41) s� = E � Ui 2 = � cr7 + 2�- �- <Ji,i+m• -
.
( )
with cr1 = Var(UD and cri, i+m = Cov(Ui,Ui+m). Also, require that supdi Udl r < oo, r > 2. Then we may choose cni = 1 /sn , and with Kn(t) = [nt], condition 29.6(d) reduces to the requirement that s�ln > 0, uniformly in n. In this case, 29.6(e) is satisfied for y = l If in addition s�ln ---7 � < =, then E(Xn(t))2 = sfnt]ls� ---7 t and 29.6(f) also holds. These conclusions are summarized in the following corollary. 29.7 Corollary Let the sequence { Ud have mean zero, be uniformly L,-bounded, and L2-NED of size -! on an a-mixing process of size -r/(r - 2), and let Xn(t) = n 112II�P ui. If n-1(I'i=t Ui) 2 ---7 �. 0 < � < then Xn � �B. o Be careful to note that � =
-
oo,
FCLTs for Dependent Processes
483
It is instructive to compare the conditions of 29.6 with those of 24.6 and 24.7. Since X�( l ) � B(l ) - N(O, l), the two theorems give alternative sets of condi tions for the central limit theorem. Although they are stated in very different terms, conditions 29.6(d) and (e) clearly have a role analogous to 24.6(d). While 24.6 required a L2-NED size of -1, it was pointed out above how the same condition is generally enforced by 29.6(f). However, 29.6(f) itself has no counterpart in the CLT conditions. It is not clear how tough this restriction is, given our free choice of Kn, and this is a question we attempt to shed light on in §29.4. What is clear is that the convergence of the partial sum process Xn to B requires stronger conditions than are required just for the convergence of Xn(l ) to B( l), which is the CLT. Proof of 29.6 We
will establish that the conditions of 29.4 hold for the sequence
{X�} . Condition 29.4(d) holds directly, by the present conditions (a) and (f). Conditions (a), (b), and (c) imply by 17.6(i) that { Uni,c:fnd is a �-mixingale of size -� with respect to the scaling constants { end , where c:Jni = a( Vi-j' j ;::::: 0). In view of the uniform Lr-boundedness with r > 2, the array { U�/c�i } is uniformly integrable. If we let k = Kn(t) and m = Kn(t + 8) - Kn (t) for 8 E [0, 1 t), it
-
follows by 16.14 (which holds irrespective of shifts in the coordinate index) that the set
(29.42) is uniformly integrable, for any t and 8. Further, because of condition (d) we may assume there is a positive constant M < oo such that for any t E [0, 1) and any 8 E (0, 1 - t]), there exists N(t,8) ;::::: 1 with the property v�(t,8)/8 � M for n ;::::: N(t,8). Therefore the set
{
(Sn ,k+j - Snk) 2 �ax ,n 8 ]S m
;:::::
1,
N(t,8)
,
}
(29.43)
is also uniformly integrable. If N* = sup oN(t, 8) condition (d) implies that !V is finite. Taking the case t = 0 and hence k = 0 and m = Kn(8) in (29.43) (but then writing t in place of 8 for consistency of notation), we deduce uniform integrability of {X�(t) }';;'= 1 for any t e (0,1 ] (the summands from 1 to N(O,t) - 1 can be included by condition (b)). In other words, condition 29.4(c) holds for {X�}';;'= l · Note that !.?P( I X I > A.) � E(X2 1 1 JXJ,.A. }) for any square-integrable r.v. X. There fore, the uniform integrability of (29 .43) implies that for any 8 e (0, 1 ), any t � 1 8, and any E > 0 and 11 > 0, 3 A. > 0 large enough that for n ;::::: !V,
-
(
P max I Sn,k+r Snk I l �S m
;:::::
)
AVO
�
�, 11-
8/..,2
(29.44)
where k amd m are defined as before. The argument now follows similar lines to the proof of 27.14. For the case 8 = £214!..2, (29.44) implies
484
The Functional Central Limit Theorem sup
P (t�s �t+ll I X�(s) - X�(t) l ;:::: �£) :::; �118, n ;:::: N*,
O � t � 1 -8
sup
(29.45)
which is identical to (27.71). Condition 29.4(b) now follows by 27.13, as before. The final step is to show asymptotic independence. Whereas the theorem requires us to show that (29.33) holds for any r, since the argument is based on the mixing property and the linear separation of the increments, it will suffice to show independence for adjacent pairs of increments (i, i + 1) having ti < si+ l · The extension to the general case is easy in principle, though tedious to write out. Hence we consider, without loss of generality, the pair of variables
Kn(t;) L Uni• j i=Kn(Sj)+ l
=
1 and 2,
(29.46)
where 0 :::; s 1 < t1 < s2 < t2 :::; 1 . We cannot show asymptotic independence of Yl n and Y2n directly because the increment process need not be mixing, but there is an approximation argument direct from the NED property. Defining '!f�,j = cr(Vnj· ... , Vnk), the r. v. E( Ytn I '!f�7(�_!_)oo) is '!f�7(:!?co-measurable, and similarly E( Y2n I '!f";,K11(s2) ) is '!f";,K11(s2)-measurable. By assumption (c), sup --7
0 as n --7 oo
(29.47) whenever ft < S2 , where the events A include those of the form { E(Yl n I '!f�7(�oo) E E } for E E 13 , and similarly events B include those of the type { E( Y2n I '!f";,Kn(s2) ) E E } . These conditional expectations are asymptotically independent r.v.s, and it remains to show that Yl n and Y2n share the same property. We show that the conditions of 29.5 are satisfied when Ztn = E(Ytn I '!F�7(�oo) and Z2n = E( Y2n I '!f";, Kn(s2)). This is sufficient in view of the fact that the Y;-continuity sets are a convergence-determining class for the sequences { Y;n } , by
26.10(e). The argument of the preceding paragraph has already established condi tion 29.5(c). To show condition 29.5(a) we have the inequalities
Kn(fJ ) L 11 Un; - E(Uni l '!f�7(�oo) lh i=Kn(SJ )+ l Kn(t J ) :::; 2 L 1 1 Un; - E(Und '!f�7g�kn (t1 )) 11 2 i=Kn(SJ )+ l Kn (t J ) :::; 2 L CniVKn(t J )- i i=Kn (SJ )+ l
E II Ytn - E(Yl n l '!f�7(�oo) lh :::;
FCLTs for Dependent Processes � 2
485
Cni
max
Kn (SJ)
---7
0 as n
---7
(29.48)
oo,
where we have applied Minkowski's inequality, 10.28, and finally assumptions (c) and (e), and 2.27. This implies that Ytn - E(Ytn l ��7c:�}oo) � 0. Note that condition (d) implies that sup;Cni ---7 0 as n ---7 oo, so in the case y = 1, (e) can be dispensed with. By the same reasoning, Y2n - E(Y2n I �";Xn( 2) ) � 0 also. Since we have established that conditions 29.4(b) and 29.4(d) hold, we know that the sequence of measures associated with {X�} is uniformly tight, and so contains at least one convergent subsequence - { nb k E IN } , say - such that X�k � xK (say) as k ---7 oo where P(XK E C) 1. It follows that the continuous mapping theorem applies to the coordinate projections ntCXK) = XK(t), and we may assert that X�k(t) � XK(t). Confining attention to this subsequence, condition 29.5(b) is satisfied for the case Ynhj = X�(tj) - X�(sj). All the conditions of 29.5 have now been confirmed, so these increments are asymptotically independent in the sense of (29.36). But since this is true for every convergent subsequence {nk }, we can conclude that the weak limit of {X�} has asymptotically independent increments whenever it exists. All the conditions of 29.4 are therefore fulfilled by {X� } , and the proof i s complete. • s
=
It is possible to relax the moment conditions of this theorem if we substitute a uniform mixing condition for the strong mixing in condition (c). 29.9 Theorem Let { Un;} , { en; } , {Kn(t) } , and {X�} be defined as in 29.6; assume that conditions 29.6(a), (d), (e) and (f) hold, but replace conditions 29.6(b) and (c) by the following: (b') sup;,n I I Un/Cni I I < oo, for r 2 2, and { U�;/c�; } is uniformly integrable; (c') Uni is L2-NED of size -y, for ! � y � 1, with respect to constants { en; } , on an array { Vn; } which i s -mixing of size -r/2 ( r - 1 ), for r 2 2; then X� � B. o r
The uniform integrability stipulation in (b') is required only for the case r = 2, and the difference between this and the a-mixing case is that this value of r is permitted, corresponding to a -mixing size of -1. By 17.6(ii), { Un; } i s again an L2-mixingale of size -! i n this case. The same arguments as before establish that conditions 29.4(b),(c) and (d) hold; and, since a(m) � <j>(m), condition (29.47) remains valid so that asymptotic indepen dence also holds by the same arguments as before. • Proof
29.4 Transformed Brownian Motion
To develop a fully general theory of weak convergence of partial sum processes, permitting global heterogeneity of the increments with possibly trending moments,
486
The Functional Central Limit Theorem
and particularly to accommodate the multivariate case, we shall need to extend the class of limit processes beyond ordinary Brownian motion. The desired generaliza tion has already been introduced as example 27.8, but now we consider the theory of these processes a little more formally. A transformed (or variance-transformed) B rownian motion BTl will be defined as a stochastic process on [0, 1] with finite dimensional distributions given by
(29.49) where B is a Brownian motion and 11 is an increasing homeomorphism on [0, 1 ] with T\(0) = 0. The increments of this process, BTI(t) - BTI(s) for 0 $ t < s $ 1 , are therefore independent and Gaussian with mean 0 and variance T\(t) - 11(s). Since T\(1) must be finite, the condition 110) = 1 can be achieved by a trivial normal BTJ(t) - B(Tl(t)), t
E
[0, 1].
ization. To appreciate the relevance of these processes, consider, as was done in §27 .4 the characterization of B as the limit of a partial-sum process with independent Gaussian summands. Here we let the variance of the terms change with time. Sup pose Si N(O,cr7>, and let s� = E(Ii= rSi = Ii= rcrr. Also suppose the variance sequence {cn}i= 1 has the property that, for each t e [0, 1], -
S2[ n t] 2 Sn
-
---)
T\(t) as n
---)
oo,
(29.50)
where the limit function 11: [0, 1] 1---7 [0, 1 ] is continuous and strictly increasing everywhere. In this case, according to the definition of BTJ, we have
I�r =n lt ] Si D -- � BTI(t), Sn
(29.5 1 )
for each t E [0, 1 ] . What mode of evolution of the variances might satisfy (29.50), and give rise to this limiting result? In what we called, in § 13.2, the globally stationary case, where the sequence { cn}'i' is Cesaro-summable and the Cesaro limit is strictly positive, it is fairly easy to see that Tl(t) = t is the only possible limit for (29.50). This conclusion extends to any case where the variances are uniformly bounded and the limit exists; however, the fact that uniform boundedness of the variances is not sufficient is illustrated by 24.11. (Try evaluating the sequence in (29.50) for this case.) Alternatively, consider the example in 27.8. It may be surprising to find that (for the case - 1 < � < 0) the partial sums have a well defined limit process even when the Cesaro limit of the variances is 0. However, 27.8 is more general than it may at first appear. Define a continuous function on [O,oo) by g(v)
=
sfvl + (v - [V])afv+ l l ·
(29.52)
If s� satisfies (29.50), g is regularly varying at infinity according to 2.32. g _2 I /
..._
FCLTs for Dependent Processes
487
g(n) + g'(n) for integer n, and note that by 2.33 (which holds for right deriva tives) g' is also regularly varying. The variance process of 27.8 can be general
ized at most by the inclusion of a slowly varying component. This is the situation for the case of unweighted partial sums, as in (29.53), the one that probably has the greatest relevance for applications. But remember there are other ways to define the limit of a partial-sum process, using an array formulation. There need only exist a sequence {gn } j of strictly increasing func tions on the integers such that gn([nt])lgn(n) � 11(t), and the partial sums of the array { �ni } , where �ni - N(O,a�D and
(J�i = (gn(i) - gn(i - 1 ))/gn(n ),
(29.53)
will converge to BTl. And since such a sequence can always be generated by setting gn([nt]) = ll(t)an, where a n is any monotone positive real sequence, any desired member of the family Bll. can be constructed from Gaussian increments in this manner. The results obtained in §29.1 and §29.2 are now found to have generalizations from B to the class BTl. For 29.1 we have the following corollary. 29.10
Corollary
Let condition 29.1(a) be replaced by
(a') E(X(t)) = 0, E(X(t) 2)
Then X - BTl.
=
11(t), 0
:S
t
:S
1.
Define x* (t) = X(11 - 1 (t)) and apply 29.1 to X* . 11 - 1 (.) is continuous, so condition 29.1(b) continues to hold. Strict monotonicity ensures that if { t 1 , ,tm } define arbitrary non-overlapping intervals, so also do {11 - 1 (t1 ), ... , 11 - \tm) } , so 29.1(c) continues to hold. • Proof ..•
Similarly, for 29.4 there is the following corollary. 29.11 Corollary Let the conditions (a), (b), and (c) of 29.4 hold, and instead of condition 29.4(d) assume (d') E(Xn(t)) � 0 and E(Xn(t) 2) � 11(t) as n � oo, each t E [0, 1]. Then Xn _!4 BTl. Proof The
for X.
•
argument in the proof of 29.4 shows that the conditions of 29.10 hold
29.12 Example Let { Ui } 7 denote a sequence satisfying the conditions of 29. 7, with the extra stipulation that the L2-NED size is - 1 . Define the cadlag process
[nt] 1 (29.54) Xn(t) = -m L)Uj. an j= l This differs from the process n - ll 2rj�[ 1 l!jla only by the multiplication of the summands by constant weightsj/n, taking values between lin and 1 . The arguments of 29.6 show that conditions 29.4(a), (b), and (c) are satisfied for this case, and it remains to check 29.1l(d'). We show that
488
The Functional Central Limit Theorem
(29.55) Choose a monotone sequence { bn E IN } such that bn --7 oo but bnfn --7 0; bn = [n 112] will do. Putting Tn = [ntlbn ] for t E (0, 1] and n large enough that Tn 2 1 , we
have
(29.56) The terms in this sum have the decomposition
ibn (29.57) L }Uj = ibnSn i + bnS�i• j=(i- 1 )bn+ 1 in which Sni = 'L}�(i - l )bn+ l Uj , and S�i = 'L}�(i- l )bn+ l anij Uj, where a nij = (ibn -})Ibn E [0, 1]. The assumptions, and 17.7, imply that b� 1 E(S�i) --7 c:J2 for each i = 1 , ... ,rn. and that b� 1 I E(SniSnd l = O( l i - i' I - Hi) for B > 0. Neither limsupnb� 1 E(S�T) nor limsupnb� 1 1 E(SniS�D I exceed c:Jl, whereas b� 1 1 E(S�iS�i') I and b� 1 E(SniS�r) are of O( l i - i' l - 1 - 0). The same results apply to Sn,rn+ 1 and S�,rn+ 1 , the analogous terms corresponding to the residual sum in (29.56). Thus, consider E(Xn(t)2). Multiplying out the square of (29.56) after substitut ing (29 .57), we have three types of summand: those involving squares and products of the Sni ( ( rn + 1) 2 terms); those involving squares and products of the S�i ((rn + 1 )2 terms); and those involving products of S�i with Sni (2(rn + 1 )2 terms). The terms of the second type are each of O(b�n-3) = O(n - 1 r� 2), and this block vanishes asymptotically. The terms in the third block (given ibn = O(n)) are of O(bnn - 2) = o(r� 2), and hence this block also vanishes. This leaves the terms of the first type, and this block has the form
(29.58) Noting that rnbnln --7 t, applying standard summation formulae and taking the limit yields (29.55). Thus, according to 29.11, Xn(t) � BTl where ll(t) = �t3. o There is an intimate connection between the generalization from B to BTl and the style of the result in 29.6. The latter theorem does not establish the convergence of the partial sum sequences Xn(t) = 'L��flUni• either to B or to any other limit. In fact there are two distinct possibilities. In the first, Kn(t)ln converges to 11- \ t) for t E [0, 1 ], for some 11 as in (29 .49). If this holds, there is no loss of generality in setting Kn(t) [n1l - 1 (t)], and under condition 29.6(f) this has the implication =
FCLTs for Dependent Processes
489 (29.59)
In other words, Xn � Brt by 29.11. Example 29.8 is a case in point, for which Tl(t) = t1 +P . In these cases the convergence of the process {X�} to Brownian motion can be also represented as the convergence of the partial sum process (Xn} to Brt. On the other hand, it is possible that no such 11 exists, and the partial sums have no weak limit, as the following case demonstrates. 29.13
Example
t
Let a sequence { Ud have the property
ui
=
22k � i
a.s.,
<
22k+l , k
=
0, 1 ,2,3, ...
(O,cr2), otherwise. Thus, U1 = 0, U4 = Us = U6 = U7 = 0, U1 6 = U1 7 = ... = U31 = 0, and so forth. Let Uni = U/sn as before, and put Xn(t) = L��flUni· Then, observe that for ! < t � 1, 1 when n = 2k - 1 for k even, Xn(t) = Xn(!) with probability 0 when n = 2k - 1 for k odd.
{
Since this 'cycling ' in the behaviour of Xn is present however large n is, Xn does not possess a limit in distribution. However, let Kn(t) be the integer that satisfies Kn (t)
L 1(22k - I � i i=2
<
2 2k, k E rN)
=
(29.60)
[nt],
where 1 (.) is the indicator function, equal to 1 when i is in the indicated range and 0 otherwise. With this arrangement, n counts the actual number of increments in the sum, while Kn(l ) counts the nominal number, including the zeros; Kn+ l (1) = Kn(l ) + 1 except when Kn(1) = 22k, in which case Kn+ 1 (1) 2 2k+l . The conditions of 29.6 are satisfied with Tl(t) = t, and X� � B. o =
Incidentally, since condition 29.6(f) imposes
(29.61) one might expect that KnO )In � 1 . The last example shows that this i s not neces sarily the case. To get multivariate versions of 29.6 and 29.9, as we undertake in the next section, it will be necessary to restate these theorems in a slightly more general form, following the lines of 29.10 and 29.11. 29.14 Corollary Let conditions 29.6(a), (b), (c), (d), and (e) hold, and replace 29.6(f) by (f') E(X�(t) 2) � Tl(t) as n � =, for each t E [0, 1]; then X� � Brt. The same modification i n 29.9 leads to the same result. o
The Functional Central Limit Theorem
490
The main practical reason why this extension is needed is because we shall wish to specify Kn in advance, rather than tailor it to a particular process; the same choice will need to work for a range of different processes - to be precise, for every linear combination of a vector of processes, for each of which a compatible ll will need to exist. However, the fact that partial sum processes may converge to limits different from simple Brownian motion may be of interest for its own sake, so that 29.14 (with Kn (t) = [nt]) becomes the more appropriate form of the FCLT. See Theorem 30.2 below for a case in point. 29.5 The Multivariate Case
To extend the FCLT to vector processes requires an approach similar in principle to that of §27.7. However, the results of this chapter have so far been obtained, unlike those of §27, without explicit derivation of the finite-dimensional distri butions. It has not been necessary to use the results of §28.4 at any point. Because we have to rely on the Cramer-Wold device to go from univariate to multi variate limits, it is now necessary to consider the finite dimensional sets of D, and indeed to generalize the results of §28.4. This section draws on Phillips and Durlauf ( 1986). We define Dm as the space of m- vectors of cadlag functions, which we endow with the metric max { d8(xj,yj) } , (29.62) l $;j$ m where d8 is the Billingsley metric as before. dj] induces the product topology, and the separability of (D,ds) implies both separability of (Dm,dj]) and also that 'B'B = 'Bv x 'Bv x ... x 'Bv is the Borel field of (Dm,d'Jl). Also let d'J](x,y)
=
(29.63) be the finite-dimensional sets of Dm, the field generated from the product of m copies oflfv. The following theorem extends 28.10 in a way which closely parallels the extension of 27.6 to 27.16. 29.15 Theorem Jf'B is a determining class for (Dm, 'B'B). Proof An open sphere in 'B'B is S(x, a)
=
=
{y E Dm : d'Jl(x,y)
{Y
E
<
a}
D"': 3 A E A s. t. II A II
< u, ��:':,'�PI
Define, for { tl ... ,tk E [0,1 ] , k E IN } ,
yj(t) xj(A,(t)) I -
< u}
(29. 64)
FeLTs for Dependent Processes
{
H,(x,a.) = y
E
Dm :
3 '/.. E
A s.t. ll '/.. 1 1
49 1
< a,
max max ! yj(t;) - xj(A.(tD) I
l�j � m l�i� k
}
< a
E
Jf'Jj.
(29.65)
It follows by direct generalization of the argument of 28.10 that, for any x E Dm and r > 0,
S(x,r) Hence, 'B'Jj
�
=
U n Hk(x, r - 1/n) n=l k=l
cr(Jf'Jj) as required.
'
E
cr(Jf'Jj).
(29.66)
•
The following can be thought of as a generic multivariate convergence theorem, in that the weak limit specified need only be a.s. continuous. It is not necessar ily Bm. 29.16 Theorem29 Let Xn E Dm be an m-vector of random elements. Xn � X, where P(X E e m) 1 , iff A'Xn � A'X for every fixed A with A'A = 1 . =
If x1 E D, j = l , ... ,m, 2..}= 1 A.1x1 possesses a left limit and is continuous on the right, since for t E [0, 1 ) , Proof
m
lim L �xj(t + £) £.!-0 }= I
=
m ."L A.1 limx1(t + £)
=
m ."L A.1x1(t).
(29.67)
£.!-0 }=1 Hence, x = (x1, ,xm)' E Dm implies A'x E D. It follows that A'Xn is a random element of D. It is clear similarly that X E e m implies A x E e, and hence P(A'X E C) = 1 . To prove sufficiency, let J.l� denote the sequence of measures corresponding to A'Xm and assume J.l� :::::} J.l"-. Fix t1 , ,tk E [0, 1], for finite k. Noting that 1t��, ... , 1k(B) n D E JfD c 'BD for each B E 'Bk (see 28.10), the projections are measurable and v� = ).l�1t��, . ,tk is a measure on (IR k,'Bk). Although 1t11, . ,tk is not continuous (see the discussion in §28.2), the stipulation J.l"'(C) = 1 implies that the discontinuity points have J.l"'-measure 0, and hence v� :::::} v"- by the contin uous mapping theorem (26.13). Since v� is the p.m. of a k-vector of r.v.s, and A is arbitrary, the Cramer-Wold theorem (25.5) implies that Vn :::::} v, where Vn = J.ln 1t �: .... . tk is the p.m. of an mk-vector, the distribution of Xn(ti), ... ,Xn(tk). Since t1, ... .tk are arbitrary, the finite dimensional distributions of Xn converge. To complete the proof of sufficiency, we must show that { J.ln} is uniformly tight. Choose A = e1, the vector with 1 in position j and 0 elsewhere, to show that XnJ � Xi; this means the marginal p.m.s are uniformly tight, and so {J.ln} is uni formly tight by 26.23. Then Xn � X by 29.15. To show necessity, on the other hand, simply apply the continuous mapping theo rem to the continuous functional h(x) = A'x. • }= I
.••
'
••.
. .
..
492
The Functional Central Limit Theorem
Although this is a general result, note the importance of the requirement j..!( C) = 1 . It is easy to devise a counter-example where this condition is violated, in which case convergence fails. 29.17 Example Suppose j.l is the p.m. on (D,'BD) which assigns probability 1 to elements x with
x(t) = •
{0,
{0,
t
Also, let l-ln assign probability 1 to elements with
x(t) =
t < ! +k . 1, t � �+ k
If X1 n - j.l all n, and X2n - l-lm then clearly (XJn,X2n) � (X1 ,X2) = (x,x) w.p. l . But X2n - Xl n is equal w.p. 1 to the function in (28. 12), which does not converge in (D ,ds). o
Now we are ready to state the main result. Let {B11(Q) } denote the family of m x 1 vector transformed-Brownian motion processes on [0, 1], whose members are defined by a vector of homeomorphisms (11 \ ... ,11P)' and a covariance matrix Q (m x m). If X - B11(Q ) , the finite-dimensional distributions of X are jointly Gaussian with independent increments, zero mean, and
E(X(t)X(t)') = DH(t)D', where D (m xp) has rank p, DD' = Q and H(t) = diag { 111 (t), .... , 11P (t) } , with H(1 ) = lp . In other words, the jth element of X may be expressed as a linear combination 2J= 1 djkzk where Z = (ZJ , . . . ,Zp)' is a vector of independent processes with Zk - B11k . With p < m, a singular limit is possible. Note, Z = (D'Df 1D'X. 29.18 Theorem Let { Uni } be an array of zero-mean stochastic m-vectors. For an increasing, integer-valued right-continuous function Kn( . ) define X� = L:f�ft)Uni' and suppose that (a) For each fixed m-vector .i\ satisfying .i\'A = 1 there exists a scalar array { c�i} and a homeomorphism 11"' on 1 ] with 11\ ) = and 11\ 1 ) = 1 , such that the conditions of 29.14 hold for the arrays { .i\'Und and { c�d , with respect to 11"'· (b) Letting H(t) be defined as above with elements � denoting 11"' for the case .i\ = ej (ith column of the identity matrix), for j = 1 , .. . ,p,
[0,
E(X�(t)X�(t)') Then X� � X - B11(0). o
---7
DH(t)D' as n
00
-7
oo
.
(29.68)
A point already indicated above is that under these conditions Kn must be the same function for each .i\, and must satisfy condition 29.6(d) in each case as well as 29.14(f'). The condition 11\ 0 1 can always be achieved by a renormalization, =
493
FCLTs for Dependent Processes
and simply requires that differences in scale of the vector elements be absorbed in the matrix D. Proof Consider first the case m = p and D = Im . Condition (a) is sufficient under
A.
The 29.14 for A.'X� � BrJ''-, where this limit is a.s. continuous, for each convergence of the joint distribution of X� now follows by 29. 16. The form of the marginal distributions is implied by 29.14, independence of the vector elements following from condition (b). If D t:. Im , the theorem can be applied to the array { (D'Df 1D'Und , for which the limit in (29.68) is H(t) as before. Since linear transformations preserve Gaussianity, the general conclusion follows by the continuous mapping theorem. • Theorem 29.18 is a highly general result, and the interest lies in establishing how the conditions might come to be satisfied in practice. While we permit Kn(.) t:. [nt] to allow cases like 29.13, these are of relatively small importance, and it will simplify the discussion if it is conducted in terms of the case Kn(t) = [nt] . The K superscript can then be dropped and Xn becomes the vector of ordinary partial sum processes. Even then, the result has considerable generality thanks to the array formul ation, and its interpretation requires care. We can invoke the decomposition Q = 'i:.. + A + A',
(29.69)
n 'F. = lim L E(UniU�i), n--7oo i=l
(29.70)
where
n A = lim L n--7oo i=2
i- 1 _L E(Un,i- m U� i). m= I
(29.71 )
But it should be observed that the conditions of 29.18 do not explicitly impose summability of the covariances. While 'F. and A are finite by construction, without summability it would be possible to have 'F. = 0. We noted previously that condition 29.6(f) appeared to impose summability, but it remains a conjecture that the more general 29.14(f') must always do the same. This conjecture is supported by the need for a summability condition in 24.6, whose conclusion must hold whenever 29.14 holds for the partial sums, but is yet to be demonstrated. Replacing 29.14(f') with more primitive conditions on the increment processes would be a useful extension of the present results, but would probably be difficult at the present level of generality. Note that Q, not 'F., is the covariance of the process, notwithstanding the fact that B,1(Q) is a process with independent increments. The condition Q = I, such that the elements of B11 are independent, neither implies nor is implied by the contemporaneous uncorrelatedness of the Un i· While uncorrelatedness at all lags is sufficient, with 'F. = I and A = 0, it is important to note that when the elements of Uni are related arbitrarily (contemporaneously and/or with a lag) there always
494
The Functional Central Limit Theorem
exists a linear transformation (D'D) - 1D', under which the elements of the limiting process are independent of one another. As we did for the scalar case, we review some of the simplest sets of sufficient conditions. Let Uni = S� 112Ui where Sn = E(Lt= l UiUi) . For this choice, D = Im is imposed automatically. lf { Ud is uniformly Lr-bounded, choose c�i = (A.'S� 1 A.) 112 • Then A.'Un/c�i is a linear combination of the Ui with weights summing to 1 and supi n ii A.'Unilc�d l r < oo holds for any A, so that conditions (a) and (b) of 29.6 are satisfied. The multivariate analogue of 29.7 is then easily obtained:
,
29.19 Corollary Let { Vi} be a zero-mean, uniformly Lr-bounded m-vectorsequence, with each element L2-NED of size -1 on an a-mixing process of size - rl(r - 2); and assume n - 1 Sn ----7 Q < If Xn(t) = n 1 12I\� fl ui , then Xn � B(Q). o
=
-
.
Compare this formulation with (27.82), and as with 29.7, note the important difference from the martingale difference case, with Q taking the place of �- It is also worth reiterating how the statement of conditions is potentially mislead ing, given that the last one is typically hard to fulfil without a NED size of 1 Somewhat trickier is the case of trending moments, where different elements of the vector may even be subject to different trends. The discussion here will have some close parallels with §24.4. Diagonalize Sn as -
.
(29.72) where Mn is diagonal of rank m, and CnC� = C�Cn = lm. Assume, to fix ideas, that Cn ----7 C, which can be thought of as imposing a form of global stationarity on the cross-correlations. Then S � 1 12 = Kn 112C� and E(X"(t)X"(t)' )
=
:::::
M;; 112C;E
(f, ; )
Ui C"M; w
U,
M� 112M!ntJKn 112
----7
(29.73)
H(t),
where the approximation is got by setting Cn to C, and can be made as good as desired by taking n large enough. The status of conditions 29.18(a) and (b) must be checked by evaluating the elements of H in (29. 73 ). An example is the best way to illustrate the possibilities. 29.20
Example
Let
m =
0]
2, and assume E(UPi-m)
=
0 for m i: 0, but let
C' (29.74) i �2 for fixed C. Then, Mn = diag{n�1 + 1 , n �z+ l } and H(t) = diag { t� 1 + 1 �z+ l } . For � 1 , �2 > -1 , �1 +1 and �z+ l are increasing homeomorphisms on the unit square, and
,
condition 29.18(b) is satisfied. It remains to check 29.18(a). Condition 29.14(f') holds for the array { A.'Und with respect to
T) '\t)
=
A.rt� 1 +1
+
A.�t�2+ \
(29.75)
FCLTs for Dependent Processes
495
which, since AI + /0 = 1 , is an increasing homeomorphism on the unit square with 11 \1) = 1 and 11\0) = 0 whenever � 1 , �2 > - 1 . Assuming that 29.6(b) holds for 112 ip 1 ip2 2 2 t.. C ni = I IA Un db = A 1 n ' +l + A2 n (29.76) , P P2+ 1
, ,
[ ( ) ( )]
we can check conditions 29.6(d) and 29.6(e). The latter holds for y = 1· We also find that
/..y((t + 8)p, + l _ tP ' + l) + /..�((t + 8)Pz+l _ tP2+ 1 ) 8 (29.77) ---7 Af( � l + l ) tp , + MC �2 + l )tp2 < oo as 8 ---7 0, where the approximation is as good as desired with large enough n . Condition 29.6(d) is satisfied, and hence 29.18(a) holds. This completes the veri fication of the conditions. o
30 Weak Convergence to S tochastic Integrals
30. 1 Weak Limit Results for Random Functionals
The main task of this chapter is to treat an important corollary to the functional central limit theorem: convergence of a particular class of partial sums to a limit distribution which can be identified with a stochastic integral with respect to Brownian motion, or another Gaussian process. But before embarking on this topic, we first review another class of results involving integrals, superficially similar to what follows, but actually different and rather more straightforward. There will, in fact, turn out to be an unexpected correspondence in certain cases between the results obtained by each approach. For a probability space (Q,C!f,P), we are familiar with the notion of a measurable mapping f : Q f---7 C, where C is C[O, I J as usual. We now want to extend measurability to functionals on C, and especially to integrals. Let F(t) = gfds: C f---7 IR denote the ordinary Riemann integral of f over [O,t] . 30.1
Theorem
If f is �/:Be-measurable, the composite mapping
F(t)0f:
Q
f---7
is �/:B-measurable for t E
IR [0 , 1 ] .
Proof It i s sufficient to show that F(t) i s continuous o n (C, du). This follows since, for G(t) = J�ds, g E C, and 0 ::::; t ::::; 1 , t (30. 1) I F(t) - G(t) l ::::; l f - g l ds ::::; sup l f(s) - g(s) l . •
J0
s
This shows that F(t) is a random variable for any t. Now, writing F: C f---7 C as the mapping whose range is the set of functions assuming the values F(t) at t, it can further be shown that F is a new random function whose distribution is uniquely found by extension from the finite-dimensional distributions, just as for f. The same reasoning extends to F2(t) = f(t'ds , to P, and so on. Other important examples of measurable functionals under du include the extrema, supt { f(t) } and inf { f(t) } . As a simple example of technique, here is an ingenious argument which shows that if B is standard Brownian motion, supt { B(t) } has the half-normal distribution (see (8.26)). Consider the partial sum process Sn t
Weak Convergence to Stochastic Integrals =
497
I'i= 1 �i· where the �i are independent binary r.v.s with P(�i = 1) = P(�i = -1) =
1 · Straightforward enumeration of the sample space shows that
(
P max Si � an 1$ i $ n
)
=
2P(Sn > an) + P(Sn
=
an),
(30.2)
for any an � 0 (see Billingsley 1968: ch. 2. 10). Since this holds for any n, on putting an = Yiia the FCLT implies that the limiting case of (30.2) applies to B, in respect of any constant a � 0. This also defines the limit in distribution of supr{ Xn(t) } for every process Xn satisfying the conditions of 29.4. This is a neat demonstration of the method of extending a limit result from a special case to a general case, using an invariance principle. Limit results for the integrals (i.e, sample means) of partial-sum processes, or continuous functions thereof, are obtained by the straightforward method of teaming a functional central limit theorem with the continuous mapping theorem.
Sno = 0 and Snj = IJ= 1 Uni for j = 1 , ... ,n - 1. If Xn(t) = Sn,[nt]• assume that Xn � BTJ (see 29.1 1) . For any continuous function g: rR � rR , J1 g(B )dt. 1- n - 1 (30.3) � _Lg(Sn) TJ n j =v o 30.2 Theorem Let
·
Proof Formally,
g(Snj)
=
__{\
f(j+1)/ndt nf(j+1)/ng(Xn(t))dt. j!n j/n n - 1 f(j+1)/n f 1 g(Xn(t))dt. g(Xn(t))dt L o 1/n
ng(Snj)
Hence,
n-1 -n1 L g(Snj) ._0 }-
=
.__{\
} �v
(30.4)
=
(30.5)
=
.
Since fbg(x(t) )dt, x E C, is a continuous mapping from C to lR, the result follows by the continuous mapping theorem (26.13). •
Note how g(Snn) is omitted from these sums in accordance with the convention that elements of D are right-continuous. Since the limit process is continuous almost surely, its inclusion would change nothing material. These results illustrate the importance of having 29.14 (with Kn (t) = [nt]) as an alternative to 29.6 as a representation of the invariance principle. The processes X�(t) are defined in [0, 1 ] , and cannot be mapped onto the integers l , ... ,n by setting t = j/n. There is no obvious way of defining the sample average of g(X�) in the manner of (30.3), and for this purpose the partial-sum process Xn with limit BTJ has no substitute. The leading cases of g(.) include the identity function, and the square. For the · former case, 30.2 should be compared with 29.12. Observe that I,}:lsnj = Ii: } (n - i)Uni· If Uni = n-1 12 Un - /CJ, reversing the order of summation in 29.12 shows, in effect, that n I,j: i s j � BTJ(l ), for the case ll(t) = !f. In other words, fl/Jdt - N(O, !).
-1
� 1 -
_
__ _
___ _
L
n
- � --- -- 1 -
- -- -
� � -- -
- -- - -
£ _ __ .._1_ _
£.. _ _
_
4,� � - - 1
rl n2 .J"'-
.LL -
The Functional Central Limit Theorem
498
limit for the case g(.) = (i. These limit results do not generally yield close1 formulae for the c.d.f., so there are no exact tabulations of the percentag' points such as we have for the Gaussian case. Their main practical value is i1 letting us know that the limits exist. Applications in statistical inferenc' usually involve estimating the percentiles of the distributions by Monte Carl< simulation; in other words, tabulating random, variates generated as the average of large but finite samples of g evaluated at a Gaussian drawing, to approximat1 integrals of g(B11). Knowledge of the weak convergence assures us that such approx imations can be made as close as desired by taking n large enough. Given a basic repertoire of limit results, it is not difficult to find th1 distributions of other limit processes and random variables in the same manner. T< take a simple case, if { Vi} is a sequence with constant variance c?, an< n - 112Srn1]fa � B(t) where Srntl = Il�f l vi, we can deduce from the continuou: mapping theorem that the partial sums of the sample mean deviations converge t< the Brownian bridge; i.e.,
(30.6 where Vn n - 1 Vi. On the other hand, if we express the partial sum proces� itself in mean deviations, s1 - Sn where Sn = n- 1 I.'j:6s1, we find convergenc( according to =
Ii=1
1
-
-----vz (Srntl - Sn)
an
D --==--7
f1
B(t) - Bds.
(30.7:
o
The limit process on the right-hand side of (30.7) is the de-meaned B rowniar motion. One must be careful to distinguish the last two cases. The integral of th<: latter over [0,1 ] is identically zero. The mean square of the mean deviatiom converges similarly, according to
(3o.s: There is also an easy generalization of these results to vector processes. The following is the vector counterp art of the leading cases of 30.2, the details oj whose proof the reader can readily supply. 30.3
Corollary
Let { Uni } satisfy the conditions of 29.18. If SnJ
1
D I 1 n- 1 - 2_ SnJ --==--7 B11dt, n ].= I Io
n- 1
fl
-2:, n SnjS�j � BllB�dt. . ]=
1
o
=
E=l Uni• then (30.9)
D
(30. 10j
Weak Convergence to Stochastic Integrals
499
The same approach of applying the
Note in particular that for B, the m-dimensional standard Brownian motion,
fl/Jdt N(O, 1Im)·
continuous mapping theorem yields an important result involving the product of the partial-sum process with its increment. The limits obtained do not appear at first sight to involve stochastic integrals, although there will tum out to be an intimate connection. 30.4 Theorem Let the assumptions of 30.2 hold, with 11 0 ) = 1. Then
-
n- 1 L Snj Un, j+ l � 1(x2( 1 ) cr2) , j= l where cf = limn =n- 1 I'i= l �i · Proof Letting Snj = L{= l Uni = Sn, j 1 + Unj• note the identity Sn2,j+ l = Snj2 + 2SnjUn, j+ l + Un2, j+l ·
-7
-
-
Summing from 0 to n 1 , setting Sno
2 Snn or
=
(30. 1 1)
=
(30. 12)
0, yields
n n- 1 n- 1 " " " 2 2 2 L (Sn , j+ l - S.'lj) = 2L SnjUn , j+ ! + L Unj • j= l j=O j= l
(30. 1 3)
SnjUn,j+l = 1 (s�n - i U�j). � j= j=
(30. 14) l Under the assumptions, Snn � B11 (1) N(O, 1 ) and I'J=I U�i � cf. The result l
follows on applying the continuous mapping theorem and 22.14(i).
•
This is an unexpectedly universal result, for it actually does not depend on the FCLT at all for its validity. It is true so long as { Und satisfies the conditions for a CLT. Since cr2 = 1 2/.. where /.. = limn-7=Lt=2L�:lECUn, i- m UnD, the left-hand side of (30. 1 1) has a mean of zero in the limit if and only if the sequence { Und is serially uncorrelated. There is again a generalization to the vector case, although only in a restricted sense. Let Snj = I{= I Uni• and then generalizing (30. 12) we have the identity
-
(30. 15) Summing and taking limits in the same manner as before leads to the following result. 30.5 Theorem Let { Un d satisfy the conditions of 29.18. Then
n n � + n L S jU ,j+ l L Un, j+I S�j � BTJ (l)B11 (1)' - :E j=l j=l - B(l )B(l)' - :E. o
(30. 16)
Details of the nroof are left to the reader. The oeculiaritv of this result is
500
The Functional Central Limit Theorem
that it does not lead to a limiting distribution for the stochastic matrix n- 1 L.j:lsnjU�, j+ l · This must be obtained by an entirely different approach, which is explored in §30.4. 30.2 S tochastic Processes
m
Continuous Time
To understand how stochastic integrals are constructed requires some additional theory for continuous stochastic processes on [0, 1]. Much of this material is a natural analogue of the results for random sequences studied in Part lll. A filtration is a collection { :¥(t), t E [0, 1 ] } of cr-subfields of events in a complete probability space (Q.,:¥,P) with the property
:¥(t)
c
:¥(s) when t � s. The filtration { :¥(t) } is said to be right-continuous if :¥(t) = :¥(t+) n :¥(s).
(30. 1 7) (30.1 8)
=
s>t A stochastic process X = {X(t), t E [0, 1 ] } is said to be adapted to { :¥(t) } if X(t) is :¥(f)-measurable for each t (compare § 1 5 . 1 ) . Note that right-continuity of the filtration is not the same thing as right-continuity of X, but if X E D (which will be the case in all our examples) adaptation of X(t) to :¥(t) implies adapta tion to :¥(t+ ) and there is typically no loss of generality in assuming (30. 1 8). A stronger notion of measurability is needed for defining stochastic integrals of the X process. {X(t) } is said to be progressively measurable with respect to { :¥(t)} if the mappings X(.,.): Q. x [O,t] 1---7 lR are :¥(t) ® :B[O,tl/:8-measurable, for each t E [0, 1]. Every progressively measurable process is adapted (just consider the rectangles E x [O,t] for E E c:f(t)) but the converse is not always true; with arbitrary functions, measurability problems can arise. However, we do have the following result. 30.6 Theorem An adapted cadlag process is progressively measurable. Proof For an adapted process X E D and any t E
on [O,t] :
(0, 1], define the simple process
X(n) (W,s) = X(CO , Tnk), s E [2 -n (k - 1), Tnk), k = 1 , ... ,[2n t], (30. 19) with X(n) (w,t) = X(w,t). X(n) need not be adapted, but it is a right-continuous function on Q. x [O,t]. If Ek = { ro: X(w, 2 -nk) � x} E :¥(t), then Ax = { (ro,s): X(n) (W,s) � x} [2 -n (k - 1), 2-nk) x £k u {t} x E[znt]+ l =
(y
)
(30.20)
is a finite union of measurable rectangles, and so Ax E :¥(t) ® :Bro, t] · This is true for each x E !R , and hence X(n) is :¥(t) ® :B[O,tl/:8-measurable. Fix ro and s,
Weak Convergence to Stochastic Integrals
501
and note that for each n
oo
X(n)(ffi,s) = X(co,u), (30.21 ) where u > s, and u J. s as n ---7 Since X(co,u ) ---7 X(co,s) by right-continuity, it follows that Xcn) (ffi,s) ---7 X(co,s) everywhere on n x [O,t] and hence X is r:i (t) ® 13[0,tJ/:B-measurable (apply 3.26). This holds for any t, and the theorem
follows.
.
•
Since we are dealing with time as a continuum, we can think of the moment at which some event in the evolution of X occurs as a real random variable. For example, the first time X(t) exceeds some positive constant M in absolute value is
T(co)
inf { t: I X(co,t) l > M}.
=
! E [O, l ]
(30.22)
T( co) is called a stopping time of the filtration { r:J' (t), t E [0, 1 ] } if { co: T(co) $ t } E r:J(t) (compare § 1 5 .2). It is a simple exercise to show that, if X is progres sively measurable, so is the stopped process xT where XT(t) = X(t A 1). Let X E D, and let X(t) be an r:J'(t)-measurable r.v. for each t E [0, 1]. The adapted pair {X(t),�(t) } is said to be a martingale in continuous time if sup E I X(t) l < t
E(X(s) I �(t))
=
=,
X(t) a.s. [P], 0 $ t $ s $ 1.
(30.23) (30.24)
It is called a semimartingale (sub- or super-) if (30.23) plus one of the inequal ities
E(X(s) l r:f(t))
{�} X(t) a.s.[P], 0 $ t $ s $ 1
(30.25)
hold. One way to generate a continuous-time martingale is by mapping a discrete time martingale {Sj,�j}l into [0, 1], rather in the manner of (27.55) and (27.56). If we let X(t) = S[nt]+ J , this is a right-continuous simple function which jumps at the points where [nt] = nt. It is �(f)-measurable where r:J(t) = r:i [ntJ+l , and the collection { r:J' (t), 0 $ t < 1 } is right-continuous. Properties of the martingale can often be generalized from the discrete case. The following result extends the maximal inequalities of 15.14 and 15.15. 30.7 Theorem Let
(
{ (X(t),�(t)) t E [0, 1 ] } be a martingale. Then
E) $ E I XE�) I P, p � 1 , (Kolmogorov inequality). (ii) E ( sup I X(s) IP) $ �� rE I X(t) IP, p > 1 , (Doob inequality). 1 (i) P sup I X(s) I > 5 E [ O, t] S E [O,t]
Proof These inequalities hold if they hold for the supremum over the interval
[O,t), noting in (i) that the case s
=
t is just the Chebyshev inequality. Given a
The Functional Central Limit Theorem discrete martingale {Sk >�d'r= I with m = [2nt] , define a continuous martingale Xcn) on [O,t] as in the previous paragraph, by setting X(n) (s) = s[2ns]+ l for s E [O,t), with Xcn ) (t) = Xc )(t- ) = Sr2n11• The inequalities hold for Xcn) by 15.14 and 15.15, noting that (30.26) [O,t) l� k � m for p � 1 . Now, given an arbitrary continuous martingale {X(t), �(t)} , a discrete martingale is defined by setting (30.27) For this case we have Xcn)(s)= X(u) for u = Tn ([2n s] + 1), so that u J.. s as n --7 oo . Hence Xcn) (s) --7 X(s) for s E [O,t), by right continuity. • The class of martingale processes we shall be mainly concerned with satisfy two extra conditions: almost sure continuity (P(X E C) = 1), and square integrability. A martingale X is said to be square-integrable if E(X(t)2) < oo for each t E [0, 1]. For such processes, the inequality (30.28) holds a.s.[P] for s � t in view of (30.24), and it follows that X2 is a submartin gale. The Doob-Mayer(DM) decomposition of an integrable submartingale, when it exists, is the unique decomposition (30.29) X(t) = M(t) + A(t), where M is a martingale and A an integrable increasing process. The DM decomposition has been shown to exist, with M uniformly integrable, if the set {X(t), t E :Y} is uniformly integrable, where :Y denotes the set of stopping times of { �(t)} (see e.g. Karatzas and Shreve 1988 : th. 4. 1 0). In particular, suppose there exists for a martingale {X(t),�(t) } an increasing, adapted stochas tic process { (X)(t), �(t) } on [0, 1], whose conditionally expected variations match those of X2 almost surely; that is, (30.30) E((X) (s) l �(t)) - (X) (t) = E(X(s) 2 l �(t)) - X(t)2 a.s. [P] for s � t. Rearranging (30.30) gives (30.3 1) E(X2(s) - (X) (s) l �(t) ) = X(t)2 - (X)(t), a.s.[P], which shows that {X(t) 2 - (X)(t), �(t)} is a martingale, and this process accord ingly defines the DM decomposition of X2 . An increasing adapted process { (X)(t),�(t) } satisfying (30.30), which is unique a.s. if it exists, is called the quadratic variation process of X. 30.8 Example The Brownian motion process B is a square-integrable martingale with respect to the filtration �(t) cr(B(s), s � t). The martingale property is an obvious consequence of the independence of the increments of B . A special feature of B is that the quadratic variation process is deterministic. Definition 502
n
SE
==
Weak Convergence to Stochastic Integrals
503
27.7 implies that, for s 2 t,
E(B(s) 2 i �(t)) - B(t) 2 = E([B(s) - B(t) ] 2 i �(t)) = s - t, a.s.[P],
(30.32)
and rearrangement of the equality shows that B (t)2 - t is a martingale; that is, (B)(t) = t. o Two additional pieces of terminology often arise in this context. A Markov process is an adapted process {X(t),�(t) } having the property (30.33) P (X(t + s) E A j �(t)) = P (X(t + s) E A j cr(X(t)) a.s.[P] for A E 'B and t, s 2 0. This means that all the information capable of predicting the future path of a Markov process is contained in its current realized value. A diffusion process is a Markov process having continuous sample paths. The sample paths of a diffusion process must be describable in terms of a stochastic mechanism generating infinitesimal increments, although these need not be independent or identically distributed, nor for that matter Gaussian. A Brownian motion, however, is both a Markov process and a diffusion process. We shall not pursue these generalizations very far, but works such as Cox and Miller (1965) or Karatzas and Shreve (1988) might be consulted for further details. The family BTJ defined in (29.49) are diffusion processes. They are also martin gales, and it is easy to verify that in this case (BTJ) = fl. However, a diffusion process need not be a martingale. An example with increments that are Gaussian but not independent is X(t) = S(t)B(t) (see 27.9). Observe that E(X(t + s) - X(t) i �(t) ) = (8 (t + s) - 8 (t))B (t) S(t + s) 1 X(t) -:�; 0, a.s.[P] . (30.34) = S(t) A larger class of diffusion processes is defined by the scheme X(t) = S(t)BTJ(t), for eligible choices of 8 and 11· The Ornstein-Uhlenbeck process (27.10) is another example. However, the class BTJ is the only one we shall be concerned with here.
(
)
30.3 S tochastic Integrals
In this section we introduce a class of stochastic integrals on [0, 1]. Let { M(t),�(t) } denote a martingale having a deterministic quadratic variation process (M). For a function f E D, satisfying a prescribed set of properties to be detailed below, a stochastic process on [0, 1] will be represented by I( m,t)
=
J�f(m,'t)dM(m,'t), t
E
[0,1 ] ,
(30.35)
more compactly written as /(t) = f(JdM. The notation corresponds, for fixed ffi, to what we would use for the Riemann-Stieljtes integral of f(w,. ) over [O,t] with respect to M(w,.). However, it is important to appreciate that, for almost every
The Functional Central Limit Theorem
504
oo, this Riemann-Stieljtes integral does not exist; quite simply, we have not required M( oo,.) to be of bounded variation, and the example of Brownian motion shows that this requirement could fail for almost all oo. Hence, a different interpretation of the process l(t) is called for. The results we shall obtain are actually available for a larger class of inte grator functions, including martingales whose quadratic variation is a stochastic process. However, it is substantially easier to prove the existence of the inte gral for the case indicated, and this covers the applications of interest to us. We assume the existence of a filtration {?J(t), t E [0, 1 ] } , on a probability space (Q,?J,P). Let a: [0, 1] H !R be a positive, increasing element of D, and let a(O) 0 and a(l ) = 1 , with no loss of generality as it turns out. For any t E (0, 1], the restriction of a to [O,t] induces a finite Lebesgue-Stieltjes measure. That is to say, a is a c.d.f., and the function fnda(s) assigns a measure to each B E :Bro.tJ · Accordingly we can define on the product space (.Q x [O,t], ?J(t) ® :Bro.1J) the product measure J...la, where ==
J.la(A)
=
Jnf� 1A(oo,s)da(s)dP(oo)
=
E
{f� 1A(oo,s)da(s))
(30.36)
for each A E ?J(t) ® :B ro,t] · Let !La denote the class of functions f: Q H R [O, l ] which are (a) progressively measurable (and hence adapted to { ?f(t) }), and (b) square-integrable with respect to J...la ; that is to say, III I I < = where II I II
=
E
(f>2da) 112•
(30.37)
It is then easy to verify that I II - g il is a pseudo-metric on !La . While II I - g i l = 0 does not guarantee that I(oo) = g(oo) for every oo E Q, it does imply that the integrals of I and g with respect to a will be equal almost surely [P]. In this case we call functions I and g equivalent. The chief technical result we need is to show that a class of simple functions is dense in !La. Let [a c !La denote the class such that I(t) = I( tk) for t E [tk,tk+l ), k = O, .. . ,m - 1 and I( 1 ) = I(1 - ), where { t1 , . . ,tm } = IIm is a partition of [0, 1 ] for some m E IN . 30.9 Lemma (after Kopp 1984) For each iE llu, there exists a sequence Ucn) E [a, n E IN } with !lien) - Ill --7 0 as n --7 =. Proof Let the domain of I be extended to !R by setting I( t) = 0 for t � [0, 1]. By square-integrability, t:I(oo,ttda( t) < = a.s.[P], and (J(oo,t h) - I(ro,t) ) 2da(t) --7 0 a.s. [P] as h --7 0;
.
hPnt'P
J::
Weak Convergence to Stochastic Integrals lim E
h-10
(f+"" (f(s + h) - f(s)?da(s)) - oc
:::
0
505 (3 0.38)
by the bounded convergence theorem. This holds for any sequence ofpoints going to 0, so, given a partition I1m(n) such that I I Tim(n) ll � 0 as n � oo , and t E [0, 1], consider the case h = kn(t) - t, where kn(t) = ti, t E [ti, ti+J ), i O, ... ,m - 1, (3 0.39) (3 0.40) kn(l ) = tm - 1 · Clearly, kn(t) � t. Hence, (30.38) implies that
:::
s::E u� 1 (f(s + kn(t)) - f(s + t)) 2da(t)) da(s) f� / (f::ucs + kn(t)) - f(s + t)) 2da(s)) da(t) f�1 E (f::ucs + knCt) - t) - tCs)) 2da(s)) da(t) ==
=
�
oo,
0 as n �
(30.41)
where the first equality is an application of Fubini's theorem. Since the inner integral on the left-hand side is non-negative, (30.41 ) implies
)
(J�
lim E (30.42) 0 (f(s + kn(t)) - f(s + t)) 2da(t) l n-1oc for almost all s E fR . Fixing s E [0, 1] and making a change of variable from t to t - s gives ==
(30.43) where ln(t)
==
{
kn(t - s) - (t - s). Define a function fcn)(t)
=
f(t + ln(t)),
t + ln(t)
E
[0, 1 ]
(30.44)
otherwise, 0 noting that fcn)(t) = f(t i + s) for t E [ti + s, ti+ l + s) n [0, 1] and hence fen) E [[a · Given (30.43), the proof is completed by noting that [0,1 ] � [s - 1 , 1 + s], and hence
(30.45)
508
The Functional Central Limit Theorem
J:BdB - Icx2C l ) - 1).
(30.54)
Put t = 1 and compare this with 30.4. It is apparent (and will be proved rigor ously in 30.13 below) that the limit in (30. 1 1) can be expressed as JbBdB + /.., where /.., = ¥1 - 0'2), as before. o . A form of Ito ' s rule holds for a general class of continuous semi-martingales. The proof of the general result is lengthy (see for example Karatzas and Shreve 1988 or McKean 1969 for details) and we will give the proof just for the case of 30.11, to avoid complications with the possible unboundedness of g". However, there is little extra difficulty in extending from ordinary Brownian motion to the class of diffusion processes BTl. 30.12 Theorem BTJdBTJ = 1(BTJ (t) 2 - ll(t) ) , a.s.
J:
.
Proof Let Tin denote the partition of [O,t] in which t1 = tj/n for j = l, .. ,n. Use Taylor expansions to second order to obtain the identity
n- 1 = iL_ BTJ (tj)(BTJ (tJ+ I ) - BTJ (tj)) }=0
+
n- 1
,2: (BTJ (tJ+ I ) - BTJ (tj)) 2 . }=0
(30.55)
We show the � convergence of each of the sums in the right-hand member. BTl E [LTJ , so define Pn E [TJ by (30.56) and Pn(s) = BTJ (s) for t $ s $ 1 . This is a construction similar to that used in the proof of 30.9, and ll Pn - B Tl I I -7 0 as n -7 We may write n- 1
n- 1
}=0
}=0
,2: BTJ (tj )(BTJ (tJ+I ) - BTJ (t1)) = ,2:
oo
.
ffj.+lBTJ (t1)dBTJ(s) tj
(30.57) and it follows that
(30.58) Considering the second sum on the right-hand side of (30.55), we have n-1 2 E � (BTJ (tJ+ I ) - BTJ (tJ)? - ll(t)
(
pO
)
Weak Convergence to Stochastic Integrals
509
n- 1 L E [ (BTI (tj+1 ) - BTI (tj))2 - (11 (tj+1 ) - 11(tj)) F j=O n -1 = 2 L (11 (tj+1 ) - 11 (tj) ) 2 j=O � 211(t) max { 11 (tj+1 ) - 11 (tj) } O�j� n - 1 =
� --7 0 as n --7 =
.
(30.59)
The second equality here is due to the fact that the cross-products disappear in expectation, thanks to the law of iterated expectations and the fact that (30.60) The third equality applies the Gaussianity of the increments, together with 9.7 for the fourth moments, and the inequality uses the continuity of 11 and the fact that I nn II --7 o . Thus, B11 (t)2 can be decomposed as the sum of sequences converging in L2-norm to, respectively, 2gB11dB11 and 1l(t). However, according to 18.6, L2 convergence implies convergence with probability 1 on a subsequence { nk } . Since the choice of partitions is arbitrary so long as I ITinJ --7 0, the theorem follows. • The special step in this result is of course (30.59). In a continuous function of bounded variation, the sum of the squared increments is dominated by the largest increment and so must vanish by continuity, just as happens with 11Ct) in the last line of the expression. It is because it is unbounded a.s. that the same sort of thing does not happen with B11. 30.4 Convergence to Stochastic Integrals
Let { Unj } and { Wnj } be a pair of stochastic arrays, let Xn(t) = I)�fl Unj and Yn(t) = I)�fl Wnj ' and suppose that (Xn, Yn) � (Bx,By) where Bx and By are a pair of transformed Brownian motions from the class B11, with quadratic variation pro cesses 11x and 11 Y, the latter being homeomorphisms on the unit interval. In what follows it is always possible for fixing ideas to think of Bx and B y as simple Brownian motions, having 11 x(t) = 11 Y(t) = t. However, the extensions required to relax this assumption are fairly trivial. The problem we wish to consider is the convergence of the partial sums
The Functional Central Limit Theorem
5 10
n -1 "2_ XnU1n ) ( Yn ((j + 1)/n) - Yn (jln) ) .
(30.61) j= l This problem differs from those of §30. 1 because it cannot be deduced merely from combining the functional CLT with the continuous mapping theorem. None the less, it is possible to show that the convergence holds under the conditions of 29.14. The following theorem generalizes one given by Chan and Wei ( 1 988). See inter alia Strasser ( 1986), Phillips (1988), Kurtz and Protter ( 1 99 1 ), Hansen ( 1992c), for alternative approaches to this type of result. 30.13 Theorem Let { Unj• Wnj} be a (2 x 1) stochastic array satisfying the condi tions of 29.18 for the case Kn(t) = [nt]. In addition, assume that both Unj and Wnj are �-NED of size -1 on { Vni } . Then =
Gn
where, with Anjk
=
D �
f1 BxdBy + Axy,
(30.62)
0
E(UnjWnk), Axr
=
n -1 i- 1 lim L L An,i-m,i+ l · n�oo i= l m=O
(30.63)
D
An admissable case here is Unj = Wnj• in which case the relevant joint distribu tion is singular. Setting the L2-NED size at -1 ensures that the covariances are summable in the sense of 17.7. This strengthening of the conditions of 29.18 is typically only nominal, in the light of the discussions that follow 29.6 and 29.18. However, be careful to see that summability is not required to ensure that I Axr l < =, which holds under the conditions of 29.18 merely by choice of normalization. Its role will become apparent in the course of the proof. Proof The main ingredient of this proof is the Skorokhod representation theorem, 26.25, which at crucial steps in the argument allows us to deduce weak convergence from the a.s. convergence of a random sequence, and vice versa. Let (Xn ,Yn ) be an element of the separable, complete metric space (D2,d�) (see §29.5). Since (30.64) (Xn. Yn) � (Bx,By) by 29.18, Skorokhod's theorem implies the existence of a sequence { (X",Yn) E D2, n E IN } such that (XmYn) is distributed like (X", Yn), and d�((X",Yn ), (Bx,Br)) � 0. According to Egoroff' s theorem (18.4) and the equivalence of ds and d8 in D, (30.64) implies that, for a set Ce E '!J with P( C ) � 1 - E, e
sup d� ( (X"(co),Yn (co)), (Bx(co),Br(co)))
oo e CE
--7
0
(30.65)
for each E > 0. Since Bx is a.s. continuous, there exists a set Ex with P(Ex) = 1 and the following property: if co E Ex, then for any 11 > 0 there is a constant 8 > n c11,.,h th !'l t if r/,.{)(n{m).Bvfco)) � 8 .
Weak Convergence to Stochastic Integrals
51 1
sup I X\co,t) - Bx(co,t) I $ sup I X"(co,t) - Bx(co , A-(t)) I t
t
+
sup I Bx(co,t) - Bx(co,A.(t)) I t
(30.66) where A.(.)) is the function from (28.7). The same result holds for Y in respect of a set Er with P(Er) = 1 . It follows from (30.65) that, for co e C� = Ce n Ex n Ey, (30.67) db ((X'(co),Yn(co)) , (Bx(co),B y (co))) = Bn ---7 0, where the equality defines Bn . Note too that P( C�) = P( Ce) . For each member of an increasing integer subsequence {kn, n E IN } , choose an ordered subset {n 1 ,n2 , . ,nkn } of the integers l, ... ,n, with nkn = n, such that min 1 :=;j :=; kn { nr nj- d ---7 Use these sets to define partitions of [0, 1], Tin = { t1 , ... tkn } , where tj = n/n. Assume that {k } is increasing slowly enough that kn()� ---7 0 and knfn ---7 0, but note that provided kn t it is always possible to have IITin ll ---7 0. For example, choosing nj = [nk/knJ will satisfy these conditions. The main steps to be taken are now basically two. Define .
,
.oo
.
n
oo
kn
(30.68) _L XnUj- 1)(Yn(tj) - Yn Uj- 1)), 1 f= n and also let G* represent the same expression except that the Skorokhod variables X' and yn are substituted for Xn and Yn . In view of 22.18 and the fact that G* n and G� have the same distribution, to establish G� � f6BxdBy it will suffice to prove that G�
=
I c*n _ f�BxdBy l � 0.
(30.69)
The proof will then be completed, in view of 22.14(i), by showing (30.70) The Cauchy-Schwartz inequality and (30.67) give, for each co 2 n n (X"(co,t;- 1) - Bx(CO,tj- 1))(Y (CO,tj) - Y (CO,tj- 1)) ]= 1
(±
e
4,
)
S
n
kn
k n n _L X (Y (CO,tj) - Y (CO,tj- 1)) 2 "(co,tj-1 ) Bx(CO,tj-1)? L( j=1 j=1 kn
n n (30.71) kn()� L ( Y (CO,tj) - Y (CO,tj-1)?, j= 1 Also the assumptions on Yn. and equivalence of the distributions, imply that S
512
The Functional Central Limit Theorem
(30.72)
and hence from (30.71), E
(�
(X" (tJ-1 ) - Bx(t1_1 ))(Y"(fj) - Y"(t1_ 1 )) 1 c;
Closely similar arguments give E and also
(�
)
(Y" (t1) - Br (t1))(Bx(t1) - Bx(t1- 1 )) 1 c:
)2 '
->
->
0.
0,
(30.73)
(30.74)
(30.75) E(Bx(1 ) ( Y\l) - By(l ) ) l c�) 2 s 8� � 0. We now use the method of 'summation by parts' ; given arbitrary real numbers { aj,bj,<Xj ,pj , j = l , ... ,k} with a0 = b0 = <Xo = Po = 0, we have the identity . k
k
L, aj- I (bj - bj- r ) - L <Xj- 1 (Pr Pj - 1 ) j=1 j=1
k
k
(30.76) L_ (aj- 1 - <Xj- I )(br bj- I ) + ak(bk - pk) L_ (br P)( <Xj - <Xj- J ) . j=I j= I Put k = kn, aj = Xn(ro,tj), bj = Yn (ro,tj), aj = Bx(W,tj), and pj = By(W,tj). Then the left-hand side of (30.76) corresponds to GM - Pm where =
-
kn
Pn = 2:,Bx(tj - 1 )(B r(tj) - By(tj- 1 ) ) j=I kn
=L j=1
f}
t·
tr l
Bx(tj- 1 )dBy(t),
(30.77)
and the squares of the right-hand side terms correspond to the integrands in (30.73), (30.74), and (30.75). Since £ is arbitrary, P(C�) can be set arbitrarily close to 1, so that each of these terms vanishes in Lrnorm. We may conclude that n I G* - Pn I � 0. So, to get (30.69), it suffices to show that I Pn - f/JJxdB rl � 0. But kn t · 2 I 2 E P, - BxtfBy = E (Bx(tj- d - Bx(t))dB y(t)
( fo
)
(�I:_,
)
Weak Convergence to Stochastic Integrals kn
=L
t·
J1
i= l tr 1
513
(r{(t) - TJ\tj- 1)) dr{(t)
:::; max { TJ\t1) - Tjx(t1_1 ) }l{(l) ! �j �kn
�
(30.78)
0
where the second equality applies (30.5 1) and then Fubini's theorem, and the convergence is by continuity of Tjx . This completes the proof of (30.69). To show (30.70), we use the fact that nr l Yn(tj) - Yn UJ- d = L (Yn((i + 1)/n) - Yn(iln) ) , and so G, - G�
�
� [ t�
1
X. (i/n) (Y,((i + 1 )In) - Y.(iln))
)
- x.c�_,J (Y,Ctil - Y.(tj- 1 )) �
� (.�1
(X,(iln) - X,Cti- 1 ))( Y. ((i + 1 )In) - Y,(i!n) )
]
) (30.79)
where we have formally set Uno = 0. The final equality represents the shift from summing the elements of a triangular array by rows to summing by diagonals, and we use whichever of the two versions of the expression is most convenient. In view of (30.63), Gn - G� - Axy = A n - Bm where (summing by diagonals) (30.80) and (summing by rows) kn
Bn = L
(L
nr l
i- 1
L An, i-m,i+ l
j=l i=nj-! m=i-nj-t + I
)
·
(30.81)
The problem is to show that both An � 0 and Bn � 0. Choose a finite integer N and break up An into N + 1 additive components A n0 , . .. ,AnN• where by taking n large enough that N :::; mi n 1 �j �kn {n1 - n1 d we -
The Functional Central Limit Theorem
514 can write
for m
=
j= l i=m+nj-1 O, .:.,N - 1 , and
(30.82)
(Un' i-m Wn' i+l - An'i-m'i+l ) ,
(30.83) For fixed finite
the process { Vn,i-m Wn,i+l - An, i-m,i+I> a(Vnb k :::;; i + 1 ) } is, according to 17.11 and 17 .6, an L 1 -mixingale array of size -1 with respect to the constant array { 4c�,i-m c;;',i+ l }, where { c�;} and { c;;';} are the constants specified by 29.18 for i\ = (1,0)' and (0, 1 )' respectively. We next show that the conditions of 19.11 are satisfied by these terms, so that n- 1 A*nm -- L ( Un,i- m wn,i+l - 1\,"' n,i-m,i+l .l· � L, 0· (30.84) i=m+ I First, for r > 2 in the a-mixing case (29.6) or for r � 2 in the <(>-mixing case (29.9), Un,i-m Wn,i+l r/2 Un,i- m Wn,i+ 1 - An,i-m,i+l r/2 sup E :::;; 2'12 sup u w u w i,n C n ,i-mC n ,i+l Cn ,i-mC n ,i+l i,n m,
�
\
Un,i-m :::;; 2'12 sup u i,n C n ,i- m
r/2 r
Wn,i+l w C n ,i+l
r/2 r
(30.85) where the first inequality makes use successively of Loeve's c, inequality and Jensen' s inequality, the second one is by the Cauchy-Schwartz inequality, and the finiteness is because the arrays satisfy either 29.6(b) or 29.9(b') by assumption. In the latter case, note that the assumptions include uniform square-integrabil ity. Therefore the array < oo,
{
:
Un,i-m n,i+l �An,i-m,i+l C n ,i-mC n ,i+l
}
is uniformly integrable in either case, and condition 19.11(a) is met. Next, the arrays { c�; } and { c ;;';} satisfy condition 29.6(d) by assumption, which by the Cauchy-Schwartz inequality implies that ·
Weak Convergence to Stochastic Integrals
{
} 00
[n(t+0)] - 1 1 . u w ""' sup 1Imsup 3 L Cn,i -mCn ,i+l < tE [0,1 ), OE (0, 1 -t] n---too i=[nt]+m+l Setting t
0 and 8
•
1 in (30.86) gives n- ! . p L ""' u 1Imsu Cn,i-mCnw,i+! < 00 , n---too i=m+! which is condition 19.11(b), whereas setting 8 = lin gives =
515 (30.86)
=
(30.87)
(30.88) max { c�,i-mC�,i+l } = 0 (1/n), ! m+l�i�nand (30.87) and (30.88) together imply n- ! (30.89) L (c�,i-mC�,i+! )2 -7 0, i=m+l which is condition 19.11(c). So A�m � 0 is proved. But for m � 1 , according to (30.82), kn m+nj-1 - ! (30.90) E I Anm -A�m l ::::; L L E I Un,i-mWn,i+J - An,i-m,i+d , n j=l i= j-1 where the order of magnitude is by (30.85) and (30.88), so Anm � 0 also holds, for each m = O, . ,N- 1 . Similarly E l Un,i-m Wn,i+l - An,i-m,i+d ::::; 2 1 An,i -m, i+l l , and appiying 17.7 yields kn nrni-1 - ! nr! (30.91) EIAnNI = 0 � L L C�,i-mC�,i+! Sm+! J=I m=N r=m+nj-1 .
.
(
)
.
for some 8 > 0, where the order of magnitude follows by a combination of (30.87) with the fact that the sequence { Sm } is of size -1, according to the mixing and Lz-NED size assumptions. Thus , limn---tooE I A n I ::::; limn-too£ I A nN I , which by taking N large enough can be made as close to 0 as desired. In the same manner, recalling N ::::; min! �j � kn {nj - nj- d ,
0
(� �1 '�- 1
)
(30 .92) c�.i-mC !!'.i+ 1 1;,.+ 1 = O(lV•). I+ It follows that Gn - G� - Axr � 0, and this completes the proof of (30.70), and of the theorem. •
I B, I
=
=
The Functional Central Limit Theorem
516
Now let { Und (m x 1) be a vector array satisfying the conditions of 29.18, plus the extra condition that the L2-NED size of the increments is -1 for each element. Since 30.13 holds for each element paired with each element, including itself, the argument may be generalized in the following manner. 30. 14 Theorem Let SnJ = E=tUni· Then n- 1 l L, SnJUn ,J+l � 0B11dB� + A (m x m). (30.93)
Proof For
J
}= I
arbitrary m-vectors of unit length, A and IJ. , the scalar arrays {A'SnJ} and {p.'Un,J+d satisfy the conditions of 30.13. Letting Gn denote the matrix on the left-hand side of (30.93), and G the matrix on the right-hand side, the result A'Gnll � A'Gp. is therefore given. A well-known matrix formula (see e.g. Magnus and Neudecker 1988: th. 2.2) yields A'Gnll = (p.' ® A')Vec Gn, (30.94) where p.' ® A' is the Kronecker product of the vectors, the row vector (j.l1 A1 , 2 IJ.tAm. IJ.2AJ , .. . , ... , IJ.m Am) (1 x m 2 ) and Vec Gn (m x 1) is the vector consisting of the columns of Gn stacked one above the other. p.' ® A' is of unit length, and applying the Cramer-Wold theorem (25.5) in respect of (30.94) implies that Gn � G, as asserted in (30.93). • This result is to be compared with 30.5. Between them they provide the intriguing incidental information that •••,
,
s:BrtdB� + s:dBrtB� - Bll(l )Brt(1 )' - Q.
(30.95)
(Note that the stochastic matrix on the right has rank 1.) Of the two, 30.14 is much the stronger result, since it derives from the FCLT and is specific to the pattern of the increment variances. Between them, 30.3 and 30.14 provide the basic theoretical tools necessary to analyse the linear regression model in variables generated as partial-sum processes (integrated processes). See Phillips and Durlauf ( 1986), and Park and Phillips (1988, 1989) among many other recent references.
Notes
1 . See Billingsley (1979, 1986). The definition of a A.-system is given as 1.25 in Billingsley (1979), and as 1.26 in Billingsley (1986). 2. The 'prime' symbol ' denotes transposition. f is a column vector, written as a row for convenience. 3. An affine transformation is a linear transformation x H Ax followed by a translation, addition of a constant vector b . By an accepted abuse of terminology, such transformations tend to be referred to as 'linear' . 4. That is, l x + y l s l x l + ly l . See §5. 1 for more details. 5. The notations ffdJ..L, ffJ..L(dro), or simply ff when the relevant measure is understood, are used synonymously by different authors. 6. I thank Elizabeth Boardman for supplying this proof. 7. Elizabeth Boardman also suggested this proof. 8. If there is a subset N c n such that either N or ff is contained in every :!F-set, the elements of N cease to be distinguishable as different outcomes. An equivalent model of the random experiment is obtained by redefining n to have N itself as an element, replacing its individual members. 9. Random variables may also be complex-valued; see § 1 1 .2. 10. In statements of general definitions and results we usually consider the case of a one-sided sequence {Xr }]". There is no difficulty in extending the concepts to the case {Xr} :"", and this is left implicit, except when the mapping from 7l. plays a specific role in the argument. 1 1 . We adhere somewhat reluctantly to the convention of defining size as a negative number. The use of terms such as 'large size' to mean a slow (or rapid?) rate of mixing can obviously lead to confusion, and is best avoided. 12. This is a problem of the weak corfvergence of distributions; see §22.4 for further details. 13. This is similar to an example due to Athreya and Pantula (1986a). 14. In the theory of functions of a complex variable, an analytic function is one possessing finite derivatives everywhere in its domain. 15. I am grateful to Graham Brightwell for this argument. 16. For convergence to fail, the discontinuities of f (which must be Borel measurable) would have to occupy a set of positive Lebesgue measure. 17. Conventionally, and for ease of notation, the symbol :!f1 is used here to denote
518
Notes
what has been previously written as c:J � oo · No confusion need arise, since a cr-subfield bearing a time subscript but no superscript will always be interpreted in this way. 18. Some quoted versions of this result (e.g. Hall and Heyde 1980) are for p > � ' whereas the present version, adapted from Karatzas and Shreve (1988), extends to 0 < p :::; � as well. 19. The nann, or length, of a k-vector X is !lXII = (X'X) 112 . To avoid confusion with the � norm of a r.v., the latter is always written with a subscript. 20. The original St Petersburg Paradox, enunciated by Daniel Bernoulli in 1758, considered a game in which the player wins £2n - l if the first head appears on the nth toss for any n. The expected winnings in this case are infinite, but the principle involved is the same in either case. See Shafer (1988). 2 1 . See the remarks following 3.18. It is true that in topological spaces projections are continuous, and hence measurable, under the product topology (see §6.5), but of course, the abstract space (.Q,c:J) lacks topological structure and this reasoning does not apply. 22. Since 8 is here a real k-vector it is written in bold face by convention, notwithstanding that 8 is used to denote the generic element of (8,p ), in the abstract. 23. This is the basis of a method for generating random numbers having a distribution F. Take a drawing from the uniform distribution on [0,1 ] (i.e., a random string of digits with a decimal point placed in front) and apply the transformation F- 1 (or Y) to give a drawing from the desired distribution. 24. A is used here as the argument of the ch.f. instead of the t used in Chapter 1 1 , to avoid confusion with the time subscript. 25. The symbol i appearing as a factor in these expressions denotes V-T. The context distinguishes the use of the same symbol as an array index. 26. In practice, of course, Ut usually has to be estimated by a residual Ut > depending on consistent estimates of model parameters. In this case, a result such as 21.6 is also required to show convergence. 27. More precisely, of course, W models the projection of the motion of a particle in three-dimensional space onto an axis of the coordinate system. 28. A cautionary note: these combinations cannot be constructed as residuals from least squares regressions. If � has full rank, the regression of one element of Yn onto the rest yields coefficients which are asymptotically random. � must be estimated from the increments using the methods discussed in § 25 . 1 . 29. Compare Wooldridge and White (1988: Prop. 4. 1). Wooldridge and White's result is incorrect as stated, since they omit the stipulation of almost sure A
...... "-+... ..... .. .. 11 .., ,
References
Amemiya, Takeshi (1985), Advanced Econometrics, Basil Blackwell, Oxford. Andrews, Donald W. K. (1984), 'Non-strong mixing autoregressive processes' , Journal of Applied Probability 2 1 , 930-4. (1987a), 'Consistency in nonlinear econometric models: a generic uniform law of large numbers' , Econometrica 55, 1465-7 1 . (1988), 'Laws of large numbers for dependent non-identi cally distributed random variables ' , Econometric Theory 4, 458-67. ( 1991 ), 'Heteroscedasticity and autocorrelation consistent covariance matrix estimation ' , Econometrica 59, 817-58. ------- (1992), 'Generic uniform convergence' , Econometric Theory 8, 241-57. Apostol, Tom M. (1974), Mathematical Analysis (2nd edn.) Addison-Wesley, Menlo Park. Ash, R. (1972), Real Analysis and Probability, Academic Press, New York. Athreya, Krishna B. and Pantula, Sastry G.(1986a), 'Mixing properties of Harris chains and autoregressive processes', Journal of Applied Probability 23, 880-92. ------ (1986b ), 'A note on strong mixing of ARMA processes', Statistics and Probability Letters 4, 1 87-90. Azuma, K. ( 1967), 'Weighted sums of certain dependent random variables' , Tohoku Mathematical Journal 19, 357-67. Barnsley, Michael (1988), Fractals Everywhere, Academic Press, Boston. Bates, Charles and White, Halbert (1985), 'A unified theory of consistent estima tion for parametric models ' , Econometric Theory 1, 151-78. Bernstein, S. (1927), 'Sur I ' extension du theoreme du calcul des probabilites aux sommes de quantites dependantes' , Mathematische Annalen 97, 1-59. Bierens, Herman (1983), 'Uniform consistency of kernel estimators of a regression function under generalized conditions', Journal of the American Statistical Association 77, 699-707. (1989), 'Least squares estimation of linear and nonlinear ARMAX models under data heterogeneity', Working Paper, Department of Econometrics, Free University of Amsterdam. -------
------
-------
------
520
References
Billingsley, Patrick (1968), Convergence of Probability Measures, John Wiley, New York. (1979), Probability and Measure, John Wiley, New York. Borowski, E. J. and Borwein, J. M. (1989), The Collins Reference Dictionary of Mathematics, Collins, London and Glasgow. Bradley, Richard C., Bryc, W. and Janson, S. (1987), 'On dominations between measures of dependence' , Journal of Multivariate Analysis 23, 3 12-29. Breiman, Leo (1968), Probability, Addison-Wesley, Reading, Mass. Brown, B. M. (1971), 'Martingale central limit theorems ' , Annals of Mathemati cal Statistics 42, 59-66. Burkholder, D. L. (1973), 'Distribution function inequalities for martingales' , Annals of Probability 1 , 19-42. Chan, N. H. and Wei, C. Z. (1988), 'Limiting distributions of least squares esti mates of unstable autoregressive processes' , Annals of Statistics, 16, 367-401 . Chanda, K. C . (1974), 'Strong mixing properties of linear stochastic processes ' , Journal of Applied Probability 1 1 , 401-8. Chow, Y. S. (1971), 'On the LP convergence for n - llpsn , 0 < p < 2 ' , Annals of Mathematical Statistics 36, 393-4 and Teicher, H. (1978), Probability Theory: Independence, Inter changeability and Martingales, Springer-Verlag, Berlin. Chung, Kai Lai (1974), A Course in Probability Theory (2nd edn.), Academic Press, Orlando, Fla. Cox, D. R. and Miller, H. D. (1965), The Theory of Stochastic Processes, Methuen, London. Cramer, Harald (1946), Mathematical Methods of Statistics, Princeton University Press, Princeton, NJ. Davidson, James (1992), 'A central limit theorem for globally nonstationary near-epoch dependent functions of mixing processes', Econometric Theory, 8, 3 13-29. (1993a) 'An L1-convergence theorem for heterogeneous mixin gale arrays with trending moments', Statistics and Probability Letters 16, 301-4 (1993b), 'The central limit theorem for globally non stationary near-epoch dependent functions of mixing processes: the asymp totically degenerate case ' , Econometric Theory 9, 402-12. -------
-----
------
References
521
de Jong, R. M. (1992), 'Laws of large numbers for dependent heterogeneous processes', Working Paper, Free University of Amsterdam (forthcoming in Econometric Theory, 1995). (1994 ), 'A strong law for L2-mixingale sequences ' , Working Paper, Department of Econometrics, University of Tilburg. Dellacherie, C. and Meyer, P.-A. (1978), Probabilities amd Potential, North Holland, Amsterdam. Dhrymes, Phoebus J. (1989) Topics in Advanced Econometrics, Springer-Verlag, New York. Dieudonne, J. (1969), Foundations of Modern Analysis, Academic Press, New York and London. Domowitz, I. and White, H. (1982), 'Misspecified models with dependent obser vations', Journal of Econometrics 20, 35-58 Donsker, M. D. (1951) 'An invariance principle for certain probability limit theorems' , Memoirs of the American Mathematical Society, 6, 1-12. Doob, J. L. (1953), Stochastic Processes, John Wiley, New York; Chapman & Hall, London. Dudley, R. M. (1966), 'Weak convergence of probabilities on nonseparable metric spaces and empirical measures on Euclidean spaces ' , Illinois Journal of Mathe matics 10, 109-26. (1967), 'Measures on non-separable metric spaces ' , Illinois Journal of Mathematics 1 1, 109-26. ----- (1989), Real Analysis and Probability, Wadsworth and Brooks/Cole, Pacific Grove, Calif. Dvoretsky, A. (1972), 'Asymptotic normality of sums of dependent random variables ' , in Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, ii, University of California Press, Berkeley, Calif., 5 13-35. Eberlein, Ernst and Taqqu, Murad S. (eds.) (1986), Dependence in Probability and Statistics: a Survey of Recent Results, Birkhauser, Boston. Engle, R. F., Hendry, D. F. and Richard, J.-F. (1983), 'Exogeneity ' , Econo metrica 5 1 , 277-304 Feller, W. (1971), An Introduction to Probability Theory and its Applications, ii, John Wiley, New York. Gallant, A. Ronald (1987), Nonlinear Statistical Models, John Wiley, New York. and White, Halbert (1988), A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models, Basil Blackwell, Oxford. -----
-----
------
522
References
Gastwirth, Joseph L. and Rubin, Herman (1975), 'The asymptotic distribution theory of the empiric CDF for mixing stochastic processes' , Annals of Statistics 3, 809-24. Gnedenko, B. V. (1967), The Theory of Probability (4th edn.), Chelsea Publish ing, New York. Gorodetskii, V. V. (1977), 'On the strong mixing property for linear sequences ' , Theory of Probability and its Applications, 22, 41 1-13. Halmos, Paul R. ( 1956), Lectures in Ergodic Theory, Chelsea Publishing, New York (1960), Naive Set Theory, Van Nostrand Reinhold, New York. ----- (1974), Measure Theory, Springer-Verlag, New York. Hall, P and Heyde, C. C. (1980), Martingale Limit Theory and its Application, Academic Press, New York and London. Hannan, E. J. (1970), Multiple Time Series, John Wiley, New York. Hansen, L. P. (1982), 'Large sample properties of generalized method of moments estimators' , Econometrica 50, 1029-54. Hansen, Bruce E. ( 1991), 'Strong laws for dependent heterogeneous processes', Econometric Theory 7, 213-2 1 . (1992a), 'Errata', Econometric Theory 8, 421-2. (1992b ), 'Consistent covariance matrix estimation for dependent heterogeneous processes' , Econometrica 60, 967-72 (1992c), 'Convergence to stochastic integrals for dependent heterogeneous processes' , Econometric Theory 8, 489-500. Herrndorf, Norbert (1984), 'A functional central limit theorem for weakly depen dent sequences of random variables' , Annals of Probability 12, 141-53. ------- (1985), 'A functional central limit theorem for strongly mix ing sequences of random variables' , Wahrscheinlichkeitstheorie verw. Gebeite 69, 540-50. Hoadley, Bruce ( 1971 ), 'Asymptotic properties of maximum likelihood estimators for the independent not identically distributed case' , Annals of Mathematical Statistics 42, 1977-9 1 . Hoeffding, W . (1963), 'Probability inequalities for sums of bounded random vari ables ' , Journal of the American Statistical Association 58, 13-30. Ibragimov, I. A. (1962), 'Some limit theorems for stationary processes' , Theory of Probability and its Applications 7, 349-82. -----
-------
-------
------
References
523
------ (1965), 'On the spectrum of stationary Gaussian sequences satis fying the strong mixing condition. 1: Necessary conditions' , Theory of Proba bility and its Applications 10, 85-106. and Linnik, Yu. V. (1971), Independent and Stationary Se quences of Random Variables, Wolters-Noordhoff, Groningen. Iosifescu, M. and Theodorescu, R. (1969), Random Processes and Learning, Springer-Verlag, Berlin. Karatzas, Ioannis and Shreve, Steven E. (1988), B rownian Motion and Stochastic Calculus, Springer-Verlag, New York. Kelley, John L. (1955), General Topology, Springer-Verlag, New York. Kingman, J. F. C. and Taylor, S. J. (1966), Introduction to Measure and Probability, Cambridge University Press, London and New York. Kolmogorov, A. N. (1950), Foundations of the Theory of Probability, Chelsea Publishing, New York (published in German as Grundbegriffe der Wahrschein lichkeitsrechnung, Springer-Verlag, Berlin, 1933). and Rozanov, Yu. A. (1960), 'On strong mixing conditions for stationary Gaussian processes', Theory of Probability and its Applications 5, 204-8. Kopp, P. E. (1984), Martingales and Stochastic Integrals, Cambridge University Press. Kurtz, T. G. and Protter, P. (1991), 'Weak limit theorems for stochastic integrals and stochastic differential equations' , Annals of Probability 19, 1035-70. Loeve, Michel (1977), Probability Theory, i (4th edn.), Springer-Verlag, New York. Lukacs, Eugene (1975), Stochastic Convergence (2nd edn.), Academic Press, New York. Magnus, J. R., and Neudecker, H. (1988), Marix Differential Calculus with Applications in Statistics and Econometrics, John Wiley, Chichester. Mandelbrot, Benoit B. (1983), The Fractal Geometry of Nature, W. H. Freeman, New York. Mann, H. B. and Wald, A. (1943a), 'On the statistical treatment of linear stochastic difference equations', Econometrica 1 1 , 173-220. (1943b), 'On stochastic limit and order relation ships' , Annals of Mathematical Statistics 14, 390-402. McKean, H. P., Jr. (1969), Stochastic Integrals, Academic Press, New York. McLeish, D. L. (1974), 'Dependent central limit theorems and invariance princi ples ' , Annals of Probability 2,4, 620-8. ------
------
----
-----
References
524
(1975a), 'A maximal inequality and dependent strong laws' , Annals of Probability 3,5, 329-39. (1975b), 'Invariance principles for dependent variables' , Z. Wahrscheinlichkeitstheorie verw. Gebeite 32, 165-78. (1977), 'On the invariance principle for nonstationary mix ingales', Annals of Probability 5,4, 6 16-21. Nagaev, S. V. and Fuk, A Kh . (1971), 'Probability inequalities for sums of independent random variables', Theory of Probability and its Applications 6, 643-60. Newey, W. K. (1991), 'Uniform convergence in probability and stochastic equi continuity', Econometrica 59, 1 16 1-8. and West, K. (1987), 'A simple positive definite heteroskedastic ity and correlation consistent covariance matrix' , Econometrica 55, 703-8. Park, J. Y. and Phillips, P. C. B. (1988), 'Statistical inference in regressions with integrated processes, Part 1 ' , Econometric Theory 4, 468-98. (1989), 'Statistical inference in regressions with integrated processes, Part 2', Econometric Theory 5, 95-132. Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic Press, New York and London. Pham, Tuan D. and Tran, Lanh T. ( 1985), 'Some mixing properties of time series models' , Stochastic Processes and their Applications 19, 297-303. Phillips, P. C. B. (1988), 'Weak convergence to the matrix stochastic integral fbBdB ' , Journal of Multivariate Analysis 24, 252-64. ------
-------
-------
-----
------
------
and Durlauf, S. N. (1986), 'Multiple time series regression with integrated processes ' , Review of Economic Studies 53, 473-95. Pollard, David (1984), Convergence of Stochastic Processes, Springer-Verlag, New York. Potscher, B. M. and Prucha, I. R. (1989), 'A uniform law of large numbers for dependent and heterogeneous data processes' , Econometrica 57, 675-84. ------- ------ (1994), 'Generic uniform convergence and equicontinuity concepts for random functions: an exploration of the basic structure' , Journal of Econometrics 60, 23-63. ( 1991 a), 'Basic structure of the asymptotic theory in dynamic nonlinear econometric models, Part I: Consistency and approximation concepts' , Econometric Reviews 1 0, 125-216. ------
-------
------
References
525
(1991b), 'Basic structure of the asymptotic theory in dynamic nonlinear econometric models, Part II: Asymptotic normal ity ' , Econometric Reviews 10, 253-325. Prokhorov, Yu. V (1956), 'Convergence of random processes and limit theorems in probability theory', Theory of Probability and its Applications 1 , 157-213. Rao, C. Radhakrishna (1973), Linear Statistical Inference and its Applications (2nd edn.), John Wiley, New York. Revesz, Pal (1968), The Laws of Large Numbers, Academic Press, New York. Rosenblatt, M. (1 956), 'A central limit theorem and a strong mixing condition ' , Proceedings of the National Academy of Science, USA, 42, 43-7. (1972), 'Uniform ergodicity and strong mixing' , Z. Wahrschein lichkeitstheorie verw. Gebeite 24, 79-84. ----- (1978), 'Dependence and asymptotic independence for random processes', in Studies in Probability Theory (ed. M. Rosenblatt), Mathematical Association of America, Washington DC. Royden, H. L. (1968), Real Analysis, Macmillan, New York. Seneta, E. (1976), Regularly Varying Functions, Springer-Verlag, Berlin. Serfling, R. J. (1968), 'Contributions to central limit theory for dependent variables ' , Annals of Mathematical Statistics 39, 1 158-75. ------ (1980), Approximation Theorems of Mathematical Statistics, John Wiley, New York. Shafer, G. (1988), 'The St Petersburg Paradox ' , in Encyclopaedia of the Statis tical Sciences, viii (ed. S. Kotz and N. L. Johnson), John Wiley, New York. Shiryayev, A. N. (1984), Probability, Springer-Verlag, New York. Skorokhod, A. V. (1956), 'Limit theorems for stochastic processes' , Theory of Probability and its Applications 1 , 261-90. ------ (1957), 'Limit theorems for stochastic processes with indepen dent increments' , Theory of Probability and its Applications 2, 138-7 1 . Slutsky, E . (1 925), 'Uber stochastiche Asymptoter und Grenzwerte ' , Math. Annalen 5, 93. Stinchcombe, M. B. and White, H. (1992), 'Some measurability results for ex trema of random functions over random sets', Review of Economic Studies 59, 495-5 14. Stone, Charles (1963), 'Weak convergence of stochastic processes defined on semi-infinite time intervals', Proceedings ofthe American Mathematical Society 14, 694-6. ------
-----
------
526
References
Stout, W. F. (1974), Almost Sure Convergence, Academic Press, New York. Strasser, H. (1986), 'Martingale difference arrays and stochastic integrals' , Probability Theory and Related Fields 72, 83-98. Varadarajan, V. S. (1958), 'Weak convergence of measures on separable metric spaces' , Sankhya 19, 15-22. von Bahr, Bengt, and Esseen, Carl-Gustav (1965), 'Inequalities for the rth abso lute moment of a sum of random variables, 1 ::;; r ::;; 2 ' , Annals of Mathematical Statistics 36, 299-303. White, Halbert (1984), Asymptotic Theory for Econometricians, Academic Press, New York. and Domowitz, I. (1984), 'Nonlinear regression with dependent observations' , Econometrica 52, 143-62. Wiener, Norbert (1923), 'Differential space' , Journal of Mathematical Physics 2, 13 1-74 Vlillard, Stephen (1970), General Topology, Addison-Wesley, Reading, Mass. Withers, C. S . (198 1a), 'Conditions for linear processes to be strong-mixing' , Z. Wahrscheinlichkeitstheorie verw. Gebeite 57, 477-80. (1981b), 'Central limit theorems for dependent variables, I' , Z. Wahrscheinlichkeitstheorie verw. Gebeite 57, 509-34. Wooldridge, Jeffrey M. and White, Halbert (1986), 'Some invariance principles and central limit theorems for dependent heterogeneous processes' , University of California (San Diego) Working Paper. (1988), 'Some invariance principles and central limit theorems for dependent heterogeneous processes ' , Econo metric Theory 4, 210-30. -----
-----
------
-------
Index
a-mixing, see strong mixing Abel's partial summation formula 34, 254 absolute convergence 31 absolute moments 132 absolute value 162 absolutely continuous function 120 absolutely continuous measure 69 absolutely regular sequence 209 abstract integral 128 accumulation point 21, 94 adapted process 500 adapted sequence 229 addition modulo 1 46 adherent point 77 affine transformation 126 Aleph nought 8 algebra of sets 14 almost everywhere 38, 1 13 almost sure convergence 178, 281 method of subsequences 295 uniform 33 1 almost surely 1 1 3 analytic function 328 analytic set 328 of C 449 Andrews, D. W. K. 216, 261 , 301 , 336, 338, 341 antisymmetric relation 5 Apostol, T. M. 29, 32, 33, 126 approximable in probability 274 approximable process 273 weak law of large numbers 304 approximately mixing 262 AR process, see autoregressive process ARMA process 215 array 33 array convergence 35 Arzela-Ascoli theorem 91, 335, 439, 447, 469
asymptotic equicontinuity 90 asymptotic expectation 357 asymptotic independence 479 asymptotic negligibility 375 asymptotic uniform equicontinuity 90, 335 asymptotic unpredictability 247 Athreya, Krishna B. 215 atom 37 of a distribution 129 of a p.m. 1 12 autocovariance 193 autocovariances, summability 266 autoregressive process 215 non-strong mixing 216 non-uniform mixing 218 autoregressive-moving average process 215 axiom of choice 47 axioms of probability 1 1 1 Azuma, K. 245 Azuma ' s inequality 245
�-mixing 207 backshift transformation 191 ball 76 band width 401 Bartlett kernel 403, 407 base 79 for point 94 for space of measures 418 for topology 93 Bernoulli distribution 122 expectation 129 Bernoulli r.v.s 216 Bernstein sums 386, 401 Berry-Esseen theorem 407 betting system 233-4 Bierens, Herman 261 , 336 Big Oh 3 1 , 1 87 Billingsley, Patrick 17, 18, 261 ,
528 421 , 422, 447, 448, 457, 466, 469, 472, 473, 474-5, 496 Billingsley metric 462 equivalent to Skorokhod metric 464 binary expansion 10 binary r.v. 122, 133 binary sequence 1 80 binomial distribution 122, 348, 364 bivariate Gaussian 144 blocking argument 194, 299 Borel field of C 440 of D 465 infinite-dimensional 1 8 1 real line 22, 1 6 metric space 77, 413 topological space 413 Borel function 55, 57, 1 17 expectation of 130 Borel sets 47 Borel-Cantelli lemma 282, 295, 307 Borel's normal number theorem 290 boundary point 2 1 , 77 bounded convergence theorem 64 bounded function 28 bounded set 22, 77 bounded variation 29 Brown, Robert 443 Brownian bridge 445 mean deviations 498 Brownian motion 443 de-meaned 498 distribution of supremum 496 transformed 445, 486 vector 454 with drift 444 Burkholder' s inequality 242 c 437 c.d.f., s ee cumulative distribution function cadlag function 90, 456 cadlag process 48 8 progressively measurable 500
Index Cantor, G. 10 cardinal number 8 cardinality 8 Cartesian product 5, 83, 102 Cauchy criterion 25 Cauc�y distribution 123 as stable distribution 362 characteristic function 167 no expectation 129 Cauchy family 124 Cauchy sequence 80-2, 97 real numbers 25 vectors 29 Cauchy-Schwartz inequality 138 central limit theorem ergodic L1 -mixingales 385 functional 450, 480 independent sequences 368 martingale differences 383 NED functions of mixing processes 386 three series theorem 3 12 trending variances 3 79 central moments 1 3 1 central tendency 128 centred sequence 23 1 Cesaro sum 3 1 ch.f., s ee characteristic function Chan, N. H. 5 1 0 Chanda, K . C . 2 1 5, 2 1 9 characteristic function 53, 162 derivatives 164 independent sums 166 multivariate distributions 168 series expansion 165 weak convergence 357 Chebyshev' s inequality 132 Chebyshev' s theorem 293 chi-squared distribution 124 chord 133 Chow, Y. S. 298 Chung, K. L. 407, 409 closed under set operations 14 closed interval 1 1 closed set 77
Index real line 21 closure point 20, 77, 94 cluster point 24, 80, 94 coarse topology 93 coarseness 438 codomain 6 coin tossing 1 80, 191 collection 3 compact 95 compact set 22, 77 compact space 77 compactificaton 107 complement 3 complete measure space 39 complete space 80 completely regular sequence 209 completely regular space 100 completeness 97 completion 39 complex conjugate 162 complex number 162 composite mapping 7 concave function 133 conditional characteristic function 171-3 distribution function 143 expectation 144, 147 linearity 1 5 1 optimal predictor 150 variance of 156 versions of 148 Fatou's lemma 152 Jensen' s inequality 1 53 Markov inequality 1 52 modulus inequality 1 5 1 monotone convergence theorem 152 probability 1 13 variance 238, 316 conditionally heteroscedastic 214 consistency properties 1 83, 435 consistency theorem function spaces 436 sequences 1 84 contingent probability 1 14 continuity 1 12
529 of a measure 38 continuous distribution 122 continuous function 27, 84, 436 continuous mapping 97 continuous mapping theorem 355, 497 metric spaces 421 continuous time martingale 501 continuous time process 500 continuous truncation 271 , 309 continuously differentiable 29 contraction mapping 263 converge absolutely 3 1 convergence 94 almost sure 178, 28 1, 33 1 in distribution 347, 367 in Lp norm 287, 33 1 in mean square 179, 287 in probability 179, 284, 349, 359, 367 in probability law 347 metric space 80 on a subsequence 284 real sequence 23 space of probability measures 418 stochastic function 333 transformations 285 uniform, 30, 33 1 weak 179 with probability 1 179, 33 1 convergence lemma 306 convergence-determining class 420 convex function 133, 339 convolution 161 coordinate 1 77 coordinate projections 102, 434 coordinate space 48 coordinate transformation 126 correlation coefficient 138 correspondence 6 countability axioms 94 countable additivity 36 countable set 8 countable subadditivity 37, 1 1 1 countably compact 95
530 covariance 136 covariance matrix 137 covariance stationarity 193 covering 22, 77 Cox, D. R. 503 c inequality 140 Cramer, Harald 141 Cramer-Wold device 405, 455, 490, 516 Cramer's theorem 355 cross moments 136 cumulative distribution function 1 17 cylinder set 48, 1 15 of R 434 r
D 456 Billingsley metric on 464 Davidson, James 301 , 386 de Jong, R. M. 3 1 9, 326 de Morgan's laws 4 decimal expansion 10 decreasing function 29 decreasing sequence of real numbers 23 of sets 12 degenerate distribution 349 degree of belief 1 1 1 Dellacherie, C. 328 dense 23 dense set 77 density function 74, 120 denumerable 8 dependent events 1 13 derivative 29 conditional expectation 153 expectation 141 derived sequence 177 determining class 40, 121, 127, 420 diagonal argument 10 diagonal method 35 diffeomorphism 126 difference, set 3 differentiable 29 differential calculus 28
Index diffusion process 503 discontinuity, jump 457 discontinuity of first kind 457 discrete distribution 122, 129 discrete metric 76 disc�ete subset 80 discrete topology 93 disjoint 3 distribution 1 17 domain 6 dominated convergence theorem 63 Donsker, M. D. 450 Donsker' s theorem 450 Doob, J. L. 196, 216, 235, 3 14 Doob decomposition 23 1 Doob's inequality 241 continuous time 501 Doob-Mayer decomposition 502 double integral 66, 136 drift 444 Dudley, R. M. 328, 457 Durlauf, S. N. 490, 5 1 6 dyadic rationals 26, 99, 439 dynamic stability 263 Dynkin, E. B. 1 8 Dynkin's 7t-A theorem 1 8 £-neighbourhood 20 see also sphere E-net 78 Egoroff s theorem 282, 5 1 0 element, set 3 embedding 86, 97, 105 empirical distribution function 332 empty set 3, 77 Engle, R. F. 403 ensemble average 195 envelope 329 equally likely 1 80 equicontinuity 90, 335 stochastic 336 strong stochastic 336 equipotent 6 equivalence class 5, 46 equivalence relation 5 equivalent measures 69
Index equivalent metrics 76 equivalent sequences 307, 382 ergodic theorem 200 law of large numbers 291 ergodicity 199 asymptotic independence 202 Cesaro-summability of autocovariances 201 Esseen, Carl-Gustav 171 essential supremum 1 17, 132 estimator 177 Euclidean distance 20, 75 Euclidean k-space 23 Euclidean metric 75, 105 evaluation map 105 even numbers 8 exogenous 403 expectation 128 exponential function 162 exponential inequality 245 extended functions 52 extended real line 1 2 extended space 1 1 7 extension theorem 1 84 existence part 40 uniqueness part 44, 127 <J>-mixing, see uniform mixing �-analytic sets 449 factor space 48 fair game 289 Fatou' s lemma 63 conditional 152 Feller, W. 32 Feller's theorem 373 field of sets 14 filter 94 filtration 500 fine toplogy 93 fineness 438 finite additivity 36 finite dimensional cylinder sets 1 8 1 , 435 finite dimensional distributions of C 440, 442, 446
53 1 of D 466 Wiener measure 446 finite intersection property 95 finite measure 36 first countable 95 fractals 443 frequentist model 1 1 1 law of large numbers 292 Fubini' s theorem 66, 69, 125 Fuk, A. Kh. 220 function 6 convex 339 of a real variable 27 of bounded variation 29 function space 84, 434 nonseparable 89 functional 84 functional central limit theorem 450, 480 martingale differences 440 multivariate 454, 490 NED functions of strong mixing processes 48 1 NED functions of uniform mixing processes 485 Gallant, A. Ronald 261 , 263, 271 , 401-2 gambling policy 233 gamma function 124 Gaussian distribution 123 characteristic function 167 stable distribution 363 Gaussian family 123 expectation 129 moments 1 3 1 Gaussian vector 126 generic uniform convergence 336 geometric series 3 1 Glivenko-Cantelli theorem 332 global stationarity 194, 388, 450, 486 Gordin, M. 385 Gorodetskii, V. 215, 219, 220 graph 6
532 Hahn decomposition 70, 72 half line 1 1 , 2 1 , 52, 1 18 half-Gaussian distribution 385 half-normal density 124 half-open interval 1 1 , 15, 1 1 8 Hall, P. 250, 3 14, 385, 409 Halmos, Paul R. 202 Hansen, Bruce E. 3 1 8, 403, 5 10 Hartman-Wintner theorem 408 Hausdorff metric 83, 469 Hausdorff space 98 Heine-Bore! theorem 23 Helly-Bray theorem 353 Helly's selection theorem 360 Heyde, C. C. 250, 3 14, 385, 409 Hoadley, Bruce 336 Hoeffding' s inequality 245 Holder's inequality 138 homeomorphism 27, 51, 86, 97, 105 i.i.d. 193 Ibragimov, I. A. 204-5, 210, 21 1 , 2 1 5 , 216, 261 identically distributed 193 image 6 imaginary number 162 inclusion-exclusion formula 37, 420 increasing 29 . function .mcreasmg sequence of real numbers 23 of sets 12 independence 1 14 independent Brownian motions 455 independent r.v.s 127, 1 54-5, 161 independent sequence 192 strong law of large numbers 3 1 1 independent subfields 1 54 index set 3, 177 indicator function 53, 128 inferior limit of real sequence 25 of set sequence 13 infimum 12 infinite run of heads 1 80
Index infinite-dimensional cube 83, 105 infinite-dimensional Euclidean space 83, 104 infinitely divisible distribution 362 initial conditions 194 inner measure 41 innovation sequence 215, 23 1 integers 9 integrability 61 integral 57 integration by parts 58, 129, 134 interior 21 , 77 intersection 3 interval 1 1 into 6 invariance principle 450, 497 invariant event 195 invariant r.v. 196 inverse image 6 inverse projection 48 inversion theorem 168-70 irrational numbers 10, 80 isolated point 2 1 , 77 isometry 86 isomorphic spaces 5 1 isomorphism 145 iterated integral 66, 135 Ito integral 507 J1 metric 459 Jacobian matrix 126 Jensen's inequality 133 conditional 1 53 Jordan decomposition 7 1 jump discontinuities 457 jump points 120 Karatzas, Joannis 502-3, 508 kernel estimator 403 Khinchine' s theorem 368 Kolmogorov, A. N. 209-10, 3 1 1-2 Kolmogorov consistency theorem 1 84 Kolmogorov's inequality 240
Index Kolmogorov ' s zero-one law 204 Kopp, P. E. 504 Kronecker product 5 1 6 Kronecker' s lemma 34, 293, 307 Kurtz, T. G. 5 1 0 A.-system 1 8 Hopital's rule 361 Lp convergence 287 uniform 33 1 Lp norm 132 Lp-approximable 274 Lp-bounded 132 £,-dominated 330 largest element 5 latent variables 177 law of iterated expectations 149 law of large numbers 200, 289 Cauchy r.v.s 291 definition of expectation 292 frequentist model 292 random walk 291 uniform 340 law of the iterated logarithm 408 Lebesgue decomposition 69, 72 probability measure 120 Lebesgue integral 57 Lebesgue measure 37, 45-6, 74, 1 12 plane 66 product measure 135 Lebesgue-integrable r.v.s 132 Lebesgue-Stieltjes integral 57-8, 128 left-continuity 27 left-hand derivative 29 Levy, P. 3 1 2 Levy continuity theorem 358 Levy's metric 424, 468 lexicographic ordering 10 Liapunov condition 373 Liapunov' s inequality 139 Liapunov ' s theorem 372 LIE 149 liminf of real sequence 25 I'
533 of set sequence 13 limit 80 expectation of 141 of set sequence 13 limit point 26 lim sup of real sequence 25 of set sequence 13 Lindeberg condition 369, 371 , 380 asymptotic negligibility 376 uniform integrability 372 Lindeberg theorem 369 Lindeberg-Feller theorem, see Lindeberg theorem; Feller's theorem Lindeberg-Levy theorem 366 FCLT 449 LindelOf property 78-79, 95 LindelOf space 98 LindelOf' s covering theorem 22 linear ordering 5 linear process 193, 247, 252 strong law of large numbers 326 linearity of conditional expectation 1 5 1 of integral 62 Linnik, Yu. V. 204-5, 210, 215, 216 Lipschitz condition 28, 86, 269 stochastic 338 Little Oh 3 1 , 1 87 Loeve, M. 32, 407 Loeve's c, inequality 140 log-likelihood 177 lower integral 57 lower semicontinuity 86 lower variation 71 MA process, see moving average process Magnus, J. R. 5 1 6 Mandelbrot, Benoit 443 Mann, H. B. 187 mapping 6 marginal c.d.f. 126 marginal distributions of a
534 sequence 1 86 marginal measures 64 marginal probability measures 1 1 5 tightness 430 Markov inequality 132, 135 conditional 152 Markov process 503 martingale 229 continuous time 501 convergence 235 martingale difference array 232 weak law of large numbers 298 martingale difference sequence 230 strong law of large numbers 3 14 maximal inequalities for linear processes 256 for martingales 240 for mixingales 252 maximum metric 76 McKean, H. P., Jr. 508 McLeish, D. L. 247, 261 , 318, 380 mean 128 mean reversion 214 mean stationarity 193 mean value theorem 340 mean-square convergence 293 measurability of suprema 327, 449 measurable function 1 17 measurable isomorphism 5 1 measurable rectangle 50, 48 measurable set 4 1 measurable space 36 measurable transformation 50 measure 36 measure space 36 measure-preserving transformation 191 mixes outcomes 200 memory 192 method of subsequences 295 metric 75 metric space 75, 96 metrically transitive 199 metrizable space 93 metrization 107
Index Meyer, P.-A. 328 Miller, H. D. 503 Minkowski's inequality 139 mixed continuous-discrete distribution 129 mixed, Gaussian distribution 404 mixing 202 inequalities 2 1 1-4 MA processes 2 1 5 martini example 202 measurable functions 21 1 mixing process strong law of large numbers 323 mixing sequence 204 mixing size 210 see also size mixingale 247 stationary 250 strong law of large numbers 3 1 8 weak law of large numbers 301 mixingale array 249 modulus 162 modulus inequality 63 complex r.v.s 163 conditional 151 modulus of continuity 91, 335, 439, 479 cadlag functions 458, 468 moment generating function 162 moments 1 3 1 Gaussian distribution 123 monkeys typing Shakespeare 1 80 monotone class 17 monotone convergence theorem 60, 6: conditional 1 52 envelope theorem 329 monotone function 29 monotone sequence of real numbers 23 of sets 12 monotonicity of measure 3 7 of p.m. 1 1 1 Monte Carlo 498 moving average process 193 mixing 2 1 5
Index strong law of large numbers 326 multinormal distribution, see multivariate Gaussian multinormal p.d.f. 126 multivariate c.d.f. 125 multivariate FCLT 490 multivariate Gaussian affine transformations 170 characteristic function 168 mutually singular measures 69 Nagaev, S. V. 220 naive set theory 4 7 natural numbers 8 near-epoch dependence 261 mixingales 264 transformations 267 near-epoch dependent process strong law of large numbers 323 weak law of large numbers 302 nearly measurable set 329 NED, see near-epoch dependence negative set 70 neighbourhood 20 see also sphere nested subfields 1 55 net 94 Neudecker, H. 516 Newey, W. K. 336, 401 non-decreasing function 29 non-decreasing sequence of real numbers 23 of sets 12 non-increasing function 29 non-increasing sequence of real numbers 23 of sets 12 non-measurable function 55 non-measurable set 46 norm inequality 139 for prediction errors 1 57 normal distribution 123 normal law of error 364 normal number theorem 290 normal space 98 null set 3, 70
535 odd-order moments 131 one-dimensional cylinder 1 82 one-to-one 6 onto 6, 97 open covering 22, 77 open interval 1 1 open mapping 27 open rectangles 102 open set 77, 93 of real line 20 order-preserving mapping 6 ordered pairs 7 orders of magnitude 3 1 origin 10 Ornstein-Uhlenbeck process 445 diffusion process 503 outer measure 41 outer product 137 7t-A theorem 18, 44, 49, 67 7t-system 1 8 p.d.f., see probability density function p.m., see probability measure pairwise independence 1 14 pairwise independent r.v.s 127, 136 Pantula, Sastry G. 2 1 5 parameter space 327 Park, J. Y. 5 16 Parthasarathy, K. R. 418, 422, 426, 427, 429, 469 partial knowledge 145 partial ordering 5 partial sums 31 partition 3 of [0, 1 ] 438 permutation of indices 436 Pham, Tuan D. 215 Phillips, P. C. B. 490, 5 10, 516 piecewise linear functions 437 pigeon hole principle 8 Pitman drift 369 pointwise convergence 30 stochastic 33 1
Index
536 Poisson distribution 122, 348 characteristic function 167 expectation 129 infinitely divisible 362 polar coordinates 162 Pollard, David 457 positive semi-definite 137 positive set 70 Potscher, B. M. 261 , 274, 277, 336, 342 power set 13 precompact set 78 predictable component 23 1 probability density function 122 probability measure 1 1 1 weak topology 418 probability space 1 1 1, 1 17 product measure 64 product space 7, 48, 102, 1 15 product topology 102 function spaces 453 product, set 5 progressively measurable 500 progressively measurable functions 504 projection 8, 48, 50 suprema 328 not measurable 328 projection a-field 415, 435 of C 440 of D 456 Prokhorov, Yu. V. 422, 423, 457 Prokhorov metric 424, 467 Protter, P. 5 1 0 Prucha, I. 261 , 274, 277, 336, 342 pseudo-metric 75, 504 q-dependence 2 1 5 quadratic variation 238, 502 deterministic 503 p-mixing 207 R, the space 434 r.v., see random variable
D nrlnn _hl1 1rnr1vm
rlP.ri v:Jti ve
70. 74.
120, 148 Radon-Nikodym theorem 70, 72-4, 122 random element 413, 1 1 1 random event I l l random experiment 1 1 1 , 128 random field 178 random pair 124 random sequence 177 memory of 192 random variable 1 12, 1 1 7 random vector 137 random walk 230, 291 random weighting 3 1 6 range 6 Rao, C. R. 143 rate of convergence 294 rational numbers 9 real numbers 10 real plane 1 1 real-valued function 27, 87, 434 realization 179 refinement 438 reflexive relation 5 regression coefficients 177 regular measure 413 regular sequence 204 regular space 98 regularly varying function 32 relation 5 relative compactness 422 relative frequencies 128 relative topology 20, 93 relatively compact 77 remote cr-field 203 repeated sampling 179 restriction of a measure space 37 Riemann integral 57 expectation 141 of a random function 496 Riemann zeta function 32 Riemann-Stieltjes integral 58 and stochastic integral 503 right continuous 27 c.d.f. 1 19-20 filtration 500
Index right-hand derivative 29 ring 14 Rozanov, Yu. A. 209-10 a-algebra 15 a-field 15 a-finite measure 36 St Petersburg paradox 289 sample 177 sample average 128 sample path 179 sample space 1 1 1 scale transformation 178 seasonal adjustment 262 second countable 79, 95 Serfling, R. J. 213 self-similarity 443 semi-algebra 15 semi-ring 15, 40, 48, 49, 1 18 semimartingale 233 in continuous time 501 separable set 78 separable space 95 separating function 98 separation axioms 97 sequence 9, 94 metric space 80 real 23 sequentially compact 95, 422 serial independence 192 series 3 1 set 3 set function 36 set of measure zero 38, 70 shift transformation 1 9 1 shocks 2 1 5 Shreve, Steven E. 502-3, 508 signed measure 70 simple function 53 approximation by 54 integral of 59 simple random variables 128 singleton 1 1 singular Gaussian distribution singular measures 69, 120
170
537 SIZe mixing 210 mixingale 247 near-epoch dependence 262 Skorokhod, A. V. 350, 457, 459 Skorokhod metric 459, 469 Skorokhod representation 350, 43 1, 510 Skorokhod topology 461 slowly varying function 32 Slutsky's theorem 287 smallest element 5 Souslin space 328 spectral density function 209 MA process 2 1 5 sphere 76 of C 440 of D 466 stable distribution 362 stationarity 193 step function 1 19, 349 Stinchcombe, M. B. 328 stochastic convergence, see under convergence stochastic equicontinuity 336 termwise 341 stochastic integral 503 stochastic process 177 continuous time 500 stochastic sequence 177 stopped process 234 stopping rule 234 stopping time 233 filtration 501 Stout, W. F. 3 14, 409 Strasser, H. 510 strict ordering 5 strict stationarity 193 strong law of large numbers 289 for independent sequence 3 1 0 for Lp-bounded sequence 295, 3 1 2, 3 1 4 for martingales 314-7 for mixingales 3 19-23 for NED functions of mixing
538 processes 324--6 strong mixing autoregressive processes 216 coefficient 206 negligible events 207 smoothness of the p.d.f. 225 sufficient conditions in MA processes 219-27 see also mixing strong mixing sequence 209 law of large numbers 295, 298 strong topology 93 strongly dependent sequences 210 strongly exogenous 403 sub-base 101, 103 subcovering 22 subfield measurability 145 subjective probability 1 1 1 submartingale 233 continuous time 501 subsequence 24, 80 subset 3 subspace 20, 93 summability 3 1 superior limit of real sequence 25 of set sequence 13 supermartingale 233 continuous time 501 support 37, 1 19 supremum 12 sure convergence 178 symmetric difference 3 symmetric relation 5 T1 -space 98 T2-space 98 Trspace 98 T3 � -space 100 T4-space 98 tail sums 3 1 taxicab metric 75 Taylor's theorem 165 telescoping sum 250 proof of weak law 301
Index termwise stochastic equicontinuity 341 three series theorem 31 1 tight measure 360, 427 time average 195 and ensemble average 200 limit under stationarity 196 time plot 437 time series 177 Toeplitz's lemma 34, 35 Tonelli's theorem 68 topological space 93 topology 93 real line 20 weak convergence 4 1 8 torus 102 total independence 1 14 total variation 72 totally bounded 78, 79 totally independent r.v.s 127 trace 1 12 Tran, Lanh T. 215 transformation 6 transformed Brownian motion 486 diffusion process 503 transitive relation 5 trending moments 262 triangle inequality 3 1 , 139 triangular array 34 stochastic 178 trivial topology 93 truncated kernel 403 truncation 298, 308 continuous 271 , 309 two-dimensional cylinder 1 82 Tychonoff space 100 Tychonoff topology 103, 1 8 1 , 46 1 Tychonoff s theorem 104 uncorrelated sequence, 230, 293 uncountable 10 uniform conditions 1 86 uniform continuity 28, 85 uniform convergence 30 uniform distribution 1 12, 122, 364
Index convolution 161 expectation 129 uniform equicontinuity 90 uniform integrability 1 88 squared partial sums 257 uniform laws of large numbers 340 uniform Lipschitz condition 28, 87 uniform metric 87 parameter space 327 uniform mixing autoregressive process 218 coefficient 206 moving average process 227 see also mixing uniform mixing sequence 209 strong law of large numbers 298 weak law of large numbers 295 uniform stochastic convergence 33 1 uniform tightness 359, 427 ArzeUi-Ascoli theorem 447 measures on C 447,448 uniformly bounded a.s. 1 86 in Lp norm 132, 186 in probability 186 union 3 universally measurable set 328 upcrossing inequality 236 upper integral 57 upper semicontinuity 86, 470 upper variation 71 Urysohn's embedding theorem 106 Urysohn' s lemma 98 Urysohn' s metrization theorem 98 usual topology of the line 20 Varadjaran, V. S. 425 variable 1 17 variance 1 3 1
539 sample mean 293 variance-transformed Brownian motion 486 vector 29 vector Brownian motion 498 vector martingale difference 234 Venn diagram 4 versions of conditional expectation 148 von Bahr, Bengt 171 Wald, A. 1 87 weak convergence 179, 347 in metric space 4 1 8 of sums 361 weak dependence 210 weak law of large numbers 289 for L 1 -approximable process 304 for L 1 -mixingale 302 for L2-bounded sequence 293 for partial sums 3 12 weak topology 93, 101 space of probability measures 418 Wei, C . Z . 5 1 0 well-ordered 5 West, K. 401 White, Halbert 210, 261 , 263, 271 , 328, 401-2, 480-1 wide-sense stationarity 193 Wiener, Norbert 442 Wiener measure 442, 474 existence of 446, 452 with probability 1 1 13 Withers, C. S. 215 Wooldridge, Jeffrey M. 480-1 zero-one law 204