Stochastic Limit Theory: An Introduction for Econometricicans (Advanced Texts in Econometrics)

ADVANCED TEXTS IN ECONOMETRICS General Editors C. W . J. Granger G. E . Mizon STOCHASTIC LIMIT THEORY An Introducti...

Author: James Davidson

305 downloads 850 Views 10MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

ADVANCED TEXTS IN ECONOMETRICS

General Editors C. W . J. Granger

G. E . Mizon

STOCHASTIC LIMIT THEORY An Introduction for Econometricians

JAMES DAVIDSON

Oxford University Press 1 994

Oxford University Press, Walton Street, Oxford Oxford New York

oxz

6DP

Athens Auckland Bangkok Bombay Calcutta Cape Town Dar es Salaam Delhi Florence Hong Kong Istanbul Karachi Kuala Lumpur Madras Madrid Melbourne Mexico City Nairobi Paris Singapore Taipei Tokyo Toronto and associated companies in Berlin Ibadan Oxford is a trade mark of Oxford University Press Published in the United States by Oxford University Press Inc., New York ©James Davidson, 1994 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press. Within the UK, exceptions are allowed in respect of any fair dealing for the purpose of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, 1988, or in the case of

reprographic reproduction in accordance with the terms of the licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside these terms and in other countries should be sent to the Rights Department, Oxford University Press, at the address above This book is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, re-sold, hired out or otherwise circulated without the publisher's prior consent in any form of binding or cover other than that in which it is published and without a similar condition including this condition being imposed on the subsequent purchaser British Library Cataloguing in Publication Data Data available Library of Congress Cataloging in Publication Data Data available

ISBN 0-19-877402-8

ISBN 0-19--877403-6 (Pbk)

1 3 5 7 9 10 8 6 4 2

Printed in Great Britain on acid-free paper by Biddies Ltd., Guildford and King's Lynn

For Lynette, Julia, and Nicola.

' . . . what in me is dark Illumine, what is low raise and support, That, to the height of this great argument, I may assert Eternal Providence, And justify the ways of God to men.'

Paradise Lost, Book I,

16-20

Contents

Preface Mathematical Symbols and Abbreviations

xiii xix

Part 1: Mathematics

1. Sets and Numbers

1.1 1 .2 1.3 1 .4 1.5 1 .6

Basic Set Theory Countable Sets The Real Continuum Sequences of Sets Classes of Subsets Sigma Fields

2. Limits and Continuity

2.1 2.2 2.3 2.4 2.5 2.6 2.7

The Topology of the Real Line Sequences and Limits Functions and Continuity Vector Sequences and Functions Sequences of Functions Summability and Order Relations Arrays

3 8 10 12 13 15 20 23 27 29 30 31 33

3. Measure

3.1 3.2 3.3 3.4 3.5 3.6

Measure Spaces The Extension Theorem Non-measurability Product Spaces Measurable Transformations Borel Functions

36 40 46 48 50 55

4. Integration

4. 1 4.2 4.3 4.4

Construction of the Integral Properties of the Integral Product Measure and Multiple Integrals The Radon-Nikodym Theorem

57 61 64 69

5. Metric Spaces

5. 1 Distances and Metrics 5.2 Separability and Completeness 5.3 Examples

75 78 82

viii

Contents 5.4 Mappings on Metric Spaces 5.5 Function Spaces

6. Topology

6. 1 6.2 6.3 6.4 6.5 6.6

Topological Spaces Countability and Compactness Separation Properties Weak Topologies The Topology of Product Spaces Embedding and Metrization

84 87 93 94 97 101 102 105

Part II: Probability

7. Probability Spaces

7. 1 7.2 7.3 7.4

Probability Measures Conditional Probability Independence Product Spaces

8. Random Variables

8.1 8.2 8.3 8.4 8.5

Measures on the Line Distribution Functions Examples Multivariate Distributions Independent Random Variables

9. Expectations

9. 1 9.2 9.3 9.4 9.5 9.6

Averages and Integrals Expectations of Functions of X Theorems for the Probabilist's Toolbox Multivariate Distributions More Theorems for the Toolbox Random Variables Depending on a Parameter

10. Conditioning

10. 1 10.2 10.3 10.4 10.5 10.6

Conditioning in Product Measures Conditioning on a Sigma Field Conditional Expectations Some Theorems on Conditional Expectations Relationships between Subfields Conditional Distributions

11. Characteristic Functions

1 1 . 1 The Distribution of Sums of Random Variables 1 1 .2 Complex Numbers

111 1 13 1 14 1 15 1 17 1 17 122 124 126 128 130 132 135 137 140 143 145 147 149 1 54 157 161 162

Contents 1 1 .3 The Theory of Characteristic Functions 1 1 .4 The Inversion Theorem 1 1 .5 The Conditional Characteristic Function

ix 164 168 171

Part III: Theory of Stochastic Processes

12. Stochastic Processes

12. 1 12.2 12.3 12.4 12.5 12.6

Basic Ideas and Terminology Convergence of Stochastic Sequences The Probability Model The Consistency Theorem Uniform and Limiting Properties Uniform Integrability

13. Dependence

13.1 13.2 13.3 13.4 13.5 13.6

Shift Transformations Independence and Stationarity Invariant Events Ergodicity and Mixing Subfields and Regularity Strong and Uniform Mixing

14. Mixing

14.1 14.2 14.3 14.4

Mixing Sequences of Random Variables Mixing Inequalities Mixing in Linear Processes Sufficient Conditions for Strong and Uniform Mixing

177 178 179 1 83 1 86 188 191 1 92 195 199 203 206 209 21 1 215 219

15. Martingales

15.1 15.2 15.3 1 5.4 15.5

Sequential Conditioning Extensions of the Martingale Concept Martingale Convergence Convergence and the Conditional Variances Martingale Inequalities

16. Mixingales

16. 1 16.2 16.3 16.4

Definition and Examples Telescoping Sum Representations Maximal Inequalities Uniform Square-integrability

17. Near-Epoch Dependence

17. 1 Definition and Examples 17.2 Near-Epoch Dependence and Mixingales

229 232 235 238 240 247 249 252 257 261 264

Contents

X

17.3 Near-Epoch Dependence and Transformations 17.4 Approximability

267 273

Part IV: The Law of Large Numbers

18. Stochastic Convergence

1 8. 1 1 8.2 1 8.3 1 8.4 1 8.5 1 8.6

Almost Sure Convergence Convergence in Probability Transformations and Convergence Convergence in Lp Norm Examples Laws of Large Numbers

19. Convergence in Lp-Norm

19. 1 19.2 19.3 1 9.4 19.5

Weak Laws by Mean-Square Convergence Almost Sure Convergence by the Method of Subsequences A Martingale Weak Law A Mixingale Weak Law Approximable Processes

20. The Strong Law of Large Numbers

20. 1 20.2 20.3 20.4 20.5 20.6

Technical Tricks for Proving LLNs The Case of Independence Martingale Strong Laws Conditional Variances and Random Weighting Two Strong Laws for Mixingales Near-epoch Dependent and Mixing Processes

21. Uniform Stochastic Convergence

21.1 2 1 .2 21.3 2 1 .4 2 1 .5

Stochastic Functions on a Parameter Space Pointwise and Uniform Stochastic Convergence Stochastic Equicontinuity Generic Uniform Convergence Unifom1 Laws of Large Numbers

28 1 284 285 287 288 289 293 295 298 302 304 306 31 1 313 316 318 323 327 330 335 337 340

Part V: The Central Limit Theorem

22. Weak Convergence of Distributions

22. 1 22.2 22.3 22.4 22.5 22.6

Basic Concepts The Skorokhod Representation Theorem Weak Convergence and Transformations Convergence of Moments and Characteristic Functions Criteria for Weak Convergence Convergence of Random Sums

347 350 355 357 359 361

Contents 23. The Classical Central Limit Theorem

23. 1 23.2 23.3 23.4

The i.i.d. Case Independent Heterogeneous Sequences Feller's Theorem and Asymptotic Negligibility The Case of Trending Variances

24. CLTs for Dependent Processes

24. 1 24.2 24.3 24.4 24.5

A General Convergence Theorem The Martingale Case Stationary Ergodic Sequences The CLT for NED Functions of Mixing Processes Proving the CLT by the Bernstein Blocking Method

25. Some Extensions

25. 1 25.2 25.3 25.4

The CLT with Estimated Normalization The CLT with Random Norming The Multivariate CLT Error Estimation

xi 364 368 373 377 380 383 385 386 391 399 403 405 407

Part VI: The Functional Central Limit Theorem

26. Weak Convergence in Metric Spaces

26. 1 26.2 26.3 26.4 26.5 26.6

Probability Measures on a Metric Space Measures and Expectations Weak Convergence Metrizing the Space of Measures Tightness and Convergence Skorokhod's Representation

27. Weak Convergence in a Function Space

27. 1 27.2 27.3 27.4 27.5 27.6 27.7

Measures on Function Spaces The Space C Measures on C Brownian Motion Weak Convergence on C The Functional Central Limit Theorem The Multivariate Case

28. Cadlag Functions

28. 1 28.2 28.3 28.4

The Space D Metrizing D Billingsley's Metric Measures on D

413 416 418 422 427 43 1 434 437 440 442 447 449 453 456 459 461 465

Contents

xii

28.5 Prokhorov's Metric 28.6 Compactness and Tightness in D 29. FCLTs for Dependent Variables

29. 1 29.2 29.3 29.4 29.5

The Distribution of Continuous Functions on D Asymptotic Independence The FCLT for NED Functions of Mixing Processes Transformed Brownian Motion The Multivariate Case

30. Weak Convergence to Stochastic Integrals

30. 1 30.2 30.3 30.4

Weak Limit Results for Random Functionals Stochastic Processes in Continuous Time Stochastic Integrals Convergence to Stochastic Integrals

Notes References Index

467 469 474 479 48 1 485 490 496 500 503 509 517 519 527

Preface

Recent years have seen a marked increase in the mathematical sophistication of econometric research. While the theory oflinear parametric models which forms the backbone of the subject makes an extensive and clever use of matrix algebra, the statistical prerequisites of this theory are comparatively simple. But now that these models are pretty thoroughly understood, research is concentrated increas ingly on the less tractable questions, such as nonlinear and nonparametric estima tion and nonstationary data generation processes. The standard econometrics texts are no longer an adequate guide to this new technical literature, and a sound understanding of the probabilistic foundations of the subject is becoming less and less of a luxury. The asymptotic theory traditionally taught to students of econometrics is founded on a small body of classical limit theorems, such as Khinchine' s weak law of large numbers and the Lindeberg-Levy central limit theorem, relevant to the stationary and independent data case. To deal with linear stochastic difference equations, appeal can be made to the results of Mann and Wald (1943a), but even these are rooted in the assumption of independent and identically distributed disturbances. This foundation has become increasingly inadequate to sustain the expanding edifice of econometric inference techniques, and recent years have seen a systematic attempt to construct a less restrictive limit theory. Hall and Heyde' s Martingale Limit Theory and its Application (1980) is an important landmark, as are a series of papers by econometricians including among others Halbert White, Ronald Gallant, Donald Andrews, and Herman Bierens. This work introduced to the economeuics profession pioneering research into limit theory under dependence, done in the preceding decades by probabilists such as J. L. Doob, I. A. Ibragimov, Patrick Billingsley, Robert Serfling, Murray Rosenblatt, and Donald McLeish. These latter authors devised various concepts of limited dependence for general nonstationary time series. The concept of a martingale has a long history in probability, but it was primarily Doob's Stochastic Processes (1953) that brought it to prominence as a tool of limit theory. Martingale processes behave like the wealth of a gambler who undertakes a succession of fair bets; the differences of a martingale (the net winnings at each step) are unpredictable from lagged infor mation. Powerful limit theorems are available for martingale difference sequences involving no further restrictions on the dependence of the process. Ibragimov and Rosenblatt respectively defined strong mixing and uniform mixing as character izations of 'limited memory ' , or independence at long range. McLeish defined the notion of a mixingale, the asymptotic counterpart of a martingale difference, becoming unpredictable m steps ahead as m becomes large. This is a weaker property than mixing because it involves only low-order moments of the distribution, but mixin!!ales nossess most of those attributes of mixin!! nrocesses needed to make

xiv

Preface

limit theorems work. Very important from the econometrician' s point of view is the property dubbed by Gallant and White (1988) near-epoch dependence from a phrase in one of McLeish' s papers, although the idea itself goes back to Billingsley (1968) and Ibragimov ( 1962). The mixing property may not be preserved by transformations of sequences involving an infinite number of lags, but near epoch dependence is a condition under which the outputs of a dynamic econometric model can be shown, given some further conditions, to be mixingales when the inputs are mixing. Applications of these results are increasingly in evidence in the econometric literature; Gallant and White's monograph provides an excellent survey of the possibilities. Limit theorems impose restrictions on the amount of dependence between se quence coordinates, and on their marginal distributions. Typically, the probabil ity of outliers must be controlled by requiring the existence of higher-order moments, but there are almost always trade-offs between dependence and moment restrictions, allowing one to buy more of one at the price of less of the other. The fun of proving limit theorems has been to see how far out the envelope of sufficient conditions can be stretched, in one direction or another. To complicate matters, one can get results both by putting limits on the rate of approach to independence (the rate of mixing), and by limiting the type of dependence (the martingale approach), as well as by combining both types of constraint (the mixingale approach). The results now available are remarkably powerful, judged by the yardstick of the classical theory. Proofs of necessity are elusive and the limits to the envelope are not yet known with certainty, but they probably lie not too far beyond the currently charted points. Perhaps the major development in time-series econometrics in the 1980s has been the theory of cointegration, and dealing with the distributions of estimators when time series are generated by unit root processes also requires a new type of limit theory. The essential extra ingredient of this theory is the functional central limit theorem (FCLT). The proof of these weak convergence results calls for a limit theory for the space of functions, which throws up some interesting problems which have no counterpart in ordinary probability. These ideas were pioneered by Russian probabilists in the 1950s, notably A. V. Skorokhod and Yu. V. Prokhorov. It turns out that FCLTs hold under properties generally similar to those for the ordinary CLT (though with a crucial difference), and they can be analysed with the same kind of tools, imposing limitations on dependence and outliers. The probabilistic literature which deals with issues of this kind has been seen as accessible to practising econometricians only with difficulty. Few concessions are made to the nonspecialist, and the concerns of probabilists, statisticians, and econometricians are frequently different. Textbooks on stochastic processes (Cox and Miller 1965 is a distinguished example) often give prominence to topics that econometricians would regard as fairly specialized (e.g. Markov chains, processes in continuous time), while the treatment of important issues like nonstationarity gets tucked away under the heading of advanced or optional material if not omitted altogether. Probability texts are written for students of mathematiCS and 3��1Jme rJ fr�millrJrltV With thP ff\JVlf\rP nf thP cnh1A£>t thM

Preface

XV

econometnc1ans may lack. The intellectual investment required is one that students and practitioners are often, quite reasonably, disinclined to make. It is with issues of this sort in mind that the present book has been written. The first objective has been to provide a coherent and unified account of modern asymptotic theory, which can function as both a course text, and as a work of reference. The second has been to provide a grounding in the requisite mathematics and probability theory, making the treatment sufficiently self-contained that even readers with limited mathematical training might make use of it. This is not to say that the material is elementary. Even when the mathematics is mastered, the reasoning can be intricate and demand a degree of patience to absorb. Proofs for nearly all the results are provided, but readers should never hesitate to pass over these when they impede progress. The book is also intended to be useful as a reference for students and researchers who only wish to know basic things, like the meaning of technical terms, and the variety of limit results available. But, that said, it will not have succeeded in its aim unless the reader is sometimes stimulated to gain a deeper understanding of the material - if for no better reason, because this is a theory abounding in mathematical elegance, and technical ingenuity which is often dazzling. Outline of the Work

Part I covers almost all the mathematics used subsequently. Calculus and matrix algebra are not treated, but in any case there is little of either in the book. Most readers should probably begin by reading Chapters 1 and 2, and perhaps the first sections only of Chapters 3 and 4, noting the definitions and examples but skipping all but the briefest proofs initially. These chapters contain some difficult material, which does not all need to be mastered immediately. Chapters 5 and 6 are strictly required only for Chapter 21 and Part VI, and should be passed over on first reading. Nearly everything needed to read the probability literature is covered in these chapters, with perhaps one notable exception- the theory of normed spaces. Some treatments in probability use a Hilbert space framework, but it can be avoided. The number of applications exploiting this approach seemed currently too small to justify the added technical overhead, although future developments may require this judgement to be revised. Part II covers what for many readers will be more familiar territory. Chapters 7, 8, and 9 contain essential background to be skimmed or studied in more depth, as appropriate. It is the collections of inequalities in §9.3 and §9.5 that we will have the most call to refer to subsequently. The content of Chapter 10 is probably less familiar, and is very important. Most readers will want to study this chapter carefully sooner or later. Chapter 1 1 can be passed over initially, but is needed in conjunction with Part V. In Part III the main business of the work begins. Chapter 12 gives an introduc tion to the main concepts arising in the study of stochastic sequences. Chapters 1 3 and 14 continue the discussion by reviewing concepts of dependence, and Chapters 15, 16, and 17 deal with specialized classes of sequence whose properties

xvi

Preface

make them amenable to the application of limit theorems. Nearly all readers will want to study Chapters 12, 13, and the earlier sections of 14 and 15 before going further, whereas Chapters 16 and 17 are rather technical and should probably be avoided until the context of these results is understood. In Parts IV and V we arrive at the study of the limit theorems themselves. The aim has been to contrast alternative ways of approaching these problems, and to present a general collection of results ranging from the elementary to the very general. Chapter 18 is devoted to fundamentals, and everyone should read this before going further. Chapter 19 compares classical techniques for proving laws of large numbers, depending on the existence of second moments, with more modern methods. Although the concept of convergence in probability is adequate in many econometric applications, proofs of strong consistency of estimators are increas ingly popular in the econometrics literature, and techniques for dependent processes are considered in Chapter 20. Uniform stochastic convergence is an essential concept in the study of econometric estimators, although it has only recently been systematically researched. Chapter 21 contains a synthesis of results that have appeared in print in the last year or two. Part V contrasts the classical central limit theorems for independent processes with the modern results for martingale differences and general dependent processes. Chapter 22 contains the essentials of weak convergence theory for random variables. The treatment is reasonably complete, although one neglected topic, to which much space is devoted in the probability literature, is conver gence to stable laws for sequences with infinite variance. This material has found few applications in econometrics to date, although its omission is another judge ment that may need to be revised in the future. Chapter 23 describes the classic CLTs for independent processes, and Chapter 24 treats modern techniques for dependent, heterogeneous processes. Part VI deals with the functional central limit theorem and related convergence results, including convergence to limits that can be identified with stochastic integrals. A number of new mathematical challenges are presented by this theory, and readers who wish to tackle it seriously will probably want to go back and apply themselves to Chapters 5 and 6 first. Chapter 26 is both the hardest going and the least essential to subsequent developments. It deals with the theory of weak convergence on metric spaces at a greater level of generality than we strictly need, and is the one section where topological arguments seriously intrude. Almost certainly one should go first to Chapter 27, referring back as needed for definitions and statements of the prerequisite theorems, and pursue the rationale for these results further only as interest dictates. Chapter 28 is likewise a technical prologue to Chapers 29 and 30, and might be skipped over at first reading. The meat of this part of the book is in these last two chapters. Results are given on the multivariate invariance principle for heterogeneous dependent processes, paralleling the central limit theorems of Chapter 24. A number of the results in the text are, to the author' s knowledge, new. These include 14.13/14, 19.11, 20.18119, 20.21, 24.617/14, 29.14/29.18 , and 30.13/14, although some have now appeared in print elsewhere.

Preface

xvii

Further Reading

There is a huge range of texts in print covering the relevant mathematics and probability, but the following titles were, for one reason or another, the most frequently consulted in the course of writing this book. T. M. Apostol' s Mathemat ical Analysis (2nd edition) hits just the right note for the basic bread-and butter results. For more advanced material, Dieudonne' s Foundations of Modern Analysis and Royden's Real Analysis are well-known references, the latter being the more user-friendly although the treatment is often fairly concise. Halmos's classic Measure Theory and Kingman and Taylor's Introduction to Measure and Probability are worth having access to. Willard' s General Topology is a clear and well-organized text to put alongside Kelley's classic of the same name. Halmos's Naive Set Theory is a slim volume whose main content falls outside our sphere of interest, but is a good read in its own right. Strongly recommended is Borowski and Borwein' s Collins Reference Dictionary of Mathematics; one can learn more about mathematics in less time by browsing in this little book, and following up the cross references, than by any other method I can think of. For a stimulating introduction to metric spaces see Michael Barnsley' s popular Fractals Everywhere. For further reading on probability, one might begin by browsing the slim volume that started the whole thing off, Kolmogorov's Foundations of the Theory of Probability. Then, Billingsley's Probability and Measure is an inspiration, both authoritative and highly readable. Breiman' s Probability has a refreshingly informal style, and just the right emphasis. Chung's A Course in Probability Theory is idiosyncratic in parts, but strongly recommended. The value of the classic texts, Loeve's Probability Theory (4th edition) and Feller' s An Introd uction to Probability Theory and its Applications (3rd edition) is self-evident, although these are dense and detailed books that can take a little time and patience to get into, and are chiefly for reference. Cramer' s Mathematical Methods of Statistics is now old-fashioned, but still useful. Two more recent titles are Shiryayev's Probability, and R. M. Dudley's tough but stimulating Real Analysis and Probability. Of the more specialized monographs on stochastic convergence, the following titles (in order of publication date) are all important: Doob, Stochastic Processes; Revesz, The Laws of Large Numbers; Parthasarathy, Probability Measures on Metric Spaces; Billingsley, Convergence of Probability Measures; Iosifescu and Theodorescu, Random Processes and Learning; Ibragimov and Linnik, Independent and Stationary Sequences of Random Variables; Stout, Almost Sure Convergence; Lukacs, Stochastic Convergence; Hall and Heyde, Martingale Limit Theory and its Application; Pollard, Convergence of Stochastic Processes; Eberlein and Taqqu (eds.), Dependence in Probability and Statistics. Doob is the founding father of the subject, and his book its Old Testament. Of the rest, Billingsley's is the most original and influential. Ibragimov and Linnik's essential monograph is now, alas, hard to obtain. The importance of Hall and Heyde was mentioned above. Pollard's book takes up the weak convergence

Preface

xviii

story more or less where Billingsley leaves off, and much of the material complements the coverage of the present volume. The Eberlein-Taqqu collection contains up-to-date accounts of mixing theory, covering some related topics outside the range of the present work. The literature on Brownian motion and stochastic integration is extensive, but Karatzas and Shreve' s Brownian Motion and Stochastic Calculus is a recent and comprehensive source for reference, and Kopp' s Martingales and Stochastic Integrals was found useful at several points. These items receive an individual mention by virtue of being between hard covers. References to the journal literature will be given in context, but it is worth mentioning that the four papers by Donald McLeish, appearing between 197 4 and 1977, form possibly the most influential single contribution to our subject. Finally, titles dealing with applications and related contributions include Serfling, Approximation Theorems of Mathematical Statistics; White, Asymptotic Theory for Econometricians; Gallant, Nonlinear Statistical Methods; Gallant and White, A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models, Amemiya, Advanced Econometrics. All of these are highly recommended for forming a view of what stochastic limit theory is for, and why it matters. Acknowledgements

The idea for this book originated in 1987, in the course of writing a chapter of mathematical and statistical prerequisites for a projected textbook of econometric theory. The initial, very tentative draft was completed during a stay at the University of California (San Diego) Department of Economics in 1988, whose hospitality is gratefully acknowledged. It has grown a great deal since then, and getting it finished has involved a struggle with competing academic commitments as well as the demands of family life. My family deserve special thanks for their forbearance. My colleague Peter Robinson has been a great source of encouragement and help, and has commented on various drafts. Other people who have read portions of the manuscript and provided invaluable feedback, not least in pointing out my errors, include Getullio Silveira, Robert de Jong, and especially Elizabeth Boardman, who took immense trouble to help me lick the chapters on mathematics into shape. I am also most grateful to Don Andrews, Graham Brightwell, S¢ren Johansen, Donald McLeish, Peter Phillips, Hal White, and a number of anonymous referees for helpful conversations, comments and advice. None of these people is responsible for the various flaws that doubtless remain. The book was written using the ChiWriter 4 technical word processor, and after conversion to Postscript format was produced as camera-ready copy on a Hewlett Packard LaserJet 4M printer, direct from the original files. I must particularly thank Cay Horstmann, of Horstmann Software Design Corp., for his technical assistance with this task.

London, June

1994

Mathematical Symbols and Abbreviations

In the text, the symbol o is used to terminate examples and definitions, and also theorems and lemmas unless the proof follows directly. The symbol • terminates proofs. References to numbered expressions are enclosed in parentheses. Ref erences to numbered theorems, examples etc. are given in bold face. References to chapter sections are preceded by §. In statements of theorems, roman numbers (i), (ii), (iii), ... are used to indi cate the parts of a multi-part result. Lower case letters (a), (b), (c), ... are used to itemize the assumptions or conditions specified in a theorem, and also the components of a definition. The page numbers below refer to fuller definitions or examples of use, as appropriate. 1.1 l l . l lp ''·"

absolute value 20 Lp-norm 132 Euclidean norm; 23 also fineness (of a partition) 438 347, 418 weak con vergence (of measures); also implication 19 monotone convergence 23 convergence 23 almost sure convergence 179 convergence in distribution 347 convergence in Lp norm 287 convergence in probability 284 mapping, function 6 composition of mappings 7 equivalence in order of magnitude (of sequences); 32 also equivalence in distribution (of r.v.s) 123 addition modulo 1 46 set difference 3 partial ordering, inequality 5 strict ordering, strict inequality 5 order of magnitude inequality (sequences); 32 also absolutely continuous (measures) 69 mutually singular (of measures) 69 indicator function 53 almost everywhere 38 autoregressive process 218 autoregressive moving average process 215 almost surely, (with resp. to p.m. �) 1 13 •

0

+ -,I s, 2 <, > «

l..

lA (.)

a.e. AR ARMA a.s., a.s.[�]

XX

Ac A, (A) Ao am No

v

B(n,p ) 13

CLT ch.f. c.d.f. Cro,IJ

C,::>

C,::>

X z(n) d(x,y )

D ro, IJ !D

�

dA E

ess sup

E(.) E(.lx) E(.i�) 3

f+, f-

FCLT F(.) x(.) m iff inf i.i.d. 1.0. m

pr. LIE LIL lim limsup, lim liminf, lim L(n) Lp-NED MA m(.)

Symbols complement of A closure of A interior of A strong mixing coefficient aleph-nought (cardinality of [N) 'for every' Binomial distribution with parameters n and p Borel field central limit theorem characteristic function cumulative distribution function continuous functions on the unit interval set containment strict containment chi-squared distribution with n degrees of freedom distance between x and y cadlag functions on the unit interval dyadic rationals symmetric difference boundary of A set membership essential supremum expectation conditional expectation (on variable x) conditional expectation (on a-field �) 'there exists' positive, negative parts of f functional central limit theorem cumulative distribution function characteristic function of X uniform mixing coefficient 'if and only if infimum independently and identically distributed infinitely often in probability law of iterated expectations law of the iterated logarithm limit (sets); also limit (numbers) superior limit (sets); also superior limit (numbers) inferior limit (sets); also inferior limit (numbers) slowly varying function near-epoch dependent in Lp-norrn moving average process Lebesgue measure

3 2 1 , 77 2 1 , 77 209 8 12 122 16 364 162 1 17 437 3 3 124 75 456 26 3 21, 77 3 1 17 128 144 147 15 61 450 1 18 162 209 5 12 193 28 1 284 149 408 13, 23 13, 25 13 , 25 32 261 193 37

Symbols m.d. m.g.f. m.s.

[M

N(Jl,d) [N 1No

n, n

mAn 0(.) o(.) Op (.) Op (.) 0 p.d.f. p.m. P(.) P(. I A) P(. l §')

rr,n

1t 1tt(. ) (Q r.v. Rro,t] x Ry

IR IR+ [R

IR+ n IR

s.e. s.s.e. SLLN sup S(x, £) Sn s� a(r;) a(X)

ffdx ffdJl, ffdF ffdP

L, L

x.xi

martingale difference 23 0 moment-generating function 162 mean square 287 space of measures 418 Gaussian distribution with mean Jl and variance d 123 natural numbers 8 [N u {0} 8 intersection 3 minimum of m and n 258 'Big Oh' , order of magnitude relation 31 'Little Oh' , strict order of magnitude relation 31 stochastic order relation 1 87 strict stochastic order relation 1 87 null set 8 probability density function 122 probability measure 111 probability 111 conditional probability (on event A) 1 13 conditional probability (on a-field §' ) 1 14 product of numbers; 167 also partition of an interval 58 product measure 64 coordinate projection 434 rational numbers 9 random variable 1 17 real valued functions on [0, 1 ] 434 relation 5 real line 10 non-negative reals 11 extended real line, IR u { -oo ,+oo} 12 IR+u { +oo } 12 n-dimensional Euclidean space 11 stochastic equicontinuity 336 strong stochastic equicontinuity 336 strong law of large numbers 289 supremum 12 £-neighbourhood, sphere 20, 76 sum of random variables 290 variance of Sn 364 a-field generated by collection r; 16 a-field generated by r.v. X 146 Lebesgue integral 57 Lebesgue-Stieltjes integral 57 expected value (integral with resp. to p.m.) 128 sum 31

Symbols

xxii Tro

u, U

U[a,b]

v, v mv n

WLLN w.p. 1

Xn X,

® 7l.

X

{.} {.}j, {.r:"' { {.}}

[x] [a,b] (a,b) (Q,?f) (Q,?f ,J..l) (Q,?f ,ji) (Q, ?f,P) (S,d) (2:Z,'t)

shift transformation union uniform distribution on interval [a,b] union of a-fields maximum of m and n weak law of large numbers with probability 1 sample mean of sequence {X1} 1 Cartesian product a-field of product sets integers set designation; also sequence, array infinite sequences array largest integer :::; x closed interval bounded by a,b open interval bounded by a,b measurable space measure space complete measure space probability space metric space topological space

Common usages

A,B,C,D, . . .

X,Y,Z,... X,Y,Z, .. . f,g,h, .. .

£,0,11

B,M

dl,'(f;,V,V, . .

?f,'§,Jf, ..

S,lf ,2:Z, ... J..l,V, ...

d,p 't

.

sets random variables random vectors functions positive constants bounding constants collections of subsets a-fields spaces measures metrics topology

191 3 123 17 257 289 1 13 289 5 48 9 3, 23 23 34 9 11 11 36 36 38 111 75 93

I MATHEMATICS

1 Sets and Numbers

1 . 1 B asic Set Theory

A set is any specified collection of objects. In this book the objects in question are often numbers, but they may also be functions, or other sets, or indeed wholly arbitrary, to be determined by the context in which the theory is applied. In any analysis there is a set which defines the universe of discourse, containing all the objects under consideration, and in what follows, sets denoted A, B etc., are subsets of a set X, with generic element x. Set membership is denoted by the symbol ' E , x E A meaning 'x belongs to the set A'. To show sets A and B have the same elements, one writes A = B. The usual way to define a set is by a descriptive statement enclosed in braces, so that for example A = { x: x E B } defines membership of A in terms of membership of B, and is an alternative way of writing A = B. Another way to denote set membership is by labels. If a set has n elements one can write A = { xi, i = l , ... ,n } , but any set of labels will do. The statement A = { Xa, a E C} says that A is the set of elements bearing labels a contained in another set C, called the index set for A. The labels (indices) need not be numbers, and can be any convenient objects at all. Sets whose elements are sets (the word 'collection' tends to be preferred in this context) are denoted by upper-case script characters. A E t3' denotes that the set A is in the collection t3', or using indices one could write t3' = {A a : a E C} . B is called a subset of A, written B � A, if all the elements of B are also elements of A. If B is a proper subset of A , ruling out B A, the relation is written B cA. The union of A and B is the set whose elements belong to either or both sets, written A u B. The union of a collection t3', the set of elements belong ing to one or more A E t3', is denoted U AE�A, or, alternatively, one can write U a E cA a for the union of the collection {A a: a E C}. The intersection of A and B is the set of elements belonging to both, written A n B. The intersection of a collection t3' is the set of elements common to all the sets in t3', written r1AE � A or n aEcA a. In particular, the union and intersection of {AJ , A2, ..., An} are written Ut=IAi andn t=IAi. When the index set is implicit or unspecified we may write just U a A a, nAi or similar. The difference of sets A and B, written A - B or by some authors A \B, is the set of elements belonging to A but not to B. The symmetric difference of two sets is A !J.B = (A - B) u (B - A). X- A is the complement of A in X, also denoted Ac when X is understood, and we have the general result that A - B = A n Be. The null set (or empty set) is 0 = XC, the set with no elements. Sets with no elements in common (having empty intersection) are called disjoint. A partition of a set is a '

=

Mathematics

4

collection of disjoint subsets whose union is the set, such that each of its elements belongs to one and only one member of the collection. Here are the basic rules of set algebra. Unions and intersections obey commuta tive, associative and distributive laws: A u B = B uA, (1.1) A n B = B n A, (1 .2) (A u B) u C = A u (B u C), (1 .3) (A liB) II C = A n (B n C), (1 .4) A n (B u C) = (A n B) u (A n C), (1 .5) A u (B II C) = (A u B) II (A u C). (1 .6) There are also rules relating to complements known as de Morgan ' s laws: (A u B)c = Ac n Bc, ( 1 .7) (AIIB)c = Ac u Bc. (1 .8) Venn diagrams, illustrated in Fig. 1 . 1 , are a useful device for clarifying rela tionships between subsets.

Fig. 1 . 1 The distributive and de Morgan laws extend to general collections, as follows. 1.1 Theorem Let t;' be a collection of sets, and B a set. Then

( U A) n B = U (A n B), (ii) ( n A ) u B = n (A u B), c (iii) ( n A ) = u Ac, (i)

A E t5

A E t5

A E t5

A E t5

AEt5

A E t5

Sets and Numbers (iv)

( U A) c A E�

=

n Ac.

A E�

5

o

The Cartesian product of two sets A and B, written AxB, is the set of all pos sible ordered pairs of elements, the first taken from A and the second from B; we write AxB = {(x,y) : x E A, y E B}. For a collection of n sets the Cartesian pro duct is the set of all the n-tuples (ordered sets of n elements, with the ith ele ment drawn from Ai), and is written n XAi i=l

=

{(Xt,X2, ... ,Xn): Xi

E

Ai, i = l , ... ,n}.

( 1 .9)

If one of the factor sets Ai is empty, Xi=tAi is also empty. Product sets are important in a variety of different contexts in mathematics. Some of these are readily appreciated; for example, sets whose elements are n-vectors of real numbers are products of copies of the real line (see § 1 .3). But product sets are also central to the mathematical formalization of the notion of relationship between set elements. Thus: a relation R on a set A is any subset of AxA. If (x,y) E R, we usually write x R y. R is said to be reflexive iff x R x, symmetric iff x R y implies y R x, antisymmetric iff x R y and y R x implies x = y, transitive iff x R y and y R z implies x R z, where in each case the indicated condition holds for every x, y, and z E A, as the case may be. (Note: 'iff means 'if and only if .) An equivalence relation is a relation that is reflexive, symmetric, and transi tive. Given an equivalence relation R on A, the equivalence class of an element x E A is the set Ex = {y E A: x R y } . If Ex and Ey are the equivalence classes of elements x and y, then either Exn Ey = 0, or Ex= Ey. The equality relation x = y is the obvious example of an equivalence relation, but by no means the only one. A partial ordering is any relation that is reflexive, antisymmetric, and transi tive. Partial orderings are usually denoted by the symbols::;; or �. with the under standing that x � y is the same as y ::;; x. To every partial ordering there corre sponds a strict ordering, defined by the omission of the elements (x,x) for all x E A. Strict orderings, usually denoted by < or >, are not reflexive or antisym metric, but they are transitive. A set A is said to be linearly ordered by a partial ordering ::;; if one of the relations x < y, x > y, and x = y hold for every pair (x,y) E AxA. If there exist elements a E A and b E A such that a::;; x for all x E A, or x ::;; b for all x E A, a and b are called respectively the smallest and largest elements of A. A linearly ordered set A is called well-ordered if every subset of A contains a smallest element. It is of course in sets whose elements are numbers that the ordering concept is most familiar. Consider two sets X and Y, which can be thought of as representing the universal sets for a pair of related problems. The following bundle of definitions contains

Mathematics

6

the basic ideas about relationships between the elements of such sets. A mapping (or transformation or junction) T: X f--7 Y is a rule that associates each element of X with a unique element of Y; in other words, for each x E X there exists a specified element y E Y, denoted T(x). X is called the domain of the mapping, and Y the codomain. The set Gr = { (x,y): x E X, y = T(x)} � Xx Y ( 1 . 10) is called the graph of T. For A c X, the set T(A) = { T(x): x E A}<:;; Y (1 . 1 1) is called the image of A under T, and for B � Y, the set T - 1 (B) = {x: T(x) E B } � X ( 1 . 1 2) is called the inverse image of B under T. The set T(X) is called the range of T, and if T(X) = Y the mapping is said to be from X onto Y, and otherwise, into Y. If each y is the image of one and only one x E X, so that T(x 1 ) = T(x2) if and only if x1 = x2, the mapping is said to be one-to-one, or 1 - 1 . The notions of mapping and graph are really interchangeable, and it is permiss able to say that the graph is the mapping, but it is convenient to keep a distinc tion in mind between the rule and the subset of Xx Y which it generates. The term junction is usually reserved for cases when the codomain is the set of real num bers (see § 1 .3). The term correspondence is used for a rule connecting elements of X to elements of Y where the latter are not necessarily unique. r - 1 is a corre spondence, but not a mapping unless T is one-to-one. However, the term one-to one correspondence is often used specifically, in certain contexts that will arise below, to refer to a mapping that is both 1 - 1 and onto. If partial orderings are defined on both X and Y, a mapping is called order-preserving if T(x1 ) � T(x2) iff x1 � x2. On the other hand, if X is partially ordered by �. a 1 - 1 mapping induces a partial ordering on the codomain, defined by 'T(xi) � T(x2) iff x1 � x2'. And if the mapping is also onto, a linear ordering on X induces a linear ordering on Y. The following is a miscellany of useful facts about mappings. 1.2 Theorem

(i) For a collection {A a (ii) for a collection { B a

c

�

(�Aa) = � T(A a); Y}, T - ! (�Ba) X}, r

(iii) for B � Y, T- 1 (Bc) y-1 (B)c; (iv) for A � X, A c r -1(T(A)); (v) for B c Y, T(T - 1 (B)) c B. o Here, r - 1 (Bt means X- r-1 (B). Using de Morgan's laws, properties (ii) and =

Sets and Numbers

7

(iii) are easily extended to the inverse images of intersections and differences; for example, we may show that the inverse images of disjoint sets are also disjoint. However, Y- T(A) = T(At =I= T(Ac) , in general. Parts (iv) and (v) are illustrated in Fig. 1 .2, where X and Y both correspond to the real line, A and B are intervals of the line, and T is a function of a real variable.

-- ________,_ ________________________ ' ' ' ' ' ' ' ' ' '

l

A

l

�

Fig. 1 .2 When T is a 1-1 correspondence (1-1 and onto) so is T - l . These properties then hold symmetrically, and the inclusion relations of parts (iv) and (v) also become equalities for all A � X and B � Y. If Z is a third set, and U: Y i----7 Z is a further mapping, the composite mapping UoT: X i----7 Z takes each x E X to the element U(T(x)) E Z. UoT operates as a simple transform ation from X to Z, and 1.2 applies to this case. For C c Z, (U07)-\C) = T-1cu-\C)).

T(A/.

Fig. 1 .3 1.3 Example If X= ex 3 is a product space,

x = (8 ,l;), the mapping r: ex 2

1----7

2,

having as elements the ordered pairs

Mathematics

8

defined by T(S, �) = �. is called the projection mapping onto 3. The projection of a set A � 8 x 3 onto 3 (respectively, 8) is the set consisting of the second (resp., first) members of each pair in A. On the other hand, for a set B E 8, 1 T- (B) = 8 x B . It is a useful exercise to verify 1.2 for this case, and also to check that T(Ar 1: T(Ac) in general. In Fig. 1 .3, 8 and 3 are line segments and 8 x 3 is a rectangle in the plane. Here, T(A)c is the union of the indicated line segments, whereas T(A c) = 3. o The number of elements contained in a set is called the cardinality or cardinal number of the set. The notion of 'number' in this context is not a primitive one, but can be reduced to fundamentals by what is called the 'pigeon-hole' principle. A set A is said to be equipotent with a set B if there exists a 1-1 correspondence connecting A and B. Think in terms of taking an element from each set and placing the pair in a pigeon-hole. Equipotency means that such a procedure can never exhaust one set before the other. Now, think of the number 0 as being just a name for the null set, 0. Let the . number 1 be the name for the set that has a single element, the number 0. Let 2 denote the set whose elements are the numbers 0 and 1 . And proceeding recursively, let n be the name for the set { O, ... ,n - 1 } . Then, the statement that a set A has n elements, or has cardinal number n, can be interpreted to mean that A is equipo tent with the set n. The set of natural numbers, denoted IN , is the collection {n: n = 1 ,2,3, ... } . This collection is well ordered by the relation usually denoted :S:, where n :S: m actually means the same as n � m under this definition of a number. 1 . 2 Countable Sets

Set theory is trivial when the number of elements in the set is finite, but for malization becomes indispensable for dealing with sets having an infinite number of elements. The set of natural numbers IN is a case in point. If n is a member so is n + 1 , and this is true for every n. None the less, a cardinal number is formally assigned to IN, and is represented by the symbol �o ('aleph-nought'). When the elements of an infinite set can be put into a one-to-one correspondence with the natural numbers, the set is said to have cardinal number �0, but, more commonly, to be countable, or equivalently, denumerable. Countability of a set requires that a scheme can be devised for labelling each element with a unique element of IN. This imposes a well-ordering on the elements, such that there is a 'first' element labelled 1, and so on, although this ordering may have signifi cance or be arbitrary, depending on the circumstances. It is the pigeon-hole prin ciple that matters here, that each element has its own unique label. With infinite sets, everyday notions of size and quantity tend to break down. Augmenting the natural numbers by the number 0 defines the set !N0 = { 0, 1 ,2,3, ... } . The commonplace observation that !N0 has 'one more' element than IN is contra dicted by the fact that IN and !No are equipotent (label n - 1 E !No by n E IN). Still more surprisingly, the set of even numbers, IE = { 2n, n E IN } , also has an obvious labelling scheme demonstrating equipotency with IN . The na"ive idea that there are 'tnr;_.,. "'" m<>nu' PlPmPnt" in il\1

""

in W 1" uJ1thnnt lno-ir<>l fnnnrl<>tinn

Pl!Pr\1

Sets and Numbers

9

infinite subset A of lN has a natural well-ordering and is equipotent with lN itself, the label of an element x E A being the cardinal number of the set {y E A: y � x}. Turning to sets apparently 'larger' than lN, consider the integers, 7l. = { ... ,- 1 ,0, 1,2, ... } , the set containing the signed whole numbers and zero. These are linearly ordered although not well ordered. They can, however, be paired with the natural numbers using the 'zig-zag' scheme: (1 ,0), (2, 1), (3,- 1), (4,2), ... , (n, [n/2](-1 t), ... , where [x] denotes the largest whole number below x. Thus, IN and 7l. are equipotent. Then there are the rational numbers, [) = {x: x = alb, a E 7!., b E 7!., b =f. 0} . ( 1 . 13) We can also show the following. 1.4 Theorem [;l is a countable set. Proof We construct a 1-1 correspondence between 7l. x 7l. and lN . A 1 - 1 correspon dence between 7l. x 7l. and 7l. x lN is obtained by the method just used to show 7l. count able, and one between 7l. x lN and lN x lN is got by the same method. Then note that the number 2a3 b E lN is uniquely associated with each pair (a,b) E lN x lN ; the rule for recovering a and b from 2a3 b is 'get a as the number of divisions by 2 required to get an odd number, and the number so obtained is 3b' . The collection { 2a3 b : a E lN, b E lN } c IN is equipotent with lN itself, as shown in the preceding paragraph. The composition of all these mappings is the desired correspondence. • Generalizing this type of argument leads to the following fundamental result. 1.5 Theorem The union of a countable collection of countable sets is a countable set. o The concept of a sequence is fundamental to all the topics in this book. A sequence can be thought of as a mapping whose domain is a well-ordered countable set, the index set. Since there is always an order-preserving 1-1 mapping from lN to the index set, there is usually no loss of generality in considering the composite mapping and thinking of IN itself as the domain. Another way to char acterize a sequence is as the graph of the mapping, that is, a countable collection of pairs having the ordering conferred on it by the elements of the domain. The ranges of the sequences we consider below typically contain either sets or real numbers; the associated theor� for these cases is to be found respectively in § 1 .4 and §2.2. The term sequence may also be applied to mappings having 7l. or another linearly ordered set as index set. This usage broadens the notion, since while such sets can be re-indexed by lN (see above) this cannot be done while preserving the original ordering.

10

Mathematics

1 . 3 The Real Continuum

The real-number continuum [R is such a complex object that no single statement of definition can do it justice. One can emphasize the ordinal and arithmetic proper ties of the reals, or their geometrical interpretation as the distances of points on a line from the origin (the point zero). But from a set-theoretic point of view, the essentials are captured by defining [R as the set of countably infinite sequences of decimal digits, having a decimal point inserted at exactly one position in the sequence, and possibly preceded by a minus sign. Thus, the real number x can be written in the form 00

x = m(x)1QP(x)L di(x) IO - i, (1. 14) i= l where the sequence { d1 (x), d2(x), ... } consists of decimal digits (elements of the set { 0, 1 , ... ,9 }), p(x) E IN0 denotes the position of the decimal point in the string (the decimal exponent), and m(x) = + 1 if x 2:: 0, and -1 otherwise (the sign). When di(x) = 0 for all but a finite number of terms, the decimal expansion of x is said to terminate and the final Os are conventionally omitted from the representation. The representation of x by (1. 14) is not always unique, and there exists a 1-1 correspondence between elements of [R and sequences { m,p,d1 ,d2,d3 , . . } only after certain of the latter are excluded. To eliminate arbitrary leading zeros we must stipulate that d1 # 0 unless p = 0. And since for example, 0.49999 ... (the sequence of 9s not terminating) is the same number as 0.5, we always take the terminating representation of a number and exclude sequences having di = 9 in all but a finite number of places. [R is of course linearly ordered, and in terms of (1. 14) the ordering corresponds to the lexicographic ordering of the sequences {m,mp,mdl,rruh,md3 · ·· } . The choice of base 1 0 in the definition is of course merely conventional, and of the alternative possibilities, the most important is the binary (base 2) represen tation, .

x = m(x)2q(x)L bi(x)T i, ( 1 . 1 5) i= l where the b i are binary digits, either 0 or 1 , and q(x) is the binary exponent. The integers have the representation in ( 1 . 14) with the strings terminating after p(x) digits. The rationals are also elements of [R , being those which either terminate after a finite number of places, or else cycle repeatedly through a finite sequence of digits beyond a certain finite point. The real numbers that are not rational are called irrational. The irrational numbers are overwhelmingly more numerous than the rationals, representing a higher order of infinity. The following is the famous 'diagonal' argument of Cantor. 1.6 Theorem The set !R is uncountable. Proof Assume a 1-1 correspondence between [R and IN exists. Now construct a real 00

Sets and Numbers

11

number in the following way. Let the first digit be different from that of the real number labelled 1, the second digit be different from that of the real number labelled 2, and in general the nth digit be different from that of the real number labelled n, for every n. This number is different from every member of the labell ed collection, and hence it has no label. Since this construction can be performed for any labelling scheme, the assumption is contradicted. • We say that �0 < c, where c is the cardinal number of [R . The linear ordering on !R is of interest to us chiefly for providing the basis for constructing the fundamental subsets of !R , the intervals. The set A = {x: a < x < b } is called an open interval, since it does not contain the end points, whereas the interval B = { x: a s; x s; b } is said to be closed. Common notations are [a,b], (a,b), (a,b] and [a,b) to denote closed, open and half-open intervals. A set containing just a single point a is called a singleton, written {a}. Unbounded intervals such as C {x: a < x}, defined by a single boundary point, are written (a, +oo), (-oo,b), and [a, +oo), (-oo,b] for the open and closed cases respectively, where the infinities +oo and -oo are the fictional 'points' (not elements of [R ) with the respective properties x < +oo and x > -oo, for all x E [R . An important example is the positive half-line [O,+oo ) , which we will denote subsequently by !R+. 1.7 Theorem Every open interval is uncountable. Proof Let the interval in question be (a,b). If a < b, there exists n ?: 0 such that the (n + 1)th term of the sequence (m,mp,md1 ,md2, . .. ) in the expansion of ( 1. 14) defining b exceeds that in the corresponding sequence for a, whereas the first n digits of each sequence are the same. The elements of (a,b) are those reals whose expansions generate the same initial sequence, with the (n + l )th terms not exceeding that of b nor being exceeded by that of a. If a and b are distinct, n is finite. The result follows on applying the diagonal argument in 1.6 to these expansions, beginning at position n + 2. • Other useful results concerning [R and its intervals include the following. 1.8 Theorem The points of any open interval are equipotent with [R . Proof This might be proved by elaborating the argument of 1.7, but it is simpler just to exhibit a 1-1 mapping from [R onto (a,b ) . For example, the function (b - a)x y = a-+2-b + 2(1 ( 1 . 16) + l xl) for x E [R fulfils the requirement. • 1.9 Theorem The real plane [R 2 = [R x [R is equipotent with IR. Proof In view of the last theorem, it suffices to show that the unit interval [0, 1 ] is equipotent with the unit square [O,lf Given points x, y E [0, 1], define the point z E [0, 1 ] according to the decimal expansion in ( 1 . 14), by the rule =

Mathematics

12 di(Z ) =

{d(i+ l)n (x),

i odd

, i = 1,2,3, ...

( 1 .17)

i even In words, construct z by taking a digit from x and y alternately. Such a z exists for every pair x and y, and, given z, x and y can be uniquely recovered by setting ( 1 . 1 8) di (X) = d2i-l (Z ), di(y) = d2i(Z), i = 1,2,3, ... , This defines a 1-1 mapping from [0, 1 ] 2 onto [0, 1], as required. • This argument can be extended from IR 2 to IR n , for any n E !N . 1.10 Theorem Every open interval contains a rational number. Proof This is equivalent to the proposition that if x < y, there exists rational r with x < r < y. First suppose x � 0. Choose q as the smallest integer exceeding ll(y - x), such that qy > qx + 1, and choose p as the smallest integer exceeding qy. Then x < (p - 1 )/q < y. For the case x < 0 choose an integer n > -x, and then x < r - n < y, where r is the rational satisfying n + x < r < n +y, found as above. • 1.11 Corollary Every collection of disjoint open intervals is countable. Proof Since each open interval contains a rational appearing in no other interval disjoint with it, a set of disjoint open intervals can be placed in 1-1 correspon dence with a subset of the rationals. • The supremum of a set A c IR , when it exists, is the smallest number y such that x :::; y for every x E A, written supA. The infimum of A, when it exists, is the largest number y such that x � y for every x E A, written inf A. These may or may not be elements of A . In particular, inf[a,b] = inf(a,b) = a, and sup[a,b] = sup(a,b) = b. Open intervals do not possess largest or smallest elements. However, every subset of IR which is bounded above (resp. below) has a supremum (resp. infimum). While unbounded sets in IR lack suprema and/or infima, it is customary to define the set IR = IR u { -oo,+oo}, called the extended real line. In R, every set has a supremum, either a finite real number or +oo, and similarly, every set has an infimum. The notation R+ will also be used subsequently to denote IR + u { +oo} . di!2(y),

1 .4 Sequences o f S ets

Set sequences {A1, A2, A3, ... } will be written, variously, as {An: n E !N }, {An } ! , or just {An } when the context is clear. A monotone sequence is one that is either non-decreasing, with each member of the sequence being contained in its successor (An � An+ l , V n), or non-increasing, with each member containing its successor (An+ l c An, V n). We also speak of increasing (resp. decreasing) sequences when the inclusion is strict, with c (resp. ::J) replacing c (resp. ::J ). For a non-decreasing sequence, define the set A = U;;'=1An and for a non-increasing sequence the set A = n;;= IAn = (U;;'=IA�Y. These

Sets and Numbers

13

sets are called the limits of the respective sequences. We may write An t A and An .J, A, and also, in general, limn�ooA.n = A and A n ----7 A. 1.12 Example The sequence { [0, 1/n], n E IN } is decreasing and has as limit the singleton {0}. In fact, limn �oo[0, 1/n) = {0} also, whereas limn�oo(0,1/n] = 0. The decreasing sequence of open intervals, { (a - lin, b + 1/n), n E IN }, has as its limit the closed interval [a, b]. On the other hand, the sequence { [a + 1/n, b - 1/n], n E IN } is increasing, and its limit is (a, b). o Consider an arbitrary sequence {A n } . The sequence Bn = U;;;= nAm is non-increas ing, so that B = limn �ooBn exists. This set is called the superior limit of the sequence {An } , written limsupn An , and also as limn A n . Similarly, the limit of the non-decreasing sequence Cn = n;;;=nAm is called the inferior limit of the sequence, written liminfn An , or limn A n . Formally: for a sequence {An , n E IN }, lim:upAn =

lin:infA n =

D (QnAm)

Q (t]nAm) .

(1 . 19) (1 .20)

De Morgan's laws imply that liminfn An = (limsupn A� Y. The limsup is the set of elements contained in infinitely many of the An , while the liminf is the set belonging to all but a finite number of the Am that is, to every member of the sequence from some point onwards. These concepts provide us with a criterion for convergence of set sequences in general. Liminfn An � limsupnAn , and if these two sets differ, there are elements that belong to infinitely many of the An , but also do not belong to infinitely many of them. Such a sequence is not convergent. On the other hand, if liminfn An = limsupnAn = A, the elements of A belong to infinitely many of the An and do not belong to at most a finite number of them. Then the sequence {A n } is said to converge to A, and A is called the limit of the sequence. 1 .5 Classes of Subsets

The set of all the subsets of X is called the power set of X, denoted 2x . The power set of a set with n elements has 2n elements, which accounts for its name and representation. In the case of a countable set, the power set is thought of formally as having 2 N o elements. One of the fundamental facts of set theory is that the number of subsets of a given set strictly exceeds the number of its elements. For finite sets this is obvious, but when extended to countable sets it amounts to the claim that 2 N o > �0. 1.13 Theorem 2 N o = c. Proof The proposition is proved if we can show that 2 a-J is equipotent with !R or, equivalently (in view of 1.8), with the unit interval [0, 1]. For a set A E 2 1N , construct the sequence of binary digits { b 1 ,b2,b3 , ... } according to the rule, 'bn

14

Mathematics

= 1 if n E A, bn = 0 otherwise'. Using formula ( 1 . 1 5) with m = 1 and q = 0, let this sequence define an element XA of [0, 1 ] (the case where bn = 1 for all n defines 1). On the other hand, for any element x E [0, 1], construct the set Ax E 2 1N according to the rule, 'include n in Ax if and only if the nth digit in the binary expansion of x is a 1 ' . These constructions define a 1-1 correspondence between 2 1N and [0, 1 ] . • When studying the subsets of a given set, particularly their measure-theoretic properties, the power set is often too big for anything very interesting or useful to be said about it. The idea behind the following definitions is to specify sub sets of 2x that are large enough to be interesting, but whose characteristics may be more tractable. We typically do this by choosing a base collection of sets with known properties, and then specifying certain operations for creating new sets from existing ones. These operations permit an interesting diversity of class mem bers to be generated, but important properties of the sets may be deduced from those of the base collection, as the following examples show. 1.14 Definition A ring :R is a nonempty class of subsets of X satisfying (a) 0 E :R. (b) If A and B E 'R then A u B E 'R, A n B E 'R and A - B E 'R. o One generates a ring by specifying an arbitrary basic collection b', which must include 0, and then declaring that any sets that can be generated by the specified operations also belong to the class. A ring is said to be closed under the opera tions of union, intersection and difference. Rings lack a crucial piece of structure, for there is no requirement for the set X itself to be a member. If X is included, a ring becomes afield, or synonymously an algebra. Since X -A = A c, this amounts to including all complements, and, in view of the de Morgan laws, specifying the inclusion of intersections and differ ences becomes redundant. 1.15 Definition A field lff is a class of subsets of X satisfying (a) X E r:f. (b) If A E lff then Ac E r:f. (c) If A and B E lff then A u B E lff . o A field is said to be closed under complementation and finite union, and hence under intersections and differences too; none of these operations can take one outside the class. These classes can be very complex, and also very trivial. The simplest case of a ring is { 0 } . The smallest possible field is {X,0 } . Scarcely less trivial is the field {X,A, Ac, 0 } , where A is any subset of X. What makes any class of sets interesting, or not, is the collection b' of sets it is declared to contain, which we can think of as the 'seed' for the class. We speak of the smallest field containing b' as 'the field generated by t;' ' . Rings and fields are natural classes in the sense of being defined in terms of the simple set operations, but their structure is rather restrictive for some of

Sets and Numbers

15

the applications in probability. More inclusive definitions, carefully tailored to include some important cases, are as follows. 1.16 Definition A semi-ring !f is a non-empty class of subsets of X satisfying (a) 0 E !f. (b) If A, B E !f then A n B E !f. (c) If A, B E !f and A s;;;; B, .:3 n < oo such that B - A = U}=! Cj, where Cj E !f and q n Cj' = 0 for each j, j'. o More succinctly, condition (c) says that the difference of two !!-sets has a finite partition into !f-sets. 1.17 Definition A semi-algebra !f is a class of subsets of X satisfying (a) X E !f. (b) If A, B E !f then A n B E !f. (c) If A E !f, .:3 n < oo such that A c = U}=1 Cj, where Cj E !f and Cj n Cj' = 0 for each j, j'. o A semi-ring containing X is a semi-algebra. 1.18 Example Let X = [R , and consider the class of all the half-open intervals I = (a,b] for -oo < a � b < +=, together with the empty set. If It = (a 1 ,bd and h = (a2 ,b2], then It n fz is one of ft, h (a l ,bz], (az,bd, and 0 . And if /1 s;;;; lz so that a2 � a 1 and b 1 � bz, then lz - It is one of 0, (az,ad, (b i ,bz], (a2,a!] u (b 1 ,b 2], and h The conditions defining a semi-ring are therefore sat isfied, although not those defining a ring. If we now let [R be a member of the class and follow 1.17, we find that the half open intervals, plus the unbounded intervals of the form ( -oo,b] and (a,+oo ), plus 0 and [R, constitute a semi-algebra. o 1 .6 Signna Fields

When we say that a field contains the complements and finite unions, the qualifier finite deserves explanation. It is clear that A 1 , , An E '!f implies that U}=1Aj E r:J by a simple n-fold iteration of pairwise union. But, given the constructive nattit;e of the definition, it is not legitimate without a further stipulation to aSSt,lQle that such an operation can be taken to the limit. By making this additional s�ip��atjgn, ,we obtain the concept of a a-field. t;f��'1Jefuif:tion A a-field (a-algebra) '!f is a class of subsets of X satisfying •••

•··•

< ;{arx e

r:;. e

then Ac E '!f. {An, n E IN } is a sequence of '!f-sets, then u;=lAn E r:J. 0 &'ctt:.:fllelCiis�clcJSed under the operations of complementation and countable union, d¢ Morgan laws, of countable intersection also. A a-ring can be althol.lgh this is not a concept we shall need in the sequel. r;, the intersection of all the a-fields containing r; is A

·.'!f

Mathematics

16

called the a-field generated by �. customarily denoted a(�). The following theorem establishes a basic fact about a-fields. 1.20 Theorem If � is a finite collection a(�) is finite, otherwise a(�) is always uncountable. Proof Define the relation R between elements of X by 'x R y iff x and y are elements of the same sets of �· . R is an equivalence relation, and hence defines an equivalence class £? of disjoint subsets. Each set of £? is the intersection of all the �-sets containing its elements and the complements of the remainder. (For example, see Fig. 1 . 1 . For this collection of regions of IR 2 , £? is the partition defined by the complete network of set boundaries.) If � contains n sets, £? con tains at most 2n sets and a(�), in this case the collection of all unions of g -sets, contains at most 22n sets. This proves the first part of the theorem. Let � be infinite. If it is uncountable then so is a(�) and there is nothing more to show, so assume � is countable. In this case every set in £? is a countable intersection of �-sets or the complements of �-sets, hence £? � a(�), and hence also ?l({:g) � a(�), where ?l({:g) is the collection of all the countable unions of b-sets. If we show ?l({:g) is uncountable, the same will be true of a(�). We may assume that £? is countable, since otherwise there is nothing more to show. So let the sets of £? be indexed by IN. Then every union of £?-sets corresponds uniquely with a subset of IN, and every subset of IN corresponds uniquely to a union of g -sets. In other words, the elements of 71(£?) are equipotent with those of 21N, which are uncountable by 1.13. This completes the proof. • 1.21 Example Let X = IR, and let � = { ( -oo,r], r E (Q }, the collection of closed half-lines with rational endpoints. a(�) is called the Borel field of IR, generally denoted 'B. A number of different base collections generate 'B. Since countable unions of open intervals can be closed intervals, and vice versa, (compare 1.12), the set of open half-lines, { (-oo,r), r E (Q }, will also serve. Or, letting { rn } be a decreasing sequence of rational numbers with rn -1- x,

(1.21) n (-oo, rn]. n=l Such a sequence exists for any x E IR (see 2.15), and hence the same a-field is generated by the (uncountable) collection of half-lines with real endpoints, { ( -oo,x], x E IR } . It easily follows that various other collections generate 'B, including the open intervals of IR, the closed intervals, and the half-open intervals. o ( -oo, x]

=

1.22 Example Let X = iR, the extended real line. The Borel field of iR is easily given. It is 'B = {B, B U { +oo}, B u { -oo}, B U {+oo } U { - oo } : B E 'B}, where 'B is the Borel field of IR . You can verify that 'B is a a-field, and is gener ated by the collection � of 1.21 augmented by the sets { -oo} and iR. o 1.23 Example Given an interval I of the line, the class 'B1 = {B r. I: B E 'B } is

Sets and Numbers

17

called the restnctlon of 'B to /, or the Borel field on / . In fact, 'B1 is the a-field generated from the collection ri' = { ( oo , r] n /: r E (Q } . o Notice how a(ri') has been defined 'from the outside' . It might be thought that a(ri') could be defined 'from the inside' , in terms of a specified sequence of the opera tions of complementation and countable union applied to the elements of ri'. But, despite the constructive nature of the definitions, 1.20 suggests how this may be impossible. Suppose we define A 1 as the set that contains ri', together with the complement of every set in ri' and all the finite and countable unions of the sets of ri'. Of course, ih is not a(ri') because it does not contain the complements of the unions. So let A2 be the set containing A 1 together with all the complements and finite and countable unions of the sets in A1 . Defining A3 , A4, ... in the same manner, it might be thought that the monotone sequence {An } would approach a(ri') as n � oo ; but in fact this is not so. In the case of the class 'Br0, 1 1, for example, it can be shown that A"" is strictly smaller than a(ri') (see Billingsley 1986: 26). On the other hand, a(ri') may be smaller than 2x. This fact is demonstrated, again for 'Bro,l]• in §3.4. The union of two a-fields (the set of elements contained in either or both of them) is not generally a a-field, for the unions of the sets from one field with those from the other are not guaranteed to belong to it. The concept of union for a-fields is therefore extended by adding in these sets. Given a-fields ':J and §', the smallest a-field containing all the elements of ':J and all the elements of §' is denoted ':} v §', called the union of ':J and §'. On the other hand, ':J n §' = {A: A E ':J and A E §'} is a a-field, although for uniformity the notation ':J 1\ §' may be used for such intersections. Formally, ':J 1\ §' denotes the largest of the a-fields whose elements belong to both ':J and §'. Both of these operations generalize to the count able case, so that for a sequence of a-fields ':Jn, n = 1 , 2 ,3, ... we may define v;=l'§n and n;=I ':Jn. Without going prematurely into too many details, it can be said that a large part of the intellectual labour in probability and measure theory is devoted to proving that particular classes of sets are a-fields. Problems of this kind will arise throughout this book. It is usually not too hard to show that A c E ':J when ever A E 'ff, but the requirement to show that a class contains the countable unions can be tough to fulfil. The following material can be helpful in this connection. A monotone class .M is a class of sets such that, if {An} is a monotone sequence with limit A, and An E .M for all n, then A E .M. If {An} is non-decreasing, then A = U";=IAn. If it is non-increasing, then A = no;=IAn. The next theorem shows that, to determine whether or not we are dealing' with a a-field, it is sufficient to consider whether the limits of monotone sequences belong to it, which should often be easier to establish than the general case. 1.24 Theorem ':J is a a-field iff it is both a field and a monotone class. Proof The 'only if part of the theorem is immediate. For the 'if part, define A,� u� tEm, for any sequence { Em E ':} ' m E IN }. Since ':} is a field, An E ':} for any ,yz�c/. if�� tt· l'!lt tAm n E IN } is a monotone sequence with limit u;= I An E ':} ' by -

<

fitlt�i�\

=

..

:f::i<�.f��-,;- -

18

Mathematics

assumption. U';;'= 1An = U;;;= !Em, so the theorem follows. • Another useful trick is Dynkin's 1t-A theorem. 1 To develop this result, we define two new classes of subsets of X. 1.25 Definition A class 'P is a 1t-system if A and B E 'P implies A 11 B E 'P. A class 2 is a A.-system if (a) X E :£. (b) If A and B E 2 and B � A, then A - B E 2. (c) If {An E 2} is a non-decreasing sequence and An t A, then A E 2. o Conditions (a) and (b) imply that a A.-system is closed under complementation (put A = X). Moreover, since (b) implies that Bn = An+! - An E 2 for each n, (c) implies that a countable union of disjoint :£-sets is in 2. In fact, these implica tions hold in both directions, and we have the following. 1.26 Theorem A class 2 is a A.-system if and only if (a) X E 2. (b) If B E 2 then Be E 2. (c) If {An E 2 } is a disjoint sequence, then UnAn E 2. o In particular, a a-field is a A.-system, and moreover, a class that is both a n-system and a A.-system is a a-field. This follows by 1.24, because a A.-system is a monotone class by 1.25(c), and by de Morgan' s laws is closed under unions if closed under both intersections and complementation. The following result makes these definitions useful. 1.27 Dynkin's R·A theorem (Billingsley 1979, 1986: th. 3.2) If 'P is a 1t-system, 2 is a A.-system, and 'P � 2, then a('P) � 2. Proof Let A.('P) denote the smallest A.-system containing 'P (the intersection of all the A.-systems containing 'P), so that in particular, A.('P) � 2. We show that A.('P) is a 1t-system. By the remarks above, it will then follow that A.('P) is a a-field, and hence that a('P) � A.('P) � 2, as required. For a set A E A.('P), let '§A denote the class of sets B such that A 11 B E A.('P). We shall show that '§A is a A.-system. Clearly, X E '§A , so that condition 1.25(a) is satisfied. Let B1 , B2 E '§A , and B 1 c B2; then A n B1 E A.('P) and A n B2 E A.('P), and (A 11 B1 ) c (A 11 B2), which implies that (A n B2) - (A n B1 ) = A 11 (B2 - B J ) E A.('P). (1 .22) But this means that B 2 - B1 E '§A, and condition 1.25(b) is satisfied. Lastly, suppose A n Bi E A.('P) for each i = 1,2, ... and Bi t B. Then A n B E A.('P) by 1.25(c), which means that 1.25(c) holds for '§A • and '§A is a A.-system as asserted. Suppose A E 'P. Then B E 'P implies A 11 B E 'P ('P is a n-system) and since 'P � A.('P), this further implies B E '§A · Hence 'P � '§A· Since '§A is a A.-system, and A.('P) is the smallest A.-system containing 'P, we also have A.('P) � '§A in this case. So, when A E 'P, B E A.('P) implies B E §'A and hence A 11 B E A.('P). We can summarize the last conclusion as:

Sets and Numbers

19

(1 .23) {A E 'P, B E A('P) } ==> {A n B E A('P) } . Now defining rJ8 by analogy with rJA ' so that (1 .24) {B E A.('P), A n B E A('P) } ==> {A E '§8} , we see that (1 .23) and (1 .24) together yield 'P k rJ8. Since rJ8 is also a A.-system by the same argument as held for rJA, and contains 'P, A.('P) k f18 by definition of A.('P). Thus, suppose B E A.('P) and C E A.('P). Then C E rJ8, which means that B n C E A.('P). So A.('P) is a n-system as required. •

2 Limits and Continuity

2 . 1 The Topology of the Real Line

The purpose of this section is to treat rigorously the idea of 'nearness', as it applies to points of the line. The key ingredient of the theory is the distance between a pair of points, x, y E !R, defined as the non-negative real number I x - y I , what is formally called the Euclidean distance. In Chapters 5 and 6 we examine the generalization of this theory to non-Euclidean spaces, and find not only that most aspects of the theory have a natural generalization, but that the concept of distance itself can be dispensed with in their development. We are really studying a special case of a very powerful general theory. This fact may be helpful in making sense of certain ideas, the definition of compactness for example, which can otherwise appear a little puzzling at first sight. An £-neighbourhood of a point x E !R is a set S(x,£) = {y: I x - y I < £}, for some E > 0. An open set is a set A � !R such that for each x E A, there exists for some E > 0 an £-neighbourhood which is a subset of A. The open intervals defined in § 1 .3 are open sets since if a < x < b, £ = min { I b - x I , I a - x I } > 0 satisfies the definition. IR and 0 are also open sets on the definition. The concept of an open set is subtle and often gives beginners some difficulty. Naive intuition strongly favours the notion that in any bounded set of points there ought to be one that is 'next to' a point outside the set. But open sets are sets that do not have this property, and there is no shortage of them in IR . For a complete understanding of the issues involved we need the additional concepts of Cauchy sequence and limit, to appear in §2.2 below. Doubters are invited to suspend their disbelief for the moment and just take the definition at face value. The collection of all the open sets of IR is known as the topology of !R. More precisely, we ought to call this the usual topology on !R, since other ways of defining open sets of !R can be devised, although these will not concern us. (See Chapter 6 for more information on these matters.) More generally, we can discuss subsets of!R from a topological standpoint, although we would tend to use the term subspace rather than subset in this context. If A c Sl � !R, we say that A is open in Sl if for each x E A there exists S(x,£), £ > 0, such that S(x,£) n Sl is a subset of A . Thus, the interval [0,1) is not open in !R, but it is open in [0, 1]. These sets define the relative topology on Sl , that is, the topology on Sl relative to !R . The following result is an immediate consequence of the definition. 2.1 Theorem If A is open in !R, A n Sl is open in the relative topology on Sl. o A closure point of a set A is a point x E !R such that, for every £ > 0, the set

Limits and Continuity

21

A n S(x,E) is not empty. The closure points of A are not necessarily elements of A, open sets being a case in point. The set of closure points of A is called the closure of A, and will be denoted A or sometimes (A)- if the set is defined by an expression. On the other hand, an accumulation point of A is a point x E IR which is a closure point of the set A - { x}. An accumulation point has other points of A arbitrarily close to it, and if x is a closure point of A and x e A, it must also be an accumulation point. A closure point that is not an accumulation point (the former definition being satisfied because each £-neighbourhood of x contains x itself) is an isolated point of A. A boundary point of a set A is a point x E A such that the set Ac n S(x,£) is not empty for any E > 0 . The set of boundary points of A is denoted aA, and A = A u aA. The interior of A is the set A 0 = A - aA. A closed set is one containing all its closure points, i.e. a set A such that A = A. For an open interval A = (a,b) c IR, A = [a,b]. Every point of (a,b) is a closure point, and a and b are also closure points, not belonging to (a,b). They are the boundary points of both (a,b) and [a,b]. 2.2 Theorem The complement of an open set in IR is a closed set. o This gives an alternative definition of a closed set. According to the defini tions, 0 (the empty set) and IR are both open and closed. The half-line ( -oo,x] is the complement of the open set (x,+oo) and is hence closed. Extending this result to relative topologies, we have the following. 2.3 Theorem If A is open in 5l c IR , then 5l - A is closed in Sl. o In particular, a corollary to 2.1 is that if B is closed in IR then 5l n B is closed in Sl. But, for example, the interval [�)) is not closed in IR , although it is closed in the set (0, 1 ) , since its complement (0,�) is open in (0, 1 ) . Some additional properties of open sets are given in the following theorems. 2.4 Theorem (i) The union of a collection of open sets is open. (ii) If A and B are open, then A n B is open. o This result is be proved in a more general context below, as 5.4. Arbitrary inter sections of open sets need not be open. See 1.12 for a counter-example. 2.5 Theorem Every open set A � IR is the union of a countable collection of dis joint open intervals. Proof Consider a collection { S(x,Ex), x E A h where for each x, Ex > 0 is chosen small enough that S(x,Ex) � A. Then UxEAS(x,Ex) c A, but, since necessarily A s;;;; UxEAS(x,Ex), it follows that UxEAS(x,Ex) = A. This shows that A is a union of open intervals. Now define a relation R for elements of A, such that x R y if there exists an open interval I � A with x E I and y E I. Every x E A is contained in some interval by the preceding argument, so that xRx for all x E A. The symmetry of R is obvious. Lastly, if x,y E I � A and y,z E I' � A, In I' is nonempty and hence Iu I' is also an open interval, so R is transitive. Hence R is an equivalence

Mathematics

22

relation, and the intervals I are an equivalence class partitioning A. Thus, A is a union of disjoint open intervals. The theorem now follows from 1.11. • Recall from 1.21 that :B, the Borel field of !R , is the a-field of sets generated by both the open and the closed half-lines. Since every interval is the intersection of a half-line (open or closed) with the complement of another half-line, 2.2 and 2.5 yield directly the following important fact. 2.6 Theorem :B contains the open sets and the closed sets of !R . o A collection � is called a covering for a set A t;;;; !R if A � UsEtS'B. If each B is an open set, it is called an open covering. 2.7 LindeiOf's covering theorem If � is any collection of open subsets of !R , there is a countable subcollection { Bi E � . i E [N } such that u B = U Bi. i= l

B EtS'

(2. 1 )

Consider the collection J> = {Sk = S(rk.sk), rk E (Q , Sk E (Q + } ; that is the collection of all neighbourhoods of rational points of !R, having rational radii. The set (Q x (Q + is countable by 1.5, and hence J> is countable; in other words, indexing by k E [N exhausts the set. We show that, for any open set B c !R and point X E B, there is a set sk E y> such that X E sk c B. Since X has a £-neighbourhood inside B by definition, the desired Sk is found by setting sk to any rational from the open interval (0, 1£), for £ > 0 sufficiently small, and then choosing rk E S(x,!£) as is possible by 1.10. Now for each x E UsEtS'B choose a member of J>, say Sk(x) • satisfying x E Sk(x) c B for any B E �- Letting k(x) be the smallest index which satisfies the requirement gives an unambiguous choice. The distinct members of this collection form a set that covers UsEtS'B, but is a subset of J> and hence countable. Labelling the indices of this set as k 1 ,k2 , ... , choose Bi as any member of � containing Skr Clearly, Ui= I Bi is a countable covering for U7= tSk;• and hence also for UsEtS'B. • It follows that, if � is a covering for a set in !R , it contains a countable sub covering. This is sometimes called the LindelOf property. The concept of a covering leads on to the crucial notion of compactness. A set A is said to be compact if every open covering of A contains afinite subcovering. The words that matter in this definition are 'every' and 'open' . Any open covering that has !R as a member obviously contains a finite subcovering. But for a set to be compact, there must be no way to construct an irreducible, infinite open cover ing. Moreover, every interval has an irreducible infinite cover, consisting of the singleton sets of its individual points; but these sets are not open. 2.8 Example Consider the half-open interval (0, 1]. An open covering is the count able collection { (lin, 1], n E [N } . It is easy to see that there is no finite sub collection covering (0, 1] in this case, so (0, 1] is not compact. o A set A is bounded if A t;;;; S(x,E) for some x E A and £ > 0. The idea here is that Proof

Limits and Continuity

23

E is a possibly large but finite number. In other words, a bounded set must be

containable within a finite interval. 2.9 Theorem A set in IR is compact iff it is closed and bounded. o This can be proved as a case of 5.12 below, and provides an alternative definition of compactness in IR . The sufficiency part is known as the Heine-Borel theorem. A subset B of A is said to be dense in A if B � A � B. Readers may think they know what is implied here after studying the following theorem, but denseness is a slightly tricky notion. See also 2.15 and the remarks following before coming to any premature conclusions. 2.10 Theorem Let A be an interval of IR, and C �A be a countable set. Then A - C is dense in A. Proof By 1.7, each neighbourhood of a point inA contains an uncountable number of points. Hence for each x E A (whether or not x E C), the set (A - C) n S(x,£) is not empty for every E > 0, so that x is a closure point of A - C. Thus, A - C c (A - C) u C = A � (A - C).

•

The k-fold Cartesian product of IR with copies of itself generates what is called Euclidean k-space, IR k. The points of IR k have the interpretation of k-vectors, or ordered k-tuples of real numbers, x = (x 1 ,x2 , . . . ,xk)'. All the concepts defined above for sets in IR generalize directly to IR k. The only modification required is to replace the scalars x and y by vectors x and y, and define an £-neighbourhood in a new way. Let llx -y II be the Euclidean distance between x andy, where ! Ia II = [:L7= 1 ai J 1 12 is the length of the vector a = (a 1 , ,ak) and then define S(x,E) = {y: llx - y I I < E } , for some E > 0. An open set A of IR 2 is one in which every point x E A can be contained in an open disk with positive radius centred on x. In IR 3 the open disk becomes an open sphere, and so on. • . •

'

2 . 2 Sequences and Limits

A real sequence is a mapping from IN into IR . The elements of the domain are called the indices and those of the range variously the terms, members, or coordinates of the sequence. We will denote a sequence either by {xn, n E IN }, or more briefly by {xn } i, or just by {xn } when the context is clear. {xn } i is said to converge to a limit x, if for every E > 0 there is an integer Ne for which (2.2) Write Xn ---7 x, or x = limn�ooXn· When a sequence is tending to +oo or -oo it is often said to diverge, but it may also be said to converge in tR, to distinguish those cases when it is does not approach any fixed value, but is always wandering. A sequence is monotone (non-decreasing, increasing, non-increasing, or decreas ing) if one of the inequalities Xn :::; Xn+ b Xn < Xn+ l ' Xn � Xn+ l , or Xn > Xn+l holds for every n. To indicate that a monotone sequence is converging, one may write for emphasis either Xn t x or Xn ,J, x, as appropriate, although Xn ---7 x will

24

Mathematics

also do in both cases. The following result does not require elaboration. 2.1 1 Theorem Every monotone sequence in a compact set converges. o A sequence that does not converge may none the less visit the same point an infinite number of times, so exhibiting a kind of convergent behaviour. If { xn, n E IN } is a real sequence, a subsequence is { Xnk ' k E IN } where { nk> k E IN } is any increasing sequence of positive integers. If there exists a subsequence { xnk' k E IN } and a constant c such that Xnk -7 c, c is called a cluster point of the sequence. For example, the sequence { (-It, n = 1 ,2,3, ... } does not converge, but the subsequence obtained by taking only even values of n converges trivially. c is usually a finite constant, but += and oo may be cluster points of a sequence if we allow the notion of convergence in iR. If a subsequence is convergent, then so is any subsequence of the subsequence, defined as { Xmk' k E IN } where { mk } is an increasing sequence whose members are also members of { nk } . The concept of a subsequence is often useful in arguments concerning conver gence. A typical line of reasoning employs a two-pronged attack; first one identi fies a convergent subsequence (a monotone sequence, perhaps); then one uses other characteristics of the sequence to show that the cluster point is actually a limit. Especially useful in this connection is the knowledge that the members of the sequence are points in a compact set. Such sequences cannot diverge to infin ity, since the set is bounded; and because the set is closed, any limit points or cluster points that exist must be in the set. Specifically, we have two useful results. 2.12 Theorem Every sequence in a compact set of !R has at least one cluster point. Proof A monotone sequence converges in a compact set by 2.11. We show that every sequence { Xn, n E IN } has a monotone subsequence. Define a subsequence { Xnk } as follows. Set n 1 = 1, and for k = 1,2,3, ... let Xnk+ l = supn�nkXn if there exists a finite nk+ l satisfying this condition; otherwise let the subsequence terminate at nk. This subsequence is non-increasing. If it terminates, the sub sequence { Xn, n :?: nk } must contain a non-decreasing subsequence. A monotone subsequence therefore exists in every case. • 2.13 Theorem A sequence in a compact set either has two or more cluster points, or it converges. Proof Suppose that c is the unique cluster point of the sequence {xn}, but that Xn A c. Then there is an infinite set of integers { nk . k E I.N } such that I Xnk - c I :?: £ for some £ > 0. Define a sequence {yk } by setting Yk = Xnk · Since {yk } is also a sequence on a compact set, it has a cluster point c ' which by construction is different from c. But c' is also a cluster point of {xn } , of which {yk } is a subsequence, which is a contradiction. Hence, Xn -7 c. • 2.14 Example Consider the sequence { l ,x,x2,x3 , ... ,xn, ... } , or more formally {Y, n E IN0} , nwhere x is a real number. In the case l x l < 1 , this sequence converges to zero, { I x I } being monotone on the compact interval [0, 1]. The condition specified -

Limits and Continuity

25

in (2.2) is satisfied for Ne = log(E)Ilog lxl in this case. If x = 1 it converges to 1 , trivially. If x > 1 it diverges in IR, but converges in iR to +oo. If x = -1 it neither converges nor diverges, but oscillates between cluster points + 1 and - 1 . Finally, if x < - 1 the sequence diverges in IR , but does not converge in iR. Ulti mately, it oscillates between the cluster points +oo and -oo. o We may discuss the asymptotic behaviour of a real sequence even when it has no limit. The superior limit of a sequence {xn} is limsup Xn = inf sup Xm. (2.3) n n m <: n (Alternative notation: limn Xn.) The limsup is the eventual upper bound of a sequence. Think of { SUPm<:nXm , n = 1 ,2, ... } as the sequence of the largest values the sequence takes beyond the point n. This may be +oo for every n, but in all cases it must be a non-increasing sequence having a limit, either +oo or a finite real number; this limit is the limsup of the sequence. A link with the correspond ing concept for set sequences is that if Xn = sup An for some sequence of sets {An � IR } , then Iimsup xn = sup A, where A = limsupnAn. The inferior limit is defined likewise, as the eventual lower bound:

(

)

liminf Xn = - limsup( -xn) = sup inf Xm , (2.4) n m <: n n n also written limnXn. Necessarily, liminfnXn $; limsupnXn. When the limsup and liminf of a sequence are equal the sequence is convergent, and the limit is equal to their common value. If both equal +oo, or -oo, the sequence converges in iR. The usual application of these concepts is in arguments to establish the value of a limit. It may not be permissible to assume the existence of the limit, but the limsup and liminf always exist. The trick is to derive these and show them to be equal. For this purpose, it is sufficient in view of the above inequality to show liminfn Xn � Iimsupn Xn· We often use this type of argument in the sequel. To determine whether a sequence converges it is not necessary to know what the limit is; the relationship between sequence coordinates 'in the tail' (as n becomes large) is sufficient for this purpose. The Cauchy criterion for conver gence of a real sequence states that {xn } converges iff for every £ > 0 3 Ne such that lxn - xm l < £ whenever n > Ne and m > Ne. A sequence satisfying this cri terion is called a Cauchy sequence. Any sequence satisfying (2.2) is a Cauchy sequence, and conversely, a real Cauchy sequence must possess a limit in IR. The two definitions are therefore equivalent (in IR, at least), but the Cauchy condi tion may be easier to verify in practice. The limit of a Cauchy sequence whose members all belong to a set A is by defini tion a closure point of A, though it need not itself belong to A. Conversely, for every accumulation point x of a set A there must exist a Cauchy sequence in the set whose limit is x. Construct such a sequence by taking one point from each of the sequence of sets, {A r1 S(x, 1/n), n = 1,2,3, .. } ,

Mathematics

26

none of which are empty by definition. The term limit point is sometimes used synonymously with accumulation point. The following is a fundamental property of the reals. 2.15 Theorem Every real number is the limit of a Cauchy sequence of rationals. Proof For finite n let Xn be a number whose decimal expansion consists only of zeros beyond the nth place in the sequence. If the decimal point appears at position m, with m > n, then Xn is an integer. If m s n, removing the decimal point produces a finite integer a, and Xn = a llOn -m , so Xn is rational. Given any real x, a sequence of rationals {xn } is obtained by replacing with a zero every digit in the decimal expansion of x beyond the nth, for n = 1 2, ... Since l xn+ l - xn l < w-n , {xn } is a Cauchy sequence and Xn --7 X as n --7 oo . • The sequence exhibited is increasing, but a decreasing sequence can also be con structed, as { -yn } where {Yn } is an increasing sequence tending to -x. If x is itself rational, this construction works by putting Xn = x for every n, which trivially defines a Cauchy sequence, but certain arguments such as in 2.16 below depend on having Xn :f:. x for every n. To satisfy this requirement, choose the 'non terminating' representation of the number; for example, instead of 1 take 0.9999999 . .. , and consider the sequence {0.9, 0.99, 0.999, ... } . This does not work for the point 0, but then one can choose {0. 1 , 0.01 , 0.001, ... } . One interesting corollary of 2.15 is that, since every £-neighbourhood of a real number must contain a rational, (Q is dense in IR . We also showed in 2.10 that IR (Q is dense in IR , since (Q is countable. We must be careful not to jump to the conclu sion that because a set is dense, its complement must be 'sparse ' . Another version of this proof, at least for points of the interval [0, 1], is got by using the binary expansion of a real number. The dyadic rationals are the set [) = { i/2", i 1, ... ,2n - 1, n E [N } . (2.5) The dyadic rationals corresponding to a finite n define a covering of [0, 1] by intervals of width 1/2n , which are bisected each time n is incremented. For any x n E [0, 1], a point of the set { i/2", i = 1, ... ,2 - 1 } is contained in S(x,E} for E < 212", so the dyadic rationals are dense in [0, 1]. [) is a convenient analytic tool when we need to define a sequence of partitions of an interval that is becom ing dense in the limit, and will often appear in the sequel. Another set of useful applications concern set limits in IR. 2.16 Theorem Every open interval is the limit of a sequence of closed sub intervals with rational endpoints. Proof If (a,b) is the interval, with a < b, choose Cauchy sequences of rationals a n .J., a and bn t b, with a 1 a and b > bn, (a,b) c � [ambn]c for all n � 1, so that (a,bt � liminfn[an,bn Y This is equivalent to limsupn[an,bn] � (a,b). Hence lim [a ,bn] exists and is equal to (a,b). • ,

-

=

,

.

n

n

·

Limits and Continuity

27

This shows that the limits of sequences of open sets need not be open, nor the limits of sequences of closed sets closed (take complements above). The only hard and fast rules we may lay down are the following corollaries of 2.4(i): the limit of a non-decreasing sequence of open sets is open, and (by complements) the limit of a non-increasing sequence of closed sets is closed. 2 . 3 Functions and Continuity

A function of a real variable is a mapping f: S H lf, where S c IR , and lf � IR . By specifying a subset of IR as the codomain, we imply without loss of generality that f(S) lf, such that the mapping is onto lT . Consider the image in lf , under f, of a Cauchy sequence {xn } in S converging to x. If the image of every such sequence converging to x E S is a Cauchy sequence in lf converging to f(x), the function is said to be continuous at x. Continuity is formally defined, without invoking sequences explicitly, using the £ - () approach. f is continuous at the point x E S if for any £ > 0 ::3 <5 > 0 such that I y - x I < () implies I f(y) - f(x) I < £, whenever y E S. The choice of () here may depend on x. If f is continuous at every point of S, it is simply said to be continuous on S. Perhaps the chief reason why continuity matters is the following result. 2.17 Theorem If f: S H lf is continuous at all points of S, f- 1 (A) is open in S whenever A is open in lT, and r 1 CA) is closed in s whenever A is closed in lf. D This important result has several generalizations, of which one, the extension to vector functions, is given in the next section. A proof will be given in a still more general context below; see 5.19. Continuity does not ensure that f(A)is open when A is open. A mapping with this property is called an open mapping, although, since f(Ac) i= f(Af in general, we cannot assume that an open mapping is also a closed mapping, taking closed sets to closed sets. However, a homeomorphism is a function which is 1-1 onto, contin uous, and has a continuous inverse. If f is a homeomorphism so is f - 1 , and hence by 2.17 it is both an open mapping and a closed mapping. It therefore preserves the structure of neighbourhoods, so that, if two points are close in the domain, their images are always close in the range. Such a transformation amounts to a relabelling of axes. If f(x + h) has a limit as h -!- 0, this is denoted f(x+). Likewise, f(x-) denotes the limit of f(x - h). It is not necessary to have x E S for these limits to exist, but if f(x) exists, there is a weaker n.,ption of continuity at x. f is said to be right-continuous at the point x E S if, for any £ > 0, ::3 () > 0 such that whenever 0 � h < () and x + h E S, l f(x + h) - f(x) l < £. (2.6) It is said to be left-continuous at x if, for any £ > 0, 3 () > 0 such that when ever 0 � h < () and x - h E S, l f(x) - f(x - h) l < £. (2.7) ==

Mathematics

28

Right continuity at x implies f(x) = f(x+) and left continuity at x implies f(x) f(x- ). If f(x) = f(x+) f(x-), the function is continuous at x. Continuity is the property of a point x, not of the function f as a whole. Despite continuity holding pointwise on 5>, the property may none the less break down as certain points are approached. 2.18 Example Consider f(x) = 1/x, with 5l = ""IT" = (0, oo) For £ > 0, 2 0 £X l f(x + O) - f(x) ! = x(x + O) < £ iff o < 1 - £X and hence the choice of o depends on both £ and x. f(x) is continuous for all x > 0, but not in the limit as x --7 0. o The function f: 5l 1---7 ""IT" is uniformly continuous if for every £ > 0 3 o > 0 such that (2.8) l x - y l < 0 ::::} l f(x) - f(y) ! < £ for every x,y E 5>. In 2.18 the function is not uniformly continuous, for whichever o is chosen, we can pick x small enough to invalidate the definition. The problem arises because the set on which the function is defined is open and the boundary point is a discontinuity. Another class of cases that gives difficulty is the one where the domain is unbounded, and continuity at x is breaking down as x --7 oo However, we have the following result. 2.19 Theorem If a function is continuous everywhere on a compact set 5>, then it is bounded and uniformly continuous on 5>. o (For proof, see 5.20 and 5.21.) Continuity is the weakest concept of smoothness of a function. So-called Lip schitz conditions provide a whole class of smoothness properties. A function f is said to satisfy a Lipschitz condition at a point x if, for any y E S(x,O) for some o > 0, 3 M > 0 such that (2.9) ! f(y) - f(x) ! � Mh( ! x -y ! ) where h: IR + 1---7 IR + satisfies h(d) J, 0 as d t 0. f is said to satisfy a uniform Lipschitz condition if condition (2.9) holds, with fixed M, for all x,y E 5>. The type of smoothness imposed depends on the function h. Continuity (resp. uniform continuity) follows from the Lipschitz (resp. uniform Lipschitz) property for any choice of h. Implicit in continuity is the idea that some function o(.): IR + 1---7 IR + exists satisfying 0(£) J, 0 as £ J, 0. This is equivalent to the Lipschitz condition holding for some h(.), the case h = o- 1 . By imposing some degree of smoothness on h - making it a positive power of the argument for example - we impose a degree of smoothness on the function, forbidding sharp 'corners' . The next smoothness concept is undoubtedly well known to the reader, although differential calculus will play a fairly minor role here. Let a function f: 5l 1---7 ""IT" be continuous at x E 5> . If

=

=

.

.

Limits and Continuity f�(x)

=

{

lim f(x + Mo

�- f(x� J

29 (2. 10)

exists, f�(x) is called the left-hand derivative of f at x. The right-hand deriva tive, f�(x) , is defined correspondingly for the case h t 0. If f�(x) = f�(x), the common value is called the derivative of f at x, denoted f'(x) or dfldx, and f is said to be differentiable at x. If f': S � IR is a continuous function, f is said to be continuously differentiable on S. A function f is said to be non-decreasing (resp. increasing) if f(y) 2 f(x) (resp. f(y) > f(x)) whenever y > x. It is non-increasing (resp. decreasing) if -f is non-decreasing (resp. increasing). A monotone function is either non-decreasing or non-increasing. When the domain is an interval we have yet another smoothness condition. A function f: [a,b] � IR is of bounded variation if 3 M < oo such that for every partition of [a,b] by finite collections of points a = Xo < X! < ... < Xn = b, n (2. 1 1) L I f(x i) - f(x i- 1 ) I � M.

k=l

2.20 Theorem If and only if f is of bounded variation, there exist non-decreasing functions f1 and h such that f = h - fJ . o (For proof see Apostol 1974: Ch. 6.) A function that satisfies the uniform Lipschitz condition on [a,b] with h( I x - y I ) = I x - y I is of bounded variation on [a,b]. 2 . 4 Vector Sequences and Functions

A sequence {xn } of real k-vectors is said to converge to a limit x if for every E > 0 there is an integer Ne for which (2. 12) ll xn - x ll < E for all n > Ne. The sequence is called a Cauchy sequence in IR k iff ll xn - Xm II < E whenever n > Ne and m > Ne. A function '

f: s

�

'IT',

where S c IR k , and 'IT' � IR, associates each point of S with a unique point of 'IT'. Its graph is the subset of Sx'U' consisting of�the (k + I)-vectors {x, f(x) } for each x E S. f is continuous at x E S if for any E > 0 3 8 > 0 such that (2. 13) l i b I I < 8 ::::} lf(x + b) - f(x) I < E whenever x + b E S. The choice of 8 may here depend on x. On the other hand, f is uniformly continuous on S if for any E > 0, 3 8 > 0 such that (2. 14) li b II < 8 ::::} sup l f(x + b) - f(x) I < E. x e S ,x+b e 5l

Mathematics

30

A vector f (f1 , Jm)' of functions of x is called, simply enough, a vector function? Continuity concepts apply element-wise to f in the obvious way. The function f : s; f----7 s;, s; c !R k is said to be one-to-one if there exists a vector function f- 1 : 5) f----7 5>, such that f - 1 (j(x)) x for each x E 5>. An example of a 1-1 continuous function is the affine transformation3 f(x) = Ax + b for constants b (k x l) and A (k x k) with IA I * 0 , having inverse f - 1 (y) 1 1 A - (y - b ) . In most other cases the function f - does not possess a closed form, but there is a generalization of 2.17, as follows. 2.21 Theorem lffj: 5) f----7 "U" is continuous, where 5) c !R k and "U" c !R m ,J-\A) is open in 5) when A is open in "U", andf- 1 (A) is closed in 5) when A is closed in "U". o ==

•••

==

==

2 . 5 Sequences of Functions

Let fn: Q f----7 lf, "U" c !R, be a function, where in t�is case Q may be an arbitrary set, not necessarily a subset of !R . Let Um n E [N } be a sequence of such func tions. If there exists f such that, for each ro E n, and £ > 0, 3 NEro such that I fn(ro) - f(ro) I < £ when n > NEw• then fn is said to converge to f,pointwise on n. As for real sequences, we use the notations fn -----) f, fn t f, or fn -i f, as approp riate, for general or monotone convergence, where in the latter case the mono tonicity must apply for every ro E Q. This is a relatively weak notion of conver gence, for it does not rule out the possibility that the convergence is breaking down at certain points of n. The following example is related to 2.18 above. 2.22 Example Let fn(x) n!(nx + 1), x E (O,oo). The pointwise limit of fn(x) on (O,oo) is 1/x. But ==

j tn(x) - ��

==

x(nx\ 1) '

and 1/(x(Nf.Xx + 1)) < £ only for NEX > (1/EX - 1)(1/x). Thus for given £, NEX -----) oo as x -----) 0 and it is not possible to put an upper bound on Nex such that l fn(x) - llxl < £, n � Nex, for every x > 0. o To rule out cases of this type, we define the stronger notion of uniform conver gence. If there exists a function f such that, for each £ > 0, there exists N such that sup I fn(ffi) - f(ro) I < £ when n > N, ffi E Q

fn is said to converge to f uniformly on n.

Limits and Continuity

31

2 . 6 S ummability and Order Relations

The sum of the terms of a real sequence {xn}i is called a series, written 2:';;'= 1xn (or just Lxn). The terms of the real sequence { �= I Xm, n E IN } are called the partial sums of the series. We say that the series converges if the partial sums converge to a finite limit. A series is said to converge absolutely if the mono tone sequence { �= 1 1 Xm I, n E IN } converges. 2.23 Example Consider the geometric series, L}= 1x 1. This converges to 1/(1 - x) when lxl < 1 , and also converges absolutely. It oscillates between cluster points 0 and 1 for x = - 1 , and for other values of x it diverges. o 2.24 Theorem If a series converges absolutely, then it converges. Proof The sequence { �= 1 1 Xm I , n E IN } is monotone, and either diverges to +oo or converges to a finite limit. In the latter case the Cauchy criterion implies that l xn l + .... + l xn+m l --7 0 as m and n tend to4 infinity. Since l xn l + .. . + l xn+m l � l xn + .... +Xn+m l by the triangle inequality, convergence of { L�=!Xm, n E IN } follows by the same criterion. • An alternative terminology speaks of summability. A real sequence {xn }i is said to be summable if the series Lxn converges, and absolutely summable if { I Xn I } i is summable. Any absolutely summable sequence is summable by 2.24, and any summable sequence must be converging to zero. Convergence to zero does not imply summability (see 2.27 below, for example), but convergence of the tail sums to zero is necessary and sufficient. 2.25 Theorem Iff { xn }i is summable, L;;;=nXm --7 0 as n --7 oo. Proof For necessity, write l l:;'= l xm l ::; l l:�:}xm l + I .L;;;=nxm l · Since for any £ > 0 there exists N such that I r;;;=nXm I < £ for n � N, it follows that I r;;;= 1Xm I ::; l l:�:}xm l + £ < oo. Conversely, assume summability and let A = 2:';;'= 1Xn. Then � mn =- 1Xm --7 0 as n --7 00 • • �= ,L..,m=nXm = A ,L.., 1 A sequence { Xn } i is Cesaro-summable if the sequence { n - 1 L�= 1 xm } i converges. This is weaker than ordinary convergence. 2.26 Theorem If { xn } i converges to x, its Cesaro som also converges to x. o But a sequence can be Cesaro-summable in spite of not converging. The sequence in Cesaro sum to zero, whereas the partial sum sequence { (- 1 t} o converges {�=0(- 1 )m }0 converges in Cesaro sum tb :! (compare 2.14). Various notations are used to indicate the relationships between rates of diver gence or convergence of different sequences. If {xn}i is any real sequence, {an } i is a sequence of positive real numbers, and there exists a constant B < oo such that l xn l lan ::; B for all n, we say that Xn is (at most) of the order of magnitude of am and write Xn = O(an). If {xn lan} converges to zero, we write Xn = o(an), and say that Xn is of smaller order of magnitude than an. an can be increasing or decreasing, so this notation can be used to express an upper bound either on the rate of !!fowth of a diver!!ent seauence. or on the rate of convergence of a .

-

Mathematics

32

sequence to zero. Here are some rules for manipulation of 0(.), whose proof follows from the definition. If Xn = O(n a) and Yn = O(nP), then Xn + Yn = O(nmax { a,p j) (2. 15) XnYn O(na+P ), (2. 1 6) � = O(naP ), whenever � is defined. (2. 17) =

An alternative notation for the case Xn � 0 is Xn « an , which means that there is a constant, 0 0 and B � A, such that infn�Xn fan) � A and supn�Xn fan) � B. This says that {xn } and {an } grow ultimately at the same rate, and is different from the relation Xn = O(an), since the latter does not exclude Xn !an � 0. Some authors use Xn - an in the stronger sense of Xn fan � 1 . 2.27 Theorem If { xn } is a real positive sequence, and Xn - n a, (i) if a > - 1 then I,�= I Xm - n 1 +a; (ii) if a = - 1 then L�= I Xm - log n; (iii) if a < -1 then I.;;;=]Xm < 00 and I.;;;=nXm O(n l +a). Proof By assumption there exist N � 1 and constants A > 0 and B � A such that An a � Xn � Bn a for n � N, and hence A I.�=Nm a � L�=NXm � BI,�=Nma . The limit of I,�= 1 m a as n � oo for different values of a defines the Riemann zeta function for a < -1, and its rates of divergence for a � -1 are standard results; see e.g. Apostol (1974: Sects. 8. 12-8. 13). Since the sum of terms from 1 to N- 1 is finite, their omission cannot change the conclusions. • It is common practice to express the rate of convergence to zero of a positive real sequence in terms of the summability of the coordinates raised to a given power. The following device allows some furtherrefinement of summability condi tions. Let U(v) be a positive function of v. If U(vx)IU(v) � .xP as v � oo (0) for x > 0 and -oo 0 as v � oo (0), it is said to be slowly varying at infinity (zero). Evidently, any regularly varying function can be expressed in the form U(v) = vPL(v), where L(v) is slowly varying. While the definition allows v to be a real variable, in the cases of interest we will have v = n for n e [N, with U and L having the interpretation of positive sequences. 2.28 Example (log v)a is slowly varying at infinity, for any a. o On the theory of regular variation see Feller (1971), or Loeve (1977). The impor tant property is the following. 2.29 Theorem If L is slowly varying at infinity, then for any 8 > 0 there exists N � 1 such that =

Limits and Continuity

33

(2. 1 8) Hence we have the following corollary of 2.27, which shows how the notion of a convergent power series can be refined by allowing for the presence of a slowly varying function. 2.30 Corollary If Xn = O(naL(n)) then :�:�;;;'= 1 xn < oo for all a < - 1 and all functions L(n) which are slowly varying at infinity. o On the other hand, the presence of a slowly varying component can affect the summability of a sequence. The following result can be proved using the integral test for series convergence (Apostol 1974: Sect. 8. 12). 2.31 Theorem If Xn - 1/[n(log n) 1 +1i] with () > 0, then :L';;'= tXn < oo If () = 0, then I,�= I Xm - log log n. o 2.32 Theorem (Feller 197 1 : 275) If a positive monotone function U(v) satisfies U(vx) (2. 19) U(v) ---7 \jl(x), all x E D, where D is dense in IR+, and 0 < \jl(x) < oo, then \jl(x) = x P for oo < p < oo o To the extent that (2. 1 9) is a fairly general property, we can conclude that monotone functions are as a rule regularly varying. 2.33 Theorem The derivative of a monotone regularly varying function is regu larly varying at oo Proof Given U(v) = vPL(v), write (2.20) U'(v) = pvP - 1 L(v) + vPL'(v) = vP - \pL(v) + vL'(v)). If L'(v) ---7 0 there is no more to show, so assume liminfvL'(v) > 0. Then .

-

.

.

( )

!!_ L(vx) dv L(v)

which implies L'(vx)/L'(v)

U'(vx) � -:---:-"U'(v)

=

=

{

L'(v) L'(vx) L(vx) L(v) L'(v) L(v)

---7

_

)

---7

0'

(2.21)

•

(2.22)

1 . Thus,

X p - IPL(vx) + vxL'(vx) pL(v) + vL'(v)

---7

X P.

2 . 7 Arrays

Arguments concerning stochastic convergence often involve a double-indexing of elements. An array is a mapping whose domain is the Cartesian product of count able, linearly ordered sets, such as [N x [N or ::r x [N, or a subset thereof. A real double array, in particular, is a double-indexed collection of numbers, or, alter natively, a sequence whose members are real sequences. We will use notation such as { { xnt' t E ::r } , n E [N }, or just {xnr} when the context is clear.

Mathematics

34

A collection of finite sequences { {xn1, t 1, . . . ,kn }, n E IN } , where kn t oo as n � oo, is called a triangular array. As an example, consider array elements of the form Xnr y1/n, where { y1, t = 1, ... ,n } is a real sequence. The question of whether the series { I.7= tXnt• n E IN } converges is equivalent to that of the Cesaro convergence of the original sequence; however, the array formulation is frequently the more convenient. 2.34 Toeplitz's lemma Suppose { Yn } is a real sequence and Yn � y. If { { Xn1, t 1 , . .. ,kn }, n E IN } is a triangular array such that (a) Xnr � 0 as n � oo for each fixed t, =

=

=

kn

(b) lim L l xnrl :::; C < n�oo t=l

00,

kn

(c) lim ,Lxnr = 1, n�oo t=l then I.�g 1 xnrYr � y. For y = 0, (c) can be omitted. Proof By assumption on { Yn L for any E > 0 3 Ne 2 1 such that for n I Yn - y I < EIC. Hence by (c), and then (b) and the triangle inequality, kn

lim L XnrYt - Y n�oo t=l

=

>

Ne,

kn

lim L Xnt(yt - Y) n�oo t=l Ne

(2.23) :::; lim L Xnt(yt - y) + £ = £, n�oo t=l in view of (a). This completes the proof, since £ is arbitrary. • A particular case of an array { Xnr } satisfying the conditions of the lemma is Xnr (I.�=IYst 1 y1, where { y1} is a positive sequence and L�=IYs � oo. A leading application of this result is to prove the following theorem, a funda mental tool of limit theory. 2.35 Kronecker's lemma Consider sequences { a1} 1 and { x1 } 1 of positive real numbers, with a1 t 00 If I.7= 1 x1/a1 � C < oo as n � oo, 1 n � 0. (2.24) an ,Lxr t=l =

•

co = 0 and Cn = I.7=tX11a1 for n E IN, note that x1 = ar(c1 - Cr- t), t 1 , ... ,n. Also define ao = 0 and b1 = a1 - ar-1 for t = 1, ... ,n, so that an = I.7=tb1. Now apply the identity for arbitrary sequences ao, ... ,an and ca, ... ,cm n n (2.25) ,L atCcr - Cr-1 ) = ,L Car-1 - ar) Cr- 1 + anCn - aoco. Proof Defining =

t=l

t=l

(This is known as Abel's partial summation formula.) We obtain

Limits and Continuity

35

(2.26) where the convergence is by the Toeplitz lemma, setting Xnt = b1 !an . • The notion of array convergence extends the familiar sequence concept. Consider for full generality an array of subsequences, a collection { { Xmnk' k E IN } m E IN } , where { nk, k E IN } is an increasing sequence of positive integers. If the limit Xm = limk�ooXmnk exists for each m E IN, we would say that the array is convergent; and its limit is the infinite sequence { Xm, m E IN } . Whether this sequence converges is a separate question from whether it exists at all. Suppose the array is bounded, in the sense that supk,m i Xmnk l ::; s < =. We know by 2.12 that for each m there exists at least one cluster point, say Xm, of the inner sequence {Xmnk' k E IN }. An important question in several contexts is this: is it valid to say that the array as a whole has a cluster point? 2.36 Theorem Corresponding to any bounded array { { Xmnk' k E IN } , m E IN } , there exists a sequence {xm } , the limit of the array { {xmnl• k E IN } , m E IN } as k � =, where { nk} is the same subsequence of { nd for each m. Proof This is by construction of the required subsequence. Begin with a conver gent subsequence for m 1 ; let { n1} be a subsequence of { nk } such that x1,nl � x1. Next, consider the sequence {xz,nl } . Like {xz,nk}, this is on the bounded interval (-B,B), and so contains a convergent subsequence. Let the indices of this latter subsequence, drawn from the members of {n1}, be denoted {nt} and note that Xt,ni � Xt as well as xz,ni ---7 xz. Proceeding in the same way for each m gener ates an array { {nT, k E IN }, m E IN }, having the property that {xi,nT• k E [N } is a convergent sequence for 1 ::; i ::; m. Now consider the sequence { nZ, k E IN } ; in other words, take the first member of {nk}, the second member of {nt}, and so on. For each m, this sequence is a sub sequence of {nT } from the mth point of the sequence onwards, and hence the sequence {xm,ni• k � m } is convergent. This means that the sequence {xm,n�· k E IN } is convergent, so setting {nk} {nZ} satisfies the requirement of the theorem. • =

=

This is called the 'diagonal method' . The el�ments nZ may be thought of as the diagonal elements of the square matrix (of infinite order) whose rows contain the sequences { nT} , each a subsequence of the row above it. This theorem holds independently of the nature of the elements {Xmn } . Any space of points on which convergent sequences are defined could be substituted for !R . We shall need a generalization on these lines in Chapter 26, for example.

3 Measure

3 . 1 Measure Spaces

A measure is a set function, a mapping which associates a (possibly extended) real number with a set. Commonplace examples of measures include the lengths, areas, and volumes of geometrical figures, but wholly abstract sets can be 'measured' in an analogous way. Formally, we have the following definition. 3.1 Definition Given a class '!F of subsets of a set Q, a measure J.l: '!F 1--7 [R is a function having the following properties: (a) J.L(A) � 0, all A E '!F. (b) j.!(0) 0. (c) For a countable collection { Aj E '!F, j E IN } with Aj n Af = 0 for j :1= j' and UjAj E '!F, (3. 1) J.l uAj = L J.l(Aj). o =

( ) 1

1

The particular cases at issue in this book are of course the probabilities of random events in a sample space Q; more of this in Chapter 7. Condition (a) is optional and set functions taking either sign may be referred to as measures (see e.g. §4.4), but non-negativity is desirable for present purposes. A measurable space is a pair (Q,'!F) where Q is any collection of objects, and '!F is a a-field of subsets of Q. When (Q,'!f) is a measurable space, the triple (Q,'!f,J.l) is called a measure space. More than one measure can be associated with the measurable space (Q,'!f), hence the distinction between measure space and measur able space is important. Condition 3.1(c) is called countable additivity. If a set function has the property (3.2) J.l(A u B) J.l(A) + J.!(B) for each disjoint pair A,B, a property that extends by iteration to finite collec tions A1, , A n, it is said to be finitely additive. In 3.1 '!F could be a field, but the possibility of extending the properties of J.l to the corresponding a-field, by allowing additivity over countable collections, is an essential feature of a measure. If j.!(Q) < oo the measure is said to be finite. And if Q Uj Qj where { Qj} is a countable collection of '!F-sets, and J.L(il.i) < oo for each j, J.l is said to be a-finite. In particular, if there is a collection !I such that '!F a(!/) and Qj E !I =

• . .

=

=

Measure

37

{An B: B AA

for each j, j.l is said to be a-finite on !f (rather than on r:J). If r:JA = E ) r:J } for some E r:J, A is a measurable space and ( ,r:JA,j.l) is a measure space called the restriction of (Q,r:J,j.l) to If in this case = 0 (equivalent to = j.!(Q) when j.!(Q) < =)A is called a support of the measure. When supports n, the sets of r:JA have the same measures as the corresponding ones of r:J. point ffi ffi}) > 0 is called an atom of the measure. E Q with the property 3.2 Example The case closest to everyday intuition is Lebesgue measure, m, on the measurable space (IR,13), where 13 is the Borel field on IR . Generalizing the notion of length in geometry, Lebesgue measure assigns m((a,b]) = b - a to an interval (a,b]. Additivity is an intuitively plausible property if we think of measuring the total length of a collection of disjoint intervals. Lebesgue measure is atomless (see 3.15 below), every point of the line taking measure 0, but m(IR) = =. Letting ((a,b], 13ca,bl • m) denote the restriction of (IR,13,m) to a finite interval, m is a finite measure on (a,b]. Since IR can be partitioned into a countable collection of finite intervals, m is a-finite. o Some additional properties may be deduced from the definition: 3.3 Theorem For arbitrary r:J-sets and j E IN } , (i) c =:::> (monotonicity). (ii) = + u (iii) (countable subadditivity). Proof To show (i) note that and are disjoint sets whose union is by hypothesis, and use 3.1(a) and 3.1(c). To show (ii), use and in each union are disjoint. The result where again the sets = follows on application of 3.1(c). To show (iii), define = and = Note that the sets are disjoint, that and that = Hence, Uj

j...L(A)

A (A ,r:J

Aj.!(Ac)

A.

j.!( {

A, B, {Aj, Aj.!(A BB) + j.!(j.!A(A)n B)::;; j.!(B)I!(A) j.!(B). j...L(UAj) ::;; Ljll(Aj) A B-A B, Au B =Au (B-A) B (A n B) u (B-A), B A En 1 1 An-Uj:} En En s An, U}=tBj =IAj. Aj. (3.3) � (ffi) � (QB;) t �(Bj) ,; t�Ai)· =

•

=

This proof illustrates a standard technique of measure theory, converting a sequence of sets into a disjoint sequence having the same union by taking differ ences. This trick will become familiar in numerous later applications. The idea behind 3.3(ii) can be extended to give an expression for the measure of any finite union. This is the inclusion-exclusion formula: �

� (Q,Aj) t�(Ai) - t;�(Air> Ak) + ��(Ai r> Akn A1) ± j.!(At nA 2 n . . nAn), (3.4) where the sign of the last term is negative if n is even and positive if n is odd, and there are 2n - 1 terms in the sum in total. The proof of (3.4) is by induction from 3.3(ii), substituting for the second term on the right-hand side of =

• • .

Mathematics

38

(3.5) repeatedly, for n - 1. n - 2, ... , 1 . Let {An, n E [N } be a monotone sequence of �-sets with limit A E �. A set func tion on 11 is said to be continuous if 11(An) --7 11(A). 3.4 Theorem A finite measure is continuous. Proof First let {An } be increasing, with An - I � An, and A U;= IAn. The sequence { B1, j E fN } , where B 1 A 1 , and B1 = A1 - AJ - I for j > 1 is disjoint by construc tion, with B1 E �. An = U]= IB1, and =

=

n

(3.6)

11(An) = L 11(Bj). J==l

The real sequence { 11(An) } is therefore monotone, and converges since it is bounded above by 11(Q) < oo Countable additivity implies LJ=II1(B1) = 11(U }==1B1) = 11(A). Alternatively, let { An } be decreasing, with An-I ;;;;? An and A n;=IAn. Consider the increasing sequence {Aj } , determine il(Ac) by the same argument, and use finite additivity to conclude that 11(A) = 11(Q) - 11(A c) is the limit of 11(An) = 11(.Q) - 11(A�). • The finiteness of the measure is needed for the second part of the argument, but the result that 11(An) -7 11(A) when An 1' A actually holds generally, not excluding the case 11(A) This . theorem has a partial converse: 3.5 Theorem A non-negative set function 11 which is finitely additive and contin uous is countably additive. Proof Let { Bn } be a countable, disjoint sequence. If An U]=1B1, the sequence { An } is increasing, Bn n An-! = 0, and so 11(An) 11(Bn) + 11(An- I ) for every n, by finite additivity. Given non-negativity, it follows by induction that { 11(An) } is monotone. If A = U}=1 B1, 11(A) = LJ=II1(B1), whereas continuity implies that 11(A) = .

=

=

oo .

=

=

11(U }= 1Bj) .

•

Arguments in the theory of integration often turn on the notion of a 'negligible' set. In a measure space (.Q,�,I1), a set of measure zero is (simply enough) a set M E � with J.l(M) = 0. A condition or restriction on the elements of n is said to occur almost everywhere (a.e.) if it holds on a set E and .Q - E has measure zero. If more than one measure is assigned to the same space, it may be necessary to indicate which measure the statement applies to, by writing a.e.[l1] or a.e.[v] as the case may be. 3.6 Theorem

(i) If M and N are �-sets, M has measure 0 and N � M, then N has measure 0. (ii) If { M1 } is a countable sequence with 11(M1) = 0, V j, then 11(U1 M1) = 0. (iii) If {0·} is a countable sequence with 11(Ef) 0, V j, then 11((U E1Y) = 0. =

Measure

39

Proof (i) is an application of monotonicity; (ii) is a consequence of countable

additivity; and (iii) follows likewise, using the second de Morgan law. • In §3.2 and §3.3 we will be concerned about the measurability of the sets in a given space. We show that, if the sets of a given collection are measurable, the sets of the a-field generated by that collection are also measurable (the Exten sion Theorem). For many purposes this fact is sufficient, but there may be sets outside the a-field which can be shown in other ways to be measurable, and it might be desirable to include these in the measure space. In particular, if (A) it would seem reasonable to assign J.L(E) = J.L(A) whenever B. This is equivalent to assigning measure 0 to any subset of a set of measure 0. The measure space (Q,�,Jl) is said to be complete if, for any set � with 0, all subsets of are also in �- According to the following result, every measure space can be completed without changing any of our conclusions except in respect of these negligible sets. 3.7 Theorem Given any measure space (Q,�,J.L), there exists a complete measure space (Q,�Il,ji), called the completion of (Q,�,J.L), such that � �ll, and for all E �- D Notice that the completion of a space is defined with respect to a particular measure. The measurable space (Q,�) has a different completion for each measure that can be defined on it. Proof Let Nil denote the collection of all subsets of �-sets of J.L-measure 0, and �ll � Q: Nil for some (3.7) �}. If 0, any set satisfies the criterion of (3.7) and so is in �ll as the definition requires. For �ll, let where is any �-set satisfying E Nil. To show that the choice of is immaterial, let £ 1 and £2 be two such sets, and note that (3.8) 0. Since J.L(E1 u £2) n £2) + £2), we must conclude that n £2) � J.L( i) � (3.9) n £2) for i = 1 and 2, or, = J.L £ ). Hence, the measure is unique. When �' we since can choose 0 Nil, confirming that the measures agree on �. It remains to show that �ll is a a-field co:Q.taining �- Choosing F in (3.7) for � shows � � �ll. IfF �ll, then Nil for E � and hence Nil where �, and so �ll. And finally if �ll for j IN, there exist � for j IN , such that Nil. Hence

J.L(B) =

JI(E)

A cE c J.L = E E J.L(E)

E

c

E

J.L(E) =

EE = {F E!l.F E J.L(E) = F cFEE JI(F) = J.L(E), E !l FE E J.L(E1 !l. E2) = J.L((F!l.El )!l(F!l.E2)) = = J.L(El J.L(El !l J.L(El E J.L(El J.L ( E ) ( FE 2 1 E = F, F !l F = E E = Ee !l Fe FE E E !l FE E = E !l.FEjE E EeE E FeEjE!l Fj E Fj E E (3. 10) (wEj) !l (uFj) c L) CEj!l.Fj) E by 3.6(ii). This means that UjFj E �ll, and completes the proof. 1

1

1

Nil,

•

40

Mathematics

3 . 2 The Extension Theorem

You may wonder why, in the definition of a measurable space, � could not simply be the set of all subsets; the power set of n. The problem is to find a consistent method of assigning a measure to every set. This is straightforward when the space has a finite number of elements, but not in an infinite space where there is no way, even conceptually, to assign a specific measure to each set. It is necessary to specify a rule which generates a measure for any designated set. The problem of measurability is basically the problem of going beyond constructive methods with out running into inconsistencies. We now show how this problem can be solved for a-fields. These are a sufficiently general class of sets to cope with most situa tions arising in probability. One must begin by assigning a measure, to be denoted)..l{J , to the members of some basic collection � for which this can feasibly be done. For example, to construct Lebesgue measure we started by assigning to each interval (a,b] the measure b - a. We then reason from the properties of J...4:J to extend it from this basic collection to all the sets of interest. � must be rich enough to allow J...4:J to be uniquely defined by it. A collection � c � is called a determining class for (.Q,�) if, whenever J..l and v are measures on �. J..l(A) = v(A ) for all A E � implies that J..l = v. Given �, we must also know how to assign )..l{J-values to any sets derived from � by operations such as union, intersection, complementation, and difference. For disjoint sets A and B we have )..l{J(A u B) = J..ln(A) + J..ln(B) by finite additivity, and when B � A, J..ln (A - B) = J..ln(A) - )..l{J(B). We also need to be able to determine J..ln (A n B), which will require specific knowledge of the relationship between the sets. When such assignments are possible for any pair of sets whose measures are themselves known, the measure is thereby extended to a wider class of sets, to be denoted !f. Often !f and � are the same collection, but in any event !f is closed under various finite set operations, and must at least be a semi-ring. In the applications !f is typically either a field (algebra) or a semi-algebra. Example 1.18 is a good case to keep in mind. However, !f cannot be a a-field since at most a finite number of operations are permitted to determine J..ln(A) for any A E !f. At this point we might pose the oppo site question to the one we started with, and ask why !f might not be a rich enough collection for our needs. In fact, events of interest frequently arise which !f cannot contain. 3.15 below illustrates the necessity of being able to go to the limit, and consider events that are expressible only as countably infinite unions or intersections of �-sets. Extending to the events � = a(!f) proves indispensable. We have two results, establishing existence and uniqueness respectively. 3.8 Extension theorem (existence) Let !f be a semi-ring, and let )..l{J: !f 1---7 iR+ be a measure on !f. If � = a(!f), there exists a measure J..l on (.Q,:¥), such that J..l(E) = )..l{J (E) for each E E !f. o Although the proof of the theorem is rather lengthy and some of the details are fiddly, the basic idea is simple. Take an event A c .Q to which we wish to assign a

Measure

41

measure 11(A). If A E :1, we have 11(A) = J.lo(A). If A � :1, consider choosing a finite or countable covering for A from members of !!; that is, a selection of sets E1 E :1, j = 1 ,2,3, ... such that A c U1E1. The object is to find as 'economical' a covering as possible, in the sense that LJJ.lo(E1) is as small as possible. The outer measure of A is !l*(A) = inf L J.lo(E1), (3. 1 1) j

oo

where the infimum is taken over all finite and countable coverings of A by !!-sets. If no such covering exists, set ll*(A) = Clearly, !l*(A) = J.lo(A) for each A E !f. 11* is called the outer measure because, for any eligible definition of 11(A), ll*(A) �

(

.

� 11(EJ) � 11 L)E1) � 11(A ), for E1 J

J

E

!f.

(3. 1 2)

The first inequality here is by the stipulation that 11(E1) = J.lo(E1) for E1 E !I in the case where a covering exists, or else the majorant side is infinite. The second and third follow by countable subadditivity and monotonicity respectively, because 11 is a measure. We could also construct a minimal covering for A c and, at least if the relevant outer measures are finite, define the inner measure of A as 11* (A) = 1-l*(.Q) - 1-L*(Ac). Note that since 11(A) = 11(.0) - 11(Ac) and !l*(Ac) � 11(Ac) by (3. 1 2), (3. 13) If !l*(A) = 11/A), it would make sense to call this common value the measure of A, and say that A is measurable. In fact, we employ a more stringent criterion. A set A � Q is said to be measurable if, for any B � Q, (3. 14) This yields ll*(A) = 11* (A) as a special case on putting B = Q, but remains valid even if 11(.0) = Let Jrt denote the collection of all measurable sets, those subsets of Q satis fying (3. 14). Since !l*(A) = J..Lo(A) for A E !I and J..to( 0) = 0, putting A = 0 in (3. 14) gives the trivial equality !l*(B) = !l*(B). Hence 0 E Jrt, and since the definition implies that Ac E Jrt if A E Jrt, Q E Jrt too. The next steps are to determine what properties the set function W : Jrt 1--7 rR shares with a measure. Clearly, !l*(A) � 0 for all A � Q. (3. 15) Another property which follows directly from the definition of 11* is monotonicity:

oo

.

(3. 16) Our goal is to show that countable additivity also holds for 11* in respect of Jrt-sets, but it proves convenient to begin by establishing countable subadditivity. 3.9 Lemma If { A1, j E IN } is any sequence of subsets of Q, then

Mathematics

42

( )

)l* uAJ � 1

� )l*(Aj). 1

(3. 17)

Assume )l* (A1) < for each j. (If not, the result is trivial.) For each j, let { EJk } denote a countable covering of A1 by !f-sets, which satisfies _L �(EJk) < )l*(Aj) + 2 -J£ k for any £ > 0. Such a collection always exists, by the definition of )l*. Since u�j � Uj,kEjb it follows by definition that oo

Proof

)l* 1 'Lj= 1 T =

(L)A1) � �1,k �(EJk) < � )l*(AJ) + £, 1

1

(3. 1 8)

noting 1 . (3. 17) now follows since £ is arbitrary and the last inequal ity is strict. • The following is an immediate consequence of the theorem, since subadditivity supplies the reverse inequality to give (3. 14). 3.10 Corollary A is measurable if, for any B � Q, (3. 19) The following lemma is central to the proof of the extension theorem. It yields countable additivity as a corollary, but also has a wider purpose. 3.11 Lemma A1. is a monotone class. Proof Letting {A1, j E IN } be an increasing sequence of .M.-sets converging to A = u�j. we show A E .M.. For n > 1 and E E Q, the definition of an .M.-set gives ).!* (An n E) = )l* (An - 1 n (An n E)) + )l*(A� -1 n (An n E)) = )l* (An - 1 n E) + )l* (Bn n E).

(3.20)

where Bn = An -An-I, and the sequence { Bj} is disjoint. Put Ao = 0 so that )l* (Ao n E) = 0; then by induction, n )l*(An n E) = _L )l* (B1 n E) (3.21) }=1 holds for every n. The right-hand side of (3.21) for n E IN is a monotone real sequence, and )l*(An n E) ----7 )l*(A n E) as n ----7 oo. Now, since An E .M., )l* (E) = )l* (An n E) + )l* (A� n E) :2: )l* (An n E) + )l* (Ac n E),

(3.22)

using the monotonicity of )l* and the fact that Ac � A�. Taking the limit, we have from the foregoing argument that

Measure

43

(3.23) so that A E .At by 3.10. For the case of a decreasing sequence, simply move to the complements and argue as above. • Since {B1} is a disjoint sequence, countable additivity emerges as a by-product of the lemma, as the following corollary shows. 3.12 Corollary If {B1} is a disjoint sequence of .At-sets, 1-1

*

(L)Bi) = � j.l*(BJ).

(3.24)

1

1

Proof Immediate on putting E = Q in (3.21) and letting n -----7 oo, noting UB1 = A. •

Notice how we needed 3.10 in the proof of 3.11, which is why additivity has been derived from subadditivity rather than the other way about. Proof of 3.8 We have established in (3. 15) and (3.24) that W is a measure for the elements of .At. If it can be shown that r:J � .At, setting j.!(A) = j.l*(A) for all A E r:J will satisfy the existence criteria of the theorem. The first step is to show that !f � .At or, by 3.10, that A E !f implies (3.25) for any E c Q. Let {A1 E !f } denote a finite or countable covering of E such that by defin LJ �-to(A1) < j.l*(E) + e, for e > 0. If no such covering exists, j.l*(E) ition and (3.25) holds trivially. Note that E n A � UCA1 n A), and since !f is a semi-ring the sets A1 n A are in !f. Similarly, E n Ac � Uj(A1 n Ac), and by simple set algebra and the definition of a semi-ring, A1 n Ac Ar (A1 n A) U qk (3.26) k where the c1k are a finite collection of !/-sets, disjoint with each other and also with A1 n A . Now, applying 3.9 and the fact that j.l*(B) 1-lo(B) for B E !f, we find = oo

=

=

=

j k

j =

L 1-lo(Aj) j

<

j.l* (E) + £ ,

(3.27)

where the equality follows from (3.26) becaure 1-lo is finitely additive, and A1 n A and the C1k are mutually disjoint. Since e is arbitrary, (3.25) follows. Next, we show that .At is a a-field. We have only to show that .At is a field, because 3.11 implies it is also a <J-field, by 1.24. We already know that Q E .At and .At is closed under complementation, so it remains to show that unions of .At-sets are in .At. Suppose that A1 and A2 are .At-sets and E � Q. Then

J.l* (E) j.l*(A1 n E) + j.l*(A} n E) =

Mathematics

44 =

J.L*(Az nAt n E) + J.L*(A� n A t n E) + J.L*(A2 n A! n E) + IJ.*(A� n A! n E)

�

J.L*(Az n A t n E) + J.L*((A2 nAt n E) u (A2 nA! n E) u (A2 n A! n E))

=

J.L*((Az nAt) n E) + J.L*((A2 n A t)c n E),

(3.28)

where the inequality is by subadditivity, and the rest is set algebra. By 3.10 this is sufficient for At n A2 E At, and hence also for At u A2 E At, using closure under complementation. It follows that At is a a-field containing !f, and since r:J is the smallest such a-field, we have that r:J � At, as required. • Notice that (3.28) was got by using (3. 14) as the relation defining measurability. The proof does not go through using J.L*(A) = IJ. (A) as the definition. * The style of this argument tells us some important things about the role of !f. Any set that has no covering by !/-sets is assigned the measure oo, so for finite measures it is a requisite that 0 � U1E1 for a finite or countable collection {Ej E !f } . The measure of a union of !f-sets must be able to approximate the measure of any r:f-set arbitrarily well, and the basic content of the theorem is to establish that a semi-ring has this property. To complete the demonstration of the extension, there remains the question of uniqueness. To get this result we need to impose a-finiteness, which was not needed for existence. 3.13 Extension theorem (uniqueness) Let 1J. and !J.' denote measures on a space (O,r:J), where r:J = a( !f), and !f is a semi-ring. If the measures are a-finite on !f and !J.(E) = IJ.'(E) for all E E !f, then !J.(E) = IJ.'(E) for all E E r:f. Proof We first prove the theorem for the case of finite measures, by an appli cation of the n-A. theorem. Define il = {E E r:J: !J.(E) = IJ.'(E) } . Then !f � il by hypoth

esis. If !f is a semi-ring, it is also a n-system. By 1.27, the proof is completed if we can show that il is a A.-system, and hence contains a(!f). When the measure is finite, 0 E A and condition 1.26(a) holds. Additivity implies that, for A E A,

(3.29) so that 1.26(b) holds. Lastly, let {A1 } be a disjoint sequence in A. By countable additivity,

�{gAil = � fl(Ai = � fl'(Aj) = fltl}J )

(3.30)

and 1.26(c) holds. It follows by 1.26 and the n-A. theorem that r:J = a(!f) � A. Now consider the a-finite case. Let 0 = U1B1 where B1 E !f and 1J.(B1) = IJ.'(B1) < oo. r:J1 = { B1 n A : A E r:J } is a a-field, so that the (B1,r:Ji) are measurable spaces,

Measure

45

on which 1.1 and 1.1' are finite measures agreeing on !I n B1. The preceding argument showed that, for A E '!F, l.l(B1 n A) = l.l'(B1 n A) only if 1.1 and 1.1' are the same measure. Consider the following recursion. By 3.3(ii) we have l.l(A n (B1 u B2)) = l.l(A n B1) + l.l(A n B2) - l.l(A n B1 n B2).

Letting Cn

=

(3.3 1)

U}=1B1 the same relation yields

l.l(A n Cn) = l.l(A n Bn) + l.l(A n Cn - 1) - l.l(A n Bn n Cn - 1 ).

(3.32)

The terms involving Cn- 1 on the right-hand side can be solved backwards to yield an expression for l.l(A n Cn). as a sum of terms having the general form

(3.33) for some j, say j = h, in which case D = A n Bh n Bh n . . . E '!f. Since we know that J.L(D n B1) = l.l'(D n B1) for all D E '!F by the preceding argument, it follows that in (3.32)

(3.34) This holds for any n. Since Cn

�

Q as n

�

co ,

we obtain in the limit

(3.35) the two sides of the equality being either finite and equal, or both equal to +oo. l.l(A) = l.l'(A),

This completes the proof, since A is arbitrary.

•

3.14 Example Let .M denote the subsets of [R which are measurable according to

(3. 14) when

is the outer measure defined on the half-open intervals, whose measures Jlo are taken equal to their lengths. This defines Lebesgue measure m. These sets form a semi-ring by 1.18, a countable collection of them covers [R, and the extension theorem shows that, given m is a a-finite measure, .M contains the Borel field on [R (see 1.21), so ([R ,�,m) is a measure space. It can be shown (we won't) that all the Lebesgue-measurable sets not in � are subsets of �-sets of measure 0. For any measure 1.1 on (IR .�), the complete space ([R,�!l,p:) includes all of the Lebesgue-measurable sets. o l.l*

The following is a basic property of Lebesgue measure. Notice the need to deal with a countable intersection of intervals to determine so simple a thing as the measure of a point. 3.1 5 Theorem Any countable set from

IR has Lebesgue measure 0.

Proof The measure of a point {x} is zero, since for x E IR,

{x} = and m( {x})

=

limn

�oo

l ln

=

0.

00

n (x - 1/n, X]

n=l

E �

The result follows by 3.6(ii).

(3.36) •

Mathematics

46 3 . 3 Non-measurability

To give the ideas of the last section their true force, it needs to be shown that M c 211 is possible, in other words, that Q can contain non-measurable subsets. In this section we construct such a set in the half-open unit interval (0, 1], a standard counter-example from Lebesgue theory. For x, y E (0, 1], define the operator

y +x =

{

y + x,

y + x :::; 1

y+x- 1, y+x

>

(3.37)

1

This is addition modulo 1 . Imagine the unit interval mapped onto a circle, like a clock face with 0 at the top. y + x is the point obtained by moving a hand clockwise through an angle of 21tX from an initial point y on the circumference. For each set A s;;; (0, 1], and x E (0, 1], define the set

A + X = {y + x: y E A } . 3.16 Theorem If A is Lebesgue-measurable so is A + x, and m(A + x)

any x.

(3.38) =

m(A), for

Proof For (a,b] s;;; (0, 1 ] , m((a + x, b + x]) = b - a = m((a,b]), for any real x such that a +x > 0 and b + x :::; 1 . The property extends to finite unions of intervals translated by x. If A is any Lebesgue-measurable subset of (0,1 ] , and A + x c (0,1 ] where A + x = {y + x: y E A } , the construction of the extension similarly implies that A + x is measurable and m(A) = m(A + x). Now let A1 = A n (0, 1 -x), and A2 = A n (1 - x, 1]. Then m(A 1 + x) = m(A 1 ) and m(A 2 + x - 1) = m(A2), where the sets on the left-hand sides of these equalities are in each case contained in (0, 1 ]. A 1 + x and A2 + x - 1 are disjoint sets whose union is A + x, and hence

m(A + x) = m(A 1 + x) + m(A 2 + x - 1) = m(A 1 ) + m(A z) = m(A). •

(3.39)

Define a relation for points of (0, 1 ] , letting x R y if y = x + r for r E (Q. That is, x R y if y is separated from x by a rational distance along the circle. R is an equivalence relation. Defining the equivalence classes

(3.40) E x = {y: y = x + r, r E (Q } , the sets of the collection {E x, x E (0,1] } are either identical, or disjoint. Since every x is a rational distance from some other point of the interval, these sets cover (0,1]. A collection formed by choosing just one of each of the identical sets, and discarding the duplicates, is therefore a partition of (0, 1 ] . Write this as { Ex, x E C}, where C denotes the residual set of indices.

Another example may help the reader to visualize these sets. In the set of integers, the set of even integers is an equivalence class and can be defined as 0 F 0 th� s�t of interrers which differ from 0 an even inte_ger. Of course E = E2 =

Measure

47

If = . . . = E2n , for any n E 71.. The set of odd integers can be defined similarly as n 1 E 1 , the set of integers differing by an even integer from 1. E = E3 = .. . = E2 + l for any n E 71.. Discarding the redundant members of the collection {E x, x E 71. } leaves just the collection {E 0,E1 } to define a partition of 71. . Now construct a set H by taking an element from Ex for each x E C. 3.17 Theorem H is not Lebesgue-measurable. Proof Consider the countable collection

{ H + r, r E 0 } . We show that this collect

ion is a partition of (0, 1]. To show disjointness, argue by contradiction. Suppose z E H + r1 and z E H + rz, for r1 :f:. rz. This means there are points h 1 , hz E H, such that

h 1 + r1 = z = hz + rz. r2 , we cannot have h 1 = hz; but if h1

(3.41)

If r1 :f:. :f:. hz then h 1 and h2 belong to different equivalence classes by construction of H, and cannot be a rational distance l r1 - r2 1 apart; hence no z satisfying (3.41) exists. On the other hand, let H* = U (H + r), and consider any point x E (0, 1]. x belongs to one of the r equivalence classes, and hence is within a rational distance of some element of H; but H* contains all the points that are a rational distance r from a point of H, for some r, and hence x E H*, and it follows that (0, 1] c H*. Suppose m(H) exists. Then by 3.16, m(H + r) = m(H) for all r. Since m(H* ) � m((0, 1]) = 1 , we must have m(H) > 0 by 3.6(ii), but countable additivity then gives m(H*) = L.,m(H + r) = oo, which is impossible. It follows that m(H) does not exist. •

The definition of H involves a slightly controversial area of mathematics, since the set of equivalence classes is uncountable. It is not possible to devise, even in principle, constructive rules for selecting the set C, and elements from Ex for each x E C. The proposition that sets like H exist cannot be deduced from the axioms of set theory but must be asserted as an additional axiom, the so-called axiom of choice. If one chooses to reject the axiom of choice, this counter example fails. We have made no attempt here to treat set theory from the axiomatic standpoint, and the theory in Chapter 1 has been what is technically called naive (i.e. based on the intuitive notion of what a 'set' is). For us, the problem of the axiom of choice reduces to the question: should we admit the existence of a mathematical object that cannot be constructed, even in imagination? The decision is ultimately personal, but suffice it to say that most mathematicians are willing to do so. Sets like H do not belong to :13(0,1 1 = {B n (0, 1], B E :13}. It is not difficult to show that all the sets of :13(0, 11 are Lebesgue-measurable; see 3.14 and restrict m to (0, 1 ] as in 3.2. By sticking with Borel sets we shall not run into measurabil ity difficulties on the line, but this example should serve to make us careful. In less familiar situations (such as will arise in Part VI) measurability can fail in superficially plausible cases. However, if measurability is in doubt one might remember that outer measure W is well defined for all subsets of n, and coincides •

Mathematics

48

with 1-1 whenever the latter is defined. Sometimes measurability problems are dealt with by working explicitly with outer measure, and forgetting about them. 3 .4 Product Spaces

If (Q,�) and (3,§') are two measurable spaces, let Q X 3 = { ( 00,�): (I) E Q, � E 3 }

(3.42)

be the Cartesian product of Q and 3, and define � ® §' = a('R�'fl ), where

'R�'fl = { (F x G), F E �. G E §' } . (3.43) The space (Q x 3, � ® §') is called a product space, and (Q,�) and (3,§') are the factor spaces, or coordinate spaces, of the product. The elements of the col lection 'R�'fl are called the measurable rectangles. The rectangles of the Euclidean plane IR x !R = IR 2 (products of intervals) are a familiar case.

3.18 Example Rather trivially, consider the two-element sets A = { ro 1 , ffi2 } E �. and B = {� 1 ,�2 } E §'. The corresponding rectangle is

A X B = { (ro 1 ,� 1 ), ( OO J ,�2), (ffi2,�J), (ro2,�2) } . The sets { (ro 1 ,� 1 ), (ffi2,�2) } and { (roJ ,�2), (ffi2,� I ) } are not rectangles, but are unions of rectangles and so are elements of � ® §'. o Two important pieces of terminology. If E <;;;;; Q x 3, the set nn(E) = { ro: ( ro,�) E E} is called the projection of E onto Q. And if A <;;;;; Q, the inverse projection of A is the set

(3.44) A x 3 is also called a cylinder set in Q x 3, with base A. The latter terminology is natural if you think about the case Q = IR 2 and 3 = IR. Cylinder sets with bases in � and §' are elements of 'R�'fl· One might think that if E E � ® §', 7tn(E) should be an :¥-set, but this is not necessarily the case. 7tn(E)c ::f:: 7tn(E) in general (see 1.3) so that the collection tg of projections of � ® §'-sets onto Q is not closed under complementation. However, notice that A = 1tn(A x 3) so that � <;;;;; tg, The main task of this section is to establish a pair of results required in the construction of measures on product spaces. 3.19 Theorem If tg and V are semi-rings of subsets of Q and 3, respectively, then

'R'&v = { C x D: C e tg, D e V } i s a semi-ring of Q x 3. Proof There are three conditions from 1.16 to be established. First, 'R'&v clearly

contains 0. Second, consider CJ , C2 E tg, and DJ , D2 E V. cl () C2 E tg and D I n D2 E V, and as a matter of definition,

Measure = { ro E C1 n c2, =

� E D1 n D2 }

(C1 n C2) x (D1 n D2) E

Third, assume that C1 x D 1

�

C2 x D2,

(C2 x D2) - (C J X DJ) = { (ro E C2 , =

49 (3.45)

'Rrgv.

and by a similar argument,

� E D2): either (J) E:

((C2 - C1 ) X D 1 ) U ( C1 X (D2 - D1))

u

cl or �

E: D d

(3.46)

((C2 - C1) X (D2 - DJ)),

where the sets in the union on the right-hand side are disjoint. By hypothesis, the sets C2 - C1 and D2 - D 1 are finite disjoint unions of �-sets and V-sets respectively, say (Cj , . . . , C�), and (D j, . . . ,D:r,). The product of a finite disjoint union of sets is a disjoint union of products; for example,

(0c;) �= I

X D1

=

{(ro,�):

(J) E

(0c;), � :J= l

E D1

} Q =

J- 1

Extending the same type of argument, we may also write (C2 - C1 ) x (D2 - D, )

��

( Cj x D! )

)

u

(Q (c,

x Dk)

)

u

(3.47)

(Cj X DJ).

(�

)

(Cj x D;) .

(3.48)

All of the product sets in this union are disjoint (i.e., a pair (ro,�) can appear in at most one of them) and all are in 'Rrgv. This completes the proof. •

The second theorem leads to the useful result that, to extend a measure on a product space, it suffices to assign measures to the elements of 'Rrgv, where � and V are suitable classes of the factor spaces. 3.20 Theorem If '!f = a(�) and 'tJ = a(V) where � and V are semi-rings of subsets of Q and 3 respectively, then '!f ® 'tJ = � ® V. Proof It is clear that 'Rrgv � 'R'!Jir:Y. and hence that � ® V

converse, consider the collection of inverse projections, J-'91 = {1tn 1 (F) , F E '!f } c 'R '!J' Fi ·

�

'!f ® 'tJ.

To show the

It can easily be verified that J-'91 is a a-field of Q x 3 , and is in fact the small est a-field containing the collection !Jrg = {ltn C), C E � } c � ® V. !Jrg is a 1t-system, and since � ® V is a a-field and hence a A.-system, it follows by the 1t-A theorem that J-'91 = a(:Jrg) � � ® V. Exactly same conclusion holds for :JFJ, the corresponding collection for '
\

� ® V.

•

a product extends beyond pairs to triples and general n-tuples, and we shall be interested in the nronerties of Ruclirle:�n (!R n)

The notion of

iff particular

n-smw�

Mathematics

50

For finite n at least, a separate theory is not needed because results can be obtained by recursion. If ('l',H) is a third measurable space, then trivially, Q X 3 X '¥ = { (ro,�,'!'): (I) E Q, � E 3, '!' E 'I' } =

{ ((ro,�),'!'): (ro,�) E Q x 3,

=

(Q x 3) x 'l'.

'I'

E 'I'}

(3.49)

Either or both of (Q,:J') and (3,9') can be product spaces, and the last two theorems extend to product spaces of any finite dimension. 3 .5 Measurable Transformations

Consider measurable spaces (Q,:Ji) and (3,9') in a different context, as domain and codomain of a mapping T: Q r-7 3. T is said to be :¥/9'-measurable if T - 1 (B) E :¥ for all B E 9'. The idea is that a measure � defined on (Q,:Ji) can be mapped into (3,9'), every event B E 9' being assigned a measure v(B) = ).t(T - 1 (B)). We have just encountered one example, the projection mapping, whose inverse defined in (3.44) takes each :¥-set A into a measurable rectangle. Corresponding to a measurable transformation there is always a transformed measure, in the following sense. 3.21 Theorem Let � be a measure on (Q,:Ji) and T: Q f-7 3 a measurable transform ation. Then T - is a measure on (3,9'), where

� 1

(3.50) 1

Proof We check conditions 3.1(a)-(c). Clearly �T - (A) �

0, all A E 'Br. Since T \3) = Q holds by definition, by 1.2(iii) and so = \ �(T )) = 0. For countable additivity we must show

1 (0) 0 1 (0) TT� 0 �(0) (3.51) �r- 1 (wsj) L �r- I (Bj) for a disjoint collection B 1 ,B2 , E Letting Bj = T - 1 (Bj), 1.2 shows both that the Bj are disjoint and that T - 1 (UjBj) UjBj. Equation (3.5 1) therefore becomes =

=

=

1

. • .

1

B.

=

(3.52) for disjoint sets Bj, which holds because

� is a measure.

•

The main result on general transformations is the following. 3.22 Theorem Suppose T - \B) E :¥ for each B E V, where en is an arbitrary class of sets, and 9' cr(V). Then the transformation T is :¥/9'-measurable. =

Measure Proof By 1.2(ii) and (iii), if T-\Bj) E 'Jr, j E

=

51

then T - 1(UjBj) = UjT-\Bj) T - 1 (B)c E 'Jr. It follows that the class of IT'J ,

E 'Jr, and if T -1 (B) E 'Jr then T-1(Bc) sets dl = {B: T-1(B) E 'Jr } is a a-field. Since 'lJ � dl, §' � dl by definition. •

This result is easily iterated. If (o/,Jf) is another measurable space and U: 3 is a §'/Jf-measurable transformation, then UoT: Q � o/ is 'Jr/Jf-measurable, since for C E Jf, [)1 (C) E §', and hence (UoT)-1(C)

=

T-1 (U-\C)) E 'Jr.

�

o/

(3.53)

An important special case: T: Q (----7 3 is called a measurable isomorphism if it is 1-1 onto, and both T and T - 1 are measurable. The measurable spaces (Q,'Jr) and (3,§') are said to be isomorphic if such a mapping between them exists. The implication is that measure-theoretic discussions can be conducted equivalently in either (Q,'Jr) or (3,§'). This might appear related to the homeomorphic property of real functions, and a homeomorphism is indeed measurably isomorphic. But there is no implication the other way. 3.23 Example Consider g: [0, 1] g(x)

=

{

�

x + 2,1

x - l2,

[0, 1], defined by (3.54)

Note that g is discontinuous, but is 1-1 onto, of bounded variation, and hence :B ro, d:B ro. n-measurable by 3.32 below, and g-1 = g . o

The class of measurable transformations most often encountered is where the codomain is (IR ,:B), :B being the linear Borel field. In this case we speak of a function, and generally use the notation f instead of T. A function may also have the extended real line (iR,�) as codomain. The measurability criteria are as follows. 3.24 Theorem (i) A function f: Q � for which { w: f(w) � x} E 'Jr for each x E (Q is 'Jr/:8-measurable. So is a function fo[ which { w: f(w) < x} E 'Jr for each X E (Q . (ii) A function f : Q � iR for which { w: f(w) � x} E 'Jr for each x E (Q u { +=} u { -oo } is ':}/�-measurable. Proof For case (i), the sets { ro: f(w) � x} are of the form f - 1 (B), B E � where

IR

� is defined in 1.21. Since :B = cr(�), the theorem follows by 3.22. The other collection indicated also generates :8, and the same argument applies. The extension to case (ii) is equally straightforward. •

Mathematics

52

The basic properties of measurable functions follow directly. 3.25 Theorem (i) If f is measurable, so are c + f and cf, where c is any constant. (ii) If f and g are measurable, so is f + g. Proof If f s X E [R ,

x, then f + c s x + c, so that f + c is measurable by 3.24. Also, for

{ co: j(co) s x/c }, c > O { co: cf(co) s x} = { ro: j(co) < x/c }c, c < O c = 0 and x � 0 0, c = 0 and x < 0 0,

(3.55)

where for each of the cases on the right-hand side and each xlc E [R the sets are in c:f, proving part (i). If and only if f + g < x, there exist r E (Q such that f < r < x- g (see 1.10). It follows that

{ ro: f(ro) + g(ro) < x} = U { ro : f(ro) < r} n { co: g(ro) < x - r}.

(3.56)

r e IQ

The countable union of c:f-sets on the right-hand side is an c:f-set, and since this holds for every x, part (ii) also follows by 3.24(i), where in this case it is convenient to generate 'B from the open half-lines. • Combining parts (i) and (ii) shows that if fJ, ... ,fn are measurable functions so is I.J== 1 cd_;, where the c1 are constant coefficients. The measurability of suprema, infima, and limits of sequences of measurable functions is important in many applications, especially the derivation of inte grals in Chapter 4. These are the main cases involving the extended line, because of the possibility that sequences in [R are diverging. Such limits lying in iR are called extended functions. 3.26 Theorem Let Un } be a sequence of c:f/'B-measurable functions. Then infnfn , supnfn, liminfnfn, and Iimsupnfn are c:f/'B-measurable. Proof For any

x E [R , { ro: fn(co) s x} E c:f for each

n

by assumption. Hence

{ co: SUPnfn(ro) s x} = n { ro: fn(CO) s x} E c:f, (3.57) n==l so that supnfn is measurable by 3.24(ii). Since infnfn = -supn( -fn), we also

obtain

{ co: infnfn(CO) < x} = { co: supn(-fn(ro)) > -x} = { co: SUPn(-fn(CO)) S -x}c

Measure

=

U {ro:

n= !

53

fn(ro) < x} E �.

(3.58)

To extend this result from strong to weak inequalities, write { 00: infnfn( (J)) � X}

=

n { 00:

m= l

infnfn((J)) < X + 1 fm } E �.

(3.59)

Similarly to (3.57), we may show

(3.60) n { ro: fn(ro) � x} E �. k :?: n and applying (3.59) to the sequence of functions 8n supk :?:n!k yields { ro: SUPk :?:nfkCro) � x}

=

=

(3.61)

{ ro: Iimsupnfn(ro) � x} E �. In much the same way, we can also show

(3.62)

{ 00: liminfnfn((J)) � X} E �. The measurability condition of 3.24 is therefore satisfied in each case.

•

We could add that limnfn(ro) exists and is measurable whenever Iimsupnfn(ro) = liminfnfn(ro). This equality may hold only on a subset of n, but we say fn converges a.e. when the complement of this set has measure zero. The indicator function lE{ro) of a set E E � takes the value lE(ro) 1 when ro E E, and 1£(ro) 0 otherwise. Some authors call 1E the characteristic function of E. It may also be written as h or as XE· We now give some useful facts about indicator functions. =

=

3.27 Theorem (i) 1E(ro) is �/'B measurable if and only if E E �. (ii) 1 g:(ro) 1 - 1E(ro). (iii) 1 u .g(ro) sup 1 e-(ro). =

l

l

(iv) 1 n;E;(ro)

=

=

l

. I

i� f } ei(ro) I

=

n 1E;(00). i

B E , 'B, if 0 E B and 1 E B if 1 E B, 0 e B if 0 E B, 1 e B

Proof To show (i) note that, for each n

l�/(B)

=

E Ec

(3.63)

0, otherwise These sets

are

in � if and only if E E �. The other parts of the theorem are immediate from the definition. •

Mathematics

54

Indicator functions are the building blocks for more elaborate functions, constructed so as to ensure measurability. A simple function is a �/13-measurable function f: n f-7 1R having finite range; that is, it has the form n f(ro) = _'L a; l E; (ro) = a;, ro E Ei, (3.64) i= l where the aJ , ... ,an are constants and the collection of �-sets EJ, ... ,En is a finite partition of Q. �/13-measurability holds because, for any B E 13, (3.65) f - l (B) = u E; E �a; E B

Simple functions are ubiquitous devices i n measure and probability theory, because many problems can be solved for such functions rather easily, and then generalized to arbitrary functions by a limiting approximation argument such as the following.

0 -+--��----�--��£! E2 £3 £4 \£6\Es Es E7

Eg

Es E7 E6 Es

Fig. 3 . 1 3.28 Theorem If f i s �/13-measurable and non-negative, there exists a monotone sequence of �/13-measurable simple functions Uen>• n E IN } such that fen)(ro) t f(ro)

for every

0)

E Q.

l , ... ,n2n , consider the sets E; = { ro: (i - 1 )12n s f(ro) < i/2n } . Augment these with the set En2n+ 1 = { ro: f(ro) � n } . This collection corresponds to n a n2 + 1-fold partition of [0,=) into 13-sets, and since f is a function, each ro maps into one and only one f( ro), and hence belongs to one and only one E;. The E; therefore constitute a partition of n. Since f is measurable, E; E � for each i. Define a simple function fen) on the E; by letting a; = (i - l)/2n, for i = 1 , ... , n2n + 1 . Then fen) s f, but fn+ 1 (ro) � fn(ro) for every ro; incrementing n bisects each interval, and if fen) (ro) = (i - 1)/2n, fen+ l ) (ro) is equal to either n Proof For i =

2(i - 1)/2 + l = fen) (ro), or (2i - 1)/2n+ l > fn(ro).

It

follows that the sequence is

Measure

55

monotone, and limn�"',f(n iro) f(ro). This holds for each ro E .0. To extend from non-negative to general functions, one takes the positive and negative parts. Define f + = max {f,O} and f - = f + - f, so that both f + and r are non-negative functions. Then if f (n) and f(n) are the non-negative simple approximations to f + and f - defined in 3.28, and f(n) f(n) - f(n) • it is clear that =

=

(3.66) Fig. 3.1 illustrates the construction for n = 2 and the case .Q = [R , so that f(ro) is a function on the real line. 3 . 6 B orel Functions

If f is a measurable function, and

g:

5i

1--7

lf;

5i

c

[R , lf

c

[R

is a function of a real variable, is the composite function go f measurable? The answer to this question is yes if and only if g is a Borel function. Let 'B5 = { B n 5i : B e 'B}, where 'B is the Borel field of [R . 'B5 is a a-field of subsets of 5i, and B n 5i is open (closed) in the relative topology on 5i whenever B is open (closed) in [R (see 2.1 and 2.3). 'B5 is called the Borel field on 5i. Define 'Elf similarly with respect to lf. Then g is called a Borel function (i.e., is Borel measurable) if g - 1 (B ) E 'B5 for all sets B E 'Elf.

3.29 Example Consider g(x) = I x 1 . g - 1 takes each point of [R + into the points x and -x. For any B e 'B+ (the restriction of 'B to [R +) the image under g - 1 is the set containing the points x and -x for each x E B, which is an element of 'B. o 3.30 Example Let g(x) = 1 if x is rational, 0 otherwise. Note that (Q E 'B (see 3.15), and g - 1 is defined according to (3.63) with E = !Q, so g is Borel-measur able. o In fact, to construct a 'plausible' non-measurable function is quite difficult, but the obvious case is the following. 3.31 Example Take a set A � 'B; for example, let A be the set H defined in 3.17. Now construct the indicator function 1 A (x): [R 1--7 { 0, 1 } . Since 1 A 1 ( { 1 }) = A � 'B, this function is not measurable. o Necessary conditions for Borel measurability are hard to pin down, but the follow ing sufficient conditions are convenient. 3.32 Theorem If g: 5i 1--7 lf is either (i) continuous or (ii) of bounded variation, it is Borel-measurable. Proof (i) follows immediately from 3.22 and the definition of a Borel field, since

continuity implies that h - \B) is open (closed) in 5i whenever B is open (closed) in lf, by 2.17. To prove (ii), consider first a non-decreasing function h: [R 1--7 [R , having the property h(y) ::;; h(x) when y < x; if A = {y: h(y) ::;; h(x)} , sup A x and A is one =

Mathematics

56

of ( -oo,x) and (-oo,x], so the condition of 3.24 is satisfied. So suppose g is non decreasing on S; applying the last result to any non-decreasing h with the prop erty h(x) = g(x), x E S, we have also shown that g is Borel-measurable because g - 1 (B n ""U') h - 1 (B) n S E :B'£, for each B n "U' E :B1r. Since a function of bounded variation is the difference of two non-decreasing functions by 2.20, the theorem now follows easily by 3.25. • =

This result lets us add a further case to those of 3.25. 3.33 Theorem If f and g are measurable, so is fg. Proof fg !((f + g) 2 - f 2 - i), and the result follows on combining 3.32(i) with 3.25 (ii ) . • =

The concept of a Borel function extends naturally to Euclidean n-spaces, and indeed, to mappings between spaces of different dimension. A vector function ---?

c

IR k, "U' � IR m is Borel-measurable if g- 1 (B) E :B'£ for all B E :B1r, where :B'£ and :B1r = { B n "U': B E :Em } . g: S

"U'; S

=

{B n S : B E :Bk }

3.34 Theorem If g is continuous, it is Borel-measurable. Proof By 2.21.

•

Finally, note the application of 3.21 to these cases. 3.35 Theorem If ll is a measure on (IR k,:Bk) and g: S � "U' is Borel-measurable where S c IR k and "U' c IR m , llC - 1 is a measure on (]',:B1f) where

(3.67)

for each B E :Elf.

o

A simple example is where g is the projection of IR k onto IR m for m < k. If X is k x 1 with partition X' = (X;,x; * ), where X is m x 1 and X** is (k - m) x 1, let * g: IR k � IR m be defined by g(X)

In this case, llc-\B)

=

!l(g-\B))

=

= X* . !l(B x iR k-m) for B E IR m .

(3.68)

4 Integration

4. 1 Construction of the Integral

The reader may be familiar with the Riemann integral of a bounded non-negative function f on a bounded interval of the line [a,b], usually written f�fdx. The objects to be studied in this chapter represent a heroic generalization of the same idea. Instead of intervals of the line, the integral is defined on an arb itrary measure space. Suppose (Q,�,!-l) is a measure space and f : Q H [R+ is a �/;8-measurable function into the non-negative, extended real line. The inte gral of f is defined to be the real valued functional

(4. 1) where the supremum is taken over all finite partitions of Q into sets Ei E �' and the supremum exists. If no supremum exists, the integral is assigned the value +=. 5 The integral of the function 1Af, where lA(ro) is the indicator of the set A E � ' is called the integral of f over A, and written fAfd!-l. The expression in (4. 1 ) is sometimes called the lower integral, and denoted L fd!-l. Likewise defining the upper integral of f,

(4.2) we should like these two constructions, approximating f from below and from above, to agree. And indeed, it is possible to show that fJd!-l f*fd!-l whenever f is bounded and !-l(Q) < =. However, J*fd!-l if either the set { co: f(co) > 0 } has infinite measure, or f is unbounded on sets of positive measure. Definition (4. 1 ) i s preferred because it can yield a finite value in these cases. 4.1 Example A familiar case is the measure space (IR ,:B,m), where m is Lebesgue measure. The integral ffdm where fis a Borel function is the Lebesgue integral of f. This is customarily written ffdx, reflecting the fact that m((x, x + dx]) = dx, even though the sets {Ed in (4. 1 ) need not be intervals. o 4.2 Example Consider a measure space (IR,:B,!-l) where 1-l differs from m. The integral ffd!-l, where f is a Borel function, is the Lebesgue-Stieltjes integral. = oo

=

Mathematics

58 The monotone function

F(x) = �(( -oo, x]) (4.3) has the property �((a,b]) = F(b) - F(a), and the measure of the interval (x, x + dx] can be written dF(x). The notation ffdF means exactly the same as ffd�, the choice between the � and F representations being a matter of taste. See §8.2 and §9. 1 for details. o For a contrast with these cases, consider the Riemann-Stieltjes integral. For an interval [a,b], let a partition into subintervals be defined by a set of points I1 = {xt, ... , xn } , with a x0 < x 1 < ... < Xn = b. Another set Il' is called a refinement of I1 if I1 � Il'. Given functions f and a: IR H IR, let =

n

S(Il,a,f) = _Lf(t;) (a(x;) - a(x;- t)),

(4.4)

i=l

where t; E [x;- t, x;]. If there exists a number f�fda, such that for every £ > 0 there is a partition Ile with

I S(TI,a,f) -J>da j < £

for all I1 � Ile and every choice of { t;}, this is called the Riemann-Stieltjes integral of f with respect to a. Recall in this connection the well-known formula for integration by parts, which states that when both integrals exist,

f(b)a(b) = f(a)a(a) +

s>da + s:adj.

(4.5)

When a = x and f is bounded this definition yields the ordinary Riemann integral, and when it exists, this always agrees with the Lebesgue integral of f over [a,b ] . Moreover, if a is an increasing function of the form in (4.3), this integral is equal to the Lebesgue-Stieltjes integral whenever it is defined. There do exist bounded, measurable functions which are not Riemann-integrable (consider 3.30 for example) so that even for bounded intervals the Lebesgue integral is the more inclusive concept. However, the Riemann-Stieltjes integral is defined for more general classes of integrator function. In particular, if f is continuous it exists for a of bounded variation on [a,b ] , not necessarily monotone. These integrals therefore fall outside the class defined by (4. 1 ), although note that when a is of bounded varia tion, having a representation as the difference of two increasing functions, the Reimann-Stieltjes integral is the difference between a pair of Lebesgue-Stieltjes integrals on [a,b]. The best way to understand the general integral is not to study a particular measure space, such as the line, but to restrict attention initially to particular classes of function. The simplest possible case is the indicator of a set. Then, every partition {E;} yields the same value for the sum of terms in (4. 1 ), which is

Integration

59

f dJl = f 1AdJl Jl(A), A '!F. Note that if A '!F, the integral is undefined. =

(4.6)

for any A E e Another case of much importance is the following. 4.3 Theorem If f = 0 a.e. [Jl], then ffdJl = 0. Proof The theorem says there exists C � Q with Jl( C) = 1 , such that f(ro) = 0 for ro e C. For any partition {EJ, ... ,En} let Ei = Ei n C, and E[ = Ei - Ei. By additiv ity of Jl,

� { inf f(ro)) Jl(Ea = L { inf , f(ro)) Jl(Ei) + � { inf JCro)) Jl(Ei) I

I

W E E;

=

I

W E E;

0,

W E E;

(4.7)

the first sum of terms disappearing because f(ro) = 0, and the second disappearing by 3.6(i) since Jl(Ef) =:;; Jl(C c) = 0 for each i. • A class of functions for which evaluation of the integral is simple, as their name suggests, is the non-negative simple functions. 4.4 Theorem Let
fcpdJl

=

aiJlCEa . i i

(4.8)

=l

Consider an arbitrary finite partition of Q, A J . ... ,A m, and define �j infroeA·
j

m

m

=

m

j Jl(Aj n Ei) L �jJl(Aj) = L �L j= l i=l j=l n m � ,L aL i Jl(Aj n Ei) i=l j=l

(4.9) i= l where the inequality uses the fact that �j assumes the smallest value of ai such that Aj n Ei -:f. 0, by definition. The theorem follows, given (4. 1), since (4.9) holds as an equality for the case m = n and A i Ei, i = 1 , ... ,n. • So for functions with finite range, the integral is the sum of the possible values of f, weighted by the measures of the sets on which those values hold. Look at Fig. 3. 1 . The Lebesgue integral of the approximating function fc2) in the figure is the sum of the areas of the rectangular regions. To compute the Lebesgue-Stiel ties inte2ral with resoect to some measure u. one renl aces the wirlth of thP. "-t".t"•

=

Mathematics

62

(4. 17) a simple function. Hence,

Jcacp + by)d�

=

� L)aai + b�J)�(AinBJ) I

}

a.LaiL�(AinB1) + bL �JL�(Ai n BJ) i j j i = aL ai�(AD + bL �j�(Bj) i j = aJ cpd� + bJyd�, =

(4. 1 8)

showing that linearity applies to simple functions. Now applying 4.6,

feat + bg)d�

=

=

=

fcacp + by)d� a (:� J.pd�t) + b (�� Jydl') affd� + bJgd�. sup

cp '5f,y'5g

(4. 19)

To extend the result to general functions, note that

(4.20) l af + bg l ::; l a l . l f l + l b l . l g l , so (4. 1 9) shows that af +bg is integrable so long f and g are integrable, and a and b are finite. The identity (4.21) af + bg (aft - (af)- + (bg)+ - (bgr =

implies, applying (4. 19), that

Jcaf + bg)d� fcaftd� - fcaffd�+ fcbgtd� - fcbgrd�.

(4.22)

=

If a � 0, then f(aftd� - f(affd� = a(ff+d� - ff-d�)

affd�, whereas if a < 0, d� = (af)+ - f(affd� I a I (ff -d� - Jf+d�) = I a I ( -ffd�) = affd�. The same argu f ment applies to the terms in b and g. So (4. 16) holds as required. =

•

Linearity is a very useful property. The first application is to show the invariance of the integral to the behaviour of functions on sets of measure 0, extending the basic result of 4.3. 4.8 Lemma Let f and g be integrable functions. (i) If f ::; g a.e.[�], then ffd� ::; fgdJ-1.. (ii) If f g a.e.[�], then ffd� fgd� . Proof For (i), consider first the case f = 0. If g � 0 everywhere, fgd� � 0 directly from (4. 1 ). So suppose � � 0 a.e.[Jl,] and define =

=

Integration

Then h

=

h(oo)

0

a.e.[M-] but g + h

0

�

0

=

{0,

g(ffi) � g(ffi) <

oo,

0 0

63 .

everywhere, and, applying 4.7,

(4.23) f(g + h d = fgdM- + fhd fgdM-, since fhd!l 0 by 4.3. Now replace g by g - f in the last argument to show f(g -f)dM- 0, and hence fgd!l ffdM- by 4.7. To prove (ii), let h f - g so that h 0 a.e.[!l] , and fhd!l 0 by 4.3. Then ffd!l = f g + h)d!l = fgdM- + fhd!l fgd!l, where the second equality is by 4.7. �

=

�

) M-

M- =

�

=

(

=

=

=

•

These results permit the extension to the more commonly quoted version of the monotone convergence theorem. 4.9 Corollary If fn

�

0

and fn t f a.e. [!l], limn�ooffndll

=

ffdM-.

D

Another implication of linearity is the following.

J I f I dM- I JfdM-1 · f l f l d!l f(f+ + f -)dM- ff+dM-+ ff -dll i ft+dM-- sf-dlli = I JtdM-i · .

4.10 Modulus inequality Proof

�

=

=

�

In the form of 4.9, the monotone convergence theorem has several other useful corollaries. 4.11 Fatou's lemma If fn

�

0

a.e. [!l], then

f

liminf fndMn�oo Proof Let gn

�

f (liminf fn) d!l. n�oo

infk211fk, so that {gn } is a non-decreasing sequence, and gn t g = liminfnfn· Since fn � gn, ffn � fgn M-. Letting n --7 oo on both sides of the inequality gives =

f

liminf fndMn�oo

�

dll d lim fgnd = f d fn) dM-. • f (liminf n�oo n�oo g !l =

M-

(4.24)

4.12 Dominated convergence theorem If fn --7 f a.e. [M-] , and there exists g such that I fn I � g a.e. [M-] for all n and fg !l < oo , then ffndll --7 ffd!l.

d hn ::;; 2g a.e. [M-] and hn

Proof According to 4.8(i), fg M- <

that 0

::;;

ity, and Fatou' s

lemma,

0

oo

--7

d

implies f I fn I d!l < oo. Let hn = I fn - f I , such a.e. [M-]. Applying 4.3 to Iiminfnhm linear

Mathematics

64

f

2 gd� = =

f ( �r:�f(2g - hn)) d� � li:�f f ((2g 2fgd� li:�p (fhnd�) , �

-

hn)d�

)

(4.25)

-

where the last equality uses (2.4). Clearly, limsupn�= fhndll = 0, and since fhnd� 0 the modulus inequality implies

�

(4.26) Taking the case where the g is replaced by a finite constant produces the follow ing version, often more convenient: 4.13 Bounded convergence theorem If fn -7 f a.e.[�] and I fni � B < oo for all n, then limn�ooffnd� -7 Jfd� < oo. 0

Theorem 4. 7 extends by recursion from pairs to arbitrary finite sums of func tions, and in particular we may assert that f(I.'i=I g i)d� = 'Li= Ifgidll . Put fn = 'L'i= I gi and ffnd� = 'L'i=Ifgid�, where the gi are non-negative functions. Then, if fn t f = Lt=lgi < oo a.e., 4.9 also permits us to assert the following. 4.14 Corollary If {gd is a sequence of non-negative functions,

(4.27) By implication, the two sides of this equation are either both infinite, or finite and equal. This has a particular application to results involving a-finite measures. Suppose we wish to evaluate an integral fgd� using a method that works for finite measures. To extend to the a-finite case, choose a countable partition {Oil of 0, such that �(Qi) < oo for each i. Letting gi = lnig, note that g = Ligi > and fgd� = I,Jgid� by (4.27). 4 . 3 Product Measure and Multiple Integrals

Let (Q,?f,�) and (3,§',v) be measure spaces. In general, (Q x 3, ?f ® §', 1t) might also be a measure space, with 1t a measure on the sets of ?J ® §'. In this case measures � and v, defined by �(F) = 1t(F x 3) and v(G) = 1t(O x G) respectively, are called the marginal measures corresponding to 1t. Alternatively, suppose that � and v are given, and define the set function 1t: 'Rf!i '§ !--7 iR"+, where 'Rf!i'§ denotes the measurable rectangles of the space Q x 3, by '

1t(F x G) =

�(F)v ( G) .

(4.28)

We will show that 1t is a measure on 'R ffi'fi• called the product measure, and has an extension to ?; ® §', so that (0 x 3, ?f ® §', 1t) is indeed a measure space. The first

Integration

65

step in this demonstration is to define the mapping T00: 3 H Q x 3 by T00(�) = (m,�), so that, for G E §', T00(G) = m } X G. For £00

=

T;;/(E)

=

{�: (m,�) E E }

{

�

E E ':f ® §', let

3.

(4.29)

E

The set £00 can be thought of as the cross-section through at the element m. For any countable collection of ':f ® rt'-sets { , j E IN } ,

E1

(uej) 00 {�: ((J),�) E UEj} l) { �: ((1),�) E Ej} UCEj)oo. =

1

1

=

=

1

For future reference, note the following.

4.15 Lemma T00 is a §'/(':; ® §')-measurable mapping for each (J) E Proof We must show that £00 E §' whenever

G

E §', it is obvious that E00

Since ':f ® §'

=

a('Rff'fl ),

=

{

G, m E F 0, m

�

F

E E ':f ® §'. IfE

(4.30)

1

= Fx G

E §'.

for

n.

FE ':f and (4.3 1)

the lemma follows by 3.22.

•

The second step is to show the following. 4.16 Theorem 1t is a measure on 'R nt· Proof Clearly 1t is non-negative, and n(0) = 0, recalling that Fx 0 = 0 x G = 0 for any F E ':f or G E §', and applying (4.28). It remains to show countable additiv ity. Let E 'Rff'fl, j E IN } be a disjoint collection, such that there exist sets E ':} and Gj E §' with = X G ; and also suppose = E 'Rff'fj, such that there exist sets F and G with = F x G. Any point ( m,�) E F x G belongs to one and only one of the sets x G1, so that for any m E F, the sets of the subcollection G1 } for which m E must constitute a partition of G. Hence, applying (4 .30) and (4.3 1),

{E1

{

Ej FjE j F1 F1

E uj Ej

F1

(4.32) where the additivity of v can be applied since the sets G1 appearing in this decomposition are disjoint. Since we can also write v(£00) = v(G) 1 r(m), we find 1l(E)

�

11(F}v(G)

�

fv(Ero)dl!(ro) f (:�:>/ro Gj)) d�t(ro) �

)v(

Mathematics

66 j

j

as required, where the penultimate equality is by 4.14.

(4.33) •

It is now straightforward to extend the measure from �Yi'fl to r::F ® §'. 4.17 Theorem (.Q x 3, r::F ® §', n) is a measure space. Proof r::F and §' are a-fields and hence semi-rings; hence �Yi'fl is a semi-ring by 3.19. The theorem follows from 4.16 and 3.8. •

Iterating the preceding arguments (i.e. letting (.Q,r:f) and/or (3,§') be product spaces) allows the concept to be extended to products of higher order. In later chapters, product probability measures will embody the intuitive notion of statis tical independence, although this is by no means the only application we shall meet. The following case has a familiar geometrical interpretation. 4.18 Example Lebesgue measure in the plane, IR 2 = IR x IR , is defined for intervals by

(4.34) Here the measurable rectangles include the actual geometrical rectangles (products of intervals), and 'B2 , the Borel sets of the plane, is generated from these as a consequence of 3.20. By the foregoing reasoning, (IR 2 ,:lf,m) is a measure space in which the measure of a set is given by its area. o We now construct integrals of functions f( co,�) on the product space. The follow ing lemma is a natural extension of 4.15, for it considers what we might think of as a cross-section through the mapping at a point co E .Q, yielding a function with domain 3. 4.19 Lemma Let j: .Q x 3 r-7 1R be r::F ® §'/'B-measurable. Define fw(�) = f(co,�) for fixed co E .Q. Then fw: 3 f--7 IR is §'/'B-measurable. Proof We can write

(4.35) By 4.15 and the remarks following 3.22, the composite function f0T00 is §'/'B measurable. • Suppose we are able to integrate fw with respect to v over 3. There are two ques tions of interest that arise here. First, is the resulting function g(co) = f=:.fwdv :1/'B-measurable? And second, if g is now integrated over .Q, what is the relationship between this integral and the integral fnx:ddn over .Q x 3? The affirmative answer to the first of these questions, and the fact that the 'iterated' integral is identical with the 'double' integral where these exist, are the most important results for product spaces, known jointly as the Fubini theorem. Since iterated integration is an operation we tend to take for granted m;th rnnJt1 nJ P "R 1 Pmllnn infP:OT::ll « nerhanS the main 00int needing tO be Stressed

Integration

67

here is that this convenient property of product measures (and multivariate Lebesgue measure in particular) does not generalize to arbitrary measures on product spaces. The first step is to let f be the indicator of a set E E � ® fl. In this case fro is the indicator of the set Ew defined in (4.29), and

f

fw dv

=

V (Ero)

=

gE(ro),

(4.3 6)

[R+

say. In view of 4.15, Ew E fJ and the function gE: Q 1--7 is well-defined, although, unless v is a finite measure, it may take its values in the extended half line, as shown. 4.20 Lemma Let J..L and v be a-finite. For all E E � ® fi, and

fo.gEdJ..L

=

8E is ��-measurable (4.37)

rt(E).

By implication, the two sides of the equality in (4.37) are either both infinite, or finite and equal. Proof Assume first that the measures are finite. The theorem is proved for this

case using the rt-A theorem. Let d1 denote the collection of sets E such that gE satisfies (4.37). 'R'!Ji'f!l � dl, since if E = Fx G then, by (4.31),

(4.38) 8E (ffi) = V(G) l F (ro), F E � ' and fo.gEdJ..L = J..L(F)v ( G) = n(E) as required. We now show .4 is a A.-system. Clearly Q x 3 E A, so 1.25(a) holds. If Et ,Ez E .4 and Et c Ez, then, since 1 ErE 1 = 1 E2 - l EI'

=

(4.39)

8E2(ro) -8E1(ro),

an �� measurable function by 3.25, and so, by additivity of rt,

Jo.8ErE1dJ..L(ro)

=

n(Ez) - rr(Et)

=

n(E2 - E1),

(4.40)

showing that d1 satisfies 1.25(b). Finally, If A t and A2 are disjoint so are (At)ro and (Az) ro, and 8A1vAz{ro) = 8A1(ro) + 8Az(ro) . To establish 1.25(c), let {Ej E .4, j E IN } be a monotone sequence, with Ej t E . Define the disjoint collection { Aj } with At Et and Aj = Ej+t - Ej, j > 1 , so that E = U}=1Aj and Aj E .4 by (4.39). By countable additivity of v, =

gE(ro) = L8A/ffi). Tl9s,is 3'/13�measurable by 3.26, �

; ;'-""' " '' '<-,

j= t

and

(4.41)

Mathematics

68

J" (� gAj{ro)} d�t(ro)

=

� JngAj{ )d�t( � ro)

ro

=

x(Ai) = x (E),

(4.42)

where the first equality is by 4.14. This shows that A is A.-system. Since �'!FY is a semi-ring it is also a 7t-system, and '!f ® §' = a(�'!FY) � A by 1.27. This completes the proof for finite measures. To extend to the a-finite case, let {.Qd and { 3j} be countable partitions of .Q and 3 with finite �-measure and v-measure respectively; then the collection { .0; x 8j E �'!FY } forms a countable partition of .Q x 3 having finite measures, 7t(.O; x 8j) = �(.O;)v(3j). For a set E E '!f ® §', write Eij = E n (.0; x 8j). Then by the last argument,

fg

· eI}- d QI

�

gEij: gE(ro)IR+ is defined bywhen gE/W) E ro

where .0; disjoint and

1---7

=

(4.43)

= 1t(Eij), =

v((UjEij)ro)

v((Eij)ro), ro E .0;. .0;, or

j The sum on the right need not converge, and in that case '!f/;B-measurability holds by 3.25/3.26, and

JngEdl' Jn (� ! ni � 8E;}I' .L � fO;gE;i� L � 1t(Eij)

gE(ro)

The sets Eij are (4.44) = +oo.

However,

=

=

=

1

I

I

1

= 1t(E),

using 4.14 and countable additivity. This completes the proof.

(4.45)

•

Now extend from indicator functions to non-negative functions: 4.21 Tonelli's theorem Let 1t be a product measure with a-finite marginal mea sures � and v, and let f : .Q x 3 1---7 IR+ be ('!f ® §')/;B-measurable. Define functions + fro : 3 1---7 IR by fro (� ) = f(ro,�), and let (ro) = f3frotfv. Then (i) is '!f/;B-measurable,

g (ii) f fd1t In (J f ) d�. 3 3 nx

=

ro dv

g

D

In part (ii) it is again understood that the two sides of the equation are either finite and equal, or both infinite. Like the other results of this section, the theorem is symmetric in (.Q,'!f,�) and (3,§',v), and the complementary results given by interchanging the roles of the marginal spaces do not require a separate state ment. The theorem holds even for measures that are not a-finite, but this further complicates the proof. Proof This is on the lines of 4.6. For a partition {E1, . . . ,En } of '!f ® §' let f

L;<X;l E;•

E;

and then fro = L;<X;l( ) and w

g

=

L;<X;v((EDro)

g

=

by 4.4. is '!FIB-measurable

Integration

69

by 3.25, and 4.20 gives

J gd� = Li aJ v((EDro)d�

L ain(ED =

J _fdn,

(4 . 46) Qx.::. i so that the theorem holds for simple functions. For general non-negative f, choose a monotone sequence of simple functions converging to f as in 3.28, show measurability of g in the limit using 3.26, and apply the monotone convergence theorem. • Q

Q

=

Extending to general f requires the additional assumption of integrability.

4.22 Fubini's theorem Let 1t be a product measure with a-finite marginal measures � and v; let f: Q x 3 1--7 IR be (� ® §')/;8-measurable with

� ro ro ro � J {f/(ro,�)dv(�)l d�(ro).

(4.47)

define fro: 3 1--7 IR by f ( ) = f( ,�); and let g(ro) = f3frotfv. Then (i) fro is §'/:8-measurable and integrable for OJ E .1 c Q, with �(Q - ll) = 0; (ii) g is �/;8-measurable, and integrable on .1; ro

(iii)

Jn 3f( x

,�)drt( , ) =

12

Proof Apart from the integrability, 4.19 shows (i) and Tonelli' s Theorem shows (ii) and (iii) for the functions f + = max{f,O} and f - = f + - f, where I i i = f + + f - . But under (4.47), l f(ro, �) l < on a set of rt-measure 1 . With .1 defined as the projection of this set onto Q, (i), (ii) and (iii) hold for f + and f-, with both sides of the equation finite in (iii). Since f = f + - f - , (i) extends to f by 3.25, and (ii) and (iii) extend to f by 4.7. • oo

4.4 The Radon-Nikodym Theorem

Consider a-finite measures � and v on a measurable space (Q,�). � is said to be absolutely continuous with respect to v if v(E) = 0, for E E r:J, implies �(E) = 0. This relationship is written as � « v. If � « v and v « �, the measures are said to be equivalent. If there exists a partition (A, Ac) of Q, such that �(A) = 0 and v(Ac) = 0, then � and v are said to be mutually singular, written � j_ v. Mutual singularity is symmetric, such that � j_ v means the same as v j_ �· The following result defines the Lebesgue decomposition of � with respect to v. 4.23 Theorem If � and v are a-finite measures, there exist measures � 1 and �2 such that � = �1 + �2 , �� j_ v, and �2 « v. o If there is a function f: Q 1-7 IR+ such that �(E) = fefdv, it follows fairly directly (choose E such that v(E) = 0) that � « v, and f might be thought of as the derivative of one measure with

Mathematics

70

respect to the other; we could even write f = d�dv. The result that absolute continuity of 1-l with respect to v implies the existence of such a function is the Radon-Nikodym theorem. 4.24 Radon-Nikodym theorem Let v and j.l2 be a-finite measures and let j.l2 « v. There exists a �/;8-measurable function f: Q 1--7 [R+ such that j.l2(E) = fEfdv for all E E �- 0

f is called the Radon-Nikodym derivative of 1-l with respect to v. If g is another such function and j.l2(E) = fEgdv for all E E � ' then v (f :t= g) = 0, otherwise at least one of the sets £1 = { ro: f( ro) > g( ro) } and £2 = { ro: f(ro) < g( ro) } must contra

dict the definition. Proof of these results requires the concept of a signed measure.

4.25 Definition A signed measure on (Q,�) is a set function x: � 1--7 IR satisfying (a) X(0) = 0. (b) x(U;4j) = LjX(Aj) for any countable, disjoint collection {Aj E � } . (c) Either X < or X > o

oo

-

oo

.

For example, let J..1 and v be non-negative measures on a space (!1, �), with at least one of them finite. For a non-negative constant r, define X(A) for any A x

=

1-L(A) - rv(A)

�- For disjoint {Aj } ,

(Q ) � ) � )

E

Aj

=

�

Aj

-

rv

Aj

=

(4.48)

� (�Aj) - r (Aj)), v

(4.49)

so that countable additivity holds. If A is a �-set with the property that X(B) � 0 for every B E � with B c A, A is called a positive set, a negative set being defined in the complementary manner. A set that is both positive and negative is called a null set. Be careful to distin guish between positive (negative, null) sets, and sets of positive measure (negative measure, measure zero). A set A has measure zero if j.l(A) = rv(A) in (4.48), but it is not a null set. By the definition, any subset of a positive set is positive. The following theorem defines the Hahn decomposition. 4.26 Theorem Let X be a signed measure on a measurable space (Q,�), having the property X(A) < for all A E �- There exists a partition of Q into a positive set A+ and a negative set A - . oo

Proof Let A = sup X(A), where the supremum is taken over the positive sets of X·

Choose a sequence of positive sets {An} such that limn�oo X(An) = A, and let A + = UnAn. To show that A+ is also a positive set, consider any measurable E s;; A+ . Letti n g R, = A , and R = A" - A " 1 . n > 1 . the seauence fBn l is disjoint, positive -

Integration

71

since Bn � An for each n, and UnBn = A+. Likewise, if En = E n Bn the sequence {En } is disjoint, positive since En � Bn , and UnEn = E. Hence X(E) = LnX(En) � 0, and since E was arbitrary, A+ is shown to be positive. A+ - An being therefore positive, (4.50) and hence X(A+) � A, implying X(A+) = A. Now let A- = Q - A+. We show, by contradiction, that A- has no subset E with positive measure. Suppose there exists E � A- with X(E) > 0. By construction E and A+ are disjoint. Every subset of A+ u E is the disjoint union of a subset of A+ with a subset of E, so if E is a positive set, so is A+ u E. By definition of A,

(4.5 1) which requires X(E) = 0, so E cannot be a positive set. If F is a subset of E, it is also a subset of A-, and if positive it must have zero measure, by the argument just applied to E. The desired contradiction is obtained by showing that if X(E) > 0, E must have a subset F which is both positive and has positive measure. The technique is to successively remove subsets of negative measure from E until what is left has to be a positive set, and then to show that this remainder has positive measure. Let n 1 be the smallest integer such that there is a subset E 1 � E with X(E I ) < -1/n 1 , and define F1 = E - E1 . Then let n2 be the smallest integer such that there exists E2 c F1 with xCE2) < - l ln2 . In general, for k = 2,3, . . . , let nk be the smallest positive integer such that Fk- 1 has a subset Ek satisfying X(Ek) < - link. and let k Fk = E - UEj . (4.52) j= l

If no such set exists for finite nk. let nk = +oo and Ek = 0. The sequence {Fd is non-increasing and so must converge to a limit F as k � oo We may therefore write E = Fu (U'k=1Ek) , where the sets on the right-hand side are mutually disjoint, and hence, by countable additivity, .

X(E) = X(F)

+

LX(Ek) k=l

<

X(F) - L link.

(4.53)

k=l

Since X(E) > 0 it must be the case that X(F) > 0, but since X(F) < oo by assumpt ion, it is also the case that Lk= l ( l ink) < oo, and hence nk � oo as k � oo This means that F contains no subset with negative measure, and is therefore a positive set having positive measure. • For any set B E '!f, define x\B) = X(A+ n B) and X-(B) = -x(A- n B), such that X(B) = X+(B) - X-(B). It is easy to verify that X+ and X- are mutually singular, non-negative measures on (Q,'!f). X = X+ - X- is called the Jordan decomposition of a signed measure. X+ and X - are called the upper variation and lower variation of ....... .... I ...., I - ...,, + .__l__ f\1 ;C' rol':lll .o.rl th,:3 frlfnl '\U1rinfifl'l1 nf ThP Torci !l n .

,.. ,

n. - .....1

•h�

.o.

,.., ..... ....o..

-

'V

Mathematics

72

decomposition shows that all signed measures can be represented in the form of (4.48). Signed measures therefore introduce no new technical difficulties. We can integrate with respect to X by taking the difference of the integrals with respect to X+ and x - . We are now able to prove the Radon-Nikodym theorem. It is actually most convenient to derive the Lebesgue decomposition (4.23) in such a way that the Radon-Nikodym theorem emerges as a fairly trivial corollary. It is also easiest to begin with finite measures, and then extend the results to the a-finite case. 4.27 Theorem Finite, non-negative measures v and � have a Lebesgue decompo sition � = �1 + �2 where � 1 .l v and �2 « v, and there exists an �-measurable function f: n � IR + such that �2(E) = fEfdv for all E E �. Proof 6 Let !G denote the class of all �/:8-measurable functions g: Q � IR + for which fEgdv � �(E), all E E r:J. (8 is not empty since 0 is a member. Let a = supgE«3fgdv, so that a � �(Q) < oo We show there is an element of !G at which the supremum is attained. Either this element exists, and there is nothing further to show, or it is possible by definition of a to choose an element gn of !G satisfying a - lin � fgndv � a, for each n E !N. Generate a monotone sequence Un. n E !N } in !G as follows. Put fi = g 1 and define fn by fn(ro) = maxUn - l (ro), gn (ro) } , so that fn � fn - 1 · Define the sets An = { ro: fn-l(ro) > gn(ro)} for n = 2,3 , .. . , and then if fn - 1 .

E (8,

fEfndv = fEnAnfn- ldv fEnA� gndv +

� �(E n An) +

�(E n A �) = �(E)

(4.54)

so that fn E «3. Since fn t f, it follows by the monotone convergence theorem that fEfndv ---7 fEfdv � �(E), and hence f E «3. And since fn � gn so that <X - lin �

we must conclude that ffdv =

a,

fgndv ffndv �

� <X,

as was required. Now define �2 by

(4.55) Evidently �2 is a non-negative measure (for countable additivity consider the functions fJ = l Ej . and use 4.14), and also �2 « v. Define �I(E) = �(E) - �2(E), which is non-negative by construction of f, and also a measure. It remains to show that J.L1 .l v. Let (A� , A�) be a Hahn decomposition for the measure Jl i - v/n, for n = 1 ,2,3, .. . . Then for E E �. J.t(E n A �) = Jlt (E n A �)

+

� .!.nv (E n A�)

+

J1,2(E n A �)

JEnA�fdv JEnA� (t +.!.) dv, =

n

(4.56)

Integration

73

and hence �(E) = �(E n A�) + �(E nA � ) �

�

�(EnA�) + �2(E n A � )

JEnA;\" (t + nl) dv + JEnA�fdv = JEfdv + lnv(E nA�).

(4.57)

Note from this inequality that f + n-1 1A-:; E (G, so that 1 1 (4.58) a � fdv + -v(A�) = a + -v(A�), n n implying v(A�)In = 0. This holds for each n E IN, so if A = U;;'=lA�, v(A) = 0. Note that Ac = n;= I A� c A� for every n, and so �1(Ac) � v(Ac)ln for every n. Hence �1(Ac) = 0, and ·so �� .l v. • It remains to extend this result to the a-finite case. Proof of 4.23 By a-finiteness there exists a countable partition { .Qj} of .Q, such that v(.Qj) and �(.Qj) are finite for each j. If {Aj} is any collection with finite measures whose union is .Q, letting 0 1 = A 1 and .Qj = Aj - Aj- 1 for j > 1 defines such a partition. If different collections with finite measures are known for v and �' say {Ailj } and {Avj } , the collection containing all the Allj n Avk for j,k E IN, is countable and of finite measure with respect to both v and �' and after re indexing this collection can generate { .Qj}. Consider the restrictions of � and v to the measurable spaces (Qj ,'!fj) , for j E IN, where '!fj = { E n .Qj, E E '!f } . By countable additivity, �(E) = Lif�(E n .Qj) with similar equalities for � 1 , �2, and v; by implication, the two sides are in each case either finite and equal, or both +oo . If v(E n .Qj) = 0 implies �2(E n .Qj) = 0 for each j, then v(E) = 0 implies �2(E) = 0 for E E '!f, and �2 « v. Similarly, let (Aj, Aj) define partitions of the Qj such that �1(Aj) = v(Aj) = 0; then A = Ufij, and Ac = UjAj are disjoint unions, �1(A) = Lj�1(Aj) = 0, and v(Ac) = LjV(Aj) = 0. Hence �1 .l v . • The proof of the Radon-Nikodym theorem is now achieved by extending the other conclusion of 4.27 to the a-finite case. Proof of 4.24 In the countable partition of the last proof, 4.27 implies the existence of '!fj/13-measurable non-negative f.; such that �(E n .Qj) = ��(E n Qj) + �2(En Qj) where f;dv, all E E '!f. �2(E r1 .Qj) = (4.59) EnO.·'] + Define f : Q f--7 IR by (4.60) f( ro) = L 1 niro)f;(ro) . j This is a function since the .Qj are disjoint, and is '!f/13-measurable since f - 1 (B) = Uf) 1 (B) = UjEj E '!f where Ej E '!fj, for each B E 13. Apply 4.14 to give

J

J

Mathematics

74

Jl2(E) = �Jl2(EnQj) 2: (fE l nJjdv) (4.61) = JE � l nJ� JE fdv Consider the case where Jl is absolutely continuous with respect to another measure v. If the Lebesgue decomposition with respect to v is Jl Jlt + J.L2, v(A) 0 implies J.L(A) 0 which in turn implies J.L1(A) = 0. But since Jlt v, J.L 1 (Ac) 0 too. Thus, Jlt (Q) = 0 and Jl J.L2. The absolute continuity of a measure implies the existence of a Radon-Nikodym derivative f as equivalent representation of the measure, given v, in the sense that J.L(E) = fEfdv for any E �- An important application of these results is to measures on the line. 4.28 Example Let v in the last result be Lebesgue measure, m, and let Jl be any other measure on the line. Clearly, Jlt m requires that J.L1(E) 0 except when E is of Lebesgue measure 0. On the other hand, absolute continuity of J.L2 with respect to m implies that any set of 'zero length' , any countable collection of isolated points for example, must have zero measure under J..L2 . If is absolutely continuous with respect to m, we may write the integral of a measurable function g =

1

1

=

=

1

. •

=

=

..L

=

an

..L

=

E

=

Jl

as

s::g(x)dJ.L(x) = s::g(x)f(x)dx,

(4.62)

f

so that all integrals reduce to Lebesgue integrals. Here, is known as the density function of the measure and is an equivalent representation of Jl, with the relation = (4.63)

Jl J.L(E) JE f(x)dx (the Lebesgue integral of f over E) holding for each E

E 'B.

o

5 Metric Spaces

5 . 1 Distances and Metrics

Central to the properties of IR studied in Chapter 2 was the concept of distance. For any real numbers x and y, the Euclidean distance between them is the number dE(x,y) = !x - y ! E IR + . Generalizing this idea, a set (otherwise arbitrary) having a distance measure, or metric, defined for each pair of elements is called a metric space. Let S denote such a set. 5.1 Definition A metric is a mapping d: S xS 1--7 IR + having the properties (a) d(y,x) = d(x,y), (b) d(x,y) = 0 iff if x = y, (c) d(x,y) + d(y,z) � d(x,z) (triangle inequality). A metric space (S,d) is a set S paired with metric d, such that conditions (a)-(c) hold for each pair of elements of S. o If 5.1(a) and (c) hold, and d(x,x) = 0, but d(x,y) = 0 is possible when x :f:. y, we would call d a pseudo-metric. A fundamental fact is that if (A,d) is a metric space and B c A, (B,d) is also a metric space. If (Q is the set of rational numbers, (Q c IR and (!Q,dE) is a metric space; another example is ([0, 1], dE). While the Euclidean metric on IR is the familiar case, and the proof that dE satisfies 5.l(a)-(c) is elementary, dE is not the only possible metric on IR . 5.2 Example For x,y E IR let �(x,y) = 1 ! x - y l y ! ' (5. 1 ) + !x It is immediate that 5.1(a) and (b) hold. To show (c), note that ! x - y ! = do(x,y)/(1 - do(x,y)). The inequality a/(1 - a) + b/(1 - b) � c/(1 - c) simplifies to a + b � c + ab(2 - c). We obtain 5.l(c) on putting a = d0(x,y), b = d0(y,z), c = �(x,z), and using the fact that 0 :::;; d0 :::;; 1 . Unlike the Euclidean metric, � is defined for x or y ±oo. (IR,d0) is a metric space on the definition, while IR with the Euclidean metric is not. o In the space IR2 a larger variety of metrics is found. 5.3 Example The Euclidean distance on IR2 is (5.2) de(x,y) = ll x - y ll = [(xl -Y 1 )2 + (x2 - Y2)2 ] 1 12 , and (IR 2,dE) is the Euclidean plane. An alternative is the 'taxicab' metric, =

Mathematics

76

(5.3) dE is the shortest distance between two addresses in Manhattan as the crow flies, but dr is the shortest distance by taxi (see Fig. 5. 1). The reader will note that dr2 and dE are actually the cases for p = 1 and p = 2 of a sequence of metrics on IR . He/she is invited to supply the definition for the case p 3, and so for any p. The limiting case as p --7 oo is the maximum metric, (5.4) All these distance measures can be shown to satisfy 5. l(a)- (c). Letting IR n = IR x IR x . . . x IR for any finite n, they can be generalized in the obvious fashion, to define metric spaces (IR n,dE), (IR n ,dr), (IR n,dM) and so forth. o ==

-- dE(x,y)

---·-------·

dr(x,y)

Fig. 5.1 Metrics d1 and d2 on a space s; are said to be equivalent if, for each x E s; and E > 0, there is a 8 > 0 such that d1(x,y) < 8 d2(x,y) < 8

=>

=>

d2(x,y) < E d1(x,y) < E

(5.5) (5.6)

for each y E 5>. The idea here is that the two metrics confer essentially the same properties on the space, apart from a possible relabelling of points and axes. A metric that is a continuous, increasing function of another metric is equivalent to it; thus, if d is any metric on 5>, it is equivalent to the bounded metric d/(1 + d). dE and d0 of 5.2 are equivalent in IR , as are are dE and dM in IR 2. On the other hand, consider for any 5l the discrete metric dD, where for x,y E S, dD(x,y) = 0 if x = y, and 1 otherwise. dD is a metric, but dD and dE are not equivalent in IR . In metric space theory, the properties of IR outlined in §2.1 are revealed as a special case. Many definitions are the same, word for word, although other concepts are novel. In a metric space (5l,d) the concept of an open neighbourhood in IR generalizes to the sphere or ball, a set Sd(x,E) = {y: y E S, d(x,y) < E } , where x E s; and E > 0. We write simply S(x,E) when the context makes clear which

�j

Metric Spaces

77

metric is being adopted. In (IR 2,d£), S(x,£) is a circle with centre at x and radius £. In (IR 2 ,dT) it is a 'diamond ' (rotated square) centred on x with £ the distance from x to the vertices. In (IR 2,dM) it is a regular square centred on x, with sides of 2£. For (lR 3 ,dE) ... well, think about it! An open set of (§,d) is a set A � § such that, for each x E A, ::3 8 > 0 such that S(x,8) is a subset of A. If metrics d1 and d2 are equivalent, a set is open in (S,d1 ) iff it is open in (S,d2). The theory of open sets of IR generalizes straightforwardly. For example, the Borel field of § is a well-defined notion, the smallest a-field containing the open sets of (§,d). Here is the general version of

2.4.

5.4 Theorem (i) If t;' is any collection of open sets of (§,d), then

(5.7)

C= UA

is open. (ii) If A and B are open in (§,d), then A n B is open. Proof (i) If S(x,£) � A and A E t;', then S(x,£) � C. Since such a ball exists by definition for all x E A, all A E t;', it follows that one exists for all x E C. (ii) If S(x,EA) and S(x,£8) are two spheres centred on x, then (5.8) S(x,£A) n S(x,£s) = S(x,£), where £ = min { £A,£B}. If x E A, ::3 S(x,EA) � A with £A > 0, and if x E B, ::3 S(x,£s) � B similarly, with £8 > 0. If x E A n B, S(x,£) c A n B, with £ > 0. • The important thing to bear in mind is that openness is not preserved under arbitrary intersections. A closure point of a set A is a point x E § (not necessarily belonging to A) such that for all 8 > 0 ::3 y E A with d(x,y) < 8. The set of closure points of A, denoted A, is called the closure of A. Closure points are also called adherent points, 'sticking to ' a set though not necessarily belonging to it. If for some 8 > 0 the definition of a closure point is satisfied only for y = x, so that S(x,8) n A = { x}, x is said to be an isolated point of A. A boundary point of A is a point x E A, such that for all 8 > 0 ::3 z E Ac with d(x,z) < 8. The set of boundary points of A is denoted dA, and A = A u dA. The interior of A is A0 = A - dA. A closed set is one containing all its closure points, such that A = A. An open set does not contain all of its closure points, since the boundary points do not belong to the set. The empty set 0 and the space § are both open and closed. A subset B of A is said to be dense in A if B c A � B. A collection of sets t;' is called a covering for A if A � Use �B. If each B is open, it is called an open covering. A set A is called compact if every open covering of A contains a finite subcovering. A is said to be relatively compact if A is compact. If § is itself compact; (S,d) is said to be a compact space. The remarks in §2. 1 about compactn�ssci��: a:e equally relevant to the general case. A is said to be bounded if ::3 x �f"tt.'�tl;�ro< r < such that A c S(x,r); and also oo ,

78

Mathematics

totally bounded (or precompact) if for every E > 0 there exists a finite collection of points XJ , . . . ,Xm (called an £-net) such that the spheres S(x;,E), i 1 , ... ,m form a covering for A. The S(x;,E) can be replaced in this definition by their closures S(x;,E), noting that S(x;,E) is contained in S(x;, E + 8) for all 8 > 0. The points of the E-net need not be elements of A. An attractive mental image is a region of IR 2 covered with little cocktail umbrellas of radius E (Fig. 5.2). Any set that is totally bounded is also bounded. In certain cases such as (!R n,dE) the converse is also true, but this is not true in general. =

Fig. 5.2 If a set is relatively compact, it is totally bounded. Proof Let A be relatively compact, and consider the covering of A consisting of the E-balls S(x,E) for all x E A. By the definition this contains a finite sub cover S(x;,E), i = l , ... ,m, which also covers A. Then {x 1 , ... ,xm } is an E-net for A, and the theorem follows since E is arbitrary. • The converse is true only when the space is complete; see 5.13. 5.5 Theorem

5 . 2 Separability and Completeness

In thinking about metric spaces, it is sometimes helpful to visualize the analogue problem for IR, or at most for IR n with n s 3, and use one's intuitive knowledge of those cases. But this trick can be misleading if the space in question is too alien to geometrical intuition. A metric space is said to be separable if it contains a countable, dense subset. Separability is one of the properties that might be considered to characterize an 'IR-like' space. The rational numbers (Q are countable and dense in IR, so IR is separable, as is IR n . An alternative definition of a separable metric space is a metric space for which the LindelOf property holds (see 2.7). This result can be given in the following form. 5.6 Theorem In a metric space s; the following three properties are equivalent: (a) s; is separable. (b) Every open set A k S has the representation

Metric Spaces

79

A = U Bi, Bi E V,

(5.9)

i=l

where V is a countable collection of open spheres in 5>. (c) Every open cover of a set in 5> has a countable subcover. o A collection V with property (b) is called a base of 5>, so that separability is equated in this theorem with the existence of a countable base for the space. In topology this property is called second-countability (see §6.2 ) . (c) is the LindelOf property. Proof We first show that (a) implies (b). Let V be the countable collection of spheres {S(x,r): x E D, r E ID + } , where D is a countable, dense subset of§, and ID + is the set of positive rationals. If A is an open subset of 5>, then for each x E A, 3 o > 0 such that S(x,o) s;; A. For any such x, choose xi E D such that d(xi,x) < o/2 (possible since D is dense) and then choose rational ri to satisfy d(xi,x) < ri < 0/2. Define Bi = S(xi, ra E V, and observe that (5. 10) x E Bi � S(x,o) � A. Since V as a whole is countable, the subcollection {Bd of all the sets that satisfy this condition for at least one x E A is also countable, and clearly A � UBi c A, so A = UiBi. Next we show that (b) implies (c). Since V is countable we may index its elements as { V1, j E IN }. If t5 is any collection of open sets covering A, choose a subcollection { Ci, j E IN } , where c1 is a set from t5 which contains \!} if such exists, otherwise let c1 = 0. There exists a covering of A by V-sets, as just shown, and each \!} can itself be covered by other elements of V with smaller radii, so that by taking small enough spheres we may always find an element of t5 to contain them. Thus A c UJCJ, and the LindelOf property holds. Finally, to show that (c) implies (a), consider the open cover of 5> by the sets {S(x, l/n), x E 5> } . If there exists for each n a countable subcover {S(xnb l ln), k E IN }, for each k there must be one or more indices k' such that d(xnk.Xnk') < 2/n. Since this must be true for every n, the countable set {Xnh k E IN , n E IN } must be dense in 5>. This completes the proof. • The theorem has a useful corollary. 5.7 Corollary A totally bounded space is separable. o Another important property is that subspaces of separable spaces are separable, which we show as follows. 5.8 Theorem If (§,d) is a separable space and A c §, then (A,d) is separable. Proof Suppose D is countable and d�nse in §. Construct the countable set E by .s�t taking one point from each An (5. 1 1) ..

80

Mathematics

For any x E A and 8 > 0, we may choose y E D such that d(x,y) < 8/2. For every such y, ::3 z E E satisfying z E A n S(y,r) for r < 8/2, so that d(y ,z) < 8/2. Thus d(x,z) s d(x,y) + d(y,z) < 8, (5. 1 2) and since x and 8 are arbitrary it follows that E is dense in A. • This argument does not rule out the possibility that A and D are disjoint. The separability of the irrational numbers, !R - IQ , is a case in point. On the other hand, certain conditions are incompatible with separability. A subset A of a metric space (§,d) is discrete if for each x E A, 3 8 > 0 such that (S(x,8) - {x}) n A is empty. In other words, each element is an isolated point. The integers 7L are a discrete set of (!R,d£), for example. If § is itself discrete, the discrete metric dD is equivalent to d. 5.9 Theorem If a metric space contains an uncountable discrete subset, it is not separable. Proof This is immediate from 5.6. Let A be discrete, and consider the open set UxE A S(x, Ex) , where Ex is chosen small enough that the specified spheres form a disjoint collection. This is an open cover of A, and if A is uncountable it has no countable subcover. • The separability question arises when we come to define measures on metric spaces (see Chapter 26). Unless a space is separable, we cannot be sure that all of its Borel sets are measurable. The space D ra,b] discussed below (5.27) is an important example of this difficulty. The concepts of sequence, limit, subsequence, and cluster point all extend from !R to general metric spaces. A sequence {xn} of points in (§,d) is said to converge to a limit x if for all E > 0 there exists Ne � 1 such that (5. 1 3) d(xn,x) < E for all n > Ne. Theorems 2.12 and 2.13 extend in an obvious way, as follows. 5.10 Theorem Every sequence on a compact subset of § has one or more cluster points. o 5.1 1 Theorem If a sequence on a compact subset of § has a unique cluster point, then it converges. o The notion of a Cauchy sequence also remains fundamental. A sequence {xn} of points in a metric space (§,d) is a Cauchy sequence if for all £ > 0, 3 Ne such that d(xn,Xm) < E whenever n > Ne and m > Ne. The novelty is that Cauchy sequences in a metric space do not always possess limits. It is possible that the point on which the sequence is converging lies outside the space. Consider the space (IQ,d£). The sequence {xn } , where Xn = 1 + 112 + 116 + . . . + 11n ! E IQ, is a Cauchy sequence since l xn+ l - xn l = 1/(n + 1 ) ! ---7 0; but of course, Xn ---7 e (the base of the natural logarithms), an irrational number. A metric space (§,d) is said to be complete if it contains the limits of all Cauchy sequences defined on it. (!R ,dE)

Metric Spaces

81

(�,dE)

is not. is a complete space, while Although compactness is a primitive notion which does not require the concept of a Cauchy sequence, we can nevertheless define it, following the idea in 2.12, in terms of the properties of sequences. This is often convenient from a practical point of view. 5.12 Theorem The following statements about a metric space (�,d) are equivalent: (a) � is compact. (b) Every sequence in � has a cluster point in � (c) � is totally bounded and complete. o Notice the distinction between completeness and compactness. In a complete space all Cauchy sequences converge, which says nothing about the behaviour of non Cauchy sequences. But in a compact space, which is also totally bounded, all sequences contain Cauchy subsequences which converge in the space. Proof We show in turn that (a) implies (b), (b) implies (c), and (c) implies (a). n E [N } be a sequence in �. and define a Suppose � is compact. Let decreasing sequence of subsets of � by Bn = k � n } . The sets Bn are closed, and the cluster points of the sequence, if any, compose the set c = n-;;= r Bn = (U-;;'= 1B�Y- If C = 0, � = U;=1B�, so that the open sets B� are a cover for �. and by assumption these contain a finite subcover. This means that, for some m < oo, � � U�::l B� = cn�=lBnY = B�. This leads to the contradiction Em = 0, so that c must be nonempty. Hence, (a) implies (b). Now suppose that every sequence has a cluster point in �- Considering the case of Cauchy sequences, it is clear that the space is complete; it remains to show that it is totally bounded. Suppose not: then there must exist an £ > 0 for which . ,xn} such that no £-net exists; in other words, no finite n and points :::::; £ for all j ::f. k. But letting n ---7 oo in this case, we have found a sequence with no cluster point, which is again a contradiction. Hence, (b) implies (c). Finally, let � be an arbitrary open cover of �- We assume that � contains no finite subcover of � ' and obtain a contradiction. Since � is totally bounded it must possess for each n � 1 a finite cover of the form n Bni = S(Xni• l/2 ) , i = l , ... ,kn . (5. 14) Fixing n, choose an i for which Bni has no finite cover by �-sets (at least one For n > {Bnd1�1 is also a such exists by hypothesis) and call this set covering for Dn-l and we can chooseDn so thatDn nDn-1 has no finite subcover by �-sets, and accordingly is nonempty . Thus , choose a sequence of points E Dn, n E [N } . Since Dn is a ball of radius · . , and contains and Dn+ I is of radius triangle inequality implies that 1 /2n+ l and contains n n ---7 0 as • .·· i 12 {xn } is a Cauchy sequence and :::: :; 6 T < 32,j=0 converges to a limit x E �. by S(x,£) c A for some £ > 0. Choose a set A E � containing n radius < 6/2 , choosing £ Since for any n

{xm

{xk:

{x1,

d(xj,Xk)

Dn.

d(xn,Xn+m)

Xn+l , d(xn.Xn+I)

d(xmx)

.

. .

1,

{xn

Xn,

112n,

Mathematics 2 < 912n ensures that Dn S(x, E). But this means Dn A, which is a contradiction since Dn has no finite cover by t;'-sets, Hence t;' contains a finite subcover, and 8

c

c

(c) implies (a).

•

In complete spaces, the set properties of relative compactness and precompact ness are identical. The following is the converse of 5.5.

A

5.13 Corollary In a complete metric space, a totally bounded set is relatively

compact.

Proof If s; is complete, every Cauchy sequence in

A.A

A has a limit in s;, and all such

points are closure points of The subspace (A,d) is therefore a complete space. It follows from 5.12 that if is totally bounded, A is compact. • 5 . 3 Examples

The following cases are somewhat more remote from ordinary geometric intuition than the ones we looked at above.

2

5.14 Example In § 1 . 3 and subsequently we shall encounter rR00, that is, infinite dimensional Euclidean space. If . . E rR00, and . E

oo x = (xt, X z, . ) (y , y z, . ) [R y 1 similarly, a metric for [R oo is given by 00 k (5. 15) doo(x,y) = ,L2k=l do(Xk>Yk), where d0 is defined in (5. 1). Like d0, doc is a bounded metric with doo(x,y) ::::; 1 for all x and y. =

o

5.15 Theorem (rR00,doo) is separable and complete.

Proof To show separability, consider the collection

Am = {x = (x1,x2, ... ): xk rational if k ::::; m, xk = 0 otherwise } (5. 16) [R OO, and by 1.5 the collection A = {A m, m = 1 , 2, ... } is also count Aable. m is Forcountable, any y [R oo and E 0, 3 x A m such that m k£ + Loo Tkdo(O,yk) ::::; £ + 2-m. doo(x,y) ::::; ,L2(5. 17) k=l k=m+l Since the right-hand side can be made as small as desired by choice of E and m, y is a closure point of A. Hence, A is dense in IR 00 • To show completeness, suppose {xn = (X tn,Xzn, ... ) , n IN } is a Cauchy sequence k in IR"". Since do(Xkn .Xkm) ::::; 2 doo(XmXm) for any k, {xkn• n IN } must be a Cauchy sequence in Since m kd(Xk.Xkn) + Tm (5. 18) doo(X,Xn) ::::; _LT k=l c

E

>

E

E

E

IR .

Metric Spaces

83

for all m, we can say that Xn ---7 x = (x 1 ,X2, ... ) E !Roo iff xkn ---7 xk for each k = 1 ,2 , ... ; the completeness of 1R implies that {xn} has a limit in IR 00 • • 5.16 Example Consider the 'infinite-dimensional cube', [0, 1]00; the Cartesian' product of an infinite collection of unit intervals. The space ([0, l] oo ,doo) is separable by 5.8. We can also endow [0, l]oo with the equivalent and in this cas e bounded metric, 00 k Poo(x,y) = 2 2- l xk - Yk l · D (5. 19) k=l

In a metric space (S,d), where d can be assumed bounded without loss of general ity, define the distance between a point x E '£ and a subset A � '£ as d(x,A) = infy e A d(x,y). Then for a pair of subsets A,B of (S,d) define the function 5 dH: i'» X 2 � IR+, 5 where 2 is the power set of '£, by

{

}

max sup d(x,A), sup d(y,B) . yEA XE B dH(A,B) is called the Hausdorff distance between sets A and B.

dH(A,B)

=

(5.20)

5.17 Theorem Letting Jf5 denote the compact, nonempty subsets of '£, (Jf5 ,dH) is a metric space. Proof Clearly dH satisfies 5.1(a). It satisfies 5.1(b) since the sets of Jf5 are

closed, although note that dH(A, A) = 0, so that dH is only a pseudo-metric for general subsets of '£. To show 5.1( c), for any x E A and any z E C we have, by definition of d(x,B) and the fact that d is a metric, sup d(x,B) :::; sup { d(x,z) + d(z,B) } .

(5.21)

xEA

xEA

Since C is compact, the infimum over C of the expression in braces on the right hand side above is attained at a point z E C. We can therefore write

{

sup d(x,B) :::; sup inf (d(x,z) + d(z,B) )

xeA

x e A ze C

:::; sup d(x,C) xEA

+

sup d(z,B).

}

Similarly, supy ed(x,A) :s;. SUPz e cd(z,A) + supyeBd(y,C), and hence, e

{

dH(A,B) :::; max sup d(x,C) xEA

When (S,d) is complete, it

can

+

(5.22)

ZEC

supd(z,B), sup d(z,A) + sup d(y,C) ze C ,ZE C yeB

}

(5.23)

Mathematics

84

5.18 Example Let 5l IR. The compact intervals with the Hausdorff metric define a complete metric space. Thus, { E [N } is a Cauchy sequence which converges in the Hausdorff metric to This is the closure of the set which we usually regard as the limit of this sequence (compare 2.16), but although =

[0, 1 - lin], n [0,1].

[0,1)

[0,1) Jf$, dH{[0,1),[0,1]) 0. Another case is where 5l (IR 2 ,dE) and Jf$ contains the closed and bounded subsets �

=

D

=

of the Euclidean plane. To cultivate intuition about metric spaces, a useful exercise is to draw some figures on a sheet of paper and measure the Hausdorff distances between them, as in Fig. 5.3. For compact and if and only if = compare this with another intuitive concept of the 'distance between two sets' , infx e A y s ( y , which is zero if the sets touch or intersect.

A B;

A B, dH(A,B) 0 =

,

e

dE x, )

dH(A,B) Fig. 5.3 5 .4 Mappings on metric spaces

We have defined a function as a mapping which takes set elements to unique points of IR , but the term is also used where the codomain is a general metric space. Where the domain is another metric space, the results of §2.3 arise as special cases of the theory. Some of the following properties are generalizations of those given previously, while others are new. The terms mapping, transformation, etc., are again synonyms for function, but an extra usage is functional, which refers to the case where the domain is a space whose elements are themselves functions, with (usually) IR as co-domain. An example is the integral defined in §4. 1 . The function (5l,d) f--7 is said to be continuous at x if for all £ > ::3 8 > such that

0

f:

(l,p) sup p(f(y), f(x)) < £.

ye SJ(x,li)

Here, 8 may depend on such that

>

0

0

(5.24)

x. Another way to state the condition is that for £ > 0 ::3 8 f(Sd(x,o)) � Sp(f(x),E), (5.25)

Metric Spaces

85

where Sd and Sp are respectively balls in ('£,d) and (U" ,p ) . Similarly, f is said to be uniformly continuous on a set A c '£ if for all E > 0, 3 8 > 0 such that sup

sup p(f(y),f(x)) < E.

(5.26)

x E A y E SJ(x,li)nA

Theorem 2.17 was a special case of the following important result. 5.19 Theorem For A e lf, f- 1 (A) is open (closed) in '£ whenever A is open (closed) in 1r , iff f is continuous at all points of §. Proof Assume A is open, and let f(x) E A for x e f- 1 (A). We have Sp(f(x),E) c A

for some E > 0. By 1.2(iv) and continuity at x,

(5 . 27) If A is open then u - A is closed and f- 1 (U" - A) = '£ -f-\A) by 1.2(iii), which is closed if f-\A) is open. This proves sufficiency. To prove necessity, suppose f-1(A) is open in '£ whenever A is open in lr, and in particular, f- 1 (Sp(f(x),E)) for E > 0 is open in '£. Since x e f- 1 (Sp(f(x),E)), there is a 8 > 0 such that (5.25) holds. Use complements again for the case of closed sets. •

This property of inverse images under f provides an alternative characterization of continuity, and in topological spaces provides the primary definition of continuity. The notion of Borel measurability discussed in §3.6 extends naturally to mappings between pairs of metric spaces, and the theorem establishes that continuous transformations are Borel-measurable. The properties of functions on compact sets are of interest in a number of contexts. The essential results are as follows. 5.20 Theorem The continuous image of a compact set is compact. Proof We show that, if A c '£ is compact and f is continuous, then f(A) is compact.

Let � be an open covering of f(A). Continuity of f means that the sets f- 1 (B), B e � are open by 5.19, and their union covers A by 1.2(ii). Since A is compact, these sets contain a finite subcover, say, f-\Bt), ... ,f-\Bm). It follows that

f(A) c f

�

}

F 1 (Bj) =

Q

tlf 1 (Bj)) �

Q

Bj,

(5.28)

where the equality is by 1.2(i) and the second inclusion by 1.2(v). Hence, B 1 , ... ,Bm is a finite subcover of f(A) by �-sets. Since � is arbitrary, it follows that f(A) is compact. • 5.21 Theorem If f is continuous on a compact set, it is uniformly continuous on the set. ), and for each x e A, continuity at x Proof Let A c '£ be compact. Choo��:e;?c( . means that there exists a spber� ,.�,� <(r may depen d on x) such th;:tt ese balls cover 41 �4 p(f(x),f(y)) < !E for each y e S4:� ·

.,....,�,�,_. .

Mathematics

86

compact they contain a finite subcover, say Sd(xk>rk), k = 1 , ... ,m. Let 8 min1 � k �mrk, and consider a pair of points x,y E S such that d(x,y) < 8. Now, y E Sd(xk ,rk) for some k, so that p(f(x k),f(y)) < 1£, and also d(xk,x) � d(xby) + d(x,y) s rk + 8 � 2rb (5.29) using the triangle inequality. Hence p(f(xk),f(x)) � 1£, and p(f(x),f(y)) � p(f(x),f(xk)) + p(f(xk),f(y)) < £. (5.30) Since, 8 independent of x and y, f is uniformly continuous on A. • If f: S 1-7 1f is onto, and f and f - 1 are continuous, f is called a homeo morphism, and S and 1f are said to be homeomorphic if such a function exists. IfS is homeomorphic with a subset of lf, it is said to be embedded in 1f by f. If f also preserves distances so that p(f(x),f(y)) = d(x,y) for each x,y E S, it is called an isometry. Metrics d1 and d2 in a space S are equivalent if and only if the identity mapping from (S,d1 ) to (S,d2) (the mapping which takes each point of S into itself) is an homeomorphism. 5.22 Example If d""' and p""' are the metrics defined in (5. 1 5) and (5. 1 9) respectively, the mapping g: ([R""' ,d""') ---7 ([0, l] =,poo), where g = (gt .gz, . .. ) and

=

1-1

(5.3 1 ) i s an homeomorphism. o Right and left continuity are not well defined notions for general metric spaces, but there is a concept of continuity which is 'one-sided' with respect to the range of the function. A function f: (S,d) 1-7 IR is said to be upper semicontinuous at x if for each £ > 8 > 0 such that, for y E S, d(x,y) < 8 => f(y) < f(x) + £. (5.32) If {xn } is a sequence of points in S and d(xn ,x) ---7 0, upper semicontinuity implies limsupnf(xn ) � f(x). The level sets of the form {x: f(x) < <X} are open for all <X E IR iff f is upper semicontinuous everywhere on S. f is lower semi continuous iff -f is upper semicontinuous, and f is continuous at x iff it is both upper and lower semicontinuous at x. A function of a real variable is upper semicontinuous at x if it jumps at x with f(x) '?. max {f(x-),f(x+)}; isolated discontinuities such as point A in Fig. 5.4 are not ruled out if this inequality is satisfied, On the other hand, upper semi continuity fails at point B. Semicontinuity is not the same thing as right/left continuity except in the case of monotone functions; if f is increasing, right (left) continuity is equivalent to upper (lower) semicontinuity, and the reverse holds for decreasing functions. The concept of a Lipschitz condition generalizes to metric spaces. A function f on (S,d) satisfies a Lipschitz condition at x E S if for 8 > M > 0 such that, for any y E Sd(x,8),

03

03

Metric Spaces

87

p(f(y),f(x)) :::; Mh(d(x,y)) (5.33) where h(.): IR + 1---7 IR + satisfies h(d) ,!, 0 as d ,!, 0. It satisfies a uniform Lipschitz condition if condition (5.33) holds unifonnly, with fixed M, for all x E $. The remarks following (2.9) apply equally here. Continuity is enforced by this condition with arbitrary h, and stronger smoothness conditions are obtained for special cases of h. A

Fig. 5.4 5 . 5 Function Spaces

The non-Euclidean metric spaces met in later chapters are mostly spaces of real functions on an interval of IR . The elements of such spaces are graphs, subsets of IR 2 . However, most of the relevant theory holds for functions whose domain is any metric space ('£,d), and accordingly, it is this more general case that we will study. Let denote the set of all bounded continuous functions f: $ 1---7 IR , and define

C'!D

du(f,g) = sup j f(x) - g(x) 1 . XES

(5 .34)

5.23 Theorem du i s a metric. Proof Conditions S.l(a) and (b) are immediate. To prove the triangle inequality

write, given functions f, g and h

E

Cs ,

du(f,h) = sup I f(x) - g(x) + g(x) - h(x) I XES

:::; sup ( I f(x) - g(x) I + I g(x) - h(x) I ) XES

:::; du(f,g)+du(g ,h). du is called the uniform

Cs

An important subset of by § is compact,

Cs = Us

•

(5.35)

) is a metric space. of uniformly continuous functions. If relatively compact, every f E Cs has

Mathematics

88

a uniformly continuous restriction to $, and every f E U$ has a continuous extension to $, say f, constructed by setting f(x) = f(x) for x E $ and f(x) = lim nf(xn) for a sequence {xn E $ } converging to x, for each x E $ - $. Note that for any pair f,f' E $, du(f,f') = du(f,f'), so that the spaces Cs; and U$ are isometric. There are functions that are continuous on $ and not on $, but these cannot be uniformly continuous. The following is a basic property of C$ which holds independently of the nature of the domain $. 5.24 Theorem (C$,du) is complete. Proof Let Un} be a Cauchy sequence in C$ ; in other words, for £ > 0 3 N6 � 1

such that du(fn.fm) � £ for n,m > N6• Then for each x E Si, the sequences Un(x) } satisfy I fn(x) - fm(x) � du(fnJm); these are Cauchy sequences in [R , and so have limits f(x). In view of the definition of du, we may say that fn -----7 f uniformly in Si. For any x,y E Si, the triangle inequality gives

I

(5.36) + I fn(y) -f(y) 1 . Fix £ > 0. Since fn E c$ ' 3 0 > 0 such that I fn(X) - fn(y) I < t£ if d(x,y) < 0. Also, by uniform convergence 3 n large enough that (5.3 7) max { l f(x) - fn(x) l , l fn(y) - f(y ) l } < !£, so that I f(x) - f(y) I < £. Hence f E C$ , which establishes that C$ is complete. • I f(x) -f(y) I � I f(x) - fn(X) + I fn(X) - fn(y) I

I

Notice hQw this property holds by virtue of the uniform metric. It is easy to devise sequences of continuous functions converging to discontinuous limits, but none of these are Cauchy sequences. It is not possible for a continuous function to be arbitrarily close to a discontinuous function at every point of the domain. A number of the results to follow call for us to exhibit a continuous function which lies· uniformly close to a function in U$ , but is fully specified by a finite collection of numbers. This is possible when the domain is totally bounded. 5.25 Theorem Let (Si,d) be a totally bounded metric space. For any f E U$ , there

exists for any £ > 0 a function g e U$ , completely specified by points of the domain x 1 , ••• ,Xm and rational numbers a t , ... ,am, such that du(f,g) < £. o

We specify rational numbers here, because this will allow us to assert in appli cations that the set of all possible g · is countable.

Proof7 By total boundedness of §, 3 for 0 > 0 a finite o-net {XJ, ... ,Xm } . For each xi, let Ai = {x: d(x,xi) � 2o } and Bi = {x: d(x,xi) � !o } , and define functions

gi :

s;

f-7

[0, 1 ] by

gi(x)

=

d(x,AD d(x,Aa + d(x,Bi) '

(5.38)

where d(x,A) = infy Ad(x,y) . d(x,A) is a uniformly continuous function of x by construction, and gi(x) is also uniformly continuous, for the denominator is never e

Metric Spaces less than �8. Then define

g(x)

89

'Li=i gi(x)ai

(5.39)

= ----

Lt=i gi(x)

Being a weighted average of the numbers { ai} , g(x) is bounded. Also, since {xi} is a 8-net for 5>, there exists for every x E 5i some i such that d(x,Ai) � 8, as well as d(x,BD :::; d(x,xi) :::; 8, and hence such that gi(x) � �- Therefore, Lt=i gi(x) � � and uniform continuity extends from the gi to g. For arbitrary f E Us, fix E > 0 and choose 8 small enough that I f(x) - f(y) I < �E when d(x,y) < 28, for any x,y E 5>. Then fix m large enough and choose xi and ai for i = , ... ,m, so that the S(xi,8) cover 5>, and I f(xi) - ad < �E for each i. Note that if d(x,x;) � 28 then x E Ai and gi(x) = 0, so that in all cases

1

(5.40) Hence

gi{x) I f(x) - ad :::; g;(x) I f(x) - f(xD I + g;(x) I f(x;) - a; I < g;(x)E

(5.41)

for each x E 5i and each i. We may conclude that

du(f,g) = sup I f(x) - g(x) I XES

(5.42) The next result makes use of this approximation theorem, and is fundamental. It tells us (recalling the earlier discussion of separability) that spaces of contin uous functions are not such alien objects from an analytic point of view as they might at first appear, at least when the domain is totally bounded. 5.26 Theorem (i) If (5i,d) is totally bounded then (Us,du) is separable. (ii) If (5i,d) is compact then (Cs,du) is separable. Proof We need only prove part (i), since for part (ii),

Cs = Us by 5.21 and the

same conclusion follows. Fix m and suitable points {xJ, ... ,xm } of § so as to define a countable family of functions A m = {gmk. k E IN } , where the gmk are defined as in 5.25, and the index k enumerates the countable collection of m-vectors (a 1 , ... ,am) of rationals. For each E > 0, there exists m large enough that, for each f E Us, du(f, gmk) < E for some k. By 1.5, A = limm ooA m is countable, and there exists gk E A such that du(f, gk) < E for every E > 0. This means that A is dense in Us. • ---t

To show that we cannot rely on}��!i�t�perties holding under more general -, ,, ,·

·,;:,' ·: �"";

. . ·-.

Mathematics

90

circumstances, we exhibit a nonseparable function space. 5.27 Example For $ = [a,b ], an interval of the real line, consider the metric space (Dca ,b],du) of real, bounded cadlag functions of a real variable. Cadlag is a colourful French acronym (continue a droite, limites a gauche) to describe functions of a real variable which may have discontinuities, but are right contin uous at every point, with the image of every decreasing sequence in [a,b] contain ing its limit point; in other words, there is a limit point to the left of every point. Of course, Cra.b] k; Dra.b] · Functions with completely arbitrary discon tinuities form a larger class still, but one that for most purposes is too unstructured to permit a useful theory. To show that (Dca,b]•du) is not separable, consider the subset with elements =

fa(t)

{o,

t<e , 8 1, t � e

E

[a,b].

(5.43)

This set is uncountable, containing as many elements as there are points in [a,b]. But du(fa,fa') = 1 for 8 i= 8', so it is also discrete. Hence (Dca.b],du) is not separable by 5.9. o Let A denote a collection of functions f: ($,d) H (U", p) . A is said to be equi continuous at x E $ if 'i/ £ > 0 3 o > 0 such that sup p(f(y),f(x)) < £.

sup

(5.44)

jEA y E Sd{x,o)

A is also said to be uniformly equicontinuous if 'i/ £ > 0 3 o > 0 such that sup sup

sup p(f(y) , f(x)) < £.

(5.45)

jEA X E $ y E Sd(x,o)

Equicontinuity is the property of a set of continuous functions (or uniformly continuous functions, as the case may be) which forbids limit points of the set to be not (uniformly) continuous. In the case when A c Cs; (Us;) but A is not (uniformly) equicontinuous, we cannot rule out the possibility that A � Cs; (Us;). An important class of applications is to countable A, and if we restrict atten tion to the case A = Um n E [N } , A c Cs; (or Us;) may not be essential. If we are willing to tolerate discontinuity in at most a finite number of the cases, the following concept is the relevant one. A sequence of functions Un , n E [N } will be said to be asymptotically equicontinuous at x if 'i/ £ > 0 3 o > 0 such that limsup n�oo

{

}

sup p(fn(y) , fn(x)) < £,

y E Sd{x,o)

(5.46)

and asymptotically uniformly equicontinuous if 'i/ £ > 0 3 o > 0 such that

Metric Spaces

{

limsup sup n�oo X E $

91

}

sup p(fn(y) ,fn(x)) < E.

y E Sd(x,8)

(5.47)

If the functions fn are continuous for all n, limsupn�oo can be replaced by supn in (5.46) and similarly for (5.47) when all the fn are uniformly continuous. In these circumstances, the qualifier asymptotic can be dropped. The main result on equicontinuous sets is the ArzeUt-Ascoli theorem. This designation covers a number of closely related results, but the following version, which is the one appropriate to our subsequent needs, identifies equicontinuity as the property of a set of bounded real-valued functions on a totally bounded domain which converts boundedness into total boundedness.

5.28 Arzela-Ascoli theorem Let (S,d) be a totally bounded metric space. A set A c C5; is relatively compact under du iff it is bounded and uniformly equi

continuous.

Proof Since C5; is complete, total boundedness of A is equivalent to relative

compactness by 5.13. So to prove 'if , we assume boundedness and equicontinuity, and construct a finite E-net for A. It is convenient to define the modulus of continuity of f, that is, the function w : c$ X IR + � IR + where

w(f,8) = sup

sup l f(y) - f(x) l .

x e $ y e Sd(x,B)

Fix £

>

(5.48)

0, and choose 8 (as is possible by uniform equicontinuity) such that sup w(f,8) < E.

(5.49)

fe A

Boundedness of A under the uniform metric means that there exist finite real numbers U and L such that L�

inf f(x) � sup f(x)

feA,x e $

feA,xe $

� U.

Let {x1 , ... ,xm } be a 8-net for S, and construct the finite family Dm = {gk E A, k = l , ... ,(v + l )m }

(5.50) (5.5 1)

according to the recipe of 5.25, with the constants ai taken from the finite set { L + ( U - L)u/v } , where u and v are integers with v exceeding ( U - L)IE and u = O, ... ,v. This set contains v + 1 real values between U and L which are less than E apart, so that Dm has (v + l )m members, as indicated. Since the assumptions imply A � U5;, it follows by 5.25 that for every f E A there exists gk E Dm with du(f,gk) < E. This shows that Dm is a E-net for A, and A is totally bounded. To prove 'only if , suppose A is relatively compact, and hence totally bounded. Trivially, total boundedness implies boundedness, and it remains to show uniform equicontinuity. Consider for £ > 0 the set

Mathematics

92

{f: w(f, llk) < £ } .

(5.52) Uniform equicontinuity of is the condition that, for any £ > 0, there exists k large enough that A c Bk(£). It is easily verified that (5. 53) l w(f,8) - w(g,8) 1 � 2du(f,g), so that the function w(.,o): (C$,du) H (lR + ,dE) is continuous. Bk(£) is the in verse image under w(.,O) of the half-line [0,£) which is open in [R + , and hence Bk(£) is open by 5.19. By definition of C$, w(f, llk) � 0 as k � = for each f e C$ . In other words, w converges to 0 pointwise on C$ , which implies that the collection {Bk(£), k e IN } must be an open covering for Cs;, and hence for But by hypothesis A is compact, every such covering of A has a finite subcover, and so A c Bk(£) for finite k, as required. • Bk(£)

A

=

A.

6

Topology

6. 1 Topological Spaces

Metric spaces form a subclass of a larger class of mathematical objects called topological spaces. These do not have a distance defined upon them, but the concepts of open set, neighbourhood, and continuous mapping are still well defined. Even though only metric spaces are encountered in the sequel (Part VI), much of the reasoning is essentially topological in character. An appreciation of the topological underpinnings is essential for getting to grips with the theory of weak convergence. 6.1 Definition A topological space (lZ,'t) is a set lZ on which is defined a topol ogy, a class of subsets 't called open sets having the following properties: (a) lZ E 't, 0 E 't. (b) If � c 't, then Uoe t5 0 E 't. (c) If 01 E 't, 02 E 't, then Ot ll 02 E 't. o These three conditions define an open set, so that openness becomes a primitive concept of which the notion of £-spheres around points is only one characteriza tion. A metric induces a topology on a space because it is one way (though not the only way) of defining what an open set is, and all metric spaces are also topolog ical spaces. On the other hand, some topological spaces may be made into metric spaces by defining a metric on them under which sets of 't are open in the sense defined in §5. 1 . Such spaces are called metrizable. A subset of a topological space (lZ,'t) has a topology naturally induced on it by the parent space. If A c ZZ, the collection 'tA = {A 11 0: 0 E 't } is called the relative topology for A. (A,'tA) would normally be referred to as a subspace of lZ . If two topologies t1 and t2 are defined on a space and 'tJ c 't2, then t1 is said to be coarser, or weaker, than 't2, whereas 't2 is finer (stronger) than 'tJ . In partic ular, the power set of lZ is a topology, called the discrete topology, whereas { 0 ,ZZ } is called the trivial topology. Two metrics define the same topology on a space if and only if they are equivalent. If two points are close in one space, their images in the other space must be correspondingly close. If a set 0 is open, its complement oc on lZ is said to be closed. The closure A of an arbitrary set A c lZ is the intersection of all the closed sets containing A . As for metric spaces, a set A c B, for B c ZZ, is said to be dense in B if B � A. ·

6.2 Theorem The intersection of any collection of closed sets is closed. lZ and 0 are both open and closed. o

Mathematics

94

However, an arbitrary union of closed sets need not be closed, just as an arbit rary intersection of open sets need not be open. For given x E � ' a collection Y'x of open sets is called a base for the point x if for every open 0 containing x there is a set B E Y'x such that x E B and B c 0. This is the generalization to topological spaces of the idea of a system of neighbourhoods or spheres in a metric space. A base for the topology 't on � is a collection V of sets such that, for every 0 E 't, and every x E 0, there exists B E V such that x E B c 0. The definition implies that any open set can be expressed as the union of sets from the base of the topology; a topology may be defined for a space by specifying a base collection, and letting the open sets be defined as the unions and finite intersections of the base sets. In the case of IR , for example, the open intervals form a base. 6.3 Theorem A collection V is a base for a topology 't on � iff (a) Une v B = �. (b) '\/ B1,B2 E V and x E B1 n B2, 3 B3 E V such that x E B3

c

B1 n B2.

Proof Necessity of these conditions follows from the definitions of base and open set. For sufficiency, define a collection 't in terms of the base V, as follows:

0 E 't iff, for each x E 0, 3 B E V such that x E B c 0.

(6. 1)

0 satisfies the condition in (6. 1 ), and � satisfies it given condition (a) of the theorem. If & is a collection of 't-sets, Uoe � 0 E 't since (6. 1 ) holds in this case in respect of a base set B corresponding to any set in & which contains x. And if 01,02 E 't, and x E 01 n 02, then, letting B1 and B2 be the base sets specified in

(6. 1 ) in respect of x and 01 and 02 respectively, condition (b) implies that x E

B3 c 01 n 02, which shows that 't is closed under finite intersections. Hence, 't is

a topology for �.

•

The concept of base sets allows us to generalize two further notions familiar from metric spaces. The closure points (accumulation points) of a set A in a topological space (�,'t) are the points x E � such that every set in the base of x contains a point of A (a point of A other than x). An important exercise is to show that x is a closure point of A if and only if x is in the closure of A . We have generalizations of two other familiar concepts. A sequence {xn } of points in a topological space is said to converge to x if, for every open set 0 containing x, 3 N � 1 such that Xn E 0 for all n � N. And x is called a cluster point of { Xn } if, for every open 0 containing x and every N � 1 , Xn E 0 for some n � N. In general topological spaces the notion of a convergent sequence is inadequate for characterizing basic properties such as the continuity of mappings, and is augmented by the concepts of net and filter. Because we deal mainly with metric spaces, we do not require these extensions (see e.g. Willard 1970: Ch. 4). 6 . 2 Countability and Compactness

The countability axioms provide one classification of topological spaces accord ing, roughly speaking, to their degree of structure and amenability to the methods

Topology

95

of analysis. A topological space is said to satisfy the first axiom of countabil ity (to be first-countable) if every point of the space has a countable base. It satisfies the second axiom of countability (is second-countable) if the space as a whole has a countable base. Every metric space is first-countable in view of the existence of the countable base composed of open spheres, S(x, 1/n) for each x. More generally, sequences in first-countable spaces tend to behave in a similar manner to those in metric spaces, as the following theorem illustrates. 6.4 Theorem In a first-countable space, x is a cluster point of a sequence {xm n E IN } iff there is a subsequence {xnk k E IN } converging to x. ' Proof Sufficiency is immediate. For necessity, the definition of a cluster point implies that n ;::: N such that Xn E 0, for every open 0 containing x and every N ;::: 1 . Let the countable base of x be the collection 'llx = {Bi, i E IN }, and choose a monotone sequence of base sets {Ab k E IN } containing x (and hence nonempty) with A 1 = B1, and Ak c Ak- 1 n Bk for k = 2,3, ... ; this is always possible by 6.3. Since x is a cluster point, we may construct an infinite subsequence by taking Xnk as the next member of the sequence contained in Ak> for k = 1 ,2, ... For every open set 0 containing x, N ;::: 1 such that Xnk E Ak � 0, for all k ;::: N, and hence Xnk � x as k � oo, as required. •

3

3

The point of quoting a result such as this has less to do with demonstrating a new property than with reminding us of the need for caution in assuming properties we take for granted in metric spaces. While the intuition derived from [R-like situat ions might lead us to suppose that the existence of a cluster point and a conver gent subsequence amount to the same thing, this need not be true unless we can establish first-countability. A topological space is said to be separable if it contains a countable dense subset. Second-countable spaces are separable. This fact follows directly on taking a point from each set in a countable base, and verifying that these points are dense in the space. The converse is not generally true, but it is true for metric spaces, where separability, second countability and the LindelOf property (that every open cover of � has a countable subcover) are all equivalent to one another. This is just what we showed in 5.6. More generally, we can say the following. 6.5 Theorem A second-countable space is both separable and LindelOf. Proof The proof of separability is in the text above. To prove the LindelOf property, let � be an open cover of �, such that UAE �A = �. For each A E � and x E A , we can find a base set Bi such that x E Bi c A. Since Ui'=tBi = �. we may choose a countable subcollection Ai, i = 1,2, ... such that Bi c Ai for each i, and hence U i= tA i

= �-

•

A topological space is said to be compact if every covering of the space by open sets has a finite subcover. It is said to be countably compact if each countable covering has a finite subcovering. And it is said to be sequentially compact if each sequence on the space has a convergent subsequence. Sometimes, compact-

Mathematics

96

ness is more conveniently characterized in terms of the complements. The comple ments of an open cover of X are a collection of closed sets whose intersection is empty; if and only if X is compact, every such collection must have a finite sub collection with empty intersection. An equivalent way to state this proposition is in terms of the converse implication. A collection of closed sets is said to have the finite intersection property if no finite subcollection has an empty inter section. Thus: is compact (countably compact) if and only if no collection (countable collection) of closed sets having the finite intersection property has an empty intersection. o

6.6 Theorem

X

The following pair of theorems summarize important relationships between the different varieties of compactness. 6.7 Theorem A first-countable space X is countably compact iff it is sequentially

compact.

Proof Let the space be countably compact. Let {xm n E

IN } be a sequence in X, and define the sets En = {xmXn+t, . . } , n = 1,2, ... The collection of closed sets (Bn, n E IN } clearly possesses the finite intersection property, and hence nnBn is nonempty by 6.6, which is another way of saying that { Xn} has a cluster point. .

Since the sequence is arbitrary, sequential compactness follows by 6.4. This proves necessity. For sufficiency, 6.4 implies that under sequential compactness, all sequences in X have a cluster point. Let { C;, i E IN } be a countable collection of closed sets having the finite intersection property such that An = n'l= t C; ::�; 0, for every finite n. Consider a sequence {xn } chosen such that Xn E An, and note since {An } is monotone that Xn E A m for all n 2 m; or in other words, A m contains the sequence { Xn, n 2 m } . Since { Xn} has a cluster point x and A m is closed, x E A m . This is true for every m E [N , so that n'i=t Ci is nonempty, and X is countably compact by 6.6. • 6.8 Theorem A metric space ('5,d) is countably compact iff it is compact. Proof Sufficiency is immediate. For necessity, we show first that if '5 is

countably compact, it is separable. A metric space is first-countable, hence countable compactness implies sequential compactness (6.7), which in turn implies that every sequence in '5 has a cluster point (6.4). This must mean that for any £ > 0 there exists a finite £-net {x1 , ... ,xm } such that, for all x E '5, d(x,xk) < £ for some k E { l , . . . ,m} ; for otherwise, we can construct an infinite sequence {xn } with d(xn,Xn') 2 £ for n '# n', contradicting the existence of a cluster point. Thus, for each n E [N there is a finite collection of points An such that, for every X E '5, d(x,y) < 2 -n for some y E An. The set = u;;=tAn is countable and dense in '5, and '5 is separable. Separability in a metric space is equivalent by 5.6 to the LindelOf property, that every open cover of '5 has a countable subcover; but countable compactness implies that this countable subcover has a finite subcover in its turn, so that

D

Topology compactness is proved.

97

•

Like separability and compactness, the notion of a continuous mapping may be defined in terms of a distance measure, but is really topological in character. In a pair of topological spaces � and Yl, the mapping f: � 1---7 Yl is said to be contin uous if f-1(B) is open in � when B is open in Yl, and closed in � when B is closed in Yl. That in metric spaces this definition is equivalent to the more familiar one in terms of£- and 8-neighbourhoods follows from 5.19. The concepts of homeomor phism and embedding, as mappings that are respectively onto or into, and 1-1 continuous with continuous inverse, remain well defined. The following theorem gives two important properties of continuous maps. 6.9 Theorem Suppose there exists a continuous mapping f from a topological space � onto another space Yl. (i) If � is separable, Yl is separable. (ii) If � is compact, Y1 is compact. Proof (i) The problem is to exhibit a countable, dense subset of Yl. Consider f(D) where D is dense in �- If f(D) is the closure of f(D), the inverse image f -1 (f(D)) is closed by continuity of f, and contains f - 1 (f(D)) , and hence also contains D by 1.2(iv). Since D is the smallest closed set containing D, and � c D, it follows that � c f-1 (f(D)) . But since the mapping is onto, Yl = f(�) c f(F1 Cf(D))) � f(D), where the last inclusion is by 1.2(v). f(D) is therefore

dense in Yl as required. f(D) is countable if D is countable, and the conclusion follows directly. (ii) Let b' be an open cover of Yl. Then { r 1 (B): B E b'} must be an open cover of � by the definition. The compactness of � means that it contains a finite sub cover, say f-'(B1 ), ,f-1 (Bn) such that • • •

Y1 =

t(�)

=

t

(O t- ' csj)) t (r ' (usj)) y=l

=

y=l

�

(J sj.

J=l

(6.2)

where the third equality uses 1.2(ii) and the inclusion, 1.2(v). Hence b' contains a finite subcover. •

Note the importance of the stipulation 'onto ' in both these results. The extension of (ii) to the case of compact subsets of � and Y1 is obvious, and can be supplied by the reader. Completeness, unlike separability, compactness, and continuity, is not a topo logical property. To define a Cauchy sequence it is necessary to have the concept of a distance between points. One of the advantages of defining a metric on a space is that the relatively weak notion of completeness provides some of the essential features of compactness in a wider class than the compact spaces. 6 . 3 Separation Properties

Another classification of topological spaces is provided by the separation axioms, which in one sense are more primitive than the countability axioms. They are

Mathematics

98

indicators of the richness of a topology, in the sense of our ability to distin guish between different points of the space. From one point of view, they could be said to define a hierarchy of resemblances between topological spaces and metric spaces. Don ' t confuse separation with separability, which is a different concept altogether. A topological space lZ is said to be: - a T1 -space, iff \1 x,y E lZ with x '# y 3 an open set containing x but not y and also an open set containing y but not x; - a Hausdoiff (or T2-) space, iff V x,y E lZ with x '# y 3 disjoint open sets 01 and Oz in lZ with x E 01 and y E Oz; - a regular space iff for each closed set C and x (i!O C 3 disjoint open sets 01 and Oz with x E 0 1 and C c 02; - a normal space iff, given disjoint closed sets C1 and C2 , 3 disjoint open sets 0 1 and Oz such that C1 c 01 and Cz c Oz. A regular T1 -space is called a T3-space, and a normal T1 -space is called a T4-space. In a T1 -space, the singleton sets { x} are always closed. In this case y E { x V whenever y '# x, where { x} c is the complement of a closed set, and hence open. Conversely, if the T1 property holds, every y '# x is contained in an open set not containing x, and the union of all these sets, also open by 6.1(b), is {x} c. It is easy to see that T4 implies T3 implies T2 implies T1 , although the reverse implications do not hold, and without the T1 property, normality need not imply regularity. Metric spaces are always T4, for there is no difficulty in construct ing the sets specified in the definition out of unions of £-spheres. We have the following important links between separation, compactness, count ability, and metrizability. The first two results are of general interest but will not be exploited directly in this book, so we forgo the proofs. The proof of 6.12 needs some as yet undefined concepts, and is postponed to §6.6 below. 6.10 Theorem A regular LindelOf space is normal. 6.1 1 Theorem A compact Hausdorff space is T4.

o

o

6.12 Urysohn's metrization theorem A second-countable T4-space is metriz able. o In fact, the conditions of the last theorem can be weakened, with T4 replaced by T3 in view of 6.10, since we have already shown that a second-countable space is LindelOf (6.5). The properties of functions from lZ to the real line play an important role in defining the separation properties of a space. The key to these results is the remarkable Urysohn' s lemma. 6.13 Urysohn's lemma A topological space lZ is normal iff for any pair A and B of disjoint closed subsets there exists a continuous function f: lZ � [0, 1 ] such that f(A) = 0 and f(B) 1 . o =

The function f is called a separating function.

Topology

99

Proof This is by construction of the required function. Recall that the dyadic

rationals 10 are dense in [0, 1 ] . We demonstrate the existence of a system of open sets { Un r E [) } with the properties

(6.3) (6.4) (6.5) Normality implies the existence of an open set U1 12 (say) such that U1 12 contains A and (U1 12)c contains B. The same story can be told with Uf12 replacing B in the role of C2 to define Uu4, and then again with Uu2 replacing A in the role of C1 to define U314· The argument extends by induction to generate sets { Uml2n, m = n 1 , . . . ,2 - 1 } for any n E [N, and the collection { Un r E 10 } is obtained on letting n --7 oo It is easy to verify conditions (6.3)-(6.5) for this collection. Fig. 6. 1 illustrates the construction for n = 3 when A and B are regions of the plane. One must imagine countably many more 'layers of the onion' in the limiting case. .

Fig. 6. 1 Now define f:

iZ

f(x) =

{

--7

[0, 1 ] by

inf{ r

E

10 :

X

E

1,

(6.6)

where, in particular, f(x) = 1 for x E B. Because of the monotone property (6.5), we have for any a E (0, 1 ]

{x: f(x) < a }

= =

{x: inf{ r E ID : x E Ur } < a } {x: ::3 r < a such that x E Ur } r< a

(6.7)

which is open. On the other hand, because [) is dense in [0, 1 ] we can deduce that,

Mathematics

100 for any �

E

[0, 1 ), {x: f(x) $ �}

= = =

=

{x: inf{ r E ID : x E U, } $ � } { x: x E U, V r > � }

n

u,

n

u,

r> P r> j3

(6.8)

which is closed. Here, the final equality must hold to reconcile the following two facts: first that U, � U, and second that, for all r > � ' there exists (since ID is dense) s E ID with r > s > � and Us c U, by (6.5). We have therefore shown that, for 0 ::; � < a $ 1,

{x: � < f(x) < a} = {x: f(x) < a} n {x: f(x) $ � }c

(6.9)

is open, being the intersection of open sets. Since every open set of [0,1] is a union of open intervals (see 2.5), it follows that f - \A) is open in 2:\ whenever A is open in [0,1], and accordingly f is continuous. It is immediate that f(A) = 0 and f(B) = 1 as required, and necessity is proved. Sufficiency is simply a matter, given the existence of f with the indicated properties, of citing the two sets f- 1 ([0,�)) and f -1 ((�,1]), whose images are open in 2:\ , disjoint, and contain A and B respectively, so that 2:\ is normal. •

It is delightful the way this theorem conjures a continuous function out of thin air! It shows that the properties of real-valued functions provide a legitimate means of classifying the separation properties of the space. In metric spaces, separating functions are obtained by a simple direct construc tion. If A and B are closed and disjoint subsets of a metric space ('5,d), the normality property implies the existence of 3 > 0 such that infxE A yE Bd(x,y) � 3. The required function is ,

d(x,A) f(x) = d(x,A) + d(x,B)

(6. 10)

where d(x,A) infyE Ad(x,y) , and d(x,B) is defined similarly. The continuity of f follows since d(x,A) and d(x,B) are continuous in x, and the denominator in (6. 10) is bounded below by 3 . A similar construction was used in the proof of 5.25. The regularity property can be strengthened by requiring the existence of separating functions for closed sets C and points x. A topological space 2:\ is said to be completely regular if, for all closed C � 2:\ and points x ItO C, 3 a continuous function f: 2:\ 1---7 [0, 1 ] with f( C) = 0 and f(x) 1 . A completely regular T1 -space is called a Tychonoff space or T3 rspace. As the tongue-in-cheek terminology suggests, a T4-space is T3 1 (this is immediate from Urysohn ' s lemma) and a T3 � -space is clearly T3, although the reverse implications do not hold. Being T4, metric spaces are always T3 � · ==

==

2

Topology

101

6 . 4 Weak Topologies

Now let' s go the other way and, instead of using a topology to define a class of real functions, use a class of functions to define a topology. Let � be a space and fF a class of functions f: � 1----7 Y/1, where the codomains Yl1 are topological spaces. The weak topology induced by fF on � is the weakest topology under which every f E fF is continuous. Recall, continuity means that r\A) is open in � whenever A is open in Yl1 . We can also call the weak topology the topology gener ated by the base of sets V consisting of the inverse images of the open sets of the Y/1, under f E IF, together with the finite intersections of these sets. The inverse images themselves are called a sub-base for the topology, meaning that the sets of the topology can be generated from them by operations of union and finite intersection. If we enlarge IF we (potentially) increase the number of sets in this base and get a stronger topology, and if we contract IF we likewise get a weaker topology. With a given IF , any topology stronger than the weak topology contains a richer collection of open sets, so the elements of IF must retain their continuity in this case, but weakening the topology further must by definition force some f E IF to be discontinuous. The class of cases in which Yf1 = for each f suggests using the concept of a weak topology to investigate the structure of a space. One way to represent the richness of a given topology 't on � is to ask whether 't contains, or is contained in, the weak topology generated by a particular collection of bounded real-valued functions on �- For example, complete regularity is the minimal condition which makes the sort of construction in 6.13 feasible. According to the next result, this is sufficient to allow the topology to be completely characterized in terms of bounded, continuous real-valued functions on the space.

IR

6.14 Theorem If a topological space (�,'t) is completely regular, the topology 't is the weak topology induced by the set

fF

of the separating functions.

Proof Let V denote the collection of inverse images of open sets under the func tions of IF . And let T denote the weak topology induced by IF, such that the V-sets, together with their finite intersections, form a base for T. We show that T = 't. For any x E 2Z , let 0 E 't be an open set containing x. Then ac is closed, and by complete regularity there exists f E IF taking values in [0,1 ] with f(x) = 1 and f(Oc) = 0. The set (�,1] is open in [0, 1], and B = j - 1 ((�, 1]) is therefore an open set, containing x and disjoint with oc so that B c 0. Since this holds for every such 0, x has a base Vx consisting of inverse images of open sets under functions from fF . Since x is arbitrary the collection V { Vx• x E � } forms a base for 't. It follows that 't c T. On the other hand, T is by definition the weakest topology under which every f E IF is continuous. Since f E IF is a separating function and continuous under 't, it also follows that T c 't. • =

Mathematics

102

6.5 The Topology of Product S paces

Let � and YI be a pair of topological spaces, and consider the product space � x YJ . The plane IR x IR and subsets thereof are the natural examples for appreciating the properties of product spaces, although it is a useful exercise to think up more exotic cases. An example always given in textbooks of topology is tC x tC where (C is the unit circle; this space has the topology of the torus (doughnut). Let the ordered pair (x,y) be the generic element of � x YJ. The coordinate projections are the mappings 7tzZ: � x Y1 H � and 1t'rT : � x Y1 H Y1, defined by 1tll(x,y)

=

1t'rT (x,y)

=

(6. 1 1) (6. 1 2)

x

y.

If � and Y1 are topological spaces, the coordinate projections can be used to generate a new topology on the product space. The product topology on � x Y1 is the weak topology induced by the coordinate projections. The underlying idea here is very simple. If A c � and B c Y1 are open sets, the set A X B = (A X Yl) n (� X B), where A X Y1 = 1t� \A) and � X B = 1t� 1 (B), will be regarded as open in � x Y1, and is called an open rectangle of � x YJ. The product topology on � x Y1 is the one having the open rectangles as a base. This means that two points (x1 ,y 1 ) and (x2,y2) are close in � x Y1 provided XJ is close to x2 in �' and y 1 to y2 in YJ . Equivalently, it is the weakest topology under which the coordinate projections are continuous. If the factors are metric spaces (�,dll) and (YJ, d'rT ), several metrics can be constructed to induce the product topology on � x Y1, including

(6. 13) and

(6. 14) An open sphere in the space (� x YJ, p), where p is the metric in (6. 13), also

happens to be an open rectangle, for

Sp ((x,y) ,8) = Sd (x,b) X Sd (y,b) ; Yl

)(

(6. 1 5)

but of course, this is not true for every metric. Since either � or Y1 may be a product space, the generalization of these results from two to any finite number of factors is straightforward. The generic element of the space X7= I �i is the n-tuple (XJ , ,xn: xi E �D, and so on. But to deal with infinite collections of factor spaces, as we shall wish to do below, it is necessary to approach the product from a slightly different viewpoint. Let A denote an arbitrary index set, and { � a ' a E A } a collection of spaces indexed by A . The Cartesian product �A = Xae A� a is the collection of all the mappings x: A H Uae �a such that x(a) E � a for each a E A. This definition contains that given A in § 1 . 1 as a special case, but is fundamentally more general in character. The coordinate projections are the mappings 1ta : �A H � a with • . .

Topology 1ta(x) = x(a),

103 (6. 1 6)

but can also be defined as the images under x of the points a E A. Thus, a point in the product space is a mapping, the one which generates the coordinate projections when it is evaluated at points of the domain A. In the case of a finite product, A can be the integers 1 , ... ,n. In a countable case A = IN, or a set equipotent with IN , and we should call x an infinite sequence, an element of � 00 (say). A familiar uncountable example is provided by a class of real-valued functions x: IR � IR, so that A = IR . In this case, x associates each point a E IR with a real number x(a) , and defines an element of the product IR IR . The product topology is now generalized as follows. Let { �a. a E A } be an arbitrary collection of topological spaces. The Tychonoff topology (product topology) on the space �A has as base the finite-dimensional open rectangles, sets of the form Xae A Oa, where the Oa � �a are open sets and Oa = �a except for at most a finite number of coordinates. These basic sets can be written as the intersections of finite collections of cylinders, say (6. 1 7) for indices a1, . . . ,am E A. Let 't be a topology under which the coordinate projections are continuous. If Oa is open in �a. n;1 (0a) E 't, and hence 't contains the Tychonoff topology. Since this is true for any such 't, we can characterize the Tychonoff topology as the weak topology generated by the coordinate projections. The sets n; \ Oa) form the sub-base for the topology, whose finite intersections yield the base sets. Something to keep in mind in these infinite product spaces is that, if any of the sets �a are empty, �A is empty. Some of our results are true only for non empty spaces, so for full rigour the stipulation that elements exist is desirable. 6.15 Example The space (C,du) examined in §5.5 is an uncountable product space having the Tychonoff topology; the uniform metric is the generalization of the maximum metric p of (6. 1 3). Continuous functions are regarded as close to one another under du only if they are close at every point of the domain. The sub sequent usefulness of this characterization of ( C,du) stems mainly from the fact that the coordinate projections are known to be continuous. o The two essential theorems on productspaces extend separability and compactness from the factor spaces to the product. The following theorem has a generalization to uncountable products, which we shall not pursue since this is harder to prove, and the countable case is sufficient for our purposes. 6.16 Theorem Finite or countable product spaces are separable under the product topology iff the factor sp aces are separable. Proof The proof for finite products is an easy implication of the countable case, hence consider �oo = Xt=!�i· Let Di = { di!,di2···· } � �i be a countable dense set for each i, and construct a set D � �oo by defining

Mathematics

104 Fm

m

=

X D; X X { dil }

(6. 1 8)

i=l

i=m+ l for m = 1 ,2, . . . , and then letting D = U;;;= I Fm . Fm is equipotent with the set of m-tuples formed from the elements of the countable D 1 , ... ,Dm , and is countable by induction from 1.4. Hence D is countable, as a countable union of countable sets. We will show that D is dense in 2'.\"". Let B = Xi= 1 0; be a non-empty basic set, with 0; open in 2'.\ ; and 0; = 2'.\; except for a finite number of coordinates. Choose m such that 0; = 2'.\ ; for i > m, and then m B n Fm = x co; n D;) X X { d;J } i= 0, (6. 19) i=m+l i=l recalling that the dense property implies O; n D; :t= 0, for i = 1, . . . ,m. Since B n F c B n D, it follows that B contains a point of D; and since B is an arbi trary basic set, D is dense in 2'.\ "", as required. • m

One of the most powerful and important results in topology is Tychonoff' s theorem, which states that arbitrary products of compact topological spaces are also compact, under the product topology. It will suffice here to prove the result for countable products of metric spaces, and this case can be dealt with using a more elementary and familiar line of argument. It is not necessary to specify the metrics involved, for we need the spaces to be metric solely to exploit the equivalence of compactness and sequential compactness. 6.17 Theorem A finite or countable product of separable metric spaces (2'.\ ;,d;) is compact under the product topology iff the factor spaces are compact. Proof As before, the finite case follows easily from the countable case, so assume

2'.\"" = Xi= 1 � ; , where the �i are separable spaces. In a metric space, which is first countable, compactness implies separability and is equivalent to sequential compactness by 6.8 and 6.7. Since � i is sequentially compact and first-countable, every sequence {x;n, n E IN } on �i has a cluster point x; on the space (6.4). Applying the diagonal argument of 2.36, there exists a single subsequence of integers, { nk> k E IN } , such that X;nk ---7 x;, for every i. Consider the subsequence in �00, {xnk ' k E IN } where Xnk = (XJnk' Xznk, ... ). In the product topology, Xnk ---7 x = (xi . xz, . . . ) iff X;nk ---7 x; for every i, which proves that � oo is sequentially compact. � = can be endowed with the metric Poo = Lt=1d; lzi, which induces the product topology. �00 is separable by 6.16, and sequential compactness is equivalent to compactness by 6.8 and 6.7, as above. This proves sufficiency. Necessity follows from 6.9(ii), by continuity of the projections as before. • 6.18 Example The space !R oo (see 5.14) is endowed with the Tychonoff topology

if we take as the base sets of a point x the collection N(x ; k, £)

=

{y : l x; - y; l < £, i

=

l , ... ,k} ;

k

E

IN, £ > 0.

(6.20)

A point in !R oo is close to x in this topology if many of its coordinates are close

Topology

105

to those of x; another point is closer if either more coordinates are within £ of each other, or the same coordinates are closer than £, or both. The metric doo defined in (5. 1 5) induces the topology of (6.20). If {xn } is a sequence in )Z, doo (Xn,x) � 0 iff V £,k 3 N � 1 such that Xn E N(x;k,£) for all n � N. We already know that IROO is separable under doo (5.15), but now we can deduce this as a purely topological property, since !R oo inherits separability from IR by 6.16. o The infinite cube shares the topology (6.20) with IROO and is a compact space by 6.17; to show this we can assign the Euclidean metric to the factor spaces [0, 1]. The trick of metrizing a space to establish a topological property is frequently useful, and is one we shall exploit again below.

[O,lr

6.6 Embedding and Metrization

Let )Z be a topological space, and IF a class of functions f: )Z 1---7 YJ1 . The evalua tion map e: )Z 1---7 XrE IF Yl1 is the mapping defined by e(x)f = f(x). (6.21) The class IF may be quite general, but if it were finite we would think of e(x) as the vector whose elements are the f(x), f E IF . (6.21) could also be written 1tj 0e = f where rc1 is the coordinate projection. A minor complication arises because f need not be onto YI1, and e()Z) c XrE rF YI1 is possible. If A is a set of points in YJ1, the inverse projection rcj 1 (A) may contain points not in e()Z). We therefore need to express the inverse image of A under f, in terms of e, as (6.22) The importance of this concept stems from the fact that, under the right condi tions, the evaluation map embeds )Z in the product space generated by it. It would be homeomorphic to it in the case e()Z) = XrErF Yft . 6.19 Theorem Suppose the class IF separates points of )Z, meaning that f(x) ::f. f(y) for some f E IF whenever x and y are distinct points of )Z. If )Z is endowed with the weak topology induced by IF , the evaluation map defines an embedding of )Z into XrYit .

Proof It has to be shown that e is a 1-1 mapping from )Z onto a subset of XrYI1, . which is continuous with continuous inverse. Since IF separates points of )Z, e is 1 - 1 , since e(x) ::f. e(y) whenever f(x) ::f. f(y) for some f E IF. To show continuity of e, note first that F 1 (A) is open in )Z whenever A is open in YJ1 under the weak topology, and sets of the form rcj 1 (A) are likewise open in XtYI1 with the product topology, since the projections are continuous. But e- 1 (rcj 1 (A)) = (rc�ef 1 (A) = f - 1 (A), so we can conclude that the inverseimages under e of sets of the form )'tj\A), A � YJ1, are open. Since inverse i�ges preserve unions and intersections .base sets of XtYJ1, which are (see 1.2) the same property extends� fit$1 . the product topology, and finite intersections of these inverse . thence to all the open sets of open in )Z. Let B be a e - 1 is continuous if e(B) is open

Mathematics

106

set of the formf- 1 (A), where A is open in Yfr Since IF defines the topology on � we know this set to be open, and the finite intersections of such sets form a base for � by assumption. Since e is 1-1 and e- I a mapping, it will suffice to verify that their images under e are open in e(�). Noting that B is a set of the type shown in (6.22), e(B) = nj 1 (A) n e(�), but since nj 1 (A) is open, e(B) is open in e(�) as required. • The following is the (for us) most important case of Urysohn 's embedding theorem. 6.20 Theorem A second-countable T4-space (�,'t) can be embedded in [0, 1r. o The proof requires the sufficiency part of the following lemma. 6.21 Lemma Let x E �. and let 0 � � be any open set containing x. Iff � is a regular space, there exists an open set U with x E U c 0. Proof Let � be regular. If 0 is open and x E 0, there exist disjoint open sets U and C such that x E U and oc c C, and hence cc c 0. Since U � cc by the dis jointness and cc is closed, we have U � cc c 0, and sufficiency is proved. To prove necessity, suppose x E U and U c 0. oc is a closed set not containing x, and oc c if, where U and if are disjoint open sets. Hence � is regular. • Proof of 6.20 Let V be a countable base for 't. Since the space is T4 it is T3 and hence regular. For any x E � and B E V containing x, we have by 6.21 a U E 't such that x E U c B, and also by definition of a base 3 A E V with x E A c U c U, and hence x E A c U c B. (A is the smallest closed set containing A, note.) Since V is countable, the collection of all such pairs, say d1 {(A,B): A E V, B E 'V; A c B } , (6.23) is countable, and so we can label its elements (A,B)i = (Ai,BD, i = 1,2, ... Every x E � lies in Ai for some i E IN . Since the space is normal, we have by Urysohn's lemma a separating function fi: � H [0, 1 ] for each element of dJ, such that fi(Ai) 1 and f/B7) 0. For each x E � and closed set c such that X * C, choose (Ai,Bi) such that X E xi c Bi c cc, and then fi(x) = 1 and flC) = 0. These separating functions form a countable class IF , a subset of U(�). Since the space is T1 , C can be a point {y} so that IF separates points. And since the space is T3 � and hence completely regular, 't is the weak topology induced by rF , by 6.14. It follows by 6.19 that the evaluation map for rF embeds 2:\ into [0, 1r. • Recall that [0, 1 r endowed with the metric P oo defined in (5. 19) is a compact metric space. It follows that e(�), which is homeomorphic to � under the eval uation mapping by rF , is a totally bounded metric space. It further follows that (�,'t) is metrizable, since, among other possibilities, it can be endowed with the metric under which the distance between points x and y of � is taken to be P oo (e(x),e(y)). We have therefore proved the Urysohn metrization theorem, 6.12. The topology induced by this metric on [0,1r is the Tychonofftopology. A base for a point p = (pt.p 2, ... ) E [O,l r in this topology is provided by sets of the =

=

=

Topology

107

form N(p;k, £)

{q E [0, 1]"": IPi - qd < £, i = 1, . . . ,k} , (6.24) for some finite k, and 0 < £ < 1, which is the same as (6.20) . The topology induced on � by the embedding is accordingly generated by the base sets N(x;k,£) = {y E �: I fiCx) - flY) I < £, i = 1, ... ,k}, (6.25) which can be recognized as finite intersections of the inverse images, under functions from If , of £-neighbourhoods of [R ; this is indeed the weak topology induced by If . This further serves to remind us of the close link between product topologies and weak topologies. Since metric spaces are T4, separable metric spaces can be embedded in [0, 1 ]"" by 6.20. In this case the motivation is not metrization, but usually compactific ation - that is, to show that separable spaces can be topologized as totally bounded spaces. Both metrization and compactification are techniques with impor tant applications in the theory of weak convergence, which we study in Chapter 26. Although the following theorem is a straightforward corollary of 6.20, the result is of sufficient interest to deserve its own proof; the main interest is to see how in metric spaces there always exists a ready-made collection of functions to define the weak topology. 6.22 Theorem A separable metric space ('S,d) is homeomorphic to a subset of [0, 1 ] "" . Proof Let � = d/(1 + d), which satisfies 0 � � S 1 and is equivalent to d, so that (�,d0) is homeomorphic to (�,d). By separability there exists a countable set of points {zi , i E [N } which is dense in �. Let a countable family of functions be defined by fi(x) = d0(x,zi), i = 1, 2 , ... , and define an evaluation map h: � H [0,1 ]"" by h(x) = (do(X,ZJ), do(X,Z2), ... ) . (6.26) We show that h is an embedding in ([0, 1] 00,poo) where Poo(h,g) = I.k'= I I hk - gk l 12k. If {xn} is a sequence in � converging to x, then for each k, do(Xn,Zk) -7 d0(x,zk). Accordingly, V k,£ ::3 N � 1 such that Xn E N(x;k,£) for all n � N, Poo(h(xn),h(x)) ---7 0, and h is continuous at x. On the other hand, if Xn A x, there exists £ > 0 such that V N � 1, do(Xn,x) � £ for some n � N. Since {zk } is dense in � . there is a k for which �(xn,Zk) � �£ and do(X,Zk) < i£, so that I �(Xn,Zk) - do(X,Zk) I � !£ and hence ( 6.27) Since this holds for some n � N for every N � 1, it holds for infinitely many n, and h(xn) A h(x). We have therefore shown that h(xn) -7 h(x) if and only if Xn -7 x. This is the property of a 1-1 continuous function with continuous inverse. • But note too the alternative approach of transforming the distance functions into separating functions as in (6.10), and applying 6.20. =

II PROBABILITY

7

Probability Spaces

7 . 1 Probability Measures

A random experiment is an action or observation whose outcome is uncertain in advance of its occurrence. Tosses of a coin, spins of a roulette wheel, and observations of the price of a stock are familiar examples. A probability space, the triple (0./!f ,P), is to be thought of as a mathematical model of a random experiment. 0. is the sample space, the set of all the possible outcomes of the experiment, called the random elements, individually denoted w. The collection 'Jf of random events is a cr- field of subsets of 0., the event A E ':f being said to have occurred if the outcome of the experiment is an element of A. A measure P is assigned to the elements of ':f, P(A) being the probability of A . Formally, we have the following. 7.1 Definition A probability measure (p.m.) on a measurable space (Q,':f) is a set function P: ':f � [0, 1 ] satisfying the axioms of probability: (a) P(A) � 0, for all A E ':f. (b) P(Q) = 1 . (c) Countable additivity: for a disjoint collection {Aj E ':f , j E IN } , P

(�j) � P(Aj) =

D

(7. 1 )

Under the frequentist interpretation of probability, P(A) is the limiting case of the proportion of a long run of repeated experiments in which the outcome is in A. Alternatively, probability may be viewed as a subjective notion with P(A) said to represent an observer ' s degree of belief that A will occur. For present purposes, the interpretation given to the probabilities has no relevance. The theory stands or falls by its mathematical consistency alone, although it is then up to us to decide whether the results accord with our intuition and are useful in the ana lysis of real-world problems. Additional properties of P follow from the axioms. 7.2 Theorem If A, B, and {Aj, j E IN } are arbitrary ':f-sets, then (i) P(A) � 1 . (ii) P(A c) = 1 - P(A). (iii) P(0) = 0. (iv) A � B ::::} P(A) � P(B) (monotonicity). (v) (A u B) = P(A) + P(B) - P(A r1B). (vi) P(UAj) � l.jP(Aj) (countable suba,ddltivity).

Probability

1 12

A or A1 -!- A � P(A1) --7 P(A) (continuity). o Most of these are properties of measures in general. The complementation property (ii) is special to P, although an analogous condition holds for any finite measure, with P(Q.) replacing 1 in the formula. (iii) confirms P is a measure, on the definition. (vii) AJ

1'

Proof Applying 7.1(a), (b), and (c),

P(A) + P(Ac)

=

P(A u Ac)

=

P(Q.)

=

(7.2)

1,

from which follow (i) and (ii), and also (iii) on setting A by 3.3, and (vii) by 3.4. •

=

Q.. (iv)-(vi) follow

To create a probability space, probabilities are assigned to a basic class of events tg, according to a hypothesis about the mechanisms underlying the random outcome. For example, in coin or die tossing experiments we have the usual hypo thesis of a fair coin or die, and hence of equally likely outcomes. Then, provided tg is rich enough to be a determining class for the space, (0./!f,P) exists by 3.8 (extension theorem) where 'J = cr(tg). 7.3 Example Let 13[0,11 = {B n [0, 1], B E 13), where 13 is the Borel field on IR . Then ([0, 1], 13[0, 1 1, m), where m i s Lebesgue measure, i s a probability space, since m([O, 1 ]) = 1 . The random elements of this space are real numbers between 0 and 1 , and a drawing from the distribution is called a random variable. It is said to be distributed uniformly on the unit interval. The inclusion or exclusion of the endpoints is optional, remembering that m([0, 1J) = m((0,1)) 1 . o =

The atoms of a p.m. are the outcomes (singleton sets of Q.) that have positive probability. The following is true for finite measures generally but has special importance in the theory of distributions. 7.4 Theorem The atoms of a p.m. are at most countable. Proof Let ro1 be an atom satisfying P( { ro 1 } ) ;::: P( { ro}) for all ro E

Q., let ffi2 satisfy

P( { ro2 }) ;::: P( { ro}) for all ro E Q. - { ro 1 } , and so forth, to generate a sequence with

(7.3) The partial sums Lt= 1 P( { roi}) form a monotone sequence which cannot exceed P(Q.) = 1, and therefore converges by 2.11, implying by 2.25 that limn�=P( { ffin }) = 0. All points with positive probability are therefore in the countable set { roi, i E

IN } .

•

Suppose a random experiment represented by the space (Q.,'J,P) is modified so as to confine the possible outcomes to a subset of the sample space, say A c Q.. For example, suppose we switch from playing roulette with a wheel having a zero slot to one without. The restricted probability space is derived as follows. Let 'JA denote the collection { E n A, E E 'J } . 'JA is a a-field (compare 1.23) called 'J A and is called the trace of 'J on A. Defining PA(E) = P(E)IP(A) for E E 'JA• PA can be verified to be a p.m. The triple (A,'JA ,PA) is called the trace of (Q.,'J,P) on A.

Probability Spaces

1 13

This is similar to the restriction of a measure space to a subspace, except that the measure is renormalized so that it remains a p.m. In everyday language, we are inclined to say that events may be 'impossible' or 'certain' . If such events are none the less elements of tg;, and hence technically random, we convey the idea that they will occur or not occur with 'certainty' by assigning them probabilities of zero or one. The usage of the term 'certain' here is deliberately loose, as the quotation marks suggest. To say an event cannot occur because it has probability zero is different from saying it cannot occur because the outcomes it contains are not elements of .Q. Similarly, to say an event has probability 1 is different from saying it is the event .Q. In technical discus sion we therefore make the nice distinction between sure, which means the latter, and almost sure, which means the former. An event E is said to occur almost surely (a.s.), or equivalently, with probability one (w.p. 1) if M = .Q - E has prob ability measure zero. This terminology is synonymous with almost everywhere (a.e.) in the measure-theoretic context. When there is ambiguity about the p.m. being considered, the notation 'a.s. [P]' may be used. 7 . 2 Conditional Probability

A central issue of probability is the treatment of relationships. When random experiments generate a multi-dimensioned outcome (e.g. a poker deal generates several different hands) questions always arise about relationships between the different aspects of the experiment. The natural way to pose such questions is: 'if I observe only one facet of the outcome, does this change the probabilities I should assign to what is unobserved?' (Skilled poker players know the answer to this question, of course.) The idea underlying conditional probability is that some but not all aspects of a random experiment have been observed. By eliminating some of the possible outcomes (those incompatible with our partial knowledge), we have to consider only a part of the sample space. In (.Q,tg; ,P), suppose we have partial information about the outcome to the effect that 'the event A has occurred' , where A E tg;. How should this knowledge change the probabilities we attach to other events? Since the outcomes in Ac are ruled out, the sample space is reduced from .Q to A. To generate probabilities on this restricted space, define the conditional probability of an event B as P(B IA) = P(A n B)IP(A), for A,B E tg;, P(A) > 0. P(. IA) satisfies the probability axioms as long as P does and P(A) > 0. In particular, P(A lA) = 1 , and P(Bc i A) = 1 - P(B IA), since B nA and Bc nA are disjoint, and their union is A. The space (A,tg;A,PA), the trace of the set A on (.Q,tg;,P), models the random exper iment from the point of view of an observer who knows that ro E A. Events A and B are said to be dependent when P(B lA) '# P(B). In certain respects the conditioning concept seems a little improper. A context in which the components of the random outcome are revealed sequentially to an observer might appear relevant only to a subjective interpretation of probability, and lead a sceptical reader to call the neutrality of the mathematical theory into question. We might also protest that a random event is random, and has no business

Probability

1 14

defining a probability space. In practice, the applications of conditional proba bility in limit theory are usually quite remote from any considerations of sub jectivity, but there is a serious point here, which is the difficulty of constructing a rigorous theory once we depart from the restricted goal of predict ing random outcomes a priori. The way we can overcome improprieties of this kind, and obtain a much more powerful theory into the bargain, is to condition on a class of events, a a-sub field of CZF. Given an event B E CZF, let the set function P(B I §'): §' 1-7 [0, 1 ] represent the contingent probability to be assigned to B after drawing an event A from §', where §' c CZF. We can think of §' as an information set in the sense that, for each A E §', an observer knows whether or not the outcome is in A. Since the elements of the domain are random events, we must think of P(B I §') as itself a random outcome (a random variable, in the terminology of Chapter 8) derived from the restricted probability space (Q,§',P). We may think of this space as a model of the action of an observer possessing information §', who assigns the conditional probability P(B I A) to B when be observes the occurrence of A, viewed from the standpoint of another observer who has no prior information. §' is a a-field, because if we know an outcome is in A we also know it is not in A c, and if we know whether or not it is in Aj for each j = 1 ,2,3, ... , we know whether or not it is in Ufij· The more sets there are in §' the larger the volume of information, all the way from the trivial set ':! = (.0,0) (complete ignorance, with P(B I ':!) = P(B) a.s.) to the set CZF itself, which corresponds to almost sure knowledge of the outcome. In the latter case, P(B I CZF) = 1 a.s. if ro E B, and 0 otherwise. If you know whether or not ro E A for every A E CZF, you effectively know ro. 8 7 . 3 Independence

A pair of events A, B E

equivalently, if

CZF

is said to be independent if P(A n B) = P(A)P(B), or,

P(B I A) = P(B). (7.4) If, in a collection of events �. (7.4) holds for every pair of distinct sets A and B from the collection, � is said to be pairwise independent. In addition, � is said to be totally independent if p

( n A) = AE J

rr P(A)

(7.5)

AEJ

for every subset J c � containing two or more events. This is a stronger condition than pairwise independence. Suppose � consists of sets A, B, and C. Knowing that B has occurred may not influence the probability we attach to A, and similarly for C; but the joint occurrence of B and C may none the less imply something about A . Pairwise independence implies that P(A n B) = P(A )P(B), P(A n C) = P(A )P( C), and P(B n C) = P(B)P( C), but total independence would also require P(A n B n C) P(A)P(B)P( C). =

Probability Spaces

1 15

Here are two useful facts about independent events. In each theorem let � be a totally independent collection, satisfying P((U E .'IA) = IL E .'IP(A) for each subset :J

�

�-

7.5 Theorem The collection �, which contains A and A c for each A E � is totally

independent. Proof It is sufficient to prove that the independence of A and B implies that of Ac and B, for B can denote any arbitrary intersection of sets from the collection and (7.5) will be satisfied, for either A or Ac. This is certainly true, since if P(A n B) = P(A)P(B), then P(A c n B) = P(Ac n B) + P(A n B) - P(A)P(B) = P(B) - P(A)P(B) = P(Ac)P(B). •

(7.6)

7.6 Theorem Let { Bj} be a countable disjoint collection, and let the collections consisting of Bj and the sets of � be totally independent for each j. Then, if B = Uj Bj, the collection consisting of B and � is also independent. Proof Let :J

be any subset of �- Using the disjointness of the sets " of B, and countable additivity,

(

P Bn n A A E .'f

) P (Uj Bj n AnE .'fA) 'L P(Bj)P ( n A ) j A E .'f =

=

=

(

L P Bj n n A j

=

A E .'f

P(B) n P(A) . A E .'f

)

•

(7.7)

7 . 4 Product Spaces

Questions of dependence and independence arise when multiple random experi ments run in parallel, and product spaces play a natural role in the analysis of these issues. Let (Q. x 3, '!F ® 'fl, P) be a probability space where '!F ® 'fl is the a-field generated by the measurable rectangles of Q. x 3, and P(Q. x 3) = 1 . The random outcome is a pair (m,�). This is no more than a case of the general theory of §7. 1 (where the nature of m is unspecified) except that it becomes possible to ask questions about the part of the outcome represented by m or � alone. Pn.(F) = P(Fx 3) for F E '!F, and Ps(G) = P(Q. x G) for G E 'f/, are called the marginal probabilities. (Q.,'!F,Pn) and (3,'fl,Ps) are probability spaces representing an incompletely observed random experiment, with m or � ' respectively, being the only things observed in an experiment generating ( m,�). On the other hand, suppose we observe � and subsequently consider the 'experi ment' of observing m. Knowing � means that for eachQ. x G we know whether or not ( m,�) is in it. The conditional probabilities generated by this two-stage exper iment can be written by a slight abuse of notation as P(FI 'fl), although strictly speaking the relevant events are the cylinders F x 3, and the elements of the

cnnrlitinnin o- C\-tlP.ln

::�rf".

0" X r; fnr

r; E=

�-

�n WP.

rmcrht tn mntP c-nrnoi-h; � �

1 : 1,�

1 16

Probability

P(F x 3 1 Q x §'). In this context, product measure assumes a special role as the model of independence. In (Q x 3, r:J ® §', P), the coordinate spaces are said to be independent when P(Fx G) = Pn(F)P=.(G) (7.8) for each F E r:J and G E §'. Unity of the notation is preserved since F x G = (Fx 3) n (Q x G). We can also write P(F x 3 1 Q x G) = Pn(F), or with a further slight abuse of notation P(FI G) = Pn(F), for any pair F E r:J and G E §'. Independence means that knowing � does not affect the probabilities assigned to sets of r:J. Since the measurable rectangles are a determining class for the space, the p.m. P is entirely determined by the marginal measures.

8 Random Variables

8. 1 Measures on the Line

Let (Q,'!f,P) be a probability space. A real random variable (r.v.) is an '!f/'13measurable function X: Q H IR . 9 That is to say, X(co) induces an inverse mapping from 'J3 to 'Ji such that X - 1 (B) E 'Ji for every B E 'B, where 'J3 is the linear Borel field. The term ''!f-measurable r.v.' may be used when the role of 'J3 is understood. The symbol � will be generally used to denote a p.m. on the line, reserving P for the p.m. on the underlying space. Random variables therefore live in the space (IR,'B,�), where � is the derived measure such that �(B) = P(X - \B)) = P(X E B). The term distribution is syn onymous with measure in this context. The properties of r. v.s are special cases of the results in Chapter 3; in particular, the contents of §3.6 should be reviewed in conjunction with this chapter. If g: IR 1---7 IR is a Borel function, then goX(ro) = g(X(ro)) is also a r.v., having derived p.m. �g - 1 according to 3.21. If there is a set S E 'J3 having the property �(S) = 1, the trace of (IR,'B,�) on S is equivalent to the original space in the sense that the same measure is assigned to B and to B n S, for each B E 'B. Which space to work with is basically a matter of technical convenience. If X is a r.v., it may be more satisfactory to say that the Borel function X 2 is a r. v. distributed on IR+, than that it is distributed on IR but takes values in IR + almost surely. One could substitute for (IR ,'B,�) the extended space {R,'B,�) (see 1.22), but note that assigning a positive probability to infinity does not lead to meaningful results. Random variables must be finite with probability 1 . Thus (IR ,'B,�), the trace of (R,'B,�) on IR , is equivalent to it for nearly all purposes. However, while it is always finite a.s., a r.v. is not necessarily bounded a.s.; there may exist no constant B such that I X(co) I :::; B for all ro E C, with P(O. - C) = 0. The essential supremum of X is ess sup X = inf{x: P( I X I > x) = 0}, and this may be either a finite number, or

(8 . 1 )

+oo.

8 . 2 Distribution Functions

The cumulative distribution function (c.d.f.) of X is the function F: iR where

1---7

[0, 1],

F(x) = ).1((-oo, x]) = P(X :::; x), x E iR. (8.2) We take the domain to be iR since it is natural to assign the values 0 and 1 to

Probability

1 18

F(-oo) and F(+oo) respectively. No other values are possible so there is no contra diction in confining attention to just the points of IR . To specify a distribution for X it is sufficient to assign a functional form for F; J..1 and F are equivalent representations of the distribution, each useful for different purposes. To represent J..l(A) in terms of F for a set A much more complicated than an interval would be cumbersome, but on the other hand, the graph ofF is an appealing way to display the characteristics of the distribution. To see how probabilities are assigned to sets using F, start with the half-open interval (x,y] for x < y. This is the intersection of the half-lines (oo,y] and (-oo, xr = (x,+oo). Let A = (-oo, x] and B = ( - oo,y], so that J..l(A) = F(x) and J..l(B) = F(y); then J..l((x,y]) = J..l(Ac n B) = 1 - J..l(A u B') = 1 - (J..l(A) + 1 - J..l(B)) = J..l(B) - J..l(A) = F(y) - F(x),

(8.3)

A and Be being disjoint. The half-open intervals form a semi-ring (see 1.18), and

from the results of §3.2 the measure extends uniquely to the sets of 'B. As an example of the extension, we determine J..l({x}) = P(X = x) for x E IR (compare 3.15). Putting x = y in (8.3) will not yield this result, since A n A' = 0, not {x}. We could obtain {x} as the intersection of (-oo, x] and [x,+oo) = ( -oo, x)', but then there is no obvious way to find the probability for the open interval (-oo, x) = (-oo, x] - {x} . The solution to the problem is to consider the monotone sequence of half-lines (-oo, x - 1/n] for n E IN . Since (x - 1/n, x] = (-oo, x - 1/n] n (-oo, x], we have J..l( (x - 1/n,x]) = F(x) - F(x - 1/n), according to (8.3). Since {x} = n;=l (x - 1/n,x], {x} E 13 and J..l( {x}) = F(x) - F(x- ), where F(x-) is the left limit of F at x. F(x) exceeds F(x-) (i.e. F jumps) at the atoms of the distribution, points x with J..l( { x}) > 0. We can deduce by the same kind of reasoning that J..l((x,y)) = F(y-) - F(x), J..l( [x,y)) = F(y-) - F(x- ), and that, generally, measures of open intervals are the same as those of closed intervals unless the endpoints are atoms of the distribution. Certain characteristics imposed on the c.d.f. by its definition in terms of a measure were implicit in the above conclusions. The next three theorems establish these properties. 8.1 Theorem F is non-negative and non-decreasing, with F(-oo) = 0 and F(+oo) = 1, and is increasing at x E IR iff every open neighbourhood of x has positive measure. Proof These are all direct consequences of the definition. Non-negativity is from (8.2), and monotonicity from 7.2(iv). F is increasing at x if F(x + E) > F(x - E) for each E > 0. To show the asserted sufficiency, we have for each such E, (8.4) F(x + E) - F(x - E) :2: F((x + E)-) - F(x - E) = J..l(S(x,E)). For the necessity, suppose J..l(S(x,£)) = 0 and note that, by monotonicity of F, J..l(S(x,£)) F((x + £)-) - F(x - £) ;:::: F(x + £12) - F(x- £12). • (8.5) =

The collection of points on which F increases is known as the suvvort of LL Tt�

Random Variables

1 19

complement in IR , the largest set of zero measure, consists of points that must all lie in open neighbourhoods of zero measure, and hence must be open. The support of 11 is accordingly a closed set. 8.2 Theorem F is right-continuous everywhere. Proof For

x

E

IR and n � 1 , additivity of the p.m. implies

1-!(( -oo, x + lin]) = 1-!(( -oo, x]) + 1-!((x, x + lin]). (8.6) As n � oo, 1-!(( -oo, x + lin]) -.1- 1-!(( -oo, x]) by continuity of the measure, and hence limn�ooll((x, x + lin]) = 0. It follows that for E > 0 there exists Ne such that 1-!((x, x + 1/n]) < £, and, accordingly, 1-!( (-oo, x]) � 1-!((-oo, x + lin]) < 1-!((-oo, x]) + £, (8.7) for n � Ne. Hence F(x+) = F(x), proving the theorem since x was arbitrary. • If F(x) had been defined as 1-!((-oo, x)), similar arguments would show that it was left continuous in that case.

F = F+� 1 F(x2) ---------------------- ------------------· ···- ------·----·--············· -------� · · ·

F�::�: · · · · · · · · · · · · · · · · · · �

· · · ....................... . F(x 1 -) 0 _.��----+---+---+ +----

I

F' F"

Fig. 8.1 8.3 Theorem F has the decomposition

F(x) = F(x) + F''(x)

(8.8)

where F'(x) is a right-continuous step function with at most a countable number of jumps, and F''(x) is everywhere continuous. Proof By 7.4, the jump points of F are at most countable. Letting

denote these points,

{x1 , x2, } •.•

Probability

120

F'(x) = L (F(x;) - F(x;-))

(8.9)

x;5,x

is a step function with jumps at the points x;, and F''(x) = F(x) - F'(x) has F(x;-) = F(x;) at each x; and is continuous everywhere. • Fig. 8 . 1 illustrates the decomposition. This is not the only decomposition of F. The Lebesgue decomposition of J..l with respect to Lebesgue measure on IR (see 4.28) is J..l = J..l l + J..l2 where J..l i is singular with respect to m (is positive only on a set of Lebesgue measure 0) and J..L2 is absolutely continuous with respect to Lebesgue measure. Recall that J..L2(A) = fAf(x)dx for A E 'B, where f is the associated Radon-Nikodym derivative (density function). If we decompose F in the same way, such that F;(x) = Jl;((-oo, x]) for i = 1 and 2, we may write F2(x) = s:oof(s)ds, implying that f(x) = dF2 1dS I �:x• This must hold for almost all x (Lebesgue measure), and we call F2 an absolutely continuous function, meaning it is differentiable almost everywhere on its domain. F' ::; F1 since F1 may increase on a set of Lebesgue measure 0, and such sets can be uncountable, and hence larger than the set of atoms. It is customary to summarize these relations by decomposing F" into two additive components, the absolutely continuous part F2, and a component F3 = F" - F2 which is continuous and also singular, constant except on a set of zero Lebesgue measure. This component can in most cases be neglected. The collection of half-lines with rational endpoints generates 'B (1.21) and should be a determining class for measures on (IR ,'B). The following theorem estab lishes the fact that a c.d.f. defined on a dense subset of IR is a unique represen tation of J..l. 8.4 Theorem Let J..l be a finite measure on (IR ,'B) and D a dense subset of iR . The function G defined by G(x) =

{

F(x)

= J..L(( -oo, x]), x

E

D

x

E

IR - D

F(x+),

(8. 1 0)

i s identical with F. Proof By definition, IR

� D and the points of IR - D are all closure points of D. For each x e IR , not excluding points in lR - D, there is a sequence of points in D converging to x (e.g. choose a point from S(x, lln) n D for n E IN). Since F is right-continuous everywhere on IR, J..L((-oo, x]) = F(x+) for each x E IR - D. •

Finally, we show that every F corresponds to some J..l, as well as every J..l to an F. 8.5 Theorem Let F: iR � [0, 1 ] be a non-negative, non-decreasing, right continuous function, with F(-oo) = 0 and F(+oo) = 1 . There exists a unique p.m. J..l on (IR,'B) such that F(x) = J..L((-oo, x]) for all x E IR . o Right continuity, as noted above, corresponds to the convention of defining F by (8.2). If instead we defined F(x) = J..L(( -oo, x)), a left-continuous non-decreasing F

Random Variables

121

would represent a p.m. Proof Consider the function <j>:

[0, 1 ]

1---7

iR,

defined by

<j>(u) = inf{x: u � F(x) } .

(8. 1 1 )

<1> can be thought of as the inverse of F; <j>(O) = -oo, <j>( 1 ) = +oo, and since F is non decreasing and right-continuous, <1> is non-decreasing and left-continuous; <1> is therefore Borel-measurable by 3.32(ii). According to 3.21, we may define a meas ure on (IR ,'B) by m<j> - \B) for each B E 'B, where m is Lebesgue measure on the Borel sets of [0, 1 ] . I n particular, consider the class rg of the half-open intervals (a,b] for all a,b E IR with a < b. This is a semi-ring by 1.18, and cr(tg') = 'B by 1.21. Note that

<j> - 1 ((a,b]) = { u : inf{x: u � F(x) } E (a,b] } = (F(a), F(b)] .

(8. 12)

For each of these sets define the measure

1 j..t((a,b]) = m<j>- ((a,b])) = F(b) - F(a).

(8. 1 3)

The fact that this is a measure follows from the argument of the preceding para graph. tg is a determining class for (IR,'B), and the measure has an extension by 3.8. It is a p.m. since j..t(IR) = 1 , and is unique by 3.13. • The neat construction used in this proof has other applications in the theory of random variables, and will reappear in more elaborate form in §22.2. The graph of <1> is found by rotating and reflecting the graph of F, sketched in Fig. 8.2; to see the former with the usual coordinates, turn the page on its side and view in a mirror. 1 F(x)

F ---------

-----------------------------------------------------------

F(x-) ------------------------ --- -

0 �-----�a= b= x <j>(c) <j>(c+) Fig. 8.2 If F has a discontinuity at x, then = x on the interval (F(x-), F(x)], and <j> 1 ({x}) = (F(x-), F(x)] . Thus, j..t( {x}) = m((F(x-), F(x)]) = F(x) - F(x-), as required. On the other hand, if an interval (a,b] has measure 0 under F, F is

Probability

122

constant on this interval and <1> has a discontinuity at F(a) = F(b) = c (say). <1> takes the value a at this point, by left continuity. Note that <j> - 1 (c) = (a,b], so that J.L((a,b]) = m(c) = 0, as required. 8 . 3 Examples

Most of the distributions met with in practice are either discrete or continuous. A discrete distribution assigns zero probability to all but a countable set of points, with F'' = 0 in the decomposition of 8.3. 8.6 Example The Bernoulli (or binary) r.v. takes values 1 and 0 with fixed proba bilities p and 1 - p. Think of it as a mapping from any probability space contain ing two elements, such as 'Success' and 'Failure', 'Yes' and 'No ' , etc. o 8.7 Example The binomial distribution with parameters n and p (denoted B(n,p)) is the distribution of the number of 1 s obtained in n independent drawings from the Bernoulli distribution, having the probability function

P(X = x) =

(:)p\1 -pt-x, x = O,... ,n.

(8. 14)

o

8.8 Example The limiting case of (8. 14) with p = 'A in, as n ---7 oo, is the Poisson distribution, having probability function

P(X = x) = X1,-e-"1!, x = 0, 1 ,2, ... .

(8. 15)

This is a discrete distribution with a countably infinite set of outcomes.

o

In a continuous distribution, F is absolutely continuous with F1 = 0 in the Lebesgue decomposition of the c.d.f. The derivative f= dF/dx exists a.e. [m] on !R , and is called the probability density function (p.d.f.) of the p.m. According to the Radon-Nikodym theorem, the p.d.f. has the property that for each E E 'B,

(8. 1 6) 8.9 Example For the uniform distribution on [0, 1] (see 7.3),

10,

0 F(x) = x, 0 s x s 1. 1, X > 1 X <

(8. 1 7)

The p.d.f is constant at 1 on the interval, but is undefined at 0 and 1.

o

8.10 Example The standard normal or Gaussian distribution has p.d.f.

1 -x2/2 f(x) = --e ' ..fin

oo

-

< X <

+oo ,

(8. 1 8)

Random Variables

123

whose graph is the well-known bell-shaped curve with mode at zero. o 8.11 Example The Cauchy distribution has p.d.f. f(x)

=

1 , X2 -oo < X < +oo, n(l + )

(8. 19)

which, like the Gaussian, is symmetric with mode at 0. o When it exists, the p.d.f. is the usual means of characterizing a distribution. A particularly useful trick is to be able to derive the distribution of g(X) from that of X, when g is a function of a suitable type. 8.12 Theorem Let g: 5l 1----7 lf be a 1-1 function onto lf, where s; and lf are open subsets of IR, and let h = g - I be continuously differentiable with dhldy ::f. 0 for all y E lf . If X is continuously distributed with p.d.f. fx, and Y = g(X), then Y is continuously distributed with p.d.f.

(8.20) The proof is an easy exercise in differential calculus. This result illustrates 3.21, but in most other cases it is a great deal harder than this to derive a closed form for a transformed distribution. 8.13 Example Generalize the uniform distribution (8.9) from [0, 1] to an arbitrary interval [a,b]. The transformation is linear,

y = a + (b - a)x, (8.21) so that fy(y) = (b - af 1 on (a,b), by (8.20). The c.d.f. is defined on [a,b] by F(y) (y - a)/(b - a). (8.22) Membership of the uniform family is denoted by X - U[a,b]. o =

8.14 Example Linear transformations of the standard Gaussian r.v.,

X = J..L + crZ, cr > 0,

(8.23)

generate the Gaussian family of distributions, with p.d.f.s f(x; J..L,cr) =

1 V2itcr

-- e

-

(x

-

)212cr2 , -oo < x <

u "'

+oo

. o

(8.24)

The location parameter J..L and scale parameter cJ2 have better-known designations as moments of the distribution; see 9.4 and 9.7 below. Membership of the Gaussian family is denoted by X - N(J..L, cr2). 8.15 Example A family of Cauchy distributions is generated from the standard Cauchy by linear transformations X = v + 8Z, 8 > 0. The family of p.d.f.s with location parameter v and scale parameter 8 take the form

Probability 1 f(x; v,3) = -

(

)

1 , -co 1tD 1 + [(x - v)/3] 2

< X <

+co.

D

(8.25)

8.16 Example Consider the square of a standard Gaussian r. v. with � = 0 and a = 1 . Since the transformation is not monotone we cannot use 8.12 to determine the

{

density, strictly speaking. But consider the 'half-normal' density, 2f(u), u � 0 ) (8.26) u = flZI ( 0, u<0 where f is given by (8. 1 8). This is the p.d.f. of the absolute value of a Gaussian variable. The transformation g( I u I ) = u2 is 1-1 , so the p.d.f. of Z 2 is

(8.27) applying (8.20). This is the chi-squared distribution with one degree of freedom, or x2(1). It is a member (with p = a = �) of the gamma family, G(u; a,p )

=

�

1 r ) e-001(au.f- , 0

 0, p > 0,

(8.28)

where r(p) = fo�p -l e- sd� is the gamma function, having the properties rG) = Y21t, and r(n) = (n - l )r(n - 1). D 8 .4 Multivariate Distributions

In Euclidean k-space IR k, the k-dimensional Borel field 'Bk is a('Rk), where 'Rk denotes the measurable rectangles of IR k, the sets of the form B 1 x Bz x . . . x Bk where Bi E 'B for i = 1 , .. . ,k. In a space (Q.,r:J',P), a random vector (Xt .Xz, ... ,Xk) ' = X is a measurable mapping k X: Q -t IR . If � is the derived measure such that �(A) = P(E) for A E 'Bk and E E r:J', the multivariate c.d.f., F: iRk -t [0, 1], is defined for (x1 , ... , xk)' = x by F(x) = �((-co, xi ] x ... x (-co, xk]). (8.29) The extension proceeds much like the scalar case. 8.17 Example Consider the random pair (X,Y). Let F(x,y) = �(( -co, x] x (-co, y]). The measure of the half-open rectangle (x, x+ Llx] X (y, y +�y] is M(x, y) = F(x + Llx, y + �y) - F(x + Llx, y) - F(x, y + �y) + F(x, y) . (8.30) To show this, consider the four disj oint sets of IR 2 illustrated in Fig. 8.3:

Random Variables

125

A = (x, x + &] x (y, y + �y], B = (-oo, x] x (y, y + �y], D = (-oo, x] x (-oo, y]. C = (x, x + &] x (-oo, y], A is the set whose probability 1s sought. Since P(A u B u C u D) F(x + &,y + �y), P(B u D) = F(x,y + �y), P(C u D) = F(x + &, y), and P(D) F(x, y), the result is immediate from the probability axioms. o

y + �y

······ ················ ········

: B

__]_

r

························-·

A

=

=

i

-l : - -j 1x !'

'

,

,

!l x + &

I

Fig. 8.3 Extending the approach of 8.17 inductively, the measure of the k-dimensional rectangle X7= I (xi , xi + &iJ can be shown to be

(8.31)

j

where the sum on the right has 2k terms, and the Fj are the values of F at each of the vertices of the k-dimensional rectangle extending from (x1, xk)' with sides of length &i, i = l, ... ,k. The sign pattern depends on k; if k is odd, the Fj having as arguments even numbers of upper vertices (points of the form xi + &D . . •

take negative signs, and the others positive; while if k is even, the Fj with odd numbers of upper vertices as arguments are negative. Generalizing the monotonic ity of the univariate c.d.f., F must satisfy the condition that M(x1 , ... , xk) be non-negative for every choice of (xt. ... , xk)' E [R k and (& 1 , ... ,&k)' E (1R kt. Applying 3.19 inductively shows that the class of k-dimensional half-open rect angles is a semi-ring, so that the measure defined by F extends to the sets of :B k; hence � k,:B k,J..l.) is a probability space derived from (Q., '!F,P). If the distribution is continuous with p.d.f. f(x), Fubini ' s theorem gives

(8.32) Theorem 8.12 has the following generalization. A diffeomorphism (also, coor-

126

Probability

dinate transformation) is a function g: $ 1---7 1f ($ and l open subsets of IR k) which is 1-1 onto and continuously differentiable with det(dgldx') :t: 0 for all x E $, where dgldx ' is the Jacobian matrix whose (i,j)th element is dg/dxi for i,j = 1, ... ,k. The inverse of a diffeomorphism is also continuously differentiable. 8.18 Theorem If Y = g(X) where g is a diffeomorphism, the p.d.f. of Y is (8.33) h(y) = f(g- 1 (y)) l l l where J = det[d(g - 1 )/dy']. o This is a standard result in the theory of multiple Lebesgue integrals (see e.g. Apostol 1974: 15. 10-15. 1 2). 8.19 Example Letting f denote the standard Gaussian p.d.f. (see 8.10), consider k

<J>(z)

= n f(zi) = (2nfkl2exp { -!Cz'z) } . i= l

(8.34)

This is a k-dimensional p.d.f., and the corresponding random vector Z = (Z1 , . . . ,Zk)', is the standard Gaussian vector. The affine transformation X = AZ + IJ., (8.35) where A (k x k nonsingular) and 1J. (k x 1) are constants, is 1 - 1 continuous with inverse Z = A -1(X - IJ.), having J = l A -I I = li l A 1 . Define 1: = AA' such that (A -1)'A -I = (AA'f1 = l:-1, and I lA 1 -1 1 = I :E I -112, the positive square root being understood. Applying 8.18 produces f(x) =
j

� �� 1 1 j

= (27tfkl2 iA- 1 1 exp { -!(x - 1J.)'(A-1)'A-1(x - 1J.) } . = (27tfk/2 l l: l - 1 12exp { -!(x - IJ.) ':E- 1 (x - IJ.)} .

(8.36)

This is the multinormal p.d.f., depending on parameters 1J. and 1: . Every such distribution is generated by an affine transform applied to Z. Membership of the multinormal family is denoted X - N(IJ.,:E). o 8 . 5 Independent Random Variables

Suppose that, out of a pair of r.v.s (X, Y) on (IR 2.�2./-l). we are interested only in predicting X. In this situation the events of interest are the cylinder sets in IR 2, having the form B x !R , B E 'B. The marginal distribution of X is defined by (IR ,:B,/-lx) where (8 .37) 1-lx(A) = J.l(A x !R ) for A E �. The associated marginal c.d.f. is Fx = F(x, +oo). The notion of independence defined in § 7.4 specializes in the following way. X and

Random Variables

127

Y are called independent r.v.s iff

Jl(A x B) = Jlx(A)Jly(B) (8.38) for all pairs of events A,B E 'B, where llx is defined by (8.37) and Jly is anal ogous. Equivalently, 11 is the product measure generated by llx and Jly. 8.20 Theorem X and Y are independent iff for each x,y E [R F(x,y) = Fx(x)Fy(y). (8.39) If the distribution is continuous the p.d.f. factorizes as

f(x,y) = fx(x)fy (y).

(8 . 40)

Proof Obviously, (8.39) is true only if 11 satisfies (8.38). The problem is to show that the former condition is also sufficient. Consider the half-open rectangles,

b'

=

{ (a,b] x (c,d], a,c E

IR,

b, d

E

[R }.

If and only if (8.39) holds,

/-l((a,b] X (c,d]) = F(b,d) - F(b,c) - F(a,d) + F(a,c) = (Fx(b) - Fx(a))(Fy(d) - Fy(c)) = llx((a,b])Jly((c,d]),

(8.41)

where the first equality i s by 8.17. b' i s a determining class for ([R 2,'B 2), and 11 is defined by the extension of the measure satisfying (8.41) for ordinary rect angles or, equivalently, satisfying (8.39). The Extension Theorem (uniqueness part) shows that this is identical with the product measure satisfying (8.38). The extension to p.d.f.s follows directly from the definition. • With more than two variables there are alternative independence concepts (compare §7.3). Variables X1 , . . . ,Xk distributed on the space ([R k,'Bk,Jl) are said to be totally independent if

(8.42) for all k-tuples of events Ar, ... ,Ak E 'B. By contrast, pairwise independence can hold between each pair Xi,Xj without implying total independence of the set. Another way to think of total independence is in terms of a partitioning of a vector X = (X1 , . . . ,Xk)' into subvectors Xr (j x 1) and X2 ((k - j) x 1) for 0 < j < k. Under total independence, the measure of X is always expressible as the product measure of the two subvectors, under all orderings and partitionings of the elements.

9

Expectations

9 . 1 Averages and Integrals

When it exists, the expectation, or mean, of a r.v. X(ro) in a probability space (Q,'J',P) is the integral E(X)

J X(ro)dP(ro).

(9. 1 )

1 n = - L Xt, n

(9.2)

0

=

measures the central tendency of the distribution of X. It is sometimes identified with the limiting value of the sample average of realized values Xt drawn in n identical random experiments,

E(X)

-

Xn

t=l

as n becomes large. However, the validity of this hypothesis depends on the method of repeating the experiment. See Part IV for the details, but suffice it to say at this point that the equivalence certainly holds if E(X) exists and the random experiments are independent of one another. The connection is most evident for simple random variables. If X = LjXj1 Ej where the { Ej } are a partition of 0., then by 4.4, E(X)

= L XjP(Ej) .

(9.3)

j

When the probabilities are interpreted as relative frequencies of the events Ej = { ro: X(ro) = Xj } in a large number of drawings from the distribution, (9.2) with large n should approximate (9.3). The values Xj will appear in the sum in a proportion roughly equal to their probability of occurrence. E(X) has a dual characterization, as an abstract integral on the parent prob ability space and as a Lebesgue-Stieltjes integral on the line, under the derived distribution. It is equally correct to write either (9.1 ) or E(X)

=

f+ooxdFx(x).

(9.4)

-oo

Which of these representations is adopted is mainly a matter of convenience. If lA(ffi) is the indicator function of a set A E 'J', then E(lAX)

=

f

X(ro)dP(ro) =

A

f

xdF(x),

X(A)

(9.5)

Expectations

129

where X(A) E 'B is the image of A under X. Here the abstract integral is obviously the more direct and simple representation, but by the same token, the Stieltjes form is the natural way to represent integration over a set in 'B. If the distribution is discrete, X is a simple function and the formula in (9.3) applies directly. Under the derived distribution,

E(X) where Xj, j

=

=

L:>j �({xj }), j

(9.6)

1 ,2, ... , are the atoms of the distribution.

9.1 Example If X is a Bernoulli variable (8.6), E(X)

=

l .p + 0.(1 -p)

==

p. o

9.2 Example If X is Poisson (8.8), (9.7) For a continuous distribution, the Lebesgue-Stieltjes integral of x coincides with the integral in ordinary Lebesgue measure of the function xf(x). 9.3 Example For the uniform distribution on the interval [a,b] (8.13),

E(X)

=

1

b -a

fb/dx

=

1Ca + b). o

(9.8)

9.4 Example For the Gaussian family (8.19), (9.9) This can be shown by integration by parts, but for a neater proof see 11.8. o In a mixed continuous-discrete distribution with atoms x1 ,x2> · · · · we can use the decomposition F = F 1 + F2 where F1 (x) = Lv:; x�t({xj }) and F2(x) is absolutely continuous with derivative f2(x). Then

E(X)

=

� Xj �t({xj }) + Jxh(x)dx. 1

(9. 1 0)

The set of atoms has Lebesgue measure zero in IR , so there is no need to exclude these from the integral on the right-hand side of (9. 10). Some random variables do not have an expectation. 9.5 Example Recall the condition for integrability in (4. 1 5), and note that for the Cauchy distribution (8.1 1),

a l xl 2 dx f+ 1t -a ( l + x)

-1

� oo as

a � oo . o

(9. 1 1)

Probability

130

9.2 Expectations of Functions of X

If X is a r.v. on the probability space (lR,;B,�), and g: IR H IR is a Borel func This leads tion, goX = g(X) is a r.v. on the space (lR ,;B,�g - 1 ), as noted in to the following dual characterization of the expectation of a function.

§8.1.

9.6 Theorem If

g is a Borel function, (9.12)

Proof Define a sequence of simple functions

Z(n): IR+

H

IR+ by

(9.13) n + 1 and Bi [rn(i - 1), Tni) for i 1, .. . Then, Z(n)(x) t x where n2 for x 0, by arguments paralleling 3.28. According to 3.21, (IR ,;B,�g - 1 ) is a 1 -1 m =

=

=

:2:

measure space where �g- (B) convergence theorem,

=

�(g (B)) for B

E

,m.

;8, and so by the monotone

(9.14) fz(n)(y)d�g-l(y) t1=! (i 2-nl ) �g- I B ) � fyd�g- l(y). Consider first the case of non-negative g. Let l B(x) be the indicator of the set B ;8, and then if g is Borel, so is the composite function B (lB og)(x) {0,1, g(x) (9.15) g(x) B l g-l(B)(X) . =

( i

E

=

E

�

=

Hence, consider the simple function

(Z(n) 0g)(x) Lm (l-2n1) 0B; 0g)(x) = Li=lm (l 2-n1) 1 g-l (B;)(X). (9.16) By the same arguments as before, Z(n)0 g t g, and E(Z(n)0 g) (g) fgd�. E However, '

=

'

i= l

-7

=

(9.17) (9.12)

(9.14).

and follows from To extend the result to general g, consider the non-negative functions g+ max {g,O } and g - = g+ - g separately. It is immediate that

=

Expectations

131 (9. 1 8)

so consider each component of this limit separately.

m L� (l'2-n1) Jl(g- l (B;)) J0oo ydJlg-l (y), 1 where the second equality holds because (g+f1(B;) g -1(B;) for elements of B; are all positive for these cases, whereas for i disappears. Similarly, -Zn(x) x for x < 0, and =

�

=

t

=-

i�1 (i2-n1)Jl(g-\Bi))

----7

s:00ydJlg- l (y),

(9. 19) i� =

2 since the 1 the term

(9.20)

n(i - 1)], and in this case the second equality holds Bi(g f\B; (-T)ni=, g--T1(-B; ) for i � 2. Hence -

where because

=

(9.21) and the theorem follows in view of (9 . 1 8). • The quantities E(x"), for integer k are called the moments of the distri bution of X, and for k > the central moments are defined by

1

E(X - E(X) l =

� 1, ± (�) E(Xi)(-E(X))k-i. •_(\ J=v

1

(9.22)

A familiar case is the variance, Var(X) = E(X - E(X))2 = E(X2) - E(X)2 , the usual

measure of dispersion about the mean. When the distribution is symmetric, with P(X - E(X) E A) = P(E(X) - X E A) for each A E c:B, the odd-order central moments are all zero. 9.7 Example. For the Gaussian case (8.14), the central moments are

E(X - ll) k = 2k12(k/2) ! '

k even,

(9.23)

0, k odd. This formula may be derived after some manipulation from equation ( 1 1 .22) below. Var(X) = cr2 , and all the finite-order moments exist although the sequence increases monotonically. o

Probability

U2

The existence of a moment of given order requires the existence of the correspond ing absolute moment. If E I XI P < oo, for any real p > 0, X is sometimes said to belong to the set Lp (of functions Lebesgue-integrable to order p), or otherwise, to be Lp -bounded. 9.8 Example For X - N(O,ci'), we have, by (8.26), E I XI

2(2ncrf 1 12 �x e -212aldx

J

=

(9.24)

Taking the corresponding root of the absolute moment is convenient for purposes of comparison (see 9.23) and for X E Lp, the Lp -norm of X is defined as (9.25) II X II p = (E ! X j P) 11P. The Gaussian distribution possesses all finite-order moments according to (9.23), but its support is none the less the whole of [R , and its p-norms are not uniformly bounded. If IIXII has a finite limit as p � oo, it coincides with the essential supremum of X, soP that a random variable belonging to L, is bounded almost surely. 9 . 3 Theorems for the Probabilist' s Toolbox

The following inequalities for expected values are exploited in the proof of innu merable theorems in probability. The first is better known as Chebyshev 's inequal ity for the special case p 2. 9.9 Markov's inequality For £ > 0 and p > 0 , =

(9.26) Proof £P(P( I XI 2

t:)

=

J

t:P

dF(x) s

) x ) :? E

f

l x i:?E

l x l PdF(x) s E( I X I P).

•

This inequality does not bind unless E I XI Pf£P < 1 , but it shows that if E I XI P < oo, the tail probabilities converge to zero at the rate £-p as £ � = The order of Lp-boundedness measures the tendency of a distribution to generate outliers. The Markov inequality is a special case of (at least) two more general inequalities. 9.10 Corollary For any event A E IJ', .

cFJAn(

IXI:? e }

dP s

J I XI PdP. A

(9.27)

Equivalently, P({ ro: I X(ro) l 2 t:} n A ) s E(lA I X IP)fEP. Proof Obvious from 9.9. • 9.1 1 Corollary Let g: [R H IR be a function with the property that x 2 a implies g(x) 2 g(a) > 0, for a given constant a. Then

Expectations

133 (9.28)

Proof

g(a)(P(X � a)

=

L�adF(x) fx�ag(x)dF(x) :::; E(g(X)).

g(a)

:::;

•

An increasing function has the requisite property for all a > 0. Let I c IR be any interval. A function �: I � IR is said to be convex on I if �((1 - J..)x + l..y) :::; (1 - A)�(x) + A<jl(y) (9.29) for all x, y E I and A E [0, 1]. If -� is convex on I, � is concave on I. 9.12 Jensen's inequality If a Borel function � is convex on an interval I contain ing the support of an integrable r.v. X, where �(X) is also integrable, �(E(X)) :::; E(�(X)). (9.30) For a concave function the reverse inequality holds. o The intuition here is easily grasped by thinking about a binary r. v. taking values x 1 with probability p and x2 with probability 1 - p. A convex � is illustrated in Fig. 9. 1 . E(X) = px 1 + (1 -p)x2, whereas E(�(X)) = p�(x 1 ) + (1 - p)�(x2). This point is mapped from E(X) onto the vertical axis by the chord joining x1 and x2 on � ' while �(E(X)) is mapped from the same point by � itself.

�(x2)

---------------------------------------------------------------

E(�(x)) <j>(E(x)) -------------------- -----------

-- ---------------------------------------

--

-

---

!

:

<j>(xl ) E(x) Fig. 9. 1 A proof of the inequality is obtained from the following lemma. Let I0 denote the interior of I. 9.13 Lemma If is convex there exists a function A(x) such that, for all x E I0 and y E I, (9.31) A(x)(y -x) :::; <j)(y) - <j)(x). Proof A convex function possesses right and left derivatives at all points of I0•

Probability

134

This follows because (9.29) implies for h > 0 that <j> (x + f...h ) - <j>(x) < <j>(x + h) - <)>(x) (9.32) E (O 1 ] . ' 11, ' Ah h The sequence { n(<j>(x + lln) - <j>(x)), n E IN } is decreasing, and has a limit <)>:(x). In the case h < 0 the inequality in (9.32) is reversed, showing the existence of <)>�(x) as the limit of an increasing sequence. Note that <)>�(x) ::;; 0 fixed in (9.32) and y = x + h gives <)>�(x)(y -x) ::;; <)>:(x)(y -x) ::;; <)>(y) - <)>(x) (9.33) whereas the parallel argument with h < 0 gives, for y < x, (9.34) :(x)(y - x) ::;; <)>�(x)(y -x) ::;; <)>(y) - <)>(x) Inequality (9.31) is therefore satisfied with (say) A(x) = <)>:(x). • Proof of 9.12 Set x = E(X), y = X in (9.3 1) to give (9.35) A(E(X))(X - E(X)) ::;; <)>(X) - <)>(E(X)). Taking expectations of both sides gives inequality (9.30), since the left-hand side has expectation zero. • Next, we have an alternative approach to bounding tail probabilities which yields the Markov inequality as a corollary. 9.14 Theorem If X is a non-negative r.v. and r > 0, 'l

-

(9.36) Proof Integration

by parts gives, for some b > 0,

J:X rdF(x)

=

J:

brF(b) - r x r-IF(x)dx

(9.37) The theorem follows on letting b tend to infinity. • If the left-hand side of (9.37) diverges, so does the right, and in this sense the theorem is true whether or not E(X) is finite. 9.15 Corollary If X is non-negative and integrable,

J:xdF = £P(X � £) + s:P(X > x)dx.

(9.38)

Expectations Proof

Apply 9.14 �ith r = 1 to the r.v.

J:xdF = J:P(l

135

l 1x � e }X.

IX �e 1 X

>

This gives

x)dx

(9.39) Not only does (9.38) give the Markov inequality on replacing non-negative X by I X I P for p > 0 and arbitrary X, but the error in the Markov estimate of the tail probability is neatly quantified. Noting that P( IXI :=:: E) = P( I X I P :=:: EP),

EPP( I XI ;::: E) =

foo jxjPdF - foo P( I XIP Ep

Ep

>

x)dx (9.40)

where both the subtracted terms on the right-hand side are non-negative. 9.4 Multivariate Distributions

From one point of view, the integral of a function of two or more random variables presents no special problems. For example, if g: IR 2 !--7 IR is Borel-measurable, meaning in this case that g - 1 (B) E :82 for every B E :8, then h(ro) = g(X(ro), Y(ro)) is just a �/:8-measurable r.v., and

J

E(g(X,Y)) = nh(ro)dP(ro) (9.41 ) i s its expectation, which involves no new ideas apart from the particular way in which the r.v. h(ro) happens to be defined. Alternatively, the Lebesgue-Stieltjes form is E(g(X,Y)) =

f

IR2

g(x,y)dF(x,y),

(9.42)

where dF(x,y) is to be thought of as the limiting case of M(x,y) defined in (8.30) as the rectangle tends the differential of area. When the distribution is continuous, the integral is the ordinary integral of g(x,y)f(x,y) with respect to Lebesgue product measure. According to Fubini's theorem, it is equivalent to an iterated integral, and may be written

E(g(X,Y)) =

f+oo f+_oo}(x,y)f(x,y)dydx. -

co

(9.43)

But caution must be exercised with formula (9.42) because this is not a double

Probability

1 36

integral in general. It might seem more appropriate to write d 2F(x,y) instead of dF(x,y), but except in the continuous case this would not be correct. The abstract notation of (9 .41) is often preferable, because it avoids these ambiguities . In spite of these caveats, the expectation of a function of (say) X alone can in every case be constructed with respect to either the marginal distribution or the joint distribution.

9.16 Theorem E(g(X))

=

J

�

g(x)dF(x,y)

2

=

J

�

g(x)dFx (x) .

Proof Define a function g*:

[R 2

I-?

[R

by setting g*(x,y) = g(x), all y e !R . g*- 1 (B) is a cylinder in !R 2 with base g - 1 (B) e :8 for B e 'B, and g* is 'B2/:B-measurable. For non-negative g, let

g(n) � ( �n1 ) 1 E; m

=

l

(9.44)

"

where m = n2n+1 and E; = { (x,y): Tn (i - 1) :s:; g*(x,y) < T"i} E 'B2. Since E; A; x !R where A; = {x: T"(i - 1) :s:; g(x) < T"i } , and J..Lx (A;) = J..L(E;), E(g (n))

=

Lm (l 2-n1) "

J..L(E;)

i=l

=

m

L i= l

( 1) l

"

-

2n

J..Lx (A;)

=

E(g (n)) ·

=

(9.45)

By the monotone convergence theorem the left and right-hand members of (9.45) converge to E(g*) = fg*(x,y)dF(x,y) and E(g) = Jg(x)dFx (x) respectively. Extend from non-negative to general g to complete the proof. •

The means and variances of X and Y are the leading cases of this result. We also have cross moments, and in particular, the covariance of X and Y is Cov(X, Y)

=

E[(X - E(X))(Y - E( Y)) ]

=

E(XY) - E(X)E( Y).

(9.46)

Fubini' s theorem suggests a characterization of pairwise independence:

9.17 Theorem If X and Y are independent r.v.s, Cov(<j>(X),\jf(Y)) = 0 for all pairs of integrable Borel functions

Proof Fubini' s theorem gives E(<j>(X)\ji(Y))

= =

=

<1>

and \jl.

f J�

IR2

<j>(x)\jl(y)dF(x,y)

<j>(x)dFx (x)

J

�

E(<j>(X))E(\ji( Y)).

\jl(y)dFy(y) •

(9.47)

The condition is actually sufficient as well as necessary for independence, although this cannot be shown using the present approach; see 10.25 below.

Expectations

137

Extending from the bivariate to the general k-dimensional case adds nothing of substance to the above, and is mainly a matter of appropriate notation. If X is a random k-vector,

J

E(X) = xdF(x)

(9.48)

denotes the k-vector of expectations, E(Xi) for i = 1, ... ,k. The variance of a scalar r. v. generalizes to the covariance matrix of a random vector. The k x k matrix, 2 Xt X1X2 . . . x1 xk

XX' =

X2X1 x�

(9.49) X�

is called the outer product of X, and E(XX') is the k x k positive semi-definite matrix whose elements are the expectations of the elements of XX'. The covariance matrix of X is V ar(X) = E[(X - E(X)(X - E(X))'] = E(XX') - E(X)E(X)'. (9.50) Var(X) is positive semi-definite, generalizing the non-negative property of a scalar variance. It is of full rank (notwithstanding that XX' has rank 1) unless an element of X is an exact linear function of the remainder. The following generalizes 4.7, the proof being essentially an exercise in interpreting the matrix formulae. 9.18 Theorem If Y = BX + c where X is an k-vector with E(X) = ll and Var(X) = :E, and B and c are respectively an m x k constant matrix and a constant m-vector, then (i) E(Y) = Bfl +c. (ii) Var(Y) = B:EB'. o Note that if m > k Var(Y) is singular, having rank k. 9.19 Example If a random vector Z = (Z1 , ... ,Zk)' is standard Gaussian (8.19), it is easy to verify, applying 9.17, that E(Z) = 0 and E(ZZ') = h. Applying 9.18 to the transformation in (8.35) produces E(X) = 1J. and Var(X) = E(X - ll)(X - IJ.)' = E(AZZ'A') = AE(ZZ')A' = AA' :E. o =

9 . 5 More Theorems for the Toolbox

The following collection of theorems, together with the Jensen and Markov inequal ities of §9.3, constitute the basic toolbox for the proof of results in proba-

Probability

138

bility. The student will find that it will suffice to have his/her thumb in these pages to be able to follow a gratifyingly large number of the arguments to be encountered in subsequent chapters. 9.20 Cauchy-Schwartz inequality

E(XY)2 ::;; E(X 2)E(Y 2), (9.51) with equality attained when Y = eX, c a constant. Proof By linearity of the integral, E[(aX + Y)2] = a2E(X 2) + 2aE(XY) + E(Y 2) � 0 for any constant a. (9.51) follows on setting a = -E(XY)IE(X2), and holds as an equality if and only if aX + Y = 0. • The correlation coefficient, rxy = Cov(X,Y)/(Var(X)Var(Y)) 112, accordingly lies in the interval [-1 , + 1]. The Cauchy-Schwartz inequality is a special case (for p = 2) of the following. 9.21 HOlder's inequality For any p � 1, (9.52) E I XYI ::;; IIX IIpii YII q· where q = pl(p - 1) if p > 1, and q = oo if p = 1 . Proof The proof for the case p > 1 requires a small lemma. 9.22 Lemma For any pair of non-negative numbers a,b, bq ab ::;; aP (9.53) + -. p q Proof If either a or b are zero this is trivial. If both are positive, let s = p log a and t = q log b. Inverting these relations gives a = estp, b = ettq , ab = eslp+tlq, and (9.53) follows from the fact that ex is a convex function of x, noting llq = 1 - lip and applying (9.29). • Choose a = I XI ! II X II ' b = I YI ! II YII q · For these choices, E(aP) = E(bq) = 1 , and P

E I XYI = E(ab) ::;; 1/p + llq = 1 . II X!Ip i! YII q

(9.54)

For the case p = 1 , the inequality reduces to E I XYI ::;; E ! XI ess sup Y, which holds since Y ::;; ess sup Y a.s., by definition. • The Holder inequality spawns a range of useful corollaries and special cases, including the following. 9.23 Liapunov's inequality (norm inequality) If r > p > 0, then II X II r � II X IIp· Proof Let Z = ! XI P, Y = 1 , s = rip. Then, (9.52) gives E I ZY I ::;; IIZIIsii Y IIst(s- 1 ) , or

Expectations

139 (9.55)

This result is also obtainable as a corollary of the Jensen inequality. 9.24 Corollary For each A E c:J and 1/p + 1/q = 1 , /p 1/ I XY i dP $ I X I PdP I Yl qdP q.

fA

uA

r uA Proof In (9.52), replace X by X1A and Y by Y1 A .

)

•

Alternative variants of the result are not explicitly probabilistic in character. 9.25 Corollary Let XJ , . . . ,Xn and YJ , ,yn be any sequences of numbers. Then n n lip n / � l xi.Yi l $ � l xi iP � I Yi l q 1 q for lip + l lq = 1 . o

(

) (

. . •

)

9.26 Corollary Let f(t) and g(t) be Lebesgue-integrable functions of a real

fco-oo l f(t)g(t) l dt $ (fccoo l f(t) IPdt) 1/p (fcooo l g(t) l qdt) l q for

variable. Then

f

lip + 1/q = 1 . D Proofs are left as an exercise. The sequences in 9.25 and the functions in 9.26 can be either real or complex-valued (see § 1 1 .2). 9. 27 Minkowski's inequality For r � 1, II X + Yll $ II XI I , + I I Yl l, . r Proof For r = 1 this follows direct from the triangle inequality, (9.56) I X+ Y l $ I XI + I Yl ' on taking expectations. For r > 1 , note that E I X + Yl r = E( I X + Y l l X + Yl r- l )

$ E(( I XI + I YI ) I X + YI '- 1 ) l = E( I XI I X + Yl r- l ) + E( l YI I X + Yl r- ).

(9.57)

Applying the Holder inequality to the two right-hand-side terms yields

(9.58) Cancelling E I X + Yl ' and rearranging gives the result. • By recursive application to the sum of m variables, the Minkowski inequality generalizes directly to m

m

L Xi $ L: II Xil l r i= l r i=l for r � 1 . For an infinite series one can write

(9.59)

140

Probability

m s L xi :L II Xillr + L xi i=m+l r i=l i=l If II I'i=m+1Xillr --7 0 as m --7 oo it is permissable to conclude that oo

00

L Xi S L I ! Xill n i=l r i=l not ruling out the possibility that the right-hand side is infinite. 9.28 Loeve's c, inequality For r > 0, ' E xi s c, E I Xi l ',

�� j

�

(9.60)

(9.6 1 )

(9.62)

1 when r s 1 , and c, = m'- 1 when r � 1 . Proof This goes by proving the inequality where

c, =

(9.63) for real numbers a1, ... ,an, then substituting random variables and taking expecta tions. Since I I7= l ai I ' s (I7=r l ad ) ', it will suffice to let the a i be non-nega tive. For the case 0 < r s 1 , 0 s Zi s 1 implies zi � Z i and hence if I7= IZi = 1, I7= rzi � 1 . (9.63) follows on putting Zi = aJ(L,)= Iaj). For r � 1 , on the other hand, convexity implies directly that m ' 1 m 1- _L - _Lai. • m i=l ai s m i=l

(

)

9.29 Theorem If X, Y, and Z are non-negative r.v.s satisfying X s a(Y+Z) a.s.

for a constant a > 0 , then, for any constant M > 0, (9.64) E( l {X>M} X) s 2a(E(1 { Y >MI2a) Y) +E( l {Z>M/2a ) Z)) . Proof If we can prove the almost sure inequality 1 {X>M }X S 2a( l { Y>M12a)y + 1 {Z>M/2a}Z), a.s., (9.65) the theorem will follow on taking expectations. 1 1x >M }X is the r.v. that is equal to X if X > M and 0 otherwise. If X s M (9.65) is immediate. At least one of the inequalities Y � X/2a, Z � Xl2a must hold, and if X > M, (9.65) is no less obviously true. • 9. 6 Random Variables Depending on a Parameter

Let G(ro,8): Q x 8 H !R, 8 c !R , denote a random function of a real variable 8, or ln other words, a family of random variables indexed on points of the real line.

Expectations

141

The following results, due to Cramer (1946), are easy consequences of the domi nated convergence theorem. 9.30 Theorem Suppose that for each ro E C, with P( C) = 1, G(ro, e) is continuous at a point eo , and I G(ro,e) I < Y(ro) for each e in an open neighbourhood No of eo where E(Y) < oo. Then lim E(G(e)) = E(G(eo)). (9.66) e�e0

Proof Passage

to a limit e0 through a continuum of points in 8, as indicated in (9.66), is implied by the convergence of a countable sequence in 8. Let { ev, v E [N } be such a sequence, in No, converging to e0. Putting Gv(ro) = G(ro,ev) defines a countable sequence of r.v.s. and limsupvGv(ro) and liminfvGv(ro) are r.v.s by 3.26. By continuity, they are equal to each other and to G(ro,e0) for ro E C; in other words, G(ev) ----7 G(80) a.s. The result follows from the dominated convergence theorem. • 9.31 Theorem If, for each ro E C with P( C) = 1 , (dGIde)( ro) exists at a point e0 and G(ro,e0 + h) - G(ro,So) < Yt(ro) h for 0 :::;; h :::;; h1 , where E(Y1 ) < oo and h 1 is independent of ro, then dG !!:_ (9.67) deE(G) e=eo - E de e=eo ·

I

( I )

Proof The argument goes like the preceding one, by considering a real

sequence { hv } tending to zero through positive values and hence the sequence of r.v.s {Hv} where Hv = [G(e0+hv) - G(e0)]/hv, whose limit H = H(e0) exists by assumption. • The same sort of results hold for integrals. Fubini's Theorem provides the extension to general double integrals. The following result for Riemann integrals on intervals of the line is no more than a special case of Fubini, but it is useful to note the requisite assumptions in common notation with the above. 9.32 Theorem Suppose that for each ro E C, with P(C) = 1, G( ro,e) is continuous on a finite open interval (a,b ), and I G(ro,8) I < Y2(ro) for a < e < b, where E(Y2) < oo. Then

s:E(G(e))de = E u:G(e)d8) .

(9.68)

If fo I G( ro,e) I de < Y3 (ro) for ro E C and E(Y3 ) < oo, (9 .68) holds for either or both of a = -oo and b = +oo. Proof For the case of finite a and b, consider H(ro,t) = f�G(ro,e)de. This has the properties I H(ro,t) l < (b - a)Y2(ro), and l (dH/dt)(ro) l = I G(ro,t) l < Y2(ro), for each t E (a,b). Hence, E(H(t)) exists for each t, and by 9.31,

Probability

142

!E(H(t)) = E (��) = E(G(t)).

(9.69)

J:E(G(8))d8 - E(H(t))

(9.70)

E(G(t)) is continuous on (a,b) by the a.s continuity of G and 9.30, and hence �(t) =

is differentiable on (a,b), and d�ldt = 0 at each point by (9.69). But by defi nition, H(ro,a) = 0 for ro E C, so that �(a) = 0, and hence Mb) = 0 which is equiv alent to (9.68). Under the stated integrability condition on G(ro,8), f':ooG(ro,8)d8 exists and is finite on C. Hence H( ro,t) = f :O<,G( ro,8)d8 is well defined and has an expectation for all t E [R , and the argument above goes through with a = oo and/or b = oo • -

.

10 Conditioning

1 0. 1 Conditioning in Product Measures

It is difficult to do justice to conditioning at an elementary level. Without resort to some measure-theoretic insights, one can get only so far with the theory before running into problems. There are none the less some relatively simple results which apply to a restricted (albeit important) class of distributions. We introduce the topic by way of this 'naive' approach, and so demonstrate the diffi culties that arise, before going on to see how they can be resolved. In the bivariate context, the natural question to pose is usually: 'if we know X = what is the best predictor of Y?' For a random real pair {X,Y} on (Q,c:J, P) we can evidently define (see §7.2) a class of conditional distribution functions for Y. For any A E 'B such that P(X E A) > 0, let P(X E A, Y � y . E A) = (10. 1) P(X E A) This corresponds to the idea of working in the trace of (Q,c:J ,P) with respect to A, once A is known to have occurred. Proceeding in this way, we can attempt to construct a theory of conditioning for random variables based on the c.d.f. We may tentatively define the2 conditional distribution function, I when it exists, as a mapping from IR to IR which for fixed x E IR is a non-decreasing, right-con tinuous function of y with -oo I = 0 and +oo I = 1, and which for fixed y E IR satisfies the equation

x,

)

F(y'x

F(y x),

F( x)

P(X E

F( x) A, Y � y) = fAF(ylx)dFx(x)

(10.2)

F(ylx)

for any A E 'B (compare Rao 1965: §2a.8). Think of the graph of in y-space as the profile of a 'slice' through the surface of the joint distribution func tion, parallel to the y-axis, at the point However, much care is needed in interpreting this construction. Unlike the ordinary c.d.f., it does not represent a probability in general. If we try to interpret it as P(Y � y I X = we face the possibility that P(X = 0, as in a continuous distribution. Since the integral of over a set in the marginal distribution of X yields a probability, as in (10.2), it might even be treated as a type of density function. Taking A = {X ::; x} shows that we would need

x.

x),

F(ylx)

= x)

(10.3)

Probability

144

to hold. Since F(x,y) is an integral over IR 2, Fubini' s theorem implies that F(y ! x) is well defined only when the integrals in (10.3) are with respect to a product measure. If X and Y are independent we can say unambiguously (but not very usefully) that F(y !x) = Fy(y). F(y lx) is also well defined for continuous distributions. Let Sx denote the support of X (the set on which fx > 0); the conditional p.d.f. is (10.4) f(y l x) = f(x,y) fx(x) , x E Sx, where fx(x) is the marginal density of X. We may validly write, for A E 'B n Sx,

P(X e A, Y s; y) = where

I I oof(u lx)fx(x)dxdu = I F(y !x)fx(x)dx A

Y

A

-

(10.5)

(10.6) The second equality of (10.5) follows by Fubini's theorem, since the function f(x,y) is integrated with respect to Lebesgue product measure. However, (10.6) appears to exist by a trick, rather than to have a finn relationship with our intuition. The problem is that we cannot work with the trace (A,r:iA.PA) when A = { co: X(co) = x} and P(A) = 0, because then PA = PIP(A) is undefined. It is not clear what it means to 'consider the case when {X = x} has occurred ' when this event fails to occur almost surely. Except in special cases such as the above, the factorization dF(y ,x) = dF(y lx)dFx(x) is not legitimate, but with this very important caveat we can define the mean and other moments of the conditional distribution. The conditional expec tation of a measurable function g(X,Y), given X = x, can be defined as

E(g(X, Y) I x) =

I::g(x,y)dF(y ! x),

(10.7)

also written as E(g(X,Y) I X = x). The simplest case is where g(X,Y) is just Y. E(Yix) is to be understood in terms of the attempt of an observer to predict Y after the realization of X has been observed. When X and Yare independent, E(YI x) = E(Y), where E(Y) is the ordinary expectation of Y, also called the marginal or unconditional expectation. In this case, the knowledge that X = x is no help in predicting Y. 10.1 Example These concepts apply to the bivariate Gaussian distribution. From (8.36), the density is

{

[0'1 O't2] -l [X -�1J }

f(x,y) = ---_1___1_/2 exp - ! (x - �t . y - �2] <J 0' t 2 22 O't t O' t 2 21t O' t 2 0'22

y - �2

Conditioning

145

' (10.8)

where the last equality is got by completing the square Evidently, f(x,y) = f(y l x)fx(x) where

m

the exponent.

(10.9)

and

fx(x) =

1 exp v'2na1 1

{ J.1I)2} -

(x 2a 1 1

.

(10. 10)

Thus, (10. 1 1 ) and (10. 12) If a 1 2 = 0, f(y I x) reduces to fy(y), so that the joint density is the product of the marginals, and x and y are independent. o 1 0.2 Conditioning on a Sigma Field

In view of the limitations of working directly with the distribution of (X,Y), we pursue the approach introduced in §7.2, to represent partial knowledge of the distribution of Y by specifying a a-field of events � c r;J such that, for each G E �. an observer knows whether or not the realized outcome belongs to G. The idea of knowing the value of a random variable is captured by the concept of subfield measurability. A random variable X(ffi): .Q 1--7 IR is said to be measurable with respect to a a-field � c rg; if x - \B) = { ffi: X(ffi) E B } E �. for all B E 'B. (10. 13) The implication of the condition � c rg; is that the r.v. X is not a complete

146

Probability

representation of the random outcome ffi. We denote by a(X) the intersection of all a-fields with respect to which X is measurable, called the a-field generated by X. If, on being confronted with the distribution of a random pair (X(m),Y(ffi)), we learn that X = x, we shall know whether or not each of the events G E a(X) has occurred by determining whether X(G) contains x. The image of each G E a(X) under the mapping (X(ffi), Y(ffi)): .Q 1-0 [R 2 is a cylinder set in [R 2, and the p.m. defined on a(X) is the marginal distribution of X. 10.2 Example The knowledge that x1 :S: X :S: x2 can be represented by Jf = a({ (-oo, x] : X < xJ } , { (X, oo) : X > X2 }, [R). Satisfy yourself that for every element of this a-field we know whether or not X belongs to the set; also that it contains all sets about which we possess this knowledge. The closer together x 1 and x2 are, the more sets there are in :lf. When XJ = x2 , Jf = a(X), and when x 1 = -=, x2 = +=, :1f = <J" = { 0,[R } . o The relationships between transformations and subfield measurability are sum marized in the next theorem, of which the first part is an easy consequence of the definitions but the second is trickier. If two random variables are measurable with respect to the same subfield, the implication is that they contain the same information; knowledge of one is equivalent to knowledge of the other. This means that eve ry Borel set is the image of a Borel set under g - I . This is a stronger condition than measurability, and requires that g is an isomorphism. It suffices for g to be a homeomorphism, although this is not necessary as was shown in 3.23. 10.3 Theorem Let X be a r.v. on the space (5,135,�-t), and let Y = g(X) where g: S 1-0 lf is a Borel function, with S � [R and lf c [R . (i) a(Y) � a(X). (ii) a(Y) = a(X) iff g is a Borel-measurable isomorphism. Proof Each B E 13lf has an image in 'B5 under g I which in turn has an image in a(X) under x- 1 . This proves (i). To prove (ii), define a class of subsets of §, t; = { g -! (B): B E 13lf } . To every A c S there corresponds (since g is a mapping) a set B c lf such that A = g - 1 (B), and making this substitution gives (10. 14) . 'B5 c {A: g(A) E 'Elf } = {g -I(B) : g(g - l (B)) E 13lf } = t; , where the inclusion is by measurability of g- 1 , and the second equality is because g(g - 1 (B)) = B for any B � lf, since g is 1-1 onto. It follows from (10.14) that {X-1 (A) � n : A E 'B5 } � {X- 1 (g -1 (B)) c .n: B E 13 lf } . (10. 15) If Y is §'-measurable for some a-field '§ � '!f (such that '§ contains the sets of the right-hand member of (10. 15)), then X is also '§-measurable. In particular, cr(X) � a(Y). Part (i) then implies cr(X) = a(Y), proving sufficiency of the conditions. To show the necessity, suppose first that g is not 1-1 and g(x1 ) = g(x2) = y -

,

Conditioning

147

(say) for x1 :f. x2 . The sets {x 1 } and {x2 } are elements of 'B5 but no(of l-5', which contains only g - \{y}) = {xd u {x2 } . Hence, <.B5 � l-5', and 3 a <.B5-set A for which there is no <.B-u- - set B having the property g - 1 (B) = A . This implies that X 1 (A ) E a(X) but � a(Y), so that a(Y) c a(X). We may therefore assume that g is 1 - 1 . If g is not Borel-measurable, then by definition 3 A = g - I (B) E 'B5 such that g(A ) = B � <.B-u-, and hence A � l-5'; and again, 'B5 � l-5', so that a(Y) c a(X) by the same argument. This completes the proof of necessity. • We should briefly note the generalization of these results to the vector case. A random vector X(ro): .Q 1--7 [R k is measurable with respect to §' � '!F if (10. 1 6) X- 1 (B) = { co: X(ro) E B } E §', V B E 'Bk. If a(X) is the a-field generated by X, we have the following result. 10.4 Theorem Let X be a random vector on the probability space (S,'BtJ.L) where k k 5 c [R and 'B� = {B n 5: B E <.B } and consider a Borel function (10. 17) (i) a(Y) c a(X). (ii) If m = k and g is 1 - 1 with Borel inverse, then a( Y) = a(X). Proof This follows the proof of 10.3 almost word for word, with the substitutions of <.B� and 'B� for 'B5 and 'B-u-, X and Y for X and Y, and so forth. •

-I

-

,

1 0 . 3 Conditional Expectations

Let Y be an integrable r.v. on (.Q,'!F,P) and '§ a a-field contained in '!F. The term conditional expectation, and symbol E(YI §'), can be used to refer to any inte grable, §'-measurable r.v. having the property

J E(YI r.J)dP = J YdP = E(YI G)P(G), all G 0

0

E

§'.

(10. 1 8)

Intuitively, E(YI §')(ro) represents the prediction of Y(co) made by an observer having information '§, when the outcome co is realized. The second equality of (10. 18) supplies the definition of the constant E(YI G), although this need not exist unless P(G) > 0. The two extreme cases are E(YI '!F) = Y a.s., and E(YI 'Y) = E(Y) a.s., where 'Y denotes the trivial a-field with elements { .Q,0 } . Note that .Q E §', so integrability of Y is necessary for the existence of E(YI §'). The conditional expectation is a slightly bizarre construction, not only a r.v., but evidently not even an integral. To demonstrate that an object satisfying ( 1 0. 18) actually does exist, consider initially the case Y � 0, and define

(10. 19) 10.5 Theorem v is a measure, and is absolutely continuous with respect to P.

Probability

148

v(G) G } G) v(G) j v (w Gj) = I /dP = 2; IG/dP = L; v(Gj),

Clearly, 2 0, and P( = 0 implies = 0. It remains to show countable additivity. If { is a disjoint sequence, then

Proof

Up

J

1

1

( 10.20)

where the second equality holds under disjointness. • So the implication of (10. 18) for non-negative Y turns out to be that E(YI §') is the Radon-Nikodym derivative of with respect to P. The extension from non negative to general Y is easy, since we can write Y = y+ - y- where y+ and y are non-negative, and from (1 0. 1 8), E(YI §') = E(Y+ I §') - E(Y- 1 §'), where both of the right-hand r.v.s are Radon-Nikodym derivatives. The Radon-Nikodym theorem therefore establishes the existence of E(YI §'); at any rate, it establishes the existence of at least one r.v. satisfying (10. 1 8). It does not guarantee that there is only one such r.v., and in the event of non uniqueness, we speak of the different versions of E(YI §'). The possibility of multiple versions is rarely of practical concern since 10.5 assures us that they are all equal to one another a.s.[P], but it does make it necessary to qualify any statement we make about conditional expectations with the tag 'a.s. ' , to indicate that there may be sets of measure zero on which our assertions do not apply. In the bivariate context, E(YI a(X)), which we can write as E(YIX) when the context is clear, is interpreted as the prediction of Y made by observers who observe X. This notion is related to (10.7) by thinking of E(Yi x) as a drawing from the distribution of E(YI X). 10.6 Example In place of (10. 1 1) we write

v

E(YIX)(co)

0"! 2 co) - J.tt ), = Jlz + =-(X( O" t t

(10.21)

which is a function of X(co), and hence a r.v. defined on the marginal distribution of X. E(YI X) is Gaussian with mean Jlz and variance of aTz/O"n . o Making E(Y!x) a point in a probability space on IR circumvents the difficulty encountered previously with conditioning on events of probability 0, and our construction is valid for all distributions. It is possible to define E(Y I G) when P(G) = 0. What is required is to exhibit a decreasing sequence { Gn E §' } with P(Gn) > 0 for every n, and Gn J, such that the real sequence { E(YI Gn)} converges. This is why (10. 7) works for continuous distributions. Take Gn = [x, x + l in] X fR E a(X), so that = {x} x iR . Using ( 10.4) in (10. 1 8),

G

E(YI

Gn)

=

G,

f+oo Ix+lfnyj(�,y)d�dy I+ooyf(y l x)dy -oo f+= Ix+lfnj(�,y)d dy oo X

-

oo

X

�

--7

-

=

E(Yix),

(10.22)

Conditioning

149

as n � oo Fubini's theorem allows us to evaluate these double integrals one dimension at a time, and to take the limits with respect to n inside the integrals with respect to y. Conditional probability can sometimes generate paradoxical results, as the following case demonstrates. 10.7 Example Let X be a drawing from the space ([0, 1], :Br0• 1 1, m), where m is Lebesgue measure. Let §' c :Br0•11 denote the a-field generated from the single tons {x}, x E [0, 1]. All countable unions of singletons have measure 0, while all complements have measure 1 . Since either P(G) = 0 or P( G) = 1 for each G E §', it is clear from (10. 18) that E(XI §') = E(X) = �> a.s. However, consider the following argument. 'Since {x} E §', ifweknow whether or notxE G for each G E §', we know x. In particular, §' contains knowledge of the outcome. It ought to be the case that E(XI §') X a.s.' o The mathematics are unambiguous, but there is evidently some difficulty with the idea that §' should always represent partial knowledge. It must be accepted that the mathematical model may sometimes part company with intuition, and generate paradoxical results. Whether it is the model or the intuition that fails is a nice point for debate. .

=

1 0.4 Some Theorems on Conditional Expectations

10.8 Law of iterated expectations (LIE)

(10.23) E[E(Yi §')] = E(Y). Proof Immediate from (10. 18), setting G = .0.. • The intuitive idea that conditioning variables can be held 'as if constant' under the conditional distribution is confirmed by the following pair of results. 10.9 Theorem If X is integrable and §'-measurable, then E(XI §') X, a.s. Proof Since X is §'-measurable, E+ = { co: X(co) > £(XI §')(co) } E §'. If P(E +) > 0, then =

(10.24) This contradicts (1 0. 1 8), so P(E+) = 0. By the same argument, P(E- ) 0 where E { ro: X(ro) < £(XI §')(co) } E §'. • 10.10 Theorem If Y is �-measurable and integrable, X is §'-measurable for §' � �. ' and E I XYI < oo, then E(YXI §') = XE(YI §') a.s. Proof By definition, the theorem follows if =

=

JGXE(YI §')dP JGXYdP a.s., for all G E §'. =

(10.25)

Let Xcn) = Lt= l ai1 Ei be a §'-measurable simple r.v., with Et, ... ,En a partition of

Probability

1 50 .Q and Ei

E

§' for each i. (10.25) holds for X = Xcn) since, for all G

IGX(n)E(YI §')dP = � aifGnEiE(YI §')dP i aiiGnEi YdP IGX(n)YdP, i= l =

=

E

§',

(10.26)

noting G n Ei E §' when G E §' and Ei E §'. Let X � 0 a.s., and let {X(n)l be a monotone sequence of simple §'-measurable functions converging to X as in 3.28. Then Xcn) Y -7 XY a.s. and I X(n)YI s I XYI , where E I XYI < oo by assumption. Similarly, Xcn)E(YI §') -7 XE(YI §') a.s., and £1 X(n)E(YI §') I = El E(X(n) YI §') I S E(E( I X(n) YI I §') ) = E I Xcn) YI s E I XYI < oo,

(10.27)

where the first inequality is the conditional modulus inequality, shown in 10.14 below, and the second equality is the LIE. It follows by the dominated convergence theorem that faX(n)E(YI §')dP -7 faXE(YI §')dP, and so (10.25) holds for non negative X. The extension to general §'-measurable X is got by putting x = x + - x- , (10.28) where x + = max { X,O } � 0 and r � 0, and noting E(YXI §') = E(YX+ - YX- 1 §') = E(YX+ I §') - E( YX- 1 §') = (X+ - x- )E(YI §') = XE(YI §') a.s.,

( 10.29)

using (10.33) below and the result for non-negative X. • X does not need to be integrable for this result, but the following is an impor tant application to integrable X. 10.11 Theorem If Y is :g;-measurable and integrable and E(YI §') = E(Y) for §' c :g;, then Cov(X,Y) = 0 for integrable, §'-measurable X. Proof From 10.8 and 10.10, E(XY) = E[E(XYI §')] = E[XE(YI §')]. (10.30) If E(YI §') = E(Y) a.s. (a constant), then E(XY) = E(X)E(Y). • Note that in general Cov(X,Y) is defined only for square integrable r.v.s X and Y. But Cov(X,Y) = 0, or E(XY) = E(X)E(Y), is a property which an integrable pair can satisfy. The following is the result that justifies the characterization of the condi tional mean as the optimal predictor of Y given partial information. 'Optimal' is seen to have the specific connotation of minimizing the mean of the squared prediction errors.

Conditioning

151

10.12 Theorem Let Y denote any §'-measurable approximation to Y. Then (10.3 1) -

� ·

Proof (Y- Y)2 = [Y- E(YI §') f + 2[Y- E(YI §') ] [E(YI §') - n + [E(YI §') - YJ 2 , and

hence

(10.32) noting that the conditional expectation of the cross-product disappears by defin ition of E(YI §'), and 10.10. The proof is completed by taking unconditional expectations through (10.32) and using the LIE. • The foregoing are results that have no counterpart in ordinary integration theory, but we can often exploit the fact that the conditional expectation behaves like a 'real' expectation, apart from the standard caveat we are dealing with r.v.s so that different behaviour is possible on sets of measure zero. Linearity holds, for example, since E(aX + bYI §') = aE(XI §') + bE(YI §'), a.s., (10.33) is a direct consequence of the definition in (10. 1 8). The following are condi tional versions of various results in Chapters 4 and 9. The first extends 4.5 and 4.12. 10.13 Lemma (i) If X = 0 a.s., then E(XI §') = 0 a.s. (ii) If X s Y a.s., then E(XI §') s E(YI §') a.s. (iii) If X = Y a.s., then E(XI §') = E(YI §') a.s. Proof (i) follows directly from (10. 1 8). To prove (ii), note that the hypothesis, (10. 1 8) and 4.8(i) together imply

fGE(XI §')dP fGXdP s fG YdP fGE(YI §')dP =

=

for all G E §'. Since A = { co: E(X I §')( ffi) > E(YI §')(ro) } E §', it follows that P(A) 0. The proof of (iii) uses 4.8(ii), and is otherwise identical to that of (ii). • 10.14 Conditional modulus inequality I E(YI §') I s E( l Yl l §') a.s. Proof Note that I Yl = y+ + Y - , where y+ and y- are defined in (10.28). These are non-negative r.v.s so that E(r l §') � 0 a.s. and E(Y - 1 §') � 0 a.s. by 10.13(i) and (ii). For ro E C with P(C) = 1 , I E(Y+ - Y - 1 §')( ro) I

- E(Y - 1 §')(ffi) I s E(Y+ I §')(ro) + E(Y - I §')(ro) = E(Y+ + Y - 1 §')( ro), =

=

I E(Y+ I §')(ro)

where both the equalities are by linearity.

•

(10.34)

Probability

152

10.15 Conditional monotone convergence theorem If Yn $ Y and Yn t Y a.s., then E(Yn l §') t E(YI §') a.s. Proof Consider the monotone sequence Zn = Yn - Y. Since Zn $ 0 and Zn $ Zn+l• 10.13 implies that the sequence {E(Zn I §') } is negative and non-decreasing a.s., and hence converges a.s. By Fatou's Lemma,

JG (Iimsup E(Zn I §') ) dP � limsup I E(Zn I §')dP G n�oo limsup f ZndP 0 G n�oo

= (10.35) n�oo for G E §', the first equality being by (10. 18), and the second by regular monotone convergence. Choose G = { ro: limsupnECZn l §')(ro) < 0 } , which is in §' by 3.26, and (10.35) implies that P(G) = 0. It follows that

=

lim E(Zn I §')

=

0, a.s.

•

(10.36)

10.16 Conditional Fatou's lemma If Yn � 0 a.s. then

(

)

(10.37)

liminf E(Yn I §') � E liminf Yn I §' a.s. n�oo n�oo

Put Y� = infk� nYk so that Y� is non-decreasing, and converges to Y' liminfnYn- Then E(Y� I §') ---7 E(Y' I §') by 10.15. Yn � Y�, and hence E(Yn I §') E(Y� I §') a.s. by 10.13(ii). The theorem follows on letting n ---7 • Extending the various other corollaries, such as the dominated convergence theorem, follows the pattern of the last results, and is left to the reader. 10.17 Conditional Markov inequality I YI P j §') a.s. P({ I YI � E} l §') $ E( EP ' Proof By Corollary 9.10 we have Proof

oo

f

I

EP l { I YI� e }dP $ I YI PdP, G E G G By definition, for G E §',

f!.

=

.

�

(10.38)

(10.39) and

fGE( l YIP I §')dP IG l YI PdP. =

Substituting (10.39) and (10.40) into (10.38), it follows that

( 10.40)

Conditioning

1 53 (10.41)

The contents of the square brackets in (10.41) is a !1-measurable r.v. Let G e §' denote the set on which it is positive, and it is clear that P(G) = 0. • 10.18 Conditional Jensen's inequality Let a Borel function <)> be convex on an interval I containing the support of a �-measurable r.v. Y where Y and <)> (Y) are integrable. Then (10.42) <)> (E(YI §')) � E(<)>(Y) I §'), a.s. Proof The proof applies 9.13. Setting x = E(YI §') and y = Y in (9.3 1), (10.43) A(E(YI §'))(Y- E(YI §')) � <)>(Y) - <)>(E(YI §')). However, unlike A(E(Y)), A(E(YI §')) is a random variable. It is not certain that the left-hand side of (10.43) is integrable, so the proof cannot proceed exactly like that of 9.12. The extra trick is to replace Y by l E Y, where E = { ro: E(YI §')(ro) � B } for B < =. E(YI §') and hence also I E are §'-measurable random variables, so E(l E YI §') = l E E(YI §') by 10.10, and E(<)>( l E Y) I §') = E(<)>(Y) I E + <)>(O)l g: I §') = 1 EE(<)> (Y) I §') + (1 - l E)<)>(O).

(10.44)

Thus, instead of (10.43), consider A (E(l EY I §')) ( l EY - l EE(YI §')) � <)>( l E Y) - <)>( l E E(YI §')). ( 10.45) The majorant side of (10.45) is integrable given that <)>(Y) is integrable, and hence so is the minorant side. Application of 10.9 and 10.10 establishes that the conditional expectation of the latter term is zero almost surely, so with (10.44) we get (10.46) <)>( 1 EE(YI §')) � l E E(<)> (Y) I §') + (1 - l E) (O), a.s. Finally, let B --7 oo so that I E --7 1 to complete the proof. • The following is a simple application of the last result which will have use subsequently. 10.19 Theorem Let X be §'-measurable and Lr-bounded for r � 1 . If Y is �-measurable, X + Y is also Lr-bounded, and E(YI §') = 0 a.s., then ( 10.47) Proof Take expectations and apply the LIE to the inequality (10.48) E( I X + Yl r l §') � I E(X + Yl §') l r lXI a.s. • Finally, we can generalize the results of §9.6. It will suffice to illustrate with the case of differentiation under the conditional expectation. =

r

Probability

154 10.20 Theorem

Let a function G(ro, 9) satisfy the conditions of 9.31. Then dE(G I §') E dG §' (10.49) ' a.s. I = dfJ 9=9o d9 e=eo

(

I)

I

Proof Take a countable sequence { hv, v E IN } with hv ----7 0 as v ----7 00 • By linearity of the conditional expectation,

E

(

G( 9o + hv) - G(9o) hv

§'

)-

E( G( 9o + hv) I §') - ( G(So) I §') hv

a.s.

(10.50)

If Cv E §' is the set on which the equality in (10.50) holds, with P( Cv ) = 1, the two sequences agree in the limit on the set nvcv, and P(n C ) = 1 by 3.6. The left-hand side of (10.50) converges a.s. to the left-hand side of (10.49) by assumption, applying the conditional version of the dominated convergence theorem. Since whenever it exists the a.s. limit of the right-hand side of (1 0.50) is the right-hand side of (10.49) by definition, the theorem follows. • v

v

1 0. 5 Relationships between Subfields

§' 1 c r:J and §'2 c r:J are independent subfields if, for every pair of events G 1 E §' 1 and Gz E §'z, (10.51) Note that if Y is measurable on §' 1 it is also measurable on any collection containing §' 1 , and on r:J in particular. Theorems 10.10 and 10.11 cover cases where Y as well as X is measurable on a subfield. 10.21 Theorem Random variables X and Y are independent iff cr(X) and cr(Y) are independent. Proof Under the inverse mapping in (10. 13), G 1 E cr(X) if and only if B 1 = X(G 1 ) E '13 with a corresponding condition for cr(Y). It follows that (10.51) holds for each G 1 E cr(X), Gz E cr( Y) iff P(X E B 1 , Y E Bz) = P(X E B 1 )P( Y E Bz) for each B 1 = X(GI), B2 = Y(G2). The 'only if' of the theorem then follows directly from the definition of cr(X). The 'if' follows, given (8.38), from the fact that every Bi E '13 has an inverse image in any subfield on which a r.v. is measurable. • The 'only if' in the first line of this proof is essential. Independence of the subfields always implies independence of X and Y, but the converse holds only for the infimal cases, cr(X) and cr(Y). 10.22 Theorem Let Y be integrable and measurable on §' 1 . Then E(YI §') = E(Y) a.s. for all §' independent of §' 1 . Proof Define the simple §' 1 -measurable r.v.s Y(n) = I7=I Yila li on a partition Gu, .. ,G 1 n of Q where G 1 i E §' 1 , each i, with Y(n) t Y as in 3.28. Then

.

Conditioning

155

n = P(G)"L_ yp(G! i) = P(G)E(Y(n)) for all G i=l

E

'§,

(10.52)

E(Y(n ) ) -7 E(Y) by the monotone convergence theorem. E(Y(n) I '§) is not a simple function, but E(Y(n) I '§) t E(YI '§) a.s. by 10.15, and

fGE(Y(n) I '§)dP -7 fGE(Yi '§)dP,

(10.53)

by regular monotone convergence. Hence for '§ 1 -measurable Y,

JGE(YI '§)dP = P(G)E(Y) for all G

E

'§.

(10.54)

From the second equality of (10. 1 8) it follows that E(YI G) = E(Y) for all G E '§, which proves the theorem. • 10.23 Corollary If X and Y are independent, then E(YI X) = E(Y). Proof Direct from 10.21 and 10.22, putting '§ = cr(X) and '§ 1 = cr(Y). • 10.24 Theorem Apair ofcr-fields '§ 1 c r:J and '§2 c r:J are independentiffCov(X,Y) = 0 for every pair of integrable r.v.s X and Y such that X is measurable on '§ 1 and Y is measurable on '§2 · Proof By 10.22, independence implies the condition of 10.11 is satisfied for '§ = '§J, proving 'only if . To prove 'if, consider X = l G " G 1 E '§ 1 , and Y = 1 G2 for G2 E '§2 . X is '§1-measurable and Y is '§2-measurable. For this case, (10.55) Cov(X,Y) = 0 for every such pair implies '§I and '§2 are independent by (10.51). • 10.25 Corollary Random variables X and Y are independent iff Cov(<j>(X),'Jf(Y)) = 0 for every pair of integrable Borel functions <1> and 'If. Proof By 10.3(i), <j>(X) is measurable with respect to cr(X) for all <j>, and 'lf(Y) is cr(Y)-measurable for all 'If. If and only if all these pairs are uncorrelated, it follows by 10.24 that cr(X) and cr(Y) are independent subfields. The result then follows by 10.21. • An alternative proof of the necessity part is given in 9.17. The next result generalizes the law of iterated expectations to subfields. We say that cr-fields '§ 1 and '§2 are nested if '§ 1 � '§2· 10.26 Theorem If '§I c '§2 � r:J, then for r:f-measurable Y, (i) E[E(YI '§2) 1 '§J] = E(YI '§t) a.s. (ii) E[E(YI '§t) l '§2] = E(YI '§ t ) a.s.

Probability

1 56 Proof

By definition,

JGE[E(Yi ti2) 1 tiJ]dP JGE(Yi ti2)dP for all G =

But, since G

E

ti 1 implies G

E

E

ti 1 .

(10.56)

ti2 , (10. 1 8) and (10.56) imply that

JGE[E(Yi ti2) 1 tiJ]dP = JG rdP for all G

E

ti 1 ,

(10.57)

so that E[E(YI ti2) 1 tid is a version of E(YI tiJ), proving (i). Part (ii) is by 10.9, since E(YI ti 1 ) is a ti2-measurable r. v. • A simple application of the theorem is to a three-variable distribution. If (X(ro),Y(ro),Z(ro)) is a random point in [R 3 , measurable on '!F, let cr(Z) and cr(Y,Z) be the intimal a-fields on which Z and (Y,Z) respectively are measurable, and cr(Z) � a(Y,Z) � '!F. Unifying notation by writing E(YIZ) = E(YI cr(Z)) and E(YIX,Z) = E(YI cr(X,Z)), 10.26 implies that (10.58) E[E(YI X,Z) I ZJ = E[E(YI Z) IX,Z)] = E(YI Z). Our final results derive from the conditional Jensen inequality. 10.27 Theorem Let Y be a '!F-measurable r.v. and ti 1 � §'2 � '!F. If $(.) is convex, (10.59) E(<J>(E(YI §'2))) ;::: E(<J>(E(YI ti1))). Proof Applying 10.18 to the ti 2-measurable r.v. E(YI ti 2) gives E(<J> (E(YI §'2)) I ti I ) ;::: (E(E(YI ti2) I ti I )) = <J>(E(Y I ti J)) a.s. (10.60) where the a.s. equality is by 10.26(i). The theorem follows on taking uncon ditional expectations and using the LIE. • The application of interest here is the comparison of absolute moments. Since I x!P is convex for p ;::: 1, the absolute moments of E(YI §'2) exceed those of E(YI ti 1 ) when ti 1 c §'2 · In particular, (10.61) E[E(YI §'2)2] ;::: E[E(YI ti1)2]. SinceE(YI ti 1 ) andE(YI ti 2) bothhavemean ofE(Y), (10.6l) implies Var(E(YI §'2)) ;::: Var(E(YI ti J)). Also, E (E(YI §'ii) + E( (Y- E(YI tii))2) = E(Y 2) for i = 1 or 2, (the expected cross-product vanishes by 10.10), so that an equivalent inequality is ( 10.62) The interpretation is simple. ti 1 represents a smaller information set than §'2, and if one predictor is based on more information than another, it exhibits more variation and the prediction error accordingly less variation. The extreme cases are E(YI '!F) = Y and E(YI <J) = E(Y), with variances of V ar(Y) and zero respectively. This generalizes a fundamental inequality, that a variance is non-negative, to the partial information case. While (10.61) generalizes from the square to any convex function, (10.62) does

Conditioning

157

not. However, there is the following norm inequality for prediction errors. 10.28 Theorem If Y is �-measurable and §' 1 � §'2 � �, (10.63) Proof

Let 11 = Y- E(YI §'I). Then by 10.26(ii), 11 - £(11 1 §' 2) = y - E(YI §'!) - E(YI §'2) + E(E(YI §' J) I §'2) = Y- E(YI §'2).

(10.64)

The theorem now follows, since (10.65) by, respectively, the Minkowski and conditional Jensen inequalities, and the LIE. • 1 0. 6 Conditional Distributions

The conditional probability of an event A E � can evidently be defined as P(A I §') = E(lA I §'), where 1A(ro) is the indicator function of A . But is it therefore meaning ful to speak of a conditional distribution on (Q,�), which assigns probabilities P(A I §') to each A E �? There are two ways to approach this question. First, we can observe straightforwardly that conditional probabilities satisfy the axioms of probability except on sets of probability 0 and, in this sense, satisfactorily mimic the properties of true probabilities, just as was found for the expectations. Thus, we have the following. 10.29 Theorem (i) P(A I §') � 0, all A E �. (ii) P(Q I §') = 1 a.s. (iii) For a countable collection of disjoint sets Aj E � ' (10.66) P L) Ajl §' = � P(Ajl §') a.s.

(

)

1

Proof To prove (i), suppose 3

(10. 1 8),

1

G E §' with P(A I §')(ro) < 0 for all ro E G. Then, by

fGnAdP = fGP(A I §')dP < 0,

(10.67)

which is a contradiction, since the left-hand member is a probability. To prove (ii), note that P(.Q I §') is §'-measurable and let G+ E §' denote the set of ro such that P(OI §')(ro) > 1 . Suppose P(if) > 0. Then since c+ r1 Q = c+, P(if)

=

f

dP <

a+

f

P(.Q I §')dP =

a+

f

dP = P(G+),

a+nn

(10.68)

Probability which is a contradiction. Hence, P(G+) 0. Repeating the argument for a set G - on which P(.O I §')(ro) < 1 shows that P(G - ) = 0. For (iii), (10. 1 8) gives, for any G §', fGP(U_;Ai l §')dP = fGn(UjAj)dP fUj(GnAj)dP _Lf j GnA1dP, (10.69) since the sets G Aj are disjoint if this is true of the Ai. By definition there exists a version of P(Ajl §') such that G §', 1 58

=

E

=

=

n

V

E

(10.70) and hence

fcP(U_;Ai l §')dP �fcP(Aj ! §')dP fc (� P(Aj!§'))dP. =

1

=

(10.71 )

J

§'

The left- and right-hand members of (10.71) define the same measure on (see 10.5) and hence a.s. by the Radon-Nikodym theorem. • = But there is also a more exacting criterion which we shouid consider. That is, does there exist, for fixed a p.m. J.lro on which satisfies = I each E � (10.72) for all E where = 1 ? If this condition holds, the fact that conditional expectations and probabilities behave like regular expectations and probabilities requires no separate proof, since the properties hold for J.lro· If a family of p.m.s { J.l00, E satisfying (10.72) does exist, it is said to define a condi tional probability on However, the existence of regular conditioning is guaranteed in every case, and counter-examples have been constructed (see e.g. Doob 1953: 623-4). The problem is this. In (10.66), there is allowed to exist for a given collection dl E � } an exceptional set, say with = 0, on which the equality fails. This in itself does not violate (10.72), but the set is specific to dl, and since there are typically an uncountable number of countable subsets dl c �. we cannot guarantee that = 0, as would be required for J.lro both to be a p.m. and to satisfy (10.72). This is not a particularly serious problem because the existence of the family { J.lro } has not been critical to our development of conditioning theory, but for certain purposes it is useful to know, as the next theorem shows, that p.m.s on the line admit regular conditional distributions. 10.30 Theorem Given a space and a subfield '§' c �. a random variable Y has a regular conditional distribution defined by = E lR, (10.73) for E with = 1, where is a c.d.f. for all E

P(U_;Aj l §') IqP(Aj l §') ro, (.Q, �) J.lco(A) P(A §')(ro), A ro C, P(C) ro .OJ §'.

{Aj

C$1. P(C$1.)

regular

not

=

C$1.

P(U$1-C$1.)

do

(.Q, �,P) Fy(y!§')(ro) P((-oo,y]!§')(ro), y Fy(. l §')(ro) ro C P(C)

ro .Q.

Conditioning

159

Proof Write F�(y) to denote a version of P(( -oo,y] I §')(ffi). Let Mij denote the set of

ffi such that F�(ri) > F�(rj) for ri, r1 E n:l with ri < r1 . Similarly, let Ri denote the set of (I) on which limn�ooF�(ri + l i ) :f:. F�(ri), ri E n:l . And finally, let L denote the set of those ffi for which F�(+oo) :f:. 1 and F�(-oo) :f:. 0. Then = (UiJMijY n (UiRiY n Lc is the set of ffi on which F�(y) is monotone and right-continuous at all rational points of the line, with F�(+oo) = 1 and F�( -oo) = 0. For y E IR let

n

{

F�(y), y

Fy( . I §')( (J))

=

F�(y+ ), y

E n:l E

G(y),

IR - n:l

C

}

' (I)

E

C,

(10.74)

otherwise,

i,j,

where G is an arbitrary c.d.f. In view of 10.29, P(Mij) = 0 for each pair P(Ri) = 0 for each and P(L) = 0. (If need be, work in the completion of the space to define these probabilities.) Since this collection is countable, P(C) = 1 , and in view of 8.4, Fy(. l §')(ffi) is a c.d.f. which satisfies (10.73), as it was required to show. • It is straightforward, at least in principle, to generalize this argument to multivariate distributions. For B E 'B it is possible to write

i

(10.75) and the standard argument by way of simple functions and monotone convergence leads us full circle, to the representation E(YI §')(ffi)

=

s::ydFr(y l §')(ffi), a.s.

(10.76)

If §' = a(X), we have constructions to parallel those of § 1 0. 1 . Since no restric tion had to be placed on the distribution to obtain this result, we have evidently found a way around the difficulties associated with the earlier definitions. However, Fr(. I !:J)(ro) is something of a novelty, a c.d.f. that is a random element from a probability space. Intuitively, we must attempt to understand this as representing the subjective distribution of Y(ro) in the mind of the observer who knows whether or not ffi E G for each G E §'. The particular case Fy (.1 §')( ffi) is the one of interest to the statistical modeller when the outcome ffi is realized. Many random variables may be generated from the elements of (Q,:¥ ,P), not only the outcome itself- in the bivariate case the pair Y(ffi),X(ffi) - but also variables such as E(YIX)(ffi), and the quantiles of Fr(y i X)(ffi). All these have to be thought of as different aspects of the same random experiment. Let X and Y be r.v.s, and §' a subfield with §' c 1-fx = <J(X) and §' � 1-f y = <J(Y). We say that X and Y are §' if Fxr(x,y I §') = Fx(x I §')Fy(y I §') a.s. (10.77)

independent conditional on

Probability

160

This condition implies, for example, that E(XYI §') = E(XI §')E(YI §') a.s. Let Jlw = Jl(.,ro) be the conditional measure such that Jl00({X E (-oo ,.x], Y E (-oo,y] }) = Fxy(x,y l §')(ro). With ro fixed this is a regular p.m. by (the bivariate generalization of) 10.30, and Jl00(A n B) = Jl00(A)Jl00(B) for each A e 1fx and B e Jf y, by 10.21. In this sense, the subfields 1fx and Jf y can be called conditionally independent. 10.31 Theorem. If X and Y are independent conditional on §', then ( 10.78) E(YI 1fx) = E(YI §') a.s. Proof By independence of 1fx and Jf y under Jloo we can write

fAE(YI 1fx)dJlw = fA YdJlw = Jlw(A)f YdJl00, A

e

1fx.

(10.79)

E(1AE(YI 1fx) I §')(ro) = E(lAYI §')( ro) = E(lAE(YI §') I §')(ro) a.s.[P],

(10.80)

This is equivalent to

where the first equality also follows from 10.26(i) and 10.10. Integrating over .Q with respect to P, noting .Q e §', using 4.8(ii) and the LIE, we arrive at

JAE(YI 1fx)dP = JA YdP = JAE(YI §')dP, A e 1fx.

(10.8 1)

This shows E(YI 1fx) is a version of E(YI §'), competing the proof. • Thus, while E(YI 1fx) is in principle 1fx-measurable, it is in fact almost surely [P] equal to a §'-measurable r.v. Needless to say, the whole argument is symmetric in X and Jf y. The idea we are capturing here is that, to an observer who possesses the infor mation in §' (knows whether ro e G for each G e §'), observing X does not yield any additional information that improves his prediction of Y, and vice versa. This need not be true for an observer who does not possess prior information. Equation ( 10.77) shows that the predictors of Y based on the smaller and larger information sets are the same a.s.[P], although this does not imply E(YI 1fx) = E(Y) a.s., so that X and Y are not independent in the ordinary sense.

11 Characteristic Functions

1 1 . 1 The Distribution o f Sums o f Random Variables

Fx(x) and Fy(y). Fx and Fy, the (11.1)

Let a pair of independent r.v.s X and Y have marginal c.d.f.s The c.d.f. of the sum W = X + Y is given by the convolution of function

Fx*Fy(w) f+ooFx(w - y)dFy(y). =

-

=

11.1 Theorem If r.v.s X and Y are independent, then = P(X + Y :::; Proof Let be the indicator function of the set P(X+ Y :::; = E(l w(X,Y)). By independence =

Fx*Fy(w) w) Fr*Fx(w). (11.2) 1 w(x,y) {x,y: x :::; y}, so that w) F(x,y) Fx(x)Fy(y),w - so this is f�2 1 w(x,y)F(x,y) s:: (f:)w(x,y)dFx(x))dFy(y) s:: (f:�YdFx(x))dFy(y) +oo = f Fx(w - y)dFy(y), (11.3) =

=

=

-

=

where the first equality is by Fubini's theorem. This establishes the first equal ity in Reversing the roles of X and Y in establishes the second. • For continuous distributions, the convolution = of p.d.f.s and is

(11.2).

(11.3) f fx*fr fx fy (11.4) f(w) s::fx(w - y)fy(y)dy, such that f-oof(�)d� F(w). 11.2 Example Let X and Y be independent drawings from the uniform distribu tion on [0,1], so that fx(x) = 1ro, l ] (x). Applying (11. 4) gives (11.5) fx+r(w) = J>[w-l ,w]dy. It is easily verified that the graph of this function forms an isosceles triangle with base [0,2] and height 1. =

=

o

Probability

162

This is the most direct result on the distribution of sums, but the formulae generated by applying the rule recursively are not easy to handle, and other approaches are preferred. The (m.g.f.) of X, when it exists, is = E IR, (1 1 .6)

moment generating function Mx (t) = E(e1x) Je1xdF(x), t

e

where denotes the base of natural logarithms. (Integrals are taken over unless otherwise indicated.) If X and Y are independent,

Mx+ r(t)

=

ffet(x+y)dFx(x)dFy(y)

= JetxdFx (x)JetydFy(y) = Mx (t)Mr(t).

(-=,+=)

(1 1 .7)

This suggests a simple approach to analysing the distribution of independent sums. The difficulty is that the method is not universal, since the m.g.f. is not defined for every distribution. Considering the series expansion of all the moments of X must evidently exist. The solution to this problem is to replace the variable by where is the imaginary number, V-T. The (ch.f.) of X is defined as

function

t it,

i x (t) = E(eitX) feitxdF(x). =

etx, characteristic (1 1 .8)

1 1 . 2 Complex Numbers

complex number = a + ib, a i= a b a= b= complex conjugate =a-ib. i -i, i4 a b. modulus = = (a2 + b2) 112.

A is z where and b are real numbers and V-T. and are called the real and imaginary parts of the number, denoted Re(z) and Im(z). The of z is the number z Complex arithmetic is mainly a matter of carrying as an algebraic unknown, and replacing P by -1, P by by 1 , etc., wherever these appear in an expression. One can represent z as a point in the plane with Cartesian coordinates and The or absolute value of z is its Euclidean distance from the origin, (1 1 .9) l z l (z z) 1 12 Polar coordinates can also be used. Let the complex exponential be defined by cos e sin e (1 1 . 10) for real e. All the usual properties of the exponential function, such as multi plication by summing exponents (according to the rules of complex arithmetic) go through under this definition, and 1 (1 1 . 1 1) l ie l = (cos28 + sin 28) 12 1

eie =

e

+i

=

Characteristic Functions

163

for any 8, by a standard trigonometric identity. We may therefore write z = I z I e ie , where Re(z) = I z I cos 8 and Im(z) = I z I sin e. Also note, by (1 1 . 1 1), that (1 1 . 12) 1 ez I = 1 eRe(z)+ilm(z) I = eRe(z) I eilm(z) I = eRe(z). If X and Y are real random variables, Z = X + Y is a complex-valued random variable. Its distribution is defined in the obvious way, in terms of a bivariate c.d.f. , F(x,y). In particular, ( 1 1 . 13) E(Z) = E(X) + iE(Y). Whereas E(Z) is a complex variable, EIZI is of course real, and since I Z I :::; I X I + I Yl by the triangle inequality, integrability of X and Y is sufficient for the integrability of Z. Many of the standard properties of expectations extend to the complex case in a straightforward way. One result needing proof, however, is the generalization of the modulus inequality. 11.3 Theorem If Z is a complex random variable, I E(Z) I :::; E I Z l . Proof Consider a complex-valued simple r.v. n ( 1 1 . 14) Z(n) (aJ + i �J)1Ep J= l where the a1 and �J are real non-negative constants and the E1 E Cff for j = 1, constitute a partition of Q. Write Pi = E(1E) · Then

i

=

L

... ,n

r + (.L �Jpjt = L (aJ + J3])Pj + L L a1ak + J31J3k)P1Pk> j j#k (

1 Eczcn)) l 2 = .L ajPj 1

1

(

whereas (£ 1 Z(n) I )2

( 1 1 . 15)

= {� (aJ + J3]) 112Pj) 2 1

(1 1 . 16)

The modulus inequality holds for Z(n) if 0 :::; (E I Zcn) l )2 - I E(Z(n)) l 2 ( 1 1 . 17) = L L [( aJ + J3j) 1 12(af + �f) 1 12 - (a1ak + J31 �k)] P1Pk. j#k The coefficients of P1Pk in this expression are the differences of pairs of non negative terms, and these differences are non-negative if and only if the differences of the squares are non-negative. But as required,

Probability

164

(a] + J3])(ar + J3h - (apk + J3jJ3k)2 = a]J3r + arJ3] - 2apkJ3jJ3k 2 ( 1 1 . 1 8) = (aj J3k - akJ3j) ::::: 0. This result extends to any complex r.v. having non-negative real and imaginary parts by letting Z(n) = X(n) + iY(n ) t Z = X+ iY, using 3.28, and invoking the monotone convergence theorem. To extend to general integrable r. v.s, split X and Y into positive and negative parts, so that Z = z-, where = + with 0 and y+ ;::::: 0, and = )( + iY", with 0 and Y" :?: 0. Noting that (1 1 . 19) completes the proof. •

x+ ;:::::

z+ x+ iY+

z+ x- :?:

z-

1 1 . 3 The Theory of Characteristic Functions

We are now equipped to study some of the properties of the characteristic function The fact that it is defined for distribution follows from the fact that I I = 1 for all E( I I ) = 1 and E(itx) is finite regardless of the distribution of X. The real and imaginary parts of are respectively E(cos tX) and E(sin tX). 11.4 Theorem If E I X I k < oo, then

<J>x (t). eitx

any

x; eitx dkx (t) df

Proof

x (t+h ) h where, using (1 1 . 10),

eix(t+h) - eitx

=

-

x(t)

eitX) .

. k E((tX)

(1 1 .20)

x(t) f+oo eix(t+h) eitxdF(x), h _ -

cos

_

-

(1 1 .21)

oo

x(t + h) - cos tx + . sin x(t + h) sin tx. -

h h l h The limits of the real and imaginary terms i n this expression as h ----7 0 are respectively sin and cos so the limit if the integrand in ( 1 1 .21) is cos sin = ix(cos + sin = Since I = the integral exists if E I X I < oo This proves (1 1 .20) for the case k = 1 . To complete the proof, the same argument can be applied inductively to the integrands for k = 2,3, . . . • It follows that the integer moments of the distribution can be obtained by repeated differentiation with respect to t.

-x tx i(x tx), x(i tx - tx) tx i tx) (ix)eitx. (ix)eitx l l xl, (ixl-l eitx .

Characteristic Functions 11.5 Corollary If E I X I k

<

oo,

16 5

then

d'<J>x (t)

(1 1 .22)

An alternative way to approach the last result is to construct a series expan sion of the ch.f. with remainder, using Taylor's theorem. This gives rise to a very useful approximation theorem. 11.6 Theorem If E I XI k < oo, then n. 'f'X

(t)

_

[ {

. 2 l tXI k l tXI k+ l � (i f/E( )() < E mm , . L k'. ' (k + 1)'. 1· . 0 )=

}]

.

(1 1 . 2 3 )

function f which is differentiable k times has the expansion k) f" -t (O) 2 + f"'(O) - t3 + ... + f (at) I, f(t) = f(O) + f'(O)t + k! 6 2 i where 0 :::; a :::; 1 . The expansion of f(t) = e tx gives k (itxY. �k ie tx "' (1 1 . 24) L . ,. + k .' Yb _f\ 1 v = J i a tx k where Yk = i sgn (tx/(e 1) and 1, tx � 0 sgn(tx) = . -1, tx < O Applying (1 1 . 10) and (1 1 . 1 1), we can show that I Yk l = (2 - 2cos atx) 1 12 :::; 2. However, by extending the expansion to term k + 1, we also have k c · y· _ I tx l k+ l "' !:!!_ eitx - L (1 1 .2 5 ) . , + -'-:(k,... + ..,.,.-'- 1)-!. Zk, .j=0 1 · where Zk = l+1 sgn(tx) k+l eia'tx for 0 :::; a' :::; 1, and l zk l = 1. Given that both of (1 1 .24) and (1 1 . 25 ) hold, we may conclude that Proof A

_ -

"

{

-

_

(itx-'/ �e itx - L., _f\ · J =v 1 "

< -

{

}

. 2 l tx l k, l tx l k+l . mm -k-., - (k + 1) '.

(1 1 .26)

The theorem now follows on replacing x with the r.v. X in (1 1 .26), taking expect ations and using the modulus inequality:

Probability

166

. ) (i X)j k ( k t)''!,CX' X e -;_ i t $ L (t) E
.

j =v

j=v

·__r,

<

[ {

.

}]

l tXIk+ I - E mm. 2l ktXI! k' (k+ 1)!

.

•

(11 .27)

There is no need for E I XI k+ I to exist for this theorem to hold, and we can think of it as giving the best approximation regardless of whether I tXI is large or small. To interpret the expectation on the right-hand side of (1 1. 2 7) , note that, for any pair of non-negative, measurable functions g and h, E(min {g(X),h(X) } ) = inf E(g(X)1 A + h(X) 1 Ac) ,

(1 1.28) the intimal set being the one containing those points x on which g(x) $ h (x). In particular, for any £ � 0, the set A = { l XI > £ } belongs to the class over which the infimum in (1 1.28) is taken, and we get the further inequality, tXI k, I tXI k+ 1 $ E (2 1 tX! I k l { J X J>e) + E ( I tXI lk+1 1 (J X Js; e} E min 2 1 k! k (k+ 1)! ) (k+ ) ! ) Ae:I:l

[ {

}]

(1 1.29) The second alternative on the right is obtained in view of the fact that E( IXI k+ I 1 ( JXJs;eJ ) = E( IXIkiXI 11 J x Js; eJ) $ EIXI k£. Both of these versions of the bound on the truncation error prove useful subsequently. Two other properties of the characteristic function will be much exploited. First, for a pair of constants a and b, (1 1.30) <PaX+b(t) = E(eit(aX+b)) e1b1
r(t).

(1 1.31)

An interesting case of the last result is Y = -X' where X' is an independent draw ing from the distribution of X. The distribution of X - X' is the same as that of

Characteristic Functions

167

'- X, and hence this r. v. is symmetric about 0. The ch.f. of X- is real, Xsince (1 1 .32) x(t)<)>x(-t) I x (t) 1 2, in view of the fact that x (- t) E(e - itX) x(t). It can be verified from the existing odd-order moments X'

=

=

=

expansion in (1 1 .23) that with a real ch. f., all the must be zero, the trademark of a symmetric distribution. Considering more generally a sum S = where independent collection, recursive application of (1 1 .3 1 ) yields

'L'l=tXi

{Xt,... .Xn} are a totally

n t). (t) = s flX;(

( 1 1 .33)

i=l

To investigate the distribution of S, one need only establish the formulae linking the ch.f.s with the relevant c.d.f.s (or where appropriate p.d.f.s) which are known for the standard sampling distributions. 11.7 Example For the Poisson distribution (8.8),

xCt;"-) E(eitX) e-'A:£x=O �A:eitx e'A.(eit- 1 ). =

=

X.

=

o

( 1 1 .34)

11.8 Example In the standard Gaussian case, (8.10),

it 1 J+-ooooeitz-z2!2dz. (1 1 .35) Completing the square yields itz - z2/2 - (z - it)212 - t212 and hence (1 1 .36) (t) e-?12 -1 J+-ooooe -(z+it)2dz e -?12. z(t)

=

E(e Z)

=

_

..nn

=

th 't'Z

=

=

..nn

(The integral in the middle member has the value V2n for any choice of t, note.) Accordingly, consider = crZ+ J.l, whose p.d.f. is given by (8.24). Using (1 1 .30), we obtain

X

(1 1 .37) Equation (1 1 .22) can be used to verify the moment formulae given in 9.4 and 9.7. With J.l 0 the ch.f. is real, reflecting the symmetry of the distribution. o 11.9 Example The Cauchy distribution (8.1 1) has no integer moments. The ch.f. turns out to be e which is not differentiable at t 0, as (1 1 .22) would lead us to expect. The ch.f. for the Cauchy family (8.15) is o (1 1 .38) v, 8) = =

-I t I

=

x (t;

eitv-o l t l .

The ch.f. is also defined for multivariate distributions. For a random vector X (m x 1 ) the ch.f. is

Probability

168

(1 1 .39) <)>(t) = E(exp{ it'X}), where t is a m-vector of arguments. This case will be especially important, not least because of the ease with which, by the generalization of (1 1 .30), the ch.f. can be derived for an affine transformation of the vector. Let Y = BX +d (k x 1), where B (k x m) and d (k x 1) are constants, and then we have <)> y (t ) = E(exp{ it'Y}) = exp{ it'd}E(exp{ it'BX}) = exp{ it'd}<)>x (B't). (1 1 .40) 11.10 Example Let X (m x 1) be multinormal with p.d.f. as in (8.36). The ch.f. is

<)>x(t; f.,l , l: )

=

=

1

m/2

(21t) 1 1: 1 12

I

f+-oooo f+-oooo

exp{ it'x - 1(x - f.L)'l: - 1 (x - JL ) } dx

...

exp { it'f.L - 1t'l:t } .

(1 1 .41)

The second equality is obtained as before by completing the square: it'x - !Cx - f.L)'� - 1 (x - f.L ) = it'f.L - !t'l:t - 1Cx - ll - iD)'� - 1 (x - f.L - il:t), where it can be shown that the exponential of the last term integrates over IR to (21t )m12 l l: 1 112. D 1 1 .4 The Inversion Theorem

Paired with (1 1 .8) is a unique inverse transformation from <)>(t) to F(x), so that the ch.f. and c.d.f. are fully equivalent representations of the distribution. The chief step in the proof of this proposition is the construction of the inverse transformation, as follows. 11.11 Lemma If <)>(t) is defined by (1 1 .8), then F(b ) - F(a)

=

t T 1 e-i a e -itb - lim <)>(t)dt lt. 21t T I -T _

(1 1 .42)

�00

for any pair a and b of continuity points of F; with a < b. The multivariate generalization of this formula is

( )

1 M(xb ... ,Xk) = 21t

m

. hm

T�00

f

T

-T

...

I

T

-T

[n

]

k e-itrJ;X· - e - itJi(x:+fu:') J . . zt1 . ;= 1 1

(1 1 .43)

where M(x1 , ... ,xk) is defined in (8.3 1) and the vertices of the rectangle based at the point x1 , ... ,xk , with sides &j > 0, are all continuity points of F. o

Characteristic Functions e

-ita - e - itb

lt.

�

169

b - a as t � 0.

The integrals in ( 1 1 .42) and (1 1 .43) are therefore well defined in spite of including the point t = 0. Despite this, it is necessary to avoid writing (1 1 .42) as

F(b) - F(a)

I+oo e-it;- _e-itb<j>(t)dt, -oo 1tlf

=

(1 1 .44)

because the Lebesgue integral on the right may not exist. For example, suppose the random variable is degenerate at the point 0; this means that <j>(t) = e it.O = 1 , and

IT e-ita . -itb dt � 1 IT2dt 21t -T I lt I 1t tt

1 -

-

e

-

1

-

log T,

(1 1 .45)

so that the criterion for Lebesgue integrability over ( -oo,+oo) fails. However, the limits in (1 1 .42) and (1 1 .43) do exist, as the proof reveals. Proof of 11.11 Only the univariate case will be proved, the multivariate extension being identical in principle. After substituting for in (1 1 .43) we can inter change the order of integration by 9.32, whose continuity and a.s. boundedness requirements are certainly satisfied here:

IT e-ita .e-itb<j>(t)dt = IT e-ita .e-itb (I oo e1.txdF(x) dt -T 21tlt -T 21tzt -oo ) +oo (IT eit(x-a) .eit(x-b)dt dF(x). =I -T 21tzt ) _

_

_

(1 1 .46)

-oo

Using (1 1 . 10),

it(x-b) IT sin t(x - a)dt - IT sin t(x - b)dt e it(x-a) - eIT-T---=--dt t t 2 it

n

=

0

n ln/2,

o

n

'

(1 1 .47)

noting that the cosine is an even function, so that the terms containing cosines (which are also the imaginary terms) vanish in the integral. The limit as T � oo of this expression is obtained from the standard formula

I� si�atdt

=

0,

a>0 a=0

-n/2, a < 0.

(1 1 .48)

Probability

1 70

Substituting into

( 1 1 .46)

f+oo -oo

yields the result

e it(x-a) - e it(x-b) dt

21tlt

. -

=

10,

x

!,

x

=a

1,

a <

x

or x > or x =

b b

( 1 1 . 49 )

< b.

Letting T� = in ( 1 1 .46) and applying the bounded convergence theorem now gives

= !(F(b) + F(b-) - F(a) - F(a-) ) ,

(1 1 .50)

which reduces to F(b) - F(a) when a and b are continuity points of F. • Lemma 11.11 is the basic ingredient of the following key result, the one that primarily justifies our interest in characteristic functions. 11.12 Inversion theorem Distributions having the same ch.f. are the same. Proof We give the proof for the unvariate case only. By (1 1.42), the c.d.f.s of the two distributions are the same at every point which is a continuity point of both c.d.f.s. Since the set of jump points of each c.d.f. is countable by 8.3 their union is countable, and it follows by 2.10 that the set of continuity points is dense in IR . It then follows by 8.4 that the c.d.f.s are the same everywhere. • A simple application of the inversion theorem is to provide a proof of a well known result, that affine functions of Gaussian vectors are also Gaussian. 1 1.13 Example Let X - N(JL,�) (m x 1) and Y = BX + d (n x 1) where B (n x m) and d (n x 1) are constants. Then by ( 1 1 .42), <)>y (t) = exp { it'd }E(exp { it'BX } ) = exp { it'(Bil + d) - !t'B'rnt} .

If rank(B'rn)

= n,

(implying

n

�

m),

(1 1 .5 1 )

11.12 implies that Y has p.d.f.

exp { -l(y - BJL - d)'(B�')- 1 (y - BJL - d) } f(y) = ----2 (2nt12 I B�' I 112

__ __ __ __ __ __ __ __ __ __

( 1 1 .52)

If rank(Brn') < n, (1 1 .5 1 ) remains valid although ( 1 1 .52) is not. But by the same arguments, every linear combination c'Y, where c is p x 1 , is either scalar Gaussian with variance c'B�'c, or identically zero, corresponding to the cases B'c ::1= and B'c = 0 respectively. In this case Y is said to have a singular Gaussian distribution. o

0

Characteristic Functions 1 1 .5

17 1

The Conditional Characteristic Function

Let Y be a ?!-measurable r.v., and let §' c ?f. The conditional ch.f. of Yl §', <\> \ ( t) , is for each t a random variable having the property

Y §'

JGY\§'(t)dP = JGeitYdP, all G E §'.

(1 1 .53)

The conditional ch.f. shares the properties of the regular ch.f. whenever the theory of conditional expectations parallels that of ordinary expectations accord ing to the results of Chapter 10. Its real and imaginary parts are, respectively, the §'-measurable random variables E(cos tYI §') and E(sin tYI §'). It can be expanded as in 11.6, in terms of the existing conditional moments. If X is §') by = §'-measurable, the conditional ch.f. of X+ Y is 10.10. And if Y is §' 1 -measurable and §' and §' 1 are independent subfields, then <\> i (t) = <)>y(t) a.s. The conditional ch.f. is used to prove a useful inequality due to von Bahr and Esseen (1965). We start with a technical lemma which appears obscure at first sight, but turns out to have useful applications. 11.14 Lemma Suppose EI ZI ' < oo, 0 < r < 2. Then

x+Y\ §' (t) eitXE(eitY I

Y §'

f+oo 1 - Re(<)>z(t)) t

( 1 1 .54) l t l 1 +r where K(r) = ( t:(l - cos u )/ l u l 1 + ' u t 1 = 1t - 1 r(r + 1) sin rn/2. o The last equality, with r(.) denoting the gamma function, is a standard integral formula for 0 < r < 2. Proof The identity for real z, +oo 1 - cos zt (1 1 .55) t, I z I r = K(r) -oo I tl l +r is easily obtained by a change of variable in the integral on the right. The lemma follows on applying 9.32 and noting that Re(<\>z(t)) = E(cos tZ). • This equality also holds, for w E C with P( C) = 1 , if E( I Zl ' I §')( w) and z1 w) are substituted for E I Zl ' and <\>z(t). In other words, the conditional rth moment and conditional ch.f are almost surely related by the same formula. So consider L,-bounded r.v.s Z and X, where Z is ?!-measurable, and X is §'-measurable for §' c ?f. Suppose that is a real r.v. almost surely. Then for each co E Q, EI ZI ' = K(r)

-oo

d

f

d

§'(t)(

z\§'(t)

1-

Re(<\>x+Z\ §'(t))(ro) = 1 - Re(eitXZ\ §'(t))(ro) = 1 - (cos tX(ro))<\>z 1 §'(t)(ro) s (1 cos tX(ro)) + (1 - z \ §' (t)(ro)) -

,

(1 1 .56)

Probability

172

tX(ro))(l -z1 �(t)(ro)) r .ro E( I X + ZI r l §') K(r)f+oo-oo 1 - Re(<J>I tlx++ZIr �(t)) t +oo 1 -costXdt + K(r)f+oo 1 -<J>zi�(t)dt � K(r)f -oo I t I +r I t I +r

the difference between the last two members being (1 - cos which is non-negative for all Hence, for 0 < < 2, 1

=

1

-oo

1

( 1 1 .57) and taking expectations through yields (1 1 .58) + I Zi . E I X + Zi r � For the case 0 < � this inequality holds by the inequality for general Z and X, so it is the case < < 2 that is of special interest here. Generalizing from the remarks following the condition that be real a.s. can be fulfilled by letting Z = where and are identically distributed and independent, conditional on §'. Note that if Jf = a(Y), then (1 1 .59) §'), a.s., I Jf) = by 10.31. Identical conditional distributions means simply that F §') = = Fy(. l §') a.s., and equivalently that a.s. Hence

r 1

E I X Ir E r

1 r

(11.3 1), Y- Y', Y

E(Y'

<J>Y-YI �(t)

Cr

Y'

E(Y'I YI�(t) n�(t)

<j>21�(t) y(. l

E( eit Ye -itY' I §') = E(e itY I §')E( e - uY I §') =

=

I YI �(t) 1 2, a.s. ,

where the right-hand side is a real r.v. Now, for each identity can be verified:

(1 1 .60) ro E

0., the following

( 1 1 .6 1)

Applying (1 1.60) and 11.14, and taking expectations, this yields the inequality (1 1 .62) noting that the difference between the two sides here is the non-negative function of These arguments lead us to the following conclusion. 1 1.15 Theorem Suppose §') = 0 a.s. and is §'-measurable where §' c Jf = a(Y), and both variables are Lr-bounded. Then 1 .6 Proof Let be independent and identical with conditional on §'. Applying Jf) = = §') = 0. Noting ( 1 1 .59), these conditions jointly imply

r, K(r)J�:EI 1 - <J>yi�(t)l2/ltl1+rdt. E(YI Y'

X

E(Y'I

Y, E(Y' I §') E(YI

(1 3)

Characteristic Functions

173

that X+ Y is 1-f-measurable, it follows by 10.19 (in applying this result be care ful to note that 1f plays the role of the subfield here) that E I X + YI ' � E I X+ (Y- Y') I '. ( 1 1 .64) The conclusion for 1 < r < 2 now follows on applying (1 1 .58) for the case Z = Y- Y', and then (1 1 .62). The inequality holds for 0 < r � 1 by the c, inequality, and for r = 2 from elementary considerations since E(YX) = 0. In these latter cases the factor 2 in (1 1 .63) can be omitted. • This result can be iterated, given a sequence of r.v.s measurable on an increasing sequence of a-fields. An easy application is to independent r.v.s X1 , ... ,Xn, for which the condition E(X11 a(XI> ... .X1- 1 )) = 0 certainly holds for t = 2, . . . ,n. Letting Sn = I7=IXt, a(Sn) = a(XJ, ... ,Xn) and 11.15 yields

� ... n � 2L EIX1 i ', t=!

o

� r � 2.

(1 1 .65)

If the series on the majorant side converges, this inequality remains valid as n � oo It may be contrasted for tightness with the c, inequality for general X1, (9.62). In this case, 2 must be replaced by n'- 1 for 1 < r � 2, which is of no use for large n. .

III THEORY OF STOCHASTIC PROCES SES

12 Stochastic Processes

1 2 . 1 B asic Ideas and Terminology

Let (Q,<Jf,P) be a probability space, let T be any set, and let IR lf be the product space generated by taking a copy of IR for each element of T. Then, a stochastic process is a measurable mapping x: Q f--7 IR lf , where x(co) {X,;(co), 't E T } . (12. 1) T is called the index set, and the r.v. X't(co) is called a coordinate of the process. A stochastic process can also be characterized as a mapping from Q x T to IR . However, the significant feature of the definition given is the requirement of joint measurability of the coordinates. Something more is implied than having X't(co) a measurable r. v. for each 't. Here, T is an arbitrary set which in principle need not even be ordered, although linear ordering characterizes the important cases. A familiar example is T { l , ... ,k} , where x is a random k-vector. Another important case of T is an interval of IR, such that x( co) is a function of a real variable and IR lf the space of random functions. And when T is a countable subset of IR, x {X't(co), 't E T } defines a stochastic sequence. Thus, a stochastic sequence is a stochastic process whose index set is countable and linearly ordered. When the X't represent random observations equally spaced in time, no relevant information is lost by assigning a linear ordering through IN or 71., indicated by the notations { X't(co)} I and { X't(co) } :00. The definition does not rule out T containing information about distances between the sequence coordinates, as when the observations are irreg ularly spaced in time with 't a real number representing elapsed time from a chosen origin, but cases of this kind will not be considered explicitly. Familiarly, a time series is a time-ordered sequence of observations of (say) economic variables, although the term may extend to unobserved or hypothetical variables, such as the errors in a regression model. Time-series coordinates are labelled t. If a sample is defined as a time series of finite length n (or more generally, a collection of such series for different variables) it is convenient to assume that samples are embedded in infinite sequences of 'potential' observa tions. Various mathematical functions of sample observations, statistics or estimators, will also be well known to the reader, characteristically involving a summation of terms over the coordinates. The sample moments of a time series, regression coefficients, log-likelihood functions and their derivatives, are standard examples. By letting n take the values 1 ,2,3, ... , these functions of n observations generate what we may call derived sequences. The notion of a sequence =

=

=

1 n:s

Theory of Stochastic Processes

in this case comes from the idea of analysing samples of progressively increasing size. The mathematical theory often does not distinguish between the types of sequence under consideration, and some of our definitions and results apply generally, but a clue to the usual application will be given by the choice of index symbol, t or n as the case may be. A leading case which does not fall under the definition of a sequence is where "U' is partially ordered. When there are two dimensions to the observations, as in a panel data set having both a time dimension and a dimension over agents, x may be called a random field. Such cases afe not treated explicitly here, although in many applications one dimension is regarded as fixed and the sequence notion is adequate for asymptotic analysis. However, cases where "U' is either the product set Z: x !N , or a subset thereof, are often met below in a different context. A triangular stochastic array is a doubly-indexed collection of random variables, X1 1 X2 1 X31 X12 X22 X32 l kt

( 12.2)

compactly written as { {Xnm } !�d ;:=I > where { kn } ;:= l is some increasing integer sequence. Array notation is called for when the points of a sample are subjected to scale transformations or the like, depending on the complete sample. A standard example is { {Xntl�=d;:=t . where Xnr = X fs and Sn = L� t Var(X ) , or some similar function of the sample moments from 1 to n. r

m

=

1

1 2 .2 Convergence of Stochastic Sequences

Consider the functional expression { Xn(co) } 1 for a random sequence on the space (Q.,'!J,P). When evaluated at a point co E Q this denotes a realization of the sequence, the actual collection of real numbers generated when the outcome co is drawn. It is natural to consider in the spirit of ordinary analysis whether this sequence converges to a limit, say X(co). If this is the case for every co E 0., we would say that Xn -------7 X surely (or elementwise) where, if Xn is an '!J/:8-measurable r. v. for each n, then so is X, by 3.26. But, except by direct construction, it is usually difficult to establish in terms of a given collection of distributional properties that a stochastic sequence converges surely to a limit. A much more useful notion (because more easily shown) is almost sure convergence. Let C � Q be the set of outcomes such

Stochastic Processes

179

that, for every ro E C, Xn(ro) ----7 X(ro) as n ----7 oo If P( C) = 1 , the sequence is said to converge almost surely, or equivalently, with probability one. The notations Xn � X, or Xn ----7 X a.s., and a.s.lim Xn = X are all used to denote almost sure convergence. A similar concept, of convergence almost everywhere (a.e.), was invoked in connection with the properties of integrals in §4.2. For many purposes, almost sure convergence can be thought of as yielding the same implications as sure convergence in probabilistic arguments. However, attaching probabilities to the convergent set is not the only way in which stochastic convergence can be understood. Associated with any stochastic sequence are various non-stochastic sequences of variables and functions describing aspects of its behaviour, moments being the obvious case. Convergence of the stochastic sequence may be defined in terms of the ordinary convergence of an associated sequence. If the sequence {E(Xn - X) 2 } i converges to zero, there is a clearly a sense in which Xn ----7 X; this is called convergence in mean square. Or suppose that for any £ > 0, the probabilities of the events { ro: I Xn(ro) - X(ro) I < £} E '!f form a real sequence converging to 1 . This is another distinct convergence concept, so-called convergence in probability. In neither case is there any obvious way to attach a probability to the convergent set; this can even be zero! These issues are studied in Part IV. Another convergence concept relates to the sequence of marginal p.m.s of the coordinates, { Jln }i, or equivalently the marginal c.d.f.s, {Fn }i. Here we can consider conditions for convergence of the real sequences { Jln(A) } i for various sets A E <.B, or alternatively, of {Fn(x) } i for various x E IR. In the latter case, uniform or pointwise convergence on IR is a possibility, but these are relatively strong notions. It is sufficient for a theory of the limiting distribution if convergence is confined just to the continuity points of the limiting function F, or equivalently (as we shall show in Chapter 22) of Jln(A), to sets A having Jl(DA) = 0. This condition is referred to as the weak convergence of the distributions, and forms the subject of Part V. .

1 2 . 3 The Probability Model

Some very important ideas are implicit in the notion of a stochastic sequence. Given the equipotency of IN and 7l., it will suffice to consider the random element { XtCro) } i, ro E Q, mapping from a point of .Q to a point in infinite-dimensional Euclidean space, IR 00• From a probabilistic point of view, the entire infinite sequence corresponds to a single outcome ro of the underlying abstract probability space. In principle, a sampling exercise in this framework is the random drawing of a point in IR 00, called a realization or sample path of the random sequence; we may actually observe only a finite segment of this sequence, but the key idea is that a random experiment consists of drawing a complete realization. Repeated sampling means observing the same finite segment (relative to the origin of the index set) of different realizations, not different segments of the same realization. The reason for this characterization of the random experiment will become clear

1 80

Theory of Stochastic Processes

in the sequel; for the moment we will just concentrate on placing this slightly outlandish notion of an infinite-dimensioned random element into perspective. To show that there is no difficulty in establishing a correspondence between a probability space of a familiar type and a random sequence, we discuss a simple example in some detail. 12.1 Example Consider a repeated game of coin tossing, generating a random sequence of heads and tails; if the game continues for ever, it will generate a sequence of infinite length. Let 1 represent a head and 0 a tail, and we have a random sequence of 1s and Os. Such a sequence corresponds to the binary (base 2) representation of a real number; according to equation (1. 15) there is a one-to one correspondence between infinite sequences of coin tosses and points on the unit interval. On this basis, the fundamental space (.Q,:¥) for the coin tossing experiment can be chosen as ([0, 1 ),13[0, 1) ). The form of P can be deduced in an elementary way from the stipulation that P(heads) = P(tails) = 0.5 (i.e. the coin is fair) and successive tosses are independent. For example, the events { tails on first toss } and { heads on first toss} are the images of the sets [0,0.5) and [0.5, 1 ) respectively, whose measures must accordingly be 0.5 each. More generally, the probability that the first n tosses in a sequence yields a given configuration of heads and tails out of the 2n possible ones is equal in every case to 1/2 n , so that each sequence is (in an appropriate limiting sense) 'equally likely ' . The corresponding sets in [0, 1) of the binary expansions with the identical pattern of Os and Is in the first n positions occupy intervals all of width precisely 112n in the unit interval. The conclusion is that the probability measure of any interval is equal to its width. This is nothing but Lebesgue measure on the half-open interval [0, 1 ). o This example can be elaborated from binary sequences to sequences of real vari ables without too much difficulty. There is an intimate connection between infinite random sequences and continuous probability distributions on the line, and understanding one class of problem is frequently an aid to understanding the other. The question often posed about the probability of some sequence predicted in advance being realized, say an infinite run of heads or a perpetual alternation of heads and tails, is precisely answered. In either the decimal or binary expan sions, all the numbers whose digit sequences either terminate or, beyond some point, are found to cycle perpetually through a finite sequence belong to the set of rational numbers. Since the rationals have Lebesgue measure zero in the space of the reals, we have a proof that the probability of any such sequence occurring is zero. Another well-known conundrum concerns the troupe of monkeys equipped with typewriters who, it is claimed, will eventually type out the complete works of Shakespeare. We can show that this event will occur with probability 1 . For the sake of argument, assume that a single monkey types into a word processor, and his ASCII-encoded output takes the form of a string of bits (binary digits). Suppose Shakespeare' s encoded complete works occupy k bits, equivalent to k/5 characters allowing for a 32-character keyboard (upper-case only, but including some

Stochastic Processes

181

punctuation marks). This string is one of the 2k possible strings of k bits. Assuming that each such string is equally likely to arise in k/5 random key presses, the probability that the monkey will type Shakespeare without an error from scratch is exactly Tk. However, the probability that the second string of 2k bits it produces is the right one, given that the first one is wrong, is (1 T k)2 -k when the strings are independent. In general, the probability that the monkey will type Shakespeare correctly on the (m + 1)th independent attempt, given that the first m attempts were failures, is (1 - Tk)m2 -k. All these events are disjoint, and summing their probabilities over all m � 0 yields P(monkey types Shakespeare eventually) = 1 . In the meantime, of course, the industrious primate has produced much of the rest of world literature, not to mention a good many telephone books. It is also advisable to estimate the length of time we are likely to wait for the desired text to appear, which requires a further calculation. The average waiting time, expressed in units of the time taken to type k bits, is T�;;;= 1 m(l - T k)m = 2k - 1 . If we scale down our ambitions and decide to be content with just 'TO BE OR NOT TO BE' (5 x 18 = 90 bits), and the monkey takes 1 minute over each attempt, we shall wait on average 2.3 x 102 1 years. So the Complete Works don't really bear thinking about. What we have shown is that almost every infinite string of bits contains every finite string somewhere in its length; but also, that the mathematical concept of 'almost surely' has no difficulty in coinciding with an everyday notion indistinguishable from 'never' . The example is frivolous, but it is useful to be reminded occasionally that limit theory deals in large numbers. A sense of pers pective is always desirable in evaluating the claims of the theory. -

The first technical challenge we face in the theory of stochastic processes is to handle distributions on IR 00• To construct the Borel field <.Boo of events on IR 00, we implicitly endow IR oo with the Tychonoff, or product, topology. It is not essential to have absorbed the theory of §6.5 to make sense of the discussion that follows, but it may help to glance at Example 6.18 to see what this assumption implies. Given a process x = {Xr}) we shall write nk(x) = (X1 , ... ,Xk): IR"" 1--7 IR k (12.3) for each k E IN , to denote the k-dimensional coordinate projection. Let tg denote the collection of finite-dimensional cylinder sets of IR "" , the sets C = {x E IR "" : nk(x) E E, E E <.Bk, k E IN } (12.4) In other words, elements of tg have the form (12.5) C = n"k\E) for some E E <.Bk, and some finite k. Although we may wish to consider arbitrary finite dimensional cylinders, there is no loss of generality in considering the projections onto just the first k coordinates. Any finite dimensional cylinder can

1 82

Theory of Stochastic Processes

embedded in a cylinder of the form 1t:k 1 (E), E E 13k, where k is just the largest of the restricted coordinates. The distinguishing feature of an element of t;' is that at most a finite number of its coordinates are restricted. 12.2 Theorem t;' is a field. oo Proof First, the complement in !R of a set C defined by (12.4) is Cc = {x E !R oo : (1tk(x) E Ec, E E 13k } 1t"k 1 (E c), (12.6) which is another element of t;', i.e. cc E t;', Second, consider the union of sets C k = 1t "k 1 (E) E t;' and C' = 1t "k\E') E t;', for E, E' E 13 . C u C' is given by ( 12.4) with E replaced by E u E', and hence C u C' E t;'. Third, if E E 13k and E' E 13m for k m > k, then E x !R m - E 13m , and so the argument of the second case applies. • be

=

X

Fig. 12. 1 It is not easy to imagine sets in arbitrary numbers of dimensions, but good visual intuition is provided by thinking about one-dimensional cylinders in !R 3 . Letting (x,y,z) denote the coordinate directions, the one-dimensional cylinder generated by an interval of the x axis is a region of 3-space bounded by two infinite planes at right angles to the x axis (see Fig. 12. 1 for a cut-away representation). A union of x-cylinders is another x-cylinder, a collection of parallel 'walls' . But the union and intersection of an x-cylinder with a y-cylinder are two-dimensional cylinder sets, a 'cross' and a 'column' respec tively (see Fig. 12.2). These examples show that the collection of cylinder sets in IR k for fixed k is not a field; the intersection of three mutually orthogonal 'walls' in !R 3 is a bounded 'cube', not a cylinder set. The set of finite-dimensional cylinders is not closed under the operations of union and complementation (and hence intersection) except in an infinite-dimensional space. This fact is critical in considering cr(t;'), the class obtained by adding the countable unions to t;', By the last mentioned property of unions, cr(t;') includes sets of the form (12.4) with k tending to infinity. Thus, we have the following theorem.

Stochastic Processes

1 83

12.3 Theorem cr(rg) = 'Boo, the Borel field of sets in !R oo with the Tychonoff topol ogy. o

Fig. 12.2 The condition of this result is something we can take for granted in the usual applications. Recalling that the Borel field of a space is the smallest a-field containing the open sets, 12.3 is true by definition, since rg is a sub-base for the product topology (see §6.5) and all the open sets of !Roo are generated by unions and finite intersections of rg-sets. To avoid explicit topological consider ations, the reader may like to think of 12.3 as providing the definition of 1300• One straightforward implication, since the coordinate projections are continuous mappings and hence measurable, is that, given a distribution on (IR 00 ,'B00 ), finite collections of sequence coordinates can always be treated as random vectors. But, while this is obviously a condition that will need to be satisfied, the real problem runs the other way. The only practical method we have of defining distrib utions for infinite sequences is to assign probabilities to finite collections of coordinates, after the manner of §8.4. The serious question is whether this can be done in a consistent manner, so that in particular, there is one and only one p.m. on (IR00,1r) that corresponds to a given set of the finite-dimensional distribu tions. The affirmative answer to this question is the famous Kolmogorov consis tency theorem. 1 2.4 The Consistency Theorem

The goal is to construct a p.m. on (IR00,'B00), and, following the approach of §3.2, the plausible first step in this direction is to assign probabilities to elements of rg_ Let Ilk denote a p.m. on the space (!R k,'Bk), for k = 1 ,2,3, .... We will say that this family of measures satisfies the consistency property if (12.7)

1 84

Theory of Stochastic Processes

for E E 'Bk and all m > k > 0. In other words, any k-dimensional distribution can be obtained from an m-dimensional distribution with m > k, by the usual operation of marginalization. The consistency theorem actually generalizes to stochastic processes with uncountable index sets 1f (see 27.1) but it is sufficient for present purposes to consider the countable case. 12.4 Kolmogorov's consistency theorem Suppose there exists a family of p.m.s {J..Ld which satisfy consistency condition (12.7). Then there exists a stochastic sequence x = {Xc, t E !N } on a probability space (1R 00 ,'B00 ,)..l) such that J..lk is the p.m. of the finite vector of coordinate functions (X1 , ... ,Xk)'. o The candidate measure for x is defined for sets in t;' by (12.8) where C and E are related by (12.4). The problem is to show that ).1 is a p.m. on b. If this is the case, then, since b is a field and 'Boo = a(t;'), we may appeal to the extension theorem (3.8+3.13) to establish the existence of a unique measure on (IR oo ,'Boo) which agrees with ).1 for all C E b. The theorem has a simple but important corollary. 12.5 Corollary t;' is a determining class for (IR 00,'B 00). o In other words, if ).1 and v are two measures on (IR 00 ,'B00) and J..lk = vk for every finite k, then �t = v . To prove the consistency theorem we require a technical lemma whose proof is beyond us at this stage. It is quite intuitive, however, and will be proved in a more general context as 26.23. 12.6 Lemma For every E E 'Bk and 8 > 0 there exists K, a compact subset of E, such that J..lk(E - K) < 8. o In other words, a p.m. on the space (1R k,'Bk) has nearly all of its mass confined to a compact set; this implies in particular the proposition asserted in §8. 1 , that random variables are finite almost surely. Proof of 12.4 We will verify that ).1 of (12.8) satisfies the probability axioms with respect to elements of t;'. When E = IR k, C = !R oo so that the first two probabil ity axioms, 7.1(a) and (b), are certainly satisfied. To establish finite additiv ity, suppose we have t;'-sets C = nk: \E), and C' = n; \E') for E E 'Bk, E' E 'Bm and m � k. If C and C' are disjoint, k ).1( C) + J..L ( C') = J..Lk(E) + J..lm (E') = J..lm (E x IR m - ) + J..lm(E') (12.9) = J..Lm(E x iR m -k u E') = J..L( C u C'), where the second equality applies the consistency condition (12.7), and the third one uses the fact that E x iR m-k and E' are disjoint if C and C' are. The remaining, relatively tricky, step is to extend finite additivity to count-

Stochastic Processes

185

able additivity. This is done by proving continuity, which is an equivalent property according to 3.5. If and only if the measure is continuous, a monotone sequence { cj E rs' } such that cj J, c or q t c has the property, J..L( Cj) � ).!(C). Since Cj t C implies Cj J, cc where J..L( C c) = 1 - J..L( C), it is sufficient to consider the decreasing case. And by considering the sequence q - C there also is no loss of generality in setting C = 0, so that continuity implies J..L( Cj) � 0. To prove continuity, it is sufficient to show that if J..L( Cj) � for some > 0, for every j, then C is nonempty. If Cj E rs' for some j � 1, then J..L( Cj) = J.!k(j) (Ej) for some set Ej E 'Bk(j) where kU) is the dimension of the cylinder Cj. By consistency, J.!k(j) (Ej) = J..L (Ej x !R m-k(j)) for any m > kU), so there is no loss of generality in assuming that k(1 ) < k(2) < ... < kU) < ... . We may therefore define sets E7 E 'Bk(j) , i = 1 , . . . ,j, by setting Ej = Ej and . i . (12.10) r::,r.*i - Ei x 1R k(j) -k( ), l - 1 , ... , ; - 1 . Since { Cd{= 1 is a decreasing sequence, so is the sequence of 'Bk(j)_sets {E7 }{= 1 , for each j � 1 . Consider any fixed j. There exists, by 12.6, a compact set Kj � Ej such that (12. 1 1) J..lk(j)(Ej - Kj) < Define the sets Ki = Ki x !R k(j) -k(i) E 'Bk(j) by analogy with the £7, and so define j (12. 12) Fj = n Kj E 'Bk(j) .

£

£

m

£12j+l .

i= l

Ej, and hence Dj c Cj where Dj = rck ti)(Fj) E rs'. Applying l.l(iii) and then l.l(i), observe that Fj

�

Ei - Fi = Ei

n

(� Kj') � (Er Kf) =

(12. 13)

,:;;

�(£/-

Kf),

(12. 14)

where the inclusion is because the sequence {Ej }{= 1 is decreasing. Hence j J..lk(j)(Ej - Fj) ::; L J.!k(i)(Ei - Ki ) i= l j = L J..lk(i) (Ei - KD < (12. 15) =l i

£12.

The first inequality here is from (12. 14) by finite subadditivity, which follows from finite additivity as a case of 3.3(iii). The equality applies consistency, and the second inequality applies the summation of 2 - i- l . Since Ej - Fj and Fj are disjoint and J..lk(j) (Ej) = )..t( Cj) > £ by assumption, it follows from (12. 1 5) that and accordingly that Dj is nonempty. )..t(Dj) = J..lk(j) (Fj) > Now. we construct a noint of C. Let f x( i). i E IN 1 denote a seauence of noints

£12,

Theory of Stochastic Processes

186 of IFf'" with xU)

Dj for each }, so that

E

(XI U), . . . ,Xkv)U))

Note that for m

=

=

nkvl(xU)) E

Fj.

(12. 16)

1, ... ,},

= 1tk(mJ(xU)) E Km , (12. 1 7) by (12. 12), where Km is compact. Now letmm be fixed. Our reasoning ensures that (12. 17) holds for each } � m. A set in IR k( ) is compact if and only if each of the coordinate sets in IR is compact, so consider the bounded scalar sequences, { XiU) , j � m } for i = 1, ... ,k(m). Each of these has a cluster point �, and we can use the diagonal method (2.36) to construct a single subsequence Un } with the property that XiUn ) --7 � for each i = 1, ... ,k(m). By the compactness, (Xi, . . . ,XkCm)) E Km c Em . This is true for every m E fN . Consider the point x* E IR"" defined by 1tk(m)(x* ) = (Xj, . . ,XkCm)) for each m E IJI.J . Since x* E Cm for each m , we have x* E n;;;= l Cm = C, as required • This theorem shows that, if a p.m. satisfying (12.7) can be assigned to the finite-dimensional distributions of a sequence x, x is a random element of a probability space (IR "",�"",!-l). We shall often wish to think of (IR""$"",1-l) as derived from an abstract probability space (Q,�,P), and then we shall say that x is g;/�00-measurable if x-I (E) E � for each event E E �"". This statement implies the coordinates X1 are g;;�-measurable r.v.s for each t, but it also implies a great deal more than this, since it is possible to assign measures to events involving countably many sequence coordinates. (XI U), . . . ,Xk(m)U))

.

.

1 2. 5 Uniform and Limiting Properties

Much the largest part of stochastic process theory has to do with the joint distributions of sets of coordinates, under the general heading of dependence. Before getting into these topics, we shall deal in the rest of this chapter with the various issues relating exclusively to the marginal distributions of the coordinates. Of special interest are conditions that limit the random behaviour of a sequence as the index tends to infinity. The concept of a uniform condition on the marginal distributions often plays a key role. Thus, a collection of r. v .s {X1:, 't E lf } is said to be uniformly bounded in probability if, for any E > 0, there exists Be < such that oo

sup P( I X-r l � 1:

Be)

( 1 2. 18)

< E.

It is also said to be uniformly Lp -bounded for p > 0, if sup IIX1: IIp s B < oo (12. 1 9) 1: For the case p = (12.19) reduces to the condition, sup1:I X1: I < a.s. In this case we just say that the sequence is uniformly bounded a.s. For the case p = 1 , we have sup1EI Xt l < and one might think it correct to refer to this property as 'uniform integrability' , Unfortunately, this term is already in use for a .

oo ,

oo

oo ,

Stochastic Processes

1 87

different concept (see the next section) and so must be avoided here. Speak of 'uniform L 1 -boundedness ' in this context. To interpret these conditions, recall that in mathematics a property is said to hold uniformly if holds for all members of a class of objects, including the limits of any convergent sequences in the class. Consider the case where the collection in question is itself a sequence, with lf = IN and 't = t. Random variables are finite with probability 1, and for each finite t E IN , P( I Xr I ;::: BEt) < E always holds for some BEt < oo, for any E > 0. The point of a uniform bound is to ensure that the constants BEt are not having to get larger as t increases. 'Bounded uniformly in t' is a different and stronger notion than 'bounded for all t E IN ' , because, for example, the supremum of the set { IIXr li P } i may lie outside the set. If IIXrllp � Br < oo for every t, we would say that the sequence was Lp-bounded, but not uniformly Lp-bounded unless we also ruled out the possibility that Br � oo as t � oo (or -oo ). Note that the statement ' IIXr li P � B, t E IN ' , where B is the same finite constant for all t, is equivalent to 'supriiXrllp � B ' , because the former condition must extend to any limit of { IIXrllp } . But the 'sup' notation is less ambiguous, and a good habit to adopt. The relationships between integrability conditions studied in §9.3 and §9.5 can be used here to establish a hierarchy of boundedness conditions. Uniform Lr-boundedness implies uniform Lp-boundedness for r > p > 0, by Liapunov' s inequality. Also, uniform Lp-boundedness for any p > 0 implies uniform bounded ness in probability; the Markov inequality gives (12.20) so that, for given E > 0, (12. 1 8) holds for Be > 11Xrllp1£ 1 1P . By a mild abuse of terminology we sometimes speak of Lo-boundedness in the case of (12. 18). A standard shorthand (due to Mann and Wald 1943b) for the maximum rate of (positive or negative) increase of a stochastic sequence uses the notion of uniform boundedness in probability to extend the 'Big Oh ' and 'Little Oh' notation for ordinary real sequences (see §2.6). If, for E > 0, there exists Be < oo such that the stochastic sequence { Xn } i satisfies supnP( I Xn I > Be) < E, we write Xn = Op(l ). If {Yn} i is another sequence, either stochastic or nonstochastic, and Xn!Yn = Op(l ) we say that Xn = Op(Yn), or in words, 'Xn is at most of order Yn in probability ' . If P( I Xn l > E) � 0 as n � oo, we say that Xn = op (1); more generally Xn = op(Y�) when XnfYn = op(l ) or in words, 'Xn is of order less than Yn in probability ' . The main use of these notations is in manipulating small-order terms in an expression, without specifying them explicitly. Usually, Yn is a positive or negative power of n. To say that Xn = op(1) is equivalent to saying that Xn converges in probability to zero, following the terminology of § 12.2. Sometimes Xn = Op(l ) is defined by the condition that for each E > 0 there exists Be < and an integer Ne �· 1 such that P( I Xn I > Be) < E for all n ;::: Ne. But Xn is finite almost surely, and there necessarily exists (this is by 12.6) a constant B� < oo, possibly larger than Be, such that P( I Xn I > B�) < E for 1 � n < Ne. For all practical purposes, the formulations are equivalent. ,

oo

Theory of Stochastic Processes

1 88

1 2. 6 Uniform Integrability

If a r. v. X is integrable, the contributions to the integral of extreme X values must be negligible. In other words, if E I XI < oo, ( 12.2 1 ) However, i t i s possible to construct uniformly L 1 -bounded sequences { Xn } which fail to satisfy ( 1 2. 2 1 ) in the limit.

12.7 Example Define a stochastic sequence as follows: for n = 1 ,2,3, . . . let Xn = 0

with probability 1 - 1 /n, and Xn = n with probability lin. Note that E( I Xn I ) = nln 1 for every n, and hence the sequence is uniformly £1-bounded. But to have

=

( 1 2 .22) uniformly in n requires that for each E > 0 there exists ME such that E( I Xn i 1 1 1Xn i �M l ) < E for all M > ME, uniformly in n. Clearly, this condition fails, for E < 1 , in view of the cases n > ME. o

Something very strange is going on in this example. Although E(Xn) = 1 for any n, Xn = 0 with probability approaching 1 as n � oo. To be precise, we may show that Xn � 0 (see 18.15). The intuitive concept of expectation appears to fail when faced with r.v.s taking values approaching infinity with probabilities approaching zero. The uniform integrability condition rules out this type of perverse behaviour in a sequence. The collection { Xt, 't E 1f } is said to be uniformly integrable if sup 't E lr

{

lim E( I Xt l l { IXt i �Ml) M---"oo

}

=

0.

( 1 2.23)

In our applications the collection in question is usually either a sequence or an array. In the latter case, uniform integrability of {Xnr } (say) is defined by taking the supremum with respect to both t and n. The following is a collection of theorems on uniform integrability which will find frequent application later on; 12.8 in particular provides insight into why this concept is so important, since the last example shows that the conclusion does not generally hold without uniform integrability.

12.8 Theorem Let { Xn } l' be a uniformly integrable sequence. If Xn � X, then

E(Xn) � E(X).

Proof Note that E( I Xn l )

= E( I Xn l 1 { 1xn i <MJ ) + E( I Xn l l i iXn i�MJ ) 5 M + E( I Xn l l ! lxn i �Ml ).

( 1 2.24)

By choosing M large enough the second term on the right can be made uniformly small by assumption, and it follows that EI Xn l is uniformly bounded. Fatou' s

Stochastic Processes

1 89

lemma implies that E I XI < oo, and E(X) exists. Define Yn = I Xn - X I , so that Yn ----7 0 a.s. Since Yn � I Xn l + l XI by the triangle inequality, 9.29 gives (12.25) E(Yn 1 1 Yn > MJ ) � 2E( I Xn l 1 u xni>MI2 J ) + 2E( I XI 1 ! 1XI>MI2 J ) · The second right-hand-side term goes to zero as M ----7 oo, so { Yn } is uniformly integrable if {Xn} is. We may write (12.26) and by the bounded convergence theorem there exists, for any £ > 0, Ne such that E(Yn 1 1 Yn :<=; MJ ) < £12 for n > Ne, for M < oo . M can be chosen large enough that E(Yn 1 ! Yn >MJ ) < £12 uniformly in n, so that E(Yn) < £ for n > Ne, or, since £ is arbitrary, E(Yn) ----7 0. But (12.27) E(Yn) = E( I Xn -X I ) � I E(Xn) - E(X) I by the modulus inequality, and the theorem follows. • The next theorem gives an alternative form for the condition in (12.23) which is often more convenient for establishing uniform integrability. 12.9 Theorem A collection {Xr, 't E u } of r.v.s on a probabiiity space (Q,r:J,P) is uniformly integrable iff it is uniformly L1-bounded and satisfies the following condition: V £ > 0, 3 o > 0 such that, for E E r:J, P(E) � o => sup { E( I X't i 1E) } < £. (12.28) 't E ]" To show sufficiency, fix £ > 0 and 't Markov inequality for p = 1 ,

Proof

E

u. By L1 -boundedness and the (12.29)

and for M large enough, P( I X't I � M) � o, for any o > 0. Choosing o to satisfy (12.28), it follows since 't is arbitrary that (12.30) and (12.23) follows since £ is arbitrary. To show necessity, note that, for any E

E

'!f and

't

E

u,

E( I X'ti 1 E) = E( I X't i 1 E 1 { 1X't i <MJ ) + E(I X'ti 1 E 1 { 1X't i� MJ ) � MP(E) + E( I X'ti 1 ! 1X't i� MJ ).

(12.3 1 )

Consider the suprema with respect to 't of each side of this inequality. For £ > 0, ( 12.23) implies there exists M < oo such that sup { E( I X't i 1 E) } < MP(E) + !£. (12.32) 't Uniform L 1 -boundedness now follows on setting E = Q, and (12.28) also follows

Theory of Stochastic Processes

190

with o < £12M. • Another way to express condition (12.28) is to say that the measures V-r(E) = fE I X-r l dP must be absolutely continuous with respect to P, uniformly in 't. Finally, we prove a result which shows why the uniform boundedness of moments of a given order may be important. 12.10 Theorem If (12.33) E I X-r l l +S < 00 for 8 > 0, then limM�ooE'( I X-r i 1 ( 1 X-r i �MI ) = 0. Proof Note that E I X-r l 1+9 � EC I X-r l 1 +9 l { l x-r i �MI ) � MIEC I X-r l l ux-r i �M J ) (12 . 34) for any 8 > 0. The result follows on letting M -7 since the majorant side of (12.34) is finite by (12.33). • Example 12.7 illustrated the fact that uniform L1 -boundedness is not sufficient for uniform integrability, but 12.10 shows that unifmm L1 +9-boundedness is sufficient, for any 8 > 0. Adding this result to those of § 12.5, we have established the hierarchy of uniform conditions summarized in the following theorem. oo,

12.11 Theorem

Uniform boundedness a.s.

�

uniform uniform � uniform � uniform None of the reverse implications hold. �

Lp-boundedness, p > 1 integrability Lp-boundedness, 0 < p ::::; 1 boundedness in probability.

o

13 Dependence

1 3 . 1 Shift Transformations

We now consider relationships between the different members of a sequence. In what ways, for example, might the joint distribution of X1 and Xr-k depend upon k, and upon t? To answer questions such as these it is sometimes helpful to think aboutthe sequence in a new way. Having introduced the notion of a random sequence as a point in a probability space, there is a useful analogy between studying the relationships within a sequence and comparing different sequences, that is, different sample OUtcomes (i) E Q. In a probability space (Q,r:J,P), consider a 1-1 measurable mapping, T: Q 1-7 Q (onto). This is a rule for pairing each outcome with another outcome of the space, but if each ro E Q maps into an infinite sequence, T induces a mapping from one sequence to another. T is called measure-preserving if P(TE) = P(E) for all E E r:J. The shift transformation for a sequence {X1(ro) }1 is defined by 10

(13. 1) T takes each outcome ro into the outcome under which the realized value of X occurring in period t now occurs in period t + 1, for every t. In effect, each coordinate of the sequence from t = 2 onwards is relabelled with the previous period's index. More generally we can write Xr(Tkro) = Xt+k(ro), the relationship between points in the sequence k periods apart becoming a characteristic of the transformation Tk . Since X1 is a r. v. for all t, both the shift transformation and its inverse r- I , the backshift transformation, must be measurable. Taken together, the single r.v. X1(ro): Q 1-7 IR and the shift transformation T, can be thought of as generating a complete description of the sequence { X1( ro) } 1. This can be seen as follows. Given X1 (ro), apply the transformation T to ro, and obtain X2 = X1(Tro). Doing this for each ro E Q defines the mapping X2(ro): Q i-7 IR, and we are ready to get X3 = X2(Tro). Iterating the procedure generates as many points in the sequence as we require.

13.1 Example Consider 12.1. Let {X1(ro) }1 be a sequence of coin tosses (with 1 for heads, 0 for tails) beginning 1 1010010001 1... (say). Somewhere on the interval [0, 1 ] of real numbers (in binary representation), there is also a sequence { X1(ro') } 1 beginning 10 1001000 1 1 . , identical to the sequence indexed by ro apart from the dropping of the initial digit and the backshift of the remainder by one position. Likewise there is another sequence { Xr(ro") } 1, a backshifted version of {X1(ro') } 1 beginning 010010001 1...; and so forth. If we define the transformation .

.

Theory of Stochastic Processes

192

T by T- 1 w = w', T - 11w' = ro", etc., the sequence {X1(0J) } i can be constructed as the sequence {X1 (T1 - w) }}; that is, the sequence of first members of the sequences found by iterating the transformation, in this case beginning 1, 1 ,0, . . . o This device reveals, among other things, the complex structure of the probability space we are postulating. To each point ro E Q there must correspond a countably infinite set of points Tkro E Q, which reproduce the same sequence apart from the absolute date associated with X1 . The intertemporal properties of a sequence can then be treated as a comparison of two sequences, the original and the sequence lagged k periods. Econometricians attempt to make inferences about economic behaviour from recorded economic data. In time-series analysis, the sample available is usually a single realization of a random sequence, economic history as it actually occurred. Because we observe only one world, it is easy to make the mistake of looking on it as the whole sample space, whereas it is really only one of the many possible realizations the workings of historical chance might have generated. Indeed, in our probability model the whole economic universe is the counterpart of a single random outcome ro; there is an important sense in which the time series analyst is a statistician who draws inferences from single data points ! But although a single realization of the sequence can be treated as a mapping from a single OJ, it is linked to a countably infinite set of ros corresponding to the leads and lags of the sequence. A large part of our subsequent enquiry can be summarized as posing the question: is this set really rich enough to allow us to make inferences about P from a single realization? 1 3 . 2 Independence and Stationarity

Independence and stationarity are the best-known restrictions on the behaviour of a sequence, but also the most stringent, from the point of view of describing economic time series. But while the emphasis in this book will mainly be on finding ways to relax these conditions, they remain important because of the many classic theorems in probability and limit theory which are founded on them. The degree to which random variations of sequence coordinates are related to those of their neighbours in the time ordering is sometimes called the memory of a sequence; in the context of time-ordered observations, one may think in terms of the amount of information contained in the current state of the sequence about its previous states. A sequence with no memory is a rather special kind of object, because the ordering ceases to have significance. It is like the outcome of a collection of independent random experiments conducted in parallel, and indexed arbitrarily. When a time ordering does nominally exist, we call such a sequence serially independent. Generalizing the theory of §8.6, a pair of sequences [ {Xr(ro)} i, { Yr(ro) } i] E IRoo x iRoo, is independent if, for all Et, Ez E 'Boo,

(13.2) Et. { Y1}} E Ez) = P({Xr}i E EJ)P({ Yr}i E Ez). Accordingly, a sequence {Xr(ro)}i is serially independent if it is independent of

P({ X1}}

E

Dependence

19 3

{ Xr(Tkco) } 1 for all k > 0. This is equivalent to saying that every finite collec tion of sequence coordinates is totally independent. Serial independence is the simplest possible assumption about memory. Simi larly, looking at the distribution of the sequence as a whole, the simplest treatment is to assume that the joint distribution of the coordinates is invariant with respect to the time index. A random sequence is called strictly stationary if the shift transformation is measure-preserving. Thi s implies that the sequences { Xr } 7 1 and { Xr+d 7==1 have the same joint distribution, for every k > 0. Subject to the existence of particular moments , less restrictive versions of the condition are also commonly employed. Letting Jlr = E(X1), and = Cov(Xr.Xr+k), consider those cases in which the sequence {J..tr }7==1 , and also the array are well defined. If Jlr = Jl, all t, we say the sequence is mean o is a sequence stationary. If a mean stationary sequence has = where of constants, it is called covariance stationary, or wide sense stationary. If the marginal distribution of X1 is the same for any t, the sequence {X1 } is said to be identically distributed. This concept is different from stationarity, which also restricts the joint distribution of neighbours in the sequence. However, when a stochastic sequence is both serially independent and identically distributed (or i.i.d.), this suffices for stationarity. An i.i.d. sequence is like an arbitrarily indexed random sample drawn from some a underlying population. The following clutch of examples include both stationary and nonstationary cases.

==

Ykt

{ { Ymd;;;==0 } 7==1•

Ymr Ym

{ Ym }

13.2 Example Let the sequence { £.1} ':oo be i.i.d. with mean 0 and variance cr2 < oo ,

and let

{ Sj }o be a square-summable sequence of constants . Then Xr L SjEt-j • }=0

{Xr }1, where

(13.3)

=

is a covariance stationary sequence, with E(X1) = 0 and E(X7) = a2I.j ] for every t. This is the infinite-order moving average (MA(oo)) process. See § 14.3 for additional details. o

=oe

13.3 Example

If £.1 is i.i.d. with mean 0, and X1

for a constant

a,

then E(X1)

=

=

cos at

+ £.1

cos at, depending systematically on

(13.4) t.

o

13.4 Example Let { X1 } be any stationary sequence with autocovariance sequence { Ym· m � 0 } . The sequence { X1 + X0 } has autocovariances given by the array { Yo+ Ym + Yr + Yt+m• m � 0, t � 1 } , and hence it is nonstationary.

13.5 Example

o

Let X be a r.v. which is symmetrically distributed about 0, with = ( - 1 ) 1X, then { X1} 1 is a stationary sequence. In particular, E(X1) = 0, and Cov(X1,Xt+k) = d when k is even and -d when k is odd, independent variance cr2 • If X1

1 94

Theory of Stochastic Processes

of t. o These examples show that, contrary to a common misconception, stationarity does not imply homogeneity in the appearance of a sequence, or the absence of periodic patterns. The essential feature is that any patterns in the sequence do not depend systematically on the time index. It is also important to distinguish between stationarity and limited dependence, although these notions are often closely linked. Example 13.4 is nonstationary in view of dependence on initial conditions. The square-summability condition in 13.2 allows us to show covariance stationarity, but is actually a limitation on the long-range dependence of the process. Treatments of time series modelling which focus exclusively on models in the linear MA class often fail to distinguish between these properties, but examples 13.3 and 13.5 demonstrate that there is no necessary connection between them. Stationarity is a strong assumption, particularly for the description of empiri cal time series, where features like seasonal patterns are commonly found. It is useful to distinguish between 'local' nonstationarity, something we might think of as capable of elimination by local averaging of the coordinates, and 'global' nonstationarity, involving features such as persistent trends in the moments. If sequences { X1} 7==I and { Xt+d 7==I have the same distribution for some (not necessarily every) k > 0, it follows that {Xr+2d7==t has the same distribution as { Xz+k }7::1, and the same property extends to every integer multiple of k. Such a sequence accordingly has certain stationary characteristics, if we think in terms of the distributions of successive blocks of k coordinates. This idea retains force even as k ---7 oo. Consider the limit of a finite sequence of length n, divided into [n a] blocks of length [n 1 -a], plus any remainder, for some a between 0 and 1 . (Note, [x] here denotes the largest integer below x.) The number of blocks as well as their extent is going to infinity, and the stationarity (or otherwise) of the sequence of blocks in the limit is clearly an issue. Important applications of these ideas arise in Parts V and VI below. It is convenient to formulate a definition embodying this concept in terms of moments. Thus, a zero-mean sequence will be said to be globally covariance stationary if the autocovariance sequences { Ymz }7= 1 are Cesaro-summable for each m � 0, where the Cesaro sum is strictly positive in the case of the variances (m = 0). The following are a pair of contrasting counter-examples. 13.6 Example A sequence with y01 - t 13 is globally nonstationary for any � "I= 0. o 13.7 Example Consider the integer sequence beginning 1,2, 1 , 1 ,2,2,2,2, 1 , 1 , 1 , 1 ' 1 , 1 , 1 ' 1 , ... ; i.e. the value changes at points t = 2\ k = 1 ,2,3, . . . The Cesaro sum of this sequence fails to converge as n ---7 oo It fluctuates eventually between the points 5/3 at n = 2k, k odd, and 4/3 at n = 2 k, k even. A stochastic sequence having a variance sequence of this form is globally nonstationary. o .

Dependence

195

1 3 .3 Invariant Events

The amount of dependence in a sequence is the chief factor determining how infor mative a realization of given length can be about the distribution that generated it. At one extreme, the i.i.d. sequence is equivalent to a true random sample. The classical theorems of statistics can be applied to this type of distribution. At the other extreme, it is easy to specify sequences for which a single realization can never reveal the parameters of the distribution to us, even in the limit as its length tends to infinity. This last possibility is what concerns us most, since we want to know whether averaging operations applied to sequences have useful limiting properties; whether, for example, parameters of the generation process can be consistently estimated in this way. To clarify these issues, imagine repeated sampling of random sequences { Xr(ro) } 1; in other words, imagine being given a function X1(.) and transformation T, making repeated random drawings of ro from Q and constructing the corresponding random sequences; 13.1 illustrates the procedure. Let the sample drawings be denoted roi , i = 1, . . . ,N, and imagine constructing the average of the realizations at some fixed time to. The average XN,to = N - 1 2..1=1X10(roi) is called an ensemble average, which may be contrasted with the time average of a realization of length n for 1 some given (J)i E n, Xn(roD = n- 2.. 7= 1 Xr(roj). Fig. 13. 1 illustrates this procedure , showing a sample of three realizations of the sequence. The ensemble average is the average of the points falling on the vertical line labelled t0. It is clear that the limits of the time average and the ensemble average as n and N respectively go to infinity are not in general the same; we might expect the ensemble average to tend to the marginal expectation E(X10), but the time average will not do so except in special cases. If the sequence is nonstationary E(X10) depends upon t0, but even assuming stationarity, it is still possible that different realizations of the sequence depend upon random effects which are common to all t.

X

*

+

o

*

* +

0t

0

=

+

-

*

-

*

+

0

*

*

* *

*

+

0 CD

+

0 l

(J)2

+

0

+

0

+

0

+

0

*

+

0

*

+

*

+

*

*

*

*

*

0

*

0

0

0

*

0

*

*

*

*

*

*

*

0

*

*

*

0

+

(J)3

Fig.

13. 1

*

*

0

o ?o

*

0

*

*

*

*

00

*

*

*

*

0

* *

* * 0

*

+

0o

+

*

+

* *

+

+

Theory of Stochastic Processes

196

In a probability space (O.,C!F,P) the event E E c:J is said to be invariant under a transformation T if P(TE � E) = 0. The criterion for in variance is sometimes given as TE = E, but allowing the two events to differ by a set of measure zero does not change anything important in the theory. The set of events in c:J that are invariant under the shift transformation is denoted J. 13.8 Theorem J is a a-field. Proof Since T is onto, 0. is clearly invariant. Since T is also 1-1, TE e � E c = (TE)c � E c = TE� E (13.5) by definition. And, given {En E J, n E IN }, (13.6) P(TEn � En) = P(TEn - En) + P(En - TEn) = 0 for each n, and also

( )

T U En � U En = u TEn � U En, n n n n using 1.2(i). By 1.1(i) and then 1.1(iii),

(

U TEn - U En = u TEn n

n and similarly,

n

n

nE:z) n

c

U ( TEn - En) , n

(13.7)

(13.8)

(13.9) n n n The conclusion P[T(UnEn) � (UnEn)] = 0 now follows by (13.6) and 3.6(ii), com pleting the proof. • An invariant random variable is one that is J/i3-measurable. An invariant r.v. Z(co) has the property that Z(Tco) =Z(co), and an J/i3-measurable sequence {Zr(co) } 1 is trivial in the sense that ZtCro) = Z1 (ro) a.s. for every t. The invariant events and associated r. v.s constitute those aspects of the probability model that do not alter with the passage of time. 13.9 Example Consider the sequence { X1(ro)} where XtCro) = Yr(ro) + Z(ro), { YtCco) } being a random sequence and Z(ro) a r.v. An example of an invariant event is E = { co: Z( ro) $ z, YrCco) E IR }. Clearly E and TE are the same event, since Z is the only thing subject to a condition. Fig. 1 3 . 1 illustrates this case� If { Yr( ro) } is a zero-mean stationary sequence, the figure illustrates the cases Z(rot) = Z(ro 2) = 0; and Z(ro 3 ) > 0. Even if E(Z) = 0, the influence of Z(ro) in the time average is not 'averaged out' in the limit, as it will be from the ensemble average. o The behaviour of the time average as n -7 oo is summarized by the following fundamental result. 13.10 Theorem (Doob 1953: th. X.2. 1) Let a stationary sequence {Xr(CO)} } be defined by a measurable mapping X1(co) and measure-preserving shift transform-

Dependence

197

lim Sn(co)/n = E(Xd J)(co), a.s.

o

( 1 3 . 10)

In words, the limiting case of the time average can be identified with the mean of the distribution conditional on the a-field of invariant events. The proof of 13.10 requires a technical lemma. 13.11 Lemma Let A(�) = { co: supnCSn(CO) - n�)

f

>

0 } . Then for any set M E .1,

X1(co)dP(co) ;::: j)P(M n A( j))).

( 1 3. 1 1)

MnA(j3)

Proof We establish this for the case � = 0. To generalize to any real !), consider

the sequence { Xt - I) } , which is stationary if { Xt} is stationary. Write A for A(O) , and let A1 = { co: max1 ::;k ::;J Sk(co) > 0 } , the set of outcomes for which the partial sum is positive at least once by time j. Note, the sequence {A1 } is monotone and A1 t A as j � Also let 1 J (13.12) NnJ = y- An -J = co: max Sk(T co) > o . l:o; k::; n -j Since k j+k 1 1 (13. 13) Sk(T co) = L_ XtCT co) = L Xt(co) = SJ+k(co) - S1(co), t= I t=j+ I

{

oo.

}

by defining S0 = 0 we may also write

{

}

( 1 3 1 4) :s; j :::; n - 1 . j+ I ::; k ::; n This is the set of outcomes for which the partial sums of the coordinates from j + 1 to n are positive at least once, and we have the inequality (explained below) n-1 (13.15) L_ xi+ I (co) 1 Nnico) ;::: 0, all co E .0 . j=O Integrating this sum over the invariant set M gives Nni = co: max (Sk(co) - S1(co)) > o , 0

=

� j=O

f

Xt(CO)dP(co) =

MnAn-1.

if }=l

X1(co)dP(co),

MnA1

.

(13. 16)

J where the first equality uses the fact that An -J = { co: y - co E Nni} and the mea sure-preserving property of T, and the third is by reversing the order of summa tion. The dominated convergence theorem applied to {Xt 1MnA) , with l XI I as the

Theory of Stochastic Processes

198

dominating function, yields

f

X1 (ro)dP(ro)

Mn�

�

J

X1 (ro)dP(ro).

(13. 17)

MnA

This limit is equal to the Cesaro limit by 2.26, so that, as required,

f

X, (ro)dP(ro)

MnA

lim

=

n�oo

*i f j= l

•

X,(ro)dP(ro) � 0.

MnA1

(13. 18)

The inequality in (13. 15) is not self-evident, but is justified as follows. The expression on the left is the sum containing only those Xr(ro) having the property that in realization ro, the partial sums from the point t onwards are positive at least once, otherwise the tth contribution to the sum is 0. The sum includes only Xr lying in segments of the sequence over which Sk increases, so that their net contribution must be positive. It would be zero only in the case Xr � 0 for 1 � t � n. Fig. 13.2 depicts a realization. 'o' shows values of Sk(ro) for k = 1, ... ,n, so the Xr(ro) are the vertical separations between successive ' o ' . ' +' shows the running sum of the terms of (13. 15). The coordinates where the Xr are to be omitted from (13. 15) are arrowed, the criterion being that there is no 'o' to the right which exceeds the current ' o ' .

+

..

0

q,

+

0

1' 1'

1 2 3. . .

+

+

0

+

+

+

0

0

+

0

+

+

+

0

0

0

+

0

1' 1' 1'

0

+

+

+

+

0

j o in 0

0

Fig. 13.2 Proof of 13.10 The first step is to show that the sequence

{

}

almost surely to an invariant r.v. S(ro). Consider S(ro) _

S(Tro)

=

lim sup

m�oo n � m

( )

Sn+ l (ro) n + 1 X1 (ro) - - -n + 1 nn

=

=

{ Sn(ro)ln } 1 converges limsupn Sn (ro)/n.

S(ro),

_

.

(13. 19)

so that S(ro) is invariant, the same being true for �(ro) liminfn Sn(ro)/n Hence, the event M(a,�) = { ro: �(ro) < a < � < S(ro)} is invariant. Since supnSn (ro)/n � S(ro), M(a,�) c A(�) where A(�) is defined in the statement of 13.11. Hence, putting M = M(a,�) in the lemma gives =

Dependence

199

IM(a,J3)X1 (ro)dP(ro) � �P(M(a. �)).

(13.20)

IM(a,J3)X1 (w)dP(w) � aP(M(a.(3)).

(13.21)

IMX1 (ro)dP(ro) IMS(ro)dP(ro), each M E J.

(13.22)

But now replace Xt(ro) in 13.11 by -Xr(ro), and observe that M(a,f3) � A( -a) { ro: supn( -Sn(w)In) > -a} . Hence we get

=

Since the left-hand sides of (13.21) and (13.20) are equal and a < (3, it follows that P(M(a,(3)) = 0; that is, S(ro) = �(ro) = S(co) with probability 1 . This completes the first stage of the proof. It is now required to show that S = E(Xt l J) a.s., that is, according to equation (10. 1 8), that =

Since M is invariant,

(13.23) and the issue hinges on the convergence of the right-hand member of (13.24) to E(S1M). Since the sequence {Xr} is stationary and integrable, it is also uniformly integrable, and the same is true of the sequence { Yr }, where Yr = Xr1 M and M E J. For £ > 0, it is possible by 12.9 to choose an event E E f[f with P(E) < 8, such that supr E( I Yr i 1 E) < £. For the same E, the triangle inequality gives

(13.24) By the same argument, also using stationarity and integrability of Yr,

Il � t, Yr l dP � � t, I I Yr l dP

=

E I Yt l

< 00 •

(13.25)

Hence by 12.9 the sequence { n - I I.7=1 Yr } is also uniformly integrable, where n-1 I.7= 1 Yr = n - 1Sn 1 M . If n - 1 Sn -7 S a.s., it is clear that n - 1Sn1 M -7 S1M a.s., so by 12.8,

IM (� Sn(ro)) dP(ro) IMS(ro)dP(w). -7

(13.26)

Since n is arbitrary, (13.26) and (13.23) together give (13.22), and the proof is complete. • 1 3 .4 Ergodicity and Mixing

The property of a stationary sequence which ensures that the time average and the ensemble average have the same limit is ergodicity, which is defined in terms of

200

Theory of Stochastic Processes

the probability of invariant events. A measure-preserving transformation T is ergodic if either P(E) = 1 or P(E) = 0 for all E E J, where J is the <J-field of invariant events under T. A stationary sequence {X1(ro) } �: is said to be ergodic if Xr(ro) = X1 (Tr- I ro) for every t where T is measure-preserving and ergodic. Some authors, such as Doob, use the term metrically transitive for ergodic. Events that are invariant under ergodic transformations either occur almost surely, or do not occur almost surely. In the case of 13.9, Z must be a constant almost surely. Intuitively, stationarity and ergodicity together are seen to be sufficient conditions for time averages and ensemble averages to converge to the same limit. Stationarity implies that, for example, J.L = E(X1 (ro)) is the mean not just of X1 but of any member of the sequence. The existence of events that are invariant under the shift transformation means that there are regions of the sample space which a particular realization of the sequence will never visit. If P(TE !1 E) = ll;, then the event E c occurs with probability 0 in a realization where E occurs. owever, if invariant events other than the trivial ones are ruled out, we ensure that a sequence will eventually visit all parts of the space, with probability 1 . In this case time averaging and ensemble averaging are effectively equivalent operations. ) The following corollary is the main reason for our interest in Theorem 13.10.

(H

13.12 Ergodic theorem Let {Xr(ro) } 1 be a stationary, ergodic, integrable

sequence. Then

lim Sn(ro)ln

=

E(Xt), a.s.

( 1 3.27)

Proof This is immediate from 13.10, since by ergodicity, E(X1 1 J) = E(X1 ) a.s.

•

In an ergodic sequence, conditioning on events of probability zero or one is a trivial operation almost surely, in that the information contained in J is trivial. The ergodic theorem is an example of a law of large numbers, the first of several such theorems to be studied in later chapters. Unlike most of the subsequent examples this one is for stationary sequences. Its practical applications in econometrics are limited by the fact that the stationarity assumption is often inappropriate, but it is of much theoretical interest, because ergodicity is a very mild constraint on the dependence, as we now show. A transformation that is measure-preserving eventually mixes up the outcomes in a non-invariant event A with those in A c. The measure-preserving property rules out mapping sets into proper subsets of themselves, so we can be sure that TA n Ac is nonempty. Repeated iterations of the transformation generate a sequence of sets { TkA } containing different mixtures of the elements of A and Ac. A positive dependence of B on A implies a negative dependence of B on A c; that is, if P(A n B) > P(A)P(B) then P(Ac n B) = P(B) - P(A n B) < P(B) - P(A)P(B) = P(A c)P(B). Intuition suggests that the average dependence of B on mixtures of A and A c should tend to zero as the mixing-up proceeds. In fact, ergodicity can be characterized in just this kind of way, as the following theorem shows.

Dependence

201

13.13 Theorem A measure-preserving shift transformation Tis ergodic if and only if, for any pair of events A, B E r:J,

1

n

lim - L_ P(TkA n B) = P(A)P(B). n k=l n�oo

(13.28)

Proof To show 'only if , let A be an invariant event and B = A. Then P(TkA n B)

is equal to P(A) for all P(A) for all k, and hence the left-hand side of 2 k. This gives P(A) = P(A) , implying P(A) = or To show 'if , apply the ergodic theorem to the indicator function of the sets k T A, where T is measure-preserving and ergodic, to give

(13.2?) 0 1>c>�

=

n

1 1:rkA(co) = P(A), a.s. n

lim - L

But for any B E

r:/,

n�oo

(13.29)

k=l

J I * � 1 :fi
I ! i,

k = n P(T A n B) - P(A)P(B) k=l

I·

(13.30)

The sequence whose absolute value is the integrand in the left-hand member of converges almost surely to zero as n ---7 by it is bounded absolutely by + P(A) uniformly in n, so is clearly uniformly integrable. Hence, the left-hand member of converges to zero by 12.8, and the theorem follows. •

(13.30)

1

(13.30)

oo (13.29);

Following from this result, ergodicity of a stationary sequence is often associ ated with convergence to zero in Cesaro-sum of the autocovariances, and indeed, in a Gaussian sequence the conditions are equivalent. 13.14 Corollary If { XtC then

co)} 1 is a stationary, ergodic, square-integrable sequence, (13.31)

k = First extend this result to sequences is equivalent to so that X2(co) = of simple r.v.s. Let X1 (ro) = = The main point to be established is that the difference between X2 and a simple r. v. can be ignored in integration. In other words, the sets T A must form a Proof Setting B = A and defining a real sequence by the indicators of T A, Xk(ro)

l ykA(co), (13 .28)

(13.31). Lz-<Xi 1A;(co),

X1 (Tco) Li<Xi 1riA;(co). 1i -

Theory of Stochastic Processes

202

partition of Q, apart possibly from sets of measure 0. Since T is measure preserving, P(Ui T- 1A i) = P(T - 1 (UA;)) = P(U;A;) 1 , using 1.2(ii), and hence P(O - U;T- 1Ai) = 0. And since LiP(T - 1A;) = L;P(A i) = 1 , additivity of the measure implies that the collection { T- 1Ad is also disjoint apart from possible sets of measure 0, verifying the required property. This argument extends by induction to Xk for any k E !N . Hence, =

E(X, X,)

�

E

(� �

<x;<Xj 1 A,nr-•A;(ro)

= L L <X;<XjP(A J 1 T- kAj), i

j

)

(w-.�

·"!

/

Hor,Jrrj

( 1 3.32)

(the sum being absolutely convergent by assumption), and by 13.13, lim

n��

�i E(XtXk) ��

= =

��J a;aj 1

(lim �i P(A; n fkAj)) ��

n��

L L <Xi<XjP(A ;)P(Aj ) = E(Xt ) 2, i j

(13.33)

where E(X1 )2 = E(X1 )E(Xk) for any k, by stationarity. The theorem extends to general sequences by the usual application of 3.28 and the monotone convergence theorem. • This result might appear to mean that ergodicity implies some form of asymptotic independence of the sequence, since one condition under which ( 1 3 .3 1 ) certainly But this is not so. The following holds is where Cov(X1 ,Xk) � 0 as k � example illustrates nicely what ergodicity implies and does not imply.

oo

.

13.15 Example Let the probability space (Q,'!J,P) be defined by Q = { 0, 1 }, so that '!J = ( { 0 } ,{ 0 } , { 1 } , {0, 1 }) and P(ro) = 0.5 for ro 0 and ro = 1 . Let T be the transformation that sets TO = 1 and T1 = 0. In this setup a random sequence { Xt(ro)} may be defined by letting X1 (ro) = ro, and generating the sequence by iterating T. These sequences always consist of alternating Os and 1s, but the initial value is randomly chosen with equal probabilities. Now, T is measure preserving; the invariant events are .Q and 0, both trivial, so the sequence is k 1 ergodic. And it is easily verified that lim n I.Z 1 P(T A n B) = P(A)P(B) for every pair A,B E '!/. For instance, let B = { 1 } , and then P(TkA n B) = 0.5 for k even and 0 for k odd, so that the limit is indeed 0.25 as required. You can verify, equivalently, that the ergodic theorem holds, since the time average of the sequence will always converge to 0.5, which is the same as the ensemble mean of X1 (ro). o =

n��

- ==

In this example, Xr is perfectly predictable once we know X1 , for any t. This shows that ergodicity does not imply independence between different parts of the sequence, even as the time separation increases. By contrast, a mixing sequence has this property. A measure-preserving, ergodic shift transformation T is said to be mixing if, for each A, B E '!/,

Dependence lim P(TkA n B)

=

203

P(A)P(B).

(13.34)

The stationary sequence {Xr}i is said to be mixing if Xr(ro) = X1 (T t- I ro) for each t where T is a mixing transformation. Compare this condition with (13.28); Cesaro convergence of the sequence { P(TkA n B), k E IN } has been replaced by actual convergence. To obtain a sound intuition about mixing transformations, one cannot do better than reflect on the following oft-quoted example, originally due to Halmos (1956). 13.16 Example Consider a dry martini initially poured as a layer of vermouth

(10% of the volume) on top of the gin (90%). Let G denote the gin, and F an arbitrary small region of the fluid, so that F n G is the gin contained in F. If P(.) denotes the volume of a set as a proportion of the whole, P(G) = 0.9 and P(F n G)IP(F), the proportion of gin in F, is initially either 0 or 1 . Let T denote the operation of stirring the martini with a swizzle stick, so that

P(TkF n G)IP(F) is the proportion of gin in F after k stirs. Assuming the fluid is incompressible, stirring is a measure-preserving transformation in that P(TkF) P(F) for all k. If the stirring mixes the martini we would expect the proportion of gin in TkF, which is P(TkF n G)/P(F), to tend to P( G), so that each region F of the martini eventually contains 90% gin. o This is precisely condition (13.34). Repeated applications of a mixing transforma tion to an event A should eventually mix outcomes in A and A c so thoroughly that for large enough k the composition of TkA gives no clues about the original A. Mixing in a real sequence implies that events such as A = { ro: Xr(ro) � a } and TkA = { ro: Xr+k (ro) � a} are becoming independent as k increases. It is immediate, or =

virtually so, that for stationary mixing sequences the result of 13.14 can be strengthened to Cov(Xt ,Xk) � 0 as n � = .

1 3 . 5 Subfields and Regularity

We now introduce an alternative approach to studying dependence which considers the collection of a-subfields of events generated by a stochastic sequence. This theory is fundamental to nearly everything we do subsequently, particularly because, unlike the ergodic theory of the preceding sections, it generalizes beyond the measure-preserving (stationary) case. Consider a doubly infinite sequence {X1, t E ?l } (not necessarily stationary) and define the family of subfields { ::¥!, s � t}, where ::¥! = a(Xs, ... ,X1) is the smallest a-field on which the sequence coordinates from dates s to t are measurable. The sets of �; can be visualized as the inverse images of (t - s) dimensional cylinder sets in 13= ; compare the discussion of § 12.3, recalling that IN and 7l are equipotent. We can let one or other of the bounds tend to infinity, and a particularly important sub-family is the increasing sequence { ::¥ !..=, t E 7l } , which can be thought of as, in effect, 'the information contained in the sequence up to date t ' . The a-field on which the sequence as a whole is measurable is the limiting case �� : V1�!.. oo . In cases where the underlying probability model =

,

.

Theory of Stochastic Processes

204

concerns just the sequence {Xr} we shall identify :¥:: with :¥. Another interesting object is the remote a-field (or tail a-field), :¥ = nr:¥ !. 00 . This a-field contains events about which we can learn something by observing any coordinate of the sequence, and it might plausibly be supposed that these events occurred at time -oo, the 'remote ' past when the initial conditions for the sequence were set. However, note that the set may be generated in other ways, such as nr:¥�00 , or nr:¥ .:t. One of the ways to characterize independence in a sequence is to say that any pair of non-overlapping subfields, of the form :J<�i and :¥�� where t1 � t2 > t3 � t4, are independent (see § 10.5). One of the most famous early results in the theory of stochastic processes is Kolmogorov ' s 'zero-one law ' for independent sequences. This theorem is usually given for the case of a sequence {X1}j, and the remote a-field is defined in this case as n7=I :J'';'. -

oo

{X1 }j is independent, every remote event is trivial, having either probability 0 or probability 1 . 13.17 Zero-one law If the sequence

Proof Let A be a remote event, so that A E :¥00 = n7=I :¥';', and let §' be the col

lection of events having the property P(A n B) = P(A)P(B) if B E 9'. By indepen dence, rg; and :J a(U7=I :¥{) = V7=I :¥ { = :¥. However, A E :¥, so A E §' and we may set B =A, giving P(A) = P(.4. l Hence P(A) = 0 or 1. •

oo

The zero-one law shows us that for an independent sequence there are no events, other than trivial ones, that can be relevant to all sequence coordinates. But clearly, not only independent sequences have the zero-one property, and from our point of view the interesting problem is to identify the wider class of sequences that possess it. A sequence {Xr} :: is said to be regular or mixing if every remote event has probability 0 or 1 . Regularity is the term adopted by Ibragimov and Linnik (1971 ), to whom the basics of this theory are due. In a suitably unified framework, this is essentially equivalent to the mixing concept defined in § 13.4. The following theorem says that in a regular sequence, remote events must be independent of all events in :¥. Note that trivial events are independent of themselves, on the definition. 13.18 Theorem (Ibragimov and Linnik 197 1 : th. 17. 1 . 1 ) {Xr}:: is regular if and only if, for every B E :¥,

l

Dependence sup I P(A n B) - P(A)P(B) I A E 3i .:_

---7

205 0 as t ---7

-oo .

(13.35)

E E r:f with 0 < P(E) < 1, so that {X,}�: is not regular. Then for every t, E E r:f �co and so (13.36) SJ.lp I P(A (I E) - P(A)P(E) I � P(E) - P(E)2 > 0, Proof To prove 'if , suppose ::3

-co

A E 3i.:_

which contradicts (13.35). To prove 'only if, assume regularity and define random variables � = 1 A - P(A) (r:f �col�-measurable) and 11 = 18 - P(B) (r:f/�-measurable), such that P(A n B) - P(A)P(B) £(�11). Then, by the Cauchy-Schwartz inequality, =

1 £(�11 ) 1 = l £(�(11 1 r:f�oo)) j � �� lbll £(1l l r:f .:. co) l b ,

(13.37) where the equality is by the law of iterated expectations because A E r:f�oo· Note that ll � l b � 1 . We show 1 1 £(11 1 r:f�co) l b ---7 0 as t ---7 which will complete the proof, since A is an arbitrary element of r:f � co· Consider the sequence { E( I 8 I r:f �co)(ro) } �:. For any ro E Q, EOs l r:f�oo )(ro) ---7 E(ls l r:f-oo )(ro) as t ---7 -oo, (13.38) where by equation (10. 18) and the zero-one property, -oo,

JE E(18 l r:f -oo)(ro)dP(ro)

=

P(E n B)

{

P(B), P(E) = 1 (13.39) , £ E r:f - oo· 0, P(E) = 0 It is clear that setting £(18 I r:f -oo)(ro) = P(B) a.s. agrees with the definition, so we may say that E(18 j r:f.:co) ---7 P(B) a.s., or, equivalently, that (13.40) (EOs l r:f �oo) - P(B)? ---7 0, a.s. Since l 18(ffi) - P(B) I < 1 for all ro E Q, (EOs l r:f �oo)(ro) - P(B))2 is similarly bounded, uniformly in t. Uniform integrability of the sequence can therefore be =

assumed, and it follows from 12.8 that

(13.41) as required.

•

In this theorem, it is less the existence of the limit than the passage to the limit, the fact that the supremum in (13.35) can be made small by choosing - t large, that gives the result practical significance. When the only remote events are trivial, the dependence of Xr+k on events in r:f�""' for fixed k, must eventually decline as t increases. The zero-one law is an instant corollary of the necessity part of the theorem, since an independent sequence would certainly satisfy

(13.35).

Theory of Stochastic Processes

206

There is an obvious connection between the properties of in variance and remote ness. If T is a measure-preserving shift transformation we have the following simple implication. 13.19 Theorem If TA = A, then A E '!J -oo · Proof If A E '!J� oo, then TA E '!J�! 1 and T - 1 A E '!Jf_: 1 . If TA = A, T - 1A = A and it follows immediately that A E (n';=t'!J:oo ) n (n�-='!J: oo) = '!J -oo · • The last result of this section establishes formally the relati onship between regularity and ergodicity which has been implicit in the foregoing discussion. 13.20 Theorem (Ibragimov and Linnik 1971: cor. 17 . 1 . 1) If a stationary sequence = {Xr(ro)}�: is regular, it is also ergodic.

x(ro)

e ��: is contained in a set At e '!J f_r. with the sequence {At} non-increasing and At J, A. Thus, At may be constructed as the inverse image under x of the (2t + I )-dimensional cylinder set whose base is the product of the coor dinate :13-sets for coordinates -t, . .. ,t of x(A) E 'B""'. The inclusion follows by 1.2(iv). By continuity of P, we can assume that P(A t) --7 P(A). Let A be invariant. Using the measure-preserving property of T, we find

Proof Every set A

(13.42) Since k is arbitrary, regularity implies by (13.34) that P(A t nA) = P(A t)P(A). Letting t --7 oo yields P(A) = P(A) 2, so that P(A) = 0 or 1, as required. • 1 3 . 6 Strong and Uniform Mixing

The defect of mixing (regularity) as an operational concept is that remote events are of less interest than arbitrary events which happen to be widely separated in time. The extra ingredient we need for a workable theory is the concept of depen dence between pairs of cr-subfields of events. There are several ways to character ize such dependence, but the following are the concepts that have been most commonly exploited in limit theory. Let (Q,�,P) be a probability space, and let '§ and Jf. be cr-subfields of �; then a('§ ,Jf) =

sup G E '€/ , H E Jf

I P( G n H) - P( G)P(H) I

(13.43)

is known as the strong mixing coefficient, and <J>(r!l ,Jf.)

=

sup

I P(HI G) - P(H) I

(13.44)

G E '€/, H E Jf ; P(G) > O

as the uniform mixing coefficient. These are alternative measures of the depen dence between the subfields '§ and Jf.. If the subfields rJ and Jf. are independent, then a(rJ,Jf.) = 0 and <J>(r!i,Jf.) = 0, and the converse is also true in the case of uniform mixing, although not for strong mixing. At first sight there may appear not much to choose between the defini-

Dependence

207

tions, but since they are set up in terms of suprema of the dependence measures over the sets of events in question, it is the extreme (and possibly anomalous) cases which define the characteristic s of the mixing coefficients. The strong mixing concept is weaker than the uniform concept. Since

(13.45) I P(G (I H) - P( G)P(H) I � I P(H I G) - P(H) I � <j>(§' ,Jf) for all G E §' and H E Jf , it is clear that a(§',Jf) � <j> (§',Jf). However, the following example shows how the two concepts differ more crucially. 13.21 Example Suppose that, for a sequence of subfields { §'m} j, and a subfield 1-£, a(§'m,H) --7 0 as m --7 oo This condition is compatible with the existence of sets Gm E §'m and H E Jf with the properties P(Gm) = lim, and P(Gm n H) = alm fora :t= P(H). But <j>(§' m,H) � I P(H I Gm) - P(H) I = I a - P(H) I for every m � 1 , showing that sub fields §'m and Jf are not independent in the limit. o .

Evidently, the strong mixing characterization of 'independence ' does not rule out the possibility �f dependence between negligible sets. a and <1> are not the only mixing coefficients that can be defined, although they have proved the most popular in applications. Others that have appeared in the literature include

�(§',Jf) = sup E I P(HI §') - P(H) I

(13.46)

p(§',Jf) = sup IsECs11) I ll lbll11 lb'

(13.47)

HEJe

and

where the latter supremum is taken with respect to all square integrable, zero mean, 9'-measurabie r.v.s s, and if-measurable r.v.s 11 · To compare these alter natives, first let /;(ro) = P(HI §')( ro) - P(H), so that

�(§',Jf) = sup

HE Je

�

�

f I t; I dP

sup

G E '€/, HE Je

sup

G E '€/ ,HEJe

fG I t; I dP

IJ

G

/;dP

I

=

a(§',Jf).

(13.48)

Moreover, since for any sets G E §' and H E Jf, s(ro) = l o(ro) - P(G) and 11 (ro) = l H(ro) - P(H) are members of the set over which p(§',Jf) is defined, and E(S11) = P(G n H) - P(G)P(H) while l s i � 1 and 1 11 1 � 1 for these cases, it is also clear that p � a. Thus, a mixing, notwithstanding its designation, is the weakest of these four 'strong ' variants, although it is of course stronger than ordinary regularity characterized by trivial remote events. We also have � � <J> , by an immediate corollary of the following result.

Theory of Stochastic Processes

208

13.22 Theorem I P(H I §') - P(H) I $ (§',If) a.s., for all H E If. o The main step in the proof of 13.22 is the following lemma. 13.23 Lemma Let X be an almost surely bounded,' §'-measurable r.v. Then

�)P( G E §' , P(G) > O sup

Proof

1 J XdP I G

\

=

ess sup X.

(13.49)

P(Gf 1 1 fcXdPI $ ess sup X, for any set G in the designated class. For any

£ > 0, consider the sets

G+ G-

=

i

=

{ ro: X(ro) � (ess sup'-"X) - £ } , { w : -X(ro) � (ess sup X) - £} .

B y definition of ess sup X, both these sets belong to §' and at least one of them is nonempty and has positive probability. Define the set

G*

=

{a+'

P(c+) � P(G-)

G - , otherwise

and we may conclude that

� IJ

I

- E. P( * ) G * XdP � (ess sup X)

(13.50)

(13.49) now follows on letting £ approach 0. • Proof of 13.22 Put X = P(H I §') - P(H) in the lemma, noting that this is a §'-measurable r.v. lying between + 1 and - 1 . Observe that, for any G E §', P(G) - 1 l fcXdP I I P(H I G) - P(H) I . Hence the lemma together with (13.44) implies that, for any H E If, =

<)>(§',If) � ess sup { P(HI §') - P(H) }

� I P(HI §') - P(H) I , with probability 1 .

•

(13.51)

14 Mixing

1 4 . 1 Mixing Sequences of Random Variables

For a sequence {Xr(ffi) } '.:.'oo , let �!_co = cr( ... ,X1_ 2,X1_ 1 ,X1) as in § 13.5, and similarly define �7+m = a(Xr+m.Xr+m+I ,Xt+m+2, ). The sequence is said to be a-mixing (or strong mixing) if limm�ooam = 0 where • . .

am

=

sup a(� !_oo ,�7+m), t

(14.1)

and a is defined in (13.43). It is said to be
(14.2)

and

(14.3)

and � is defined in (13.46). According to the results in § 13.6, absolute regular ity is a condition intermediate between strong mixing and uniform mixing. On the other hand, if {X1} �: is a stationary, L2-bounded sequence, and � � oo = cr(... ,X_ 1 ,X0) and ��oo = cr(Xm,Xm+ l •·· · ), the sequence is said to be completely regular if P m = p(� �oo, ��=) ---7 0, where p is defined in (13.47). In stationary Gaussian sequences, complete regularity is equivalent to strong mixing. Kolmo gorov and Rozanov (1960) show that in this case

(14.4) In a completely regular sequence, the autocovariances '(j = E(X1Xr-j) must tend to 0 as j ---7 oo A sufficient condition for complete regularity can be expressed in terms of the spectral density function. When it exists, the spectral density f('A) is the Fourier transform of the autocovariance function, that is to say, .

f(A,) = 2� '£ '(j etN, j= - oo

/..,

E

[-n,n].

(14.5)

210

Theory of Stochastic Processes

The theorem of Kolmogorov and Rozanov leads to the result proved by Ibragimov and Linnik (197 1 : th. 17.3.3), that a stationary Gaussian sequence is strong mixing when j(A) exists and is continuous and strictly positive, everywhere on [-1t,1t] .

This topic is something of a terminological minefield. 'Regularity' is an un descriptive term, and there does not seem to be unanimity among authors with regard to usage, complete regularity and absolute regularity sometimes being used synonymously. Nor is the list of mixing concepts given here by any means exhaus tive. Fortunately, we shall be able to avoid this confusion by sticking with the strong and uniform cases. While there are some applications in which absolute regularity provides just the right condition, we shall not encounter any of these. Incidentally, the term 'weak mixing' might be thought appropriate as a synonym for regularity, but should be avoided as there is a risk of confusion with weak depen dence, a term used, often somewhat imprecisely, to refer to sequences having summable covariances. Strongly dependent sequences may be stationary and mix ing, but their covariances are non-summable. ('Weak' implies less dependence than 'strong' in this instance, not more!) Confining our attention now to the strong and uniform mixing definitions, measures of the dependence in a sequence can be based in various ways on the rate at which the mixing coefficients am or m tend to zero. To avoid repetition we will discuss just strong mixing, but the following remarks apply equally to the uniform mixing case, on substituting for a throughout. Since the collections '!J .:00 and '!J7+m are respectively non-decreasing in t and non-increasing in t and m, the sequence { am} is monotone. The rate of convergence is often quantified by a summability criterion, that for some number
0, am � 0 sufficiently fast that

( 14.6)

m=l The term size has been coined to describe the rate of convergence of the mixing numbers, although different definitions have been used by different authors, and the terminology should be used with caution. One possibility is to say that the sequence is of size -
-mixing) of size -
21 1

Mixing

§'';'+m cr(Yt+m.Yr+m+ b ·· .). Since Y1 is measur able on any cr-field on which each of X1,X1_ 1 , ... ,X1-'t are measurable, §'_:"" � rg;_:"" and §'';'+m c '!f';'+m- 't· Let ar,m = sup1 (§'_:oo,§'7+m) and it follows that ar,m � Um - 't for m � 't. With 't finite, am -1: = O(m-q>) if am = O(m - q>) and the conclusion follows. The same argument follows word for word with '<)> ' replacing 'a ' . • Proof Let §'.:== cr( ... , Y1- J,f1), and

=

1 4 . 2 Mixing Inequalities

Strong and uniform mixing are restrictions on the complete joint distribution of the sequence, and to make practical use of the concepts we must know what they imply about particular measures of dependence. This section establishes a set of fundamental moment inequalities for mixing processes. The main results bound the m-step-ahead predictions, E(Xt+m I '!f .: ""). Mixing implies that, as we try to forecast the future path of a sequence from knowledge of its history to date, looking further and further forward, we will eventually be unable to improve on the predictor based solely on the distribution of the sequence as a whole, E(Xt+m). The r.v. E(Xt+m l rg;_:"") - E(Xt+m) is tending to zero as m increases. We prove convergence of the Lp-norm. 14.2 Theorem (lbragimov 1962) For r � p � 1 and with am defined in (14. 1),

I I E(Xt+m I rg;_:oo) - E(Xr+m) li P � 2(2 1 1P + 1)a�lp - l lr 11 Xr+m ll r·

(14.7)

Proof To simplify notation, substitute X for Xt+m• §' for '!! .:=, :1-e for '!f7+m • and

a

for am . It will be understood that X is an Jf-measurable random variable where §', :1-e � 'lf . The proof is in two stages, first to establish the result for I X I � Mx < oo a.s., and then to extend it to the case where X is £,-bounded for finite r. Define the §'-measurable r.v. 1 , E(XI §') � E(X), 11 = sgn(E(XI §') - E(X)) = (14.8) - 1 , otherwise. Using 10.8 and 10.10,

{

El E(XI §') - E(X) I

=

E[11 (E(XI §') - E(X))]

= E[(E(11Xl §') - 11E(X)] =

Cov(11,X)

=

I Cov(11,X) 1 .

(14.9)

Let Y be any §'-measurable r.v., such as 11 for example. Noting that � sgn(E(YI Jf) - E(Y)) is :1-e-measurable, similar arguments give I Cov(X, Y) I = I E(X(E(YI H) - E(Y))) I

� E( I XI I CEC YI H) - E(Y)) I ) � MxE I E(YI H) - E(Y) I

=

Theory of Stochastic Processes

L. l L.

(14. 10)

� Mx i Cov(I;, Y) I ,

where the first inequality is the modulus inequality. 1; and 11 are simple random variables taking only two distinct values each, so define the sets A + = { 11 = 1 } , A - = {11 = -1 }, B+ = {1; = 1 } , and B - = {1; = - 1 } . Putting (14.9) and (14. 10) together gives

E I E(XI §') - E(X) I � Mx 1 Cov(11 1;) 1 = Mxi E(s11 ) - E(1;)E(11) 1 = Mx l [P(A+ n B+) + P(A - n B- ) - P(A + n B -) - P(A - n B+)] - [P(A +)P(B+) + P(A - )P(B - ) - P(A +)P(B - ) - P(A - )P(B+)] I � 4Mxa. (14. 1 1) ,

I E(X) I � 2Mx, it follows that, for p 2 1, (14. 12) II ECX I §') - E(X) Iip � 2Mx(2a) 1 1P. This completes the first part of the proof. The next step is to let X be L,-bounded. Choose a finite positive Mx, and define X1 = 1 I I X !:<> Mx}X and X2 = X - X1 . By the Minkowski inequality and ( 14 . 1 1 ) Since I E(X I §') - E(X) I � I E(XI §') I

+

,

I! E(XI §') - E(X) II p � II E(Xd §') - E(X, ) II p + II (Xz l §') - E(Xz) ll p � 2Mx(2a) 1 1P + 2 11 Xz ll v•

(14. 1 3)

and the problem is to bound the second right-hand-side member. But

I! Xz llv

�

M}- riP II X II�1P

( 14. 14)

for r 2 p, so we arrive at

I ! E(XI §') - E(X) Ilv � 2Mx(2a) 1 1P + 2M_k- rlp iiX II�1P. Finally, choosing Mx = II X !I ,a- 11' and simplifying yields II ECXI §') - E(X) II p � 2(2 1 1P + 1)allp -l !r iiX I I,,

which is the required result.

( 14. 15)

•

There is an easy corollary bounding the autocovariances of the sequence. >

1 and r 2 pl(p - 1 ), I Cov(Xr.Xt+m) l � 2(2 1 - 1 1P + 1 )a� - 1 /p - l !rii Xr ll p i! Xt+mll r·

14.3 Corollary For p

(14. 16)

Proof

I Cov(Xr.Xt+m) I = I E(XrXt+m) - E(X,)E(Xt+m) I = j E[X1(E(Xt+m l �,) - E(Xt+m))] j � I! Xr llpi i E(Xr+m l ��) - E(Xr+m) ll p!(p 1 ) p /p 1 1', 1 / 1 -1 � 2(2 - + l) I! Xr llp ll Xr+m ll r<X�

(14. 17)

Mixing

213

where the second equality is by 10.8 and 10.10, noting that Xr is :1ft-measurable, the first inequality is the Holder inequality, and the second inequality is by 14.2 . • 14.4 Theorem (Serfling 1968: th. 2.2) For r ;::: p

;:::

1,

(14. 18) II E(Xt+m l :1f�oo) - E(Xr+m) ii p :::; 2<j>�- 1 1'11 Xt+m l l r · where m is defined in (14.2). Proof The result is trivial for r = p = 1 , so assume r > 1. The strategy is to prove the result initially for a sequence of simple r.v.s. Let Xr+m = I1= 1xi1A;, A i E :1f7+m' where :1f7+m = cr(Xt+m,Xt+m+ b . . . ). For some OJ E n consider the random element E(Xr+m I :1f �oo)( OJ), although for clarity of notation the dependence on ro is not indicated. For r > 1 and q = r/(r - 1), we have

:::; {� l xd I P(Ad :1f�oo) - P(Ai) l r (� I Xj I I P(Ai I :1f:oo) - P(Ai) 1 11' I P(Ai I :1f�oo) - P(Ai) 1 1/q) r q $ (� I x;l ' I P(A t l P(A;) I r P(A;) I ) (� I P(A, I =

:1' '�) -

:1' '�) -

(14. 19)

The second inequality here is by 9.25. The sets A i partition Q, and P(A i U A i' I :1f �"") = P(A i I :1f :"") + P(A i' I :1f �oo) a.s. and P(A i U A i') = P(A i) + P(A i') for i =I= i'. Letting Aj denote the union of all those A i for which P(A d :1f :"") - P(A i) ;::: 0, and A j the complement of A j on n,

L I P(A il :1f �oo) - P(A) I i

=

I P(Aj l :y;:oo) - P(A 1) I + I P(A f I :1f�oo) - P(A j) 1 . (14.20)

By 13.22, the inequalities

I P(AJ I :1f �oo) - P(A1) I :::; <J>m I P(A l I :1f �oo) - P(A j) I :::; <J>m hold with probability 1 . Substituting into (14. 19) gives I E(Xt+m l :1f�oo) - E(Xt+m) l r ;::: [E( I Xt+m l r l :y;�oo) + EI Xt+m l r] (2<J>m)rlq, a.s. (14.21) Taking expectations and using the law of iterated expectations then gives

EI E(Xt+m l :1f�oo) - E(Xt+m) l r :::; 2EI Xt+m l r(2<J>m) r/q' and, raising both sides to the power 1/r,

(14.22)

Theory of Stochastic Processes

214

(14.23) Inequality (14. 1 8) follows by Liapunov's inequality. The result extends from simple to general r.v.s using the construction of 3.28. For any r. v. Xr+m there exists a monotone sequence of simple r. v .s { X(k)t+m• k E IN } such that I X(k)t+m(ffi) - Xr+m(ro) l -7 0 as k � =, for all ro E .Q. This conver gence transfers a.s. to the sequences { E(�k)t+m I r:f _:oo) - E(X(k)r+m), k E IN } by 10.15. Then, assuming Xr+m is L,-bounded, the inequality in (14.22) holds as k � oo by the dominated convergence theorem applied to each side, with I Xr+m l ' as the dominating function, thanks to 10.13(ii). This completes the proof. • The counterpart of 14.3 is obtained similarly. 14.5 Corollary For r � 1, where, if r

=

I Cov(Xr+m.Xr) l S 2<j>�1' 11 Xr ll r i! Xr+mll rl(r- I ) • 1, replace I! Xr+m ll rl(r- 1 ) by I ! Xr+m ll oo = ess sup Xt+m·

(14.24)

Proof

I Cov(Xr+m .Xr) I S I !Xr ll r li E(Xt+m I r:f _:oo) - E(Xt+m) II rl(r- I )

S 2<j>�1' 11 Xr ll r 11 Xt+m ll rl(r- 1) •

(14.25)

where the first inequality corresponds to the one in (14. 17), and the second one is by 14.4. • These results tell us a good deal about the behaviour of mixing sequences. A fundamental property is mean reversion. The mean deviation sequence { X1 - E(X1) } must change sign frequently when the rate of mixing is high. If the sequence exhibits persistent behaviour with X1 - E(X1) tending to have the same sign for a large number of successive periods, then I E(Xt+m I r:f .:. ) - E(Xr+m) I would likewise tend to be large for large m. If this quantity is small the sign of the mean deviation m periods hence is unpredictable, indicating that it changes frequently. But while mixing implies mean reversion, mean reversion need not imply mixing. Theorems 14.2 and 14.4 isolate the properties of greatest importance, but not the only ones. A sequence having the property that I I Var(Xr m l rg, _:"") - Var(Xr+m) ll p > 0 is called conditionally heteroscedastic. Mixing also requires this sequence of norms to converge as m � =, and similarly for other integrable functions of Xt+m· Comparison of 14.2 and 14.4 also shows that being able to assert uniform mixing can give us considerably greater flexibility in applications with respect to the existence of moments. In (14. 1 8), the rate of convergence of the left-hand side to zero with m does not depend upon p, and in particular, El E(Xr+m I r:f .:oo) - E(Xr+m) I converges whenever IIXr+mll 1+o exists for o > 0, a condition infinitesimally stronger than uniform integrability. In the corresponding inequality for am in 14.2, p < r is required for the restriction to 'bite' . Likewise, 14.5 for the case p = 2 yields ""

+

Mixing

215 (14.26)

but to be useful 14.3 requires that either Xr or Xt+m be �+0-bounded, for 8 > 0. Mere existence of the variances will not suffice. 1 4 . 3 Mixing in Linear Processes

A type of stochastic sequence { Xr } ':"" which arises very frequently in econometric modelling applications has the representation

q (14.27) Xr =L 8jZr-j• 0 ::;; q ::;; oo, j=O where {Zr} �: (called the innovations or shocks) is an independent stochastic sequence, and { ej }J=O is a sequence of fixed coefficients. Assume without loss of generality that the Z1 have zero means and that 80 = 1 . ( 14.27) is called a moving average process of order q (MA(q)). Into this class fall the finite-order auto

regressive and autoregressive-moving average CARMA) processes commonly used to model economic time series. We would clearly like to know when such sequences are mixing, by reference to the properties of the innovations and of the sequence { 8j } . Several authors have investigated this question, including Ibragimov and Linnik (1971 ), Chanda (197 4), Gorodetskii (1977), Withers (1981 a), Pham and Tran ( 1985), and Athreya and Pantula (1986a, 1986b). Mixing is an asymptotic property, and when q < oo the sequence is mixing infinitely fast. This case is called q-dependence. The difficulties arise with the cases with q = oo. Formally, we should think of the MA(oo) as the weak limit of a sequence of MA(q) processes; the characteristic function of Xr has the form

q (14.28) <J>qrO�) = TI qr(A) ---7 <\>r(A) (pointwise in [R) as q ---7 oo where <\>r(A) is a ch.f. and continuous at A = 0, we may invert the latter according to 11.12, and identify the corresponding distribution as that of Xr = I}=08jZr-j· 1 2 The existence of the limit imposes certain conditions on the coefficient sequence { ej } j=O · We clearly need l 8j I ---7 0 as j ---7 oo, and for the variance of Xr to exist, it is further neces

sary that the sequence be square-summable. Note that the solutions of finite order ARMA processes are characterized by the approach of l 8j I to 0 at an exp onential rate, beyond a finite point in the sequence. If { Zr} is i.i.d. with mean 0 and variance d-, Xr is stationary and has spectral density function fCA.) =

oo

� j "'j:.. ejiN j 2. j=O

( 14.29)

The theorem of lbragimov and Linnik cited in § 14. 1 yields the condition I}=0 I ej I < as sufficient for strong-mixing in the Gaussian case. However, another standard result (see Doob 1 953: ch. X.8, or lbragimov and Linnik 1 97 1 : ch. 1 6.7)

Theory of Stochastic Processes

216

states that every wide-sense stationary sequence admitting a spectral density has a (doubly-infinite) moving average representation with orthogonal increments and square summable coefficients. But allowing more general distributions for the innovations yields surprising results. Contrary to what might be supposed, having the ei tend to zero even at an exponential rate is not sufficient by itself for strong mixing. Here is a simple illustration. Recall that the first-order autoregressive process X1 = pX1_ 1 + Z1, I p i < 1, has the MA(oo) form with ei = p i, j 0, 1 ,2, .. =

.

14.6 Example Let { Z1} 0 be an independent sequence of Bernoulli r. v .s, with P(Z1 = 1) = P(Z1 = 0) = !· Let Xo = Zo and

Xt It

=

�t- 1 + Z1

=

t L 2 -izt-J• t = 1 ,2,3, ...

(14.30)

l=O

is not difficult to see that the term

t 'L. Tizt -J j=O

=

t 2 -1L 2kzk

(14.3 1)

k=O

belongs for each t to the set of dyadic rationals W1 = {k/21 , k = 0,1 ,2, ... ,2t+ 1 - 1 } Each element of W1 corresponds to one of the 21+ 1 possible drawings {Zo, ... ,Z1}, and has equal probability of T1- 1 . Iff Zo 0, .

Xt E Bt whereas iff Zo

=

=

{k/21, k

=

=

0,2,4, ... ,2(2 1 - 1) } ,

1, 1 xt E Wt - Br = {k/2 , k = 1,3,5, . . .,21+ 1 - 1 } . It follows that {Xo = 1 } n {X1 E B1} = 0 , for every finite t. But it i s clear that P(X1 E B1) = P(Xo = 0) = !· Hence for every finite m, CXm :2:

I P( { Xo

which contradicts

CXm

=

1 } n {Xm

-7 0.

E

Bm }) - P(Xo

=

l)P(Xm

E

Bm) I

(14.32)

o

Since the process starts at t = 0 in this case it is not stationary, but the ex ample is easily generalized to a wider class of processes, as follows. 14.7 Theorem (Andrews 1984) Let {Z1}':.'= be an independent sequence of Bern oulli r.v.s, taking values 1 and 0 with fixed probabilities p and 1 - p. If X1 = pX1- 1 + Z1 for p E (O,!], {X1}':'oo is not strong mixing. o

Note, the condition on p is purely to expedite the argument. The theorem surely holds for other values of p, although this cannot be proved by the present approach. Proof Write

Xt+s

=

psX1 + X1,s where

Mixing xt,s

=

217

s- 1 L pjZt+s-j · j=O

(14.33)

The support of X1,s is finite for finite p, having at most 25 distinct members. Call this set W5 , so that W1 = (0, 1) , W2 = (0, 1, p, 1 + p), and so on. In general, Ws+ l is obtained from Ws by adding p 5 to each of its elements and forming the union of these elements with those of W5 ; formally,

(14.34) For given s denote the distinct elements of Ws by Wj, ordered by magnitude with w1 5 < . . . < WJ, for J � 2 • Now suppose that X1 E (O,p), so that p5X1 E (O,p 5+1 ). This means that Xt+s assumes a value between Wj and Wj + ps+ 1 , for some }. Defining events A = { X1 E (O,p)} and Bs = { Xt+s E Uf= l (wj, w + r s+l ) } , we have P(Bs i A) = 1 for any s, however large. To see that P(A) > 0, consider the case Z1 = Zt-1 = Zt-2 = 0 and Z1-3 = 1 and note that 3 (14.35) L PjZt-j � L Pj = 1 � p < p j=3 j=3 for p E (0,1J. So, unless P(Bs) = 1, strong mixing is contradicted. The proof is completed by showing that the set D = {X1 E [p, 1 ] } has positive probability, and is disjoint with Bs . D occurs when Z1 = 0 and Zt- l = 1 , since then, for p E (0,1], =

P

=

� pj - _P_ pjzt-1 <- L 1 -p j= l j= l

< � - L

.

-

and hence P(D) > 0. Suppose that min { Wj+ 1 - Wj } Then, if D occurs,

j <': l

�

<

-

ps- I .

1

'

(14.36) (14.37)

W1· + P s+l -< w·1 + p sXt < w1· + p s- I < Wj+ I ' (14.38) hence, Xr+s = Wj + psX1 E Uf= l (wj, Wj + p s+ l ), or in other words, Bs n D = 0. The assertion in (14.37) is certainly true when s = 1, so consider the following inductive argument. Suppose the distance between two points in Ws is at least p s- I . Then by (14.34), the smallest distance between two points of Ws+ l cannot be less than the smaller of ps and p s- I - ps. But when p E (0,�), ps � 1Ps- l , which implies p s- I - p 5 � p s. It follows that (14.37) holds for every s. • These results may appear surprising when one thinks of the rate at which p s -

approaches 0 with s; but if so, this is because we are unconsciously thinking about the problem of predicting gross features of the distribution of Xt+s from time t, things like P(Xt+s � x i A) for fixed x, for example. The notable feature of the sets Bs is their irrelevance to such concerns, at least for large s. What we

Theory of Stochastic Processes

218

have shown is that from a practical viewpoint the mixing concept has some undesi able features. The requirement of a decline of dependence is imposed over c events, whereas in practice it might serve our purposes adequately to tolera certain uninteresting events, such as the Bs defined above, remaining dependent c the initial conditions even at long range. In the next section we will derive some sufficient conditions for strong mixint and it turns out that certain smoothness conditions on the marginal distribution of the increments will be enough to rule out this kind of counter-example. But no\ consider uniform mixing. 14.8 Example 13 Consider an AR(1 ) process with i.i.d. increments,

Xr = pXr- 1 + Zr. 0 0 choose a positive constant M to satisfy

(14.39 ) Then consider the events

A = {Xo � p -m (L + M) } E � �oo B = {Xm :S: L} e ��00, where L is large enough that P(B) � 1 - 8. We show P(A) > 0 for every m. Let PK = P(Zo < K), for any constant K. Since Z0 has unbounded support, either PK < 1 for every K > 0 or, at worst, this holds after substituting { -Z1} for {Z1 } and hence { -X1 } for {X1 } . Pk < 1 for all K implies, by stationarity, P(X- 1 < 0) = P(Xo < 0) < 1 . Since { X0 < K} c {Zo < K} u ({Zo � K} n {X- 1 < 0 } ) , independence of the { Z1 } implies that

P(Xo < K) :S: PK+ (1 - PK)P(X- t < 0) < 1 . So P(A) > 0, since K is arbitrary. Since Xm = pmXo + L.}:A p1Z

that

P(B iA)

�

(

P pmXo +

�

m

l

)

p jZ, -j S L p mXo ;, L + M <

by ( 14.39). Hence m � I P(B I A) - P(B) I means m = 1 for every m. o

>

6

(14.40) -J •

it is clear

(14.4 1 )

1 - 28, and since 8 i s arbitrary, this

Processes with Gaussian increments fall into the category covered by this example, and if -mixing fails in the first-order AR case it is pretty clear that counter examples exist for more general MA(oo) cases too. The conditions for uniform mix ing in linear processes are evidently extremely tough, perhaps too tough for this mixing condition to be very useful. In the applications to be studied in later chapters, most of the results are found to hold i n some form for strong mixing processes, but the ability to assert uniform mixing usually allows a relaxation of

Mixing

219

conditions elsewhere i n the problem, s o i t i s still desirable to develop the parallel results for the uniform case. The strong restrictions needed to ensure processes are mixing, which these exam ples point to (to be explored further in the next section), threaten to limit the usefulness of the mixing concept. However, technical infringements like the ones demonstrated are often innocuous in practice. Only certain aspects of mixing, encapsulated in the concept of a mixingale, are required for many important limit results to hold. These are shared with so-called near-epoch dependentfunctions of mixing sequences, which include cases like 14.7. The theory of these dependence concepts is treated in Chapters 16 and 17. While Chapter 15 contains some neces sary background material for those chapters, the interested reader might choose to skip ahead at this point to find out how, in essence, the difficulty will be resolved. 14.4 Sufficient Conditions for Strong and Uniform Mixing

The problems in the counter-examples above are with the form ofthe marginal shock distributions - discrete or unbounded, as the case may be. For strong mixing, a degree of smoothness of the distributions appears necessary in addition to summa bility conditions on the coefficients of linear processes. Several sufficient conditions have been derived, both for general MA(oo) processes and for auto regressive and ARMA processes. The sufficiency result for strong mixing proved below is based on the theorems of Chanda (1974) and Gorodetskii (1977). These conditions are not the weakest possible in all circumstances, but they have the virtues of generality and comparative ease of verification. 14.9 Theorem Let X1 = LJ==OejZt-j define a random sequence {X1}:oo , where, for either 0 < r :::; 2 or r an even positive integer, (a) Z1 is unifor-mly Lr-bounded, independent, continuous with p.d.f. fz1, and s �p

J:] fdzt + a)

-

fziz) J dz

:::;

Mi a I , M <

oo,

(14.42)

whenever I a I :::; 3, for some () > 0; (b) "L7==0Gt(r) 11( l +r) < oo , where

( 00 )r/2 2r-l � j==t

j

e]

r :::; 2, (14.43) '

r � 2;

(c) 8(x) = "L}== 1 8_;.x -:;:. 0 for all complex numbers x with l xl Then {Xd is strong mixing with <Xm = O("L7==m+ ! GtCr) 1 10 +r)). o

:::;

1.

Before proceeding to the proof, we must discuss the implications of these three conditions in a bit more detail. Condition 14.9(a) may be relaxed somewhat, as we

Theory of Stochastic Processes

220

show below, but we begin with this case for simplicity. The following lemma extends the condition to the joint distributions under independence. 14.10 Lemma Inequality

(14.42) implies that for ! a, !

:::;;

�.

t

=

1 , ... ,k, ( 14.44)

Proof Using Fubini's theorem,

J I Ok fzlzr + ar) - ilk fzlzr) I dzt · · ·dZk IRk

t=l

t=l

:::;; M l a 1 l +

J I r-rlfzlzr + ar) - r-fi2tz,(zr) I dz2...dzk· 2 IRk-!

The lemma follows on applying the same inequality to the second term on the right, iteratively for t = 2, . . ,k. • .

Condition 14.9(b) is satisfied when I ej I « F11 for Jl > 1 + 2/r when r 5 2 and ll > 3/2 + l!r when r � 2. The double definition of Gr(r) is motivated by the fact that for cases with r s 2 we use the von Bahr-Esseen inequality (11.15) to bound a certain sequence in the proof, whereas with r > 2 we mly on Lemma 14. 1 1 below. Since the latter result requires r to be an even integer, the conditions in the theorem are to be applied in practice by taking r as the nearest even integer below the highest existing absolute moment. Gorodetskii (1977) achieves a further weakening of these summability conditions for r > 2 by the use of an inequality due to Nagaev and Fuk ( 1 97 1). We will forgo this extension, both because proof of

the Nagaev-Fuk inequalities represents a rather complicated detour, and because the present version of the theorem permits a generalization (Corollary 14.13) which would otherwise be awkward to implement. Define Wr = I}:6ejZt-j and Vr = I,j=tejZt-j• so that Xt = Wt + V,, and Wr and Vt are independent. Think of Vt as the � � 00-measurable 'tail ' of Xr, whose contribu tion to the sum should become negligible as t -7 oo .

14.11 Lemma If the sequence { Zs } is independent with zero mean, then

E(Vi"') S z2m-I (t, sf:�� E(Zl'") for each positive integer m such that sup

s �o

E(Z�m)

<

(14.45) oo

.

Mixing

221

Proof First consider the case where the r.v.s Zr-j are symmetrically distributed,

meaning that -Zr-j and Zr-j have the same distributions. In this case all existing odd-order integer moments about 0 are zero, and +k +k t+k L j t-j = . . .

t t 2 m E ( e Z ) � L eh .. . ehmE(Zr-jj . . Zt-hm) 11 =t 12m=t

1=t

t+k

t+k

h=t

jm=t

L · · · L 8]1 eJmE(Z7-h- . .z7 ) ,; (� aJr:�� E(z;m) ( 14 . 46 ) The second equality holds since E(Zr-h . . Zt-hm) vanishes unless the factors form matching pairs, and the inequality follows since, for any r.v. Y possessing the j j requisite moments, E(Y +k) � E(Y )E(Y k) (i.e., Cov(Y< Y k) � 0) for j,k > 0. The result for symmetrically distributed Zs follows on letting k oo For general Z5, let z; be distributed identically as, and independent of, Z5, for each s s 0. Then v� LJ te z� - is independent of Vr, and Vr-v� has symmet rically distributed independent increments Zr-j- Z�-j- Hence E(V?m) s E(Vr-V�)2m s (t eJ) msup E(Zr-j-Z�-j)2m =

-jm

•••

---7

=

= j

.

j

1=t

1

(14.47) where the first inequality is by 10.19, the second by (14.45), and the third is the C r inequality. •

Lastly, consider condition 14.9(c). This is designed to pin down the properties of the inverse transformation, taking us from the coordinates of to those of It ensures that the function of a complex variable S(x) possesses an ana lytic 1 4 inverse (x) = for l x l s 1 . The particular property needed and implied by the condition is that the coefficient sequence { 'tj is absolutely summable. If = under 14.9(c) the inverse representation is also defined, as = Note that 'to = 1 if = 1 . An effect of 14.9( c) is to rule out 'over-differenced' cases, as for example where S(x) = (x)( l - x) with (.) a summable polynomial. The differencing transformation does not yield a mixing process in general, the exception being where it reverses the previous integration of a mixing process. For a finite number of terms the transformation is conveniently expressed using matrix notation. Let

{Zr} .

81

't 2-}=o't� Zr Xr 2.}2-}= =O'tjXr0ej-jZ· t-j•

So

{Xr} }

81

222

Theory of Stochastic Processes 1 0

1

e1

e1

An =

1

(n x n),

(14.48)

Sn- 2 Sn- 1

Sn-2

so that the equations x1 = I.j:be1zt-J• t = 1, ... ,n can be written x = AnZ where x = (x1 , ,xn )' and z = (z1 , ,zn)'. A � 1 is also lower triangular, with elements 'tj replacing ej for j = O, . . ,n - 1 . If v = (Vt . · · ··Vn)' the vector v = A�1v has elements I.}:d'tJVt-J• for t = l , . . . ,n. These operations can in principle be taken to the limit as n --7 =, subject to 14.9(c). .••

• • . .

Proof of 14.9 Without loss of generality, the object is to show that the a-fields

��= = cr( . . . .X- J ,Xo) and �;;;+ ! = cr(Xm+I .Xm+2•· · · ·) are independent as m --7 = The result does not depend on the choice of origin for the indices. This is shown for a sequence { X1} 7��-p for finite p and k, and since k and p are arbitrary, it then follows by the consistency theorem (12.4) that there exists a sequence {X1 }'::' = whose finite-dimensional distributions possess the property for every k and p. This sequence is strong mixing on the definition. Define a p + m + k-vector X = (X0,Xi,X2)' where Xo = (XI -p·· · ·,X0)' (p x 1 ), X1 = (Xt , . . . ,Xm)' (m X 1 ), and X2 = (Xm+J , . . . ,Xm+d (k x 1), and also vectors W = (Wi , W2)' and V = (Vi, V2)' such that X1 = W1 + VI and X2 = W2 + Vz. (The elements of W and V are defined above 14.11.) The vectors Xo and V are independent of W. Now, use the notation �; = a(Xs, . . . ,X1) and define the following sets: G = { ro: Xo(ro) E C} E �?-p, for some C E <.BP, .

H = { ro: X2Cro) E D } E ��!1. for some D E <.Bk, E

�� oo, where B = { v2 : I v2 l :::; 1l } E <.Bk, I v2 l denotes the vector whose elements are the absolute values of v2, and 1l = ('Tlm+J, . . . ,TJm+ )' is a vector of positive constants. k Also define E = { ro: Vz(ffi) E B}

D - v2 = { w2: w2 + v2 E D } E <.Bk.

H may be thought of as the random event that has occurred when, first V2 = v 2 is realized, and then W2 E D - v2. By independence, the joint c.d.f. of the variables (Wz, Vz,Xo) factorizes as F = Fw2Fv2 ,x0 (say) and we can write

P(H)

= P(X2 E D) =

J J kJ IRP

1R

D-vz

dF(w 2, v2 , xo) (14.49)

Mixing

223

where

(14.50) These definitions set the scene for the main business of the proof, which is to show that events G and H are tending to independence as rn becomes large. Given m �/;Ep+ +k_measurability of X, this is sufficient for the result, since C and D are arbitrary. By the same reasoning that gave (14.49), we have

fcf8x(v2)dFv2,x0(v2,xo). sup 2 X( 2) and X* inf 2 E X( 2), and (14.51) implies P(G n H n E)

Define X*

=

v es

=

=

V

v

s

(14.51)

V

X * P(G n E) � P(G n H n E) � X* P(G n E).

(14.52)

Hence we have the bounds

P(G n H)

=

P(G n H n E) + P(G n H n Ee) � X* P(G) + P(Ee),

(14.53)

and similarly, since X * � 1,

P(G n H) � X* P(G n E) + P(G n Hn Ee) = X* P(G) - X * P(G n Ee) + P(G n H n Ee) � X* P(G) - P(Ee). Choosing G

=

(14.54)

.Q (i.e., C = [RP) in (14.53) and (14.54) gives in particular X* - P(Ee) � P(H) � X * + P(Ee), (14.55)

and combining all these inequalities yields I P(G n H) - P( G)P(H) I

� X* - X* + 2P(Ee).

(14.56) Write W Am+kz, where Z = (Z1 , ... ,Zm+k)' and Am+k is defined by (14.48). Since I A m+k l = 1 and the {Z1 , ... ,Zm+k } are independent, the change of variable formula =

from 8.18 yields the result that W is continuously distributed with m+k

(14.57) fw(w) = fz(z) fl !Zt(z,). m Define B' { v : v 1 0, v2 B } B +k. Then the following relations hold: X* - X* � 2 sup I X(v2)- X(O) I V2 EB � 2 sup J I fw2(w2 + v2)-fw2(w2) I dw2 V2 EB =

=

=

E

D

E

t=l

Theory of Stochastic Processes

224

= {f { 2 su�

I

n fz/Zt + Vr) - n fdzr)

IR m+k t:= !

VEB

m+k

m+k

m+k

}

t:=l

I

dz

}

� 2M sup L l vrl , v E B'

t:=m+l

( 14.58)

where it is understood in the final inequality (which is by 14.10) that I vr I � 8 where 8 is defined in condition 14.9(a). The third equality substitutes v = A;;;!kv and uses the fact that v1 = 0 if v1 = 0 by lower triangularity of A m+k· For v E B', note that m+k

L l vr l

t=m+l

= .L

m+k

t-m-1

t:=m+!

j=O

L

'tjVt-j

(14.59) assuming Tt has been chosen with elements small enough that the terms in parenthe ses in the penultimate member do not exceed 8. This is possible by condition 14.9(c). For the final step, choose r to be the largest order of absolute moment if this is does not exceed 2, and the largest even integer moment, otherwise. Then

m+k

m+k

t:=m+!

t:=m+!

� L P( l Vrl > 1l r) � L El Vrl r1l � r,

(14.60)

by the Markov inequality, and

(14.61) El Vrl r � sup E I Zs l rG,(r), s where Gr(r) is given by (14.43), applying 11.15 for r � 2 (see (1 1 .65) for the required extension) and Lemma 14.11 for r > 2. Substituting inequalities (14.58), ( 14.59), (14.60), and (14.61 ) into (14.56) yields j P(G n H) - P(G)P(H) I

m+k

«

L (1lr + Gr(r)1l�r).

t:=m+!

(14.62)

Mixing

225

Since Gr(r) � 0 by 14.9(b), it is possible to choose m large enough that (14.59) and hence (14.62) hold with Tl t = Gr(r)1 1(l +r) = Gr(r)Tl? for each t > m. We obtain I P(G n H) - P(G)P(H) I

«

::=:;

m+k

L

t=m+ l 00

L

t=m+ l

Gr(r)ll(l +r) Gr(r) ll(l +r) ,

(14.63)

where the right-hand sum is finite by 14.9(b), and goes to zero as m � oo . This completes the proof. • It is worth examining this argument with care to see how violation of the condi tions can lead to trouble. According to (14.56), mixing will follow from two conditions: the obvious one is that the tail component V2, the � �""-measurable part of X2 , becomes negligible, such that is, P(E) gets close to 1 when m is large, even when 1\ is allowed to approach 0. But in addition, to have X* - X * disappear, P( W2 E D - v 2) must approach a unique limit as v2 � 0, for any D, and whatever the path of convergence. When the distribution has atoms, it is easy to devise examples where this requirement fails. In 14.6, the set B1 becomes W1 - B1 on 1 being translated a distance of 2 - • For such a case these probabilities evidently do not converge, in the limiting case as t � oo. However, this is a sufficiency result, and it remains unclear just how much more than the absence of atoms is strictly necessary. Consider an example where the distribution is continuous, having differentiable p.d.f., but condition ( 14.42) none the less fails.

14.12 Example Let f(z) = C0 z-2sin\z\ z E IR . This is non-negative, continuous everywhere, and bounded by Co z -2 and hence integrable. By choice of Co we can have t:J(z)dz = 1 , so f is a p.d.f. By the mean value theorem,

I f(z + a) - f(z) I = I a I I f'(z + a(z)a) I , a(z) E [0, 1], ( 14.64) where f'(z) = 8Cosin(z4)cos(z4)z - 2Cosin2(i)z - 3 . But note that t: l f'(z) l dz = oo, and hence,

1

J

TQT

+oo _""

l f(z + a) - f(z) l dz �

oo

as l a l � 0,

( 14.65)

which contradicts (14.42). The problem is that the density is varying too rapidly in the tails of the distribution, and I f(z + a) - f(z) I does not diminish rapidly enough in these regions as a � 0. The rate of divergence in (14.65) can be estimatedY For fixed (small) a, l f(z + a) - f(z) l is at a local maximum at points at which sin (z + a)4 = 1 (or 0) and sin z4 = 0 (or 1), or in other words where (z + a)4 - i = 4az3 + O(a2) = ±rr/2 . The solutions to these approximate relations can be written as z 1 ±C1 1 a l - 13 for C1 > 0. At these points we can write, again approximately (orders of magnitude are all we need here), =

Theory of Stochastic Processes

226

l f(z + a) - f(z) l � 2f(z) � 2CoC 12 I a l 213.

The integral is bounded within the interval [-C1 I a l - 1 13, C1 l a l - 1 13] by 4 C0C 11 1 a 1 1 13, the area of the rectangle having height 2Co C 1 2 1 a 1 213 . Outside the interval, f is bounded by C0 z-2, and the integral over this region is bounded by

J+oo

2C0

z- 2dz Cl l a i - 113

=

/ J 2Co C 1 l a l 1 3 .

Adding up the approximations yields

for M <

oo. o

s:: l f(z + a) - f(z) l dz � M l a l 113

(14.66)

The rate of divergence is critical for relaxing the conditions. Suppose instead of (14.42) that

J:: l f(z + a) - f(z) l dz � Mh( l a i ), I a l � B, B > 0

( 14.67)

could be shown sufficient, where h(.) is an arbitrary increasing function with h( I a I ) t 0 as I a I t 0. Since

s:: l f(z + a) - f(z) l dz � 2s::f(z)dz

=

(14.68)

2

for any a, ( 14.67) effectively holds for any p.d.f., by the dominated convergence theorem. Simple continuity of the distributions would suffice. 1 6 This particular result does not seem to be available, but it is possible to relax 14.9(a) substantially, at the cost of an additional restriction on the moving average coefficients. 14.13 Corollary Modify the conditions of 14.9 as follows: for 0 < � � 1 , assume that (a') Zr is uniformly Lr-bounded, independent, and continuously distributed with p.d.f. fz1, and

(14.69) whenever l a I � B, for some 8 > 0;

(b') I.7=0Gr(r) 13!(13+r) < oo, where Gr(r) is defined in (14.43); (c') 1/S (x) = 't(x) = LJ=I'tjXj for l xl � 1 , and I.'J=t l -rj l l3 < oo

Then Xr is strong mixing with

CXm

=

0(2:7=m+ 1 Gr(r) j3/(j3+r)).

Proof This follows the proof of 14.9 until

.

(14.58), which becomes

Mixing

227

{ m+k lvr l �}'

x * - x * :::; 2M sup L vE B'

(14.70)

t=m+i

applying the obvious extension of Lemma 14.10. Note that

t-Lm- I -j � oo l 1 l 13 m+k 11� (14.71 ) t=m+ I t=m+I j=O 'tjVt :::; .L}=0 t t=.Lm+I using (9.63), since 0 < � :::; 1 . Applying assumption 14.13(c'), m+k (14.72) I P(G n H) - P(G)P(H) I L (TJ � + Gr(r)TJ?), t=m+I and the result is obtained as before, but in this case setting TJ r G�1(�+r). Condition 14.13(b') is satisfied when I ej l for 1-l > l!J3 2/r when r :::; 2, and 1-1 > 1/2 + 1/r 11� when r � 2, which shows how the summability restrictions have «

•

=

+

« F11

+

to be strengthened when J3 is close to 0. This is none the less a useful extension because there are important cases where 14.13(b') and 14.13(c') are easily satisfied. In particular, if the process is finite-order ARMA, both I I and I I either decline geometrically or vanish beyond some finite j, and (b') and (c') both hold. Condition 14.13(a') is a strengthening of continuity since there exist functions h(.) which are slowly varying at 0, that is, which approach 0 more slowly than any positive power of the argument. Look again at 14.12, and note that setting � = ! will satisfy condition 14.13(a') according to (14.65). It is easy to generalize the example. Putting j(z) = Csin2(l)z - 2 for k � 4, the earlier argument is easily modified to show that the integral converges at the rate ! a ! and this 2 2 choice of J3 is appropriate. But for f(z) = Csin (ez)z the integral converges more slowly than l a l � for all � > 0, and condition 14.13(a') fails. To conclude this chapter, we look at the case of uniform mixing. Manipulating inequalities (14.52)--{14.55) yields

ej

'tj

1/(k-I \

( (�J

I P(H I G) - P(H) I � x* - x * + P(Ec) l + P (14.73) . which shows that uniform mixing can fail unless P(E) = 1 for all m exceeding a finite value. Otherwise, we can always construct a sequence of events G whose probability is positive but approaching 0 no slower than P(Ec). When the support of (X-p, . . . ,X0) is unbounded this kind of thing can occur, as illustrated by 14.8. The essence of this example does not depend on the AR(l ) model, and similar cases could be constructed in the general MA(=) framework. Sufficient conditions must include a.s. boundedness of the distributions, and the summability conditions are also modified. We will adapt the extended version of the strong mixing condition in 14.13, although it is easy to deduce the relationship between these conditions and 14.9 by setting J3 = 1 below.

Theory of Stochastic Processes

228

14.14 Theorem Modify the conditions of 14.13 as follows. Let (a') and (c') hold as before, but replace (b') by (b") )13 < oo, and add (d) is uniform!y bounded a. s. = Then is uniform mixing with )13). Proof Follow the proof of 14.9 up to (14.55), but replace (14.56) by (14.73). By condition 14.14(d), there exists < oo such that sup1 l < a.s., and hence a.s. It further follows, recalling the definition of V2, that P(E) = 1 < when m + 1 , ... ,m + k. Substituting directly into (14.73) < from (14.70) and (14.71), and making this choice of 11, gives (for any G with P(G)

L,";'=O(Lj=r l Sjl {Xr}{ Z1}

KL,j=0l Sjl Tl r KLj=rlSjl for t =

m O(L,7=m+I(Lj=rl Sjl K Z11 K

IXrl

> 0)

(14.74) The result now follows by the same considerations as before.

•

These summability conditions are tougher than in 14.13. Letting r � oo in the latter case for comparability, 14.13(b') is satisfied when I = OU 11) for 11 > 1/2 + 1/�, while the corresponding implication of 14.14(b") is 11 > 1 + 1/�.

Sjl

15 Martingales

1 5 . 1 Sequential Conditioning

It is trivial to observe that the arrow of time is unidirectional. Even though we can study a sample realization ex post, we know that, when a random sequence is generated, the 'current' member X1 is determined in an environment in which the previous members, Xt- k for k > 0, are given and conditionally fixed, whereas the members following remain contingent. The past is known, butthe future is unknown. The operation of conditioning sequentially on past events is therefore of central importance in time-series modelling. We characterize partial know ledge by specify ing a <J-subfield of events from c:J, for which it is known whether each of the events belonging to it has occurred or not. The accumulation of information by an observer as time passes is represented by an increasing sequence of a-fields, { c:J1}':'oo , such that . .. � cg;_ 1 c c:Jo � c:J 1 c c:J2 � ... � cg;_ I ? If X1 is a random variable that is c:J1-measurable for each t, { c:J1} ':'00 is said to be adapted to the sequence {X1}':'oo . The pairs {X1,c:Jr}':'oo are called an adapted sequence. Setting c:J1 = a(Xs, oo < s � t) defines the minimal adapted sequence, but c:Jt typically has the interpretation of an observer' s information set, and can contain more information than the history of a single variable. When X1 is inte grable, the conditional expectations E(X1 1 c:J1_ 1 ) are defined, and can be thought of as the optimal predictors of X1 from the point of view of observers looking one period ahead (compare 10.12). Consider an adapted sequence {Sn.c:Jn} ':'oo on a probability space (Q,c:J,P), where { c:Jn } is an increasing sequence. If the properties -

E I Sn l < E(Sn I cg;n - 1 )

=

00,

Sn - 1 , a.s.,

(15.1) (15.2)

hold for every n, the sequence is called a martingale. In old-fashioned gambling parlance, a martingale was a policy of attempting to recoup a loss by doubling one's stake on the next bet, but the modern usage of the term in probability theory is closer to describing a gambler's worth in the course of a sequence of fair bets. In view of (10. 1 8), an alternative version of condition (15.2) is

JAsndP JAsn-ldP, each A =

E cg;n_ 1 .

(15 .3)

Sometimes the sequence has a finite initial index, and may be written {Sn.c:Jn}1 where is an arbitrary integrable r.v.

s1

Theory of Stochastic Processes

230

15.1 Example Let {Xr}1 be an i.i.d. integrable sequence with zero mean. If Sn

'L7= 1Xr and r:Ji

=

=

o"(Xn.Xn - J , ... ,Xt), {Sn,r:Jin }1 is a martingale, also known as a random walk sequence. Note that E I Sn l :::; 'L7= tEI Xr l < oo. o n

15.2 Example Let

2 be

an integrable, r:Ji/:8-measurable, zero-mean r.v., { r:Jin}':oo an increasing sequence of a-fields with limn�oorg;n = r:Ji, and Sn = E(ZI r:Ji ) Then n .

(15.4) where the second equality is by 10.26(i). El Sn l :::; E I ZI < oo by 10.27, so Sn is a martingale. o Following on the last definition, a martingale difference (m.d.) sequence { Xr,r:Jir}':'oo is an adapted sequence on (Q,r:Ji,P) satisfying the properties E l Xr l < oo ,

E(Xr l r:Jit- I )

=

(15.5) (15.6)

0, a.s.,

for every t. Evidently, if { Sn } is a martingale and Xr = Sr - Sr- I , then { Xr} is a m.d. Conversely, we may define a martingale as the partial sum of a sequence of m.d.s, as in 15.1 (an independent integrable sequence is clearly a m.d.). However, if Xr has positive variance uniformly in t, condition (15. 1) holds for all finite n but not uniformly in n. To define a martingale by Sn = L�=- = Xt can therefore lead to difficulties. Example 15.2 shows how a martingale can arise without reference to summation of a difference sequence. It is important not to misunderstand the force of the integrability requirement in (15. 1). After all, if we observe Sn - 1 , predicting Sn might seem to be just a matter of knowing something about the distribution of the increment. The problem is that we cannot treat E(Sn I Sn - 1 , ... ) as a random variable without integrability of Sn . Conditioning on Sn- l is not the same proposition as treating it as a constant, which entails restricting the probability space entirely to the set of repeated random drawings of Xn. The latter problem has no connection with the theory of random sequences. A fundamental result is that a m.d. is uncorrelated with any measurable function of its lagged values. 15.3 Theorem If {Xr,r:Jir} is a m.d., then

Cov(Xr,cJ>(Xr- t ,Xt- 2•· ··))

=

0,

where cp is any Borel-measurable, integrable function of the arguments. Proof By 10.11 (see also the remarks following) noting that cj>(Xr- t ,Xr-2 , . . . ), is r:Jir- 1 -measurable. • 15.4 Corollary If {Xt,r:Jir} is a m.d., then E(XtXr-k) = 0, for all t and all k '# 0.

Proof Put cp

and t' - I k l

=

=

Xr-k in 15.3. For k < 0, redefine the subscripts, putting t' t, so as to make the two cases equivalent. •

=

t-k

Martingales

23 1

One might think of the m.d. property as intermediate between uncorrelatedness and independence in the hierarchy of constraints on dependence. However, note the asymmetry with respect to time. Reversing the time ordering of an independent sequence yields another independent sequence, and likewise a reversed uncorrelated sequence is uncorrelated; but a reversed m.d. is not a m.d. in general. The Doob decomposition of an integrable sequence {Sn,:!Fn }o is

(15.7) 0, Mo = So, and (15.8) Mn = Mn - 1 + Sn - E(Sn i :!Fn - 1 ), (15.9) An = An - I + E(Sn i :!Fn - J ) - Sn - 1 · An is an :!Fn- 1 -measurable sequence called the predictable component of Sn. Writing !:l.Sn = Yn and !:l.Mn = Xn, we find Mn = E(Yn l :!Ft- 1 ), and (15. 10) Xn is known as a centred sequence, and also as the innovation sequence of Sn. It is adapted if { Yn,:!Fn }o is, and since £1 Yn I < oo by assumption, where Ao

=

E I Xn l � E I Yn l + E I E(Yn i :!Fn - 1 ) 1 � E I Yn l + E(E( I Yn l l :!fn - 1 )) = 2£ 1 Yn l < 00 ,

(15. 1 1)

by (respectively) Minkowski' s inequality, the conditional modulus inequality, and the LIE. Since it is evident that E(Xt l :!Ft- 1 ) = 0, {Xn,:!Fn}o is a m.d. and so { Mn,:!Fn } o is a martingale. Martingales play an indispensable role in modern probability theory, because m.d.s behave in many important respects like independent sequences. Independence is the simplifying property which permitted the 'classical ' limit results, laws of large numbers and central limit theorems, to be proved. But independence is a constraint on the entire joint distribution of the sequence. The m.d. property is a much milder restriction on the memory and yet, as we shall see in later chapters, most limit theorems which hold for independent sequences can also be proved for m.d.s, with few if any additional restrictions on the marginal distributions. For time series applications, it makes sense to go directly to the martingale version of any result of interest, unless of course a still weaker assumption will suffice. We will rarely need a stronger one. Should we prefer to avoid the use of a definition involving a-fields on an abstract probability space, it is possible to represent a martingale difference as, for example, a sequence with the property

(15. 12) When a random variable appears in a conditioning set it is to be understood as representing the corresponding minimal cr-subfield, in this case cr(Xt- I .Xt-2 , ... ).

Theory of Stochastic Processes

232

This is appealing at an elementary level since it captures the notion of informa tion available to an observer, in this case the sequence realization to date. But since, as we have seen, the conditioning information can extend more widely than the history of the sequence itself, this type of notation is relatively clumsy. Suppose we have a vector sequence and though not necessarily is a m.d. with respect to = ... ) in the sense of ( 1 5.6). This case is distinct from (15. 12), and shows that that definition is inadequate, although ( 1 5. 16) implies (15. 12). More important, the representation of conditioning information is not unique, and we have seen (10.3(ii)) that any measurably isomorphic transformation of the conditioning variables contains the same information as the original variables. Indeed, the information need not even be represented by a variable, but is merely knowledge of the occurrence/non occurrence of certain abstract events.

{ (Xr,Zr)}, Xrrg;r cr(Xr,Zr,Xt- 1 ,Zr- 1 ,

Zr -

1 5. 2 Extensions of the Martingale Concept

{ {Xnr, rg;nr}��J};=I• where {kn};=l is some increasing ( 1 5 . 13) E i Xm i < (15. 14) E(Xnrl rg;n,t-1 ) 0 a.s. for each t l , ... , kn and n � 1 , is called a martingale difference array. In many applications we would have just kn n. The double subscripting of the subfield rg;nt may be superfluous if the information content of the array does not depend on n, with rg;nt rg;t for each n, but the additional generality given by the definition is harmless and could be useful. The sequence { Sn, rg;n} i where Sn L�� 1 Xm and rg;n rg;n,kn is not a martingale, but the properties of martingales can be profitably used to analyse its behaviour. Consider the case Sn n - 1 12.L7=1 Xr where { Xr, rg; t} is a m.d. Such scaling by sample size may ensure that the distribution of Sn has a non-degenerate limit. Sn is not a martingale since (15. 15) E(Sn l rg;n - 1 ) [(n 1)/n] 1 12Sn 1 , An adapted triangular array sequence of integers, for which

=,

=

=

=

=

=

=

=

=

-

-

but each column of the m.d. array

x1 T 112X1 3 -1 12Xt 4- 112X1 T 1 '2x2 3 - lt2x2 4-lt2x2 3 -1'2x3 4-112X3 4-112x4

(15. 1 6)

is a m.d. sequence, and Sn is the sum of column n. It is a term in a martingale sequence even though this is not the sequence A n adapted sequence of L 1 -bounded variables satisfying

{Smrg;nf:'oo

{Sn}.

Martingales

233 (15. 17)

is called a submartingale, in which case Xn = Sn - Sn - ! is a submartingale differ ence, having the property E(Xn+! l <Jn) ;::: 0 a.s. In the Doob decomposition of a submartingale, the predictable sequence An is non-decreasing. Reversing the inequality defines a supermartingale, although, since -Sn is a supermartingale whenever Sn is a submartingale, this is a minor extension. A supermartingale might represent a gambler' s worth when a sequence of bets is unfair because of a house percentage. The generic term semimartingale covers all the possibilities. 15.5 Theorem Let <)>(.): IR f-) IR be continuous and convex. If {Sm<Jn} is a martin gale and E I <J>(Sn) l < =, then {<J>(Sn),<Jn} is a submartingale. If
(Sn),<Jn } is a submartingale if { Sn,<Jn } a submartingale. Proof For the martingale case,

(15. 1 8) by the conditional Jensen inequality (10.18). For the submartingale case, '= ' becomes •;::: • in (15. 1 8) when X! � x2 ==> <j>(x1 ) � <J> (x2). • If {X1,'J1}i is a (sub)martingale difference, {Z1,'J1}o any adapted sequence, and n Sn = _LXrZr-J , (15. 19) t:=l

then {Sn,<Jn }i is a (sub)martingale since

n E(Sn+ d <Jn) = _LXrZt- 1 + ZnE(Xn+d <Jn) = Sn (;::: Sn). t:=l

(15.20)

We might think of Xt as the random return on a stake of 1 unit in a sequence of bets, and the sequence { Z1} as representing a betting system, a rule based on information available at time t - 1 for deciding how many units to bet in the next game. The implication of (15.20) is that, if the basic game (in which the same stake is bet every time) is fair, there is no betting system (based on no more than information about past play) that can turn it into a game favouring the player - or for that matter, a game favouring the house into a fair game. For an increasing sequence { <J t } of cr-subfields of (Q,<J ,P), a stopping time 't( co) is a random integer having the property { co: t = 't(co) } e 'J1• The classic example is a gambling policy which entails withdrawing from the game whenever a certain condition depending only on the outcomes to date (such as one' s losses exceeding some limit, or a certain number of successive wins) is realized. If 't is the random variable defined as the first time the said condition is met in a sequence of bets, it is a stopping time. Let 't be a stopping time of { <Jn }, and consider

SnA't =

{

Sn, n � 't Sr;, n > 't

(15.21)

Theory of Stochastic Processes

234

{

where n A 't stands for min n , 't } . {Snl\'t,�n } ';;'= 1 is called a stopped process.

{ Sn , �n }t is a martingale (submartingale), then { Sn 't �n lt is a martingale (submartingale). Proof Since { �n} 1 is increasing, { ro: k ro)} � n for k < n, and hence also n 't(W)} �n. by complementation. Write SnNr: IZ: l sk l {lr-'t) +Snl {n:O:'t )• {ro: where the indicator functions are all �n-measurable. It follows by 3.25 and 3.33 that sn/\'t is �n-measurable, and n -1 E I SM't l 2: E I Sk l ( k='tJ I +EI Sn l { n�'t J I k=1 15.6 Theorem If

:S

l\ •

=

E

't(

E

=

:S

n- 1 :S L E I Sk l + E I Sn l < oo, n ;?: 1.

k=1

(15.22)

{Sn,�n}i is a martingale then for A �n. applying (15.3), fASt) showing that {Sn+ l /\'t, �n}l' is a martingale. The submartingale case follows easily on replacing the second equality by the required inequality in (15.23). If

E

=

�(D't)

=

•

The general conclusion is that a gambler cannot alter the basic fairness charact eristics of a game, whatever gambling policy (betting system plus stopping rule) he or she selects. All these concepts have a natural extension to random vectors. An adapted sequence is defined to be a vector martingale diffe rence if and only if �1} :'"" is a scalar m .d. sequence for all conformable fixed vectors ::1= 0. It has the property

{A'Xr. {X1, �r}:'oo

A

(15.24) The one thing to remember is that a vector martingale difference is not the same thing as a vector of martingale differences. A simple counter-example is the two element vector = where is a m.d.; is an adapted sequence, but

Xr (X1,X1- d, X1 {f-.1Xr + �Xr- 1 •�t} E(AIXr + A2Xt- d �r- I) A2Xr- 1 0, so it is not a m.d.. On the other hand, E(A.IXr+l + X l � ) 0, but {AIXr+I +�X1, �r} is not adapted, since Xr+ l is not �1-measurable. :f.

=

A2

t

t-

l

=

Martingales

235

1 5 . 3 Martingale Convergence

Applyi ng 15.5 to the case <j>(.) = I . IP and taking unconditional expectations shows that every martingale or submartingale has the property

(15.25) By 2.1 1 the sequence of pth absolute moments converges as n --7 oo, either to a finite limit or to +oo. In the case where the L 1 -norms are uniformly bounded, (sub)martingales also exhibit a substantially stronger property; they converge, almost surely, to some point which is random in the sense of having a distribution over realizations, but does not change from one time period t to the next. The intuition is reasonably transparent. { r:Jn } is an increasing sequence of a-fields which converges to a limit r:J"" � r:J, the a-field that contains r:Jn for every n. Since E(Sn l r:Jn) = Sm the convergence of the sequence { r:J'n } implies that of a uniformly bounded sequence with the property E(Sn+ l l r:Jn) � Sn, so long as these expectations remain well-defined in the limit. Thus, we have the following.

M < oo, then Sn --7 S a.s. where S is a r:J'-measurable random variable with El S l -5, M. o The proof of 15.7, due to Doob, makes use of a result called the upcrossing inequality, which is proved as a preliminary lemma. Considering the path of a submartingale through time, an upcrossing of an interval [a,�] is a succession of steps starting at or below a and terminating at or above �· To complete more than 15.7 Theorem If {Smr:J'n}i is a submartingale sequence and supnE I Sn l

-5,

one upcrossing, there . must be one and only one intervening downcrossing, so downcrossings do not require separate consideration. Fig. 15.1 shows two upcross ings of [a,�], spanning the periods marked by dots on the abscissa.

�

0

0

0 0

a

0 °o 0 0

0 00

0

0

0

0

0

�o �

· ············································ ········ ··························-

0

0

·

······

o

····································································

0

00 0 0

0

············

0

0

0

0 0

0 0 0 0

0

0

0

················ ················· ····

o

·············

0

0

0

c;-··························

· ·_·_·_·_ --�---·_ ·_ · _ · ·_·_·_ ·_ ·_ · _______·_·_ ·_ ·_ · _ · ·_·_·_ ·_ · __________� k ··_ ·_

yk

000001 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1000000001 1 1 1 1 1 1 1 1 1 1000000000000 Fig. 15. 1

Let the r.v. Yk be the indicator of an upcrossing. To be precise, set Y1 and then, for k = 2,3, ... ,n,

=

0,

236

{

Theory of Stochastic Processes

0 if either Yk-1 yk = 1 if either Yk - 1

= =

-

0, Sk 1 > a, or Yk- 1 0, Sk-1 :5 a, or Yk- 1

=

=

1, Sk-1 � �. 1 , Sk-1 < �-

(15.26)

The values of Yk appear at the bottom of Fig. 15. 1 . Observe that an upcrossing begins the period after Sk falls to or below a, and ends at the first step there after where � is reached or exceeded. Yk is a function of Sk- 1 and an �k - 1 -measur able random variable. The number of upcrossings of [a, � ) up to time n of the sequence {Sn(ro) } i, to be denoted Un(ro), is an �n-measurable random variable. The sequence { Un(ro)} i is monotone, but it satisfies the following condition. 15.8 Upcrossing inequality The number of upcrossings of [a,�) by a submartin gale {Sm�n }i satisfies

(15.27) Proof Define S� = max{Sn,a} , a continuous, convex, non-decreasing function of

Sm such that { S� .� n } is an adapted sequence and also a submartingale. Un is the set of upcrossings up to n for {S� } as well as for {Sn } . Write n n n (1 5.28) S� - S! = ,L xk ,L YkXk + ,L ( l - Yk)Xk, k=2 k=2 k=2 where Yk is from (15.26), and Xk is a submartingale difference. Then

=

=if

E(Xk l �k- l )dP � 0,

(15.29)

k=2 { Y,rO} using the definition of a conditional expectation in the second equality (recall ing that Yk is �k -1-measurable), and the submartingale property, to give the in equality. We have therefore shown that

(i

)

(15.30) Ykxk . k=2 I,k=2 YkXk is the sum of the steps made during upcrossings, by definition of Yk · Since the sum of the Xk over an upcrossing equals at least � - a by definition, we must have n (15.3 1) ,L YkXk � (� - a)Un, k=2 where Un is the number of upcrossings completed by time n. Taking the expectation of ( 15.3 1 ) and substituting (15.30), we obtain, as required, E(S� - Sl) � E

Martingales

237

(� - a)E( Vn) � E(S� - S j ) � E(S� - a) =

f

I Sn >a)

(15.32)

(Sn - a)dP

� E I Sn - a l � EI Sn l + l a l .

•

The upcrossing inequality contains the implication that, if the sequence is uniformly bounded in L1 , the expected number of upcrossings is finite, even as n ---7 oo. This is the heart of the convergence proof, for it means that the sequence has to be settling down somewhere beyond a certain point. Proof of 15.7 Fix a and �

E(Un) �

> a. By 15.8,

E I Sn l + l a l M + l a l < � A p-a �-a

oo.

(15.33)

For CD E Q, { Un(CD) } i is a positive, non-decreasing sequence and either diverges to +oo or converges to a finite limit U(CD) as n ---7 oo. Divergence for CD E C with P(C) > 0 would imply E(Un) --7 oo, which contradicts (15.33), so Un ---7 U a.s., where E(U) < oo. Define S(CD) = limsupn�ooSn and �(CD) = liminfn�ooSn. If �(co) < a < � < S(CD), the interval [a,�] is crossed an infinite number of times as n --7 oo, so it must be the case that P(S. < a < � < S) = 0. This is true for any pair a,�. Hence consider { co: �(CD) < S(CD) } = U {S. � a < � � S } , a,J3

(15.34)

where the union on the right is taken over rational values of a and �- Evidently, P(S. < S) = 0 by 3.6(ii), which is the same as S. = S = S a.s., where S is the limit of {Sn } . Finally, note that

(1 5.35) E I S I � liminf E I Sn l � sup E I Sn l � M, n n�oo where the first inequality is from Fatou ' s lemma and the last is by assumption. This completes the proof. • Of the examples quoted earlier, 15.1 does not satisfy the conditions of 15.7. A random walk does not converge, but wanders forever with a variance that is an increasing function of time. But in 15.2, X1 is of course converging to Z. 15.9 Corollary Let {Su,?:Fn }:'"" be a doubly infinite martingale. Then Sn --7 S_"" a.s.

as n

---7 - oo ,

where s_oo is an £!-bounded r.v.

Proof Let U- n denote the number of upcrossings of [a,�] performed by the sequence { Sj, -1 � j � n } The argument of 15.8 shows that -

E( U- n) �

.

E I S1 I + l a l � , all n � 1 . a _

Arguments precisely analogous to those of 15.7 show that

(15.36)

Theory of Stochastic Processes

238

)

(

P 1!:��f Sn < 1�� sn = 0,

( 1 5.37)

so that the limit s_"" exists a.s. The sequence { E I Sn l } ::� is non-negative, non increasing as n decreases by ( 1 5 .25), and E I S-d < oo by definition of a martin gale. Hence E I S -.,.,1 < oo. • If a martingale does not converge, it must not be thought of as converging in R, of heading off to +oo or -oo, never to return. This is an event that occurs only with probability 0. Subject to the increments having a suitably bounded distribu tion, a nonconvergent martingale eventually visits all regions of the real line, almost surely. 15.10 Theorem Let {X1,?1'1} be a m.d. sequence with E(sup1 I X1 1 ) < oo, and let Sn = I7=tX1• If C = { ffi: Sn(ffi) converges } , and

E = { co: either infnSn(CO) > -oo or supnSn ( (I)) < oo } , then P(E - C) = 0. Proof For a constant M > 0, define the stopping time 'tM(ro) as the smallest integer n such that Sn( ro) > M, if one exists, and 'tM(ro) = oo otherwise. The stopped process { SnAtu • ?/'nAtu l';;'= l is a martingale (15.6), and Scn -l )Nr:u � M for all n. Letting S�"tM = max { SnA M'O } , t

s �A'tM � s (n - l )A'tM + x�A'tM � M + sup I Xn I .

n

( 1 5.38)

Since E(SnA'tM) = 0, El snA u l = 2E(�A'tM), and hence supnEI SnNrul < 00 , and snA M converges a.s., by 15.7. And since snATu(ro) = Sn(CO) on the set { (1): supnSn(ffi) � M}, Sn(co) converges a.s. on the same set. Letting M � oo, and then applying the same argument to -Sn, we obtain the conclusion that Sn(ro) converges a.s. on the set E; that is, P(C n E) = P(E), from which the theorem follows. • 't

t

Note that E c = { co: supnSn( co) = +oo, infnSn(ro) = -oo } . Since P(Ec) = P((C n E)c) = P( cc u E c) , a direct consequence of the theorem is that cc c Ec u N where P(N) = 0, which is the claim made above. 1 5 .4 Convergence and the Conditional Variances

If { Sn } is a square-integrable martingale with differences {Xn } ,

E(S� I ?Fn- t) = E(S�- 1 + X; + 2XnSn -d ?l'n-t ) ;:::: S�- 1 .

and s; i s a submartingale. The Doob decomposition of the sequence of squares has the form s; = Mn + An where �n = X� - E(X; I ?l'n- J ), and Mn = E(X� I ?l'n- t ). The sequence {An } is called the quadratic variation of { Sn } . The following theorem reveals an intimate link between martingale convergence and the summability of the conditional variances; the latter property implies the former almost surely, and in particular, if I'7= 1 ECX7 1 ?l't- t ) < oo a.s. then Sn � S a.s.

Martingales

239

15. 1 1 Theorem Let {X1,�r}j be a m.d. sequence, and Sn = I.7= I X1. If

D

=

{ w: I7=IE(X7 1 �t-J)(w) < oo } E � .

C = { w: Sn(W) converges } E �.

=

0. Proof Fix M > 0, and define the stopping time 'tM(W) as the smallest value of n then P(D - C)

having the property

n

_L ecx7 1 �t-J)(w) � M. t=i

(15.39)

If there is no finite integer with this property then 'tM(W) = oo. If DM = { w: 'tM(w) = oo } , D = limM�J)M· The r.v. 1 (<M �nJ(w) is �n- 1 -measurable, since it is known at time n - 1 whether the inequality in (15.39) is true. Define the stopped process n (15.40) _L xt 1 !<M� t ) = SnA<M· t=i

SnA,M is a martingale by 15.6. The increments are orthogonal, and sup E(S�I\'M) n

=

sup e n

(f x; t l,M� nl) t=i

(15.41) where the final inequality holds for the expectation since it holds for each w E Q by definition of 'tM(ro). By Liapunov ' s inequality, sup E I SnA,M I � sup I! SnA<Mih < M112, n n

and hence SnA,M converges a.s., by 15.7. If w E DM, Sn1\,�ro) = Sn(w) for every E IN , and hence Sn(w) converges, except for w in a set of zero measure. That is, P(DM n C) = P(DM). The theorem follows on taking complements, and then letting n

M � oo.

•

15.12 Example To get an idea of what convergence entails, consider the case of {Xr } an i.i.d. sequence (compare 15.1). Then {X/a1} is a m.d. sequence for any

sequence {ar} of positive constants. Since E(X7 1 �t-I) = E(X7) = c:Jl, a constant which we assume finite, Sn = 'L7=IX/a1 is an a.s. convergent martingale whenever I7= dla7 < oo. For example, a1 = t would satisfy the requirement. o

In the almost sure case of Theorem 15.11 (when P(C) = P(D) = 1), the summa bility of the conditional variances transfers to that of the ordinary variances, a7 = E(X7). Also when E(sup1X7) < oo, the summability of the conditional variances is almost equivalent to the summability of the x7 themselves. These are conse quences of the following pair of useful results.

Theory of Stochastic Processes

240

{Z1} ifbeanda anyonlynon-negative stochastic sequence. �r- 1) < oo a.s. if 'L7:: : J E(Z 'L7=IE(Z1 ) < rl (i) (ii) If E(sup1Z1) < then P(D E) 0, where D {co:'L7=JE(Zrl �r- l )(co) < oo } , E = {co: L7=1Zr(CO) < oo } Proof (i) The first of the sums is the expected value of the second, so the 'only if' part is immediate. Since E(Z11 �t- d is undefined unless E(Z1) < we may assume 'L7=1E(Z1) < oo for each finite n. These partial sums form a monotone series which either converges to a finite limit or diverges to +oo. Suppose 'L7= I E(Zrl �t- d converges a.s, implying (by the Cauchy criterion) that 'L7�'::+ JE(Z11 �t- d -7 0 a.s. as m 1\ n -7 oo. Then 7�':: E(Z1) -7 0 by the monotone con vergence theorem, so that by the same criterion 'L7::: 1E(Z1) -7 'L7::: 1 E(Z1) < oo , as required. (ii) Define the m.d. sequence X1 = Z1-E(Z11 �1- J), and let Sn 'L7::: I X1. Clearly supnSn(CO) s 'L7= 1 Zr(co), and if the majorant side of this inequality is finite, Sn(CO) converges in almost every case, by 15.10. Given the definition of X1, this implies in turn that 'L7::: 1 E(Z1I�t- J)(co) < oo. In other words, P(E - D) = 0. Now apply the same argument to -X1 = E(Z11 �1- 1 ) -Z1 to show that the reverse implica tion holds almost surely, and P(D - E) = 0 also. 15.13 Theorem Let

oo

Ll

oo

=

=

.

oo,

1:.

+1

=

•

1 5 . 5 Martingale Inequalities

Of the many interesting results that can be proved for martingales, certain inequalities are essential tools of limit theory. Of particular importance are maximal inequalities, which place bounds on the extreme behaviour a sequence is capable of over a succession of steps. We prove two related results of this type. The first, a sophisticated cousin of the Markov inequality, was originally proved by Kolmogorov for the case where { X is an independent sequence rather than a m.d., and in this form is known as Kolmogorov' s inequality. 15.14 Theorem Let be a martingale. For any p ;?: 1 ,

r}

{Sn,�n}i ) EISn i P P { max ISkl > £ s E l� k�n Proof Define the events A1 {co: IS1(co)l > £}, and for k = 2, Ak = {co: 1max j5 < k ISj(co) l s £, I Sk(co)l > £} �k· The collection A 1 , .An is disjoint, and tJAk = { max !�k sn ISkl > £}. P

.

=

( 15.42) . . . ,n,

E

•••

k=l

( 15.43)

Martingales

241

{ I Sk i > £}, the Markov inequality (9.10) gives (15.44) P(A k) :::;; £-pE( I Sk i P 1 Ak). By 15.5, I Sn I p for p ;?: 1 is a submartingale, so I sk I p :::;; E( I Sn I p I �k) a.s., for 1 :::;; k :::;; n. Since A k E �k' it follows that (15.45) where the equality applies- (10. 18). Noting .LZ=t l Ak = 1 u�=tAk' we obtain from (15.43)-{15.45), as required, Since A k

P

c

t�::

n

i Sk l > £

) � =

� ( i )

P(A k) :::;; £-p E( I Sn i P 1 Ak)

£-pE I Sn i P

1 Ak :::;; £-PE( I Sni P).

• (15.46) = � k1 The second result converts the probability bound of 15.14 into a moment inequal ity. 15.15 Doob's inequality Let {Sn,�n }l be a martingale. For p > 1 , =

(15.47) Proof

Consider the penultimate member of (15.46) for the case p

)

(

=

1, that is,

(15.48) P max I Sk l > £ :::;; £- 1 E( I Sn l 1 tmaxts k s n i Skl> e } ), 1 :'> k :<> n and apply the following ingenious lemma involving integration by parts. 15.16 Lemma Let X and Y be non-negative r.v.s. If P(X > £) :::;; £- 1 E(Y1 1x > e J ) for all £ > 0, then E(XP) :::;; [p/(p - 1)]PE(YP), for p > 1 . Proof Letting Fx denote the marginal c.d.f. of X, and integrating by parts, using d(l - Fx) = -dFx and [xP( l - Fx(x))]o = 0, E(Xp) = =

J:�PdFx(�) J:�Pd(l Fx(�)) J:p;p- 1 (1 Fx(;))d; J:p;p- 1 P(X > ;)d;. =

-

-

-

=

(15.49)

Define the function 1 tx>s J(x) = 1 when x > ;, and 0 otherwise. Letting Fx, Y denote the j oint c.d.f. of X and Y and substituting the assumption of the lemma into (15.49), we have

I

E(XP ) :::;; p :;p - 2E(Yl { X >I;) )d;

Theory of Stochastic Processes

242

f (5

oo = p c-,P -2 0

=

)

y l {X>�)(x)dFx,r (x,y) df:,

(IR2)+

�� l ) E(fXP- 1 ).

(15.50)

Here (rR 2t denotes the non-negative orthant of [R 2 , or [O,oo) X [O,oo) . The second equality is permitted by Tonelli's theorem, noting that the function FxyS defines a a-finite product measure on (rR 3 t. By Holder's inequality, E( fXP- 1 ) :::; £ 1 1P( fP)£ 1 - 1 1P(_xP ) . Substituting into the m�orant side of ( 15.50) and simplifying gives the result. • To complete the proof of 15.15, apply the lemma to (15.48 ) to yield (15.47), putting X = max l 5 k 5 n 1 Sk l and Y = ! Sn l · • Because of the orthogonality of the differences, we have the interesting property of a martingale {Sn}i that E(S�)

=

E

(it=1 x7) ,

(15.51)

where, with S0 = 0, Xr = Sr - St- l · This lets us extend the last two inequalities for the case p = 2, to link P(m ax t 5 k :::; n ! Sn l > E) and E(max l $k$ nsi) directly with the variance of the increments. It would be most useful if this type of property extended to other values of p, in particular for p E (0,2). One approach to this problem is the von Bahr-Esseen inequality of § 1 1 .5. Obviously, 1 1.15 has a direct application to martingales. 15.17 Theorem If {X,�r}i is a m.d. sequence and Sn = I.�=l Xt, n (15.52) E ! Sn l p :::; 2_L E !Xt ! P t=l for 0 < p :::; 2. Proof This is by iterating 11.15 with Y = Xm '§ = �n J, and X = Sn - t. as in the argument leading to ( 1 1 .65); note that the latter holds for m.d. sequences just as for independent sequences. • Another route to this type of result is Burkholder's inequality (Burkholder 1 973). 15.18 Theorem Let { Sn,�n } i be a martingale with increments Xt = St - St- 1 , and

Martingales

243

So 0. For 0 < p :::;; 1 there exist positive constants cp and Cp , depending only on p, such that =

(15.53) On the majorant side, this extends by the cr inequality to n p n 2p ( 15.54) :::; ; C S EI n l pE _Lx7 :::;; cp.LE1Xr i 2P, 0 1 also, although we shall not need to use this result and extending the proof is fairly laborious. Concavity of ( . f becomes convexity, so that the arguments have to be applied in reverse. Readers may like to attempt this as an exercise. The proof employs the following non-probabilistic lemma. 15.19 Lemma Let { Yr }1 be a sequence of non-negative �s with y1 > 0, and let Yn = I.7= 1Yt for t � 1 and Yo = 0. Then, for 0 < p $ 1 , n (15.55) y� :::;; yl) + p_L y�::: � Yt :::;; (1 + Bp)Y�, t=2 where Bp � 0 is a finite constant depending only on p. Proof For p = 1 this is trivial, with Bp = 0. Otherwise, expand Y� = ( Yn- 1 + Ynf in a Taylor series of first order to get (15.56) Ypn = Ypn- 1 + p( Yn- 1 + enYnJ\P- 1Yn , where en E [0,1]. Solving the difference equation in (15.56) yields n (15.57) y� = y l) + p L (Yr- 1 + etYtf -1Yt· t=2 Defining 1 1 ( 15.58) Kt - ypt - I - < Yr-1 + 0tYt!\[J - , we obtain the result by showing that n 0 $ p_L KtYt $ Bp Y�. (15.59) t=2 The left-hand inequality is immediate. For the right-hand one, note that l - xr :::;; (y - xr, for y > X > 0 and 0 < r :::;; 1 (see (9.63)). Hence, ( 15.58) implies that 1 -p 1 1 - (Yr-t(Yr-1 + 8rYr))p-1 O 1r -pYt1-p . ( 15.60) Kr $ O yt-1 yt-1 + tYt _

(

It follows that

)

Theory of Stochastic Processes

244

and hence

0 <- KrYt <- y2p-2 t - 1 Y2t -p ,

(15.6 1 )

n n rY p y�P L., K t S P L (yr! Yr-1)2( 1 -p)(y/Ynf t=2 t=2 n -- P L.. "" (y'!Y' t t-1 )2(1-p)y'tP t=2

(15.62) where y� = yrf Yn for t = 1, . . ,n is a collection of non-negative numbers summing to 1, r; = I.!= 1 y;, and Bp (n) denotes the supremum of the indicated sum over all such collections, given p and n. The terms y;IY ;- 1 = y11 Yr- 1 for t ;::: 2 are finite since Y 1 > 0. If at most a finite number of the Yt are positive the majorant side of (15.62) is certainly finite, so assume otherwise. Without loss of generality, by reinterpreting n if necessary as the number of nonzero terms, we can also assume Yt > 0 for every t. Then, y;!r;_ 1 = O(t-1 ) and y; = O(n - 1), and applying 2.27 yields the result Bp (n) o(1) for all p e (0, 1 ). Putting Bp = supnBp(n) < oo completes the proof. • Proof of 15.18 Put An I.7= t X7, and for £ > 0 and b > 0, set n (15.63) Yn = £ + S� + b(E +An) = (1 + b)(E +An) + 2L_Sr-1Xt, t=l so that in the notation of 15.19, Yr = Yr - Yr-1 = (1 + b)X7 + 2Sr- 1 Xr for t ;::: 2, with y 1 = ( 1 + b)(E + Xb > 0. Then by the left-hand inequality of ( 15.55), n n (15.64) Y� s ( l + bf(E + Xtf + (l + b)pL_ Y�:�x7 + 2pL_ Y�:}Sr- IXt . t=2 t=2 .

=

=

However, E is arbitrary in (15.64), and we may allow it to approach 0. Taking expectations through, using the law of iterated expectations, and the facts that E(X11 �t- l ) = 0 and that (. f- 1 is decreasing in its argument, we obtain 1 E(S� + BA n f s E (l + BfXIP + (1 + b)p± (S7-l + BAt-1f- X7 t=2

) ( S E ((l + bfXyP + (1 + b)bp - lp± A�::: � x7) . t=2

( 15.65)

But if we put now Yn = E + An , with Y1 = E + Xy and Yt = X7 for t ;::: 2, the right hand inequality of ( 15.55) yields (again, as the limiting case as E -t 0) n (15.66) XyP + pL_A�::: �x7 s (1 + Bp)E(A�). t=2

Martingales

245

and since (1 + 8f � (1 + 8)8 p- I' this combines with (15.65) to give

(1 + 8)8p - I (1 + Bp)E(A�) :?: E(S� + 8An)P (1 5.67) ;;::: 2p- l [E J Sn J 2P + 8PE(A�)], where the second inequality is by the concavity of the function (.f for p � 1 . Rearrangement yields (15.68) which is the right-hand inequality in (15.54), where Cp is given by choosing 8 to minimize the expression on the majorant side of (15.68). In a similar manner, combining the right-hand inequality of (15.55) with Yn = £ + with (15.65) and (15.67), and using concavity, yields

s;

t, s;��-ox;) :?: E ((l + 8)PXyP + (1 + 8)pt, (S7 - I + 8At- I f-1 X7) (

(1 + 8)(1 + Bp)EI Sn l 2p :?: (1 + 8)E xrp + p

:?:

E(S� + 8Ant

(15.69) which rearranges as (15.70) EI Sn l 2p :?: 8P [2 1 -p(l + 8)(1 + Bp) - 1 ] - 1 E(A�), which is the left hand inequality of (15.54), with Cp given by choosing 8 to maximize the expression on the majorant side . 11 For the case p = 1 , Bp = 0 identically in (15.55) and c 1 = C1 = 1 for any 8, reproducing the known orthogonality property. Our final result is a so-called exponential inequality. This gives a probability bound for martingale processes whose increments are a.s. bounded, which is accordingly related directly to the bounding constants, rather than to absolute moments. 15.20 Theorem If {Xt5'tli is a m.d. sequence with I Xtl � Bt a.s., where {Bt} is a sequence of positive constants, and Sn = 'L7=I Xt, (15.71) This is due, in a slightly different form, to Azuma (1967), although the cor responding result for independent sequences is Hoeffding' s inequality, (Hoeffding 1963). The chief interest of these results is the fact that the tail probabilities decline exponentially as £ increases. To fix ideas, consider the case B t = B for all t, so that the probability bound becomes P( I Sn l > £) � 2exp{ -£212nB 2 } . This ineaualitv is trivial when n is small. since of course P( I S, I > nB) = 0 bv con-

Theory of Stochastic Processes

246

struction. However, choosing E = O(n 1 12) allows us to estimate the tail probabili ties associated with the quantity n - J I2Sn. The fact that these are becoming exponential suggests an interesting connection with the central limit results to be studied in Chapter 24. Proof of 15.20 By convexity, every x E [-B1,B1] satisfies (Bt + x)ea.B r + (B1 -x)e -a.Br ax < e ------��( 1 5.72) 28 t -

__ __ __ _

for any a > 0. Hence by the m.d. property, ( 15.73) E( eaX1 I :Y't- J ) :::; �(eaB1 + e- aB 1) :::; exp{ ia2B7 } a.s., where the second inequality can be verified using the series expansion of the exponential function. Now employ a neat recursion of 10.10: ( 1 5.74) E(eaSn i :J'n - d = E(eaSn- t+aXn i :J'n - l)

Generalizing this idea yields E(eaSn) = E(E( ... E(E(eaSn l :Y'n - 1 ) I :Y'n - 2) . . . 1 :1' 1 )) :::; exp {ia2B� } E(E( ... E(eaSn - t l :Y'n -2} · · 1 :¥ 1 ))

(1 5.75)

:::; . ..

Combining ( 1 5 .75) with the generalized Markov inequality 9.11 gives (15.76) P(Sn > E) :::; exp { -ac + !a.li,7= l B7 } for E > 0, which for the choice a = EI('L7= JB7) becomes (15. 77) P(Sn > c) :::; 2exp { -E212(I,7= 1 B7) } . The result follows on repeating the argument of ( 1 5.75)--(15.76) in respect of -Sn and summing the two inequalities. • A practical application of this sort of result is to team it with a truncation or uniform integrability argument, under which the probabilities of the bound B being exceeded can also be suitably controlled.

16

Mixingales

1 6. 1 Definition and Examples

Martingale differences are sequences of a rather special kind. One-step-ahead unpredictability is not a feature we can always expect to encounter in observed time series. In this chapter we generalize to a concept of asymptotic unpredict ability. 16.1 Definition On a probability space (Q,'!f ,P), the sequ� of pairs { 1, r} :"" , where { 1 } is an increasing sequence of cr-subfields of and tile X1 are integrable r.v.s, is called an Lp-mixingale if, for p � 1, there exist sequences of non negative constants { cr}':"" and { �}0 such that � 0 as m � =, and

'!F

�m I E(Xrl '!Fr-m) I P :::; Cr�m I Xr-E(Xr i '!Ft+m) I P :::; Cr�m+l

X '!F

'!F

(16. 1) (16.2)

hold for all t, and m � 0. o A martingale difference is a mixingale having = 0 for all m > 0. Indeed, 'mixing ale differences' might appear the more logical terminology, but for the fact that the counterpart of the martingale (i.e. the cumulation of a mixingale sequence) does not play any direct role in this theory. The present terminology, due to Donald McLeish who invented the concept, is standard. Many of the results of this chapter are basically due to McLeish, although his theorems are for the case p = 2. Unlike martingales, mixingales form a very general class of stochastic processes; many of the processes for which limit theorems are known to hold can be characterized as mixingales, although supplementary conditions are generally needed. Note that mixingales are not adapted sequences, in general. is not assumed to be ?F1-measurable, although if it is, (16.2) holds trivially for every m � 0. The mixingale property captures the idea that the sequence { s } contains progressively more information about as s increases; in the remote past nothing is known according to (16.1 , whereas in the remote future everything will eventu ally be known according to (16.2). The constants c1 are scaling factors to make the choice of scale-independent, will often fulfil this role. As for mixing processes (see and multiples of say that the sequence is of size -
�m

)

I Xrllp

'!F

Xr

Sm

Xr

248

Theory of Stochastic Processes

16.2 Example Consider a linear process (16.3) }= -oo where { Us}�: is a Lp-bounded martingale difference sequence, with p ::2: 1 . Also let '3'1 = a(Us, s � t). Then E(Xr l 'ff1-m)

=

Xr - E(Xr l 'ff t+m)

=

'L, S}Ur-j' a.s. }=m

( 16.4)

(16.5) L s_j ut+}' a.s. }=m+ l Assuming { Us } :'oo to be uniformly Lp-bounded, the Minkowski inequality shows that (16. 1) and ( 16.2) are satisfied with c1 = supsll Usll P for every t, and Sm = Ij=m( I SJ I + I S -} I ). { Xr.'ff 1} is therefore a Lp-mixingale if 2..}=m( I SJ I + I S-} I ) -7 0 as m -7 oo, and hence if the coefficients { S1 } :'oo are absolutely summable. The 'one sided' process in which SJ = 0 for j < 0 arises more commonly in the econometric modelling context. In this case X1 is 'ff ;-measurable and X1 - E(X1 I 'ffr+m) = 0 a.s., but we may set c 1 = sups$r11 Usllp which may increase with t, and does not have to be bounded in the limit to satisfy the definition. To prove X1 integrable, given integrability of the Us, requires the absolute summability of the coefficients, and in this sense, integrability is effectively sufficient for a linear process to be an L1-mixingale o

We could say that mixingales are to mixing processes as martingale differences are to independent processes; in each case, a restriction on arbitrary dependence is replaced by a restriction on a simple type of dependence, predictability of the level of the process. Just as martingale differences need not be independent, so mixingales need not be mixing. However, application of 14.2 shows that a mixing zero-mean process is an adapted Lp-mixingale for some p 2 1 with respect to the subfields '3'1 = cr(X1,Xt- 1, ... ), provided it is bounded in the relevant norm. To be precise, the mean deviations ofany L,-bounded sequence whichis a-mixing of size - 1 , form an Lp-mixingale of size -
Mixingales

249

The next examples show the type of case arising in the sequel. 16.3 Example An Lr-bounded, zero-mean adapted sequence is an L2-mixingale of size -� if either r > 2 and the sequence is a-mixing of size - r/(r - 2), or r ?: 2 and it is -mixing of size -r/2(r - 1). o 16.4 Example Consider for any j ?: 0 the adapted zero-mean sequence { XtXt+i - at,t+J• � t+i } ,

where at, t+J = E(XtXt+j) , and { Xt l is defined as in 16.3. By 14.1 this is a-mixing (<j>-mixing) of the same size as Xt for finite j, and is Lr12-bounded, since IIXtXt+j ll r12

S

IIXtllriiXt+jllr-

by the Cauchy-Schwartz inequality. Assuming r > 2 and applying 14.2, this is an L1-mixingale of size - 1 in the a-mixing case. To get this result under -mixing also requires a size of -r/(r - 2), by 14.4, but such a sequence is also a-mixing of size -r/(r- 2) so there is no separate result for the <j>-m� case. o ' Mixingales generalize naturally from sequences to arrays. 16.5 Definition The integrable array { { Xnt• �nt l7=- oo}�= l is an Lp-mixingale if, for p ?: 1, there exists an array of non-negative constants { end :oo, and a non negative sequence { �m } 0 such that �m ---7 0 as m ---7 oo , and IIE(Xnt l �n.t-m) llp

S

Cnt�m

II Xnt - E(Xnt I �n,t+m) II p S Cnt�m+ 1

(16.6) (16.7)

hold for all t, n, and m ?: 0. o The other details of the definition are as in 16.1. All the relevant results for mixingales can be proved for either the sequence or the array case, and the proofs generally differ by no more than the inclusion or exclusion of the extra sub script. Unless the changes are more fundamental than this, we generally discuss the sequence case, and leave the details of the array case to the reader. One word of caution. This is a low-level property adapted to the easy proof of convergence theorems, but it is not a useful construct at the level of time-series modelling. Although examples such as 16.4 can be exhibited, the mixingale prop erty is not generally preserved under transformations, in the manner of 14.1 for example. Mixingales have too little structure to permit results of that sort. The mixingale concept is mainly useful in conjunction with either mixing assumptions, or approximation results of the kind to be studied in Chapter 17. There we will find that the mixingale property holds for processes for which quite general results on transformations are available. 1 6.2 Telescoping Sum Representations

Mixingale theory is useful mainly because of an ingenious approximation method. A sum of mixingales is 'nearly' a martingale process, involving a remainder which

Theory of Stochastic Processes

250

can be neglected asymptotically under various assumptions limiting the dependence. For the sake of brevity, let EsXr stand for E(Xrl �s). Then note the simple identity, for any integrable random variable X1 and any m � 1, m (16.8) Xr = L (Et+kXr - Er+k - lXr) + Er-m - l Xt + (Xr - Er+mXt) . k=-m

Verify that each term on the right-hand side of (16.8) appears twice with opposite signs, except for X1 . For any k, the sequence { Er+kXt - Et+k- l Xr, lfft+k } 7= 1 is a martingale difference, since Et+k-l (Er+kXr - Er+k- I Xr) = 0 by the LIE. When { X1,lff1 } is a mixingale, the remainder terms can be made negligible by taking m large enough. Observe that { Er+mXt, lfft+m l:=-oo is a martingale, and since supmE I Er+mXrl ::; E I Xr l < by 10.14, it converges a.s. both as m � oo and as m � -oo, by 15.7 and 15.9, respectively. In view of the fact that II Er-mXr ll p � 0 and II Xr - Er+mXr ll p � 0, the respective a.s. limits must be 0 and X1, and hence we are able to assert that oo

Xr = L (Et+kXr - Er+k- 1 Xr), a.s. Letting Sn

where

k=-oo

=

I,�=1X1, we similarly have the decomposition m n n = r Sn L Ynk + L E - m- I Xt + L (Xr - Et+mXr) k= -m t=l t=l

n Ynk = L (Er+kXr - Er+k- 1Xr), t= l

(16.9)

(16. 10)

(16. 1 1)

and the processes { Ynk> lffn+d are martingales for each k. By taking m large enough, for fixed n, the remainders can again be made as small as desired. The advantage of this approach is that martingale properties can be exploited in studying the convergence characteristics of sequences of the type Sn. Results of this type are elaborated in § 16.3 and § 16.4. If the sequence {Xr } is stationary, the constants {c1} can be set to 1 with no loss of generality. In this case, a modified form of telescoping sum actually yields a representation of a partial sum of mixingales as a single martingale process, plus a remainder whose behaviour can be suitably controlled by limiting the dependence. 16.6 Theorem (after Hall and Heyde 1980: th. 5.4) Let { Xr,lffr } be a stationary L 1 -mixingale of size - 1 . There exists the decomposition (16. 12) Xr = Wr + Zr - Zr+ l • where E I Zr l < and { W1,lff 1 } is a stationary m.d. sequence. o oo

Mixingales There is the immediate corollary that Sn = Yn + Z1 - Zn+ 1 where { Ym�n } is a martingale. Proof Start with the identity

25 1 (16. 13)

(16. 14) where, for m � 1, m (16. 15) Wmt = L (EtXt+s - Et-1Xt+s) + EtXt+m+1 + Xt-m-1 - Et- 1 Xt-m- 1 s=-m m (16. 16) Zmt = L (Et-1Xt+s - Xt-s- 1 + Et- 1 Xt-s- 1 ) . s=O J As in (1 6.8), every term appears twice with different sign in (16. yi), except for X1• Consider the limiting cases of these random variables as m ----7 oo, to be designated W1 and Z1 respectively. By stationarity, E I Et-1 Xt+s1 = E i Et-s - 1 Xt l and E I Xt-s- 1 - Et-1Xt-s- 1 1 = E I Xt - Et+sXt l ; hence, applying the triangle inequality, 00

E I Zt l :::; L E I Et-s - 1Xt l + L E I Xt - Et+sXt l s=O

:::; 2L ss < s=O

s=O

oo .

Writing W1 = X1 - Z1 + Zt+ 1 • note that

(16. 17)

(16. 1 8) and it remains to show that W1 is a m.d. sequence. Applying 10.26(i) to (16. 15), (16. 19) and stationarity and (16.1) imply that (16.20) E I Et- 1Xt+m+t l = EI Er-m-2Xt l ----7 0 as m ----7 oo, so that El £1_ 1 Wm1 l ----7 0 also. Anticipating a result from the theory of stochastic convergence (18.6), this means that every subsequence {mb k E IN } contains a further subsequence { mk(i) • j E IN } such that I E1- 1 Wmk(i)· t I ----7 0 a.s. as j ----7 oo Since Wmk(i)J ----7 W1 for every such subsequence, it is possible to conclude that E(W11 �r- 1 ) = 0 a.s. This completes the proof. • .

252

Theory of Stochastic Processes

The technical argument in the final paragraph of this proof can be better appre ciated after studying Chapter 18. It is neither possible nor necessary in this approach to assert that E(Wmt l �t- I ) ---7 0 a.s. Note how taking conditional expectations of (16. 12) yields E(Xt l �t - I ) = Zt - Zt+I a.s. (16.21 ) It follows that Wt is almost surely equal to the centred r.v. Xt - E(Xt l �t- J). 16.7 Example Consider the linear process from 16.2, with { Ut } a stationary inte grable sequence. Then Xt is stationary, and 00

E l Ud L I Sj l < OO, j=-<X) If the coefficients satisfy a stronger summability condition, i.e. EI Xd

00

�

00

00

00

(16.22) L _L ( ! Sjl + 1 8-jl ) = ,L m l 8m l + ,L m l 8-ml < 00, m= l m=l m=l j=m then Xt is an L1-mixingale of size - 1 . By a rearrangement of terms we obtain the decomposition of (16. 12) with (16.23) and

(16.24) where E ! Zt l <

oo

by (16.22). o

1 6. 3 Maximal Inequalities

As with martingales, maximal inequalities are central to applications of the mixingale concept in limit theory. The basic idea of these results is to extend Doob's inequality (15.15) by exploiting the representation as a telescoping sum of martingale differences. MacLeish's idea is to let m go to oo in (16. 10), and accordingly write

Sn

=

L Ynb a.s.

k=- oo

(16.25)

Suppose { Snll has the representation in (16 . 25) . Let { ak}':oo be a summable collection of non-negative real numbers, with ak = 0 if Ynk = 0 a.s., and ak > 0 otherwise. For any p > 1 , I P ( 1 6.26) E m�x ! Sj ! P � {_� 1 i ak p- L ar1 E I Ynk iP. \P k=-oo l:'>j :'> n ak>O 16.8 Lemma

(

)

)( )

Mixingales

253

Proof For a real sequence { xd : and positive real sequence {ad :DO, let K = Lk= -=ak and note that DO

(16.27)

where the weights ak/K sum to unity, and the inequality follows by the convexity of the power transformation (Jensen's inequality). Clearly, (16.27) remains true if the terms corresponding to zero xk are omitted from the sums, and for these cases set ak = 0 without loss of generality. Put xk = Ynk> take the max over 1 � j � n , and then take expectations, to give

/

(16.28)

To get (16.26), apply Doob's inequality on the right-hand side. • This lemma yields the key step in the proof of the next theorem, a maximal inequ ality for L2-mixingales. This may not appear a very appealing result at first sight, but of course the interesting applications arise by judicious choice of the sequence { ak} . 16.9 Theorem (Macleish 1975a: th. 1 .6) Let {X1,'!f1} ':= be an L2-mixingale, let Sn = I,�= 1X1, and let {ad 0 be any summable sequence of positive reals. Then E

(::: ) (� ) ( s]

<;

8

a, (1;5 + i;llaO 1 + 2

� i;[(ak 1 - ai� 1 )) (� cl) .

(16.29)

get a doubly infinite sequence {ad : put a_k = ak for k > 0. Then, applying 16.8 for the case p = 2,

Proof To

E

(�':: ) t ) {.t_ sJ

<;

{

a•

DO,

)

ai1E(Y�,) .

(1 6.30)

Since the terms making up Ynk are martingale differences and pairwise uncorre lated, we have n (16.3 1 ) E(Y�k) = L_E(Er+kXr - Et+k-tXrf t= l

Now, E(Et+kXrEt+k- tXr) = E(Et+k- t (Er+kXrEt+k-t Xr) ) = E(E7+k- tXr) by the LIE, from which it follows that (16.32) E(Er+kXt - Et+k - tXr)2 = E(E7+kXt - E7+k- tXr) . Also let Z1k = X1 - Et+kX1, and it is similarly easy to verify that E(Et+kXr - Et+k-l Xt)2 = E(Zt,k- l - Ztk)2 = E(Z7,k- l - z7k).

(16.33)

Theory of Stochastic Processes

254

Now apply Abel's partial summation formula (2.25), to get

oo

n

oo

L ai1E(Y�k) = L L ai 1 E(Er+kXr - Et+k- 1 Xr)2

k= -oo

t::

I k=- oo

+ a! 1 E(Z}o) +

ik=l E(Z}k)(a:kll - a:k 1 ))

(16.34)

where the second equality follows by substituting (16.32) for the cases k � 0, and (16.33) for the cases k > 0. (16.29) now follows, noting from (16. 1) that E(E}-kXr) � c}�� and from (16 . 2) that E(Z}k) � c}��+ l · • Putting (16.35)

oo

this result poses the question, whether there exists a summable sequence { ad o such that K < There is no loss of generality in letting the sequence { Sk } o be monotone. If Sm = 0 for m < then Sm+j = 0 for all j > 0, and in this case one may choose ak = 1 , k = O, .. ,m + 1 , and K reduces to (m + l)(s5 + sb. Alter natively, consider the case where Sk > 0 for every k. If we put a0 = �0, and then define the recursion .

.

oo ,

(16.36) ak

is real and positive if this is true of ak- l and the relation -I -I ak - ak1

-

�o;,r k 2ak is satisfied for each k. Since a0 1 C s6 + t;,b � 2a0, we have

K=s

=

� I 6 (£ ak) 2• (i ak) (ao\s5 + sh + 2�ak) k==O k-I k=O

(16.37)

(16.38)

In this case, for k > 0 we find si 2

so that

=

(ai 1 - a:k�J)ai 1

� a :k 2 - a:k : I

(16.39)

m

"

Mixingales

y--2 k L ':! k=O

255

m

" (a k-2 - a -2 $ y-':10-2 + L k-1) k=l

K {m=Ooo ( m

Substituting into (16.38), we get

)

=

}

am

-2 .

- l/2 2 . $ 16 L L s "k 2 k=O

(16.40)

( 1 6.41)

This result links the maximal inequality directly with the issue of the summa bility of the mixingale coefficients. In particular, we have the following corollary. 16.10 Corollary Let

where

K

{ Xt,:1't} be an L2-mixingale of size -!. Then

(

E max s;

< oo.

l �j � n

) ::;; Ki d, t= l

(16.42)

O(k - 1 12-6) for o > 0, as the theorem imposes, then Lk= I S"k2 = O(m2+26) by 2.27, and (I'k= 1 s"i2t 1 12 = O(m- 1 -6) and hence is summable over m. The theorem follows by (16.41). • Proof If

Sk

=

However, it should be noted that the condition -1 2 s "k2 1 < 00

� (�

)

(16.43)

is weaker than Sk = O(k- 1 12-6). Consider the case Sk = (k + 2r 1 12 (log k + 2) - l -E for E > 0, so that k 1 12+6Sk ----7 oo for every 0 > 0. Then

m

m

L s"k 2 = L (k + 2)(log k + 2)2+2E s (m + 2) 2(log m + 2)2+2E, k=O

k=O

(16.44)

and (16.43) follows by 2.31. One may therefore prefer to define the notion of 'size = -!' in terms of the summability condition (16.43), rather than by orders of magnitude in m. However, in a practical context assigning an order of magnitude to Sm is a convenient way to bound the dependence, and we shall find in the sequel that these summability arguments are greatly simplified when the order-of magnitude calculus can be routinely applied. Theorem 16.9 has no obvious generalization from the �-mixingale case to general Lp for p > 1 , as in 15.15, because (16.31) hinges on the uncorrelatedness of the terms. But because second moments may not exist in the cases under consideration, a comparable result for 1 < p < 2 would be valuable. This is attainable by a slightly different approach, although at the cost of raising the mixingale size from -! to - 1 ; in other words, the mixingale numbers will need to be summable. 16.11 Theorem Let {Xr,:1't } ':oo be an Lp-mixingale, 1 < p < 2, of size - 1 , and let Sn

=

I7= tXt ; then

256

Theory of Stochastic Processes

(16.45) Cp is a positive constant. Proof Let Ynk be defined as in (16.11), and apply Burkholder' s inequality (15.18) and then Loeve ' s inequality with r p/2 C1, 1 ) to obtain E I Y,,j P ,;; CpE I t (E..,X, - E,.,_,X,)2 1 ,n where

Cr

=

�

E

n

C�LEI t=:l (Et+kXt-Et+k- IXt! P.

(16.46)

Now we have the mixingale inequalities,

(16.47)

I! Er+kXr-Er+k- I Xtl lp � I Er+kXtllp + I Et+k- I Xtl lp � 2cl,k

0 and I !Er+kXr-Er+k-IXrllp I Zr,k-I - Ztkl ip � I Zt,k- I I p + I ! Ztkl lp � 2 cr�k (16.48) for k > 0, where Ztk is defined above (16. 3 3). Hence, (16.49) £ 1 Ynkl p � 2P Cp�{ L � t=:l (put �o 1), and substitution in (16. 26), with {ado a positive sequence and -ak ak, gives P p-l (16.50) £ { m� I Sj l p) � 2p+l cP I_� 1 ) (iak) (ial-psr)±c�. \jJ Both ak and al-psf can be summable for p > 1 only in the case Sk O(ak), and the conclusion follows. for k <

=

n

=

C

=

k=O

l:'>J � n

k=O

•

t=:l

=

A case of special importance is the linear process of 16.2. Here we can special

ize 16.1 1 as follows: 16.12 Corollary For

1 n k=I

(i) if

(ii) if x,

=

=

Ij=Oejut -j• then

E (�� I Sj i ) ,;; Cp �� n� 1 e, 1 ) t :�� E I U,l''

'

ales 257 Proof In this case, Er- kXr-Er-k- I Xr = 8k Ur-k· Letting {ad 0 be any non-negative constant sequence and a_k = ak> n p p (16.5 1) _L ak- EI Ynki P :::;; Cp_L al l 8ki P_L0 t=I , where c1 = sups II Us l i p i n case (i), and c1 = sups rl l Us l i P in case (ii). Choosing ak = l 8k I and substituting in (16. 26) yields the results. Recall that the mixing ale coefficients in this case are Sm = Ij=m( l 8j I 18 -j I ), so linearity yields a dramatic relaxation of the conditions for the inequalities to be satisfied. Absolute summability of the 8j is sufficient. This corresponds simply to Sm � 0. A mixingale size of zero suffices. Moreover, there is no separate result for £2-bounded linear processes. Putting p = 2 yields a result that is correspondingly superior in terms of mixingale size restrictions to 16. 1 1. Mixing

al(i{)

al(i{)

::;

•

+

1 6.4 Uniform Square-integrability

One of the most important of McLeish's mixingale theorems is a further conse quence of 16.9. It is not a maximal inequality, but belongs to the same family of results and has a related application. The question at issue is the uniform inte grability of the sequence of squared partial sums.

1975b: lemma 6.5; 1977: lemma 3.5) Let {X1,�1} I7 X1 , and v� = I7= Ic7 where c1 is defined in =I { tc }7 I is uniformly integrable, then so is the (16.1)-{16.2) . '::'} =1 · X7 7 = Proof A preliminary step is to decompose X1 into three components. Choose posi tive numbers B and m (to be specified below), let 1 � = 11 and then define (16.52) U1 = X1 - Et+mXr + E1-mX1 (16.53) Yr = Et+mXr 1�-Er-mXr1� (16. 54) Zr = Et+mXrO - 1 �) -Er-mXrO - 1 �), such that X1 = U1 Y1 Z1 • This decomposition allows us to exploit the following collection of properties. (To verify these, use various results from Chapter 10 on 16.13 Theorem (from MacLeish be an �-mixingale of size -�, Sn = If the sequence sequence { maxi::;j ::;nSJIV�

IXr i :5:Bcr J •

+

+

conditional expectations, and consider the cases k � m and k < m separately.) First,

(16. 55) (16.56) for k �

0, where k v m = max { k,m} . Second, EE7-kYt E(E7-(kAm�t l �-E7-mXr 1 �), =

(16. 57)

258

Theory of Stochastic Processes

(16.58)

m m k,m}. E(Xh �) Bc1• EEt-kZr = E(E;-(kNn�r( l - 1�) - E;_mXr( l - 1 �)),

The terms are both zero ifk � and are otherwise bounded where k 1\ = min { by Third, �

(16.59) (16.60)

m E(X7(1 - 1 �)) otherwise. Note E(X7(1 - 1�))!c7 E((Xrfc1)2 1t1Xr'cri>B)) -) 0 B -) oo uniformly in t, by

where the terms are zero for k � and bounded by that = as the assumption of uniform integrability. The inequality

(16.61) 1

X1

Z1,

for � j � n follows from substituting = U1 + Y1 + multiplying out, and applying the Cauchy-Schwartz inequality. For brevity, write

(16.61)

xj = sJtv�, uj = (L/r= I uYtv�, lvn, Yj = lvn. Zj = is equivalent to Xj � 3(uj + yj + Zj), for each j =

(L'r..=I Yr)2 2 (L'r=IZr)2 2

1, .

Then .. ,n. Also let = max 1 :5j:5nXj, and define Um Yn · and Zn similarly; then clearly, Xn � Un + Yn + Z n . For any r. v. and constant M > introduce the notation £:5 = £( so that the object of the proof is to show that supn£:5MCxn) -) as M -) As a consequence of and 9.29,

Xn

"

3(

"

"

"

(16.62) M(X) 1 {X>MIX), 0

)

X� 0 0, (16.62) EEM(Xn) � 3f5Mt3(un + Yn + Zn)

oo.

(16.63)

0, (16.63) ( 16.55) (16.56), Sm -l/2-o), ak m-l -o m. m, ak o l.lr= r U1 E(un) � 8 ((m + 1)m- I -o + i k - I o) (s�m l +o + 2 i. siko) k=m+ 1 k==m+l 0

We now show that for any E > each of the expectations on the right-hand side of can be bounded by E by choosing M large enough. First consider given and we can apply 16.9 to this and assuming = O(m I case, setting = for k > Applying for k � and = k with substituted for Sj in that expression produces -

-

E(un); (16.29)

-

= O(m - ),

(16.64)

259

Mixingales

m

where the order of magnitude in follows from 2.27(iii). Evidently we can choose < £. Henceforth, let be fixed at this value. large enough that A similar argument is applied to but in view of and we and = otherwise. Write, formally, may choose = k = O, s; and where s;

m E(un) E(zn), 2c ty':l2k E(Zrak-Er+1,kZr)2 ...c2t,my':l2k,• ak 0

m

(16.59) (16.670) EE -kzt k < m, (16.65) k � m,

(16. 29) leads to max E((Xrfcr)2 1 I X/ r i> Bl) . (16.66) E(zn) s; 16(m + 1) I�t� c n This term goes to zero as B � oo, so let B be fixed at a value large enough that < £. E(zn) For the remaining term, notice that Y1 Lk= - m+ I �tk where (16.67) For each k, { �tk • �t+k } is a m.d. sequence. If 16.8 is applied for the case 4 and ak 1 for I k I s; m, 0 otherwise, we obtain (not forgetting that for Yj > 0, (maxm)2 maxj{ YJ }) j 1 4) 1 (4) 4 m E(�k), (16.68) 1 ( 1) (2m+ max LYt EQ�) V4En I�j� 3 L s; 4 n I t=I k=-m Vn where Ynk L� =I �tk · Now, given Ynk Yn - I ,k + �nk. we have the recursion E(Y�k) E(Y�- i,k)+4E(Y�- l,k�nk)+6E(Y�-l,k��k) + 4E(Yn- I,��k) + E(��k). (16.69) The �tk are bounded absolutely by 2Bc1; hence consider the terms on the right-hand side of (16. 69). The second one vanishes, by the m.d. property. For the third one, we have (16.70) E(Yn2- I,k-�':l2nk) E(Yn2- ! ,k)(2Bcn)2 s; (2B) Yn2- J Cm2 and for the fourth one, note that by the Cauchy-Schwartz inequality, (16.7 1) El Yn- I,��kl s; (2B)4Yn- J C�. Making these substitutions into (16. 69) and solving the implied inequality and then application of

1

=

p =

=

=

=

3

=

=

=

s;

4

recursively yields

(16.72)

Theory of Stochastic Processes

260

Plugging this bound into (16.68), and applying the inequality a 0 (X) s E(X2 ) for X· � 0 and a > 0, yields finally a

()

r�n ) <- J_E-\Y r�2n) <- ± 4 6(2m + 1 )4 1 1 (2B)4 �M/6\Y M 3 M C?

( 16.73)

·

By choice of M, this quantity can be made smaller than £. Thus, according to ( 1 6.63) we have shown that 0MCxn) < 1 8£ for large enough M, or, equivalently,

(16.74) By assumption, the foregoing argument applies uniformly in n, so the proof is complete. • The array version of this result, which is effectively identical, is quoted for the record. 16.14 Corollary Let {Xnr.?l'nr} be an L2-mixingale array of size -�, and let Sn = 'L7=t C�1, where Cnr is given by (16.6)-( 16.7); if {X�/c�1 } is uniformly integrable, { maxl: o:;j � nS]!v; }�=! is uniformly integrable.

'L7=tXnr and v�

=

Proof As for 16.13, after inserting the subscript n as required.

•

17 Near-Epoch Dependence

1 7 . 1 Definitions and Examples

As noted in § 14.3, the mixing concept has a serious drawback from the viewpoint of applications in time-series modelling, in that a function of a mixing sequence (even an independent sequence) that depends on an infinite number of lags and/or leads of the sequence is not generally mixing. Let (17.1) where V1 is a vector of mixing processes. The idea to be developed in this chapter is that although X1 may not be mixing, if it depends almost entirely on the 'near epoch ' of { Vt} it will often have properties permitting lhe application of limit theorems, of which the mixingale property is the most important. This idea goes back to lbragimov (1962), and had been formalized in different ways by Billingsley (1968), McLeish (1975a), Bierens (1983), Gallant and White ( 1 988), Andrews (1988), and Potscher and Prucha (1991a), among others. The following definitions encompass and extend most existing ones. Consider first a definition for sequences.

{ Vr }�:. possibly vector-valued, on a ;:: = cr(V probability space (Q,':J,P), let :!F�� r-m • ··· .Vt+m), such that { :!f��;:: } ;;;=O is an increasing sequence of a-fields. If, for p > 0, a sequence of integrable r.v.s { X1 } � : satisfies 17.1 Definition For a stochastic sequence

(1 7.2) · where Vm ----7 0, and {d1 }�: is a sequence of positive constants, X1 will be said to be near-epoch dependent in Lp-norm (Lp-NED) on { V1 }�:. o Many results in this literature are proved for the case p = 2 (Gallant and White, 1988, for example) and the term near-epoch dependence, without qualification, may be used in this case. As for mixingales, there is an extension to the array case.

{ { Vnrl7:'- oo }';= l • possibly vector-valued, on a probability space (Q,:!f,P), let :!f��"!-m = a(Vn ,t- m•· ·· · Vn,t+m). If an integrable array { {Xnrl7:'- oo}';= l • satisfies 17.2 Definition For a stochastic array

(17.3) where Vm -7 0, and {dnr} is an array of positive constants, it is said to be Lp-NED on { Vn1 } . o

262

Theory of Stochastic Processes

We discuss the sequence case below with the extensions to the array case being easily supplied when needed. The size terminology which has been defined for mixing processes and mixingales is also applicable here. We will say that the sequence or array is Lp-NED of size -) for
<po . According to the Minkowski and conditional modulus inequalities,

I Xr -E(Xrl �::z:) liP :s; I Xr - J.Lr l p + I E(Xr - J.Lr l ��:Z:) I p 2 I Xr - J.Lr l p, (17.4) where J.Lr E(X1). The role of the sequence {d1} in (17. 2 ) is usually to account for the possibility of trending moments, and when I Xr -J.Lr l p is uniformly bounded, we should expect to set d1 equal to a finite constant for all t. However, a drawback with the definition is that { d1} can always be chosen in such a way that . t { I Xr -E(XdrtI �::z:) liP} 0, mf for every m, so that the near-epoch dependence property can break down in the limit without violating (17. 2 ). Indeed, (17. 2) might not hold except with such a choice of constants. In many applications this would represent an undesirable weakening of the condition, which can be avoided by imposing the requirement d1 2can case, dnt 2 I Xnr -J.Lnr l p · Under this restriction we I X1set-J.Lr liP, :s;or1 forwiththenoarray loss of generality. :s;

=

=

:s;

:s;

vm

Near-epoch dependence is not an alternative to a mixing assumption; it is a property of the mapping from to not of the random variables themselves. The concept acquires importance when is a mixing process, because then inherits certain useful characteristics. Note that �::z:) is a finite-lag, �::Z:J;B-measurable function of a mixing process and hence is also mixing, by 14.1. Near-epoch dependence implies that is 'approximately' mixing in the sense of being well approximated by a mixing process. And as we show below, a near-epoch dependent function of a mixing process, subject to suitable restrictions on the moments, can be a mixingale, so that the various inequalities of § can be exploited in this case. From the point of view of applications, near-epoch dependence captures nicely the characteristics of a stable dynamic econometric model in which a dependent variable depends mainly on the recent histories of a collection of explanatory variables or shock processes which might be assumed to be mixing. The symmetric dependence on past and future embodied in the definition of a Lp-NED function has no obvious relevance to this case, but it is at worst a harmless generalization. In fact, such cases do arise in various practical contexts, such as the application of two-sided seasonal adjustment procedures, or similar smoothing filters; since most published seasonally adjusted time series are the output of a two-sided filter, none of these variables is strictly measurable with out reference to future events.

{ V1} { X1},{ V1}

{ X1}

E(X11

{X1}

16.2

X1

17.3 Example Let

V1,

{ V1} :: be a zero-mean, L0-bounded scalar sequence. and define

Near Epoch Dependence

263 (17.5)

}=-oo

Then, by the Minkowski inequality,

I Xt-ECXtl ��;;:) liP = I J.=m£+! (8j( Vt-r E;�;;:vt-1) + 8-j( Vt+J- E;�;;:vt+j)) I P (17.6)

), 8 8 1 I(I jl jl Vm I,J=m + { 81}

and dt = 2supsi1 Vs l lp, all t. Clearly, Vm --7 0 if + where = the sequence is absolutely summable, and vm is of size -
e1

1 81 18-J 1

The second example, suggested by Gallant and White (1 988), illustrates how near epoch dependence generalizes to a wide class of lag functions subject to a dynamic stability condition, analogous to the summability condition in the linear example.

{ V1}

17.4 Example Let be a Lp-bounded stochastic sequence for p � 2 and let a sequence be generated by the nonlinear difference equation

{ X1}

where

{f1(.,. ) }

Xt

=

frC Vt, Xt-d , is a sequence of differentiable functions satisfying

(1 7.7)

dft(v,x) I � b < 1 . (1 7.8) �; I dx As a function of x, f1 is called a contraction mapping. Abstracting from the stochastic aspect of the problem, write v1 as the dummy first argument of f1• By repeated substitution, we have s

(1 7.9) and, by the chain rule for differentiation of composite functions,

dgt !' I dVt-j

-- � br I .

I ddVftt--j l . j

( 1 7. 1 0)

' --

Define a ��-m/B-measurable approximation to g1 by replacing the arguments with lag exceeding m by zeros: ( 1 7. 1 1) By a Taylor expansion about 0 with respect to

v1_1 for j > m, ( 1 7 . 1 2)

264

Theory of Stochastic Processes

where * denotes evaluation of the derivatives at points in the intervals [0, Vr-jl Now define the stochastic sequence {X1} by evaluating g1 at (V1Y1- 1 , ). Note that • . .

(17. 1 3) by 10.12. The Minkowski inequality, ( 1 7 . 1 2), and then (17. 10) further imply that

� �

00

L II Gr-jVt-j ll z

j=m+l

b j- I II Fr-j Vt-j l b L j=m+l 00

(17. 14) where Gr-j is the random variable defined by evaluating [(d!dv1-j)g1]* with respect to the random point (V1-j , Vr-j- J , ... ), and Fr-j bears the corresponding relationship with (dldv1-)fr-j· X1 is therefore L2-NED of size with constants dr « sup � r ii F V ll z , if this norm exists. In particular, Holder ' s inequality allows us to make this derivation whenever II Vr llzr and II Fr l brtr- 1) exist for r > 1, and also if II Vr ll 2 < oc and F1 is a.s. bounded. o s

s

-oc,

s

1 7. 2 Near-Epoch Dependence and Mixingales

The usefulness of the near-epoch depencence concept is due largely to the next theorem. 17.5 Theorem Let {Xr}:'oo be an Lr-bounded zero-mean sequence, for r

> 1.

(i) Let { V1} be a-mixing of size -a. If X1 is Lp-NED of size b on { V1} for 1 � p < r with constants { d1} , { X,,�_:oo} is an Lp-mixingale of size -min { b, a(l!p - 1/r)} with constants c1 « max { IIXr ll n drJ . (ii) Let { V,} be -mixing, of size -a. If X1 is Lp-NED of size -b on { V1} for 1 � p :S r with constants { d1 } , {X1,�:""} is an Lp-mixingale of size -min{ b, a ( l 1 /r) } , with constants c1 « max { I ! Xr ll , d,J . -

-

= £(. 1 �!) where �� = cr(Vs, ... , V,). Also, for m � [m/2] , the largest integer not exceeding m/2. By the Minkowski

Proof For brevity we write £�(.)

1 , let k

=

inequality,

IIEL:mXr ll p � IIE.:.:m (X, - E��fXr) liP + IIE_:.:;m(E��Xr) liP ,

and we bound each of the right-hand-side terms. First,

(17 . 1 5)

Near Epoch Dependence

265

( 1 7. 1 6)

using the conditional Jensen inequality and law of iterated expectations. Second, Er�ZX1 is a finite-lag measurable function of Vr-h···,Vt+b and hence mixing of the same size as { V1 } for finite k. Hence for part (i), we have, from 14.2, ( 1 7 . 17) IIE_: .:;m (E��X,) Ii p � 6a;llp- llr ii .Er�ZX,II , � 6a;llp- llr iiX,II , . Combining ( 1 7. 16) and (17. 1 7) into ( 1 7. 1 5) yields ( 1 7. 1 8) I I E_:.:;mx, IIP � max { II Xr ll ,dr} sm , p where Sm = 6aY - llr + Vk. Also, applying 10.28 gives ( 1 7. 19) II X, - E_:.;!;mXr iiP � 2 II Xr - Er��Xrii P � 2drVm � 2drSm · Since Sm is of size -min {b, a( l/p - 1/r) } , part (i) of the theorem holds with c1 « max { II X, I I ,d1 } . The proof of part (ii) is identical except that in place of ( 17 . 1 7) we must substitute, by 14.4, ( 17.20) IIE.:.:m (E��ZX,) II p � 2¢1- 11'IIE;�ZXr llr � 2¢1- 11'11 Xr llr· • Let us also state the following corollary for future reference. 17.6 Corollary Let { {Xnr};:'- oo }'::'= I be an £,-bounded zero-mean array,

r > 1. (i) If Xnt is Lp-NED of size -b for 1 � p < r with constants { dnr l on an array { Vnr} which is a-mixing of size -a, then {Xn1,:1'�, -oo } is an Lp-mixingale of size -min { b, a(l/p - 1/r) } , with respect to constants Cnr « max { II Xntll ,dnr } . (ii) If Xnr is Lp-NED of size -b for 1 � p � r with constants {dnr} on an array { Vnr} which is ¢-mixing of size -a, then {Xnr,:J�, -oo } is an Lp-mixingale of size -min{ b, a(l - 1/r) } , with respect to constants Cnr « max{ I IXnr ll ,dnr } .

Proof Immediate on inserting

last proof.

•

n before the t subscript wherever required i n the

The replacement of V1 by Vn1 and :1<� by ��s is basically a formality here, since none of our applications will make use of it. The role of the array notation will always be to indicate a transfonnation by a function of sample size, typically the normalization of the partial sums to zero mean and unit variance, and in these cases :1< �s = :1<.� for all n. Reconsider the AR process of 14.7. As a special case of 17.3, it is clear that in that example X1 is Lp-NED of size -oo on Z1, an independent process, and hence is a Lp-mixingale of size -oo, for every p > 0. There is no need to impose smoothness assumptions on the marginal distributions to obtain these properties, which will usually be all we need to apply limit theory to the process. These results allow us to fine-tune assumptions on the rates of mixing and near epoch dependence to ensure specific low-level properties which are needed to prove convergence theorems. Among the most important of these is summability of the sequences of autocovariances. If for example we have a sum of terms, Sn = I�= IX,,

Theory of Stochastic Processes

266

we should often like to know at what rate the variance of this sum grows with n. Assuming E(X1) = 0 with no loss of generality, E(S;)

=

t E(XlJ + 2� (� E(X,X,.m)) ,

a sum of n 2 terms. If the sequence { X,} is uncorrelated only the n variances appear and, assuming uniformly bounded moments, E(S�) = O(n). For general dependent processes, summability of the sequences of autocovariances { E(X1X1-j) . j E IN } implies that on a global scale the sequence behaves like an uncorrelated sequence, in the sense that, again, E(S�) = O(n). For future reference, here is the basic result on absolute summability. To fulfil subsequent requirements, this incorporates two easy generalizations. First we consider a pair (X1, Y1), which effectively permits a generalization to random vectors since any element of an autocovariance matrix can be considered. To deal with the leading case discussed above we simply set Y1 = X1• Second, we frame the result in such a way as to accommodate trending moments, imposing L,-boundedness but not uniform L,-boundedness. It is also noted that, like previous results, this extends trivially to the array case. 17.7 Theorem Let {Xr.Ytl be a pair of sequences, each Lr1(r- l )-NED of size - 1 , with respect to constants {d},dr} for r > 2, on either (i) an a-mixing process of size -r/(r - 2), or (ii) a $-mixing process of size -r/(r - 1), where d} « IIXrllr and dr « II Yt II,. Then the sequences

(17.21) are summable for each t. Also, if arrays { Xnr,YnrJ are similarly Lrl(r- n-NED of size -1 with respect to constants {�1,d�r } with cfnt « IIXntll r and drr « II Ynr ll n the sequences

(17.22) are summable for each n and t. o Since r > 2, the constants appearing in ( 17.21) and ( 17.22) are smaller (absolutely) than the autocorrelations, and the latter need not converge at the same rate. But notice too that rl(r- 1 ) < 2, so it is always sufficient for the result if the functions are L2-NED. Proof As before, let £�(.) = E(. l ��), and let k = [m/2]. By the triangle inequal ity, ( 17.23) I E( Yt+mXt) I ::;; I E( Yt+m(Xt - Fr�ZXr)) I + I E( Yt+mE��ZXr) 1 . The modulus and Holder inequalities give

Near Epoch Dependence

I E(Yt+rn(X, - Er�ZX,)) I � II Yt+rn ll riiXr - Er�ZXrllrt(r- 1 ) � II Yr+m ll�v{,

267 (17.24)

where v� is of size - 1 . Also, applying 17.5(i) with p = rl(r - 1), a = r/(r - 2), and b = 1,

I E(Yt+rnEr�fX,) I = I E(Er�ZYt+rnE��ZXr) I � IIEr�ZYt+rn II rt(r- 1) 11-Fr�zx,) II r � II E�!kYt+m ll r/(r-1) 11Xrll r

� IIX, II r c ;+msk.

(17.25)

where cr « max{ II Y l l ndr } and t;� is of size - 1 . Combining the inequalities in (17.24) and (17.25) in (17.23) gives r

I E(X,Yt+m) I � max { II Yr+rn II r di, II X, II r c r+rn } (v{ + /; [)

(17.26)

where �rn = (v{ + /; [) is of size - 1 . This completes the proof of (i). The proof of (ii) is similar using 17.5(ii) with p = a = r/(r - 1 ) and b = 1. For the array generalization, simply insert the n subscript after every random variable and scaling constant. The argument is identical except that 17.6 is applied in place of 17.5. • 1 7 . 3 Near-Epoch Dependence and Transformations

Suppose that (X1 , , ... ,Xw) ' = X, = g( ... ,V,- I , Vt,Vt+ 1 , ... ) is a v-vector of Lp-NED functions, and interest focuses on the scalar sequence { <)>r(X,) } , where <)>,: 'U' H IR, 'U' � IR v, i s a �v/�-measurable function. We may presume that, under certain condi tions on the function, <)>,(X,) will be near-epoch dependent on { V, } if the elements of X, are. This setup subsumes the important case v = 1 , in which the question at issue is the effect of nonlinear transformations on the NED property. The depen dence of the functional form <)>r(.) on t is only occasionally needed, but is worth making explicit. The first cases we look at are the sums and products of pairs of sequences, for which specialized results exist. 17.8 Theorem Let X, and Y, be Lp-NED on { V,} of respective sizes -
II (X, + Y,) - E��(X, + Y,) l i P � II X, - E;��Xr l l p + II Y, - Er��Yt li P � ctiv� + drv� � dtYrn,

(17.27)

Theory of Stochastic Processes

268

y wh ere d = max { a.Xi, d 1 } and Vm = vXm + vmY = O(m-min { q>x,q> r) ) . • A variable that is Lq -NED is Lp-NED for 1 5 p 5 q, by the Liapunov inequality, so there is no loss of generality in equating the orders of norm in this result. The same consideration applies to the next theorem. 17.9 Theorem Let X1 and Y1 be Lz-NED on { V1} of respective sizes -
E I XtYr - E��:cxrYr) l

!

E (XrYr - XrEr�: Yr) + (XrEr�: Yr - (Er�:x,)(Er�: r,)) - E��: ((Xr - Er�:Xr)(Yr - E��: Yr)) l 5 II X, Ibll Y, - Er�:r,lb + IIEr�:r,lbii Xr - Er�:xr lb + II Xr - Er�:Xr lbll Yr - Er�: Yr lb 5 I I Xr ll zdrv;; + I! Yr ll z�! + dfv�div! =

(17.28) where dr

=

max{ II Xr ll zdr, II Yr ll zd?, d;d? } and Vm = v ;; + v! +V�v! = O(m-min {cpx,CJlrl ) .

•

In both of the last results we would like to be able to set Y, = Xr+J for some finite j. A slight modification of the argument is required here. 17.10 Theorem If X, is Lp-NED on { V, } , so is Xr+J for 0 j < Proof If X1 is Lp-NED, then

<

oo

.

ll Xr+j - E(Xr+j I ��J�:) liP S 2 l1 Xr+j - E(Xr+j I ��!)�:) li P x ( 17.29) -< d't Vm, using 10.28, where d;x 2J{+J· We can write ( 17.30) I I Xr+j - E(Xr+j l ��:) lip 5 d;�;,.., where v0, m 1 v;. = . Vm-) • m > j and v/n is of size

-

{

<"

_

Near Epoch Dependence size -min {
269

o

{=

By considering Z1 = Xr- [k/2] Yr+k- [kl2] , the L1 -NED numbers can

be

given here as

m � [k/2] + 1 Vo, Y m- [k/21 _ 1 , m > [k/2] + 1 :n where v = v� + v� + v� v�, and the constants are 4d}- [kl2] d;+k- [kl2] • assummg that d} and d; are not smaller than the corresponding L2 norms. v

m

All these results extend to the array case as before, by simply including the extra subscript throughout. Corollary 17.11 should be compared with 17.7, and care taken not to confuse the two. In the former we have k fixed and finite, whereas the latter result deals with the case as m ----7 oo The two theorems naturally complement each other in applying truncation arguments to infinite sums of prod ucts. Applications will arise in subsequent chapters. More general classes of function can be treated under an assumption of continu ity, but in this case we can deal only with cases where <\lr(X1) is L2-NED. Let <j>(x): lr 1--7 !R, lr s;; !Rv be a function of v real variables, and use the taxicab metric on !Rv, .

v I XI - X2 , "" p(x I , X 2) =Li i1

(17.3 1 )

i= l

to measure the distance between points x 1 and x2. We consider a set of results that impose restrictions of differing severity on the types of function allowed, but offer a trade-off with the severity of the moment restrictions. To begin with, impose the uniform Lipschitz condition,

(17.32) where B1 is a finite constant. 17.12 Theorem Let Xit be L2-NED of size -a on

{ V1} for i = 1 , . . . ,v, with constants dit . If (17.32) holds, { <j>1(X1)} is also L2-NED on { Vr } of size -a, with constants a finite multiple of maxd dit } . Proof Let �� denote a ?:1'��-measurable approximation to <j>r(X1). Then (17.33) li<J>r(Xr) - E��<J>tCXr) lb � II <J>rCXr) - �r lb by 10.12. Since <j>1(E��1) is an ?:1'��-measurable random variable, ( 17.33) holds for this choice of <j>1, and by Minkowski ' s inequality, A

v

� BrL II Xir - Er��Xir lb i= l v � BrL ditV im i=l

Theory of Stochastic Processes

270

( 1 7.34) where d, = vB,maxi {dit } and Vm = v-1Ii'= IVim• the latter sequence being of size -a by assumption. •

If we can assume only that the Xit are Lp-NED on V, for some p E [1 ,2), this argument fails. There is a way to get the result, however, if the functions <j>, are bounded almost surely. 17.13 Theorem Let Xit be Lp-NED of size -a on { V, } , for 1 s p s 2, with constants dir. i = 1 , ... ,v. Suppose that, for each t, I <j>,(X,) I s M < oo a.s., and also that I <J>,(X 1 ) - <j>,(X2) 1

s

min {B,p (X1 ,X2), 2M} a.s.,

(17.35)

where B, < oo . Then { <j>,(X,) } is L -NED on { V, } of size -ap/2, with constants a 2 finite multiple of max-i{ £iii? } . Proof For brevity, write <j>; ·

=

1 � - 7 1 s 2M min{Z, 1 } . Then E(<j>�

<j>1(X1), and let Z ·

=

-7? f { Zsl } (<j>} - <j>7)2dP + J {Z>l!(<j>} - <j>7)2dP S (2M)2 (f { Zs l } z 2dP + f iZ>II dP ) =

S (2M)2E(ZP)

=

where B lt

=

B,p(X I , X2)12M, so that

By,E( p(X 1 , X2f) ,

(17.36)

B�12(2M)1 -P12. Combining (17.33) with (17.36), we can write

l!(l>t(X,) - E��Z!<J>,(Xr) lb

s B �ti!P (X,,E��X,) II�12

(17.37) where d, = B ,vP12maxi{d�? } and V m = (v-1 I/= Vim)P12, which is of size -ap/2 by 1 1 assumption. •

An important example of this case (with v = 1) is the truncation of X,, although this must be defined as a continuous transformation. 17.14 Example For M

> 0 let

Near Epoch Dependence

�(x) or, equivalently, �(x) so set B

=

=

l

x, M, -M,

271

l xl $ M x>M X < -M

(17.38)

x1 { 1xlgf) +M(x! l x l ) l { lx i>MJ · In this case I �(XI ) - � (X2) I $ l XI - X2 1 ' 1 , and 17.13 can be used to show that { �(X1)} is L2-NED if {Xr } is =

{

Lp -NED. The more conventional truncation, Xr, I Xrl $ M Xr 1 { 1 x i�M) =

(17.39)

0, otherwise,

cannot be shown to be near-epoch dependent by this approach, because of the lack of continuity. o A further variation on 17.12 is to relax the Lipschitz condition ( 17.32) by letting the scale factor B be a possibly unbounded function of the random vari ables. Assume

(17.40) where, for each t,

(17.4 1 )

i s a non-negative, 152v/:B-measurable function. To deal with this case requires a lemma due to Gallant and White (1988). 17.15 Lemma Let B and p be non-negative r.v.s and assume < oo, and II B p ll r < oo, for q � 1, and r > 2. Then

l l p ll q < oo , IIB I I q!(q - 1 )

l II B p llz $ 2 ( ll p ii� -2 II B II �/Zq- 1 ) 1 1 Bp l l �) 1 12(r- ) .

( 1 7.42)

Proof Define

(17.43) and let B 1

=

l ( Bp:> C } B . Then by the Minkowski inequality, I I Bp l lz $ II B IP I I z + IICB - B I ) P IIz .

( 1 7.44)

) 1 12

The right-hand-side terms are bounded by the same quantity. First,

IIB I P I Iz =

J( Bp:> C(Bp)2dP

( 17.45)

Theory of Stochastic Processes

272

applying the Holder inequality. Second, IICB - B t )P II z =

(J

Bp > C

) 112

(Bp)2dP

(17.46) where the first inequality follows from r > 2 and Bp/C > 1 . Substituting for C in (17.45) and (17.46) and applying (17.44) yields the result. • The general result is then as follows. 17.16 Theorem Let { X1} be a v-dimensioned random sequence, of which each ele ment is Lz-NBD of size -a on { V1} , and suppose that cpr(X1) is Lz-bounded. Suppose further that for 1 :$; q :$; 2, ll p(Xr,E�::tr) ll q < oo, II Br(Xr,Er:::;:xt) ll q!(q - 1 ) < oo, and for r > 2, II B,(Xt ,Er::::xt) P (Xr, E�:;;:xt) II r < oo , Then { cj)r(X1) } is Lz-NBD on { V1} of size -a ( r - 2)/2( r - 1 ). Proof For ease of notation, write p for p(X1,E�::;:x1) and B for Br(X1,E�::t1). As in the previous two theorems, the basic inequality (17.33) is applied, but now we have llcp(Xr) - E�:;::cp(Xr) liz :$; ilcp(Xr) - <J>(E�::;:xr) liz :$; II Bp llz ( :$; 2 11 p ll � r-2)/2(r-l ) l i B II �/(� � wr- 1) II Bp 11 12 r- 1 ), where the last step is by 17.15. For q ll p ll q

:$;

IIP II 2

:$;

v

:$;

2,

L 1 1Xit - Er::::xitll 2 �1

(17.47)

:$;

v

_L ditv im �1

=

dtYm ,

(17.48)

where d1 = v maxd dtt } and Vm = v-1Lt=IVim • which is of size -a by assumption. Hence, under the stated assumptions, (17 .49) ll $tCXr) - Er:Z:cptCXr) llz :$; d�v�r- Z)IZ(r- 1 ) , where d; = II B II �/(�� f �Cr-l ) II Bp ll�'2<'- l )d1. • Observe the important role of 17.15 in tightening this result. Without it, the best we could do in ( 17.47) would be to apply Holder's inequality directly, to obtain

Near Epoch Dependence

273 ( 1 7.50)

The minimum requirement for this inequality to be useful is that B is bounded almost surely permitting the choice q = 1, which is merely the case covered by 1 2 17.12 with the constant scale factors set to ess sup Br(X ,X ). The following application of this theorem may be contrasted with 17.9. The moment conditions have to be strengthened by a factor of at least 2 to ensure that the product of L2-NED functions is also L2-NED, rather than just L 1 -NED. There is also a penalty in terms of the L2-NED size which does not occur in the other case. 17.17 Example Let X, = (Xr.Y1) and <)>(X1) = X1Y1• Assume that 1/ Xr /b < DO and II Yr /b < DO for r > 2, and that X1 and Y1 are L2-NED on { Vr} of size -a. Then

1 x } r} - x;r; I � I x} I I r} - r7 1 + 1 x} - x; I I r7 1

� ( I X} I + I Y7 1 )( I Y} - Y7 1 + I X} - X7 1 ) ( 1 7.5 1)

defining B and p. For any

q

i n the range [4/3,4], the assumptions imply

II B(X} ,x7 ) /l qt(q - l ) � II X} II qt(q- 1 ) + II Y7 11 qt(q- 1)

< DO ,

(1 7.52) (17.53)

and

I /B(X} ,X7)p(X } ,X7) // , � II X} Ib ii Y} Ib + II X} Ibii Y7 /b + II X} I/�, + II X} lb i i X7 1 b + II r7 1 12rll Y} lb + II r7 1 1�, + II Y7 1bi/ X} /b + II Y7 1 b i i X7/ b < DO ,

( 1 7.54)

Putting X } = X1 and x; = E/:.':;X1, the conditions of 17.16 are satisfied for the range [4/3,2] and X1Y1 is L2-NED of size -a(r - 2)/2(r- 1). o

q

in

1 7 .4 Approximability

In § 1 7.2 we showed that Lp-NED functions of mixing processes were mixingales, and most of the subsequent applications will exploit this fact. Another way to look at the Lp -NED property is in terms of the existence of a finite lag approx imation to the process. The conditional mean E(X1 I ��:;:) can be thought of as a function of the variables V1-m , ... ,Vt+m' and if { V1} is a mixing sequence so is { E(X,I ��:;:) } , by 14.1 . One approach to proving limit theorems is to team a limit theorem for mixing processes with a proof that the difference between the actual sequence and its approximating sequence can be neglected. This is an alternative way to overcome the problem that lag functions of mixing processes need not be m1xmg.

Theory of Stochastic Processes But once this idea occurs to us, it is clear that the conditional mean might not be the only function to possess the desired approximability property. More gener ally, we might introduce a definition of the following sort. Letting V1 be l x 1 , we shall think of h7 : !R1(2m+ l ) � lR as a ?F:�;::/1�-measurable function, where ?}�:�;;: =

a( Vr-m , ... , Vt+m).

17.18 Definition The sequence {X1 } will be called Lp-approximable (p > 0) on the sequence { V1} if for each m E !N there exists a sequence { h7} of ?}�:�;;:-measurable random variables, and

(17 .55) where {d,} is a non-negative constant sequence, and Vm -7 0 as m -7 oo. {Xr} will also be said to be approximable in probability (or Lo-approximable) on { V,} if there exist { h7 } , { d1} , and { vm } as above such that, for every o > 0, (17.56) The usual size terminology can be applied here. There is also the usual extension to arrays, by inclusion of the additional subscript wherever appropriate. If a sequence is Lp-approximable for p > 0, then by the Markov inequality

(17.57) P( ! Xr - h7 1 > dro) s; (dro) -p i! Xr - h7 11� s; v:n , where v/n = ()-Pvf:t; hence an Lp-approximable process is also Lo-approximable. An Lp-NED sequence is Lp-approximable, although only in the case p = 2 are we able to claim (from 10.12) that E(Xrl ?f ��;;:) is the best Lp-approximator in the sense that the p-norms in ( 17.55) are smaller than for any alternative choice of h7. 17.19 Example Consider the linear process of 17.3. The function

(17.58) j:::. -m is different from E(Xr I ?f��;;:) unless { Vr} is an independent process, but is also an Lp-approximator for X1 since (17.59) p j=m+ l where Vm = I,j:::.m+ l( l ejl + I e_j l ) and dr = SUPr ll Vr llp· D 17.20 Example In 17.4, the functions g7 are Lp-approximators for X1, of infinite size, whenever sups�r ii Fs Vs ll p < oo . D One reason why approximability might have advantages over the Lp-NED prop erty is the ease of handling transformations. As we found above, transferring the Lp-NED property to transformations of the original functions can present diffi culties, and impose undesirable moment restrictions. With approximability, these difficulties can be largely overcome. The first step is to show that, subject to Lr-boundedness, a sequence that is approximable in probability is also Lp-approx imable for p < r, and moreover, that the approximator functions can be bounded for

Near Epoch Dependence

275

each finite m. The following is adapted from Potscher and Prucha (1991a). 17.21 Theorem Suppose { X1} is L,-bounded, r> 1, and L.v-approximable by h 7. Then

for 0 < p < r it is Lp-approximable by h7 = h7 1 1 l h7l.,; d1Mml • where Mm < oo for each

mE

!N .

Proof Since h7 i s an L0-approximator of X1, we may choose a positive sequence

{ 8m } such that Dm --7 0 and yet P( l X1 - h7 1 > d1Dm) :::; Vm --7 0 as m --7 oo. Also choose a sequence of numbers { Mm } having the properties Mm --7 =, but Mmvm --7 0. For example, Mm = v::n 1 12 would serve. There is no loss of generality in assuming supmM::n 1 :::; 1 . By Minkowski ' s inequality we are able to write II Xr - 'fz 7 1 1p :::; A }m +A 7m + A ;m , (17.60) where

A}m = A 7m = A ;m =

II (Xr - h7)1 ( 1X1- h';'l > dr'im J IIP ' II (Xr - h ';') 1 { IX1- h';'l<;; dr)m, l hrl > drMm) l i P ' II (Xr - h7)1 ( IX1- h';'l<;; dr)m,lh71<;; drMmJ i lp ·

HOlder's inequality implies that

(17.61 ) II XYII P :::; II X II pq II Yllpq/(q- 1 ) · q > 1. Choose q = rip and apply (17.61) to A�1 for i = 1,2,3. Noting that II Xr - 'fz 7 ll r :::; II Xr ll , + drMm, again by Minkowski's inequality, and that 11 1 £ 1 1� = P(E), we obtain

the following inequalities. First,

A 1r m :::; II Xr - h� mt il ,P( l Xr - �hmt l > drDm) :::; dr( II Xr 1dr ii , M::n 1 + 1)Mmvm.

(17.62)

Second, observe that

{ 1 Xr - h7 1 :::; drDm , l h7 1 > dr Mm } C { I Xr l > dr(Mm - Dm) } and hence, when Mm > Dm , P( I Xr - h7 1 :::; drDm , l h7 1 > drMm) :::; P( I Xr l > dr(Mm - Dm)) :::; II Xr ll�ti; '(Mm - Dm) - r

(17.63)

by the Markov inequality, so that

A7m :::; 11 Xr - h7 11 , P( l Xr - h7 1 :::; drDm , l h7 1 > drMm) :::; ( II Xr ll r + drMm) II Xr ll�ti;'(Mm - Dmf' (17.64) :::; dr( II Xr ldr ii�+ I + II Xrldr ii�Mm(Mm - Dm) - r. (The final inequality is from replacing M::n 1 by 1 .) And lastly, A ;m :::; drDm (17.65) in view of the fact that 'h7 = h7 on the set { I 'fz7 1 :::; d1Mm } . We have therefore

� . ·�v,

5 u1

0wcnasnc Processes

established that

(17.66) where

v/n

=

MmVm + Mm(Mm - '6m) -' + Om

(17.67)

and v/n � 0 by assumption, since r > 1, and

d�

=

2d,max { II X/d, ll ,, II X/dr ll�+l , II Xrldr ll�, 1 } .

•

(17.68)

If Lo-approximability is satisfied with d1 « II X1 1! , then d; = 2d1• The value of this result is that we have only to show that the transformation of an Lo-approximable variable is also Lo-approximable, and establish the existence of the requisite moments, to preserve Lp-approximability under the transformation. Consider the Lipschitz condition specified in (17.40). The conditions that need to be imposed on B(.,.) are notably weaker than those in 17.16 for the Lp-NED case. ,

17.22 Theorem Let h"! = (h'J.1, ... ,h�1) be the Lo-approximator of X1 = (X1 1, ... ,Xv1)' of size -1: !Rv � IR satisfies (17 .40), and E(B,(X1,h"!)e) < oo for £ > 0,

then <MXr) is Lo-approximable of size

.

Proof Fix 8 > 0 and M > 0, and define d1 = Li=tdif. The Markov inequality gives

P( I <MXr) - <j>,(h"!) I > d1'6) :::;; P(B1(Xr.h"!)p(X1, h"!) > d1'6, Br(X1,h"!) > M) + P(Br(X1,h"!)p(X1,h"!) > d1'6, BtCXt,h"!)

:::;;

M)

(17.69) Since M is arbitrary the first term on the majorant side can be made as small as desired. The proof is completed by noting that

(� I Xit - h'i\1 > d,'61M) :::;; P (Q { ! Xit - hirl > dir'61M })

P(p(X1,h"!) > d1'61M) = P

v :::;; .L P( I xit - hit I > dit'6!M) i=l v :::;; L Vim � 0 as m � oo. • i= l

(17.70)

It might seem as if teaming 17.22 with 17.21 would allow us to show that, given an L,-bounded, L2 -NED sequence, r > 2 , which is accordingly L2-approximable and hence Lo-approximable, any transformation satisfying the conditions of 17.22 is L2 -approximable, and therefore also L2-NED, by 10.12. The catch with circum-

Near Epoch Dependence

277

venting the moment restrictions of 17. 16 in this manner is that it is not possible to specify the L2-NED size of the transformed sequence. In (17 .67), one cannot put a bound on the rate at which the sequence { 8m } may converge without specifying the distributions of the Xit in greater detail. However, if it is possible to do this in a given application, we have here an alternative route to dealing with trans formations. Potscher and Prucha (1991a), to whom the concepts of this section are due, define approximability in a slightly different way, in terms of the convergence of the Cesaro-sums of the p-norms or probabilities. These authors say that Xt is Lp-approximable (p > 0) if limsup n�oo

1

n

n

L II Xt - h"/ llp --7 0 as m --7

(17.71)

00 ,

t=I

and is Lo-approximable if, for every 8 > 0, n 1 limsup - _L P( I Xt - h"/ 1 > 8) n t=I n�oo

--7

0 as m --7 =

.

(17.72)

It is clear that we might choose to define near-epoch dependence and the mixingale property in an analogous manner, leading to a whole class of alternative converg ence results. Comparing these alternatives, it turns out that neither definition dominates, each permitting a form of behaviour by the sequences which is ruled out by the other. If (17.55) holds, we may write

(17.73) so long as the limsup on the majorant side is bounded. On the other hand, if (17.71) holds we may define Vm

=

limsup ! n�oo

i

n t= I II Xt - h"/ I J p ,

(17.74)

and then dt = supm{ I IXt - h": J lp lv m } will satisfy 17.18 so long as it is finite for finite t. Evidently, (17.71) permits the existence of a set of sequence coordin ates for which the p-norms fail to converge to 0 with m, so long as these are ultimately negligible, accumulating at a rate strictly less than n as n increases. On the other hand, (17.55) permits trending moments, with for example dt = O(tJ...), A > 0, which would contradict (17.71). Similarly, for 8m > 0, and Vm > 0, define dtm by the relation

(17.75) and then, allowing Vm --7 0 and 8m --7 0, define dt = supmdtm · (17.56) is satisfied if dt < oo for each finite t; this latter condition need not hold under (17. 72). On the other hand, (17. 72) could fail in cases where, for fixed 8 and every m, P( I Xt - h"/ 1 > 8) is tending to unity as t --7 = .

IV THE LAW OF LARGE NUMBERS

18 Stochastic Convergence

1 8 . 1 Almost Sure Convergence

Almost sure convergence was defined formally in § 12.2. Sometimes the condition is stated in the form

)

(

P limsup i Xn - XI > E = 0, for all E > 0. (18.1) n-';oo Yet another way to express the same idea i s to say that P( C) = 1 where, for each ro E C and any E > 0, I Xn(W) - X(ro) I > E at most a finite number of times as we pass down the sequence. This is also written as P( I Xn - XI > E, i.o.) = 0, all E > 0,

(18.2)

where i.o. stands for 'infinitely often ' . Note that the probability in (1 8.2) is assigned to an attribute of the whole sequence, not to a particular n. One way to grasp the 'infinitely often ' idea is to consider the event U';;'=m { I Xn - X I > E } ; in words, 'the event that has occurred whenever { I Xn - XI > E} occurs for at least one n beyond a given point m in the sequence ' . If this event occurs for every m, no matter how large, { I Xn - X I > E} occurs infinitely often. In other words, { I Xn - XI > £, i.o. } =

00

n u { I Xn - XI

m= l n=m

> E}

= limsup { I Xn - XI > E } .

(1 8.3)

Useful facts about this set and its complement are contained in the following lemma. 18.1 Lemma Let {En E � } i be an arbitrary sequence. Then

{ ) ( lJ ) (ii) P (liminf En) = lim P ( fl Em) . n....:;oo m=n n-';oo

(i) P limsup En = lim P Em . n....:;oo m=n n....:;oo

{U';;'=mEn J:= l is decreasing monotonically to limsupn En . Part (i) therefore follows by 3.4. Part (ii) follows in exactly the same way, since the

Proof The sequence

The Law of Large Numbers

282

sequence {n';;'=mEn };= I increases monotonically to liminf En. • A fundamental tool in proofs of a.s. convergence is the Borel-Cantelli lemma. This has two parts, the 'convergence' part and the 'divergence' part. The former is the most useful, since it yields a very general sufficient condition for convergence, whereas the second part, which generates a necessary condition for convergence, requires independence of the sequence. 18.2 Borel-Cantelli lemma

(i) For an arbitrary sequence of events {En

E

?f}'!,

'L_ P(En) < oo � P(Em i.o.) 0. 00

n= l (ii) For a sequence {En

=

� } i of independent events,

E

'L_ P(En) = oo � P(En i.o.) = 1. 00

n= l

Proof

(18.4)

(18.5)

By countable subadditivity,

(18.6) The premise in (18.4) is that the majorant side of (18.6) is finite for m = 1 . This implies L,';;'=mP(En) � 0 as m � oo (by 2.25), which further implies

( )

lim P LJ En = 0 . (18.7) m---too n=m Part (i) now follows by part (i) of 18.1. To prove (ii), note by 7.5 that the collection { g,; E ?f } i is independent; hence for any m > 0, and m' > m,

(18.8) by hypothesis, since e -x ;:::: 1 - x. (18.8) holds for all m, so P(liminf g,;)

=

( g,;)

lim P n m-+oo n=m

=

0,

(18.9)

by 18.1(ii). Hence, (18. 10) P(En i.o.) = P(limsup En) = 1 P(liminf E,;) = 1. • To appreciate the role of this result (the convergence part) in showing a.s. convergence, consider the particular case

-

Stochastic Convergence

283

If 2.";;= 1 P(En) < oo, the condition P(En) > 0 can hold for at most a finite number of n. The lemma shows that P(En i.o.) has to be zero to avoid a contradiction. Yet another way to characterize a.s. convergence is suggested by the following theorem. 18.3 Theorem {Xn } converges a.s. to X if and only if for all £ > 0

(

lim P sup / Xn - X / m�oo n ?: m

:s;

)

£

=

1.

(18. 1 1)

Proof Let

(18. 12) n=m and then (18. 1 1) can be written in the form limm�ooPCA m(E)) = 1 . The sequence {Am(E) } i is non-decreasing, so Am(£) = Ui= IAj(£); letting A(£) = u;;;= !Am(E), (18. 1 1) can be stated as P(A(£)) = 1 . Define the set C by the property that, for each ffi E C, { Xn( ffi)} i converges. That is, for ffi E C, 3 m(ffi) such that sup / Xn(ffi) - X(ffi) / :s; £, for all £ > 0.

(18. 13) n ?: m(w) Evidently, ffi E C=::} co E Am(E) for some m, so that C �A(E). Hence P(C) = 1 implies P(A(E)) = 1 , proving 'only if . To show 'if , assume P(A(£)) = 1 for all E > 0. Set E = llk for positive integer k, and define c (18. 14) A* = n A(llk) = ( LJ A(llkt . k=I

k=I

)

The second equality here is 1.1(iv). By 3.6(ii), P(A*) = 1 - P(Uk'=1A(l/k)c) = 1 . But every element of A * is a convergent outcome in the sense of (18. 13), hence A * c C, and the conclusion follows. • The last theorem characterizes a.s. convergence in terms of the uniform prox imity of the tail sequences { / Xn(ffi) - X(ffi) I }"::=m to zero, on a set Am whose measure approaches 1 as m ----7 = A related, but distinct, result establishes a direct link between a.s. convergence and uniform convergence on subsets of .Q. 18.4 Egoroff's Theorem If and only if Xn ...!!:4 X, there exists for every 8 > 0 a set C(8) with P(C(8)) � 1 - <5, such that Xn(ffi) ----7 X(ffi) uniformly on C(8). Proof To show 'only if', suppose Xn(ffi) converges uniformly on sets C(1/k), k = 1 ,2,3, ... The sequence { C(1/k) , k E rN } can be chosen as non-decreasing by mono tonicity of P, and P(Uk'= I C( 1 /k) ) = 1 by continuity of P. To show 'if, let .

The Law of Large Numbers

284 A m(O)

=

00

n { m: I Xn(ffi) - X(m) l < lim},

n=k(m)

(18.15)

k(m) being chosen to satisfy the condition P(A m(O)) � 1 - Tmo. In view of a.s. convergence and 18.3, the existence of finite k(m) is assured for each m. Then if C(o)

=

00

n A m(O),

m= I

(18. 16)

convergence is uniform on C(o) by construction; that is, for every ffi e C(O), I Xn(ffi) - X(m) I < l im for n � k(m), for each m > 0. Applying 1.1(iii) and subaddi tivity, we find, as required,

00

� 1 - L ( 1 - P(A m(O))) m= l � 1 - 8. II

(18. 17)

1 8 . 2 Convergence in Probability

In spite of its conceptual simplicity, the theory of almost sure convergence cannot easily be appreciated without a grasp of probability fundamentals, and traditionally, an alternative convergence concept has been preferred in econo metric theory. If, for any £ > 0, lim P( ! Xn - X I > £) = 0,

(18. 18)

Xn is said to converge in probability (in pr.) to X. Here the convergent sequences are specified to be, not random elements { Xn(ffi) } i, but the nonstochastic sequences { P( I Xn - X I > £) }}. The probability of the convergent subset of Q is left unspec ified. However, the following relation is immediate from 18.3, since (18. 1 1) implies (18. 18). 18.5 Theorem If Xn � X then Xn � X. o The converse does not hold. Convergence in probability imposes a limiting condi tion on the marginal distribution of the nth member of the sequence as n � = The probability that the deviation of Xn from X is negligible approaches 1 as we move down the sequence. Almost sure convergence, on the other hand, requires that beyond a certain point in the sequence the probability that deviations are negli gible from there on approaches 1 . While it may not be intuitively obvious that a sequence can converge in pr. but not a.s., in 18. 16 below we show that convergence in pr. is compatible with a.s. nonconvergence. However, convergence in probability is equivalent to a.s. convergence on a .

Stochastic Convergence

285

subsequence; given a sequence that converges in pr., it is always possible, by throwing away some of the members of the sequence, to be left with an a.s. conver gent sequence.

18.6 Theorem Xn � X if and only if every subsequence { Xnk ' k E IN } contains a further subsequence {Xnk(J) ' j E IN } which converges a.s. to X. Proof To prove 'only if : suppose P( I Xn - XI > E) � 0 for any E > 0. This means

that, for any sequence of integers { nb k E IN } , P( I Xnk - X I > E) � 0. Hence for each j E IN there exists an integer k(j) such that

(18. 19) Since this sequence of probabilities is summable over j, we conclude from the first Borel-Cantelli lemma that

(18.20) It follows, by consideration of the infinite subsequences U � J} for 1 > liE, that P( I Xnk(J) - XI > E i.o.) = 0 for every E > 0, and hence the subsequence {XnkVl } converges a.s. as required. To prove 'if : if {Xn } does not convergence in probability, there must exist a subsequence {nk } such that infk { P( I Xnk - X I > E) } � E, for some E > 0. This rules out convergence in pr. on any subsequence of {nk } , which rules out convergence a.s. on the same subsequence, by 18.5. • 1 8 .3 Transformations and Convergence

The following set of results on convergence, a.s. and in pr., are fundamental tools of asymptotic theory. For completeness they are given for the vector case, even though most of our own applications are to scalar sequences. A random k-vector Xn is said to converge a.s. (in pr.) to a vector X if each element of Xn converges a.s. (in pr.) to the corresponding element of X. 18.7 Lemma Xn � X a.s. (in pr.) if and only if IIXn - X I I � 0 a.s. (in pr.). 1 9

Proof Take first the case of a.s. convergence. The relation IIXn - X I I � 0 may be

expressed as

P

(urn ± (Xni - Xi)2 < E2) n--7oo r=l

for any E > 0. But ( 1 8.21 ) implies that

(

P lim I Xni - Xi ! n--7oo

< E, i =

=

)

1, . ,k . .

(18.21)

1

=

1,

(18.22)

proving 'if . To prove 'only if , observe that if (18.22) holds, P(Iimn--7oo i iXn - XII < k 1 12E) = 1, for any E > 0. To get the proof for convergence in pr., replace P(limn--7oo · .. ) everywhere by limn--7ooP( ... ), and the arguments are identical. •

The Law of Large Numbers

286

There are three different approaches, established in the following theorems, to the problem of preserving convergence (a.s. or in pr.) under transformations.

IR k ----7 IR be a Borel function, let C8 c IR k be the set of conti nuity points of g, and assume P(X e C8) = 1 . (i) If Xn � X then g(Xn) .-!!:4 g(X). 18.8 Theorem Let g:

(ii) If Xn � X then g(Xn) � g(X).

Proof For case (i), there is by hypothesis a set D e 'J', with P(D) = 1 , such that Xn(ro) ----7 X(ro), each ro e D. Continuity and 18.7 together imply that g(Xn(ro)) ----?

g(X(w)) for each (!) E x-1(Cg) n D. This set has probability 1 by 3.6(iii). Toprove (ii), analogous reasoning shows that, for each £ > 0, 3 8 > 0 such that { w: ll� (ro) - X(ro) ll < 8 } n K 1(C8) � { ro: l g(� (ro)) - g(X(ro)) l < e } . (18.23)

Note that if P(B) = 1 then for any A E

'fF,

( 1 8.24) by de Morgan ' s law and subadditivity of P. In particular, when P(X e C8) = 1 , ( 1 8.23) and monotonicity imply P( !l Xn - X II < 8) :::; P( j g(Xn) - g(X) j < e) . (18.25) Taking the limit of each side of the inequality, the rninorant side tends to 1 by

hypothesis.

•

We may also have cases where only the difference of two sequences is convergent. 18.9 Theorem Let { Xn } and { Zn } be sequences of randomk-vectors (not necessarily converging) and g the function defined in 18.8, and let P(Xn E C8) = P(Zn E C8) =

1 for every

n.

(i) If IIXn - Zn II � 0 then I g(Xn) - g(Zn) I � 0. (ii) If II Xn - Zn ll � 0 then l g(Xn) - g(Zn) l � 0.

Proof Put Efz = X�1(C8), E� = Z�\C8), Ex = n�=tE!z, and e2 = n�=l�· P(Ex) =

P(£2) = 1 , by assumption and 3.6(iii). Also let D be the set on which I I Xn - Zn I I converges. The proof is now a straightforward variant of the preceding one, with the set Ex n £2 playing the role of .A1(C8). • The third result specifies convergence to a constant limit, but relaxes the conti nuity requirements.

IR k ----? IR be a Borel function, continuous at a. (i) If Xn � a then g(Xn) � g(a).

18.10 Theorem Let g:

(ii)

If

Xn �

a

then g(Xn) � g(a).

with P(D) = 1 , such that Xn(ro) ----? a, each ro E D. Continuity implies g(Xn(ro)) ----7 g(a) for ro E D, proving (i). Likewise,

Proof By hypothesis there is a set D E

'J',

{ ro: II Xn(w) - a ll < 8 } � { ro : l g(Xn(ro)) - g(a) l < £ } ,

and (ii) follows much as in the preceding theorems.

•

( 18.26)

Stochastic Convergence

287

Theorem 18.10(ii) is commonly known as Slutsky ' s theorem (Slutsky 1925). These results have a vast range of applications, and represent one of the chief reas ons why limit theory is useful. Having established the convergence of one set of statistics, such as the first few empirical moments of a distribution, one can then deduce the convergence of any continuous function of these. Many commonly used estimators fall into this category. 18.11 Example Let An be a random matrix whose elements converge a.s. (in pr.) to a limit A . Since the matrix inversion mapping is continuous everywhere, the results a.s.lim A � 1 = A - l (plimA� 1 = A - l ) follow on applying 18.8 element by element. o The following is a useful supplementary result, for a case not covered by the Slutsky theorem because Yn is not required to converge in any sense. 18.12 Theorem Let a sequence { Yn}i be bounded in probability (i.e., Op(1) as n � oo); if Xn � 0, then XnYn � 0. Proof For a constant B > 0, define

Y� = Yn1 { 1 Yn l :o;B} · The event { I XnYn l � £} for

£ > 0 is expressible as a disjoint union:

(18.27) { I XnYn l � £} = { IXn i i Y� I � £} u { I Xn i i Yn - Y� I � £ } . For any £ > 0, { I Xn l l Y� l � £} c { I Xn l � £1B } , and (18.28) P( I Xn l l Y� l � E) � P( I Xn l � £/B) � 0. By the Op (l) assumption there exists, for each 8 > 0, B0 < oo such that P( l Yn - y�& I > 0) < 8 for n E lN . Since { I Xn l l Yn - Y� l � £ } c { I Yn - Y� l > 0 } , (1 8.27) and additivity imply, putting B = B0 i n (8.28), that (18.29) The theorem follows since both £ and 8 are arbitrary.

•

1 8 .4 Convergence in LP Norm

Recall that when E( I Xn I P) < oo, we have said that Xn is Lp-bounded. Consider, for p > 0, the sequence { II Xn - X l l p } i . If E( II Xn llp) < oo, all n, and limn�oo ii Xn - X l l p = 0, Xn is said to converge in 4 norm to X (write Xn � X). When p = 2 we speak of convergence in mean square (m.s.). Convergence in probability is sometimes called LQ-convergence, terminology which can be explained by the fact that Lp-convergence implies Lq-convergence for 0 < q 0, then Xn � X.

o

The converse does not follow in general, but see the following theorem. 18.14 Theorem If Xn � X, and { I Xn l P }i is uniformly integrable, then Xn � X.

The Law of Large Numbers

288 Proof

For E > 0,

E I Xn - XI P = E( l { IXn -X IP>EJ I Xn - X I P) + E( l { IXn -X (�;eJ I Xn - XI P)

$ E( l i i Xn -X I P>eJ I Xn - XI P) + E.

(18.30)

Convergence in pr. means that P( I Xn - X I > E) � 0 as n � co. Uniform integrab ility therefore implies, by 12.9, that the expectation on the majorant side of (18.30) converges to zero. The theorem follows since E is arbitrary. • We proved the a.s. counterpart of this result, in effect, as 12.8, whose conclus ion can be written as: I Xn - XI � O implies E I Xn - XI � 0. The extension from the L 1 case to the Lp case is easily obtained by applying 18.8(i) to the case g(.) =

I . I P.

One of the useful features of Lp convergence is that the Lp norms of Xn - X define a sequence of constants whose order of magnitude in n may be determined, providing a measure of the rate of approach to the limit. We will say for example that Xn converges to X in mean square at the rate nk if IIX1 - X l i z = O(n -k), but not o(n -k) . This is useful in that the scaled random variable nk(X1 - X) may be non-degenerate in the limit, in the sense of having positive but finite limiting variance. Determining this rate of convergence is often the first step in the analysis of limiting distributions, as discussed in Part V below. 1 8 . 5 Examples

Convergence in pr. is a weak mode of convergence in that without side conditions it does not imply, yet is implied by, a.s. convergence and Lp convergence. How ever, there is no implication from a.s. convergence to Lp convergence, or vice versa. A good way to appreciate the distinctions is to consider 'pathological' cases where one or other mode of convergence fails to hold. 18.15 Example Look again at 12.7, in whichXn = 0 with probability 1 - 1/n, and Xn = n with probability l in, for n = 1,2,3, ... . A convenient model for this sequence is to let ro be a drawing from the space ([0,1] , 13 [o,I],m) where m is Lebesgue measure, and define the random variable ro E [0, l in), Xn(ro) = ( 1 8.3 1 ) 0, otherwise. The set { ro: limnXn(OO) :f:. 0 } consists of the point { 0 } , and has p.m. zero, so that p Xn � 0 according to (18. 1). But E I Xn i P = 0.(1 - l in) + nP!n = n - l . It will be recalled that this sequence is not uniformly integrable. It fails to converge in Lp for any p > 1, but for the case p = 1 we obtain E(Xn) = 1 for every n. The limiting expectation of Xn is therefore different from its almost sure limit. o The same device can be used to define a.s. convergent sequences which do not converge in Lp for any p > 0. It is left to the reader to construct examples.

{n,

Stochastic Convergence

289

18.16 Example Let a sequence be generated as follows: X1 = 1 with probability 1 ; (X2 ,X3) are either (0, 1 ) or ( 1 ,0) with equal probability; (X4,X5 ,X6) are chosen from (1 ,0,0), (0, 1 ,0), (0,0, 1) with equal probability; and so forth. For k = 1 ,2,3, ... the next k members of the sequence are randomly selected such that one of them is unity, the others zero. Hence, for n in the range [�k(k - 1) + 1, !k(k + 1 )] , P(Xn = 1) = Ilk, as well as E I Xn i P = llk for p > 0. Since k � as n � =, it is clear that Xn converges to zero both in pr. and in Lp norm. But since, for any n, Xn+J = 1 a.s. for infinitely many j, oo

P( I Xn l

< £, i.o.)

=

0

(18.32)

for 0 :::; £ :::; 1 . The sequence not only fails to converge a.s., but actually converges with probability 0. Consider also the sequence { k 1 1rXn} , whose members are either 0 or k1 1r in the range [1k(k - 1) + 1 , 1k(k + 1 )]. Note that E( l k11rXn i P) = Jtlr- t , and by suitable choice of r we can produce a sequence that does not converge in Lp for p > r. With r = 1 we have E(kXn) = 1 for all n, but as in 18.15, the sequence is not uniformly integrable. The limiting expectation of the sequence exists, but is different from the probability limit. o In these non-uniformly integrable cases in which the sequence converges in L 1 but not in L 1 +9 for any e > 0, one can see the expectation remaining formally well-defined in the limit, but breaking down in the sense of losing its intuitive interpretation as the limit of a sample average. Example 18.15 is a version of the well-known St Petersburg Paradox. Consider a game of chance in which the player announces a number n E IN , and bets that a succession of coin tosses will produce n heads before tails comes up, the pay-off for a correct prediction being £2 n+ l . The n probability of winning is T - I, so the expected winnings are £ 1 ; that is to say, it is a 'fair game ' if the stake is fixed at £ 1 . The sequence of random winnings Xn generated by choosing n = 1,2,3, ... is exactly the process specified in 18.15 ?0 If n is chosen to be a very large number, a moment's reflection shows that the probability limit is a much better guide to one's prospective winnings in a finite number of plays than the expectation. The paradox that with large n no one would be willing to bet on this apparently fair game has been explained by appeal to psychological notions such as risk aversion, but it would appear to be an adequate explanation that, for large enough n, the expectation is simply not a practical predictor of the outcome. 1 8 . 6 Laws of Large Numbers

Let { Xr}I be a stochastic sequence and define Xn = n - l L7: I Xr. Suppose that E(Xr) = 11r and n - l I7=tl1t ---7 11 with 1 11 1 < oo; this is trivial in the mean-stationary case in which l1t = 11 for all t. In this simple setting, the sequence is said to obey the weak law of large numbers (WLLN) when Xn � 11, and the strong law of large numbers (SLLN) when Xn � 11· These statements of the LLNs are standard and familiar, but as characterizations

290

The Law of Large Numbers

of a class of convergence results they are rather restrictive. We can set 1-1 = 0 with no loss of generality, by simply considering the centred sequence { X1 - IJ.r } i; centring is generally a good idea, because then it is no longer necessary for the time average of the means to converge in the manner specified. We can quite easily have n - 1 :L7=I I-lr -j oo at the same time that n - 1 :L7=t(X1(ro) - !-11) -j 0. In such cases the law of large numbers requires a modified interpretation, since it ceases to make sense to speak of convergence of the sequence of sample means. More general modes of convergence also exist. It is possible that Xn does not converge in the manner specified, even after centring, but that there exists a sequence of positive constants {an } i such that an t oo and a� 1 :L7= 1 X1 -j 0. Results below will subsume these possibilities, and others too, in a fully general array formulation of the problem. If { {Xnr}�� I }-;:'= 1 is a triangular stochastic array with { kn} -;:'= 1 an increasing integer sequence, we will discuss conditions for kn

Sn = L Xnr � 0. t=l

(1 8.33)

A result in this form can be specialized to the familiar case with Xnr = a� 1 (X1 - !-11) and an = kn = n, but there are important applications where the greater generality is essential. We have already encountered two cases where the strong law of large numbers applies. According to 13.12, Xn � 1-1 = E(X1 ) when {Xr} is a stationary ergodic sequence and E I X1 1 < oo . We can illustrate the application of this type of result by an example in which the sequence is independent, which is sufficient for ergodicity. 18.17 Example Consider a sequence of independent Bernoulli variables X1 with P(X1 = 1) = P(X1 = 0) = !; that is, of coin tosses expressed in binary form (see 12.1). The conditions of the ergodic theorem are clearly satisfied, and we can conclude that n - 1 :L7= 1 X1 � E(X1) = !· This is called Borel's normal number theorem, a normal number being defined as one in which Os and 1 s occur in its binary expansion with equal frequency, in the limit. The normal number theorem therefore states that almost every point of the unit interval is a normal number; that is, the set of normal numbers has Lebesgue measure 1 . Any number with a terminating expansion is clearly non-normal and we know that all such numbers are rationals; however, rationals can be normal, as for example 1, which has the binary expansion 0.01010101010101... This is a different result from the well-known zero measure of the rationals, and is much stronger, because the non-normal numbers include irrationals, and form an uncountable set. For example, anynumberwithabinary expansionoftheform0.1 1bt 1 1b 2 1 1b 3 1 1 ... where the bi are arbitrary digits is non-normal; yet this set can be put into 1-1 cor respondence with the expansions O.b 1 b2b3 , ... , in other words, with the points of the whole interval. The set of non-normal numbers is equipotent with the reals, but it none the less has Lebesgue measure 0. o A useful fact to remember is that the stationary ergodic propery is preserved under measurable transformations; that is, if {X1} is stationary and ergodic, so

Stochastic Convergence

291

is the sequence {g(X1) } whenever g: !R f---7 !R is a measurable function. For example, we only need to know that E(XI) < to be able to assert that n - 1 I7= 1x7 ...!!:4 E(Xi). The ergodic theorem serves to establish the strong law for most stationary sequences we are likely to encounter; recall from § 13.5 that ergodicity is a weaker property than regularity or mixing. The interesting problems in stochastic convergence arise when the distributions of sequence coordinates are hetero geneous, so that it is not trivial to assume that averaging of coordinates is a stable procedure in the limit. Another result we know of which yields a strong law is the martingale conver gence theorem (15.7), which has the interpretation that a � 1 I7= 1 Xr ...!!:4 0 whenever {L7= 1 Xr} is a submartingale with El I7= 1 Xr l < oo uniformly in n, and an -7 This particular strong law needs to be combined with additional results to give it a broad application, but this is readily done, as we shall show in §20.3. But, lest the law of large numbers appear an altogether trivial problem, it might also be a good idea to exhibit some cases where convergence fails to occur.

oo

co

.

-

18.18 Example Let {X1} denote a sequence of independent Cauchy random vari ables with characteristic function G>xP,.) = e I A. I for each t (11.9). It is easy to

n verify using formulae (1 1 .30) and (1 1 .33) that xn (A) = e - ! A. IIn = e- I A. I _ Accord ing to the inversion theorem, the average of n independent Cauchy variables is also a Cauchy variable. This result holds for any n, contradicting the possibility that Xn could converge to a constant. o 18.19 Example Consider a process

-

t

(18.34) Xr = L 'l'sZs = Xr 1 + 'JfrZr, t = 1,2,3, ... s= 1 with X0 = 0, where { Z1 } j is an independent stationary sequence with mean 0 and variance �, and { 'Vs } j is a sequence of constant coefficients. Notice, these are indexed with the absolute date rather than the lag relative to time t, as in the linear processes considered in § 14.3. For m > 0, t

Cov(X1, Xt+m) = Var(X1) = cr2 L Vs · s= 1

(18.35)

oo

For {X1} } to be uniformly L2-bounded requires I7= 1 'l'; < ; in this case the effect of the innovations declines to zero with t and Xr approaches a limiting random variable X, say. Without the square-summability assumption, Var(X1) -7 An example of the latter case is the random walk process, in which 'Vs = 1, all s. Since Cov(X1 ,X1) = 'Jficr2 for every t, these processes are not mixing. Xn has zero mean, but

)

oo

.

1 (18.36) f var(Xj) . ± v ar(X1) + 2i n2 t=1 t=2 ]= 1 If 'L7:::: 1 'VJ < co, then limn�ooVar(Xn) = �L7= 1 'JIJ; otherwise Var(Xn) -7 In either case the sequence { Xn } fails to converge to a fixed limit, being either stochastic Var(Xn) =

(

oo

.

292

The Law of Large Numbers

asymptotically, or divergent. o These counter-examples illustrate the fact that, to obey the law of large numbers, a sequence must satisfy regularity conditions relating to two distinct factors: the probability of outliers (limited by bounding absolute moments) and the degree of dependence between coordinates. In 18.18 we have a case where the mean fails to exist, and in 18.19 an example of long-range dependence. In neither case can Xn be thought of as a sample statistic which is estimating a parameter of the under lying distribution in any meaningful fashion. In Chapters 19 and 20 we devise sets of regularity conditions sufficient for weak and strong laws to operate, constraining both characteristics in different configurations. The necessity of a set of regularity conditions is usually hard to prove (the exception is when the sequences are independent), but various configurations of mixing and Lp-boundedness conditions can be shown to be sufficient. These results usually exhibit a trade-off between the two dimensions of regularity; the stronger the moment restrictions are, the weaker dependence restrictions can be, and vice versa. One word of caution before we proceed to the theorems. In §9 . 1 we sought to motivate the idea of an expectation by viewing it as the limit of the empirical average. There is a temptation to attempt to define an expectation as such a limit; but to do so would inevitably involve us in circular reasoning, since the arguments establishing convergence are couched in the language of probability. The aim of the theory is to establish convergence in particular sampling schemes. It cannot, for example, be used to validate the frequentist interpretation of probability. However, it does show that axiomatic probability yields predictions that accord with the frequentist model, and in this sense the laws of large numbers are among the most fundamental results in probability theory.

19

Convergence in Lp Norm

1 9. 1 Weak Laws by Mean-Square Convergence

This chapter surveys a range of techniques for proving (mainly) weak laws oflarge numbers, ranging from classical results to recent additions to the literature. The common theme in these results is that they depend on showing convergence in Lp-norrn, where in general p lies in the interval [ 1 ,2]. Initially we consider the case p = 2. The regularity conditions for these results relate directly to the variances and covariances of the process. While for subsequent results these moments will not need to exist, the � case is of interest both because the conditions are familiar and intuitive, and because in certain respects the results available are more powerful. Consider a stochastic sequence {X1}}, with sequence of means {J.l1 }}, and vari ances {cr� }i. There is no loss of generality in setting 1-lr = 0 by simply consider ing the case of { X1 - j..Lr } i, but to focus the discussion on a familiar case, let us initially assume iln n - 1 I�= II-lr ----7 !-l (finite), and so consider the question, what are sufficient conditions for E(Xn - j..L)2 ----7 0? An elementary relation is (19.1) where the second term on the right-hand side converges to zero by definition of j..l . Thus the question becomes: when does Var(Xn) ----7 0? We have =

(19.2) where � = Var(X1) and <J15 = Cov(X1,X5). Suppose, to make life simple, we assume that the sequence is uncorrelated, with <J15 = 0 for t-::f. s in (19.2). Then we have the following well-known result. 19.1 Theorem If {Xr }i is uncorrelated sequence and

(19.3) then Xn � j..l . Proof This is an application of Kronecker's lemma (2.35), by which ( 19.3) implies Var(Xn) = n - 2L1� ----7 0. • This result yields a weak law of large numbers by application of 18.13, known as Chebyshev' s theorem. An (amply) sufficient condition for (19.3) is that the

The Law of Large Numbers variances are uniformly bounded with, say, sup1o7 ::;; B < oo . Wide-sense stationary sequences fall into this class. In such cases we have Var(Xn) O(n - 1 ). But since all we need is Var(Xn) o(l ), o7 � oo is evidently permissable. If o7 - tHi for > 0, I,� 1 t- 2o7 has terms of O(t- 1 -8), and therefore converges by 2.27. Looking at (19. 2) again, it is also clear that uncorrelatedness is an unnec 294

=

=

=

o

essarily tough condition. It will suffice if the magnitude of the covariances can be suitably controlled. Imposing uniform L2-boundedness to allow the maximum relaxation of constraints on dependence, the Cauchy-Schwartz inequality tells us that I a15 I ::;; for all and Rearranging the formula in

B

t s.

(19. 2),

n- 1 n n 1 2 o7 + 2 L L l at,t-m I ::;; n2 L: n m=l t=m+ l t= l

(19.4) Bm ::;; B, m

Bm

where = sup1/ a1,1 ml , and all � 1 . This suggests the following variant on 19.1. oo 19.2 Theorem If {X,} i is a uniformly Lrbounded sequence, and I.;= L where = suprl ar,r- m l , then Xn ---=4 11· Proof Since it is sufficient by to show the convergence of to zero. This follows immediately from the stated condition and Kronecker's lemma. • o > 0; a very A sufficient condition, in view of 2.30, is = 0 ( (log mild restriction on the autocovariances. There are two observations that we might make about these results. The first is to point to the trade-off between the dimensions of dependence and the growth of the variances. Theorems 19.1 and 19.2 are easily combined, and it is found that by tightening the rate at which the covariances diminish the variances can grow faster, and vice versa. The reader can explore these possibilities using the rather simple techniques of the above proofs, although remember that the I a ,t I will need to be treated as growing with as well as diminishing with Analogous trade-offs are derived in a different context below. The order of magnitude in of Var(Xn), which depends on these factors, can be thought of as a measure of the of convergence. With no correlation and bounded variances, convergence is at the rate - 1 12 in the sense that Var(Xn) = - 1 ); but from = If convergence implies that Var(Xn) = rates are thought of as indicating the number of sample observations required to get Xn close to j..L with high confidence, the weakest sufficient conditions evidently yield convergence only in a notional sense. It is less easy in some of thl" mnrl". oP:nP.r�l rP�ml ts helow to l i nk exnlicitlv the rate of conver_gence with the

I m- I Bm<

Bm (n-m)/n < 1, (2/n)I,�:�Bm

(19.4)

mr 1-8),

Bm

t

m.

n

O(n

rate (19.4), Bm O(m-8)

n

O(n-8).

1

-

m

Convergence in Lp Norm

295

degree of dependence and/or nonstationarity; this is always an issue to keep in mind. Mixing sequences have the property that the covariances tend to zero, and the mixing inequalities of § 14.2 gives the following corollary to 19.2. 19.3 Corollary If {Xr}] is either (i) uniformly L2-bounded and uniform mixing with

(19.5) Lm= m- 1<1>�12 < 1 or (ii) uniformly L2+a-bounded for 8 > 0, and strong mixing with ( 19 . 6) Lm=l m - 1a�/(2+8) < j.l. then Xn Proof For part (i), 14.5 for the case r 2 yields the inequality Bm ::::;; 2B<j>�12• For r 2 + 8 yields Bm ::::;; 6 IIXrll�+8a�1(2+8). Noting part (ii), 14.3 for the case that B II Xrll�+li• the conditions of 19.2 are satisfied in either case. A sufficient condition for 19.3(i) is m O((log mf2-e) for any £ > 0. For + li) + ) 00

oo ,

00

-

oo ,

L

�

p =

::::;;

=

=

•

=

19.3(ii), Um = O((log mf(l 2l (l e ) for E > 0 is sufficient. In the size termi nology of § 14. 1 , mixing of any size will ensure these conditions. The most signif icant cost of using the strong mixing condition is that simple existence of the variances is not sufficient. This is not of course to say that no weak law exists for L2-bounded strong mixing processes, but more subtle arguments, such as those of § 19 .4, are needed for the proof. 1 9 .2 Almost Sure Convergence by the Method of Subsequences

Almost sure convergence does not follow from convergence in mean square (a counter-example is 18.16), but a clever adaptation of the above techniques yields a result. The proof of the following theorems makes use of the exploiting the relation between convergence in pr. and convergence a.s. demonstrated in 18.6. Mainly for the sake of clarity, we first prove the result for the uncorrelated case. Notice how the conditions have to be strengthened, relative to 19.1.

method of

subsequences,

19.4 Theorem If {Xr}] is uniformly �-bounded and uncorrelated, Xn ...E4 j.l . o A natural place to start in a sufficiency proof of the strong law is with the convergence part of the Borel-Cantelli lemma. The Chebyshev inequality yields, under the stated conditions,

(19.7)

B

for < oo , with the probability on the left-hand side going to zero with the right-hand side as -7 oo . One approach to the problem of bounding the quantity

n

The Law of Large Numbers

296

P( ! Xn - iin l > £, i.o.) would be to add up the inequalities in (19.7) over n. Since the partial sums of l in form a divergent sequence, a direct attack on these lines does not succeed. However, I';;':: 1 n -2 :::::: 1 .64, and we can add up the subsequence of the probabilities in (19.7), for n = 1 ,4,9, 16, ... , as follows. Proof of 19.4 By

(19.7),

" ,< 1 .64B L P( Xnz - iinz I > £) - -2n2 £

<

oo .

(19.8)

Now 18.2(i) yields the result that the subsequence {Xnz, n E IN } converges a.s. The proof is completed by showing that the maximum deviation of the omitted terms from the nearest member of { Xnz } also converges in mean square. For each n define max I Xk - Xnz l (19.9) n2 S k <(n+l)2 and consider the variance of Dnz. Given the assumptions, the sequence of the Var(Xn) = (1/n2)I,7== I � tends monotonically to zero. For n2 < k < (n + 1) 2, re

Dnz

=

arrangement of the terms produces

(19. 1 0) and when the sequence is uncorrelated the two terms on the right are also uncorre lated. Hence

( �r� � (: �) (: � )

� 1 =

B

2

+

_

2

�B

-

(k - n2)

. 2 - (n 1 )2

(19. 1 1)

Var(Dnz) cannot exceed the last term in (19. 1 1), and n2

n

(19.12)

so the Chebyshev inequality gives

B (19. 1 3) _L P(Dnz > £) � 2 < oo, n2 £ and the subsequence { Dnz, n E !N } also converges a.s. Dn 2 � I xk - Xnz l for any k between n2 and (n + 1 )2, and hence, by the triangle inequality,

Convergence in Lp Norm

297 (19 . 14)

The sequences on the majorant side are positive and converge a.s. to zero, hence so does their sum. But (19. 14) holds for n2 :S; k < (n + 1)2 for { n2, n E IN } , so that k ranges over every integer value. We must conclude that Xn � Jl. • We can generalize the same technique to allow autocorrelation. 19.5 Corollary If {Xr}j is uniformly L2-bounded, and

(19. 15) where Bm

=

supt I O"t,t -m I , then Xn � Jl.

o

Note how much tougher these conditio ns are than those of 19.2. It will suffice here for Bm = O(m - 1 (log mf 1 -0) for o > 0. Instead of, in effect, having the autocovariances merely decline to zero, we now require their summability. Proof of 19.5 By (19 .4) , V ar(Xn) :S; (B + 2B * )In and hence equation (19. 7) holds in the modified form,

L p( l x -n2 - Jln2 I > £) n2 -

< _

1 .64(B + 2B * ) £2

<

oo

.

(19. 1 6)

Instead of ( 19. 1 1) we have, on multiplying out and taking expectations, - n2 Var(Xk - Xnz) = Var k1 L Xnz � X1 - 1 - k

[

(

t=n2+!

)

_

]

2 k t- 1 ( ( - k 1 - � ) L L at,t-m) . t=n2+ m=t n2

2

1

-

(19. 1 7)

The first term on the right-hand side is bounded by ( 1 - n2/k) 2(B + 2B * )/n2 , the second by (k - n2)(B + 2B*)!k2, and the third (absolutely) by 2(1 - n2/k) 2B* . Adding together these latter terms and simplifying yields

-<

(_!_n2 - (n +1 1 )2)s + 2 L!ln2 - (n +1 1)2 + ( 1 - (n +n21)2) 2] s* .

Note, ( 1 - n2!(n + 1 ) 2)2 (19.13) we can write

=

(19. 1 8)

O(n -2), so the term in B* is summable. In place of

The Law of Large Numbers

298

L P(Dn2 > E) s; n2

B + KtB * E2

< oo ,

(19. 19)

where K1 is a finite constant. From here on the proof follows that of 19.4.

•

Again there is a straightforward extension to mixing sequences by direct analogy with 19.3. 19.6 Corollary If { X1 } 1 is either (i) uniformly L2-bounded and uniform mixing with

m=l

"L �/2 < oo, 00

or (ii) uniformly L2+0-bounded for 8 > 0, and strong mixing with L a�t(2+B) < oo,

m=l 00

then Xn � 0.

D

( 19.20)

(19.21 )

Let it be emphasized that these results have no pretensions to being sharp ! They are given here as an illustration of technique, and also to define the limits of this approach to strong convergence. In Chapter 20 we will see how they can be improved upon. 1 9 . 3 A Martingale Weak Law

We now want to relax the requirement of finite variances, and prove Lp-convergence for p < 2. The basic idea underlying these results is a truncation argument. Given a sequence { X1}} which we assume to have mean 0, define Y1 = 1 { IXri�B)X1, which equals X1 when I X1 I s; B < oo , and 0 otherwise. Letting Z1 = X1 - Y1, the 'tail component ' of X1, notice that E(Z1) = -E(Y1) by construction, and Xn = Yn + Zn. Since Y1 is a.s. bounded and possesses all its moments, arguments of the type used in § 19 . 1 might be brought to bear to show that Yn � !-lr (say). Some other approach must then be used to show that Zn � !-lz = -j..l y. An obvious technique is to assume uniform integrability of { I X1 j P } . In this case, sup1E I Z1 I P can be made as small as desired by choosing B large enough, leading (via the Minkowski inequality, for example) to an Lp-convergence result for ZnA different approach to limiting dependence is called for here. We cannot assume that Y1 is serially uncorrelated just because X1 is. The serial independence assumption would serve, but is rather strong. However, if we let X1 be a martin gale difference, a mild strengthening of uncorrelatedness, this property can also be passed on to Y1, after a centring adjustment. This is the clever idea behind the next result, based on a theorem of Y. S. Chow (1971). Subsequently (see § 1 9.4) the m.d. assumption can be relaxed to a mixingale assumption. We will take this opportunity to switch to an array formulation. The theorems are easily specialized to the case of ordinary sample averages (see § 1 8.6), but in subsequent chapters, array results will be indispensable.

Convergence in Lp Norm

299

19.7 Theorem Let {Xnr,!¥nr l be a m.d. array, { Cnr } a positive constant array, and { kn } an increasing integer sequence with kn t =. If, for 1 :::; p :::; 2, (a) { I Xnr !Cnr iP} is uniformly integrable, kn

(b) limsup L Cnt <

n -too

kn

t= l

(c) lim L

c�1

n-too t= l then L��!Xnr � 0.

=,

and

0,

=

D

The leading specialization of this result is where Xnr = Xrfan, where { X1,:¥,} is a m.d. sequence with :¥
(b) L hr t= l

=

O (an), and

=

o (a � ) ;

kn

(c) L h; t=l

then a� 1 :L��IXr � 0.

Xnr X11an and Cnt brfan. • Be careful to distinguish the constants an and kn. Although both are equal to in

Proof Immediate from 19.7, defining

=

=

n

the sample-average case, more generally their roles are quite different. The case with kn different from typically arises in 'blocking' arguments, where the array coordinates are generated from successive blocks of underlying sequence coor dinates. We might have kn = for a E (0, 1) denoting the largest integer below x) where the length of a block does not exceed For an application of this sort see §24.4. Conditions 19.8(b) and (c) together imply an t =, so this does not need to be separately asserted. To form a clear idea of the role of the assumptions, it is helpful to suppose that b1 and an are regularly varying functions of their argu ments. It is easily verified by 2.27 that the conditions are observed if b1 - t � for any � � -1, by choosing an - n 1 +� for � > - 1 , and an - log for � - 1 . In particular, setting b1 = 1 for all t, an = kn = yields

n

[na]

([x] 1 [n a].

n

n

=

(19.22) Choosing an = I,�� 1 b1 will automatically satisfy condition (a), and condition (b) will also hold when b1 = O(t �). On the other hand, a case where the conditions

The Law of Large Numbers 1 and, for t > 1, b1 .L,�:lbs In this case condition (a) fail is where imposes the requirement bn O(an), so that b� O(a�). contradicting condition (b). The growth rate of b1 exceeds that of tl3 for every � > 0. 300

=

h =

=

=

=

t.

Proof of 19.7 Uniform integrability implies that

n,t E(IXnr Cnr P 1

sup

1

i

1 1 Xn1lcnr i >M )

)

-7 0

as

One may therefore find, for £ > 0, a constant B£ <

n.t {

sup IIXnrl ! IXnti>BEcnrJ II p !cn

Define

} :::; £.

M -7 oo .

such that

=

(19.23)

Ynt Xnrl IXntl gl£Cntl ' and Znt xnt- Ynt. Then since E(Xnt I ?}n,t- 1) Xnt Ynr-E(Ynrl?in,t- J)+Znr-E(Znr l?in,t- 1 ). =

{

=

=

=

0,

By the Minkowsk:i inequality, kn

p

:::; 2 t= l

(Ynt -E(Ynr l ?in,t- 1)) p

+

kn

2 t=!

(Znt-E(Znrl ?fn,t-J )) p

(19.24)

Consider each of these right-hand-side terms. First,

2(Ynr-E(Ynrl ?in,r-I )) :::; L (Ynr-E(Ynrl ?in,t- I )) 2 �

t=l

�

p

(i E(Ynt-E(Ynr l ?in, d)2)1/2 (i EY�t) 1/2 (i c�t) 1 2 t= l

=

$

k

r

t=l

k

$ B£

t=l

-

k

t=l

(19.25)

The first inequality in (19.25) is Liapunov' s inequality, and the equality follows because { is a m.d., and hence orthogonal. Second,

Yn1 -E(Ynrl ?in,r-1 )} kn kn kn L (Znt -E(Zntl ?i n, t -1 )) :::; L I!Znrl! p + L I ! E(Zntl ?in,t- J ) I ! p p t= l

t= 1

t= l

kn

kn

t=l

t= l

$ 2_L IIZntllp $ 2£L Cnt .

(19.26)

The second inequality here follows because

EIE(Znrl?in,t- J )I P $ E(E(IZnri P I?fn,t- J)) EIZnti P, =

from, respectively, the conditim1al Jensen inequality and the law of iterated expectations. The last is by (19.23). It follows by (c) that for £ > 0 there exists N£ � 1 such that, for n � N£,

Convergence in Lp Norm k ""'n C2

<

L,_, nt -

t=l

301

B - 2£2·

(19.27)

E

Putting together (19.24) with (1 9.25) and (19.23) shows that kn

(19.28)

L Xnt p :::; B£ t=l

22:�� nr < oo, by condition (b). Since £ is arbitrary,

n Ne,

where B = 1 + for 2 this completes the proof. •

C

The weak law for martingale differences follows directly, on applying 18.13. 19.9 Corollary Under the conditions of 19.7 or 19.8,

p

r 1/n

n

.L��1 Xnr

�

0.

o

If we take the case = 1 and set Cn = and kn = as above, we get the result that uniform integrability of { Xr} is sufficient for convergence in probability of the sample mean Xn. This cannot be significantly weakened even if the martingale difference assumption is replaced by independence. If we assume identically distributed coordinates, the explicit requirement of uniform integrability can be dropped and L 1 -boundedness is enough; but of course, this is only because the uniform property is subsumed under the stationarity. You may have observed that (b) in 19.7 can be replaced by kn

(b') limsup /ln - 1 L d,;1 < = . t=I n�oo It suffices for the two terms on the majorant side of (19.24) to converge in Lp, and the Cr inequality can be used instead of the Minkowski inequality in (19.26) to obtain p (19.29) , ( , :J' c.;,.
It

1

•.

£(2k.r't

However, the gain in generality here is notional. Condition (b') requires that limsup1,n�oo'lnd,;t < = , and if this is true the same property obviously extends to { kncntl· For concreteness, put Cnt = as in 19.8 with - and where � and y can be any real constants. With kn - for a > 0, note that the of the value majorant side of (19.29) is bounded if a(1 + �) - y :::; 0, of p. This condition is automatically satisfied as an equality by setting = but note how the choice of can accommodate different choices of kn. None the less, in some situations condition (b) is stronger than what we know to be sufficient. For the case = 2 it can be omitted, in addition to weakening the martingale difference assumption to uncorrelatedness, and uniform integrability to simple L2-boundedness. Here is the array version of 19.1, with the conditions cast in the framework of 19.7 for comparability, although all they do is to ensure that the variance of the partial sums goes to zero.

brian

.L�� 1 br,

an

p

bt t � an nY, na independent an

The Law of Large Numbers

302

19.10 Corollary If {Xnr} is a zero-mean stochastic array with E(XnrXns) = 0 for t # s, and (a) {Xnr!Cnr } is uniformly L2-bounded, and kn

(b) lim ,L c�,

=

0,

1 9.4 A Mixingale Weak Law

To generalize the last results from martingale differences to rnixingales is not too difficult. The basic tool is the 'telescoping series' argument developed in § 16.2. The array element Xnr can be decomposed into a finite sum of martingale differences, to which 19.7 can be applied, and two residual components which can be treated as negligible. The following result, from Davidson (1993a), is an extension to the heterogeneous case of a theorem due to Andrews (1988). 19.11 Theorem Let the� {Xnr.�n1}'::'oo be a L 1 -mixingale with respect to a constant array { cnr } . If (a) {Xnrlcnr} is uniformly integrable, kn

(b) limsup L cnt < oo, and n �oo t=l kn

(c) lim ,L c�r = 0, n �oo t=l where kn is an increasing integer-valued function of n and kn t � O. o

oo,

then I,��1 Xn

r

There is no restriction on the size here. It suffices simply for the mixingale coefficients to tend to zero. The remarks following 19.7 apply here in just the same way. In particular, if X, is a L1 -mixingale sequence and { X,Ib,} is uniformly integrable for positive constants { b,} , the theorem holds for Xnr = X,lan and Cnr = b, lan where an L�= I b r. Theorems 14.2 and 14.4 give us the corresponding results for mixing sequences, and 17.5 and 17.6 for NED processes. It is sufficient for, say, Xnr to be Lr-bounded for r > 1 , and Lp-NED, for p :?: 1 , o n a a-mixing process. Again, n o size restrictions need to b e specified. Uniform integrability of { Xnrlcnr} will obtain in those cases where I! Xnr II is finite for r > 1 and each t, and the NED constants likewise satisfy dnr » II Xnr I I A simple lemma is required for the proof: =

r

r·

19.12 Lemma If the array {Xnrlcnr} is uniformly integrable for p :?: 1 , so is the array {Er-)Xnrlcnr} for j > 0. Proof By the necessity part of 12.9, for any

£ > 0 .:3 8 > 0 such that

Convergence in

{

sup sup

n,t

Lp

Norm

fE I Xnt!cnrl dP}

303

< £,

( 1 9.30)

where the inner supremum is taken over all E E ':; satisfying P(E) < o. Since ':Jn,t -J c ':J, ( 1 9.30) also holds when the supremum is taken over E E ':Jn, t J satisfying P(E) < o. For any such E,

fE i Xntlcnt l dP fEEt-J I Xntlcnt l dP � fE i Et-JXnrlcnrl dP, =

(19.3 1 )

by definition of Et-J(.), and the conditional Jensen inequality (10.18). We may accordingly say that, for £ > 0 :3 o > 0 such that

:;

s

{

sup

J. I E,-jX.,!c.,l dP} < £,

(19.32)

taking the inner supremum over E E ':Jn,t -J satisfying P(E) < o. Since Et-JXnr is ':Jn,t-rmeasurable, uniform integrability holds by the sufficiency part of 12.9. • Proof of 19.11 Fix an integer j and let

Ynj

kn

L (Et+}Xnt - Et+j- l Xnt) . t= l

=

The sequence { Yn1, ':Jn,n+J } ;= l is a martingale, for each j. Since the array

{ (Er+}Xnr - Et+j - l Xnr)lcnr } is uniformly integrable by (a) and 19.12, it follows by (b) and (c) and 19.7 that

Ynj � 0.

(19.33)

We now express L.�� 1 Xnt as a telescoping sum. For any M

M- 1 L Ynj }= 1 -M

1,

kn

kn

=

�

L Et+M- l Xnt - L Et-MXnt' t=l t= 1

( 1 9.34)

and hence

M- 1 kn kn + ) Xn + (Xnr Et Y "L_ Xnt L nj "L_ +M- 1 t "L_ Et-MXnt· t= 1 t= 1 t= 1 }= 1 -M kn

( 19.35)

=

The triangle inequality and the L1 -mixingale property now give

kn kn M- 1 E "L_ Xnt :::; L El YnJI + "L_ E I Xnr - Et+M- 1 Xnrl + "L_ EI Et-MXnr l t= 1 t= 1 t=1 }= 1 -M kn M- 1 + 2 ( 19.36) E l Ynjl �ML Cnr . :::; L t = M = } 11 kn

According to the assumptions, the second member on the right-hand side of (19.36) �

_

/ "'\ /

... 1-0"

r

�

�

1"\

_ _ _ _1

_

� - - - --

_

�

n

� 1

1

,,

•

Y

�lr-

The Law of Large Numbers

304

�E for M � Me. By choosing n large enough, the sum of 2M - 1 terms on the right hand side of ( 19.36) can be made smaller than �E for any finite M, by ( 19.33). So, < E when n is large enough. The theorem is by choosing M � Me we have £ 1 now proved since E is arbitrary. •

I��lXntl

A comparison with the results of § 19. 1 is instructive. In an L2-bounded process, the L2-mixingale property would be a stronger form of dependence restriction than the limiting uncorrelatedness specified in 19.2, just as the martingale property is stronger than simple uncorrelatedness. The value of the present result is the substantial weakening of the moment conditions. 1 9. 5 Approximable Processes

There remains the possibility of cases in which the mixingale property is not easily established -perhaps because of a nonlinear transformation of a Lp-NED process which cannot be shown to preserve the requisite moments for application of the results in § 1 7 .3. In such cases the theory of § 1 7.4 may yield a result. On the assumption that the approximator sequence is mixing, so that its mean deviations converge in probability by 19. 1 1 , it will be sufficient to show that this implies the convergence of the approximable sequence. This is the object of the following theorem.

{h';; t } is a stochastic array and the centred array {h';:1 -E(h';:1) } satisfies the conditions of 19. 1 1 . If the array {Xnt l is L 1 -approximable by { h'::r } with respect to a constant array { dnt}, and limsupn�oo l:��l dnt � B < then L��lXnt 0. Establishing the conditions of the theorem will typically be achieved using 17.21 , by showing that Xm i s Lr-bounded for r > 1 , and approximable i n probability on h';:1 for each m, the latter being m-order lag functions of a mixing array of any 19.13 Theorem Suppose that, for each m E =,

�

IN ,

D

size.

Proof Since ( 19.37)

by the triangle inequality, we have for () > 0

( 1 9.38)

by subadditivity, since the event whose probability is on the minorant side implies at least one of those on the majorant. By the Markov inequality,

Convergence in Lp Norm

3

:::; 3

kn Vm· (2.:.dnt ) t=l

305

(19.39)

P(stochastic is equal to either 0 or 1 , according to whether the non c) l > 8/3)holds IL�� IE(h'::inequality or does not hold. By the fact that E(Xnc) 0 and L -approximability, =

1

(19.40)

and hence (19.4 1 )

We therefore find that for each m E IN

n�= P ( I i Xnc I > 8) :::; 3:v m limsupP (I i (h'::c - E(h'::c)) l > �) + 1 {svm> o/3} n �=

limsup

t= I

+

t=I

(19.42)

h'::c

by the assumption that satisfies the WLLN for each m E IN . The proof is completed by letting m � oo • .

20 The Strong Law of Large Numbers

20. 1 Technical Tricks for Proving LLNs

In this chapter we explore the strong law under a range of different assumptions, from independent sequences to near-epoch dependent functions of mixing processes. Many of the proofs are based on one or more of a collection of ingenious technical lemmas, and we begin by studying these results. The reader has the option of skipping ahead to §20.2, and referring back as necessary, but there is something to be said for forming an impression of the method of attack at the outset. These theorems are found in several different versions in the literature, usually in a form adapted to the particular problem in hand. Here we will take note of the minimal conditions needed to make each trick work. We start with the basic convergence result that shows why maximal inequalities (for example, 15.14, 15.15, 16.9, and 16.11) are important. 20.1 Convergence lemma Let {X,}} be a stochastic sequence on a probability space (O.,'Jf,P), and let Sn = L�= 1 X, and So = 0. For ro E 0., let

If P(M > E)

=

(

}

inf �up i Sj(co) - Sm(ro) l . m J> m 0 for all E > 0 , then Sn � S. M(ro)

=

(20. 1 )

Proof By the Cauchy criterion for convergence, the realization {Sn(ro) } converges if we can find an m such that I Sj - Sm I :::; E for all j > m, for all E > 0; in other words, it converges if M(ro) :::; E for all E > 0. •

This result is usually applied in the following way. 20.2 Corollary Let { cr} 1 be a sequence of constants, and suppose there exists p > 0 such that, for every m � 0 and n > m, and every E > 0,

(

)

�

i

P m �x I Sr Sm l > E s c�, E t=m+ l m <J:O, n where K is a finite constant. If I-7= 1 c� < oo, then Sn � S.

(20.2)

Proof Since { c� } is summable it follows by 2.25 that limm�oa"L'7=m+lc � = 0. Let M be the r.v. in (20. 1). By definition, M :::; supj>m i Sr Sm l for any m > 0, and hence

(

P(M > E) :::; lim �up j Sj - Sm l > E m�oo j> m

)

The Strong Law of Large Numbers 00 K fP m--)oo t=m+ 1

::; - lim L c�

=

307

0,

(20.3)

where the final inequality is the limiting case of (20.2). 20.1 completes the proof. •

Notice how this proof does not make a direct appeal to the Borel-Cantelli lemma to get a.s. convergence. The method is closer to that of 18.3. The essential trick with a maximal inequality is to put a bound on the probability of all occurrences of a certain type of event as we move down the sequence, by specifying a probabil ity for the most extreme of them. Since S is finite almost surely, Xn � 0 is an instant corollary of 20.2. How ever, the result can be also used in a more subtle way in conjunction with Kronecker ' s lemma. If i7=t Yt converges a.s., where { Yt} = {X11at} and {at} is a sequence of positive constants with an t oo , it follows that a� 1 I7= 1 X1 � 0. This is of course a much weaker condition than the convergence of I7=tXt itself. Most applications feature at = t, but the more general formulation also has uses. There is a standard device for extending a.s. convergence to a wider class of sequences, once it has been proved for a given class: the method of equivalent sequences. Sequences {Xt}j and { Yt}j are said to be equivalent if (20.4)

By the first Borel-Cantelli lemma (18.2(i)), (20.4) implies P(Xt 7:. Yt, i.o.) = 0. In other words, only on a set of probability measure zero are there more than a finite number of t for which Xt(ro) 7:. Yr(ro): 20.3 Theorem If Xt and Yt are equivalent, I�= J(Xt - Yt) converges a.s. Proof By definition of equivalence and 18.2(i) there exists a subset C of Q, with

P(Q - C) = 0, and with the following property: for all ro n0(ro) such that X1(ro) = Y1(ro) for t > n0(ro) . Hence n

L (Xt(ro) - Yt(ro) ) t=l

=

no(ro)

L (Xt(ro) - YtCro)), t= l

and the sum converges, for all ro E C.

'\/

E

C, there is a finite

n 2 no(ro),

•

The equivalent sequences concept is often put to use by means of the following theorem.

/w.4 Theorem Let {Xt}j be a zero-mean random sequence satisfying ,L E I Xr i Pfa� t= l

< oo

(20.5)

for some p 2 1, and a sequence of positive constants {at } . Then, putting 1 � for the indicator function 1 { IX11s;arJ(ro),

The Law of Large Numbers

308

_L P( I Xr l > a1)

< 00,

(20.6)

L I E(X11 �) I la1

< 00,

(20.7)

t= 1

t= 1

and for any

r

� p,

_LE( I Xr l r 1�)/a� <

(20.8)

00• D

t=l

The idea behind this result may be apparent. The indicator function is used to truncate a sequence, replacing a member by 0 if it exceeds a given absolute bound. The ratio of the truncated sequence to the bound cannot exceed 1 and possesses all its absolute moments, while inequality (20.6) tells us that the truncated sequence is equivalent to the original under condition (20.5). Proving a strong law under (20.5) can therefore be accomplished by proving a strong law for a truncated sequence, subject to (20.7) and (20.8). Proof of Theorem 20.4 We prove the following three inequalities:

P( I Xr l > ar)

E(l - 1�) $ E( I Xr i P (1 - 1 �))Ia� $ E( I Xri P)fa�. =

(20.9)

Here the inequalities are because I Xr(ro) I P/� > 1 for ro E { I Xt I > at } , and because E(I Xt i P 1�) is non-negative, respectively.

I E(Xt 1 �) I fat

=

I E(Xr(1 - 1 �) I Iat $ E(I Xti 0 - 1�))/at $ E( I Xt i P (l - 1�))/a� $ E( I Xt iP)fa�.

(20. 10)

The equality in (20. 10) is because E(Xt) = 0, hence E(Xt1 �) = -E(Xr(l - 1 �)). The first inequality is the modulus inequality, and the second is because on the event { I Xr l > ad , ( I Xt l latf � I Xr l lat for p � 1 . Finally, by similar arguments to the above,

E( I Xt l r1�)/a� $ E( I X1 1 P l�)la� for p $ $ E( I Xri P)faf. The theorem follows on summing over t.

r

(20. 1 1)

•

There are a number of variations on this basic result. The first is a version for martingale differences in terms of the one-step-ahead conditional moments, where

The Strong Law of Large Numbers

309

the weight sequence is also allowed to be stochastic. The style of this result is appropriate to the class of martingale limit theorems we shall examine in §20.4, in which we establish almost-sure equivalence between sets on which certain conditions obtain and on whkh sequences converge. 20.5 Corollary Let {X1,:g;1} be a m.d. sequence, let { W1} be a sequence of positive � t-1 -measurable r.v.s, and for some p 2 1 let

D

=

Also define

D1

=

{ro: £ t= l

{ro:

ro ro) }

E( I Xr I P I :g;t- 1 )( )/Wr(

.f P( I Xr l > Wr l �r- I )( m)

< oo

< oo

t=l

}

(20. 12)

:g;.

E

(20. 13)

:g;

E

(20. 14) (20. 15) and let D'

P(D')

=

1.

=

D1

n

D2 n D3. Then P(D - D') = 0. In particular, If P(D) = 1 then

(20.9), (20. 10), and (20. 1 1) for the case of conditional expectations. Noting that E(X1 1 :g;1_!) = 0 a.s. and using

Proof It suffices to prove the three inequalities

the fact that W1 is :g;r_1-measurable, all of these go through unchanged, except that the conditional modulus inequality 10.14 is used to get (20. 14) It follows that almost every m E D is in D'. • .

Another version of this theorem uses a different truncation, with the truncated variable chosen to be a continuous function of Xr; see 17.13 to appreciate why this variation might be useful. 20.6 Corollary Let {Xr}'i' be a zero-mean random sequence satisfying (20.5) for 2 1 . Define

p

(20 . .16) -1 ,

Xr

<

- at .

Then,

2:: I E(Yr) I t=l

< oo,

(20. 17)

The Law of Large Numbers

3 10 00

_L E I Yrl t=l

Proof Write

r

<

00,

r 2:: p.

(20. 1 8)

± ar to denote a1X/I X1 1 . Inequalities (20.10) and (20. 1 1 ) of 20.4

are adapted as follows.

I E(Yr) I = I E(Xr1 � + (1 - 1 �)(±ar)) I /at = I E(Xr - (± ar))( I - 1�) 1 /at ::;; E I Xr i 0 - 1�)/ar + EI 1 - 1� 1 :s; E( I Xri P(l - 1�))/a � + P( ! Xrl > ar) ::;; 2E( I X1 ! P)fa�.

(20. 19)

The second equality in (20. 19) is again because E(Xr) = 0. The first inequality is an application of the modulus inequality and triangle inequalities in succession, and the last one uses (20.9). By similar arguments, except that here the c, inequality is used in the second line, we have

E( l Yr l ') ::;; E I Xr1� + (1 - l�)(±ar) l '!a� 1 :s; 2'- (EI X11� I '/a� + EI (l - 1� ) 1 ') ::;; 2'- \E( I Xr iP1�)/a� + P( ! Xr l > ar)) for p ::;; 2rE( I Xr !P)fa�. The theorem follows on summing over t as before.

::;;

r

(20.20)

•

Clearly, 20.5 could be adapted to this case if desired, but that extension will not be needed for our results. The last extension is relatively modest, but permits summability conditions for norms to be applied. 20.7 Corollary (20.6), (20.7), (20.8), (20. 17), and (20. 1 8) all continue to hold if (20.5) is replaced by

_L E( ! Xr ! P) 1 1qla�1q 00

t=l

for any

q 2::

<

oo

(20.21)

1.

Proof The modified forms of

(20.9), and of (20. 19) and (20.20) (say) are (20.22) (20.23)

The Strong Law of Large Numbers

311 (20.24)

where in each case the first inequality is because the left-hand-side member does not exceed 1 . •

For example, by choosing p = q the condition that the sequence { IIX1/a1 11 p } is summable is seen to be sufficient for 20.4 and 20.6. 20. 2 The Case of Independence

The classic results on strong convergence are for the case of independent sequences. The following is the 'three series theorem ' of Kolmogorov: 20.8 Three series theorem Let {X1} be an independent sequence, and Sn = L�= 1 X1• Sn � S if and only if the following conditions hold for some fixed a > 0:

l:: PC I Xr l > a)

(20.25)

< oo,

t=l

2: E( 1 ! 1 Xtl � a)Xr)

(20.26)

< oo,

t=l 00

2:: Var ( 1 { IXtl �a)Xr) t= l

< oo. D

(20.27)

Since the event {S � S} is the same as the event {Sn+ l � S}, convergence is invariant to shift transformations. It is a remote event by 13.19 and hence in independent sequences occurs with probability either 0 or 1 , according to 13.17. 20.8 gives the conditions under which the probability is 1 , rather than 0. The theorem has the immediate corollary that Sn la � 0, whenever an t oo The basic idea of these proofs is to prove the convergence result for the trun cated variables 1 1 J X1J� a )X1, and then use the equivalent sequences theorem to extend it to X1 itself. In view of 20.4, the condition n

n

l:: E I Xr l p t=l

< oo,

1 � p � 2,

.

(20.28)

is sufficient for convergence, although not necessary. Another point to notice about the proof is that the necessity part does not assign a value to a. Conver gence implies that (20.25)-{20.27) hold for every a > 0. Proof of 20.8 Write Y1 = 1 1 J X1J�a1X1, so that the summands in (20.26) and (20.27)

are respectively the means and variances of Y1• The sequence { Y1 - E(Y1) } is inde pendent and hence a martingale difference, so that s� - s:n = L�=m+1 ( Yr - E(Yr)) is a martingale for fixed m 2 0, and I,7=m+ t Var(Y1) = Var(S� - s:n). Theorem 15.14 combined with 20.2, setting p = 2 in each case and putting c7 = Var(Y1) and K = 1 , together yield the result that S� � S' when (20.27) holds. If (20.26) holds, this further implies that I,7=t Y1 converges. And then if (20.25) holds the

The Law of Large Numbers

3 12

sequences {X1 } and { Y1 } are equivalent, and so Sn � S, by 20.3. This proves sufficiency of the three conditions. Conversely, suppose Sn � S. By 2.25 applied to Sn(ffi) for each ffi E Q, it follows that limm�ooL7=mX = 0 a.s. This means that P( I Xr l > a , i.o.) = 0, for any a > 0, and so (20.25) must follow by the divergence part of the Borel-Cantelli lemma (18.2(ii)). 20.3 then assures us that I7=I Y1 also converges a.s. Write s� = L�= !Var(Y1). If s� ---7 oo as n --7 oo, L,7= t(Y1 - E(Y1))1sn fails to converge, but is asymptotically distributed as a standard Gaussian r.v. (This is the central limit theorem - see 23.6.) This fact contradicts the possibility of I7= 1 Y1 converging, so we conclude that s� is bounded in the limit, which is equiv alent to (20.27). Finally, consider the sequence { Y1 - E(Y1) } . This has mean zero, the same vari ance as Y1, and P( l Y1 - E(Y1) ! > 2a) = 0 for all t. Hence, it satisfies the condi tions (20.25)-(20.27) (in respect of the constant 2a) and the sufficiency part of the theorem implies that L,7= 1 (Y1 - E(Y1)) converges. And since L,7= 1 Yt converges, (20.26) must hold. This completes the proof of necessity. • r

The sufficiency part of this result is subsumed under the weaker conditions of 20.10 below, and is now mainly of historical interest; it is the necessity proof that is interesting, since it has no counterpart in the LLNs for dependent sequences. In these cases we cannot use the divergence part of the Borel-Cantelli lemma, and it appears difficult to rule out special cases in which convergence is achieved with arbitrary moment conditions. Incidentally, Kolmogorov originally proved the maximal inequality of 15.14, cited in the proof, for the independent case; but again, his result can now be subsumed under the case of martingale differences, and does not need to be quoted separately. Another reason why the independent case is of interest is because of the follow ing very elegant result due to Levy. This shows that, when we are dealing with partial sums of independent sequences, the concepts of weak and strong conver gence coincide. 20.9 Theorem When {Xz} is an independent sequence and Sn = L,7= 1 X1, Sn � S if and only if Sn � S.

Proof Sufficiency is by 18.5. It is the necessity that is unique to the particular

case cited. Let Smn = L�=m+ I X1, and for some £ > 0 consider the various ways in which the event { ! Smn l > £ } can occur. In particular, consider the disjoint collection

{

max ! Smjl

msjs k- l

S:

}

2£, ! Smk l > 2£ , k

=

m + l , . . ,n. .

For each k, this is the event that the sum from m onwards exceeds 2£ absolutely for the first time at time k, and thus

lJ

{

max ! Smj l

k=m+ l m sj s k- 1

S:

} {

2£, ! Smk l > 2£

=

}

max I Smjl > 2£ ,

m sj s n

(20.29)

The Strong Law of Large Numbers

313

where the sets of the union are disjoint. It is also the case that

YJm �::'- ] I Smj l 5 2£, I Smd > 21} > { I S,. I 5 £ }

ko

(;;

{ I S= I > E } ,

(20.30)

where the inclusion is ensured by imposing the extra condition for each k. The events in this union are still disjoint, and by the assumption of an independent sequence they are the intersections of independent pairs of events. On applying (20.29), we can conclude from (20.30) that

{�;;'.

L':':: C�::

I Smi l > 2E

�

, "' ' P

.

P( I S, I 5 t)

_ , I S,; I ,

21:.

)

I S"" I > 2e P( I S, I , ,) (20.3 1 )

If Sn � S, there exists by definition m 2 1 such that

P( ! Smn l > £) < £ (20.32) for all n > m. According to (20.32), the second factor on the minorant side of (20. 3 1 ) is at least as great as 1 - £, so for 0 < £ < 1 , P

(:::;::. I Smi I > 2!:) < 1 � £.

Letting n � oo and then m

�

(20.33)

oo, the theorem now follows by 18.3.

•

This equivalence of weak and strong results is one of the chief benefits stemming from the independence assumption. Since the three-series theorem is equivalent to a weak law according to 20.9, we also have necessary conditions for convergence in probability. As far as sufficiency results go, however, practically nothing is lost by passing from the independent to the martingale case, and since showing convergence is usually of disproportionately greater importance than showing nonconvergence, the absence of necessary conditions may be regarded a small price to pay. However, a feature of the three series theorem that is common to all the strong law results of this chapter is that it is not an array result. Being based on the convergence lemma, all these proofs depend on teaming a convergent stochastic sequence with an increasing constant sequence, such that their ratio goes to zero. Although the results can be written down in array form, there is no counterpart of the weak law of 19.7, more general than its specialization in 19.8. 20.3 Martingale Strong Laws

Martingale limit results are remarkably powerful. So long as a sequence is a mart1 n cr<>lP rl 1 ffPrPn<'P

no f1 1rthPr rPctri r'tion<: on

it�:

riPnPnrlP.ncP.

arf'. rf'.nnirf'.O anrl the

The Law of Large Numbers

314

moment assumptions called for are scarcely tougher than those imposed in the inde pendent case. Moreover, while the m.d. property is stronger than the uncorrelated ness assumed in § 19.2, the distinction is very largely technical. Given the nature of econometric time-series models, we are usually able to assert that a sequence is uncorrelated because it is a m.d., basically a sequence which is not fore castable in mean one step ahead. The case when it is uncorrelated with its own past values but not with some other function of lagged information could arise, but would be in the nature of a special case. The results in this section and the next one are drawn or adapted chiefly from Stout (1974) and Hall and Heyde (1980), although many of the ideas go back toDoob (1953). We begin with a standard SLLN for L 2-bounded sequences. 20.10 Theorem Let {Xr,:!Fr} o be a m.d. sequence with variance sequence { a7 } , and {ar } a positive constant sequence with ar i oo. Snfan � 0 if 00

L a7ta7 r-= 1

< =. o

(20.34)

There are (at least) two ways to prove this result. The first is to use the mart ingale convergence theorem (15.7) directly, and the second is to combine the maxi mal inequality of 15.14 with the convergence lemma 20.2. In effect, the second line of argument provides an alternative proof of martingale convergence for the square-integrable case, providing an interesting comparison of techniques. First proof Define Tn = L�== IXrlat, so that { Tm:!Fn } is a square-integrable martin

gale. We can say, using the norm inequality and orthogonality of { Xr } , l/2 < oo, sup E I Tn l � sup E(T �) 1 12 = ,La7ta7 (20.35) n n I r-=

( 00 )

leading directly to the conclusion Tn � T a.s., by 15.7. Now apply the Kronecker lemma to the sequences { Tn(ro) } for (0 E n, to show that Snfan � 0 . •

m ;?: 0, { Tn - Tm, :!Fn } is a martingale with (20.36) E(Tn - Tm) 2 = L�==m+I <J7ta7. Apply 15.14 for p = 2, and then 20.2 with c7 = a7ta7. Finally, apply the Kronecker S �cond proof For

lemma as before.

•

Compare this result with 19.4. If Var(Xr) = <J7 � B < oo, say, then setting at = t, we have I7== 1 a7tP � B'L7== 1 11P 1 . 64B < =, and the condition of the theorem is satisfied, hence Xn = Snfn -E:4 0, the same conclusion as before. But the conditions on the variances are now a lot weaker, and in effect we have converted the weak law of 19.1 into a strong law, at the small cost of substituting the m.d. assumption for orthogonality. As an example of the general formulation, suppose the sequence satisfies z

The Strong Law of Large Numbers 00

L E(X7)1i"

315

< 00 •

t= l

(20.37)

We cannot then rely upon Xn converging to zero, but (putting ar = P) we can show that n -2.L7= 1 Xr = Xnln will do so. The limitation of 20.10 is that it calls for square integrability. The next step is to use 20.4 to extend it to the class of cases that satisfy

_L E ! Xr!Pfa�

< oo

t=l

(20.38)

for 1 :::;; p :::;; 2, and some {ar } t oo It is important to appreciate that (20.3 8) for p < 2 is not a weaker condition than for p = 2, and the latter does not imply the former. For contrast, consider p = 1 . The Kronecker lemma applied to (20.38) implies that .

n

a� 1 :.'L E ! Xrl t=l

-7

0.

(20.39)

For an - n, such a sequence has got to be zero or very close to it most of the time. In fact, there is a trivially direct proof of convergence. Applying the monotone convergence theorem (4.9),

(

) (

E lim a� 1 1 Sn l :::;; E lim a� 1 n�oo

n�oo

n

± I Xrl ) t=l

1 = lim a� ,L E ! Xr l · n�oo

t=l

(20.40)

For any random variable X, EIXI = 0 if and only if X = 0 a.s .. Nothing more is needed to show that Snlan converges, regardless of other conditions. Thus, having latitude in the value of p for which the theorem may hold is really a matter of being able to trade off the existence of absolute moments against the rate ofdamping necessary to make them summable. We may meet interesting cases in which (20.38) holds for p < 2 only rarely, but since this extension is available at small extra cost in complexity, it makes sense to take advantage of it. 20.1 1 Theorem If Snlan � 0.

{ Xr,�r}i is a m.d. sequence satisfying (20.38) for 1

:::;;

p :::;; 2,

Proof Let Yr = 1 { IXti�1)Xr, and note that {Xr } and { Yr } are equivalent under (20.3 8), by 20.4. Yr is also �rmeasurable, and hence the centred sequence

{ Zr,� t }, where Zr = Yr - E(Yr l �r- l), is a m.d. Now,

E(Z7) = E(E(Z� ! � t- 1 )) = E(E(Y7 1 �t- l ) - E(Yr l �t- 1 )2)

316

The Law of Large Numbers (20.41)

According to 20.4 with r = 2 , (20.38) implies that 2.7=1E(Y7)1a7 < =, and so, since E(Z7) s; E(Y7) by (20.41 ), 00

(20.42) L E(Z7)1a7 < =. t= l By 20.10, this is sufficient for 2.7= 1 Z11a1 � S t , where S1 is some random variable. But

n n n L Zr lat = L Yrfat - L E(Yr l ?l'r-t )lar. t= J t= I t=I By 15.13(i), (20.38) is equivalent to

(20.43)

(20.44) L E( I Xr i P I ?l't-t )ldr < oo, a.s. t= l According to 20.5, (20.44) implies that 2.7=I I E(Y1 I ?1'1-J ) l ia1 < =, a.s. Absolute

convergence of a series implies convergence by 2.24, so we may say that 2.7=tE(Yt f ?i'r-t)lar � Sz. Hence, I.7= t Yt lar � St + Sz and so a�12.�= 1 Yr � 0 by the Kronecker lemma. It follows by 20.3 and the equivalence of X1 and Y1 implied by (20.38) that Snlan � 0. • Notice that in this proof there are no short cuts through the martingale conver gence theorem. While we know that 2.7= 1 X,Ia1 is a martingale, the problem is to establish that it is uniformly L1 -bounded, given only information about the joint distribution of {X1}, in the form of (20.38). We have to go by way of a result for p = 2 to exploit orthogonality, which is where the truncation arguments come in handy. 20.4 Conditional Variances and Random Weighting

A feature of martingale theory exploited in the last theorem is the possibility of relating convergence to the behaviour of the sequences of one-step-ahead condi tional moments; we now extend this principle to the conditional variances E(X71 ?ft- J). The elegant results of this section contain those such as 20.10 and 20.11. The conditional variance of a centred coordinate is the variance of the innova tion, that is, of X1 - E(X1 I ?f1_1), and in some circumstances it may be more nat ural to place restrictions on the behaviour of the innovations than on the orig inal sequence. In regression models, for example, the innovations may correspond to the regression disturbances. Moreover, the fact that the conditional moments are ?1'1_ 1-measurable random variables, so that any constraint upon them is probabilistic, permits a generalization of the concept of convergence, following the results of § 15.4; our confidence in the summability of the weighted condi tional variances translates into a probability that the sequence converges, in the

The Strong Law of Large Numbers

317

manner of the following theorem. A nice refinement is that the constant weight sequence {a,} can be replaced by a sequence of �1- 1-measurable random weights. 20.12 Theorem Let {X1,�1 } 0 be a m.d. sequence, { W,} a non-decreasing sequence of positive, ��- 1-measurable r.v.s, and Sn = I7= 1X1. Then

The last statement is perhaps a little opaque, but roughly translated it says that the probability of convergence, of the event {Sn!Wn --7 0}, is not less than that of the intersection of the two other events in (20.45). In particular, when one probability is 1 , so is the other. Proof If {Xr } is a m.d. sequence so is {X1/Wr } , since W, is ��- 1 -measurable, and

oo

Tn = I7= 1X11W, is a martingale. For W E Q, if Tn(W) --7 T(w) and Wn(W) t then Sn(w)!Wn(ro) --7 0 by Kronecker' s lemma. Applying 15.11 completes the proof. • See how this result contains 20.10, corresponding to the case of a fixed, diver gent weight sequence and a.s. summability. As before, we now weaken the summ ability conditions from conditional variances to pth absolute moments for 1 � p � 2. However, to exploit 20.5 outside the almost sure case requires a modification to the equivalent sequences argument (20.3), as follows. 20.13 Theorem If {Xr } and { Y1} are sequences of �,-measurable r.v.s, P

({�

P(X,

*

Y, I :l' H) <

+'< {�

(X, - Y,) converges

})

�

0.

(20.46)

Proof Let E1 = {X1 t: Yr } E �1, so that P(X1 t: Y1 1 �r-1) = E(lEt l �r- 1). According to

15.13(ii), P

({

ro:

� E(IE, I :J',_J)(ro) <

oo

}{ � < oo (20.47)

= !'<

ro:

l E,(ro) <

=

})

�

0.

(20.47)

But 2.7= 1 1 Et(w) < means that the number of coordinates for which Xr(ro) t: Yr(w) is finite, and hence I7= 1 (X1(ro) - Yr(w)) therefore implies (20.46). • .

Now we are able to prove the following extension of 20.11. 20.14 Theorem For 1 � p � 2, let £ 1 = { I7= 1 EC I Xr iP I �r- 1)/Wf: { W1 t } Under the conditions of 20.12,

oo

<

oo }

and £2

=

.

(20.48)

Proof The basic line of argument follows closely that of 20.1 1 . As before, let Y1

= l i i Xtl:!> atJ Xr, so that Zr

=

Yr - E(Yr l �r- 1 ) is a m.d. and

ECZ7 1 �r- 1 ) = E(Y7 1 �r- 1 ) - (E(Yr l �r- 1) )2

The Law of Large Numbers

318

(20.49) Applying 20.5 and the last inequality,

( g

P E1 -

E(Z; J 3',- I )IW;

It follows by 15.1 1 and the fact that

< =

})

=

0.

£1 - C � (£1 - D)

(20.50) u

(D - C) that (20.5 1 )

where S1 is some a.s. finite random variable. A second application of 20.5 gives

( {�

I E(Y, l !f,_I ) l ! W,

( {i

E(Yr l ?ir- J )IW,

P EI

-

which is equivalent (by 2.24) to P

£1 -

t:=l

})

=

--7 })

=

< =

S2

J

0,

(20.52)

0,

(20.53)

where S2 is another a.s. finite r.v. And a third application of 20.5 together with 20.13 gives (20.54) for some a.s. finite r.v. S3 . Now (20.5 1 ), (20.53), (20.54), the definition of Z1, the Kronecker lemma and some more set algebra yield, as required,

(20.55) 20.5 Two Strong Laws for Mixingales

The martingale difference assumption is specialized, and the last results are not sufficient to support a general treatment of dependent processes, although they are the central prop. The key to extending them, as in the weak law case, is the mixingale concept. In this section we contrast two approaches to proving mixing ale strong convergence. The first applies a straightforward generalization of the methods introduced by McLeish (1975a); see also Hansen (199 1 , 1992a) for related

319

The Strong Law of Large Numbers

results. We have two versions of the theorem to choose from, a milder constraint on the dependence being available in return for the existence of second moments.

{X1,?11}':: ""

20.15 Theorem Let the sequence be a Lp-mixingale with respect to constants { for either with mixingale size -�, or (i) p (ii) 1 < p < with rnixingale size - 1 . If 2.7= 1 d; < oo then Sn � S.

c1}, 2, =

2,

Proof We have the maximal inequality,

)

(

K±c�,

(20.56)

E max I Sj i P � 1 5)5, n t=l

K

where is a finite constant. This is by 16.10 in the case of (i) and 16.11 case (ii). By relabelling coordinates It can be expressed in the form

E

(�:

,

)

I Sr Sm i P

,;

for any choice of m and n. Moreover,

P

l�:

,

I Sr Sm l

> e)

=

K!1 cl

{':':

,

m

(20.57)

I Sr Sm i P

>

e"

)

m�x I Sj - Sm i P) �E(m<]'O, (20.58) n by the Markov inequality. Inequalities (20. 5 7) and (20. 5 8) combine to yield (20.2), and the convergence lemma 20.2 now yields the result. �

£

We can add the usual corollary from Kronecker' s lemma.

{ Xrfat>?F1}':""

•

20.16 Corollary Let satisfy either (i} or (ii) of 20.15 with respect to constants for a positive sequence with t oo. If

{ c1/ar},

L t=l

c�/a�

{ a1}

a1

(20.59)

< oo,

0. The second result exploits a novel and remarkably powerful argument due to R. M. de Jong (1992). 20.17 Theorem Let { X1, ?1 1} be an L,-bounded, L1-mixingale with respect to con stants { c1} for r � 1, and let { a1}, { B1} be positive constant sequences, and { M1} a positive integer sequence, with an t oo If �M,ex{e2a�/(32M�t.s;) } (20.60)

then Snlan �

D

.

<

�.

The Law of Large Numbers

320 00

_Lt=l B �-rEIXtl rlat Lt=l SM1Ctlat

(20.61 )

< 00,

(20.62)

< oo,

{ sm };=O are the mixingale coefficients, then Snlan 0. Here {B1} and {M1} are chosen freely to satisfy the conditions, given {a1} and { c1} , which suggests a considerable amount of flexibility in application. The sequence {B1} will be used to define a truncation of { Xc}, the role which was played by {a1} in 20.11. The most interesting of the conditions is (20.62), which explicitly trades off the rate of decrease of the mixingale numbers with that of the sequence {c1/a1}. This approach is in contrast with the McLeish method of where

�

o

defining separate summability conditions for the moments and rnixingale numbers, as detailed in § 16.3.

Proof Writing

1 � for 1 1 start by noting that Er+jXt Et+)�Xr + Er+/1 - l�)Xt. IXti :SBtl' =

Hence, we have the identity

Xt = (Et+M1- tl�Xt-Et-M1 l �Xt) + (Et+M1- t ( l - l �)Xr - Et-M/1 - l�)Xr) (20.63) + (Xt-Et+M1- tXt) + Er-MXr, and, by the usual 'telescoping sum' argument, M1- 1 , (20.64) Et+M1- t l�Xt - Et-M11 �Xt = j=Ll-M10t where Zjt Et+jl�X1- Et+j- d �X1, and { Zjt, ?Jt+j} is a m.d. sequence. Note that application of 10 13(ii) Summing yields 1 0tl 2B1 na.s., by a double Mr- 1 _Lxt t=l = Lt=l j=lL-M10t + _LEt t=l +M1- t (1 - 1 �)Xt n - Lt=l Et-MP - 1�)Xr '+ Lt=l (Xr-Er+M1- lXr) s

=

n

n

.

.

n

(20.65)

1,

The object will be to show that Sknlan � 0 for k = . . . ,5. Starting with S1 n , the main task is to reorganize the double sum. It can be verified by inspection that

The Strong Law of Large Numbers n

M1- l

Mn- 1

n

Mn- 1

n

32 1

L J=LI -M1ZJ1 = L L ZJt t= l

=

=.2::1 -Mn LZJt -

}

(20.66)

t=l

= -M1 x�,

<j s -M1- I and Mr-1 s j < M1, where q1 = 1 for -M1 < j < M1 , and q1 t for > £} � for t = 2, ... ,n. Note that, for arbitrary numbers ... ,xk, > flk } . Hence by subadditivity, and Azuma's inequality (15.20),

U�=d lxd

{ a;/ (32 ; � B?)} -i'

,; 4M"exp

{ I I�=1xd

(20.67)

M

Under (20.60), these probabilities are summable over n and so S1 n first Borel-Cantelli lemma. Now let { be any integrable sequence and define

Y1}

la � 0 by the n

(20.68) By the Markov inequality,

)

( I

P max Sj - S� I > £ s m <j 5, n

�E ( m�x I Sj - S� I m <J 5, n

)

(20.69)

If 00

L E I Yrlla1 < t=l

00,

(20.70)

The Law of Large Numbers by an application of 20.2, and hence S,:!an then 0 by Kronecker' s lemma. We apply this result to each of the remaining terms. For S2n, put Y1 Ez+M1- I Xz( l n- 1 �), and noten that n B - laz, a Y 1�) (20.7 1) 1az :::; z :::; _LE1(1 zlf _LEI Xrl _L t=l t=l t=I } 'EIXrl' using the fact that IXtC I - 1 �)/B11 :::; 1Xr( l - 1�)/B11'. S3n is dealt with i n ex actly the same way. For S4n and Ssn , put successively Yz Xr-Et+M1- I X1 and Y1 assumption, Er-M1X1, and by the mixingale n n (c a ) (20.72) ,LEI t=l Yzlfaz :::; _Lt=I zf z SMt" The proof is completed by noting that the majorant terms of (20. 7 1) and (20. 7 2) 322

�

s,: � sa

=

=

are bounded in the limit by assumption.

=

•

The conditions of 20.17 are rather difficult to apply and interpret. We will restrict them very slightly, to derive a simple summability condition which can be compared directly with 20. 15.

{ X1,�tl { cr}. { c1} { a1} I X1I I , c1 an

-
20.18 Corollary Let be an L,-bounded, L1-mixingale of size with respect to constants If and are positive, regularly varying sequences of constants with and t oo, and «

(20.73) where

0. Proof Define o1 then

Snfan

-E4

l) oS = (1 2r
(20.74)

0.

= (log for 8 > This is slowly varying at infinity by 2.28, and the sequence is summable by 2.31. Apply the conditions of 20. 17 with the added stipulation that and are regularly varying, increasing sequences, and so consider the conditions for summability of a series of the form converges, summability follows for 11 > Since from � Taking logarithms, this is equivalent to

{B1} {M1} (n)}, 0. InConln) -11 U LnUl (n)exp{ 2 (nlon)U1 (n)exp{ -11 U2(n)} 0.

(20.75)

U(n) nPL(n) where L(n) is slowly varying, this condition has the form (20.76) where P I and are non-negative constants and L1 (n) and L2(n) are slowly varying. The terms log (on) and log(L1(n)) can be neglected here. Put P 2 0 and L2(n) lion (log n) 1 +8, and the condition reduces to Since

=

pz

=

=

=

The Strong Law of Large Numbers p1 , B } r

0

which holds for all for any 11 > and o > satisfied (recalling that { is monotone) if

323 (20.77)

0. Condition (20.60) is therefore (20.78)

(2.61) and (2. 62) are satisfied if, respectively, B) -rEIXtlrlat B) -rc�lat tlt, (20.79) and (20.80) We can identify the bounding cases of B t and Mt by replacing the second order-of magnitude inequality sign in (20. 79), and that in (20. 80), by equalities, leaving the required scaling constants implicit. Solving for Mt and B t in this way, sub stituting into (20. 78), and simplifying yields the condition (20.81) where � [2r �o. so that (p3 -P4)� - 1, which in (p3 4) � P turn implies (20.81). This completes the proof. Noting that 1 � �0 � 2, the condition in (20. 7 3) may be compared with (20. 5 9). Put <po i and r = 2 and we obtain �0 = �. whereas with <po = 1, we get �o = 2(2r - 1 )/(3 r - 1 ) which does not exceed r in the relevant range, taking values between 1 when r = 1 and � when r = 2. Square-summability of ct lat is sufficient only in the limit as both <po and r oo Thus, this theorem does not contain 20.16. On the other hand, in the cases where { Ct } is uniformly bounded and at = t, we need only �0 > 1, so that any r > 1 and 0 will serve. These dependence restrictions are on a par with those of the L1 convergence law of 19.10, and a striking improvement on 20.16. The case r = 1 is not permitted for sample aver ages, but is compatible with a1 = t(log t) 1 +3 for o > 0. In other words, the Similarly, conditions

«

« O

=

>

•

<

=

---7 oo

---7

.

theorem shows that

(n(log n) 1 +3t 1 _L Xt 0 a.s. t=l n

---7

(20.82)

This amounts to saying that the sequence of sample means is almost surely slowly varying as ---7 =; it could diverge, but no faster than a power of log

n

n.

20.6 Near-Epoch Dependent and Mixing Processes Tn viP.w nf

thP h l�t re�ul ts_ there are two nossihle

annroaches to the NED case. It

The Law of Lar e Numbers

324

g

turns out that neither approach dominates the other in terms of permissable condi tions. We begin with the simplest of the arguments, the straightforward extension of 20.18.

{ Xr } �00 -b, { J..lr } �00 , dr I Xr-J..Lr l p { Vr} �oo 2, -a. (20.83) L I CXr -J.l. r)larl l � < 00 t=l for q > p in the a-mixing case (q � p in the -mixing case) where . {(1 2qb+2(q1) ' 2q(a+ 1) } - mm �(20.84) +q)b + 2(q - 1) (1 +q)a +2q ' then a�1L�= I(Xr-J..lr) -.!!:4 0. Proof By 17.5, {Xr-J..Lr } is a L 1 -mixingale of size -min {b, a(l - 1/q)} with respect to constants fer }, with Cr I Xr -J..Lr l q· This is by 17.5(i) in the a-mixing case and by 17.5(ii) in the -mixing case. The theorem follows by 20.18, after substituting for <po in (20.74) and simplifying. This permits arbitrary mixing and NED sizes and arbitrary moment restrictions, so long as (20. 8 3) holds with � arbitrarily close to 1. By letting b one obtains a result for mixing sequences, and by letting a a result for sequences that are Lp-NED on an independent underlying process. Interestingly, in each of these special cases � ranges over the interval (1, 2ql(q + 1)) as the mixing/Lp-NED size

20.19 Theorem Let a sequence be Lp-NED of size for with means on a (possibly vector-valued) sequence 1 ::;; p ::;; with constants « which is a-mixing (<j>-mixing) of size If , 00

«

•

---7

oo

---7

oo

is allowed to range from zero to oo By contrast, a result based on 20.15 would be needed if we could claim only square-summability of the sequence for finite p; this rules out for any choices of a and The first of these results comes directly by applying 17.5. -

.

{ b. I (Xr-J..lr)larllp } 20.20 Theorem For real numbers b, p and r, let a sequence {Xr}�oo with means {J..lr } �oo be Lp-NED of size -b on a sequence {Vr}�=• with constants dr I Xr- J..Lr l p· For a positive constant sequence {ar } let {(Xr -J..Lr)lar}i be uniformly Lr-bounded, and let (20.83)

«

t oo,

(20.85) I CXr -J..lr)larll� < Then a� 1 .L�=I(Xr- J..L1) 0 in each of the following cases: (i) b = -1, p = 2, r > 2, {Vr } is a-mixing of size -r/(r - 2); (ii) b = 1 , 1 p, {Vr} is a-mixing of size -prl(r -p); (iii) b -1, p = 2, r � 2, {Vr } is -mixing of size -r/2(r - 1); (iv) b = 1 , 1 < p < 2, � p, {Vr} is -mixing of size -r/(r - 1). Proof By 17.5, conditions (i)-(iv) are all sufficient for { (Xr-J..lt)/at, ?fr} to be L t=I

-.!!:4

=

r

oo

.

The Strong Law of Large Numbers 325 an Lp-mixingale of size -b, where '!11 cr(V5, s :::; t). The mixingale constants are crlar max{dr.IIXr-�tl l r }lat I Xt- �tl l ,!ar. The theorem follows by 20.16. As an example, let X1 possess moments of all orders and be L2-NED of size -�. on an a-mixing process of size close to -1 (letting r --7 oo). Summability of the terms Var(X/a1) is sufficient by (20. 8 5). The same numbers yield s � on putting q 2 and a 1 in (20.84 ), which is not far from requiring summability of the �-norms. However, this theorem requires L,-boundedness, which if r is small constrains the permitted mixing size, as well as offering poor NED size characteristics for cases with p < 2. It can be improved upon in these situations by introducing a =

«

•

=

=

=

=

truncation argument. The third of our strong laws is the following.

1} �} Lp-NED of size -lip X � V1} which is d1 I X1-�� liP' :::; 2, r 2 -r/(r-2) -r/2(r- 1) r 1, r :::; r a1} oo, let 00 (20.86) L I (Xr-�t)larl l 1n{ p,2qlr } < oo ; t=l then a� 1 :L�=l (Xt-�1) 0. Note the different roles of the three constants specified in the conditions. p controls the size of the NED numbers, q is the minimum order of moment required to exist, and r controls the mixing of { V1 } . The distribution of X1 does not otherwise depend on r. Proof The strategy is to show that there is a sequence equivalent to { (X1-�1)/at }, and satisfying the conditions of 20.15(i). As in 20.6, let (20. 87) where 1 � 1 1 and '±' denotes if X1 > M-1, ' otherwise. Note that { (X1-�1)/a1} is Lp-NED with constants d1/a1, and Y1 is a continuous function of (size X1-�-�1)/a1withwithconstants I Yr l :::; 1 a.s. Applying 17.13 shows that Y1 is L2-NED on { V1} of 21 -p12(d11a1)P12. Since I Yrl l , < oo for- every finite r, it further follows by 17.5 that if '!11 cr(V1-5, s � 0), { Y1 E(Y1), '!f1}o is an L2-mixingale of size -� with constants (20. 88) Cr max { (drfat)P12, I Ytll r } . Here, dr :::; 2 I Xr-�r l q for any � p, and I Ytl l r :::; 2( 1 Xt -�t l q la1) qlr for any q :::; r by the second inequality of (20.24). Condition- (20.86) is therefore sufficient for the sequence { d} to be summable, and { Y1 E(Y1) } satisfies the conditions of 20.15(i). We can conclude that L�=l (Y1 - E(Y1)) S1 , where S1 is some random variable. According to 20.6, condition (20.86) is sufficient for :L7= d E(Y1) I < oo. The '\'00 20.21 Theorem Let a sequence { :'oo with means { :oo be on a sequence { for 1 p :::; with constants « either for > or (i) a-mixing of size for > and � p; (ii) $-mixing of size and for q with p q :::; and a constant positive sequence {

�

=

:'""

t

o

' +'

I Xr l :::: a tl

'

=

«

q

�

cPrtPC

)n

•

J<'{V.\ thPrpfnrP t'nn"PrctPc tn

<=�

fi n i tP l i m i t hu '? ?Ll

"""

•

PfV \ -

The Law of Large Numbers

326

n

Lt=l Yr

� St + Ct .

(20.89)

Inequalities (20.22) and (20.6) further imply that lent sequences, and hence

Y1 and (X1- �1)fa1 are equiva

n

Lt=l ((Xr -!-lr)lar- Yr)

� Sz,

(20.90)

where S2 is another random variable, by 20.3. We conclude that n

.�)Xr-1-l r)lar t=l

� S 1 + Sz + Ct

=

say. It follows by Kronecker's lemma that a� 1 .L�=I conclusion. •

S3 ,

(X1 -j.l1)

(20.9 1 ) � 0,

the required

Here is a final, more specialized, result. The linear function of martingale differences with summable coefficients is a case of particular interest since it unifies our two approaches to the strong law.

X1

20.22 Theorem Let = .Lj=- aa81Ur-j where { U1} is a uniformly Lp-bounded m.d. sequence with p > 1, and L}:-ool e) I < Then

oo

.

(20.92) xr ..!!:4 0. Proof Y1 X1/t is a Lp-mixingale with c1 lit and arbitrary size. It was shown in 16.12 that the maximal inequality (20.56) holds for this case. Application of the convergence lemma and Kronecker's lemma lead directly to the result. Alterna tively, apply 20.18 to X1 with a1 t.

n± t= I l.

=

«

=

•

In these results, four features summarize the relevant characteristics of the stochastic process: the order of existing moments, the summability characteristics of the moments, and the sizes ofthe mixing and near-epoch dependence numbers. The way in which the currently available theorems trade off these features suggests that some unification should be possible. The McLeish-style argument is revealed by de Jong' s approach to be excessively restrictive with respect to the dependence conditions it imposes, whereas the tough summability conditions the latter's theorem requires may also be an artefact of the method adopted. The repertoire of dependent strong laws is currently being extended (de Jong, 1994) in work as yet too recent for incorporation in this book.

21 Uniform Stochastic Convergence

2 1 . 1 Stochastic Functions on a Parameter Space

The setting for this chapter is the class of functions f: Q x 8 � iR, where (Q,� ,j.t) is a measure space, and (8,p) is a metric space. We write f( ro,S) to denote the real value assumed by f at the point (ro,S), which is a random variable for fixed 8. But f(ro,.) , alternatively written just f(ro), is not a random vari able, but a random element of a space of functions. Econometric analysis is very frequently concerned with this type of object. Log Ukelihoods, sums of squares, and other criterion functions for the estimation of econometric models, and also the first and second derivatives of these criterion functions, are all the subject of important convergence theorems on which proofs of consistency and the derivation of limiting distributions are based. Except in a restricted class of linear models, all of these are typically functions both of the model parameters and of random data. To deal with convergence on a function space, it is necessary to have a crit erion by which to judge when two functions are close to one another. In this chapter we examine the questions posed by stochastic convergence (almost sure or in probability) when the relevant space of functions is endowed with the uniform metric. A class of set functions that are therefore going to be central to our discussion have the form f*: Q � iR, where f*(ro)

=

sup f(ro,S).

(2 1 . 1)

6E0

For example, if g and h are two stochastic functions whose uniform proximity is at issue, we would be interested in the supremum of f(ro,S) = l g(ro,S) - h(ro,S) I . An important technical problem arises here which ought to be confronted at the outset. We have not so far given any results that would justify treating f* as a random variable, when (8,p) may be an arbitrary metric space. We can write { ro: f * (ro) > x}

=

U { ro : f(S,ro) > x},

(2 1 .2)

and the results of 3.26 show that { ro: f*( ro) > x} E � when { ro: f(S,ro) > x} E � for each 8, when 8 is a countable set. But typically 8 is a subset of (R k,de) or something of the kind, and is uncountable.

The Law of Large Numbers

328

This is one of a class of measurability problems having ramifications far beyond the uniform convergence issue, and to handle it properly requires a mathematical apparatus going beyond what is covered in Chapter We shall not attempt to deal with this question in depth, and will offer no proofs in this instance. We will merely outline the main features of the theory required for its solution. The essential step is to recognize that the set on the left-hand side of can be expressed as a projection. Let denote the Borel field of subsets of that is, the smallest a-field containing the sets of that are open with respect to p. Then let (.Q X lJi denote the product space endowed with the product a-field (the a-field generated from the measurable rectangles of lJi and and suppose that is cy; /:B-measurable. Observe that, if

3.

(21.2)

:Be

8,

8

8, ®:Be) f(.,. ) (21.3 )

:Be), Ax = { (co,S): f(ro,S) > x} ®:Be, the projection of Ax into .Q is Ex = {co: f(co,S) > X, 8} = {co: f *(ro) > x}. (21.4) In view of 3.24, measurability of r is equivalent to the condition that Ex for rational x. Projections are not as a rule measurable transformations,21 but under certain conditions it can be shown that Ex cy;P , where (.Q,lJP,P) is the completion of the probability space. The key notion is that of an analytic set. A standard reference on this topic is Dellacherie and Meyer (1978); see also Dudley (1989: ch. 13 ) , and Stinchcombe and White (1992). The latter authors provide the following definition. Letting (.Q,cyl) be a measurable space, a set E .Q is called lJ-analytic if there exists a compact metric space (8,p) such that E is the projection onto .Q of a set A ® :Be. The collection of cy;-analytic sets is written dJ(cyl). Also, a function f: .Q is called cyl-analytic if {co: f( co) :::; x} dJ(cyl) for each x Since every E cy; is the projection of E x 8 ®:Be, cy; dl(lJ). A measurable ®:Be

E lJi

e E

E tj

E

c

E lJi

E iR. �

E

E cy;

E

1-7

iR

set (or function) is therefore also analytic. dJ(cyl) is not in general a a-field, although it can. be shown to be closed under countable unions and countable intersections. The conditions under which an image under projection is known to be analytic are somewhat weaker than the definition might suggest, and it will actually suffice to let be a that is, a space that is measurably isomorphic to an analytic subset of a compact metric space. A suffi cient condition, whose proof can be extracted from the results in Stinchcombe and White is the following:

(8,:Be)

Souslin space,

(1992), 21.1 Theorem Let (.Q,cyl) be a measurable space and (8, :B e) a Souslin space. If dl(lJ ®:Be), the projection of onto .Q is in dl(lJ). Now, given the measurable space (.Q,cyl), define = n)lcy;ll, where (.Q,cyl11,ji) is the completion of the probability space (.Q,cyl,f...l) (see 3.7) and the intersection is taken over all p.m.s defined on the space. The elements of are called univerBE

B

o

rg;U

f...l

lJi u

Uniform Stochastic Convergence

329

sally measurable sets. The key conclusion, from Dellacherie and Mayer (1978: III.33(a)), is the following.

21.2 Theorem For a measurable space (05), :A('!F) � rg; u_ o

(21 .5)

Since by definition rg; u c '!F� for any choice of J.L, it follows that the analytic sets of '!F are measurable under the completion of (Q,'!f ,Jl) for any choice of Jl. In other words, if E is analytic there exist A, B E '!F such that A � E c B and J.L(A) = J.L(B). In this sense we say that analytic sets are 'nearly ' measurable. All the standard probabilistic arguments, and in particular the values of integrals, will be unaffected by this technical non-measurability, and we can ignore it. We can legitimately treat f*(ro) as a random variable, the conditions on 8 are observed and we can assume f(. , .) to be (near-) '!F ® :130 /:B-measurable. An analytic subset of a compact space need not be compact but must be totally bounded. It is convenient that we do not have to insist on compactness of the parameter space, since the latter is often required to be open, thanks to strict inequality constraints (think of variances, stable roots of polynomials and the like). In the convergence results below, we find that 8 will in any case have to be totally bounded for completely different reasons: to ensure equicontinuity; to ensure that the stochastic functions have bounded moments; and that when a stoch astic criterion function is being optimized with respect to 8, the optimum is usually required to lie almost surely in the interior of a compact set. Hence, total boundedness is not an extra restriction in practice. The measurability condition on f(ro,S) might be verifiable using an argument from simple functions. It is certainly necessary by 4.19 that the cross-section functions f(.,S) : .Q f---7 iR and f(ro, .): f---7 iR be, respectively, '!f/:B-measurable for each e E and :Be /:B-measurable for each 0) E .Q. For a finite partition { , . . . ,em } of by :Be-sets, consider the functions

provided

e e

where

ej

e

fem) (ro, S)

is a point of

=

ej.

f(ro, �), e E

If Ej

=

el

ej,

j

=

I, . . .

,m,

(2 1 .6)

{ ro: f(0), �) :::; X} E '!F for each j, then (21 .7)

j

being a finite union of measurable rectangles. Since this is true for any x, fern) is '!F ® :80 /:B-measurable. The question to be addressed in any particular case is whether a sequence of such partitions can be constructed such that fern) --7 f as

m

--7 oo .

Henceforth we shall assume without further comment that suprema of stochastic functions are random variables. The following result should be carefully noted, not least because of its deceptive similarity to the monotone convergence theorem, although this inequality goes the opposite way. The monotone convergence theorem concerns the expectation of the supremum of a class of functions { fnCro) } , whereas the present one is more precisely concerned with the of a class of

envelope

The Law of Large Numbers

330

functions, the function j*(w) which assumes the value supB e ef(ro, O) at each point of n.

(

21.3 Theorem sup E(f(O)) � E sup f(O)) . J Be e Be e

Proof Appealing to 3.28, it will suffice to prove this inequality for simple

functions. A simple function depending on e has the form
=

=

m

_L a/9) 1 E; (W) i=l

ai(9), W E Ei.

(

)

=

a;, w E Ei.

i

sup E(cp(O)) - E sup cp(e) = sup ( ai(O) - ai)P(Ei) � 0, B e e i=l Be e where the final inequality is by definition of a;. • Bee

(2 1 .8)

SUPB e eai(O), sup cp(ro,e) Be e

Hence

=

(2 1 .9)

(2 1 . 10)

2 1 . 2 Pointwise and Uniform Stochastic Convergence

Consider the convergence (a.s., in pr., in Lp, etc.) of the sequence { Qn(O) } to a limit function Q(8), Typically this is a law-of-large-numbers-type problem, with n (2 1 . 1 1) Qn(O) = L qnr(O) t=l (we use array notation for generality, but the case qnt = q1/n may usually be assumed), and Q(O) = limn�coE'(Qn(O)). Alternatively, we may want to consider the case Gn(8) --7 0 where n (2 1 . 12) Gn(O) = L (qnr(O) - E(qnr(O ))) . t= l By considering (2 1 . 1 2) we divide the problem into two parts, the stochastic convergence of the sum of the mean deviations to zero, and the nonstochastic convergence assumed in the definition of Q( O). This raises the separate question of whether the latter convergence is uniform, which is a matter for the problem at hand and will not concern us here. As we have seen in previous chapters, obedience to a law of large numbers calls for both the boundedness and the dependence of the sequence to be controlled. In the case of a function on 0, the dependence question presents no extra difficulty; for example, if qntCOt) is a mixing or near-epoch dependent array of a given class, the property will generally be shared by qntC82), for any e l ' 92 E But the existence of particular moments is clearly not independent of e . If there

e.

Uniform Stochastic Convergence 331 exists a positive array {Dnr} such that I qnr(8) I Dnt for all 8 e, and I Dnrl l r < oo, uniformly in t and n, qntC8) is said to be Lr-dominated. To ensure pointwise convergence on e, we need to postulate the existence of a dominating array. There is no problem if the qnr( 8) are bounded functions of 8. More generally it is E

::;

necessary to bound e, but since e will often have to be bounded for a different set of reasons, this does not necessarily present an additional restriction. Given restrictions on the dependence plus suitable domination conditions, as an ordinary pointwise stochastic convergence follows by considering stochastic sequence, for each E e. However, this line of argument does not guarantee that there is a minimum rate of convergence which applies for all the condition of uniform convergence. If pointwise convergence of to the limit is defined by

{ Gn(8)}

8

{ Gn(8)} 8, G(8) Gn(8) 0 (a.s., in or in pr.), each 8 e, (21.13) a sequence of stochastic functions { Gn(8)} is said to converge unifo rmly (a.s., in or in pr.) on e if sup I Gn(8) I or in pr.). 0 (a.s., in (21.14) �

E

Lp,

Lp ,

�

Lp ,

9 E8

To appreciate the difference, consider the following example.

[O,oo), and define a zero-mean array {gnr(8)} where o 8 ll2n Z8, h t (21.15) gntC8) + Z(lln-8), ll2n < 8 1/n 0, 1/n < 8 < oo where {hr} is a zero-mean stochastic sequence, and Z is a binary r.v. with P(Z 1)and P(Z -1) �· Then Gn(8) L�=lgnr(8) Hn + Kn(8), where Hn n L7= ht, Zn8 , 0 8 1/2n Kn(8) Z(l - n8), ll2n < 8 1/n. (21.16) 0, lin < 8 < oo We assume Hn ...E4 0. Since Gn(8) Hn for 8 > lin as well as for 8 0, Gn(8) ...E4 0 for each fixed 8 e. In other words, Gn(8) converges pointwise to zero, a.s. However, supe I Kn(8) I I �Z I = � for every n ;:::: 1. Because Hn converges a.s. there will exist N such that I Hnl < ! for all n ;:::: N, with probability 1. You can verify that when I Hn I < ! the supremum on e of I Hn + Kn(8) I is always attained at the point 8 1/2n. Hence, \Yith probability 1, sup I GnC8) 1 I Hn+�ZI for n ;:::: N, 21.4 Example Let e

=

=

=

!

n

=

E

E8

=

!

::;

=

::;

::;

=

=

=

=

9E 8

::;

::;

=

=

=

::;

=

-

I

t

=

The Law of Large Numbers � as n oo It follows that the uniform a.s. limit of Gn(S) is not zero. Similarly, for n � N, P (6supE 8 I Gn(S) I � E) P( IHn +�ZI � E) P( I�ZI � E) 1, 332

----7

----7

.

(21 . 1 7)

=

=

----7

(21 . 1 8)

so that the uniform probability limit is not zero either, although the pointwise probability limit must equal the pointwise a.s. limit. o Our first result on uniform a.s. convergence is a classic of the probability literature, the This is also of interest as being a case outside the class of functions we shall subsequently consider. For a collec tion of identically distributed r.v.s on the probability space (Q.,'!f,P), the is defined as

Glivenko-Cantelli theorem. { X1(co), ... ,Xn(CO)} empirical distribution function n 1 (21 . 19) Fn(x,co) -n L 1 (-co,rJ(Xr(co)). In other words, the random variable Fn(x,co) is the relative frequency of the variables in the set not exceeding x. A natural question to pose is whether (and in what sense) Fn converges to F, the true marginal c.d.f. for the distribution. For fixed x, {Fn(x,co)} i is a stochastic sequence, the sample mean of n Bern oulli-distributed random variables which take the value 1 with probability F(x) and 0 otherwise. If these form a stationary ergodic sequence, for example, we know that Fn (x,co) F(x) a.s. for each x IR . We may say that the strong law of large numbers holds pointwise on IR in such a case. Convergence is achieved at x for all co Cx, where P(Cx) 1. The problem is that to say that thefunctions Fn converge a.s. requires that a.s. convergence is achieved at each of uncountable set of points. We cannot appeal to 3.6(iii) to claim that P(nx E IRCx) 1 , and hence the assertion that Fn (x,co) F(x) with probability 1 at a point x not specified beforehand cannot be proved in this manner. This is a problem for a.s. convergence additional to the possibility of convergence breaking down at certain points of =

t= l

----7

E

E

=

an

----7

=

the parameter space, illustrated by 21.4. However, uniform convergence is the condition that suffices to rule out either difficulty. In this case, thanks to the special form of the c.d.f. which as we know is bounded, monotone, and right-continuous, uniform continuity can be proved by establishing a.s. convergence just at a countable collection of points of IR .

Fn(x,co) F(x) a.s. pointwise, for x sup I Fn(x,co)-F(x) I 0 a.s. Proof First define, in parallel with Fn, 21.5 Glivenko-Cantelli theorem If X

----7

----7

o

E

IR , then (21 .20)

Uniform Stochastic Convergence

333

(2 1 . 2 1 ) F�(x,w) F(x-) for all w in a set C�, where P(C�) __,

and note that integer > 1 let

m

=

Xjm inf{x E lR : F(x) � jim} , j 1 , .. . ,m - 1 , and also let Xom = and Xmm so that, by construction, F(Xjm-) - F(Xj-I ,m) � l im, j 1, ... ,m. =

-=

Lastly let

= +=,

=

1 . For an (21 .22)

=

(2 1 2 3 ) .

=

iFn(Xjm,W)-F(Xjm)l , IF�(Xjm ,W) -F(Xjm-) 1 l} (21 .24) Mmn(W) 1max{maxf 5j 5 m Then, for j 1 , . .. ,m and x E [Xj- l,m ,Xjm), 1 F(x) --m Mmn(W) � F(x1·- 1 ,m) -Mmn(W) � Fn(Xj- l ,m,(J)) � Fn(X,ffi) � F�(Xjm,W) 1 (21 .25) � F(Xjm-)+Mmn(W) � F(x) +-+Mmn(W). m That is to say, IFn(x,w) -F(x)l � llm +Mmn(W) for every x E IR . By pointwise strong con vergence we may say that limn�ooMmnCro) 0 for finite m, and hence that limn�ooSUPx I Fn(x,ro) -F(x) I � 1 /m , for all c;, where m = (2 1 . 2 6 ) c; n ccxmjn c�m). =j l But P(limm oo c;) 1 by 3.6(iii), and this completes the proof. =

(J) (It

�

=

•

=

Another, quite separate problem calling for uniform convergence is when a sample statistic is not merely a stochastic function of parameters, but is to be eval uated at a random point in the parameter space. Estimates of covariance matrices of estimators generally have this character, for example. One way such estimates are obtained is as the inverted negative Hessian matrix of the associated sample log-likelihood function, evaluated at estimated parameter values. The problem of proving consistency involves two distinct stochastic convergence phenomena, and it does not suffice to appeal to an ordinary law of large numbers to establish convergence to the true function evaluated at the true point. The following theorem gives sufficient conditions for the double convergence to hold. 21.6 Theorem Let (Q,c:J,P) be a probability space and (8,p) a metric space, and let H IR be c:J/:8-measurable for each 8 E If (a) e� � eo , and (b) � Q( e) uniformly on an open set Bo containing eo, where Q (e) is a nonstochastic function continuous at eo,

Qn: exn Qn(e)

e.

The Law of Large Numbers

334

then Qn(S� ) � Q(So) . Proof Uniform convergence in probability of Qn on Bo implies that, for any £ > 0 � 1 large enough that, for � and 8 > 0, there exists

(

N1

n N1,

)

(2 1 .27)

P sup I Qn(S) - Q (S) I < �£ � 1 - !8. B E Bo

Also, since S� � S0, there exists

P(s�

e

Nz such that, for � Nz, n

(21 .28)

Bo) � 1 - a8.

To consider the joint occurrence of these two events, use the elementary relation

P(A n B) � P(A) + P(B) - 1 . Since

{ 8� for

E

}

{

Bo } n sup I Qn(S ) - Q (S) I < 1£ B E Bo

n � max(NI ,Nz),

c

(21 .29)

{ I Qn(S�) - Q( S�) I < �£} , (21 .30)

P { I Qn(S�) - Q(S� ) I < �£} � 2( 1 - !P) - I = 1 - 18. (21 .3 1 ) Using continuity at S0 and 18.10(ii), there exists large enough that, for �

N3

N3,

n

(21 .32)

P( I Q(S�) - Q (So) I < �£) � 1 - �8. By the triangle inequality,

(21 .33) and hence

{ I QnCS�) - Q (S�) I < 1£} n { I Q(S� ) - Q(So) I < �£} � {

Applying (21 .29) again gives, for

I Qn(S�) - Q (So) I < £ } .

(2 1 .34)

n � max(N1 ,N2,N3),

P( I Qn(S�) - Q(S o) I < £) � 1 - 8. The theorem follows since 8 and £ are arbitrary. •

(21 .35)

Notice why we need uniform convergence here. Pointwise convergence would not allow us to assert (21 .27) for a which works for all S e B0• There would be the risk of a sequence of points existing in B0 on which is diverging. Suppose So = 0 and Gn(S) = Qn(S) - Q(S) in 21.4. A sequence approaching So, say { 1 /m, m e IN } , has this property; we should have

single N1

N1

(21 .36) P( I Qn(1/m) - Q(llm) I < �£) � 1 - !P for arbitrary £ > 0 and 8 > 0, for > m. Therefore we would not be able to

only n

Uniform Stochastic Convergence 335 claim the existence of a finite n for which (2 1 . 3 1 ) holds, and the proof collapses. In this example, the sequence of functions { Gn(9 ) } is continuous for each n, but

the continuity breaks down in the limit. This points to a link between uniform convergence and continuity. We had no need of continuity to prove the Glivenko Cantelli theorem, but the c.d.f. is rather a special type of function, with its behaviour at discontinuities (and elsewhere) subject to tight limitations. In the wider class of functions, not necessarily bounded and monotone, continuity is the condition that has generally been exploited to get uniform convergence results. 2 1 . 3 S tochastic Equicontinuity

Example 21.4 is characterized by the breakdown of continuity in the limit of the sequence of continuous functions. We may conjecture that to impose continuity uniformly over the sequence would suffice to eliminate failures of uniform conver gence. A natural comparison to draw is with the uniform integrability property of sequences, but we have to be careful with our terminology because, of course, uniform continuity is a well-established term for something completely different. The concept we require is or, to be more precise, see (5.47). Our results will be based on the following version of the ArzeUt-Ascoli theorem (5.28).

equicontinuity,

uniform equicontinuity;

n

asymptotic

21.7 Theorem Let Un(9 ), E [N } be sequence of (nonstochastic) functions on a totally bounded parameter space (8,p ) . Then, supe E e I fn( 9) I 0 if and only if and Un } is fn(e) --7 0 for all e E 8o, where 8o is a dense subset of asymptotically uniformly equicontinuous. o

--7 e,

n

The set IF = Un, E [N } u { 0 } , endowed with the uniform metric, is a subspace of ( Ce ,du), and by definition, convergence of fn to 0 in the uniform metric is the same thing as uniform convergence on 8. According to 5.12, compactness of IF is equivalent to the property that every sequence in IF has a cluster point. In view of the pointwise convergence, the cluster point must be unique and equal to 0, so that the conclusion of this theorem is really identical with the ArzeUt-Ascoli theorem, although the method of proof will be adapted to the present case. Where convenient, we shall use the notation wUn l>)

=

sup

sup I fn( 9') - fn(8) 1 .

(2 1 .37)

9 E 8 9'E S(8,0)

The function w(fn,.): IR + 1--7 IR + is called the of fn· Asymp totic uniform equicontinuity of the sequence Un } is the property that limsupnw(fn,8) -1- 0 as 8 -1- 0.

modulus of continuity

Proof of 21.7 To prove 'if : given £ > 0, there exists by assumption 8 > 0 to

satisfy

limsup w(fn,8) < £.

(2 1 .38)

The Law of Large Numbers

336

Since e is totally bounded, it has a cover {S(S))/2), i = 1 , ... ,m}. For each i, choose ei E eo such that p( Sj,Sj) < o/2 (possible because eo is dense in 8) and note that { S(Sj,O), i = 1 , ... ,m} is also a cover for e. Every e E e is contained in S(S i,o) for some i, and for this i,

I fn(S ) I

:::;

:::;

sup I fn(S') I

e' e s(6;,5)

sup I fnCS') - fCSD I + I f(8i) 1 .

(2 1 .39)

e' e S(S;,5)

We can therefore write sup I fn(S ) I

eee

:::;

max

sup I fn( S') - f(S i) I + max I f(Si) I J=:; i =:;m

J =:; i =:; m e' e S(e;,o)

(21 .40) I =:;; i =:;; m

Sufficiency follows on taking the limsup of both sides of this inequality. 'Only if follows simply from the facts that uniform convergence entails point wise convergence, and that w(fmO) :::; 2 sup I fn (S ) 1 . eee

•

(21 .41)

To apply this result to the stochastic convergence problem, we must define concepts of stochastic equicontinuity. Several such definitions can be devised, of which we shall give only two: respectively, a weak convergence (in pr.) and a strong convergence (a.s.) variant. Let (8,p) be a metric space and (0./!F,P) a probability space, and let { Gn(S ,co)), n E IN } be a sequence of stochastic functions Gn : e X Q. i---7 IR , ?1'/:B-measurable for each 8 E e. The sequence is said to be asymp totically uniformly stochastically equicontinuous (in pr.) if for all E > 0 3 o > 0 such that Iimsup P(w(GmO)

::?:

E) < E.

(21 .42)

And it is said to be strongly asymptotically uniformly stochastically equicontin uous if for all E > 0 3 o > 0 such that

( n-too

)

P umsup w(Gn.o) :::: E

=

0.

(2 1 .43)

Clearly, there is a bit of a terminology problem here ! The qualifiers 'asymptotic' and 'uniform' will be adopted in all the applications in this chapter, so let these be understood, and let us speak simply of stochastic equicontinuity and strong stochastic equicontinuity. The abbreviations s.e. and s.s.e. will sometimes be used.

Uniform Stochastic Convergence

337

2 1 .4 Generic Uniform Convergence

Uniform convergence results and their application in econometrics have been researched by several authors including Hoadley (197 1 ), Bierens ( 1989), Andrews ( 1987a, 1992), Newey ( 1 99 1 ), and Potscher and Prucha ( 1989, 1994). The material in the remainder of this chapter is drawn mainly from the work of Andrews and Potscher and Prucha, who have pioneered alternative approaches to deriving 'generic' uniform convergence theorems, applicable in a variety of modelling situations. These methods rely on establishing a stochastic equicontinuity condition. Thus, once we have 21.7, the proof of uniform almost sure convergence is direct.

n

21.8 Theorem Let { Gn(S), E IN } be a sequence of stochastic real-valued functions on a totally bounded metric space (8,p). Then sup I Gn(S) I ...!!4 0 8E 0

(2 1 .44)

if and only if (a) Gn(S) ...!!4 0 for each 8 E 8o, where 8o is a dense subset of (b) { Gn } is strongly stochastically equicontinuous.

e,

Proof Because (8,p) is totally bounded it is separable (5.7) and 80 can be chosen

to be a countable set, say 80 = { Sh k E IN } . Condition (a) means that for k = 1,2, ... there is a set with = 1 such that Gn(Sbro) 0 for ro E Condition (b) means that the sequences { Gn( ro) } are asymptotically equicontinuous for all ro E with = 1 . By the sufficiency part of 21.7, sup8E e i Gn(S,ro) l --7 0 for ro E c* = nk=l c = 1 by 3.6(iii), proving 'if . 'Only if' follows from the necessity part of 21.7 applied to { Gn(ro) } for each ro

Ck P(Ck) C, P(C) k (') C. P(C*)

e

c*.

--7

Ck.

•

The corresponding 'in probability' result follows very similar lines. The proof cannot exploit 21.7 quite so directly, but the family resemblance in the arguments will be noted. 21.9 Theorem Let { Gn(S), n E IN } be a sequence of stochastic real-valued functions on a totally bounded metric space (8,p ). Then sup I Gn(S) I � 0 8E 0 if and only if (a) Gn(S) � 0 for each 8 E 8o, where 8o is a dense subset of (b) { Gn } is stochastically equicontinuous.

S

Proof To show 'if ' let { ( S i, 8)

,

(2 1 .45)

e,

i = 1 ,. . . ,m} with ei E 8o be a finite cover for

8. This exists by the assumption of total boundedness and the argument used in the proof of 21.7. Then,

The Law of Large Numbers

338

(

2£) ::; P (�·�X

P :�� 1 Gn(8) 1

�

'

Lt_m e

sup

( I Gn (9') - Gn(9i) I + I Gn(Si) I )

) t c:) + P (Q { I Gn(SD I c:} )

eS(S;,o)

::; P(w(Gmo) � c:) + P ��: I Gn(SD I � c: � ::; P(w(Gn,o) �

2£)

�

m

::; P(w(Gmo) � £) + _LP( I Gn(Si) l i=l

where we used the fact that

{x+y � 2£ }

�

{x � c: }

c

u

�

(21.46)

£),

{y � £ }

(21.47)

for real numbers x and y, to get the third inequality. Taking the limsup of both sides of (a) and (b) impiy that

(21.46),

n--7oo P

limsup

(

sup

eee

I Gn(8) I

�

2c:) < £.

(21.48)

To prove 'only if' , pointwise convergence follows immediately from uniform convergence, so it remains to show that s.e. holds; but this follows easily in view of the fact (see that

(21.4 1))

P(w(Gn ,O) � c:) ::; P

(

sup

eee

I Gn(9) I

�

c:/2) .

(21.49)

•

0

There is no loss of generality in considering the case Gn � in these theorems. We canjust as easily apply them to the case where Gn(8) = Qn(8) - Qn(9) and Qn is a nonstochastic function which may really depend on n, or just be a limit func tion, so that Qn = Q. In the former case there is no need for Qn to converge, as long as Qn - Qn does. Applying the triangle inequality and taking complements in we obtain

(21.47),

(21. 50)

This means that { Qn - Qn} is s.e., or s.s.e. as the case may be, provided that { Qn} is s.e., or s.s.e., and { Qn} is asymptotically equicontinuous in the ordinary sense of This extension of 21.8 is obvious, and in 21.9 we can insert the step

§5. 5 .

(21.46),

0 1

(21. 5 1)

into where the second term on the right is or depending o n whether the indicated nonstochastic condition holds, and this term will vanish when n � N

Uniform Stochastic Convergence

339

for some N � 1 , by assumption. The s.e. and s.s.e. conditions may not be particularly easy to verify directly, and the existence of Lipschitz-type sufficient conditions could then be very convenient. Andrews (1 992) suggests conditions of the following sort. 21.10 Theorem Suppose there exists N � 1 such that

I Qn(S') - Qn(S) I

n

::;

(21 .52)

Bnh( p (S,8')), a.s.

holds for all 8,8' E 8 and � N, where h is nonstochastic and and { Bn } is a stochastic sequence not depending on e. Then (i) { Qn } is s.e. if Bn = Op (l). (ii) { Qn } is s.s.e. if limsupnBn < oo, a.s. Proof The definitions imply that w(Qn,8) ::; Bnh(8) a.s. for note that, for any £ > 0 and 8 > 0,

limsup P(w(Qn,8) � E)

::;

h(x)

-1-

0 as

x

-1-

0,

n � N. To prove (i), (21 .53)

limsup P(Bn � £/h(8)).

By definition of Op(l), the right-hand side can be made arbitrarily small by choosing £1h(8) large enough. In particular, fix £ > 0, and then by definition of we may take 8 small enough that limsupn�ooP(Bn � < £. For (ii), we have in the same way that, for small enough 8,

h

£/h(8))

(

) (

)

P limsup w(Qm8) � £ ::; P limsup Bn � £/h(8) < £. n �oo n �oo

•

(21 .54)

A sufficient condition for Bn Op (l) is to have Bn uniformly bounded in L1 norm, i.e., supnE(Bn) < oo (see 12.11), and it is sufficient for limsupnBn to be a.s. bounded if, in addition to this, Bn - E(Bn) � 0. The conditions of 21.10 offer a striking contrast in restrictiveness. Think of (21 .52) as a continuity condition, which says that Qn(8') must be close to Qn(8) when 8' is close to e. When Qn is stochastic these conditions are very hard to satisfy for Bn, because random changes of scale may lead the condition to be violated from time to time even if Qn(S,oo) is a continuous function for all w and The purpose of the factor Bn is to allow for such random scale variations. Under s.e., we require that the probability of large variations declines as their magnitude increases; this is what Op(l) means. But in the s.s.e. case, the requirement that { Bn } be bounded a.s. except for at most a finite number of terms implies that { Qn } must satisfy the same condition. This is very restrictive. It means for example that Qn(8) cannot be Gaussian, nor have any other distribution with infinite support. In such a case, no matter what { Bn } and were chosen, the condition in (21 .52) would be violated eventually. It does not matter that the probability of large deviations might be extremely small, because over an number of sequence coordinates they will still arise with probability 1 . Thus, strong uniform convergence is a phenomenon confined, as far as we are able to show, to a.s. bounded sequences. Although (21 . 52) is only a sufficient condition, it can be verified that this feature of s.s.e. is implicit in the =

fixed

n.

h

infinite

340

The Law of Large Numbers

definition. This fact puts the relative merits of working with strong and weak laws of large numbers in a new light. The former are simply not available in many important cases. Fortunately, 'in probability' results are often sufficient for the purpose at hand, for example, determining the limits in distribution of estimators and sample statistics; see §25 . 1 for more details. Supposing (8,p) c (lRk,d£) , suppose further that Qn (O) is differentiable a.s. at each point of 8; to be precise, we must specify differentiability a.s. at each point of an open convex set 8* containing 8. (A set B c IRk is said to be convex if x E B and y E B imply Ax + (1 - A.)y E B for A E [0, 1].) The mean value theorem yields the result that, at a pair of points 8,8' E 8* , 22 (2 1 .55) where 8 * E 8* is a point on the line segment joining 8 and 8', which exists by convexity of 8*. Applying the Cauchy-Schwartz inequality, we get

I Qn (B) - Qn (B') I

where

l

I

k aQn I Oj ' · l I ei I a e i I S=e

s�

S Bn ii B - B'll a.s.,

(2 1 .56)

I � j .1 .

(2 1 . 57)

a n 8n = sup a _ S-O 9 * E 8*

Here II . 11 denotes the Euclidean length, and dQnldB is the gradient vector whose elements are the partials of Qn with respect to the Oi. Clearly, (2 1 .52) is satisfied by taking h as the identity function, and Bn defined in (2 1 .57) is a random variable for all n. Subject to this condition, and Bn satisfying the conditions specified in 21.10, a.s. differentiability emerges as a sufficient condition for s.e .. 2 1 .5 Uniform Laws of Large Numbers

In the last section it was shown that stochastic equicontinuity (strong or in pr.) is a necessary and sufficient condition to go from pointwise to uniform convergence (strong or in pr.). The next task is to find sufficient conditions for stochastic equicontinuity when { Qn(O) } is a sequence of partial sums, and hence to derive uniform laws of large numbers. There are several possible approaches to this problem, of which perhaps the simplest is to establish the Lipschitz condition of 21.10. 21.11 Theorem Let { { qnr( ro,O) } �= 1 } �= 1 denote a triangular array of real stochastic functions with domain (8,p), satisfying, for � 1 ,

N

Uniform Stochastic Convergence

341

I qnr(S') - qnr(S) I � Bnrh(p(8 , 8')), a.s., (2 1 .58) for all 8,8' e 8 and 2 N, where is nonstochastic and (x) -1, 0 as x -1, 0, and { Bnr l is a stochastic array not depending on a with L�= !E(Bnr) = 0( 1 ). If Qn(S) = L�= l qnr(S ), then (i) Qn is s.e. ; (ii) Qn is s.s.e. if L�=!(Bnr - E(Bnr)) � 0.

n

h

h

Proof For (i) it is only necessary by 21.10(i) and the triangle inequality to

establish that I�= !Bn1 = Op (l). This follows from the stated condition by the Markov inequality. Likewise, (ii) follows directly from 21.10(ii). • A second class of conditions is obtained by applying a form of s.e. to the summands. For these results we need to specify Gn to be an unweighted average of functions, since the conditions to be imposed take the form of Cesaro summability of certain related sequences. It is convenient to confine attention to the case

n

Gn(ro,S)

=

1 n - L (qtCXr(CO),S) - E(qr(Xr,S ))),

n

t=l

(2 1 .59)

where X1 e tz is a random element drawn from the probability space (tz,X, �1). Typically, though necessarily, X1 is a vector of real r.v.s with K a subset of m !R , m 2 1 , X being the restriction of 13m to K. The point here is not to restrict the form of the functional relation between q1 and ro, but to specify the existence of marginal derived measures �1, with �1(A) = P(X1 E A) for A e X. The usual context will have Gn the sample average of functions that are stochastic through their dependence on some kind of data set, indexed on t. The functions themselves, not just their arguments, can be different for different t. We must find conditions on both the functions qr(.,.) and the p.m.s J.l1 which yield the s.e. condition on Gn. The first stage of the argument is to establish conditions on the stochastic functions q1(8) which have to be satisfied for s.e. to hold. Andrews (1 992) gives the following result.

not

21.12 Theorem If (a) there exists a positive stochastic sequence { d1} satisfying sup I qr (S ) I � d,, all

and

eee

1 n . hmsup _L E(drl ldt>M} )

t

---7 0 as M ---7 oo ; n�oo t=l (b) for every £ > 0, there exists 0 > 0 such that n 1 limsup - _L P(w(q1,0) > E) < £;

n

-

n�oo

n

t=l

(2 1 .60)

(2 1 .6 1 )

(21 .62)

The Law of Large Numbers

342

Gn is s.e. Condition (21. 6 1) is an interesting Cesaro-sum variation on uniform integrability, and actual uniform integrability of { d1} is sufficient, although not necessary. Condition (a) is a domination condition, while condition (b) is called by Andrews termwise stochastic equicontinuity. Proof Given £ > 0, choose M such that limsupn�oon I7= t E(2d11 < t£2 , and then o such that limsup ! i, r(w(q1,o) > !£2 ) < !M-1 £2• (21.63) n�oo n then

o

-I

1 2dr>MJ)

t:=l

The first thing to note is that

w((q1 - E(qr)),o) ::; w(qr,ro) + w(E(qr), o) $ w(q1,ffi) +E(w(q1,0)),

(21.64)

where the last inequality is an application of 21.3. Applying Markov's inequality,

P(w(Gn,O) > £) ::; r (* i w(q1 -E(q1),o) > £)

(21.64) and using

t= l

(21.65) 1.

where the indicator functions in the last member add up to Using the fact that > M } , and taking the limsup, we now and hence > M} c $ obtain n £2 £2 >6 limsup > £) $ 6 + M limsup ;z L P E n�oo t:=l n�oo

w(q1,0) 2d1,

{ w(q1,0) P(w(Gn,O) 2 [

{2d1

::; £, in view of the values chosen for M and

1 (w(q1,0) )

o.

•

(21.66)

Uniform Stochastic Convergence

343

Clearly, whether condition 21.12(a) is satisfied depends on both the distribution of X1 and functional form of qr(.). But something relatively general can be said about termwise s.e. (condition 21.12(b)). Assume, following Potscher and Prucha (1989), that p

(2 1 .67) L rk1(x)skr(x,9), k=i where rkr: � � IR, and Skt(., S): � � IR for fixed e , are :X/:B-measurable functions.

qr(x,S)

=

The idea here is that we can be more liberal in the behaviour allowed to the factors rkt as functions of Xr than to the factors sk1; discontinuities are permitted, for example. To be exact, we shall be content to have the rkt uniformly L1-bounded in Cesaro mean: 1 n sup - :L E 1 rkr(X1) 1 :5 B < oo, k n

n

t=l

=

1 , . . . ,p.

(21 .68)

As to the factors skr(x, S ), we need these to be asymptotically equicontinuous for a sufficiently large set of x values. Assume there is a sequence of sets {Km E :X, m = 1 ,2, ... } , such that 1 n limsup - L �r(K�) n�oo

and that for each

m

n t=i

�

0 as

m �

oo,

(2 1 .69)

� 1 and E > 0, there exists 8 > 0 such that

limsup sup w(skr(x, .),8) < E, k

=

1 , . . . ,p.

(2 1 . 70)

Notice that (2 1 . 70) is a nonstochastic equicontinuity condition, but under condi tion (2 1 .69) it holds (as one might say) 'almost surely, on average ' when the r.v. Xr is substituted into the formula. These conditions suffice to give termwise s.e., and hence can be used to prove s.e. of Gn by application of 21.12. 21.13 Theorem If qrCXr,S) is defined by (21 .67), and (2 1 . 68), (2 1 .69), and (2 1 .70) hold, then for every E > 0 there exists 8 > 0 such that 1 n limsup - :L P(w (q,,8) > E) < E. n�oo

Proof Fix

n t=i

E > 0, and first note that

(±k=l l rkr l w(skr,8) > E) :5 P ( LJ{ I rkr I w(skr,8) > Elp } ) k=i

P(w(qr,8) > E) :5 P

(2 1 .7 1 )

The Law of Large Numbers

344 p

::;; _L P(I rkrl w(skr.8) > Elp) ,; � [P(ir,,jw(s.,,O)i K. > 2�) + P ( I rkt I w(skr.8) 1 K,;. > 2�) l (21. 72) Consider any one of these p terms. Choose m large enough that (21.73) and for this m choose 8 small enough that limsup sup w(skr(x,. ) , 8 ) < -(21.74) 4Bp EK x m k=l

n �oo

2 £

2.

Then, by the Markov inequality,

n

1 _L E I rktl 28Ep

::;; limsup n n�oo t= l and by

(21.75)

(21.73), (21.76)

Substituting these bounds into

(21.72) yields the result.

•

v THE CENTRAL LIMIT THEOREM

22 Weak Convergence of Distributions

22. 1 B asic Concepts

The objects we examine in this part of the book are not sequences of random vari ables, but sequences of marginal distribution functions. There will of course be associated sequences of r. v .s generated from these distributions, but the concept of convergence arising here is quite distinct. Formally, if { Fn }i is a sequence of c.d.f.s, we say that the sequence converges weakly to a limit F if Fn(x) ----7 F(x) pointwise for each x e C, where C � [R is the set of points at which F is continu ous. Then, if Xn has c.d.f. Fn and X has c.d.f. F, we say that Xn converges in distribution to X. These terms are in practice used more or less interchangeably for the distributions and associated r.v.s. Equivalent notations for weak convergence are Fn � F, and Xn � X. Although the latter notation is customary, it is also slightly irregular, since to say a sequence of r.v.s converges in distribution means only that the limiting r.v. has the given distribution. If both X and Y have the distribution specified by F, then Xn � X and Xn � Y are equivalent statements. Moreover, we write things like Xn � N(O, 1 ) to indicate that the limiting distribution is standard Gaussian, although 'N(O, 1 )' is shorthand for 'a r.v. having the standard Gaussian distribu tion' ; it does not denote a particular r.v .. Also used by some authors is the notation '�' standing for 'convergence in probability law' , but we avoid this form because of possible confusion with convergence in Lp-norm. Pointwise convergence of the distribution functions is all that is needed, remembering that F is non-decreasing, bounded by 0 and 1 , and that every point is either a continuity point or a jump point. It is possible that F could possess a jump at a point x0 which is a continuity point of Fn for all finite n, and in these cases Fn(x0) does not have a unique limit since any point between F(x0-) and F(x0) is a candidate. But the jump points of F are at most countable in number, and according to 8.4 the true F can be constructed by assigning the value F(x0) at every jump point x0; hence, the above definition is adequate. If ll represents the corresponding probability measure such that F(x) = !l(( -oo, x]) for each x e [R, we know (see §8.2) that ll and F are equivalent repre sentations of the same measure, and similarly for !ln and Fn. Hence, the statement J..ln � ll is equivalent to Fn � F. The corresponding notion of weak convergence for the sequence of measures {!ln } is given by the following theorem. 22.1 Theorem lln � !l iff !ln(A) ----7 !l(A) for every A e 'B for which !l(dA) = 0.

o

The proof of this theorem is postponed to a later point in the development. Note

The Central Limit Theorem

348

meanwhile that the exclusion of events whose boundary points have positive proba bility corresponds to the exclusion of jump points of F, where the events in question have the form oo ,x . Just as the theory of the expectation is an application of the general theory of integrals, so the theory of weak convergence is a general theory for sequences of finite measures. The results below do not generally depend upon the condition J..ln(lR ) for their validity, provided definitions are adjusted appropriately. However, a serious concern of the theory is whether a sequence of distribution functions has a distribution function as its limit; more specifically, should it follow because J..ln (lR ) for every that This is a question that is taken up in Meanwhile, the reader should not be distracted by the use of the convenient notations £(.) and P(.) from appreciating the generality of the theory.

{ (- ] }

=1

§22.5. = 1

n J.l. {lR) = 1?

{

) n=

22.2 Example Consider the sequence of binomial distributions B(n)Jn , ... } , where the probability of successes in n Bernoulli trials is given by

1,2,3,

x = x) = (:) (A/n)\1 - Aint-x, x = O, ... ,n

P(Xn

(22.1)

(see 8.7). Here, I� i s a constant parameter, so that the probability of a success falls linearly as the number of trials increases. Note that E(Xn) = 'A for every x For fixed as ---7 oo, and taking the binomial expansion of - � oo, whereas ---7 1 . We shows that � e -J.. as may therefore conclude that

1/x! n x, (�)n (1 -Aint (1 -Aint P(Xn

and accordingly, Fn(a)

=

n.

(1 -A/n)-x

n�

= x) � X.A� e-A, x = 0,1,2, ... , L

O$x$a

P(Xn

= x) � e-A. L AX�. O$x$a

(22.2) (22.3)

at all points a < oo. Thus the limit (and hence the weak limit) of the sequence is the Poisson distribution with parameter A. o

{B(n,Ain)}

[0, 1] is defined by = x) = { 1/n, x = iln , i = 1, ... ,n. (22.4) 0, otherwise This sequence actually converges weakly to Lebesgue measure m on [0, 1], although this fact may be less than obvious; it will be demonstrated below. For any x [0,1], J.l. n([O,x]) = [nx]/n � x = m([O,x]), where [nx] denotes the largest integer less than nx. There are sets for which convergence fails, notably the set of all rationals in [0,1], in view of the fact that = 1 for every n, and m((Q[o. n ) = 0. But [0,1] and = 1, thus the definition of weak convergence in 22.1 is not violated. 22.3 Example A sequence of discrete distributions on P(Xn

E

(Q [O,l J

(Q r0,11 =

m(o((Qr0,11)) o

J..ln ((Q [O, l])

Weak Convergence of Distributions

349

Although convergence in distribution is fundamentally different from converg ence a.s. and in pr., the latter imply the former. In the next result, · � · can be substituted for ·� ' , by 18.5.

0,

22.4 Theorem If Xn � X, then Xn � X. Proof For £ >

we have

P(Xn ::::; x) = P({Xn ::::; x} n { I Xn - X I ::::; £})

+ P( { Xn ::::; X} rl { I Xn - X I > £})

� P(X ::::; x + £) + P( I Xn - X I > £),

(22.5)

where the events whose probabilities appear on the right-hand side of the inequal ity contain (and hence are at least as probable as) the corresponding events on the left. P( I Xn - X I > £) -7 by hypothesis, and hence

0

limsup P(Xn ::::; x) � P(X � x + £).

(22.6)

Similarly, P(X � x - £)

P( {X � x - £} n { I Xn - XI � £})

=

+ P( { X � X - £} (l { I Xn - X I > £ } )

� P(Xn � x) + P( I Xn - X I > £),

(22.7)

P(X � x - £) � liminf P(Xn ::::; x).

(22.8)

and so

0,

Since £ is arbitrary, it follows that limn--+ooP(Xn � x) = P(X ::::; x) at every point x for which P(X = x) = such that lime,!, oF(X � x - £) = P(X � x). This condition is equivalent to weak convergence. •

The converse of 22.4 is not true in general, but the two conditions are equiva lent when the probability limit in question is a constant. A degenerate distribu tion has the form F(x)

=

{0,

x
1, x

;?:

a

(22.9)

If a random variable is converging to a constant, its c.d.f. converges to the step function (22.9), through a sequence of the sort illustrated in Fig. 22. 1 . 22.5 Theorem Xn converges in probability to a constant a iff its c.d.f. converges to a step function with jump at a. Proof For any £ >

0

PC I Xn - a l < £)

=

P(a - £ � Xn � a + £)

The Central Limit Theorem

350

=

Fn(a + £) - Fn((a - £)-). (22. 10) Convergence to a step function with jump at a implies limn�ooFn(a + £) = F(a + £) = 1 , and similarly limn�ooFn((a - £)-) = F((a - £)-) = 0 for all £ > 0. The suffi ciency part follows from (22. 10) and the definition of convergence in probability. For the necessity, let the left-hand side of (22.10) have a limit of 1 as n -7 oo, for all £ > 0. This implies lim [Fn(a + £) - Fn((a - £)-) ]

=

1.

(22. 1 1)

Since 0 � F � 1 , (22. 1 1) will be satisfied for all £ > 0 only if F(a) F(a-) = 0, which defines the function in (22.9). •

a

=

1 and

X

Fig. 22. 1 . 22.2 The S korokhod Representation Theorem

Notwithstanding the fact that Xn � X does not imply Xn ...E4 X, whenever a sequence of distributions { Fn } converges weakly to F one can construct a sequence of r.v.s with distributions Fn, which converges almost surely to a limit having distribution F. Shown by Skorokhod (1956) in a more general context (see §26.6), this is an immensely useful fact for proving results about weak convergence. Consider the sequence { Fn } converging to F. Each of these functions is a mono tone mapping from [R to the interval [0,1]. The idea is to invert this mapping. Let a random variable ro be defined on the probability space ([0,1],i3ro. 1 1 ,m), where :B ro.I J is the Borel field on the unit interval and m is the Lebesgue measure. Define for ro E [0, 1] Yn(W)

=

inf{x: ro � Fn(x) } .

(22. 1 2)

In words, Yn is the random variable obtained by using the inverse distribution function to map from the uniform distribution on [0,1 ] onto iR, taking care of any discontinuities in Fn 1 (ro) (corresponding to intervals with zero probability mass under Fn) by taking the infimum of the eligible values. Yn is therefore a non decreasing, left-continuous function. Fig. 22.2 illustrates the construction,

Weak Convergence of Distributions

35 1

essentially the same as used in the proof of 8.5 (compare Fig. 8.2). When Fn has discontinuities it is only possible to assert (by right-continuity) that Fn(Yn(w)) :2: ro, whereas Yn(Fn(x)) ::::; x, by left-continuity of Yn. - The first important feature of the Skorokhod construction is that, for any constant a E IR ,

(22. 13) where the last equality follows from the fact that w is uniformly distributed on [0, 1 ] . Thus, Fn is the c.d.f. of Yn. 23 Letting F be a c.d.f. and Y the r.v. corres ponding to F according to (22. 12), the second important feature of the construc tion is contained in the following result. 1

0 �----�--+ X Yn(W2) = Yn(W3)

Fig. 22.2 22.6 Theorem If Fn

=>

F then Yn -----7 Y a.s.[m] as n

-----7

=. o

In working through the proof, it may be helpful to check each assertion about the functions F and Y against the example in Fig. 22.3. This represents the extreme case where F, and hence also Y, is a step function; of course, if F is everywhere continuous and increasing, the mappings are 1-1 and the problem becomes trivial. Proof Let w be any continuity point of Y, excluding the end points 0 and

1 . For

any E > 0, choose x as a continuity point of F satisfying Y( co) - E < x < Y( ro). Given the countability of the discontinuities of F, such a point will always exist, and according to the definition of Y, it must have the property F(x) < w. If Fn(x) -----7 F(x), there will be n large enough that Fn(x) < W, and hence x < Yn(W), by definition. We therefore have Y(w) - E < x < Yn(CO).

(22. 14)

Without presuming that limn oo Yn(w) exists, since E is arbitrary (22. 14) allows us to conclude that liminfn�ooYn(W) :2: Y(w). Next, choose y as a continuity point of F satisfying Y(w) < y < Y(w) + E. The properties of F give co ::::; F(Y(co)) ::::; F(y). For large enough n we must also have co ::::; �

The Central Limit Theorem

352

and hence, again by definition of Yn, Yn(w) � y Y(w) + £. (22. 15) In the same way as before, we may conclude that limsupn�oofn(ffi) � Y(w). The superior and inferior limits are therefore equal, and limn�oofn(W) = Y(w). Fn (y) ,

<

F(y), w' 0)

F(x)

A

- - - - - - - - - - - -----------------------------------

----------------------------- ---------·---------------------

B

T

Y(w) - £

.

X

T Y( w), Y(w')

T

Y

Y( w) + £

Yn( w')

Fig. 22.3 This result only holds for continuity points of Y. However, there is a 1-1 correspondence between the discontinuity points of Y and intervals having zero probability under J.l in IR . A collection of disjoint intervals on the line is at most countable (1.11), and hence the discontinuities of Y (plus the points 0 and 1 ) are countable, and have Lebesgue measure zero. Hence, Yn ---7 f w.p. l [m], as asserted. • In Fig. 22.3, notice how both functions take their values at the discontinuities at the points marked A and B. Thus, F(Y(w)) = w' > w. Inequality (22. 15) holds for ro, but need not hold for ro', a discontinuity point. A counter-example is the sequence of functions Fn obtained by vertical translations of the fixed graph from below, as illustrated. In this case Yn(w') > Y(w') + £ for every n. 22.7 Corollary Define random variables Y�, so that f�(w) = Yn(W) at each w where the function is continuous, and f�(w) = 0 at discontinuity points and at w = 0 and 1 . Define Y' similarly. If Fn => F then f�(ro) ---7 Y'(ro) for every w E [0, 1], and Fn and F are the distribution functions of Y� and Y'. Proof The convergence for every ro is immediate. The equivalence of the distribu tions follows from 8.4, since the discontinuity points are countable and their complement is dense in [0, 1], by 2.10. • In the form given, 22.6 does not generalize very easily to distributions in IR k for k > 1 , although a generalization does exist. This can be deduced as a special case of 26.25, which derives the Skorokhod representation for distributions on general metric spaces of suitable type. A final point to observe about Skorokhod' s representation is its generalization

Weak Convergence of Distributions

353

to any finite measure. If Fn is a non-decreasing right-continuous function with codomain [a, b] , (22. 12) defines a function Yn(co) on a measure space ([a,b], :B[a,b]• m), where m is Lebesgue measure as before. With appropriate modifications, all the foregoing remarks continue to apply in this case. The following application of the Skorokhod representation yields a different, but equivalent, characterization of weak convergence. 22.8 Theorem Xn � X iff lim E(f(Xrz)) = E(f(X) ) (22. 16) n--)oo for every bounded, continuous real function f. o The necessity half of this result is known as the Helly-Bray theorem. Proof To prove sufficiency, construct an example. For a E [R and 8 > 0, let 1, x $ a-8 f(x) = (a - x)/8, a - 8 < x $ a (22. 1 7) 0, x>a

{

We call this the 'smoothed indicator' of the set (-oo, a]. (See Fig. 22.4.) It is a continuous function with the properties

f

Fn(a - 8) $ fdFn $ Fn(a), all F(a - 8)

n,

:::; JfdF :::; F(a).

(22. 1 8) (22. 19)

By hypothesis, ffdFn � ffdF, and hence

J

(22.20)

limsup Fn(a-) $ F(a),

(22.21)

F(a-) $ liminf Fn(a ).

(22.22)

limsup Fn(a - 8) .$ fdF $ liminf Fn(a). n�oo n�oo Letting 8 ·� 0, combining (22. 19) and (22.20) yields

These inequalities show that IimnFn(a) exists and is equal to F(a) whenever F(a-) = F(a), that is, Fn � F. To prove necessity, let f be a bounded function whose points of discontinuity are contained in a set D1, where �(D1) = 0, � being the p.m. such that F(x) = �(( -oo,.x]). When Fn � F CFn being the c.d.f. of Xn and F that of X) Y�(co) � Y'(co) for every co e [0, 1], where Y�(co) and Y'(co) are the Skorokhod variables defined in 22.7. Since m(co: Y'(co) e D1) = �(D1) = 0, f(Y�) � f(Y') a.s. [�] by 18.8(i). The bounded convergence theorem then implies E(f(Y�)) � E(f(Y')), or

The Central Limit Theorem

354

ff(y)dJln(y) � ff(y)dJl(y),

(22.23)

where Jln is the p.m. corresponding to Fn. But 9.6 allows us to write 1 (22.24) f(y)dJln(y) = ydJlnf- (y) = xdJlnf- 1 (x) = E(f(Xn)),

f

f

f

with a similar equality for E(f(X)) . (The trivial change of dummy argument from y to x is just to emphasize the equivalence of the two formulations.) Hence we have E(f(Xn )) ---7 E(f(X)) . The result certainly holds for the case D1 = 0, so 'only if is proved. •

/(x )

a- 8 a

X

Fig. 22.4 Notice how the proof cleverly substitutes ([0, 1] ,:Bro, 1 1 , m) for the fundamental probability space (Q, ':J,P) generating {Xn }, exploiting the fact that the derived distributions are the same. This result does not say that the expectations converge only for bounded continuous functions; it is simply that convergence is implied at least for all members of this large class of functions. The theorem also holds if we substitute any subclass of the class of bounded continuous functions which contains at least the smoothed indicator functions of half-lines, for example the bounded uniformly continuous functions. 22.9 Example We now give the promised proof of weak convergence for 22.3. Clearly, in that example, 1 n f(iln). (22.25) fdJln = n L i==l

f

-

The limit of the expression on the right of (22.25) as n � oo is by definition the Riemann integral of f on the unit interval. Since this agrees with the Lebesgue integral, we have a proof of weak convergence in this case. o We shall subsequently require the generalization of Theorem 22.8 to general finite measures. This will be stated as a corollary, the modifications to the proof being left to the reader to supply; it is mainly a matter of modifying the notation to suit.

Weak Convergence of Distributions

355

22.10 Corollary Let {Fn } be a sequence of bounded, non-decreasing, right continuous functions. Fn � F if and only if

JtdFn � ftdF

(22.26)

for every bounded, continuous real function f. o Another proof which was deferred earlier can now be given. Proof of 22.1 To show sufficiency, consider A = (-oo, x], for which dA = {x} . Weak convergence is defined by the condition J.ln( { ( -oo, x] } ) � J.l({ ( -oo, x] }) whenever J.l( { x}) = 0. To show necessity, consider in the necessity part of 22.8 the case f(x) = lA(x) for any A E 'B. The discontinuity points of this function are contained in the set dA, and if J.l(dA) = 0, we have J.ln(A) � J.l(A) as a case of (22. 16), when Fn � F. • 22.3 Weak Convergence and Transformations

The next result might be thought of as the weak convergence counterpart of 18.8. 22.1 1 Continuous mapping theorem Let h: IR f----7 IR be Borel-measurable with discontinuity points confined to a set Dh, where J.l(Dh) = 0. If J.ln � J.l, then J.lnh 1 1 � J.lh - . Proof By the argument used to prove the Helly-Bray theorem, h(Y�) � h(Y') a.s.[J.l]. It follows from 22.4 that h(Y�) � h(Y'). Since m(ro: Y�(w) E A) = J.ln(A), (22.27) for each A E 'B, using 3.21. Similarly, m(h(Y') E A) = J.lh - 1 (A). According to the definition of weak convergence, h(Y�) � h(Y') is equivalent to J.lnh - 1 � J.lh - 1 . • -

22.12 Corollary If h is the function of 22.11 and Xn � X, then h(Xn) � h(X). Proof Immediate from the theorem, given that Xn - Fn and X - F. • 22.13 Example If Xn � N(O, l) , then X� � x2 (1). o

Our second result on transformations is from Cramer (1946), and is sometimes called Cramer's theorem: 22.14 Cramer's theorem If Xn � X and Yn � a (constant), then (i) (Xn + Yn) � (X + a). (ii) YnXn � aX. (iii) XnfYn � X!a, for a :f. 0. Proof This is by an extension of the type of argument used in 22.4.

The Central Limit Theorem

356

+ P(Xn + Yn ::::; X, I Yn - a I :2:

E).

P(Xn ::::; x - a - E) ::::; P(Xn + Yn ::::; x) + P( I Yn -: a l

:2:

::::; P(Xn ::::; x - a + E) + P( I Yn - a l

:2:

E)

(22.28)

Similarly, E),

(22.29)

and putting these inequalities together, we have P(Xn ::::; x - a - E) - P( I Yn - a l

:2:

E) ::::; P(Xn + Yn ::::; x)

::::; P(Xn ::::; x - a + E) + P( I Yn - a l :2: E) .

(22.30)

Let Fxn and Fxn+ Yn denote the c.d.f.s of Xn and Xn + Yn respectively, and let Fx be the c.d.f. of X, such that Fx = Iimn�ooFxn(x) at all continuity points of Fx. Since Iimn�=P( I Yn - a I :2: E) = 0 for all E > 0 by assumption, (22.30) implies

(22.3 1 )

n �oo

Taking E arbitrarily close to zero shows that

(22.32) whenever x - a is a continuity point of FX· This proves (i). To prove (ii), suppose first that a > 0. By taking E > 0 small enough we can ensure a - E > 0, and applying the type of argument used in (i) with obvious variations, we obtain the inequalities P(Xn(a + E) ::::; x) - P( I Yn - a I :2: E) ::::; P(XnYn ::::; x)

::::; P(Xn(a - E) ::::; x) + P( I Yn - a I

:2:

E).

(22.33)

Taking limits gives Fx(x/(a + E)) ::::; liminfFxnrJx) n�oo ::::; limsup Fxn yn(x) ::::; Fx(xl(a - E)), n �oo

and thus

<

(22.34) (22.35)

n �oo

If a 0, replace Yn by - Yn and a by -a, repeat the preceding argument, and then apply 22.12. And if a = 0, (22.33) becomes P(XnE ::::; x) - P( I Yn I

:2:

E) ::::; P(XnYn ::::; x)

Weak Convergence of Distributions

357 (22.36)

<

For x > 0, this yields in the limit FXnYn (x) = 1, and for x 0, FXnYn(x) = 0, which defines the degenerate distribution with the mass concentrated at 0. In this case XnYn � 0 in view of 22.5. (Alternatively, see 18.12.) To prove (iii) it suffices to note by 18.10(ii) that plim llYn = lla if a =t. 0. Replacing Yn by llYn in (ii) yields the result directly. • 22.4 Convergence of Moments and Characteristic Functions

Paralleling the sequence of distribution functions, there may be sequences of moments. If Xn � X where the c.d.f. of X is F, then E(X) = fxdF(x), where it exists, is sometimes called the asymptotic expectation of Xn. There is a tempta tion to write E(X) = limn�ooE(Xn), but there are cases where E(Xn) does not exist for any finite n while E(X) exists, and also cases where E(Xn) exists for every n but E(X) does not. This usage is therefore best avoided except in specific circum stances when the convergence is known to obtain. Theorem 22.8 assures us that expectations of bounded random variables converge under weak convergence of the corresponding measures. The following theorems indicate how far this result can be extended to more general cases. Recall that El XI is defined for every X, although it may take the value +oo. 22.15 Theorem If Xn � X then E I XI � liminfn�ooE I Xnl · Proof The function ha(x) = I x I I { I xi::; a } is real and bounded. If P( I XI = a) = 0, it follows by 22.11 that ha(Xn) � ha(X), and from 22.8 (letting f be the identity function which is bounded in this case) that E(ha(X))

=

lim E(ha(Xn)) � liminf E I Xn l ·

(22.37)

The result follows on letting a approach += through continuity points of the distribution. • The following theorem gives a sufficient condition for E(X) to exist, given that E(Xn) exists for each n. 22.16 Theorem If Xn � X and {Xn } is uniformly integrable, then E I XI oo and E(Xn) ---7 E(X). Proof Let Yn and Y be the Skorokhod variables of (22. 12), so that Yn --E4 Y. Since Xn and Yn have the same distribution, uniform integrability of {Xn } implies that of { Yn } . Hence we can invoke 12.8 to show that E(Yn) ---7 E(Y), Y being integrable. Reversing the argument then gives E I XI oo and E(Xn) ---7 E(X) as required. • Uniform integrability is a sufficient condition, and although where it fails the existence of E(X) may not be ruled out, 12.7 showed that its interpretation is questionable in these circumstances. A sequence of complex r.v.s which is always uniformly integrable is f e itXn l. for

<

<

The Central Limit Theorem any sequence { since I eitXn I = 1 . Given the sequence of characteristic func tions { xn(t) } , we therefore know that if => then (22.38) (pointwise on IR ), where the indicated limit should be the characteristic function associated with In view of the inversion theorem (11.12), we could then say that only if (22.38) holds, where <)>x(t) is the ch.f. of However, it is the 'if rather than the 'only if' that is the point of interest here. If a sequence of characteristic functions converges pointwise to a limit, under what circumstances can we be sure that the limit is a ch.f., in the sense that invert ing yields a c.d.f.? A sufficient condition for this is provided by the so-called Levy continuity theorem: 22.17 Levy continuity theorem Suppose that { is a sequence of c.d.f.s and => where is any non-negative, bounded, non-decreasing, right-continuous func tion. If 358

Xn}, F.

Xn � X

F,

Fn F,

X.

Fn}

F

Fn

(22.39)

F

fdF

and <)>(t) is continuous at the point t = 0, then is a c.d.f. (i.e., = 1) and <)> is its ch.f. o The fact that the conditions imposed on the limit F in this theorem are not un reasonable will be established by the Helly selection theorem, to be discussed in the next section. Proof Note that = = 1 for any n, by (22.39) and the fact that is a c.d.f. For v > 0, e = (22.40) (t) t =

n(O) fdFn

Fn

�f>n d s:: (�f:eitxdt) dFn s:: ( i::: 1) dFn,

the change in order of integration being permitted by 9.32. By 22.10, which extends to complex-valued functions by linearity of the integral, we have, as n -7 oo, 1 5 <)>(t)dt, e e 1 = (22.41) -7 . lVX V 0 lVX

f+"" ( ivx - 1) dFn f+"" ivx - ) dF - v oc ( oc .

-

-

where the equality is by (22.40) and the definition of <)>. Since <)> is continuous at t = 0, lim v 1 (}<)>(t) t = <)>(0) and since -7 <)> we must have <)>(0) = 1 . It follows from (22.41 ) that

v oo f d ---t

n

-

+oo lim eivx : ) dF +oodF. (22.42) J-oo ( iv f-oo In view of the other conditions imposed, this means F is a c.d.f. It follows by 22.8 that <)>(t) E(e itX) where X is a random variable having c.d.f. F. This 1

=

1

=

completes the proof.

V---tO

•

=

Weak Convergence of Distributions

359

The continuity theorem provides the basic justification for investigating limit ing distributions by evaluating the limits of sequences of ch.f.s, and then using the inversion theorem of § 1 1 .5. The next two chapters are devoted to developing these methods. Here we will take the opportunity to mention one useful applica tion, a result similar to 22.4 which may also be proved as a corollary. 22.18 Theorem If I Xn - Zn l � 0 and {Xn } converges in distribution, then {Zn } converges in distribution to the same limit. Proof I eitXn - eitZn I � 0 by 18.9(ii). Since I eitX I = 1 these functions are Leo-bounded, and the sequence { I eitXn - eitZn I } is uniformly integrable. So by 18.14, I eitXn - eitZn I � 0. However, the complex modulus inequality (11.3) gives (22.43) so that a further consequence is I xn(t) - <J>z,(t) I ---7 0 as n ---7 pointwise on fR . Given the assumption of weak convergence, the conclusion now follows from the inversion theorem. • To get the alternative proof of 22.4, set Zn X for each n. co ,

=

22.5 Criteria for Weak Convergence

Not every sequence of c.d.f.s has a c.d.f. as its limit. Counter-examples are easy to construct. 22.19 Example Consider the uniform distribution on the interval [-n,n], such that Fn(a) = �(1 + aln), -n :::; a :::; n. Then Fn(a) --7 1 for all a E IR . D 22.20 Example Consider the degenerate r.v., Xn n w.p. l . The c.d.f. is a step function with jump at n. Fn(a) ---7 0, all a E IR . o Although Fn is a c.d.f. for all n, in neither of these cases is the limit F a c.d.f., in the sense that F(a) ---7 1 (0) as a ---7 ( ) Nor does intuition suggest to us that the limiting distributions are well defined. The difficulty in the first example is that the probability mass is getting smeared out evenly over an infinite support, so that the density is tending everywhere to zero. It does not make sense to define a random variable which can take any value in IR with equal probability, any more than it does to make a random variable infinite almost surely, which is the limiting case of the second example. In view of these pathological cases, it is important to establish the conditions under which a sequence of measures can be expected to converge weakly. The condi tion that ensures the limit is well-defined is called uniform tightness. A se quence { Jln } of p.m.s on fR is uniformly tight if there exists a finite interval (a,b] such that, for any £ > 0, SUPnlln((a,b]) > 1 - £. Equivalently, if {Fn } is the sequence of c.d.f.s corresponding to {J..Ln } , uniform tightness is the condition that for £ > 0 3 a,b with b - a < and =

co

co

-co

.

360

The Central Limit Theorem

sup { Fn(b) - Fn(a) } > 1 - E. (22.44) n It is easy to see that examples 22.19 and 22.20 both fail to satisfy the uniform tightness condition. However, we can show that, provided a sequence of p.m.s { Jln } is uniformly tight, it does converge to a limit Jl which is a p.m. This terminology derives from the designation tight for a measure with .the property that for every E > 0 there is a compact set Ke such that J.!(K�) :::; E. Every p.m. on (IR ,B) is tight, although this is not necessarily the case in more general probability spaces. See §26.5 for details on this. An essential ingredient in this argument is a classic result in analysis, Helly ' s selection theorem.

22.21 Belly's selection theorem If { Fn} is any sequence of c.d.f.s, there exists a subsequence { nk, k = 1 ,2, ... } such that Fnk � F, where F is bounded, non-decreas ing and right-continuous, and 0 :::; F :::; 1 . Proof Consider the bounded array { { Fn(Xi), n E IN } , i E IN } , where {xi, i E IN } is an

enumeration of the rationals. By 2.36, this array converges on a subsequence, so that limk�Jn/Xi) = F'(xD for every i. Note that F*(xi1) :::; F'(xi2) whenever Xi 1 < xi2, since this property is satisfied by Fn for every n. Hence consider the non decreasing function on IR , F(x)

=

(22.45)

inf F*(xi).

x; > x

Clearly 0 :::; F*(xi) :::; 1 for all i, since the Fnk(xa have this property for every k. By definition of F, for x E IR 3 xi > x such that F(x) s F'(xt) < F(x) + E for any E > 0, showing that F is right-continuous since F*(xD = F(xt). Further, for continuity points x of F, there exist xi1 < x and Xi2 > x such that

(22.46) The following inequalities hold in respect of these points for every k:

(22.47) Combining (22.46) with (22.47), F(x) - E < liminf Fnk(x) s limsup Fnk(x) < F(x) + E. Since E is arbitrary, limk�ooFnk(x)

=

F(x) at all continuity points of F.

(22.48) •

The only problem here is that F need not be a c.d.f., as in 22.19 and 22.20. We need to ensure that F(x) � 1 (0) as x � oo ( oo ), and tightness is the required property. -

22.22 Theorem Let { Fn } be a sequence of c.d.f.s. If

Weak Convergence of Distributions

361

(a) Fnk ==> F for every convergent subsequence {nk }, and (b) the sequence is uniformly tight, then Fn ==> F, where F is a c.d.f. Condition (b) is also necessary. o Helly ' s theorem tells us that { Fn } has a cluster point F. Condition (a) requires that this F be the unique cluster point, regardless of the subsequence chosen, and the argument of 2.13 applied pointwise to {Fn } implies that F is the actual limit of the sequence. Uniform tightness is necessary and sufficient for this limit F to be a c.d.f. Proof of 22.22 Let x be a continuity point of F, and suppose Fn(x) A F(x). Then I Fn(x) - F(x) I � £ > 0 for an infinite subsequence of integers, say { nh k E IN } . Define a sequence of c.d.f.s by Fk = Fnk' k 1,2, ... According to Helly's theorem, this sequence contains a convergent subsequence, { ki, i E IN }, say, such that Fie; ==> F'. But by (a), F' F, and we have a contradiction, given how the sub sequence { ki} was constructed. Hence, Fn ==> F. Since Fn is a c.d.f. for every n, Fn(b) - Fn(a) > 1 - £ for some b - a < oo, for any £ > 0. Since Fn -7 F at continuity points, increase b and reduce a as neces sary to make them continuity points of F. Assuming uniform tightness, we have by (22.44) that F(b) - F(a) > 1 - £, as required. It follows that 1imx---7ooF(x) = 1 and limx---7 - J(x) 0. Given the monotonicity and right continuity of F established by Helly' s theorem, this means that F is a c.d.f. On the other hand, if the sequence is not uniformly tight, F(b) - F(a) � 1 - £ for some £ > 0, and every b > a. Letting b -7 +oo and a -7 -oo, we have F(+<XJ) - F(-oo) � 1 - £ < 1 . Hence, either F(+oo) < 1 or F( oo) > 0 or both, and F is not a c.d.f. • The role of the continuity theorem (22.17) should now be apparent. Helly' s theorem ensures that the limit F of a sequence of c.d.f.s has all the properties of a c.d.f. except possibly that of fdF 1 . Uniform tightness ensures this property, and the continuity of the limiting ch.f. at the origin can now be interpreted as a sufficient condition for tightness of the sequence. It is of interest to note what happens in the case of our counter-examples. The ch.f. corresponding to example 22.19 is n sin v n _ (22.49) <J>n(Y) = (2n ) - 1 -n (cos vx + i sin vx)dx vn We may show (use l ' Hopital's rule) that <J>n(O) 1 for every n, whereas <J>n(Y) -7 0 for all v :f. 0. In the case of 22.20 we get <J>n(Y) cos v n + i sin vn, (22.50) which fails to converge except at the point v = 0. =

=

=

-

=

J

=

=

=

22.6 Convergence of Random Sums

Most of the important weak convergence results concern sequences of partial sums of a random array {Xn,, t 1 , ... ,n, n E IN }. Let =

The Central Limit Theorem

362

Sn

=

n L Xnt• t=l

(22.51)

and consider the distributions of the sequence {Sn } as n � oo . The array notation (double indexing) permits a normalization depending on n to be introduced. Central limit theorems, the cases in which typically Xnr = n - 1 12X1, and Sn converges to the Gaussian distribution, are to be examined in detail in the following chapters, but these are not the only possibility. 22.23 Example The B(n)Jn) distribution is the distribution of the sum of n inde pendent Bernoulli random variables Xnt• where P(Xnr = 1) = IJn and P(Xnr = 0) = 1 - IJn. From 22.2 we know that in this case n (22.52) Sn = L Xnt � Poisson with mean A. o t= l

From 11.1 we know that the distribution of a sum of independent r.v.s is given by the convolution of the distributions of the summands. The weak limits of independent sum distributions therefore have to be expressible as infinite con volutions. The class of distributions that have such a representation is necessar ily fairly limited. A distribution F is called infinitely divisible if for every n E fN there exists a distribution Fn such that F has a representation as the n-fold convolution (22.53) In view of (1 1 .33), infinite divisibility implies a corresponding multiplicative rule for the ch.f.s. 22.24 Example For the Poisson distribution, <J>x(t;A) = exp{Aeit - 1 } , from (1 1 .34), and x(t; A) = (exp{ (IJn)(eit _ 1 ) } Y = (<J>x(t; IJn)t. (22.54) The sum of n independent Poisson variates having parameter IJn is therefore a Poisson variate with parameter A. o In certain infinitely divisible distributions, Fn and F have a special relation ship, expressed through their characteristic functions. A distribution with ch.f. x is called stable, with index p, if for each n n (22.55) ( <J>x(t)) = eitb(n)<J>x(n 11Pt) , 0 < p � 2, where b(n) is some function of n. According to ( 1 1 .30), the right-hand side of (22.55) is the ch.f. of the r.v. b(n) + n 1 1PX; that is, the sum of n independent drawings from the distribution is a drawing from the same distribution apart from a change of scale and origin. If a stable distribution is also symmetric about zero, it can be shown that the ch.f. must belong to the family of real-valued functions having the form

Weak Convergence of Distributions <j)x(t)

=

363 (22.56)

exp { -a ! t iP } , a � 0.

22.25 Example The Cauchy distribution is stable with p = 1 and b(n) = 0, having ch.f.
-

-

23 The Classical Central Limit Theorem

2 3 . 1 The i . i.d. Case

The 'normal law of error' is justly the most famous result in statistics, and to the susceptible mind has an almost mystical fascination. If a sequence of random variables {X1} I have means of zero, and the partial sums L�= tX1, n = 1 ,2,3, ... have variances s� tending to infinity with n although finite for each finite n, then, subject to rather mild additional conditions on the distributions and the sampling process, 1 n D (23. 1) Sn = L Xt � N(O, 1). -

Sn t=l

This is the central limit theorem (CLT). Establishing sets of sufficient condi tions is the main business of this chapter and the next, but before getting into the formal results it might be of interest to illustrate the operation of the CLT as an approximation theorem. Particularly if the distribution of the X1 is symmet ric, the approach to the limit can be very rapid. 23.1 Example In 11.2 we derived the distribution of the sum of two independent U[O, 1 ] drawings. Similarly, the sum of three such drawings has density fx+Y+z(w)

r t rt

= J 0 J 0 1 ( w-z- I.w-z )dydz,

(23.2)

which is plotted in Fig. 23. 1. This function is actually piecewise quadratic (the three segments are on [0 , 1], [1 ,2] and [2,3] respectively), but lies remarkably close to the density of the Gaussian r.v. having the same mean and variance as X + Y + Z (also plotted). The sum of 10 or 12 independent uniform r.v.s is almost indistinguishable from a Gaussian variate; indeed, the formula S = 'L�� 1 Xi - 6, which has mean 0 and variance 1 when Xi - U[O, l ] and independent, provides a simple and perfectly adequate device for simulating a standard Gaussian variate in computer modelling exercises. o 23.2 Example For a contrast in the manner of convergence consider the B(n,p ) distribution, the sum of n Bernoulli(p) variates for fixed p E (0, 1). The proba bilities for p = � and n = 20 are plotted in Fig. 23.2, together with the Gaussian density with matching mean and variance. These distributions are of course dis crete for every finite n, and continuous only in the limit. The correspondence of the ordinates is remarkably close, although remember that for p :t � the binomial distribution is not symmetric and the convergence is correspondingly slower. This

The Classical Central Limit Theorem

365

example should be compared with 22.2, the non-Gaussian limit in the latter case being obtained by having p decline as a function of h. o

Sum of 3 U[O, 1 ]s

1 .5

3

Fig. 23. 1 0.2 I Bin(20, 0.5)

1-

0. 1

1-

'-

0

0

__

..

(1

I

I

I

I I I I

I I

I

, '

\

�

N( 1 0,5) \ \

- - -

\ \ \

\ \

\

�'

10

t ..

... _

20

Fig. 23.2 Proofs of the CLT, like proofs of stochastic convergence, depend on establishing properties for certain nonstochastic sequences. Previously we considered sample points I Xn(ro) - X(ro) I for ro E C with P( C) = 1 , probabilities P( I Xn - X I > £) , and moments E I Xn - XI P , as different sequences to be shown to converge to 0 to estab lish the convergence of Xn to X, respectively a.s., in pr., or in Lp . In the present case we consider the expectations of certain functions of the Sn; the key result is Theorem 22.8. The practical trick is to find functions that will finger print the limiting distribution conclusively. The characteristic function is by

The Central Limit Theorem

366

common consent the convenient choice, since we can exploit the multiplicative property for independent sums. This is not the only possible method though, and the reader can find an alternative approach in Pollard (1984: lll.4), for example. The simplest case is where the sequence { X1 } is both stationary and indepen dently drawn. 23.3 Lindeberg-Levy theorem If {X1}] is an i.i.d. sequence having zero mean and variance cr2, D ,. X1 I(J � = n - 1 12L... N(0 , 1 ) . 0 (23.3) t= l Proof The ch.f.s <j>x(A.) of X1 are identical for all t, 24 so from ( 1 1 .30) and

Sn

n

(1 1 .33),

sn(A.) = (<j>x(A.cr- l n -l /2)Y .

Applying 11.6 with k = 2 yields the expansion

jcA.Xr)2 A.Xr

(23.4)

}

I 13 , 1 <\>x(A.cr -1 n - 1/2 ) - l - A.212n l � E mm �· l fTn 6cr3n312 .

(23.5)

which makes it possible to write, for fixed A., <\>x(A.cr- 1 n - 1 12) = l - A.212n + O(n-312) .

(23.6)

Applying the binomial expansion, ( 1 + alnt = I.}=o('})(alnY � 'Lj=naj!p = ea as n -7 oo, and setting a = -�A.2 + O(n- 112), we find lim s/A.) = e -"-212. (23.7 ) Comparing this formula with ( 1 1 .36), the limiting ch.f. is revealed as that of the N(O, 1 ) variable. We then appeal to the inversion theorem (11.12) to establish that the limiting distribution is necessarily Gaussian. • Be careful to note how the existence of E I X l 3 is not necessary for the expansion in (23.6) to hold. The 'min' function whose expectation appears on the majorant side of (23.5) is unquestionably of O(n -312), but also clearly integrable for each n on the assumption of finite variance. The Lindeberg-Levy theorem imposes strong assumptions, but offers the benefit of a simple and transparent proof. All the key features of the central limit property are discernible. In (23 .6), the expansion of the ch.f. of n - 112X1 consists either of terms common to every centred distribution with finite variance, or of terms that can be neglected asymptotically, a fact that ensures that the limiting sum distribution is invariant to the component distributions. The imaginary part of <j>x(A.cr-ln- 1 12) is of smaller order than the real part, which would appear to require a symmetric limit by the remarks following (1 1 .3 1 ). The coincidence of these facts with the fact that the centred Gaussian is the only stable symmetric distribution having a second moment appears to rule out any

The Classical Central Limit Theorem

367

alternative to the central limit property under the specified conditions, that is, zero mean and finite variance. The earlier remark that symmetry of the distribution of n -Inx, improves the rate of convergence to the limit can be also 0, the expansion in (23.5) can be taken appreciated here. If we can assume to third order, and the remainder in (23.6) is of O(n- 2). On the other hand, if the variance does not exist the expansion of (23.6) fails. Indeed, in the specific case we know of, in which the are centred Cauchy, n 2Xn = O(n 1 12) ; the sequence of distributions of {n 1 12Xn } is not tight, and there is no weak convergence. The limit law for the sum would itself be Cauchy under the appropriate scaling of n - I . The distinction between convergence in distribution and convergence in probabil ity, and in particular the fact that the former does not imply the latter, can be demonstrated here by means of a counter-example. Consider the sequence {X,}i defined in the statement of the Lindeberg-Levy theorem, and the corresponding Sn in (23.3). 23.4 Theorem Sn does not converge in probability. Proof If it was true that plimn---7ooSn = Z, it would also be the case that p1imn---7=S2n = Z, implying plim (S2n - Sn) = 0. (23.8)

E(X�)

=

X,

11

We will show that (23. 8) is false. We have (23.9) where S'n

=

n

- 112"' 2n X "- t =n+ I

p1(J,

hence (23. 10)

According to the Lindeberg-Levy theorem,
X,

X(ro).

=

ro,

n- 1122:7= 1Xr(ro)

{X(ro) ro

roE

X

ro,

The Central Limit Theorem

368

not necessarily close to Sn(W) no matter how large n becomes. Weak convergence of the distribution functions does not imply convergence of the random sequence. Characteristic function-based arguments can also be used to show convergence in distribution to a degenerate limit. The following is a well-known proof of the weak law oflarge numbers for i.i.d. sequences, which circumvents the need to show L 1 convergence. 23.5 Khinchine's theorem If { Xr} i is an identically and independently distributed sequence· with finite mean �' then Xn = n -1 'L7=1 Xt � �· Proof The characteristic function of Xn has the form xn (A) = (<J>xOJn)t, (23. 13) where, by application of the argument used for 23.3, <J>x(Ain) = I + iA�n + O(').}/n2). Letting n � = we find, by analogy with (23.7), lim x"(A) = e0"�. (23. 14) But E(i)..x) = e0·� only for the case where X = � with probability 1 . The distribu tion is degenerate, and convergence in probability follows from 22.5. • 23.2 Independent Heterogeneous Sequences

The Lindeberg-Levy theorem imposes conditions which are too strong for the result to have wide practical applications in econometrics. In the remainder of this chapter we retain the assumption of independence (to be relaxed in Chapter 24), but allow the summands to have different distributions. In this theory it is convenient to work with normalized variables, such that the partial sums always have unit variance. This entails a double indexing scheme. Define the triangular array {Xnt' t = 1, ... ,n, n E IN }, the elements having zero mean and variances O'�t' such that if

(23. 15) then (under independence) E(S�)

=

n L �t t=l

=

1.

(23. 16)

Typically we would have Xnt = (Yr - �r)lsn where { Yr} is the 'raw' sequence under study, with means {�r}, and s� = 'L7= 1 E(Yr - �i. In this case O'�t = E(Yt - �t) 2/s�, so that these variances sum to unity by construction. It is also possible to have Xnt = (Ynt - �nt)lsn, the double indexing of the mean arising in situations where the sequence depends on a parameter whose value in turn depends on n. This case arises, for example, in the study of the limiting distributions of

The Classical Central Limit Theorem

369

test statistics under a sequence of 'local' deviations from the null hypothesis, the device known as Pitman drift. The existence of each variance �1 is going to be a necessary baseline condition in all the theorems, just as the existence of the common variance cJl was required in the Lindeberg-Levy theorem. However, with heterogeneity, not even uniformly bounded variances are sufficient to get a central limit result. If the Y1 are identically distributed, we do not have to worry about a small (i.e. finite) number of members of the sequence exhibiting such extreme behaviour as to in fluence the distribution of the sum as a whole, even in the limit. But in a heter ogeneous sequence this is possible, and could interfere with convergence to the normal, which usually depends on the contribution of each individual member of the sequence being negligible. The standard result for independent, non-identically distributed sequences is the Lindeberg-Feller theorem, which establishes that a certain condition on the distributions of the summands is sufficient, and in some circumstances also neces sary. Lindeberg is credited with the sufficiency part, and Feller the necessity; we look at the latter in the next section. 23.6 Lindeberg theorem Let the array { Xnr } be independent with zero mean and variance sequence {�r} satisfying (23. 16). Then, Sn � N(O, 1) if X�1 dP = 0, for all c: > 0. o (23 . 1 7) lim i n�oo t= l { I Xnr l > t:} Equation (23. 17) is known as the Lindeberg condition. The proof of the Lindeberg theorem requires a couple of purely mechanical lemmas. 23.7 Lemma If x 1 , . .. ,xn and y1 , ... ,yn are collections of complex numbers with l xr l � 1 and I Y r l � 1 for t = 1 , ... ,n, then

J

(23 . 1 8) Proof

For n

=

2, i x 1 x2 - Y1Yz l

I (x t - YI )xz + (xz - Yz)Y I I � i x1 - Yi i i xz l + i xz - Yzi I Yd � l xl - Y l l + l xz - Yzi . =

(23. 19)

The general case follows easily by induction. • 23.8 Lemma If z is a complex number and l z l � �' then i ez - 1 - z l � l z l 2 . Proof Using the triangle inequality, � l z li � l z l 2f=o (23.20) U + 2)!"

The Central Limit Theorem

370

Since I.j=02 -J = 2, the infinite series on the right-hand side cannot exceed 1 . • Proof of 23.6 We have to show that
I
-

n

2

e - 'A 12 1 = Il
I j fi
s

n 'A2crnl2 1 - Il e- 2 t=l

t=l

t=l

2 2 e I D _" (J�� - fi o - 1A-2�a I , where the equality is by definition, using the fact that I7=10"�t +

(23.21)

= 1, and the inequality is the triangle inequality. The proof will be complete if we can show that each of the right-hand side terms converges to zero. The integrals in (23. 17) may be expressed in the form E( l 1 \Xnr\>e} X�t), and

O"�t = E(l ( \ Xnr\�e}X�t) + E(l(\ Xnr \ >E}X�t) ::; c? + E(l { \Xnr \>E} X�t) e2 as n oo, all 1 s t s n, ----7

----7

(23.22)

since the Lindeberg condition implies that the second term on the right-hand side of the inequality (which is positive) goes to zero; since E can be chosen arbit rarily small, this shows that max O"�t ----7 0. (23.23)

l�ts n

In the first of the two terms on the majorant side of (23.21), the ch.f.s are all less than 1 in modulus, and by taking n large enough we can make 1 1 - 1A-2a�1 I s 1 for any fixed value of A-. Hence by 23.7,

I fi
(23.24)

To break down the terms on the majorant side of (23.24), note that 11.6 for the case k = 2, combined with (1 1 .29), yields

I
s A-2£(1 1 \Xnrl>e}X�t) + �� A-I 3E�t·

I7=1 O"�t = 1 , j e}X�t) + �I A-I 3E t=l

Hence, recalling

i

t=l

(23.25)

The Classical Central Limit Theorem �

� I A-1 \: as n

371 (23.26)

� oo ,

since the first majorant-side term vanishes by the Lindeberg condition. Since £ is arbitrary, this limit can be made as small as desired. Similarly, for the second term of (23.21), take n large enough so that

I TI e-A.2cr�/2 - TI (1 - �A2�r) I t=!

t=l

:::;

i I e-A.2a�/2 -

1

t= l

-

!"-2�t I ·

(23.27)

Setting z = -1A-2a�t (a real number, actually) in 23.8 and applying the result to the majorant side of (23.27) gives n 4 (23.28) :::; !"- .L cr�t· t=I

But,

n L <J�t

n :::; (max <J�r) L <J�r = max <J�r � 0 as n � (23.29) t=I t=I by (23.23). The proof is therefore complete. • The Lindeberg condition is subtle, and its implications for the behaviour of random sequences call for careful interpretation. This will be easier if we look at the case Xnr = Xtlsn, where Xr has mean 0 and variance cr7 and s� = I7=1cr7. Then the Lindeberg condition becomes n lim 21 L (23.30) x7dP = 0, for all £ > 0. n-too Sn t=I I { I Xr J >snE ) One point easily verified is that, when the summands are identically distributed, sn = Vncr, and (23.30) reduces to limn-too s� 2E(XTl { IXJ I > minE J ) = 0. The Lindeberg condition then holds if and only if X1 has finite variance, so that the Lindeberg theorem contains the Lindeberg-Levy as a special case. The problematical cases the Lindeberg condition is designed to exclude are those where the behaviour of a finite subset of sequence elements dominates all the others, even in the limit. This can occur either by the sequence becoming exces sively disorderly in the limit, or (the other side of the same coin, really) by its being not disorderly enough, beyond a certain point. Thus, the condition clearly fails if the variance sequence { cr7 } is tending to zero in such a way that s� = I7=1cr7 is bounded in n. On the other hand, if s� � oo, then sn£ � oo for any fixed positive £, and the Lindeberg condition resembles a condition of 'average' uniform integrability of {X7 } . The sum of the terms £(1 { I Xr l > snE }X7) must grow less fast than s�, no matter how close £ is to zero. The following is a counter-example (compare 12.7). 23.9 Example Let Xr = Yr - E(Yt) where Yt = 0 with probability 1 - t-2 , and t with probability t-2 . Thus E(Yr) = t- 1 and Xt is converging to a degenerate r.v., equal oo,

The Central Limit Theorem

372

to 0 with probability 1 , although Var(Yr) = 1 - t-2 for every t. The Lindeberg condition fails here. s� = n - 'L7= 1 t-2 , and for 0 < £ S. 1 we certainly have t > Sn£ whenever t > n 1 12. Therefore, (23.3 1)

as n --7 = And indeed, we can show rather easily that the CLT fails here. For any no � 1 , if we put Bo = 'L7� 1 t, then .

(23.32)

where the majorant side can be made as small as desired by choosing no large enough. It follows that 2:.7= 1 Y1 = Op(l) and hence, since we also have 'L7= 1 E(Yr) = O(log n), that 'L7= 1 Xr1Sn � 0, confirming that the CLT does not operate in this case. o Uniform square-integrability is neither sufficient nor necessary for the Lindeberg condition, so parallels must be drawn with cauiion. However, the following theorem gives a simple sufficient condition. 23.10 Theorem If {X7 } is uniformly integrable, and s�/n � B > 0 for all n , then (23.30) holds. Proof For any n and £ > 0, the latter assumption implies (23.33)

Hence lim � n ---7= Sn

i t=l

E(l { IX11>snEl X7) S. � limsup max {E(l { IXtl>snEl X7)} n---?oo i �t� n S. =

E( 1 1 1 x1 >sn£ 1 X7) � supt nlim ---?oo 1

0,

(23.34)

where the last equality follows by uniform integrability. • There is no assumption here that the sequence is independent. The conditions involve only the sequence of marginal distributions of the variables. None the less, the conditions are stronger than necessary, and a counter-example as well as further discussion of the conditions appears in §23.4. The following is a popular version of the CLT for independent processes. 23.11 Liapunov's theorem A sufficient condition for (23. 17) is

The Classical Central Limit Theorem

373

n

lim .2::: El Xn1 1 2+8 = 0, for some 8 > 0. (23 .35) n�oo t:= l Proof For 8 > 0 and £ > 0, (23.36) EIXnr l 2+l5 � E(l{ IXnt i>£J I Xnrl 2+8) � £8E(l I IXnri>£)X�r) . The theorem follows since, if £8limn�oo I7=1 E(l 1 1 Xnt i >£)X�1) = 0 for fixed £ > 0, then the same holds with £0 replaced by 1. •

Condition (23.35) is called the Liapunov condition, although this term is also used to refer to Liapunov's original result, in which the condition was cast in terms of integer moments, i.e.

n

lim L EIXnrl3

n�oo

t:= l

=

(23 .37)

0.

Although stronger than necessary, the Liapunov condition has the advantage of being more easily checkable, at least in principle, than the Lindeberg condition, as the following example illustrates. 23.12 Theorem Liapunov ' s condition holds if s�/n > 0 uniformly in n and E I X1 1 2+8 < oo uniformly in t, 8 > 0. Proof Under the stated conditions,

n nB 2 Bl < 2 8 + X EI tl L 2 Sn B2 Sn2 1

- � t:= l

<

-

-

-

� E I X1 1 2+8 lim -1- L

n�oo Sn2+8

t:= l

=

<

00

0

(23.38)

(23.39)

follows immediately. • Note that these conditions imply those of 23.10, by 12.10. It is sufficient to avoid the 'knife-edge' condition in which variances exist but no moments even fractionally higher, provided the sum of those variances is also O(n). 2 3 . 3 Feller' s Theorem and Asymptotic Negligibility

We said above that the Lindeberg condition is sufficient and sometimes also neces sary. The following result specifies the side condition which implies necessity. 23.13 Feller's theorem Let { Xnr} be an indep�ndent sequence with zero mean and variance sequence { 0'�1}. If Sn � N(O, 1) and

374

The Central Limit Theorem

max P( I Xnr l > E) � 0 as n � oo, any E > 0, (23.40) :=::: t :5n 1 the Lindeberg condition must hold. o The proof of Feller' s theorem is rather fiddly and mechanical, but necessary conditions are rare and difficult to obtain in this theory, and it is worth a little study for that reason alone. Several of the arguments are of the type used already in the sufficiency part. Proof Since �t � 0 for every t, the series expansion of the ch.f. suggests that I xm(A) - 1 1 converges to zero for each t. In fact, we can show that the sum of the squares of these terms converges, and this is the first step in the proof. Applying 11.6 for k = 0 and k = 1 respectively, we can assert that and

I xJA) - 1 1 � E(min { 2, I AXnr I } ) S 2P( I Xnr I > E) + E I A I

(23.41) (23.42)

In each case the second inequality is by (1 1 .29), setting E = 0 for (23.42). Squaring I x..1(A) - 1 1 , adding up over t, and substituting from the inequalities remembering I,�= 1 a� = 1, we obtain 1

t. 1 xm(A). - 1 1 2 S �t:� I xm(A) - 1 I (t. 1 x..1(A) - 1 1 ) S (2�� P( I Xnt I > E) + E I A I ) A2 �

I AI 3 E as n

(23.43)

� oo,

using (23.40) and (23.41). This result is used to show that s.. (A) = Jl7=1 <1>x../A) can be approximated by exp { I,7= 1 <1>xm(A) - 1 } . Note that if z = reie and r s 1, then l e - 1 1 = ercose- 1 � 1 using ( 1 1 . 12). Lemma 23.7 can therefore be applied, and for some n large enough, z

exp

{t. (�x",(A.) 1 )} ··· D�x"'(A.) t. exp{ �x"'(A.) - 1 } - �x",(A.) I -

5

I

1

n (23.44) L I x../A) - 1 . The condition of the lemma can be satisfied for large enough n according to (23.42) and (23.23). By hypothesis s..(A) � e -A212 , so by choosing E arbitrarily small in (23.43), (23.44) implies that exp { I,�=1 (x,./A) - 1) } � e- A212 • The limit being a positive real number, this is equivalent to s

The Classical Central Limit Theorem 1 og

ex

��

(�x"'(A.)

-

} � E(cos AX., - 1 )

1)

=

375 -->

-1'-2,

(23.45)

using ( 1 1.12 ) and ( 1 1.13 ) to get the equality. Taking the real part of the expansion in ( 1 1.24) up to k = 2 gives cos x = 1 - ¥2cos ax for 0 :S: a :S: 1 , so that ¥2 - 1 + cos x � 0 for any x. Fix £ > 0 and choose A > 21£, so that the contents of the parentheses on the minorant side below is positive. Then we have n ')...2 2 n I A?X2nt - 2 dP x2 dP nt .L L 2 2 - t=I { IXnti>E} 2 £ t=I { IXnti>E}

(

)

f

< f

:S:

if t= l n

(

)

( �'A2 1 + cos AXnr) dP r X� { IXnti>E}

_L EqA.2X�1 - 1 + cos AXnr) ----7 0,

(23.46) t= l where the last inequality holds since the integrand is positive by construction for every Xn1, and the convergence is from (23.45) after substituting Lr�t = 1 . Since £ is arbitrary, the Lindeberg condition must hold according to (23.46). • Condition (23.40) is a condition of 'asymptotic negligibility' , under which no single summand may be so influential as to dominate the sum as a whole. The chief reason why we could have E}X�n dP L t=I { IXnti>E} :S:

-

f

f

where Z is a standard Gaussian variate. A condition related to (23.40) is

P ( max I Xnr I

)

(23.47) o

----7 0 as n ----7 oo, any £ > 0, I :> t :> n which says that the largest Xnr converges in probability to zero. 23.15 Theorem (23.48) implies (23.40).

>£

(23.48)

The Central Limit Theorem

376

P(max1s;1s;n I Xnrl $ £) 1. But P t�� I Xnr l £) = P (!] { I Xnr l £}) $ min P(! Xnt l $ £) Is;ts;n = 1 - max P( I Xn zl > £) , (23 .49) Is;ts;n where the inequality is by monotonicity of P. If the first member of (23.49) converges to 1 , so does the last. Proof

(23 .48) is the same as

�

s;

s;

•

Also, interestingly enough, we have the following. 23.16 Theorem The Lindeberg condition implies (23.48). Proof Another way to write (23. 17) (interchanging the order of summation and integration) is

nL 1!1Xnrl>E}X�, 0, all £ > (23.50) t=I According to 18.13 this implies 2:7=1 1 1 I Xnr / > E}X�1 or, equivalently, (23.51) P (it=I 1 1 /Xnr/ > E}X�1 > £2) 0 as n for any £ 0. But notice that {ro: i111 s;t�n Xnz(CO)j > £}, (23.52) t=l Xnr / > E}(ro)X�r(ro) > £2} = {ro: Imax! �

0.

� 0,

�

�

=,

>

so (23.5 1) is equivalent to (23.48). • Note that the last two results hold generally, and do not impose independence on the sequence. The foregoing theorems establish a network of implications which it may be helpful to summarize symbolically. Let L = the Lindeberg condition; I = independence of the sequence; � e- J...ZI2 ); AG = asymptotic Gaussianity AN = asymptotic negligibility (condition (23.40)); and PM = � 0 (condition (23.48)). Then we have the implications L + I ==> AG +PM + I ==> AG + AN + I ==> L + l, (23.53) where the first implication is the Lindeberg theorem and 23.16, the second is by 23.15, and the third is by the Feller theorem. Under independence, conditions L, AG + PM, and AG + AN are therefore equivalent to one another.

maxi Xnr l

(�sn(A)

The Classical Central Limit Theorem

377

However, this is not quite the end of the story. The following example shows the possibility of a true CLT operating under asymptotic negligibility, without the Lindeberg condition. 23.17 Example Let X1 = 1 and -1 with probabilities 1C l - t- 2) each, and t and -t with probabilities 1t- 2 each, so that E(X1) = 0 and Var(X1) = � - !?, and s� = �n + O(l). This case has similar characteristics to 23.9. Since I Xnrl = ts� 1 with probability t- 2 and 1s� 1 otherwise, we have for any E > 0 that, whenever n is large enough that ESn > 1, (23.54) P( \ Xnr l > E) � n; 2 where n£ is the smallest integer such that n£s� 1 > E. Since n£ = O(n 1 12), (23.40) holds in this case. However, the argument used in 23.9 shows that the Lindeberg condition is not satisfied. E( l { 1 Xn�I >E }X�1) � s� 2 for t � n£, and hence

it=1 f

{ IXnti>E}

X�rdP � (n - n£)s�2 � � > 0.

(23.55)

However, consider the random sequence { W1} , where W1 X1 when I X1 I = 1 and Wr = 0 otherwise. As t increases, W1 tends to a centred Bernoulli variate with p = 1, and defining Wnr = W 1sn , it is certainly the case that n (23.56) L Wnt � N(O, !). t= 1 However, I X1 - Wrl is distributed like Y1 in 23.9, and applying (23.32) shows that .L7=J I X, - W, I = Op(l ), and hence I .L7=1 Xnr - L7= 1 Wnr \ � 2:.7=1 \ Xnr - Wnr l = Op(n - 1 12). It follows that .L7=1Xn1 � N(0, !), according to 22.18. o A CLT therefore does operate in this case. Feller's theorem is not contradicted because the limit is not the standard Gaussian. The clue to this apparent paradox is that the sequence is not uniformly square-integrable, having a component which contributes to the variance asymptotically in spite of vanishing in probability. In these circumstances Sn can have a 'variance' of 1 for every n despite the fact that its limiting distribution has a variance of ! ! =

1

23 .4 The Case of Trending Variances

The Lindeberg-Feller theorems do not impose uniform integrability or Lr-bounded ness conditions, for any r. A trending variance sequence, with no uniform bound, is compatible with the Lindeberg condition. It would be sufficient if, for example, cr7 < oo for each finite t and the unit-variance sequence {X1 /cr1} is uniformly square-integrable, provided that the variances do not grow so fast that the largest of them dominates the Cesaro sum of the sequence. The following is an extension of the sufficient condition of 23.10. 23.18 Theorem Suppose { x7td} is uniformly integrable, where { c 1} is a sequence of positive constants. Then {X1} satisfies the Lindeberg condition, (23.30), if

The Central Limit Theorem

378

sup nMnf 2 sn2 n

=

C

<

(23.57)

oo,

One way to construct the c1 might be as max { 1 ,a1 } . The variances of the trans formed sequence are then bounded by 1 , but a7 = 0 is not ruled out for some t. Proof of 23.18 The inequality of (23.33) extends to

{c7E(1{1X/c11>s,Eic1)(X,Icr)2)} _!_£E(1{ 1 Xrl>s,E)X7) Sn� max $t$ n

s�

:s;

t:= l

:s;

l

C max {E( 1 ( IX/crl>s,Eictl (X/cr) 2)} l$t$ n

(23.58)

The analogous modification of (23.34) then gives sup l im E( 1 { IX!crl>s,.Eicr) (X,Icr)2) t n�oo

=

0.

•

(23.59)

Notice how (23.57) restricts the growth of the variances whether this be positive or negative. Regardless of the choice of { c1}, it requires that s�ln > 0 uniformly in n. It permits the c1 to grow without limit so long as they are finite for all t, so the variances can do the same; but the rate of increase must not be so rapid as to have a single coordinate dominate the whole sequence. If we let c1 = max{ 1 ,0"1} as above, (23.57) is1 satisfied (according to 2.27) when a7 - {x for any a � 0, but not when a7 2 • In fact, the conditions of 23.18 are stronger than necessary in the case of decreasing variances. The variance sequence may actually decline to zero without violating the Lindeberg condition, but in this case it is not possible to state a general sufficient condition on the sequence. If a7 - P with - 1 < a < 0, we would have to replace (23.33) by the condition' -

it:=l

max {E( 1 (IX1i>s,e}X7)}. (23.60) E( 1 { 1Xrl>s,e)X7) :s; Bn � Ists n s� where B = infn (s�ln 1 +a) > 0 by assumption (note, s� - n 1 +a under independence). Convergence of the majorant side of (23.60) to zero as n --7 oo is not ruled out, but depends on the distribution of the X1• The following example illustrates both possibilities. 23.19 Example Let { X1} be a zero-mean independent sequence with X1- U[-ta /v.] for some real a, such that a7 = ft2a either growing with t (a > 0) or declining with t (a < 0). However, X1 is Loo-bounded for finite t (see 8.13). The integrals in (23.30) each take the form _!_

,

(23.61 )

The Classical Central Limit Theorem

379

where s; = !I�= 1 't2a. Now, I,�=1 't2a = O(n2a+ l ) for a > -! and O(log n) for a = -! (2.27). Condition (23.57) is satisfied when a � 0. Note that (23.61) is zero if (!I,�=1 't2a) 1 12£ > ta ; (I,�= l 't2a) 1 12 grows faster than na for all a � 0, and hence (23.61) vanishes in the limit for every t in these cases, and the Lindeberg condition is satisfied. But if X1 - U[ -21,2 1], 2n grows at the same rate as (I,�= 1 22-r) 1 12 , the above argument does not apply, and the Lindeberg condition fails. Note how condition (23.57) is violated in this case. However, the fact that condition (23.57) is not necessary is evident from the fact that the variance sum diverges at a positive rate when X1 - U[ -ta,ta] for any a � -! even though the variance sequence itself goes to zero. It can be verified that (23.61) vanishes in the limit, and accordingly the Lindeberg condition holds, for these cases too. On the other hand, if a < ! s; is bounded in the limit and (23 . 1 7) becomes -

,

(23.62) where B = limn-tooL7=1 t2a < oo, and by choice of small enough £, (23.62) can be made arbitrarily close to 1 . This is the other extreme at which the Lindeberg condition fails. o

24 CL Ts for Dependent Processes

24. 1 A General Convergence Theorem

The results of this chapter are derived from the following fundamental theorem, due to McLeish ( 1 974). 24.1 Theorem Let {Zni• i = l , . . . ,rn , n E [N } denote a zero-mean stochastic array, where rn is a positive, increasing integer-valued function of n, and let25 rn (24. 1 ) Trn = n ( 1 + iAZni), A > 0. i= l Then, S rn = Lt�tZni � N(O, 1) if the following conditions hold: (a) Trn is uniformly integrable, (b) E(Trn) � 1 as n � oo, rn (c) L Z�i � 1 as n � oo, i= l (d) maXt::;i::;rn i Znd � 0 as n � oo. D There are a number of features requiring explanation here, regarding both the theorem and the way it has been expressed. This is a generic result in which the elements of the array need not be data points in the conventional way, so that their number r does not always correspond with the number of sample observations, n. rn = n is a leading case, but see 24.6 for another possibility. It is interesting to note that the Lindeberg condition is not imposed in 24.1, nor is anything specific assumed about the dependence of the sequence. Condition 24.1(d) is condition PM defined in (23.48), and by 23.16 it follows from the Lindeberg condition. We noted in (23.53) that under independence the condition is equivalent to the Lindeberg condition in cases where (as we shall prove here) the central limit theorem holds. But without independence, conditions 24.1(a)-(d) need not imply the Lindeberg condition. Proof of 24.1 Consider the series expansion of the logarithmic function, defined for l x l < 1 , log(l +x) = x - �2 + !x3 Although a complex number does not possess a unique logarithm, the arithmetic identity obtained by taking the exponential of both sides of this equation is well-defined when x is complex. The formula yields n

••.

CLTs for Dependent Processes

38 1

(24.2) where the remainder satisfies I r(x) I :::;; I x 1 3 for I x I < 1 . Multiplying up the terms for i = 1 , ... ,rn yields exp{ iAS,J = T,n U'n ' where T,n is defined in (24. 1) and

{-1A.'� z;, - �r(O..z,+

(24.3)

u,, = exp

Taking expectations produces

s,n (A,) = E(T,n U,n ) = e - 'A212E(T,n ) + E( T,n( U,n - e-'A212)),

(24.4)

so given condition (b) of the theorem, s'n (A,) � e- 'A212 if

-

lim E , T,n( U,n e - 'A212) '

n.....:;oo

The sequence

=

0.

(24.5) (24.6)

is uniformly integrable in view of condition (a), the first term on the right-hand side having unit modulus. So in view of 18.14, it suffices to show that Plim T'n (U'n - e -'A212) = 0. (24.7)

n....oo.:;

Since T,n is clearly Op(l), the problem reduces, by 18.12, to showing that plimn.....:;oo U,n = e -'A212, and for this in tum it suffices, by condition (c), if 'n plim :L r(iAZnD = 0. (24.8)

n---';oo

i=l

To show that this convergence obtains, we have by the triangle inequality 'n { 'n 'n (24.9) I Znd � Z�;. � r(iAZn;) :::;; I A 1 3 � I Znd :::;; I A

I

I

3 l3l:;i::

)

The result now follows from conditions (c) and (d), and 18.12. • It is instructive to compare this proof with that of the Lindeberg theorem. A different series approximation of the ch.f. is used, and the assumption from independence, that is avoided. Of course, we have yet to show that conditions 24.1(a) and 24.1(b) hold under convenient and plausible assumptions about the sequence. The rest of

The Central Limit Theorem

382

this chapter is devoted to this question. 24.l(b) will tum out to result from suitable restrictions on the dependence. 24.l(a) can be shown to follow from a more primitive moment condition by an argument based on the 'equivalent sequences ' idea.

24.2 Theorem For an array { Zni } , let (i) The sequence T,n

=

ll/� 1 ( 1

(

+ iAZni) is uniformly integrable if

)

sup E max Z�i < oo. n l:S: i :S: rn And if I:i�1Z�i � 1, then 'n (ii) _Lz�i � 1 ; i=l rn (iii) Srn = _L z i has the same limiting distribution as Srn· i=l Proof Let

(24. 1 1)

n

(24. 12)

otherwise such that Zni

=

0, if at all, from the point i i=I

=

ln + 1 onwards. Note that (24. 1 3)

i=l

The terms A.2Z�i are real and positive. The inequality 1 +x < ex for x > 0 implies that ll;(l +x;) :::; nieXi for Xi > 0. Hence, ln- 1 2 I 't,n 1 = fl (1 + t..2z�i)C l + t..2Z�)

i=l

:::; exp { ki:{�1 1 Z�i) } ( l + kZ�) :::;

e2'-2( 1 + 1..2Z�).

(24. 14)

where the last inequality is by definition of ln. Then by (24. 1 1), sup E I T, 1 2 :::; e 2'-\ I + A2 sup £(Z�n)) < 00 (24. 15) n n Uniform boundedness of E I T,n 1 2 is sufficient for uniform integrability of T,n , proving (i). Since by construction I:i� 1 Z�i :::; 2, n

•

CLTs for Dependent Processes P

(� z�i * � z�i) = P (� z�i > 2) (I � z�i - 1 1 > E) ::; p

'

383

for 0 < E < 1 '

-----7

0 as n ----7 oo, (24. 16) by assumption, which proves (ii). In addition, P(Zni * Zni• some 1 :::;; i :::;; rn) = P(� 1 Z�k > 2) -----7 0 as n ----7 oo, so I Sn - Sn I � 0, and by 22.18, Sn and Sn have the same limiting distribution, proving (iii). • 24.2 The Martingale Case

Although it permits a law of large numbers, uncorrelatedness is not a strong enough assumption to yield a central limit result. But the martingale difference assumption is similar to uncorrelatedness for practical purposes, and is attrac tive in other ways too. The next theorem shows how 24.1 applies to this case. 24.3 Theorem Let {Xnr,r::Fn r} be a martingale difference array with finite uncondi tional variances { 0'�1} , and L-7=1 �t = 1 . If n (a) _L X�1 � 1 , and t= 1 (b) max1�t�n i Xnr l � 0, n then Sn = _LXnr � N(0, 1). t=1 We use 24.1 and 24.2, setting rn = n, i = t, and Zni = Xnr. Conditions (a) and (b) are the same as (c) and (d) of 24.1, so it remains to show that the other conditions of 24.1 are satisfied; not actually by Xnt• but by an equivalent sequence in the sense of 24.2(iii). If Tn = ll7=1 ( 1 + i'AXn1), we show that limn ooE(Tn) = 1 when { Xnr} is a m.d. array. By repeated multiplying out, n Tn = fl( l + i'AXnr) = Tn - 1 + if..Tn -1Xnn t= 1 Proof

-7

=

Tr- 1

=

n i L 1 = + f.. Tr-1Xnr· t=1 ll�: }(1 + i'AXns) is an r:!r- 1-measurable r.v., so by the LIE,

(24. 1 7)

The Central Limit Theorem

384

E(Tn)

=

n 1 + iA-L E(Tr- IXnr) t:=l

n (24. 18) 1 + iA-L E(Tr- I E(Xnrl ?Fn ,r- 1 )) =. 1 . t:= 1 This is an exact result for any n, so certainly holds in the limit. If Xnr is a m.d., so is Xnr = Xnr1 (IJ:}x�k � 2), and this satisfies 24.1(b) as above, and certainly also 24.1(d) according to condition (b) of the theorem. Since I.7= 1 E(X�1) = 1 , condition (24. 1 1) holds for Xnr. Hence, Xnr satisfies 24.1(a) and 24.1(c) according to 24.2(i) and (ii), and so obeys the CLT. The theorem now follows by 24.2(iii). • This theorem holds for independent sequences as a special case of m.d. sequences, but the conditions are slightly stronger than those of the Lindeberg theorem. Under independence, we know by (23. 53) that24.3(b) is equivalent to the Lindeberg condition when the CLT holds. However, 24.3(a) is not a consequence of the Lin de berg condition. For the purpose of discussion, assume that Xnr = X/sn , where s� = I7= 1 cr? and a7 = E(X7). Under independence, a sufficient extra condition for 24.3(a) is that the sequence {X71a7} be uniformly integrable. In this case, inde pendence of {X1} implies independence of {X7 } , {X7 - a7, ?f1} is a m.d., and 19.8 (put Gn = S� and b r = a7) gives sufficient conditions for s�2L7: 1 (X7 - a7) � 0. This is equivalent to 24.3(a). But of course, it is not the case that {X? - a?, ?f1} is a m.d. merely because {Xr.�t } is a m.d. If 24.3(a) cannot be imposed in any other manner, we should have to require E(X71 ?Fr- 1 } = a?, a significant strengthen ing of the assumptions. On the other hand, the theorem does not rule out trending variances. Following the approach of §23.5, we can obtain such a case as follows. 24.4 Theorem If {Xr.?f1} is a square-integrable m.d. sequence and E(X71 ?Fr- 1 ) = a7 a.s., and there exists a sequence of positive constants { c1} such that {X71d} is uniformly integrable and =

2 n2 < oo sup nMnls (24. 1 9) n where M� = max l :o; r :o; n d, conditions 24.3(a) and 24.3(b) hold for Xnr = Xr fsn . Proof By 23.18 the sequence {Xnr } satisfies the Lindeberg condition, and hence, 24.3(b) holds by 23.16. Note that neither of these results imposes restrictions on the dependence of the sequence. To get 24.3(a), apply 19.8 to the m.d. sequence CX7 - a7), putting p = 1 , b1 = c7, and an = s�. The sequence { CX7 - a7)1d} is uniformly integrable on the assumptions, and note that I.7= 1 b1 � nM� = O(s�) and that I.7= 1 b7 :::; nM;; = o(s�), both as consequences of (24. 1 9). The conditions of 19.8 are therefore satisfied, and the required result follows by 19.9. •

CLTs for Dependent Processes

385

24. 3 Stationary Ergodic Sequences

It is easy to see that any stationary ergodic martingale difference having finite variance satisfies the conditions of 24.3. Under stationarity, finite variance is sufficient for the Lindeberg condition, which ensures 24.3(b) by 23.16, and 24.3(a) follows from the ergodicity by 13.12. The interest of this case stems from the following result, attributed by Hall and Heyde (1980: 137) to unpublished work of M. I. Gordin. 24.5 Theorem Let {X1,:1'1 } be a stationary ergodic L 1 -mixingale of size -1, and if Sn = I7= 1 X, assume that limsup n - 1 12E I Sn l < oo . (24.20) Then n - 1 12E I Sn l -7 A, 0 :::;; A < oo. If A > 0, n- 1 12Sn � N(0, 1:nA2). o Notice that the assumptions for this CLT do not include E(Xf) < oo . Let X� = Xrl ( IXr i � Bl · The centred sequence Y� = � - E(X� I :J
(

)

(24.21 ) E n - 1 12± Y� 2 = crl t-= 1 The sequence { n - 1 12 I I7= 1 Y� I } is therefore uniformly integrable (12.11), and by the continuous mapping theorem it converges in distribution to the half-Gaussian limit (see 8.16); hence, by 9.8, n - 1 12E

i it y� � -7 crs(2/n) 112•

I

I

(24.22) -=1 Now define Y1 = X1 - E(Xrl :J
(24.23)

Noting that cr� = f1 IXt l � B )XIdP, { cr�, B = 1 ,2, ... } is a monotone sequence converg ing either to a finite limit cr2 (say) or to +oo. In the latter case, in view of (24.22) there would exist N such that n - 1 12EI I.7=t Y11 > A for N � n, which contradicts (24.23), so we can conclude that cr2 < oo . Taking B = oo in (24.22) yields, in view of the fact that n - 1 12E I Zt - Zn+ I I -7 0,

The Central Limit Theorem

386

n- 1 12EI Sn I = n - 1 12E

I ± Yr + Zt - Zn+l I � a(2ht) 112.

(24.24)

t= l

Hence, A = a(2/rc) 1 12 . Since Y1 is now known to be a stationary ergodic m.d. with finite variance if, and if > 0, 24.3 and 22.18 imply that n - 1 12Sn � N(O,a2 ), which completes the proof. • This result can be thought of as the counterpart for dependent sequences of the Lindeberg-Levy theorem, but unlike that case, we do not have to assume explicitly that E(XT) < oo. Independence of {Xr} enforces the condition Z1 = 0 for all t, and then the conditions of the theorem imply E(XT) = if. It might appear that the existence of dependence weakens the moment restrictions required for the CLT, but this gain is more technical than real, for it is not obvious how to construct a stationary sequence such that X1 is not square-integrable but Y1 is. The most useful implication is that the independence assumption can be replaced by arbi trary local dependence (controlled by the mixing ale assumption) without weakening any of the conclusions of the Lindeberg-Levy theorem. 24.4 The CLT for NED Functions of Strong Mixing Processes

The traditional approach to the problem of general dependence is the so-called method of 'Bernstein sums ' (Bernstein 1927). That is, break up Sn into blocks (partial sums), and consider the sequence of blocks. Each block must be so large, relative to the rate at which the memory of the sequence decays, that the degree to which the next block can be predicted from current information is negligible; but at the same time, the number of blocks must increase with n so that a CLT argument can be applied to this derived sequence. It would suffice to require the sequence of blocks to approach independence in the limit, but a result can also be obtained if it behaves asymptotically like a martingale difference. This is the approach we adopt. The theorem weprove (from Davidson 1992, 1993b) is given in two versions. The first, 24.6, is fully general but the conditions are complicated and not very intuitive. The second, 24.7, is a special case, whose conditions are simpler but cover almost all the possibilities. The exceptional cases for which 24.6 is essen tial are those in which the variances of the process tend to 0 as t increases. 24.6 Theorem Let {Xnr• t = 1 , .. ,n, n � 1 } be a triangular stochastic array, let { Vnt• -oo < t < oo, n � 1 } be a (possibly vector-valued) stochastic array, and let r:f��7 = a(Vns• t - m :::; s :::; t + m). Also, let Sn = L�=IXnr. Suppose the following assumptions hold: (a) Xnr is r:f�. - ooi:B-measurable, with E(Xnr) = 0 and E(S�) = 1. (b) There exists a positive constant array {cnr } such that SUPn,r i i Xnlcnr llr < oo for r > 2. (c) Xnr is �-NED of size -1 on { Vnr} with respect to the constants { cnr } speci fied in (b), and { Vnr } is a-mixing of size -(1 + 28)r/(r - 2), 0 :::; 8 < �.

-m

CLTs for Dependent Processes

387

(d) Letting bn = [n 1 - a] and rn = [nlbn ] for some a E (0, 1 ] , and defining Mn i = maX(i- l )bn 0. Hence, 24.6 contains 24.7 as a special case. The adaptation assumption in 24.6(a) will be needed because of the asymptotic m.d. property the Bernstein blocks must possess; see 24.19 below for the applica tion. This assumption says that Xnr must not depend on future values of the under lying mixing process, Vns for s > t. In econometric applications at least, such an assumption would typically be innocuous. The remaining parts of condition (a) specify the assumed normalization. The roles of the remaining conditions are not particularly transparent in either version of the result, and in reviewing these it will be helpful to keep in mind the leading case with Xnt = (Y1 - jl1)1sn where s� = E(2:7= I (Y1 - jl1))2 , although more general interpretations are possible, as noted in §23.2. In this case it would often be legitimate to choose (24.29) Cnr = max{ cr,, 1 }Ism where cr7 is the variance of Y,. The Cnr have to be thought of as tending to zero with n, although possibly growing or shrinking with t also, subject to 24.6(d) or 24.7(d'). Because autocorrelation of the sequence is not ruled out, s� is no longer just the partial sum of the variances, but is defined as

The Central Limit Theorem

388

s�

=

n n t- 1 ,L a7 + 2L .L ar,t-k. t= 1 t=2 k= 1

(24.30)

where at t-k Cov ( Yr,Yt-k). Assumptions 24.6(c) or 24.7(c') imply, by the norm inequality, that (24.3 1 ) 11 Xnr - E(Xnt l ��7-m) llp � CntYm, for 0 < p � 2 , where v is of size - 1 . The following lemma is an immediate consequence of 17.6. 24.8 Lemma Under assumptions 24.6(a), (b), and (c), { Xn1, ��. oo } is an Lp-mixingale of size -min { l , (1 + 28)(r -p)/p(r - 2) } for 1 � p � 2, with constants {cnt l · o In particular, when p = 2 the size is -{1 + 8), and when p = r/(r - 1) the size is - 1 . Under the assumptions of 24.7 the same conclusions apply, except that 8 = 0. There is also a more subtle implication which happens to be very convenient for the present result. This is the following. 24.9 Lemma Under 24.6(a) and (b), plus either 24.6(c) or 24.7(c'), ==

m

-

(24.32) and

(24.33) for each k E IN, where ants = E(Xn�Xns), and �m = O(m-1 -8) for 8 > 0. o These inequalities are convenient for the subsequent analysis, but in effect the lemma asserts that for each fixed k the products XnrXn,t+k• after centring, form L 1 -mixingales of size -1 with constants given by max { c�1,c�,t+d . One of these might be written as, say, { Un1,��. -oo L where Unt = Xn,t-[k/2]Xn,t+k-[k/2] - an,t-[k/2],t+k-[k/2]· The mixingale coefficients here are �� = �0 for m = O, ... ,[k/2], and �� = �m-[k!Z] for m > [k/2]. Proof of 24.9 The array {Xn�Xn,t+d is L 1 -NED on { Vnt l of size -1 by 17.11. The conclusion then follows by 17.6(i), noting that any constant factors generated in forming the inequalities have been absorbed into �m in (24.32) and (24.33). • Now consider 24.6(d) and 24.7(d'). These assumptions permit global nonstation arity (see § 13.2). This is a fact worthy of emphasis, because in the functional CLT (see Chapters 27 and 29 for details) global stationarity is a requirement. In this respect the ordinary CLT is notably unrestrictive, provided we normalize by sn as we do here. The following cases illustrate what is allowed. 24.10 Example Let { Y,} be an independent sequence with variances a7 - t � for any � :?: 0 (compare 13.6). It is straightforward to verify that assumption 24.7(d') is

CLTs for Dependent Processes

389

satisfied for Xnr Yr lsn where Cnr = ar lsn , and in this case s� = L�=I cri· It is however violated when cr7 - 2t, a case that is incompatible with the asymptotic negligibility of individual summands (compare 23.19). It is also violated when � < 0 (see below). o =

24.11 Example Let { Yr} be an independent sequence with variance sequence gen erated by the scheme described in 13.7. Putting Xnr = Yr lsn, 24.7(d') is satis fied with Cnr = llsn for all t. o Among the cases that 24. 7(d') rules out are asymptotically degenerate sequences, having a7 � 0 as t � oo. In these cases, max 1 �r�ncri = 0(1) as n � oo, but given 24.7(c'), it will usually be the case that s�ln � 0. It is certain of these cases that assumption 24.6(d) is designed to allow. To see what is going on here, it is easiest to think in terms of the array { Cnt } as varying regularly with n and t, within certain limits to be determined. We have the following lemma, whose conditions are somewhat more general than we require, but are easily specialized. 24.12 Lemma Suppose c�t - tpn-y- I for �,y E IR . Then, 24.6(d) holds iff (24. 34)

and � < 2[8 +y(l + 8)].

0

(24. 3 5)

Notice that � and y can be of either sign, subject to the indicated constraints. Proof We establish that an a E (0, 1] exists such that each of conditions (24.25) (24.27) are satisfied. We have either M�i - (ibn)Pn -y- I for � � 0, or M�i ((i - 1)bn) Pn -y- I for � < 0, but in both cases Tn v �

M2 ·

rni+PbnPn -y- 1

na.O

+P)+C I - a.) p -y- 1

(24. 36) i=l Simplifying shows that condition (24.34) is necessary and sufficient for (24.27), independently of the value of a, note. Next, (24.25) is equivalent to max Cnr2 = o (na.-1 ) , (24.37) l �t� n which (since the maximum is at t = n for � � 0, and t = 1 otherwise) imposes the requirement max{ �,O} - y < a. (24. 38 ) In view of (24.34) this constraint only binds if y < 0, and is equivalent to just -y < a. (24. 39 ) Also, m

_

rn

_

" . i+ I2 I2 - ( + I )I2 � M - rn P bnP n y i= l m

·

The Central Limit Theorem

390

a(! +P/2)+(( 1 -a)p-y- 1 )/2 -n '

(24.40)

and (24.26) reduces to

28 +y- � a -< 1 + 28 ·

(24.41) i

The existence of a satisfying (24.41 ) and (24.39) requires two strict inequalities to be satisfied by �. y, and 8, but since 8 is positive the first of these, which is 8 > �� - y), holds by (24.34). The second is -y(1 + 29) < 28 + y- �. which is the same as (24.35). It follows that (24.34) and (24.36) are equivalent to 24.6(d), as asserted. • In terms of the leading case with Xm = (Yt - �t)lsn, with s� given by (24.30), consider first the case a7 - rP, � � 0. To have Cnt monotone in t (not essential, but analytically convenient), we could often set

(24.42) = max {<Js } fsn. 1 $s $t In this case, under 24.6(b) and 24.7(c'), the conditions of 17.7 are satisfied. Note however that I I Yt - �til (Jt is required by 24.6(b ). Since { sd is of size - 1 we have, substituting into (24.30), n n t- 1 p (24.43) s� « 2 t + 22 rP12 L (t - k)P12 Sk O(n l+P) . t= 1 t= 1 k=l This only provides an upper bound, and condition 24.7(c') alone does not exclude the possibility that s� - n 1+y for some y < �- However, compliance with condition (24.34), which follows in tum from 24.7(d') (note, M� = c�n in this case), enforces y = �- This condition says that the variance of the sum must grow no more slowly than the sum of the variances. Now consider the case a7 tp with � < 0. Here, we might be able to set (24.44) C nt

r

-

=

-

s :?: t

Under 24.7(c') and (24.34) we would again have c� t l3n - l3 -1 , but here M� = <J7 •1s� for some t* < oo; hence M� - n - 13- 1 , and 24.7(d') ceases to hold. However, with � y, condition (24.35) reduces to r

-

=

�

> 1-+2828'

(24.45)

and it is possible for the conditions of 24.12 to be satisfied, although only with 8 > 0. As 8 is increased, limiting the dependence, a can be increased according to (24.36) and this allows smaller �. increasing the permitted rate of degeneration. As 8 approaches �' so that the mixing size approaches -2rl(r - 2), � may approach -�, with a also approaching �These conclusions are summarized in the following corollary. Part (i) is a case

CLTs for Dependent Processes

391

of 24.7, and part (ii) a case of 24.6. 24.13 Corollary Let Xnt = (Y1 - J.l1)1sn where � t � and s� - n 1 +f3. If either (i) 0 :::; � < oo, and 24.6(b) and 24.7(c') hold with Cnt defined in (24.42); or (ii) there exists 8 such that (24.45), 24.6(b) and 24.6(c) hold with Cnt defined by (24.44); then Sn � N(O, 1). o By an apparent artefact of the proof, t- 112 represents a limit on the permitted rate of degeneration of the variances. We may conjecture that with mixing sizes exceeding 2rl(r - 2), the CLT can hold with � :::; -� , but a different method of attack would be necessary to obtain this result. Also, both the above cases appear to require that a E (0, �), and it is not clear whether larger values of a might be possible generally. The plausibility of both these conjectures is strengthened by the existence of the following special case. 24.14 Corollary If conditions 24.6(a) and (b) hold, and also (c") {Xn1,��, - oo} is a martingale difference array, and (d") conditions (24.25) and (24.27) hold with bn = 1 , then Sn � N(O, l). o In the case Xnt = (Y1 - J...L1)1sn, where by the m.d. assumption s� = I,�=J�, note that (24.27) is satisfied with bn = 1 by construction, and (24.25) requires only that s� t oo, so that � t� is permitted for any � :2: -1 under 2.14(d"). This result may be compared with 24.4, whose conditions it extends rather as 24.6 extends 24.7. The proof will be given below after the intermediate results for the proof of 24.6 have been established. As an example, let Y1 - J.l1 be a m.d. with a7 = �It where � is constant, so that s� log n. Corollary 24.14 establishes that (log nf 1 12L�= l (Yt - J.lt) � N(O, a2). The limit on the permissable rate of degeneration here is set by the requirement s� --7 oo as n --7 oo. If the variances are summable, the central limit theorem surely fails. Here is a well-known case where the non-summability condition is violated. 24.15 Example Consider the sequence { Y1} of first differences, with Y1 = Z1 and Y1 = Z1 - Z1- I , t = 2,3, . .. , where { Z1} is a zero-mean, uniformly L,-bounded, independent sequence, with r > 2. Here { Y1} satisfies 24.6(a)-(c), but L�= l Yt = Zn and s� = Var(Zn) = 0(1 ). o -

-

-

24. 5 Proving the CLT by the Bernstein Blocking Method

We shall prove only 24.6, the arguments being at worst a mild extension of those required for 24.7. We show that with a suitable Bernstein-type blocking scheme the blocks will satisfy the conditions of 24.1. In effect, we show that the blocks behave like martingale differences. In most applications of the Bernstein approach alternating big and small blocks are defined, with the small blocks containing of

The Central Limit Theorem

392

the order of [n �( !-a)] summands for some � E (0, 1 ) , small enough that their omis sion is negligible in the limit but increasing with n so that the big blocks are asymptotically independent. Our martingale difference approximation method has the advantage that the small blocks can be dispensed with. Define bn and rn as in condition 24.6(d) for some a E (0 , 1), and let ibn

Zni = L Xnt• i = 1 , . .. ,rm t= i )bn+ ( -I

such that n

(24.46)

I

'n

(24.47) Zni + (Xn, nbn+l + .. . + Xnn) . L t= ! i== l The final fragment has fewer than bn terms, and is asymptotically negligible in the sense that bnrnln ---7 1 . Our method is to show the existence of a > 0 for which an array { Zni } can be constructed according to (24.46), such that 24.1(c) is satisfied, and such that the truncated sequence {Znd defined in (24.10) satisfies the other conditions of 24.1. This will be sufficient to prove the theorem, since then Lt� I Zn i � 1 according to 24.2(ii), and Zni and Zni are equivalent sequences when 24.1(c) holds, by 24.2(iii). Since { Xnr} is a L2-mixingale array of size -� according to 24.8, we may apply 16.14 to establish that the sequences (24.48) . ib 2 At 1 east, thIS" 1011ows "" 1y mtegrable, where Vn2i = " are unl!orm .L.. t=(i-!)bn+!Cnr. directly in the case i = 1, and it is also clear that the result generalizes to any i, for, although the starting point (i - 1)bn + 1 is increasing with n, the nth coordinate of the sequence in (24.48) can be embedded in a sequence with fixed starting point which is uniformly integrable by 16.14. This holds uniformly in n and i, and it follows in particular that the array {Z�/Y�i , 1 � i � rm rn � 1 } is uniformly integrable. This result leads to the following theorem. 24.16 Theorem Under assumptions 24.6(a)-(d), {Znd satisfies the Lindeberg condition. Proof For any i, v�i � bnM�i � 0 as n � oo by (24.25) and hence, for E > 0, 1 { 1Znil>£}z�i) 1 { I Z,/vn; I >Eivn;}z�i) < max max bnMn2i Y2ni ! :S i :S r ! :S i :S r Sn = L Xnt =

r

"

{E(

n

} {E( _

n

� O as n � = by uniform integrability. The conclusion,

}

(24.49)

CLTs for Dependent Processes

393

rn

_L £( 1 { l2nj l >£ l z�j) j= l

-----) 0 as n -----) oo, (24 . 50) now follows since the sum of rn terms on the majorant side is 0(1 ) from (24.27). • This theorem leads in turn to two further crucial results. First, by way of 23.16 we know that condition 24.l(d) is satisfied (and if this holds for Zni it is certainly true for Zn;); and second, note that for any E > 0 max Zn2i

rn

"' 1 { I Z,.;I>£ } zn2i· E2 + L

(24.5 1 ) i= l Taking expectations of both sides of this inequality, and then the limit as n -----) oo, shows that (24. 1 1 ) holds, and therefore that Trn is uniformly integrable (that is, condition 24.1(a) holds) by 24.2(i). This leaves two results yet to be proven, E( Tr ) -----) 1 (corresponding to 24.1(b) for the truncated array), and 24.1(c). By the latter result we shall establish parts (ii) and (iii) of 24.2, and then in view of 24.2(iii) the proof will be complete. We tackle 24.1(c) first. Consider ] :s; j :s; rn

::;;

n

(24.52) say, where rn

An = _L (�; - E(Z�;)) i= l (24.53) and Bn

= B�

+

B�, B�

where

[

(

ibn n -t Tn = 2 _L _L L crnt,t+k , i= l t=(i- l )bn+ l k=ibn -t+l

)]

(24.54) (24.55)

Here, ants = E(XntXns), and recall that E(S�) = .L.7: } (cr�t + 2.L.}=t+l crnt,t+j) + �n = 1 . It may be helpful to visualize these expressions as made up of elements from outer product and covariance matrices divided into � blocks of dimension bn x bn, with a border corresponding to the final n - rnbn terms, if any; see Fig. 24. 1 .

The Central Limit Theorem

394

The terms in An correspond to the rn diagonal blocks, and B� and B� contain the remaining covariances, those from the off-diagonal and final blocks. 2

3

0 1hn hn 0 �

2 3 :

0 :

:

0

:

:

'" I I I I � .

.

Fig. 24. 1

An is stochastic and so must be shown to converge in probability, whereas the components of Bn are nonstochastic. The nonstochastic part of the problem is the

more straightforward, so we start with this. 24.17 Theorem Under assumptions 24.6(a)-(d), Bn --7 0. Proof Since r > 2, r/(r - 1) < 2 and conditions 24.6(b) and (c) imply by 17.7 that (24.56) I
rn rn+ l bn bn I B� I :5: 2� � � L I
(24.57) To determine the order of magnitude of the majorant expression, verify that the terms in the parentheses are O(b�-8(1 - i) -I -&) for I = i + l , . . ,rn + 1 , and some b > 0. Changing the order of summation and putting k = l - i allows us to write .

(24.58) But for every k in this sum, the Cauchy-Schwartz inequality and (24.27) give

rn+ l -k rn+ l L MniMn,i+k :5: ,LM�i = O(b�1 ) , i=l i=l

(24.59)

CLTs for Dependent Processes

395

so that (24.58) implies, as required, that I B� I O(b�8) O(n-S( l -a)) . (24.60) To complete the proof we can also show, by a similar kind of argument but applying (24.25), that (24.61) =

=

To solve the stochastic part of the problem, decompose the terms Z�i - E(Z�i) in (24.53) into individual mixingale components, each indexed by i = 1 , ... ,rn. For a pair of positive integers j and k, let (24.62) It is an immediate consequence of 24.9, specifically of (24.32) and (24.33), that for fixed j and k the triangular array { Wni(i ,k), �� '2= ; 1 :::; i :::; rn, n � N(j,k) } , (24.63) where N(j,k) = min{n: rn � 1, bn � j + k}, is an L 1 -mixingale of size -1, with mixing coefficients 'Vo = �o and 'Jip �(p - l )bn+J for p � 1 , and constants (24.64) Substituting from (24.62) into (24.53), we have the sum of b� such rnixingales, =

Z�i - E(Z�i) =

I:1 (wniU,O) + 2rk=! WniU, k)) + Wn;(bn,O). j=l

(24.65)

Although this definition entails considering k of order bn as n --7 oo, note that the inter-block dependence of the summands does not depend on k. The designation of 'mixingale ' is convenient here, but it need not be taken more literally than the inequalities (24.32) and (24.33) require. The crucial conclusion is that a weak law of large numbers can be applied to these terms. 24.18 Theorem Under assumptions 24.6(a)-(d), A n � 0. o The object here is to show that the array (24.66) { (Z�i - E(Z�;) ), �� '2= ; 1 :::; i :::; rn , rn � 1 } is an L1-mixingale with constants {and which satisfy conditions (a), (b), and (c) of 19.11. The proof could be simplified in the present case by using the fact that Zni is ��'2=-measurable (by 24.6(a)) so that the minorant side of (24.68) below actually disappears identically for p � 0. But since the result could find other applications, we establish the mixingale property formally, without appealing to this assumption. Proof By multiplying out Z�i and applying 24.9 term by term, we obtain

/

E E(Z�i - E(Z�a I � �� -p)bn ).

j

The Central Limit Theorem

396

, �� ( 1

e e(w.,{i,O) i !l' ! � -p)b,) I

+

2

�I

e e(w.,{i,k) i !l'!�-p)b, )

E ' E( Wni(bmO) j � ��-p)bn ) I

+

I)

(24.67) and similarly,

I

E z�i - E(Z�d � � �+p)bn ) I «

1 -0)

M�i

[�1 (

�(p+

l )bn-j + 2

) ]

�

�(p+l)b -j -k

+ �Pb

'

(24.68)

n n pl k=l where �j = 0([ for o > 0. Write, formally, ani'I'; to denote the larger of the two majorant expressions in (24.67) and (24.68), such that \jf; � 0 and ani is fixed by setting 'l'o = 1. Evaluating (24.68) at p = 1 and (24.68) at p = 0 respec tively gives

I

E E(Z�, - E(�,) J !!'!� - l )h,) I and also, putting

I

j' j =

( 1 + 2(b, -j))j+O + b� I -O

M�,

«

M�;bn,

bn - and k'

E z�i - E(Z�il � ��n) l

) (� j' [�1 (j'-1-o �k'=O ) ]

((

«

M�i

=

(24.69)

- k,

/= 1

+2

k' - l -o

+

1

(24.70)

2 Mnibn. Hence, ani = BM�ibn for some finite constant B. Since 'n 'n (24.7 1 ) L �i $ max M�; _LM�i = o(b�2) i= l i=l l$i$rn+l i n view of (24.27) and (24.25), these constants satisfy conditions 19.1l(b) and (c). And since Z�/bnM�; s Z�/V�i where Z�/V�; is uniformly integrable, they also satisfy condition 19.11(a). It follows that An � 0, and the proof is complete. • This brings us to the final step in the argument, establishing the asymptotic m.d. property of the Bernstein blocks. 24.19 Theorem Under 24.6(a)-(d), limn--?c£(f,n) = 1 . Proof Applying (24.17),

«

CLTs for Dependent Processes

397

'n 'n ) = 1 i 1 i Z + A (24.72) + A L 1\-! Zni• ni ( TI i=l i=l where Ti -l = Tij:t ( l + iA-Znj) is an ��:=�)bn_measurable r.v. by 24.6(a), and hence 'n E(Ti- I ZnD ,) = 1 + iAL E(T i=l T,n =

(24.73) By the Cauchy-Schwartz inequality,

(24 . 74) i=l i=l where II Ti- 1 1 l 2 is uniformly bounded by (24 . 1 5), which follows in tum by (24. 1 1 ), so the result hinges on the rate of approach to zero of II E(Znd ��:=�)bn) lb. This cannot be less than that for Zni • so consider, more conveniently, the latter case.

[�

E(E(Xntl ��:= �)bn))2 t=(i- 1 )bn+ I 1 12 i l i t I ( ( l � i � i + 2 L..., L..., � E(E(Xnt � �n,-oo)bn)E(Xnt+k � �n,-oo)bn)) l t=(i- I )bn+ I k=l Applying 24.8, �

]

(24 . 75) (24.76)

and

E I E(E(Xm l �t=�)bn)E(Xnt+k l ��:=�)bn)) I

� II E(Xm l ��:= �)bn) ll 2 II E(Xnt+k l ��:= �)bn) lb

where Sj

O(j - 1 12- 11) for

(24.77)

> 8 for the 8 defined in 24.6(c). Hence, 1 12 (24.78) II E(Znd ��:= �)bn) ll 2 « Mni sJ + 2 \j i Sk , k j=l =! =j+! ] where the sum of squares is 0(1 ) and the double sum is O(b� - 211) . Applying (24.26) , assumption 24.6(d) implies =

J.l

(�

�

)

The Central Limit Theorem

398 �

L j E(Ti- J E(Znd ��:=�)bn )) l i=l

�

L MnP (max { b� 1 - 211)12, 1 }) i=l - -I = O(max { b� 11 , b� I2 }) = o ( l ). =

(24.79 )

This ensures that L,j� 1 E(Ti- !Zni) ----7 0 as n ----7 oo, which is the desired result. • Proof of 24.6 We have established that L.i� 1 Zni = L.;�fnXnt � N(O,l). There remains the formality of extending the same conclusion to Sm but this is easy smce (24.80) and Sn has the same limiting distribution by 22.18. The proof of 24.6 is therefore complete. • It remains just to modify the proof for the martingale difference case, as was promised above. Proof of 24.14 It is easily seen that 24.16 holds for rn = n and bn = 1 . In 24.17, Bn = 0 identically since �k 0 for k > 0 in (24.57). In 24.18 one may put Zni = Xnt• and the conditions for 19.10 follow directly from the assumptions. Lastly, 24.19 holds since the sum in (24.79) vanishes identically under the martingale difference assumption. The proof is completed just as for 24.6. • =

25 S ome Extensions

25 . 1 The CLT with Estimated Normalization

The results of the last two chapters, applied to the case Xnt = X11sn where E(X1) = 0 and s� = E(I.7= 1 X1) 2 , would not be particularly useful if it were necessary to know the sequences { cr7} , and { cr1,1-d for k � 1 , in order to apply them. Obviously, the relevant normalizing constants must be estimated in practice. Consider the independent case initially, and let Sn = "'L7= 1 X11sn where s� = I.7=Icr7. Also let s� I.7 =Ix7, and we may write n 1 Sn = --;::- L Xr = dnSn, (25.1) S n t=! where dn snfsn. If dn � 1 , we could appeal to 22.14 to show that Sn � N(0,1 ) whenever Sn � N(O, l). The interesting question is whether the minimal conditions sufficient for the CLT are also sufficient for the relevant convergence in probability. If the sequence is stationary as well as independent, existence of the variance 2 cr is sufficient for both the CLT (23.3) and for n -1 I.7= 1 X7 � d (applying 23.5 to X7). In the heterogeneous case, we do not have a weak law of large numbers for { x7 } based solely on the Linde berg condition. However, the various sufficient conditions for the Lindeberg condition given in Chapter 23, based on uniform integrability, are sufficient for a WLLN. Without loss of generality, take the case of possibly trending variances. 25.1 Theorem If {X1} is an independent sequence satisfying the conditions of 23.18, then =

�

=

(25.2) the sequence (X7 - a7)1c7. By assumption this has zero mean, is independent (and hence an m.d.), and uniformly integrable. The conditions of 19.8, with p = 1 , b1 = d, and an = s�, are satisfied since, by (23.57), n " 2 2 2 (25.3) L- c t :$ nMn = O(sn) ,

Proof Consider

t=!

where Mn

=

max, � 1 � n{c1 } . Hence

The Central Limit Theorem

400 E

=

I-7=1 x? E --2- - 1 --7 0, Sn

(25.4)

which is sufficient for convergence in probability. • When the sequence { X1} is a martingale difference, supplementary conditions are needed for {X?} to obey a WLLN, but these tum out to be the same as are needed for the martingale CLTs of 24.3 and 24.4. In fact, condition 24.3(a) corresponds precisely to the required result. We have, immediately, the following theorem. 25.2 Theorem Let {X1,�1} be a m.d., and let the conditions of 24.3 or 24.4 be satisfied; then (25.2) holds. o Although we have spoken of estimating the variance, there is clearly no neces sity for {s�/n } , or any other sequence apart from { dn } , to converge. Example 24.1 1 is a case in point. In those globally covariance stationary cases (see § 1 3 .2) where s�/n converges to a finite positive constant, say -& , the 'average variance' of the sequence, we conventionally refer to s�ln as a consistent estimator of cr2. But more generally, the same terminology can always be applied to s� with respect to s� , in the sense of (25.2). Alternative variance estimators can sometimes be defined which exploit the particular structure of the random sequence. In regression analysis, we typically apply the CLT to sequences of the form X1 = W1U1 where { U1,�1} is assumed to be a m.d. with fixed variance � (the disturbance), and where W1 (a regressor) is �1- 1 -measurable. In this case, {Xt � l is a m.d. with variances a? = �E( W?) , which suggests the estimator s� = (n-1 2-7= 1 u?) L-7= 1 w?, for s�. This is the usual approach in regression analysis, but of course the method is not robust to the failure of the fixed-variance assumption. By contrast, s� = 2-7= 1 w?u? possesses the property cited in (25.2) under the stated conditions, regardless of the distributions of the sequence. The latter type of estimator is termed hetero scedasticity-consistent. 26 Now consider the case of general dependence. The complicating factor here is that s� contains covariances as well as variances, and s� is no longer a suitable estimator. A sample analogue of s� must include terms of the form X1Xt+j for IJ I � 1 as well as for j = 0, but the problem is to know how many of these to include. If we include all of them, in other words for j = 1 - t, ... ,n - t, the resulting sum is equal to (I-7= 1 X1) 2, and the ratio of this quantity to s� is converging, not in probability to 1 , but in distribution to x2 (1) . For consistent estimation we must make use of the knowledge that all but a finite number of the covariances are arbitrarily close to 0, and omit the corresponding sample products from the sum. Similarly to the m.d. case, the conditions of 24.6 contain the required convergence result. Consider (24.46) and (24.47) where Xnt = X!sm but now write >

4z;

=

SnZni =

t

ibn

1

L X1, i'= 1 , ... , rn.

t=(i- 1 )bn+

(25.5)

Some Extensions

401

In the proof of the CLT the construction of the Bernstein blocks was purely conceptual, but we might consider the option of actually computing them. The sum of squares of the unweighted blocks, s� 1 = L.i� 1 Z�T, is consistent for s� in the sense that r sn2l � 2 pr (25.6) 2 L Zni --"-----7 1 ' Sn i= l according to 24.17 and 24.18. An important rider to this proposal is that 24.17 and 24.18 were proved using only (24.25) and (24.27), so that, as noted previ ously, any a E (0, 1 ) will serve to construct the blocks. In the context of the conditions of 24.12 at least, the only constraint imposed by 24.6(d) is repre sented by (24.39), which puts a possible lower bound on a in the case of decreas ing variances (y > 0), but no upper bound strictly less than 1 . It is sufficient for consistency if bn goes to infinity at a positive rate, and we are not bound to use the a that satisfies the conditions of the CLT to construct the blocks in (25.6). But although consistent, s� 1 is not the obvious choice of estimator. It would be more natural to follow authors such as Newey and West (1987) and Gallant and White (1988), inter alia, who consider estimators based on all the cross-products X1Xr-k for t = k + 1 , ... ,n and k = O, . . . ,bn. In terms of the array represention of Fig. 24. 1, these are the elements in the diagonal band of width 2bm rather than the diagonal blocks only. (In this context, bn is referred to as the band width.) The simplest such estimator is n bn n s�2 = .L X7 + 2L L XrXr-k· (25.7) t= l k=l t=k+ l 25.3 Theorem Under the conditions of 24.6, applied to X/sn, s�2/s� � 1 . Proof Let Xnr = X/sn in (24.53), so that s!4 n denotes the same sum constructed from the X1 in place of the Xnr . The difference between An and (s�2 - E(s�2))1s� is the quantity

(25.8) The components of this sum correspond to the rn - 1 'triangles' which separate the diagonal blocks in Fig. 24. 1 , each containing �bn(bn - 1) terms, plus the terms from the lower-right corner blocks. Reasoning closely analogous to the proof of 24.18 shows that A� � 0. The sums of the corresponding covariances converge absolutely to 0 by 24.17, since they are components of B� in (24.54), and it follows that

402

The Central Limit Theorem I s�2 - s� d

2 -- � 0 . --Sn

The theorem therefore follows from 24.18. • Since this estimator uses sample data which are, discarded in s� 1 , there are informal reasons for preferring it in small samples. But there is a problem in that (the chosen notation notwithstanding) s�2 is not a square, and not always non-negative except in the limit. This difficulty can be overcome by inserting fixed weights in the sum, as in bn n n s;3 = _Lx7 + 2,L wnk L XrXt-k· t=l k=l t=k+l

(25 . 9 )

Suppose Wnk � 1 as n � oo for every k ::; K, for every fixed K < n I S�2 - s;3 l 2 bn ) ( 1 = 2 L - Wnk L XrXr-k � r(K), s; t=k+l Sn k=l where

1

---

I

bn n 1 (1 r(K) = 2 plim 2 L Wnk) L XrXt-k . n-too Sn k=K+I t=k+l

I

-

I

oo.

Then (25. 10)

(25 . 1 1 )

Since r(K) can be made as small as desired by taking K large enough, in view of 25.3 and 24.18, s�3 is consistent when the weights have this property. It remains to see if they can also be chosen in such a way that (25 . 9) is a sum of squares. Following Gallant and White (1988), choose b n + 1 real numbers an J , ... ,an,bn+ l , satisfying 2:j�t 1 a�j = 1 , and consider the n + bn variables Yni = a 1 X1, Yn2 = an1X2 + an2X1 , Yn'bn+ 1 = an!Xbn+1 + ... + an'bn+lXb Yn,bn+2 = an 1Xbn+2 + ... + an,bn+ 1X2, Ynn Yn,n+ 1

==

==

aniXn + ... + an,bn+lXn-bn ' an2Xn + ... + anbnXn-bn+b

Observe that

(25. 1 2)

Some Extensions

403

which shows that any weights of the form Wnk = I,j�tl l anJanJ -k, k = O, ... ,bn impose non-negativity, and also give wno = 1 . A case that fulfils the consistency requirement is anj = (b n + 1) - 1!2 , all j, which yields Wnk = 1 - k/(bn + 1). Variance estimators having the general form n n L L w( I t - S I lbn)XrXs t=l s=l are known as kernel estimators. The function w(x) = 1 - l x l for l xl 5 1, 0 other wise, which corresponds to �3 , is the Bartlett kernel. The estimator �2 , by contrast, uses the truncated kernel, w(x) = 1 1 l x l :o; I l . Other possibilities exist; see Andrews ( 1991) for details. One other point to note is that much of the literature on covariance matrix estimation relies on L2-convergence results for the sums and products, and accordingly requires L,-boundedness of the variables for r � 4. The present results hold under the milder moment conditions sufficient for the CLT, by using a weak law of large numbers based on 19.11. See Hansen (1992b) for a comparable approach to the problem. 25 . 2 The CLT with Random Norming

Here is a problem not unconnected with the last one. We will discuss it in the context of the m.d. case for simplicity, but it clearly exists more generally. Consider a m.d. sequence {X1} which instead of (25.2) has the property �7=Ix7 pr 2 (25. 13) 11 ' Sn2 � where 11 2 is a random variable. This might arise in the following manner. Let X1 = W1 U1 where { U1, '!F 1 } :00 is a m.d. and { W1 } := a sequence of r. v .s which are measurable with respect to the remote a-field 'R = n7=- oo'ffs . The implication is that W1 is 'strongly exogenous' (see Engle et al. 1983). with respect to the generation mechanism of U1• Then (25. 14) since 'R � r:Ji1_ 1 for every t, hence X1 is a m.d. Provided the W1 are distributed in such a way that �7= 1 w7u7Js� � 1, the analysis can proceed exactly as in §24.2. There is no practical need to draw a distinction between nonstochastic Wr and 'R-measurable Wr. But this need not be the case, as the following example shows. 25.4 Example Anticipating some distributional results from Part VI, let Wr = ��=I Vs where { V5 } is a stationary m.d. sequence with E(V�) = 't2 for all s. Also assume, for simplicity, that E(U7) = cr2, all t. Then, E(W7) = t't2 and s� = !n(n + l)'t2cr2. If we further assume that { Vs } satisfies the conditions of 24.3, it will be shown below (see 27.14) that

The Central Limit Theorem

404

'

f B(r)2dr, 1 � W12 � D (25. 15) L 22 o 't n t=l where B(r) is a Brownian motion process. Under the distribution conditional upon 'R, we may treat the sequence { Wr } as nonstochastic and apply the weak LLN (by an argument paralleling 25.2) to say that . I�=,x;

I�=' w7u7 . I�=' w; = 2 hm (25 . 1 6) 2 phm 2 .-2 = 11 2 , 2 2 't 't n s� n�oo crn (n + 1) n�oo n�oo where 11 2 merely denotes the limit indicated. Under the joint distribution of { U1 } and { W1}, !11 2 is a drawing from the limiting distribution specified in (25.15). o The application of the CLT to { X1} can proceed under the conditional distribution (defined according to 10.30), replacing each expectation by E(. l 'R). Let cr7 = E(U7 1 'R), defining an 'R-measurable random sequence, so that

phm

.

=

E

(it=l x7 1 'R) it=l w7a7. =

(25. 17)

We can then apply a result for heterogeneously distributed sequences, such as 24.4, letting Cr = Wrat· Assuming that ECX71 �t-1 ) = w;a;, and that condition (24. 19) is almost surely satisfied, the conditional CLT takes the form

}

( {

)

lim E exp n i� 2 1 12 i x� 'R = e - ".2!2 , a.s. (Lr =l Wrcrr) t= l n�oo

(25. 1 8)

But if we normalize by the unconditional variance, the situation is quite different. W1 must be treated as stochastic and E(X71 �1-1 ) :f:. ECX7), so the condi tions of 24.4 are violated. However, if (25.13) holds with 11 an 'R-measurable r.v., then according to 22.14(ii) the conditional distribution has the property

1 � X 'R -n L r S t= l

I

D �

N(O, 11 2) , a.s.

{ }I ( )

(25 . 1 9)

(see § 10.6 for the relevant theory). This result can also be expressed in the form

iA n Xr 'R lim E exp :S L n n�oo t=l

=

2 2

e - "- 11 12, a.s.

(25.20)

Hence, the limiting unconditional distribution is given by

(25.21) This is a novel central limit result, because we have established that L,�= 1 X11sn is not asymptotically Gaussian. The right-hand side of (25.21) is the ch.f. of a mixed Gaussian distribution. One may visualize this distribution by noting that

Some Extensions

405

drawings can be generated in the following way. First, draw 11 2 from the appropri ate distribution, for example the functional of Brownian motion defined in (25 . 1 5); then draw a standard Gaussian variate and multiply it by YJ . If X is mixed Gaussian with respect to a marginal c.d.f. G(YJ) (say), and <J>T'I is the Gaussian density with mean 0 and variance YJ 2, the moments of the distribution are easily computed, and as well as E(X) = E(X3) = 0 we find

E(X2) =

f""f 00 x2<J>T'l (x)dxdG(YJ) = f""0 YJ2dG(YJ) = E(YJ 2). 0

-

oo

(25.22)

However, the kurtosis is non-Gaussian, for

(25.23) where the right-hand side is in general different from iE(YJ 2)2 (see 9.7). 25 . 3 The Multivariate CLT

An array {Xnt} of p-vectors of random variables is said to satisfy the CLT if the joint distribution of Sn = L�= lXnt converges weakly to the multivariate Gaussian. In its multivariate version, the central limit theorem contributes a new and powerful approximation result. Given a vector of stochastic processes exhibiting arbitrary (contemporaneous) dependence among themselves, we can show that there exist different linear combinations of the processes which are asymptotically independent of one another (uncorrelated Gaussian variables being of course independent). This is a fundamental result in the general theory of asymptotic inference in econometrics. The main step in the solution to the multivariate problem sometimes goes by the name of the 'Cramer-Wold device' . 25.5 Cramer-Wold theorem A vector random sequence { Sn } i, Sn E IR \ con verges in distribution to a random vector S if and only if a'Sn � a'S for every fixed k-vector a t:. 0. Proof For given a the characteristic function of the scalar a'Sn is E(exp{ iA.a'Sn } ) = n,a(A). By the Levy continuity theorem (22.17), a'Sn � a 'S if and only if G>a(A) and a is continuous at A = 0. Since a is arbitrary, we can put t n,a(A) A.a. and obtain = (25.24) E(exp { it'Sn } ) E(exp { it'S } ) = \jf(t), (say) where by assumption the convergence is pointwise on IR k. By (1 1 .39), the left-hand side of (25.24) is the ch.f. of Sm and the right-hand side is the ch.f. of S. The continuity of \jf at the origin is ensured by the continuity of a at 0 for all a, and it follows that Sn � S. • Now let { X1 } be a sequence of random vectors, and let l:n be the variance matrix of L.�=tXt. Being symmetric and positive semidefinite by construction, this matrix possesses the factorization

--7

--7

The Central Limit Theorem

406

(25.25) where Ln ::: CnA�12, Cn and An being respectively the eigenvector matrix (satisfying CnC� = C�Cn= lp) and the diagonal, non-negative matrix of eigenvalues. Let Xnt ::: L-;,X1, where if l:n has full rank then L-;; = A� 1 12C� so that L-;,I:nL-;; ' lp· However, l:n need not have full rank for every n. If it is singular with An = I n ,

[� �]

=

let L-;; ' = [Cn 1 A!�12 : 0] where Cnl is the appropriate submatrix of Cn. In this case, L -;; l:nL -;; ' has either ones or zeros on the diagonal. We do however require l:n to be asymptotically of full rank, in the sense that L-;;l:nL-;; ' � lp. If Sn = L,7=IXn1, then for any p-vector a with a'a = 1 , we have E(a'Sn) 2 � 1 . If this condition fails, and there exists a ::f. 0 such that E(a'Sn)2 � 0, the asymptotic distribution of Sn is said to be singular. In this case, some elements of the limiting vector are linear combinations of the remainder. Their distribution is therefore determined, and nothing is lost by dropping these variables from the analysis. To obtain the multivariate CLT it is necessary to show that the scalar sequences { a'Xnt } satisfy the ordinary scalar CLT, for any a. If sufficient conditions hold for a'Sn � N(O, l), the Cramer-Wold theorem allows us to say that Sn � S, and it remains to determine the distribution of S. For any a, the ch.f. of a'S is <j>(A.) = E(exp { iA.a'S} ) = e - "-212 . (25.26) But letting t = A« be a vector of length A., it follows from ( 1 1 .41) that (25.26) is the ch.f. of a standard multi-Gaussian vector. (Recall that ex.'ex. = 1 .) By the inversion theorem we get the required result that S - N(O, lp) . We have therefore proved the following theorem. 25.6 Theorem Let {Xr} be a stochastic sequence of p-vectors and let l:n = E((L,7=IX1)(L7=IXt)'). If L-;;l:nL-;; ' � lp and L,7=Ia'L-;;Xt � N(0,1 ) for every ex. satisfying ex.'a = 1, then n (25.27) _LL-;;Xt � N(O, lp). D t=l

In this result the elements of l:n need not have the same orders of magnitude in n. The variances can be tending to infinity for some elements of X1, and to zero for others, within the bounds set by the Lindeberg condition. However, in the case when all of the elements of l:n have the same order of magnitude, say n° for some 8 > 0 such that n -ol:n � 1:: , a finite constant matrix, it is easy to manipulate (25.27) into the form n 0 2 (25.28) n 1 _LXr � N(O, l:) . t= l

Techniques for estimating the normalization factors generalize naturally from

Some Extensions

407

the scalar case discussed in §25 . 1 , just like the CLT itself. Consider the m.d. case in which l:n = L�=JE(XrX;), and assume this matrix has rankp asymptotically in the sense defined above. Under the assumptions of 25.2, (2 5 .29 ) for any a: with a:'a: = 1 , where the ratio is always well defined on taking n large enough, by assumption. This suggests that the positive semidefinite matrix i:n = L,�= 1 XrX; is the natural estimator for l:n. To be more precise: (25.29) says that P( I a:'i:na:/a:'l:n« - I I ) can be made as small as desired for arbitrary a: ::f. 0 by taking n large enough, since the normalization to unit length cancels in the ratio. This is therefore true in the particular case a:* = L;;'a:, and since a:�'l:n«� = 1, we are able to conclude that a:'L;;:i:nL;;'a: � 1 . We can further deduce from this fact that L;;:i:nL;;' � lp. To show this, note that if a matrix B is nonsingular, and g (a:) = a:'Ba:!a:'a: = 1 for every a: ::f. 0, g has the gradient vector g'(a:) = 2Ba:/a:'a: - 2a:/a:'a:, for any a:, and the system of equations Ba:/a:' a: - a:/a:'a: = 0 has the unique solution B = lp. If Ln is the factor ization of l:n. since Ln is asymptotically of rank p it follows by 18.10(ii) that LnL ;; � lp, and we arrive at the desired conclusion, for comparison with (25.27): n A " D N(O, lp). L -;; L Xt � (25. 30)

(pxp)

t=l

The extension to general dependence is a matter of estimating l:n by a general ization of the consistent methods discussed in §25. 1 , either i:n 1 = I,j�1Z�iZ� i where Z �i = L�(i-l )bn+ IXr, or, letting weights { Wnd represent the Bartlett kernel, n n bn (25. 3 1 ) i:n3 = :LXrX; + :L wnk L (XrX;_k + Xt-kX;). + t=l

k=l

t=k l

The latter matrix is assuredly positive definite, since a:'i:n3« � 0 with arbitrary a: by application of (25. 12) with Xr = Xt«. 25.4 Error Estimation

There are a number of celebrated theorems in the classical probability literature on the rates at which the deviations of distributions from their weak limits, and also stochastic sequences from their almost sure limits, decline with n. Most are for the independent case, and the extensions to general dependent sequences are not much researched to date. These results will not be treated in detail, but it is useful to know of their existence. For details and proofs the reader is referred to texts such as Chung ( 1974) and Loeve (1977). If {Fn } is a sequence of c.d.f.s and Fn � (the Gaussian c.d.f.), the Berry Esseen theorem sets limits on the largest deviation of Fn from . The setting for

The Central Limit Theorem

408

this result is the integer-moment case of the Liapunov CLT; see (23.37). 25.7 Berry-Esseen theorem Let {X1} be a zero-mean, independent, L3 -bounded random sequence, with variances { cr?L let s� = I,�= 1 cr?, and let Fn be the c.d.f. of Sn = 'L7= I X1 1sn. There exists a constant C > 0 such that, for all n, n sup i Fn (x) (x) l � C 2:: £ 1 Xrl 31s�. D (25 . 32) X t= l

-

The measure of distance between functions Fn and appearing on the left-hand side of (25.32) is the uniform metric (see §5.5). As was noted in §23 . 1 , convergence to the Gaussian limit can be very rapid, with favourable choice of Fn. The Berry Esseen bounds represent the 'worst case' scenario, the slowest rate of uniform convergence of the c.d.f.s to be expected over all sampling distributions having third absolute moments. For the uniformly L3-bounded case in which s� = O(n), inequality (25.32) establishes convergence at the rate n 1 12 . Another famous set of results on rates of convergence goes under the name of the law of the iterated logarithm (LIL). These results yield error bounds for the strong law of large numbers although they tell us something important about the rate of weak convergence as well. The best known is the following. 25.8 Hartman-Wintner theorem If {X1} is i.i.d. with mean j.l and variance a2, then I.7=I xr - 1-l = 1 a.s. o (25.33) limsup n--7= a(2n log log n) l/2 Notice the extraordinary delicacy of this result, being an equality. It is equiva lent to the condition that, for any £ > 0, both

P

(

and

P

I.7= 1 Xr - j.l � 1 + £, i.o. = 0, a(2n log log n) 1 12

(

)

- )

I.�= l xr - 1-l � 1 £, i.o. = 1 . a(2n log log n) 112

(25.34)

(25.35)

In words, infinitely many of these sequence coordinates will come arbitrarily close to 1, but no more than a finite number will exceed it, almost surely. By symmetry, there is a similar a.s. bound of -1 on the liminf of the sequence. Under these assumptions, n- 1 122.� = 1 (X1 - j.l)la � N(0, 1) according to the Lindeberg-Levy theorem, and so is asymptotically supported on the whole real line. On the other hand, n - 1 I.�=I (X1 - j..L)/a ----7 0 a.s. by the law of large numbers. It is clear there is a function of n, lying between 1 and n 1 12, representing the 'knife edge' between the degenerate and non-degenerate asymptotic distributions, being the smallest scaling factor which frustrates a.s. convergence on being applied to

Some Extensions

409

the sequence of means. The Hartman-Wintner law tells us that the knife-edge is 1 12 . l - 1 12 . precrse y n (2 1og 1 og n ) A feel for the precision involved can be grasped by trying some numbers: (2 log log 1099) 1 12 3.3! A check with the tabulation of the standard Gaussian probabilities will show that 3.3 is far enough into the tail that the probability of exceeding it is arbitrarily close to zero. What the LIL reveals is that for the scaled partial sums this probability is zero for some n not exceeding 1099, although not for yet larger n. Be careful to note how this is true even if the Xr variables have the whole real line as their support. For nonstationary sequences there is the following version of the LIL. 25.9 Theorem (Chung 197 4: th. 7.5 . 1) Let { Xr} be independent and L3 -bounded with variance sequence { a7 } , and let s� = I7= 1 cr7. Then (25.33) holds if for £ > 0, I7= 1 £1Xr l 3 = O(s�/( Iog Sn) 1 +e) . D ""

Generalizations to martingale differences also exist; see Stout (1974) and Hall and Heyde (1980) inter alia for further details.

VI THE FUNCTIONAL CENTRAL LIMIT THEOREM

26 Weak Convergence in Metric Spaces

26. 1 Probability Measures o n a Metric Space

In any topological space 5>, for which open and closed subsets are defined, the Borel field of 5l is defined as the smallest a-field containing the open sets (and hence also the closed sets) of 5>. In this chapter we are concerned with the properties of measurable spaces (Sl,!f), where s; is a metric space endowed with a metric d, and !f will always be taken to be the Borel field of 5>. If a probability measure Jl is defined on the elements of !f, we obtain a proba bility space ((Sl,d),!f,J..t) , and an element x E s; is referred to as a random element. As in the theory of random variables, it is often convenient to specify an under lying probability space (Q,'!F,P) and let ((Sl,d),!f,Jl) be a derived space with the prope1iy Jl(A) = P(x- \A ) ) for each A E !f, where x: � cs,d) is a measurable mapping. We shall often write 5l for (Sl,d) when the choice of metric is understood, but it is important to keep in mind that d matters in this theory, because !f is not invariant to the choice of metric; unless d1 and d2 are equivalent metrics, the open sets of (5l,d1 ) are not the same as those of (5l,d2). A property of measure spaces that is sometimes useful to assume is regularity (yet another usage of an overworked word, not to be confused with regularity of sequences etc.): (S>,J',J..t) is called a regular measure space (or Jl a regular measure with respect to (Sl,!f)) if for each A E !f and each £ > 0 there exists an open set 010 and a closed set C10 such that (26 . 1) and (26.2)

n

Happily, as the following theorem shows, this condition can be relied upon when 5l is a metric space. 26.1 Theorem On a metric space ((Sl,d),!f), every measure is regular. Proof Call a set A E !f regular if it satisfies (26. 1 ) and (26.2). The first step is to show that any closed set is regular. Let An = { x: d(A ,x) < lin}, n = 1 ,2,3, ... denote a family of open sets. (Think of A with a 'halo' of width 1/n.) When A is closed we may write A = nc;;= tAm and An -1- A as n � · By continuity of the measure this means J..L(An - A) � 0. For any £ > 0 there therefore exists N such that J..L(AN - A) < £. Choosing 010 = AN and C10 = A shows that A is regular. ""

The Functional Central Limit Theorem

414

Since § is both open and closed, it is clearly regular. If a set A is regular, so is its complement, since 0� is closed, C� is open, 0� � A c � C� and C� - 0� = Oe - Ce. If we can show that the class of regular sets is also closed under count-able unions, we will have shown that every Borel set is regular, which is the required result. Let Ar, A2 , ... be regular sets, and define A = U�= lAn. Fixing £ > 0, let One and Cne be open and closed sets respectively, satisfying (26.3)

and (26.4)

Let Oe = U�= l One, which is open, and A � Oe. Also let Ce = U�= l Cne, where the latter set is not necessarily closed, but C� = U�= l Cne where k is finite is closed, and C� � A; and since C� t Ce, continuity of the measure implies that k can be chosen large enough that Jl(Ce - C�) < £12. For such a k,

Jl(Oe - C�) $ Jl(Oe - Ce) + Jl(Ce - C�) 00

$ L Jl(One - Cne) + Jl(Ce - C�) < £. (26.5) n= l It follows that A is regular, and this completes the proof. • Often the theory of random variables has a straightforward generalization to the case of random elements. Consider the properties of mappings, for example. If (§,d) and (lf,p) are metric spaces with Borel fields !I and 'J, and f: § � lf is a function, there is a natural extension of 3.32(i), as follows. 26.2 Theorem If f is continuous, it is Borel-measurable. Proof Direct froms 5.19 and 3.22, and the fact that !I and 'J contain the open sets of § and lf respectively. • Let ((§ d), !!) and ((lf,p),'JJ be two measurable spaces, and let h: § � lf define a measurable mapping, such thatA E 'J implies that h - 1 (A) E !!; then each measure Jl on 5i has the property that Jlh- I , defined by Jlh - 1 (A) = Jl(h- \A)), A E 'J, (26.6) is a measure on ((lf,p),'J). This is just an application of 3.21, which does not use topological properties of the spaces and deals solely with the set mappings involved. However, the theory also presents some novel difficulties. A fundamental one concerns measurability. It is not always possible to assign probabilities to the Borel sets of a metric space - not, at least, without violating the axiom of choice. 26.3 Example Consider the space (D [o,I ] ,du), the case of 5.27 with a = 0 and b = 1 . Recall that each of the random elements fe specified by (5.43) are at a mntn�l rfi <:.:t�nce of 1 from one another. Hence, the spheres BCfe , 1) are all ,

Weak Convergence in Metric Spaces

415

disjoint, and any union of them is an open set (5.4). This means that the Borel field Vro, l ] on (Dr0, 1 1 ,du) contains all of these sets. Suppose we attempt to construct a probability space on ((Dro, l ] ,du), Vro, JJ) which assigns a uniform distribution to the fa, such that J.l( {fa: a e � b}) = b - a for 0 � a b � 1. Superficially this appears to be a perfectly reasonable project. The problem is formally identical to that of constructing the uniform distribution on [0, 1 ] . But there is one crucial difference: here, sets of fa functions corresponding to every subset of the interval are elements of Vro, l ] · We know that there are subsets of [0, 1] that are not Lebesgue-measurable unless the axiom of choice is violated; see 3.17. Hence, there is no consistent way of constructing the probability space ((Dro, l ] ,du),Vro, l J•J.l), where J.l assigns the uniform measure to sets of fa elements. This is merely a simple case, but any other scheme for assigning probabilities to these events would founder in a similar way. o There is no reason why we should not assign probabilities consistently to smaller a-fields which exclude such odd cases, and in the case of (Dro,n,du) the so called projection a-field will serve this purpose (see §28. 1 below for details). The point is that with spaces like this we have to move beyond the familiar intuitions of the random variable case to avoid contradictions. The space (Dr0, 1 1,du) is of course nonseparable, and nonseparability is the source of the difficulty encountered in the last example. The characteristic of a separable metric space which matters most in the present theory is the following. 26.4 Theorem In a separable metric space, there exists a countable collection V of open spheres, such that a(V) is the Borel field. Proof This is direct from 5.6, V being any collection of spheres S(x,r) where x ranges over a countable dense subset of s; and r over the positive rationals. • The possible failure of the extension of a p.m. to ('£,f/) is avoided when there is a countable set which functions as a determining class for the space. Measurabil ity difficulties on rR were avoided in Chapter 3 by sticking to the Borel sets (which are generated from countable collections of intervals, you may recall) and this dictum extends to other metric spaces so long as they are separable. Another situation where separability is a useful property is the construction of product spaces. In §3 .4 some aspects of measures on product spaces were discussed, but we can now extend the theory in the light of the additional structure contributed by the product topology. Let and ("IT" /J) be a pair of measurable topological spaces, with !I and <J the respective Borel fields. If 'R denotes the set of open rectangles of s; and !I ® <J = a('R), we have the following result. 26.5 Theorem If s; and "IT" are separable spaces, !I ® <J is the Borel field of with the product topology. Proof Under the product topology, 'R is a base for the open sets (see §6.5). Since s; "IT" is separable by 6.16, any open set of s; "IT" can be generated as a countable union of 'R-sets. It follows that any a-field containing 'R also contains the open sets of s; and in particular, !I ® <J contains the Borel field. Since the sets

<

x"U",

x

x"U",

<

(5>,:1)

5> x"U"

x

The Functional Central Limit Theorem

416

of � are open, it is also true that any a-field containing the open sets of s; x T also contains � ' and it follows likewise that the Borel field contains !I® <J . • If either s; or T are nonseparable, the last result does not generally hold. A counter-example is easily exhibited. 26.6 Example Consider the space (D[O, I J X D[o, 1 1 , Pu), where Pv is the max metric defined by (6. 1 3) with du for each of the componentmetrics. Let E denote the union of the open balls B( (x9,y9), 1) over 8 E [0, 1 ] , where xe and Ye are func tions of the form fe in (5.43). In this metric the sets B((xe,Ye), 1) are mutually disjoint rectangles, of which E is the uncountable union; if � denotes the open rectangles of (D[o, l ] x D[o, 1 1 , P u), E e a(�), even though E is in the Borel field of D[o, l ] x D[o, 1 1 , being an open set. o The importance of this last result is shown by the following case. Given a proba bility space (.Q,� ,P), let x and y be random elements of derived probability spaces ((Si,d),!f,!lx) and ((Si,d),!f,!ly). Implicitly, the pair (x,y) can always be thought of as a random element of a product space of which the llx and llY are the marginal measures. Since x and y are points in the same metric space, for given ro E Q a distance d(x(ro),y(ro)) is a well-defined non-negative real number. The question of obvious interest is whether d is also a measurable function on (.Q,�). This we can answer as follows. 26.7 Theorem If (Si,d) is a separable space, d(x,y) is a random variable. Proof The inverse image of a rectangle A x B under the mapping (x,y) : f--7 s; x§ lies in � ' being the intersection of the �-sets x- 1 (A) and y 1 (B) . The mapping is therefore �/!! ® !/-measurable by 3.22. But under separability, !I ® !I is the Borel field of § x § according to 26.5. Hence (x,y)(ro) (x(w),y(ro)) is a �/Borel measurable random element of§ x §. If the spaces; x s; is endowed with the product topology, the function d: § X § f--7 IR+ is continuous by construction, and this mapping is also Borel-measurable. The composite mapping (x,y) od: .Q f--7 1R is therefore �/:B-measurable, and the theorem follows. •

n

-

=

26.2 Measures and Expectations

As well as taking care to avoid measurability problems, we must learn to do with out various analytical tools which proved fundamental in the study of random variables, in particular the c.d.f. and ch.f. as representations of the distrib ution. These handy constructions are available only for r.v.s. However, if U5 is the set of bounded, uniformly continuous real functions f: s; f--7 IR, the expect ations

Weak Convergence in Metric Spaces

417 (26.7)

are always well defined. (From now on, the domain of integration will be understood to be s; unless otherwise specified.) The theory makes use of this family of expectations to fingerprint a distrib ution uniquely, a device that works regardless of the nature of the underlying space. While there is no single all-purpose function that will do this job, like eiAX in the case X E IR , the expectations in (26. 7) play a role in this theory analogous to that of the ch.f. in the earlier theory. As a preliminary, we give here a pair of lemmas which establish the unique represention of a measure on (Sl,:f) in terms of expectations of real functions on Sl. The first establishes the uniqueness of the representation by integrals. 26.8 Lemma If J.l and v are measures on ((Sl,d),:f) (:f the Borel field), and

ffdJ.l f dv, all =

f

fE

Us,,

(26.8)

then J.l = v . Proof We show that Us, contains an element for which (26.8) directly yields the conclusion. Let B E :f be closed, and define Bn = { x: d(x,B) < 1/n } . Think of Bn as B with an open halo of width 1/n. Bn -l- B as n --7 oo, B and B� are closed and mutu ally disjoint, and infxe BJ;,ye sd(x,y) :2 1/n for each n. Let gs:;,s E U'£ be a separating function such that g8:;,8 (x) = 0 for x E B� and 1 for x E B (see 6.13). Then j.l(B)

:::;

fgs:;,sdJ.l fgs:;,sdv fBngs:;,sdv :::; v(Bn). =

=

(26.9)

where the last inequality is because g8:;,8 (x ) :::; 1 . Letting n --7 oo, we have J.l(B) :::; v(B). But J.l and v can be interchanged, so J.l(B ) = v(B). This holds for all closed sets, which form a determining class for the space, so the theorem follows. • Since Us, c C'£, the set of all continuous functions on Sl, this result remains true if we substitute Cs, for Us, ; the point is that Us, is the smallest class of general functions for which it holds, by virtue of the fact that it contains the required separating function for each closed set. The second result, although intuitively very plausible, is considerably deeper. Given a p.m. J.l on a space Sl, define A(f) = ffdJ.l for f E Us,. We know that A is a functional on Us, with the following properties: (26. 10) f(x) :2 0, all X E Sl =::} A(f) :2 0 (26. 1 1) f(x) = 1, all X E Sl =::} A(f) = 1 (26. 12) A(afl + bh) = aA(fi) + bA(/2), !1> h E U, a,b E IR, where (26. 1 1) holds since fdJ.l = 1 , and (26. 12) is the linearity property of integrals. The following lemma states that on compact spaces the implication also

The Functional Central Limit Theorem

418

runs the other way. 26.9 Lemma Let be a compact metric space, and let A Us -7 1R define a func tional satisfying (26. 10)--(26. 12). There exists a unique p.m. on satisfy ing = A ( ) , each E Us. o In other words, functionals A and measures are uniquely paired. At a later stage we use this result to establish the existence of a measure (the limit of a sequence) by exhibiting the corresponding A functional. We shall not attempt to give a proof of this result here; see Parthasarathy (1967: ch. 2.5) for the details. Note that because is compact, Us and Cs coincide here; see 5.21.

ffdll f

5>

(f): ll (5>,Jl)

f

ll

5>

26.3 Weak Convergence

Consider IM, the set of all probability measures on ((5>,d),!f). As a matter of fact, we can extend our results to cover the set of all finite measures, and there are a couple of cases in the sequel where we shall want to apply the results of this chapter to measures where "# 1 . However, the modifications required for the extension are trivial. It is helpful in the proofs to have an agreed normaliz ation, and = 1 is as good as any, so let 1M be the p.m.s, while keeping the possibility of generalization in mind. Weak convergence concerns the properties of sequences in IM, and it is mathemati cally convenient to approach this problem by treating 1M as a topological space. The natural means of doing this is to define a collection of real-valued functions on IM, and adopt the weak topology that they induce. And in view of (26.7), a natural class to consider are the integrals of bounded, continuous real-valued functions with respect to the elements of IM. For a point E IM, define the base sets

ll fdll

fdll

f

ll

v,.tCk, t ... ,fk,£)

fi

,

=

{

v: v E IM,

J Jfidv-Jtidll/

< £,

i

=

}

1, ... ,k

(26. 1 3)

,llfk·

where E Us for each i, and £ > 0. By ranging over all the possible ft , ... and £, for each k E IN , (26. 1 3) defines a collection of open neighbourhoods of The base collection V11(k,ft, . . ,fk,£), E IM, defines the weak topology on IM. The idea is that two measures are close to one another when the expectations of various elements of Us are close to one another. The more functions this applies to, and the closer they are, the closer are the measures. This is not the conse quence of some more fundamental notion of closeness, but is the defining property itself. This simple yet remarkable application illustrates the power of the topo logical ideas developed in Chapter 6. The weak topology is the basic trick which allows distributions on general metric spaces to be handled by a single theory. Given a concept of closeness, we have immediately a companion concept of convergence. A sequence of measures n E IN } is said to converge in the weak topology, or converge weakly, to a limit written => if, for every neigh bourhood V11, 3 N such that E V11 for all n � N. If is a random element from a probability space and => !l, we shall say that converges in distri Essenhntion to and write xM � x. where x is a random element from

.

x

ll n (5>,Jl,!ln), lln

ll

{!ln!l,,

Xnlln Xn!l, (5>,Jl,!l).

Weak Convergence in Metric Spaces

419

tially, the same caveats noted in §22. 1 apply in the use of this terminology. The following theorem shows that there are several ways to characterize weak convergence. 26.10 Theorem The following conditions are equivalent to one another: (a) �n � �(b) ffdJln -c) ffd� for every f E U§. (c) limsupn�n(C) s �(C) for every closed set C E !f. (d) liminfn�n(B) � �(B) for every open set B E !f. (e) limn�n(A) = �(A) for every A E !f for which �(oA) = 0. o The equivalence of (a) and (b), and of (a) and (e), were proved for the case of measures on the line as 22.8 and 22.1 respectively; in that case weak convergence was identified with the convergence of the sequence of c.d.f.s, but this charac terization has no counterpart here. A noteworthy consequence of the theorem is the fact that the sets (26. 1 3) are not the only way to generate the topology of weak convergence. The alternative corresponding to part (e) of the theorem, for exam ple, is the system of neighbourhoods, V�(k,A 1 , . . . ,Ak>£)

=

{v: v E

[M,

j vCAi) - �(AD j < £, i

=

l,

. . ,k},

(26. 14)

where Ai E !f, i = l, ... ,k and �(()Ai) = 0. Proof of 26.10 This theorem is proved by showing the circular set of implications, (a) � (b) � (c) � (c),(d) � (e) � (a). The first is by definition. To show that (b) � (c), we can use the device of 26.8; let B be any closed set in !f, and put Bm = {x: d(x,B) < 1 /m } , so that B and B� are closed and inf B;. ,y B d(x,y ) � lim. Letting gs;.,B E u$ be the separating function defined above (26.9), we have xE

f

f

E

f

limsup Jln(B) s limsup gs;.,sd�n = gs;.,sd)l = B gs;.,sd� s �(Bm), (26. 1 5) m n---')oo n---')oo where the first equality is by (b). (c) now follows on letting m -c) oo . (c) � (d) is immediate since every closed set is the complement of an open set relative to and �(Sl) 1 . To show (c) and (d) � (e): for any A E !f, A0 c A � A, where A0 is open and A is closed, and ()A = A - A0• From (c), limsup Jln(A) s limsup �n (A) s J.L(A) = �(A), (26. 1 6) n---')oo and from (d), (26. 1 7) liminf�n(A) � liminf�n(A0) � �(A0) = �(A),

5>,

=

hence limn�n(A) �(A). The one relatively tricky step is to show (e) � (a). Let f E U§, and define (what is easily verified to be) a measure, �' on the real line (IR,15) by =

The Functional Central Limit Theorem

420

(26 . 1 8) 1-t'CB) = !l({x: f(x) E B}), B E 'B. f is bounded, so there exists an interval (a,b) such that a < f(x) < b, all x E Sl. Recall that a distribution on (lR ,'B) has at most a countable number of atoms. Also, a finite interval can be divided into a finite collection of disjoint subintervals of width not exceeding £, for any £ > 0. Therefore it is possible to choose m points t1, with a = to < t1 < ... < tm = b, such that t1 - tJ-1 < £, and 1-t'C { t1}) = 0 , for each j. Use these to construct a simple r.v. m (26. 19) gm(X) = 2)}- dAjCx), }= I

where AJ = {x: fJ- 1 :::::;; f(x) :::::;; fJ } , and note that supxl f(x) - gm(x) I

< £.

[ ffd!Jn - Jfd!l l :::::;; J l f - gm l d!ln + J l f - gm l d!l + [ fgmd!ln - fgmd!l l :::::;; 2£

m + L I tj- l l l !ln(Aj) - !l(Aj l ·

Thus,

(26.20)

}= I

Since !l(dA1) (e),

=

0

by the choice of t;, so that limn!ln(Aj) li:n��p

I ffdJ.ln - Ifd!l l

:::::;; 2£.

=

!l(Aj), for each j by (26.2 1 )

Since £ can be chosen arbitrarily small, (a) follows and the proof is complete. • A convergence-determining class for (Sl,:f) is a class of sets 'U c :.t which satisfy the following condition: if !ln(A) ---7 !l(A) for every A E 'U with !l(dA) = 0, then lln => J..l . This notion may be helpful for establishing weak convergence in cases where the conditions of 26.10 are difficult to show directly. The following theorem is just such an example. 26.11 Theorem If 'U is a class of sets which is closed under finite intersections, and such that every open set is a finite or countable union of 'U-sets, then 'U is con vergence-determining. Proof We first show that the measures lln converge for a finite union of 'U-sets A 1 , ,Am . Applying the inclusion-exclusion formula (3 .4), • • .

(26.22)

where the sets Ck consist of the AJ and all their mutual intersections and hence are in 11 whenever the A1 are, and '±' indicates that the sign of the term is given in accordance with (3 .4). By hypothesis, therefore, (26.23 )

Weak Convergence in Metric Spaces

421

To extend this result to a countable union B = U}= IAj , note that continuity of � implies �(U} = IAj) t �(B) as m -7 oo, so for any £ > 0 a finite m may be chosen large enough that �(B) - �(Uj= IAj) < £. Then

(LJ ) (LJ )

(26.24) liminf �n(B) � liminf �n Aj = � Aj > �(B) - £. n �oo n �oo �= I :t= l Since £ is arbitrary and (26.24) holds for any open B E !f by hypothesis on 'U, condition (d) of 26.10 is satisfied. • A convergence-determining class must also be a determining class for the space (see §3.2). But caution is necessary since the converse does not hold, as the following counter-example given by Billingsley (1968) shows. 26.12 Example Consider the family of p.m.s {�n } on the half-open unit interval [0 , 1) with �n assigning unit measure to the singleton set { 1 - 1/n } . That is, �n( { 1 - 1/n}) = 1 . Evidently, { �n } does not have a weak limit. The collection b' of half-open intervals [a ,b) for 0 < a 0 and b < 1 , and the p.m. � for which �( { 0}) = 1 has the property that �([a,b)) = 0 for all a > 0. It is therefore valid to write (26.25) �n(A) -7 �(A), all A E b', even though �n � � in this case, so b' is not convergence-determining. o The last topic we need to consider in this section is the preservation of weak convergence under mappings from one metric space to another. Since �n => � means ffdllr. -7 ffd� for any f E U$, it is clear, since j oh E U$ when h is continuous, that ff(h(x))dllr.(x) -7 ff(h(x))d�(x). Writing y for h(x), we have the result

fJ(y)dJ..!nh- I (y) -7 Jf(y)d�h- I (y).

(26.26)

So much is direct, and relatively trivial. But what we can also show, and is often much more useful, is that mappings that are 'almost' continuous have the same property. This is the continuous mapping theorem proper, the desired generaliz ation of 22.11. 26.13 Continuous mapping theorem Let h: S H T be a measurable function, and let Dh c $ be the set of discontinuity points of h. If �n => � and �(Dh) = 0, then �nh- I => �h - I . Proof Let C be a closed subset of T. Recalling that (A) - denotes the closure of A, limsup �nh - 1 (C) = limsup �n(h - I ( C)) =::; limsup �n((h - I ( C)) -) n �oo =::; �((h - 1 (C)) -) =::; �(h - 1 (C) u Dh) =::;

�(h - l (C)) + �(Dh)

=

�(h- l (C))

=

�h - I (C),

(26.27)

422

The Functional Central Limit Theorem

noting for the third inequality that (h- 1 (C))- k h- 1 (C) u Dh; i.e., a closure point of h- 1 (C) is either in h- 1 (C), or is not a continuity point of h. The second inequality is by 26.10(c), and the conclusion follows similarly. • 26.4 Metrizing the S pace of Measures

We can now outline the strategy for determining the weak limit of a sequence of measures {�n} on (£,:!). The problem falls into two parts. One of these is to determine the limits of the sequences {�n(A) } for each A E t:i', where t:i' is a deter mining class for the space. This part of the programme is specific to the particu lar space under consideration. The other part, which is quite general, is to verify conditions under which the sequence of measures as a whole has a weak limit. Without this reassurance, the convergence of measures of elements of t:i' is not generally sufficient to ensure that the extensions to :f also converge. It is this second aspect of the problem that we focus on here. It is sufficient if every sequence of measures on the space is shown to have a cluster point. If a subsequence converges to a limit, this must agree with the unique ordinary limit we have (by assumption) established for the determining class. Our goal is achieved by finding conditions under which the relevant topo logical space of measures is sequentially compact (see §6.2). This is similar to what Billingsley (1968) calls 'relative' compactness, and the required results can be derived in his framework. However, we shall follow Prokhorov (1956) and Parthasarathy ( 1967) in making 1M a metric space which will under appropriate circumstances be compact. The following theorem shows that this project is feasi ble; the basic idea is an application of the embedding theorem (6.20/6.22). 26.14 Theorem (Parthasarathy 1967: th. ll.6.2) If and only if ('5,d) is separable, 1M can be metrized as a separable space and embedded in [0, 1 ]=. Proof Assume ('5,d) is separable. The first task is to show that Us is also separa ble. According to 6.22, '5 can be metrized as a totally bounded space ('5,d') where d' is equivalent to d. Let S denote the completion of '5 under d' (including the limits of all Cauchy sequences on '5) and then S is a compact space (5.12). The space ofcontinuous functions Cs; is accordingly separable under the uniform metric (5.26(ii)). Now, every continuous function on a compact set is also uniformly continuous (5.21), so that Us; = Cs;. Moreover, the spaces Cs; and Us are isometric (see §5.5) and if the former is separable so is the latter. Let {gm, m E lN } be a dense subset of U'£, and define the mapping T: 1M -7 IR= by (26.28) The object is to show that T embeds 1M in IR=. Suppose T(�) = T(v), so that fgmd� = fgmdv for all m. Since {gm } is dense in u'£, f E u'£ implies that

I ffd� -fgmd� I :5 f I f - gm I d� ::; du(f,&n)

<

£

(26.29)

Weak Convergence in Metric Spaces

423

for some m, and every £ > 0. (The second inequality is because fdJ.l = 1 , note.) The same inequalities hold for v, and hence we may say that ffdJ.l = ffdv for all f E U5, . It follows by 26.8 that Jl = v , so T is 1 - 1 . Continuity of T follows from the equivalence of (a) and (b) in 26.10. To show that T - 1 is continuous, let { Jln } be a sequence of measures and assume T(Jln) ----7 T(Jl). For f E U5, and any m � 1 ,

I ffd� - ffdJ.l l

=

I f(f - gm)dJ.ln + f(g + fg fg l g + I fgmdJ.ln - fg l · m - f)dJ.l

:S: 2du(f,

m

mdJ.ln -

mdJ.l

mdJ.l

)

(26.30)

Since the second term of the majorant side converges to zero by assumption, limsupn

If f l fd� - fdJ.l

:S:

2du(f,&n ) < 2£

(26 .3 1 )

for some m, and £ > 0, by the right-hand inequality of (26.29). Hence limn l ffd� - ffdJ.l i = 0, and lln ==> Jl by 26.10(b). We have therefore shown that 1M is homeomorphic with the set T(IM) � IR"", and IR"" is homeomorphic to [0, 1 ]"" as noted in 5.22. The distance d"" between the images of points of 1M under T defines a metric on 1M which induces the weak topology. The space T(IM) with the product topology is separable (see 6.16), so applying 6.9(i) to T - 1 yields the result that 1M is separable. This completes the sufficiency part of the proof. The necessity part requires a lemma, which will be needed again later on. Let Px E 1M be the degenerate p.m. with unit mass at x, that is, px( { x } ) = 1 and Px('5 - { x } ) = 0, and so let D = {px : x E '5 } � IM. 26.15 Lemma The topological spaces '5 and D are homeomorphic.

'5 1---7 D taking points x E '5 to points Px E D is clearly 1 - 1 , onto. For f E C5,, ffdpx = f(x), and Xn ----7 x implies f(.xn) ----7 f(x) and hence Pxn ==> Px by 26.10, establishing continuity of p. Conversely, suppose Xn A x. There is then an open set A containing x, such that for every N E IN, Xn E '5 - A for some n � N. Let f be a separating function such that f(x) = 0, f(y) = 1 for y E '5 - A, and 0 :S: f :S: 1. Then ffdPxn = 1 and ffdPx = 0, so Pxn � Px· This establishes Proof The mapping p:

continuity of p - 1 , and p is a homeomorphism, as required.

•

Proof of 26.14, continued Now suppose 1M is a separable metric space. It can be embedded in a subset of [0, 1 ]"" , and the subsets of 1M are homeomorphic to their images in [0, 1 ]"" under the embedding, which are separable sets, and hence are

themselves separable (again, by 6.16 and 6.9(i)). Since D c IM, D is separable and hence '5 must be separable since it is homeomorphic to D by 26.15. This proves necessity. •

The last theorem showed that 1M is metrizable, but did not exhibit a specific metric on IM. Note that different collections of functions {gm } yield different metrics, given how d= is defined. Another aooroach to the oroblem is to construct

The Functional Central Limit Theorem

424

such a metric directly, and one such was proposed by Prokhorov ( 1 956). For a setA e !1, define the open set A li = {x: d(x,A) < 3}, that is, 'A with a 3-halo' . The Prokhorov distance between measures !l, v e IM, is L(!l,V) inf{3 > 0: !l(Aii) + 3 ;;::: v(A), all A e !!}. (26.32) Since !f contains complements and !l(Sl) = v(Sl) = 1 , it must be the case, unless !l = v, that !l(A) ;;::: v(A) for some sets A e !1, and !l(A) < v(A) for others. The idea of the Prokhorov distance is to focus on the latter cases, and see how much has to be added to both the sets and their !l-measures, to reverse all the inequalities. When the measures are close this amount should be small, but you might like to convince yourself that both the adjustments are necessary to get the desired properties. As we show below, L is a metric, and hence is symmetric in !l and v. The properties are most easily appreciated in the case of measures on the real line, in which case the metric has the representation in terms of the c.d.f.s, L*(F1 ,F2) = inf { <5 > 0 : F2(x - <5) <5 � F1 (x) � F2 (x+ 3) + <5, V x e [R } , (26.33) for c.d.f.s F 1 and F2 . This is also known as Levy 's metric. =

-

Fig.

26. 1

Fig. 26. 1 sketches this case, and F2 has been given a discontinuity, so that the form of the bounding functions F2 (x + <5) + <5 and F2(x - <5) - 3 can be easily discerned. Any c.d.f. lying wholly within the region defined by these extremes, such as the one shown, is within () of F2 in the L* metric. 26.16 Theorem L is a metric. Proof L(!l,V) = L(V,!l) is not obvious from the definition; but for any 3 > 0 consider B = (A ii)c. If x e A, then d(x,y) 2 () for each y e B, whereas if x e Bli, d(x,y) <5 for some y e B; or in other words, Bli = A c If L(!l,V) � <5, then !l(A c) + 3 = !l(B 0) + () ;;::: v(B). (26.34) Subtracting both sides of (26.34) from 1 gives !l(A) � v(Bc) + () = v(Ali) + 8, (26.35)

<

.

Weak Convergence in Metric Spaces

425

and hence L(V,Il) � B. This means there is no B for which L(!l,V) > B ;::: L(V,!l), nor, by symmetry, for which L(V,Il) > B ;::: L(!l,V), and equality follows. It is immediate that L(!l,V) = 0 if ll = v. To show the converse holds, note that n n if L(!l,V) = 0, ll(A Il ) + lin ;::: v(A) for A E !f, and any n E IN. If A is closed, A Il n .J, A as n ---7 By continuity of 11, ll(A) = limn(ll(A li ) + lin) ;::: v(A), and by n symmetry, v(A) = limn(v(A li ) + lin) ;::: ll(A) likewise. It follows that ll(A) = v(A) for all closed A. Since the closed sets are a determining class, ll v. Finally, for measures 11, v, and 't let L(!l,V) = B and L(V,'t) = ll · Then for any A E !f, ll(A) � v(A8) + B � 't((A 8Y1 ) + B + ll � 't(A&+11) + B + ll , (26.36)

oo

.

=

where the last inequality holds because (A8)11 = {x: d(x,A 8) < ll } � {x: d(x,A ) < B + ll) }

=

A&+11 ,

(26.37)

the inclusion being valid since d satisfies the triangle inequality. Hence L(!l,'t) � B + ll = L(!l,V) + L(V,'t). • We can also show that L induces the topology of weak convergence. 26.17 Theorem If {lln} is a sequence of measures in IM , lln L(lln ,ll) ---7 0 .

===>

ll if and only if

show 'if , suppose L(lln•ll) ---7 0. For each closed set A E !f, limsupnlln(A) � ll(A 8) + B for every B > 0, and hence, letting B .J, 0, limsupnlln(A) � ll(A) by continuity. Weak convergence follows by (c) of 26.10. To show 'only if , consider for A E !f and fixed B the bounded function Proof To

fA (x) Note that fA (x) Since

=

=

{

}

d(x,A)

(26.38)

max 0, 1 - -- . B

1 for x E A, 0 < fA (x) � 1 for x E A 8, and fA (x)

=

0 for x � A 8 .

(26.39) independent of A, the family { fA , A E !f } is uniformly equicontinuous (see §5.5) and so is a subset of U5 . If lln ===> 11, then by 26.10(b), ,1n

=

sup

AE 9'

I ffAdlln - f Ad l.

(26.40)

f ll ---7 0.

Hence, n can be chosen large enough that ,1n � B, for any B > 0. For this n or larger,

f

f

f

lln(A) � fAdlln � fAdll + ,1n � fAdll + B � ll(A 8) + B, or, equivalently, L(lln,ll) � B. It follows that L(lln.ll)

---7

0.

•

(26.41)

It i s possible to establish the theory of convergence o n 1M by working explicitly

426

The Functional Central Limit Theorem

in the metric space (lM,L). However, we will follow the approach of Varadarajan (1958), of working in the equivalent space derived in 26.14. The treatment in this section and the following one draws principally on Parthasarathy (1967). The Prokhorov metric has an application in a different context, in §28.5. The next theorem leads on from 26.14 by answering the crucial question - when is 1M compact? 26.18 Theorem (Parthasarathy 1967: th. II.6.4) 1M is compact if and only if § is compact. Proof First, let § be compact, and recall that in this case Cs = U5 (5.21), and C5 is separable (5.26(ii)). For simplicity of notation write just C for C5, and write 0 for that element of C which takes the value 0 everywhere in §. Let Sc(0, 1) denote the closed unit sphere around 0 in C, such that sup1 l f(t) I ::;; 1 for all f E Sc(O, 1 ), and let {gm , m E IN } be a sequence that is dense in Sc(O, 1 ). For this sequence of functions, the map T defined in (26.28) is a homeomorphism taking 1M into T(IM), a subset of the compact space [ - 1 , 1 r. This follows by the argument used in 26.14. It must be shown that T(IM) is closed and therefore compact. Let {�n } be a sequence of measures in 1M such that T(�n) -7 y E [ -1 , 1r. What we have to show to prove sufficiency is that y E T(lM). Since the mapping T - 1 onto 1M is continuous, this would imply (6.9(ii)) that 1M itself is compact. Write An(f) = fjd�, and note that, since I ffd� I ::;; sup1 l f(t) I ::;; 1 , this defines a functional An(f): Sc(O, l) t--7 [- 1 , 1] . (26.42) In this notation we have T(�n) = (An(gt), An(g2), ... ). Since Sc(O, 1 ) is compact and {gm } is dense in it, we can choose for every f E Sc(0, 1) a subsequence {gmk ' k E IN } converging to f. Then, as in (26.30), (26.43) The second term of the majorant side contains a coordinate of T(J.ln) - T(J.ln') and converges to 0 as n and n' --7 oo by assumption. Letting k --7 =, we obtain, as in (26.31 ), (26.44) lim I An (f) - 1\z, (f) I = 0. n,n'�oo This says that {An } is a Cauchy sequence of real functionals on [- 1 , 1], and so must have a limit A; in particular, y = (A(g 1 ), A(g2), ... ). It is easy to verify that each An(f), and hence also A(f), satisfy conditions (26.10)-{26. 12) for f E Sc(0,1 ). Since for every f E C there is a constant c > 0 such that cf e Sc(0, 1), we may further say, by (26. 12), that A(f) = cA*(f/c) where A * (.) is a functional on C which must also satisfy (26. 10)-(26.12). From 26.9, there exists a unique J.l E 1M such that A* (f)= ffd�, fe C. Hence, we may write y = T(Jl). It follows that T(IM) contains its limit points, and being also bounded is compact; and since T - 1 is a homeomorphism, 1M is also compact. This completes the proof of sufficiency.

Weak Convergence in Metric Spaces

427

To prove necessity, consider D = {px : x E S> } � IM, the set shown to be homeo morphic to S> in 26.15. If D is compact, then so is D is totally bounded when 1M is compact, so by 5.12 it suffices to show completeness. Every sequence in D is the image of a sequence { Xn E S> } , and can be written as {Pxn } , so suppose Pxn ==> q E IM . If Xn � x E Sl, then q = Px E D by 26.15, so it suffices to show that Xn p x is impossible. The possibility that { Xn } has two or more distinct cluster points in 5l is ruled out by the assumption Pxn ==> q, so Xn P x means that the sequence has no cluster points in Sl. We assume this, and obtain a contradiction. Let E {xt,x2 , ... } � 5l be the set of the sequence coordinates, and let E 1 be any infinite subset of E. If the sequence has no cluster points, every point y E E1 is isolated, in that Et n S(y, E) - {y } is empty for some E > 0. Otherwise, there would have to exist a sequence {Yn E Ed such that Yn E S(y, lin) for every n, and y would be a cluster point of {xn } contrary to assumption. A set containing only isolated points is closed, so E1 is closed and, by 26.10(c),

5>.

=

(26.45) where the equality must obtain since Et contains Xn for some n � N, for every N E [N . Since q E !M, this has to mean q(E1 ) = 1 . But clearly we can choose another subset from E, say E2, such that E 1 and E2 are disjoint, and the same logic would give q(E2) = 1 . This is impossible. The contradiction is shown, concluding the proof. • 26.5 Tightnes s and Convergence

In §22.5 we met the idea of a tight probability measure, as one whose mass is concentrated on a compact subset of the sample space. Formally, a measure 1-1 on a space (S>,!f) is said to be tight if, for every E > 0, there exists a compact set K£ E !f such that j.l(K�) � E. Let TI � 1M denote any family of measures. The family n is said to be uniformly tight if supJlE rrj.l(K�) ::::; E. Tightness is a property of general measures, although we shall concentrate here on the case of p.m.s. In the applications below, TI typically represents the sequence of p.m.s associated with a stochastic sequence {Xn}l'. If a p.m. 1-1 is tight, then of course j.l(K£) > 1 - E for compact K£. In §22.5 uniform tightness of a sequence of p.m.s on the line was shown to be a necessary condition for weak convergence of the sequence, and here we shall obtain the same result for any metric space that is separable and complete. The first result needed is the following. 26.19 Theorem (Parthasarathy 1967: th. 11.3.2) When 5l is separable and complete, every p.m. on the space is tight. o Notice, this proves the earlier assertion that every measure on � .�) is tight, given that IR is a separable, complete space. Another lemma is needed for the proof, and also subsequently.

The Functional Central Limit Theorem

428

26.20

Lemma

Let S be a complete space, and let (26.46)

where Sni is a sphere of radius lin in S, Sni is its closure, and in is a finite integer for each n. Then K is compact. Proof Being covered by a finite collection of the Sni for each n, K is totally bounded. If { Xj, i E IN } is a Cauchy sequence in K, completeness of S implies that Xj � x E S. For each n, since K c U{� 1 Sni• infinitely many of the sequence coor dinates must lie in Kn = K n Snk for some k, 1 � k � in · Since Snk has radius 1/n, taking n to the limit leads to the conclusion that nnKn = {x}, and hence, X E K; K is therefore complete, and the lemma follows by 5.12. • Proof of 26.19 By separability, a covering of S by 1/n-balls Sn = S(x, 1/n), x E S, has a countable subcover, say {Sn i i E IN } , for each n = 1 ,2, ....nFix n. For any E > 0 there must exist in large enough that Jl(An) � 1 - El2 , where An = U{� ! Sni; otherwise we would have Jl(Ui'= I Sn i) = Jl(S) < 1 - El2n , which is a contradiction since Jl is a p.m. Given E, choose An in this manner for each n and let Ke = n;= IAn where An = U{�1Sni• note. Then Ke is compact by 26.20. Further, since (26.47)

n= l

n=l

(26.48)

or, in other words, Jl(Ke) > 1 - E. • Before moving on, note that the promised proof of 12.6 can be obtained as a corollary of 26.19. 26.21 Corollary Let (S,Jl,Jl) be a separable complete probability space. For any E E Jl, there is for any E > 0 a compact subset K of E such that Jl(E - K) < E. Proof Let the compact set d E Jl satisfy Jl(d) > 1 - El2, as is possible by 26.19, and let (d,Jl�. Jl�) denote the trace of (S,Jl,Jl) on d. This is a compact space, such that every set in Jl� is totally bounded. By regularity of the measure (26.1) there exists for any A E Jl� an open set A' ::J A such that Jl�(A' - A) < E /2 . Moving to the complements, A'c is a closed, and hence compact, set contained in Ac. But A c - A'c = A' - A, and Jl(A c - A'c) = Jl�(A c -A'c)Jl(d) < El2. Now for any set E E Jl let A = (E n dl, and let K = A'c, and this argument shows that there is a compact subset K of E n d (and hence of E) such that Jl((E n d) - K) < fi2. Since Jl(E n de) � Jl(dc) � fi2, Jl(E - K) < E, as required. •

Weak Convergence in Metric Spaces

429

k Lemma 12.6 follows from this result on noting that IR is a separable complete space. Theorem 26.19 tells us that on a separable complete space, every measure Jln of a sequence is tight. It remains to be established whether the same property applies to the weak limit of any such sequence. Here the reader should review examples 22.19 and 22.20 to appreciate how this need not be the case. The next theorem is a partial parallel of 22.22, although the latter result goes further in giving sufficient conditions for a weak limit to exist. Here we merely establish the possibility of weak convergence, via an application of theorems 5.10 and 5.11, by showing the link between uniform tightness and compactness. 26.22 Theorem (Parthasarathy 1967: th. II.6.7) Let (S,d) be a separable complete space, and let I1 � 1M be a family of p.m.s on (S,!f). TI is compact if and only if it is uniformly tight.

(S,d) is separable, it is homeomorphic to a subset of [0, 1 ]00, by 6.22. Accordingly, there exists a metric d' equivalent to d such that (S,d') is rela Proof Since

tively compact. In this metric, let § be a compact space containing '5, and let !f be the Borel field on § . We cannot assume that '5, E '!1, but !f, the Borel field of '£, is the trace of !f on '£. Define a family of measures fi on § such that, for jl E fi, jl(A) = Jl(A n '£), Jl E TI, for each A E !f. To prove that I1 is compact, we show that a sequence of measures {Jln, n E [N } from I1 has a cluster point in TI. Consider the counterpart sequence { Jln, n E [N } in fi. Since § is compact, fi is compact by 26.18, so this sequence has one or more cluster points in fi. Let v be such a cluster point. The object is to show that there exists a p.m. Jl E I1 such that J1 = v. Tightness of I1 means that for every integer r there is a compact set Kr c '5, such that Jl(Kr) � 1 1/r, for all Jl E TI. Being closed in § , Kr E !f and jl(Kr) = f.l(Kr n '5,) = 1-l(Kr), all !-1 E Il. Since Kr is closed we have for some subsequence { nh k E IN } -

(26.49) by 26.10(c). Since UrKr E '!1, we have in particular that v(UrKr) = 1 . Now, suppose we let v*('S,) denote the outer measure of 'S, in terms of coverings by !f-sets. Since UrKr c '£, we must have v*(S) � v*(UrKr) = v(UrKr) = 1 . Applying 3.10, note that '5, is v-measurable since the inequality in (3. 19) becomes

(26.50) which holds for all B � § . Since !f is the trace of !f on S, all the sets of !f are v*(B n '£) � v*(B),

accordingly v-measurable and there exists a p.m. Jl E I1 such Athat J1 = v, as required. For any closed subset C of '£, there exists a closed D � '5, such that C = D n '£, the assertions limsupkJlnk(D) � jl(D) and limsupkJln/C) � J.L(C) are equiv alent, and hence, by 26.10, Jlnk ::::::> Jl. This means that { J.!n } has a convergent sub sequence, proving sufficiency. Notice that completeness of '5, is not needed for this part of the proof. To prove necessitv. assume I1 is comoact. Letti n g f S_,_ i E IN 1 he a r.onntahlP.

430

The Functional Central Limit Theorem

covering of § by 1/n-spheres, and Un, n E iJ'.J } any increasing subsequence of inte gers, define An = U{�1Sni · We show first that the assumption, ::3 J.l E II such that, for 8 > 0, (26.51) leads to a contradiction, and so has to be false. If (26.51) is true for at least one element of (compact) II, there is a convergent sequence { J.lkl k E iJ'.J } in II, with J.lk ==:} J.l, such that it holds for all J.lk· (Even if there is only one such element, we can put J.lk = J.l, all k). Fix m. By 26.10, J.l(An) � limsup J.lk(An) � 1 - 8.

(26.52)

k-7oo

Letting n � oo yields J.l(§) � 1 8, which is a contradiction. Putting 8 = El2n , we may therefore assert n J.l(An) > 1 - El2 , all n, all J.l E II. (26.53) Letting Ke = n�=1An , this set is compact by 26.20 (§ being complete) and it follows as in (26.48) above that J.l(Ke) > 1 - E. Since J.l is an arbitrary element of II, the family is uniformly tight. • We conclude this section with a useful result for measures on product spaces. See §7.4 for a discussion of the marginal measures. 26.23 Theorem A p.m. J.l on the space (§ x lr, with the product topology is tight iff the marginal p.m.s J.lx and J.l:y are tight. Proof For a set K E § x lr, let Kx = nx(K) denote the projection of K onto §. Since the projection is continuous (see §6.5), Kx is compact if K is compact (5.20). Since (26.54) tightness of J.l implies tightness of J.lx· Repeating the argument for J.l:y proves the necessity. For sufficiency we have to show that there exists a compact set K E having measure exceeding 1 - E. Consider the set K =A x B where A E :1 and J.lx{A) > 1 - E/2, and B E where J.ly(B) > 1 - E/2. Note that (26.55) Kc = (A x Bc) u (Ac x B) U (Ac x Bc), where the sets of the union on the right are disjoint. Thus, -

:1 ® 'J)

:1 ® 'J,

'J

J.l(Kc) � J.l(A c X B) + J.l(A X Be) +

+

2J.L(A c X Be)

J.l(§ X Be) = J.lx{A c) + J.l:y(Bc) � E. = J.l(A c X lf)

(26.56)

If A and B are compact they are separable in the relative topologies generated from § and lr (5.7), and hence K is compact by 6.17 • .

Weak Convergence in Metric Spaces

43 1

26.6 Skorokhod ' s Representation

Considering a sequence of random elements, we can now give a generalization of some familiar ideas from the theory of random variables. Recall from 26.7 that separability ensures that the distance functions in the following definitions are r.v.s. Let {xn } be a sequence of random elements and x a given random element of a separable space ('£,!/). If (26.57) d(xn( W),x(w)) ----7 0 for ffi E C, with P(C) 1, we say that Xn converges almost surely to x, and also write Xn � x. Also, if (26.58) P(d(xn,x) ;?: £) ----7 0, all £ > 0, we say that Xn converges in probability to x, and write Xn � x. A.s. convergence is sufficient for convergence in probability, which in tum is sufficient for Xn � x. A case subsumed in the above definition is where x = a with probability 1 , a being a fixed element of '£. We now have the following result generalizing 22.18. 26.24 Theorem Given a probability space (Q.,'!J,P), let {xn( W) } and {Yn(ffi) } be random sequences on a separable space ('£,!/). If Xn � x and d(xn,Yn) � 0, then D Yn --===-? X. Proof Let A E !f be a closed set, and for £ > 0 put A £ = { x: d(x,A) � £} E !f, also a closed set for each £, and A £ -!- A as £ -!- 0. Since { w: Yn( W) E A } � {w: Xn(W) E A£ } U { ffi: d(xn(W),yn(W)) ;::: £ } , we have (26.59) and, letting n ----7 oo, =

limsup P(yn E A) S limsup J.ln(A £) � J.l(A£), (26.60) n -toa n-toa where J.ln is the measure associated with Xm J.l the measure associated with x, and the second inequality of (26.60) is by hypothesis on {xn } and 26.10(c). Since this inequality holds for every £ > 0, we have limsup P(yn E A) � J.l(A), (26.61) n -toa by continuity of the measure. This is sufficient for the result by 26.10. • In §22.2, we showed that the weak convergence of a sequence of distributions on the line implies the a.s. convergence of a sequence of random variables. This is the Skorokhod representation of weak convergence. That result was in fact a special case of the final theorem of this chapter. 26.25 Theorem (Skorokhod 1956: 3. 1) Let {J.ln } be a sequence of measures on the

The Functional Central Limit Theorem

432

'Bro.n/:1-

separable, complete metric space ('£,:/). There exists a sequence of measurable functions Xn: [0, 1] 1--7 '£ such that /1-n (A) m( { ro: Xn(W) E A }) for each A E where m is Lebesgue measure. If 1-ln => /-!, there exists a function x( w) such that /1-(A) m( { ro: x(w) E A }) for each and d(xn(W),x(w)) -7 0 a.s.[m] as n -7 = AE Proof This is by construction of the functions Xn(W). For some k E IN let {x�k), i E IN } denote a countable collection of points in '£ such that, for every x E '£, d(x,x�k)) � 112k+l for some i. Such sequences exist for every k by separability. Let S(x�k).Tk), for 1 !2k+ l < rk < 1!2k, denote a system of spheres in '£ having the property /1-(dS(x�k),rk)) 0 for every i. An rk satisfying this condition exists, since there can be at most a countable number of points r such that /1-(dS(x�k\r)) > 0 for one or more i; this fact follows from 7.4. For given k, the system {S(x�k),rk), i E IN } covers '£, and accordingly the sets i-1 (26.62) D7 S(x)k),rk) - US(xjk),rk), i E IN,

:1,

==

:1,

.

==

==

==

j= l

form a partition of '£. By letting each of the k integers i1, ... ,ik range indepen dently over IN, define the countable collection of sets (26.63) Each Si1, . .,ik is a subset of a sphere of radius rk < 1/2k, and /-!(dSi1, ...,ik) 0. By construction, any pair Si1,... ,ik and Si;, ... ,i!: are disjoint unless ik ik. Fixing iJ , . . . ,ik-1 we have (26.64)

== ==

and in particular, (26.65) That is to say, for any k the collection {Si�o .. -,ik} forms a partition of '£, which gets finer as k increases. These sets are not all required to be non-empty. For any n E IN and k E IN , define a partition of [0, 1 ] into intervals .1.)��---.ik' where it is understood that .1.��---.ik lies to the left of .1�(� ... ,ik if i1 ij for j 1 , ... ,r- 1 and ir < i; for some r, and the lengths of the segments equal the probabilities /1-n(Si� o ... ,ik) . We are now ready to define a measurable mapping from [0, 1] to '£. Choose an element xi�o ... ,ik from each non-empty Si1, ... .ik• and for w E [0, 1 ] put (26.66) X�((J)) = Xi1, ...,ik if (J) E .1)��---.ikNote that by construction d(x�(ro),x�+m(m)) � I/2k for m � 1 , and taking k = 1 ,2, . . . defines a Cauchy sequence in '£ which is convergent since '£ is a complete hu .f). C" C"lu·nnt1nn '"XT.,.. f+.c:t. _kt ==

==

C f"V.:l. l' P.

I r., \

1 ; _...,

'·'

\

Weak Convergence in Metric Spaces

433

To show that Xn(CO) is a random element with distribution defined by 1-ln, it is sufficient to verify that (26.67) 1-ln(A) P(xn E A) = m( { CO: Xn(CO) E A }), for, at least, all A E !f such that 1-ln(aA) 0. If we let A(k) denote the union of all Si 1 , ... ,ik � A and A'(k) the union of all Si 1, . .. ,ik � Ac, it is clear that A (k) � A � A'(k) , and that (26.67) holds in respect of A(k) and A' (k) . Let c 0 whenever n is large enough. We cannot draw this conclusion for the boundary points of the �i 1 ,. . . ,ik' but these are at most countable even as k ----7 and have Lebesgue measure 0. This completes the proof. • The construction of §22.2 is now revealed as a particularly elegant special case, since the mapping from [0, 1 ] to 5) is none other than the inverse of the c.d.f. when 5) IR . In his 1956 paper, Skorokhod goes on to use this theorem to prove convergence results in spaces such as D[0,1]. We shall not use his approach directly, but this is a useful trick which has a variety of potential applic ations, just as in the case of IR . One of these will be encountered in Chapter 30. =

=

=

=

. . .•

.

oc,

oc,

=

27 Weak Convergence in a Function Space

2 7 . 1 Measures on Function Spaces

This chapter is mainly about the space of continuous functions on the unit interval, but an important preliminary is to consider the space R[O, Il of all real functions on [0, 1]. We shall tend to write just R for this space, for brevity, when the context is clear. In this chapter and the following ones we also tend to use the symbols x,y etc. to denote functions, and t,s etc. to denote their arguments, instead of f,g and x,y respectively as in previous chapters. This is conventional, and reflects the fact that the objects under consideration are usually to be interpreted as empirical processes in the time domain. Thus, x: [0, 1 ] f---7 IR will be the function which assumes the value x(t) at the point t. In what follows the element x will typically be stochastic, a measurable mapping from a probability space (Q,c:Jf,P). We may legitimately write x: Q f---7 R, assigning x(co) as the image of the element co, but also x : Q x [0, 1] f---7 IR, where x(co , t) denotes the value of x at (co , t). We may also write x(t) to denote the ordinate at t where dependence on ro is left implicit. The potential ambiguity should be resolved by the context. Sometimes one writes x1 to denote the random ordinate where x1(co) = x(co,t), but we avoid this as far as possible, given our use of the subscript notation in the context of a sequence with countable domain. The notion of evaluating the function at a point is formalized as a projection mapping. The coordinate projections are the mappings 1t1: R[o,I] -7 1R , where n1(x) = x(t). The projections define cylinder sets in R; for example, the set n� 1 (a), a E IR , is the collection of all functions on [0,1] which pass through the point of the plane with coordinates (a , t). This sort of thing is familiar from § 12.3, and the union or intersection of a collection of k such cylinders with different coordinates is a k-dimensional cylinder; the difference is that the number of coordinates we have to choose from here is uncountable. Let { t1 , ... ,td be any finite collection of points in [0, 1], and let k (27. 1 ) 1tr 1, ... ,t/X) = (1tr1 (x), ... ,1t1/x)) E IR denote the k-vector of projections from these coordinates. The sets of the collection

Weak Convergence in a Function Space 1f

=

{1t�� .... ,riB)

� R [o,l]:

435

B E :Bk, tt , .. . , tk E [0, 1], k E IN },

(27.2)

are called the finite-dimensional sets of R[O,IJ· It is easy to verify that Jf is a field. The projection a-field is defined as 'P = cr(Jf). Fig. 27. 1 shows a few of the elements of a rather simple Jf-set, with k 1 , and B an interval [a,b] of IR . The set H = 1t��([a,b]) E Jf consists of all those functions that succeed in passing through a hole of width b - a in a barrier erected at the point t1 of the interval. Similarly, the set of all the functions passing through holes in two such barriers, at t1 and t2, is the image under 1t �� .t2 of a rectangle in the plane - and so forth. =

Fig. 27. 1 If the domain of the function had been countable, the projection <J-field 'P would be effectively the same collection as :B"" of 12.3. But since the domain is uncountable, 'P is strictly smaller than the Borel field on R. The sets of example 26.3 are Borel sets but are not in 'P, since their elements are restricted at uncountably many points of the interval. As that example showed, the Borel sets of R are not generally measurable; but (R,'P) is a measurable space, as we now show. Define for k 1,2,3, ... the family of finite-dimensional p.m.s I'"""I J , ... , tk on k (IR ,:Bk), indexed on the collection of all the k-vectors of indices, =

1 1

{ (tr , ... , tk): ti

[0, 1], j

1 , ... ,k} . This family will be required to satisfy two consistency properties. The first is (27.3) �t1, .. . ,t/E) = �t1, ... ,tm(E x 1R m -k) for E E 'Bk and all m > k > 0. In other words, a k-dimensional distribution can be obtained from an m-dimensional distribution with m > k, by the usual operation of E

=

marginalization. This is simply the generalization to arbitrary collections of coordinates of condition (12.7). The second is where n( l )

_____

n(k)

IS a

nermutation of the i ntep-ers 1

_

- .k. anct m:

(27.4) IR k � IR k

The Functional Central Limit Theorem

436

denotes the (measurable) transformation which reorders the elements of a k-vector according to the inverse permutation; that is, <j>(xp (I) •···Xp(kJ ) = x 1 , ... ,xk. This condition basically means that re-ordering the vector elements transforms the measure in the way we would expect if the indices were 1 , ... ,k instead of tJ, ... ,tk. The following extends the consistency theorem, 12.4. 27.1 Theorem For any family of finite-dimensional p.m.s { J.Lr 1 , ... ,rk } satisfying conditions (27.3) and (27.4), there exists a unique p.m. Jl on (R,'P), such that llr 1 , ... , t = Jl1t�� ... ,t for each finite collection of indices. k k Proof Let T denote the set of countable sequences of real numbers from [0, 1]; that is, 't E T if 't = { Sj E [0, 1], j E lN } . Define the projections 1t-r: R f---7 Iff"' by (27.5) For any 't, write v� = llsk··sn for n = 1 ,2, ... Then by 12.4, which applies thanks to (27.3), there exist p.m.s v-r on (IR"",�"") such that v� v-rn� \ where nn (y) is the projection of the first n coordinates of y, for y E IR "". Consistency requires that v� = v�' if sequences 't and -r' have their first n coordinates the same. Since evidently 'P c { n� 1 (B) : B E :B"", 't E T} , we may define a p.m. J.L on (R,'P) by setting (27.6) for each B E :B"". No extension is necessary here, since the measure is uniquely defined for each element of 'P. It remains to show that the family { J.Lr1 , ... , rk } corresponds to the finite dimensional distributions of J.L. For any Jlr 1 , .. ,tk there exists 't E T such that { tJ. ... ,tk } c { s 1 , ... ,sn } , for some n large enough. Construct a mapping \jf: IR n f---7 IR\ by first applying a permutation p to the indices s 1 , ... ,sn which sets x(p(s iJ) n to IR k by suppressing the = x(ti) for i = 1 , ... ,k, and then projecting from IR indices Sp(k+ l ) , ... , Sp(n) · The consistency properties imply that .

=

(27.7) Since \jf 0 1tn °1t-r = 1tr 1 , ... , tk is a projection, Jlr 1 , . .. . tk is a finite-dimensional distribution of J.L. • If we have a scheme for assigning a joint distribution to any finite collection of coordinate functions {x(t 1 ), ... ,x(tk) } with rational coordinates, this can be extended, according to the theorem, to define a unique measure on (R,'P). These p.m.s are called the finite-dimensional distributions of the stochastic process x. The sets generated by considering this vector of real r.v.s are elements of 1f., and hence there is a corollary which exactly parallels 12.5. 27.2 Corollary 1f. is a determining class for (R , 'P). o

Weak Convergence in a Function Space

437

27 .2 The Space C

Visualize an element of Cro. 1 1 , the space of continuous real-valued functions on [0 , 1], as a curve drawn by the pen of a seismograph or similar instrument, as it traverses a sheet of paper of unit width, making arbitrary movements up and down, but never being lifted from the paper. Since [0 , 1] is a compact set, the elements of Cr0, 1 1 are actually uniformly continuous. To get an idea why distributions on Cr0.1 1 might be of interest to us, imagine observing a realization of a stochastic sequence {Sj(ro) }]Z, from a probability space (O.,'!J,P), for some finite n. A natural way to study these data is to display them on a page or a computer screen. We would typically construct a graph of Sj against the integer values of j from 1 to n on the abscissa, the discrete points being joined up with ruled lines to produce a 'time plot ' , the kind of thing shown in Fig. 27.2.

n

1

(1)

Fig. 27.2 We will then have done rather more than just drawn a picture; by connecting the points we have defined a random continuous function, a random drawing (the word here operates in both its senses !) from the space C[1 ,n]. It is convenient, and there is obviously no loss of generality, if instead of plotting the points at unit intervals we plot them at intervals of ll(n - 1 ) ; in other words, let the width of the paper or computer screen be set at unity by choice of units of measurement. Also, relocating the origin at 0, we obtain by this means an element of Cro, 1 1 , a member of the subclass of piecewise linear functions, with formula (27.8) x(t) = (i - tm)x((i - 1)/m) + (1 + tm - i)x(ilm) for t E [(i - 1)/m, ilm], and i = 1 , ... ,m, m = n - 1 . The points x(ilm) E IR for i = O, ... ,m are the m + 1 vertices of the function. In effect, we have defined a measurable mapping between points of IR n and elements of Cr0, 1 1 , and hence a family of distributions on Cro, I J derived from (O.,'!J,P), indexed on n. The specific problem to be studied is the distribution of these graphs as n tends to infinity, under particular assumptions about the sequence { Sj } . When { Sj } is a sequence of scaled partial sums of independent or asymptotically independent random variables, we shall obtain a useful generalization of the central limit theorem.

438

The Functional Central Limit Theorem

As in §5.5, we metrize Cro.I J with the uniform metric du(x,y) = sup l x(t) - y(t) l . t

(27.9)

Imagine tying two pens to a rod, so that moving the rod up and down as it traverses a sheet of paper draws a band of fixed width. The uniform distance du(x,y) between two elements of C[O, I J is the width of the narrowest such band that will contain both curves at all points. We will henceforth tend to write C for CCro, I J•du) when the context is clear. C is a complete space by 5.24, and, since [0, 1] is compact, is also separable by 5.26(ii). In this case an approximating function for any element of C, fully determined by its values at a finite number of points of the interval (compare 5.25) is available in the form of a piecewise linear function. A set IIm = { t 1 , ... ,tm } satisfying 0 = to < t 1 < ... < tm = 1 is called a partition of [0, 1 ] . This i s a slight abuse of language, an abbreviated way of saying that the collect ion defines such a partition into subintervals, say, A; = (ti- l ,t;) for i = 1 , ... ,m - 1 together with A m = [tm - J,1]. The norm II IIm II = max { t; - t;- d (27. 10) l�i�m is called the fineness of the partition, and a refinement of IIm is any partition of which IIm is a proper subset. We could similarly refer to mint�i�m{ t; - t;- d as the coarseness of Tim . The following approximation lemma specializes 5.25 with the partition II2n = { i/2n , i = 1 , ... , 2n } for n ;?: 1 playing the role of the B-net on the domain, with in this case () < 2!2n . n 27.3 Theorem Given x E C, let Yn E C be piecewise linear, having 2 + 1 vertices, with max { l x(2 -n i) -Yn(2 -n i) I < �£. (27. 1 1) n l �i� 2 There exists n large enough that du(x,yn) < £. n n n Proof Write A; = [T (i - 1), 2 - i], i = 1 , ... ,2 . (Inclusion of both endpoints is innocuous here.) Applying (27.8) we find that, for t E A;, Yn(t) = A-yn(t') + (1 - A.)yn(t") where t' = rn (i - 1), t" = rn i, and A. = i - 2n t. Noting that

}

l x(t) - yn(t) i � A- l x(t) - x(t') l + (1 - A.) Ix(t) -x((') l + A-l x(t') - yn(t') l + (1 - A.) I x(t") -yn((') l ,

(27. 1 2)

and that for n large enough, sups,teA; i x(s) - x(t) l < �£ by continuity, it follows that for such n,

du(x,y,) =

,:d::� I

x(t) - y.(t)

I}

<

E.

•

(27. 13)

Weak Convergence in a Function Space

439

Note that as n ---7 =, II2n ---7 [[) (the dyadic rationals). There is the following important implication. 27.4 Theorem If x,y E C and x(t) = y (t) whenever t E [[), then x = y. Proof Let Zn be piecewise linear with Zn (t) = x(t) = y(t) for t E TI2n . By assumption, such a Zn exists for every n E IN. Fix £, and by taking n large enough that max { 4J(x,zn) , du(y,zn } } < �£, as is possible by 27.3, we can conclude by the triangle inequality that du(x,y) < £. Since £ is arbitrary it follows that du(x,y) = 0, and hence x = y since du is a metric. • The continuity of certain elements of R, particularly the limits of sequences of functions, is a crucial feature of several of the limit arguments to follow. An important tool is the modulus of continuity of a function x E R, the monotone function Wx: (0, 1 ] f-7 IR + defined by wx(o) = sup lx(s) -x(t) 1 . (27. 14) I s- t i d )

Wx has already been encountered in the more general context of the ArzeUt-Ascoli theorem in §5.5. It tells us how rapidly x may change over intervals of width o. Setting o = 1 , for example, defines the range of x. But in particular, the fact that the x are uniformly continuous functions implies that, for every x E C, wx(o) -!, 0 as o -!, 0. (27. 15) For fixed o, we may think of wx(o) = w(x,o) as a function on the domain C. Since I w(x o) - w(y,o) I ::; 2du(x,y) , w(x , o) is continuous on C, and hence a measurable function of x. The following is the version of the ArzeUt-Ascoli theorem relevant to C. 27.5 Theorem A set A c C is relatively compact iff sup I x(O) I < =, (27. 16) XEA lim sup wx(o) = 0. o (27. 17) 0---70 xE A

,

These conditions together impose total boundedness and uniform equicontinuity on A. Consider, for some t E [0, 1] and k E IN,

[ x(t) [ ,; [ x(O) [

+

�

x

(H -x (i � 1 t)

(27. 1 8)

Equality (27.17) implies that for large enough k, supxEA wx(1/k) < =. Therefore (27. 16) and (27. 17) together imply that (27.19) sup sup I x(t) I < =. t xE A In other words, all the elements of A must be contained in a band of finite width ::�rnnnci

n

Thi .;;: thP:nrf'.m i.;;: thP:rP:fnrP: ::� .;;:tr::�i ohtfnrw::�n-1 r.nrnlhrv nf �-2"'

The Functional Central Limit Theorem

440

27. 3 Measures on C

We now see how 27.1 specializes when we restrict the class of functions under consideration to the members of C. The open spheres of C are sets with the form S(x,r) = {y E C: du(x,y) < r,} (27.20) for x E C.

Fig. 27.3 Such sets can be visualized as a bundle of continuous graphs, with radius r and the function x at the core, traversing the unit interval - for example all the functions lying within the shaded band in Fig. 27.3. We shall write :B e for the Borel field on C, and since (C,du) is separable each open set has a countable covering by open spheres and :B e can be thought of as the a-field generated by the open spheres of C. Each open sphere can be represented as a countable union of closed spheres, S(x,r)

=

00 U S(x, r 1/n) , n= l

(27.21)

-

and hence :Be is also the a-field generated from the closed spheres. Now consider the coordinate projections on C. Happily we know these to be continuous (see 6.15), and hence the image of an open (closed) finite-dimen sional rectangle under the inverse projection mapping is an open (closed) element of 'P. Letting 1-fe = { H n C: H E Jf } with Jf defined in (27 .2), and so defining 'P e = cr(Jfe), we have the following important property: 27.6 Theorem :B e = 'Pe. Proof Let H,(x,a) = E C: max I y(rki) - x(rki) I < a E 1-f e, (27.22)

{Y

and so let

l � iQk

}

Weak Convergence in a Function Space H(x,a)

=

n Hk(x,a) k=l

{

441

}

= y E C: sup l y(t) -x(t) l :::; a E 'Pc, tE ID

(27.23)

where [) denotes the dyadic rationals. Note that we cannot rely on the inequality in (27.22) remaining strict in the limit, but we can say by 27.4 that H(x,a) = S(x,a), (27.24) where S is the closure of S. Using (27.21), we obtain 00

S(x,r) = U H(x, r - lin). n=l It follows that the open spheres of C lie in 'Pc. and so 'Be � 'Pc.

(27.25)

X

Fig. 27.4 To show 'P c c 'Be consider, for a E IR and to E the restriction to [0, 1] of the functions on IR ,

!

[0, 1], functions Xn E

a + n(n + 1/n)(t + 1/n - t0), to - 1/n :::; t < to Xn(t) = a + n(n + 1/n)(to + lin - t), to :::; t < to + 1/n

C defined by

(27.26)

a, otherwise. Every element y of the set S(xmn) E 'Be has the property y(to) > a. (This is the shaded region in Fig. 27.4.) Note that 00

(27.27) G(a,to) = {y E C: 1tro(y) > a} = u scxn,n) E 'Be. l n= Now, G(a,t0) is an element of the collection 1f00 where for general t we define 1fcr = {1t� 1 (B), B E 'B } . (27.28)

442

The Functional Central Limit Theorem

In words, the elements of 1f.e1 are the sets of continuous functions x having x(t) E B, for each B E 'B. In view of parts (ii) and (iii) of 1.2, and the fact that 'B can be generated by the collection of open half-lines ( a,oo), it is easy to see that 1f.0 is the a-field generated from the sets of the form G(a,t) for fixed t and a E rR . Moreover, 1f.e is the a-field generated by {1f.0, t E [0, 1 ] } . Since G(a, t) E 'Be for any a and t by (27.27), it follows that 1f.e c 'Be and hence cp e � 'Be. • It will be noted that the limit Xoo(t) of (27.26) is not an element of C, taking the value a at all points except t0, and +oo at t0. Of course, {xn } is not a Cauchy sequence. However, the countable union of open spheres in (27 .27) is an open set (the inverse projection of the open half line) and omits this point. cp e is the projection a-field on C with respect to arbitrary points of the continuum [0, 1], but consider the collection cp(; = { H l't C: H E cp'}, where cp' is the collection of cylinder sets of Rro, 1 1 having rational coordinates as a base. In other words, the sets of cp' contain functions whose values x(t) are unrestricted except at rational t. Since elements of C which agree on the rational coordinates agree everywhere by 27.4, (27.29) This argument is just an alternative route to the conclusion (from 6.22) that C is homeomorphic to a subset of rR"". However, it is not true that cp = cp', because cp is generated from the projections of every point of the continuum [0,1], and arbitrary functions can be distinct in spite of agreeing on rational t. Evidently (C,'Bc) is a measurable space, and according to 27.2 and 27.6, 1f.e is a determining class for the space. In other words, the finite-dimensional distribu tions of a space of continuous functions uniquely determine a p.m. on the space. Every p.m. on R[O,l J must satisfy the consistency conditions, but the elements of C have the special property that x(t1) and x(t2) are close together whenever t1 and t2 are close together, and this puts a further restriction on the class of finite-dimensional distributions which can generate distributions on C. Such distributions must have the property that for any E > 0, ::3 8 > 0 such that (27.30) The class of p.m.s in (C,'Bc), whose finite-dimensional distributions satisfy this requirement, will be denoted !Me. Note that, thanks to 26.14, we are able to treat !Me as a separable metric space. This fact will be most important below. 27.4 Brownian Motion

The original and best-known example of a p.m. on C, whose theory is due to Norbert Wiener (Wiener 1923) is also the one that matters most from our point of view, since in the theory of weak convergence it plays the role of the attractor measure which the Gaussian distribution plays on the line. It is in fact the natural generalization of that distribution to function spaces. 27.7 Definition Wiener measure W is the p.m. on (C,'Bc) having these properties:

Weak Convergence in a Function Space (a) W(x(O)

=

0)

(b) W(x(t) ::::; a)

= =

443

1;

-1-fa e-s1121df;, 0

V2iti

-oo

<

t ::::;

1;

(c) for every partition { tb ... ,tk l of [0,1], the increments x(t 1 ) - x(t0), x(t3 ) - x(t2), ... ,x(tk) - x(tk- 1 ) are totally independent. o Parts (a) and (b) of the definition give the marginal distributions of the coord inate functions, while condition (c) fixes their joint distribution. Any finite collection of process coordinates {x(ti), i = 1 , ... ,k} has the multivariate Gauss ian distribution, with x(tj) - N(O,tj), and E(x(tj)x(tf)) = min { tj,tj' } . Hence, x(tt) - x(t2) - N(O, l tt - t2 l ), which agrees with the requirements of continuity. This full specification of the finite-dimensional distributions suffices to define a unique measure on ( C,'Bc). This does not amount to proving that such a measure exists, but we shall show this below; see 27.15. W may equally well be defined on the interval [O,b) for any b > 0, including b = oo, but the cases with b ::t 1 will not usually concern us here. A random element distributed according to W is called a Wiener process or a Brownian motion process. The latter term refers to the use of this p.m. as a mathematical model of the random movements of pollen grains suspended in water resulting from thermal agitation of the water molecules, first observed by the botanist Robert Brown. 27 In practice, the terms Wiener process and Brownian motion tend to be used synonymously. The symbol W conventionally stands for the p.m., and we also follow convention in using the symbol B to denote a random element from the derived probability space ( C, 'Bc, l-ll). In terms of the underlying proba bility space (Q,�,P) on which we assume B: Q � C to be a �/'Be-measurable mapping, we have W(E) = P(B E E) for each set E E 'Be. The continuous graph of a random element of Brownian motion, B(co) for co E Q, is quite a remarkable object (see Fig. 27 .5). It belongs to the class of geometrical forms namedfractals (Mandelbrot 1983). These are curves possessing the property of self-similarity, meaning essentially that their appearance is invariant to scaling operations. It is straightforward to verify from the defini tion that if B is a Brownian motion so is B * , where (27.31) B *(ro,t) = k - 1 12(B(OJ ,s + kt) - B( ro,s)) for any s E [0,1) and k E (0, 1 - s] . Varying s and k can be thought of as 'zooming in' on the portion of the process from s to s + k. The key property is the one contained in part (iii) of the definition, that of independent increments. A little thought is required to see what this means. In the definition, the points tJ, ... ,tk may be arbitrarily close together. Consider ing a pair of points t and t + �. the increment B( OJ,t + �) - B( OJ,t) is Gaussian with variance �. and independent of B(ro,t). Symmetry of the Gaussian density implies that P( OJ: (B(ro,t + M B(OJ,t) )(B(OJ,t) - B(ro,t - �)) < 0) = 1 -

The Functional Central Limit Theorem

444

for L1 � t � 1 L1 and every L1 > 0. This is compatible with continuity, but completely rules out smoothness; in any realization of the process, almost every point of the graph is a comer, and has po tangent. This property is also apparent when we attempt to differentiate B(ro). Note from the definition that -

B(ro,t + � - B(ro,t)

_

N(O, llh).

(27.32)

The sequence of measures defined by letting h � 0 in (27.32) is not uniformly tight, and fails to converge to any limit. To be precise, the probability that the difference quotients in (27.32) fall in any finite interval is zero, another way of saying that the sample path x(t,ro) is non-differentiable at t, almost surely. A way to think about Brownian motion which makes its relation to the problem of weak convergence fairly explicit is as the limit of the sequence of partial sums of n independent standard Gaussian r.v.s, scaled by n - 1 n . Note that (27.33) �(ro) = n 1 12(B(ro,jln) - B(ro,(j - 1)/n)) - N(0, 1) for j = 1, ... ,n and B(ro,j/n) = n - 1 12E= I�i(ro). By taking n large enough, we can express B(ro,t) in this form for any rational t, and by a.s. continuity of the process we may write [nt] (27.34) B(ro,t) = lim n - 1 12L �/ro) a.s. n�oo i= l for any t E [0, 1], where [nt] denotes the integer part of nt. Consider the expected sum of the absolute increments contributing to B(t). According to 9.8, l �j l has mean (2/n) 1 12 and variance 1 - 2/n, and so by indepen dence of the increments the r.v. An(t) = n - I/2L��fl l �d has mean [nt](21nn) 1 12 = m(t,n) (say) and variance 1 - 2/n. Applying Chebyshev ' s inequality, we have that for t > 0,

P(An(t) > �m(t,n)) � P( I An(t) - m(t,n) I � �m(t,n)) (27.35) � 1 _ 4(1 - 2/n) m(t,n) 2 _ Since m(t,n) = O(n 1 12), An(t) � oo a.s.[WJ for all t > 0. This means that the random element B(ro) is a function of unbounded variation, almost surely. Since limn�ooAn(t) is the total distance supposedly travelled by a Brownian particle as it traverses the interval from 0 to t, and this turns out to be infinite for t > 0, Brownian motion cannot be taken as a literal description of such things as particles undergoing thermal agitation. Rather, it provides a simple limiting approximation to actual behaviour when the increments are small. Standard Brownian motion is merely the leading member of an extensive family of a.s. continuous processes on [0, 1 ] having Gaussian characteristics. For example, if we multiply B by a constant cr > 0, we obtain what is called a Brownian motion with variance �- Adding the deterministic function �t to the process defines a Rrownian motion with drift LL Thu X(t) = rTR(t) -1- tlt rP-nrP-"P-nt" a fami l v of .;_

.

Weak Convergence in a Function Space

445

processes having independent increments X(t) -X(s) - N(J.l(t - s), d I t - s 1 ). More elaborate generalizations of Brownian motion include the following. 27.8 Example Let X(t) = B(t 1 +J3) for -1 < � < oo. X is a Brownian motion which has been subjected to stretching and squeezing of the time domain. Like B, it is a.s. continuous with independent Gaussian increments. It can be thought of as the limit of a partial sum process whose increments have trending variance. Suppose �;(ro) N(O }), which means the variances are tending to 0 if � < 0, or to infinity if � > 0. Then n-l-J3E(If�{l�;) 2 ---) t1 +J3 , and [nt] ) +J3 !2 O n _L �;(ro) ---) B(ro,t 1 +J3) a.s. o (27.36) i=l

27.9 Example Let X(t) = O(t)B(t) where 9: [0, 1] f--7 1R is any continuous determin istic function, and B is a Brownian motion. For s < t, (27.37) X(t) -X(s) = O(t)(B(t) - B(s)) + (9(t) - 9(s))B(s), which means that the increments of this process, while Gaussian, are not inde pendent. It can be thought of as the almost sure limit as n � oo of a double partial sum process, n-

J n� [

6 (iln) E;;( ffi) + ( 6(i/n) - 6 ((i - I )in))

� ]

E;,{ffi) ,

(27.38)

where �; - N(0, 1). o 27.10 Example Letting B denote standard Brownian motion on [O,oo), define (27.39) X(t) = e- J3tB(iJ3t) for fixed � > 0. This is a zero-mean Gaussian process, having dependent increments like 27.9. The remarkable feature of this process is that it is stationary, with X(t) - N(O, 1) for all t > 0, and (27.40) E(X(t)X(s)) = eJ3(2min{ t,sJ -t-s) = e- J3 1 t-s i . This is the Ornstein-Uhlenbeck process. o 27.11 Example The Brownian bridge is the process B0 E C where B0(t) = B(t) - tB(l ), t E [0, 1]. (27.41) This is a Brownian motion tied down at both ends, and has E(B0(t)B0(s)) = min { t,s} - ts. A natural way to think about B0 is as the limit of the partial sums of a mean-deviation process, that is

B" (t,ro)

=

�: n -

where �;(ro) - N(O,l).

o

w

�{

E;;(ro) -

� t E;,{ro)) a.s.

(27.42)

The Functional Central Limit Theorem

446

We have asserted the existence of Wiener measure, but we have not so far offered a proof. The consistency theorem (27.1) establishes the existence of a measure on (C, :B c) whose finite-dimensional distributions satisfy conditions (a)-(c) of 27.7, so we might attempt to construct a continuous process having these properties. Consider

-

Y.(t,co) : n - I /2

{% �,{co) + (nt - [nt])�l.,J+I (co)) ,

(27.43)

where Si N(O,l ) and the set {sJ, ... , sn } are totally independent. For given ro, Yn(.,ro) is a piecewise linear function of the type sketched in Fig. 27.2, although with Yn(O,ro) = 0 (the Si represent the vertical distances from one vertex to the next), and is an element of C. Yn(t) is Gaussian with mean 0, and E(Yn(t))2 = n - \ [nt] + (nt - [tft])2) 2 1 = t + n - ([nt] - nt+ (nt - [nt]) ) (27.44) = t + K(n,t)ln, (say) where 0 < K(n,t) < 2. Moreover, the Gaussian pair Yn( t) and Yn(t + s + n - 1 ) - Yn(t + n - 1 ) , s > 0, are independent. Extrapolating the same argument to general collections of non-overlapping increments, it becomes clear Yn(t) � N(O , t), and more generally that if Yn � Y, then Y is a stochastic process whose finite dimensional distributions match those of W. Fig. 27 .5, which plots the partial sums of around 8000 (computer-generated) independent random numbers, shows the typical appearance of a realization of the process approaching the limit. 1 . 3 ,-------� 1.0 0.5 0 �------� 0

1 Fig. 27.5

This argument does not show that the measure on ( C,:Bc) corresponding to Y actually is W. There are attributes of the sample paths of the process which are not specified by the finite dimensional distributions. According to the continuous mapping theorem, Yn � Wwould imply that h( Yn) � h(W) for any a.s. continuous function h. For example, suptl Yn(t) I is such a function, and there are no grounds .f.rnm th.,. <>rmnn<>nt" f'ron " ; ri PrPrl �hnvP. fnr •mnnn"i n u thM "lln. l y_(t) I � SUO, I W( t) I .

Weak Convergence in a Function Space

447

However, if we are able to show that the sequence of measures corresponding to Yn converges to a unique limit, this can only be W, since the finite-dimensional cylinder sets of C are a determining class for distributions on (C/Bc). This is what we were able to conclude from 27.6, in view of 27.1. This question is taken up in the next section, and the proof of existence will eventually emerge as a corollary to the main weak convergence result in §27.6. 27 . 5 Weak Convergence on C

Let { J.ln } be a sequence of probability measures in IMc. For example, consider the distributions associated with a sequence like { Ym n E IN } , whose elements are defined in (27.43). According to 26.22, the necessary and sufficient condition for the family {J.ln } to be compact, and hence to possess (by 5.10) a cluster point in IMc, is that it is uniformly tight. Theorem 27.5 provides us with the relevant compactness criteria. The message of the following theorem is that the uniform tightness of measures on C is equivalent to boundedness at the origin and continuity arising with sufficiently high probability, in the limit. Since tightness is the concentration of the mass of the distribution in a compact set, this is just a stochastic version of the Arzela-Ascoli theorem. 27.12 Theorem (Billingsley 1968: th. 8.2) {J.ln } is uniformly tight iff there exists N E IN such that, for all 11 > 0 and for all n � N, (a) there exists M < oo such that (27.45) J.ln({x: l x(O) I > M}) :::; T) ; (b) for each £ > 0, there exists 8 E (0, 1) such that J.ln({x: wx(o) � E}) :::; T) . o (27.46) Condition (b) is a form of stochastic equicontinuity (compare §21 .3). It is easier to appreciate the connection with the notions of equicontinuity defined in §5.5 if we write it in the form P(w(Xn,8) � E) :::; T) , where {Xn} is the sequence of stochas tic functions on [0, 1] having derived measures J.ln · Asymptotic equicontinuity is sufficient in this application, and the conditions need hold only over n � N, for some finite N. Since C is a separable complete space, each individual member of { J.ln} is tight, and for uniform tightness it suffices to show that the conditions hold 'in the tail ' . Proof of 27.12 To prove the necessity, let U..tn } be uniformly tight, and for 11 > 0 choose a compact set K with J.ln(K) > 1 - T) . By 27.5, there exist M < oo and 8 E (0, 1) such that K � {x: l x(O) J :::; M} n {x: wx(8) < E} (27.47) for any £ > 0. Applying the De Morgan law,

1l � J.ln (�) 2: J.ln ({x: J x(O) J > M} U {x: wx(8)

2:

£})

The Functional Central Limit Theorem

448 ;?:

max { J.Ln( { x : l x(O) I > M}), Jln({x: wx(8) ;?: E} ) } .

(27.48)

Hence (27.45) and (27.46) hold, for all n E !N . Write J.L*(.) as shorthand for SUPn <-Nf..Ln(.) . To prove sufficiency, consider for k = 1,2, . . the sets (27.49) where { 8d is a sequence chosen so that ll*(Ak) > 1 - 8/2k+l , for 8 > 0. This is possible by condition (b). Also set B = { x: I x(O) I :::;; M } , where M is chosen so that Jl*(B) > 1 - 8/2, which is possible by condition (a). Then define a closed set K = (nk'=IAk n B) -, and note that conditions (27.16) and (27.17) hold for the case A = K. Hence by 27.5, K is compact. But

.

11* CKc) :::;; J.L*

(UAZu Be) k=l

:::;; L J.L*(AZ} + J.L*(Bc) k=l

:::;; 2e .L 112k+2 + en = e. 00

k=l

(27.50)

This last inequality is to be read as SUPn<- Niln(Kc) :::;; e, or equivalently infn<- Nf..Ln(K) > 1 - e. Since e is arbitrary, and every individual Jln is tight by 26.19, in particular for 1 :::;; n < N, it follows that the sequence { Jln} is uniformly tight. • The following lemma is a companion to the last result, supplying in conjunction with it a relatively primitive sufficient condition for uniform tightness. 27.13 Lemma (adapted from Billingsley 1968: th 8.3) Suppose that, for some 8 E (0, 1),

({

sup Jln x: sup I x(s) -x(t) I O:St:SI -o

t :5 s :5 t+o

;?:

1£

})

:::;; 1no.

(27.51 )

Then (27 .46) holds. Proof Fixing 8, consider the partition { t1 , . . . ,t, } of [0, 1], for r = 1 + [1/o], where t; = i8 for i = 1 , ... ,r - 1 and t, = 1 . Thus, for 1 < o < 1 we have r = 2 and the partition { 0, 1 }, for t < 0 :::;; � we have r = 2 and the partition { o, 28, 1 } , and so on. The width of these intervals is at most 8. A given interval [t,t'] with I t' - tl :::;; 8 must either lie within an interval of the partition, or at most overlap two adjoining intervals; it cannot span three or more. In the event that I x(t') - x(t) I ;?: E , x must change absolutely by at least �£ in at least one of the interval(s) overlapping [t,t'], and the probability of the latter event is at least that of the former. In other words, considering all such intervals,

Weak Convergence in a Function Space !J-n{ {x : wx(o) � £ } ) ::; 1-ln

(U{ t=]

x:

S,S

449

})

sup l x(s') - x(s) l � 1£ ,E [t;-!, 1;]

(27.52) where the third of these inequalities applies (27.51), and the final one follows because rO ::; 2. • These results provoke a technical query over measurability. In §21 . 1 we indicated difficulties with standard measure theory in showing that functions such as sup1 � � .�;l x(s) -x(t) l in (27.51), and wx(o) in (27.46), are random variables. However, it is possible to show that sets such as the one in (27.51) are �-analytic, and hence nearly measurable. In other words, complacency about this issue can be justified. The same qualification can be taken as implicit wherever such sets arise below. s

t+

27.6 The Functional Central Limit Theorem

Let Sno = 0 and Snj = TI=I Uni for j = 1 , ... ,n, where { Und is a zero-mean stochastic array, normalized so that E(S�n) = 1 . As in the previous applications of array notation, in Part V and elsewhere, the leading example is Un i = U/sn, where { U;} is a zero-mean sequence and s� = E(Ii= l ui. Define an element Yn of Cro. lJ• somewhat as in (27.43) above, as follows: Yn(t) = Sn,[nt] + (nt - [n t] )Un,[nt]+ I , = Snj- 1 + (nt -j + 1)Unj for (j - 1)/n ::; t < j/n, j

=

l , . . . ,n ;

Yn(l ) = Snn·

(27.53) (27.54)

This is the type of process sketched in Fig. 27.2. The question of whether the distribution of Yn possesses a weak limit as n -> oo is the one we now address. The interpolation terms in Yn(t) are necessary to generate a continuous func tion, but from an algebraic point of view they are a nuisance; dropping them, we obtain Xn (t) = Sn,[nt] Xn(l ) = Snn ·

=

Snj- 1 for U - l)ln ::; t < j/n, j = 1 , ... ,n,

(27 .55) (27.56)

If conditions of the type discussed in Chapters 23 and 24 are imposed on { Un i } , XnO) � N(O, l) as n � oo. If for example U; - i.i.d.(O,d), so that Uni = UJsn where s� = ncr2, this is just the Lindeberg-Levy theorem. However, the Lindeberg Levy theorem yields additional conclusions which are less often remarked; it is easv to verifv that. for each distinct oair tJ .f? E rO , l l,

The Functional Central Limit Theorem

450

(27.57) Since non-overlapping partial sums of independent variates are independent, we find for example that, for any 0 :::; t1 < t2 < t3 :::; 1, Xn(t2) - XnCti ) and XnCt3) Xn(t2) converge to a pair of independent Gaussian variates with variances t2 - t1 and t3 - t2, so that their sum Xn(t3) - XnCti) is asymptotically Gaussian with variance t3 - t1 , as required. Under our assumptions, (27.58) so that Yn(t) and Xn(t) have the same asymptotic distribution. Since Yn(O) = 0, the finite-dimensional distributions of Yn converge to those of a Brownian motion process as n -----7 oo. As noted in §27.4, this is not a sufficient condition for the convergence of the p.m.s of Yn to Wiener measure. But with the aid of27.12 we can prove that { Yn} is uniformly tight, and hence that the sequence has at least one cluster point in rMc. Since all such points must have the finite-dimensional distributions of W, and the finite-dimensional cylinders are a determining class for (C,'Bc), W must be the weak limit of the sequence. This convergence will be expressed either by writing !-ln :::::::> W, or, more commonly in what follows, by Yn � B. This type of result is called a functional central limit theorem (FCLT), although the term invariance principle is also used. The original FCLT for i.i.d. increments (the generalization of the Lindeberg-Levy theorem) is known as Donsker ' s theorem (Donsker 1951). Using the results of previous chapters, in particular 24.3, we shall generalize the theorem to the case of a heterogeneously distributed martingale difference, although the basic idea is the same. 27.14 Theorem Let Yn be defined by (27.53) and (27.54), where { Uni.�nd is a martingale difference array with variance array { CJ�;}, and Lt= l �i = 1 . If n (a) _L u�; � 1 , i= l (b) (c)

max I Und � 0, ]$;i$; n [nt] lim L CJ�; = t, for all t n�oa i= l

E

[0, 1 ],

then Yn � B. o Conditions (a) and (b) reproduce the corresponding conditions of 24.3, and their role is to establish the finite-dimensional distributions of the process, via the conventional CLT. Condition (c) is a global stationarity condition (see § 13.2) which has no counterpart in the CLT conditions of Chapter 24. Its effect is to rule out cases such as 24.10 and 24.11. By simple subtraction, the condition is sufficient for

Weak Convergence in a Function Space

·

,!

;

·-·

45 1

[ns] (27.59) lim L a;i = s - t, n �oo i=[nt]+l tor 0 � t < s � 1 . Clearly, without this restriction condition (27 .57) could not hold for any t1 and t2 . Proof of 27.14 Conditions 24.3(a) and 24.3(b) are satisfied, on writing Uni for Xnr · In view of the last remarks, the finite-dimensional distributions of Yn converge to those of W, and it remains to prove that { Yn} is uniformly tight (i.e., that the sequence of p.m.s of the Yn is uniformly tight). m Define, for positive integers k and m with k + m � n, s;km = "'C' "-k+i= k+ I<Jni2 = E(Sn,k+m - Snk) 2, where Sn,k+j - Snk = I7�{+1 Uni· The maximal inequality for martingales in 15.14 implies that, for A > 0, E I Sn,k+m - Snk l p p max I Sn,k+j - Snk I > Asnkm � (27.60) . (Asnkmf l s;j �m In particular, set k = [nt] and m = [n8] for fixed 8 E (0, 1) and t E [0 , 1 - 8] so that m increases with n, and then we may say that (Sn ,k+m - Snk)lsnkm � Z N(0, 1), by 24.3. For given positive numbers 11 and £, choose A satisfying A > max{E/8, 256E I ZI 3/T) £2 } , (27.61) and consider the case 8 = £2/64A2 < 1 . There must exist N0 ;::: 1 for which, with n ;::: N0, the Gaussian approximation is sufficiently close that El Sn,k+m - Snk 1 3 11 £2 � _' I _ = 1T 8 . (27.62) ) (Asnkm)3 256A2 4 Also observe from (27.59) that limn�oos�km = 8. For the choice of 8 indicated there exists N1 ;::: 1 such that As nkm < £E for n ;::: N1 , and hence, combining (27 .60) with p = 3 with (27.62), such that

)

(

-

------

_

)

(

P max I Sn ,k+j - Snk l ;::: £E � £11 8, l �j�m for n ;::: max{N0,N1 } . Now, Yn(s) - Yn(t) = Sn,[ns] - Sn, [nt] + Rn(s,t) for s > t, from (27.53) and (27.54), where Rn(s,t) = (ns - [ns]) Un,[ns]+ l - (nt - [nt])Un, [nt] +l · For t E [0, 1 8], there exists s' E [t, t + 8] such that

(27.63)

(27.64) (27.65)

-

I Yn(s') - Yn(t) I

=

sup I Yn(s) - Yn(t) 1 . t �s� t+o

(27.66)

The Functional Central Limit Theorem

452

[ns'] .:::; [nt] + [n8] and hence max I Sn,[nt] +j - Sn, [nt] l · � I J �[no] It follows (also invoking the triangle inequality) that for n

I Sn,[ns'] - Sn, [ nt] l

(27.67)

.:::;

:2:

N2 ,

=

I Sn,[ns'] - Sn,[nt] + Rn(s', t) I (27.68) :::; max I Sn ,[nt] +J - Sn,[nt] l + I Rn(s',t) I . I �J�[n o] By condition (b) of the theorem, P(sup i Rn(s, t) l :2: ac) ---7 0 as n ---7 oo, and hence there exists N3 :2: 1 such that, for n :2: N3, (27.69) Inequalities (27.69) and (27.63) jointly imply that, for all t E [0, 1 - 8] and n max{No,NbN2 ,N3 } , :2: N* I Yn(s') Yn(t) I -

s, r

=

P( I Yn(s') - Yn(t) I

( ({ (

:2:

!c)

.:::; P max I Sn, [nt] +J - Sn, [nt] I + I Rn(s' , t) I I �J �[no] :::; p

max I Sn,[nt] +j - Sn,[n t] l I�J[no]

:::; P max ! Sn,[ntl +J - Sn ,[ntJ I I �J[n o]

:::; 111 B . The conclusion may be written as

(

:2:

:2:

}

aE

)

aE

)

)

!c

})

{ I Rn(s',t) I

:2:

aE

P( I Rn(s',t) l

:2:

aE)

u

+

:2:

(27.70)

s p (27.7 1 ) 1 Yn(s) - Yn(t) I :2: !E .:::; !11 8, n :2: N*. � � :� 0� t f o � +8 Note that (27.51) i s identical with (27.71 ) for the case J.ln (A) P(Yn E A), and that 11 and c are arbitrary. Therefore, uniform tightness of the corresponding sequence of measures follows by 27.12 and 27.13. This completes the proof. • We conclude this section with the result promised in §27.4: 27.15 Corollary Wiener measure exists. o The existence is actually proved in 27.14, since we derived a unique limiting distribution which satisfied the specifications of 27.7. The points which are conveniently highlighted by a separate statement are that the tightness argument developed to prove 27.14 holds independently of the existence of W as such, and that the central limit theorem plays no role in the proof of existence. -

=

Weak Convergence in a Function Space

453

the process Yn of (27.41). We shall show that, on putting Uni -l 2 n l � i' conditions 27.14(a), (b), and (c) are satisfied. It will follow by the reasoning of 27.12 that the associated sequence of measures is uniformly tight, and possesses a limit. This limit has been shown above by direct calculation to have the finite-dimensional distributions specified by 27. 7, which will conclude the proof. Condition 27.14(c) holds by construction. Condition 27.12(a)) follows from an application of the weak law of large numbers (e.g. Khinchine's theorem), recalling that since �i is Gaussian { �1 } is an independent sequence possessing all its moments. Finally, condition 27.14(b) holds by 23.16 if the collection {�J, ... ,�n } satisfy the Lindeberg condition, which is obvious given their Gaussianity and 23.10 . • Proof Consider

=

27 .7 The Multivariate Case

We would like to extend these results to vector-valued processes, and there is no difficulty in extending the approach of §25.3. Define the space C[o, 1 1m, which we write as en for brevity, as the space of continuous vector functions m H IR m, ' X = (X J , . . . ,Xk) : [O, l ] where [0, l ] m and IR m are the Cartesian products of m copies of [0, 1] and IR respect ively. em is itself the product of m copies of C. It can be endowed with a metric such as (27.72) = max 15:j 5: m which induces the product topology, and coordinate projections remain continuous. Since C is separable, e m is also separable by 6.16, and 'Be = 'B e ® 'B e ® . .. ® 'B e (the a-field generated by the open rectangles of [0, l] m) is the Borel field of e m by m-fold iteration of 26.5. (C m, 'B(!) is therefore a measurable space. Let (27.73) denote the finite-dimensional sets of e m. Again thanks to the product topology, 1f(! is the field generated from the sets in the product of m copies of 1fe. 27.16 Theorem 1f(! is a determining class for (C m , 'B[;). Proof An open sphere in 'B(! is a set S(x,a) = E C m : < a} ,

d'U(x, y)

=

The set

{du(x1,y1) },

{y d'{j(x, y) yj(t) {y E

em : max sup I 15:j5: m

t

xj(t) I

}

< a .

(27.74)

The Functional Central Limit Theorem

454

(27.75) is an element 1-fe. It follows by the argument of 27.6 that

n=l ( , lin)

S(x,r) = U n Hk x r k= l

E

Te = cr(1fe),

(27.76)

and hence, that :Be c 'Pe. Since 1-fe is a field, the result follows by the extension theorem. • It is also straightforward to show that :Be = T(;, by a similar generalization from 27.6, but the above is all that is required for the present purpose. A leading example of a measure on (C m ,c.B(;) is wm, the p.m. of m-dimensional standard Brownian motion. A m-vector B distributed according to wm has as its elements m mutually independent Brownian motions, such that (27 .77) B(t) - N(O, m where Im is the m x m identity matrix, and the process has independent increments with E( (B(s) - B(t))(B(s) - B(t))') = (s - t)Im (27.78) for t<s The following general result can now be proved. 27. 17 Theorem Let be a m-vector martingale difference array with variance matrix array such that = Im . Then let = (27.79) for (j - 1)/n � < j/n, = 0 and for j = and = where

ti ),

0 S S 1.

{ Uni{l:niL>�nil Lt=ll:ni Yn(t) SnJ-t+(nt-j+1)Unj t l,. .,n, Yn(l) Snn'j Sno (27.80) Snj Li=l Uni' j = 1, . . ,n. If n (a) L UniU�i Im, i=l 0, (b) max U�iUni l:S::iSnn [ t ] (c) lim L l:ni tim , for all t [0 ,1 , n -'t oo then Yn � B. Consider for an m-vector A of unit length the scalar process A'Yn, having increments ;\'Uni· By definition, {;\'Uni, �ni} is a scalar martingale difference array with variance sequence ;\'l:niA . It is easily verified that all the conditions =

�

�

i=l

Proof

=

E

]

Weak Convergence in a Function Space

455

of 27.14 are satisfied, and so �'Yn � B . This holds for any choice of �- In particular, �'Yn(t) � N(O, t) , with similar conclusions regarding all the finite dimensional distributions of the process. It follows by the Cramer-Wold theorem that Yn(t) � N(O, tim), (27.81) with similar conclusions regarding all the finite-dimensional distributions of the process; these are identical to the finite-dimensional distributions of wm . Since 1-f'C is a determining class for (em, 73m), any weak limit of the p.m. s of { Yn} can only be wm . It remains to show that these p.m.s are uniformly tight. But this is true provided the marginal p.m.s of the process are uniformly tight, by 26.23. Picking � to be the jth column of Im for j = l , ... ,m and applying the argument of 27.14 shows that this condition holds, and completes the proof. • The arguments of §25.3 can be extended to convert 27.17 into provide an unusually powerful limit result. The conditions of the theorem are easily generalized, by replacing 27.17(c) by [nt] (c') lim L :Eni = t:E , n�oo i= l where :£ is an arbitrary variance matrix. Defining L- such that L -u = Im as in 25.6, 27.17 holds for the transformed vector process Zn = L - yn · The limit of the process Yn itself can then be determined by applying the continuous mapping theorem. This is a linear combination of independent Brownian motions, the finite dimensional distributions of which are jointly Gaussian by 11.13. We call it an m-dimensional correlated Brownian motion, having covariance matrix :r, and denoted B(:r). The result is written in the form Yn � B(:r). (27.82) An invariance principle can be used in this way to convert propositions about dependence between stochastic processes converging to Brownian motion into more tractable results about correlation in large samples. Given an arbitrarily related set of such processes, there always exist linear combinations of the set which are asymptotically independent of one another. 28 _,

28 Cadlag Functions

2 8 . 1 The Space

D

The proof of the FCLT in the last chapter was made more complicated by the presence of terms necessary to ensure that the random functions under considera tion lay in the space C. Since these terms were shown to be asymptotically negli gible, it might reasonably be asked whether they are needed. Why not, in other words, work directly with Xn of (27.55) and (27.56), instead of Yn of (27.53) and (27.54)? Fig. 28. 1 shows (apart from the omission of the point Xn(l), to be explained below) the graph of the process Xn corresponding to the Yn sketched in Fig. 27.2. Xn as shown does not lie in Cro, I J but it does lie in Dro, I J , the space of cadlag functions on the unit interval (see 5.27), of which Cro,I J is a subset. Henceforth, we will write D to mean D ro, I J when there is no risk of confusion with other usages.

Fig. 28. 1 As shown in 26.3, D is not a separable space under the uniform metric, which means that the convergence theory of Chapter 26 will not apply to (D,du). du is not the only metric that can be defined on D, and it is worth investigating alternatives because, once the theory can be shown to work on D in the same kind of way that it does on C, a great simplification is achieved. Abandoning du is not the only way of overcoming measurability problems. Another approach is simply to agree to exclude the pathological cases from the field of events under consideration. This can be achieved by working with the a-field <JJn , the restriction to D of the projection a-field (see §27. 1). In con trast with the case of C, <JJn c c.Bn (compare 27.6) and all the awkward cases such as uncountable discrete subsets are excluded from <JJn , while all the ones likely to arise in our theory (which exclusively concerns convergence to limit points lying

Cadlag

Functions

457

in C) are included. Studying measures on the space ((D,du),Tv) is an interesting line of attack, proposed originally by Dudley (1966, 1967) and described in detail in the book by Pollard (1984). While this approach represents a large potential simplification (much of the present chapter could be dispensed with), an early decision has to be made about which line to adopt; there is little overlap between this theory and the methods pioneered by Skorokhod (1956, 1957), Prokhorov (1956), and Billingsley (1968), which involves metrizing D as a separable complete space. Although the technical overheads of the latter approach are greater, it has the advantage that, once the investment is made, the probabilistic environment is familiar; at whatever remove, one is still working in an analogue of Euclidean space for which all sorts of useful topological and metric properties are known to hold. There is scope for debate on the relative merits of the two approaches, but we follow the majority of subsequent authors who take their cue from Billingsley's work. The possibility of metrizing D as a separable space depends crucially on the fact that in D the permitted departures from continuity are of a relatively limited kind. The only ones possible are jump discontinuities (also called 'discontinuities of the first kind'): points t at which l x(t) - x(t-) 1 > 0. There is no possibility of isolated discontinuity points t at which both lx(t) - x(t-) I and I x(t) - x( t+) I are positive, because that would contradict right-continuity. There is however the possibility that x(l) is isolated; it will be necessary to discard this point, and let x(l) = x(l-) . This is a little unfortunate, but since we shall be studying convergence to a limit lying in C[O I J (e.g., B), it will not change anything material. We adopt the following definition. 28.1 Definition D[o J ] is the space of functions satisfying the following condi tions: (a) x(t+) exists for t E [0, 1 ) ; (b) x(t-) exists for t E (0 , 1 ]; (c) x(t) = x(t+), t < 1, and x(1) = x(1-). o The first theorem shows how, under these conditions, the maximum number of jumps is limited. 28.2 Theorem There exists, for all x E D and every £ > 0, a finite partition { tJ, ... ,tr } of [0, 1] with the property sup lx(t) - x(s) I < £ (28 . 1 ) ,

,

s,te [t;-[,t;)

for each i = 1 , . . . ,r. Proof This is by showing that tr = 1 for a collection { t1 , ... ,tr } satisfying (28. 1), with to = 0. For given x and £ let t = sup{ tr } , the supremum being taken over all these collections. Since x(t-) exists for all t > 0, t belongs to the set; that is, there exists r such that t = tr. Suppose tr < 1, and consider the point tr + 8 s 1 , for some 8 > 0. By definition of tn l x(tr + 8) - xUr- d l � £. Hence consider the interval [tntr + 8). By

458

The Functional Central Limit Theorem

choice of () we can ensure by right continuity that l x(t, + D) -x(t,) I < E. Hence there exists an (r + 1 )-fold collection satisfying the conditions of the theorem. We must have t :?: tr+ l = t, + D, and the assertion that t, = t is contradicted. It follows that t, = 1. • This elementary but slightly startling result shows that the number of jump points at which I x(t) -x(t-) I exceeds any given positive number are at most finite. The number of jumps such that I x(t) - x(t-) I > 1/n is finite for every n, and the entire set of discontinuities is a countable union of finite sets, hence count able. Further, we see that (28.2) sup l x(t) l < =, t

since for any t E [0,1], x(t) is expressible according to (28.1) as a finite sum of finite increments. The modulus of continuity wx(D) in (27 . 14) provides a means of discriminating between functions in Cro,I J and functions outside the space. For just the same reasons, it is helpful to have a means of discriminating between cadlag functions and those with arbitrary discontinuities. For () E (0 , 1 ), let Il0 denote a partition { t 1 , ... ,t,} with r s [1/D] and mini { ti - ti- d > (), and then define

{ {

w;(<>) = inf max 110

l :S t :S r

sup I x(t) - x(s) I

s,tE [t;-J,t;)

}}

·

(28.3)

Let' s attempt to say this in English! w;(<>) is the smallest value, over all partitions of [0, 1 ] coarser than (), of the largest change in x within an interval of the partition. This notion differs from, and weakens, that of wx(D) , in that w;(<>) can be small even if the points ti are jump points such that wx(D) would be large. For () < � there is always a partition I10 in which ti - ti- l < 2() for some i, so that for any x E D, (28.4) for () < �· So obviously, lim0 0w;(<>) = 0 for any x E C. On the other hand, �

lim w�(D) = 0

3�0

(28.5)

is a property which holds for elements of D, but not for more general functions. 28.3 Theorem If and only if x E D, 3 () such that w;(<>) < E, for any E > 0. Proof Sufficiency is immediate from 28.2. Necessity follows from the fact that if x i: D there is a point other than 1 at which x is not right-continuous; in other words, a point t at which I x(t) - x(t+) I :?: E for some E > 0. Choose arbitrary () and consider (28.3). If t i:. ti for any i, then w�(D) :?: E by definition. But even if t = ti for some i, ti E [ti,ti+ I) and l x(ti) - x(t,+) l :?: E, and again w;(<>) :?: E.

•

Cadlag Functions 28.2 Metrizing

459

D

Recall the difficulty presented by the existence of uncountable discrete sets in (D,du), such as the sets of functions o, o � t < e xe(t) = (28. 6) 1, e � t � 1, the case of (5.43) with a = 0 and b = 1 . We need a topology in which xe and Xe' are regarded as close when I 9 - 9' 1 is small. Skorokhod (1956) devised a metric with this property. Let A denote the collection of all homeomorphisms A: [0, 1] f--7 [0, 1] with A(O) = 0 and A( 1 ) = 1 ; think of these as the set of increasing graphs connecting the opposite corners of the unit square (see Fig. 28.2). The Skorokhod J1 metric is defined as

{

ds(x,y) = inf {E > 0: sup I A(t) - tl � E, sup l x(t) -y(A(t)) I � E} . t

AEA

t

(28.7)

In his 1956 paper Skorokhod proposes four metrics, denoted J l , J2, M l, and M2. We shall not be concerned with the others, and will refer to ds as is customary, as 'the' Skorokhod metric.

0

t

1

Fig. 28.2 It is easy to verify that ds is a metric, if you note that sup 1 I A(t) - t I = suprl t - A- \ t) l and suprl x(t) -y(A(t)) l = sup1 l x(A- 1(t)) - y(t) l , where A- 1 E A if A E A. While in the uniform metric two functions are close only if their vertical separation is uniformly small, the Skorokhod metric also takes into account the possibility that the horizontal separation is small. If x is uniformly close to y except that it jumps slightly before or slightly after y, the functions would be considered close as measured by ds, if not by du. Consider xe in (28.6), and another element Xe+o· The uniform distance between these elements is 1 , as noted above. To calculate the Skorokhod distance, note that the quantity in braces in (28.7) will be 1 for any A for which A(9) * 9 + 8. Confining consideration to the subclass of A with A(9 ) = 9 + 8, choose a case

460

The Functional Central Limit Theorem

where I A(t) - t l ::s; 8 (for example, the graph { t, A(t) } , obtained by joining the three points (0,0), (9,9 + 8), and (1, 1) with straight lines, will fulfil the definition) and hence, (28.8) This distance approaches zero smoothly as 8 J- 0, whi�h might conform better to our intuitive idea of 'proximity' than the uniform metric in these circumstances. 28.4 Theorem On C, ds, and du are equivalent metrics. Proof Obviously ds(x,y) � du(x,y), since the latter corresponds to the case where A is the identity function in (28. 7). On the other hand, for any A,

du(x,y) � sup lx(t) -y(A(t)) ! + sup l y(A(t)) - y(t) l . t

t

(28.9)

Suppose y is uniformly continuous. For every E > 0 there must exist 8 > 0 such that, if ds(x,y) < 8 (and hence sup1 I A(t) - t I < 8), then sup, I y(A(t)) -y(t) I < E. In other words, (28. 10) ds(x,y) < 8 => du(x,y) < 8 + E. The criteria of (5.5) and (5.6) are therefore satisfied. Uniform continuity is equivalent to continuity on [0, 1], and so the stated inequalities hold for all y E C. . The following result explains our interest in the Skorokhod metric. 28.5 Theorem (D,ds) is separable. Proof As usual, this is shown by exhibiting a countable dense subset. The counter part in D of the piecewise linear function defined for C is the piecewise constant function (as in Fig. 28. 1) defined as y(t) = y(t;), t E [ti, t;+ I ) i = O, ... ,m - 1 , (28. 1 1 ) where the y(ti) are specified real numbers. For some n E IN , define the set A n as the countable collection of the piecewise constant functions of form (28 . 1 1), with ti = i/2n for i = 0, ... ,2n - 1 , and y(t;) assuming rational values for each i. Letting A denote the limit of the sequence {An} , A is a set of functions taking rational values at a set of points indexed on the dyadic rationals [), and hence is countable by 1.5. According to 28.2, there exists for x E D a finite partition (t1 , ... ,tm } of [0, 1], such that, for each i, sup lx(s) - x(t) l < £. s,t E [t;-],t;)

Let y be a piecewise constant function constructed on the same intervals, assuming rational values Y I · · · · ·Ym where y; differs by no more than E from a value assumed by x on [t;,t;+1 ). Then, ds(x,y) < 2E. Now, given n � 1, choose z E An such that n Zj = Yi when j/2 E [t;,t;+ 1 ). Since [) is dense in [0, 1], ds(y,z) � 0 as n � oo .

461

Cadlag Functions

Hence, ds(x,z) � ds(x,y) + ds(y,z) is tending to a value not exceeding 2£. Since by taking m large enough E can be made as small as desired, x is a closure point of A. And since x was arbitrary, we have shown that A is dense in D. • Notice how this argument would fail under the uniform metric in the cases where x has discontinuities at one or more of the points ti . Then, du(y,z) will be small only if the two sets of intervals overlap precisely, such that ti = j/2n for some j. If ti were irrational, this would not occur for any finite m, since j/2n is rational. Under these circumstances x would fail to be a closure point of A. This shows why we need the Skorokhod topology (that is, the topology induced by the Skorokhod metric) to ensure separability. Working with ds will none the less complicate matters somewhat. For one thing, ds does not generate the Tychonoff topology, and the coordinate projections are not in general continuous mappings. The fact that x and y are close in the Skorokhod metric does not imply that x(t) is close to y(t) for every t, the examples of xe and xe+o cited above being a case in point. We must therefore find alternative ways of showing that the projections are measurable . •

i

1

.

: � - �;

I n

0

I 2

�

i:

t

Fig. 28.3 And there is another serious problem: (D,ds) is not complete. This is easily seen by considering the sequence of elements { Xn} where 1, t E [�, 1 + �) (28. 12) xn(t) = 0, otherwise (see Fig. 28.3). The limit of this sequence is a function having an isolated point of discontinuity at �' and hence is not in D. However, to calculate ds(xn,Xm) A. must be chosen so that A.(�) = �' and A.(� + n = � + k; the distance is 1 for any other choice. The piecewise-linear graph with vertices at (0,0), G,�), (� + k � + k), and (1, 1 ) fulfils the definition, and satisfies (28. 7). It appears that ds(Xn,Xm) = I k - k I , and so { Xn } is a Cauchy sequence.

{

,

28.3 Billingsley' s Metric

The solution to this problem is to devise a metric that is equivalent to ds (in the sense of generating the same topology, and hence a separable space) but in

The Functional Central Limit Theorem

462

which sequences such as the one in (28. 12) are not Cauchy sequences. Ingenious alternatives have been suggested by different authors. The following is due to Billingsley (1968), from which source the results of this section and the next are adapted. Let A be the collection of homeomorphisms /.. from [0, 1 ] to [0, 1 ] with /..(0) = 0 and /..( 1 ) = 1 , and satisfying lit.. I I

=

l

sup log/..(t)t -- s/..(s) t#s

I

< =.

(28. 1 3)

Here, I I/.. II : A � [R + is a functional measuring the maximum deviation of the gradient of /.. from 1 , so that in particular ll t.. l l = 0 for the case /..(t) = t. The set A is like the one defined for the Skorokhod metric with the added proviso that l it.. I I be finite; both /.. and /..- I must be strictly increasing functions. Then define

ds(x,y)

=

inf {c: > 0: llt.. l l ::; E, sup lx(t) - y(/..(t)) l t

AEA

::;

}

c: .

(28. 14)

We review the essential properties of d8. 28.6 Theorem d8 is a metric. Proof d8(x,y) = 0 iff x = y is immediate. d8(x,y) = d8(y,x) is also easy once it is noted that llt..- 1 1 1 = llt.. l l . To show the triangle inequality, note that

{ {

� sup t# s

� sup log f *- S

(f...t (t) - At (s))(/..2(t') - /..2(s')) -....,. . ' -. s ') (t - s)-..,.( t..,...

-----.,.---=-

for arbitrary t' and s'. On setting t'

=

_ _ _

/..1 (t) and s'

=

}

(28. 15)

/..2(s), we obtain

(28. 16) where /..1 o/..2(t) = /.. 1 (�(t)), and f... 1 of...2 is clearly an element of A Since (28. 1 7) sup l x(t) - z(/..2(/.. t (t))) l ::; sup l x(t) -y(f...t (t)) l + sup l y(t) - z(/..2(t)) l t

t

t

by the ordinary triangle inequality for points of IR , the condition ds(x,z) :::;; ds(x,y) + ds(y,z) follows from the definition. • Next we explore the relationship between ds and d8, and verify that they are equivalent metrics. Inequalities going in both directions can be derived provided the distances are sufficientlv small. Given functions x and v for which do(x_v) =

Cadlag Functions

463

£ s �' consider A E A satisfying the definition of d8 for this pair, such that, in particular, II A II s £. Since A(O) = 0, there evidently must exist t E (0, 1] such that l log(A(t)/t) I s II All , or e

-E

< _

A(t) < e E. (28 . 1 8) t we find eE - 1 s 2£ for £ s �' and e -£ - 1 � - 2£ -

-

Using the series expansion of eE, similarly, which implies that (28. 1 9) -2£ s t( e -£ - 1) s A(t) - t s t(e£ - 1) s 2£, or, I A(t) - tl s 2£. And in view of our assumption about A, suptl x(t) - y(A(t)) I s £ and hence ds(x,y) cannot exceed 2£. In other words, ds(x,y) s 2d8(x,y) (28.20) whenever d8(x,y) s �Now consider a function Jl E A which is piecewise-linear with vertices at the points of a partition III), as defined above (28.3) for a suitable choice of 3 to be specified. The slope of Jl is equal to (J.L(ti} - J.LUi- 1 ))/(ti - ti l ) on the inter vals [tH,ti), where ti - ti- 1 > 3. Notice that, if supt i J.L(t) - tl s- 32 ,

I

J.LUa - J.LCti-1 ) ti - ti - l

1

d (28.21) 1 s I J.LCti) - td + I J.L(ti- 1 ) - ti - s 23. 3 For I x I s �' the series expansion log { 1 + x} = x - ¥2 + !x3 - . .. implies (28.22) l log{ 1 + x} l s max { lx l , l x - x2 1 } s 2 l x l . Substituting for x in (28.22) the quantity whose absolute value is the minorant side of (28.21), we must conclude that, if supt l J.L(t) - tl s 32 for 0 < 3 s !, then (28.23) 111111 s 43 . Now, suppose ds(x,y) = 32, which means there exists A E A satisfying sup i A(t) - t l s 32, and supt ly(t) - x(A(t)) l s 32. Choose Jl as the piecewise linear function with J.L(tD = A(tD for i = O, . . . , r. The function A- 1 Jl is 'tied down ' to the diagonal at the points of the partition; that is, it is increasing on the intervals [ti-l ,tD with A-1 J.L(t) E [tH,ti) if and only if t E [ti- I , ti). Therefore, choosing II8 to correspond to the definition of w�(3), we can say I x(t) -x(J.L(t)) I s I x(t) - x(A- 1 J.L(t)) I + I x(A - 1 J.L(t)) - x(J.L(t)) I 2 s w�(3) + 3 . (28.24) -

t

Putting this together with (28.23) gives for 0 < 3 s ! the inequality d8(x,y) s max { 43, w�(3) + 32 } s w�(3) + 43. (28.25) Since for x E D we may make w�(3) arbitrarily small by choice of 3, we have

The Functional Central Limit Theorem

464

ds(x,y) S 4ds(x,y) 1 12 (28.26) whenever ds(x,y) is sufficiently small. We may conclude as follows. 28.7 Theorem In D, metrics ds and ds are equivalent. Proof Given £ > 0, choose 8 s a' and also sma)l enough that w;(8) + 48 s £. Then, for 11 s min { 82, 1£} , ds(x,y) < 11 � ds(x,y) < £, (28.27) ds(x,y) < 11 � ds(x,y) < £, (28.28) by (28.20) and (28.25) respectively, The criteria of (5.5) and (5.6) are therefore satisfied. • Equivalence means that the two metrics induce the same topology on D (the Skorokhod topology). Given a sequence of elements {xn}, ds(Xn,x) � 0 if and only if ds(xmx) � 0, whenever x E D. But it does not imply that {xn } is a Cauchy sequence in (D,ds) whenever it is a Cauchy sequence in (D,ds), because the latter space is incomplete and a sequence may have its limit outside the space. It is clear in particular that ds(xn,x) ----7 0 only if ds(xn,X) � 0 and lim.s�ow�(8) = 0. For example, the sequence of functions in (28. 12) is not a Cauchy sequence in (D,dB). To define dB(Xn,Xm) (for n � 3, m � 4) it is necessary to find the element of A for which "-G) 1 and A.(1 + k) 1 + A, and whose gradient deviates as little as possible from 1 . This is obviously the same piecewise-linear function, with vertices at the points (0,0), (1.-D, G + k, 1 + k) and ( 1 , 1 ), as defined for ds. But the maximum gradient is n/m, corresponding to the segment connecting the second and third vertices. dB(Xn,Xm) = min { 1 , I log(nlm) I } , which does not approach zero for large n and m (set m = 2n for example). 28.8 Theorem The space (D,ds) is complete. Proof Let {yk , k E IN } be a Cauchy sequence in (D,ds) satisfying dsCYbYk+ I ) < 1 12k, implying the existence of a sequence of functions { Ilk E A} with (28.29) sup I Yk(t) -Yk+l (llk(t)) I < 1 12k, =

=

,

t

(28.30) It follows from (28.20) that SUPrl llk+m(t) - t l S 212k+m for m > 0. Define llk,m f.Lk+m 0 f.Lk+m - I 0 0f.Lb also an element of A for each finite m; the sequence {llk,m , m = 1 ,2, ... } is a Cauchy sequence in (C,du) because sup l llk,m+ l (t) - llk,m(t) I = sup l l!k+m+ l (s) - s I S 112k+m. (28.3 1) t s Since (C,du) is complete there exists a limit function Ak = Iimk�oo llk,m· To show that Ak E A, it is sufficient to show that II A.k il < oo. But by (28. 16),

=

•••

Cadlag Functions

465

m m 1 1 (28.32) m k L k < k k+ 1 1 !-! L j ll ll :=:; + �;+- :=:; N' :=:; ll !-! 0!-l l 0 ••• 01-l + j=O j=O 2 J 2 for any m, so 111�- k I I :::; 112k- I . Note that Ak = 'Ak+1 °1-lh so that 'Akll = !-!k 0 Ak 1 and hence, by (28.29), sup 1 Yk(Ak 1 (t)) - Yk+l('AklJ(t)) = sup I Yk(s) - Yk+I(!-!k(s)) < 112k. (28.33) 11 !-!k,m ll

I

t

s

I

So consider the sequence {yk o')...k 1 E D, k E [N } . According to (28.33) this is actu ally a Cauchy sequence in (D,du). But the latter space is complete; this is easily shown as a corollary of 5.24, whose proof shows completeness of (C,du) without using any of the properties of C, so that it applies without modification to the case of D. Hence Yk 0 A"k 1 has a limit y E D. Since this means both that supr l yk(t) - y(Ak(t)) I = supr iYk(Ak 1 (t)) - y(t) I � 0 and that II'Adl = II'Ak1 11 � 0, d8(yk,y) � 0 and so {yd has a limit y in (D,dn). We began by assuming that {yd was a Cauchy sequence with dn(yk >Yk+I ) < 1 12 k. But this involves no loss of generality because it suffices to show that any Cauchy sequence {xm n E [N } contains a convergent subsequence {Yk = Xnk' k E [N } . Clearly, a Cauchy sequence cannot have a cluster point which is not a limit point. Every Cauchy sequence contains a subsequence with the required property; if dn(XmXn+I ) < 1/g (n) � 0 (say), choosing nk ? g - 1 (2 k) is appropriate. This completes the proof. • 28.4 Measures

on

D

We write 'Bv for the Borel field on (D,dn). Henceforth, we will also write just D to denote (D,dn), and will indicate the metric specifically only if one different from dn is intended. The basic property we need the measurable space (D,'Bv) to possess is that measures can be fully specified by the finite-dimensional sets. An argument analogous to 27.6 is called for, although some elaboration will be needed. In particular, we have to show, without appealing to continuity of the projections, that the finite-dimensional distributions are well-defined and that there are finite-dimensional sets which constitute a determining class for (D,'Bv). We start with a lemma. Define the field of finite-dimensional sets of D as 1fv = { H n D: H E 1f } , where 1f was defined in § 27 . I . 28.9 Lemma Given x E D, a > 0, and any t1 , ... ,tm E [0, 1], let Hm(x,a)

=

{

y E D: 3 '!. E A s.t. 11'1.11 < a,

Then Hm(x,a) E 1-fv. Proof Since Hm (x,a)

}

��:: I y(t;) - x(A.(t;)) I < a

(28.34)

D, all we have to show according to (27.2) is that 1t1 1 , ... ,rm(Hm(x,a)) E 'B . This is the set whose elements are (y(tt ), ... ,y(tm )) for each y E Hm(x,a). To identify these, first define the set s

m

The Functional Central Limit Theorem

466

(28.35) Then it is apparent that 1t11, ... ,tm (Hm(x,a)) =

{

bJ, . . . ,bm: max l ai - b d

< a,

(a1, ... ,am)

lsism

E Am(x,a)

}

m

� IR .

(28.36)

In words, this is the set Am(x,a) with an open a-halo, and it is an open set. It therefore belongs to :Em . • To compare the present situation with that for C, it may be helpful to look at the case k = 1 . The one-dimensional projection nr(Hr(x,a)), where (28.37) Hr(x,a) = {y E D: 3 'A E A s.t. IIJ... I I < a, l y(t) -x(J...(t)) l < a } , is in general different from S(x(t),a), that is, the interval of width 2a centred on x(t). If x is continuous at t the difference between these two sets can be made arbitrarily small by taking a small enough, and at these points the projections are in fact continuous. Since the discontinuity points are at most countable, they can be ignored in specifying finite-dimensional distributions for x, as will be apparent in the next theorem. However, the point that matters here is that we have the material for the extension of a measure to (D,:BD) from the finite-dimensional distributions. It is easily verified that JfD, like Jf, is a field. The final link in the chain is to show that JfD is a determining class for (D,BD). 28. 10 Theorem (cf. Billingsley 1968: th. 14.5) :ED = cr(JfD). Proof An open sphere in (D,d8) is a set of the form S(x,a)

=

{ y E D : d8(y,x) < a}

=

{y E D: 3 'A E A s.t.

I I J... I I

< a,

s�p l y(t) - x(J...(t)) l

}

< a

(28.38)

for x E D, a > 0. Since these sets generate :ED, it will suffice to show they can be constructed by countable unions and complements (and hence also countable intersections) of sets in JfD. Let H(x,a) = n'k'=rHk(x,a), where Hk(x,a) is a set with the form of Hm defined in (28.34), but with m = 2k - 1 and ti = i/2k, so that the set { t1 , ... hk-d converges on [) (the dyadic rationals) as k � oo Consider y E H(x,a). Since y E Hk(x,a) for every k, we may choose a sequence { 'Ad such that, for each k � 1 , (28.39) I I A.k l l < a, k (28.40) max I y(2 - i) - x(J...k(Tki)) I < a. .

l s i s 2L 1

Making use of the fortuitous fact that J...k has the properties of a c.d.f. on [0, 1 ] , Helly' s theorem (22.21) may be applied to show that there is a subsequence f 'Ar--_,

Cadlag Functions n

467

e

IJ'.J } converging to a limit function /.., which is non-decreasing on [0, 1 ] with /..(0) = 0 and /..( 1) = 1 . /.., is necessarily in A, satisfying l l t.. l l � a (28.41) according to (28.39). And in view of (28.40), and the facts that /..k(t) � /..(t) and x is right-continuous on [0, 1), it must also satisfy either l y(t) -x(/..(t)) l � a or I y(t) - x(/..(t)-) I � a for every t e []) . Since []) is dense in [0, 1], this is equivalent to sup I y(t) - x(/..(t)) I � a. (28.42) t

The limiting inequalities (28.41) or (28.42) cannot be relied on to be strict, but comparing with (28.38) we can conclude that y e S(x,a). This holds for all such y, so that H(x,a) � S(x,a). Put a = r- lin, and take the countable union to give 00

U H(x, r - lln)

n= l

�

U S(x, r- lln)

n=l

=

S(x,r).

(28.43)

It is also evident on comparing (28.34) with (28.38) that S(x,a) � Hk(x,a) for a > 0. Again, put a = r - 1/n, and

S(x,r)

=

00

U S(x, r - 1/n)

n=l

�

U H(x, r- 1/n).

n=l

(28.44)

It follows that, for any X E D and r > 0, S(x,r) u;= l nk=!Hk(x, r - lin) where Hk(x, r - 1/n) e JfD. This completes the proof. • The defining of measures on (D,'BD) is now possible by arguments that broadly parallel those for C. The one pitfall we may encounter when assigning measures to finite-dimensional sets is that the coordinate projections of 'BD sets may have no 'natural' interpretation in terms of observed increments of the random process. For example, suppose Xn e D is the process defined in (27.55) and (27.56), with respect to the underlying space (Q.,Wf,P). It is not necessarily the case that ntCXn (W)) is measurable with respect to W'n,[nt] = a(Uni, i � [nt]), as casual intuition might suggest. A 'BD-set like HtCx,a) in (28.37) is the image under the mapping Xn : Q H D of a set E e W'; in fact, we could write E = K,; 1 (n�1 (B)), where B e 'B. But E depends on the value that x assumes at /..(t), and if /..(t) > t then E cannot be in W'n, [nt] · However, this difficulty goes away for processes lying in C almost surely. In view of28.4, we may 'embed ' ((C,du),'Bc) in ((D,d8),'BD) and a p.m. defined on the former space can be extended to the latter, with support in C. In particular, Wiener measure is defined on (D,'BD) by simply augmenting the conditions in 27.7 with the stipulation that W(x e C) = 1 . =

2 8 . 5 Prokhorov ' s Metric

The material of this section is not essential to the development since Billings ley ' s metric is all that we need to work successfully in D. But it is interesting

The Functional Central Limit Theorem

468

to compare it with the alternative approach due to Prokhorov (1956). We begin with an alternative approach to defining a continuity modulus for cadlag functions. Let

Wx(O)

�

max

{_

'<,

�� '

'
(min { I x(t') - x(t) I . I x(f') -x(t) I ) ),

}

sup l x(O) - x(o) l , sup l x(o) - x( 1 ) 1 . 1 -o< t s l

O�t
(28.45)

Again, it may be helpful to restate this definition in English. The idea is that, for every t E [8, 1 - 8], a pair of adjacent intervals of width 8 are constructed around the point, and we determine the maximum change over each of these inter vals; the smaller of these two values measures the 8-continuity at point t, and this quantity is supped over the interval. This means that the function can jump discontinuously without affecting wx(o), so long as no two jumps are too close together. The exceptions are the two points 0 and 1 , which for wx(8) ----:f 0 must be true continuity points from the right and left respectively. The following theorem parallels 28.3. 28.11 Theorem If and only if x E D, lim wx(o) Proof Suppose x E

=

0.

(28.46)

D. By 28.1(c), the second and third terms under the 'max' in (28.45) definitely go to zero with o. Hence consider the first term. Let { tk>tk,t'k } denote the sequence of points at which the supremum is attained on setting o = llk for k = 1 ,2, . .. Assume tk --7 t. (If need be consider a convergent subsequence.) Then tk --7 t and t'[; --7 t. Since x(t) = x(t+) , this implies l x(tk) - x(t'k) l --7 0, which proves sufficiency. Now suppose w (8) --7 wx(O) > 0. Since wx(O) = max { l x(O) - x(O+) l . l x(1) - x(1 -) I . min { l x(t) - x(t-) I , l x(t) - x(t+) I } } , it follows that x 12: D, proving necessity. • Now define the function wx(e +), z < 0, (28.47) wx (z ) = wxC 1), z 2 o. This is non-decreasing, right-continuous, bounded below by 0 and above by wx(1). It therefore defines a finite measure on IR , just as a c.d.f. defines a p.m. on IR . B y defining a family of measures i n this way (indexed on x) on a separable space, we can exploit the fact that a space of measures is metrizable. In fact, we can use Levy' s metric L* defined in (26.33). The Prokhorov metric for D is (28.48) x

_

{�

z

Cadlag Functions

469

where rx and ry are the graphs of x and y and dH is the Hausdorff metric. The idea here should be clear. With the first term alone, we should obtain a property similar to that of the Skorokhod metric; if we write d(x(t),ry) = inf1'dE(x(t),y(t')), then dH(rx,ry)

=

{

}

max sup d(x(t),ry), sup d(rx,y(t')) . t

t'

(28.49)

In words, the smallest Euclidean distances between x(t) and a point of y, and y(t) and a point of x, are supped over t. For comparison, the Skorokhod metric minimizes the greater of the horizontal and vertical distances separating points on rx and ry in the plane, subject to the constraints imposed on the choice of 'A such as continuity. In cases such as the functions xe of (28.6), xe and xe+o are close in (D,dH) when 8 is small. (Think in terms of the distances the graphs would have to be moved to fit over one another.) The purpose of the second term is to ensure completeness. By 28.11, limz--7 -oowx(z) = 0 if and only if x E D; otherwise this limit will be strictly positive. Unlike the case of (D,dH), it is not possible to have a Cauchy sequence in (D,dp) aproaching a point outside the space. It can be shown that dp is equivalent to ds, and hence of course also to dB, and that the space (D,dp) is complete. The proofs of these propositions can be found in Parthasarathy ( 1967). For practical purposes, therefore, there is nothing to choose between dp and dB. 2 8 . 6 Compactness and Tightness in

D

The remaining task is to characterize the compact sets of D, in parallel with the earlier application of the ArzeUt-Ascoli theorem for C. 28. 12 Theorem (Billingsley 1968: th. 14.3) A set A c D is relatively compact in (D,dB) if and only if sup sup l x(t) I < =, (28.50) XEA

t

lim sup w�(8)

8--70 X E A

=

0.

o

(28.51 )

This theorem obviously parallels 27.5 but there are significant differences in the conditions. The modulus of continuity w� appears in place of Wx which is a weaken ing of the previous conditions, but, on the other hand, (28.50) replaces (27. 16). Instead of sup1 I x(t) I we could write dB( I xI ,0), where 0 denotes the element of D which is identically zero everywhere on [0,1]. It is no longer sufficient to bound the elements at one point of the interval to ensure that they are bounded every where: the whole element must be bounded. A feature of the proof that follows, which is basically similar to that of 5.28, is that we can avoid invoking completeness of the space until, so to speak, the last moment. The sufficiency argument establishing total boundness ofA is couched in terms of the more tractable Skorokhod metric, and then we can exploit the Pnni v�lPnf'P of rln with � t'omnlPtP mPtri f' <;:Jwh �" rln to

o-Pt thP: romn::�rtnP:"" of A

The Functional Central Limit Theorem

470

The argument for necessity also uses ds to prove upper semicontinuity of w�(8), a property that, as we show, implies (28.5 1 ) when the space is compact. Proof of 28.12 Let sup E A sup l x(t) l = M. To show sufficiency, fix E > 0 and choose m as the smallest integer such that both l im < !E and sup E A w�( l lm) < !E. Such an m exists by (28.51). Construct the finite collection Em of piecewise constant functions, whose values at the discontinuity points t = jim for j = O, . . . ,m - 1, are drawn from the set {M(2u/v - 1), u = 0, 1 , ... ,v } where v is an integer exceeding 2M/E; hence, Em has (v + 1 )m different elements. This set is shown to be an E-net for A. Given the definition of m, one can choose for x E A a partition il1 1m { t , ,tr } , defined as above (28.3), to satisfy max sup l x(t) - x(s ) l < �E. (28.52) x

1

1

x

}

{

• . •

=

l s i s r s,t E (t;_1,t;)

For i = O, . .. ,r - 1 let ji be the integer such that j/m � ti < (ji + 1)/m, noting that, since the ti are at a distance more than 1/m, there is at most one of them in any one of the intervals Ulm, (j + 1)/m), j = O, ... ,m - 1 . Choose a piecewise linear function A, E A with vertices /..(j/m) ti , i = O, ... ,r. Since I ti - Nm l � lim, maxos i s r l A(j/m) -Nm l � 1E, and the linearity of A between these points means that sup I A(t) - t l � �E. (28.53) =

t

By construction, A maps points in Ulm, (j + 1)/m) into (ti, ti+ l ) whenever ji � j $ ii+l • and since x varies by at most !E over intervals [ti,ti+J), the composite function xoA, can vary by at most 1E over intervals rJim,(j + 1 )/m). An example with m = 10 and r 4 is sketched in Fig. 28.4; here, }1 2, h = 4 and h = 6. The points to, . .. ,t4 must be more than a distance 1/10 apart in this instance. One can therefore choose y E Em such that (28.54) j y(jlm) - x(A(jlm)) l < !t:, j O, .. . ,m - 1. =

=

=

1

to -J<--.---f--L-r--+'--r--+--'-r-.--l 0 1 2 Fig. 28.4

Functions 471 Since y(t) y(j/m) for t E Ulm, (j + 1)/m), we have by (28. 52) and (28. 54), s�p I y(t) -x(A(t)) I ,:�J I y(jlm) -x(A(j/m)) I 0 sup l x(A(j/m))-x(A(t)) l } + < E. (28.55) Together, (28. 5 5) and (28. 5 3) imply ds(x,y) :::; £, showing that is an £-net for A Cadlag

=

,;

te Ulm,(j+ l )lm)

Em

as required. This proves that A is totally bounded in (D,ds). But since ds and d8 are equivalent (28.7), A is also totally bounded in (D,d8); in particular, if Em is an £-net for A in (D,ds), then we can find 11 such that it is also an 11-net for and where 11 can be set arbitrarily A in (D,d8) according to small. Since (D,d8) is complete, A is therefore compact, proving sufficiency. When A is totally bounded it is bounded, proving the necessity of To show the necessity of we show that the functions = are upper semicontinuous on (D,ds) for each This means that the sets Bm = < E} are open in (D,ds) for each E > 0. By equivalence of the metrics, they are also open in (D,d8). In this case, for any such £, the sets Bm, IN } are an open covering for D by 28.3. Any compact subset of D then has a finite subcovering, or in other words, if A is compact there is an such that A � Bm. By definition of Bm, this implies that holds. To show upper semicontinuity, fix E > 0, () > 0, and D, and choose a parti tion I1 0 satisfying

(28.27)

(28.28),

(28.51),

(28.50). w'(x,llm) w�(llm)

m.

w�(llm)

{x:

{ mE

(28.51)

max

{

sup

l :::; i :::; r s,t e [ti-J.fi)

m

xE

l x(t)-x(s) l } < w�(D) + !£.

Also choose 11 < !£, and small enough that > D + 211. max

(28. 56)

(28.57) {ti -ti-d Our object is to show, after (5. 3 2), that if y E D and ds(x,y) < 11 then w;(8) < w�(D) + E. (28.58) If ds(x,y) < 11 there is A. E A such that sup jy("A.(t))-x(t)l < 11 (28.59) and sup I A.(t)-t l < (28. 60) Letting A.(tD, (28.57) and (28.60) and the triangle inequality imply that (28.6 1) 1 :::;: i :::;; r

t

fl .

t

si =

l :::; i :::; r

l :::; i:::; r

The Functional Central Limit Theorem

472

If both s and t lie in [t;- J,t;) A(s) and A(t) must both lie in [s;-1 ,s;). It follows by (28.56), (28.59), and the choice of 11 that

�t:�

�

p J y(s) - y(t) l , '"l

}

<

wili) + £.

(28.62)

In view of (28.61), this shows that (28.58) holds, and since E and x are arbitrary the proof is complete. • This result is used to characterize uniform tightness of a sequence in D. The next theorem directly parallels 27.12. We need completeness for this argument to avoid having to prove tightness of every �n' so it is necessary to specify an appropri ate metric. Without loss of generality, we can cite d8 where required. 28.13 Theorem (Billingsley 1968: th. 15.2) A sequence { �n } of p.m.s on ((D,d8), <.En) is uniformly tight iff there exists N E IN such that, for all n � N, (a) For each 11 > 0 there exists M such that �n ( {x: sup l x(t) l > M}) t

(28.63)

:<::; TJ ;

(b) for each E > 0, 11 > 0 there exists 8 E (0, 1) such that �n( {x: w�8) � E}) :<::; TJ .

(28.64)

be uniformly tight, and for 11 > 0 choose a compact set K with �n(K) By 28.12 there exist M < oo and 8 E (0, 1 ) such that

Proof Let { �n } >

1

- TJ .

K

c

{x: sup l x(t) l t

:<::;

M} n { x: w;(8)

<

E}

(28.65)

for any E > 0. Inequalities (28.63) and (28.64) follow for n E IN , proving neces sity. The object is now to find a set satisfying the conditions of 28.12, whose closure K satisfies SUPn ::>:NJ..ln (K) > 1 - e for some N E IN and all e > 0. Because (D,d8) is a complete separable space, each �n is tight (26.19) and the above is sufficient for uniform tightness. As in 27.12, let �* stand for SUPn::>:NJ..ln · For e > 0, define (28.66) where { 8k} is chosen so that �*(Ak) > 1 - 8/2k+l , possible by condition (b). Also set B = {x: sup l x(t) l :<::; M} such that �*(B) > 1 - �8, possible by condition (a). Let K = (nk=IAk n B)-, and note that K satisfies the conditions in (28.50) and (28.51 ), and hence is compact by 28.12. With these definitions, the argument follows that of 27.12 word for word. • The last result of this chapter concerns an issue of obvious relevance to the functional CLT; how to characterize a sequence in D which is converging to a limit in C. Since in all our applications the weak limit we desire to establish is in C, no other case has to be considered here. The modulus of continuitv w_ 1s thP: t

Cadlag

Functions

473

natural medium for expressing this property of a sequence. Essentially, the following theorem amounts to the result that the sufficiency part of 27.12 holds in (D,d8) just as in (C,du). 28.14 Theorem (Billingsley 1968: th. 15.5) Let {J..ln } be a sequence of measures on ((D,d8),'BD). If there exists N E IN such that, for n :?: N, (a) for each 11 > 0 there is a finite M such that (28.67) J..ln ({x: lx(O) I > M}) $ 11 ; (b) for each E > 0, 11 > 0 there is a 8 E (0, 1) such that (28.68) )..ln({x: wx(8) :?: £}) $ 11; then {J..ln } is uniformly tight, and if J..l is any cluster point of the sequence, J..l(C) = 1. Proof By (28.4), if (28.68) holds for a given 8 then (28.64) holds for 8/2. Let k = [1 /8] + 1 (so that k8 > 1) where 8 > 0 is specified by condition (b). Then according to (28.68), J..ln({x: l x(ti/k) - x(t(i - 1)/k) l :?: E}) $ 11 for i = 1 , ... ,k, and t E [0, 1]. We have noted previously that

(28.69) where each of the k intervals indicated has width less than 8. It follows by (28.68) and (28.67) that

J..ln ({x: sup l x(t) l > M + kE}) :::; J..ln( {x: lx(O) I > M}) :::; 11, t

(28.70)

so that (28.63) also holds for finite M. The conditions of 28.13 are therefore satisfied, proving uniform tightness. Let J..l be a cluster point such that J.lnk :::::} J..l for some subsequence { nk> k E IN } . Defining A = {x: wx(8) :?: £}, consider the open set AD, the interior of A; for example, x E AD if wx(8/2) :?: 2£. Then by (d) of 26.10, and (28.68),

J..l(AD) :::; liminf J.ln/AD) k�oo

:::;

11·

(28 . 71)

Hence J..l(B) :::; 11 for any set B c AD. Since E and 11 are arbitrary here, it is pos sible to choose a decreasing sequence { 8j } such that J..l(Bj) :::; 1/j, where Bj = { x: wx(8j) :?: 1/j}. For each m :?: 1 , J..L(n}=mBj) = 0, and so, by subadditivity, J..L(B) = 0 where B = liminfBj. But suppose X E Be, where Be = n;;;= l uJ=mB} is the set {x: wx(8j) < 1/j, some j :?: m; all m E IN } . Since { 8j } is monotonic, it must be the case that lim0�0wx(8) = 0 for this x. Hence Be c C, and since J..l(Bc) = 1 , J..l(C) = 1 follows. •

29 FCLTs for Dependent Variables

29. 1 The Distribution o f Continuous Functions o n

D

A surprising fact about Wiener measure is that definition 27.7 is actually

redundant; if part (b) of that definition is replaced by the specification merely of the first two moments of x(t), Gaussianity of x(t) must follow. This fact leads to a class of functional CLTs of considerably greater power and generality than is possible with the approach of §27.6. 29.1 Theorem (Billingsley 1968: th. 19. l ) LetX be a random element of D ro,IJ with the following properties: (a) E(X(t)) = 0, E(X(t)2) = t, 0 :::;; t ::; 1 . (b) P(X E C) = 1 . (c) For any partition { t1 , ... ,tk } of [0, 1], the increments X(t2) - X(t 1 ), X(t3 ) - X(t2), ... , X(tk) - XCtk-J), are totally independent. Then X - B. o This is a remarkable theorem, in the apparent triviality of the conditions; if an element of D is a.s. continuous, independence of its increments is equivalent to Gaussianity ! The essential insight it provides is that continuity of the sample paths is equivalent to the Lindeberg condition being satisfied by the increments. The virtuosity of Billingsley's proof is also remarkable. The two preliminary lemmas are technical, and in the second case the proof is rather lengthy; the reader might prefer to take this one on trust initially. If S t. · ·Sm is a random sequence, and we define Sj = E=t Si for 1 ::; j ::; m, and S0 = 0, the problem is to bound the probability of I Sm I exceeding a given value. The lemmas are obviously designed to work together to this end. 29.2 Lemma I Sm ! ::; 2 max min { I Sji . I Sm - Sj l } + max l sj l · 0 5.j 5. m 0 5.j $ m ..

Proof Let I s;;;; { O, . ,m} denote the set of integers k for which I Sk l :::;; I Sm - Sk i · If Sm = 0 the lemma holds, .and if Sm :f: 0 then m fit I. On the other hand, 0 E /. It follows that there is a k fit I such that k - 1 E /. For this choice of k, ! Sm l ::; I Sm - Sk i + I Sk l ::; I Sm - Sk i + l sk - t l + l skl (29. 1) ::; 2 max min{ I Sj i , I Sm - Sj l } + max I S; I . • j j m m O O$ 5. $$ . .

FCLTs for Dependent Processes

475

The second lemma is a variation on the maximal inequality for partial sums. 29.3 Lemma (Billingsley 1968: th. 12. 1) If

� (± b1) , j = i, ... ,k, (29.2) l=i+l for each pair i,k with 0 � i � k � m, where {b 1 , ... , bm } is a collection of posi 2

2

2

E((Sj - Si) (Sk - Sj) )

tive numbers, then ::3 K > 0 such that, for all a > 0 and all m,

P

(0�:: min { I Sj i , I Sm - Sj l } � a) � KBa4 ,

where B = LJ== l bj. Proof For 0 � i � k

2

(29.3)

� m and a > 0, we have

P(min { I Sj Sd , I Sk - Sj l } � a) = P( { I Sr Sd � a} n { I Sk - Sj l � a} ) � P( I Sj - Si i i Sk - Sj l � a2) ' k 2 � � a "L b� , (29.4) l=i+ 1 -

( ) '

where Chebyshev ' s inequality and (29.2) give the final inequality. If m = 1 , the minorant side of (29.3) is zero. If m = 2, (29.4) with i = 0 and k = 2 yields

( b 1 + b 2)2 , (29.5) P(max{ O, min{ I Sd , I S2 - SJ i } } � a) � a4 so that (29.3) holds for K. = 1 and hence for any K � 1 . The proof now proceeds by induction. Assuming there is a K for which (29.3) holds when m is replaced by any integer between 1 and m - 1, we show it holds for m itself, with the same K. The basic idea is to split the sum into two parts, each with fewer than m terms, obtain valid inequalities for each part, and combine these. Choose h to be the largest integer such that L..j:}bj � B/2 (the sum is zero if h = 1); it is easy to see that Ij=h+ 1 bj � B/2 also (the sum being zero if h = m). First define (29.6) U1 = max min{ I Sji . I Sh - 1 - Sji } O�j�h- 1 (29.7) Evidently, h 1 2 KB2 K (29.8) P(U1 � a) � 4 L, bj � -4 4a a j==l

( )

by the induction hypothesis. Also, by (29.4) with i = 0 and k = m,

The Functional Central Limit Theorem

476

2 (29.9) P(D I � a) s 84 . a The object is now to show that (29. 10) min{ I SJ I . I Sm - Sj l } :::; UI + DJ, 0 s j :::; h - 1 . If I Sj l :::; UJ, (29.10) holds, hence suppose I Sh -1 sj I :s; U}, the only other possi bility according to (29.6). If D 1 = I Sh - I I , then min{ I SJ I , I Sm - Sjl } S I SJ I S I Sh - 1 - SJ I + I Sh - I I :s; U1 + D 1 . And if D 1 I Sm - Sh - I l then again, min{ I SJ I , I Sm - Sj l } :::; I Sm - Sj l :::; I Sh - 1 - Sj i + I Sm - Sh - d :::; Ul + D J . Hence (29. 10) holds in all cases. Now, for 0 :s; J..l s 1 , -

==

P(U 1 + D I � a) s P({ UI � J..la } u {DI � (1 - J..L) a}) s P(U1 � J..la) + P(D I � (1 - J..L)a) B2 KB2 < - 4 4+ 4a J..L a\1 - J..L/ --

(29. 1 1 )

Choosing J..l to minimize K/4J..l4 + 11( 1 - J..L)4 yields J..l = (i-K) 1 15/(1 + (!K) 1 15] (use calculus). Back-substituting for J..l and simplifying yields, for K � 2[1 - (�) 1 15r5 55,021,

z

(29. 12) According to (29. 10), we have bounded min { I S1 1 , I Sm - S11 } in the range 0 s j s h - 1 . To do . the same for the range h s j s m, define (29. 13) Uz max min{ I SJ - Sh l , I Sm - SJI } h 5J 5m =

(29. 14) It can be verified by variants of the previous arguments that min{ I SJI , I Sm - SJI } s Uz+ Dz, h s j s m,

(29. 15)

and also that

(29. 16) for the same choice of K. Combining (29. 16) with (29. 12), we obtain

FCLTs for Dependent Processes P

l':':: j

)

477

min{ I Sj l , I Sm - Sj l ) 2 a � P(max { U 1 + D l > U1 + D2 ) ;, a)

= P( { Ur +D1 � a} u { U2 + D2 � a}) ::::; P(Ur + Dt � a) + P(U2 + D2 � a) 2 <- KB . (29. 17) a4 "

29.1 Let the characteristic function of X(t) be (29. 1 8) <J>(t, A) = E(e i'AX(t)) . We can write, by ( 1 1 .25), eiu = 1 + iu - 1u2 + r(u), (29.19) where l r(u) l ::::; l u l 3. We shall write either 115,1 or 11(s,t), as is most convenient, to denote X(s) - X(t) for 0 ::::; t ::::; s ::::; 1 . Observe that by conditions (a) and (c) of the theorem, E(l17+h,r) = h. Hence, <J>(t + h,A.) - <J>(t, A) = E[ei'AX(t)(eiUr+h ,t - 1)] = E [eiAX(t)(iA-11r+h,t - 1A-2117+h,t + r(A-11r+h, r)) ] 2 (29.20) = <J>(t,A)[ -1A- h + E(r(Mr+h,r)] , Proof of

where the last equality is because X(t) and 11r+h,t are independent by condition (c). Since E(r(A-11r+h, r) ) ::::; A-3EI 11r+h, rl 3, it follows that <\> (t, A)A3EI 11r+h, r l 3 th(t + h A) - th(t A) 2 ' 'V ' ! 'V (t,A.) ::::; (29.21) + A<J> h h Now, suppose that

l

I

(29.22) lim * E I 11r+h, rl 3 = 0. h ,l.O It will then follow that, for all 0 ::::; t < 1 , possesses a right-hand derivative, Im (t + h,A.)h - (t,A.) - 21'\""2th'V(t, l\,) . (29.23) h ,l.O Further, for h > 0 and h ::::; t ::::; 1 , (29 .21) holds at the point t - h, so by consid ering a path to the limit through such points we may also conclude that 1.

_

'\

_

) 2th Im (t,A.) - h (t - h,A.) = _t'l (29.24) 2"" ( t ' "" . 'V h ,J, o Since <J> (t-,A) = <J>(t,A) because is continuous in t, by condition (b) of the theorem, is differentiable on (0, 1 ) and 1.

_

'\

The Functional Central Limit Theorem

478

(29.25) This differential equation is well known to have the solution -()..212, :?: 0. = (29.26) (Verify this by differentiating log with respect to Since = 0 a.s., = 1, and applying the inversion theorem we conclude that for each E (0, 1 . By continuity of <\l at 1 , the result also extends to = 1 . Hence, the task is to prove (29.22). This requires the application of 29.2 and 29.3. For some finite let S; = for j = 1 , By assumption, the �J are independent r.v.s with variances of If SJ = L{= l �i = then E((s1 - sicsk - Sj)2) = (29.27) we have By 29.3, setting bj =

<\l(t,A.) <\l(O,A.)e

<\J(O,A.) t

t

<\l

t.)

X(t)(O) N(O,t) X t !1(t+hj/m, t+h(j- l)lm)him. ... ,m. �

)

m

11(t +}him, t),

(}-i)(k-J)h21m2 :::; h2.

him, 2 P ( m�x minfl l 11(t + �' I, I !1(t + h, t +�h) I L a}) K� . a J o��m ;:::

t)

Hence by 29.2,

P( l

d(t+ h,t) I

;,

(

a) $ P 2 x min 0�� m

K*h2 + m X C

(

:::;

(29.28)

{ 1,\.(t +�, t) I , I d(t+ h, t+ �h) I }

)

. . 1 :::; -4- P max (29.29) h) I :?: ¥x , a O�j <:;,m where K* = 44K. Letting -7 oo, the second term of the majorant member must go to zero, since E with probability 1 by condition (b), so we can say that

1!1(t+ �, t +1�

(29.30) We may now use 9.15 to give E 1 11t+h,t 1 3 =

J: 1 11t+h,t1 3 F + EP( 1 11t+h,t 1 3 d

:?:

E)

+ f�PC 1 11t+h,t 1 3 > s)ds (29.3 1 )

FCLTs for Dependent Processes 479 . . . . Ch oose £ = (K* )314h 312 to m1mm1ze the 1ast member above, and we obtam (29.32) E l �t+h, r l 3 :::::; 4(K* ) 314h312 . This condition verifies (29.22), and completes the proof. • Notice how (29.30) is a substantial strengthening of the Chebyshev inequality, which gives merely P( l �t+h, r l ;::=:: a) :::::; hla2 . We have not assumed the existence of the third moment at the outset; this emerges (along with the Gaussianity) from the assumption of independent increments of arbitrarily small width, which allows us to take (29.29) to the limit. 2 9 . 2 Asymptotic Independence

Let {Xn } i denote a stochastic sequence in (D,'Bv). We say that Xn has asymptoti cally independent increments if, for any collection of points {si ,ti, i = 1 , ... ,r} such that 0 :::::; SJ :::::; t1 < S2 :::::; t2 < . . . < S r :::::; t, :::::; 1 , and all collections of linear Borel sets B 1 , ... ,B, E 'B, r

1 , ... ,r) -7 flP(Xn(ti) - Xn (si) E Bi) (29.33) i= l as n -7 oo. Notice that in this definition, gaps of positive width are allowed to separate the increments, which will be essential to establish asymptotic indepen dence in the partial sums of mixing sequences. The gaps can be arbitrarily small, however, and continuity allows us to ignore them as we see below. Given this idea, we have the following consequence of 29.1. 29.4 Theorem Let { Xn } ;;'= 1 have the following properties: (a) The increments are asymptotically independent. (b) For any £ > 0 and 11 > 0, 3 8 E (0, 1 ) s.t. limsupn---+ooP(w(Xn.8) ;::=:: £) :::::; T) . (c) {Xn(t)2 }';;'= 1 is uniformly integrable for each t E [0, 1]. (d) E(Xn Ct)) -7 0 and E(Xn(t)2) -7 t as n -7 oo, each t E [0, 1]. Then Xn � B. o Be careful to note that w(.,8) in (b) is the modulus of continuity of (27. 14), not w' of (28.3). Proof Condition (b), and the fact that E l Xn(O) l -7 0 by (d), imply by 28.14 that the associated sequence of p.m.s is uniformly tight. Theorem 26.22 then implies that the latter sequence is compact, and so has one or more cluster points. To complete the proof, we show that all such cluster points must have the characteristics of Wiener measure, and hence that the sequence has this p.m. as its unique weak limit. Consider the properties the limiting p.m. must possess. Writing X for the random element, 28.14 also gives P(X E C) = 1 . Uniform integrability of Xn(t) 2, and hence of Xn(t), implies that E(X(t)) = 0 and E(X(t)2) = t, by 22.16. By condition (a) we P(Xn(tD - Xn(si)

E

Bi, i

=

The Functional Central Limit Theorem

480

may say that the increments X(t 1 ) - X(s 1 ), ... ,X(t,) - X(s,) are totally indepen dent according to (29.33). Specifically, consider increments X(ti) - X(si) and X(ti+ I ) - X(si+ I ) for the case where si+ I = ti + 1/m. By a.s. continuity, (29.34) so that asymptotic independence extends to contiguous increments. All the condi tions of 29.1 are therefore satisfied by X, and X - B. • Our aim is now to get a FCLT for partial-sum processes by linking up the asymptotic independence idea with our established characterization of a dependent increment process; that is to say, as a near-epoch dependent function of a mixing process. Making this connection is perhaps the biggest difficulty we still have to surmount. An approach comparable to the 'blocking' argument used in the CLTs of §24.4 is needed; and in the present context we can proceed by mapping an infinite sequence into [0, 1 ] and identifying the increments with asymptotically independent blocks of summands. This is a particularly elegant route to the result. However, an asymptotic martingale difference-type property of the type exploited in 24.6 is not going to work in our present approach to the problem. vv'hile the terms of a mixing process (of suitable size) can be 'blocked ' so that the blocks are asymptotically independent (more or less by definition of mixing), mixingale theory will not serve here; near-epoch dependent functions can be dealt with only by a direct approximation argument. What we shall show is that, if the difference between two stochastic processes is op (l) and one of them exhibits asymptotic independence, so must the other, in a sense to be defined. Near-epoch dependent functions can be approximated in the required way by their near-epoch conditional expectations, where the latter are functions of mixing variables. This result is established in the following lemma in terms of the independence of a pair of sequences, which in the application will be adjacent increments of a partial sum process.

29.5 Lemma (Wooldridge and White 1986: Lemma A.3) If { l}n} and {.-0n } are real

stochastic sequences, and (a) lJn - .-4n � 0, for j = 1,2; (b) lJn � Yj for j = 1 ,2; (c) for any AI> A2 E :B ,

then

P({Z!n E A J } ("') {Z2n as n � oo ;

E

A2 } )

�

P(Zln

E

A 1)P(Z2n

E

A 2)

(29.35)

(29.36)

aBj) = 0) for j = 1 ,2. Proof Considering (Zl n,�n) and (Yln o Y2n) as points of !R 2 with the Euclidean metric, (a) implies de((Zl n,Z2n) , (Yln, Y2n)) � 0, and by an application of 26.24, for all lj-continuity sets (sets Bj

{h \ 1 m nl 1 Pc hnth {7.

7� \

�

E 'B

r v. V� \

such that P(Yj
=

1 ')

E

Wri tP

FCLTs for Dependent Processes

481 (29.37)

where )ln is the measure associated with the element (Z1n,Zzn). If )l is the measure associated with (Y1 , Y2), define the marginal measures � by �(Bj) = P( Yj E Bj) ; then �(dBj) = 0 for j = 1 ,2 implies )l(d(B1 x B2 )) = 0, in view of the fact that d(B t X B z)

c

(29.38)

(dB1 x rR ) u (rR x ()B 2).

Applying (e) of 26.10, it follows from the weak convergence of the joint distribu tions that, for all Yj-continuity sets Bj,

P( { Ztn

E B J } n { Zzn E Bz } )

lln (B t X Bz)

=

---7

=

)l(B t x B2 )

P( { Yt

E B t } n { Yz E Bz }).

(29.39)

And by the weak convergence of both sets of marginal distributions it follows that, for these same Bj,

(29.40) This completes the proof, since the limits of the left-hand sides of (29.39) and (29.40) are the same by condition (c). • 29.3 The FCLT for NED Functions of Mixing Processes

From 29.4 to a general invariance principle for dependent sequences is only a short step, even though some of the details in the following version of the result are quite fiddly. This is basically the one given by Wooldridge and White (1988).

29.6 Theorem Let { Uni } be a zero-mean stochastic array, { Cni } an array of positive constants, and {Kn(t), n E IN } a sequence of integer-valued, right-continuous, increasing functions of t, with Kn(O) = 0 for all n, and Kn(t) - Kn(s) ---7 oo as n ---7 oo if t > s. Also define X�(t) = l:�� { t)Uni· If (a) E(Uni) = 0; (b) SUPi,n ll UnJCni llr < 00, for r > 2; (c) Uni is L2-NED of size -y, for 1 $: y $: 1 , with respect to the constants { end , on an array { Vnd which is a-mixing of size -r/(r - 2); 2 Kn (t+8) v (t 8) (d) sup limsup T < oo, where v�(t,8) = L c�i ; tE [O,l),CE (O, l -t]

(e)

max Cni

{

}

n�=

= O(Kn(1 )1- 1),

1 5:: i 5:: Kn (l)

E(X�(t)2) ---7 t as n then X� � B. o (f)

---7 oo,

i=Kn (t)+ l

where y is defined in (c);

for each t E [0,1 ] ;

Right-continuity of Kn(t) ensures that v�(t,8) ---7 0 as 8 ---7 0, if we agree that a sum is equal to zero whenever the lower limit exceeds the upper.

The Functional Central Limit Theorem

482

If y is set to 1 in condition (c), condition (e) can be omitted. It is important to emphasize that this statement of the assumptions, while technically correct, is somewhat misleading in that condition (c) is not the only constraint on the depen dence. In the leading cases discussed below, condition (f) will as a rule imply a L2 -NED size of 1 Theorem 29.6 is very general, and it may help in getting to grips with it to extract a more basic and easily comprehended set of sufficient conditions. What we might think of as the 'standard' case of the FCLT - that of convergence of a partial sum process to Wiener measure - corresponds to the case Kn(t) = [nt]. We will omit the K superscript to denote this case, writing Xn(t) = I}� flUni · The full conditions of the theorem allow different modes of convergence to be defined for various kinds of heterogeneous processes, and these issues are taken up again in §29.4 below. But it might be a good plan to focus initially on the case Xn(t), mentally making the required substitutions of [nt] for Kn(t) in the formulae. In particular, consider the case Uni = U/sn where n n nl ni (29.41) s� = E � Ui 2 = � cr7 + 2�- �- <Ji,i+m• -

.

( )

with cr1 = Var(UD and cri, i+m = Cov(Ui,Ui+m). Also, require that supdi Udl r < oo, r > 2. Then we may choose cni = 1 /sn , and with Kn(t) = [nt], condition 29.6(d) reduces to the requirement that s�ln > 0, uniformly in n. In this case, 29.6(e) is satisfied for y = l If in addition s�ln ---7 � < =, then E(Xn(t))2 = sfnt]ls� ---7 t and 29.6(f) also holds. These conclusions are summarized in the following corollary. 29.7 Corollary Let the sequence { Ud have mean zero, be uniformly L,-bounded, and L2-NED of size -! on an a-mixing process of size -r/(r - 2), and let Xn(t) = n 112II�P ui. If n-1(I'i=t Ui) 2 ---7 �. 0 < � < then Xn � �B. o Be careful to note that � =
-

oo,

FCLTs for Dependent Processes

483

It is instructive to compare the conditions of 29.6 with those of 24.6 and 24.7. Since X�( l ) � B(l ) - N(O, l), the two theorems give alternative sets of condi tions for the central limit theorem. Although they are stated in very different terms, conditions 29.6(d) and (e) clearly have a role analogous to 24.6(d). While 24.6 required a L2-NED size of -1, it was pointed out above how the same condition is generally enforced by 29.6(f). However, 29.6(f) itself has no counterpart in the CLT conditions. It is not clear how tough this restriction is, given our free choice of Kn, and this is a question we attempt to shed light on in §29.4. What is clear is that the convergence of the partial sum process Xn to B requires stronger conditions than are required just for the convergence of Xn(l ) to B( l), which is the CLT. Proof of 29.6 We

will establish that the conditions of 29.4 hold for the sequence

{X�} . Condition 29.4(d) holds directly, by the present conditions (a) and (f). Conditions (a), (b), and (c) imply by 17.6(i) that { Uni,c:fnd is a �-mixingale of size -� with respect to the scaling constants { end , where c:Jni = a( Vi-j' j ;::::: 0). In view of the uniform Lr-boundedness with r > 2, the array { U�/c�i } is uniformly integrable. If we let k = Kn(t) and m = Kn(t + 8) - Kn (t) for 8 E [0, 1 t), it

-

follows by 16.14 (which holds irrespective of shifts in the coordinate index) that the set

(29.42) is uniformly integrable, for any t and 8. Further, because of condition (d) we may assume there is a positive constant M < oo such that for any t E [0, 1) and any 8 E (0, 1 - t]), there exists N(t,8) ;::::: 1 with the property v�(t,8)/8 � M for n ;::::: N(t,8). Therefore the set

{

(Sn ,k+j - Snk) 2 �ax ,n 8 ]S m

;:::::

1,

N(t,8)

,

}

(29.43)

is also uniformly integrable. If N* = sup oN(t, 8) condition (d) implies that !V is finite. Taking the case t = 0 and hence k = 0 and m = Kn(8) in (29.43) (but then writing t in place of 8 for consistency of notation), we deduce uniform integrability of {X�(t) }';;'= 1 for any t e (0,1 ] (the summands from 1 to N(O,t) - 1 can be included by condition (b)). In other words, condition 29.4(c) holds for {X�}';;'= l · Note that !.?P( I X I > A.) � E(X2 1 1 JXJ,.A. }) for any square-integrable r.v. X. There fore, the uniform integrability of (29 .43) implies that for any 8 e (0, 1 ), any t � 1 8, and any E > 0 and 11 > 0, 3 A. > 0 large enough that for n ;::::: !V,

-

(

P max I Sn,k+r Snk I l �S m

;:::::

)

AVO

�

�, 11-

8/..,2

(29.44)

where k amd m are defined as before. The argument now follows similar lines to the proof of 27.14. For the case 8 = £214!..2, (29.44) implies

484

The Functional Central Limit Theorem sup

P (t�s �t+ll I X�(s) - X�(t) l ;:::: �£) :::; �118, n ;:::: N*,

O � t � 1 -8

sup

(29.45)

which is identical to (27.71). Condition 29.4(b) now follows by 27.13, as before. The final step is to show asymptotic independence. Whereas the theorem requires us to show that (29.33) holds for any r, since the argument is based on the mixing property and the linear separation of the increments, it will suffice to show independence for adjacent pairs of increments (i, i + 1) having ti < si+ l · The extension to the general case is easy in principle, though tedious to write out. Hence we consider, without loss of generality, the pair of variables

Kn(t;) L Uni• j i=Kn(Sj)+ l

=

1 and 2,

(29.46)

where 0 :::; s 1 < t1 < s2 < t2 :::; 1 . We cannot show asymptotic independence of Yl n and Y2n directly because the increment process need not be mixing, but there is an approximation argument direct from the NED property. Defining '!f�,j = cr(Vnj· ... , Vnk), the r. v. E( Ytn I '!f�7(�_!_)oo) is '!f�7(:!?co-measurable, and similarly E( Y2n I '!f";,K11(s2) ) is '!f";,K11(s2)-measurable. By assumption (c), sup --7

0 as n --7 oo

(29.47) whenever ft < S2 , where the events A include those of the form { E(Yl n I '!f�7(�oo) E E } for E E 13 , and similarly events B include those of the type { E( Y2n I '!f";,Kn(s2) ) E E } . These conditional expectations are asymptotically independent r.v.s, and it remains to show that Yl n and Y2n share the same property. We show that the conditions of 29.5 are satisfied when Ztn = E(Ytn I '!F�7(�oo) and Z2n = E( Y2n I '!f";, Kn(s2)). This is sufficient in view of the fact that the Y;-continuity sets are a convergence-determining class for the sequences { Y;n } , by

26.10(e). The argument of the preceding paragraph has already established condi tion 29.5(c). To show condition 29.5(a) we have the inequalities

Kn(fJ ) L 11 Un; - E(Uni l '!f�7(�oo) lh i=Kn(SJ )+ l Kn(t J ) :::; 2 L 1 1 Un; - E(Und '!f�7g�kn (t1 )) 11 2 i=Kn(SJ )+ l Kn (t J ) :::; 2 L CniVKn(t J )- i i=Kn (SJ )+ l

E II Ytn - E(Yl n l '!f�7(�oo) lh :::;

FCLTs for Dependent Processes � 2

485

Cni

max

Kn (SJ)
---7

0 as n

---7

(29.48)

oo,

where we have applied Minkowski's inequality, 10.28, and finally assumptions (c) and (e), and 2.27. This implies that Ytn - E(Ytn l ��7c:�}oo) � 0. Note that condition (d) implies that sup;Cni ---7 0 as n ---7 oo, so in the case y = 1, (e) can be dispensed with. By the same reasoning, Y2n - E(Y2n I �";Xn( 2) ) � 0 also. Since we have established that conditions 29.4(b) and 29.4(d) hold, we know that the sequence of measures associated with {X�} is uniformly tight, and so contains at least one convergent subsequence - { nb k E IN } , say - such that X�k � xK (say) as k ---7 oo where P(XK E C) 1. It follows that the continuous mapping theorem applies to the coordinate projections ntCXK) = XK(t), and we may assert that X�k(t) � XK(t). Confining attention to this subsequence, condition 29.5(b) is satisfied for the case Ynhj = X�(tj) - X�(sj). All the conditions of 29.5 have now been confirmed, so these increments are asymptotically independent in the sense of (29.36). But since this is true for every convergent subsequence {nk }, we can conclude that the weak limit of {X�} has asymptotically independent increments whenever it exists. All the conditions of 29.4 are therefore fulfilled by {X� } , and the proof i s complete. • s

=

It is possible to relax the moment conditions of this theorem if we substitute a uniform mixing condition for the strong mixing in condition (c). 29.9 Theorem Let { Un;} , { en; } , {Kn(t) } , and {X�} be defined as in 29.6; assume that conditions 29.6(a), (d), (e) and (f) hold, but replace conditions 29.6(b) and (c) by the following: (b') sup;,n I I Un/Cni I I < oo, for r 2 2, and { U�;/c�; } is uniformly integrable; (c') Uni is L2-NED of size -y, for ! � y � 1, with respect to constants { en; } , on an array { Vn; } which i s -mixing of size -r/2 ( r - 1 ), for r 2 2; then X� � B. o r

The uniform integrability stipulation in (b') is required only for the case r = 2, and the difference between this and the a-mixing case is that this value of r is permitted, corresponding to a -mixing size of -1. By 17.6(ii), { Un; } i s again an L2-mixingale of size -! i n this case. The same arguments as before establish that conditions 29.4(b),(c) and (d) hold; and, since a(m) � <j>(m), condition (29.47) remains valid so that asymptotic indepen dence also holds by the same arguments as before. • Proof

29.4 Transformed Brownian Motion

To develop a fully general theory of weak convergence of partial sum processes, permitting global heterogeneity of the increments with possibly trending moments,

486

The Functional Central Limit Theorem

and particularly to accommodate the multivariate case, we shall need to extend the class of limit processes beyond ordinary Brownian motion. The desired generaliza tion has already been introduced as example 27.8, but now we consider the theory of these processes a little more formally. A transformed (or variance-transformed) B rownian motion BTl will be defined as a stochastic process on [0, 1] with finite dimensional distributions given by

(29.49) where B is a Brownian motion and 11 is an increasing homeomorphism on [0, 1 ] with T\(0) = 0. The increments of this process, BTI(t) - BTI(s) for 0 $ t < s $ 1 , are therefore independent and Gaussian with mean 0 and variance T\(t) - 11(s). Since T\(1) must be finite, the condition 110) = 1 can be achieved by a trivial normal BTJ(t) - B(Tl(t)), t

E

[0, 1].

ization. To appreciate the relevance of these processes, consider, as was done in §27 .4 the characterization of B as the limit of a partial-sum process with independent Gaussian summands. Here we let the variance of the terms change with time. Sup pose Si N(O,cr7>, and let s� = E(Ii= rSi = Ii= rcrr. Also suppose the variance sequence {cn}i= 1 has the property that, for each t e [0, 1], -

S2[ n t] 2 Sn

-

---)

T\(t) as n

---)

oo,

(29.50)

where the limit function 11: [0, 1] 1---7 [0, 1 ] is continuous and strictly increasing everywhere. In this case, according to the definition of BTJ, we have

I�r =n lt ] Si D -- � BTI(t), Sn

(29.5 1 )

for each t E [0, 1 ] . What mode of evolution of the variances might satisfy (29.50), and give rise to this limiting result? In what we called, in § 13.2, the globally stationary case, where the sequence { cn}'i' is Cesaro-summable and the Cesaro limit is strictly positive, it is fairly easy to see that Tl(t) = t is the only possible limit for (29.50). This conclusion extends to any case where the variances are uniformly bounded and the limit exists; however, the fact that uniform boundedness of the variances is not sufficient is illustrated by 24.11. (Try evaluating the sequence in (29.50) for this case.) Alternatively, consider the example in 27.8. It may be surprising to find that (for the case - 1 < � < 0) the partial sums have a well defined limit process even when the Cesaro limit of the variances is 0. However, 27.8 is more general than it may at first appear. Define a continuous function on [O,oo) by g(v)

=

sfvl + (v - [V])afv+ l l ·

(29.52)

If s� satisfies (29.50), g is regularly varying at infinity according to 2.32. g _2 I /

..._

FCLTs for Dependent Processes

487

g(n) + g'(n) for integer n, and note that by 2.33 (which holds for right deriva tives) g' is also regularly varying. The variance process of 27.8 can be general

ized at most by the inclusion of a slowly varying component. This is the situation for the case of unweighted partial sums, as in (29.53), the one that probably has the greatest relevance for applications. But remember there are other ways to define the limit of a partial-sum process, using an array formulation. There need only exist a sequence {gn } j of strictly increasing func tions on the integers such that gn([nt])lgn(n) � 11(t), and the partial sums of the array { �ni } , where �ni - N(O,a�D and

(J�i = (gn(i) - gn(i - 1 ))/gn(n ),

(29.53)

will converge to BTl. And since such a sequence can always be generated by setting gn([nt]) = ll(t)an, where a n is any monotone positive real sequence, any desired member of the family Bll. can be constructed from Gaussian increments in this manner. The results obtained in §29.1 and §29.2 are now found to have generalizations from B to the class BTl. For 29.1 we have the following corollary. 29.10

Corollary

Let condition 29.1(a) be replaced by

(a') E(X(t)) = 0, E(X(t) 2)

Then X - BTl.

=

11(t), 0

:S

t

:S

1.

Define x* (t) = X(11 - 1 (t)) and apply 29.1 to X* . 11 - 1 (.) is continuous, so condition 29.1(b) continues to hold. Strict monotonicity ensures that if { t 1 , ,tm } define arbitrary non-overlapping intervals, so also do {11 - 1 (t1 ), ... , 11 - \tm) } , so 29.1(c) continues to hold. • Proof ..•

Similarly, for 29.4 there is the following corollary. 29.11 Corollary Let the conditions (a), (b), and (c) of 29.4 hold, and instead of condition 29.4(d) assume (d') E(Xn(t)) � 0 and E(Xn(t) 2) � 11(t) as n � oo, each t E [0, 1]. Then Xn _!4 BTl. Proof The

for X.

•

argument in the proof of 29.4 shows that the conditions of 29.10 hold

29.12 Example Let { Ui } 7 denote a sequence satisfying the conditions of 29. 7, with the extra stipulation that the L2-NED size is - 1 . Define the cadlag process

[nt] 1 (29.54) Xn(t) = -m L)Uj. an j= l This differs from the process n - ll 2rj�[ 1 l!jla only by the multiplication of the summands by constant weightsj/n, taking values between lin and 1 . The arguments of 29.6 show that conditions 29.4(a), (b), and (c) are satisfied for this case, and it remains to check 29.1l(d'). We show that

488

The Functional Central Limit Theorem

(29.55) Choose a monotone sequence { bn E IN } such that bn --7 oo but bnfn --7 0; bn = [n 112] will do. Putting Tn = [ntlbn ] for t E (0, 1] and n large enough that Tn 2 1 , we

have

(29.56) The terms in this sum have the decomposition

ibn (29.57) L }Uj = ibnSn i + bnS�i• j=(i- 1 )bn+ 1 in which Sni = 'L}�(i - l )bn+ l Uj , and S�i = 'L}�(i- l )bn+ l anij Uj, where a nij = (ibn -})Ibn E [0, 1]. The assumptions, and 17.7, imply that b� 1 E(S�i) --7 c:J2 for each i = 1 , ... ,rn. and that b� 1 I E(SniSnd l = O( l i - i' I - Hi) for B > 0. Neither limsupnb� 1 E(S�T) nor limsupnb� 1 1 E(SniS�D I exceed c:Jl, whereas b� 1 1 E(S�iS�i') I and b� 1 E(SniS�r) are of O( l i - i' l - 1 - 0). The same results apply to Sn,rn+ 1 and S�,rn+ 1 , the analogous terms corresponding to the residual sum in (29.56). Thus, consider E(Xn(t)2). Multiplying out the square of (29.56) after substitut ing (29 .57), we have three types of summand: those involving squares and products of the Sni ( ( rn + 1) 2 terms); those involving squares and products of the S�i ((rn + 1 )2 terms); and those involving products of S�i with Sni (2(rn + 1 )2 terms). The terms of the second type are each of O(b�n-3) = O(n - 1 r� 2), and this block vanishes asymptotically. The terms in the third block (given ibn = O(n)) are of O(bnn - 2) = o(r� 2), and hence this block also vanishes. This leaves the terms of the first type, and this block has the form

(29.58) Noting that rnbnln --7 t, applying standard summation formulae and taking the limit yields (29.55). Thus, according to 29.11, Xn(t) � BTl where ll(t) = �t3. o There is an intimate connection between the generalization from B to BTl and the style of the result in 29.6. The latter theorem does not establish the convergence of the partial sum sequences Xn(t) = 'L��flUni• either to B or to any other limit. In fact there are two distinct possibilities. In the first, Kn(t)ln converges to 11- \ t) for t E [0, 1 ], for some 11 as in (29 .49). If this holds, there is no loss of generality in setting Kn(t) [n1l - 1 (t)], and under condition 29.6(f) this has the implication =

FCLTs for Dependent Processes

489 (29.59)

In other words, Xn � Brt by 29.11. Example 29.8 is a case in point, for which Tl(t) = t1 +P . In these cases the convergence of the process {X�} to Brownian motion can be also represented as the convergence of the partial sum process (Xn} to Brt. On the other hand, it is possible that no such 11 exists, and the partial sums have no weak limit, as the following case demonstrates. 29.13

Example

t

Let a sequence { Ud have the property

ui

=

22k � i

a.s.,

<

22k+l , k

=

0, 1 ,2,3, ...

(O,cr2), otherwise. Thus, U1 = 0, U4 = Us = U6 = U7 = 0, U1 6 = U1 7 = ... = U31 = 0, and so forth. Let Uni = U/sn as before, and put Xn(t) = L��flUni· Then, observe that for ! < t � 1, 1 when n = 2k - 1 for k even, Xn(t) = Xn(!) with probability 0 when n = 2k - 1 for k odd.

{

Since this 'cycling ' in the behaviour of Xn is present however large n is, Xn does not possess a limit in distribution. However, let Kn(t) be the integer that satisfies Kn (t)

L 1(22k - I � i i=2

<

2 2k, k E rN)

=

(29.60)

[nt],

where 1 (.) is the indicator function, equal to 1 when i is in the indicated range and 0 otherwise. With this arrangement, n counts the actual number of increments in the sum, while Kn(l ) counts the nominal number, including the zeros; Kn+ l (1) = Kn(l ) + 1 except when Kn(1) = 22k, in which case Kn+ 1 (1) 2 2k+l . The conditions of 29.6 are satisfied with Tl(t) = t, and X� � B. o =

Incidentally, since condition 29.6(f) imposes

(29.61) one might expect that KnO )In � 1 . The last example shows that this i s not neces sarily the case. To get multivariate versions of 29.6 and 29.9, as we undertake in the next section, it will be necessary to restate these theorems in a slightly more general form, following the lines of 29.10 and 29.11. 29.14 Corollary Let conditions 29.6(a), (b), (c), (d), and (e) hold, and replace 29.6(f) by (f') E(X�(t) 2) � Tl(t) as n � =, for each t E [0, 1]; then X� � Brt. The same modification i n 29.9 leads to the same result. o

The Functional Central Limit Theorem

490

The main practical reason why this extension is needed is because we shall wish to specify Kn in advance, rather than tailor it to a particular process; the same choice will need to work for a range of different processes - to be precise, for every linear combination of a vector of processes, for each of which a compatible ll will need to exist. However, the fact that partial sum processes may converge to limits different from simple Brownian motion may be of interest for its own sake, so that 29.14 (with Kn (t) = [nt]) becomes the more appropriate form of the FCLT. See Theorem 30.2 below for a case in point. 29.5 The Multivariate Case

To extend the FCLT to vector processes requires an approach similar in principle to that of §27.7. However, the results of this chapter have so far been obtained, unlike those of §27, without explicit derivation of the finite-dimensional distri butions. It has not been necessary to use the results of §28.4 at any point. Because we have to rely on the Cramer-Wold device to go from univariate to multi variate limits, it is now necessary to consider the finite dimensional sets of D, and indeed to generalize the results of §28.4. This section draws on Phillips and Durlauf ( 1986). We define Dm as the space of m- vectors of cadlag functions, which we endow with the metric max { d8(xj,yj) } , (29.62) l $;j$ m where d8 is the Billingsley metric as before. dj] induces the product topology, and the separability of (D,ds) implies both separability of (Dm,dj]) and also that 'B'B = 'Bv x 'Bv x ... x 'Bv is the Borel field of (Dm,d'Jl). Also let d'J](x,y)

=

(29.63) be the finite-dimensional sets of Dm, the field generated from the product of m copies oflfv. The following theorem extends 28.10 in a way which closely parallels the extension of 27.6 to 27.16. 29.15 Theorem Jf'B is a determining class for (Dm, 'B'B). Proof An open sphere in 'B'B is S(x, a)

=

=

{y E Dm : d'Jl(x,y)

{Y

E

<

a}

D"': 3 A E A s. t. II A II

< u, ��:':,'�PI

Define, for { tl ... ,tk E [0,1 ] , k E IN } ,

yj(t) xj(A,(t)) I -

< u}

(29. 64)

FeLTs for Dependent Processes

{

H,(x,a.) = y

E

Dm :

3 '/.. E

A s.t. ll '/.. 1 1

49 1

< a,

max max ! yj(t;) - xj(A.(tD) I

l�j � m l�i� k

}

< a

E

Jf'Jj.

(29.65)

It follows by direct generalization of the argument of 28.10 that, for any x E Dm and r > 0,

S(x,r) Hence, 'B'Jj

�

=

U n Hk(x, r - 1/n) n=l k=l

cr(Jf'Jj) as required.

'

E

cr(Jf'Jj).

(29.66)

•

The following can be thought of as a generic multivariate convergence theorem, in that the weak limit specified need only be a.s. continuous. It is not necessar ily Bm. 29.16 Theorem29 Let Xn E Dm be an m-vector of random elements. Xn � X, where P(X E e m) 1 , iff A'Xn � A'X for every fixed A with A'A = 1 . =

If x1 E D, j = l , ... ,m, 2..}= 1 A.1x1 possesses a left limit and is continuous on the right, since for t E [0, 1 ) , Proof

m

lim L �xj(t + £) £.!-0 }= I

=

m ."L A.1 limx1(t + £)

=

m ."L A.1x1(t).

(29.67)

£.!-0 }=1 Hence, x = (x1, ,xm)' E Dm implies A'x E D. It follows that A'Xn is a random element of D. It is clear similarly that X E e m implies A x E e, and hence P(A'X E C) = 1 . To prove sufficiency, let J.l� denote the sequence of measures corresponding to A'Xm and assume J.l� :::::} J.l"-. Fix t1 , ,tk E [0, 1], for finite k. Noting that 1t��, ... , 1k(B) n D E JfD c 'BD for each B E 'Bk (see 28.10), the projections are measurable and v� = ).l�1t��, . ,tk is a measure on (IR k,'Bk). Although 1t11, . ,tk is not continuous (see the discussion in §28.2), the stipulation J.l"'(C) = 1 implies that the discontinuity points have J.l"'-measure 0, and hence v� :::::} v"- by the contin uous mapping theorem (26.13). Since v� is the p.m. of a k-vector of r.v.s, and A is arbitrary, the Cramer-Wold theorem (25.5) implies that Vn :::::} v, where Vn = J.ln 1t �: .... . tk is the p.m. of an mk-vector, the distribution of Xn(ti), ... ,Xn(tk). Since t1, ... .tk are arbitrary, the finite dimensional distributions of Xn converge. To complete the proof of sufficiency, we must show that { J.ln} is uniformly tight. Choose A = e1, the vector with 1 in position j and 0 elsewhere, to show that XnJ � Xi; this means the marginal p.m.s are uniformly tight, and so {J.ln} is uni formly tight by 26.23. Then Xn � X by 29.15. To show necessity, on the other hand, simply apply the continuous mapping theo rem to the continuous functional h(x) = A'x. • }= I

.••

'

••.

. .

..

492

The Functional Central Limit Theorem

Although this is a general result, note the importance of the requirement j..!( C) = 1 . It is easy to devise a counter-example where this condition is violated, in which case convergence fails. 29.17 Example Suppose j.l is the p.m. on (D,'BD) which assigns probability 1 to elements x with

x(t) = •

{0,

{0,

t
Also, let l-ln assign probability 1 to elements with

x(t) =

t < ! +k . 1, t � �+ k

If X1 n - j.l all n, and X2n - l-lm then clearly (XJn,X2n) � (X1 ,X2) = (x,x) w.p. l . But X2n - Xl n is equal w.p. 1 to the function in (28. 12), which does not converge in (D ,ds). o

Now we are ready to state the main result. Let {B11(Q) } denote the family of m x 1 vector transformed-Brownian motion processes on [0, 1], whose members are defined by a vector of homeomorphisms (11 \ ... ,11P)' and a covariance matrix Q (m x m). If X - B11(Q ) , the finite-dimensional distributions of X are jointly Gaussian with independent increments, zero mean, and

E(X(t)X(t)') = DH(t)D', where D (m xp) has rank p, DD' = Q and H(t) = diag { 111 (t), .... , 11P (t) } , with H(1 ) = lp . In other words, the jth element of X may be expressed as a linear combination 2J= 1 djkzk where Z = (ZJ , . . . ,Zp)' is a vector of independent processes with Zk - B11k . With p < m, a singular limit is possible. Note, Z = (D'Df 1D'X. 29.18 Theorem Let { Uni } be an array of zero-mean stochastic m-vectors. For an increasing, integer-valued right-continuous function Kn( . ) define X� = L:f�ft)Uni' and suppose that (a) For each fixed m-vector .i\ satisfying .i\'A = 1 there exists a scalar array { c�i} and a homeomorphism 11"' on 1 ] with 11\ ) = and 11\ 1 ) = 1 , such that the conditions of 29.14 hold for the arrays { .i\'Und and { c�d , with respect to 11"'· (b) Letting H(t) be defined as above with elements � denoting 11"' for the case .i\ = ej (ith column of the identity matrix), for j = 1 , .. . ,p,

[0,

E(X�(t)X�(t)') Then X� � X - B11(0). o

---7

DH(t)D' as n

00

-7

oo

.

(29.68)

A point already indicated above is that under these conditions Kn must be the same function for each .i\, and must satisfy condition 29.6(d) in each case as well as 29.14(f'). The condition 11\ 0 1 can always be achieved by a renormalization, =

493

FCLTs for Dependent Processes

and simply requires that differences in scale of the vector elements be absorbed in the matrix D. Proof Consider first the case m = p and D = Im . Condition (a) is sufficient under

A.

The 29.14 for A.'X� � BrJ''-, where this limit is a.s. continuous, for each convergence of the joint distribution of X� now follows by 29. 16. The form of the marginal distributions is implied by 29.14, independence of the vector elements following from condition (b). If D t:. Im , the theorem can be applied to the array { (D'Df 1D'Und , for which the limit in (29.68) is H(t) as before. Since linear transformations preserve Gaussianity, the general conclusion follows by the continuous mapping theorem. • Theorem 29.18 is a highly general result, and the interest lies in establishing how the conditions might come to be satisfied in practice. While we permit Kn(.) t:. [nt] to allow cases like 29.13, these are of relatively small importance, and it will simplify the discussion if it is conducted in terms of the case Kn(t) = [nt] . The K superscript can then be dropped and Xn becomes the vector of ordinary partial sum processes. Even then, the result has considerable generality thanks to the array formul ation, and its interpretation requires care. We can invoke the decomposition Q = 'i:.. + A + A',

(29.69)

n 'F. = lim L E(UniU�i), n--7oo i=l

(29.70)

where

n A = lim L n--7oo i=2

i- 1 _L E(Un,i- m U� i). m= I

(29.71 )

But it should be observed that the conditions of 29.18 do not explicitly impose summability of the covariances. While 'F. and A are finite by construction, without summability it would be possible to have 'F. = 0. We noted previously that condition 29.6(f) appeared to impose summability, but it remains a conjecture that the more general 29.14(f') must always do the same. This conjecture is supported by the need for a summability condition in 24.6, whose conclusion must hold whenever 29.14 holds for the partial sums, but is yet to be demonstrated. Replacing 29.14(f') with more primitive conditions on the increment processes would be a useful extension of the present results, but would probably be difficult at the present level of generality. Note that Q, not 'F., is the covariance of the process, notwithstanding the fact that B,1(Q) is a process with independent increments. The condition Q = I, such that the elements of B11 are independent, neither implies nor is implied by the contemporaneous uncorrelatedness of the Un i· While uncorrelatedness at all lags is sufficient, with 'F. = I and A = 0, it is important to note that when the elements of Uni are related arbitrarily (contemporaneously and/or with a lag) there always

494

The Functional Central Limit Theorem

exists a linear transformation (D'D) - 1D', under which the elements of the limiting process are independent of one another. As we did for the scalar case, we review some of the simplest sets of sufficient conditions. Let Uni = S� 112Ui where Sn = E(Lt= l UiUi) . For this choice, D = Im is imposed automatically. lf { Ud is uniformly Lr-bounded, choose c�i = (A.'S� 1 A.) 112 • Then A.'Un/c�i is a linear combination of the Ui with weights summing to 1 and supi n ii A.'Unilc�d l r < oo holds for any A, so that conditions (a) and (b) of 29.6 are satisfied. The multivariate analogue of 29.7 is then easily obtained:

,

29.19 Corollary Let { Vi} be a zero-mean, uniformly Lr-bounded m-vectorsequence, with each element L2-NED of size -1 on an a-mixing process of size - rl(r - 2); and assume n - 1 Sn ----7 Q < If Xn(t) = n 1 12I\� fl ui , then Xn � B(Q). o

=

-

.

Compare this formulation with (27.82), and as with 29.7, note the important difference from the martingale difference case, with Q taking the place of �- It is also worth reiterating how the statement of conditions is potentially mislead ing, given that the last one is typically hard to fulfil without a NED size of 1 Somewhat trickier is the case of trending moments, where different elements of the vector may even be subject to different trends. The discussion here will have some close parallels with §24.4. Diagonalize Sn as -

.

(29.72) where Mn is diagonal of rank m, and CnC� = C�Cn = lm. Assume, to fix ideas, that Cn ----7 C, which can be thought of as imposing a form of global stationarity on the cross-correlations. Then S � 1 12 = Kn 112C� and E(X"(t)X"(t)' )

=

:::::

M;; 112C;E

(f, ; )

Ui C"M; w

U,

M� 112M!ntJKn 112

----7

(29.73)

H(t),

where the approximation is got by setting Cn to C, and can be made as good as desired by taking n large enough. The status of conditions 29.18(a) and (b) must be checked by evaluating the elements of H in (29. 73 ). An example is the best way to illustrate the possibilities. 29.20

Example

Let

m =

0]

2, and assume E(UPi-m)

=

0 for m i: 0, but let

C' (29.74) i �2 for fixed C. Then, Mn = diag{n�1 + 1 , n �z+ l } and H(t) = diag { t� 1 + 1 �z+ l } . For � 1 , �2 > -1 , �1 +1 and �z+ l are increasing homeomorphisms on the unit square, and

,

condition 29.18(b) is satisfied. It remains to check 29.18(a). Condition 29.14(f') holds for the array { A.'Und with respect to

T) '\t)

=

A.rt� 1 +1

+

A.�t�2+ \

(29.75)

FCLTs for Dependent Processes

495

which, since AI + /0 = 1 , is an increasing homeomorphism on the unit square with 11 \1) = 1 and 11\0) = 0 whenever � 1 , �2 > - 1 . Assuming that 29.6(b) holds for 112 ip 1 ip2 2 2 t.. C ni = I IA Un db = A 1 n ' +l + A2 n (29.76) , P P2+ 1

, ,

[ ( ) ( )]

we can check conditions 29.6(d) and 29.6(e). The latter holds for y = 1· We also find that

/..y((t + 8)p, + l _ tP ' + l) + /..�((t + 8)Pz+l _ tP2+ 1 ) 8 (29.77) ---7 Af( � l + l ) tp , + MC �2 + l )tp2 < oo as 8 ---7 0, where the approximation is as good as desired with large enough n . Condition 29.6(d) is satisfied, and hence 29.18(a) holds. This completes the veri fication of the conditions. o

30 Weak Convergence to S tochastic Integrals

30. 1 Weak Limit Results for Random Functionals

The main task of this chapter is to treat an important corollary to the functional central limit theorem: convergence of a particular class of partial sums to a limit distribution which can be identified with a stochastic integral with respect to Brownian motion, or another Gaussian process. But before embarking on this topic, we first review another class of results involving integrals, superficially similar to what follows, but actually different and rather more straightforward. There will, in fact, turn out to be an unexpected correspondence in certain cases between the results obtained by each approach. For a probability space (Q,C!f,P), we are familiar with the notion of a measurable mapping f : Q f---7 C, where C is C[O, I J as usual. We now want to extend measurability to functionals on C, and especially to integrals. Let F(t) = gfds: C f---7 IR denote the ordinary Riemann integral of f over [O,t] . 30.1

Theorem

If f is �/:Be-measurable, the composite mapping

F(t)0f:

Q

f---7

is �/:B-measurable for t E

IR [0 , 1 ] .

Proof It i s sufficient to show that F(t) i s continuous o n (C, du). This follows since, for G(t) = J�ds, g E C, and 0 ::::; t ::::; 1 , t (30. 1) I F(t) - G(t) l ::::; l f - g l ds ::::; sup l f(s) - g(s) l . •

J0

s

This shows that F(t) is a random variable for any t. Now, writing F: C f---7 C as the mapping whose range is the set of functions assuming the values F(t) at t, it can further be shown that F is a new random function whose distribution is uniquely found by extension from the finite-dimensional distributions, just as for f. The same reasoning extends to F2(t) = f(t'ds , to P, and so on. Other important examples of measurable functionals under du include the extrema, supt { f(t) } and inf { f(t) } . As a simple example of technique, here is an ingenious argument which shows that if B is standard Brownian motion, supt { B(t) } has the half-normal distribution (see (8.26)). Consider the partial sum process Sn t

Weak Convergence to Stochastic Integrals =

497

I'i= 1 �i· where the �i are independent binary r.v.s with P(�i = 1) = P(�i = -1) =

1 · Straightforward enumeration of the sample space shows that

(

P max Si � an 1$ i $ n

)

=

2P(Sn > an) + P(Sn

=

an),

(30.2)

for any an � 0 (see Billingsley 1968: ch. 2. 10). Since this holds for any n, on putting an = Yiia the FCLT implies that the limiting case of (30.2) applies to B, in respect of any constant a � 0. This also defines the limit in distribution of supr{ Xn(t) } for every process Xn satisfying the conditions of 29.4. This is a neat demonstration of the method of extending a limit result from a special case to a general case, using an invariance principle. Limit results for the integrals (i.e, sample means) of partial-sum processes, or continuous functions thereof, are obtained by the straightforward method of teaming a functional central limit theorem with the continuous mapping theorem.

Sno = 0 and Snj = IJ= 1 Uni for j = 1 , ... ,n - 1. If Xn(t) = Sn,[nt]• assume that Xn � BTJ (see 29.1 1) . For any continuous function g: rR � rR , J1 g(B )dt. 1- n - 1 (30.3) � _Lg(Sn) TJ n j =v o 30.2 Theorem Let

·

Proof Formally,

g(Snj)

=

__{\

f(j+1)/ndt nf(j+1)/ng(Xn(t))dt. j!n j/n n - 1 f(j+1)/n f 1 g(Xn(t))dt. g(Xn(t))dt L o 1/n

ng(Snj)

Hence,

n-1 -n1 L g(Snj) ._0 }-

=

.__{\

} �v

(30.4)

=

(30.5)

=

.

Since fbg(x(t) )dt, x E C, is a continuous mapping from C to lR, the result follows by the continuous mapping theorem (26.13). •

Note how g(Snn) is omitted from these sums in accordance with the convention that elements of D are right-continuous. Since the limit process is continuous almost surely, its inclusion would change nothing material. These results illustrate the importance of having 29.14 (with Kn (t) = [nt]) as an alternative to 29.6 as a representation of the invariance principle. The processes X�(t) are defined in [0, 1 ] , and cannot be mapped onto the integers l , ... ,n by setting t = j/n. There is no obvious way of defining the sample average of g(X�) in the manner of (30.3), and for this purpose the partial-sum process Xn with limit BTJ has no substitute. The leading cases of g(.) include the identity function, and the square. For the · former case, 30.2 should be compared with 29.12. Observe that I,}:lsnj = Ii: } (n - i)Uni· If Uni = n-1 12 Un - /CJ, reversing the order of summation in 29.12 shows, in effect, that n I,j: i s j � BTJ(l ), for the case ll(t) = !f. In other words, fl/Jdt - N(O, !).

-1

� 1 -

_

__ _

___ _

L

n

- � --- -- 1 -

- -- -

� � -- -

- -- - -

£ _ __ .._1_ _

£.. _ _

_

4,� � - - 1

rl n2 .J"'-

.LL -

The Functional Central Limit Theorem

498

limit for the case g(.) = (i. These limit results do not generally yield close1 formulae for the c.d.f., so there are no exact tabulations of the percentag' points such as we have for the Gaussian case. Their main practical value is i1 letting us know that the limits exist. Applications in statistical inferenc' usually involve estimating the percentiles of the distributions by Monte Carl< simulation; in other words, tabulating random, variates generated as the average of large but finite samples of g evaluated at a Gaussian drawing, to approximat1 integrals of g(B11). Knowledge of the weak convergence assures us that such approx imations can be made as close as desired by taking n large enough. Given a basic repertoire of limit results, it is not difficult to find th1 distributions of other limit processes and random variables in the same manner. T< take a simple case, if { Vi} is a sequence with constant variance c?, an< n - 112Srn1]fa � B(t) where Srntl = Il�f l vi, we can deduce from the continuou: mapping theorem that the partial sums of the sample mean deviations converge t< the Brownian bridge; i.e.,

(30.6 where Vn n - 1 Vi. On the other hand, if we express the partial sum proces� itself in mean deviations, s1 - Sn where Sn = n- 1 I.'j:6s1, we find convergenc( according to =

Ii=1

1

-

-----vz (Srntl - Sn)

an

D --==--7

f1

B(t) - Bds.

(30.7:

o

The limit process on the right-hand side of (30.7) is the de-meaned B rowniar motion. One must be careful to distinguish the last two cases. The integral of th<: latter over [0,1 ] is identically zero. The mean square of the mean deviatiom converges similarly, according to

(3o.s: There is also an easy generalization of these results to vector processes. The following is the vector counterp art of the leading cases of 30.2, the details oj whose proof the reader can readily supply. 30.3

Corollary

Let { Uni } satisfy the conditions of 29.18. If SnJ

1

D I 1 n- 1 - 2_ SnJ --==--7 B11dt, n ].= I Io

n- 1

fl

-2:, n SnjS�j � BllB�dt. . ]=

1

o

=

E=l Uni• then (30.9)

D

(30. 10j

Weak Convergence to Stochastic Integrals

499

The same approach of applying the

Note in particular that for B, the m-dimensional standard Brownian motion,

fl/Jdt N(O, 1Im)·

continuous mapping theorem yields an important result involving the product of the partial-sum process with its increment. The limits obtained do not appear at first sight to involve stochastic integrals, although there will tum out to be an intimate connection. 30.4 Theorem Let the assumptions of 30.2 hold, with 11 0 ) = 1. Then

-

n- 1 L Snj Un, j+ l � 1(x2( 1 ) cr2) , j= l where cf = limn =n- 1 I'i= l �i · Proof Letting Snj = L{= l Uni = Sn, j 1 + Unj• note the identity Sn2,j+ l = Snj2 + 2SnjUn, j+ l + Un2, j+l ·

-7

-

-

Summing from 0 to n 1 , setting Sno

2 Snn or

=

(30. 1 1)

=

(30. 12)

0, yields

n n- 1 n- 1 " " " 2 2 2 L (Sn , j+ l - S.'lj) = 2L SnjUn , j+ ! + L Unj • j= l j=O j= l

(30. 1 3)

SnjUn,j+l = 1 (s�n - i U�j). � j= j=

(30. 14) l Under the assumptions, Snn � B11 (1) N(O, 1 ) and I'J=I U�i � cf. The result l

follows on applying the continuous mapping theorem and 22.14(i).

•

This is an unexpectedly universal result, for it actually does not depend on the FCLT at all for its validity. It is true so long as { Und satisfies the conditions for a CLT. Since cr2 = 1 2/.. where /.. = limn-7=Lt=2L�:lECUn, i- m UnD, the left-hand side of (30. 1 1) has a mean of zero in the limit if and only if the sequence { Und is serially uncorrelated. There is again a generalization to the vector case, although only in a restricted sense. Let Snj = I{= I Uni• and then generalizing (30. 12) we have the identity

-

(30. 15) Summing and taking limits in the same manner as before leads to the following result. 30.5 Theorem Let { Un d satisfy the conditions of 29.18. Then

n n � + n L S jU ,j+ l L Un, j+I S�j � BTJ (l)B11 (1)' - :E j=l j=l - B(l )B(l)' - :E. o

(30. 16)

Details of the nroof are left to the reader. The oeculiaritv of this result is

500

The Functional Central Limit Theorem

that it does not lead to a limiting distribution for the stochastic matrix n- 1 L.j:lsnjU�, j+ l · This must be obtained by an entirely different approach, which is explored in §30.4. 30.2 S tochastic Processes

m

Continuous Time

To understand how stochastic integrals are constructed requires some additional theory for continuous stochastic processes on [0, 1]. Much of this material is a natural analogue of the results for random sequences studied in Part lll. A filtration is a collection { :¥(t), t E [0, 1 ] } of cr-subfields of events in a complete probability space (Q.,:¥,P) with the property

:¥(t)

c

:¥(s) when t � s. The filtration { :¥(t) } is said to be right-continuous if :¥(t) = :¥(t+) n :¥(s).

(30. 1 7) (30.1 8)

=

s>t A stochastic process X = {X(t), t E [0, 1 ] } is said to be adapted to { :¥(t) } if X(t) is :¥(f)-measurable for each t (compare § 1 5 . 1 ) . Note that right-continuity of the filtration is not the same thing as right-continuity of X, but if X E D (which will be the case in all our examples) adaptation of X(t) to :¥(t) implies adapta tion to :¥(t+ ) and there is typically no loss of generality in assuming (30. 1 8). A stronger notion of measurability is needed for defining stochastic integrals of the X process. {X(t) } is said to be progressively measurable with respect to { :¥(t)} if the mappings X(.,.): Q. x [O,t] 1---7 lR are :¥(t) ® :B[O,tl/:8-measurable, for each t E [0, 1]. Every progressively measurable process is adapted (just consider the rectangles E x [O,t] for E E c:f(t)) but the converse is not always true; with arbitrary functions, measurability problems can arise. However, we do have the following result. 30.6 Theorem An adapted cadlag process is progressively measurable. Proof For an adapted process X E D and any t E

on [O,t] :

(0, 1], define the simple process

X(n) (W,s) = X(CO , Tnk), s E [2 -n (k - 1), Tnk), k = 1 , ... ,[2n t], (30. 19) with X(n) (w,t) = X(w,t). X(n) need not be adapted, but it is a right-continuous function on Q. x [O,t]. If Ek = { ro: X(w, 2 -nk) � x} E :¥(t), then Ax = { (ro,s): X(n) (W,s) � x} [2 -n (k - 1), 2-nk) x £k u {t} x E[znt]+ l =

(y

)

(30.20)

is a finite union of measurable rectangles, and so Ax E :¥(t) ® :Bro, t] · This is true for each x E !R , and hence X(n) is :¥(t) ® :B[O,tl/:8-measurable. Fix ro and s,

Weak Convergence to Stochastic Integrals

501

and note that for each n

oo

X(n)(ffi,s) = X(co,u), (30.21 ) where u > s, and u J. s as n ---7 Since X(co,u ) ---7 X(co,s) by right-continuity, it follows that Xcn) (ffi,s) ---7 X(co,s) everywhere on n x [O,t] and hence X is r:i (t) ® 13[0,tJ/:B-measurable (apply 3.26). This holds for any t, and the theorem

follows.

.

•

Since we are dealing with time as a continuum, we can think of the moment at which some event in the evolution of X occurs as a real random variable. For example, the first time X(t) exceeds some positive constant M in absolute value is

T(co)

inf { t: I X(co,t) l > M}.

=

! E [O, l ]

(30.22)

T( co) is called a stopping time of the filtration { r:J' (t), t E [0, 1 ] } if { co: T(co) $ t } E r:J(t) (compare § 1 5 .2). It is a simple exercise to show that, if X is progres sively measurable, so is the stopped process xT where XT(t) = X(t A 1). Let X E D, and let X(t) be an r:J'(t)-measurable r.v. for each t E [0, 1]. The adapted pair {X(t),�(t) } is said to be a martingale in continuous time if sup E I X(t) l < t

E(X(s) I �(t))

=

=,

X(t) a.s. [P], 0 $ t $ s $ 1.

(30.23) (30.24)

It is called a semimartingale (sub- or super-) if (30.23) plus one of the inequal ities

E(X(s) l r:f(t))

{�} X(t) a.s.[P], 0 $ t $ s $ 1

(30.25)

hold. One way to generate a continuous-time martingale is by mapping a discrete time martingale {Sj,�j}l into [0, 1], rather in the manner of (27.55) and (27.56). If we let X(t) = S[nt]+ J , this is a right-continuous simple function which jumps at the points where [nt] = nt. It is �(f)-measurable where r:J(t) = r:i [ntJ+l , and the collection { r:J' (t), 0 $ t < 1 } is right-continuous. Properties of the martingale can often be generalized from the discrete case. The following result extends the maximal inequalities of 15.14 and 15.15. 30.7 Theorem Let

(

{ (X(t),�(t)) t E [0, 1 ] } be a martingale. Then

E) $ E I XE�) I P, p � 1 , (Kolmogorov inequality). (ii) E ( sup I X(s) IP) $ �� rE I X(t) IP, p > 1 , (Doob inequality). 1 (i) P sup I X(s) I > 5 E [ O, t] S E [O,t]

Proof These inequalities hold if they hold for the supremum over the interval

[O,t), noting in (i) that the case s

=

t is just the Chebyshev inequality. Given a

The Functional Central Limit Theorem discrete martingale {Sk >�d'r= I with m = [2nt] , define a continuous martingale Xcn) on [O,t] as in the previous paragraph, by setting X(n) (s) = s[2ns]+ l for s E [O,t), with Xcn ) (t) = Xc )(t- ) = Sr2n11• The inequalities hold for Xcn) by 15.14 and 15.15, noting that (30.26) [O,t) l� k � m for p � 1 . Now, given an arbitrary continuous martingale {X(t), �(t)} , a discrete martingale is defined by setting (30.27) For this case we have Xcn)(s)= X(u) for u = Tn ([2n s] + 1), so that u J.. s as n --7 oo . Hence Xcn) (s) --7 X(s) for s E [O,t), by right continuity. • The class of martingale processes we shall be mainly concerned with satisfy two extra conditions: almost sure continuity (P(X E C) = 1), and square integrability. A martingale X is said to be square-integrable if E(X(t)2) < oo for each t E [0, 1]. For such processes, the inequality (30.28) holds a.s.[P] for s � t in view of (30.24), and it follows that X2 is a submartin gale. The Doob-Mayer(DM) decomposition of an integrable submartingale, when it exists, is the unique decomposition (30.29) X(t) = M(t) + A(t), where M is a martingale and A an integrable increasing process. The DM decomposition has been shown to exist, with M uniformly integrable, if the set {X(t), t E :Y} is uniformly integrable, where :Y denotes the set of stopping times of { �(t)} (see e.g. Karatzas and Shreve 1988 : th. 4. 1 0). In particular, suppose there exists for a martingale {X(t),�(t) } an increasing, adapted stochas tic process { (X)(t), �(t) } on [0, 1], whose conditionally expected variations match those of X2 almost surely; that is, (30.30) E((X) (s) l �(t)) - (X) (t) = E(X(s) 2 l �(t)) - X(t)2 a.s. [P] for s � t. Rearranging (30.30) gives (30.3 1) E(X2(s) - (X) (s) l �(t) ) = X(t)2 - (X)(t), a.s.[P], which shows that {X(t) 2 - (X)(t), �(t)} is a martingale, and this process accord ingly defines the DM decomposition of X2 . An increasing adapted process { (X)(t),�(t) } satisfying (30.30), which is unique a.s. if it exists, is called the quadratic variation process of X. 30.8 Example The Brownian motion process B is a square-integrable martingale with respect to the filtration �(t) cr(B(s), s � t). The martingale property is an obvious consequence of the independence of the increments of B . A special feature of B is that the quadratic variation process is deterministic. Definition 502

n

SE

==

Weak Convergence to Stochastic Integrals

503

27.7 implies that, for s 2 t,

E(B(s) 2 i �(t)) - B(t) 2 = E([B(s) - B(t) ] 2 i �(t)) = s - t, a.s.[P],

(30.32)

and rearrangement of the equality shows that B (t)2 - t is a martingale; that is, (B)(t) = t. o Two additional pieces of terminology often arise in this context. A Markov process is an adapted process {X(t),�(t) } having the property (30.33) P (X(t + s) E A j �(t)) = P (X(t + s) E A j cr(X(t)) a.s.[P] for A E 'B and t, s 2 0. This means that all the information capable of predicting the future path of a Markov process is contained in its current realized value. A diffusion process is a Markov process having continuous sample paths. The sample paths of a diffusion process must be describable in terms of a stochastic mechanism generating infinitesimal increments, although these need not be independent or identically distributed, nor for that matter Gaussian. A Brownian motion, however, is both a Markov process and a diffusion process. We shall not pursue these generalizations very far, but works such as Cox and Miller (1965) or Karatzas and Shreve (1988) might be consulted for further details. The family BTJ defined in (29.49) are diffusion processes. They are also martin gales, and it is easy to verify that in this case (BTJ) = fl. However, a diffusion process need not be a martingale. An example with increments that are Gaussian but not independent is X(t) = S(t)B(t) (see 27.9). Observe that E(X(t + s) - X(t) i �(t) ) = (8 (t + s) - 8 (t))B (t) S(t + s) 1 X(t) -:�; 0, a.s.[P] . (30.34) = S(t) A larger class of diffusion processes is defined by the scheme X(t) = S(t)BTJ(t), for eligible choices of 8 and 11· The Ornstein-Uhlenbeck process (27.10) is another example. However, the class BTJ is the only one we shall be concerned with here.

(

)

30.3 S tochastic Integrals

In this section we introduce a class of stochastic integrals on [0, 1]. Let { M(t),�(t) } denote a martingale having a deterministic quadratic variation process (M). For a function f E D, satisfying a prescribed set of properties to be detailed below, a stochastic process on [0, 1] will be represented by I( m,t)

=

J�f(m,'t)dM(m,'t), t

E

[0,1 ] ,

(30.35)

more compactly written as /(t) = f(JdM. The notation corresponds, for fixed ffi, to what we would use for the Riemann-Stieljtes integral of f(w,. ) over [O,t] with respect to M(w,.). However, it is important to appreciate that, for almost every

The Functional Central Limit Theorem

504

oo, this Riemann-Stieljtes integral does not exist; quite simply, we have not required M( oo,.) to be of bounded variation, and the example of Brownian motion shows that this requirement could fail for almost all oo. Hence, a different interpretation of the process l(t) is called for. The results we shall obtain are actually available for a larger class of inte grator functions, including martingales whose quadratic variation is a stochastic process. However, it is substantially easier to prove the existence of the inte gral for the case indicated, and this covers the applications of interest to us. We assume the existence of a filtration {?J(t), t E [0, 1 ] } , on a probability space (Q,?J,P). Let a: [0, 1] H !R be a positive, increasing element of D, and let a(O) 0 and a(l ) = 1 , with no loss of generality as it turns out. For any t E (0, 1], the restriction of a to [O,t] induces a finite Lebesgue-Stieltjes measure. That is to say, a is a c.d.f., and the function fnda(s) assigns a measure to each B E :Bro.tJ · Accordingly we can define on the product space (.Q x [O,t], ?J(t) ® :Bro.1J) the product measure J...la, where ==

J.la(A)

=

Jnf� 1A(oo,s)da(s)dP(oo)

=

E

{f� 1A(oo,s)da(s))

(30.36)

for each A E ?J(t) ® :B ro,t] · Let !La denote the class of functions f: Q H R [O, l ] which are (a) progressively measurable (and hence adapted to { ?f(t) }), and (b) square-integrable with respect to J...la ; that is to say, III I I < = where II I II

=

E

(f>2da) 112•

(30.37)

It is then easy to verify that I II - g il is a pseudo-metric on !La . While II I - g i l = 0 does not guarantee that I(oo) = g(oo) for every oo E Q, it does imply that the integrals of I and g with respect to a will be equal almost surely [P]. In this case we call functions I and g equivalent. The chief technical result we need is to show that a class of simple functions is dense in !La. Let [a c !La denote the class such that I(t) = I( tk) for t E [tk,tk+l ), k = O, .. . ,m - 1 and I( 1 ) = I(1 - ), where { t1 , . . ,tm } = IIm is a partition of [0, 1 ] for some m E IN . 30.9 Lemma (after Kopp 1984) For each iE llu, there exists a sequence Ucn) E [a, n E IN } with !lien) - Ill --7 0 as n --7 =. Proof Let the domain of I be extended to !R by setting I( t) = 0 for t � [0, 1]. By square-integrability, t:I(oo,ttda( t) < = a.s.[P], and (J(oo,t h) - I(ro,t) ) 2da(t) --7 0 a.s. [P] as h --7 0;

.

hPnt'P

J::

Weak Convergence to Stochastic Integrals lim E

h-10

(f+"" (f(s + h) - f(s)?da(s)) - oc

:::

0

505 (3 0.38)

by the bounded convergence theorem. This holds for any sequence ofpoints going to 0, so, given a partition I1m(n) such that I I Tim(n) ll � 0 as n � oo , and t E [0, 1], consider the case h = kn(t) - t, where kn(t) = ti, t E [ti, ti+J ), i O, ... ,m - 1, (3 0.39) (3 0.40) kn(l ) = tm - 1 · Clearly, kn(t) � t. Hence, (30.38) implies that

:::

s::E u� 1 (f(s + kn(t)) - f(s + t)) 2da(t)) da(s) f� / (f::ucs + kn(t)) - f(s + t)) 2da(s)) da(t) f�1 E (f::ucs + knCt) - t) - tCs)) 2da(s)) da(t) ==

=

�

oo,

0 as n �

(30.41)

where the first equality is an application of Fubini's theorem. Since the inner integral on the left-hand side is non-negative, (30.41 ) implies

)

(J�

lim E (30.42) 0 (f(s + kn(t)) - f(s + t)) 2da(t) l n-1oc for almost all s E fR . Fixing s E [0, 1] and making a change of variable from t to t - s gives ==

(30.43) where ln(t)

==

{

kn(t - s) - (t - s). Define a function fcn)(t)

=

f(t + ln(t)),

t + ln(t)

E

[0, 1 ]

(30.44)

otherwise, 0 noting that fcn)(t) = f(t i + s) for t E [ti + s, ti+ l + s) n [0, 1] and hence fen) E [[a · Given (30.43), the proof is completed by noting that [0,1 ] � [s - 1 , 1 + s], and hence

(30.45)

508

The Functional Central Limit Theorem

J:BdB - Icx2C l ) - 1).

(30.54)

Put t = 1 and compare this with 30.4. It is apparent (and will be proved rigor ously in 30.13 below) that the limit in (30. 1 1) can be expressed as JbBdB + /.., where /.., = ¥1 - 0'2), as before. o . A form of Ito ' s rule holds for a general class of continuous semi-martingales. The proof of the general result is lengthy (see for example Karatzas and Shreve 1988 or McKean 1969 for details) and we will give the proof just for the case of 30.11, to avoid complications with the possible unboundedness of g". However, there is little extra difficulty in extending from ordinary Brownian motion to the class of diffusion processes BTl. 30.12 Theorem BTJdBTJ = 1(BTJ (t) 2 - ll(t) ) , a.s.

J:

.

Proof Let Tin denote the partition of [O,t] in which t1 = tj/n for j = l, .. ,n. Use Taylor expansions to second order to obtain the identity

n- 1 = iL_ BTJ (tj)(BTJ (tJ+ I ) - BTJ (tj)) }=0

+

n- 1

,2: (BTJ (tJ+ I ) - BTJ (tj)) 2 . }=0

(30.55)

We show the � convergence of each of the sums in the right-hand member. BTl E [LTJ , so define Pn E [TJ by (30.56) and Pn(s) = BTJ (s) for t $ s $ 1 . This is a construction similar to that used in the proof of 30.9, and ll Pn - B Tl I I -7 0 as n -7 We may write n- 1

n- 1

}=0

}=0

,2: BTJ (tj )(BTJ (tJ+I ) - BTJ (t1)) = ,2:

oo

.

ffj.+lBTJ (t1)dBTJ(s) tj

(30.57) and it follows that

(30.58) Considering the second sum on the right-hand side of (30.55), we have n-1 2 E � (BTJ (tJ+ I ) - BTJ (tJ)? - ll(t)

(

pO

)

Weak Convergence to Stochastic Integrals

509

n- 1 L E [ (BTI (tj+1 ) - BTI (tj))2 - (11 (tj+1 ) - 11(tj)) F j=O n -1 = 2 L (11 (tj+1 ) - 11 (tj) ) 2 j=O � 211(t) max { 11 (tj+1 ) - 11 (tj) } O�j� n - 1 =

� --7 0 as n --7 =

.

(30.59)

The second equality here is due to the fact that the cross-products disappear in expectation, thanks to the law of iterated expectations and the fact that (30.60) The third equality applies the Gaussianity of the increments, together with 9.7 for the fourth moments, and the inequality uses the continuity of 11 and the fact that I nn II --7 o . Thus, B11 (t)2 can be decomposed as the sum of sequences converging in L2-norm to, respectively, 2gB11dB11 and 1l(t). However, according to 18.6, L2 convergence implies convergence with probability 1 on a subsequence { nk } . Since the choice of partitions is arbitrary so long as I ITinJ --7 0, the theorem follows. • The special step in this result is of course (30.59). In a continuous function of bounded variation, the sum of the squared increments is dominated by the largest increment and so must vanish by continuity, just as happens with 11Ct) in the last line of the expression. It is because it is unbounded a.s. that the same sort of thing does not happen with B11. 30.4 Convergence to Stochastic Integrals

Let { Unj } and { Wnj } be a pair of stochastic arrays, let Xn(t) = I)�fl Unj and Yn(t) = I)�fl Wnj ' and suppose that (Xn, Yn) � (Bx,By) where Bx and By are a pair of transformed Brownian motions from the class B11, with quadratic variation pro cesses 11x and 11 Y, the latter being homeomorphisms on the unit interval. In what follows it is always possible for fixing ideas to think of Bx and B y as simple Brownian motions, having 11 x(t) = 11 Y(t) = t. However, the extensions required to relax this assumption are fairly trivial. The problem we wish to consider is the convergence of the partial sums

The Functional Central Limit Theorem

5 10

n -1 "2_ XnU1n ) ( Yn ((j + 1)/n) - Yn (jln) ) .

(30.61) j= l This problem differs from those of §30. 1 because it cannot be deduced merely from combining the functional CLT with the continuous mapping theorem. None the less, it is possible to show that the convergence holds under the conditions of 29.14. The following theorem generalizes one given by Chan and Wei ( 1 988). See inter alia Strasser ( 1986), Phillips (1988), Kurtz and Protter ( 1 99 1 ), Hansen ( 1992c), for alternative approaches to this type of result. 30.13 Theorem Let { Unj• Wnj} be a (2 x 1) stochastic array satisfying the condi tions of 29.18 for the case Kn(t) = [nt]. In addition, assume that both Unj and Wnj are �-NED of size -1 on { Vni } . Then =

Gn

where, with Anjk

=

D �

f1 BxdBy + Axy,

(30.62)

0

E(UnjWnk), Axr

=

n -1 i- 1 lim L L An,i-m,i+ l · n�oo i= l m=O

(30.63)

D

An admissable case here is Unj = Wnj• in which case the relevant joint distribu tion is singular. Setting the L2-NED size at -1 ensures that the covariances are summable in the sense of 17.7. This strengthening of the conditions of 29.18 is typically only nominal, in the light of the discussions that follow 29.6 and 29.18. However, be careful to see that summability is not required to ensure that I Axr l < =, which holds under the conditions of 29.18 merely by choice of normalization. Its role will become apparent in the course of the proof. Proof The main ingredient of this proof is the Skorokhod representation theorem, 26.25, which at crucial steps in the argument allows us to deduce weak convergence from the a.s. convergence of a random sequence, and vice versa. Let (Xn ,Yn ) be an element of the separable, complete metric space (D2,d�) (see §29.5). Since (30.64) (Xn. Yn) � (Bx,By) by 29.18, Skorokhod's theorem implies the existence of a sequence { (X",Yn) E D2, n E IN } such that (XmYn) is distributed like (X", Yn), and d�((X",Yn ), (Bx,Br)) � 0. According to Egoroff' s theorem (18.4) and the equivalence of ds and d8 in D, (30.64) implies that, for a set Ce E '!J with P( C ) � 1 - E, e

sup d� ( (X"(co),Yn (co)), (Bx(co),Br(co)))

oo e CE

--7

0

(30.65)

for each E > 0. Since Bx is a.s. continuous, there exists a set Ex with P(Ex) = 1 and the following property: if co E Ex, then for any 11 > 0 there is a constant 8 > n c11,.,h th !'l t if r/,.{)(n{m).Bvfco)) � 8 .

Weak Convergence to Stochastic Integrals

51 1

sup I X\co,t) - Bx(co,t) I $ sup I X"(co,t) - Bx(co , A-(t)) I t

t

+

sup I Bx(co,t) - Bx(co,A.(t)) I t

(30.66) where A.(.)) is the function from (28.7). The same result holds for Y in respect of a set Er with P(Er) = 1 . It follows from (30.65) that, for co e C� = Ce n Ex n Ey, (30.67) db ((X'(co),Yn(co)) , (Bx(co),B y (co))) = Bn ---7 0, where the equality defines Bn . Note too that P( C�) = P( Ce) . For each member of an increasing integer subsequence {kn, n E IN } , choose an ordered subset {n 1 ,n2 , . ,nkn } of the integers l, ... ,n, with nkn = n, such that min 1 :=;j :=; kn { nr nj- d ---7 Use these sets to define partitions of [0, 1], Tin = { t1 , ... tkn } , where tj = n/n. Assume that {k } is increasing slowly enough that kn()� ---7 0 and knfn ---7 0, but note that provided kn t it is always possible to have IITin ll ---7 0. For example, choosing nj = [nk/knJ will satisfy these conditions. The main steps to be taken are now basically two. Define .

,

.oo

.

n

oo

kn

(30.68) _L XnUj- 1)(Yn(tj) - Yn Uj- 1)), 1 f= n and also let G* represent the same expression except that the Skorokhod variables X' and yn are substituted for Xn and Yn . In view of 22.18 and the fact that G* n and G� have the same distribution, to establish G� � f6BxdBy it will suffice to prove that G�

=

I c*n _ f�BxdBy l � 0.

(30.69)

The proof will then be completed, in view of 22.14(i), by showing (30.70) The Cauchy-Schwartz inequality and (30.67) give, for each co 2 n n (X"(co,t;- 1) - Bx(CO,tj- 1))(Y (CO,tj) - Y (CO,tj- 1)) ]= 1

(±

e

4,

)

S

n

kn

k n n _L X (Y (CO,tj) - Y (CO,tj- 1)) 2 "(co,tj-1 ) Bx(CO,tj-1)? L( j=1 j=1 kn

n n (30.71) kn()� L ( Y (CO,tj) - Y (CO,tj-1)?, j= 1 Also the assumptions on Yn. and equivalence of the distributions, imply that S

512

The Functional Central Limit Theorem

(30.72)

and hence from (30.71), E

(�

(X" (tJ-1 ) - Bx(t1_1 ))(Y"(fj) - Y"(t1_ 1 )) 1 c;

Closely similar arguments give E and also

(�

)

(Y" (t1) - Br (t1))(Bx(t1) - Bx(t1- 1 )) 1 c:

)2 '

->

->

0.

0,

(30.73)

(30.74)

(30.75) E(Bx(1 ) ( Y\l) - By(l ) ) l c�) 2 s 8� � 0. We now use the method of 'summation by parts' ; given arbitrary real numbers { aj,bj,<Xj ,pj , j = l , ... ,k} with a0 = b0 = <Xo = Po = 0, we have the identity . k

k

L, aj- I (bj - bj- r ) - L <Xj- 1 (Pr Pj - 1 ) j=1 j=1

k

k

(30.76) L_ (aj- 1 - <Xj- I )(br bj- I ) + ak(bk - pk) L_ (br P)( <Xj - <Xj- J ) . j=I j= I Put k = kn, aj = Xn(ro,tj), bj = Yn (ro,tj), aj = Bx(W,tj), and pj = By(W,tj). Then the left-hand side of (30.76) corresponds to GM - Pm where =

-

kn

Pn = 2:,Bx(tj - 1 )(B r(tj) - By(tj- 1 ) ) j=I kn

=L j=1

f}

t·

tr l

Bx(tj- 1 )dBy(t),

(30.77)

and the squares of the right-hand side terms correspond to the integrands in (30.73), (30.74), and (30.75). Since £ is arbitrary, P(C�) can be set arbitrarily close to 1, so that each of these terms vanishes in Lrnorm. We may conclude that n I G* - Pn I � 0. So, to get (30.69), it suffices to show that I Pn - f/JJxdB rl � 0. But kn t · 2 I 2 E P, - BxtfBy = E (Bx(tj- d - Bx(t))dB y(t)

( fo

)

(�I:_,

)

Weak Convergence to Stochastic Integrals kn

=L

t·

J1

i= l tr 1

513

(r{(t) - TJ\tj- 1)) dr{(t)

:::; max { TJ\t1) - Tjx(t1_1 ) }l{(l) ! �j �kn

�

(30.78)

0

where the second equality applies (30.5 1) and then Fubini's theorem, and the convergence is by continuity of Tjx . This completes the proof of (30.69). To show (30.70), we use the fact that nr l Yn(tj) - Yn UJ- d = L (Yn((i + 1)/n) - Yn(iln) ) , and so G, - G�

�

� [ t�

1

X. (i/n) (Y,((i + 1 )In) - Y.(iln))

)

- x.c�_,J (Y,Ctil - Y.(tj- 1 )) �

� (.�1

(X,(iln) - X,Cti- 1 ))( Y. ((i + 1 )In) - Y,(i!n) )

]

) (30.79)

where we have formally set Uno = 0. The final equality represents the shift from summing the elements of a triangular array by rows to summing by diagonals, and we use whichever of the two versions of the expression is most convenient. In view of (30.63), Gn - G� - Axy = A n - Bm where (summing by diagonals) (30.80) and (summing by rows) kn

Bn = L

(L

nr l

i- 1

L An, i-m,i+ l

j=l i=nj-! m=i-nj-t + I

)

·

(30.81)

The problem is to show that both An � 0 and Bn � 0. Choose a finite integer N and break up An into N + 1 additive components A n0 , . .. ,AnN• where by taking n large enough that N :::; mi n 1 �j �kn {n1 - n1 d we -

The Functional Central Limit Theorem

514 can write

for m

=

j= l i=m+nj-1 O, .:.,N - 1 , and

(30.82)

(Un' i-m Wn' i+l - An'i-m'i+l ) ,

(30.83) For fixed finite

the process { Vn,i-m Wn,i+l - An, i-m,i+I> a(Vnb k :::;; i + 1 ) } is, according to 17.11 and 17 .6, an L 1 -mixingale array of size -1 with respect to the constant array { 4c�,i-m c;;',i+ l }, where { c�;} and { c;;';} are the constants specified by 29.18 for i\ = (1,0)' and (0, 1 )' respectively. We next show that the conditions of 19.11 are satisfied by these terms, so that n- 1 A*nm -- L ( Un,i- m wn,i+l - 1\,"' n,i-m,i+l .l· � L, 0· (30.84) i=m+ I First, for r > 2 in the a-mixing case (29.6) or for r � 2 in the <(>-mixing case (29.9), Un,i-m Wn,i+l r/2 Un,i- m Wn,i+ 1 - An,i-m,i+l r/2 sup E :::;; 2'12 sup u w u w i,n C n ,i-mC n ,i+l Cn ,i-mC n ,i+l i,n m,

�

\

Un,i-m :::;; 2'12 sup u i,n C n ,i- m

r/2 r

Wn,i+l w C n ,i+l

r/2 r

(30.85) where the first inequality makes use successively of Loeve's c, inequality and Jensen' s inequality, the second one is by the Cauchy-Schwartz inequality, and the finiteness is because the arrays satisfy either 29.6(b) or 29.9(b') by assumption. In the latter case, note that the assumptions include uniform square-integrabil ity. Therefore the array < oo,

{

:

Un,i-m n,i+l �An,i-m,i+l C n ,i-mC n ,i+l

}

is uniformly integrable in either case, and condition 19.11(a) is met. Next, the arrays { c�; } and { c ;;';} satisfy condition 29.6(d) by assumption, which by the Cauchy-Schwartz inequality implies that ·

Weak Convergence to Stochastic Integrals

{

} 00

[n(t+0)] - 1 1 . u w ""' sup 1Imsup 3 L Cn,i -mCn ,i+l < tE [0,1 ), OE (0, 1 -t] n---too i=[nt]+m+l Setting t

0 and 8

•

1 in (30.86) gives n- ! . p L ""' u 1Imsu Cn,i-mCnw,i+! < 00 , n---too i=m+! which is condition 19.11(b), whereas setting 8 = lin gives =

515 (30.86)

=

(30.87)

(30.88) max { c�,i-mC�,i+l } = 0 (1/n), ! m+l�i�nand (30.87) and (30.88) together imply n- ! (30.89) L (c�,i-mC�,i+! )2 -7 0, i=m+l which is condition 19.11(c). So A�m � 0 is proved. But for m � 1 , according to (30.82), kn m+nj-1 - ! (30.90) E I Anm -A�m l ::::; L L E I Un,i-mWn,i+J - An,i-m,i+d , n j=l i= j-1 where the order of magnitude is by (30.85) and (30.88), so Anm � 0 also holds, for each m = O, . ,N- 1 . Similarly E l Un,i-m Wn,i+l - An,i-m,i+d ::::; 2 1 An,i -m, i+l l , and appiying 17.7 yields kn nrni-1 - ! nr! (30.91) EIAnNI = 0 � L L C�,i-mC�,i+! Sm+! J=I m=N r=m+nj-1 .

.

(

)

.

for some 8 > 0, where the order of magnitude follows by a combination of (30.87) with the fact that the sequence { Sm } is of size -1, according to the mixing and Lz-NED size assumptions. Thus , limn---tooE I A n I ::::; limn-too£ I A nN I , which by taking N large enough can be made as close to 0 as desired. In the same manner, recalling N ::::; min! �j � kn {nj - nj- d ,

0

(� �1 '�- 1

)

(30 .92) c�.i-mC !!'.i+ 1 1;,.+ 1 = O(lV•). I+ It follows that Gn - G� - Axr � 0, and this completes the proof of (30.70), and of the theorem. •

I B, I

=

=

The Functional Central Limit Theorem

516

Now let { Und (m x 1) be a vector array satisfying the conditions of 29.18, plus the extra condition that the L2-NED size of the increments is -1 for each element. Since 30.13 holds for each element paired with each element, including itself, the argument may be generalized in the following manner. 30. 14 Theorem Let SnJ = E=tUni· Then n- 1 l L, SnJUn ,J+l � 0B11dB� + A (m x m). (30.93)

Proof For

J

}= I

arbitrary m-vectors of unit length, A and IJ. , the scalar arrays {A'SnJ} and {p.'Un,J+d satisfy the conditions of 30.13. Letting Gn denote the matrix on the left-hand side of (30.93), and G the matrix on the right-hand side, the result A'Gnll � A'Gp. is therefore given. A well-known matrix formula (see e.g. Magnus and Neudecker 1988: th. 2.2) yields A'Gnll = (p.' ® A')Vec Gn, (30.94) where p.' ® A' is the Kronecker product of the vectors, the row vector (j.l1 A1 , 2 IJ.tAm. IJ.2AJ , .. . , ... , IJ.m Am) (1 x m 2 ) and Vec Gn (m x 1) is the vector consisting of the columns of Gn stacked one above the other. p.' ® A' is of unit length, and applying the Cramer-Wold theorem (25.5) in respect of (30.94) implies that Gn � G, as asserted in (30.93). • This result is to be compared with 30.5. Between them they provide the intriguing incidental information that •••,

,

s:BrtdB� + s:dBrtB� - Bll(l )Brt(1 )' - Q.

(30.95)

(Note that the stochastic matrix on the right has rank 1.) Of the two, 30.14 is much the stronger result, since it derives from the FCLT and is specific to the pattern of the increment variances. Between them, 30.3 and 30.14 provide the basic theoretical tools necessary to analyse the linear regression model in variables generated as partial-sum processes (integrated processes). See Phillips and Durlauf ( 1986), and Park and Phillips (1988, 1989) among many other recent references.

Notes

1 . See Billingsley (1979, 1986). The definition of a A.-system is given as 1.25 in Billingsley (1979), and as 1.26 in Billingsley (1986). 2. The 'prime' symbol ' denotes transposition. f is a column vector, written as a row for convenience. 3. An affine transformation is a linear transformation x H Ax followed by a translation, addition of a constant vector b . By an accepted abuse of terminology, such transformations tend to be referred to as 'linear' . 4. That is, l x + y l s l x l + ly l . See §5. 1 for more details. 5. The notations ffdJ..L, ffJ..L(dro), or simply ff when the relevant measure is understood, are used synonymously by different authors. 6. I thank Elizabeth Boardman for supplying this proof. 7. Elizabeth Boardman also suggested this proof. 8. If there is a subset N c n such that either N or ff is contained in every :!F-set, the elements of N cease to be distinguishable as different outcomes. An equivalent model of the random experiment is obtained by redefining n to have N itself as an element, replacing its individual members. 9. Random variables may also be complex-valued; see § 1 1 .2. 10. In statements of general definitions and results we usually consider the case of a one-sided sequence {Xr }]". There is no difficulty in extending the concepts to the case {Xr} :"", and this is left implicit, except when the mapping from 7l. plays a specific role in the argument. 1 1 . We adhere somewhat reluctantly to the convention of defining size as a negative number. The use of terms such as 'large size' to mean a slow (or rapid?) rate of mixing can obviously lead to confusion, and is best avoided. 12. This is a problem of the weak corfvergence of distributions; see §22.4 for further details. 13. This is similar to an example due to Athreya and Pantula (1986a). 14. In the theory of functions of a complex variable, an analytic function is one possessing finite derivatives everywhere in its domain. 15. I am grateful to Graham Brightwell for this argument. 16. For convergence to fail, the discontinuities of f (which must be Borel measurable) would have to occupy a set of positive Lebesgue measure. 17. Conventionally, and for ease of notation, the symbol :!f1 is used here to denote

518

Notes

what has been previously written as c:J � oo · No confusion need arise, since a cr-subfield bearing a time subscript but no superscript will always be interpreted in this way. 18. Some quoted versions of this result (e.g. Hall and Heyde 1980) are for p > � ' whereas the present version, adapted from Karatzas and Shreve (1988), extends to 0 < p :::; � as well. 19. The nann, or length, of a k-vector X is !lXII = (X'X) 112 . To avoid confusion with the � norm of a r.v., the latter is always written with a subscript. 20. The original St Petersburg Paradox, enunciated by Daniel Bernoulli in 1758, considered a game in which the player wins £2n - l if the first head appears on the nth toss for any n. The expected winnings in this case are infinite, but the principle involved is the same in either case. See Shafer (1988). 2 1 . See the remarks following 3.18. It is true that in topological spaces projections are continuous, and hence measurable, under the product topology (see §6.5), but of course, the abstract space (.Q,c:J) lacks topological structure and this reasoning does not apply. 22. Since 8 is here a real k-vector it is written in bold face by convention, notwithstanding that 8 is used to denote the generic element of (8,p ), in the abstract. 23. This is the basis of a method for generating random numbers having a distribution F. Take a drawing from the uniform distribution on [0,1 ] (i.e., a random string of digits with a decimal point placed in front) and apply the transformation F- 1 (or Y) to give a drawing from the desired distribution. 24. A is used here as the argument of the ch.f. instead of the t used in Chapter 1 1 , to avoid confusion with the time subscript. 25. The symbol i appearing as a factor in these expressions denotes V-T. The context distinguishes the use of the same symbol as an array index. 26. In practice, of course, Ut usually has to be estimated by a residual Ut > depending on consistent estimates of model parameters. In this case, a result such as 21.6 is also required to show convergence. 27. More precisely, of course, W models the projection of the motion of a particle in three-dimensional space onto an axis of the coordinate system. 28. A cautionary note: these combinations cannot be constructed as residuals from least squares regressions. If � has full rank, the regression of one element of Yn onto the rest yields coefficients which are asymptotically random. � must be estimated from the increments using the methods discussed in § 25 . 1 . 29. Compare Wooldridge and White (1988: Prop. 4. 1). Wooldridge and White's result is incorrect as stated, since they omit the stipulation of almost sure A

...... "-+... ..... .. .. 11 .., ,

References

Amemiya, Takeshi (1985), Advanced Econometrics, Basil Blackwell, Oxford. Andrews, Donald W. K. (1984), 'Non-strong mixing autoregressive processes' , Journal of Applied Probability 2 1 , 930-4. (1987a), 'Consistency in nonlinear econometric models: a generic uniform law of large numbers' , Econometrica 55, 1465-7 1 . (1988), 'Laws of large numbers for dependent non-identi cally distributed random variables ' , Econometric Theory 4, 458-67. ( 1991 ), 'Heteroscedasticity and autocorrelation consistent covariance matrix estimation ' , Econometrica 59, 817-58. ------- (1992), 'Generic uniform convergence' , Econometric Theory 8, 241-57. Apostol, Tom M. (1974), Mathematical Analysis (2nd edn.) Addison-Wesley, Menlo Park. Ash, R. (1972), Real Analysis and Probability, Academic Press, New York. Athreya, Krishna B. and Pantula, Sastry G.(1986a), 'Mixing properties of Harris chains and autoregressive processes', Journal of Applied Probability 23, 880-92. ------ (1986b ), 'A note on strong mixing of ARMA processes', Statistics and Probability Letters 4, 1 87-90. Azuma, K. ( 1967), 'Weighted sums of certain dependent random variables' , Tohoku Mathematical Journal 19, 357-67. Barnsley, Michael (1988), Fractals Everywhere, Academic Press, Boston. Bates, Charles and White, Halbert (1985), 'A unified theory of consistent estima tion for parametric models ' , Econometric Theory 1, 151-78. Bernstein, S. (1927), 'Sur I ' extension du theoreme du calcul des probabilites aux sommes de quantites dependantes' , Mathematische Annalen 97, 1-59. Bierens, Herman (1983), 'Uniform consistency of kernel estimators of a regression function under generalized conditions', Journal of the American Statistical Association 77, 699-707. (1989), 'Least squares estimation of linear and nonlinear ARMAX models under data heterogeneity', Working Paper, Department of Econometrics, Free University of Amsterdam. -------

------

-------

------

520

References

Billingsley, Patrick (1968), Convergence of Probability Measures, John Wiley, New York. (1979), Probability and Measure, John Wiley, New York. Borowski, E. J. and Borwein, J. M. (1989), The Collins Reference Dictionary of Mathematics, Collins, London and Glasgow. Bradley, Richard C., Bryc, W. and Janson, S. (1987), 'On dominations between measures of dependence' , Journal of Multivariate Analysis 23, 3 12-29. Breiman, Leo (1968), Probability, Addison-Wesley, Reading, Mass. Brown, B. M. (1971), 'Martingale central limit theorems ' , Annals of Mathemati cal Statistics 42, 59-66. Burkholder, D. L. (1973), 'Distribution function inequalities for martingales' , Annals of Probability 1 , 19-42. Chan, N. H. and Wei, C. Z. (1988), 'Limiting distributions of least squares esti mates of unstable autoregressive processes' , Annals of Statistics, 16, 367-401 . Chanda, K. C . (1974), 'Strong mixing properties of linear stochastic processes ' , Journal of Applied Probability 1 1 , 401-8. Chow, Y. S. (1971), 'On the LP convergence for n - llpsn , 0 < p < 2 ' , Annals of Mathematical Statistics 36, 393-4 and Teicher, H. (1978), Probability Theory: Independence, Inter changeability and Martingales, Springer-Verlag, Berlin. Chung, Kai Lai (1974), A Course in Probability Theory (2nd edn.), Academic Press, Orlando, Fla. Cox, D. R. and Miller, H. D. (1965), The Theory of Stochastic Processes, Methuen, London. Cramer, Harald (1946), Mathematical Methods of Statistics, Princeton University Press, Princeton, NJ. Davidson, James (1992), 'A central limit theorem for globally nonstationary near-epoch dependent functions of mixing processes', Econometric Theory, 8, 3 13-29. (1993a) 'An L1-convergence theorem for heterogeneous mixin gale arrays with trending moments', Statistics and Probability Letters 16, 301-4 (1993b), 'The central limit theorem for globally non stationary near-epoch dependent functions of mixing processes: the asymp totically degenerate case ' , Econometric Theory 9, 402-12. -------

-----

------

References

521

de Jong, R. M. (1992), 'Laws of large numbers for dependent heterogeneous processes', Working Paper, Free University of Amsterdam (forthcoming in Econometric Theory, 1995). (1994 ), 'A strong law for L2-mixingale sequences ' , Working Paper, Department of Econometrics, University of Tilburg. Dellacherie, C. and Meyer, P.-A. (1978), Probabilities amd Potential, North Holland, Amsterdam. Dhrymes, Phoebus J. (1989) Topics in Advanced Econometrics, Springer-Verlag, New York. Dieudonne, J. (1969), Foundations of Modern Analysis, Academic Press, New York and London. Domowitz, I. and White, H. (1982), 'Misspecified models with dependent obser vations', Journal of Econometrics 20, 35-58 Donsker, M. D. (1951) 'An invariance principle for certain probability limit theorems' , Memoirs of the American Mathematical Society, 6, 1-12. Doob, J. L. (1953), Stochastic Processes, John Wiley, New York; Chapman & Hall, London. Dudley, R. M. (1966), 'Weak convergence of probabilities on nonseparable metric spaces and empirical measures on Euclidean spaces ' , Illinois Journal of Mathe matics 10, 109-26. (1967), 'Measures on non-separable metric spaces ' , Illinois Journal of Mathematics 1 1, 109-26. ----- (1989), Real Analysis and Probability, Wadsworth and Brooks/Cole, Pacific Grove, Calif. Dvoretsky, A. (1972), 'Asymptotic normality of sums of dependent random variables ' , in Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, ii, University of California Press, Berkeley, Calif., 5 13-35. Eberlein, Ernst and Taqqu, Murad S. (eds.) (1986), Dependence in Probability and Statistics: a Survey of Recent Results, Birkhauser, Boston. Engle, R. F., Hendry, D. F. and Richard, J.-F. (1983), 'Exogeneity ' , Econo metrica 5 1 , 277-304 Feller, W. (1971), An Introduction to Probability Theory and its Applications, ii, John Wiley, New York. Gallant, A. Ronald (1987), Nonlinear Statistical Models, John Wiley, New York. and White, Halbert (1988), A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models, Basil Blackwell, Oxford. -----

-----

------

522

References

Gastwirth, Joseph L. and Rubin, Herman (1975), 'The asymptotic distribution theory of the empiric CDF for mixing stochastic processes' , Annals of Statistics 3, 809-24. Gnedenko, B. V. (1967), The Theory of Probability (4th edn.), Chelsea Publish ing, New York. Gorodetskii, V. V. (1977), 'On the strong mixing property for linear sequences ' , Theory of Probability and its Applications, 22, 41 1-13. Halmos, Paul R. ( 1956), Lectures in Ergodic Theory, Chelsea Publishing, New York (1960), Naive Set Theory, Van Nostrand Reinhold, New York. ----- (1974), Measure Theory, Springer-Verlag, New York. Hall, P and Heyde, C. C. (1980), Martingale Limit Theory and its Application, Academic Press, New York and London. Hannan, E. J. (1970), Multiple Time Series, John Wiley, New York. Hansen, L. P. (1982), 'Large sample properties of generalized method of moments estimators' , Econometrica 50, 1029-54. Hansen, Bruce E. ( 1991), 'Strong laws for dependent heterogeneous processes', Econometric Theory 7, 213-2 1 . (1992a), 'Errata', Econometric Theory 8, 421-2. (1992b ), 'Consistent covariance matrix estimation for dependent heterogeneous processes' , Econometrica 60, 967-72 (1992c), 'Convergence to stochastic integrals for dependent heterogeneous processes' , Econometric Theory 8, 489-500. Herrndorf, Norbert (1984), 'A functional central limit theorem for weakly depen dent sequences of random variables' , Annals of Probability 12, 141-53. ------- (1985), 'A functional central limit theorem for strongly mix ing sequences of random variables' , Wahrscheinlichkeitstheorie verw. Gebeite 69, 540-50. Hoadley, Bruce ( 1971 ), 'Asymptotic properties of maximum likelihood estimators for the independent not identically distributed case' , Annals of Mathematical Statistics 42, 1977-9 1 . Hoeffding, W . (1963), 'Probability inequalities for sums of bounded random vari ables ' , Journal of the American Statistical Association 58, 13-30. Ibragimov, I. A. (1962), 'Some limit theorems for stationary processes' , Theory of Probability and its Applications 7, 349-82. -----

-------

-------

------

References

523

------ (1965), 'On the spectrum of stationary Gaussian sequences satis fying the strong mixing condition. 1: Necessary conditions' , Theory of Proba bility and its Applications 10, 85-106. and Linnik, Yu. V. (1971), Independent and Stationary Se quences of Random Variables, Wolters-Noordhoff, Groningen. Iosifescu, M. and Theodorescu, R. (1969), Random Processes and Learning, Springer-Verlag, Berlin. Karatzas, Ioannis and Shreve, Steven E. (1988), B rownian Motion and Stochastic Calculus, Springer-Verlag, New York. Kelley, John L. (1955), General Topology, Springer-Verlag, New York. Kingman, J. F. C. and Taylor, S. J. (1966), Introduction to Measure and Probability, Cambridge University Press, London and New York. Kolmogorov, A. N. (1950), Foundations of the Theory of Probability, Chelsea Publishing, New York (published in German as Grundbegriffe der Wahrschein lichkeitsrechnung, Springer-Verlag, Berlin, 1933). and Rozanov, Yu. A. (1960), 'On strong mixing conditions for stationary Gaussian processes', Theory of Probability and its Applications 5, 204-8. Kopp, P. E. (1984), Martingales and Stochastic Integrals, Cambridge University Press. Kurtz, T. G. and Protter, P. (1991), 'Weak limit theorems for stochastic integrals and stochastic differential equations' , Annals of Probability 19, 1035-70. Loeve, Michel (1977), Probability Theory, i (4th edn.), Springer-Verlag, New York. Lukacs, Eugene (1975), Stochastic Convergence (2nd edn.), Academic Press, New York. Magnus, J. R., and Neudecker, H. (1988), Marix Differential Calculus with Applications in Statistics and Econometrics, John Wiley, Chichester. Mandelbrot, Benoit B. (1983), The Fractal Geometry of Nature, W. H. Freeman, New York. Mann, H. B. and Wald, A. (1943a), 'On the statistical treatment of linear stochastic difference equations', Econometrica 1 1 , 173-220. (1943b), 'On stochastic limit and order relation ships' , Annals of Mathematical Statistics 14, 390-402. McKean, H. P., Jr. (1969), Stochastic Integrals, Academic Press, New York. McLeish, D. L. (1974), 'Dependent central limit theorems and invariance princi ples ' , Annals of Probability 2,4, 620-8. ------

------

----

-----

References

524

(1975a), 'A maximal inequality and dependent strong laws' , Annals of Probability 3,5, 329-39. (1975b), 'Invariance principles for dependent variables' , Z. Wahrscheinlichkeitstheorie verw. Gebeite 32, 165-78. (1977), 'On the invariance principle for nonstationary mix ingales', Annals of Probability 5,4, 6 16-21. Nagaev, S. V. and Fuk, A Kh . (1971), 'Probability inequalities for sums of independent random variables', Theory of Probability and its Applications 6, 643-60. Newey, W. K. (1991), 'Uniform convergence in probability and stochastic equi continuity', Econometrica 59, 1 16 1-8. and West, K. (1987), 'A simple positive definite heteroskedastic ity and correlation consistent covariance matrix' , Econometrica 55, 703-8. Park, J. Y. and Phillips, P. C. B. (1988), 'Statistical inference in regressions with integrated processes, Part 1 ' , Econometric Theory 4, 468-98. (1989), 'Statistical inference in regressions with integrated processes, Part 2', Econometric Theory 5, 95-132. Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic Press, New York and London. Pham, Tuan D. and Tran, Lanh T. ( 1985), 'Some mixing properties of time series models' , Stochastic Processes and their Applications 19, 297-303. Phillips, P. C. B. (1988), 'Weak convergence to the matrix stochastic integral fbBdB ' , Journal of Multivariate Analysis 24, 252-64. ------

-------

-------

-----

------

------

and Durlauf, S. N. (1986), 'Multiple time series regression with integrated processes ' , Review of Economic Studies 53, 473-95. Pollard, David (1984), Convergence of Stochastic Processes, Springer-Verlag, New York. Potscher, B. M. and Prucha, I. R. (1989), 'A uniform law of large numbers for dependent and heterogeneous data processes' , Econometrica 57, 675-84. ------- ------ (1994), 'Generic uniform convergence and equicontinuity concepts for random functions: an exploration of the basic structure' , Journal of Econometrics 60, 23-63. ( 1991 a), 'Basic structure of the asymptotic theory in dynamic nonlinear econometric models, Part I: Consistency and approximation concepts' , Econometric Reviews 1 0, 125-216. ------

-------

------

References

525

(1991b), 'Basic structure of the asymptotic theory in dynamic nonlinear econometric models, Part II: Asymptotic normal ity ' , Econometric Reviews 10, 253-325. Prokhorov, Yu. V (1956), 'Convergence of random processes and limit theorems in probability theory', Theory of Probability and its Applications 1 , 157-213. Rao, C. Radhakrishna (1973), Linear Statistical Inference and its Applications (2nd edn.), John Wiley, New York. Revesz, Pal (1968), The Laws of Large Numbers, Academic Press, New York. Rosenblatt, M. (1 956), 'A central limit theorem and a strong mixing condition ' , Proceedings of the National Academy of Science, USA, 42, 43-7. (1972), 'Uniform ergodicity and strong mixing' , Z. Wahrschein lichkeitstheorie verw. Gebeite 24, 79-84. ----- (1978), 'Dependence and asymptotic independence for random processes', in Studies in Probability Theory (ed. M. Rosenblatt), Mathematical Association of America, Washington DC. Royden, H. L. (1968), Real Analysis, Macmillan, New York. Seneta, E. (1976), Regularly Varying Functions, Springer-Verlag, Berlin. Serfling, R. J. (1968), 'Contributions to central limit theory for dependent variables ' , Annals of Mathematical Statistics 39, 1 158-75. ------ (1980), Approximation Theorems of Mathematical Statistics, John Wiley, New York. Shafer, G. (1988), 'The St Petersburg Paradox ' , in Encyclopaedia of the Statis tical Sciences, viii (ed. S. Kotz and N. L. Johnson), John Wiley, New York. Shiryayev, A. N. (1984), Probability, Springer-Verlag, New York. Skorokhod, A. V. (1956), 'Limit theorems for stochastic processes' , Theory of Probability and its Applications 1 , 261-90. ------ (1957), 'Limit theorems for stochastic processes with indepen dent increments' , Theory of Probability and its Applications 2, 138-7 1 . Slutsky, E . (1 925), 'Uber stochastiche Asymptoter und Grenzwerte ' , Math. Annalen 5, 93. Stinchcombe, M. B. and White, H. (1992), 'Some measurability results for ex trema of random functions over random sets', Review of Economic Studies 59, 495-5 14. Stone, Charles (1963), 'Weak convergence of stochastic processes defined on semi-infinite time intervals', Proceedings ofthe American Mathematical Society 14, 694-6. ------

-----

------

526

References

Stout, W. F. (1974), Almost Sure Convergence, Academic Press, New York. Strasser, H. (1986), 'Martingale difference arrays and stochastic integrals' , Probability Theory and Related Fields 72, 83-98. Varadarajan, V. S. (1958), 'Weak convergence of measures on separable metric spaces' , Sankhya 19, 15-22. von Bahr, Bengt, and Esseen, Carl-Gustav (1965), 'Inequalities for the rth abso lute moment of a sum of random variables, 1 ::;; r ::;; 2 ' , Annals of Mathematical Statistics 36, 299-303. White, Halbert (1984), Asymptotic Theory for Econometricians, Academic Press, New York. and Domowitz, I. (1984), 'Nonlinear regression with dependent observations' , Econometrica 52, 143-62. Wiener, Norbert (1923), 'Differential space' , Journal of Mathematical Physics 2, 13 1-74 Vlillard, Stephen (1970), General Topology, Addison-Wesley, Reading, Mass. Withers, C. S . (198 1a), 'Conditions for linear processes to be strong-mixing' , Z. Wahrscheinlichkeitstheorie verw. Gebeite 57, 477-80. (1981b), 'Central limit theorems for dependent variables, I' , Z. Wahrscheinlichkeitstheorie verw. Gebeite 57, 509-34. Wooldridge, Jeffrey M. and White, Halbert (1986), 'Some invariance principles and central limit theorems for dependent heterogeneous processes' , University of California (San Diego) Working Paper. (1988), 'Some invariance principles and central limit theorems for dependent heterogeneous processes ' , Econo metric Theory 4, 210-30. -----

-----

------

-------

Index

a-mixing, see strong mixing Abel's partial summation formula 34, 254 absolute convergence 31 absolute moments 132 absolute value 162 absolutely continuous function 120 absolutely continuous measure 69 absolutely regular sequence 209 abstract integral 128 accumulation point 21, 94 adapted process 500 adapted sequence 229 addition modulo 1 46 adherent point 77 affine transformation 126 Aleph nought 8 algebra of sets 14 almost everywhere 38, 1 13 almost sure convergence 178, 281 method of subsequences 295 uniform 33 1 almost surely 1 1 3 analytic function 328 analytic set 328 of C 449 Andrews, D. W. K. 216, 261 , 301 , 336, 338, 341 antisymmetric relation 5 Apostol, T. M. 29, 32, 33, 126 approximable in probability 274 approximable process 273 weak law of large numbers 304 approximately mixing 262 AR process, see autoregressive process ARMA process 215 array 33 array convergence 35 Arzela-Ascoli theorem 91, 335, 439, 447, 469

asymptotic equicontinuity 90 asymptotic expectation 357 asymptotic independence 479 asymptotic negligibility 375 asymptotic uniform equicontinuity 90, 335 asymptotic unpredictability 247 Athreya, Krishna B. 215 atom 37 of a distribution 129 of a p.m. 1 12 autocovariance 193 autocovariances, summability 266 autoregressive process 215 non-strong mixing 216 non-uniform mixing 218 autoregressive-moving average process 215 axiom of choice 47 axioms of probability 1 1 1 Azuma, K. 245 Azuma ' s inequality 245

�-mixing 207 backshift transformation 191 ball 76 band width 401 Bartlett kernel 403, 407 base 79 for point 94 for space of measures 418 for topology 93 Bernoulli distribution 122 expectation 129 Bernoulli r.v.s 216 Bernstein sums 386, 401 Berry-Esseen theorem 407 betting system 233-4 Bierens, Herman 261 , 336 Big Oh 3 1 , 1 87 Billingsley, Patrick 17, 18, 261 ,

528 421 , 422, 447, 448, 457, 466, 469, 472, 473, 474-5, 496 Billingsley metric 462 equivalent to Skorokhod metric 464 binary expansion 10 binary r.v. 122, 133 binary sequence 1 80 binomial distribution 122, 348, 364 bivariate Gaussian 144 blocking argument 194, 299 Borel field of C 440 of D 465 infinite-dimensional 1 8 1 real line 22, 1 6 metric space 77, 413 topological space 413 Borel function 55, 57, 1 17 expectation of 130 Borel sets 47 Borel-Cantelli lemma 282, 295, 307 Borel's normal number theorem 290 boundary point 2 1 , 77 bounded convergence theorem 64 bounded function 28 bounded set 22, 77 bounded variation 29 Brown, Robert 443 Brownian bridge 445 mean deviations 498 Brownian motion 443 de-meaned 498 distribution of supremum 496 transformed 445, 486 vector 454 with drift 444 Burkholder' s inequality 242 c 437 c.d.f., s ee cumulative distribution function cadlag function 90, 456 cadlag process 48 8 progressively measurable 500

Index Cantor, G. 10 cardinal number 8 cardinality 8 Cartesian product 5, 83, 102 Cauchy criterion 25 Cauc�y distribution 123 as stable distribution 362 characteristic function 167 no expectation 129 Cauchy family 124 Cauchy sequence 80-2, 97 real numbers 25 vectors 29 Cauchy-Schwartz inequality 138 central limit theorem ergodic L1 -mixingales 385 functional 450, 480 independent sequences 368 martingale differences 383 NED functions of mixing processes 386 three series theorem 3 12 trending variances 3 79 central moments 1 3 1 central tendency 128 centred sequence 23 1 Cesaro sum 3 1 ch.f., s ee characteristic function Chan, N. H. 5 1 0 Chanda, K . C . 2 1 5, 2 1 9 characteristic function 53, 162 derivatives 164 independent sums 166 multivariate distributions 168 series expansion 165 weak convergence 357 Chebyshev' s inequality 132 Chebyshev' s theorem 293 chi-squared distribution 124 chord 133 Chow, Y. S. 298 Chung, K. L. 407, 409 closed under set operations 14 closed interval 1 1 closed set 77

Index real line 21 closure point 20, 77, 94 cluster point 24, 80, 94 coarse topology 93 coarseness 438 codomain 6 coin tossing 1 80, 191 collection 3 compact 95 compact set 22, 77 compact space 77 compactificaton 107 complement 3 complete measure space 39 complete space 80 completely regular sequence 209 completely regular space 100 completeness 97 completion 39 complex conjugate 162 complex number 162 composite mapping 7 concave function 133 conditional characteristic function 171-3 distribution function 143 expectation 144, 147 linearity 1 5 1 optimal predictor 150 variance of 156 versions of 148 Fatou's lemma 152 Jensen' s inequality 1 53 Markov inequality 1 52 modulus inequality 1 5 1 monotone convergence theorem 152 probability 1 13 variance 238, 316 conditionally heteroscedastic 214 consistency properties 1 83, 435 consistency theorem function spaces 436 sequences 1 84 contingent probability 1 14 continuity 1 12

529 of a measure 38 continuous distribution 122 continuous function 27, 84, 436 continuous mapping 97 continuous mapping theorem 355, 497 metric spaces 421 continuous time martingale 501 continuous time process 500 continuous truncation 271 , 309 continuously differentiable 29 contraction mapping 263 converge absolutely 3 1 convergence 94 almost sure 178, 28 1, 33 1 in distribution 347, 367 in Lp norm 287, 33 1 in mean square 179, 287 in probability 179, 284, 349, 359, 367 in probability law 347 metric space 80 on a subsequence 284 real sequence 23 space of probability measures 418 stochastic function 333 transformations 285 uniform, 30, 33 1 weak 179 with probability 1 179, 33 1 convergence lemma 306 convergence-determining class 420 convex function 133, 339 convolution 161 coordinate 1 77 coordinate projections 102, 434 coordinate space 48 coordinate transformation 126 correlation coefficient 138 correspondence 6 countability axioms 94 countable additivity 36 countable set 8 countable subadditivity 37, 1 1 1 countably compact 95

530 covariance 136 covariance matrix 137 covariance stationarity 193 covering 22, 77 Cox, D. R. 503 c inequality 140 Cramer, Harald 141 Cramer-Wold device 405, 455, 490, 516 Cramer's theorem 355 cross moments 136 cumulative distribution function 1 17 cylinder set 48, 1 15 of R 434 r

D 456 Billingsley metric on 464 Davidson, James 301 , 386 de Jong, R. M. 3 1 9, 326 de Morgan's laws 4 decimal expansion 10 decreasing function 29 decreasing sequence of real numbers 23 of sets 12 degenerate distribution 349 degree of belief 1 1 1 Dellacherie, C. 328 dense 23 dense set 77 density function 74, 120 denumerable 8 dependent events 1 13 derivative 29 conditional expectation 153 expectation 141 derived sequence 177 determining class 40, 121, 127, 420 diagonal argument 10 diagonal method 35 diffeomorphism 126 difference, set 3 differentiable 29 differential calculus 28

Index diffusion process 503 discontinuity, jump 457 discontinuity of first kind 457 discrete distribution 122, 129 discrete metric 76 disc�ete subset 80 discrete topology 93 disjoint 3 distribution 1 17 domain 6 dominated convergence theorem 63 Donsker, M. D. 450 Donsker' s theorem 450 Doob, J. L. 196, 216, 235, 3 14 Doob decomposition 23 1 Doob's inequality 241 continuous time 501 Doob-Mayer decomposition 502 double integral 66, 136 drift 444 Dudley, R. M. 328, 457 Durlauf, S. N. 490, 5 1 6 dyadic rationals 26, 99, 439 dynamic stability 263 Dynkin, E. B. 1 8 Dynkin's 7t-A theorem 1 8 £-neighbourhood 20 see also sphere E-net 78 Egoroff s theorem 282, 5 1 0 element, set 3 embedding 86, 97, 105 empirical distribution function 332 empty set 3, 77 Engle, R. F. 403 ensemble average 195 envelope 329 equally likely 1 80 equicontinuity 90, 335 stochastic 336 strong stochastic 336 equipotent 6 equivalence class 5, 46 equivalence relation 5 equivalent measures 69

Index equivalent metrics 76 equivalent sequences 307, 382 ergodic theorem 200 law of large numbers 291 ergodicity 199 asymptotic independence 202 Cesaro-summability of autocovariances 201 Esseen, Carl-Gustav 171 essential supremum 1 17, 132 estimator 177 Euclidean distance 20, 75 Euclidean k-space 23 Euclidean metric 75, 105 evaluation map 105 even numbers 8 exogenous 403 expectation 128 exponential function 162 exponential inequality 245 extended functions 52 extended real line 1 2 extended space 1 1 7 extension theorem 1 84 existence part 40 uniqueness part 44, 127 <J>-mixing, see uniform mixing �-analytic sets 449 factor space 48 fair game 289 Fatou' s lemma 63 conditional 152 Feller, W. 32 Feller's theorem 373 field of sets 14 filter 94 filtration 500 fine toplogy 93 fineness 438 finite additivity 36 finite dimensional cylinder sets 1 8 1 , 435 finite dimensional distributions of C 440, 442, 446

53 1 of D 466 Wiener measure 446 finite intersection property 95 finite measure 36 first countable 95 fractals 443 frequentist model 1 1 1 law of large numbers 292 Fubini' s theorem 66, 69, 125 Fuk, A. Kh. 220 function 6 convex 339 of a real variable 27 of bounded variation 29 function space 84, 434 nonseparable 89 functional 84 functional central limit theorem 450, 480 martingale differences 440 multivariate 454, 490 NED functions of strong mixing processes 48 1 NED functions of uniform mixing processes 485 Gallant, A. Ronald 261 , 263, 271 , 401-2 gambling policy 233 gamma function 124 Gaussian distribution 123 characteristic function 167 stable distribution 363 Gaussian family 123 expectation 129 moments 1 3 1 Gaussian vector 126 generic uniform convergence 336 geometric series 3 1 Glivenko-Cantelli theorem 332 global stationarity 194, 388, 450, 486 Gordin, M. 385 Gorodetskii, V. 215, 219, 220 graph 6

532 Hahn decomposition 70, 72 half line 1 1 , 2 1 , 52, 1 18 half-Gaussian distribution 385 half-normal density 124 half-open interval 1 1 , 15, 1 1 8 Hall, P. 250, 3 14, 385, 409 Halmos, Paul R. 202 Hansen, Bruce E. 3 1 8, 403, 5 10 Hartman-Wintner theorem 408 Hausdorff metric 83, 469 Hausdorff space 98 Heine-Bore! theorem 23 Helly-Bray theorem 353 Helly's selection theorem 360 Heyde, C. C. 250, 3 14, 385, 409 Hoadley, Bruce 336 Hoeffding' s inequality 245 Holder's inequality 138 homeomorphism 27, 51, 86, 97, 105 i.i.d. 193 Ibragimov, I. A. 204-5, 210, 21 1 , 2 1 5 , 216, 261 identically distributed 193 image 6 imaginary number 162 inclusion-exclusion formula 37, 420 increasing 29 . function .mcreasmg sequence of real numbers 23 of sets 12 independence 1 14 independent Brownian motions 455 independent r.v.s 127, 1 54-5, 161 independent sequence 192 strong law of large numbers 3 1 1 independent subfields 1 54 index set 3, 177 indicator function 53, 128 inferior limit of real sequence 25 of set sequence 13 infimum 12 infinite run of heads 1 80

Index infinite-dimensional cube 83, 105 infinite-dimensional Euclidean space 83, 104 infinitely divisible distribution 362 initial conditions 194 inner measure 41 innovation sequence 215, 23 1 integers 9 integrability 61 integral 57 integration by parts 58, 129, 134 interior 21 , 77 intersection 3 interval 1 1 into 6 invariance principle 450, 497 invariant event 195 invariant r.v. 196 inverse image 6 inverse projection 48 inversion theorem 168-70 irrational numbers 10, 80 isolated point 2 1 , 77 isometry 86 isomorphic spaces 5 1 isomorphism 145 iterated integral 66, 135 Ito integral 507 J1 metric 459 Jacobian matrix 126 Jensen's inequality 133 conditional 1 53 Jordan decomposition 7 1 jump discontinuities 457 jump points 120 Karatzas, Joannis 502-3, 508 kernel estimator 403 Khinchine' s theorem 368 Kolmogorov, A. N. 209-10, 3 1 1-2 Kolmogorov consistency theorem 1 84 Kolmogorov's inequality 240

Index Kolmogorov ' s zero-one law 204 Kopp, P. E. 504 Kronecker product 5 1 6 Kronecker' s lemma 34, 293, 307 Kurtz, T. G. 5 1 0 A.-system 1 8 Hopital's rule 361 Lp convergence 287 uniform 33 1 Lp norm 132 Lp-approximable 274 Lp-bounded 132 £,-dominated 330 largest element 5 latent variables 177 law of iterated expectations 149 law of large numbers 200, 289 Cauchy r.v.s 291 definition of expectation 292 frequentist model 292 random walk 291 uniform 340 law of the iterated logarithm 408 Lebesgue decomposition 69, 72 probability measure 120 Lebesgue integral 57 Lebesgue measure 37, 45-6, 74, 1 12 plane 66 product measure 135 Lebesgue-integrable r.v.s 132 Lebesgue-Stieltjes integral 57-8, 128 left-continuity 27 left-hand derivative 29 Levy, P. 3 1 2 Levy continuity theorem 358 Levy's metric 424, 468 lexicographic ordering 10 Liapunov condition 373 Liapunov' s inequality 139 Liapunov ' s theorem 372 LIE 149 liminf of real sequence 25 I'

533 of set sequence 13 limit 80 expectation of 141 of set sequence 13 limit point 26 lim sup of real sequence 25 of set sequence 13 Lindeberg condition 369, 371 , 380 asymptotic negligibility 376 uniform integrability 372 Lindeberg theorem 369 Lindeberg-Feller theorem, see Lindeberg theorem; Feller's theorem Lindeberg-Levy theorem 366 FCLT 449 LindelOf property 78-79, 95 LindelOf space 98 LindelOf' s covering theorem 22 linear ordering 5 linear process 193, 247, 252 strong law of large numbers 326 linearity of conditional expectation 1 5 1 of integral 62 Linnik, Yu. V. 204-5, 210, 215, 216 Lipschitz condition 28, 86, 269 stochastic 338 Little Oh 3 1 , 1 87 Loeve, M. 32, 407 Loeve's c, inequality 140 log-likelihood 177 lower integral 57 lower semicontinuity 86 lower variation 71 MA process, see moving average process Magnus, J. R. 5 1 6 Mandelbrot, Benoit 443 Mann, H. B. 187 mapping 6 marginal c.d.f. 126 marginal distributions of a

534 sequence 1 86 marginal measures 64 marginal probability measures 1 1 5 tightness 430 Markov inequality 132, 135 conditional 152 Markov process 503 martingale 229 continuous time 501 convergence 235 martingale difference array 232 weak law of large numbers 298 martingale difference sequence 230 strong law of large numbers 3 14 maximal inequalities for linear processes 256 for martingales 240 for mixingales 252 maximum metric 76 McKean, H. P., Jr. 508 McLeish, D. L. 247, 261 , 318, 380 mean 128 mean reversion 214 mean stationarity 193 mean value theorem 340 mean-square convergence 293 measurability of suprema 327, 449 measurable function 1 17 measurable isomorphism 5 1 measurable rectangle 50, 48 measurable set 4 1 measurable space 36 measurable transformation 50 measure 36 measure space 36 measure-preserving transformation 191 mixes outcomes 200 memory 192 method of subsequences 295 metric 75 metric space 75, 96 metrically transitive 199 metrizable space 93 metrization 107

Index Meyer, P.-A. 328 Miller, H. D. 503 Minkowski's inequality 139 mixed continuous-discrete distribution 129 mixed, Gaussian distribution 404 mixing 202 inequalities 2 1 1-4 MA processes 2 1 5 martini example 202 measurable functions 21 1 mixing process strong law of large numbers 323 mixing sequence 204 mixing size 210 see also size mixingale 247 stationary 250 strong law of large numbers 3 1 8 weak law of large numbers 301 mixingale array 249 modulus 162 modulus inequality 63 complex r.v.s 163 conditional 151 modulus of continuity 91, 335, 439, 479 cadlag functions 458, 468 moment generating function 162 moments 1 3 1 Gaussian distribution 123 monkeys typing Shakespeare 1 80 monotone class 17 monotone convergence theorem 60, 6: conditional 1 52 envelope theorem 329 monotone function 29 monotone sequence of real numbers 23 of sets 12 monotonicity of measure 3 7 of p.m. 1 1 1 Monte Carlo 498 moving average process 193 mixing 2 1 5

Index strong law of large numbers 326 multinormal distribution, see multivariate Gaussian multinormal p.d.f. 126 multivariate c.d.f. 125 multivariate FCLT 490 multivariate Gaussian affine transformations 170 characteristic function 168 mutually singular measures 69 Nagaev, S. V. 220 naive set theory 4 7 natural numbers 8 near-epoch dependence 261 mixingales 264 transformations 267 near-epoch dependent process strong law of large numbers 323 weak law of large numbers 302 nearly measurable set 329 NED, see near-epoch dependence negative set 70 neighbourhood 20 see also sphere nested subfields 1 55 net 94 Neudecker, H. 516 Newey, W. K. 336, 401 non-decreasing function 29 non-decreasing sequence of real numbers 23 of sets 12 non-increasing function 29 non-increasing sequence of real numbers 23 of sets 12 non-measurable function 55 non-measurable set 46 norm inequality 139 for prediction errors 1 57 normal distribution 123 normal law of error 364 normal number theorem 290 normal space 98 null set 3, 70

535 odd-order moments 131 one-dimensional cylinder 1 82 one-to-one 6 onto 6, 97 open covering 22, 77 open interval 1 1 open mapping 27 open rectangles 102 open set 77, 93 of real line 20 order-preserving mapping 6 ordered pairs 7 orders of magnitude 3 1 origin 10 Ornstein-Uhlenbeck process 445 diffusion process 503 outer measure 41 outer product 137 7t-A theorem 18, 44, 49, 67 7t-system 1 8 p.d.f., see probability density function p.m., see probability measure pairwise independence 1 14 pairwise independent r.v.s 127, 136 Pantula, Sastry G. 2 1 5 parameter space 327 Park, J. Y. 5 16 Parthasarathy, K. R. 418, 422, 426, 427, 429, 469 partial knowledge 145 partial ordering 5 partial sums 31 partition 3 of [0, 1 ] 438 permutation of indices 436 Pham, Tuan D. 215 Phillips, P. C. B. 490, 5 10, 516 piecewise linear functions 437 pigeon hole principle 8 Pitman drift 369 pointwise convergence 30 stochastic 33 1

Index

536 Poisson distribution 122, 348 characteristic function 167 expectation 129 infinitely divisible 362 polar coordinates 162 Pollard, David 457 positive semi-definite 137 positive set 70 Potscher, B. M. 261 , 274, 277, 336, 342 power set 13 precompact set 78 predictable component 23 1 probability density function 122 probability measure 1 1 1 weak topology 418 probability space 1 1 1, 1 17 product measure 64 product space 7, 48, 102, 1 15 product topology 102 function spaces 453 product, set 5 progressively measurable 500 progressively measurable functions 504 projection 8, 48, 50 suprema 328 not measurable 328 projection a-field 415, 435 of C 440 of D 456 Prokhorov, Yu. V. 422, 423, 457 Prokhorov metric 424, 467 Protter, P. 5 1 0 Prucha, I. 261 , 274, 277, 336, 342 pseudo-metric 75, 504 q-dependence 2 1 5 quadratic variation 238, 502 deterministic 503 p-mixing 207 R, the space 434 r.v., see random variable

D nrlnn _hl1 1rnr1vm

rlP.ri v:Jti ve

70. 74.

120, 148 Radon-Nikodym theorem 70, 72-4, 122 random element 413, 1 1 1 random event I l l random experiment 1 1 1 , 128 random field 178 random pair 124 random sequence 177 memory of 192 random variable 1 12, 1 1 7 random vector 137 random walk 230, 291 random weighting 3 1 6 range 6 Rao, C. R. 143 rate of convergence 294 rational numbers 9 real numbers 10 real plane 1 1 real-valued function 27, 87, 434 realization 179 refinement 438 reflexive relation 5 regression coefficients 177 regular measure 413 regular sequence 204 regular space 98 regularly varying function 32 relation 5 relative compactness 422 relative frequencies 128 relative topology 20, 93 relatively compact 77 remote cr-field 203 repeated sampling 179 restriction of a measure space 37 Riemann integral 57 expectation 141 of a random function 496 Riemann zeta function 32 Riemann-Stieltjes integral 58 and stochastic integral 503 right continuous 27 c.d.f. 1 19-20 filtration 500

Index right-hand derivative 29 ring 14 Rozanov, Yu. A. 209-10 a-algebra 15 a-field 15 a-finite measure 36 St Petersburg paradox 289 sample 177 sample average 128 sample path 179 sample space 1 1 1 scale transformation 178 seasonal adjustment 262 second countable 79, 95 Serfling, R. J. 213 self-similarity 443 semi-algebra 15 semi-ring 15, 40, 48, 49, 1 18 semimartingale 233 in continuous time 501 separable set 78 separable space 95 separating function 98 separation axioms 97 sequence 9, 94 metric space 80 real 23 sequentially compact 95, 422 serial independence 192 series 3 1 set 3 set function 36 set of measure zero 38, 70 shift transformation 1 9 1 shocks 2 1 5 Shreve, Steven E. 502-3, 508 signed measure 70 simple function 53 approximation by 54 integral of 59 simple random variables 128 singleton 1 1 singular Gaussian distribution singular measures 69, 120

170

537 SIZe mixing 210 mixingale 247 near-epoch dependence 262 Skorokhod, A. V. 350, 457, 459 Skorokhod metric 459, 469 Skorokhod representation 350, 43 1, 510 Skorokhod topology 461 slowly varying function 32 Slutsky's theorem 287 smallest element 5 Souslin space 328 spectral density function 209 MA process 2 1 5 sphere 76 of C 440 of D 466 stable distribution 362 stationarity 193 step function 1 19, 349 Stinchcombe, M. B. 328 stochastic convergence, see under convergence stochastic equicontinuity 336 termwise 341 stochastic integral 503 stochastic process 177 continuous time 500 stochastic sequence 177 stopped process 234 stopping rule 234 stopping time 233 filtration 501 Stout, W. F. 3 14, 409 Strasser, H. 510 strict ordering 5 strict stationarity 193 strong law of large numbers 289 for independent sequence 3 1 0 for Lp-bounded sequence 295, 3 1 2, 3 1 4 for martingales 314-7 for mixingales 3 19-23 for NED functions of mixing

538 processes 324--6 strong mixing autoregressive processes 216 coefficient 206 negligible events 207 smoothness of the p.d.f. 225 sufficient conditions in MA processes 219-27 see also mixing strong mixing sequence 209 law of large numbers 295, 298 strong topology 93 strongly dependent sequences 210 strongly exogenous 403 sub-base 101, 103 subcovering 22 subfield measurability 145 subjective probability 1 1 1 submartingale 233 continuous time 501 subsequence 24, 80 subset 3 subspace 20, 93 summability 3 1 superior limit of real sequence 25 of set sequence 13 supermartingale 233 continuous time 501 support 37, 1 19 supremum 12 sure convergence 178 symmetric difference 3 symmetric relation 5 T1 -space 98 T2-space 98 Trspace 98 T3 � -space 100 T4-space 98 tail sums 3 1 taxicab metric 75 Taylor's theorem 165 telescoping sum 250 proof of weak law 301

Index termwise stochastic equicontinuity 341 three series theorem 31 1 tight measure 360, 427 time average 195 and ensemble average 200 limit under stationarity 196 time plot 437 time series 177 Toeplitz's lemma 34, 35 Tonelli's theorem 68 topological space 93 topology 93 real line 20 weak convergence 4 1 8 torus 102 total independence 1 14 total variation 72 totally bounded 78, 79 totally independent r.v.s 127 trace 1 12 Tran, Lanh T. 215 transformation 6 transformed Brownian motion 486 diffusion process 503 transitive relation 5 trending moments 262 triangle inequality 3 1 , 139 triangular array 34 stochastic 178 trivial topology 93 truncated kernel 403 truncation 298, 308 continuous 271 , 309 two-dimensional cylinder 1 82 Tychonoff space 100 Tychonoff topology 103, 1 8 1 , 46 1 Tychonoff s theorem 104 uncorrelated sequence, 230, 293 uncountable 10 uniform conditions 1 86 uniform continuity 28, 85 uniform convergence 30 uniform distribution 1 12, 122, 364

Index convolution 161 expectation 129 uniform equicontinuity 90 uniform integrability 1 88 squared partial sums 257 uniform laws of large numbers 340 uniform Lipschitz condition 28, 87 uniform metric 87 parameter space 327 uniform mixing autoregressive process 218 coefficient 206 moving average process 227 see also mixing uniform mixing sequence 209 strong law of large numbers 298 weak law of large numbers 295 uniform stochastic convergence 33 1 uniform tightness 359, 427 ArzeUi-Ascoli theorem 447 measures on C 447,448 uniformly bounded a.s. 1 86 in Lp norm 132, 186 in probability 186 union 3 universally measurable set 328 upcrossing inequality 236 upper integral 57 upper semicontinuity 86, 470 upper variation 71 Urysohn's embedding theorem 106 Urysohn' s lemma 98 Urysohn' s metrization theorem 98 usual topology of the line 20 Varadjaran, V. S. 425 variable 1 17 variance 1 3 1

539 sample mean 293 variance-transformed Brownian motion 486 vector 29 vector Brownian motion 498 vector martingale difference 234 Venn diagram 4 versions of conditional expectation 148 von Bahr, Bengt 171 Wald, A. 1 87 weak convergence 179, 347 in metric space 4 1 8 of sums 361 weak dependence 210 weak law of large numbers 289 for L 1 -approximable process 304 for L 1 -mixingale 302 for L2-bounded sequence 293 for partial sums 3 12 weak topology 93, 101 space of probability measures 418 Wei, C . Z . 5 1 0 well-ordered 5 West, K. 401 White, Halbert 210, 261 , 263, 271 , 328, 401-2, 480-1 wide-sense stationarity 193 Wiener, Norbert 442 Wiener measure 442, 474 existence of 446, 452 with probability 1 1 13 Withers, C. S. 215 Wooldridge, Jeffrey M. 480-1 zero-one law 204

Stochastic Limit Theory: An Introduction for Econometricicans (Advanced Texts in Econometrics)

Read more

Stochastic Limit Theory: An Introduction for Econometricians (Advanced Texts in Econometrics)

Read more

Stochastic Limit Theory: An Introduction for Econometricians

Read more

Finite Sample Econometrics (Advanced Texts in Econometrics)

Read more

Panel Data Econometrics (Advanced Texts in Econometrics)

Read more

Finite Sample Econometrics (Advanced Texts in Econometrics)

Read more

Stochastic limit theory

Read more

Limit Theorems for Stochastic Processes

Read more

Limit Theorems for Stochastic Processes

Read more

The Econometrics of Macroeconomic Modelling (Advanced Texts in Econometrics)

Read more

The Econometrics of Macroeconomic Modelling (Advanced Texts in Econometrics)

Read more

Limit Theorems for Stochastic Processes

Read more

An introduction to stochastic filtering theory

Read more

An Introduction to Mathematical Analysis for Economic Theory and Econometrics

Read more

Advanced Econometrics

Read more

Advanced Econometrics

Read more

Readings in Unobserved Components Models (Advanced Texts in Econometrics)

Read more

Advanced Econometrics

Read more

Advanced Econometrics

Read more

Advanced Econometrics

Read more

Surface Science: An Introduction (Advanced Texts in Physics)

Read more

Surface Science: An Introduction (Advanced Texts in Physics)

Read more

An introduction to Bayesian inference in econometrics

Read more

Micro-Econometrics for Policy, Program, and Treatment Effects (Advanced Texts in Econometrics)

Read more

An introduction to Bayesian inference in econometrics

Read more

Introduction to Optics (Advanced Texts in Physics)

Read more

Periodic Time Series Models (Advanced Texts in Econometrics)

Read more

Stochastic Finance: An Introduction in Discrete Time

Read more

Stochastic Finance: An Introduction In Discrete Time

Read more

Introduction to Stochastic Control Theory

Read more

Recommend Documents

Stochastic Limit Theory: An Introduction for Econometricicans (Advanced Texts in Econometrics)

Advanced Texts in Econometrics General Editors C.W.J.Granger G.E.Mizon This page intentionally left blank Stochasti...

Stochastic Limit Theory: An Introduction for Econometricians (Advanced Texts in Econometrics)

ADVANCED TEXTS IN ECONOMETYCS General Editors C. W. J. Granger G. E. Mizon STOCHASTIC LIMIT THEORY An lntroduction f...

Stochastic Limit Theory: An Introduction for Econometricians

...

Finite Sample Econometrics (Advanced Texts in Econometrics)

ADVANCED TEXTS IN ECONOMETRICS General Editors Manuel Arellano Guido Imbens Grayham E.Mizon Adrain Pagan Mark Watson Ad...

Panel Data Econometrics (Advanced Texts in Econometrics)

Advanced Texts in Econometrics General Editors Manuel Arellano Guido Imbens Grayham E. Mizon Adrian Pagan Mark Watson A...

Finite Sample Econometrics (Advanced Texts in Econometrics)

ADVANCED TEXTS IN ECONOMETRICS General Editors Manuel Arellano Guido Imbens Grayham E.Mizon Adrain Pagan Mark Watson Ad...

Stochastic limit theory

Limit Theorems for Stochastic Processes

Limit Theorems for Stochastic Processes

The Econometrics of Macroeconomic Modelling (Advanced Texts in Econometrics)

the econometrics of macroeconomic modelling Other Advanced Texts in Econometrics ARCH: Selected Readings Edited by Ro...