Asymptotic Statistics
This book is an introduction to the field of asymptotic statistics. The treatment is both practical and mathematically rigorous. In addition to most of the standard topics of an asymptotics course, including like lihood inference, M-estimation, asymptotic efficiency, U-statistics, and rank procedures, the book also presents recent research topics such as sernipararnetric models, the bootstrap, and empirical processes and their applications. One of the unifying themes is the approximation by limit experi ments. This entails mainly the local approximation of the classical i.i.d. set-up with smooth parameters by location experiments involving a sin gle, normally distributed observation. Thus, even the standard subjects of asymptotic statistics are presented in a novel way. Suitable as a text for a graduate or Master's level statistics course, this book also gives researchers in statistics, probability, and their applications an overview of the latest research in asymptotic statistics. A.W. van der Vaart is Professor of Statistics in the Department of Mathematics and Computer Science at the Vrije Universiteit, Amsterdam.
C A M B RIDGE S E RIES IN S TA T I S T I C AL A N D PROB A B I L I S T I C M A T H E M A T I C S
Editorial Board: R. Gill, Department of Mathematics, Utrecht University B .D. Ripley, Department of Statistics, University of Oxford S . Ross, Department of Industrial Engineering, University of California, Berkeley M. Stein, Department of Statistics, University of Chicago D. Williams, School of Mathematical Sciences, University of Bath This series of high-quality upper-division textbooks and expository monographs covers all aspects of stochastic applicable mathematics. The topics range from pure and applied statistics to probability theory, operations research, optimization, and mathematical pro gramming. The books contain clear presentations of new developments in the field and also of the state of the art in classical methods. While emphasizing rigorous treatment of theoretical methods, the books also contain applications and discussions of new techniques made possible by advances in computational practice. Already published 1 . Bootstrap Methods and Their Application, by A.C. Davison and D.V. Hinkley 2. Markov Chains, by J. Norris
Asymptotic Statistics
A.W. VANDER VAART
CAMBRIDGE UNIVERSITY PRESS
PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE The Pitt Building, Trumpington Street, Cambridge, United Kingdom
CAMBRIDGE UNIVERSITY PRESS
The Edinburgh Building, Cambridge CB2 2RU, UK
http://www.cup.cam.ac.uk
40 West 20th Street, New York, NY 10011-4211, USA
http://www.cup.org
10 Stamford Road, Oakleigh, Melbourne 3166, Australia Ruiz de Alarcon 13, 28014 Madrid, Spain
© Cambridge University Press 1998 This book is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the w ritten permission of Cambridge University Press. First published 1998 First paperback edition 2000 Printed in the United States of America Typeset in Times Roman 10112.5 pt in LKfff(2 [TB]
A
catalog record for this book is available from the British Library
Library of Congress Cataloging in Publication data Asymtotic statistics I A.W. van der Vaart. Vaart, A.W. van der p.
em. - (Cambridge series in statistical and probablistic
mathematics) Includes bibliographical references. 1. Mathematical statistics - Asymptotic theory.
I. Title.
II. Series: cambridge series on statistical and probablistic mathematics. CA276.V 22
1998
519.5-dc21 ISBN 0 521 49603 9 hardback ISBN 0 521 78450 6 paperback
98-15176
To Maryse and Marianne
Contents
Preface Notation
page xiii page XV
1 . Introduction 1 . 1 . Approximate Statistical Procedures 1 .2. Asymptotic Optimality Theory 1 .3 . Limitations 1 .4. The Index n
1 1 2 3 4
2. Stochastic Convergence 2. 1 . Basic Theory 2.2. Stochastic o and 0 Symbols *2.3 . Characteristic Functions *2.4. Almost -Sure Representations *2.5. Convergence of Moments *2.6. Convergence-Determining Classes *2.7. Law of the Iterated Logarithm *2.8. Lindeberg-Feller Theorem *2.9. Convergence in Total Variation Problems
5 5 12 13 17 17 18 19 20 22 24
3. Delta Method 3. 1 . Basic Result 3.2. Variance-Stabilizing Transformations *3.3. Higher-Order Expansions *3.4. Uniform Delta Method *3.5. Moments Problems
25 25 30 31 32 33 34
4. Moment Estimators 4. 1 . Method of Moments *4.2. Exponential Families Problems
35 35 37 40
5. M- and Z-Estimators 5 . 1 . Introduction 5.2. Consistency 5.3. Asymptotic Normality
41 41 44 51 Vll
Vlll
Contents
*5.4. 5.5. *5 .6. *5 .7. *5 .8. *5 .9.
Estimated Parameters Maximum Likelihood Estimators Classical Conditions One-Step Estimators Rates of Convergence Argmax Theorem Problems
6. Contiguity 6. 1 . Likelihood Ratios 6.2. Contiguity Problems
60 61 67 71 75 79 83 85 85 87 91
7 . Local Asymptotic Normality 7. 1 . Introduction 7.2. Expanding the Likelihood 7.3. Convergence to a Normal Experiment 7.4. Maximum Likelihood *7.5. Limit Distributions under Alternatives *7.6. Local Asymptotic Normality Problems
92 92 93 97 100 103 103 106
8. Efficiency of Estimators 8. 1 . Asymptotic Concentration 8.2. Relative Efficiency 8.3. Lower Bound for Experiments 8.4. Estimating Normal Means 8.5. Convolution Theorem 8.6. Almost-Everywhere Convolution Theorem *8.7. Local Asymptotic Minimax Theorem *8.8. Shrinkage Estimators *8.9. Achieving the Bound *8. 10. Large Deviations Problems
108 108 1 10 111 1 12 1 15
9. Limits of Experiments 9. 1 . Introduction 9.2. Asymptotic Representation Theorem 9.3. Asymptotic Normality 9.4. Uniform Distribution 9.5. Pareto Distribution 9.6. Asymptotic Mixed Normality 9.7. Heuristics Problems
125 125 126 127 129 130 131 136 1 37
10. Bayes Procedures 10. 1 . Introduction 10.2. Bernstein-von Mises Theorem
1 15 1 17 1 19 120 122 123
138 138 140
Contents
10.3. Point Estimators *1 0.4. Consistency Problems
ix
146 149 152
1 1 . Projections 1 1 . 1 . Projections 1 1 .2. Conditional Expectation 1 1 .3 . Projection onto Sums * 1 1 .4. Hoeffding Decomposition Problems
153 153 155 157 157 160
12. U -Statistics 1 2. 1 . One-Sample U -Statistics 1 2.2. Two-Sample U -statistics * 1 2.3. Degenerate U -Statistics Problems
161 161 1 65 1 67 171
1 3 . Rank, Sign, and Permutation Statistics 1 3 . 1 . Rank Statistics 1 3 .2. Signed Rank Statistics 13.3. Rank Statistics for Independence * 1 3 .4. Rank Statistics under Alternatives 13.5. Permutation Tests * 1 3 .6. Rank Central Limit Theorem Problems
173 173 181 1 84 1 84 188 190 190
14. Relative Efficiency of Tests 14. 1 . Asymptotic Power Functions 14.2. Consistency 14.3. Asymptotic Relative Efficiency *14.4. Other Relative Efficiencies *14.5. Rescaling Rates Problems
192 192 199 20 1 202 21 1 213
15. Efficiency of Tests 1 5. 1 . Asymptotic Representation Theorem 1 5.2. Testing Normal Means 15.3. Local Asymptotic Normality 1 5 .4. One-Sample Location 15.5. Two-Sample Problems Problems
215 215 216 218 220 223 226
16. Likelihood Ratio Tests 1 6. 1 . Introduction *1 6.2. Taylor Expansion 16.3. Using Local Asymptotic Normality 16.4. Asymptotic Power Functions
227 227 229 23 1 236
X
Contents
16.5. Bartlett Correction *1 6.6. Bahadur Efficiency Problems
23 8 238 24 1
17. Chi-Square Tests 17. 1 . Quadratic Forms in Normal Vectors 17.2. Pearson Statistic 17.3. Estimated Parameters 17.4. Testing Independence *17.5. Goodness-of-Fit Tests *17.6. Asymptotic Efficiency Problems
242 242 242 244 247 248 25 1 253
18. Stochastic Convergence in Metric Spaces 1 8 . 1 . Metric and Normed Spaces 1 8.2. Basic Properties 1 8.3. Bounded Stochastic Processes Problems
255 255 258 260 263
19. Empirical Processes 19 . 1 . Empirical Distribution Functions 19.2. Empirical Distributions 19.3. Goodness-of-Fit Statistics 19.4. Random Functions 19.5 . Changing Classes 19.6. Maximal Inequalities Problems
265 265 269 277 279 282 284 289
20. Functional Delta Method 20. 1 . von Mises Calculus 20.2. Hadamard-Differentiable Functions 20.3 . Some Examples Problems
29 1 29 1 296 298 303
2 1 . Quantiles and Order Statistics 2 1 . 1. Weak Consistency 2 1 .2. Asymptotic Normality 2 1 .3. Median Absolute Deviation 2 1 .4. Extreme Values Problems
304 304 305 310 312 315
22. L-Statistics 22. 1 . Introduction 22.2. Hajek Projection 22.3. Delta Method 22.4. L-Estimators for Location Problems
316 316 318 320 323 324
23 . Bootstrap
326
Xl
Contents
23. 1 . Introduction 23 .2. Consistency 23.3. Higher-Order Correctness Problems
326 329 334 339
24. Nonparametric Density Estimation 24. 1 Introduction 24.2 Kernel Estimators 24.3 Rate Optimality 24.4 Estimating a Unimodal Density Problems
341 341 341 346 349 356
25. Semiparametric Models 25 . 1 Introduction 25.2 Banach and Hilbert Spaces 25.3 Tangent Spaces and Information 25.4 Efficient Score Functions 25.5 Score and Information Operators 25.6 Testing *25 .7 Efficiency and the Delta Method 25.8 Efficient Score Equations 25.9 General Estimating Equations 25 . 1 0 Maximum Likelihood Estimators 25 . 1 1 Approximately Least-Favorable Submodels 25 . 1 2 Likelihood Equations Problems
358 358 360 362 368 37 1 384 386 39 1 400 402
References
433
Index
439
408 419 43 1
Preface
This book grew out of courses that I gave at various places, including a graduate course in the Statistics Department of Texas A&M University, Master's level courses for mathematics students specializing in statistics at the Vrije Universiteit Amsterdam, a course in the DEA program (graduate level) of Universite de Paris-sud, and courses in the Dutch AIO-netwerk (graduate level). The mathematical level is mixed. Some parts I have used for second year courses for mathematics students (but they find it tough), other parts I would only recommend for a graduate program. The text is written both for students who know about the technical details of measure theory and probability, but little about statistics, and vice versa. This requires brief explanations of statistical methodology, for instance of what a rank test or the bootstrap is about, and there are similar excursions to introduce mathematical details. Familiarity with (higher-dimensional) calculus is necessary in all of the manuscript. Metric and normed spaces are briefly introduced in Chapter 1 8, when these concepts become necessary for Chapters 19, 20, 2 1 and 22, but I do not expect that this would be enough as a first introduction. For Chapter 25 basic knowledge of Hilbert spaces is extremely helpful, although the bare essentials are summarized at the beginning. Measure theory is implicitly assumed in the whole manuscript but can at most places be avoided by skipping proofs, by ignoring the word "measurable" or with a bit of handwaving. Because we deal mostly with i.i.d. observations, the simplest limit theorems from probability theory suffice. These are derived in Chapter 2, but prior exposure is helpful. Sections, results or proofs that are preceded by asterisks are either of secondary impor tance or are out of line with the natural order of the chapters. As the chart in Figure 0. 1 shows , many of the chapters are independent from one another, and the book can be used for several different courses. A unifying theme is approximation by a limit experiment. The full theory is not developed (another writing project is on its way), but the material is limited to the "weak topology" on experiments, which in 90% of the book is exemplified by the case of smooth parameters of the distribution of i.i.d. observations. For this situation the theory can be developed by relatively simple, direct arguments. Limit experiments are used to explain efficiency properties, but also why certain procedures asymptotically take a certain form. A second major theme is the application of results on abstract empirical processes. These already have benefits for deriving the usual theorems on M -estimators for Euclidean pa rameters but are indispensable if discussing more involved situations, such as M -estimators with nuisance parameters, chi-square statistics with data-dependent cells, or semiparamet ric models. The general theory is summarized in about 30 pages, and it is the applications xiii
Preface
xiv 5
23
16
@
24
Figure 0.1. Dependence chart. A solid arrow means that a chapter is a prerequisite for a next chapter. A dotted arrow means a natural continuation. Vertical or horizontal position has no independent meaning.
that we focus on. In a sense, it would have been better to place this material (Chapters 1 8 and 19) earlier in the book, but instead we start with material of more direct statistical relevance and of a less abstract character. A drawback is that a few (starred) proofs point ahead to later chapters. Almost every chapter ends with a "Notes" section. These are meant to give a rough historical sketch, and to provide entries in the literature for further reading. They certainly do not give sufficient credit to the original contributions by many authors and are not meant to serve as references in this way. Mathematical statistics obtains its relevance from applications. The subjects of this book have been chosen accordingly. On the other hand, this is a mathematician's book in that we have made some effort to present results in a nice way, without the (unnecessary) lists of "regularity conditions" that are sometimes found in statistics books. Occasionally, this means that the accompanying proof must be more involved. If this means that an idea could go lost, then an informal argument precedes the statement of a result. This does not mean that I have strived after the greatest possible generality. A simple, clean presentation was the main aim. Leiden, September 1997 A.W. van der Vaart
Notation
A* JE* Cb(T), UC(T), C(T) e oo(T)
adjoint operator dual space (bounded, uniformly) continuous functions on T bounded functions on T £r (Q), L r (Q) measurable functions whose rth powers are Q-integrable JII norm of L r (Q) Q II ,r uniform norm li z lloo, li z liT lin linear span number fields and sets
< smaller than up to a constant convergence in distribution p ----+ convergence in probability as ----+ convergence almost surely covering and bracketing number N(s, T, d) , N[](E, T, d) entropy integral J (s, T, d) , f[](E, T, d) stochastic order symbols Op(l), Op(l)
XV
1 Introduction
Why asymptotic statistics? The use of asymptotic approximations is two fold. First, they enable us to find approximate tests and confidence regions. Second, approximations can be used theoretically to study the quality (efficiency) of statistical procedures.
1.1
Approximate Statistical Procedures
To carry out a statistical test, we need to know the critical value for the test statistic. In most cases this means that we must know the distribution of the test statistic under the null hypothesis. Sometimes this is known exactly, but more often only approximations are available. This may be because the distribution of the statistic is analytically intractable, or perhaps the postulated statistical model is considered only an approximation of the true underlying distributions. In both cases the use of an approximate critical value may be fully satisfactory for practical purposes. Consider for instance the classical t-test for location. Given a sample of independent observations X 1, . . . , Xn , we wish to test a null hypothesis concerning the mean f.-L = EX. The t-test is based on the quotient of the sample mean Xn and the sample standard deviation Sn . If the observations arise from a normal distribution with mean f-Lo, then the distribution of ,Jii(Xn t-Lo) / Sn is known exactly: It is a t -distribution with n 1 degrees of freedom. However, we may have doubts regarding the normality, or we might even believe in a completely different model. If the number of observations is not too small, this does not matter too much. Then we may act as if ,Jii(Xn - f-Lo)/ Sn possesses a standard normal distribution. The theoretical justification is the limiting result, as n -+ oo, -
-
provided the variables Xi have a finite second moment. This variation on the central limit theorem is proved in the next chapter. A "large sample" level a test is to reject H0 : f.-L = f-Lo if I,JiiC Xn - t-Lo)/ Sn I exceeds the upper a/2 quantile of the standard normal distribution. Table 1 . 1 gives the significance level of this test if the observations are either normally or exponentially distributed, and a = 0.05. For n :::: 20 the approximation is quite reasonable in the normal case. If the underlying distribution is exponential, then the approximation is less satisfactory, because of the skewness of the exponential distribution.
Introduction
2 Table 1 . 1 .
Level of the test with critical region f.Lo )/ Sn I > 1 .96 if the observations are sampled from the normal or exponential distribution.
iv'flCXn
n 5 10 15 20 25 50 1 00 a
-
Normal
Exponentiala
0. 1 22 0.082 0.070 0.065 0.062 0.056 0.053
0. 1 9 0.14 0. 1 1 0. 1 0 0.09 0.07 0.06
The third column gives approximations based o n 10,000
simulations.
In many ways the t -test is an uninteresting example. There are many other reasonable test statistics for the same problem. Often their null distributions are difficult to calculate. An asymptotic result similar to the one for the t-statistic would make them practically applicable at least for large sample sizes. Thus, one aim of asymptotic statistics is to derive the asymptotic distribution of many types of statistics. There are similar benefits when obtaining confidence intervals. For instance, the given approximation result asserts that Jfi ( Xn - J-L)/ Sn is approximately standard normally dis tributed if J-L is the true mean, whatever its value. This means that, with probability approx imately 1 - 2a , Jfi ( X n- J-L) < < - za - Z a· Sn This can be rewritten as the confidence statement f-L = Xn ± Za Sn / Jfi in the usual manner. For large n its confidence level should be close to 1 - 2a . As another example, consider maximum likelihood estimators Bn based on a sample of size n from a density Pe. A major result in asymptotic statistics is that in many situations Jfi (Bn- 13) is asymptotically normally distributed with zero mean and covariance matrix the inverse of the Fisher information matrix Ie. If Z is k-variate normally distributed with mean zero and nonsingular covariance matrix I; , then the quadratic form z r I; -I Z possesses a chi-square distribution with k degrees of freedom. Thus, acting as if Jfi(Bn - 13) possesses an Nk (0, le-I ) distribution, we find that the ellipsoid x 13: (13- Bn)T Ie (13- Bn) :::: l,CI n is an approximate 1 - a confidence region, if xf a is the appropriate critical value from the chi-square distribution. A closely related alternative is the region based on inverting the likelihood ratio test, which is also based on an asymptotic approximation.
{
1 .2
n
}
Asymptotic Optimality Theory
For a relatively small number of statistical problems there exists an exact, optimal solution. For instance, the Neyman-Pearson theory leads to optimal (uniformly most powerful) tests
3
1.3 Limitations
in certain exponential family models ; the Rao-Blackwell theory allows us to conclude that certain estimators are of minimum variance among the unbiased estimators. An important and fairly general result is the Cramer-Rao bound for the variance of unbiased estimators, but it is often not sharp. If exact optimality theory does not give results, be it because the problem is untractable or because there exist no "optimal" procedures, then asymptotic optimality theory may help. For instance, to compare two tests we might compare approximations to their power functions . To compare estimators, we might compare asymptotic variances rather than exact variances. A major result in this area is that for smooth parametric models maximum likelihood estimators are asymptotically optimal. This roughly means the following. First, maximum likelihood estimators are asymptotically consistent: The sequence of estimators converges in probability to the true value of the parameter. Second, the rate at which maximum likelihood estimators converge to the true value is the fastest possible, typically 1 I ,Jn. Third, their asymptotic variance, the variance of the limit distribution of ,Jn ( en (]) ' is minimal; in fact, maximum likelihood estimators "asymptotically attain" the Cramer-Rao bound. Thus asymptotics justify the use of the maximum likelihood method in certain situations . It is of interest here that, even though the method of maximum likelihood often leads to reasonable estimators and has great intuitive appeal, in general it does not lead to best estimators for finite samples. Thus the use of an asymptotic criterion simplifies optimality theory considerably. By taking limits we can gain much insight in the structure of statistical experiments. It turns out that not only estimators and test statistics are asymptotically normally distributed, but often also the whole sequence of statistical models converges to a model with a nor mal observation. Our good understanding of the latter "canonical experiment" translates directly into understanding other experiments asymptotically. The mathematical beauty of this theory is an added benefit of asymptotic statistics. Though we shall be mostly concerned with normal limiting theory, this theory applies equally well to other situations. -
1.3
Limitations
Although asymptotics is both practical! y useful and of theoretical importance, it should not be taken for more than what it is: approximations. Clearly, a theorem that can be interpreted as saying that a statistical procedure works fine for n ------?- oo is of no use if the number of available observations is n = 5. In fact, strictly speaking, most asymptotic results that are currently available are logically useless. This is because most asymptotic results are limit results, rather than approximations consisting of an approximating formula plus an accurate error bound. For instance, to estimate a value a, we consider it to be the 25th element a = a 2s in a sequence a1 , a2 , , and next take limn ---+ oo an as an approximation. The accuracy of this procedure depends crucially on the choice of the sequence in which a 25 is embedded, and it seems impossible to defend the procedure from a logical point of view. This is why there is good asymptotics and bad asymptotics and why two types of asymptotics sometimes lead to conflicting claims. Fortunately, many limit results of statistics do give reasonable answers. Because it may be theoretically very hard to ascertain that approximation errors are small, one often takes recourse to simulation studies to judge the accuracy of a certain approximation. .
•
.
4
Introduction
Just as care is needed if using asymptotic results for approximations, results on asymptotic optimality must be judged in the right manner. One pitfall is that even though a certain procedure, such as maximum likelihood, is asymptotically optimal, there may be many other procedures that are asymptotically optimal as well. For finite samples these may behave differently and possibly better. Then so-called higher-order asymptotics, which yield better approximations, may be fruitful. See e.g. , [7] , [52] and [ 1 14] . Although we occasionally touch on this subject, we shall mostly be concerned with what is known as "first -order asymptotics."
1.4
The Index
n
In all of the following n is an index that tends to infinity, and asymptotics means taking limits as n ----+ oo. In most situations n is the number of observations, so that usually asymptotics is equivalent to "large-sample theory." However, certain abstract results are pure limit theorems that have nothing to do with individual observations. In that case n just plays the role of the index that goes to infinity.
1.5
Notation
A symbol index is given on page xv. For brevity we often use operator notation for evaluation of expectations and have special symbols for the empirical measure and process. For P a measure on a measurable space (X, B) and f : X r--+ ffi.k a measurable function, Pf denotes the integral J f d P ; equivalently, the expectation Ep f(X 1 ) for X 1 a random variable distributed according to P . When applied to the empirical measure JPln of a sample X 1 , . . . , Xn, the discrete uniform measure on the sample values, this yields
This formula can also be viewed as simply an abbreviation for the average on the right. The empirical process CGn f is the centered and scaled version of the empirical measure, defined by
This is studied in detail in Chapter 19, but is used as an abbreviation throughout the book.
2 Stochastic Convergence
This chapter provides a review ofbasic modes ofconvergence ofsequences of stochastic vectors, in particular convergence in distribution and in probability.
2.1
Basic Theory
A random vector in Rk is a vector X = (X 1, . . . , X k ) of real random variables. t The dis tributionfunction of X is the map x f---+ P(X :S x). A sequence of random vectors Xn is said to converge in distribution to a random vector X if
P(Xn
:S
x) -+ P(X :S x),
for every x at which the limit distribution function x f---+ P(X :S x) i s continuous. Alterna tive names are weak convergence and convergence in law. As the last name suggests, the convergence only depends on the induced laws of the vectors and not on the probability spaces on which they are defined. Weak convergence is denoted by Xn X ; if X has dis tribution L, or a distribution with a standard code, such as N(O, 1), then also by Xn L or Xn N(O, 1). Let d(x, y) be a distance function on Rk that generates the usual topology. For instance, the Euclidean distance "07
"07
d(x , y) = ll x - Yll =
"07
(6(xi - yi )2) 1/2 k
A sequence of random variables Xn is said to converge in probability to X if for all 8
11
This is denoted by Xn p d(X , X) -+ 0. t
p -+
11
P (d(X , X)
>
8 ) -+ 0.
>
0
X. In this notation convergence in probability is the same as
More formally it is a Borel measurable map from some probability space in
JFtk.
Throughout it is implic
itly understood that variables X, g(X), and so forth of which we compute expectations or probabilities are measurable maps on some probability space.
5
6
Stochastic Convergence
As we shall see, convergence in probability is stronger than convergence in distribution. An even stronger mode of convergence is almost-sure convergence. The sequence X11 is said to converge almost surel y to X if d(X11, X) --? 0 with probability one:
P( lim d(X11, X) = 0) = 1 . This is denoted by X11 � X . Note that convergence in probability and convergence almost surely only make sense if each of X11 and X are defined on the same probability space. For convergence in distribution this is not necessary. Let Y11 be the average of the first n of a sequence of independent, identically distributed random vectors Y1 , Y2 , . . . . If E ll Y1 11 < oo , then Yn � EY1 bythestrong law oflarge numbers. Under the stronger assumption that EII Y1 112 < oo, the central limit theorem asserts that .Jli (Y11 - E Y1 ) � N (0, Cov Y1 ). The central limit theorem plays an important role in this manuscript. It is proved later in this chapter, first for the case of real variables, and next it is extended to random vectors. The strong law of large numbers appears to be of less interest in statistics. Usually the weak law of large numbers, according to which Y 11 � E Y1 , suffices. This is proved later in this chapter. D 2.1
Example (Classical limit theorems).
The portmanteau lemma gives a number of equivalent descriptions of weak convergence. Most of the characterizations are only useful in proofs. The last one also has intuitive value. 2.2 Lemma (Portmanteau). For an y random vectors X11 and X the following statements are equivalent. (i) P(X11 ::S x) --? P(X ::s; x)for all continuity points ofx f----'3> P(X ::s; x) ; (ii) Ef(X11) --? Ef (X) for all bounded, continuousfunctions f; (iii) Ef(X11) --? Ef (X) for all bounded, Lipschitzt functions f; (iv) lim infEf (X11) � Ef(X) for all nonnegative, continuous functions f; (v) lim infP(X11 E G) �P(X E G) for every open set G; (vi) lim sup P(X11 E F) ::s; P(X E F) for every closed set F; (vii) P(X11 E B) --? P(X E B) for all Borel sets B with P(X E 8B) 0, where o B = jj- B is the boundary of B.
(i) ===> (ii). Assume first that the distribution function of X is continuous. Then condition (i) implies that P(X11 E I) --? P(X E I) for every rectangle I. Choose a sufficiently large, compact rectangle I with P(X tf: I) < 8. A continuous function f is uniformly continuous on the compact set I. Thus there exists a partition I U 1 I 1 into finitely many rectangles I1 such that f varies at most 8 on every I 1 . Take a point x 1 from each IJ and define fc: = .L 1 f(x 1 ) 1 I1 • Then I f - fc: l < 8 on I, whence if f takes its values in [- 1 , 1 ] ,
Proof
=
1 Ef(X11) - Efc: (X11) I I Ef(X) - Efc: (X) I t
A
y.
::'S
:::::
8 + P(X11 t/: I) , 8 + P(X tf: I) < 28.
function is called Lipschitz if there exists a number L such that If(x) The least such number L is denoted llfllhp·
-
f(y) I
:::0: Ld (x,
y),
for every x and
7
2. 1 Basic Theory
For sufficiently large n, the right side of the first equation is smaller than 2s as well. We combine this with
I Eft: (Xn) - Efe (X) I
:S
L I PCXn E Ij ) - P( X E Ij ) llf(xj ) l � 0. j
Together with the triangle inequality the three displays show that I Ef(Xn) - Ef(X) I is bounded by 5s eventually. This being true for every E > 0 implies (ii). Call a set B a continuity set if its boundary 8B satisfies P(X E 8B) 0. The preceding argument is valid for a general X provided all rectangles I are chosen equal to continuity sets . This is possible, because the collection of discontinuity sets is sparse. Given any collection of pairwise disjoint measurable sets, at most countably many sets can have positive probability. Otherwise the probability of their union would be infinite. Therefore, given any collection of sets {Ba :a E A} with pairwise disjoint boundaries, all except at most countably many sets are continuity sets. In particular, for each j at most countably many sets of the form {x : x1 :s a} are not continuity sets. Conclude that there exist dense subsets Q 1 , . . . , Qk of lR such that each rectangle with comers in the set Q 1 x · · · x Qk is a continuity set. We can choose all rectangles I inside this set. (iii) => (v). For every open set G there exists a sequence of Lipschitz functions with 0 :s fm t 1 G. For instance fm (x) ( md (x ' GC)) !\ 1. For every fixed m, =
=
lim n-+infP(Xn
oo
E
G)=::: lim n-+inf Efm C Xn )
oo
=
Efm (X) .
As m � oo the right side increases to P(X E G) by the monotone convergence theorem. (v) {} (vi). Because a set is open if and only if its complement is closed, this follows by taking complements. (v) + (vi)=> (vii). Let B and B denote the interior and the closure of a set, respectively. By (iv) P(X
E
B)
:S
lim inf P(Xn
E
B)
:S
lim sup P(Xn
E
B)
:S
P(X
E
B),
by (v) . If P(X E 8B) 0, then left and right side are equal, whence all inequalities are equalities. The probability P(X E B) and the limit lim P(Xn E B) are between the expressions on left and right and hence equal to the common value. (vii) => (i). Every cell ( - oo , x] such that x is a continuity point of x f----+ P(X :S x) is a continuity set. The equivalence (ii) {} (iv) is left as an exercise. • =
The continuous-mapping theorem is a simple result, but it is extremely useful. If the sequence of random vectors Xn converges to X and g is continuous, then g (Xn) converges to g(X ). This is true for each of the three modes of stochastic convergence. 2.3
Theorem (Continuous mapping). a set C such that P(X E C) = 1.
Let g : JRk f----+ Rm be continuous at every point of
(i) If Xn -v-+ X, then g ( Xn ) -v-+ g ( X ) ; (ii) If Xn � X, then g ( Xn ) � g ( X ) ; (iii) If Xn � X, then g(Xn) � g(X).
Stochastic Convergence
8
(i). The event {g(Xn) closed set F,
Proof.
E
F } is identical to the event {Xn E g - 1 ( F ) } . For every
To see the second inclusion, take x in the closure of g - 1 ( F ) . Thus, there exists a sequence Xm with Xm � x and g (xm ) E F for every F. If x E C, then g (xm ) � g (x ) , which is in F because F is closed; otherwise x E c c . By the portmanteau lemma, Because P(X E c c ) 0, the probability on the right is P(X E g - 1 ( F ) ) P( g ( X ) E F) . Apply the portmanteau lemma again, in the opposite direction, to conclude that =
=
g(Xn) -v-+ g(X).
(ii). Fix arbitrary E: > 0. For each 8 > 0 let B8 be the set of x for which there exists y with d(x , y) < 8, but d ( g(x ) , g (y) ) > E:. If X tf- B0 and d (g(Xn), g(X)) > E: , then d(Xn , X) ::=: 8. Consequently,
The second term on the right converges to zero as n � oo for every fixed 8 > 0. Because B8 n C + 0 by continuity of g, the first term converges to zero as 8 + 0. Assertion (iii) is trivial. • Any random vector X is tight: For every E: > 0 there exists a constant M such that P( II X I > M) < E:. A set of random vectors {Xa : a E A} is called uniformly tight if M can be chosen the same for every Xa : For every E: > 0 there exists a constant M such that sup P( II Xa II
a
>
M) < c.
Thus, there exists a compact set to which all Xa give probability "almost" one. Another name for uniformly tight is bounded in probability. It is not hard to see that every weakly converging sequence Xn is uniformly tight. More surprisingly, the converse of this statement is almost true: According to Prohorov's theorem, every uniformly tight sequence contains a weakly converging subsequence. Prohorov's theorem generalizes the Heine-Borel theorem from deterministic sequences Xn to random vectors . 2.4
Let Xn be random vectors in JRk . (i) If Xn -v-+ X for some X, then {Xn : n E N} is uniformly tight; (ii) If Xn is uniformly tight, then there exists a subsequence with Xn 1 -v-+ X as j � oo, for some X . Theorem (Prohorov 's theorem).
(i). Fix a number M such that P( II X II ::=: M) < E: . By the portmanteau lemma P( ll Xn I ::=: M ) exceeds P( II X I ::=: M ) arbitrarily little for sufficiently large n . Thus there exists N such that P( II Xn ll ::=: M ) < 2E:, for all n ::=: N. Because each of the finitely many variables Xn with n < N is tight, the value of M can be increased, if necessary, to ensure that P( II Xn II ::=: M) < 2E: for every n .
Proof.
2. 1
Basic Theory
9
(ii) . By Helly's lemma (described subsequently), there exists a subsequence Fn1 of the sequence of cumulative distribution functions Fn (x) = P(Xn :S x) that converges weakly to a possibly "defective" distribution function F. It suffices to show that F is a proper distribution function: F (x) --* 0, 1 if X i --* - oo for some i , or x --* oo. By the uniform tightness, there exists M such that Fn (M) > 1 - 8 for all n. By making M larger, if necessary, it can be ensured that M is a continuity point of F. Then F(M) = lim Fn1 (M) :=::: 1 - 8 . Conclude that F(x)--* 1 as x--* oo . That the limits at -oo are zero can be seen in a similar manner. • The crux of the proof of Prohorov's theorem is Helly's lemma. This asserts that any given sequence of distribution functions contains a subsequence that converges weakly to a possibly defective distribution function. A defective distribution function is a function that has all the properties of a cumulative distribution function with the exception that it has limits less than 1 at oo and/or greater than 0 at -oo .
Each given sequence Fn of cumulative distribution func tions on �k possesses a subsequence Fn1 with the property that Fn1 (x) --* F(x) at each continuity point x of a possibly defective distribution function F. 2.5
Lemma (Belly 's lemma).
Let «i = {q 1 , q2 , . . . } be the vectors with rational coordinates, ordered in an arbitrary manner. Because the sequence Fn (q 1 ) is contained in the interval [0, 1 ] , it has a converging subsequence. Call the indexing subsequence {n } } j:1 and the limit G (q 1 ) . Next, extract a further subsequence {n ] } c {n } } along which Fn (q 2 ) converges to a limit G(q 2 ) , a further subsequence {n ] } c {n ] } along which Fn (q 3 ) converges to a limit G (q 3 ) , . . . , and so forth. The "tail" of the diagonal sequence n j := n j belongs to every sequence n � . Hence Fn/q i ) --* G (q i ) for every i = 1 , 2, . . . . Because each Fn is nonde creasing, G (q) :s G (q') if q :s q'. Define
Proof.
F(x ) = inf G (q). q>x
Then F is nondecreasing. It is also right-continuous at every point x , because for every 8 > 0 there exists q > x with G (q) - F (x ) < 8, which implies F (y) - F (x) < 8 for every x :s y :s q . Continuity of F at x implies, for every 8 > 0, the existence of q < x < q' such that G (q') - G (q) < 8. By monotonicity, we have G(q) :S F(x) :S G (q ' ) , and
Conclude that I lim inf Fn1 (x) - F (x) I < 8. Because this is true for every 8 > 0 and the same result can be obtained for the lim sup, it follows that Fn1(x) � F (x ) at every continuity point of F. In the higher-dimensional case, it must still be shown that the expressions defining masses of cells are nonnegative. For instance, for k = 2, F is a (defective) distribution function only if F(b) + F(a) - F(a1 , b 2 ) - F(a 2 , bi) :=::: 0 for every a :S b. In the case that the four corners a, b , (a 1 , b 2 ) , and (a2 , b 1 ) of the cell are continuity points; this is immediate from the convergence of Fn1 to F and the fact that each Fn is a distribution function. Next, for general cells the property follows by right continuity. •
Stochastic Convergence
10
Example (Markov 's inequality). A sequence Xn of random variables with EI Xn i P = 0(1) for some p > 0 is uniformly tight. This follows because by Markov 's inequality
2.6
The right side can be made arbitrarily small, uniformly in n , by choosing sufficiently large M. Because EX� = var Xn + (EXn ) 2 , an alternative sufficient condition for uniform tight ness is EXn = 0(1) and var Xn = 0(1) . This cannot be reversed. 0 Consider some of the relationships among the three modes of convergence. Convergence in distribution is weaker than convergence in probability, which is in tum weaker than almost-sure convergence, except if the limit is constant. 2.7
Let Xn, X and Yn be random vectors. Then Xn � X implies Xn !,. X; X n !,. X implies X n X; Xn !,. cfor a constant c if and only if Xn c; if Xn X and d ( Xn , Yn) !,. 0, then Yn X; if Xn X and Yn !,. cfor a constant c, then (Xn , Yn) if Xn !,. X and Yn !,. Y, then (Xn , Yn ) !,. (X, Y).
Theorem.
(i) (ii) (iii) (iv) (v) (vi)
'V'-7
'V'-7
'V'-7
'V'-7
'V'-7
'V'-7
(X , c);
(i) . The sequence of sets An = Um �n {d(Xm , X) > .s} is decreasing for every .s > 0 and decreases to the empty set if X n (w) -+ X (w) for every w. If X n � X , then P (d (Xn , X) > .s) ::::; P(An) -+ 0. (iv). For every f with range [0, 1] and Lipschitz norm at most 1 and every .s > 0,
Proof
The second term on the right converges to zero as n -+ oo. The first term can be made arbitrarily small by choice of .s. Conclude that the sequences Ef(Xn) and Ef(Yn) have the same limit. The result follows from the portmanteau lemma. (ii). Because d (Xn , X) !,. 0 and trivially X X, it follows that Xn X by (iv). (iii). The "only if" part is a special case of (ii). For the converse let ball(c, .s) be the open ball of radius .s around c. Then P(d CXn , c) ::=: .s) = P(Xn E ball(c, .s ) ) If Xn c, then the lim sup of the last probability is bounded by P( c E ball(c, .s) c ) = 0, by the portmanteau lemma. (v). First note that d (CXn , Yn) , (Xn , c)) = d(Yn , c) !,. 0. Thus, according to (iv), it suffices to show that (Xn , c) (X, c) . For every continuous, bounded function (x , y) f---+ f (x , y), the function x f---+ f (x , c) is continuous and bounded. Thus Ef (Xn , c) -+ Ef (X, c) if Xn X . (vi). This follows from d (Cx 1 , YI ) , (x z , Yz)) ::::; d (x 1 , xz) + d ( y 1 , Yz) . • 'V'-7
'V'-7
c
.
'V'-7
'V'-7
'V'-7
According to the last assertion of the lemma, convergence in probability of a sequence of vectors X11 = (X11,1 , . . . , Xn,d is equivalent to convergence of every one of the sequences of components Xn,i separately. The analogous statement for convergence in distribution
2. 1
11
Basic Theory
is false: Convergence in distribution of the sequence Xn is stronger than convergence of every one of the sequences of components Xn ,i. The point is that the distribution of the components Xn ,i separately does not determine their joint distribution: They might be independent or dependent in many ways . We speak of joint convergence in distribution versus marginal convergence . Assertion (v) of the lemma has some useful consequences. If Xn -v-+ X and Yn -v-+ c, then (Xn , Yn) -v-+ (X, c) . Consequently, by the continuous mapping theorem, g(Xn , Yn ) -v-+ g(X, c) for every map g that is continuous at every point in the set IRk x {c} in which the vector (X, c) takes its values. Thus, for every g such that lim
x�xo,y � c
g (x , y) = g (xo , c),
for every x0 .
Some particular applications of this principle are known as Slutsky's lemma.
Let Xn, X and Yn be random vectors or variables. If Xn -v-+ X and Yn -v-+ c for a constant c, then (i) Xn + Yn -v-+ X + c; (ii) YnXn -v-+ eX ; (iii) yn- l Xn -v-+ c - 1 X provided c f. 0.
2.8
Lemma (Slutsky).
In (i) the "constant" c must be a vector of the same dimension as X, and in (ii) it is probably initially understood to be a scalar. However, (ii) is also true if every Yn and c are matrices (which can be identified with vectors, for instance by aligning rows, to give a meaning to the convergence Yn -v-+ c), simply because matrix multiplication (x, y) f-+ yx is a continuous operation. Even (iii) is valid for matrices Yn and c and vectors Xn provided c f. 0 is understood as c being invertible, because taking an inverse is also continuous. Let Y1 , Y2 , . . . be independent, identically distributed random variables with EY1 = 0 and EYf < oo. Then the t-statistic ,J'iiYn/ Sn, where S� = (n n- l 2.::: 7= 1 (Yi - Yn ) 2 is the sample variance, is asymptotically standard normal. To see this, first note that by two applications of the weak law of large numbers and the continuous-mapping theorem for convergence in probability 2.9
Example (t-statistic).
(
)
n 1 � 2 -2 P ( 2 2 ) = var Yr . - L...li - yn � 1 EYI - (EYI ) sn2 = -n - 1 n i= l
Again by the continuous-mapping theorem, Sn converges in probability to sd Y1 . By the cen tral limit theorem ,JfiY n converges in law to the N (0, var Y1 ) distribution. Finally, Slutsky's lemma gives that the sequence of t-statistics converges in distribution to N (0, var Y1 ) j sd Y1 = N (O, 1 ) . D 2.10
fying
Example (Confidence intervals).
Let Tn and Sn be sequences of estimators satis
for certain parameters f3 and 0'2 depending on the underlying distribution, for every distri bution in the model. Then e = Tn ± Sn/ ,Jn Za is a confidence interval for f3 of asymptotic
Stochastic Convergence
12
level 1 - 2a . More precisely, we have that the probability that e is contained in [T,1 Sn l Jn Za , Tn + Sn l Jn Za ] converges to 1 - 2a . This is a consequence of the fact that the sequence Jn (Tn - 8 ) I Sn is asymptotically standard normally distributed. D If the limit variable X has a continuous distribution function, then weak convergence X implies P (Xn ::::: x) � P(X ::::: x) for every x . The convergence is then even uniform in x .
Xn
-v-7
2.1 1 Lemma. Suppose that Xn X for a random vector X with a continuous distribution function. Then SUPx i P C Xn :'S x) - P(X ::::: x) l � 0. -v-7
Let Fn and F be the distribution functions of Xn and X. First consider the one dimensional case. Fix k E N. By the continuity of F there exist points -oo x0 < x 1 < · · · < Xk = oo with F(x i) = i I k. By monotonicity, we have, for X i- I :::; x :::; x i,
Proof
=
Fn (x) - F(x)
:'S
Fn (X i) - F(x i- I) = Fn (x i) - F(x i) + 1 1 k � Fn (X i_ r) - F(x i) = Fn (X i_ r) - F(X i I) - 1 1 k . -
Thus I Fn (x) - F(x) I is bounded above by sup i I Fn (X i) - F(x i) I + ll k, for every x . The latter, finite supremum converges to zero as n � oo, for each fixed k. Because k is arbitrary, the result follows . In the higher-dimensional case, we follow a similar argument but use hyperrectangles, rather than intervals. We can construct the rectangles by intersecting the k partitions obtained by subdividing each coordinate separately as before. •
2.2
Stochastic
o and 0 Symbols
It is convenient to have short expressions for terms that converge in probability to zero or are uniformly tight. The notation op ( 1 ) ("small oh-P-one") is short for a sequence of random vectors that converges to zero in probability. The expression Op ( l ) ("big oh P-one") denotes a sequence that is bounded in probability. More generally, for a given sequence of random variables R n,
Xn = Op (Rn) means Xn = Yn Rn and Yn �p 0; Xn = Op (Rn) means Xn = Yn Rn and Yn = Op ( l ) . This expresses that the sequence X11 converges in probability to zero or is bounded in probability at the "rate" R 11 • For deterministic sequences X11 and R 11, the stochastic "oh"
symbols reduce to the usual o and 0 from calculus. There are many rules of calculus with o and 0 symbols, which we apply without com ment. For instance,
Op ( 1) + Op (1) = Op ( 1) Op ( 1) + Op (1) = Op ( l ) 0p (J )op ( 1) = op ( 1 )
2.3
Characteristic Functions
( 1 + op ( l )r 1 Op (Rn) Op (Rn) op ( Op ( 1))
13
Op(1) = Rnop (1) = Rn Op (l) = op ( 1 ) . =
To see the validity of these rules it suffices to restate them in terms of explicitly named vectors, where each op ( 1 ) and Op (1) should be replaced by a different sequence of vectors that converges to zero or is bounded in probability. In this way the first rule says: If Xn � 0 and Yn � 0, then Zn = Xn + Yn � 0. This is an example of the continuous-mapping theorem. The third rule is short for the following: If Xn is bounded in probability and Yn � 0, then Xn Yn � 0. If Xn would also converge in distribution, then this would be statement (ii) of Slutsky's lemma (with c = 0). But by Prohorov's theorem, Xn converges in distribution "along subsequences" if it is bounded in probability, so that the third rule can still be deduced from Slutsky's lemma by "arguing along subsequences." Note that both rules are in fact implications and should be read from left to right, even though they are stated with the help of the equality sign. Similarly, although it is true that o p ( 1) + o p ( 1) = 2o p ( 1 ) , writing down this rule does not reflect understanding of the o p symbol. Two more complicated rules are given by the following lemma.
Let R be a function defined on domain in Rk such that R (0) = 0. Let Xn be a sequence of random vectors with values in the domain of R that converges in probability to zero. Then, for every p > 0, (i) if R (h ) = o ( ll h li P) as h ----+ 0 , then R (Xn) = Op ( I I Xn li P) ; (ii) if R ( h ) = O (ll h 11 P) as h----+ 0, then R (Xn) = Op ( I I Xn ii P) . 2.12
Lemma.
Define g (h) as g (h) = R (h)/ ll h ii P for h i= 0 and g (O) = 0. Then R (Xn ) = g(Xn) II Xn ii P. (i) Because the function g is continuous at zero by assumption, g ( Xn ) � g (O) = 0 by the continuous-mapping theorem. (ii) By assumption there exist M and 8 > 0 such that I g (h) I ::: M whenever I h I ::: 8. Thus P ( l g ( Xn ) l > M) ::: P ( I I Xn I > 8)----+ 0, and the sequence g ( Xn ) is tight. •
Proof.
*2.3
Characteristic Functions
It is sometimes possible to show convergence in distribution of a sequence of random vectors directly from the definition. In other cases "transforms" of probability measures may help. The basic idea is that it suffices to show characterization (ii) of the portmanteau lemma for a small subset of functions f only. The most important transform is the characteristic function
Each of the functions lemma, ----+
rx is continuous and bounded. Thus, by the portmanteau eu itr x
x r--+
Eeur Xn Ee
for every t if Xn -v--t X . By Levy's continuity theorem the
14
Stochastic Convergence
converse is also true: Pointwise convergence of characteristic functions is equivalent to weak convergence.
Let Xn and X be random vectors in Rk . Then for every t E JRk . Moreover, con verges pointwise to a function tjJ(t) that is continuous at zero, then tjJ is the characteristic function of a random vector X and Xn """ X. 2.13
ifEei tr x" Eei tr x
Theorem (Levy 's continuity theorem). � Xn """ X and only
if
if Eeur x"
If Xn """ X, then Eh (Xn) � Eh (X) for every bounded continuous function h, in particular for the functions h (x) = x . This gives one direction of the first statement. For the proof of the last statement, suppose first that we already know that the sequence Xn is uniformly tight. Then, according to Prohorov's theorem, every subsequence has a further subsequence that converges in distribution to some vector Y. By the preceding paragraph, the characteristic function of Y is the limit of the characteristic functions of the converging subsequence. By assumption, this limit is the function tjJ (t) . Conclude that every weak limit point Y of a converging subsequence possesses characteristic function ¢. Because a characteristic function uniquely determines a distribution (see Lemma 2. 15), it follows that the sequence Xn has only one weak limit point. It can be checked that a uniformly tight sequence with a unique limit point converges to this limit point, and the proof is complete. The uniform tightness of the sequence Xn can be derived from the continuity of tjJ at zero. Because marginal tightness implies joint tightness, it may be assumed without loss of generality that Xn is one-dimensional. For every x and 8 > 0, sin ox = � ( 1 - costx) dt. 1 { 1 ox l > 2} :::::; 2 1 8 ox Replace x by Xn , take expectations, and use Fubini's theorem to obtain that
e i tr
Proof
) !-88 8 P ( I Xn l � ) :::::; � ! Re( l - Eeit X" ) dt. 8 8 -8 By assumption, the integrand in the right side converges pointwise to Re ( 1 - tjJ (t) ) . By the (
>
dominated-convergence theorem, the whole expression converges to
18
� Re ( 1 - tjJ (t) ) dt. 8 -o Because tjJ is continuous at zero, there exists for every E > 0 a 8 > 0 such that 1 1 - tjJ (t) I < E for l t l < 8 . For this 8 the integral is bounded by 2£ . Conclude that P(I Xn l > 2/8) :::::; 2£ for sufficiently large n, whence the sequence Xn is uniformly tight. • 2.14
Example (Normal distribution).
bution is the function
The characteristic function of the Nk (/1-, I: ) distri
Indeed, if X is Nk (O, /) distributed and L:11 2 is a symmetric square root of I: (hence I: = (I:11 2 ) 2 ) , then L;11 2 X + 11- possesses the given normal distribution and
15
Characteristic Functions
2.3
For real-valued z, the last equality follows easily by completing the square in the exponent. Evaluating the integral for complex z, such as z = it, requires some skill in complex function theory. One method, which avoids further calculations, is to show that both the left- and righthand sides of the preceding display are analytic functions of z. For the right side this is obvious ; for the left side we can justify differentiation under the expectation sign by the dominated-convergence theorem. Because the two sides agree on the real axis, they must agree on the complex plane by uniqueness of analytic continuation. D 2.15 Lemma. Random vectors Ee i ' x = Ee i t r Y for every t E
IRk .
t
Proof
and y
E
X and Y in
IRk are equal in distribution if and only if
By Fubini's theorem and calculations as in the preceding example, for every a
IRk ,
J i tr y e - � tr to.2 e-
r Ee i r x
dt
=
E
J
r e i r cx- y )
>
0
r 2 e - � t t cr dt
JT: )kk/2 Ee 21 (X r (X /cr 2 . By the convolution formula for densities, the righthand side is (2n ) k times the density =
(2
-
- y)
a
- y)
p x +u z (y) of the sum of X and a Z for a standard normal vector Z that is independent of X . Conclude that if X and Y have the same characteristic function, then the vectors X + a Z and Y + a Z have the same density and hence are equal in distribution for every a > 0. By Slutsky's lemma X + a Z -v-+ X as a + 0, and similarly for Y. Thus X and Y are equal in distribution. • The characteristic function of a sum of independent variables equals the product of the characteristic functions of the individual variables. This observation, combined with Levy's theorem, yields simple proofs of both the law of large numbers and the central limit theorem.
Let Y1 , . . . , Yn be i. i. d. random variables with characteristic function ¢>. Then Yn !. f-L for a real number f-L if and only if¢> is differ entiable at zero with i f-L = ¢>' (0). 2.16
Proposition (Weak law of large numbers).
We only prove that differentiability is sufficient. For the converse, see, for exam ple, [ 1 27, p. 52] . Because ¢ (0) = 1 , differentiability of ¢> at zero means that rf> (t) = 1 + t ¢>' (0 ) + o (t) as t --+ 0. Thus, by Fubini's theorem, for each fixed t and n --+ oo,
Proof
Ee ' '
y
"
�
ql'
m
�
(I
�
+ i JL + o
G) r -+ e''" ·
The right side is the characteristic function of the constant variable f-L. By Levy's theorem, Yn converges in distribution to f-L. Convergence in distribution to a constant is the same as convergence in probability. •
it
A sufficient but not necessary condition for ¢> (t) = Ee Y to be differentiable at zero is that E l Y I < oo . In that case the dominated convergence theorem allows differentiation
16
Stochastic Convergence
under the expectation sign, and we obtain
cp ' (t)
=
d . -Ee 1 1 y dt =
=
. y
Ei Ye1 1
•
EY
� EY1 .
In particular, the derivative at zero is ¢' (0) i and hence Yn If < oo, then the Taylor expansion can be carried a step further and we can obtain a version of the central limit theorem.
EY 2
2.17
EY;
Y1 ,
Let . . . , Yn be i. i. d. random variables with 1. Then the sequence .JflYn converges in distribution to the standard
Proposition (Central limit theorem). =
EY?
=
0 and normal distribution.
A second differentiation under the expectation sign shows that ¢ " (0) Because ¢' (0) i 0, we obtain
Proof.
=
Ee
EY
' ' � Y.
=
=
(&"
(;,)
(I - �� EY2
=
+o
G) r
_, e
=
i
2EY 2 .
+'EY'
The right side is the characteristic function of the normal distribution with mean zero and variance The proposition follows from Levy's continuity theorem. •
EY 2 .
ir r x Ee Eei u (t r X)
The characteristic function t f--+ of a vector X is determined by the set of all characteristic functions u f--+ of linear combinations t r X of the components of X. Therefore, Levy's continuity theorem implies that weak convergence of vectors is equivalent to weak convergence of linear combinations :
Xn "-""+ X if and only if t r Xn "-""+ t r X for all t
E
JRk .
This is known as the Cramer-Wold device. It allows to reduce higher-dimensional problems to the one-dimensional case. 2.18
Y1 , Y2 , E(Y1 - (Y
Let . . . be i.i.d. random vec and covariance matrix I: �J-) 1 - �J-) r .
Example (Multivariate central limit theorem).
tors in JRk with mean vector fJThen
=
1
EY
=
(The sum is taken coordinatewise.) By the Cramer-Wold device, this can be proved by finding the limit distribution of the sequences of real variables
� - ) ( -�(Y; i= l
1 � -�(t Y; t �J-) . i= .Jri l .Jri Because the random variables t r Y - t r tr Y2 - t r . . . are i.i.d. with zero mean and variance t r L:t, this sequence is asymptotically (0, t r I:t) -distributed by the univariate
tr
1
M)
1
!J- ,
T
=
N1
T
!J- ,
central limit theorem. This is exactly the distribution of t r X if X possesses an Nk (O, I:) distribution. 0
2.5
*2.4
17
Convergence of Moments
Almost-Sure Representations
Convergence in distribution certainly does not imply convergence in probability or almost surely. However, the following theorem shows that a given sequence Xn "-"'+ X can always be replaced by a sequence Xn "-"'+ X that is, marginally, equal in distribution and converges almost surely. This construction is sometimes useful and has been put to good use by some authors, but we do not use it in this book.
Suppose that the sequence ofrandom vec tors Xn converges in distribution to a random vector Xo. Then there exists a probability space ( Q , U, P) and random vectors Xn defined on it such that Xn is equal in distribution to Xn for every n � 0 and Xn -+ Xo almost surely. 2.19
Theorem (Almost-sure representations).
For random variables we can simply define Xn F;; 1 (U) for Fn the distribu tion function of Xn and U an arbitrary random variable with the uniform distribution on [0, 1]. (The "quantile transformation," see Section 2 1 . 1 .) The simplest known construc tion for higher-dimensional vectors is more complicated. See, for example, Theorem 1 . 1 0.4 in [ 146] , or [4 1]. • =
Proof.
*2.5
Convergence of Moments
By the portmanteau lemma, weak convergence Xn "-"'+ X implies that Ef (Xn) -+ Ef(X) for every continuous, bounded function f. The condition that f be bounded is not superfluous : It i s not difficult to find examples of a sequence Xn "-"'+ X and an unbounded, continuous function f for which the convergence fails. In particular, in general convergence in distri bution does not imply convergence EX!: -+ EX P of moments. However, in many situations such convergence occurs, but it requires more effort to prove it. A sequence of random variables Yn is called asymptotically uniformly integrable if lim lim sup EI Yn i 1 { 1 Yn l > M} 0. n -HXl Uniform integrability is the missing link between convergence in distribution and conver gence of moments. =
M-HXJ
Rk
Let f : � JR. be measurable and continuous at every point in a set C. Let Xn "-"'+ X where X takes its values in C. Then Ef (Xn) -+ E f (X) zf and only if the sequence of random variables f (Xn ) is asymptotically uniformly integrable.
2.20
Theorem.
We give the proof only in the most interesting direction. (See, for example, [146] (p. 69) for the other direction.) Suppose that Yn f (Xn) is asymptotically uniformly integrable. Then we show that EYn -+ EY for Y f (X). Assume without loss of generality that Yn is nonnegative; otherwise argue the positive and negative parts separately. By the continuous mapping theorem, Yn "-"'+ Y. By the triangle inequality,
Proof.
=
=
I E Yn - EY I :::: IEYn - EYn A M l + IEYn A M - EY A M l + lEY A M - EY I .
Because the function y � y !\ M is continuous and bounded on [0, oo) , it follows that the middle term on the right converges to zero as n -+ oo. The first term is bounded above by
18
Stochastic Convergence
EYn 1 { Yn > M}, and converges to zero as n � oo followed by M � oo, by the uniform integrability. By the portmanteau lemma (iv), the third term is bounded by the lim inf as n � oo of the first and hence converges to zero as M t oo . • 2.21 Example. Suppose Xn is a sequence of random variables such that Xn -v-+ X and lim sup EI Xn I P < oo for some p. Then all moments of order strictly less than p converge also: EX! � EX k for every k < p. By the preceding theorem, it suffices to prove that the sequence X! is asymptotically uniformly integrable. By Markov's inequality
The limit superior, as n �
oo
followed by M �
oo,
of the right side is zero if k < p.
D
The moment function p r--+ EX P can be considered a transform of probability distributions, just as can the characteristic function. In general, it is not a true transform in that it does determine a distribution uniquely only under additional assumptions. If a limit distribution is uniquely determined by its moments, this transform can still be used to establish weak convergence. 2.22
every
Theorem. Let Xn and X be random variables such that EX� � EXP < oo for p E N. If the distribution of X is uniquely determined by its moments, then Xn -v-+ X. =
Because EX� 0 ( 1 ) , the sequence Xn is uniformly tight, by Markov's inequality. By Prohorov's theorem, each subsequence has a further subsequence that converges weakly to a limit Y. By the preceding example the moments of Y are the limits of the moments of the subsequence. Thus the moments of Y are identical to the moments of X. Because, by assumption, there is only one distribution with this set of moments, X and Y are equal in distribution. Conclude that every subsequence of Xn has a further subsequence that converges in distribution to X . This implies that the whole sequence converges to X . •
Proof.
Example. The normal distribution is uniquely determined by its moments. (See, for example, [ 1 23] or [133, p. 293] .) Thus EX� � 0 for odd p and EX� � (p - 1) (p - 3) · . . 1 for even p implies that Xn -v-+ N (0, 1). The converse is false. D
2.23
*2.6
Convergence-Determining Classes
A class :F of functions f : Rk � R is called convergence-determining if for every sequence of random vectors Xn the convergence Xn -v-+ X is equivalent to Ef (Xn ) � Ef (X) for every f E :F. By definition the set of all bounded continuous functions is convergence determining, but so is the smaller set of all differentiable functions, and many other classes. The set of all indicator functions 1 c - oo ,tl would be convergence-determining if we would restrict the definition to limits X with continuous distribution functions. We shall have occasion to use the following results. (For proofs see Corollary 1 .4.5 and Theorem 1 . 1 2.2, for example, in [ 146] .)
2. 7
19
Law of the Iterated Logarithm
On IR_k IR.1 x IR.rn the set offunctions (x , y) f-+ f (x)g(y) with f and g ranging over all bounded, continuous functions on IR.1 and IR.rn, respectively, is convergence determining.
2.24
Lemma.
=
There exists a countable set of continuous functions f : IR.k f-+ [0, 1] that is convergence-determining and, moreover, Xn -v-+ X implies that Ef (Xn ) ---+ Ef (X) uni formly in f E :F. 2.25
Lemma.
*2.7
Law of the Iterated Logarithm
The law of the iterated logarithm is an intriguing result but appears to be of less interest to statisticians. It can be viewed as a refinement of the strong law of large numbers. If Y1 , Y2 , . . . are i.i.d. random variables with mean zero, then Y1 + · · · + Yn o (n) almost surely by the strong law. The law of the iterated logarithm improves this order to 0 ( y'n log log n), and even gives the proportionality constant. =
Proposition (Law of the iterated logarithm). Let Y1 , Y2 , . . . be i. i. d. random vari ables with mean zero and variance 1. Then Y1 + · · · + Yn . r:::. hm sup a.s. v 2, n --;.oo y'n log log n Conversely, if this statement holds for both Yi and - Yi , then the variables have mean zero and variance 1.
2.26
=
The law of the iterated logarithm gives an interesting illustration of the difference between almost sure and distributional statements. Under the conditions of the proposition, the sequence n - 1 12 (Y1 + · · · + Yn ) is asymptotically normally distributed by the central limit theorem. The limiting normal distribution is spread out over the whole real line. Apparent! y division by the factor y'log log n is exactly right to keep n - 1 12 (Y1 + · · · + Yn) within a compact interval, eventually. A simple application of Slutsky's lemma gives Y1 + · · · + Yn P ---+ O. Zn . y'n log log n Thus Zn is with high probability contained in the interval ( - E , E ) eventually, for any E > 0. This appears to contradict the law of the iterated logarithm, which asserts that Zn reaches the interval cv0. - E , yl2 + E ) infinitely often with probability one. The explanation is that the set of w such that Zn (w) is in ( - E , E ) or cv0. - E , yl2 + E ) fluctuates with n. The convergence in probability shows that at any advanced time a very large fraction of w have Zn (w) E ( - E , E ) . The law of the iterated logarithm shows that for each particular w the sequence Zn (w) drops in and out of the interval cv0. - E , yl2 + E ) infinitely often (and hence out of ( - E , E )) . The implications for statistics can be illustrated by considering confidence statements. If f.L and 1 are the true mean and variance of the sample Y1 , Y2 , . . . , then the probability that 2 2 < ,..-vI I _< Y n + Y n Jn _ Jn . _
-
-
-
Stochastic Convergence
20
converges to (2) - (-2) � 95%. Thus the given interval is an asymptotic confidence interval of level approximately 95%. (The confidence level is exactly (2) - (- 2) if the observations are normally distributed. This may be assumed in the following; the accuracy of the approximation is not an issue in this discussion.) The point fJ., = 0 is contained in the interval if and only if the variable satisfies 2 < .Jlog log n Assume that fJ., = is the true value of the mean, and consider the following argument. By the law of the iterated logarithm, we can be sure that hits the interval ( .J2 - 8, .J2 + 8) infinitely often. The expression 2/ .Jlog log n is close to zero for large n . Thus we can be sure that the true value fJ., = is outside the confidence interval infinitely often. How can we solve the paradox that the usual confidence interval is wrong infinitely often? There appears to be a conceptual problem if it is imagined that a statistician collects data in a sequential manner, computing a confidence interval for every n. However, although the frequentist interpretation of a confidence interval is open to the usual criticism, the paradox does not seem to rise within the frequentist framework. In fact, from a frequentist point of view the curious conclusion is reasonable. Imagine statisticians , all of whom set 95% confidence intervals in the usual manner. They all receive one observation per day and update their confidence intervals daily. Then every day about five of them should have a false interval. It is only fair that as the days go by all of them take turns in being unlucky, and that the same five do not have it wrong all the time. This, indeed, happens according to the law of the iterated logarithm. The paradox may be partly caused by the feeling that with a growing number of observa tions, the confidence intervals should become better. In contrast, the usual approach leads to errors with certainty. However, this is only true if the usual approach is applied naively in a sequential set-up. In practice one would do a genuine sequential analysis (including the use of a stopping rule) or change the confidence level with n . There is also another reason that the law of the iterated logarithm is of little practical consequence. The argument in the preceding paragraphs is based on the assumption that 2/ .Jlog log n is close to zero and is nonsensical if this quantity is larger than .J2. Thus the argument requires at least n :=:: 1 6 1 9, a respectable number of observations .
Z11 I Zn l
0
·
Zn
0
100
*2.8
Lindeberg-Feller Theorem
Central limit theorems are theorems concerning convergence in distribution of sums of random variables. There are versions for dependent observations and nonnormal limit distributions. The Lindeberg-Feller theorem is the simplest extension of the classical central limit theorem and is applicable to independent observations with finite variances. 2.27
Yn, kn
For each n let Yn , I , . . . be independent random vectors with finite variances such that Proposition (Lindeberg-Feller central limit theorem).
kn EI I Yn . i i 2 1{I I Yn. i l > 8 } � 0, L i =l k:Ln cov Yn, i � I; . i =l
every E >
0,
,
Then the sequence L�== 1 (Yn , i -
distribution.
21
Linde berg-Feller Theorem
2.8
EYn, i ) converges in distribution to a normal N (0, I:)
A result of this type is necessary to treat the asymptotics of, for instance, regression problems with fixed covariates . We illustrate this by the linear regression model. The application is straightforward but notationally a bit involved. Therefore, at other places in the manuscript we find it more convenient to assume that the covariates are a random sample, so that the ordinary central limit theorem applies. In the linear regression problem, we observe a vector Y = XfJ + for a known (n x p) matrix X of full rank, and an (unobserved) error vector with i.i.d. components with mean zero and variance The least squares estimator of fJ is
2.28
Example (Linear regression).
e
e
0' 2 .
e
This estimator is unbiased and has covariance matrix C5 2 (X T X) - 1 . If the error vector is normally distributed, then � is exactly normally distributed. Under reasonable conditions on the design matrix, the least squares estimator is asymptotically normally distributed for a large range of error distributions. Here we fix p and let n tend to infinity. This follows from the representation
(X T X) 1 / 2 (� -
an 1 , . . . , ann
n {J ) = cxT x ) - 1 12 xT e = I>ni ei, i=
1 are the columns of the (p x n) matrix (X T X) - 1 1 2 x r =: A. This sequence
where is asymptotically normal if the vectors satisfy the Lindeberg conditions . 1 2 T The norming matrix (X X) 1 has been chosen to ensure that the vectors in the display have covariance matrix I for every n. The remaining condition is
0' 2
an 1 e 1 , . . . , annen
n 2 :L i = 1 ) an t i Eef 1{1 1 an t l l etl } � 0. This can be simplified to other conditions in several ways. Because L I ani 1 2 = trace(AA T ) = p, it suffices that maxEef1{ l ani l l e; l } � 0, which is equivalent to si sn l an; l � 0. 1max Alternatively, the expectation Ee 2 1{a l e l } can be bounded by c k E i e l k + 2 a k and a second set of sufficient conditions is n k (k 2) . L) i = 1 ani I l � 0 ; Both sets of conditions are reasonable. Consider for instance the simple linear regression model = {30 + {J 1 x; + e;. Then )2 - 1 /2 ( 1 1 x 1 x2 x It is reasonable to assume that the sequences and x 2 are bounded. Then the first matrix > 8
> 8
>
8
>
Y;
x
x
22
Stochastic Convergence
on the right behaves like a fixed matrix, and the conditions for asymptotic normality simplify to
Every reasonable design satisfies these conditions. D
Convergence in Total Variation
*2.9
A sequence of random variables converges in total variation to a variable X if sup j P(Xn B
E
B) - P(X E B) \ -+ 0,
where the supremum is taken over all measurable sets B. In view of the portmanteau lemma, this type of convergence is stronger than convergence in distribution. Not only is it required that the sequence P(X E B) converges for every Borel set B , the convergence must also be uniform in B . Such strong convergence occurs less frequently and is often more than necessary, whence the concept is less useful. A simple sufficient condition for convergence in total variation is pointwise convergence of densities. If Xn and X have densities and p with respect to a measure f.L, then
n
Pn
sup j P(Xn B
E
B) - P(X E B) \
=
2 J I Pn - P I df.L.
�
Thus, convergence in total variation can be established by convergence theorems for inte grals from measure theory. The following proposition, which should be compared with the monotone and dominated convergence theorems, is most appropriate.
fn f
Suppose that and are arbitrary measurable functions such that df.L J df.L < n -+ J.L-almost everywhere (or in J.L-measure) and lim sup J dt-L -+ 0. oo, for some p � 1 and measure f.L. Then J 2.29
Proposition.
fn i P ::::; I J I P l I P lfn - J By the inequality ( a + b)P ::::; 2PaP + 2Pb P , valid for every a , b 0, and the assumption, 0 ::::; 2 P i fn1 P + 2P i f 1 P - l fn - J I P -+ 2P + 1 1 f i P almost everywhere. By Fatou's lemma, f 2P+ 1 1 f i P df.L lim inf f (2P i fn l p + 2P I J I P - l fn - JI P ) dJ.L 2p+ I J I J I P df.L - lim sup J l fn - JI P d f-L , f f
�
Proof.
:S
::::;
by assumption. The proposition follows.
•
Pnn
Let Xn and X be random vectors with densities and p with respect to a measure f.L. If -+ p J.L-almost everywhere, then the sequence X converges to X in total variation. 2.30
Corollary (Schefje).
Pn
23
Notes
The central limit theorem is usually formulated in terms of convergence in distribution. Often it is valid in terms of the total variation distance, in the sense that
i
l
1-Jhr e - � (x - n J.L ) 2 / n o- 2 dx � 0. sup P ( Yr + + Yn E B) - { B JB .jfie5 2n Here f-L and C5 2 are mean and variance of the Yi , and the supremum is taken over all Borel sets. An integrable characteristic function, in addition to a finite second moment, suffices. ·
·
·
Let Y1 , Y2 , . . . be i. i.d. random variables with .finite second moment and characteristic function ¢ such that J l ¢ ( t ) l v dt < oo for some v 2:: 1. Then Yr + · · · + Yn satisfies the central limit theorem in total variation.
2.31
Theorem (Central limit theorem in total variation).
=
=
It can be assumed without loss of generality that EY1 0 and var Y1 1 . By the inversion formula for characteristic functions (see [47, p. 509]), the density P n of Yr + · · · + Yn / .jfi can be written n 1 t t dt. P n (X) 2n e - z x ¢ vn By the central limit theorem and Levy's continuity theorem, the integrand converges to e - i tx exp( - � t 2 ) . It will be shown that the integral converges to
Proof.
=
l
-
f . ( ) -
f e- z.tx e_ l2 12 dt
=
- l2 x 2 e-J2n .
2n Then an application of Scheffe's theorem concludes the proof. The integral can be split into two parts. First, for every 8 > 0, { e - i tx ¢ Jltl >svfn
( �n ) n dt v
:S
n -Jii sup l c/J ( t ) l - v ltl>s
f l cfJ (t) l v dt.
Here sup ltl>s I ¢ (t) I < 1 by the Riemann-Lebesgue lemma and because ¢ is the characteristic function of a nonlattice distribution (e.g., [47, pp. 50 1 , 5 1 3]). Thus, the first part of the integral converges to zero geometrically fast. Second, a Taylor expansion yields that cp (t) l - � t 2 + o ( t 2 ) as t � 0, so that there exists 8 > 0 such that I ¢ (t) I :S l - t 2 I 4 for every I t I < 8 . It follows that =
The proof can be concluded by applying the dominated convergence theorem to the remaining part of the integral. • Notes
The results of this chapter can be found in many introductions to probability theory. A standard reference for weak convergence theory is the first chapter of [1 1 ] . Another very readable introduction is [4 1 ] . The theory of this chapter is extended to random elements with values in general metric spaces in Chapter 18.
24
Stochastic Convergence
PROBLEMS 1. If Xn possesses a t-distribution with Show this .
n
degrees of freedom, then
Xn "-"'+ N(O, 1) as n ---+ oo.
2 . Does i t follow immediately from the result of the previous exercise that EX� ---+ EN(O, I ) P for every p E N? Is this true? 3. If Xn "-"'+ N (O, 1) and Yn
�
CJ ,
then
Xn Yn "-"'+ N(O, CJ 2 ) . Show this .
4 . I n what sense i s a chi-square distribution with distribution?
n
degrees of freedom approximately a normal
5. Find an example of s equences such that Xn "-"'+ X and Yn "-"'+ Y, but the j oint sequence (Xn , Y11 ) does not converge in law. 6. If X11 and Yn are independent random vectors for every n, then X11 "-"'+ X and Yn "-"'+ Y imply that (X11 , Y11 ) "-"'+ (X, Y ) , where X and Y are independent. Show this. 7. If every Xn and X possess discrete distributions supported on the integers, then Xn "-"'+ X if and only if P (Xn = x ) ---+ P(X = x ) for every integer x . Show thi s . 8. If P(Xn = i / n ) = I / n for every i = 1 , 2, P ( Xn E B) = 1 for every n, but P(X E B)
then Xn "-"'+ 0. Show this.
. . . , n,
=
X , but there exist B orel sets with
9. If P(Xn = Xn ) = 1 for numbers Xn and Xn ---+ x , then Xn "-"'+ x. Prove this (i) by considering distributions functions (ii) by using Theorem 2.7.
10. State the rule o p (I) + 0 p ( I )
=
0 p (I) in terms of random vectors and show its validity.
11. In what sense is it true that o p ( l )
=
Op ( l ) ? Is it true that O p (l)
12. The rules given by Lemma 2.12 are not simple plug-in rules . (i) Give an example of a function R with R (h) = o ( llh II) as variables X11 such that R (X11) is not equal to op (X11 ) .
=
op (l)?
h ---+ 0 and a sequence of random
(ii) Given an example of a function R such R (h) = 0 ( I I h I I ) a s h ---+ 0 and a sequence of random variables Xn such that Xn = Op ( 1 ) but R (Xn ) is not equal to Op (Xn ) .
13. Find an example of a sequence of random variables such that Xn "-"'+ 0 , but EXn ---+ oo .
14. Find an example of a s equence of random variables such that Xn � 0 , but X11 does not converge almost surely. 15. Let X1 , . . . , Xn be i.i.d. with density !A, a (x) = A. e - !c (x - a ) l { x 2: a } . Calculate the maximum
On ) of (A. , a) and show that (An , 011 ) � (A. , a ) . 16. Let X1 , . . . , X 11 b e i.i.d. standard normal variables . Show that the vector U = (X1 , . . . , Xn ) / N , n n 1 where N 2 = L7= 1 Xf, is uniformly distributed over the unit sphere s - in JR , in the sense that U and 0 U are identically distributed for every orthogonal transformation 0 of lR11 • n 1 n 17. For each n , let U11 be uniformly distributed over the unit sphere s - in JR . Show that the vectors ..jli(Un, 1 , Un, 2 ) converge in distribution to a pair of independent standard normal variables. 18. If ..jli(Tn - e ) converges in distribution, then Tn converges in probability to e . Show this. likelihood estimator of (An ,
19. If EXn ---+
f-L
and var Xn ---+ 0, then Xn
20. I f L� 1 P ( I Xn I
>
8
)
<
oo for every 8
�
>
f-L .
Show this .
0, then X11 converges almost surely t o zero. Show this.
21. Use characteristic functions to show that binomial(n , A./ n) "-"'+ Poisson(A.) . Why does the central limit theorem not hold? 22. If X 1 , . . . , Xn are i.i.d. standard Cauchy, then Xn is standard Cauchy. (i) Show this by using characteristic functions (ii) Why does the weak law not hold?
23. Let X 1 , . . . , Xn be i.i.d. with finite fourth moment. Find constants a , b, and c11 such that the s equence c11 (X11 - a , X� - b) converges in distribution, and determine the limit law. Here Xn and X� are the averages of the Xt and the Xf, respectively.
3 Delta Method
The delta method consists of using a Taylor expansion to approximate a random vector of the form ¢ (T11) by the polynomial cp (e ) +
Basic Result
Suppose an estimator Tn for a parameter e is available, but the quantity of interest is ¢ (e ) for some known function ¢ . A natural estimator is ¢ (T11). How do the asymptotic properties of ¢(T11) follow from those of Tn ? A first result is an immediate consequence of the continuous-mapping theorem. If the sequence T11 converges in probability to e and ¢ is continuous at e , then ¢ (T11) converges in probability to cp (e). Of greater interest is a similar question concerning limit distributions. In particular, if Jfi(T11 -e) converges weakly to a limit distribution, is the same true for Jfi (¢ (T11) -
vn(
-v-+
=
cp (e + h) - ¢ (e) =
h � 0.
All the expressions in this equation are vectors of length m, and II h II is the Euclidean norm. The linear map h f-+ ¢� (h) is sometimes called a "total derivative," as opposed to 25
Delta Method
26
partial derivatives. A sufficient condition for ¢ to be (totally) differentiable is that all partial derivatives arp j (x ) I a xi exist for X in a neighborhood of e and are continuous at e . (Just existence of the partial derivatives is not enough.) In any case, the total derivative is found from the partial derivatives. If ¢ is differentiable, then it is partially differentiable, and the derivative map h r--+ ¢� (h) is matrix multiplication by the matrix � (e)
)
axk
a rpm a xk
(e)
If the dependence of the derivative ¢� on e is continuous, then ¢ is called continuously
differentiable.
It is better to think of a derivative as a linear approximation h r--+ ¢� ( h ) to the function h r--+ ¢ ( e + h) - ¢ (e) than as a set of partial derivatives. Thus the derivative at a point e is a linear map. If the range space of ¢ is the real line (so that the derivative is a horizontal vector), then the derivative is also called the gradient of the function. Note that what is usually called the derivative of a function ¢ : Itt r--+ Itt does not com pletely correspond to the present derivative. The derivative at a point, usually written ¢' (e) , i s written here as ¢� . Although ¢' (e) i s a number, the second object ¢� i s identified with the map h r--+ rjJ� ( h ) = ¢'(e) h. Thus in the present terminology the usual derivative function e r--+ ¢' (e) is a map from Itt into the set of linear maps from Itt r--+ Itt , not a map from Itt r--+ JEt. Graphically the "affine" approximation h r--+ ¢ (e) + ¢� (h) is the tangent to the function ¢ at e .
Let ¢ : lDrp C Ittk r--+ Ittm be a map defined on a subset of Ittk and dif ferentiable at e. Let Tn be random vectors taking their values in the domain of ¢. If rn ( Tn - e) """"' T for numbers rn � oo, then rn ( rfJ C Tn ) - ¢ (e) ) 'V'-7 rjJ� (T). Moreover, the diffe rence between rn (rfJ C Tn ) - rp (e)) and rfJHrn C Tn - e) ) converges to zero in probability. 3.1
Theorem.
Because the sequence rn ( Tn - e) converges in distribution, it is uniformly tight and Tn - e converges to zero in probability. By the differentiability of ¢ the remainder function R (h) = ¢ (e + h) - rp (e) - ¢� (h) satisfies R (h) = o ( ll h II) as h � 0. Lemma 2. 12 allows to replace the fixed h by a random sequence and gives
Proof.
rjJ (Tn) - ¢ (e) - rfJ� ( Tn - e) = R ( Tn - e) = op ( II Tn - e 11 ) .
Multiply this left and right with rn , and note that 0 p ( rn II Tn - e II ) = Op ( 1 ) by tightness of the sequence rn (Tn - e ) . This yields the last statement of the theorem. Because matrix multiplication is continuous, ¢� ( rn (Tn - e) ) ¢� (T) by the continuous-mapping theorem. Apply Slutsky's lemma to conclude that the sequence rn (¢ (Tn) - rp (e)) has the same weak limit. • 'V'--7
A common situation is that Jn(Tn - e ) converges to a multivariate normal distribution Nk (fl.,, L: ) . Then the conclusion of the theorem is that the sequence ,Jn ( ¢ (Tn) - ¢ (e) ) converges in law to the Nm (r/J�fl.,, rjJ� L: (rjJ� ) T ) distribution. Example (Sample variance). The sample variance of n observations X 1 , . . . , Xn is defined as S 2 = n - 1 L7= 1 ( X i - X) 2 and can be written as ¢ ( X , X 2 ) for the function
3.2
3. 1
Basic Result
27
-x 2 . (For simplicity of notation, we divide by n rather than n - 1 .) Suppose that 5 2 is based on a sample from a distribution with finite first to fourth moments a1 , a 2 , a3 , a4 .
cp (x , y) =
y
By the multivariate central limit theorem,
The map ¢ is differentiable at the point () = (a1 , a 2 )T, with derivative ¢c'aI , a2 ) = ( -2a1 , 1 ) . Thus if the vector (T1 , T2 ) ' possesses the normal distribution in the last display, then The latter variable is normally distributed with zero mean and a variance that can be ex pressed in a1 , . . . , a4 • In case a1 = 0, this variance is simply a4 - a � . The general case can be reduced to this case, because 52 does not change if the observations Xi are replaced by the centered variables Yi = Xi - a1. Write f.L k = EYik for the central moments of the Xi . Noting that 5 2 = ¢ (Y , Y 2 ) and that ¢ (f.lr , f.l 2 ) = f.l 2 is the variance of the original observations, we obtain In view of Slutsky's lemma, the same result is valid for the unbiased version nj(n - 1)5 2 of the sample variance, because Jli (n/(n - 1) - 1) -+ 0. 0
3.3 Example (Level of the chi-square test). As an application of the preceding example, consider the chi-square test for testing variance. Normal theory prescribes to reject the null hypothesis H0 : f.L 2 :S 1 for values of n 5 2 exceeding the upper a point x; a of the xL 1 distribution. If the observations are sampled from a normal distribution, then the test has exactly level a. Is this still approximately the case if the underlying distribution is not normal? Unfortunately, the answer is negative. For large values of n, this can be seen with the help of the preceding result. The central limit theorem and the preceding example yield the two statements Xn2- 1 - (n - 1 ) v'2n - 2
-v-t
N (O ' 1) '
where K = f.L4 / f.l � - 3 is the kurtosis of the underlying distribution. The first statement implies that ( x ;, a - (n - 1 )) j v'2n - 2) converges to the upper a point Za of the standard normal distribution. Thus the level of the chi-square test satisfies
( r:: ( 52
)
)
(
)
x; a - n za v'2 -+ 1 - Xn2, a ) = P v n f.l - 1 > ,J1i v'K + 2 · 2 The asymptotic level reduces to 1 -
distribution is 0. This is the case for normal distributions. On the other hand, heavy-tailed distributions have a much larger kurtosis. If the kurtosis of the underlying distribution is "close to" infinity, then the asymptotic level is close to 1 -
Delta Method
28
Level of the test that rejects if nS 2 I M exceeds the 0. 95 quantile ofthe xf9 distribution.
Table 3 . 1 .
Law
Level
Laplace
0.12 0.12
0.95 N (O, 1) + 0.05 N (O, 9)
Note: Approximations based on simulation of 1 0,000 samples.
normal approximation to the distribution of ,.jii ( S 2 I fJ, 2 - 1) the problem would not arise, provided the asymptotic variance K + 2 is estimated accurately. Table 3. 1 gives the level for two distributions with slightly heavier tails than the normal distribution. D In the preceding example the asymptotic distribution of ,Jri(S 2 - CJ 2 ) was obtained by the delta method. Actually, it can also and more easily be derived by a direct expansion. Write vr::n (S 2 - (J 2 )
r:::
= vn
( �{=( 1 -;;_
) -
(X i - 1-i ) 2 - (J 2 - vr:::n (X - fJ,) 2 .
The second term converges to zero in probability; the first term is asymptotically normal by the central limit theorem. The whole expression is asymptotically normal by Slutsky's lemma. Thus it is not always a good idea to apply general theorems. However, in many exam ples the delta method is a good way to package the mechanics of Taylor expansions in a transparent way.
3.4 Example. Consider the joint limit distribution of the sample variance S2 and the t -statistic X IS . Again for the limit distribution it does not make a difference whether we use a factor or - 1 to standardize S 2 . For simplicity we use Then ( S 2 , X IS) can be written as ¢ (X, X2) for the map ¢ : Itt2 r--+ Itt2 given by n
n
n.
¢ (x, y)
=
(y - x 2 , (y - :2) 1 /2 ) .
The joint limit distribution of ,.jii ( X - a 1 , X2 - a 2 ) is derived in the preceding example. The map ¢ is differentiable at e = (a l , a 2 ) provided CJ 2 = a 2 - ar is positive, with derivative
It follows that the sequence ,Jri(S2 - CJ 2 , X 1 S - a l /CJ) is asymptotically bivariate normally distributed, with zero mean and covariance matrix,
It is easy but uninteresting to compute this explicitly. D
3. 1
3.5
Basic Result
29
The sample skewness of a sample X 1 , . . , , Xn is defined as n n - 1 L... '\' r = 1 (X · - X) 3 ln /2 . 1 (n - L7= 1 (Xi - X) 2 ) 3 Not surprisingly it converges in probability to the skewness of the underlying distribution, defined as the quotient A. = f..L 3 /C5 3 of the third central moment and the third power of the standard deviation of one observation. The skewness of a symmetric distribution, such as the normal distribution, equals zero, and the sample skewness may be used to test this aspect of normality of the underlying distribution. For large samples a critical value may be determined from the normal approximation for the sample skewness. The sample skewness can be written as ¢ (X, X 2 , X3) for the function ¢ given by c - 3ab + 2a 3 . ¢ (a , b, c) = (b - a 2 ) 3 / 2 The sequence .Jfi.(X - a 1 , X 2 - a 2 , X3 - a 3 ) is asymptotically mean-zero normal by the central limit theorem, provided EXf is finite. The value ¢ (a 1 , a2 , a3 ) is exactly the popu lation skewness. The function ¢ is differentiable at the point (a 1 , a 2 , a 3 ) and application of the delta method is straightforward. We can save work by noting that the sample skewness is location and scale invariant. With Yi = (Xi - a 1 ) j C5, the skewness can also be written as ¢ (Y , Y 2 , Y3). With A. = t.L 3 / C5 3 denoting the skewness of the underlying distribution, the Ys satisfy Example (Skewness).
z
( � ) (a, ( �
� y 2 1 """ N K + 3 y3 - A The derivative of ¢ at the point (0, 1 , A.) equals ( - 3 , - 3/.. / 2, 1 ) . Hence, if T possesses the normal distribution in the display, then Jfi(ln - A.) is asymptotically normal distributed with mean zero and variance equal to var(- 3 T1 - 3A.T2 j2 + T3 ) . If the underlying distribution is normal, then A. = f..L s = 0, K = 0 and f..L 6 /C5 6 = 15. In that case the sample skewness is asymptotically N (0, 6) -distributed. An approximate level a test for normality based on the sample skewness could be to reject normality if .Jfi l ln I > .J6 Zaf 2 · Table 3.2 gives the level of this test for different values of n . D 3 .2. Level of the test that rejects if .Jrl l ln I / .J6 exceeds the 0. 975 quantile of the normal distribution, in the case that the observations are normally distributed.
Table
n
Level
10 20 30 50
0.02 0.03 0.03 0.05
Note: Approximations based on simula tion of 10,000 samples.
Delta Method
30 3.2
Variance-Stabilizing Transformations
Given a sequence of statistics Tn with Jfi(Tn - 8) -!. N (0, 0" 2 (8)) for a range of values of e ' asymptotic confidence intervals for e are given by
These are asymptotically of level 1 - 2a in that the probability that e is covered by the interval converges to 1 - 2a for every e . Unfortunately, as stated previously, these intervals are useless, because of their dependence on the unknown e . One solution is to replace the unknown standard deviations 0" (8) by estimators. If the sequence of estimators is chosen consistent, then the resulting confidence interval still has asymptotic leve1 1 - 2a . Another approach is to use a variance-stabilizing transformation, which often leads to a better approximation. The idea is that no problem arises if the asymptotic variances 0" 2 (8) are independent of e . Although this fortunate situation i s rare, it i s often possible to transform the parameter into a different parameter 77 = ¢ (8), for which this idea can be applied. The natural estimator for 77 is cp (Tn). If cp is differentiable, then
=
For cp chosen such that ¢'(8)0" (8) 1 , the asymptotic variance is constant and finding an asymptotic confidence interval for 77 = cp (8) is easy. The solution ¢ (8)
=
f -0" (8)1- dB
is a variance-stabililizing transformation. variance stabililizing transformation. If it is well defined, then it is automatically monotone, so that a confidence interval for 77 can be transformed back into a confidence interval for e .
3.6 Example (Correlation). Let ( X1 , Y1 ) , . . . , (Xn , Yn) be a sample from a bivariate nor mal distribution with correlation coefficient p . The sample correlation coefficient is defined as
With the help of the delta method, it is possible to derive that Jfi(r - is asymptotically zero-mean normal, with variance depending on the (mixed) third and fourth moments of (X, Y) . This is true for general underlying distributions, provided the fourth moments exist. Under the normality assumption the asymptotic variance can be expressed in the correlation of X and Y. Tedious algebra gives
p)
It does not work very well to base an asymptotic confidence interval directly on this result.
Higher-Order Expansions
3.3
31
Table 3.3. Coverage probability of the asymptotic 95% confidence interval for the correlation coefficient, for two values ofn andfive different values of the true correlation p.
n 15 25
p
=
0
0.92 0.93
p
=
0.2
p
=
0.4
p
0.92 0.94
0.92 0.94
=
0.6
0.93 0.94
p
=
0.8
0.92 0.94
Note: Approximations based on simulation of 10,000 samples.
1
-1 .4
-1 . 0
-0. 8
rh -0. 2
Figure 3.1. Histogram of 1000 sample correlation coefficients, based on 1000 independent samples of the the bivariate normal distribution with c orrelation 0.6, and histogram of the arctanh of these values .
The transformation
¢(p) f 1 1 p2 dp 21 1og 11 + pp arctanh p is variance stabilizing. Thus, the sequence .Jri ( arctanh r -arctanh p) converges to a standard normal distribution for every p. This leads to the asymptotic confidence interval for the correlation coefficient p given by (tanh (arctanh r - Za I -Jfi) , tanh (arctanh r + Za I -Jfi)) . =
_
=
_
=
Table 3.3 gives an indication of the accuracy of this interval. Besides stabilizing the vari ance the arctanh transformation has the benefit of symmetrizing the distribution of the sample correlation coefficient (which is perhaps of greater importance), as can be seen in Figure 5.3. D
*3.3
Higher-Order Expansions
To package a simple idea in a theorem has the danger of obscuring the idea. The delta method is based on a Taylor expansion of order one. Sometimes a problem cannot be exactly forced into the framework described by the theorem, but the principle of a Taylor expansion is still valid.
Delta Method
32
In the one-dimensional case, a Taylor expansion applied to a statistic Tn has the form
Usually the linear term (Tn - e)¢>' (e) is of higher order than the remainder, and thus determines the order at which cf> (Tn) - cp (e) converges to zero: the same order as Tn - e . Then the approach of the preceding section gives the limit distribution of ¢> (Tn) - ¢> (e). If ¢>'(e) = 0, this approach is still valid but not of much interest, because the resulting limit distribution is degenerate at zero. Then it is more informative to multiply the difference ¢> (Tn) - ¢> (e) by a higher rate and obtain a nondegenerate limit distribution. Looking at the Taylor expansion, we see that the linear term disappears if cf>' (e) = 0, and we expect that the quadratic term determines the limit behavior of cp (Tn ) .
3.7 Example. Suppose that .jn X converges weakly to a standard normal distribution. Because the derivative of x r--+ cos x is zero at x = 0, the standard delta method of the preceding section yields that Jn (cos X - cos 0) converges weakly to 0. It should be concluded that Jn is not the right norming rate for the random sequence cos - A more informative statement is that -2n (cos X - converges in distribution to a chi -square distribution with one degree of freedom. The explanation is that 0) 2 (cos x) ;�=O · · · . 0)0 cos 0 cos
X 1.
1) + X - = (X - + �(X That the remainder term is negligible after multiplication with n can be shown along the same lines as the proof of Theorem 3.1. The sequence n X2 converges in law to a x � distribution by the continuous-mapping theorem; the sequence -2n (cos X - 1) has the same limit, by Slutsky's lemma. D
A more complicated situation arises if the statistic Tn is higher-dimensional with coor dinates of different orders of magnitude. For instance, for a real-valued function ¢> ,
If the sequences Tn,i - ei are of different order, then it may happen, for instance, that the linear part involving Tn,i - ei is of the same order as the quadratic part involving (Tn,j - ej ) 2 . Thus, it is necessary to determine carefully the rate of all terms in the expansion, and to rearrange these in decreasing order of magnitude, before neglecting the "remainder."
*3.4
Uniform Delta Method
Sometimes we wish to prove the asymptotic normality of a sequence .jn(¢> (Tn ) - ¢> (en)) for centering vectors en changing with n, rather than a fixed vector. If .jn(en - e) -* h for certain vectors e and h , then this can be handled easily by decomposing
Moments
3. 5
33
Several applications of Slutsky's lemma and the delta method yield as limit in law the vector tjJ� (T + h) - tjJ� (h ) = tjJ� (T), if T is the limit in distribution of y'ri(Tn - en) . For en � e at a slower rate, this argument does not work. However, the same result is true under a slightly stronger differentiability assumption on ¢.
Let tjJ : JRk f--+ JRm be a map defined and continuously diffe rentiable in a neighborhood of e. Let Tn be random vectors taking their values in the domain of ¢. lf rn (Tn - en) T for vectors en � e and numbers rn � oo, then rn ( t/J (Tn ) tjJ (en)) -v-+ ¢� (T). Moreover, the diffe rence between rn ( ¢ (Tn) - tjJ (en)) and ¢� ( rn (Tn - en)) converges to zero in probability.
3.8
Theorem.
-v-+
It suffices to prove the last assertion. Because convergence in probability to zero of vectors is equivalent to convergence to zero of the components separately, it is no loss of generality to assume that tjJ is real-valued. For 0 :::: t :::: and fixed h, define gn (t) = tjJ (en + th). For sufficiently large n and sufficiently small h, both en and en + h are in a ball around e inside the neighborhood on which tjJ is differentiable. Then gn : [0, f--+ lR is continuously differentiable with derivative g� (t) = ¢� + t h (h). By the mean -value theorem, gn - gn (0) = g� (�) for some 0 :=:: � :=:: In other words
Proof
1
1]
"
(1)
1.
By the continuity of the map e f--+ ¢� , there exists for every s > 0 a 8 > 0 such that ll tP� (h) - ¢� (h) I < s ll h ll for every li s - e 11 < 8 and every h. For sufficiently large n and llh l l < 8 / 2, the vectors en + �h are within distance 8 of e , so that the norm I R n (h) I of the right side of the preceding display is bounded by s ll h 11 . Thus, for any 1J > 0,
The first term converges to zero as n by choosing s small. •
� oo.
*3.5
The second term can be made arbitrarily small
Moments
So far we have discussed the stability of convergence in distribution under transformations. We can pose the same problem regarding moments: Can an expansion for the moments of tjJ (Tn) - tjJ (e) be derived from a similar expansion for the moments of Tn - e? In principle the answer is affirmative, but unlike in the distributional case, in which a simple derivative of tjJ is enough, global regularity conditions on tjJ are needed to argue that the remainder terms are negligible. One possible approach is to apply the distributional delta method first, thus yielding the qualitative asymptotic behavior. Next, the convergence of the moments of tjJ (Tn) - tjJ (e) (or a remainder term) is a matter of uniform integrability, in view of Lemma 2.20. If tjJ is uniformly Lipschitz, then this uniform integrability follows from the corresponding uniform integrability of Tn - e . If tjJ has an unbounded derivative, then the connection between moments of tjJ ( Tn) - tjJ (e) and Tn - e is harder to make, in general.
Delta Method
34
Notes
The Delta method belongs to the folklore of statistics. It is not entirely trivial; proofs are sometimes based on the mean-value theorem and then require continuous differentiability in a neighborhood. A generalization to functions on infinite-dimensional spaces is discussed in Chapter 20. PROBLEMS
1.
Find the j oint limit distribution of ( fL) , if and are based on a sample of size from a distribution with finite fourth moment. Under what condition on the underlying distribution are t-L) and asymptotically independent?
,Jfi(X - ,Jfi(S2 - a2)) X S2 n ,Jfi(X ,Jfi(S2 - a2) 2. Find the asymptotic distribution of ,Jn (r - p) if r is the correlation coefficient of a sample of n
bivariate vectors with finite fourth moments. (This is quite a bit of work. It helps to assume that the mean and the variance are equal to 0 and 1, respectively.)
3. Investigate the asymptotic robustness of the level of the t-test for testing the mean that rejects Ho : fL ::::: 0 if is larger than the upper a quantile of the tn - 1 distribution.
,JfiX IS
n - 1 L7= 1 (Xi - X)4 I S4 -
4. Find the limit distribution of the sample kurtosis kn = 3, and design an asymptotic level a test for normality based on kn . (Warning : At least 500 observations are needed to make the normal approximation work in this case.) 5. Design an asymptotic level a test for normality based on the sample skewness and kurtosis jointly. 6. Let be i.i.d. with expectation fL and variance converges in distribution if fL = 0 or fL i= 0.
X 1 , . . . , Xn 1 . Find constants such that a n (X� - bn) 7. Let X 1 , . . . , X n be a random sample from the Poisson distribution with mean e. Find a variance stabilizing transformation for the sample mean, and construct a confidence interval for e based on this.
X 1 , 1. . . , Xn be i.i.d. with expectation 1 and finite variance. Find the limit distribution of ,Jfi(X� - 1) . If the random variables are sampled from f that i s bounded and strictly 1 aoodensity positive in a neighborhood of zero, show that EIX� 1 for every n. (The density of Xn is bounded away from zero in a neighborhood of zero for every n.)
8. Let
=
4 Moment Estimators
The method of moments determines estimators by comparing sample and theoretical moments. Moment estimators are useful for their simplicity, although not always optimal. Maximum likelihood estimators for full ex ponentialfamilies are moment estimators, and their asymptotic normality can be proved by treating them as such. 4.1
Method of Moments
Let 1 , be a sample from a distribution that depends on a parameter e, ranging over some set 8. The method of moments consists of estimating e by the solution of a system of equations
X . . . , Xn
Pe
1- Ln fJ(Xt) Ee fj (X), n i=l =
j
=
1, . . . , k,
for given functions f1 , . . . , fk . Thus the parameter is chosen such that the sample moments (on the left side) match the theoretical moments. If the parameter is k-dimensional one usually tries to match k moments in this manner. The choices fJ (x) = x lead to the method of moments in its simplest form. Moment estimators are not necessarily the best estimators, but under reasonable condi tions they have convergence rate Jn and are asymptotically normal. This is a consequence of the delta method. Write the given functions in the vector notation f = (!I , . . . , fk ), and let e : 8 r-:>- �k be the vector-valued expectation e( F3 ) = Then the moment estimator en solves the system of equations
j
Pe f .
r?n f
=
1 Ln f (Xt )
-
n i=l
=
e( F3)
=
Pe f.
For existence of the moment estimator, it is necessary that the vector r?n f be in the range of the function e. If e is one-to-one, then the moment estimator is uniquely determined as en = e - 1 Wn f) and A
If r?n f is asymptotically normal and e - 1 is differentiable, then the right side is asymptoti cally normal by the delta method. 35
Moment Estimators
36
The derivative of e - 1 at e ( e0) is the inverse e�� 1 of the derivative of e at 80. Because the function e - 1 is often not explicit, it is convenient to ascertain its differentiability from the differentiability of e. This is possible by the inverse function theorem. According to this theorem a map that is (continuously) differentiable throughout an open set with nonsingular derivatives is locally one-to-one, is of full rank, and has a differentiable inverse. Thus we obtain the following theorem.
Suppose that e(e) = Pe f is one-to-one on an open set 8 c JRk and con tinuously differentiable at e0 with nonsingular derivative e�a . Moreover, assume that Pea I f 11 2 < oo. Then moment estimators en exist with probability tending to one and satisfy 4.1
Theorem.
�
Continuous differentiability at e0 presumes differentiability in a neighborhood and the continuity of e r--+ e� and nonsingularity of e�a imply nonsingularity in a neighborhood. Therefore, by the inverse function theorem there exist open neighborhoods U of e0 and V of Pea f such that e : U r--+ V is a differentiable bijection with a differentiable inverse e - 1 : V r--+ U . Moment estimators en = e - 1 (Pn f) exist as soon as Pn f E V, which happens with probability tending to by the law of large numbers. The central limit theorem guarantees asymptotic normality of the sequence Jn Wn f Pea f) . Next use Theorem on the display preceding the statement of the theorem. •
Proof.
1
3.1
For completeness, the following two lemmas constitute, if combined, a proof of the inverse function theorem. If necessary the preceding theorem can be strengthened somewhat by applying the lemmas directly. Furthermore, the first lemma can be easily generalized to infinite-dimensional parameters, such as used in the semiparametric models discussed in Chapter 25.
Let 8 c JRk be arbitrary and let e : 8 r--+ JRk be one-to-one and differentiable at a point e0 with a nonsingular derivative. Then the inverse e - 1 (defined on the range of e) is diffe rentiable at e(e0) provided it is continuous at e(e0). Proof. Write ry = e(e0) and l:!.. h = e - 1 (ry + h) - e - 1 (ry) . Because e - 1 is continuous at ry, we have that l:!.. h r--+ 0 as h r--+ 0. Thus 4.2
Lemma.
as h r--+ 0, where the last step follows from differentiability of e. The displayed equation can be rewritten as e�a ( I:!.. h) = h + o( ll l:!.. h II ) . By continuity of the inverse of e�a' this implies that
l:!.. h = e�� 1 (h) + o( ll i:!.. h ll ) . In particular, ll l:!.. h II ( + o( :::: I e�� 1 (h) I = 0 ( II h II) . Insert this in the displayed equation to obtain the desired result that i:!.. h = e�� 1 (h) + o( ll h ll ) . •
1 1) )
Let e : 8 r--+ JRk be defined and differentiable in a neighborhood of a point e0 and continuously diffe rentiable at e0 with a nonsingular derivative. Then e maps every 4.3
Lemma.
Exponential Families
4. 2
37
sufficiently small open neighborhood U of80 onto an open set V and e - 1 : V f---+ U is well defined and continuous. By assumption, e� -? A- 1 := e�0 as f) f---+ 80. Thus Il l - Ae � ll ::::; � for every e in a sufficiently small neighborhood U of 80 . Fix an arbitrary point ry 1 = e(f)I) from V = e(U) (where 81 E U). Next find an c > 0 such that ball(81 , c ) c U, and fix an arbitrary point ry with II ry - rJ1 II < 8 := � II A l l - 1 c . It will be shown that ry = e (f)) for some point e E ball(8 1 , c). Hence every ry E ball(ry 1 , 8) has an original in ball(81 , c). If e is one-to-one on U, so that the original is unique, then it follows that V is open and that e - 1 is continuous at ry1 . Define a function ¢ (8) = f) + A (ry - e (8)) . Because the norm of the derivative ¢� = I - Ae� is bounded by � throughout U, the map ¢ is a contraction on U . Furthermore, if 1 1 8 - 8 1 ll :S c,
Proof
Consequently, ¢ maps ball(81 , c ) into itself. Because ¢ is a contraction, it has a fixed point f) E ball(8 1 , c ) : a point with ¢ (8) = f) . By definition of ¢ this satisfies e (f)) = ry. Any other B with e ( B ) = ry is also a fixed point of ¢. In that case the difference B - f) = ¢ ( B ) - ¢ (8) has norm bounded by li B - 8 11 - This can only happen if B = e . Hence e is one-to-one throughout U . •
�
4.4 Example. Let X 1 , . . . , Xn be a random sample from the beta-distribution: The com mon density is equal to
x f---+
rca + fJ) x a -1 (1 - x) {3 - 1 lo<x
r (a)r ({J)
(a, {3 ) is the solution of the system of equations a Xn = Ea R X 1 = -- , a + fJ(a + l)a Xn2 - Ea ,f3 X 21 (a + fJ + l)(a + fJ) The righthand side is a smooth and regular function of (a, {3 ), and the equations can be
The moment estimator for
· �-"
solved explicitly. Hence, the moment estimators exist and are asymptotically normal. D
*4.2
Exponential Families
Maximum likelihood estimators in full exponential families are moment estimators. This can be exploited to show their asymptotic normality. Actually, as shown in Chapter 5, maximum likelihood estimators in smoothly parametrized models are asymptotically normal in great generality. Therefore the present section is included for the benefit of the simple proof, rather than as an explanation of the limit properties. Let X 1 , . . . , Xn be a sample from the k-dimensional exponential family with density
pe (x) = c (f)) h(x)
eeT t(x)_
Moment Estimators
38
Thus h and t = (t 1 , . . . , tk ) are known functions on the sample space, and the family is given in its natural parametrization. The parameter set e must be contained in the natural parameter space for the family. This is the set of e for which Pe can define a probability density. If fJ., is the dominating measure, then this is the right side in
It is a standard result (and not hard to see) that the natural parameter space is convex. It is usually open, in which case the family is called "regular." In any case, we assume that the true parameter is an inner point of 8. Another standard result concerns the smoothness of the function e f--+ c ( 8 ) , or rather of its inverse. (For a proof of the following lemma, see [ 1 00, p. 59] or [17, p. 39] . )
4.5 Lemma. Thefunction () f--+ j h (x ) e e rt(x) dfl,(X) is analytic on the set {() E Ck : Re (3 8 }. Its derivatives can be found by differentiating (repeatedly) under the integral sign: 0
E
for any natural numbers p and i 1 + · · · + ik = p. The lemma implies that the log likelihood £0 (x ) = log p0 (x ) can be differentiated (in finitely often) with respect to e . The vector of partial derivatives (the score function) satisfies
.
c c
£0 (x ) = - (8) + t (x) = t (x) - Ee t (X) .
Here the second equality is an example of the general rule that score functions have zero means. It can formally be established by differentiating the identity J Pe dfl, 1 under the integral sign: Combine the lemma and the Leibniz rule to see that =
� i
� pe dfl, = ae
f a aec({)i ) h (X) ee
T t(x) dfl,(X)
+
f c({) ) h(X) ti (X) ee
T t (x) dfl,(X)
.
The left side is zero and the equation can be rewritten as 0 = cj c (e ) + Ee t ( X ) . It follows that the likelihood equations L l e (Xi ) = 0 reduce to the system of k equations 1
n
-n L t ( Xi ) = Ee t (X) . i=l
Thus, the maximum likelihood estimators are moment estimators . Their asymptotic prop erties depend on the function e ( 8 ) = Ee t (X) , which is very well behaved on the interior of the natural parameter set. By differentiating Ee t (X) under the expectation sign (which is justified by the lemma), we see that its derivative matrices are given by
e� = Cove t (X) . The exponential family is said to be of full rank if no linear combination L � =l J... jtj (X) is constant with probability 1 ; equivalently, if the covariance matrix of t (X) is nonsingular. In
4. 2
Exponential Families
39
view of the preceding display, this ensures that the derivative e� is strictly positive-definite throughout the interior of the natural parameter set. Then e is one-to-one, so that there exists at most one solution to the moment equations. (Cf. Problem 4.6.) In view of the expression for le , the matrix - ne � is the second-derivative matrix (Hessian) of the log likelihood L 7= 1 ie (X d . Thus, a solution to the moment equations must be a point of maximum of the log likelihood. A solution can be shown to exist (within the natural parameter space) with probability 1 if the exponential family is "regular," or more generally "steep" (see [ 17]); it is then a point of absolute maximum of the likelihood. If the true parameter is in the interior of the parameter set, then a (unique) solution en exists with probability tending to 1 as n f-+ 00, in any case, by Theorem 4. 1 . Moreover, this theorem shows that the sequence ,Jfi(en - 80) is asymptotically normal with covariance matrix
eg1 0 -1 Cove0 t (X) ( eg1 0 - l ) T = ( Cove0 t (X)) - 1 .
So far we have considered an exponential family in standard form. Many examples arise in the form
pe (x ) = d (8) h (x) e Q (B/ t (x ) ,
(4.6)
where Q = ( Q 1 , , Q k ) is a vector-valued function. If Q is one-to-one and a maximum likelihood estimator en exists, then by the invariance of maximum likelihood estimators under transformations ' Q (en) is the maximum likelihood estimator for the natural parameter Q (8) as considered before. If the range of Q contains an open ball around Q (80) , then the preceding discussion shows that the sequence .Jfi ( Q ( en ) - Q (80) ) is asymptotically normal. It requires another application of the delta method to obtain the limit distribution of .J1i (en - 80) . As is typical of maximum likelihood estimators, the asymptotic covariance matrix is the inverse of the Fisher information matrix .
.
•
Ie
=
.
.
Ee ie (X)ie (X) T .
Let 8 c ffi.k be open and let Q : 8 f-+ ffi.k be one-to-one and continuously differentiable throughout 8 with nonsingular derivatives. Let the (exponential) family of densities Pe be given by (4. 6) and be offull rank. Then the likelihood equations have a unique solution en with probability tending to 1 and ,jfi( en - 8) � N (O, Ie-1) for every e. 4.6
Theorem.
According to the inverse function theorem, the range of Q is open and the inverse map Q- 1 is differentiable throughout this range. Thus, as discussed previously, the delta method ensures the asymptotic normality. It suffices to calculate the asymptotic covariance matrix. By the preceding discussion this is equal to
Proof.
By direct calculation, the score function for the model is equal to l e (x ) dj d (8) + ( Q�) T t (x) . As before, the score function has mean zero, so that this can be rewritten as le (x) = ( Q�) T ( t (x) - Ee t (X) ) . Thus, the Fisher information matrix equals Ie = ( Q�) T Cove t (X) Q� . This is the inverse of the asymptotic covariance matrix given in the preceding display. •
Moment Estimators
40
Not all exponential families satisfy the conditions of the theorem. For instance, the normal N (8 , 8 2 ) family is an example of a "curved exponential family." The map Q (8) = (8 - 2 , 8- 1 ) (with t (x) = ( - x 2 / 2, x)) does not fill up the natural parameter space of the normal location-scale family but only traces out a one-dimensional curve. In such cases the result of the theorem may still hold. In fact, the result is true for most models with "smooth parametrizations," as is seen in Chapter 5. However, the "easy" proof of this section is not valid. PROBLEMS 1.
X . . . , Xn
[ -e, e].
Let 1 , be a sample from the uniform distribution on Find the moment estimator of based on X 2 . Is it asymptotically normal? Can you think of an estimator for e that converges faster to the parameter?
e
X 1 , . . . , Xn
2. Let be a sample from a density pe and f a function such that e (e ) = Ee f (X) is differentiable with e ' = Ee t e f for £ e = log pe . (i) Show that the asymptotic variance of the moment estimator based on f equals vare (f) I . cove (f, £ e ) 2 . (ii) Show that this is bigger than Ie- l with equality for all if and only if the moment estimator is the maximum likelihood estimator.
(e)
(X) (X)
e
(iii) Show that the latter happens only for exponential family members .
3. To what extent does the result of Theorem
4. 1 require that the observations are i.i.d.?
4. Let the observations be a sample of size n from the N (tL , a 2 ) distribution. Calculate the Fisher information matrix for the parameter = (tL , a 2 ) and its inverse. Check directly that the maximum 1 likelihood estimator is asymptotically normal with zero mean and covariance matrix I() .
e
5. Establish the formula e(; = Cove t (X) by differentiating e (e ) = Ee t (X) under the integral sign. (Differentiating under the integral sign is justified by Lemma 4.5, because Ee t is the first derivative of
(X)
c(e)- 1 .)
6. Suppose a function e : 8 r+ lltk is defined and continuously differentiable on a convex subset 8 C lltk with strictly positive-definite derivative matrix. Then e has at most one zero in B . (Consider the function g(A.) = 2 ) e (A. l + (1 - A. ) 2 ) for given =f. 2 and 0 :S A. :S 1. I f g(O) = g(l) = 0, then there exists a point A.o with g' (A.o) = 0 by the mean-value theorem.)
(e l - e T e
e
e1 e
5 M- and Z -Estimators
This chapter gives an introduction to the consistency and asymptotic normality of M -estimators and Z-estimators. Maximum likelihood esti mators are treated as a special case.
5.1
Introduction
Suppose that we are interested in a parameter (or "functional") e attached to the distribution of observations X 1 , . . . , Xn . A popular method for finding an estimator en = en (X 1 , . . . , Xn) is to maximize a criterion function of the type (5. 1 ) Here m e : X � lR are known functions . An estimator maximizing Mn (e) over 8 is called an M -estimator. In this chapter we investigate the asymptotic behavior of sequences of M -estimators. Often the maximizing value is sought by setting a derivative (or the set of partial deriva tives in the multidimensional case) equal to zero. Therefore, the name M -estimator is also used for estimators satisfying systems of equations of the type
n
Wn (e)
1 = -:2: 1/re (X i ) = 0. n i= l
(5.2)
Here 1./le are known vector-valued maps. For instance, if e is k-dimensional, then 1/re typically has k coordinate functions 1/re = (1/re , 1 , , 1/re , k ), and (5.2) is shorthand for the system of equations .
•
.
n
L Yre,j (Xi) = 0, j = 1 , 2, . . . , k. i= l Even though in many examples 1/re ,j is the j th partial derivative of some function me , this
is irrelevant for the following. Equations, such as (5.2), defining an estimator are called estimating equations and need not correspond to a maximization problem. In the latter case it is probably better to call the corresponding estimators Z-estimators (for zero), but the use of the name M -estimator is widespread. 41
42
M- and Z-Estimators
Sometimes the maximum of the criterion function Mn is not taken or the estimating equation does not have an exact solution. Then it is natural to use as estimator a value that almost maximizes the criterion function or is a near zero. This yields approximate M -estimators or Z-estimators. Estimators that are sufficiently close to being a point of maximum or a zero often have the same asymptotic behavior. An operator notation for taking expectations simplifies the formulas in this chapter. We write P for the marginal law of the observations X 1 , . . . , Xn , which we assume to be identically distributed. Furthermore, we write P f for the expectation E f (X) = J f d P and abbreviate the average n - 1 :L 7= 1 f(X J to rl'n f· Thus rl'n is the empirical distribution: the (random) discrete distribution that puts mass 1 In at every of the observations X 1 , . . . , Xn . The criterion functions now take the forms We also abbreviate the centered sums n - 1 1 2 :L 7= 1 ( f(Xi) - Pf) to G n f, the empirical process at f.
5.3 Example (Maximum likelihood estimators). Suppose X 1 , . . . , Xn have a common density Pe . Then the maximum likelihood estimator maximizes the likelihood TI 7= 1 Pe (X i ), or equivalently the log likelihood (J r--+
n L log pe (Xi ). i= 1
Thus, a maximum likelihood estimator is an M-estimator as in (5. 1 ) with me = log pe . If the density is partially differentiable with respect to (] for each fixed x, then the maximum likelihood estimator also solves an equation of type (5.2), with lJ;e equal to the vector of partial derivatives le , j = a I aej log Pe . The vector-valued function le is known as the score function of the model. The definition (5. 1 ) of an M-estimator may apply in cases where (5.2) does not. For instance, if X 1 , . . . , Xn are i.i.d. according to the uniform distribution on [0, (]], then it makes sense to maximize the log likelihood (]
r--+
n
L (log 1 [O, e J (Xi ) - log (] ) . i= 1
=
(Define log 0 -oo.) However, this function is not smooth in (] and there exists no natural version of (5.2). Thus, in this example the definition as the location of a maximum is more fundamental than the definition as a zero. D
5.4 Example (Location estimators). Let X 1 , . . . , Xn be a random sample of real-valued observations and suppose we want to estimate the location of their distribution. "Location" is a vague term; it could be made precise by defining it as the mean or median, or the center of symmetry of the distribution if this happens to be symmetric. Two examples of location estimators are the sample mean and the sample median. Both are Z-estimators, because they solve the equations
n n L (Xi - (]) = 0 ; and L sign( X i - (]) = 0, i= 1 i=1
5. 1
43
Introduction
respectively. t Both estimating equations involve functions of the form 1/J (x - 8) for a function 1/J that is monotone and odd around zero. It seems reasonable to study estimators that solve a general equation of the type n
2: 1/J ( X i - 8) = 0. i=l We can consider a Z-estimator defined by this equation a "location" estimator, because it has the desirable property of location equivariance. If the observations Xi are shifted by a fixed amount a, then so is the estimate: iJ + a solves 2::: 7= 1 1/f(X i + a - 8) = 0 if iJ solves the original equation. Popular examples are the Huber estimators corresponding to the functions if X :S -k, if l x l :S k, if X 2: k . The Huber estimators were motivated by studies in robust statistics concerning the influ ence of extreme data points on the estimate. The exact values of the largest and smallest observations have very little influence on the value of the median, but a proportional influ ence on the mean. Therefore, the sample mean is considered nonrobust against outliers. If the extreme observations are thought to be rather unreliable, it is certainly an advantage to limit their influence on the estimate, but the median may be too successful in this respect. Depending on the value of k, the Huber estimators behave more like the mean (large k) or more like the median (small k) and thus bridge the gap between the nonrobust mean and very robust median. Another example are the quantiZes. A p th sample quantile is roughly a point e such that pn observations are less than e and (1 - p)n observations are greater than e . The precise definition has to take into account that the value pn may not be an integer. One possibility is to call a pth sample quantile any iJ that solves the inequalities n
-1
<
2: { ( 1 - p) 1 { X i i=l
<
8 } - p 1 {Xi
>
8 })
<
1.
(5.5)
This is an approximate M-estimator for 1/f (x) = 1 - p, 0, -p if x < 0, x = 0, or x > 0, respectively. The "approximate" refers to the inequalities: It is required that the value of the estimating equation be inside the interval ( - 1 , 1), rather than exactly zero. This may seem a rather wide tolerance interval for a zero. However, all solutions turn out to have the same asymptotic behavior. In any case, except for special combinations of p and n, there is no hope of finding an exact zero, because the criterion function is discontinuous with jumps at the observations. (See Figure 5. 1 .) If no observations are tied, then all jumps are of size one and at least one solution iJ to the inequalities exists. If tied observations are present, it may be necessary to increase the interval ( 1 1 ) to ensure the existence of solutions. Note that the present 1/J function is monotone, as in the previous examples, but not symmetric about zero (for p =1- 1/2). -
t
,
The sign-function is defined as sign(x) - 1, 0, 1 if x < 0, x 0 or x > 0, respectively. Also x+ means x v 0 max(x , 0) . For the median we assume that there are no tied observations (in the middle). =
=
=
M- and Z-Estimators
44 l!)
0 C\l 0
0
0 0
0 � '
4
6
10
8
12
-2
-1
0
2
3
Figure 5.1. The functions e H> (e) for the 80% quantile and the Huber estimator for samples of size 15 from the gamma(S , 1) and standard normal distribution, respectively.
Wn
All the estimators considered so far can also be defined as a solution of a maximization problem. Mean, median, Huber estimators, and quantiles minimize 2:7= 1 m (X i - e) for m equal to x 2 , l x l , x 2 l l x l:s k + (2k l x l - k 2 ) 1 1 x l> k and ( 1 - p)x - + px + , respectively. D
5.2
Consistency
If the estimator is used to estimate the parameter e , then it is certainly desirable that the sequence converges in probability to e . If this is the case for every possible value of the parameter, then the sequence of estimators is called asymptotically consistent. For instance, the sample mean X is asymptotically consistent for the population mean EX (provided the population mean exists). This follows from the law of large numbers. Not surprisingly this extends to many other sample characteristics. For instance, the sam ple median is consistent for the population median, whenever this is well defined. What can be said about M -estimators in general? We shall assume that the set of possible parameters is a metric space, and write for the metric. Then we wish to prove that e0) 0 for some value e0, which depends on the underlying distribution of the observations. Suppose that the M -estimator maximizes the random criterion function
en
en
n
d
d(Bn, �
Bn
Clearly, the "asymptotic value" of depends on the asymptotic behavior of the functions Under suitable normalization there typically exists a deterministic "asymptotic criterion function" e f--+ M (e) such that
Bn
Mn.
every e .
(5 .6)
For instance, if is an average of the form JID m e as in (5 . 1 ), then the law of large numbers gives this result with M (e) = Pm11 , provided this expectation exists. It seems reasonable to expect that the maximizer of converges to the maximizing value e0 of M. This is what we wish to prove in this section, and we say that is (asymptotically) consistent for e0. However, the convergence (5.6) is too weak to ensure
Mn(e)
n en Mn
Bn
5. 2
45
Consistency
Figure 5.2. Example of a function whose point of maximum is not well separated.
the convergence of fJn. Because the value fJn depends on the whole function e r--+ Mn (e) , an appropriate form of "functional convergence" of Mn to M is needed, strengthening the pointwise convergence (5.6). There are several possibilities. In this section we first discuss an approach based on uniform convergence of the criterion functions. Admittedly, the assumption of uniform convergence is too strong for some applications and it is sometimes not easy to verify, but the approach illustrates the general idea. Given an arbitrary random function e r--+ Mn (e) , consider estimators fJn that nearly maximize Mn , that is,
MnCfJn) :::: supe Mn (e) - Op ( l ) . Then certainly MnCfJn) > Mn (eo) - op (l), which turns out to be enough to ensure con sistency. It is assumed that the sequence Mn converges to a nonrandom map M: 8 r--+ JR. Condition (5. 8) of the following theorem requires that this map attains its maximum at a · unique point e0, and only parameters close to e0 may yield a value of M (e) close to the maximum value M(e0) . Thus, e0 should be a well-separated point of maximum of M . Figure 5 .2 shows a function that does not satisfy this requirement.
5.7 Theorem. Let Mn be random functions and let M be a fixed function of e such that for every E > Qt
supiMn (e) - M(e) l � 0,
0 E8
sup
e : d(O,Oo ) ?:: c:
M (e) < M(e0).
Then any sequence of estimators fJn with MnCfJn) bility to eo. t
:=::
(5.8)
Mn (eo) - op (l) converges in proba
Some of the expressions in this display may be nonmeasurable. Then the probability statements are understood in terms of outer measure.
46
M- and Z-Estimators
By the property of en , we have Mn (e n) � Mn (f3o) - op ( 1 ) . Because the uniform convergence of Mn to M implies the convergence of Mn (610) � M(610) , the right side equals M(610) - op ( 1 ) . It follows that Mn ( 8 n) � M (610) - op ( 1 ) , whence
Proof
M( &o) - M (8 n) :::: Mn ( e n ) - M( e n ) o p ( ) p :=:: sup I Mn - M l ( & ) o p (1 )----+ 0. e by the first part of assumption (5 .8). By the second part of assumption (5. 8), there exists for every fl > 0 a number 17 > 0 such that M( & ) < M(610) - 17 for every e with d( & , 610) � fl . Thus, the event d ( e n , 610) � fl } is contained in the event MC 8 n) < M(610) - 17}. The probability of the latter event converges to 0, in view of the preceding display. •
+ l +
{
{
Instead of through maximization, an M -estimator may be defined as a zero of a criterion function e � \lin ( & ) . It is again reasonable to assume that the sequence of criterion functions converges to a fixed limit:
Then it may be expected that a sequence of (approximate) zeros of \lin converges in prob ability to a zero of \ll . This is true under similar restrictions as in the case of maximizing M estimators. In fact, this can be deduced from the preceding theorem by noting that a zero of \lin maximizes the function 61 � - I \lin (61) II ·
5.9 Theorem. Let \lin be random vector-valued functions and let \II be a fixed vector valued function of& such that for every f:l > 0 supe E e I \lin (& ) - \II ( & ) I infe : d(e,eo)�c l 'l1(&) 1 > 0 =
� 0, 11 \ll (&o ) l -
Then any sequence of estimators e n such that \lln ( e n) = O p ( 1 ) converges in probability to &o . This follows from the preceding theorem, on applying it to the functions Mn (61) = & - I \lin ( ) I and M (& ) = - 1 \ll ( &) l · •
Proof
The conditions of both theorems consist of a stochastic and a deterministic part. The deterministic condition can be verified by drawing a picture of the graph of the function. A helpful general observation is that, for a compact set 8 and continuous function M or \II , uniqueness of 610 as a maximizer or zero implies the condition. (See Problem 5 .27.) For Mn ( & ) or \lin ( & ) equal to averages as in (5. 1) or (5. 2) the uniform convergence required by the stochastic condition is equivalent to the set of functions {me : e E 8} or {1/fe f 61 E 8, j = 1 , . . . , k} being Glivenko-Cantelli. Glivenko-Cantelli classes of functions are discussed in Chapter 19. One simple set of sufficient conditions is that 8 be compact, that the functions e � m e (x) or e � 1/fe (x) are continuous for every x, and that they are dominated by an integrable function. Uniform convergence of the criterion functions as in the preceding theorems is much stronger than needed for consistency. The following lemma is one of the many possibilities to replace the uniformity by other assumptions.
5. 2
Consistency
47
Let 8 be a subset of the real line and let Wn be random functions and \ll a fixed function of e such that Wn (e) � w (e) in probability for every e. Assume that each map e r--:>- Wn (e) is continuous and has exactly one zero {J n, or is nondecreasing with Wn ( {J n ) = o p (1). Let eo be a point such that \ll (eo - .s) < 0 < \ll ceo + .s) for every .s > 0. p Then en � eo.
5.10
Lemma.
A
Proof
If the map e
r--:>-
\lln Ce) is continuous and has a unique zero at {J n , then
The left side converges to one, because Wn Ceo ± .s) � \ll Ceo ± .s) in probability. Thus the right side converges to one as well, and Bn is consistent. If the map e r--:>- 'lln (e) is nondecreasing and {J n is a zero, then the same argument is valid. More generally, if e r--:>- Wn (e) is nondecreasing, then Wn (eo - .s) < -ry and {J n ::::: eo - .s imply Wn ({J n) < -ry, which has probability tending to zero for every rJ > 0 if {J n is a near zero. This and a similar argument applied to the right tail shows that, for every .s, rJ > 0,
For 2ry equal to the smallest of the numbers -\ll (eo - .s) and \ll (eo + .s) the left side still converges to one. • 5.11
Example (Median).
The sample median 8 n is a (near) zero of the map e
n - 1 L7= 1 sign( Xi - e). By the law of large numbers, Wn (e)
r--:>-
Wn (e)
=
!,. w (e) = E sign(X - e) = P(X > e) - P(X < e ) ,
for every fixed e . Thus, we expect that the sample median converges i n probability to a point e0 such that P(X > e0) = PCX < e0) : a population median. This can be proved rigorously by applying Theorem 5.7 or 5.9. However, even though the conditions of the theorems are satisfied, they are not entirely trivial to verify. (The uniform convergence of Wn to \ll is proved essentially in Theorem 19. 1) In this case it is easier to apply Lemma 5 . 10. Because the functions e r--:>- Wn (e) are nonincreasing, it follows that {J n eo provided that w (e0 - .s) > 0 > w (e0 + .s) for every .s > 0. This is the case if the population median is unique: P(X < e0 - .s) < < P(X < e0 + .s) for all .s > 0. D
!,.
�
*5.2.1
Wald's Consistency Proof
Consider the situation that, for a random sample of variables X 1 , . . . , Xn , M (e)
=
Pme .
In this subsection we consider an alternative set of conditions under which the maximizer Bn of the process Mn converges in probability to a point of maximum e0 of the function M. This "classical" approach to consistency was taken by Wald in 1949 for maximum likelihood estimators. It works best if the parameter set 8 is compact. If not, then the argument must
48
M- and Z -Estimators
be complemented by a proof that the estimators are in a compact set eventually or be applied to a suitable compactification of the parameter set. Assume that the map e f--+ me (x) is upper-semicontinuous for almost all X : For every e a.s . . lim sup men (x) ::: me (x) ' (5. 1 2) en -+ e (The exceptional set of x may depend on 8 .) Furthermore, assume that for every sufficiently small ball U C 8 the function x r--+ sup e EU me (x) is measurable and satisfies P sup me < oo . (5 . 1 3) e EU Typically, the map 8 r--+ Pme has a unique global maximum at a point 80, but we shall allow multiple points of maximum, and write 8o for the set { 8o E 8: Pme0 = supe Pme } of all points at which M attains its global maximum. The set 80 is assumed not empty. The maps me : X r--+ R are allowed to take the value - oo , but the following theorem assumes implicitly that at least Pme0 is finite.
5.14 Theorem. Let 8 r--+ me (x) be upper-semicontinuousfor almost all x and let (5. 1 3) be satisfied. Thenfor any estimators {J n such that Mn (en) :::: Mn (8o) - o p (1)for some 8o E 8o, for every E > 0 and every compact set K C 8, If the function 8 r--+ Pme is identically -oo, then 8o = 8, and there is nothing to prove. Hence, we may assume that there exists 80 E 80 such that Pme0 > - oo , whence P lme0 I < oo by (5. 1 3). Fix some e and let U1 -!- 8 be a decreasing sequence of open balls around 8 of diameter converging to zero. Write m u (x) for supe EU me (x) . The sequence mu1 is decreasing and greater than me for every l . Combination with (5 . 1 2) yields that mu1 -!- me almost surely. In view of (5. 1 3), we can apply the monotone convergence theorem and obtain that Pmu1 -!- Pme (which may be - oo ) . For 8 ¢ 80, we have Pme < Pme0 • Combine this with the preceding paragraph to see that for every 8 ¢ 80 there exists an open ball Ue around 8 with Pmu8 < Pme0 • The set B = {e E K : d (8 , 80) :::: E } is compact and is covered by the balls {Ue : 8 E B}. Let Ue1 , • • • , Uep be a finite subcover. Then, by the law of large numbers, as sup Pnme :=: sup Pnm u8 ----;.. sup Pmu8 < Pme0 • e EB j j =l . ... , p
Proof.
1
1
If {J n E B , then SUPe EB JPlnme is at least JPlnm(jn ' which by definition of en is at least JPlnmeo o p ( 1 ) = Pme0 - op (l), by the law of large numbers. Thus {en
E
B}
c
{ esupEB JPlnme :::: Pme0 - o ( 1 ) } . p
In view of the preceding display the probability of the event on the right side converges to zero as n ----;.. oo . • Even in simple examples, condition (5 . 1 3) can be restrictive. One possibility for relax ation is to divide the n observations in groups of approximately the same size. Then (5 . 1 3)
49
Consistency
5. 2
may be replaced by, for some k and every k :::; l < 2k, l
L
P 1 sup me (xi ) < e EU i= l
(5. 15)
oo .
Surprisingly enough, this simple device may help. For instance, under condition (5. 1 3) the preceding theorem does not apply to yield the asymptotic consistency of the maximum likelihood estimator of (fL , O") based on a random sample from the N (fL, 0" 2 ) distribution (unless we restrict the parameter set for O" ), but under the relaxed condition it does (with k = 2). (See Problem 5 .25.) The proof of the theorem under (5 . 1 5) remains almost the same. Divide the n observations in groups of k observations and, possibly, a remainder group of l observations ; next, apply the law of large numbers to the approximately n / k group sums.
5.16 Example (Cauchy likelihood). The maximum likelihood estimator for e based on a random sample from the Cauchy distribution with location e maximizes the map e r+ JPlnme for me (x)
=
- log ( 1 + (x - e) 2 ) .
The natural parameter set lR is not compact, but we can enlarge it to the extended real line, provided that we can define me in a reasonable way for e = ± oo . To have the best chance of satisfying (5. 1 3), we opt for the minimal extension, which in order to satisfy (5 . 1 2) is m _00 (x)
=
lim sup me (x) e f-+- 00
= - oo ;
m00 (x)
=
lim sup me (x) e �-+oo
= - oo .
These infinite values should not worry us: They are permitted in the preceding theorem. Moreover, because we maximize e r+ Pnme , they ensure that the estimator {J n never takes the values ±oo, which is excellent. We apply Wald's theorem with 8 = JR, equipped with, for instance, the metric d (er , e2 ) = I arctg e1 - arctg e2 1 . Because the functions e r+ me (x) are continuous and nonpositive, the conditions are trivially satisfied. Thus, taking K = JR, we obtain that d (en , 80) !,. 0. This conclusion is valid for any underlying distribution P of the observations for which the set 80 is nonempty, because so far we have used the Cauchy likelihood only to motivate me . To conclude that the maximum likelihood estimator in a Cauchy location model is con sistent, it suffices to show that 80 = { e0} if P is the Cauchy distribution with center e0. This follows most easily from the identifiability of this model, as discussed in Lemma 5.35. 0
5.17 Example (Current status data). Suppose that a "death" that occurs at time T is only observed to have taken place or not at a known "check-up time" C . We model the obser vations as a random sample X 1 , , Xn from the distribution of X = (C, 1 { T :::; C}) , where T and C are independent random variables with completely unknown distribution functions F and G , respectively. The purpose is to estimate the "survival distribution" 1 - F. If G has a density g with respect to Lebesgue measure /.. , then X = (C, £.. ) has a density •
P F (c , 8)
=
•
•
(8 F ( c) + ( 1 - 8) ( 1 - F) (c) ) g (c)
M- and Z-Estimators
50
with respect to the product of A. and counting measure on the set {0, 1 } . A maximum like lihood estimator for F can be defined as the distribution function F that maximizes the likelihood F
c-+
n
n ( � i F ( Ci ) + ( 1 - � i ) ( 1 - F ) ( Ci ) ) i =l
over all distribution functions on [0, oo) . Because this only involves the numbers F ( CI ) , . . . , F ( Cn ) , the maximizer of this expression is not unique, but some thought shows that there is a unique maximizer F that concentrates on (a subset of) the observation times C 1, . . . , Cn . This is commonly used as an estimator. We can show the consistency of this estimator by Wald's theorem. By its definition F maximizes the function F c-+ JlDn log p F, but the consistency proof proceeds in a smoother way by setting
2p m F = log FP FF__ = log F---=-+ F F P Po PC + o ) /2 Because the likelihood is bigger at F than it is at � F + � Fo , it follows that JlD nm fr 2:: 0 = J!DnmFo · (It is not claimed that F maximizes F c-+ J!DnmF; this is not true.) Condition (5. 1 3) is satisfied trivially, because mF _:::: log 2 for every F. We can equip the _..:::.__
-
--
set of all distribution functions with the topology of weak convergence. If we restrict the parameter set to distributions on a compact interval [ 0, r], then the parameter set is compact by Prohorov's theorem. t The map F c-+ mF ( c , 8) is continuous at F, relative to the weak topology, for every (c, 8) such that c is a continuity point of F. Under the assumption that G has a density, this includes almost every (c, 8), for every given F. Thus, Theorem 5 . 14 shows that F n converges under Fo in probability to the set :F0 of all distribution functions that maximize the map F c-+ PFo mF, provided Fo E Fa . This set always contains Fo , but it does not necessarily reduce to this single point. For instance, if the density g is zero on an interval [a , b], then we receive no information concerning deaths inside the interval [a , b ] , and there can be no hope that F n converges to F0 on [a , b] . In that case, F0 is not "identifiable" on the interval [a , b ] . We shall show that :F0 i s the set o f all F such that F = Fo almost everywhere according to G . Thus, the sequence F n is consistent for F0 "on the set of time points that have a positive probability of occurring." Because p F = p Fo under PF if and only if F = Fo almost everywhere according to G , it suffices to prove that, for every pair ofprobability densities p and p0, Po log 2 pI (p+ p0) _:::: 0 with equality if and only if p = p0 almost surely under P0. If P0 (p = 0) > 0, then log 2p I (p + p0) = - oo with positive probability and hence, because the function is bounded above, Po 1og 2p l (p + p0) = -oo. Thus we may assume that Po (p = 0) = 0. Then, with f(u) = -u log( � + �u), 0
2p Po log (p + o ) P
t
=
( ) ( )
Po _:::: f P Po Pf P P
=
f( l) = 0,
Alternatively, consider all probability distributions on the compactification [0, oo] again equipped with the weak topology.
5.3
51
Asymptotic Normality
by Jensen's inequality and the concavity of f, with equality only if p0j p under P, and then also under P0. This completes the proof. D
5.3
=
1 almost surely
Asymptotic Normality
Suppose a sequence of estimators en is consistent for a parameter e that ranges over an open subset of a Euclidean space. The next question of interest concerns the order at which the discrepancy en - e converges to zero. The answer depends on the specific situation, but for estimators based on n replications of an experiment the order is often n - 1 1 2 . Then multiplication with the inverse of this rate creates a proper balance, and the sequence .J1i (en - e) converges in distribution, most often a normal distribution. This is interesting from a theoretical point of view. It also makes it possible to obtain approximate confidence sets . In this section we derive the asymptotic normality of M -estimators. We can use a characterization of M -estimators either by maximization or by solving estimating equations. Consider the second possibility. Let X 1 , . . . , Xn be a sample from some distribution P, and let a random and a "true" criterion function be of the form:
\ll (e) = Pl/fe . Assume that the estimator en is a zero of 'lin and converges in probability to a zero eo of \ll . Because en ---+ eo, it makes sense to expand 'lln (en) in a Taylor series around eo. Assume for simplicity that e is one-dimensional. Then where Bn is a point between en and eo. This can be rewritten as r:: A
- .fo \lln (eo) (5. 1 8) A 'lin (eo) 21 (en - eo) 'lin (en) If Pljf� is finite, then the numerator -.fo'lln (e0) = -n - 1 1 2 L 1/fe0 (X i ) is asymptotically normal by the central limit theorem. The asymptotic mean and variance are Pl/fe0 = \ll (eo) = 0 and P1/JJ0 , respectively. Next consider the denominator. The first . term �n (eo) . p is an average and can be analyzed by the law of large numbers : 'lln (eo) ---+ Pl/f e0 , provided the expectation exists. The second term in the denominator is a product of en - e = o p ( 1) and -if; n (B n) and converges in probability to zero under the reasonable condition that �n (fJ n) v ft
(en - eo) =
•
+
••
-
0
(which is also an average) is Op ( l ) . Together with Slutsky's lemma, these observations yield
(5. 19) The preceding derivation can be made rigorous by imposing appropriate conditions, often called "regularity conditions." The only real challenge is to show that �n CfJn) = Op (1) (see Problem 5 .20 or section 5.6). The derivation can be extended to higher-dimensional parameters. For a k-dimensional parameter, we use k estimating equations. Then the criterion functions are maps 'lin : Rk r--+
M- and Z-Estimators
52
IRk and the derivatives �n (610) are (k x k)-matrices that converge to the (k x k) matrix P� eo with entries p a I ae j'l/feo , i . The final statement becomes (5.20) Here the invertibility of the matrix P � eo is a condition. In the preceding derivation it is implicitly understood that the function e f-+ 1/Je (x) possesses two continuous derivatives with respect to the parameter, for every x . This is true in many examples but fails, for instance, for the function 1/Je (x) = sign(x - 61), which yields the median. Nevertheless, the median is asymptotically normal. That such a simple, but important, example cannot be treated by the preceding approach has motivated much effort to derive the asymptotic normality of M -estimators by more refined methods . One result is the following theorem, which assumes less than one derivative (a Lipschitz condition) instead of two derivatives.
For each e in an open subset ofEuclidean space, let x f-+ 1/Je (x) be a mea surable vector-valued function such . that, for every 61 1 and 612 in a neighborhood of Bo and . a measurable function 1/J with P1/f 2 < oo, 5.21
Theorem.
Assume that P 11 1/Je0 11 2 < oo and that the map e f-+ P1/Je is differentiable at a zero Bo, with nonsingular derivative matrix Ve0 • lfrl'n 1/Je" = op ( n - 1 1 2 ) , and en .!,. Bo, then
In particular, the sequence Jn(f}n - 610) is asymptotically normal with mean zero and covarzance matrzx ye-o l p ,trr eo ,t,T r e0 ( Ve0- l ) T • ·
·
For a fixed measurable function f, we abbreviate Jn(r?n - P ) f to G f , the empirical process evaluated at f. The consistency of {J n and the Lipschitz condition on the maps e f-+ 1/Je imply that n
Proof.
(5.22) For a nonrandom sequence {J n this is immediate from the fact that the means of these variables are zero, while the variances are bounded by P 11 1/Jen - 1/!e0 f ::S P� 2 ll tln - Bo 11 2 and hence converge to zero. A proof for estimators {Jn under the present mild conditions takes more effort. The appropriate tools are developed in Chapter 19. In Example 19.7 it is seen that the functions 1/Je form a Donsker class. Next, (5.22) follows from Lemma 19.24. Here we accept the convergence as a fact and give the remainder of the proof. By the definitions of {jn and Bo, we can rewrite Gn 1/Je" as ,JfiP(1/!e0 - 1/!e) + o p ( l ) Combining this with the delta method (or Lemma 2. 12) and the differentiability of the map e f-+ p 1/Je we find that .
'
5.3
Asymptotic Normality
53
In particular, by the invertibility of the matrix Ve0 , This implies that (} n is .jll- consistent: The left side is bounded in probability. Inserting this in the previous display, we obtain that JfiVe0 ({}n - eo) = - CGn 1/leo + op (1). We conclude the proof by taking the inverse Ve� 1 left and right. Because matrix multiplication is a continous map, the inverse of the remainder term still converges to zero in probability. • The preceding theorem is a reasonable compromise between simplicity and general applicability, but, unfortunately, it does not cover the sample median. Because the function e f-+ sign(x - e) is not Lipschitz, the Lipschitz condition is apparently still stronger than necessary. Inspection of the proof shows that it is used only to ensure (5.22). It is seen in Lemma 19.24, that (5.22) can be ascertained under the weaker conditions that the collection of functions x f-+ 1/Je (x) are a "Donsker class" and that the map e f-+ 1/Je is continuous in probability. The functions sign(x - e) do satisfy these conditions, but a proof and the definition of a Donsker class are deferred to Chapter 19. If the functions e f-+ 1/Je (x) are continuously differentiable, then the natural candidate for t (x) is supe II t e II , with the supremum taken over a neighborhood of eo . Then the main condition is that the partial derivatives are "locally dominated" by a square-integrable function: There should exist a square-integrable function -(p with I -(p e I .:::: -(p for every e close to eo . If e f-+ ;p e (x ) is also continuous at eo, then the dominated-convergence theorem readily yields that Ve0 = P-(p eo · The properties of M estimators can typically be obtained under milder conditions by using their characterization as maximizers. The following theorem is in the same spirit as the preceding one but does cover the median. It concerns M -estimators defined as maximizers of a criterion function e f-+ IfDnme, which are assumed to be consistent for a point of maximum e0 of the function e f-+ P m e . If the latter function is twice continuously differentiable at e0, then, of course, it allows a two-term Taylor expansion of the form It is this expansion rather than the differentiability that is needed in the following theorem.
5.23 Theorem. For each e in an open subset ofEuclidean space let x f-+ m e (x) be a mea surable function such that e f-+ m e (x) is differentiable at e0 for P -almost every x t with derivative me0 (x) and such that, for every e 1 and e2 in a neighborhood of eo and a measur able function m with Pm 2 < 00 Furthermore, assume that the map e f-+ Pme admits a second-order Taylor expansion at a point of maximum e0 with nonsingular symmetric second derivative matrix Ve0 • If IfDn m en � SUPe IfDnme - Op (n - 1 ) and en -+p eo, then A
t
Alternatively, it suffices that e
!---+
me
is differentiable at 8o in ?-probability.
M- and Z-Estimators
54
In particular, the sequence Jfi(en - eo) is asymptotically normal with mean zero and covariance matrix Ve� 1 Prhe0rh� Ve� 1 . *Proof. The Lipschitz property and the differentiability of the maps e for every random sequence hn that is bounded in probability,
r-+
m e imply that,
Gn [ Jn (m eo+hnlvfn - me0) - h � rhe0 J !,. 0.
For nonrandom sequences hn this follows, because the variables have zero means, and vari ances that converge to zero, by the dominated convergence theorem. For general sequences hn this follows from Lemma 19.3 1 . A second fact that we need and that is proved subsequently i s the Jfi -consistency of the sequence en. By Corollary 5.53, the Lipschitz condition, and the twice differentiability of the map e r-+ P m e , the sequence Jfi({jn - e) is bounded in probability. The remainder of the proof is self-contained. In view of the twice differentiability of the map e r-+ P m e , the preceding display can be rewritten as . + h- ny Gn me0 + Op ( 1 ) . n JP>n (m eo+hnlvfn - m e0 ) = 21 h- ny Ve0hn Because the sequence en i s Jfi -consistent, this is valid both for hn equal to h n = Jfi Cen - eo) and for hn = - Ve� 1 Gnrhe0 • After simple algebra in the second case, we obtain the equations
By the definition of en, the left side of the first equation is larger than the left side of the second equation (up to o p ( 1)) and hence the same relation is true for the right sides. Take the difference, complete the square, and conclude that
� (hn + Ve� 1 Gnrhe0) T Ve0 (hn + Ve� 1 Gnrhe0) + o p ( l )
2:
0.
Because the matrix Ve0 is strictly negative-definite, the quadratic form must converge to zero in probability. The same must be true for llh n + Ve� 1 Gnrhe0 11 . • The assertions of the preceding theorems must be in agreement with each other and also with the informal derivation leading to (5.20). If e r-+ m e (x) is differentiable, then a maximizer of e r-+ lfDn m e typically solves lfDn 1/Je = 0 for 1/Je = rhe . Then the theorems and (5 .20) are in agreement provided that . a a2 Ve = ae 2 P m e = ae Pl/Je = Pl/J e = P m e . This involves changing the order of differentiation (with respect to e) and integration (with respect to x ), and is usually permitted. However, for instance, the second derivative of Pme may exist without e r-+ m e (x) being differentiable for all x, as is seen in the following example.
5.24
Example (Median).
The sample median maximizes the criterion function e
r-+
- L7= 1 1 Xi -e 1 . Assume that the distribution function F of the observations is differentiable
5.3
-0.5
55
Asymptotic Nonnality
0.0
0.5
Figure 5.3. The distribution function o f the s ample median (dotted curve) and its normal approxi mation for a sample of size 25 from the Laplace distribution.
at its median e0 with positive derivative j(e0). Then the sample median is asymptotically normal. This follows from Theorem 5.23 applied with me (x) = lx - e I - l x 1 . As a consequence of the triangle inequality, this function satisfies the Lipschitz condition with m (x) 1. Furthermore, the map e f---+ me (x) is differentiable at eo except if X = eo, with mea (x) = - sign(x - e0) . By partial integration, =
eF(O) + J{ (e - 2x) dF(x) - e (1 - F(e)) = 2 lr F(x) dx - e. o (o , e ] If F is sufficiently regular around e0, then Pme is twice differentiable with first derivative 2F(e) - 1 (which vanishes at e0) and second derivative 2j(e). More generally, under the minimal condition that F is differentiable at e0, the function Pme has a Taylor expansion Pmea + � (e - eo) 2 2j (eo) + o(le - eo l 2 ) , so that we set Vea = 2j(eo). Because Pm�a = E1 = 1 , the asymptotic variance of the median is 1/(2j(e0)) 2 . Figure 5.3 gives an Pme
=
impression of the accuracy of the approximation.
D
5.25 Example (Misspecified model). Suppose an experimenter postulates a model {pe : e E 8 } for a sample of observations X 1 , . . . , X n . However, the model is misspecified in that the true underlying distribution does not belong to the model. The experimenter decides to use the postulated model anyway, and obtains an estimate {J n from maximizing the likelihood L log Pe (Xi). What is the asymptotic behaviour of {Jn ? At first sight, it might appear that {Jn would behave erratically due to the use of the wrong model. However, this is not the case. First, we expect that {Jn is asymptotically consistent for a value eo that maximizes the function e f---+ P log pe , where the expectation is taken under the true underlying distribution P . The density P ea can be viewed as the "projection"
M- and Z-Estimators
56
of the true underlying distribution P on the model using the Kullback-Leibler divergence, which is defined as - P log (pe I p) , as a "distance" measure: pe0 minimizes this quantity over all densities in the model. Second, we expect that Jfi({jn - e0) is asymptotically normal with mean zero and covariance matrix
P log pe . The Here fe log pe , and Ve0 is the second derivative matrix of the map e preceding theorem with me log pe gives sufficient conditions for this to be true. The asymptotics give insight into the practical value of the experimenter's estimate {jn · This depends on the specific situation. However, if the model is not too far off from the truth, then the estimated density Pe " may be a reasonable approximation for the true density. D
=
f--+
=
5.26 Example (Exponential frailty model). Suppose that the observations are a random sample (X 1 , Y1 ), . . . , (Xn , Yn) of pairs of survival times. For instance, each X i is the survival time of a "father" and Yi the survival time of a "son." We assume that given an unobservable value Z i , the survival times X i and Yi are independent and exponentially distributed with parameters Zi and e Zi , respectively. The value Zi may be different for each observation. The problem is to estimate the ratio e of the parameters. To fit this example into the i.i.d. set-up of this chapter, we assume that the values Z 1 , . . . , Zn are realizations of a random sample Z 1 , . . . , Zn from some given distribution (that we do not have to know or parametrize). One approach is based on the sufficiency of the variable X i + e Yi for Zi in the case that e is known. Given Zi z, this "statistic" possesses the gamma-distribution with shape parameter 2 and scale parameter z. Corresponding to this, the conditional density of an observation (X, Y) factorizes, for a given z, as he (x , y) ge (x + e y I z) , for ge (s I z) a-density and z 2 se -z s the g
=
amm
he (x , y)
=
e
X
+ ey
.
Because the density of X i + e Y; depends on the unobservable value Zi , we might wish to discard the factor ge (s I z) from the likelihood and use the factor he (x , y) only. Unfor tunately, this "conditional likelihood" does not behave as an ordinary likelihood, in that the corresponding "conditional likelihood equation," based on the function h e I he (x , y) a 1 a e log he (x , y), does not have mean zero under e . The bias can be corrected by condi tioning on the sufficient statistic. Let
=
= 2e _!_hh e (X, Y) - 2eEe (h_!_he (X, Y) I X + e y) = XX +- ee YY Next define an estimator {} n as the solution of n 1/fe = 0. This works fairly nicely. Because the function e f--+ 1/fe (x , y) is continuous, and de creases strictly from 1 to - 1 on (0, oo) for every x , y 0, the equation n 1/le = 0 has a 1/le (X, Y)
.
TID
>
TID
unique solution. The sequence of solutions Bn can be seen to be consistent by Lemma 5. 10. By straightforward calculation, as e -+ eo,
5.3
Asymptotic Normality
57
Hence the zero of e r--+ Pe0 1/!e is taken uniquely at e = eo. Next, the sequence ,Jri({jn - eo) can be shown to be asymptotically normal by Theorem 5.21 . In fact, the functions � e (x, y) are uniformly bounded in x , y > 0 and e ranging over compacta in (0, oo) , so that, by the mean value theorem, the function � in this theorem may be taken equal to a constant. On the other hand, although this estimator is easy to compute, it can be shown that it is not asymptotically optimal. In Chapter 25 on semiparametric models, we discuss estimators with a smaller asymptotic variance. D Suppose that we observe a random sample (X 1 , Y1 ), . . . , (Xn , Yn ) from the distribution of a vector (X, Y) that follows the regression model
5.27
Example (Nonlinear least squares).
Y = fe0 (X) + e,
E(e I X) = 0.
Here fe is a parametric family of regression functions, for instance fe (x) = e 1 + e2 e e3x , and we aim at estimating the unknown vector e. (We assume that the independent variables are a random sample in order to fit the example in our i.i.d. notation, but the analysis could be carried out conditionally as well.) The least squares estimator that minimizes n
e r--+
L (li - fe (Xi) ) 2 i= 1
is an M-estimator for me (x, y) = (y - fe (x)) 2 (or rather minus this function). It should be expected to converge to the minimizer of the limit criterion function
Thus the least squares estimator should be consistent if e0 is identifiable from the model, in the sense that e =I= e0 implies that fe (X) =I= fe0 (X) with positive probability. For sufficiently regular regression models, we have
This suggests that the conditio 1_1 s of Theorem 5.23 are satisfied with Ve0 = 2P j eo j� and rhe0 (x, y) = -2(y - fe0 (x)) f e0 (x). If e and X are independent, then this leads to the asymptotic covariance matrix Ve� 1 2Ee 2 . D Besides giving the asymptotic normality of Jfi({jn - e0), the preceding theorems give an asymptotic representation
If we neglect the remainder term, t then this means that {J n - e0 behaves as the average of the variables Ve� 1 1/Je0 (X i )· Then the (asymptotic) "influence" of the nth observation on the t
To make the following derivation rigorous, more information concerning the remainder term would be necessary.
M - and Z-Estimators
58
value of {j can be computed as n
Because the "influence" of an extra observation x is proportional to Ve- 1 1/Je (x), the function x f--+ Ve- 1 1/Je (x) is called the asymptotic influence function of the estimator {j n . Influence functions can be defined for many other estimators as well, but the method of Z-estimation is particularly convenient to obtain estimators with given influence functions. Because Ve0 is a constant (matrix), any shape of influence function can be obtained by simply choosing the right functions 1/Je . For the purpose of robust estimation, perhaps the most important aim is to bound the influence of each individual observation. Thus, a Z-estimator is called B-robust if the function 1/Je is bounded.
5.28 Example (Robust regression). Consider a random sample of observations (X 1 , Y1 ) , . . . , (Xn , Yn ) following the linear regression model
for i.i.d. errors e 1 , , en that are independent of X 1 , , Xn . The classical estimator for the regression parameter (] is the least squares estimator, which minimizes :L�= 1 (Yi - (] T X i ) 2 . Outlying values of Xi ("leverage points") or extreme values of (X i , Yi) jointly ("influence points") can have an arbitrarily large influence on the value of the least-squares estimator, which therefore is nonrobust. As in the case of location estimators, a more robust estimator for (] can be obtained by replacing the square by a function m (x) that grows less rapidly as x ---+ oo, for instance m (x) = lx I or m (x) equal to the primitive function of Huber's 1/1 . Usually, minimizing an expression of the type 2:: �= 1 m (Yi - (] Xi ) i s equivalent to solving a system of equations .
.
.
•
n
1/I ( Yi L i=1
-
.
•
e r xi)xi =
o.
Because E1jf (Y - el X) X = E1jf (e)EX, we can expect the resulting estimator to be consistent provided E1jf ( e ) = 0. Furthermore, we should expect that, for Ve0 = E1jf ( e )XXT , '
Consequently, even for a bounded function 1/1 , the influence function (x , y) f--+ Ve- 1 1/1 (y e T x)x may be unbounded, and an extreme value of an Xi may still have an arbitrarily large influence on the estimate (asymptotically). Thus, the estimators obtained in this way are protected against influence points but may still suffer from leverage points and hence are only partly robust. To obtain fully robust estimators, we can change the estimating
5.3
59
Asymptotic Normality
equations to
n L o/ (( Yi - e Tx i)v(Xi))w (Xd = 0. i=l Here we protect against leverage points by choosing w bounded. For more flexibility we have also allowed a weighting factor v (X i ) inside 1/J. The choices 1/J (x) = x, v (x) = 1 and w(x) = x correspond to the (nonrobust) least-squares estimator. The solution {j n of our final estimating equation should be expected to be consistent for the solution of
(
)
0 = E o/ ( ( Y - e T X)v (X) )w (X) = E o/ ( e + etf X - e T X) v (X) w (X) .
If the function 1/J is odd and the error symmetric, then the true value e0 will be a solution whenever e is symmetric about zero, because then E 1/J (eO' ) = 0 for every 0' . Precise conditions for the asymptotic normality of Jfi({jn - e0) can be obtained from Theorems 5.21 and 5. 9. The verification of the conditions of Theorem 5.2 1 , which are "local" in nature, is relatively easy, and, if necessary, the Lipschitz condition can be relaxed by using results on empirical processes introduced in Chapter 19 directly. Perhaps proving the consistency of en is harder. The biggest technical problem may be to show that {jn = Op (1), so it would help if e could a priori be restricted to a bounded set. On the other hand, for bounded functions 1/J, the case of most interest in the present context, the functions (x, y) � 1/J ((y - e T x) v(x)) w (x) readily form a Glivenko-Cantelli class when e ranges freely, so that verification of the strong uniqueness of e0 as a zero becomes the main challenge when applying Theorem 5.9. This leads to a combination of conditions on 1/J, v, w, and the distributions of e and X. 0
5.29 Example (Optimal robust estimators). Every sufficiently regular function 1/J defines a location estimator en through the equation L 7= 1 1/f (Xi - e) = 0. In order to choose among the different estimators, we could compare their asymptotic variances and use the one with the smallest variance under the postulated (or estimated) distribution P of the observations. On the other hand, if we also wish to guard against extreme obervations, then we should find a balance between robustness and asymptotic variance. One possibility is to use the estimator with the smallest asymptotic variance at the postulated, ideal distribution P under the side condition that its influence function be uniformly bounded by some constant c. In this example we show that for P the normal distribution, this leads to the Huber estimator. The Z-estimator is consistent for the solution eo of the equation P o/ (· - e) = Eo/(X 1 e) = 0. Suppose that we fix an underlying, ideal P whose "location" e0 is zero. Then the problem is to find 1/J that minimizes the asymptotic variance Po/ 2 /(Po/') 2 under the two side conditions, for a given constant c, sup X
I 1/JPo/(x)' I -<
and Po/ = 0.
c '
The problem is homogeneous in 1/J, and hence we may assume that Po/' = 1 without loss of generality. Next, minimization of Po/ 2 under the side conditions Po/ = 0, Po/' = 1 and I 1/J I � c can be achieved by using Lagrange multipliers, as in problem 14.6 This leads to minimizing 00
M- and Z-Estimators
60
for fixed "multipliers" A and f.1, under the side condition 11 1/J II :::: c with respect to 1/J . This expectation is minimized by minimizing the integrand pointwise, for every fixed x . Thus the minimizing 1/J has the property that, for every x separately, y = 1/J (x ) minimizes the parabola y 2 + A Y + f.l,y (p ' jp) (x) over y E [ c , c] . This readily gives the solution, with [y ]� the value y truncated to the interval [ c, d ] , oo
-
1/J (x)
=
[
1 2
- -A -
1 p' - f.l, - (X) 2 p
]c -c
.
The constants A and f.1, can be solved from the side conditions P l/J = 0 and Pl/J ' = 1 . The normal distribution P = has location score function p'jp(x) = -x, and by symmetry it follows that A = 0 in this case. Then the optimal l/J reduces to Huber's 1/J function. 0
*5.4
Estimated Parameters
In many situations, the estimating equations for the parameters of interest contain prelim inary estimates for "nuisance parameters." For example, many robust location estimators are defined as the solutions of equations of the type (5.30) Here a is an initial (robust) estimator of scale, which is meant to stabilize the robustness of the location estimator. For instance, the "cut-off" parameter k in Huber's 1/J-function determines the amount of robustness of Huber's estimator, but the effect of a particular choice of k on bounding the influence of outlying observations is relative to the range of the observations. If the observations are concentrated in the interval [ -k, k] , then Huber's 1/J yields nothing else but the sample mean, if all observations are outside [ -k, k] , we get the median. Scaling the observations to a standard scale gives a clear meaning to the value of k. The use of the median absolute deviation from the median (see. section 2 1 .3) is often recommended for this purpose. If the scale estimator is itself a Z-estimator, then we can treat the pair (fJ , a) as a Z estimator for a system of equations, and next apply the preceding theorems. More generally, we can apply the following result. In this subsection we allow a condition in terms of Donsker classes, which are discussed in Chapter 19. The proof of the following theorem follows the same steps as the proof of Theorem 5 .2 1 . 5.31 Theorem. For each () in an open subset oflR k and each 77 in a metric space, let x f-+ 1/Je, ,7 (x) be an 1R k -valued measurable function such that the class offunctions 1/Je . r; : I I () eo II < 8 , d (71 , 7J0) < 8 } is Donsker for some 8 > 0, and such that P 11 1/Je. r; - 1/Jeo. r;o 11 2 -+ 0 as (() , 77) -+ (eo , 77o). Assume that Pl/Jeo . r;o = 0, and that the maps () f-+ Pl/Je. r; are diffe r entiable at ()0, uniformly in 77 in a neighborhood of 7Jo with nonsingular derivative matrices p Ve0, r; such that Ve0 . r; -+ Ve0, ,70 • If .Jii fl!n l/Je ,, ry, = o p ( 1 ) and (()n , � n) -+ (eo , 7Jo ), then
{
A
5. 5
Maximum Likelihood Estimators
61
Under the conditions of this theorem, the limiting distribution of the sequence .Jfi({Jn
eo) depends on the estimator fin through the "drift" term .Jfi P 1/Je0,r1" . In general, this gives a contribution to the limiting distribution, and fin must be chosen with care. If fin is .Jfi consistent and the map rJ f--+ P1/fe0,r1 is differentiable, then the drift term can be analyzed
using the delta-method. It may happen that the drift term is zero. If the parameters e and rJ are "orthogonal" in this sense, then the auxiliary estimators fin may converge at an arbitrarily slow rate and affect the limit distribution of {Jn only through their limiting value rJo .
5.32 Example (Symmetric location). Suppose that the distribution of the observations is symmetric about e0. Let x f--+ 1/f (x ) be an antisymmetric function, and consider the Z-estimators that solve equation (5.30). Because P1/f ((X - e0)ja) = 0 for every a, by the symmetry of P and the antisymmetry of 1/J, the "drift term" due to fi in the pre ceding theorem is identically zero. The estimator {Jn has the same limiting distribu tion whether we use an arbitrary consistent estimator of a "true scale" a0 or a0 itself. D 5.33 Example (Robust regression). In the linear regression model considered in Exam ple 5.28, suppose that we choose the weight functions v and w dependent on the data and solve the robust estimator {Jn of the regression parameters from
This corresponds to defining a nuisance parameter rJ = ( v , w ) and setting 1/fe,v,w (x , y) = 1/! ((y - e T x ) v (x ) ) w (x ) . If the functions 1/!e , v,w run through a Donsker class (and they easily do), and are continuous in (e , v , w ) , and the map e f--+ P1/Je,v , w is differentiable at e0 uniformly in ( v, w ) , then the preceding theorem applies. If E1j; (ea) = 0 for every a , then P 1/fe0, v , w = 0 for any v and w , and the limit distribution of .Jfi({Jn - e0) is the same, whether we use the random weight functions (vn, wn) or their limit ( v0 , w0) (assuming that this exists). The purpose of using random weight functions could be, besides stabilizing the robust ness, to improve the asymptotic efficiency of {Jn· The limit ( v 0 , w0) typically is not the same for every underlying distribution P, and the estimators (vn, wn) can be chosen in such a way that the asymptotic variance is minimal. D
5.5
Maximum Likelihood Estimators
Maximum likelihood estimators are examples of M -estimators. In this section we special ize the consistency and the asymptotic normality results of the preceding sections to this important special case. Our approach reverses the historical order. Maximum likelihood estimators were shown to be asymptotically normal first by Fisher in the 1 920s and rigor ously by Cramer, among others, in the 1940s . General M -estimators were not introduced and studied systematically until the 1960s, when they became essential in the development of robust estimators.
62
M- and Z-Estimators
If X 1 , . . . , Xn are a random sample from a density pe , then the maximum likelihood estimator en maximizes the function (] f-+ L log Pe (Xi ) , or equivalently, the function Pe (X = Pn log Pe . log Mn (f3) = -1 � � n i = l P ea d Pea
(Subtraction of the "constant" L log Pea (Xi ) turns out to be mathematically convenient.) If we agree that log 0 = -oo, then this expression is with probability 1 well defined if Pea is the true density. The asymptotic function corresponding to Mn is t Pe (X) = Pe log Pe . M(f3) = Eea log a Pea P ea The number -M(f3) is called the Kullback-Leibler divergence of Pe and Pea ; it is often considered a measure of "distance" between pe and Pea , although it does not have the properties of a mathematical distance. Based on the results of the previous sections, we may expect the maximum likelihood estimator to converge to a point of maximum of M(f3). Is the true value f30 always a point of maximum? The answer is affirmative, and, moreover, the true value is a unique point of maximum if the true measure is identifiable: every f3 f. f3o.
(5.34)
This requires that the model for the observations is not the same under the parameters f3 and f30. Identifiability is a natural and even a necessary condition: If the parameter is not identifiable, then consistent estimators cannot exist. Lemma. Let { pe : f3 E 8} be a collection of subprobability densities such that (5.34) holds and such that Pea is a probability measure. Then M(f3) = Pea log pe / P ea
5.35
attains its maximum uniquely at f30.
First note that M (f30) = Pea log 1 = 0. Hence we wish to show that M (f3) is strictly negative for f3 f. f3o. Because log x 2( .JX - 1) for every x � 0, we have, writing J.-L for the dominating measure,
Proof
:::
Pea log .!!!_ Pea
:S
2 Pea
- 1 ) 2 J JPe Pea d f.-L - 2 (V{E_ P ea =
::: - f (yfpe - �f d
J.-L .
(The last inequality is an equality if J pe d J.-L = 1 .) This is always nonpositive, and is zero only if pe and Pea are equal. By assumption the latter happens only if f3 = f3o. • Thus, under conditions such as in section 5 .2 and identifiability, the sequence of maxi mum likelihood estimators is consistent for the true parameter. t
Presently we take the expectation Pea under the parameter Bo , whereas the derivation in section 5.3 is valid for a generic underlying probability structure and does not conceptually require that the set of parameters 8 indexes a set of under!ying distributions.
5. 5
Maximum Likelihood Estimators
63
This conclusion is derived from viewing the maximum likelihood estimator as an M estimator for me = log pe . Sometimes it is technically advantageous to use a different starting point. For instance, consider the function e+ e me = 1o g P P o 2peo ::..._---=. .._
By the concavity of the logarithm, the maximum likelihood estimator fJ satisfies 1 Pe + lP'n -1 log 1 0 = lP'nme lP'nm e :=:: lP'n - log :=:: 0• 2 2 pe0 Even though fJ does not maximize (] f-+ lP'n me , this inequality can be used as the starting point for a consistency proof, since Theorem 5.7 requires that Mn (B ) :=:: Mn ((]0) - op (1) only. The true parameter is still identifiable from this criterion function, because, by the preceding lemma, Pe0 me = 0 implies that (pe + pe0 )/2 = pe0 , or Pe = pe0 • A technical advantage is that me :=:: log(1/2) . For another variation, see Example 5 . 1 7 . Consider asymptotic normality. The maximum likelihood estimator solves the likelihood equations a n - L: log pe (X i ) = 0. ae i = 1 Hence it is a Z-estimator for 1/Je equal to the score function i e = a I ae log Pe of the model. In view of the results of section 5.3, we expect that the sequence ,Jn ( en - (]) is, under (] , asymptotically normal with mean zero and covariance matrix (5.36) Under regularity conditions, this reduces to the inverse of the Fisher information matrix To see this in the case of a one-dimensional parameter, differentiate the identity J pe d JL 1 twice with respect to (] . Assuming that the order of differentiation and integration can be reversed, we obtain J Pe dJL J P e dJL 0. Together with the identities Pe 2 Pe Pe le = le - - ; Pe Pe Pe this implies that Pe l e = 0 (scores have mean zero), and Pe f e = -Ie (the curvature of the likelihood is equal to minus the Fisher information). Consequently, (5.36) reduces to Ie- 1 • The higher-dimensional case follows in the same way, in which we should interpret the identities Pe l e = 0 and Pe f e = - Ie as a vector and a matrix identity, respectively. We conclude that maximum likelihood estimators typically satisfy =
=
·
=
··
_
( )
This is a very important result, as it implies that maximum likelihood estimators are asymp totically optimal. The convergence in distribution means roughly that the maximum likeli hood estimator B n is N (e , (n /e ) - 1 ) -distributed for every (] , for large n . Hence, it is asymp totically unbiased and asymptotically of variance (n /e ) - 1 . According to the Cramer-Rao
64
M- and Z-Estimators
theorem, the variance of an unbiased estimator is at least (n le )- 1 . Thus, we could in fer that the maximum likelihood estimator is asymptotically uniformly minimum-variance unbiased, and in this sense optimal. We write "could" because the preceding reasoning is informal and unsatisfying. The asymptotic normality does not warrant any conclusion about the convergence of the moments Ee {J n and vare {J n ; we have not introduced an asymptotic version of the Cramer-Rao theorem; and the Cramer-Rao bound does not make any assertion concerning asymptotic normality. Moreover, the unbiasedness required by the Cramer-Rao theorem is restrictive and can be relaxed considerably in the asymptotic situation. However, the message that maximum likelihood estimators are asymptotically efficient is correct. We give a precise discussion in Chapter 8. The justification through asymptotics appears to be the only general justification of the method of maximum likelihood. In some form, this result was found by Fisher in the 1920s, but a better and more general insight was only obtained in the period from 1 950 through 1 970 through the work of Le Cam and others. In the preceding informal derivations and discussion, it is implicitly understood that the density pe possesses at least two derivatives with respect to the parameter. Although this can be relaxed considerably, a certain amount of smoothness of the dependence (} f-+ Pe is essential for the asymptotic normality. Compare the behavior of the maximum likelihood estimators in the case of uniformly distributed observations : They are neither asymptotically normal nor asymptotically optimal.
5.37 Example (Uniform distribution). Let X 1 , . . . , Xn be a sample from the uniform distribution on [0, (} ] . Then the maximum likelihood estimator is the maximum X (n) of the observations. Because the variance of X en) is of the order O (n- 2 ), we expect that a suitable norming rate in this case is not ,Jn, but n. Indeed, for each x < 0
Thus, the sequence -n (X (n) - 8) converges in distribution to an exponential distribution with mean e . Consequently, the sequence ,Jii ( Xcn) - 8) converges to zero in probability. Note that most of the informal operations in the preceding introduction are illegal or not even defined for the uniform distribution, starting with the definition of the likelihood equa tions. The informal conclusion that the maximum likelihood estimator is asymptotically optimal is also wrong in this case; see section 9.4. 0 We conclude this section with a theorem that establishes the asymptotic normality of maximum likelihood estimators rigorously. Clearly, the asymptotic normality follows from Theorem 5.23 applied to me = log pe , or from Theorem 5.21 applied with 1/re = le equal to the score function of the model. The following result is a minor variation on the first theorem. Its conditions somehow also ensure the relationship Pele = -Ie and the twice differentiability of the map (} f-+ Pe0 log Pe , even though the existence of second derivatives is not part of the assumptions. This remarkable phenomenon results from the trivial fact that square roots of probability densities have squares that integrate to 1 . To exploit this, we require the differentiability of the maps (} f-+ �. rather than of the maps (} f-+ log Pe . A statistical model (Pe : (} E 8) is called differentiable in q uadratic mean if there exists a
5. 5 Maximum Likelihood Estimators
measurable vector-valued function le0 such that, as e
---+
65
e0, (5 .38)
This property also plays an important role in asymptotic optimality theory. A discussion, including simple conditions for its validity, is given in Chapter 7. It should be noted that 1 a 1(a ) e = log Pe §e. = P ae 5e 25e a e 2 a e a
Thus, the function fe0 in the integral really is the score function of the model (as the notation suggests), and the expression le0 Pe0fei,� defines the Fisher information matrix. However, condition (5 .38) does not require existence of a;ae pe (x) for every x.
=
Suppose that the model (Pe : e E 8 ) is differentiable in quadratic mean at an inner point e0 of 8 c ffi.k . Furthermore, suppose that there exists a measurable function .£ with Pe0l 2 < oo such that, for every e 1 and e2 in a neighborhood ofe0, 5.39
Theorem.
If the Fisher information matrix le0 is nonsingular and {jn is consistent, then
In particular, the sequence y'ri(Bn - e0) is asymptotically normal with mean zero and covariance matrix Ie� 1 . This theorem is a corollary of Theorem 5.23. We shall show that the conditions of the latter theorem are satisfied for m e log pe and Ve0 - le0 • Fix an arbitrary converging sequence of vectors hn ---+ h, and set *Proof
=
=
By the differentiability in quadratic mean, the sequence ,fii, Wn converges in L 2 (Pe0) to the function h T le0 • In particular, it converges in probability, whence by a delta method
vfn (log Peo+hnl y'n - log Peo)
= 2vfn log ( 1 + � Wn ) � h T le0 •
In view of the Lipschitz condition on the map e f---+ log pe , we can apply the dominated convergence theorem to strengthen this to convergence in L 2 (Pe0). This shows that the map e f---+ log P e is differentiable in probability, as required in Theorem 5.23. (The preceding argument considers only sequences ell of the special form eo + hn/ y'n approaching eo. Because hn can be any converging sequence and Jn+I; Jn ---+ 1 , these sequences are actually not so special. By re-indexing the result can be seen to be true for any ell ---+ e0.) Next, by computing means (which are zero) and variances, we see that
M- and Z-Estimators
66
Equating this result to the expansion given by Theorem 7 .2, we see that
Hence the map e r--+ Pe0 log Pe is twice-differentiable with second derivative matrix - Ie0, or at least permits the corresponding Taylor expansion of order 2. • Suppose that we observe a random sample (X 1 , Y1 ), . . . , (Xn , Y11) consisting of k-dimensional vectors of "covariates" X; , and 0- 1 "response variables" Y; , following the model
5.40
Example (Binary regression).
Here \lf : ffi. r--+ [0, 1] is a known continuously differentiable, monotone function. The choices \lf (e) = 1 1 ( 1 + e- e ) (the logistic distribution function) and \lf = (the normal distribution function) correspond to the logit model and probit model, respectively. The maximum likelihood estimator e n maximizes the (conditional) likelihood function n n Y e r--+ n pe (Y; I X i) : = n \lf (e T X;)yi (1 - \lf (e T X i) ) I - ; .
i =l
i =l
The consistency and asymptotic normality of en can be proved, for instance, by combining Theorems 5.7 and 5 .39. (Alternatively, we may follow the classical approach given in sec tion 5 .6. The latter is particularly attractive for the logit model, for which the log likelihood is strictly concave in e , so that the point of maximum is unique.) For identifiability of e we must assume that the distribution of the X i is not concentrated on a (k - I)-dimensional affine subspace of ffi.k . For simplicity we assume that the range of Xi is bounded. The consistency can be proved by applying Theorem 5.7 with me = log( pe + pe0)12. Because pe0 is bounded away from 0 (and oo), the function me is somewhat better behaved than the function log pe . By Lemma 5.35, the parameter e is identifiable from the density pe . We can redo the proof to see that, with ;S meaning "less than up to a constant,"
This shows that eo is the unique point ofmaximum of e r--+ Pe0me . Furthermore, if Pe0mek -? Pe0me0, then e{ X !,. e[ X. If the sequence ek is also bounded, then E( (ek - eo) T X) 2 -? 0, whence ek r--+ eo by the nonsingularity of the matrix EX x T . On the other hand, llek I cannot have a diverging subsequence, because in that case e{ I II ek I x !,. o and hence ek f II ek II -? o by the same argument. This verifies condition (5 .8). Checking the uniform convergence to zero of supe IIfDnme - Pme I is not trivial, but it becomes an easy exercise if we employ the Glivenki-Cantelli theorem, as discussed in Chapter 19. The functions x r--+ \lf (e r x ) form a VC-class, and the functions me take the form me (x ' y) = ¢ ( \lf (e T X ) , y , \lf eel X) ) , where the function ¢ ( y ' y ' T}) is Lipschitz in its first argument with Lipschitz constant bounded above by 1 I rJ + 1 I (1 - TJ ) . This is enough to
5. 6
67
Classical Conditions
ensure that the functions me form a Donsker class and hence certainly a Glivenko-Cantelli class, in view of Example 1 9.20. The asymptotic normality of Jfi(Bn 8) is now a consequence of Theorem 5.39. The score function -
is uniformly bounded in x , y and e ranging over compacta, and continuous in 8 for every x and y . The Fisher information matrix is continuous in 8, and is bounded below by a multiple of EX x r and hence is nonsingular. The differentiability in quadratic mean follows by calculus, or by Lemma 7 .6. D
*5.6
Classical Conditions
In this section we discuss the "classical conditions" for asymptotic normality of M -estima tors. These conditions were formulated in the 1 930s and 1940s to make the informal deriva tions of the asymptotic normality of maximum likelihood estimators, for instance by Fisher, mathematically rigorous . Although Theorem 5.23 requires less than a first derivative of the criterion function, the "classical conditions" require existence of third derivatives. It is clear that the classical conditions are too stringent, but they are still of interest, because they are simple, lead to simple proofs, and nevertheless apply to many examples . The classical conditions also ensure existence of Z -estimators and have a little to say about their consistency. We describe the classical approach for general Z-estimators and vector-valued parame ters. The higher-dimensional case requires more skill in calculus and matrix algebra than is necessary for the one-dimensional case. When simplified to dimension one the argu ments do not go much beyond making the informal derivation leading from (5 . 1 8) to (5 . 19) rigorous. Let the observations X 1 , . . . , Xn be a sample from a distribution P , and consider the estimating equations \ll (8)
=
P 1/Je .
The estimator B is a zero of \lln, and the true value 80 a zero of \ll . The essential condition of the following theorem is that the second-order partial derivatives of 1/Je (x) with respect to e exist for every x and satisfy n
for some integrable measurable function 1f . This should be true at least for every 8 in a neighborhood of 80•
68
M- and Z-Estimators
5.41 Theorem. For each e in an open subset of Euclidean space, let e � 1/fe (x ) be twice continuously differentiable for every x. Suppose that P1/fe0 = 0, that P 1 1/le0 11 2 < oo and that the matrix P'if; e0 exists and is nonsingular. Assume that the second-order partial derivatives are dominated by a fixed integrable function 1f; (x) for every e in a neighborhood ofeo. Then every consistent estimator sequence {jn such that 'lin ({}n) = Ofor every n satisfies
In particular, the sequence ,Jfi(fJn - e0). is asymptotically normal with mean zero and . covariance matrix ( P 1/1 e0 )- 1 P 1/leo 1/1eTo ( P 1/1 e0 )- 1 •
By Taylor's theorem there exists a (random) vector iJ n on the line segment between eo and {jn such that
Proof
The first term on the right 'lln (eo) is an average of the i.i.d. random vectors 1/Je0 (X i ), which have mean P1/Je0 = 0. By the central limit theorem, the sequence ,Jfi'll n (e0) converges in distribution to a multivariate normal distribution with mean 0 and covariance matrix P1/Je0 1/f[.0 The derivative �n (eo) in the second term is an average also. By the law of large numbers it converges in probability to the matrix V = P 1/f eo . The second derivative � n (iJn) is a k-vector of (k x k) matrices depending on the second-order derivatives 1/; e . By assumption, there exists a ball B around e0 such that 1/; e is dominated by 11 1/; II for every e E B . The probability of the event { {} n E B } tends to 1 . On this event •
This is bounded in probability by the law of large numbers. Combination of these facts allows us to rewrite the preceding display as
because the sequence ({}n - e0) Op (1) = op (1) 0 p ( 1 ) converges to 0 in probability if {}n is consistent for e0. The probability that the matrix Ve0 + op (1) is invertible tends to 1 . Multiply the preceding equation by Jn and apply ( V + o p ( l ) ) - 1 left and right to complete the proof. • In the preceding sections, the existence and consistency of solutions {}n of the estimating equations is assumed from the start. The present smoothness conditions actually ensure the existence of solutions. (Again the conditions could be significantly relaxed, as shown in the next proof.) Moreover, provided there exists a consistent estimator sequence at all, it is always possible to select a consistent sequence of solutions.
5.42 Theorem. Under the conditions ofthe preceding theorem, the probability that the eq uation lP'n 1/fe = 0 has at least one root tends to 1, as n -+ oo, and there exists a sequence of roots {} n such that {} n -+ eo in probability. If 1/1e = me is the gradient of some function
5. 6
Classical Conditions
69
me and &o is a point of local maximum of& f--:>- Pme, then the sequence {jn can be chosen to be local maxima of the maps e f--:>- JPlnme. Proof point {j
Integrate the Taylor expansion of e f--:>- 1/.re (x ) with respect to x to find that, for a = {j (x) on the line segment between &o and e'
B y the domination condition, II PVie- 11 i s bounded by P II Vi ll < 00 i f e i s sufficiently close to 80 • Thus, the map \ll (&) = P1/re is differentiable at 80. By the same argument \ll is differentiable throughout a small neighborhood of &0, and by a similar expansion (but now to first order) the derivative P � e can be seen to be continuous throughout this neighborhood. Because P � eo is nonsingular by assumption, we can make the neighborhood still smaller, if necessary, to ensure that the derivative of \ll is nonsingular throughout the neighborhood. Then, by the inverse function theorem, there exists, for every sufficiently small o > 0, an open neighborhood Go of &0 such that the map \ll : G0 f--:>- ball(O, o) is a homeomorphism. The diameter of G0 is bounded by a multiple of o, by the mean-value theorem and the fact that the norms of the derivatives (P� e )- 1 of the inverse w- 1 are bounded. Combining the preceding Taylor expansion with a similar expansion for the sample version Wn (&) = Yln 1/.re , we see sup ll wn (&) - \II (&) I
e EGs
:S
op ( l ) + oop ( l ) + o 2 0p ( 1 ) ,
where the op ( 1 ) terms and the Op ( l ) term result from the law of large numbers, and are uniform in small o . Because P ( op ( 1 ) + oop ( l ) > � o ) � 0 for every o > 0, there exists On .,[,. 0 such that P (op ( 1 ) + on op ( 1 ) > � on ) � 0. If Kn,o is the event where the left side of the preceding display is bounded above by o, then P(Kn,oJ � 1 as n � oo. On the event Kn,o the map e f--:>- e - Wn w- 1 (&) maps ball(O, o) into itself, by the definitions of Go and Kn,D · Because the map is also continuous, it possesses a fixed-point in ball(O, o), by Brouwer's fixed point theorem. This yields a zero of Wn in the set G0, whence the first assertion of the theorem. For the final assertion, first note that the Hessian P � eo of e f--:>- P me at 80 is negative definite, by assumption. A Taylor expansion as in the proof of Theorem 5.41 shows that . . . JPln 1fre" - Pn 1/r e0 ::+p 0 for every &n �p &o. Hence the Hessian Yln l/r e" of 8 f--:>- Pn � e at any consistent zero &n converges in probability to the negative-definite matrix P1fr e0 and is negative-definite with probability tending to 1 . • o
�
The assertion of the theorem that there exists a consistent sequence of roots of the estimating equations is easily misunderstood. It does not guarantee the existence of an asymptotically consistent sequence of estimators. The only claim is that a clairvoyant statistician (with preknowledge of &0 ) can choose a consistent sequence of roots. In reality, it may be impossible to choose the right solutions based only on the data (and knowledge of the model). In this sense the preceding theorem, a standard result in the literature, looks better than it is. The situation is not as bad as it seems. One interesting situation is if the solution of the estimating equation is unique for every n. Then our solutions must be the same as those of the clairvoyant statistician and hence the sequence of solutions is consistent.
70
M- and Z-Estimators
In general, the deficit can be repaired with the help of a preliminary sequence of estimators iJ n . If the sequence iJ n is consistent, then it works to choose the root {) n of lJ:Dn 1/re = 0 that is closest to iJn . Because ll(jn - iJn ll is smaller than the distance 118� - iJn ll between the clairvoyant sequence {)� and iJ n , both distances converge to zero in probability. Thus the sequence of closest roots is consistent. The assertion of the theorem can also be used in a negative direction. The point (}0 in the theorem is required to be a zero of (} f----+ P 1/re , but, apart from that, it may be arbitrary. Thus, the theorem implies at the same time that a malicious statistician can always choose a sequence of roots (Jn that converges to any given zero. These may include other points besides the "true" value of (} . Furthermore, inspection of the proof shows that the sequence of roots can also be chosen to jump back and forth between two (or more) zeros. If the function (} f----+ P l/re has multiple roots, we must exercise care. We can be sure that certain roots of (} f----+ lJ:Dn 1/re are bad estimators. Part of the problem here is caused by using estimating equations , rather than maximiza tion to find estimators, which blurs the distinction between points of absolute maximum, local maximum, and even minimum. In the light of the results on consistency in section 5.2, we may expect the location of the point of absolute maximum of (} f----+ Jl:Dnme to converge to a point of absolute maximum of (} f----+ Pme . As long as this is unique, the absolute maximizers of the criterion function are typically consistent.
5.43 Example (Weibull distribution). Let X 1 , tribution with density !!._ e Pe, a (x) = (j x - I e -xB / a '
•
.
X
•
, Xn be a sample from the Weibull dis >
0, (}
>
0, a
>
0.
(Then a I / e is a scale parameter.) The score function is given by the partial derivatives of the log density with respect to (} and a :
f e, a (x) =
(1 + log x - : log x , w - � + ;: ) .
The likelihood equations I; ie , a (x i ) = 0 reduce to n 1" 1 1� I:7=1 xf log x; a = � x�� ·· "D + - � log x; " n x e = 0. n n L... i =I i i=I i=l The second equation is strictly decreasing in (} , from oo at (} = 0 to log x -log X (n) at (} = oo. Hence a solution exists, and i s unique, unless all x ; are equal. Provided the higher-order derivatives of the score function exist and can be dominated, the sequence of maximum likelihood estimators (en , an) is asymptotically normal by Theorems 5.41 and 5.42. There exist four different third-order derivatives, given by 2 xe 3 = - - log x () 3 (j xe = 2 log 2 x (j a3 £e, a (X) 2x e = - - log x aeaa2 a3 a3 £e , a (X) 2 6x e =- +. aa 3 a 3 a4
-
0
-
------,----- -
5. 7
71
One-Step Estimators
For e and a ranging over sufficiently small neighborhoods of e0 and a0 , these functions are dominated by a function of the form M(x)
=
A ( l + x B ) ( l + l log xl +
·
·
·
+ l log x n ,
for sufficiently large A and B . Because the Weibull distribution has an exponentially small tail, the mixed moment Eeo,cro X P I log Xl q is finite for every p , q ::::_ Thus, all moments of le and ee exist and M is integrable. D
0.
*5.7
One-Step Estimators
The method of Z-estimation as discussed so far has two disadvantages. First, it may be hard to find the roots of the estimating equations. Second, for the roots to be consistent, the estimating equation needs to behave well throughout the parameter set. For instance, existence of a second root close to the boundary of the parameter set may cause trouble. The one-step method overcomes these problems by building on and improving a preliminary estimator iJ The idea is to solve the estimator from a linear approximation to the original estimating equation Wn (e) = Given a preliminary estimator en , the one-step estimator is the solution (in e) to n.
0.
This corresponds to replacing Wn (e) by its tangent at iJ n , and is known as the method of Newton-Rhapson in numerical analysis. The solution e = {Jn is In numerical analysis this procedure is iterated a number of times, taking {Jn as the new preliminary guess, and so on. Provided that the starting point iJ n is well chosen, the sequence of solutions converges to a root of Wn . Our interest here goes in a different direction. We suppose that the preliminary estimator {}n is already within range n - 1 12 of the true value of e. Then, as we shall see, just one iteration of the Newton-Rhapson scheme produces an estimator {J that is as good as the Z -estimator defined by \IIn . In fact, it is better in that its consistency is guaranteed, whereas the true Z-estimator may be inconsistent or not uniquely defined. In this way consistency and asymptotic normality are effectively separated, which is useful because these two aims require different properties of the estimating equations. Good initial estimators can be constructed by ad-hoc methods and take care of consistency. Next, these initial estimators can be improved by the one-step method. Thus, for instance, the good properties of maximum likelihood estimation can be retained, even in cases in which the consistency fails . In this section we impose the following condition on the random criterion functions \1111 • For every constant M and a given nonsingular matrix �0 , n
sup
.Jll i iB-Bo I I < M
1 -Jn ( Wn (e) - Wn (eo)) - �o-Jn (e - eo) I � 0 .
(5.44)
Condition (5 .44) suggests that Wn is differentiable at eo, with derivative tending to �o, but this is not an assumption. We do not require that a derivative �n exists, and introduce
72
M- and Z-Estimators
a further refinement of the Newton-Rhapson scheme by replacing �n (en) by arbitrary estimators. Given nonsingular, random matrices �n .o that converge in probability to �0 define the one-step estimator
Call an estimator sequence en Jfi-consistent if the sequence Jn(en - e0) is uniformly tight. The interpretation is that en already determines the value e0 within n -1 1 2 -range .
Let Jfi\lln (e0) Z and let (5. 44) hold. Then the one-step estimator en, for a given Jfi-consistent estimator sequence en and estimators . p . Wn , O_,. \llo , satisfies
5.45
5.46
Theorem (One-step estimation).
Addendum. Theorem 5. 21 with
Proof.
-v-7
For \lin (e) = fiDn 1/fe condition (5. 44) is satisfied under the conditions of �o = Ve0 , and under the conditions of Theorem 5. 41 with �o = P V, 80 •
The standardized estimator �n ,o Jfi(en - e0) equals
By (5 .44) the second term can be replaced by - �0Jfi(en - e0) + o p (1). Thus the expression can be rewritten as
The first term converges to zero in probability, and the theorem follows after application of Slutsky's lemma. For a proof of the addendum, see the proofs of the corresponding theorems. • If the sequence Jn(en - e0) converges in distribution, then it is certainly uniformly tight. Consequently, a sequence of one-step estimators is Jfi-consistent and can itself be used as preliminary estimator for a second iteration of the modified Newton-Rhapson algorithm. Presumably, this would give a value closer to a root of \lin . However, the limit distribution of this "two-step estimator" is the same, so that repeated iteration does not give asymptotic improvement. In practice a multistep method may nevertheless give better results. We close this section with a discussion of the discretization trick. This device is mostly of theoretical value and has been introduced to relax condition (5 .44) to the following. For every nonrandom sequence en = e0 + O (n -1 1 2 ), (5 .47) This new condition is less stringent and much easier to check. It is sufficiently strong if the preliminary estimators en are discretized on grids of mesh width n -1 1 2 . For instance, en is suitably discretized if all its realizations are points of the grid n -1 1 2 £} (consisting of the points n -1 1 2 (i 1 , . . . , i k ) for integers i 1 , . . . , i k ). This is easy to achieve, but perhaps unnatural. Any preliminary estimator sequence e n can be discretized by replacing its values
5. 7
One-Step Estimators
73
by the closest points of the grid. Because this changes each coordinate by at most n -112, Jn-consistency of en is retained by discretization. Define a one-step estimator Bn as before, but now use a discretized version of the pre liminary estimator.
5.48 Theorem (Discretized one-step estimation). Let Jfl\lln (e0) Z and let (5.47) hold. Then the one-step estimator en, for a given ,.Jii- consistent, discretized estimator sequence . . en and estimators \lln , O�p \llo , satisfies -v--7
5.49 Addendum. For \lin (e) = lP'n 1/Je and lP'n the empirical measure of a random sample from a. density pe that is differentiable in quadratic mean ( 5.38 ), condition ( 5.47), is satisfied, ·T with \llo = - Pe0 1/fel- eo ' if, as e � eo,
Proof.
The arguments of the previous proof apply, except that it must be shown that
converges to zero in probability. Fix £ > 0. By the ,.Jii - consistency, there exists M with P(Jn ll en - eo ll > M) < c . If Jll l l en - eo I ::::; M, then en equals one of the values in the set Sn = {e E n-1127J..k: 11 e - eo ll ::::; n - 1 1 2M } . For each M and n there are only finitely many elements in this set. Moreover, for fixed M the number of elements is bounded independently of n . Thus
::::; £ +
L P( II R (en) ll > c ) .
Bn E Sn
The maximum of the terms in the sum corresponds to a sequence of nonrandom vectors en with en = e0 + O (n -11 2) . It converges to zero by (5 .47). Because the number of terms in the sum is bounded independently of n, the sum converges to zero. For a proof of the addendum, see proposition A. 10 in [ 1 39] . • If the score function le of the model also satisfies the conditions of the addendum, then the estimators �n ,o = -Pen 1/Jen i�, are consistent for �o - This shows that discretized one-step estimation can be carried through under very mild regularity conditions. Note that the addendum requires only continuity of e f--+ 1/Je , whereas (5 . 47) appears to require differentiability.
5.50 Example (Cauchy distribution). Suppose X 1 , . . . , Xn are a sample from the Cauchy location family pe (x) = n-1 (1 + (x - e) 2 r 1 . Then the score function is given by _
2_ (x_ -_e_ )_
i e (x ) - 1 (x - e) 2 · + _
M- and Z -Estimators
74
0 0
0 LO
0 0
C';J
0 LO
C';J
0 0
'?
-200
- 1 00
0
1 00
Figure 5.4. Cauchy log likelihood function of a sample of 25 observations , showing three local maxima. The value of the absolute maximum is well-separated from the other maxima, and its location is close to the true value zero of the parameter.
This function behaves like 1 j x for x -+ ±oo and is bounded in between. The second moment of fe (X1 ) therefore exists , unlike the moments of the distribution itself. Because the sample mean possesses the same (Cauchy) distribution as a single observation X 1 , the sample mean is a very inefficient estimator. Instead we could use the median, or another M -estimator. However, the asymptotically best estimator should be based on maximum likelihood. We have . 4(x - 8) ((x - 8)2 - 3) fe (x) = ( 1 + (x - 8)2) 3 .:..._ .._ _=---'-
___
The tails of this function are of the order 1 j x3 , and the function is bounded in between. These bounds are uniform in 8 varying over a compact interval. Thus the conditions of Theorems 5.41 and 5 .42 are satisfied. Since the consistency follows from Example 5 . 1 6, the sequence of maximum likelihood estimators is asymptotically normal. The Cauchy likelihood estimator has gained a bad reputation, because the likelihood equation L.)e (Xd = 0 typically has several roots. The number of roots behaves asymp totically as two times a Poisson(ljn) variable plus 1 . (See [126] .) Therefore, the one-step (or possibly multi-step method) is often recommended, with, for instance, the median as the initial estimator. Perhaps a better solution is not to use the likelihood equations, but to deter mine the maximum likelihood estimator by, for instance, visual inspection of a graph of the likelihood function, as in Figure 5.4. This is particularly appropriate because the difficulty of multiple roots does not occur in the two parameter location-scale model. In the model with density P e (x fer ) fer , the maximum likelihood estimator for (8 , er) is unique. (See [25].) D
5.51 Example (Mixtures). Let f and g be given, positive probability densities on the real line. Consider estimating the parameter 8 = (M , v , er, r , p) based on a random sample from
the mixture density
X
1---+
pj
75
Rates of Convergence
5. 8
(X -- ) 1 + ( 1 - p)g (X - V) 1 · -a
f-L
-T- -;;
-;;
If f and g are sufficiently regular, then this is a smooth five-dimensional parametric model, and the standard theory should apply. Unfortunately, the supremum of the likelihood over the natural parameter space is oo , and there exists no maximum likelihood estimator. This is seen, for instance, from the fact that the likelihood is bigger than n V 1 f-L 1- n pf ( 1 - p)g -- - . r r a a i= Z If we set J-L = x 1 and next maximize over a > 0, then we obtain the value oo whenever p > 0, irrespective of the values of v and r . A one-step estimator appears reasonable in this example. In view of the smoothness of the likelihood, the general theory yields the asymptotic efficiency of a one-step estimator if started with an initial -fo-consistent estimator. Moment estimators could be appropriate initial estimators. D
(X! - )
*5.8
(Xi - )
Rates of Convergence
In this section we discuss some results that give the rate of convergence of M -estimators . These results are useful as intermediate steps in deriving a limit distribution, but also of interest on their own. Applications include both classical estimators of "regular" parameters and estimators that converge at a slower than .Jli-rate. The main result is simple enough, but its conditions include a maximal inequality, for which results such as in Chapter 1 9 are needed. Let JlDn be the empirical distribution of a random sample of size n from a distribution P , and, for every () in a metric space 8, let x �----+ me (x ) be a measurable function. Let (Jn (nearly) maximize the criterion function () �----+ Pn me . The criterion function may be viewed as the sum of the deterministic map () �----+ P me and the random fluctations 8 t---+ Pnme - Pme . The rate of convergence of en depends on the combined behavior of these maps. If the deterministic map changes rapidly as () moves away from the point of maximum and the random fluctuations are small, then (Jn has a high rate of convergence. For convenience of notation we measure the fluctuations in terms of the empirical process Gnme = -Jfl(JIDn me - Pme) .
5.52 Theorem (Rate of convergence). Assume that forfixed constants C and a every n , andfor every sufficiently small 8 > 0, sup P (me - me0 )
:S
E* sup I Gn (me - me0 ) 1
:S
d(e , eo)
>
fJ, for
- C8a ,
C8 !3 .
If the sequence en satisfies JIDn menA 2:. JIDnmeo - Op (nrx / ( Z/3 - Zrx ) ) and converges in outer probability to 8o, then n 1 / ( Za - Zf3 ) d (8n , 8o) = 0� ( 1 ).
76 Proof.
M- and Z -Estimators
Set rn
=
n 1 1 ( 2a - 2f3 ) and suppose that {jn maximizes the map e
variable R n = Op (r;; a ) .
f--+
lFnme up to a
For each n, the parameter space minus the point e0 can be partitioned into the "shells" SJ,n = {e : 2i - I < rnd(e , eo) :::::: 2i }, with j ranging over the integers. If rn d(en , eo) is larger than 2M for a given integer M, then {jn is in one of the shells S1 , n with j � M. In that case the supremum of the map e f--+ lFnme - lFnmea over this shell is at least R n by the property of en . Conclude that, for every E > 0, -
If the sequence {j n is consistent for e0, then the second probability on the right converges to 0 as n --* oo, for every fixed E > 0. The third probability on the right can be made arbitrarily small by choice of K , uniformly in n . Choose E > 0 small enough to ensure that the conditions of the theorem hold for every 8 :::=: E . Then for every j involved in the sum, we have 2 (j - l )a sup P (me - mea) :::=: -C
e ESJ ,n
For � C2CM - I ) a
�
---
r�
K , the series can be bounded in terms of the empirical process Gn by
by Markov's inequality and the definition of r11 • The right side converges to zero for every
M = Mn --*
oo .
•
Consider the special case that the parameter e is a Euclidean vector. If the map e f--+ P me is twice-differentiable at the point of maximum e0, then its first derivative at e0 vanishes and a Taylor expansion of the limit criterion function takes the form
Then the first condition of the theorem holds with a = 2 provided that the second-derivative matrix V is nonsingular. The second condition of the theorem is a maximal inequality and is harder to verify. In "regular" cases it is valid with f3 = 1 and the theorem yields the "usual" rate of convergence .Jfi. The theorem also applies to nonstandard situations and yields, for instance, the rate n 1 1 3 if a = 2 and fJ = � . Lemmas 1 9.34, 19.36 and 19.38 and corollary 1 9.35 are examples of maximal inequalities that can be appropriate for the present purpose. They give bounds in terms of the entropies of the classes of functions {me - mea : d ( e, eo) < 0}. A Lipschitz condition on the maps e f--+ me is one possibility to obtain simple estimates on these entropies and is applicable in many applications. The result of the following corollary is used earlier in this chapter.
5. 8
Rates of Convergence
77
5.53 Corollary. For each e in an open subset of Euclidean space let x f-+ m e (x) be a measurable function such that, for every e 1 and e2 in a neighborhood ofeo and a measurable function m such that Pm 2 < oo,
Furthermore, suppose that the map e f-+ P me admits a second-order Taylor expansion at the point ofmaximum eo with nonsingular second derivative. Jf lfDnmen 2:: lfDnme0 - 0 p ( n - 1 ) , p then � cen - eo) = 0 p (1 ) , provided that en -+ eo. A
A
By assumption, the first condition of Theorem 5 .52 is valid with a = 2. To see that the second one is valid with f3 = 1 , we apply Corollary 1 9.35 to the class of functions :F = { m e - mea : I e - eo II < This class has envelope function F = m , whence
Proof.
8}.
8
111mi P.28 J
E* sup 1 Gn (me - me0 ) 1 � log N[] (8, :F, L 2 (P)) d8. 0 ll e -Bo l d The bracketing entropy of the class :F is estimated in Example 1 9.7. Inserting the upper bound obtained there into the integral, we obtain that the preceding display is bounded above by a multiple of
111 m 1 P.20 F:(D8) 0
log - d 8. 8
Change the variables in the integral to see that this is a multiple of
8.
•
Rates of convergence different from ,fii are quite common for M -estimators of infinite dimensional parameters and may also be obtained through the application of Theorem 5.52. See Chapters 24 and 25 for examples. Rates slower than ,fii may also arise for fairly simple parametric estimates.
5.54 Example (Modal interval). Suppose that we define an estimator en oflocation as the center of an interval of length 2 that contains the largest possible fraction of the observations. This is an M-estimator for the functions me = 1 [ e - 1 ,8+1 1 . For many underlying distributions the first condition of Theorem 5.52 holds with a = 2. It suffices that the map e f-+ P me = P [ e - 1, e + 1] is twice-differentiable and has a proper maximum at some point e0. Using the maximal inequality Corollary 19.3 5 (or Lemma 1 9.38), we can show that the second condition is valid with f3 = � · Indeed, the bracketing entropy of the intervals in the real line is of the order j 8 2 , and the envelope function of the class of functions 1 [e - 1 ,8+1] - 1 [ ()o - 1 ,eo +1] as e ranges over ceo - eo + is bounded by 1 [eo - 1 - 8 ,eo - 1 HJ + 1 [eo +1 - Mo + 1 Hl • whose squared L 2 -norm is bounded by liP ll oo 28 . Thus Theorem 5.52 applies with a = 2 and f3 = � and yields the rate of convergence 1 n 13 • The resulting location estimator is very robust against outliers. However, in view of its slow convergence rate, one should have good reasons to use it. The use of an interval of length 2 is somewhat awkward. Every other fixed length would give the same result. More interestingly, we can also replace the fixed-length interval by the smallest interval that contains a fixed fraction, for instance 1 /2, of the observations. This
8
8, 8)
78
M- and Z-Estimators
still yields a rate of convergence of n 1 13 . The intuitive reason for this is that the length of a "shorth" settles down by a ,Jn-rate and hence its randomness is asymptotically negligible relative to its center. D The preceding theorem requires the consistency of en as a condition. This consistency is implied if the other conditions are valid for every 8 > 0, not just for small values of 8 . This can be seen from the proof or the more general theorem in the next section. Because the conditions are not natural for large values of 8 , it is usually better to argue the consistency by other means.
5.8.1
Nuisance Parameters
In Chapter 25 we need an extension of Theorem 5.52 that allows for a "smoothing" or "nuisance" parameter. We also take the opportunity to insert a number of other refinements, which are sometimes useful. Let x f--+ me, 11 (x) be measurable functions indexed by parameters (8 , YJ), and consider estimators e n contained in a set en that, for a given fin contained in a set Hn , maximize the map The sets en and Hn need not be metric spaces, but instead we measure the discrepancies between e n and 80, and �n and a limiting value YJo, by nonnegative functions 8 f--+ d11 ( 8 , 80) and YJ f--+ d(YJ, YJo), which may be arbitrary.
5.55 Theorem. Assume that, for arbitraryfunctions en : Bn x Hn f--+ lR and c/Jn : (0, oo) f--+ lR such that 8 f--+ c/Jn (8) j8� is decreasing for some f3 < 2, every (8 , YJ) E en x Hn, and every 8 > 0, P (me,11 - me0 , 11) + en (8 , YJ) � -d; (8 , 8o) + d 2 (YJ, YJo) , E* sup 1 Gn (me,11 - me0, 11 ) - Jll en (8 , YJ) I � c/Jn (o ) . d� (B.IJo)
(O,ry)E8n xHn
Let On > 0 satisfy c/Jn (8n) � ,Jn 8� for every n. If P (e n E Bn , fin E Hn ) l!Dnmen .iln :::: l!Dnmeo . iln - Op (o�), then diln ( e n , 8o ) = O � (o n + d ( fl n , YJo ) ) .
---+
1 and
For simplicity assume that l!Dn men . iln =::: l!Dn meo . iln • without a tolerance term. For each n E N, j E Z and M > 0, let Sn , J, M be the set { (8 , YJ) E Bn x Hn : 21 - 1 8n < d17 (8 , 8o ) � 21 8n, d(YJ, YJo) � 2- M d17 (8 , 8o ) } .
Proof.
Then the intersection ofthe events ( e n , fin) E en X Hn , and diln (en , 8o) :::: 2 M ( on +d( fl n , YJo)) is contained in the union of the events { ( e n , �n ) E Sn , J, M } over j =::: M. By the definition of en, the supremum of l!Dn (me, 11 - me0,11) over the set of parameters (8 , YJ) E Sn , J, M is nonnegative on the event { (e11 , fin ) E Sn , J ,M } . Conclude that P* dij. (en , 8o) =::: 2 M (8n + d ( fl n , YJo) ) , (en , fin) E Bn x Hn
(
�
I: r *
j�M
(
(
sup
ti , I) ) ESn 1 M
)
)
l!Dn (me,11 - me0,17) :::: 0 .
Argmax Theorem
5. 9
For every j , (e, rJ )
E
79
Sn,J ,M · and every sufficiently large M,
-d; (e, eo) + d 2 ( rJ , rJo) < - (1 - 2 - 2M ) d1)2 (e ' e0 ) < -2 21 -4 8n2 . From here on the proof is the same as the proof of Theorem 5 . 52, except that we use that ¢>n (c8) ::::; cf3¢>n (8) for every c > 1 , by the assumption on tf>n · • P (m e ,l) - me0,1) ) + en (e , rJ )
::::;
_
*5.9
Argmax Theorem
The consistency of a sequence of M -estimators can be understood as the points of maximum en of the criterion functions e r--+ Mn (e) converging in probability to a point of maximum of the limit criterion function e r--+ M(e). So far we have made no attempt to understand the distributional limit properties of a sequence of M -estimators in a similar way. This is possible, but it is somewhat more complicated and is perhaps best studied after developing the theory of weak convergence of stochastic processes, as in Chapters 1 8 and 19. Because the estimators en typically converge to constants, it is necessary to rescale them before studying distributional limit properties. Thus, we start by searching for a sequence of numbers rn r--+ oo such that the sequence hn = rn (en - e) is uniformly tight. The results of the preceding section should be useful. If en maximizes the function e r--+ Mn (e), then the rescaled estimators hn are maximizers of the local criterion functions
h
r--+
( �) - Mn (eo) .
Mn e +
Suppose that these, if suitably normed, converge to a limit process h r--+ M ( h ) . Then the general principle is that the sequence hn converges in distribution to the maximizer of this limit process. For simplicity of notation we shall write the local criterion functions as h r--+ Mn ( h ) . Let { Mn (h) : h E Hn } be arbitrary stochastic processes indexed by subsets Hn of a given metric space. We wish to prove that the argmax-functional is continuous : If Mn -v-+ M and Hn ----+ H in a suitable sense, then the (near) maximizers hn of the random maps h r--+ Mn (h) converge in distribution to the maximizer h of the limit process h r--+ M (h) . It is easy to find examples in which this is not true, but given the right definitions it is, under some conditions. Given a set B , set
M (B )
=
sup M (h ) . hEB
Then convergence in distribution of the vectors ( Mn (A) , Mn ( B ) ) for given pairs of sets A and B is an appropriate form of convergence of Mn to M . The following theorem gives some flexibility in the choice of the indexing sets. We implicitly either assume that the suprema Mn (B ) are measurable or understand the weak convergence in terms of outer probabilities, as in Chapter 18. The result we are looking for is not likely to be true if the maximizer of the limit process is not well defined. Exactly as in Theorem 5 .7 , the maximum should be "well separated." Because in the present case the limit is a stochastic process, we require that every sample path h r--+ M (h ) possesses a well-separated maximum (condition (5.57)).
80
M- and Z -Estimators
5.56 Theorem (Argmax theorem). Let Mn and M be stochastic processes indexed by sub sets Hn and H of a given metric space such that, for every pair of a closed set F and a set K in a given collection K, Furthermore, suppose that every sample path of the process h � M(h) possesses a well separated point of maximum h in that, for every open set G and every K E K, M( h ) > M(G c n K n H) ,
if h E G,
a.s ..
If Mn ( h n) :=:: Mn (Hn) - o p ( 1 ) andfor every 8 > 0 there exists K E K) < 8 and P( h � K) < 8, then h n "-""+ h .
K such that sup n
(5 .57) P( h n
�
If h n E F n K , then Mn (F n K n Hn ) :::: Mn (B ) - op (l) for any set B . Hence, for every closed set F and every K E K ,
Proof.
P( h n E F n K)
:S
:S
P(Mn (F n K n Hn) :::: Mn (K n Hn) - op ( l ) ) P(M(F n K n H) :::: M(K n H) ) + o ( l ) ,
by Slutsky's lemma and the portmanteau lemma. If h E F e , then M ( F n K n H ) i s strictly smaller than M (h ) by (5.57) and hence on the intersection with the event in the far right side h cannot be contained in K n H . It follows that lim sup P( h n E F n K)
:S
P(h E F) + P( h
tf.
K n H) .
By assumption we can choose K such that the left and right sides change by less than 8 if we replace K by the whole space. Hence h n "-""+ h by the portmanteau lemma. • The theorem works most smoothly if we can take K to consist only of the whole space. However, then we are close to assuming some sort of global uniform convergence of Mn to M, and this may not hold or be hard to prove. It is usually more economical in terms of conditions to show that the maximizers h n are contained in certain sets K , with high probability. Then uniform convergence of Mn to M on K is sufficient. The choice of compact sets K corresponds to establishing the uniform tightness of the sequence h n before applying the argmax theorem. If the sample paths of the processes Mn are bounded on K and Hn = H for every n, then the weak convergence of the processes Mn viewed as elements of the space .C00 (K) implies the convergence condition of the argmax theorem. This follows by the continuous-mapping theorem, because the map
z � (z (A n K), z (B n K) ) from .C00 (K) to IR2 is continuous, for every pair of sets A and B . The weak convergence in .C00 (K) remains sufficient if the sets Hn depend on n but converge in a suitable way. Write Hn --+ H if H is the set of all limits lim hn of converging sequences hn with hn E Hn for every n and, moreover, the limit h = limi hn, of every converging sequence hn, with hn, E Hn, for every i is contained in H .
5. 9
Argmax Theorem
81
5.58 Corollary. Suppose that M11 "'-"+ M in £00 ( K) for every compact subset K of IRk , for a limit process M with continuous sample paths that have unique points of maxima h . If H11 � H, M11 (h 11) :=: M11 (H11 ) - o p ( l ) , and the sequence h 11 is unzformly tight, then h 11 "'-"+ h . The compactness of K and the continuity of the sample paths h f-+ M (h) imply that the (unique) points of maximum h are automatically well separated in the sense of (5 .57). Indeed, if this fails for a given open set G 3 h and K (and a given w in the underlying probability space), then there exists a sequence h m in G c n K n H such that M( h m ) � M( h ) . If K is compact, then this sequence can be chosen convergent. The limit ho must be in the closed set G c and hence cannot be h . By the continuity of M it also has the property that M (h0 ) = lim M( h m ) = M ( h ) . This contradicts the assumption that h is a unique point of maximum. If we can show that (M11 ( F n Hn) , M11 ( K n Hn)) converges to the corresponding limit for every compact sets F c K, then the theorem is a corollary of Theorem 5.56. If H11 = H for every n, then this convergence is immediate from the weak convergence of M11 to M in .C00(K), by the continuous-mapping theorem. For H11 changing with n this convergence may fail, and we need to refine the proof of Theorem 5.56. This goes through with minor changes if
Proof.
lim sup P(M11 (F n H11 ) - M11 ( K H11) ::: x ) ::: P(M(F n H) - M ( K n H) ::: x ) , 11 � 00 for every x , every compact set F and every large closed ball K . Define functions g11 : £00 (K) f-+ lR by �"',
gn (Z) = sup z (h) - sup z (h) , hEFnH,, hEKnH, and g similarly, but with H replacing H11 • By an argument as in the proof of Theo rem 1 8 . 1 1 , the desired result follows if lim sup g11 (Z11) ::: g(z) for every sequence Z n � z in .C00 (K) and continuous function z . (Then lim sup P(g11 (M11) ::: x) ::: P(g(M) ::: x) for every x , for any weakly converging sequence M11 "'-"+ M with a limit with continuous sample paths.) This in turn follows if for every precompact set B c K, sup z (h) ::: lim sup Z11 (h) ::: sup z (h). n � oo hEBnH, hEBnH To prove the upper inequality, select h11 E B n H11 such that sup Z11 (h) = Z11 (h11) + o(l) = z (h11) + o ( l ) . hEBnH,, Because B is compact, every subsequence of h11 has a converging subsequence. Because H11 � H, the limit h must be in B n H. Because z (h11) � z (h), the upper bound follows. To prove the lower inequality, select for given 8 > 0 an element h E B n H such that sup z (h) ::: z (h) + 8. hEBnH
Because H11 � H , there exists h11 E H11 with h11 � h. This sequence must be in B C B eventually, whence z(h) = lim z (h11) = lim Zn (h11) is bounded above by lim inf sup hEBnH, Zn (h) . •
M- and Z-Estimators
82
The argmax theorem can also be used to prove consistency, by applying it to the original criterion functions (] f--+ Mn (e). Then the limit process (] f--+ M(e) is degenerate, and has a fixed point of maximum e0. Weak convergence becomes convergence in probability, and the theorem now gives conditions for the consistency {Jn !,. e0. Condition (5.57) reduces to the well-separation of e0, and the convergence sup
e eFnK ne.
Mn (e) �p
sup
e eFnKne
Mn (e)
is, apart from allowing 8n to depend on n , weaker than the uniform convergence of Mn to
M.
Notes
In the section on consistency we have given two main results (uniform convergence and Wald's proof) that have proven their value over the years, but there is more to say on this subject. The two approaches can be unified by replacing the uniform convergence by "one sided uniform convergence," which in the case of i.i.d. observations can be established under the conditions of Wald's theorem by a bracketing approach as in Example 19.8 (but then one-sided). Furthermore, the use of special properties, such as convexity of the 1j; or m functions, is often helpful. Examples such as Lemma 5 . 1 0, or the treatment of maximum likelihood estimators in exponential families in Chapter 4, appear to indicate that no single approach can be satisfactory. The study of the asymptotic properties of maximum likelihood estimators and other M -estimators has a long history. Fisher [48], [50] was a strong advocate of the method of maximum likelihood and noted its asymptotic optimality as early as the 1920s. What we have labelled the classical conditions correspond to the rigorous treatment given by Cramer [27] in his authoritative book. Huber initiated the systematic study of M -estimators, with the purpose of developing robust statistical procedures. His paper [78] contains important ideas that are precursors for the application of techniques from the theory of empirical processes by, among others, Pollard, as in [ 1 17], [ 1 1 8], and [ 1 20] . For one-dimensional parameters these empirical process methods can be avoided by using a maximal inequality based on the L 2 -norm (see, e.g., Theorem 2.2.4 in [146]). Surprisingly, then a Lipschitz condition on the Hellinger distance (an integrated quantity) suffices; see for example, [80] or [94] . For higher-dimensional parameters the results are also not the best possible, but I do not know of any simple better ones. The books by Huber [79] and by Hampel, Ronchetti, Rousseeuw, and Stahel [73] are good sources for applications of M -estimators in robust statistics. These references also discuss the relative efficiency of the different M -estimators, which motivates, for instance, the use of Huber's 1/J -function. In this chapter we have derived Huber's estimator as the solution of the problem of minimizing the asymptotic variance under the side condition of a uniformly bounded influence function. Originally Huber derived it as the solution to the problem of minimizing the maximum asymptotic variance sup P a� for P ranging over a contamination neighborhood: P = ( 1 c:)
Problems
83
their asymptotic efficiency by Le Cam in 1 956, who later developed them for general locally asymptotically quadratic models, and also introduced the discretization device, (see [93]). PROBLEMS 1. Let X1 , . . . , Xn be a sample from a density that is strictly positive and symmetric about some
point. Show that the Huber M -estimator for location is consistent for the symmetry point.
2. Find an expression for the asymptotic variance of the Huber estimator for location if the obser
vations are normally di stributed. 3. Define 1/f (x ) = 1 - p , 0, p if x e) ::S: p ::S: P(X ::S: e ) .
<
0, 0,
>
0. Show that E1/f (X - e) = 0 implies that P(X
<
4. Let X J , . . . , Xn b e i.i.d. N ( � , cr 2 )-distributed. Derive the maximum likelihood estimator for
(� , cr 2 ) and show that it is asymptotically normal . Calculate the Fisher information matrix for this parameter and its inverse.
5. Let X 1 , . . . , Xn be i.i.d. Poisson ( l j e )-distributed. Derive the maximum likelihood estimator for
e and show that it is asymptotically normal.
6. Let X1 , . . . , Xn be i.i.d. N ( e , e) -distributed. Derive the maximum likelihood estimator for e and show that it is asymptotically normal.
7. Find a sequence of fixed (nonrandom) functions Mn : lR H- lR that converges pointwise to a limit Mo and such that each Mn has a unique maximum at a point en , but the sequence en does not converge to eo . Can you also find a sequence Mn that converges uniformly?
8. Find a sequence of fixed (nonrandom) functions Mn : lR H- lR that converges pointwise but not uniformly to a limit Mo such that each Mn has a unique maximum at a point en and the sequence en converges to eo . 9. Let X 1 , . . . , Xn be i.i.d. observations from a uniform distribution on [0, e ] . Show that the sequence of maximum likelihood estimators is asymptotically consistent. Show that it is not asymptotically normal.
10. Let X1 , . . . , Xn be i.i.d. observations from an exponential density e exp ( - e x ) . Show that the
sequence of maximum likelihood estimators is asymptotically normal. 1 11. Let JF;;- (p) be a pth sample quantile of a sample from a cumulative distribution F on JR. that is 1 differentiable with positive derivative at the population pth-quantile p - (p) = inf x : F (x ) ::=: 1 p . Show that .fii JF;;- 1 (p) - p - (p) is asymptotically normal with mean zero and variance p ( l - p)/f F - I (p)
}
(
(
f
)
{
12. Derive a minimal condition on the distribution function F that guarantees the consistency of the
sample pth quantile. 13. Calculate the asymptotic variance of .jii (en - e ) in Example 5 .26. 14. Suppose that we observe a random sample from the distribution of (X, Y) in the following
errors-in-variables model: X = Z+e
Yi = a + f3 Z + f, where (e, f) is bivariate normally distributed with mean 0 and covariance matrix cr 2 I and is independent from the unobservable variable Z . In analogy to Example 5 .26, construct a system of estimating equations for (a, (3) based on a conditional likelihood, and study the limit properties of the corresponding estimators. 15. In Example 5 .27, for what point is the least squares estimator iJn consistent if we drop the
condition that E(e I X) = 0? Derive an (implicit) solution in terms of the function E(e I X ) . Is it necessarily eo if Ee = 0?
M - and Z-Estimators
84
16. In Example 5 .27, consider the asymptotic behavior of the least absolute-value estimator e that minimizes L: 7= 1 1 Y - ¢ e (X i ) I ·
i
17. Let X 1 , . . . , Xn be i.i.d. with density h. , a (x ) = A.e - f. (x - a) l {x � a ) , where the parameters A. > 0 and a E are unknown. Calculate the maximum likelihood estimator of (A. , a) and derive its asymptotic properties.
(An , fin)
lR
X be Poisson-distributed with density pe (x) = e x e - e jx ! . Show by direct calculation that Ee fe (X) 0 and Eele (X) = - Ee f � (X). Compare this with the assertions in the introduction.
18. Let
=
Apparently, differentiation under the integral (sum) is permitted in this case. Is that obvious from results from measure theory or (complex) analysis?
19. Let X 1 , . . . , Xn be a sample from the N ( e , 1) distribution, where it is known that e � 0. Show that the maximum likelihood estimator is not asymptotically normal under e = 0. Why does this not contradict the theorems of this chapter?
20. Show that (e - eo)\lln (en ) in formula (5 . 1 8) converges in probability to zero if en � eo, and that there exists an integrable function M and 8 > 0 with I Vf e (x ) l :::: M (x) for every x and every
11 e - eo I
<
8.
21. If en maximizes Mn , then it also maximizes M;t . Show that this may be used to relax the conditions of Theorem 5 .7 to sup e I M;t - M + l (e) ---+ 0 in probability (if M(eo) > 0) .
22. Suppose that for every s > O there exists a set 88 with lim inf P(en E 88) � 1 - s . Then uniform convergence of Mn to M in Theorem 5 .7 can be relaxed to uniform convergence on every 88 • 23. Show that Wald 's consistency proof yields almost sure convergence of e n , rather than convergence in probability if the parameter space is compact and Mn (en ) � Mn (Bo) - o ( l ) .
24. Suppose that (X 1 , Y1 ) , . . . , (Xn , Yn ) are i.i.d. and satisfy the linear regression relationship Yi = e T Xi + ei for (unobservable) errors e 1 , . . . , en independent of X 1 , . . . , Xn . Show that the mean absolute deviation estimator, which minimizes I: I Yi - ()Xi I , is asymptotically normal under a mild condition on the error distribution. 25.
(i) Verify the conditions of Wald 's theorem for me the log likelihood function of the N (f.L , distribution if the parameter set for ()
=
(f.L,
a 2 ) is a compact subset of lR
(ii) Extend me by continuity to the compactification of lR Wald's theorem fail at the points ( f.L ,
0).
x
JR + .
x
JR + .
a2)
Show that the conditions of
(iii) Replace m e by the log likelihood function of a pair of two independent observations from the N ( f.L ,
a 2 )-distribution. Show that Wald 's theorem now does apply, also with a compactified
parameter set.
IRk
26. A distribution on is called ellipsoidally symmetric if it has a density of the form x �---* T g ( (x - f.L) L: - 1 (x - tL)) for a function g : [0, oo) �---* [0, oo), a vector f.L, and a symmetric positive-definite matrix 2:: . Study the Z-estimators for location fl that solve an equation of the form
i =l for given estimators tn and, for instance, Huber's ljr-function. Is the asymptotic distribution of tn important?
27. Suppose that B is a compact metric space and M : B ---+ lR is continuous. Show that (5 .8) is equivalent to the point eo being a point of unique global maximum. Can you relax the continuity of M to some form of "semi-continuity"?
6 Contiguity
"Contiguity " is another name for "asymptotic absolute continuity." Contiguity arguments are a technique to obtain the limit distribution of a sequence of statistics under underlying laws Qn from a limiting distribution under laws Pn. Typically, the laws Pn describe a null distri bution under investigation, and the laws Qn correspond to an alternative hypothesis.
6.1
Likelihood Ratios
Let P and Q be measures on a measurable space (Q , A) . Then Q is absolutely continuous with respect to P if P (A) = 0 implies Q (A) = 0 for every measurable set A; this is denoted by Q « P . Furthermore, P and Q are orthogonal if Q can be partitioned as Q = Qp U QQ with Qp n Q Q = 0 and P (QQ) = 0 = Q(Qp ) . Thus P "charges" only Qp and Q "lives on" the set Q Q , which is disjoint with the "support" of P . Orthogonality is denoted by p j_ Q . In general, two measures P and Q need be neither absolutely continuous nor orthogonal. The relationship between their supports can best be described in terms of densities. Suppose P and Q possess densities p and q with respect to a measure p, , and consider the sets Qp
=
QQ
{p > 0},
=
{q > 0} .
See Figure 6. 1 . Because P (Q� ) = Jp =O p d p, = 0, the measure P is supported on the set Qp . Similarly, Q is supported on Q Q . The intersection Qp n Q Q receives positive measure from both P and Q provided its measure under p, is positive. The measure Q can be written as the sum Q = Qa + Ql_ of the measures Q j_ (A) = Q (A n {p = 0}) . (6. 1 ) A s proved in the next lemma, Q a set A
«
P and Q j_
l_
P . Furthermore, for every measurable
{ 9:__ d P . }A p The decomposition Q = Qa + Q l_ is called the Lebesgue decomposition of Q with respect to P . The measures Qa and Q j_ are called the absolutely continuous part and the orthogonal Q a (A)
=
85
Contiguity
86 Q
p> O q=O
p =O q>O
p> O q>O
p =q= O Figure 6.1. Supports of measures.
part (or singular part) of Q with respect to P , respectively. In view of the preceding display, the function q I p is a density of Qa with respect to P . It is denoted d Q I d P (not: d Qa I d P ), so that dQ q P - a.s. dP p As long as we are only interested in the properties of the quotient q I p under P -probability, we may leave the quotient undefined for p = 0. The density d Q I d P is only P -almost surely unique by definition. Even though we have used densities to define them, d Q I d P and the Lebesgue decomposition are actually independent of the choice of densities and dominating measure. In statistics a more common name for a Radon-Nikodym density is likelihood ratio. We shall think of it as a random variable d Q I d P : Q f-+ [0, oo) and shall study its law under P .
6.2 Lemma. Let P and Q be probability measures with densities p and q with respect to a measure f-L. Then for the measures Q a and Q j_ defined in (22. 30) (i) Q = Q a + Q j_ , Qa « P, Q j_ ..l P. (ii) Q a (A) = JA (q lp) d P for every measurable set A. (iii) Q « P if and only if Q (p = 0) = 0 if and only if j (q I p) d P = 1. The first statement of (i) is obvious from the definitions of Qa and Q j_ . For the second, we note that P (A) can be zero only if p (x ) = 0 for M -almost all x E A. In this case, M ( A n {p > 0}) = 0, whence Q a (A) = Q ( A n {p > 0}) = 0 by the absolute continuity of Q with respect to f-L. The third statement of (i) follows from P (p = 0) = 0 and Q j_ (p > 0) = Q( 0 ) = 0. Statement (ii) follows from q q q dt-L = Q a (A) = p df.-L = - d P . Ap An{ p>O} An{ p>O} p For (iii) we note first that Q « P if and only if Q j_ = 0. By (22.30) the latter happens if and only if Q (p = 0) = 0. This yields the first "if and only if." For the second, we note
Proof.
1
1
1
6. 2
87
Contiguity
that by (ii) the total mass of Qa is equal to Qa (Q) = J (q I p) d P . This is 1 if and only if Qa = Q • .
It is not true in general that J f d Q = J f (d Q I d P) d P . For this to be true for every measurable function f , the measure Q must be absolutely continuous with respect to P . On the other hand, for any P and Q and nonnegative f,
This inequality is used freely in the following. The inequality may be strict, because dividing by zero is not permitted. t 6.2
Contiguity
If a probability measure Q is absolutely continuous with respect to a probability measure P , then the Q-law of a random vector X : Q f---+ IRk can be calculated from the P -law of the pair (X, d Qid P) through the formula EQ J (X)
With p x , v equal to the law of the pair can also be expressed as Q (X
E
= E p f ( X)
(X,
V)
dQ . dP
= (X,
dQ B) = E p l s (X) - = dP
1
d Q id P) under P, this relationship
BxR
v
dP x, v (x , v ) .
The validity of these formulas depends essentially on the absolute continuity of Q with respect to P , because a part of Q that is orthogonal with respect to P cannot be recovered from any P -law. Consider an asymptotic version of the problem. Let (Qn , An) be measurable spaces, each equipped with a pair of probability measures Pn and Qn . Under what conditions can a Qn -limit law of random vectors Xn : Qn f---+ IRk be obtained from suitable Pn-lirnit laws? In view of the above it is necessary that Qn is "asymptotically absolutely continuous" with respect to Pn in a suitable sense. The right concept is contiguity. Definition. The sequence Qn is contiguous with respect to the sequence Pn if Pn (An ) -+ 0 implies Qn CAn) -+ 0 for every sequence of measurable sets An . This is denoted Qn <J Pn · The sequences Pn and Qn are mutually contiguous if both Pn <J Qn and Qn Qn . 6.3
The name "contiguous" is standard, but perhaps conveys a wrong image. "Contiguity" suggests sequences of probability measures living next to each other, but the correct image is "on top of each other" (in the limit) . t
The algebraic identify d Q (d Ql d P) d P is false. because the notation d Ql d P is used as shorthand for d Qa I d P: If we write d Q I d P, then we are not implicitly assuming that Q « P . =
Contiguity
88
Before answering the question of interest, we give two characterizations of contiguity in terms of the asymptotic behavior of the likelihood ratios of Pn and Qn . The likelihood ratios d QnldPn and d Pnld Qn are nonnegative and satisfy d Pn < 1. and EQ n -- d Qn Thus, the sequences of likelihood ratios d Qnl d Pn and d Pn l d Qn are uniformly tight under Pn and Qn, respectively. By Prohorov's theorem, every subsequence has a further weakly converging subsequence. The next lemma shows that the properties of the limit points determine contiguity. This can be understood in analogy with the nonasymptotic situation. For probability measures P and Q, the following three statements are equivalent by (iii) of Lemma 6.2: dQ Ep - = 1 . Q « P, dP This equivalence persists if the three statements are replaced by their asymptotic counter parts: Sequences Pn and Qn satisfy Qn <1 Pn , if and only if the weak limit points of d Pn I d Qn under Qn give mass 0 to 0, if and only if the weak limit points of d Qn l d Pn under Pn have mean 1 .
Let Pn and Q n be sequences ofprobability measures on measurable spaces (Qn , An). Then the following statements are equivalent: (i) Qn 0) = 1 . (iii) If d Q n I d Pn � V along a subsequence, then E V = 1 . ( iv) For any statistics Tn : Q n f--+ IRk : If Tn � 0, then Tn £.;. 0. 6.4
Lemma (Le Cam 's first lemma).
The equivalence of (i) and (iv) follows directly from the definition of contiguity: Given statistics Tn , consider the sets An = { II Tn I > £ } ; given sets An, consider the statistics Tn = l A n · (i) :::} (ii) . For simplicity of notation, we write just {n} for the given subsequence along which d Pn ld Qn � U . For given n, we define the function gn ( £ ) = Qn (d Pn ld Qn < c) - P(U < £). By the portmanteau lemma, lim inf g11 (£) � 0 for every £ > 0. Then, for En + 0 at a sufficiently slow rate, also lim inf g11 (£11) � 0. Thus,
Proof.
P(U = 0) = lim P(U < En) ::::; lim inf Qn On the other hand,
(:�: < En ) .
If Qn is contiguous with respect to Pn , then the Qn -probabi1ity of the set on the left goes to zero also. But this is the probability on the right in the first display. Combination shows that P(U = 0) = 0. (iii) :::} (i). If Pn (An) -+ 0, then the sequence l Q" - A " converges to 1 in Pn-probability. By Prohorov's theorem, every subsequence of {n} has a further subsequence along which
6. 2
89
Contiguity
(d Qn/dPn , 1r.lcAJ """' (V, 1) under Pn , for some weak limit V . The function (v, t) f-+ vt is continuous and nonnegative on the set [0, oo) x {0, 1}. By the portmanteau lemma lim inf Qn (Qn - An) :::: lim inf
J 1r.ln -A,. �;: d Pn :::: E1 ·
V.
Under (iii) the right side equals E V = 1. Then the left side is 1 as well and the sequence Qn (An) = 1 - Qn (Qn - An ) converges to zero. (ii) ::::} (iii). The probability measures f.Ln = 1 CPn + Qn) dominate both Pn and Qn, for every n . The sum of the densities of Pn and Qn with respect to f.Ln equals 2. Hence, each of the densities takes its values in the compact interval [0, 2] . By Prohorov's theorem every subsequence possesses a further subsequence along which d Pn & d Qn !::, dPn &, W, V' n := U' W d Qn d Pn df.Ln for certain random variables U, V and W. Every Wn has expectation 1 under f.Ln . In view of the boundedness, the weak convergence of the sequence Wn implies convergence of moments, and the limit variable has mean E W = 1 as well. For a given bounded, continuous function j , define a function g : [0, 2] f-+ IR by g (w) = f( w / (2- w) ) (2- w) for 0 � w < 2 and g (2) = 0. Then g is bounded and continuous. Because d Pn /d Qn = Wn/(2 - Wn ) and d Q n I d f.Ln = 2 - Wn ' the portmanteau lemma yields Pn = Pn d Qn = W (2 - W), EJ.Ln g(Wn) ---+ E j E!Ln j dd Qn E Q ,J dd Qn df.Ln 2- W where the integrand in the right side is understood to be g (2) = 0 if W = 2. By assumption, the left side converges to Ef(U). Thus Ef(U) equals the right side of the display for every continuous and bounded function f. Take a sequence of such functions with 1 :::: fm + 1 {OJ , and conclude by the dominated-convergence theorem that
( )
(
0)
(
=
)
)
E1r OJ (U) = E1r OJ 2 -WW (2 - W) = 2P(W = 0) . By a similar argument, Ef(V) = Ef((2 - W)/ W) W for every continuous and bounded function f, where the integrand on the right is understood to be zero if W = 0. Take a sequence 0 � fm (x) t x and conclude by the monotone convergence theorem that E V = E 2 �W w = E(2 - W) 1w> o = 2P(W > 0) - 1 . P(U
=
( )
(
)
Combination of the last two displays shows that P(U
=
0) + E V
=
1.
•
6.5 Example (Asymptotic log normality). The following special case plays an important role in the asymptotic theory of smooth parametric models. Let Pn and Qn be probability measures on arbitrary measurable spaces such that d Pn £.+ e N ( J.L , a 2 ) d Qn Then Qn <1 Pn . Furthermore, Qn <1 r;;. Pn if and only if f.L = -1cr 2 . Because the (log normal) variable on the right is positive, the first assertion is immediate from (ii) of the theorem. The second follows from (iii) with the roles of P11 and Q11 switched, on noting that E exp N (J.L, cr 2 ) = 1 if and only if J.L = -1cr 2 .
Contiguity
90
A mean equal to minus half times the variance looks peculiar, but we shall see that this sit uation arises naturally in the study of the asymptotic optimality of statistical procedures . D The following theorem solves the problem of obtaining a Q11-limit law from a P11 -limit law that we posed in the introduction. The result, a version of Le Cam 's third lemma, is in perfect analogy with the nonasymptotic situation.
6.6 Theorem. Let Pn and Q n be sequences ofprobability measures on measurable spaces (Q11 , A11), and let Xn : Qn r----+ IRk be a sequence of random vectors. Suppose that Qn <1 Pn and Xn , d Q n � (X V ) . d Pn
(
)
'
Then L(B) = El s (X) V defines a probability measure, and Xn f?4 L. Because V � 0, it follows with the help of the monotone convergence theorem that L defines a measure. By contiguity, EV = 1 and hence L is a probability measure. It is immediate from the definition of L that J f dL = Ef(X) V for every measurable indicator function f. Conclude, in steps, that the same is true for every simple function f, any nonnegative measurable function, and every integrable function. If f is continuous and nonnegative, then so is the function (x , v) r----+ f (x) v on IRk x [0, oo) . Thus
Proof.
lim inf E Q n f(X11)
�
lim inf
J f(X11) �;: dPn
�
Ef(X) V,
by the portmanteau lemma. Apply the portmanteau lemma in the converse direction to conclude the proof that Xn f?4 L . •
6.7 Example (Le Cam 's third lemma). The name Le Cam 's third lemma is often reserved for the following result. If
then In this situation the asymptotic covariance matrices of the sequence Xn are the same under Pn and Q11 , but the mean vectors differ by the asymptotic covariance T between Xn and the log likelihood ratios. t The statement is a special case of the preceding theorem. Let (X, W) have the given (k + 1 )-dimensional normal distribution. By the continuous mapping theorem, the sequence (X11 , d Q11 jd P11 ) converges in distribution under Pn to (X, e w ) . Because W is N ( - � a 2 , a 2 ) distributed, the sequences Pn and Qn are mutually contiguous. According to the abstract t
We set log 0 - oo ; because the normal distribution does not charge the point -oo the assumed asymptotic normality of log d Q11/d P11 includes the assumption that Pn (d Qn/ d Pn 0) -+ 0. =
=
91
Problems
version of Le Cam's third lemma, X n & L with L(B) = Els (X)e w . The characteristic function of L is J e i t T dL (x) = Ee i t T x e w . This is the characteristic function of the given normal distribution at the vector (t, -i). Thus
x
The right side is the characteristic function of the Nk (J.L + T, L:) distribution.
0
Notes
The concept and theory of contiguity was developed by Le Cam in [92] . In his paper the results that were later to become known as Le Cam's lemmas are listed as a single theorem. The names "first" and "third" appear to originate from [7 1]. (The second lemma is on product measures and the first lemma is actually only the implication (iii) ::::} (i) .)
PROBLEMS
1. Let P = N (O, 1) and = N ( �J., , 1 ) . Show that the sequences Pn and contiguous if and only if the sequence is bounded .
n
Qn
Qn
n IJ.,n
Qn
are mutually
2. Let Pn and be the distribution of the mean of a sample of size n from the N (0, 1 ) and the N 1 ) distribution, respectively. Show that P if and only if = 0 ( 1 I ,Jfi) .
(()n,
n <1 1> Qn
Qn
en
3. Let Pn and be the law of a sample of size n from the uniform distribution on [0, 1 ] or [0, 1 + 1 / n ] , respectively. Show that P Is it also true that Pn ? U s e Lemma 6.4 t o derive your answers .
n <1 Qn .
Qn <1
Pn - Qn I 0, where I · II is the total variation distance II P - Q II = sup A I P ( A ) Pn <1 1> Qn. 1 - £ . (The 0 find an example of sequences such that Pn <1 1> Qn, but I Pn - Qn I
4. Suppose that I Show that
Q(A) I ·
---+
---+ 5. Given £ > maximum total variation distance between two probability measures is 1 .) This exercise shows that it is wrong to think of contiguous sequences as being close. (Try measures that are supported on just two points .)
n <1 Qn, but it is not true that Qn <1 Pn . Show that the constant sequences { P } and { Q} are contiguous if and only if P and Q are absolutely
6. Give a simple example in which P 7.
continuous .
8. If P « then ---+ 0 implies P ( does this follow from Lemma 6.4?
Q,
Q(An)
An )
---+
0 for every sequence o f measurable sets . How
7 Local Asymptotic Normality
A sequence of statistical models is "locally asymptotically normal " if, asymptotically, their likelihood ratio processes are similar to those for a normal location parameter. Technically, this is if the likelihood ratio processes admit a certain quadratic expansion. An important example in which this arises is repeated sampling from a smooth parametric model. Local asymptotic normality implies convergence of the models to a Gaussian model after a rescaling of the parameter.
7.1
Introduction
Suppose we observe a sample X 1, . . . , Xn from a distribution Pe on some measurable space (X, A) indexed by a parameter e that ranges over an open subset 8 of IRk . Then the full observation is a single observation from the product P; of n copies of Pe , and the statis tical model is completely described as the collection of probability measures { Pen : e E 8} on the sample space ( xn , An ). In the context of the present chapter we shall speak of a statistical experiment, rather than of a statistical model. In this chapter it is shown that many statistical experiments can be approximated by Gaussian experiments after a suitable reparametrization. The reparametrization is centered around a fixed parameter 80, which should be regarded as known. We define a local parameter h = ,fii ( 8 - 80) , rewrite P; as P8: +h / .fo ' and thus obtain an experiment with parameter h. In this chapter we show that, for large n, the experiments are similar in statistical properties, whenever the original experiments e �----+ Pe are "smooth" in the parameter. The second experiment consists of observing a single observation from a normal distribution with mean h and known covariance matrix (equal to the inverse of the Fisher information matrix) . This is a simple experiment, which is easy to analyze, whence the approximation yields much information about the asymptotic properties of the original experiments. This information is extracted in several chapters to follow and concerns both asymptotic optimality theory and the behavior of statistical procedures such as the maximum likelihood estimator and the likelihood ratio test. 92
93
Expanding the Likelihood
7. 2
We have taken the local parameter set equal to IRk , which is not correct if the parameter set 8 is a true subset of IRk . If e0 is an inner point of the original parameter set, then the vector e = 80 + hj Jfi is a parameter in 8 for a given h, for every sufficiently large n, and the local parameter set converges to the whole of IRk as n --+ oo . Then taking the local parameter set equal to IRk does not cause errors. To give a meaning to the results of this chapter, the measure Peo +h / Fn may be defined arbitrarily if eo + h / Jfi ¢ 8 . 7.2
Expanding the Likelihood
The convergence of the local experiments is defined and established later in this chapter. First, we discuss the technical tool: a Taylor expansion of the logarithm of the likelihood. Let Pe be a density of Pe with respect to some measure f.L. Assume for simplicity that the parameter is one-dimensional and that the log likelihood £e (x) = log Pe (x) is twice differentiable with respect to e , for every x, with derivatives fe (x) and le (x) . Then, for every fixed x , 1 .. Pe+h log (x) = h£e· (x) + - h 2 £e (x) + o (h 2 ) . 2 Pe The subscript x in the remainder term is a reminder of the fact that this term depends on x as well as on h . It follows that --
x
.
..
Here the score has mean zero, Pe£e = 0, and - Pe£e = Pe £· 2e = Ie equals the Fisher information for e (see, e.g., section 5 .5). Hence the first term can be rewritten as h f),n,e , where f),n,e = n - 1 12 L 7=1 f e (Xi) is asymptotically normal with mean zero and variance Ie , by the central limit theorem. Furthermore, the second term in the expansion is asymptotically equivalent to - � h 2 Ie , by the law of large numbers . The remainder term should behave as o(ljn) times a sum of n terms and hopefully is asymptotically negligible. Consequently, under suitable conditions we have, for every h , n Pe +h!Fn n log (Xi) = h f),n ,e - -1 le h 2 + Op8 ( 1 ) . 2 i =1 Pe In the next section we see that this is similar in form to the likelihood ratio process of a Gaus sian experiment. Because this expansion concerns the likelihood process in a neighborhood of e , we speak of "local asymptotic normality" of the sequence of models { P; : e E 8}. The preceding derivation can be made rigorous under moment or continuity conditions on the second derivative of the log likelihood. Local asymptotic normality was originally deduced in this manner. Surprisingly, it can also be established under a single condition that only involves a first derivative: differentiability of the root density e r---+ ffi in quadratic mean. This entails the existence of a vector of measurable functions fe = ( f e, 1 , . . . , fe , k ) T such that
h --+ 0.
(7. 1 )
Local Asymptotic Normality
94
If this condition is satisfied, then the model (Pe : e E 8) is called diffe rentiable in quadratic mean at e . .J Pe+h (x ) at h = 0 for Usually, � h T f 8 (x ) .Jp8 (x ) is the derivative of the map h (almost) every x . In this case
f----*
1 a � a Pe (x ) = - log p e (x) . f. e (x ) = 2 -vt.::/71 ae Pe (x) ae Condition (7 .1) does not require differentiability of the map e f----* Pe (x) for any single x, but -
v
rather differentiability in (quadratic) mean. Admittedly, the latter is typically established by pointwise differentiability plus a convergence theorem for integrals. Because the condition is exactly right for its purpose, we establish in the following theorem local asymptotic normality under ( 7. 1 ). A lemma following the theorem gives easily verifiable conditions in terms of pointwise derivatives.
7.2 Theorem. Suppose that 8 is an open subset of rr:f..k and that the model (Pe : e E 8) is differentiable in quadratic mean at e. Then Pefe = 0 and the Fisher information matrix Ie = Pefef� exists. Furthermore, for every converging sequence h n h, as n oo,
---+
---+
---+
Given a converging sequence h n h, we use the abbreviations Pn , p, and g for Pe + h,J.Jii • p8 , and h T fe , respectively. By ( 7. 1 ) the sequence .y'ri(#n - ,jP) converges in quadratic mean (i.e., in L 2 (J.L)) to � g,JP. This implies that the sequence ffn converges in quadratic mean to ,JP. By the continuity of the inner product,
Proof.
The right side equals .y'ri(1 - 1) = 0 for every n, because both probability densities integrate to 1 . Thus Pg = 0. The random variable Wn i = 2[ .JPn! p (X i ) - 1] is with P -probability 1 well defined. By ( 7. 1 ) var
E
(8 n
1 n g ( Xi ) Wn i - .y'ri
8
)
�
E (.Jfi Wn i - g (X i ) ) 2
---+ 0,
t Wni = 2n (f y/p;;y/p dj.L - 1 ) = -n J [Jii;; - v'Pf dj.L ---+ - l Pg2 .
(7.3)
Here P g 2 = J g 2 dP = h T Ie h by the definitions of g and !8 . If both the means and the variances of a sequence of random variables converge to zero, then the sequence converges to zero in probability. Therefore, combining the preceding pair of displayed equations, we find (7.4)
7. 2
Expanding the Likelihood
95
Next, we express the log likelihood ratio in L 7=1 Wni through a Taylor expansion of the logarithm. If we write log(l + x) = x - � x 2 + x 2 R (2x) , then R (x ) --+ 0 as x --+ 0, and
As a consequence of the right side of (7.3), it is possible to write n Wn� = g 2 ( X i ) + Ani for random variables Ani such that E I Ani I --+ 0. The averages An converge in mean and hence in probability to zero. Combination with the law of large numbers yields
By the triangle inequality followed by Markov's inequality, nP ( I Wni I >
c:
J2)
::S
::S
n P (g 2 (Xi ) > n c: 2 ) + nP ( I Ani I > n c: 2 ) c;- 2 Pg 2 { i > n c: 2 } + 8 - 2 E I Ani I --+ 0.
The left side is an upper bound for P (maxl<:: i <:: n I Wni I > c: J2) . Thus the sequence maxl <:: i <:: n I Wni I converges to zero in probability. By the property of the function R , the sequence maxl <:: i <:: n I R (Wni) I converges in probability to zero as well. The last term on the right in (7.5) is bounded by maxl<:: i <:: n I R (Wni) I L 7=1 Wn� . Thus it is o p ( l) 0 p (1), and converges in probability to zero. Combine to obtain that n n l Pn n """ log - (Xi ) = L Wni - - Pg 2 + o p ( l ) . 4 i =l i=l p Together with (7 .4) this yields the theorem. • To establish the differentiability in quadratic mean of specific models requires a conver gence theorem for integrals. Usually one proceeds by showing differentiability of the map e f--+ Pe (x) for almost every X plus p,-equi-integrability (e.g., domination). The following lemma takes care of most examples.
7.6 Lemma. For every e in an open subset ofiT?.k let Pe be a J-L-probability density. Assume that the map e f--+ s e (x) = Jpe (x) is continuously differentiablefor every x. If the elements of the matrix Ie = j(fJ e !Pe) (p� fpe) Pe d p, are well defined and continuous in e, then the map e f--+ ffe is differentiable in quadratic mean (7. 1 ) with ie given by P e l Pe ·
By the chain rule, the map e f--+ pe (x) = s� (x) is differentiable for every x with gradient P e = 2sese . Because se is nonnegative, its gradient se at a point at which se = 0 must be zero. Conclude that we can write se = � (P e !Pe ) ffe, where the quotient P e !Pe may be defined arbitrarily if Pe = 0. By assumption, the map e f--+ Ie = 4 J ses[ dJ-L is continuous . Because the map e f--+ Se (x) is continuously differentiable, the difference S()+h (x) - Se (x) can be written as the integral J01 h T S()+uh (x) du of its derivative. By Jensen's (or Cauchy Schwarz's) inequality, the square of this integral is bounded by the integral J01 (h T S()+uh (x) ) 2
Proof.
Local Asymptotic Normality
96
du of the square. Conclude that
J(
)
SH t h - S g 2 d fL � �
J1 (h'{ S11+uth, ) 2 du dfL = � lal h'{ lll+uth, h t du , 1
where the last equality follows by Fubini's theorem and the definition of le . For h1 -+ h the right side converges to � h T Ie h = J (h T se ) 2 d JL by the continuity of the map 8 1---+ Ie . By the differentiability of the map 8 1---+ Sg (x) the integrand in
! [ SII+th� - Sg - h Tsg r dJL
converges pointwise to zero. The result of the preceding paragraph combined with Propo sition 2.29 shows that the integral converges to zero. •
7.7 Example (Exponential families). The preceding lemma applies to most exponential family models
pe (x) = d(8)h(x) e Q Ce l t (x ) . An exponential family model is smooth in its natural parameter (away from the boundary of the natural parameter space). Thus the maps 8 �----+ Jpe (x) are continuously differentiable if the maps 8 �----+ Q (8) are continuously differentiable and map the parameter set 8 into the interior of the natural parameter space. The score function and information matrix equal
fe (x) = Q� (t (x) - Ee t (X)) ,
Ie = Q�cove t (X) (Q�) T .
Thus the asymptotic expansion of the local log likelihood is valid for most exponential families. 0
7.8 Example (Location models). The preceding lemma also includes all location models { f (x - 8) : 8 E JR.} for a positive, continuously differentiable density f with finite Fisher information for location
If =
J ( j') \x) f (x) dx .
The score function fe (x) can be taken equal to - (f '/f) (x - 8 ) . The Fisher information is equal to I1 for every 8 and hence certainly continuous in 8 . B y a refinement of the lemma, differentiability in quadratic mean can also be established for slightly irregular shapes, such as the Laplace density f (x) = �e - l x l . For the Laplace density the map 8 1---+ log f (X - 8) fails to be differentiable at the single point 8 = X . At other points the derivative exists and equals sign(x - 8 ) . It can be shown that the Laplace location model is differentiable in quadratic mean with score function fe (x) = sign(x - 8). This may be proved by writing the difference J f (x - h) - J7(Xj as the integral J01 � h sign(x - uh) J f (x - uh) du of its derivative, which is possible even though the derivative does not exist everywhere. Next the proof of the preceding lemma applies . 0
7.9 Counterexample (Uniform distribution). The family of uniform distributions on [0, 8] is nowhere differentiable in quadratic mean. The reason is that the support of the
7.3
97
Convergence to a Normal Experiment
uniform distribution depends too much on the parameter. Differentiability in quadratic mean (7. 1) does not require that all densities pe have the same support. However, restric tion of the integral (7. 1) to the set {pe = 0} yields
Pe +h (Pe = 0) =
1
PB+ h d f-L = o (h 2 ) .
po =O Thus, under (7. 1) the total mass Pe+h (Pe = 0) of Pe + h that is orthogonal to Pe must "disappear" as h -+ 0 at a rate faster than h 2 . This is not true for the uniform distribution, because, for h ::: 0, Pe+h (Pe = 0) =
1
1
--
[O , e ]c () + h
1 [ 0 , 8 + hJ (x ) d x =
h
-- .
() + h
The orthogonal part does converge to zero, but only at the rate O (h).
7.3
D
Convergence to a Normal Experiment
The true meaning of local asymptotic normality is convergence of the local statistical experiments to a normal experiment. In Chapter 9 the notion of convergence of statistical experiments is introduced in general. In this section we bypass this general theory and establish a direct relationship between the local experiments and a normal limit experiment. The limit experiment is the experiment that consists of observing a single observation X with the N (h , le- l )-distribution. The log likelihood ratio process of this experiment equals log
r dN(h , le- 1 ) 1 r X X -h leh . h le ( ) = 2 dN(O, le- 1 )
The right side is very similar in form to the right side of the expansion of the log likelihood ratio log d P;+h / .,fii I d Pen given in Theorem 7 .2. In view of the similarity, the possibility of a normal approximation is not a complete surprise. The approximation in this section is "local" in nature: We fix () and think of
as a statistical model with parameter h, for "known" () . We show that this can be approxi mated by the statistical model (N(h , le- 1 ) : h E Il�n. A motivation for studying a local approximation is that, usually, asymptotically, the "true" parameter can be known with unlimited precision. The true statistical difficulty is therefore determined by the nature of the measures Pe for () in a small neighbourhood of the true value. In the present situation "small" turns out to be "of size 0 ( 1 I ,jn) ." A relationship between the models that can be statistically interpreted will be described through the possible (limit) distributions of statistics. For each n , let Tn = Tn ( X 1 , , X11 ) be a statistic in the experiment (Pen+ h / .,fii : h E �k ) with values in a fixed Euclidean space. Suppose that the sequence of statistics T11 converges in distribution under every possible (local) parameter: •
every h .
.
.
98
Local Asymptotic Normality
Here !;.., means convergence in distribution under the parameter e + hj.jfi, and Le ,h may be any probability distribution. According to the following theorem, the distributions {Le , h : h E IRk } are necessarily the distributions of a statistic T in the normal experiment ( N (h , Ie- 1 ) : h E IRk ) . Thus, every weakly converging sequence of statistics is "matched" by a statistic in the limit experiment. (In the present set-up the vector e is considered known and the vector h is the statistical parameter. Consequently, by "statistics" Tn and T are understood measurable maps that do not depend on h but may depend on e .) This principle of matching estimators is a method to give the convergence of models a statistical interpretation. Most measures of quality of a statistic can be expressed in the distribution of the statistic under different parameters. For instance, if a certain hypothesis is rejected for values of a statistic Tn exceeding a number c , then the power function h c---+ Ph (Tn > c) is relevant; alternatively, if Tn is an estimator of h, then the mean square error h c---+ Eh ( Tn - h) 2 , or a similar quantity, determines the quality of Tn . Both quality measures depend on the laws of the statistics only. The following theorem asserts that as a function of h the law of a statistic Tn can be well approximated by the law of some statistic T . Then the quality of the approximating T is the same as the "asymptotic quality" of the sequence Tn . Investigation of the possible T should reveal the asymptotic performance of possible sequences Tn . Concrete applications of this principle to testing and estimation are given in later chapters. A minor technical complication is that it is necessary to allow randomized statistics in the limit experiment. A randomized statistic T based on the observation X is defined as a measurable map T = T (X, U) that depends on X but may also depend on an independent variable U with a uniform distribution on [0, 1 ] . Thus, the statistician working in the limit experiment is allowed to base an estimate or test on both the observation and the outcome of an extra experiment that can be run without knowledge of the parameter. In most situations such randomization is not useful, but the following theorem would not be true without it. t
7.10 Theorem. Assume that the experiment (Pe : e E 8) is differentiable in quadratic mean (7.1) at the point e with nonsingular Fisher information matrix Ie. Let Tn be statistics in the experiments (Pen+h ! .fii : h E IRk ) such that the sequence Tn converges in distribution under every h. Then there exists a randomized statistic T in the experiment ( N (h , Ie- 1 ) : h E IRk ) such that Tn !;.., T for every h. Proof.
For later reference, it is useful to use the abbreviations J = Ie ,
By assumption, the marginals of the sequence (Tn , ,6. n ) converge in distribution under h = 0; hence they are uniformly tight by Prohorov's theorem. Because marginal tightness implies joint tightness, Prohorov's theorem can be applied in the other direction to see the existence of a subsequence of {n} along which
t
It is not important that U is uniformly distributed. Any randomization mechanism that is sufficiently rich will do.
7.3
99
Convergence to a Normal Experiment
jointly, for some random vector (S, �). The vector � is necessarily a marginal weak limit of the sequence �n and hence it is N (O , f)-distributed. Combination with Theorem 7.2 yields
( Tn , log dPn,h ) o ( S, h T � - -h21 T Jh ) . --
�
dPn,O In particular, the sequence log dPn,h / dPn, o converges to the normal N(- � h T Jh, h T ]h) distribution. By Example 6.5, the sequences Pn,h and Pn, o are contiguous. The limit law L h of Tn under h can therefore be expressed in the joint law on the right, by the general form of Le Cam ' s third lemma: For each Borel set B
We need to find a statistic T in the normal experiment having this law under h (for every h), using only the knowledge that � is N (0, f)-distributed. By the lemma below there exists a randomized statistic T such that, with U uniformly distributed and independent of �, t
( T (� , U) , � ) "" (S, �) . Because the random vectors on the left and right sides have the same second marginal distribution, this is the same as saying that T ( 8 , U) is distributed according to the conditional distribution of S given � = 8, for almost every 8. As shown in the next lemma, this can be achieved by using the quantile transformation. Let X be an observation in the limit experiment ( N ( h , J - 1 ) : h E Il�_k) . Then J X is under h = 0 normally N (0, f)-distributed and hence it is equal in distribution to � . Furthermore, by Fu bini's theorem,
Ph ( T ( J X , U) E B) =
f P ( T ( Jx, U)
E
B) e - � (x - h )r l(x - h )
= Eo 1 B ( T ( J X , U) ) e h r J X-� h r J h .
det J (2n ) k
-- dx
This equals L h ( B ) , because, by construction, the vector ( T (J X, U) , IX) has the same distribution under h = 0 as (S, �). The randomized statistic T (J X, U) has law L h under h and hence satisfies the requirements. •
7.11 Lemma. Given a random vector (S, �) with values in �d x �k and an independent uniformly [0, 1] random variable U (defined on the same probability space), there exists a jointly measurable map T on �k x [0, 1] such that ( T (� , U) , � ) and (S, �) are equal in distribution. For simplicity of notation we only give a construction for d = 2. It is possible to produce two independent uniform [0, 1 ] variables ul and u2 from one given [0, 1] variable U. (For instance, construct U1 and U2 from the even and odd numbered digits in the decimal expansion of U. ) Therefore it suffices to find a statistic T = T(� , U1 , U2 ) such that (T, �) and (S, �) are equal in law. Because the second marginals are equal, it
Proof.
t
The symbol � means "equal-in-law."
1 00
Local Asymptotic Normality
suffices to construct T such that T(8, U 1 , U2 ) is equal in distribution to S given t. = 6 , for every 8 E IR.k . Let Q 1 (u 1 I 8) and Q 2 (u 2 I 8 , s t ) be the quantile functions of the conditional distributions and respectively. These are measurable functions in their two and three arguments , respectively. Furthermore, Q 1 (U1 18) has law P 51 1 t.. =o and Q 2 ( U2 1 8 , s 1 ) has law p Sz l t.. =o , S1 =s1 , for every 8 and s 1 • Set Then the first coordinate Q 1 (U1 18) of T (8 , U1 , U2 ) possesses the distribution p S1 1 6= 8 . Given that this first coordinate equals s 1 , the second coordinate is distributed as Q 2 (U2 1 8 , s 1 ) , which has law p S2 1 t.. =o , S1 =s1 by construction. Thus T satisfies the requirements. • 7.4
Maximum Likelihood
Maximum likelihood estimators in smooth parametric models were shown to be asymp totically normal in Chapter 5. The convergence of the local experiments to a normal limit experiment gives an insightful explanation of this fact. By the representation theorem, Theorem 7. 10, every sequence of statistics in the local ex periments (P;+ h /fo : h E IR.k ) is matched in the limit by a statistic in the normal experiment. Although this does not follow from this theorem, a sequence of maximum likelihood esti mators is typically matched by the maximum likelihood estimator in the limit experiment. Now the maximum likelihood estimator for h in the experiment ( N (h , !8- 1 ) : h E IR.k ) is the observation X itself (the mean of a sample of size one), and this is normally distributed. Thus, we should expect that the maximum likelihood estimators hn for the local param eter h in the experiments (P;+h / fn : h E IR.k ) converge in distributiAon to X. In terms of the original parameter 8, the local maximum likelihood estimator hn is the standardized maximum likelihood estimator hn = Jn(Bn 8) . Furthermore, the local parameter h = 0 corresponds to the value 8 of the original parameter. Thus, we should expect that under 8 the sequence Jn(Bn 8) converges in distribution to X under h = 0, that is, to the N( O , /8- 1 )-distribution. As a heuristic explanation of the asymptotic normality of maximum likelihood estimators the preceding argument is much more insightful than the proof based on linearization of the score equation. It also explains why, or in what sense, the maximum likelihood estimator is asymptotically optimal: in the same sense as the maximum likelihood estimator of a Gaussian location parameter is optimal. This heuristic argument cannot be justified underjust local asymptotic normality, which is too weak a connection between the sequence oflocal experiments and the normal limit exper iment for this purpose. Clearly, the argument is valid under the conditions of Theorem 5.39, because the latter theorem guarantees the asymptotic normality of the maximum likelihood estimator. This theorem adds a Lipschitz condition on the maps 8 f---'3> log Pe (x) , and the "global" condition that Bn is consistent to differentiability in quadratic mean. In the fol lowing theorem, we give a direct argument, and also allow that 8 is not an inner point of the parameter set, so that the local parameter spaces may not converge to the full space m.k . -
-
7. 4
101
Maximum Likelihood
X
Then the maximum likelihood estimator in the limit experiment is a "projection" of and the limit distribution of Jfi(Bn 19) may change accordingly. Let 8 be an arbitrary subset of IRk and define Hn as the local parameter space Hn Jn(8 19). Then hn is the maximizer over Hn of the random function (or "process") d Pn h f-+ lo g d Pn If the experiment 19 E 8) is differentiable in quadratic mean, then this sequence of processes converges (marginally) in distribution to the process
-
-
Hh / ..fii e
(Pe :
h
f-+
log
dN ( h, dN ( 0,
li1)1 (X) = - -(X 1 1 h) h) + -X leX. Ie(X 2 2 Ie- ) T
T
If the sequence of sets Hn converges in a suitable sense to a set H, then we should expect, under regularity conditions, that the sequence hn converges to the maximizer h of the latter process over H . This maximizer is the projection of the vector onto the set H relative to the metric d (x , y) (x - yl y) (where a "projection" means a closest point); if k H IR , this projection reduces to itself. An appropriate notion of convergence of sets is the following. Write Hn ---+ H if H is the set of all limits lim hn of converging sequences hn with h n E Hn for every n and, moreover, the limit h limi hn, of every converging sequence hn, with h n, E Hn, for every i is contained in H. t =
=
X
Ie(xX -
=
Suppose that the experiment : 19 E 8) is differentiable in quadratic Furthermore, suppose that for mean at 19o with nonsingular Fisher information matrix every and 192 in a neighborhood of 190 and a measurable function .€ with < oo, 7.12
(Pe
Theorem.
le0 •
81
Pe0 l2
If the sequence of maximum likelihood estimators Bn is consistent and the sets Hn Jn( t9n 19o) Jfi(8 19o) converge to a nonempty, convex set H, then the sequence converges under 190 in distribution to the projection of a standard normal vector onto the H. set
le10/2
le10/2
-
- Pe0 ) L 2 ( Pe0 ) Pe e -le00 • P Pe, 1 e sup nJPln log P o +h!..fii - h Gn f e0 + -h le0 h 2 I I 0. Pea
A
-
=
Let Gn Jfl(JPln be the empirical process. In the proof of Theorem 5.39 it is shown that the map 19 f-+ log pe is differentiable at 19o in with derivative and that the map t9 f-+ log permits a Taylor expansion of order 2 at 19o, with "second-derivative matrix" Therefore, the conditions of Lemma 19.3 1 are satisfied for m log whence, for every M, *Proof.
le0
e
=
=
l l h ii �M
T
·
T
---+
P
By Corollary 5.53 the estimators Bn are Jn-consistent under 190. The preceding display is also valid for every sequence Mn that diverges to sufficiently slowly. Fix such a sequence. By the Jn-consistency of Bn , the local maximum likelihood oo
t
See Chapter 16 for examples.
Local Asymptotic Normality
102
estimators h11 are bounded in probability and hence belong to the balls of radius Mn with probability tending to 1 . Furthermore, the sequence of intersections 11 n ball (0 , M11) converges to as the original sets Thus, we may assume that the h11 are the maximum likelihood estimators relative to local parameter sets that are contained in the balls of radius M11 • Fix an arbitrary closed set If h11 E then the log likelihood is maximal on Hence (h E is bounded above by
H,
H
Hn. F.
Hn
F, P n F) P ( sup n og PeaP+eha/ ::::_ sup n og PeaP+eha! v'fi) P ( sup hT Gn-Cea - � h T lea h ::::_ sup hT Gnfea - � h T lea h + o p ( l ) ) P ( l le� 1 12 Gn iea - 1�:\ F n Hn) l :::: l le� 1 12 Gn iea - 1�:2 Hn II + O p ( l) ) , by completing the square. By Lemma 7 . 1 3 (ii) and (iii) ahead, we can replace Hn by H on both sides, at the cost of adding a further op ( l)-term and increasing the probability. Next, by the continuous mapping theorem and the continuity of the map I - A I for every F.
h E Fni-1, =
1Tll 11
I
v'ii
h EI-1,
1Tll 11
1
___:____c_ _:_ _
h EFni-fn
h EI-1,
=
z �----+
z
set A, the probability is asymptotically bounded above by, with Z a standard normal vector,
1�:2 H
The projection ITZ of the vector Z on the set is unique, because the latter set is convex by assumption and automatically closed. If the distance of Z to n is smaller than its distance to the set then IT Z must be in n Consequently, the probability in the last display is bounded by P(IT Z E The theorem follows from the portmanteau lemma. •
1�:2 H,
2 (F H) 1�: 2 (F H). 1�: 1�: 2 F).
Hn
H
If the sequence of subsets of TJf.k converges to a non empty set and the sequence of random vectors converges in distribution to a random vector then (i) n + O p ( l), for every closed set n (ii) ::::_ n G il + op (l), for every open set G. (iii) n G il
7.13
Lemma.
Xn X, I Xn - Hn l -v'+ I X - H l F. I Xn - Hn F l I XnXn - HH F l I Xn - Hn :::: I (i). Because the map x x - H l is (Lipschitz) continuous for any set H, l we have that I X11 - H I -v'+l l X - H I by the continuous-mapping theorem. If we also show that I X11 - Hn I - I X11 - H I � 0, then the proof is complete after an application of Slutsky's lemma. By the uniform tightness of the sequence X11 , it suffices to show that for x ranging over compact sets, or equivalently that l xxn- HnHnI ---+ l xx- HHI uniformly every converging sequence x11 ---+ x. - l X11for, there l For- everyl ---+fixedl vector exists a vector h n Hn with l xn-Hn I l xn - h n 11 - l f n . Unless l xn - Hn I is unbounded, we can choose the sequence hn bounded. Then every subsequence of hn has a further subsequence along which it converges, to a limit h in H. �----+
Proof
E
::::__
Conclude that, in any case,
Conversely, for every 8 > 0 there exists h
E
H and a sequence hn ---+ h with h11
E
l x - H l ::::_ l x - h l - 8 = lim l xn - hn l - 8 ::::_ lim sup l xn - Hn l - 8.
Hn and
Local Asymptotic Normality
7. 6
-
103
Combination of the last two displays yields the desired convergence of the sequence llxn Hn I to llx - H l l (ii). The assertion is equivalent to the statement P II Xn - Hn n F ll - II Xn - H n F ll > - 8 --+ 1 for every 8 > 0. In view of the uniform tightness of the sequence Xn , this follows if lim inf ll xn - Hn n F II ::: llx - H n F II for every converging sequence Xn -+ x . We can prove this by the method of the first half of the proof of (i), replacing Hn by Hn n F. (iii) . Analogously to the situation under (ii), it suffices to prove that lim sup llxn - Hn n G I ::::: llx - H n G I for every converging sequence Xn -+ x . This follows as the second half of the proof of (i). •
)
(
*7.5
Limit Distributions under Alternatives
Local asymptotic normality is a convenient tool in the study of the behavior of statistics under "contiguous alternatives." Under local asymptotic normality, n d Pe+h ! v'n !, N - -1 h T 1 h h T 1 h . lo g 2 d Pn Therefore, in view of Example 6.5 the sequences of distributions P8n+h ! v'n and Pen are mutually contiguous. This is of great use in many proofs. With the help of Le Cam ' s third lemma it also allows to obtain limit distributions of statistics under the parameters e + hI -Jii, once the limit behavior under e is known. Such limit distributions are of interest, for instance, in studying the asymptotic efficiency of estimators or tests. The general scheme is as follows. Many sequences of statistics Tn allow an approxima tion by an average of the type
e
(
e, e)
According to Theorem 7 .2, the sequence of log likelihood ratios can be approximated by an average as well: It is asymptotically equivalent to an affine transformation of is asymptotically multivariate The sequence of joint averages normal under e by the central limit theorem (provided has mean zero and finite second moment). With the help of Slutsky's lemma we obtain the joint limit distribution of Tn and the log likelihood ratios under e :
n - 1 12 .2:: ( 1/re (Xi ), le(Xi)) 1/re
n - 1 12 _L le (Xi ).
Finally we can apply Le Cam ' s third Example 6.7 to obtain the limit distribution of under e + hj -Jfi. Concrete examples of this scheme are discussed in later Jn(Tn chapters.
f-Le )
*7.6
Local Asymptotic Normality
The preceding sections of this chapter are restricted to the case of independent, identically distributed observations. However, the general ideas have a much wider applicability. A
Local Asymptotic Normality
1 04
wide variety of models satisfy a general form of local asymptotic normality and for that reason allow a unified treatment. These include models with independent, not identically distributed observations, but also models with dependent observations, such as used in time series analysis or certain random fields. Because local asymptotic normality underlies a large part of asymptotic optimality theory and also explains the asymptotic normality of certain estimators, such as maximum likelihood estimators, it is worthwhile to formulate a general concept. Suppose the observation at "time" n is distributed according to a probability measure Pn , e , for a parameter () ranging over an open subset 8 of rn;.k . Definition. The sequence of statistical models ( Pn , e : () E 8) is locally asymptoti cally normal (LAN) at () if there exist matrices rn and Ie and random vectors t-.. n , e such that
7.14
t-.. n , e � N (O, Ie ) and for every converging sequence hn
log
-+
h
d Pn , e + rn- l h " 1 = h T t-.. 11 e - - h T fe h + O p" d Pn , e 2 '
0
(1).
If the experiment ( Pe : () E e) is differentiable in quadratic mean, then the sequence of models (P; : () E 8) is locally asymptotically normal with norming matrices rn = Jli!. D
7.15
Example.
An inspection of the proof of Theorem 7. 10 readily reveals that this depends on the local asymptotic normality property only. Thus, the local experiments of a locally asymptotically normal sequence converge to the experiment (N (h , /0- 1 ) : h E Il�_k), in the sense of this theorem. All results for the case of i.i. d. observations that are based on this approximation extend to general locally asymptotically normal models. To illustrate the wide range of applications we include, without proof, three examples, two of which involve dependent observations. An autoregressive process {Xr : t E Z} of or der 1 satisfies the relationship Xr = () Xr - 1 + Zr for a sequence of independent, identically distributed variables . . . , Z_ 1 , Z0 , Z1 , . . . with mean zero and finite variance. There ex ists a stationary solution . . . , X _ 1 , X0 , X 1 , to the autoregressive equation if and only if I() I =f. 1 . To identify the parameter it is usually assumed that I() I 1 . If the density of the noise variables Z1 has finite Fisher information for location, then the sequence of models corresponding to observing X 1 , , Xn with parameter set ( - 1 , 1) is locally asymptotically normal at () with norming matrices r11 = Jli!. The observations in this model form a stationary Markov chain. The result extends to general ergodic Markov chains with smooth transition densities (see [ 1 30]). D 7.16
Example (Autoregressive processes).
•
.
.
<
•
•
•
This example requires some knowledge of time series models. Suppose that at time n the observations are a stretch X 1 , . . . , X from a stationary, Gaussian time series {Xr : t E Z} with mean zero. The covariance matrix of n
7.17
Example (Gaussian time series).
n
7. 6
Local Asymptotic Normality
1 05
consecutive variables is given by the (Toeplitz) matrix
Tn(fe) (/_: e i (t -s )Jc fe (A) dA ) s t l =
fe
_
,-
....
,n
.
The function is the spectral density of the series. It is convenient to let the parameter enter the model through the spectral density, rather than directly through the density of the observations. Let n , be the distribution (on �n ) of the vector ( X 1 , . . . , Xn) , a normal distribution with mean zero and covariance matrix Tn The periodogram of the observations is the function
Pe
(fe).
Suppose that fe is bounded away from zero and infinity, and that there exists a vector-valued function : � �d such that, as ----?- 0, . T fe] 2 dA = o ( 2 )
h l hl . f Ue +h - fe - h fe Then the sequence of experiments (Pn , e : e 8) is locally asymptotically normal at () with Ie 4n1 f £8 £8 dA . le r-+
E
= -
·
·
T
The proof is elementary, but involved, because it has to deal with the quadratic forms in the n-variate normal density, which involve vectors whose dimension converges to infinity (see [30]). D Consider estimating a location parameter () based on a sample of size n from the density ( - 8). If is smooth, then this model is differentiable in quadratic mean and hence locally asymptotically normal by Example 7.8. If possesses points of discontinuity, or other strong irregularities, then a locally asymptot ically normal approximation is impossible. t Examples of densities that are on the boundary between these "extremes" are the triangular density = (1 and the gamma density = e - 1 > 0}. These yield models that are locally asymptotically normal, but with norming rate Jn log n rather than ,Jn. The existence of singularities in the density makes the estimation of the parameter e easier, and hence a faster rescaling rate is necessary. (For the triangular density, the true singularities are the points - 1 and 1 , the singularity at 0 is statistically unimportant, as in the case of the Laplace density.) For a more general result, consider densities that are absolutely continuous except pos sibly in small neighborhoods U1 , . . . , Uk of finitely many fixed points c 1 , . . . , ck . Suppose that f 'I ..jJ is square-integrable on the complement of U 1 u1 , that (c 1) = 0 for every j , and that, for fixed constants a 1 , . . . , ak and b1 , . . . , bk . each of the functions 7.18
Example (Almost regular densities).
fx
f
f
f(x)
f(x) x x {x
- lx lt
f
f
t
See Chapter 9 for some examples.
1 06
Local Asymptotic Normality
is twice continuously differentiable. If I)ai + bi ) > 0, then the model is locally asymp totically normal at () = 0 with, for equal to the interval (n - 1 1 (Iog n ) - 1 14 , (log n)- 1 ) around zero, t
2
Vn
n
r
=
Jn log n ,
j
1 {Xi - c Vn} 1 1 ( -----'-- f (x + c d x -- 6.n , O Jn log1 n �i = l � Xi ) J= l The sequence 6.n , o may be thought of as "asymptotically sufficient" for the local parameter h . Its definition of 6.n , o shows that, asymptotically, all the "information" about the parameter is contained in the observations falling into the neighborhoods Vn + c Thus, asymptotically, :;== = --;:.::::=
i
LL
E
Cj
v, ,
.
i)
X
1.
the problem is determined by the points of irregularity. The remarkable rescaling rate Jn log n can be explained by computing the Hellinger distance between the densities f (x - ()) and f (x) (see section 14.5). D
Notes
Local asymptotic normality was introduced by Le Cam [92], apparently motivated by the study and construction of asymptotically similar tests. In this paper Le Cam defines two sequences of models () E 8) and () E 8) to be diffe rentially equivalent if
(Pn , e : (Qn , e : sup I Pn , H h jfn - Qn , Hh jy'n l --+ 0, hEK
Tn
for every bounded set K and every (). He next shows that a sequence of statistics in a given asymptotically differentiable sequence of experiments (roughly LAN) that is asymptotically equivalent to the centering sequence is asymptotically sufficient, in the sense that the original experiments and the experiments consisting of observing the are differentially equivalent. After some interpretation this gives roughly the same message as Theorem 7 . 1 0 . The latter is a concrete example of an abstract result in [95], with a different (direct) proof.
6.n , e
Tn
PROBLEMS 1. Show that the Poisson distribution with mean information.
e satisfies the conditions of Lemma 7.6.
Find the
2. Find the Fisher information for location for the normal, logistic, and Laplace distributions. 3. Find the Fisher information for location for the Cauchy distributions.
4. Let f be a density that is symmetric about zero. Show that the Fisher information matrix (if it exists) of the location scale family f ( (x - � ) I 0' ) I 0' is diagonal. 5. Find an explicit expression for the ope ( 1 ) -term in Theorem 7.2 in the case that of the N 1 ) -distribution.
(e,
pe
is the density
6. Show that the Laplace location family is differentiable in quadratic mean . t
See, for example, [80, pp. 133-1 39] for a proof, and also a discussion of other almost regular situations. For instance, singularities of the form f(x) f(cj ) + lx - Cj 1 1 1 2 at points cj with f(cj ) > 0. �
Problems
1 07
7. Find the form of the score function for a location-scale family f ( (x - fL) I a ) I a with parameter e = (f.L , a ) and apply Lemma 7.6 to find a sufficient condition for differentiability in quadratic mean.
8. Investigate for which parameters k the location family f (x - e) for f the gamma(k , 1) density is differentiable in quadratic mean.
9. Let
Pn,e
be the distribution of the vector
(X 1 , . . . , Xn) if {X1
:
t
E
Z} is a stationary Gaussian
time series satisfying X1 = e X r - 1 + Z1 for a given number 1 e 1 < 1 and independent standard normal variables Z1 . Show that the model is locally asymptotically normal.
10. Investigate whether the log normal family of distributions with density
is differentiable in quadratic mean with respect to e = (� , f.L , a ) .
8 Efficiency of Estimators
One purpose of asymptotic statistics is to compare the performance of estimatorsfor large sample sizes. This chapter discusses asymptotic lower bounds for estimation in locally asymptotically normal models. These show, among others, in what sense maximum likelihood estimators are asymptotically efficient.
8.1
Asymptotic Concentration
1/1 (8)
Suppose the problem is to estimate based on observations from a model governed by the parameter What is the best asymptotic performance of an estimator sequence Tn for
8. (8)? 1/1 To simplify the situation, w e shall in most of this chapter assume that the sequence ,.jii ( Tn -1/1 (8)) converges in distribution under every possible value of 8. Next we rephrase
the question as: What are the best possible limit distributions? In analogy with the Cramer Rao theorem a "best" limit distribution is referred to as an asymptotic lower bound. Under certain restrictions the normal distribution with mean zero and covariance the inverse Fisher information is an asymptotic lower bound for estimating in a smooth parametric model. This is the main result of this chapter, but it needs to be qualified. The notion of a "best" limit distribution is understood in terms of concentration. If the limit distribution is a priori assumed to be normal, then this is usually translated into asymp totic unbiasedness and minimum variance. The statement that ,.Jii ( Tn converges in distribution to a ) -distribution can be roughly understood in the sense that eventually Tn is approximately normally distributed with mean and variance given by
1/1 (8) 8 =
- 1/1 (8))
N (f.l(8), 0" 2 (8)
and
(8), 1/1 0" 2 (8)
n
Because Tn is meant to estimate optimal choices for the asymptotic mean and variance are 0 and variance as small as possible. These choices ensure not only that the asymptotic mean square error is small but also that the limit distribution is maximally concentrated near zero. For instance, the probability of the interval (-a, a) is maximized by choosing 0 and minimal. We do not wish to assume a priori that the estimators are asymptotically normal. That normal limits are best will actually be an interesting conclusion. The concentration of a general limit distribution Le cannot be measured by mean and variance alone. Instead, we
J.1,(8)
=
J.1,(8)
=
0" 2 (8)
N (f.l(8),
108
8. 1
Asymptotic Concentration
109
can employ a variety of concentration measures, such as
J x 2 dLe(x); J l x l dL e (x) ; J l { lx l > a } dLe(x); J ( l x l /\ a ) dLe(x).
A limit distribution is "good" if quantities of this type are small. More generally, we for a given nonnegative function £. Such a function is called focus on minimizing J £ a loss function and its integral J £ is the asymptotic risk of the estimator. The method of measuring concentration (or rather lack of concentration) by means of loss functions applies to one- and higher-dimensional parameters alike. The following example shows that a definition of what constitutes asymptotic optimality is not as straightforward as it might seem.
dL e
dLe
Tn
8.1 Example (Hodges ' estimator). Suppose that is a sequence of estimators for a real parameter (] with standard asymptotic behavior in that, for each (] and certain limit distri butions
L e,
Tn Sn
As a specific example, let be the mean of a sample of size n from the N ((] , 1 )-distribution. Define a second estimator through if if
Tn
ITnl :::: n -- 11 //44 · I Tn l < n
If the estimator is already close to zero, then it is changed to exactly zero; otherwise it is left unchanged. The truncation point n - 1 /4 has been chosen in such a way that the limit behavior of is the same as that of for every (] -=j 0, but for (] = 0 there appears to be a great improvement. Indeed, for every r
Sn
Tn n, rnSn e0 0 n ;) n Le , To see this, note first that the probability that Tn falls in the interval ((] - M n- 1 12 , (]+ Mn - 1 12 ) converges to L e (- M, M) for most M and hence is arbitrarily close to 1 for M and n sufficiently large. For (] 0, the intervals ((] - M n- 1 12 (] + M n - 1 12 and ( -n - 1 14 , n - 1 14 r- ,
v
�
-v-+
�
r:J ) �,
""'
T
-=j
,
)
)
are centered at different places and eventually disjoint. This implies that truncation will rarely occur: = � 1 if (] -=F 0, whence the second assertion. On the other hand the 1 1 12 12 interval (- Mn - , M n - ) is contained in the interval ( -n - 1 14 , n - 1 14 ) eventually. Hence under (] = 0 we have truncation with probability tending to 1 and hence = 0) � 1 ; this is stronger than the first assertion. At first sight, is an improvement on For every (] -=j 0 the estimators behave the same, while for (] = 0 the sequence has an "arbitrarily fast" rate of convergence. However, this reasoning is a bad use of asymptotics. Consider the concrete situation that is the mean of a sample of size n from the normal N ((], !)-distribution. It is well known that = X is optimal in many ways for every fixed n and hence it ought to be asymptotically optimal also. Figure 8. 1 shows why = X 1 { I X I :::: n - 1 /4 } is no improvement. It shows the graph of the risk function (] f---+ - (]) 2 for three different values of n . These functions are close to 1 on most
Pe (Tn Sn) Sn
Sn Ee (Sn
Sn Tn
P0(Sn
Tn.
Tn
1 10
Efficiency of Estimators
l()
-
l()
-
0
-
I
-2
I
I
-1
I
0
I
2
Figure 8.1. Quadratic risk functions of the Hodges estimator based on the means of samples of size 10 (dashed) , 1 00 (dotted) , and 1000 (solid) observations from the I ) -distribution.
N(e,
of the domain but possess peaks close to zero. As n � the locations and widths of the peaks converge to zero but their heights to infinity. The conclusion is that "buys" its better asymptotic behavior at f3 = 0 at the expense of erratic behavior close to zero. Because the values of f3 at which is bad differ from n to n , the erratic behavior is not visible in the pointwise limit distributions under fixed f3. D oo ,
Sn
Sn
8.2
Relative Efficiency
In order to choose between two estimator sequences, we compare the concentration of their limit distributions. In the case of normal limit distributions and convergence rate Jfi, the quotient of the asymptotic variances is a good numerical measure of their relative efficiency. This number has an attractive interpretation in terms of the numbers of observations needed to attain the same goal with each of two sequences of estimators. Let v � be a "time" index, and suppose that it is required that, as v � our estimator sequence attains mean zero and variance 1 (or 1 / v ) Assume that an estimator based on n observations has the property that, as n � oo
oo,
.
oo ,
Tn
Then the requirement is to use at time v an appropriate number nv of observations such that, as v � oo ,
.fo ( Tn"
-
1/J(f3)) ! N( O , 1 ) .
Given two available estimator sequences, let nv , l and nv , 2 b e the numbers of observations
8. 3
111
Lower Bound for Experiments
needed to meet the requirement with each of the estimators. Then, if it exists, the limit
nv 1. lm , 2 v -HXJ nv, l is called the relative efficiency of the estimators. (In general, it depends on the para meter Because - 1/.r (B )) can be written as rvrn:, - 1j.r (e)), it follows that necessarily nv Thus, the relative efficiency of two and also that nvfv -+ estimator sequences with asymptotic variances is just
e.)
.JV(Tnv -+
2a (e). Fv(Tn" a((e) nv , dv a:}Je) lim . nv , ! / V af(e)
oo,
=
!1 ---HXl
If the value of this quotient is bigger than 1, then the second estimator sequence needs proportionally that many observations more than the first to achieve the same (asymptotic) preCISIOn.
8.3
Lower Bound for Experiments
,.Jii( Tn -
It is certainly impossible to give a nontrivial lower bound on the limit distribution of a standardized estimator 1j.r(e)) for a single Hodges ' example shows that it is not even enough to consider the behavior under every pointwise for all Different values of the parameters must be taken into account simultaneously when taking the limit as n -+ We shall do this by studying the performance of estimators under parameters in a "shrinking" neighborhood of a fixed We consider parameters h j ,Jn for fixed and h ranging over ffi.k and suppose that, for certain limit distributions L e , h , oo.
e+
Tn
e.
e,
e.
e. e
v'n(T" - o/ (8 + �))
' +h_L.fo L0 , ,
every h.
(8.2)
(e)
Then can be considered a good estimator for 1/.r if the limit distributions Le, h are maximally concentrated near zero. If they are maximally concentrated for every h and some fixed then can be considered locally optimal at Unless specified otherwise, we assume in the remainder of this chapter that the parameter set 8 is an open subset of ffi.k , and that 1/.r maps 8 into ffi.m . The derivative of r--+ 1j.r (e) is denoted by ;pe . Suppose that the observations are a sample of size n from a distribution Pe . If Pe depends smoothly on the parameter, then
Tn
e,
e.
e
as experiments, in the sense of Theorem 7 . 1 0. This theorem shows which limit distributions are possible and can be specialized to the estimation problem in the following way. 8.3 Theorem. Assume that the experiment (Pe : E 8) is differentiable in quadratic mean (7. 1) at the point e with nonsingular Fisher information matrix Ie. Let 1/.r be dzf ferentiable at Let be estimators in the experiments (P�1 hf .fii : h E JRk ) such that
e.
Tn
e
+
Efficiency of Estimators
1 12
(8.2) holds for every h. Then there exists a randomized statistic T in the experiment
( N(h, le- 1 ) : h E ll�_k) such that T - ;peh has distribution Le , h for every h.
Apply Theorem 7.10 to Sn Jfi (Tn - 1/r (&) ) . In view of the definition of Le , h and the differentiability of Vr , the sequence =
Proof.
converges in distribution under h to Le, h * 8 Vro h • where *Oh denotes a translation by h. According to Theorem 7 . 1 0, there exists a randomized statistic T in the normal experiment such that T has distribution Le , h * 8 Vro h for every h . This satisfies the requirements. • This theorem shows that for most estimator sequences Tn there is a randomized estimator T such that the distribution of Jfi ( Tn - Vr (e + hI Jfi)) under e + hI Jfi is, for large n ' approximately equal to the distribution of T - ;peh under h. Consequently the standardized distribution of the best possible estimator Tn for Vr (& + hi Jfi) is approximately equal to the standardized distribution of the best possible estimator T for ;pe h in the limit experiment. If we know the best estimator T for ;peh, then we know the "locally best" estimator sequence Tn for 1/r (&). In this way, the asymptotic optimality problem is reduced to optimality in the experiment based on one observation X from a N (h , le- 1 )-distribution, in which e is known and h ranges over ffi.k . This experiment is simple and easy to analyze. The observation itself is the customary estimator for its expectation h, and the natural estimator for ;peh is ;pe x. This has several optimality properties: It is minimum variance unbiased, minimax, best equivariant, and Bayes with respect to the noninformative prior. Some of these properties are reviewed in the next section. Let us agree, at least for the moment, that ;pe x is a "best" estimator for ;pe h. The distribution of ;pe x - ;peh is normal with zero mean and covariance ;pe le- I ;pe r for every h. The parameter h 0 in the limit experiment corresponds to the parameter e in the original problem. We conclude that the "best" limit distribution of Jfi(Tn - 1/r (&)) under () is the ;pe le- 1 ;pe r)-distribution. This is the main result of the chapter. The remaining sections discuss several ways of making this reasoning more rigorous. Because the expression ;pe li 1 ;pe T is precisely the Cramer-Rao lower bound for the covariance of unbiased estimators for 1/r(&), we can think of the results of this chapter as asymptotic Cramer-Rao bounds. This is helpful, even though it does not do justice to the depth of the present results. For instance, the Cramer-Rao bound in no way suggests that normal limiting distributions are best. Also, it is not completely true that an N(h, le- 1 )-distribution is "best" (see section 8.8). We shall see exactly to what extent the optimality statement is false. =
N(O,
8.4
Estimating Normal Means
According to the preceding section, the asymptotic optimality problem reduces to optimality in a normal location (or "Gaussian shift") experiment. This section has nothing to do with asymptotics but reviews some facts about Gaussian models.
8. 4
1 13
Estimating Normal Means
Based on a single observation X from a N (h , �)-distribution, it is required to estimate Ah for a given matrix A. The covariance matrix � is assumed known and nonsingular. It is well known that AX is minimum variance unbiased. It will be shown that AX is also best-equivariant and minimax for many loss functions. A randomized estimator T is called equivariant-in-law for estimating Ah if the distri bution of T - Ah under h does not depend on h. An example is the estimator AX, whose "invariant law" (the law of AX - Ah under h) is the N (0, A� A T )-distribution. The follow ing proposition gives an interesting characterization of the law of general equivariant-in-law estimators: These are distributed as the sum of AX and an independent variable.
The null distribution L of any randomized equivariant-in-law estimator of Ah can be decomposed as L N(O, A � A T ) * M for some probability measure M. The only randomized equivariant-in-law estimator for which M is degenerate at 0 is AX.
8.4
Proposition.
=
The measure M can be interpreted as the distribution of a noise factor that is added to the estimator AX. If no noise is best, then it follows that AX is best equivariant-in-law. A more precise argument can be made in terms of loss functions. In general, convoluting a measure with another measure decreases its concentration. This is immediately clear in terms of variance: The variance of a sum of two independent variables is the sum of the variances, whence convolution increases variance. For normal measures this extends to all "bowl-shaped" symmetric loss functions. The name should convey the form of their graph. Formally, a function is defined to be bowl-shaped if the sublevel sets { x : .e (x) � c } are convex and symmetric about the origin; it is called subconvex if, moreover, these sets are closed. A loss function is any function with values in [0, oo). The following lemma quantifies the loss in concentration under convolution (for a proof, see, e.g., [80] or [1 14] .) 8.5
Lemma (Anderson 's lemma).
For any bowl-shaped loss function .e on JR.\ every prob
ability measure M on IRk , and every covariance matrix �
J .e d N(O, � ) J .e d [N (O, � ) �
*
M] .
Next consider the minimax criterion. According to this criterion the "best" estimator, relative to a given loss function, minimizes the maximum risk sup Eh .€( T - Ah) , h
over all (randomized) estimators T. For every bow 1-shaped loss function .e, this leads again to the estimator AX.
For any bowl-shaped loss function .e, the maximum risk of any random ized estimator T of Ah is bounded below by Eo f ( A X). Consequently, AX is a minimax estimator for Ah. If Ah is real and E0(AX) 2 .€ (AX) < oo, then AX is the only minimax estimator for Ah up to changes on sets ofprobability zero.
8.6
Proposition.
For a proof of the uniqueness of the minimax estimator, see [18] or [80] . We prove the other assertions for subconvex loss functions, using a Bayesian argument.
Proofs.
1 14
Efficiency of Estimators
Let H be a random vector with a normal N (0, A )-distribution, and consider the original N (h , 'E )-distribution as the conditional distribution of X given H = h . The randomization variable U in T (X, U) is constructed independently of the pair (X, H) . In this notation, the distribution of the variable T - AH is equal to the "average" of the distributions of T - Ah under the different values of h in the original set-up, averaged over h using a N(O, A)-"prior distribution." By a standard calculation, we find that the "a posteriori" distribution, the distribution of H given X, is the normal distribution with mean ('E- 1 + A - 1 ) - 1 'E - 1 X and covariance matrix ( 'E - l + A - 1 ) - 1 . Define the random vectors
These vectors are independent, because WA is a function of (X, U) only, and the condi tional distribution of G A given X is normal with mean 0 and covariance matrix A ( .'E - 1 + A - 1 ) - 1 A T , independent of X. As A = A I for a scalar A -+ oo, the sequence G A converges in distribution to a N (O, A 'E A T )-distributed vector G. The sum of the two vectors yields T - AH, for every A. Because a supremum is larger than an average, we obtain, where on the left we take the expectation with respect to the original model, sup Eh£(T - Ah) h
�
E£ (T - AH) = E£(GA + WA)
�
E£ (GA) ,
by Anderson's lemma. This is true for every A. The lim inf of the right side as A -+ oo is at least E£(G), by the portmanteau lemma. This concludes the proof that AX is minimax. If T is equivariant-in-law with invariant law L, then the distribution of G A + WA T - AH is L, for every A . It follows that
As A -+ oo, the left side remains fixed; the first factor on the right side converges to the characteristic function of G, which is positive. Conclude that the characteristic functions of WA converge to a continuous function, whence WA converges in distribution to some vector W, by Levy's continuity theorem. By the independence of GA and WA for every A, the sequence ( G A , WA ) converges in distribution to a pair ( G, W) of independent vectors with marginal distributions as before. Next, by the continuous-mapping theorem, the distribution of GA + WA, which is fixed at L, "converges" to the distribution of G + W. This proves that L can be written as a convolution, as claimed in Proposition If T is an equivariant-in-law estimator and T (X) = E ( T (X, U) I X ) , then
8. 4.
is independent of h. By the completeness of the normal location family, we conclude that t - AX is constant, almost surely. If T has the same law as AX, then the constant is zero. Furthermore, T must be equal to its projection t almost surely, because otherwise it would have a bigger second moment than t = AX. Thus T = AX almost surely. •
8. 6
Almost-Everywhere Convolution Theorem 8.5
1 15
Convolution Theorem
An estimator sequence Tn is called regular at () for estimating a parameter 1/J ( fJ) if, for every h,
The probability measure Le may be arbitrary but should be the same for every h. A regular estimator sequence attains its limit distribution in a "locally uniform" manner. This type of regularity is common and is often considered desirable: A small change in the parameter should not change the distribution of the estimator too much; a disappearing small change should not change the (limit) distribution at all. However, some estimator sequences of interest, such as shrinkage estimators, are not regular. In terms of the limit distributions Le,h in (8.2), regularity is exactly that all Le,h are equal, for the given e. According to Theorem 8.3, every estimator sequence is matched by an estimator T in the limit experiment ( N (h , Ie- 1 ) : h E �_k). For a regular estimator sequence this matching estimator has the property .
h
T - 1/Jeh "" Le , every h.
(8.7)
Thus a regular estimator sequence is matched by an equivariant-in-law estimator for 1/Je h . A more informative name for "regular" is asymptotically equivariant-in-law. It is now easy to determine a best estimator sequence from among the regular estima tor sequences (a best regular sequence): It is the sequence Tn that corresponds to the best equivariant-in-law estimator T for {f;e h in the limit experiment, which is {;;e X by Proposi tion 8.4. The best possible limit distribution of a regular estimator sequence is the law of this estimator, a N(O, {;;e le- 1 {;;e T )-distribution. The characterization as a convolution of the invariant laws of equivariant-in-law estima tors carries over to the asymptotic situation.
Assume that the experiment (Pe : () E 8) is differentiable in quadratic mean (7 . 1) at the point () with nonsingular Fisher information matrix Ie . Let 1/J be differentiable at (). Let Tn be an at e regular estimator sequence in the experi ments ( Pen : () E 8) with limit distribution Le. Then there exists a probability measure Me such that 8.8
Theorem (Convolution).
Le
=
. . N ( 0, 1/Je li 1 1/Je T ) * Me .
_/n particular, ifLe has covariance matrix I;g, then the matrix I;e - {;;e le- I {fre T is nonnegative definite. Apply Theorem 8.3 to conclude that Le is the distribution of an equivariant-in-law estimator T in the limit experiment, satisfying (8.7). Next apply Proposition 8.4. •
Proof.
8.6
Almost-Everywhere Convolution Theorem
Hodges ' example shows that there is no hope for a nontrivial lower bound for the limit distribution of a standardized estimator sequence Jn ( Tn - 1/J (fJ) ) for every e . It is always
116
Efficiency of Estimators
possible to improve on a given estimator sequence for selected parameters. In this section it is shown that improvement over an N (0, ;pe le- t ;pe T )-distribution can be made on at most a Lebesgue null set of parameters. Thus the possibilities for improvement are very much restricted.
Assume that the experiment (Pe : E 8) is differentiable in quadratic mean (7 .1) at every with nonsingular Fisher information matrix Ie. Let 1/f be differentiable at every Let Tn be an estimator sequence in the experiments (P0 : E 8) such that Jn ( T11 - 1/f converges to a limit distribution Le under every Then there exist probability distributions Me such that for Lebesgue almost every 8.9
Theorem.
e.
e
e
(e))
e
e.
e
.
.
In particular, if Le has covariance matrix I:e, then the matrix I:e - 1/fe le- 1 1/fe T is nonnegative definite for Lebesgue almost every
e.
The theorem follows from the convolution theorem in the preceding section combined with the following remarkable lemma. Any estimator sequence with limit distributions is automatically regular at almost every along a subsequence of { n}.
e
Lemma. Let Tn be estimators in experiments (Pn . e E 8 ) indexed by a measurable subset 8 of iR.k . Assume that the map f--+ Pn, e (A) is measurable for every measurable set A and every n, and that the map f--+ 1/f is measurable. Suppose that there exist distributions Le such thatfor Lebesgue almost every 8.10
e e
:e
(e)
e
Then for every Yn � 0 there exists a subsequence of {n} such that, for Lebesgue almost every h), along the subsequence,
(e,
e0
Assume without loss of generality that 8 IRk ; otherwise, fix some and let Pn,e Pn,eo for every not in 8. Write Tn,e rn (Tn There exists a countable collection :F of uniformly bounded, left- or right-continuous functions f such that weak convergence of a sequence of maps Tn is equivalent to Ef (Tn) � J f d L for every f E :F. t Suppose that for every f there exists a subsequence of {n } along which
Proof.
=
e
=
=
- ljf(e)).
A. 2k - a.e.
(e, h) .
Even in case the subsequence depends on f, we can, by a diagonalization scheme, con struct a subsequence for which this is valid for every f in the countable set :F. Along this subsequence we have the desired convergence. t
For continuous distributions L we can use the indicator functions of cells ( - oo , c] with c ranging over Qk. For general L replace every such indicator by an approximating sequence of continuous functions. Alternatively, see, e.g., Theorem 1 . 12.2 in [ 146] . Also see Lemma 2.25.
8. 7
1 17
Local Asymptotic Minimax Theorem
e
e
Setting gn ( ) = Ee f (Tn,e) and g ( ) = f f dLe , we see that the lemma is proved once we have established the following assertion: Every sequence of bounded, measurable functions gn that converges almost everywhere to a limit g, has a subsequence along which "A 2k - a.e.
(e, h).
We may assume without loss of generality that the function g is integrable; otherwise we first multiply each gn and g with a suitable, fixed, positive, continuous function. It should also be verified that, under our conditions, the functions gn are measurable. Write p for the standard normal density on Rk and Pn for the density of the N(O, I +Y1' /) distribution. By Scheffe's lemma, the sequence Pn converges to p in L 1 . Let 8 and H denote independent standard normal vectors. Then, by the triangle inequality and the dominated convergence theorem, E l gn ( 8 + Yn H) - g ( 8 + Yn H ) I
=
J l gn (U ) - g(u) I Pn (u) du � 0.
Secondly for any fixed continuous and bounded function gc: the sequence E l gc: ( 8 + Yn H ) gc: ( 8 ) I converges to zero as n � oo by the dominated convergence theorem. Thus, by the triangle inequality, we obtain E l g( 8 + Yn H ) - g ( 8 ) l
J l g - gc: l (u) (pn + p) (u) du + o(l) = 2 J l g - gc: l (u) p(u) d u + o(l). :S
Because any measurable integrable function g can be approximated arbitrarily closely in L 1 by continuous functions, the first term on the far right side can be made arbitrarily small by choice of gc: . Thus the left side converges to zero. By combining this with the preceding display, we see that El gn ( 8 + Yn H ) - g( 8) I � 0. In other words, the sequence of functions f-+ gn (e + Yn h) - g(e ) converges to zero in mean and hence in probability, under the standard normal measure. There exists a subsequence along which it converges to zero almost surely. •
(e, h)
*8.7
Local Asymptotic Minimax Theorem
The convolution theorems discussed in the preceding sections are not completely satisfying. The convolution theorem designates a best estimator sequence among the regular estimator sequences, and thus imposes an a priori restriction on the set of permitted estimator se quences. The almost-everywhere convolution theorem imposes no (serious) restriction but yields no information about some parameters, albeit a null set of parameters. This section gives a third attempt to "prove" that the normal N(O, 1/Je Ie- 1 1/Je T )-distribution is the best possible limit. It is based on the minimax criterion and gives a lower bound for the maximum risk over a small neighborhood of a parameter In fact, it bounds the expression
(
e.
lim lim inf sup e,.e v'fi(Tn o--+ 0 n --+oo llfJ' - fJ I <8 E
1/f(e'))).
e.
This is the asymptotic maximum risk over an arbitrarily small neighborhood of The following theorem concerns an even more refined (and smaller) version of the local maxi mum risk.
118
Efficiency of Estimators
Let the experiment (Pe : E 8) be differentiable in quadratic mean at with nonsingular Fisher information matrix Ie . Let 1/1 be diffe rentiable at Let Tn be any estimator sequence in the experiments ( P; : E JRk ). Then for any bowl-shaped loss function £ 8.11
e
Theorem.
e
(7.1)
e.
e
( <J; (e + �))) f e dN(O, {fr0 !0- 1 {fr0 T )
(
s �p l���f ��r Eo +h/ ,/ii e .fii T" -
:>
Here the first supremum is taken over all finite subsets I of ffi.k .
-
We only give the proof under the further assumptions that the sequence ,jn ( Tn is uniformly tight under and that £ is (lower) semicontinuous .t Then Prohorov's theorem shows that every subsequence of {n} has a further subsequence along which the vectors
Proof.
e
'ljJ(e) )
(Jll ( Tn - 'ljJ(e) ), }. L fe (Xi )) e.
7. 2 e
converge in distribution to a limit under By Theorem and Le Cam's third lemma, the sequence ,Jn ( Tn - 1/1 converges in law also under every +h j ,Jn along the subsequence. By differentiability of 1/J, the same is true for the sequence ,Jri (Tn + hj ,jn) , whence is satisfied. By Theorem the distributions Le , h are the distributions of T - ;;;e h under h for a randomized estimator T based on an N(h , lg- 1 )-distributed observation. By Proposition
(e))
(8.2)
-'ljJ(e
8. 3 ,
8. 6 ,
)
J
sup Eh £ ( T - ;;;e h) :::: Eo £ ( ;;;e X) = £ dN(O, ;;;e le- 1 ;;;e T ) . h EIRk It suffices to show that the left side of this display is a lower bound for the left side of the theorem. The complicated construction that defines the asymptotic minimax risk (the lim inf sand wiched between two suprema) requires that we apply the preceding argument to a carefully chosen subsequence. Place the rational vectors in an arbitrary order, and let h consist of the first k vectors in this sequence. Then the left side of the theorem is larger than
(
�)))·
( - 'ljJ (e
inf sup Ee +h/.fo e v�n rn R : = lim lim + k --'>oo n --'>oo h Eh vn There exists a subsequence {nd of {n} such that this expression is equal to
We apply the preceding argument to this subsequence and find a further subsequence along which Tn satisfies For simplicity of notation write this as {n'} rather than with a double subscript. Because £ is nonnegative and lower semicontinuous, the portmanteau lemma gives, for every h,
(8. 2 ).
'�1'1/� Eoc, 1 -Re t
(
(
# T" - o/
(e + J.,))) J e d L o , h :>
See, for example, [146, Chapter 3. 1 1] for the general result, which can be proved along the same lines, but using a compactification device to induce tightness.
8. 8
Shrinkage Estimators
119
Every rational vector h is contained in h for every sufficiently large k. Conclude that R
2:
sup h EQk
J £ dLe,h = hsupEQk Eh£ (T - {pe h ) .
The risk function in the supremum on the right is lower semicontinuous in h, by the continuity of the Gaussian location family and the lower semicontinuity of £. Thus the expression on the right does not change if Ql is replaced by �k . This concludes the proof. •
*8.8
Shrinkage Estimators
The theorems of the preceding sections seem to prove in a variety of ways that the best possible limit distribution is the N(O, {pe ie- 1 {pe T )-distribution. At closer inspection, the situation is more complicated, and to a certain extent optimality remains a matter of taste, asymptotic optimality being no exception. The "optimal" normal limit is the distribution of the estimator {pe X in the normal limit experiment. Because this estimator has several optimality properties, many statisticians consider it best. Nevertheless, one might prefer a Bayes estimator or a shrinkage estimator. With a changed perception of what constitutes "best" in the limit experiment, the meaning of "asymptotically best" changes also. This becomes particularly clear in the example of shrinkage estimators. Example (Shrinkage estimator). Let X 1 , . . . , Xn be a sample from a multivariate normal distribution with mean fJ and covariance the identity matrix. The dimension k of the observations is assumed to be at least This is essential ! Consider the estimator
8.12
3.
Because X n converges in probability to the mean e , the second term in the definition of Tn is 0 p (n - 1 ) if fJ =j:. In that case ,Jfi(Tn - X n) converges in probability to zero, whence the estimator sequence Tn is regular at every e =j:. For e = hj ,Jn, the variable ,JriXn is distributed as a variable X with an N(h , I)-distribution, and for every n the standardized estimator ylri(Tn - hj ,Jfi) is distributed as T - h for
0.
0.
X . II X II 2 This is the Stein shrinkage estimator. Because the distribution of T - h depends on h, the sequence Tn is not regular at e The Stein estimator has the remarkable property that, for every h (see, e.g., [99, p. T(X) = X - (k - 2)
-
0. 300]), =
Eh ii T - h ll 2 < Eh ii X - h ll 2 = k. It follows that, in terms of joint quadratic loss £(x) = ll x 11 2 , the local limit distributions Lo,h of the sequence ,Jri(Tn - hj ylri) under e = hj ,Jn are all better than the N (O, I)-limit distribution of the best regular estimator sequence X n . 0 The example of shrinkage estimators shows that, depending on the optimality criterion, a normal N {pe Ie- 1 {pe T )-limit distribution need not be optimal. In this light, is it reasonable
(0,
1 20
Efficiency of Estimators
to uphold that maximum likelihood estimators are asymptotically optimal? Perhaps not. On the other hand, the possibility of improvement over the N ;pe !8- 1 ;pe T )-limit is restricted in two important ways. First, improvement can be made only on a null set of parameters by Theorem Second, improvement is possible only for special loss functions, and improvement for one loss function necessarily implies worse performance for other loss functions. This follows from the next lemma. Suppose that we require the estimator sequence to be locally asymptotically minimax for a given loss function f in the sense that
(0,
8. 9 .
This is a reasonable requirement, and few statisticians would challenge it. The following lemma shows that for one-dimensional parameters local asymptotic minimaxity for even a single loss function implies regularity. Thus, if it is required that all coordinates of a certain estimator sequence be locally asymptotically minimax for some loss function, then the best regular estimator sequence is optimal without competition.
'ljr(e)
Assume that the experiment Pe E 8) is diffe rentiable in quadratic mean at e with nonsingular Fisher information matrix !8 . Let Vr be a real-valued map k that is differentiable at Then an estimator sequence in the experiments E IR ) can be locally asymptotically minimax at for a bowl-shaped loss function f such that < f x 2 f(x) dN O , ;pe l8- 1 ;pe T ) x ) < oo only if Tn is best regular at
8.13
(7.1)
( :e
Lemma.
e.
0
(
(
(P8n : e
e
e.
We only give the proof under the further assumption that the sequence ,Jn(Tn is uniformly tight under Then by the same arguments as in the proof of Theo rem every subsequence of {n} has a further subsequence along which the sequence y'n ( Tn - Vr + hI y'ri)) converges in distribution under + hI y'n to the distribution Le,h of T - ;pe h under h, for a randomized estimator T based on an N(h , /8- 1 )-distributed observation. Because Tn is locally asymptotically minimax, it follows that
Proof.
'ljr(e) ) 8.11,
e.
(e
e
Thus T is a minimax estimator for ;pe h in the limit experiment. By Proposition 8 .6, T = ;p8 x, whence Le,h is independent of h. •
*8.9
Achieving the Bound
If the convolution theorem is taken as the basis for asymptotic optimality, then an estimator sequence is best if it is asymptotically regular with a N(O, ;p8 !8- 1 ;p8 T )-limit distribution. An estimator sequence has this property if and only if the estimator is asymptotically linear in the score function. 8.14
mean
E
Assume that the experiment (Pe : e 8) is differentiable in quadratic (7.1) at e with nonsingular Fisher information matrix !8 . Let be differentiable at Lemma.
Vr
8. 9
Achieving the Bound
e. Let Tn be an estimator sequence in the experiments (P; : ()
121 E
JRk ) such that
Then Tn is best regular estimator for 1/J (()) at (). Conversely, every best regular estimator sequence satisfies this expansion. The sequence 11n. e = n - 1 12 L Ce ( X i ) converges in distribution to a vector !':!.. e with Ie )-distribution. By Theorem the sequence log d P;+h !-fo/ d P; is asymptotically equivalent to h T 11n.e - �h T Ieh. If Tn is asymptotically linear, then ,Jn(Tn - 1/f (())) is asymptotically equivalent to the function .(p0 /0- 1 !':!.. n , e . Apply Slutsky's lemma to find that
Proof. a N (0,
7. 2
(� (Tn - 1/f(()) ) , log dP��:Jn ) !!.. ( .(pe /0- 1 !':!..e , h T !':!..e - � h T Ieh ) ""--' N
(( - 1 h0T Ieh) ( .(p01/JehJ0- 1 {p0T T 2
0
The limit distribution of the sequence ,Jn(Tn - 1/f (())) under () + h/ ,Jn follows by Le Cam's third lemma, Example 6. and is normal with mean {p0 h and covariance matrix {p0 I0- 1 .(p0 T . Combining this with the differentiability of 1/J , we obtain that Tn is regular. Next suppose that Sn and Tn are both best regular estimator sequences. By the same arguments as in the proof of Theorem it can be shown that, at least along subsequences, the joint estimators ( Sn , Tn) for ( 1/J (()), 1/J (() ) ) satisfy for every h
7,
8.11
for a randomized estimator (S, T) in the normal-limit experiment. Because Sn and Tn are best regular, the estimators S and T are best equivariant-in-law. Thus S = T = .(p0 X almost surely by Proposition 8.6, whence ,Jfi(Sn - Tn) converges in distribution to S - T = Thus every two best regular estimator sequences are asymptotically equivalent. The second assertion of the lemma follows on applying this to Tn and the estimators
0.
Sn = 1/f (()) +
1 1/Je le- 1 11n,O ·
,Jn
°
Because the parameter () is known in the local experiments (P;+h / Jn : h E lRk ), this indeed defines an estimator sequence within the present context. It is best regular by the first part of the lemma. • Under regularity conditions, for instance those of Theorem hood estimator Bn in a parametric model satisfies
5. 3 9, the maximum likeli
Then the maximum likelihood estimator is asymptotically optimal for estimating () in terms of the convolution theorem. By the delta method, the estimator 1/f (Bn) for 1/J (() ) can be seen
1 22
Efficiency of Estimators
to be asymptotically linear as in the preceding theorem, so that it is asymptotically regular and optimal as well. Actually, regular and asymptotically optimal estimators for e exist in every parametric model (Pe : e E 8) that is differentiable in quadratic mean with nonsingular Fisher infor mation throughout 8, provided the parameter e is identifiable. This can be shown using the discretized one-step method discussed in section (see
5. 7 [93]).
*8.10
Large Deviations
Consistency of an estimator sequence Tn entails that the probability of the event This is a very weak require d ( Tn , l/J (e) ) > E tends to zero under e , for every E > ment. One method to strengthen it is to make E dependent on n and to require that the probabilities Pe (d ( Tn , 1/f (e) ) > En ) converge to or are bounded away from for a given sequence E n -+ The results of the preceding sections address this question and give very precise lower bounds for these probabilities using an "optimal" rate En r;; 1 , typically
0.
0,
0.
1,
=
n- 1 /2 .
Another method of strengthening the consistency is to study the speed at which the probabilities Pe (d ( Tn , 1/f (e) ) > c ) converge to for a fixed E > This method appears to be of less importance but is of some interest. Typically, the speed of convergence is exponential, and there is a precise lower bound for the exponential rate in terms of the Kullback-Leibler information. We consider the situation that Tn is based on a random sample of size n from a distribution Pe , indexed by a parameter e ranging over an arbitrary set 8. We wish to estimate the value of a function 1/J : 8 f-+ ][J) that takes its values in a metric space.
0
8.15
Theorem.
0.
Suppose that the estimator sequence Tn is consistentfor 1/J (e) under every
0
e. Then, for every E > and every eo,
(
lim sup - ..!_ log Pe0 d ( Tn , 1/J (eo) ) > c n --+co n
)
:S
inf
e : d(l/f( e ) , l/f (eo)) >c
- Pe log
P eo _ Pe
If the right side is infinite, then there is nothing to prove. The Kullback-Leibler information - Pe log P eo I Pe can be finite only if Pe « Pea . Hence, it suffices to prove that - Pe log P eo I Pe is an upper bound for the left side for every e such that Pe « Pea and d ( 1/f (e ) , 1/f (eo) ) > E . The variable An = (n - 1 ) I: 7= 1 log( p e i Pea ) ( X i ) is well defined (possibly - oo ) For every constant M,
Proof
.
(
Pe0 d ( Tn , 1/f (eo) ) > c
)
:=::. :=::. ::::_
An < M ) ( Ee l { d ( Tn , 1/f (eo) ) > An < M } e -n An e - n M Pe (d ( Tn , 1/f (eo) ) > An < M) . Pea d ( Tn , 1/J (eo) ) >
E,
E,
E,
Take logarithms and multiply by - ( l i n ) to conclude that
-� log Pea (d (Tn , 1/J (eo)) >
E
)
:S
M
- � log Pe (d ( Tn , 1/J (eo) ) > 1
E,
)
An < M .
For M > Pe log pe i Pe0 , we have that Pe (An < M) -+ by the law of large numbers . Furthermore, by the consistency of Tn for 1/J (e ) , the probability Pe ( d ( Tn , 1/f (eo) ) > c )
Problems
123
converges to 1 for every e such that d ( 1/f (8) , 1/f (80)) > s . Conclude that the probability in the right side of the preceding display converges to 1 , whence the lim sup of the left side is bounded by M. •
Notes
Chapter 32 of the famous book by Cramer [27] gives a rigorous proof of what we now know as the Cramer-Rao inequality and next goes on to define the asymptotic efficiency of an estimator as the quotient of the inverse Fisher information and the asymptotic variance. Cramer defines an estimator as asymptotically efficient if its efficiency (the quotient men tioned previously) equals one. These definitions lead to the conclusion that the method of maximum likelihood produces asymptotically efficient estimators, as already conjectured by Fisher [48, 50] in the 1 920s. That there is a conceptual hole in the definitions was clearly realized in 195 1 when Hodges produced his example of a superefficient estimator. Not long after this, in 1 953, Le Cam proved that superefficiency can occur only on a Lebesgue null set. Our present result, almost without regularity conditions, is based on later work by Le Cam (see [95] .) The asymptotic convolution and minimax theorems were obtained in the present form by Hajek in [69] and [70] after initial work by many authors . Our present proofs follow the approach based on limit experiments, initiated by Le Cam in [95].
PROBLEMS 1. Calculate the asymptotic relative efficiency of the sample mean and the sample median for estimating based on a sample of size n from the normal N 1 ) distribution.
e,
(e,
2. As the previous problem, but now for the Laplace distribution (density p (x )
=
! e - l x l ).
3. Consider estimating the distribution function P(X ::::: x ) at a fixed point x based o n a sample X , . . . , X11 from the distribution of X . The "nonparametric" estimator is n - 1 #(Xi ::::: x ) . If it 1 is known that the true underlying distribution is normal N ( , 1), another possible estimator is (x X ) . Calculate the relative efficiency o f these estimators .
e
-
4. Calculate the relative efficiency of the empirical p-quantile and the estimator - 1 (p) +X for the estimating the p-th quantile of the distribution of a sample from the normal N (f.L , 0" 2 ) distribution. 5. Consider estimating the population variance by either the sample variance 2 (which is unbiased) or else n - 1 L7= (Xi - :X)2 = (n - 1 ) / n S 2 . Calculate the asymptotic relative efficiency. 1 6. Calculate the asymptotic relative efficiency of the sample standard deviation and the interquartile range (corrected for unbiasedness) for estimating the standard deviation based on a sample of size n from the normal N (/L , 0" 2 ) -distribution.
Sn
n
S
e],
en )
7. Given a sample of size n from the uniform distribution on [0, the maximum X of the observations is biased downwards . Because Ee (e - X Ee X ( l ) • the bias can be removed by adding the minimum of the observations. Is X ( 1 ) + X a good estimator for e from an asymptotic point of view?
en ) ) = en )
S11 based on the mean of a sample from the N (e , I )-distribution. Jn(Sn - en)� - oo, if en ---+ 0 in such a way that n 1 14en ---+ 0 and n 1 12 en ---+ oo. Show that Sn is not regular at e = 0.
8. Consider the Hodges estimator (i) Show that
(ii)
124
Efficiency of Estimators (iii) Show that SUP-8 <e <8 Pe sufficiently slowly.
9. Show that a loss function � for a nondecreasing function
f:
(
Jni Sn
r--+
fo.
10. Show that a function of the form
-
eI
>
kn)
--+
1 for every
kn
that converges to infinity
� is bowl-shaped if and only if it has the form
f(x) = f o (lx I)
f(x) = fo (llx II) for a nondecreasing function fo is bowl-shaped .
11. Prove Anderson's lemma for the one-dimensional case, for instance b y calculating the derivative of + h) dN (0, Does the proof generalize to higher dimensions ?
J f(x
1)(x). Lemma 8.1 3 imply
12. What does about the coordinates of the S tein estimator. Are they good estimators of the coordinates of the expectaction vector?
13. All results in this chapter extend in a straightforward manner to general locally asymptotically normal models. Formulate Theorem and Lemma for such models.
8.9
8. 14
9 Limits of Experiments
A sequence of experiments is defined to converge to a limit experiment if
the sequence of likelihood ratio processes converges marginally in dis tribution to the likelihood ratio process of the limit experiment. A limit experiment serves as an approximation for the converging sequence of experiments. This generalizes the convergence of locally asymptotically normal sequences of experiments considered in Chapter 7. Several ex amples of nonnormal limit experiments are discussed.
9. 1
Introduction
This chapter introduces a notion of convergence of statistical models or "experiments" to a limit experiment. In this notion a sequence of models, rather than just a sequence of estimators or tests, converges to a limit. The limit experiment serves two purposes. First, it provides an absolute standard for what can be achieved asymptotically by a sequence of tests or estimators, in the form of a "lower bound": No sequence of statistical procedures can be asymptotically better than the "best" procedure in the limit experiment. For instance, the best limiting power function is the best power function in the limit experiment; a best sequence of estimators converges to a best estimator in the limit experiment. Statements of this type are true irrespective of the precise meaning of "best." A second purpose of a limit experiment is to explain the asymptotic behaviour of sequences of statistical pro cedures. For instance, the asymptotic normality or (in)efficiency of maximum likelihood estimators. Many sequences of experiments converge to normal limit experiments. In particular, the local experiments in a given locally asymptotically normal sequence of experiments, as considered in Chapter converge to a normal location experiment. The asymptotic representation theorem given in the present chapter is therefore a generalization of Theo rem (for the LAN case) to the general situation. The importance of the general concept is illustrated by several examples of non-Gaussian limit experiments. In the present context it is customary to speak of "experiment" rather than model, al though these terms are interchangeable. Formally an experiment is a measurable space (X, A) , the sample space, equipped with a collection of probability measures (Ph : h E H). The set of probability measures serves as a statistical model for the observation, written as X. In this chapter the parameter is denoted by h (and not 8), because the results are typi cally applied to "local" parameters (such as h = 80)). The experiment is denoted
7,
7.10
Jfi(e
-
1 25
Limits of Experiments
1 26
by (X, A, Ph : h E H) and, if there can be no misunderstanding about the sample space, also by (Ph : h E H) . Given a fixed parameter h0 E H, the likelihood ratio process with base h0 is formed as
Each likelihood ratio process is a (typically infinite-dimensional) vector of random variables dPh/dPho (X) . According to the results of section the right side of the display is Ph0 almost surely the same for any given densities Ph and Pho with respect to any measure J.L. Because we are interested only in the laws under Pho of finite subvectors of the likelihood processes, the nonuniqueness is best left unresolved.
6.1,
Definition. A sequence En = (Xn , An , Pn,h : h E H) of experiments converges to a limit experiment £ = (X, A, Ph : h E H) if, for every finite subset l C H and every ho E H,
9.1
dPn ,h ( dPn,h ( Xn) ) o
h E!
�
dPh ( dPh (X) ) o
h E!
.
The objects in this display are random vectors of length I I 1 . The requirement is that each of these vectors converges in law, under the assumption that h0 is the true parameter, in the ordinary sense of convergence in distribution in Tit! . This type of convergence is sometimes called marginal weak convergence: The finite-dimensional marginal distributions of the likelihood processes converge in distribution to the corresponding marginals in the limit experiment. Because a weak limit of a sequence of random vectors is unique, the marginal distributions of the likelihood ratio process of a limit experiment are unique. The limit experiment itself is not unique; even its sample space is not uniquely determined. This causes no problems. Two experiments of which the likelihood ratio processes are equal in marginal distributions are called equivalent or of the same type. Many examples of equivalent experiments arise through sufficiency. 9.2 Example (Equivalence by sufficiency). Let S : X r-+ Y be a statistic in the statistical experiment (X, A, Ph : h E H) with values in the measurable space (Y, B) . The experiment of image laws (Y, B, Ph s - 1 : h E H) corresponds to observing S. If S is a sufficient statistic, then this experiment is equivalent to the original experiment (X, A, Ph : h E H) . This may be proved using the Neyman factorization criterion of sufficiency. This shows that there exist measurable functions gh and f such that Ph (x) = g h ( S(x ) ) f (x ) , so that the likelihood ratio Ph/ Ph o (X) is the function gh/ gho (S) of S. The likelihood ratios of the measures Ph s - 1 take the same form. Consequently, if (Ph : h E H) is a limit experiment, then so is (Ph s- 1 : h E H). A very simple example that we encounter frequently is as follows: For a given invertible matrix J the experiments ( N ( J h , J) : h E IRd ) and ( N (h, J - 1 ) : h E IRd ) are equivalent. D o
o
o
9.2
Asymptotic Representation Theorem
In this section it is shown that a limit experiment is always statistically easier than a given sequence. Suppose that a sequence of statistical problems involves experiments
9. 3
1 27
Asymptotic Normality
En = (Pn,h : h E H) and statistics Tn . For instance, the statistics are test statistics for testing certain hypotheses concerning the parameter h, or estimators of some function of h. Most of the quality measures of the procedures based on the statistics Tn can be expressed in their laws under the different parameters. For simplicity we assume that the sequence of statistics Tn converges under a given parameter h in distribution to a limit Lh , for every parameter h. Then the asymptotic quality of the sequence Tn may be judged from the set of limit laws {Lh : h E H}. According to the following theorem the only possible sets of limit laws are the laws of randomized statistics in the limit experiment: Every weakly converging se quence of statistics converges to a statistic in the limit experiment. One consequence is that asymptotically no sequence of statistical procedures can be better than the best procedure in the limit experiment. This is true for every meaning of "good" that is expressible in terms of laws. In this way the limit experiment obtains the character of an asymptotic lower bound. We assume that the limit experiment E = (Ph : h E H) is dominated: This requires the existence of a O"-finite measure J.L such that Ph « J.L for every h. Recall that a randomized statistic T in the experiment (X, A, Ph : h E H) with values in IRk is a measurable map T : X x [ , 1] f-+ IRk for the product O"-field A x B orel sets on the space X x [ , 1]. Its law under h is to be computed under the product measure Ph x uniform[O, 1 ] .
0
0
Let En = (Xn , An , Pn,h : h E H) be a sequence of experiments that conver ges to a dominated experiment E = (X, A, Ph : h E H). Let Tn be a sequence of statistics in En that converges in distribution for every h. Then there exists a randomized statistic T in E such that Tn � T for every h. 9.3
Theorem.
The proof of the theorem starting from the definition of convergence of experi ments is long and can best be broken up into parts of independent interest. This goes beyond the scope of this book. The proof for the case of local asymptotic normal sequences of experiments is given in Chapter (It is shown in Theorem 9.4 that such a sequence of experiments converges to a Gaussian location experiment.) Many other examples can be treated by the same method of proof.t •
Proof
7.
9.3
Asymptotic Normality
7
As in much of statistics, normal limits are of prime importance. In Chapter a sequence of statistical models (Pn,e : e E 8) indexed by an open subset 8 c IRd is defined to be locally asymptotically normal at e if the log likelihood ratios log dPn,e + r;l hn / dPn,e allow a certain quadratic expansion. This is shown to be valid in the case that Pn,e is the distribution of a sample of size n from a smooth parametric model. Such experiments converge to simple normal limit experiments if they are reparametrized in terms of the "local parameter" h. This follows from the following theorem. Theorem. Let En = (Pn,h : h E H) be a sequence of experiments indexed by a subset H d IR of (with E H) such that 9.4
0
log t
dPn h ' = h T fj,n n, O
dP
--
-
1
-h T l h + Opn . (1),
2
For a proof of the general theorem see, for instance, [ 141].
0
1 28
Limits of Experiments
0
for a sequence of statistics � n that converges weakly under h = to a N Then the sequence En converges to the experiment (N(Jh, I) : h E H).
(0, J) -distribution.
The log likelihood ratio process with base h0 for the normal experiment has coordinates
Proof
log
1 1 d N ( J h ' I) (X) = (h - h0) T X - -h T Ih + -h 0T Ih0 . dN(Jh0 , I)
2
2
If I is nonsingular, then this follows by simple algebra, because the left side is the quotient of two normal densities. The case that I is singular perhaps requires some thought. By the assumption combined with Slutsky's lemma, the sequence log Pn , h I P n , o is under h = asymptotically normal with mean - �h T Ih and variance h T I h). This implies con tiguity of the sequences of measures Pn ,h and Pn , o for every h, by Example Therefore, the probability of the set on which one of P n , o , Pn , h , or P n , h o is zero converges to zero. Outside this set we can write
0
6. 5 .
1
n
n
Pn , h o
P n ,O
--
P ,h P ,h og -= log - log P
n , ho
Pn , O
.
Because this is true with probability tending to 1 , the difference between the left and the right sides converges to zero in probability. Apply the (local) asymptotic normality assumption twice to obtain that 1 1 Pn h - = (h - ho) T �n - -h T Ih + -h 0T I h o + Op ( 1 ) . log -'
2
Pn ,h o
2
ll ,
h0
On comparing this to the expression for the normal likelihood ratio process, we see that it suffices to show that the sequence �n converges under ho in law to X: In that case the vector (P n , h i P n , h o h E I converges in distribution to (dN(Jh, I)ldN(O, I) ( X ) ) h E J ' by Slutsky's lemma and the continuous-mapping theorem. By assumption, the sequence (�11 , h'5 �11) converges in distribution under h = to a vector (� , h'5 �), where � is N(O, I) -distributed. By local asymptotic normality and Slutsky's lemma, the sequence of vectors (�n , log P n ,ho I P n , o) converges to the vector (� , h'5 � - �h6 Iho) . In other words
0
( �n , log -- ) -v--+0 N( ( _ l h0T ih0 ) ' ( h TII Pn , h o Pn ,O
2 0
0
Iho h'5 Iho
) )·
6. 5 ,
By the Gaussian form of Le Cam's third lemma, Example the sequence �n converges in distribution under h0 to a N(Jh0 , I)-distribution. This is equal to the distribution of X under h0 . •
Let 8 be an open subset ofiR.d, and let the sequence of statistical models (Pn.e : () E 8) be locally asymptotically normal at () with nanning matrices rn and a nonsingular matrix Ie . Then the sequence of experiments (Pn , e +r;; l h : h E JR. d) converges to the experiment (N(h , Ie- 1 ) : h E IR.d ) . 9.5
Corollary.
9. 4
9.4
129
Uniform Distribution Uniform Distribution
[0,
The model consisting of the uniform distributions on 8] is not differentiable in quadratic mean (see Example 9.) In this case an asymptotically normal approximation is impossible. Instead, we have convergence to an exponential experiment.
7.
Let P; be the distribution of a random sample of size n from a uniform distribution on 8]. Then the sequence of experiments (P;_ h / n : h E IR) converges for each fixed e > to the experiment consisting of observing one observation from the shzfted exponential density z e-Cz - h ) /e 1 {z > h}ltl. t 9.6
Theorem.
Proof
[0, 0
r--+
If Z is distributed according to the given exponential density, then
dPhz e - ( - h )/:--e 1_{_Z_>_h_}I_B_ (Z) - _:-(-=Z---,h:-o :e- Z - ) / e 1 { Z > ho}IB dPh�
=
e (h - ho ) /e l { Z > h} '
almost surely under h0, because the indicator 1 {z > h0} in the denominator equals 1 almost surely if h0 is the true parameter. The joint density of a random sample X 1, . . . , Xn from the uniform 8] distribution can be written in the form ( 1 1 8 ) n 1 {X en ) :S 8}. The likelihood ratios take the form
[0,
Under the parameter e - hoi n, the maximum of the observations is certainly bounded above by e - hoi n and the indicator in the denominator equals 1 . Thus, with probability 1 under e - ho l n, the likelihood ratio in the preceding display can be written
(e (h - ho ) / e + o(l)) 1 { -n (X (n) - 8)
�
h}.
B y direct calculation, -n(X (n ) -8) !!:;!. Z. B y the continuous-mapping theorem and Slutsky's lemma, the sequence of likelihood processes converges under e - ho l n marginally in distribution to the likelihood process of the exponential experiment. • Along the same lines it may be proved that in the case of uniform distributions with both endpoints unknown a limit experiment based on observation of two independent ex ponential variables pertains . These types of experiments are completely determined by the discontinuities of the underlying densities at their left and right endpoints. It can be shown more generally that exponential limit experiments are obtained for any densities that have jumps at one or both of their endpoints and are smooth in between. For densities with discontinuities in the middle, or weaker singularities, other limit experiments pertain. The convergence to a limit experiment combined with the asymptotic representation theorem, Theorem 9.3, allows one to obtain asymptotic lower bounds for sequences of estimators, much as in the locally asymptotically normal case in Chapter We give only one concrete statement.
8.
t
Define Pe arbitrarily for e
<
0.
Limits of Experiments
130
Let Tn be estimators based on a sample X 1 , . , Xn from the uniform distribution on e] such that the sequence n (Tn - e) converges under e in distribution to a limit Le, for every e. Then for Lebesgue almost-every e we have f l x l dLe (x) � EI Z - med Z l and J x 2 dLe (x) � E(Z - EZ) 2 for the random variable Z exponentially distributed with mean e. 9. 7
Corollary.
.
[0,
8.10,
.
By Lemma the estimator sequence Tn is automatically almost reg ular in the sense that n (Tn - e + hjn) converges under e hjn in distribution to Le for Lebesgue almost every e and h, at least along a subsequence. Thus, it is matched in the limit experiment by an equivariant-in-law estimator for almost every e. More precisely, for almost every e there exists a randomized statistic Te such that the law of Te ( Z + h , U) - h does not depend on h (if Z is exponentially distributed with mean e). By classical statistical decision theory the given lower bounds are the (constant) risks of the best equivariant-in-law estimators in the exponential limit experiment in terms of absolute error and mean-square error loss functions, respectively. •
Proof (Sketch).
-
In view of this lemma, the maximum likelihood estimator X (n ) is asymptotically ineffi cient. This is not surprising given its bias downwards, but it is encouraging for the present approach that the small bias, which is of the order ljn, is visible in the "first-order" asymp totics. The bias can be corrected by a multiplicative factor, which, unfortunately, must depend on the loss function. The sequences of estimators
n + log 2 X (n ) and n
---
n+l
--
n
X (n )
are asymptotically efficient in terms of absolute value and quadratic loss, respectively. 9.5
Pareto Distribution
The Pareto distributions are a two-parameter family of distributions on the real line with parameters a > and fJ., > and density
0
0
This density is smooth in a, but it resembles a uniform distribution as discussed in the preceding section in its dependence on fJ.,. The limit experiment consists of a combination of a normal experiment and an exponential experiment. The likelihood ratios for a sample of size n from the Pareto distributions with parameters (a + gjJn, fJ., + hjn) and (a + go/Jn, fJ., + h0jn), respectively, is equal to
9. 6
131
Asymptotic Mixed Normality
Here, under the parameters (a + g0j,Jii , JL + h0jn) , the sequence 1 n 1 X log -i - D.. n = M a Jfi i = l
--L(
)
converges weakly to a normal distribution with mean g0ja 2 and variance 1 ja 2 ; and the sequence Zn = n (X cl ) - JL) converges in distribution to the (shifted) exponential distri bution with mean JL/a + h0 and variance (JJ-/a) 2 . The two sequences are asymptotically independent. Thus the likelihood is a product of a locally asymptotically normal and a "locally asymptotically exponential" factor. The local limit experiment consists of observ ing a pair ( D. , Z) of independent variables D. and Z with a N (g, a 2 )-distribution and an exp(a/ JL) + h-distribution, respectively. The maximum likelihood estimators for the parameters a and JL are given by A
an =
n
L 7= I log(X d X o ) )
,
and fl n = X cl ) ·
The sequence Jfi(&n - a ) converges in distribution under the parameters (a + gj Jfi , JL + h j n) to the variable D. - g . Because the distribution of Z does not depend on g, and D. follows a normal location model, the variable D. can be considered an optimal estimator for g based on the observation (D. , Z) . This optimality is carried over into the asymptotic optimality of the maximum likelihood estimator &n . A precise formulation could be given in terms of a convolution or a minimax theorem. On the other hand, the maximum likelihood estimator for JL is asymptotically inefficient. Because the sequence n (fln - JL - hjn) converges in distribution to Z - h, the estimators fl n are asymptotically biased upwards. 9.6
Asymptotic Mixed Normality
The likelihood ratios of some models allow an approximation by a two-term Taylor ex pansion without the linear term being asymptotically normal and the quadratic term being deterministic. Then a generalization of local asymptotic normality is possible. In the most important example of this situation, the linear term is asymptotically distributed as a mixture of normal distributions. A sequence of experiments (Pn , e : e E 8) indexed by an open subset 8 of IR.d is called locally asymptotically mixed normal at e if there exist matrices Yn , e ---+ such that
0
lo g
dPn, e +Yn ,B hn 1 = h T D. n , e - -2 h T Jn,e h + 0 Pn,e ( 1) ' dPn,e
for every converging sequence hn ---+ h, and random vectors D.. n ,e and random matrices ln , e such that ( D.. n ,e , ln,e) � ( D.. e , le) for a random vector such that the conditional distribution of D.. e given that le = J is normal N J) . Locally asymptotically mixed normal is often abbreviated to LAMN. Locally asymp totically normal, or LAN, is the special case in which the matrix le is deterministic. Se quences of experiments whose likelihood ratios allow a quadratic approximation as in the preceding display (but without the specific limit distribution of ( D.. n ,e , ln,e)) and that are
(0,
Limits of Experiments
132
such that Pn, !J+ y, eh <1 t> Pn,e are called locally asymptotically quadratic, or LAQ. We note that LAQ or LAMN requires much more than the mere existence of two derivatives of the likelihood: There is no reason why, in general, the remainder would be negligible. 9.8 Theorem. Assume that the sequence of experiments (Pn,e : () E 8) is locally asymptot ically mixed normal at 8. Then the sequence of experiments (Pn , !J+ y, ,eh : h E .!Rd ) converges to the experiment consisting ofobserving a pair ( !1 , J) such that J is marginally distributed as le for every h and the conditional distribution of !1 given J is normal N (J h, J).
Write Pe,h for the distribution of ( !1 , J) under h. Because the marginal distribution of J does not depend on h and the conditional distribution of !1 given J is Gaussian
Proof
By Slutsky's lemma and the assumptions, the sequence dPn,fJ + y, ,e h/ dPn,e converges under () in distribution to exp ( h T t1e - �h T leh) . Because the latter variable has mean one, it follows that the sequences of distributions Pn,!J + y, ,eh and Pn,e are mutually contiguous. In particular, the probability under () that dPn.fJ+y,,eh is zero converges to zero for every h, so that dPn, 8 + Yn . eh dPn,e + y, , eh o dPn, 8 + Yn ,eh = lo g log log + 0 P, , e dpn, 8 + y, ,eho dPn,e dPn,e
(1)
_
Conclude that it suffices to show that the sequence (t:... n ,e , ln,e) converges under () + Yn,eho to the distribution of (6. , J) under h0. Using the general form of Le Cam's third lemma we obtain that the limit distribution of the sequence (t111,e , ln,e) under () + Yn,eh takes the form
0
noting that the distribution of ( 6. , J) under h = is the same as the distribution of (t:... e , le ) , we see that this is equal to Eo s ( !1 , J) dPe,h/ dPe.o (t1 , 1) = Ph ( (6. , 1) E B) . •
On
1
It is possible to develop a theory of asymptotic "lower bounds" for LAMN models, much as is done for LAN models in Chapter Because conditionally on the ancillary statistic 1, the limit experiment is a Gaussian shift experiment, the lower bounds take the form of mixtures of the lower bounds for the LAN case. We give only one example, leaving the details to the reader.
8.
9.9
Corollary. E
Let T11 be an estimator sequence in a LAMN sequence of experiments
8) such that y,;J (Tn - 1/1 (8 + Yn,eh)) converges weakly under e ve ry () + Yn,eh to a limit distribution Le , for every h. Then there exist probability distributions M1 (or rather kernel) such that Le = EN (O, .,fr 8 18- 1 .,fr�) * Mfe · In particular, cove Le =::: . a1 Markov . E1j; e le- 1/1 eT · (Pn,e : ()
9. 6
Asymptotic Mixed Normality
133
We include two examples to give some idea of the application of local asymptotic mixed normality. In both examples the sequence of models is LAMN rather than LAN due to an explosive growth of information, occurring at certain supercritical parameter values . The second derivative of the log likelihood, the information, remains random. In both examples there is also (approximate) Gaussianity present in every single observation. This appears to be typical, unlike the situation with LAN, in which the normality results from sums over (approximately) independent observations . In explosive models of this type the likelihood is dominated by a few observations, and normality cannot be brought in through (martingale) central limit theorems. In a Galton-Watson branching process the "nth generation" is formed by replacing each element of the (n - 1 )-th generation by a random number of elements, independently from the rest of the population and from the preceding generations. This random number is distributed according to a fixed distribution, called the offspring distribution. Thus, conditionally on the size Xn _ 1 of the (n - 1)th generation the size Xn of the nth generation is distributed as the sum of Xn - 1 i.i.d. copies of an offspring variable Z. Suppose that X 0 = 1 , that we observe (X 1 , . . . , Xn), and that the offspring distribution is known to belong to an exponential family of the form
9.10
Example (Branching processes).
Pe (Z = z)
=
az 82c(8 ) ,
z
= 0,
1,
2, . . . ,
for given numbers a0, a 1 , The natural parameter space is the set of all e such that 1 c(e) = L a2 82 is finite (an interval). We shall concentrate on parameters in the interior of the naturalz parameter space such that f.L(8) : = Ee Z > 1 . Set a 2 (8) = vare Z. The sequence X 1 , X 2 , . . . is a Markov chain with transition density .
•
.
•
Pe ( Y I x) = Pe (Xn
=
y I Xn - 1
=
x
times
x) = � eYc(e) x .
To obtain a two-term Taylor expansion of the log likelihood ratios, let fe (y I x) be the log transition density, and calculate that
f e ( Y I x ) = Y - �f.L(e) ,
.. fe (y I X)
=
y - xf.L(e) x�(e) e2 e Pe (Z = z) has derivative zero yields -
(The fact that the score function of the model e f--+ the identity f.L(8) = -8 (cjc) (8), as is usual for exponential families.) Thus, the Fisher information in the observation (X 1 , . . . , Xn) equals (note that Ee (X j I X j - 1 ) = X j - 1 f.L(8))
For f.L(8) > 1, this converges to infinity at a much faster rate than "usually." Because the total information in (X 1 , . . . , X11) is of the same order as the information in the last observation X11, the model is "explosive" in terms of growth of information. The calculation suggests the rescaling rate Yn,e = f.L(e) - n l2 , which is roughly the inverse root of the information.
Limits of Experiments
1 34
A Taylor expansion of the log likelihood ratio yields the existence of a point en between e and e + Yn , e h such that
This motivates the definitions
Because Ee (Xn I Xn -1 , . . . , X 1 ) = Xn -1 /k(e), the sequence of random variables {L(e) - n Xn is a martingale under e . Some algebra shows that its second moments are bounded as n --* oo. Thus, by a martingale convergence theorem (e.g., Theorem 10.5 .4 of [42]), there exists a random variable V such that fk(e) - n Xn --* V almost surely. By the Toeplitz lemma (Problem 9.6) and again some algebra, we obtain that, almost surely under e, 1 n 1 n 1 /k(e) x --* v V. --* X " " 1 -· n n fk(e) - 1 ' /k(e) - 1 fk(e) f=; 1 {L(e) f=I J --
It follows that the point en in the expansion of the log likelihood can be replaced by e at the cost of adding a term that converges to zero in probability under e . Furthermore,
Jn , e
r--+
j.J,(e) v e (!k(e) 1 ) , _
Pe -almost surely.
It remains to derive the limit distribution of the sequence 6.n , e . If we write X j for independent copies Zj,i of the offspring variable Z, then
=
:2:: ��1 1 Zj,i
for independent copies zi of z and Vn = :2:: 7= 1 X j -1 · Even though 2 1 ' Zz , . . . and the total number Vn of variables in the sum are dependent, a central limit theorem applies to the right side: conditionally on the event { V > 0} (on which Vn --* oo), the sequence v,:;- 1 12 :2:: �: 1 ( Zi - /k(e)) converges in distribution to (]" (e) times a standard normal variable G. Furthermore, if we define G independent of V, conditionally on { V > 0}, t
(
(]" (e) G , j.1, (8 ) V (6.n ,e , ln , e) -v-t -- r-v� e y e ( !k(e) 1 )
t
See the appendix of [8 1] or, e.g., Theorem 3.5.1 and its proof in [146].
_
)
(9. 1 1 ) ·
9. 6
Asymptotic Mixed Normality
135
It is well known that the event { V = 0} coincides with the event {lim Xn = 0} of extinction of the population. (This occurs with positive probability if and only if a0 > 0.) Thus, on the set { V = 0} the series l....: � 1 X 1 converges almost surely, whence ll n,e -+ 0. Interpreting zero as the product of a standard normal variable and zero, we see that again is valid. Thus the sequence (lln,e , ln, e ) converges also unconditionally to this limit. Finally, note that a 2 (e)je = jL(e) , so that the limit distribution has the right form. The maximum likelihood estimator for J.L (e) can be shown to be asymptotically efficient, (see, e.g., or 0
(9 .11)
[29] [81]).
The canonical example of an LAMN sequence of exper iments is obtained from an explosive autoregressive process of order one with Gaussian innovations. (The Gaussianity is essential.) Let 1 e I > and c: 1 , c: 2 , be an i.i.d. sequence of standard normal variables independent of a fixed variable X0. We observe the vector (X o , X 1 , . . . ' Xn ) generated by the recursive formula x t = ex t - 1 + E t · The observations form a Markov chain with transition density p ( · I x1_ 1 ) equal to the N (ex t -l , ) -density. Therefore, the log likelihood ratio process takes the form 9.12
Example (Gaussian AR).
1
.
.
•
1
1
n n Pn, HYn,o h � � 2 2 2 log (Xo, . . . , X n ) = h Yn, e � (X t - e x t - l ) X t - 1 - l h Yn, e � X t 1 · Pn, e t =1 t =1 This has already the appropriate quadratic structure. To establish LAMN, it suffices to find the right rescaling rate and to establish the joint convergence of the linear and the quadratic term. The rescaling rate may be chosen proportional to the Fisher information and is taken Yn, e = e - n · By repeated application of the defining autoregressive relationship, we see that t r e - x t = Xo + L: e - 1 c: 1 -+ V : = X o + L: e - 1 c: 1 , ) =1 )= 1 00
almost surely a s well a s in second mean. Given the variable X0, the limit i s normally distributed with mean X0 and variance (e 2 - ) - 1 . An application of the Toeplitz lemma (Problem yields
9. 6)
1
The linear term in the quadratic representation of the log likelihood can (under e) be rewritten as e -n :z.=;=1 c:1Xr - h and satisfies, by the Cauchy-Schwarz inequality and the Toeplitz lemma,
It follows that the sequence of vectors ( ll n, e , ln, e ) has the same limit distribution as the ) For every n the vector (e- n :z.=;=1 c:1 sequence of vectors (e -n :z.=;=1 c:1e t - 1 V, V 2 I (e 2
- 1) .
136
Limits of Experiments
e t - I , V) possesses, conditionally on X0, a bivariate-normal distribution. As n --* oo these distributions converge to a bivariate-normal distribution with mean (0, X0) and covariance matrix I / (8 2 Conclude that the sequence (�n.e , ln,e) converges in distribution as required by the LAMN criterion. D
1).
9.7
Heuristics
9. 3 ,
The asymptotic representation theorem, Theorem shows that every sequence of statistics in a converging sequence of experiments is matched by a statistic in the limit experiment. It is remarkable that this is true under the present definition of convergence of experiments, which involves only marginal convergence and is very weak. Under appropriate stronger forms of convergence more can be said about the nature of the matching procedure in the limit experiment. For instance, a sequence of maximum like lihood estimators converges to the maximum likelihood estimator in the limit experiment, or a sequence of likelihood ratio statistics converges to the likelihood ratio statistic in the limit experiment. We do not introduce such stronger convergence concepts in this section but only note the potential of this argument as a heuristic principle. See section for rigorous results. For the maximum likelihood estimator the heuristic argument takes the following form. If hn maximizes the likelihood h f--+ dPn h , then it also maximizes the likelihood ratio process h f--+ dPn,h fdPn,h o · The latter sequence of processes converges (marginally) in distribution to the likelihood ratio process h f--+ dPh/ dPh o of the limit experiment. It is reasonable to expect that the maximizer hn converges in distribution to the maximizer of the process h f--+ dPh/dPho ' which is the maximum likelihood estimator for h in the limit experiment. (Assume that this exists and is unique.) If the converging experiments are the local experiments corresponding to a given sequence of experiments with a parameter () , then the argument suggests that the sequence of local maximum likelihood estimators hn = rn (en - 8) converges, under 8 , in distribution to the maximum likelihood estimator in the local limit experiment, under h = 0. Besides yielding the limit distribution of the maximum likelihood estimator, the argu ment also shows to what extent the estimator is asymptotically efficient. It is efficient, or inefficient, in the same sense as the maximum likelihood estimator is efficient or ineffi cient in the limit experiment. That maximum likelihood estimators are often asymptotically efficient is a consequence of the fact that often the limit experiment is Gaussian and the maximum likelihood estimator of a Gaussian location parameter is optimal in a certain sense. If the limit experiment is not Gaussian, there is no a priori reason to expect that the maximum likelihood estimators are asymptotically efficient. A variety of examples shows that the conclusions of the preceding heuristic arguments are often but not universally valid. The reason for failures is that the convergence of experiments is not well suited to allow claims about maximum likelihood estimators. Such claims require stronger forms of convergence than marginal convergence only. For the case of experiments consisting of a random sample from a smooth parametric model, the argument is made precise in section 7 .4. Next to the convergence of experiments, it is required only that the maximum likelihood estimator is consistent and that the log density is locally Lipschitz in the parameter. The preceding heuristic argument also extends to the other examples of convergence to limit experiments considered in this chapter. For instance, the maximum likelihood estimator based on a sample from the uniform distribution on [0, 8]
5. 9
137
Problems
is asymptotically inefficient, because it corresponds to the estimator Z for h (the maximum likelihood estimator) in the exponential limit experiment. The latter is biased upwards and inefficient for every of the usual loss functions . Notes
This chapter presents a few examples from a large body of theory. The notion of a limit experiment was introduced by Le Cam in [95] . He defined convergence of experiments through convergence of all finite subexperiments relative to his deficiency distance, rather than through convergence of the likelihood ratio processes. This deficiency distance in troduces a "strong topology" next to the "weak topology" corresponding to convergence of experiments. For experiments with a finite parameter set, the two topologies coincide. There are many general results that can help to prove the convergence of experiments and to find the limits (also in the examples discussed in this chapter). See [82] , [89] , [96] , [97], [ 1 15], [ 1 38], [142] and [144] for more information and more examples. For nonlocal ap proximations in the strong topology see, for example, [96] or [1 10] . PROBLEMS 1. Let
X 1 , . . . , Xn be an i.i.d. sample from the normal N (hiJn, 1 ) distribution, in which h
E
JR.
The corresponding sequence of experiments converges to a normal experiment by the general results . Can you see this directly? 2. If the nth experiment corresponds to the observation of a sample of size n from the uniform
[0, 1 - h I n], then the limit experiment corresponds to observation of a shifted exponential variable Z . The sequences -n (X (n) - 1) and .jii ( 2Xn - 1) both converge in distribution under every h . According to the representation theorem their sets of limit distributions are the distributions of randomized statistics based on Z. Find these randomized statistics explicitly. Any implications regarding the quality of X (n) and Xn as estimators ?
3. Let the nth experiment consist o f one observation from the binomial distribution with parameters
n and success probability h I n with 0 < h < 1 unknown. Show that this sequence of experiments converges to the experiment consisting of observing a Poisson variable with mean h .
4 . Let the nth experiment consists of observing an i.i.d. sample o f size n from the uniform
[ - 1 - h l n , 1 + h l n] distribution. Find the limit experiment.
5. Prove the asymptotic representation theorem for the case in which the nth experiment corresponds
to an i.i.d. sample from the uniform [0, e - h I n] distribution with h of this theorem for the locally asymptotically normal case.
>
0 by mimicking the proof
L an = oo and Xn x an L J = l a j X j IL} = l a j converges to --+
6. (Toeplitz lemma.) If an is a sequence of nonnegative numbers with
arbitrary converging sequence of numbers, then the sequence x as well. Show this.
7. Derive a limit experiment in the case of Galton-Watson branching with /L (e) 8. Derive a limit experiment in the case of a Gaus sian AR( l ) process with e
=
<
1.
1.
9 . Derive a limit experiment for sampling from a U [u, r ] distribution with both endpoints unknown.
10. In the case of sampling from the U [O, e ] distribution show that the maximum likelihood estimator
for e converges to the maximum likelihood estimator in the limit experiment. Why is the latter not a good estimator?
1 1 . Formulate and prove a local asymptotic minimax theorem for estimating e from a sample from
a U [O, e ] distribution, using
£(x)
=
x 2 as los s function.
10 Bayes Procedures
In this chapter Bayes estimators are studied from afrequentist perspec tive. Both posterior measures and Bayes point estimators in smooth parametric models are shown to be asymptotically normal. 10.1
Introduction
In Bayesian terminology the distribut�on Pn,e of an observation X n under a parameter () is viewed as the conditional law of X n given that a random variable Gn is equal to () . The distribution [] of the "random parameter" B n is called the prior distribution, and the conditional distribution of en given X n is the posterior distribution. If Bn possesses a density n and Pn,e admits a density Pn,e (relative to given dominating measures), then the density of the posterior distribution is given by Bayes' formula
This expression may define a probability density even if n is not a probability density itself. A prior distribution with infinite mass is called improper. The calculation of the posterior measure can be considered the ultimate aim of a Bayesian analysis. Alternatively, one may wish to obtain a "point estimator" for the parameter (), using the posterior distribution. The posterior mean E(Gn I X n) = f () Pe" 1 x" (()) d() is often used for this purpose, but other location estimators are also reasonable. A choice of point estimator may be motivated by a loss function. The Bayes risk of an estimator Tn relative to the loss function e and prior measure n is defined as
Here the expectation Ee £ (Tn - ()) is the risk function of Tn in the usual set-up and is identical to the conditional risk E ( £(Tn - Gn ) I en = e) in the Bayesian notation. The corresponding Bayes estimator is the estimator Tn that minimizes the Bayes risk. Because the Bayes risk can be written in the form EE ( £ C Tn - Gn) I X n ) , the value Tn = Tn (x ) minimizes, for every fixed x, the "posterior risk" -:;-
....
E (£(Tn - Gn) I Xn
_ -
X
)
-
j £ (Tn - ()) pn,e (x ) d f1 ( fl ) . J Pn,e (x ) d[] (()) 138
1 0. 1
Introduction
139
Minimizing this expression may again be a well-defined problem even for prior densities of infinite total mass. For the loss function .£ (y) = I y f, the solution Tn is the posterior mean E(8n I X n ), for absolute loss .£ (y) = II Y II , the solution is the posterior median. Other Bayesian point estimators are the posterior mode, which reduces to the maximum likelihood estimator in the case of a uniform prior density; or a maximum probability estimator, such as the center of the smallest ball that contains at least posterior mass (the "posterior shorth" in dimension one). If the underlying experiments converge, in a suitable sense, to a Gaussian location experiment, then all these possibilities are typically asymptotically equivalent. Consider the case that the observation consists of a random sample of size n from a density Pe that depends smoothly on a Euclidean parameter e . Thus the density Pn,e has a product form, and, for a given prior Lebesgue density n , the posterior density takes the form
1/2
Typically, the distribution corresponding to this measure converges to the measure that is degenerate at the true parameter value e0, as n � oo. In this sense Bayes estimators are usually consistent. A further discussion is given in sections and To obtain a more interesting limit, we rescale the parameter in the usual way and study the sequence of posterior distributions of JfiC8 n e0) , whose densities are given by
10. 2 10. 4 .
-
If the prior density n is continuous, then n (eo + hj Jfi), for large n, behaves like the constant n (e0) , and n cancels from the expression for the posterior density. For densities Pe that are sufficiently smooth in the parameter, the sequence of models (Peo + h! -fii : h E JR1.k ) is locally asymptotically normal, as discussed in Chapter This means that the likelihood ratio processes h r----+ n7=1 Peo +h! -fii / Pea (Xi ) behave asymptotically as the likelihood ratio process of the normal experiment ( N (h , Ie� 1 ) : h E IR1. k ) . Then we may expect the preceding display to be asymptotically equivalent in distribution to
7.
dN (h ' /e-o 1 ) (X) J dN(h, Ie� 1 ) (X)dh
=
dN(X ' Ie-o 1 ) (h) '
where dN(JL, I; ) denotes the density of the normal distribution. The expression in the preceding display is exactly the posterior density for �he experiment ( N (h , Ie� 1 ) : h E IR1.k ) , relative to the (improper) Lebesgue prior distribution. The expression on the right shows that this is a normal distribution with mean X and covariance matrix Ie� 1 . This heuristic argument leads us to expect that the posterior distribution of Jfi(8 n e0) "converges" under the true parameter e0 to the posterior distribution of the Gaussian limit experiment relative to the Lebesgue prior. The latter is equal to the N (X, Ie� 1 ) distribution, for X possessing the N Ie� 1 ) -distribution. The notion of convergence in this statement is a complicated one, because a posterior distribution is a conditional, and hence stochastic, probability measure, but there is no need to make the heuristics precise at this point. On the other hand, the convergence should certainly include that "nice" Euclidean valued functionals applied to the posterior laws converge in distribution in the usual sense. -
(0,
140
Bayes Procedures
Consequently, a sequence of Bayes point estimators, which can be viewed as location functionals applied to the posterior distributions, should converge to the corresponding Bayes point estimator in the limit experiment. Most location estimators (all reasonable ones) map symmetric distributions, such as the normal distribution, into their center of symmetry. Then, the Bayes point estimator in the limit experiment is X, and we should expect Bayes point estimators to converge in distribution to the random vector X, that is, to a N /8� 1 )-distribution under e0 . In particular, they are asymptotically efficient and asymptotically equivalent to maximum likelihood estimators (under regularity conditions). A remarkable fact about this conclusion is that the limit distribution of a sequence of Bayes estimators does not depend on the prior measure. Apparently, for an increasing number of observations one's prior beliefs are erased (or corrected) by the observations. To make this true an essential assumption is that the prior distribution possesses a density that is smooth and positive in a neighborhood of the true value of the parameter. Without this property the conclusion fails . For instance, in the case in which one rigorously sticks to a fixed discrete distribution that does not charge e0, the sequence of posterior distributions of Gn cannot even be consistent. In the next sections we make the preceding heuristic argument precise. For technical reasons we separately consider the distributional approximation of the posterior distributions by a Gaussian one and the weak convergence of Bayes point estimators. Even though the heuristic extends to convergence to other than Gaussian location exper iments, we limit ourselves in this chapter to the locally asymptotically normal case. More precisely, we even assume that the observations are a random sample X 1 , . . . , Xn from a distribution Pe that admits a density Pe with respect to a measure fJ., on a measurable space (X, A) . The parameter e is assumed to belong to a measurable subset 8 of IR k that contains the true parameter e0 as an interior point, and we assume that the maps (e , x ) f-+ Pe (x) are jointly measurable. All theorems in this chapter are frequentist in character in that we study the posterior laws under the assumption that the observations are a random sample from Pe0 for some fixed, nonrandom e0. The alternative, which we do not consider, would be to make probability statements relative to the joint distribution of (X 1 , . . . , X n , Gn) , given a fixed prior marginal measure for Bn and with Pen being the conditional law of (X 1 , . . . , Xn) given Gn .
(0,
10.2
Bernstein-von Mises Theorem
The heuristic argument in the preceding section indicates that posterior distributions in dif ferentiable parametric models converge to the Gaussian posterior distribution N (X, Ii/ ) . The Bernstein-von Mises theorem makes this approximation rigorous and actually yields the approximation in a stronger sense than discussed so far. In Chapter it is seen that the observation X in the limit experiment is the asymptotic analogue of the "locally sufficient" statistics
7
where fe is the score function of the model. The Bernstein-von Mises theorem asserts that the total variation distance between the posterior distribution of yln(8n - e0) and the random distribution N (l:!.n,e0 , /8� 1 ) converges to zero. Because l:!.n,eo X, this has as a -v--+
1 0. 2
Bernstein-von Mises Theorem
141
consequence that the posterior distribution of .Jii( Gn - 80) converges, in any reasonable sense, in distribution to N (X , 18� 1 ) . The conditions of the following version of the Bernstein-von Mises theorem are re markably weak. Besides differentiability in quadratic mean of the model, it is assumed that there exists a sequence of uniformly consistent tests for testing H0 : e = 80 against In other words, it must be possible to separate the H1 : 118 - 80 11 ::::_ 8, for every 8 > true value 80 from the complements of balls centered at 80. Because the theorem implies that the posterior distributions eventually concentrate on balls of radii Mn f .Jn around 80, for every Mn --+ oo, this separation hypothesis appears to be very reasonable. Even more so, since, as is noted in Lemmas and under continuity and identifiability of the model, separation by tests of Ho : (} = eo from H1 : II (} - eo II ::::. 8 for a single (large) 8 > already implies separation for every 8 > Furthermore, if 8 is compact and the model continuous and identifiable, then even the separation condition is superfluous (because it is automatically satisfied). t
0.
10. 4 10. 6 , 0.
0
Let the experiment (Pe : (} E 8) be diffe rentiable in quadratic mean at 8o with nonsingular Fisher information matrix le0, and suppose that for every 8 > there exists a sequence of tests
Theorem (Bernstein-von Mises).
0
sup P; ( l -
11 8 - 8o ll :>-s
0.
(10. 2)
Furthermore, let the prior measure be absolutely continuous in a neighborhood of 80 with a continuous positive density at 80. Then the corresponding posterior distributions satisfy
Throughout the proof we rescale the parameter (} to the local parameter h = .jn(e - 8o) . Let On be the corresponding prior distribution on h (hence Dn (B) = 0 (8o + B j .Jii ) ), and for a given set C let n; be the probability measure obtained by restricting nn to C and next renormalizing. Write Pn,h for the distribution of Xn = (X 1 , . . . , Xn) under the original parameter 8o + hj .Jii, and let Pn, c = f Pn,h dn; (h) . Finally, let H n .jn(811 - 8o) , and denote the posterior distributions relative to On and n; by Pfin 1 x" and pH!::n I Xn , respectively. The proof consists of two steps. First, it is shown that the difference between the posterior measures relative to the priors nn and n;" , for Cn the ball with radius Mn , is asymptotically negligible, for any Mn --+ oo. Next it is shown that the difference between N (t:.... n ,e0 , /8� 1 ) and the posterior measures relative to the priors n;" converges to zero in probability, for some Mn --+ oo. For U, a ball of fixed radius around zero, we have Pn, u <1 r>Pn,o, because Pn,h n <1 r>Pn ,O for every bounded sequence hn , by Theorem Thus, when showing convergence to zero in probability, we may always exchange Pn,o and Pn, u . Proof.
=
-
7.2.
t
Recall that a test is a measurable function of the observations taking values in the interval [0, 1 ] ; in the present context this means a measurable function ¢n : xn r-+ [0, 1].
Bayes Procedures
142
Let Cn be the ball of radius Mn . By writing out the conditional densities we see that, for any measurable set B , PH
n
I X (B) - PH�" 1 xn
n
n
(B) = PH
n
I X (C� n B ) n
PH
n
I X (C� ) PH�" I X- (B) . n
n
n
Taking the supremum over B yields the bound The right side will be shown to converge to zero in mean under Pn , u for U a ball of fixed radius around zero. First, by assumption and because Pn, u <J Pn , o , Manipulating again the expressions for the posterior densities, we can rewrite the first term on the right as
For the tests given in the statement of the theorem, the integrand on the right converges to zero pointwise, but this is not enough. By Lemma there automatically exist tests ¢n for which the convergence is exponentially fast. For the tests given by the lemma the preceding display is bounded above by
10.3,
{ e - cC I h ll zAn ) d On (h) . On (U) Jll h ii�M"
1
Here nn (U) = n (Ito + u I Jn) is bounded below by a term of the order 1/ Jnk , by the positivity and continuity of the density n at 80 . Splitting the integral into the domains sufficiently small that n (tl) is uniformly Mn ::::: l l h II ::::: D,Jfi and ll h I 2: D,Jfi for D bounded on 1111 - 110 11 ::::: D, we see that the expression is bounded above by a multiple of
::::: 1
This converges to zero as n , Mn -+ oo. In the second part of the proof, let C be the ball of fixed radius M around zero, and let c N (f-L, :E ) be the normal distribution restricted and renormalized to C. The total variation distance between two arbitrary probability measures P and Q can be expressed in the form l i P - Q ll = j ( l - pjq) + d Q . It follows that
2
H N c ( t. n , eo , Ie� l ) = <
-
PlL xJ
+ d N C ( t. n,eo , /-1 e0 ) (h) ( ) 1 f l c (h)Pn , h (Xn)nn (h)j fc Pn, g (Xn) nn (g ) dg d P�" I X_ (h) !! ( 1 - Pn, g (Xn) nn (g )dN c ( t.n , eo , le� 1 ) (h) ) + d N c ( t.n,eo /-eo ! ) (g ) d P� _
-+
-+
Pn,h (Xn) nn (h) d N C ( t. n,e0 , le-o 1 ) ( g ) -+
H
"
'
H " I X_ " (h) '
1 0. 2 Bernstein-von Mises Theorem
143
(1
because - EY ) + :::; E ( l - Y ) + . This can be further bounded by replacing the third occurr ence of N c ( D. n,e0 , Ie� 1 ) by a multiple of the uniform measure A c on C. By the dominated-convergence theorem, the double integral on the right side converges to zero in mean under Pn, C if the integrand converges to zero in probability under the measure -+
(Note that Pn, c is the marginal distribution of Xn under the Bayesian model with prior []� .) Here []� is bounded up to a constant by A c for every sufficiently large n. Because Pn,h <1 1> Pn,o for every h, the sequence of measures on the right is contiguous with respect to the measures "A c (dh) Pn,o (dx) "A c (dg ) . The integrand converges to zero in probability under the latter measure by Theorem 7.2 and the continuity of n at e0 . This is true for every ball C of fixed radius M and hence also for some Mn � oo. •
Under the conditions of Theorem 10. 1, there exists for every Mn � oo a sequence of tests such that, for every sufficiently large n and every 10.3
Lemma.
0
11 e - eo ll :::: Mnf y'ii,
P�
0,
We shall construct two sequences of tests, which "work" for the ranges Mn/ y'ii :::; I e - eo I ::::: E and I e - eo II > E' respectively, and a given E > Then the such that the matrix Pe0 l� l� is nonsingular. Fix such an L and define
Proof
0.
[-
By the central limit theorem, Pe: Wn � triangle inequality,
0
0, so that Wn satisfies the first requirement. By the
Because, by the differentiability of the model, Pel� - Pe0 #0 = ( Pe0 l� l� + o ( l ) ) (e - eo ) , the first term on the right i s bounded below b y c 11 e - e0 II for some c > for every e that is sufficiently close to e0, say for 11 e - e0 II < E . If Wn = then the second term (without the minus sign) is bounded above by .j Mnfn. Consequently, for every c 11 e - eo I :::: 2.j Mnfn, and hence for every 11 e - e0 II :::: Mn/ y'ii and every sufficiently large n,
0,
0,
[117]),
by Hoeffding's inequality (e.g., Appendix B in for a sufficiently small constant C . Next, consider the range 11 e - e0 II > E for an arbitrary fixed E > B y assumption there exist tests
P�
0,
sup P; ( l -
lle - ea ll > c:
0.
0.
Bayes Procedures
1 44
It suffices to show that these tests can be replaced, if necessary, by tests for which the convergence to zero is exponentially fast. Fix k large enough such that P� ¢k and Pt ( 1 - ¢k ) for every ll 6l - 6lo II > 8. Let n = mk + r for :::: r < k, and define are smaller than Yn, I , . . . , Yn, m as ¢k applied in turn to X I , . . . , Xb to Xk + I , . . . , X 2b and so forth. Let Y n, m be their average and then define Wn = 1 { Yn, m � 1 2 } . Because Ee Yn,J � for every > 8 and every j , Hoeffding's inequality implies that 6l 80 I ll < e- 2m ( � - � f < e- m /8 . pen ( 1 - wn ) = Pe (Y n, m < 1 2) -
0
1/4
/
3/4
/
Because m is proportional to n , this gives the desired exponential decay. Because Ee0 Yn,J :::: 1 j the expectations P� Wn are similarly bounded. •
4,
The Bernstein-von Mises theorem is sometimes written with a different "centering se quence." By Theorem 8. 14 any sequence of standardized asymptotically efficient estimators y'ri (en - 6l) is asymptotically equivalent in probability to �n . e · Because the total variation distance is bounded by a multiple of I �n . e - y'ri (en - 6l) II , any such sequence y'ri(en - 6l) may replace �n . e in the Bernstein-von Mises theorem. By the invariance of the total variation norm under location and scale changes, the resulting statement can be written
Under regularity conditions this is true for the maximum likelihood estimators en . Com bining this with Theorem we then have, informally,
5. 3 9
and since conditioning en on en = e gives the usual "frequentist" distribution of en under e . This gives a remarkable symmetry. Le Cam's version of the Bernstein-von Mises theorem requires the existence of tests that are uniformly consistent for testing H0 : 6l = 80 versus HI : ll 6l - 80 11 � 8, for every 8 > Such tests certainly exist if there exist estimators Tn that are uniformly consistent, in that, for every 8 >
0.
0,
sup Pe ( II Tn - e I � 8 ) --*
e
0.
In that case, we can define ¢n = 1 { II Tn - 6lo I � 8 j2 } . Thus the condition of the Bernstein von Mises theorem that certain tests exist can be replaced by the condition that uniformly consistent estimators exist. This is often the case. For instance, the next lemma shows that this is the case for a Euclidean sample space X provided, for Fe the distribution functions corresponding to the Pe ,
e ' l >c II l e -inf
Fe - Fe , ll oo >
0.
10.2 Bernstein-von Mises Theorem
145
For compact parameter sets, this is implied by identifiability and continuity of the maps (] f-+ Fe . We generalize and formalize this in a second lemma, which shows that uniformity on compact subsets is always achievable if the model (Pe : f3 E 8) is differentiable in quadratic mean at every f3 and the parameter f3 is identifiable. A class of measurable functions :F is a uniform Glivenko-Cantelli class (in probability) if, for every c; > 0, sup Pp ( II JPln - P ll .r > 8) --* 0. p
Here the supremum is taken over all probability measures P on the sample space, and I I Q ll .r = sup / EF I Qf l . An example is the collection of indicators of all cells ( -oo, t] in a Euclidean sample space. 10.4
Lemma.
every c; > 0,
Suppose that there exists a uniform Glivenko-Cantelli class :F such that,for inf I Pe - Pe ' ll .r > 0. d (B, e ' ) > s
(10. 5 )
Then there exists a sequence of estimators that is uniformly consistent on 8 for estimat ing e. 10.6 Lemma. Suppose that 8 is CJ -compact, Pe f3 f-+ Pe are continuous for the total variation
f. Pe ' for every pair f3 f. fJ', and the maps
norm. Then there exists a sequence of estimators that is uniformly consistent on every compact subset of 8.
For the proof of the first lemma, define en to be a point of (near) minimum of the map (] f-+ II JPln - Pe ll .r· Then, by the triangle inequality and the definition of en , II Pe" - Pe ll .r :::: 2 11 JPln - Pe ll .r + ljn, if the near minimum is chosen within distance ljn of the true infimum. Fix c; > 0, and let o be the positive number given in condition ( 1 0.5). Then
Proof.
By assumption, the right side converges to zero uniformly in f3 . For the proof of the second lemma, first assume that 8 i s compact. Then there exists a uniform Glivenko-Cantelli class that satisfies the condition of the first lemma. To see this, first find a sequence A 1, A 2 , . . . of measurable sets that separates the points Pe . Thus, for every pair fJ , (]' E 8, if Pe (Ad = Pe , ( A d for every i , then f3 = fJ'. A separating collection exists by the identifiability of the parameter, and it can be taken to be countable by the continuity of the maps f3 f-+ Pe . (For a Euclidean sample space, we can use the cells ( -oo, t] for t ranging over the vectors with rational coordinates. More generally, see the lemma below.) Let :F be the collection of functions x f-+ i - 1 1 A , (x). Then the map h : 8 f-+ £00 (F) given by f3 f-+ ( Pe f) fEF is continuous and one-to-one. By the compactness of 8, the inverse h - 1 : h ( 8 ) f-+ 8 is automatically uniformly continuous. Thus, for every c; > 0 there exists o > 0 such that
ll h(fJ) - h(fJ') ll .r :::: o implies d(fJ, (]') :::: c:.
Bayes Proudurt!s
1 46 This
means that ( I 0.5) is satis fied . The class :F is
because by Chebyshev's inequality,
This co nc lud e s the
also a u n i fonn Gl ivenko-Cantclli class,
proo f of the second Jemma for compact 0.
To remove the compactness condition. wri te e as the union of an i ncreasing sequence
of compact sets K, c Kz c T,,,, that is unifonnly co n s i st e n t
· · · .
fixed m ,
a,.. m
6,.
=
O t: K.,
the p reced i ng
(d(T,,,m, 0) .!.) 2:
111
sequ e nce m,J _.. oo such that a11 .m,. Tn.m,. sat i s fi es the requirements� •
Then the re e x i st s that
: = s up Po
f-or every m there cxisrs
on Km . by
a
a
seq u en ce of e�timators .
argu ment.
-.
0,
�
0 as " __.
ll _..
Thus. for every
00.
oo.
It is not h a rd
to see
s econd lemma, if there cx isrs a sequence of tests t/Jn such that . ( J 0.2) holds fo r some t: > 0, th e n it holds for every & > 0. In t h a t ca� we can replace the g i \'e n sequence r/Jn by t h e minimum of t/J, and the tes t s I { nT,. - Oo ll 2: £/2 J for a sequ e nce of estimators T,, that is un i fonn ly consistent on a s uftici en r J y large subset of B. As
a
consequence of the
1 0.7 !..emma. Let the .fel ofprobabilit)' mt•a.wre.'i P mr a uu•a.wrablt• ,'ifWCe (,\'. A> be .'it>parable for tlu.• total \'tlriation llorm. Then the re exists ll countable .mb.H•t Ao C A .m ch tlwt P1 = P2 011 Ao implies Pa = 1'2 for f!\ el) Pa . P!. E P. '
Proof.
'
The set P can be i de n t i fi e d with a subset of L a (Jl )
for a su it able prob abil i ty m�asurc
11. . For i n s t a n ce. 11. cim be t a ke n a con\'ex li near co mb i nat i o n of a cou nt abl e dense set. Let
Po be a cou n t_abl e dense·subset, and Jet Ao be the set of all finite intersections of the sets p -• ( B) for p ran g i ng over a choice of densi ties of rhc set Po c L 1 (J.I. ) and R ranging O\'er
a
countable generator of the Borel seas in lR.
l11en every density p E Po is O"(Ao)- mcasumblc by construction. A den sity of a measure
P E P - P0 can be approximated in L 1 (Jt ) by a seq u e n ce from Pu and hence can be chosen a (A0)-measurable. without loss of ge nc m l i ty .. Because Ao is intersection-stable (a rr-systcm ) two probabi lity measures th�1t &Jgrec on Ao a uto m at i cal ly agree on the a -ficJd a (Ao ) generated by A0• Then t hey also gi\'c the same ex pec t a t io n to every a CAo)·mcasurable fu nct ion f : X .- [0, 1 1. If the measures have u (.A0)-measurabJ c d e n s i ti e s then they must agree on A, because P(t\ ) = E,, J " p = E,, E,. ( I 11 I a (..40)) p if p is a (.Ao)-measumble. • .
..
.
.
1 0.3
Point
:Estimators
The Bcmstcin-von M ises t he orem shows that the post e rio r laws con ve rge in distribution to a G a u ss i a n posterior l aw in total variation d i s ta n ce As a con seq ue n ce, any loca t i on functional .
that is sui tably cont inuous re l is t i ve to the total v�lriation norm appl ied
to the sequence of
10.3 posterior laws c onverge s
Poilll Estimators
1 47
to the same location functional applied to the limiring G"ussian X, or a N (0. Ii. ' )-distribution.
pos erior distribution. For most choices thi s means to
t
this sect ion we cons i der more general Bayes point est i mators that arc de fi ned as the e u i s relati\'e to some l os s fu nction. For a given loss function l : R4 H> [0, oo), le t Tn . for fixed X 1 X,. minimize the posrerior risk In
mini mizers of th posterior risk f nct on
•
, H>
• • • •
f e( Jti
the minimizing values T,, can be selected as a measurable is an im plic it assumption, or ot herwise the statem� n ts arc lo be understO 0, It is not imme d at ely clear that
i
fun cti n of the o bse rva ti on s . This
o
.
-
sup JhU::M
t'(lz ) ::;
inf ((h ). lh11�2M
with strict i nequality for at least one AI. t TI1is is tr e, for instance, for loss f n c o ns
u
u ti
of
= lo(lllr n) for a nondecre as i n g function to : [0. 00) H- [0, 00) that is not (0, oo). Fu n herm orc, we suppose that t. grows at most poly no m ia lly: For some constant p � 0,
the form l(h)
con s ta n t on
t !: 1 +
n'z ll ''·
10. 1 hold, ami /el l saris/)· the conclition.'i as .jii( TN - Oo) (·om·ery:es tmder Oo in distribution to the minimi:.er ofl � f ((1 - h ) d N (X, 1,� 1 )(h ). for X possessing 1 the N (0, /� )-cli.ftribution, prm•iclecl rlrat cmy two minimi�ers ofrhis process c:oindde almost .wrt!ly. In particular, for C\'er_\· nun :.ero, .wbcom·t•x loss fwtc·ticm it (.'On\'('rges w X. 10.8
Theorem. Let tile conditions ofThearem
a J ne Jl'' cl n {0 )
list(•d, for a p .'illc:ll tlz l
*Proof.
We ad op the notation as
t
< 00.
Then the .f('CJIIt!IICI!
li�ted in the fi rst paragraph of the proof of Theorem I 0. 1 .
Th e last assertion of the theorem is a consequence of Anderson's l emm a, Lemma 8.5. The standardized e st i m at or
where t, is t h
.jii( T,,
-
0o) minimizes
the function
e funct ion II ._,. l(t - h ). TI1c proof consists of t h ree parts. First it is shown
that i n te grals over the �els llll II � M, can be neglected for every M, _,
that t h e
sequence J,j( T,,
proc e sses 1 � the process
Next, it is proved
- 0o) is unifonn ly t ight. Fi n a l l y, it is shown that the stochastic Z,(l ) conve rge in distribution in the space t'"' ( K), for every compact K. to
I �
t
oo.
Z(t) =
J
l (t - h ) cl
Tho! 2 i s for cunvcnicm:c. any other numhcr would do.
N(X. 1,� 1 )(!z).
Bayes Procedures
148
t,
The sample paths of this limit process are continuous in in view of the subexponential growth of and the smoothness of the normal density. Hence the theorem follows from the argmax theorem, Corollary Let Cn be the ball of radius Mn for a given, arbitrary sequence Mn --+ oo. We first show that, for every measurable function f that grows subpolynomially of order p ,
.e
5. 5 8.
(1 0 .9 ) To see this, we utilize the tests
Here Tin ( U) is bounded below by a term of the order 1 j Jnk , by the positivity and continuity at (}0 of the prior density n . Split the integral over the domains Mn I :::; D,jii and I I ::: D,jii, and use the fact that J II (} d TI ((}) < oo to bound the right side of the display by terms of the order e - A M; and ,jnk +P e -B n , for some A , B > 0. These converge to zero, whence ( 1 0.9) has been proved. Define .C(M) as the supremum of £(h) over the ball of radius M, and f_(M) as the infimum over the complement of this ball. By assumption, there exists 8 > 0 such that ry : = f_(28) - l(8) > 0. Let U be the ball of radius 8 around 0. For every I :=: 3Mn and sufficiently large Mn , we have .C ( - h) - £(-h) :=: 'f] if h E U, and .C ( - h) - £(-h) :=: fo_(2Mn) - .C(Mn) ::: 0 if h E uc n Cn , by assumption. Therefore,
lh
:::; l h
liP
t
ti l
t
[(.e (t -
h) - £(-h)) ( 1 u + 1 ucn c" + 1 c�) ] Zn (t) - Zn (O) = PH" I Xn ::: ry Piln i X" (U) - pil" l x " (£ (-h) 1 cJ . Here the posterior probability Piln 1 X n (U) of U converges in distribution to N(X, !8� 1 ) (U) , by the Bernstein-von Mises theorem. This limit is positive almost surely. The second term in the preceding display converge� to zero in probability by ( 1 0 .9 ) . Conclude that the infimum of Zn - Zn (0) over the set of t with II I :=: 3Mn is bounded below by variables that converge in distribution to a strictly positive variable. Thus this infimum is positive with probability tending to one. This implies that the probability that H> Zn has a minimizer in the set II ::: 3Mn converges to zero. Because this is true for any Mn --+ oo, it follows that the sequence .jn (Tn - (}0) is uniformly tight. Let C be the ball of fixed radius M around 0, and fix some compact set K c �k . Define stochastic processes
(t)
t
t
lit
N (li n ,e0 , l8� 1 ) (£ , 1 c ) , WM = N(X, /8� 1 ) (£ 1 1 c ) .
Wn, M
=
(t)
1 0.4
Consistency
149
The function h 1---+ C (t - h) l c (h) is bounded, uniformly if t ranges over the compact K . Hence, b y the Bernstein-von Mises theorem, Zn , M - Wn,M � 0 in £00(K) as n -+ oo, for every fixed M. Second, by the continuous-mapping theorem, Wn,M WM in £00(K), as n -+ oo, for fixed M. Next WM � Z in £00(K) as M -+ oo, or equivalently C t JR.k . Conclude that there exists a sequence Mn -+ oo such that the processes Zn,M" Z in £00(K). Because, by ( 1 0.9), Zn (t) - Zn,M" (t) 0, we finally conclude that Zn Z in £00 (K) . a '¥'7
'¥'7
�
*10.4
'¥'7
Consistency
A sequence of posterior measures Pe" 1 x1 , . . . , x" is called consistent under () if under Pe00probability it converges in distribution to the measure 8e that is degenerate at (), in proba bility; it is strongly consistent if this happens for almost every sequence X 1 , X2 , . . . . Given that, usually, ordinarily consistent point estimators of () exist, consistency of posterior measures is a modest requirement. If we could know () with almost complete accuracy as n -+ oo, then we would use a Bayes estimator only if this would also yield the true value with similar accuracy. Fortunately, posterior measures are usually consistent. The following famous theorem by Doob shows that under hardly any conditions we already have consistency under almost every parameter. Recall that 8 is assumed to be Euclidean and the maps () 1---+ Pe (A) to be measurable for every measurable set A. 10.10 Theorem (Doob 's consistency theorem). Suppose that the sample space ( X , A) is a subset of Euclidean space with its Borel CJ -field. Suppose that Pe =f. Pw whenever () =f. ()'. Then for every prior probability measure n on 8 the sequence of posterior measures is consistent for n -almost every ().
On an arbitrary probability space construct random vectors 8 and X 1 , X 2 , . . . such that 8 is marginally distributed according to n and such that given 8 = () the vectors X 1 , X2 , are i.i.d. according to Pe . Then the posterior distribution based on the first n observations is Pe l X t , . . . , Xn . Let Q be the distribution of (X 1 ' X2 , . . . ' 8) on x oo X e . The main part o f the proof consists of showing that there exists a measurable function h : X00 1---+ 8 with
Proof.
.
.
•
Q-a.s .. Suppose that this is true. Then, for any bounded, measurable function f : 8 Doo b 's martingale convergence theorem,
(10. 1 1 ) 1---+
JR., by
E ( / (8) I X 1 , . . . , Xn ) -+ E ( / (8) I X 1 , X 2 , . . . ) f (h ( X l , X 2 , . . . ) ) , Q-a.s . .
2. 25
B y Lemma there exists a countable collection :F of bounded, continuous functions f that are determining for convergence in distribution. Because the countable union of the associated null sets on which the convergence of the preceding display fails is a null set, we have that Q-a. s . .
IJuy�s Procedures
I SO
This statement refers to the marginal distribution of
(X 1 , X 2,
• • •
)
under
Q.
We wish
to
translate it intO a State ment concerning the P800-mea� urcs. Let C C X� X 0 be th� inter
section of the sets
on
which the weak convergence hoJds and on which
Fubini's theorem
( 15.9)
is valid. By
Ct, = {x : (.t, 0) E C} is the horizontal section of C at height 0. It follows PtJ00 (CH ) = J for n -al most every 0. For e\'cry 8 such that P,]"�(C0) = J . we have (x, 0) e C for P�-al most every sequence x 1 , x1 • • • • and hence where
This is lhe assertion of the theorem.
In
order 10 establ ish
( 1 5.9),
call a measurable function
exi sts a sequence of measumblc functions
hn: XH
II jh,(x) - f(8)j l\
I
�
I:e
�-+
R such t hat
lhat that
R m·cessibl� i f there
dQ(:r, 0) -+ D.
(Here we abuse notation in viewing II, also as a measurable function on x� x
there also exi sts a {sub)scquence with h H (.t ) -+
H.)
Then
/(0 ) al most surely under Q. whence every I is almost everywhere equal to an Ax. x {0. 0}-mcasurable fu nction. This is a measurable function of :c = (.t1 , x2 , ) alone. If we can show c hat rhc functions f (0 ) = 0; arc accessible, then ( 15.9) follows. \Ve s ha ll in fact show that every Borel accessible function
• • •
measurable function is accessi ble.
L�::l
By the strong law of la�Je numbers, hn (x) =
P/f'.
for every
theorem.
0 and
mcasumble set A.
II lhn(X) -
J " (x, )
-
PH (A ) al most surely under
Consequent ly, by th� domin ated convergence
Po (A )
J dQ(:c. 0)
__.,
0.
of the functions 0 � ��(A ) is accessible. (..\-, A) is Eucl ide<J n by ass umption, there ex ists a countnble measure-deter· m i ning subcollccti on Ao C A. The fu nctions 0 H- PH ( A ) are measur.1ble by a!\sumpt ion and scpararc the points of (-) as A ranges O\'Ct .Ao. in view of the choice of Ao and 1hc Thus each
Because
ident ifiability of the parameter 0.
a-field on e. in view of Lemma
This implies that these fu nctions gcnera1e the Borel
1 0. 1 2.
The proof is complete once it
is
shown rhat every function that is measurable in the
u -fi dd generated by the accessible functions (which is the Borel u - field) is accessible.
From lhe de finition it follows ea�ily that the
set
of accessible fu nctions
is a
veclor .space.
contains the constant functions, is closed under mon otone li mits. and is a lattice.
desired result therefore follows by a monotone class argument. as
1llc merit of the preceding theore m is that it imposes hardly
drawback is that i t gives the
consistency o n ly
up
to
in
Lemma I 0. J 3.
any
The
•
conditions. but its
null sets of possible para mercrs (de
pending on the prior). In certain ways these null sets can be quite larg�. and examples ha\'C
10.4
ConJi.'lteiiQ'
151
been co n st ru cted where Bayes estimators behave badly. To guarantee co ns i ste n cy under
every parameter it is n ece s sary to impose some fu rt h e r cond it ions. Because in this c ha p ter we arc m a i n l y concerned with asymptotic n omm l i ty of B aye s esti mators (which implies c o ns i st e ncy w i t h a ra te), we omit a di scus..'\ion.
1 0.12
[..emma. Let :F be a cmmtab/e co//('CiiOII ofmeu.mmblefiml'tions I : e c R.l:
that .'i('[mratl'S the points of e. Then the Borel a-field and the a
coincide.
.... ..
R
field generated by F Olt e
·
Proof. By assumption, the m a p h : E-J �--+ IR.F defined by h {O ) f = /(0) is mea s u rabl e and one-to-one . B eca use F is c ou n tabl e . the B ore l a - field on RF (for the product topology) is equal to the a-field gcnernted by the coordinate projections. Hence the a-fields generated by It an d F (viewed as Borel mt!a.'\urabJe maps in ;rtF an d lR.. respectively) on e are idcnrica l. Now ,- • . defi ned on the range of h. is automatically Borel measur.Ihlc, by Pro pos i t i o n in (24}. and hen ce <� and h ( (-) ) arc B o re l isomorph ic. •
8.3.5
F bt.• U lilll'Gr :rub.'ipace of r.. ( n ) with the properties (i) if f. g E :F. tht.•ll I " g E :F: ( ii ) ifO � f• !: l!. � . . e F mul J,� t f E .C , ( n ). then f e :F: (iii) I E :F.
10. 13
Lemma.
Let
·
Then :F coma ins e1·ery (t (:F) -m£•a.'iurttble jimction in
£1 ( n ).
Proof. Because a n y a (F )-measurable n onn egat i ve function is the mo n o t o n e l i mit of a seq u e n ce of s i m p l e fu nctions. it suffices to p ro ve that I A e :F tor every t1 e a (F ). Define An = {A : 1 A e F} . Then An is an intersect ion-stable Dynkin .system a n d h e nce a l1 -field. Furthermore. for C\'Cry f E F a n d a E R. the fu n c t i o n s n C / - ar+- 1\ I arc contai ned in F and i ncrease poi n tw i se to l r r > u t · I t follows that ( .f' > a } E A0 • l ienee a (F ) c A1• •
Notes
The Bern stein-von Miscs t he o re m has that n a me , because. ;.l s Le Cam and Yang (97 ) write. it was fi rst discovered by Lapl ace.
The th�orem th01t is prese nt e d in t h b c hap t e r is consi daably more e lcgu n t than t h e results _by these early authors. and also m uc h better than the result in Le Ca m (91 J. who revi ved the thcorcin in o rd er to prove resu lts on superefficiency. \Vc adapted it from Le C am 1 96) and Lc Cam and Ya ng (97 1. lb rag i mov and Hasmi nski i (80] discuss the convergence of Bayes poi nt c st im < o rs i n greater gencmlity, an d also cover non-Gaussian l i m i t experi ments. b u t their discussion of the i.i.d. case as discussed i n the p re se n t ch a pt e r is li mited to bo u n de d parameter sets and re q u i re s st ro n ge r a ssum pt i on s . Our t re at m e n t uses .some c l e me n t s of their proof. but is heav ily based on Lc C6l m·s Bernstei n-von Mises theorem. I nspection of t he proof shows thnt tht.! co nd it i ons on t h e loss fu n ct i o n cun be relaxed s ig n i l i cu m l y. for i n st ance a l l ow i n g e x po n e n t i a l growth. Doob•s theorem origi n t c n ti :a l null se t s of in co n si s t en cy that it leaves o�n re a lly exist in some s i t u ut i o n s pa rt i c u l ar ly if the param e te r set is infinite di men sio n al .
152
Bayes Procedures
and have attracted much attention. See [34] , which is accompanied by evaluations of the phenomenon by many authors, including Bayesians.
PROBLEMS 1. Verify the conditions of the B ernstein-von Mises theorem for the experiment where Pe is the
Poisson measure of mean
e.
2. Let Pe be the k-dimensiona1 normal distribution with mean
e
and covariance matrix the identify. Find the a posteriori law for the prior IT = N ( r, A) and some nonsingular matrix A . Can you see directly that the B ernstein-von Mises theorem is true in this case?
3. Let Pe be the Bernoulli distribution with mean
e. Find the posterior distribution relative to the
beta-prior measure, which has density
4. Suppose that, in the case of a one-dimensional parameter, we use the loss function £ (h)
1 (- 1 ,2) (h) .
Find the limit distribution of the corresponding Bayes point estimator, assuming that the conditions of the B ernstein-von Mises theorem hold.
11 Projections
A projection of a random variable is defined as a closest element in a given set offunctions. We can use projections to derive the asymptotic distribution of a sequence of variables by comparing these to projections of a simple form. Conditional expectations are special projections. The Hajek projection is a sum of independent variables; it is the leading term in the Hoeffding decomposition.
1 1.1
Proj ections
A common method to derive the limit distribution of a sequence of statistics Tn is to show that it is asymptotically equivalent to a sequence Sn of which the limit behavior is known. The basis of this method is Slutsky's lemma, which shows that the sequence Tn = Tn - Sn + Sn converges in distribution to S if both Tn - Sn � 0 and Sn S. How do we find a suitable sequence Sn ? First, the variables Sn must be of a simple form, because the limit properties of the sequence Sn must be known. Second, Sn must be close enough. One solution is to search for the closest Sn of a certain predetermined form. In this chapter, "closest" is taken as closest in square expectation. Let T and { S : S E S} be random variables (defined on the same probability space) with finite second-moments. A random variable S is called a projection of T onto S (or L 2 -projection) if S E S and minimizes "-"7
s
E
S.
Often S is a linear space in the sense that a 1 S1 + a 2 S2 is in S for every a 1 , a 2 E R, whenever S 1 , S2 E S. In this case S is the projection of T if and only if T - S is orthogonal to S for the inner product (S 1 , S2 ) = ES 1 S2 . This is the content of the following theorem. Let S be a linear space of random variables with finite second moments. Then S is the projection of T onto S if and only if S E S and
11.1
Theorem.
E(T - S) S
=
0,
every S
E
S.
Every two projections of T onto S are almost surely equal. If the linear space S contains the constant variables, then ET = ES and cov (T - S, S) = 0 for every S E S. 153
1 54
Projtcrions
Figure J l.J. A variable T and its projcclion S on :. linear sp::�ce. For any
Proof.
S and S in S,
E(T - S)2 If
=
E(T - S)2 + 2E(T - S)(S - S) + E(S - s)2 •
S satisfies the orthogonality condition, then the middle tenn is zero, and we conclude E(T - S)2 ?: E(T - S)2• with suict i nequal ity un less E(S - S)'2 0. Thus. the
that
=
orthogonality condition impl ies that Conversely, for any number a,
If
a
S t-)lo
is
a
S is a projection. and also lh
a
proj ectio n, then this �xpression is nonneg tive for every a.
a2ES2 -
E(T - S)S
=
2aE(T - S)S
0 is satisfied.
B ut the pambora
is nonne gativ
e if and on ly if the orthogon al ity condition
If the constants are in S, then the o rthogon al ity
a
the l st asse rtions of the theorem follow.
•
condition implies E(T - S)c
=
0, whence
The theorem does n ot assert that projections always exi st. This is not true: The in fimum infs
E(T - S)'2 need not be achieved.
A sufficie nt condition for existence is that
S is closed for the second- moment norm, but existence is usuaJJy more easily establ ished
i
d rectly. The orthogonality of
(Sec Figure 1 1 . 1 .)
T - S an d S yie lds the Pythagorean ru le ET2
=
E(T - S)2 + E.f!.
If the constants arc contained in S. then this is also true for variances
instead of second moments.
Tn and linear spaces S,. is given. For each n, let S, t T,, follows from that of Sn , and vice versa. provided the quotient varT,/varSn converges to I . Now sup pose a .sequence of �t t i stics
a
be the proj ecti on of Tn on S,. . Then the l imi i ng behavior of the sequence
I I .2
Theorem. Let
S, be linear J1Jt1Ces of random
l-'ariables with fin ite secot1d moments
that comain tlze cmzsltmts. Let Tn be random variables with projections
varT,,jvarS, -.
I
tlze11
Tn - ET,. sd Tn
_
S ,. ���� � O. sd Sn
S,
onto S, . If
J 1.2
Conditional E.xpectatimr
1 55
Proof. \Ve shall prove convergence in second mean, which is stronger. The expectation of the d i fference is zero. lls variance is eq ual to 2
_
2
S,. ) . sd T" sd S,,
cov(T,..
" :!
By the o rthogon al i ty of T,. - S, and S,. , it follows that ET,,S,. = ES,. . Because the constants arc in S,,, this implies that cov(T,, , S, ) = varS,., an d the the ore m follows. • A
A
A
The condition v:.arT,./varS,. --. 1 in the theorem implies that the projections S,. are asymptotically of the same size as the original Tn . This explains that "nothing is Josf' in the limit� and that the difference between T,, and its projection converges to zero. In the preceding theorem it is essential that the S, arc the projections of the variables T,. , because the condition varT11/varSn --. I for general sequences Sn and T, does not imply anything.
1 1 .2
Conditional Expectation
1l1e expectat ion EX of a random variable X minimizes the quadrJtic form a H> E(X - a) :! over the real n u mbers a. This may be expressed as fo l lows: EX is the best prediction of X. given a quadratic loss function, and in the absence of additional infom1ation. The cmulitimwl expectation E(X I Y) of a random variable X given a random \'ector Y is defined as the best '"prediction"' of X given knowledge of Y. Form al l y, E( X l Y) is a measurable function Ro( Y) of Y that minimizes
over all measurable functions g. In the tem1inology of the preceding section, E(X I Y) is the projection of X onto the l inear space of all measurable functions of Y. It follows that the conditional expectation is the unique measurable function E(X 1 Y) of Y that satisfies the orthogonality relation
E(X - E< X I Y)) �: C Y) = 0.
e\'ery g.
IfE(X I Y) = go( Y}, then it is customary to write E(X I Y = )' ) for Ro(y). This is i nterpreted as the expected value of X given that Y = y is observed. B y Theorem l l . l the projection is unique only up to changc:s on set-; of probabi lity zero. This means that the function g0(y) is un ique up to sets B of values·.,, such that P( Y e B) = 0. (These could be very hig sel<>.) The fo l lowing examples gi\'c some propert ies und also describe the rci:Jtionship with conditional densities. 1 1 .3
Example. The orthogonal ity relationship with g
EE(X I Y).
Thus, ··the expectation
= I yi el d s the formu la EX o f a conditional expectation is the expectation."' 0
/( Y) for a measur.tble fu nction f� th e n ECX I Y) = X. This foll ows immediately from the defin ition. in which the minimum can be reduced to zero. The interpretation is that X is perfectly predictable given knowledge of Y. 0 1 1 .4
Example. If X
=
156
Projections
11.5 Example. Suppose that (X, Y) has a joint probability density f (x , y) with respect to a a- -finite product measure J1, x v , and let f (x I y) = f (x , y) jfy (y) be the conditional density of X given Y = y . Then
E(X I Y) =
f xf (x I Y) d f.l,(x) .
(This i s well defined only if fy (Y) > 0.) Thus the conditional expectation a s defined above concurs with our intuition. The formula can be established by writing E ( X - g (Y) ) 2 =
J [! (x - g(y) ) 2 f(x I y) d f.l,(X ) ] fy (y) d v (y) .
To minimize this expression over g, it suffices to minimize the inner integral (between sq uare brackets) by choosing the value of g(y) for every y separately. For each y, the integ ral j (x - a ) 2 f (x I y) d f.l,(x) is minimized for a equal to the mean of the density x !----'? f (x I y). D If X and Y are independent, then E(X I Y) = EX. Thus, the extra knowl edge of an unrelated variable Y does not change the expectation of X . The relationship follows from the fact that independent random variables are uncorre lated: Because E(X - EX)g (Y) = 0 for all g, the orthogonality relationship holds for go (Y) = EX . D 11.6
Example.
Example. If f is measurable, then E ( f (Y) X I Y ) = f ( Y)E(X I Y) for any X and Y. The interpretation is that, given Y, the factor f(Y) behaves like a constant and can be "taken out" of the conditional expectation. Formally, the rule can be established by checking the orthogonality relationship. For every measurable function g, 1 1.7
E ( / (Y)X - f(Y)E(X I Y) ) g(Y) = E ( X - E(X I Y) ) f (Y)g ( Y) = 0, because X - E(X I Y) is orthogonal to all measurable functions of Y, including those of the form f(Y)g (Y) . Because f(Y)E(X I Y) is a measurable function of Y, it must be equal to E ( / (Y)X I Y ) . D If X and Y are independent, then E ( f (X, Y) I Y = y ) = Ef(X, y) for every measurable f. This rule may be remembered as follows: The known value y is substituted for Y; next, because Y carries no information concerning X, the unconditional expectation is taken with respect to X. The rule follows from the equality 11.8
Example.
E ( / (X, Y) - g(Y) ) 2 =
ff ( / (x , y) - g (y)) 2 d Px (x ) d Py (y) .
Once again, this i s minimized over g by choosing for each y separately the value g(y) to minimize the inner integral. D 11.9
Example.
For any random vectors X, Y and Z, E (E(X I Y, Z) I Y ) = E(X I Y) .
11.4
Hoeffding Decomposition
157
This expresses that a projection can be carried out in steps: The projection onto a smaller set can be obtained by projecting the projection onto a bigger set a second time. Formally, the relationship can be proved by verifying the orthogonality relationship E (E( X I Y, Z) - E( X I Y) ) g ( Y) = 0 for all measurable functions g. By Example 1 1 .7, the left side of this equation is equivalent to EE ( Xg(Y) I Y, z) - EE ( g(Y)X I Y ) = 0, which is true because conditional expectations retain expectations. D 1 1 .3
Proj ection onto Sums
Let X 1 , . . . , Xn be independent random vectors, and let S be the set of all variables of the form
n
L gi (Xi ),
i=1 for arbitrary measurable functions gi with Eg[ (Xi )
<
oo. This class is of interest, because
the convergence in distribution of the sums can be derived from the central limit theorem. The projection of a variable onto this class is known as its Hajek projection.
Let X 1 , . . . , Xn be independent random vectors. Then the projection of an arbitrary random variable T with finite second moment onto the class S is given by n S = L E(T I Xd - (n - 1)ET. i=1 11.10
Lemma.
The random variable on the right side is certainly an element of S . Therefore, the assertion can be verified by checking the orthogonality relation. Because the variables X; are independent, the conditional expectation E (E(T I Xi) I Xi ) is equal to the expectation EE(T I X; ) = ET for every i -=1- j . Consequently, E(S I X1 ) = E(T I X1 ) for every j , whence
Proof.
This shows that T - S is orthogonal to S.
•
Consider the special case that X 1 , . . . , Xn are not only independent but also identically distributed, and that T = T (X 1 , . . . , Xn) is a permutation-symmetric, measurable function of the X; . Then
E(T I Xi = x) = ET (x , X 2 , . . . , Xn) . Because this does not depend on i , the projection S is also the projection of T onto the smaller set of variables of the form L 7= 1 g(Xi ), where g is an arbitrary measurable function. * 1 1 .4
Hoeffding Decomposition
The Hajek projection gives a best approximation by a sum of functions of one X; at a time. The approximation can be improved by using sums of functions of two, or more, variables. This leads to the Hoeffding decomposition.
Projections
1 58
B ecau se a projection onto a sum of orthogonal spaces i s the sum of the projections onto the individual spaces, it is convenient to decompose the proposed projection space into a su m of orthogonal sp aces . Given independent variables X� o , X, and a su bset A C { I • . . . • 11 }. let 11" denote the set of all squ are-integrable mndom variables of the ty� • • •
g,.. (X, : i
E
A).
for measurable fu nctions g,.. o f lA I argument� such that
E(gA (Xl : i E A ) I Xj : j E B) = 0,
every
B : IBI
<
l A I.
(Define E(T ) �) = E T. ) By the independence of X 1 , • • • • X, the condi tion in the last d i sp lay is automatically valid for any B c { I . 2 , "} that docs not contain A. Con sequently, the spaces HA. when A ranges o\'er all subset� of { I , n}, are pairwise orthogonal. Stated in i ts present fonn. the condit ion reflects the i n ten ti on to bu ild ap proximations of increasing complexity by project ing a gi\'en vari able in tum onto the spaces • . . .
. . . •
[ 1 ).
....
where 81i J € Hr; J . gli.jl e 1/ri. J t . - and so forth. Each new space is chosen orthogonal to the precedi ng s paces . Let P,. T denote the project ion of T onto HA . Then, by the orthogonality of the //" , th·e projection onto the sum of the fi rst r spaces is the sum 2:1,.. 1 �, PA T of the projections onto the i nd ivi du al spaces. The projection onto the sum of the fi rst two spaces is the Hajek projection. 1\-fore general ly. the projections of zero, first. and seco nd order can be seen to be
PttT = ET, Plit T = E ( T I X; ) - ET, PII.J I T = E(T I X; . Xj ) - E ( T I X1> - E(T I X1 ) + ET. Now the general fonnula given by the fol lowing lemma should n ot be surprisi ng. Lemma. Let X 1 ,
1 1.1 1
, Xn be independelll random \'a riables, and let T be till arbi· ET2 < oo. Th en the projection of T onto HA i.r gi,·en by
• • •
trary random ''ariable u.oith
PA T = L: C - 1 )1"1-IBIE(T I X, : i
E
B ).
BCA
If T .L 1/8 for every subset
B C A
of a given set
quently. the sum of the .'ipaces H8 with
(Xt : i
Proof.
E
A).
BCA
A.
then E( T I X1 : i e
A)
=
0. Collse of
contains all square- imegrab/efunctions
8A (X; : i E A) to g,.. . By the inde B) = E( T I A n B) for every su bsets A
Abbre\'iate E(T I Xs : i E A) to E(T I A) and pendence of X1 • • • • • Xn i t follows that E(E(T I A ) I
1/oeffi/ing Decompo.'iitimr
1 1.4
and B of ( 1 • . . . , n J . Thus. for in A,
1 59
P�t T as defined in the lemma and a set C strictly contained
E( PA T 1 C) = 2: <- 1 )1.41-1" 1 E< T 1 B n C> BCA IAI -1<"1 IA I = L L ( - I )lltJ-1/) 1 - j
(
DCC
)
� ICI E(T I D). J
J=O
By the binomial fonnu la, the inner sum is zero for every D. Thus the left side is zero. I n view o f the fonn of P�t T, i t was not a loss o f generality t o assume that C c A. Hence P" T is contai ned in 1/A . Next we ve ri fy the orthogonal ity relationship. For any measurable function g,, .
E(T - Pit T)g" = E{T - E(T I A ))g" -
L (- 1 > 1 " 1 - 1 81 EE( T I B)E(glt I 8).
IleA IJ 'fo/1
This is zero for any g" E HIt . This concludes the proof that P" T is as given. \Ve pro ve the second assertion of the lemma by induction on r = lA J. If T .1 H.., . then E(T I �) = ET = 0. Thus the assertion is true for r = 0. Su ppose that it is true for 0, r - I , and conside r a set A of r clements. If T J. H11 for e\'cry 8 c A. then certainly T .L He for every C c B. Consequent ly. the induct ion hypothesis shows that E(T 1 /l) = 0 for every B C A of r - I or fewer elements. The fonnula for PA T now shows_ that P,, T = E(T I A). By assumption the left side is zero. This concl udes the induction argument. The final assertion of the lemma follows i f the variable T,,. : = T - LBc A PH T is zero for every T that depends on (X1 : i e 1\ ) only. But in this ca.se T,. depends on (X, � i e A ) only and hence equals Err,, t A). which i s zero. because T�t l. H11 for every B c A. • . . . •
If T = T(X. , . . . . X, ) is pcnnutat ion-symmctric and X1 X, a rc independent and identically distributed. then the Hocffding decomposition of T can be simpli fied to •
,
• • • •
,
T
=
L L g,(X; : i
r=O I AI=r
E
A).
for
g, (x ,
•
. . .
, x,)
=
L
o c u . .. . . rJ
(- l )r- I B I ET (.t,
E
B, X; (j
B).
The inner s u m in the representation of T is for each r a U-statistic of order r (as discussed in the Chapter 1 2). with dege nerate kernel. All terms in the sum are orthogonal, whence the variance of T can be found as var T = L�- • (;)Eg!(X� o . . . . X, ).
Notes Orthogonal projections in H ilbert spaces (complete inner product spaces) arc a classical sub ject in fu nctional analys is. We have limited our discussion to the li ilbert space L2{Q, U. P) of all square-integrable random variables on a probabi lity space. Another popu lar method to
Projections
1 60
introduce conditional expectation is based on the Radon-Nikodym theorem. Then E(X I Y) is naturally defined for every integrable X. Hajek stated his projection lemma in [68] when proving the asymptotic normality of rank statistics under alternatives. Hoeffding [7 5] had already used it implicitly when proving the asymptotic normality of U -statistics. The "Ho effding" decomposition appears to have received its name (for instance in [15 1]) in honor of Hoeffding's 194 8 paper, but we have not been able to find it there. It is not always easy to compute a projection or its variance, and, if applied to a sequence of statistics, a projection may take the form L gn (Xi ) for a function gn depending on n even though a simpler approximation of the form L g (Xi ) with g fixed is possible.
PROBLEMS 1. Show that "projecting decreases second moment" : If S is the projection of T onto a linear space, 2 then ES :=:: ET 2 • If S contains the constants, then also var S :=:: varT.
2. Another idea of projection is based on minimizing variance instead of second moment. Show that var(T - S) is minimized over a linear space S by S if and only if cov( T - S , S) = 0 for every S E S .
3 . I f X :=::
Y almost surely, then E ( X I Z )
::::_ E ( Y I Z ) .
4 . For an arbitrary random variable X :=:: 0 (not necessarily square-integrable), define a conditional expectation E(X I
Y) by limM-+oo E(X 1\ M I Y).
(i) Show that this is well defined (the limit exists almost surely) . (ii) Show that this coincides with the earlier definition i f EX 2 < oo .
(iii) I f E X
<
oo show that E(X - E ( X I
(iv) Show that E ( X 1
Y) )g (Y)
= O for every bounded, measurable function g.
Y) is the almost surely unique measurable function of Y that satisfies the
orthogonality relationship of (iii) .
How would you define E(X I
Y)
for a random variable with E I X I
<
oo ?
5 . Show that a projection S o f a variable T onto a convex set S is almost surely unique. 6. Find the conditional expectation E(X I
Y) if (X, Y) possesses a bivariate normal distribution.
7. Find the conditional expectation E(X1 I X(n) ) if X 1 , . . . , Xn are a random sample of standard uniform variables. 8. Find the conditional expectation E(X 1 I X n ) if X 1 , . . . , Xn are i.i.d.
9. Show that for any random variables S and T (i) sd(S + T) :=:: sd S + sd T , and (ii) I sd S - sd T I :=:: sd(S - T ) .
10. I f Sn and Tn are arbitrary sequences o f random variables such that var(Sn - Tn ) /varTn then Tn - ETn
---
Moreover, varSn fvarTn
11. Show that PA h (Xj : Xi
---+ E
sd Tn
1 . Show this.
p ---+
0.
B ) = 0 for every set B that does not contain A .
---+
0,
12 U-Statistics
One-sample U -statistics can be regarded as generalizations of means. They are sums of dependent variables, but we show them to be asymptoti cally normal by the projection method. Certain interesting test statistics, such as the Wilcoxon statistics and Kendall 's T -statistic, are one-sample U -statistics. The Wilcoxon statistic for testing a difference in location between two samples is an example of a two-sample U-statistic. The Cramer-von Mises statistic is an example of a degenerate U-statistic.
12.1
One-Sample U-Statistics
Let X 1, . . . , X n be a random sample from an unknown distribution. Given a known function h, consider estimation of the "parameter"
In order to simplify the formulas, it is assumed throughout this section that the function h is permutation symmetric in its r arguments. (A given h could always be replaced by a symmetric one.) The statistic h ( X 1 , , X r ) is an unbiased estimator for 8 , but it is unnatural, as it uses only the first r observations. A U-statistic with kernel h remedies this; it is defined as 1 .
u= )
C
•
.
� h (Xf31 ' . . . ' Xf3J ,
where the sum is taken over the set of all unordered subsets f3 of r different integers chosen from { 1 , . . . , n}. Because the observations are i.i.d., U is an unbiased estimator for 8 also. Moreover, U is permutation symmetric in X 1 , . . . X n , and has smaller variance than h ( X 1 , . . . , X r ). In fact, if Xc l) , Xcnl denote the values X 1 , . . . , Xn stripped from their order (the order statistics in the case of real-valued variables), then •
.
.
.
Because a conditional expectation is a projection, and projecting decreases second mo ments, the variance of the U -statistic U is smaller than the variance of the naive estimator h (X 1 , . . . , X r ) .
161
1 62
U-Statisrks
In this section it is shown that the sequence J,i(U - 6) the condition lh3t Elr 2 (X. , . . . , X, ) < oo.
I is a mean , - • 'L?.11l (X;). The asserted asymptotic no nn al i ty is then just the central limit theorem. 0
1 2.1
Example. A
U-statistic of d eg ree r
is a sy m ptot i cal l y nonnal under
=
4<x•
Example. For the kernel h (.t� w X:! ) = - xz)2 of degree 2, the pammetcr 0 = Elz(X 1 , X2) = var X 1 is th e variance of the observations. The corres pon d ing U -statistic can be calc u lated to be 1 2.2
Thus, the sample variance
is a
U -s ta t i s ti c of order 2.
0
The asymptotic normality of a sequence of U -statistics, if 11 --.. oo '"' ith the kernel rem a i ni ng fixed, can be established by the projection method. The projection of U - 9 onto the set of all statistics of the form L:�... 1 .�; (X1) is given by·
where the function II 1
is
given by
h1 (:r) = Eh(x, X2
• . . .
, X,) - 0.
The first equ3Jity in the formula for {; is t he IUjck projection principle. The second equality
is established in the proof below. The sequence of projeCtions {; is a�ymptolically nomtal by the central limit thcorein, provided Elri(X 1 ) < oo. The difference between U - 0 and its projection is asymptotically negligible. Theorem.
/f Eir2(X � o . . . ,
.jii( U - .0
D) ..!'. 0.
Consequelll(\� the :requence ,Jii( U - 0) is asymptotically normal with mean 0 and \'llriance r2(', , where, with X l ' • • • ' Xr. XI ' . . . ' x; denoting i.i. cl. \"Clriab/es, 1 2.3
(I
=
Xr)
cov (h ( X l ,
< oo,
tlzetr
-
X2, . . . , X,). h(X 1 , x; . . . . . x;)).
\Ve first veri fy the fonnula for the projection U. It su ffi ce s to show that E(U () I X1) = h 1 (X1 ). By the independence of the observations and pcnnutation symmetry of Proof. Jr.
E (h (XtJ• • • • • • X,,, ) - 8 I Xi
=X
)
=
{
h l (:c) 0
if i e {J if i f. fJ.
To ca l c u l ate E(U - 0 1 X;). we take the average over all {3. Then the first case occurs for (� : :> of the vectors fJ in the definition of U. The factor rfn in the fonn u J a for the projection
(J arises as rIn = (� : l )/(�).
12. 1
163
Ollt'-Samplt' U-Sraristirs
The projection (J has mean zero, and \'ariancc equal to .,
"
var U
r·
=
.,
- Ehj ( X I ) II
rl
= -
n
J
E{lt(x. X2
• • • • •
X,) - 9)Eir (x.
X2
r2
• . . . •
x: ) dPx. (x) = -�� · n
finite, the sequence .Jii D converges weakl y to the N(O. r2�1 )-distribution limit theorem. By Theorem 1 1 .2 and Slutsky's lemma, the sequence �(U 0 - U) converges in probability to zero, provided \'ar Ufvar 0 -+ I . In view of the permutation symmetry of the keme I h. an expression of the type CO\' (Ir (Xp1 • • • • • X� ), h(Xf.l; • • • • , Xf.!; )) depends· only on the number of vari abl es X, that arc common to Xp1 , • • • Xp, and X11; • • • • , Xp; . Let ('(" be this covariance if c variables are in common. Then Because this
is
by the central
,
va r U
=
=
(; )
-2
L L cov (lr ( XJf,
• . . .
, Xp, ), /r ( X11j •
(;t�b (�) (; =�) �,.
• • • Xf.l; )) •
The last step follows, because a pair ({3. /J') with c indexes in common can be chosen by first choosing the r indexes in {J. next the c common indexes from {J. and finally the re maining r - c indexes in /J' from { I • . . . • n } f1. Th e expression can be simpl ified to
-
� var U = � r•l
r !2
,
c ! (r - c) !..
(n - r)(11 - r - I )
· ·
· (n -
2r + c + I )
n (n - I ) · · · (II - r + I }
,
(
.
.
sum the first term is 0 ( 1 /n), the second term is 0 ( 1 /11 2 ), and so forth. Be cause n t i mes the first tem1 converges to r2('1 • the desired limit result \'ar U f\'ar D � In this
follows.
•
Example (Sig11cd rank statistic). The parameter 0 = P(X 1 + X2 > to the kemel lz (:r1 • x2) = I {x1 + x1 > 0} . 1lte co rrespo nd i n g U-statistic is 1 2 .4
u
-( ) L L l {X, I
=
"
2
I <J
+
Xj
>
0) corresponds
0).
TI1is statistic is the average number of pairs (X, , X1 ) with positive sum X; + X1 > 0. and can be used as a test statistic for invcstig•tting whether the distribution of the observations is located at zero. If m any pairs (X, . X1 ) y ie l d a positive sum (relative to the total number of pairs), then we have an indication that the distribution is centered to the right of zero. The sequence .jii( U - 0) is asymptoticalJy nomtal with mean zero and variance 4{1 • I f F denotes the cumulative distribution function of the observations. then the projection of U - 0 can be written
TI1is formula is useful in subsequent discussion and is also convenient to express the asymp totic variance in F.
1 64
U-Statistics
The statistic is particularly useful for testing the null hypothesis that the underlying distribution function is continuous and symmetric about zero: (x) x) for every Under this hypothesis the parameter e equals e and the asymptotic variance reduces to 4 var because is uniformly distributed. Thus, under the null hypothesis of continuity and symmetry, the limit distribution of the sequence Jn ( U is normal N (O, independent of the underlying distribution. The last property means that the sequence is asymptotically distribution free under the null hypothesis of symmetry and makes it easy to set critical values. The test that rejects H0 if ,J3,i(U :=:: Za is asymptotically of level for every in the null hypothesis. This test is asymptotically equivalent to the signed rank test of Wilcoxon. Let ..., K; denote the ranks of the absolute values of the observations: k means that is the kth smallest in the sample of absolute values. More precisely, Then Suppose that there are no pairs of tied observations :S: the signed rank statistic is defined as 0} . Some algebra shows that
X.
F(X 1 ) = 1/3, 1/3), Un
= 1/2,
F(Xd
F = 1 - F (-
- 1 /2)
1/2) Rt, Rt = Rt = X; = X1.
F
a
l XII , I Xn I •
I X; I L�= I 1{1X1 I I X; 1} .
.
.
.
w+ = L 7= I Rt1{X; > w+ = (;) � 1{X; > O}. u+
The second term on the right is of much lower order than the first and hence it follows that ""' N(O, D
n -312 (W+ - EW+ )
1/12) .
Example (Kendall's r ).
The U -statistic theorem requires that the observations X I, . . . , Xn are independent, but they need not be real-valued. In this example the observations are a sample of bivariate vectors, for convenience (somewhat abusing notation) written as (XI, YI), . . . , (Xn, Yn) · Kendall 's T -statistic is 4 r= I: I: l {(Y1 - Y;)(XJ - X;) > 0 } n (n This statistic is a measure of dependence between X and Y and counts the number of concordant pairs (X;, Yi ) and (X 1, Y1) in the observations. Two pairs are concordant if the indicator in the definition of r is equal to 1. Large values indicate positive dependence (or concordance), whereas small values indicate negative dependence. Under independence of X and Y and continuity of their distributions, the distribution of r is centered about zero, and in the extreme cases that all or none of the pairs are concordant r is identically 1 or - 1,Therespectively. statistic r 1 is a U -statistic of order 2 for the kernel 12.5
- l)
l <]
- 1.
+
1 - 2P((Y2 - YI)(X2 - XI) > F1(x, = P(X
Hence the sequence Jn(r + o)) is asymptotically normal 4 with mean zero and variance si · With the notation y) < x, < y) and p r (x , y) y y), the projection of u - e takes the form
= P(X > X, >
Y
X and Y are independent and have continuous marginal distribution functions, then Er = 0 and the asymptotic variance 4si can be calculated to be 4/ 9, independent of the If
12.2
Two-Sample U-statistics
1 65
y
()
concordant X
Figure 12.1. Concordant and discordant pairs of points .
marginal distributions. Then ylrir -v--7 N (0, 4/9) which leads to the test for "independence": Reject independence if .fti174 1 r I > Za/2 · D 12.2
Two-Sample U-statistics
Suppose the observations consist of two independent samples X 1 , . . . , Xm and Y1 , , Yn , i.i.d. within each sample, from possibly different distributions. Let h (x 1 , . . . , Xr , y 1 , . . . , Ys ) be a known function that is permutation symmetric in x 1 , . . . , Xr and Y l , . . . , Ys separately. A two-sample U -statistic with kernel h has the form .
•
.
where and f3 range over the collections of all subsets of r different elements from {1, , m } and of s different elements from { 1 , . . . , n } , respectively. Clearly, U is an unbiased estimator of the parameter a
2, . . .
2,
The sequence Um, n can be shown to be asymptotically normal by the same arguments as for one-sample U -statistics. Here we let both m ---+ oo and n ---+ oo, in such a way that the number of Xi and Y1 are of the same order. Specifically, if N = m + n is the total number of observations we assume that, as m, n ---+ oo,
m
n
0 < A < l. - ---+ A ' - ---+ 1 - A ' N N To give an exact meaning to m , n ---+ oo, we may think of m = mv and n = n v indexed by a third index v E N. Next, we let mv ---+ oo and nv ---+ oo as v ---+ oo in such a way that mv/ Nv A. The projection of U -e onto the set of all functions of the form L�=l ki (Xi ) + L: ) = 1 l1 ( Y1 ) is given by ---+
1 66
U.Staristics
where the fu nctions 11 1 •0 and h0• 1 h � ,o(.t)
=
ho. J (y)
=
are
defined
by
Elt (x, X2 . . . . , X, , Y� t . . . . Y" ) - 0, Eh (X� . , X,, y, Y2· · · · · Y, ) - 0. • • •
This follows, as before, by first ap p ly i ng the Hajek projection lem m a , and next expressing
E( U I X; ) and E( U I Yj ) in the kernel fu nction. If the kernel is square-integrable, then the sequence {J is asymp tot i caJJy nonnal by the centml Ji mit theorem. The di fference between {J a nd U - 0 is asymptoticaJJy neg l ig i b le . 1 2.6 Theorem. I/ Eh2(X1 , • • • , X, , Y1 • • • • , Y.. ) < oo. then the sequence Jil(U 8 0 ) com·erges in probability· to zero. · consequellll)� the sequence Jil(U - 0) £·om·erges in di.t:tribution to the non1wl /aw with meall zero mu/ J.•ariance r2��,0/). + s2("o. d( l - .A.). ,....here. with the X; being i. i.d. l'ariable:r illdepelldent of the i.i.d. l'llritlbles r,. -
�'".J
=
cov (hCX� t
. • •
-
, X, Y� t • • • , Y.. ) .
h ( X. , • . • • Xr ,
x;+l • . . . . x; . r. . . . . . YJ, Y�+ . . . . . . r;>).
Proof. The argu m e nt is sim i l ar to the one given previously for one-sample U - st at i st ic s. The variances of U and it s proje cti o n are given by
var U
=
(;);(:)2 t.t. (;) (�) ('; ::;) (�) W (;=�) �··.d·
It can be checked fro m this that both the sequence Nvar D and the sequence Nvar U co nverge to the number r 2 �1 •0!). + s2�o. J /( I - l). • 12.7
Example (t.famr- U1titllej' statistic). Th e kcmcJ for the
lt (x, y) = I ( X ::;
parameter 0 = P(X ::; Y) is Y }, which is of order I i n both x and )'. The correspondi ng U - s tat i M i c is
The statistic mn U is known as the },famr- \\'ltimey statistic a nd is used to test for a di fference
i n location between the two sa mp l e s . A large value indicates that the Y1 arc '"stochastically larger" than the X, . If the X1 and r1 have c u m ul ati ve distribution functions F a nd G. resp ec t i vel y, then the p rojec t i on of U - 0 ca n be written
It is easy to obrain the limir distribur ion of the projections iJ (and hence of U) from this for mula. In particular. under the null h ypothesis that the poole d sam p le X 1 • • • • • Xm , r• . . . . . Y, is i.i.d. with conti nuous dislribu tion fu nction F = G. the sequence Jl2imz/N(U - 1 /2)
12.3
Dt!gmt!mtt" U-Swrisric.f
1 67
converges to a standard nonnal distribution. (The pammetcr equals 0 = 1 /2 and {0• 1 = {J.o = 1 / 1 2.) If no observations in the pooled sample arc tied. then mnU + 4n
* 1 2.3
Degenerate U-Statistics
A
sequence of U-statistics (or, better, their kernel fu nction) is cal led degenerate if the aliymptotic va ri a nce r2{1 (fou nd in Theorem 1 2.3) is zero. The fonnula for the variance of a U -statistic (in the proof of Theorem 1 2.3) shows that var U is of the order ,- r if {1 = · · · = �r- l = 0 < {c· In this case. the sequence n''f2(U11 - 0) is asymptotically tight. In this section we derive its limit distribution. Consider the Hoeffding decomposition as discussed in section 1 1 .4. For a U -Matistic U11 with kernel It of order based on observations X 1 X,• • this can be simpl ified to
r,
U,.
'
=
•
1
L L (" L PA h ( Xtfp I A I=(' )
c=O
IIere, for each 0 !::
r
c
!::
Jf
•
•
•
•
Xifr ) =
•
• • •
L '
(r) £'
U, , r
(say).
c·=O
r. the variable U,,,. is a U -statistic" of order c with kernel h r(X • ·
. • • •
X(' )
=
Pn .. ... (·th( X 1 , .
• • •
X, ).
To sec this� fix a se t A with c clements. Because the space HA is orthogonal to all functions g(Xj : j e B) (i.e the space LccB lie) for every set B that docs not contai n A, the projection PA/z(Xp1 Xp. ) is zero unless A c fJ = (fJ1 fJ, }. For the remaining p the projection PA h ( Xp 1 • • • • • XI'. ) docs not depend on jJ (i.e on the r - c clements of jJ - A) and is a fixed fu nction It,. of ( X1 : j e A). This fol lows by symmetry. or expl icitly from the fommla for the projections in section 1 1 .4. The function It(" is indeed the fu nction as given previously. There arc (� : �) vectors {J that contain the set A. The claim th at U,• . r is a U-statistic with kernel hr: now follows by simple algebra. using the fact that (� : �-)/(�.><:> = 1 /(�). By the defining properties of the space 1/f l.�c·J· it follows that the kemel /r� is degenerate for c � 2. In fact. it is strongly degenerate in the sense that the conditional expectation of hc(X1 X�) given any strict su bset of the variabks X 1 • • • • • X(' is zero. In other words. the integral J h(:c. X2 . , Xc> d P(:c) with respect to any single argument vanishes. By the same reasoni ng. U,,,. is u ncorrclatcd with every measurable fu nction that depends on strictly fewer than ,. elements of X I X11 • V.'e shall show that the sequence n''/2 Un.(' converges in d istribution to a l i mit with variance c! FJz;(x 1 • • • • • X(') for every c � I . Then it follows that the sequence n('12 (U,, - 0) converges i n distribution for c equal to the smallest value such that /r(' ¢ 0. For c � 2 the limit distribution is not nonnal but is known as Gaussian chaos. Because the idea is simple, bu t th� statement of the theor�m (apparently) necessarily compl icated, first consider a special case: c = 3 and a "'product kernel .. of the. form .•
•
• • • •
•
• • • •
.•
•
•
•
•
•
•
• •
•
• • • •
168
U-Statistics
A U -statistic corresponding to a product kernel can be rewritten as a polynomial in sums of the observations. For ease of notation, let Pn i = n - 1 L.7= 1 i(Xi ) (the empirical mea sure), and let Gn i = Jll Wn - ) i (the empirical process), for the distribution of the observations X 1 , . . . , X n . If the kernel h 3 is strongly degenerate, then each function ii has mean zero and hence Gn fi = JllPn ii for every i . Then, with (i 1 , i2, i 3 ) ranging over all triplets of three different integers from { 1 , . . . , n } (taking position into account),
P
P
Pi
By the law of large numbers, Pn � almost surely for every i, while, by the central limit theorem, the marginal distributions of the stochastic processes i 1---+ Gn i converge weakly to multivariate Gaussian laws. If G i : i E L 2 (X, A, } denotes a Gaussian process with mean zero and covariance function EGiGg = (a ?-Brownian bridge process), then Gn G. Consequently,
{
P) Pig - Pi Pg
-v-+
3
The limit is a polynomial of order in the Gaussian vector (GJI , Gf2, Gf3) . There is no similarly simple formula for the limit of a general sequence of degenerate U statistics. However, any kernel can be written as an infinite linear combination of product kernels. Because a U -statistic is linear in its kernel, the limit of a general sequence of degenerate U -statistics is a linear combination of limits of the previous type. To carry through this program, it is convenient to employ a decomposition of a given kernel in terms of an orthonormal basis of product kernels. This is always possible. We assume that L 2 (X, A, is separable, so that it has a countable basis.
P)
12.8
P),
Example (General kernel).
If 1
=
io , JI , f2, . . . is an orthonormal basis of L 2 (X, ike with (k 1 , . . , kc) ranging over the nonnegative c c c
x A, then the functions ik 1 x integers form an orthonormal basis of L 2 (X , A , p ). Any square-integrable kernel can be written in the form hc (X 1 , . . . , Xc) = L a (k 1 , . . . , kc ) fk 1 X x ike for a(k 1 , . . . , kc) = (h e , ik 1 x x ike ) the inner products of h e with the basis functions. D ·
·
·
.
·
·
·
·
·
·
In the case that c = 2, there is a choice that is spe cially adapted to our purposes. Because the kernel h 2 is symmetric and square-integrable by assumption, the integral operator K : L 2 (X, A, 1---+ L 2 (X, A, defined by Ki (x) = J h 2 (x , y) i (y) d ( y ) is self-adjoint and Hilbert-Schmidt. Therefore, it has at most count ably many eigenvalues A o , A 1 , , satisfying L_ A� and there exists an orthonormal basis of eigenfunctions io , i1 , . . . . (See, for instance, Theorem VI. 1 6 in [ 1 24].) The kernel h 2 can be expressed relatively to this basis as 12.9
Example (Second-order kernel).
P
P)
P)
.
•
< oo,
.
h 2 (x , y)
=
00
L A k ik (x) fk (y) . k =O
For a degenerate kernel h 2 the function 1 is an eigenfunction for the eigenvalue 0, and we can take io = 1 without loss of generality.
12.3
169
Degenerate U-Statistics
The gain over the decomposition in the general case is that only product functions of the type f x f are needed. D The (nonnormalized) Hermite polynomial H1 is a polynomial of degree j with leading coefficient xi such that J Hi (x) H1 (x) ¢ (x) dx = 0 whenever i 1- j . The Hermite polyno mials of lowest degrees are H0 = 1 , H1 (x) = x , H2 (x) = x 2 - 1 and H3 (x ) = x 3 - 3x . Let he : xe 1--+ 1Ft be a permutation-symmetric, measurable function of c arguments such that Eh� (X 1 , . . . , Xe) < oo and Ehe (X l , . . . , Xe- 1 , X c ) 0. Let 1 = fo , fi , h , . . . be an orthonormal basis of L2 (X, A., P). Then the sequence of U statistics Un,e with kernel h e based on n observations from P satisfies d (k ) X he, X ik f n e/ 2 Un , e kt . . . ( J n Hai (k) (!Gl/fi (k)) . i= l 12.10
Theorem.
=
-v'-7
Here !G is a P -Brownian bridge process, the functions 1/f 1 (k) , . . . , lfd (k ) (k) are the different elements in fk l ' . . . , ike' and ai (k) is number of times lfi (k) occurs among fk 1 , , fke · The variance of the limit variable is equal to c ! Eh� (X 1 , . . . , Xe). •
•
•
The function he can be represented in L2(Xc , A.e , P e ) as the series L k ( he , fk 1 x . . . , fkJ fh x · · · x ike . By the degeneracy of he the sum can be restricted to k = (k 1 , . . . , ke) with every k1 ::::_ 1 . If Un , c h denotes the U-statistic with kernel h (x 1 , . . . , Xe) , then, for a pair of degenerate kernels h and g,
Proof.
cov(Un ' eh, Un ' c g)
=
c! P c hg. n (n - 1) · · · (n - c + 1)
This means that the map h 1--+ n e f2 �cf Un,eh is close to being an isometry between L 2 (P c ) and L 2 (P n ) . Consequently, the series Lk ( he, fk 1 x · · · x fk) Un,efk1 x · · · x fke con verges in L 2 (P n ) and equals Un, c h c = Un,e · Furthermore, if it can be shown that the finite-dimensional distributions of the sequence of processes { Un,efk1 x · · · x fke : k E r�:n converge weakly to the corresponding finite-dimensional distributions of the process { n:�k{ Hai (k) (!Gl/fi (k)) : k E then the partial sums of the series converge, and the proof can be concluded by approximation arguments. There exists a polynomial Pn,e of degree c, with random coefficients, such that
P�F},
(See the example for c = 3 and problem 12. 1 3). The only term of degree c in this polynomial is equal to !Gn fk1 !Gn fkz · · · !Gn fke . The coefficients of the polynomials Pn,e converge in probability to constants. Conclude that the sequence n e 2 c ! Un,e fk1 x · · · x ike converges in distribution to Pc (!G fkp . . . , !G fkJ for a polynomial Pc of degree c with leading term, and only term of degree c, equal to !G fk1 !G fk2 !G ike . This convergence is simultaneous in sets of finitely many k. It suffices to establish the representation of this limit in terms of Hermite polynomials. This could be achieved directly by algebraic and combinatorial arguments, but then the occurrence of the Hermite polynomials would remain somewhat mysterious. Alternatively,
1
•
•
•
U-Sratb:rics
1 70
the representation can be derived from the defi n ition of the Hermite polyno m ial s and co variance calculations. By the degeneracy of the kerne l J,.. , x x fi:. . the U stati stic U,,(·/t, x · · · x /t, is orthogonal to all me as u rable fu nc t i o ns of c.· - I or fewer dements of X I • • X,. . and t he i r linear combinations. Th i s includes the functions n; (Gn.'.'r ) for a rbh ra ry fun ct ions g; a n d nonn egati ve i nt egers a; w it h L: a, < c. Tak ing l i m i t s. we con c l u de that PcCGJ,., , . Gft.. ) mu s t be orth ogon a l to every polyno m i al in Gjj, 1 • • • • Gfs.. of degree Jess than c - I . By th e orthonormal ity of the basis f, , the variables G f, are i ndepe nde n t s£andard n orm a l variables. Because the Hermite polynomials fonn a b�1sis for the po lyn om i al s in one variable, their ( tensor) prod ucl" fo rm a bas is fo r the polynomials o f more than one argument. The polyno m i al P,. can be wri tten as a J incar combination of c l e m e n ts from this bas is. By the orthogonality. t he cocfficienrs of base clements of d eg ree < ( vanish. From the base elcmenl" of deg ree c.· only the product as in the t heorem can occ ur, as follows from considerat ion of the lead i ng tenn of P.· . • •
·
•
..
"'
0 0 .
o • •
0
'
1 2. 1 1 Example. For c = 2 and a basis I hz, we obtain a l i mi t of the form L,,.
s ta nd ard nonnal variables.
1 2. 1 2
= x
(Zf -
0
/o. /1 • • • • of eigenfunctions of the kernel fd lll.(GJJ J . By the orth o normali ty of t h e I ) for Z 1 • Z:z • • • . a sequ e nce of independent
Example (Samp le �·aria11ce). The kernel h (x 1 • x2 )
=
� (x1 -x2 ) 2 yields the samp le
v ari ance s,; . Because this has asymptot i c vari anc e 11� - Jl � (sec Examp l e 3.2). t he kernel is dege ne ra te if and on ly if JA� = Jt3. Th i s can happen onl y if ( X 1 - a1 )2 is con st a n t . for i:r t = EX, . If we center the observations. so th at 0'1 = 0. then t hi s means that X 1 only t3kes the va lu es -u and a = .j'iG. each with probability J /2. Th i s is a very degenerate si tuat ion. and ir is easy to find t h e limit distribution directly. but perhaps it is i nstructive to apply the ge ne ral theorem. The kemels "� take the forms (Sec s ect io n I J .4).
flo =
E! < X t - X2) � = a2•
ll . c.r.> = E! <x1 - X2)2 - u :! .
h1(x, , .\'.2)
=
! (xr - X;! )2 - E� (x, - X:!)2 - E� ( X 1 - X:! ):! + a 2 •
The ke mel is degenerate if h 1
lt�(x 1 . x2) =
! Cx1 - x2 ) 2. -
=
u '!..
0 a l m os t �urcJy,
an
d
then th� seco nd-orde r kernel is
B L-ca u sc the u n derlying di�tribution has on ly two sup
o , the eigenfunctions f of the corresponding in tegral operator ca n be iden ti fied w ith vectors (/(-a ). f
degeneracy the kernel allows the decomposition
\Ve can conclude rhar lh� sequence
1 2. 1 3
F..xamp le
n(s,; - Jl � ) con ve rge s in dis t ribution to -a2(Zf - I ).
(Cramir-l·on JUises).
distribution function of a random sa mple
0
Let F, (.r ) , - r L::'= 1 I { .\', � .d be the e mp i.r i ca l X 1 , • • • , X,, of r�;•l-valued nrndom \'ariahlcs. The =
Cmmer-\�m /.-fi.\'t!S statistic for tes t i n g the (nu l l ) hypothes i s t ha t t h e unde rly i n g cumu lative
Probl�m.\' distribution is a given
171
function F is given by
The double sum restricted to the off-diagonal terms i s a U-stat istic. with. under llo. a degenerate kernel. Thus, this s ta t i st i c converges to a nondcgcnc mte li mit distribution. The di agon a l tenns contribute the constant J F( I - F) d F to the limit distribution. by the law of large numbers. If F is uni form. then the kernel of the U-.srati�tic is
h(x. y)
=
�x2 + !Y:! - x v y + . �·
To find the e ige nva l u e s of the corresponding integ ral operator K . we differentiate the identity Kf
=
l�f twice, to find the eq uation .A..f"+ f
=
f f(s) ds.
Because the kerne l is degenerate,
the constants arc eigenfunctions for the eigenvalue 0. The eigenfunctions corresponding to nonzero eigenvalues arc ort hogo nal to this e i ge nspace . whence J f(s ) cis = 0. The eq u a t i on >.f '' + f = 0 has solutions cos a.r und sin ax for a:! = ).- I . Rei nsert ing these in the original equ�ttion or uti lizing the relation J f(s) ds = 0, we find that the nonzero eigenvalues are equal to· j -2rr-2 for j e N, with eigenfunctions J2 cos rrj.t . Tims. 1he Cramer-Von �1ises s tat i s t i c converges in distribut ion to 1 /6 + I ). For an othe r derivation of the li mit distribution. sec Chapter I 9. D
E7=1 j-'!rr-2
Notes The main part of this chapter has its roots in 1he paper by 1-locffding [76] . Because the asymp t oti c variance is s ma l le r than the true vari ance of a U -stati stic. l loeffd ing recommends to apply a s ta nd ard nonnal approxi mation to (U - EU)/ sd U . Degene rate U-stati stics were con s i de red . among others. in [ 1 3 1 ) within the context of more general linear combina tions of symmetric kemds. A rco ne s and Gi ne (2 1 have s t u d i ed the we<1k c o nve rgence of ··u-processes··. stochastic processes i ndexed by classes of kernels. in spaces of bou nded functions as discussed i n Cha pte r 1 8. PROHLE!\·IS
1 . De: rive the asy mptotic d i M ri b ut ion of Giui'.v mt'atl tlij}""t' ft'IICt'. w h ich is deli ned as <2>- 1
IX, - X1 1.
L L; < 1
2. De rive the proj ec t i on of the sample variance.
3. Find
.a. Find
a kcrnd for the parametcr O a
=
E( X - EX) :l .
kernel for t h e p;u··o.� mclcr 0 = CU\'(X, }' ). Show lhat the corresponding U-stat isric is the .L:;'= 1 ( X� - X ) ( Y; - Y)/(u - I ).
sample covariance
=(2)- 1 L Li ..:J f Y1 - Y, HX1 - X, ). Let U, 1 and fl,2 be U -stat ist ics with ke rne l s II 1 and /1 :! · rc:-.p.:ctivc ly. Derive the joint a symptoti c
5. Find the limil di �tribution oi' U
6.
distribution of (U, I • Un�J.
7. Suppose EX r
Derive the asymptotic d is t ri bu ti on of the seq ue nce , - I L E,�1 X; Xj . Can y ou g i ve a two l i ne proof wi tho u t using the U-:\tat istk the o re m '! What happens i f EX 1 = 0'! <
oo .
8. (Mann's h.-st agaht!it trend.) To lc:st the: null h ypoth c:s i s that a sample X I • the alternative hypothes i s th;lt the di �tri bu t ions uf lhc
• • • •
X, is i.i.d. a�ain!.l
X; arc �roch;•srically increasi ng in i. �1ann
1 72
U-Statistics suggested to reject the null hypothesis if the number of pairs (Xi , Xi ) with i
is large. How can we choose the critical value for large n ?
<
j and Xi
<
Xi
9. Show that the U -statistic U with kernel 1 {x1 + x2 > 0} , the signed rank statistic w + , and the positive-sign statistic S = = 1 1 {Xi > 0} are related by w + = G)U + S in the case that there are no tied observations. 10. A V -statistic of order 2 is of the form n - 2 2::: = 1 :L'f = 1 h (Xi , Xi ) where h (x , y) is symmetric in x and y . Assume that Eh 2 (X1 , X 1 ) < oo and Eh 2 (X1 , X 2 ) < oo . Obtain the asymptotic distribution of a V -statistic from the corresponding result for a U -statistic.
2:::7
7
1 1 . Define a V -statistic of general order r and give conditions for its asymptotic normality.
12. Derive the asymptotic distribution of n ( s; - t.Jd in the case that M4 = f.J.,� by using the delta method (see Example 1 2 . 1 2). Does it make a difference whether we divide by n or n - 1 ?
13. For any (n
x
c) matrix aij we have
L ai , i
,
1
·
·
·
ai , c c
=
n
L fl ( - 1) I B I l ( B I - 1 ) ! L fl aij . -
i=1 }EB
B B EB
Here the sum o n the left ranges over all ordered subsets ( i 1 , . . . , i c ) of different integers from { 1 , . . . , n } and the first sum on the right ranges over all partitions B of { 1 , . . . , c} into nonempty sets (see Example [ 1 3 1]).
14. Given a sequence of i.i.d. random variables X 1 , X 2 , . . . , let An be the a -field generated by all functions of (X 1 , X 2 , . . . ) that are symmetric in their first n arguments . Prove that a sequence Un of U -statistics with a fixed kernel h of order r is a reverse martingale (for n :=:: r) with respect to the filtration Ar ::) Ar + 1 ::) · · · .
15. (Strong law.) If E j h (X 1 , · · · , Xr) J
< oo,
then the sequence
converges almost surely to Eh (X 1 , · · · , Xr ) . (For r
>
Un
of
U -statistics with kernel h
1 the condition is not necessary, but a
simple necessary and sufficient condition appears to be unknown.) Prove this. (Use the preceding
problem, the martingale convergence theorem, and the Hewitt-Savage 0- 1 law.)
13 Rank, Sign, and Permutation Statistics
Statistics that depend on the observations only through their ranks can be used to test hypotheses on departuresfrom the null hypothesis that the ob servations are identically distributed. Such rank statistics are attractive, because they are distribution-free under the null hypothesis and need not be less efficient, asymptotically. In the case of a sample from a symmetric distribution, statistics based on the ranks of the absolute values and the signs of the observations have a similar property. Rank statistics are a special example ofpermutation statistics.
13.1
Rank Statistics
The order statistics XN(1) ::::; XNc2J ::::; · · • ::::; XN (N) of a set of real-valued observations X 1 , , X N i th order statistic are the values of the observations positioned in increasing order. The rank RNi of Xi among X1 , . . . , XN is its position number in the order statistics. More precisely, if X 1 , . . . , XN are all different, then R Ni i s defined by the equation .
•
.
If Xi is tied with some other observations, this definition is invalid. Then the rank RNi is defined as the average of all indices j such that Xi = XN(j) (sometimes called the midrank), or alternatively as L:: 7= 1 {X1 ::::; Xd (which is something like an up rank). In this section it is assumed that the random variables X 1 , . . . , XN have continuous distribution functions, so that ties in the observations occur with probability zero. We shall neglect the latter null set. The ranks and order statistics are written with double subscripts, because N varies and we shall consider order statistics of samples of different sizes. The vectors of order statistics and ranks are abbreviated to X No and RN , respectively. A rank statistic is any function of the ranks. A linear rank statistic is a rank statistic of the special form L : 1 aN (i, RNi ) for a given (N X N) matrix (aN ( i, n) . In this chapter we are be concerned with the subclass of simple linear rank statistics, which take the form N L CN i aN, RN, . i=1 Here ( cN1 , . . . , CN N ) and (aN1 , . . . , aNN) are given vectors in ffi.N and are called the coeffi cients and scores, respectively. The class of simple linear rank statistics is sufficiently large
1
173
1 74
Rank. Sign, and P�rmmmilm Srati.ui,·.fl
to contai n i nte res tin g statistics for test ing a variety of hypotheses.
part icula r,
In
\ve s hal l
sec that it contains all .. locally most powerful" rank stat istics, whic h i n a not h e r c h a pt e r are
shown to be asy mptot ically efficient wi thi n the class of all tests.
Some e le m ent ary propcnics of ranks and order statistics arc ga t here d in the following
lemma.
13.1 /,.emma. Ut X I X N be a random sample ftvm a continuous distribmionfimc rion F with density f. Then (i) the vectors XNc , and RN are independelll: < .rN : (ii) rhe vector X N o has density N ! nf:: 1 f(x; ) o11 the set x1 < •
• • • •
·
·
·
N I (iii) rite \'ariable XNCo has densir.v NC�:t') F(x)1- 1 {I - F<x>) - f(x): for F the wu'· form diJtriburlon 011 [0. 1 ). it has mean ij ( N + I ) and \'ariance i (N - i + 1 )/ (CN + 1 )1 (N + 2 )): (i\•) the \'telor R,.. is uniformly distributed on the set ofall N! pt•rmutarion'i of I. 2 . . . . •
N:
(l') for any statiJtic T aizd pennutation r E(T(Xr
•
.
.
.
= (r1 ,
•
• • •
r"' ) of I , 2
, X.v ) l R.v = r) = ET( X.,•cra J ·
(�·i) for em}" simple linear rank .fla tistit: T
=
•
• • • •
• • .
, N.
X,\'fr"· � ) :
Ei� 1 cN;a.v.R.,
.• •
S ta te m ents (i) through { iv) are wel l-known and e l e me nta ry. For the proof of ( v). it is he lpful to write T( X1 XN) as a function of the ranks an d the order statistics. Nex t. we app ly (i), For the proof of statement (vi). we u se that the d i st ributions of the vari ab l es R,w and the \'ectors CRt�; . RN, ) for i -I= j arc u niform on th e sets I = ( I N ) and (i. j) e 1 2 : i i: j respectively. Furthermore. a doubl e sum of the form Li'FI (b, - b) (bi - h) is
Proof.
•
• • • •
• . . . •
J.
equal to -
'L, (b; - b)'·.
{
•
It follows that rank st at i st ic s arc di.�rribmion-free over the set of all models in which the
observations arc indepe ndent and identica l ly distributed . On the one hand. this makes them
�tatistically useless in si tuati o n s i n which the observations are. indee d a random s a mp le .
from some di stribution. On the othe r hand, it makes them of great int ere st to detect cena in
differences in distribution between the obse rva tion s . such as i n the two-sample pro b l em .
If
the null hypothesi s is taken to assert that the observations are i de nt ical l y distributed. then
the critical values for a rank test can be chose n in such a way that the p robabi lity of an
error of the f i rst k i nd is eq ua l to a g iven Jc\'cl a. for any probabi lity distri bution in the nul l hypothes i s. Somewhat s u rprisin g ly. this gain is not necess aril y counteracted by a lo�s i n
asymptot ic effi c ie ncy a s w e sec in Ch a pt e r .
13.2
1 4.
Example (Two·sample location problem). Suppose that the total set of observ at io n s
consists of two i ndependent random samples. inconsistently with the precedi ng notat ion
X1 X the pooletl sample X 1 written as
•
•
•
•
•
m
and Y
• • • •
1
•
•
• • •
, X,. . Y� o
Y" . Set N
. . . •
Yn .
=
m +n
and
let RN rn:
the rank vec to r of
I 3. 1
Rank Stari.titic.,·
1 75
\Ve arc interested in resting t he null hypothesis that the two samples arc identically dis tributed (according to a continuous distribut ion) ag ai n st the alternative that the distribution of the second sample is stochastically larger than the distribution of the fir�t sample. Even wit hour a more precise description of the alternat ive hypothesis. we can discuss a tion of usefu l rank stat istics. I f the Y1 arc a sample from a stochastically larger distribu tion, then the ranks of the Y1 in the pooled sample should be rclati\'ely large. Thus. any measure of rhe size of the ranks RN.m + J · • • • • RN .v can be used as a test statistic. It will be distribution-free under the null hypothesis. TI1e most popu lar choice in this prob l e m is the Wilcoxon .'ilatistic
collec
N
W =
L
i=lll +- I
RNr ·
This i s a si mple linear rank stati�tic with coe flicients c = (0, . . . , 0, I , . . . , I }, and scores a = (I, , N ) . The null hypot he s is is rej ected for large values of the \Vi lcoxon �tatiMic. (The \Vi lcoxon statistic is equi \ a l e nt to the J-fmm- \Vhit11ey stati.vtic U = Lr. J I (X1 � Y1 } in that W = U + !n (ll + 1 ). ) There arc many mher reasonable choices of rank stat istics. some o f which are o f special interest and h ave names. For instance, the ''an dt•r U�1erden stati.\'lic is defi ned as . • •
'
,\'
L
r .=m + I
- 1 ( RN, ).
Here - 1 is the standard normal quanti lc function. We s ha ll sec ahead that this statistic is part icul arl y attract ive if it is bel ieved that the underlying distribution of the obse rvat i on s is approxi mately normal. A ge n era l method to ge n cn1 te useful mnk statistics is discussed below. 0 A cri t i ca l value for a rest based on a (distribution-free) rnnk statistic can be found by simply tabulating its null distri bution. For a l arge number of observations this is a bit ted ious. In most cases it is a so unnecessary. bc.."Causc there ex ist accurate asymptotic approx imations. The remainder of this section i s concerned with proving asymptot ic normality of si mple l i near r.mk statistics under the null hypothesi�. Apart from bei ng useful for fi nd ing crit ical values. the theorem is used su bsequ e ntly to �t udy the &�symptotic efliciency of rank tests. Consider a rank statistic of the fonn T.v = L� 1 c,.,, a.v.N., ,· For a sequence of this type to be asy mptot i ca ll y normal, some restrictions on the coeflicicnts c and scores a arc necessary. I n most cases of interest. the scores arc ge nerated .. through a g i ve n fu nction t/J : [0, I J � R in one of two ways. Either
l
"
where UNC J i o UNcN• distribution on (0. I ) ; or • • • •
arc
the order stat istics of
a
sample
"N• = ¢(N� l }
( 1 3.3) of size N from the ·unifonn
( 13.4 )
For \Vei l -behaved function� c/J. these defi nit ions ure closely related and al most ident ica l. because i I ( N + 1 ) = EUN c 1 • · Scores of the first type correspond to the locally most
1 76
Rank, Sign, and Permutation Statistics
powerful rank tests that are discussed ahead; scores of the second type are attractive in view of their simplicity. 13.5 Theorem. Let R N be the rank vector of an i. i.d. sample X 1 , . . . , XN from the continuous distribution function F. Let the scores aN be generated according to ( 13.3) for 1 a measurable function ¢ that is not constant almost everywhere, and satisfies J0 ¢ 2 (u) du < oo. Define the variables N N t N TN = a (cNi - cN )¢ (F(X i ) ) . = + N CNi N RN cNZiN i =l i =l
L
,
, '
L
Then the sequences TN and TN are asymptotically equivalent in the sense that ETN = ETN and var (TN - TN) jvar TN ---+ 0. The same is true if the scores are generated according to ( 13.4) for a function ¢ that is continuous almost everywhere, is nonconstant, and satisfies N - 1 "'£ � 1 ¢ 2 (i /(N + 1)) ---+ f01 ¢ 2 (u) du < oo. Set Ui = F ( X i ) , and view the rank vector R N as the ranks of the first N elements of the infinite sequence U1 , U2 , . . . . In view of statement (v) of the Lemma 1 3 . 1 the definition (13.3) is equivalent to
Proof.
This immediately yields that the projection of TN onto the set of all square-integrable functions of R N is equal to TN = E(TN I R N ) . It is straightforward to compute that N var aN , RN I 1 / (N - 1) "'£ (cNi - eN ) 2 "£ (aN i - ZiN ) 2 N - 1 var ¢ (U1 ) L ( CN i - CN ) 2 var cp (UI ) If it can be shown that the right side converges to 1 , then the sequences TN and TN are asymptotically equivalent by the projection theorem, Theorem 1 1 .2, and the proof for the scores (13.3) is complete. Using a martingale convergence theorem, we shall show the stronger statement
( 1 3.6) Because each rank vector R j - l is a function of the next rank vector R j (for one observation more), it follows that aN, RN ! = E (¢ ( U1 ) I R 1 , . . . , RN ) almost surely. Because ¢ is square integrable, a martingale convergence theorem (e.g., Theorem 10.5.4 in [42]) yields that the sequence aN, R N 1 converges in second mean and almost surely to E (¢ (U1 ) I R 1 , R2, . . . ) . If ¢ (U1 ) is measurable with respect to the a--field generated by R 1 , R 2 , . . . , then the condi tional expectation reduces to ¢ (UI ) and ( 13.6) follows. The projection of U1 onto the set of measurable functions of R N I equals the conditional expectation E(U1 I RN 1 ) = R N i f(N + 1). By a straightforward calculation, the sequence var (R N I / (N + 1)) converges to 1 / 1 2 = var U1 . By the projection Theorem 1 1 .2 it follows that R N i f (N + 1 ) ---+ U 1 in quadratic mean. Because R N I is measurable in the a- -field generated by R 1 , R 2 , . . . , for every N, so must be its limit U1 . This concludes the proof that ¢ (U1 ) is measurable with respect to the a--field generated by R 1 , R 2 , . . . and hence the proof of the theorem for the scores 13.3.
13. 1
bNi
Rank Statistics
177
a Ni
Next, consider the case that the scores are generated by ( 13 .4). To avoid confusion, write = ¢ ( 1 / (N + 1 ) ) and let these scores as be defined by (1 3.3) as before. We shall prove that the sequences of rank statistics SN and TN defined from the scores and bN, respectively, are asymptotically equivalent. Because RNJ !(N + 1) converges in probability to U1 and ¢ is continuous almost ev erywhere, it follows that ¢ (RNJ / (N + 1) ) � ¢ (U1 ) . The assumption on ¢ is exactly that E¢ 2 (RNJ / (N + 1) converges to E¢ 2 (U 1 ) . By Proposition 2.29, we conclude that ¢ ( RNJ/(N + 1) � ¢ (U 1 ) in second mean. Combining this with ( 1 3 .6), we obtain that 2 b Nt ) 2 = E R, ¢ -+ ,
aN
)
)
� t (a m -
(a N
-
(::\ ) ) 0.
By the formula for the variance of a linear rank statistic, we obtain that - CaN - b N ) ) 2 var (S TN) 2:=� 1 (a Ni � ' var TN (a )2 because var aN, R N 1 � var ¢ ( U1 ) > This implies that var SN fvar TN � 1 . The proof is complete. •
N
- bNi N Li= l Ni - aN
-
0.
0
iaN,
Under the conditions of the preceding theorem, the sequence of rank statistics l:= cN RN , is asymptotically equivalent to a sum of independent variables. This sum is asymptotically normal under the Lindeberg-Feller condition, given in Proposition 2.27 . In the present case, because the variables ¢ ( F(X J ) are independent and identically distributed, this is implied by max l -si -sN C cN i - eN ) 2 ( 1 3 .7) N ( - ) 2 � o. c N i cN This is satisfied by the most important choices of vectors of coefficients.
Li= l
If the vector of coefficients CN satisfies ( 1 3.7), and the scores are genera ted according to ( 1 3 .3) for a measurable, nonconstant, square-integrable function ¢, then the sequence of standardized rank statistics (TN ETN ) /sd TN converges weakly to an N(O, I)-distribution. The same is true if the scores are generated by ( 1 3 .4) for a function ¢ that is continuous almost everywhere, is nonconstant, and satisfies N - 1 2:=� 1 ¢ 2 ( i / (N + 1) � ¢ 2 (u) du. 13.8
Corollary.
-
) fol
13.9 Example (Monotone score generating functions). Any nondecreasing, nonconstant function ¢ satisfies the conditions imposed on score-generating functions of the type (13.4) in the preceding theorem and corollary. The same is true for every ¢ that is of bounded variation, because any such ¢ is a difference of two monotone functions. To see this, we recall from the preceding proof that it is always true that RNJ ! (N + 1) � U1 , almost surely. Furthermore,
E¢ 2
i ) N 1 N l(i + l)j ( N + l) 2 ( NRNl 1 ) = -N1 LN ¢2 ( -i =l N 1 :::: N Li = l i /(N+ l l ¢ (u) du . --
+
+
+
--
The right side converges to J ¢ 2 (u) du. Because ¢ is continuous almost everywhere, it follows by Proposition 2.29 that ¢ (RNJ / (N + 1) � ¢ (UJ ) in quadratic mean. 0
)
Rank, Sign, and Permutation Statistics
178
In a two-sample problem, in which the first m observations constitute the first sample and the remaining observations n = N - m the second, the coefficients are usually chosen to be 13.10
Example (Two-sample problem).
i = 1, . , m CNi = 1 l = m + 1 , . , m + n .
{0
.
.
.
. .
In this case eN = nj N and L � 1 (cNi -eN ) 2 = mnj N. The Lindeberg condition i s satisfied provided both m ---+ and n ---+ D oo
oo.
The function ¢ ( u) = u generates the scores aNi = i / (N + 1). Combined with "two-sample coefficients," it yields a multiple of the Wilcoxon statistic. According to the preceding theorem, the sequence of Wilcoxon statistics WN = L �m + 1 RNi / (N + 1) is asymptotically equivalent to 13.11
Example (Wilcoxon test).
The expectations and variances of these statistics are given by EWN var WN = mnf(12(N + 1) ) , and var WN = mnj(12N). D
n/2,
13.12 Example (Median test). The median test is a two-sample rank test with scores of the form aN i = ¢ (i j(N + 1)) generated by the function ¢ (u) = 1 co, 1 ;21 (u). The corresponding test statistic is
N
i
J;_
}
N+1 1 RNi ::::; -- . 2 1
{
This counts the number of Y1 less than the median of the pooled sample. Large values of this test statistic indicate that the distribution of the second sample is stochastically smaller than the distribution of the first sample. D The examples of rank statistics discussed so far have a direct intuitive meaning as statistics measuring a difference in location. It is not always obvious to find a rank statistic appropriate for testing certain hypotheses. Which rank statistics measure a difference in scale, for instance? A general method of generating rank statistics for a specific situation is as follows. Suppose that it is required to test the null hypothesis that X 1 , . . . , X N are i.i.d. versus the alternative that X 1 , . . . , X N are independent with Xi having a distribution with density fcNJh for a given one-dimensional parametric model e �----+ fe . According to the Neyman-Pearson lemma, the most powerful rank test for testing H0 : e = against the simple alternative H1 : e = e rejects the null hypothesis for large values of the quotient Pe (R N - r) = N ! Pe (RN = r ) . Po (RN = r )
0
__ _ _
Equivalently, the null hypothesis is rejected for large values of P8 (RN = r). This test depends on the alternative e , but this dependence disappears if we restrict ourselves to
179
13. 1 Rank Statistics
alternatives e that are sufficiently close to 0. Indeed, under regularity conditions,
Pe(RN = r) - Po(RN = r) =
r . L �, (fv'"'' (x, ) - !] to(x, ) ) dx l
N 1 = 8 - L cNi Eo N .,
i= I
dxN
( _Qjfo (Xi ) I RN = r ) + o (&) .
Pe ( R r) TN 2..::{: 1 R aNi = Eo �� ( XN<• l ) = E �� ( F0- 1 ( UN<, J ) .
Conclude that, for small e > 0, large values of N = correspond to large values of the simple linear rank statistic = C N i aN , N , , for the vector aN of scores given by
(j
F0- 1 .
These scores are of the form ( 1 3 .3), with score-generating function ¢ = 0 /fo) Thus the corresponding rank statistics are asymptotically equivalent to the statistics e Nd olfo (X J . Rank statistics with scores generated as in the preceding display yield locally most pow erfu l rank tests. They are most powerful within the class of all rank tests, uniformly in a sufficiently small neighbourhood (0, c) of 0. (For a precise statement, see problem 1 3 . 1). Such a local optimality property may seem weak, but it is actually of considerable im portance, particularly if the number of observations is large. In the latter situation, any reasonable test can discriminate well between the null hypothesis and "distant" alterna tives. A good test proves itself by having high power in discriminating "close" alternatives.
LiN=I .
13.13
Example (Two-sample scale).
o
To generate a test statistic for the two-sample scale
problem, let fe (x) = e- B f (e - 8 x) for a fixed density f. If xi has density fcN , B and the
vector c is chosen equal to the usual vector of two-sample coefficients, then the first m observations have density fo = f; the last n = N - m observations have density fe . The alternative hypothesis that the second sample has larger scale corresponds to e > 0. The scores for the locally most powerful rank test are given by
For instance, for f equal to the standard normal density this leads to the sank statistic with scores
L �=m + I aN,RN,
The same test is found for f equal to a normal density with a different mean or variance. This follows by direct calculation, or alternatively from the fact that rank statistics are location and scale invariant. The latter implies that the probabilities p.,a, B = of the rank vector of a sample of independent variables N with Xi distributed according to e - B f (e- 8 (x - J-L)/CJ ) /CJ do not depend on (J-L, CJ). Thus the procedure to generate locally most powerful scores yields the same result for any (J-L, CJ). 0
RN
X1 , . . . , X
P (RN r)
Rank, Sign, and Permutation Statistics
1 80
13.14 Example (Two-sample location). In order to find locally most powerful tests for location, we choose fe (x) = f (x - 8) for a fixed density f and the coefficients c equal to the two-sample coefficients. Then the first m observations have density f (x) and the last n = N - m observations have density f (x - 8). The scores for a locally most powerful rank test are
For the standard normal density, this leads to a variation of the van der Waerden statistic. The Wilcoxon statistic corresponds to the logistic density. D 13.15 Example (Log rank test). The cumulative hazardfunction corresponding to a con tinuous distribution function F is the function A = -log(l - F). This is an important modeling tool in survival analysis. Suppose that we wish to test the null hypothesis that two samples with cumulative hazard functions Ax and Ay are identically distributed against the alternative that they are not. The hypothesis of proportional hazards postulates that Ay = 8Ax for a constant 8 , meaning that the second sample is a factor e more "at risk" at any time. If we wish to have large power against alternatives that satisfy this postulate, then it makes sense to use the locally most powerful scores corresponding to a family defined by Ae = 8A 1 . The corresponding family of cumulative distribution functions Fe satisfies 1 - Fe = (1 - F1 ) e and is known as the family of Lehmann alternatives. The locally most powerful scores for this family correspond to the generating function
¢ (u)
=
a ae
a ax
- log - ( 1 - Fe ) (x) 1e_- 1 ' x - F-I I (u )
=
1 - log( l - u).
It is fortunate that the score-generating function does not depend on the baseline hazard function A 1 . The resulting test is known as the log rank test. The test is related to the Savage test, which uses the scores
N 1 a N , i = L -: J= N -i + 1 1
R:!
( -i ) . N 1
- log 1 -
+
The log rank test is a very popular test in survival analysis. Then usually it needs to be extended to the situation that the observations are censored. D 13.16 Example (More-sample problem). Suppose the problem is to test the hypothesis that k independent random samples X1 , . . . , XN1 , X N1 + 1 , . . . , XN2 , . . . , X Nk_ 1 + 1 , . . . , XNk are identical in distribution. Let N = Nk be the total number of observations, and let R N be the rank vector of the pooled sample X 1 , . . . , X N . Given scores aN inference can be based on the rank statistics
Nk N1 N2 TN1 = L aN , RNr ' TN 2 = L aN ,RNr ' . . . ' TNk = L aN ,RN, . i=1
i= N 1 +1
i =Nk-1 +1
The testing procedure can consist of several two-sample tests, comparing pairs of (pooled) subsamples, or on an overall statistic. One possibility for an overall statistic is the chi-square
13.2
Signed Rank Statistics
181
statistic. For n j = Nj - Nj _ 1 equal to the number of observations in the jth sample, define
� (TN1 - n/(iN) 2 2 = eN L...t = n j var ¢ (U 1 ) j 1
If the scores are generated by (13.3) or (13.4) and all sample sizes n j tend to infinity, then every sequence TN1 is asymptotically normal under the null hypothesis, under the conditions of Theorem 1 3.5. In fact, because the approximations N1 are jointly asymptotically normal by the multivariate central limit theorem, the vector TN = (TN1 , . . . , TNk) is asymptotically normal as well. By elementary calculations, if n i f N ---+ Ai ,
T
TN - ETN
v'ii sd ¢ (UI )
-'A2'A 1
-'A 1 'A2 'A2 ( 1 - 'A2 )
-'A 1 'Ak -'A2'Ak
-'Ak'A 1
-'Ak'A2
'Ak ( l - 'Ak )
)q (1
� Nk 0,
- 'A I )
This limit distribution is similar to the limit distribution of a sequence of multinomial vectors. Analogously to the situation in the case of Pearson's chi-square tests for a multinomial distribution (see Chapter 17), the sequence C1 converges in distribution to a chi-square distribution with k - 1 degrees of freedom. There are many reasonable choices of scores. The most popular choice is based on ¢ (u) = u and leads to the Kruskal-Wallis test. Its test statistic is usually written in the form
(-
)
k 12 N+1 2 n j R j - -- ' . 2 N(N - 1 )
j;
This test statistic measures the distance of the average scores of the k samples to the average score (N + 1 )/2 of the pooled sample. An alternative is to use locally asymptotically powerful scores for a family of distribu tions of interest. Also, choosing the same score generating function for all subsamples is convenient, but not necessary, provided the chi-square statistic is modified. D
13.2
Signed Rank Statistics
The sign of a number x, denoted sign(x), is defined to be - 1 , 0, or 1 if x < 0, x = 0 or x > 0, respectively. The absolute rank R ti of an observation Xi in a sample X 1 , , X N is defined as the rank of I Xi I in the sample of absolute values I X 1 1 , . . . , I X N 1 . A simple linear signed rank statistic has the form .
.
•
The ordinary ranks of a sample can always be derived from the combined set of absolute ranks and signs. Thus, the vectors of absolute ranks and signs are together statistically more informative than the ordinary ranks. The difference is dramatic if testing the location of a symmetric density of a given form, in which case the class of signed rank statistics contains asymptotically efficient test statistics in great generality.
1 82
Rank, Sign, and Permutation Statistics
The main attraction of signed rank statistics is their simplicity, particularly their being distribution-free over the set of all symmetric distributions. Write l X I , Rt, and signN (X) for the vectors of absolute values, absolute ranks, and signs.
Let X 1 , . . . , XN be a random sample from a continuous distribution that is symmetric about zero. Then (i) the vectors ( l X I , Rt) and signN (X) are independent; (ii) the vector Rt is uniformly distributed over { 1 , . . . , N}; N (iii) the vector signN (X) is uniformly distributed over { - 1, 1} ; (iv) forany signed rank statistic, var 'L;: 1 aN, , sign(Xi ) = L ;: 1 a�i · 13.17
Lemma.
Rt
Consequently, for testing the null hypothesis that a sample is i.i.d. from a continuous, symmetric distribution, the critical level of a signed rank statistic can be set without further knowledge of the "shape" of the underlying distribution. The null hypothesis of symmetry arises naturally in the two-sample problem with paired observations. Suppose that, given independent observations (X 1 , Y1 ) , . . . , (XN , YN), it is desired to test the hypothesis that the distribution of Xi - Yi is "centered at zero." If the observations (Xi , Yi) are exchangeable, that is, the pairs (Xi , Yi) and (Yi , Xi ) are equal in distribution, then Xi - Yi is symmetrically distributed about zero. This is the case, for instance, if, given a third variable (usually called "factor"), the observations Xi and Yi are conditionally independent and identically distributed. For the vector of absolute ranks to be uniformly distributed on the set of all permutations it is necessary to assume in addition that the differences are identically distributed. For the signs alone to be distribution-free, it suffices, of course, that the pairs are inde pendent and that P(Xi < YJ P(Xi > Yi ) � for every Consequently, tests based on only the signs have a wider applicability than the more general signed rank tests. However, depending on the model they may be less efficient.
=
i.
=
Let X 1 , . . . , XN be a random sample from a continuous distribution that is symmetric about zero. Let the scores aN be generated according to ( for a measurable function ¢ such that f01 ¢2 ( u) d u For p+ the distribution function of I X 1 l , define N N + '""" TN = i 1 aN sign(Xi) , TN = L ¢ ( F ( I Xi l ) ) sign(Xd . i 1 1 var (TN Then the sequences TN and TN are asymptoticall y equi v alent in the sense that N - 1 12 TN is asymptotically normal with mean zero TN) 0. Consequentl y , the sequence N 1 and variance J0 ¢ 2 (u) d u. The same is true if the scores are generated -accordi ng2 to 1 for a function ¢ that is continuous almost everywhere and satisfies N L;: 1 ¢ (i j ( N + 1 2 1) ) fo ¢ (u) du Because the vectors signN (X) and ( l X I , Rt) are independent and E signN (X) = 0, the means of both TN and TN are zero. Furthermore, by the independence and the 13.18
Theorem.
< oo.
L =
'
---+
---+
13. 3)
R+
Nt
=
(13.4)
< oo.
Proof
orthogonality of the signs,
13.4
1 83
Rank Statistics for Independence
The expectation on the right side is exactly the expression in ( 1 3.6), evaluated for the special choice U1 = p + ( I X 1 1 ) . This can be shown to converge to zero as in the proof of Theorem 1 3 .5. •
Wilcoxon signed rank statlstzc u.
The Large = 2.:= � 1 R t i sign(Xi ) is obtained from the score-generating function ¢ ( u ) = values of this statistic indicate that large absolute values I Xi I tend to go together with pos itive Xi . Thus large values of the Wilcoxon statistic suggest that the location of the Xi is larger than zero. Under the null hypothesis that X 1 , . . . , X N are i.i.d. and symmetrically distributed about zero, the sequence is asymptotically normal N(O, 1 13). The variance of is equal to N(2 N + 1 ) (N + 1 ) 1 6. The signed rank statistic is asymptotically equivalent to the U -statistic with kernel h (x 1 , x 2 ) = 1 {x 1 + x2 > 0}. (See problem 12.9.) This connection yields the limit distri bution also under nonsymmetric distributions. 0 13.19
WN
Example (Wilcoxon signed rank statistic).
N -312 WN
WN
Signed rank statistics that are locally most powerful can be obtained in a similar fashion as locally most powerful rank statistics were obtained in the previous section. Let f be a symmetric density, and let X 1 , , XN be a random sample from the density f(- -e). Then, under regularity conditions, .
Pe ( signN (X)
= = =
.
•
s, R t N r ) - Po ( signN (X) s, R t =
=
j ' ( I Xi l ) { signN (x) -e Eo L sign(Xi ) f i= 1 N 1 f' -e -N-, L iEo - ( I Xi l ) I R t i f 2 N . i =1
s (
r)
=
=
s, R t r } o (e ) ) o (e ) . ri =
=
+
+
In the second equality it is used that f' If (x) is equal to sign(x) f' If ( l x I) by the skew symmetry of f' If. It follows that for testing · f against f ( -e) are obtained from the scores
locally most poweiful signed rank statistics
These scores are of the form (1 3.3) with score-generating function ¢ = - (f' 1f) (F + ) - 1 , whence locally most powerful rank statistics are asymptotically linear by Theorem 1 3 . 1 8. By the symmetry of F, we have (F + ) - 1 = p- 1 ( + 1) 12) . o
(u)
(u
Example. The Laplace density has score function f' lf (x) = sign(x) = 1 , for x :::= 0. This leads to the locally most powerful scores 1 . The corresponding test statistic is the TN = 2.:= � 1 sign(Xi ) . Is it surprising that this simple statistic possesses an optimality property? It is shown to be asymptotically optimal for testing Ho : e = 0 in Chapter 1 5 . 0 13.20
sign statistic
13.21
aNi
=
Example. The locally most powerful score for the normal distribution are N i + 1) 12) . These are appropriately known as the normal (absolute) scores. 0
E
a
1 84
Rank, Sign, and Permutation Statistics 13.3
Rank Statistics for Independence
, (XN, YN)
Let (X 1 , Y1 ) , . . . be independent, identically distributed bivariate vectors, with continuous marginal distributions. The problem is to determine whether, within each pair, Xi and Yi are independent. Let and be the rank vectors of the samples X 1 , . . . , X and Y1 , . . . , Y respec tively. If Xi and Yi are positively dependent, then we expect the vectors and to be roughly parallel. Therefore, rank statistics of the form
RN SN
N
N, RN SN
N bN aN ,R , sN, , L N, i= l
aN bN
with and increasing vectors, are reasonable choices for testing independence. Under the null hypothesis of independence of Xi and the vectors and are independent and both uniformly distributed on the permutations of { 1 , . . . , N}. Let be the vector of ranks of . . . , XN if first the pairs (X 1 , y have been put in increasing order of Y1 < Y2 < < Y The coordinates of are called the Under the null hypothesis, the antiranks are also uniformly distributed on the permutations of { 1 , . . . , N}. By the definition of the antiranks,
xl ,
Yi , R N SN Rc; Yl ), . . . , (XN, N) Rc; antiranks.
· · · N.
N N L: i= l aN , R�, bNi · i = l aN , RN, bN, SN, L: =
The right side is a simple linear rank statistic and can be shown to be asymptotically normal by Theorem 1 3 .5. The simplest choice of scores corresponds to the generating function ¢ (u) = u. This leads to the which is the ordinary sample correlation coefficient of the rank vectors and Indeed, be cause the rank vectors are permutations of the numbers 1 , 2, . . . , N, their sample mean and variance are fixed, at (N + 1)/2 and N (N + 1 ) / 12, respectively, and hence 13.22
Example (Spearman rank correlation).
rank correlation coefficient PN, RN SN.
PN
Thus the tests based on the rank correlation coefficient are equivalent to tests based on the signed rank statistic L It is straightforward to derive from Theorem 1 3 .5 that the sequence ffi is asymptot ically standard normal under the null hypothesis of independence. 0
RNi SNi .
* 1 3.4
RN
PN
Rank Statistics under Alternatives
N
Let be the rank vector of the independent random variables X 1 , . . . , X with continu ous distribution functions Theorem 1 3 .5 gives the asymptotic distribution of simple, linear rank statistics under very mild conditions on the score-generating function,
F1 , . . . , FN.
13.4
1 85
Rank Statistics under Alternatives
Fi
but under the strong assumption that the distribution functions are all equal. This is suffi cient for setting critical values of rank tests for the null hypothesis of identical distributions, but for studying their asymptotic efficiency we also need the asymptotic behavior under alternatives. For instance, in the two-sample problem we are interested in the asymptotic distributions under alternatives of the form G, . . . , G , where and G are the distributions of the two samples. For alternatives that converge to the null hypothesis "sufficiently fast," the best approach is to use Le Cam's third lemma. In particular, if the log likelihood ratios of the alternatives allow Fn , . . . , Fn , G . . . , G with respect to the null distributions an asymptotic approximation by a sum of the type then the joint asymptotic distribution of the rank statistics and the log likelihood ratios under the null hypothesis can be obtained from the multivariate central limit theorem and Slutsky's lemma, because Theorem 1 3 .5 yields a similar approximation for the rank statistics. Next, we can apply Le Cam's third lemma, as in Example 6.7, to find the limit distribution of the rank statistics under the alternatives. This approach is relatively easy, and is sufficiently general for most of the questions of interest. See sections 7.5 and 14. 1 . 1 for examples. More general alternatives must be handled directly and appear to require stronger con ditions on the score-generating function. One possibility is to write the rank statistic as a functional of the empirical distribution function IF and the weighted empirical distribution IF� (x) = N- 1 _L f: 1 ::S of the observations. Because = N IF , we have
F, . . . , F,
n,
n
F
.L .ei (Xi ),
N,
cNi 1 { Xi x}
F, . . . , F, F, . . . , F
RNi
N(Xi )
Next, we can apply a von Mises analysis, using the convergence of the empirical distribution functions to Brownian bridges. This method is explained in general in Chapter In this section we illustrate another method, based on Hajek's projection lemma. To avoid technical complications, we restrict ourselves to smooth score-generating functions. Let be the average of be the weighted sum N- 1 and and let define
20.
-
FN
-c
F1 , . . . , FN
FN
.
.LNi= l cNi Fi ,
TN
We shall show that the variables are the Hajek projections of approximations to the variables up to centering at mean zero. The Hajek projections of the variables themselves give a better approximation but are more complicated.
TN,
TN
f ¢ : [0, 1] � is twice continuously diffe rentiable, then there exists a universal constantI K such that 1� var (TN - TN) K - (cN i - CN) 2 ( I ¢ / ll 2oo + I ¢ // ll 2oo ) N i= t Because the inequality is for every fixed N, we delete the index N in the proof. Furthermore, because the assertion concerns a variance and both TN and TN change by a 13.23
�
Lemma.
A
Proof.
::S
L
·
1 86
Rank, Sign, and Permutation Statistics
cNi cNi - eN, TN Xi Ri 1 + Lk#i 1{Xk _::::: Xi }. 1 I 1 - -F(X . ) - P. (X . ) I - 1 Ri X . ) - -F(X . ) -(E -N+1 N N+ 1 I
constant if the are replaced by it is not a loss of generality to assume that = 0. (Evaluate the integral defining to see this.) The rank of can be written as = This representation and a little algebra show that
CN
l
l
l
l
=
l
< -.
Furthermore, applying the Marcinkiewitz-Zygmund inequality (e.g., [23, p. 356]) condi tionally on we obtain that
Xi ,
F(Xi ),
Next, developing ¢ in a two-term Taylor expansion around for each term in the sum that defines T, we see that there exist random variables that are bounded by II¢" I such that
Ki
N RNi - . ) 2 +�t= l c· ( -- F(X ) N+ 1 �
1
1
00
K1
Using the Cauchy-Schwarz inequality and the fourth-moment bound obtained previously, we see that the quadratic term T2 is bounded above in second mean as in the lemma. The leading term T0 is a sum of functions of the single variables and is the first part of t . We shall show that the linear term T1 is asymptotically equivalent to its Hajek projection, which, moreover, is asymptotically equivalent to the second part of T , up to a constant. The Hajek projection of T1 is equal to, up to a constant,
Xi ,
4= ci l:: E [ N: 1 ¢'( F(Xi ) ) I xj ] - 4= ci F(Xi )¢'( F(Xi ) ) l
J
l
The second term is bounded in second mean as in the lemma; the first term is equal to
13.4
1 87
Rank Statistics under Alternatives
If we replace (N + 1 ) by N, write out the conditional expectation, add the diagonal terms, and remove the constant, then we obtain the second term in the definition of T . The difference between these two expressions is bounded above in second mean as in the lemma. To conclude the proof it suffices to show that the difference between and its Hajek projection is negligible. We employ the Hoeffding decomposition. Because each of the variables RiC// ( ) is contained in the space LI A I :s2 HA , the difference between and its Hajek projection is equal to the projection of onto the space LI A I =2 HA . This projection has second moment
T1
F(Xi)
T1
T1
{Xk ::::; Xd¢' ( F(Xi ) ) , which is contained in the space H{k,i) > {a, b} {k, i}. Thus, the expression in the preceding
The projection of the variable 1 onto the space Hra ,b} is zero unless display is equal to
C
This is bounded by the upper bound of the lemma, as desired. The proof is complete.
•
(TN - ETN )jsd TN and (TN - E TN )jsd TN L.: t: l (CNi - CN ) 2 -+ 0 Nvar TN
As a consequence ofthe lemma, the sequences have the same limiting distribution (if any) if A
•
This condition is certainly satisfied if the observations are identically distributed. Then the rank vector is uniformly distributed on the permutations, and the explicit expression for var given by Lemma 1 3 . 1 shows that the left side (with var instead of var T ) is of the order 0 ( 1 j N) . Because this leaves much too spare, the condition remains satisfied under small departures from identical distributions, but the general situation requires a calculation. Under the conditions of the lemma we have the approximation
TN
TN
N
The square of the difference is bounded by the upper bound of the lemma. The preceding lemma is restricted to smooth score-generating functions. One possibility to extend the result to more general scores is to show that the difference between the rank statistics of interest and suitable approximations by rank statistics with smooth scores is small. The following lemma is useful for this purpose, although it is suboptimal if the observations are identically distributed. (For a proof, see Theorem 3 . 1 , in [68] .) 13.24
Lemma (Variance inequality).
and arbitrary scores cN l , . . . , CNN·
For nondecreasing coefficients aNI
::::;
·
·
·
::::; aNN
1 88
Rank, Sign, and Permutation Statistics 13.5
Permutation Tests
permutation tests .
Rank tests are examples of General permutation tests also possess a distribution-free level but still use the values of the observations next to their ranks. In this section we illustrate this for the two-sample problem. Suppose that the null hypothesis H0 that two independent random samples and are identically distributed is rejected for large values of a test statistic Write for the values of the pooled sample stripped of its original order. (N = + Under the null hypothesis each permutation of the N values is equally likely to lead back to the original observations. More precisely, the conditional null distribution of given is uniform on the N ! permutations of the latter sample. Thus, it would be reasonable to reject H0 if the observed value is among the l OOa % largest values as ranges over all permutations. Then we obtain a test of level a, conditionally given the observed values and hence also unconditionally. Does this procedure work? Does the test have the desired power? The answer is affirmative for statistics that are sums, in the sense that, asymptotically, the permutation test is equivalent to the test based on the normal approximation to If the latter test performs well, then so does the permutation test. We consider statistics of the form, for a given measurable function
X 1 , . . . , Xm
Y1 , . . . , Yn TN(Xt, . . . , Xm , Yt, . . . , Yn). Zn1, , ZnN •
•
Z( l ) • . . . , Z(N) m n. ) X 1 , . . . , Xm , Y1 , . . . , Yn TN(Xt, . . . , Xm , y 1 , , Yn) TN (Zn1 , , ZnN) n •
.
•
•
.
Z(l ) , . . . , Z(N)
•
•
TN
TN. f,
These statistics include, for instance, the score statistics for testing that the two samples have distributions and Pe , respectively, for which we take equal to the score function P o l of the model. Because a permutation test is conditional on the observed values, and is fixed once and are fixed, it would be equivalent to consider statistics of the form Let . be uniformly distributed on the N ! permutations of the numbers 2, , N, and be independent of
p0
p0 TN
f
) Li f(Zi) LL:j f(Y j f (Y ). (nN 1 , . . , lfNN) j j 1, . . . Xt, . . . , Xm , Yt, . . . , Yn. Let both Ef 2 (X 1 ) and Ej 2 (Y1 ) be finite, and suppose that m, n -+ such that mj N -+ A. (0, 1). Then, given almost every sequence X 1 , X2 , . . . , Y1 , Y2 , . . . , the sequence ,JF.iTN(ZnN 1 , , ZnNN) is asymptotically normal with mean zero. Under the null hypothesis the asymptotic variance is equal to var f(X 1 )/('A(1 - 'A) ) . Conditionally on the values of the pooled sample, the statistic N TN(ZnNl , . . . , ZnNN) is distributed as the simple linear rank statistic L:� 1 CNiaN, RN, with coefficients and scores N' m aNi = n i > m l ' Here RNt, . . . , RNN are the antiranks of Nt, . . . , lfNN defined by the equation L cN , nN , aNi = L: cNiaN ,RN, (for any numbers CNi and aNi). 13.25
Theorem.
oo
E
•
•
•
Proof.
_ !!..
n
13. 5 Pennutation Tests
1 89
(13.
X 1 , X2 , . . . , Y1 , Y2 , ,
The coefficients satisfy relation 7) for almost every sequence because, by the law of large numbers,
.
•
.
c� AEjk (XI) + (1 - A)Efk (Y1 ), k 1, 2, 1 N c2Ni -+ 0. - max N l� i � The scores are generated as aN i ¢N (i j (N + 1) ) for the functions N ' U - Nm+ l ' { m ¢N(u) m U > N+ l ' These functions depend on N, unlike the situation of Theorem 13. 5 , but they converge to the fixed function ¢ A - l 1t o , .J. ) - (1 - A) - 1 1c.J. . Il · By a minor extension of Theo rem 13. 5 , the sequence L cN i aN ,RN , is asymptotically equivalent to L (cN i - eN )¢(Ui ), for a uniform sample U1 , . . . , UN. The (asymptotic) variance of the latter variable is easy to compute. �
=
as
=
=
<
_ !!__
n'
=
•
By the central limit theorem, under the null hypothesis,
1) 2 varA( lf(X - A) The limit is the same as the conditional limit distribution of the sequence � TN ( ZrrNl , . . . , ()'
=
.
ZrrNN ) under the null hypothesis. Thus, we have a choice of two sequences of tests, both of asymptotic level a, rejecting H0 if : or 2:: Z a O' ; where 2:: is the upper a-quantile of the conditional distribution of ZrrNN ) given Z o) • . . . Z e (ZrrN 1 , The second test is just the permutation test discussed previously. By the preceding theorem the "random critical values" converge in probability to Za O' under H0 . Therefore the two tests are asymptotically equivalent under the null hypoth esis. Furthermore, this equivalence remains under "contiguous alternatives" (for which again � Za O' ; see Chapter 6), and hence the local asymp totic power functions as discussed in Chapter are the same for the two sequences of tests. The preceding theorem also shows that the sequence of "critical values" remains bounded in probability under every alternative. Because ,JN the power at any alternative with this > property converges to Thus, permutation tests are an attractive alternative to both rank and classical tests. Their main drawback is computational complexity. The dependence of the null distribution on the observed values means that it cannot be tabulated and must be computed for every new data set.
- �TN(X 1 , .. .. .. ,, XXm ,, YY1 ,, .. .. .. ,, Yn)Yn) cN(X 1 , . . . , Xm , Y1 , . . . , Yn), - eN�TN(XI, (X 1 , . . . , Xm , Ym1 , . . 1. , Yn) , N) · �TN , eN (X 1 , . . . , Xm , Y1 , . . . , Yn) •
•
•
cN(X 1 , . . . , Xm , Y1 , . . . , Yn)
Y1 , . . . , Yn) (X 1 , . . . , Xm , Y1 , . . . , Yn) 1.
� oo
14
ifEj(X 1 ) Ef(Y1 ),
cN(X 1 , . . . , Xm , TN
190
Rank, Sign, and Permutation Statistics *13.6
Rank Central Limit Theorem
The rank central limit theorem Theorem 13.5, is slightly special in that the scores aNi are assumed to be of one of the forms ( 1 3 .3) or ( 1 3 .4) . In this section we record what is com monly viewed as the rank central limit theorem. For a proof see [ 67] . For given coefficients and scores, let
c� = L (cN i - CN) 2 ,
2 A� = L i= l (aNi - 7iN) . i=l Let TN = L cN i aN , RN , be the simple lin ear rank statistic with coefficients and scores such that max l <e:i ::; N l a N i - aN I I AN --+ 0 and max 1 < i
13.26
n
Theorem (Rank central limit theorem).
8 >
--+
Notes
The classical reference on rank statistics is the book by Hajek and Sidak [7 1 ] , which still makes wonderful reading and gives extensive references. Its treatment of rank statistics for nonidentically distributed observations is limited to contiguous alternatives, as in the first sections of this chapter. The papers [43] and [68] remedied this, shortly after the publication of the book. Section 1 3.4 reports only a few of the results from these papers, which, as does the book, use the proj ection method. An alternative approach to obtaining the limit distribution of rank statistics, initiated by Chernoff and Savage in the late 1 950s and refined many times, is to write them as functions of empirical measures and next apply the von Mises method. We discuss examples of this approach in Chapter 20. See [1 34] for a more comprehensive treatment and further references. PROBLEMS
1. This problem asks one to give a precise meaning to the notion of a locally most powerful test. Let be a rank statistic based on the "locally most powerful scores ." Let a = > ca ) for a given number Ca . (Then a is a natural level of the test statistic, a level that is attained without randomization.) Then there exists E: > 0 such that the test that rej ects the null hypothesis if > Ca is most powerful within the class of all rank tests at level a uniformly in the alternatives E (O, e ) . (i) Prove the statement.
TN
Po(TN
TNe
(ii) Can the statement be extended to arbitrary levels ?
2 . Find the asymptotic distribution o f the median test statistic under the null hypothesis that the two samples are identically distributed and continuous. 3. Show that -Jii times Spearman 's rank correlation coefficient is asymptotically standard normal. 4. Find the scores for a locally most powerful two-sample rank test for location for the Laplace family of densities .
Problems
191
5 . Find the scores for a locally most powerful two-sample rank test for location for the Cauchy family of densities . 6. For which density is the Wilcoxon signed rank statistic locally most powerful? 7. Show that Spearman's rank correlation coefficient is a linear combination of Kendall's r and the
U -statistic with (asymmetric) kernel h (x , y , z) = sign(x 1 - Y l ) sign(x 2 - z 2 ) . This decompo sition yields another method to prove the asymptotic normality.
8. The symmetrized Siegel-Tukey test is a two-sample test with score vector of the form aN ( 1 , 3 , 5, . . . , 5, 3 , 1 ) . For which type of alternative hypothesis would you use this test? 9. For any aNi given by ( 1 3 .3), show that aN
=
J� ¢ (u) du .
14 Relative Efficiency of Tests
The quality of sequences of tests can be judged from their power at alter natives that become closer and closer to the null hypothesis. This moti vates the study oflocal asymptotic powerfunctions. The relative efficiency of two sequences of tests is the quotient of the numbers of observations needed with the two tests to obtain the same level and power. We discuss several types of asymptotic relative efficiencies.
14.1
Asymptotic Power Functions
Consider the problem of testing a null hypothesis Ho : 8 E G o versus the alternative H1 : 8 E 8 1 . The power function of a test that rejects the null hypothesis if a test statistic falls into a critical region Kn is the function 8 1--:)o nn (8) = Pe (Tn E Kn ), which gives the probability of rejecting the null hypothesis. The test is of level if its size sup { nn (8) : 8 E G0 } does not exceed A sequence of tests is called asymptotically of level if a
a.
a
lim sup sup lfn (8) n -+oo
BE8o
::::::
a.
(An alternative definition is to drop the supremum and require only that lim sup nn (8) :::::: for every 8 E Go.) A test with power function nn is better than a test with power function !f n if both a
lfn (8) :S !f n (8), and lfn (8) ::: !f n (8),
The aim of this chapter is to compare tests asymptotically. We consider sequences of tests with power functions nn and !f n and wish to decide which of the sequences is best as n ----+ Typically, the tests corresponding to a sequence n 1 , n2 , . . . are of the same type. For instance, they are all based on a certain U -statistic or rank statistic, and only the number of observations changes with n. Otherwise the comparison would have little relevance. A first idea is to consider limiting power functions of the form oo.
Jf
(8 )
=
lfn (8) . nlim -+ oo
If this limit exists for all 8, and the same is true for the competing tests ?1 n , then the se quence Jfn is better than the sequence ?1 n if the limiting power function Jf is better than the 192
193
14. 1 Asymptotic Power Functions
limiting power function n . It turns out that this approach is too naive. The limiting power functions typically exist, but they are trivial and identical for all reasonable sequences of tests. 14.1 Example (Sign test). Suppose the observations X 1 , . . . , Xn are a random sample from a distribution with unique median The null hypothesis H0 : = 0 can be tested against the alternative H1 : > 0 by means of the Sn = n - 1 2..: 7= 1 1 {Xz > 0}. If F (x - is the distribution function of the observations, then the expectation and variance of Sn are equal to = 1 = (1 respectively. By the normal approximation to the binomial distribution, the sequence Jn( Sn - t-L ( ) is asymptotically normal N (O, CT 2 ( ) . Under the null hypothesis the mean and variance are equal to t-L(O) = 1 /2 and CT 2 (0) = 1 /4, respectively, so that .jn(Sn - 1/2) � N(O, 1/4) . The test that rej ects the null hypothesis if .jn(Sn - 1 /2) exceeds the critical value � Za has power function
e)
e
t-L(e)
e.
e
sign statistic F( - e) andCT 2 (e)jn F(-e) ) F( -e)jn, e)
F - F (-e) > 0 for every e > 0, it follows that for a if e o , if e > o.
Because (0) slowly
=
an
----+
e)
0 sufficiently
=
The limit power function corresponds to the perfect test with all error probabilities equal to zero. 0 The example exhibits a sequence of tests whose (pointwise) limiting power function is the perfect power function. This type of behavior is typical for all reasonable tests. The point is that, with arbitrarily many observations, it should be possible to tell the null and alternative hypotheses apart with complete accuracy. The power at every fixed alternative should therefore converge to 1 . A sequence of tests with power functions �----+ nn is asymptotically at level a at (or against) the alternative if it is asymptotically of level a and nn ----+ 1 . If a family of sequences of tests contains for every level a E (0, 1) a sequence that is consistent against every alternative, then the corresponding tests are simply called consistent. 14.2
Definition.
consistent (e)
e
e
(e)
Consistency is an optimality criterion for tests, but because most sequences of tests are consistent, it is too weak to be really useful. To make an informative comparison between sequences of (consistent) tests, we shall study the performance of the tests in problems that become harder as more observations become available. One way of making a testing problem harder is to choose null and alternative hypotheses closer to each other. In this section we fix the null hypothesis and consider the power at sequences of alternatives that converge to the null hypothesis.
194
Relative Efficiency of Tests
Figure 14.1. Asymptotic power function.
Example (Sign test, continued). Consider the power of the sign test at sequences of alternatives en t 0. Suppose that the null hypothesis Ho : e = 0 is rejected if vfrl(Sn - � ) ::::: 14.3
� Za . Extension of the argument of the preceding example yields TCn (en)
=
)
(
l za - y!fi(F(O) - F(-en ) ) n 1 -
Since a (0) = � , the levels nn (0) of the tests converge to 0. Then a.
a:
oo ,
This is bounded away from zero and infinity if en converges to zero at rate en = 0 (n - l/2 ) . For such rates the power nn (en) is asymptotically strictly between and 1 . In particular, for every h, a
The form of the limit power function is shown in Figure 14. 1 . D In the preceding example only alternatives en that converge to the null hypothesis at rate 0 ( 1 / Jn) lead to a nontrivial asymptotic power. This is typical for parameters that depend "smoothly" on the underlying distribution. In this situation a reasonable method for asymptotic comparison of two sequences of tests for H0 : e = 0 versus H0 : e > 0 is to consider local limiting power functions, defined as n (h)
=
-HXJ
nlim TCn
( �), n '\/
h ?: 0.
These limits typically exist and can be derived by the same method as in the preceding example. A general scheme is as follows. Let e be a real parameter and let the tests reject the null hypothesis Ho : e = 0 for large values of a test statistic Tn . Assume that the sequence Tn is asymptotically normal in the
195
14. 1 Asymptotic Power Functions
sense that, for all sequences of the form en = h 1 ,.jii, ( 14.4) Often J.-L (e) and cr 2 (e ) can be taken to be the mean and the variance of Tn, but this is not necessary. Because the convergence (14.4) is under a law indexed by en that changes with n, the convergence is not implied by ,.Jii (Tn - J.-L (e ) ) 0'
(e)
8
-v-+
every e .
N (O, 1 ) ,
(14.5)
On the other hand, this latter convergence uniformly in the parameter e is more than is needed in (14.4). The convergence (14.4) is sometimes referred to as "locally uniform" asymptotic normality. "Contiguity arguments" can reduce the derivation of asymptotic normality under en = hi Jn to derivation under e = 0. (See section 14. 1 . 1 ). Assumption ( 14.4) includes that the sequence ,.fii ( Tn - J.-L (O) ) converges in distribution to a normal N ( O, cr 2 (0) ) -distribution under e = 0. Thus, the tests that reject the null hypothesis H0 : e = 0 if ,.jii ( Tn - J.-L (O) ) exceeds cr (O) za are asymptotically of level The power functions of these tests can be written a.
For en = hi ,.jii, the sequence ,.Jii ( J.-L (en ) - J.-L (O) ) converges to hJ.-L' (O) if J.-L is differentiable at zero. If cr (en) ---+ cr (O), then under (14.4)
Jin
( h ) ---+ 1 -
0'
(O)
(14.6)
For easy reference we formulate this result as a theorem.
Let J.-L and cr be functions of e such that (14.4) holds for every sequence en = hi ,jfi. Suppose that J.-L is differentiable and that cr is continuous at e = 0. Then the power functions nn of the tests that reject Ho : e = 0 for large values of Tn and are asymptotically of level satisfy (14. 6) for every h. 14.7
Theorem.
a
The limiting power function depends on the sequence of test statistics only through the quantity f.-L 1 (0) I cr (0) . This is called the slope of the sequence of tests. Two sequences of tests can be asymptotically compared by just comparing the sizes of their slopes. The bigger the slope, the better the test for Ho : e = 0 versus H1 : e > 0. The size of the slope depends on the rate f.-L1 (0) of change of the asymptotic mean of the test statistics relative to their asymptotic dispersion cr (0) . A good quantitative measure of comparison is the square of the quotient of two slopes. This quantity is called the asymptotic relative efficiency and is discussed in section 14.3. If e is the only unknown parameter in the problem, then the available tests can be ranked in asymptotic quality simply by the value of their slopes. In many problems there are also nuisance parameters (for instance the shape of a density), and the slope is a function of the nuisance parameter rather than a number. This complicates the comparison considerably. For every value of the nuisance parameter a different test may be best, and additional criteria are needed to choose a particular test.
1 96
Relative Efficiency of Tests
14. 3 ,
2f
According to Example the sign test has slope (0). This can also be obtained from the preceding theorem, in which we can choose p,(8) = 1 - F( -8) F(-8) ) F(-8). o and rr 2 (8) = 14.8
Example (Sign test).
(1 -
Let X 1 , . . . , Xn be a random sample from a distribution with mean 8 and finite variance. The t -test rejects the null hypothesis for large values of L;. The sample variance S2 converges in probability to the variance rr 2 of a single observation. The central limit theorem and Slutsky's lemma give 14.9
Example (t-test).
0i( X - hj � ) � (X - hj �) + h ( _!_ - _!_ ) h!J!i N(O , 1). s
Thus Theorem rr .t 0
1/
(J
=
s
s
(J
14. 7 applies with p,(8) 8 j rr and rr (8) 1. The slope of the t-test equals =
=
Let X 1 , . . . , Xn be a random sample from a density - 8), where is symmetric about zero. We shall compare the performance of the sign test and the t -test for testing the hypothesis H0 : 8 = 0 that the observations are symmetrically distributed about zero. Assume that the distribution with density has a unique median and a finite second moment. It suffices to compare the slopes of the two tests. By the preceding examples these 1 are and ( f 2 r 1 2 , respectively. Clearly the outcome of the comparison depends on the shape It is interesting that the two slopes depend on the underlying shape in an almost orthogonal manner. The slope of the sign test depends only on the height of at zero; the slope of the t -test depends mainly on the tails of For the standard normal distribution the slopes are .J2Tif and The superiority of the t-test in this case is not surprising, because the t-test is uniformly most powerful for every n. For the Laplace distribution, the ordering is reversed: The slopes are and � .J2. The superiority of the sign test has much to do with the "unsmooth" character of the Laplace density at its mode. The relative efficiency of the sign test versus the t -test is equal to 14.10
f(x
Example (Sign versus t-test).
f
f
2j(O)
x f(x) dx f.
f
f.
1.
1
14.1
Table summarizes these numbers for a selection of shapes. For the uniform distribution, the relative efficiency of the sign test with respect to the t-test equals It can be shown that this is the minimal possible value over all densities with mode zero (problem On the other hand, it is possible to construct distributions for which this relative efficiency is arbitrarily large, by shifting mass into the tails of the distribution. The sign test is "robust" against heavy tails, the t -test is not. 0
1/3.
14. 7 ).
The simplicity of comparing slopes is attractive on the one hand, but indicates the potential weakness of asymptotics on the other. For instance, the slope of the sign test was seen to be (0) , but it is clear that this value alone cannot always give an accurate indication
f
t
Although ( 14.4) holds with this choice of iJ- and o-, it is not true that the sequence � (X IS - fJ I o- ) is asymp totically standard normal for every fixed fJ . Thus ( 1 4.5) is false for this choice of iJ- and o- . For fixed fJ the contribution of S - o- to the limit distribution cannot be neglected, but for our present purpose it can.
197
14. 1 Asymptotic Power Functions
14. 1 . Relative efficiencies of the sign test versus the t -test for some distributions.
Table
Distribution
Efficiency (sign/t -test)
rr2 /12 2/rr 2 1/3
Logistic Normal Laplace Uniform
of the quality ofthe sign test. Consider a density that is basically a normal density, but a tiny proportion of of its total mass is located under an extremely thin but enormously high peak at zero. The large value would strongly favor the sign test. However, at moderate sample sizes the observations would not differ significantly from a sample from a normal distribution, so that the t -test is preferable. In this situation the asymptotics are only valid for unrealistically large sample sizes. Even though asymptotic approximations should always be interpreted with care, in the present situation there is actually little to worry about. Even for n = the comparison of slopes of the sign test and the t-test gives the right message for the standard distributions listed in Table
10- 1 0 %
f(O)
20,
14.1.
14. 1 1
Example (Mann- Whitney).
Suppose we observe two independent random samples
X1 , . . . , Xm and Y1 , . . . , Yn from distributions F (x) and G(y - e), respectively. The base distributions F and G are fixed, and it is desired to test the null hypothesis Ho : e 0 versus the alternative H1 : e > 0. Set N m + n and assume that m j N -+ A (0, 1). Furthermore, assume that G has a bounded density The Mann-Whitney test rejects the null hypothesis for large numbers of U mn - 1 Li Lj 1 { X i Y1 }. By the two-sample U -statistic theorem =
(
E
g.
=
::=::
)
=
(14. 5 ) for every fixed e, with J.L (e) 1 - f G (x - e) dF(x) , CJ 2 (e) A1 var G(X - e) + 1 -1 A varF (Y) . To obtain the local asymptotic power function, this must be extended to sequences eN hj ./Fl. It can be checked that the U -statistic theorem remains valid and that the Lindeberg central limit theorem applies to the right side of the preceding display with eN replacing e. Thus, we find that (14. 4) holds with the same functions and CJ. (Alternatively we can use contiguity and Le Cam ' s third lemma.) Hence, the slope of the Mann-Whitney test equals p/(0)/CJ(O) j dF/CJ(O). D This readily yields the asymptotic normality
=
=
-
--
=
f.L
=
g
14.12 Example (Two-sample t-test). In the set-up of the preceding example suppose that the base distributions F and have equal means and finite variances. Then = E(Y
G
e
- X)
Relative Efficiency of Tests
198 Table 14.2.
Relative efficiencies of the Mann- Whitney test versus the two-sample t -test if f = g equals a number of distributions.
Distribution
Efficiency (Mann-Whitney/two-sample t -test)
JT 2 /9
Logistic Normal Laplace Uniform
t3 ts c(1
3/n 3/2 1 1 .24 1 .90 108/ 1 25
- x2 ) v 0
and the t -test rejects the null hypothesis H0 : 8 = 0 for large values of the statistic ( Y - X) 1 S, where S 2 I N = Si I m + s� In is the unbiased estimator of var( - X ) . The sequence S2 converges in probability to a 2 = varXIA. + varY 1(1 - A.). By Slutsky ' s lemma and the central limit theorem
y
v'N(y � hi:!) h0}J N(O, 1 ) . X -
Thus (14.4) i s satisfied and Theorem 14.7 applies with JJ,(8) slope of the t -test equals J-1, 1 (0) I a (0) = 1 I a . D
=
8 1 a and a ( 8 )
=
1 . The
14.13 Example (t- Test versus Mann- Whitney test). Suppose we observe two indepen dent random samples X 1 , . . . , Xm and Y1 , . . . , Yn from distributions F (x) and G(x - 8), respectively. The base distributions F and G are fixed and are assumed to have equal means and bounded densities. It is desired to test the null hypothesis H0 : 8 = 0 versus the alternative H1 : 8 > 0. Set N = m + n and assume that m l N ---+ A. E (0, 1 ) . The slopes of the Mann-Whitney test and the t-test depend on the nuisance parameters F and G. According to the preceding examples the relative efficiency of the two sequences of tests equals
( ( 1 - A.) var X + A. var Y ) ( j g d F) 2
(1 - A.) varo G(X) + A. varo F(Y) In the important case that F = G, this expression simplifies. Then the variables G(X) and F(Y) are uniformly distributed on [0, 1 ] . Hence they have variance 1 1 1 2 and the relative efficiency reduces to 1 2 var X ( J f 2 (y) dy) 2 . Table 14.2 gives the relative efficiency if F = G are both equal to a number of standard distributions. The Mann-Whitney test is inferior to the t-test if F G equals the normal distribution, but better for the logistic, Laplace, and t-distribution. Even for the normal distribution the Mann-Whitney test does remarkably well, with a relative efficiency of In ::::::; 95%. The density that is proportional to (1 - x 2 ) v 0 (and any member of its scale family) is least favorable for the Mann Whitney test. This density yields the lowest possible relative efficiency, which is still equal to 1081 1 25 ::::::; 86% (problem 14.8). On the other hand, the relative efficiency of the Mann Whitney test is large for heavy-tailed distributions; the supremum value is infinite. Together with the fact that the Mann-Whitney test is distribution-free under the null hypothesis, this =
3
14. 2
199
Consistency
makes the Mann-Whitney test a strong competitor to the t-test, even in situations in which the underlying distribution is thought to be approximately normal. 0 *14.1.1
Using Le Cam 's Third Lemma
In the preceding examples the asymptotic normality of sequences of test statistics was established by direct methods. For more complicated test statistics the validity of is easier checked by means of Le Cam's third lemma. This is illustrated by the following example.
(14.4)
14.11,
In the two-sample set-up of Example suppose that = G is a continuous distribution function with finite Fisher information for location lg . The median test rejects the null hypothesis H0 : e = 0 for large values of the rank statistic � (N + By the rank central limit theorem, Theorem = under the null hypothesis, 14.14
Example (Median test).
F TN N - 1 L�m+ l 1 { RN i
1)/2 }.
13. 5 ,
n
mrATL j =l 1{F( Yj ) 1/2} + op( l ). Under the null hypothesis the sequence of variables on the right side is asymptotically normal with mean zero and variance 0' 2 (0) .A( l - .A)/4. By Theorem 7. 2 , for every eN = h f �, f(X; ) n j g(Yj - eN) h � � -g' (Y; ) - -h1 2 (1 - .A) lg + op(1). log ni 2 n; J (X; ) n j g(Yj ) .Jri j =l g +
�
Nv N
=
= -
�
By the multivariate central limit theorem, the linear approximations on the right sides of the two preceding displays are jointly asymptotically normal. By Slutsky's lemma the same is true for the left sides. Consequently, by Le Cam's third lemma the sequence in distribution to a converges under the alternatives = normal distribution with variance 0' 2 (0) and mean the asymptotic covariance r of the linear approximations. This is given by
� (TN - nj(2N) )
eN hj�
r (h )
=
(h)
f' dF(y). - h.A(1 - .A) 1F(y)<::: l /2 -(y) f
(14.4) is valid with t-L(e) r (e) and O" (e) 0' (0) . (Use the test statistics1 12 TN - nj(2N) rather than TN. ) The slope of the median test is given by - 2 J.A( l - .A) J0 1 (f' /f) ( F (u) ) du. D
Conclude that
=
14.2
=
Consistency
1,
After noting that the power at fixed alternatives typically tends to we focused attention on the performance of tests at alternatives converging to the null hypothesis. The compar ison of local power functions is only of interest if the sequences of tests are consistent at
200
Relative Efficiency of Tests
fixed alternatives. Fortunately, establishing consistency is rarely a problem. The following lemmas describe two basic methods.
Let Tn be a sequence of statistics such that Tn � p,(e) for every e. Then thefamily of tests that reject the null hypothesis H0 : e 0for large values ofTn is consistent against every e such that p, (e) > p,(O). 14.15
Lemma.
=
Let p, and u be functions of e such that ( 14.4) holds for every sequence en h / Jn. Suppose that p, is differentiable and that u is continuous at zero, with p,' (0) > 0 and u (0) > 0. Suppose that the tests that reject the null hypothesis for large values of Tn possess nondecreasing power functions e 1--+ nn (e). Then this family of tests is consistent against every alternative e > 0. Moreover, ifnn (0) -+ then JTn (en) -+ or JTn (en) -+ 1 when .J1i en -+ 0 or .J1i en -+ 00, respectively. 14.16
Lemma.
=
a,
a
For the first lemma, suppose that the tests reject the null hypothesis if Tn exceeds the critical value en . By assumption, the probability under e 0 that Tn is outside the interval ( p, (O) - c, p,(O) + c) converges to zero as n -+ for every fixed c > 0. If the asymptotic level lim P0(Tn > en) is positive, then it follows that Cn < p,(O) + c eventually. On the other hand, under e the probability that Tn is in (p, (e) - c, p,(e) + c) converges to 1 . For sufficiently small c and p,(e) > p,(O), this interval is to the right of p,(O) + c . Thus for sufficiently large n, the power Pe ( Tn > en ) can be bounded below by Pe ( Tn E ( p,(e) - c, p,(e) + c)) -+ 1 . For the proof of the second lemma, first note that by Theorem 14.7 the sequence of local power functions JTn ( h / .Jii) converges to n (h ) 1 - <1> ( Za - hp,' (0) / u (0) ) , for every h, if the asymptotic level is If .J1i en -+ 0, then eventually en < h / .J1i for every given h > 0. By the monotonicity of the power functions, JTn (en) ::::::; JTn (h / .Jii) for sufficiently large n. Thus lim sup nn (en ) ::::::; n (h) for every h > 0. For h ,j.. 0 the right side converges to n (O) Combination with the inequality nn (en ) :=::: nn (O) -+ gives nn (en ) -+ The case that .J1i en -+ can be handled similarly. Finally, the power nn (e) at fixed alternatives is bounded below by nn (en) eventually, for every sequence en ,j.. 0. Thus JTn (e) -+ 1 ' and the sequence of tests is consistent at e . •
Proofs.
=
oo,
=
a.
= a.
a
a.
oo
The following examples show that the t -test and Mann-Whitney test are both consistent against large sets of alternatives, albeit not exactly the same sets. They are both tests to compare the locations of two samples, but the pertaining definitions of "location" are not the same. The t-test can be considered a test to detect a difference in mean; the Mann-Whitney test is designed to find a difference of P(X ::::::; Y) from its value 1 /2 under the null hypothesis. This evaluation is justified by the following examples and is further underscored by the consideration of asymptotic efficiency in nonparametric models. It is shown in Section 25.6 that the tests are asymptotically efficient for testing the parameters EY - EX or P(X ::::::; Y) if the underlying distributions F and G are completely unknown. The two-sample t-statistic C Y - X )/ S converges in probability to E(Y - X)/ u , where u 2 lim var( Y - X ) . If the null hypothesis postulates that EY EX, then the test that rejects the null hypothesis for large values of the t-statistic is consistent against every alternative for which EY > EX . D 14.17
Example (t-test).
=
=
20 1
14. 3 Asymptotic Relative Efficiency
The Mann-Whitney statistic U converges in prob ability to P(X .::::; Y), by the two-sample U -statistic theorem. The probability P(X .::::; Y) is equal to if the two samples are equal in distribution and possess a continuous distribution function. If the null hypothesis postulates that P( X .::::; Y) then the test that rej ects for large values of U is consistent against any alternative for which P(X .::::; Y) > D 14.18
Example (Mann- Whitney test).
1/2
1/2,
=
14.3
1/2.
Asymptotic Relative Efficiency
Sequences of tests can be ranked in quality by comparing their asymptotic power functions. For the test statistics we have considered so far, this comparison only involves the "slopes" of the tests. The concept of relative efficiency yields a method to quantify the interpretation of the slopes. Consider a sequence of testing problems consisting of testing a null hypothesis H0 : e versus the alternative : e We use the parameter to describe the asymptotics; thus We require a priori that our tests attain asymptotically level and power Usually we can meet this requirement by choosing an appropriate number of yE( observations at "time" A larger number of observations allows smaller level and higher power. If is the power function of a test if n observations are available, then we define to be the minimal number of observations such that both =
H1
v -+ a,
nv
=
Bv.
0
v
oo .
a
1). 1fn
v.
If two sequences of tests are available, then we prefer the sequence for which the numbers are smallest. Suppose that and observations are needed for two given sequences of tests. Then, if it exists, the limit
nv
nv, 1 nv, 2 nv, lm v-Hx:> n v, 2I 1.
is called the (asymptotic) relative efficiency or Pitman efficiency of the first with respect to the second sequence of tests. A relative efficiency larger than indicates that fewer observations are needed with the first sequence of tests, which may then be considered the better one. In principle, the relative efficiency may depend on y and the sequence of alternatives The concept is mostly of interest if the relative efficiency is the same for all possible choices of these parameters. This is often the case. In particular, in the situations considered previously, the relative efficiency turns out to be the square of the quotient of the slopes.
1
a,
Bv.
Consider statistical models ( P e : e :::: such that II P e - P O II as 8 -+ 0, for every n. Let and be sequences of statistics that satisfy (14.4) for every sequence -1.- and functions fLi and O"; such that fLi is differentiable at zero and O"i is continuous at zero, with JL; > and O"; > Then the relative efficiency of the tests that reject the null hypothesis H0 : e for large values of i is equal to 14.19
n,
Theorem.
0)
n, n,
-+
0
Tn, l Tn, 2 (0) 0 (0) 0. 0 Tn, (/L�(0)/0"1(0)) 2 , JL � (O) I 0"2 (O) for every sequence of alternatives Bv 0, independently of > 0 and y 1). If the powerfunctions of the tests based on Tn, i are nondecreasingfor every n, then the assumption Bn 0
=
.J..
a
E (a ,
Relative Efficiency of Tests
202
ofasymptotic normality ofTn,i can be relaxed to asymptotic normality under every sequence en o C1 1 Jn) only. =
Fix a and y as in the introduction and, given alternatives ev {. 0, let nv, i observations for be used with each of the two tests. The assumption that I Pn,e" - Pn,o II 0 as each fixed n forces nv, i Indeed, the sum of the probabilities of the first and second kind of the test with critical region Kn equals Proof.
----*
v ----* oo
----* oo .
{ d Pn, O + JK�{ d Pn, Bv
]Kn
=
1+
{ (Pn, O - Pn,eJ d f-tn
JK,
·
This sum is minimized for the critical region Kn { Pn,o - Pn,Bv < 0}, and then equals uniformly in every 1 - � II Pn, Bv - Pn,o 11 . By assumption, this converges to 1 as finite set of n. Thus, for every bounded sequence n nv and any sequence of tests, the sum of the error probabilities is asymptotically bounded below by 1 and cannot be bounded above by a + 1 - y < 1 , as required. Now that we have ascertained that nv, i as we can use the asymptotic normality of the test statistics Tn, i . The convergence to a continuous distribution implies that the asymptotic level and power attained for the minimal numbers of observations (minimal for obtaining at most level a and at least power y) is exactly a and y . In order to obtain asymptotic level a the tests must reject Ho if Fv(Tnv,i - f-t i (0)) > CJi (O) za + o(l). The powers of these tests are equal to 1-t; (0) lfnv_ , (ev) 1 -
v ----*
oo
=
----*
(
=
oo
v
----*
oo ,
)
CJi (O)
This sequence of powers tends to y < 1 if and only if the argument of
nv, 2 }�� nv, l .
=
nv, 2 eJ v�� nv, t eJ .
=
( za
- Zy ) 2
( t-t ;(O)j CJ2 (0) ) 2
( za
- Zy ) 2
I (t-t� (0)/CJt (0)) 2 .
This proves the first assertion of the theorem. If the power functions of the tests are monotone and the test statistics are asymptotically normal for every sequence en 0 ( 1 / Jn) , then lfn , i (en) a or 1 if Jn en 0 or 00, respectively (see Lemma 14. 1 6). In that case the sequences of tests can only meet the (a, y) requirement for testing alternatives ev such that .;n;; ev 0 ( 1). For such sequences the preceding argument is valid and gives the asserted relative efficiency. • =
----*
----*
=
* 1 4.4
Other Relative Efficiencies
The asymptotic relative efficiency defined in the preceding section is known as the Pitman relative efficiency. In this section we discuss some other types of relative efficiencies. Define n i (a, y, e) as the minimal numbers of observations needed, with i E { 1 , 2} for two given sequences of tests, to test a null hypothesis Ho : e 0 versus the alternative Ht : e e at level a and with power at least y . Then the Pitman efficiency against a sequence of alternatives ev 0 is defined as (if the limits exists) n (a, y, ev) 1. . Im 2 n t (a, y, ev) =
----*
V-HXJ
=
14.4
Other Relative Efficiencies
203
The device to let the alternatives 8v tend to the null hypothesis was introduced to make the testing problems harder and harder, so that the required numbers of observations tend to infinity, and the comparison becomes an asymptotic one. There are other possibilities that can serve the same end. The testing problem is harder as a is smaller, as y is larger, and (typically) as 8 is closer to the null hypothesis. Thus, we could also let a tend to zero, or y tend to one, keeping the other parameters fixed, or even let two or all three of the parameters vary. For each possible method we could define the relative efficiency of two sequences of tests as the limit of the quotient of the minimal numbers of observations that are needed. Most of these possibilities have been studied in the literature. Next to the Pitman efficiency the most popular efficiency measure appears to be the Bahadur efficiency, which is defined as 1.
n 2 (av , Im v
y,
8) . -Hx) n 1 (av , y , 8)
Here av tends to zero, but y and 8 are fixed. Typically, the Bahadur efficiency depends on 8, but not on y , and not on the particular sequence av + 0 that is used. Whereas the calculation of Pitman efficiencies is most often based on distributional limit theorems, Bahadur efficiencies are derived from large deviations results. The reason is that the probabilities of first or second kind for testing a fixed null hypothesis against a fixed alternative usually tend to zero at an exponential speed. Large deviations theorems quantify this speed. Suppose that the null hypothesis H0 : 8 0 is rejected for large values of a test statistic Tn , and that =
2 n
- - log Po (Tn
:=::
t) ----+ e(t),
every t,
(14 . 20) (14.2 1)
The first result is a large deviation type result, and the second a "law of large numbers." The observed significance level of the test is defined as Po(Tn ::: t ) l t=Tn . Under the null hypothesis, this random variable is uniformly distributed if Tn possesses a continuous distribution function. For a fixed alternative 8 , it typically converges to zero at an exponential rate. For instance,-under the preceding conditions, if e is continuous at t-L(8), then (because e is necessarily monotone) it is immediate that
2 n
- - log Po ( Tn
:=::
Pe t ) l t=Tn ----+ e ( t-L(8) ) .
The quantity e ( 1-L (8)) is called the Bahadur slope of the test (or rather the limit in probability of the left side if it exists). The quotient of the slopes of two sequences of test statistics gives the Bahadur relative efficiency.
Let Tn , l and Tn, 2 be sequences of statistics in statistical models (Pn,o, Pn,e) that satisfy (14.20) and (14.2l) for functions e i and numbers f-L i (8) such that e i is continuous at f-L i (8). Then the Bahadur relative efficiency of the sequences of tests that reject for large values of Tn, i is equal to e 1 ( I-L l (8)) / e2 (!-L 2 (8) ) , for every av + 0 and every 1 > y > supn Pn,e ( Pn . o 0). 14.22
Theorem.
=
For simplicity of notation, we drop the index i E { 1 , 2} and write nv for the minimal numbers of observations needed to obtain level av and power y with the test statistics Tn .
Proof
204
Relative Efficiency of Tests
The sample sizes nv necessarily converge to oo as ---+ oo. If not, then there would exist a fixed value n and a (sub)sequence of tests with levels tending to and powers at least y . However, for any fixed n , and any sequence of measurable sets Km with ---+ as m ---+ oo, the probabilities n Pn,o = + o(1) are eventually strictly = smaller than y , by assumption. The most powerful level av-test that rejects for large values of Tn has critical re gion {Tn ::::: en } or {Tn > en } for Cn = inf { c : Po (Tn ::::: c) :S av } , where we use ::::: if Po (Tn ::::: en) :S av and > otherwise. Equivalently, with the notation Ln = Po (Tn ::::: t ) lt=Tn ' this is the test with critical region { Ln ::=: av } . By the definition of n v we conclude that 2 2 - - log Ln ::::: - - log av -> y for n = nv , < Y for n = n v - 1 . n n By (14.20) and (14.21), the random variable inside the probability converges in probability to the number as n ---+ oo. Thus, the probability converges to or 1 if - (2/ n) log av is asymptotically strictly bigger or smaller than , respectively. Conclude that 2 lim sup - - log av :S v
Pn, o (Km) 0
0)
Pn,e (Km ) Pn, e (Km
Pn, e (
0
){
e (t-t(8) )
0 e (t-t(8) ) e (t-t(8) ) 2 log a11 ::::: e ( t-t(8) ) . lim inf v -Hxl 11 - 1 Combined, this yields the asymptotic equivalence v - 2logav f e (t-t(8) ) . Applying this nv
1! -HX)
n
n
�
for both nv, l and nv, 2 and taking the quotient, we obtain the theorem.
•
Bahadur and Pitman efficiencies do not always yield the same ordering of sequences of tests. In numerical comparisons, the Pitman efficiencies appear to be more relevant for moderate sample sizes. This is explained by their method of calculation. By the preceding theorem, Bahadur efficiencies follow from a large deviations result under the null hypothesis and a law of large numbers under the alternative. A law of large numbers is of less accuracy than a distributional limit result. Furthermore, large deviation results, while mathematically interesting, often yield poor approximations for the probabilities of interest. For instance, condition (14.20) shows that P0 (Tn ::::: t) = exp ( - � n e (t ) ) exp o (n ) . Nothing guarantees that the term exp o (n) is close to 1 . On the other hand, often the Bahadur efficiencies as a function of e are more informa tive than Pitman efficiencies. The Pitman slopes are obtained under the condition that the sequence ,Jii ( Tn is asymptotically normal with mean zero and variance Suppose, for the present argument, that Tn is normally distributed for every finite n , with the parameters and Then, because 1 -
a 2 (0).
- t-t(O) ) t-t(O) a 2 (0)jn.
¢(t)jt 2 ,Jii ) -2 2 ( ) - - log Po ( Tn ::::: t-t(O) + ) = - - log 1 -
t
�
n
t
-
---+
t
,
t.
---+
---+
Now, the preceding argument is completely false if Tn is only approximately normally distributed: Departures from normality that are negligible in the sense of weak convergence need not be so for large-deviation probabilities. The difference between the "approximate
14.4
205
Other Relative Efficiencies
Bahadur slopes" just obtained and the true slopes is often substantial. However, the argument tends to be "more correct" as t approaches �t (O), and the conclusion that limiting Bahadur efficiencies are equal to Pitman efficiencies is often correct. t The main tool needed to evaluate Bahadur efficiencies is the large-deviation result (14.20). For averages Tn , this follows from the Cramer-Chernoff theorem, which can be thought of as the analogue of the central limit theorem for large deviations. It is a refine ment of the weak law of large numbers that yields exponential convergence of probabilities of deviations from the mean. The cumulant generating function of a random variable Y is the function u 1---+ K (u) log Ee uY . If we allow the value oo , then this is well-defined for every u E R The set of u such that K (u) is finite is an interval that may or may not contain its boundary points and may be just the point {0} . =
Let Y1 , Y2 , . . . be i. i. d. random vari ables with cumulant generating function K. Then, for every t, 14.23
Proposition (Cramer-Chernoff theorem).
1
-
- log P(Y n
:=:
t) --+ inf( K ( u ) - tu) . u ?:: O
The cumulant generating function of the variables Yi - t is equal to u 1---+ K (u) - u t . Therefore, we can restrict ourselves to the case t 0. The proof consists of separate upper and lower bounds on the probabilities P( Y :=: 0) . The upper bound is easy and is valid for every n . By Markov's inequality, for every u ::: 0, P( Y :=: 0) P (e u n Yn :=: 1) ::S Ee u n Yn en K ( u ) .
Proof
=
=
=
Take logarithms, divide by n, and take the infimum over u :=: 0 to find one half of the proposition. For the proof of the lower bound, first consider the cases that Yi is nonnegative or nonpositive. If P(Yi < 0) 0, then the function u 1---+ K (u) is monotonely increasing on JRl. and its infimum on u :=: 0 is equal to 0 (attained at u 0); this is equal to n l log P( Y :=: 0) for every n . Second, if P(Yi > 0) 0, then the function u 1---+ K (u) is monotonely decreasing on JRl. with K (oo) log P(Y1 0) ; this is equal to n- 1 log P( Y :=: 0) for every n . Thus, the theorem is valid in both cases, and we may exclude them from now on. First, assume that K (u) is finite for every u E R Then the function u 1---+ K (u) is analytic on JRl., and, by differentiating under the expectation, we see that K' (0) EY1 . Because Yi takes both negative and positive values, K (u) --+ oo as u --+ ± oo . Thus, the infimum of the function u 1---+ K (u) over u E JRl. is attained at a point u0 such that K' (u0) 0 . The case that u0 < 0 is trivial, but requires an argument. By the convexity ofthe function u 1---+ K (u) , K is nondecreasing on [u 0 , oo) . lf u0 < 0, then it attains its minimum value over u :=: 0 at u 0, which is K (0) = 0. Furthermore, in this case EY1 K' (0) > K' (uo) 0 (strict inequality under our restrictions, for instance because K" (0) var Y1 > 0) and hence P(Y :=: 0) --+ 1 by the law of large numbers. Thus, the limit of the left side of the proposition (with t 0) is 0 as well. =
=
-
=
=
=
=
=
=
=
=
=
=
t
In [85] a precise argument is given.
206
Relative Efficiency of Tests
For u0 :::: 0, let Z 1 , Z2 , . . . be i.i.d. random variables with the distribution given by d Pz (z) = e - K(uo ) e uo z d Py (z) . Then Z 1 has cumulant generating function u �---* K (u0 + u) - K (u0), and, as before, its mean can be found by differentiating this function at u = 0 : EZ 1 = K' (u0) = 0. For every 8 > 0, P( Y :::: O) = E1 { Z n :::: O } e - uo n z" e n K(uo ) :::: P(O ::'S Z n ::'S 8) e - uon s en K(uo ) . Because Zn has mean 0, the sequence P(O ::'S Zn ::'S 8) is bounded away from 0, by the central limit theorem. Conclude that n - 1 times the limit inferior of the logarithm of the left side is bounded below by -u08 + K (u0). This is true for every 8 > 0 and hence also for 8 = 0. Finally, we remove the restriction that K (u) is finite for every u, by a truncation argument. For a fixed, large M, let Y1M , Y2M , . . . be distributed as the variables Y1 , Y2 , given that I Y; I ::'S M for every i , that is, they are i.i.d. according to the conditional distribution of Y1 given I Y1 I ::'S M. Then, with u �---* KM (u) = log Eeu Yq { I Y1 I ::'S M}, .
�
lim inf log P ( Y :::: 0) ::::
•
.
� log (P ( Y� :::: O) P ( I Y;M I ::'S Mr)
:::: inf KM (u) , u::: O
by the preceding argument applied to the truncated variables. Let s be the limit of the right side as M ----+ oo, and let AM be the set u :::: 0 : KM (u) ::'S s } . Then the sets A M are nonempty and compact for sufficiently large M (as soon as KM (u) ----+ oo as u ----+ ±oo), with A 1 :J A 2 :J · · ·, whence nAM is nonempty as well. Because KM converges pointwise to K as M ----+ oo, any point u 1 E nA M satisfies K ( u 1 ) = lim KM (u 1 ) ::'S s . Conclude that s is bigger than the right side of the proposition (with t = 0). •
{
The cumulant generating function of a variable Y that is - 1 and 1 , each with probability � , is equal to K ( u ) = log cosh u . Its derivative is K' ( u) = tanh u and hence the infimum of K ( u) - t u over u E lR is attained for u = arctanh t . B y the Cramer-Chernoff theorem, for 0 < t < 1 , 14.24
Example (Sign statistic).
2 - - log P(Yn
::::
t) ----+ e (t) : = - 2 log cosh arctanh t + 2t arctanh t .
We can apply this result to find the Bahadur slope of the sign statistic Tn = n - 1 I:7= 1 sign(X; ) . If the null distribution of the random variables X 1 , . . . , Xn is continuous and symmetric about zero, then (1 4.20) is valid with e (t) as in the preceding display and with t-t (e) = Ee sign(X 1 ) . Figure 14.2 shows the slopes of the sign statistic and the sample mean for testing the location of the Laplace distribution. The local optimality of the sign statistic is reflected in the Bahadur slopes, but for detecting large differences of location the mean is better than the sign statistic. However, it should be noted that the power of the sign test in this range is so close to 1 that improvement may be irrelevant; for example, the power is 0.999 at level 0.007 for n = 25 at e = 2. 0
14.4
207
Other Relative Efficiencies
C\i
0
tn �
0
"! 0
ci 0
2
0
4
3
Figure 14.2. Bahadur slopes of the sign statistic (solid line) and the sample mean (dotted line) for testing that a random sample from the Laplace distribution has mean zero versus the alternative that the mean is e ' as a function of e .
2o- . X 1 , .X. . , Xn on Sn 0. + i 2 o- 2 . t > 0, 2 t2 - - log Po(Xn � t) ----+ e (t) : n oThus, the Bahadur slope of the sample mean is equal to 2 1 o- 2 , for every > 0. Under the null hypothesis, the statistic JfiX n I Sn possesses the t -distribution with (n -1) degrees of freedom. Thus, for a random sample Z0, Z 1 , . . . of standard normal variables, for every t > 0, n-1 2 ) X 2 t {!; 1 1 ( ( ) ( ) 2 2 2 � > -P ___! !:_ > -P '"' z t t t P0 n - 1 S - = 2 n - 1 - = 2 f;;{ z > 0 .
Suppose that are a random sample from a normal distribution with mean p., and variance We shall consider known and compare the slopes of the sample mean and the Student statistic I for testing H0 : p., The cumulant generating function of the normal distribution is equal to K (u) u p., u By the Cramer-Chernoff theorem, for 14.25
Example (Student statistic).
=
=
=
2·
p.,
_
_
p.,
°
n
1
-
This probability is not of the same form as in the Cramer-Chernoff theorem, but it concerns almost an average, and we can obtain the large deviation probabilities from the cumulant generating function in an analogous way. The cumulant generating function of a square of a standard normal variable is equal to u �----+ log(l 2u ) , and hence the cumulant generating function of the variable L.7::{ is equal to Kn (u )
=
-i z5 - t2 z; - i log(1 - iCn - 1) log(1 + 2u) -
2t 2 u ) .
This function is nicely differentiable and, by straightforward calculus, its minimum value can be found to be
2 + 1 ) - -1 (n - 1) log ((n - 1)(t 2 + 1) ) . 1 (t-t2n 2 n The minimum is achieved on [0, for t 2 � (n - 1) - 1 . This expression divided by n is the analogue of infu ( in the Cramer-Chernoff theorem. By an extension of this theorem, .
mf Kn (u) = - - log u 2
oo)
K u)
208
Relative Efficiency of Tests
for every t > 0, - � log Po n
(J n n- 1 XSnn t ) ---+ e(t) 2:
=
log(t 2 + 1 ) .
Thus, the Bahadur slope of the Student statistic is equal to log(l + f-L 2 I CJ 2 ) . For f-L 1 CJ close to zero, the Bahadur slopes of the sample mean and the Student statistic are close, but for large f-L I CJ the slope of the sample mean is much bigger. This suggests that the loss in efficiency incurred by unnecessarily estimating the standard deviation CJ can be substantial. This suggestion appears to be unrealistic and also contradicts the fact that the Pitman efficiencies of the two sequences of statistics are equal. 0 The sequence of Neyman-Pearson statis tics f17= 1 (pe l Pea) ( X i ) has Bahadur slope -2Pe log(pea l pe ). This is twice the Kullback Leibler divergence of the measures Pea and Pe and shows an important connection between large deviations and the Kullback-Leibler divergence. In regular cases this result is a consequence of the Cramer-Chemoff theorem. The variable Y log Pe I Pea has cumulant generating function K (u) log J p� p�a- u d f-L un der Pea . The function K (u) is finite for 0 ::S u ::S 1 , and, at least by formal calculus, K' ( l ) Pe log(pe i Pea ) J.L (8), where J.L (8) is the asymptotic mean of the sequence n- 1 I )og(pe i Pea ) (X i ) · Thus the infimum of the function u �---+ K (u) - U J.L (8) is attained at u 1 and the Bahadur slope is given by 14.26
Example (Neyman-Pearson statistic).
=
=
=
=
=
e ( J.L (8) )
=
-2(K ( l ) - J.L (e) )
=
2Pe log � Pea
In section 16.6 we obtain this result by a direct, and rigorous, argument.
0
For statistics that are not means, the Cramer-Chemoff theorem is not applicable, and we need other methods to compute the Bahadur efficiencies. An important approach applies to functions of means and is based on more general versions of Cramer's theorem. A first generalization asserts that, for certain sets B, not necessarily of the form [t , oo ) ,
1
-
sup (u y - K (u)) . u For a given statistic of the form ¢ (Y), the large deviation probabilities of interest P(¢ (Y) :::: t) can be written in the form P( Y E B1) for the inverse images B1 ¢ - 1 [t, oo) . If B1 is an eligible set in the preceding display, then the desired large deviations result follows, although we shall still have to evaluate the repeated "inf sup" on the right side. Now, according to Cramer ' s theorem, the display is valid for every set such that the right side does not change if B is replaced by its interior or its closure. In particular, if ¢ is continuous, then B1 is closed and its interior B1 contains the set ¢- 1 (t , oo) . Then we obtain a large deviations result if the difference set ¢- I { t } is "small" in that it does not play a role when evaluating the right side of the display. Transforming a univariate mean Y into a statistic ¢ ( Y ) can be of interest (for example, to study the two-sided test statistics I Y I), but the real promise of this approach is in its applications to multivariate and infinite-dimensional means. Cramer's theorem has been generalized to these situations. General large deviation theorems can best be formulated - log P(Y n
E
B) ---+ -infyEB l ( y ) ,
l (y)
=
=
14.4
209
Other Relative Efficiencies
as separate upper and lower bounds. A sequence of random maps Xn : Q � ][}) from a probability space (Q , U, P) into a topological space ][}) is said to satisfy the large deviation principle with rate function I if, for every closed set F and for every open set G, 1 lim sup - log P* CXn E F ) ::=: - infF I (y) , yE n �oo n 1 lim inf - log P* (Xn E G) :=::: - inf I (y) . n �oo yEG n The rate function I : D � [0, oo] is assumed to be lower semi continuous and is called a good rate function if the sublevel sets { y : I (y) ::=: are compact, for every E JR. The inner and outer probabilities that Xn belongs to a general set B is sandwiched between the probabilities that it belongs to the interior and the closure B . Thus, we obtain a large deviation result with equality for every set B such that inf I (y) : y E B inf I (y) : yE An implication for the slopes of test statistics of the form ¢ (Xn ) is as follows.
M}
M
B
{
B}.
}
{
=
14.27 Lemma. Suppose that ¢ : ][}) � lR is continuous at every y such that I (y) < oo and suppose that inf I (y) : ¢ ( y ) > inf { I (y) : ¢ ( y ) :=::: t If the sequence Xn satisfies the large-deviation principle with the rate function I under P0, then Tn ¢ (Xn) satisfies (14. 20) with e(t) 2 inf I (y) : ¢ ( y ) :=::: t Furthermore, if I is a good rate function, then e is continuous at t.
{
=
t}
}.
=
{
=
}.
Define sets A1 ¢ 1 ( t , oo ) and Bt ¢ 1 [ t , oo ) , and let ][})0 be the set where I is finite. By the continuity of ¢, Bt n ][})0 Bt n ][})0 and t n Do ::J At n D0 . (If y r:f_ Bt , then there is a net Yn E B� with Yn -+ y ; if also y E Do , then ¢ (y) lim ¢ ( yn) ::=: t and hence y r:f_ At .) Consequently, the infimum of I over t is at least the infimum over A1, which is the infimum over Bt by assumption, and also the infimum over Bt . Condition (14.20) follows upon applying the large deviation principle to t and B t . The function e is nondecreasing. The condition on the pair (I, ¢) is exactly that e is right-continuous, because e (t +) inf{ I (y) : ¢ (y) > t . To prove the left-continuity of e, let tm t t . Then e(tm ) t a for some a ::=: e(t) . If a oo , then e(t) oo and e is left-continuous. If a < oo, then there exists a sequence Ym with ¢ (Ym ) :=::: tm and 2/ (ym ) ::=: a + l j m . By the goodness of I, this sequence has a converging subnet Ym ' -+ y. Then 2I (y) ::=: lim inf 2I (Ym ' ) ::=: a by the lower semicontinuity of I, and ¢ (y) :=::: t by the continuity of ¢. Thus e(t) ::=: 2I (y) ::=: a . •
Proof
=
=
-
-
=
B
=
B
B
=
}
=
=
Empirical distributions can be viewed as means (of Dirac measures), and are therefore potential candidates for a large-deviation theorem. Cramer's theorem for empirical distri butions is known as Sanov 's theorem. Let IL 1 (X, A) be the set of all probability measures on the measurable space (X, A), which we assume to be a complete, separable metric space with its Borel a-field. The r -topology on IL 1 (X, A) is defined as the weak topology gen erated by the collection of all maps P � Pf for f ranging over the set of all bounded, measurable functions on f : X � JR. t Let 1P'n be the empirical measure of a random sample of size n from a fixed measure P. Then the sequence 1P'n viewed as maps into IL 1 (X, A) 14.28
Theorem (Sanov's theorem).
t For a proof of the following theorem, see [3 1 ], [32] , or [65] .
Relative Efficiency of Tests
210
satisfies the large deviation principle relative to the r -topology, with the good rate function / ( Q) = - Q log pjq.
For X equal to the real line, L 1 (X, A) can be identified with the set of cumulative distribution functions. The r -topology is stronger than the topology obtained from the uniform norm on the distribution functions. This follows from the fact that if both Fn (x) ----+ F(x) and Fn {x } ----+ F {x } for every x E JR., then II Fn - F ll oo ----+ 0. (see problem 1 9 .9). Thus any function ¢ that is continuous with respect to the uniform norm is also continuous with respect to the r -topology, and we obtain a large collection of functions to which we can apply the preceding lemma. Trimmed means are just one example. Let IFn be the empirical distribution function of a ran dom sample of size n from the distribution function F, and let JF,;- 1 be the corresponding quantile function. The function ¢ (IFn) = (1 - 2a)- 1 J: -a IF,;- 1 (s) ds yields a version of the a-trimmed mean (see Chapter 22). We assume that 0 < a < � and (partly for simplicity) that the null distribution Fa is continuous. If we show that the conditions of Lemma 14.27 are fulfilled, then we can conclude, by Sanov's theorem, 14.29
Example (Trimmed means).
2 h - - log PF0 ( ¢ CIFn) � t) ----+ e(t) : = 2 inf -G log . G : ¢(Gkt n g Because IFn !,. F uniformly by the Glivenko-Cantelli theorem, Theorem 19. 1 , and ¢z is continuous, ¢ (IFn ) !,. ¢ (F), and the Bahadur slope of the a-trimmed mean at an alternative F is equal to e ( ¢ (F)) . Finally, we show that ¢ is continuous with respect to the uniform topology and that the function t f---+ inf { -G log(fa/ g)) : ¢ (G) � t } is right-continuous at t if Fa is continuous at t. The map ¢ is even continuous with respect to the weak topology on the set of distri bution functions: If a sequence of measures Gm converges weakly to a measure G, then the corresponding quantile functions G;;/ converge weakly to the quantile function c - 1 (see Lemma 21 .2) and hence ¢ (G m ) ----+ ¢ (G) by the dominated convergence theorem. The function t f---+ inf { - G log(fa/ g) : ¢ (G) � t } is right-continuous at t if for every G with ¢ (G) = t there exists a sequence Gm with ¢ (Gm ) > t and G m log(fa/gm ) ----+ G log(fa/ g). If G log(fa/ g) = then this is easy, for we can choose any fixed Gm that is singular with respect to Fa and has a trimmed mean bigger than t . Thus, we may assume that G ! log (fa/g) I < that G « Fa and hence that G is continuous. Then there exists a point c such that a < G (c ) < 1 - a . Define - oo ,
oo,
dGm (x) = { 1 dG
l_
1+�
if if
X
X
C, > C.
::'S
Then Gm is a probability distribution for suitably chosen Em > 0, and, by the dominated convergence G m log(fafgm ) ----+ G log(fa/g) Because Gm (x) ::=:: G(x) for all x , with strict inequality (at least) for all x ::::: c such that G (x) > 0, we have that G ;; / (s) ::=:: a- 1 (s) for all s, with strict inequality for all s E (0, G(c)]. Hence the trimmed mean ¢ (G m ) is strictly bigger than the trimmed mean ¢ (G), for every 0 as m ----+
oo.
m.
14.5 * 14.5
211
Rescaling Rates Rescaling Rates
The asymptotic power functions considered earlier in this chapter are the limits of "local power functions" of the form h 1---+ lfn (h/ Jn) . The rescaling rate Jn is typical for testing smooth parameters of the model. In this section we have a closer look at the rescaling rate and discuss some nonregular situations. Suppose that in a given sequence of models (Xn ' An ' Pn ,e : e E 8) it is desired to test the null hypothesis Ho : e eo versus the alternatives H1 : e en . For probability measures P and Q define the total variation distance II P - Q II as the L 1 -distance J I P - q I df-L between two densities of P and Q . =
14.30
Lemma.
=
The power function nn of any test in (Xn , An , Pn , e : e
E
8) satisfies
For any e and eo there exists a test whose power function attains equality.
If lfn is the power function of the test ¢n , then the difference on the left side can be written as J ¢n ( Pn , e - P n , e ) df-Ln This expression is maximized for the test function ¢n l {pn , e > Pn , e }. Next, for any pair of probability densities p and q we have Jq > p (q p) d f.J- � J I P - q l df-L, since j (p - q) df-L 0. •
Proof.
0
=
0
·
=
=
This lemma implies that for any sequence of alternatives en : (i) If I Pn , en - Pn , Bo I ---;. 2, then there exists a sequence of tests with power lfn (en ) tending to 1 and size lfn (e0) tending to 0 (a perfect sequence of tests). (ii) If II Pn ,e" - Pn , eo II ___,. 0, then the power of any sequence of tests is asymptotically less than the level (every sequence of tests is worthless). (iii) If I Pn , en - Pn ,eo II is bounded away from O and 2, then there exists no perfect sequence of tests, but not every test is worthless. The rescaling rate hj Jn used earlier sections corresponds to the third possibility. These ex amples concern models with independent observations. Because the total variation distance between product measures cannot be easily expressed in the distances for the individual factors, we translate the results into the Hellinger distance and next study the implications for product experiments. The Hellinger distance H (P, Q) between two probability measures is the L 2 -distance between the square roots of the corresponding densities. Thus, its square H 2 (P, Q) is equal to J C.jp - Jfj) 2 df-L. The distance is convenient if considering product measures. First, the Hellinger distance can be expressed in the Hellinger affinity A (P , Q) J fi..j{j df-L, through the formula =
H 2 ( P , Q)
=
2 - 2 A ( P , Q).
Next, by Fubini's theorem, the affinity of two product measures is the product of the affinities. Thus we arrive at the formula
Relative Efficiency of Tests
212
Given a statistical model (Pe : 8 ::::-_ 8o ) set Pn,e = Pt. Then the possi bilities (i), (ii), and (iii) arise when nH 2 (Pe" , Pea) converges to oo, converges to 0, or is bounded away from 0 and oo, respectively. In particular, if H 2 (Pe , Pea) = 0 ( 18 - 8o I a ) as e � e0, then the possibilities (i), (ii), and (iii) are valid when n I fa IBn - 80 I converges to oo, converges to 0, or is bounded away from 0 and oo, respectively. 14.31
Lemma.
The possibilities (i), (ii), and (iii) can equivalently be described by replacing the total variation distance II Pt - Pt II by the squared Hellinger distance H 2 (Pt , Pt ) . This follows from the inequalities, for any probability measures P and Q,
Proof.
n
0
n
H 2 (P , Q) .:::; li P - Q ll .:::; (2 - A 2 (P , Q))
!\
0
2 H ( P, Q) .
The inequality on the left is immediate from the inequality I ,JP - JfiP _:::; I p - q I , valid for any nonnegative numbers p and q . For the inequality on the right, first note that pq = (p v q ) (p !\ q) _:::; (p + q ) (p !\ q ) , whence A 2 (P , Q) _:::; 2 j (p !\ q) dJL, by the Cauchy-Schwarz inequality. Now f (p !\ q) d fL is equal to 1 - i II P - Q II , as can be seen by splitting the domains of both integrals in the sets p < q and p ::::-_ q . This shows that li P - Q ll _:::; 2 - A 2 (P, Q) . That li P - Q ll _:::; 2 H ( P, Q) is a direct consequence of the Cauchy-Schwarz inequality. We now express the Hellinger distance of the product measures in the Hellinger distance of Pe" and Pea and manipulate the nth power function to conclude the proof. • If the model (X, A, Pe : 8 E 8) is differentiable in quadratic mean at 80 , then H 2 (Pe , Pea) = 0 ( 18 - 80 1 2 ) . The intermediate rate of conver gence (case (iii)) is .Jli. 0 14.32
Example (Smooth models).
If Pe is the uniform measure on [0, 8], then H 2 (Pe , Pea) = 0 ( 18 - 80 I ) . The intermediate rate of convergence is n . In this case we would study asymptotic power functions defined as the limits of the local power functions of the form h f---+ nn (eo + hI n) . For instance, the level tests that reject the null hypothesis Ho : e = eo for large values of the maximum X en) of the observations have power functions 14.33
Example (Uniform law).
a
( �) = Pea+h/n (X (n)
nn eo +
::::-_
8o ( l - ) 1 n ) a
1
�
-
1 - ( 1 - a ) e h l ea .
Relative to this rescaling rate, the level tests that reject the null hypothesis for large values of the mean X n have asymptotic power function (no power). 0 a
a
Let Pe be the probability distribution with density + x f---+ ( 1 - I x - e I) on the real line. Some clever integrations show that H 2 ( Pe , Po) = i B 2 log( l j 8 ) + 0 (8 2 ) as e � 0. (It appears easiest to compute the affinity first.) This leads to the intermediate rate of convergence .jn log n. 0 14.34
Example (Triangular law).
The preceding lemmas concern testing a given simple null hypothesis against a simple alternative hypothesis. In many cases the rate obtained from considering simple hypotheses does not depend on the hypotheses and is also globally attainable at every parameter in the parameter space. If not, then the global problems have to be taken into account from the beginning. One possibility is discussed within the context of density estimation in section 24. 3 .
Problems
213
Lemma 14.3 1 gives rescaling rates for problems with independent observations. In mod els with dependent observations quite different rates may pertain. Consider the Galton-Watson branching process, discussed in Example 9. 1 0. If the offspring distribution has mean t-t ( 8) larger than 1 , then the parameter is estimable at the exponential rate t-t (8) n . This is also the right rescaling rate for defining asymptotic power functions. 0 14.35
Example (Branching).
Notes
Apparently, E.J.G. Pitman introduced the efficiencies that are named for him in an unpub lished set of lecture notes in 1949. A published proof of a slightly more general result can be found in [109]. Cramer [26] was interested in preciser approximations to probabilities oflarge deviations than are presented in this chapter and obtained the theorem under the condition that the moment-generating function is finite on 1Ft Chernoff [20] proved the theorem as presented here, by a different argument. Chernoff used it to study the minimum weighted sums of error probabilities of tests that reject for large values of a mean and showed that, for any 0 < rr < 1 , � log inf ( nP0 ( Y > t ) + ( 1 - n)P 1 ( Y ::::: t) ) t n inf inf( K0 (u) - ut ) V inf ( K 1 (u) - ut ) .
--+
Eo Y1 < t <E1 Y1 u
u
Furthermore, for Y the likelihood ratio statistic for testing Po versus P1 , the right side of this display can be expressed in the Hellinger integral of the experiment (Po , P1 ) as inf log
O
f dP0dP11 -u .
Thus, this expression is a lower bound for the lim infn-Hxl n - 1 log(an + f3n ) for an and f3n the error probabilities of any test of Po versus P1 • That the Bahadur slope of Neyman-Pearson tests is twice the Kullback-Leibler divergence (Example 14.26) is essentially known as Stein 's lemma and is apparently among those results by Stein that he never cared to publish. A first version of Sanov's theorem was proved by Sanov in 1957. Subsequently, many authors contributed to strengthening the result, the version presented here being given in [65]. Large-deviation theorems are subject of current research by probabilists, particularly with extensions to more complicated objects than sums of independent variables. See [3 1 ] and [32] . For further information and references concerning applications in statistics, we refer to [4] and [ 61], as well as to Chapters 8, 16, and 17. For applications and extensions of the results on rescaling rates, see [37].
PROBLEMS
1 . Show that the power function of the Wilcoxon two sample test is monotone under shift of location. 2. Let X , . . . , Xn be a random sample from the N (�-t, cr 2 )-distribution, where cr 2 is known. A test
1
for Ho
:
0 =
0 against H 1
:
0
>
0 can be based on either
-
X1
a
-
or X 1 S . Show that the asymptotic
214
Relative Efficiency of Tests relative efficiency of the two sequences of tests is or t-critical values are used?
1.
Does it make a difference whether normal
3. Let X 1 , . . . , Xn be a random sample from a density f (x - e ) where f is symmetric about zero. Calculate the relative efficiency of the t-test and the test that rejects for large values of L L i <j 1 { Xi + X j > 0} for f equal to the logistic, normal, Laplace, and uniform shapes . 4. Calculate the relative efficiency of the van der Waerden test with respect to the t-test i n the two-sample problem. 5. Calculate the relative efficiency of the tests based on Kendall' s r and the sample correlation coefficient to test independence for bivariate normal pairs of observations . 6. Suppose ¢ : F
f-+
lR and 1/J : F
f-+
IRk
are arbitrary maps on an arbitrary set F and we wish to find the minimum value of ¢ over the set { f E F : 1/1 (f) = 0 } . If the map f f-+ ¢ (f) + a T 1/1 (f) attains its minimum over F at fa , for each fixed a in an arbitrary set A, and there exists ao E A such that 1/J (fa0 ) = 0, then the desired minimum value is ¢ (/a0 ) . This is a rather trivial use of Lagrange multipliers, but it is helpful to solve the next problems . (¢ Cfa0 ) = ¢ Cfa0 ) + a lj 1/f (fa 0 ) is the minimum of ¢ (/) + alj 1/f (f) over F and hence smaller than the minimum of ¢ ( /) + a lj 1/f ( f) over { f E F : 1/f ( f) = 0}.) 2: 1/3 for every probability density f that has its mode at 0. (The minimum is equal to the minimum of 4 J y 2 f (y) dy over all probability densities f that are bounded by 1 .)
7 . Show that
4 / ( 0 ) 2 J y 2 f (y) dy
12(! j 2 (y) dy ) 2 J y 2 f (y) dy 2: 108/ 1 25 for every probability density f with mean zero. (The minimum is equal to 12 times the minimum of the square of ¢ (/ ) = J j 2 (y) dy over all probability densities with mean 0 and variance 1 .)
8. Show that
9. Study the asymptotic power function of the sign test if the observations are a sample from
a distribution that has a positive mass at its median. Is it good or bad to have a nonsmooth distribution?
10. Calculate the Hellinger and total variation distance between two uniform U [O, e ] measures . 11. Calculate the Hellinger and total variation distance between two normal N ( f.L , o- 2 ) measures . 12. Let X 1 , . . . , Xn be a sample from the uniform distribution on [ -e , e ] . (i) Calculate the asymptotic power functions of the tests that rej ect Ho : e of X(n) , X(n) v (-Xcl)) and X(n) - X(l) · (ii) Calculate the asymptotic relative efficiencies of these tests.
=
eo for large values
13. If two sequences of test statistics satisfied (14.4) for every en ,J.. 0, but with norming rate n01 instead of ,Jfi, how would Theorem 14. 19 have to be modified to find the Pitman relative efficiency?
15 Efficiency of Tests
It is shown that, given converging experiments, every limiting power function is the power function of a test in the limit experiment. Thus, uniformly most poweiful tests in the limit experiment give absolute upper bounds for the power of a sequence of tests. In normal experiments such uniformly most powerful tests exist for linear hypotheses of codimension one. The one-sample location problem and the two-sample problem are discussed in detail, and appropriately designed (signed) rank tests are shown to be asymptotically optimal. 15.1
Asymptotic Representation Theorem
A randomized test (or test function) ¢ in an experiment (X, A, Ph : h E H) is a measurable map ¢ : X f---+ [0, 1] on the sample space. The interpretation is that if x is observed, then a null hypothesis is rejected with probability ¢ (x ) . The power function of a test ¢ is the function h f---+ n (h) Eh ¢ (X). This gives the probabilities that the null hypothesis is rejected. A test is of level for testing a null hypothesis H0 if its size sup n (h) : h E H0 } does not exceed The quality of a test can be judged from its power function, and classical testing theory is aimed at finding, among the tests of level a test with high power at every alternative. The asymptotic quality of a sequence of tests may be judged from the limit of the sequence of local power functions. If the tests are defined in experiments that converge to a limit experiment, then a pointwise limit of power functions is necessarily a power function in the limit experiment. This follows from the following theorem, which specializes the asymptotic representation theorem, Theorem 9.3, to the testing problem. Applied to the special case of the local experiments En ( P; h !vn : h E IR.k ) of a differentiable + parametric model as considered in Chapter 7, which converge to the Gaussian experiment k 1 ( N(h , Ie- ) , h E IR. ) , the theorem is the parallel for testing of Theorem 7. 10. =
a
{
a.
a,
=
15.1 Theorem. Let the sequence of experiments En (Pn , h : h E H) converge to a domi nated experiment E (Ph : h E H). Suppose that a sequence of power functions nn of tests in En converges pointwise: nn (h) � n (h), for every h and some arbitrary function n. Then n is a power function in the limit experiment: There exists a test ¢ in E with n (h) Eh¢ (X) for every h. =
=
=
215
216
Efficiency of Tests
We give the proof for the special case of experiments that satisfy the following assumption: Every sequence of statistics Tn that is tight under every given parameter h possesses a subsequence (not depending on h) that converges in distribution to a limit under every h. See problem 15.2 for a method to extend the proof to the general situation. The additional condition is valid in the case of local asymptotic normality. With the notation of the proof of Theorem 7 . 1 0, we argue first that the sequence (Tn , L'ln) is uni formly tight under h 0 and hence possesses a weakly convergent subsequence by Prohorov's theorem. Next, by the expansion of the likelihood and Slutsky's lemma, the se quence (Tn , log d Pn,h/ d Pn,o) converges under h 0 along the same sequence, for every h . Finally, we conclude by Le Cam 's third lemma that the sequence Tn converges under h, along the subsequence. Let ¢n be tests with power functions nn . Because each ¢n takes its values in the com pact interval [0, 1], the sequence of random variables ¢n is certainly uniformly tight. By assumption, there exists a subsequence of {n } along which ¢n converges in distribution un der every h. Thus, the assumption of the asymptotic representation theorem, Theorem 9.3 or Theorem 7 . 1 0, is satisfied along some subsequence of the statistics ¢n . By this theorem, there exists a randomized statistic T T (X, U) in the limit experiment such that ¢n !;.., T along the subsequence, for every h. The randomized statistic may be assumed to take its values in [0, 1]. Because the ¢n are uniformly bounded, Eh¢n Eh T. Combination with the assumption yields n( h ) Eh T for every h. The randomized statistic T is not a test function (it is a "doubly randomized" test). However, the test ¢ (x ) E ( T (X, U) I X x ) satisfies the requirements. •
Proof.
=
=
=
--+
=
=
=
The theorem suggests that the best possible limiting power function is the power function of the best test in the limit experiment. In classical testing theory an "absolutely best" test is defined as a uniformly most powerful test of the required level. Depending on the experiment, such a test may or may not exist. If it does not exist, then the classical solution is to find a uniformly most powerful test in a restricted class, such as the class of all unbiased or invariant tests; to use the maximin criterion; or to use a conditional test. In combination with the preceding theorem, each of these approaches leads to a criterion for asymptotic quality. We do not pursue this in detail but note that, in general, we would avoid any sequence of tests that is matched in the limit experiment by a test that is considered suboptimal. In the remainder of this chapter we consider the implications for locally asymptotically normal models in more detail. We start with reviewing testing in normal location models. 15.2
Testing Normal Means
Suppose that the observation X is Nk (h , �)-distributed, for a known covariance matrix � and unknown mean vector h. First consider testing the null hypothesis H0 : c T h 0 versus the alternative H1 : c T h > 0, for a known vector c. The "natural" test, which rejects H0 for large values of c T X, is uniformly most powerful. In other words, if n is a power function such that n (h) :::: for every h with c T h 0, then for every h with c T h > 0, =
a
=
15. 2
Testing Normal Means
217
Suppose that X be Nk (h , L. ) -distributed for a known nonnegative definite matrix L., and let c be a fixed vector with c T L.c > 0. Then the test that rejects H0 if c T X > Z a c T L.c is uniformly most powerful at level for testing Ho : c T h 0 versus H1 : c T h > 0, based on X. 15.2
Proposition.
J
a
=
with c T h 1 > 0. Define h0 h 1 (c T h i fc T L.c) L.c. Then c T ho O . By the Neyman-Pearson lemma, the most powerful test for testing the simple hypotheses H0 : h h0 and H1 : h h 1 rejects H0 for large values of
Proof.
Fix h 1
=
=
-
=
=
This is equivalent to the test that rejects for large values of c T X. More precisely, the most powerful level test for Ho : h h0 versus H1 : h h 1 is the test given by the proposition. Because this test does not depend on h0 or h 1 , it is uniformly most powerful for testing H0 : c T h 0 versus H1 : c T h > 0. • a
=
=
=
The natural test for the two-sided problem H0 : c T h 0 versus H1 : c T h -f. 0 rejects the null hypothesis for large values of I c T X 1 . This test is not uniformly most powerful, because its power is dominated by the uniformly most powerful tests for the two one-sided alterna tives whose union is H1 . However, the test with critical region x : I c T x I ::: Z a 12 c T L. c } is uniformly most powerful among the unbiased level tests (see problem 15. 1). A second problem of interest is to test a simple null hypothesis H0 : h 0 versus the alternative H1 : h -f. 0. If the parameter set is one-dimensional, then this reduces to the problem in the preceding paragraph. However, if e is of dimension k > 1 , then there exists no uniformly most powerful test, not even among the unbiased tests. A variety of tests are reasonable, and whether a test is "good" depends on the alternatives at which we desire high power. For instance, the test that is most sensitive to detect the alternatives such that c T h > 0 (for a given c) is the test given in the preceding theorem. Probably in most situations no particular "direction" is of special importance, and we would use a test that distributes the power over all directions. It is known that any test with as critical region the complement of a closed, convex set C is admissible (see, e.g., [1 38, p. 1 37]). In particular, complements of closed, convex, and symmetric sets are admissible critical regions and cannot easily be ruled out a priori. The shape of C determines the power function, the directions in which C extends little receiving large power (although the power also depends on L.). The most popular test rejects the null hypothesis for large values of X T L. - l X . This test arises as the limit version of the Wald test, the score test, and the likelihood ratio test. One ad vantage is a simple choice of critical values, because X T L, - 1 X is chi square-distributed with k degrees of freedom. The power function of this test is, with Z a standard normal vector, =
a
J
{
=
By the rotational symmetry of the standard normal distribution, this depends only on the non centrality parameter II L. - l / 2 h 11 . The power is relatively large in the directions h for which I L. - l /2 h II is large. In particular, it increases most steeply in the direction of the eigenvector corresponding to the smallest eigenvalue of L. . Note that the test does not distribute the power evenly, but dependent on L. . Two optimality properties of this test are given in problems 15.3 and 15 .4, but these do not really seem convincing.
Efficiency of Tests
218
Due to the lack of an acceptable optimal test in the limit problem, a satisfactory asymp totic optimality theory of testing simple hypotheses on multidimensional parameters is impossible.
15.3
Local Asymptotic Normality
A normal limit experiment arises, among others, in the situation of repeated sampling from a differentiable parametric model. If the model (Pe : 8 E 8) is differentiable in quadratic mean, then the local experiments converge to a Gaussian limit:
A sequence of power functions 8 �----+ JTn ( 8) in the original experiments induces the sequence of power functions h 1----+ JTn (8o + hI Jn) in the local experiments. Suppose that JTn (80 + hi Jn) ---+ n (h ) for every h and some function JT . Then, by the asymptotic representation theorem, the limit JT is a power function in the Gaussian limit experiment. Suppose for the moment that 8 is real and that the sequence 1Tn is of asymptotic level for testing Ho : 8 ::::; 8o versus H1 : 8 > 8o . Then n (O) = lim n11 (8o) ::::; and hence n corresponds to a level test for H0 : h = 0 versus H1 : h > 0 in the limit experiment. It must be bounded above by the power function of the uniformly most powerful level test in the limit experiment, which is given by Proposition 15.2. Conclude that a
a
a
a
(
1Tn eo + nlim -+oo
�) ::::; 1 -
-v n
every h > 0.
(Apply the proposition with c = 1 and I; = /8� 1 .) We have derived an absolute upper bound on the local asymptotic power of level tests. In Chapter 14 a sequence of power functions such that JT11 (80 + hi Jn) ---+ 1 -
leo s2
is the relative efficiency of the best test and the test with slope s. It can be interpreted as the number of observations needed with the given sequence of tests with slope s divided by the number of observations needed with the best test to obtain the same power. With a bit of work, the assumption that JTn ( 80 + hI vfn) converges to a limit for every h can be removed. Also, the preceding derivation does not use the special structure of i.i.d. observations but only uses the convergence to a Gaussian experiment. We shall rederive the result within the context of local asymptotic normality and also indicate how to construct optimal tests. Suppose that at "time" n the observation is distributed according to a distribution Pn . e with parameter ranging over an open subset 8 offfi.k . The sequence of experiments ( Pn ,e : 8 E 8) is locally asymptotically normal at 80 if 1
1 P og d n,fJo+r;;1h = h T l:!.. n e0 - - h T le0 h + Op eo ( 1 ) , 2 d Pn ,Bo '
n,
( 1 5.3)
219
15. 3 Local Asymptotic Normality
for a sequence of statistics Nk (O, le0)-distribution.
l::!.. n ,Bo
that converges in distribution under e0 to a normal
Let 8 c m_k be open and let 1/J : 8 1--+ ffi. be diffe rentiable at e0, with and such that 1/f (eo) 0. Let the sequence of experiments nonzero gradient ( Pn , e : e E 8) be locally asymptotically normal at e0 with nonsingular Fisher informa tion, for constants rn --+ oo. Then the power functions e 1--+ nn (e) of any sequence of level tests for testing Ho : 1/J (e) 0 versus H1 : 1/f (e) > 0 satisfy, for every h such that eo h > 0, 15.4
tfr
Theorem.
tfr80
=
:::
a
( !!__ )
lim sup nn eo + n -+oo rn 15.5
Addendum.
:'S
1
-
(
-
Let Tn be statistics such that
Then the sequence oftests that rejectfor values ofTn exceeding Z a is asymptotically optimal in the sense that the sequence PBo +r; l h (Tn ::::_ Z a ) converges to the right side of the preceding display, for every h.
The sequence of localized experiments ( Pn ,Bo +r; l h : h E ffi. k ) converges by Theo rem 7. 10, or Theorem 9.4, to the Gaussian location experiment ( Nk ( h, / 1 ) : h E m_k ) . Fix some h 1 such that > 0, and a subsequence of {n} along which the lim sup rr (e0 + h i f rn) is taken. There exists a further subsequence along which nn (e0 + r;; 1 h) converges to a limit rr( h ) for every h E m_k (see the proof of Theorem 15. 1). The function h 1--+ n (h) is a power function in the Gaussian limit experiment. For < 0, we have 1 1 1/J (eo + r;; h) r;; ( tfr80 h+o(1) ) < 0 eventually, whence n (h) ::: lim sup lfn (eo + r;; 1 h) 0. Thus, rr is of By continuity, the inequality rr ( h ) extends to all h such that level for testing Ho : > 0. Its power function is bounded above by 0 versus H1 the power function of the uniformly most powerful test, which is given by Proposition 15.2. This concludes the proof of the theorem. The asymptotic optimality of the sequence Tn follows by contiguity arguments. We start by noting that the sequence ( !:!.. n , Bo , l::!.. n , Bo ) converges under eo in distribution to a (degenerate) normal vector (!:!.. , !:!.. ) . By Slutsky's lemma and local asymptotic normality,
8-0
Proofs.
1/!80 h 1 •
=
a.
a
tfr80 h :::
:::
a
: tfr80 h
tfr80 h tfr80 h :::
:::
By Le Cam's third lemma, the sequence l::!.. n ,Bo converges in distribution under e0 + r;; 1 h to a N (/e h , /e0)-distribution. Thus, the sequence Tn converges under eo + r;; 1 h in distribution to a normal distribution with mean and variance 1 . •
0
tfr80 h/( tfr8/8� 1 tfr� ) 1 12
220
Efficiency of Tests
The point e0 in the preceding theorem is on the boundary of the null and the alter native hypotheses. If the dimension k is larger than 1, then this boundary is typically (k - I)-dimensional, and there are many possible values for e0 • The upper bound is valid at every possible choice. If k 1, the boundary point e0 is typically unique and hence known, and we could use Tn Ie� 1 12 lln,eo to construct an optimal sequence of tests for the problem H0 : e = e0 . These are known as score tests. Another possibility is to base a test on an estimator sequence. Not surprisingly, efficient estimators yield efficient tests. =
=
Example (Wald tests). Let : e E 8) that is differentiable
15.6
X 1 , . . . , Xn be a random sample in an experiment
(Pe in quadratic mean with nonsingular Fisher informa tion. Then the sequence of local experiments (P; h f.jii : h E IRk ) is locally asymptotically + normal with rn Jn, Ie the Fisher information matrix, and =
A sequence of estimators en is asymptotically efficient for estimating e if (see Chapter 8)
Under regularity conditions, the maximum likelihood estimator qualifies. Suppose that e 1---+ Ie is continuous, and that 1/r is continuously differentiable with nonzero gradient. Then the sequence of tests that reject H0 : 1/r (e) :::=:: 0 if
is asymptotically optimal at every point e0 on the boundary of H0 . Furthermore, this seq uence of tests is consistent at every e with 1/r ( e ) > 0. These assertions follow from the preceding theorem, upon using the delta method and Slutsky's lemma. The resulting tests are called Wald tests if en is the maximum likelihood estimator. D 15.4
One-Sample Location
f
f
Let X 1 , . . . , Xn be a sample from a density (x - e), where is symmetric about zero and has finite Fisher information for location I1 . It is required to test H0 : e 0 versus H1 : e > 0. The density may be known or (partially) unknown. For instance, it may be known to belong to the normal scale family. For fixed the sequence of experiments (0�= 1 f (xi - e) : e E IR) is locally asymptot ically normal at e 0 with lln ,o -n - 1 12 2..::�= 1 norming rate Jn, and Fisher information 11 . By the results of the preceding section, the best asymptotic level ex power function (for known is
f
f,
=
=
f)
(f' If)(Xi ),
1 - (za - hjff) .
=
15.4 One-Sample Location
22 1
This function is an upper bound for lim sup 1f11 (hi.Jfi) , for every h > 0, for every sequence of level power functions. Suppose that Tn are statistics with a
(15.7) Then, according to the second assertion of Theorem 15 .4, the sequence of tests that reject the null hypothesis if Tn � Za attains the bound and hence is asymptotically optimal. We shall discuss several ways of constructing test statistics with this property. If the shape of the distribution is completely known, then the test statistics Tn can simply be taken equal to the right side of (15.7), without the remainder term, and we obtain the score test. It is more realistic to assume that the underlying distribution is only known up to scale. If the underlying density takes the form (x) (x I CJ ) I CJ for a known density that is symmetric about zero, but for an unknown scale parameter CJ , then
f
fo
=
fo
f�lfo
fo
15.8 Example (t-test). The standard normal density possesses score function (x) x and Fisher information Ifo 1 . Consequently, if the underlying distribution is normal, then the optimal test statistics should satisfy Tn .JfiXn iCJ + op0 (n - 1 12 ) . The t -statistics Xn l Sn fulfill this requirement. This is not surprising, because in the case of nor mally distributed observations the t-test is uniformly most powerful for every finite n and hence is certainly asymptotically optimal. D =
=
-
=
The t-statistic in the preceding example simply replaces the unknown standard deviation CJ by an estimate. This approach can be followed for most scale families. Under some regu larity conditions, the statistics
should yield asymptotically optimal tests, given a consistent sequence of scale esti mators &n . Rather than using score-type tests, we could use a test based on an efficient estimator for the unknown symmetry point and efficient estimators for possible nuisance parameters, such as the scale - for instance, the maximum likelihood estimators. This method is indicated in general in Example 15.8 and leads to the Wald test. Perhaps the most attractive approach is to use signed rank statistics. We summarize some definitions and conclusions from Chapter 13. Let R:1 , , R"!:n be the ranks of the absolute values I X 1 1 , . . , I Xn I in the ordered sample of absolute values. A linear signed rank statistic takes the form •
•
•
.
for given numbers a11 1 , , ann , which are called the scores of the statistic. Particular examples are the Wilcoxon signed rank statistic, which has scores an i i , and the sign statistic, which corresponds to scores an i 1 . In general, the scores can be chosen to weigh •
•
•
=
=
222
Efficiency of Tests
the influence of the different observations. A convenient method of generating scores is through a fixed function ¢ : [0, 1 ] �---+ lR, by (Here Un ( 1 ) , . . . , Un (n ) are the order statistics of a random sample of size n from the uniform distribution on [0, 1 ] .) Under the condition that J01 ¢ 2 (u) du < oo, Theorem 1 3 . 1 8 shows that, under the null hypothesis, and with p + (x) 2F(x) - 1 denoting the distribution function of I X 1 I , =
Because the score-generating function ¢ can be chosen freely, this allows the construction of an asymptotically optimal rank statistic for any given shape f. The choice 1 f' + - 1 ¢ (u) - ((F ) (u)) . ( 1 5.9) If f ...;;-y:=
-
yields the locally most powerful scores, as discussed in Chapter 13. Because f ' IJ ( l x I ) sign (x) f ' If (x) by the symmetry of f, it follows that the signed rank statistics Tn satisfy (15.7). Thus, the locally most powerful scores yield asymptotically optimal signed rank tests. This surprising result, that the class of signed rank statistics contains asymptotically efficient tests for every given (symmetric) shape of the underlying distribution, is sometimes expressed by saying that the signs and absolute ranks are "asymptotically sufficient" for testing the location of a symmetry point. =
Let Tn be the simple linear signed rank statistic with scores an i E¢ (Un (i) ) generated by the function ¢ defined in (15.9). Then Tn satisfies ( 15 .7) and hence the sequence of tests that reject Ho : 8 0 if Tn ::::_ Za is asymptotically optimal at 8 0. 15.10
Corollary.
=
=
=
Signed rank statistics were originally constructed because of their attractive property of being distribution free under the null hypothesis. Apparently, this can be achieved without losing (asymptotic) power. Thus, rank tests are strong competitors of classical parametric tests. Note also that signed rank statistics automatically adapt to the unknown scale: Even though the definition of the optimal scores appears to depend on f , they are actually identical for every member of a scale family j (x) j0 (xiCJ )ICJ (since (F + ) -1 (u) CJ (F0+ ) - 1 (u)). Thus, no auxiliary estimate for CJ is necessary for their definition. =
=
The sign statistic Tn n - 1 1 2 2.:.:7= 1 sign( X i ) satisfies (15.7) for f equal to the Laplace density. Thus the sign test is asymptotically optimal for testing location in the Laplace scale family. 0 15.11
Example (Laplace).
=
The standard normal density has score function for location f�lfo (x) x and Fisher information Ifo 1 . The optimal signed rank statistic for the normal scale family has score-generating function U i +1 � 1 +
Example (Normal). =
=
-
=
=
(
�
)
( � �).
15. 5 Two-Sample Problems
223
We conclude that the corresponding sequence of rank tests has the same asymptotic slope as the t -test if the underlying distribution is normal. (For other distributions the two sequences of tests have different asymptotic behavior.) 0 Even the assumption that the underlying distribution of the observations is known up to scale is often unrealistic. Because rank statistics are distribution-free under the null hypo thesis, the level of a rank test is independent of the underlying distribution, which is the best possible protection of the level against misspecification of the model. On the other hand, the power of a rank test is not necessarily robust against deviations from the postulated model. This might lead to the use of the best test for the wrong model. The dependence of the power on the underlying distribution may be relaxed as well, by a procedure known as adaptation. This entails estimating the underlying density from the data and next using an optimal test for the estimated density. A remarkable fact is that this approach can be completely successful: There exist test statistics that are asymptotically optimal for any shape f . In fact, without prior knowledge of f (other than that it is symmetric with finite and positive Fisher information for location), estimators en and In can be constructed such that, for every 8 and f, 1 1 n f' ,Jri(8n - 8) � - L - (X i - 8) + ope ( l ) ; v n If i = l f A
= -
We give such a construction in section 25.8. 1 . Then the test statistics Tn J1i !}n I)/ 2 satisfy (15.7) and hence are asymptotically (locally) optimal at 8 0 for every given shape f . Moreover, for every 8 > 0 , and every f, =
=
Hence, the sequence of tests based on Tn is also consistent at every ( 8, f) in the alternative hypothesis H1 : 8 > 0. 15.5
Two-Sample Problems
Suppose we observe two independent random samples X 1 , . . . , Xm and Y1 , . . . , Yn from densities PJL and q v , respectively. The problem is to test the null hypothesis H0 : v _:::: t-t versus the alternative H1 : v > J-t . There may be other unknown parameters in the model besides t-t and v , but we shall initially ignore such "nuisance parameters'' and parametrize the model by (J-t , v) E IR.2 . Null and alternative hypotheses are shown graphically in Figure 15. 1 . We let N + n be the total number of observations and assume that j N ----+ A as n ----+ oo . = m
m
m,
If pJ.L (x) f(x - t-t) and qv (Y) g (y - v) for two densi ties f and g that have the same "location," then we obtain the two-sample location problem. The alternative hypothesis asserts that the second sample is "stochastically larger." 0 15.13
Example (Testing shift).
=
=
The alternatives of greatest interest for the study of the asymptotic performance of tests are sequences Ct-t N , v N ) that converge to the boundary between null and alternative hypotheses. In the study of relative efficiency, in Chapter 14, we restricted ourselves to
224
Efficiency of Tests v
\ Ho
Figure 15.1. Null and alternative hypothesis.
vertical perturbations ( e ' e + hI ,J"N) . Here we shall use the sequences ( e + g I ,J"N, e + hI ,J"Fi) , which approach the boundary in the direction of a general vector (g , h) . If both pl-l and qv define differentiable models, then the sequence of experiments ( P;: 0 P� : (JL , v) E Il�?) is locally asymptotically normal with norming rate ,J"N. If the score functions are denoted by Kl-l and .fv , and the Fisher informations by Il-l and lv, respectively, then the parameters of local asymptotic normality are
!J. n , (JJ., , v ) =
The corresponding limit experiment consists of observing two independent normally dis tributed variables with means g and h and variances A - 1 1; 1 and ( 1 - A)- 1 Jv- 1 , respectively. 15.14 Corollary. Suppose that the models (PI-l : fL E �) and (Qv : v E �) are differen tiable in quadratic mean, and let m , n ---+ oo such that m l N ---+ A E (0, 1). Then the powerfunctions of any sequence of level tests for Ho : v = fL satisfies, for every fL and for every h > g, a
lim sup nm, n n ,m -+oo
(
fL
+
h g fhr ' fL + r;:r vN vN
)
:::=::
1 -
(
Za -
(h - g)
)
A ( 1 - A) IJJ., JJJ., . A fl-l + (1 - A) JI-l
This is a special case of Theorem 15.4, with 1/J (JL , v) = v - fL and Fisher in formation matrix diag (All-l , (1 - A) JIJ- ) · It is slightly different in that the null hypothesis H0 : 1/1 (8) = 0 takes the form of an equality, which gives a weaker requirement on the sequence Tn . The proof goes through because of the linearity of 1/J . •
Proof
1 5. 5 Two-Sample Problems
225
It follows that the optimal slope of a sequence of tests is equal to Sopt (/h)
=
A ( l - A) lfJ., JJ.J. A lfJ., + ( 1 - A) JfJ.,
The square of the quotient of the actual slope of a sequence of tests and this number is a good absolute measure of the asymptotic quality of the sequence of tests. According to the second assertion of Theorem 15 .4, an optimal sequence of tests can be based on any sequence of statistics such that
(The multiplicative factor Sopt(/h) ensures that the sequence TN is asymptotically normally distributed with variance 1 .) Test statistics with this property can be constructed using a variety of methods. For instance, in many cases we can use asymptotically efficient esti mators for the parameters fh and combined with estimators for possible nuisance param eters, along the lines of Example 15.6. If pfJ., = qtJ., = ftJ., are equal and are densities on the real line, then rank statistics are attractive. Let RN 1 , . . . , RN N be the ranks of the pooled sample X 1 , . . . , Xm , Y1 , . . . , Yn . Consider the two-sample rank statistics v,
for the score generating function
Up to a constant these are the locally most powerful scores introduced in Chapter 13. By Theorem 1 3 .5, because aN = fo1 ¢ (u) du = 0,
Thus, the locally most powerful rank statistics yield asymptotically optimal tests. In general, the optimal rank test depends on f.h, and other parameters in the model, which must be estimated from the data, but in the most interesting cases this is not necessary. For ftJ., equal to the logistic density with mean fh , the scores a N,i are proportional to i. Thus, the Wilcoxon (or Mann-Whitney) two-sample statistic is asymptotically uniformly most powerful for testing a difference in location between two samples from logistic densities with different means. 0 15.15
Example (Wilcoxon statistic).
15.16 Example (Log rank test). The log rank test is asymptotically optimal for testing proportional hazard alternatives, given any baseline distribution. 0
226
Efficiency of Tests Notes
Absolute bounds on asymptotic power functions as developed in this chapter are less known than the absolute bounds on estimator sequences given in Chapter 8. Testing problems were nevertheless an important subject in Wald [ 149] , who is credited by Le Cam for having first conceived of the method of approximating experiments by Gaussian experiments, albeit in a somewhat different way than later developed by Le Cam. From the point of view of statistical decision theory, there is no difference between testing and estimating, and hence the asymptotic bounds for tests in this chapter fit in the general theory developed in [99] . Wald appears to use the Gaussian approximation to transfer the optimality of the likelihood ratio and the Wald test (that is now named for him) in the Gaussian experiment to the sequence of experiments. In our discussion we use the Gaussian approximation to show that, in the multidimensional case, "asymptotic optimality" can only be defined in a somewhat arbitrary manner, because optimality in the Gaussian experiment is not easy to define. That is a difference of taste.
PROBLEMS
1. Consider the two-sided testing problem Ho : c T h = 0 versus H 1 : c T h =/= 0 based on an Nk (h , b ) distributed observation X . A test for testing Ho versus H 1 i s called unbiased i f sup h E Ho n (h) :::: infh E H1 n (h ) . The test that rej ects Ho for large values of l e T X I is uniformly mos t powerful among the unbiased tests . More precisely, for every power function n of a test based on X the conditions
n (h) ::S a
if h T c = O
and
imply that, for every c T h =f= 0,
Formulate an asymptotic upper bound theorem for two-sided testing problems in the spirit of Theorem 1 5 .4.
2. (i) Show that the set of power functions h 1--+ n11 (h) in a dominated experiment ( Ph : h compact for the topology of pointwise convergence (on H).
E
H) is
(ii) Give a full proof o f Theorem 1 5 . 1 along the following lines. First apply the proof a s given for every finite subset I C H. This yields power functions lT J in the limit experiment that coincide with JT on I .
3 . Consider testing Ho : h = 0 versus H1 : h =/= 0 based on an observation X with an N (h , :E ) -distribution. Show that the testing problem is invariant under the transformations x 1--+ :E 1 12 o :E - 1 1 2 x for 0 ranging over the orthonormal group . Find the best invariant test.
4. Consider testing Ho : h = 0 versus H1 : h =/= 0 based on an observation X with an N (h , :E ) distribution. Find the test that maximizes the minimum power over { h : I I :E - 1 1 2 h I I = c } . (B y the Hunt-Stein theorem the best invariant test is maximin, so one can apply the preceding problem. Alternatively, one can give a direct derivation along the following lines. Let n be the distribution of :E 1 1 2 u if U is uniformly distributed on the set { h : l l h ll = c } . Derive the Neyman-Pearson test for testing Ho : N (O , :E) versus H1 : J N ( h , :E) dn (h) . Show that its power is constant on { h : II :E - l/ 2 h II = c } . The minimum power of any test on this set is always smaller than the average power over this set, which is the power at J N (h , :E) dn(h).)
16 Likelihood Ratio Tests
The critical values of the likelihood ratio test are usually based on an asymptotic approximation. We derive the asymptotic distribution of the likelihood ratio statistic and investigate its asymptotic quality through its asymptotic power function and its Bahadur efficiency.
16.1
Introduction
Suppose that we observe a sample X 1 , . . . , Xn from a density pg , and wish to test the null hypothesis H0 : e E 80 versus the alternative H1 : e E 8 1 . If both the null and the alternative hypotheses consist of single points, then a most powerful test can be based on the log likelihood ratio, by the Neyman-Pearson theory. If the two points are eo and 8 1 , respectively, then the optimal test statistic is given by
For certain special models and hypotheses, the most powerful test turns out not to depend on 8 1 , and the test is uniformly most powerful for a composite hypothesis 8 1 . Sometimes the null hypothesis can be extended as well, and the testing problem has a fully satisfac tory solution. Unfortunately, in many situations there is no single best test, not even in an asymptotic sense (see Chapter 15). A variety of ideas lead to reasonable tests. A sensible extension of the idea behind the Neyman-Pearson theory is to base a test on the log likelihood ratio
The single points are replaced by maxima over the hypotheses. As before, the null hypoth esis is rejected for large values of the statistic. Because the distributional properties of An can be somewhat complicated, one usually replaces the supremum in the numerator by a supremum over the whole parameter set 8 80 U 8 1 . This changes the test statistic only if An ::::; 0, which is inessential, because in most cases the critical value will be positive. We study the asymptotic properties of the =
227
228
Likelihood Ratio Tests
(log) likelihood ratio statistic
The most important conclusion of this chapter is that, under the null hypothesis, the sequence An is asymptotically chi squared-distributed. The main conditions are that the model is differentiable in 8 and that the null hypothesis Go and the full parameter set G are (locally) equal to linear spaces. The number of degrees of freedom is equal to the difference of the (local) dimensions of G and G0 . Then the test that rejects the null hypothesis if An exceeds the upper a-quantile of the chi-square distribution is asymptotically of level a. Throughout the chapter we assume that 8 c JE.k . The "local linearity" of the hypotheses is essential for the chi-square approximation, which fails already in a number of simple examples. An open set is certainly locally linear at every of its points, and so is a relatively open subset of an affine subspace. On the other hand, a half line or space, which arises, for instance, if testing a one-sided hypothesis H0 : J-Le :::::: 0, or a ball H0 : 118 I :::::: 1 , is not locally linear at its boundary points. In that case the asymptotic null distribution of the likelihood ratio statistic is not chi-square, but the distribution of a certain functional of a Gaussian vector. Besides for testing, the likelihood ratio statistic is often used for constructing confidence regions for a parameter 1/1 (0) . These are defined, as usual, as the values r for which a null hypothesis Ho : 1/1 (8) r is not rejected. Asymptotic confidence sets obtained by using the chi-square approximation are thought to be of better coverage accuracy than those obtained by other asymptotic methods. The likelihood ratio test has the desirable property of automatically achieving reduction of the data by sufficiency: The test statistic depends on a minimal sufficient statistic only. This is immediate from its definition as a quotient and the characterization of sufficiency by the factorization theorem. Another property of the test is also immediate: The likelihood ratio statistic is invariant under transformations of the parameter space that leave the null and alternative hypotheses invariant. This requirement is often imposed on test statistics but is not necessarily desirable. =
A vector N (N1 , . . . , Nk ) that possesses the mul tinomial distribution with parameters n and p (p 1 , . . . , Pk ) can be viewed as the sum of n independent multinomial vectors with parameters 1 and p. By the sufficiency reduction, the likelihood ratio statistic based on N is the same as the statistic based on the single observations. Thus our asymptotic results apply to the likelihood ratio statistic based on N, if n ---+ oo. If the success probabilities are completely unknown, then their maximum likelihood estimator is N j n. Thus, the log likelihood ratio statistic for testing a null hypothesis Ho : p E Po against the alternative H1 : p ¢:. Po is given by ( N n N ) (N1 /n) N1 (Nk fn ) N' k . f "N 2 1 og l ··· k ( n ) N 2 i 1 og - . L m pE Po . 1 np i suppE Po N1 ... Nk p 1 . . . p kNk 16.1
Example (Multinomial vector).
=
=
•
•
(N) z
•
I
=
z=
1 The full parameter set can be identified with an open subset of JE.k , if p with zero coordi nates are excluded. The null hypothesis may take many forms. For a simple null hypothesis -
229
Taylor Expansion
1 6. 2
the statistic is asymptotically chi-square distributed with k 1 degrees of freedom. This follows from the general results in this chapter. t Multinomial variables arise, among others, in testing goodness-of-fit. Suppose we wish to test that the true distribution of a sample of size n belongs to a certain parametric model { Pe : e E 8}. Given a partition of the sample space into sets X1 , . . . , Xk . define N1 , . . . , Nk as the numbers of observations falling into each of the sets of the partition. Then the vector N = (N1 , . . . , Nk ) possesses a multinomial distribution, and the original problem can be translated in testing the null hypothesis that the success probabilities p have the form (Pe (XJ ) , . . . , Pe (Xk )) for some e . D -
Suppose that the observations are sampled from a density Pe in the k-dimensional exponential family 16.2
Example (Exponentialfamilies).
Let 8 c IR.k be the natural parameter space, and consider testing a null hypothesis 80 versus its complement 8 - 80. The log likelihood ratio statistic is given by
c
8
This is closely related to the Kullback-Leibler divergence of the measures Pe0 and Pe , which is equal to Pe K(e , eo) = Pe log - = (e - eo) T Pe t + log c(e) - log c(eo). Peo If the maximum likelihood estimator {} exists and is contained in the interior of 8, which is the case with probability tending to 1 if the true parameter is contained in 8, then 8 is the moment estimator that solves the equation Pe t = tn . Comparing the two preceding displays, we see that the likelihood ratio statistic can be written as An = 2n K (8 , 80) , where K (e , Bo) is the infimum of K (e , eo) over eo E Bo . This pretty formula can be used to study the asymptotic properties of the likelihood ratio statistic directly. Alternatively, the general results obtained in this chapter are applicable to exponential families. D
* 16.2
Taylor Expansion
Write en, O and en for the maximum likelihood estimators for e if the parameter set is taken equal to 80 or 8, respectively, and set = log Pe . In this section assume that the true value of the parameter fJ is an inner point of 8. The likelihood ratio statistic can be rewritten as
.C e
n L
An = -2 i= l (.ee11_0 (Xi ) - .Ce, CXi )).
.Ce (Xi ) .C e (x)
To find the limit behavior of this sequence of random variables, we might replace L by its Taylor expansion around the maximum likelihood estimator e = {}n · If e �----;>t
It is also proved in Chapter 17 by relating the likelihood ratio statistic to the chi-square statistic .
Likelihood Ratio Tests
230
is twice continuously differentiable for every X , then there exists a vector {jn between en . O and en such that the preceding display is equal to
n , -2(en O - en ) L l en (X i) - cen, O - en l L len ( Xi ) (e n, O - en ) . i=1
Because en is the maximum likelihood estimator in the unrestrained model, the linear term in this expansion vanishes as soon as en is an inner point of G. If the averages -n- 1 L ie (X i) converge in probability to the Fisher information matrix IfJ and the sequence ,Jn (en ,o - en ) is bounded in probability, then we obtain the approximation
( 16.3) In view of the results of Chapter 5, the latter conditions are reasonable if 7J E G 0 , for then both {jn and {Jn, O can be expected to be .Jfi-consistent. The preceding approximation, if it can be justified, sheds some light on the quality of the likelihood ratio test. It shows that, asymptotically, the likelihood ratio test measures a certain distance between the maximum likelihood estimators under the null and the full hypotheses. Such a procedure is intuitively reasonable, even though many other distance measures could be used as well. The use of the likelihood ratio statistic entails a choice as to how to weigh the different "directions" in which the estimators may differ, and thus a choice of weights for "distributing power" over different deviations. This is further studied in section 16.4. If the null hypothesis is a single point Go = {8o}, then en, O = eo, and the quadratic form in the preceding display reduces under Ho : 8 = 8o (i.e., 7J = 8o ) to hn ifJhn for hn = ..jn ({Jn iJ ) T . In view of the results of Chapter 5, the sequence h n can be expected to converge in distribution to a variable h with a normal N (O, Ii 1 )-distribution. Then the sequence An converges under the null hypothesis in distribution to the quadratic form h T IfJ h. This is the squared length of the standard normal vector I� 1 2 h , and possesses a chi-square distribution with k degrees of freedom. Thus the chi-square approximation announced in the introduction follows. The situation is more complicated if the null hypothesis is composite. If the sequence ..jn (en,O - iJ, {Jn iJ) converges jointly to a variable (h0, h), then the sequence A 11 is asymptotically distributed as (h - h0) T IfJ (h - h0) . A null hypothesis Go that is (a seg ment of) a lower dimensional affine linear subspace is itself a "regular" parametric model. If it contains iJ as a relative inner point, then the maximum likelihood estimator en, O may be expected to be asymptotically normal within this affine subspace, and the pair ..jn(en,o - iJ, {Jn - iJ) may be expected to be jointly asymptotically normal. Then the like lihood ratio statistic is asymptotically distributed as a quadratic form in normal variables. Closer inspection shows that this quadratic form possesses a chi-square distribution with k - l degrees of freedom, where k and l are the dimensions of the full and null hypothe ses. In comparison with the case of a simple null hypothesis, l degrees of freedom are "lost." Because we shall rigorously derive the limit distribution by a different approach in the next section, we make this argument precise only in the particular case that the null hypothesis Go consists of all points ( 81 , , 81 , 0, . . . , 0) , if e ranges over an open subset 8 of IR.k . Then the score function for e under the null hypothesis consists of the first l coordinates of the score function e fJ for the whole model, and the information matrix under the null hypothesis is equal to the (l x l) principal submatrix of IfJ . Write these as i fJ, 5.1 and IfJ, 5. l , 5.l , respective! y, and use a similar partitioning notation for other vectors and matrices. -
•
.
•
1 6. 3
23 1
Using Local Asymptotic Normality
Under regularity conditions we have the linear approximations (see Theorem 5.39)
Given these approximations, the multivariate central limit theorem and Slutsky's lemma yield the joint asymptotic normality of the maximum likelihood estimators. From the form of the asymptotic covariance matrix we see, after some matrix manipulation, .Jn cen, <: l - {) n, O , <:l ) = - l;J, ;l , <:l ItJ, <:1 , >l .Jii en, > I + 0 p ( 1 ) .
(Alternatively, this approximation follows from a Taylor expansion of 0 2..::7= 1 f e, , <:l around en,O, <:l · ) Substituting this in ( 1 6.3) and again carrying out some matrix manipulations, we find that the likelihood ratio statistic is asymptotically equivalent to (see problem 16.5) 1 ( 1 6. 4) .Jii e: >l (I;- t l , >l - I .Jii en, >l ·
(
)
The matrix (/i I ) >i, > I is the asymptotic covariance matrix of the sequence Jn en, > I ' whence we obtain an asymptotic chi-square distribution with k - l degrees of freedom, by the same argument as before. We close this section by relating the likelihood ratio statistic to two other test statistics. Under the simple null hypothesis 80 = { 80}, the likelihood ratio statistic is asymptotically equivalent to both the maximum likelihood statistic (or Wald statistic) and the score statistic. These are given by n(!J" - eol ' le, (iJ"
- 8o )
and
Ht i "" r [t i l (X, )
Ig,; l
" (X, )
The Wald statistic is a natural statistic, but it is often criticized for necessarily yielding ellipsoidal confidence sets, even if the data are not symmetric. The score statistic has the advantage that calculation of the supremum of the likelihood is unnecessary, but it appears to perform less well for smaller values of n. In the case of a composite hypothesis, a Wald statistic is given in ( 16.4) and a score statistic can be obtained by substituting the approximation n en, >i � (/i 1 ) >i, >i I: i. e, (X; ) in (1 6.4). (This approximation is obtainable from linearizing l..: Cf e, - i. e,) ·) In both cases we also replace the unknown parameter iJ by an estimator. 0 >I
16.3
Using Local Asymptotic Normality
An insightful derivation of the asymptotic distribution of the likelihood ratio statistic is based on convergence of experiments. This approach is possible for general experiments, but this section is restricted to the case of local asymptotic normality. The approach applies also in the case that the (local) parameter spaces are not linear. Introducing the local parameter spaces Hn = ,jfi(B - iJ ) and Hn,o = ,j11 ( 80 - iJ ) , we can write the likelihood ratio statistic in the form ) ) 2 if _ 2 sup 1 og n 7= 1 nPtJ+h/ Jilcx; sup 1 og n 7= 1 nP +h/Jil cx ; . n i = 1 PtJ (X ; ) n i = I P tJ (X ; ) hEH, hEH, o A l � Jl
-
_
232
Likelihood Ratio Tests
In Chapter 7 it is seen that, for large n, the rescaled likelihood ratio process in this display is similar to the likelihood ratio process of the normal experiment ( N (h , I;; 1 ) : h E Il�_k) . This suggests that, if the sets Hn and Hn, O converge in a suitable sense to sets H and H0 , the sequence An converges in distribution to the random variable A obtained by substituting the normal likelihood ratios, given by A
=
dN ( h , I;; 1 ) 2 sup log (X) hEH dN ( 0, I;; 1 )
-
dN ( h , I;; 1 ) 2 sup log (X) . h EHo dN ( 0, I;; 1 )
This is exactly the likelihood ratio statistic for testing the null hypothesis Ho : h E Ho versus the alternative H1 : h E H - Ho based on the observation X in the normal experiment. Because the latter experiment is simple, this heuristic is useful not only to derive the asymptotic distribution of the sequence An , but also to understand the asymptotic quality of the corresponding sequence of tests. The likelihood ratio statistic for the normal experiment is A
=
=
inf (X - h l I,y (X - h) - hinf (X - h l I,y (X - h)
h E Ho
EH
I I; f2 x - I; ;2 Ho ll 2 - II I; f2 x - I; ;2 H II 2 ·
( 1 6.5)
The distribution of the sequence An under iJ corresponds to the distribution of A under h 0. Under h 0 the vector I; 12 X possesses a standard normal distribution. The following lemma shows that the squared distance of a standard normal variable to a linear subspace is chi square-distributed and hence explains the chi-square limit when H0 is a linear space. =
=
16.6 Lemma. Let X be a k-dimensional random vector with a standard normal distri bution and let H0 be an ! -dimensional linear subspace of m.,_k . Then I X - H0 11 2 is chi square-distributed with k - l degrees offreedom.
Take an orthonormal base of m.,_k such that the first l elements span H0 . By Pythago ras' theorem, the squared distance of a vector z to the space H0 equals the sum of squares Li>i zf of its last k - l coordinates with respect to this basis. A change of base corresponds to an orthogonal transformation of the coordinates. Because the standard normal distribu tion is invariant under orthogonal transformations, the coordinates of X with respect to any orthonormal base are independent standard normal variables. Thus II X Ho 11 2 Li >l Xf is chi square-distributed. •
Proof.
-
=
If fJ is an inner point of 8, then the set H is the full space m.,_k and the second term on the right of ( 1 6.5) is zero. Thus, if the local null parameter spaces ..jii ( 8 0 - fJ ) converge to a linear subspace of dimension l , then the asymptotic null distribution of the likelihood ratio statistic is chi-square with k - l degrees of freedom. The following theorem makes the preceding informal derivation rigorous under the same mild conditions employed to obtain the asymptotic normality of the maximum likelihood estimator in Chapter 5. It uses the following notion of convergence of sets. Write H11 ---+ H if H is the set of all limits lim hn of converging sequences hn with h 11 E Hn for every n and, moreover, the limit h limi hn, of every converging sequence hn, with h 11, E Hn, for every i is contained in H. =
1 6. 3
233
Using Local Asymptotic Normality
Let the model (Pe : e E 8) be differentiable in quadratic mean at iJ with nonsingular Fisher information matrix, and suppose that for every 8 1 and 82 in a neighbor. 2 hood of iJ and for a measurable function f such that PiJ f· < oo, 16.7
Theorem.
If the maximum likelihood estimators Bn ,O and en are consistent under i} and the sets Hn , O and Hn converge to sets H0 and H, then the sequence of likelihood ratio statistics An converges under iJ + hj ,Jli in distribution to A given in ( 16. 5), for X normally distributed with mean h and covariance matrix Li 1 . *Proof.
Zn by
Let IGn
=
,JlioPn - PiJ ) be the empirical process, and define stochastic processes Zn (h)
=
nlfDn log
P H h/ .fii -
PiJ
h T IGn l iJ + � h T liJ h . 2
The differentiability of the model implies that Zn (h ) !,. 0 for every h. In the proof of Theorem 7 . 1 2 this is strengthened to the uniform convergence sup J zn (h) j !,. 0,
ll h i i ::OM
every M.
Furthermore, it follows from this proof that both Bn , O and en are ,Jfi-consistent under i} (These statements can also be proved by elementary arguments, but under stronger regularity conditions.) The preceding display is also valid for every sequence Mn that increases to suffi ciently slowly. Fix such a sequence. By the ,Jfi-consistency, the estimators Bn, O and en are contained in the ball of radius Mn / ,Jn around iJ with probability tending to 1 . Thus, the limit distribution of An does not change if we replace the sets Hn and Hn ,o in its definition by the sets Hn n ball(O, Mn ) and Hn ,o n ball(O, Mn ) . These "truncated" sequences of sets still converge to H and H0, respectively. Now, by the uniform convergence to zero of the processes Zn (h) on Hn and Hn , o , and simple algebra, 0
oo
by Lemma 7 . 1 3 (ii) and (iii). The theorem follows by the continuous-mapping theorem.
•
In a generalized linear model a typical ob servation (X, Y), consisting of a "covariate vector" X and a "response" Y, possesses a density of the form 16.8
Example (Generalized linear models).
(It may be more natural to model the covariates as (observed) constants, but to fit the model into our i.i.d. setup, we consider them to be a random sample from a density Px .) Thus, given
Likelihood Ratio Tests
234
X, the variable Y follows an exponential family density eY8
varfl ,
o
b" k(f3 T X) 0
¢
The function (b ' k) - 1 is called the link function of the model and is assumed known. To make the parameter f3 identifiable, we assume that the matrix EX XT exists and is nonsin gular. To judge the goodness-of-fit of a generalized linear model to a given set of data (X 1 , Y1 ) , . . , (X n , Yn ), it is customary to calculate, for fixed ¢, the log likelihood ratio statistic for testing the model as described previously within the model in which each Yi , given Xi , still follows the given exponential family density, but in which the parameters e (and hence the conditional means E( Yi I X i )) are allowed to be arbitrary values ei , umelated across the n observations (Xi , Yi ) . This statistic, with the parameter ¢ set to 1 , is known as the deviance, and takes the form, with �n the maximum likelihood estimator for f3 , t o
.
n = - 2 'L Yi k (�: xi ) - (b ' ) - 1 ( YJ b k (�: xi ) + b o (b ' ) - 1 ( Yi) . i=1 In our present setup, the codimension of the null hypothesis within the "full model" is equal to n k, if f3 is k-dimensional, and hence the preceding theory does not apply to the deviance. (This could be different if there are multiple responses for every given covariate and the asymptotics are relative to the number of responses.) On the other hand, the preceding theory allows an "analysis of deviance" to test nested sequences of regression models corresponding to inclusion or exclusion of a given covariate (i.e., column of the regression matrix). For instance, if D1 (Y n , flci l ) is the deviance of the model in which the i -±;- 1 , i + 2 , . . , ktj coordinates of f3 are a priori set to zero, then the difference Di - l ( Yn , fl u - l J ) - Di ( Yn , P,(i)) is the log likelihood ratio statistic for testing that the ith coordinate of f3 is zero within the model in which all higher coordinates are zero. According to the theory of this chapter, ¢ times this statistic is asymptotically chi square-distributed with one degree of freedom under the smaller of the two models. To see this formally, it suffices to verify the conditions of the preceding theorem. Using the identities for exponential families based on Lemma 4.5, the score function and Fisher information matrix can be computed to be
)-
[ (
o
J
-
.
C fl (x , y) = ( y - b' k(f3 T x) ) k'(f3 T x)x , lfl = Eb " k(f3 T X)k' (f3 T X) 2 X X T . o
o
Depending on the function k, these are very well-behaved functions of f3 , because b is a strictly convex, analytic function on the interior of the natural parameter space of the family, as is seen in section 4.2. Under reasonable conditions the function sup flEU II i fl ll is t
The arguments Y n and fl of D are the vectors of estimated (conditional) means of Y given the full model and the generalized linear model, respectively. Thus /l; b' o k(S: X; ) . =
1 6. 3
Using Local Asymptotic Normality
235
square-integrable, for every small neighborhood U, and the Fisher information is continu ous. Thus, the local conditions on the model are easily satisfied. Proving the consistency of the maximum likelihood estimator may be more involved, depending on the link function. If the parameter f3 is restricted to a compact set, then most approaches to proving consistency apply without further work, including Wald's method, Theorem 5.7, and the classical approach of section 5.7. The last is particularly attractive in the case of canonical link functions, which correspond to setting k equal to the identity. Then the second-derivative matrix lfl is equal to -b " (f3 T x )xx T , whence the likelihood is a strictly concave function of f3 whenever the observed covariate vectors are of full rank. Consequently, the point of maximum of the likelihood function is unique and hence consistent under the conditions of Theorem 5 . 1 4. t 0 16.9 Example (Location scale). Suppose we observe a sample from the density f ( (x p., ) 1 CJ ) 1 CJ for a given probability density J, and a location-scale parameter e = (p.,, CJ )
ranging over the set 8 = .lR x .JR + . We consider two testing problems. (i). Testing H0 : p., = 0 versus H1 : p., =f 0 corresponds to setting G0 = {0} x .IR+ . For a given point � = (0, CJ ) from the null hypothesis the set ,Jn' (80 - �) equals {O} x (- ,Jn'CJ, oo) and converges to the linear space {0} x R Under regularity conditions on j, the sequence of likelihood ratio statistics is asymptotically chi square-distributed with 1 degree of freedom. (ii). Testing H0 : p., ::::; 0 versus H1 : p., > 0 corresponds to setting Go = ( - oo , 0] x .JR+ . For a given point � = (0, CJ) on the boundary of the null hypothesis, the sets ,Jn' (80 �) converge to H0 = ( - oo , 0] x R In this case, the limit distribution of the likelihood ratio statistics is not chi-square but equals the distribution of the square distance of a standard normal vector to the set 1� 12 Ho = { h : ( h, r;; 1 12e 1 ) ::::; 0 } . The latter is a half-space with boundary line through the origin. Because a standard normal vector is rotationally symmetric, the distribution of its distance to a half-space of this type does not depend on the orientation of the half-space. Thus the limit distribution is equal to the distribution of the squared distance of a standard normal vector to the half-space {h : h 2 :::: 0}: the distribution of (Z v 0) 2 for a standard normal variable Z. Because P ( (Z v 0) 2 > c ) = � P(Z 2 > c) for every c > 0, we must choose the critical value of the test equal to the upper 2a-quantile of the chi-square distribution with 1 degree of freedom. Then the asymptotic level of the test is a for every � on the boundary of the null hypothesis (provided a < 1 12). For a point � in the interior of the null hypothesis H0 : p., ::::; 0 the sets ,Jn' (G0 - �) converge to .lR x .lR and the sequence of likelihood ratio statistics converges in distribution to the squared distance to the whole space, which is zero. This means that the probability of an error of the first kind converges to zero for every � in the interior of the null hypothesis. 0 Suppose we wish to test the null hypothesis Ho : I e I ::::; 1 that the parameter belongs to the unit ball versus the alternative H1 : 118 I > 1 that this is not case. If the true parameter � belongs to the interior of the null hypothesis, then the sets Jn ( 80 - �) converge to the whole space, whence the sequence of likelihood ratio statistics converges in distribution to zero. 16.10
t
Example (Testing a ball).
For a detailed study of sufficient conditions for consistency see [45] .
Likelihood Ratio Tests
236
For 1J on the boundary of the unit ball, the sets Jfi(80 - 1J) grow to the half-space H0 = { h : (h , 1J ) :::::; 0} . The sequence of likelihood ratio statistics converges in distribution
to the distribution of the square distance of a standard normal vector to the half-space 1� 12 H0 = { h : (h , r; 1 12 1J ) :::::; 0} . By the same argument as in the preceding example, this is the distribution of (Z v 0) 2 for a standard normal variable Z . Once again we find an asymptotic level-a test by using a 2a-quantile. 0
Suppose that the null hypothesis is equal to the image 80 = g(T) of an open subset T of a Euclidean space of dimension l ::::::: k. If g is a homeomorphism, continuously differentiable, and of full rank, then the sets Jli( 80 - g(r)) converge to the range of the derivative of g at r , which is a subspace of dimension l. Indeed, for any rJ E ffi. 1 the vectors r + rJ / J1i are contained in T for sufficiently large n, and the sequence Jli(g(r + TJ/Jli) - g(r)) converges to g� T]. Furthermore, if a sub sequence of Jli(g( tn) - g (r)) converges to a point h for a given sequence tn in T, then the corresponding subsequence of JliCtn - r) converges to rJ = (g- 1 )� C rl h by the differ entiability of the inverse mapping g- 1 and hence J1i (g(tn) - g( r)) ---+ g� TJ. (We can use the rank theorem to give a precise definition of the differentiability of the map g-1 on the manifold g(T).) 0 16.11
Example (Testing a range).
16.4
Asymptotic Power Functions
Because the sequence of likelihood ratio statistics converges to the likelihood ratio statistic in the Gaussian limit experiment, the likelihood ratio test is "asymptotically efficient" in the same way as the likelihood ratio statistic in the limit experiment is "efficient." If the local limit parameter set H0 is a half-space or a hyperplane, then the latter test is uniformly most powerful, and hence the likelihood ratio tests are asymptotically optimal (see Proposition 1 5.2).This is the case, in particular, for testing a simple null hypothesis in a one-dimensional parametric model. On the other hand, if the hypotheses are higher dimensional, then there is often no single best test, not even under reasonable restrictions on the class of admitted tests. For different (one-dimensional) deviations of the null hypothesis, different tests are optimal (see the discussion in Chapter 15). The likelihood ratio test is an omnibus test that gives reasonable power in all directions. In this section we study its local asymptotic power function more closely. We assume that the parameter 1J is an inner point of the parameter set and denote the true parameter by 1J + h i Jfi. Under the conditions of Theorem 16.7, the sequence oflikelihood ratio statistics is asymptotically distributed as A = li z + I� /2 h - /�/2 Ho 11 2 for a standard normal vector Z . Suppose that the limiting local parameter set H0 is a linear subspace of dimension l, and that the null hypothesis is rejected for values of An exceeding the critical value xf- 1 a · Then the local power functions of the resulting tests satisfy 7Tn
(f} + �) = PHh/fo (A n > xLz. a ) ---+ P (A > xL/, a ) = : n (h ) . h
The variable A is the squared distance of the vector Z to the affine subspace - I� 12 h + I� 12 H0 . By the rotational invariance of the normal distribution, the distribution of A does not de pend on the orientation of the affine subspace, but only on its codimension and its distance
1 6. 4
Asymptotic Power Functions
237
I I� 12 h 1� 12 H0 I to the origin. This distribution is known as the noncentral chi-square distribution with noncentrality parameter o . Thus o
=
-
The noncentral chi-square distributions are stochastically increasing in the noncentrality parameter. It follows that the likelihood ratio test has good (local) power at h that yield a large value of the noncentrality parameter. The shape of the asymptotic power function is easiest to understand in the case of a simple null hypothesis. Then H0 {0} , and the noncentrality parameter reduces to the square root of h T l,Jh . For h JL h e equal to a multiple of an eigenvector h e (of unit norm) of IJJ with eigenvalue 'Ae , the noncentrality parameter equals FeJL. The asymptotic power function in the direction of h e equals =
=
The test performs best for departures from the null hypothesis in the direction of the eigen vector corresponding to the largest eigenvalue. Even though the likelihood ratio test gives power in all directions, it does not treat the directions equally. This may be worrisome if the eigenvalues are very inhomogeneous. Further insight is gained by comparing the likelihood ratio test to tests that are designed to be optimal in given directions. Let X be an observation in the limit experiment, having a N (h , /0 1 )-distribution. The test that rejects the null hypothesis H0 {0} if I -fie h r; X I > Za f 2 has level and power function =
a
For large k this is a considerably better power function than the power function of the likelihood ratio test (Figure 1 6. 1 ), but the forms of the power functions are similar. In particular, the optimal power functions show a similar dependence on the eigenvalues of
N 0
0
5
Figure 16.1. The functions t-t 2 -0 P(xt (t-t) (dashed ), respectively, for a = 0.05 .
10 >
15
xt, a ) for k
=
25
20
1 (solid ), k
=
5 (dotted ) and k
=
15
23 8
Likelihood Ratio Tests
the covariance matrix. In this sense, the apparently unequal distribution of power over the different directions is not unfair in that it reflects the intrinsic difficulty of detecting changes in different directions. This is not to say that we should never change the (automatic) emphasis given by the likelihood ratio test.
16.5
Bartlett Correction
The chi-square approximation to the distribution of the likelihood ratio statistic is relatively accurate but can be much improved by a correction. This was first noted in the example of testing for inequality of the variances in the one-way layout by Bartlett and has since been generalized. Although every approximation can be improved, the Bartlett correction appears to enjoy a particular popularity. The correction takes the form of a correction of the (asymptotic) mean of the likelihood ratio statistic. In regular cases the distribution of the likelihood ratio statistic is asymptoti cally chi-square with, say, r degrees offreedom, whence its mean ought to be approximately equal to r . Bartlett ' s correction is intended to make the mean exactly equal to r, by replacing the likelihood ratio statistic An by
The distribution of this statistic is next approximated by a chi-square distribution with r degrees of freedom. Unfortunately, the mean E110 An may be hard to calculate, and may depend on an unknown null parameter 80 • Therefore, one first obtains an expression for the mean of the form b (8o) Ee0 An = 1 + n + --
.
.
·
.
Next, with bn an appropriate estimator for the parameter b (80) , the corrected statistic takes the form
The surprising fact is that this recipe works in some generality. Ordinarily, improved approx imations would be obtained by writing down and next inverting an Edgeworth expansion of the probabilities P(An ::::; x ) ; the correction would depend on x . In the present case this is equivalent to a simple correction of the mean, independent of x . The technical reason is that the polynomial in x in the ( 1 /n)-term of the Edgeworth expansion is of degree 1 . t
* 16.6
Bahadur Efficiency
The claim in the Section 16.4 that in many situations "asymptotically optimal" tests do not exist refers to the study of efficiency relative to the local Gaussian approximations described t For a further discussion, see [5], [9] , and [83], and the references cited there.
1 6. 6
Bahadur Efficiency
239
in Chapter 7. The purpose of this section is to show that, under regularity conditions, the likelihood ratio test is asymptotically optimal in a different setting, the one of Bahadur efficiency. For simplicity we restrict ourselves to the testing of finite hypotheses. Given finite sets Po and P1 of probability measures on a measurable space (X, A) and a random sample X 1 , . . . , Xn , we study the log likelihood ratio statistic A- n
_ 1 -
og
sup Q EP1 TI 7=1 q (Xi ) sup P EPo n ni= l p(Xi) '
More general hypotheses can be treated, under regularity conditions, by finite approximation (see e.g., Section 10 of [4]). The observed level of a test that rejects for large values of a statistic Tn is defined as sup P p ( Tn :::: t) l t = Tn · P EPo The test that rejects the null hypothesis if Ln ::::: has level The power of this test is maximal if Ln is "minimal" under the alternative (in a stochastic sense). The Bahadur slope under the alternative Q is defined as the limit in probability under Q (if it exists) of the sequence ( -21n) log Ln. If this is "large," then Ln is small and hence we prefer sequences of test statistics that have a large slope. The same conclusion is reached in section 14.4 by considering the asymptotic relative Bahadur efficiencies. It is indicated there that the Neyman-Pearson tests for testing the simple null and alternative hypotheses P and Q have Bahadur slope -2 Q log(p I q ) . Because these are the most powerful tests, this is the maximal slope for testing P versus Q . (We give a precise proof in the following theorem.) Consequently, the slope for a general null hypothesis cannot be bigger than infP EPo -2 Q log(p I q ) . The sequence of likelihood ratio statistics attains equality, even if the alternative hypothesis is composite.
Ln
=
a
a.
The Bahadur slope of any sequence of test statistics for testing an arbi trary null hypothesis Ho : P E Po versus a simple alternative H1 : P Q is bounded above by infP EPo -2Q log( plq ) , for any probability measure Q. If Po and P1 are finite sets of probability measures, then the sequence of likelihood ratio statistics for testing H0 : P E Po versus H1 : P E P1 attains equality for every Q E P1 . 16.12
Theorem.
=
Because the observed level is a supremum over Po , it suffices to prove the upper bound of the theorem for a simple null hypothesis Po { P } . If -2 Q log( plq ) then there is nothing to prove. Thus, we can assume without loss of generality that Q is absolutely continuous with respect to P . Write An for log fl 7= 1 (q I p) (Xd . Then, for any constants B > A > Q log(q I p) , P Q (Ln < e- n B , An < nA) Ep l {Ln < e - n B , An < nA }e A " nA n ::S e P p ( Ln < e- B ) .
Proof.
=
=
oo ,
=
Because L n is superuniformly distributed under the null hypothesis, the last expression is bounded above by exp -n(B - A). Thus, the sum of the probabilities on the left side over n E N is finite, whence - (21 n) log Ln ::::: 2B or An :=::: nA for all sufficiently large n, almost surely under Q, by the Borel-Cantelli lemma. Because the sequence n - 1 An
240
Likelihood Ratio Tests
converges almost surely under Q to Q log(q I p) < A, by the strong law of large numbers, the second possibility can occur only finitely many times. It follows that - (21n) log Ln :::; 2B eventually, almost surely under Q. This having been established for any B > Q log(q I p), the proof of the first assertion is complete. To prove that the likelihood ratio statistic attains equality, it suffices to prove that its slope is bigger than the upper bound. Write An for the log likelihood ratio statistic, and write sup p and sup Q for suprema over the null and alternative hypotheses. Because ( l ln)An is bounded above by sup Q IP'11 log(q I p) , we have, by Markov's inequality,
The expectation on the right side is the n th power of the integral J ( q I p) d P 1 . Take logarithms left and right and multiply with - (21n) to find that
(
=
Q (p > 0) :::;
)
2 12 log I P1 I - - log Pp - A11 � t � 2t . n n n Because this is valid uniformly in t and P , we can take the infimum over P on the left side; next evaluate the left and right sides at t ( l ln)A11 • By the law of large numbers, IP'n log(q I p) -+ Q log(q I p) almost surely under Q, and this remains valid if we first add the infimum over the (finite) set Po on both sides. Thus, the limit inferior of the sequence ( l ln)An � infp IP'n log(q I p) is bounded below by infp Q log(q I p) almost surely under Q, where we interpret Q log(q I p) as if Q (p 0) > 0. Insert this lower bound in the preceding display to conclude that the Bahadur slope of the likelihood ratio statistics is bounded below by 2 infp Q log(q I p) . • =
oo
=
Notes
The classical references on the asymptotic null distribution of likelihood ratio statistic are papers by Chernoff [2 1] and Wilks [ 1 50] . Our main theorem appears to be better than Chernoff ' s, who uses the "classical regularity conditions" and a different notion of approx imation of sets, but is not essentially different. Wilks' treatment would not be acceptable to present-day referees but maybe is not so different either. He appears to be saying that we can replace the original likelihood by the likelihood for having observed only the maximum likelihood estimator (the error is asymptotically negligible), next refers to work by Doob to infer that this is a Gaussian likelihood, and continues to compute the likelihood ratio statistic for a Gaussian likelihood, which is easy, as we have seen. The approach using a Taylor expansion and the asymptotic distributions of both likelihood estimators is one way to make the argument rigorous, but it seems to hide the original intuition. Bahadur [3] presented the efficiency of the likelihood ratio statistic at the fifth Berkeley symposium. Kallenberg [84] shows that the likelihood ratio statistic remains asymptotically optimal in the setting in which both the desired level and the alternative tend to zero, at least in exponential families. As the proof of Theorem 16. 1 2 shows, the composite nature of the alternative hypothesis "disappears" elegantly by taking ( 1 In) log of the error probabilities too elegantly to attach much value to this type of optimality?
24 1
Problems PROBLEMS
1. Let (X I , YI ) , . . . , (Xn , Yn ) be a sample from the bivariate normal distribution with mean vec tor (f-L, v) and covariance matrix the diagonal matrix with entries cr 2 and r 2 . Calculate (or characterize) the likelihood ratio statistic for testing Ho : f-L = v versus HI : f-L i= v .
2. Let N b e a kr-dimensional multinomial variable written a s a (k x r ) matrix (Nij ) . Calculate the likelihood ratio statistic for testing the null hypothesis of independence Ho : P ii = P i · P . i for every i and j . Here the dot denotes summation over all columns and rows, respectively. What is the limit distribution under the null hypothesis?
3. Calculate the likelihood ratio statistic for testing Ho : f-L = v based on independent samples of size n from multivariate normal distributions Nr (f-L , I: ) and Nr ( v , I: ) . The matrix I: is unknown. What is the limit distribution under the null hypothesis?
4. Calculate the likelihood ratio statistic for testing Ho : f-L l = · · = I-Lk based on k independent s amples of size n from N (t-t i , cr 2 ) -distributions . What is the asymptotic distribution under the null hypothesis? ·
5. Show that (I;;} I ) > I , >I is the inverse of the matrix ltJ, >I , >I - ltJ, >I , sl l;J, 1 , s z ltJ, <:l, > l · 6. Study the asymptotic distribution of the sequence An if the true parameter is contained in both the null and alternative hypotheses .
�
7. Study the asymptotic distribution of the likelihood ratio statistics for testing the hypothesis Ho : cr = -r based on a sample of size n from the uniform distribution on [cr, r ] . Does the
asymptotic distribution correspond
to
a likelihood ratio statistic in a limit experiment?
17 Chi-Square Tests
The chi-square statistic for testing hypotheses concerning multinomial distributions derives its name from the asymptotic approximation to its distribution. Two important applications are the testing of independence in a two-way classification and the testing ofgoodness-of-fit. In the second application the multinomial distribution is created artificially by group ing the data, and the asymptotic chi-square approximation may be lost if the original data are used to estimate nuisance parameters.
17.1
Quadratic Forms i n Normal Vectors
The chi-square distribution with k degrees of freedom is (by definition) the distribution of :L7=1 Zf for i.i.d. N (O, !)-distributed variables Z1 , . . . , Zk . The sum of squares is the squared norm II Z II 2 ofthe standard normal vector Z (Z1 , . . . , Zk ) . The following lemma gives a characterization of the distribution of the norm of a general zero-mean normal vector. =
If the vector X is Nk (O, "h)-distributed, then I I X II 2 is distributed as :L7=1 A. i zf for i. i. d. N(O, !)-distributed variables Z 1 , . . . , Zk and A. 1 , . . . , A. k the eigenvalues of "h.
17.1
Lemma.
There exists an orthogonal matrix 0 such that 0 'E. 0 T diag (A. i ). Then the vector 0 X is Nk ( 0, diag (A. i )) -distributed, which is the same as the distribution of the vector ( -Jii"Z 1 , . . . , ,Ji:k Zk ). Now II X II 2 II 0 X ll2 has the same distribution as :LCAZ i )2. •
Proof.
=
=
The distribution of a quadratic form of the type :L7= l A.i Zf is complicated in general. However, in the case that every A. i is either 0 or 1 , it reduces to a chi-square distribution. If this is not naturally the case in an application, then a statistic is often transformed to achieve this desirable situation. The definition of the Pearson statistic illustrates this.
Pearson Statistic
17.2
Suppose that we observe a vector Xn (Xn . l , . . . , Xn .d with the multinomial distribution corresponding to n trials and k classes having probabilities p (p1 , . . . , Pk ) . The Pearson =
=
242
1 7. 2
Pearson Statistic
statistic for testing the null hypothesis H0 p :
=
243
a is given by
We shall show that the sequence Cn (a) converges in distribution to a chi-square distribution if the null hypothesis is true. The practical relevance is that we can use the chi-square table to find critical values for the test. The proof shows why Pearson divided the squares by nai and did not propose the simpler statistic II Xn - na ll 2 . 17.2
a
=
If the vectors Xn are multinomially distributed with parameters n and (a 1 , . . . , ak ) > 0, then the sequence Cn (a) converges under a in distribution to the Theorem.
xL 1 -distribution.
The vector Xn can be thought of as the sum of n independent multinomial vectors Y1 , . . . , Yn with parameters 1 and a (a 1 , . . . , ak ) . Then
Proof.
=
By the multivariate central limit theorem, the sequence n - 1 12 ( Xn - na) converges in distribu tion to the Nk (O, Cov Y1 )-distribution. Consequently, with y'a the vector with coordinates _;a;,
(
Xn , 1 - na 1 Xn , - nak y'rUi1 , . . . , k .JiUik
)
-v--+
N (O , I - vr:::a vr:::a T ) .
Because L a i 1 , the matrix I - y'ay'aT has eigenvalue 0, of multiplicity 1 (with eigen space spanned by y'a), and eigenvalue 1 , of multiplicity (k - 1) (with eigenspace equal to the orthocomplement of y'a). An application of the continuous-mapping theorem and next Lemma 17. 1 conclude the proof. • =
The number of degrees of freedom in the chi-squared approximation for Pearson's statis tic is the number of cells of the multinomial vector that have positive probability. However, the quality of the approximation also depends on the size of the cell probabilities ai . For instance, if 100 1 cells have null probabilities w- 23 , . . . , w- 23 , 1 - w- 20 , then it is clear that for moderate values of n all cells except one are empty, and a huge value of n is necessary to make a x l000 -approximation work. As a rule of thumb, it is often advised to choose the partitioning sets such that each number na1 is at least 5. This criterion depends on the (possibly unknown) null distribution and is not the same as saying that the number of observations in each cell must satisfy an absolute lower bound, which could be very unlikely if the null hypothesis is false. The rule of thumb means to protect the level. The Pearson statistic is oddly asymmetric in the observed and the true frequencies (which is motivated by the form of the asymptotic covariance matrix). One method to symmetrize
244
Chi-Square Tests
the statistic leads to the Hellinger statistic
Up to a multiplicative constant this is the Hellinger distance between the discrete probabil ity distributions on { 1 , . . . , k} with probability vectors a and Xn f n , respectively. Because Xn / n - a � 0, the Hellinger statistic is asymptotically equivalent to the Pearson statistic.
17.3
Estimated Parameters
Chi-square tests are used quite often, but usually to test more complicated hypotheses. If the null hypothesis of interest is composite, then the parameter a is unknown and cannot be used in the definition of a test statistic. A natural extension is to replace the parameter by an estimate an and use the statistic
Cn (an)
=
k (Xn , n a n , )2 L z n z na , i i =l ·
-
A
·
The estimator an is constructed to be a good estimator if the null hypothesis is true. The asymptotic distribution of this modified Pearson statistic is not necessarily chi-square but depends on the estimators an being used. Most often the estimators are asymptotically normal, and the statistics
Xn , i - n an , i rna;:
Xn , i - nan , i
rna;:
,Jii (an , i - an , i ) �
are asymptotically normal as well. Then the modified chi-square statistic is asymptotically distributed as a quadratic form in a multivariate-normal vector. In general, the eigenvalues determining this form are not restricted to 0 or 1 , and their values may depend on the unknown parameter. Then the critical value cannot be taken from a table of the chi-square distribution. There are two popular possibilities to avoid this problem. First, the Pearson statistic is a certain quadratic form in the observations that is motivated by the asymptotic covariance matrix of a multinomial vector. If the parameter a is estimated, the asymptotic covariance matrix changes in form, and it is natural to change the quadratic form in such a way that the resulting statistic is again chi-square distributed. This idea leads to the Rao-Robson-Nikulin modification of the Pearson statistic, of which we discuss an example in section 17.5. Second, we can retain the form of the Pearson statistic but use special estimators a. In particular, the maximum likelihood estimator based on the multinomial vector Xn , or the minimum-chi square estimator tin defined by, with Po being the null hypothesis,
The right side of this display is the "minimum-chi square distance" of the observed frequen cies to the null hypothesis and is an intuitively reasonable test statistic. The null hypothesis
1 7. 3 Estimated Parameters
245
is rejected if the distance of the observed frequency vector Xn fn to the set Po is large. A disadvantage is greater computational complexity. These two modifications, using the minimum-chi square estimator or the maximum likelihood estimator based on Xn , may seem natural but are artificial in some applications. For instance, in goodness-of-fit testing, the multinomial vector is formed by grouping the "raw data," and it is more natural to base the estimators on the raw data rather than on the grouped data. On the other hand, using the maximum likelihood or minimum-chi square estimator based on Xn has the advantage of a remarkably simple limit theory: If the null hypothesis is "locally linear," then the modified Pearson statistic is again asymptotically chi-square distributed, but with the number of degrees of freedom reduced by the (local) dimension of the estimated parameter. This interesting asymptotic result is most easily explained in terms of the minimum chi square statistic, as the loss of degrees of freedom corresponds to a projection (i.e., a minimum distance) of the limiting normal vector. We shall first show that the two types of modifications are asymptotically equivalent and are asymptotically equivalent to the likelihood ratio statistic as well. The likelihood ratio statistic for testing the null hypothesis Ho : p E Po is given by (see Example 16. 1) k
X
n'i . L n ( P ) = 2 '"""' � Xn , i log npi i =l Lemma. Let Po be a closed subset of the unit simplex, and let an be the maximum likelihood estimator of a under the null hypothesis H0 : a E Po (based on Xn ). Then 17.3
k
inf L pEPo
i=l
(X , z. - np · ) 2 = C (a n n ) Op ( l ) = Ln (an ) n npi
+
1
+ O p (1).
Let On be the minimum-chi square estimator of a under the null hypothesis. Both sequences of estimators On and an are Jn-consistent. For the maximum likelihood esti mator this follows from Corollary 5.53. The minimum-chi square estimator satisfies by its definition
Proof
This implies that each term in the sum on the left is 0 p ( 1 ) , whence n l on , i - ai 1 2 = 0 p (an , i ) Op ( I Xn , i - na i 1 2 / n ) and hence the Jn-consistency. Next, the two-term Taylor expansion log(l x) x - �x 2 o (x 2 ) combined with Lemma 2. 12 yields, for any y'n-consistent estimator sequence ftn , 2 k k A i 1 k npA n i X n i pn n -1 ' = -� "'xn . z· n , · -' - 1 o p ( l ) � n , z· log 2� Xn ,l. npAn ,l. i =l z Xn ,l. i=l i=l 1 k (X . - npA · ) 2 = 0 2 L n, z X n ,z n,i i =l
+
(
"' x
+-
+
·
=
) + -"'x (
+
)+
+ O p (1).
In the last expression we can also replace Xn,i in the denominator by n Pn,i , so that we find the relation L n ( Pn ) = Cn (p11) between the likelihood ratio and the Pearson statistic, for
Chi-Square Tests
246
every Jn-consistent estimator sequence Pn · By the definitions of an and Gn , we conclude that, up to o p ( l )-terms, C (a11 ) ::::; Cn (Gn ) = Ln (Gn ) ::::; L11 (an ) = C11 (G11 ) . The lemma follows. • The asymptotic behavior of likelihood ratio statistics is discussed in general in Chap ter 16. In view of the preceding lemma, we can now refer to this chapter to obtain the asymp totic distribution of the chi-square statistics. Alternatively, a direct study of the minimum-chi square statistic gives additional insight (and a more elementary proof). As in Chapter 16, say that a sequence of sets Hn converges to a set H if H is the set of all limits lim hn of converging sequences hn with h n E Hn for every n and, moreover, the limit h = limi hn, of every converging subsequence hn, with h n E Hn , for every i is contained in H . ,
Let Po be a subset of the unit simplex such that the sequence of sets Jn (Po - a) converges to a set H (in IR.k ), and suppose that a > 0. Then, under a, 2 k (Xn , z np·) 2 1 inf L inf X - - H , h EH p EPo i=l npi Ja
17.4
Theorem.
.
_
1
I
�
1
for a vector X with the N (0, I - .J(iJaT )-distribution. Here ( 1 / Ja) H is the set of vectors (h l / fo, . . . , h k f ,J{ik) as h ranges over H. Let Po be a subset of the unit simplex such that the sequence of sets Jn(Po - a) converges to a linear subspace of dimension l (of IR.k ), and let a > 0. Then both the sequence of minimum-chi square statistics and the sequence of modified Pearson statistics Cn (f111 ) converge in distribution to the chi-square distribution with k - 1 - l degrees offreedom. 17.5
Corollary.
Because the minimum-chi square estimator an (relative to Po) is Jn-consistent, the asymptotic distribution of the minimum-chi square statistic is not changed if we replace nan , i in its denominator by the true value na i . Next, we decompose,
Proof.
X n , l· - na·
v'fUii
r
The first vector on the right converges in distribution to X . The (modified) minimum-chi square statistics are the distances of these vectors to the sets Hn = Jn(Po - a)/ Ja, which converge to the set HI Ja. The theorem now follows from Lemma 7 . 1 3 . The vector X is distributed as Z - n01z for n01 the projection onto the linear space spanned by the vector Ja and Z a k-dimensional standard normal vector. Because every element of H is the limit of a multiple of differences of probability vectors, 1 T h = 0 for every h E H . Therefore, the space ( 1 / Ja)H is orthogonal to the vector Ja, and I1 n01 0 for I1 the projection onto the space ( 1 / Ja) H. The distance of X to the space ( 1 / Ja) H is equal to the norm of X - I1 X , which is distributed as the norm of Z - n01z - I1 Z . The latter projection is multivariate normally distributed with mean zero and covariance matrix the projection matrix I - n 01 - I1 with k - l - 1 eigenvalues 1 . The corollary follows from Lemma 17. 1 or 16.6. • =
247
Testing Independence
1 7. 4
=
If the null hypothesis is a parametric family Po of ffi.1 with l :S k and the maps 8 f-+ Pe from 8 into the unit simplex are differentiable and of full rank, then ,Jri(Po - pe ) ---+ .P e (lR1 ) for every e E 8 (see Example 16. 1 1 ). Then the chi-square statistics Cn (pe ) are asymptotically xl- z- 1 -distributed. This situation is common in testing the goodness-of-fit of parametric families, as dis cussed in section 17.5 and Example 16. 1 . 0 17.6 Example (Parametric model). { pe : f) E 8} indexed by a subset 8
17.4
Testing Independence
Suppose that each element of a population can be classified according to two characteristics, having k and r levels, respectively. The full information concerning the classification can be given by a (k x r) table of the form given in Table 17 . 1 . Often the full information is not available, but we do know the classification Xn, ij for a random sample of size n from the population. The matrix Xn,ij • which can also be written in the form of a (k x r) table, is multinomially distributed with parameters n and probabil ities P ij NiJ / N. The null hypothesis of independence asserts that the two categories are independent: H0 : Pi j ai bj for (unknown) probability vectors ai and bj . The maximum likelihood estimators for the parameters a and b (under the null hypothe sis) are ai Xn i . f n and b j Xn,. j /n. With these estimators the modified Pearson statistic takes the form =
=
=
.
=
The null hypothesis is a (k + r - 2)-dimensional submanifold of the unit simplex in ffi.kr . In a shrinking neighborhood of a parameter in its interior this manifold looks like its tangent space, a linear space of dimension k + r - 2. Thus, the sequence Cn (an 0 bn) is asymptotically chi square-distributed with kr - 1 - (k + r - 2) (k - 1 ) (r - 1) degrees of freedom. =
Table 1 7 . 1 .
Classification of a population of N elements according to two categories, Nij elements having value i on the first category and value j on the second. The borders give the sums over each row and column, respectively.
Nl l N2 1
N, z Nzz
Nl r Nl r
N! . Nz.
Nk , N,
Nkz N.z
Nl r
Nk. N
N.r
248
Chi-Square Tests
If the (k x r) matrices Xn are multinomially distributed with parameters n and Pij ai b j > 0, then the sequence Cll (all ® bn) converges in distribution to the xfk -1)(r-1) -distribution. 17.7
Corollary. =
The map (a 1 , . . . , ak -1 , b1 , . . . , b r -d � (a x b) from �k+r -2 into �kr is con tinuously differentiable and of full rank. The true values (a1 , . . . , ak -l , b 1 . . . , b r _ 1 ) are interior to the domain of this map. Thus the sequence of sets ,.jii (Po - a x b) converges to a (k + r 2)-dimensional linear subspace of �kr . •
Proof
-
* 17.5
Goodness-of-Fit Tests
Chi -square tests are often applied to test goodness-of-fit. Given a random sample X 1 , . . . , Xn from a distribution P , we wish to test the null hypothesis H0 : P E Po that P belongs to a given class Po of probability measures. There are many possible test statistics for this problem, and a particular statistic might be selected to attain high power against certain alternatives. Testing goodness-of-fit typically focuses on no particular alternative. Then chi-square statistics are intuitively reasonable. The data can be reduced to a multinomial vector by "grouping." We choose a partition X = U j Xj of the sample space into finitely many sets and base the test only on the observed numbers of observations falling into each of the sets Xj . For ease of notation, we express these numbers into the empirical measure of the data. For a given set A we denote by JPlll (A) = n -1 ( 1 :::::: i :::::: n : Xi E A) the fraction of observations that fall into A . Then the vector n (JPlll (X1 ) , . . . , JPln (Xk ) ) possesses a multinomial distribution, and the corresponding modified chi-square statistic is given by
Here P (Xj ) is an estimate of P (Xj ) under the null hypothesis and can take a variety of forms. Theorem 17.4 applies but is restricted to the case that the estimates P ( Xj ) are based on the frequencies n (JPln (X1 ) , . . . , JPln (Xk ) ) only. In the present situation it is more natural to base the estimates on the original observations X 1 , . . . , Xn . Usually, this results in a non--chi square limit distribution. For instance, Table 17.2 shows the "errors" in the level of a chi-square test for testing normality, if the unknown mean and variance are estimated by the sample mean and the sample variance but the critical value is chosen from the chi-square distribution. The size of the errors depends on the numbers of cells, the errors being small if there are many cells and few estimated parameters. Consider testing the null hypothesis that the true dis tribution belongs to a regular parametric model { Pe : e E 8}. It appears natural to estimate the unknown parameter e by an estimator en that is asymptotically efficient under the null hypothesis and is based on the original sample X 1 , . . . , Xn , for instance the maximum likelihood estimator. If Gn = ,.jii(Jilln - Pe ) denotes the empirical process, then efficiency entails the approximation ,.jii(en - e) = Ie- 1 Gni e + o P ( 1 ) . Applying the delta method to 17.8
Example (Parametric model).
1 7. 5
Goodness-of-Fit Tests
249
Table 1 7 .2.
True levels of the chi-square test for normality using -quantiZes as critical values but estimating unknown mean and �ariance by sample mean and sample variance. Chi square statistic based on partitions of [ - 1 0, 1 0] into k = 5 , 1 0, or 20 equiprobable cells under the standard normal law.
xl;_ 3
a
a =
k k k
=
=
=
5 10 20
0.20
0.30 0 . 22 0.21
a =
0.10
0.15 0. 1 1 0.10
a =
0.05
0.08 0.06 0.05
a =
0.01
0.02 0.0 1 0.01
Note: Values based on 2000 simulations of standard normal samples of size 1 00.
the variables J1i ( Pe (X1) - Pe ( X1 )) and using Slutsky's lemma, we find J1i (I¥n ( Xi ) - Pe ( Xi ))
J Pe ( X1 )
(The map e 1---+ Pe (A) has derivative Pe 1 A i:: e .) The sequence of vectors (Gn lx) , Gnf e) converges in distribution to a multivariate-normal distribution. Some matrix manipulations show that the vectors in the preceding display are asymptotically distributed as a Gaussian vector X with mean zero and covariance matrix P.e l v f e (Ce) iJ =
��
·
In general, the covariance matrix of X is not a projection matrix, and the variable II X 11 2 does not possess a chi-square distribution. Because Pef e 0, we have that Ce Fe 0 and hence the covariance matrix of X can be rewritten as the product (I - ,.fiie,.fiieT )(I - CJ Ie- 1 Ce). Here the first matrix is the projection onto the orthocomplement of the vector Fe and the second matrix is a positive definite transformation that leaves Fe invariant, thus acting only on the orthocomplement Fej_ . This geometric picture shows that Cove X has the same system of eigenvectors as the matrix I -CJ Ie- 1 Ce , and also the same eigenvalues, except for the eigenvalue corresponding to the eigenvector ,.fiie, which is O for Cove X and 1 for I - CJ Ii 1 Ce . Because both matrices CJ I() 1 Ce and I - CJ Ie- 1 Ce are nonnegative-definite, the eigenvalues are contained in [0, 1 ] . One eigenvalue (corresponding to eigenvector ,.fiie ) is 0, dim N(Ce) - 1 eigenvalues (corresponding to eigenspace N(Ce) n y'o:Bj_ ) are 1 , but the other eigenvalues may be contained in (0, 1) and then typically depend on e. By Lemma 17. 1 , the variable II X II 2 is distributed as =
=
dimN(Ce) - 1 L Zf + i =1 i = dimN(Ce) This means that it is stochastically "between" the chi-square distributions with dim N( Ce ) 1 and k - 1 degrees of freedom. The inconvenience that this distribution is not standard and depends on e can be remedied by not using efficient estimators en or, alternatively, by not using the Pearson statistic.
250
Chi-Square Tests
The square root of the matrix I - CJ Ie- 1 Ce is the positive-definite matrix with the same eigenvectors, but with the square roots of the eigenvalues. Thus, it also leaves the vector Fe invariant and acts only on the orthocomplement Fej_. It follows that this square root commutes with the matrix I - FeFe r and hence
(We assume that the matrix I - CJ Ie- 1 Ce is nonsingular, which is typically the case; see problem 17 .6). By the continuous-mapping theorem, the squared norm of the left side is asymptotically chi square-distributed with k - 1 degrees of freedom. This squared norm is the Rao-Robson-Nikulin statistic. 0 It is tempting to choose the partitioning sets X1 dependent on the observed data X 1 , . . . , Xn , for instance to ensure that all cells have positive probability under the null hypothesis. This is permissible under some conditions: The choice of a "random partition" typically does not change the distributional properties of the chi-square statistic. Consider partition ing sets X1 XJ ( X 1 , . . . , Xn ) that possibly depend on the data, and a further modified Pearson statistic of the type =
If the random partitions settle down to a fixed partition eventually, then this statistic is asymptotically equivalent to the statistic for which the partition had been set equal to the limit partition in advance. We discuss this for the case that the null hypothesis is a model { Pe : e E 8} indexed by a subset 8 of a normed space. We use the language of Donsker classes as discussed in Chapter 19.
Suppose that the sets XJ belong to a Pe0 -Donsker class C of sets and p that Pe0 (Xj !::,. XJ ) -+ 0 under Pe0 , for given nonrandom sets Xj such that Pe0 ( Xj ) > 0. Furthermore, assume that Jn ll tJ - 8o ll O p ( l ), and suppose that the map e H- Pe from 8 into £00 (C) is differentiable at e0 with derivative P e0 such that P e0 ( X1) - P e0 ( X1 ) � o for every j. Then 17.9
Theorem. A
=
Let Gn Jli"(IFn - Pe0 ) be the empirical process and define IHin Jli"(Pe - Pe0 ) . Then Jn(1Fn ( X1) - Pe ( X1 ) ) (Gn - IHin ) ( X1 ) , and similarly with X1 replacing X1 . The condition that the sets X1 belong to a Donsker class combined with the continuity condition Pe0 (XJ .6. XJ ) _!,. 0, imply that Gn ( Xj ) - Gn (Xj ) � 0 (see Lemma 19.24). The differentiability of the map e �--+- Pe implies that
Proof
=
=
=
sup I Pe (C) - Pe0 (C) - P e0 (C) (fJ - 8o) l c Together with the continuity P eo ( Xj ) - P eo ( Xj )
=
op ( ll e - 8o ll) .
� 0 and the #-consistency of e, this
1 7. 6
25 1
Asymptotic Efficiency
p
p
shows that IHin (X1) - IHin (X1) � 0. In particular, because Pe0 ( X1 ) � Pe0 (XJ ) , both Pe ( XJ ) and Pe ( X1 ) converge in probability to Pe0 ( X1 ) > 0. The theorem follows. • A
A
A
The conditions on the random partitions that are imposed in the preceding theorem are mild. An interesing choice is a partition in sets X1 (B ) such that P8 (X1 (e) ) a 1 is independent of e . The corresponding modified Pearson statistic is known as the Watson-Roy statistic and takes the form =
Here the null probabilities have been reduced to fixed values again, but the cell frequencies are "doubly random." If the model is smooth and the parameter and the sets X1 (e) are not too wild, then this statistic has the same null limit distribution as the modified Pearson statistic with a fixed partition. Consider testing a null hypothesis that the true under lying measure of the observations belongs to a location-scale family Fo ( ( · f.L) I a) : f.L E lR, a > 0}, given a fixed distribution F0 on R It is reasonable to choose a partition in sets and estimators X1 {1, + a ( c1 _ 1 , c1], for a fixed partition c0 < c 1 < · · · < ck {1, and a of the location and scale parameter. The partition could, for instance, be chosen equal to c1 F0- 1 (j I k), although, in general, the partition should depend on the type of deviation from the null hypothesis that one wants to detect. If we use the same location and scale estimators to "estimate" the null probabilities ( F0 ( X1 J.L) I a) of the random cells X1 {1, + a ( c 1 _ 1 , c1], then the estimators cancel, and we find the fixed null probabilities F0 ( c1 ) F0 ( c1 _ 1 ). D 17.10
Example (Location-scale).
{
=
-
-
= oo
oo =
=
=
-
-
* 1 7.6
Asymptotic Efficiency
The asymptotic null distributions of various versions of the Pearson statistic enable us to set critical values but by themselves do not give information on the asymptotic power of the tests. Are these tests, which appear to be mostly motivated by their asymptotic null distribution, sufficiently powerful? The asymptotic power can be measured in various ways. Probably the most important method is to consider local limiting power functions, as in Chapter 14. For the likelihood ratio test these are obtained in Chapter 16. Because, in the local experiments, chi-square statistics are asymptotically equivalent to the likelihood ratio statistics (see Theorem 17.4) , the results obtained there also apply to the present problem, and we shall not repeat the discussion. A second method to evaluate the asymptotic power is by Bahadur efficiencies. For this nonlocal criterion, chi-square tests and likelihood ratio tests are not equivalent, the second being better and, in fact, optimal (see Theorem 16. 12). We shall compute the slopes of the Pearson and likelihood ratio tests for testing the simple hypothesis H0 : p a . A multinomial vector Xn with parameters n and p (p 1 , . . . , Pk ) can be thought of as n times the empirical measure JPln of a random sample of size n from the distribution P on the set { 1 , . . . , k } defined by P {i } Pi . Thus we can view both the =
=
=
252
Chi-Square Tests
Pearson and the likelihood ratio statistics as functions of an empirical measure and next can apply Sanov's theorem to compute the desired limits of large deviations probabilities. Define maps C and K by
k a Pi K(p, a) = - P log - = L Pi log - . ai p i= l Then the Pearson and likelihood ratio statistics are equivalent to C (JPln , a) and K (JPl a), respectively. Under the assumption that a > 0, both maps are continuous in p on the k-dimensional unit simplex. Furthermore, for t in the interior of the ranges of C and K, the sets B1 = p : C (p , a) 2:: t } and B1 = p : K (p , a) 2:: t } are equal to the closures of their interiors. Two applications of Sanov's theorem yield 1 - log Pa ( C (JPln , a) 2:: t ) --+ - inf K (p, a), n,
{
{
n
pe�
1 - log Pa ( K (JPln , a)
n
2::
t ) --+ - inf K (p, a) = - t. pe�
We take the function e(t) of ( 14.20) equal to minus two times the right sides. Because Pn {i } --+ Pi by the law of large numbers, whence C (JPln , a) � C(P, a) and K (JPln , a) � K ( P , a), the Bahadur slopes of the Pearson and likelihood ratio tests at the alternative H1 : p = q are given by 2
inf
p:C(p, a)�C(q ,a)
K (p , a)
and 2K(q , a) .
It is clear from these expressions that the likelihood ratio test has a bigger slope. This is in agreement with the fact that the likelihood ratio test is asymptotically Bahadur optimal in any smooth parametric model. Figure 17. 1 shows the difference of the slopes in one particular case. The difference is small in a neighborhood of the null hypothesis a, in agreement with the fact that the Pitman efficiency is equal to 1 , but can be substantial for alternatives away from a . Notes
Pearson introduced his statistic in 1900 in [ 1 12] The modification with estimated para meters, using the multinomial frequencies, was considered by Fisher [49], who corrected the mistaken belief that estimating the parameters does not change the limit distribution. Chernoff and Lehmann [22] showed that using maximum likelihood estimators based on the original data for the parameter in a goodness-of-fit statistic destroys the asymptotic chi-square distribution. They note that the errors in the level are small in the case of testing a Poisson distribution and somewhat larger when testing normality.
253
Problems
N
0
a .t.
a .'O
Figure 17. 1. The difference of the B ahadur slopes of the likelihood ratio and Pearson tests for testing Ho : p = ( 1 / 3 , 1 / 3 , 1 / 3 ) based on a multinomial vector with parameters n and p = ( P I , p 2 , p3 ) , as a function of ( PI , P2 ) .
The choice of the partition in chi-square goodness-of-fit tests is an important issue that we have not discussed. Several authors have studied the optimal number of cells in the partition. This number depends, of course, on the alternative for which one desires large power. The conclusions of these studies are not easily summarized. For alternatives p such that the likelihood ratio p j Peo with respect to the null distribution is "wild," the number of cells k should tend to infinity with n . Then the chi-square approximation of the null distribution needs to be modified. Normal approximations are used, because a chi-square distribution with a large number of degrees of freedom is approximately a normal distribution. See [40], [60], and [86] for results and further references.
PROBLEMS
1. Let N = (Nij ) be a multinomial matrix with success probabilities Pi} . Design a test statistic for the null hypothesis of symmetry Ho : Pii = P) i and derive its asymptotic null distribution.
2. Derive the limit distribution of the chi-square goodness-of-fit statistic for testing normality if using the sample mean and s ample variance as estimators for the unknown mean and variance. Use two or three cells to keep the calculations simple. Show that the limit distribution is not chi-square. 3. Suppose that X m and Yn are independent multinomial vectors with parameters (m , a I , . . . , ak ) and (n , h , . . , bk ) , respectively. Under the null hypothesis Ho : a = b, a natural estimator of the unknown probability vector a = b is c = (m + n) - I (Xm + Yn ), and a natural test statistic is given by .
Show that c is the maximum likelihood estimator and show that the sequence of test statistics is asymptotically chi square-distributed if m, n -+ oo.
254
Chi-Square Tests
4. A matrix I: - is called a generalized inverse of a matrix I: if x = I: - y solves the equation I:x = y for every y in the range of I: . Suppose that X is Nk (O, 2:: )-distributed for a matrix I: of rank r . Show that (i) 2:;- Y is the same for every generalized inverse 2:: - , with probability one; (ii) (iii)
yT yT 2:: - Y possesses a chi-square distribution with r degrees of freedom; if Y T C Y possesses a chi -square distribution with r degrees offreedom and C is a nonnegative definite symmetric matrix, then C is a generalized inverse of I: .
5. Find the limit distribution of the Dzhaparidze-Nikulin statistic
6. Show that the matrix I - CJ
!(} 1 Ce in Example 1 7 . 8 is nonsingular unless the empirical estima tor (lPn (X1 ), . . . , lPn (Xk )) is asymptotically efficient. (The estimator ( Pe (XI), . . . , Pe (Xk )) is asymptotically efficient and has asymptotic covariance matrix diag (,.jae) CJ !8- 1 Ce diag ( ,.jae ) ; the empirical estimator has asymptotic covariance matrix diag (,.jae) (I - ,.Jae
,.JaeT) diag (,.jae) . )
18 Stochastic Convergence in Metric Spaces
This chapter extends the concepts of convergence in distribution, in prob ability, and almost surely from Euclidean spaces to more abstract metric spaces. We are particularly interested in developing the theory for ran dom functions, or stochastic processes, viewed as elements of the metric space of all bounded functions. 18.1
Metric and Normed Spaces
In this section we recall some basic topological concepts and introduce a number of examples of metric spaces. A metric space is a set lDi equipped with a metric. A metric or distance function is a map d : lDi x lDi r--+ [0, oo) with the properties (i) d(x , y) d(y, x); (ii) d(x , z ) :S d(x, y) + d(y, z ) (triangle inequality); (iii) d (x , y) = O if and only if x = y. =
A semimetric satisfies (i) and (ii), but not necessarily (iii). An open ball is a set of the form { y : d (x, y) < r } . A subset of a metric space is open if and only if it is the union of open balls; it is closed if and only if its complement is open. A sequence Xn converges to x if and only if d (xn , x) ---+ 0; this is denoted by Xn ---+ x. The closure A of a set A c ]]]I consists of all points that are the limit of a sequence in A; it is the smallest closed set containing A . The interior A is the collection of all points x such that x E G c A for some open set G ; it is the largest open set contained in A . A function f : lDi r--+ lE between two metric spaces is continuous at a point x if and only if f (xn ) ---+ f (x) for every sequence Xn ---+ x ; it is continuous at every x if and only if the inverse image f - 1 (G) of every open set G c lE is open in lDi . A subset of a metric space is dense if and only if its closure is the whole space. A metric space is separable if and only if it has a countable dense subset. A subset K of a metric space is compact if and only if it is closed and every sequence in K has a converging subsequence. A subset K is totally bounded if and only if for every 8 > 0 it can be covered by finitely many balls of radius 8. A semimetric space is complete if every Cauchy sequence, a sequence such that d(xn , Xm ) ---+ 0 as n, m ---+ oo, has a limit. A subset of a complete semimetric space is compact if and only if it is totally bounded and closed. A normed space ]]]I is a vector space equipped with a norm. A norm is a map I I : lDi r--+ [0, oo) such that, for every x, y in lDi, and E R, ·
a
255
256
Stochastic Convergence in Metric Spaces
(i) ll x + Y I :::: ll x I + I y II (triangle inequality); (ii) I la x II = Ia l ll x II ; (iii) ll x II = 0 if and only if x = 0. A seminorm satisfies (i) and (ii), but not necessarily (iii). Given a norm, a metric can be defined by d (x , y) = ll x - Y ll . The Borel a -field on a metric space Jill is the smallest a -field that contains the open sets (and then also the closed sets). A function defined relative to (one or two) metric spaces is called Baret-measurable if it is measurable relative to the Borel a-field(s). A Borel-measurable map X : Q 1---7 Jill defined on a probability space (Q , U, P) is referred to as a random element with values in Jill . 18.1
Definition.
For Euclidean spaces, Borel measurability is just the usual measurability. Borel measur ability is probably the natural concept to use with metric spaces. It combines well with the topological structure, particularly if the metric space is separable. For instance, continuous maps are Borel-measurable. 18.2
Lemma.
A continuous map between metric spaces is Borel-measurable.
A map g : Jill 1---7 IE is continuous if and only if the inverse image g - 1 (G) of every open set G c IE is open in Jill . In particular, for every open G the set g - 1 (G) is a Borel set in Jill . By definition, the open sets in IE generate the Borel a -field. Thus, the inverse image of a generator of the Borel sets in IE is contained in the Borel a -field in Jill . Because the inverse image g - 1 (9) of a generator g of a a-field B generates the a -field g - 1 (B) , it follows that the inverse image of every Borel set is a Borel set. •
Proof.
The Euclidean space IRk is a normed space with re spect to the Euclidean norm (whose square is ll x 11 2 = L�= 1 x[), but also with respect to many other norms, for instance ll x II = maxi l x i I , all of which are equivalent. By the Heine Bore! theorem a subset of IRk is compact if and only if it is closed and bounded. A Euclidean space is separable, with, for instance, the vectors with rational coordinates as a countable dense subset. The Borel a -field is the usual a -field, generated by the intervals of the type ( -oo, x ]. D 18.3
Example (Euclidean spaces).
The extended real line lR = [ -oo, oo] is the set consisting of all real numbers and the additional elements -oo and oo. It is a metric space with respect to 18.4
Example (Extended rea/ line).
d(x , y) = I
1 8. 1
Metric and Nonned Spaces
257
set T, let £00 (T) be the collection of all z 1 + z 2 and products with scalars az pointwise. For instance, z 1 + z2 is the element of £00 (T) such that (z 1 + z2 ) (t ) z 1 (t) + z2 (t ) for every t. The uniform norm is defined as 18.5
Example (Uniform norm). Given an arbitrary bounded functions z : T f---+ R Define sums
=
l l z ii T
=
sup J z(t) J . tET
With this notation the space £00 (T) consists exactly of all functions z : T f---+ II z 11 T < The space £00 (T) is separable if and only if T is countable. 0
lR
such that
oo .
Let T [a , b] be an interval in the extended real line. We denote by C [a , b] the set of all continuous functions z : [a , b] f---+ lR and by D [a , b] the set of all functions z : [a , b] f---+ lR that are right continuous and whose limits from the left exist everywhere in [a , b] . (The functions in D [a , b] are called cadlag: continue a droite, limites a gauche.) It can be shown that C[a , b] c D [a , b] c £00[a , b]. We always equip the spaces C [a , b] and D[a, b] with the uniform norm li z li T , which they "inherit" from £00[a , b] . The space D [a , b] is referred to here as the Skora hod space, although Skorohod did not consider the uniform norm but equipped the space with the "Skorohod metric" (which we do not use or discuss). The space C[a, b] is separable, but the space D [a , b] is not (relative to the uniform norm). 0 18.6
Example (Skorohod space).
=
Example (Uniformly continuous functions). Let T be a totally bounded semimetric space with semimetric p . We denote by U C (T, p) the collection of all uniformly continuous functions z : T f---+ R Because a uniformly continuous function on a totally bounded set is necessarily bounded, the space UC (T, p) is a subspace of £00 (T) . We equip UC (T, p) 18.7
with the uniform norm. Because a compact semimetric space is totally bounded, and a continuous function on a compact space is automatically uniformly continuous, the spaces C(T, p) for a compact semimetric space T, for instance C [a, b ], are special cases of the spaces U C(T, p ) . Actu ally, every space UC (T, p) can be identified with a space C(T, p ) , because the completion T of a totally bounded semimetric T space is compact, and every uniformly continuous function on T has a unique continuous extension to the completion. The space UC (T, p) is separable. Furthermore, the Borel a -field is equal to the a-field generated by all coordinate projections (see Problem 18.3). The coordinate projections are the maps z f---+ z( t ) with t ranging over T. These are continuous and hence always Borel-measurable. 0 18.8 Example (Product spaces). Given a pair of metric spaces ]]]I and lE with metrics d and e, the Cartesian product ]]]I x lE is a metric space with respect to the metric
Cxn, Yn) --+ (x,
Xn --+ x
For this metric, convergence of a sequence y) is equivalent to both and y. For a product metric space, there exist two natural a -fields: The product of the Borel a-fields and the Borel a -field of the product metric. In general, these are not the same,
Yn --+
Stochastic Convergence in Metric Spaces
258
the second one being bigger. A sufficient condition for them to be equal is that the metric spaces IDl and lE are separable (e.g., Chapter 1 .4 in [146])). The possible inequality of the two O" -fields causes an inconvenient problem. If X : Q 1---+ IDl and Y : Q 1---+ lE are Borel-measurable maps, defined on some measurable space (Q , U), then (X, Y) : Q 1---+ IDl x lE is always measurable for the product of the Borel O"-fields. This is an easy fact from measure theory. However, if the two O" -fields are different, then the map (X, Y) need not be Borel-measurable. If they have separable range, then they are. D
1 8.2
Basic Properties
In Chapter 2 convergence in distribution of random vectors is defined by reference to their distribution functions. Distribution functions do not extend in a natural way to random elements with values in metric spaces. Instead, we define convergence in distribution using one of the characterizations given by the portmanteau lemma. A sequence of random elements Xn with values in a metric space IDl is said to converge in distribution to a random element X if E f (X n) E f (X) for every bounded, continuous function f : IDl 1---+ R In some applications the "random elements" of interest tum out not to be Borel-measurable. To accomodate this situation, we extend the preceding definition to a sequence of arbitrary maps Xn : Qn t--+ IDl, defined on probability spaces (Qn , Un , Pn ) · Because E f ( Xn ) need no longer make sense, w e replace expectations by outer expectations. For an arbitrary map X : Q 1--+ IDl, define
--+
E* f(X)
=
inf {EU : U : Q 1--+ JR., measurable, U :::: j (X) , EU exists}.
Then we say that a sequence of arbitrary maps Xn : s-2n 1---+ IDl converges in distribution to a random element X if E* f(Xn ) --+ Ef (X) for every bounded, continuous function f : IDl 1---+ R Here we insist that the limit X be Borel-measurable. In the following, we do not stress the measurability issues. However, throughout we do write stars, if necessary, as a reminder that there are measurability issues that need to be taken care of. Although Qn may depend on n, we do not let this show up in the notation for E* and P* . Next consider convergence in probability and almost surely. An arbitrary sequence of maps Xn : s-2n 1---+ IDl converges in probability to X if P* (d (X n , X) > E ) --+ 0 for all E > 0. This is denoted by Xn X. The sequence Xn converges almost surely to X if there exists a sequence of (measurable) random variables l::!.. n such that d (Xn , X) :S l::!.. n and l::!.. n � 0. This is denoted by Xn � X. These definitions also do not require the Xn to be Borel-measurable. In the definition of convergence of probability we solved this by adding a star, for outer probability. On the other hand, the definition of almost-sure convergence is unpleasantly complicated. This cannot be avoided easily, because, even for Borel-measurable maps Xn and X, the distance d(Xn , X) need not be a random variable. The portmanteau lemma, the continuous-mapping theorem and the relations among the three modes of stochastic convergence extend without essential changes to the present defini tions. Even the proofs, as given in Chapter 2, do not need essential modifications. However, we seize the opportunity to formulate and prove a refinement of the continuous-mapping theorem. The continuous-mapping theorem furnishes a more intuitive interpretation of
�
1 8. 2
Basic Properties
259
n
weak convergence in terms of weak convergence of random vectors: X X in the metric space [ll if and only if g(X ) g(X) for every continuous map g : [ll f---+ JR.k .
n
'V'-7
'V'-7
n : S"2n
For arbitrary maps X f---+ [ll and every random ele ment X with values in [ll, the following statements are equivalent. (i) E* f(Xn) ---+ Ef(X) for all bounded, continuousfunctions f. (ii) E* f(X ) ---+ Ef(X) for all bounded, Lipschitz .functions f. (iii) lim infP* (Xn E G) 2:: P(X E G) for every open set G. ( iv) lim sup P* (X E F) ::::: P(X E F) for every closed set F. (v) P* (Xn E B) ---+ P(X E B) for all Borel sets B with P(X E 8B) 0. 18.9
Lemma (Portmanteau).
n
n
=
n Yn : S"2n
18.10 Theorem. For arbitrary maps X , f---+ [ll and every random element X with values. in [ll : . 1 zes · X ---+ ( l ) X ---+ as* X zmp P X. (ii) Xn X implies Xn X. (iii) Xn c for a constant c and only if Xn c. (iv) Xn X and d(Xn , 0, then X. (v) Xn X and c for a constant c, then (X , (X, c). (vi) if Xn X and then (X , (X,
n
n
� �
-v--7
if if Yn) � Yn if Yn � n Yn) � Yn � Y, n Yn) � Y). Let [lln [ll be arbitrary subsets and gn : [lln lE be arbitrary maps (n 0) such that for every sequence Xn [lln : if Xn' ---+ x along a subsequence and x [llo, then gn' (xn' ) ---+ go (x ). Then, for arbitrary maps Xn : S"2n [lln and every random element X with values in [ll 0 such that go (X) is a random element in (i) If Xn X, then gn (X n ) go(X). (ii) If Xn � X, then gn (Xn) � go (X). (iii) If Xn X, then g (Xn ) go (X). -v--7
-v--7
-v--7
-v--7
18. 1 1
-v--7
C
Theorem (Continuous mapping).
E
2::
f---+
E
f---+
JE:
-v--7
-v--7
�
�
n
[lln
n
The proofs for [ll and g g fixed, where g is continuous at every point of [ll0 , are the same as in the case of Euclidean spaces. We prove the refinement only for (i). The other refinements are not needed in the following. For every closed set F, we have the inclusion
Proof.
=
=
Indeed, suppose that x is in the set on the left side. Then for every k there is an m k ::=:: k and an element E g;;;; (F) with x) < 1 / k. Thus, there exist a sequence m k ---+ oo and elements with ---+ x . Then either E ---+ go (x) or x ¢'. [ll o . Because the set F is closed, this implies that g0 (x) E F or x ¢'. [ll0 . Now, for every fixed k, by the portmanteau lemma,
Xm k d(xmk ' Xmk [llmk Xmk
lim sup P* ( gn (Xn )
n ---+ oo
E
F)
gmk (xmk )
=dx [llm : gm (x) n ---+ oo ( ....,-u:: ::::: P( X u::= g;1 (F)) .
::'S
lim sup P * Xn E
E
---
k
-
E
E
F} )
Stochastic Convergence in Metric Spaces
260
As k --+ oo, the last probability converges to P ( X E n� 1 u;:=k g;;; 1 (F) ) , which is smaller than or equal to P ( g0 (X) E F) , by the preceding paragraph. Thus, gn (Xn) g0 (X) by the portmanteau lemma in the other direction. • -v-+
The extension of Prohorov's theorem requires more care. t In a Euclidean space, a set is compact if and only if it is closed and bounded. In general metric spaces, a compact set is closed and bounded, but a closed, bounded set is not necessarily compact. It is the compactness that we employ in the definition of tightness. A Borel-measurable random element X into a metric space is tight if for every c > 0 there exists a compact set K such that P(X <j_ K) < c. A sequence of arbitrary maps Xn : Qn � ][]) is called asymptotically tight if for every c > 0 there exists a compact set K such that lim sup P* (Xn n ---+ oo
<j_ K 8 ) <
c,
every 8 > 0.
{
Here K 8 is the 8 -enlargement y : d (y, K) < 8 } of the set K . It can be shown that, for Borel-measurable maps in JRk , this is identical to "uniformly tight," as defined in Chapter 2. In order to obtain a theory that applies to a sufficient number of applications, again we do not wish to assume that the Xn are Borel-measurable. However, Prohorov's theorem is true only under, at least, "measurability in the limit." An arbitrary sequence of maps Xn is called asymptotically measurable if every f
E
Cb (D) .
Here E* denotes the inner expectation, which is defined in analogy with the outer expec tation, and Cb (]]}l) is the collection of all bounded, continuous functions f : ]]}l � JR. A Borel-measurable sequence of random elements Xn is certainly asymptotically measur able, because then both the outer and the inner expectations in the preceding display are equal to the expectation, and the difference is identically zero. 18.12 Theorem (Prohorov 's theorem). Let Xn : Qn --+ ]]}l be arbitrary maps into a metric space. (i) If Xn X for some tight random element X, then {Xn : n E N} is asymptotically tight and asymptotically measurable. (ii) If Xn is asymptotically tight and asymptotically measurable, then there is a subse quence and a tight random element X such that Xn 1 X as j --+ oo. -v-+
-v-+
18.3
Bounded Stochastic Processes
A stochastic process X {X1 : t E T} is a collection of random variables X1 : Q � JR., indexed by an arbitrary set T and defined on the same probability space (Q , U, P) . For a fixed cv, the map t � X1 (cv) is called a sample path, and it is helpful to think of X as a random function, whose realizations are the sample paths, rather than as a collection of random variables. If every sample path is a bounded function, then X can be viewed as a =
t
The following Prohorov's theorem is not used in this book. For a proof see, for instance, [146].
18.3
Bounded Stochastic Processes
26 1
map X : Q � f00 (T) . If T = [a, b] and the sample paths are continuous or cadlag, then X is also a map with values in C [a , b] or D [a , b] . Because C[a , b] c D[a, b] c l 00 [a , b], we can consider the weak convergence of a sequence of maps with values in C [a , b] relative to C[a , b], but also relative to D[a , b] , or i 00 [ a , b] . The following lemma shows that this does not make a difference, as long as we use the uniform norm for all three spaces. 18.13 Lemma. Let lDlo C ][)) be arbitrary metric spaces equipped with the same metric. If X and every Xn take their values in lDlo, then Xn � X as maps in lDlo if and only if Xn � X
as maps in lDl.
Because a set G0 in lDlo is open if and only if it is of the form G n lDlo for an open set G in lDl, this is an easy corollary of (iii) of the portmanteau lemma. •
Proof.
Thus, we may concentrate on weak convergence in the space l00(T), and automatically obtain characterizations of weak convergence in C[a , b] or D[a , b] . The next theorem gives a characterization by finite approximation. It is required that, for any 8 > 0, the index set T can be partitioned into finitely many sets T1 , . . . , Tk such that (asymptotically) the variation of the sample paths t � Xn , t is less than 8 on every one of the sets � , with large probability. Then the behavior of the process can be described, within a small error margin, by the behavior of the marginal vectors (Xn ,t1 , , Xn , tk ) for arbitrary fixed points ti E Ti . If these marginals converge, then the processes converge. •
•
•
18.14 Theorem. A sequence of arbitrary maps Xn : Qn � l00(T) converges weakly to a tight random element if and only if both of the following conditions hold: (i) The sequence (Xn , t1 , , Xn , rk ) converges in distribution in JR.k for every finite set of points t 1 , . . . , tk in T; (ii) for every 8 , rJ > 0 there exists a partition of T into finitely many sets Tt , . . . , Tk such that •
•
•
(
lim sup P* sup sup I Xn ,s n -? oo
1
s,t E T,
- Xn , t 2:: 8
I
)
� rJ .
We only give the proof of the more constructive part, the sufficiency of (i) and (ii). For each natural number m, partition T into sets Tt , . . . , Tkrn , as in (ii) corresponding to 8 rJ 2- rn . Because the probabilities in (ii) decrease if the partition is refined, we can assume without loss of generality that the partitions are successive refinements as m increases. For fixed define a semimetric Prn on T by Prn (s, t) 0 if s and t belong to the same partioning set Tt , and by Prn (s , t) 1 otherwise. Every Prn -ball of radius 0 < s < 1 coincides with a partitioning set. In particular, T is totally bounded for Prn , and the Prn -diameter of a set TF is zero. By the nesting of the partitions, P t � P2 � Define p (s , t) L: ;:= 1 2 - rn prn (s, t). Then p is a sernimetric such that the p-diameter of TF is smaller than L k> rn 2- k 2- rn ' and hence T is totally bounded for p. Let To be the countable p-dense subset constructed by choosing an arbitrary point tj from every TF . By assumption (i) and Kolmogorov's consistency theorem (e.g., [1 33, p. 244] or [42, p. 34 7]), we can construct a stochastic process {X 1 : t E T0} on some probability space such that (Xn ,r1 , , Xn ,tk ) � (X11 ' . . . , Xrk ) for every finite set of points t1 , . . . , tk in To . By the
Proof
m
=
=
m
=
=
·
=
=
•
•
•
·
· .
Stochastic Convergence in Metric Spaces
262
portmanteau lemma and assumption (ii), for every finite set S c T0,
By the monotone convergence theorem this remains true if S is replaced by T0• If p (s, t) < 2-m , then P (s , t) < 1 and hence s and t belong to the same partitioning set T1m . Conse m quently, the event in the preceding display with S T0 contains the event in the following display, and =
P
(
sup
p (s,t)<2-m s , t E To
I Xs -
Xt l > 2 -m
::::; 2 - m .
)
This sums to a finite number over m E N. Hence, by the Borel-Cantelli lemma, for almost all I ::::; 2-m for all p (s, t) < 2-m and all sufficiently large m. This implies that almost all sample paths of t t E T0} are contained in U C (To , p). Extend the process by continuity to a process t E T} with almost all sample paths in UC(T, p). Define nm : T 1--+ T as the map that maps every partioning set T? onto the point tj E T? . Then, by the uniform continuity of and the fact that the p-diameter of T? is smaller than 2 -m , o lfm in £00 (T) as m -+ oo (even almost surely). The processes o n (t ) t E T } are essentially km -dimensional vectors. By (i), Xn o n o n in £00 (T) as m m m n -+ oo, for every fixed m. Consequently, for every Lipschitz function f : .€00 (T) 1--+ [0, 1 ] , as n -+ oo, followed by m -+ oo . Conclude that, for every c > 0, f o nm ) -+
w, Xs(w) - Xt (w) l X
:
E* (Xn
{X : {Xt :
""'"' X
X,
{ Xn
""'"' X
Ef (X) I E* f(Xn) - Ef(X) I ::::; I E * f(Xn) - E * f(Xn + P* (I I Xn - Xn 0
::::; ll f ii iip E
For c 2-m this is bounded by partitions. The proof is complete. =
ll f l l !ip 2-m + 2-m + •
lf ) l + o(1) m o lf i i T > c) + m
o(l).
o(1), by the construction of the
In the course of the proof of the preceding theorem a semimetric p is constructed such that the weak limit has uniformly p-continuous sample paths, and such that (T, p) is totally bounded. This is surprising: even though we are discussing stochastic processes with values in the very large space £00(T), the limit is concentrated on a much smaller space of continuous functions. Actually, this is a consequence of imposing the condition (ii), which can be shown to be equivalent to asymptotic tightness. It can be shown, more generally, that every tight random element in £00 (T) necessarily concentrates on UC(T, p) for some semimetric p (depending on that makes T totally bounded. In view of this connection between the partitioning condition (ii), continuity, and tight ness, we shall sometimes refer to this condition as the condition of asymptotic tightness or
X
X X)
asymptotic equicontinuity.
We record the existence of the semimetric for later reference and note that, for a Gaussian limit process, this can always be taken equal to the "intrinsic" standard deviation semimetric. 18.15 Lemma. Under the conditions (i) and (ii) of the preceding theorem there exists a semimetric p on T for which T is totally bounded, and such that the weak limit of the
Problems
263
Xn X sd(X8 - X1).
sequence can be constructed to have almost all sample paths in U C (T, p ) . Furthermore, if the weak limit is zero-mean Gaussian, then this semimetric can be taken equal to p (s, t) =
A semimetric p is constructed explicitly in the proof of the preceding theorem. It suffices to prove the statement concerning Gaussian limits Let p be the semimetric obtained in the proof of the theorem and let p2 be the stan dard deviation semimetric. Because every uniformly p-continuous function has a unique continuous extension to the p-completion of T, which is compact, it is no loss of gener ality to assume that T is p-compact. Furthermore, assume that every sample path of is p -continuous. An arbitrary sequence tn in T has a p-converging subsequence tn' t. By the p continuity of the sample paths, almost surely. Because every is Gaussian, this implies convergence of means and variances, whence p2 (tn ' , t) 2 0 by Proposition 2.29. Thus tn' t also for p2 and hence T is p2 -compact. Suppose that a sample path t f---+ is not p2 -continuous. Then there exists an s > 0 and a t E T such that P2 Ctn , t) ---+ 0, but 2: s for every n. By the p-compactness and continuity, there exists a subsequence such that p Ctn' , s) 0 and for some s . By the argument of the preceding paragraph, P2 Ctn' , s) ---+ 0, so that P2 Cs, t) 0 and 2: s . Conclude that the path t f---+ can only fail to be P2 -continuous for for which there exist s, t E T with P2 Cs, t) 0, but i= Let N be the set of for which there do exist such s, t. Take a countable, p-dense subset A of { (s, t) E T x T : p2 (s , t) 0 } . Because t f---+ is p continuous, N is also the set of all such that there exist (s, t) E A with i= From the definition of p2 , it is clear that for every fixed (s, t), the set of such that (w) i= is a null set. Conclude that N is a null set. Hence, almost all paths of are P2 -continuous. • Proof.
X.
X
X1", ---+ X1 ---+ X1 (w)
X1", (w) ---+ Xs(w)
=
I XrJw) - X1(w) l
I Xs(w) - X1(w) w l w Xs(w) X1(w). w Xs X1 (w) =
---+X1 E(X1", - X1) 2 ---+ ---+
X1(w) =
=
X1(w) Xs (w) X1 (w). w X
Notes
The theory in this chapter was developed in increasing generality over the course of many years. Work by Donsker around 1950 on the approximation of the empirical process and the partial sum process by the Brownian bridge and Brownian motion processes was an important motivation. The first type of approximation is discussed in Chapter 19. For further details and references concerning the material in this chapter, see, for example, [76] or [ 1 46] . PROBLEMS 1. (i) Show that a compact set is totally bounded. (ii) Show that a compact set is separable.
2. Show that a function f : ]]]) r--+ lE is continuous at every x E ]]]) if and only if for every open G E JE .
j- 1 (G) is open in ]]])
3 . (Projection a-field.) Show that the a -field generated b y the coordinate projections z 1-+ z (t) on C [a , b] is equal to the Borel a-field generated by the uniform norm. (First, show that the space
Stochastic Convergence in Metric Spaces
264
C [a , b]
is separable. Next show that every open set in a separable metric space is a countable union of open balls. Next, it suffices to prove that every open ball is measurable for the proj ection a - field.)
4. Show that
D[a , b]
is not separable for the uniform norm.
D [a , b] is bounded. Let h be an arbitrary element of D[ -oo, oo] and let 8 0. Show that there exists a grid uo -oo < u 1 < · · · Urn oo such that h varies at most 8 on every interval [ui , Ui +l ) . Here "varies at most 8 " means that l h (u) - h (v) I is les s than 8 for every u , in the interval. (Make sure that all points at which h jumps more than E are grid points .) Suppose that Hn and Ho are subsets of a semimetric space H such that Hn Ho in the sense that (i) Every h E Ho is the limit of a sequence hn E Hn ; (ii) If a subsequence hn 1 converges to a limit h, then h E Ho . Suppose that An are stochastic processes indexed by H that converge in distribution in the space .f00 (H) to a stochastic process A that has uniformly continuous sample paths. Show that sup h E Hn An (h) sup h E Ho A (h).
5. Show that every function in 6.
=
7.
>
=
v
_,..
'V'-)
19 Empirical Processes
The empirical distribution of a random sample is the uniform discrete measure on the observations. In this chapter; we study the convergence of this measure and in particular the convergence of the corresponding distribution function. This leads to laws of large numbers and central limit theorems that are uniform in classes offunctions. We also discuss a number of applications of these results. 19.1
Empirical Distribution Functions
Let X 1, . . . , X be a random sample from a distribution function empirical distribution function is defined as
n
F on the real line. The
F F( )
It is the natural estimator for the underlying distribution if this is completely unknown. Because n iF (t) is binomially distributed with mean n t , this estimator is unbiased. By the law of large numbers it is also consistent,
n
as
IFn(t) --+ F(t),
every t.
By the central limit theorem it is asymptotically normal, In this chapter we improve on these results by considering t r--+ (t) as a random function, rather than as a real-valued estimator for each t separately. This is of interest on its own account but also provides a useful starting tool for the asymptotic analysis of other statistics, such as quantiles, rank statistics, or trimmed means. The Glivenko-Cantelli theorem extends the law of large numbers and gives uniform convergence. The uniform distance
IFn
I IFn - F l oo
=
sup i iFn t
(t) - F(t) I
is known as the Kolmogorov-Smimov statistic. 265
Empirical Processes
266
Theorem (Glivenko- Cantelli). If X 1 , bution function then II IF I DO � 0. 19.1
n-F
F,
X 2 , . . . are i. i.d. random variables with distri
as as By the strong law of large numbers, both t ---+ t and (t -) ---+ t for < tk every t. Given a fixed E: > 0, there exists a partition -oo to < t 1 < oo such that ) < E: for every i (Points at which jumps more than E: are points of . _1 the partition.) Now, for t _ 1 ::::: t < t ,
IFn( ) F( ) IFn F( -) F (ti -) - F Cti F i i IFn(t) - F(t) ::::: IFn Cti-) - F(ti-) + c, IFn(t) - F(t ) � IFnCti - I) - F(ti - I) - E:. The convergence of IFn (t) and IFn (t -) for every fixed t is certainly uniform for t in the finite set {t1 , . . . , tk - d · Conclude that lim sup l lFn - F l oo ::::: c, almost surely. This is true for every E: > 0 and hence the limit superior is zero.
Proof.
=
·
·
=
·
•
The extension of the central limit theorem to a "uniform" or "functional" central limit theorem is more involved. A first step is to prove the joint weak convergence of finitely many coordinates. By the multivariate central limit theorem, for every t1 , , tk . •
•
•
where the vector on the right has a multivariate-normal distribution, with mean zero and covariances (1 9.2)
,Jll(IFn - F),
This suggests that the sequence of empirical processes viewed as random functions, converges in distribution to a Gaussian process G F with zero mean and covariance functions as in the preceding display. According to an extension of Donsker's theorem, this is true in the sense of weak convergence of these processes in the Skorohod space D[-oo, oo] equipped with the uniform norm. The limit process GF is known as an Brownian bridge process, and as a standard (or uniform) Brownian bridge if is the uniform distribution A. on [0, 1 ] . From the form of the covariance function it is clear that the -Brownian bridge is obtainable as G"from a standard bridge G"- . The name "bridge" results from the fact that the sample paths of the process are zero (one says "tied down") at the endpoints -oo and oo. This is a consequence of the fact that the difference of two distribution functions is zero at these points.
F
F
F
oF
X 1 , X2 , . . . are i. i.d. random variables with distribution function then the sequence of empirical processes converges in distribution in the space D[ -oo, oo] to a tight random element GF, whose marginal distributions are zero-mean normal with covariance function ( 19. 2). 19.3
Theorem (Donsker). If
F,
,Jll(IFn - F)
The proof of this theorem is long. Because there is little to be gained by considering the special case of cells in the real line, we deduce the theorem from a more general result in the next section. •
Proof.
Figure 1 9. 1 shows some realizations of the uniform empirical process. The roughness of the sample path for n 5000 is remarkable, and typical. It is carried over onto the limit =
19. 1
267
Empirical Distribution Functions
0
..
0
"'
0
0
9
"'
9
..
0.0
0
"'
0
"'
.. 0
0
"'
0
0
"'
9
9
"'
0.0
0.2
0.4
06
0.8
1 .0
00
_0 2
04
0.6
08
1 0
Figure 19.1. Three realizations of the uniform empirical process , of 50 (top), 500 (middle), and 5000 (bottom) observations , respectively.
process, for it can be shown that, for every
t,
. . I G;. (t + h) - G;. (t) l -< hm. sup I G;. (t + h) - G;. (t) l < oo ,
0 < hm mf h 0
a.s. -J i h log log h l v'l h log log h l h-?-0 Thus, the increments of the sample paths of a standard Brownian bridge are close to being of the order .JThf. This means that the sample paths are continuous, but nowhere differentiable. -?-
268
Empirical Processes
A related process is the Brownian motion process, which can be defined by ZA(t) IGA (t) + t Z for a standard normal variable Z independent of IGA . The addition of t Z "liberates" the sample paths at t 1 but retains the "tie" at t 0. The Brownian motion process has the same modulus of continuity as the Brownian bridge and is considered an appropriate model for the physical Brownian movement of particles in a gas. The three coordinates of a particle starting at the origin at time 0 would be taken equal to three independent Brownian motions. The one-dimensional empirical process and its limits have been studied extensively. t For instance, the Glivenko-Cantelli theorem can be strengthened to a law of the iterated logarithm, =
=
n
with equality if
theorem
F
F l oo
a.s. , lim sup ---- II IFn � �, 11 ---7 00 2 log log n takes on the value � . This can be further strengthened to Strassen 's
n
F)
o F,
1i ---- (IF11 a.s. 2 log log n � Here 1{ is the class of all functions h if h : [0, 1 ] 1---+ ffi. ranges over the set of absolutely continuous functions + with h(O) h(l) 0 and J01 h'(s) 2 ds � 1 . The notation h11 =: 1{ means that the sequence h11 is relatively compact with respect to the uniform norm, with the collection of all limit points being exactly equal to H . Strassen's theorem gives a fairly precise idea of the fluctuations of the empirical process J1i (IF11 when striving in law to IGF. The preceding results show that the uniform distance of IF11 to is maximally of the order -Jlog log n / n as n -+ oo . It is also known that
oF
=
oF
�
=
F), F
a.s. IF11 rr , lim 11 ---7 inf 00 j2n log log n II 2 Thus the uniform distance is asymptotically (along the sequence) at least 1 / (n log log n) . A famous theorem, the DKW inequality after Dvoretsky, Kiefer, and Wolfowitz, gives a bound on the tail probabilities of llF For every x x2 � 2e -2 • -
F l oo
=
l n - F l oo · P( Jll i iFn F l oo > X )
The originally DKW inequality did not specify the leading constant 2, which cannot be improved. In this form the inequality was found as recently as 1990 (see [ 1 03]). The central limit theorem can be strengthened through strong approximations. These give a special construction of the empirical process and Brownian bridges, on the same probability space, that are close not only in a distributional sense but also in a pointwise sense. One such result asserts that there exists a probability space carrying i.i.d. random variables X 1 , X2 , . . . with law and a sequence of Brownian bridges IG F, n such that . sup Jn hm IGF < 00 , a.s. 11 ---? oo (log n) 2 II Jn' CIFn - - ,n
F
F)
t
+
l oo
See [ 1 34] for the following and many other results on the univariate empirical process. A function is absolutely continuous if it is the primitive function J� g (s) ds of an integrable function g . Then it is almost-everywhere differentiable with derivative g .
1 9. 2
269
Empirical Distributions
CGF,n
CGp,
.fo(IFn
Because, by construction, every is equal in law to this implies that as a process (Donsker 's theorem), but it implies a lot more. Apparently, the F) distance between the sequence and its limit is of the order 0 ( (log n ) 2 j After the method of proof and the country of origin, results of this type are also known as Hungarian embeddings. Another construction yields the estimate, for fixed constants a , b, and c and every x > 0, "Y"-7
CGF
.fo) .
Empirical Distributions
19.2
n JPln
P
Let X 1 , . . . , X be a random sample from a probability distribution on a measurable space (X, A) . The empirical distribution is the discrete uniform measure on the observations. We denote it by = n - 1 I:7= 1 8 x, , where 8 is the probability distribution that is degenerate at x . Given a measurable function : X 1--+ IR, we write for the expectation of under the empirical measure, and for the expectation under Thus x
f JPln f Pf P. 1 n (XJ , JPln f = -n L f Pf = f f dP. i=l
f
JPln
Actually, this chapter is concerned with these maps rather than with as a measure. By the law of large numbers, the sequence converges almost surely to for every such that is defined. The abstract Glivenko-Cantelli theorems make this result uniform in ranging over a class of functions. A class :F of measurable functions X1--+ IR is called -Glivenko-Cantelli if
f
f
JPln f
Pf
P
Pf, f:
as*
I JPln f - Pf i F = jEF sup I JPln f - Pfl ---+ 0. The empirical process evaluated at f is defined as CGnf = .fo Wnf - Pf). By the multivariate central limit theorem, given any finite set of measurable functions fi with Pf? < oo, where the vector on the right possesses a multivariate-normal distribution with mean zero and covariances
ECG f CG g Pfg - Pf Pg The abstract Donsker theorems make this result "uniform" in classes of functions. A class :F of measurable functions f : X1--+ IR is called P-Donsker if the sequence of processes {CGnf : f :F} converges in distribution to a tight limit process in the space C00(:F ) . Then the limit process is a Gaussian process CG with zero mean and covariance function as given in the preceding display and is known as a ?-Brownian bridge. Of course, the Donsker property includes the requirement that the sample paths f 1--+ CGn f are uniformly bounded for every n and every realization of X . . . , X n. This is the case, for instance, if the class :F P
=
P
E
p
1,
·
270
Empirical Processes
has a finite and integrable envelope function F: a function such that I J Cx) l :S F(x) < oo, for every x and f. It is not required that the function x f-+ F (x) be uniformly bounded. For convenience of terminology we define a class :F of vector-valued functions f : x f-+ �k to be Glivenko-Cantelli or Donsker if each of the classes of coordinates fi : x f-+ � with f = (fi , . . . , fk ) ranging over :F (i = 1 , 2, . . . , k) is Glivenko-Cantelli or Donsker. It can be shown that this is equivalent to the union of the k coordinate classes being Glivenko Cantelli or Donsker. Whether a class of functions is Glivenko-Cantelli or Donsker depends on the "size" of the class. A finite class of integrable functions is always Glivenko-Cantelli, and a finite class of square-integrable functions is always Donsker. On the other hand, the class of all square-integrable functions is Glivenk:o-Cantelli, or Donsker, only in trivial cases. A relatively simple way to measure the size of a class :F is in terms of entropy. We shall mainly consider the bracketing entropy relative to the L r (P)-norm Given two functions l and u, the bracket [l , u] is the set of all functions f with l :S f :S u. An c-bracket in L r (P) i s a bracket [l , u] with P ( u - lY < c r . The bracketing number N[ J ( c, :F, L r (P) ) is the minimum number of c-brackets needed to cover :F. (The bracketing functions l and u must have finite L r (P)-norms but need not belong to :F.) The entropy with bracketing is the logarithm of the bracketing number. A simple condition for a class to be P -Glivenko-Cantelli is that the bracketing numbers in L1 (P) are finite for every c > 0. The proof is a straightforward generalization of the proof of the classical Glivenko-Cantelli theorem, Theorem 19. 1 , and is omitted.
Every class :F of measurable functions such that N[ 1 ( c, :F, L1 (P) ) < oo jor every c > O is P-Glivenko-Cantelli. Theorem (Glivenko- Cantelli).
19.4
For most classes of interest, the bracketing numbers N[ 1 ( c, :F, L r (P) ) grow to infinity as c + 0. A sufficient condition for a class to be Donsker is that they do not grow too fast. The speed can be measured in terms of the bracketing integral
If this integral is finite-valued, then the class :F is P-Donsker. The integrand in the integral is a decreasing function of c. Hence, the convergence of1 the integral depends only on the size of the bracketing numbers for c + 0. Because J0 c - r converges for r < 1 and diverges for r � 1 , the integral condition roughly requires that the entropies grow of slower order than (1 I c ) 2 .
de
19.5
< OO
Theorem (Donsker).
is P-Donsker.
Every class :F of measurable functions with f[ J ( 1 , :F, L 2 (P) )
Let g be the collection of all differences f - g if f and g range over :F. With a given set of c-brackets [li , ud over :F we can construct 2c-brackets over g by tak ing differences [l i - u i , u i - lj ] of upper and lower bounds. Therefore, the bracket ing numbers N[ J (c, Q, L 2 (P) ) are bounded by the squares of the bracketing numbers
Proof
27 1
Empirical Distributions
1 9. 2
Nr 1 ( s/2, :F, L 2 (P) ) . Taking a logarithm turns the square into a multiplicative factor 2, and hence the entropy integrals of :F and g are proportional. For a given, small 8 > 0 choose a minimal number of brackets of size 8 that cover :F, and use them to form a partition of :F U i :Fi in sets of diameters smaller than 8 . The subset of g consisting of differences - g of functions and g belonging to the same partitioning set consists of functions of L 2 (P)-norm smaller than 8 . Hence, by Lemma ahead, there exists a finite number a (8) such that
f
E* supi j,supg E:Fi (f - g ) I Gn
=
f
I � Ir 1 ( 8 ,
19. 3 4
:F, L 2 (P) ) + ,Jn P F l { F > a(8) ,J11 } .
Here the envelope function F can be taken equal to the supremum of the absolute values of the upper and lower bounds of finitely many brackets that cover :F, for instance a minimal set of brackets of size This F is square-integrable. The second term on the right is bounded by a (8 ) - 1 PF 2 1 { F > a (8) Jn} and hence converges to zero as n ---+ oo for every fixed 8. The integral converges to zero as 8 ---+ 0. The theorem follows from Theorem in view of Markov's inequality. •
1.
18.14,
If :F is equal to the collection of all indicator functions of the form with t ranging over JR., then the empirical process Gn is the classical empirical process Jn(IFn (t) - F (t)) . The preceding theorems reduce to the classical theorems by Glivenko-Cantelli and Donsker. We can see this by bounding the bracketing numbers of the set of indicator functions Consider brackets of the form for a grid of points - oo to < t 1 < · · · < tk oo with the property F (ti -) - FCti - l ) < s for each i . These brackets have L 1 (F)-size s . Their total number k can be chosen smaller than 2/s. Because Fj 2 :s F for every 0 :S :S the L 2 (F)-size of the brackets is bounded by y'c. Thus Nr 1 ( y'c, :F, L 2 (F)) :s (2/s) , whence the bracketing numbers are of the polynomial order ( /s ) 2 . This means that this class of functions is very small, because a function of the type log( /s ) satisfies the entropy condition of Theorem easily. 0 19.6
Example (Distribution function).
ft 1 (- oo, t J •
ft
=
f. Dc- oo. ti - t l • 1c - oo, ti J t]
=
f 1
=
f 1,
1
19. 5
Ue :
Let :F 8 E G } be a collection of measurable d functions indexed by a bounded subset G c IR. . Suppose that there exists a measurable function m such that 19.7
Example (Parametric class).
=
If P l m l r < oo, then there exists a constant K , depending on G and d only, such that the bracketing numbers satisfy diam G d Nr J (c l l m i i P,n :F, L r (P) ) :S K , every 0 < s < diam G . 8 Thus the entropy is of smaller order than log( l js). Hence the bracketing entropy integral certainly converges, and the class of functions :F is Donsker. To establish the upper bound we use brackets of the type - sm, + sm] for 8 ranging over a suitably chosen subset of G. These brackets have L r (P)-size 2s l l m i i P . r · If 8 ranges over a grid of meshwidth s over 8, then the brackets cover :F, because by the Lipschitz condition, - sm :S :S + sm if l l 81 - 82 ll :S s. Thus, we need as many brackets as we need balls of radius s/2 to cover 8.
(
)
[fe
fe1
fe2 fe1
fe
Empirical Processes
272
The size of 8 in every fixed dimension is at most diam 8. We can cover 8 with fewer than ( diam 8 I 8 ) d cubes of size 8. The circumscribed balls have radius a multiple of 8 and also cover 8 . If we replace the centers of these balls by their projections into 8, then the balls of twice the radius still cover 8. D The parametric class in Example 19.7 is cer tainly Glivenko-Cantelli, but for this a much weaker continuity condition also suffices. Let :F Ue : e E 8 } be a collection of measurable functions with integrable envelope function F indexed by a compact metric space 8 such that the map e �----+- fe (x ) is continuous for every x. Then the L 1-bracketing numbers of :F are finite and hence :F is Glivenko-Cantelli. We can construct the brackets in the obvious way in the form [ fs , J B ] , where is an open ball and is and J B are the infimum and supremum of fe for e E respectively. Given a sequence of balls with common center a given e and radii decreasing to 0, we have f Bm - ism ,j._ fe - fe 0 by the continuity, pointwise in x and hence also in L 1 by the dominated-convergence theorem and the integrability of the envelope. Thus, given 8 > 0, for every e there exists an open ball around e such that the bracket [ fB , f B ] has size at most 8. By the compactness of 8, the collection of balls constructed in this way has a finite subcover. The corresponding brackets cover :F. This construction shows that the bracketing numbers are finite, but it gives no control on their sizes. D 19.8
Example (Pointwise Compact Class).
=
Bm
B,
B
=
B
Let IR.d Ui 11 be a partition in cubes of volume 1 and let :F be the class of all functions f : IR.d ---+ JR. whose partial derivatives up to order exist and are uniformly bounded by constants M1 on each of the cubes 11 . (The condition includes bounds on the "zero-th derivative," which is f itself.) Then the bracketing numbers of :F satisfy, for every V ::::_ d I and every probability measure P, 19.9
Example (Smooth functions).
=
a
a
The constant K depends on V , r , and d only. If the series on the right converges for r 2 and some d I :::; V < 2, then the bracketing entropy integral of the class :F converges and hence the class is P- Donsker. t This requires sufficient smoothness > d 12 and sufficiently small tail probabilities P (11 ) relative to the uniform bounds M1 . If the functions f have compact support (equivalently M1 0 for all large }), then smoothness of order > dl2 suffices. D a,
=
a
a
=
a
19.10 Example (Sobolev classes). Let :F be the set of all functions f : [0, 1] �----+- JR. such that I f I :::; 1 and the (k - 1 )-th derivative is absolutely continuous with J ( f (k) ) 2 (x) dx :::; 1 for some fixed k E N. Then there exists a constant K such that, for every 8 > o,+ 00
log N[ J (8, :F, II · ll oo) :::; Thus, the class :F is Donsker for every k t
+
::::_
(1) K -;;
ljk
1 and every P. D
The upper bound and this sufficient condition can be slightly improved. For this and a proof of the upper bound, see e.g., [ 146, Corollary 2.74] . See [ 1 6] .
19.2
Empirical Distributions
273
Let :F be the collection of all monotone functions f : lft 1---7 [ - 1 , 1 ] , or, bigger, the set of all functions that are of variation bounded by 1 . These are the differences of pairs of monotonely increasing functions that together increase at most 1 . Then there exists a constant K such that, for every r � 1 and probability measure P , t 19.11
Example (Bounded variation).
Thus, this class of functions is P-Donsker for every P . D Let w : (0, 1) I---7 Ift+ be a fixed, con tinuous function. The weighted empirical process of a sample of real-valued observations is the process 19.12
Example (Weighted distribution function).
t !---7 er: (t)
=
-Ji1(IFn - F) (t)w ( F(t) )
(defined to be zero if F(t) 0 or F(t) 1). For a bounded function w, the map z 1---7 z woF is continuous from .t:00 [ -oo, oo] into .t:00[ -oo, oo] and hence the weak convergence of the weighted empirical process follows from the convergence of the ordinary empirical process and the continuous-mapping theorem. Of more interest are weight functions that are unbounded at 0 or 1 , which can be used to rescale the empirical process at its two extremes -oo and oo. Because the difference F ) ( t ) converges to 0 as t -+ ±oo, the sample paths of the process t 1---7 er: (t) may be bounded even for unbounded w, and the rescaling increases our knowledge of the behavior at the two extremes. A simple condition for the weak convergence of the weighted empirical process in .t:00( -oo, oo) is that the weight function w is monotone around 0 and 1 and satisfies J01 w 2 (s) ds < oo. The square-integrability is almost necessary, because the convergence is known to fail for w ( t ) 1/ Jt ( 1 - t ) . The Chibisov-O 'Reilly theorem gives necessary and sufficient conditions but is more complicated. We shall give the proof for the case that w is unbounded at only one endpoint and decreases from w(O) oo to w(l) 0. Furthermore, we assume that F is the uni form measure on [0, 1 ] . (The general case can be treated in the same way, or by the quantile transformation.) Then the function v (s) w 2 (s ) with domain [0, 1] has an inverse v - 1 (t ) w- 1 ( Jt) with domain [0, oo]. A picture of the graphs shows that J000 w - 1 (Jl) dt f01 w 2 ( t ) dt , which is finite by assumption. Thus, given an c > 0, we can choose partitions 0 so < s 1 < < sk 1 and 0 to < t 1 < < t, oo such that, for every i , =
=
·
CIFn -
=
=
=
=
=
=
=
·
·
=
·
=
·
·
·
=
This corresponds to slicing the area under w 2 both horizontally and vertically in pieces of size c2 . Let the partition 0 u0 < u 1 < < Um 1 be the partition consisting of all points si and all points w- 1 (Jtj) . Then, for every i , =
t
See, e.g., [ 146, Theorem 2.75] .
·
·
·
=
274
Empirical Processes
It follows that the brackets have L 1 (A.)-size 2s 2 . Their square roots are brackets for the functions of interest x � w (t) l [o . tJ (x ) , and have L 2 (A.) -size V2s, because P I Ju - .J[f ::::; P l u - l l . Because the number m of points in the partitions can be chosen of the order ( 1 /s) 2 for small s, the bracketing integral of the class of functions x � w(t) l [o , 11 (x ) converges easily. 0 The conditions given by the preceding theorems are not necessary, but the theorems cover many examples. Simple necessary and sufficient conditions are not known and may not exist. An alternative set of relatively simple conditions is based on "uniform covering numbers." The covering number N (s, :F, L 2 ( Q) ) is the minimal number of L 2 ( Q)-balls of radius s needed to cover the set :F. The entropy is the logarithm of the covering number. The following theorems show that the bracketing numbers in the preceding Glivenko-Cantelli and Donsker theorems can be replaced by the uniform covering numbers sup N ( s ii F II Q , r , :F, L r ( Q) ) . Q Here the supremum is taken over all probability measures Q for which the class :F is not identically zero (and hence II F I Q , r Q p r > 0). The uniform covering numbers are relative to a given envelope function F . This is fortunate, because the covering numbers under different measures Q typically are more stable if standardized by the norm I F I Q , r of the envelope function. In comparison, in the case of bracketing numbers we consider a single distribution P , and standardization by an envelope does not make much of a difference. The uniform entropy integral is defined as =
log sup N ( s ii F II Q , 2 , :F, L 2 ( Q) ) ds. Q
Let :F be a suitably measurable class ofmeasurable functions with sup Q N (s ii F II Q , l , :F, L 1 ( Q) ) < oo for every s > 0. If P* F < oo, then :F is P -Glivenko-Cantelli. 19.13
Theorem (Glivenko-Cantelli).
19.14
Theorem (Donsker)
with J ( 1 , :F, L 2 )
< oo.
Let :F be a suitably measurable class ofmeasurablefunctions If P * F 2 < oo, then :F is P -Donsker. .
The condition that the class :F be "suitably measurable" is satisfied in most examples but cannot be omitted. We do not give a general definition here but note that it suffices that there exists a countable collection g of functions such that each f is the pointwise limit of a sequence gm in 9. t An important class of examples for which good estimates on the uniform covering numbers are known are the so-called Vapnik-Cervonenkis classes, or VC classes, which are defined through combinatorial properties and include many well-known examples. t
See, for example, [ 1 1 7], [ 120], or [ 146] for proofs of the preceding theorems and other unproven results in this section.
19.2
275
Empirical Distributions
Figure 19.2. The subgraph of a function.
Say that a collection C of subsets of the sample space X picks out a certain subset A of the finite set {xi , . . . , Xn } C X if it can be written as A {xi , . . . , Xn } n C for some C E C. The collection C is said to shatter {xi , . . . , Xn } if C picks out each of its 2n subsets. The VC index V (C) of C is the smallest n for which no set of size n is shattered by C. A collection C of measurable sets is called a VC class if its index V (C) is finite. More generally, we can define VC classes of functions. A collection F is a VC class of functions if the collection of all subgraphs { (x , t) : f(x) < t } , if f ranges over F, forms a VC class of sets in X x � (Figure 19 .2). It is not difficult to see that a collection of sets C is a VC class of sets if and only if the collection of corresponding indicator functions lc is a VC class of functions. Thus, it suffices to consider VC classes of functions. By definition, a VC class of sets picks out strictly less than 2n subsets from any set of n � V (C) elements. The surprising fact, known as Sauer's lemma, is that such a class can necessarily pick out only a polynomial number 0 (n v (C) - I ) of subsets, well below the 2n - 1 that the definition appears to allow. Now, the number of subsets picked out by a collection C is closely related to the covering numbers of the class of indicator functions { lc : C E C} in L1 (Q) for discrete, empirical type measures Q. By a clever argument, Sauer's lemma can be used to bound the uniform covering (or entropy) numbers for this class. =
19.15
There exists a universal constant K such that for any VC class F of 1 and 0 < 8 < 1, 1 r (V(.F) - I ) .FJ V( --; s�p N(8 II F II Q,n :F, L r (Q) ) :S K V (F ) (16e)
Lemma.
functions, any r
�
()
Consequently, VC classes are examples of polynomial classes in the sense that their covering numbers are bounded by a polynomial in 1 j 8. They are relatively small. The
Empirical Processes
276
upper bound shows that VC classes satisfy the entropy conditions for the Glivenko-Cantelli theorem and Donsker theorem discussed previously (with much to spare). Thus, they are P Glivenko-Cantelli and P-Donsker under the moment conditions P * F < oo and P * F 2 < oo on their envelope function, if they are "suitably measurable." (The VC property does not imply the measurability.) 19.16 Example (Cells). The collection of all cells ( -oo, t] in the real line is a VC class of index V(C) = 2. This follows, because every one-point set {xd is shattered, but no two-point set {xi , x2 } is shattered: If X I < x 2 , then the cells (-oo, t] cannot pick out {x2 } . D 19.17 Example (Vector spaces). Let :F be the set of all linear combinations L: A.i fi of a given, finite set of functions f1 , . . . , fk on X. Then :F is a VC class and hence has a finite uniform entropy integral. Furthermore, the same is true for the class of all sets {f > c} if f ranges over f and c over R For instance, we can construct :F to be the set of all polynomials of degree less than some number, by taking basis functions 1 , x , x 2 , on JR. and functions x�1 x�d more generally. For polynomials of degree up to 2 the collection of sets {f > 0} contains already all half-spaces and all ellipsoids. Thus, for instance, the collection of all ellipsoids is Glivenko-Cantelli and Donsker for any P . To prove that :F is a VC class, consider any collection of n = k + 2 points (x 1 , t 1 ) , . . . , (xn , tn) in X x JR.. We shall show this set is not shattered by :F, whence V (:F ) :S n. By assumption, the vectors (f(xi) - ti , . . . , f(xn) - tn) Y are contained in a (k + I) dimensional subspace of IR.n . Any vector a that is orthogonal to this subspace satisfies .
i : a, >O
.
•
·
·
·
i : a,
(Define a sum over the empty set to be zero.) There exists a vector a with at least one strictly positive coordinate. Then the set { (xi , ti ) : ai > 0} is nonempty and is not picked out by the subgraphs of :F. If it were, then it would be of the form { (xi , ti ) : ti < f (ti ) } for some f , but then the left side of the display would be strictly positive and the right side nonpositive. D A number of operations allow to build new VC classes or Donsker classes out of known VC classes or Donsker classes. 19.18 Example (Stability properties). The class of all complements c c, all intersections C n D, all unions C U D, and all Cartesian products C x D of sets C and D that range over VC classes C and D is VC. The class of all suprema f v g and infima f 1\ g of functions f and g that range over VC classes :F and g is VC. The proof that the collection of all intersections is VC is easy upon using Sauer 's lemma, according to which a VC class can pick out only a polynomial number of subsets. From n given points C can pick out at most O (n V(Cl ) subsets. From each of these subsets D can pick out at most O (n V(Dl ) further subsets. A subset picked out by C n D is equal to the subset picked out by C intersected with D. Thus we get all subsets by following the
Goodness-of-Fit Statistics
1 9. 3
277
two-step procedure and hence C n D can pick out at most 0 (n V(C J + V(Dl ) subsets. For large n this is well below whence C n D cannot pick out all subsets. That the set of all complements is VC is an immediate consequence of the definition. Next the result for the unions follows by combination, because C U D ( c c n Dey . The results for functions are consequences of the results for sets, because the subgraphs of suprema and infima are the intersections and unions of the subgraphs, respectively. D
2n ,
=
If :F and g possess a finite uniform entropy inte gral, relative to envelope functions and G, then so does the class :FQ of all functions x �---+ f (x)g(x), relative to the envelope function FG. More generally, suppose that ¢ : �2 �---+ � is a function such that, for given functions L 1 and L g and every x, 19.19
Example (Uniform entropy).
F
1 ¢ ( !1 (x ) , g 1 (x) ) - ¢ ( h (x), g2 (x)) I
-
I - hI
:S
L J (x) f1 (x) + L g (x) l g 1 - g2 l (x) . ¢ (f0 , g0) has a finite uniform entropy integral
Then the class of all functions ¢ (f, g ) relative to the envelope function L 1 + L g G, whenever :F and g have finite uniform entropy integrals relative to the envelopes and G . D
F F
For any fixed Lipschitz function ¢ : �2 �---+ � ' the class of all functions of the form ¢ (f, g) is Donsker, if f and g range over Donsker classes :F and g with integrable envelope functions. For example, the class of all sums f + g, all minima f 1\ g, and all maxima f v g are Donsker. If the classes :F and g are uniformly bounded, then also the products fg form a Donsker class, and if the functions f are uniformly bounded away from zero, then the functions 1 /f form a Donsker class. D 19.20
Example (Lipschitz transformations).
19.3
Goodness-of-Fit Statistics
An important application of the empirical distribution is the testing of goodness-of-fit. Because the empirical distribution is always a reasonable estimator for the underlying distribution P of the observations, any measure of the discrepancy between and P can be used as a test statistic for testing the hypothesis that the true underlying distribution is P. Some popular global measures of discrepancy for real-valued observations are
IFn
IFn
Jll l lFn - F l ocH n J ClFn - F ) 2 dF,
(Kolmogorov-Smirnov), (Cramer-von Mises).
These statistics, as well as many others, are continuous functions of the empirical process. The continuous-mapping theorem and Theorem 19.3 immediately imply the following result. 19.21
Corollary. If X 1 ,
X 2 , . . . are i. i.d. random variables with distribution function
F,
then the sequences of Kolmogorov-Smirnov statistics and Cramer-von Mises statistics con verge in distribution to F and J d respectively. The distributions of these limits are the same for every continuous distribution function
IG I
00
G} F,
F.
Empirical Processes
278
The maps z 1-+ ll z ll oo and z 1-+ f z 2 (t) dt from D[-oo, oo] into !R. are continuous with respect to the supremum norm. Consequently, the first assertion follows from the continuous-mapping theorem. The second assertion follows by the change of variables F(t) 1-+ u in the representation G F G). o F of the Brownian bridge. Alternatively, use the quantile transformation to see that the Kolmogorov-Smirnov and Cramer-von Mises statistics are distribution-free for every fixed n. •
Proof.
=
It is probably practically more relevant to test the goodness-of-fit of compositive null hypotheses, for instance the hypothesis that the underlying distribution P of a random sample is normal, that is, it belongs to the normal location-scale family. To test the null hypothesis that P belongs to a certain family { Pe : e E 8}, it is natural to use a measure of the discrepancy between and Pe , for a reasonable estimator 8 of 8 . For instance, a modified Kolmogorov-Smirnov statistic for testing normality is
IFn
s�p v'ri
t - X) IFn (t) - ( 5-
.
For many goodness-of-fit statistics of this type, the limit distribution follows from the limit distribution of Jl1 (IF - Pe). This is not a Brownian bridge but also contains a "drift," due to e . Informally, if e I-+ Pe has a derivative Pe in an appropriate sense, then
n Jfi Wn - Pe) = vfriWn - Pe) - y'ri (Pe - Pe)T P, ( 19.22) y'ri ( IFn - Pe) - y'ri (e - e) e . By the continuous-mapping theorem, the limit distribution of the last approximation can be derived from the limit distribution of the sequence ,Jii ( IFn - Pe , 8 - 8). The first component converges in distribution to a Brownian bridge. Its joint behavior with -fo(B - 8) can most easily be obtained if the latter sequence is asymptotically linear. Assume that �
for "influence functions" 1/Je with Pe 1/Je
= 0 and Pe 11 1/Je f < oo.
19.23 Theorem. Let 1 , . . . be a random sample from a distribution Pe indexed by k () E IR. . Let :F be a Pe-Donsker class of measurable functions and let be estimators that are asymptotically linear with influence function 1/Je. Assume that the map () 1-+ Pe from IR.k
X , Xn
Bn
to f00 (:F ) is Frichet differentiable at e. t Then the sequence - Pe) converges under P G G 1/J e in distribution in f 00 (:F ) to the process f I-+ pe f - pe l e f.
-foWn
Proof.
In view of the differentiability of the map () 1-+ Pe and Lemma 2. 12,
I Pe, - Pe - (8 - () ) T Pe I ;- = (I I Bn - e I ) . Op
This justifies the approximation (19.22). The class Q obtained by adding the k components of 1/Je to :F is Donsker. (The union of two Donsker classes is Donsker, in general. In t
This means that there exists a map Pe : :F .,._.,. IR.k such that II Pe+h - Pe Chapter 20.
-
h T Pe IIF
=
o( I h II) as h
�
0 ; see
279
19.4 Random Functions
the present case, the result also follows directly from Theorem 18. 14.) The variables are obtained from the empirical process seen as an element L oo of � (9) by a continuous map. Finally, apply Slutsky's lemma. •
( JnWn -Pe), n - 1 12 1/re (X;))
The preceding theorem implies, for instance, that the sequences of modified Kolmogorov Smimov statistic converge in distribution to the supremum of a certain Gaussian process. The distribution of the limit may depend on the model 8 Hthe estimators and even on the parameter value 8. Typically, this distribution is not known in closed form but has to be approximated numerically or by simulation. On the other hand, the limit distribution of the true Kolmogorov-Srnimov statistic under a continuous distribution can be derived from properties of the Brownian bridge, and is given byt
en,
.Jfi l lFn - Fe l oo
P(I I GJcl l oo >
Fe ,
x)
=
2
00
( - 1)1+ 1e -2/x2 • L }=1
With the Donsker theorem in hand, the route via the Brownian bridge is probably the most convenient. In the 1940s Smimov obtained the right side as the limit of an explicit expression for the distribution function of the Kolmogorov-Smimov statistic.
19.4
Random Functions
The language of Glivenko-Cantelli classes, Donsker classes, and entropy appears to be convenient to state the "regularity conditions" needed in the asymptotic analysis of many statistical procedures. For instance, in the analysis of Z- and M -estimators, the theory of empirical processes is a powerful tool to control remainder terms. In this section we consider the key element in this application: controlling random sequences of the form for functions that change with and depend on an estimated parameter. If a class :F of functions is -Glivenko-Cantelli, then the difference JPl converges to zero uniformly in varying over :F, almost surely. Then it is immediate that also ....... "' "' as JPl -+ 0 for every sequence of random functions that are contained in :F. If converges almost surely to a function and the sequence is dominated (or uniformly "" "' as as integrable), so that p -+ p then it follows that JPl -+ p Here by "random functions" we mean measurable functions x H- j (x ; w) that, for every fixed x , are real maps defined on the same probability space as the observations (w . . . (w) . In many examples the function (x ) (x ; is a func tion of the observations, for every fixed x . The notations JPl j and j are abbreviations for the expectations of the functions x H- j (x ; w) with w fixed. A similar principle applies to Donsker classes of functions. For a Donsker class :F, the empirical process converges in distribution to a Brownian bridge process p "uni formly in E :F." In view of Lemma 1 8 . 1 5 , the limiting process has uniformly continuous sample paths with respect to the variance semimetric. The uniform convergence combined with the continuity yields the weak convergence j for every sequence j of p random functions that are contained in :F and that converges in the variance semimetric to a function
2::7= 1 fn, e" (X;)
I n f - Pf I fn n n
f
fn
fo '
X 1 ), , Xn f
n
n, e P!
fo
fn
n fn
fo.
n fn fn X 1 , . . . , Xn) nn Pn =
Gn f
n
Gf
P-
Gn n G fo -v--7
fo.
t
I n f-Pf I
See, for instance, [42, Chapter 12] , or [134] .
n
280
Empirical Processes
n
Suppose that :F is a P -Donsker class of measurable functions and j is a sequence of randomfunctions that take their values in F such that J fo ( ) 2 d P ( converges in probability to 0 for some fo E L 2 (P). Then fo) 0 and hence Gp fo19.24
Gnfn
Lemma.
(fn (x)- x) GnCfn - !,.
'V'-7
x)
Assume without of loss of generality that fo is contained in :F. Define a function g : £00(:F ) X :F � lR by g(z, f) z(f) - z(fo). The set :F is a semimetric space relative to the L 2 (P)-metric. The function g is continuous with respect to the product semimetric at every point (z , f) such that f � z(f) is continuous. Indeed, if (z , f) in the z(f) if z z uniformly and hence z f + o( space i00 (:F ) x :F, then is continuous at f. By assumption, fo as maps in the metric space :F. Because :F is Donsker, in the space i00 (:F ) , and it follows that ( , ( p , fo in the space £00 (:F ) x :F. By Lemma almost all sample paths of Gp are continuous on :F. Thus the function g is continuous at almost every point (Gp , f0). By the continuous-mapping theorem, g(Gp , fo) 0. The lemma follows, because fo) convergence in distribution and convergence in probability are the same for a degenerate limit. •
Proof
=
Zn -+ fn !,. Gn 'V'-?Gp 18.15, GnCfn - g(Gn, fn ) =
Czn, fn) -+ Zn C fn) C n) 1) -+ =
Gn fn ) 'V'-? G )
'V'-7
=
The preceding lemma can also be proved by reference to an almost sure representation for the converging sequence G p . Such a representation, a generalization of Theorem exists. However, the correct handling of measurability issues makes its application involved.
Gn
2.19
19.25
sample
'V'-7
Example (Mean absolute deviation).
The mean absolute deviation of a random
X 1 , . . . , Xn is the scale estimator n 1 Mn -n 2) i= l Xi - Xn l =
The absolute value bars make the derivation of its asymptotic distribution surprisingly difficult. (Try and do it by elementary means.) Denote the distribution function of the observations by and assume for simplicity of notation that they have mean equal to zero. We shall write for the stochastic process � and use the notations and in a similar way. If < oo , then the set of functions � with ranging over a compact, such as [ is Donsker by Example Because, by the triangle inequality, - p p 0, the preceding lemma shows that -+ 0. ::S This can be rewritten as
Fx 2 - 1, 1], Xn l - l x l ) 2
F, Fx IFn l x - 81 8 n - 1 L:7= 1 1 X i - 8 1 , Gn l x - 81 F i x - 81 x x 81 8 F19. 7 . l F( lx Gnl x - Xn l - Gnl x l 1 X n l 2 -+
8 F l x - 8 I is differentiable at 0, then, with the derivative written in the form 2F(O) - 1, the first term on the right is asymptotically equivalent to (2F(O) - 1 ) Gnx, by the delta method. Thus, the mean absolute deviation is asymptotically normal with mean zero and asymptotic variance equal to the variance of ( 2F(O) - 1 ) X 1 + I X 1 1. If the mean and median of the observations are equal (i.e., F(O) � ), then the first term If the map
�
=
is 0 and hence the centering of the absolute values at the sample mean has the same effect
19.4 Random Functions
28 1
as centering at the true mean. In this case not knowing the true mean does not hurt the scale estimator. In comparison, for the sample variance this is true for any F. D Perhaps the most important application of the preceding lemma is to the theory of Z estimators. In Theorem 5.21 we imposed a pointwise Lipschitz condition on the maps 1---+ 1/Je to ensure the convergence 5.22:
e
In view of Example 19.7, this is now seen to be a consequence of the preceding lemma. The display is valid if the class of functions 1/Je : I I < 8 } is Donsker for some 8 > 0 and 1/Je 1/Je0 in quadratic mean. Imposing a Lipschitz condition is just one method to ensure these conditions, and hence Theorem 5.21 can be extended considerably. In particular, in its generalized form the theorem covers the sample median, corresponding to the choice 1/Je (x ) The sign functions can be bracketed just as the indicator functions sign(x of cells considered in Example 19.6 and thus form a Donsker class. For the treatment of semiparametric models (see Chapter 25), it is useful to extend the results on Z-estimators to the case of infinite-dimensional parameters. A differentiability or Lipschitz condition on the maps 1---+ 1/Je would preclude most applications of interest. However, if we use the language of Donsker classes, the extension is straightforward and useful. If the parameter ranges over a subset of an infinite-dimensional normed space, then we use an infinite number of estimating equations, which we label by some set H and assume to be sums. Thus the estimator en (nearly) solves an equation JPln 1/Je , h 0 for every h E H . We assume that, for every fixed x and the map h 1---+ 1/Je, h (x ), which we denote by 1/Je (x ), is uniformly bounded, and the same for the map h 1---+ P l/le , h , which we denote by P l/Je .
{ e - e0
---+
e) .
=
e
e
=
e,
19.26
Theorem.
e
For each in a subset G ofa normed space and every h in an arbitrary set
e - eo
{
H' let X 1---+ 1/le , h (x) be a measurable function such that the class 1/le , h : II II < 8 ' h E H } is P -Donsker for some 8 > 0, with finite envelope function . Assume that, as a map into £00 (H), the map 1---+ P l/Je is Frechet-diffe rentiable at a zero with a derivative V : lin G 1---+ £00(H) that has a continuous inverse on its range. Furthermore, assume that
e II P C1/!e , h - 1/Jeo h ) 2 I H ---+ O as e ---+ eo. lf ii Pn l/JeJ H V Jn (en - eo) = ,
eo,
=
O p (n - 1 12 ) and en
� eo, then
-Gn l/le0 + op ( 1 ) .
This follows the same lines as the proof of Theorem 5 .2 1 . The only novel aspect is that a uniform version of Lemma 19.24 is needed to ensure that Gn ( 1/Je - 1/le0 ) converges to zero in probability in £00(H). This is proved along the same lines. Assume without loss of generality that en takes its values in Go E 8: < 8} and define a map g : £00 (Go X H ) X G o 1---+ £00 (H) by g(z, e)h z ( , h) - z ( , h). This map is continuous at every point (z ' such that I z ( h) - z h) I H 0 as The sequence (Gn 1/Je ' en ) converges in distribution in the space £00 (Go X H) X Go to a pair (Gl/Je , As we have that suph P ( 1/le, h - 1/le0, h ) 2 0 by assumption, and thus II G 1/fe - G1/fe0 I H 0 almost surely, by the uniform continuity of the sample paths of the Brownian bridge. Thus, we can apply the continuous-mapping theorem and conclude that g (Gn l/Je , en ) � g (Gl/Je , 0, which is the desired result. •
Proof.
"
=
eo )
eo). e ---+ eo , ---+
eo) =
e'
=
---+
(eo '
{e e
1 e -e01 1 eo ---+ e ---+ eo.
282
Empirical Processes
19.5
Changing Classes
The Glivenko-Cantelli and Donsker theorems concern the empirical process for different n, but each time with the same indexing class :F. This is sufficient for a large number of applications, but in other cases it may be necessary to allow the class :F to change with n. For instance, the range of the random function fn in Lemma 1 9.24 might be different for every n . We encounter one such a situation in the treatment of M -estimators and the likelihood ratio statistic in Chapters 5 and 1 6, in which the random functions of interest Jli (m 0 - me0 ) - Jli({}n - 8o)me0 are obtained by rescaling a given class of functions. It turns out that the convergence of random variables such as Gn jn does not require the ranges :Fn of the functions jn to be constant but depends only on the sizes of the ranges to stabilize. The nature of the functions inside the classes could change completely from n to n (apart from a Lindeberg condition). Directly or indirectly, all the results in this chapter are based on the maximal inequalities obtained in section 19.6. The most general results can be obtained by applying these inequalities, which are valid for every fixed n, directly. The conditions for convergence of quantities such as Gn fn are then framed in terms of (random) entropy numbers. In this section we give an intermediate treatment, starting with an extension of the Donsker theorems, Theorems 19.5 and 19. 14, to the weak convergence of the empirical process indexed by classes that change with n. Let :Fn be a sequence of classes of measurable functions fn , t : X 1-+ JR. indexed by a parameter t, which belongs to a common index set T. Then we can consider the weak convergence of the stochastic processes t 1-+ Gn fn t as elements of f 00(T) , assuming that the sample paths are bounded. By Theorem 1 8 . 1 4 weak convergence is equivalent to marginal convergence and asymptotic tightness. The marginal convergence to a Gaussian process follows under the conditions of the Lindeberg theorem, Proposition 2.27. Sufficient conditions for tightness can be given in terms of the entropies of the classes :Fn . We shall assume that there exists a semimetric p that makes T into a totally bounded space and that relates to the L 2 -metric in that (19.27) sup P n .s - fn t ) 2 ---+ o, every On + 0. "
,
p (s , t ) < 8n
U
,
Furthermore, we suppose that the classes :Fn possess envelope functions Fn that satisfy the Lindeberg condition
P F};
=
0(1), every E: > 0.
Then the central limit theorem holds under an entropy condition. As before, we can use either bracketing or uniform entropy.
Let :Fn Un , t : t E T} be a class of measurable functions indexed by a totally bounded semimetric space (T, p) satisfying ( 19.27) and with envelope function that satisfies the Lindeberg condition. If l[ J (on , :Fn , L 2 (P) ) ---+ O for every On + 0, or alternatively, every :Fn is suitably measurable and J (on , :Fn , L 2 ) ---+ 0 for every On + 0, then the sequence {Gn fn , t : t E T} converges in distribution to a tight Gaussian process, provided the sequence of covariance functions Pfn , s fn , t - Pfn , s Pfn , t converges pointwise on T x T . 19.28
Theorem.
=
19.5 Changing Classes
283
Under bracketing the proof of the following theorem is similar to the proof of Theorem 19.5. We omit the proof under uniform entropy. For every given o > 0 we can use the semimetric p and condition to partition T into finitely many sets T1 , . . . , Tk such that, for every sufficiently large
Proof.
(19. 27) n,
sup sup P (fn,s - fn,r) 2 < o 2 .
i s, t ET,
(This is the only role for the totally bounded semimetric p ; alternatively, we could assume the existence of partitions as in this display directly.) Next we apply Lemma to obtain the bound
19. 3 4
19. 3 4
Here an (o) is the number given in Lemma evaluated for the class of functions Fn - Fn and Fn is its envelope, but the corresponding number and envelope of the class Fn differ only by constants. Because l[ On , Fn , L 2 ( P)) ----+ 0 for every On .,!.. 0, we must have that for every o > 0 and hence an (o) is bounded away from zero. l[ J (o , Fn , L 2 (P)) = Then the second term in the preceding display converges to zero for every fixed o > 0, by the Lindeberg condition. The first term can be made arbitrarily small as ----+ by choosing o small, by assumption. •
0(1)
l(
n
oo
1ca , a +t8n l (1/n)# (Xi
19.29 Example (Local empirical measure). Consider the functions fn, t = rn for t ranging over a compact in R say [0, a fixed number a, and sequences On .,!.. 0 and rn ----+ This leads to a multiple of the local empirical measure IT!n fn.r = E (a , a + ton]), which counts the fraction of observations falling into the shrinking intervals
1],
oo.
(a, a + ton].
Assume that the distribution of the observations is continuous with density p. Then
P/;, , = r; P (a, a + ton ] = r; p(a)ton + o(r; on) ·
1.
1.
Thus, we obtain an interesting limit only if r;on From now on, set r;on = Then the variance of every Gn fn,r converges to a nonzero limit. Because the envelope function is Fn = fn, l • the Lindeberg condition reduces to r; P (a , a + On l l rn >cJll ----+ 0, which is true provided non ----+ This requires that we do not localize too much. If the intervals become too small, then catching an observation becomes a rare event and the problem is not within the domain of normal convergence. The bracketing numbers of the cells with t E [0, are of the order O (on fs 2 ) . 2 Multiplication with rn changes this in 0 s ). Thus Theorem applies easily, and we conclude that the sequence of processes t �----+- Gn fn,r converges in distribution to a Gaussian process for every On .,!.. 0 such that non ----+ The limit process is not a Brownian bridge, but a Brownian motion process, as follows by computing the limit covariance of ( Gn fn, s • Gn fn,1). Asymptotically the local empirical process "does not know" that it is tied down at its extremes. In fact, it is an interesting exercise to check that two different local empirical processes (fixed at two different numbers a and b) converge jointly to two independent Brownian motions. 0 "'
oo.
1c , 8 l (1/a a+t n
1] 19. 2 8
oo .
16,
In the treatment of M -estimators and the likelihood ratio statistic in Chapters 5 and we encountered random functions resulting from rescaling a given class of functions. Given
Empirical Processes
284
functions x � m e (x ) indexed by a Euclidean parameter e, we needed conditions that ensure that, for a given sequence r -+ and any random sequence o;(l),
n
hn
oo
=
(19.30)
We shall prove this under a Lipschitz condition, but it should be clear from the following proof and the preceding theorem that there are other possibilities.
)
19.31 Lemma. For each e in an open subset of Euclidean space let x � m e (x be a measurable function such that the map e � m e (x ) is differentiable at eo for almost every x (or in probability) with derivative meo (x ) and such that,for every e l and in a neighborhood < of eo, and for a measurable function such that
Pm 2
m
e2
oo,
(19. 3 0) is validfor every random sequence hn that is bounded in probability. The random variables CGn ( rn (meo + h / rn - me0 ) - h T m e0 ) have mean zero and their variance converges to 0, by the differentiability of the maps e m e and the Lipschitz con Then
Proof
�
dition, which allows application of the dominated-convergence theorem. In other words, this sequence seen as stochastic processes indexed by h converges marginally in distribu tion to zero. Because the sequence is bounded in probability, it suffices to strengthen this to uniform convergence in h ::=:: This follows if the sequence of processes con verges weakly in the space £00 h : h ::::: because taking a supremum is a continuous operation and, by the marginal convergence, the weak limit is then necessarily zero. By Theorem we can confine ourselves to proving asymptotic tightness (i.e., condition (ii) of this theorem). Because the linear processes h � h T are trivially tight, we may concentrate on the processes h � r (meo + h / rn - m e , the empirical process indexed by the classes of functions r M 1 ; r for M o = m e - meo : 11 e - eo ll ::::: B y Example the bracketing numbers of the classes of functions M 8 satisfy
hn I I 1. ( I I 1),
18.14,
CGn ( n
n
19. 7 , N[J(88I I m i P, 2 ,
{
" '
CGnm e )0 ) 0
8}.
L 2 (P) ) c(l) 0 8 8. The constant C is independent of 8 and 8. The function M8 8m is an envelope function of M8. The left side also gives the bracketing numbers of the rescaled classes 8 I 8 relative to the envelope functions M818 m. Thus, we compute d Log (l) Log C d 8 . The right side converges to zero as 8n 0 uniformly in 8. The envelope functions M8 I 8 m also satisfy the Lindeberg condition. The lemma follows from Theorem 19. 2 8. :S
Mo ,
d'
<
=
<
M
=
+
+
=
•
Maximal Inequalities
19.6
The main aim of this section is to derive the maximal inequality that is used in the proofs of Theorems and We use the notation :S for "smaller than up to a universal constant" and denote the function v log x by Log x .
19. 5 19. 2 8.
1
285
19. 6 Maximal Inequalities
A maximal inequality bounds the tail probabilities or moments of a supremum of random variables. A maximal inequality for an infinite supremum can be obtained by combining two devices: a chaining argument and maximal inequalities for finite maxima. The chaining argument bounds every element in the supremum by a (telescoping) sum of small deviations. In order that a sum of small terms is small, each of the terms must be exponentially small. So we start with an exponential inequality. Next we apply this to obtain bounds on finite suprema, and finally we derive the desired maximal inequality.
19.32
Lemma (Bernstein 's inequality).
For any bounded, measurable function
ft
P( I Gnf l > x ) < 2 exp ( - -41 Pf2 + xxl 2fl l oo / v'n ) ' every x > 0. The leading term 2 results from separate bounds on the right and left tail probabil ities. It suffices to bound the right tail probabilities by the exponential, because the left tail inequality follows from the right tail inequality applied to -f. By Markov's inequality, for P
_
Proof.
every A > 0,
by Fubini's theorem and next developing the exponential function in its power series. The term for k 1 vanishes because 0, so that a factor 1 j n can be moved outside the sum. We apply this inequality with the choice
P (f - Pf)
=
=
k A 1 A� - 2 A I PCf - Pf) k l Pf2 (21 1 f l oo ) k 2 , n ( ) 00 1 1 1 P(Gnf > x) e -'-x 1 + -n L=l k . -A2 x kn e - 2 1 and (1 + aY e , the right side of this inequality is
Next, with A 1 and A 2 defined as in the preceding display, we insert the bound A and we obtain ::=: and use the inequality ::S
�
Because L: O ! k!) ::=: ::=: ::=: a bounded by exp(-Ax/2), which is the exponential in the lemma. 19.33
For any finite class elements,
Lemma.
::=:
•
:F of bounded, measurable, square-integrable func
I :F I EP I Gn i F ::S maxf l fv lnoo log ( 1 + I :F I ) + maxf ll f ii P.2 Jlog ( 1 + I :F I ) . Define a 24 1 f l oo /v'n and b 24Pf2 . For x bja and2 x bja the exponent in Bernstein ' s inequality is bounded above by - 3x ja and - 3x jb, respectively.
tions, with
r;:
Proof.
t
=
=
:::::
::=:
The constant 1 /4 can be replaced by 1 /2 (which is the best possible constant) by a more precise argument.
Empirical Processes
286
{ l x --) , P( I AJ I > x ) 2 exp ( -a3x
For the truncated variables A f Gn f l I Gn f > bla } and B f Bernstein ' s inequality yields the bounds, for all > 0, =
=
{ l
Gn f l I Gn f
:::S
b la } ,
:::S
Combining the first inequality with Fubini's theorem, we obtain, with 1/fP
P
(x) exp x - 1 , l A ) E { lAt l!a ex dx t"' P(I AJ I > x a ) ex dx ::::; 1 . Eo/ 1 ( -!Jo Jo 1
=
=
=
B y a similar argument w e find that Eo/2 ( I B f I I -Jb) ::::; 1 . Because the function o/ 1 is convex and nonnegative, we next obtain, by Jensen's inequality,
Because 1/f[ 1 (u) log(l + u) is increasing, we can apply it across the display, and find a bound on E max I A f I that yields the first term on the right side of the lemma. An analogous inequality is valid for max f I B f I I -Jb, but with 1/f2 instead of o/1 . An application of the triangle inequality concludes the proof. • =
For any class F of measurable functions f : X f---+ JR. such that Pf 2 < 8 2 for every f, we have, with a (8) 8 1 JLog N[ J ( 8 , F, L 2 ( P ) ) , and F an envelopefunction, 19.34
Lemma.
=
l
Because I Gn f :::S JflC JPln for F an envelope function of F,
Proof.
+
g
P ) for every pair of functions I f I
:::S
g, we obtain,
The right side is twice the second term in the bound of the lemma. It suffices to bound E* I Gn f F :::S ,Jna ( 8) } I F by a multiple of the first term. The bracketing numbers of the class of functions f F ::::; a(8)Jl1} if f ranges over F are smaller than the bracketing numbers of the class F. Thus, to simplify the notation, we can assume that every f E F is bounded by ,Jna(8). Fix an integer q0 such that 48 ::::; 2 - qo ::::; 88. There exists a nested sequence of partitions F U � 1 :Fq i of F, indexed by the integers q ::::_ q0 , into Nq disjoint subsets and measurable functions D.q i 2F such that
{
{
=
::::;
sup I f
f,g E:Fq,
- gl
:::S
D. q i ,
To see this, first cover F with minimal numbers of L 2 (P)-brackets of size 2- q and re place these by as many disjoint sets, each of them equal to a bracket minus "previous" brackets. This gives partitions that satisfy the conditions with D.q i equal to the difference
287
19. 6 Maximal Inequalities
of the upper and lower brackets. If this sequence of partitions does not yet consist of suc cessive refinements, then replace the partition at stage q by the set of all intersections of the form This gives partitions into N sets. Using the inequal 1 2 1 1 2 ity ( log 0 NP ) ::; I: Clog Np ) 1 and rearranging sums, we see that the first of the two displayed conditions is still satisfied. Choose for each q � q0 a fixed element from each partitioning set and set
n�=qo Fp , ip .
q Nq0 Nq =
•
•
•
fqi Fqi, l:!..q f l:!..q i• if f E Fq i· Then nq f and l:!.. q f run through a set of Nq functions if f runs through F. Define for each fixed n and numbers and indicator functions aq 2 -q /)Log Nq + 1 · Aq - 1 ! 1{ l:!..q0 j ,Jnaq0 , , l:!..q - 1 ! ,Jnaq_I}, Bqf 1{ 1:!..qo f ,Jnaq0 , , l:!.. q - d ,Jnaq - 1 , l:!..q f > ,Jnaq} . Then Aq f and Bq f are constant in f on each of the partitioning sets Fqi at level because the partitions are nested. Our construction of partitions and choice of also ensure that 2a(8) ::; aq0 , whence Aq0 f 1 . Now decompose, pointwise in (which is suppressed in =
q � q0
=
:S
=
:S
=
:S
•••
:S
•••
q,
q0
x
=
the notation),
00
00
f - nqo f qLo +1 C! - nqf)Bqf + qLo + 1 (nqf - nq - 1 j)Aq - 1 f. The idea here is to write the left side as the sum of f - nq1 f and the telescopic sum L��+ 1 (nqf - lfq - 1 f) for the largest 1 1 (f, such that each of the bounds l:!.. q f on the "links" nq f - lfq - 1 f in the "chain" is uniformly bounded by ,Jriaq (with q 1 possibly infinite). We note that either all Bq f are 1 or there is a unique 1 > with Bq1 f 1 . In the first case Aq f 1 for every in the second case Aq f 1 for 1 and Aq f 0 for Next we apply the empirical process Gn to both series on the right separately, take absolute values, and next take suprema over f E F. We shall bound the means of the resulting two variables. First, because the partitions are nested, l:!.. q f Bq f ::; l:!.. q - 1 f Bq f ,Jriaq_ 1 trivially P(l:!.. q f) 2 Bqf 2 -2q . Because I Gnf l Gng + 2,JriPg for every pair of functions l f l ::; g, we obtain, by the triangle inequality and next Lemma 19.33, E * qLo + 1 Gn U - nqf)Bqf qLo 1 E* I Gn l:!..q J Bqf i F + qLo + 1 2v'li i P I:!..q J Bq J I F F + =
q
=
q
x)
q
q;
=
q � q1 .
=
00
:S
00
=
=
:S
:S
:S
q0 q < q
00
In view of the definition of aq , the series on the right can be bounded by a multiple of the series I::+ 1 JLog N .
2 -q
q
Empirical Processes
288
nq -nq - 1
Second, there are at most Nq functions f f and at most Nq - 1 indicator functions A q _ 1 f. Because the partitions are nested, the function is bounded is bounded by ::::=: ,fii The L 2 (P)-norm of by Apply Lemma 19.33 to find
l:!..q - d Aq - d
aq - I ·
00
nqf - :rrq - If i A q - If 2-q + 1 l . l nqf - nq - 1 f l
00
[aq - 1 Log Nq + 2 - q JLog Nq ] . L qo + 1 Again this is bounded above by a multiple of the series 2.:= ,:+ 1 2 -q JLog Nq . To conclude the proof it suffices to consider the terms nq 0 f. Because I nqo f I F a (o) ,fii ,fiiaq0 and P(nq0 f) 2 8 2 by assumption, another application of Lemma 19 .33 yields
qLo + 1 Gn (nqf - TCq - 1 f)Aq - 1 f
E*
::::=:
F
<
::::=:
::::=:
::::=:
By the choice of q0 , this is bounded by a multiple of the first few terms of the series JLog Nq . •
2.:=:+ 1 2-q
19.35
Corollary.
For any class :F of measurable functions with envelope function F,
)
Because :F is contained in the single bracket [-F, F], we have Nl 1 ( o, :F, L 2 (P) 1 for 8 P, 2 . Then the constant a (o) as defined in the preceding lemma reduces to a multiple of P, 2 , and ,fii P * F F > ,fiia (o) } is bounded above by a multiple of F P, 2 , by Markov's inequality. •
Proof
=
21 Fl I Fl
{
=
I I
The second term in the maximal inequality Lemma 19.34 results from a crude majoriza tion in the first step of its proof. This bound can be improved by taking special properties of the class of functions :F into account, or by using different norms to measure the brackets. The following lemmas, which are used in Chapter exemplify this. t The first uses the L 2 (P)-norm but is limited to uniformly bounded classes; the second uses a stronger norm, which we call the "Bernstein norm" as it relates to a strengthening of Bernstein's inequality. Actually, this is not a true norm, but it can be used in the same way to measure the size of brackets. It is defined by
25,
l f i �, B
19.36
and
=
2P (e lfl - 1 -
l fl) .
For any class :F of measurable functions f : X M- lR such that Pf 2 ::::=: M for every f,
Lemma.
l f l oo
t For a proof of the following lemmas and further results, see Lemmas 3.4.2 and Also see [ 1 4], [ 1 5], and [5 1 ] .
<
82
3.4.3 and Chapter 2. 1 4, in [ 1 46]
289
Problems
19.37 <
Lemma.
8 for every f,
I I P, B
For any class :F of measurable functions f : X � JR. such that f
E* I Gn I p
< I[ ] ( 8 , :F
F ""
, I · I P, B ) ( 1 + 1[ 1 (8 , 8:F,2 .jfiI · I P, B ) ) .
Instead of brackets, we may also use uniform covering numbers to obtain maximal inequalities. As is the case for the Glivenko-Cantelli and Donsker theorem, the inequality given by Corollary has a complete uniform entropy counterpart. This appears to be untrue for the inequality given by Lemma for it appears difficult to use the information that a class :F is contained in a small L 2 (P)-ball directly in a uniform entropy maximal inequality. t
19. 3 5
19.38
Lemma.
we have, with
19 . 3 4,
For any suitably measurable class :F ofmeasurable functions f : X � JR., 2 2 jEF 1P'n f f 1P'n F ,
8'j; = sup
Notes
The law of large numbers for the empirical distribution function was derived by Glivenko and Cantelli in the The Kolmogorov-Smirnov and Cramer-von Mises statistics were introduced and studied in the same period. The limit distributions of these statistics were obtained by direct methods. That these were the same as the distribution of corresponding functions of the Brownian bridge was noted and proved by Doob before Donsker formalized the theory of weak convergence in the space of continuous func tions in Donsker's main examples were the empirical process on the real line, and the partial sum process. Abstract empirical processes were studied more recently. The bracketing central limit presented here was obtained by Ossiander and the uniform entropy central limit theorem by Pollard and Kolcinskii In both cases these were generalizations of earlier results by Dudley, who also was influential in developing a theory of weak convergence that can deal with the measurability problems, which were partly ignored by Donsker. The maximal inequality Lemma was proved in The first Vapnik-Cervonenkis classes were considered in For further results on the classical empirical process, including an introduction to strong approximations, see For the abstract empirical process, see and For connections with limit theorems for random elements with values in Banach spaces, see
[59]
[19]
1930s.
[38] 1952.
[116]
[111] [88].
19. 3 4 [147].
[146].
[134] .
[119].
[57], [117], [120]
[98].
PROBLEMS 1. Derive a formula for the covariance function of the Gaus sian process that appears in the limit of the modified Kolmogorov-Smirnov statistic for estimating normality.
t
For a proof of the following lemma, see, for example, [120] , or Theorem 2.14. 1 in [ 1 46].
Empirical Processes
290
2. Find the covariance function of the Brownian motion process.
3. If Z is a standard Brownian motion, then
Z(t) - tZ(l) is a Brownian bridge.
4. Suppose that X 1 , . . . , X m and Y1 , . . . , Yn are independent samples from distribution functions F and G, respectively. The Kolmogorov-Smirnov statistic for testing the null hypothesis Ho : F = G is the supremum distance Km,n = l iiFm - Gn l l oo between the empirical distribution functions of the two samples. (i) Find the limit distribution of Km ,n under the null hypothesis. (ii) Show that the Kolmogorov-Smirnov test is asymptotically consistent against every alterna tive F -=/= G .
(iii) Find the asymptotic power function a s a function of (g , h ) for alternatives (Fg / Fni• Gh / -.fii ) belonging to smooth parametric models r-+ Fe and r-+ Ge .
e
e
5. Consider the clas s of all functions f : [0, 1 ] r-+ [0, 1 ] such that I J (x ) - f (y) l .::: lx - y l . Construct a set of .s -brackets for this class of functions of cardinality bounded by exp (C / .s ) . 6. Determine the V C index of
(i) The collection of all cells (a , b] in the real line; (ii) The collection of all cells ( -oo, t] in the plane;
(iii) The collection of all translates { 1/1 ( · -
e) : e E JR;,} of a monotone function 1/f : JR;,
r-+
R
7. Suppose that the clas s of functions :F is VC. Show that the following classes are VC as well: (i) The collection of sets { f > 0 } as f ranges over :F; (ii) The collection of functions x
r-+
(iii) The collection of functions x
r-+
f (x) + g (x ) as f ranges over :F and g is fixed; f (x ) g (x) as f ranges over :F and g is fixed.
8. Show that a collection of sets is a VC class of sets if and only if the corresponding class of
indicator functions is a VC class of functions.
9. Let Fn and F be distribution functions on the real line. Show that: (i) If Fn (x ) __,. F (x ) for every x and F is continuous, then II Fn - F II 00 (ii) If Fn (x )
__,.
F (x) and Fn {x}
__,.
F{x } for every x, then II Fn - F l l oo
__,. __,.
0. 0.
10. Find the asymptotic distribution of the mean absolute deviation from the median.
20 Functional Delta Method
The delta method was introduced in Chapter 3 as an easy way to turn the weak convergence of a sequence of random vectors rn (Tn - 8) into the weak convergence of transformations ofthe type rn ( ¢ ( Tn) - ¢ (8)). It is useful to apply a similar technique in combination with the more powerful convergence ofstochastic processes. In this chapter we consider the delta method at two levels. The first section is of a heuristic character and limited to the case that Tn is the empirical distribution. The second section establishes the delta method rigorously and in general, completely parallel to the delta method for for Hadamard differentiable maps between normed spaces.
IRk ,
20.1
von Mises Calculus
Let JP>n be the empirical distribution of a random sample X 1 , . . . , Xn from a distribution P . Many statistics can be written in the form ¢ ( 1P'n) , where ¢ is a function that maps every distribution of interest into some space, which for simplicity is taken equal to the real line. Because the observations can be regained from lP'n completely (unless there are ties), any statistic can be expressed in the empirical distribution. The special structure assumed here is that the statistic can be written as a fixed function ¢ of lP'n , independent of n, a strong assumption. Because lP'n converges to P as n tends to infinity, we may hope to find the asymptotic behavior of ¢ (1P'n) - ¢ (P) through a differential analysis of ¢ in a neighborhood of P . A first-order analysis would have the form
�
¢ (1P'n ) - ¢ (P) = ¢ (1P'n - P) +
·
·
·
,
where ¢� is a "derivative" and the remainder is hopefully negligible. The simplest approach towards defining a derivative is to consider the function t H- ¢ ( P + t H) for a fixed perturbation H and as a function of the real-valued argument t. If ¢ takes its values in JR., then this function is just a function from the reals to the reals. Assume that the ordinary derivatives of the map t H- ¢ ( P + t H) at t = 0 exist for k = 1 , 2, . . . , m . Denoting them by ¢ � ) ( H ) , we obtain, by Taylor's theorem,
k
29 1
292
Functional Delta Method
Substituting t 1/.Jfi and H Gn , for Gn .J1i (Pn - P) the empirical process of the observations, we obtain the von Mises expansion =
=
=
Actually, because the empirical process Gn is dependent on n, it is not a legal choice for H under the assumed type of differentiability: There is no guarantee that the remainder is smalL However, we make this our working hypothesis. This is reasonable, because the remainder has one factor 1/.Jfi more, and the empirical process Gn shares at least one property with a fixed H : It is "bounded." Then the asymptotic distribution of ¢ (IFn) - ¢ (P) should be determined by the first nonzero term in the expansion, which is usually the first order term ¢� (Gn)· A method to make our wishful thinking rigorous is discussed in the next section. Even in cases in which it is hard to make the differentation operation rigorous, the von Mises expansion still has heuristic value. It may suggest the type of limiting behavior of ¢ (Pn) - ¢ (P), which can next be further investigated by ad-hoc methods. We discuss this in more detail for the case that 1 . A first derivative typically gives a linear approximation to the original function. If, indeed, the map H � ¢� (H) is linear, then, writing IFn as the linear combination IFn n - l L ox, of the Dirac measures at the observations, we obtain m =
=
(20. 1) Thus, the difference ¢ Wn) - cp (P) behaves as an average of the independent random variables ¢� (ox, - P). If these variables have zero means and finite second moments, then a normal limit distribution of .Jfi ( ¢ Wn) - cp (P) ) may be expected. Here the zero mean ought to be automatic, because we may expect that
J ¢� (ox - P) d P (x )
=
¢�
(/ (ox - P) d P (x ) )
=
¢� (0)
=
0.
The interchange of order of integration and application of ¢� is motivated by linearity (and continuity) of this derivative operator. The function x � ¢� (o x - P) is known as the influence function of the function ¢. It can be computed as the ordinary derivative ¢� (o x - P)
=
-dtd
l t=O
¢ ( (1 - t) P + fo x ) ·
The name "influence function" originated in developing robust statistics. The function measures the change in the value ¢ (P) if an infinitesimally small part of P is replaced by a pointmass at x. In robust statistics, functions and estimators with an unbounded influence function are suspect, because a small fraction of the observations would have too much influence on the estimator if their values were equal to an x where the influence function is large. In many examples the derivative takes the form of an "expectation operator" ¢� (H) J ¢ p d H, for some function ¢ p with J ¢ p d P 0, at least for a subset of H. Then the influence function is precisely the function ¢ p . =
=
20. 1
von Mises Calculus
293
The sample mean is obtained as ¢(IP'n ) from the mean function ¢(P) J s dP(s). The influence function is ¢� (8x - P) !!__dt J s d [ (1 - t)P + t8x ] (s) x - J s d P(s). In this case, the approximation (20.1) is an identity, because the function is linear already. If 20.2
Example (Mean).
=
=
=
l t=O
the sample space is a Euclidean space, then the influence function is unbounded and hence the sample mean is not robust. D Let (X1 , Y1 ) , . . . , (Xn , Yn ) be a random sample from a bivari ate distribution. Write n and n for the empirical distribution functions of the X; and Y1 , respectively, and consider the Mann-Whitney statistic 20.3
Example (Wilcoxon).
lF
G
1 8 f; 1 { ::::; YJ } . Tn f lFn dGn This statistic corresponds to the function ¢ (F, G) J F d G, which can b e viewed as a function of two distribution functions, or also as a function of a bivariate distribution function with marginals F and G. (We have assumed that the sample sizes of the two =
=
n2
n
n
X;
=
samples are equal, to fit the example into the previous discussion, which, for simplicity, is restricted to i.i.d. observations.) The influence function is
¢(F, G) (8x . y - P) !!__dt / [ (1 - t)F + t8x ] d [ (1 - t)G + t8y] F(y) + 1 - G_(x) - 2 f F dG. The last step follows on multiplying out the two terms between square brackets: The function that is to be differentiated is simply a parabola in t. For this case (20.1) reads =
l t=O
=
12. 6 ,
From the two-sample U -statistic theorem, Theorem it is known that the difference between the two sides of the approximation sign is actually o p .Jfi) . Thus, the heuristic calculus leads to the correct answer. In the next section an alternative proof of the asymptotic normality of the Mann-Whitney statistic is obtained by making this heuristic approach rigorous. D
( 11
20.4 Example (Z-functions). For every e in an open subset of JR.\ let x 1--+ 1/Je (x) be a given, measurable map into The corresponding Z-function assigns to a probability measure a zero of the map e 1--+ (Consider only for which a unique zero exists.) If applied to the empirical distribution, this yields a Z -estimator (lP'n ) . Differentiating with respect to across the identity
P
¢(P)
IR.k .
P1/Je .
P
t
e [ !!._dt ¢(? + t8x )]
and assuming that the derivatives exist and that e
0 (!__ae P1/Je ) =
e =
1--+
¢
1/J is continuous, we find t= o
+ 1/I¢(P ) (x ) .
Functional Delta Method
294
The expression enclosed by squared brackets is the influence function of the Z-function. Informally, this is seen to be equal to
1 ( a ) - -ae P 'l/fe
B=
1/f
In robust statistics we look for estimators with bounded influence functions. Because the influence function is, up to a constant, equal to 1/f
)
¢ -¢ (P)).
The pth quantile of a distribution function F is, roughly, the 1 number ¢(F) = F - (p) such that F F - 1 (p) = p. We set Ft = ( 1 - t)F + tox , and differentiate with respect to t the identity 20.5
Example (Quantiles).
t,
This "identity" may actually be only an inequality for certain values of p, and x , but we do not worry about this. We find that 0
=
- F ( F- 1 (p)) + f(F- 1 (p) ) [ !!._dt pt- 1 (p) ] + 8x(F- 1 (p)). l t=O
The derivative within square brackets is the influence function of the quantile function and can be solved from the equation as
The graph of this function is given in Figure 20. 1 and has the following interpretation. Suppose the pth quantile has been computed for a large sample, but an additional observation x is obtained. If x is to the left of the pth quantile, then the pth quantile decreases; if x is to the right, then the quantile increases. In both cases the rate of change is constant, irrespective of the location of x . Addition of an observation x at the pth quantile has an unstable effect. p
f(F- 1 (p)) -1
F (p)
1 -p
f(F 1 (p))
Figure 20.1. Influence function of the pth quantile.
20. 1
von Mises Calculus
295
.fo (IF;;- 11 -2 F - (p) .
The von Mises calculus suggests that the sequence of empirical quantiles (t) varF variance with normal tically p)/f 1 ( p asympto is ) t In Chapter 21 this is proved rigorously by the delta method of the following section. Alter natively, a pth quantile may be viewed as an M -estimator, and we can apply the results of Chapter 5. D
F- 1 ( )
¢�(8 x) =
20.1.1
o
Higher-Order Expansions
In most examples the analysis of the first derivative suffices. This statement is roughly equivalent to the statement that most limiting distributions are normal. However, in some important examples the quadratic term dominates the von Mises expansion. The second derivative ought to correspond to a bilinear map. Thus, it is better to write it as If the first derivative in the von Mises expansion vanishes, then we expect that
¢�(H, H).
¢�(H)
h p (x, y) = � ¢� (8x Php(X, y) = 0 for
The right side is a V-statistic of degree 2 with kernel function equal to The kernel ought to be symmetric and degenerate in that every y, because, by linearity and continuity,
P, 8y - P).
J ¢� (8x - P, 8y - P) dP(x) = ¢� (/ (8x - P) dP(x), 8y - P )
= ¢� (0, 8y - P) = 0. If we delete the diagonal, then a V -statistic turns into a U -statistic and hence we can apply Theorem 12. 10 to find the limit distribution of n ( ¢(1Pn ) - cp(P) ) . We expect that x h p (x, x) P
If the function 1--+ is -integrable, then the second term on the right only contributes a constant to the limit distribution. If the function y) 1--+ y) is x integrable, then the first term on the right converges to an infinite linear combination of independent x f -variables, according to Example 12. 12.
(x,
h� (x, (P P)
The Cramer-von Mises statistic is the function 2 ¢(lFn) for ¢(F) = j(F - F0) dF0 and a fixed cumulative distribution function F0. By 20.6
Example (Cramer-von Mises).
direct calculation,
cp(F + tH) = ¢(F) + 2t J (F - Fo)H dF0 + t2 J H 2 dF0. Consequently, the first derivative vanishes at F = Fo and the second derivative is equal to ¢�0 (H) = 2 J H2 dF0. The von Mises calculus suggests the approximation
296
Functional Delta Method
This is certainly correct, because it is just the definition of the statistic. The preceding discussion is still of some interest in that it suggests that the limit distribution is nonnormal and can be obtained using the theory of V -statistics. Indeed, by squaring the sum that is hidden in G� , we see that
n n 1 n ¢ n ) = -n L L J (l x, _9 - Fo(x) )( 1 x1_.::x - Fo(x) ) dFo(x). i=l j=l In Example 12. 13 we used this representation to find that the sequence n ¢ (1Fn) (1/6) + 2:: )': 1 j - 2 rr - 2 (Z] - 1 ) for an i.i.d. sequence of standard normal variables Z1 , Z2 , . . . , if the true distribution F0 is continuous. D Cll
""-"+
20.2
Tn
Hadamard-Differentiable Functions
rn (Tn rn n,
Let be a sequence of statistics with values in a normed space lDl such that 8) converges in distribution to a limit for a given, nonrandom 8 , and given numbers -+ In the previous section the role of was played by the empirical distribution JFl which might, for instance, be viewed as an element of the normed space D[ -oo, oo] . We wish to prove that ¢ ( - ¢ (8) ) converges to a limit, for every appropriately differentiable map ¢, which we shall assume to take its values in another normed space IE. There are several possibilities for defining differentiability of a map ¢ : lDl f---+ IE between normed spaces. A map ¢ is said to be at 8 E lDl if for every fixed there exists an element ¢� E IE such that
T, Tn
oo .
rn ( Tn)
Gateaux differentiable (h) as t + 0. ¢ (8 + th) - ¢ (8) = t¢�(h) + o(t),
h
For IE the real line, this is precisely the differentiability as introduced in the preceding section. Gateaux differentiability is also called "directional differentiability," because for every possible direction in the domain the derivative value measures the direction term in of the infinitesimal change in the value of the function ¢. More formally, the the previous displayed equation means that ¢ (8) ¢ (8 + (20.7) o. as 0.
h
¢�(h)
o(t)
t�) - - ¢�(h) t -+ t+ The suggestive notation ¢�(h) for the "tangent vectors" encourages one to think of the directional derivative as a map ¢� : lDl f---+ which approximates the difference map cf;(8 + h) - ¢ (8) : lDl f---+ It is usually included in the definition of Gateaux differentiability that
I
IE,
IE.
this map ¢� : lDl f---+ IE be linear and continuous. However, Gateaux differentiability is too weak for the present purposes, and we need a stronger concept. A map ¢ : lDlq, f---+ IE, defined on a subset lDlq, of a normed space lDl that contains 8 , is called at 8 if there exists a continuous, linear map ¢� : lDl f---+ IE such that ¢ (8 + ¢ (8) as 0, every -+ 0,
Hadamard differentiable I th;> - - ¢�(h) \I E -+ t+ h1 h. (More precisely, for every h1 h such that 8 + t h1 is contained in the domain of ¢ for all small t > 0.) The values ¢�(h) of the derivative are the same for the two types -+
297
Hadamard-Differentiable Functions
20. 2
h1
of differentiability. The difference is that for Hadamard-differentiability the directions are allowed to change with (although they have to settle down eventually), whereas for Gateaux differentiability they are fixed. The definition as given requires that ¢� : lDJ 1--+ IE exists as a map on the whole of ll.l If this is not the case, but ¢� exists on a subset lDJ0 and the sequences --+ are restricted to converge to limits E lDJ0 , then ¢ is called Hadamard differentiable to this subset. It can be shown that Hadamard differentiability is equivalent to the difference in (20. 7) tending to zero uniformly for in compact subsets of lDJ. For this reason, it is also called Because weak convergence of random elements in metric spaces is intimately connected with compact sets, through Prohorov's theorem, Hadamard differ entiability is the right type of differentiability in connection with the delta method. The derivative map ¢� : lDJ 1--+ IE is assumed to be linear and continuous. In the case of finite-dimensional spaces a linear map can be represented by matrix multiplication and is automatically continuous. In general, linearity does not imply continuity. Continuity of the map ¢� : lDJ 1--+ IE should not be confused with continuity of the depen dence 8 1--+ ¢� (if ¢ has derivatives in a neighborhood of 8-values). If the latter continuity holds, then ¢ is called This concept requires a norm on the set of derivative maps but need not concern us here. For completeness we discuss a third, stronger form of differentiability. The map ¢ : lDJq, 1--+ IE is called at 8 if there exists a continuous, linear map ¢� : lDJ 1--+ IE such that
t
h1 h tangentially
compact differentiability.
h
h
continuously differentiable.
Frechet diffe rentiable l ¢(8 + h) - ¢ (8) - ¢�(h) li E = o ( l h l ) . as l h l -!- 0. Because sequences of the type t h1, as employed in the definition of Hadamard differentia bility, have norms satisfying I t h1 I = 0 (t ), Frechet differentiability is the most restrictive
of the three concepts. In statistical applications, Frechet differentiability may not hold, whereas Hadamard differentiability does. We did not have this problem in Section 3 . 1 , because Hadamard and Frechet differentiability are equivalent when lDJ ffi.k .
= Let ]IJ) and be normed linear spaces. Let ¢ : lDJq, lDJ be Hadamard differentiable at 8 tangentially to lDJo. Let Tn : Qn ]IJ)q, be maps such that rn (Tn - 8) T for some sequence of numbers rn and a random element T that takes its values in ]IJ)o. Then rn (¢ (Tn) - ¢ (8) ) ¢� (T). If¢� is defined and continuous on the whole space ]IJ), then we also have rn (
IE
20.8 Theorem (Delta method). 1--+ IE
1--+
--+
�
oo
�
Proof.
p
�
:
.
--+
--+
�
Theorem 1 8. 1 1 , which is the first assertion. The seemingly stronger last assertion of the theorem actually follows from this, if applied to the function 1/1 (¢ , ¢�) : lDJ 1--+ IE x IE. This is Hadamard-differentiable at (8 , 8) with derivative 1/1� ( ¢� , ¢� ) . Thus, by the preceding paragraph, ( 1/1 - 1/1 (8) ) converges weakly to ( � ) in IE x IE . By the continuous-mapping theorem, the diffe�ence ( ¢ (8) ) - H 8) ) converges weakly to 0. Weak convergence to a constant is equivalent to convergence in probability. •
= = ¢�(T), ¢ (T) rn ¢(Tn)
rn (Tn) ¢�(T) - ¢�(T) =
298
Functional Delta Method
chain rule,
Without the Hadamard differentiability would not be as interesting. Con sider maps lDl � lE and 1/f : lE � IF that are Hadamard-differentiable at and respectively. Then the composed map 1/J : lDl � IF is Hadamard-differentiable at and the derivative is the map obtained by composing the two derivative maps. (For Euclidean spaces this means that the derivative can be found through matrix multiplication of the two derivative matrices.) The attraction of the chain rule is that it allows a calculus of Hadamard differentiable maps, in which differentiability of a complicated map can be established by decomposing this into a sequence of basic maps, of which Hadamard differentiability is known or can be proven easily. This is analogous to the chain rule for real functions, which allows, for instance, to see the differentiability of the map x � exp cos log ( I + x in a glance.
¢:
o
e ¢(e), e,
¢
2)
:
Let¢ lDlrp � and 1/J � IF be maps defined on sub sets lDlrp and of normed spaces lDl and respectively. Let ¢ be Hadamard-differentiable ate tangentially to lDlo and let 1/1 be Hadamard-differentiable at¢ (e) tangentially to¢� (l1Jl0). Then 1/1 ¢ lDlrp � IF is Hadamard-difef rentiable at e tangentially to lDl0 with derivative 1/1¢ (1}) ¢�. Take an arbitrary converging path h, ----+ h in lDl. With the notation g, t - 1 (¢(e + th1) - ¢(e) ) , we have 1/f ¢(e + th,) - 1/f ¢(e) 1/f ( ¢(e) + tg, ) - 1/! (¢ Ce) ) t t By Hadamard differentiability of ¢, g1 ----+ ¢�(h). Thus, by Hadamard differentiability of 1/J, the whole expression goes to 1/!¢ce J (¢� (h) ) . 20.9
Theorem (Chain rule).
:
lEl/r
0
o
lE,
lEl/r
lEl/r
:
Proof.
=
o
o
•
20.3
Some Examples
In this section we give examples of Hadamard-differentiable functions and applications of the delta method. Further examples, such as quantiles and trimmed means, are discussed in separate chapters. The Mann-Whitney statistic can be obtained by substituting the empirical distribution functions of two samples of observations into the function ( F , G) � This function also plays a role in the construction of other estimators. The following lemma shows that it is Hadamard-differentiable. The set ] is the set of all cadlag functions z: b] � M, M ] c JR. of variation bounded by M (the set of differences of z 1 - z 2 of two monotonely increasing functions that together increase no more than M).
J F dG.
[a,
BVM [a, b
[-
Let ¢ : [0, 1] � JR. be twice continuously difef rentiable. Then the func tion (F1 , F2 ) � J ¢ (F1 ) d F2 is Hadamard-differentiable from the domain D[ -oo, oo] BV1 [ -oo, oo] D[ -oo, oo] D[ -oo, oo] into JR. at every pair offunctions of bounded variation (F1 , F2 ). The derivative is given byt h2) � h2 ¢ J h2- d¢ + J ¢'(F1 )h 1 dF2 .
20. 10
Lemma.
c
(h 1 ,
x
x
o
F1 I Cl0oo -
o
F1
t We denote by h_ the left-continuous version of a cadlag function h and abbreviate h i�
=
h (b) - h (a).
20. 3
299
Some Examples
Furthermore, the function (F , F ) 1---+ · ¢(F ) dF is Hamamard-diffe rentiable as a map into D[ - oo , oo]. 1 2 f(- oo, J 1 2 Let h 1 1 � h 1 and h 21 � h 2 in D[ -oo, oo] be such that F21 F2 + th 21 is a function of variation bounded by 1 for each t. Because F2 is of bounded variation, it follows that h 21 is of bounded variation for every t. Now, with F1 1 F1 + t h 1 r .
Proof.
=
=
¢ F1 h 21 l �oo I ¢"1 1 oo t I h 1t I oo I ¢' I oo I h 1t uo F1 h 1 D[
By partial integration, the second term on the right can be rewritten as Under the assumption on this converges to the first part ofthe derivative + as given in the lemma. The first term is bounded above by ( ) Because the measures are of total variation at most 1 by assumption, this expression converges to zero. To analyze the third term on the right, take a grid -oo < < < oo such that the function ¢ ' varies less than a prescribed value 8 > 0 on each interval Such a grid exists for every element of -oo, oo] (problem 1 8.6). Then
J h 21 _ d¢ F1 • h 1 l oo f d i F2r 1 . u1 U rn ·
·
·
o
h 21, F21
o
=
=
o
[u i _ 1 , u i ).
I / ¢' (F1 )h 1 d(F2t - F2) 1 8 (/ d i F2rl + d i F2 I ) :S
m+l + L I C¢' F1 h 1 )(u i - l ) I I F2r [Ui - l • Ui ) - F2 [u i - 1 , Ui ) l . i= l The first term is bounded by 8 0 ( 1) , in which the 8 can be made arbitrarily small by the choice of the partition. For each fixed partition, the second term converges to zero as t 0. Hence the left side converges to zero as t 0. o
+
This proves the first assertion. The second assertion follows similarly.
+
•
IFm Gn 1 , m m Y1 , n. . . , Yn
Example (Wilcoxon). Let and be the empirical distribution functions of two independent random samples X . . . , X and from distribution functions and G, respectively. As usual, consider both and as indexed by a parameter v, let N + and assume that N � A E (0, 1) as v � oo. By Donsker's theorem and Slutsky's lemma,
20. 1 1
F
= m
n,
mj
D[
D[
in the space -oo, oo] x -oo, oo], for a pair of independent Brownian bridges and The preceding lemma together with the delta method imply that
Gc.
.JN (j iFm dGn - J FdG ) � - f � dF + j � dG.
Gp
The random variable on the right is a continuous, linear function applied to Gaussian processes. In analogy to the theorem that a linear transformation of a multivariate Gaussian vector has a Gaussian distribution, it can be shown that a continuous, linear transformation of a tight Gaussian process is normally distributed. That the present variable is normally
Functional Delta Method
300
distributed can be more easily seen by applying the delta method in its stronger form, which implies that the limit variable is the limit in distribution of the sequence
This can be rewritten as the difference of two sums of independent random variables, and next we can apply the central limit theorem for real variables. D Let IHIN be the empirical distribution func tion of a sample X 1 , , Xm , Y1 , , Yn obtained by "pooling" two independent random samples from distributions F and G, respectively. Let RN l , . . . , RN N be the ranks of the pooled sample and let Gn be the empirical distribution function of the second sample. If no observations are tied, then NIHIN (Yj ) is the rank of Yj in the pooled sample. Thus, 20. 12
Example (Two-sample rank statistics). .
.
•
•
.
.
is a two-sample rank statistic. This can be shown to be asymptotically normal by the preceding lemma. Because NIHIN m + nCGn , the asymptotic normality of the pair (IHIN , CGn ) can be obtained from the asymptotic normality of the pair CIFm , Gn ) , which is discussed in the preceding example. D =
mlF
cumulative hazard function corresponding to a cumulative distribution function F oo] is defined as { 1 -dFF_ . AF(t) j[O,t] In particular, if F has a density f, then AF has a density A.F f/(1 - F) . If F(t) gives the probability of "survival" of a person or object until time t , then dAF (t) can be interpreted as the probability of "instant death at time t given survival until t ." The hazard function is
The on [0,
=
=
an important modeling tool in survival analysis. The correspondence between distribution functions and hazard functions is one-to-one. The cumulative distribution function can be explicitly recovered from the cumulative hazard function as the of (see the proof of Lemma 25.74), T1 ( 1 { } ) (t J . (20. 13) O<s<:: t Here is the jump of at and is the continuous part of Under some restrictions the maps F ++ are Hadamard differentiable. Thus, from an asymptotic-statistical point of view, estimating a distribution function and estimating a cumulative hazard function are the same problem.
product integral -A 1 - FA (t) = - A s e A A { s} A s Ac (s) AF -
'
A.
Let be the set of all nondecreasing cadlag functions F : [0, r] with F(O) 0 and 1 - F ( r ) ::::: £ > 0 for some £ > 0, and let be the set of all nondecreasing cadlagfunctions A : [0, r] with A(O) 0 and A(r) M for some
20. 14
Lemma.
][])1
=
1--+ lR
M E JR.
=
El/1
1--+ lR
:-:=
Some Examples
20. 3
(i) The map ¢ : lDl.p entiable. (ii) The map 1/J : entiable.
C
D[O, r]
R
lEt C
D[O, r ]
f---+
301
defined by ¢ (F) = A p is Hadamard differ D[O, r ] defined by 1/J (A) = FA is Hadamard differ
D[O, r]
Part (i) follows from the chain rule and the Hadamard differentiability of each of the three maps in the decomposition 1 F R (F, 1 - F_ ) R F, R . 1 F_ }[O , t] 1 F_ The differentiability of the first two maps is easy to see. The differentiability of the last one follows from Lemma 20. 10. The proof of (ii) is longer; see, for example, [54] or [55]. •
Proof.
(
) { dF
-
-
Consider estimating a distribution function based on We wish to estimate the distribution function F (or the corre sponding cumulative hazard function A) of a random sample of "failure times" T1 , . . . , Tn . Unfortunately, instead of � we only observe the pair (Xi , L'l i ) , in which Xi = Ti 1\ Ci is the minimum of Ti and a "censoring time" C , and L'l i = 1 { i _:::: Cz } records whether Ti is censored (L'i i = 0) or not (L'i i = 1). The censoring time could be the closing date of the study or a time that a patient is lost for further observation. The cumulative hazard function of interest can be written 1 1 A (t) = = J[O,t] 1 F_ J[O,t] 1 - _ for 1 = (1 - F) (1 G) and = ( 1 - G_)dF, and every choice of distribution function G. If we assume that the censoring times C 1 , . . . , Cn are a random sample from G and are independent of the failure times � , then is precisely the distribution function of Xi and is a "subdistribution function," 20. 15
Example (Nelson-Aalen estimator).
right-censored data.
T
{
-
H
dH1
-
H1
1
-
dF {
-
H dH1 ,
H
H (x) = P(X i > x ) ,
An estimator for A is obtained by estimating these functions by the empirical distributions of the data, given by Tilln (x) = 2:7= 1 1 {X i _:::: x } and Till 1 n (x) = 1 2:7= 1 1 { X i _:::: 1 } , and next substituting these estimators in the formula for A . This yields the x , L'li
n- 1
=
Nelson-Aalen estimator
An (t) = �
n-
1
1[O, t] 1 - JH!n - dlHI1n ·
Because they are empirical distribution functions, the pair (lH!n , lHI 1 n) is asymptotically normal in the space D[ oo] x D[ -oo, oo]. The easiest way to see this is to consider them as continuous transformations of the (bivariate) empirical distribution function of the pairs (Xi , L'lz). The Nelson-Aalen estimator is constructed through the maps 1 1 -, (A , R ( 1 - A , R R 1 A }[O,tl 1 - A _ These are Hadamard differentiable on appropriate domains, the main restrictions being that 1 A should be bounded away from zero and of uniformly bounded variation. The - oo ,
B)
-
B) (
-
-
B) {
B
dB.
3 02
Functional Delta Method
An
asymptotic normality of the Nelson-Aalen estimator (t) follows for every t such that H ( t ) < 1 , and even as a process in D[O, r] for every r such that H ( r ) < 1 . If we apply the product integral given in (20. 13) to the Nelson-Aalen estimator, then we obtain an estimator 1 - F for the distribution function, known as the product limit estimator or Kaplan-Meier estimator. For a discrete hazard function the product integral is an ordinary product over the jumps, by definition, and it can be seen that
n
�
n
1 - F (t)
=
n
. . X, :::: t
z .
> X.) - � · X J· . : X j :::: x ) i #( ]
#( J.
:
l
l
=
n
z· .· x <' l ::: r
(
n-i n - l. + 1
)
L'> < , l
This estimator sequence is asymptotically normal by the Hadamard differentiability of the product integral. 0
Notes
A calculus of "differentiable statistical functions" was proposed by von Mises [ 1 04] . Von Mises considered functions ¢ (IF of the empirical distribution function (which he calls the "repartition of the real quantities x 1 , . . . " ) as in the first section of this chapter. Following Volterra he calls ¢ m times differentiable at F if the first m derivatives of the map t 1---+ ¢ (F + tH) at t 0 exist and have representations of the form
n)
, Xn
=
This representation is motivated in analogy with the finite-dimensional case, in which H would be a vector and the integrals sums. From the perspective of our section on Hadamard differentiable functions, the representation is somewhat arbitrary, because it is required that a derivative be continuous, whence its general form depends on the norm that we use on the domain of ¢. Furthermore, the Volterra representation cannot be directly applied to, for instance, a limiting Brownian bridge, which is not of bounded variation. Von Mises' treatment is not at all informal, as is the first section of this chapter. After developing moment bounds on the derivatives, he shows that n 1 (¢ CIFn ) - ¢ (F) ) is asymp totically equivalent to ¢ if the first m - 1 derivatives vanish at F and the (m + 1 )th derivative is sufficiently regular. He refers to the approximating variables de generate V -statistics, as "quantics" and derives the asymptotic distribution of quantics of degree 2, first for discrete observations and next in general by discrete approximation. Hoeffding ' s work on U -statistics, which was published one year later, had a similar aim of approximating complicated statistics by simpler ones but did not consider degenerate U -statistics. The systematic application of Hamadard differentiability in statistics appears to have first been put forward in the (unpublished) thesis [125] of J Reeds and had a main focus on robust functions. It was revived by Gill [53] with applications in survival analysis in mind. With a growing number of functional estimators available (beyond the empirical distribution and product-limit estimator), the delta method is a simple but useful tool to standardize asymptotic normality proofs. Our treatment allows the domain lDl.p of the map ¢ to be arbitrary. In particular, we do not assume that it is open, as we did, for simplicity, when discussing the Delta method for
�m) (Gn)
m2
Prob lem s
303
Euclidean spaces. This is convenient, because many functions of statistical interest, such as zeros, inverses or integrals, are defined only on irregularly shaped subsets of a normed space, which, besides a linear space, should be chosen big enough to support the limit distribution of Tn .
PROBLEMS 1. Let cp ( P ) = J j h ( u , v ) d P (u) d P (v) for a fixed given function h . The corresponding estimator cf> Wn) is known as a V -statistic. Find the influence function.
2. Find the influence function of the function cp ( F) = J a ( FI + F2 ) d F2 if F1 and F2 are the marginals of the bivariate distribution function F, and a is a fixed, smooth function. Write out ¢ (IFn ). What asymptotic variance do you expect? 3. Find the influence function of the map F
r+
_ )-1
J[ O,t] ( 1 - F
dF (the cumulative hazard function) .
4. Show that a map ¢> : IIJ) r+ lE is Hadamard differentiable at a point e if and only if for every compact set K c IIJ) the expression in (20.7) converges to zero uniformly in h E K as t ---+ 0. 5. Show that the symmetrization map (e , F) r+ differentiable under appropriate conditions .
� ( F (t) + 1 - F (2e - t) ) is (tangentially) Hadamard
g : [a , b] r+ lR be a continuously differentiable function. Show that the map z r+ g o z with domain the functions z : T r+ [a , b] contained in £00(T) is Hadamard differentiable. What does this imply for the function z r+ 1 / z ?
6. Let
7 . Show that the map F r+ f[ a ,b] s d F (s) is Hadamard differentiable from the domain of all distri bution functions to lR, for each pair of finite numbers a and b. View the distribution functions as a subset of D[ - oo , oo] equipped with supremum norm. What if a or b are infinite? 8. Find the first- and second-order derivative of the function cp ( F) = j (F - Fo ) 2 dF at F = Fo . What limit distribution do you expect for ¢> (IF n ) ?
21 QuantiZes and Order Statistics
In this chapter we derive the asymptotic distribution of estimators of quantiZes from the asymptotic distribution of the corresponding estima tors of a distribution function. Empirical quantiZes are an example, and hence we also discuss some results concerning order statistics. Fur thermore, we discuss the asymptotics of the median absolute deviation, which is the empirical l/2-quantile of the observations centered at their 112-quantile.
21.1
Weak Consistency
The quantile function of a cumulative distribution function F is the generalized inverse F - 1 : (0, 1) �--+ JR. given by
{
F - 1 (p) = inf x : F(x)
2:
p}.
It is a left-continuous function with range equal to the support of F and hence is often unbounded. The following lemma records some useful properties. 21.1
For every 0 < p < 1 and x E JR., 1 F - (p) :::; x iff p :S F(x); F o F - 1 (p) :=:: p with equality iff p is in the range of F; equality can fail only if F is discontinuous at F - 1 (p); F_ o F - 1 (p) :::; p; F - 1 o F (x) :::; x; equality fails iff x is in the interior or at the right end of a "flat " of F; F - 1 o F o F - 1 = F - 1 ; F o F - 1 o F = F; (F o G) - 1 = c - 1 o p- 1 .
Lemma.
(i) ( ii) (iii) ( iv) (v) (vi)
The proofs of the inequalities in (i) through (iv) are best given by a picture. The equalities (v) follow from (ii) and (iv) and the monotonicity of F and F- 1 . If p = F (x) for some x, then, by (ii) p :::; F o p - 1 (p) = F o F - 1 o F(x) = F(x) = p, by (iv). This proves the first statement in (ii); the second is immediate from the inequalities in (ii) and (iii). Statement (vi) follows from (i) and the definition of (F o G) - 1 • •
Proof
304
21.2
305
Asymptotic Normality
Consequences of (ii) and (iv) are that F F - 1 (p) p on (0, 1) if and only if F is continuous (i.e., has range [0, 1]), and F- 1 F (x) x on JR. if and only if F is strictly increasing (i.e., has no "flats"). Thus F- 1 is a proper inverse if and only if F is both continuous and strictly increasing, as one would expect. By (i) the random variable F- 1 (U) has distribution function F if U is uniformly dis tributed on [0, 1]. This is called the quantile transformation. On the other hand, by (i) and (ii) the variable F(X) is uniformly distributed on [0, 1] if and only if X has a continuous distribution function F. This is called the probability integral transformation. A sequence of quantile functions is defined to converge weakly to a limit quantile func tion, denoted Fn 1 "'"' F 1 , if and only if Fn 1 (t) --+ F 1 (t) at every t where F - 1 is contin uous. This type of convergence is not only analogous in form to the weak convergence of distribution functions, it is the same. o
-
21.2
Lemma.
only if Fn "'"' F.
-
-
o
=
=
-
For any sequence of cumulative distribution functions, Fn- 1 "'"' F - 1 if and
Let U be uniformly distributed on [0, 1 ] . Because F - 1 has at most countably many discontinuity points, Fn- 1 "'"' F- 1 implies that Fn- 1 (U) --+ F - 1 (U) almost surely. Conse quently, Fn- 1 (U) converges in law to F - 1 (U), which is exactly Fn "'"' F by the quantile transformation. For a proof the converse, let V be a normally distributed random variable. If Fn """' F, then Fn (V) � F(V), because convergence can fail only at discontinuity points of F. Thus <1> ( Fn- 1 (t) ) = P ( Fn (V) < t ) (by (i) of the preceding lemma) converges to P ( F(V) < t ) =
Proof.
A statistical application of the preceding lemma is as follows. If a sequence of estimators Fn of a distribution function F is weakly consistent, then the sequence of estimators Pn- 1 is weakly consistent for the quantile function F - 1 •
2 1 .2
Asymptotic Normality
In the absence of information concerning the underlying distribution function F of a sample, the empirical distribution function IFn and empirical quantile function IF;;- 1 are reasonable estimators for F and F - 1 , respectively. The empirical quantile function is related to the order statistics Xn ( 1 ) , . . . , Xn(n) of the sample through
(
]
i 1 i for p E -- , . n ;; -
One method to prove the asymptotic normality of empirical quantiles is to view them as M -estimators and apply the theorems given in Chapter 5. Another possibility is to express the distribution function P(Xn(i) ::::; x) into binomial probabilities and apply approximations to these. The method that we follow in this chapter is to deduce the asymptotic normality of quantiles from the asymptotic normality of the distribution function, using the delta method.
306
QuantiZes and Order Statistics
An advantage of this method is that it is not restricted to empirical quantiles but applies to the quantiles of any estimator of the distribution function. For a nondecreasing function F E D [a , b] , [a , b] c [ - oo , oo], and a fixed p E JR., let ¢ (F) E [a , b] be an arbitrary point in [a , b] such that F ( ¢ ( F) - ) �
p
�
F ( ¢ (F) ) .
The natural domain ID\t> of the resulting map ¢ is the set of all nondecreasing F such that there exists a solution to the pair of inequalities. If there exists more than one solution, then the precise choice of ¢ ( F) is irrelevant. In particular, ¢ ( F) may be taken equal to the pth quantile F 1 (p) . -
Let F E ][])> be differentiable at a point �P E (a , b ) such that F (�p ) = p, with positive derivative. Then ¢ : lDlq, C D [a , b] f---+ JR. is Hadamard-differentiable at F tangentially to the set offunctions h E D [a , b] that are continuous at �P ' with derivative ¢� (h) = - h (�p ) / F ' (�p ) .
21.3
Lemma.
Proof for ¢ ( F
Let h 1 ---+ h uniformly on [a , b] for a function h that is continuous at �p · Write �p t + th 1 ) . By the definition of ¢, for every s1 > 0,
Choose s1 positive and such that s1 = o(t) . Because the sequence h1 converges uniformly to a bounded function, it is uniformly bounded. Conclude that F (�pt - s1) + O (t) � p � F(�p r ) + O (t). By assumption, the function F is monotone and bounded away from p outside any interval ( �P - s , �P + s) around �P . To satisfy the preceding inequalities the numbers �p t must be to the right of �P - s eventually, and the numbers �p t - s1 must be to the left of �P + s eventually. In other words, �p t ---+ �p · By the uniform convergence of h1 and the continuity of the limit, h1 (�p t - s1) ---+ h (�p ) for every s1 ---+ 0. Using this and Taylor's formula on the preceding display yields p + (�p t
- �p ) F ' (�p ) - o (�p t - �p ) + O (s1 ) + th(�p ) - o(t) � p � p + (� t - � ) F ' (� ) + o (� t - � ) + O (s1) p p p p p
+ th (�p ) + o(t) .
Conclude first that �p t - �P = O (t). Next, use this to replace the o (�pt display by o (t) terms and conclude that (�p t - �p ) / t ---+ - (hj F ' ) (�p ) .
- �p )
terms in the
•
Instead of a single quantile we can consider the quantilefunction F f---+ ( F 1 (p ) ) , < < z , p p for fixed numbers 0 � p 1 < p 2 � 1 . Because any quantile function is bounded on anpinterval [p 1 , P 2 ] strictly contained in (0, 1), we may hope that a quantile estimator converges in distribution in t'-00(p 1 , p 2 ) for such an interval. The quantile function of a distribution with compact support is bounded on the whole interval (0, 1 ) , and then we may hope to strengthen the result to weak convergence in t'- 00 (0, 1 ) . Given an interval [a , b] c JR. , let ][]) 1 be the set of all restrictions of distribution functions on JR. to [a , b] , and let ][]) 2 be the subset of l!JJ 1 of distribution functions of measures that give mass 1 to (a , b] . -
21.2
21.4
307
Asymptotic Normality
Lemma.
(i) Let 0 < P l < P 2 < 1, and let F be continuously differentiable on the interval [a , b] = [p- I (P I ) - E, p- l (p 2 ) + E ] for some E > 0, with strictly positive derivative f. Then the inverse map G f-* c-1 as a map ]IJ) 1 c D [a , b] f-* -C 00 [p 1 , p 2 ] is Hadamard differentiable at F tangentially to C [a , b]. (ii) Let F have compact support [a , b] and be continuously differentiable on its support with strictly positive derivative f. Then the inverse map G f-* c-I as a map lDJ2 C D [a , b] f-* -€00(0, 1) is Hadamard differentiable at F tangentially to C [a , b]. In both cases the derivative is the map h f-* - (h jf) p- l . o
It suffices to make the proof of the preceding lemma uniform in p. We use the same notation. (i). Because the function F has a positive density, it is strictly increasing on an interval [�p ; , �p; ] that strictly contains [�p1 , �p2 ]. Then on [pi , p�] the quantile function p-I is the ordinary inverse of F and is (uniformly) continuous and strictly increasing. Let h 1 ---+ h uniformly on [�p ; , �p; ] for a continuous function h. By the proof of the preceding lemma, �p;t ---+ �p, and hence every �pt for p 1 ::::; p ::::; p 2 is contained in [�p; , �p� ] eventually. The remainder of the proof is the same as the proof of the preceding lemma. (ii). Let h1 -+ h uniformly in D [a , b] , where h is continuous and F + th 1 is contained in lDJ2 for all t . Abbreviate p-I (p) and (F + th 1 ) - 1 (p) to �P and �p1 , respectively. Because F and F + th1 are concentrated on (a , b] by assumption, we have a < �p 1 , �P ::::; b for all 0 < p < 1 . Thus the numbers Ep r t 2 1\ (�pt - a) are positive, whence, by definition,
Proof
=
By the smoothness of F we have F(�p) in 0 < p < 1 . It follows that
=
p and F(�pt -Ep1)
=
F(�p1) + 0 (Ept ) , uniformly
The o (t) terms are uniform in 0 < p < 1 . The far left side and the far right side are O (t) ; the expression in the middle is bounded above and below by a constant times I� pt - �P 1 . Conclude that l �pt - �P I O (t), uniformly in p . Next, the lemma follows by the uniform differentiability of F. • =
Thus, the asymptotic normality of an estimator of a distribution function (or another nondecreasing function) automatically entails the asymptotic normality of the correspond ing quantile estimators. More precisely, to derive the asymptotic normality of even a single quantile estimator frn- l (p) , we need to knovi that the estimators Fn are asymptotically nor mal as a process, in a neighborhood of p-I (p) . The standardized empirical distribution function is asymptotically normal as a process indexed by JR., and hence the empirical quantiles are asymptotically normal. 21.5
Corollary.
f ( F - 1 (p) ) , then
Fix 0 < p
<
1. If F is differentiable at p- I (p) with positive derivative
C( 1 1 y n IFn- (p) - F - (p) )
=
n 1 { Xi p - l (p) } - p 1 :S - L + ( 1). ( n f F I (p ) ) r:;
v i= l
-
0p
308
QuantiZes and Order Statistics
Consequently, the sequence y'n(IF;;- 1 (p) - F -1 (p) ) is asymptotically normal with mean 0 and variance p ( l - p)lf 2 ( F - 1 (p) ) . Furthermore, if F satisfies the conditions (i) or (ii) of the preceding lemma, then y'n(IF;;- 1 - F -1 ) converges in distribution in £00[p 1 , p 2 ] or £00(0, 1), respectively, to the process GA.If ( F -1 (p)) , where GA. is a standard Brownian bridge. By Theorem 19.3, the empirical process Gn , F = v'nCIFn - F) converges in distri bution in D [ -oo , oo] to an F-Brownian bridge process GF = GA. o F. The sample paths of the limit process are continuous at the points at which F is continuous. By Lemma 2 1 .3, the quantile function F �---+ F- 1 (p) is Hadamard-differentiable tangentially to the range of the limit process. By the functional delta method, the sequence Jn (IF;;- 1 (p) - F -1 (p) ) is asymptotically equivalent to the derivative of the quantile function evaluated at Gn , F • that is, to -Gn , F ( F - 1 (p) ) lf ( F - 1 (p) ) . This is the first assertion. Next, the asymptotic normality of the sequence Jn (IF;;- 1 (p) - F- 1 (p) ) follows by the central limit theorem. The convergence of the quantile process follows similarly, this time using Lemma 2 1 .4. •
Proof
The uniform distribution function has derivative 1 on its compact support. Thus, the uniform empirical quantile process converges weakly in £00(0, 1). The limiting process is a standard Brownian bridge. The normal and Cauchy distribution functions have continuous derivatives that are bounded away from zero on any compact interval. Thus, the normal and Cauchy empirical quantile processes converge in £00 [p 1 , P 2 l , for every 0 < P 1 < P2 < 1 . 0 21.6
Example.
The empirical quantile function at a point is equal to an order statistic of the sample. In estimating a quantile, we could also use the order statistics directly, not necessarily in the way that IF;;- 1 picks them. For the kn-th order statistic Xn Ckn) to be a consistent estimator for F - 1 (p) , we need minimally that kn ln -+ p as n -+ oo. For mean-zero asymptotic normality, we also need that kn In -+ p faster than 1 I y'n, which is necessary to ensure that Xn Ckn) and IF;;- 1 (p) are asymptotically equivalent. This still allows considerable freedom for choosing kn . Let F be differentiable at F - 1 (p) with positive derivative and let kn In = y'n p + cl + o ( l l y'n). Then
21.7
Lemma.
First assume that F is the uniform distribution function. Denote the observations by Ui , rather than Xi . Define a function gn : £00(0, 1) �---+ 1Ft by gn (Z) = z (kn ln) - z(p ) . Then gn (zn ) -+ z (p) - z(p) = 0 , whenever Zn -+ z for a function z that is continuous at p . Because the uniform quantile process y'n(G;;- 1 - o -1 ) converges in distribution in £00 (0, 1 ) , the extended continuous-mapping theorem, Theorem 1 8. 1 1 , yields gn ( y'n(G;;- 1 - o -1 )) = ,Jn(UnCkn l - G;;- 1 (p)) - y'n(kn ln - p) "-'+ 0. This is the result in the uniform case. A sample from a general distribution function F can be generated as F -1 (UJ , by the quantile transformation. Then Jn(Xn (k1 ) - IF;;- 1 (p)) is equal to Jn[ F -1 (UnCknl ) - F - 1 (p) ] - Jn F - 1 ( G;;- 1 (p) ) - F - 1 (p) .
Proof
[
J
21.2
309
Asymptotic Normality
Apply the delta method to the two terms to see that f ( F - 1 (p)) times their difference is asymptotically equivalent to ,Jfi(Un (knl - p) - ,Jri (G;;- 1 ( p ) - p) . • If X 1 , . . . , Xn is a random sample from a continuous distribution function F, then U 1 F(X 1 ) , , Un F ( Xn ) are a random sample from the uniform distribution, by the probability integral transformation. This can be used to construct confidence intervals for quantiles that are distribution-free over the class of continuous distribution functions. For any given natural numbers k and l, the interval (Xn(k) , Xn(l) ] has coverage probability 21.8
Example (Confidence intervals for quantiZes).
=
•
.
.
=
Because this is independent of F, it is possible to obtain an exact confidence interval for every fixed n , by determining k and l to achieve the desired confidence level. (Here we have some freedom in choosing k and l but can obtain only finitely many confidence levels.) For large n, the values k and l can be chosen equal to
/
k, l p ( l - p) - p ± za . n n To see this, note that, by the preceding lemma, Un (k) · Un(l)
=
( )
p ( 1 - p) G;;- 1 (p) 1 + O p ,Jn . ,Jn ± Z a yI n
Thus the event Un (k ) < p ::::; Un(l) is asymptotically equivalent to the event ,Jn[G;;- 1 (p) p [ ::S Za ,Jp ( 1 - p) . Its probability converges to 1 - 2a . An alternative is to use the asymptotic normality of the empirical quantiles IF;;- 1 , but this has the unattractive feature of having to estimate the density f ( F - 1 (p) ) , because this appears in the denominator of the asymptotic variance. If using the distribution-free method, we do not even have to assume that the density exists. 0 The application of the Hadamard differentiability of the quantile transformation is not limited to empirical quantiles. For instance, we can also immediately obtain the asymp totic normality of the quantiles of the product limit estimator, or any other estimator of a distribution function in semi parametric models. On the other hand, the results on empirical quantiles can be considerably strengthened by taking special properties of the empirical distribution into account. We discuss a few extensions, mostly for curiosity value. t Corollary 2 1 .5 asserts that Rn (p) � 0, for, with �P p- l (p) , =
The expression on the left is known as the standardized empirical difference process. "Stan dardized" refers to the leading factor f (�P ) . That a sum is called a difference is curious but stems from the fact that minus the second term is approximately equal to the first term. The identity shows an interesting symmetry between the empirical distribution and quantile t
See [ 134, pp. 586-587] for further information.
QuantiZes and Order Statistics
310
processes, particularly in the case that F is uniform, if f (�p ) 1 and �P p. The result that Rn (p) � 0 can be refined considerably. If F is twice-differentiable at �P with positive first derivative, then, by the Bahadur-Kiefer theorems, 1 /4 32 n 1 14 a. s. , lim sup p ( 1 - p) , I Rn ( P ) I = n -+oo (loglog n) 3 14 27 2 X y n 1 14 Rn (P ) -v-+ ¢ dy . J p ( 1 - p) Jp ( l - p) 0 ,jY =
] [ 100 ( ) ( -
-
=
)
The right side in the last display is a distribution function as a function of the argument x . Thus, the magnitude of the empirical difference process is Op (n - 1 14) , with the rate of its fluctuations being equal to n - 1 1 4 (loglog n ) 3 14. Under some regularity conditions on F, which are satisfied by, for instance, the uniform, the normal, the exponential, and the logistic distribution, versions of the preceding results are also valid in supremum norm, n 1 14 1 , a.s., lim sup Rn = ll II oo n-+ oo (log n) 1 1 2 (2 loglog n) 1 / 4 ,J2 n 1 /4 Rn · (log n) 1 1 2 I II oo -v-+ J I I Z;J oo Here ZA is a standard Brownian motion indexed by the interval [0, 1]. -
21.3
Median Absolute Deviation
The median absolute deviation of a sample X 1 , defined by MADn
=
)
.
•
•
, Xn is the robust estimator of scale
)
xi · med x i - med 1 :S i :Sn 1 :S i :-o n
It is the median of the deviations of the observations from their median and is often recom mended for reducing the observations to a standard scale as a first step in a robust procedure. Because the median is a quantile, we can prove the asymptotic normality of the median absolute deviation by the delta method for quantiles, applied twice. If a variable X has distribution function F, then the variable I X - 8 I has the distribution function x 1--+ F(8 + x ) - F_ (8 - x ) . Let (8 , F) 1--+ ¢2 (8 , F) be the map that assigns to a given number 8 and a given distribution function F the distribution function F(8 + x) F ( 8 - x ) , and consider the function ¢ = ¢3 o ¢2 o ¢ 1 defined by F � (8 : = F - 1 (1/2) , F) � G: = F(8 + · ) - F_ (8 - · ) � c - 1 ( 1 /2) . _
If we identify the median with the 1/2-quantile, then the median absolute deviation is exactly ¢ (IFn ) . Its asymptotic normality follows by the delta method under a regularity condition on the underlying distribution. Let the numbers mF and me satisfy F (m F ) = � = F (mF + me) - F (m F me). Suppose that F is differentiable at m F with positive derivative and is continuously diffe rentiable on neighborhoods of m F + m e and m F - m e with positive derivative at 21.9
Lemma.
21.3
Median Absolute Deviation
311
m F + m a and/or m F - ma. Then the map ¢ : D [ - oo, oo ] �---+ 1ft, with as domain the dis tribution functions, is Hadamard-differentiable at F, tangentially to the set of functions that are continuous both at m F and on neighborhoods of m F + m a and mF - ma. The derivative cp� (H) is given by H (mF) f (m F + ma) - f (mF - m a ) H (m F + ma) - H (mF - m a ) f(mF) f (mF + ma) + f (mF - m G ) f (m F + m a ) + f(m F - m a)
Define the maps ¢i as indicated previously. By Lemma 2 1 .3, the map ¢ 1 : D [ - oo, oo] �---+ IR x D [ -oo, oo] is Hadamard-differenti able at F tangentially to the set of functions H that are continuous at m F . The map ¢2 : IR x D [ -oo, oo] �---+ D [m - c, m + c ] is Hadamard-differentiable at the point (m F , F) tangentially to the set of points (g , H) such that H is continuous on the intervals [m F ± ma - 2£ , m F ± ma + 2£] , for sufficiently small c > 0. This follows because, if at ___,. a and Ht ___,. H uniformly, (F + t Ht ) (mF + tat + x ) - F(mF + x ) ___,. aj(m F + x ) + H (mF + x) , t uniformly in x � m a , and because a similar statement is valid for the differences (F + t Ht ) - (m F + ta1 - x ) - F_ (mF - x ) . The range of the derivative is contained in C[m a c, m a + c]. Finally, by Lemma 2 1 .3, the map ¢3 : D [m a - c, m a + c] �---+ IR is Hadamard-differenti able at G ¢2 (m F , F) , tangentially to the set of functions that are continuous at m a , because G has a positive derivative at its median, by assumption. The lemma follows by the chain rule, where we ascertain that the tangent spaces match up properly. •
Proof
G
G
------
=
The F-Brownian bridge process G F has sample paths that are continuous everywhere that F is continuous. Under the conditions of the lemma, they are continuous at the point mF and in neighborhoods of the points mF + m a and mF - m a . Thus, in view of the lemma and the delta method, the sequence ,J1i ( ¢ (IFn ) - ¢ (F) ) converges in distribution to the variable ¢ � (GF) . If F has a density that is symmetric about 0, then its median m F is 0 and the median absolute deviation m is equal to F- 1 (3/4). Then the first term in the definition of the derivative vanishes, and the derivative ¢ � (G F) at the F-Brownian bridge reduces to - (G) J3/4) - GJc (l /4)) /2f(F- 1 (3/4)) for a stan dard Brownian bridge GJc. Then the asymptotic variance of ,Jli(MADn - ma) is equal to ( 1 / 16)/f o F- 1 (3/4) 2 . 0 21.10
Example (Symmetric F).
G
If F is equal to the normal distribution with mean 2 zero and variance O" , then m F 0 and m O" - 1 (3/4) . We find an asymptotic variance (0" 2 / 1 6) ¢ o - 1 (3/4)- 2 . As an estimator for the standard deviation O" , we use the estimator MADn / - 1 (3/4), and as an estimator for 0" 2 the square of this. By the delta method, the latter estimator has asymptotic variance equal to ( 1 /4)0"4¢ o - 1 (3/4)- 2 - 1 (3/4) - 2 , which is approximately equal to 5.440"4. The relative efficiency, relative to the sample variance, is approximately equal to 37%, and hence we should not use this estimator without a good reason. 0 21 .11
Example (Normal distribution). =
G =
QuantiZes and Order Statistics
312
21.4
Extreme Values
-
The asymptotic behavior of order statistics Xn (kn ) such that kn / n ---+ 0 or 1 is, of course, different from that of central-order statistics. Because Xn (kn ) :=: Xn means that at most n kn of the X; can be bigger than Xn , it follows that, with Pn P(X; Xn ) , =
>
Therefore, limit distributions of general-order statistics can be derived from approximations to the binomial distribution. In this section we consider the most extreme cases, in which kn n - k for a fixed number k, starting with the maximum Xn (n ) · We write F (t) P(X i t) for the survival distribution of the observations, a random sample of size n from F. The distribution function of the maximum can be derived from the preceding display, or directly, and satisfies n n F(xn) n P(Xn (n ) :S Xn ) F (xn ) 1 n This representation readily yields the following lemma. =
=
>
=
=
(-
For any sequence of numbers Xn and any r Xn ) ---+ e - ' ifand only ifnF(xn ) ---+ r . 21.12
Lemma.
)
E
[0, oo], we have P(Xn (n )
:=:
In view of the lemma we can find "interesting limits" for the probabilities P(Xn (n ) :=: Xn) only for sequences Xn such that F(xn ) 0 ( 1 / n) . Depending on F this may mean that Xn is bounded or converges to infinity. Suppose that we wish to find constants an and bn 0 such that b;; 1 (Xn (n ) - an) con verges in distribution to a nontrivial limit. Then we must choose an and bn such that F (an + bnx) 0 ( 1 / n) for a nontrivial set of x . Depending on F such constants may or may not exist. It is a bit surprising that the set of possible limit distributions is extremely sman. t =
>
=
If b;; 1 (Xn (n ) - an ) """' G for a nondegenerate cumula tive distribution function G, then G belongs to the location-scale family of a distribution of one of the following forms: (i) e - e -x with support IR.; (ii) e - C l /x " ) with support [0, oo) and 0; ) ( -x (iii) e " with support ( -oo, 0] and 0. 21.13
Theorem (Extremal types).
a >
a >
Example (Uniform). If the distribution has finite support [0, 1] with F(t) (1 t)01 , then nF( l + n - 1 101x) ---+ (-x)01 for every x ::: 0. In view of Lemma 2 1 . 1 2, the sequence n 1 101 (Xn (n ) - 1) converges in distribution to a limit of type (iii). The uniform distribution is the special case with 1 , for which the limit distribution is the negative of an exponential distribution. D
21.14
=
a =
t
For a proof of the following theorem, see [66] or Theorem 1 .4.2 in [90].
313
21.4 Extreme Values
The survival distribution of the Pareto distribution satisfies F( t ) (JJ.) tY' for t ::::: J.L. Thus n F(n 1 1cx J.LX ) � 1/xcx for every x > 0. In view of Lemma 2 1 . 12, the sequence n - 1 /cx Xn(n)/ f.L converges in distribution to a limit of type (ii). 0
21.15
Example (Pareto).
=
21.16
Example (Normal).
more delicate. We choose an
=
For the normal distribution the calculations are similar, but
1 loglog n + log 4n , J2 log n - 2 J2 1og n -
bn
=
1j j2 1og n.
Using Mill 's ratio, which asserts that
,...._,
The problem of convergence in distribution of suitably normalized maxima is solved in general by the following theorem. Let TF sup{t : F(t) 1 } be the right endpoint of F (possibly ). =
<
oo
There exist constants an and bn such that the sequence b;; 1 (Xn (n) - an ) converges in distribution if and only if, as t � TF, (i) There exists a strictly positive function g on IR. such that F(t + g (t)x)j F(t) � e - x , for every x E R (ii) TF = and F(tx)/ F(t) � 1jxcx, for every x > 0; (iii) TF and F ( TF - ( rF - t)x ) / F (t) � xcx, for every x > 0. The constants (an , bn) can be taken equal to ( Un , g (un) ) , (0, Un ) and ( rF , TF - Un ), respec tively, for Un = p - 1 (1 - 1 / n ).
21.17
Theorem.
oo
< oo
We only give the proof for the "only if" part, which follows the same lines as the preceding examples. In every. of the three cases, n F (un) � 1 . To see this it suffices to show that the jump F(un) - F ( un - ) o(1 /n). In case (i) this follows because, for every x 0, the jump is smaller than F ( u + g ( un )x ) - F (un ), which is of the order F(un ) (e- x - 1) ::::; (1/n)(e- x - 1). The right side can be made smaller than c ( 1 /n) for any c > 0, by choosing x close to 0. In case (ii), we can bound the jump at U n by F (xun) - F (un) for every x 1 , which is of the order F(un ) ( 1 jxcx - 1) ::::; ( l j n) ( 1 jxcx - 1). In case (iii) we argue similarly. We conclude the proof by applying Lemma 2 1 . 1 2. For instance, in case (i) we have nF ( un + g (un)x ) n F (un)e - x � e - x for every x, and the result follows. The argument under the assumptions (ii) or (iii) is similar. •
Proof.
=
n
<
<
rv
If the maximum converges in distribution, then the (k + 1)-th largest-order statistics Xn(n - k) converge in distribution as well, with the same centering and scaling, but a dif ferent limit distribution. This follows by combining the preceding results and the Poisson approximation to the binomial distribution. lfb,-; 1 (Xn(n) - an) G, then b;; 1 (Xn(n -k ) - an ) i function H (x) = G (x) L �= 0 ( - log G (x) ) / i ! . 21.18
Theorem.
-v-+
-v-+
H for the distribution
Quantiles and Order Statistics
3 14
If Pn = F(an + b nx), then npn ---+ - log G (x) for every x where G is continuous (all x), by Lemma 2 1 . 12. Furthermore, P b,;- 1 (Xn (n - k) - an ) :S x = P bin(n, Pn ) :S k) . This converges to the probability that a Poisson variable with mean - log G (x) is less than or equal to k. (See problem 2.2 1 .) •
Proof
(
) (
By the same, but more complicated, arguments, the sample extremes can be seen to converge jointly in distribution also, but we omit a discussion. Any order statistic depends, by its definition, on all observations. However, asymptot ically central and extreme order statistics depend on the observations in orthogonal ways and become stochastically independent. One way to prove this is to note that central-order statistics are asymptotically equivalent to means, and averages and extreme order statistics are asymptotically independent, which is a result of interest on its own. Let g be a measurable function with Fg = and Fg 2 = 1, and sup 1 pose that b;; (Xn(n ) -an ) G for a nondegenerate distribution G. Then n - 1 1 2 _L 7= 1 g (Xi ) , b;; 1 (Xn(n ) - an) ) (U, V) for independent random variables U and V with distributions N (O , 1) and G. 21.19
0
Lemma.
�
�
(
Let Un = n - 1 12 .L 7::l g (Xn(i ) ) and Vn = b;; 1 (Xn(n ) - an ) . Because Fg 2 it follows that max 1 �i�n l g (Xi ) l = op (Jfi) . Hence n - 1 /2 l g (Xn(n ) ) l � 0, whence the distance between (CGng, Vn ) and (Un , Vn) converges to zero in probability. It suffices to show that (Un , Vn ) (U, V). Suppose that we can show that, for every u, Proof
<
oo ,
�
Then, by the dominated-convergence theorem, EFn (u I Vn) 1 { Vn :::: v} = Cl> (u)E 1 { Vn :::: v} + o(1), and hence the cumulative distribution function EFn (u l Vn ) 1 { Vn :::: v} of (Un , Vn ) converges to Cl> (u)G(v). The conditional distribution of Un given that Vn = Vn is the same as the distribution of n - 1 12 .L Xni for i.i.d. random variables Xn , 1 , . . . , Xn,n - 1 distributed according to the conditional distribution of g (X 1 ) given that X 1 :::: Xn : = an + bn Vn . These variables have absolute mean
If Vn ---+ v, then P(Vn :::: Vn ) ---+ G(v) by the continuity of G, and, by Lemma 2 1 . 12, n F (xn ) = 0 ( 1) whenever G (v) > 0. We conclude that ,JfiEXn 1 ---+ 0. Because we also have that EX 1 ---+ Fg 2 and EX 1 1 1 Xn 1 1 ::=:: s ,Jfi ---+ 0 for every s > 0, the Lindeberg-Feller theorem yields that Fn (u I Vn) ---+ Cl> (u). This implies Fn (u I Vn ) """"' Cl> (u) by Theorem 18. 1 1 or a direct argument. •
�
� {
}
By taking linear combinations, we readily see from the preceding lemma that the em pirical process CGn and b;; 1 (Xn (n ) - an ) , if they converge, are asymptotically independent as well. This independence carries over onto statistics whose asymptotic distribution can
Problems
315
be derived from the empirical process by the delta method, including central order statis tics Xn (kn / n ) with kn / n p + 0 ( 1 / ..jn) , because these are asymptotically equivalent to averages. =
Notes
For more results concerning the empirical quantile function, the books [28]and [1 34] are good starting points. For results on extreme order statistics, see [66] or the book [90] .
PROBLEMS 1 1 1. Suppose that Fn ---+ F uniformly. Does this imply that Fn- ---+ F - uniformly or pointwise? Give a counterexample.
2. Show that the asymptotic lengths of the two types of asymptotic confidence intervals for a quan tile, discussed in Example 2 1 . 8 , are within op (lj.jil) . Assume that the asymptotic variance of 1 the sample quantile (involving 1 /f o F - (p)) can be estimated consistently. 3. Find the limit distribution of the median absolute deviation from the mean, med 1 �i �n I Xi - Xn 1 .
4 . Find the limit distribution of the pth quantile of the absolute deviation from the median. 5. Prove that Xn and Xn (n - 1 ) are asymptotically independent.
22 L-Statistics
In this chapter we prove the asymptotic normality of linear combinations oforder statistics, particularly those usedfor robust estimation or testing, such as trimmed means. We present two methods: The projection method presumes knowledge of Chapter 11 only; the second method is based on the functional delta method of Chapter 20.
Introduction
22. 1
Let Xn (l) , . . . , Xn (n ) be the order statistics of a sample of real-valued random variables. A linear combination of (transformed) order statistics, or L-statistic, is a statistic of the form n
L.>n ; a (Xn (i ) ) . i =l
The coefficients Cn ; are a triangular array of constants and a is some fixed function. This "score function" can without much loss of generality be taken equal to the identity function, for an L-statistic with monotone function a can be viewed as a linear combination of the order statistics of the variables a (X 1 ) , . . . , a (Xn ) , and an L-statistic with a function a of bounded variation can be dealt with similarly, by splitting the L-statistic into two parts. The simplest example of an L -statistic is the sample mean. More interesting are the a-trimmed meanst 22.1
Example (Trimmed and Winsorized means).
n - Lan j
1
L
n - 2 l anj l. = Lan J + 1
Xn ( i ) •
and the a - Winsorized means
[
n
n
- La j 1 Xn ( i ) + l an J Xn (n - La n J +l) - lan J Xn ( La n J ) + n i = La n J + l
t
L
]
·
The notation Lx J is used for the greatest integer that is less than or equal to x. Also 1x l denotes the smallest integer greater than or equal to x . For a natural number n and a real number 0 ::S x ::S n one has Ln x J n - Ix l and ln xl n LxJ . -
-
=
=
-
316
22. 1 Cauchy
<0
Introduction
317 Laplace
0 (')
C\i
L()
L()
0 C\1
"
�
(')
�
0
C\1
�
L() 0
c:i 0
0.0
0.2
normal
�
0.4
c:i
0.0
L()
(')
C\i L()
0.2
0.4
0.2
0.4
logistic
"
C\i 0
�
(')
�
0
� �
'-..
C\1
0 0
c:i
0.0
0.2
0.4
0.0
Figure 22. 1. Asymptotic variance of the a-trimmed mean of a sample from a distribution F as function of a for four distributions F .
The a-trimmed mean is the average of the middle (1 - 2a)-th fraction of the observations, the a-Winsorized mean replaces the ath fractions of smallest and largest data by Xn c lanJ ) and Xn cn - LanJ + l) , respectively, and next takes the average. Both estimators were already used in the early days of statistics as location estimators in situations in which the data were suspected to contain outliers. Their properties were studied systematically in the context of robust estimation in the 1960s and 1970s. The estimators were shown to have good properties in situations in which the data follows a heavier tailed distribution than the normal one. Figure 22. 1 shows the asymptotic variances of the trimmed means as a function of a for four distributions. (A formula for the asymptotic variance is given in Example 22. 1 1 .) The four graphs suggest that 10% to 15% trimming may give an improvement over the sample mean in some cases and does not cost much even for the normal distribution. 0 Two estimators of dispersion are the interquartile range Xn ( l 3 n /4l) - Xn ( ln /41 ) and the range Xn (n ) - Xn (l) · Of these, the range does not have a normal limit distribution and is not within the scope of the results of this chapter. 0 22.2
Example (Ranges).
We present two methods to prove the asymptotic normality of L-statistics. The first method is based on the Hajek projection; the second uses the delta method. The second method is preferable in that it applies to more general statistics, but it necessitates the study of empirical processes and does not cover the simplest L-statistic: the sample mean.
L-Statistics
318
Haj ek Proj ection
22.2
The Hajek projection of a general statistic is discussed in section 1 1 .3. Because a projection is linear and an L-statistic is linear in the order statistics, the Hajek projection of an L-statistic can be found from the Hajek projections of the individual order statistics. Up to centering at mean zero, these are the sums of the conditional expectations E(Xn (i) I Xk ) over k. Some thought shows that the conditional distribution of Xn (i) given Xk is given by P(Xn (i) ::::;
=
P(Xn - 1( i) ::::; y) Yl Xk = x) = { P(Xn - 1 (i - 1 ) y)
=
if y if y
::::;
=
< X,
::::
X.
-
This is correct for the extreme cases i 1 and i n provided that we define Xn _ 1 (OJ and Xn - 1 (n ) Thus, we obtain, by the partial integration formula for an expectation, for x :::: 0, oo.
E (Xn C i l I Xk
=
oo
= x) = fo x P(Xn - 1Cil > y) dy + 1oo P(Xn - 1Ci -ll > y) dy
- i: P(Xn - 1Ci l ::::; y) dy = - 100 (P(Xn - 1(il > y) - P(Xn - 1Ci - l l > y) ) dy + EXn -1CiJ ·
The second expression is valid for x 0 as well, as can be seen by a similar argument. Because Xn - 1 (i - 1 ) ::::; Xn - 1 (i) • the difference between the two probabilities in the last integral is equal to the probability of the event { Xn - 1 (i - 1 ) ::::; y Xn -1 (i ) } . This is precisely the probability that a binomial (n - 1 , F (y) ) -variable is equal to i 1 . If we write this probability as Bn - 1 F (y) (i 1), then the Hajek projection Xn(i ) of Xn (i) satisfies, with IFn the empirical distribution function of X 1 , . . . , Xn , <
<
,
-
-
= - f n (IFn - F) (y) Bn - 1 F ( ) (i - 1) dy . For the projection of the L-statistic Tn = :L7= 1 Cni Xn(i ) we find Tn - E Tn = - J n (IFn - F) (y) f >ni Bn - 1 F ( ) (i - 1) dy . i= 1 y
,
,
y
Under some conditions on the coefficients Cni • this sum (divided by Jn) is asymptotically normal by the central limit theorem. Furthermore, the projection Tn can be shown to be asymptotically equivalent to the L-statistic Tn by Theorem 1 1 .2. Sufficient conditions on the Cni can take a simple appearance for coefficients that are "generated" by a function ¢ as in ( 1 3 .4).
=
22.3 Theorem. Suppose that EX ? and that Cni ¢ ( i / (n + 1) ) for a boundedfunction ¢ that is continuous at F (y)for Lebesgue almost-every y. Then the sequence n - 1 1 2 (Tn -ETn) < oo
Hajek Projection
22. 2
319
converges in distribution to a normal distribution with mean zero and variance
CJ 2 ( ¢ , F) = Proof.
ff ¢ ( F(x)) ¢ ( F(y)) (F(x A y) - F(x ) F (y)) dx dy .
Define functions e (y) = ¢ ( F (y)) and
(
n 1 en (Y) = I>ni Bn - l.F(y) ( l. - 1) = E¢ Bn + n+1 i=l
)
'
for Bn binomially distributed with parameters (n - 1 , F (y)) . By the law of large numbers (Bn + 1 )/(n + 1) � F (y) . Because ¢ is bounded, en (y) -+ e (y) for every y such that ¢ is continuous at F(y), by the dominated-convergence theorem. By assumption, this includes almost every y . By Theorem 1 1 . 2, the sequence n-112 (Tn - Tn ) converges in second mean to zero if the variances of n - l / 2 Tn and n - l i 2 Tn converge to the same number. Because n - l i 2 (Tn - E Tn ) = - Gn (Y) en (Y) dy , the second variance is easily computed to be
J
± var Tn = JJ (F(x
!\
y) - F (x)F(y)) en (x)en ( Y ) dx dy .
This converges to CJ 2 (¢ , F) by the dominated-convergence theorem. The variance of n - l i 2 Tn can be written in the form
where, because cov(X, Y) = ( X , Y),
JJ cov ( {X ::::; x } , { Y ::::; y}) dx dy for any pair of variables
( )( )
1 n n i j ¢ -- cov ( { Xn (i) ::::; x } , { Xn (J) ::::; y}) . y) = Rn (X , -n LL ¢ -n+1 i=l J=l n + 1 Because the order statistics are positively correlated, all covariances in the double sum are nonnegative. Furthermore,
1 n n -n LL cov ( {Xn(i) ::'S x } , {Xn(J) i=l J=l
::'S
y}) = cov ( Gn (x) , Gn (y)) = ( F(x !\ y) - F(x ) F (y) )
.
For pairs (i , j) such that i � n F (x) and j � nF(y), the coefficient of the covariance is approximately e(x)e(y) by the continuity of ¢. The covariances corresponding to other pairs (i , j) are negligible. Indeed, for i ::::_ n F (x) + nEn ,
0
::'S
cov ( { Xn (i) ::'S x } , { Xn(j)
::'S
y})
2P ( Xn(i) ::'S x) ::::; 2P(bin (n , F (x)) ::::; 2 exp - 2nE; , ::'S
::::_
n F (x) + nEn )
320
L-Statistics
by Hoeffding's inequality. t Thus, because ¢ is bounded, the terms with i ::=: n F (x) + nt:n contribute exponentially little as En -----+ 0 not too fast (e.g., £� n -1 12). A similar argument applies to the terms with i :::: n F (x) - nt:n or I J - n F (y) l ::=: nt:n . Conclude that, for every (x , y) such that ¢ is continuous at both F (x) and F (y) , =
Rn (x , y) -----+ e (x)e(y) (F(x 1\ y) - F (x) F (y)) .
Finally, we apply the dominated convergence theorem to see that the double integral of this expression, which is equal to n -1 var Tn , converges to a 2 (¢ , F) . This concludes the proof that Tn and 't are asymptotically equivalent. To show that the sequence n -1 12 ('t - ETn ) is asymptotically normal, define Sn - Gn (y) e(y) dy . Then, by the same arguments as before, n -1 var(Sn - Tn ) -----+ 0. Furthermore, the sequence n -1 12Sn is asymptotically normal by the central limit theorem. • =
22.3
J
Delta Method
The order statistics of a sample X 1 , . . . , Xn can be expressed in their empirical distribution IFn , or rather the empirical quantile function, through i-1 i < s :::: - . n n Consequently, an £-statistic can be expressed in the empirical distribution function as well. Given a fixed function a and a fixed signed measure K on (0, 1) :1 , consider the function
for
--
View ¢ as a map from the set of distribution functions into R Clearly, (22.4) The right side is an £-statistic with coefficients Cn i K ( (i - 1) In, i In] . Not all possible arrays of coefficients Cn i can be "generated" through a measure K in this manner. However, most £-statistics of interest are almost of the form (22.4 ) , so that not much generality is lost by assuming this structure. An advantage is simplicity in the formulation of the asymptotic properties of the statistics, which can be derived with the help of the von Mises method. More importantly, the function ¢ (F) can also be applied to other estimators besides IFn . The results of this section yield their asymptotic normality in general. =
The a-trimmed mean corresponds to the uniform distribution K on the interval (a, 1 - a) and a the identity function. More precisely, the £-statistic generated by 22.5
t
+
Example.
See for example, the appendix of [ 1 17]. This inequality gives more than needed. For instance, it also works to apply Markov 's inequality for fourth moments. A signed measure is a difference K K 1 Kz of two finite measures K 1 and K2 · =
-
22.3 Delta Method
321
[
this measure is 1 -a 1 1 ( fan l - an)Xnc ra n l ) IF,;- 1 (s) ds = n - 2an 1 - 2a a
1
+
n - ra n l
]
L Xn (i) + ( fan l - cm ) Xn (n - ran l +1 ) .
i= ra n l + l
Except for the slightly different weight factor and the treatment of the two extremes in the averages, this agrees with the a-trimmed mean as introduced before. Because Xn (kn ) converges in probability to F- 1 (p) ifk11ln -+ p and (n - 2 lanJ ) I(n - 2an) = 1 + 0 ( l ln), the difference between the two versions of the trimmed mean can be seen to be 0 p ( 1 In) . For the purpose of this chapter this is negligible. The a-Winsorized mean corresponds to the measure K that is the sum of Lebesgue measure on (a, 1 - a) and the discrete measure with pointmasses of size a at each of the points a and 1 - a . Again, the difference between the estimator generated by this K and the Winsorized mean is negligible. The interquartile range corresponds to the discrete, signed measure K that has point masses of sizes 1 and - 1 at the points 1 14 and 3 I 4, respectively. 0 Before giving a proof of asymptotic normality, we derive the influence function of an (empirical) £-statistic in an informal way. If F1 = ( 1 - t)F + t8x , then, by definition, the influence function is the derivative of the map t 1--+ ¢ ( F1) at t = 0. Provided a and K are sufficiently regular,
Here the expression within square brackets if evaluated at t = 0 is the influence function of the quantile function and is derived in Example 20.5. Substituting the representation given there, we see that the influence function of the £-function ¢ (F) = a(F- 1 ) dK takes the form 1 1 ( F- 1 (u) ) - u dK(u) ¢'F (8 x - F) = - a' ( F - 1 (u) ) [x , oo) o f ( F - 1 (u) ) (22.6) 1 [x , oo) ( Y ) - F (y) d K o F (y) . = - a'(y) f (y) The second equality follows by (a generalization of) the quantile transformation. An alternative derivation of the influence function starts with rewriting ¢ (F) in the form
J
1 f
¢ (F) =
f a dK o F = 1(O,oo) (K o F)_ da - 1(- oo,O] (K o F)_ da .
(22.7)
Here K o F (x ) = K o F ( oo) - K o F (x ) and the partial integration can be justified for a a function of bounded variation with a (O) = 0 (see problem 22.6; the assumption that a(O) = 0 simplifies the formula, and is made for convenience). This formula for ¢ (F) suggests as influence function ¢ � ( 8x - F) = -
J K' ( F (y) ) ( l [x, oo) (Y ) - F(y) ) da (y) .
(22.8)
L-Statistics
322
Under appropriate conditions each of the two formulas (22.6) and (22. 8) for the influence function is valid. However, already for the defining expressions to make sense very different conditions are needed. Informally, for equation (22.6) it is necessary that a and F be differentiable with a positive derivative for F, (22.8) requires that K be differentiable. For this reason both expressions are valuable, and they yield nonoverlapping results. Corresponding to the two derivations of the influence function, there are two basic approaches towards proving asymptotic normality of £-statistics by the delta method, valid under different sets of conditions. Roughly, one approach requires that F and a be smooth, and the other that K be smooth. The simplest method is to view the £-statistic as a function of the empirical quantile function, through the map IF;;- 1 �---+ a o IF;;- 1 d K , and next apply the functional delta method to the map Q �---+ a o Q d K . The asymptotic normality of the empirical quantile function is obtained in Chapter 2 1 .
J
J
22.9 Lemma. Let a : IR �---+ IR be continuously diffe rentiable with a bounded derivative. Let K be a signed measure on the interval (a , /3) C (0, 1). Then the map Q 1---+ a ( Q) dK from l00 (a, /3) to IR is Hadamard-differentiable at every Q. The derivative is the map H �---+ a' (Q) H dK.
J
f
Proof.
Let H1
---+
H in the uniform norm. Consider the difference
f a ( Q + t ; ) - a ( Q) - a' (Q) H I dK. I
The integrand converges to zero everywhere and is bounded uniformly by II a' II oo ( II Ht ll oo+ II H II oo ) · Thus the integral converges to zero by the dominated-convergence theorem. • If the underlying distribution has unbounded support, then its quantile function is un bounded on the domain (0, 1), and no estimator can converge in £ 00 (0, 1). Then the pre ceding lemma can apply only to generating measures K with support (a , /3) strictly within (0, 1 ) . Fortunately, such generating measures are the most interesting ones, as they yield bounded influence functions and hence robust £-statistics. A more serious limitation of using the preceding lemma is that it could require unnec essary smoothness conditions on the distribution of the observations. For instance, the empirical quantile process converges in distribution in £00 (a , /3) only if the underlying dis tribution has a positive density between its a- and /3-quantiles. This is true for most standard distributions, but unnecessary for the asymptotic normality of empirical £-statistics gen erated by smooth measures K . Thus we present a second lemma that applies to smooth measures K and does not require that F be smooth. Let D F[ - oo , oo] be the set of all distribution functions.
J
Let a : IR 1---+ IR be of bounded variation on bounded intervals with (a + + a - ) d i K o F l oo and a (O) = 0. Let K be a signed measure on (0, 1) whose distribution function K is diffe rentiable at F (x ) for a almost-every x and satisfies I K (u + h) - K (u) I :=:: M (u)hfor every sufficiently small I h i , and some function M such that M(F_ ) d i a l oo. Then the map F �---+ a o p- I d K from D F[ -oo, oo] c D[ -oo, oo] to IR is Hadamard differentiable at F, with derivative H 1---+ - (K' o F_ ) H da. 22.10
Lemma. <
J
J
j
<
22.4 L-Estimators for Location
323
First rewrite the function in the form (22.7). Suppose that H1 ----+ H uniformly and set F1 = F + tH1 • By continuity of K , (K o F)_ = K (F_) . Because K o F (oo) = K ( 1 ) for all F , the difference ¢ (F1) - ¢ (F) can be rewritten as - j (K o F1_ - K o F_ ) da . Consider the integral K(F_ + t Ht - ) - K (F_ ) ------ - K (F_ ) H dia l . t
Proof.
/I
I
I
The integrand converges a-almost everywhere to zero and is bounded by M(F_ ) ( II Hr ll oo + II H II oo) ::::: M(F_) (2 11 H II oo + 1 ) , for small t . Thus, the lemma follows by the dominated convergence theorem. •
Because the two lemmas apply to nonoverlapping situations, it is worthwhile to combine the two approaches. A given generating measure K can be split in its discrete and continuous part. The corresponding two parts of the £-statistic can next be shown to be asymptotically linear by application of the two lemmas. Their sum is asymptotically linear as well and hence asymptotically normal. The cumulative distribution function K of the uniform distribution on (a, 1 - a) is uniformly Lipschitz and fails to be differentiable only at the points a and 1 - a . Thus, the trimmed-mean function is Hadamard-differentiable at every F such that the set {x : F (x) = a, or 1 - a } has Lebesgue measure zero. (We assume that a > 0.) In other words, F should not have flats at height a or 1 - a . For such F the trimmed mean is asymptotically normal with asymptotic influence function - : -a ( 1 x sy - F (y)) dy (see (22. 8)), and asymptotic variance 22.1 1
Example (Trimmed mean).
J
1 p-l (l -a)l p-l (1 -a) (F(x 1\ y) - F (x) F(y)) dx dy. p-l (a)
p-l (a)
Figure 22. 1 shows this number as a function of a for a number of distributions.
D
The generating measure of the Winsorized mean is the sum of a discrete measure on the two points a and 1 - a, and Lebesgue measure on the interval (a, 1 - a). The Winsorized mean itself can be decomposed correspondingly. Suppose that the underlying distribution function F has a positive derivative at the points F - 1 (a) and p- I ( 1 - a). Then the first part of the decomposition is asymptotically linear in view of Lemma 22.9 and Lemma 21 .3, the second part is asymptotically linear by Lemma 22. 10 and Theorem 19.3. Combined, this yields the asymptotic linearity of the Winsorized mean and hence its asymptotic normality. D 22.12
Example (Winsorized mean).
22.4
L-Estimators for Location
The a-trimmed mean and the a-Winsorized mean were invented as estimators for location. The question in this section is whether there are still other attractive location estimators within the class of £-statistics. One possible method of generating £-estimators for location is to find the best £ estimators for given location families { f (x - 8) : e E � } , in which f is some fixed density. For instance, for the f equal to the normal shape this leads to the sample mean.
L-Statistics
324
According to Chapter 8, an estimator sequence Tn is asymptotically optimal for estimat ing the location of a density with finite Fisher information I1 if Jn(Tn - 8)
=
1 n 1 J'
-.L (X Jn i = I IJ f - -
i
-
8) + O p ( l ) .
Comparison with equation (22.8) for the influence function of an L -statistic shows that the choices of generating measure K and transformation a such that K' (F(x - 8)) a'(x)
= -
( -I1 -f'f (x - 8) ) ' f
lead to an L-statistic with the optimal asymptotic influence function. This can be accom modated by setting a (x) x and =
The class of L-statistics is apparently large enough to contain an asymptotically efficient estimator sequence for the location parameter of any smooth shape. The L-statistics are not as simplistic as they may seem at first.
Notes
This chapter gives only a few of the many results available on L-statistics. For instance, the results on Hadamard differentiability can be refined by using a weighted uniform norm combined with convergence of the weighted empirical process. This allows greater weights for the extreme-order statistics. For further results and references, see [7 4], [ 1 34] , and [1 36] .
PROBLEMS 1. Find a formula for the asymptotic variance of the Winsorized mean. 2. Let T (F) = J F - 1 (u) k(u) du. (i) Show that T (F) = 0 for every distribution F that is symmetric about zero if and only if k is symmetric about 1 /2. (ii) Show that T (F) is location equivariant if and only if J k(u) du = 1 . (iii) Show that "efficient" L -statistics obtained from symmetric densities possess both prop erties (i) and (ii) . 3. Let X 1 , . . . , X n be a random sample from a continuous distribution function. Show that con ditionally on (Xn (k ) , Xn (l ) ) = (x , y) , the variables Xn (k + l) , . . . , Xn ( l - 1 ) are distributed as the order statistics of a random s ample of size l k 1 from the conditional distribution of X 1 given that x :S X 1 :S y. How can you use this to study the properties of trimmed means? 4. Find an optimal L -statistic for estimating the location in the logis tic and Laplace location families . 5. Does there exist a distribution for which the trimmed mean is asymptotically optimal for esti mating location?
- -
Problems
325
6 . (Partial Integration.) If a : JR 1----i> JR is right-continuous and nondecreasing with a(O) b: JR 1----i> JR is right-continuous , nondecreasing and bounded, then
= 0, and
f a db = lc{o, oo) (b(oo) - b- ) da + J( oo,OJ (b( - oo) - b - ) da . -
a is J b_ da .
Prove this. If
als o bounded, then the righthand side can be written more succinctly as (Substitute a (x) = fco ,x] da for x > 0 and a(x) = - fcx , O] da for x :::: 0 into the left side of the equation, and use Fubini' s theorem separately on the integral over the positive and negative part of the real line.)
ab l�oo
-
23 Bootstrap
This chapter investigates the asymptotic properties of bootstrap estima tors for distributions and confidence intervals. The consistency of the bootstrap for the sample mean implies the consistency for many other statistics by the delta method. A similar result is valid with the empirical process.
23.1
Introduction
In most estimation problems it is important to give an indication of the precision of a given estimate. A simple method is to provide an estimate of the bias and variance of the estimator; more accurate is a confidence interval for the parameter. In this chapter we concentrate on bootstrap confidence intervals and, more generally, discuss the bootstrap as a method of estimating the distribution of a given statistic. Let 8 be an estimator of some parameter 8 attached to the distribution P of the obser vations. The distribution of the difference 8 - 8 contains all the information needed for assessing the precision of 8 . In particular, if �a is the upper a-quantile of the distribution of (8 - 8)/ff , then P(8 - �fJ ff :::; 8 :::; 8 - � 1 -aff I P) :::: 1 - fJ - a. Here ff may be arbitrary, but it is typically an estimate of the standard deviation of 8 . It follows that the interval [ 8 - �fJ fj , 8 - � 1 - a fj ] is a confidence interval of level 1 - fJ - a. Unfortunately, in most situations the quantiles and the distribution of 8 - 8 depend on the unknown distribution P of the observations and cannot be used to assess the performance of 8 . They must be replaced by estimators. If the sequence (8 - 8)/ff tends in distribution to a standard normal variable, then the normal N(O, ff 2 )-distribution can be used as an estimator of the distribution of 8 - 8 , and we can substitute the standard normal quantiles Za for the quantiles �a · The weak convergence implies that the interval [8 - Z fJ ff , 8 - Z 1 - a ff ] is a confidence interval of asymptotic level 1 - a - fJ . Bootstrap procedures yield an alternative. They are based on an estimate P of the un derlying distribution P of the observations. The distribution of (8 - 8)/ff under P can, in principle, be written as a function of P . The bootstrap estimator for this distribution is the "plug-in" estimator obtained by substituting P for P in this function. Bootstrap estimators 326
327
23. 1 Introduction
for quantiles, and next confidence intervals, are obtained from the bootstrap estimator for the distribution. The following type of notation is customary. Let e* and 8* be computed from (hypo thetic) observations obtained according to p in the same way e and a are computed from the true observations with distribution P . If fJ is related to P in the same way 8 is related to p ' then the bootstrap estimator for the distribution of ce - 8)18 under p is the distribution of (e* - e) I a* under p . The latter is evaluated given the original observations, that is, for a fixed realization of P . A bootstrap estimator for a quantile �a of ( e - 8 ) I a is a quantile of the distribution of (e* - e) Ia* under P . This is the smallest value X = �a that satisfies the inequality
(
P e* � e 8
::: x
1?
)
�
(23 . 1 )
1 - a.
The notation P(- I P ) indicates that the distribution of (e*, 8*) must be evaluated assum ing that the observations are sampled according to P given the original observations. In particular, in the preceding display e is to be considered nonrandom. The left side of the preceding display is a function of the original observations, whence the same is true for �a . If P is close to the true underlying distribution P , then the bootstrap quantiles should be close to the true quantiles, whence it should be true that
(
e - 8 ;P ---;
:S
A
�a I P
)
�
1 - CY .
In this chapter we show that this approximation is valid in an asymptotic sense: The probability on the left converges to 1 - a as the number of observations tends to infinity. Thus, the bootstrap confidence interval A
A
A
[e - �f3 a , 8
e-8 - � l -a B] = 8 : � 1 -a ::; ---;;A
{
A
:S
A
�f3
}
possesses asymptotic confidence level 1 - a - f3 . The statistic a i s typically chosen equal to an estimator of the (asymptotic) standard deviation of e . The resulting bootstrap method is known as the percentile t -method, in view of the fact that it is based on estimating quantiles of the "studentized" statistic ce - 8) 1 a . (The notion of a t -statistic is used here in an abstract manner to denote a centered statistic divided by a scale estimate; in general, there is no relationship with Student's t-distribution from normal theory.) A simpler method is to choose a independent of the data. If we choose a = a* = 1 , then the bootstrap quantiles �a are the quantiles of the centered statistic e* - fJ . This is known as the percentile method. Both methods yield asymptotically correct confidence levels, although the percentile t-method is generally more accurate. A third method, Efron 's percentile method, proposes the confidence interval [E 1 - f3 , Ea ] for Ea equal to the upper a-quantile of e*: the smallest value X = Ea such that
Pee * ::: x 1 P ) � 1 - a. Thus, Ea results from "bootstrapping" e, while �a is the product of bootstrapping ce 8) 1 a . These quantiles are related, and Efron 's percentile interval can be reexpressed in the quantiles �a of e* - e (employed by the percentile method with a = 1) as
Bootstrap
328
The logical justification for this interval is less strong than for the intervals based on boot strapping 8 8 , but it appears to work well. The two types of intervals coincide in the case that the conditional distribution of 8* - 8 is symmetric about zero. We shall see that the difference is asymptotically negligible if 8* - 8 converges to a normal distribution. Efron's percentile interval is the only one among the three intervals that is invariant under monotone transformations. For instance, if setting a confidence interval for the cor relation coefficient, the sample correlation coefficient might be transformed by Fisher's transformation before carrying out the bootstrap scheme. Next, the confidence interval for the transformed correlation can be transformed back into a confidence interval for the correlation coefficient. This operation would have no effect on Efron's percentile interval but can improve the other intervals considerably, in view of the skewness of the statistic. In this sense Efron ' s method automatically "finds" useful (stabilizing) transformations. The fact that it does not become better through transformations of course does not imply that it is good, but the invariance appears desirable. Several of the elements of the bootstrap scheme are still unspecified. The missing prob ability a + f3 can be distributed over the two tails of the confidence interval in several ways. In many situations equal-tailed confidence intervals, corresponding to the choice a = {3 , are reasonable. In general, these do not have 8 exactly as the midpoint of the interval. An alternative is the interval -
with �� equal to the upper a-quantile of 18* - 8 1 /8' * . A further possibility is to choose a and f3 under the side condition that the difference �fJ - � l - a , which is proportional to the length of the confidence interval, is minimal. More interesting is the choice of the estimator P for the underlying distribution. If the original observations are a random sample X 1 , . . . , Xn from a probability distribution P , - l L 8 x, of the observations, leading then one candidate is the empirical distribution IFn to the empirical bootstrap. Generating a random sample from the empirical distribution amounts to resampling with replacement from the set {X 1 , . . . , Xn } of original observations. The name "bootstrap" derives from this resampling procedure, which might be surprising at first, because the observations are "sampled twice." If we view the bootstrap as a nonparametric plug-in estimator, we see that there is nothing peculiar about resampling. We shall be mostly concerned with the empirical bootstrap, even though there are many other possibilities. If the observations are thought to follow a specified parametric model, then it is more reasonable to set F equal to Pe for a given estimator e. This is what one would have done in the first place, but it is called the parametric bootstrap within the present context. That the bootstrapping methodology is far from obvious is clear from the fact that the literature also considers the exchangeable, the Bayesian, the smoothed, and the wild bootstrap, as well as several schemes for bootstrap corrections. Even "resampling" can be carried out differently, for instance, by sampling fewer than variables, or without replacement. It is almost never possible to calculate the bootstrap quantiles �a numerically. In practice, these estimators are approximated by a simulation procedure. A large number of indepen dent bootstrap samples X7 , . . . , X� are generated according to the estimated distribution P . Each sample gives rise to a bootstrap value (8* - 8)j8- * of the standardized statistic. Finally, the bootstrap quantiles �a are estimated by the empirical quantiles of these bootstrap = n
n
23.2
Consistency
329
values. This simulation scheme always produces an additional (random) error in the cover age probability of the resulting confidence interval. In principle, by using a sufficiently large number of bootstrap samples, possibly combined with an efficient method of simulation, this error can be made arbitrarily small. Therefore the additional error is usually ignored in the theory of the bootstrap procedure. This chapter follows this custom and concerns the "exact" distribution and quantiles of ctr - e) 1 &* , without taking a simulation error into account.
23.2
Consistency
A confidence interval [en , 1 , en, 2 ] is (conservatively) asymptotically consistent at level 1 a - {3 if, for every possible P ,
lim P(en, 1 :::; 8 :::; en. 2 l P) :::: 1 - a - {3 . n ---'> inf 00 The consistency of a bootstrap confidence interval is closely related to the consistency of the bootstrap estimator of the distribution of (en - 8) j 8-n . The latter is best defined relative to a metric on the collection of possible laws of the estimator. Call the bootstrap estimator for the distribution consistent relative to the Kolmogorov-Smirnov distance if sup P x
( en - 8
A-
Un
:S
XIP
) - P ( e*n - en A
(Jn*
:S
A
X I Pn
)
p
--+ 0.
It is not a great loss of generality to assume that the sequence (en - 8)/8-n converges in distribution to a continuous distribution function F (in our examples
( en - 8 8-n
:S
XIP
) --+ F (x),
P
( e*n -; en a-
:S
A
X I Pn
) --+ F (x ) . p
(23.2)
(See Problem 23 . 1 .) This type of consistency implies the asymptotic consistency of confi dence intervals. Suppose that (en - 8)/8-n � T, and that (e; - en ) /&; � T given the orig inal observations, in probability, for a random variable T with a continuous distribution function. Then the bootstrap confidence intervals [en - €n, f3 8-n , en - €n, 1 - a8-n ] are asymp . totically consistent at level 1 - a - {3. If the conditions hold for nonrandom 8-n a-;, and T is symmetrically distributed about zero, then the same is true for Efron 's percentile intervals. 23.3
Lemma.
=
Every subsequence has a further subsequence along which the sequence (e; en )/&; converges weakly to T, conditionally, almost surely. For simplicity, assume that the whole sequence converges almost surely; otherwise, argue along subsequences. If a sequence of distribution functions Fn converges weakly to a distribution function F, then the corresponding quantile functions Fn- 1 converge to the quantile function F - 1 at every continuity point (see Lemma 21 .2). Apply this to the (random) distribution functions Fn of ce; - en) ;a-; and a continuity point 1 - a of the quantile function p - 1 of T to conclude
Proof.
Bootstrap
330
1
that �n, a �- ( 1 ) converges almost surely to p - 1 ( 1 ) By Slutsky ' s lemma, the sequence (Bn - 8)18-n - �n, a converges weakly to T - p - 1 ( 1 - ) Thus =
A
- a
A
- a
A
P (8 � en - D'n�n,a)
=
(
)
.
a .
Bn - 8 _ ( P � ::::; �n .a I P ---+ P T ::::; F 1 ( 1 A
- a
))
=
1
- a.
This argument applies to all except at most countably many Because both the left and the right sides of the preceding display are monotone functions of and the right side is continuous, it must be valid for every The consistency of the bootstrap confidence interval follows. Efron's percentile interval is the interval [En, 1 - tJ , En, a ] , where En,Ci en + �n.a J · By the preceding argument, ( P (8 � sn, 1 - tJ ) P(8n - 8 ::::; -�n . 1 - tJ I P) ---+ P T ::::; - p - 1 (/3) ) 1 - f3. a.
a
a.
=
A
=
A
A
=
The last equality follows by the symmetry of T. The consistency follows.
•
From now on we consider the empirical bootstrap; that is, P n JPln is the empirical distribution of a random sample X 1 , . . . , Xn . We shall establish (23 .2) for a large class of statistics, with F the normal distribution. Our method is first to prove the consistency for en equal to the sample mean and next to show that the consistency is retained under application of the delta method. Combining these results, we obtain the consistency of many bootstrap procedures, for instance for setting confidence intervals for the correlation coefficient. In view of Slutsky ' s lemma, weak convergence of the centered sequence ,jfi(Bn - 8) combined with convergence in probability of 8-n l Jn yields the weak convergence of the studentized statistics (Bn - 8)18-n . An analogous statement is true for the bootstrap statis tic, for which the convergence in probability of a-; I Jn must be shown conditionally on the original observations. Establishing (conditional) consistency of 8-n l Jn and a-; I Jn is usually not hard. Therefore, we restrict ourselves to studying the nonstudentized statistics. Let X n be the mean of a sample of n random vectors from a distribution with finite mean vector f.1, and covariance matrix L: . According to the multivariate central limit theorem, the sequence ,jfi(Xn - f.l,) is asymptotically normal N(O, L:)-distributed. We wish to show the same for Jn (X� - Xn ) , in which X� is the average of n observations from Pn , that is, of n values resampled from the set of original observations {X 1 , . . . , Xn } with replacement. =
Theorem (Sample mean). Let X 1 , X 2 , . . . be i. i. d. random vectors with mean f.1, and covariance matrix L:. Then conditionally on X 1 , X 2 , . . . , for almost every sequence x1 , X 2 , . . . ,
23.4
For a fixed sequence X 1 , X2 , . . . , the variable X� is the average of n observa tions X! , . . . , X� sampled from the empirical distribution JPln . The (conditional) mean and covariance matrix of these observations are
Proof
23. 2
Consistency
33 1
By the strong law of large numbers, the conditional covariance converges to .L; for almost every sequence X 1 , X2 , . . . . The asymptotic distribution of X� can be established by the central limit theorem. Be cause the observations Xt , . . . , X� are sampled from a different distribution JPln for every n, a central limit theorem for a triangular array is necessary. The Lindeberg central limit theorem, Theorem 2.27, is appropriate. It suffices to show that, for every s > 0,
The left side is smaller than n - 1 L7= 1 II Xi ll 2 l { II Xi ll > M } as soon as s .Jfi :=: M. By the strong law of large numbers, the latter average converges to E II Xi 11 2 1 { II Xi II > M } for almost every sequence X 1 , X2 , . . . . For sufficiently large M, this expression is arbitrarily small. Conclude that the limit superior of the left side of the preceding display is smaller than any number 1J > 0 almost surely and hence the left side converges to zero for almost every sequence x1 , x2 , . . . . • Assume that en is a statistic, and that ¢ is a given differentiable map. If the sequence ,Jii-C 8n - 8) converges in distribution, then so does the sequence .Jfi( ¢ (en ) - ¢ (8) ) , by the delta method. The bootstrap estimator for the distribution of ¢ (en ) - ¢ (8) is ¢ (e;) - ¢ (en ) . If the bootstrap i s consistent for estimating the distribution of en - e , then it i s also consistent for estimating the distribution of ¢ (en ) - ¢ (8 ) . Let ¢ : ffi.k 1--+ ffi.rn be a measurable map defined and continuously differentiable in a neighborhood of 8. Let en be random vectors taking their values in the domain of¢ that converge almost surely to 8. If ,Jfi(en - 8) "-'""") T, and ,jfi(e; - en ) "-'""") T conditionally almost surely, then both .fii ( ¢ (en ) - ¢ (8)) "-'""") ¢� (T) and ,Jn( ¢ (e;) - ¢ (en ) ) ¢� (T), conditionally almost surely. 23.5
Theorem (Delta method for bootstrap).
"-'""")
By the mean value theorem, the difference ¢ (e;) - ¢ (en ) can be written as ¢&" (e; en ) for a point en between e: and en , if the latter two points are in the ball around e in which ¢ is continuously differentiable. By the continuity of the derivative, there exists for every 1} > 0 a constant 8 > 0 such that ll ¢�, h - ¢� h I 1J ll h I for every h and every 118' - 8 II :S 8 . If n i s sufficiently large, 8 sufficiently small, .Jfi ll e; - en II :s M, and I I en - e I I :s 8 , then
Proof
<
Fix a number s > 0 and a large number M. For 1} sufficiently small to ensure that 1} M
<
s,
Because en converges almost surely to e , the right side converges almost surely to P( ll T II :::. M) for every continuity point M of II T 11 . This can be made arbitrarily small by choice of M. Conclude that the left side converges to 0 almost surely. The theorem follows by an application of Slutsky's lemma. • The (biased) sample variance s; n - 1 2:7= 1 (Xi Xn ) 2 equals ¢ (Xn , XD for the map ¢ (x , y) y - x 2 . The empirical bootstrap is consistent 23.6
Example (Sample variance).
=
=
Bootstrap
332
for estimation of the distribution of (Xn , X�) - (a 1 , a2 ), by Theorem 23 .4, provided that the fourth moment of the underlying distribution is finite. The delta method shows that the empirical bootstrap is consistent for estimating the distribution of s?; - a 2 in that
The asymptotic variance of s?; can be estimated by S� (kn + 2) , in which kn is the sample kurtosis. The law of large numbers shows that this estimator is asymptotically consistent. The bootstrap version of this estimator can be shown to be consistent given almost every sequence of the original observations. Thus, the consistency of the empirical bootstrap extends to the studentized statistic ( s;; - a 2 ) Is; Jkn + 1 . D *23.2.1
Empirical Bootstrap
In this section we follow the same method as previously, but we replace the sample mean by the empirical distribution and the delta method by the functional delta method. This is more involved, but more flexible, and yields, for instance, the consistency of the bootstrap of the sample median. Let JPln be the empirical distribution of a random sample X 1 , . . . , Xn from a distribution P on a measurable space (X, A) , and let :F be a Donsker class of measurable functions f : X t--+ ffi., as defined in Chapter 19. Given the sample values, let x r , . . . , X� be a random sample from JPln . The bootstrap empirical distribution is the empirical measure JPl� n - 1 L:7= 1 8x ; . and the bootstrap empirical process G� is
in which Mni is the number of times that Xi is "redrawn" from {X 1 , . . . , Xn } to form x r , . . . , X� . By construction, the vector of counts (Mn l , . . . , Mnn ) is independent of X 1 , . . . , Xn and multinomially distributed with parameters n and (probabilities) 1 / n , . . . , 1jn. If the class :F has a finite envelope function F, then both the empirical process Gn and the bootstrap process G� can be viewed as maps into the space £00 (:F) . The analogue of Theorem 23 .4 is that the sequence G� converges in £00 (:F) conditionally in distribution to the same limit as the sequence Gn , a tight Brownian bridge process G p . To give a precise meaning to "conditional weak convergence" in £00 (:F) , we use the bounded Lipschitz metric. It can be shown that a sequence of random elements in £00 (:F) converges in distribution to a tight limit in £00(:F) if and only if t sup
h EBL1 (e'" (.F))
I E* h(G n ) - Eh (G) -+ 0.
We use the notation EM to denote "taking the expectation conditionally on X 1 , . . . , Xn ," or the expectation with respect to the multinomial vectors Mn only.+ t
+
For a metric space llll , the set BL1 (llll) consists of all functions h : lill f-+ [ - 1 , 1 ] that are uniformly Lipschitz: lh(zJ) - h(z 2 ) I _:::: d(z 1 , 2 2 ) for every pair (Z l , 2 2 ) . See, for example, Chapter 1 . 12 of [ 146] . For a proof of Theorem 23.7, see the original paper [58], or, for example, Chapter 3 . 6 of [ 1 46] .
23. 2 23.7
Theorem (Empirical bootstrap).
Consistency
333
For every Donsker class :F of measurable func
tions with finite envelope function F,
sup
hEBLt (£00 (:F))
I EM h(G�) - Eh(G p ) l � 0.
G�
Furthermore, the sequence is asymptotically measurable. If P* F 2 convergence is outer almost surely as well.
<
oo ,
then the
Next, consider an analogue of Theorem 23 .5, using the functional delta method. Theo rem 23.5 goes through without too many changes. However, for many infinite-dimensional applications of the delta method the condition of continuous differentiability imposed in Theorem 23.5 fails. This problem may be overcome in several ways. In particular, contin uous differentiability is not necessary for the consistency of the bootstrap "in probability" (rather than "almost surely"). Because this appears to be sufficient for statistical applica tions, we shall limit ourselves to this case. Consider sequences ofmaps and with values in a normed space lDl (e.g., (:F)) such that the sequence - 8) converges unconditionally in distribution to a tight random element T, and the sequence converges conditionally given X 1 , X2 , . . in distribution to the same random element T . A precise formulation of the second is that
Jn(en
sup
EM
h EBLt (llli )
en e: Jnee: - en) I EM h ( Jnce: - en)) - Eh(T)I � 0.
£00
.
e23 .8)
.
Here the notation means the conditional expectation given the original data X1 . X 2 , . . and is motivated by the application to the bootstrap empirical distribution. t By the preceding theorem, the empirical distribution = JPl satisfies condition 23.8 if viewed as a map in for a Donsker class :F.
en n
£00e:F)
( )
23.9 Theorem (Delta method for bootstrap). Let lDl be a normed space and let ¢ : lDlcp c lDl �---+ JR.k be Hadamard differentiable at 8 tangentially to a subspace lOla. Let and be maps with values in lDlcp such that - 8) -v-+ T and such that (23. 8) holds, in which is asymptotically measurable and T is tight and takes its values in lOla. Then
Jnee: - en)
en e:
.vfi Cen
Jn ( (e;) - (en) ) converges conditionally in distribution to ¢� (T), given
the sequence ¢ ¢ X 1 , X 2 , . . . , in probability.
:
By the Hahn-Banach theorem it is not a loss of generality to assume that the deriva tive ¢� lDl �---+ JR.k is defined and continuous on the whole space. For every h E BL 1 the function h o ¢� is contained in BL11cp� 11 (lDl) . Thus (23 .8) implies
Proof.
eJRk ),
sup
hEBLt (IRk )
Because
I EM h (¢�(0ice: - en) ) ) - Eh (¢�( T) ) I � 0.
l h (x) - h(y) I is bounded by 2 1\ d ex , y) for every h BL1 eJRk ), sup I EM h ( 0i (¢ce: ) - ¢ (en) ) ) - E M h (¢� ( .Jn ce: - en) ) ) I :::: 8 + 2PM ( I v'n ( ¢ ce: ) - ¢een) ) - ¢� ( 0ice: - en)) I > 8 ) . E
hEBLt (lR k )
t
It is assumed that h ( Jn ( e�
-
Bn )) is a measurable function of M .
(23 . 1 0)
Bootstrap
334
The theorem is proved once it has been shown that the conditional probability on the right converges to zero in outer probability. The sequence ,Jfi(e; - en , en - 8) converges (unconditionally) in distribution to a pair of two independent copies of T . This follows, because conditionally given X 1 , X2 , . . . , the second component is deterministic, and the first component converges in distribution to T, which is the same for every sequence X 1 , X 2 , . . . . Therefore, by the continuous mapping theorem both sequences ,Jfi(en - 8) and ,Jfi(e; - 8) converge (unconditionally) in distribution to separable random elements that concentrate on the linear space ]]))0 . By Theorem 20.8, vn (¢ c e: ) - ¢ (8)) = ¢� ( vfi c e: - 8)) + o � ( l ) , vn ( ¢ C en ) - ¢ (8)) = ¢� ( vn C en - 8 ) ) + o � ( l ) .
Subtract the second from the first equation to conclude that the sequence ,Jn(¢ Ce;) ¢ (en)) - ¢� ( ,Jfi(e; - en)) converges (unconditionally) to 0 in outer probability. Thus, the conditional probability on the right in (23. 10) converges to zero in outer mean. This concludes the proof. • Because the cells ( t] c lR form a Donsker class, the empirical distribution function IFn of a random sample of real-valued variables satisfies the condition of the preceding theorem. Thus, conditionally on X 1 , X 2 , . . . , the sequence ,Jn ( ¢ (IF� ) - ¢ (IFn)) converges in distribution to the same limit as ,Jn( ¢ (IFn) ¢ (F) ) , for every Hadamard-differentiable function ¢. This includes, among others, quantiles and trimmed means, under the same conditions on the underlying measure F that ensure that empirical quantiles and trimmed means are asymptotically normal. See Lemmas 2 1 .3, 22.9, and 22. 10. 0 23.11
Example (Empirical distribution function).
23.3
- oo ,
Higher-Order Correctness
The investigation of the performance of a bootstrap confidence interval can be refined by taking into account the order at which the true level converges to the desired level. A confidence interval is (conservatively) correct at level 1 - a - f3 up to order O (n- k ) if P (en, 1
�
8
�
en, 2 1
P)
:::::
1
( )
- a - [3 - 0 n1k .
Similarly, the quality of the bootstrap estimator for distributions can be assessed more precisely by the rate at which the Kolmogorov-Smimov distance between the distribution function of (en - 8)/&n and the conditional distribution function of ce; - en ) fa: converges to zero. We shall see that the percentile t-method usually performs better than the per centile method. For the percentile t-method, the Kolmogorov-Smimov distance typically converges to zero at the rate 0 p (n - 1 ) , whereas the percentile method attains "only" an 0 p (n - 1 1 2 ) rate of correctness. The latter is comparable to the error of the normal approxi mation. Rates for the Kolmogorov-Smimov distance translate directly into orders of correctness of one-tailed confidence intervals. The correctness of two-tailed or symmetric confidence intervals may be higher, because of the cancellation of the coverage errors contributed by
23.3
Higher-Order Correctness
335
the left and right tails. In many cases the percentile method, the percentile t-method, and the normal approximation all yield correct two-tailed confidence intervals up to order 0 (n -1 ) . Their relative qualities may be studied by a more refined analysis. This must also take into account the length of the confidence intervals, for an increase in length of order 0 p (n - 31 2 ) may easily reduce the coverage error to the order 0 (n - k ) for any k. The technical tool to obtain these results are Edgeworth expansions. Edgeworth's clas sical expansion is a refinement of the central limit theorem that shows the magnitude of the difference between the distribution function of a sample mean and its normal approxima tion. Edgeworth expansions have subsequently been obtained for many other statistics as well. An Edgeworth expansion for the distribution function of a statistic (Bn - e) ICYn is typically an expansion in increasing powers of 1 1 .J1i of the form p
( en
-
A
�
e
::::: X I
p)
=
P (x I (x) + 1 .J1i
P) ¢ (x) + P2 (x I P) ¢ (x) + . . .
(23 . 12)
n
The remainder is of lower order than the last included term, uniformly in the argument x . Thus, in the present case the remainder i s o(n - 1 ) (or even O (n - 3 1 2 )). The functions Pi are polynomials in x, whose coefficients depend on the underlying distribution, typically through (asymptotic) moments of the pair (Bn , &n ) . Example (Sample mean). Let X n b e the mean of a random sample of size n, and let S� n - 1 2:7= 1 (X i - Xn) 2 be the (biased) sample variance. If f-1,, 0" 2 , A and K are the mean, variance, skewness and kurtosis of the underlying distribution, then A (x 2 - 1) Xn - f-1, P (x) ¢ (x) ::::; x I 6 v;:;n O" l v;:;n 1 3K (x 3 - 3x) + A 2 (x5 - 10x 3 + 15x) . ¢ (x) + 0 72n n.Jfi 23.13
=
(
P)
=
( ) -
These are the first two terms of the classical expansion of Edgeworth. If the standard deviation of the observations is unknown, an Edgeworth expansion of the t -statistic is of more interest. This takes the form (see [72, pp. 7 1-73]) P
( Xn
- f-1,
P)
A (2x 2 + 1) x (x) + ¢ (x) I ::::; 6 v;:;n Sn l v;:;n 3K (x 3 - 3x) - 2A 2 (x5 + 2x 3 - 3x) - 9 (x 3 + 3x) 1 ¢ (x) + 0 + 36n n Jn =
( )
·
Although the polynomials are different, these expansions are of the same form. Note that the polynomial appearing in the 11 Jn term is even in both cases. These expansions generally fail if the underlying distribution of the observations is discrete. Cramer's condition requires that the modulus of the characteristic function of the observations be bounded away from unity on closed intervals that do not contain the origin. This condition is satisfied if the observations possess a density with respect to Lebesgue measure. Next to Cramer ' s condition a sufficient number of moments of the observations must exist. 0
Bootstrap
336
F-1 1 IF;;np.
The p th quantile (p) of a distribution func tion may be estimated by the empirical pth quantile (p). This is the rth order statistic of the sample for r equal to the largest integer not greater than Its mean square error can be computed as 23.14
F
Example (Studentized quantiles).
IF;;- 1
An empirical estimator fin for the mean square error of (p) is obtained by replacing by the empirical distribution function. If the distribution has a differentiable density f, then
F
p (IF;;- 1 (p) � F - 1 (p) I F) = cf> (x) + P1 �F) ¢ (x) + o ( ;/4 ) , n where p 1 (x I F) is the polynomial of degree 3 given by (see [72, pp. 3 1 8-32 1 ] ) P1 Cx I F) 12jp ( l - p) = fi3 x 3 + [2 - lOp - 12p ( l - p) f' ( F - 1 (p) ) ] x ::: X
!2
+
2
3 + 6� x - 8 + 4p - 12(r - np) . fi
(n
This expansion is unusual in two respects. First, the remainder is of the order 0 - 3 14) rather than of the order 0 ( ). Second, the polynomial appearing in the first term is not even. For this reason several of the conclusions of this section are not valid for sample quantiles. In particular, the order of correctness of all empirical bootstrap procedures is 0 p - 1 12 ) , not greater. In this case, a "smoothed bootstrap" based on "resampling" from a density estimator (as in Chapter 24) may be preferable, depending on the underlying distribution. D
n-1
(n
If the distribution function of (Bn - 8)/fin admits an Edgeworth expansion (23 . 1 2), then it is immediate that the normal approximation is correct up to order 0 ( 1 / Jn) . Evaluation of the expansion at the normal quantiles Z f3 and z 1 -a yields
Thus, the level of the confidence interval [Bn - Z f3 fin , en - Z 1 - a fin ] is 1 - a - (3 up to order 0 ( 1 / Jn) . For a two-tailed, symmetric interval, a and (3 are chosen equal. Inserting Z f3 Za - z a in the preceding display, we see that the errors of order 1 / Jn resulting from the left and right tails cancel each other if is an even function. In this common situation the order of correctness improves to 0 It is of theoretical interest that the coverage probability can be corrected up to any order by making the normal confidence interval slightly wider than first-order asymptotics would suggest. The interval may be widened by using quantiles Za" with an a , rather than Za · In view of the preceding display, for any an ,
= = 1
_
p (n -1 1 ).
<
23.3
Higher-Order Correctness
337
The 0 (n - 1 ) term results from the Edgeworth expansion (23 . 12) and is universal, indepen dent of the sequence an . For an = a - MIn and a sufficiently large constant M, the right side becomes 2 1 - 2a + . +0 � 1 - 2a - 0
: (�)
( :k )
Thus, a slight widening of the normal confidence interval yields asymptotically correct (conservative) coverage probabilities up to any order O (n -k ). If &n = 0 p (n - 1 1 2 ), then the widened interval is 2 (z an - Za ) &n = 0 p (n - 31 2 ) wider than the normal confidence interval. This difference is small relatively to the absolute length of the interval, which is 0 p (n - 1 1 2 ). Also, the choice of the scale estimator &n (which depends on en ) influences the width of the interval stronger than replacing �a by �a" . An Edgeworth expansion usually remains valid in a conditional sense if a good estimator P n is substituted for the true underlying distribution P. The bootstrap version of expansion (23 . 1 2) is
(
x:
)
p e; -:. en � X I Pn =
In this expansion the remainder term is a random variable, which ought to be of smaller order in probability than the last term. In the given expansion the remainder ought to be o p (n - 1 ) uniformly in x. Subtract the bootstrap expansion from the unconditional expansion (23 . 1 2) to obtain that
( en - e
sup p -A- � X I X
�
p ) p ( e*n A-* en � X I pA n ) -
�
()
P (x I P) - P 1 (x I P n) + P2 Cx I P) - P2 (x I P n) ¢ (x) + op -1 � sup 1 n n . -yr:;;n x
I
I
The polynomials Pi typically depend on P in a smooth way, and the difference P n - P is typically of the order 0 p (n - 1 1 2 ). Then the Kolmogorov-Smimov distance between the true distribution function of (en - 8) I &n and its percentile t-bootstrap estimator is of the order
Op (n - 1 ).
The analysis of the percentile method starts from an Edgeworth expansion of the distri bution function of the unstudentized statistic en - 8. This has as leading term the normal distribution with variance O";, the asymptotic variance of en - 8, rather than the standard normal distribution. Typically it is of the form
( ;n ) + � q1 ( ;n I P) ¢ ( ;n ) + � q2 ( n p ) ¢ ( ;n ) + . . . . :l
P(en - 8 � x i P) =
The functions qi are polynomials, which are generally different from the polynomials occurring in the Edgeworth expansion for the studentized statistic. The bootstrap version of this expansion is
Bootstrap
338
The Kolmogorov-Smimov distance between the distribution functions on the left in the pre ceding displays is of the same order as the difference between the leading terms (x fan) (x I an) on the right. Because the estimator an is typically not closer than 0 p (n - 1 1 2 ) to a ' this difference may be expected to be at best of the order 0 p (n - 11 2 ) . Thus, the percentile method for estimating a distribution is correct only up to the order 0 p (n - 11 2 ) , whereas the percentile t- method is seen to be correct up to the order 0 p (n - 1 ) . One-sided bootstrap percentile t and percentile confidence intervals attain orders of correctness that are equal to the orders of correctness of the bootstrap estimators of the distribution functions: Op (n - 1 ) and Op (n - 1 1 2 ), respectively. For equal-tailed confidence intervals both methods typically have coverage error of the order Op (n - 1 ) . The dec rease in coverage error is due to the cancellation of the errors contributed by the left and right tails, just as in the case of normal confidence intervals. The proofs of these assertions are somewhat technical. The coverage probabilities can be expressed in probabilities of the type
(
en - e P �
:'S
A
�n,a I
P
)
(23 . 15)
·
Thus we need an Edgeworth expansion of the distribution of (en - e) fan - �n,a ' or a related quantity. A technical complication is that the random variables �n, a are only implicitly defined, as the solution of (23 . 1). To find the expansions, first evaluate the Edgeworth expansion for (e; - en) fa; at its the upper quantile �n, a to find that
After expanding <1>, p 1 and ¢ in Taylor series around Za , we can invert this equation to obtain the (conditional) Cornish-Fisher expansion 2
_
s n,a - Za
_
P1 (za I P) r.;; vn
+
0P
(�) n
.
In general, Cornish-Fisher expansions are asymptotic expansions of quantile functions, much in the same spirit as Edgeworth expansions are expansions of distribution functions. The probability (23. 15) can be rewritten P
( --
()
1 en - e (Za I P) - Op - < z a - P1 an n Jn
I
)
P .
For a rigorous derivation it is necessary to characterize the 0 p (n - 1 ) term. Informally, this term should only contribute to terms of order O (n - 1 ) in an Edgeworth expansion. If we just ignore it, then the probability in the preceding display can be expanded with the help of (23 . 1 2) as
(Za - P1 (Za I P) )
(
) o( � )
P l (Za - n - 11 2 Pl (Za I P) I P) P l ( Za I P) + . r.;; r.;; If' Za r.;; n vn vn vn The linear term of the Taylor expansion of cancels the leading term of the Taylor expansion of the middle term. Thus the expression in the last display is equal to 1 - a up to the order <�>
+
r�-.
Problems
339
O (n - 1 ) , whence the coverage error of a percentile t-confidence interval is of the order O (n - 1 ) . For percentile intervals w e proceed in the same manner, this time inverting the Edgeworth expansion of the unstudentized statistic. The (conditional) Cornish-Fisher expansion for the quantile �n . a of e: - en takes the form � ,a : an
=
Za
_
q 1 (Za I F n) Jn
+
Op
( �n ) .
The coverage probabilities of percentile confidence intervals can be expressed in probab ilities of the type A
A
( en CJ-n e
)
�n . a Ip (Jn Insert the Cornish-Fisher expansion, again neglect the 0 p (n - 1 ) term, and use the Edgeworth P (en - e
::'::
�n .a I P)
=
p
A -
::'::
A-
0
expansion (23. 12) to rewrite this as
Because p 1 and q 1 are different, the cancellation that was found for the percentile t-method does not occur, and this is generally equal to 1 - a up to the order 0 (n - 1 1 2 ) . Consequently, asymmetric percentile intervals have coverage error of the order O (n - 1 12 ) . On the other hand, the coverage probability of the symmetric confidence interval [en - �n ,a , en - �n, 1 -a ] is equal to the expression in the preceding display minus this expression evaluated for 1 - a instead of a. In the common situation that both polynomials p 1 and q 1 are even, the terms of order 0 (n - 1 1 2 ) cancel, and the difference is equal to 1 - 2a up to the order 0 (n - 1 ) . Then the percentile two-tailed confidence interval has the same order of correctness as the symmetric normal interval and the percentile t-intervals.
Notes
For a wider scope on the applications of the bootstrap, see the book [44] , whose first author Efron is the inventor of the bootstrap. Hall [72] gives a detailed treatment of higher order expansions of a number of bootstrap schemes. For more information concerning the consistency of the empirical bootstrap, and the consistency of the bootstrap under the application of the delta method, see Chapter 3.6 and Section 3.9.3 of [146], or the paper by Gine and Zinn [58] .
PROBLEMS 1. Let F n be a sequence of random distribution functions and function. Show that the following statements are equivalent: (i) (ii)
A
p
Fn (x) ---+ F(x) for every x . supx I Fn (x) - F(x) j _.;. 0.
F
a continuous, fixed-distribution
Bootstrap
340
2. Compare in a simulation study Efron's percentile method, the normal approximation in combina
tion with Fisher's transformation, and the percentile method to set a confidence interval for the correlation coefficient.
3. Let X (n ) be the maximum of a sample of size n from the uniform distribution on [0, 1 ] , and let X (n ) be the maximum of a sample of size n from the empirical distribution lP'n of the first sample. Show that P(X (n ) = X(n ) l lP'n ) -+ 1 - e - 1 . What does this mean regarding the consistency of the empirical bootstrap estimator of the distribution of the maximum? 4. Devise a bootstrap scheme for setting confidence intervals for Yi = a + f3xi + ei . Show consistency.
f3
in the linear regression model
5. (Parametric bootstrap.) Let en be an estimator based on observations from a parametric model Pe such that -Jfi(en h n / -Jfi) converges under + hn/ Fn to a continuous distribution Le for every converging sequence hn and every (This is slightly stronger than regularity as defined in the chapter on asymptotic efficiency.) Show that the parametric bootstrap is consistent: If e; is en computed from observations obtained from Pe , then -Jfi(e; - en ) � Le conditionally on n the original observations, in probability. (The conditional law of -Jfi(e; - en ) is L n , e if Ln , e is the distribution of -Jfi(en under
e-
e.
e)
e
e.)
6. Suppose that -Jfi(en - e ) � T and Fnce; - en ) � T in probability given the original observations.
Show that Fn(¢ Ce;) - ¢ (en )) � ¢� (T) in probability for every map ¢ that is differentiable at
e.
7. Let Un b e a U -statistic based on a random sample X 1 , . . . , X11 with kernel h (x , y ) such that both Eh (X 1 , X 1 ) and Eh 2 (X1 , X 2 ) are finite. Let u; be the same U-statistic based on a sample X ! , . . . , X � from the empirical distribution of X 1 , . . . , X11 • Show that -Jfi(U; - U11 ) converges conditionally in distribution to the same limit as -Jfi ( U11 almost surely.
- e),
8. Suppose that -Jfi(en - e ) � T and -Jnce; - en ) � T in probability given the original observations. Show that, unconditionally, -Jfi(en e; - e11 ) � (S, T ) for independent copies S and T of T .
- e,
Deduce the unconditional limit distribution o f -Jnce;
- e).
24 Nonparametric Density Estimation
This chapter is an introduction to estimating densities if the underlying density of a sample of observations is considered completely unknown, up to existence of derivatives. We derive rates of convergence for the mean square error of kernel estimators and show that these cannot be improved. We also consider regularization by monotonicity.
24. 1
Introduction
Statistical models are called parametric models if they are described by a Euclidean param eter (in a nice way). For instance, the binomial model is described by a single parameter p, and the normal model is given through two unknowns: the mean and the variance of the observations. In many situations there is insufficient motivation for using a particular parametric model, such as a normal model. An alternative at the other end of the scale is a nonparametric model, which leaves the underlying distribution of the observations essentially free. In this chapter we discuss one example of a problem of nonparametric estimation: estimating the density of a sample of observations if nothing is known a priori. From the many methods for this problem, we present two: kernel estimation and monotone estimation. Notwithstanding its simplicity, this method can be fully asymptotically efficient.
24.2
Kernel Estimators
The most popular nonparametric estimator of a distribution based on a sample of observa tions is the empirical distribution, whose properties are discussed at length in Chapter 19. This is a discrete probability distribution and possesses no density. The most popular method of nonparametric density estimation, the kernel method, can be viewed as a recipe to "smooth out" the pointmasses of sizes 1 j n in order to turn the empirical distribution into a contin uous distribution. Let X 1, . . . , Xn be a random sample from a density f on the real line. If we would know that f belongs to the normal family of densities, then the natural estimate of f would be the normal density with mean Xn and variance S'/; , or the function
341
342
Nonparametric Density Estimation
'·:� .
Figure 24. 1. The kernel estimator with normal kernel and two observations for three bandwidths : small (left), intermediate (center) and large (right). The figures show both the contributions of the two observations separately (dotted lines) and the kernel estimator (solid lines) , which is the sum of the two dotted lines .
In this section we suppose that we have no prior knowledge of the form of f and want to "let the data speak as much as possible for themselves." Let K be a probability density with mean 0 and variance 1 , for instance the standard normal density. A kernel estimator with kernel or window K is defined as A
f (x)
1�1 -K n i= l h
= -�
(X - xi ) . h
Here h is a positive number, still to be chosen, called the bandwidth of the estimator. It turns out that the choice of the kernel K is far less crucial for the quality of J as an estimator of f than the choice of the bandwidth. To obtain the best convergence rate the requirement that K :::-_ 0 may have to be dropped. A kernel estimator is an example of a smoothing method. The construction of a density estimator can be viewed as a recipe for smoothing out the total mass 1 over the real line. Given a random sample of n observations it is reasonable to start with allocating the total mass in packages of size 1 / n to the observations. Next a kernel estimator distributes the mass that is allocated to Xi smoothly around X i , not homogenously, but according to the kernel and bandwidth. More formally, we can view a kernel estimator as the sum of n small "mountains" given by the functions 1 x - xi x r--+ - K . h nh Every small mountain is centred around an observation Xi and has area 1 / n under it, for any bandwidth h. For a small bandwidth the mountain is very concentrated (a peak), while for a large bandwidth the mountain is low and flat. Figure 24. 1 shows how the mountains add up to a single estimator. If the bandwidth is small, then the mountains remain separated and their sum is peaky. On the other hand, if the bandwidth is large, then the sum of the individual mountains is too flat. Intermediate values of the bandwidth should give the best results. Figure 24.2 shows the kernel method in action on a sample from the normal distribution. The solid and dotted lines are the estimator and the true density, respectively. The three pictures give the kernel estimates using three different bandwidths - small, intermediate, and large - each time with the standard normal kernel.
(
)
h�4 5
h�1 .82
h�O 68
"'
"'
"'
.,.
.,.
.,.
0
0
0
0
0
0
',
:;1 �
t;"l §" �
0
0
0
�
"'
"'
"'
N
0
0
0
N
"'
"'
"'
:"'
....
(3" ;;:j
0
0
0
0
0
-3
-2
-1
0
2
0
-2
-3
-1
0
2
0
-3
-2
-1
0
2
Figure 24.2. Kernel estimates for the density of a sample of size 1 5 from the standard normal density for three different bandwidths h 0:68 (left), 1 . 82 (center), and 4.5 (right), using a normal kernel. The dotted line gives the true density. =
U-) .j::>. U-)
Nonparametric Density Estimation
344
A popular criterion to judge the quality of density estimators is the mean integrated square error (MISE), which is defined as MISE J ( j)
= =
J EJ ( f(x ) - f (x)) 2 dx J var1 } (x) dx + J (EJ } (x) - f (x)) 2 dx .
This is the mean square error EJ ( f (x) - f (x) ) 2 of } (x) as an estimator of f (x ) integrated over the argument x . If the mean integrated square error is small, then the function j is close to the function f. (We assume that jn is jointly measurable to make the mean square error well defined.) As can be seen from the second representation, the mean integrated square error is the sum of an integrated "variance term" and a "bias term." The mean integrated square error can be small only if both terms are small. We shall show that the two terms are of the orders 1 nh '
respectively. Then it follows that the variance and the bias terms are balanced for (nh) - l '"""' h4 , which implies an optimal choice of bandwidth equal to h '"""' n - 1 1 5 and yields a mean integrated square error of order n -41 5 • Informally, these orders follow from simple Taylor expansions. For instance, the bias of j (x) can be written E1 } (x) - f (x)
= =
J h1 K (X-h--t ) f (t) dt - f (x) J K (y) (f(x - hy) - f (x)) dy .
Developing f in a Taylor series around x and using that J y K (y) d y = 0, we see, informally, that this is equal to
Thus, the squared bias is of the order h4. The variance term can be handled similarly. A precise theorem is as follows. Suppose that f is twice continuously differentiable with J I f" (x ) 1 2 d x Furthermore, suppose that J y K (y) dy = 0 and that both J y 2 K (y) dy and J K 2 (y) dy are finite. Then there exists a constant C1 such that for small h > 0
24. 1
Theorem.
<
oo.
Consequently , for hn '"""' n - 1 1 5 , we have MISE J ( jn ) '
=
O (n - 41 5).
24. 2 Kernel Estimators
345
Because a kernel estimator is an average of n independent random variables, the variance of ] (x) is (1/n) times the variance of one term. Hence x X1 2 x X1 E K � var1 K varf j (x) 1 n 2 1 - K 2 (y) f (x - hy) dy . nh Take the integral with repect to x on both left and right sides. Because J f (x - hy) dx 1 is the same for every value of hy, the right side reduces to (nh) - 1 J K 2 (y) dy, by Fubini's theorem. This concludes the proof for the variance term. To upper bound the bias term we first write the bias E f J (x) - f (x) in the form as given preceding the statement of the theorem. Next we insert the formula 1 f (x + h ) - f (x) hf ' (x) + h 2 j" (x + sh) ( l - s) ds .
Proof.
=� � = J =
(� ) � (� )
=
1
This is a Taylor expansion with the Laplacian representation of the remainder. We obtain 1 E f j (x) - f(x) K (y) [ -hyj ' (x) + (hy) 2 j" (x - shy) (l - s) J ds dy .
= !1
Because the kernel K has mean zero by assumption, the first term inside the square brackets can be deleted. Using the Cauchy-Schwarz ineqnality (EU V) 2 � EU 2 E V 2 on the variables U Y and V Yf " (x - Sh Y) (1 - S) for Y distributed with density K and S uniformly distributed on [0, 1] independent of Y, we see that the square of the bias is bounded above by 1 h 4 K (y) l dy K (y) l f" (x - shy) 2 (1 - s) 2 ds dy .
=
=
J
J1
The integral of this with respect to x is bounded above by
This concludes the derivation for the bias term. The last assertion of the theorem is trivial. • The rate 0 (n -41 5) for the mean integrated square error is not impressive if we compare it to the rate that could be achieved if we knew a priori that f belonged to some parametric family of densities fe . Then, likely, we would be able to estimate e by an estimator such that 8 e + Op (n - 1 1 2 ) , and we would expect
=
MISEe Cfe )
= J Ee (fe (x) - fe (x )) 2 dx "'"' Ee (8 - e) 2 = ( � ) a
.
This is a factor n - 1 15 smaller than the mean square error of a kernel estimator. This loss in efficiency is only a modest price. After all, the kernel estimator works for every density that is twice continuously differentiable whereas the parametric estimator presumably fails miserably if the true density does not belong to the postulated parametric model.
346
Nonparametric Density Estimation
Moreover, the lost factor n - 1 1 5 can be (almost) covered if we assume that f has suffi ciently many derivatives. Suppose that f is m times continuously differentiable. Drop the condition that the kernel K is a probability density, but use a kernel K such that
J K (y) dy
=
J y K (y) dy
1,
=
0, . . . ,
Then, by the same arguments as before, the bias term can be expanded in the form Ef j (x) - f (x)
= =
J K (y) (f (x - hy) - f (x)) dy J K (y) -m1 ! (- l) m hm ym J (m) (x) dy
+
...
Thus the squared bias is of the order h 2m and the bias-variance trade-off (nh ) - 1 h 2m is solved for h "' n 1 / (lm + 1 l . This leads to a mean square error of the order n - lm / (lm + l J , which approaches the "parametric rate" n - 1 as m --+ oo . This claim is made precise in the following theorem, whose proof proceeds as before. �
24.2
dx <
Theorem. Suppose that f is m times continuously diffe rentiable oo. Then there exists a constant C f such that for small h > 0
Consequently , for hn "' n - 1 / (lm + l) , we have MISE 1 ( jn )
=
with J l f (m l (x) l 2
O (n -2m / (lm + 1 l ) .
In practice, the number of derivatives of f is usually unknown. In order to choose a proper bandwidth, we can use cross-validation procedures. These yield a data-dependent bandwidth and also solve the problem of choosing the constant preceding h- 1 / (lm + 1 l . The combined procedure of density estimator and bandwidth selection is called rate-adaptive if the procedure attains the upper bound n - lm / (lm + 1) for the mean integrated square error for every m.
24.3
Rate Optimality
In this section we show that the rate n - lm / (lm + 1 l of a kernel estimator, obtained in The orem 24.2, is the best possible. More precisely, we prove the following. Inspection of the proof of Theorem 24.2 reveals that the constants C 1 in the upper bound are uniformly bounded in f such that J I J <m l (x) l 2 dx is uniformly bounded. Thus, letting :Fm , M be the class of all probability densities such that this quantity is bounded by M, there is a constant Cm , M such that the kernel estimator with bandwidth hn = n - 1 / (lm + 1 l satisfies 1 2m / ( 2m + 1) sup Ef ( fn (x) - f (x)) 2 dx :S Cm , M n j E:Fm M
J
()
24. 3 Rate Optimality
347
In this section we show that this upper bound is sharp, and the kernel estimator rate optimal, in that the maximum risk on the left side is bounded below by a similar expression for every density estimator Jn , for every fixed m and M. The proof is based on a construction of subsets :Fn C :Fm , M , consisting of 2rn functions, with rn = Ln 1 / (Zm + l J J , and on bounding the supremum over :Fm , M by the average over :Fn · Thus the number of elements in the average grows fairly rapidly with n . An approach, such as in section 14.5, based on the comparison of fn at only two elements of :Fm , M does not seem to work for the integrated risk, although such an approach readily yields a lower bound for the maximum risk sup 1 E J (fn (x) - i(x) ) 2 at a fixed x . The subset :Fn is indexed by the set of all vectors 8 E {0, 1 Yn consisting of sequences of rn zeros or ones. For h n = n - l / (Zm + l) , let Xn, l < Xn, 2 < < Xn,n be a regular grid of meshwidth 2hn. For a fixed probability density i and a fixed function K with support ( - 1 , 1), define, for every 8 E {0, 1 yn , ·
·
·
If i is bounded away from zero on an interval containing the grid, I K I is bounded, and J K dx = 0, then in,e is a probability density, at least for large n . Furthermore,
(x)
It follows that there exist many choices of i and K such that in ,e E :Fm , M for every 8 . The following lemma gives a lower bound for the maximum risk over the parameter set {0, 1 Y , in an abstract form, applicable to the problem of estimating an arbitrary quantity 1/! (8) belonging to a metric space (with metric d). Let H (8, 8') = L: = l l 8i - 8[ 1 be the Hamming distance on {0, 1 Y , which counts the number of positions at which 8 and 8' differ. For two probability measures P and Q with densities p and q, write II P 1\ Q II for p l\ q d J,L .
;
f
Lemma (Assouad). For any estimator T : 8 E {0, 1 Y ) , and any p > 0,
24.3
(Pe
max e 2P Ee d P ( T , 1/! (8)) :::: Hcmin e . e ' ) o.l
based on an observation in the experiment
dP 1/J (8 ) , 1/! (8')) (
H(8, 8 ' )
min II Pe 1\ Pe ' ll · 2 H(e , e ' ) =l
!._
Define an estimator S, taking its values in 8 = { 0, 1 Y , by letting S = 8 if 8' �---+ d ( T, 1/J ( 8')) is minimal over 8 at 8' = 8 . (If the minimum is not unique, choose a point of minimum in any consistent way.) By the triangle inequality, for any 8, d ( 1/J (S) , 1/!(8)) < d ( 1/J (S) , T + d ( 1/J (8), T , which is bounded by 2d ( 1/J (8), T , by the definition of S. If dP ( 1/f (8 ) , 1j;(8')) :::: a H (8, 8') for all pairs 8 # 8', then
Proof.
)
)
)
The maximum of this expression over 8 is bounded below by the average, which, apart
Nonparametric Density Estimation
348
from the factor a , can be written
r 1 1 � J S · dPe + 1 1 � J (1 - S ) dPe . 1 �r ( -21r �e � � E e i S · - 8 · 1 = -2 � r r 2 - e : 81 = 1 1=1 1=1 2 - e : 81 =0 ) This is minimized over S by choosing s1 for each j separately to minimize the jth term in the sum. The expression within brackets is the sum of the error probabilities of test of 1 1 L Pe. versus P 1 ' = -r 2 - e : 81 = 1 Equivalently, it is equal to 1 minus the difference of power and level. In Lemma 14. 3 0 this was seen to be at least 1 - �I I Po, } - P l , J I = I Po, } /\ P 1 , 1 1 . Hence the preceding display is bounded below by �
1
1
1
�
1
�
a
· 1
}=1 · A P 1 · I .
�2 � � I Po
,]
,]
/\ I Pi /\ Qi l .
Because the minimum pm 7jm of two averages of numbers is bounded below by the average of the minima, the same is true for the total variation norm of a minimum: m-1 and The 2r - 1 terms and in the averages m-1 L can be ordered and matched such that each pair e and differ only in their jth coordinate. Conclude that the preceding display is bounded below by 2..: } = 1 min P8 , in which the minimum is taken over all pairs e and that differ by exactly one coordinate. •
L/\ Pi /\ qi I P m Qm l :::::
Po, } P l , J I Pe /\ I ,
Pe Pw 8' �
8'
We wish to apply Assouad's lemma to the product measures resulting from the densities is useful. It fn , B · Then the following inequality, obtained in the proof of Lemma relates the total variation, affinity, and Hellinger distance of product measures:
14. 3 1,
24.4
Theorem.
There exists a constant Dm, M such that for any density estimator fn
J ( )-
sup Ef ( fn x
/ E fm,M
( ) 2m / ( 2m + 1 ) 1 d Dm,M -;;
f (x) ) 2 x
=::::
Because the functions fn,e are bounded away from zero and infinity, uniformly in e , the squared Hellinger distance
Proof.
J
( l /2 - l /: ) 2 n,B
n,e
dx = J (
e
fn . - fn , B ' 1/2 f 1/2 fn,B n,B'
+
)
2 dx
is up to constants equal to the squared L 2 -distance between fn,e and fn . B ' · Because the
24. 4 Estimating a Unimodal Density
j ) hn ) have disjoint supports, the latter is equal to h;m t I Bj - Bjl 2 J K ' C �:"· j ) dx h;m+ l H (B , B') J K 2 (x) dx
349
functions K ( (x - X n I ,
=
This is of the order 1 In. Inserting this in the lower bound given by Assouad' s lemma, with and d ( l{f (8), 1/f (8')) the L 2 -distance, we find up to a constant the lower bound 1/f (8) = (rn l2) ( 1 - 0 ( 1 1 n 2n . •
fn. e m 1 h� +
))
24.4
Estimating a Unimodal Density
In the preceding sections the analysis of nonparametric density estimators is based on the assumption that the true density is smooth. This is appropriate for kernel-density estimation, because this is a smoothing method. It is also sensible to place some a priori restriction on the true density, because otherwise we cannot hope to achieve much beyond consistency. However, smoothness is not the only possible restriction. In this section we assume that the true density is monotone, or unimodal. We start with monotone densities and next view a unimodal density as a combination of two monotone pieces. It is interesting that with monotone densities we can use maximum likelihood as the estimating principle. Suppose that is a random sample from a Lebesgue density on [0, oo) that is known to be nonincreasing. Then the maximum likelihood estimator fn is defined as the nonincreasing probability density that maximizes the likelihood n
X 1 , . . . , Xn
f
1
f-+
ni= 1 f(X;).
f
This optimization problem would not have a solution if were only restricted by possessing a certain number of derivatives, because very high peaks at the observations would yield an arbitrarily large likelihood. However, under monotonicity there is a unique solution. The solution must necessarily be a left-continuous step function, with steps only at the observations. Indeed, if for a given the limit from the right at is bigger than the limit from the left at then we can redistribute the mass on the interval by raising the value for instance by setting equal to and lowering :c· > f (t) dt on the whole interval, resulting in an the constant value J increase of the likelihood. By the same reasoning we see that the maximum likelihood estimator must be zero on oo) (and ( -oo, 0)). Thus, with j; = fn finding the maximum likelihood estimator reduces to maximizing n7 j; under the side conditions = 0) (with
f X; , f(X( l(iJ) - 1 f(X (i - 1 ) +), (XuJ - X(i - l J) (X (nl•
X (i _ 1 )
(1 - 1 )
X (O)
fn L i= l f;(x
n
· · ·
=:::
=::: =
=1
(X (i - 1 ) , X(; J] f
(X(i J),
0,
1.
This problem has a nice graphical solution. The least concave majorant of the empirical distribution function IFn is defined as the smallest concave function F n with F (x) =::: IFn (x) for every x . This can be found by attaching a rope at the origin (0, 0) and winding this (from above) around the empirical distribution function IFn (Figure 24.3). Because Fn is n
Nonparametric Density Estimation
350 <::? c:i co
� 0
c:i
""'"
�
0
c:i 0
2
0
3
Figure 24.3. The empirical distribution and its concave majorant of a sample of size 75 from the exponential distribution.
Figure 24.4. The derivative of the concave maj orant of the empirical distribution and the true density of a sample of size 75 from the exponential distribution.
concave, its derivative is nonincreasing. Figure 24.4 shows the derivative of the concave majorant in Figure 24.3. Lemma. The maximum likelihood estimator j is the left derivative of the least concave majorant F of the empirical distribution IF that is, on each of the intervals (X(i -I ) , X ci l ] it is equal to the slope of F on this interval. n
24.5
n
n
n,
In this proof, let f denote the left derivative of the least concave majorant. We shall show that this maximizes the likelihood. Because the maximum likelihood estimator
Proof.
n
24. 4 Estimating a Unimodal Density
351
is necessarily constant on each interval (X (i - l l , Xci l l . we may restrict ourselves to densities f with this property. For such an f we can write log f = L a i 1 [o, x,,l1 for the constants a i = 1og ft /fi+l (with fn + l = 1), and we obtain
For f = in this becomes an equality. To see this, let y 1 ::::: y2 ::::: be the points where ft touches IF . Then i is constant on each of the intervals (yi - 1 , Yi ] , so that we can write n n n log in = L b i 1 [o,y, J • and obtain ·
·
·
Third, by the identifiability property of the Kullback-Leibler divergence (see Lemma 5 .35), for any probability density f,
with strict inequality unless in = f. Combining the three displays, we see that in is the unique maximizer of f � J log f diFn . • Maximizing the likelihood is an important motivation for taking the derivative of the concave majorant, but this operation also has independent value. Taking the concave majo rant (or convex minorant) of the primitive function of an estimator and next differentiating the result may be viewed as a "smoothing" device, which is useful if the target function is known to be monotone. The estimator in can be viewed as the result of this procedure applied to the "naive" density estimator -
f n (x )
=
n(
1 x ( i) - x(i - 1) )
'
This function is very rough and certainly not suitable as an estimator. Its primitive function is the polygon that linearly interpolates the extreme points of the empirical distribution function IFn , and its smallest concave majorant coincides with the one of IFn . Thus the derivative of the concave majorant of F n is exactly in Consider the rate of convergence of the maximum likelihood estimator. Is the assumption of monotonicity sufficient to obtain a reasonable performance? The answer is affirmative if a rate of convergence of n 1 13 is considered reasonable. This rate is slower than the rate n m f ( 2m + l ) of a kernel estimator if m > 1 derivatives exist and is comparable to this rate given one bounded derivative (even though we have not established a rate under m = 1). The rate of convergence n 1 1 3 can be shown to be best possible if only monotonicity is assumed. It is achieved by the maximum likelihood estimator. If the observations are sampled from a compactly supported, bounded, monotone density f, then
24.6
Theorem.
Nonparametric Density Estimation
352
This result is a consequence of a general result on maximum likelihood estimators of densities (e.g., Theorem 3.4.4 in [ 1 46] .) We shall give a more direct proof using the convexity of the class of monotone densities. The sequence I jn I I = jn (0) is bounded in probability. Indeed, by the characterization of fn as the slope of the concave majorant of IF n , we see that fn (O) > M if and only if there exists t > O such that iFn (t) > Mt. The claim follows, because, by concavity, F(t) :::=:: f (O)t for every t, and, by Daniel's theorem ( [ 1 34, p. 642] ), IF n (t) 1 P sup -- > M = M t >O F(t)
Proof
(X)
(
)
- .
It follows that the rate of convergence of jn is the same as the rate of the maximum likelihood estimator under the restriction that f is bounded by a (large) constant. In the remainder of the proof, we redefine jn by the latter estimator. Denote the true density by fo . By the definition of fn and the inequality log x :::=:: 2 ( ,JX 1) ,
Therefore, we can obtain the rate of convergence of jn by an application of Theorem 5.52 or 5.55 with mf = 2 f/ ( f + fo). Because (m f - m fo ) ( fo - f) :::=:: 0 for every f and fo it follows that Fo(m f - m fo ) :::=:: F(m f - m fo ) and hence
J
Fo (m J - m J0 )
:::=::
i CFo + F) (m J - m J0 ) = - ih 2 (f, i f + i fo ) ';S - h 2 (f, fo) ,
in which the last inequality is elementary calculus. Thus the first condition of Theorem 5.52 is satisfied relative to the Hellinger distance h, with a = 2. The map f f---+ m 1 is increasing. Therefore, it turns brackets [ f1 , h J for the functions x f---+ f (x ) into brackets [m !I ' m h ] for the functions x f---+ m f (x ) . The squared L 2 (Fo)-size of these brackets satisfies It follows that the L 2 (Fo)-bracketing numbers of the class of functions m f can be bounded by the h-bracketing numbers of the functions f. The latter are the L 2 (A.)-bracketing numbers of the functions .JY, which are monotone and bounded by assumption. In view of Example 1 9. 1 1 , 1 log N[ J ( 2c, { mf : f E F} , L 2 C Fo) ) :::=:: log N[ J (c, H, L 2 (A.) ) ;S - . E'
Because the functions m 1 are uniformly bounded, the maximal inequality Lemma 1 9.36 gives, with J (8) = -JTTi = 2 � ,
J;
de
( [� ) ·
Efo sup IGn Cf - fo) i ;S .J8 1 + 8 .y n h ( f, f0 )<8 Therefore, Theorem 5.55 applies with ¢n (8) equal to the right side, and the Hellinger distance, and we conclude that h ( jn , fo) = 0 p (n- 1 1 3 ) .
24.4 Estimating a Unimodal Density
I
I
I
,-------' 1I
353
I I I t I I I I I I I
A-1 1 f n (a)
x
Figure 24.5. If (x) ::::= a, then a line of slope a moved down vertically from + oo first hits to the left of x . The point where the line hits is the point at which is farthest above the line of slope a through the origin.
fn
lFn
lFn
The L 2 (A.)-distance between uniformly bounded densities is bounded up to a constant by the Hellinger distance, and the theorem follows. • The most striking known results about estimating a monotone density concern limit distributions of the maximum likelihood estimator, for instance at a point. 24.7 Theorem. If f is differentiable at x > 0 with derivative f ' (x) < 0, then, h E lR} a standard Brownian motion process (two-sided with Z(O) = 0),
with { Z(h) :
n 1 13 ( fn Cx ) - f(x)) -v-+ I 4J' (x) f(x) l 1 13 argmax {Z(h) - h 2 } .
hER
For simplicity we assume that f is continuously differentiable at x. Define a 1 stochastic process { j� (a) : a > 0} by
Proof.
1 ]� (a) = argmax { lFn (s ) - as } ,
s::>-0
in which the largest value is chosen when multiple maximizers exist. The suggestive notation is justified, as the function j� 1 is the inverse of the maximum likelihood estimator jn in that 1 fn Cx ) ::::; a if and only if ]� (a) ::::; x, for every x and a. This is explained in Figure 24.5. We first derive the limit distribution of ]� 1 . Let 8n = n- 1 13 • By the change of variable s 1---7 x + h8n in the definition of ]� 1 , we have 1 n 1 13 ( j� o f (x) - x)
=
argmax { lFn (x + h8n) - f (x ) (x + h8n) } .
h::>- -n lflx
Because the location of a maximum does not change by a vertical shift of the whole function, we can drop the term f (x )x in the right side, and we may add a term lFn (x) . For the same
Nonparametric Density Estimation
354
reason we may also multiply the process in the right side by display is equal to the point of maximum hn of the process
n 213 .
Thus the preceding
The first term is the local empirical process studied in Example 1 9.29, and converges in distribution to the process h � ,JJ(X) 7L(h), for 7L a standard Brownian motion process, in -C00(K), for every compact interval K. The second term is a deterministic "drift" process and converges on compacta to h � � f'(x) h 2 . This suggests that
n 1 13(1� 1 o f(x) - x) � argmax { J:{W 7L(h) + � f'(x) h 2 } . hEIR
This argument remains valid if we replace x by Xn = x - 8nb throughout, where the limit is the same for every b E JR.. We can write the limit in a more attractive form by using the fact that the processes h � 7L(o-h) and h � .J(i7L(h) are equal in distribution for every a- > 0. First, apply the change of variables h � a-h, next pull a- out of 7L(a-h), then divide the process by J f(x)a , and finally choose a- such that the quadratic term reduces to - h 2 , that is Jf (x )a = - � f' (x )o- 2 . Then we obtain, for every b E JR.,
The connection with the limit distribution of 1n (x) is that
P (n 1 13(1n (x) - f(x)) =
:S
- bf'(x) )
=
P(1n Cx)
:S
f(x - 8nb) + o ( 1 ) )
P (n 1 13(1� 1 o f(x - 8nb) - (x - 8nb)) b) + o ( l ) . :S
Combined with the preceding display and simple algebra, this yields the theorem. The preceding argument can be made rigorous by application of the argmax continuous mapping theorem, Corollary 5.58. The limiting Brownian motion has continuous sample paths, and maxima of Gaussian processes are automatically unique (see, e.g., Lemma 2.6 in [87]). Therefore, we need only check that hn = 0 p ( 1 ) , for which we apply Theorem 5.52 with mg
=
1 [0,xn + g ] - 1 [0,xnl - f(xn)g.
n,
(In Theorem 5.52 the function mg can be allowed to depend on as is clear from its generalization, Theorem 5.55.) By its definition, gn = 8n h n maximizes g � IFn mg , whence we wish to show that gn = 0 p (8n) . By Example 19.6 the bracketing numbers of the class of functions { l [o , xn + gJ - 1 [o , x" J : l g l 8 } are of the order 8 jc 2 ; the envelope function 1 1 [o ,x" +8J - 1 [o ,x n J 1 has L 2 (F)-norm of the order Jf(x)8. By Corollary 19.35, <
By the concavity of F, the function g � F(xn + g) - F(xn ) - f(xn)g is nonpositive and nonincreasing as g moves away from 0 in either direction (draw a picture.) Because
24. 4 Estimating a Unimodal Density
f'
(xn)
---+
f'
(x)
<
355
0, there exists a constant C such that, for sufficiently large n,
If we would know already that 8n � 0, then Theorem 5 . 5 2, applied with a = 2 and fJ = � , yields that 8n = 0 p ( 8n ) . The consistency of 8n can be shown by a direct argument. By the Glivenko-Cantelli theorem, for every s > 0, sup IFn m g
l g l::>:e
::S
sup F m g + op ( l )
l g l::>:e
::S
-C inf (g 2 1\ l g l ) + o p ( l ) . l g l ::>:e
Because the right side is strictly smaller than 0 = IFn m o , the maximizer 8n must be contained in [ - E , s] eventually. • Results on density estimators at a point are perhaps not of greatest interest, because it is the overall shape of a density that counts. Hence it is interesting that the preceding theorem is also true in an L -sense, in that
1 n 1 13 j l fn (x) - j(x)l dx fi 4 J'(x)j(x) l 1 13 dx E argmax {Z(h) - h2 } . �
hEIR
This is true for every strictly decreasing, compactly supported, twice continuously differ entiable true density f. For boundary cases, such as the uniform distribution, the behavior of fn is very different. Note that the right side of the preceding display is degenerate. This is explained by the fact that the random variables n ( jn - f for different values of are asymptotically independent, because they are only dependent on the observations Xi very close to so that the integral aggregates a large number of approximately indepen dent variables. It is also known that times the difference between the left side and the right sides converges in distribution to a normal distribution with mean zero and variance not depending on f . For uniformly distributed observations, the estimator fn remains dependent on all n observations, even asymptotically, and attains a .fo-rate of convergence (see [62]). We define a density f on the real line to be unimodal if there exists a number M1 , such that f is nondecreasing on the interval - oo, M1] and nondecreasing on [ M1 , oo). The mode M1 need not be unique. Suppose that we observe a random sample from a unimodal density. If the true mode M1 is known a priori, then a natural extension of the preceding discussion is to estimate the distribution function F of the observations by the distribution function F n that is the least concave majorant of IFn on the interval [ M1 , oo ) and the greatest convex minorant on (-oo, M1] . Next we estimate f by the derivative fn of Fn . Provided that none of the observations takes the value M1 , this estimator maximizes the likelihood, as can be shown by arguments as before. The limit results on monotone densities can also be extended to the present case. In particular, because the key in the proof of Theorem 24.7 is the characterization of jn as the derivative of the concave majorant of IF n, this theorem remains true in the unimodal case, with the same limit distribution. If the mode is not known a priori, then the maximum likelihood estimator does not exist: The likelihood can be maximized to infinity by placing an arbitrary large mode at some fixed observation. It has been proposed to remedy this problem by restricting the likelihood
x
x,
1 /3 (x) (x) )
n 1 16
(x)
(
Nonparametric Density Estimation
356
to densities that have a modal interval of a given length (in which f must be constant and maximal). Alternatively, we could estimate the mode by an independent method and next apply the procedure for a known mode. Both of these possibilities break down unless f possesses some additional properties. A third possibility is to try every possible value M as a mode, calculate the estimator fnM for known mode M, and select the best fitting one. Here "best" could be operationalized as (nearly) minimizing the Kolmogorov-Smirnov distance II FnM IFn ll oo · It can be shown (see [ 1 3]) that this procedure renders the effect of the mode being unknown asymptotically negligible, in that -
up to an arbitrarily small tolerance parameter if M only approximately achieves the mini mum of M f---+ II fnM - IFn II This extra "error" is oflower order than the rate of convergence n 1 13 of the estimator with a known mode. 00 •
Notes
The literature on nonparametric density estimation, or "smoothing," is large, and there is an equally large literature concerning the parallel problem of nonparametric regression. Next to kernel estimation popular methods are based on classical series approximations, spline functions, and, most recently, wavelet approximation. Besides different methods, a good deal is known concerning other loss functions, for instance L 1 -loss and automatic methods to choose a bandwidth. Most recently, there is a revived interest in obtaining exact constants in minimax bounds, rather than just rates of convergence. See, for instance, [14], [ 1 5 ] , [36], [ 1 2 1 ] , [ 1 35], [137] , and [1 48] for introductions and further references. The kernel estimator is often named after its pioneers in the 1960s, Parzen and Rosenblatt, and was originally developed for smoothing the periodogram in spectral density estimation. A lower bound for the maximum risk over Holder classes for estimating a density at a single point was obtained in [ 46] . The lower bound for the L 2 -risk is more recent. Birge [ 1 2] gives a systematic study of upper and lower bounds and their relationship to the metric entropy of the model. An alternative for Assouad's lemma is Fano's lemma, which uses the Kullback-Leibler distance and can be found in, for example, [80] . The maximum likelihood estimator for a monotone density is often called the Grenander estimator, after the author who first characterized it in 1956. The very short proof of Lemma 24.5 is taken from [64] . The limit distribution of the Grenander estimator at a point was first obtained by Prakasa Rao in 1969 see [ 1 2 1 ] . Groeneboom [63] giVes a characterization of the limit distribution and other interesting related results. PROBLEMS 1. Show, informally, that under sufficient regularity conditions MISE t c f)
�
h J K2 (y) dy + l h4 J J" (x )2 dx (! i K (y) dyr
l
n
What does this imply for an optimal choice of the bandwidth ?
357
Problems
2. Let X 1 , . . . , Xn be a random sample from the normal distribution with variance 1. Calculate the mean square error of the estimator ¢ (x X n ) of the common density. -
3. Using the argument of section 14.5 and a submodel as in section 24.3, but with rn the best rate for estimating a density at a fixed point is also n- m / (Zm + l) . 4. Using the argument of section 14.5, show that the rate of convergence likelihood estimator for a monotone density is best possible. 5. (Marshall's lemma.) Suppose that F is concave on [0, concave maj orant F n of IFn satisfies the inequality II F n imply about the limiting behavior of F n ?
oo) with F (O) -
F l l oo :S II IFn
= -
n 1 13
=
1 , show that
of the maximum
0. Show that the least F l l oo · What does this
25 Semiparametric Models
This chapter is concerned with statistical models that are indexed by infinite-dimensional parameters. It gives an introduction to the theory of asymptotic efficiency, and discusses methods of estimation and testing.
25. 1
Introduction
Semiparametric models are statistical models in which the parameter is not a Euclidean vector but ranges over an "infinite-dimensional" parameter set. A different name is "model with a large parameter space." In the situation in which the observations consist of a ran dom sample from a common distribution P , the model is simply the set P of all possible values of P : a collection of probability measures on the sample space. The simplest type of infinite-dimensional model is the nonparametric model, in which we observe a random sample from a completely unknown distribution. Then P is the collection of all probability measures on the sample space, and, as we shall see and as is intuitively clear, the empirical distribution is an asymptotically efficient estimator for the underlying distribution. More interesting are the intermediate models, which are not "nicely" parametrized by a Euclidean parameter, as are the standard classical models, but do restrict the distribution in an im portant way. Such models are often parametrized by infinite-dimensional parameters, such as distribution functions or densities, that express the structure under study. Many aspects of these parameters are estimable by the same order of accuracy as classical parameters, and efficient estimators are asymptotically normal. In particular, the model may have a natural parametrization (8 , 1J) f---+ Pe , 11 , where e is a Euclidean parameter and 17 runs through a nonparametric class of distributions, or some other infinite-dimensional set. This gives a semiparametric model in the strict sense, in which we aim at estimating e and consider 17 as a nuisance parameter. More generally, we focus on estimating the value 1jJ ( P) of some function 1jJ : P f---+ ffi.k on the model. In this chapter we extend the theory of asymptotic efficiency, as developed in Chapters 8 and 1 5, from parametric to semiparametric models and discuss some methods of estimation and testing. Although the efficiency theory (lower bounds) is fairly complete, there are still important holes in the estimation theory. In particular, the extent to which the lower bounds are sharp is unclear. We limit ourselves to parameters that are Jn-estimable, although in most semiparametric models there are many "irregular" parameters of interest that are outside the scope of "asymptotically normal" theory. Semiparametric testing theory has 358
25. 1 Introduction
359
little more to offer than the comforting conclusion that tests based on efficient estimators are efficient. Thus, we shall be brief about it. We conclude this introduction with a list of examples that shows the scope of semipara metric theory. In this description, X denotes a typical observation. Random vectors Y, Z, e, and f are used to describe the model but are not necessarily observed. The parameters e and v are always Euclidean. Let Z and e be independent random vectors and suppose that Y = f.Le (Z) + ae ( Z ) e for functions f.Le and ae that are known up to e . The observation is the pair X = (Y, Z). If the distribution of e is known to belong to a certain paramet ric family, such as the family of N (O, a 2 )-distributions, and the independent variables Z are modeled as constants, then this is just a classical regression model, allowing for het eroscedasticity. Semiparametric versions are obtained by letting the distribution of e range over all distributions on the real line with mean zero, or, alternatively, over all distributions that are symmetric about zero. D 25.1
Example (Regression).
25.2 Example (Projection pursuit regression). Let Z and e be independent random vec tors and let Y = rJ (e T Z) + e for a function rJ ranging over a set of (smooth) functions, and e having an N (O, o- 2 )-distribution. In this model e and rJ are confounded, but the direction of e is estimable up to its sign. This type of regression model is also known as a single-index model and is intermediate between the classical regression model in which rJ is known and the nonparametric regression model Y = rJ (Z) + e with rJ an unknown smooth function. An extension is to let the error distribution range over an infinite-dimensional set as well. D 25.3
Example (Logistic regression).
Given a vector Z, let the random variable Y take
the value 1 with probability 1 /(1 + e - r (Zl) and be 0 otherwise. Let Z = (Z 1 , Z2 ), and let the function r be of the form r (Z! , Z 2 ) = rJ (Z J ) + er Z 2 - Observed is the pair X = (Y, Z).
This is a semiparametric version of the logistic regression model, in which the response is allowed to be nonlinear in part of the covariate. D 25.4 Example (Paired exponential). Given an unobservable variable Z with completely unknown distribution, let X = (X 1 , X2 ) be a vector of independent exponentially distributed random variables with parameters Z and ze . The interest is in the ratio e of the conditional hazard rates of X 1 and X 2 . Modeling the "baseline hazard" Z as a random variable rather than as an unknown constant allows for heterogeneity in the population of all pairs (X 1 , X2 ) , and hence ensures a much better fit than the two-dimensional parametric model in which the value z is a parameter that is the same for every observation. D
The observation is a pair X = (X 1 , X2 ) , where X 1 = Z + e and X2 = + f3 Z + f for a bivariate normal vector (e, f) with mean zero and unknown covariance matrix. Thus X2 is a linear regression on a variable Z that is observed with error. The distribution of Z is unknown. D
25.5
Example (Errors-in-variables). a
Suppose that X = (Y, Z), where the ran dom vectors Y and Z are known to satisfy rJ (Y) = e r z + e for an unknown map 1'J and independent random vectors e and Z with known or parametrically specified distributions. 25.6
Example (Transformation regression).
Semiparametric Models
360
The transformation rJ ranges over an infinite-dimensional set, for instance the set of all monotone functions. 0 25.7 Example (Cox). The observation is a pair X = (T, Z) of a "survival time" T and a covariate Z. The distribution of Z is unknown and the conditional hazard function of T given Z is of the form e 8 r z A (t) for A being a completely unknown hazard function. The parameter e has an interesting interpretation in terms of a ratio of hazards. For instance, if the ith coordinate Zi of the covariate is a 0- 1 variable then e 8• can be interpreted as the ratio of the hazards of two individuals whose covariates are Zi = 1 and Zi = 0, respectively, but who are identical otherwise. 0
The observation X is two-dimensional with cumulative distri bution function of the form C8 ( G 1 (x 1 ) , G 2 (x2 ) ) , for a parametric family of cumulative distribution functions C8 on the unit square with uniform marginals. The marginal distri bution functions G i may both be completely unknown or one may be known. 0
25.8
Example (Copula).
Two survival times Y1 and Y2 are conditionally independent given variables (Z, W) with hazard function of the form We 8 r 2 A(y) . The random variable W is not observed, possesses a gamma( v, v) distribution, and is independent of the variable Z which possesses a completely unknown distribution. The observation is X = (Y1 , Y2 , Z). The variable W can be considered an unobserved regression variable in a Cox model. 0
25.9
Example (Frailty).
A "time of death" T is observed only if death oc curs before the time C of a "censoring event" that is independent of T; otherwise C is observed. A typical observation X is a pair of a survival time and a 0- 1 variable and is distributed as ( T 1\ C, l {T _:::: C}) . The distributions of T and C may be completely unknown. 0 25.10
Example (Random censoring).
25. 1 1 Example (Interval censoring). A "death" that occurs at time T is only observed to have taken place or not at a known "check-up time" C. The observation is X = ( C, 1 {T _:::: C}) , and T and C are assumed independent with completely unknown or partially specified distributions. 0
A variable of interest Y is observed only if it is larger than a censoring variable C that is independent of Y ; otherwise, nothing is observed. A typical observation X = (X 1 , X 2 ) is distributed according to the conditional distribution of (Y, C) given that Y C. The distributions of Y and C may be completely unknown. 0
25.12
Example (Truncation).
>
25.2
Banach and Hilbert Spaces
In this section we recall some facts concerning Banach spaces and, in particular, Hilbert spaces, which play an important role in this chapter. Given a probality space (X, A, P), we denote by L 2 ( P) the set of measurable functions g : X H- lR with Pg 2 = J g 2 d P < oo, where almost surely equal functions are identi fied. This is a Hilbert space, a complete inner-product space, relative to the inner product
25. 2
Banach and Hilbert Spaces
361
and norm
ll g ll
=
jPii .
Given a Hilbert space IHI, the projection lemma asserts that for every g E lHl and convex, closed subset C c IHI, there exists a unique element ITg E C that minimizes c 1-+ ll g - e ll over C. If C is a closed, linear subspace, then the projection ITg can be characterized by the orthogonality relationship
(g - ITg , c)
=
0,
every c E C.
The proof is the same as in Chapter 1 1 . If C 1 c C2 are two nested, closed subspaces, then the projection onto cl can be found by first projecting onto c2 and next onto c] . Two subsets C 1 and C2 are orthogonal, notation C l_ C2 , if (c 1 , c2 ) = 0 for every pair of c; E C; . The projection onto the sum of two orthogonal closed subspaces is the sum of the proj ections. The orthocomplement c .L of a set C is the set of all g j_ C. A Banach space is a complete, normed space. The dual space JBl,* of a Banach space JBl, is the set of all continuous, linear maps b* : JBl, 1-+ Tit Equivalently, all linear maps such that l b* (b) l :::::; ll b* ll ll b ll for every b E JBl, and some number l i b* II - The smallest number with this property is denoted by I b* I and defines a norm on the dual space. According to the Riesz representation theorem for Hilbert spaces, the dual of a Hilbert space IHI consists of all maps
h I-+ (h, h * ) , where h * ranges over IHI. Thus, in this case the dual space IHI* can be identified with the space IHI itself. This identification is an isometry by the Cauchy-Schwarz inequality I (h, h*) I :::::;
ll h ll ll h* II A
linear map A : llli 1 1-+ llli 2 from one Banach space into another is continuous if and only if II A h ll 2 :::::; II A II II b 1 ll for every h E llli 1 and some number II A II - The smallest number with this property is denoted by I A II and defines a norm on the set of all continuous, linear maps, also called operators, from llli 1 into llli2 . Continuous, linear operators are also called "bounded," even though they are only bounded on bounded sets. To every continuous, linear operator A : llli 1 1-+ llli 2 exists an adjoint map A* : llli� 1-+ JBl,r defined by (A*b;>b 1 = b�A h . This is a continuous, linear operator of the same norm I A* II = I A II - For Hilbert spaces the dual space can be identified with the original space and then the adjoint of A : IHI 1 1-+ IHI2 is a map A* : IHI2 1-+ IHI 1 . It is characterized by the property An operator between Euclidean spaces can be identified with a matrix, and its adjoint with the transpose. The adjoint of a restriction A0 : IHI 1 ,0 c IHI 1 1-+ IHI2 of A is the composition IT o A* of the projection IT : IHI 1 1-+ IHI 1 , 0 and the adjoint of the original A. The range R(A) = {Ab 1 : b 1 E llli d of a continuous, linear operator is not necessarily closed. By the "bounded inverse theorem" the range of a 1 - 1 continuous, linear operator between Banach spaces is closed if and only if its inverse is continuous. In contrast the kernel N(A) = {b 1 : Ab 1 = 0} of a continuous, linear operator is always closed. For an operator between Hilbert spaces the relationship R(A).l = N(A *) follows readily from the
362
Semiparametric Models
characterization of the adjoint. The range of A is closed if and only if the range of A* is closed if and only if the range of A* A is closed. In that case R (A*) = R (A* A). If A* A : IHI 1 M- IHI 1 is continuously invertible (i.e., is 1 - 1 and onto with a continuous inverse), then A (A* A ) 1 A* : lHh M- R(A) is the orthogonal projection onto the range of A, as follows easily by checking the orthogonality relationship. -
25.3
Tangent Spaces and Information
Suppose that we observe a random sample X 1 , . . . , Xn from a distribution P that is known to belong to a set P of probability measures on the sample space (X, A) . It is required to estimate the value 1/J ( P) of a functional 1/J : P M- JR.k . In this section we develop a notion of information for estimating 1/J(P) given the model P, which extends the notion of Fisher information for parametric models. To estimate the parameter 1/J (P) given the model P is certainly harder than to estimate this parameter given that P belongs to a submodel Po c P. For every smooth parametric submodel Po = { Pe : e E 8} c P, we can calculate the Fisher information for estimating 1/J(Pe). Then the information for estimating 1/J(P) in the whole model is certainly not bigger than the infimum of the informations over all submodels. We shall simply define the information for the whole model as this infimum. A submodel for which the infimum is taken (if there is one) is called least favorable or a "hardest" submodel. In most situations it suffices to consider one-dimensional submodels P0. These should pass through the "true" distribution P of the observations and be differentiable at P in the sense of Chapter 7 on local asymptotic normality. Thus, we consider maps t M- P1 from a neighborhood of 0 E [0, oo) to P such that, for some measurable function g : X M- JR., t
] 2 [ dp 1 /2 - dP 1 /2 1 - g d p /2 -+ J t 1 2
t
0.
(25 . 1 3)
In other words, the parametric submodel { P1 : 0 < t < E:} is differentiable in quadratic mean at t = 0 with score function g. Letting t M- P1 range over a collection of submodels, we obtain a collection of score functions, which we call a tangent set of the model P at P and denote by Pp . Because Ph 2 is automatically finite, the tangent space can be identified with a subset of L 2 (P), up to equivalence classes. The tangent set is often a linear space, in which case we speak of a tangent space . Geometrically, we may visualize the model P, or rather the corresponding set of "square roots of measures" d P 1 12 , as a subset of the unit ball of L 2 ( P), and Pp , or rather the set of all objects 1 g d P 1 12 , as its tangent set. Usually, we construct the submodels t M- P1 such that, for every x ,
g(x) t
a log dP1 (x ) . at l t=o
= -
If P and every one of the measures P1 possess densities p and p1 with respect to a measure f-tt , then the expressions d P and d Pt can be replaced by p and Pt and the integral can be understood relative to I-tt (add d f-tt on the right). We use the notations d Pt and d P, because some models P of interest are not dominated, and the choice of I-tt is irrelevant. However, the model could be taken dominated for simplicity, and then d Pt and d P are just the densities of Pt and P .
363
Tangent Spaces and Information
25.3
However, the differentiability (25 . 1 3) is the correct definition for defining information, because it ensures a type of local asymptotic normality. The following lemma is proved in the same way as Theorem 7 .2. 25.14
Lemma. If the path t f---+
log
P1 in P satisfies (25 . 1 3), then P g = 0, Pg 2
ll n d P1 / ..fo. ( X i ) = 1r;; � 1 2 � g ( X i ) - P g + op ( l ) . 2 dP n
and
-
v i=1
i= 1
< oo,
For defining the information for estimating 1./f ( P ) , only those submodels t f---+ P1 along which the parameter t f---+ 1/f ( P1) is differentiable are of interest. Thus, we consider only submodels t f---+ P1 such that t f---+ 1/f ( P1) is differentiable at t = 0. More precisely, we define 1./f : P f---+ JH.k to be differentiable at P relative to a given tangent set Pp if there exists a continuous linear map � P : L 2 (P) f---+ JH.k such that for every g E PP and a submodel t f---+ P1 with score function g, . 1./f ( Pt ) - 1./f ( P) -----
t
-+ 1./f p g.
This requires that the derivative of the map t f---+ 1./f ( P1) exists in the ordinary sense, and also that it has a special representation. (The map � P is much like a Hadamard derivative of 1./f viewed as a map on the space of "square roots of measures.") Our definition is also relative to the submodels t f---+ Pr . but we speak of "relative to P p" for simplicity. B y the Riesz representation theorem for Hilbert spaces, the map Vr P can always be written in the form of an inner product with a fixed vector-valued, measurable function 1./f p : A' f---+ lRk '
Vr p g = ( Vf p , g) p =
f Vf p g dP.
Here the function 1f P is not uniquely defined by the functional l./f and the model P, because only inner products of 1f p with elements of the tangent set are specified, and the tangent set does not span all of L 2 (P) . However, it is always possible to find a candidate 1f P whose coordinate functions are contained in lin Pp , the closure of the linear span of the tangent set. This function is unique and is called the efficient influence function. It can be found as the projection of any other "influence function" onto the closed linear span of the tangent set. In the preceding set-up the tangent sets PP are made to depend both on the model P and the functional l./f. We do not always want to use the "maximal tangent set," which is the set of all score functions of differentiable submodels t f---+ P1 , because the parameter 1./f may not be differentiable relative to it. We consider every subset of a tangent set a tangent set itself. The maximal tangent set is a cone: If g E PP and a ::::: 0, then ag E Pp , because the path t f---+ Pa t has score function ag when t f---+ P1 has score function g. It is rarely a loss of generality to assume that the tangent set we work with is a cone as well. 25. 1 5 Example (Parametric model). Consider a parametric model with parameter e rang ing over an open subset e of JH.m given by densities Pe with respect to some measure f-t · Suppose that there exists a vector-valued measurable map i e such that, as h -+ 0, 2 1 /2 1 / 2 - 1 h T · e 1 /2 d = o( ll h II 2 ) t-t Pe + h - P e 2 -f Pe
j[
]
.
3 64
Semiparametric Models
Then a tangent set at Pe is given by the linear space {h T ee : h E JR.m } spanned by the score functions for the coordinates of the parameter 8 . If the Fisher information matrix Ie Pe lel� i s invertible, then every map x : 8 �---+ IRk that is differentiable in the ordinary sense as a map between Euclidean spaces is differentiable as a map 1/f(Pe) x (8) on the model relative to the given tangent space. This follows because the submodel t �---+ Pe+t h has score h T l e and =
=
Xe h Pe [( xe le- l € e ) h T R e ] . This equation shows that the function {if Pe Xe li 1 ee is the efficient influence function. .!!._ o X (8 + th) at lt=
=
=
=
In view of the results of Chapter 8, asymptotically efficient estimator sequences for x (8) are asymptotically linear in this function, which justifies the name "efficient influence function." D
25.16 Example (Nonparametric models). Suppose that P consists of all probability laws on the sample space. Then a tangent set at P consists of all measurable functions g satisfying J g dP 0 and J g 2 d P oo . Because a score function necessarily has mean zero, this is the maximal tangent set. It suffices to exhibit suitable one-dimensional submodels. For a bounded function g, consider for instance the exponential family p1 (x) c(t) exp (tg(x) ) p0 (x) or, alterna tively, the model p1 (x) (1 + tg(x) ) p0 (x). Both models have the property that, for every =
<
=
=
X,
a log Pr (x) . a t lt = o By a direct calculation or by using Lemma 7.6, we see that both models also have score function g at t 0 in the L 2 -sense (25. 13). For an unbounded function g, these submodels are not necessarily well-defined. However, the models have the common structure p1 (x) c (t) k ( tg (x) ) p0 (x) for a nonnegative function k with k(O) k'(O) 1 . The function k (x) 2(1 + e - 2x ) - 1 is bounded and can be used with any g. D g(x)
= -
=
=
=
=
=
25. 17
form
Example (Cox model).
The density of an observation in the Cox model takes the
Differentiating the logarithm of this expression with respect to 8 gives the score function for 8 , We can also insert appropriate parametric models s �---+ A s and differentiate with respect to s . If a is the derivative of log A s at s 0, then the corresponding score for the model for the observation is a (t) - e 8T z a dA . J[O,t] Finally, scores for the density pz are functions b (z) . The tangent space contains the linear span of all these functions. Note that the scores for A can be found as an "operator" working on functions a . D =
{
25. 3
Tangent Spaces and Information
365
If the transformation ry is increas ing and e has density ¢, then the density of the observation can be written ¢ ( ry (y) e T z) rJ'(y) p z (z) Scores for e and 'f/ take the forms 25.18
Example (Transformation regression model). .
a' ¢' - (ry(y) - e T z)a (y) + - (y), ry ' ¢ where a is the derivative for ry. If the distributions of e and Z are (partly) unknown, then there are additional score functions corresponding to their distributions. Again scores take the form of an operator acting on a set of functions. 0
To motivate the definition of information, assume for simplicity that the parameter 1/J (P) is one-dimensional. The Fisher information about t in a submodel t � P1 with score function g at t 0 is P g 2 . Thus, the "optimal asymptotic variance" for estimating the function t � 1/J (P1) , evaluated at t 0, is the Cramer-Rao bound =
=
-
( 1/J P • g ) 2p
(g, g) p
The supremum of this expression over all submodels, equivalently over all elements of the tangent set, is a lower bound for estimating 1/J ( P) given the model P, if the "true measure" is P . This supremum can be expressed in the norm of the efficient influence function 1f P .
Suppose that the junctional 1/J : P � lR is differentiable at P relative to the tangent set Pp. Then 1/J P • g ) 2p ( sup p ,/, 2p · p g) g Elin Pp (g
25.19
Lemma.
=
'f'
,
This is a consequence of the Cauchy-Schwarz inequality (Plfr p g) 2 ::::::; Plfr � Pg 2 and the.fact that, by definition, the efficient influence function 1f P is contained in the closure of lin Pp . •
Proof
Thus, the squared norm P lfr � of the efficient influence function plays the role of an "optimal asymptotic variance," just as does the expression Vr 8 Ii 1 Vr in Chapter 8. Similar considerations (take linear combinations) show that the "optimal asymptotic covariance" for estimating a higher-dimensional parameter 1/J : P � IRk is given by the covariance matrix P 1f P 1f � of the efficient influence function. In Chapter 8, we developed three ways to give a precise meaning to optimal asymptotic covariance: the convolution theorem, the almost-everywhere convolution theorem, and the minimax theorem. The almost-everywhere theorem uses the Lebesgue measure on the Euclidean parameter set, and does not appear to have an easy parallel for semiparametric models. On the other hand, the two other results can be generalized. For every g in a given tangent set Pp , write Pr , g for a submodel with score function g along which the function 1/J is differentiable. As usual, an estimator Tn is a measurable function Tn (X 1, . . , Xn ) of the observations. An estimator sequence Tn is called regular at P for estimating 1/J (P) (relative to P p) if there exists a probability measure L such that
�
.
every g E Pp .
366
Semiparametric Models
Let the function 1/J : P r--+ �k be differentiable at P rel ative to the tangent cone PP with efficient influence function 1iJ p · Then the asymptotic covariance matri:: of every regular sequence of estimators is bounded below by P 1jf p 1if�. Furthermore, ifPp is a convex cone, then every limit distribution L of a regular sequence of estimators can be written L P1jf P 1if� ) * M for some probability distribu tion M. 25.20
Theorem (Convolution).
=
N (O,
Let the function 1/J : P r--+ �k be differentiable at P relative to the tangent cone PP with efficient influence function 1iJ P · IfPp is a convex cone, then, for any estimator sequence {Tn } and subconvexfunction -C : �k r--+ [0, oo), 25.21
Theorem (LAM).
Here the first supremum is taken over all finite subsets I of the tangent set. Proofs. These results follow essentially by applying the corresponding theorems for para metric models to sufficiently rich finite-dimensional submodels. However, because we have defined the tangent set using one-dimensional submodels t r--+ P1,g , it is necessary to rework the proofs a little. Assume first that the tangent set is a linear space, and fix an orthonormal base g p ( g 1 , . . . , gm ) T of an arbitrary finite-dimensional subspace. For every g E lin g p select a sub model t r--+ P1,g as used in the statement of the theorems. Each of the submodels t r--+ P1,g is locally asymptotically normal at t 0 by Lemma 25 . 14. Therefore, because the covariance matrix of g p is the identity matrix, =
=
in the sense of convergence of experiments. The function 1/Jn (h)
=
1/J ( P1 1 .fii , F gp ) satisfies
For the same (k x m) matrix the function Agp is the orthogonal projection of 1if P onto lin g p, and it has covariance matrix AA T . Because 1if P is, by definition, contained in the closed linear span of the tangent set, we can choose g p such that 1if P is arbitrarily close to its projection and hence AA T is arbitrarily close to P 1if P 1if �. Under the assumption of the convolution theorem, the limit distribution of the sequence ..jii ( Tn - 1/Jn (h)) under P1 ;.fii, hTgp is the same for every h E �m . By the asymptotic representation theorem, Proposition 7 . 10, there exists a randomized statistic T in the limit experiment such that the distribution of T - Ah under h does not depend on h. By Proposition 8.4, the null distribution of T contains a normal AAT)-distribution as a convolution - T The proof of the convolution theorem is complete upon letting AAT - factor. tend to P ljf p ljf p · Under the assumption that the sequence ..jii ( Tn - 1/J (P) ) is tight, the minimax theorem is proved similarly, by first bounding the left side by the minimax risk relative to the submodel corresponding to gp, and next applying Proposition 8.6. The tightness assumption can be dropped by a compactification argument. (see, e.g., [ 1 39] , or [ 146]). If the tangent set is a convex cone but not a linear space, then the submodel constructed previously can only be used for h ranging over a convex cone in �m . The argument can
N(O,
25.3
367
Tangent Spaces and Information
remain the same, except that we need to replace Propositions 8.4 and 8.6 by stronger results that refer to convex cones. These extensions exist and can be proved by the same Bayesian argument, now choosing priors that flatten out inside the cone (see, e.g., [ 1 39]). If the tangent set is a cone that is not convex, but the estimator sequence is regular, then we use the fact that the matching randomized estimator T in the limit experiment satisfies Eh T Ah + Eo T for every eligible h, that is, every h such that h T g p E Pp . Because the tangent set is a cone, the latter set includes parameters h t h i for t � 0 and directions hi spanning JR.m . The estimator T is unbiased for estimating Ah + Eo T on this parameter set, whence the covariance matrix of T is bounded below by AA T , by the Cramer-Rao inequality. • =
=
Both theorems have the interpretation that the matrix Pl/1 p lfr� is an optimal asymptotic covariance matrix for estimating 1/r(P) given the model P. We might wish that this could be formulated in a simpler fashion, but this is precluded by the problem of superefficiency, as is already the case for the parametric analogues, discussed in Chapter 8. That the notion of asymptotic efficiency used in the present interpretation should not be taken absolutely is shown by the shrinkage phenomena discussed in section 8.8, but we use it in this chapter. We shall say that an estimator sequence is asymptotically efficient at P, if it is regular at P with limit distribution L N(O, Pl/1 p lfr�).t The efficient influence function 1j/ P plays the same role as the normalized score function 1 /9- fe in parametric models. In particular, a sequence of estimators Tn is asymptotically efficient at P if =
(25 .22)
This justifies the name "efficient influence function."
Let the function 1/r : P � JR.k be differentiable at P relative to the tangent cone PP with efficient influence function 1j/ p · A sequence of estimators Tn is regular at P with limiting distribution N (0, P1j/ p 1j/�) if and only if it satisfies (25 .22). 25.23
Lemma.
Because the submodels t � Pr, g are locally asymptotically normal at t 0, "if" follows with the help of Le Cam's third lemma, by the same arguments as for the analogous result for parametric models in Lemma 8. 14. To prove the necessity of (25 .22), we adopt the notation of the proof of Theorem 25 .20. The statistics Sn 1/r (P) + n - 1 L_ 7= 1 1j/ p ( Xi ) depend on P but can be considered a true estimator sequence in the local subexperiments. The sequence Sn trivially satisfies (25 .22) and hence is another asymptotically efficient estimator sequence. We may assume for sim plicity that the sequence ,Jii ( Sn - 1/r(P1 ; ..;n, g ) , Tn - 1/r(P1 ; ..;n, g )) converges under every local parameter g in distribution. Otherwise, we argue along subsequences, which can be
Proof
=
=
t
If the tangent set is not a linear space, then the situation becomes even more complicated. If the tangent set is a convex cone, then the minimax risk in the left side of Theorem 25.21 cannot fall below the normal risk on the right side, but there may be nonregular estimator sequences for which there is equality. If the tangent set is not convex, then the assertion of Theorem 25 .21 may fail. Convex tangent cones arise frequently ; fortunately, nonconvex tangent cones are rare.
368
Semiparametric Models
selected with the help of Le Cam's third lemma. By Theorem 9.3, there exists a match ing randomized estimator (S, T) = (S, T) (X, U) in the normal limit experiment. By the efficiency of both sequences Sn and Tn , the variables S - Ah and T - Ah are, under h, marginally normally distributed with mean zero and covariance matrix P -[f p lf� - In partic ular, the expectations Eh S = Eh T are identically equal to Ah . Differentiate with respect to h at h = 0 to find that
It follows that the orthogonal projections of S and T onto the linear space spanned by the coordinates of X are identical and given by OS = D T = AX, and hence
(The inequality means that the difference of the matrices on the right and the left is nonnegative-definite.) We have obtained this for a fixed orthonormal set g p = (g 1 , . . . , gm ) If we choose g p such that AA T is arbitrarily close to P -[f p lf� , equivalently Cov0 D T = AA T = Covo OS is arbitrarily close to Covo T = P {f p lf� = Covo S, and then the right side of the preceding display is arbitrarily close to zero, whence S - T � 0. The proof is complete on noting that .Jfi(Sn - Tn ) � S - T . • 25.24 Example (Empirical distribution). The empirical distribution is an asymptotically efficient estimator if the underlying distribution P of the sample is completely unknown. To give a rigorous expression to this intuitively obvious statement, fix a measurable func tion f : X 1-+ JR. with Pf 2 oo, for instance an indicator function f = 1 A , and consider JP!n f = n - 1 '£7= 1 f(XJ as an estimator for the function 1/r (P) = Pf. In Example 25. 16 it is seen that the maximal tangent space for the nonparametric model is equal to the set of all g E L 2 (P) such that Pg = 0. For a general function f, the parameter 1/r may not be differentiable relative to the maximal tangent set, but it is differentiable relative to the tangent space consisting of all bounded, measurable functions g with P g = 0. The closure of this tangent space is the maximal tangent set, and hence working with this smaller set does not change the efficient influence functions. For a bounded function g with P f = 0 we can use the submodel defined by dP1 = ( 1 + tg) dP, for which 1fr ( P1) = Pf + t Pfg. Hence the derivative of 1/r is the map g 1-+ Vr P g = Pfg, and the efficient influence function relative to the maximum tangent set is the function 1f p = f - Pf. (The function f is an influence function; its projection onto the mean zero functions is f - Pf.) The optimal asymptotic variance for estimating P 1-+ Pf is equal to P 1f � = P (j Pf) 2 . The sequence of empirical estimators I¥n f is asymptotically efficient, because it satisfies (25.22), with the o p (I) -remainder term identically zero. D <
25.4
Efficient Score Functions
A function 1/r (P) of particular interest is the parameter 8 in a semiparametric model { Pe , 11 : 8 E 8 , 17 E H}. Here 8 is an open subset of IR.k and H is an arbitrary set, typically of infinite dimension. The information bound for the functional of interest 1/r (Pe , 11) = 8 can be conveniently expressed in an "efficient score function."
25.4 Efficient Score Functions
Pe +ta , 111 ,
369
As submodels, we use paths of the form t � for given paths t � rJr in the parameter set H. The score functions for such submodels (if they exist) typically have the form of a sum of "partial derivatives" with respect to e and rJ. If is the ordinary score function for e in the model in which rJ is fixed, then we expect a log = ay · + g. at l t = O The function g has the interpretation of a score function for rJ if e is fixed and runs through an infinite-dimensional set if we are concerned with a "true" semiparametric model. We refer to this set as the tangent set for rJ , and denote it by 17 PPe , ry . The parameter e + ta is certainly differentiable with respect to t in the ordinary sense but is, by definition, differentiable as a parameter on the model if and only if there exists a function 1f such that . a T · + g) Pe ,ry ' a E �k , g E 17 pPe , ry . a = e� t =0
fe, 17
d Pe +ra , 111 le, 11
1/r (Pe +ra ,11J = e, 11 It 1/r(Pe +ra , 11) = ( 1/r e, 11 , a le, 11 Setting a = 0, we see that 1f e,11 must be orthogonal to the tangent set 17 PPe ,ry for the nuisance p�ameter. Define De, 17 as the orthogonal projection onto the closure of the linear span of 17 PPe, ry in L 2 (Pe, 11 ) . The function defined by le . 11 = fe, 11 - De, 11 fe, 11 is called the efficient score function for e , and its covariance matrix l e, 11 = Pe, 11 le, 11 l�. 11 is the efficient information matrix. -
u
25.25
t�
TJr
e,11
Suppose that for every a E �k and every g E 11PPe ,ry there exists a path in H such that Lemma.
l+/2ta , 17r - dP()l,/217 dP [ () J t
1 - l (a T
] 2 fe , 17 + g) dP�.�2
---+
0.
Pe, 11 relative to 1/r e, 11 = Ie, 11 le, 11 • = le, 11 The given set PPe ,ry is a tangent set by assumption. The function 1/r is differentiable (Pe, 11 ) =
If l is nonsingular, then the functional lfr e is differentiable at the tangent set PPe ,ry lin + 17 PPe ,ry with efficient influence function Proof
(25 .26)
.
.
.
-
--
1
-
with respect to this tangent set because
The last equality follows, because the inner product of a function and its orthogonal projection is equal to the square length of the projection. Thus, we may replace by •
le, 17 •
fe, 11
Consequently, an estimator sequence is asymptotically efficient for estimating e if
370
Semiparametric Models
This equation is very similar to the equation derived for efficient estimators in parametric models in Chapter 8. It differs only in that the ordinary score function ie,ry has been replaced by the efficient score function (and similarly for the information). The intuitive explanation is that a part of the score function for 8 can also be accounted for by score functions for the nuisance parameter 17· If the nuisance parameter is unknown, a part of the information for e is "lost," and this corresponds to a loss of a part of the score function. Example (Symmetric location). Suppose that the model consists of all densities 17 (x - 8) with e E lR and the "shape" 17 symmetric about 0 with finite Fisher informat
25.27
x f-+ ion for location Iry . Thus, the observations are sampled from a density that is symmetric about e . B y the symmetry, the density can equivalently be written as 17 (1x - 8 1 ) . It follows that any score function for the nuisance parameter 17 is necessarily a function of l x - 8 1 . This suggests a tangent set containing functions of the form a (17 ' /lJ) (x - 8) + b ( l x - 8 1) . It is not hard to show that all square-integrable functions of this type with mean zero occur as score functions in the sense of (25 .26). t A symmetric density has an asymmetric derivative and hence an asymmetric score func tion for location. Therefore, for every b, 17' Ee , ry - (X - 8) b ( I X - 8 1 ) = 0 . 17 Thus, the projection of the 8-score onto the set of nuisance scores is zero and hence the efficient score function coincides with the ordinary score function. This means that there is no difference in information about 8 whether the form of the density is known or not known, as long as it is known to be symmetric. This surprising fact was discovered by Stein in 1956 and has been an important motivation in the early work on semiparametric models. Even more surprising is that the information calculation is not misleading. There exist estimator sequences for e whose definition does not depend on 17 that have asymptotic variance 1;; 1 under any true 17· See section 25 .8. Thus a symmetry point can be estimated as well if the shape is known as if it is not, at least asymptotically. 0 25.28 Example (Regression). Let g8 be a given set of functions indexed by a parameter e E JE.k , and suppose that a typical observation (X, Y) follows the regression model
Y = ge (X) + e ,
E(e I X) = 0 .
This model includes the logistic regression model, for g8 (x) = 1 /( 1 + e _ e r x ) . It is also a version of the ordinary linear regression model. However, in this example we do not assume that X and e are independent, but only the relations in the preceding display, apart from qualitative smoothness conditions that ensure existence of score functions, and the existence of moments. We shall write the formulas assuming that (X, e) possesses a density 17· Thus, the observation (X, Y) has a density 17 (x , y - ge (x)) , in which 17 is (essentially) only restricted by the relations J e17(x , e) de = 0. Because any perturbation 171 of 17 within the model must satisfy this same relation J e171 (x , e) de = 0, it is clear that score functions for the nuisance parameter 17 are functions t
That no other functions can occur is shown in, for example, [8, p. 56-57] but need not concern us here.
25. 5 Score and Information Operators
37 1
a (x , y - ge (x) ) that satisfy E ( ea (X, e) I X )
_
-
J ea (X, e) rJ(X, e) de 0 rJ(X, e) de _
f
-
.
By the same argument as for nonparametric models all bounded square-integrable functions of this type that have mean zero are score functions. Because the relation E ( ea (X, e) I X ) = 0 is equivalent to the orthogonality in L 2 (rJ) of a(x , e) to all functions of the form eh(x), it follows that the set of score functions for rJ is the orthocomplement of the set eH, of all functions of the form (x , y) 1-+ (y - g8 (x))h (x) within L 2 (Pe , TJ), up to centering at mean zero. Thus, we obtain the efficient score function for e by projecting the ordinary score function fe, TJ (x , y) = - 1}2 / rJ (x , e)ge (x) onto eH. The projection of an arbitrary function b(x , e) onto the functions eH is a function eh0(x) such that Eb(X, e)eh(X) = Eeh0 (X)eh(X) for all measurable functions h. This can be solved for ho to find that the projection operator takes the form E ( b (X , e)e I X ) Il eHb(X, e) = e E (e 2 I X) This readily yields the efficient score function ·
ege (X) J rJ2 (X, e)e de (Y - ge (X))ge (X) . = fe, TJ (X, Y) = - E(e 2 X) rJ(X, e) de E(e 2 1 X) 1 The efficient information takes the form l e,TJ = E (ge gJ (X) jE(e 2 I X) ) . D
f
25.5
Score and Information Operators
The method to find the efficient influence function of a parameter given in the preceding section is the most convenient method if the model can be naturally partitioned in the parameter of interest and a nuisance parameter. For many parameters such a partition is impossible or, at least, unnatural. Furthermore, even in semiparametric models it can be worthwhile to derive a more concrete description of the tangent set for the nuisance parameter, in terms of a "score operator." Consider first the situation that the model P = {PTJ : rJ E H} is indexed by a parameter rJ that is itself a probability measure on some measurable space. We are interested in estimating a parameter of the type 1/f (PTJ) = x (rJ) for a given functiorr x : H 1-+ �k on the model H . The model H gives rise to a tangent set if TJ at rJ . If the map rJ 1-+ PTJ is differentiable in an appropriate sense, then its derivative maps every score b E if TJ into a score g for the model P. To make this precise, we assume that a smooth parametric submodel t 1-+ rJt induces a smooth parametric submodel t 1-+ PTJ, , and that the score functions b of the submodel t 1-+ rJ1 and g of the submodel t 1-+ PTJt are related by Then ATJ ii TJ is a tangent set for the model P at PTJ . Because ATJ turns scores for the model H into scores for the model P it is called a score operator . It is seen subsequently here that if rJ
Semiparametric Models
372
and PrJ are the distributions of an unobservable Y and an observable X = m (Y), respectively, then the score operator is a conditional expectation. More generally, it can be viewed as a derivative of the map M- Pry . We assume that Ary, as a map ArJ : lin if rJ c L 2 (17) M- L 2 (PrJ), is continuous and linear. Next, assume that the function M- x (17) is differentiable with influence function XrJ relative to the tangent set if rJ · Then, by definition, the function ljr ( PrJ) = x (17) is pathwise differentiable relative to the tangent set Pp71 = ArJif rJ if and only if there exists a vector valued function 1jr P such that
1J
1J
'I
This equation can be rewritten in terms of the adjoint score operator A � : L 2 (PrJ) M- lin H rJ · By definition this satisfies (h, ArJb) p71 = (A � h, b) rJ for every h E L 2 (Pry) and b E if rJ · t The preceding display is equivalent to (25.29) We conclude that the function ljr ( PrJ) = x (1J) is differentiable relative to the tangent set Pp71 = ArJ if rJ if and only if this equation can be solved for 1jr P" ; equivalently, if and only if XrJ is contained in the range of the adjoint A � . Because A � is not necessarily onto lin if rJ , not even if it is one-to-one, this i s a condition. For multivariate functionals (25 .29) is to be understood coordinate-wise. Two solutions 1jr p71 of (25.29) can differ only by an element of th.e kernel N(A � ) of A � , which is the orthocomplement R(ArJ)l_ of the range of ArJ : lin H rJ M- L 2 (Pry ) . Thus, there is at most one solution 1jr P that is contained in R(Ary) = lin Ary if rJ ' the closure of the range of Ary, as required. If Xry is contained in the smaller range of A � ArJ, then (25 .29) can be solved, of course, and the solution can be written in the attractive form 'I
(25.30) Here A � Ary is called the information operator, and (A � ArJ)- is a "generalized inverse." (Here this will not mean more than that b = (A � ArJ)- XrJ is a solution to the equation A � ArJb = j(rJ .) In the preceding equation the operator A � Ary performs a similar role as the matrix X T X in the least squares solution of a linear regression model. The operator ArJ (A � ArJ) - 1 A � (if it exists) is the orthogonal projection onto the range space of Ary. So far we have assumed that the parameter is a probability distribution, but this is not necessary. Consider the more general situation of a model P = { PrJ E H} indexed by a parameter 17 running through an arbitrary set H . Let IHiry be a subset of a Hilbert space that indexes "directions" b in which can be approximated within H . Suppose that there exist continuous, linear operators ArJ : lin IH!rJ M- L 2 (PrJ) and XrJ : lin IH!rJ M- lRk , and for every b E IH!rJ a path t M- 1}1 such that, as t {.. 0, 2 d PrJl 2 d PrJl 2 1 2 1 - - A rJ b dPrJ 1 r 0, xr]b. t 2 t
1J
1J
f[ / - /
t
] ----+
: 1J
X ( 1Jt ) - X ( 1J ) ----+
-----
r-+
Note that we define A � to have range lin if 1 , so that it is the adjoint of A1 : if 1 L2(P1 ) . This is the adjoint of an extension A1 : L2 (ry) L2 (P1 ) followed by the orthogonal projection onto lin if 1 .
r-+
25.5 Score and Information Operators
373
By the Riesz representation theorem for Hilbert spaces, the "derivative" Xry has a repre sentation as an inner product )( ,1 b {X.,P b)IHI� for an element Xry E lin iHI� . The preceding discussion can be extended to this abstract set-up. =
The map 1/J : P 't---+ TR.k given by 1/J ( Pry) = x (17) is differentiable at Pry relative to the tangent set AryiHiry if and only if each coordinate function of Xry is contained in the range of A � : L 2 (Pry) 't---+ lin IHiry. The efficient influence function 1f p� satisfies (25 .29). If each coordinate function of Xry is contained in the range of A � Ary : lin iHiry 't---+ lin lH!ry, then it also satisfies (25 .30). 25.31
Theorem.
By assumption, the set ArylHiry is a tangent set. The map 1/J is differentiable relative to this tangent set (and the corresponding submodels t 't---+ Pry, ) by the argument leading up to (25 .29). •
Proof
The condition (25 .29) is odd. By definition, the influence function Xry is contained in the closed linear span of lH!ry and the operator A � maps L 2 (Pry) into lin iHiry . Therefore, the condition is certainly satisfied if A � is onto. There are two reasons why it may fail to be onto. First, its range R(A � ) may be a proper subspace of lin lH!rJ . Because b _L R(A � ) if and only if b E N(Ary), this can happen only if Ary is not one-to-one. This means that two different directions b may lead to the same score function Ary b , so that the information matrix for the corresponding two-dimensional submodel is singular. A rough interpretation is that the parameter is not locally identifiable. Second, the range space R(A � ) may be dense but not closed. Then for any Xry there exist elements in R(A � ) that are arbitrarily close to Xry , but (25 .29) may still fail. This happens quite often. The following theorem shows that failure has serious consequences. t Suppose that 1J 't---+ x (17) is differentiable with influence function Xry ¢:. R(A � ). Then there exists no estimator sequence for x (7]) that is regular at Pry.
25.32
Theorem.
25.5.1
Semiparametric Models
In a semiparametric model {Pe,ry : 13 E 8 , 17 E H } , the pair (13, 17) plays the role of the single 1J in the preceding general discussion. The two parameters can be perturbed independently, and the score operator can be expected to take the form Ae, ry ( a , b) = a T l' e,11 + Be,ryb. Here Be,rJ : JHI11 't---+ L 2 (Pe,ry ) is the score operator for the nuisance parameter. The domain of the operator Ae,11 : TR.k x lin lH!ry 't---+ L 2 (Pe,IJ) is a Hilbert space relative to the inner product Thus this example fits in the general set-up, with TR.k x JHI11 playing the role of the earlier lHI11 • We shall derive expressions for the efficient influence functions of 13 and 1J. The efficient influence function for estimating 13 is expressed in the efficient score function for 13 in Lemma 25.25, which is defined as the ordinary score function minus its projection t
For a proof, see [ 1 40].
Semiparametric Models
374
onto the score-space for ry. Presently, the latter space is the range of the operator Be,ry · If the operator Be, ry Be, ry is continuously invertible (but in many examples it is not), then the operator Be, ry (Be,ry Be,ry) - 1 Be,ry is the orthogonal projection onto the nuisance score space, and (25.33) This means that b = -(Be,ry Be,ry)- 1 Be. i ·e, ry is a "least favorable direction" in H , for esti mating e . If e is one-dimensional, then the submodel t f---+ Pe + t,ry1 where 17t approaches 17 in this direction, has the least information for estimating t and score function le, ry , at t = 0. A function x ( 17) of the nuisance parameter can, despite the name, also be of interest. The efficient influence function for this parameter can be found from (25 .29). The adjoint of Ae,ry : Rk x IHiry f---+ L 2 (Pe,ry), and the corresponding information operator Ae ,ry Ae,ry : Rk x IHiry f---+ Rk x lin iHiry are given by, with Be,ry : L 2 (Pe,ry f---+ lin iHiry the adjoint of Be,ry,
The diagonal elements in the matrix are the information operators for the parameters e and ry, respectively, the former being just the ordinary Fisher information matrix Ie,ry for e . If 1J f---+ X ( 1J) is differentiable as before, then the function ( e , 1J) f---+ X ( 1J) is differentiable with influence function (0, Xry). Thus, for a real parameter x ( ry ) , equation (25 .29) becomes If l e,ry is invertible and Xry is contained in the range of Be,ry Be,ry, then the solution lf Pe." of these equations is
The second part of this function is the part of the efficient score function for x ( 1J) that is "lost" due to the fact that e is unknown. Because it is orthogonal to the first part, it adds a positive contribution to the variance. 25.5.2
Information Loss Models
Suppose that a typical observation is distributed as a measurable transformation X = m(Y) of an unobservable variable Y . Assume that the form of m is known and that the distribution 17 of Y is known to belong to a class H . This yields a natural parametrization of the distribution Pry of X. A nice property of differentiability in quadratic mean is that it is preserved under "censoring" mechanisms of this type: If t f---+ 1Jr is a differentiable submodel of H, then the induced submodel t f---+ Pry1 is a differentiable submodel of { Pry : 1J E H } . Furthermore, the score function g = A ry b (at t = 0) for the induced model t f---+ Pry1 can be obtained from the score function b (at t = 0) of the model t f---+ 1Jt by taking a conditional expectation:
25.5 Score and Information Operators
375
If we consider the scores b and g as the carriers of information about t in the variables Y with law 1]1 and X with law PrJt , respectively, then the intuitive meaning of the condi tional expectation operator is clear. The information contained in the observation X is the information contained in Y diluted (and reduced) through conditioning. t Lemma. Suppose that {ry1 : 0 < t < 1 } is a collection of probability measures on a measurable space (Y, B) such that for some measurable function b : Y � IR
25.34
For a measurable map m : Y � X let PTJ be the distribution of m(Y) if Y has law 1J and let A TJ b (x) be the conditional expectation of b(Y) given m (Y) = x. Then 2 d pTJl / 2 - dPTJI / 2 1 1 2 , - - A TJ b dPTJ / � 0. t 2
]
f[
f---+
If we consider ATJ as an operator ATJ : L 2 (ry) � L 2 (PTJ), then its adjoint A � : L 2 (PTJ) L 2 (ry) is a conditional expectation operator also, reversing the roles of X and Y,
This follows because, by the usual rules for conditional expectations, EE (g (X) I Y ) b(Y) = Eg(X)b(Y) = Eg(X)E (b(Y) I X ) . In the "calculus of scores" of Theorem 25 .3 1 the adjoint is understood to be the adjoint of ATJ : IHITJ f---+ L 2 ( PTJ) and hence to have range lin IHITJ c L 2 (ry) . Then the conditional expectation in the preceding display needs to be followed by the orthogonal projection onto lin IHITJ . Example (Mixtures). Suppose that a typical observation X possesses a conditional density p(x I z) given an unobservable variable Z = z . If the unobservable Z possesses an unknown probability distribution ry, then the observations are a random sample from the mixture density 25.35
P TJ (x) =
j p(x I z) dry(z) .
This is a missing data problem if we think of X as a function of the pair Y = (X, Z) . A score for the mixing distribution rJ in the model for Y is a function b (z) . Thus, a score space for the mixing distribution in the model for X consists of the functions ATJ b (x ) = ETJ ( b (Z) I X = x ) =
J b(z) p(x I z) dry (z) . J p(x I z) dry (z)
If the mixing distribution is completely unknown, which we assume, then the tangent set if ,1 for rJ can be taken equal to the maximal tangent set { b E L 2 ( rJ) : ryb = 0} . In particular, consider the situation that the kernel p(x I z) belongs to an exponential family, p(x I z) c(z)d(x) exp (zT x ) . We shall show that the tangent set A ,/J TJ is dense =
t
For a proof of the following lemma, see, for example, [ 1 39, pp. 1 88-193].
Semiparametric Models
376
in the maximal tangent set { g E L 2 (P1J) : P!Jg = 0 } , for every 1J whose support contains an interval. This has as a consequence that empirical estimators JPl g, for a fixed squared integrable function g, are efficient estimators for the functional 1fi (1J) = P!Jg. For instance, the sample mean is asymptotically efficient for estimating the mean of the observations. Thus nonparametric mixtures over an exponential family form very large models, which are only slightly smaller than the nonparametric model. For estimating a functional such as the mean of the observations, it is of relatively little use to know that the underlying distribution is a mixture. More precisely, the additional information does not decrease the asymptotic variance, although there may be an advantage for finite On the other hand, the mixture structure may express a structure in reality and the mixing distribution 1J may define the functional of interest. The closure of the range of the operator AIJ is the orthocomplement of the kernel N (A � ) of its adjoint. Hence our claim is proved if this kernel is zero. The equation n
n.
0 A � g(z ) = E (g(X) I Z = z ) = =
j g (x) p(x I z) d v (x)
says exactly that g (X) is a zero-estimator under p (x I z ) . Because the adjoint is defined on L 2 (7]), the equation O = A � g should be taken to mean A � g(Z) = 0 almost surely under 1J. In other words, the display is valid for every z in a set of 1}-measure 1 . If the support of 1J contains a limit point, then this set is rich enough to conclude that g = 0, by the completeness of the exponential family. If the support of 1J does not contain a limit point, then the preceding approach fails. However, we may reach almost the same conclusion by using a different type of scores. The paths 1J t = (1 - ta) 1J + ta1} 1 are well-defined for 0 :::: at :::: 1, for any fixed a :::: 0 and 1J 1 , and lead to scores
( p1J
)
� log p!J, (x) = a IJ1 (x) - 1 . P of lt = O
This is certainly a score in a pointwise sense and can be shown to be a score in the L 2 -sense provided that it is in L 2 (P1J). If g E L 2 (P1J) has P!J g = 0 and is orthogonal to all scores of this type, then 0 = P1J1 g = P!Jg
( �: - 1 ) ,
every 7] 1 .
If the set of distributions { PIJ : 1J E H} is complete, then we can typically conclude that g 0 almost surely. Then the closed linear span of the tangent set is equal to the nonparametric, maximal tangent set. Because this set of scores is also a convex cone, Theorems 25 .20 and 25.21 next show that nonparametric estimators are asymptotically efficient. D =
Example (Semiparametric mixtures). In the preceding example, replace the den sity p(x I z) by a parametric family pe (x I z ) . Then the model pe (x I z) d17 (z) for the un observed data Y = (X, Z) has scores for both 8 and 7J. Suppose that the model t 1---0- 1}1 is differentiable with score b, and that
25.36
25.5 Score and Information Operators
377
Then the function a T £8 (x I z) + b(z) can be shown to be a score function corresponding to the model t f-+ PB + t a Cx I z) dry1 (z) . Next, by Lemma 25.34, the function f (C e (x I z) + b (z)) Pe (x I z) dry (z) C T Ee 17 (a e (X I Z) + b( Z ) I X = x) = f pe (x l z) dry(z) is a score for the model corresponding to observing X only. D '
25.37 Example (Random censoring). Suppose that the time T of an event is only ob served if the event takes place before a censoring time C that is generated independently of T ; otherwise we observe C. Thus the observation X = (Y, .6.) is the pair of transformations Y = T 1\ C and .6. = 1 { T :::; C} of the "full data" (T, C) . If T has a distribution function F and t f-+ F1 is a differentiable path with score function a, then the submodel t f-+ PF,,G for X has score function � , a dF A F' aa (x) = EF (a (T) I X = (y , 8)) = (1 - 8) (y oo ) + 8 a (y) . 1 - F(y) A score operator for the distribution of C can be defined similarly, and takes the form, with G the distribution of C,
fr bdG
, BF, ab(x) = (1 - 8 )b(y) + 8 [ y oo) . 1 - G_ (y) The scores AF, Ga and BF, Gb form orthogonal spaces, as can be checked directly from the formulas, because EA Fa (X) Ba b (X) = F aGb . (This is also explained by the product structure in the likelihood.) A consequence is that knowing G does not help for estimating F in the sense that the information for estimating parameters of the form 1/f (PF , G) = x (F) is the same in the models in which G is known or completely unknown, respectively. To see this, note first that the influence function of such a parameter must be orthogonal to every score function for G, because d / d t 1/f ( PF, G, ) = 0. Thus, due to the orthogonality of the two score spaces, an influence function of this parameter that is contained in the closed linear span of R(AF , G) + R(BF,G) is automatically contained in R(AF, a ) . D Example (Current status censoring). Suppose that we only observe whether an event at time T has happened or not at an observation time C. Then we observe the trans formation X = ( C, 1 { T :::; C} ) = (C, .6.) of the pair (C, T). If T and C are independent with distribution functions F and G , respectively, then the score operators for F and G are given by, with x = (c, 8), F d_ , a dF O , c--'-J -a _ J[--'A F, G a (x) = E F (a (T) I C = c , .6. = 8 ) = ( 1 - 8) fcc oo) + 8 ----' ' 1 - F (c ) F (c) BF , ab(x) = E(b( C ) I C = c, .6. = 8 ) = b (c) . 25.38
These score functions can be seen to be orthogonal with the help of Fubini's theorem. If we take F to be completely unknown, then the set of a can be taken all functions in L 2 (F) with Fa = 0, and the adjoint operator A � G restricted to the set of mean-zero functions in L 2 (PF,G) is given by A �. a h (c) =
1[c, oo)
h (u ,
1) d G (u) +
1[O, c)
h(u,
0) dG (u) .
Semiparametric Models
378
For simplicity assume that the true F and G possess continuous Lebesgue densities, which are positive on their supports. The range of A �. G consists of functions as in the preceding display for functions h that are contained in L 2 (PF, G), or equivalently
J h2 (u, 0) (1 - F) (u) dG (u)
< oo
and
J h2 (u, 1) F(u) dG (u)
< oo .
Thus the functions h(u, 1) and h ( u , 0) are square-integrable with respect to G on any interval inside the support of F. Consequently, the range of the adjoint A� G contains only absolutely continuous functions, and hence (25.29) fails for every parameter x (F) with an influence function XF that is discontinuous. More precisely, parameters x (F) with influence functions that are not almost surely equal under F to an absolutely continuous function. Because this includes the functions 1 [0 ,tl - F (t) , the distribution function F 1---+ x (F) = F(t) at a point is not a differentiable functional of the model. In view of Theorem 25 .32 this means that this parameter is not estimable at ,Jn-rate, and the usual normal theory does not apply to it. On the other hand, parameters with a smooth influence function XF may be differentiable. The score operator for the model PF, G is the sum (a , b) 1---+ AF, Ga + BF, Gb of the score operators for F and G separately. Its adjoint is the map h 1---+ (A�, G h, B} , G h). A parameter of the form (F, G) 1---+ x (F) has an influence function of the form (XF, 0) . Thus, for a parameter of this type equation (25 .29) takes the form
The kernel N(A�, G ) consists of the functions h E L 2 (PF,G) such that h(u, 0) = h (u , 1) almost surely under F and G. This is precisely the range of BF, G, and we can conclude that
Therefore, we can solve the preceding display by first solving A �. G h = XF and next project ing a solution h onto the closure of the range of A F,G · By the orthogonality of the ranges of AF, G and BF, G, the latter projection is the identity minus the projection onto R(BF, G). This is convenient, because the projection onto R(BF, G) is the conditional expectation relative to C. For example, consider a function x ( F ) = Fa for some fixed known, continuously dif ferentiable function a . Differentiating the equation a = A�,G h, we find a' (c) = (h (c , 0) h (c, 1) ) g (c) . This can happen for some h E L 2 (PF,G) only if, for any r such that 0 < F(r) < 1 , fXJ a' 2 ( 1 - F) dG (h (u, 0) - h(u, 1)) 2 (1 - F) (u) dG (u) < oo , g ) r a' 2 (h (u , 0) - h ( u , 1) ) 2 F (u) dG (u) < oo. g F dG fo If the left sides of these equations are finite, then the parameter PF, G 1---+ Fa is differentiable. An influence function is given by the function h defined by ,
( )
( )
h (c ' O) =
100
1'
a' (c) 1 [ r, oo) (c) ' g (c)
and
h (c, 1) = -
a' (c) 1 [ 0 , r ) (c) . g (c)
25.5 Score and Information Operators
379
The efficient influence function is found by projecting this onto R(AF,G), and is given by (h (c, 1 ) - h(c, 0) ) (8 - F(c) ) 1 - F(c) ' F (c) a (c) + (1 - 8) a ' (c) . -8 g (c) g (c) For example, for the mean x (F) = J u dF(u), the influence function certainly exists if the density g is bounded away from zero on the compact support of F. 0 h (c, 8) - EF, c ( h(C, L'l) I C = c)
*25.5.3
Missing and Coarsening at Random
Suppose that from a given vector (Y1 , Y2 ) we sometimes observe only the first coordinate Y1 and at other times both Y1 and Y2 . Then Y2 is said to be missing at random (MAR) if the conditional probability that Y2 is observed depends only on Y1 , which is always observed. We can formalize this definition by introducing an indicator variable L\ that indicates whether Y2 is missing (L'l = 0) or observed (L'l = 1). Then Y2 is missing at random if P(L'l = 0 I Y) is a function of Y1 only. If next to P( L\ = 0 I Y) we also specify the marginal distribution of Y, then the distribution of (Y, L'l) is fixed, and the observed data are the function X = ( ¢ (Y, L'l) , L\) defined by (for instance) cp (y , 1 ) = y . c/J (y , 0) = Y I , The tangent set for the model for X can be derived from the tangent set for the model for (Y, L'l) by taking conditional expectations. If the distribution of (Y, L'l) is completely unspecified, then so is the distribution of X, and both tangent spaces are the maximal "nonparametric tangent space". If we restrict the model by requiring MAR, then the tangent set for (Y, L'l) is smaller than nonparametric. Interestingly, provided that we make no further restrictions, the tangent set for X remains the nonparametric tangent set. We shall show this in somewhat greater generality. Let Y be an arbitrary unobservable "full observation" (not necessarily a vector) and let L\ be an arbitrary random variable. The distribution of (Y, L'l) can be determined by specifying a distribution Q for Y and a conditional density r (8 I y) for the conditional distribution of L\ given Y . t As before, we observe X = (¢ ( Y, L'l) , ll) , but now ¢ may be an arbitrary measurable map. The observation X is said to be coarsening at random (CAR) if the conditional densities r (8 I y) depend on x = (¢ (y , 8 ) , 8 ) only, for every possible value (y , 8). More precisely, r (8 1 y) is a measurable function of x . Example (Missing at random). If L\ E {0, 1 } the requirements are both that P(L'l = 0 I Y = y) depends only on ¢ (y , 0) and 0 and that P(L'l = 1 1 Y = y) depends only on cp (y , 1) and 1 . Thus the two functions y M- P(L'l = 0 I Y = y) and y M- P(L'l = 1 1 Y = y) may be different (fortunately) but may depend on y only through cp (y, 0) and ¢ (y , 1), respectively. If ¢ (y , 1) = y, then 8 = 1 corresponds to observing y completely. Then the require ment reduces to P( L\ = 0 I Y = y) being a function of ¢ (y , 0) only. If Y = (Y1 , Y2 ) and ¢ (y , 0) = y 1 , then CAR reduces to MAR as defined in the introduction. 0
25.39
t
The density is relative to a dominating measure v on the sample space for /';., and we suppose that (8 , y) r--+ r (8 I y) is a Markov kernel.
380
Semiparametric Models
Denote by Q and R the parameter spaces for the distribution Q of Y and the kernels r (8 I y) giving the conditional distribution of 6 given Y , respectively. Let Q x R = ( Q x R : Q E Q, R E R) and P = (PQ, R : Q E Q, R E R) be the models for (Y, 6) and X, respectively. Suppose that the distribution Q of Y is completely unspecified and the Markov kernel r (8 I y) is restricted by CAR, and only by CAR. Then there exists a tangent set PPQ . R for the model P = (PQ, R : Q E Q, R E R) whose closure consists of all mean-zero junctions in L 2 (PQ, R). Furthermore, any element of PPQ . R can be orthogonally decompo sed as 25.40
Theorem.
EQ, R (a (Y) I X = x ) + b (x) , where a E Q Q and b E R R. The functions a and b range exactly over the functions a E L 2 ( Q) with Qa = 0 and b E L 2 (PQ, R) with ER (b (X) I Y ) = 0 almost surely, respectively.
Fix a differentiable submodel t 1---7 Q1 with score a. Furthermore, for every fixed y fix a differentiable submodel t 1---7 r1 ( I y) for the conditional density of 6 given Y = y with score b0 (8 I y) such that
Proof
•
ff [ r,' I' (O I y) � r ' I' (O I y) - � bo (O I y)r ' I' (O I y) r dv (O) d Q (y) -+ 0.
Because the conditional densities satisfy CAR, the function b0 (8 I y) must actually be a function b(x) of x only. Because it corresponds to a score for the conditional model, it is further restricted by the equations J bo (8 1 y) r (8 1 y) dv(8) = ER ( b (X) I Y = y ) = 0 for every y . Apart from this and square integrability, b0 can be chosen freely, for instance bounded. By a standard argument, with Q x R denoting the law of (Y, 6) under Q and r , 2 (d Q 1 X R 1 ) 1 1 2 - (d Q X R ) l / 2 1 2 1 / - 2 (a (y ) + b (x) ) (d Q x R ) -+ 0. t
f[
]
Thus a (y) + b(x) is a score function for the model of (Y, 6), at Q x R. By Lemma 25.34 its conditional expectation EQ, R (a (Y) + b(X) I X = x ) is a score function for the model of X. This proves that the functions as given in the theorem arise as scores. To show that the set of all functions of this type is dense in the nonparametric tangent set, suppose that some function g E L 2 (PQ, R) is orthogonal to all functions EQ , R (a (Y) I X = x ) + b (x). Then EQ.Rg(X)a(Y) = EQ, Rg (X)EQ, R (a (Y) I X ) = 0 for all a . Hence g is orthogonal to all functions of Y and hence is a function of the type b. If it is also orthogonal to all b, then it must be 0. • The interest of the representation of scores given in the preceding theorem goes beyond the case that the models Q and R are restricted by CAR only, as is assumed in the theorem. It shows that, under CAR, any tangent space for P can be decomposed into two orthogonal pieces, the first part consisting of the conditional expectations EQ,R (a (Y) I X ) of scores a for the model of Y (and their limits) and the second part being scores b for the model R
25. 5 Score and Information Operators
381
describing the "missingness pattern." CAR ensures that the latter are functions of x already and need not be projected, and also that the two sets of scores are orthogonal. By the product structure of the likelihood q (y)r (8 I y), scores a and b for q and r in the model Q x R are always orthogonal. This orthogonality may be lost by projecting them on the functions of x, but not so under CAR, because b is equal to its projection. In models in which there is a positive probability of observing the complete data, there is an interesting way to obtain all influence functions of a given parameter PQ, R f-+ x ( Q) . Let C be a set of possible values of fj. leading t o a complete observation, that is, ¢ (y , 8) y whenever 8 E C, and suppose that R (C I y) PR (fj. E C I Y y) is positive almost surely. Suppose for the moment that R is known, so that the tangent space for X consists only of functions of the form EQ , R (a (Y) I X ) . If X Q ( Y ) is an influence function of the parameter Q f-+ x ( Q) on the model Q, then . 1 {8 E C} . 1/1 Po. R (x ) R (C I y) X Q ( Y ) is an influence function for the parameter 1/r (PF, G) x (Q) on the model 'P . To see this, first note that, indeed, it is a function of x, as the indicator 1 { 8 E C} is nonzero only if (y , 8) x. Second, 1 { /j. E C} EQ , R XQ (Y) a (Y) R (C I Y) EQ , RXQ (Y) a (Y ) . =
=
=
=
=
=
The influence function we have found is just one of many influence functions, the other ones being obtainable by adding the orthocomplement of the tangent set. This particular influence function corresponds to ignoring incomplete observations altogether but reweighting the influence function for the full model to eliminate the bias caused by such neglect. Usually, ignoring all partial observations does not yield an efficient procedure, and correspondingly this influence function is usually not the efficient influence function. All other influence functions, including the efficient influence function, can be found by adding the orthocomplement of the tangent set. An attractive way of doing this is: - by varying XQ over all possible influence functions for Q f-+ x (Q), combined with - by adding all functions b (x ) with ER ( b(X) I Y ) 0. This is proved in the following lemma. We still assume that R is known; if it is not, then the resulting functions need not even be influence functions. =
Suppose that the parameter Q f-+ x (Q) on the model Q is differentiable at Q, and that the conditional probability R (C I Y) = P(fj. E C I Y) of having a complete observation is bounded away from zero. Then the parameter PQ, R f-+ x ( Q) on the model (PQ, R : Q E Q) is differentiable at PQ, R and any of its influence functions can be written in the form
25.41
Lemma.
1 {8 E C} . Q ( ) + b (x ) , R (C I y) X Y for ;:( Q an influence function of the parameter Q f-+ x ( Q) on the model Q and a function b E L 2 (PQ, R) satisfying ER (b (X) I Y ) = 0. This decomposition is unique. Conversely, every function of this form is an influence function.
Semiparametric Models
382
The function in the display with b = 0 has already been seen to be an influence function. (Note that it is square-integrable, as required.) Any function b (X) such that ER ( b ( X ) Y ) = 0 satisfies E Q . Rb ( X )E Q,R (a (Y) X ) = 0 and hence is orthogonal to the tangent set, whence it can be added to any influence function. To see that the decomposition is unique, it suffices to show that the function as given in the lemma can be identically zero only if x Q = 0 and b = 0. If it is zero, then its conditional expectation with respect to Y, which is XQ• is zero, and reinserting this we find that b = 0 as well. Conversely, an arbitrary influence function � PQ R of P Q, R f---+ x ( Q) can be written in the form . 1 {8 E C} . 1 {8 E C} . 1/f PQ.R (x ) = R (C y) x ( y ) + 1/f PQ.R (x ) - R (C y) x ( y ) .
Proof
I
I
I
I
[·
I
]
For x (Y) = ER ( � PQ , R (X) Y ) , the conditional expectation of the part within square brackets with respect to Y is zero and hence this part qualifies as a function b. This function x is an influence function for Q f---+ x ( Q) , as follows from the equality E Q , R E R ( � PQ.R (X) Y ) a (Y) = E Q,R � PQ , R ( X )E Q,R (a (Y) for every a . •
I x)
I
Even though the functions x Q and b in the decomposition given in the lemma are uniquely determined, the decomposition is not orthogonal, and (even under CAR) the decomposition does not agree with the decomposition of the (nonparametric) tangent space given in Theorem 25 .40. The second term is as the functions b in this theorem, but the leading term is not in the maximal tangent set for Q. The preceding lemma is valid without assuming CAR. Under CAR it obtains an inter esting interpretation, because in that case the functions b range exactly over all scores for the parameter r that we would have had if R were completely unknown. If R is known, then these scores are in the orthocomplement of the tangent set and can be added to any influence function to find other influence functions. A second special feature of CAR is that a similar representation becomes available in the case that R is (partially) unknown. Because the tangent set for the model (PQ,R : Q E Q, R E R) contains the tangent set for the model (PQ,R : Q E Q) in which R is known, the influence functions for the bigger model are a subset of the influence functions of the smaller model. Because our parameter x (Q) depends on Q only, they are exactly those influence functions in the smaller model that are orthogonal to the set R PPQ.R of all score functions for R . This is true in general, also without CAR. Under CAR they can be found by subtracting the projections onto the set of scores for R . Suppose that the conditions of the preceding lemma hold and that the ta"! gent space PPQ R for the model (PQ, R : Q E Q, R E R) is taken to be the sum q PPQ.R + . R PpQ R of tangent spaces of scores for Q and R separately. If Q PPQ.R and R PPQ.R are orthogonal, in particular under CAR, any influence function of P Q, R f---+ x ( Q) for the model (PQ,R : Q E Q, R E R) can be obtained by_!!!_ki"!g the functions given by the preceding lemma and subtracting their projection onto lin R PPQ . R .
25.42
Corollary.
The influence functions for the bigger model are exactly those influence functions for the model in which R is known that are orthogonal to R P PQ R . These do not change Proof
383
25. 5 Score and Information Operators
by subtracting their projection onto this space. Thus we can find all influence functions as claimed. If the score spaces for Q and R are orthogonal, then the projection of an influence function onto lin R PPQ . R is orthogonal to Q PPQ . R , and hence the inner products with elements of this set are unaffected by subtracting it. Thus we necessarily obtain an influence function. • The efficient influence function 1jr pQ , R is an influence function and hence can be written in the form of Lemma 25.41 for some XQ and b. By definition it is the unique influence function that is contained in the closed linear span of the tangent set. Because the parameter of interest depends on Q only, the efficient influence function is the same (under CAR or, more generally, if QPPQ . R ..l R PPQ . R ) , whether we assume R known or not. One way of finding the efficient influence function is to minimize the variance of an arbitrary influence function as given in Lemma 25.4 1 over XQ and b. 25.43 Example (Missing at random). In the case of MAR models there is a simple rep resentation for the functions b (x) in Lemma 25.4 1 . Because MAR is a special case of CAR, these functions can be obtained by computing all the scores for R in the model for ( Y, �) under the assumption that R is completely unknown, by Theorem 25 .40. Suppose that � takes only the values 0 and 1 , where 1 indicates a full observation, as in Example 25.39, and set n (y) : P(� 1 1 Y y) . Under MAR n (y) is actually a function of ¢ (y, 0) only. The likelihood for (Y, �) takes the form q (y)r (8 1 y) q (y)n (y) 8 (1 - n (y)) 1 - 8 . =
=
=
=
Insert a path n1 n + tc , and differentiate the log likelihood with respect to t at t 0 to obtain a score for R of the form 8 1 -8 8 - n (y) c(y) c(y) . -c(y) n (y) 1 - n (y) n (y) (l - n ) (y) To remain within the model the functions n1 and n, whence c, may depend on y only through ¢ (y , 0) . Apart from this restriction, the preceding display gives a candidate for b in Lemma 25.41 for any c, and it gives all such b. Thus, with a slight change of notation any influence function can be written in the form 8 . 8 - n (y) Q (Y) c(y). X n (y) n (y) One approach to finding the efficient influence function in this case is first to minimize the variance of this influence function with respect to c and next to optimize over x Q . The first step of this plan can be carried out in general. Minimizing with respect to c is a weighted least-squares problem, whose solution is given by =
=
=
c (Y)
=
EQ, R ( X Q (Y) 1 ¢ (Y, O) ) .
To see this it suffices to verify the orthogonality relation, for all c, 8 . 8 - n (y) 8 - n (y) c(y) . Q (Y ) c(y) ..l X n (y) n (y) n (y) Splitting the inner product of these functions on the first minus sign, we obtain two terms, both of which reduce to EQ,RXQ (Y)c(Y) ( 1 - n ) (Y) / n (Y) . 0 _
3 84
Semiparametric Models 25.6
Testing
The problem oftesting a null hypothesis H0 : 'lj! (P) ::::; O versus the alternative H1 : 'lj! ( P ) 0 is closely connected to the problem of estimating the function 1jJ (P). It ought to be true that a test based on an asymptotically efficient estimator of 'lj! (P) is, in an appropriate sense, asymptotically optimal. For real-valued parameters 1jJ (P) this optimality can be taken in the absolute sense of an asymptotically (locally) uniformly most powerful test. With higher dimensional parameters we run into the same problem of defining a satisfactory notion of asymptotic optimality as encountered for parametric models in Chapter 1 5 . We leave the latter case undiscussed and concentrate on real-valued functionals 1jJ : P r--+ 1Ft Given a model P and a measure P on the boundary of the hypotheses, that is, 1jJ (P) 0, we want to study the "local asymptotic power" in a neighborhood of P . Defining a local power function in the present infinite-dimensional case is somewhat awkward, because there is no natural "rescaling" of the parameter set, such as in the Euclidean case. We shall utilize submodels corresponding to a tangent set. Given an element g in a tangent set Pp , let t r--+ P1,g be a differentiable submodel with score function g along which 1jJ is differentiable. For every such g for which � p g P 1(; p g 0, the submodel P1,g belongs to the alternative hypothesis H1 for (at least) every sufficiently small, positive t, because 'lj! (P1,g) t P 1(; p g + o(t) if 'lj! (P) 0. We shall study the power at the alternatives Ph ; .jii, g · >
=
=
=
>
=
25.44 Theorem. Let thefunctional 'lj! : P r--+ JR. be differentiable at P relative to the tangent space Pp with efficient influence function 1(; P· Suppose that 1jJ (P) 0. Then for every sequence of power functions P r--+ nn (P) of level-a tests for Ho : 1jJ (P) ::::; 0, and every g E PP with P 1(; p g 0 and every h 0, =
>
>
This theorem is essentially Theorem 15.4 applied to sufficiently rich submodels. Because the present situation does not fit exactly in the framework of Chapter 15, we rework the proof. Fix arbitrary h 1 and g 1 for which we desire to prove the upper bound. For notational convenience assume that P g f 1 . Fix an orthonormal base g p ( g 1 , , gm ) T of an arbitrary finite-dimensional subspace of Pp (containing the fixed g 1 ). For every g E lin g p , let t r--+ P1 ,g be a submodel with score g along which the parameter 1jJ is differentiable. Each of the submodels t r--+ P1,g is locally asymptotically normal at t 0 by Lemma 25. 14. The�efore, with sm - 1 the unit sphere of IR.m ,
Proof
=
=
•
•
•
=
in the sense of convergence of experiments. Fix a subsequence along which the limsup in the statement of the theorem is taken for h = h 1 and g g 1 . By contiguity arguments, we can extract a further subsequence along which the functions n n (P�z ; .jii,a r g) converge pointwise to a limit n (h, a) for every (h , a). By Theorem 15. 1 , the function n (h , a) is the power function of a test in the normal limit experiment. If it can be shown that this test is of level a for testing H0 : a T P 1(; p g p 0, then Proposition 15.2 shows that, for every (a , h) =
=
25. 6 Testing
with
a T P l.j!- pgp
>
385
0,
( P{f pg�)g p, gp P{f p g�P{f pg p. gp (h, a) = (h 1 , e 1 ) P{f�. n(P ) n(h 1 , e 1 ), h aT P{f pg p {f P
The orthogonal projection of and has length onto lin is equal to By choosing lin large enough, we can ensure that this length is arbitrarily close to completes the proof, because Choosing lim sup n h 1 f-v'n,g 1 ::::; by construction. To complete the proof, we show that n is of level Fix any 0 and an E such that 0. Then a.
a sm - l
>
<
n. Hence Ph ; y'n,aTg belongs to and n(h, a) = 1im n11 (Ph; y'n,aTg) Thus, the test with power function is of level for testing aT P{f p 8 P continuity it is of level for testing aT P{f pg p 0.
is negative for sufficiently large
H0
::::;
n
H0 :
a
::::;
H0 :
a
a.
<
0. By
•
As a consequence of the preceding theorem, a test based on an efficient estimator for is automatically "locally uniformly most powerful" : Its power function attains the upper bound given by the theorem. More precisely, suppose that the sequence of estimators is asymptotically efficient at and that is a consistent sequence of estimators of its asymptotic variance. Then the test that rejects Ho : 0 for � Za attains the upper bound of the theorem.
1./I(P) Tn
Sn
P
1./I(P) = ,.jii T1 / Sn
Let the functional l.j! : P lR be diffe rentiable at P with (P) 0. Sup pose that the sequence Tn- 2 is regular at P with a N(O, P{f�)-li. mitdistribution. Furthermore, suppose that Sn2 Pl.j! p · Then, for every h � 0 and g Pp, .n --? oo Ph; g ( ,JriTn � ) = 1 - ( - h P{f2pg ) . hm Sn (Pl.j! p ) l /2 By the efficiency of Tn and the differentiability of the sequence ,.jii T1 converges under Ph ; y'n,g to a normal distribution with mean h P {f pg and variance P{f�. 25.45
�
Lemma.
p
-+
1./1
=
E
v'n
·
--
Za
Za
_
1./1 ,
Proof
•
Suppose that the observations are two independent ran , Yn from distribution functions and respectively. To fit this two-sample problem in the present i.i.d. set-up, we pair the two samples and think of (Xi , Yi ) as a single observation from the product measure x on We wish to test the null hypothesis Ho : J ::::; � versus the alternative H1 : J � - The Wilcoxon test, which rejects for large values of J IF11 is asymptotically efficient, rel ative to the model in which and are completely unknown. This gives a different perspective on this test, which in Chapters 14 and 15 was seen to be asymptotically effi cient for testing location in the logistic location-scale family. Actually, this finding is an 25.46
Example (Wilcoxon test). dom samples X 1 , . . . , X and Y1 ,
n
.
.
•
F dG F G
dG11 ,
F G, F G JR2 . F dG
>
Semiparametric Models
386
example of the general principle that, in the situation that the underlying distribution of the observations is completely unknown, empirical-type statistics are asymptotically efficient for whatever they naturally estimate or test (also see Example 25.24 and section 25 .7). The present conclusion concerning the Wilcoxon test extends to most other test statistics. By the preceding lemma, the efficiency of the test follows from the efficiency of the Wilcoxon statistic as an estimator for the function 1/f ( F x G ) = J F dG. This may be proved by Theorem 25.47, or by the following direct argument. The model P is the set of all product measures F x G . To generate a tangent set, we can perturb both F and G . If t f---+ F1 and t f---+ G 1 are differentiable submodels (of the collection of all probability distributions on JR.) with score functions a and b at t = 0, respectively, then the submodel t f---+ F1 x G 1 has score function a (x ) + b (y) . Thus, as a tangent space we may take the set of all square-integrable functions with mean zero of this type. For simplicity, we could restrict ourselves to bounded functions a and b and use the paths dF1 = ( 1 + ta) dF and dG1 = ( 1 + tb) dG . The closed linear span of the resulting tangent set is the same as before. Then, by simple algebra,
:
J
J
� F x G (a , b) = t 1/f ( F1 X G1) 1 t = 0 = ( 1 - G _ )a dF + Fb dG. We conclude that the function (x , y) f---+ ( 1 - G ) (x ) + F (y) is an influence function of 1/f . This i s of the form a (x ) + b (y) but does not have mean zero; the efficient influence function is found by subtracting the mean. The efficiency of the Wilcoxon statistic is now clear from Lemma 25.23 and the asymp totic linearity of the Wilcoxon statistic, which is proved by various methods in Chapters 12, 13, and 20. D
*25.7
_
Efficiency and the Delta Method
Many estimators can be written as functions ¢ (Tn ) of other estimators. By the delta method asymptotic normality of Tn carries over into the asymptotic normality of cp (Tn ), for every differentiable map ¢. Does efficiency of Tn carry over into efficiency of ¢ (Tn ) as well? With the right definitions, the answer ought to be affirmative. The matter is sufficiently useful to deserve a discussion and turns out to be nontrivial. Because the result is true for the functional delta method, applications include the efficiency of the product-limit estimator in the random censoring model and the sample median in the nonparametric model, among many others. If Tn is an estimator of a Euclidean parameter 1/1 ( P) and both ¢ and 1/1 are differentiable, then the question can be answered by a direct calculation of the normal limit distributions. In view of Lemma 25 .23, efficiency of Tn can be defined by the asymptotic linearity (25 .22). By the delta method,
The asymptotic efficiency of cp (Tn ) follows, provided that the function x f---+ cp� ( P ) {r p (x ) is the efficient influence function of the parameter P f---+ ¢ 1/1 (P) . If the coordinates of o
25. 7 Efficiency and the Delta Method
387
1j/ p are contained in the closed linear span of the tangent set, then so are the coordinates of ¢� ( P ) 1j/ p , because the matrix multiplication by ¢� ( P ) means taking linear combinations.
Furthermore, if 1jJ is differentiable at P (as a statistical parameter on the model P) and ¢ is differentiable at 1jJ (P) (in the ordinary sense of calculus), then ¢ o 1/1 ( P1) - ¢ o 1/1 ( P) -+ ¢ 1/r( P ) 1/1 p g = P¢1/r( P ) 1/1 p g. t Thus the function ¢� ( P ) 1j/ p is an influence function and hence the efficient influence func tion. More involved is the same question, but with Tn an estimator of a parameter in a Banach space, for instance a distribution in the space D [ -oo, oo] or in a space l 00 (:F) . The question is empty until we have defined efficiency for this situation. A definition of asymptotic efficiency of Banach-valued estimators can be based on generalizations of the convolution and minimax theorems to general Banach spaces. t We shall avoid this route and take a more naive approach. The dual space ][J)* of a Banach space []) is defined as the collection of all continuous, linear maps d* : ][J) r-+ JR. If Tn is a ][J)-valued estimator for a parameter 1/1 (P) E []), then d* Tn is a real-valued estimator for the parameter d*o/ (P) E JR. This suggests to defining Tn to be asymptotically efficient at P E P if Jn(Tn - 1/I (P) ) converges under P in distribution to a tight limit and d* Tn is asymptotically efficient at P for estimating d* 1jJ ( P), for every d* E [)l* . This definition presumes that the parameters d*o/ are differentiable at P in the sense of section 25 .3. We shall require a bit more. Say that 1jJ : P r-+ []) is differentiable at P relative to a given tangent set Pp if there exists a continuous linear map Vr P : L 2 ( P) 1-+ lDl such that, for every g E Pp and a submodel t r-+ P1 with score function g, . 1/1 (Pr) - 1/1 (P) -+ 1/1 p g . t This implies that every parameter d*o/ : P r-+ JR. is differentiable at P , whence, for every d* E lDl* , there exists a function 1jl p d* : X r-+ JR. in the closed linear span of Pp such that d* 1jJ p ( g ) = P ljl p d* g for every g � Pp . The efficiency of d* Tn for d*o/ can next be understood in ten�'ls of asymptotic linearity of d* Jn(Tn - 1/J (P) ) , as in (25 .22), with influence function 1j/ p d* . To avoid measurability issues, we also allow nonmeasurable functions Tn = Tn (X 1 , . . . , Xn) of the data as estimators in this section. Let both lDl and IE be Banach spaces. 1
·
1
-
-----
Suppose that 1/1 : P r-+ lDl is diffe rentiable at P and takes its values in a subset lDlq, c lDl, and suppose that ¢ : lDlq, c lDl r-+ IE is Hadamard-differentiable at 1jJ (P) tangentially to lin Vr p (Pp ) . Then ¢ o 1/1 : P r-+ IE is differentiable at P. If Tn is a sequence of estimators with values in lDlq, that is asymptotically efficient at P for estimating 1/J (P), then ¢ (Tn ) is asymptotically efficient at P for estimating ¢ o 1/J (P).
25.47
Theorem.
The differentiability of ¢ o 1/1 is essentially a consequence of the chain rule for Hadamard-differentiable functions (see Theorem 20.9) and is proved in the same way. The derivative is the composition ¢� ( P ) o Vr P ·
Proof
t
See for example, Chapter 3 . 1 1 in [ 146] for some possibilities and references.
388
Semiparametric Models
(
First, we show that the limit distribution L of the sequence Jn Tn - 1/1 ( P) ) concentrates on the subspace lin Vt p (Fp). By the Hahn-Banach theorem, for any S c lDl , For a separable set S, we can replace the intersection by a countable subintersection. Be cause L is tight, it concentrates on a separable set S, and hence L gives mass 1 to the left side provided L (d : d* d 0) 1 for every d* as on the right side. This probability is equal to II � d * P II� ) {0} 1 . Now we can conclude that under the assumptions the sequence Jn ¢ (Tn ) - ¢ o 1j; ( P) ) converges in distribution to a tight limit, by the functional delta method, Theorem 20.8. Furthermore, for every e* E JE *
N (O,
=
=
=
(
where, if necessary, we can extend the definition of d* e* cp � ( P ) to all of lDl in view of the Hahn-Banach theorem. Because d* E ][J) * , the asymptotic efficiency of the sequence Tn implies that the latter sequence is asymptotically linear in the influence function 1f p d* . This is also the influence function of the real-valued map e* ¢ o 1jJ , because =
Thus, e*¢ (Tn) is asymptotically efficient at P for estimating e*¢ o 1jJ (P), for every e* E JE * . • The proof of the preceding theorem is relatively simple, because our definition of an efficient estimator sequence, although not unnatural, is relatively involved. Consider, for instance, the case that lDl .f00 ( S) for some set S. This corresponds to estimating a (bounded) function s !---7 1/J ( P ) (s) by a random function s 1---7 Tn (s). Then the "marginal estimators" d* Tn include the estimators lfs Tn Tn (s) for every fixed s - the coordinate projections lfs : d 1---7 d(s) are elements of the dual space .f00 (S)*-, but include many other, more complicated functions of Tn as well. Checking the efficiency of every marginal of the general type d* Tn may be cumbersome. The deeper result of this section is that this is not necessary. Under the conditions of Theorem 17. 14, the limit distribution of the sequence Jn(Tn - 1/J(P) ) in .f00 (S) is determined by the limit distributions of these processes evaluated at finite sets of "times" s 1 , . . . , sk . Thus, we may hope that the asymptotic efficiency of Tn can also be characterized by the behavior of the marginals Tn (s) only. Our definition of a differentiable parameter 1jJ : P 1---7 lDl is exactly right for this purpose. =
=
Theorem (Efficiency in .e= (s)). Suppose that 1jJ : P 1---7 £00 (S) is differentiable at P, and suppose that Tn (s) is asymptotically efficient at P for estimating 1/J (P)( s ), for every s E S. Then Tn is asymptotically efficient at P provided that the sequence Jn ( Tn - 1/J (P) ) converges under P in distribution to a tight limit in .f00 (S). 25.48
The theorem is a consequence of a more general principle that obtains the efficiency of Tn from the efficiency of d* Tn for a sufficient number of elements d* E ][J) * . By definition, efficiency of Tn means efficiency of d* Tn for all d* E ][J) * . In the preceding theorem the efficiency is deduced from efficiency of the estimators lfs T11 for all coordinate projections lfs
25. 7 Efficiency and the Delta Method
389
on .£00(5). The coordinate projections are a fairly small subset of the dual space of .£00(5). What makes them work is the fact that they are of norm 1 and satisfy ll z ll s = sups l ns z l . 25.49 Lemma. Suppose that 1/f : P H- ID> is differentiable at P, and suppose that d' Tn is asymptotically efficient at P for estimating d''t/f ( P) for every d' in a subset ID>' C []) * such that, for some constant C,
ll d ll
:S
C
sup
d' E lll>' , ll d' II :S: I
l d' (d) l .
Then Tn is asymptotically efficient at P provided that the sequence Jn ( Tn - 1/f ( P)) is asymptotically tight under P.
The efficiency of all estimators d'Tn for every d' E ID>' implies their asymptotic linearity. This shows that d'Tn is also asymptotically linear and efficient for every d' E lin ID>'. Thus, it is no loss of generality to assume that ID>' is a linear space. By Prohorov ' s theorem, every subsequence of Jn ( Tn - 1/f ( P)) has a further subsequence that converges weakly under P to a tight limit T. For simplicity, assume that the whole sequence converges; otherwise argue along subsequences. By the continuous-mapping the orem, d* Jn ( Tn - 1/f (P)) converges in distribution to d* T for every d* E ID> * . By the assumption of efficiency, the sequence d* ,Jn ( Tn - 1/f(P)) is asymptotically linear in the influence function 1f p d * for every d* E [])' . Thus, the variable d* T is normally distributed with mean zero and v�riance Plfr � d* for every d* E ID>'. We show below that this is then automatically true for every d* E ID>* . By Le Cam's third lemma (which by inspection of its proof can be seen to be valid for general metric spaces), the sequence Jn ( Tn - 1/f(P)) is asymptotically tight under P1 1 � as well, for every differentiable path t H- Pt . By the differentiability of 1/f, the sequence Jn ( Tn - 1/f (P1 ; � )) is tight also. Then, exactly as in the preceding paragraph, we can conclude that the sequence d* ,Jn ( Tn - 1/f(P1 ;_f! )) converges in distribution to a normal distribution with mean zero and variance P 1/f P d* , for every d* E ID> * . Thus, d* Tn is asymptotically efficient for estimating d* 't/f (P) for every d* E ID> * and hence Tn is asymptotically efficient for estimating 1/f (P), by definition. It remains to prove that a tight, random element T in []) such that d* T has law N ( 0, ll d*t p 11 2 ) for every d* E ID>' necessarily verifies this same relation for every d* E ID>* .t First assume that lD> = .£00 (S) and that [])' is the linear space spanned by all coordinate projections. Because T is tight, there exists a semimetric p on S such that S is totally bounded and almost all sample paths of T are contained in U C (S, p) (see Lemma 1 8 . 1 5). Then automatically the range of Vr P is contained in U C (S, p) as well. To see the latter, we note first that the map s H- ET (s) T ( u) is contained in U C ( S, p ) for every fixed u : If p (sm , tm ) --+ 0, then T (sm ) - T (tm ) --+ 0 almost surely and hence in second mean, in view of the zero-mean normality of T (sm ) - T (tm ) for every m , whence I ET (sm ) T (u) - ET (tm ) T (u) l --+ 0 by the Cauchy-Schwarz inequality. Thus, the map
Proof
S H- Vr p ( lfr p , ) ( S ) = ns Vr p ( lfr p , ) = ( lfr p , , lfr p , rr,
t
rr"
rr,,
rr,
)
P
= E T ( U) T ( S )
The proof of this lemma would be considerably shorter if we knew already that there exists a tight random element T with values in lill such that d* T has a N (0, li d* v;. p 11 . 2 -distribution for every d* E IIll * . Then it suffices to show that the distribution of T is uniquely determined by the distributions of d* T for d* E IIll ' .
�)
Semiparametric Models
390
is contained in the space UC ( S , p ) for every u. By the linearity and continuity of the derivative Vr P , the same is then true for the map s �----+ Vr P (g) (s) for every g in the closed linear span of the gradients 1/r P :rr as u ranges over S. It is even true for every g in the tangent set, because Vr p (g) (s) , � p (Ilg) (s) for every g and s , and Il the projection onto the closure of lin 1/r P , :rr" . By a minor extension of the Riesz representation theorem for the dual space of C (S, p ) , the restriction of a fixed d* E ][J) * to U C (S, p ) takes the form
d*z =
Is z(s) d
fL (S
),
for fL a signed Borel measure on the completion S of S, and z the unique continuous extension of z to S. By discretizing fL, using the total boundedness of S, we can construct a sequence d! in lin {ns : s E S} such that d! � d* pointwise on U C (S, p ). Then d!Vr P � d*VrP pointwise on Pp . Furthermore, d! T � d*T almost surely, whence in distribution, so that d*T is normally distributed with mean zero. Because d! T - d; T � 0 almost surely, we also have that
E(d� T - d: T) 2 = li d� Vr p - d: Vr p 11� , 2 � 0, whence d! Vr p is a Cauchy sequence in L 2 (P). We conclude that d! Vr p � d*Vr p also in norm and E(d! T) 2 = li d! Vr p I I � 2 � ll d*Vr p II� , 2 . Thus, d*T is normally distributed with mean zero and variance ll d*Vr p 1 i � . 2 . This concludes the proof for lDJ equal to f00 (S). A general Banach space lDJ can be ernbed ded in f00 (lDl� ), for ][J)'1 = {d' E lDl' , lid' II _:::: 1 } , by the map d � Zd defined as z d (d ' ) = d'(d). By assumption, this map is a norm homeomorphism, whence T can be considered to be a tight random element in f00 (lDl'1 ) . Next, the preceding argument applies.
•
Another useful application of the lemma concerns the estimation of functionals 1/J (P) = ( 1/1 1 (P), 1/12 (P) ) with values in a product lDJ1 x lDJ2 of two Banach spaces. Even though marginal weak convergence does not imply joint weak convergence, marginal efficiency implies joint efficiency! Suppose that 1/Ji : P �----+ JD) i is diffe rentiable at P, and suppose that Tn,i is asymptotically efficient at P for estimating 1/Ji (P), for i = 1 , 2. Then (Tn , 1 , Tn , 2 ) is asymptotically efficient at P for estimating ( 1/1 1 (P), 1/12 ( P ) ) provided that the sequences Jli(Tn,i - 1/!i ( P ) ) are asymptotically tight in JD)i under P, for i = 1 , 2. 25.50
Theorem (Efficiency in product spaces).
Let lDl' be the set of all maps (d1 , d2 ) �----+ dt ( di ) for dt ranging over lDl7 , and i = 1 , 2. By the Hahn-Banach theorem, ll di ll = sup{ l dt(di ) l : ll dt ll = 1 , dt E lDl; } . Thus, the product norm II Cd 1 , d2 ) II ll d1 ll V ll d2 ll satisfies the condition of the preceding lemma (with C = 1 and equality). •
Proof
=
In section 25 . 1 0. 1 it is seen that the distribution of X ( C 1\ T, 1 { T ::::; C}) in the random censoring model can be any distribution on the sample space. It follows by Example 20. 1 6 that the empirical subdistribution functions IHion and IHI 1 n are asymptotically efficient. By Example 20. 15 the product limit estimator is a Hadamard-differentiable functional of the empirical subdistribution functions. Thus, the product limit-estimator is asymptotically efficient. D
25.51 =
Example (Random censoring).
391
25. 8 Efficient Score Equations 25.8
Efficient Score Equations
The most important method of estimating the parameter in a parametric model is the method of maximum likelihood, and it can usually be reduced to solving the score equations L �=l e() (Xi) = 0, if necessary in a neighborhood of an initial estimate. A natural gener alization to estimating the parameter 8 in a serniparametric model 8 E 8 , 1J E H} is to solve 8 from the n (Xi ) = 0.
{P() , 11 :
efficient score equations _l). i = 1 () , 1),
Here we use the efficient score function instead of the ordinary score function, and we substitute an estimator fi n for the unknown nuisance parameter. A refinement of this method has been applied successfully to a number of examples, and the method is likely to work in many other examples. A disadvantage is that the method requires an explicit form of the efficient score function, or an efficient algorithm to compute it. Because, in general, the efficient score function is defined only implicitly as an orthogonal projection, this may preclude practical implementation. A variation on this approach is to obtain an estimator fi n (8) of 1J for each given value of 8 ' and next to solve 8 from the equation n LC() , l], ( ()) (Xi ) = 0. i=l
If en is a solution, then it is also a solution of the estimating equation in the preceding display, for fi n = fi n (en ) . The asymptotic normality of en can therefore be proved by the same methods as applying to this estimating equation. Due to our special choice of estimating function, the nature of the dependence of fi n (8) on 8 should be irrelevant for the limiting distribution of .Jli(en - 8). Informally, this is because the partial derivative of the estimating equation relative to the 8 inside fi n (8 ) should converge to zero, as is clear from our subsequent discussion of the "no-bias" condition (25.52). The dependence of fi n (8) on 8 does play a role for the consistency of en , but we do not discuss this in this chapter, because the general methods of Chapter 5 apply. For simplicity we adopt the notation as in the first estimating equation, even though for the construction of en the two-step procedure, which "profiles out" the nuisance parameter, may be necessary. In a number of applications the nuisance parameter 7] , which is infinite-dimensional, cannot be estimated within the usual order 0 for parametric models. Then the classical approach to derive the asymptotic behavior of Z-estimators - linearization of the equation in both parameters - is impossible. Instead, we utilize the notion of a Donsker class, as developed in Chapter 19. The auxiliary estimator for the nuisance parameter should satisfy t
(n - 1 12 )
Pe, , i{j,, 1), =o p (n - 1 12 + 11en - 8 11 ) , P(), 1) l le,, l]/1 - l() , ry 1 2 � 0, P(j, , 17 Fe", 1),1 2 = Op (l). t
(25.52) (25.53)
The notation Pf.� is an abbreviation for the integral J f. � (x) dP (x ) . Thus the expectation is taken with respect to x only and not with respect to fj .
392
Semiparametric Models
The second condition (25 .53) merely requires that the "plug-in" estimator le , fi, is a consistent estimator for the true efficient influence function. Because Pe, ,r/9",1J = 0, the first condition (25.52) requires that the "bias" of the plug-in estimator, due to estimating the nuisance parameter, converge to zero faster than 1 j ,Jfi. Such a condition comes out naturally of the proofs. A partial motivation is that the efficient score function is orthogonal to the score functions for the nuisance parameter, so that its expectation should be insensitive to changes in 1J . 25.54 Theorem. Suppose that the model {Pe , ry : 8 E 8} is differentiable i n quadratic mean with respect to e at (8 , 17) and let the efficient information matrix l e, ry be nonsingular. Assume that (25 .52) and (25 .53) hold. Let {jn satisfy ,Jii lPnle",i/" = op ( 1 ) and be consistent for e. Furthermore, suppose that there exists a Donsker class with square-integrable envelope junction that contains every function le, , fi , with probability tending to 1. Then the sequence {j n is asymptotically efficient at (e , 1}).
Let Gn (8', 1} 1 ) = ,Jfi(lPn - Pe,11)fe',1J' be the empirical process indexed by the func tions le',1J'· By the assumption that the functions le, fj are contained in a Donsker class, together with (25 .53),
Proof
Gn ({Jn, � n ) = Gn(8, 17) + o p ( l ) . (see Lemma 1 9 . 24. ) B y the defining relationship of {Jn and the "no-bias" condition (25.52), this is equivalent to
The remainder of the proof consists of showing that the left side is asymptotically equivalent to (l e,11 + o p ( 1 ) ),Jn({Jn - 8), from which the theorem follows. Because l e,11 = Pe,11le,11C �. 11 , the difference of the left side of the preceding display and l e, 11 ,Jn ( {J n - e ) can be written as the sum of three terms: 1 r.: vn l-e, , fi" (Pe1 /2,11 + Pe,111 /2 ) Pe1 /2,11 - Pe,111 /2 ) - 2 (8n - 8) T l. e,11 Pe,111 /2 ] d p,
f
,
[( ,
A
f l-e, , 1), (Pe1,/2,11 - Pe,111 /2) 2l1 . e,11T Pe,111/2 dp, n (8n - 8) - J (le, , r1, - le,11) £ �. 11 Pe,11 d p, y'n(e n - 8). r.: v
+
A
(
The first and third term can easily be seen to be o p ,Jii l fJn - e II ) by applying the Cauchy Schwarz inequality together with the differentiability of the model and (25.53). The square of the norm of the integral in the middle term can for every sequence of constants � oo be bounded by a multiple of m 11
m�
j l le,, 1 , I P�:�� P�:,�11 - P�:�l d p,2 +
J l le,, fi, i 2 (Pe,, ry + Pe, ry ) d p, 1 .
ll ie.n
2 Pe, ry d p, . le, l ry 1 ll > m,
In view of (25.53), the differentiability of the model in e , and the Cauchy-Schwarz inequal ity, the first term converges to zero in probability provided � oo sufficiently slowly m 11
25. 8 Efficient Score Equations
393
to ensure that m n I I B n 8 I � 0. (Such a sequence exists. If Zn � 0, then there exists a sequence e n + 0 such that P( I Zn I en) --+ 0. Then c;: 1 1 2 Zn � 0.) In view of the last part of (25 .53), the second term converges to zero in probability for every m n --+ oo. This concludes the proof of the theorem. • -
>
The preceding theorem is best understood as applying to the efficient score functions le , 11 • However, its proof only uses this to ensure that, at the true value (8 , rJ),
I e , 17 = Pe ,,l- e , 17 f e17 • -
-
·T
The theorem remains true for arbitrary, mean-zero functions le , 17 provided that this identity holds. Thus, if an estimator (B , f]) only approximately satisfies the efficient score equation, then the latter can be replaced by an approximation. The theorem applies to many examples, but its conditions may be too stringent. A modification that can be theoretically carried through under minimal conditions is based on the one-step method. Suppose that we are given a sequence of initial estimators B n that is Jn-consistent for e. We can assume without loss of generality that the estimators are discretized on a grid of meshwidth n - 1 1 2 , which simplifies the constructions and proof. Then the one-step estimator is defined as
The estimator B n can be considered a one-step iteration of the Newton-Raphson algorithm for solving the equation 'L le , fi (X; ) = 0 with respect to e, starting at the initial guess Bn . For the benefit of the simple proof, we have made the estimators Fin i for 17 dependent on the index i . In fact, we shall use only two different values for f] n i , one for the first half of the sample and another for the second half. Given estimators Fin = F! n (X 1 , , Xn) define F! n,i by, with m = Ln/2J , .
.
•
if i m if i :::: m. >
Thus, for X; belonging to the first half of the sample, we use an estimator Fin i based on the second half of the sample, and vice versa. This sample-splitting trick is convenient in the proof, because the estimator of 17 used in le , 17 ( Xi ) is always independent of X; , simultaneously for X; running through each of the two halves of the sample. The discretization of en and the sample-splitting are mathematical devices that rarely are useful in practice. However, the conditions of the preceding theorem can now be relaxed to, for every deterministic sequence Bn = 8 + O (n - 1 1 2 ), (25 .55) (25 .56) Suppose that the model { Pe , 17 : 8 E 8 } is differentiable in quadratic mean with respect to 8 at (8, 1J ) and let the efficient information matrix l e , 17 be nonsingular.
25.57
Theorem.
,
394
Semiparametric Models
Assume that (25 .55) and (25 .56) hold. Then the sequence 1]).
(e,
en is asymptotically efficient at
en = e+
0 (n- 1 1 2 ) . By the sample-splitting, Fix a deterministic sequence of vectors the first half of the sum L (Xi ) is a sum of conditionally independent terms, given the second half of the sample. Thus,
Proof
le,. , fi,.
..
Ee,. , 11 (JnllfDm (le,. , fi,. , - le,. , 11 ) I Xm + l , . . . , Xn ) vare , ( JnllfDm (lel ,fin . • - le,.,1) ) I Xm + l , . . . , Xn ) n
1J
Vr=: ffl
P,e len• 17,n,J ' n ,
rJ
<
Both expressions converge to zero in probability by assumption (25 .55). We conclude that the sum inside the conditional expectations converges conditionally, and hence also unconditionally, to zero in probability. By symmetry, the same is true for the second half of the sample, whence
(en,
We have proved this for the probability under 17), but by contiguity the convergence is also under 17). The second part of the proof is technical, and we only report the result. The condition of differentiabily of the model and (25.56) imply that
(e,
JlllfDn (le" , 11 - le,11) + Jl1Pe, 17 le,17 C ;, 11 (en - e) !,. 0 (see [139] , p. 1 85). Under stronger regularity conditions, this can also be proved by a Taylor expansion of le, 11 in e.) By the definition of the efficient score function as an orthogonal projection, Pe,11 le,11 e ;,11 = le, 11 • Combining the preceding displays, we find that n,
en
In view of the discretized nature of fJ this remains true if the deterministic sequence is replaced by fJ see the argument in the proof of Theorem 5 .48. Next we study the estimator for the information matrix. For any vector h E JR.k , the triangle inequality yields
n;
(en,
By (25.55), the conditional expectation under ry ) of the right side given Xm + l , . . . , Xn converges in probability to zero. A similar statement is valid for the second half of the observations. Combining this with (25.56) and the law of large numbers, we see that
n,
In view of the discretized nature of fJ this remains true if the deterministic sequence replaced by fJ
n.
en is
395
25. 8 Efficient Score Equations
The theorem follows combining the results of the last two paragraphs with the definition of e . •
n
A further refinement is not to restrict the estimator for the efficient score function to be a plug-in type estimator. Both theorems go through if � is replaced by a general estimator provided that this satisfies the appropriately modi I fied conditions of the theorems, and in the second theorem we use the sample-splitting scheme. In the generalization of Theorem 25 .57, condition (25.55) must be replaced by
ln, (} =Cn, (} C X 1 , . . . , Xn),
le
.
(25.58)
The proofs are the same. This opens the door to more tricks and further relaxation of the regularity conditions. An intermediate theorem concerning one-step estimators, but without discretization or sample-splitting, can also be proved under the conditions of The orem 25 .54. This removes the conditions of existence and consistency of solutions to the efficient score equation. The theorems reduce the problem of efficient estimation of 8 to estimation of the efficient score function. The estimator of the efficient score function must satisfy a "no-bias" and a consistency conditions. The consistency is usually easy to arrange, but the no-bias condition, such as (25.52) or the first part of (25.58), is connected to the structure and the size of the model, as the bias of the efficient score equations must converge to zero at a rate faster than l 1 ,Jn. Within the context of Theorem 25.54 condition (25.52) is necessary. If it fails, then the sequence is not asymptotically efficient and may even converge at a slower rate than ,Jn. This follows by inspection of the proof, which reveals the following adaptation of the theorem. We assume that is the efficient score function for the true parameter (8, 17) but allow it to be arbitrary (mean-zero) for other parameters.
Bn
le, TJ
Theorem. Suppose that the conditions ofTheorem 25 .54 hold except possibly con dition (25.52). Then
25.59
JnCB n - 8) = � t,I;� l(} , TJ(Xi ) + JnPen.le". �" + op ( l ) . B ecause by Lemma 25.23 the sequence B n can be asymptotically efficient (regular with
N (0, 1; � )-limit distribution) only if it is asymptotically equivalent to the sum on the right, condition (25 .52) is seen to be necessary for efficiency. The verification of the no-bias condition may be easy due to special properties of the model but may also require considerable effort. The derivative of with respect to 8 ought to converge to a 1 (}8 0. Therefore, condition (25 .52) can usually be simplified to
P(} , TJ l(}, TJ =
P(} , TJ le, fi
The dependence on r, is more interesting and complicated. The verification may boil down to a type of Taylor expansion of in r, combined with establishing a rate of convergence for r, . Because 17 is infinite-dimensional, a Taylor series may be nontrivial. If r, 1J can
Pe,TJ le,fi
-
396
Semiparametric Models
occur as a direction of approach to 'f/ that leads to a score function Bg, 17 (� - ry), then we can write
(25.60)
We have used the fact that Pg, ryl.g,17Bg,17h = 0 for every h, by the orthogonality property of the efficient score function. (The use of Bg, 17 (fj - ry) corresponds to a score operator that yields scores Bg,17h from paths of the form 'fit = 'f/ + th. If we use paths d'f/t = ( 1 + th) dry, then Bg, 17 (d�jdry - 1) is appropriate.) The display suggests that the no-bias condition (25.52) is certainly satisfied if II � - 'f/ I = 0 p (n - 1 1 2 ) , for I I II a norm relative to which the two terms on the right are both of the order o p ( I I � - 'f/ II ) . In cases in which the nuisance parameter is not estimable at .fii-rate the Taylor expansion must be carried into its second order term. If the two terms on the right are both 0 p ( I I � - 'f/ 11 2 ) , then it is still sufficient to have II � - 'fi ll = op (n - 1 14). This observation is based on a crude bound on the bias, an integral in which cancellation could occur, by norms and can therefore be too pessimistic (See [35] for an example.) Special properties of the model may also allow one to take the Taylor expansion even further, with the lower order derivatives vanishing, and then a slower rate of convergence of the nuisance parameter may be sufficient, but no examples of this appear to be known. However, the extreme case that the expression in (25.52) is identically zero occurs in the important class of models that are convex-linear in the parameter. ·
Example (Convex-linear models). Suppose that for every fixed e the model { Pg ,17 : 'f/ E H } is convex-linear : H is a convex subset of a linear space, and the depen dence 'f/ 1---+ Pg , 17 is linear. Then for every pair (ry 1 , ry) and number 0 :=: t :=: 1 , the convex combination 'fit = try 1 + ( 1 - t)ry is a parameter and the distribution t Pg, 171 + ( 1 - t) Pg, 17 = Pg, 17, belongs to the model. The score function at t = 0 of the submodel t �---+ Pg, 171 is 25.61
--
a dPg,171 log dPg,t171 + ( 1 - t ) 17 = - 1. dPg,17 i3t l t = O
Because the efficient score function for e is orthogonal to the tangent set for the nuisance parameter, it should satisfy
(
)
dPg,171 0 = Pg, 17£g, 11 � - 1 = Pg,17 1 £g, 11 •
(i, ry
This means that the unbiasedness conditions in (25 .52) and (25. 55) are trivially satisfied, with the expectations Pg, 17lg,ri even equal to 0. A particular case in which this convex structure arises is the case of estimating a linear functional in an information-loss model. Suppose we observe X = m (Y) for a known function m and an unobservable variable Y that has an unknown distribution 'f/ on a measurable space (Y, A) . The distribution P17 = 'f/ m - 1 of X depends linearly on ry. Furthermore, if we are interested in a linear function e = x (ry), then the nuisance parameter space Hg = { 'f/ : x ( 'f/) = e } is a convex subset of the set of probability measures on (Y, A) . o o
397
25. 8 Efficient Score Equations
25.8.1
Symmetric Location
Suppose that we observe a random sample from a density 17 (x - 8) that is symmetric about e. In Example 25.27 it was seen that the efficient score function for e is the ordinary score function, -
f e ' 17 (x)
'
=
17 - - (x - 8). 1J
We can apply Theorem 25.57 to construct an asymptotically efficient estimator sequence for e under the minimal condition that the density 17 has finite Fisher information for location. First, as an initial estimator en , we may use a discretized Z-estimator, solving JPln 1/f (x 8) 0 for a well-behaved, symmetric function 1/f . For instance, the score function of the logistic density. The ,Jn-consistency can be established by Theorem 5.21 . Second, it suffices to construct estimators ln,e that satisfy (25.58). By symmetry, the variables Ti I Xi - e I are, for a fixed e, sampled from the density 21](s) 1 0} . We use these variables to construct an estimator kn for the function I and next we set =
g(s) g' g,
=
=
{s
>
Because this function is skew-symmetric about the point e , the bias condition in (25.58) is satisfied, with a bias of zero. Because the efficient score function can be written in the form
-
f e, r] (x)
=
g' g
- - ( l x - 8 1) sign(x - 8),
g' g
the consistency condition in (25.58) reduces to consistency of kn for the function I in that
2 (s) g(s) ds � 0. ( kn � ) J
(25.62)
Estimators kn can be constructed by several methods, a simple one being the kernel method of density estimation. For a fixed twice continuously differentiable probability density with compact support, a bandwidth parameter CJ11 , and further positive tuning parameters CYn , f3n , and Yn , set 1 n - T.· --� ' gn ( ) CJn =l Cln
w
s
= -
,Lw ( s ) i
(25.63) Then (25.58) is satisfied provided an t oo , f3n + 0, Yn + O, and CJ11 + O at appropriate speeds. The proof is technical and is given in the next lemma. This particular construction shows that efficient estimators for e exist under minimal conditions. It is not necessarily recommended for use in practice. However, any good initial estimator en and any method of density or curve estimation may be substituted and will lead to a reasonable estimator for e , which is theoretically efficient under some regularity conditions.
398
Semiparametric Models
Let T1 , . . . , Tn be a random sample from a density g that is supported and absolutely continuous on [0, oo) and satisfies j (g ' I _,jg) 2 (s) ds oo. Then kn given by (25.63) for a probability density w that is twice continuously differentiable and supported on [ - 1 , 1 ] satisfies (25.62), if an t oo, Yn � 0, f3n � 0, and CJn � 0 in such a way that CJn ::S Yn • a;CJn l {3; -+ 0, nCJ: {3; -+ 00. 25.64
Lemma.
<
Start by noting that ll g ll oo :::: f l g' (s ) l ds :::: JT;, by the Cauchy-Schwarz inequal ity. The expectations and variances of g n and its derivative are given by
Proof.
� ( s � Tl ) = J g (s - CJy) w (y) dy , s - T1 varg n (s ) = � varw ( ll w ll � , ) � nCJ nCJ CJ Eg� (s) = g� (s ) = J g ' (s - CJy) w (y) dy , (s :;:::
gn (s ) : = Eg n (s ) = E w
::::
y),
B y the dominated-convergence theorem, gn (s ) -+ g (s), for every s > 0. Combining this with the preceding display, we conclude that g n (s) � g (s). If g' is sufficiently smooth, then the analogous statement is true for g� (s ) . Under only the condition of finite Fisher information for location, this may fail, but we still have that g� (s ) - g� (s ) � 0 for every s ; furthermore, g� 1 [ a, oo) -+ g ' in L 1 , because
100 l g� - g' l (s) ds :::: JJ l g' (s - CJ Y) - g' (s ) l ds w (y) dy -+ 0,
by the L 1 -continuity theorem on the inner integral, and next the dominated-convergence theorem on the outer integral. The expectation of the integral in (25.62) restricted to the complement of the set iJ n is equal to
J( � ) \s) g (s) P( I g� l (s ) > a or gn (s )
<
f3 or s
<
y)
ds .
This converges to zero by the dominated-convergence theorem. To see this, note first that P(g n (s ) f3) converges to zero for all s such that g (s) > 0. Second, the probability P( l g� l (s ) > a) is bounded above by 1 { lg� l (s ) > al2 } + o(l), and the Lebesgue measure of the set {s : lg� l (s ) > al2 } converges to zero, because g� -+ g' in L 1 . On the set iJ n the integrand in (25 .62) is the square of the function (g� I g n - g ' I g) g 1 12 . This function can be decomposed as ( g n - gn ) gn1 / 2 g n (g n - gn ) g n 1 /2 _1_ 2 1 / + � gn ) + (g L 1 1 2 / / gn gn g 1 /2 • gn g n gn <
A/
A
A/
_
f
A
f
_
A
(
/
/
_
)
On fJ n the sum of the squares of the four terms on the right is bounded above by
399
25. 8 Efficient Score Equations
The expectations of the integrals over B n of these four terms converge to zero. First, the integral over the first term is bounded above by a2 a dt . l g ' (t) I dt l g (s - a t) - g (s) I w (t) dt ds ::::: f3 f3 s> y
: !1
�J
Next, the sum of the second and third terms gives the contribution 1 ll 2 -4 na f3 2 ll w oo I
f gn (s) ds
+
1 ll w ll 2 na 2f3 2 oo
J i t l w (t)
f ( g� ) 2 ds . 172
gn The first term in this last display converges to zero, and the second as well, provided the integral remains finite. The latter is certainly the case if the fourth term converges to zero. By the Cauchy-Schwarz inequality, (J g ' (s - ay) w (y) dy) 2 g' 2 1/2 (s - ay) w (y) dy . ::::: J g s - ay ) w ( y ) dy g
f( )
(
Using Fubini's theorem, we see that, for any set S , and sa its a-enlargement,
f. C1;, )
' c·
) ds
:0
L u;, ) , ds.
In particular, we have this for S = sa = ffi., and S = { s : g (s ) = 0 } . For the second choice of S , the sets sa decrease to S , by the continuity of g . On the complement of S , g� j g� 1 2 � g' j g 1 12 in Lebesgue measure. Thus, by Proposition 2.29, the integral of the fourth term converges to zero. • 25.8.2
Errors-in- Variables
Let the observations be a random sample of pairs
X Y
(Xi , Yi ) with the same distribution as
Z+e = a + f3 Z + f, =
for a bivariate normal vector (e, f) with mean zero and covariance matrix 'E and a random variable Z with distribution 1J, independent of (e, f) . Thus is a linear regression on a variable Z which is observed with error. The parameter of interest is e = (a, {3 , 'E ) and the nuisance parameter is 1J . To make the parameters identifiable one can put restrictions on either 'E or 1J. It suffices that 17 is not normal (if a degenerate distribution is considered normal with variance zero); alternatively it can be assumed that 'E is known up to a scalar. Given (e , 'E ) the statistiqlr g a) T is sufficient (and complete) = ( 1 , {3 ) 'E for 1J . This suggests to define estimators for (a, {3, 'E) as the solution of the "conditional score equation" = 0, for
Y
(X, Y)
- l (X, Y -
Pn le. i) le,ry(X, Y) le,17(X, Y) - Ee(le,17(X, Y) 1 1/Je (X, Y)). =
This estimating equation has the attractive property of being unbiased in the nuisance parameter, in that every e , 1J, 1} 1
0
400
Semiparametric Models
Therefore, the no-bias condition is trivially satisfied, and the estimator fJ need only be consistent for 1J (in the sense of (25.53)). One possibility for fJ is the maximum likelihood estimator, which can be shown to be consistent by Wald ' s theorem, under some regularity conditions. As the notation suggests, the function le , 17 is equal to the efficient score function for e . We can prove this by showing that the closed linear span of the set of nuisance scores contains all measurable, square-integrable functions of 1/fe (x , y ) , because then projecting on the nuisance scores is identical to taking the conditional expectation. As explained in Example 25.6 1 , the functions Pe , 17 1 / Pe.17 - 1 are score functions for the nuisance parameter (at (8 , ry) ). As is clear from the factorization theorem or direct calcu lation, they are functions of the sufficient statistic 1/fe (X, Y) . If some function b ( 1/fe (x , y) ) is orthogonal to all scores of this type and has mean zero, then
Ee , 17 1 b ( 1/fe (X, Y)) = Ee , 17 b ( 1/fe (X, Y)) ( PPee,,TJ171 - 1 ) = 0.
Consequently, b = 0 almost surely by the completeness of 1/fe (X, Y). The regularity conditions of Theorem 25 .54 can be shown to be satisfied under the condition that J l z l 9 dry(z) < oo. Because all coordinates of the conditional score function can be written in the form Qe (x , y) + Pe (x , y) E17 ( Z I o/e ( X , Y)) for polynomials Qe and Pe of orders 2 and 1 , respectively, the following lemma is the main part of the verification. t Lemma. For every 0 < a ::::; 1 and every probability distribution 1Jo on R and compact K c (0, oo), there exists an open neighborhood U of 1Jo in the weak topology such that the class F of all functions 25.65
z e z (bo +b t x +bzy ) e - c z 2 dry (z) J (x , y) � (ao + a l x + a 2 y) 2 +b +h bo y) ( x z z 2 e - c dry (z) ' Je
with 1J ranging over U, c ranging over K, and a and b ranging over compacta in R3 , satisfies v2 1 v 4 v P ( 1 + 1x l + l y l ) 5+2o: + I +8 1 , log N[ 1 ( .s , F, L 2 (P) ) ::::; C -;;
()
for every V :::=:: 1 /a, every measure P on R2 and 8 on a, 1Jo, U, V, the compacta, and 8.
25.9
)
(
>
0, and a constant C depending only
General Estimating Equations
Taking the efficient score equation as the basis for estimating a parameter is motivated by our wish to construct asymptotically efficient estimators. Perhaps, in certain situations, this is too much to ask, and it is better to aim at estimators that come close to attaining efficiency or are efficient only at the elements of a certain "ideal submodel." The pay off could be a gain in robustness, finite-sample properties, or computational simplicity. The information bounds then have the purpose of quantifying how much efficiency has possibly been lost. t
See [ l OS] for a proof.
40 1
25. 9 General Estimating Equations
We retain the requirement that the estimator is ,Jn-consistent and regular at every dis tribution P in the model. A somewhat stronger but still reasonable requirement is that it be asymptotically linear in that
�
This type of expansion and regularity implies that P is an influence function of the parameter 1/1 (P), and the difference p - 1f p must be orthogonal to the tangent set Pp . This suggests that we compute the set of all influence functions to obtain an indication of which estimators Tn might be possible. If there is a nice parametrization of these sets of functions in terms of a parameter of interest e and a nuisance parameter r , then a possible estimation procedure is to solve e from the estimating equation, for given r ,
�
�e ,
n
L � e, , (X;) = 0. i=l
The choice of the parameter r determines the efficiency of the estimator 8. Rather than fixing it at some value we also can make it data-dependent to obtain efficiency at every element of a given submodel, or perhaps even the whole model. The resulting estimator can be analyzed with the help of, for example, Theorem 5 .3 1 . If the model is parametrized by a partitioned parameter ce ' 1] , then any influence function for e must be orthogonal to the scores for the nuisance parameter lJ · The parameter r might be indexing both the nuisance parameter 1J and "position" in the tangent set at a given (e , 1J ) . Then the unknown 1J (or the aspect of it that plays a role in r ) must be replaced by an estimator. The same reasoning as for the "no-bias" condition discussed in (25.60) allows us to hope that the resulting estimator for e behaves as if the true 1J had been used.
)
Example (Regression). In the regression model considered in Example 25.28, the set of nuisance scores is the orthocomplement of the set e'H of all functions of the form (x , y ) H- (y x x , up to centering at mean zero. The efficient score function for e is equal to the projection of the score for e onto the set e'H, and an arbitrary influence function is obtained, up to a constant, by adding any element from e'H to this. The estimating equation
25.66
ge ( ))h( )
n
L ( Y; - ge(X;))h(X;) = 0 i=l
- ge ) h( e ( X
leads to an estimator with influence function in the direction of (y (x ) x ). Because the equation is unbiased for any we easily obtain ,Jn-consistent estimators, even for data dependent The estimator is more efficient if is closer to the function g (x ) /EIJ e 2 I = x ) , which gives the efficient influence function. For full efficiency it is necessary to estimate the function x H- EIJ (e 2 I = x ) nonparametrically, where consistency (for the right norm) suffices. 0
h,
h.
X
h
In Lemma 25.41 and Example 25 .43 the influence functions in a MAR model are characterized as the sums of reweighted influence functions in the original model and the influence functions obtained from the MAR specification. If 25.67
Example (Missing at random).
Semiparametric Models
402
n
the function is known, then this leads to estimating equations of the form
n
n i).i - n(Yi ) (X ) t; n(Yd 1/19 , -r t 8 n(Yi ) c(Yt ) i).i
.
=
0.
For instance, if the original model is the regression model in the preceding example, then (y) is (y (x) ) h (x) . The efficiency of the estimator is influenced by the choice of c (the optimal choice is given in Example 25.43) and the choice of 9 , , . (The efficient influence function of the original model need not be efficient here.) If is correctly specified, then the second part of the estimating equation is unbiased for any c, and the asymptotic variance when using a random c should be the same as when using the limiting value of c. 0
�9,,
- g9
�
25.10
n
Maximum Likelihood Estimators
Estimators for parameters in semiparametric models can be constructed by any method for instance, M -estimation or Z -estimation. However, the most important method to obtain asymptotically efficient estimators may be the method of maximum likelihood, just as in the case of parametric models. In this section we discuss the definitions of likelihoods and give some examples in which maximum likelihood estimators can be analyzed by direct methods. In Sections 25. 1 1 and 25. 12 we discuss two general approaches for analyzing these estimators. Because many semiparametric models are not dominated or are defined in terms of densities that maximize to infinity, the functions that are called the "likelihoods" of the models must be chosen with care. For some models a likelihood can be taken equal to a density with respect to a dominating measure, but for other models we use an "empirical likelihood." Mixtures of these situations occur as well, and sometimes it is fruitful to incor porate a "penalty" in the likelihood, yielding a "penalized likelihood estimator"; maximize the likelihood over a set of parameters that changes with yielding a "sieved likelihood estimator"; or group the data in some way before writing down a likelihood. To bring out this difference with the classical, parametric maximum likelihood estimators, our present estimators are sometimes referred to as "nonparametric maximum likelihood estimators" (NPMLE), although semiparametric rather than nonparametric seems more correct. Thus we do not give an abstract definition of "likelihood," but describe "likelihoods that work" for particular examples. We denote the likelihood for the parameter P given one observation x by lik(P)(x). Given a measure P, write P {x} for the measure of the one-point set {x } . The function x 1--+ P {x } may be considered the density of P , or its absolutely continuous part, with respect is the function, to counting measure. The empirical likelihood of a sample 1 , . . . n,
P
�---+
n ni= l P{Xi }.
X , Xn
Given a model P, a maximum likelihood estimator could be defined as the distribution P that maximizes the empirical likelihood over P. Such an estimator may or may not exist. Let P be the set of all probability distributions on the measurable space (X, A) (in which one-point sets are measurable). Then, for
25.68
Example (Empirical distribution).
n
25. 1 0 Maximum Likelihood Estimators
x1 ,
403
( P{x l }, . . . , P{x11} )
ranges over all vectors fixed different values . . . , X11 , the vector p :::: 0 such that L P i :::: 1 when p ranges over p . To maximize p f---+ n i Pi ' it is clearly best to choose p maximal : L i Pi = 1 . Then, by symmetry, the maximizer must be p = ( 1 In, . . . , 1 In). Thus, the empirical distribution JPl = n -1 L 8 maximizes the empirical likelihood over the nonparametric model, whence it is referred to as the nonparametric maximum likelihood estimator. If there are ties in the observations, this argument must be adapted, but the result is the same. The empirical likelihood is appropriate for the nonparametric model. For instance, in the case of a Euclidean space, even if the model is restricted to distributions with a continuous Lebesgue density p, we still cannot use the map p R n7= 1 p(X i ) as a likelihood. The supremum of this likelihood is infinite, for we could choose p to have an arbitrarily high, very thin peak at some observation. 0 n
xi
Given a partitioned parameter ( e , 77), it is sometimes helpful to consider the profile likelihood. Given a likelihood lik11 (8 , ry) (X 1 , . . . , X11), the profile likelihood for 8 is defined as the function e
R
sup lik11 (8 , ry)(X 1 , . . . , X ) . n
1}
The supremum is taken over all possible values of ry. The point of maximum of the profile likelihood is exactly the first coordinate of the maximum likelihood estimator (8 , i}) . We are simply computing the maximum of the likelihood over (8 , ry) in two steps. It is rarely possible to compute a profile likelihood explicitly, but its numerical evaluation is often feasible. Then the profile likelihood may serve to reduce the dimension of the likelihood function. Profile likelihood functions are often used in the same way as (ordinary) likelihood functions of parametric models. Apart from taking their points of maximum as estimators e , the second derivative at e is used as an estimate of minus the inverse of the asymptotic covariance matrix of e . Recent research appears to validate this practice. 25.69 Example (Cox model). Suppose that we observe a random sample from the distri bution of X = (T, Z) , where the conditional hazard function of the "survival time" T with covariate Z takes the form
A
The hazard function is completely unspecified. The density of the observation X = (T, Z) is equal to
A
A
A(O)
= 0). The usual estimator for (8 , where is the primitive function of (with based on a sample of size n from this model is the maximum likelihood estimator ce , where the likelihood is defined as, with the jump of at
A { t} A t, (8 , A) n e e z i A{ td e _ eeZi A(td . i= l f---+
A) A),
11
A(t)
This is the product of the density at the observations, but with the hazard function replaced by the jumps of the cumulative hazard function. (This likelihood is close
A {t}
Semiparametric Models
404
but not exactly equal to the empirical likelihood of the model.) The form of the likelihood forces the maximizer to be a jump function with jumps at the observed "deaths" ti , only and hence the likelihood can be reduced to a function of the unknowns A {td , . . . , A {tn } . It appears to be impossible to derive the maximizers ce , in closed-form formulas, but we can make some headway in characterizing the maximum likelihood estimators by "profiling out" the nuisance parameter A . Elementary calculus shows that, for a fixed e, the function
A
A)
n
() q , . . . ' An ) �
e
n e ll z; A i e -e 'i Lj : tj 9i Aj i=l
is maximal for
The profile likelihood for e is the supremum of the likelihood over f\ for fixed e. In view of the preceding display this is given by
The latter expression is known as the Co partial likelihood. The original motivation for this criterion function is that the terms in the product are the conditional probabilities that the ith subject dies at time i given that one of the subjects at risk dies at that time. The maximum likelihood estimator for A is the step function with jumps
x
A
The estimators e and are asymptotically efficient, under some restrictions. (See sec tion 25. 1 2. 1 .) We note that we have ignored the fact that jumps of hazard functions are smaller than 1 and have maximized over all measures A . 0 Suppose we observe a sample from the distribution of X e + Zs, where the unobservable variables Z and s are independent with completely unknown distribution rJ and a known density ¢, respectively. Thus, the observation has a mixture density J p11 (x I z) d rJ ( z ) for the kernel 1 e pe (x l z) � ¢ -- . z If ¢ is symmetric about zero, then the mixture density is symmetric about e, and we can estimate e asymptotically efficiently with a fully adaptive estimator, as discussed in Section 25.8. 1 . Alternatively, we can take the mixture form of the underlying distribution into account and use, for instance, the maximum likelihood estimator, which maximizes the likelihood 25.70
Example (Scale mixture).
=
=
(e , rJ ) �
(x - )
J Pe (Xt 1 z) d rJ (z) . fi i=
l Under some conditions this estimator is asymptotically efficient.
405
25. 1 0 Maximum Likelihood Estimators
Because the efficient score function for e equals the ordinary score function for e, the maximum likelihood estimator satisfies the efficient score equation JIDn le , 1J 0. By the convexity of the model in 1J, this equation is unbiased in 1J . Thus, the asymptotic efficiency of the maximum likelihood estimator e follows under the regularity conditions of Theorem 25 .54. Consistency of the sequence of maximum likelihood estimators (en , f7 n) for the product of the Euclidean and the weak topology can be proved by the method of Wald. The verification that the functions le , ry form a Donsker class is nontrivial but is possible using the techniques of Chapter 19. 0 =
25.71
Example (Penalized logistic regression).
ple from the distribution of X regression model
Pe , ry (Y
In this model we observe a random sam
(V, W, Y), for a 0- 1 variable Y that follows the logistic
=
=
1 1 V, W)
=
\ll ( e V + ry (W) ) ,
where \ll (u) 1 /(1 + e - u ) is the logistic distribution function. Thus, the usual linear regression of (V, W) has been replaced by the partial linear regression e V + ry (W), in which ranges over a large set of "smooth functions." For instance, is restricted to the Sobolev class of functions on [0, 1 ] whose (k - 1)st derivative exists and is absolutely continuous with J(ry) < oo, where =
17
17
12 (17) 1 1 ( 17 (k) (w) ) 2 dw. Here k 1 is a fixed integer and ry Ck) is the kth derivative of 17 with respect to The density of an observation is given by P e , ry (x ) \ll (e v + ry(w) Y ( 1 - \11 (8 v + ry(w)) 1 - y fv. w ( v , w). We cannot use this directly for defining a likelihood. The resulting maximizer f] would be such that f)(w t ) for every w1 with Yt 1 and f](w t ) when Yt 0, or at least we could construct a sequence of finite, smooth approaching this extreme choice. The problem is that qualitative smoothness assumptions such as J (ry) do not restrict 17 on a finite set of points w 1 , . . . , Wn in any way. To remedy this situation we can restrict the maximization to a smaller set of which we allow to grow as n ---)for instance, the set of all ry such that J (ry) ::::; Mn for Mn =
z.
::=::
=
=
oo
=
=
-oo
=
1J m
< oo
1J ,
oo ;
at a slow rate, or a sequence of spline approximations. An alternative is to use a penalized likelihood, of the form (8 ,
5-.n
17)
f-+
t oo
A2 2
IFn log pe, ry - 'An i (ry) .
Here is a "smoothing parameter" that determines the importance of the penalty J 2 (ry) . A large value of Jcn leads to smooth maximizers for small values the maximizer is more like the unrestricted maximum likelihood estimator. Intermediate values are best and are often chosen by a data-dependent scheme, such as cross-validation. The penalized estimator e can be shown to be asymptotically efficient if the smoothing parameter is constructed to and satisfy (see [102]). 0
f),
5-.n op(n - 1 12) 5-.� 1 Op(nk f(2k + l ) ) =
25.72
=
Example (Proportional odds).
the distribution of the variable X
=
Suppose that we observe a random sample from ::::; C}, Z ) , in which, given Z, the variables
( T 1\ C, 1 { T
Semiparametric Models
406
T and C are independent, as in the random censoring model, but with the distribution function F (t I z ) of T given Z restricted by F(t I z) e2 T e ry (t). 1 - F(t I z) In other words, the conditional odds given z of survival until t follows a Cox -type regression model. The unknown parameter rJ is a nondecreasing, cadlag function from [0, oo) into itself with ry (O) 0. It is the odds of survival if e 0 and T is independent of Z. If is absolutely continuous, then the density of X (Y, .:::\ , Z) is ( e -zT e rJ' (y ) ( 1 - Fc(Y - I z) ) ) 8 ( e -zT e fc(Y 1Tz ) ) 1 - 8 fz (z). ry (y) + e-z e (ry (y) + e _ 2 T g) 2 =
=
=
lJ
=
We cannot use this density as a likelihood, for the supremum is infinite unless we restrict lJ in an important way. Instead, we view lJ as the distribution function of a measure and use the empirical likelihood. The probability that X x is given by =
For likelihood inference concerning e, lJ) only, we may drop the terms involving z and define the likelihood for one observation as
( Fc and F .hk(e , l}) (X) ( (ry y e-zeT-zB)(ryT e ry {yy } e-zT B) ) 8 ( ry y e -zTe-ze T e ) 1 -8 ( )+ ( -) + ( )+ The presence of the jumps ry {y} causes the maximum likelihood estimator � to be a step function with support points at the observed survival times (the values corresponding to 8; 1). First, it is clear that yeach of these points must receive a positive mass. Second, mass to the right of the largest ; such that 8; 1 can be deleted, meanwhile increasing the likelihood. Third, mass assigned to other points can be moved to the closest Yi to the right such that 8t 1 , again increasing the likelihood. If the biggest observation Yt has 81 1 , then �{ y1} oo and that observation gives a contribution 1 to the likelihood, because the function p p 1 (p + r) attains for p � 0 its maximal value 1 at p oo. On the other hand, if 81 0 for the largest y1, then all jumps of � must be finite. =
Yi
=
=
=
=
=
=
1-+
=
The maximum likelihood estimators have been shown to be asymptotically efficient under some conditions in [105]. D 25.10.1
Random Censoring
Suppose that we observe a random sample . . . , (Xn , L\ n from the distribution of 1\ 1 in which the "survival time" and the "censoring time" C are independent with completely unknown distribution functions and respectively. The distribution of a typical observation (X, satisfies
(X1, .:::\ 1 ),
( T C , {T ::::; C} ) ,
T
.:::\)
P F,a (X ::::; x , P F , a (X ::::; x ,
.:::\ .:::\
=
0)
=
=
1)
=
( (1 J [O,x]
)
F G,
F) dG, ( ( 1 - G _ ) dF . } [O , x ]
407
25. 1 0 Maximum Likelihood Estimators
Consequently, if F and G have densities f and g (relative to some dominating measures), then (X, .::\ ) has density 1 (x , 8) � (C l - F) (x )g(x ) ) 8 ( ( 1 - G _ ) (x ) f (x ) ) - 8 . For f and g interpreted as Lebesgue densities, we cannot use this expression as a factor in a likelihood, as the resulting criterion would have supremum infinity. (Simply choose f or g to have a very high, thin peak at an observation Xi with L\i 1 or L\i 0, respectively.) Instead, we may take f and g as densities relative to counting measure. This leads to the empirical likelihood =
n
1 - ""
=
n
' TI(o - G_ ) (Xi ) F{X d ) "" · . i =l In view of the product form, this factorizes in likelihoods for F and G separately. The maximizer F of the likelihood F � fT= 1 ( 1 - F) (Xi ) l - L'., F{X d ""' turns out to be the product limit estimator, given in Example 20. 15. (F, G) � n(o - F) (Xi ) G { x z } ) i =l
That the product limit estimator maximizes the likelihood can be seen by direct ar guments, but a slight detour is more insightful. The next lemma shows that under the present model the distribution PF, G of (X, .::\ ) can be any distribution on the sample space [0, oo) x { 0, 1 } . In other words, if F and G range over all possible probability distributions on [0, oo ], then PF , G ranges over all distributions on [0, oo) x {0, 1 } . Moreover, the relationship (F, G) � PF , G is one-to-one on the interval where ( 1 - F) ( l - G) 0. As a consequence, there exists a pair ( F , G) such that Pt , G is the empirical distribution JPl of the observations >
n
1 :::; i :::; n .
Because the empirical distribution maximizes P � TI7= 1 P {Xi , L\i } over all distributions, it follows that ( F , G) maximizes (F, G) � TI7= 1 PF, c { Xi , L\i } over all (F, G). That F is the product limit estimator next follows from Example 20. 15. To complete the discussion, we study the map (F, G) � PF, G . A probability distribution on [0, oo) x {0, 1 } can be identified with a pair (H0, H1 ) of subdistribution functions on [0, oo) such that H0 (oo) + H1 (oo) 1 , by letting Hi (x ) be the mass of the set [O, x ] x {i } . A given pair of distribution functions (F0 , FI ) on [0, oo) yields such a pair of subdistribution functions (Ho , H1 ) , by =
Ho (x )
=
1[O, x] ( 1 - F1 ) dFo ,
H1 (x )
=
1[O,x] ( 1 - Fo- ) d F1 .
(25 .73)
Conversely, the pair (F0 , F1) can b e recovered from a given pair (H0 , HI ) by, with L\Hi the jump in Hi , H Ho + H1 and Af the continuous part of Ai , dHl dHo A I (x ) Ao (x ) ' ' [O,xJ 1 - H_ - L\H1 [ O,xJ 1 - H_ 1 - Fi (x ) n ( 1 - Az { s })e - A� (x ) _ 0 :-;:s::s x =
=
1
=
1
=
Lemma. Given any pair (Ho , HI ) of subdistributionfunctions on [0, oo) such that Ho (oo) + H1 (oo) = 1, the preceding display defines a pair (F0 , F1 ) of subdistribution functions on [0, oo) such that (25 .73) holds. 25.74
Semiparametric Models
408
For any distribution function A and cumulative hazard function B on [0, oo) , with Be the continuous part of B, dA . 1 - A (t) = n ( 1 - B {s } ) e - B c (t ) iff B (t) = [ O, t ] 1 - A_ 0 :9:st To see this, rewrite the second equality as (1 - A_) dB = dA and B (O) A (O), and integrate this to rewrite it again as the Volterra equation
Proof.
1
(1 - A) = 1 +
1[0, .] (1 - A_) d(-B) .
It is well known that the Volterra equation has the first equation of the display as its unique solution. t Combined with the definition of Fi , the equivalence in the preceding display implies immediately that dAi = dFd(l - Fi_). Secondly, as immediate consequences of the definitions,
(Ao + A 1 ) (t) -
L �Ao (s) � A l (s) = 1[O, t] 1 -dHH_ . sg
(Split dHo/(1 - H_ - �H1 ) into the parts corresponding to dHg and �Ho and note that �H1 may be dropped in the first part.) Combining these equations with the Volterra equation, we obtain that 1 - H = (1 - F0) ( 1 - F1 ) . Taken together with dH1 = (1 - H_) dA 1 , we conclude that dH1 = (1 - F0_)(1 - F1 _) dA 1 = (1 - Fo_) dF1 , and similarly dH0 = ( 1 - F1 ) dFo . • 25. 1 1
Approximately Least-Favorable Submodels
If the maximum likelihood estimator satisfies the efficient score equation IFn le . » = 0, then Theorem 25.54 yields its asymptotic normality, provided that its conditions can be verified for the maximum likelihood estimator � · Somewhat unexpectedly, the efficient score function may not be a "proper" score function and the maximum likelihood estimator may not satisfy the efficient score equation. This is because, by definition, the efficient score function is a projection, and nothing guarantees that this projection is the derivative of the log likelihood along some submodel. If there exists a "least favorable" path t 1--+ rJ1 (8 , �) SUCh that 'f/O ( e , �) = �, and, for every a log lik ( + t , rJ 1 (8 , �) ) (x ) , C e . » (x) = at l t =o then the maximum likelihood estimator satisfies the efficient score equation; if not, then this is not clear. The existence of an exact least favorable submodel appears to be particularly uncertain at the maximum likelihood estimator ce ' �), as this tends to be on the "boundary" of the parameter set. X,
e A
t
See, for example, [133, p. 206] or [55] for an extended discussion.
A
409
25. 11 Approximately Least-Favorable Submodels A
method around this difficulty is to replace the efficient score equation by an approxi mation. First, it suffices that (8 , f]) satisfies the efficient score equation approximately, for Theorem 25.54 goes through provided .Jli = o p ( l ) Second, it was noted following the proof of Theorem 25.54 that this theorem is valid for estimating equations of the form = 0 for arbitrary mean-zero functions its assertion remains correct provided that at the true value of ( e, 17) the function is the efficient score function. This suggests to replace, in our proof, the function that are proper score functions by functions and are close to the efficient score function, at least for the true value of the parameter. These are derived from "approximately-least favorable submodels." We define such submodels as maps t 1--+ ry, (e , 17) from a neighborhood of O E to the parameter set for 17 with ry0 (8 , 17) = 17 (for every (8 , ry)) such that
Pn le, i)
le, 17
Pn le, i) e , 11 ; l le, 11
.
Re ,17
IRk
Re , 17 (x)
a log lik (e + t , ry, (e , ry) (x) , at l t = o
)
= -
exists (for every x) and is equal to the efficient score function at (8 , 17) = (80 , 17 0 ) . Thus, the path t 1--+ 171 (8 , 17) must pass through 17 at t = 0, and at the true parameter (80 , 17 0 ) the submodel is truly least favorable in that its score is the efficient score for e . We need such a submodel for every fixed (8 , 17), or at least for the true value (80 , 17 0) and every possible value of (8 , f]). If (8 , f]) maximizes the likelihood, then the function t 1--+ log lik (e + t , is maximal at t = 0 and hence (fJ , f]) satisfies the stationary equation = 0. Now Theorem 25.54, with replaced by yields the asymptotic efficiency of e For easy reference we reformulate the theorem.
Pn
JPln Re , i)
ry,(fJ, f7 ))
n Re , 17 , Pen. 1'/o Ken. iln = o p (n 1 12 + l fJn - 8o l ) (25.75) Peo , ryo I Ren. iln - Keo , 1Jo 1 2 !,. 0, Pen . 11o I Re, , i)J 2 = 0 p ( l) . (25.76) 8}, is differentiable in quadratic Suppose that the model {Pe , 17 : mean with respect to e at 17o) and let the efficient information matrix e0, 170 be nonsingu lar. Assume that Re , 11 are the scorefunctions of approximately least-favorable submodels (at that the functions Re , i) belong to a Pe0, 170 -Donsker class with square-integrable en velope with probability tending to and that (25.75) and (25.76) hold. Then the maximum le , 11
.
-
25.77
8 E
Theorem.
(8o ,
l
(8o , 1Jo ) ),
1,
likelihood estimator en is asymptotically efficient at (eo 17o) provided that it is consistent. '
le , i)
Re , i) ·
The no-bias condition (25.75) can be analyzed as in (25.60), with replaced by Alternatively, it may be useful to avoid evaluating the efficient score function at or f], and (25.60) may be adapted to
fJ
Pe , ryo Re, i) = (Pe, ryo - Pe , i) )(Re, i) - Reo , 1Jo ) -/ Keo. 11o [Pe , i) - Pe, 110 - Beo , 1Jo ( f7 - 17o) Peo , 1Jo ] dfL . (25.78) Replacing by should make at most a difference of o p ( l fJ - 8 l ) , which is negligible in the preceding display, but the presence of f] may require a rate of convergence for f]. e
eo
o
Theorem 5.55 yields such rates in some generality and can be translated to the present setting as follows.
410
Semiparametric Models
Consider estimators T11 contained in a set Hn that, for a given Jc11 contained in a set A11 C IR, maximize a criterion t f--+ JIDn m , c , • or at least satisfy JID nm r , ::=:: l¥ nm ,0)c, · ) Assume that for every A E An , every t E Hn and every o 0, )
,
>
P (m o, - m ,0,; J ,:S - d� ( t , to) + )...2 , E* sup I GAmo, - m ,0 ,;. J I .:S ¢n (o ) .
(25 .79) (25.80)
d, (r, ro ) < i5 I.. E A , , T EH,
Suppose that (25.79) and (25 .80) are valid for functions ¢n such that o f--+ ¢n ( o ) / oa is decreasing for some a < 2 and sets An X Hn such that P (Jcn E A n , Tn E Hn ) -+ 1. Then d). ( rn , to) :::; 0 � ( on + Jc n) for any sequence of positive numbers O n such that ¢n ( on) :::; ..Jii o� for every n. 25.81
Theorem.
25. 11.1
Cox Regression with Current Status Data
Suppose that we observe a random sample from the distribution of X ( C, b. , Z), in which b. 1 {T :::; C}, that the "survival time" T and the observation time C are independent given Z, and that T follows a Cox model. The density of X relative to the product of Fc , z and counting measure on {0, 1 } is given by =
=
Pe ,A (x)
=
(1
-
T exp ( - e e 2 A ( c) )
) 0 (exp ( - ee
T
2
A ( c) )
) [ -i5 .
We define this as the likelihood for one observation x . In maximizing the likelihood we restrict the parameter 8 to a compact in IRk and restrict the parameter A to the set of all cumulative hazard functions with A ( t ) :::; M for a fixed large constant M and t the end of the study. We make the following assumptions. The observation times C possess a Lebesgue density that is continuous and positive on an interval [0', t ] and vanishes outside this interval. The true parameter A o is continuously differentiable on this interval, satisfies 0 < A o ( 0' -) :::; A o ( t ) < M, and is continuously differentiable on [ 0', t ] . The covariate vector Z is bounded and E cov(Z I C) 0. The function h eo , A o given by (25.82) has a version that is differentiable with a bounded derivative on [ 0', t ] . The true parameter 80 is an inner point of the parameter set for 8 . The score function for 8 takes the form >
-Ce , A (x)
=
z A (c) Qe , A (x) ,
for the function Qe , A given by
For every nondecreasing, nonnegative function h and positive number t , the sub model A 1 A + t h is well defined. Inserting this in the log likelihood and differentiating with respect to t at t 0, we obtain a score function for A of the form =
=
Be . A h (x)
=
h (c) Qe, A (x) .
41 1
25. 11 Approximately Least-Favorable Submodels
Be , A h
h
The linear span of these score functions contains for all bounded functions of bounded variation. In view of the similar structure of the scores for and projecting onto the closed linear span of the nuisance scores is a weighted least-squares problem with weight function The solution is given by the vector-valued function IC (25.82) IC The efficient score function for takes the form
fe , A
8 A,
Qe, A . hg, A (c) = A(c) EeEe, A, A( (QZQ��. >A X)(X) ==cc) ) · 8 le, A (x) = (zA(c) - he , A (c) ) Qe , A (x). Formally, this function is the derivative at t = 0 of the log likelihood evaluated at (8 + t, A t The, A ). However, the second coordinate of the latter path may not define a nondecreasing, nonnegative function for every t in a neighborhood of 0 and hence cannot be used to obtain a stationary equation for the maximum likelihood estimator. This is true in particular for discrete cumulative hazard functions for which t is nondecreasing for both t < 0 and t 0 only if is constant between the jumps of This suggests that the maximum likelihood estimator does not satisfy the efficient score equation. To prove the asymptotic normality of e , we replace this equation by an approxi mation, obtained from an approximately least favorable submodel. For fixed and a fixed bounded, Lipschitz function ¢, define
A,
h
>
A+ h A.
(8, A),
A t (8, A) = A - tT cp(A) (heo , Ao o A 0 1 ) (A). Then A1 (8, A) is a cumulative hazard function for every t that is sufficiently close to zero, because for every u ::::; v, A t (8, A)(v) - A t (8, A)(u) � ( A(v) - A(u) ) ( 1 - l t l l ¢ heo, Ao o A 0 1 l up )· Inserting (8 + t , A1(8, A)) into the log likelihood, and differentiating with respect to t at t = yields the score function k'e , A (x) = (zA(c) - ¢ ( A(c) ) (he0 , A0 o A 0 1 )(A(c)) ) Qe , A (x). If evaluated at (80, Ao) this reduces to the efficient score function le0 , A 0 (x) provided ¢ (A o) = 1, whence the submodel is approximately least favorable. To prove the asymptotic 0,
efficiency of en it suffices to verify the conditions of Theorem 25.77. The function ¢ is a technical device that has been introduced in order to ensure that 0 ::::; ::::; M for all t that are sufficiently close to 0. This is guaranteed if 0 ::::; y ¢ (y) ::::; c ( y 1\ (M - y) ) for every 0 ::::; y ::::; M, for a sufficiently large constant Because by assumption [ 0' - ) , r) J c (0, M) , there exists such a function ¢ that also fulfills 1 on [0', r]. In order to verify the no-bias condition (25.52) we need a rate of convergence for An .
A1 (8, A)
¢(Ao) = 25.83
0
Lemma.
p(n- 1 13 ).
Proof
Ao ( Ao (
c.
Under the conditions listedpreviously, en is consistent and II An -
Denote the index
(80, Ao) by 0, and define functions me , A = log ( Pe, A + Po)/2.
Ao II Po, 2
=
412
Semiparametric Models
Pe , A
are bounded above by 1 , and under our assumptions the density p0 i� The densities bounded away from zero. It follows that the functions (x) are uniformly bounded in (e , A) and x . B y the concavity of the logarithm and the definition of (8 ,
me , A
A),
IFnme , A � � IFn log pe , J. + � IFn log po � IFn log po
=
m
IFn o .
Therefore, Theorem 25.81 is applicable with r (e , A) and without A.. For technical reasons it is preferable first to establish the consistency of (e , by a separate argument. We apply Wald' s proof, Theorem 5. 14. The parameter set for e is compact by assumption, and the parameter set for A is compact relative to the weak topology. Wald's theorem shows that the distance between (e , and the set of maximizers of the Kullback-Leibler divergence converges to zero. This set of maximizers contains (e0 , A0), but this parameter is not fully identifiable under our assumptions: The parameter Ao is identifiable only on p p the interval ( o-, r ) . It follows that e ___,. eo and A (t) ___,. A o ( t ) for every o- t r . (The convergence of at the points o- and r does not appear to be guaranteed.) By the proof of Lemma 5.35 and Lemma 25.85 below, condition (25.79) is satisfied with By Lemma 25.84 below, the bracketing d ( (e ' A) ' (eo ' A o) ) equal to I e - eo II + II A entropy of the class of functions is of the order ( l jc-). By Lemma 19.36 condition (25. 80) is satisfied for =
A)
A)
�
�
<
A
Ao 1 2 ·
me , A
rPn (8)
=
<
( 8f!rn) .
y'8 1 +
A Ao l 2 -
This leads to a convergence rate of n- 1 1 3 for both 11 8 - e0 I and I I -
•
To verify the no-bias condition (25 .75), we use the decomposition (25.78). The inte grands in the two terms on the right can both be seen to be bounded, up to a constant, by A0) 2 , with probability tending to one. Thus the bias Pe , 1J0 K:e , r, is actually of the order Op (n- 21 3). The functions x �--+ (x) can be written in the form 1/J (z , e 13 T z , A (c) , 8) for a function '1/J that is Lipschitz in its first three coordinates, for 8 E {0, 1 } fixed. (Note that A �--+ A is Lipschitz, as A �--+ A o Ar / Ao) o A0 1 (A).) The functions z 1--+ z, z 1--+ exp e z, c 1--+ A (c) and 8 1--+ 8 form Donsker classes if e and A range freely. Hence the functions x �--+ A(c) (x) form a Donsker class, by Example 19.20. The efficiency of en follows by Theorem 25.77.
(A -
T
K:e , A heo , Ao (A)/ (heo , Ao / Qe , A
Qe , A
=
Under the conditions listed previously, there exists a constant C such that, for every c > 0,
25.84
Lemma.
me , A
First consider the class of functions for a fixed e . These functions depend on A monotonely if considered separately for 8 0 and 8 1 . Thus a bracket A 1 � A � for A leads, by substitution, readily to a bracket for Furthermore, because this dependence is Lipschitz, there exists a constant D such that
Proof.
=
=
me , A ·
J (me , A 1 - me , AJ2 dFc, z D 1r (A 1 (c) - A2 (c) ) 2 de . �
A2
25. 11 Approximately Least-Favorable Submodels
me , A
413
(Pe , A ) ( a ;ae me , A
Thus, brackets for A of L 2 -size translate into brackets for of L 2 -size propor tional to E. By Example 1 9. 1 1 we can cover the set of all A by exp C 1 IE) brackets of siZe (x) Next, we allow to vary freely as well. Because is finite-dimensional and is uniformly bounded in A , x , this increases the entropy only slightly. • c
c.
e
e
(e, )
Under the conditions listed previously there exist constants C, that, for all A and all II I E,
25.85
Lemma.
e - eo f (p�:� - P�[�A0 ) 2 dt-t � C 1' (A - A o)2 (c) de + C 1 e - eo f .
c >
0 such
<
Proof.
The left side of the lemma can be rewritten as
f (Pe1/2, A - Pe1/2o, Ao ))22 df-t . (Pe , A + Pe0, A0
Pe , A
Because p0 is bounded away from zero, and the densities are uniformly bounded, the denominator can be bounded above and below by positive constants. Thus the Hellinger distance (in the display) is equivalent to the L 2 -distance between the densities, which can be rewritten 2 eT , 2 e -e e -e dFY, Z (c, z) .
I
[ ' A ( c) 8T
o A o ( c) ]
_
-e
Let g (t) be the function exp ( e r z A (c) ) evaluated at t + (1 - t)80 and A1 t A + (1 - t)A0, for fixed (c, z) . Then the integrand is equal to 0 2 , and hence, by the mean value theorem, there exists 0 :=:: t t (c, z) :=:: 1 such that the preceding display is equal to
e1 e (g(l) - g( ) ) =
=
eT Here the multiplicative factor e - e e ,r z is bounded away from zero. By dropping this term we obtain, up to a constant, a lower bound for the left side of the lemma. Next, because the function is bounded away from zero and infinity, we may add a factor and obtain the lower bound, up to a constant,
A ( c) ' e ,
,
Qeo,Ao
Q�o, Ao '
T z ) is uniformly close to 1 if is close to Here the function h ( 1 + t (e Furthermore, for any function and vector a , - )2 )2 T· T · 2 T :=::
e - eo) g ( Po(Be0, A 0 g)a fe0, A0 ( Po(Be0, A0 g)a (fe0, A0 - fo) Po(Be0, A 0 g) a Uo - Io)a, =
eo.
=
by the Cauchy-Schwarz inequality. Because the efficient information ]0 is positive-definite, the term a T (10 - ]0)a on the right can be written a T I0ac for a constant 0 c 1 . The lemma now follows by application of Lemma 25.86 ahead. • <
<
414
Semiparametric Models
Let h, g 1 and g2 be measurable functions such that c 1 ::::; h ::::; c 2 and 2 (Pg 1 g2 ) ::::; c Pgf Pgi for a constant c < 1 and constants c 1 < 1 < c2 close to 1. Then
25.86
Lemma.
for a constant C depending on c, c 1 and c2 that approaches 1 - .Jc as c 1 Proof
t
1 and c2
+
1.
We may first use the inequalities (hg 1 + g2 ) 2 ::: c 1 hg f + 2hg l g2 + c2 1 hgi = h (g 1 + g2 ) 2 + (c 1 - 1)hg f + ( 1 - c2 1 ) hgi ::: c l (g f + 2g l g2 + gi ) + (c 1 - 1)c2 g f + (c2 1 - 1)gi .
Next, we integrate this with respect to P, and use the inequality for P g 1 g2 on the second term to see that the left side of the lemma is bounded below by
Finally, we apply the inequality 2xy ::::; x 2 + y 2 on the second term. 25.1 1.2
•
Exponential Frailty
Suppose that the observations are a random sample from the density of X =
Pe, 17 (u, v) = f ze - zu 8ze -Bz v dry(z).
(U, V) given by
This is a density with respect to Lebesgue measure on the positive quadrant of JR2 , and we may take the likelihood equal to just the joint density of the observations. Let (en , fln ) maximize n V; ) . 17 ) 1-+ i= l This estimator can be shown to be consistent, under some conditions, for the Euclidean and weak topology, respectively, by, for instance, the method of Wald, Theorem 5. 14. The "statistic" 1/re V) = + V is, for fixed and known sufficient for the nuisance parameter. Because the likelihood depends on 17 only through this statistic, the tangent set 17 pPe , ry for 1] consists of functions of u + v only. Furthermore, because u + v is distributed according to a mixture over an exponential family (a gamma-distribution with shape parameter 2), the closed linear span of 17 PPe ry consists of all mean-zero, square integrable functions of + V , by Example 25.35. Thus, the projection onto the closed linear span of 17 PPe , ry is the conditional expectation with respect to + V, and the efficient score function for e is the "conditional score," given by
(8,
(U,
0Pe , 11 (U; ,
U e
e,
e
U e
e
U e
le, 11 (x) = i e, 11 (x) - Ee (i e , 17 (X) I 1/re (X) = 1/re (x)) J � (u J
ev)z3 e -z (u +13v) dry(z) ez2 e -z (u +13v) dry(z)
25. 1 1 Approximately Least-Favorable Submodels
+ e s,
415
e
where we may use that, given U V the variables U and V are uniformly distributed on the interval [0, s]. This function turns out to be also an actual score function, in that there exists an exact least favorable submodel, given by =
q,
(,
(e ,
q)(B)
=
q
( e( ;e )) 1-
z t
z,
Inserting rJr () ry ) in the log likelihood, making the change of variables ( 1 / (28) ) -+ and computing the (ordinary) derivative with respect to t at 0, we obtain It follows that the maximum likelihood estimator satisfies the efficient score equation, and its asymptotic normality can be proved with the help of Theorem 25.54. The linearity of the model in rJ (or the formula involving the conditional expectation) implies that
t
every
e,
rJ , r]o
·-
=
ie ,ry (x).
0
Thus, the "no-bias" condition (25.52) is trivially satisfied. The verification that the functions
ie , ry form a Donsker class is more involved but is achieved in the following lemma. t Suppose that J (z2 + z -5 ) d (z) oo. Then there exists a neighborhood V of for the weak topology such that the class offunctions (x, y) �-+ j (a l + a2Jzxz2+ a3 zy) z2 d (z) dry(z) where (a 1 , . . . , a23 ) ranges over a bounded subset of JR3 , b2 ) ranges over a compact subset of(O, oo) , and ranges over V, is Pe0, ry0 -Donsker with square-integrable envelope. 25.87
rJo
Lemma.
<
'fl o
e -z(b 1 x + h y l
e -z (b 1 x + b2Yl
,
rJ
(h ,
rJ
25. 1 1.3
Partially Linear Regression
Suppose that we observe a random sample from the distribution of X which for some unobservable error e independent of (V, Y
e
=
ev + ry(W) + e.
W),
W)
(V,
W, Y), in
Thus, the independent variable Y is a regression on (V, that is linear in V with slope but may depend on in a nonlinear way. We assume that V and take their values in the unit interval [0, 1], and that rJ is twice differentiable with < for
W
J2 (ry)
=
1 1 ry"(w)2 dw .
W J(ry) oo,
e
This smoothness assumption should help to ensure existence of efficient estimators of and will be used to define an estimator. If the (unobservable) error is assumed to be normal, then the density of the observation X ( V, W, Y) is given by 1 1 ( )) / Pe , t] Pv ' w ( v =
(x)
t
For a proof see [I 06] .
=
e Cy ev ry w 2 a2 -
CJ �
-
-
, w).
416
Semiparametric Models
We cannot use this directly to define a maximum likelihood estimator for as a maximizer for will interpolate the data exactly: A choice of such that i i Yi for every maximizes Il Pe , ., (xi ) but does not provide a useful estimator. The problem is that so far has only been restricted to be differentiable, and this does not prevent it from being very wiggly. To remedy this we use a penalized log likelihood estimator, defined as the minimizer of
i
17
17
17
(e, 17), 17 ( w ) - e v =
Here �n is a "smoothing parameter" that may depend on the data, and determines the weight of the "penalty" A large value of �n gives much influence to the penalty term and hence leads to a smooth estimate of and conversely. Intermediate values are best. For the purpose of estimating we may use any values in the range
12 ( 17).
17,
e � 2n
=
0
p
(n 1 2 ) ' -
1
There are simple numerical schemes to compute the maximizer (en , f7n ) , the function f7n being a natural cubic spline with knots at the values W I , . ' Wn . The sequence en can be shown to be asymptotically efficient provided that the regression components involving and are not confounded or degenerate. More precisely, we assume that the conditional distribution of given is nondegenerate, that the distribution of has at least two support points, and that has a version with < oo. Then, we I have the following lemma on the behavior of (en , Let II denote the norm of
..
W
W W h0(w) E(V W w) 1(h0) n). ·lw L2 CPw). f7 Under the conditions listed previously, the sequence en is consistent for eo, l f7n l oo 0p (1), 1( f7n ) O p ( 1 ), and l f7n - 11 i l w Op (�n) , under (eo, 17o). Write g(v, w) ev + 17(w), let IP'n and Po be the empirical and true distribution of the variables ' vi wi ) , and define functions
25.88
V
V
=
=
Lemma. =
=
Proof
=
=
(ei
'
and g(v, w) e v + f](w) minimizes g 2e(go - g) + (go - g)2 + A2 12 (17) - A2 12 (170). By the orthogonality property of a conditional expectation and the Cauchy-Schwarz in equality, (EV17(W)) 2 EE(V I W) 2 E17 2 (W) EV 2 i 1 111 i fv . Therefore, by Lemma 25.86, Po(g - go f � 1 e - eo l 2 + 1 17 - 17o l fv . Consequently, because Po e 0 and e is independent of (V, W),
Then
�---+
=
mg,'A - m 80 , 'A
IP'n m 8 ;_ , ,
=
�
<
=
This suggests to apply Theorem 25.81 with t and i t equal to the sum of the first three terms on the right. 1 5 , it is not a real loss of generality to assume Because �� 1 0 p ( 1 / for n that � n E An oo). Then < 8 and E An implies that I < 8, that < 8 and that � 8 / n . Assume first that it is known already that e and =
An) A n 2 [An, d'A(t, to) 1(17) A
=
1 17 - 11o l w
=
=
-
(e, 17) d ( , to) A
1 e - eo
11
417
25. 1 1 Approximately Least-Favorable Submodels
I f] I are bounded in probability, so that it is not a real loss of generality to assume that I B I l fJ I oo 1 . Then 00
v
::s:
I I
ery + [e r-, 1 -
Thus a bound on the w-norm of r] yields a bound on the "Bernstein norm" of (given on the left) of proportional magnitude. A bracket for r] induces a bracket for the functions In view of Lemma 1 9.37 and Example 1 9. 10, we obtain ·
e - r-,2 , e + r-,2 - e - r,d
[r-, 1 , r-,2 ]
ery.
for
<
( 1 + 8/'An ) 1/2 dc rv83/4 + 8 · 174 An c
This bound remains valid if we replace ry by for the parametric p art adds little to the entropy. We can obtain a similar maximal inequality for the process CGn - go) 2 , in view of the inequality ::S: still under our assumption that 1 . We conclude that Theorem 25.81 applies and yields the rate of convergence v ry ::S: = 5- n) = O 5- n . are bounded in probability. By the Cauchy Finally, we must prove that e and Schwarz inequality, for every w and r] ,
- r-,0 g-g0, P0(g - g0)4 4P0(g - g0)2 , Ie I I I p( ) I B - Bol + l f7 - rJol l w O p(n -215 + I f] I
ev (g
00
(X)
r,Cw) - r-,(0) - r, ' (O) w l 1 w1 u l r, " l (s) ds du J(ry). This implies that l rJI I oo l ry(O)I + l r, ' (O) I + J(ry), whence it sufficies to show that f](O), f}'(O), and J(f]) remain bounded. The preceding display implies that l e v + ry(O) + ry '(O)w l l g(v, w) l + J(ry) . The empirical measure applied to the square of the left side is equal to a T Ana for a = (e, ry (0) , ry ' (0) ) and An = n ( v, 1 , w) ( v, 1 , w l the sample second moment matrix of the ::S:
::S:
l
::s:
e,
::s:
lP'
variables (Vi , 1 , Wi ). By the conditions on the distribution of (V, W), the corresponding population matrix is positive-definite, whence we can conclude that a is bounded in prob ability as soon as a T An a is bounded in probability, which is certainly the case if 1P'ng 2 and 1 are bounded in probability. We can prove the latter by applying the preceding argument conditionally, given the sequence V Given these variables, the variables are the only random part in m g , t,. - m go , J. and the parts only contribute to the centering function. We apply Theorem 25.81 with square distance equal to
(f])
1 , W1 , V2 , W2 ,
.
•
.
.
ei
(g - g0)2
An appropriate maximal inequality can be derived from, for example Corollary 2.2.8 in because the stochastic process IP'n )-metric is sub-Gaussian relative to the on the set of Because 82 , 8 implies that IP'n ::S: 8/'An , and ) for C dependent on the smallest eigenvalue of v ::S: C( IP'n
[146], g. 2 1 8 1 l r-,1 1 �
Gneg dJ.(T, r0)2 2 (g g0) + J (r,) <
-
(g - g0)2
<
L( J(ry)2
Semiparametric Models
418
the second moment matrix An , the maximal inequality has a similar form as before, and we conclude that Pn C� - g0 ) 2 + � 2 J 2 (f]) O p (� \ This implies the desired result. • =
e
The normality of the error motivates the least squares criterion and is essential for the efficiency of e . However, the penalized least-squares method makes sense also for nonnormal error distributions. The preceding lemma remains true under the more general condition of exponentially small error tails: E c l e l oo for some 0. Under the normality assumption (with IJ 1 for simplicity) the score function for e is given by
e
c
<
=
>
(y - e v - ry (w) ) v . Given a function h with J (h) the path ry 1 ry + t h defines a submodel indexed by the nuisance parameter. This leads to the nuisance score function Be, 17 h(x) ( y - e v - ry (w) ) h (w). On comparing these expressions, we see that finding the projection of l 8 , 17 onto the set of ry-scores is a weighted least squares problem. By the independence of e and (V, W), it le , ry (x) <
=
oo ,
=
=
follows easily that the projection is equal to B8 , 17 h0 for h0 (w) E(V I W the efficient score function for e is given by ev - ry (w) ) ( v - o (w) ) . le,ry (x) =
(y -
=
h
=
w), whence
th
Therefore, an exact least-favorable path is given by ry1 (8 , ry) 1J - 0 . Because (en , fJn) maximizes a penalized likelihood rather than an ordinary likelihood, it certainly does not satisfy the efficient score equation as considered in section 25.8. However, it satisfies this equation up to a term involving the penalty. Inserting (e + ry1(e , fJ)) into the least-squares criterion, and differentiating at 0, we obtain the stationary equation =
-
A2 {I
t
A I/
t,
= II
dw t h o n
2A o 1J (w)h 0 (w) 0. J The second term is the derivative of � 2 1 2 (ry1 (e , fJ)) at 0. By the Cauchy-Schwarz 2 o inequality, it is bounded in absolute value by 2� J (f]) J( ) p ( - 1 1 2 ), by the first as sumption on � and because J(f]) O p (l ) by Lemma 25 .88. We conclude that (en , fJn ) satisfies the efficient score equation up to a op (n- 1 1 2 )-term. Within the context of Theo rem 25 .54 a remainder term of this small order is negligible, and we may use the theorem to obtain the asymptotic normality of en . A formulation that also allows other estimators f] is as follows. Pw ee, i) -
=
=
=
=
Let fJn be any estimators such that ll fJn ll oo Then any consistent sequence of estimators en such that Jn o ) . ry cally efficient at (8o,
25.89
Proof
Theorem.
=
Op ( l )
P1Je , i)
It suffices to check the conditions of Theorem 25 .54. Since
for every (8 , ry), the no-bias condition (25 .52) is satisfied.
and J(fJn) 0 p ( l ) . o p ( l) is asymptoti
=
=
25. 1 2 Likelihood Equations
419
That the functions le , f] are contained in a Donsker class, with probability tending to 1 , follows from Example 19. 10 and Theorem 1 9.5. The remaining regularity conditions of Theorem 25.54 can be seen to be satisfied by standard arguments. • In this example we use the smoothness of rJ to define a penalized likelihood estimator for e . This automatically yields a rate of convergence of -lfS for � . However, efficient estimators for e exist under weaker smoothness assumptions on ry , and the minimal smooth ness of 1J can be traded against smoothness of the function g (w) E(V I W w), which also appears in the formula for the efficient score function and is unknown in practice. The trade-off is a consequence of the bias Pe , 11, 8 le ,f),g being equal to the cross product of the biases in � and g. The square terms in the second order expansion (25 .60), in which the derivative relative to ( ry , g) (instead of ry) is a (2 x 2)-matrix, vanish. See [35] for a detailed study of this model. n
=
25 .12
=
Likelihood Equations
The "method of the efficient score equation" isolates the parameter e of interest and charac terizes an estimator fJ as the solution of a system of estimating equations. In this system the nuisance parameter has been replaced by an estimator � . If the estimator � is the maximum likelihood estimator, then we may hope that a solution fJ of the efficient score equation is also the maximum likelihood estimator for e , or that this is approximately true. Another approach to proving the asymptotic normality of maximum likelihood estimators is to design a system of likelihood equations for the parameter of interest and the nuisance parameter jointly. For a semiparametric model, this necessarily is a system of infinitely many equations. Such a system can be analyzed much in the same way as a finite-dimensional system. The system is linearized in the estimators by a Taylor expansion around the true parameter, and the limit distribution involves the i�verse of the derivative applied to the system of equations. However, in most situations an ordinary pointwise Taylor expansion, the classical argument as employed in the introduction of section 5.3, is impossible, and the argument must involve some advanced tools, in particular empirical processes. A general scheme is given in Theorem 1 9.26, which is repeated in a different notation here. A limitation of this approach is that both fJ and � must converge at y'n-rate. It is not clear that a model can always appropriately parametrized such that this is the case; it is certainly not always the case for the natural parametrization. The system of estimating equations that we are looking for consists of stationary equa tions resulting from varying either the parameter e or the nuisance parameter ry . Suppose that our maximum likelihood estimator (fJ , �) maximizes the function for lik(e , 1] ) ( ) being the "likelihood" given one observation The parameter e can be varied in the usual way, and the resulting stationary equation takes the form X
X.
420
Semiparametric Models
This is the usual maximum likelihood equation, except that we evaluate the score function at the joint estimator (e , f]), rather than at the single value e. A precise condition for this equation to be valid is that the partial derivative of log lik(e , 1J )(x) with respect to e exists and is equal to le, IJ (x), for every x, (at least for 1J f] and at e = e). Varying the nuisance parameter 1J is conceptually more difficult. Typically, we can use a selection of the submodels 1--0- 1J t used for defining the tangent set and the information in the model. If scores for 1J take the form of an "operator" B9 , 1J working on a set of indices then a typical likelihood equation takes the form =
t
h,
Here we have made it explicit in our notation that a score function always has mean zero, by writing the score function as x 1--0- Be,1Jh(x) - Pe, 1J Be,1Jh rather than as x 1--0- Be,1Jh(x). The preceding display is valid if, for every (e , 1J ) , there exists some path 1--0- 1J t (e , 1J) such that 1J o ( e , 1J ) 1J and, for every x,
t
=
Be,l)h(x) - Pe , 1)Be,1)h
a at
= -
h
l t =o
.
log hk (e +
t,
1J t (e , 1J ) ) .
Assume that this is the case for every in some index set H, and suppose that the latter is chosen in such a way that the map 1--0- Be,1Jh(x) - Pe,1JBe,1Jh is uniformly bounded on H, for every x and every ( e , 1J ) . Then we can define random maps Wn : IRk x H 1--0- IRk x l00 (H) by Wn ( Wn l , Wn 2 ) with
h
=
Wn l (e , 1] ) Wn 2 (e , 1J )
h
Pnle , IJ , PnBe,IJh - Pe , 1JBe,1Jh,
= =
h E H.
The expectation of these maps under the parameter (eo , 1J o) is the deterministic map ( W 1 , W2 ) given by
W 1 (e , 1] ) W2 (e , 1] )h
= =
Peo, 1Jo le, 1J , Pe0, 1J0 Be,1Jh - Pe , 1JBe,1Jh,
w
=
h E H.
By construction, the maximum likelihood estimators (en , fJn) and the "true" parameter (eo , 1J o ) are zeros of these maps,
Wn (en , fJn)
=
0
=
w (eo , 1J o ) .
The argument next proceeds by linearizing these equations. Assume that the parameter set H for 1J can be identified with a subset of a Banach space. Then an adaptation of Theorem 19.26 is as follows.
Suppose that the functions le,l) and Be, l)h, ifh ranges overH and over a neighborhood of are contained in a Peo,1Jo -Donsker class, and that
25.90
Theorem.
(e , 1J )
(eo , 1J o),
Furthermore, suppose that the mapk with a derivative �0 : IR ceo , 1J o),
\)1 : e
X
X
lin H
H
1--0-
1--0-
IRk
IRk X l00(H) X
l00 (H)
is Frechet-differentiable at that has a continuous inverse
42 1
25. 1 2 Likelihood Equations
on its- range. If the sequence (en , �n) is consistent for (e0 , 17o) and satisfies Wn (en , �n) = 11 2 o (n ), then p
The theorem gives the joint asymptotic distribution of en and � n · Because .Jfiwn (eo, 17o) is the empirical process indexed by the Donsker class consisting of the functions .€ e0 , 170 and Be0, 170 h, this process is asymptotically normally distributed. Because normality is retained under a continuous, linear map, such as �0 1 , the limit distribution of the sequence .Jfi(en - e0, � n - 170) is Gaussian as well. The case of a partitioned parameter (e , 17) is an interesting one and illustrates most aspects of the application of the preceding theorem. Therefore, we continue to write the formulas in the corresponding partitioned form. However, the preceding theorem, applies more generally. In Example 25.5 . 1 we wrote the score operator for a semiparametric model in the form
Ae, 17 (a , b)
= a .€e, 17 + Be, 17 b . ·
y
Corresponding to this, the system of likelihood equations can be written in the form
(a, b). If the partitioned parameter (e, 17) and the partitioned "directions" (a, b) are replaced by a general parameter and general direction c, then this formulation extends to general every
r
models. The maps 'l1n and 'l1 then take the forms
The theorem requires that these can be considered maps from the parameter set into a Banach space, for instance a space .€ 00 (C). To gain more insight, consider the case that 17 is a measure on a measurable space (Z , C ) . Then the directions h can often be taken equal to bounded functions h : Z �---+ JR., corresponding to the paths (1 + if 17 is a completely unknown measure, or 17 t ( 1 + (h - 17 h ) ) 17 if the total mass of each 17 is fixed to one. In the remainder of the discussion, we assume the latter. Now the derivative map �0 typically takes the form
d
=
d17 = d t
t
th) d17
where
= - Pe0 , 170 fe0 ,170.€eo. t]o (e - eo), = - f s;0,1)0 ieo, t]o d(17 - 17o) , W2 1 (e - eo)h = - Pe0, 170 (Be0 ,ry0 h) .€eo,ryo (e - eo), �22 (17 - 17o)h = - f s;o,1)o Be0, 170 h d(17 - 17o). 'll u .
(e - eo) � 12 (17 - 17o)
.
.
·
T
·
T
(25.91)
422
Semiparametric Models
For instance, to find the last identity in an informal manner, consider a path 17 t in the direction of g, so that d17 t - d170 = t g d1]o + o (t ) . Then by the definition of a derivative On the other hand, by the definition of \II , for every h,
\112 (8o , 17 t )h - \11 (8o , 1}o )h = - (Pg0, 17, - Pg0, 170 )Bg0, 17, h �
=
-t Pg0 , 170 (Bg0, 170 g )(Bg0 , 170 h) + o(t)
J
- (B;0, 170 Bg0, 170 h) tg d1]o + o (t) .
On comparing the preceding pair of displays, we obtain the last line of (25 .9 1), at least for = g d 7]0 . These arguments are purely heuristic, and this form of the derivative must be established for every example. For instance, within the context of Theorem 25.90, we may need to apply �0 to 17 that are not absolutely continuous with respect to 1} o. Then the validity of (25.9 1 ) depends on the version that is used to define the adjoint operator B:!: By definition, an adjoint is an operator between L 2 -spaces and hence maps equivalence classes into equivalence classes. The four partial derivatives �ij in (25 .9 1 ) involve the four parts of the information operator A� , 17 Ag , 17 , which was written in a partitioned form in Example 25.5 . 1 . In particular, the map � ll is exactly the Fisher information for e, and the operator �22 is defined in terms of the information operator for the nuisance parameter. This is no coincidence, because the formulas can be considered a version of the general identity "expectation of the second derivative is equal to minus the information." An abstract form of the preceding argument applied to the map \ll ( r )c = Pr0 Arc - Pr Arc leads to the identity, with Tt a path with derivative io at t = 0 and score function Ar0 d,
d17 - d1]o
17Q , . ,Q n
•
�o (io)c = (A �0 Ar0C , d ) ro · In the case of a partitioned parameter r = ( e , 17), the inner inner product on the right is defined as ( (a , b) , (a , f3)) r0 = a T a + J bf3 d7]o, and the four formulas in (25.91) follow by
Example 25 .5. 1 and some algebra. A difference with the finite-dimensional situation is that the derivatives i0 may not be dense in the domain of �0, so that the formula determines �0 only partly. An important condition in Theorem 25 .90 is the continuous invertibility of the derivative. Because a linear map between Euclidean spaces is automatically continuous, in the finite dimensional set-up this condition reduces to the derivative being one-to-one. For infinite dimensional systems of estimating equations, the continuity is far from automatic and may be the condition that is hardest to verify. Because it refers to the g oo ('H)-norm, we have some control over it while setting up the system of estimating equations and choosing the set of functions H . A bigger set 1{ makes �0 1 more readily continuous but makes the differentiability of \II and the Donsker condition more stringent. In the partitioned case, the continuous invertibility of �0 can be verified by ascertaining the continuous invertibility of the two operators � l l and V = �22 - �2 1 �[i 1 � 1 2 . In that case we have
25. 12 Likelihood Equations
423
The operator W 11 is the Fisher information matrix for e if rJ is known. If this would not be invertible, then there would be no hope of finding asymptotically normal estimators for e . The operator V has the form
where the operator K is defined as The operator V : lin H r--+ £ 00 (H) is certainly continuously invertible if there exists a positive number E such that sup i V ( rJ - rJo ) h l ::::: c ll rJ - rJ o ll h EH
In the case that rJ is identified with the map h r--+ ryh in £ 00(1{), the norm on the right is given by suph EH I (17 - rJo ) h I Then the display is certainly satisfied if, for some E 0, >
·
This condition has a nice interpretation if 1{ is equal to the unit ball of a Banach space IB of functions. Then the preceding display is equivalent to the operator B;o, rJo Beo , rJo + K : IB r--+ IB being continuously invertible. The first part of this operator is the information operator for the nuisance parameter. Typically, this is continuously invertible if the nuisance parameter is regularly estimable at a Jn-rate (relatively to the norm used) if e is known. The following lemma guarantees that the same is then true for the operator B;o ,rJo Beo,rJo + K if the efficient information matrix for e is nonsingular, that is, the parameters e and rJ are not locally confounded.
Let Iffi be a Banach space contained in £ 00 ( £ ). If leo, rJo is nonsingu lar, B;0 ,rJ0 Be0, 110 : Iffi r--+ IB is onto and continuously invertible and B;o, rJo i e0 , 110 E IB, then B;0 , 110 Be0 , rJo + K : IB r--+ IB is onto and continuously invertible.
25.92
Lemma.
Abbreviate the index (8o , ry 0) to 0. The operator K is compact, because it has a finite-dimensional range. Therefore, by Lemma 25.93 below, the operator B� Bo + K is continuously invertible provided that it is one-to-one. Suppose that (B� B0 + K ) h = 0 for some h E lB. By assumption there exists a path t r--+ rJ t with score function B0h = Bah - P0B0h at t = 0. Then the submodel indexed by t r--+ (8o + t ao , rJt ) , for ao = - /0- 1 Po ( Bo h ) f. o , has score function a(; io + Bah at t = 0, and information
Proof
2 2 a0T Io ao + Po(Boh) - a0T Ioao. + 2a0T Po £. o ( Bo h ) = Po (Boh)
Because the efficient information matrix is nonsingular, this information must be strictly positive, unless a0 = 0. On the other hand, 0 = rJoh(B� Bo + K ) h
=
Po (Boh) 2 + a0T Po(Boh) £. o .
Semiparametric Models
424
This expression is at least the right side of the preceding display and is positive if a0 f 0. Thus ao = 0, whence K h = 0. Reinserting this in the equation (B� B0 + K)h = 0, we find that B� B0h = 0 and hence h = 0. • The proof of the preceding lemma is based on the Fredholm theory of linear operators. An operator K : lBl � lBl is compact if it maps the unit ball into a totally bounded set. The following lemma shows that for certain operators continuous invertibility is a consequence of their being one-to-one, as is true for matrix operators on Euclidean space. t It is also useful to prove the invertibility of the information operator itself.
Let lBl be a Banach space, let the operator A : lBl � lBl be continuous, onto and continuously invertible and let K : lBl � lBl be a compact operator. Then R(A + K) is closed and has codimension equal to the dimension ofN(A + K). In particular, if A + K is one-to-one, then A + K is onto and continuously invertible. 25.93
Lemma.
The asymptotic covariance matrix of the sequence .Jfi (Bn - 80) can be computed from the expression for �0 and the covariance function of the limiting process of the sequence Jii 'ltn ( 80 , rJo) . However, it is easier to use an asymptotic representation of Jii (Bn - 80) as a sum. For a continuously invertible information operator Be0 , 170 Be0 , 170 this can be obtained as follows. In view of (25.9 1 ) , the assertion of Theorem 25.90 can be rewritten as the system of equations, with a subscript 0 denoting (80, 17 0 ) , - Io (Bn - 8o) - U7 n - rJo) B�f-o - Po ( Bo h )i � (Bn - 8o) - (�n - rJo)B� Boh
= - ( JIDn - Po)lo + Op ( 1 / Jn) , = - ( JPln - Po)Boh + op ( 1 / Jn) .
The o p (1 / .Jfi)-term in the second line is valid for every h E 1{ (uniformly in h). If we can also choose h = ( B� Bo) - 1 B� £ 0 , and subtract the first equation from the second, then we arrive at Ie0 , 170 Jn (Bn - 8o)
= Jn ( JPln - Po)le0 , 170 + Op ( l ) .
Here le0, 170 is the efficient score function for 8 , as given by ( 25.33 ) , and Ie0, 170 is the ef ficient information matrix. The representation shows that the sequence .Jfi (Bn - 80) is asymptotically linear in the efficient influence function for estimating 8 . Hence the maxi mum likelihood estimator fJ is asymptotically efficient. + The asymptotic efficiency of the estimator �h for 17h follows similarly. We finish this section with a number of examples. For each example we describe the general structure and main points of the verification of the conditions of Theorem 25.90, but we refer to the original papers for some of the details. 25. 12. 1
Cox Model
Suppose that we observe a random sample from the distribution of the variable X (T 1\ C, 1 { T ::=: C}, z) , where, given Z, the variables T and C are independent, as in the t
t
For a proof see, for example, [ 1 32, pp. 99-103]. This conclusion also can be reached from general results on the asymptotic efficiency of the maximum likelihood estimator. See [56] and [ 1 43] .
425
25. 12 Likelihood Equations
random censoring model, and T follows the Cox model. Thus, the density of X is given by
=
(Y, � , Z)
We define a likelihood for the parameters (8 , A) by dropping the factors involving the distribution of (C, Z), and replacing .A.(y) by the pointmass A {y}, lik(8 , A ) (x)
=
e e (ee z A { y}e _ e ' A (yl ) s (e _ e ' A (y ) ) l - s .
This likelihood is convenient in that the profile likelihood function for 8 can be computed explicitly, exactly as in Example 25.69. Next, given the maximizer e , which must be calculated numerically, the maximum likelihood estimator is given by an explicit formula. Given the general results put into place so far, proving the consistency of ce ' is the hardest problem. The methods of section 5.2 do not apply directly, because of the empirical factor A {y} in the likelihood. These methods can be adapted. Alternatively, the consistency can be proved using the explicit fom1 of the profile likelihood function. We omit a discussion. For simplicity we make a number of partly unnecessary assumptions. First, we assume that the covariate Z is bounded, and that the true conditional distributions of T and C given Z possess continuous Lebesgue densities. Second, we assume that there exists a finite number r 0 such that P(C � r) = P(C = r) 0 and Pe0 ,A 0 (T r) 0. The latter condition is not unnatural: It is satisfied if the survival study is stopped at some time r at which a positive fraction of individuals is still "at risk" (alive). Third, we assume that, for any measurable function h, the probability that Z =f. h (Y) is positive. The function A now matters only on [0, r ] ; we shall identify A with its restriction to this interval. Under these conditions the maximum likelihood estimator ce , can be shown to be consistent for the product of the Euclidean topology and the topology of uniform convergence on [0, r ] . The score function for 8 takes the form
A
>
A)
>
>
>
A)
For any bounded, measurable function h : [0, r] 1-+ R the path defined by dA1 = (1 + th) dA defines a submodel passing through A at t = 0. Its score function at t = 0 takes the form
{[O, y] h dA .
Be,Ah(x) = 8h (y) - e11 z J
The function h 1-+ Be,Ah(x) is bounded on every set of uniformly bounded functions h, for any finite measure A , and is even uniformly bounded in x and in (8, A) ranging over a neighborhood of (8o, Ao). It is not difficult to find a formula for the adjoint Be,A of Be,A : L 2 (A) �--+ L 2 (Pe,A), but this is tedious and not insightful. The information operator Be, A Be, A : L 2 (A) 1-+ L 2 (A) can be calculated from the identity Pe,A (Be,Ag) (Be,Ah) = Ag(Be,A Be,Ah) . For continuous A it takes the surprisingly simple form
Semiparametric Models
426
To see this, write the product Be,Ag Be,Ah as the sum of four terms
Take the expectation under Pe,A and interchange the order of the integrals to represent BZ,A Be,Ah also as a sum of four terms. Partially integrate the fourth term to see that this cancels the second and third terms. We are left with the first term. The function BZ,A ie,A can be obtained by a similar argument, starting from the identity Pe,A ie,A Be,Ah = A( Be,A ie,A)h. It is given by
The calculation of the information operator in this way is instructive, but only to check (25.91) for this example. As in other examples a direct derivation of the derivative of the map W = (\11 1 , \11 2 ) given by \IJ 1 (e , A) = Po fe,A and W2 (e , A) h = PoBe,Ah requires less work. In the present case this is almost trivial, for the map \II is already linear in A. Writing Go (y I Z) for the distribution function of Y given Z, this map can be written as
J Go(Y I Z) dAo(Y) - EZee z J A(y) dGo (Y I Z) , Ee 80 z J h(y)Go (Y I Z) dAo( Y ) - Ee e z JJ( h dA dGo( Y I Z). [O , y]
W 1 (e , A) = EZe 80 z \ll 2 (e , A)h
=
If we take 1t equal to the unit ball of the space BV[O, r] of bounded functions of bounded variation, then the map w : ffi. X C00 (1t) 1--7 ffi. X C00 (1t) is linear and continuous in A, and its partial derivatives with respect to e can be found by differentiation under the expectation and are continuous in a neighborhood of (eo , Ao). Several applications of Fubini 's theorem show that the derivative takes the form (25.91). We can consider B� B0 as an operator of the space BV[O, r] into itself. Then it is continuously invertible if the function y 1--7 Eeo ,A o 1 r::: y e eo z is bounded away from zero on [0, r ]. This we have (indirectly) assumed. Thus, we can apply Lemma 25 .92. The efficient score function takes the form (25.33), which, with Mi ( Y ) = Eeo.Ao lr::: y Zi e eo z , reduces to
(
M
)
(
M
)
le0 ,A 0 (X) = 8 z - Mo1 (y) - e 80 z }( z - Mo1 (t) dAo (t). [O , y] The efficient information for e can be computed from this as
leo ,A o
e0 = Ee z
J(
)
2M 1 Z - (y) Go( Y I Z) dAo(y) . Mo
This is strictly positive by the assumption that Z is not equal to a function of Y. The class 1t is a universal Donsker class, and hence the first parts 8 h (y) of the functions Be,Ah form a Donsker class. The functions of the form J[O , y] h dA with h ranging over 1t and A ranging over a collection of measures of uniformly bounded variation are functions of uniformly bounded variation and hence also belong to a Donsker class. Thus the functions Be,Ah form a Donsker class by Example 1 9.20.
25. 12 Likelihood Equations
25.12.2
427
Partially Missing Data
Suppose that the observations are a random sample from a density of the form
(x , y, z) f--+
J Pe (x I s) d 1J (s) Pe (y I z) d l] (z) =: Pe (x l lJ) pe (Y I z) d l] (z) .
Here the parameter 1J is a completely unknown distribution, and the kernel p11 ( I s ) is a given parametric model indexed by the parameters e and s, relative to some density f.-L. Thus, we obtain equal numbers of bad and good (direct) observations concerning lJ. Typically, by themselves the bad observations do not contribute positive information concerning the cumulative distribution function lJ , but along with the good observations they help to cut the asymptotic variance of the maximum likelihood estimators. •
This model can arise if we are interested in the relationship between a response Y and a covariate Z, but because of the cost of measurement we do not observe Z for a fraction of the population. For instance, a full observation (Y, Z) = (D, W, Z) could consist of - a logistic regression D on exp Z with intercept and slope {30 and {3 1 , respectively, and - a linear regression W on Z with intercept and slope a 0 and a 1 , respectively, and an N (0, a 2 )-error. Given Z the variables D and W are assumed independent, and Z has a completely unspec ified distribution 1J on an interval in R The kernel is equal to, with \II denoting the logistic distribution function and ¢ denoting the standard normal density, 25.94
Example.
Pe (d, w I z) = \II (.Bo +
f3 1 ez ) d ( 1 - \ll (f3o + f3 1 ez )) 1 -d -;;1 ¢ ( w
- ao - a
1z ) .
a The precise form of this density does not play a major role in the following. In this situation the covariate Z is a gold standard, but, in view of the costs of measure ment, for a selection of observations only the "surrogate covariate" W is available. For instance, Z corresponds to the LDL cholesterol and W to total cholesterol, and we are interested in heart disease D = 1 . For simplicity, each observation in our set-up consists of one full observation (Y, Z) = (D, W, Z) and one reduced observation X = (D , W) . 0 Example. If the kernel Pe (y I z) is equal to the normal density with mean z and variance e , then the observations are a random sample Z . . . , Zn from 1J, a random sample X . . . , Xn from 1J perturbed by an additive (unobserved) normal error, and a sample Y , . . . , Yn of random variables that given Z 1 , . . . , Zn are normally distributed with means Zi and variance e . In this case the interest is perhaps focused on estimating lJ , rather than e. o
25.95
1, 1
1,
The distribution of an observation (X, Y, Z) is given by two densities and a nonparametric part. We choose as likelihood lik(e , lJ )(x , y, z) = Pe (x l lJ) Pe ( Y I z) l] {z} . Thus, for the completely unknown distribution 1J of Z we use the empirical likelihood for the other part of the observations we use the density, as usual. It is clear that the maximum
428
Semiparametric Models
likelihood estimator � charges all observed values z 1 , . . . , Zn , but the term p8 (x 1 77 ) leads to some additional support points as well. In general, these are not equal to values of the observations. The score function for e is given by
Ke (x I s) Pe (x I s) drJ(s) . . € e, 17 (x , y, z) = Ke, 11 (x) + Ke (y I z) = f Pe (x 1 17 ) ·
.
+ Ke ( Y
I z) .
Here Ke (y I z) a I a e log Pe (y I z) is the score function for e for the conditional density p8 (y I z), and Ke,ry(x) is the score function for e of the mixture density pe (x l rJ) . Paths of the form d171 = (1 + th) d17 (with 17h = 0) yield scores =
h (s)pe (x I s) drJ (s) Be,ryh (x , z) = Ce,ryh (x) + h (z) = f + h(z) . Pe (x 1 77 ) The operator Ce,11 : L 2 (17) H- L 2 ( Pe (· 1 17) ) is the score operator for the mixture part of the model. Its Hilbert-space adjoint is given by
c;,r] g(z) =
J g(x) Pe (x I z) df-L (X) .
The range of Be,ry is contained in the subset G of L 2 ( pe (· l rJ) x 17 ) consisting of functions of the form (x , z) H- g 1 (x) + g2 (z) + c. This representation of a function of this type is unique if both g 1 and g2 are taken to be mean-zero functions. With Pe,ry the distribution of the observation (X, Y, Z) , Thus, the adjoint Be,ry : G H- L 2 (17) of the operator Be,11 : L 2 (17)
H-
G is given by
B;,11 (g l EB g2 EB c) = c;,ry g l + g2 + 2c. Consequently, on the set of mean-zero functions in L 2 (17) we have the identity Be,ry Be,11 = Ce,ry Ce,ry + I . Because the operator Ce, ry Ce,11 is nonnegative definite, the operator Be, ry Be,11 is strictly positive definite and hence continuously invertible as an operator of L 2 (17) into itself. The following lemma gives a condition for continuous invertibility as an operator on the space C01 (Z) of all "a-smooth functions." For a0 :::=: a the smallest integer strictly smaller than a, these consist of the functions h : Z c ffi.d H- ffi. whose partial derivatives up to order a0 exist and are bounded and whose a0-order partial derivatives are Lipschitz of order a - a0 . These are Banach spaces relative to the norm, with Dk a differential operator a k l . . . a kd I az�I . . . z�d '
The unit ball of one of these spaces is a good choice for the set H indexing the likelihood equations if the maps z H- Pea (x I z) are sufficiently smooth.
Let Z be a bounded, convex subset of ffi.d and assume that the maps z H Po (x I z) are continuously differentiable for each X with partial derivatives a I a Z i PiJo (x I z) 25.96
Lemma.
25. 12 Likelihood Equations
satisfying, for all z, z' in Z and fixed constants K and a
>
429
0,
Then B;0 , 170 Be0 ,r10 : Cf3 (Z) � Cf3 (Z) is continuously invertiblefor every f3
< a.
By its strict positive-definiteness in the Hilbert-space sense, the operator B0 B0 : £00 (Z) � £00 (Z) is certainly one-to-one in that B0 B0 h = 0 implies that h = 0 almost surely under ry0 . On reinserting this we find that -h = C0Coh = C00 = 0 everywhere. Thus B0 Bo is also one-to-one in a pointwise sense. If it can be shown that C0C0 : Cf3 (Z) � c f3 (Z) is compact, then B0 B0 is onto and continuously invertible, by Lemma 25.93. It follows from the Lipschitz condition on the partial derivatives that C0h(z) is differ entiable for every bounded function h : X � lP1. and its partial derivatives can be found by differentiating under the integral sign:
Proof.
a
C0h(z) = a zi
J h(x) -aazi po(x I z) dfl, (x).
The two conditions of the lemma imply that this function has Lipschitz norm of order a bounded by K ll h ll oo· Let h n be a uniformly bounded sequence in £00 (X). Then the partial derivatives of the sequence C0hn are uniformly bounded and have uniformly bounded Lipschitz norms of order a. Because Z is totally bounded, it follows by a strengthening of the Arzela-Ascoli theorem that the sequences of partial derivatives are precompact with respect to the Lipschitz norm of order f3 for every f3 a. Thus there exists a subsequence along which the partial derivatives converge in the Lipschitz norm of order f3. By the Arzela-Ascoli theorem there exists a further subsequence such that the functions C0h n (z) converge uniformly to a limit. If both a sequence of functions itself and their continuous partial derivatives converge uniformly to limits, then the limit of the functions must have the limits of the sequences of partial derivatives as its partial derivatives. We conclude that C0hn converges in the I · 11 1 +!3 -norm, whence C0 : £00 (X) � Cf3 (Z) is compact. Then the operator C0C0 is certainly compact as an operator from cf3 (Z) into itself. • <
Because the efficient information for 8 is bounded below by the information for 8 in a "good" observation (Y, Z) , it is typically positive. Then the preceding lemma together with Lemma 25.92 shows that the derivative �0 is continuously invertible as a map from ffi. k x £00 (11) x ffi.k x £00(11) for 1t the unit ball of Cf3 (Z) . This is useful in the cases that the dimension of Z is not bigger than 3, for, in view of Example 19.9, we must have that f3 d/2 in order that the functions Be, 17 h = Ce, 17 h EB h form a Donsker class, as required by Theorem 25.90. Thus a 1 /2, 2, 3/2 suffice in dimensions 1 , 2, 3, but we need f3 2 if Z is of dimension 4. Sets Z of higher dimension can be treated by extending Lemma 25.96 to take into account higher-order derivatives, or alternatively, by not using a ca (Z)-unit ball for 'H. The general requirements for a class 1t that is the unit ball of a Banach space IDS are that 1t is ry0-Donsker, that C0C0IDS C IDS, and that C0C0 : IDS � IDS is compact. For instance, if Pe (x I z) corresponds to a linear regression on z, then the functions z � C0C0h (z) are of the form z � g(a T z) >
>
>
Semiparametric Models
430
for functions g with a one-dimensional domain. Then the dimensionality of Z does not really play an important role, and we can apply similar arguments, under weaker conditions than required by treating Z as general higher dimensional, with, for instance, lffi equal to the Banach space consisting of the linear span of the functions z � g (a T z) in C i (Z) and 7-( its unit ball. The second main condition of Theorem 25.92 is that the functions i8 ,., and B8 , ., h form a Donsker class. Dependent on the kernel p8 (x I z) , a variety of methods may be used to verify this condition. One possibility is to employ smoothness of the kernel in x in combination with Example 1 9.9. If the map x � p8 (x I z) is appropriately smooth, then so is the map x � C8 , 71h(x) . Straightforward differentiation yields _!_ Ce , ., h (x )
a�
=
(
)
covx h (Z) , _!___ log pe (x I Z) ,
a�
where for each x the covariance is computed for the random variable Z having the (condi tional) density z � p8 (x I z) dTJ(z)jp8 (x I TJ) . Thus, for a given bounded function h ,
I
a a xi
- C,,u
, "f/
I-
h(x) < ll h ll
00
J l a: log Pe (x I z ) l Pe (x I z) drJ(Z) J Pe (x I z ) d rJ (z ) '
·
Depending on the function a I a x i log Pe (x I z), this leads to a bound on the first derivative of the function x � C8 ,.,h(x ) . If Xis an interval in ffi., then this is sufficient for applicability of Example 19.9. If X is higher dimensional, the we can bound higher -order partial derivatives in a similar manner. If the main interest is in the estimation of TJ rather than e, then there is also a nontechnical criterion for the choice of H, because the final result gives the asymptotic distribution of �h for every h E H, but not necessarily for h tJ_ H. Typically, a particular h of interest can be added to a set 7-( that is chosen for technical reasons without violating the results as given previously. The addition of an infinite set would require additional arguments. Reference [107] gives more details concerning this example.
Notes
Most of the results in this chapter were obtained during the past 1 5 years, and the area is still in development. The monograph by Bickel, Klaassen, Ritov, and Wellner [8] gives many detailed information calculations, and heuristic discussions of methods to construct estimators. See [77], [101], [102] , [1 13], [ 1 22], [145] for a number of other, also more re cent, papers. For many applications in survival analysis, counting processes offer a flexible modeling tool, as shown in Andersen, Borgan, Gill, and Keiding [1], who also treat semi parametric models for survival analysis. The treatment of maximum likelihood estimators is motivated by (partially unpublished) joint work with Susan Murphy. Apparently, the present treatment of the Cox model is novel, although proofs using the profile likelihood function and martingales go back at least 1 5 years. In connection with estimating equations and CAR models we profited from discussions with James Robins, the representation in section 25.53 going back to [ 1 29]. The use of the empirical likelihood goes back a long way, in particular in survival analysis. More recently it has gained popularity as a basis for constructing likelihood ratio based confidence intervals. Limitations of the information
Likelihood Equations
25. 12
43 1
bounds and the type of asymptotics discussed in this chapter are pointed out in [128]. For further information concerning this chapter consult recent journals, both in statistics and econometrics.
PROBLEMS 1. Suppose that the underlying distribution of a random sample of real-valued observations is known to have mean zero but is otherwise unknown. (i) Derive a tangent set for the model. (ii) Find the efficient influence function for es timating 1/J (P)
=
P (C) for a fixed set C .
(iii) Find an asymptotically efficient sequence o f estimators for 1/J (P ) .
2 . Suppose that the model consists o f densities p (x - ()) o n JR k , where p i s a smooth density with p (x) = p(- X). Find the efficient influence function for estimating e . 3 . In the regres sion model of Example 25.28, assume in addition that e and X are independent. Find the efficient score function for e .
4 . Find a tangent set for the set o f mixture distributions J p (x I z ) d F (z) for x r-+ p (x I z ) the uniform distribution on [z , z + 1 ]. Is the linear span of this set equal to the nonparametric tangent set? 5. (Neyman-Scott problem) Suppose that a typical observation is a pair (X, Y) of variables that are conditionally independent and N (Z , e) -distributed given an unobservable variable Z with a completely unknown distribution 11 on lit A natural approach to estimating () is to "eliminate" the unobservable Z by taking the difference X - Y. The maximum likelihood estimator based on a sample of such differences is Tn = !n - 1 L 7=1 (X; - Y; ) 2 . (i) Show that the closed linear span of the tangent set for 11 contains all square-integrable, mean-zero functions of X + Y. (ii) Show that Tn is asymptotically efficient. (iii) Is Tn equal to the semiparametric maximum likelihood estimator? 6. In Example 25.72, calculate the score operator and the information operator for 11 ·
7. In Example 25. 12, express the density of an observation X in the marginal distributions F and G of Y and C and (i) Calculate the score operators for F and G . (ii) Show that the empirical distribution functions P * and 6 * of the Y; and Ci are asymptotically efficient for estimating the marginal distributions F* and G* of Y and C, respectively ;
(iii) Prove the asymptotic normality of the es timator for F given by
F(y)
=
1-
n
O_:s:s _:::: y
(1 - A{s}) ,
�
A (y)
=
1
[O , y]
dF*
� G*
A ;
- F*
(iv) Show that this es timator is asymptotically efficient. 8. (Star-shaped distributions) Let F be the collection of all cumulative distribution functions on
[0, 1] such that x
r-+
F (x ) / x is nondecreasing. (This is a famous example in which the maximum likelihood estimator is inconsistent.) (i) Show that there exists a maximizer F n (over F) of the empirical likelihood F r-+ fl?=l F {x; }, and show that this satisfies ft n (x ) -+ x F (x ) for every x . (ii) Show that at every F E F there i s a convex tangent cone whose closed linear span is the nonparametric tangent space . What does this mean for efficient estimation of F ?
432
Semiparametric Models
9. Show that a U -statistic is an asymptotically efficient estimator for its expectation if the model is nonparametric. 10. Suppose that the model consists of all probability distributions on the real line that are symmetric . (i) If the symmetry point is known to be 0, find the maximum likelihood estimator relative to the empirical likelihood. (ii) If the symmetry point is unknown, characterize the maximum likelihood estimators relative to the empirical likelihood; are they useful?
e
11. Find the profile likelihood function for the parameter in the Cox model with censoring discussed in Section 25 . 1 2. 1 .
12. Let P be the set of all probability distributions on lR with a positive density and let 1/J (P) be the median of P . (i) Find the influence function of 1/J . (ii) Prove that the sample median is asymptotically efficient.
References
[ 1 ] Andersen, P. K., Borgan, 0., Gill, R.D., and Keiding, N. ( 1 992). Statistical Models Based on Counting Processes. Springer, Berlin. [2] Arcones, M.A., and Gine, E. ( 1 993). Limit theorems for U-processes . Annals of Probability 21 , 1494-1542. [3] Bahadur, R.R. ( 1 967) . An optimal property of the likelihood ratio statistic . Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability ( 1965/66) I, 1 3-26. University of California Press, Berkeley.
[4] Bahadur, R.R. ( 1 97 1). Some limit theorems in statistics . Conference Board of the Mathematical Sciences Regional Conference Series in Applied Mathematics 4. Society for Industrial and Applied Mathematics, Philadelphia.
[5] Bamdoff-Nielsen, O.E. , and Hall, P. (1988). On the level-error after Bartlett adjustment of the likelihood ratio statistic. Biometrika 75, 37 8-37 8 . [6] Bauer, H . ( 1 9 8 1 ) . Probability Theory and Elements of Measure Theory. Holt, Rinehart, and Winston, New York .
[7] Bentkus, V. , Gotze, F. , van Zwet, W.R. ( 1 997). An Edgeworth expansion for symmetric statis tics. Annals of Statistics 25, 85 1-896. [8] Bickel, P.J., Klaass en, C.A.J., Ritov, Y., and Wellner, J.A. ( 1 993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore. [9] Bickel, P.J., and Ghosh, J.K. ( 1 990) . A decomposition for the likelihood ratio statistic and the Bartlett correction-a Bayesian argument. Annals of Statistics 18, 1070-1090. [ 1 0] Bickel, P.J., and Rosenblatt, M . ( 1 973). On some global measures of the deviations of density function es timates. Annals of Statistics 1, 1 07 1-1095 . [ 1 1 ] Billingsley, P. (1968). Convergence of Probability Measures. John Wiley, New York. [ 1 2] Birge, L. ( 1 983). Approximation dans les espaces metriques et theorie de ! ' estimation. Zeitschriftfur Wahrscheinlichkeitstheorie und Verwandte Gebiete 65, 1 8 1-23 8. [ 1 3 ] Birge, L . ( 1 997). Estimation o f unimodal densities without smoothness assumptions . Annals of Statistics 25, 970-9 8 1 . [ 1 4] Birge, L., and Massart, P. ( 1 993). Rates o f convergence for minimum contrast estimators . Probability Theory and Related Fields 97, 1 1 3-1 50. [ 1 5 ] Birge, L., and Massart, P. ( 1 997). From model selection to adaptive es timation. Festschrift for Lucien Le Cam. Springer, New York, 55-87. [ 1 6] Birman, M.S., and Solomj ak, M.Z. ( 1 967). Piecewise-polynomial approximation of functions of the classes Wp . Mathematics of the USSR Sbornik 73, 295-3 1 7 . [17] Brown, L. ( 1 987). Fundamentals of Statistical Exponential Families with Applications in Sta tistical Decision Theory. Institute of Mathematical Statistics, California. [ 1 8] Brown, L.D., and Fox, M. ( 1 974) . Admissibility of procedures in two-dimensional location parameter problems . Annals of Statistics 2, 248-266. [ 1 9] Cantelli, F.P. (1933). Sulla determinazione empirica delle leggi di probabilitiL Giornale dell 'lstituto Italiano degli Attuari 4, 42 1-424.
433
References
434
[20] Chernoff, H. ( 1 952). A measure of asymptotic efficiency for tes ts of a hypothesis based on the sum of observations . Annals of Mathematical Statistics 23, 493-507 . [2 1] Chernoff, H. ( 1 954 ). On the distribution of the likelihood ratio statistic . Annals ofMathematical Statistics 25, 573-578. [22] Chernoff, H., and Lehmann, E.L. ( 1 954). The use o f maximum likelihood estimates i n x 2 tests for goodness of fit. Annals of Mathematical Statistics 25, 579-5 86. [23] Chow, Y.S., and Teicher, H. ( 1 978). Probability Theory. Springer-Verlag, New York. [24] Cohn, D.L. (1 980). Measure Theory. Birkhauser, Boston. [25] Copas , J. (1975). On the unimodality of the likelihood for the Cauchy distribution. Biometrika 62, 701-704. [26] Cramer, H. ( 1 938). Sur un nouveau theoreme-limite de la theorie des probabilites . Actualites Scientifiques et Industrielles 736, 5-23. [27] Cramer, H. ( 1 946) . Mathematical Methods of Statistics. Princeton University Press, Princeton. [28] Cs6rg6, M. (1983). Quantile Processes with Statistical Applications. CBMS-NSF Regional Conference Series in Applied Mathematics 42. Society for Industrial and Applied Mathematics [29] [30] [3 1] [32] [33] [34] [35] [36] [37] [38] [39]
(SIAM), Philadelphia. Dacunha-Castelle, D., and Duflo, M. ( 1 993). Probabilites et Statistiques, tome II. Mas son, Paris . Davies , R.B . ( 1 973). Asymptotic inference in stationary Gaussian time-series . Advances in Applied Probability 4, 469-497 . Dembo, A., and Zeitouni, 0. ( 1 993). Large Deviation Techniques and Applications. Jones and Bartlett Publishers , B oston. Deuschel, J.D., and Stroock, D .W. (1 989). Large Deviations. Academic Pres s , New York. Devroye, L., and Gyorfi, L. ( 1 985). Nonparametric Density Estimation: The L 1 - View. John Wiley & Sons , New York. Diaconis, P. , and Freedman, D. (1 986). On the consistency of Bayes estimates . Annals of Statistics 14 , 1-26. Donald, S . G., and Newey, W. K. ( 1 994) . Series estimation of semilinear models . Journal of Multivariate Analysis 50, 30-40. Donoho, D.L., and Johnstone, I.M. ( 1 994) . Idea spatial adaptation by wavelet shrinkage. Biometrika 81, 425-455 . Donoho, D.L., and Liu, R . C . ( 1 9 9 1 ) . Geometrizing rates o f convergence II, Ill. Annals of Statistics 19, 633-70 1 . Donsker, M.D. (1 952). Justification and extension of Doob 's heuristic approach to the Kolmogorov-Smirnov theorems . Annals of Mathematical Statistics 23, 277-28 1 . Doob, J . ( 1 948). Application of the theory of martingales. Le Calcul des Probabilites et ses
Applications. Colloques Internationales du CNRS Paris, 22-28 . [40] Drost, F. C. ( 1 988). Asymptotics for Generalized Chi-Square Goodness-of-Fit Tests. C WI tract 48 Centrum voor Wiskunde en Informatica, Amsterdam. .
[41 ] Dudley, R.M. (1976). Probability and Metrics: Convergence of Laws on Metric Spaces. Math ematics Institute Lecture Notes Series 45. Aarhus University, Denmark. [42] Dudley, R.M. (1 989) . Real Analysis and Probability, Wadsworth, Belmont, California. [43] Dupac, V., and Hajek, J. (1 969) . Asymptotic normality of simple linear rank statistics under alternatives , Annals of Mathematical Statistics II 40, 1 992-20 1 7 . [44] Efron, B . , and Tibshirani, R.J. ( 1 993). A n Introduction to the Bootstrap. Chapman and Hall, London.
[45] Fahrmeir, L., and Kaufmann, H. ( 1 985). Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models . Annals of Statistics 13, 342-368. (Correction: Annals of Statistics 14 , 1 643.) [46] Farrell, R.H. ( 1 972). On the best obtainable asymptotic rates of convergence in estimation of a density function at a point. Annals of Mathematical Statistics 43, 1 70-1 80. [47] Feller, W. ( 1 97 1 ) . An Introduction to Probability Theory and Its Applications, vol. II. John Wiley & Sons, New York.
References
435
[48] Fisher, R.A. ( 1 922) . On the mathematical foundations of theoretical statistics . Philosophical Transactions of the Royal Society of London, Series A 222, 309-368. [49] Fisher, R.A. (1 924) . The conditions under which x 2 measures the discrepancy between obser vations and hypothesis . Journal Royal Statist. Soc. 87, 442-450. [50] Fisher, R.A. ( 1 925). Theory of statistical estimation. Proceedings of the Cambridge Philosoph ical Society 22, 700-725 . [5 1 ] van de Geer, S . A. (1 988). Regression Analysis andEmpirical Processes. CWI Tract45. Centrum voor Wiskunde en Informatica, Amsterdam.
[52] Ghosh, J.K. ( 1 994) . Higher Order Asymptotics. Institute of Mathematical Statistics, Hayward. [53] Gill, R.D. (1 989). Non- and semi-parametric maximum likelihood estimators and the von-Mises method (part I) . Scandinavian Journal of Statistics 16, 97-128. [54] Gill, R . D . (1 994) . Lectures o n survival analysis . Lecture Notes in Mathematics 1581, 1 1 5-24 1 . [55] Gill, R.D., and Johansen, S . (1 990). A survey o f product-integration with a view towards application in survival analysis . Annals of Statistics 18, 1 5 0 1 - 1 5 5 5 . [56] Gill, R.D . , and van der Vaart, A.W. (1993) . Non- and semi-parametric maximum likelihood estimators and the von Mises method (part II) . Scandinavian Journal of Statistics 20, 27 1-288. [57] Gine, E., and Zinn, J. (1 986). Lectures on the central limit theorem for empirical processes. Lecture Notes in Mathematics 1221, 50-1 1 3 . [58] Gine, E., and Zinn, J . ( 1 990) . Bootstrapping general empirical measures . Annals ofProbability 18, 851-869. [59] Glivenko, V. ( 1 933). Sulla determinazione empirica della leggi di probabilita. Giornale dell 'Istituto Italiano degli Attuari 4, 92-99. [60] Greenwood, P.E., and Nikulin, M.S. (1 996). A Guide to Chi-Squared Testing. John Wiley & Sons , New York.
[61 ] Groeneboom, P. ( 1 9 80). Large Deviations and Bahadur Efficiencies. MC tract 118, Centrum voor Wiskunde en Informatica, Amsterdam.
[62] Groeneboom, P. (1 985). Estimating a monotone density. Proceedings of the Berkeley Confer ence in Honor ofJerzy Neyman and Jack Kiefer 2, 539-555. Wadsworth, Monterey, California. [63] Groeneboom, P. (1 988). Brownian Motion with a parabolic drift and Airy functions . Probability Theory and Related Fields 81, 79-109. [64] Groeneboom, P. , Lopuhaa, H.P. ( 1 993). Isotonic estimators of monotone densities and distri bution functions : basic facts . Statistica Neerlandica 47, 1 75-1 83. [65] Groeneboom, P. , Oosterhoff, J., and Ruymgaart, F. ( 1 979). Large deviation theorems for em pirical probability measures . Annals of Probability 7, 553-5 86. [66] de Haan, L. ( 1 976). Sample extremes: An elementary introduction. Statistica Neerlandica 30, 1 6 1-172. [67] Haj ek, J. ( 1 9 6 1 ) . Some extensions of the Wald-Wolfowitz-Noether theorem. Annals of Math ematical Statistics 32, 506-523 . [68] Hajek, J. ( 1 968). Asymptotic normality ofsimple linear rank statistics under alternatives . Annals of Mathematical Statistics 39, 325-346. [69] Haj ek, J. ( 1 970) . A characterization of limiting distributions of regular estimates . Zeitschrift fii r Wahrscheinlichkeitstheorie und Verwandte Gebiete 14, 323-3 30. [70] Hajek, J. ( 1 972). Local asymptotic minimax and admissibility in estimation. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability 1, 1 7 5-194. [7 1 ] Hajek, J., and S idak, Z. ( 1 967) . Theory of Rank Tests. Academic Press, New York. [72] Hall, P. (1 992) . The Bootstrap and Edgeworth Expansion. Springer Series in Statistics. Springer
Verlag, New York. [7 3] Hampel, F.R., Ronchetti, E.M ., Rousseeuw, P.J., and Stahel, W.A. ( 1 9 86). Robust Statistics: the Approach Based on Influence Functions. Wiley, New York. [74] Helmers, R. (1 982) . Edgeworth Expansions for Linear Combinations of Order Statistics. Math ematical Centre Tracts 105. Mathematisch Centrum, Amsterdam. [75] Hoeffding, W. ( 1 948). A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics 1 9 , 293-325.
References
436
[76] Hoffmann-J¢rgensen, J. (1991). Stochastic Processes on Polish Spaces. Various Publication Series 39. Aarhus Universitet, Aarhus, Denmark. [77] Huang, J. ( 1 996). Efficient estimation for the Cox model with interval censoring . Annals of Statistics 24, 540-568. [78] Huber, P. ( 1 967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1, 22 1-233. University of California Press, Berkeley. [79] Huber, P. ( 1 974). Robust Statistics . Wiley, New York. [80] Ibragimov, LA., and Has ' minskii, R.Z. ( 1 9 8 1 ) . Statistical Estimation: Asymptotic Theory. Springer-Verlag, New York.
[81] Jagers, P. ( 1 975). Branching Processes with Biological Applications. John Wiley & Sons,
London-New York-Sydney. [82] Janssen, A., and Mason, D.M. ( 1 990). Nonstandard Rank Tests. Lecture Notes in Statistics 65. Springer-Verlag, New York. [83] Jensen, J.L. ( 1 993). A historical sketch and some new results on the improved log likelihood ratio statistic . Scandinavian Journal of Statistics 20, 1-1 5 . [84] Kallenberg, W.C.M. (1 983). Intermediate efficiency, theory and examples . Annals of Statistics
11, 498-504. [85] Kallenberg, W.C.M., and Ledwina, T. (1 987). On local and nonlocal meas ures of efficiency. Annals of Statistics 15, 1401-1420 . [86] Kallenberg, W.C.M., Oos terhoff, J., and Schriever, B . F. (1 980). The number of classes in chi-squared goodness-of-fit tests . Journal of the American Statistical Association 80, 959-968. [87] Kim, J., and Pollard, D . (1 990). Cube root asymptotics . Annals of Statistics 18, 1 9 1-2 1 9 . [88] Kolcinskii, V. I . ( 1 9 8 1 ) . O n the central limit theorem for empirical measures . Theory of Proba bility and Mathematical Statistics 24, 7 1-82. [89] Koul, H.L., and Pflug, G.C. ( 1 990). Weakly adaptive estimators in explosive autoregression. Annals of Statistics 18, 939-960. [90] Leadbetter, M.R., Lindgren, G., and Rootzen, H. (1 983). Extremes and Related Properties of Random Sequences and Processes. Springer-Verlag, New York. [9 1] Le Cam, L. ( 1 953). On some asymptotic properties of maximum likelihood estimates and related B ayes estimates. University of California Publications in Statistics 1, 277-3 30. [92] Le Cam, L. (1 960) . Locally asymptotically normal families of distributions . University of California Publications in Statistics 3, 37-9 8 . [93] L e Cam, L. ( 1 969) . Theorie Asymptotique de l a Decision Statistique. Les Presses de l' Universite de Montreal, Montreal.
[94] Le Cam, L. ( 1 970) . On the assumptions used to prove asymptotic normality of maximum likelihood estimates. Annals of Mathematical Statistics 41, 802-82 8 . [95] L e Cam, L. (1972). Limits o f experiments . Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability 1, 245-261 . University of California Press , Berkeley. [96] Le Cam, L. ( 1 986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag, New York.
[97] Le Cam, L.M., and Yang, G. ( 1 990) . Asymptotics in Statistics: Some Basic Concepts . Springer Verlag, New York.
[98] Ledoux, M., and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Pro cesses. Springer-Verlag, Berlin. [99] Lehmann, E.L. ( 1 983). Theory of Point Estimation. Wiley, New York. [ 1 00] Lehmann, E.L. (1991). Testing Statistical Hypotheses, 2nd edition . Wiley, New York. [1 0 1 ] Levit, B.Y. ( 1 97 8). Infinite-dimensional informational lower bounds . Theory ofProbability and its Applications 23, 388-394. [102] Mammen, E. , and van de Geer, S.A. ( 1 997). Penalized quasi-likelihood estimation in partial linear models . Annals of Statistics 25, 10 14- 1 035 . [103] Mas sart, P. (1 990). The tight constant in the Dvoretsky-Kiefer-Wolfowitz inequality. Annals of Probability 18, 1 269-1283.
References
437
[104] von Mises, R. ( 1 947). On the asymptotic distribution of differentiable statistical functions. Annals of Mathematical Statistics 18, 309-348. [105] Murphy, S.A., Rossini, T. J., and van der Vaart, A.W. ( 1 997). MLE in the proportional odds model. Journal of the American Statistical Association 92 , 968-976. [1 06] Murphy, S . A . , and van der Vaart, A.W. ( 1 997). Semiparametric likelihood ratio inference. Annals of Statistics 25, 147 1-1 509. [107] Murphy, S.A., and van der Vaart, A.W. ( 1 996) . Semiparametric mixtures in case-control studies. [108] Murphy, S.A., and van der Vaart, A.W. ( 1 996). Likelihood ratio inference in the errors-in variables model. Journal of Multivariate Analysis 59, 8 1-108. [109] Noether, G.E. ( 1 955). On a theorem of Pitman. Annals of Mathematical Statistics 2 5 , 64-68 . [ 1 1 0] Nussbaum, M. ( 1 996). Asymptotic equivalence o f density estimation and Gaussian white noise. Annals of Statistics 24, 2399-2430. [1 1 1 ] Ossiander, M. (1 987). A central limit theorem under metric entropy with L 2 bracketing. Annals of Probability 15, 897-9 1 9 . [1 12] Pearson, K. ( 1 900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosopical Magazine, Series 5 50, 1 57-175. (Reprinted in: Karl Pearson 's Early Statistical Papers, Cambridge University Press, 1 956.) [1 13] Pfanzagl, J., and Wefelmeyer, W. ( 1 982) . Contributions to a General Asymptotic Statistical Theory. Lecture Notes in Statistics 13. Springer-Verlag, New York. [1 14] Pfanzagl, J., and Wefelmeyer, W. (1 985). Asymptotic Expansions for General Statistical Mod els. Lecture Notes in Statistics 31. Springer-Verlag, New York. [1 15] Pflug, G.C. ( 1 983). The limiting loglikelihood process for discontinuous density families.
Zeitschriftfur Wahrscheinlichkeitstheorie und Verwandte Gebiete 64, 1 5-35 . [ 1 1 6] Pollard, D. ( 1 982). A central limit theorem for empirical processes. Journal of the Australian Mathematical Society A 33, 235-248. [ 1 17] Pollard, D . ( 1 984). Convergence of Stochastic Processes. Springer-Verlag, New York. [1 1 8] Pollard, D. (1 985). New ways to prove central limit theorems . Econometric Theory 1, 295-3 14. [1 1 9] Pollard, D . (1 989). A maximal inequality for sums of independent processes under a bracketing condition.
[ 1 20] Pollard, D. (1 990) . Empirical Processes: Theory and Applications. NSF-CBMS Regional Con ference Series in Probability and Statistics 2. Institute of Mathematical Statistic s and American Statistical As sociation . Hayward, California.
[ 1 2 1 ] Prakasa Rao, B.L.S. (1 983). Nonparametric Functional Estimation. Academic Press, Orlando . [ 1 22] Qin, J., and Lawless , J. ( 1 994) . Empirical likelihood and general estimating equations . Annals of Statistics 22, 300-325 . [ 1 23] Rao, C.R. ( 1 973). Linear Statistical Inference and Its Applications. Wiley, New York. [ 1 24] Reed, M., and Simon, B . (1 980) . Functional Analysis . Academic Press, Orlando . [ 1 25] Reeds, J.A. ( 1 976). On the Definition ofvon Mises Functionals. Ph.D. dissertation, Department of Statistics, Harvard University, Cambridge, MA.
[ 1 26] Reeds, J.A. ( 1 985). Asymptotic number of roots of Cauchy location likelihood equations . Annals of Statistics 13, 775-7 84. [127] Revesz, P. (1 968). The Laws of Large Numbers. Academic Press, New York. [1 28] Robins, J.M., and Ritov, Y. ( 1 997) . Towards a curse of dimensionality appropriate (CODA) asymptotic theory for semi-parametric models. Statistics in Medicine 16 , 285-3 19. [ 1 29] Robins, J.M., and Rotnitzky, A . ( 1 992) . Recovery o f information and adjustment for dependent censoring using surrogate markers . In AIDS Epidemiology-Methodological Issues, 297-3 3 1 , eds : N . Jewell, K . Dietz, and V. Farewell. Birkhauser, Boston.
[ 1 30] Roussas, G.G. ( 1 972). Contiguity of Probability Measures. Cambridge University Press, Cambridge.
[ 1 3 1 ] Rubin, H., and Vitale, R.A. ( 1 980) . Asymptotic distribution of symmetric statistic s. Annals of Statistics 8, 1 65-1 70. [1 32] Rudin, W. ( 1 973). Functional Analysis. McGraw-Hill, New York.
43 8
References
[1 33] Shiryayev, AN. (1 984). Probability. Springer-Verlag, New York-B erlin. [1 34] Shorack, G.R., and Wellner, J.A. (1 986). Empirical Processes with Applications to Statistics. Wiley, New York.
[1 35] Silverman, B.W. ( 1 986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
[ 1 36] Stigler, S.M. ( 1 974) . Linear functions of order statistics with smooth weight functions. Annals of Statistics 2, 676-693 . (Correction: Annals of Statistics 7, 466.) [ 1 37] Stone, C.J. (1 990) . Large-sample inference for log-spline models . Annals of Statistics, 18, 7 1 7-741 . [ 1 38] Strasser, H. ( 1 985). Mathematical Theory of Statistics. Walter de Gruyter, Berlin. [ 1 39] van der Vaart, A.W. ( 1 988). Statistical Estimation in Large Parameter Spaces. CWI Tracts 44. Centrum voor Wiskunde en Informatica, Amsterdam.
[ 1 40] van der Vaart, A.W. ( 1 99 1 ) . On differentiable functionals . Annals of Statistics 19, 17 8-204. [ 1 4 1 ] van der Vaart, A.W. ( 1 9 9 1 ) . An asymptotic representation theorem. International Statistical Review 59, 97- 1 2 1 . [ 1 42] van der Vaart, A.W. ( 1 994) . Limits of Experiments. Lecture Notes, Yale University. [ 1 43] van der Vaart, A.W. ( 1 995). Efficiency of infinite dimensional M -estimators . Statistica Neer landica 49, 9-30. [ 1 44] van der Vaart, A.W. (1 994) . Maximum likelihood estimation with partially censored observa tions . Annals of Statistics 22, 1 896- 1 9 1 6 . [ 1 45] van der Vaart, A.W. ( 1 996) . Efficient estimation i n semiparametric models . Annals of Statistics 24, 862-87 8 . [ 1 46] van der Vaart, A.W. , and Wellner, J.A. ( 1 996) . Weak Convergence and Empirical Processes. Springer, New York.
[ 1 47] Vapnik, V.N . , and C ervonenkis, A.Y. ( 1 97 1 ) . On the uniform convergence of relative frequen cies of events to their probabilities . Theory of Probability and Its Applications 16 , 264-280. [ 1 48] Wahba, G. (1 990) . Spline models for observational data. CBMS-NSF Regional Conference Se ries in Applied Mathematics 59. Society for Industrial and Applied Mathematics , Philadelphia. [ 1 49] Wald, A. ( 1 943). Test of statistical hypotheses concerning several perameters when the number of observations is large. Transactions of the American Mathematical Society 54, 426-482. [ 1 50] Wilks, S . S . ( 1 938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. Annals of Mathematical Statistics 19, 60-62. [ 1 5 1 ] van Zwet, W.R. ( 1 984). A Berry-Esseen bound for symmetric statistics. Zeitschrift fiir Wahrscheinlichkeitstheorie und Verwandte Gebiete 66, 425-440.
Index
a -Winsorized means, 3 1 6 a-trimmed means, 3 1 6 absolute rank, 1 8 1 absolutely continuous, 85, 268 continuous part, 85 accessible, 150 adaptation, 223 adjoint map, 361 adjoint score operator, 372 antiranks, 1 84 Assouad's lemma, 347 asymptotic consistent, 44, 329 differentiable, 106 distribution free, 164 efficient, 64, 367, 387 equicontinuity, 262 influence function, 58 of level a, 192 linear, 401 lower bound, 1 08 measurable, 260 relative efficiency, 195 risk, 109 tight, 260 tightness, 262 uniformly integrable, 17 Bahadur efficiency, 203 relative efficiency, 203 slope, 203, 239 Bahadur-Kiefer theorems, 3 10 Banach space, 361 bandwidth, 342 Bartlett correction, 238 Bayes estimator, 138 risk, 1 38 Bernstein's inequality, 285 best regular, 1 15 bilinear map, 295
bootstrap empirical distribution, 332 empirical process, 332 parametric, 328, 340 B orel a-field, 256 measurable, 256 bounded Lipschitz metric, 332 in probability, 8 bowl-shaped, 1 1 3 bracket, 270 bracketing integral, 270 number, 270 Brownian bridge, 266 bridge process, 1 68 motion, 268 cadlag, 257 canonical link functions, 235 Cartesian product, 257 Cauchy sequence, 255 central limit theorem, 6 moments, 27 chain rule, 298 chaining argument, 285 characteristic function, 1 3 chi-square distribution, 242 Chibisov-O'Reilly theorem, 273 closed, 255 closure, 255 coarsening at random (CAR), 379 coefficients, 173 compact, 255, 424 compact differentiability, 297 complete, 255 completion, 257 concordant, 164 conditional expectation, 155 consistent, 3, 149, 1 93, 329 439
440
contiguous, 87 continuity set, 7 continuous, 255 continuously differentiable, 297 converge almost surely, 6 in distribution, 5, 258 in probability, 5 weakly, 305 convergence in law, 5 of sets, 101, 232 convergence-determining, 18 converges, 255 almost surely, 258 in distribution, 258 to a limit experiment, 126 in probability, 258 coordinate projections, 257 Cornish-Fisher expansion, 338 covering number, 274 Cox partial likelihood, 404 Cramer's condition, 335 theorem, 208 Cramer-Von Mises statistic, 1 7 1 , 277, 295 Cramer-Wold device, 16 critical region, 1 92 cross-validation, 346 cumulant generating function, 205 cumulative hazard function, 1 80, 300 defective distribution function, 9 deficiency distance, 137 degenerate, 167 dense, 255 deviance, 234 differentiable, 25, 363, 387 in quadratic mean, 64, 94 differentially equivalent, I 06 discretization, 72 discretized, 72 distance function, 255 distribution free, 1 74 function, 5 dominated, 1 27 Donsker, 269 dual space, 361, 387 Dzhaparidze-Nikulin statistic, 254 s-bracket, 270 Edgeworth expansions, 335 efficient function, 369, 373 influence function, 363 information matrix, 369 score equations, 391
Index
Efron's percentile method, 327 ellipsoidally symmetric, 84 empirical bootstrap, 328 difference process, 309 distribution, 42, 269 distribution function, 265 likelihood, 402 process, 42, 266, 269 entropy, 274 with bracketing, 270 integral, 274 uniform, 274 envelope function, 270 equivalent, 126 equivariant-in-law, 1 1 3 errors-in-variables, 83 estimating equations, 41 experiment, 1 25 exponential family, 37 inequality, 285 finite approximation, 26 1 Fisher information for location, 96 information matrix, 39 Frechet differentiable, 297 full rank, 3 8 Gateaux differentiable, 296 Gaussian chaos, 167 generalized inverse, 254 Gini's mean difference, 171 Glivenko-Cantelli, 46 Glivenko-Cantelli theorem, 265 good rate function, 209 goodness-of-fit, 248, 277 gradient, 26 Grenander estimator, 356 Hajek projection, 1 57 Hadamard differentiable, 296 Hamming distance, 347 Hellinger affinity, 2 1 1 distance, 2 1 1 integral, 2 1 3 statistic, 244 Hermite polynomial, 169 Hilbert space, 360 Hoeffding decomposition, !57 Huber estimators, 43 Hungarian embeddings, 269 hypothesis of independence, 247 identifiable, 62 improper, 138
Index
influence function, 292 information operator, 372 interior, 255 interquartile range, 3 17 joint convergence, 1 1 (k x r ) table, 247 Kaplan-Meier estimator, 302 Kendall's r -statistic, 164 kernel, 161, 342 estimator, 342 method, 34 1 Kolmogorov-Srnimov, 265, 277 Kruskal-Wallis, 1 8 1 Kullback-Leibler divergence, 56, 62 kurtosis, 27
L-statistic, 3 1 6 Lagrange multipliers, 214 LAN, 1 04 large deviation, 203 deviation principle, 209 law of large numbers, 6 Le Cam's third lemma, 90 least concave majorant, 349 favorable, 362 Lebesgue decomposition, 85 Lehmann alternatives, 1 80 level, 192 a, 2 1 5 likelihood ratio, 86 ratio process, 126 ratio statistic, 228 linear signed rank statistic, 22 1 link function, 234 Lipschitz, 6 local criterion functions, 79 empirical measure, 283 limiting power, 194 parameter, 92 parameter space, 101 locally asymptotically minimax, 1 20 mixed normal, 1 3 1 normal, 104 quadratic, 132 most powerful, 179 scores, 222, 225 signed rank statistics, 183 test, 1 90 log rank test, 180
logit model, 66 loss function, 109, 1 1 3 M-estimator, 4 1 Mann-Whitney statistic, 166, 175 marginal convergence, 1 1 vectors, 26 1 weak convergence, 1 26 Markov's inequality, 1 0 Marshall's lemma, 357 maximal inequality, 76, 285 maximum likelihood estimator, 42 likelihood statistic, 23 1 mean absolute deviation, 280 integrated square error, 344 median absolute deviation, 3 1 0 absolute deviation from the median, 60 test, 178 metric, 255 space, 255 midrank, 173 Mill's ratio, 3 1 3 minimax criterion, 1 1 3 minimum-chi square estimator, 244 missing at random (MAR), 379 mode, 355 model, 358 mutually contiguous, 87 y'/1-consistent, 72 natural level, 190 parameter space, 38 nearly maximize, 45 Nelson-Aalen estimator, 301 Newton-Rhapson, 71 noncentral chi-square distribution, 237 noncentrality parameter, 217 nonparametric maximum likelihood estimator, 403 model, 34 1 , 358 norm, 255 normed space, 255 nuisance parameter, 358 observed level, 239 significance level, 203 odds, 406 offspring distribution, 133 one-step estimator, 72 method, 7 1
441
442
open, 255 ba11, 255 operators , 361 order statistics, 173 orthocomplement, 36 1 orthogonal, 85, 153, 36 1 part, 85 outer probability, 258 ?-Brownian bridge, 269 P-Donsker, 269 P -Glivenko-Cantelli, 269 pth sample quantile, 43 parametric bootstrap, 328, 340 models, 341 Pearson statistic, 242 percentile t-method, 327 method, 327 perfect, 2 1 1 permutation tests, 188 Pitman efficiency, 201 relative efficiency, 202 polynomial classes, 275 pooled sample, 174 posterior distribution, 138 power function, 215 prior distribution, 1 38 probability integral transformation, 305 probit model, 66 product integral, 300 limit estimator, 302, 407 profile likelihood, 403 projection, 153 lemma, 361 proportional hazards, 180 quantile, 43 function, 304, 306 transformation, 305 random element, 256 vector, 5 randomized statistic, 98, 1 27 range, 3 1 7 rank, 164, 1 73 correlation coefficient, 1 84 statistic , 173 Rao-Robson-Nikulin statistic, 250 rate function, 209 rate-adaptive, 346 regular, 1 1 5, 365 regularity, 340 relative efficiency, 1 1 1 , 201
Index
right-censored data, 30 I robust statistics, 43 sample correlation coefficient, 30 path, 260 space, 125 Sanov's theorem, 209 Savage test, 180 score, 173, 22 1 function, 42, 63, 362 operator, 37 1 statistic, 23 1 tests, 220 semimetric, 255 seminorm, 256 semiparametric models, 358 separable, 255 shatter, 275 shrinkage estimator, 1 1 9 Siegel-Tukey, 1 9 1 sign, 1 8 1 statistic, 1 8 3 , 193, 22 1 sign-function, 43 signed rank statistic, 18 1 test, 164 simple linear rank statistics, 173 single-index, 359 singular part, 86 size, 192, 2 1 5 skewness, 29 Skorohod space, 257 slope, 195, 2 1 8 smoothing method, 342 solid, 237 spectral density, 105 standard (or uniform) Brownian bridge, 266 statistical experiment, 92 Stein's lemma, 2 1 3 stochastic process, 260 Strassen's theorem, 268 strong approximations, 268 law of large numbers, 6 strongly consistent, 1 49 degenerate, 167 subconvex, 1 1 3 subgraphs, 275 -r-topology, 209 tangent set, 362 for f/ , 369 tangent space, 362 tangentially, 297 test, 1 9 1 , 2 1 5 function, 215
Index
tight, 8, 260 total variation, 22 variation distance, 2 1 1 totally bounded, 255 two-sample U -statistic, 165
variance stabililizing transformation, 30 VC class, 275 VC index, 275 Volterra equation, 408 von Mises expansion, 292
U-statistic, 161 unbiased, 226 uniform covering numbers, 274 entropy integral, 274 Glivenko-Cantelli, 145 norm, 257 uniformly most powerful test, 216 tight, 8 unimodal, 355 uprank, 173
Wald statistic, 23 1 tests, 220 Watson-Roy statistic, 25 1 weak convergence, 5 law of large numbers, 6 weighted empirical process, 273 well-separated, 45 Wilcoxon signed rank statistic, 183, 22 1 statistic, 17 5 two-sample statistic, 167 window, 342
V -statistic, 172, 295, 303 van der Waerden statistic, 175 Vapnik-Cervonenkis classes, 274
2-estimators, 4 1
443