Contents Eustasio del Barrio Empirical and Quantile Processes in the Asymptotic Theory of Goodness-of-fit Tests 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Testing fit to a fixed distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Some notes on principal component decompositions of quadratic statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Some tools from empirical processes theory . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Some inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Strong approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Testing fit to a family of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Adaptation of tests coming from the fixed-distribution setup . . . 4.2 The empirical process with estimated parameters . . . . . . . . . . . . . . 4.3 Correlation and regression tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Tests based on Wasserstein distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Asymptotics for L2 -functionals of the quantile process . . . . . . . . . . . . . . . 6.1 Weak convergence of L2 linear combinations of exponential r.v.’s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Weak convergence of weighted L2 functionals of the quantile process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Weighted Wasserstein tests of fit to location-scale families of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Deheuvels Topics on Empirical Processes 1 Introduction, notation and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Distribution and quantile functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Topologies on spaces of measures and functions . . . . . . . . . . . . . . . . 1.4 The quantile transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 9 12 12 16 21 30 31 35 40 44 49 53 67 82 86
93 93 94 98 108
vi
Contents 2 Fluctuations of partial sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Some large deviations theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Martingale inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Some useful examples of large deviation inequalities . . . . . . . . . . . . 2.4 Strong approximations of partial sums of i.i.d. random variables 2.5 The Erd˝ os-R´enyi theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
114 114 118 126 129 132
3 Empirical Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Uniform empirical distribution and quantile functions . . . . . . . . . . 3.2 Uniform empirical and quantile processes . . . . . . . . . . . . . . . . . . . . . . 3.3 Some further martingale inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Relations between empirical and Poisson processes . . . . . . . . . . . . . 3.5 Strong approximations of empirical and quantile processes . . . . . 3.6 Some results for weighted processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Finkelstein’s theorem via invariance principles . . . . . . . . . . . . . . . . . 3.8 Local and tail empirical process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Modulus of continuity of αn and βn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 The Bahadur-Kiefer representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Application to density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
136 136 140 142 144 148 151 153 155 156 159 162
4 Auxiliary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Some Gaussian process theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 A functional LIL for superpositions of Gaussian processes . . . . . . 4.3 Karhunen-Lo`eve expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The RKHS of the Wiener process and Brownian bridge . . . . . . . . 4.5 KL expansions for weighted Wiener processes and Brownian bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Bessel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
164 164 166 170 172
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
179
174 177
Sara van de Geer Oracle Inequalities and Regularization 1 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The empirical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191 192 194 194 200
2 M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Estimation and approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Where empirical process theory comes in . . . . . . . . . . . . . . . . . . . . . . . 2.5 Some first results, assuming ready-to-use empirical process theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
200 200 202 202 203 205
3
4
5
6
Contents
vii
2.6 Balancing estimation and approximation error . . . . . . . . . . . . . . . . . 2.7 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The sequence space formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Reformulation of the regression problem . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimating the mean of a normal vector . . . . . . . . . . . . . . . . . . . . . . . 3.3 A collection of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 The model an oracle would select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Hard- and soft-thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 A probability inequality for the empirical process . . . . . . . . . . . . . . 3.7 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overruling the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Estimation and approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Finite models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Nested, finite models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 General penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Application to the “classical” penalty of Chapter 1 . . . . . . . . . . . . . 4.6 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The 1 -penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Robust regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 The behavior on Ξn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tools from empirical process theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Contraction principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Weighted empirical processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 The case of a Lipschitz transformation of a linear space . . . . . . . . 6.6 Modulus of continuity of the empirical process . . . . . . . . . . . . . . . . . 6.7 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
208 208 209 209 211 213 214 215 220 221 221 222 224 225 226 227 229 229 232 235 238 240 241 242 242 243 243 243 244 248 249 250
Empirical and Quantile Processes in the Asymptotic Theory of Goodness-of-fit Tests Eustasio del Barrio Abstract. This text contains the material presented at the European Mathematical Society Summer School on Theory and Statistical Applications of Empirical Processes in Laredo, Spain in August-September 2004.
1. Introduction It was in 1900 when Karl Pearson proposed the first test of goodness of fit: the χ2 test. The subsequent research devoted to enhancements of this elementary goodness-of-fit procedure became a major source of motivation for the development of key areas in Probability and Statistics such as the theory of weak convergence in general spaces and the asymptotic theory of empirical processes. In this course we will analyze some suggestive aspects which arise from the development of the asymptotic theory of goodness-of-fit tests along the last century. We will pay special attention to stressing the parallel evolution of the theory of empirical processes and the asymptotic theory of goodness-of-fit tests. Doubtless, this evolution is a good indicator of the vast transformation that Probability and Statistics experienced along this century. Certainly, the names that contributed to the theory are the main guarantee for this assertion. Pearson, Fisher, Cram´er, Von Mises, Kolmogorov, Smirnov, Feller, . . . laid the foundations of the theory. In some cases, the mathematical derivation of the asymptotic distribution of goodness-of-fit tests in that period had the added merit that, in a certain sense, the limit law was blindly pursued. In Mathematics the main difficulty in showing convergence consists, with no doubt, of obtaining a convincing candidate for limit. Thus, proofs in that period could be considered as major pieces of precision and inventiveness. A systematic method of handling adequate candidates for the limit law begins in 1950 with the heuristic work by Doob [36], made precise by Donsker, through the Invariance Principle. The subsequent construction of adequate metric spaces and the development of the corresponding weak convergence theory as the right
2
E. del Barrio
probabilistic setup for the study of asymptotic distributions had a wide and fast diffusion, with notable advances due to Prohorov and Skorohod among others. The contribution of Billingsley’s book [7] to this diffusion must also be pointed out. The study of Probability in Banach spaces has been other source of useful results for the goodness-of-fit theory. The names of Varadhan, Dudley, Araujo, Gin´e, Zinn, Ledoux, Talagrand,. . . are necessary references to anyone interested in asymptotics in Statistics. For example, the Central Limit Theorem in Hilbert spaces had a main role in the obtention of the asymptotic behavior of Cram´er-von Mises type statistics. Last, we must indicate the significance of the “Hungarian school”, developing the strong approximation techniques initiated by Skorohod with his “embedding”. Breiman’s book [8] had the merit of initially spreading Skorohod’s embedding. Now, the strong approximations due to Koml´ os, Major, Tusn´ ady, M. and S. Cs¨ org˝ o, R´ev´esz, Deheuvels, Horv´ath, Mason, . . . , are an invaluable tool in the study of asymptotics in Statistics, as we will point out in this course.
2. Testing fit to a fixed distribution The simplest goodness-of-fit problem consists of testing fit to a single fixed distribution, namely, given a random sample of real r.v.’s X1 , X2 , . . . , Xn with common d.f. F , testing the null hypothesis H0 : F = F0 for a fixed d.f. F0 . While this procedure is usually of limited interest in applications, the solutions proposed to this problem provided the main idea in subsequent generalizations designed for testing fit to composite null hypotheses. Pearson’s chi-squared test can be considered as the first approach to the problem of testing fit to a fixed distribution. The solution proposed by Pearson consisted of dividing the real line into k disjoint categories or “cells” C1 , . . . , Ck into which data would fall, under the null hypothesis, with probabilities p1 , . . . , pk . That is, if H0 were true, then P (X1 ∈ Ci ) = pi , i = 1, . . . , k. If Oi is the number of observations in cell i, then Oi has a binomial distribution with parameters n and pi ; hence, the De Moivre-Laplace central limit theorem (C.L.T.) states that (npi (1 − pi ))−1/2 (Oi − npi ) →w N (0, 1). The multivariate C.L.T. shows that, if l ≤ k, then Bl = n−1/2 (O1 − np1 , . . ., Ol − npl )T has a limit distribution which is centered Gaussian and has covariance matrix Σl whose (i, j) element, σi,j , satisfies σi,j = −pi pj , for i = j, and σi,i = pi (1 − pi ). On the other hand, if pi > 0, i = 1, . . . , k, Σk−1 is nondegenerate and −1 −1 −1 Σ−1 k−1 has element (i, j), νi,j , satisfying νi,j = pk , for i = j, and νi,i = pi + pk . T 2 Simple matrix algebra shows that Bk−1 Σ−1 k−1 Bk−1 converges in law to a χk−1 distribution. Further, straightforward computations show that χ2 :=
k (Oj − npj )2 T = Bk−1 Σ−1 k−1 Bk−1 np j j=1
providing thus a well-known result in the asymptotic theory of tests of fit:
Empirical and Quantile Processes
3
Theorem 2.1. Under H0 , χ2 has asymptotic distribution χ2k−1 . Theorem 2.1 reduces the problem to testing fit to a fixed distribution to analyze a multinomial distribution, thus providing a widely applicable and easy to use method for testing fit which immediately carries over to the multivariate setup. Moreover, this test also allows some freedom in choosing the number, the location or the size of the cells C1 , . . . , Ck . This point will be discussed in the next section. Though, as pointed out by many authors (see, e.g., [68]), consideration of only the cell frequencies when F is continuous produces a loss of information that results in lack of power (the χ2 statistic will not distinguish two different distributions sharing the same cell probabilities). Therefore, in order to improve our method for testing fit we should try to make use of the complete information provided by the data. However, the multivariate C.L.T. and elementary matrix algebra were the only tools needed in the derivation of the asymptotic distribution in Theorem 2.1. This will not be the case when handling more complicated statistics. A way to improve Pearson’s statistic consists of employing a functional distance to measure the discrepancy between the hypothesized d.f. F0 and the empirical d.f. Fn . The first representatives of this method were proposed in the late 20’s and in the 30’s. Cram´er [15] and, in a more general form, von Mises [103] proposed ∞ 2 ωn = n (Fn (x) − F0 (x))2 ρ(x) dx −∞
for some suitable weight function ρ as an adequate measure of discrepancy. Kolmogorov [53] studied √ Dn = n sup |Fn (x) − F0 (x)| −∞<x<∞
and Smirnov [90], [91] the closely related statistics √ Dn+ = n sup (Fn (x) − F0 (x)) −∞<x<∞ √ Dn− = n sup (F0 (x) − Fn (x)) −∞<x<∞
more adequate for tests against one-sided alternatives. The statistics Dn , Dn+ or Dn− are known as Kolmogorov-Smirnov statistics and present the advantage of being distribution-free: for any continuous d.f. F0 , Dn has (under H0 ) the same distribution as sup0
4
E. del Barrio
satisfied by ωn2 , but it also holds for the following modification: ∞ Wn2 (Ψ) = n Ψ(F0 (x))(Fn (x) − F0 (x))2 dF0 (x),
(2.1)
−∞
which was proposed by Smirnov [88], [89]. All the statistics which can be obtained by varying Ψ are usually referred to as statistics of Cram´er-von Mises type. Consideration of different weight functions Ψ allows the statistician to put special emphasis on the detection of particular sets of alternatives. For this reason, some weighted versions of Kolmogorov’s statistics have also been proposed, namely, Kn (Ψ) =
√ n
|Fn (x) − F0 (x)| . Ψ(F0 (x)) −∞<x<∞ sup
(2.2)
The convenience of employing Wn2 (Ψ) instead of Dn2 as a test statistic can be understood taking into account that Dn2 accounts only for the largest deviation between Fn (t) and F (t), while Wn2 (Ψ) is a weighted average of all the deviations between Fn (t) and F (t). Thus, as observed by Stephens [96], Wn2 (Ψ) should have more chance to detect alternatives that, not presenting a very large deviation with respect to F at any point t, are moderately far from F for a large range of points t (think of location alternatives). These heuristic considerations are confirmed by simulation studies (see [96] for references). Two particular statistics have received special attention in the literature. When Ψ = 1, ∞ 2 (Fn (x) − F0 (x))2 dF0 (x) Wn = n −∞
is called the Cram´er-von Mises statistic; when Ψ(t) = (t(1 − t))−1 then ∞ (Fn (x) − F0 (x))2 2 An = n dF0 (x) −∞ F0 (t)(1 − F0 (t)) is called the Anderson-Darling statistic. A2n has the additional appeal of weighting the deviations according to their expected value, and this results in a more powerful statistic for testing fit to a fixed distribution, see [96]. For the use in practice of any of these appealing statistics we should be able to obtain the corresponding significance levels. Smirnov [91], using combinatorial techniques, obtained an explicit expression for the exact distribution of Dn+ . Kolmogorov [53] gave also an expression that enabled the tabulation of the distribution of Dn . Some more difficulties were found when dealing with the exact distributions of statistics of Cram´er-von Mises type. But even in those cases in which some formula allowed to compute the exact p-values, the interest in obtaining the asymptotic distribution of the test statistic was clear, for it would greatly decrease the computational effort needed to obtain the (approximate) pvalues (and this was of crucial importance by the time these tests were proposed). The celebrated first asymptotic results about Dn and Dn+ are summarized in the following theorem:
Empirical and Quantile Processes
5
Theorem 2.2. For every x > 0: i) (Kolmogorov, 1933) P (Dn ≤ x) →
∞
(−1)j e−2j
2
x2
j=−∞
2 P Dn+ > x → e−2x .
ii) (Smirnov, 1941)
Kolmogorov’s proof of i) was based on the consideration of a limiting diffusion equation. Smirnov used the exact expression of P (Dn+ > x) to show ii). Also Smirnov [88] derived the asymptotic distribution of statistics of Cram´er-von Mises type. Feller [40] claimed that Kolmogorov’s and Smirnov’s proofs were “very intricate” and were “based on completely different methods” and presented his paper as an attempt to give “unified proofs” of those theorems (which could provide a systematic method of deriving the asymptotic distribution of other test statistics expressible as a functional of the empirical d.f.). It seemed unnatural that, being Dn , Dn+ and Wn2 measures of discrepancy between Fn and F0 based on the same object, namely, the empirical process, a particular technique had to be used in the derivation of the asymptotic distribution of each statistic. Thus, Feller’s paper is a remarkable step in the development of a unified asymptotic theory for tests of fit based on the empirical process. But still, a study of the empirical process itself and of its asymptotic distribution (a concept which would have to be precised) was not considered and, as claimed in Doob [36], all these proofs (including Feller’s) “conceal to some extent . . . the naturalness of the results (qualitatively at least) and their mutual relations”. It was Doob [36] who, considering the finite-dimensional distributions, conjectured the convergence of the uniform empirical process to the Brownian bridge. A useful consequence of this fact would be that, under some (not explicit) hypotheses, the derivation of the asymptotic distribution of a functional of the uniform empirical process could be reduced to the derivation of the distribution of the same functional of the Brownian bridge. Doob proved that ∞ 2 2 (−1)j e−2j x (2.3) P sup |B(t)| ≤ x = 0≤t≤1
and
P
j=−∞
2 sup B(t) > x = e−2x .
(2.4)
0≤t≤1
Thus, justification of Doob’s conjecture would provide a new simpler proof of the results of Kolmogorov and Smirnov. This justification was given by Donsker by employing his invariance principle in [34] and [35]. His results showed that the distribution of a continuous functional of the partial sum process (obtained from a sequence of i.i.d. r.v.’s with finite second moment) converges to the distribution of the corresponding functional of a Brownian motion, and that the distribution of a continuous functional
6
E. del Barrio
of the uniform empirical process converges to the distribution of the corresponding functional of a Brownian bridge. The development of the theory of weak convergence in metric spaces due, among others, to Kolmogorov, Prohorov and Skorohod in the fifties (see [75], [54], [76] and [87]) allowed to get a better understanding of this invariance principle, as presented in [7]. The space C[0, 1] was one of the first metric spaces for which this theory was developed, through the work of Prohorov [76]. The scheme consisting of proving the convergence of the finite-dimensional distributions plus a tightness condition allowed to obtain distributional limit theorems for slight modifications of the partial sum and the uniform empirical processes, because both processes could be approximated by “equivalent” processes obtained from them by linear interpolation so that all the random objects considered in the limit theorems remained in C[0, 1]. This last approximation is somehow artificial. In order to avoid it, a wider space had to be considered. A proper study of the weak convergence of the uniform empirical process could be attempted in the space D[0, 1]. The fact that the empirical process is not measurable when the uniform norm is considered, led to the introduction of a more involved topology, namely the Skorohod topology that turned D[0, 1] into a separable and complete metric space in which the empirical process was measurable. In this setup the weak convergence of the empirical process could be properly stated (see, e.g., [7] p. 141): Theorem 2.3. If we consider αn and B as random elements taking values in D[0, 1], then w
αn → B. Theorem 2.3 enabled to rederive Theorem 2.2 in a very natural way. Notice that Dn = ||αn ||∞ and that the map x → ||x||∞ is continuous for the Skorohod w topology outside a set of B-measure zero. Thus, we can conclude that Dn → ||B||∞ and this, combined with (2.3), gives a proof of the first statement in Theorem 2.2. The same method works for Dn+ . The use of the Skorohod space is not the only possibility to circumvent the difficulty posed by the nonmeasurability of the empirical process. A different approach to the problem could be based on the following scheme. If we could define, on a rich enough probability space, a sequence of i.i.d. r.v.’s uniformly distributed on (0, 1) with associated empirical process α∗n (t) and a Brownian bridge B(t) such that sup |α∗n (t) − B(t)|→0, P
(2.5)
0≤t≤1
then we would trivially obtain that for any functional H defined on D[0, 1] and P continuous on C[0, 1], H(α∗n ) → H(B) obtaining a new proof of Theorem 2.2. The study of results of type (2.5), generically known as strong approximations, began with the Skorohod embedding, consisting of imitating the partial sum process by
Empirical and Quantile Processes
7
using a Brownian motion evaluated at random times (see [8]). Successive refinements of this idea became one of the most important methodologies in the research related to empirical processes. Turning back to the applications of Theorem 2.3 in the asymptotic theory of 1 tests of fit, we should note that the functional x → 0 x(t)2 dt is also continuous for the Skorohod topology outside a set of B-measure zero. We can use this fact to obtain the asymptotic distribution of the Cram´er-von Mises statistic. Namely, 1 2 w Wn → B(t)2 dt. (2.6) 0
Then a Karhunen-Lo`eve expansion of B(t) allows to easily compute the char1 acteristic function of 0 B(t)2 dt and the inversion of this characteristic function allows to tabulate the asymptotic distribution of Wn2 (see, e.g., [86] p. 215 for details). This methodology made unnecessary the involved arguments used by Smirnov to derive the asymptotic distribution of Wn2 . A little extra effort allows to extend this method for deriving the asymptotic distribution of other statistics of Cram´er-von Mises type. As a consequence of the Law of the Iterated Logarithm for the Brownian motion, Anderson and Darling showed in [2] that, provided 1 δ 1 1 and dt Ψ(t)t log log dt Ψ(t)(1 − t) log log t 1−t 0 δ 1 are finite for some δ ∈ (0, 1), the functional x → 0 Ψ(t)x(t)2 dt is continuous, with respect to the Skorohod distance, outside a set of B-measure zero and, consew 1 quently, Wn2 (Ψ) → 0 Ψ(t)B(t)2 dt. This result covers the Anderson-Darling statistic A2n . Although all the limit theorems for goodness of fit that we have described so far are based on the weak convergence of the empirical process considered as a random element taking values in the space of c` adl` ag functions with the Skorohod topology plus the continuity of a suitable functional, there is a more natural way to study the asymptotic properties of statistics of Cram´er-von Mises type and, more generally, of integral functionals of the empirical process. The uniform empirical process can be viewed as a random element taking values in the separable Hilbert space L2 ((0, 1), Ψ) of all real, Borel measurable 1 functions f on (0, 1) such that 0 Ψ(t)f (t)2 dt is finite, where we consider the norm given by 1 ||f ||22,Ψ = Ψ(t)f (t)2 dt. 0
Wn2 (Ψ)
||αn ||22,Ψ .
In this setup = The theory of probability in Banach spaces, developed in the 60’s and 70’s, turned the problem of studying the asymptotic distribution of Wn2 (Ψ) into an easier task, because the C.L.T. for random elements taking values in L2 ((0, 1), Ψ) (see, e.g., [3] p. 205, ex. 14) asserts that a sequence
8
E. del Barrio
{Yn (t)}n of i.i.d. L2 (0, 1)-valued random elements verifies 1 √ Yi (t) → Y (t) w n i=1 n
1 if and only if 0 E(Y1 (t))2 Ψ(t)dt < ∞ and, in that case, Y is a Gaussian random element with the same covariance function as Y1 . Therefore, if we set
then αn (t) =
√1 n
n
Yi (t) = I{Ui ≤t} − t, i = 1, . . . , n,
Yi (t) and Y1 (t) has the same covariance function as the 1 w Brownian bridge B(t). Hence, αn → B in L2 ((0, 1), Ψ) if and only if 0 t(1 − t)Ψ(t)dt < ∞. i=1
Theorem 2.4. (Asymptotic distribution of statistics of Cram´er-von Mises type) 1 αn converges weakly in L2 ((0, 1), Ψ) if and only if 0 t(1 − t)Ψ(t)dt < ∞. In that case 1 w 2 Wn (Ψ) → Ψ(t)B(t)2 dt. (2.7) 0
While the development of the probability in Banach spaces provides this final result for quadratic statistics (and we will have a better example of this later in Chapter 6), the use of strong approximations produces a similar result for supremum norm statistics. Chibisov [12] and O’Reilly [70] used the Skorohod embedding and a special representation of the uniform empirical process in terms of a Poisson process (see, e.g. [86] p. 339) to obtain necessary and sufficient conditions for the weak convergence of the empirical process to the Brownian bridge in weighted uniform metrics. If Ψ is a positive function on (0, 1) nondecreasing in a neighborhood of 0 and nonincreasing in a neighborhood of 1 and we consider the norm given w by ||x||Ψ = sup0
dt < ∞, (2.8) t(1 − t) 0 t(1 − t) for every > 0. An immediate corollary of the Chibisov-O’Reilly theorem is that (2.8) is a sufficient condition for ensuring the convergence w
Kn (Ψ) → sup
0
|B(t)| . Ψ(t)
A modification of the so-called Hungarian construction due to Koml´ os, Major and Tusn´ady [55], [56] and to Cs¨ org˝ o and R´ev´esz [21] was used in [17] to give the following final result for statistics of Kolmogorov-Smirnov type:
Empirical and Quantile Processes
9
Theorem 2.5. (Asymptotic distribution of statistics of Kolmogorov-Smirnov type) If Ψ is a positive function on (0, 1), nondecreasing in a neighborhood of 0 and nonincreasing in a neighborhood of 1, then Kn (Ψ) converges
in distribution to a 1 1 Ψ(t)2 nondegenerate limit law if and only if 0 t(1−t) exp − t(1−t) dt < ∞, for some
> 0. In that case, |B(t)| w Kn (Ψ) → sup . 0
0
Thus, if {fj }j is a c.o.n.s. for ·, · 2,Ψ we have 1 ∞ 2 2 (B(t)) Ψ(t) dt = B 2,Ψ = B, fj 22,Ψ . 0
j=1
1 On the other hand, B, fj = 0 B(t)fj (t)Ψ(t) dt are centered Gaussian r.v. and, if {fj }j are the eigenfunctions of the covariance operator, namely, if 1 λj fj (t) = (s ∧ t − st)Ψ(s)Ψ(t)fj (s)ds (2.9) 0
then B, fj are independent N (0, λj ) r.v.’s, since 1 1 B(s)fj (s)B(t)fj (t)dsdt Var(B, fj ) = E
1
0 1
0
1
(s ∧ t − st)fj (s)fj (t)dsdt = λj
= 0
0
fj (t)2 dt = λj . 0
As a consequence, we have that, under H0 ωn2 (Ψ) −→ w
∞
λj Yj2 ,
j=1
where {Yj }j i.i.d. N (0, 1), λj ≥ 0. This shows that the characteristic function of the limiting distribution of ωn2 (Ψ) can be written as φ(λ) =
∞
(1 − 2iλj λ)−1/2 .
j=1
This can be used in some cases to find a useful expression of the limiting characteristic function and, hopefully, to find, via an inversion formula, exact expressions for the limiting distribution functions.
10
E. del Barrio
Example. For the Cram´er-von Mises Statistic, ωn2 λj = (πj)−2 ; fj (t) = and, therefore, ∞ Yj2 . ωn2 → ω 2 := d π2 j 2 j=1
√
2 sin(jπt)
To see this, we observe that in this case equation (2.9) becomes t 1 λf (t) = s(1 − t)f (s)ds + (1 − s)f (s)ds, 0
t
from which, differentiating twice we obtain λf (t) = −f (t). Noting that f (0) = f (1) = 0 we prove the above claim. It can be used to obtain that √ 2iλ iλω 2 √ . )= E(e sin( 2iλ) This gives √ (2jπ)2 − y xy 1 1 2 j+1 P (ω > x) = ∞(−1) √ exp − dy. π j=1 y sin y 2 2 ((2j−1)π) In the case of the Anderson-Darling statistic, A2n , we can proceed similarly to
−1 obtain λj = (j(j + 1)) , f1 (t) = 6t(1 − t), f2 (t) = 30t(1 − t)(2t − 1), . . . , and ∞ Yj2 . A2n → A2 := d j(j + 1) j=1
Now E(e
iλω 2
)=
−2πiλ √ cos((π/2)/ 1 + 8iλ)
and a similar expression is available for P (A2 > x). The main interest of the orthogonal decomposition of quadratic statistics does not come from this type of inversions, but from the fact that it provides a good insight into the power properties of the tests. Let us √ assume that H0 does not hold, but, instead, that X1 , . . . , Xn are i.i.d. Hn and n(Hn − G0 ) → ∆(G0 ) (this are called local alternatives). It can be shown then that αn → B + ∆ w
Hence,
Dn (Ψ) → sup |B(t) + ∆(t)|Ψ(t) d 0
and ωn2 (Ψ) → d
1
(B(t) + ∆(t))2 Ψ(t) dt 0
Example. Location and scale alternatives √ X1 , . . . , Xn i.i.d. Hn , n(Hn − G0 ) → ∆(G0 )
Empirical and Quantile Processes
11
√ G0 = Gθ0 ; Hn = Gθn ; θn = θ0 + γ/ n −γg(G−1 (t)) if θ0 = 0, Gθ = G(· − θ) ∆(t) = −γG−1 (t)g(G−1 (t)) if θ0 = 1, Gθ = G(·/θ) Turning back to statistics of Cram´er-von Mises type we obtain 1 ∞ (B(t) + ∆(t))2 Ψ(t) dt = λj Yj∗ 2 ωn2 (Ψ) → w
0
j=1
−1/2 N (λj ∆, fj , 1)
with Yj∗ independent principal component in direction fj : ωn2 (Ψ)
=
∞
2 Yn,j ,
r.v.’s. This suggest that we define the
Yn,j :=
1
αn (t)fj (t) dt 0
j=1
Under H0 , Yn,j are approximately N (0, λj ) independent r.v.’s; under local alternatives d Yn,j −→ N (∆, fj , λj ). Thus, Yn,j measures deviations in direction fj and tests based on Yn,j can be expected to be powerful against alternatives in direction fj . Example. Cram´er-von Mises components √ √ 2n n 2n n Yn,1 = nπ cos(πG (X ))), Y = 0 i n,2 i=1 i=1 cos(2πG0 (Xi ))) 2nπ −γg(G−1 (t)) location alternatives ∆(t) = −γG−1 (t)g(G−1 (t)) scale alternatives
f2(t)= 2sin(2πt)
−1.5
0.0
0.4
−0.5
0.8
0.5
1.2
1.5
f1(t)= 2sin(πt)
0.0
0.4
0.8 t
0.0
0.4
0.8 t
12
E. del Barrio
If we assume G to be (approximately) symmetric we can expect Yn,1 to be good for detecting location alternatives, while Yn,2 should be more powerful against scale alternatives. The table below provides some confirmation of this idea. Asymptotic power of ωn2 , Yn,1 , Yn,2 & A2n location and scale alternatives; normal model (one-sided tests) location scale best test power best test power Test 0.50 0.95 0.50 0.95 Yn,1 Yn,2 ωn2 A2n
0.466 0.05 0.342 0.354
0.930 0.05 0.877 0.890
0.05 0.336 0.072 0.108
0.05 0.787 0.205 0.423
Similar considerations apply to other quadratic functionals of empirical, quantile or Gaussian processes, but we will not pursue this issues further here. The interested reader can found more details in [86], [96].
3. Some tools from empirical processes theory 3.1. Some inequalities We collect here some inequalities involving tail probabilities and moments of sums of independent r.v.’s. They are useful for our goals in that they can be used to upgrade stochastic boundedness of some sequences of variables to uniform boundedindependent B-valued r.v.’s (B ness of their moments. We will assume that Xi are n being a separable Banach space) and denote Sn = i=1 Xi , Xn∗ = max1≤i≤n Xi
∗ and Sn = max1≤i≤n Si . The proofs of these results can be found in [27]. We begin with the L´evy maximal inequalities, bounding the tail probabilities of Sn∗ by tail probabilities of Sn . Theorem 3.1. Let Xi be independent symmetric r.v.’s. Then n k P max
Xi > t ≤ 2Pr
Xi > t , 1≤k≤n
i=1
t > 0.
i=1
The symmetry hypothesis can be replaced if Xi are i.i.d. at the price of worse constants: Theorem 3.2. Let Xi be i.i.d. r.v.’s. Then n k Xi > t ≤ 9Pr
Xi > t/30 , P max
1≤k≤n
i=1
t > 0.
i=1
The Hoffmann-Jørgensen inequalities bound moments of sums by the corresponding moment of a maximum plus a quantile of the sum:
Empirical and Quantile Processes
13
Theorem 3.3. For each p > 0 there exists constants Kp , cp such that, if Xi are i.i.d or independent and symmetric r.v.’s, then
Sn p ≤ Kp [t0 + Xn p ], where Y p = (E Y )
p (1/p)
and
t0 = inf{t > 0 : Pr( Sn > t ≤ cp )}. This inequality gives the following result on comparison of moments: Theorem 3.4. For each 0 < p < q there exists a constant, K, such that, if Xi are i.i.d or independent and symmetric r.v.’s, and then
Sn p ≤ K[ Sn q + Xn p ]. The following randomization/symmetrization result by Rademacher variables (symmetric r.v.’s taking values in {−1, 1}) shows that the above Theorem is also valid for centered, independent r.v.’s. Theorem 3.5. Let Xi be independent, centered r.v.’s in Lp , p ≥ 1, and let { i } be independent Rademacher r.v.’s, independent of the Xi . Then 2−p E
n
i Xi p ≤ E
i=1
n
Xi p ≤ 2p E
i=1
and ESnp ≤ 2p+1 E
n
n
i Xi p
i=1
i Xi p .
i=1
As an example of the usefulness of this inequalities we give the following result on the L1 -norm of the empirical process on the line. Theorem 3.6. Let X, Xi , i ∈ N, be i.i.d. random variables with common distribution F . Let Y (t) := IX>t − Pr{X > t}, −∞ < t < ∞, (3.1) and let Yi , i ∈ N, denote the processes obtained by replacing X by Xi in (3.1). Then the sequence ∞ ∞ √ Yi √ = n |Fn (t) − F (t)|dt, n ∈ N, n L1 −∞ i=1 is stochastically bounded if and only if ∞ Pr{|X| > t}dt < ∞. Λ2,1 (X) := 0
Proof. The sufficiency part follows easily from Markov’s inequality. For the necessity part we can assume, w.o.l.g., that X ≥ 0. It is now convenient to write Z(t) := IX>t , Zi (t) = IXi >t , i ∈ N, t ∈ R,
14
E. del Barrio
so that Y (t) = Z(t) − EZ(t) and likewise for Yi . The stochastic boundedness hypothesis simply asserts n 1 √ lim sup Pr (Zi − EZi )L1 > M = 0. M→∞ n n i=1 The L´evy type inequality for i.i.d. random vectors then implies then that 1 lim sup Pr √ max Zi − EZi L > M = 0. 1 M→∞ n n 1≤i≤n The classical inequality for independent random variables, say ξi ,
Pr{|ξi | > t} Pr max |ξi | > t ≥ 1 − exp − i
then gives that there is a constant M < ∞ such that 1 sup n Pr √ Z − EZ L > M < ∞, 1 n n or, equivalently,
Λ2,∞ (Z − EZ) := sup t2 Pr Z − EZ L1 > t < ∞. t>0
A first consequence of this is that
Pr max Zi − EZi L1 > t dt 1≤i≤n 0 ∞ √ ≤ 1 + n √ Pr Zi − EZi L1 > t dt
1
Zi − EZi L1 √ =√ 1≤i≤n n n
E max
∞
n
√ ≤ 1 + nΛ2,∞ (Z − EZ)
∞ √ n
= 1 + Λ2,∞ (Z − EZ) < ∞.
t−2 dt (3.2)
A second consequence is that EX = EZ L1 < ∞. To wit, if t0 < ∞ is such that Pr{X > t0 } ≤ 1/2, then, since finiteness of Λ2,∞ obviously implies E Z −EZ L1 < ∞, we have ∞ IX>t − Pr{X > t}dt ∞ > E Z − EZ L1 =E ≥ E IX>t0
X
t0
0
1 1 dt = E(X − t0 )+ , 2 2
so that EX = t0 + E(X − t0 ) ≤ t0 + E(X − t0 )+ < ∞. Now, Hoffmann–Jørgensen’s inequality gives that for every r > 0 there exist finite positive constants ci , i = 1, 2, depending only on r, such that n (Z − EZ ) r
Zi − EZi rL1 i i r E i=1 √ ≤ c + t E max 1 0,n , 1≤i≤n n nr/2 L1
Empirical and Quantile Processes where
15
n (Zi − EZi ) i=1 L1 √ > t ≤ c2 . = inf t : Pr n
t0,n
On one hand, the stochastic boundedness hypothesis implies supn t0,n < ∞, and on the other, inequality (3.2) asserts the finiteness of the sup over n of the first summand at the right-hand side of the last for r = 1. We thus conclude that n (Z − EZ ) i i sup E i=1 √ (3.3) < ∞. n L1 n If now ξ is a binomial (n, p) then there exist positive finite constants C1 and C2 such that L(ξ) = Bin(n, p) with
1 √ C1 ≤ p ≤ implies E|ξ − Eξ| ≥ C2 np.) n 2
(this follows, for instance, from symmetrization and Corollary 3.4 in Gin´e and Zinn, 1983.) Applying this to the empirical process, yields n
1 i=1 (IXi >t − Pr{Xi > t}) √ Pr{X > t} ≤ E for med(X) < t < Q(1 − C1 /n). C2 n Then, if integrating and applying inequality (3.3), we obtain Q(1−C1 /n) Pr{X > t}dt sup n
med(X)
Q(1−C1 /n) n (IXi >t − Pr{Xi > t}) 1 √ sup E i=1 dt < ∞. C2 n med(X) n Since Q 1 − C1 /n → ess sup X as n → ∞, this last inequality gives ∞ Pr{X > t}dt < ∞, ≤
0
that is, Λ2,1 (X) < ∞, proving the theorem.
We finally include here a different type of maximal inequality, the BirnbaumMarshall inequality for martingales. It can be used to bound the tails of weighted supremum functionals of the empirical process as we will see later. Lemma 3.7. Let {|Sk |, Fk }0≤k≤n be a submartingale with S0 = 0. Let b1 ≥ · · · ≥ bn ≥ bn+1 = 0 and let r ≥ 1. Call Mn = max1≤k≤n bk |Sk |. Then for all λ > 0 P (Mn ≥ λ) ≤
n 1 r 1 r r r (b − b )E|S | = b [E|Sk |r − E|Sk−1 |r ] . k k k+1 λr 1 λr 1 k
16
E. del Barrio
Proof. Assume without loss of generality r = 1 and Sk ≥ 0. Set Ak = (max1 ≤ j < kbj Sj < λ ≤ bk Sk ). Then n n n n λP (Mn ≥ λ) = λ P (Ak ) ≤ bk Sk dP = (bj − bj+1 ) Sk dP 1
≤ ≤
1
n n
Ak
(bj − bj+1 )
Sj dP = Ak
k=1 j=k n
k=1 j=k n
(bj − bj+1 )
j=1
(bk − bk+1 )E|Sk |
Mj ≥λ
Ak
Sj dP
1
Lemma 3.8 (Birnbaum and Marshall). Let {|St |, Ft }0≤t≤θ be a submartingale with right-continuous sample paths. Assume S(0) = 0 and ν(t) = ES 2 (t) < ∞ on [0, θ]. Let q > 0 be a nondecreasing and right-continuous function on [0, θ]. Then θ |S(t)| 1 ≥1 ≤ P sup dν(t). 2 0≤t≤θ q(t) 0 q(t) Proof. By right-continuity of sample paths and S(0) = 0 |S(t)| |S(θi/2n )| ≤1 =P max ≤ 1 for all n P sup 0
≥ 1 − lim
n→∞
=1− 0
θ
[E(S 2 (θi/2n ) − S 2 (θ(i − 1)/2n ))/q 2 (θi/2n )]
1
1 dν(t) q(t)2
3.2. The Central Limit Theorem The classical central limit problem consists in studying conditions for tightness and weak convergence of sequences of laws of sums of small independent random variables and determining their limits. The theory is built up around two main results, the Gaussian and the Poisson convergence theorems. Here we collect the main facts on this central limit problem, both for real and for Banach valued random variables and refer to [3] for a complete treatment of the subject. evyIf {Xn }∞ n=1 are i.i.d. centered r.v.’s with unit variance, then, the L´ Lindeberg CLT states that X1 + · · · + Xn √ → N (0, 1). w n we might wonder about the possibility of obtaining different limiting distributions (possibly at different rates) if we drop some of the assumptions. In this i.i.d. but
Empirical and Quantile Processes
17
without the finite variance assumption we have, for instance, that, if {Xn }∞ n=1 are i.i.d. Cauchy, then X1 + · · · + Xn = X1 d n and we get, trivially, a different limit distribution. More generally we could consider the limiting distributions of sums of triangular arrays of row-wise independent r.v.’s, that is, we consider r.v.’s {Xn,k : n ∈ kn Xn,k . N, 1 ≤ k ≤ kn }, where Xn,1 , . . . , Xn,kn are independent, and call Sn = k=1 If we now consider, for instance, i.i.d. Xn,k , k = 1, . . . , n having Bernoulli distribution with parameter pn such that npn → λ ∈ (0, ∞) we have that Sn converges weakly to the Poisson distribution with mean λ. Obviously, without some restriction on the relative weight of the summands in Sn any kind of (trivial) limit distribution can result. The infinitesimality condition ensures that this is not the case. We say that the triangular array {Xn,k : n ∈ N, 1 ≤ k ≤ kn } of row-wise independent random variables is infinitesimal if, for every > 0 max P (|Xn,k | > ) → 0, as n → ∞. 1≤k≤kn
It turns out that, under infinitesimality, the class of possible limit laws of Sn − an , where an are possibly needed centering constants, is the class of the so-called infinitely divisible laws. A law γ is said to be infinitely divisible if, for every n, it n)
can be expressed as an nth convolution power: γ = γn ∗ · · · ∗ γn (that is, γ is the law of the sum of n i.i.d. r.v.’s). Infinitely divisible laws can be characterized as the convolution of a Gaussian law and a Poissonization of a L´evy measure: γ = N (ν, σ 2 ) ∗ cτ Pois µ. If µ is a finite measure on R, then the associated Poissonization is ∞ n) −µ(R) Pois µ := e µ ∗ · · · ∗ µ/n!. n=0
If µ is a positive measure, not necessarily finite, but it integrates min(1, x2 ) then the weak limit w − lim δ <|x|≤τ xdµ(x) ∗ Pois (µ||x| > ) := cτ Pois µ ↓0
exists and defines the (τ -centered ) Poissonization of µ. Positive measures on the real line satisfying min(1, x2 )dµ(x) < ∞ are called L´evy measures. Thus cτ Pois µ is defined for L´evy measures. The characteristic function of the infinitely divisible law γ = N (ν, σ 2 ) ∗ cτ Pois µ is σ 2 t2 + (eitx − 1 − itxI{|x ≤ τ |})dµ(x) . Ψ(t) = exp itν − 2 We are now ready to state the CLT on the real line. We will denote Xn,k,δ = Xn,k I{Xn,k ≤ δ} and Sn,δ = 1≤k≤kn Xn,k,δ .
18
E. del Barrio
Theorem 3.9. (CLT on the real line) Let {Xn,k : n ∈ N, 1 ≤ k ≤ kn } be an infinitesimal array of row-wise independent real random variables and {an } a sequence of constants. Then, Sn − an converges weakly iff (i) There exists a L´evy measure µ such that P (Xn,k > δ) → µ(δ, ∞), P (Xn,k < −δ) → µ(−∞, −δ) 1≤k≤kn
1≤k≤kn
for every δ > 0 such that µ{−δ, δ} = 0. (ii) There exists σ ≥ 0 such that lim sup Var (Xn,k,δ ) = σ 2 Var (Xn,k,δ ) = σ 2 . lim lim inf δ↓0 1≤k≤kn
1≤k≤kn
E(Sn,δ ) − an → aδ .
(iii) If δ > 0 satisfies µ{−δ, δ} = 0 then Then, if this is the case
Sn − an → N (aδ , σ 2 ) ∗ cδ Pois µ. w
If we restrict our interest to triangular arrays coming from i.i.d. sequences, that is, those arrays with kn = n and Xn,k = Xk /an for some i.i.d. r.v.’s {Xn }n and a sequence of normalizing constants {an }n then the class of limiting distributions reduces to the stable laws. A law γ is said to be stable if it is of the same type as any of its nth power convolutions, that is, if X1 , . . . , Xn are i.i.d. γ, then X 1 + · · · + X n = an X 1 + b n . d
The only possible constants an in the above expression are of type an = n1/α for some α ∈ (0, 2]. This α is called the stability index of the stable law γ. The case α = 2 corresponds to the Gaussian law. A stable law with index α < 2 is an infinitely divisible law without normal part and L´evy measure µ(c1 , c2 ; α) defined by c1 x−1−α dx for x > 0 dµ(c1 , c2 ; α)(x) = c2 |x|−1−α dx for x < 0 for some c1 , c2 ≥ 0. Assume {Xn }n is a sequence of i.i.d. r.v.’s having law ρ. Then we say that ρ belongs to the domain of attraction of the stable law γ if there exists constants an > 0, bn ∈ R such that X1 + · · · + Xn − bn → γ. w an Stable laws are the only laws having a nonvoid domain of attraction. Domains of attraction can be characterized in terms of regular variation of the tails and the truncated variances of a law. A function L is regularly varying (at ∞) with exponent ν ∈ R if L(tx) lim = xν . t→∞ L(t) We say that it is slowly varying if the above exponent equals 0.
Empirical and Quantile Processes
19
Theorem 3.10. (Domains of attraction on the line) (i) A law ρ belongs to the domain of attraction of the normal law iff the truncated moment x U (x) = y 2 dρ(x) −x
is slowly varying or, equivalently, iff x2
ρ(−x, x)c → 0. U (x)
(ii) A law ρ belongs to the domain of attraction of µ(c1 , c2 ; α), α < 2 iff ρ(−x, x)c is regularly varying of order alpha and ρ(x, ∞) c1 → , c ρ(−x, x) c1 + c2
ρ(−∞, x) c2 → . c ρ(−x, x) c1 + c2
We finish this section with some results on the CLT on separable Banach spaces. We consider a triangular array {Xn,k : n ∈ N, 1 ≤ k ≤ kn }, where kn Xn,k and Xn,1 , . . . , Xn,kn are independent (B,
)-valued r.v.’s, call Sn = k=1 consider the problem of finding necessary and sufficient conditions for the weak convergence of Sn − an and of characterizing the possible limiting distributions. The theory parallels to a great extent that for the real line. There are some important differences, though. As in the real case, we say that the array is infinitesimal if max P ( Xn,k > ) → 0,
1≤k≤kn
as n → ∞.
Under infinitesimality, the class of possible weak limits of L(Sn − an ) is, still in this new setup, the class of infinitely divisible laws (the probability measures expressible as nth convolution powers for every n. Infinitely divisible laws, ρ, on B can be characterized as convolutions of Gaussian measures and Poissonization of L´evy measures: ρ = γ ∗ cτ Pois µ. The characterization of L´evy measures, though, is not so straightforward as on the real line. A L´evy measure on B is a σ-finite positive measure, µ, for which there exists τ > 0 and a probability measure η having characteristic functional Φ(f ) = exp eif (x) − 1 − if (x)I{ x ≤ τ }dm u(x) for all f ∈ B (here B denotes the topological dual of B). In this case we define cτ Pois µ := η. Integrability of min(1, x 2 ) is neither a necessary nor a sufficient condition for a measure to be a L´evy measure on a general separable Banach space. The following result is a general CLT in Banach spaces. Theorem 3.11. (CLT in separable Banach spaces) Let {Xn.k } be infinitesimal. Then L(Sn − an ) is weakly convergent iff
20
E. del Barrio
(i) There exists a σ-finite measure µ with µ{0} = 0 such that L(Xn,j )|Bδc → µ|Bδc w
j
for every δ > 0 such that µ(∂Bδ ) = 0. (ii) The limit lim sup Var (f (Xn,k,δ )) φ(f ) = lim lim inf δ↓0 1≤k≤kn
exists and is finite for every f ∈ B (at least for f ∈ W ⊂ B weak-star total ). (iii) There exists a (for all) sequence {Fk } of finite-dimensional subspaces with B = ∪k Fk and β > 0, p > 0 (for all β > 0, p > 0) such that lim sup Edp (Sn,β − ESn,β , Fk ) = 0. k
n
Then (a) µ is a L´ evy measure and there exists a centered Gaussian p.m. such that φ(f ) = f 2 dγ, f ∈ B . (b) For every δ > 0 such that µ(∂Bδ ) = 0 L(Sn − ESn,δ ) → γ ∗ cδ Pois µ. w
In the particular case of a separable Hilbert space (H, , ) the CLT can be slightly simplified. In Hilbert space L´evy measures are positive Borel measures integrating min(1, x 2 ). Condition (iii) in Theorem 3.11 can be replaced by (iii ) There exists a c.o.n.s {φi }i≥1 in H such that, for some (all) > 0, lim lim sup
k→∞ n→∞
n
E Xn,j, − EXn,j, 2 −
j=1
k
EXn,j, − EXn,j, , φi 2 = 0
i=1
Thus, convergence of sums to a Gaussian limit can be characterised with the following Theorem: Theorem 3.12. (CLT in separable Hilbert spaces, normal case) Let {Xn.k } be an infinitesimal H-valued array. Then L(Sn − an ) is weakly convergent iff (i) For every > 0
P ( Xn,j > ) → 0.
j
(ii) The limit
lim sup Var (Xn,k,δ , h ) φ(f ) = lim lim inf δ↓0 1≤k≤kn
exists and is finite for every h ∈ H.
Empirical and Quantile Processes
21
(iii) There exists a c.o.n.s {φi }i≥1 in H such that lim lim sup
k→∞ n→∞
n
E Xn,j, − EXn,j, 2 −
j=1
k
EXn,j, − EXn,j, , φi 2 = 0
i=1
Then there exists a centered Gaussian p.m. such that φ(h) = x, h 2 dγ(x), h ∈ H and L(Sn − ESn,δ ) → γ w
for every δ > 0. The case of sequences of i.i.d. H-valued random variables becomes also simpler in the Hilbert space setup. We conclude the subsection with the LindebergL´evy Theorem in Hilbert space. It is a very easy consequence of Theorem 3.12. Theorem 3.13. Let {Xn } be a sequence of centered i.i.d. r.v.’s then X1 + · · · + Xn √ n converges weakly iff E X1 2 < ∞. In that case the limit law is centered Gaussian with covariance φ(h) = EX1 , h 2 , h ∈ H. 3.3. Strong approximations As we said before, a different approach to proving limit theorems for empirical process on the line can be based on finding suitable versions of the empirical process, αn (t), and a Brownian bridges Bn (t) such that sup |αn (t) − Bn (t)| → 0,
(3.4)
0≤t≤1
almost surely or in probability. The study of results of type (3.4), generically known as strong approximations, began with the Skorohod embedding, consisting of imitating the partial sum process by using a Brownian motion evaluated at random times (see [8]). Successive refinements of this idea became one of the most important methodologies in the research related to empirical processes. A major breakthrough in this line was achieved by the so-called Hungarian construction due to Komlos, Major and Tusnady. We summarize the main results in this section. We assume that {Xn }n are i.i.d. centered r.v.’s with unit variance and finite moment generating function in a neighborhood of the origin and call F its distribution function and set Sk = X1 + · · · + Xk . We also assume that {Yn }n are i.i.d. standard normal and set Tk = Y1 + · · · + Yk . Then Theorem 3.14. We can define, on the probability space on which {Yn }n are defined, a sequence {Xn }n of i.i.d. r.v.’s with d.f. F such that P max |Sk − Tk | > C log n + x ≤ Ke−λx , x > 0, n ≥ 1 1≤k≤n
where C, K, λ are constants depending only on F .
22
E. del Barrio
Theorem 3.14 is a deep result with a long, difficult proof. It has, though, many important consequences for important processes in Statistics. A first, direct result in this line can be obtained for the partial sum process: 1 Xk , S (n) (t) := √ n [nt]
0 ≤ t ≤ 1.
k=1
We will assume that {W (t)}t≥0 is Brownian motion and W (n) (t) =
√1 W (nt) n
(observe that W (n) (t) is itself a Brownian motion). Theorem 3.15. We can define, on a sufficiently rich probability space, versions of {Xn }n and {W (t)}t≥0 such that √ n sup S (n) (t) − W (n) (t) > C log n + x ≤ Ke−λx , P 0≤t≤1
where C, K, λ are positive constants not depending on n. Proof. It follows easily from Theorem 3.14 and the reflection principle for Brownian motion. Cs¨org˝ o and R´ev´esz [21] used this result to give a strong approximation for the uniform quantile process: √ un (t) := n G−1 0 < t < 1, n (t) − t , n where Gn (t) = n1 i=1 I(Ui ≤ t), 0 < t < 1 and {Un }n are i.i.d. uniform r.v.’s on (0, 1). In fact, the well-known distributional equality S1 Sn d (U(1) , . . . , U(n) ) = ,..., , Sn+1 Sn+1 with Sk = ξ1 + . . . + ξk and {ξk }k i.i.d. exponentials with mean 1 allows us to assume k Sk k−1 G−1
Empirical and Quantile Processes Now
23
un k − Bn k > A log n + t n n 1≤k≤n A log n + t √ (n) (n) ≤ 2P n sup S (t) − W (t) > 5 0≤t≤1 A log n + t + P (1 + n )ξn+1 > 5 1/2 A log n + t + 2P sup |Sk − k| > 5n 1≤k≤n 1/2 A log n + t + 2P | n | > 5n
√ P n sup
Using Theorem 3.15 and some other standard techniques we can give exponential bounds for all terms in the right-hand side of this last inequality to conclude (see [21] for details): Theorem 3.16. We can define, on a sufficiently rich probability √ space, versions of {Un }n and {W (t)}t≥0 such that for every n ≥ 1 and |x| ≤ c n, √ n sup |un (t) − Bn (t)| > C log n + x ≤ Ke−λx , P 0≤t≤1
Komlos, Major and Tusnady [55, 56] gave also a construction, similar to the one in Theorem 3.14 for the uniform empirical process: Theorem 3.17. We can define, on a sufficiently rich probability space, a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that (3.5) P sup |αn (t) − Bn (t)| > n−1/2 (x + C log n) ≤ K exp(−λx) 0≤t≤1
for some absolute positive constants C, K, λ. The best constants in (3.5) are C = 12, K = 2, λ = 1/6, see [9]. While the above Theorems give the best possible rates of approximation of the empirical or the quantile process by Brownian bridges, there was still room for some improvement through a careful study of the approximation at the tails. The following results come from [17]. They have important applications in the study of weighted norms of empirical and quantile processes. Theorem 3.18. We can define, on a sufficiently rich probability space, a sequence {Un }n and Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that P
sup 0≤t≤d/n
|αn (t) − Bn (t)| > n−1/2 (x + C log d)
≤ K exp(−λx)
(3.6)
24
E. del Barrio
P
|αn (t) − Bn (t)| > n
sup
−1/2
≤ K exp(−λx)
(x + C log d)
(3.7)
1−d/n≤t≤1
for some absolute positive constants C, K, λ. Theorem 3.19. We can define, on a sufficiently rich probability space, a sequence {Un }n and Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that P
|un (t) − Bn (t)| > n−1/2 (x + C log d)
sup
≤ K exp(−λx)
(3.8)
0≤t≤d/n
P
|un (t) − Bn (t)| > n
sup
−1/2
(x + C log d)
≤ K exp(−λx)
(3.9)
1−d/n≤t≤1
for some absolute positive constants C, K, λ. 3.3.1. Weighted approximations of empirical and quantile processes. We will see here how the improved construction in [17] can be used to give weighted approximations of empirical and quantile processes. Theorem 3.20. We can define, on a sufficiently rich probability space, a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that sup |αn (t) − Bn (t)| = O(n−1/2 log n) a.s.
(3.10)
0≤t≤1
and n1/2−ν
sup λ λ n ≤t≤1− n
|αn (t) − Bn (t)| = OP (1) (t(1 − t))ν
(3.11)
for all 0 < ν ≤ 1/2 and 0 < λ < ∞. Proof. We use the construction in Theorem 3.18. (3.10) follows from (3.6) (taking d = n and x = (2/λ) log n) and the Borel-Cantelli lemma: P ( sup |αn (t) − Bn (t)| > n−1/2 log n(2/λ + C)) ≤ 0≤t≤1
1 . n2
To show (3.11), take first λ = 1 and 0 < ν ≤ 1/2 and call ∆(1) n,ν
:=
∆(2) n,ν
:=
n1/2−ν sup 1 n ≤t≤1
n1/2−ν
|αn (t) − Bn (t)| tν
sup 1 0≤t≤1− n
(i)
|αn (t) − Bn (t)| . (1 − t)ν
It is enough to prove that ∆n,ν = OP (1), i = 1, 2. By symmetry it reduces to (1) showing that ∆n,ν = OP (1). Now, for e < d < ∞ let di = id, i = 1, 2, . . . , in−1 and
Empirical and Quantile Processes
25
din = n, where in = max{i : di−1 ≤ n}. Set I1 = [1/n, d1 /n], Ii = [di−1 /n, di /n], i = 2, . . . , in and δn,i = sup{|αn (t) − Bn (t)| : t ∈ [0, di /n]} i = 2, . . . , in . Now −1/2 P (∆(1) (C + 1) log d) + n,ν > (C + 1) log d) ≤ P (δ1,n > n
in
P (δi,n
i=2
> n−1/2 (C + 1)dνi−1 ) =:
in
Pi,n (d).
i=1
Hence using, (3.6) and the fact that dνi−1 ≥ log di for d large enough, we have, for large d, P1,n (d) ≤ K exp(−λ log d) = Kd−λ and Pi,n (d) ≤ K exp(−λdνi−1 ) ≤ K((i − 1)d)−2 . Thus in
Pi,n (d) ≤ K
i=1
∞ 1 1 1 + K , dλ d2 i=1 i2 (1)
which can be made arbitrarily small by taking d large enough. This shows ∆n,ν = OP (1) for λ = 1. To prove this for general λ it suffices to cover the range [λ/n, 1/n] for fixed 0 < λ < 1, but n1/2−ν sup λ 1 n ≤t≤ n
|αn (t)| ≤ λ−ν (nGn (1/n) + 1) = OP (1), tν
since nGn (1/n) converges weakly to a Poisson r.v. and n
1/2−ν
|Bn (t)| 1 −ν 1/2 −ν sup ≤λ n sup |Bn (t)|d λ sup W (t) − W (n) ν = t n λ 0≤t≤1 0≤t≤ 1 ≤t≤ 1 n
n
n
= OP (1).
Corollary 3.21. We can define, on a sufficiently rich probability space, a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that n1/2−ν
sup
U1:n ≤t≤Un:n
|αn (t) − Bn (t)| = OP (1) (t(1 − t))ν
(3.12)
for all 0 < ν ≤ 1/2. Corollary 3.22. We can define, on a sufficiently rich probability space, a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that |αn (t) − Bn (t)| = OP (1) (t(1 − t))ν 0
n1/2−ν sup
(3.13)
for all 0 < ν < 1/2. The proof of the corresponding result for the quantile process is similar to the proof of Theorem 3.20 and, hence, omitted.
26
E. del Barrio
Theorem 3.23. We can define, on a sufficiently rich probability space, a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that sup |un (t) − Bn (t)| = O(n−1/2 log n) a.s.
(3.14)
0≤t≤1
and n1/2−ν
sup λ λ n ≤t≤1− n
|un (t) − Bn (t)| = OP (1) (t(1 − t))ν
(3.15)
for all 0 < ν ≤ 1/2 and 0 < λ < ∞. From Inequality 3.8 and the fact that {αn (t)/(1 − t)}t is a martingale we obtain the following bound θ 1 1 θ P αn /q 0 ≥ x ≤ 2 dt. x 0 q(t)2 And this already gives a simple result on convergence of αn in weighted sup metrics: Theorem 3.24. Suppose that q is symmetric about 1/2, positive and nondecreasing on [0, /1/2] and satisfies 1 1 dt < ∞. 2 0 q(t) Then there exists, on a sufficiently rich probability space, versions of αn and a Brownian bridge such that αn − B →P 0. q There is room for improvement in this Theorem. In fact, finiteness of B/q
does not require integrability of 1/q 2 . We will see next how can we characterize finiteness of B/q . Lemma 3.25. The following statements are equivalent: (i) lim supt→0 |W (t)|/q(t) < ∞ (ii) lim supt→0 |B(t)|/q(t) < ∞ (iii) lim supt→0 |W (t)|/q(t) = β for some 0 ≤ β < ∞ (iv) lim supt→0 |B(t)|/q(t) = β for some 0 ≤ β < ∞ Proof. It follows easily from Blumenthal’s 0-1 law. q: inf δ≤t≤1/2 q(t) > 0 for all δ > 0, q nondecreasing in a neighbourhood of 0 1/2 E(q, c) = s−3/2 q(s) exp(−cq 2 (s)/s)ds
F C0 :=
0
I(q, c) = 0
1/2
1 exp(−cq 2 (s)/s)ds s
Empirical and Quantile Processes
27
Lemma 3.26. Assume q ∈ F C0 . (i) If I(q, c) < ∞ then E(q, d) < ∞ for any d > c and q(s)/s → ∞ as t → 0. (ii) If E(q, c) < ∞ and q(t)/t1/2 → ∞ as t → 0 then I(q, c) < ∞. Proof. (ii) Call C = inf 0
0. Then E(q, c) ≥ CI(q, c). (i) If I(q, c1 ) < ∞ then I(q, c2 ) < ∞ for any c2 > c1 . Hence we can assume c > 1. Fix θ small enough so that q is non-decreasing on (0, θ]. It t ∈ (0, θ/c] then ct ct ct 1 1 1 exp(−cq 2 (s)/s)ds ≥ exp(−cq 2 (s)/t)ds ≥ exp(−cq 2 (ct)/t)ds s s s t t t = (log c) exp(−c2 q 2 (ct)/(ct)). Hence q(t)/t1/2 → ∞. We also have, for d − c > 0 and small enough η, q(t)/t1/2 ≤ exp((d − c)q 2 (t)/t) for t ∈ (0, η].Consequently, η η 1 1 1/2 2 (q(t)/t ) exp(−dq (t)/t)dt ≤ exp(−cq 2 (t)/t)dt. t 0 0 t
This completes the proof. Lemma 3.27. If q ∈ F C0 and
1/2 0
1/q 2 (t)dt < ∞ then I(q, c) < ∞ for all c > 0.
Proof. For all x, c > 0 we have that x exp(−cx) ≤ 1/(ec). Therefore, 1/2 1/2 1 1 q 2 (t) exp(−cq 2 (t)/t)dt = exp(−cq 2 (t)/t)dt 2 t q (t) t 0 0 1 1/2 ≤ 1/q 2 (t)dt. ec 0
Theorem 3.28. Assume q ∈ F C0 . Then lim supt→0 |W (t)|/q(t) = β for some 0 ≤ β < ∞ iff < ∞ for any c > β 2 /2 . I(q, c) = ∞ for any c < β 2 /2 Proof. We assume that I(q, c) < ∞ and show that lim supt→0 |W (t)|/q(t) ≤ (2c)1/2 . Assume f is an increasing function on (0, b] and let 0 < a = t0 < t1 < · · · < tn = b ≤ 1. Then P (W (t) > f (t) for some t ∈ [a, b]) ≤ P (Tq(a)≤a ) + ≤
n
P (tm−1 < Tq(tm−1 ) ≤ tm )
m=1 a
(2πt3 )−1/2 f (a) exp(−f 2 (a)/(2t))dt
0
+
n m=1
tm
tm−1
2 m−1 ) dt (2πt3 )−1/2 f (tm−1 ) exp − f (t2t
28
E. del Barrio
We fix now some λ > 1, take a = β/λn and set tm = (1/λ)n−m b. We have now that, for t ∈ [tm−1 , tm ] −1/2
t−1/2 f (tm−1 ) exp(−f 2 (tm−1 )/(2t)) ≤ f (tm−1 )tm−1 exp(−f 2 (tm−1 )/(2λtm−1 )). Take ρ > λ then x exp(−x2 /(2λ)) ≤ exp(−x2 /(2ρ)) for large x. If we define f (t) = (2ρ2 c)1/2 q(t) then, for small b (recall that q(t)/t1/2 → ∞ as t → 0) we get that, for t ∈ [tm−1 , tm ] t−1/2 f (tm−1 ) exp(−f 2 (tm−1 )/(2t)) ≤ exp(−f 2 (tm−1 )/(2ρtm−1 )) ≤ exp(−f 2 (tm−1 )/(2ρt)) ≤ exp(−f 2 (t/λ)/(2ρt)) ≤ exp(−f 2 (t/ρ)/(2ρt)). Taking n → ∞ we obtain that 1 P (W (t) > (2ρ c) /2q(t) for some t ∈ (0, b]) ≤ (2π)1/2 2
1
0
b
1 exp(−cq 2 (t))dt. t
Hence, (letting b → 0) lim sup |W (t)|/q(t) ≤ (2ρ2 c)1 /2a.s. t→0
Since ρ can be taken arbitrarily close to 1, this proves our claim. It remains to be shown that, if lim supt→0 |W (t)|/q(t) = β for some 0 ≤ β < ∞ then I(q, c) < ∞ for any c > β 2 /2. The proof can be found in [19], pp. 181–188. We summarize the consequences of the above results for the finiteness of
B/q ∞ in the following corollary. Now F C0,1 is the class of functions nondecreasing on a neighborhood of 0, nonincreasing in a neighborhood of 1 and such that inf q(x) > 0, δ ∈ (0, 1/2). δ<x<1−δ
Corollary 3.29. Assume q ∈ F C0,1 , then the following statements are equivalent: (i) sup0 0 1 1 ˆ I(q, c) := exp(−cq 2 (s)/(s(1 − s)))ds∞ 0 s(1 − s) (iii) limt→0,1 q(t)/(t(1 − t))1/2 = 1 and, for some c > 0 1/2 ˆ c) := E(q, (s(1 − s))−3/2 q(s) exp(−cq 2 (s)/(s(1 − s)))ds < ∞. 0
With this results we can prove the Chibisov-O’Reilly Theorem. Theorem 3.30. (Weak convergence of the empirical process in weighted sup norm). If q ∈ F C0,1 the following statements are equivalent:
Empirical and Quantile Processes
29
(i) There is a sequence of Brownian bridges {Bn (t), 0 ≤ t ≤ 1} such that |αn (t) − Bn (t)| = oP (1). q(t) 0
(ii) For all c > 0
ˆ c) := I(q, 0
1
1 exp(−cq 2 (s)/(s(1 − s)))ds < ∞ s(1 − s)
Proof. We show first that (ii) implies (i). Fix δ ∈ (0, 1/2). Then sup
U1:n ≤t≤Un:n
|αn (t) − Bn (t)| q(t)
≤
sup 0
+
t1/2 |αn (t) − Bn (t)| sup q(t) U1:n ≤t≤Un:n t1/2
(1 − t)1/2 q(t) 1−δ≤t<1 sup
+
sup δ≤t≤1−δ
sup
U1:n ≤t≤Un:n
|αn (t) − Bn (t)| (1 − t)1/2
1 |αn (t) − Bn (t)| sup = oP (1). q(t) 0≤t≤1 t1/2
Now, observing that nU1:n is tight (it has an exponential limiting distribution) we have |αn (t)| t t1/2 = n1/2 sup ≤ (nU1:n )1/2 sup = oP (1) sup 0
lim q(t)/t1/2 = ∞
t→0
(a similar results holds also for the upper tail). Otherwise we can take sequence 1/2 tk → 0 such that ktk → λ ∈ (0, ∞) and limk q(tk )/tk = η < ∞. Then, lim P (|Bk (tk ) + k 1/2 tk |/q(tk ) < ) = Φ( η − λ1/2 ) − Φ(− η − λ1/2 ). k
Now, if λk = ktk , λk /k → 0 and lim inf P (|αk (tk ) + k 1/2 tk |/q(tk ) < ) ≥ lim P (Gk (λk /k) = 0) = e−λ . k
k
Hence, Φ( η − λ1/2 ) − Φ(− η − λ1/2 ) ≥ e−λ , for all > 0, a contradiction. The remainder of the proof follows easily from this. In fact, with the aid of weighted approximations we can prove this refinement of the Chibisov-O’Reilly Theorem: Theorem 3.31 (Asymptotic distributions of Kolmogorov-Smirnov-type statistics). If q ∈ F C0,1 the following statements are equivalent:
30
E. del Barrio
(i)
sup 0
|αn (t)| |B(t)| → sup q(t) w 0
(ii) For some c > 0 ˆ c) := I(q, 0
1
1 exp(−cq 2 (s)/(s(1 − s)))ds < ∞ s(1 − s)
Proof. Finiteness of the limit in (i) implies (ii). Thus, it suffices to show that (ii) implies (i). Under (ii) we have that limt→0 q(t)/t1/2 = limt→1 q(t)/(1 − t)1/2 = ∞ and from this we get |αn (t)| |Bn (t)| = sup + oP (1). q(t) 0
4. Testing fit to a family of distributions We consider in this section the problem of testing whether the underlying d.f. of the sample, F , belongs to a given family of distribution functions, F . We will assume F is a parametric family, i.e., F = {F (·, θ), θ ∈ Θ}, where Θ is some open set in Rd . F −1 (·, θ) is the quantile function associated to F (·, θ). Perhaps, the most interesting case appears when F is the Gaussian family. It seems that the first statistics for detecting possible departures from normality were introduced in [43, 71, 107] and were based√on the study of the standardized third and fourth moments, usually denoted by b1 and b2 , respectively. To strengthen these procedures, some omnibus tests, that intended to take into account simultaneously both features, were proposed. For instance,√in [72] the K 2 and the R tests, consisting functions of the b1 and b2 √ of handling two suitable √ statistics, namely, K 2 = K( b1 , b2 ) and R = R( b1 , b2 ), were introduced. In that paper a Monte Carlo study comparing those tests to the most popular normality tests was accomplished. The authors select many alternative distributions and the power of both tests seems to be similar to that of the competitor ones. However, tests based on kurtosis and skewness are not too reliable because they are based on properties which do not characterize Gaussian distributions. For instance, [1] exhibits a sequence of distributions {Pk } which converges to the standard Gaussian distribution while the kurtosis of Pk goes to infinity. Thus, if we consider a random sample obtained from Pk , the bigger the index k, the greater chance to reject the normality of the sample. On the opposite side, some examples of symmetric distributions, with shapes very far from normality (some of them even multimodal),√and β2 = 3 are known (see, for instance, [4, 52]). As a consequence, none of the b1 , b2 , K 2 or R tests detects the non-normality of the parent distribution in all cases.
Empirical and Quantile Processes
31
Other tests of normality are the u-test [25], based on the ratio between the range and the standard deviation in the sample, and the a-test [45], which studies the ratio of the sample mean to the standard deviation. These tests are broadly considered as not being too powerful against a wide range of alternatives (although it is known that the u-test has good power against alternatives with light tails [85]; in fact, see [99, 100], the u-test is the most powerful against the uniform distribution while the a-test is the most powerful against the double exponential distribution). For these reasons, some other tests, focusing on features that characterize completely (or, at least, more completely) the family under consideration, have been proposed. These tests can be divided, broadly speaking, into three categories. A first, more general, category consists of tests that adapt other tests devised in the fixed-distribution setup. When we specialize on location scale families, new types of tests, that try to take advantage of the particular structure of F , can be employed. Tests based on the analysis of probability plots, usually referred to as correlation and regression tests, lie in this class. A third category, whose representatives combine some of the most interesting features exhibited by goodness-of-fit tests lying in the first two categories, is composed of tests based on a suitable L2 -distance between the empirical quantile function and the quantile functions of the distributions in F , the so-called Wasserstein distance. Tests based on Wasserstein distance are related to tests in the first category in the sense that all of them depend on functional distances. On the other hand, it happens that the the study of Wasserstein-tests gives some hints about several properties of the probability plot-tests. Both facts have led us to present them separately. Our approach will try to show that tests based on Wasserstein distance provide the right setup to apply the empirical and quantile process theory to study probability plot-based tests. 4.1. Adaptation of tests coming from the fixed-distribution setup All the procedures considered in Section 2 were based on measuring some distance between a distribution obtained from the sample and a fixed distribution. A way to adapt this idea for the new setup consists of choosing some adequate estimator θˆ of θ (assuming the null hypothesis is true) and, then, replacing the fixed distribution ˆ This simple idea was suggested by Pearson for his χ2 -test (a good survey by F (·, θ). 2 on χ -tests is [67]). That is, Pearson suggested to use the statistic k ˆ 2 (Oj − npj (θ)) χ ˆ = , ˆ npj (θ) 2
j=1
where pj (θ) denotes the probability, under F (·, θ), that X1 falls into cell j. Though, he did not realize the change in the asymptotic distribution of χ ˆ2 due to the estimation of parameters. It was Fisher, in the 20’s, who pointed out that the limiting distribution of χ ˆ2 depends on the method of estimation and showed that, under regularity conditions, if θˆ is the maximum likelihood estimator of θ
32
E. del Barrio
from the grouped data (O1 , . . . , Ok ), then χ ˆ2 has asymptotic χ2k−d−1 distribution (see, e.g., [13] for a detailed review of Pearson’s and Fisher’s contributions). Fisher also observed that estimating θ from the grouped data instead of using the complete sample (e.g., by estimating θ from the complete likelihood) could produce a loss of information resulting in lack of power. Further, estimating θ from the original data is often computationally simpler. Fisher studied the asymptotic distribution of χ ˆ2 when θ is unidimensional and θˆ is its maximum likelihood estimator from the ungrouped data. His result was extended by Chernoff and Lehmann in [11] for a general d-dimensional parameter showing that, under regularity conditions (essentially conditions to ensure the consistency and asymptotic normality of the maximum likelihood estimator), w
χ ˆ2 →
k−d−1 j=1
Yj2 +
k−1
λj Yj2 ,
(4.1)
j=k−d
where Yj are i.i.d. standard normal r.v.’s and λi ∈ [0, 1] and may depend on the parameter θ. This dependence is a serious drawback for the use of χ ˆ2 for testing fit to some families of distributions, the normal family being one of them (see [11]). The practical use of χ ˆ2 for testing fit presented another difficulty: the choice 2 of cells. The asymptotic χk−1 distribution of Pearson’s statistic was a consequence of the asymptotic normality of the cell frequencies. A cell with a very low expected frequency would cause a very slow convergence to normality and this could result in a poor approximation of the distribution of χ2 . This (somehow oversimplifying) observation led to the diffusion of rules of thumb such as “use cells with number of observations at least 10”. Hence, combining neighboring cells with few observations became a common practice (see, e.g., [13]). From a more theoretical point of view, in the setup of testing fit to a fixed distribution, Mann and Wald [63] and Gumbel [47] suggested to use equally likely intervals under the null hypothesis as a reasonable way to reduce the arbitrariness in the choice of cells (this choice offers some good properties, for instance, it makes the χ2 test unbiased, see, e.g., [14]). Trying to adapt this idea to the case of testing fit to parametric families poses the problem that different distributions in the null hypothesis lead to different partitions into equiprobable cells. A natural solution ˆ where θˆ to this problem is choosing for cells equally likely intervals under F (·, θ), is some suitable estimator of θ. A consequence of this procedure is that, again, the cells are chosen at random. Allowing the cells to be chosen at random introduces a deep modification on the statistical structure of χ2 because the distribution of the random vector (O1 , . . . , Ok ) is no longer multinomial; remarkably, however, it can, in some important cases, eliminate the dependence on the parameter θ of the asymptotic distribution in (4.1). Watson ([104], [105]) noted that if θˆ is the maximum likelihood ˆ estimator of θ (from the ungrouped data) and cell j has boundaries F −1 ( j−1 k , θ) −1 j ˆ and F ( k , θ), then the convergence in (4.1) remains true. Further, if F is a location scale family, then the λi ’s do not depend on θ, but only on the family F .
Empirical and Quantile Processes
33
As a consequence, an improved χ2 method could be used for testing normality or exponentiality. The development of the theory of weak convergence in metric spaces provided valuable tools for further insights in χ2 -testing. Using the weak convergence of the empirical process in D[0, 1], Moore obtained in [66] a short rigorous proof of Watson’s result which was also valid for multivariate observations and random rectangular cells. Later Pollard [73], using a general Central Limit Theorem for empirical measures due to Dudley [38], extended the result to very general random cells under the mild assumption that these random cells were chosen from a Donsker class. Despite the fact that all these theoretical contributions have widely spread the applicability and reliability of χ2 -tests, the limitations of this procedure, noted when testing fit to a fixed distribution, carry over to the case of testing fit to a family (see, e.g., [94] or [96]). The use of supremum or quadratic statistics based on the empirical d.f. with parameters estimated from the data could provide more powerful tests, just as in the fixed distribution setup. The adaptation of Wn2 or Kn to this situation can be easily carried out. Let θˆn be some estimator of θ. We can define the statistics ∞ ˆ 2 (Ψ) = n W Ψ(F (x; θˆn ))(Fn (x) − F (x; θˆn ))2 dF (x; θˆn ) n −∞
and ˆ n (Ψ) = K
√
n
|Fn (x) − F (x; θˆn )| , −∞<x<∞ Ψ(F (x; θˆn )) sup
and use them as statistical tests, rejecting the null hypothesis when large values of ˆ n (Ψ) are observed. Though, it took a long time until these statistics ˆ n2 (Ψ) or K W were considered as serious competitors to the χ2 -test; little was known about these versions of Cram´er-von Mises or Kolmogorov-Smirnov tests until the 50’s (see, e.g., [13]). The property exhibited by Wn2 and Kn of being distribution free does not ˆ n2 (Ψ) or K ˆ n (Ψ). If we set Zi = F (Xi ; θˆn ) and G ˆ n (t) denotes the carry over to W empirical d.f. associated to Z1 , . . . , Zn then, obviously, 1 ˆ 2 (Ψ) = n ˆ n (t) − t)2 dt W Ψ(t)(G (4.2) n 0
ˆ n (Ψ) = K
ˆ n (t) − t| √ |G n sup Ψ(t) 0
(4.3)
but, unlike in the fixed distribution case, Z1 , . . . , Zn are not i.i.d. uniform r.v.’s. However, in some important cases the distribution of Z1 , . . . , Zn does not ˆ 2 (Ψ) or K ˆ n (Ψ) is depend on θ, but only on F . In those cases, the distribution of W n ˆ parameter free. This happens if F is a location scale family and θn is an equivariant ˆ 2 (Ψ) or K ˆ n (Ψ) estimator, a fact noted by David and Johnson [26]. Therefore W n can be used in a straight forward way as test statistics in this situation. Lilliefors
34
E. del Barrio
[59] took advantage of this property and, from a simulation study, constructed his popular table for using the Kolmogorov-Smirnov statistic when testing normality. The first attempt to derive the asymptotic distribution of any statistic of ˆ n (Ψ) type was due to Darling [24]. His study concerned the Cram´erˆ n2 (Ψ) or K W von Mises statistic ∞ 1 ˆn ))2 dF (x; θˆn ) = n ˆ2=n ˆ n (t) − t)2 dt, W (F (x) − F (x; θ (G (4.4) n n −∞
0
assuming that θ was one-dimensional. Let us define 2 ∞ ∂ ˆ Hn := n Fn (x) − F (x; θ) − (θn − θ) F (x; θ) dF (x; θ) ∂θ −∞ 1 √ 2 = n(Gn (t) − t) − Tn g(t) dt, where Tn =
√
0
n(θˆn − θ), and
∂ F (x; θ)|x=F −1 (t;θ) . (4.5) ∂θ Darling’s approach was based on showing that, when the underlying distribution of the sample is F (·, θ) and F and θˆ satisfy some adequate regularity conditions, then ˆ n2 − Hn = oP (1). (4.6) W g(t) = g(t, θ) =
ˆ n2 can be studied through that of Hn . Thus, the asymptotic distribution of W √ Darling showed that the finite-dimensional distributions of n(Gn (t) − t) − Tn g(t) converge weakly to those of a Gaussian process Y (t) with covariance function K(s, t) = s ∧ t − st − φ(s)φ(t), where φ(t) = σg(t) and σ 2 is the asymptotic variance of Tn . He showed, further, that, under some additional assumptions on θˆn , Donsker’s invariance principle could be applied to conclude that 1 2 w ˆ Wn → (Y (t))2 dt, 0
1 and, as in the fixed distribution case, a Karhunen-Lo`eve expansion for 0 (Y (t))2 dt can provide a good way to tabulate the limiting distribution of Wˆn2 . Sukhatme [98] extended Darling’s result to multidimensional parameters and gave very valuable information for the Karhunen-Lo`eve expansion of the limiting Gaussian process. √ Instead of considering the process { n(Gn (t) − t) − Tn g(t)}t , a direct study √ of the estimated empirical process, { n(Gˆn (t) − t)}t , could yield the asymptotic distribution of general Wˆn2 (Ψ) and Kˆn (Ψ) statistics (recall (4.2) and (4.3)) without having to rely on a different asymptotic equivalence as in (4.6) for every different statistic. Kac, Kiefer and Wolfowitz [51] were the first in studying this estimated empirical process in a particular case: if we are testing fit to the family of normal distributions N (µ, σ 2 ) and we estimate θ = (µ, σ 2 ) by θˆn = (X¯n , Sn2 ), then the
Empirical and Quantile Processes
35
√ finite-dimensional distributions of { n(Gˆn (t) − t)}t converge weakly to those of a centered Gaussian process Z(t) with covariance function K(s, t) = s ∧ t − st −
1
1
φ(Φ−1 (s)) φ(Φ−1 (t))
−
1 Φ−1 (s) Φ−1 (t) , 2 φ(Φ−1 (s)) φ(Φ−1 (t))
(4.7)
where φ denotes the density function of the standard normal distribution, Φ its d.f. and Φ−1 is the quantile inverse of Φ (notice that the difference between Darling’s result and (4.7) is the introduction of an extra term corresponding to the second parameter to be estimated). Although they did not prove weak convergence of the estimated empirical process itself, they used this result (combined with a particular w 1 ˆ2→ (Z(t))2 dt, providing thus invariance result due to Kac) to conclude that W n 0 the asymptotic distribution of the Cram´er-von Mises test of normality. 4.2. The empirical process with estimated parameters A general study of the weak convergence of the estimated empirical process was carried out by Durbin [39]. We present here an approach to his main results using strong approximations. We will assume F is a parametric family, F = {F (·, θ), θ ∈ Θ}, where Θ is some open set in R . The empirical process with estimated parameters is √ ˆ α ˆnθn (x) = n(Fn (x) − F (x, θˆn ), x ∈ R, k
where θˆn is a sequence of estimators. We will assume this sequence to be efficient in the sense that n √ 1 n(θˆn − θ) = √ l(Xi ; θ) + oP (1), n i=1 where l(X1 ; θ) is centered and has finite second moments. Example. Suppose F (x, θ) ∈ F has density f (x, θ) = maximum likelihood estimator: the maximizer of n v(θ) := log f (Xi , θ). i=1
∂F ∂x (x, θ).
Take θˆn as the
∂ Under adequate regularity conditions ∂θ log f (x, θ)dF (x, θ) = 0 and T ∂ ∂ log f (x, θ) log f (x, θ) dF (x, θ) ∂θ ∂θ ∂2 =− log f (x, θ)dF (x, θ) =: I(θ). ∂θ2 Since n n ∂ ∂2 v (θ) = log f (Xi , θ) and v (θ) = log f (Xi , θ), ∂θ ∂θ2 i=1 i=1
36
E. del Barrio
we obtain, from the Law of Large Numbers, that Taylor expansion of v around θ gives
1 n v (θ)
→ I(θ) a.s.. Now, a
√ √ 1 1 √ (v (θˆn ) − v (θ)) = v (θ) n(θ − θˆn ) + oP (1) = −I(θ) n(θ − θˆn ) + oP (1), n n which, taking into account that v (θˆn ) = 0, gives n √ 1 n(θ − θˆn ) = √ l(Xi , θ) + oP (1), n i=1
∂ with l(x, θ) = I(θ)−1 ∂θ log f (x, θ). Clearly l(x, θ)dF (x, θ) = 0 , while l(x, θ)l(x, θ)T dF (x, θ) = I(θ)−1 I(θ)I(θ)−1 = I(θ)−1 .
ˆ
To obtain the null asymptotic distribution of α ˆ nθn we assume that F = F (θ) and write √ √ ˆ α ˆnθn (x) = n(Fn (x) − F (x, θ)) − n(F (x, θˆn ) − F (x, θ)) √ ∂F (·,θ) = αF (x, θ)T n(θˆn − θ) + oP (1) (x) − n ∂θ n ∂F (·,θ) T 1 √ (x, θ) = αF (x) − l(Xi , θ) + oP (1) n ∂θ n i=1 ∂F (·,θ) T (·,θ) (x, θ) (x) − l(x, θ)dαF (x) + oP (1) = αF n n ∂θ R 1 L(t, θ)dαn (t) + oP (1) = αn (F (x, θ) − H(F (x, θ), θ)T 0
=α ˆ n (F (x, θ)) + oP (1), −1 where αn is the uniform empirical process, H(t, θ) = ∂F (t, θ), θ), L(t, θ) = ∂θ (F −1 l(F (t, θ), θ) and 1 L(s, θ)dαn (s), 0 < t < 1, (4.8) α ˆ n (t) = αn (t) − H(t, θ)T 0
is the uniform estimated empirical process. 4.2.1. Some notes on stochastic integration. Equation 4.8 suggests that α ˆ n (t) →w 1 B(t) − H(t, θ)T 0 L(s, θ)dB(s), where B is a Brownian bridge. We cannot give 1 0 L(s, θ)dB(s) the meaning of a Stieltjes integral since the trajectories of B are not of bounded variation. It is possible, though, to make sense of expressions like 1 2 0 f (s)dB(s) with f ∈ L (0, 1) through the following construction.
Empirical and Quantile Processes
37
n Assume first that f is simple: f (t) = i=1 ai I(ti−1 , ti ] with ai ∈ R and 0 = t0 < t1 < · · · < tn = 1. Then 1 n n f (s)dB(s) := ai (B(ti ) − B(ti−1 )) = ai ∆Bi , 0
i=1
i=1
where ∆Bi = B(ti ) − B(ti−1 ). It can be easily checked that E∆Bi = 0, Var (∆Bi ) = ∆ti (1 − ∆ti ) and Cov (∆Bi , ∆Bj ) = −∆ti ∆tj if i = j. 1 The random variable 0 f (s)dB(s) is centered Gaussian with variance n
a2i Var (∆Bi ) + 2
i=1
=
n
a2i ∆ti −
i=1
=
n
ai aj Cov (∆Bi , ∆Bj )
1≤i<j≤n n n
ai aj ∆ti ∆tj
i=1 j=1
a2i ∆ti
−
n
i=1
2
ai ∆ti
1
f (t)dt − 2
=
f (t)dt
0
i=1
2
1
.
0
1 Thus, f → 0 f (s)dB(s) defines an isometry between the subspace of L2 (0, 1) consisting of centered, simple functions and its range. We can therefore extend the definition to all centered functions in L2 (0, 1). Finally, for a general f ∈ L2 (0, 1), 1 1 f (s)dB(s) := f → f˜(s)dB(s), 0
0
1 1 where f˜(s) = f (s) − 0 f (t)dt. The stochastic integral 0 f (s)dB(s) is a centered, Gaussian r.v. with variance 1 2 1 f 2 (t)dt − f (t)dt . 0
0
1 f (s)dB(s), . . . , f (s)dB(s) has a In fact, if f1 , . . . , fk ∈ L2 (0, 1), then 0 1 0 k joint centered, Gaussian law and form the isometry defining the integrals we see that 1 1 1 1 1 f (s)dB(s), g(s)dB(s) = f (s)g(s)ds − f (s)ds g(s)ds. Cov 0
1
0
0
0
1
0
1
(4.9)
fk (s)dB(s) is
We can similarly check that {B(t)}t∈[0,1] , 0 f1 (s)dB(s), . . . , 0 Gaussian and t 1 1 f (s)dB(s) = f (s)ds − t f (s)ds Cov B(t), 0
(take (g(s) = I(0,t] (s) in (4.9) to check it.
0
0
38
E. del Barrio
An integration-by-parts formula. Suppose h is simple. Then 1 n h(t)dB(t) = h(ti )(B(ti ) − B(ti−1 )) 0
i=1
=−
−1
B(ti )(h(ti+1 ) − h(ti )) = −
1
B(t)dh(t). 0
i=0
This result can be easily extended to any h of bounded variation and continuous on [0, 1]: 1 1 h(t)dB(t) = − B(t)dh(t). 0
0
This integration-by-parts formula can be used to bound the difference between stochastic integrals and the the corresponding integrals with respect to the empirical process: 1 1 1 h(t)dαn (t) − h(t)dBn (t) ≤ d|h|(t) αn − Bn ∞ . 0
0
0
We can summarize now the above arguments in the following Theorem Theorem 4.1. Provided H(t, θ) is continuous on [0, 1] and L(t, θ) is continuous and of bounded variation on [0, 1] we can define αn and Brownian bridges Bn such that log n ˆ a.s.,
α ˆ n − Bn ∞ = O √ n ˆn is a centered Gaussian process ˆn (t) = Bn (t)−H(t, θ)T 1 L(t, θ)dBn (t). B where B 0 with covariance s ˆ t) = s ∧ t − st − H(t, θ)T K(s, L(x, θ)dx (4.10) − H(s, θ)T
0 t
1
L(x, θ)dx + H(s, θ)T 0
L(x, θ)L(x, θ)T dxH(t, θ). 0
If θˆn is the maximum likelihood estimator, then the covariance function in (4.10) simplifies to ˆ t) = s ∧ t − st − H(s, θ)T I(θ)−1 H(t, θ). K(s,
d Note that this covariance function can be expressed as s ∧ t − st − j=1 φj (s)φj (t) for some real functions φj . A very complete study of the Karhunen-Lo`eve expansion of Gaussian processes with this type of covariance function was carried out in [98]. Example. Assume F = G0 ·−µ : µ ∈ R, σ > 0 is a location scale family (G0 is σ a standardized distribution function with density g0 ). Then 1 1 H(t, θ) = − g0 G−1 (t) 0 G−1 σ 0 (t)
Empirical and Quantile Processes
39
and
g2 (x) g2 (x) 0 x g00 (x) dx 1 g0 (x) dx I(θ) = 2 g2 (x) . 2 g02 (x) σ x g00 (x) dx x g0 (x) dx − 1 σ σ12 , with σij depending only on G0 , but We can now write I(θ)−1 = σ 2 11 σ12 σ22 not on µ or σ and ˆ t) = s ∧ t − st − φ1 (s)φ1 (t) − φ2 (s)φ2 (t). K(s, Here
φ1 (t) = −
σ11 −
2 σ12 g0 (G−1 0 (t)), σ22
and √ σ12 −1 φ2 (t) = − √ g0 (G−1 σ22 g0 (G−1 0 (t)) − 0 (t))G0 (t). σ22 If F is the Gaussian family G0 (x) = Φ(x), g0 (x) = φ(x), g0 (x) = −xφ(x) and 1 1 0 I(θ) = 2 . σ 0 2 Hence, σ11 = 1, σ12 = 0, σ22 =
1 2
and
ˆ t) = s ∧ t − st − φ(Φ−1 (s))φ(Φ−1 (t)) − 1 φ(Φ−1 (s))Φ−1 (s)φ(Φ−1 (t))Φ−1 (t). K(s, 2 In this Gaussian case L is not of bounded variation on [0, 1], but the above argument can be modified and still prove that 1 +φ(Φ−1 (t)) Φ−1 (s)dB(s) α ˆ n (t)}t →w {B(t) 0 1 1 −1 −1 −1 2 + φ(Φ (t))Φ (t) (Φ (s) − 1)dB(s) 2 0 t
as random variables in D[0, 1] or L2 (0, 1). Theorem 4.1 provides, as an easy corollary, the asymptotic distribution of a ˆ 2 (Ψ) and K ˆ n (Ψ) statistics under the null hypothesis. In fact, Durbin’s variety of W n results also give a valuable tool for studying its asymptotic power because they include too the asymptotic distribution of the estimated empirical process under contiguous alternatives. A survey of results connected to Theorem 4.1 as well as a simple derivation of it based on Skorohod embedding can be found in [86]. Among the statistics whose asymptotic distribution can be derived from Theorem 4.1 three representatives have deserved special attention in the literature: the Cram´er-Von Mises statistic, and √ ˆ n = n sup |Fn (x) − F (x; θˆn )| K −∞<x<∞
40
E. del Barrio
and
Aˆ2n = n
∞
−∞
(Fn (x) − F (x; θˆn ))2 dF (x; θˆn ), F (x; θˆn )(1 − F (x; θˆn ))
which are known, as in the fixed distribution setup, as Kolmogorov-Smirnov and Anderson-Darling statistics respectively. Also, as in the fixed distribution case, ˆ n , with Aˆ2 quadratic statistics offer in general better power properties than K n ˆ n2 . Any of these statistics offers considerable gain in power with outperforming W respect to the χ2 test (see, e.g., [94] or [96]). Let us conclude this subsection by commenting, briefly, that the achievements of subsequent advances in the theory of empirical processes have allowed to develop other goodness-of-fit procedures. For instance, in [41] the asymptotic distribution of the empirical characteristic function is obtained. This was applied in [69] and [49] to propose two normality tests (notice that the modulus of the characteristic function does not depend on the mean of the parent distribution). Simulations in [49] suggest that those tests have a quite good behavior against symmetrical alternatives. A different way to adapt the fixed-distribution tests is the minimum distance method. Assume that δ(F, G) is a distance between d.f.’s. Set ∆(Fn , F ) := inf θ δ(Fn , F (·; θ)). ∆(Fn , F ) is a reasonable measure of the discrepancy between the sample distribution and the family F that can also be used for testing fit to F . Dudley’s theory of weak convergence of empirical processes can be used for deriving the limiting distribution of ∆(Fn , F ) when δ(F, G) = ||F − G|| with || · || being some norm on D[0, 1] or D[−∞, ∞] (see, e.g., [74]). An alternative derivation can be based on Skorohod embedding (see [86] pp. 254-257). 4.3. Correlation and regression tests In this subsection we assume that F is a location scale family, i.e., given a probability measure H0 , we will assume that F is the family of d.f.’s obtained from H0 by location or scale changes. We will assume H0 to be standardized. Goodness-of-fit tests in this subsection focus on the analysis of the popular probability plot. Some reviews on this subject have appeared recently (see, for instance, [62] or [97]). The idea behind the probability plot is the following. Let X1 , . . . , Xn be a random sample whose common d.f. belongs to F and has mean µ and variance σ 2 . Let X0 = (X(1) , . . . , X(n) ) be the corresponding ordered statistic. Let Z0 = (Z(1) , . . . , Z(n) ) be an ordered sample with underlying d.f. H0 and let m = (m1 , . . . , mn ) and V = (vij ) be, respectively, the mean vector and the covariance matrix of Z0 , that is, mi = EZ(i) and vij = E(Z(i) − mi )(Z(j) − mj ). Then, (4.11) X(i) = µ + σZ(i) , in distribution, i = 1, . . . , n. Thus, the plot of the ordered values X(1) , . . . , X(n) against the points m1 , . . . , mn should be approximately linear and lack of linearity in this plot suggests that the d.f. of X1 does not belong to F . Checking this linearity is often done “by eye”. However, some analytical procedures have been devised to test it. They were
Empirical and Quantile Processes
41
proposed according to two different criteria, which essentially lead to equivalent tests, the main difference being the point of view employed by the proposer to justify his/her proposal. The first criterium relies on the idea of selecting an estimator σ ˆ of σ assuming the linear model (4.11) is right and comparing it with Sn2 which, in any case, is a consistent estimator of σ. Under the null hypothesis σ ˆ 2 /Sn2 should take values 2 2 close to 1. Hence, values of σ ˆ /Sn far from 1 would lead to rejection of the null hypothesis. These procedures are called regression tests. A second class consists of tests assessing the linearity in (4.11) through the correlation coefficient between vectors X0 and m, ρ(m, X0 ), (notice that here we have no real correlation coefficient because m is not random). When model (4.11) is true, we expect ρ2 (m, X0 ) to take values close to 1 and, consequently, small values of ρ2 (m, X0 ) would indicate that the null hypothesis is not true. Tests of this kind are called correlation tests. Vector m can be replaced by other vectors β = (β1 , . . . , βn ) satisfying, under the null hypothesis, some approximate linear relation with X0 . Coordinates of vector β are usually known as plotting positions. The first representative of these tests was the Shapiro-Wilk W -test of normality, proposed in [82]. There, authors state that they are trying to provide an analytical procedure “to summarize formally indications of probability plots” (p. 591). The best linear unbiased estimator, BLUE, of µ and σ in model (4.11) is
¯ n and σ ˆ= µ ˆ=X
m V −1 X0 m V −1 m
(4.12)
(this holds for any symmetric H0 ). Hence, under the null hypothesis, σ ˆ 2 /Sn2 should take values close to 1. The Shapiro-Wilk statistic, W , is a normalized version of σ ˆ 2 /Sn2 , namely,
2 m V −1 X0 (4.13) W = −1 −1 ¯ 2. mV V m (Xi − X) i
The normalization ensures that W always takes values between 0 and 1 (since W equals ρ2 (V −1 m, X0 )). Small values of W would lead to rejection of the null hypothesis. This is a regression test, since it is based on the comparison of σ ˆ and Sn2 , but, obviously, it can be seen as a correlation test with plotting positions V −1 m. According to simulations (provided, for instance, in [85]) it seems that the W -test is one of the most powerful normality tests against a wide range of alternatives. This fact has made this test very popular and it can be considered as the gold standard for comparisons. However, employing W for testing normality presented several difficulties of different kinds. A first problem concerned computational aspects. Computation of W requires previous computation of m and V −1 . This task is difficult and, in fact, by the time of W was introduced, elements in V were tabulated only for n ≤ 20. For this reason some numerical approximations that allowed to compute W quite accurately for sample sizes up to 50 were proposed in [82] .
42
E. del Barrio
An equally important concern about W was the tabulation of its null distribution. Except in case n = 3, when the W -test is equivalent to the u-test [82], the exact distribution of W is unknown. Percentiles of W were computed by simulation in [82] for sample sizes up to 50. Though, the asymptotic distribution of W remained unknown for a long time. In fact, it was not obtained until 20 years later, in [58] by showing the asymptotic equivalence, under normality, of W and another correlation test whose distribution was already known at this time (see the considerations concerning the de Wet-Venter test below). Some transformations of W that made its distribution approximately Gaussian were proposed (see [83] or [78]). However, these results must be used with some caution because, as shown in [57], they rely on some approximations which do not hold with the necessary accuracy. An additional weakness of the Shapiro-Wilk test is that the procedure may be not consistent for testing fit to non-normal families of distributions. For instance, If F is the exponential location scale family then the Shapiro-Wilk statistic becomes ¯ n − X(1) )2 (X , WE = (n − 1)Sn2 which is a function of the coefficient of variation. There are some families of distributions with the same coefficient of variation as the exponential family (see [79, 93]). Thus, the WE -test is not consistent when applied for testing exponentiality. In particular, simulations in [93] suggest that the power of the WE -test 5 against the beta ( 14 , 12 ) distribution decrease with the sampling size. These limitations of the Shapiro-Wilk test led to the introduction of modifications of W aimed to ease them. The first examples were the d’Agostino test [23] and the Shapiro-Francia test [81]. They were intended to replace the W -test for sample sizes greater than 50. Both tests are easier to compute than the W -test. The d’Agostino test employs an estimator of σ proposed in [37] to get the statistic (i − (n + 1)2−1 )X(i) . D= i n2 S The Shapiro-Francia test is based on an idea suggested (without proof) in [48] (see also [95]) according to which matrix V −1 in (4.13) can be replaced by the identity I, obtaining thus the statistic W =
(m X0 ) ¯ 2. m m (Xi − X) 2
(4.14)
Both tests are correlation tests. The plotting positions are (1, 2, . . . , n) for the D-test and m for the W -test. Simulation studies in [23] and [81], respectively, suggested that the proposed tests are approximately equivalent to the W -test. The D-test has the advantage of being asymptotically normal and its distribution can be approximated by Cornish-Fisher expansions for moderate sample sizes. Apart from the ease of computation, an interesting feature of the W -ShapiroFrancia test seemed to be its consistency for testing fit to any location scale family
Empirical and Quantile Processes
43
with finite second order moment, a fact shown in [79]. The meaning of this “consistency” needs some explanation. The fact really shown in [79] is that, if Wn denotes the Shapiro-Francia statistic for a sample of size n and we fix λ ∈ (0, 1), then, under any fixed alternative, P (Wn < λ) → 1.
(4.15)
If we try to choose the critical value so that the test has significance level α, that critical value, λn (α), will depend on n. We cannot conclude P (Wn < λn (α)) → 1 from (4.15) and, in fact, the Shapiro-Francia test fails to detect departures from some location scale families (an example of this is given by the exponential family, see the comments about the power of the Shapiro-Francia test below). A further simplification of the W -test was proposed by Weisberg and Bingham in [106] by replacing m by vector m ˜ = (m ˜ 1, . . . , m ˜ n ), where i − 3/8 m ˜ i = Φ−1 , i = 1, . . . , n, n + 1/4 and Φ denotes the standard Gaussian d.f. This statistic is easier to compute than W , while a Monte Carlo study included in [106] suggests that both tests are equivalent. Another modification of W was proposed in [32] by de Wet and Venter. It seems that the concept of correlation test was introduced for the first time in that paper. The de Wet and Venter test is the correlation test with plotting positions 1 n −1 −1 β= Φ ,...,Φ , n+1 n+1 or, equivalently, the test which rejects normality when large values of 2 ¯n X(i) − X ∗ −1 − Φ [i/(n + 1)] W = Sn i
(4.16)
are observed. Some other tests continued this line. For instance, in [42], Filliben proposed a correlation test with the medians of the ordered statistic Z0 as plotting positions. Some simulations comparing this and the W and W tests are offered. The distribution of this statistic was also computed via Monte Carlo method. An interesting feature of the W ∗ -test is that it was the first correlation normality test with known asymptotic distribution. To be precise, it was shown in [32] that, if {Zi } is a sequence of independent standard Gaussian r.v.’s., then W ∗ − an → w
∞ Z2 − 1 i
i=3
i
for a certain sequence of constants {an }. The key of the proof relied on showing, through rather involved calculations, the asymptotic equivalence, under normality, of W ∗ and a certain quadratic form and using then the asymptotic theory for quadratic forms given in [33].
44
E. del Barrio
Since the publication of [32], the possibility of obtaining the asymptotic distribution of other correlation tests of normality by showing their asymptotic equivalence with the W ∗ -test was considered. An important paper in this program, was [102], where the asymptotic equivalence of correlation tests under some general conditions (satisfied by most of the correlation tests in the literature) is shown. In particular, it is shown that the Shapiro-Francia, the Weisberg-Bingham and the Filliben tests are asymptotically equivalent to the de Wet-Venter test, having, consequently, the same asymptotic distribution. The asymptotic distribution of the Shapiro-Wilk test could be obtained then using its asymptotic equivalence with the Shapiro-Francia, shown in [58]. This solved an important problem which had been open for around twenty years. It would be unfair not mentioning the paper [57], by Leslie, which proved the validity of the key step in previous heuristic reasonings based on assuming that vector m is an “asymptotic eigenvector” of V −1 . More precisely, the main result in that paper, is that there exists a constant C which does not depend on n such that
V −1 m − 2m ≤ C(log n)−1/2 , where, given the matrix B = (bij ), then B 2 = ij b2ij . The possibility of extending the use of correlation tests to cover goodnessof-fit to other families of distributions has been explored, for instance, in [92] for the exponential distribution or in [46] for the extreme value distributions. In this setup correlation tests do not present the same nice properties exhibited when testing normality. In [60] the asymptotic normality of the Shapiro-Francia test when applied to the exponential family is obtained. The rate of convergence is extremely low: (log n)1/2 . This result was generalized in [65] to cover extreme-value and logistic distributions with the same rate and the same asymptotic distribution as in the exponential case. However, the asymptotic efficiency of the ShapiroFrancia test in these situations was found to be 0 when compared with tests based on the empirical distribution function, since it was possible to find a sequence of contiguous alternatives such that the asymptotic power coincides with the nominal level of significance of the test (see also [61] on this question).
5. Tests based on Wasserstein distance A different approach to correlation tests was suggested in [29] and will be widely developed in the remainder of this work. The methodology consists of analyzing the L2 -Wasserstein distance between a fixed distribution and a location scale family of probability distributions in R. Our study will cover different kinds of distribution tails, including as key examples the uniform, normal, exponential and a heavier tailed law. Let P2 (R) be the set of probabilities on R with finite second moment. For probabilities P1 and P2 in P2 (R) the L2 -Wasserstein distance between P1 and P2
Empirical and Quantile Processes is defined as W(P1 , P2 ) := inf
45
1/2 2 E (X1 − X2 ) , L(X1 ) = P1 , L(X2 ) = P2 .
For simplification of notation we will identify probability laws with their d.f.’s. In particular, if Fi , i = 1, 2, are the d.f.’s associated to the probability laws Pi ∈ P2 (R), we will say that Fi ∈ P2 (R), i = 1, 2 and write W(F1 , F2 ) instead of W(P1 , P2 ). A main fact which makes W useful for statistics on the line (the multivariate setting is very different) is that it can be explicitly obtained in terms of quantile functions. If Fi ∈ P2 (R), i = 1, 2, then (see, e.g., [101], [5]) 1 1/2 2 −1 −1 . (5.1) W(F1 , F2 ) = F1 (t) − F2 (t) dt 0
Some relevant well-known properties of the Wasserstein distance are included in the following proposition for future reference. The reader interested in properties and uses of the general Lp -Wasserstein distance can resort to [5], [86] or [22]. Proposition 5.1. (a) Let Fi ∈ P2 (R), i = 1, 2. Call mi the mean value of Fi and Fi∗ the d.f. defined by Fi∗ (x) = Fi (x − mi ). Then W 2 (F1 , F2 ) = W 2 (F1∗ , F2∗ ) + (m1 − m2 )2 . (b) Let {Fn }n be a sequence in P2 (R). The following statements are equivalent: (i.e., W(Fn , F ) → 0). i. Fn → F ∈ P2 (R) in W-distance ii. Fn → F and |t|2 dFn → |t|2 dF < ∞. w
iii. Fn−1 → F −1 a.s. and in L2 (0, 1). As in Subsection 3.2 we assume F to be a location scale family of d.f.’s, that is, F = {H : H(x) = H0 ( x−µ σ ), µ ∈ R, σ > 0} for some H0 ∈ P2 (R) which we choose, for simplicity, with zero mean and unit variance (thus, given H(x) = H0 ( x−µ σ ) in F , µ and σ are its mean and its standard deviation, respectively). −1 (t) = Note that the quantile function associated to H(x) = H0 ( x−µ σ ) verifies H −1 µ+σH0 (t). Therefore, if F is a d.f. in P2 (R) with mean µ0 and standard deviation σ0 , (5.1) and Proposition 5.1 a) imply that 1 −1 2 W 2 (F, F ) = inf{W 2 (F, H), H ∈ F } = inf F (t) − µ0 − σH0−1 (t) dt σ>0 0 1 −1 2 2 = inf σ0 + σ − 2σ F (t) − µ0 H0−1 (t)dt (5.2) σ>0 0 1 2 1 2 −1 F (t) − µ0 H0−1 (t)dt = σ02 − F −1 (t)H0−1 (t)dt . = σ02 − 0
0
1 Thus, the law in F closest to F is given by µ = µ0 and σ = 0 F −1 (t)H0−1 (t)dt, which is the covariance between F −1 and H0−1 when seen as r.v.’s defined on (0, 1). The ratio W 2 (F, F )/σ02 is not affected by location or scale changes on F. Hence,
46
E. del Barrio
it can be considered as a measure of dissimilarity between F and F . For example, if Φ denotes the standard normal d.f., the best W-approximant to F in the set normal laws will be the normal law with mean µ0 and standard deviation F 1N of −1 F (t)Φ−1 (t) dt, and the ratio 0
2 1 −1 F (t)Φ−1 (t)dt 0 W 2 (F, FN ) =1− σ02 σ02 measures the nonnormality of F . The invariance of W 2 (F, F )/σ02 against location or scale changes on F suggests the convenience of using a sample version of it for testing fit to the location scale family F . More precisely, if X1 , X2 , . . . , Xn is a random sample with underlying d.f. F , W 2 (Fn , F ) σ ˆ2 Rn = = 1 − n2 , 2 Sn Sn 1 −1 −1 where σ ˆn = 0 Fn (t)H0 (t)dt, can be used as a test statistic for the null hypothesis F ∈ F. Large values of Rn would lead to rejection of the null hypothesis. This testing procedure belongs to the class of minimum distance tests described in Subsection 3.1. A nice feature of Wasserstein tests is that we have an explicit expression for the minimum distance estimators and, consequently, for the minimum distance statistic, opposite to what happens for other metrics (e.g., those metrics leading to Kolmogorov-Smirnov or Cram´er-von Mises statistics). The connection of Rn to correlation and regression tests can be clearly seen by noting that large values of Rn correspond to small values of σ ˆn2 /Sn2 , which, with the notation employed in Subsection 3.2, can be expressed as 2 n i=1 νi X(i) 2 , ρ (ν, X0 ) = Sn2 i/n where ν = (ν1 , . . . , νn ) and νi = (i−1)/n H0−1 (t) dt, i = 1, . . . , n (observe that ν is centered and ν ν = 1 since H0 is assumed to be standardized). Hence, the Rn test is equivalent to a correlation test with plotting positions ν = (ν1 , . . . , νn ) . In fact, the plotting positions in the Shapiro-Wilk, Shapiro-Francia or de WetVenter tests are approximations to the Wasserstein plotting positions. This had been noticed, in the particular context of normal probability plots, by Brown and Hettmansperger in [10], which considered the problem of finding the best choice for the plotting positions. That paper presented a heuristic explanation, based on an orthogonal expansion of Rn , of the good power properties of the Rn normality test against general alternatives, observed by Stephens [95]. We will try to justify those heuristic considerations. The Wasserstein test of normality turns out to be equivalent to the wellknown Shapiro-Wilk test, sharing its good power properties. However, both are inefficient procedures for testing fit to location scale families that, as the exponential family, for instance, have heavier tails (just as with tests of fit based on
Empirical and Quantile Processes
47
the correlation coefficient: see Lockhart and Stephens (1998), Section 5). The null asymptotics of Rn provides a good insight into the cause of this inefficiency. To study this asymptotic distribution under H0 we can assume F = G0 (by the location√scale invariance of Rn ) and denote the empirical quantile process as vn (t) = n(Fn−1 (t) − F −1 (t)), to get (see del Barrio et al. (1999)) that 1 1
2 1
2 1 nRn = 2 vn2 (t)dt − vn (t)dt − vn (t)F −1 (t)dt σ (Fn ) 0 0 0 1 1 1 1 −1 −1 2 (vn (t) − vn , 1 1 − vn , F F (t)) dt = 2 vˆ2 (t)dt, = 2 σ (Fn ) 0 σ (Fn ) 0 n 1 where f, g = 0 f · g and vˆn = vn − vn , 1 1 − vn , F −1 F −1 . It is shown in del Barrio et al. (1999) that under normality there exist constants an such that nRn − an converges in law to a nondegenerate distribution. More precisely, if Φ (resp. φ) denote the standard normal distribution (resp. density) function and B is a Brownian bridge, then 1 2 1 B(t)
2 1 B(t)Φ−1 (t) 2 B (t) − EB 2 (t) dt − dt − dt nRn − an → −1 (t)) −1 (t)) d φ2 (Φ−1 (t)) 0 0 φ(Φ 0 φ(Φ 1 ˆ 2 (t) − E B ˆ 2 (t)dt, B = 0
ˆ = (B − B, 1 1 − B, Φ−1 Φ−1 )/φ(Φ−1 ). This last integral admits a where B principal component decomposition, namely, 1 ∞ ˆ 2 (t) − E B ˆ 2 (t)dt = B λj (Yj2 − 1), 0
j=3
ˆ fj / λj and Hj is the Hermite where λj = 1/j, fj (t) = Hj (Φ−1 (t)), Yj = B, polynomial of degree j − 1. λj and fj are, respectively, the eigenvalues and eigenˆ hence Yj are i.i.d. N (0, 1). We vectors of the covariance operator associated to B, can, similarly write 1 ∞ 2 2 nRn = S12 0 (ˆ vn ) (t)dt = S12 j=1 Yn,j , n n 1 where Yn,j = 0 vˆn (t)fj (t) dt is the principal component in direction fj . Y˜n,j = Yn,j /Sn detects deviations in direction fj . For local skewness-kurtosis alternatives:
√ √ Xi − µ 2 / n ∼ √12π e−x /2 1 + γ1 /6 n H3 (x) + γ2 24 H4 (x) , σ it can be shown that Y˜n,3 → N (γ1 , 3), and Y˜n,4 → N (γ2 , 4). d
d
Y˜n,3 and Y˜n,4 being the components with larger weight in the decomposition of Rn , this explains the good performance or the Wasserstein (or the Shapiro-Wilk) test against deviations from normality in skewness or kurtosis.
48
E. del Barrio
In the exponential√case nRn is not shift tight, but it can be shown that, for some constants an , (n/ log n)Rn − an is asymptotically normal. However, if we fix δ ∈ (0, 1/2) then 1−δ 1 √ (ˆ vn (t))2 dt → 0, Pr log n δ hence the asymptotic distribution of Rn depends only on the tails of F : the Wasserstein exponentiality test cannot detect alternatives that have approximately exponential tails. As a possible remedy to this inefficiency, de Wet (2000, 2002) and Cs¨org˝ o (2002) proposed to replace the Wasserstein distance W by a weighted version
1/2 1 −1 −1 2 (F (t) − G (t)) w(t)dt , for some positive measurable Ww (F, G) := 0 function w, and the test statistic Rn by Rw n =
Ww2 (Fn , H) , 2 (F ) σw n
where, here and in the sequel we set 1 1 2 2 µw (F ) = F −1 (t)w(t)dt and σw (F ) = (F −1 (t))2 w(t)dt − µw (F ) . 0
0
Rw n is location scale invariant, hence its null distribution can be studied assuming, as above, that F = G0 . Under the assumptions 1 w(t)dt = 1, (5.3) 0
1
G−1 0 (t)w(t)dt = 0
(5.4)
0
and
1
2 (G−1 0 (t)) w(t)dt = 1,
(5.5)
0
we can mimic, step by step, the computations leading to (1.6) and obtain that 1 1
2 1
2 1 w 2 −1 nRn = 2 v (t)w(t)dt − vn (t)w(t)dt − vn (t)F (t)w(t)dt σw (Fn ) 0 n 0 0 1 1 (vn (t) − vn ,1 w 1 − vn ,F −1 w F −1 (t))2 w(t)dt = 2 σw (Fn ) 0 1 1 = 2 vˆ2 (t)w(t)dt, (5.6) σw (Fn ) 0 n 1 where, now, f, g w = 0 (f · g)w and vˆn = vn − vn , 1 w 1 − vn , F −1 w F −1 . Thus, the asymptotic distribution of nRw n under the null hypothesis can be obtained through the analysis of weighted L2 -functionals of the quantile process vn .
Empirical and Quantile Processes
49
6. Asymptotics for L2 -functionals of the quantile process We devote our last section to the derivation of a complete set of asymptotic results for weighted L2 -functionals of the quantile process. Our analysis is based on obtaining CLT’s for linear combinations of L2 functions with random exponential coefficients, a simpler problem to which the analysis of the quantile process can be reduced. This section is largely based on [31]. We will see first how the above mentioned reduction can be justified. Let Ui , i ∈ N, be i.i.d. uniform (0,1) random variables; for each n, let Gn (t) be the empirical c.d.f. associated to U1 , . . . , Un , let G−1 n (t) be the quantile function and let un (t) be the associated uniform quantile process, that is, √ un (t) = n(t − G−1 (6.1) n (t)), 0 < t < 1. If {ξn }∞ n=1 are i.i.d. r.v.’s with common exponential distribution of mean 1 and Sn = ξ1 + · · · + ξn , then, the well known distributional identity S1 Sn d (Un:1 , . . . , Un:n )= ,..., Sn+1 Sn+1 allows us to rewrite G−1 n (t) as Sj /Sn+1 if (j − 1)/n < t ≤ j/n and, consequently, d
un (t)=
n
n+1
Sn+1
j=1
Zn,j (t),
(6.2)
where Zn,j (t) = n−1/2 an,j (t)ξj and an,j (t) = (1 − t)I{j−1
1−1/n
Ln := 1/n
un (t) g(t)
(6.3)
2 dt
(6.4)
for some weight function g non-vanishing on (0, 1), then, with · 2 denoting the L2 norm with respect to Lebesgue measure on the unit interval, 2 2 n+1 n d Ln = c ξ (6.5) n,i i Sn+1 2 i=1 for certain functions cn,i (t) which we assume in L2 (0, 1) (in the case of (6.5), but not always below, cn,i = n−1/2 an,i (t)I[1/n,1−1/n] (t)/g(t)). By the law of large numbers, weak convergence of the statistic an Ln − bn then reduces to weak convergence of n+1 2 2 Sn+1 an c ξ − b , n,i i n n i=1
2
and the second variable is almost a constant if bn does not grow too fast.
50
E. del Barrio
Since the an,j (t) have a relatively complicated expression, it is convenient to isolate here as a lemma some estimates on an,j (t) to be used below. It is also convenient to introduce the notation f ⊗ g for the function of two variables f (x)g(y). We recall that the map(f, g) → f ⊗ g is a continuous bilinear map between L2 (0, 1)×L2(0, 1) and L2 (0, 1)×(0, 1) and that f1 ⊗g1 , f2 ⊗g2 = f1 , g1 f2 , g2 . n+1 ˜ n = n+1 an,j ⊗ an,j . Then for every Lemma 6.1. Set m ˜ n = j=1 an,j and K j=1 0 < s, t < 1, i) m ˜ n (t) = frac(n(1 − t)) − t; also, if 1/n ≤ t ≤ 1 − 1/n then n |m ˜ n (t)| ≤ 1 ≤ nt(1 − t); n−1 ˜ n (s, t) = [n(1 − t)]s + frac(n(1 − s))(1 − t) + st; ii) if s ≤ t then K 1 ˜ iii) Kn (s, t) → s ∧ t − st; further, if 1/n ≤ s, t ≤ 1 − 1/n, then n
n+1 1 ˜ 1 1 (s ∧ t − st) ≤ K (s, t) ≤ |an,j (s)an,j (t)| ≤ 3(s ∧ t − st). n 2 n n j=1
n+1 Proof. To prove i) fix t and observe that each term in j=1 an,j (t) equals either 1 − t (the first n − [n(1 − t)] terms) or −t (the remaining [n(1 − t)] + 1 terms). Hence mn (t) = (n − [n(1 − t)])(1 − t) − ([n(1 − t)] + 1)t = frac(n(1 − t)) − t. The identity in ii) can be proved in a similar way. Fix s ≤ t. In the corresponding ˜ n (s, t) there are three types of summands: the first n − [n(1 − s)], each sum for K equal to (1 − s)(1 − t); the next [n(1 − s)] − [n(1 − t)], each equal to −s(1 − t); and the remaining [n(1 − t)] + 1, each equal to st. This gives ii) and the right-hand side inequality in iii). The left-hand side inequality in iii) is a trivial consequence of ii). We will also make use of the following simple but useful observation about hypercontractivity of linear combinations of exponential variables with coefficients in L2 . Its proof, based on standard symmetrization/randomization techniques, can be found in [31]. n Lemma 6.2. Let Y (t) = k=1 ck (t)ξk for some n ∈ N and ck ∈ L2 (0, 1), and where the variables ξk are independent exponential with parameter 1. Then, there exists an absolute constant C < ∞ such that 2 (6.6) E Y 42 ≤ C E Y 22 . Next we consider the general quantile process. Let F be a twice differentiable distribution function such that f := F is non-vanishing on supp F := {F = 0, 1}◦ := (aF , bF ) and t(1 − t)f F −1 (t) r := sup < ∞, (6.7) f 2 F −1 (t) 0
Empirical and Quantile Processes
51
where F −1 (t) is the corresponding quantile function. Condition (6.7), that comes from Cs¨org˝ o and R´ev´esz (1978), is a natural condition to have if we wish to relate general and uniform quantile processes: see Lemma 1.1, Ch. 6 and comments after its proof in this reference. Let Xi be i.i.d. with common distribution F , let Fn be the empirical distribution of X1 , . . . , Xn , n ∈ N, and let Fn−1 denote the empirical quantile function. Since we are considering only distributional results, there is no loss of generality in taking Xi = F −1 (Ui ), where Ui are i.i.d. uniform on [0, 1]. In this case, Fn−1 (t) = F −1 G−1 (6.8) n (t) , where G−1 n is the quantile function corresponding to the uniform variables U1 , . . . , Un . We let vn to be the quantile processes associated to the sequence Xi , √ (6.9) vn (t) := n Fn−1 (t) − F −1 (t) , n ∈ N. vn and un are related by the limited Taylor expansion √ −1 −1 Gn (t) − F −1 (t) n F vn (t) = √ −1 2 f F −1 (ξ) n Gn (t) − t 1 −1 √ = + (t) − t n G n 2 n f F −1 (t) f 3 F −1 (ξ) un (t) 1 f F −1 (ξ) 2 + √ u (t) = 2 n f 3 F −1 (ξ) n f F −1 (t)
(6.10)
for some ξ between t and G−1 n (t). This Taylor expansion can be used to relate the (weighted) L2 norms of vn and un /f (F −1 ). This is the content of our next result, see [31] for the proof. Here w is a non-negative measurable function on (0, 1) and
· 2,w,n and ·, · w,n denote, respectively, the norm and the inner product in the space L2 ((1/n, 1 − 1/n), w(t)dt). Then we have: Lemma 6.3. Let F be a distribution function which is twice differentiable on its open support (aF , bF ), with f (x) := F (x) > 0 for all aF < x < bF , and which satisfies condition (6.7). Assume further that w is a non-negative measurable function such that 1−1/n 1/2 1 t (1 − t)1/2 w(t)dt = 0. lim √ (6.11) n→∞ n 1/n f 2 F −1 (t) Then, if un is the uniform quantile process and vn is the quantile process defined by (6.8) and (6.9), u 2 un n
vn 22,w,n − −1 → 0 and vn − −1 →0 (6.12) 2,w,n 2,w,n f F f F in probability. In fact, as mentioned in the introduction, we are interested not in vn 2,w,n but rather in vn 2,w , where · 2,w denotes the L2 -norm with respect to the measure w(t)dt over the whole interval (0, 1). So, we must deal next with the integrals at extremes. The problem can be solved imposing conditions which are
52
E. del Barrio
related to, but weaker than the usual von Mises conditions on domains of attraction (e.g., Parzen (1979) and Schuster (1984)). Again, we refer to [31] for a proof. Lemma 6.4. Let F be a distribution function which is twice differentiable on its open support (aF , bF ), with f (x) := F (x) > 0 for all aF < x < bF . Assume that F satisfies condition (6.7), that either aF > −∞ or lim inf x→0+
|f (F −1 (x))|x > 0, f 2 (F −1 (x))
(6.13)
and
|f (F −1 (1 − x))|x > 0. (6.14) x→0+ f 2 (F −1 (1 − x)) Assume further that w is a bounded non-negative measurable function such that 1 x x 1−x w(t)dt x 0 w(t)dt = 0 and lim = 0. lim (6.15) x→0+ f 2 F −1 (x) x→0+ f 2 F −1 (1 − x) either bF < ∞ or lim inf
Then,
vn 22,w − vn 22,w,n → 0
(6.16)
in probability. As a consequence of these two lemmas we have: Proposition 6.5. Let F be a distribution function which is twice differentiable on its open support (aF , bF ), with f (x) := F (x) > 0 for all aF < x < bF . Assume that F satisfies conditions (6.7), (6.13) and (6.14). Let w be a non-negative measurable function for which the limits (6.15) hold. Assume further that 1−1/n 1/2 1 t (1 − t)1/2 w(t)dt = 0. lim √ n→∞ n 1/n f 2 F −1 (t) Then,
u 2 n →0
vn 22,w − −1 2,w,n f F
(6.17) in probability. If, moreover, h ∈ L2 (w(t)dt) and the sequence un /f F −1 , h w,n is stochastically bounded, then un vn , h 2w − −1 , h 2w,n → 0 (6.18) f F in probability. Proof. The conclusion (6.17) is a direct consequence of the last two Lemmas. The limit follows from the same two lemmas, stochastic boundedness of (6.18) older’s inequality and the identities un /f F −1 , h w,n , H¨ un vn , h 2w,n − −1 , h 2w,n f F un un un = vn − −1 , h w,n vn − −1 , h w,n + 2 −1 , h w,n f F f F f F
Empirical and Quantile Processes
53
and vn , h 2w − vn , h 2w,n =
vn hw + 2vn , h w,n .
vn hw
(0,1/n]∪[1−1/n,1)
(0,1/n]∪[1−1/n,1)
(Note that, by the first identity, {vn , h w,n } is stochastically bounded.)
Obviously, combining this proposition with (6.3) and (6.4) for g = f (F −1 ), reduces convergence in distribution of vn 22,w and of ˆ vn 22,w to convergence in distribution of Ln , which is a function of exponential random variables that will be relatively easy to handle. The conditions under which this has been established here are weaker than those usually found in the literature. 6.1. Weak convergence of L2 linear combinations of exponential r.v.’s. We should point out that the results that follow do not require the variables ξi to be exponential, but only to be integrable enough, however, we stay with exponential variables, which is what we need. In this section, the functions cn,i in the expression (2.4 ) of Ln are allowed to be arbitrary functions in L2 (0, 1). Given a triangular array cn,i , i ≤ n, n ∈ N, of functions in L2 (0, 1), we set 2 n n−1 cn,i ξi , Zn (t) = Yn (t), t ∈ [0, 1]. (6.19) Yn (t) := Sn i=1 Define cn,i,j = ci,j as 1 cn,i,j = ci,j = cn,i (t)cn,j (t)dt := cn,i , cn,j , i, j = 1, . . . , n, n ∈ N.
(6.20)
0
It will also be convenient to introduce the functions n n Kn (s, t) = cn,i (s)cn,i (t) and mn (t) = cn,i (t), t ∈ [0, 1], n ∈ N, (6.21) i=1
i=1
which are the covariance and the mean functions, respectively, of the random processes Yn (t). With tensor notation Kn = i cn,i ⊗ cn,i . Obviously
Kn 22 = cn,i , cn,j 2 = c2n,i,j and mn 22 = cn,i , cn,j = cn,i,j . i,j
i,j
i,j
i,j
(6.22) Before getting into convergence we examine some interesting integrability issues. The first result of this subsection is based on the Paley-Zygmund argument, e.g., de la Pe˜ na and Gin´e (1999), pp. 119–124, in particular Corollary 3.3.4 there, that we restate here for ease of reference: Lemma 6.6. Let V be a random variable such that EV 4 ≤ C(EV 2 )2 . Then, for all t > 0, IEV 2 ≥2t2 ≤ 4C Pr{|V | > t}
54
E. del Barrio
that is, EV 2 < 2a2 whenever Pr{|V | > a} <
1 . 4C
Proof. It follows immediately upon observing that 1/2 . EV 2 ≤ t2 + E V 2 I|V |>t ≤ t2 + (EV 4 )1/2 Pr{|V | > t}
Proposition 6.7. The sequence { Yn 2 }, with Yn as defined in (6.19), is stochastically bounded if and only if both conditions sup n
n
cn,k 22 < ∞
and
sup mn 22 = sup n
(6.23)
k=1
n
cn,i , cn,j < ∞
(6.24)
1≤i,j≤n
are satisfied, and the same is true for the sequence { Zn 2 }. Moreover, limn Yn 2 = 0 in probability if and only if lim n
n
cn,k 22 = lim n
k=1
cn,i , cn,j = 0,
(6.25)
1≤i,j≤n
and the same is true for the sequence { Zn 2 }. Proof. We will use the abbreviated notation ci,j for cn,i,j . Since 1 1
2
E Yn 22 = E cn,k (t)ξk dt = c2n,k (t) + cn,i (t)cn,j (t) dt 0
=
k
k
ck,k +
0
ci,j ,
k
i,j
(6.26)
i,j
it follows that the conditions (6.23) and (6.24) are sufficient for tightness of the sequence { Yn 2 } and that (6.25) is sufficient for its convergence to zero in probability. Sufficiency for tightness and convergence of { Zn 2 } follows from this and the law of large numbers. Necessity in both cases follows immediately from Lemmas 6.6 and 6.2. One can say more about the way Yn converges. In fact, stochastic boundedness of Yn m 2 implies uniform boundedness of moments of any order. This fact is proved in [31]. With these preliminaries on integrability out of the way, we consider now convergence in law of the sequence { Yn 2 }. We consider several cases, corresponding to the different cases for convergence of the square integral of the quantile process described a) Convergence of the processes Yn . Here we obtain necessary and sufficient conditions for weak convergence of Yn as L2 -valued random vectors; then convergence
Empirical and Quantile Processes
55
of Yn 22 will be an immediate consequence of the continuous mapping theorem for weak convergence. Note that P ( cn,i ξi 2 > ) = exp − / cn,i 2 and therefore, the triangular array {cn,i ξ : i = 1, . . . , n; n ∈ N} is infinitesimal if and only if (6.27) max cn,i 2 → 0 i
as n → ∞. The next theorem gives necessary and sufficient conditions for the convergence in law in L2 (0, 1) of {Yn } under (6.27). Under infinitesimality, the only possible limits of {Yn } are Gaussian, with a trace-class covariance operator. Kn and mn are defined as in (6.21). Theorem 6.8. Assuming condition (6.27) holds, the sequence {Yn } converges in law in L2 (0, 1) if and only if the following conditions hold: i) There positive semi-definite, trace-class kernel K(s, t) ∈ exists a symmetric, L2 (0, 1) × (0, 1) such that Kn → K.
(6.28)
L2
ii) If λi ≥ 0 are the eigenvalues of K then n ∞ 2
cn,i 2 → λi . i=1
(6.29)
i=1
iii) There exists m ∈ L2 (0, 1) such that mn → m
(6.30)
L2
If i), ii) and iii) hold, then Yn converges in law in L2 (0, 1) to an L2 (0, 1)-valued Gaussian random variable Y with mean function m and covariance operator ΦK given by 1
1
ΦK (f, g) =
K(s, t)f (s)g(t)dsdt 0
for f, g ∈ L2 (0, 1).
0
Proof. Necessity. Let us assume first that the L2 -valued random vectors Yn converge in law. Then { Yn 2 } also converges in law and, moreover, by Proposition 6.3, its moments converge as well (to the moments of the limit). This implies, in particular, that the sequence n n 2
cn,i 22 + cn,i , n ∈ N, (6.31) E Yn 22 = i=1
i=1
2
converges. Note also that convergence in law of Yn to Y plus uniform integrability of { Yn 2 }, which is a consequence of moment convergence, ensure that EYn →L2 EY and, therefore, that n cn,i → m := EY. (6.32) i=1
L2
56
E. del Barrio
Now, (6.31) and (6.32) imply (6.30) and also that the left-hand side in (6.29) converges to a finite limit. We have also proved that {Yn − EYn } converges in law, namely, n Yn − EYn = cn,i (ξi − 1) → Y − EY. (6.33) d
i=1
Also, (6.33) and uniform integrability imply n
cn,i 22 = E Yn − EYn 22 → E Y − EY 22 .
(6.34)
i=1
Another consequence of (6.33) is that (6.35) (Yn − EYn ) ⊗ (Yn − EYn ) → (Y − EY ) ⊗ (Y − EY ) d in L2 (0, 1) × (0, 1) ((6.35) follows from (6.33) and continuity of the map (f, g) → f ⊗ g). Now, since f ⊗ g 2 = f 2 g 2 , convergence of moments of Yn 2 ensures also uniform integrability of (Yn − EYn )⊗ (Yn − EYn ) and, just as above, we obtain that Kn = E (Yn − EYn ) ⊗ (Yn − EYn ) → K := E (Y − EY ) ⊗ (Y − EY ) . L2
Let now λi , φi be, respectively, the eigenvalues and the corresponding eigenfunctions associated to the kernel K. Then 1 1 K(s, t)φi (s)φi (t)dsdt = EY − EY, φi 2 . λi = 0
0
Therefore, E Y − EY 22 = E
∞
∞
Y − EY, φi 2 = λi ,
i=1
i=1
which combined with (6.34) yields (6.28) and (6.29). Sufficiency. Assume now that (6.27), (6.28), (6.29) and (6.30) hold. Let us denote ξi, = ξi I{ cn,i ξi 2 ≤} and ξi = ξi I{ cn,i ξi 2 >} for > 0 and i ∈ N. Since Eξi I{ξi >t} = (t + 1)e−t ≤ 1/t2 for large enough t, condition (6.29) implies that, for large enough n, n n n cn,i ξi = cn,i Eξi ≤
cn,i 2 Eξi E i=1
2
i=1
≤
n 1
2
i=1
2
i=1
cn,i 22 max cn,i 2 → 0. i
Hence, by the Central Limit Theorem in Hilbert spaces (see, e.g., Araujo and Gin´e (1980), Corollary 3.7.8), if {Yn } is shift convergent in law, we can take the shifts to be the expected values EYn and therefore, by the same central limit theorem, the proof reduces to showing that
Empirical and Quantile Processes
57
n i) j=1 P ( cn,j ξj 2 > ) → 0 as n → ∞ for every > 0, ii) for every > 0 and every f ∈ L2 (0, 1) n
Var(cn,j ξj, , f ) → ΦK (f, f )
and
j=1
iii) there exists a c.o.n.s {φi }i≥1 in L2 (0, 1) such that lim lim sup
k→∞ n→∞
k n
E cn,j ξj, − Ecn,j ξj, 2 − Ecn,j ξj, − Ecn,j ξj, , φi 2 = 0. j=1
i=1
To check i), we see that, as a consequence of (6.27) and (6.28), n
n n 1 1 2 E c ξ
=
cn,j 22 E(ξj )2 n,j j 2
2 j=1
2 j=1
P ( cn,j ξj > ) ≤
j=1
1 2 Eξ12 I{ξ1 >/ maxi cn,i 2 } → 0
c
n,j 2
2 j=1 n
≤
as n → ∞. This calculation shows also that for every f ∈ L2 (0, 1) n
Var(cn,j ξj , f ) ≤
n
j=1
Ecn,j ξj , f 2 ≤ f 22
j=1
n
E cn,j ξj 22 → 0,
j=1
which implies that ii) is equivalent to n
Var(cn,j ξj , f ) → ΦK (f, f ).
(6.36)
j=1
In order to prove (6.36), we recall that n n Cov( cn,j (s)ξj , cn,j (t)ξj ) = Kn (s, t), j=1
j=1
which, combined with (6.28), implies that n n
1 Var(cn,j ξj , f ) = Var cn,j (t)ξj f (t)dt 0
j=1
1
j=1
1
Cov
= 0
0 1
n j=1
cn,j (s)ξj ,
n
cn,j (t)ξj f (s)f (t)dsdt
j=1
1
1
Kn (s, t)f (s)f (t)dsdt →
= 0
0
1
K(s, t)f (s)f (t)dsdt 0
0
= ΦK (f, f ), proving (6.36). Finally, to show that iii) holds, let {φi } be a c.o.n.s. of eigenfunctions of the covariance operator ΦK and let {λi } be the associated eigenvalues.
58
E. del Barrio
n Then, using (6.29) and the fact that, as shown above, j=1 E cn,j ξj 22 → 0, we have n n E cn,j ξj, − Ecn,j ξj, 2 = lim E cn,j ξj − Ecn,j ξj 2 lim n→∞
n→∞
j=1
= lim
n→∞
j=1 n
cn,j 2 =
j=1
∞
λi
i=1
and, by (6.36), lim
n→∞
n k
Ecn,j ξj, − Ecn,j ξj, , φi 2 =
j=1 i=1
k
ΦK (φi , φi ) =
i=1
k
λi ,
i=1
which completes the proof.
The above proof shows that the sufficiency part of Theorem 6.8 holds as well if we only assume that the random variables ξi are i.i.d. and square integrable (with trivial adjustments to account for mean and variance possibly different from 1). As an immediate consequence of Theorem 6.8 we obtain sufficient conditions for convergence in law of the L2 norms of linear combinations of independent exponential random variables: Corollary 6.9. Suppose (6.27), (6.28), (6.29) and (6.30) hold. Let S be a metric space and let H : L2 (0, 1) → S be a continuous function. Then H(Yn ) → H(Y ), d
where Y is an L2 (0, 1)-valued Gaussian random variable with mean function m and covariance operator ΦK given by 1 1 ΦK (f, g) = K(s, t)f (s)g(t)dsdt 0
0
for f, g ∈ L2 (0, 1). Below, we will apply this corollary for H(f ) = f 22 − hk ∈ L 2 .
2
2 k=1 f, hk
with
Remark 6.9.1. The limiting random process Y in Theorem 6.8 and Corollary 6.9 is centered if and only if mn 22 = i,j cn,i,j → 0. The type of argument employed in the proof of Theorem 6.8 shows that, under the infinitesimality condition (6.27), conditions (6.28) and (6.29) are necessary and sufficient for convergence in law of the processes Yn − EYn and that the limiting random process has then a centered Gaussian distribution with covariance operator ΦK . b) Shift convergence of Yn 22 , I. It can be proved that shift tightness of { Yn 22 } implies tightness of the sequence centered at expectations, and even tightness of the sequence { Zn 22 − E Yn 22 }. We comment this to help to appreciate sharpness of the results that follow, but omit any proof of it. In the previous subsection, we
Empirical and Quantile Processes
59
examined the case when the kernels Kn associated to Y in L2 to a n converge √ trace class kernel. In that case, Yn − EYn → Y − m = i λi φi Zi and Y − d m 22 = i λi Zi2 , where {Zi } is an ortho-Gaussian sequence (a sequence of i.i.d. standard normal random variables). Of course, convergence of this series requires λ < ∞. However, if we allow centering, then i i
Yn − EYn 22 − E Yn − EYn 22 → λi (Zi2 − 1) d
i
and, clearly,in order to make sense of this limit it suffices (and is necessary as 2 well) that i λi < ∞, a weaker condition. We deal here with this situation, that is, we relax the assumptions on K in Theorem 6.8 by only assuming that by K on L2 (0, 1) is K ∈ L2 ((0, 1) × (0, 1)). In this case, the operator induced Hilbert-Schmidt, that is, its eigenvalues {λk } satisfy k λ2k < ∞ (e.g., Dunford and Schwartz (1963), XI.6 and XI.8.44). Then, with considerable abuse of notation, we define
Y − EY 22 − E Y − EY 22 := λk (Zk2 − 1), (6.37) k
where the variables Zk are independent standard normal (Y may not exist but the series does converge a.s.). In fact, Y − EY 22 − E Y − EY 22 makes sense as a multiple Wiener integral, but this is of marginal interest for the sequel. We state now a useful lemma on the asymptotic normality of sums of independent exponential random variables. The proof is omitted, but it is just an easy exercise on the CLT on the line if one uses the fact the Gaussian family is factor closed. Lemma 6.10. If {an,i : i = 1, . . . , n; n ∈ N} is a triangular array of real numbers then ni=1 an,i (ξi − 1) → σZ, where Z is a standard normal random variable, if d and only if n max |an,i | → 0 and a2n,i → σ 2 . (6.38) i
i=1
n Moreover, if maxi |an,i | → 0 then the only possible limit laws of i=1 an,i (ξi − 1) are normal and convergence in law is equivalent to convergence of ni=1 a2n,i . The main argument in this subsection is contained in the proof of the following proposition. Proposition 6.11. If maxi cn,i 2 → 0 and Kn →L2 K (so that K is necessarily in L2 ((0, 1) × (0, 1))),then
Yn − EYn 22 − E Yn − EYn 22 → Y − EY 22 − E Y − EY 22 , d
where Y − EY
22
− E Y − EY
22
is as defined in (6.37).
60
E. del Barrio
Proof. Let {φk } be a c.o.n.s. of eigenfunctions of K, φk of eigenvalue λk for each k. Then
Yn − EYn 22 − E Yn − EYn 22 = n
∞ Yn − EYn , φk 2 − EYn − EYn , φk 2 , k=1
where Yn − EYn , φk = i=1 cn,i , φk (ξi − 1). By Lemma 6.10 under (6.38) the only possible limit laws of Yn − EYn , φk are normal, and there is convergence if and only if { ni=1 cn,i , φk 2 } converges. Since Kn →L2 K, we have in particular, 1 1 Kn (s, t)φk (s)φl (t)dsdt E Yn − EYn , φk Yn − EYn , φl =
1
→ 0
0
0
1
λk K(s, t)φk (s)φl (t)dsdt = 0
0
if if
k=l , k = l
M → and therefore (Yn −EYn , φk )M k=1 d (λk Zk )k=1 . Then, by the continuous mapping theorem for weak convergence, M M Yn − EYn , φk 2 − EYn − EYn , φk 2 → γM := λk (Zj2 − 1). d
k=1
(6.39)
k=1
∞ Since k=1 λ2k = K 22 < ∞, γM converges a.s. and in L2 and, with some abuse of notation, as explained above, we denote this limit as Y − EY 22 − E Y − EY 22 , that is, we have, (6.40) γM → Y − EY 22 − E Y − EY 22 . = cn,i − Set Observe that cM n,i
M
L2
k=1 cn,i , φk φk ,
YnM =
n
M i=1 cn,i ξi
and KnM =
n
M i=1 cn,i
⊗ cM n,i .
∞ Yn −EYn , φk 2 −EYn −EYn , φk 2 = YnM −EYnM 22 −E YnM −EYnM 22 . k=M+1
(6.41) We claim that, under (6.28), lim lim sup Var( YnM − EYnM 22 ) = 0.
M→∞ n→∞
(6.42)
M We recall that K = ∞ = ∞ k ⊗ φk and define K k=1 λk φ k=M+1 λk φk ⊗ φk . We ∞ can easily see that K M 22 = k=M+1 λ2k . Now, since {φk ⊗ φk }k ∪ { √12 (φk ⊗ φl + φk ⊗ φl )}k =l is an orthonormal basis for L2 ((0, 1) × (0, 1)) we have that KnM − K M , φk ⊗ φk 2 + 2 KnM − K M , φk ⊗ φl 2 (6.43)
KnM − K M 22 = k
k =l
(here we have used the fact that f ⊗ f, g ⊗ h = f ⊗ f, h ⊗ g ). Observe that K M , φk ⊗ φl = K, φk ⊗ φk if k = l > M and K M , φk ⊗ φl = 0, otherwise
Empirical and Quantile Processes
61
and, also, that KnM , φk ⊗ φl = Kn , φk ⊗ φl if k, l > M and KnM , φk ⊗ φl = 0 otherwise. Combining this with (6.43) we obtain that Kn − K, φk ⊗ φk 2 + 2 Kn − K, φk ⊗ φl 2
KnM − K M 22 = ≤
k>M
k =l;k,l>M
Kn − K, φk ⊗ φk + 2 Kn − K, φk ⊗ φl 2 = Kn − K 22 . 2
k
k =l
The last inequality implies that →L2 K M and also that KnM 2 → K M 2 . Now this convergence combined with the fact that Var( YnM −EYnM 22 ) ≤ 8 KnM 22 prove claim (6.42). Now, the proposition follows from (6.39), (6.40), (6.41) and (6.42) through a standard 3 argument. KnM
We should remark that this proposition also holds true if we replace the sequence of exponential random variables by an i.i.d. sequence of square integrable random variables, with only formal changes in the proof. Both, Theorem 6.8 and Proposition 6.11 are practically exercises on the central limit theorem in Hilbert space, however, Proposition 6.11 can be seen as a limit theorem for quadratic forms, and this subject has a long history, reviewed e.g. in Guttorp and Lockhart (1988). Theorem 1 in de Wet and Venter (1973) and Theorem 5 in Guttorp and Lockhart (1988) could seemingly apply to give Proposition 6.11, however the conditions in either theorem are quite difficult to verify and we could not check them in our case, whereas the conditions in Proposition 3.5 are very easy to decide in general. We conclude this subsection giving sufficient conditions for convergence in law of { Yn 22 − E Yn 22 }. The result is not directly applicable to Wasserstein distances, but the changes needed for that are straightforward and omitted here. A √ λ φ , h Z warning on notation: we write Y − EY, h = k k k for φk orthonormal, 2 h ∈ L2 (0, 1) and λ < ∞ although Y − EY may not make sense. Theorem 6.12. If maxi cn,i 2 → 0, Kn →L2 K and mn →L2 m then
Yn 22 − E Yn 22
→ d
:=
Y − EY 22 − E Y − EY 22 + 2Y − EY, m ∞
λk (Zk2 − 1) + 2
k=1
∞ λk m, φk Zk k=1
where Y − E 22 − E Y − EY 22 is defined as in (6.37) and {Zk } is an orthoGaussian sequence. Proof. Formally we require the proof of Proposition 6.11 rather than its statement. First we note
Yn 22 − E Yn 22 = ( Yn − EYn 22 − E Yn − EYn 22 ) + 2Yn − EYn , EYn . (6.44) As in the previous proof, EYn − EYn , EYn Yn − EYn , φk
=
Kn , mn ⊗ φk
→ K, m ⊗ φk = λk m, φk .
62
E. del Barrio
This implies that for each M we have convergence in law of the vector Yn − EYn , φ1 , . . . , Yn − EYn , φM , Yn − EYn , EYn to the Gaussian vector ∞ λ1 Z1 , . . . , λM ZM , λk m, φk Zk , . k=1
This gives weak convergence, for every M < ∞, of the random variables M Yn − EYn , φk 2 − EYn − EYn , φk 2 + 2Yn − EYn , EYn , k=1
in analogy with (6.39). By (6.44) these random variables are ‘finite-dimensional’ approximations of the sequence of interest, and the result now follows by the approximation argument in the previous proof, as a consequence of the limit (6.42). Now, this and (6.42) gives the result. The hypotheses in this theorem are very natural. We will not deal with the question of whether they are necessary (given infinitesimality) however, note that the existence of K and m are necessary in order to define the limit. c) Shift convergence of Yn 22 , II. There are some situations in which Kn is not convergent in L2 but, nevertheless, { Yn − EYn 22 − E Yn − EYn 22 } is weakly convergent. From the definitions we see that
Yn − EYn 22 − E Yn − EYn 22 =
cn,i,j (ξi − 1)(ξj − 1) +
=
2
i=1
cn,i,i (ξi − 1)2 − 1
!
i=1
1≤i =j≤n n
n
i−1
n
cn,i,j (ξj − 1) (ξi − 1) + cn,i,i (ξi − 1)2 − 1 = xn,i ,
j=1
where xn,i = 2
i=1
i−1
cn,i,j (ξj − 1) (ξi − 1) + cn,i,i (ξi − 1)2 − 1
j=1
(and we use the convention that 0j=1 aj = 0). If {ξi } denotes an independent copy of the sequence {ξi } and we set x ˜n,i = 2
i−1
cn,i,j (ξj − 1) (ξi − 1) + cn,i,i (ξi − 1)2 − 1
j=1
for i = 1, . . . , n and Fn,i = σ(ξ1 , . . . , ξi ), then, for each n ∈ N, {xn,i } and {˜ xn,i } are tangent sequences with respect to {Fi }, that is, L(xn,i |Fn,i−1 ) = L(˜ xn,i |Fn,i−1 ) and the random variables x˜n,i are conditionally independent given the sequence {ξi }. Hence, {˜ xn,i } is a decoupled tangent sequence to {xn,i } (see, e.g., de la Pe˜ na and Gin´e (1999), Chapter 6). Decoupling introduces enough independence among
Empirical and Quantile Processes
63
n the summands in i=1 x ˜n,i to enable us to use the CLT in order to obtain their asymptotic distribution. The principle of conditioning (Theorem 1.1 in Jakubowski (1986), reproduced in de la Pe˜ na and Gin´ en(1999), Theorem 7.1.4) can then be used to conclude convergence in law of i=1 xn,i itself. The proof of our next result follows this approach. Theorem 6.13. Let Z be a standard normal random variable. If max cn,i → 0,
(6.45)
i
2 Kn 22 + 6
n
cn,i 42 → σ 2 ,
(6.46)
i=1
and j =k
2
2 cn,i , cn,j cn,i cn,k + cn,i , cn,j cn,i , cn,i + cn,j → 0, j
i:i>j∨k
i:i>j
(6.47) then
Yn − EYn 22 − E Yn − EYn 22 → σZ.
(6.48)
d
If, instead of conditions (6.45), (6.46) and (6.47), we have
max cn,i 22 + |cn,i , mn | → 0,
(6.49)
i
2 Kn 22 + 2
n
cn,i 42 + 4
cn,i , cn,i + mn 2 → σ 2 ,
(6.50)
2 cn,i , cn,j cn,i , cn,i + mn + cn,j → 0,
(6.51)
Yn 22 − E Yn 22 → σZ.
(6.52)
i=1
and
j,k
i
2 cn,i , cn,j cn,i cn,k
i:i>j∨k
+
j
i:i>j
then d
n
˜n = Proof. We first prove the limit in (6.48). If we set U ˜n,i , with x ˜n,i i=1 x defined as above, the principle of conditioning (Jakubowski (1986)) reduces the proof to showing that ˜n |{ξi }) → N (0, σ 2 ) L(U w
in probability. Arguing as in the proof of Lemma 6.10, we can see that this is equivalent to proving that i−1 2
→0 x2n,i |{ξj }) = 4 max c2i,i + ci,i + ci,j (ξj − 1) An := max E(˜ i
i
j=1
Pr
(6.53)
64
E. del Barrio
and Bn :=
E(˜ x2n,i |{ξj }) = 4
i
c2i,i + 4
i
ci,i +
i
i−1
2 ci,j (ξj − 1) → σ 2 . (6.54) Pr
j=1
After a straightforward but cumbersome computation that we omit, we can see that n
cn,i 42 EBn = 2 Kn 22 + 6 i=1
and Var(Bn ) = 16
j =k
2 ci,j ci,k
+4
j
i:i>j∨k
2 ci,j (ci,i + ci,j ) ,
i:i>j
which, by (6.46) and (6.47), immediately give (6.54). We check now (6.53), which is i−1 equivalent to maxi ci,i → 0 and maxi j=1 ci,j (ξj −1) → 0. This last convergence Pr follows from (6.47) and the use of a refined Octaviani’s maximal inequality (e.g., Proposition 1.1.2 in de la Pe˜ na and Gin´e (1999)): i−1 i−1 ci,j (ξj − 1) > t ≤ 3 max P ci,j (ξj − 1) > t/3 P max i
≤
i
j=1
i−1
j=1
n
27 27 27 max c2 ≤ 2 max c2i,j = 2 maxcn,i ⊗ cn,i , Kn i i t2 i j=1 i,j t t j=1
27
Kn 2 max cn,i 22 → 0. i t2 This concludes the proof of the limit (6.48). We pay attention now to the limit (6.52). The fact that
Yn 22 − E Yn 22 = ci,j (ξi − 1)(ξj − 1) ≤
1≤i =j≤n n
! ci,i (ξi − 1)2 − 1 + 2cn,i , mn (ξi − 1)
+
i=1
where yn,i = 2
i−1
cn,i,j (ξj − 1) (ξi − 1) + cn,i,i (ξi − 1)2 − 1 + 2cn,i , mn (ξi − 1),
j=1
can be used to conclude (6.52) by reproducing the proof of (6.48) almost verbatim. The tool for Theorem 6.13, namely the principle of conditioning, which could be easily replaced by the Brown-Eagleson central limit theorem for martingales, has been used before in analogous situations. We will just mention P. Hall (1984), who uses it in density estimation, in order to prove a limit theorem for degenerate
Empirical and Quantile Processes
65
U -statistics with varying kernels. His result is different from ours and does not apply here, but there are similarities in the proofs. The assumptions in the above theorem are quite tight (for instance, it can be shown that they are necessary for the limits (6.53) and (6.54)). An easier to check set of (stronger) sufficient conditions, more adapted to the quantile process case can be stated with the following notation. We define 1 Kn (s, u)Kn (t, u)du = cn,i , cn,j cn,i ⊗ cn,j . (Kn ◦ Kn )(s, t) := 0
i,j
It can be easily checked that Kn ◦ Kn 22 =
1
1
1
j,k
i cn,i,j cn,i,k
2 and also that
1
Kn ◦ Kn 22 =
Kn (s, t)Kn (u, v)Kn (s, u)Kn (t, v)dsdtdudv. 0
0
0
0
The assumptions in the following result are often easier to deal with than (6.51). The proof can be found in [31]. Corollary 6.14. If n
n
cn,i , mn 2 → 0,
i=1
cn,i 42 → 0 and Kn ◦ Kn 2 → 0,
(6.55)
i=1
Kn∗ (s, t) :=
|cn,i (s)cn,i (t)| ≤ CKn (s, t)
(6.56)
i
for some absolute constant C < ∞, and
Kn 22 → σ 2 /2,
(6.57)
then
Yn 22 − E Yn 22 → σZ, d
where Z is standard normal. Note that if Kn ◦ Kn 2 → 0 we cannot have Kn → K in L2 unless K = 0. Remark 6.14.1. The results on convergence or shift convergence in law of Yn 22 derived so far in this article assume infinitesimality on the coefficients cn,i (maxi
cn,i → 0). Of course, if conditions of this type are removed, other asymptotic distributions can be obtained. It is straightforward to see, for instance, that if (cn,i,j − γi,j )2 → 0, (6.58) 1≤i,j≤n
66
E. del Barrio
for some real numbers {γi,j } satisfying
i,j
2 γi,j < ∞ then
n
Yn − EYn 22 − E Yn − EYn 22 =
cn,i,j (ξi − 1)(ξj − 1) − δi,j
i,j=1 ∞
→ L2
!
! γi,j (ξi − 1)(ξj − 1) − δi,j .
i,j=1
Note that the limiting random variable is well defined because the condition 2 γ i,j i,j < ∞ implies that the associated partial sums are L2 convergent. If, further, n n
2 cn,i,j − βi → 0 (6.59) i=1
j=1
for some real numbers βi such that
Yn 22 − E Yn 22 → L2
∞
∞ i=1
βi2 < ∞, then we also have that
∞ ! γi,j (ξi − 1)(ξj − 1) − δi,j + 2 βi (ξi − 1).
i,j=1
(6.60)
i=1
d) Shift convergence of Zn 22 . Yn can be replaced by Zn in Theorem 6.8 as an immediate consequence of the law of large numbers, whereas it can be replaced in Theorem 6.12 and Corollary 6.14 because of the following proposition. Proposition 6.15. Suppose Yn 22 −E Yn 22 converges in law. Then, Zn 22 −E Yn 22 converges in law to the same limit if and only if E Yn 22 √ → 0. n
(6.61)
In particular, this condition is satisfied if both conditions, i cn,i,i √ → 0 and cn,i , mn 2 → 0, n i
(6.62)
hold. If (6.61) holds, we also have Zn − EYn , h − Yn − EYn , h → 0 for any h ∈ L2 (0, 1). Proof. Since
Zn 22 − Yn 22 =
n−1 Sn
2 Sn Sn 1+ 1−
Yn 22 = OP n−1/2 Yn 22 , n−1 n−1
by the central limit theorem and the law of large numbers, the necessity and sufficiency of condition (6.61) follows from Lemmas 2.2 and 3.1. Now, by (6.26) and Cauchy-Schwartz, 1 1 1 √ E Yn 22 = √ ci,j ≤ √ ci,i + cn,i , cn,j 2 , ci,i + n n n i
j
which gives the sufficiency of (6.62).
i
i
j
Empirical and Quantile Processes
67
6.2. Weak convergence of weighted L2 functionals of the quantile process We distinguish between the uniform and the general quantile processes. The uniform quantile process. Recall from the beginning of this section that if un is the uniform quantile process, then 2 2 n+1 2 1−1/n n un (t) d Ln := dt= cn,i ξi g(t) S n+1 1/n 2 i=1 where an,i are as defined in (6.3) and cn,i = n−1/2 an,i (t)I[1/n,1−1/n] (t)/g(t). With the help of Lemma 6.1, the results of Section 6.1 can be easily specialized to this situation. a) The infinitesimality condition (6.27): It follows from the definitions that 1−1/n 2 t + (1 − t)2 1 1 1−1/n 1 2 dt ≤ max dt,
c
≤ n,i 2 i 2n 1/n g 2 (t) n 1/n g 2 (t) and from this we conclude that condition (6.27) is equivalent to 1 1−1/n 1 dt → 0. n 1/n g 2 (t)
(6.63)
b) Convergence of Kn and definition of K. Also from the definitions (in (6.21) and Lemma 6.1), we have Kn (s, t) =
˜ n (s, t) 1K I{1/n ≤ s, t ≤ 1 − 1/n}, n g(s)g(t)
so that by Lemma 6.1 iii), Kn (s, t) → K(s, t) :=
(s ∧ t − st) g(s)g(t)
(6.64)
pointwise, hence, by Lemma 6.1 iii) and dominated convergence, Kn →L2 K if and only if K ∈ L2 ((0, 1) × (0, 1)), if and only if 1 1 (s ∧ t − st)2 ds dt < ∞. (6.65) g 2 (s)g 2 (t) 0 0 Next we see that the limiting kernel K is trace-class and the limit (6.29) holds if and only if 1 t(1 − t) dt < ∞. (6.66) g 2 (t) 0 In fact, by Lemma 6.12 iii), 1 n+1 ˜ n (t, t) 1 1−1/n K t(1 − t)
cn,i 22 = dt → dt 2 (t) n g g 2 (t) 1/n 0 i=1 regardless of whether the limiting integral is finite or not. Thus (6.66) is necessary in order to get a finite limit in (6.29). On the other hand, if (6.66) holds and
68
E. del Barrio
Bg (t) = B(t)/g(t), where B(t), 0 < t < 1, is a Brownian bridge, then Bg is a centered, L2 (0, 1)-valued Gaussian process with covariance function K(s, t). Thus, if λi and φi denote the eigenvalues and eigenfunctions, respectively, of the kernel K, then 1 1 2 1 ∞ 1
2 t(1 − t) B (t) B(t) 2 dt = dt = E φ EB (t) dt = E (t)dt g i 2 g 2 (t) 0 0 0 g (t) 0 g(t) i=1 ∞ ∞ 1 1 ∞ 1 B(t)
2 s ∧ t − st φi (t)dt = φi (s)φi (t)dsdt = = E λi < ∞. g(s)g(t) 0 g(t) 0 i=1 i=1 0 i=1 Hence, if (6.66) holds then K is trace-class and (6.29) holds. c) Convergence of mn to m = 0 assuming infinitesimality. If the infinitesimality condition (6.63) holds, then n+1 2 1 1 1 ˜ n (t)2 1 1m dt ≤ dt → 0, cn,i = n 0 g 2 (t) n 0 g 2 (t) 2 i=1 showing that mn → 0 in L2 . Finally, note that condition (6.66) implies conditions (6.63) and (6.65): the first, by dominated convergence, and condition (6.65) because, since (s ∧ t − st)2 ≤ s(1 − s)t(1 − t), we have 1 1 1 1 1 t(1 − t) 2 (s ∧ t − st)2 s(1 − s)t(1 − t) ds dt ≤ ds dt = dt . g 2 (s)g 2 (t) g 2 (s)g 2 (t) g 2 (t) 0 0 0 0 0 Summarizing, and the law of large numbers for Sn /n give the following: Theorem 6.16. Let un (g) denote the weighted uniform quantile process, that is, un (g)(t) = (un (t)/g(t))I{1/n ≤ t ≤ 1 − 1/n}}, 0 < t < 1, where g is a non-zero measurable function. Assume 1 1−1/n 1 dt → 0. n 1/n g 2 (t) Then the sequence of processes {un (g)} is weakly convergent in L2 (0, 1) to a nondegenerate limit if and only if 1 t(1 − t) dt < ∞. g 2 (t) 0 In this case, un (g) → Bg d
in L2 (0, 1), where Bg (t) = B(t)/g(t) and B is a Brownian bridge. In particular 1−1/n 2 1 2 un (t) B (t) → dt dt. 2 (t) 2 d g 1/n 0 g (t) Only the necessity part of this theorem may be considered new; the sufficiency is well known (see e.g., Mason (1984), Cs¨org˝ o and Horv´ ath (1988) and (1993) p. 354).
Empirical and Quantile Processes
69
Since, under infinitesimality, mn → 0 in L2 , we have that the analogue of Theorem 6.12 for Yn = (Sn+1 /n)un holds under conditions (6.63) and (6.65). In order to get √ rid of the factor Sn+1 /n, according to Proposition 6.15 we must have E Yn 22 / n → 0, which follows from conditions (6.62). Now, if conditions (6.63) and (6.65) hold, then so do conditions (6.62): the second condition in (6.62) is obvious because mn → 0 in L2 and supn Kn 2 < ∞ (as Kn converges in L2 ), so cn,i , mn 2 = Kn , mn ⊗ mn ≤ Kn 2 mn 22 → 0, i
and the first follows because, by Lemma 6.1 iii), 1−1/n ˜ 1−1/n Kn (t, t) 1 3 t(1 − t) 1 √ dt ≤ √ dt, cn,k , cn,k = √ ng 2 (t) g 2 (t) n i n 1/n n 1/n and it is easy to see that this last expression tends to zero √ of √ as a consequence (6.63) (divide the domain of integration at the points 1/ n, 1/2 and 1 − 1/ n). Hence, Theorem 6.12 together with Proposition 6.15 give: Theorem 6.17. If conditions 1 1−1/n 1 dt → 0. n 1/n g 2 (t) hold then 1−1/n 1/n
u2n (t) dt − g(t)2
1−1/n
1/n
1
(s ∧ t − st)2 ds dt < ∞. g 2 (s)g 2 (t)
1
and 0
t(1 − t) dt → d g 2 (t)
0
0
1
B 2 (t) − EB 2 (t) dt, g 2 (t)
(6.67)
where the integral of (B 2 − EB 2 )/g 2 is defined in a limiting L2 sense. Proof. By Theorem 6.12 and the above observations, it only remains to show that we can actually replace E Yn 22 by the centering constants in (6.67). By (6.63) and Lemma 6.1, we have 1−1/n ˜ n (t, t) − m 1 1−1/n |nt(1 − t) − K t(1 − t) ˜ 2n (t, t)| ≤ dt dt E Yn 22 − 2 2 g (t) n 1/n g (t) 1/n 4 1−1/n 1 dt → 0. ≤ n 1/n g 2 (t) For g(t) = φ(Φ−1 (t)), where φ and Φ denote the standard normal density and distribution function respectively, this result goes back, in one form or other, to de Wet and Venter (1972), but it seems to be new in the generality it is given here. See also Gregory (1977) and del Barrio, Cuesta-Albertos, Matr´ an and Rodr´ıguezRodr´ıguez (1999). Next we examine the normal convergence case (as a consequence of Corollary 6.14). We will further relax integrability (so, condition (6.65) will not hold), but, for convenience, will impose regular variation of g at at least one of the end points 0 and at 1. Standard use of the basic properties of regular variation shows that,
70
E. del Barrio
if g is regularly varying at 0 and at 1 with exponent α, then the hypotheses of Theorem 6.17, (6.63) and (6.65), both hold for α < 1 and fail if α > 1. We study now the borderline case in which α = 1 and (6.65) fails, that is, 1−x 1−x (s ∧ t − st)2 dsdt → ∞ as x → 0. (6.68) L(x) := 2 g 2 (s)g 2 (t) x x This case will fall within the scope of Corollary 6.14 and we will obtain normal convergence. Besides the function L(x) just defined, it is convenient to introduce two more functions, 1−x 1−x (s ∧ t − st) dsdt and M (x) = 2 g 2 (s)g 2 (t) x x (s ∧ t − st)(s ∧ u − su)(t ∧ v − tv)(u ∧ v − uv) dsdtdudv, R(x) = g 2 (s)g 2 (t)g 2 (u)g 2 (v) ∆x where ∆x = (x, 1 − x)4 , and establish their relationship with L. The next lemma will be helpful in this goal. The proof (cumbersome, but routine calculations based on regular variation) is omitted. It can be found in [31]. Lemma 6.18. Assume g > 0 is regularly varying at 0 and at 1 with exponent 1, or that g is regularly varying with exponent one at one of these points and with smaller exponent in the other. Assume also that L(x) → ∞ as x → 0. Then lim
xM (x) =0 L(x)
(6.69)
lim
R(x) = 0. L2 (x)
(6.70)
x→0
and x→0
Theorem 6.19. Assume g > 0 is regularly varying at 0 and at 1 with exponent 1, or that g is regularly varying with exponent one at one of these points and with smaller exponent at the other. Assume also that L(x) → ∞ as x → 0. Let Z denote a standard normal random variable. Then 1−1/n 1−1/n u2 (t) 1 t(1 − t)
n
dt − dt → Z. d g 2 (t) g 2 (t) L(1/n) 1/n 1/n Proof. We only consider the case when g is symmetric about 1/2. We can apply an,i (t) 1 Corollary 6.14 with cn,i (t) = √ 1/2 g(t) I{1/n ≤ t ≤ 1 − 1/n}. Now we nL
have that 2 Kn 22 =
2 2 n L(1/n)
and n+1 i=1
cn,i 42
2 = 2 n L(1/n)
(1/n)
1−1/n
1/n
1−1/n
1/n
1−1/n 1/n
1−1/n
1/n
˜ 2 (s, t) K n ds dt g 2 (s)g 2 (t)
n+1
a2n,i (s)a2n,i (t) ds dt. g 2 (s)g 2 (t)
i=1
Empirical and Quantile Processes
71
We claim that 2 Kn 22
n+1
→ 1,
cn,i 42
→0
and
i=1
n+1
cn,i , cn,i + mn 2 → 0.
(6.71)
i=1
In fact, from Lemma 6.1 we obtain that by Lemma 6.18, implies that
n+1 i=1
a2n,i (s)a2n,i (t) ≤ 3n(s∧t−st), which,
n+1
1 M (1/n) →0
cn,i 42 ≤ 6 n L(1/n) i=1
and proves the second part of (6.71). The first part can be obtained using Lemma ˜ n (s, t)2 − n2 (s ∧ t − st)2 | = |K ˜ n (s, t) + n(s ∧ t − 6.1 ii) and iii) to see that |K ˜ st)||Kn (s, t) − n(s ∧ t − st)| ≤ 8n(s ∧ t − st) and, consequently, that 1−1/n 1−1/n K ˜ 2 (s, t) − n2 (s ∧ t − st)2 2 n 2 Kn 2 − 1 = ds dt 2 2 (s)g 2 (t) n2 L(1/n) 1/n g 1/n 1 M (1/n) ≤ 16 n → 0. L(1/n)
Finally, the third part of claim (6.71) is a consequence of n+1
cn,i , mn 2 = Kn , mn ⊗ mn
i=1
=
1 n2 L(1/n)
1−1/n
1/n
1−1/n
1/n
1 ˜ n (s, t)m M (1/n) K ˜ n (s)m ˜ n (t) ds dt ≤ n → 0. 2 2 g (s)g (t) L(1/n)
since (a + b) ≤ 2a + 2b . The limits (6.71) prove the first two limits in (6.55) and the limit in (6.57) (with σ 2 = 1). Lemma 6.1 iii) gives that (6.56) is also satisfied (with C = 6). Finally, the third limit in (6.55) follows from Lemma 6.18 since 2
2
2
Kn ◦ Kn 22 ≤
81R((1/n) → 0. n4 L2 (1/n)
Corollary 6.14 implies now that Yn 22 − E Yn 22 →w N (0, 1). The conditions (6.62) from Proposition 6.15 are also satisfied because of the last two limits in (6.71) (see the argument immediately before Theorem 6.17) and therefore we also have Zn 22 − E Yn 22 →w N (0, 1). Now we are only left with showing that we 1−1/n can replace E Yn 22 by L−1/2 (1/n) 1/n t(1 − t)g −2 (t)dt as centering constants. Arguing as in the proof of Theorem 6.17 we see that 1−1/n 1−1/n 4 t(1 − t) 1 1 2 dt ≤ dt → 0, E Yn 2 − 1/2 g 2 (t) g 2 (t) L (1/n) 1/n nL1/2 (1/n) 1/n where the last limit is a consequence of (6.59).
72
E. del Barrio
Finally, we consider smaller functions g, corresponding to Remark 6.16.1. The following result is only given for completeness and only symmetric weights are considered in the proof. The one sided analogue is already contained in Cs¨ org˝ o and Horv´ ath (1988). Proposition 6.20. Let g be a positive function on (0, 1), regularly varying at 0 and at 1 with exponent α > 1 and with equal or smaller exponent at the other extreme, and such that limx→0 g(x)/g(1 − x) := c ∈ [0, ∞]. Set 1−x t(1 − t) E(x) := dt. g 2 (t) x Then, 1−1/n 2 1 un (t) dt E(1/n) 1/n g 2 (t) ∞ (S (1) − y)2 ∞ (S (2) − y)2
1 c2 1 [y] [y] → dy + dy , d α − 1 1 + c2 1 y 2α 1 + c2 1 y 2α (1)
where {S[y] : y ≥ 1} is the partial sum process associated to the sequence {ξj } of [y] (1) (2) independent exponential random variables, that is, S[y] = j=1 ξj , and {S[y] } is (1)
an independent copy of {S[y] }. Proof. As above, we only consider the case when g is symmetric. Symmetry of g and the fact that an,j (1 − t) = −an,n+2−j (t) show that 2 1−1/n 2 1−1/n n+1 n 2 1 un (t) 1 j=1 an,j (t)ξj dt = dt E(1/n) 1/n g 2 (t) Sn+1 nE(1/n) 1/n g 2 (t) n 2 (Vn(1) + Vn(2) ), = Sn+1 where n+1 2 a (t)ξ n,j j 1/2 1 j=1 dt Vn(1) = 1 g 2 (t) nE( n ) 1/n and
Vn(2) =
1 nE( n1 )
n+1 1/2
1/n
an,j (t)ξn+2−j
j=1
g 2 (t)
2 dt.
We set bn,j (t) = I{j − 1 < nt} and define 2 1/2 n+1 1 j=1 bn,j (t)ξj − nt (1) dt Wn = g 2 (t) nE( n1 ) 1/n
Empirical and Quantile Processes
73
(2)
and, similarly, Wn , replacing ξj with ξn+2−j . Now, since √
2 1 1/2 t2 |(Vn(1) )1/2 − (Wn(1) )1/2 |2 ≤ dt = OP (1) → 0 n(1 − Sn+1 ) n Pr E( n1 ) 1/n g 2 (t) (1)
(1)
(1)
(2)
(2)
and Vn = OP (1), we see that Vn − Wn = oP (1). Analogously, Vn − Wn = (1) (2) oP (1), showing that (6.72) is equivalent to convergence in law of Wn + Wn (1) d (2) (1) Wn and, moreover, Wn and to the right-hand side there. Obviously, Wn = (2) Wn are asymptotically independent: they are indeed independent if n is odd (1) since bn,j (t) = 0 if t ≤ 1/2 and j > (n + 1)/2 and therefore Wn depends only (2) on ξ1 , . . . , ξ(n+1)/2 while Wn depends only on ξ(n+3)/2 , . . . , ξn+1 ; the overlapping that arises if n is even is negligible. Hence, in order to prove (6.72) if suffices to show that ∞ (S (1) − y)2 1 [y] dy. (6.72) Wn(1) → d α−1 1 y 2α To see this, we note that Wn(1) = cn,i,j (ξi − 1)(ξj − 1) + 2 dn,i (ξi − 1) + en i,j
i
with cn,i,j
cn,1,1 = cn,2,2 ,
dn,1 = dn,2
n/2
1 dy if i ∨ j > 1, 2 (y/n) g i∨j−1 n/2 1 ([y] − y) dy if i > 1, dn,i = 2 1 n E( n ) i−1 g 2 (y/n) n/2 1 ([y] − y)2 and en = 2 1 dy. g 2 (y/n) n E( n ) 1
1 = 2 1 n E( n )
Similarly, 1 α−1
∞
(1)
(S[y] − y)2 y 2α
1
dy =
γi,j (ξi − 1)(ξj − 1) + 2
i,j
δi (ξi − 1) + ,
i
∞ ([y]−y) 1 1 dy if i ∨ j > 1, γ1,1 = γ2,2 , δi = α−1 dy if i∨j−1 y 2α i−1 y 2α ∞ ([y]−y)2 1 dy. Standard regular variation techniques i > 1, δ1 = δ2 and = α−1 1 2α y 2 show that i,j (cn,i,j − γi,j ) → 0, i (dn,i − δi )2 → 0 and en → , yielding as in where γi,j =
1 α−1
∞
Remark 3.10 that Wn(1) → L2
and proving (6.72).
1 α−1
1
∞
(1)
(S[y] − y)2 y 2α
dy
74
E. del Barrio
The general quantile process. By Proposition 6.5, we can transfer the results in the previous
subsection to the general quantile process just by taking g(t) = f (F −1 (t))/ w(t). Let us denote as General Hypotheses or (GH) the following conditions on the cdf F and the weight w: GH F is twice differentiable on its open support (aF , bF ) with f (x) = F (x) > 0 there, and satisfies conditions (6.7), 6.13 and 6.14. w is a non-negative measurable function on (0, 1) and satisfies conditions 6.15. These, together with (6.11), are the conditions under which we can transfer results on un to vn by Proposition 2.5. (6.11) is not included because it will be subsumed by other conditions, in fact, because, by dominated convergence, 1 1 1−1/n w(t)dt t(1 − t) w(t)dt < ∞ ⇒ (6.11) ⇒ → 0. 2 −1 (t)) n 1/n f 2 (F −1 (t)) 0 f (F We then have: Theorem 6.21. Let B be a Brownian bridge on (0, 1) and let Z be a standard normal random variable. a) If F and w satisfy (GH) and 1 t(1 − t) w(t)dt < ∞ 2 (F −1 (t)) f 0 then
(6.73)
B(t) w(t) vn (t) → in law in L2 (0, 1), f 2 (F −1 (t))
in particular, 1 2 vn (t)w(t)dt → 0
0
1
B 2 (t) w(t)dt in distribution. f 2 (F −1 (t))
b) If F and w satisfy (GH) and 1−1/n 1/2 1 t (1 − t)1/2 √ w(t)dt → 0 n 1/n f 2 (F −1 (t)) and
1
0
then 1
vn2 (t)w(t)dt 0
0
(s ∧ t − st)2 w(s)w(t)dsdt < ∞ f 2 (F −1 (s))f 2 (F −1 (t))
1−1/n
1
− 1/n
t(1 − t) w(t)dt → w f 2 (F −1 (t))
0
1
(6.74)
B 2 (t) − EB 2 (t) w(t)dt. f 2 (F −1 (t))
c) Assume F is twice differentiable on its open support (aF , bF ) with f (x) = F (x) > 0 there, that F satisfies condition (6.7) and that the function g := f (F −1 ) is regularly varying with exponent one at at least one of the two points
Empirical and Quantile Processes
75
0 and at 1, and with exponent not larger than one in the other. Assume also that 1−x 1−x (s ∧ t − st)2 dsdt → ∞ (6.75) L(x) := 2 f 2 (F −1 (s)f 2 (F −1 (t)) x x as x → 0. Then, 1 1 1 t(1 − t)
dt → Z in distribution. vn2 (t)dt − 2 −1 (t)) L(1/n) 0 0 f (F
(6.76)
Proof. By Proposition ?? and the remark above on condition (6.11), the statements a) and b) of the theorem do not require proof. But part c) does (Proposition ??) does not apply in this case). As usual, we assume f (F −1 symmetric about 1/2. If 1−1/n we can replace vn 22 in (6.76) by un /f (F −1 ) 22,n = 1/n u2n (t)/f (F −1 (t))dt, the result will follow from Theorem 6.19. By the proof of Lemma 6.4, we can replace vn 22 by vn 22,n if we show that x x2 1 = 0, and lim (F −1 (x) − F −1 (t))2 dt = 0, lim x→0 f 2 F −1 (x) x→0 x L(x) 0 L(x) and, by the proof of Lemma 6.3 (see (2.20) and (2.21)), we can replace vn 22,n by
un /f (F −1 ) 22,n if 1−1/n 1/2 1 t (1 − t)1/2 dt = 0. lim n→∞ f 2 F −1 (t) nL(1/n) 1/n The first and third of these limits follows just as the limit (??) in the proof of Lemma 6.18, using L’Hˆ opital andthe equivalence (??). To show that the second x limit also holds, let us set h(x) = 0 (F −1 (x) − F −1 (t))2 dt and observe that x x x 1 2 2 −1 −1 dudt (F (x) − F (t))dt = h (x) = f (F −1 (x)) 0 f (F −1 (x)) 0 t f (F −1 (u)) x 2x2 2 u du = f (F −1 (x)) 0 f (F −1 (u)) f 2 (F −1 (x)) the last equivalence being a consequence of regular variation. This, (??) and x x regular variation imply, in turn, h(x) = 0 h (y)dy 2 0 y 2 /f 2 (F −1 (y))dy δ 2x3 /f 2 (F −1 (x)) 2x2 x 1/f 2(F −1 (y))dy and, consequently, that δ x 1 x 0 f 2 (F −1 1 (y)) dy −1 −1 2
=0 (F (x) − F (t)) dt = 2 lim lim x→0 x L(x) 0 x→0 L(x)
by (??). Examples a) Consider the distribution functions 1 −|x|α if x ≤ 0 2e α Fα (x) = 1 − 12 e−x if x ≥ 0
forα > 0,
76
E. del Barrio
and take w ≡ 1. Let fα be the corresponding densities, which are symmetric about zero. Then, it is easy but somewhat cumbersome to see that fα Fα−1 (x) ≈ x(1 − x) log(α−1)/α fα Fα−1 (x) ≈ x(1 − x) log(α−2)/α
1 , x(1 − x) 1 , x(1 − x)
where a(x) ≈ b(x) means that 0 < limx→0 a(x)/b(x) < ∞ and likewise for x → 1, whereas 0 < inf t∈I a(t)/b(t) ≤ sup < ∞ for any closed interval I con x∈I a(t)/b(t) tained in (0, 1) (for instance fα Fα−1 (x) = αx| log 2x|(α−1)/α + α(1 − x)| log 2(1 − x)|(α−1)/α ). So, fα Fα−1 is symmetric about 1/2 and of regular variation with exponent 1 at 0 (and at 1). It then follows easily that Fα is in case a) of the above theorem iff α > 2, in case b) iff 4/3 < α ≤ 2, hence for the normal distribution, and in case c) iff 0 < α ≤ 4/3, in particular for the symmetric exponential distribution. As mentioned above, if the tail probabilities are of different order, the largest dominates and these theorems still hold, so that the same conclusions apply to the one sided families. b) Likewise, if fα (x) = αxα−1 e−x , x > 0, c > 0, α
1 is the Weibull family of densities, then f (F −1 (u)) = α(1 − u) log(α−1)/α 1−u , and, as in the above example, fα is in cases a), b) or c) according as to whether α > 3, 4/3 < α ≤ 2 or 0 < α ≤ 4/3. c) (McLaren and Lockhart (1987)). For the logistic distribution F (x) = (1 + e−x )−1 , x ∈ R, and the exponential with parameter one, both falling in case c), computation of L and the centering gives
1 √ 3/2 2 log n
1
vn2 (t)dt
− 2 log n
→ Z in distribution,
0
and for the extreme value distribution H(x) = exp(−e−x ), also in case c), 1 √ 2 log n
1
vn2 (t)dt
− log n
→ Z in distribution.
0
We cannot apply Proposition ?? when f (F −1 (t) is too large at 0 and 1 (regularly varying of exponent 1 or lower), as we saw in Case c) of the above theorem (exponent 1). Here is a situation where the exponent is larger than one. Theorem 6.22. Assume F satisfies conditions (6.7), f (F −1 ) varies regularly at 0 or at 1 with exponent γ ∈ (1, 3/2) and with equal or smaller exponent at the other
Empirical and Quantile Processes
77
extreme, and limx→0 |F −1 (x)|/F −1 (1 − x) = c ∈ [0, ∞]. Then, denoting α = 1 − γ, ∞ 1 (1) α 2 1 2 2 c2 (S[y]+1 ) − y α dy v (t)dt → w 2 n 1 1+c |α| E( n ) 0 0 ∞ (2) α α 2 1 + 1+c2 (S[y]+1 ) − y dy . 0
−1
If F satisfies conditions (6.7), f (F ) varies regularly at 0 or at 1 with exponent γ = 3/2 (i.e., α = −1/2) and with equal or smaller exponent at the other extreme, 1 limx→0 |F −1 (x)|/ F −1 (1 − x) = c ∈ [0, ∞] and 0 (F −1 (t))2 dt < ∞, then 1 1 2 v (t)dt − c n E( n1 ) 0 n ∞ (1) −1/2 1 4 −1/2 2 c2 " →w 4 1+c (S − + ) − y dy 2 [y]+1 (1) (1) 1 S1 S1 ∞ (2) −1/2 1 4 −1/2 2 1 " + 1+c (S − + ) − y dy , 2 [y]+1 (2) (2) 1 S1 S1 n1 −1 1 (1) where cn = 0 (F (t))2 dt+ 1− 1 (F −1 (t))2 dt. In both cases {S[y]+1 : y ≥ 0} is the n partial sum process associated to the sequence {ξj } of independent exponential ran[y]+1 (1) (2) dom variables, that is, S[y]+1 = j=1 ξj , and {S[y]+1 : y ≥ 0} is an independent (1)
copy of {S[y]+1 }. Remark 6.22.1. Regular variation of f (F −1 ) with exponent γ is, essentially, equivalent to regular variation of F −1 with exponent α ∈ (−1/2, 0). In fact, if f (F −1 ) ∈ RVγ (0), then F −1 ∈ RVα (0) and, provided f is monotone in a neighborhood of −∞, if F −1 ∈ RVα (0) then f (F −1 ) ∈ RVγ (0) (see, e.g., Resnick (1987), Proposi1 tions 0.6 and 0.7). With the assumption of regular variation, finiteness of 0 vn2 (t)dt requires α ≥ −1/2. Thus, Theorem 6.22 completes the picture of all the possible 1 limiting distributions of 0 vn2 (t)dt for distributions with regularly varying tails. Remark 6.22.2. It follows easily from the law of the iterated logarithm that (1) α (S[y]+1 ) − y α 1 lim sup α−1/2 √ a.s. = |α| y 2 log log y y→∞ for all α < 0, see, e.g., Samorodnitsky and Taqqu (1994), p. 31. This implies that (1) the limiting integrals in Theorem 6.22 are a.s. finite. Integrability of (S[y]+1 )α − 2 y α at 0 needs α > 1/2. When α = 1/2 the effect of the centering constants, cn , is to remove this lack of integrability, still leading to a limiting distribution. Next we collect some elementary properties of regularly varying functions that will be useful in our proof of Theorem 6.22.
78
E. del Barrio − l
Lemma 6.23. a) If l ∈ RV0 (0) is positive and > 0 then limn→∞ (log n) n l log 0. b) If l ∈ RVα (0) and α < 0 then limn→∞ n1 = 0. l
log n n = 1 l
n
n
c) If l ∈ RVα (0) and β > −α then xβ l(x) → 0 as x → 0. Proof. b) is a trivial consequence of a) and the proof c) can be found, e.g., in Resnick (1987), so we only prove a). By Karamata’s theorem (see, e.g., Resnick 1 (1987), p. 17) l can be written as l(x) = c(x) exp x ( (t)/t)dt with c(x) → c ∈ (0, ∞) and (x) → 0 as x → 0. Therefore, taking n0 large enough to ensure that | (t)| < /2 for t ≤ log n0 /n0 and n ≥ n0 we have that log n logn n 1
− l n − dt = 2(log n)−/2 → 0. (log n) ≤ 2(log n) exp 2 n1 t l n1 Proof of Theorem 6.22. We will assume in this proof that 0 > α > −1/2. The case α = −1/2 can be handled with straightforward changes. We set, as in the proof of Proposition 6.20, bn,j = I{j − 1 < nt} and H −1 (x) = −F −1 (1 − x) and observe, using the fact that bn,j (1 − t) = 1 − bn,n+2−j (t) except in a null set, that 1 logn n n+1
2 1 n 2 −1 1 F v (t)dt = bn,j (t)ξj − F −1 (t) dt n 1 1 Sn+1 E( n ) 0 E( n ) 0 j=1 +
=
n E( n1 ) n E( n1 ) +
+
n+1
2 1 F −1 Sn+1 bn,j (t)ξj − F −1 (t) dt +
1
n 1− log n
log n n
0
n E( n1 ) 1 E( n1 )
j=1
1 F −1 Sn+1
log n n
H −1
0
Wn(2)
n 1− log n log n n
vn2 (t)dt
2 bn,j (t)ξj − F −1 (t) dt
1
n+1
Sn+1
2 bn,j (t)ξn+2−j − H −1 (t) dt
j=1 n 1− log n
log n n
vn2 (t)dt =: Vn(1) + Vn(2) + Vn(3) .
We also set Wn(1)
j=1
n+1
1 E( n1 )
:=
:= (1)
n E( n1 ) n E( n1 )
log n n
F −1
1 n+1
0
log n n
0 (2)
H −1
n
and
j=1
1 n+1 n
2 bn,j (t)ξj − F −1 (t) dt
2
bn,j (t)ξn+2−j − H −1 (t) dt.
j=1
Observe that Wn and Wn are independent since they are functions of disjoint sets of independent exponential r.v.’s ξj . We will proceed now by showing that
Empirical and Quantile Processes
79
(3)
the central part, Vn , is negligible and that the upper and lower integrals are asymptotically independent and weakly convergent to the above stated limits. This will be achieved by proving the following three claims: Vn(3) →Pr 0.
Claim 1.
(Vn(i) )1/2 − (Wn(i) )1/2 →Pr 0,
Claim 2.
Wn(1) →w
Claim 3.
Wn(2) →w
2c2 |α|(1+c2 ) 2 |α|(1+c2 )
∞
0 ∞ 0
i = 1, 2.
(1) α 2 (S[y]+1 ) − y α dy, (2) α 2 (S[y]+1 ) − y α dy.
Proof of Claim 1. We show first that 1− logn n 1 u2n (t) (3) Vn − dt →Pr 0. f 2 (F −1 (t)) E( n1 ) logn n As in the proof of Proposition 6.3 this reduces to showing that 1− logn n 1 1 dt → 0 1 2 log n f (F −1 (t)) nE( n ) n and 1 √ nE( n1 )
n 1− log n log n n
t1/2 (1 − t)1/2 dt → 0. f 2 (F −1 (t))
To ease the computations we will assume in the remainder of the proof of this claim that c = 1 and replace E( n1 ) by F −1 ( n1 )2 in the last two denominators (the ratio of the two sequences converges, by regular variation, to a positive constant). Extension to general c is straightforward. Regular variation implies that x/f (F −1 (x)) = α and x→0 F −1 (x) lim
x/f 2 (F −1 (x)) 1 lim 1−x = − α, 1 x→0 2 f 2 (F −1 (t)) dt x
which implies, in turn, that 1− logn n α2 1 1 l(1) (log n/n) lim = 0, dt = lim 1 2 −1 n→∞ nF −1 ( )2 log n f (F (t)) 1/2 − α n→∞ l(1) (1/n) n n where l(1) (x) = x/f 2 (F −1 (x)) ∈ RV2α−1 (0) and 2α − 1 ∈ (−2, −1) and the last limit follows from Lemma 6.23. Similarly, 1− logn n 1/2 α2 1 t (1 − t)1/2 l(2) (log n/n) dt = lim lim √ −1 1 2 = 0, n→∞ f 2 (F −1 (t)) 1/4 − α n→∞ l(2) (1/n) nF ( n ) logn n since l(2) (x) = x3/2 /f 2 (F −1 (x)) ∈ RV2α−1/2 (0) and 2α−1/2 ∈ (−3/2, −1/2). Now 1−log n/n u2n (t) 1 we can prove Claim 1 by showing that F −1 (1/n) 2 f 2 (F −1 (t)) dt →Pr 0. But log n/n
80
E. del Barrio
taking expectations we can see that it suffices to show that 1− logn n 1 t(1 − t) dt = 0. lim 2 (F −1 (t)) n→∞ F −1 ( 1 )2 log n f n n Using again regular variation properties and Lemma 6.23 we have that 1− logn n 1 t(1 − t) l(3) (log n/n) dt = |α| lim lim −1 1 2 = 0, 2 −1 n→∞ F n→∞ f (F (t)) l(3) (1/n) ( n ) logn n now with l(3) (x) = x2 /f 2 (F −1 (x)) ∈ RV−2α (0) and −2α ∈ (−1, 0). This completes the proof of Claim 1. (1) (1) Proof of Claim 2. We will show that (Vn )1/2 − (Wn )1/2 →Pr 0. It suffices to show that log n 1 (1) (1) (F −1 (S[y]+1 /Sn+1 ) − F −1 (S[y]+1 /n))2 dy →Pr 0. F −1 ( n1 )2 0 (1)
To ease the notation we will omit the superscript from S[y]+1 . Similarly as in (6.10) we consider a Taylor expansion
Sj S − F −1 nj F −1 Sn+1 Sn+1 Sj Sn+1 2 Sj 2 f (F −1 (ξ)) 1 1 = 1− 1 − , + S n Sn+1 f (F −1 ( j )) 2 n Sn+1 f 3 (F −1 (ξ)) n for some ξ between Sj /Sn+1 and Sj /n, which enables us to write, using the obvious
2 analogues of (??) and (??), and the fact that supn≥1 nE 1 − Sn+1 < ∞, that n Sj Sj
1 1 −1 Sj n n −1 Sj √ + F Sn+1 − F n ≤ OP (1) n f (F −1 ( Sj )) n f (F −1 ( Sj )) n n S
j 1 n ≤ OP (1) √ , n f (F −1 ( Sj ))
n
where OP (1) stands for a stochastically bounded sequence which does not depend on j ∈ [1, log n]. We take now > 0 such that 2γ − 3 + < 0. From this bound and regular variation (Lemma 6.23, c)) we obtain that log n S S 2 (F −1 ( S[y]+1 ) − F −1 ( [y]+1 n )) dy n+1 0
1 ≤ OP (1) n ≤ OP (1)
1 n
S[y]+1 2
log n
0
n dy S f 2 (F −1 ( [y]+1 n ))
[log n+1]
Sj 2−2γ− n
j=1
1 ≤ OP (1) n
Sj 2
[log n+1]
j=1
n
f 2 (F −1 (
Sj n ))
[log n+1]
≤ OP (1)n2γ−3+
j=1
j 2−2γ− → 0.
Empirical and Quantile Processes
81
This completes the proof of Claim 2 (note that we need not divide by F −1 ( n1 )2 to obtain the equivalence of the two sequences if γ < 3/2; if γ = 3/2 that division still gives the result).
2 k/n −1 1 n+1 (1) −1 F (t) dt Proof of Claim 3. We set Wn,k = n1 0 j=1 bn,j (t)ξj − F n E( ) k (1) α n α 2 2c2 dy. With the change of variable t = y/n and Wk = |α|(1+c2 ) 0 (S[y]+1 ) − y (1)
we can rewrite Wn,k as (1) Wn,k
k
= bn
F −1 (S[y]+1 /n) − F −1 (y/n)
(1)
2
F −1 (1/n)
0
where bn = F −1 ( n1 )2 /E( n1 ) →
(1)
dy,
2c2 |α|(1+c2 ) , and we conclude that, by regular varia ∞ (1) α (1) 2c2 α 2 that Wn →w |α|(1+c 2) 0 ((S[y]+1 ) − y ) dy it
tion, Wn,k →Pr Wk . To prove suffices, using a 3 argument, to show that (1)
lim lim sup P (|Wn,k − Wn(1) | > ) = 0
k→∞ n→∞
(6.77)
for all > 0. As in the proof of Claim 2 we consider a Taylor expansion S y
S[y]+1 − y 1 (S[y]+1 − y)2 f (F −1 (ξ)) [y]+1 − F −1 = + , F −1 n n nf (F −1 (y/n)) 2 n2 f 3 (F −1 (ξ)) for some ξ between S[y]+1 /n and y/n, which enables us to write, using the obvious 2 equivalents of (??) and (??), and the fact that supy≥1 E (S[y]+1 − y)/y 1/2 < ∞, that √ √
y/ n 1 1 1 −1 S[y]+1 −1 y + −F , ≤ OP (1) √ F n n n f (F −1 (y/n)) n f (F −1 (y/n)) where OP (1) stands for a stochastically bounded sequence which does not depend on y ∈ [k, log n]. From this bound we obtain that (1)
|Wn,k − Wn(1) | =
bn −1 F ( n1 )2
log n
k
1 ≤ OP (1) −1 1 2 F (n) = OP (1)
1 F −1 ( n1 )2
2 (1) F −1 (S[y]+1 /n) − F −1 (y/n) dy
1 n
k
log n n k n
log n 1 y/n 1 dy + 2 dy f 2 (F −1 (y/n)) n k f 2 (F −1 (y/n)) logn n 1 t 1 dt + dt . f 2 (F −1 (t)) n nk f 2 (F −1 (t))
log n
From regular variation we obtain that log n logn n n 1 t 1 1 dt + dt → C1 k 2−2γ + C2 k 1−2γ 2 (F −1 (t)) 2 (F −1 (t)) k k f n f F −1 ( n1 )2 n n
82
E. del Barrio
and this, combined with the last estimate and the fact that 2 − 2γ < 0 completes the proof of (6.77) and, consequently of Claim 3 . 6.3. Weighted Wasserstein tests of fit to location-scale families of distributions Finally, we apply the foregoing to weighted Wassserstein tests. Recall that Rw n =
Ww2 (Fn , H) 2 (F ) σw n
relates to the quantile process vn via equations (5.6) (assuming conditions (5.3)(5.5)). Theorem 6.24. Let w be a non-negative measurable function satisfying condition (5.3). Let H be a location scale family of distributions as defined in the Introduc1 tion, such that 0 (F −1 (t))2 w(t)dt < ∞ for any (hence for all ) F ∈ H, let G0 ∈ H be chosen so as to satisfy conditions (5.4) and (5.5) and let g0 = G0 . Assume the distribution functions F ∈ H and the weight w satisfies conditions (GH) and that moreover 1 t(1 − t) w(t)dt < ∞. (6.78) 2 F −1 (t) f 0 Then, under the null hypothesis F ∈ H, we have 1 2 1 B 2 (t) B(t) w → w(t)dt − w(t)dt nRn −1 −1 2 d 0 g0 (G0 (t)) 0 g0 (G0 (t)) 2 1 B(t)G−1 0 (t) w(t)dt − . −1 0 g0 (G0 (t))
(6.79)
Note that the hypotheses on F ∈ H are either satisfied by all or by none of the functions in H. Proof. By equivariance we can assume F = G0 . The result follows directly from 2 (Fn ) → 1 in probability. By Theorem Theorem 4.5 a) as soon as we show that σw 4.5 a) vn 2,w = OP (1), and therefore (recall F = G0 and (5.5)) 1 |σw (Fn ) − 1| = Fn−1 2,w − F −1 2,w ≤ Fn−1 − F −1 2,w = √ vn 2,w → 0. Pr n If H in Theorem 6.24 were only a location family or only a scale family then the limit would exhibit only the loss of one degree of freedom, that is, one of the last two integrals would be absent from the limit in (6.79): see Cs¨ org˝ o (2002), where a theorem of this sort for scale families is proved. Theorem 6.25. Under the hypotheses of Theorem 6.24, except that now condition (6.78) is replaced by the weaker conditions (6.11) and 1 1 (s ∧ t − st)2 2 −1 w(s)w(t)dsdt < ∞, (6.80) −1 2 0 0 g0 G0 (t) g0 G0 (s)
Empirical and Quantile Processes we have nRw n
83
1 2 t(1 − t) B (t) − EB 2 (t) → w(t)dt w(t)dt −1 d g02 (G0 (t)) g02 (G−1 0 0 (t)) 1 2 1 2 B(t) B(t)G−1 0 (t) − w(t)dt w(t)dt − . −1 −1 0 g0 (G0 (t)) 0 g0 (G0 (t))
1−1/n
− 1/n
Proof. As above, we can take F = G0 . By Theorem 4.5 b) properly modified to account for the weighted integrals of the Brownian bridge (as done in Theorem 6.17), it suffices to prove that 1−1/n 1 t(1 − t) → −1 w(t)dt → 0, σw (Fn ) 1 and 2 (F ) 2 (F −1 (t)) Pr Pr σw f n 1/n √ 2 which, by condition (6.11), reduce to proving n(σw (Fn ) − 1) = OP (1). We have √ 1 2 n(σw (Fn ) − 1) = √ vn 2,w + 2vn , F −1 w . n By checking the proof of Theorem 6.17 (by way of Theorem 6.12), it is easy to see that un /f (F −1 ), F −1 w,n → B/f (F −1 ), F −1 w,n d
(note that (5.5) implies F −1 ∈ L2 (w(t)dt)). Hence Proposition 2.5 gives vn , F −1 w = OP (1). Likewise, by Theorem 4.5 1−1/n t(1−t) b), vn 2,w is shift convergent in law with shifts 1/n f 2 (F −1 (t)) w(t)dt which, by √ √ (6.11) are o( n), so that vn 2,w / n → 0. Pr
A version of this theorem for scale families is proved in Cs¨ org˝ o (2002), however the hypotheses there are stronger by factors of the order of log n or (log n)2 , the integrals at the end points are not treated analytically and the proof is different (it relies on strong approximations, which account for the stronger assumptions). Next we consider convergence to a normal distribution. This case is less interesting in connection with testing since, as indicated in the Introduction, 1−δ 2 1−δ B 2 (t) vn (t)w(t)dt → δ f 2 (F −1 (t)) w(t)dt if f does not vanish on suppF , and δ d
therefore, if we divide by L(1/n) → ∞, as we must by Theorem 6.21 c), this part of the statistic has no influence on the limit. So, when a distribution satisfies the hypotheses of Theorem 6.21 c) (meaning that g = f (F −1 ) is regularly varying with exponent 1 and L(x) with this g tends to infinity), if one wishes to have a sensible test of fit, it is probably best to find a weight w so that one can apply Theorems 6.24 or 6.25. Hence, we will only consider the normal convergence case with weight w ≡ 1. Theorem 6.26. Let H be a location scale family of distributions and assume for simplicity that the distribution G0 ∈ H with mean zero and variance one is the distribution function of a symmetric random variable. Assume that the following conditions hold for some (hence for all ) F ∈ H: F is twice differentiable on
84
E. del Barrio
(aF , bF ) and satisfies condition (6.7), its density f is strictly positive on (aF , bF ), the function f (F −1 ) is regularly varying of exponent 1 at 0 and at 1, and L(x) → ∞ as x → 0. Then, under the null hypothesis F ∈ H, 1−1/n 1 t(1 − t) → Z,
dt (6.81) nRn − d g02 G−1 L(1/n) 1/n 0 (t) where Z is standard normal and nRn := nRw n is as defined in (5.6) for w ≡ 1. −1 Proof. As in the previous two theorems we can assume F = G0 . Since x f (F dt ) is −1 regularly varying of exponent 1 at 0 and at 1, it follows that F (x) = 1/2 f (F −1 (t)) 1 is slowly varying at 0 and at 1. Hence, 0 |F −1 (t)|r dt < ∞ for all r ∈ R and there∞ r fore, all the moments 0 |x| dF (x), r > 0, are finite. In particular, if Xi are i.i.d. with distribution F , 1 n 1 vn (t)dt = √ (Xi − EXi ) = OP (1) n i=1 0
by the central limit theorem, and σ 2 (Fn ) → 1 a.s. by the law of large numbers. So, it suffices to show that 1−1/n 1 t(1 − t) dt → Z.
vn 22 − vn , F −1 2 − d f 2 F −1 (t) L(1/n) 1/n The arguments in the proof of Theorem 6.21 c) not only show that here we can replace vn 22 by un /f (F −1 2,n , but also that vn , F −1 2 can be replaced by un /f (F −1 ), F −1 n ; therefore, the theorem will follow from Theorem 6.21 c) (hence from Theorem 6.19) if we show that the sequence 1−1/n un un (t)F −1 (t) −1 := , F dt, n ∈ N, (6.82) n −1 f (F ) f (F −1 (t)) 1/n
is stochastically bounded (as it will then tend to zero upon dividing by L(1/n)). For this, we show that the product of the nth variable in (6.82) by Sn+1 /n has expected value tending to zero and variance dominated by a constant independent of n. By (6.2), Lemma 6.1 i) and slow variation of F −1 at 0 and 1, we have 1−1/n −1 n+1 Sn+1 un F (t) i=1 an,i ξi −1 E , F dt = E √ n n f (F −1 ) nf F −1 (t) 1/n 1−1/n |F −1 (t)| 1 dt ≤ √ n 1/n f F −1 (t) F −1 (1−1/n) 1 = √ |u|du n F −1 (1/n) =
(F −1 (1/n))2 + (F −1 (1 − 1/n))2 √ → 0. 2 n
Empirical and Quantile Processes
85
Let X be a random variable with distribution F . By Lemma 6.1 iii) and finiteness of the absolute moments of X, we have Sn+1 un −1 , F n Var n f (F −1 ) 2 1−1/n −1 n+1 F (t) i=1 an,i (ξi − 1) 1 =E √ dt n 1/n f F −1 (t) ˜ n (s, t)F −1 (s)F −1 (t) 1 1−1/n 1−1/n K dsdt = n 1/n f F −1 (s) f F −1 (t) 1/n 1−1/n t s(1 − t)|F −1 (s)||F −1 (t)| dsdt ≤6 −1 (s) f F −1 (t) 1/n 1/n f F F −1 (1−1/n) v =6 F (u)(1 − F (v))|u||v|dudv F −1 (1/n) 0
≤6
F −1 (1/n)
v
F (u)|u||v|dudv F −1 (1/n)
F −1 (1/n)
F −1 (1−1/n)
0
+
F −1 (1/n)
0 F −1 (1−1/n)
F (u)(1 − F (v))|u||v|dudv
v
(1 − F (v))|u||v|dudv
+ 0
0
3 EX 4 + (EX 2 )2 < ∞, 4 where at the last step we use Fubini and integration by parts. ≤
As in Theorem 6.21 c), symmetry of G0 is not necessary. Cs¨org˝ o (2002) also proves a result for correlation tests where the limit is normal, but only for the special case of Weibull scale families. Likewise, Theorem 6.22 can be used to obtain the limiting distribution of nRn when f (F −1 ) is regularly varying at the end points with exponent γ > 1, but we refrain from doing so, to avoid too much repetition. Example. (Gauss-Laplace location-scale families) This is a modification of a result in Cs¨org˝ o (2002). Consider the distribution functions from the above Example, 1 −|x|α if x ≤ 0 2e α Fα (x) = for α > 0. 1 − 12 e−x if x ≥ 0 Then, just in that example, Theorem 6.24 with w ≡ 1 holds for the location scale family based on Fα iff α > 2, Theorem 6.25 with w ≡ 1 holds iff 4/3 < α ≤ 2, hence for the normal distribution (which gives Shapiro-Wilk), and Theorem 6.26 with w ≡ 1 holds for 0 < α ≤ 4/3, in particular for the symmetric exponential distribution. As mentioned above, if the tail probabilities are of different order, the largest dominates and the same conclusions apply to the one sided families.
86
E. del Barrio
Example. (Testing fit to the Laplace location scale family) It follows from the last example and the comments immediately before Theorem 6.26 that a weighted Wasserstein test would be convenient for the Gauss-Laplace location scale family when the index α is between 0 and 4/3. For any given α > 0 these families are (in terms of the densities): α −|(x−β)/γ|α e , x ∈ R, β ∈ R, γ > 0 . Hα = Fβ,γ : fβ,γ (x) := 2γΓ(1/α) The weight should approach zero near 0 and 1. For simplicity we will only present a test for the Laplace family H1 . Simple but tedious computations using the approxi1 mations in the previous example show that a weight of the order w(t) ∼ | log t(1−t)| τ will allow us to apply Theorem 6.24 if τ > 1 and Theorem 6.25 if 1/2 < τ ≤ 1 (the determining conditions are (6.78) that holds for all τ > 1, and (6.81) that holds for 1/2 < τ ≤ 1). If w is too small near 0 and 1, we make the extreme part of the distribution count less, whereas possibly the limit has more variability as the integral of B 2 − EB 2 is closer to being divergent. de Wet (2000) convincingly suggests taking τ = 1 (see also Cs¨org˝ o (2002)). Concretely, we define 1 1 w(t) := I + I 1/2
W := e 1
∞
u−1 e−u du,
and set also V = 0
∞
u2 −u e du. 1+u
Take G0 := F0,√W/V . Then w and G0 satisfy conditions (5.3)-(5.5), and the conditions (GH) and (6.80) hold as well (but not (6.78)). Then, Theorem 5.2 gives that, under the null hypothesis F ∈ H1 , 2 ne W − − nRw log log n V 2 2 1/2 2 1 1 B (t) − EB 2 (t) B 2 (t) − EB 2 (t) → dt + dt e e 2 d V t2 log 2t 0 1/2 (1 − t) log 2(1−t) 2 1/2 1 1 B(t) B(t) − dt e dt + e VW t log 2t 0 1/2 (1 − t) log 2(1−t) 2 1/2 1 B(t) log 2t B(t) log 2(1 − t) 1 − 2 dt + dt . e e V t log 2t 0 1/2 (1 − t) log 2(1−t)
References [1] Ali, M.M. (1974) Stochastic ordering and kurtosis measure. J. Amer. Statist. Ass. 69, 543–545. [2] Anderson, T.W. and Darling, D.A. (1952) Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Statist. 23, 193–212.
Empirical and Quantile Processes
87
´, E. (1980) The Central Limit Theorem for Real and Banach [3] Araujo, A. and Gine Valued Random Variables. Wiley, New York. [4] Balanda, K.P. and McGillivray, H.L. (1988) Kurtosis: A critical review. Amer. Statistician 42, 111–119. [5] Bickel, P. and Freedman, D. (1981) Some asymptotic theory for the bootstrap. Ann. Stat. 9 1196–1217. [6] Bickel, P. and van Zwet, W. R. (1978) Asymptotic expansions for the power of distribution free tests in the two-sample problem. Ann. Statist. 6, 937–1004. [7] Billingsley, P. (1968) Convergence of Probability Measures. Wiley, New York. [8] Breiman, L. (1968) Probability. Addison-Wesley, Reading. [9] Bretagnolle, J. and Massart, P. (1989) Hungarian constructions from the nonasymptotic viewpoint. Ann. Probab., 17 239–256. [10] Brown, B. and Hettmansperger, T. (1996) Normal scores, normal plots, and test for normality. Jour. Amer. Stat. Assoc., 91, 1668–1675. [11] Chernoff, H. and Lehmann, E.L. (1954) The use of maximum likelihood estimates in χ2 tests of goodness of fit. Ann. Math. Statist. 25, 579–586. [12] Chibisov, D.M. (1964) Some theorems on the limiting behavior of empirical distribution functions. Selected Transl. Math. Statist. Probab. 6, 147–156. [13] Cochran, W.G. (1952) The χ2 test of goodness of fit. Ann. Math. Statist. 23, 315–345. [14] Cohen, A. and Sackrowitz, H.B (1975) Unbiasedness of the chi-square, likelihood ratio and other goodness of fit tests for the equal cell case. Ann. Statist. 3, 959–964. [15] Cram´ er, H. (1928) On the composition of elementary errors. Second paper: Statistical applications. Skand. Aktuartidskr. 11, 141–180. ¨ rgo ˝ , M. (1983) Quantile Processes with Statistical Applications. SIAM, Phila[16] Cso delphia. ¨ rgo ˝ , M., Cso ¨ rgo ˝ , S., Horva ´ th, L and Mason, D.M. (1986) Weighted em[17] Cso pirical and quantile process. Ann. Probab. 14, 31–85. ¨ rgo ˝ , M. and Horva ´ th, L. (1988) On the distributions of Lp norms of weighted [18] Cso uniform empirical and quantile process. Ann. Prob., 16, 142–161. ¨ rgo ˝ , M. and Horva ´ th, L. (1993) Weighted Approximations in Probability and [19] Cso Statistics. John Wiley and Sons. ¨ rgo ˝ , M., Horva ´ th, L. and Shao, Q.-M. (1993) Convergence of integrals of [20] Cso uniform empirical and quantile processes Stochastic. Process. Appl., 45, 283–294. ¨ rgo ˝ , M. and R´ [21] Cso e v´ esz, P. (1978) Strong approximations of the quantile process. Ann. Statist. 6, 882–894. ´ n, C., Rachev, S.T. and Ru ¨ schendorf, L. [22] Cuesta-Albertos, J.A., Matra (1996) Mass transportation problems in Probability Theory. Math. Scientist 21, 34–72. [23] D’Agostino, R.B. (1971) An omnibus test of normality for moderate and large sample sizes. Biometrika 58, 341–348. [24] Darling, D.A. (1955) The Cram´er-Smirnov test in the parametric case. Ann. Math. Statist. 26, 1–20.
88
E. del Barrio
[25] David, H.A., Hartley, H.O. and Pearson, E.S. (1954) The distribution of the ratio, in a single normal sample, of range to standard deviation. Biometrika 41, 482–493. [26] David, F.N. and Johnson, N.L. (1948) The probability integral transformation when parameters are estimated from the sample. Biometrika 35, 182–190. ´, E. (1998) Decoupling. From dependence to indepen[27] de la Pena, V. and Gine dence. Randomly stopped processes, U -statistics and processes, martingales and beyond. Springer. [28] del Barrio, E. (2000) Asymptotic distribution of statistics of Cram´er-von Mises type. Preprint. ´ n, C. and Rodr´ıguez-Rodr´ı[29] del Barrio, E., Cuesta-Albertos, J.A., Matra guez, J. (1999) Tests of goodness of fit based on the L2-Wasserstein distance. Ann. Statist., 27, 1230–1239. ´ n, C. (1999) Central limit theorems for [30] del Barrio, E., Gin´ e, E. and Matra the Wasserstein distance between the empirical and the true distributions. Ann. Probab. 27 1009–1071. [31] del Barrio, E., Gin´ e, E., and Utzet, F. (2005) Asymptotics for L2 functionals of the empirical quantile process, with applications to tests of fit based on weighted Wasserstein distances. Bernoulli 11 131–189. [32] de Wet, T. and Venter, J. (1972) Asymptotic distributions of certain test criteria of normality. S. Afr. Statist. J. 6, 135–149. [33] de Wet, T. and Venter, J. (1973) Asymptotic distributions for quadratic forms with applications to test of fit. Ann. Statist., 2, 380–387. [34] Donsker, M.D. (1951) An invariance principle for certain probability limit theorems. Mem. Amer. Math. Soc. 6. [35] Donsker, M.D. (1952) Justification and extension of Doob’s heuristic approach to the Kolmogorov-Smirnov theorems. Ann. Math. Statist. 23, 277–281. [36] Doob, J.L. (1949) Heuristic approach to the Kolmogorov-Smirnov theorems. Ann. Math. Statist. 20, 393–403. [37] Downton, F. (1966) Linear estimates with polynomial coefficients. Biometrika 53, 129–141. [38] Dudley, R.M. (1978) Central limit theorems for empirical measures. Ann. Probab. 6, 899–929. [39] Durbin, J. (1973) Weak convergence of the sample distribution function when parameters are estimated. Ann. Statist. 1, 279–290. [40] Feller, W. (1948) On the Kolmogorov-Smirnov limit theorems for empirical distributions. Ann. Math. Statist. 19, 177–189. [41] Feuerverger, A. and Mureika, R.A. (1977) The empirical characteristic function and its applications. Ann. Statist. 5, 88–97. [42] Filliben, J.J. (1975) The probability plot correlation coefficient test for normality. Technometrics 17, 111–117. [43] Fisher, R.A. (1930) The moments of the distribution for normal samples of measures of departure from normality. Proc. Roy. Soc. A 130, 16.
Empirical and Quantile Processes
89
[44] Galambos, J. (1987) The Asymptotic Theory of Extreme Order Statistics 2nd ed. Krieger, Melbourne, Florida. [45] Geary, R.C. (1947) Testing for normality. Biometrika 34, 209–242. [46] Gerlach, B. (1979) A consistent correlation-type goodness-of-fit test; with application to the two parameter Weibull distribution. Math. Operationsforsch. Statist. Ser. Statist. 10, 427–452. [47] Gumbel, E.J. (1943) On the reliability of the classical chi-square test. Ann. Math. Statist. 14, 253–263. [48] Gupta, A.K. (1952) Estimation of the mean and the standard deviation of a normal population from a censored sample. Biometrika 39, 260–273. [49] Hall, P. and Welsh, A.H. (1983) A test for normality based on the empirical characteristic function. Biometrika 70, 485–489. [50] Jakubowski, A. (1986) Principle of conditioning in limit theorems for sums of random variables. Ann. Probab. 14, 902–915. [51] Kac, M., Kiefer, J. and Wolfowitz, J. (1955) On tests of normality and other tests of goodness of fit based on distance methods. Ann. Math. Statist. 26, 189–211. [52] Kale, B.K. and Sebastian, G. (1996) On a class of symmetric nonnormal distributions with kurtosis of three. In Statistical Theory and Applications: Papers in Honor of Herbert A. David. Eds. H.H. Nagaraja, P.K. Sen and D. Morrison. Springer Verlag, New York. [53] Kolmogorov, A. (1933) Sulla determinazione empirica di una legge di distribuzione. Giorn. Ist. Ital. Attuari. 4, 83–91. [54] Kolmogorov, A.N. and Prohorov, Yu. V. (1949) On sums of a random number of random terms (Russian) Uspehi Matem. Nauk (N.S.) 4, 168–172. ´ s, J. Major, P. and Tusna ´ dy, G. (1975) An approximation of partial [55] Komlo sums of independent RV’s and the sample DF. I. Z. Wahrsch. verw. Gebiete 32, 111–131. ´ s, J. Major, P. and Tusna ´ dy, G. (1976) An approximation of partial [56] Komlo sums of independent RV’s and the sample DF. II. Z. Wahrsch. verw. Gebiete 34, 33–58. [57] Leslie, J.R. (1984) Asymptotic properties and new approximations for both the covariance matrix of normal order statistics and its inverse. Colloquia Mathematica Societatis J´ anos Bolyai 45, 317-354. Eds. P. R´ev´esz, K. Sarkadi and P.K. Sen. Elsevier, Amsterdam. [58] Leslie, J.R., Stephens, M.A. and Fotopoulos, S. (1986) Asymptotic distribution of the Shapiro-Wilk W for testing for normality. Ann. Statist. 14, 1497–1506. [59] Lilliefors, H.W. (1967) On the Kolmogorof-Smirnov test for normality with mean and variance unknown. J. Amer. Statist. Ass. 62, 399–402. [60] Lockhart, R.A. (1985) The asymptotic distribution of the correlation coefficient in testing fit to the exponential distribution. Canad. J. Statist. 13, 253–256. [61] Lockhart, R.A. (1991) Overweight tails are inefficient. Ann. Statist. 19, 2254– 2258.
90
E. del Barrio
[62] Lockhart, R.A. and Stephens, M.A. (1998) The probability plot: Test of fit based on the correlation coefficient. Order statistics: applications, 453–473, Handbook of Statist. 17, North-Holland, Amsterdam. [63] Mann, H. B. and Wald, A. (1942) On the choice of the number of class intervals in the application of the chi square test. Ann. Math. Statis. 13, 306–317. [64] Mason, D. and Shorack, G. (1992) Necessary and sufficient conditions for asymptotic normality of L-statistics. Ann. Prob., 20, 1779–1804. [65] McLaren, C.G. and Lockhart, R.A. (1987) On the asymptotic efficiency of certain correlation tests of fit. Canad. J. Statist. 15, 159–167. [66] Moore, D.S. (1971) A chi-square statistic with random cell boundaries. Ann. Math. Statist. 42, 147–156. [67] Moore, D.S. (1978) Chi-square tests. In Studies in Statistics (R.V. Hogg, ed.) 66–106. The Mathematical Association of America. [68] Moore, D.S. (1986) Tests of chi-squared type. In Goodness-of-Fit Techniques, d’Agostino, R.B. and Stephens, M.A., eds., 63–96. North-Holland, Amsterdam. [69] Murota, K. and Takeuchi, K. (1981) The studentized empirical characteristic function and its application to test for the shape of distribution. Biometrika 68, 55–65. [70] O’Reilly, N.E. (1974) On the weak convergence of empirical processes in sup-norm metrics. Ann. Probab. 2, 642–651. [71] Pearson, E.S. (1930) A further development of test for normality. Biometrika 22, 239–249. [72] Pearson, E.S., D’Agostino, R.B. and Bowman, K.O. (1977) Tests for departure from normality: Comparison of powers. Biometrika 64, 231–46. [73] Pollard, D. (1979) General chi-square goodness-of-fit tests with data-dependent cells. Z. Wahrsch. Verw. Gebiete 50, 317–331. [74] Pollard, D. (1980) The minimum distance method of testing. Metrika 27, 43–70. [75] Prohorov, Y.V. (1953) Probability distributions in functional spaces (Russian) Uspehi Matem. Nauk (N.S.) 8, 165–167. [76] Prohorov, Y.V. (1956) The convergence of random processes and limit theorems in probability. Theor. Probab. and its Applicat. 1, 157–214. [77] Rachev, S.T. (1991). Probability Metrics and the Stability of Stochastic Models. Wiley. [78] Royston, J.P. (1982) An extension of Shapiro and Wilk’s W test for normality to large samples. Appl. Statist. 31, 115–124. [79] Sarkadi, K. (1975) The consistency of the Shapiro-Francia test. Biometrika 62, 445–450. [80] Serfling, R.J. (1980) Approximation Theorems of Mathematical Statistics. Wiley, New York. [81] Shapiro, S.S. and Francia, R.S. (1972) An approximate analysis of variance test of normality. J. Amer. Statist. Ass. 67, 215–216. [82] Shapiro, S.S. and Wilk, M.B. (1965) An analysis of variance test for normality (complete samples). Biometrika 52, 591–611.
Empirical and Quantile Processes
91
[83] Shapiro, S.S. and Wilk, M.B. (1968) Approximations for the null distribution of the W statistic. Technometrics 10, 861–866. [84] Shapiro, S.S. and Wilk, M.B. (1972) An analysis of variance test for the exponential distribution (complete samples). Technometrics 14, 355–370. [85] Shapiro, S.S., Wilk, M.B. and Chen, H.J. (1968) A comparative study of various tests for normality. J. Amer. Statist. Ass. 63, 1343–1372. [86] Shorack, G.R. and Wellner, J.A. (1986) Empirical Processes With Applications to Statistics. Wiley, New York. [87] Skorohod, A.V. (1956) Limit theorems for stochastic processes. Theor. Probab. and its Applicat. 1, 261–290. [88] Smirnov, N.V. (1936) Sur la distribution de ω 2 (Crit´erium de M.R. v. Mises). C. R. Acad. Sci. Paris 202, 449–452. [89] Smirnov, N.V. (1937) Sur la distribution de ω 2 (Crit´erium de M.R. v. Mises) (Russian/French summary). Mat. Sbornik (N.S.) 2, 973–993. [90] Smirnov, N.V. (1939) Sur les ´ecarts de la courbe de distribution empirique (Russian/French summary). Mat. Sbornik (N.S.) 6, 3–26. [91] Smirnov, N.V. (1941) Approximate laws of distribution of random variables from empirical data (Russian). Uspekhi Mat. Nauk. 10, 179–206. [92] Smith, R.M. and Bain, L.J. (1976) Corelation-type goodness-of-fit statistics with censored sampling. Comm. Statist. A-Theory and Methods 5, 119–132. [93] Spinelli J.J and Stephens, M.A. (1987) Tests for exponentiality when origin and scale parameters are unknown. Technometrics 29, 471–476. [94] Stephens, M.A. (1974) EDF statistics for goodness of fit and some comparisons. J. Amer. Statist. Assoc. 69, 730–737. [95] Stephens, M.A. (1975) Asymptotic properties for covariance matrices of order statistics. Biometrika 62, 23–28. [96] Stephens, M.A. (1986) Tests based on EDF statistics. In Goodness-of-Fit Techniques (R.B. D’Agostino and M.A. Stephens., eds.) 97–193. Marcel Dekker, New York. [97] Stephens, M.A. (1986) Tests based on regression and correlation. Goodness-of-Fit Techniques 195–233. Eds. R.B. D’Agostino and M.A. Stephens. Marcel Dekker, New York. [98] Sukhatme, S (1972) Fredholm determinant of a positive definite kernel of a special type and its application. Ann. Math. Statist. 43, 1914–1926. [99] Uthoff, V.A. (1970) An optimum test property of two well known statistics. J. Amer. Statist. Ass. 65, 1597. [100] Uthoff, V.A. (1973) The most powerful scale and location invariant test of the normal versus the double exponential. Ann. Statist. 1, 170–174. [101] Vallender, S. (1973). Calculation of the Wasserstein distance between probability distributions on the line. Theo. Prob. Appl., 18, 785–786. [102] Verrill, S. and Johnson, R. (1987) The asymptotic equivalence of some modified Shapiro-Wilk statistics – Complete and censored sample cases. Ann. Statist. 15, 413–419. [103] von Mises, R. (1931) Wahrscheinlichkeitsrechnung. Wein, Leipzig.
92
E. del Barrio
[104] Watson, G.S. (1957) The χ2 goodness-of-fit test for normal distributions. Biometrika 44, 336–348. [105] Watson, G.S. (1958) On chi-square goodness-of-fit tests for continuous distributions. J. Roy. Statist. Soc. B 20, 44–61. [106] Weisberg, S. and Bingham, C. (1975) An approximate analysis of variance test for non-normality suitable for machine calculation. Technometrics 17, 133–134. √ [107] Williams, P. (1935) Note on the sampling distribution of β 1 , where the population is normal. Biometrika 27, 269–271. Eustasio del Barrio Departamento de Estad´ıstica e Investigaci´ on Operativa Facultad de Ciencias Universidad de Valladolid C/ Prado de la Magdalena S/N E-47005 Valladolid, Spain e-mail: [email protected]
Topics on Empirical Processes Paul Deheuvels
1. Introduction, notation and preliminaries 1.1. Introduction The present notes, starting at an elementary level, collect some basic mathematical facts and technical arguments which should be useful for research students working in the field of empirical processes and nonparametric statistics. There are many fine reviews and monographs to be recommended (such as [38, 40, 44, 77, 78, 79, 89, 87, 99, 100, 164, 178, 183, 204]) on this and related subjects. Obviously, one cannot expect a scholar to have read all these references prior to the start of his research activities, and it is of interest to provide a minimal selection of what one should know at the beginning. We have chosen here to focus our study on functional limit laws generated by empirical processes, for which there has been continuous interest in the last decades (see, e.g., [52, 60, 65, 66, 62, 98, 156, 199]). Such results provide a number of direct applications to statistics which motivate our interest. This is a rather delicate subject, mixing topological arguments and large deviation theory (see, e.g., [73]), together with other sophisticated probabilistic techniques. A basic knowledge of Gaussian processes (refer to [135]) is a natural prerequisite, as well as some understanding of weak and strong invariance principles (see, e.g., [19, 40, 155, 154, 156]). We have chosen not to speak much of the, so called, abstract empirical processes theory, which studies processes, indexed by sets or functions, on general spaces, and we advise motivated readers to orient their interests towards this direction, in a second step (refer to [6, 82, 99, 136, 164]). Our taste for functional limit laws takes root in the fact that these results have turned out to be most useful in practice, and await yet even more extensive investigations to be exploited in full. A second point is that abstract empirical process theory has remained mostly on the mathematical side, with relatively few explicit statistical applications. There are some exceptions, though, and, in fact, the recent evolution of statistical science has shown that one could solve several difficult problems of interest by using deviation inequalities obtained in this general framework (see, e.g., [95, 70, 201]). However, the relevant mathematical tools are, at times, so involved, that their
94
P. Deheuvels
study is a research domain in se, which I believe, should not be treated as a starting point. In particular, probability theory in Banach spaces, which comes into this field as a basic ingredient, constitutes a universe in itself (refer to [6, 135]). This introductory course has been presented at the summer school in Laredo, and later, taught to graduate students at the University of Paris VI. Of course, the contents of these notes had to be limited by lack of space, and their completion is, as one could expect, left to the reader, with the help of the enclosed exercises and references. 1.2. Distribution and quantile functions 1.2.1. Left- and right-continuous distribution and quantile functions. Let X ∈ R denote a real-valued random variable [rv] with probability law defined by PX (A) = P(X ∈ A), for all A ∈ BR belonging to the set BR of all Borel subsets of R (endowed with the usual topology). By definition, BR is the σ-algebra generated by the open (or closed) subsets of R. Thanks to the properties of the Lebesgue-Stieltjes integral, the probability law PX is uniquely determined (see, e.g., pp. 17–18 in [19]) by the (right-continuous version of the) distribution function [df] of X, defined by F (x) = P(X ≤ x) for
x ∈ R := R ∪ {−∞} ∪ {∞}.
(1.1)
The df F (·) of a rv may be any nondecreasing, right-continuous function on R, such that −∞ ≤ x ≤ y ≤ ∞
⇒
0 ≤ F (x) ≤ F (y) ≤ 1;
(1.2)
F (−∞) = 0 = lim F (x);
F (∞) = 1 = lim F (x);
(1.3)
F (x+) := lim F (y) = F (x)
for x ∈ R ∪ {−∞}.
(1.4)
x↓−∞
y↓x
x↑∞
The left-continuous version F (x−) := P(X < x) of the distribution function F (·) of X fulfills (1.2) and (1.3), but not (1.4). The latter relation is replaced by F (x−) := lim F (y) = lim F (y−) = P(X < x) ≤ F (x) y↑x
y↑x
for
x ∈ R ∪ {∞}. (1.5)
The set DF of discontinuity points of F , namely, the set of all x ∈ R such that PX ({x}) = P(X = x) = F (x) − F (x−) > 0,
(1.6)
is at most countable, so that its complement, the set CF := R − DF , of all continuity points of F , is always dense in R. Since F is completely determined by the knowledge of the values of F (x) when x varies within a dense subset of R, this property holds when the dense subset is chosen equal to CF .
Topics on Empirical Processes
95
The (left-continuous version of the) quantile function [qf] G(·) of X is defined on [0, 1] by for u = 0, ω− := inf{x : F (x) > 0} G(u) = inf{x ∈ R : F (x) ≥ u} = sup{x ∈ R : F (x) < u} for 0 < u < 1, (1.7) ω+ := sup{x : F (x) < 1} for u = 1. The values of the distribution endpoints, ess inf(X) := ω− ≥ −∞ and ess sup(X) := ω+ ≤ ∞, of PX are possibly infinite. The distribution of X is said to be degenerate when X is almost surely [a.s.] constant, namely, when there exists an x0 ∈ R such that P(X = x0 ) = 1. In this case, one has −∞ < ω− = ω+ = x0 < ∞.
(1.8)
On the other hand, for non-degenerate distributions, the following inequalities hold. −∞ ≤ ω− < ω+ ≤ ∞.
(1.9)
Non-degeneracy of PX is conveniently characterized by the property that sup P(X = x) < 1.
(1.10)
x∈R
The quantile function G(·) in (1.7), as well as its right-continuous version, defined, for 0 < u < 1, by G(u+) := inf{x : F (x) > u}, and, for u = 0 or 1, through the relations (1.12) below, fulfills 0≤u≤v≤1
⇒
ω− ≤ G(u) ≤ G(v) ≤ ω+ ;
(1.11)
G(0+) := G(0) = ω− = lim G(u); u↓0
(1.12)
G(1+) := G(1) = ω+ = lim G(u); u↑1
G(u) = G(u−) := lim G(u − ) for 0 < u ≤ 1. ↓0
(1.13)
The right-continuous version G(·+) of the quantile function G(·) of X, fulfills (1.11) and (1.12), but not (1.13). The latter relation is replaced by G(u+): = lim G(u + ) = inf{x : F (x) > u} ↓0
= sup{x : F (x) ≤ u} for 0 < u < 1.
(1.14)
One has the following refinement of (1.12) for G(·+). G(0+) = ω− = lim G(u+); u↓0
G(1+) = ω+ = lim G(u+). u↑1
(1.15)
Example 1.1. Let F and G be, respectively, the df and qf of a rv X. 1◦ ) Show that the set of all x ∈ R such that F (x+) − F (x−) ≥ 1/n is finite or void, for each n ≥ 1. Conclude that the set DF of discontinuity points of F is, at most, countable. 2◦ ) Show likewise that the set DG of discontinuity points of G is, at most, countable.
96
P. Deheuvels
1.2.2. Generalized inverses of nondecreasing functions. The distribution and quantile functions are mutual inverses in a sense made precise below. Consider a non-decreasing function {H(t) : A ≤ t ≤ B}, with −∞ ≤ A ≤ B ≤ ∞ and −∞ ≤ H(A) ≤ H(B) ≤ ∞. We do not make continuity assumptions on H, except at end-points when A < B, in which case, we require that H(A) = lim H(t) t↓A
and H(B) = lim H(t). t↑B
(1.16)
Then, it is always possible to define a left-continuous inverse H → (s) (resp. a rightcontinuous inverse H ← (s)) of H(·), over H(A) ≤ s ≤ H(B), as follows. (i) When H(A) < s ≤ H(B) (resp. H(A) ≤ s < H(B)), we set H → (s) = sup{t : H(t) < s}
(resp. H ← (s) = inf{t : H(t) > s}).
(1.17)
(ii) When H(A) = s < H(B) (resp. H(A) < s = H(B)), we define H → (s) (resp. H ← (s)) via (1.17) and H → (H(A)) = lim H → (v)
(resp. H ← (H(B)) = lim H ← (v)).
v↓H(A)
(1.18)
v↑H(B)
(iii) There is one possibility uncovered by (1.17)–(1.18), when H(A) = H(B). In this case, we define H → (s) (resp. H ← (s)) for H(A) = s = H(B) by setting H → (H(A)) = A (resp. H ← (H(B)) = B).
(1.19)
If, in addition to the previous assumptions, H(·) is right-continuous, then, the definition (1.17)–(1.18)–(1.19), of the left-continuous inverse H → (·) of H(·), may be simplified into H → (s) = inf{t : H(t) ≥ s}
for H(A) ≤ s ≤ H(B).
(1.20)
Likewise, when H(·) is left-continuous, the definition (1.17)–(1.18)–(1.19), of the right-continuous inverse H ← (·) of H(·), may be simplified into H ← (s) = sup{t : H(t) ≤ s}
for H(A) ≤ s ≤ H(B).
(1.21)
Under the additional assumption that H(A) < H(B), the right-continuous (resp. left-continuous) inversion operators, mapping H onto H → (resp. H onto H ← ), are involutions. The condition H(A) < H(B) (which entails A < B) allows to define intrinsic end-points, A∗ and B ∗ , pertaining to H(·), by setting and
A∗ = inf{t ∈ (A, B) : H(t) > H(A)} B ∗ = sup{t ∈ (A, B) : H(t) < H(B)}.
(1.22)
We have −∞ ≤ A ≤ A∗ < B ∗ ≤ B ≤ ∞, and −∞ ≤ H(A) = H(A∗ ) < H(B ∗ ) = H(B) ≤ ∞. It is readily checked, that, whenever H(·) is right-continuous (resp. left-continuous), for all A∗ ≤ t ≤ B ∗ , H(t) = {H → }← (t) = {H ← }← (t)
(resp. H(t) = {H ← }→ (t) = {H → }→ (t)). (1.23)
Topics on Empirical Processes
97
We note that (1.23) is invalid when H(t) = µ is constant for t ∈ [A, B]. In this case (1.22) is meaningless, and the definitions, (1.19) of H → and H ← , reduce to H ← (s) = A and H → (s) = B for s = µ, so that and
{H ← }← (t) = {H ← }→ (t) = µ, → ←
→ →
{H } (t) = {H } (t) = µ,
for t = A, for t = B.
(1.24)
Both operators, H → H → and H → H ← , are natural extensions of the inversion operator, H → H −1 , with respect to composition of applications. The interest of H → and H ← is to be always defined, whereas such is not the case for H −1 . When H(·) is (strictly) increasing and continuous on [A, B], with a properly defined inverse mapping, H −1 (·), (strictly) increasing and continuous on [H(A), H(B)], all three definitions coincide, since then, A = A∗ , B = B ∗ , and, for each s ∈ [A, B] (resp. t ∈ [H(A), H(B)], (1.25) H H −1 (t) = t, H −1 H(s) = s, H −1 t = H → t = H ← t . Proposition 1.1 below provides a variant of (1.25), tailored to the case of distribution and quantile functions. Its proof can be adapted to show that continuity of H and H −1 is sufficient to imply (1.25). 1.2.3. Some useful properties of distribution and quantile functions. As follows from (1.7) and (1.20)–(1.21)–(1.23), in the special case where G is the qf of the df F , we have and
and
G(u) = F → (u) for ←
F (x) = G (x)
for ω− ≤ x ≤ ω+ ,
G(u+) = F ← (u) for →
F (x−) = G (x)
0≤u≤1 0≤u≤1
for ω− ≤ x ≤ ω+ .
(1.26)
(1.27)
In view of (1.26)–(1.27) and (1.20)–(1.21), we obtain that, for each x fulfilling ω− ≤ x ≤ ω+ , F (x) = sup{u : G(u) ≤ x} = sup{u : G(u+) ≤ x},
(1.28)
F (x−) = inf{u : G(u) ≥ x} = inf{u : G(u+) ≥ x}.
(1.29)
As follows from §1.2.2, the relations (1.28)–(1.29) become invalid when x ∈ [ω, ω+ ]. We note, further, that the definition (1.7) of G(·) entails that, for each u ∈ [0, 1] such that −∞ < G(u) < ∞, F (G(u) − ε) < u ≤ F (G(u) + ε) ∀ε > 0.
(1.30)
This implies that the image set of [0, 1] by either G(·) or G(·+) is nothing else but the support of the distribution of X. The latter, denoted by supp(X), is a closed subset of R (see, e.g., p. 277 in [10]), collecting all x ∈ R such that P(Vx ) > 0 for each open neighborhood Vx of x in R. It is noteworthy that P(X ∈ supp(X)) = 1. The complement in R of the support of X, namely R − supp(X), is the union of all open subsets O of R such that P(X ∈ O) = 0.
98
P. Deheuvels
A similar argument as that used to infer (1.30) from (1.7), shows that, for each 0 < u < 1 (this implying that −∞ < G(u) ≤ G(u+) < ∞) and ε > 0, we have and
F (G(u) − ε) < u ≤ F (G(u) + ε), F (G(u+) − ε) ≤ u < F (G(u+) + ε).
(1.31)
By letting ε ↓ 0 in (1.31), we obtain readily the inequalities F (G(u)−) ≤ F (G(u+)−) ≤ u ≤ F (G(u)) ≤ F (G(u+)).
(1.32)
Another application of (1.7), shows readily that, for each ω− < x < ω+ , G(F (x−)) ≤ G(F (x)) ≤
x ≤ G(F (x−)+) ≤ G(F (x)+).
(1.33)
A simple and useful consequence of these inequalities is stated in the next proposition. Proposition 1.1. Assume that F (·) is continuous on (ω− , ω+ ), and that G(·) is continuous on (0, 1). Then, for each x ∈ [ω− , ω+ ] and u ∈ [0, 1], one has G(F (x)) = x,
and
F (G(u)) = u.
(1.34)
Proof. The assumption that F (·) and G(·) are continuous is equivalent to the identities F (·−) = F (·) and G(·) = G(·+). We have therefore F (x−) = F (x) and G(u) = G(u+) in (1.32)–(1.33), which, in turn, entail (3.11) for x ∈ (ω− , ω+ ) and u ∈ (0, 1). To conclude, we observe that (1.34) holds for x = ω± and u = 0 or 1, as a direct consequence of the definitions (1.1) and (1.7) of F and G. Exercise 1.1. Let X denote a [0, 1]-valued rv with density f (·), continuous and positive on [0, 1]. Denoting by F (·) and G(·) the df and qf of X, show that the d G(u) is continuous and positive on [0, 1], and is equal quantile density g(u) = du to 1/f (G(u)). Exercise 1.2. Let G be the qf of a rv X. Prove the equality 1 G(u)du. E(X) = 0
1.3. Topologies on spaces of measures and functions 1.3.1. Weak and vague convergence of measures. We first recall some basic topological facts. The notion of Polish, and of locally compact, topological space plays here an essential role. A topological space E is Polish, iff its topology can be defined by a metric δ, for which it is complete, and separable (meaning, see, e.g., p. 209 in [10], that there exists a countable dense subset of E, or, equivalently, that there exists a countable base of open sets). A locally compact space is a Hausdorff topological space (such that two distinct points have disjoint neighborhoods), for which each point has a compact neighborhood. A locally compact space E has a countable base (see, e.g., p. 209 in [10]), iff there exists a countable base of open sets of E (meaning that each open set of E is the union of some of these subsets). Both structures are closely related. In fact, the topology of an arbitrary separable metric space (E, δ) can be defined by a metric δ ∗ , such that the completion E of
Topics on Empirical Processes
99
E with respect to δ ∗ is a separable compact metric space (see, e.g., p. 72 in [83]). Conversely, a locally compact space is metrizable if and only it has a countable base, in which case it is Polish (see, e.g., p. 225 in [10]). Because of this, and since, in the applications we consider, E = R or Rd , we will let, unless otherwise specified, E denote a metrizable locally compact space, which, as follows from the previous arguments, is necessarily a Polish space. To work on locally compact spaces simplifies matters with respect to measure theory, since we may restrict our interest to Radon measures, by definition finite on compact subsets. We recall below the usual definitions and properties of vague (resp. weak convergence) in the space M+ (E) of (non-negative) Radon measures (resp. M+ f (E) of (non-negative) finite Radon measures) on a metrizable separable locally compact space E. In words, M+ (E) (resp. M+ f (E)) consists of all non-negative measures µ defined on the Borel algebra BE of E, and such that µ(A) < ∞ for each relatively compact Borel subset A ∈ BE of E (resp. µ(E) < ∞). We denote by M(E) (resp. Mf (E)) the set of all signed (Radon) measures (resp. totally bounded signed (Radon) measures on E), and keep in mind that µ ∈ M(E) (resp. µ ∈ Mf (E)) iff there exist two non-negative measures µ∗+ ∈ M+ (E) and µ∗− ∈ M+ (E) (resp. + ∗ µ∗+ ∈ M+ f (E) and µ− ∈ Mf (E)), such that µ = µ∗+ − µ∗− .
(1.35)
Given µ ∈ M(E), the choice of µ∗+ and µ∗− in (1.35) is not unique. However, the Hahn-Jordan decomposition theorem (see, e.g., p. 173 in [172] and pp. 178–181 in [83]), allows to choose these two measures, by setting µ∗± = µ± , where µ+ and µ− are defined through a decomposition of E = E− ∪ E+ into the union of two disjoint measurable subsets E− and E+ , such that, for each relatively compact A ∈ BE , µ+ (A) = µ(A ∩ E+ ) and µ− (A) = µ(A ∩ E− ),
E− ∩ E+ = ∅.
(1.36)
The property (1.36) characterizes orthogonality, of µ+ and µ− (denoted hereafter by µ+ ⊥µ− ). The measures µ± in (1.36) are uniquely defined for each A ∈ BE , via the relations µ+ (A) =
sup B⊆A, B∈BE
µ(B ∩E+ ) and µ− (A) =
sup B⊆A, B∈BE
{−µ(B ∩E− )}. (1.37)
In view of (1.37), we may define the total variation measure |µ| ∈ Mf (E) of µ ∈ Mf (E), via |µ|(A) = µ+ (A) + µ− (A) =
sup B⊆A, B∈BE
µ(B)
for each A ∈ BE .
(1.38)
In the special case where E = R, we may observe that the condition that µ ∈ Mf (R) is equivalent to the condition that the distribution function Hµ (x) = µ (−∞, x] of µ is of bounded variation on R, with Hµ (−∞) := limx↓∞ Hµ (x) = 0, and with total variation on R equal to |dHµ |(R) = |µ|(R). In the sequel the set of functions f of bounded variation on a sub-interval I of R will be denoted by BV(I).
100
P. Deheuvels
Definition 1.1. A family of measures µα ∈ M(E), indexed by an oriented net A, is said to be vaguely convergent to µ ∈ M(E) (resp. a net-oriented family, µα ∈ Mf (E), α ∈ A, is said to be weakly convergent to µ ∈ Mf (E)), when f dµα → f dµ. (1.39) R
R
for all f , continuous on E with compact support (resp. continuous and bounded on E). Vague convergence of µα ∈ M+ (E) to µ ∈ M+ (E) is equivalent to the property that µα (K) → µ(K), (1.40) for each compact subset K of E, with boundary ∂K such that µ(∂K) = 0. It is noteworthy (see, e.g., p. 235 in [10]) that weak convergence of µα ∈ M+ f (E) to + µ ∈ Mf (E) is equivalent to vague convergence, with the additional condition that µα (E) → µ(E).
(1.41)
Remark 1.1. The weak convergence of probability measures on a separable metric space can be defined by the Prohorov metric (see, e.g., pp. 393–399 in [82]). When E is locally compact, the vague topology on M+ (E) is metrisable iff E has a countable base, or, equivalently, iff E is Polish (see, e.g., p. 243 in [10]). Remark 1.2. Neither of the nice characterizations, (1.40) and (1.41), of vague and weak convergence holds when M+ (E) and M+ f (E) are replaced, respectively, by M(E) and Mf (E). The vague and weak topologies are rather nasty to handle for signed measures, since they are not metrizable on M(E). This (far from obvious) fact can be shown through the following arguments. First, the space Cb (E) of bounded continuous functions on E, endowed with the sup-norm, is a Banach space. Its topological dual Cb∗ (E), endowed with the weak∗ topology, coincides with the set Mf (E) of finite signed measures on E, endowed with the weak topology (see, e.g., p. 16 in [19]). Second, it turns out (see, e.g., p. 68 in [173]) that the topological dual X∗ of an infinite-dimensional Banach space X is not metrizable (and such is the case, therefore, in general for (Mf (E), W)). On the other hand, it can be shown (see, e.g., p. 68 in [173]), as a consequence of the Banach-Alaoglu theorem, that any weak∗ -compact subset of X∗ is metrizable. This last fact underlies some of the results of the forthcoming §1.3.4. Because of the mathematical difficulties underlined in Remark 1.2, in the remainder of this section, we will limit ourselves to the study of vague and weak convergence on the spaces of non-negative measures M+ (E) and M+ f (E). In §§1.3.3– 1.3.4, we will discuss again the case of signed measures, when E = [0, 1]. A subset S ⊆ M+ (E) is vaguely relatively compact if and only if (see, e.g., p. 241 in [10]) sup f dµ < ∞ (resp. (1.42) sup µ(K) < ∞), µ∈S
µ∈K
Topics on Empirical Processes
101
for each bounded function f on E with compact support (resp. compact subset K of E). For the set P(E) of probability measures on E, (1.41) is automatic, so that vague and weak convergence are equivalent, for a sequence µn ∈ P(E), when the limit µ is in P(E). However, the set P(E) is not closed in M+ (E) with respect to the vague topology, so that (1.42) is not sufficient to ensure weak relative compactness of a subset S of P(E). A useful characterization of this property is given by Prohorov’s theorem (refer to p. 37 in [19]), which shows that weak relative compactness of S ⊆ P(E) is equivalent to the tightness of S, namely, to the condition that, for each > 0, there exists a compact subset Kε of S such that inf µ(Kε ) ≥ 1 − ε.
µ∈S
(1.43)
In general (see, e.g., pp. 237–238 in [10]), weak convergence of µn ∈ M+ f (E) to µ ∈ M+ (E) is equivalent to the convergence of µ (Q) to µ(Q) for each Q ∈ BE n f with boundary ∂Q of null µ measure. For the probability distributions on E = R, considered in §1.3.2, this last condition is equivalent to the pointwise convergence (1.45) of df’s at each continuity point of the limiting df. The same property holds when E = Rd for an arbitrary d ≥ 1. 1.3.2. Weak convergence and the L´evy metric. We consider briefly in the present section the topology, denoted hereafter by W, of weak convergence on the space P(R) of probability measures on R, identified to the space F(R) of the corresponding (right-continuous) distribution functions. We refer to §1.3.1 for basic definitions, to [19] for more details on weak convergence of probability measures, and to §1.3.3 (resp. §1.3.4) below, for a study of of this topology on the set of bounded non-negative (resp. signed) measures on [0, 1]. One important feature of the topological space (P(R), W), is that it is conveniently metricized by the L´evy metric (or distance), defined as follows. Consider two, possibly unbounded, nondecreasing functions H1 and H2 on [A, B] ⊆ R. Extend the definition of these functions to R, by setting, for i = 1, 2, Hi (A) for x ≤ A, ' Hi (x) = Hi (B) for x ≥ A. Given this notation, the L´evy distance between H1 and H2 is defined by ' 1 (x − θ) − θ ≤ H ' 2 (x) ≤ H ' 1 (x + θ) + θ ∀x ∈ R , inf θ ≥ 0 : H δL (H1 , H2 ) = whenever such a θ ≥ 0 exists (1.44) ∞ otherwise. When H1 and H2 are bounded, (1.44) implies that δL (H1 , H2 ) < ∞. It is readily checked that δL defines a metric on the set F(R) of all probability df’s on R. Denoting, as usual, by CF the set of continuity points of F , a necessary and sufficient condition for a sequence {Fn : n ≥ 1} of df’s to converge weakly to the
102
P. Deheuvels
limiting df F is that (see, e.g., p. 18 in [19]) W
Fn → F ⇔ δL (Fn , F ) → 0 ⇔ lim Fn (x) = F (x) n→∞
∀x ∈ CF .
(1.45)
Consider now two quantile functions, G1 and G2 , pertaining to the two distribution functions F1 ∈ F and F2 ∈ F. One may extend the definition of G1 and G2 to R, by setting, for i = 1, 2, Gi (0) for x ≤ 0, ' Gi (x) = Gi (1) for x ≥ 1. Given this notation, one may define the L´evy distance δL (G1 , G2 ) between G1 '1 , G ' 2 . We have the and G2 , by the formal replacement of H1 , H2 in (1.44) by G following useful proposition. Proposition 1.2. When G1 , G2 are the quantile functions of the distribution functions F1 , F2 , we have δL (F1 , F2 ) = δL (G1 , G2 ).
(1.46)
Proof. Combine (1.32)–(1.33) with (1.45).
An interesting application of (1.45) and (1.46) is provided by the following characterization of weak convergence. Let Gn = Fn→ denote the quantile function of Fn for n = 1, 2, . . ., and let G = F → denote the quantile function of F . Denote by CG ⊆ [0, 1] the set of all continuity points of G. Then, W
Fn → F ⇔ δL (Gn , G) → 0 ⇔ lim Gn (u) = G(u) ∀u ∈ CG . n→∞
(1.47)
1.3.3. Weak and uniform topologies for df ’s of non-negative measures on [0, 1]. We now specialize in the study of (non negative bounded) Radon measures with support in a bounded closed interval of R. For convenience, and without loss of generality, we will limit ourselves to the case where this interval is E = [0, 1]. As follows from Definition 1.1, the vague and weak topologies coincide on M[0, 1]. Let M+ non-negative measures µ on R, with f [0, 1] denote the set of all bounded support in [0, 1], namely, fulfilling µ R − [0, 1] = 0 and µ(A) ≥ 0 for each A ∈ BR . We denote by I+ [0, 1] (resp. I− [0, 1]) the set of all right-continuous (resp. leftcontinuous) distribution functions H(·+) (resp. H(·−)) of measures µ ∈ M+ f [0, 1], of the form Hµ (x+) = µ ([0, x])
and Hµ (x−) = µ ([0, x))
for
x ∈ R. (1.48)
We let I± [0, 1] denote either one of the sets I+ [0, 1] or I− [0, 1]. The correspondence µ ↔ Hµ (·+) ↔ Hµ (·−) being one-to-one, we may endow I± [0, 1] with the weak topology W of convergence of measures in M+ f [0, 1]. This topology is metricized by the L´evy metric δL (·, ·) defined via (1.44), and we let (I± [0, 1], W) denote the set I± [0, 1] endowed with the weak topology W. Likewise, we let (I± [0, 1], U) (resp. + (M+ f [0, 1], U)) denote the set I± [0, 1] (resp. Mf [0, 1]), endowed with the uniform
Topics on Empirical Processes
103
topology U. The latter topology is induced by the sup-norm distance, defined, for Hµ , Hν ∈ I± [0, 1], with µ, ν ∈ M+ f [0, 1], via δU (µ, ν) = Hµ − Hν = sup |Hµ (x) − Hν (x)|.
(1.49)
x∈R
In view of (1.42), for any (finite) constant C ≥ 0, the set M+ f,C [0, 1] of all µ ∈ + Mf [0, 1] such that µ ([0, 1]) = Hµ (1+) ≤ C is weakly compact. The corresponding set of distribution functions is IC ± [0, 1] := {H ∈ I± [0, 1] : H(1+) ≤ C} = {Hµ (·±) : µ ([0, 1]) ≤ C}.
(1.50)
The Lebesgue decomposition of µ ∈ M+ f [0, 1] shows that the df Hµ (·+) ∈ I+ [0, 1] (resp. Hµ (·−) ∈ I− [0, 1]) of µ has a unique decomposition into the sum of two components, as follows. For all t ∈ R, h(u)du + HµS (t−) resp. Hµ (t+) = h(u)du + HµS (t+) . Hµ (t−) = [0,t]
[0,t]
(1.51) Here, h is a non-negative function, Lebesgue-integrable on [0, 1], defined uniquely up to an almost everywhere [a.e.] equivalence, and µS := dHµS is a singular nonnegative measure on [0, 1], with support of null Lebesgue measure, and such that 1 µ [0, 1] = Hµ (1+) = h(u)du + HµS (1+). (1.52) 0
The function h in (1.51)–(1.52) is the Lebesgue derivative of H = Hµ , denoted d H. By definition, Hµ is absolutely continuous (with rehereafter by H˙ = dx spect to Lebesgue measure) iff HµS (1+) = µS [0, 1] = 0. In the sequel, the set of non-decreasing, absolutely continuous distribution functions Hµ of measures µ ∈ M+ f [0, 1], will be denoted by ACI0 [0, 1]. Naturally, we may combine the Lebesgue and Hahn-Jordan decompositions of a totally bounded signed measure µ ∈ Mf [0, 1], to write Hµ (t) = µ [0, t] = h(u)du + Hµ+ (t) − Hµ− (t), (1.53) [0,t]
S
S
− where µ+ S ⊥µS are singular non-negative measures on [0, 1], with supports of null d Lebesgue measure, and h = H˙µ = dx Hµ is the Lebesgue derivative of Hµ . The set of all absolutely continuous functions Hµ on [0, 1], fulfilling Hµ± (1+) = 0, will S be denoted by AC0 [0, 1]. A function H ∈ AC0 [0, 1] (resp. ACI0 [0, 1] is such that H(0) = 0. If we let H(0) be arbitrary, we obtain a general absolutely continuous (resp. absolutely continuous nondecreasing) function on [0, 1]. The corresponding sets of functions will be denoted by AC[0, 1] (resp. ACI[0, 1]).
1.3.4. Weak convergence of signed measures on [0, 1]. We now concentrate on the study of the weak topology W, on the space Mf [0, 1] of totally bounded signed measures µ on R, with support in [0, 1]. Recalling the Hahn-Jordan decomposition
104
P. Deheuvels
µ = µ+ − µ− of µ in (1.36), we see that µ ∈ Mf [0, 1] and |µ| = µ+ + µ− ∈ M+ f [0, 1] fulfill, in view of (1.38), |µ| [0, 1] = µ+ [0, 1] + µ− [0, 1] < ∞ and supp µ := supp |µ| ⊆ [0, 1]. (1.54) The right-continuous distribution function [df] Hµ (x) = µ [0, x] of µ ∈ Mf [0, 1], fulfills Hµ (0−) = 0, and is of bounded variation |dHµ |(R) = |µ| [0, 1] < ∞ on R. Conversely, via the Lebesgue-Stieltjes integral, any right-continuous function H of bounded variation on R, with Hµ (x) = 0 for x < 0, and Hµ (x) = H(1) for x ≥ 1 is the df of some µ ∈ Mf [0, 1]. Here, and elsewhere, we set BV0 [0, 1] = Hµ : µ ∈ Mf [0, 1] , (1.55) and, for an arbitrary C ≥ 0, and
Mf,C [0, 1] = µ ∈ Mf [0, 1] : |µ| [0, 1] ≤ C BV0,C [0, 1] = Hµ : µ ∈ Mf,C [0, 1] .
(1.56)
Introduce now the metric (see, e.g., H¨ogn¨ as [113]), defined, for µ, ν ∈ Mf [0, 1], by 1 δH (µ, ν) = δH (Hµ , Hν ) = |Hµ (t) − Hν (t)|dt + |Hµ (1) − Hν (1)|. (1.57) 0
As mentioned in Remark 1.2, the weak topology on M(R) is not metrizable. For this reason, one should discuss the weak convergence of µα ∈ M[0, 1] to µ ∈ M[0, 1], along an oriented net A, rather than on a sequence. We have the following ogn¨ as [113]. characterization of weak convergence in Mf [0, 1], due to H¨ Proposition 1.3. A net-oriented family µα ∈ Mf [0, 1], with α ∈ A, of totally bounded signed measures on [0, 1] is weakly convergent to µ ∈ Mf [0, 1], if and only if: 1◦ ) There exists a constant 0 < C < ∞ such that, ultimately along the net A, µα ∈ Mf,C [0, 1]; 2◦ ) We have, along the net A, dH (µα , µ) → 0. Proof. See, e.g., H¨ ogn¨ as [113].
A direct consequence of Proposition 1.3 is that, for each C ∈ [0, ∞), the restriction of the weak topology W to Mf,C [0, 1] is metrizable. This is captured in the following corollary of Proposition 1.3. Corollary 1.1. For each 0 < C < 0, the metric dW (·, ·) endows Mf,C [0, 1] with the weak topology. We infer readily from the above arguments the following proposition. Proposition 1.4. For each 0 < C < 0, the set Mf,C [0, 1] is weakly compact. Proof. Consider any sequence {µn : n ≥ 1} ⊆ Mf,C [0, 1] such that dH (µm , µn ) → 0 − as m ∧ n → ∞. For each n ≥ 1, denote by µn = µ+ n − µn the Hahn-Jordan + decomposition of µn . Observe, via (1.38), that, for each n ≥ 1, µ± n ∈ Mf,C [0, 1].
Topics on Empirical Processes
105
By (1.42), M+ f,C [0, 1] is a weakly compact metrizable space. Therefore, for each increasing sequence {nj : j ≥ 1} of positive integers, there exists an increasing + ± + − subsequence, along which µ± n →W µ , for some µ and µ ∈ Mf,C [0, 1]. This, in ± ± turn, entails that, along this subsequence, δH (µn , µ ) → 0, and hence, δH (µn , µ) → 0, where µ := µ+ − µ− . Since the so-defined µ is necessarily unique, we infer from this argument that (Mf,C [0, 1], δH) is a complete metric space. The same argument shows readily that Mf,C [0, 1] is sequentially compact, and therefore compact, with respect to W. Remark 1.3. As mentioned above in §1.3.1, the fact that Mf,C [0, 1] is metrizable with respect to the weak topology W, is a general consequence of the Banach-Alaoglu theorem (refer to pp. 67–68 in [173]). The H¨ ogn¨ as metric (1.57) provides here a convenient example of metric endowing Mf,C [0, 1] with W. 1.3.5. Compact sets based upon rate functions. We will consider below a series of examples of compact subsets of Mf [0, 1], endowed, either with the weak topology W, or with the uniform topology U, of the corresponding distribution functions. Throughout, we will identify µ ∈ Mf [0, 1] with its distribution function Hµ (t) = µ [0, t] , so that the sets we will consider will be defined equivalently, in terms of measures, or in terms of functions. In particular, the uniform topology U will be defined on Mf [0, 1], by setting, for µ, ν ∈ Mf [0, 1] dU (µ, ν) = Hµ − Hν = sup |Hµ (t) − Hν (t)|.
(1.58)
t∈R
The compact sets we shall consider will be defined through rate functions, which, in the applications discussed later on, will be chosen as Chernoff functions, of the form (2.8). However, in the present section, this restriction is not necessary, and we will work in a more general setup. By rate function is meant a non-negative convex (possibly infinite) function {Ψ(α) : α ∈ R}, with the following properties, holding for some specified constant m ∈ R. 0 ≤ Ψ(α) ≤ ∞;
(Ψ.1) (Ψ.2)
Ψ(m) = 0
and Ψ is convex on R.
We set further, for Ψ fulfilling (Ψ.1–2), −∞ ≤ t0 := lim
α→∞
Ψ(α) Ψ(α) ≤ 0 ≤ t1 := lim ≤ ∞. α→−∞ α α
(1.59)
Example 1.2. The following three examples of rate functions deserve special attention. 1◦ ) Ψ(α) = α2 , with µ = 0, t1 = −∞ and t0 = ∞; 2◦ ) Ψ(α) = h(α), with µ = 1, t1 = −∞ and t0 = ∞, where h is defined by α log α − α + 1 for α > 0; h(α) = 1 (1.60) for α = 0; ∞ for α < 0.
106
P. Deheuvels
3◦ ) Ψ(α) = (α), with µ = 1, t1 = −∞ and t0 = 1, where is defined by α − log α + 1 for α > 0; (α) = (1.61) ∞ for α ≤ 0. The following theorems will have useful consequences. For each H ∈ AC[0, 1] (the set of absolutely continuous functions on [0, 1], see §1.3.3), we denote by d ˙ H(t) = dt H(t) the Lebesgue derivative of H. By (AC[0, 1], U) is meant the set AC[0, 1] endowed with the uniform topology U. The set BC0 [0, 1] collects all distribution functions Hµ of totally bounded signed measures µ ∈ Mf [0, 1] with support in [0, 1]. Theorem 1.1. Let Ψ, fulfilling (Ψ.1–2), be such that t1 = −∞ and t0 = ∞. Introduce the set 1 ˙ ≤1 . (1.62) ∆Ψ,c = H ∈ AC[0, 1] : H(0) = 0 and cΨ(c−1 H(u))du 0
Then ∆Ψ,c is a compact subset of (AC[0, 1], U). Proof. We limit ourselves to c = 1, and set ∆Ψ = ∆Ψ,1 . By the Arzel`a-Ascoli theorem (see, e.g., p. 369 in [173]), it is enough to show that ∆Ψ is closed, uniformly equi-continuous and bounded. To establish this property, we fix an arbitrary ε > 0. The assumption that Ψ(α)/|α| → ∞ as |α| → ∞ ensures the existence of an Mε such that Ψ(α) ≥ 1ε |α| for all |α| ≥ Mε . Now, this ensures that, for each 0 ≤ a < b ≤ 1 and H ∈ ∆Ψ , H(b) − H(a)
|H(b) − H(a)| |H(b) − H(a)| ≥ Mε ⇒ ≤ εΨ ± b−a b − a b−a b ε ε ˙ , Ψ(H(u))du ≤ ≤ b−a a b−a where we have used the convexity of Ψ. Thus, we have |H(b) − H(a)| ≤ ε∧{Mε (b− a)}, whence |b − a| ≤ ε/Mε ⇒ |H(b) − H(a)| ≤ ε. This establishes the uniform equicontinuity of ∆Ψ , the boundedness of ∆Ψ following trivially, in turn, from the fact that H(0) = 0 for all H ∈ ∆Ψ . Finally, the fact that ∆Ψ is closed follows from the observation (see, e.g., Lynch and Sethuraman [144]) that the mapping 1 ˙ Ψ(H(u))du, (1.63) H ∈ AC[0, 1] → 0
is lower semi-continuous. Theorem 1.2. Let Ψ, fulfilling (Ψ.1–2), be such that Ψ(α) = ∞
for
α ≤ ∞,
t1 = −∞
and
t0 < ∞.
Topics on Empirical Processes For each c > 0, introduce the set ∆Ψ,c = H ∈ I± [0, 1] : H(0) = 0,
1
107
˙ + t1 HS (1+) ≤ 1 . (1.64) cΨ(c−1 H(u))du
0
Then ∆Ψ,c is a compact subset of (I± [0, 1], W).
Proof. See, e.g., Lynch and Sethuraman [144]. Example 1.3.
1◦ ) The Strassen set of functions (see, e.g., Strassen [199] and (2.34) in the sequel), defined by 1 f˙(u)2 du ≤ 1 , S = f ∈ AC[0, 1] : f (0) = 0 and (1.65) 0
is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1], endowed with the uniform topology U. This follows from Theorem 1.1, taken with Ψ(u) = u2 . We note further that ΨX (u) = 12 u2 is the Chernoff function, (2.8), of a rv X following a standard normal N (0, 1) law (see, e.g., §2.3.3). 2◦ ) The Finkelstein set of functions (refer to Finkelstein (1971)) defined by 1 F = f ∈ AC[0, 1] : f (0) = f (1) = 0 and (1.66) f˙(u)2 du ≤ 1 , 0
is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1], endowed with This follows from 1◦ ) and the observation the uniform topology. that F = f ∈ S : f (1) = 0 is closed in (C[0, 1], U). 3◦ ) The set of functions (refer to Deheuvels and Mason (1990, 1992)) defined for c > 0 by 1 ∆c = f ∈ AC[0, 1] : f (0) = 0 and ch c−1 f˙(u) du ≤ 1 , (1.67) 0
is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1], endowed with the uniform topology U. This follows from Theorem 1.1, taken with Ψ(u) = h(u), which is nothing else but the Chernoff function, (2.8), of a Poisson rv with expectation equal to 1 (see, e.g., (2.60)). 4◦ ) The set of functions (refer to Deheuvels and Mason (1990, 1992)) defined for c > 0 by 1 c c−1 f˙(u) du + fS (1+) ≤ 1 , (1.68) Γc = f ∈ I± [0, 1] : f (0) = 0 and 0
is a compact subset of the set (I± [0, 1], W) of non-decreasing non-negative functions on [0, 1], endowed with the weak topology. This follows from Theorem 1.2, taken with Ψ(u) = (u). This last function is the Chernoff function, (2.8), of an exponentially distributed rv with mean 1.
108
P. Deheuvels
The next theorem extends Theorems 1.1 and 1.2 to the case where t0 and t1 are arbitrary. Theorem 1.3. Let Ψ, fulfilling (Ψ.1–2), be such that −∞ ≤ t1 ≤ t0 ≤ ∞. Recall (1.53), and set BV0 [0, 1] = {Hµ : µ ∈ Mf [0, 1]}. For each c > 0, introduce the set 1 cΨ c−1 H˙ µ (u) du + t0 Hµ+ (1+) − t1 Hµ− (1+) ≤ 1 . ∆Ψ,c = Hµ ∈ BV0 [0, 1] : S
0
S
(1.69) Then ∆Ψ,c is a compact subset of (BV0 [0, 1], W).
Proof. See, e.g., Deheuvels [51].
In the forthcoming §2.1, we will give some examples of rate functions based upon large deviation theory. Exercise 1.3. 1◦ ) Let h(·) and (·) be as in (1.60)–(1.61). Check the relations αh(1/α) = (α)
and
α(1/α) = h(α)
and
α > 0.
◦
2 ) Recall the results of Exercise 1.1. Let F denote the set of df ’s F (x) of rv’s d F (x) of X continuous and with support in [0, 1], having density f (x) = dx positive on [0, 1]. Let likewise G denote the set of qf ’s of random variables, d G(u) continuous and positive on with density quantile function g(u) = du [0, 1]. Show that the mapping F ∈ F → G = F → ∈ G defines a one-toone mapping of F onto G, continuous with respect to the weak topology W. Show likewise that the mapping G ∈ G → F = G← ∈ F defines a one-to-one mapping of G onto F , continuous with respect to the weak topology W. Check that {F → }← = F and {G← }→ = G, for F ∈ F and G ∈ G. ◦ 3 ) Show that the sets F∩∆1 and G∩Γ1 are homeomorphic (for additional details, see, e.g., [65, 66]). 1.4. The quantile transform 1.4.1. The univariate quantile transform. Theorem 1.4. For any random variable X ∈ R, with distribution function F and quantile function G, the following properties hold. (i) Whenever F is continuous, the random variable U := F (X) is uniformly distributed on (0, 1); (ii) When F is arbitrary, and V is a uniform (0, 1) rv, one has the distributional identity d (1.70) X = G(V ). Proof. The implications F (X) < F (x) ⇒ X < x and X ≤ x ⇒ F (X) ≤ F (x) entail that P(F (X) < F (x)) ≤ P(X < x) = F (x−) (1.71) ≤ F (x) = P(X ≤ x) ≤ P(F (X) ≤ F (x)).
Topics on Empirical Processes
109
When F is continuous, u = F (x) reaches all possible values in (0, 1) when x varies in (ω− , ω+ ). Thus, by (1.71), the df’s, H(u) := P(F (X) ≤ u) and H(u−) := P(F (x) < u), of F (X) fulfill, for all u ∈ (0, 1), H(u−) ≤ u ≤ H(u). This, in turn, readily implies that H(u) = u for all u ∈ CH , the set of continuity points of H. Since this, in turn, implies that H(u) = u for all u ∈ [0, 1], we obtain readily Assertion (i) of the theorem. d
To establish Assertion (ii), we define a random variable Y = N (0, 1), following a standard normal law, and independent of X (in fact, any rv with a continuous and positive density on R will do). For each n ≥ 1, we observe that the rv Xn := X+n−1 Y ∈ R has a density fn , continuous and positive on R. This, in turn, implies that the restriction of the df Fn of Xn to R, and the restriction of the qf Gn of Xn to (0, 1), are continuous and increasing. By the just-proven Assertion (i), this implies that, for each n ≥ 1, Un := Fn (Xn ) is uniformly distributed on (0, 1). Moreover, an application of (1.34) in Proposition 1.1 shows that Xn = Gn (Fn (Xn )) = Gn (Un ) W for each n ≥ 1. Now, as n → ∞, Xn = X + n−1 Y → X, whence Fn → F (recall that almost sure [a.s.] convergence implies convergence in probability, and hence, weak convergence of the underlying probability measures). By (1.47), this implies, in turn, that Gn (u) → G(u) for each u ∈ CG , the set of continuity points of G. d
We next observe that Xn = Gn (U ), where U is a fixed, uniformly distributed on (0, 1), random variable. Since DG := R − CG is, at most, countable, we have d P(U ∈ CG ) = 1. By all this, it follows that, as n → ∞, Xn = Gn (U ) → G(U ), W
W
and hence, Xn → G(U ). Since Xn → X a.s., and hence, Xn → X, as n → ∞, it d
follows that G(U ) = X, which was to be proved.
The quantile transform Theorem 1.4 plays an essential role in empirical process theory. By this theorem, it is always possible without loss of generality [w.l.o.g.] to define an arbitrary sequence X1 , X2 , . . . of independent and identically distributed [iid] random variables [rv’s], with df F (·) and qf G(·), on a probability space (Ω, A, P), carrying an iid sequence U1 , U2 , . . . of uniform (0, 1) rv’s, such that Xn = G(Un ) for each n ≥ 1.
(1.72)
Because of (1.72), it is, most of the time, more convenient to work directly on the sequence U1 , U2 , . . ., rather than on X1 , X2 , . . ., given that any result which can be obtained in this case has a counterpart for X1 , X2 , . . ., which often follows by book-keeping arguments based upon the mapping u → G(u). An inconvenience of Theorem 1.4 is that it is specific to dimension 1, being only valid for real-valued random variables. In the next section, we show that this result can be yet extended to Rd -valued random vectors, at the price of some additional technicalities.
110
P. Deheuvels
1.4.2. The multivariate quantile transform. First, we need recall some classical (but far from obvious) properties of conditional distributions. Recall (see, e.g., p. 209 in [10]) that a set E is a Polish space, when there is a complete metric defining its topology, with a countable dense subset in E. We will be concerned here with the case of Rd , which is Polish with respect to the usual topology. Consider now two E-valued rv’s, X and Y, defined on the same probability space (Ω, A, P). Denote by BE of all Borel subsets of E (the smallest σ-algebra generated by open (and closed) subsets of E). We assume that X and Y are measurable in the sense that the events X−1 (A) = {X ∈ A} and Y−1 (A) = {Y ∈ A} belong to A for all A ∈ BE . We denote by AY the σ-algebra of A, generated by the sets Y−1 (A), for A ∈ BE . In words, AY is the smallest σ-algebra of A such that Y−1 (A) = {Y ∈ A} = {ω ∈ Ω : Y (ω) ∈ A} ∈ AY for all A ∈ BE . Then, it is possible to define a regular conditional probability P(X ∈ · | Y = y), with the following properties. (i) For each y ∈ E, A ∈ BE → P(X ∈ A | Y = y) defines a probability measure on (E, BE ); (ii) For each A ∈ BE , the mapping ω ∈ Ω → y = Y(ω) ∈ E → P(X ∈ A | Y = y) is AY -measurable; (iii) The conditional probability P(X ∈ A | Y = y), of A ∈ BE given y ∈ E, is defined uniquely, with respect to y, up to an a.e. equivalence over a set B ∈ BE such that P(Y ∈ B) = 0; (iv) For each A ∈ BE and B ∈ BY , P(X ∈ A | Y = y)dP(Y = y) = P(X ∈ A, Y ∈ B).
(1.73)
B
Defining the support, denoted by supp(Y), of Y as the set of all y ∈ E such that P(Y ∈ Vy ) > 0 for all open neighborhoods Vy of y, we see that the definition of P(X ∈ A | Y = y) is meaningful only for y ∈ supp(Y). When y ∈ supp(Y), the conditional probability P(X ∈ · | Y = y) is, therefore, arbitrary. Let X = (X1 , . . . , Xd ) ∈ Rd be a multivariate random vector. The distribution, PX = P(X ∈ ·), of X on BRd is then (see, e.g., p. 17 in [19]), uniquely determined by its distribution function F (x) = F (x1 , . . . , xd ) = P(X ≤ x) = P(X1 ≤ x1 , . . . , Xd ≤ xd ) for
d
x = (x1 , . . . , xd ) ∈ R .
(1.74)
d
We set here x = (x1 , . . . , xd ) ≤ y = (y1 , . . . , yd ), for x, y ∈ R , when xj ≤ yj for j = 1, . . . , d. Of course, F (·) in (1.74) fulfills 0 ≤ F (x) ≤ F (y) ≤ 1.
(1.75)
Topics on Empirical Processes
111
Moreover, setting y ↓ x (resp. y ↑ x) when yj ↓ xj (resp. yj ↑ xj ) for j = 1, . . . , d, we have d
lim F (y) = F (x) for x ∈ R ;
(1.76)
y↓x
F (−∞, . . . , −∞) = 0 = F (∞, . . . , ∞) = 1 =
lim
F (x);
x↓(−∞,...,−∞)
lim
x↑(∞,...,∞)
F (x).
(1.77)
When d = 1, the relations (1.75)–(1.76)–(1.77) reduce to (1.2)–(1.3)–(1.4), and are sufficient to define the distribution function of a probability measure on R. On the other hand, this property does not hold for d ≥ 2. Introduce the following notation. For j = 1, . . . , d and u = (u1 , . . . , ud ) ∈ Rd , set ∆j;u F (x) = F (x1 , . . . , xj−1 , xj + uj , xj+1 , . . . , xd ) − F (x1 , . . . , xj−1 , xj , xj+1 , . . . , xd ).
(1.78)
A necessary and sufficient for F to define the df of a multivariate distribution is that, in addition to (1.75), (1.76) and (1.77), we have the 2d − d + 1 inequalities ∆j1 ;u ◦ · · · ◦ ∆jk ;u F (x) ≥ 0,
(1.79)
which must hold for all x ∈ Rd , u ∈ Rd , with u1 ≥ 0, . . . , ud ≥ 0, and 1 ≤ j1 < · · · < jk ≤ d. Theorem 1.5 below gives a multivariate version of Theorem 1.4, by combining results of Rosenblatt [169], Einmahl [92] (see his Theorem 6), and Deheuvels and Mason [68] (see their Lemma 3.1). The following notation will be needed. Let U = (U1 , . . . , Ud ) be a uniform (0, 1)d random vector. Let F1 (x1 ) = P(X1 ≤ x1 ) denote the distribution function of X1 , and let, for j = 2, . . . , d,
Fj xj x1 , . . . , xj−1 = P Xj ≤ xj X1 = x1 , . . . , Xj−1 = xj−1 , denote a regular conditional distribution function of Xj given X1 = x1 , . . . , Xj−1 = xj . Define a function H(x) ∈ [0, 1]d of x = (x1 , . . . , xd ) ∈ Rd , by setting
H(x) = F1 (x1 ), F2 (x1 |x2 ), . . . , Fd (xd |x1 , . . . , xd−1 ) . (1.80) For t ∈ (0, 1), set H1 (t) = inf{x : F1 (x) ≥ t}, and, for t ∈ (0, 1), j = 2, . . . , d, and x ∈ Rd , set
Hj t x1 , . . . , xj−1 = inf x : Fj (x|x1 , . . . , xj−1 ) ≥ t . Also, set for u = (u1 , . . . , ud ) ∈ (0, 1)d , G1 (u1 ) = H1 (u1 ), and, for j = 2, . . . , d,
Gj (u1 , . . . , uj ) = Hj uj G1 (u1 ), . . . , Gj−1 (u1 , . . . , uj−1 ) . Define a function G(u) ∈ Rd of u = (u1 , . . . , ud ) ∈ (0, 1)d , by setting
G(u) = G1 u1 , G2 u1 , u2 , . . . , Gd u1 , . . . , ud .
(1.81)
112
P. Deheuvels
Theorem 1.5. Under the above assumptions and notation, we have d X=G U . (i) If, in addition, H(·) is continuous on R , then, we have d U=H X .
(1.82)
d
(1.83)
(ii) If H(·) is continuously (resp. twice continuously differentiable) in a neighborhood of x ∈ Rd , then G(·) is continuously (resp. twice continuously differentiable) in the neighborhood of t = H(x), and such that |G(u) − G(t) − DG(t)(u − t)| = o(|u − t|)
as
|u − t| → 0.
(1.84)
Moreover,
−1 , (1.85) DG(t) = DH(x) where DG(t) (resp. DH(x)) denote the differential of G(·) at t = H(x) (resp. x). (iii) If X has a density (·), continuous and positive in a neighborhood of x ∈ Rd , then, the Jacobian of G(·) at t = H(x) is equal to 1/f G(t) . (1.86) Proof. See, e.g., Deheuvels and Mason [68].
The next theorem gives a useful local version of the quantile transformation in Theorem 1.5. Theorem 1.6. Assume that X has a density f , continuously (resp. twice continuously) differentiable in a neighborhood of x ∈ Rd , and that f (x) > 0. Then, there exists a t ∈ (0, 1)d , a neighborhood Vx of x, a neighborhood Vt of t, and a oneto-one mapping G∗ of Vt onto Vx such that x = G(t). Moreover, on a suitable probability space, we may define X jointly with a random vector U uniformly distributed on (0, 1)d , and such that (1.87) X1I{X∈Vx } = G∗ U 1I{U∈Vt } . The function G∗ (·) is of the form
G∗ u = G∗1 (u1 ), G∗2 (u1 , u2 ), . . . , G∗d (u1 , . . . , ud ) ,
(1.88)
where, for j = 1, . . . , j, G∗d (u1 , . . . , uj ) is an increasing function of uj . In addition, G∗ (·) is continuously (resp. twice continuously) differentiable on Vt , and such that ∗ G u − G∗ t − f (x) −1/d (u − t) = o |u − t| as |u − t| → 0. (1.89) Proof. See, e.g., Deheuvels and Mason [68].
Theorems 1.5–1.6 allow to identify, without loss of generality, an arbitrary iid sequence X1 , X2 , . . ., of Rd -valued random vectors, with G(U1 ), G(U1 ), . . ., where U1 , U2 , . . ., are iid uniformly distributed on (0, 1)d random vectors, for some appropriate invertible mapping G of (0, 1)d onto Rd . This is, of course, the best possible result of the kind since there is no homeomorphism (a continuous mapping with
Topics on Empirical Processes
113
a continuous inverse) of an open subset of Rd onto an open subset of Rq unless d = q (see, e.g., [114]). On the other hand (see, Remark 1.4 below), if we let G be non-invertible, the dimension argument breaks down. Unfortunately, Theorems 1.5–1.6 are difficult to apply for d ≥ 2, because of the inherent technicalities due to the conditioning used for the construction of G in (1.82), making it difficult to relate the smoothness properties of F and G. Moreover, the proof of these theorems relies on the choice of a basis in Rd , which leads to technical difficulties. Because of this, we will limit ourselves in the sequel to d = 1. The extension of the presently available results from dimension 1 to higher dimensions yields a series of challenging problems. Remark 1.4. 1◦ ) The multivariate quantile theorem was generalized as follows by Skorohod [188] (see, e.g., Dudley and Phillip [81]). Consider two Polish spaces X1 and X2 , and let L denote a probability law on X1 × X2 , with marginal distribution L1 on X1 . Then, there exists a measurable map Ψ : [0, 1] × X1 → X1 × X2 , such that, if V ∈ X1 is a rv with distribution L1 , and if U is a uniform (0, 1) rv independent of V , then (V, Ψ(U, V )) ∈ X1 × X2 has distribution function L. 2◦ ) The following argument (see, e.g., Lemma A1 in Berkes and Phillip [14]) is often useful in the application of Strong invariance principles. Consider three Polish spaces X1 , X2 and X3 . We consider a probability law L1,2 on X1 × X2 , and a probability law L2,3 on X2 × X3 . We assume that the marginals of L1,2 and L2,3 on X2 coincide. Then, there exists a probability space on which sits a rv (X1 , X2 , X3 ), where (X1 , X2 ) has law L1,2 , and (X2 , X3 ) has law L2,3 . Exercise 1.4. Show that, for any probability law L in a Polish space, there exists a measurable map Ψ : [0, 1] → X, such that Ψ(U ) has probability law L when U is uniformly distributed on (0, 1). Exercise 1.5. Let d ≥ 2 be integer. A copula, C(u) = C(u1 , . . . , ud ), in Rd is, by definition, the df of a random vector U = (U1 , . . . , Ud ) with uniformly distributed on [0, 1] marginals. Given any random vector X = (X1 , . . . , Xd ) with df F (x) = P(X ≤ x) and marginal df ’s Fj (x) = P(Xj ≤ x), for j = 1, . . . , d, a copula C(·) is associated with X whenever the identity F x1 , . . . , xd = C F1 (x1 ), . . . , Fd (xd ) , (1.90) holds for all x = (x1 , . . . , xd ). 1◦ ) Show that, whenever F1 , . . . , Fd are continuous, the copula C(·) of F (·) is defined uniquely by (1.90). 2◦ ) Show that the set Cd of copulas in Rd is weakly compact. Adapt the proof of Theorem 1.4 to show that there exists a copula C(·) for each df F (·) [This is known as Sklar’s theorem (see, e.g., [187, 42, 48]).
114
P. Deheuvels
2. Fluctuations of partial sums 2.1. Some large deviations theory Let X denote a random variable with df F (x) = P(X ≤ x). The corresponding Laplace transform, or moment-generating function [mgf], denoted by ψ = ψX , is defined, for s ∈ R, by ∞ sX ψ(s) = E(e ) = esx dF (x) ∈ (0, ∞]. (2.1) −∞
The mgf ψ(s) of X is always finite for s = 0, and such that ψ(0) = 1. To characterize the domain of finiteness Iψ := {s ∈ R : ψ(s) < ∞} of ψ, set −∞ ≤ t1 = inf{t ∈ R : ψ(t) < ∞} ≤ 0 ≤ t0 = sup{t ∈ R : ψ(t) < ∞} ≤ ∞. (2.2) The following proposition has a more or the less straightforward analytic proof. Proposition 2.1. The mgf ψ(·) of a random variable X is always finite and C∞ in the neighborhood of each s ∈ (t1 , t0 ). Moreover, when s ∈ (t0 , t1 ), for each m = 1, 2, . . ., we have ∞ dm ψ (m) (s) = m ψ(s) = E X m esX = xm esx dF (x) ∈ R. (2.3) ds −∞ When ψ(s) < ∞ in a right (resp. left) neighborhood of 0, or equivalently, when t0 > 0 (resp. t1 < 0), then, the finiteness of the mth moment, E(X m ), of X, is equivalent to the existence of a finite limit (2.4) lim ψ (m) (s) = E(X m ) resp. lim ψ (m) (s) = E(X m ) . s↓0
s↑0
Proof. Omitted.
Remark 2.1. Recall the definition (2.2) of t0 and t1 . When t0 < ∞, the value of ψX (t0 ) may be either finite on infinite, depending upon the distribution of X. The same is true for ψX (t1 ). The domain of finiteness of ψ = ψX , denoted by Iψ = {s ∈ R : ψ(s) < ∞},
(2.5)
is, depending upon the distribution of X, one among the four intervals (t1 , t0 ), (t1 , t0 ], [t1 , t0 ), [t1 , t0 ]. Proposition 2.2. The mgf ψ(·) of X is always convex on its domain of finiteness. Moreover, the convexity of ψ is always strict, with the exception of the case where X = 0 almost surely. Proof. The result is trivial for t1 = t0 = 0. When t1 < t0 , observe, via (2.3), that, for each s ∈ (t1 , t0 ), ψ (s) = E X 2 esX ≥ 0, (2.6) with equality only possible when X = 0 a.s. The mgf ψ of X follows a stronger property than convexity, being log-convex (meaning that log ψ is convex), as showed in the next proposition.
Topics on Empirical Processes
115
Proposition 2.3. The mgf ψ(·) is always log-convex on its domain of finiteness. Moreover, the log-convexity of ψ(·) is always strict with the exception of the case where the distribution of X is degenerate. Proof. When X is degenerate (see, e.g., (1.10)) with X = c a.s., we have ψ(s) = ecs , d2 and ds 2 log ψ(s) = 0. In the other cases, the property is trivial for t1 = t0 = 0, and we may limit ourselves to give proof when t1 < t0 . We then make use of the Schwarz inequality, to observe that, for each s ∈ (t1 , t0 ), 2 E X 2 esX − E XesX d2 log ψ(s) = ≥ 0, (2.7) 2 ds2 E (esX )
with equality only possible when X is a.s. constant.
The Chernoff function of X (or Legendre transform of the distribution function of X), is defined by Ψ(α) = ΨX (α) =
sup
{sα − log ψ(s)}
(2.8)
s:ψ(s)<∞
Proposition 2.4. The Chernoff function Ψ(α) is a (possibly infinite) convex function on R, fulfilling 0 ≤ Ψ(α) ≤ ∞, (2.9) and Ψ(α) Ψ(α) = t1 and = t0 . lim (2.10) lim α→−∞ α→∞ α α Moreover, when m = E(X) exists, we have Ψ(m) = 0.
(2.11)
Proof. The inequality (2.9) follows readily from the definition (2.8) of Ψ, and the observation that 0 × α − log ψ(0) = 0 ≤ Ψ(α). The fact that Ψ is convex follows from the fact that the supremum of an arbitrary family of convex functions is convex. Since, for each s ∈ Iψ = {s ∈ R : ψ(s) < ∞}, the function α → sα − log ψ(s) is linear, and hence, convex, the conclusion is straightforward. To establish (2.10)–(2.11), consider first the case where t1 = t0 = 0, so that the domain of definition of the mgf ψ reduces to Iψ = {0}. In this case Ψ(α) = 0 for all α ∈ R (and in particular for α = m when m = E(X) ∈ R exists), and we see that (2.10)–(2.11) are trivial. Assume now that t1 < 0 < t0 , in which case the existence of m = E(X) = ψ (0) is guaranteed by Proposition 2.1. We observe that the function Lα (s) = sα − log ψ(s), Lα (s)
(2.12)
has, via Proposition 2.3, a first derivative = α − log ψ(s), and a second d2 derivative Lα (s) = − ds2 log ψ(s) < 0 on Iψ . Now, because of (2.3), when α = m, d ds
116
P. Deheuvels
d we have Lm (0) = m − ds log ψ(0) = 0, so that Lm (s) > 0 for s < 0 and Lm (s) < 0 for s > 0. This, in turn, implies that
Ψ(m) = sup Lm (s) = Lm (0) = 0, s∈Iψ
which is (2.11). The remainder of the part of the proof of Proposition 2.4 relative to (2.10) is technical and omitted. We refer to Lemma 2.1, p. 266 in Deheuvels [51] for details. In the remainder of this section, we consider a sequence X1 , X2 , . . . of independent replicæ of the rv X with mgf ψ and Chernoff function Ψ as defined above. We set S0 = 0 and Sn = X1 + · · · + Xn for n ≥ 1. The following hypotheses will be assumed, at times (recall (1.10)). (C.1) t0 = sup{s : ψ(s) < ∞} > 0; (C.2) t1 = inf {s : ψ(s) < ∞} < 0; (C.3) The distribution of X is nondegenerate, or, equivalently, supx∈R P(X = x) < 1; (C.4) m = E(X) ∈ R exists. Under (C.4) only, Chernoff’s theorem (see, e.g., [9, 30]) may be stated as follows. Theorem 2.1. Under (C.4), we have P(Sn ≥ nα) ≤ exp(−nΨ(α))
for each
α ≥ m;
(2.13)
P(Sn ≤ nα) ≤ exp(−nΨ(α))
for each
α ≤ m.
(2.14)
Moreover, 1 (2.15) log P(Sn ≥ nα) = −Ψ(α) for each α ≥ m; n 1 log P(Sn ≤ nα) = −Ψ(α) for each α ≤ m. lim (2.16) n→∞ n Proof. The proof of (2.13)–(2.14) relies on the simple, but nevertheless powerful, Markov inequality. Below, we establish this result in a slightly more general form. Consider a random variable Y for which ψY (t) = E(exp(tY )) < ∞ for some t ≥ 0. The inequality ety ≤ etu for y ≤ u entails that ∞ etu dP(Y ≤ u) ψY (t) = E(exp(tY )) = −∞ tu ty e dP(Y ≤ u) ≥ e dP(Y ≤ u) = ety P(Y ≥ y), ≥ lim
n→∞
[y,∞)
[y,∞)
or, equivalently,
ty − log ψY (t) .
(2.17)
A similar argument, which we omit shows that
ty − log ψY (t) . P(Y ≤ y) ≤ exp − sup
(2.18)
P(Y ≥ y) ≤ exp −
sup t≥0:ψ(t)<∞
t≤0:ψ(t)<∞
Topics on Empirical Processes
117
Observe that the inequalities (2.17)–(2.18) are always true for all y ∈ R, even when the only value of t for which ψY (t) < ∞ is t = 0. Now, consider a random variable Z with finite expectation mZ := E(Z), and set Y = Z − mZ , so that E(Y ) = 0. In this case, we see that (2.17)–(2.18) readily imply that
ty − log ψY (t) for y ≥ 0, (2.19) P(Y ≥ y) ≤ exp − sup t:ψY (t)<∞
P(Y ≤ y) ≤ exp −
sup
ty − log ψY (t) for
y ≤ 0.
(2.20)
t:ψY (t)<∞
A crucial step in the proof relies on Proposition 2.1, which implies that, whenever E(Y ) = 0, we have ty − log ψY (t) = ty − log ψY (t) , (2.21) sup y>0 ⇒ sup t≥0:ψY (t)<∞
y<0
⇒
sup
ty − log ψY (t) =
t≤0:ψY (t)<∞
t:ψY (t)<∞
sup
ty − log ψY (t) .
(2.22)
t:ψY (t)<∞
By setting Y = Z − mZ and ψZ (t) = E(exp(tZ)) = etmZ ψY (t) in the above formulæ, we get ψY (t) = E(exp(tY )) = E(exp(t{Z − mZ })) = ψZ (t)e−tmZ . This, when combined with (2.19)–(2.20) and (2.21)–(2.22), implies readily that
P(Z ≥ z) ≤ exp − sup tz − log ψZ (t) for z ≥ mZ , (2.23)
t:ψZ (t)<∞
P(Z ≤ z) ≤ exp −
sup
tz − log ψZ (t)
for z ≤ mZ .
(2.24)
t:ψZ (t)<∞
By setting Z = Sn in (2.23)–(2.24), we obtain (2.13)–(2.14). The remainder of the proof is omitted. The next theorem relies on the fact that, when m = E(X) = 0, {Sn : n ≥ 0} is a martingale (see, e.g., §2.2 below for details). Theorem 2.2. Assume (C.4), with m = E(X) = 0. Then, under (C.1), for any α ≥ 0, we have P max Sk ≥ nα ≤ exp(−nΨ(α)), (2.25) 0≤k≤n
whereas, under (C.2) for each α ≤ 0, we have P min Sk ≤ nα ≤ exp(−nΨ(α)). 0≤k≤n
(2.26)
Proof. As follows from (2.31) we have, under the assumptions of the theorem, (2.27) P max Sk ≥ nα ≤ exp −n sup {sα − log ψ(s)} . 0≤k≤n
s≥0:s∈Iψ
118
P. Deheuvels
Whenever t0 > 0, we observe from the proof of Proposition 2.4 that, for each d log ψ(s) α > 0, the function Lα (s) = sα − log ψ(s) has derivative Lα (s) = α − ds decreasing on (0, s0 ) with a right derivative at 0 given by Lα (0+) = α − µ ≥ 0. This implies in turn that sup s≥0:s∈Iψ
{sα − log ψ(s)} = sup {sα − log ψ(s)} .
(2.28)
s∈Iψ
We conclude (2.25) by combining (2.27) with (2.28). The proof of (2.26) is similar and omitted. 2.2. Martingale inequalities The theory of martingales provides some very useful inequalities. We present below a minimal selection of some essential facts which are needed in the present framework, and refer to [80] and [106] for details. Let I = [0, T ] be an interval of either R or Z. We consider a random process {Zt : t ∈ I} defined on a probability space (Ω, A, P). We suppose given a family of σ-algebras of events {At : t ∈ I} such that, for each s, t ∈ I with s ≤ t, As ⊆ At ⊆ A. The family {Zt , At : t ∈ I} defines a martingale, whenever the following properties hold. (M.1) Zt is At measurable for each t ∈ I; (M.2) E(Zt ) exists for each t ∈ I; (M.3) E(Zt |As ) = Zs a.s. for each s, t ∈ I with s ≤ t. The family {Zt , At : t ∈ I} defines a sub-martingale, whenever the following properties hold. (M.1) Zt is At measurable for each t ∈ I; (M.2) E(Zt ) exists for each t ∈ I; (M.3) E(Zt |As ) ≥ Zs a.s. for each s, t ∈ I with s ≤ t. In many cases of interest, in the above definition of martingales (resp. sub-martingales), the σ-algebras reduce to At = σ{Zs : s ∈ I, s ≤ t} for t ∈ I. When such is the case, we will abbreviate the definition, by saying that {Zt : t ∈ I} defines a martingale (resp. sub-martingale). Otherwise, {Zt : t ∈ I} will be said to be a martingale (resp. submartingale) adapted to {At : t ∈ I}. We have the following useful properties of martingales and sub-martingales. Proposition 2.5. (i) Whenever {Zt , At : t ∈ I} is a martingale, and φ is a convex function such that E(φ(Zt )) exists for all t ∈ I, {φ(Zt ), At : t ∈ I} is a sub-martingale. (ii) Whenever {Zt , At : t ∈ I} is a sub-martingale, and φ is a convex nondecreasing function such that E(φ(Zt )) exists for all t ∈ I, {φ(Zt ), At : t ∈ I} is a sub-martingale. Proof. Omitted.
Topics on Empirical Processes
119
Proposition 2.6. (Doob’s inequality) Whenever {Zt , At : t ∈ I} is a sub-martingale, for each u > 0, 1 (2.29) P sup Zt ≥ u ≤ E (|ZT |) . u t∈I
Proof. Omitted.
The following examples of interest provide some useful applications of Propositions 2.5 and 2.6. Example 2.1. 1◦ ) Let {Zt , At : t ∈ I} be a martingale. Since φ(t) = |t|r is convex for each r ≥ 1, we have, for each u > 0, 1 P sup |Zt | ≥ u ≤ r E (|ZT |r ) . (2.30) u t∈I To establish (2.30), we infer from Proposition 2.5 that {|Zt |r , At : t ∈ I} is a sub-martingale for each r ≥ 1, then, we make use of Proposition 2.6 to show that 1 = P sup |Zt |r ≥ ur ≤ r E (|ZT |r ) . P sup |Zt | ≥ u u t∈I t∈I 2◦ ) Let {Zt , At : t ∈ I} be a martingale, and let s ≥ 0 be such that E(exp(sZt )) < ∞ for all t ∈ I. Since s → φ(s) = eαs is convex and nondecreasing, we have, for each α ∈ R,
P sup Zt ≥ α ≤ exp(−sα) E (exp(sZT )) = exp − sα − log ψZT (s) . (2.31) t∈I
To establish (2.31), we infer from Proposition 2.5 that {exp(sZt ), At : t ∈ I} is a sub-martingale. Then, we make use of Proposition 2.6 with u = esα to show that = P sup{exp(sXt )} ≥ exp(sα) ≤ exp(−sα) E (exp(sZT )) . P sup Xt ≥ α t∈I
t∈I
Note here that the assumption s ≥ 0 is necessary for the validity of the above equality. 2.2.1. Functional large deviations and limit theorems for the Wiener process. We start by Schilder’s theorem (see, e.g., [174]), for the statement of which the following notation and definitions will be needed. Let {W (t) : t ≥ 0} denote a standard Wiener process. The reproducing kernel Hilbert space [RKHS] (see, e.g., §4.4) associated to the restriction, {W (t) : 0 ≤ t ≤ 1}, of W (·) to [0, 1], is then defined by (2.33) below. The corresponding Hilbert norm, is defined, for each f ∈ B[0, 1] (the set of all bounded functions on [0, 1]), by 1/2 1 ˙ 2 if f ∈ AC[0, 1] and f (0) = 0, f (t) dt 0 (2.32) |f |H = ∞ otherwise.
120
P. Deheuvels
Here, as usual, AC[0, 1] denotes the set of all absolutely continuous functions f d f . The RKHS of {W (t) : 0 ≤ t ≤ 1} is, on [0, 1], with Lebesgue derivative f˙ = dx accordingly (2.33) H = f ∈ AC[0, 1] : |f |H < ∞ . Its unit ball, often referred to as the Strassen set (see, e.g., [199]), is denoted by S or K, and defined by S = f ∈ AC[0, 1] : |f |H ≤ 1 . (2.34) The sup-norm of a function f ∈ C[0, 1] (the vector space of continuous functions on [0, 1]), is denoted by
f = sup |f (t)|. (2.35) 0≤t≤1
For each f ∈ C[0, 1] and ε > 0, we set Nε (f ) = g ∈ C[0, 1] : g − f < ε .
(2.36)
For any A ⊆ C[0, 1] and ε > 0, we set ( Nε (f ) = g ∈ C[0, 1] : ∃f ∈ A, g − f < ε . Aε =
(2.37)
f ∈A
Note here that, by convention, we set ∪∅ (·) = ∅, so that ∅ε = ∅. Introduce a functional J(·) on the subsets of the set B[0, 1] of bounded functions on [0, 1] by setting, for each A ⊂ B[0, 1], inf f ∈A |f |2H if A = ∅, J(A) = (2.38) ∞ if A = ∅. For each λ > 0, we denote by W{λ} the process defined by W{λ} (s) = (2λ)−1/2 W (s) for s ∈ [0, 1]. Recall that (C[0, 1], U) denotes the set C[0, 1] of continuous functions on [0, 1], endowed with the uniform topology U, induced by · , as in (2.35). Theorem 2.3. (Schilder’s theorem). For each closed subset F of (C[0, 1], U), we have 1 lim sup log P(W{λ} ∈ F ) ≤ −J(F ), (2.39) λ→∞ λ and, for each open subset G of (C[0, 1], U), we have 1 lim inf log P(W{λ} ∈ G) ≥ −J(G). (2.40) λ→∞ λ Proof. See, e.g., Schilder [174], and Deuschel and Stroock [73], p. 12.
The next lemma turns out to be useful to derive functional laws of the iterated logarithm. Lemma 2.1. For each ε ∈ (0, 1) and f ∈ S such that 0 < ε < |f |H ≤ 1, we have 2 2 J C[0, 1] − Sε ≥ (1 + ε)2 , and J Nε (f ) ≤ |f |H − ε ≤ |f |2H 1 − ε .(2.41)
Topics on Empirical Processes
121
Proof. See, e.g., Lemma 2.5 in Deheuvels [54].
In 1964, Strassen [199] provided the first example of functional law of the iterated logarithm [FLIL]. Below, we state the version of his theorem for the Wiener process. Theorem 2.4. (Strassen’s FLIL)
Let {W (t) : t ≥ 0} denote a standard Wiener process. Set ηT (u) = W (T u)/ 2T log2 T for 0 ≤ u ≤ 1 and T ≥ 0, where log2 u = log+ log+ u and log+ u = log(u ∨ e). Then, the following two results hold with probability 1. lim inf ηT − f = 0 and sup lim inf ηT − f = 0. (2.42) T →∞
f ∈S
T →∞
f ∈S
Proof. A proof of Theorem 2.4 based upon Schilder’s Theorem 2.3 is given in Exercises 2.2 and 2.3 below. A typical application of Strassen’s Theorem 2.4 is stated in the following corollary of this theorem. Corollary 2.1. Let Θ : f ∈ C0 [0, 1] → R denote a continuous functional, with respect to the uniform topology U. Then, we have, with probability 1, lim sup Θ(ηT ) = sup Θ(f ). T →∞
f ∈S
(2.43)
Proof. Since S is a compact subset of (C[0, 1], U) (see, e.g., Example 1.3), by continuity of Θ, there exists a g ∈ S such that L0 := Θ(g) = supf ∈S Θ(f ). As follows from (2.42), there exists a.s. a sequence Tn → ∞ such that ηTn − g → 0, whence L1 := lim sup Θ(ηT ) ≥ lim Θ(ηTn ) = Θ(g) = L0 . T →∞
n→∞
On the other hand, if {τn : n ≥ 1} is a sequence such that τn → ∞ and Θ(ητn ) → L1 , there exists a.s. a subsequence {τn : n ≥ 1} ⊆ {τn : n ≥ 1}, together with an h ∈ S such that ητn − h → 0, whence, L1 = lim Θ(ητn ) = Θ(h) ≤ sup Θ(f ) = L0 . n→∞
We have therefore L0 = L1 , as sought.
f ∈S
Remark 2.2. The sup-norm · topology U, used in the statement of Theorem 2.4 and Corollary 2.1, can be replaced by the topology generated by a general norm | · |0 on C[0, 1], such that | · |0 is measurable with respect to (C[0, 1], U), as long as the (necessary and sufficient) condition
(2.44) P sup |W (1 − θ) · |0 < ∞ > 0, 0≤θ≤ε
holds for some ε > 0 (see, e.g., [61, 62]). The next result states a functional limit law for the increments of Wiener processes due to R´ev´esz [167] (see also [2] and [61]). Consider a function {aT :
122
P. Deheuvels
T > 0} such that 0 < aT < T for T > 0. For each T > 0 and t ≥ 0, consider the function of s ∈ [0, 1] defined by
(2.45) ζT,t (s) = {W (t + aT s) − W (t)}/ 2aT {log(T /aT ) + log2 T }, where log2 T = log+ log+ T , and log+ u = log(u ∨ e). Theorem 2.5. Let aT ↑ and T −1 aT ↓. Then, with probability 1,
sup inf ζT,t − f
lim T →∞
0≤t≤T −aT
f ∈S
= sup lim inf T →∞
f ∈S
inf
0≤t≤T −aT
ζT,t − f
(2.46) = 0.
If, in addition to the above assumptions, ((log(T /aT ))/ log2 T → ∞ as T → ∞, then, with probability 1,
inf lim sup
ζT,t − f
= 0. (2.47) T →∞
f ∈S
0≤t≤T −aT
Proof. Following the lines of Exercises 2.2 and 2.3 in the sequel, it is not too difficult to infer the above results from Schilder’s Theorem 2.3. A more refined proof of Theorem 2.5, valid for possibly non-uniform norms, and based on the isoperimetric inequality (see, e.g., [20]) is to be found in [61]. Exercise 2.1. 1◦ ) Show that the conclusion of Theorem 2.4 may be reformulated into the following two complementary statements. (i) For each ε > 0, there exists a.s. a Tε < ∞ such that ηT ∈ Sε for all T ≥ ε; (ii) For each f ∈ S and ε > 0, there exists a.s. a sequence Tn → ∞, such that ηTn ∈ Nε (f ) for each n. 2◦ ) Assume, that aT ↑, T −1 aT ↓ and ((log(T /aT ))/ log2 T → ∞ as T → ∞. Set FT = {ζT,t : 0 ≤ t ≤ T }. Show that, under these assumptions and notation, the conclusion of Theorem 2.5 may be reformulated as follows. For each ε > 0, there exists a.s. a Tε < ∞ such that, for all T ≥ Tε , FT ⊆ Sε and S ⊆ FTε . Exercise 2.2. 1◦ ) Fix any γ > 0, and set Tn = (1 + γ)n for n ≥ 0. Let {W (t) : t ≥ 0} denote a standard Wiener process, and set ηT (u) = W (T u)/ 2T log2 T for 0 ≤ u ≤ 1 and T ≥ 0. Here and elsewhere, log2 u = log+ log+ u and log+ u = log(u ∨ e). Let S be as in (2.34). Making use of (2.41) and (2.39), show that, for each ε > 0, ∞ P ηTn ∈ Sε < ∞. n=0
2◦ ) Infer from the results of (1◦ ) that, with probability 1, for each 0 ≤ ρ ≤ 1, √ sup |ηTn (u) − ηTn (v)| ≤ ρ . lim sup ηTn ≤ 1 and lim sup n→∞
n→∞
0≤u,v≤1, |u−v|≤ρ
Topics on Empirical Processes
123
3◦ ) Recall the notation (2.35). Show that, for all n sufficiently large, uniformly over Tn−1 ≤ T ≤ Tn ,
2Tn log2 Tn 2Tn log2 Tn 1≤ ≤ ≤ 1 + γ. 2T log2 T 2Tn−1 log2 Tn−1 4◦ ) Establish the inequality, for all large n, sup
Tn−1 ≤T ≤Tn
where An = (1 + γ)
sup 1 1+γ
≤θ≤1
ηT − ηTn ≤ An + Bn ,
sup |ηTn (u) − ηTn (θu)|
and
0≤u≤1
Bn = γ ηTn .
5◦ ) Show that, for each γ > 0, with probability 1, sup
ηT − ηTn ≤ 2γ. lim sup n→∞
Tn−1 ≤T ≤Tn
Show that, for each ε > 0, there exists a.s. a Tε < ∞, such that, for all T ≥ Tε , ηT ∈ Sε . 6◦ ) Show that, a.s. for each 0 ≤ ρ ≤ 1, lim sup ηT ≤ 1 and lim sup T →∞
T →∞
sup
0≤u,v≤1, |u−v|≤ρ
√ |ηT (u) − ηT (v)| ≤ ρ . (2.48)
Exercise 2.3. Recall the notation (2.32) and (2.41). Fix any γ > 0, and set Tn = (1 + γ)n for n ≥ 0. Let ηT (·) be as in Exercise 2.2, and set, for n ≥ 1,
ηT∗n (u) = {W (Tn−1 + (Tn − Tn−1 )u) − W (Tn−1 )}/ 2Tn log2 Tn . 1◦ ) Making use of (2.48), show that the processes {ηT∗n (u) : 0 ≤ u ≤ 1} are independent, and such that 2 lim sup ηT∗n − ηTn ≤ √ 1+γ n→∞
a.s.
2◦ ) Let f and ε ∈ (0, 1) be such that |f |H < 1 − ε. Show that there exists a γε such that, for all γ ≥ γε , ∞
P ηT∗n ∈ Nε (f ) = ∞. n=1 ◦
3 ) Making use of the fact that S is compact in (C[0, 1], U), show that, for each f ∈ S, lim inf ηT − f = 0 a.s. T →∞
124
P. Deheuvels
Exercise 2.4. 1◦ ) Check that, whenever W (·) is a standard Wiener process, then the process defined by W ∗ (0) = 0 and W ∗ (t) = tW (1/t) for t > 0 is, again, a Wiener process. 2◦ ) Making use of Corollary 2.1, show that Strassen’s Theorem 2.4 implies the law of the iterated logarithm [LIL] for the Wiener process (due to L´evy [137])
lim sup ±W (1/T )/ 2T log2 (1/T ) = lim sup ±W (T )/ 2T log2 T = 1 a.s. T →0
T →∞
(2.49) 3◦ ) Let {φ(t) : t > 0} be a measurable positive function such that
φ(t)/ 2t log2 (1/t) → ∞ as t ↓ 0. Define a norm on C[0, 1] by setting |f |0 = sup0 0, assume that adl` ag space (D[0, M ], S) of rightλ−1 X(λt), for t ∈ [0, M ] takes values in the c` continuous functions with left-hand limits, endowed with the Skorohod topology S (see, e.g., Ch. 3 in [19]). Set ψ(t) = E(exp(sX(1))), and, for each α ∈ R, define the associated Chernoff function (refer to (2.8)), via sα − log ψ(s) . (2.50) Ψ(α) = sup s:ψ(s)<∞
Assume further that ψ(s) < ∞ for all s ∈ R, or, equivalently (see, e.g., (2.10)), that (C)
Ψ(α) = ∞. |α|→∞ |α| lim
Remark 2.3. 1◦ ) (D[0, M ], S) is a Polish space, and D[0, M ] ⊃ C[0, M ]. The uniform topology U is stronger on D[0, M ] than the Skorohod topology S. On the other hand, the restriction of the Skorohod topology S to C[0, M ] coincides there with the uniform topology (see, e.g., p. 112 in [19]). As a consequence of this, a compact subset of (C[0, M ], U) is also a compact subset of (D[0, M ], S). Moreover, for any f ∈ C[0, M ], a uniform neighborhood of f is a Skorohod neighborhood of f . For characterizations of compactness in D[0, M ], with respect to S, we refer to Ch. 3 in [19].
Topics on Empirical Processes
125
2◦ ) In view of (1.59) and (2.10), the assumption (C) is equivalent to the condition that t0 = sup{t : ψ(t) < ∞} = ∞
and
t1 = inf{t : ψ(t) < ∞} = −∞.
(2.51)
As follows from Theorem 1.1, and (1.67) in Example 1.3, the set defined, for each c > 0, by 1 cΨ c−1 f˙(u) du ≤ 1 , (2.52) ∆Ψ,c = f ∈ AC[0, M ] : f (0) = 0 and 0
is, under (C), a compact subset of (C[0, 1], U). For any function f of D[0, M ], define a functional JX,M (f ) by setting M Ψ f˙(u) du, if f ∈ AC[0, M ] and f (0) = 0, 0 JX,M (f ) = ∞ otherwise. Define further, for each A ⊆ D[0, M ] inf f ∈A JX,M (f ) if A = ∅, JX,M (A) = ∞ if A = ∅.
(2.53)
(2.54)
Theorem 2.6. Under (C), for each closed subset F of (D[0, M ], S), we have X(λ·)
1 ∈ F ≤ −JX,M (F ), (2.55) lim sup log P λ λ→∞ λ and, for each open subset G of (D[0, M ], S), we have X(λ·)
1 ∈ G ≥ −JX,M (G). lim inf log P λ→∞ λ λ Proof. See, e.g., Varadhan [206] and Lynch and Sethuraman [144].
(2.56)
Exercise 2.5. 1◦ ) Let {Π(t) : t ≥ 0} denote a standard Poisson process (see, e.g., §2.3.1). Making use of (2.60), show that this process has stationary and independent increments, with Chernoff function Ψ (refer to (2.50)), given by Ψ(α) = h(α), and fulfilling (C). ◦ 2 ) Prove that the set of functions defined by M ∆(M ) = f ∈ C[0, M ] : f (0) = 0, h(f˙(u))du ≤ 1. , 0
is a compact subset of (C[0, M ], U). Recalling (2.37), show that, for each ε > 0, there exists an η > 0 such that, for all T sufficiently large, Π(log T ) ·
1 P ∈ / ∆(T )ε ≤ 1+η . (2.57) log T T [Consult Lemma 2.8 in [65] for an example of application of (2.57).]
126
P. Deheuvels
2.3. Some useful examples of large deviation inequalities 2.3.1. Inequalities for the Poisson process. By a standard Poisson process (see, e.g., §3.4) {Π(t) : t ≥ 0} is meant a homogeneous right-continuous Poisson process on R+ with E(Π(t)) = t for t ≥ 0. Set X = Π(1) to denote a Poisson random variable with expectation µ = 1. We have P(X = k) =
e−1 k!
for k = 0, 1, 2, . . . ,
(2.58)
so that
for t ∈ R. (2.59) ψX (t) = E(etX ) = exp et − 1 The corresponding Chernoff function is denoted by h(α) := ΨX (α). Some calculations show that α log α − α + 1 when α > 0, h(α) = 1 (2.60) when α = 0, ∞ when α < 0. A direct application of (2.23)–(2.24), in combination with (2.60), shows that, for an arbitrary T ≥ 0, P(Π(T ) ≥ T α) ≤
exp(−T h(α))
for α ≥ 1,
(2.61)
P(Π(T ) ≤ T α) ≤
exp(−T h(α))
for α ≤ 1.
(2.62)
Since {Π(t) − t : t ≥ 0} is a martingale, we infer readily from (2.31) that, for an arbitrary T ≥ 0, P sup {Π(t) − t} ≥ T α ≤ exp(−T h(1 + α)) for α ≥ 0, (2.63) 0≤t≤T
P
inf {Π(t) − t} ≥ −T α ≤
0≤t≤T
exp(−T h(1 − α))
for α ≥ 0. (2.64)
The following two propositions are simple consequences of (2.63)–(2.64). Proposition 2.7. For each T ≥ 0 and each α ≥ 0, we have P sup |Π(t) − t| ≥ T α ≤ 2 exp(−T h(1 + α)).
(2.65)
0≤t≤T
Proof. Observe that, for 0 ≤ α ≤ 1, h(1 + α) = (1 + α) log(1 + α) − α = h(1 − α) = (1 − α) log(1 − α) + α =
∞ k=2 ∞ k=2
(−1)k
αk . k(k − 1)
αk . k(k − 1)
(2.66) (2.67)
By combining (2.66)–(2.67) with the observation that h(1 − α) = ∞ for α > 1, we obtain the inequality h(1 + α) ≤ h(1 − α)
for all α ≥ 0.
(2.68)
Topics on Empirical Processes
127
By combining (2.68) with (2.63)–(2.64), we see that, for all α ≥ 0, P sup |Π(t) − t| ≥ T α ≤ exp(−T h(1 + α)) + exp(−T h(1 − α)) 0≤t≤T
≤ 2 exp(−T h(1 + α)),
which is (2.65). Proposition 2.8. For each T ≥ 0 and each 0 ≤ α ≤ 1, we have α T α2 P sup |Π(t) − t| ≥ T α ≤ 2 exp − 1− . 2 3 0≤t≤T
(2.69)
Proof. By (2.67), we see that, for each 0 ≤ α ≤ 1, h(1 + α) =
∞ k=2
(−1)k
α2 α3 αk ≥ − , k(k − 1) 2 6
(2.70)
which, in view of (2.65), suffices for (2.69).
2.3.2. Inequalities for binomial distributions. Let X follow a Bernoulli Be(p) distribution, namely, P(X = 1) = 1 − P(X = 0) = p ∈ (0, 1). The moment generating function of X is then given by ψ(t) = E(exp(tX)) = 1 + p et − 1) for
t ∈ R.
(2.71)
(2.72)
Proposition 2.9. The Chernoff function of the Bernoulli Be(p) distribution is given by 1 − log if α = 0, 1−p α log α + (1 − α) log 1−α if 0 < α < 1, 1−p p Ψ(α) = (2.73) 1 if α = 1, − log p ∞ if α ∈ [0, 1]. Proof. Straightforward.
The following theorem turns out to be quite useful. Recall the definition (2.60) of h(·). Theorem 2.7. Let Sn follow a binomial B(n, p) distribution. Then α
for α ≥ p, P (Sn ≥ nα) ≤ exp − np h p whereas α
P (Sn ≤ nα) ≤ exp − np h for α ≤ p. p
(2.74) (2.75)
128
P. Deheuvels
Proof. Let X = X1 , . . . , Xn be iid Bernoulli Be(p) distributed rv’s. Then, Sn = X1 + · · · + Xn follows a binomial B(n, p) distribution, and, via an application of (2.73) and (2.13)–(2.14), we obtain that P(Sn ≥ nα) ≤ exp(−nΨ(α)) P(Sn ≤ nα) ≤ exp(−nΨ(α))
for α ≥ p, for α ≤ p,
where Ψ = ΨX is as in (2.73). Therefore, the proof of (2.74)–(2.75) boils down to show that, for any 0 < α < 1, we have the inequality α α 1 − α α
= α log − α + p ≤ Ψ(α) = α log + (1 − α) log . (2.76) ph p p p 1−p Now, (2.76) is a straightforward consequence of the inequality 1 − α 1 − α 1−α 1 − α log − + 1 ≥ 0. h = 1−p 1−p 1−p 1−p
This completes the proof of the theorem. d
Corollary 2.2. Let αn (tp) = n−1/2 (Sn − np), where, for 0 < p < 1, Sn follow a √ binomial B(n, p) distribution. Then, for each 0 ≤ u ≤ np, √
√
P |αn (p)| ≥ u p = P |Sn − np| ≥ u np 2 (2.77) u u
u 1− √ ≤ 2 exp −np h 1 + √ ≤ 2 exp − . np 2 3 np
Proof. Combine (2.74)–(2.75) with (2.70).
2.3.3. Inequalities for Wiener processes. Let {W (t) : t ≥ 0} denote a standard Wiener process. Obviously, {±W (t) : t ∈ [0, 1]} defines a martingale. Moreover, if we set X = W (1), then ψ(u) = E(exp(uX)) = exp( 12 u2 ), and the corresponding Chernoff function (refer to (2.9)) is then equal to Ψ(α) = ΨX (α) = exp( 12 α2 ) for all α ∈ R. Proposition 2.10. For each T > 0, we have √
P sup ±W (t) ≥ u T
≤
exp − 12 u2 ,
(2.78)
≤
2 exp − 12 u2
(2.79)
0≤t≤T
P
√
sup |W (t)| ≥ u T
0≤t≤T
Proof. It is well known (see, for example, Cs¨org˝ o and R´ev´esz (1981)), that ∞
√ 2 2 P sup ±W (t) ≥ u T = 2(1 − Φ(u)) = √ e−t /2 dt. (2.80) 2π u 0≤t≤T Instead of using (2.80) for proving (2.78), we apply (2.31) to obtain directly that
u
√
P sup W (t) ≥ u T ≤ exp − T Ψ √ = exp − 12 u2 . T 0≤t≤T
Topics on Empirical Processes
129
We so obtain (2.78) in the “+” case. The proof of (2.78) in the “−” case is identical, and omitted. Finally, (2.79) is a trivial consequence of (2.78). 2.4. Strong approximations of partial sums of i.i.d. random variables We limit ourselves here to a few results which are relevant in our discussion, and refer to the monograph [39] of Cs¨ org˝ o and R´ev´esz for additional details and references. In particular, we will concentrate on the case where the random variables have exponential moments. Set S0 = 0 and Sn = X1 + · · · + Xn for n ≥ 1, where X1 , X2 , . . . are iid random replicæ of a rv X. We assume below that the mgf ψ(t) = E(exp(tX)) of X fulfills the condition (C) = (C.1–2), where. (C.1) t0 = sup{s : ψ(s) < ∞} > 0; and (C.2) t1 = inf {s : ψ(s) < ∞} < 0. The assumption (C) = (C.1–2) implies the existence of E(X k ) for each k ∈ N. We will set, in particular m = E(X) and σ 2 = Var(x) = E (X − m)2 . We set, further, S(t) = St for t ≥ 0, where #t$ ≤ t < #t$ + 1 denotes the integer part of t. The following Theorem 2.8(1◦ ) states a variant of the celebrated Koml´os, Major and Tusn´ady strong invariance principle ([129, 130]), with a refinement due to Major [146] for (2◦ ). Part (3◦ ) of the theorem is due to Strassen [199] (see also [147, 148]). We set log+ u = log(u ∨ e), and log2 u = log+ (log+ u), for u ∈ R. Theorem 2.8. 1◦ ) Under (C), it is possible to define {S(t) : t ≥ 0} jointly with a standard Wiener process {W (t) : t ≥ 0}, in such a way that, for universal constants (depending upon the distribution of X) a1 > 0, a2 > 0 and a3 > 0, we have, for all T ≥ 0 and x ∈ R,
P sup |S(t) − mt − σW (t)| ≥ a1 log+ T + x ≤ a2 exp a3 x . (2.81) 0≤t≤T
◦
2 ) If we only assume that E(|X|r ) < ∞ for some r > 2, then we may write, as T → ∞, |S(T ) − mT − σW (T )| = o(T 1/r ). (2.82) ◦ 2 3 ) Finally, under the assumption that E(|X| ) < ∞ only, we have, as T → ∞, |S(T ) − mT − σW (T )| = o(T 1/2 (log2 T )1/2 ).
(2.83)
Proof. The original papers of Koml´ os, Major and Tusn´ ady [129, 130] were short on some important technical arguments, useful for the understanding of the proofs. The details were provided (in a multivariate framework) in the monograph of Einmahl [90] (see also [90, 92]). A routine argument (see, e.g., Exercises 2.6 and 2.7) allows to write sup0≤t≤T (·) in (2.81), and let T vary in R in (2.82)–(2.83), instead of restricting T to the integers 1, 2, . . . . The conclusion of the theorem holds when S(t) is taken as a process with independent increments (such as a Poisson process). The necessary details are given in Lemma 2.1, p. 84 in [69].
130
P. Deheuvels
By combining (2.81) with the Borel-Cantelli lemma (see the Exercises 2.6 and 2.7 below), we obtain readily that, under (C), |S(T ) − mT − σW (T )| = O(log T ) a.s. as
T → ∞.
(2.84)
Let {aT : T > 0} be such that 0 < aT ≤ T for T > 0, aT ↑ and T −1 aT ↓. Set, for each T > 0 and t ≥ 0, ξt,T (s) =
S(t + saT ) − S(t) − msaT
σ 2aT {log(T /aT ) + log2 T }
for 0 ≤ s ≤ 1.
(2.85)
Recall the definition (1.65) of the Strassen set S. Theorem 2.9. Under (E.1–2), whenever aT / log T → ∞, we have, with probability 1,
sup inf ξT,t − f
lim T →∞
f ∈S
0≤t≤T −aT
= sup lim inf T →∞
f ∈S
inf
0≤t≤T −aT
ξT,t − f
(2.86) = 0.
If, in addition to the above assumptions, ((log(T /aT ))/ log2 T → ∞ as T → ∞, then, with probability 1,
ξT,t − f
= 0. (2.87) inf lim sup T →∞
f ∈S
0≤t≤T −aT
Proof. Combine Theorem 2.5 with (2.84) and Theorem 2.8.
A direct consequence of Theorem 2.9 is that, whenever aT / log T → ∞, we have, a.s. as T → ∞, sup a−1 S(t + aT ) − S(t) − m T 0≤t≤T −aT
=O
" a−1 = o(1). T {log(T /aT ) + log2 T }
(2.88)
The limit law (2.88) breaks down when aT / log T = O(1). This range is covered by the Erd˝ os-R´enyi new law of large numbers (see, e.g., [97]) discussed in §2.5 below. For further refinements, we refer to Deheuvels and Steinebach [71]. When combined, the above results yield readily Strassen’s proof ([199]) of the classical law of the iterated logarithm, due originally to Hartman and Wintner [112] Corollary 2.3. (The Hartman-Wintner-Strassen
LIL) Under the assumption that Var(X) = σ 2 < ∞, the sequence (Sn − nm)/ 2n log2 n is almost surely relatively compact in R (endowed with the usual topology), with limit set equal to the interval [−σ, σ]. We have S − nm
n [−σ, σ] 2n log2 n
and
Sn − nm lim sup ± =σ n→∞ 2n log2 n
a.s.
(2.89)
Topics on Empirical Processes
131
Proof. Combine (2.49) with (2.83). We will use the convenient notation xn L,
(2.90)
to express that the sequence {xn } is relatively compact (in an appropriate topological space) with limit set equal to L. Remark 2.4. 1◦ ) Some basic knowledge of topology is needed to understand the notion of limit set for a relatively compact sequence. The following facts should be kept in mind. A topological space is Hausdorff if two distinct points have nonintersecting neighborhoods (see, e.g., p. 137 in [84]). A metric space is always Hausdorff. A Hausdorff space is compact, iff it is paracompact, meaning that each open cover has a finite sub-cover (see, e.g., pp. 162 & 222 in [84]). A topological space is sequentially compact (resp. countably compact), whenever each sequence contains an infinite convergent subsequence (resp. each countable open cover has a finite sub-cover). Sequential compactness (as well as compactness) implies countable compactness, but, in general, the converse is not true (refer to p. 20 in [194]). However, the notions of compactness, sequential compactness and countable compactness, are equivalent on a separable metric space (see, e.g., p. 209 in [10]). Thus, a separable metric space is compact iff it is sequentially compact and complete. ◦ 2 ) One should distinguish the conditions of paracompactness and precompactness on a metric space. A metric space is said to be precompact or totally bounded, if, for each ε > 0, there exists a finite covering with balls of radius less than ε. A metrizable space is compact, iff there exists a metric for which it is both complete and precompact. ◦ 3 ) A part of a topological space is said to be relatively compact whenever it is included in a compact set. By definition, the limit set of a relatively compact sequence in a metric space is the set of all limits of infinite convergent subsequences. Exercise 2.6. Let {W (t) : t ≥ 0} denote a (standard) Wiener process. Making use of (2.79), show that, as n → ∞,
log n a.s. (2.91) sup |W (t + n) − W (t)| = O 0≤t≤1
Exercise 2.7. 1◦ ) Let X1 , X2 , . . ., be iid replicæ of a random variable X fulfilling (C). Show that there exists constants c1 > 0 and c2 > 0, such that P(|X| ≥ t) ≤ c1 exp(−c2 t) for all t ≥ 0. ◦ 2 ) Show that, as n → ∞, |Xn | = O (log n)
a.s.
(2.92)
132
P. Deheuvels
Exercise 2.8. Let {xn : n ≥ 1} be a relatively compact sequence in a metric space. Show that the limit set of {xn : n ≥ 1} (that is, the set of all limits of convergent subsequences) is compact. 2.5. The Erd˝ os-R´enyi theorem 2.5.1. The Classical Erd˝ os-R´enyi theorem. Set S0 = 0 and Sn = X1 + · · · + Xn for os-R´enyi n ≥ 1, where X1 , X2 , . . . are iid random observations. The classical Erd˝ [ER] theorem (see, e.g., [97]), stated in Theorem 2.10 below, plays an important role in the general theory of strong invariance principles for partial sums. In particular, it allows to give a lower bound to the best rates of strong approximations of partial sums by Gaussian processes (see Exercise 2.10 below and, e.g., §2.4, pp. 97–101 in [40]). The ER theorem describes the fluctuations of order #c log n$ of the partial sum process {Si : 0 ≤ i ≤ n}, in a sense made precise below. The following assumptions will be assumed, at times on the mgf ψ(t) = E(exp(tX)) of X = X1 . (C.1) t0 = sup{s : ψ(s) < ∞} > 0; (C.2) t1 = inf {s : ψ(s) < ∞} < 0; (C.3) The distribution of X is nondegenerate, or, equivalently, supx P(X = x) < 1; (C.4) m = E(X) ∈ R exists. denote by (C) the joint assumption (C.1–2), and set tα − log ψ(t) , Ψ(α) = sup t:ψ(t)<∞
for the Chernoff function pertaining to ψ. For each c ∈ (0, ∞] (with the convention 1/∞ = 0), we set − α+ c = inf {α ≥ µ : Ψ(α) ≥ 1/c} and αc = sup{α ≤ µ : Ψ(α) ≥ 1/c}. (2.93)
Remark 2.5. Let X be degenerate with P(X = m) = 1. Then, ψ(t) = etm for t ∈ R, and 0 for α = m, Ψ(α) = ∞ for α = m. In this case, (C.1–2–4) hold, and, for each c ∈ (0, ∞], α± c = m. As follows from the next proposition, the assumption that X is nondegenerate turns out to imply, under (C.1–2–4), that α± c = m for c ∈ (0, ∞). Proposition 2.11. Under (C.1–3–4) (resp. (C.2–3–4)), we have α± ∞ = m and, for each c ∈ (0, ∞), m < α+ c < ∞
(resp.
− ∞ < α− c < m).
(2.94)
Proof. Under (C.1–3–4), Proposition 2.4 shows that Ψ(α) is a convex nonnegative function, fulfilling Ψ(m) = 0 and Ψ(α)/α → t0 > 0 as α → ∞. This, in turn, readily implies the first half of (2.94). The proof of the second half, under (C.2–3– 4), is similar and omitted.
Topics on Empirical Processes
133
For each integer k such that 0 ≤ k ≤ n, consider the maximal and minimal increments Si+k − Si . (2.95) In+ (k) := max Si+k − Si and In− (k) := min 0≤i≤n−k
0≤i≤n−k
Fix a c > 0, and set kn = #c log n$ for n ≥ 1, where #u$ ≤ u < #u$ + 1 denotes the integer part of u. Theorem 2.10. (Erd˝ os and R´enyi) We have, almost surely, under (C.1–4),
and, under (C.2–4),
lim k −1 I + (kn ) n→∞ n n
= α+ c ,
(2.96)
lim kn−1 In− (kn )
= α− c .
(2.97)
n→∞
Proof. In view of Remark 2.5, the statements (2.96)–(2.97) hold trivially when X is degenerate. Therefore, we may limit ourselves establish these statements in the remaining case, where (C.3) is satisfied. Step 1. (Outer bounds) Assume (C.1–3–4). We define an integer sequence {nk : k ≥ k0 }, where k0 := inf{k ≥ 1, e
k+1 c
k
− e c ≥ 1},
and, for each k ≥ k0 , by setting nk := sup{n ≥ 1 : #c log n$ = k}. The inequalities c log n − 1 < #c log n$ = kn ≤ c log n, readily imply that, for each k ≥ k0 , k
e c ≤ nk < e
k+1 c
.
(2.98)
Moreover, it is easy to see that, for all k ≥ k0 , max
nk−1
kn−1 In+ (kn ) = k −1 Ink (k).
(2.99)
Thus, by (2.13), (2.95), and (2.98)–(2.99), we obtain that, for each α ≥ m and k ≥ k0 ,
( Pk (α) := P {kn−1 In+ (kn ) ≥ α} = P k −1 Ink (k) ≥ α} nk−1
=
k −k
n( Si+k − Si ≥ kα ≤ (nk − k + 1)P Sk ≥ kα P
i=0
≤
1
1 nk exp − kΨ(α) ≤ e c exp − k Ψ(α) − . c
Now, as follows from (2.94) and the definition (2.93) of α+ c , for each ε > 0, we have 1 > 0. η := Ψ α+ c +ε − c
134
P. Deheuvels
1 + c By all this, +we see that, for k ≥ k0 , Pk (αc + ε) ≤ e exp − kη , and hence, k≥k0 Pk (αc + ε) < ∞. The Borel-Cantelli lemma implies therefore that, for each ε > 0, lim sup kn−1 In+ (kn ) ≤ α+ c + ε a.s. n→∞
Since this relation holds for each specified ε > 0, it implies, in turn, that lim sup kn−1 In+ (kn ) ≤ α+ c
a.s.
(2.100)
n→∞
+ (k) = max Sk , S2k − Step 2. (Inner bounds) Assuming (C.1–3–4), we set L n Sk , . . . , SkN − Sk(N −1) , where N = N (n, k) := #n/k$ − 1 and k = kn = #c log n$. We have, for each α > m = E(X),
N
Qn,k (α) := P L+ ≤ exp − N P Sk ≥ kα . n (k) < kα = 1 − P Sk ≥ kα We next make use of Chernoff’s theorem (refer to (2.15)) to show that, for any ε1 ∈ (m, α),
1 = −Ψ(α − ε1 ), lim log P Sk ≥ k α − ε1 k→∞ k so that, ultimately in k → ∞, for each ε2 > 0
≥ exp − k Ψ(α − ε1 ) + ε2 . (2.101) P Sk ≥ k α − ε 1 + Now, letting α = α+ c , with ε1 ∈ (m, αc ), we infer from (2.94) and the definition + (2.93) of αc that 1 2θ := − Ψ(α+ c − ε1 ) > 0. c By all this, if we set ε2 = θ and k = kn = #c log n$ ≤ c log n in (2.101), we obtain that the inequality
1 nθc ≥ exp − k − θ , ≥ P Skn ≥ kn α+ − ε 1 n c c n holds for all large n. This, in turn, readily entails that, for all n sufficiently large,
N (n, k )
n θc θc/2 Qn,kn (α+ ≤ exp −n , n − ε ) ≤ exp − 1 c n + which is summable in n. Keeping in mind that L+ n (k) ≤ In (k), an application of the Borel-Cantelli lemma shows that, almost surely, + lim inf kn−1 In+ (kn ) ≥ lim inf kn−1 L+ n (kn ) ≥ αc − ε1 . n→∞
n→∞
Since this relation holds for an arbitrary ε1 > 0, we obtain therefore that lim inf kn−1 In+ (kn ) ≥ α+ c n→∞
a.s.
(2.102)
By combining (2.100) with (2.102), we conclude (2.96). The proof of (2.97) is similar, and omitted.
Topics on Empirical Processes
135
Exercise 2.9. Assume that (C.1–4) is satisfied, and that m = E(X) = 0. Set Jn+ (k) = max min Sj − Si and Jn− (k) = Sj − Si . 0≤i≤j≤i+k≤n
0≤i≤j≤i+k≤n
◦
1 ) Show that the conclusion (2.96)–(2.97) of Theorem 2.10 holds, with the formal replacements of In+ (k) and In− (k) by Jn+ (k) and Jn− (k), respectively. 2◦ ) Show that these results fail to hold when m = E(X) = 0 [for additional details, see, e.g., [58, 59]]. Exercise 2.10. Let {W (t) : t ≥ 0} be a Wiener process. Set In+ (k) = max W (i + k) − W (i) and In− (k) = min 0≤i≤n−k
0≤i≤n−k
W (i + k) − W (i) .
◦
1 ) Letting kn = #c log n$, for c > 0, find the a.s. limits lim k −1 In+ (kn ) n→∞ n
and
lim k −1 In− (kn ). n→∞ n
2◦ ) Show that if, on a suitable probability space, |Sn − W (n)| = o(log n) a.s., as n → ∞, then the Chernoff function of X is given by Ψ(α) = 12 α2 for α ∈ R. [As follows from a result of Bartfai [12] (see, e.g., p. 101 in [40]), this implies d
that X = N (0, 1), so that the relation |Sn − W (n)| = o(log n) is impossible, unless Sn coincides initially with W ∗ (n), for some Wiener process W ∗ (·).] 2.5.2. Functional Erd˝ os-R´enyi laws. The original Erd˝ os-R´enyi law [97] has given rise to a number of refinements and variants in the literature (see, e.g., [58, 59] and the references therein). Below, we give a functional version of this theorem due to Deheuvels [51]. We will work in the framework and assumptions of §2.5.1, assuming throughout that (C.3–4) hold. We introduce the following additional notation. We set, for each t ≥ 0, S(t) = St (which is right-continuous). For an arbitrary function 0 < aT ≤ T of T ≥ 0, we set, for x ≥ 0, S(x + saT ) − S(x) for 0 ≤ s ≤ 1, υx,T (s) = a−1 (2.103) T and set, for each T > 0,
FT = υx,T : 0 ≤ x ≤ T − aT .
(2.104)
It is noteworthy that, for each T > 0, FT ⊂ BV0 [0, 1], where BV0 [0, 1] is as in (1.55). Letting Ψ and t0 , t1 be as in (2.8) and (2.10), we set, for each c > 0, 1 ˙ du + t0 HS− (1+) − t1 HS+ (1+) ≤ 1 . cΨ c−1 H(u) ∆Ψ,c = H ∈ BV0 [0, 1] : 0
(2.105) Note that, when t0 = ∞ and t1 = −∞, the set defined by (2.105) reduces to 1 ˙ du ≤ 1 . (2.106) cΨ c−1 H(u) ∆Ψ,c = H ∈ AC[0, 1] : H(0) = 0 and 0
In view of Theorems 1.1, 1.2 and 1.3, we see that: – When t0 , t1 are arbitrary, ∆Ψ,c ⊂ BV0 [0, 1] is compact with respect to the topology W, of weak convergence of distribution functions of signed measures
136
P. Deheuvels in Mf [0, 1], defined via the H¨ ogn¨ as metric (1.57). The set ∆Ψ,c may be identified, through the mapping µ ↔ Hµ , to a weakly compact metrizable (via dH ) subset of the (non-metrizable) topological space (Mf [0, 1], W);
– When t1 = −∞ and t0 is arbitrary, ∆Ψ,c ⊂ I+ [0, 1] is compact with respect to the topology of weak convergence of distribution functions, defined by the L´evy distance ∆L (·, ·). This set of functions may be identified, through the mapping µ ↔ Hµ , to a weakly compact subset of the metrizable (via ∆L ) topological space (M+ f [0, 1], W); – When t0 = ∞ and t1 = −∞, ∆Ψ,c ∈ AC0 [0, 1] is compact with respect to the uniform topology. Recall the definition (1.56) of BV0,C [0, 1]. We have the following result (see, e.g., [51]). Theorem 2.11. Fix any c > 0, and assume that aT / log T → c as T → ∞. Then, under either (C.1–3–4) or (C.2–3–4), there exists a 0 < C < ∞ such that ∆Ψ,c ⊆ BV0,C [0, 1], and, a.s. ultimately as T → ∞, FT ⊆ BV0,C [0, 1]. Moreover, if we set, for each ε > 0 and A ⊆ BV0,C [0, 1], A[ε];C = {f ∈ BV0,C [0, 1] : dH (f, g) < ε
for some
g ∈ A},
(2.107)
then, there exists a.s. a Tε < ∞ such that, for all T ≥ Tε , we have [ε];C
FT ⊆ ∆Ψ,c
and
[ε];C
∆Ψ,c ⊆ FT
.
(2.108)
Exercise 2.11. Show that, under (C), (2.107) may be replaced by FT ⊆ ∆εΨ,c
and
∆Ψ,c ⊆ FεT .
(2.109)
3. Empirical Processes 3.1. Uniform empirical distribution and quantile functions Let U1 , U2 , . . . be a sequence of independent and identically distributed [iid] uniform (0, 1) random variables. For each n ≥ 1, the empirical distribution based upon U1 , . . . , Un is defined by 1 1I{Ui ∈B} , n i=1 n
Pn (B) =
(3.1)
for each Borel set B ∈ BR . Denote by #E the cardinality of E. The corresponding uniform empirical distribution function will be denoted hereafter by 1 1 Un (t) = Pn ((−∞, u]) = 1I{Ui ≤x} = #{Ui ≤ t : 1 ≤ i ≤ n} for t ∈ R. n i=1 n (3.2) n
Topics on Empirical Processes
137
Remark 3.1. In general, each sequence X1 , . . . , Xn of random variables, with values in a locally ncompact separable metric space X, defines an empirical measure by µn = n−1 i=1 δXi , where δx stands for the Dirac measure at x. In this framework, µn is a Radon measure on X. Among other problems of interest, one may consider convergence properties, as n → ∞, of the restriction of µn to a class of subsets of BX , or to a class of Borel-measurable functions on X. One of the main issues to be considered is then the central limit theorem, namely, the asymptotic convergence of such an empirical processes, indexed by sets or functions, to a limiting Gaussian process. Also, it is of interest to characterize the cases where this empirical process fulfills a Glivenko-Cantelli-type law of large numbers, or a functional law of the iterated logarithm. We refer to [6, 82, 99, 164, 204] for advanced courses on the subject. In what follows, we shall limit ourselves to the case of X = R, which is of interest in and of itself. In particular, the notion of empirical quantile process (see, e.g., §3.2 below) gives rise to developments which are very much specific to X = R, and difficult to extend properly in higher dimensions. Remark 3.2. Let X1 , . . . , Xn be iid replicæ of a random variable X with continuous df F (x) = P(X ≤ x) and quantile function G(t) = inf{x : F (x) ≥ t}. By Theorem 1.4, the rv’s U1 = F (X1 ), . . . , Un = F (Xn ) are iid uniform (0, 1) rv’s. The empirical measure µn based upon X1 , . . . , Xn has a df given by Fn (x) = Un (F (x)), for x ∈ R, and a qf given by Gn (x) = G(Vn (t)), for t ∈ (0, 1). For this reason, the study of the limiting behavior of Fn and Gn can be reduced to the study of the limiting behavior of Un and Vn . In the sequel, we will deal mostly with uniform empirical and quantile processes. We refer to [40, 38, 85, 183], and the references therein for details on the general case. For each n ≥ 0, set U0,n = 0 and Un+1,n = 1, and, for each n ≥ 1, denote by U1,n ≤ · · · ≤ Un,n the order statistics of U1 , . . . , Un , obtained by sorting these random variables increasingly. Proposition 3.1. For each n ≥ 0, we have, with probability 1, 0 = U0,n < U1,n < · · · < Un,n < Un+1,n = 1.
(3.3)
Proof. Obviously, for each i ≥ 1 and j ≥ 1 with i = j, we have P(Ui = 0 or 1) = 0
and P(Ui = Uj ) = 0,
so that U1 , . . . , Un are a.s. distinct and in (0, 1). The conclusion (3.3) is then straightforward. Unless otherwise specified, we will assume, without loss of generality, that (3.3) holds on the probability space (Ω, A, P) on which sit U1 , U2 , . . . . This allows us to write 0 for t ≤ U1,n , Un (t) = ni for Ui−1,n < t ≤ U1,n , i = 1, . . . , n, (3.4) 1 for t ≥ Un,n .
138
P. Deheuvels
The uniform empirical quantile function is defined by U1,n for t = 0, Vn (t) = i Ui,n for i−1 i = 1, . . . , n. n < t ≤ n,
(3.5)
Let I(t) = t denote identity, and, for each bounded function f on [0, 1], set
f = sup |f (t)|.
(3.6)
0≤t≤1
Proposition 3.2. We have, with probability 1,
Un (Vn ) − I =
1 . n
(3.7)
i i Proof. For each i = 1, . . . , n and i−1 n < t ≤ n , we have Un (Vn (t)) = Un (Ui,n ) = n , so that sup i−1
Proposition 3.3. We have, with probability 1, i − 1 i , Ui,n − .
Un − I = Vn − I = max Ui,n − 1≤i≤n n n
(3.8)
Proof. For each i = 1, . . . , n and Ui−1,n < t ≤ Ui,n , we have Un (t) − t = ni − t, i whereas, for i−1 n < t ≤ n , Vn (t) − t = Ui,n − t. By combining these two relations, the conclusion (3.8) is straightforward. The following properties of order statistics play an essential role in the study of empirical processes. Theorem 3.1. Let {ωi : i ≥ 1} denote a sequence of independent and identically distributed random variables with a standard exponential distribution. Namely, for each i ≥ 1, (3.9) P(ωi > t) = e−t for t ≥ 0. Set T0 = 0 and Tm = ω1 + · · · + ωm for m ≥ 1. Then, for each n ≥ 0, Tn+1 and the random vector {Ti /Tn+1 : i = 0, . . . , n + 1} are independent. Moreover, one has the distributional identity d
{Ui,n : i = 0, . . . , n + 1} = {Ti /Tn+1 : i = 0, . . . , n + 1} .
(3.10)
Proof. The joint density of ω1 , . . . , ωn+1 is f (u1 , . . . , un+1 ) = e−s , with s = u1 + · · · + un+1 . The linear change of variables (u1 , . . . , un+1 ) → (u1 , . . . , un , s) has Jacobian equal to 1, so that the joint density of (u1 , . . . , un , s) is equal to f (u1 , . . . , un , s) = e−s .
Topics on Empirical Processes
139
Now, the change of variables (u1 , . . . , un , s) → (u1 /s, . . . , un /s, s) has Jacobian equal to s−n , so that the joint density of (v1 := u1 /s, . . . , vn := un /s, s) equals sn e−s . n! We obtain therefore that (T1 /Tn+1 , . . . , Tn /Tn+1 ) and Tn+1 are independent. Moreover, (i) (T1 /Tn+1 , . . . , Tn /Tn+1 ) is uniformly distributed (with constant density, equal to n!) on the set {(v1 , . . . , vn ) = 0 ≤ v1 ≤ · · · ≤ vn ≤ 1}; (ii) Tn+1 follows a Γ(n + 1) distribution on R+ . To conclude (3.10) we observe that the joint distributions of the random vectors (T1 /Tn+1 , . . . , Tn /Tn+1 ), and (U1,n , . . . , Un,n ), coincide. f (v1 , . . . , vn , s) = n! ×
Theorem 3.2. For each n ≥ 1, the random variables i U i,n , i = 1, . . . , n, Ui+1,n
(3.11)
are independent and uniformly distributed on (0, 1). Proof. We change variables by setting Yi = − log Ui , for i = 1, 2, . . ., so that Y1 , Y2 , . . . defines an iid sequence of exponentially distributed random variables. For each n ≥ 1, set Y0,n = 0, and denote by 0 < Y1,n < · · · < Yn,n , the order statistics of Y1 , . . . , Yn . Keeping in mind that Yi,n = − log Un−i+1,n
for i = 0, . . . , n,
we see that (3.11) is equivalent to the property that (n − i + 1){Yi,n − Yi−1,n },
i = 1, . . . , n,
(3.12)
are independent and exponentially distributed. To establish this property, we recall from the proof of Theorem 3.1 that the joint density of (U1,n , . . . , Un,n ) is constant and equal to n! on the set {(v1 , . . . , vn ) = 0 ≤ v1 ≤ · · · ≤ vn ≤ 1}. The change of variables (v1 , . . . , vn ) → (y1 := − log vn , . . . , yn := − log v1 ) has Jacobian 1/(y1 . . . yn ) = exp(y1 + · · · + yn ), so that the joint density of (Y1,n , . . . , Yn,n ) equals f (y1 , . . . , yn ) = n! e−(y1 +···+yn ) , on the domain {(y1 , . . . , yn ) : 0 ≤ y1 ≤ · · · ≤ yn }. We now make the change of variables (y1 , . . . , yn ) → (t1 := ny1 , (n − 1)(y2 − y1 ), . . . , tn := yn − yn−1 ). Since the corresponding Jacobian equals n!, we conclude that the joint density of (nY1,n , (n − 1)(Y2,n − Y1,n ), . . . , Yn,n − Yn−1,n ) is equal to f (t1 , . . . , tn ) =
n
e−ti
for
ti ≥ 0,
i = 1, . . . , n.
i=1
This completes the proof of (3.12), and hence, of (3.12).
140
P. Deheuvels
Define the uniform spacings by Si,n = Ui,n − Ui−1,n
for
i = 1, . . . , n + 1.
(3.13)
Exercise 3.1. Denote, respectively, the maximal (resp. minimal) uniform spacing by Mn+ = max Si,n and Mn− = min Si,n . 1≤i≤n+1
1≤i≤n+1
Establish the limit laws, for x ∈ R and y ≥ 0,
−x lim P nMn − log n ≤ x = e−e and n→∞
lim P n2 mn ≤ y = 1 − e−y .
n→∞
Exercise 3.2. Fix a constant c > 0, and set kn = #c log n$. Letting (·) be as in (1.61), set γc+ = inf{x ≥ 1 : (x) ≥ 1/c}
and
γc− = sup{x ≤ 1 : (x) ≥ 1/c}.
Consider the statistics, for 1 ≤ k ≤ n + 1, + Mn,k =
and
− = Mn,k
max
{Ui+k−1,n − Ui−1,n }
min
{Ui+k−1,n − Ui−1,n }.
1≤i≤n−k+2 1≤i≤n−k+2
Show that, in probability, as n → ∞, + Mn,k n
kn
→
γc+
and
− Mn,k n
kn
→ γc− .
3.2. Uniform empirical and quantile processes The uniform empirical process is defined by αn (t) = n1/2 (Un (t) − t)
for t ∈ R,
(3.14)
for t ∈ [0, 1].
(3.15)
and the uniform quantile process is defined by βn (t) = n1/2 (Un (t) − t) It is convenient to set βn (t) = 0 for t ∈ [0, 1]. The following results are well known, and will be discussed in the sequel. Theorem 3.3. We have, for each n ≥ 1 and t ≥ 0, 2
P ( βn ≥ t) = P ( αn ≥ t) ≤ 2e−2t .
(3.16) 2
Proof. Dvoretzky, Kiefer and Wolfowitz [86] showed that P ( αn ≥ t) ≤ Ce−2t , for some universal constant C ≥ 2, proved equal to 2 by Massart [157]. We invoke (3.8) for the equality αn = βn . Remark 3.3. The extension of (3.16) to dimension d ≥ 2 is delicate. For x = (x1 , . . . , xd ) and y = (y1 , . . . , yd ) ∈ Rd , write x ≤ y when xj ≤ yj for j = 1, . . . , d.
Topics on Empirical Processes
141
Next, given an iid sequence U1 , U2 , . . . of random vectors, uniformly distributed on [0, 1]d , define the d-dimensional uniform empirical process by αn (x) = n−1/2
n
1I{Ui ≤x} −
i=1
d
xj ,
(3.17)
j=1
for x = (x1 , . . . , xd ) ∈ [0, 1]d. An example of Dvoretzky-Kiefer-Wolfowitz-type bound which can be obtained in this framework is given by Devroye [75, 76], who established the inequality 2
P ( αn ≥ t) ≤ 2e2 (2n)d e−2t ,
(3.18)
for all n ≥ 1 and t ≥ n−1/2 d2 . Theorem 3.4. (Chung) We have, with probability 1, lim sup(2 log2 n)−1/2 αn = lim sup(2 log2 n)−1/2 βn = n→∞
n→∞
1 . 2
(3.19)
Proof. This result is due to Chung [31]. Below, we give a simplified proof of (3.19) to illustrate the use of some important technical tools. We first observe that, for each specified t ∈ [0, 1], we may write n1/2 αn (t)
=
n i=1
n 1I{Ui ≤t} − t = (Yi − E(Yi )), i=1
d
where Yi = Be(t), i = 1, 2, . . ., is an iid sequence of random variables following Bernoulli distributions (denoted by Be(t)) with probability of success P(Yi = 1) = t. Since Var(Yi ) = t(1 − t), the law of the iterated logarithm [LIL] (see, e.g., (2.89)) implies that (for each specified t ∈ [0, 1])
±αn (t) = t(1 − t) a.s. (3.20) lim sup n→∞ 2 log2 n By choosing t = 1/2 in (3.20), and recalling (3.8), we obtain readily that 1 ±αn (1/2) = . lim sup(2 log2 n)−1/2 αn = lim sup(2 log2 n)−1/2 βn ≥ lim sup 2 n→∞ n→∞ n→∞ 2 log2 n (3.21) To obtain the reverse inequality, we select γ > 0 and > 0, and set nk = #(1 + γ)k $ for k = 0, 1, 2, . . ., where #u$ ≤ u < #u$ + 1 denotes the integer part of u. Consider the events, for k = 1, 2, . . ., " Ak = ∃n ∈ (nk−1 , nk ] : n1/2 αn ≥ (1 + 2ε) 12 nk log2 nk , (3.22) " 1/2 (3.23) Bk = nk αnk ≥ (1 + ε) 12 nk log2 nk . Making use of Ottaviani’s lemma, stated in a general framework in Lemma 4.2 in the sequel, we obtain readily that, for all k sufficiently large, one has, for all k
142
P. Deheuvels
sufficiently large, P(Ak ) ≤ 2P(Bk ). In view of (3.16) and (3.23), we obtain, in turn, that, for all large k, 2 P(Bk ) ≤ 2 exp −(1 + ε)2 log2 nk = . (log nk )(1+ε)2
(3.24)
(3.25)
Since, as k → ∞, log nk = (1 + o(1))k log(1 + γ), we infer from (3.24) and (3.25) that ∞ P(Ak ) < ∞. (3.26) k=1
The Borel-Cantelli lemma, when combined with (3.26), implies that the events Ak hold finitely often with probability 1. This, in turn, implies that, a.s., ultimately for all large n, 1/2 " nk log2 nk 1
αn ≤ (1 + 2ε) 2 log2 n nk−1 log2 nk−1 " 1 ≤ (1 + 2ε)(1 + γ) 2 log2 n. (3.27) Here, we have used the observation that, ultimately for all large k, nk log2 nk = (1 + o(1))(1 + γ) ≤ (1 + γ)2 . nk−1 log2 nk−1 Since ε > 0 and γ > 0 in (3.27) may be chosen as small as desired, it follows readily from this limiting statement that, with probability 1, 1 lim sup(2 log2 n)−1/2 αn = lim sup(2 log2 n)−1/2 βn ≤ . (3.28) 2 n→∞ n→∞ We conclude (3.19), by combining (3.21) with (3.28). Exercise 3.3. Fix any 0 ≤ T ≤ 1. Show that, almost surely,
lim sup(2 log2 n)−1/2 sup |αn (t)| = sup t(1 − t). n→∞
0≤t≤T
0≤t≤T
Exercise 3.4. Making use of (3.64) and (3.65), show that the constant C in the 2 Dvoretzky, Kiefer and Wolfowitz [86] inequality, P ( αn ≥ t) ≤ Ce−2t , cannot be chosen less that 2. 3.3. Some further martingale inequalities We have the following useful results concerning empirical processes and Brownian bridges. Proposition 3.4. For each specified n ≥ 1, the process {αn (t)/(1 − t) : 0 ≤ t < 1} , defines a martingale.
(3.29)
Proof. It is equivalent to prove that the process (nUn (t)−nt)/(1−t) : 0 ≤ t < 1 , defines a martingale. Consider any 0 < s < t < 1, and set A = nUn (s) and
Topics on Empirical Processes
143
B = nUn (t) − nUn (s). We obtain readily that
P B = b|σ {Un (u) : 0 ≤ u ≤ s} = P B = b|A = a ) n! n! sa (t − s)b (1 − t)n−a−b sa (1 − s)n−a = a! b! (n − a − b)! a! (n − a)! b n−a−b (n − a)! (t − s) (1 − t) = b! (n − a − b)! (1 − s)n−a b n−a (n − a)! 1−t t−s = . b! (n − a − b)! 1−t 1−s Next, we write
1 − t n−a n−a (n − a)! t − s b t−s E B|A = a = = (n−a) b , 1−s b! (n − a − b)! 1−t 1−s b=0
whence 1 − t
n(t − s) + a(1 − t) − nt = n(1 − t) − (n − a) , E B + A − nt|A = a = 1−s 1−s and, finally
B + A − nt a − ns , = E A = a 1−t 1−s which suffices for our needs. Proposition 3.5. Let {N (t) : 0 ≤ t ≤ 1} define a Brownian bridge. Then, the process B(t)/(1 − t) : 0 ≤ t < 1 , (3.30) defines a martingale. Proof. Omitted.
Proposition 3.4 has some interesting consequences. First, we obtain the following useful corollary. Corollary 3.1. For each n ≥ 1, 0 < T < 1 and λ > 0, we have
1 P sup |αn (t)| ≥ λ T (1 − T ) ≤ 2 . λ 0≤t≤T
(3.31)
Proof. As follows from Proposition and (2.30) taken with r = 2, we obtain that, for u > 0 and 0 < T < 1, √
√
αn (t) ≥u T P sup |αn (t)| ≥ u(1 − T ) T ≤ P sup 0≤t≤T 0≤t≤T 1 − t
1 αn (T ) 2 T (1 − T ) 1 1 = 2 × . = 2 ≤ 2 E 2 u T 1−T u T (1 − T ) u (1 − T ) √ We obtain readily (3.31) by setting u = λ/ 1 − T in this last inequality.
144
P. Deheuvels
Corollary 3.2. For each n ≥ 1, 0 < T < 1 and λ > 0, we have √
λ(1 − T )
. P sup |αn (t)| ≥ λ T ≤ 2 exp − nT h 1 + √ nT 0≤t≤T √ Moreover, if 0 < λ ≤ N T /(1 − T ), we have √
λ(1 − T )
P sup |αn (t)| ≥ λ T ≤ 2 exp − nT h 1 + √ nT 0≤t≤T λ2 λ(1 − T )
. ≤ 2 exp − (1 − T )2 1 − √ 2 3 nT
(3.32)
(3.33)
Proof. First note that √
P sup |αn (t)| ≥ λ T 0≤t≤T
≤P
α (t) −α (t) √
√
n n ≥ λ T + P sup ≥λ T . 1−t 1−t 0≤t≤T 0≤t≤T sup
Next, we combine Proposition 3.4 with (2.31). We so obtain that
α (t) n ≥u P sup 1−t 0≤t≤T sα (T )
n ≤ exp − sup su − log E exp 1−T s s s
(u(1 − T )) − log E exp αn (T ) = exp − sup 1−T 1−T s
. = exp − sup r(u(1 − T )) − log E exp rαn (T ) r
We combine this last inequality with Theorem 2.7, and (2.70), to obtain that α (t) √
λ(1 − T )
n P sup . ≥ λ T ≤ exp − nT h 1 + √ 1−t nT 0≤t≤T A similar argument allows to show that −α (t) √
λ(1 − T )
n P sup . ≥ λ T ≤ exp − nT h 1 − √ 1−t nT 0≤t≤T In view of (2.68) and (2.70), we conclude readily (3.33).
Remark 3.4. We will obtain in the forthcoming Proposition 3.9, a refinement of the inequality (3.33), through a completely different argument. 3.4. Relations between empirical and Poisson processes The empirical process is related to the Poisson process by the following general principle. The latter provides a construction of a Poisson process on a general locally compact separable metric space X (see, e.g., §1.3.1). Consider a sequence X = X1 , X2 , . . . of iid X-valued random variables with common distribution PX (B) = P(X ∈ B), for B belonging to the set BX of all Borel subsets of X.
Topics on Empirical Processes
145
Let further K denote a random variable, independent of X1 , X2 , . . ., and following a Poisson distribution with expectation E(K) = λ. We have namely λk −λ e for k = 0, 1, . . . . (3.34) k! By definition, the Poisson process on X with mean measure ν(·) = λPX (·) is the random measure Π(·), defined on the set BX of Borel subsets of X by P(K = k) =
Π(B) =
K
1I{Xi ∈B} ,
(3.35)
i=1
where, by convention ∅ (·) := 0. One of the important features of the Poisson process Π is that: 1◦ ) For each B ∈ BX , Π(B) follows a Poisson distribution with expectation ν(B); 2◦ ) For each sequence B1 , B2 , . . . ∈ BX , with ν(Bi ∩ Bj ) = 0 ∀i = j, Π(B1 ), Π(B2 ), . . . are independent; ◦ 3 ) Whenever Π1 , Π2 , . . . are independent Poisson processes with mean measures ν1 , ν2 , . . ., the superposed process Π1 +Π2 +· · · is Poisson with mean measure ν1 + ν2 + · · · . Strictly speaking, the definition (3.35) is valid only for Poisson processes with bounded mean measures ν ∈ M+ f (X), namely such that ν(X) = λP(X) = λ < ∞. To obtain a more general form of this definition, compatible with the above property (3◦ ), we use the convention that a Poisson process with a possibly unbounded mean measure ν ∈ M+ (X) is defined as the superposition Π1 + Π2 + · · · of an infinite sequence of independent Poisson processes (in the sense of (3.35)), with νj (X) < ∞ for j = 1, 2, . . ., and such that ν = ν1 + ν2 + · · · . At this point, we see the usefulness to assume that X is a locally compact separable metric space, to render such a construction possible for an arbitrary nonnegative Radon measure ν on X. In the particular case where X = R+ and ν is the Lebesgue measure on R+ , we obtain the standard Poisson process, commonly defined through its right-continuous distribution function N (t) = Π (0, t] for t ≥ 0. (3.36) The standard Poisson process {N (t) : t ≥ 0} has the following important property, which yields an alternative construction of this process. Proposition 3.6. Let Tk = inf{t : N (t) ≥ k} for k = 0, 1, . . .. Then, T0 = 0 and the random variables ωk = Tk − Tk−1 , k = 1, 2, . . . are independent and exponentially distributed with unit mean. Proof. Omitted.
For each λ > 0, the homogeneous Poisson process defined by Nλ (t) = N (λt) for t ≥ 0, with N (·) as in (3.36), is such that E(Nλ (t)) = λt for t ≥ 0. The next proposition is a straightforward consequence of the above definitions. Let Un (t) be,
146
P. Deheuvels
as in (3.2), the uniform empirical distribution function pertaining to the first n ≥ 1 observations from an iid sequence U1 , U2 , . . . of uniform (0, 1) random variables. Proposition 3.7. For an arbitrary λ > 0 and n ≥ 1, we have the distributional equality d L {nUn (t) : 0 ≤ t ≤ 1} = L {Nλ (t) : 0 ≤ t ≤ 1}Nλ (1) = n , (3.37) between the distribution L {nUn (t) : 0 ≤ t ≤ 1} of {nUn(t) : 0 ≤ t ≤ 1}, and the conditional distribution L {Nλ (t) : 0 ≤ t ≤ 1}Nλ (1) = n of {Nλ (t) : 0 ≤ t ≤ 1}, given that Nλ (1) = n.
Proof. Omitted.
The fact that a homogeneous Poisson process has independent increment renders this process more tractable than the empirical process. The next proposition gives a useful trick to replace empirical processes by Poisson processes in a series of calculations. Proposition 3.8. For each 0 < T < 1, there exists a universal constant CT with the following property. For any measurable set B of the set (I+ [0, T ], W) of nondecreasing right-continuous nonnegative functions on [0, T ], endowed with the weak topology, and each n ≥ 1, we have
(3.38) P {nUn (t) : 0 ≤ t ≤ T } ∈ B ≤ CT P {Nn (t) : 0 ≤ t ≤ T } ∈ B . In addition, for each ε > 0, there exists an nε < ∞ such that (3.38) holds for all n ≥ 1 with √ (3.39) CT = (1 + ε)/ 1 − T . Proof. This elementary result has been used repeatedly in a series of papers in the literature (see, e.g., Lemma 2.7 in Deheuvels [54]). Let Nλ (·) be as in (3.37), and set λ = n. Introduce the events E1 = {nUn (t) : 0 ≤ t ≤ T } ∈ B and E2 = {Nn (t) : 0 ≤ t ≤ T } ∈ B . Set R = Nn (T ) and R = Nn (1) − Nn (T ). By (3.37), and the independence of (E2 , R) and R, we have P(E1 ) = P(E2 |Nn (1) = n) = P(E2 |R + R = n) =
P(E2 ∩ {R + R = n}) P(R + R = n)
n 1 P(E2 ∩ {R = n − k})P(R = k) P(R + R = n) k=0 n 1 ≤ sup P(R = k) P(E2 ∩ {R = n − k}) P(R + R = n) k≥0 k=0 1 sup P(R = k) P(E2 ). ≤ P(R + R = n) k≥0
=
Topics on Empirical Processes
147
Next, we make use of the observation that, if K follows a Poisson distribution with expectation λ, then sup P(K = k) = P(K = #λ$),
(3.40)
k≥0
where #λ$ ≤ λ < #λ$+1 is the integer part of λ. To establish this property, observe that, if λk −λ e pk := P(K = k) = for k = 0, 1, . . . , k! then pk λ ≥ 0 for k ≤ λ ⇔ k ≤ #λ$, = pk−1 k < 0 for k > λ ⇔ k > #λ$, from where (3.40) is straightforward. By all this, we infer from the Stirling formula, √ N ! = (1 + o(1))(N/e)N 2πN as N → ∞, that, as n → ∞,
(n(1 − T ))n(1−T ) n! 1 n −n(1−T ) e sup P(R = k) = e nn #n(1 − T )$! P(R + R = n) k≥0 1 + o(1) . = √ 1−T This, in turn, suffices to show that
1 CT := sup sup P(R = k) < ∞. n≥1 P(R + R = n) k≥0 Given this relation, we obtain readily (3.38) and (3.39), as sought.
The next proposition gives an example of how (3.38) may be applied. Proposition 3.9. For any 0 < T0 < 1 and ε > 0, there exists an nε√< ∞ such that, for all n ≥ nε , we have, uniformly over 0 < T ≤ T0 and 0 ≤ λ ≤ nT , √
2(1 + ε) λ
exp − nT h 1 + √ P sup |αn (t)| ≥ λ T ≤ √ 1 − T0 nT 0≤t≤T (3.41) 2 λ
2(1 + ε) λ . 1− √ ≤√ exp − 2 1 − T0 3 nT Proof. Let {Π(t) : t ≥ 0} denote a standard√Poisson process. By combining (3.38) and (3.39) with (2.65), taken with α = λ/ nT , we obtain readily that, for all n sufficiently large, λ
√
1+ε P sup |Π(t) − t| ≥ nT √ P sup |αn (t)| ≥ λ T ≤ √ 1 − T0 nT 0≤t≤T 0≤t≤nT
λ 2(1 + ε) exp − nT h 1 + √ ≤ √ , 1 − T0 nT which, in view of (2.70), readily yields (3.41).
148
P. Deheuvels
3.5. Strong approximations of empirical and quantile processes The uniform empirical process αn (·) fulfills E(αn (t)) = 0
and E({m1/2 αm (s)}{n1/2 αn (t)}) = {m ∧ n}{s ∧ t − st},
for all 0 ≤ s, t ≤ 1 and m, n ≥ 1. For each n ≥ 1, the Gaussian process with the same covariance structure as {αn (t) : 0 ≤ t ≤ 1} is the Brownian bridge, conveniently defined ny B(t) = W (t) − tW (1), where W (·) is a (standard onedimensional) Wiener process. The two-parameter Gaussian process with the same covariance structure as {n1/2 αn (t) : 0 ≤ t ≤ 1, n ≥ 0} (with the convention that n1/2 αn (t) = 0 for n = 0) is the Kiefer process {K(t, s) : s ≥ 0, t ≥ 0} (refer to Kiefer [127]). For integer values of n ≥ 0, the Kiefer process K(t, n) is conveniently defined by setting n Bi (t) for 0 ≤ t ≤ 1, (3.42) K(t, n) = i=1
where B1 , B2 , . . . is an iid sequence of Brownian bridges. Given a (standard twodimensional) Wiener process {W (t, s), t ≥ 0, s ≥ 0}, such that, for s, t, s , t , s , t ≥ 0, E(W (t, s)) = 0 and E W (t , s )W (t , s ) = {t ∧ s }{t ∧ s }, (3.43) one defines a Kiefer process, via K(t, s) = W (t, s) − tW (1, s) for
t≥0
and s ≥ 0.
(3.44)
The Doob-Donsker weak invariance principle (see, e.g., Billingsley [19], and the references therein) establishes the weak convergence W
αn → B,
(3.45)
which holds for the probability measures induced by αn and B on the set D[0, 1] endowed with the Skorohod topology (see, e.g., §2.2.2 and Ch. 3 in [19]). In many applications, it is more appropriate and convenient to make use of strong invariance principles, where the empirical processes and Brownian bridges are defined on the same probability space. Below, we cite the most useful theorems of the kind, in the framework of our study. We start with the celebrated Koml´os, Major and Tusn´ ady [129, 130] invariance principle for the uniform empirical process (stated in Theorem 3.5), and the Cs¨org˝ o and R´ev´esz variant for the empirical quantile process (stated in Theorem 3.6). Theorem 3.5. On a suitable probability space, it is possible to define {Un : n ≥ 1} jointly with a sequence {Bn : n ≥ 1} of Brownian bridges, in such a way that, for appropriate constants a1 > 0, a2 > 0 and a3 > 0, we have, for all n ≥ 1 and t ∈ R, a1 log n + t √ P αn − Bn ≥ (3.46) ≤ a2 exp(−a3 t). n
Topics on Empirical Processes
149
Proof. The original papers of Koml´ os, Major and Tusn´ ady [129, 130] give hardly any proof for (3.46), and the first self-contained proof of this inequality was given by Mason and van Zwet [155]. Some details are also given in in Cs¨ org˝ o and R´ev´esz [39], Bretagnolle and Massart [26] and Cs¨ org˝ o and Horv´ ath [38]. Remark 3.5. The optimal choice of a1 , a2 , a3 in (3.46) (as well as that of a4 , a5 , a6 in (3.47) below) remains an open problem. The best presently known result in this direction is due to Bretagnolle and Massart [26], who showed that (3.46) holds for n ≥ 2 with a1 = 12, a2 = 2 and λ = 1/6. Theorem 3.6. On a suitable probability space, it is possible to define {Un : n ≥ 1} jointly with a sequence {Bn∗ : n ≥ 1} of Brownian bridges, in such a way that, for appropriate constants a4 > 0, a5 > 0 and a6 > 0, we have, for all n ≥ 1 and t ∈ R, a4 log n + t √ P βn − Bn∗ ≥ (3.47) ≤ a5 exp(−a6 t). n
Proof. See, e.g., Cs¨ org˝ o and R´ev´esz [39].
A very useful refinement of Theorem 3.5, due to Mason and van Zwet [155], is cited in the next theorem. Introduce the following notation. Set log+ u = log1 u = log(u ∨ e) for
u ∈ R,
(3.48)
and define recursively, for each integer p ≥ 2, logp u = log+ u(logp−1 (u)) for u ∈ R.
(3.49)
Theorem 3.7. On a suitable probability space, it is possible to define {Un : n ≥ 1} jointly with a sequence {Bn : n ≥ 1} of Brownian bridges, in such a way that, for appropriate constants a7 > 0, a8 > 0 and a9 > 0, we have, for all n ≥ 1, t ∈ R and 0 < d ≤ n a7 log+ d + t √ ≤ a8 exp(−a9 t). (3.50) P sup |αn (u) − Bn (u)| ≥ n 0≤u≤ d n
Proof. See, e.g., Mason and van Zwet [155].
The following two theorems provide (very likely non-optimal) versions of Theorems 3.5 and 3.7 with respect to the approximation of the empirical process by Kiefer processes. Theorem 3.8. On a suitable probability space, it is possible to define {Un : n ≥ 1} 'n : n ≥ 1} of independent Brownian bridges, in such a jointly with a sequence {B way that, for appropriate constants a10 > 0, a11 > 0 and a12 > 0, we have, for all n ≥ 1 and t ∈ R, n
1 ' (log+ n)(a10 log+ n + t) √ Bi ≥ (3.51) ≤ a11 exp(−a12 t). P αn − √ n i=1 n Proof. See, e.g., Koml´os, Major and Tusn´ ady [129, 130].
150
P. Deheuvels
Theorem 3.9. On a suitable probability space, it is possible to define {Un : n ≥ 1} 'n : n ≥ 1} of independent Brownian bridges, in such a jointly with a sequence {B way that, for appropriate constants a10 > 0, a11 > 0 and a12 > 0, we have, for all n ≥ 1 and t ∈ R, P
n (log n)(a log n + t)
1 ' 10 + + √ Bi (u) ≥ ≤ a11 exp(−a12 t). sup αn (u) − √ n n d 0≤u≤ n i=1 (3.52)
Proof. See, e.g., Castelle and Laurent-Bonvalot [27] and Castelle [28].
Theorem 3.10. On a suitable probability space, we have, almost surely as n → ∞,
and
n (log n)2
1 ' √ Bi (u) = O , αn − √ n i=1 n
(3.53)
n
1 ' Bi (u) = O n−1/4 (log n)1/2 (log2 n)1/4 , βn + √ n i=1
(3.54)
'n : n ≥ 1} is a sequence of independent Brownian bridges. In addition, where {B *n : n ≥ 1} of independent Brownian bridges such there does not exist a sequence {B that, almost surely, n
1 * Bi (u) = o n−1/4 (log n)1/2 (log2 n)1/4 . βn + √ n i=1
(3.55)
Proof. By combining Theorem 3.9 with the Borel-Cantelli lemma we obtain (3.53). This, when combined with the Bahadur-Kiefer representation (see, e.g., Theorem 3.4 in the sequel) yields (3.54), first stated by Cs¨ org˝ o and R´ev´esz [39]. The fact that (3.54) is optimal, was proved by Deheuvels [55], who showed that (3.55) cannot *n : n ≥ 1}. hold a.s., for any possible choice of the iid sequence {B The following theorem is due to Cs¨org˝ o, Cs¨org˝ o, Horv´ ath and Mason [41], and Cs¨org˝ o, and Horv´ ath [37]. Theorem 3.11. On a suitable probability space, it is possible to define {Un : n ≥ 1} jointly with a sequence {Bn : n ≥ 1} of Brownian bridges, in such a way that the following limiting property holds. For any 0 ≤ ν < 12 , 0 ≤ τ < 14 and c > 0, we have, as n → ∞, nν and n
τ
sup
c c n ≤t≤1− n
sup
c c n ≤t≤1− n
|αn (t) − Bn (t)| 1
{t(1 − t)} 2 −ν |βn (t) + Bn (t)| 1
{t(1 − t)} 2 −τ
= OP (1) (3.56) = OP (1)
Topics on Empirical Processes
151
Exercise 3.5. Show that, under the assumptions of Theorems 3.5 and 3.6, we have, almost surely, log n
log n
and βn − Bn∗ = O √ . (3.57)
αn − Bn = O √ n n 3.6. Some results for weighted processes Below, we mention some useful results on weighted empirical processes. We refer to Einmahl and Mason [93, 94] for similar results for quantile processes and multivariate empirical processes. The standardized empirical process is defined by αn (t) γn (t) = t(1 − t)
for 0 < t < 1,
(3.58)
and has the property that, for each 0 < t < 1, d
γn (t) → N (0, 1) as n → ∞. The following limit laws are available for functionals of αn and γn . Let 1 Tn = sup |γn (t)|, An = γn2 (t)dt, 0
(3.59)
0
Dn = sup |αn (t)|,
ωn2
0≤t≤1
1
α2n (t)dt,
=
(3.60)
0
Dn+ = sup αn (t), 0≤t≤1
Dn− = sup {−αn (t)}.
(3.61)
0≤t≤1
Smirnov [189, 190, 192] showed that 1 ∞ B 2 (t) Yk2 d d dt = A2n → , k2 π2 0 t(1 − t)
(3.62)
k=1
where B(·) is a Brownian bridge, and {Yk : k ≥ 1} an iid sequence of N (0, 1) r.v.’s, and Anderson and Darling [5] showed likewise that 1 ∞ Yk2 d d . (3.63) B 2 (t)dt = ωn2 → k(k + 1) 0 k=1
For the other statistics (see, e.g., Kolmogorov [128] and Smirnov [191]), it holds that, for each t > 0, ∞
2 2 (−1)k+1 e−2k t , (3.64) lim P(Dn ≥ t) = P B ≥ t = 2 n→∞
lim P(Dn± n→∞
≥ t) = P
k=1
2 sup ±B(s) ≥ t = e−2t .
and, for each t ∈ R (see, e.g., Eicker [88] and Jaeschke [116]),
−t lim P Tn 2 log2 n − 2 log2 n − 12 log3 n − 12 log π < t = e−2e . n→∞
(3.65)
0≤s≤1
(3.66)
152
P. Deheuvels
Theorem 3.12. (Cs´ aki) Let {an : n ≥ 1} be a sequence of positive constants. Then ∞
an = ∞
√ lim sup Tn nan = ∞
⇒
a.s.
(3.67)
n→∞
n=1
If, in addition, nan ↓, then ∞
an < ∞
√ lim Tn nan = 0
⇒
n→∞
n=1
a.s.
(3.68)
Moreover, Tn =1 lim inf n→∞ 2 log2 n
a.s.
(3.69)
Theorem 3.12 has been extended by Einmahl and Mason [93] as follows. Consider the statistic 1
|αn (t)| 1 1 = nν− 2 sup {t(1 − t)}ν− 2 |γn (t)|, 1−ν 0
Tn,ν = nν− 2 sup
(3.70)
for 0 ≤ ν ≤ 1/2. Keeping in mind that Tn = Tn, 12 , we have the following theorem. Theorem 3.13. (Einmahl and Mason) Let {an : n ≥ 1} be a sequence of positive constants. Then ∞
an = ∞
⇒
lim sup Tn,ν (nan )1−ν = ∞
a.s.
(3.71)
n→∞
n=1
If, in addition, nan ↓, then ∞
an < ∞
⇒
lim Tn,ν (nan )1−ν = 0.
n→∞
n=1
(3.72)
Proof. See, e.g., Einmahl and Mason [93]. We have the following easy corollary of Theorem 3.13. Corollary 3.3. For each 0 ≤ ν ≤ 1/2, we have, almost surely, lim sup n→∞
log Tn,ν = 1 − ν. log2 n
(3.73)
Proof. Set an = 1/{n(log n)1+η } in (3.71)–(3.72).
The following useful consequence of Corollary 3.3 has some interesting applications. For each ε > 0, we have, almost surely for all large n, Tn = Tn, 12 = sup
0
|αn (t)| {t(1 − t)}
1
1 2
≤ (log n) 2 +ε .
(3.74)
Topics on Empirical Processes
153
Exercise 3.6. 1◦ ) Show, as a consequence of (3.66), that Tn =1 lim n→∞ 2 log2 n
in probability.
(3.75)
2◦ ) Show, as a consequence of Theorem 3.12, that Tn lim inf =1 n→∞ 2 log2 n
and
Tn lim sup =∞ n→∞ 2 log2 n
a.s.
(3.76)
3◦ ) Show that the upper bound (3.74) is optimal, in the sense that lim sup √ n→∞
Tn =∞ log n
a.s.
(3.77)
Exercise 3.7. 1◦ ) Show that, for any fixed c > 0, we have, almost surely,
1
0
α2n (t) dt − t(1 − t)
1−c/n
c/n
α2n (t) dt → 0. t(1 − t)
(3.78)
2◦ ) Show that, for any ε1 > 0 and ε2 > 0, there exists an η = ηε1 ,ε2 ∈ (0, 12 ) such that η α2 (t) 1−1/n α2 (t)
n n (3.79) dt + dt ≤ ε1 ≥ 1 − ε2 . lim sup P t(1 − t) n→∞ 1/n t(1 − t) 1−η 3◦ ) Show that, for each η ∈ (0, 12 ), we have
1−η η
α2n (t) dt t(1 − t)
d
1−η
→ η
B 2 (t) dt. t(1 − t)
(3.80)
4◦ ) Show that (3.80) also holds for η = 0 [Hint: make use of (3.46), (3.56) and (3.73)]. 3.7. Finkelstein’s theorem via invariance principles In this section, we will denote by (C[0, 1], U) (resp. (B[0, 1], U) ) the set on continuous (resp. bounded) functions on [0, 1], endowed with the uniform topology U, defined by the sup-norm
f = sup |f (t)|.
(3.81)
0≤t≤1
We denote by AC[0, 1] the set of all absolutely continuous functions f on [0, 1] df . The Strassen set, S, and the Finkelstein set, F, with Lebesgue derivative f˙ = dx
154 are as follows.
P. Deheuvels S = f ∈ AC[0, 1] : f (0) = 0,
1
f˙2 (t)dt ≤ 1 ,
0
F = f ∈ AC[0, 1] : f (0) = f (1) = 0, = f ∈ S : f (1) = 0 .
1
f˙2 (t)dt ≤ 1 0
(3.82) (3.83)
The next theorem is due to Finkelstein [98]. Theorem 3.14. (Finkelstein) We have, with probability 1, lim inf (2 log2 n)−1/2 αn − f n→∞ f ∈F = sup lim inf (2 log2 n)−1/2 αn − f = 0, f ∈F
and
(3.84)
n→∞
inf (2 log2 n)−1/2 βn − f n→∞ f ∈F = sup lim inf (2 log2 n)−1/2 βn − f = 0
lim
f ∈F
(3.85)
n→∞
Proof. By Theorem 3.8, we may define, without loss of generality, the sequence {αn : n ≥ 1} on the same probability space as a sequence {Bn : n ≥ 1} of independent Brownian bridges, in such a way that n 1/2 a.s. (3.86) Bi = O (log n)2 n αn − i=1
In view of Proposition 4.3, an application of Theorem 4.3 shows that, with probability 1, n (3.87) lim inf (2n log2 n)−1/2 Bi − f = 0, n→∞
f ∈F
i=1
n sup lim inf (2n log2 n)−1/2 Bi − f = 0.
f ∈F
n→∞
(3.88)
i=1
The first half, (3.84), of Theorem 3.14 is a consequence of (3.86), (3.87) and (3.88). The second half, (3.84), of the theorem follows from (3.84), in combination with the Bahadur-Kiefer representation (3.125). Exercise 3.8. 1◦ ) Show that sup f = 12 . f ∈F
2◦ ) By combining this result with Theorem 3.14 and Theorem 3.10, give a proof of Theorem 3.4.
Topics on Empirical Processes
155
3.8. Local and tail empirical process Fix any t0 ∈ [0, 1], and select a sequence of positive constants {hn : n ≥ 1} fulfilling (H.1)
hn ↓ 0,
nhn ↑ ∞;
(H.4)
nhn / log2 n → r ∈ [0, ∞].
The local empirical and quantile processes at t0 are the random functions of u defined, respectively, by αn (t0 + hn u) − αn (t0 ) , ξn,t0 (u) = h−1/2 (3.89) n ζn,t0 (u) = h−1/2 β (t + h u) − β (t ) , n 0 n n 0 n
and
(3.90)
When t0 = 0 and u ∈ [0, 1], we obtain the tail empirical process, denoted hereafter by ξn = ξn,0 , for αn , and tail quantile process, denoted hereafter by ζn = ξn,0 , for βn . The case where t0 = 1 and u ∈ [−1, 0] being similar, by symmetry, the tail processes will be considered below for t0 = 0 and u ∈ [0, 1] only. We first state a series of results for the tail empirical and quantile processes, corresponding to t0 ∈ [0, 1] and u ∈ [0, 1]. Kiefer [126] (see, e.g., [49]) showed the following limit laws. Set for any c > 0 δc+ = inf{x > 1 : h(x) ≥ 1/c}, γc+
= inf{x > 1 : (x) ≥ 1/c},
δc− = sup{x < 1 : h(x) ≥ 1/c},
(3.91)
γc−
(3.92)
= sup{x < 1 : (x) ≥ 1/c}.
Theorem 3.15. Under (H.1) and (H.4) with nhn / log2 n → ∞, we have, almost surely, lim sup ±(2 log2 n)1/2 ξn (hn ) = lim sup ±(2hn log2 n)1/2 αn (hn ) = 1,
(3.93)
lim sup ±(2 log2 n)1/2 ζn (hn ) = lim sup ±(2hn log2 n)1/2 βn (hn ) = 1.
(3.94)
n→∞
n→∞
n→∞
n→∞
On the other hand, when nhn / log2 n → c, we have, almost surely, lim sup ±(2 log2 n)1/2 ξn (hn ) n→∞
= lim sup ±(2hn log2 n)1/2 αn (hn ) = ±(δc± − 1)(c/2)1/2 ,
(3.95)
n→∞
lim sup ±(2 log2 n)1/2 ζn (hn ) n→∞
= lim sup ±(2hn log2 n)1/2 βn (hn ) = ±(γc± − 1)(c/2)1/2 .
(3.96)
n→∞
Proof. See, e.g., Kiefer [126], and Deheuvels [49].
Mason [156] established a functional version of (3.94), stated in Theorem 3.16 below. This result is a consequence of the following invariance principle, when combined with a general LIL due to Lai [134].
156
P. Deheuvels
Proposition 3.10. Assume (H.1) and (H.4) with nhn / log2 n → ∞. Then, on a suitable probability space, there exists a sequence Wk (·), k = 1, 2, . . . of independent Wiener processes such that, almost surely as n → ∞, n
αn (hn ·) − √1 Wi (hn ·) = o hn log2 n . (3.97) n i=1
Proof. See, e.g., Mason [156]. As a consequence of Proposition 3.10, we obtain the following result.
Theorem 3.16. Assume (H.1) and (H.4) with nhn / log2 n → ∞. Then, as n → ∞, the sequences (2 log2 n)−1/2 ξn (hn ·) = (2hn log2 n)−1/2 αn (hn ·) S, −1/2
(2 log2 n)
−1/2
ζn (hn ·) = (2hn log2 n)
βn (hn ·) S,
(3.98) (3.99)
are almost surely relatively compact in (B[0, 1], U), with limit set equal to S. Proof. See, e.g., Mason [156], and Einmahl and Mason [94].
When 0 < t0 < 1, the limiting behavior of the local empirical process, and that of the local quantile process, become distinct. The following theorem holds. Theorem 3.17. 1◦ ) Let 0 < t0 < 1. Assume (H.1) and (H.4) with nhn / log2 n → ∞. Then, as n → ∞, the sequence (2 log2 n)−1/2 ξn,t0 (hn ·) = (2hn log2 n)−1/2 {αn (t0 + hn ·) − αn (t0 )} S, is almost surely relatively compact in (B[−1, 1], U), with limit set equal to S. 2 ) Let 0 < t0 < 1. Assume (H.1), together with: √ (H.4) nhn / log+ 1/(hn n) → ∞; √ (H.5) log+ 1/(hn n) / log2 n → d ∈ [−∞, ∞]. Then, the sequence √ (2hn log+ 1/(hn n) + log2 n )−1/2 {βn (t0 + hn ·) − βn (t0 )} : n ≥ 1 , ◦
is almost surely relatively compact in (B[−1, 1], U), with limit set equal to S. Proof. See, e.g., Deheuvels and Mason [68], Mason [156], and Deheuvels [54].
3.9. Modulus of continuity of αn and βn Let {hn : n ≥ 1} denote a sequence of constants fulfilling the conditions (H.1) hn ↓ 0, nhn ↑ ∞; (H.2) nhn / log n → ∞; (H.3) (log(1/hn ))/ log2 n → ρ ∈ [0, ∞]. The following theorem collects a series of limiting results concerning the modulus of continuity of αn and βn . Below, we let 0 ≤ c < d ≤ 1 denote specified constants.
Topics on Empirical Processes
157
Theorem 3.18. Under (H.1–2–3), we have, almost surely for each 0 ≤ c < d ≤ 1, lim sup (2hn {log(1/hn ) + log2 n})
−1/2
n→∞
|αn (u) − αn (v)|
(3.100)
sup |αn (u + hn ) − αn (u)|
(3.101)
sup c≤u,v≤d |u−v|≤hn
= lim sup (2hn {log(1/hn ) + log2 n})
−1/2
= lim sup (2hn {log(1/hn ) + log2 n})
−1/2
= lim sup (2hn {log(1/hn ) + log2 n})
−1/2
n→∞
c≤u≤d
n→∞
|βn (u) − βn (v)|
sup
(3.102)
c≤u,v≤d |u−v|≤hn
n→∞
sup |βn (u + hn ) − βn (u)| = 1, c≤u≤d
(3.103) and, setting ρ/(ρ + 1) = 1 when ρ = ∞, lim inf (2hn {log(1/hn ) + log2 n})−1/2 n→∞
= lim inf (2hn {log(1/hn ) + log2 n})
−1/2
= lim inf (2hn {log(1/hn ) + log2 n})
−1/2
= lim inf (2hn {log(1/hn ) + log2 n})
−1/2
n→∞
(3.104)
sup |αn (u + hn ) − αn (u)|
(3.105)
c≤u≤d
n→∞
sup
|βn (u) − βn (v)|
(3.106)
c≤u,v≤d |u−v|≤hn
n→∞
=
|αn (u) − αn (v)|
sup c≤u,v≤d |u−v|≤hn
sup |βn (u + hn ) − βn (u)| c≤u≤d
ρ 1/2 . ρ+1
(3.107)
In addition, we have, in probability, lim (2hn {log(1/hn ) + log2 n})−1/2
n→∞
sup
|αn (u) − αn (v)|
(3.108)
c≤u,v≤d |u−v|≤hn
= lim (2hn {log(1/hn ) + log2 n})−1/2 sup |αn (u + hn ) − αn (u)|
(3.109)
= lim (2hn {log(1/hn ) + log2 n})−1/2
(3.110)
n→∞
c≤u≤d
n→∞
= lim (2hn {log(1/hn ) + log2 n}) n→∞
−1/2
sup
|βn (u) − βn (v)|
c≤u,v≤d |u−v|≤hn
sup |βn (u + hn ) − βn (u)| = c≤u≤d
ρ 1/2 . ρ+1 (3.111)
Proof. To establish (3.100)–(3.111) combine results of Stute [200], Mason, Shorack and Wellner [154], Mason [153], Deheuvels and Mason [66], Deheuvels [52], and Deheuvels and Einmahl [60].
158
P. Deheuvels
Theorem 3.18 turns out to be a direct consequence of functional limit laws due to Deheuvels and Mason [66], Deheuvels [52], Deheuvels [53], and Deheuvels and Einmahl [60]. Let S denote the Strassen set, as defined in (2.34). Consider the random sets of functions of u ∈ [0, 1], −1/2 En = (2hn {log(1/hn ) + log2 n}) {αn (t + hn u) − αn (t)} : c ≤ t ≤ d (3.112) =: υn,t (u) : c ≤ t ≤ d , −1/2 Fn = (2hn {log(1/hn ) + log2 n}) {βn (t + hn u) − βn (t)} : c ≤ t ≤ d (3.113) =: ρn,t (u) : c ≤ t ≤ d . Denote the Hausdorff distance (pertaining to the uniform topology U) between subsets of B[0, 1] by setting, for arbitrary A, B ⊆ B[0, 1], (3.114) δU (A, B) = inf ε > 0 : A ⊆ B ε and B ⊆ Aε , whenever such an ε > 0 exists, and δU (A, B) = ∞ otherwise. Theorem 3.19. Under (H.1–2–3) with ρ = ∞, we have, for each 0 ≤ c < d ≤ 1, lim δU (En , S) = lim δU (Fn , S) = 0
n→∞
n→∞
a.s.
(3.115)
Proof. See, e.g., Deheuvels and Mason [66]. −1
Fix now a T > 0, set hn = n log n, and consider the random sets of functions of u ∈ [0, T ] defined by n Gn = {Un (t + un−1 log n) − Un (t)} : c ≤ t ≤ d , (3.116) log n n Hn = {Vn (t + un−1 log n) − Vn (t)} : c ≤ t ≤ d . (3.117) log n Introduce the sets of functions of u ∈ [0, T ] defined by T f ∈ AC[0, T ] : f (0) = 0, h(f˙(u))du ≤ 1 , ∆T = ΓT
=
f ∈ I[0, T ] : f (0) = 0,
(3.118)
0 T
(f˙(u))du + fS (T +) ≤ 1 . (3.119)
0
Here, I[0, T ] denotes the restriction to [0, T ] of left-continuous nondecreasing functions on R. Each of these functions, say f ∈ I[0, 1] with f (0) = 0, defines, via the Lebesgue-Stieltjes integral, a nonnegative measure on [0, T ]. We may therefore decompose this measure into an absolutely continuous component with Lebesgue derivative f˙, and a singular component dfS . We then define fS (T +) by dfS . (3.120) fS (T +) = [0,T ]
Topics on Empirical Processes
159
Recall the definition (1.44) of the L´evy distance ∆L (f, g) between two functions f, g ∈ I[0, T ]. We set, for any f ∈ I[0, T ], (3.121) N[ε] (f ) = g ∈ I[0, T ] : ∆L (f, g) < ε , and, for any subset A ∈ I[0, T ], A[ε] =
(
N[ε] (f ),
(3.122)
f ∈A
when A = ∅, and A[ε] = ∅[ε] = ∅ otherwise. Denote the Hausdorff distance (pertaining to the weak topology) between subsets of I[0, T ] by setting, for arbitrary A, B ⊆ I[0, T ], (3.123) δW (A, B) = inf ε > 0 : A ⊆ B [ε] and B ⊆ A[ε] , whenever such an ε > 0 exists, and δW (A, B) = ∞ otherwise. Theorem 3.20. We have, for each T > 0 and 0 ≤ c < d ≤ 1, lim δU (Gn , ∆T ) = lim δW (Hn , ΓT ) = 0
n→∞
n→∞
a.s.
(3.124)
Proof. See, e.g., Deheuvels and Mason [66].
3.10. The Bahadur-Kiefer representation The classical Bahadur-Kiefer representation, due to Bahadur [8], and Kiefer [124, 125], is stated in the following theorems 3.21 and 3.22. It allows to replace the quantile process βn by (−1) times the empirical process αn with an almost sure uniform error of order n−1/4+o(1) . At times (see, e.g., Deheuvels [55, 57]), this yields the best possible rates of approximation. The strange constants in (3.125) and (3.132) below are generated in a quite natural way by functional limit theorems, and their derivation illustrates well the main topic of the present course. Theorem 3.21. We have, with probability 1, lim sup n1/4 (log n)−1/2 (log2 n)−1/4 αn + βn = 2−1/4 .
(3.125)
n→∞
Proof. We refer to Shorack [182] and Deheuvels and Mason [64] for details. Below, we follow the elegant proof of Shorack [182] to establish (3.125). First, we combine Propositions 3.2 and 3.3, to obtain that, almost surely, 1 sup |βn (t) − αn (Vn (t))| = sup |βn (t) − n1/2 {Vn (t) − Un (Vn (t))}| = √ , n 0≤t≤1 0≤t≤1 and hence, since Vn (t) = t + n−1/2 βn (t), 1
{αn + βn } − {αn − αn (I + n−1/2 βn )} = √ n
a.s.
In view of (3.126), we see that (3.125) is equivalent to lim sup n1/4 (log n)−1/2 (log2 n)−1/4 αn (I + n−1/2 βn ) − αn = 2−1/4 n→∞
(3.126)
a.s. (3.127)
160
P. Deheuvels
The proof of (3.127) is achieved in two complementary parts. Part 1. (Outer bounds). Fix any ε > 0. By Chung’s Theorem
3.4, we have, a.s. for all n sufficiently large, n−1/2 βn ≤ hn := (1 + ε)2−1/2 log2 n. This, when combined with (3.100), shows that, a.s., lim sup(2hn {log(1/hn ) + log2 n})−1/2 αn (I + n−1/2 βn ) − αn n→∞
≤ lim sup(2hn {log(1/hn ) + log2 n})−1/2 n→∞
sup
|αn (u) − αn (v)| = 1.
0≤u,v≤1 |u−v|≤hn
We obtain readily from this last result that, a.s., lim sup n1/4 (log n)−1/2 (log2 n)−1/4 αn (I + n−1/2 βn ) − αn ≤ (1 + ε)1/2 2−1/4 . n→∞
Since ε > 0 may be chosen as small as desired, it follows that, a.s., lim sup n1/4 (log n)−1/2 (log2 n)−1/4 αn (I + n−1/2 βn ) − αn ≤ 2−1/4 . (3.128) n→∞
Part 2. (Inner bounds). In this part, we invoke Finkelstein’s Theorem 3.14. As follows from (3.83), the function f (t) = min{t, 1 − t} for t ∈ [0, 1] belongs to the Finkelstein set F. Thus, a.s. for each > 0, there exists an increasing sequence {nj : j ≥ 1} along which (2 log2 n)−1/2 βn − f ≤ 14 . Set now c = 12 − 14 and d = 12 + 14 . Obviously, for any t ∈ [c, d], we have |f (t) − 12 | ≤ 14 . Therefore, along {nj : j ≥ 1}, we have, uniformly over t ∈ [c, d], |(2 log2 n)−1/2 βn (t) − 12 | ≤ 12 . Setting hn = 2−1/2 n−1/2 (log2 n)1/2 , we have therefore, along {nj : j ≥ 1}, and uniformly over t ∈ [c, d], |n−1/2 βn (t)−hn | ≤ hn . Now, by an application of (3.101) and (3.105), we have −1/2
lim (2hn log(1/hn ))
n→∞
sup |αn (u + hn ) − αn (u)| = 1 a.s.,
(3.129)
c≤u≤d
whereas we infer from (3.101) that −1/2
lim sup (2hn log(1/hn )) n→∞
sup |αn (u + hn ) − αn (u + hn + v)| =
√
a.s.,
c≤u≤d |v|≤hn
(3.130) Making use of (3.129)–(3.130), we obtain therefore that √ −1/2 lim inf (2hn log(1/hn )) sup αn (u + n−1/2 βn ) − αn (u) ≥ 1 −
n→∞
a.s.,
c≤u≤d
Since ε > 0 may be chosen as small as desired, it follows readily from this last statement that, a.s., lim inf n1/4 (log n)−1/2 (log2 n)−1/4 αn (I + n−1/2 βn ) − αn ≥ 2−1/4 . (3.131) n→∞
The proof of (3.127) is achieved by combining (3.128) with (3.130).
Topics on Empirical Processes
161
The pointwise Bahadur-Kiefer representation is stated in the following theorem. Theorem 3.22. For each specified t0 ∈ [0, 1], we have, with probability 1, lim sup n1/4 (2 log2 n)−3/4 |αn (t0 ) + βn (t0 )| = {t0 (1 − t0 )}1/4 21/2 3−3/4 .
(3.132)
n→∞
Proof. See, e.g., Kiefer [124]). Below, we give a simple proof of this statement, given in Deheuvels and Mason [67], and §3.4 of Deheuvels and Mason [68]. We need the following two auxiliary results (see, e.g., Lemmas 3.3 and 3.4 in [68]). Set, for 0 < t0 < 1 and −1 ≤ u ≤ 1, xn = (2 log2 n)−1/2 αn (t0 ),
(3.133)
2 1/2
−1/2 log2 n log2 n fn (t) = 2 n 2 1/2
log2 n × αn (t0 ) − αn t0 − u . n Then, we have, almost surely, as n → ∞,
(3.134)
|n1/4 (2 log2 n)−3/4 {αn (t0 ) + βn (t0 )} − fn (xn )| → 0.
(3.135)
Moreover, the sequence (xn , fn ) is almost surely relatively compact in R×B[−1, 1], where B[−1, 1] denotes the set of bounded functions on [−1, 1], endowed with the unform topology. The limit set is given by 1 x2 + f˙(u)2 du ≤ 1 . (x, fn ) K1 := (x, f ) ∈ R × AC0 [−1, 1] : t0 (1 − t0 ) −1 (3.136) In view of (3.133), (3.134), (3.135) and (3.136), the RHS of (3.132) equals, almost surely,
1/4 sup |f (x)| = t0 (1 − t0 ) sup s(1 − s2 ) = {t0 (1 − t0 )}1/4 21/2 3−3/4 , (x,f )∈K1
0≤s≤1
which yields the theorem.
(3.137)
Exercise 3.9. 1◦ ) Letting xn be as in (3.133), show, as a consequence from Finkelstein’s Theorem 3.14, that, almost surely as n → ∞,
xn [− t0 (1 − t0 ), t0 (1 − t0 ) ], and show that this result is in agreement with (3.136). ◦
2 ) Letting fn be as in (3.134), show, as a consequence from Theorem 3.17, that, almost surely as n → ∞, fn S, and show that this result is in agreement with (3.136).
162
P. Deheuvels
3.11. Application to density estimation The theorems of the previous sections provide a number of applications to nonparametric functional estimation (see, e.g., [4, 13, 45, 46, 47, 15, 16, 17, 18, 23, 25, 29, 33, 34, 35, 45, 46, 50, 52, 53, 56, 72, 74, 77, 75, 78, 79, 87, 96, 105, 107, 108, 109, 110, 111, 115, 117, 118, 131, 133, 143, 149, 158, 159, 160, 161, 162, 165, 168, 169, 170, 171, 175, 176, 177, 178, 180, 196, 197, 198, 200, 202, 203, 208, 209, 210, 212]). We will limit ourselves to one example, showing how the preceding functional limit laws can be applied. Consider an iid sequence X1 , X2 , . . . of rv’s with common distribution F (x) = P(X1 ≤ x), having density f (x), assumed to be positive and continuous on J = (c , d ) ⊆ R. We estimate f (x), for x ∈ I := [c, d] ⊂ J, by the Akaike-Parzen-Rosenblatt kernel estimator (refer to [4, 163, 170]) 1 x − Xi
. K nh i=1 h n
fn,h (x) =
(3.138)
Here, h > 0 is a positive bandswidth, and K(·) a kernel, namely a function of bounded variation on R, such that, for some 0 < M < ∞, (i)
|t| ≥ M ;
K(t) = 0 for
(ii)
K(t)dt = 1. R
The following theorem was proved by Stute [200], under slightly more restrictive conditions. As stated, it is due to Deheuvels and Mason [66] (see also Deheuvels [52] and Deheuvels and Einmahl [60]). Theorem 3.23. Assume that {hn : n ≥ 1} is a sequence of positive constants, fulfilling hn ↓ 0;
nhn ↑ ∞;
nhn → ∞; log n
log(1/hn ) → ∞. log2 n
(3.139)
Then, we have, almost surely, lim
n→∞
1/2 nhn sup ±{fn,hn (x) − Efn,hn (x)} = sup f (x) 2 log(1/hn ) x∈I x∈I
K 2 (t)dt R
1/2 .
(3.140)
Proof. We will establish (3.140) in the special case where J = (0, 1), f (x) = 1 for 0 < x < 1, and K(t) = 0 for t ∈ [ 14 , 34 ]. The arguments of the proof in the general case are essentially identical, with the addition of minor technicalities, which can be omitted. We let n0 be such that t±hn ∈ J for all t ∈ I, when n ≥ n0 . Recall the definition (3.112) of υn,x (u). Our assumptions imply that Xn = Un is uniformly
Topics on Empirical Processes
163
distributed on (0, 1) for n = 1, 2, . . ., so that we may write, for n ≥ n0 , 1 1 x − t d{Un (t) − t} fn,hn (x) − Efn,hn (x) = K hn hn 0 1 1 √ = K(u)d{αn (x − hn u) − αn (x)} hn n 0 1 1 √ =− d{αn (x − hn u) − αn (x)}dK(u) hn n 0 2(log(1/h ) + log n) 1/2 1 n 2 =− υn,x (−u)dK(u) nhn 0 2 log(1/h )) 1/2 1 n υn,x (−u)dK(u), = −(1 + o(1)) nhn 0 after integrating by parts. Now, as follows from Theorem 3.19, we have δU ({υn,x : x ∈ I}, S) → 0 a.s. It is easily checked that, whenever f (u) ∈ S, we have f (−u) ∈ S. Therefore, an easy argument shows that, almost surely as n → ∞, 1/2 nhn sup ±{fn,hn (x) − Efn,hn (x)} 2 log(1/hn ) x∈I 1 f (−u)dK(u) = −(1 + o(1)) sup f ∈S
0 1
→ − sup f ∈S
f ∈S
0
1
f˙(u)K(u)du,
f (u)dK(u) = sup 0
after inegrating by parts. Finally the Schwarz inequality shows that 1 1 ˙ K 2 (u)du , f (u)K(u)du = sup f ∈S
0
0
which suffices for proving (3.140) under our (simplified) assumptions.
At this point, we leave it to the reader to continue these arguments we have used to establish similar results for other non-parametric estimators of interest. Exercise 3.10. Making use of Theorem 3.17, show that, whenever hn ↓ 0;
nhn ↑ ∞;
nhn → ∞, log2 n
we have, for each specified x0 in the neighborhood of which f is continuous, almost surely, nh 1/2 1/2 n {fn,hn (x) − Efn,hn (x)} = f (x) K 2 (t)dt . lim sup ± 2 log2 n n→∞ R
164
P. Deheuvels
4. Auxiliary results 4.1. Some Gaussian process theory Let Z be a centered Gaussian random vector taking values in a separable Banach space X, with norm denoted by | · |X . Throughout, we will work under the assumption that (G.1) P (|Z|X < ∞) = 1. In the sequel, we will mainly consider the case where X = C[0, 1], the set of all continuous functions on [0, 1], endowed with the sup-norm uniform topology U, defined by | · |X = · , and, either, Z =W is (the restriction to [0, 1] of) a standard Wiener process; or Z =B is a Brownian bridge. The assumption (G.1) is clearly satisfied in either of these cases. Denoting by BX the σ-algebra of Borel subsets of X, the distribution of Z is defined on BX via PZ (B) = P(Z ∈ B) ∀B ∈ BX . (4.1) ∗ Denoting by X the topological dual of X, that is, the space of all continuous linear forms h∗ : X → R, one may define a linear mapping J : X∗ → X, by the Bochner integral h∗ ∈ X∗ → J h∗ = E (Zh∗ (Z)) =
X
zh∗ (z)PZ (dz).
(4.2)
The image space H∗ := J (X∗ ) is pre-Hilbertian with inner product defined by ∗ ∗ ∗ ∗ J g , J h H = E (g (Z)h (Z)) = g ∗ (z)h∗ (z)PZ (dz), for g ∗ , h∗ ∈ X∗ . (4.3) X
The reproducing kernel Hilbert space [RKHS] of Z is then defined as the closure H in X of H∗ with respect to the Hilbert norm | · |H = ·, · H . Of special interest here is the unit ball of H, denoted by K = {h ∈ H : |h|H ≤ 1} .
(4.4)
The following assumption will be needed, in addition to (G.1). Throughout, we assume that (G.2) K is a compact subset of (X, | · |X ). We will make use of a generalized version of the Cameron-Martin formula stated in the following theorem. Theorem 4.1. For each h ∈ H, there exists a measurable linear form ' h on X , fulfilling the equalities
PZ (B + h) = exp ' h(x) − 12 |h|2H PZ (dx) for each B ∈ BX , (4.5) B
' E ' (4.6) h2 (x)PZ (dx) = |h|2H . h2 (Z) = X
Topics on Empirical Processes Proof. See, e.g., Borell (1976).
165
The next proposition (see, e.g., Borell (1977)) states a useful consequence of (4.6). Proposition 4.1. Let h ∈ H and let A ∈ BX denote a symmetric Borel subset of X . Then,
(4.7) PZ (A + h) ≥ exp − 12 |h|2H PZ (A). Proof. There is noting to prove if PZ (A) = 0. Assuming that PZ (A) > 0, we set B = A in (4.7), then make use of the Jensen inequality to obtain, by symmetry of A and linearity of ' h, that
2 1 PZ (A + h) = exp − 2 |h|H exp ' h(x) PZ (dx) A
P (dx) Z 2 1 = exp − 2 |h|H PZ (A) exp ' h(x) P Z (A) A
PZ (dx) ' ≥ exp − 12 |h|2H PZ (A) exp h(x) PZ (A) A
2 ≥ exp − 12 |h|H PZ (A),
which is (4.7).
The next statement is the celebrated isoperimetric inequality due to Borell (1975) and Sudakov and Tsyrelson (1978). Denote by x 2 1 e−t /2 dt, (4.8) Φ(x) = √ 2π −∞ the df of the standard N (0, 1) law, and by Φ−1 = Φ← the corresponding quantile function, fulfilling (4.9) Φ Φ−1 (u) = u for all 0 < u < 1. Theorem 4.2. For each A, B ∈ BX such that B ∩ (A + rK) = ∅, we have PZ (B) ≤ 1 − Φ Φ−1 (PZ (A)) + r .
(4.10)
The following simple inequalities will be useful to evaluate the right-hand side of (4.10). Lemma 4.1. We have, for every t > 0, 2 2 e−t /2 1 e−t /2 √ 1 − 2 ≤ 1 − Φ(t) ≤ √ . t t 2π t 2π
(4.11)
Moreover, for every t ≥ 0, we have 2 2 2 1 − Φ(t) ≤ √ e−t /2 ≤ e−t /2 . 2π
(4.12)
166
P. Deheuvels
Proof. By integrating by parts, we first observe that ∞ ∞ 2 1 1 1 −x2 /2 d{e−x /2 } 1 − Φ(t) = √ e dx = − √ x 2π t 2π t ∞ 2 2 2 1 1 e−t /2 e−t /2 −x2 /2 √ √ −√ d{e−x /2 }. = d{e } ≤ 2 t 2π 2π t x t 2π By integrating once more by parts, we obtain likewise that ∞ 2 2 1 1 1 2 e−t /2 e−t /2 −x2 /2 √ √ 1− 2 + √ 1− 2 . d{e } ≥ 1 − Φ(t) = 3 t t t 2π 2π t x t 2π The above inequalities readily yield (4.11). To establish (4.12), we infer from (4.11) that the inequality 2
2 2 e−t /2 ≤√ e−t /2 , 1 − Φ(t) ≤ √ t 2π 2π always holds when t ≥ 1. On the other hand, when 0 ≤ t ≤ 1, we may write ∞ 2 2 1 1 −x2 /2 2 e−t /2 . 1 − Φ(t) = √ e dx + e−x /2 dx ≤ √ 2π 2π t 1 By combining these two cases, we obtain readily (4.12) as sought.
4.2. A functional LIL for superpositions of Gaussian processes To illustrate the Gaussian process methodology of §4.1, we give below a proof of an elementary version of the functional law of the iterated logarithm [FLIL] for superpositions of independent Gaussian processes. We refer to [93], [103], [134], for more refined results of the kind. We inherit the notation of §4.1, and consider a sequence Z1 , Z2 , . . . of independent replicae of Z. We set further n 1 Zi . (4.13) Yn = √ n i=1 The following notation will be needed. For each ε > 0 and x ∈ X, we set Vε,x = {z ∈ X : z − x < ε} . For each ε > 0 and A ∈ BX , A = ∅, we set Aε = {z ∈ X : ∃x ∈ A, z − x < ε} =
(4.14) (
Vε,x .
(4.15)
x∈A
For convenience, we set ∅ε = ∅. Theorem 4.3. Under (G.1–2), with probability 1 for any ε > 0, there exists an nε such that, for all n ≥ nε , (2 log2 n)−1/2 Yn ∈ Kε .
(4.16)
Moreover, for each z ∈ K, we have, with probability 1, lim inf |(2 log2 n)−1/2 Yn − z|X = 0. n→∞
(4.17)
Topics on Empirical Processes
167
Prior to the proof of Theorem 4.3, we will establish the following useful result, usually referred to as Ottaviani’s lemma. Lemma 4.2. Let ζ0 and ξ1 , . . . , ξN denote independent random vectors taking value in the Banach space (X, | · |X ). Set ζi = ζ0 + ξ1 + · · · + ξi for i = 1, . . . , N . Consider a Borel subset A of X and select a ε > 0. Suppose that, for each 1 ≤ i ≤ N , P(|ζN − ζi |X ≤ ε) ≥ 1/2. Then, we have (4.18) P ∃i = 0, . . . , N : ζi ∈ A2ε ≤ 2P (ζN ∈ Aε ) . Proof. Introduce the events Ai (ε)
= {ζi ∈ Aε }
Bi (ε)
= {|ζN − ζi |X < ε}
for
i = 0, . . . , N, for
i = 0, . . . , N,
Denote by A = Ω − A the complement of A, and set A−1 (2ε) = ∅. We obtain readily that P
N (
Ai (2ε) =
i=0
N
P Ai (2ε) ∩ Ai−1 (2ε) i=0 N
P Ai (2ε) ∩ Ai−1 (2ε) P |ζN − ζi |X ≤ ε
≤
2
=
N
2 P Ai (2ε) ∩ Ai−1 (2ε) ∩ Bi (ε)
≤
N
2 P Ai (2ε) ∩ Ai−1 (2ε) ∩ AN (ε) ≤ 2P (AN (ε)) ,
i=0
i=0
i=0
which is (4.18).
Proof of Theorem 4.3. Part 1 – Outer Bounds. We select a γ > 0 and an ε > 0. Introduce the sequence of integers nk = #(1 + γ)k $ for k = 0, 1, . . ., and set ak = ε 2nk log2 nk . Consider the events
1/2 (4.19) ∃n ∈ (nk−1 , nk ] : n1/2 Yn ∈ {nk 2 log2 nk K}2ak , Ck =
1/2 1/2 (4.20) Dk = nk Ynk ∈ {nk 2 log2 nk K}ak . Our assumptions imply that, ultimately as k → ∞, for each n ∈ (nk−1 , nk ],
1/2 = P (nk − n)1/2 |Z|X ≥ ε 2nk log2 nk P n1/2 Yn − nk Ynk ≥ ak X
n 1/2 k ≤ P |Z|X ≥ ε 2 log2 nk nk−1
1
≤ P |Z|X ≥ ε(1 + 12 γ) 2 log2 nk ≤ . 2
168
P. Deheuvels
We may therefore apply Ottaviani’s lemma (Lemma 4.2) to show that, for all k sufficiently large, P(Ck ) ≤ 2P(Dk ). (4.21) We now turn to the evaluation of P(Dk ). Toward this end, me make use of the Isoperimetric Inequality (Theorem 4.2) with choices of A, B and r which will are specified below. We set, for a constant m > 0 which will be precised later on, A = B =
D(0, m) := {x ∈ X : |x|X < m}, X − {A + rK} = X − {x + h : |x|X < m, |h|H ≤ r}.
In view of (G.1), we may now choose m > 0 so large that PZ (A) = P(|Z|X < m) > 1/2, which, in turn, implies that δ := Φ−1 (PZ (A)) > 0. This, when combined with (4.10) and (4.12), shows that, for all r > 0,
1 (4.22) PZ (B) ≤ 1 − Φ Φ−1 (PZ (A)) + r ≤ exp − (δ + r)2 . 2
We now choose r = 2 log2 nk in (4.22). In view of the definition (4.20) of Dk , we observe that, ultimately as k → ∞, ε√2 log2 nk
m
≤ P Z ∈ P(Dk ) = P Z ∈ 2 log nk K 2 log nk K
2
1
δ + 2 log2 nk = PZ (B) ≤ exp − 2
δ2
1 1 . − δ 2 log2 nk = o exp − = log nk 2 (log nk )(log2 nk )2 Since log nk = (1 + o(1))k log(1 + γ) as k → ∞, we infer from this last expression, when combined with (4.21), that ∞
P (Ck ) < ∞.
k=1
The Borel-Cantelli lemma implies in turn that, with probability 1 for all k sufficiently large, we have, uniformly over all n such that nk−1 < n ≤ nk , 2ε √nk log2 nk /{√nk−1 log2 nk−1 } nk log2 nk 1/2 Yn
∈ K nk−1 log2 nk−1 2 log2 n 2ε(1+γ) . (4.23) ∈ (1 + γ)K Now, we make use of (G.2) to show that, for any fixed > 0, we may select γ > 0 and ε > 0 so small that 2ε(1+γ) ⊆ K . (1 + γ)K This, when combined with (4.23), suffices for (4.16).
Topics on Empirical Processes
169
Part 2 – Inner Bounds. We first note that the assumption (G.2) implies that M := sup |g|X < ∞.
(4.24)
g∈K
Let, as in Part 1, nk = #(1 + γ)k $, for k = 0, 1, . . . . Under the assumptions of the theorem, it is enough to show that, for each specified ε > 0 and h ∈ K with 1 − 2η := |h|2H < 1, there exists a γ > 0 such that, almost surely, lim inf |(2 log2 nk )−1/2 Ynk − h|X < ε. k→∞
(4.25)
In view of (4.24), and making use of Part 1, we obtain readily that, almost surely, √ lim sup |(2nk log2 nk )−1/2 nk−1 Ynk−1 |X k→∞
=
(2nk log2 nk )−1/2 −1/2 lim sup |(2 log n ) Y − h| k−1 n X 2 k−1 k→∞ (2nk−1 log2 nk−1 )−1/2 k→∞ lim
≤ (1 + γ)−1/2 sup |g − h|X ≤ 2M (1 + γ)−1/2 . g∈K
Therefore, there exists a γ1 > 0 such that, whenever γ ≥ γ1 , almost surely, √ lim sup (2nk log2 nk )−1/2 nk−1 Ynk−1 < 12 ε. X
k→∞
(4.26)
Consider next the mutually independent events
−1/2 √ √ nk Ynk − nk−1 Ynk−1 −h < 18 ε . Ek := 2(nk −nk−1 )log2 (nk −nk−1 ) X (4.27) As follows from (4.7) in combination with (G.1), we obtain therefore that, ultimately as k → ∞, 1/2 1/2
P(Ek ) = P |Z − 2 log2 (nk − nk−1 ) h|X < 14 ε 2 log2 (nk − nk−1 )
1/2
≥ exp − 12 |h|2H × 2 log2 (nk − nk−1 ) P |Z|X < 14 ε 2 log2 (nk − nk−1 )
1 ≥ 12 exp − 12 (1 − 2η) × 2 log2 (nk − nk−1 ) ≥ 1−η . k Since this entails that ∞ k=1
P(Ek ) = ∞,
170
P. Deheuvels
the Borel-Cantelli lemma implies, in turn, that P(Ek i.o.) = 1. Now, by the definition (4.27) of Ek , it follows that, almost surely, −1/2 √ √ lim inf 2nk log2 nk nk Ynk − nk−1 Ynk−1 − h X
k→∞
(2nk log2 nk )−1/2 1 ≤ lim 8ε k→∞ (2(nk − nk−1 ) log2 (nk − nk−1 ))−1/2 (2nk log2 nk )−1/2 +|h|X 1 − lim k→∞ (2(nk − nk−1 ) log2 (nk − nk−1 ))−1/2 + + γ γ +M 1− . ≤ 14 ε 1+γ 1+γ
(4.28)
In view of (4.28), we see that there exists a γ2 > 0 such that, whenever γ ≥ γ2 , we have, almost surely, −1/2 √ √ lim inf 2nk log2 nk nk Ynk − nk−1 Ynk−1 − h < 12 ε.(4.29) X
k→∞
By combining (4.26) with (4.29), we obtain readily that (4.25) holds for all γ ≥ γ1 ∨ γ2 . This concludes the proof of Theorem 4.3. Exercise 4.1. Prove (3.87) and (3.88). 4.3. Karhunen-Lo`eve expansions We recall the following facts about Karhunen-Lo`eve [KL] expansions, (see, e.g., [3], [7], [119] and [122]). Denote by {Z(t) : 0 < t < 1} a centered Gaussian process with covariance function {R(s, t) = E(Z(s)Z(t)) : 0 < s, t < 1}, fulfilling 1 0< R(t, t)dt < ∞. (4.30) 0
Then, there exist nonnegative constants {λk : k ≥ 1}, λk ↓, together with functions {ek (t) : k ≥ 1} ⊂ L2 (0, 1) of t ∈ (0, 1) such that the properties (K.1–2–3–4) below hold. (K.1) For all i ≥ 1 and k ≥ 1, 1 1 if i = k, ei (t)ek (t)dt = 0 if i = k. 0 The {λk , ek (·) : k ≥ 1} form a complete set of solutions of the Fredholm equation in (λ, e(·)), λ = 0. 1 1 λe(t) = R(s, t)e(s)ds for 0 < t < 1 and e2 (t)dt = 1. (4.31)
(K.2)
0
0
The λk (resp. ek (·)) are the eigenvalues (resp. eigenfunctions) of the Fredholm transformation 1 2 2 f ∈ L (0, 1) → T f ∈ L (0, 1) : T f (t) = R(s, t)f (s)ds, t ∈ (0, 1). 0
Topics on Empirical Processes (K.3)
The series expansion λk ek (s)ek (t) for R(s, t) =
171
0 < s, t < 1,
(4.32)
k≥1
is convergent in L2 (0, 1)2 . (K.4)
There exist independent and identically distributed [i.i.d.] N (0, 1) random variables {ωk : k ≥ 1} such that the Karhunen-Lo`eve [KL] expansion λk ωk ek (t) 0 < t < 1, (4.33) Z(t) = k≥1
of Z(·) holds, with the series (4.33) converging a.s., and in integrated mean square on (0, 1). Remark 4.1. 1◦ ) The sequence {λk , ek (·) : k ≥ 1} in (K.1–2–3–4) may very well be finite. Below, we will implicitly exclude this case and specialize in infinite KL expansions with k ranging through IN∗ = {1, 2, . . .}, with λ1 > λ2 > · · · > 0. 2◦ ) If, in addition to (4.30), Z(·) is a.s. continuous on [0, 1] with covariance function R(·, ·) continuous on [0, 1]2 , then, we may choose the functions {ek (·) : k ≥ 1} in the KL expansion (4.33) to be continuous on [0, 1]. The series (4.32) is then absolutely and uniformly convergent on [0, 1]2 , and the series (4.33) is a.s. uniformly convergent on [0, 1] (see, e.g., [3]). There are few Gaussian processes of interest with respect to statistics for which the KL expansion is known through explicit values of {λk : k ≥ 1}, and with simple forms of the functions {ek (·) : k ≥ 1} (see, e.g., [152] for a review). It is useful to have a precise knowledge of the λk ’s, since we infer from (4.33) that 1 D2 = Z 2 (t)dt = λk ωk2 . (4.34) 0
k≥1
This readily implies (see, e.g., (6.23), p. 200 in [119]), that the moment-generating function of the distribution of D2 is given by ∞ 1/2 1 1 for Re(z) < . (4.35) ψD2 (z) = E(exp(zD2 )) = 1 − 2zλk 2λ1 k=1
We have |E(exp(zD2 ))| < ∞ for all z ∈ C with Re(z) < additional conditions that (i) and (ii)
1 2λ1 ,
subject to the
λ1 > λ2 > · · · > 0, ∞ k=1
∞ 1 λk = < ∞, γk k=1
(4.36) 1 where γk = λk
for k ≥ 1.
172
P. Deheuvels
Since D2 is a weighted sum of independent χ21 components, its distribution is easy to compute under (4.36) via the Smirnov formula (see, e.g., [43], [150], [151], [189], [192]). For t > 0, γ2k −tu/2 ∞
e du 1
= (−1)k+1 (4.37) P D2 > t π u |F(u)| γ2k−1 k=1 1 + o(1) γ2 e−tu/2 du
= as t → ∞, π γ1 u |F(u)| where F(u) is the Fredholm determinant defined, under (4.36), by ∞ ∞ u F(u) = 1 − uλk = 1− for u ≥ 0. γk k=1
(4.38)
k=1
In view of (4.35)–(4.38), we note that 1 . (4.39) 2λ1 We refer to Martynov ([152], [150]) for a study of the convergence of the series (4.37), together with versions of this formula holding when some the consecutive terms of the sequence λ1 ≥ λ2 ≥ · · · > 0 are equal. ψD2 (z) = {F(2z)}−1/2
for
Re(z) <
4.4. The RKHS of the Wiener process and Brownian bridge The standard Wiener process W (t) admits on [0, 1] the Karhunen-Lo`eve representation ∞ √ 1 × 2 sin k − 12 πt , W (t) = Yk × (4.40) 1 k− 2 π k=1 where {Yk : k ≥ 1} denotes an iid sequence of normal N (0, 1) r.v’s. The proof of (4.40) is left to the reader in Exercise 4.2, and as a special case of Theorem 4.6 in §4.5, in the sequel. In view of (4.40), given any two functions f and g of the form f (t) = and g(t) =
∞ k=1 ∞
ak ×
√ 2 sin k − 12 πt
√ bk × 2 sin k − 12 πt ,
(4.41)
k=1
the Hilbert product of f and g within the reproducing kernel Hilbert space [RKHS] pertaining to the Wiener process on [0, 1] is given by 1 ∞ 2 1 f, g H = = k − 2 π ak b k = f˙(t)g(t)dt, ˙ (4.42) k=1
0
where f˙ and g˙ denote, respectively, the Lebesgue derivatives of f and g. We obtain the proposition:
Topics on Empirical Processes
173
Proposition 4.2. The RKHS pertaining to the Wiener process on [0, 1] is the space H , with Hilbert norm | · |H , of all absolutely continuous functions f on [0, 1] such that |f |H < ∞, where 1/2 1 ˙ 2 if f ∈ AC[0, 1] fulfills f (0) = 0, f (t) dt 0 (4.43) |f |H = ∞ otherwise. The standard Brownian bridge B(t) admits on [0, 1] the Karhunen-Lo`eve representation ∞ √ 1 B(t) = × 2 sin (kπt) , Yk × (4.44) kπ k=1
where {Yk : k ≥ 1} denotes an iid sequence of normal N (0, 1) r.v’s. Therefore, given any two functions f and g of the form f (t) =
∞
ak ×
√ 2 sin (kπt)
and g(t) =
k=1
∞
bk ×
√ 2 sin (kπt) ,
(4.45)
k=1
the Hilbert product of f and g within the RKHS pertaining to the Brownian bridge on [0, 1] is given by 1 ∞ 2 = (kπ) ak bk = f˙(t)g(t)dt, ˙ (4.46) f, g H = 0
k=1
where f˙ and g˙ denote, respectively, the Lebesgue derivatives of f and g. We obtain the proposition: Proposition 4.3. The RKHS pertaining to the Brownian bridge on [0, 1] is the space H , with Hilbert norm | · |H , of all absolutely continuous functions f on [0, 1] such that |f |H < ∞, where 1/2 1 ˙ 2 dt if f ∈ AC[0, 1] fulfills f (0) = f (1) = 0, f (t) 0 (4.47) |f |H = ∞ otherwise. Exercise 4.2. Consider the Fredholm transformation of L2 [0, 1] onto itself, defined by 1 min{x, t}f (t)dt, f → T f (x) = 0
and let y(x) denote an eigenvalue of T , fulfilling T y = λy for some λ > 0. 1◦ ) Show that y(·) is continuous on [0, 1], then, by a recursion, that y is twice continuously differentiable on (0, 1) and such that y(0) = 0 and y (1) = 0. 2◦ ) Show that y is a solution of the differential equation λy + y = 0, and that the only possible values of λ are of the form λ = 1/{(k − 12 )π}2 for k = 1, 2, . . . . Conclude to the validity of (4.40).
174
P. Deheuvels
4.5. KL expansions for weighted Wiener processes and Brownian bridges In this section, we provide KL expansions for weighted Wiener processes and Brownian bridges due to Deheuvels and Martynov [63]. Throughout, {W (t) : t ≥ 0} and {B(t) : 0 ≤ t ≤ 1} denote, respectively, a Wiener process, and a Brownian bridge. These processes are centered with covariance functions E(W (s)W (t))s ∧ t E(B(s)B(t)) = s ∧ t − st
and
for s, t ≥ 0, for 0 ≤ s, t ≤ 1.
Denote by {ψ(t) : 0 < t < 1} a positive and continuous function on (0, 1), whose definition will, at times, be extended by continuity to (0, 1] or [0, 1]. Below, we will work under additional conditions taken among the following. (L.1) (L.2) (C.3)
ψ(·) is continuous on (0, 1]; 1 (i) lim tψ(t) = 0; (ii) tψ 2 (t)dt < ∞; t↓0
(i)
0
lim tψ(t) = lim(1 − t)ψ(t) = 0; (ii) t↓0
t↑0
1
t(1 − t)ψ 2 (t)dt < ∞. 0
It is readily checked that (L.2)(ii) (resp. (L.3)(ii)) is the version of (4.30) corresponding to Z(t) = Z1 (t) (resp. Z(t) = Z2 (t)), where Z1 (t) = ψ(t)W (t)
and Z2 (t)
= ψ(t)B(t)
for 0 < t < 1.
To obtain the KL expansions of Z1 (·), Z2 (·), we will use the following theorems, in the spirit of Kac and Siegert ([121], [122]), and Kac (see, e.g., pp. 199-200 in [119] and Section 2 in [120]). Theorem 4.4. Assume (C.1-2). Set Z(t) = Z1 (t) = ψ(t)W (t) for 0 < t ≤ 1. Then, the {(λk , ek (·)) : k ≥ 1} in the KL expansion of Z(·) are obtained by setting λ = 1/γ and e(t) = y(t)ψ(t), where y(·) is a continuous on [0, 1] and twice continuously differentiable on (0, 1] solution of the differential equation y (t) + γψ 2 (t)y(t) = 0,
(4.48)
subject to γ > 0 and with limit conditions y(0) = 0
and
y (1) = 0.
(4.49)
Theorem 4.5. Assume (L.3). Set Z(t) = Z1 (t) = ψ(t)B(t) for 0 < t < 1. Then, the {(λk , ek (·)) : k ≥ 1} in the KL expansion of Z(·) are obtained by setting λ = 1/γ and e(t) = y(t)ψ(t), where y(·) is a continuous on [0, 1] and twice continuously differentiable on (0, 1) solution of the differential equation y (t) + γψ 2 (t)y(t) = 0,
(4.50)
subject to γ > 0 and with limit conditions y(0) = 0
and
y(1) = 0.
(4.51)
Topics on Empirical Processes
175
Proof. The details of proofs of Theorems 4.4 and 4.5 are given in Deheuvels and Martynov [63]. In the sequel, we will concentrate on the particular case where, for some constant β ∈ R, (4.52) ψ(t) = tβ for 0 < t ≤ 1. We note that (L.1–2–3) hold under (4.52) iff β > −1. In particular, 1 1 β > −1 ⇔ tψ 2 (t)dt < ∞ ⇔ t(1 − t)ψ 2 (t)dt < ∞. 0
0
For ν > −1, consider the Bessel function Jν (·) of first order and index ν (see §4.6 below for details on the definition and properties of Jν (·)). For ν > −1, the positive zeros of Jν (·) (solutions of Jν (z) = 0) form an infinite sequence, denoted hereafter by 0 < zν,1 < zν,2 < · · · . These zeros are interlaced with the zeros 0 < zν+1,1 < zν+1,2 < · · · of Jν+1 (·) (see, e.g., [207], p. 479), in such a way that 0 < zν,1 < zν+1,1 < zν,2 < zν+1,2 < zν,3 < · · · .
(4.53)
The next theorems provide KL expansions for {t W (t) : 0 ≤ t ≤ 1} and {t B(t) : 0 ≤ t ≤ 1}. β
β
Theorem 4.6. Let {W (t) : t ≥ 0} denote a Wiener process. Then, for each β = 1 eve 2ν − 1 > −1, or equivalently, for each ν = 1/(2(1 + β)) > 0, the Karhunen-Lo` expansion of {tβ W (t) : 0 < t ≤ 1} is given by 1
tβ W (t) = t 2ν −1 W (t) =
∞ λk ωk ek (t),
(4.54)
k=1
where {ωk : k ≥ 1} are i.i.d. N (0, 1) random variables, and, for k = 1, 2, . . ., 1 2ν 2 t 2ν 1 − 12 Jν zν−1,k 2ν for 0 < t ≤ 1.(4.55) λk = , ek (t) = t √ zν−1,k νJν zν−1,k Theorem 4.7. Let {B(t) : 0 ≤ t ≤ 1} denote a Brownian bridge. Then, for each 1 − 1 > −1, or equivalently, for each ν = 1/(2(1 + β)) > 0, the Karhunenβ = 2ν Lo`eve expansion of {tβ B(t) : 0 < t ≤ 1} is given by 1
tβ B(t) = t 2ν −1 B(t) =
∞ λk ωk ek (t),
(4.56)
k=1
where {ωk : k ≥ 1} are i.i.d. N (0, 1) random variables, and, for k = 1, 2, . . ., 2ν 2 J z t 2ν1 1 1 ν ν,k − for 0 < t ≤ 1. (4.57) λk = , ek (t) = t 2ν 2 √ zν,k νJν−1 zν,k The KL expansion of {tβ B(t) : 0 < t ≤ 1} in Theorem 4.7 is known for β = 0 (⇔ ν = 1/2) (see, e.g., [5] and §4.4, pp. 30-32 in [85]). For β = −1/2 (⇔ ν = 1), we refer to Scott [179]. In the general case where β > −1/2 (⇔ 0 < ν < 1), this KL expansion turns out to be equivalent to a KL expansion given in Li [140].
176
P. Deheuvels
√ 1 The simple observation that the { ρ t 2 (ρ−1) ek (tρ ) : k ≥ 1} are orthonormal in 2 L (0, 1) whenever such is the case for the {ek (t) : k ≥ 1} (see, e.g., [145]), allows us to derive, via (4.58)–(4.59)–(4.60) below, through the change of scale t → tρ , a series of variants of the KL expansions of Theorems 4.6–4.7. We will make use of the convention of writing these KL expansions under the form Z(t) =
∞
λk
1/2
ωk ek (t),
0 < t ≤ 1,
(4.58)
k=1
where, as in (4.33), λk and ek (·) for k ≥ 1, are, respectively, the eigenvalues and eigenfunctions pertaining to {Z(t) : 0 < t ≤ 1}. The arguments above show that, whenever the KL expansion (4.58) holds, then, for each choice of ρ > 0, the KL 1 expansion of t 2 (ρ−1) Z(tρ ) on (0, 1] is given by 1
∞
1/2 √ 1 (ρ−1) λk /ρ ωk ρ t2 ek (tρ ) ,
0 < t ≤ 1,
(4.59)
with eigenvalues λ∗k and eigenfunctions e∗k (·) for k = 1, 2, . . ., given by √ 1 1/2 λ∗k = λk /ρ and e∗k (t) = ρ t 2 (ρ−1) ek (tρ ) .
(4.60)
t 2 (ρ−1) Z(tρ ) =
k=1
By combining (4.58)–(4.59)–(4.60) with Theorems 4.6–4.7, we obtain readily in (4.61)–(4.62) below the KL expansions of tθ W (tρ ) and tθ B(tρ ) (under the form (4.58)). For any ρ > 0, θ > − 21 (ρ + 1), and ν = ρ/(2θ + ρ + 1) > 0, we get namely ρ ∞ √ ρ 1 J z 2ν ν ν−1,k t 2ν θ ρ −2 2ν ρt t W (t ) = , (4.61) √ √ ωk zν−1,k ρ νJν zν−1,k k=1 ρ ∞ 2ν √ ρ − 1 Jν zν,k t 2ν θ ρ . (4.62) ρ t 2ν 2 √ t B(t ) = √ ωk zν,k ρ νJν−1 zν,k k=1 In particular, by setting ρ = 1, θ = β and ν = 1/(2(β + 1)) in (4.61)–(4.62), we get (4.54)–(4.55) and (4.56)–(4.57). Of special interest here is the choice of ρ = 2ν and θ = 12 − ν in (4.61)–(4.62). For these values of the constants ρ, θ, we obtain that, for each ν > 0, the KL 1 1 expansions of t 2 −ν W (t2ν ) and t 2 −ν B(t2ν ) are given by. ∞ √ 1 2ν √ 1 Jν zν−1,k t −ν 2ν , (4.63) ωk 2 t2 t 2 W (t ) = zν−1,k Jν zν−1,k k=1 ∞ √ 1 2ν √ 1 Jν zν,k t . t 2 −ν B(t2ν ) = 2 t2 ωk (4.64) zν,k Jν−1 zν,k k=1 1
By multiplying both sides of (4.63) by t− 2 , we get a Dini series expansion of t−ν W (t2ν ) on (0, 1). Proceeding likewise with (4.64) one obtains the Fourier-Bessel series expansion of t−ν B(t2ν ) on (0, 1) (see, e.g., pp. 96–103 in [132]). Recall that,
Topics on Empirical Processes
177
under suitable conditions on the functions f (·) and g(·) on (0, 1), it is possible to ∞ expand f (·) into the Fourier-Bessel expansion f (t) = k=1 ak Jν (zν,k t), with 1 2 tf (t)Jν (zν,k t)dt, k ≥ 1, (4.65) ak = 2 Jν−1,k (zν,k ) 0 ∞ and g(·) into a Dini expansion g(t) = k=1 bk Jν (zν−1,k t), with 1 2 bk = 2 tg(t)Jν (zν−1,k t)dt, k ≥ 1. (4.66) Jν,k (zν−1,k ) 0 By setting f (t) = t−ν B(t2ν ) and g(t) = t−ν W (t2ν ) in (4.65)–(4.66), we get ak =
2νωk zν,k Jν−1 (zν,k )
and bk =
2νωk zν−1,k Jν (zν−1,k )
for k ≥ 1. (4.67)
Put θ = 0 and ν = ρ/(ρ + 1) in (4.61)–(4.62). Set, for notational simplicity, zν,k = zρ/(ρ+1),k and zν−1,k = zρ/(ρ+1)−1,k . We so obtain the KL expansions ρ+1 √ ∞ J 2 ρ t z 2 ρ ρ/(ρ+1) ν−1,k W (tρ ) = ρ + 1 t2 ωk (4.68) , zν−1,k (ρ + 1) Jρ/(ρ+1) zν−1,k k=1 ρ+1 √ ∞ J ρ 2 ρ ρ/(ρ+1) zν,k t 2 ρ 2 ωk . (4.69) ρ+1t B(t ) = zν,k (ρ + 1) Jρ/(ρ+1)−1 zν,k k=1 The KL expansion (4.69) has been obtained by Li ([140]) (see the proof of Theorem 1.6, pp. 24-25 in [140]), up to the normalizing factor, for k = 1, 2, . . .,
ck = ρ + 1/{Jρ/(ρ+1)−1 zν,k }, of the eigenfunction in (4.69) (with the notation (4.58)) ρ+1 ek (t) = ck tρ/2 Jρ/(ρ+1) zν,k t 2 , left implicit in his work. In spite of the fact that it is possible to revert the previous arguments, starting with (4.69), in order to obtain an alternative proof of Theorem 4.6 based on [140], this does only work for the values of ν = ρ/(ρ+1) with 0 < ν < 1 (since we must have ρ > 0). 4.6. Bessel functions 4.6.1. Definition of Bessel functions. For each real constant ν ∈ R, we denote by Jν (·) the Bessel function of the first kind of index ν. For our needs, it will be useful to recall some important properties of these functions (refer to [132] and [207] for details). The second order homogeneous differential equation x2 y + xy + (x2 − ν 2 )y = 0,
(4.70) ∞ has a fundamental set of solutions on (0, ∞) of the form y = Cxν k=0 ak xk , where C is a constant. These solutions are proportional to the Bessel function of
178
P. Deheuvels
the first kind (see, e.g., 9.1.69 in [1]), explicitly defined, for an arbitrary ν ∈ R, by ∞ (− 14 x2 )k ( 12 x)ν ν 1 2 1 Jν (x) = (4.71) . 0 F1 (ν + 1; − 4 x ) = ( 2 x) Γ(ν + 1) Γ(ν + k + 1)Γ(k + 1) k=0
When ν = −n is a negative integer, Γ(ν + k + 1) = Γ(n + k + 1) = ∞ for k = 0, . . . , n − 1 so that, making use of the convention a/∞ = 0 when a ∈ R, the n first terms in the series (4.71) vanish. In this case, we have the relation J−n (x) = (−1)n Jn (x).
(4.72)
In (4.71), we made use of the generalized hypergeometric function 0 F1 (b; z) =
∞ 1 zk (b)k k!
for z ∈ C,
(4.73)
k=0
where the Pochhammer symbol (b)k is defined for k ∈ IN by (b)k = Γ(b + k)/Γ(b) when b = 0, −1, −2, . . . , and, for an arbitrary b ∈ R, by (b)0 = 1 and (b)k = b(b + 1) . . . (b + k − 1) for k ≥ 1.
(4.74)
The roots (or zeros) of Jν (·) have the following properties, in addition to (4.53) (see, e.g., Ch.XV, pp. 478–521 in [207], p. 96 in [132]). For any ν > −1, Jν (·) has only real roots. Moreover, in this case, the positive roots of Jν (·) are isolated and form an increasing sequence 0 < zν,1 < zν,2 < . . . ,
(4.75)
such that, for any fixed k ≥ 1, zν,k is a continuous and increasing function of ν > −1. In addition, for any specified ν > −1, as k → ∞, 1
4ν 2 − 1 . + O zν,k = k + 12 (ν − 12 ) π − k3 8 k + 12 (ν − 12 ) π Remark 4.2. For ν = − 12 and ν = 12 , we have (see (4.79) below) z− 12 ,k = k − 12 π and z 21 ,k = kπ for k = 1, 2, . . . ,
(4.76)
(4.77)
so that, in either of these cases, zν,k reduces to the first term in (4.76). An alternative definition of the Bessel function Jν (·) makes use of Euler’s formula (see, e.g., (2)–(3) p. 498 in [207]) Jν (z) =
∞ ( 12 z)ν z2 1− 2 Γ(ν + 1) zν,k k=1
for z > 0.
(4.78)
Topics on Empirical Processes
179
4.6.2. Some special cases. The expression (4.71) of the first-order Bessel function Jν (·) can be simplified when ν = m + 12 for an integer m = −1, 0, 1, . . .. In particular, for m = −1 and m = 0, " " 2 2 J− 12 (x) = πx cos x and J 12 (x) = πx sin x. (4.79) For m ≥ 0, we get
" m sin x
2 d Jm+ 12 (x) = (−1)m x πx . (4.80) m dx x In general, for an arbitrary integer m ≥ −1, Jm+ 12 (·) is of the form, " 1
1
πx 1 (4.81) 2 Jm+ 2 (x) = Qm x sin x − Pm x cos x, where Pm (·) and Qm (·) are polynomials. The first terms of the sequence are P−1 (u) = −1,
Q−1 (u) = 0,
P0 (u) = 0,
Q0 (u) = 1.
(4.82)
Lemma 4.3. For an arbitrary m ≥ 0, we have the recurrence formulas Qm+1 (u) =
(2m + 1)wQm (w) − Qm−1 (w),
(4.83)
Pm+1 (u) =
(2m + 1)wPm (w) − Pm−1 (w).
(4.84)
Proof. We have "
" 2m + 1 " πx πx 1 (x) − 1 J m+ 2 2 2 Jm− 2 (x), x so that (4.83)–(4.84) is straightforward. πx 2
Jm+ 32 (x) =
By combining (4.81)–(4.82) with (4.83)–(4.84), we get " sin x 2 − cos x , J 32 (x) = πx x " 3 sin x 3 cos x 2 − sin x . J 52 (x) = − πx x2 x
(4.85) (4.86)
References [1] Abramowitz, M. and Stegun, I.A. (1965). Handbook of Mathematical Functions. Dover, New York. [2] Acosta, A. de and Kuelbs, J. (1983). Limit theorems for moving averages of independent random vectors. Z. Warhscheinlichkeitstheor. Verw. Geb. 64 67–123. [3] Adler, R.J. (1990). An Introduction to Continuity, Extrema and Related Topics for General Gaussian Processes. IMS Lecture Notes-Monograph Series 12. Institute of Mathematical Statistics, Hayward, California. [4] Akaike, H. (1954). An approximation of the density function. Ann. Inst. Statist. Math. 6 127–132. [5] Anderson, T.W. and Darling, D.A. (1952). Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Statist. 23 193–212.
180
P. Deheuvels
[6] Araujo, A. and Gin´e, E. (1980). The Central Limit Theorem for Real and Banach Valued Random Variables. Wiley, New York. [7] Ash, R.B. and Gardner, M.F. (1975). Topics in Stochastic Processes. Academic Press, New York. [8] Bahadur, R.R. (1967). A note on quantiles in large samples. Ann. Math. Statist. 37 577–580. [9] Bahadur, R.R. (1971). Some Limit Theorems in Statistics. Regional Conference Series in Applied Mathematics. 4. S.I.A.M., Philadelphia. [10] Bauer, H. (1981). Probability Theory and Elements of Measure Theory. Academic Press, New York. [11] del Barrio, E., Cuesta-Albertos, J.A. and Matran, C. (2000). Contributions of empirical and quantile processes to the asymptotic theory of goodness-of-fit tests. Test 9 1–96. [12] B´ artfai, P. (1966). Die Bestimmung der zu einem wiederkehrenden Prozess geh¨ orenden Verteilungsfunktion aus den mit Fehlern behafteten Daten einer einzigen Relation. Studia Sci. math. Hung. 1 161–168. [13] Bartlett, M.S. (1963). Statistical estimation of density functions. Sankhy¯ a. Ser. A 25 245–254. [14] Berkes, I. and Philipp, W. (1979). Approximation theorems for independent and weakly dependent random vectors. Ann. Probab. 7 29–54. [15] Berlinet, A. (1993). Hierarchies of higher-order kernels. Prob. Theor. Rel. Fields 94 489–504. [16] Berlinet, A. and Devroye, L. (1994). A comparison of kernel density estimates. Publ. Inst. Statist. Univ. Paris 38 3–59. [17] Bickel, P. and Rosenblatt, M. (1973). On some global measures of the deviation of density function estimates. Ann. Statist. 1 1071–1095. [18] Bickel, P. and Rosenblatt, M. (1975). Corrections to “On some global measures of the deviation of density function estimates”. Ann. Statist. 3 1370. [19] Billingsley, P. (1968). Convergence of Probability Measures. John Wiley & Sons, New York. [20] Borell, C. (1975). The Brunn-Minkovski inequality in Gauss space. Invent. Math. 30 207–216. [21] Borell, C. (1976). Gaussian Radon measures on locally convex spaces. Math. Scand. 38 265–284. [22] Borell, C. (1977). A note on Gauss measures which agree on balls. Ann. Inst. H. Poincar´e Ser. B 13 231–238. [23] Bosq, D. and Lecoutre, J.-P. (1987). Th´eorie de l’Estimation Fonctionnelle. Economica, Paris. [24] Bowman, F. (1958). Introduction to Bessel Functions. Dover, new York. [25] Bowman, A., Hall, P. and Prvan, T. (1998). Bandwidth selection for the smoothing of distribution functions. Biometrika 85 799–808. [26] Bretagnolle, J. and Massart, P. (1989). Hungarian constructions from the nonasymptotic viewpoint. Ann. Probab. 17 239–256.
Topics on Empirical Processes
181
[27] Castelle, N., and Laurent-Bonvalot, F. (1998). Strong approximations of bivariate uniform empirical processes. Ann. Inst. Henri Poincar´e. 34 425–480. [28] Castelle, N. (2002). Approximation fortes pour des processus bivari´es. Canad. J. Math. 54 533–553. [29] Cheng, P.E., and Bai, Z. (1995). Optimal strong convergence rates in nonparametric regression. Math. Meth. of Statist. 4 405–420. [30] Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sums of observations. Ann. Math. Statist. 23 493–507. [31] Chung, K.L. (1949). An estimate concerning the Kolmogorov limit distributions. Trans. Amer. Math. Soc. 67 36–50. [32] Ciesielski, Z. and Taylor, S.J. (1962). First passage times and sojourn times for Brownian motion in space and the exact Hausdorff measure of the sample path. Trans Amer. Math. Soc. 434–450. [33] Collomb, G. (1977). Quelques propri´et´es de la m´ethode du noyau pour l’estimation non-param´etrique de la r´egression en un point fix´e. C. R. Acad. Sci. Paris 285 A 289–292. [34] Collomb, G. (1979). Conditions n´ecessaires et suffisantes de convergence uniforme d’un estimateur de la r´egression, estimation des d´eriv´ees de la regr´ession. C. R. Acad. Sci. Paris 288 161–163. [35] Collomb, G. (1981). Estimation non-param´etrique de la r´egression: revue bibliographique. Int. Statist. Rev. 49 75–93. [36] Cs´ aki, E. (1980). On the standardized empirical distribution function. Coll. Math. Soc. J´ anos Bolyai. 32 123–138. Nonparametric Statistical Inference. Akademiai Kiado, Budapest. [37] Cs¨ org˝ o, M. and Horv´ ath, L. (1986). Approximations of weighted empirical and quantile processes. Statist. and Probab. Letters. 4 275–280. [38] Cs¨ org˝ o, M. and Horv´ ath, L. (1993). Weighted Approximations in Probability and Statistics. Wiley, New York. [39] Cs¨ org˝ o, M. and R´ev´esz, P. (1978). Strong approximation of the quantile process. Ann. Statist. 6 882–894. [40] Cs¨ org˝ o, M. and R´ev´esz, P. (1981). Strong Approximations in Probability and Statistics. Academic Press, New York. [41] Cs¨ org˝ o, M., Cs¨ org˝ o, S., Horv´ ath, L. and Mason, D.M. (1986). Weighted empirical and quantile processes. Ann. Probab. 14 86–118. [42] Dall’Aglio, G., Kotz, S. and Salinetti, G. (1991). Advances in Probability Distributions with Given Marginals. Kluwer, Dordrecht. [43] Darling, D.A. (1957). The Kolmogorov-Smirnov, Cram´er-von Mises tests. Ann. Math. Statist. 28 823–838. [44] David, H.A. (1981). Order Statistics. 2nd Ed. Wiley, New York. [45] Deheuvels, P. (1974). Conditions n´ecessaires et suffisantes de convergence presque sˆ ure et uniforme presque sˆ ure des estimateurs de la densit´e. C. R. Acad. Sci. Paris. Ser. A 278 1217–1220.
182
P. Deheuvels
[46] Deheuvels, P. (1977a). Estimation non param´etrique de la densit´e par histogrammes g´en´eralis´es. Publications de l’Institut de Statistique de l’Universit´e de Paris. 22 1–24. [47] Deheuvels, P. (1977b). Estimation non param´etrique de la densit´e par histogrammes g´en´eralis´es (II). Revue de Statistique Appliqu´ee 25 5–42. [48] Deheuvels, P. (1979). Propri´et´es d’existence et propri´et´es topologiques des fonctions de d´ependance. C. R. Acad. Sci. Paris, Ser. A 288 217–220. [49] Deheuvels, P. (1986). Strong laws for the k-th order statistic when k ≤ c log2 n. Probab. Theory Related Fields. 72 133–154. [50] Deheuvels, P. (1990). Laws of the iterated logarithm for density estimators. In: Nonparametric Functional Estimation and Related Topics. Roussas, G. (Ed.) 19– 20. Kluwer, Dordrecht. [51] Deheuvels, P. (1991). Functional Erd˝ os-R´enyi laws. Studia Sci. Math. Hungar. 26 261–295. [52] Deheuvels, P. (1992). Functional laws of the iterated logarithm for large increments of empirical and quantile processes. Stochastic Processes Appl. 43 133–163. [53] Deheuvels, P. (1996). Functional laws of the iterated logarithm for small increments of empirical processes. Statistica Neerlandica. 50 261–280. [54] Deheuvels, P. (1997). Strong laws for local quantile processes. Ann. Probab. 25 2007–2054. [55] Deheuvels, P. (1998). On the approximation of quantile processes by Kiefer processes. J. Theor. Probab. 11 997–1018. [56] Deheuvels, P. (2000). Limit laws for kernel density estimators for kernels with unbounded supports. Asymptotics in Statistics and Probability. M.L. Puri (Ed.) 117– 132. VSP. International Science Publishers, Amsterdam. [57] Deheuvels, P. (2000). Strong approximations of quantile processes by iterated Kiefer processes. Ann. Probab. 28 909–945. [58] Deheuvels, P., Devroye, L. and Lynch, J. (1986). Exact convergence rates in the limit theorems of Erd˝ os-R´enyi and Shepp. Ann. Probab. 14 209–223. [59] Deheuvels, P. and Devroye, L. (1987). Limit laws of Erd˝ o-R´enyi-Shepp type. Ann. Probab. 15 1363–1386. [60] Deheuvels, P. and Einmahl, J. H. J. (2000). Functional limit laws for the increments of Kaplan-Meier product-limit processes and applications. Ann. Probab. 28 1301– 1335. [61] Deheuvels, P. and Lifshits, M.A. (1993). Strassen-type functional laws for strong topologies. Probab. Theory Related Fields. 97 151–167. [62] Deheuvels, P. and Lifshits, M.A. (1994). Necessary and sufficient conditions for the Strassen law of the iterated logarithm in nonuniform topologies. Ann. Probab. 22 1838–1856. [63] Deheuvels, P. and Martynov, G. (2003). Karhunen-Lo`eve expansions for weighted Wiener processes and Brownian bridges via Bessel functions. Progress in Probability. 55 57–93. [64] Deheuvels, P. and Mason, D.M. (1990a). Bahadur-Kiefer-type processes. Ann. Probab. 18 669–697.
Topics on Empirical Processes
183
[65] Deheuvels, P. and Mason, D.M. (1990b). Nonstandard Functional laws of the iterated logarithm for tail empirical and quantile processes. Ann. Probab. 18 1693–1722. [66] Deheuvels, P. and Mason, D.M. (1992a). Functional laws of the iterated logarithm for the increments of empirical and quantile processes. Ann. Probab. 20 1248–1287. [67] Deheuvels, P. and Mason, D.M. (1992b). A functional L.I.L. approach to pointwise Bahadur-Kiefer theorems. In Probability in Banach Spaces. (R.M. Dudley, M. Hahn and J. Kuelbs, eds.) 8 255–266. Birkh¨ auser, Boston. [68] Deheuvels, P. and Mason, D.M. (1994a). Functional laws of the iterated logarithm for local empirical processes indexed by sets. Ann. Probab. 22 1619–1661. [69] Deheuvels, P. and Mason, D.M. (1994b). Random fractals generated by oscillations of processes with stationary and independent increments. Probability in Banach Spaces. 73–90, 9 Hoffman-Jørgensen, J., Kuelbs, J. and Marcus, M.B. Eds. Birkh¨ auser, Boston. [70] Deheuvels, P. and Mason, D.M. (2004). General asymptotic confidence bands based on kernel-type function estimators. Statist. Inference for Stoch. Processes. 7 225– 277. [71] Deheuvels, P. and Steinebach, J. (1987). Exact convergence rates in strong approximation laws for large increments of partial sums. Probab. Theor. Related Fields. 76 369–393. [72] Derzko, G. and Deheuvels, P. (2002). Estimation non-param´etrique de la r´egression dichotomique - application biom´edicale. C. R. Acad. Sci. Paris Ser. I 334 59–63. [73] Deuschel, J.D. and Stroock, D.W. (1989). Large Deviations. Academic Press, New York. [74] Devroye, L. (1978). The uniform convergence of the Nadaraya-Watson regression function estimate. Canad. J. Statist. 6 179–191. [75] Devroye, L. (1977). A uniform bound for the deviation of empirical distribution functions. J. Multivariate Analysis. 7 594–597. [76] Devroye, L. (1982). Bounds for the uniform deviation of empirical measures. J. Multivariate Analysis. 12 72–79. [77] Devroye, L. (1987). A Course in Density Estimation. Birkh¨ auser-Verlag, Boston. [78] Devroye, L. and Lugosi, G. (2001). Combinatorial Methods in Density Estimation. Springer, New York. [79] Devroye, L. and Gy¨ orfi, L. (1985). Nonparametric Density Estimation: The L1 View. Wiley, New York. [80] Doob, J.L. (1953). Stochastic Processes. Wiley, New York. [81] Dudley, R.M. and Philipp, W. (1983). Invariance principles for sums of Banach space valued random elements and empirical processes indexed by sets. Z. Wahrsch. Verw. Gebiete. 62 509–552. [82] Dudley, R.M. (1999). Uniform Central Limit Theorems. Cambridge University Press, Cambridge. [83] Dudley, R.M. (2002). Real Analysis and Probability. Cambridge University Press, Cambridge. [84] Dugundji, J. (1966). Topology. Allyn and Bacon, Boston.
184
P. Deheuvels
[85] Durbin, J. (1973). Distribution Theory for Tests Based on the Sample Distribution Function. Regional Conference Series in Applied Mathematics, 9 S.I.A.M., Philadelphia. [86] Dvoretzky, A., Kiefer, J. and Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist. 33 642–669. [87] Eggermont, P.P.B. and La Riccia, V.N. (2001). Maximum Penalized Likelihood Estimation. Springer, New York. [88] Eicker, F. (1979). The asymptotic distribution of the suprema of the standardized empirical processes. Ann. Statist. 7 116–138. [89] Einmahl, J.H.J. (1987). Multivariate Empirical Processes. C.W.I. Tract 32. Mathematisch Centrum, Amsterdam. [90] Einmahl, U. (1986). A refinement of the KMT-inequality for partial sum strong approximation. Technical Report Series of the Laboratory for Research in Statistics and Probability – Carleton University. 88, Ottawa, Canada. [91] Einmahl, U. (1988). Strong approximations for partial sums of i.i.d. B-valued r.v.’s in the domain of attraction of a Gaussian law. Probab. Theor. Rel. Fields. 77 65–85. [92] Einmahl, U. (1989). Extensions of results of Koml´ os, Major and Tusn´ ady to the multivariate case. J. Multivariate Anal. 28 20–68. [93] Einmahl, J.H.J. and Mason, D.M. (1985). Bounds for weighted multivariate empirical distribution functions. Z. Wahrsch. Verw. Gebiete. 70 563–571. [94] Einmahl, J.H.J. and Mason, D.M. (1988). Strong limit theorems for weighted quantile processes. Ann. Probab. 16 1623–1643. [95] Einmahl, U. and Mason, D.M. (2000). An empirical process approach to the uniform consistency of kernel-type function estimators. J. Theoretical Prob., 13, 1–37. [96] Epanechnikov, V.A. (1969). Nonparametric estimation of a multivariate probability density. Theor. Probab. Appl. 14 153–158. [97] Erd˝ os, P. and R´enyi, A. (1970). On a new law of large numbers. J. Analyse Math. 23 103–111. [98] Finkelstein, H. (1971). The law of the iterated logarithm for empirical distributions. Ann. Math. Statist. 42 607–615. [99] Gaenssler, P. (1983). Empirical Processes. Vol. 3, IMS Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Hayward. [100] Gaenssler, P. and Stute, W. (1979). Empirical process: a survey of results for independent and identically distributed random variables. Ann. Probab. 7 193–243. [101] Gasser, T., M¨ uller, H.G. and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. J. R. Statist. Soc. Ser. B 47 238–252. [102] Gikhman, I.I. (1957). On a nonparametric criterion of homogeneity for k samples. Theor. Probab. Appl. 2 369–373. [103] Goodman, V., Kuelbs, J. and Zinn, J. (1981). Some results on the LIL in Banach space with application of weighted empirical process. Ann. Probab. 8 713–752. [104] Gy¨ orfi, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
Topics on Empirical Processes
185
[105] Hall, P. (1991). On iterated logarithm laws for linear arrays and nonparametric regression estimators. Ann. Probab. 19 740–757. [106] Hall, P. and Heyde, C.C. (1980). Martingale Limit Theory and its Application. Academic Press, New York. [107] Hall, P. and Marron, J.S. (1991). Lower bounds for bandwidth selection in density estimation. Probab. Theor. Rel. Fields. 90 149–173. [108] Hall, P., Marron, J.S. and Park, B.U. (1992). Smoothed cross-validation. Probab. Theor. Rel. Fields 92 1–20. [109] H¨ ardle, W. (1984). A law of the iterated logarithm for nonparametric regression function estimators. Ann. Statist. 12 624–635. [110] H¨ ardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. [111] H¨ ardle, W., Janssen, P. and Serfling, R. (1988). Strong uniform consistency rates of estimators of conditional functionals. Ann. Statist. 16 1428–1449. [112] Hartman, P. and Wintner, A. (1941). On the law of iterated logarithm. Amer. J. Math. 63 169–176. [113] H¨ ogn¨ as, G. (1977). Characterization of weak convergence of signed measures on [0, 1]. Math. Scand. 41 175–184. [114] Hurevicz, W. and Wallman, G. (1948). Dimension Theory. Princeton University Press, Princeton. [115] Izenman, A.J. (1991). Recent developments in nonparametric density estimation. J. Amer. Statist. Assoc. 86 205–224. [116] Jaeschke, D. (1979). The asymptotic distribution of the supremum of the standardized empirical distribution on subintervals. Ann. Statist. 7 108–115. [117] Jones, M.C., Marron, J.S. and Park, B.U. (1991). A simple root n bandwidth selector. Ann. Statist. 19 1919–1932. [118] Jones, M.C., Marron, J.S. and Sheather, S.J. (1996). A brief survey of bandwidth selection for density estimation. J. Amer. Statist. Assoc. 91 401–407. [119] Kac, M. (1951). On some connections between probability theory and differential and integral equations. Proc.Second Berkeley Sympos. Math. Statist. Probab. 180– 215. [120] Kac, M. (1980). Integration in Function Spaces and Some of its Applications. Lezioni Ferniane, Academia Nazionale dei Lincei. Pisa. [121] Kac, M. and Siegert, A.J.F. (1947). On the theory of noise in radio receivers with square law detectors. J. Appl. Physics. 18 383–397. [122] Kac, M. and Siegert, A.J.F. (1947). An explicit representation of a stationary Gaussian process. Ann. Math. Statist. 18 438–442. [123] Kiefer, J. (1959). K-sample analogues of the Kolmogorov-Smirnov and Cram´er-V. Mises tests. Ann. Math. Statist. 30 420–447. [124] Kiefer, J. (1967). On Bahadur’s representation of sample quantiles. Ann. Math. Statist. 38 1323–1342. [125] Kiefer, J. (1970). Deviations between the sample quantile process and the sample d.f. In Nonparametric Techniques in Statistical Inference. (M. Puri, ed.) 299–319. Cambridge Univ. Press.
186
P. Deheuvels
[126] Kiefer, J. (1972a). Iterated logarithm analogues for sample quantiles when pn ↓ 0. Proc. Sixth Berkeley Symp. Math. Statist. Probab. 1 227–244. Univ. California Press, Berkeley. [127] Kiefer, J. (1972b). Skorohod embedding of multivariate rv’s and the sample df. Z. Wahrsch. Verw. Gebiete. 24 1–35. [128] Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giorn. Inst. Ital. Attauri. 4 83–91. [129] Koml´ os, J., Major, P. and Tusn´ ady, G. (1975). An approximation of partial sums of independent r.v.’s and the sample df. I. Z. Wahrsch. Verw. Gebiete. 32 111–131. [130] Koml´ os, J., Major, P. and Tusn´ ady, G. (1975). An approximation of partial sums of independent r.v.’s and the sample df. II. Z. Wahrsch. Verw. Gebiete. 34 33–58. [131] Konakov, V.D. and Piterbarg, V.I. (1984). On the convergence rate of maximal deviation distribution of kernel regression estimates. J. Mult. Appl. 15 279–294. [132] Korenev, B.G. (2002). Bessel Functions and their Applications. Taylor & Francis, London. [133] Krzy˙zak, A., and Pawlak, M. (1984). Distribution-free consistency of a nonparametric kernel regression estimate and classification. IEEE Trans. on Information Theory. 30 78–81. [134] Lai, T.L. (1974). Reproducing kernel Hilbert spaces and the law of the iterated logarithm for Gaussian processes. Z. Wahrscheinlichkeitstheorie Verw. Gebiete. 29 7–19. [135] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin. [136] Ledoux, M. (1996). On Talagrand’s deviation inequalities for product measures. ESAIM: Probab. Statist. 1 63–87. http//www.emath.fr/ps/ [137] L´evy, P. (1937). Th´eorie de l’Addition des Variables Al´eatoires. Gauthier-Villars, Paris. [138] L´evy, P. (1951). Wiener random functions and other Laplacian random functions. Proc. 6th Berkeley Sympos. Probab. Theory Math. Statist. 2 171–186. [139] L´evy, P. (1953). La mesure de Hausdorff de la courbe du mouvement brownien. Giorn. ist. Ital. Attuari. 16 1–37. [140] Li, W.V. (1992a). Comparison results for the lower tail of Gaussian seminorms. J. Theor. Probab. 5 1–31. [141] Li, W.V. (1992b). Limit theorems for the square integral of Brownian motion and its increments. Stoch. Processes Appl. 41 223–239. [142] Li, W.V. (1992c). Lim inf results for the Wiener process and its increments under the L2 -norm. Prob. Th. Rel. Fields. 92 69–90. [143] Loader, C.R. (1999). Bandwidth selection: classical or plug-in? Ann. Statist. 27 415–438. [144] Lynch, J. and Sethuraman, J. (1987). Large deviations for processes with independent increments. Ann. Probab. 15 610–627. [145] Maccone, C. (1984). Eigenfunctions and energy for time-rescaled Gaussian processes. Boll. Un. Mat. ital. 6 213–219.
Topics on Empirical Processes
187
[146] Major, P. (1976a). The approximation of partial sums of independent r.v.’s. Z. Wahrscheinlichkeitstheorie Verw. Gebiete. 35 213–220. [147] Major, P. (1976b). The approximation of partial sums of i.i.d.r.v.’s. when the summands have only two moments. Z. Wahrscheinlichkeitstheorie Verw. Gebiete. 35 221–230. [148] Major, P. (1979). An improvement of Strassen’s invariance principle. Ann. Probab. 7 55–61. [149] Marron, J.S. and Nolan, D. (1988). Canonical kernels for density estimation. Statist. Probab. Letters. 7 195–199. [150] Martynov, G.V. (1975). Computation of the distribution function of quadratic forms in normal random variables. Theor. Probab. Appl. 20 797–809. [151] Martynov, G.V. (1977). Generalization of the Smirnov formula for the quadratic forms distributions. Theor. Probab. Appl. 22 614–620. [152] Martynov, G.V. (1992). Statistical tests based on empirical processes and related questions. J. Soviet. Math. 61 2195–2275. [153] Mason, D.M. (1984). A strong limit theorem for the oscillation modulus of the uniform empirical quantile process. Stoch. Processes Appl. 17 126–136. [154] Mason, D.M., Shorack, G. and Wellner, J.A. (1983). Strong limit theorems for oscillation moduli of the empirical process. Z. Wahrscheinlichkeit. verw. Gebiete. 61 369–373. [155] Mason, D.M. and van Zwet, W.R. (1987). A refinement of the KMT inequality for the uniform empirical process. Ann. Probab. 15 871–884. [156] Mason, D.M. (1988). A strong invariance theorem for the tail empirical process. Ann. Inst. H. Poincar´ e Probab. Statist. 24 491–506. [157] Massart, P. (1990). About the constant in the DKW inequality. Ann. Probab. 18 1269–1283. [158] M¨ uller, H.G. (1988). Nonparametric Regression Analysis of Longitudinal Data. Springer, New York. [159] Nadaraya, E.A. (1964). On estimating regression. Theor. Probab. Appl. 9 141–142. [160] Nadaraya, E.A. (1970). Remarks on nonparametric estimates for density function and regression curves. Theor. Probab. Appl. 15 134–137. [161] Nadaraya, E.A. (1989). Nonparametric Estimation of Probability Densities and Regression Curves. Kluwer, Dordrecht. [162] Park, B.U. and Marron, J.S. (1990). Comparison of data-driven bandwidth selectors. J. Amer. Statist. Assoc. 85 66–72. [163] Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Statist. 33 1065–1076. [164] Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, New York. [165] Prakasa Rao, B.L.S. (1983). Nonparametric Functional Estimation. Academic Press, New York. [166] R´ev´esz, P. (1979a). On nonparametric estimation of the regression function. Problems of Control and Information Theory. 8 297–302.
188
P. Deheuvels
[167] R´ev´esz, P. (1979b). A generalization of Strassen’s functional law of the iterated logarithm for Gaussian processes. Z. Warhscheinlichkeitstheor. Verw. Geb. 50 257– 264. [168] Rice, J. (1984). Bandwidth choice for nonparametric regression. Ann. Statist. 12 1215–1230. [169] Rosenblatt, M. (1952). Remarks on a multivariate transformation. Ann. Math. Statist. 23 470–472. [170] Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27 832–837. [171] Roussas, G. (1990). Nonparametric Functional Estimation and Related Topics. NATO ASI Series 355. Kluwer, Dordrecht. [172] Rudin, W. (1979). Real and Complex Analysis. 3rd ed., McGraw Hill, New York. [173] Rudin, W. (1973). Functional Analysis. Tata McGraw-Hill, New Delhi. [174] Schilder, M. (1966). Asymptotic formulas for Wiener integrals. Trans. Amer. Math. Soc. 125 63–85. [175] Schucany, W.R. (1989a). On nonparametric regression with higher-order kernels. J. Stat. Plann. Inference. 23 145–151. [176] Schucany, W.R. (1989b). Locally optimal window widths for kernel density estimation with large samples. Statist. Probab. Letters 7 401–405. [177] Schucany, W.R. and Sommers, J.P. (1977). Improvements of kernel type density estimators. J. Amer. Statist. Assoc. 72 420–423. [178] Scott, D.W. (1992). Multivariate Density Estimation – Theory, Practice and Visualization. Wiley, New York. [179] Scott, W.F. (1999). A weighted Cram´er-von Mises statistic, with some applications to clinical trials. Commun. Statist. Theor. Methods. 28 3001–3008. [180] Scott, D.W. and Terrel, G.R. (1987). Biased and unbiased cross-validation in density estimation. J. Amer. Statist. Assoc. 82 1131–1146. [181] Sheather, S.J. and Jones, M.C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Statist. Soc. Ser.B 53 683–690. [182] Shorack, G.R. (1982). Kiefer’s theorem via the Hungarian construction. Z. Wahrsch. Verw. Gebiete. 34 33–58. [183] Shorack, G.R. and Wellner, J.A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York. [184] Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. [185] Singh, R.S. (1979). Mean squared errors of estimates of a density and its derivatives. Biometrika 66 177–180. [186] Singh, R.S. (1987). MISE of kernel estimates of a density and its derivatives. Stat. and Probab. Letters 5 153–159. [187] Sklar, A. (1959). Fonctions de r´epartition a ` n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris. 8 229–231. [188] Skorohod, A.V. (1976). On a representation of random variables. Theory Probab. Appl. 21 628–632. [189] Smirnov, N.V. (1936). Sur la distribution de ω 2 . C.R. Acad. Sci. Paris. 202 449–452.
Topics on Empirical Processes
189
[190] Smirnov, N.V. (1937). On the distribution of the ω 2 criterion. Rec. Math. (Mat. Sbornik). 6 3–26. [191] On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Math. de l’Universit´e de Moscou. 2. [192] Smirnov, N.V. (1948). Table for estimating the goodness of fit of empirical distributions. Ann. Math. Statist. 19 279–281. [193] Spiegelman, J. and Sacks, J. (1980). Consistent window estimation of nonparametric regression. Ann. Statist. 8 240–246. [194] Steen, L.A. and Seebach, J.A. (1978). Counterexamples in Topology. 2nd Ed. Springer, New York. [195] Stein, E.M. (1970). Singular Integrals and Differentiability Properties of Functions, Princeton University Press, Princeton, New Jersey. [196] Stone, C.J. (1977). Consistent nonparametric regression. Ann. Statist. 5 595–645. [197] Stone, C.J. (1980). Optimal rates of convergence for nonparametric estimators. Ann. Statist. 8 1348–1360. [198] Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040–1053. [199] Strassen, V. (1964). An invariance principle for the law of the iterated logarithm. Z. Wahrsch. Verw. Gebiete. 3 211–226. [200] Stute, W. (1982). The oscillation behaviour of empirical processes. Ann. Probab. 10 86–107. [201] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probability 22, 28–76. [202] Tapia, R.A. and Thompson, J.R. (1978). Nonparametric Probability Density Estimation. Johns Hopkins University Press, Baltimore. [203] Terrel, G.R. (1990). The maximal smoothing principle in density estimation. J. Amer. Statist. Assoc. 85 470–477. [204] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes. Springer Verlag, New York. [205] Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, London. [206] Varadhan, S.R.S. (1966). Asymptotic probabilities and differential equations. Comm. Pure Appl. Math. 19 261–286. [207] Watson, G.N. (1952). A Treatise on the Theory of Bessel Functions. Cambridge University Press, Cambridge. [208] Watson, G.S. (1964). Smooth regression analysis. Sankhy¯ a A26 359–372. [209] Watson, G.S. and Leadbetter, M.R. (1963). On the estimation of probability density, I. Ann. Math. Statist. 34 480–491. [210] Wertz, W. (1972). Fehlerabsch¨ atzung f¨ ur eine Klasse von nichtparametrischen Sch¨ atzfolgen. Metrika. 19 132–139. [211] Wertz, W. (1978). Statistical Density Estimation: A Survey. Vandenhoeck & Ruprecht, G¨ ottingen. [212] Woodroofe, M. (1966). On the maximum deviation of the sample density. Ann. Math. Statist. 41 1665–1671.
190
P. Deheuvels
[213] Zolotarev, V.M. (1961). Concerning a certain probability problem. Theor. Probab. Appl. 6 201–204. [214] Zolotarev, V.M. (1983). Probability metrics. Theory Probab. Appl. 28 278–302. Paul Deheuvels L.S.T.A., Universit´e Paris VI ((Please insert complete address))
Oracle Inequalities and Regularization Sara van de Geer Keywords. Classification, complexity penalties, complexity regularization, density estimation, empirical processes, empirical risk minimization, M-estimation, model selection, nonparametric regression, oracle inequalities.
1. Statistical models In this chapter, the construction of a statistical model is discussed. We contemplate on deviations from the model on the one hand, and simplicity of a model on the other. We introduce the concepts approximation error and estimation error. The idea of complexity regularization is illustrated in two situations: histograms in density estimation and smoothing splines in regression. Here is a brief sketch of the contents of the other chapters. In Chapter 2, we introduce penalized M-estimators – or penalized empirical risk estimators. These are obtained by minimizing a loss function (e.g., least squares loss, minus maximum likelihood, or, in classification, support vector machine loss). A roughness penalty is added to the loss function to avoid overfitting. We study the behavior of the estimators in a general context. The excess risk of an estimator is a global measure for its performance. We consider so-called oracle inequalities for the excess risk. These inequalities relate the performance of the estimator to the procedure that chooses the optimal model by trading off bias and variance (or, more generally, approximation error and estimation error). In Chapter 2, we highlight the role of empirical process theory in this context. As important particular case, we investigate high-dimensional linear spaces. The approximation error then comes from approximating curves or images by elements of a high-dimensional parameter space. Chapter 3 studies oracle inequalities in a regression framework, using the least squares estimators of the coefficients. That setup has the advantage that everything can be calculated explicitly. It serves as a preparation for more complicated situations.
192
S. van de Geer
Chapter 4 considers general least squares estimators. In Chapter 5, we look at robust regression estimators, density estimation, and binary classification. We consider there a penalty of 1 -type. Chapter 6 summarizes some tools from empirical process theory. Each chapter ends with bibliographical remarks. Most of the work will be on estimation theory and on where this theory can make use of inequalities for empirical processes. Approximation theory (for example the approximating properties of truncated series expansions) will be touched upon only briefly. We consider a data set consisting of n observations on a variable, say X, with values in some space X . These observations are denoted by X1 , . . . , Xn . We assume that the observations are independent, and that each observation follows the same probability law as X, say P (independent, non-identically distributed observations will also be considered). The probability distribution P is in whole or in part unknown. Our aim is to estimate P or certain aspects of it. A statistical model is a set of candidate distributions P for P . If “nothing” is known, one might want to choose P as the set of “all” distributions on X . However, if one has some idea about the form of P , one may want to incorporate this information into the model set P. In that way the estimation problem is made easier, i.e., the accuracy is greater. If P ∈ / P, the model is misspecified. In that case one usually has a systematic error (bias) in the estimator. 1.1. Parametric models A parametric model is of the form P = {Pθ : θ ∈ Θ}. where the parameter space Θ is a subset of Euclidean space RN . For example, if the Xi are yes/no answers to a certain question (the binary case), we know that P allows only two possibilities, say 1 and 0 (yes=1, no=0). There is only one parameter, say the probability θ of a yes answer. Here is another example. Example 1.1. Vilfredo Pareto (1897) noticed that the number of people whose income exceeds level x is often approximately proportional to x−θ , where θ is a parameter which differs from country to country. Therefore, as a model for the distribution of incomes, one may propose the Pareto distribution function 1 Pθ (X ≤ x) = 1 − θ , x > 1, x with density θ fθ (x) = θ+1 , x > 1. x More generally, in a parametric model, there are only a finite number of parameters θ = (θ1 , . . . , θN ). When the model is well specified, one has P = Pθ0 for some θ0 ∈ Θ. However, a low-dimensional, parametric model is often just a mathematical idealization. For example, there seems to be little physical reason for incomes to follow the Pareto law. I.e., a model is only an approximation.
Oracle Inequalities and Regularization
193
inaccuracy
systematic error |
oracle
complexity complexiteit
Figure 1. The trade off between inaccuracy and systematic error
When there are infinitely many parameters in the model, it is called nonparametric. We will however not be very strict in our distinction between parametric and nonparametric, for the following reason. Throughout, the number of observations n is assumed to be “large”. We will allow that the choice of the model P depends on n, and is in fact more “rich” for larger n. This is only natural, since when we have many observations, we may want to use more flexible models and get more information out of the data. Thus, in a parametric model, Θ may depend on n, and in particular its dimension N may depend on n, and in fact grow without limit as n → ∞. This means strictly speaking that we deal with a sequence of parametric models with nonparametric limiting model. We think of such a situation as a nonparametric one. Parametric models (with N “small”) are in a sense less rich than nonparametric models, and there is also a range in the complexity of various nonparametric models. The more complex a model, the larger the inaccuracy will be. On the other hand, too simple models have large systematic error. (Here, we use a generic terminology. we will be more precise in our definitions later on, e.g., in Section 2.3.) Both inaccuracy and systematic error depend on the model, and on the truth P . The optimal model trades off the inaccuracy and systematic error (see Figure 1). However, since P is unknown, it is also not known which model this will be. Only an oracle can tell you that. Our aim will be to mimic this oracle. To evaluate the inaccuracy of a model, we will use empirical process theory. Empirical process theory is about comparing the theoretical distribution P with its empirical counterpart, the empirical distribution Pn , introduced in the next section.
194
S. van de Geer
1.2. The empirical distribution The unknown P can be estimated from the data in the following way. Suppose first that we are interested in the probability that an observation falls in A, where A ⊂ X is a certain set chosen by the researcher. We denote this probability by P (A). It can be estimated by the frequency of A, i.e., by number of times an observation Xi falls in A total number of observations number of Xi ∈ A . = n In view of the law of large numbers, Pn (A) should be close to P (A) for n large. We now define the empirical distribution Pn as the probability law that assigns to a set A the probability Pn (A). We regard Pn as an estimator of the unknown P . More generally, for a function f : X → R. we denote the expectation of f (X) n by P (f ) = Ef (X). Its empirical counterpart is denoted by Pn (f ) = i=1 f (Xi )/n. Thus when f is the indicator function of the set A, which is denoted by lA , we have P (f ) = P (A), and Pn (f ) = Pn (A). Pn (A) =
Example 1.2. The empirical distribution function. Suppose that X ⊂ R. The distribution function of X is defined as F0 (x) = P (−∞, x], and the empirical distribution function is number of Xi ≤ x . n (Here, we have a slight abuse of notation: F0 is the truth. It is not cal distribution having 0 observations, a nonsensical object.) Figure distribution function F0 (x) = 1 − 1/x2 , x ≥ 1 (smooth curve) and cal distribution function Fn (stair function) of a sample from X with n = 200. Fn (x) =
the empiri2 plots the the empirisample size
1.3. Regularization Many (nonparametric) estimation procedures involve the choice of a tuning parameter, also called regularization parameter or smoothing parameter. Here are two examples. Example 1.3. Histograms. Suppose X ∈ R has density, say f0 , with respect to Lebesgue measure. Our aim is to estimate f0 . The density f0 (x) at x is defined as the derivative of the distribution function F0 at x: f0 (x) = lim h↓0
F0 (x + h) − F0 (x − h) P (x − h, x + h] = lim . h↓0 2h 2h
Unfortunately, replacing P by Pn here does not work, as for h small enough, Pn (x − h, x + h] will be either zero or one. Therefore, instead of taking the limit
Oracle Inequalities and Regularization
195
1
F0
0.9
Fn
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1
2
3
4
5
6 x
7
8
9
10
11
Figure 2. Theoretical and empirical distribution function as h ↓ 0, we fix h at a (small) positive value, called the bandwidth. The estimator of f0 (x) becomes number of Xi ∈ (x − h, x + h] Pn (x − h, x + h] = . fˆn (x) = 2h 2nh A histogram is a plot of this estimator at points x ∈ {x0 , x0 + 2h, x0 + 4h, . . .}. Figure 3 shows the histogram, with bandwidth h = 0.25, for the sample of size n = 200 from the Pareto distribution with parameter θ0 = 2 (i.e., with some abuse of notation, f0 = fθ0 ). The solid line is the density of this distribution. The bandwidth h is an example of a tuning parameter. Choosing a value for it is a complicated matter, as it leads to considering variance, bias, and related concepts. Such considerations will be the major topic in these notes. The variance, var(fˆn (x)), quantifies the inaccuracy of the histogram at the point x. The systematic error can be measured by the squared bias bias2 (fˆn (x)) = (Efˆn (x) − f0 (x))2 . The mean square error is MSE(fˆn (x)) = var(fˆn (x)) + bias2 (fˆn (x)). The integrated mean square error is IMSE(fˆn ) =
MSE(fˆn (x))dx.
The optimal – in the sense of IMSE – choice of the bandwidth h minimizes the integrated mean square error, i.e., trades off variance and squared bias. However,
196
S. van de Geer 2 1.8 1.6
f0
1.4 1.2 1 0.8 0.6 0.4 0.2 0
1
2
3
4
5
6
7
8
9
10
11
Figure 3. True density and a histogram to carry out this trade off, one needs to know certain aspects of f0 . But f0 is unknown! In an attempt to mimic an oracle, one often uses part of the data to estimate f0 with various choices of the bandwidth h, and then use the rest of the data to decide on the choice for h. One may also estimate IMSE(fˆn ) applying for example least squares cross validation. We will not present the details here. Instead of bandwidth selection, we will study regularization using complexity penalties, as illustrated in Example 1.4. In the intermezzo following this example, it is shown that the two approaches can be closely related. Exercise 1.1. Suppose f0 (x) = 2/x3 , x > 1. Let h > 0 be the bandwidth. At a given x > 1 + h, calculate bias and variance of the histogram Pn (x − h, x + h] . fˆn (x) = 2h Show that for x fixed, and h → 0 and nh → ∞, the bias is of order h2 (bias(fˆn (x)) = O(h2 )) and the variance is of order 1/(nh) (var(fˆn (x)) = O(1/(nh))). The optimal – in the sense of MSE – choice for h is thus hopt = O(n−1/5 ). (For a definition of order symbols, see Section 3.5.) Example 1.4. Penalized least squares. Consider the regression EYi = f0 (xi ), i = 1, . . . , n, where Yi is a response variable, xi is a co-variable (i = 1, . . . , n), and where f0 is an unknown function. We examine here the case where xi = i/n ∈ [0, 1], i = 1, . . . , n
Oracle Inequalities and Regularization
197
and f0 is defined on [0, 1]. We suppose f0 is not changing too much, in the sense that the squared first derivative of f0 is small, say in terms of the average |f (x)|2 dx. As estimator of f0 we propose n 1 1 2 2 2 |Yi − f (xi )| + λ |f (x)| dx . fˆn = arg min f n i=1 0 Here, “arg” stands for “argument”, i.e., the location where the minimum is attained. Moreover, λ is a tuning – or regularization – parameter. If λ = 0, the ˆ estimator fˆn will just interpolate the data. On the other hand, if λn = ∞, fn will be a constant function (namely, constantly equal to the average i=1 Yi /n of the observations). To the least squares loss function, we have thus added a penalty for choosing a too wiggly function. This is called (complexity) regularization. Figure 4 below plots the true f (which is f0 ) in blue together with the data (red). The aim is to recover f0 from the data. Figure 5 shows the estimator fˆn 0.02
0
0 -0.02
-0.02 -0.04
-0.04 -0.06
-0.06
-0.08
-0.08
-0.1 -0.1
-0.12 -0.12
-0.14 -0.14
-0.16
-0.16
0
10
20
30
40
50
60
70
80
90
100
-0.18
0
10
20
30
40
50
60
70
80
90
100
Noise added, noise level = 0.01
True f
Figure 4
0.02
0.02
0
0
-0.02
-0.02
-0.04
-0.04
-0.06
-0.06
-0.08
-0.08
-0.1
-0.1
-0.12
-0.12
-0.14
-0.14
-0.16
-0.16
-0.18
-0.18 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Denoised, lambda=0.2 Fit=9.0531e-04
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Denoised, lambda=0.1 Fit=3.4322e-04 Figure 5
0.8
0.9
1
198
S. van de Geer
(in green) for two choices of the tuning parameter λ. The fit of fˆn is defined as n
|Yi − fˆn (xi )|2 /n.
i=1
Obviously, the smaller value of λ gives a better fit. Figure 6 plots the estimator fˆn together with f0 , for two values of λ. The error (or “excess risk”, see Chapter 2), which is defined here as n
|fˆn (xi ) − f0 (xi )|2 /n
i=1
turns out to be smaller for the smaller value of λ. Now, in real life situations, it is not possible to make the plots of Figure 6 and/or calculate the error, since the true f is then unknown. Thus, again, we need an oracle to tell us which λ to choose. In Section 4.5, we show that by penalizing small values of λ one may arrive at an oracle inequality. 0.02
0.02
0
0
-0.02
-0.02
-0.04
-0.04
-0.06
-0.06
-0.08
-0.08
-0.1
-0.1
-0.12
-0.12
-0.14
-0.14
-0.16
-0.16
-0.18
-0.18 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Denoised, lambda=0.05 Err or=7.8683e-05
Denoised, lambda=0.1 Err or=2.8119e-04
Figure 6 Intermezzo. As a continuous version of the problem studied in Example 1.4, consider 1 1 |y(x) − f (x)|2 dx + λ2 |f (x)|2 dx . fˆ = arg min f
0
0
In fact, let us formulate an extension, namely a continuous version corresponding to the so-called white noise model dY (x) = f (x)dx + σdW (x), where W is standard Brownian motion. In that case, the derivative y(x) = dY (x)/dx does not exist, as Brownian motion is nowhere differentiable. We therefore use a formulation avoiding this derivative: 1 1 1 f (x)dY (x) + f 2 (x)dx + λ2 |f (x)|2 dx . fˆ = arg min −2 f
0
0
0
Oracle Inequalities and Regularization
199
We show in Lemma 1.1 below that the solution fˆ can be explicitly calculated (using variational calculus). This solution reveals that the tuning parameter λ plays the role of a bandwidth parameter. Lemma 1.1. The solution of the continuous version is x 1 x u−x C ˆ )dY (u), sinh( f (x) = cosh( ) + λ λ λ 0 λ where
1 1 1 1−u C = Y (1) − )du / sinh( ). Y (u) sinh( λ 0 λ λ
Proof. Partial integration shows that we have to minimize 2 2 2 Y f + f − 2Y (1)f (1) + λ |f |2 . Now, replace f by the function g. We then have the restriction f = g. We can express this restriction by adding a Lagrangian term to the function to be minimized, with Lagrange parameters in the function h, say. We then arrive at the problem of minimizing L(f, g) = 2 Y g + f 2 − 2Y (1)f (1) + λ2 g 2
h(g − f ).
−2
Invoking again partial integration, rewrite this as L(f, g) = 2 Y g + f 2 − 2Y (1)f (1) + λ2 g 2 −2
gh − 2
f h + 2h(1)f (1).
The Euler equations (see, e.g., Goldstein (1980)) now become f − h = 0 and Y − h + λ2 g = 0. Since g = f , this gives the equation h = (h − Y )/λ2 , with boundary condition h(1) = Y (1). Solving this differential equation yields the result of the lemma.
200
S. van de Geer
1.4. Bibliographical remarks Histograms are special cases of kernel estimators. There are many books on kernel density estimation. We refer to Wand and Jones (1995). The penalized least squares estimator we have considered in Example 1.4 was chosen for ease of exposition. A penalized smoothing spline with penalty on the second derivative, instead of the first, is used more frequently. The penalty is then 1 2 |f (x)|2 dx. λ 0
We refer to Silverman (1985) and Wahba (1990). A nice discussion of more general (Sobolev) penalties (involving for instance higher derivatives) can be found in the book of Green and Silverman (1994). The proof of Lemma 1.1 is a simple example of variational calculus, which is generally developed in the context of classical mechanics (Goldstein (1980)).
2. M-estimators We will start with two examples, and then present the general definition of an M-estimator. We furthermore give definitions of excess risk, estimation error and approximation error. Next, we highlight how empirical process theory can be invoked to assess the excess risk of an M-estimator. 2.1. Some examples Examples of M-estimators are least squares estimators, maximum likelihood estimators, and estimators in binary classification using 0/1 loss. Let us consider the latter two. Example 2.1. Maximum likelihood. Let P be a collection of probability measures dominated by a σ-finite measure µ. Write F = {f = dP˜ /dµ : P˜ ∈ P} for the collection of densities. Using the model class F , the maximum likelihood estimator is fˆn = arg max f ∈F
n
log f (Xi ).
i=1
Recall that “arg” stands for “argument”, i.e., fˆn is the density in F where the likelihood of the observations is maximal. Note that we may write n
log f (Xi )/n = Pn (log f ).
i=1
This is the empirical counterpart of log f dP = P (log f ). The maximum likelihood fˆn maximizes Pn (log f ) over all f ∈ F.
Oracle Inequalities and Regularization
201
Exercise 2.1. Verify that the true density f0 = dP/dµ is a maximizer of P (log f ) over all densities f . Exercise 2.2. Check that a histogram with cells chosen a priori (see Example 1.3) is a maximum likelihood estimator for the density on R, with model class F = {f =
N
αk l(ak−1 ,ak ] , αk ≥ 0, k = 1, . . . , N,
f (x)dx = 1}.
k=1
Here, the cell boundaries a0 < a1 < · · · < aN are fixed. Example 2.2. Minimum empirical risk estimation in classification. Let (X, Y ) be random variables, with X ∈ X a co-variable and Y ∈ {0, 1} a label. For example, X may be the features of a mushroom (size, shape, color), and Y indicates whether it is edible or not. A classifier is a function f : X → {0, 1}. So f is the indicator function of a set G ⊂ X , and instances x ∈ G are classified with the label 1, whereas x ∈ / G are classified with the label 0. We will usually identify sets with indicator functions. Define the regression η0 (x) = P (Y = 1|X = x). Bayes rule f0 is to classify observations X with η0 (X) > 1/2 in the class with label 1 and those with η0 (X) ≤ 1/2 in the class with label 0, i.e., f0 = lG0 , G0 = {x : η0 (x) > 1/2}. Identifying sets and indicator functions, we also call the set G0 Bayes rule. Let {(Xi , Yi )}ni=1 be a sample from (X, Y ). One may estimate Bayes rule in the following way. The number of misclassifications using the classifier G ⊂ X is / G, Yi = 1} = #{ Xi ∈ G, Yi = 0} + #{Xi ∈
n
|Yi − lG (Xi )| := nRn (G),
i=1
where Rn (G) is the proportion of misclassifications. The prediction error of a classifier G is the theoretical counterpart R(G) of Rn (G): R(G) = P (X ∈ G, Y = 0) + P (X ∈ / G, Y = 1) = ERn (G). The empirical risk minimizer using the model class G is defined as ˆ n = arg min Rn (G). G G∈G
Exercise 2.3. Verify that Bayes rule G0 minimizes R(G) over all subsets G ⊂ X .
202
S. van de Geer
2.2. General framework Examples 2.1 and 2.2 are two examples of M-estimation. The general principle is as follows. We have observed i.i.d. copies {Xi }ni=1 of a random variable X with distribution P . We choose a model class F ⊂ F¯ for the (possibly infinite-dimensional) parameter of interest f0 ∈ F¯ . Here F¯ is a large class of candidate f ’s, and we refer to it as the class of all f ’s. Let for all f , γf : X → R be a loss function. Define the theoretical risk R(f ) = P (γf ). The loss function γ is chosen in such a way that the risk is smallest at the parameter of interest f0 : f0 = arg min R(f ). all f
Let the empirical risk be Rn (f ) = Pn (γf ). ¯ Let F ⊂ F be a model class. The M-estimator fˆn (“M” stand for Minimization (or Maximization)) is then (2.1)
fˆn = arg min Rn (f ). f ∈F
A model class F is chosen here in the empirical risk minimization, because minimizing the empirical risk over all f generally leads to overfitting. The class F should on the other hand not be chosen too small, as that may result in a large systematic error. This problem is exactly our object of study, and it will be presented more formally in Section 2.7. Complexity regularization will be used as procedure for choosing F “not too large and not too small”. In Example 2.1, the loss function is γf = − log f , with f a density. In Example 2.2 or a more general regression set-up, with i.i.d. co-variables (random design), the situation is within our framework, but we use a different notation. We let {(Xi , Yi )}ni=1 be i.i.d. copies of a random variable (X, Y ) with distribution P , and consider a loss function γf : X × R → R, f ∈ F¯ . (For instance, in Example 2.2, the loss function is γf (x, y) = |y − f (x)|, with f = lG the indicator function of a subset G ⊂ X .) The distribution of the co-variable X is denoted by Q, and the empirical distribution of X1 , . . . , Xn is denoted by Qn . When the co-variables are fixed (non-random design), the regression model concerns a situation with independent but not identically distributed observations. However, this will not really alter the general theory. The reason why we formulated this introduction for the i.i.d. case is merely for ease of exposition. 2.3. Estimation and approximation error We will now be somewhat more precise in our definitions of accuracy and systematic error. We will refer to the inaccuracy as estimation error and the systematic error as approximation error. Let γf be a loss function, R(f ) = P (γf ), and f0 the overall minimizer of R(f ). Let F be the model class, and f∗ the minimizer of R(f ) over F .
Oracle Inequalities and Regularization
203
The empirical risk is Rn (f ) = Pn (γf ). Using the M-estimator – or empirical risk minimizer – fˆn given in (2.1), its estimation error is defined as R(fˆn ) − R(f∗ ). Note that this is a random variable. Below, we will extend the concept, and for example refer to the average value of R(fˆn ) − R(f∗ ) (or a quantity of this order) as the estimation error. The approximation error is defined as R(f∗ ) − R(f0 ). This is a non-random quantity. The difference R(f ) − R(f0 ) is called the excess risk at f . Note that it is always positive. We will investigate the behavior of the excess risk R(fˆn ) − R(f0 ). This will lead to probability inequalities for R(fˆn ) − R(f0 ), or bounds for, e.g., the average excess risk ER(fˆn ) − R(f0 ). Exercise 2.4. Consider a regression model E(Y |X = x) = f0 (x). Let γf (x, y) = (y − f (x)) be the least squares loss function. Consider first the case of fixed design (i.e., work conditionally on Xi = xi , i = 1, . . . , n). Let 2
1 (Yi − f (xi ))2 n i=1 n
Rn (f ) = and
R(f ) = ERn (f ) (where the expectation is conditional given x1 , . . . , xn ). Check that the excess risk R(f ) − R(f0 ) is equal to f − f0 2n , where · n is the L2 (Qn )-norm. Suppose that F is a (finite-dimensional) linear space. Show that f∗ is the projection of f0 (in L2 (Qn )) on F and that hence
fˆn − f0 2n = fˆn − f∗ 2n + f∗ − f0 2n . Conclude similarly for the random design case, where · n is to be replaced by the L2 (Q)-norm · . 2.4. Where empirical process theory comes in To derive bounds for the excess risk of the M-estimator fˆn given in (2.1), we use the inequalities R(fˆn ) − R(f∗ ) ≥ 0, and Rn (fˆn ) − Rn (f∗ ) ≤ 0. These inequalities are true because f∗ minimizes the theoretical risk over F and fˆn minimizes the empirical risk over F .
204
S. van de Geer Thus we may write 0 ≤ R(fˆn ) − R(f∗ ) = −
Rn (fˆn ) − R(fˆn ) − (Rn (f∗ ) − R(f∗ ))
+Rn (fˆn ) − Rn (f∗ )
≤ − Rn (fˆn ) − R(fˆn ) − (Rn (f∗ ) − R(f∗ )) . ¯ We write for f ∈ F, νn (f ) =
√ √ n (Rn (f ) − R(f )) = n(Pn (γf ) − P (γf )) .
¯ The empirical process indexed by F0 is Let F0 be some subset of F. {νn (f ) : f ∈ F0 }. From the above, we know that
√ 0 ≤ R(fˆn ) − R(f∗ ) ≤ −[νn (fˆn ) − νn (f∗ )]/ n.
Adding R(f∗ ) − R(f0 ) to both sides of this inequality yields moreover √ (2.2) R(fˆn ) − R(f0 ) ≤ −[νn (fˆn ) − νn (f∗ )]/ n + [R(f∗ ) − R(f0 )] = I + II, with
√ I = −[νn (fˆn ) − νn (f∗ )]/ n
and with II the approximation error II = [R(f∗ ) − R(f0 )]. This inequality reveals the two components I and II when studying the excess risk at fˆn . The expression I is a bound for the estimation error. Empirical process theory is invoked to examine this term. Handling the approximation error II involves approximation theory. We refer to inequalities like (2.2) as basic inequalities, because such inequalities will the starting point for deriving bounds for the excess risk at fˆn . Exercise 2.5. In Exercise 2.4, define the noise variables i = Yi −f0 (xi ), i = 1, . . . , n. Verify that for the model with fixed design 2
i (f (xi ) − f∗ (xi )). νn (f ) − νn (f∗ ) = − √ n i=1 n
In the model with random design, it becomes 2 νn (f ) − νn (f∗ ) = − √
i (f (xi ) − f∗ (xi )) n i=1 n
! √ − n ( f − f0 2n − f − f0 2 ) − ( f∗ − f0 2n − f∗ − f0 2 ) .
Oracle Inequalities and Regularization
205
The two main tools for studying M-estimators are empirical process theory and approximation theory. Empirical process theory is used to investigate the estimation error of an estimator. There is no straight answer on how to invoke which parts of empirical process theory. In Lemma 2.1 below, we give an example, which highlights the main idea. Empirical process theory supplies us with inequalities for suprema of empirical processes indexed by functions. We will in fact need the behavior of increments of the empirical process. This concerns the question: if two functions f and f∗ are “close”, then how “small” is the difference |νn (f ) − νn (f∗ )|? We welcome an answer that holds uniformly over f ∈ F in a neighborhood of f∗ , because we can then apply it to a random function (an estimator) in this neighborhood. Empirical process theory gives us probability or moment bounds for the suprema of empirical processes. These can be “directly” derived inequalities. However, a strong tool is based on so-called concentration inequalities. Concentration inequalities are (exponential, or even sub-Gaussian) probability inequalities for the concentration of a random variable around its mean. One can apply them to the random variable representing the supremum of an empirical process. Then, the only task left is to find good bounds for the mean of this supremum. This is in some cases as easy as Cauchy-Schwarz, perhaps preceded by a so-called symmetrization and/or a contraction inequality. These concepts (concentration, symmetrization and contraction) will be discussed in more detail in Chapter 6. Approximation theory will be used to illustrate how the behavior of estimators depends on how well the model approximates the “truth”. Since in real life we will actually never find out what the truth is, these illustrations are purely theoretical. 2.5. Some first results, assuming ready-to-use empirical process theory We will assume two inequalities. The first one, the margin condition, is purely deterministic. It depends very much on the problem under consideration. The second one, the empirical process condition, is an assumption on the increments of the empirical process {νn (f ) : f ∈ F }. It is in an ready-to-use form. In the literature, such empirical process inequalities have indeed been established (see also Section 6.6). A very simple example can be established in Exercise 3.2. We assume that the two inequalities hold for some metric d on F . The inequalities are tied up in a technical condition on the parameters involved. Margin condition. For some constants c2 and κ ≥ 1 and for all f ∈ F, R(f ) − R(f0 ) ≥ dκ (f, f0 )/c2 . In the margin condition, κ is an identifiability parameter. If it is small, then f0 is well-identified for the metric d. Frequently, it only holds for those f with d(f, f0 ) bounded by some constant. We skip this issue here to avoid digressions. Exercise 2.6. In Exercise 2.4, check that the margin condition is met, with κ = 2 and d(f, f0 ) = f − f0 n (fixed design) or d(f, f0 ) = f − f0 (random design).
206
S. van de Geer
Empirical process condition. For some constants c1 , r > 0, and 0 ≤ β ≤ 1, we have r |νn (f ) − νn (f∗ )| ≤ cr1 . E sup dβ (f, f∗ ) f ∈F In the empirical process condition, β is a complexity parameter. If it is small, the class F is complex, and the increments of the empirical process can be large. When f = f∗ , the weighted empirical process (νn (f ) − νn (f∗ ))/dβ (f, f∗ ) is defined to be zero. We remark now that meeting the empirical process condition frequently means that the class F does not contain f ’s with distance d(f, f∗ ) not zero but very small. This may not be true for our original F . But actually, as we want to prove that d(fˆ, f∗ ) is small, this generally turns out to be only a minor technical issue. Technical condition. We have κ > β and r ≥
κ κ−β .
In Lemma 2.1, we first consider the well specified case, i.e., the case where f0 ∈ F. We then show that in the misspecified case, the approximation error occurs as an additional term. Lemma 2.1. Let fˆn be the M-estimator defined in (2.1). Suppose the margin condition, the empirical process condition, and the technical condition are met. If f0 ∈ F (well specified case), we have β
κ
ER(fˆn ) − R(f0 ) ≤ c1κ−β c2κ−β n− 2(κ−β) .
(2.3)
κ
/ F (misspecified case), we have for any 0 < δ < 1, More generally, if possibly f0 ∈ ER(fˆn ) − R(f0 ) ≤ (
(2.4)
1+δ ) {Vn + R(f∗ ) − R(f0 )} , 1−δ
where κ
Vn = 2c1κ−β (c2 /δ) κ−β n− 2(κ−β) .
(2.5)
β
κ
Proof. Define Zn = |νn (fˆn ) − νn (f∗ )|/dβ (fˆn , f∗ ). From basic inequality (2.2) it follows that √ (2.6) R(fˆn ) − R(f0 ) ≤ Zn dβ (fˆn , f∗ )/ n + R(f∗ ) − R(f0 ). Suppose first that f0 ∈ F, so that f0 = f∗ . By the margin condition, we know that β
β dβ (fˆn , f0 ) ≤ c2κ (R(fˆn ) − R(f0 )) κ .
Hence, from (2.6) β √ β R(fˆn ) − R(f0 ) ≤ c2κ (R(fˆn ) − R(f0 )) κ Zn / n.
This implies β
κ
R(fˆn ) − R(f0 ) ≤ c2κ−β Znκ−β n− 2(κ−β) . κ
Oracle Inequalities and Regularization
207
Result (2.3) now follows from κ
κ
κ
EZnκ−β ≤ (EZnr ) r(κ−β) ≤ c1κ−β .
(2.7)
More generally, in the possibly misspecified case, we use that dβ (fˆn , f∗ ) ≤ dβ (fˆn , f0 ) + dβ (f∗ , f0 ) β
β
β β ≤ c2κ (R(fˆn ) − R(f0 )) κ + c2κ (R(f∗ ) − R(f0 )) κ .
Now, use the Technical Lemma below. Use it twice. We find β √ β R(fˆn ) − R(f0 ) ≤ c2κ (R(fˆn ) − R(f0 )) κ Zn / n β √ β +c2κ (R(fˆ∗ ) − R(f0 )) κ Zn / n + R(f∗ ) − R(f0 )
≤ δ(R(fˆn ) − R(f0 )) + δ(R(f∗ ) − R(f0 )) κ
+2(c2 /δ) κ−β Znκ−β n− 2(κ−β) + R(f∗ ) − R(f0 ). β
κ
Conclude invoking (2.7).
The next lemma was used in the proof of Lemma 2.1. It will in fact also be of help in proofs in Chapter 5. Technical Lemma. We have for all positive v, t, and δ, and κ > β, vt κ ≤ δt + v κ−β δ − κ−β . β
Proof. Suppose first that v/δ ≤ t
k
κ−β κ β
. Then obviously
vt κ = Conversely, if v/δ ≥ t
κ−β κ
β
v βκ δt ≤ δt. δ κ
, then t ≤ (v/δ) κ−β . So then β κ κ v β vt κ ≤ v( ) κ−β = v κ−β δ − κ−β . δ
Recall that we defined the estimation error as R(fˆn ) − R(f∗ ). Now, in a general context, the estimation error and approximation error cannot be studied separately. This primarily because the margin condition is assumed at f0 , not at f∗ . Also, we will generally bound the estimation error by the term I in (2.2) involving the empirical process. For this term we will in turn consider a (probability) bound. To understand more clearly the meaning of the formulas, we will be flexible in our use of terminology, and refer to such upper bounds as estimation error. For example, in the context of Lemma 2.1, Vn in (2.5) is (an upper bound for) the (average) estimation error, and we simply refer to it as estimation error.
208
S. van de Geer
2.6. Balancing estimation and approximation error Lemma 2.1 gives an upper bound for the average excess risk, involving the (bound) for the estimation error and the approximation error. More generally, suppose we have shown that for some constants C and Vn ER(fˆn ) − R(f0 ) ≤ C{Vn + R(f∗ ) − R(f0 )}. Suppose that the constant C does not depend on P or on the model class F . Clearly, the estimation error Vn depends on the model F , for example in (2.5) through the parameter β. Moreover, Vn may also depend on P , for example as in (2.5) through the parameter κ. It is clear that also the approximation error, which we write for short as Bn2 = R(f∗ ) − R(f0 ), depends on F and P . To express these dependencies, let us write Vn = Vn (F , P ) and Bn2 = Bn2 (F , P ). Given a collection of models {F }, the optimal model F∗ would now be F∗ = arg min Vn (F , P ) + Bn2 (F , P ) . {F }
But since the optimal model F∗ depends on P , only an oracle knows what F∗ is. The oracle is mimicked if we can construct an estimator with excess risk at most that of the estimator when F∗ were known. In our theory, we will not be able to arrive at exact oracle behavior. Instead, a trade off up to constants independent of the sample size n, or possibly up to (log n)-factors, is established. Having essentially the large-n situation in mind, we will not be too much concerned about such constants and (log n)-factors. In conclusion, we look for an estimator fˆn satisfying up to (log n)-terms ER(fˆn ) − R(f0 ) = O Vn (F∗ , P ) + Bn2 (F∗ , P ) . The approach to this end will be to add a penalty to the empirical risk, so that complicated models are penalized. Such a method is called penalized empirical risk minimization, and it is a form of complexity regularization. 2.7. Bibliographical remarks For the application of empirical process theory to evaluate M-estimators, we refer to van der Vaart and Wellner (1996), and van de Geer (2000) and the references in these books. We remark that much recent work uses refined concentration inequalities (e.g., Massart (2000b), see also Section 6.1). The empirical process condition is a condition on the weighted empirical process. See Section 6.4 for techniques. The margin condition is from Tsybakov (2004) (see also Mammen and Tsybakov (1999)). It in particular plays an important role in the classification problem (see Example 2.2), but also in other contexts (see van de Geer (2003)) and Chapter 5). The Technical Lemma is as in Tsybakov and van de Geer (2005), albeit with larger constant but simpler proof.
Oracle Inequalities and Regularization
209
3. The sequence space formulation The aim of this chapter is to reveal the main ingredients of complexity regularization yet keeping technicalities to a minimum. We will consider a situation where the average excess risk can be computed exactly and without much effort. We examine orthogonal series expansions of a regression function. The coefficients are estimated by least squares. The question we address is: which coefficients should we keep? Keeping many coefficients results in an estimator of the regression curve with large variance. On the other hand, killing many coefficients might lead to a large bias. An oracle can choose the optimal trade off between bias and variance. Hard- and soft-thresholding are presented as estimation methods that mimic this oracle. In Section 3.2 the concept of sparseness is introduced as motivation for considering the collection of models given in Section 3.3. Section 3.4 presents the model an oracle would select from this collection. Section 3.5 states that hardand soft-thresholding estimators mimic the oracle, and presents a proof this for the soft-thresholding case, using the empirical process result of Section 3.6. The consequences of the theory for sparse signals is left as exercises (Exercise 3.4 and 3.5). 3.1. Reformulation of the regression problem Consider the regression Yi = f0 (xi ) + i , i = 1, . . . , n, where 1 , . . . , n are independent and centered noise variables, and f0 is an unknown function on X . Our objective is to put forward the basic idea of an oracle inequality. We want to facilitate the exposition as much as possible. In this spirit, we make the simplifying assumption of normally distributed errors, that is,
1 , . . . , n are assumed to be i.i.d. N (0, σ 2 )-distributed. We may collect the observations Y = (Y1 , . . . , Yn ) in a (random) vector in n R . The regression model takes the mean of Y as (possibly partly) unknown vector in Rn . The co-variables x1 , . . . , xn are considered as fixed (nonrandom design) in this chapter. Recall that Qn denotes the empirical distribution of x1 , . . . , xn . Let (ψ1 , . . . , ψn ) be certain given functions, chosen in such a way that they form an orthonormal basis in L2 (Qn ). Thus ψj : X → R, j = 1, . . . , n, and 1 2 ψ (xi ) = 1, ∀ j, n i=1 j n
i.e, the functions have length 1, and 1 ψj (xi )ψk (xi ) = 0, ∀ k = j, n i=1 n
i.e., the functions are orthogonal to each other.
210
S. van de Geer Define now the inner products n 1 Y˜j = Yi ψj (xi ), j = 1, . . . , n, n i=1
and
1 f0 (xi )ψj (xi ), j = 1, . . . , n. n i=1 n
θj,0 = We then have (3.1) where
Y˜j = θj,0 + ˜j , j = 1, . . . , n, 1
˜j =
i ψj (xi ), j = 1, . . . , n. n i=1 n
n are i.i.d. and N (0, σ 2 /n)-distributed. Exercise 3.1. Show that ˜1 , . . . , ˜ We call (3.1) the sequence space formulation. If the regression function f0 is completely unknown, the expectation θ0 = (θ1,0 , . . . , θn,0 ) of the random vector ˜ = (Y˜1 , . . . , Y˜n ) is completely unknown. So then there are n unknowns, that is as Y many unknowns as there are observations. Nevertheless, when the signal is sparse, one can estimate the vector θ0 (in a global sense). With sparseness, we mean that most of the elements in θ0 are zero or almost zero. In literature, the sparseness of a representation is often defined as the number of zero coefficients, so that one representation is sparser than another iff it has less non-zero coefficients. Section 3.2 gives a formally somewhat different definition, giving the possibility of a more refined comparison. We end this section with some remarks. In practice, one has to make a choice for the orthonormal basis {ψj }. This is like choosing a language to interpret a given (noisy) text. Here, the special properties of the data set are of importance. For example, some signals are well described by taking a Fourier basis, others by wavelets or trigonometric series. The problem is closely related to data compression. One hopes to choose a basis such that θ0 is indeed sparse, i.e., that one has an economical approximate representation with only a few non-zero coefficients. In literature, most systems of basis functions are orthonormal for Lebesgue measure instead of Qn . We assume orthonormality in L2 (Qn ) because it makes it possible to avoid technical calculations. We note moreover that in many applications, sparse representations can only be obtained using non-orthogonal (and perhaps overcomplete) representations (for example for representing sharp edges in a picture). The mapping x → ψ from X to Rn is sometimes called the feature mapping. For example, if f0 is a picture, then ψ may represent features like angles, directions, and shapes. In some other situations ψ is of the form ψj (xi ) = K(|xi − xj |/h) where | · | is some metric on X , K is some kernel and h a regularization parameter, often called
Oracle Inequalities and Regularization
211
the width. In that case, sparseness may be less an issue. It is rather the choice of the parameter h that is of importance here. If there are many sets of basis functions to choose from, we call them dictionaries. One may use a data dependent method to choose a dictionary, but we will not consider this issue. In the next section, we take the sequence space formulation as a starting point, and we will omit the “tilde” in our notation. 3.2. Estimating the mean of a normal vector In this section we study the estimation of the vector θ0 ∈ Rn from observations Yj = θj,0 + j , j = 1, . . . , n, with 1 , . . . , n independent N (0, σ 2 /n)-distributed. Denote the Euclidean norm on Rn as n 2 |θj |2 , θ ∈ Rn .
θ n = j=1
A sparse signal......
Figure 7 Now, we want to find an estimator θˆn such that θˆn −θ0 n is small. In general, one cannot guarantee that a good estimator exists. But if we believe that most of the coefficients θj,0 are zero or at least not too big, we can construct an estimator
212
S. van de Geer
that exploits this as much as possible. A signal with only few large coefficients is called sparse. It is like the starry night sky: if you consider the sky as a set of pixels, then at most pixels there is no star (no signal), or it is too far away, and the light you see there is mainly background noise. As mathematical description of sparseness, we take: Definition. The signal θ0 ∈ Rn is called sparse if for some 0 ≤ r < 2, n
(3.2)
|θj,0 |r ≤ 1.
j=1
Note that sparseness depends on the parameter r, which is a roughness parameter. Very small values of r correspond to very sparse signals. The extreme case is r = 0, in which case there is at most one non-zero coefficient (using the convention 00 = 0 and z 0 = 1 for z > 0). The constant 1 in the right-hand side of (3.2) is chosen for ease of exposition. It could be replaced by any other constant by a rescaling argument. Lemma 3.1 below states that our definition of sparseness implies that a sparse signal has only a few large coefficients. In Lemma 3.2, it is shown that such a signal can be well approximated by one with only a few non-zero coefficients. Lemma 3.1. Suppose θ0 is sparse. Then #{|θj,0 | > λ} ≤ λ−r . Proof. By the definition of sparseness, n
|θj,0 |r ≤ 1.
j=1
Conversely
n
|θj,0 |r ≥
|θj,0 |r ≥ #{|θj,0 | > λ}λr .
|θj,0 |>λ
j=1
Lemma 3.2. Let
θj,∗ =
θj,0 0
if |θj,0 | > λ . if |θj,0 | ≤ λ
If θ0 is sparse we have
θ∗ − θ0 2n ≤ λ2−r . Proof. By the definition of θ∗ , we have
θ∗ − θ0 2n =
|θj,0 |≤λ
|θj,0 |2 ≤ λ2−r
n
|θj,0 |r ≤ λ2−r ,
j=1
using in the last inequality the assumed sparseness of θ0 .
Oracle Inequalities and Regularization
213
3.3. A collection of models Recall that Yj = θj,0 + j , j = 1, . . . , n, with 1 , . . . , n i.i.d. N (0, σ 2 /n). Now, we have in mind the situation where we believe that most coefficients in θ0 are small. But we do not know which coefficients are small, and we certainly do not know whether the signal is sparse with a given roughness parameter r. We stress that here the purpose of some definition of sparseness is to motivate our choice of model class. In the rest of this chapter, sparseness is only used as illustration of the implications of an oracle inequality, see Exercises 3.4 and 3.5. So we believe and hope most coefficients can be best estimated as being zero, but we don’t know which ones. Suppose we decide to set all coefficients with index j∈ / J (J ⊂ {1, . . . , n}) to zero. In the language of the previous chapter, our model class is then / J }. F = Θ(J ) = {θ ∈ Rn : θj = 0 ∀ j ∈ As empirical risk, we take the least squares loss Rn (θ) =
n
(Yj − θj )2
i=1
with theoretical counterpart R(θ) = ERn (θ) = θ − θ0 2n + σ 2 . Note thus that R(θ0 ) = σ 2 , and that the excess risk is equal to R(θ) − R(θ0 ) = θ − θ0 2n . The least squares estimator over the model Θ(J ) is θˆn = arg min Rn (θ). θ∈Θ(J )
It is clear that this estimator has coefficients Yj if j ∈ J ˆ θj,n (J ) = . 0 if j ∈ /J We will now show that the (average) estimation error of the estimator is in this case its variance and its approximation error equals its squared bias. The excess risk is therefore the mean square error. Define the best approximation within the class θ∗ (J ) = arg min R(θ). θ∈Θ(J )
This vector has coefficients
θj,∗ (J ) =
θj,0 0
if j ∈ J . if j ∈ /J
214
S. van de Geer The estimation error is
ER(θˆn (J )) − R(θ∗ (J )) = E θˆn (J ) − θ∗ (J ) 2n =
E(Yj − θj,0 )2 =
j∈J
Indeed, this is the variance Eθˆn (J ) − The approximation error is
σ 2 |J | . n
Eθˆn (J ) 2n .
R(θ∗ (J )) − R(θ0 ) = θ∗ (J ) − θ0 2n =
2 θj,0 .
j ∈J /
Here, we indeed recognize the squared bias Eθˆn (J ) − θ0 2n . The average excess risk is thus the mean square error σ 2 |J | 2 ER(θˆn (J )) − R(θ0 ) = + θj,0 . n j ∈J /
Exercise 3.2. Let us compare this result with the one of Lemma 2.1. With the notation used there, one has model class F = Θ(J ), and empirical process n √ νn (θ) = −2 n
j θj . j=1
˜ = θ − θ
˜ n . Check that the margin condition holds The metric we take is d(θ, θ) with κ = 2 and c2 = 1. The empirical process condition is met for r = 2, and for β = 1 and c21 = 4σ 2 |J |. The estimation error Vn defined in (2.5) becomes with these values 8σ 2 |J | . Vn = δn Now, verify the technical condition. The result (2.3) of Lemma 2.1 becomes 2 1+δ 8σ |J | 2 ER(θˆn (J )) − R(θ0 ) ≤ + θj,0 . 1 − δ δn j ∈J /
Thus, up to constants, Lemma 2.1 produces the same answer as direct calculations. 3.4. The model an oracle would select An oracle chooses J as the set J∗ which minimizes the mean square error. I.e., σ 2 |J | 2 J∗ = arg min + . θj,0 J ⊂{1,...,n} n j ∈J /
Exercise 3.3. Show that the index set an oracle would select is
J∗ = {j : |θj,0 | > σ 2 /n}. (You can use similar arguments as in the proof of Lemma 3.5.)
Oracle Inequalities and Regularization
215
Exercise 3.4. Suppose that θ0 is sparse. Show that for the model Θ(J∗ ) chosen by the oracle, 2 2−r 2 σ 2 ˆ . E θn (J∗ ) − θ0 n ≤ 2 n 3.5. Hard- and soft-thresholding The oracle estimates only the coefficients θj,0 which are in absolute value bigger
2 than σ /n. The idea is now to replace the unknown coefficients θj,0 by the observations Yj , j = 1, . . . , n. First of all, it should then be noted that the noise level σ 2 is generally unknown. However, this problem is minor, as one may construct estimators of σ 2 . Here, we assume for simplicity that σ 2 is known. A more severe problem is that we now estimate the θ0,j in order to construct an estimator of the θ0,j ! It actually turns out that this procedure works if we make the threshold a bit larger than σ 2 /n, namely, it should be chosen % 2(log n)σ 2 /n. Then the oracle is “almost” mimicked (see Lemma 3.3). We now first present the definitions of the hard-thresholding and softthresholding estimator. Lemmas 3.3 and 3.4 give the oracle inequality for these estimators, and Lemmas 3.5 and 3.6 put them in the framework of penalized Mestimation. We then have a clearer picture of the type of oracle inequalities we may expect for more general M-estimators. Let λn ≥ 0 be some threshold. This threshold will be the regularization parameter in the present context. Definition of the hard-thresholding estimator. Yj if |Yj | > λn θˆj,n (hard) = , j = 1, . . . , n. 0 if |Yj | ≤ λn Definition of the soft-thresholding estimator. Yj − λn if Yj > λn ˆ θj,n (soft) = Yj + λn if Yj < −λn , j = 1, . . . , n. 0 if |Yj | ≤ λn It is shown in Lemmas 3.3 and 3.4 that the estimators θˆn (hard) and θˆn (soft) have similar oracle properties, i.e., they have up to (log n)-terms the same mean square error as when using the model Θ(J∗ ) given by the oracle. Lemmas 3.3 and 3.4 are proved in Donoho and Johnstone (1994b), by direct calculations. We will reconsider the oracle properties of the soft-thresholding estimator in Lemma 3.7, in a fashion that allows extension to other M-estimation contexts. Lemma 3.3 for the hard-thresholding estimator is stated in asymptotic sense. We use the following notation and terminology. Let {zn } and {δn } be sequences of positive numbers. We say that zn = o(δn ) (zn is of smaller order than δn ) if zn /δn → 0 as n → ∞. Moreover, zn = O(δn ) (zn is of order δn ) means that lim supn→∞ zn /δn < ∞, and zn % δn (zn is asymptotically equal to δn ) means
216
S. van de Geer
zn /δn → 1. If {ϑˆn } is a sequence of estimators of some parameter ϑ0 in some metric space with metric d, we write d(ϑˆn , ϑ0 ) = OP (δn ) (ϑˆn converges (to ϑ0 ) with rate δn ) if d(ϑˆn , ϑ0 )/δn remains bounded in probability. Note that sufficient for the latter is Ed(ϑˆn , ϑ0 ) = O(δn ). Lemma 3.3. Take for some 0 < α < 1, (1 − α)(log log n)σ 2 /n ≤ λ2n − 2(log n)σ 2 /n ≤ o((log n)σ 2 /n). Then
σ 2 (|J | + 1) ∗ 2 + E θˆn (hard) − θ0 2n ≤ Ln θj,0 , n j ∈J / ∗
where Ln % 2 log n. Lemma 3.4. Take λ2n = 2(log n)σ 2 /n. Then σ 2 (|J | + 1) ∗ 2 E θˆn (soft) − θ0 2n ≤ (2 log n + 1) + θj,0 . n j ∈J / ∗
Exercise 3.5. Suppose that θ0 is sparse. Show that when the threshold is chosen as in Lemma 3.3 (Lemma 3.4), the squared rate of convergence for the hardthresholding (soft-thresholding) estimator is 2 2−r σ log n 2 . n Compare with Exercise 3.4. Donoho and Johnstone (1994b) also prove that the soft-thresholding estimator can be improved by choosing the threshold λn more carefully. We will not cite that result here, because our focus is not so much on the constants. In fact, in Lemma 3.7, we will treat the soft-thresholding estimator again, using an indirect method. This gives worse constants, but the approach has the advantage that the method is applicable to other situations as well, and in particular does not rely on the assumption of normally distributed errors. The hard- and soft-thresholding estimators are penalized M-estimators, as is shown in Lemmas 3.5 and 3.6. This point of view allows one to define hardand soft-thresholding type estimators for other loss functions as well. The hardthresholding estimator comes a penalty on the number of non-zero coefn from 0 |θ | which we refer to as as the 0 -penalty. The ficients, #{θj = 0} = j=1 j soft-thresholding case corresponds to a penalty on the 1 -norm nj=1 |θj |, and we refer to it as the 1 -penalty. Lemma 3.5. The hard-thresholding estimator θˆn (hard) minimizes n (Yj − θj )2 + λ2n #{θj = 0}. j=1
Oracle Inequalities and Regularization
217
Proof. Let θ·,n be the minimizer of n
(Yj − θj )2 + λ2n #{θj = 0}.
j=1
We can carry out the minimization coefficient-by-coefficient. So (for j = 1, . . . , n), θj,n minimizes Sj2 (θj ) = (Yj − θj )2 + λ2n l{θj = 0}. Clearly, if θj,n = 0, we must take it equal to Yj , and then Sj2 (θj,n ) = λ2n . If θj,n = 0, we have Sj2 (θj,n ) = Yj2 . So λ2n if θj,n = Yj 2 Sj (θj,n ) = . Yj2 if θj,n = 0 Since Sj2 (θj ) is minimized over θj , we have Sj2 (θj,n ) = min(λ2n , Yj2 ). Hence, θj,n = θˆj,n (hard).
Lemma 3.6. The soft-thresholding estimator θˆn (soft) minimizes n n 2 (Yj − θj ) + 2λn |θj |. j=1
j=1
Proof. Let θ·,n be the minimizer of n n (Yj − θj )2 + 2λn |θj |. j=1
j=1
The minimization can be done coefficient-by-coefficient, as in the previous lemma. So (for j = 1, . . . , n), θj,n minimizes Sj2 (θj ) = (Yj − θj )2 + 2λn |θj |. If θj,n > 0, the function Sj2 (θj ) is differentiable near θj,n , with derivative −2(Yj − λn ) + 2θj . Setting this derivative to zero shows that θj,n = Yj − λn . So we conclude that θj,n > 0 can only happen when Yj − λn > 0, and in that case θj,n = Yj − λn . Similarly when θj,n < 0. When θj,n = 0, we find Sj2 (θj,n ) = Yj2 . Hence 2 +2λn Yj − λn ifθj,n = Yj − λn > 0 2 Sj (θj,n ) = −2λn Yj − λ2n < mboxif θj,n = Yj + λn < 0 . 2 Yj if θj,n = 0 Obviously, we always have ±2λn Yj − λ2n ≤ Yj2 , so θj,n = 0 whenever |Yj | ≥ λn . In other words, θj,n = θˆj,n (soft).
218
S. van de Geer
Now, we will reprove Lemma 3.4, albeit with less economic constants. The idea is writing down a basic inequality, similar to (2.1) but now for the penalized case. The basic inequality with penalty takes the form √
θˆn − θ0 2 ≤ −[νn (θˆn ) − νn (θ∗ )]/ n + θ∗ − θ0 2 + pen(θ∗ ) − pen(θˆn ). n
n
Here, θ∗ is the best penalized approximation of θ0 . It means that θ∗ is defined as in Lemma 3.2 with an appropriate choice of the threshold λ. The empirical process takes the simple form n √
j θj . νn (θ) = −2 n j=1
We will bound the increments by n √ |νn (θ) − νn (θ∗ )| ≤ 2 n |θj − θj,∗ | max | j |. j=1,...,n
j=1
Finally, we only consider the soft-thresholding estimator, that is, the penalty considered is the 1 -penalty n pen(θ) = 2λn |θj |. j=1
For the regularization parameter λn , a value proportional to σ 2 log n/n can be taken (see Lemma 3.8). Lemma 3.7 then states that the estimator with 1 -penalty satisfies an oracle inequality, where the oracle concerns the 0 -penalty. Lemma 3.7. Let θˆn = θˆn (soft). Let 0 ≤ δ < 1 be arbitrary. Set Vn (θ) = 16λ2n #{θj = 0}/δ. On the set Ξn = {max1≤j≤n | j | ≤ λn }, we have 2+δ ) min Vn (θ) + θ − θ0 2n .
θˆn − θ0 2n ≤ ( θ 2−δ Proof. At first, let θ∗ be arbitrary. Write pen(θ) = 2λn
n
|θj |
j=1
= pen1 (θ) + pen2 (θ), with pen1 (θ) = 2λn
|θj |,
θj,∗ =0
pen2 (θ) = 2λn
|θj |.
θj,∗ =0
We use the short hand notation N (θ) = #{θj = 0}.
Oracle Inequalities and Regularization
219
Then, on Ξn , √
θˆn − θ0 2n ≤ −[νn (θˆn ) − νn (θ∗ )]/ n + pen(θ∗ ) − pen(θˆn ) + θ∗ − θ0 2n =2
n
j (θˆj,n − θj,∗ ) + pen(θ∗ ) − pen(θˆn ) + θ∗ − θ0 2n
j=1
≤ 2λn
n
|θˆj,n − θj,∗ | + pen(θ∗ ) − pen(θˆn ) + θ∗ − θ0 2n
j=1
= pen1 (θˆn − θ∗ ) + pen2 (θˆn ) + pen1 (θ∗ ) − pen1 (θˆn ) − pen2 (θˆn ) + θ∗ − θ0 2n ≤ 2pen1 (θˆn − θ∗ ) + θ∗ − θ0 2n ≤ 4λn ≤ 4λn
N (θ∗ ) θˆn − θ∗ n + θ∗ − θ0 2n
N (θ∗ ) θˆn − θ0 n + 4λn N (θ∗ ) θ∗ − θ0 n + θ∗ − θ0 2n
√ Now, we proceed as in Lemma 2.1. Since for a and b non-negative, ab ≤ (a + b)/2 (compare with the technical lemma at the end of Chapter 2), we have 4λn
δ 8λ2 N (θ∗ ) . N (θ∗ ) θˆn − θ0 n ≤ θˆn − θ0 2n + n 2 δ
Here, we may also replace θˆn − θ0 n by θ∗ − θ0 n . So we find on Ξn ,
θˆn − θ0 2n ≤
δ ˆ δ 16λ2n N (θ∗ )
θn − θ0 2n + θ∗ − θ0 2n + + θ∗ − θ0 2n , 2 2 δ
or δ δ 16λ2n N (θ∗ ) (1 − ) θˆn − θ0 2n ≤ + (1 + ) θ∗ − θ0 2n 2 δ 2 16λ2n N (θ∗ ) δ 2 + θ∗ − θ0 n . ≤ (1 + ) 2 δ To conclude the proof, take θ∗ = arg min{16λ2n N (θ)/δ + θ − θ0 2n }. θ
We thus see that for λn % σ 2 log n/n, we arrive at the same oracle rates as in Lemmas 3.3 and 3.4, provided the set P(maxj | j | > λn ) → 0 for such a choice of λn . Indeed, this is shown to be okay in Lemma 3.8.
220
S. van de Geer
3.6. A probability inequality for the empirical process Recall that we proved Lemma 3.7 on the set where {max1≤j≤n | n | ≤ λn }. In order to finish our oracle inequality, we need to show that for appropriate thresholds λn , this set has large probability. Lemma 3.8. Let Z be N (0, 1)-distributed. Then for all a > 0, 1 P(Z > a) ≤ exp[− a2 ]. 2 Moreover, if Z1 , . .√. , ZN are N (0, 1)-distributed (and not necessarily independent), then for all a ≥ 2 log N 1 P max Zj > a ≤ exp[− a2 ]. 1≤j≤N 4 Proof. We have for all a > 0, and γ > 0, by Chebyshev’s inequality P (Z > a) ≤
E exp[γZ] exp[γa]
1 = exp[ γ 2 − γa]. 2 Take γ = a to arrive at 1 P (Z > a) ≤ exp[− a2 ]. 2 To prove the second assertion of the lemma, we note that 1 P max |Zj | > a ≤ N exp[− a2 ]. 1≤j≤N 2 √ Take a ≥ 2 log N to get 1 P max Zj > a ≤ exp[− a2 ]. 1≤j≤N 4
It clearly follows from Lemma 3.8, that if 1 , . . . , n are independent N (0, σ 2 /n), we have for all t ≥ 2,
t2 log n ]. P max | j | > t σ 2 log n/n ≤ 2 exp[− 1≤j≤n 4 Exercise 3.6. Let Z1 , Z2 , . . . , be independent N (0, 1)-distributed, and j = √ σZj / n. Take λn = t σ 2 log n/n, with t > 2. Verify that max1≤j≤n | j | ≤ λn almost surely for all n sufficiently large.
Oracle Inequalities and Regularization
221
3.7. Bibliographical remarks In Donoho and Johnstone (1994a), one can find minimax theory for p -spaces (0 < p < ∞). In Donoho and Johnstone (1994b), the relation with wavelet shrinkage is also there: it explicitly takes the step from the original regression problem via wavelets to the sequence space formulation. Roughly speaking, when using an appropriate wavelet basis to expand a function f0 with coefficients θ0 , the roughness parameter r corresponds to the effective smoothness s of f0 via the relation r = 2/(2s + 1). For more details we refer to Donoho and Johnstone (1996) and wavelet theory (e.g., Edmunds and Triebel (1992), H¨ardle, Kerkyacharian, Picard and Tsybakov (1998)). The oracle terminology is from Donoho and Johnstone (1994b). Soft-thresholding is also studied in Donoho (1995). The 1 -penalty corresponding to soft-thresholding is called the LASSO in Tibshirani (1996). It is used for instance for regularization of least squares estimators in linear models with many co-variables, possibly with linear dependencies between the variables. We refer to the book of Hastie, Tibshirani and Friedman (2001). Indeed, linear dependent systems can often provide better sparse approximations, for instance wedgelets (see Donoho (1999)) or curvelets (see Cand`es and Donoho (2004)), which are sparse systems for representing edges or other singularities. The similarity between the behavior of hard- and soft-thresholding type penalties is investigated in Donoho (2004a,b). There, it is shown that the 1 -penalized solution of an overdetermined least squares problem is often also (an approximation of) the 0 -penalized problem.
4. Overruling the variance We revisit the regression problem of the previous chapter. One has observations {(xi , Yi )}ni=1 , with x1 , . . . , xn fixed co-variables, and Y1 , . . . , Yn response variables, satisfying the regression Yi = f0 (xi ) + i , i = 1, . . . , n, where 1 , . . . , n are independent and centered noise variables, and f0 is an unknown function on X . The errors are assumed to be N (0, σ 2 )-distributed. Let F¯ be a collection of regression functions. The penalized least squares estimator is n 1 2 |Yi − f (xi )| + pen(f ) . fˆn = arg min ¯ n i=1 f ∈F Here pen(f ) is a penalty on the complexity of the function f . Let Qn be the empirical distribution of x1 , . . . , xn and · n be the L2 (Qn )-norm. Define f∗ = arg min f − f0 2n + pen(f ) . ¯ f ∈F
Our aim is to show that (4.1) E fˆn − f0 2n ≤ const. f∗ − f0 2n + pen(f∗ ) .
222
S. van de Geer
When this aim is indeed reached, we loosely say that fˆn satisfies an oracle inequality. In fact, what (4.1) says it that fˆn behaves as the noiseless version f∗ . That means so to speak that we “overruled” the variance of the noise. In Section 4.1, we put our objectives of this chapter in the framework of Chapter 2. In particular, we recall there the definitions of estimation and approximation error. Section 4.2 calculates the estimation error when one employs least squares estimation, without penalty, over a finite model class. The estimation error turns out to behave as the log-cardinality of the model class. Section 4.3 shows that when considering a collection of nested finite models, a penalty pen(f ) proportional to the log-cardinality of the smallest class containing f will indeed mimic the oracle over this collection of models. In Section 4.4, we consider general penalties. It turns out that the (local) entropy of the model classes plays a crucial rule. Recall that in Chapter 3, the hard-thresholding estimator corresponds to a penalty proportional to the dimension (number of non-zero coefficients) of the model class. Indeed, the local entropy a finite-dimensional space is proportional to its dimension. For a finite class, the entropy is (bounded by) its log-cardinality. Whether or not (4.1) holds true depends on the choice of the penalty. In Section 4.4, we show that when the penalty is taken “too small” there will appear an additional term showing that not all variance was “killed”. Section 4.5 presents an example. Throughout this chapter, we assume the noise level σ > 0 to be known. In that case, by a rescaling argument, one can assume without loss of generality that σ = 1. In general, one needs a good estimate of an upper bound for σ, because the penalties considered in this chapter depend on the noise level. When one replaces the unknown noise level σ by an estimated upper bound, the penalty in fact becomes data dependent. 4.1. Estimation and approximation error Let F be a model class. Consider the least squares estimator without penalty 1 |Yi − f (xi )|2 . fˆn (·, F ) = arg min f ∈F n i=1 n
The excess risk fˆn (·, F ) − f0 2n of this estimator is the sum of estimation error and approximation error. Now, if we have a collection of models {F }, a penalty is usually some measure of the complexity of the model class F . With some abuse of notation, write this penalty as pen(F ). The corresponding penalty on the functions f is then pen(f ) = min pen(F ). F : f ∈F
We may then write fˆn = arg
min
fˆn (·,F ): F ∈{F }
n 1 |Yi − fˆn (xi , F )|2 + pen(F ) , n i=1
Oracle Inequalities and Regularization
223
where fˆn (·, F ) is the least squares estimator over F . Similarly, f∗ = arg
min
{ f∗ (·, F ) − f0 2n + pen(F )},
f∗ (·,F ): F ∈{F }
where f∗ (·, F ) is the best approximation of f0 in the model F . As we will see, taking pen(F ) proportional to (an estimate) of a bound for the estimation error of fˆn (·, F ) will (up to constants and possibly (log n)-factors) balance estimation error and approximation error. Exercise 4.1. Let {ψj }nj=1 be an orthonormal basis in L2 (Q). Define fθ = n n j=1 θj ψj , where θ = (θ1 , . . . , θn ) is a vector in R . Check that
fθ 2n =
n
|θj |2 := θ 2n .
j=1
Let F¯ = {fθ : θ ∈ Rn }, and F ⊂ F¯ be a linear subspace, so that, as in Exercise 2.4,
fˆn (·, F ) − f0 2n = fˆn (·, F ) − f∗ (·, F ) 2n + f∗ (·, F ) − f0 2n . Write fˆn (·, F ) = f ˆ , and likewise f∗ = fθ (F ) . Then by Pythagoras’ rule, the θn (F )
∗
excess risk at fˆn (·, F ) can be written as
θˆn (F ) − θ0 2n = θˆn (F ) − θ∗ (F ) 2n + θ∗ (F ) − θ0 2n . Consider now the hard-thresholding estimator fθˆn with θˆn = θˆn (hard) defined in Section 3.5. Recall that in that case, pen(fθ ) = λ2n × #{θj = 0}, with λn the threshold (see Lemma 3.5). Compare the oracle inequality (4.1) with the oracle inequality of Lemma 3.3. Exercise 4.1 highlights again that the hard-thresholding estimator, which is based on an 0 -penalty overruling the variance, has oracle behavior. We remark here that the 1 -penalty is of a different nature, yet can still yield oracle behavior. Indeed, we will see in Chapter 5 that 1 -type penalties can lead to the right trade off between estimation error and approximation error. The 1 -penalty “kills” the variance in a rather implicit way. It is not based on (an estimate of) the estimation error. This is very useful in contexts other than penalized least squares, because generally, the estimation error can depend on the unknown underlying probability measure P in a rather complicated way. In this chapter, the empirical process takes the form n 1
i f (xi ), νn (f ) = √ n i=1 with the function f ranging over (some subclass of) F¯ . Probability inequalities for the empirical process are derived using Lemma 3.8. The latter is for normally distributed random variables. It is exactly at this place where our assumption of normally distributed noise comes in. Relaxing the normality assumption is straightforward, provided a proper probability inequality, an inequality of sub-Gaussian
224
S. van de Geer
type, goes through. In fact, at the cost of additional, essentially technical, assumptions, an inequality of exponential type on the errors is sufficient as well (see van de Geer (2000)). 4.2. Finite models Let F be a finite collection of functions, with cardinality |F | ≥ 2. Consider the least squares estimator over F 1 |Yi − f (xi )|2 . fˆn = arg min f ∈F n i=1 n
In this section, F is fixed, and we do not explicitly express the dependency of fˆn on F . Define
f∗ − f0 n = min f − f0 n . f ∈F
The dependence of f∗ on F is also not expressed in the notation of this section. Alternatively stated, we take here 0 ∀f ∈ F pen(f ) = . ∞ ∀f ∈ F¯ \F The result of Lemma 4.1 below implies that the estimation error is proportional to log |F |/n, i.e., it is logarithmic in the number of elements in the parameter space. We present the result in terms of a probability inequality. An inequality for, e.g., the average excess risk follows from this (see Exercise 4.2). Lemma 4.1. We have for all t > 0 and 0 < δ < 1, 1+δ 4 log |F | 4t2 ) f∗ − f0 2n + + P fˆn − f0 2n ≥ ( ≤ exp[−nt2 ]. 1−δ δ δ Proof. We have the basic inequality 2
fˆn − f0 2n ≤
i (fˆn (xi ) − f∗ (xi )) + f∗ − f0 2n . n i=1 n
By Lemma 3.8, for all t > 0, 1 n
i=1 i (f (xi ) − f∗ (xi )) n P max > 2 log |F |/n + 2t2
f − f∗ n f ∈F , f −f∗ n >0 n
≤ |F | exp[−(log |F | + nt2 )] = exp[−nt2 ].
If n1 i=1 i (fˆn (xi ) − f∗ (xi )) ≤ (2 log |F |/n + 2t2 )1/2 fˆn − f∗ n , we have, √ using 2 ab ≤ a + b for all non-negative a and b,
fˆn − f0 2n ≤ 2(2 log |F |/n + 2t2 )1/2 fˆn − f∗ n + f∗ − f0 2n ≤ δ fˆn − f0 2n + 4 log |F |/(nδ) + 4t2 /δ + (1 + δ) f∗ − f0 2n .
Oracle Inequalities and Regularization
225
Exercise 4.2. Using Lemma 4.1, and the formula ∞ P(Z ≥ t)dt EZ = 0
for a non-negative random variable Z, derive bounds for the average excess risk E fˆn − f0 2n . 4.3. Nested, finite models Let F1 ⊂ F2 ⊂ · · · be a collection of nested, finite models, and let F¯ = ∪∞ m=1 Fm . We assume log |F1 | > 1. As indicated in Section 4.1, it is a good strategy to take the penalty proportional to the estimation error. In the present context, this works as follows. Define F (f ) = Fm(f ) , m(f ) = arg min{m : f ∈ Fm }, and for some 0 < δ < 1, 16 log |F(f )| . nδ In coding theory, a similar penalty is used. When encoding a message using an encoder from Fm , one needs to send, in addition to the encoded message, log2 |Fm | bits to tell the receiver which encoder was applied. Let n 1 2 ˆ |Yi − f (xi )| + pen(f ) , fn = arg min ¯ n i=1 f ∈F pen(f ) =
and
f∗ = arg min f − f0 2n + pen(f ) . ¯ f ∈F
Lemma 4.2. We have, for all t > 0 and 0 < δ < 1, 1+δ 2 2 2 ˆ ) f∗ − f0 n + pen(f∗ ) + 4t /δ P fn − f0 n > ( ≤ exp[−nt2 ]. 1−δ Proof. Write down the basic inequality 2
fˆn − f0 2n + pen(fˆn ) ≤
i (fˆn (xi ) − f∗ (xi )) + f∗ − f0 2n + pen(f∗ ). n i=1 n
Define F¯j = {f ∈ F¯ : 2j < | log F (f )| ≤ 2j+1 }, j = 0, 1, . . .. We have for all t > 0, using Lemma 3.8, n 1
i (f (xi ) − f∗ (xi )) > (8 log |F(f )|/n + 2t2 )1/2 f − f∗ n P ∃ f ∈ F¯ : n i=1 ∞ n 1 j+3 2 1/2 ≤ P ∃ f ∈ F¯j ,
i (f (xi ) − f∗ (xi )) > (2 /n + 2t ) f − f∗ n n i=1 j=0
226
S. van de Geer ≤
∞
exp[2j+1 − (2j+2 + nt2 )] =
j=0
≤
∞
∞
exp[−(2j+1 + nt2 )]
j=0
exp[−(j + 1 + nt2 )] ≤
∞
exp[−(x + nt2 )] = exp[−nt2 ].
0
j=0
n
But if i=1 i (fˆn (xi ) − f∗ (xi ))/n ≤ (8 log |F(fˆn )|/n + 2t2 )1/2 fˆn − f∗ n , the basic inequality gives
fˆn − f0 2n ≤ 2(8 log |F(fˆn )|/n+ 2t2 )1/2 fˆn − f∗ n + f∗ − f0 2n + pen(f∗ )− pen(fˆn ) ≤ δ fˆn − f0 2n + 16 log |F(fˆ)|/δ − pen(fˆ) + 4t2 /δ + (1 + δ) f∗ − f0 2n + pen(f∗ ) = δ fˆn − f0 2n + 4t2 /δ + (1 + δ) f∗ − f0 2n + pen(f∗ ),
by the definition of pen(f ).
4.4. General penalties In the general case with possibly infinite model classes F , we may replace the log-cardinality of a class by its entropy. Definition. Let u > 0 be arbitrary and let N (u, F , · n ) be the minimum number of balls with radius u necessary to cover F . Then {H(u, F , · n) = log N (u, F , · n) : u > 0} is called the entropy of F (for the metric induced by the norm · n ). Recall the definition of the estimator n 1 2 ˆ |Yi − f (xi )| + pen(f ) , fn = arg min ¯ n i=1 f ∈F and of the noiseless version
f∗ = arg min f − f0 2n + pen(f ) . ¯ f ∈F
We moreover define F (t) = {f ∈ F¯ : f − f∗ 2n + pen(f ) ≤ t2 }, t > 0. Consider the entropy H(·, F (t), · n ) of F (t). Suppose it is finite for each t, and in fact that the square root of the entropy is integrable, i.e., that for some continuous ¯ F (t), · n ) of H(·, F (t), · n ). one has upper bound H(·, t" ¯ H(u, F (t), · n )du < ∞, ∀t > 0. Ψ(t) = (4.2) 0
This means that near u = 0, the entropy H(u, F (t), · n ) is not allowed to grow faster than 1/u2 . Assumption (4.2) is related to asymptotic continuity of the empirical process {νn (f ) : f ∈ F(t)}. If (4.2) does not hold, one can still prove inequalities for the excess risk. To avoid digressions we will skip that issue here.
Oracle Inequalities and Regularization
227
Lemma 4.3. Suppose that Ψ(t)/t2 does not increase as t increases. There exists constants c and c such that for √ 2 ntn ≥ c (Ψ(tn ) ∨ tn ) , (4.3) we have c E fˆn − f0 2n + pen(fˆn ) ≤ 2 f∗ − f0 2n + pen(f∗ ) + t2n + . n Lemma 4.3 is from van de Geer (2001). Comparing it to, e.g., Lemma 4.2, one sees that there is no arbitrary 0 < δ < 1 involved in the statement of Lemma 4.3. This is just because van de Geer (2001) has fixed δ at δ = 1/3 for simplicity. √ When Ψ(t)/t2 ≤ n/C for all t, for some constant C, condition (4.3) is fulfilled if tn ≥ cn−1/2 , and, in addition, C ≥ c. Thus, by choosing the penalty carefully, one can indeed ensure that the variance is overruled. 4.5. Application to the “classical” penalty of Chapter 1 We now come to an extension of Example 1.4. Suppose X = [0, 1]. Let F¯ be the class of functions on [0, 1] which have derivatives of all orders. The sth derivative of a function f ∈ F¯ on [0, 1] is denoted by f (s) . Define for a given 1 ≤ p < ∞, and given smoothness s ∈ {1, 2, . . .}, 1 p I (f ) = |f (s) (x)|p dx, f ∈ F¯ . 0
We consider two cases. In Subsection 4.5.1, we fix a smoothing parameter λ > 0 and take the penalty pen(f ) = λ2 I p (f ). After some calculations, we then show that in general the variance has not been “overruled”, i.e., we do not arrive at an estimator that behaves as a noiseless version, because there still is an additional term. However, this additional term can now be “killed” by including it in the penalty. It all boils down in Subsection 4.5.2 to a data dependent choice for λ, 2 ˜ >0 ˜ 2 I 2s+1 (f ), with λ or alternatively viewed, a penalty of the form pen(f ) = λ depending on s and n. This penalty allows one to adapt to small values for I(f0 ). 4.5.1. Fixed smoothing parameter. For a function f ∈ F¯ , we define the penalty pen(f ) = λ2 I p (f ), with a given λ > 0. Lemma 4.4. The entropy integral Ψ can be bounded by + 2ps+2−p 1 1 − Ψ(t) ≤ c0 t 2ps λ ps + t log( ∨ 1) t > 0. λ Here, c0 is a constant depending on s and p.
228
S. van de Geer
Proof. This follows from the fact that H∞ (u, {f ∈ F¯ : I(f ) ≤ 1, |f | ≤ 1) ≤ Au−1/s , u > 0 where the constant A depends on s and p (see Birman and Solomjak (1967)). Here, H∞ denotes the entropy for the sup-norm (see Section 6.6 for a definition). For f ∈ F(t), we have p2 t , I(f ) ≤ λ and
f − f∗ n ≤ t. We therefore may write f ∈ F(t) as f1 + f2 , with |f1 | ≤ I(f1 ) = I(f ) and
f2 − f∗ n ≤ t + I(f ). It is now not difficult to show that for some constant C1 2 t t ps − 1s ) , 0 < u < t. H(u, F (t), · n ) ≤ C1 ( ) u + log( λ (λ ∧ 1)u Corollary 4.5. By applying Lemma 4.3, we find that for some constant c1 , E{ fˆn − f0 2n + λ2 I p (fˆn )} ≤ 2 min{ f − f0 2n + λ2 I p (f )} f
+c1
1 2
2ps 2ps+p−2
+
log
1 λ
∨1
n
nλ ps
.
4.5.2. Overruling the variance in this case. For choosing the smoothing parameter λ, the above suggests the penalty 2ps 2ps+p−2 C 0 pen(f ) = min λ2 I p (f ) + + , 2 λ nλ ps with C0 a suitable constant. The minimization within this penalty yields 2s
2
pen(f ) = C0 n− 2s+1 I 2s+1 (f ), where C0 depends on C0 and s. From the computational point of view (in particular, when p = 2), it may be convenient to carry out the penalized least squares as in the previous subsection, for all values of λ, yielding the estimators n 1 2 2 p fˆn (·, λ) = arg min |Yi − f (xi )| + λ I (f ) . f n i=1 ˆ n ), where Then the estimator with the penalty of this subsection is fˆn (·, λ n 2ps 2ps+p−2 C 1 0 2 ˆn = arg min λ |Yi − fˆn (xi , λ)| + . 2 λ>0 n i=1 nλ ps From the same calculations as in the proof of Lemma 4.4, one arrives at the following corollary.
Oracle Inequalities and Regularization
229
Corollary 4.6. For an appropriate, large enough, choice of C0 (or C0 ), depending on c, p and s, we have for a constant c0 depending on c, c , C0 (C0 ), p and s. 2s 2 E fˆn − f0 2n + C0 n− 2s+1 I 2s+1 (fˆn ) c 2s 2 ≤ 2 min f − f0 2n + C0 n− 2s+1 I 2s+1 (f ) + 0 . f n Thus, the estimator adapts to small values of I(f0 ). For example, when s = 1 and I(f0 ) = 0 (i.e., when f0 is the constant function), the excess risk of the estimator converges with parametric rate 1/n. If we knew that f0 is constant, we n would of course use the i=1 Yi /n as estimator. Thus, this penalized estimator mimics an oracle. 4.6. Bibliographical remarks When {F } is a collection of finite-dimensional linear models, a classical penalty is the dimension of F , properly normalized. This also has information-theoretic meaning as code length. More generally, one may think of taking the number of parameters in the penalty, or some other measure of “degrees of freedom”. Important ideas go back to Akaike (1973) and Schwarz (1978). There is much literature on penalized least squares and other loss functions. We mention in particular Barron, Birg´e and Massart (1999) for very general results.
5. The 1 -penalty Generally, using almost exactly the same arguments, the results of the previous chapter for least squares estimation, can be extended to other M-estimation procedures provided the margin behavior is known. However, when the margin behavior is not known (i.e., when the parameters c2 and especially κ in the margin condition of Section 2.5 are not known), a simple extension is no longer possible. The reason is that the estimation error depends on this margin behavior. “Overruling the variance” is thus not straightforward as this “variance” is not known. In this chapter, we propose an 1 -penalty (i.e., a soft-thresholding type penalty) to overcome the margin problem. Let F ⊂ F¯ be a class of functions on X . We assume that each f ∈ F can be written as a linear combination f (x) =
m
θj ψj (x), x ∈ X .
j=1 Here, {ψj }m j=1 is a given system of m functions on X , and θ = (θ1 , . . . , θm ) is an (in whole or in part) unknown vector. We consider the situation where the number of parameters m is large, possibly larger than the number of observations n.
230
S. van de Geer The 1 -penalty on fθ =
(5.1)
m
j=1 θj ψj
is
pen(fθ ) = λn
m
|θj |,
j=1
with λn a regularization parameter. We study M-estimation with this penalty. We will denote the 1 -norm of a vector θ ∈ Rm by I(θ) =
m
|θj |.
j=1
To simplify the exposition, we assume all coefficients θj are penalized. In practice, the first few coefficients are often left free (for instance the “constant term” or “constant + linear term”, etc.). In Section 5.1, robust estimation procedures in regression is studied, such as least absolute deviations. There, we consider the case of fixed design. Section 5.2 investigates exponential family approximations of a density. Section 5.3 considers the classification problem with random design, and we apply support vector machine loss. We recall the notation of Chapter 2. The loss function considered is denoted by γf , f ∈ F¯ . We write for each f ∈ F¯ , Rn (f ) = Pn (γf ) and R(f ) = EPn (γf ). We define (5.2)
fˆn = arg min {Rn (f ) + pen(f )} , f ∈F
and f0 = arg min R(f ). ¯ f ∈F
Note that our notation in this chapter differs somewhat from the previous chapter, in the sense that in the definition of fˆn , we minimize over f ∈ F, instead of f ∈ F¯ . (It can be brought in line with that notation by taking pen(f ) = ∞ for f ∈ F¯ \F.) The empirical process, indexed by some subset F0 of F¯ , is √ {νn (f ) = n(Pn (γf ) − P (γf )) : f ∈ F0 }. Throughout, we fix some (arbitrary) f∗ ∈ F. In the results (Lemmas 5.1, 5.2 and 5.4) one may choose f∗ as being an (almost) oracle. Because the exact definition of this (almost) oracle is somewhat involved, we leave the choice of f∗ open in our general exposition. One can see a particular choice explained after the statement of the result for general f∗ (see below Lemma 5.1). We use the notation √ (5.3) Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2n ]}. The set Ξn will play exactly the same role as in Lemma 3.7. Recall that in this chapter, pen(fθ ) = λn I(θ). We will choose the smoothing parameter λn in such a way that (for all f∗ ∈ F) the probability of the set Ξn
Oracle Inequalities and Regularization
231
is very large (tending to 1 as n tends to ∞). Under the given conditions on the system {ψj }, this is true with the choice + log m λn = c . n But what is the value of c here? Let |f |∞ = sup |f (x)|. x
The constant c depends in Section 5.1 and 5.3 on an assumed finite upper bound K for supf ∈F |f |∞ and, in Sections 5.2 and 5.3, on an upper bound for the density of X. To show that with such a choice for c, the set Ξn has large probability, we need empirical process theory. Chapter 6 is devoted to that. In fact, Section 6.5 uses Theorem 6.2 and then presents all details for (robust) regression or classification with random design. In Section 5.1, we consider fixed design. One may then argue similarly, replacing Theorem 6.2 by Theorem 6.1. This is elaborated on in Loubes and van de Geer (2002) and van de Geer (2003). We remark here that in the latter two papers, the dependency of c on K is removed using convexity arguments. This however necessitates more subtle formulations. We skip the issue here to keep the exposition transparent. The margin condition of Section 2.5 also appears in all three sections. It is a lower bound for the excess risk R(f ) − R(f0 ), f ∈ F, in terms of a metric d on F¯ . Let us repeat this margin condition here. Margin condition. For some constants c2 and κ ≥ 1, we have for all f ∈ F, R(f ) − R(f0 ) ≥ dκ (f, f0 )/c2 . The metric d is taken as the one induced by the L2 (Qn )-norm in Section 5.1 (robust regression with fixed design). In Section 5.2, F¯ is a class of densities with respect to a σ-finite measure µ, and the metric d will be the one induced by the L2 (µ)-norm. Section 5.3 takes the metric d induced by the L1 (Q)-norm. In general, for ν some measure on X , and 1 ≤ p < ∞, we denote the Lp (ν)-norm by 1/p
f p,ν = |f |p dν , f ∈ Lp (ν), and we write
· = · 2,Q , · n = · 2,Qn . The value of κ and c2 is not assumed to be known, although in Section 5.2 (density estimation), κ is shown to be generally equal to 2. The latter means that in density estimation, we get oracle inequalities very similar to those for least squares estimators in regression. The situation is then completely analogous to the one of Lemma 3.7. The 1 -penalty allows one to adapt to other margin behavior as well (i.e., to the situation where the excess risk is bounded from below by some more general increasing function of the appropriate metric).
232
S. van de Geer
The proofs of the oracle inequalities all follow from one single argument, which was basically already used in Lemma 3.7. We give this argument in Section m 5.4. It makes use of an assumed relation between the 1 -norm I(θ) = j=1 |θj | and the metric d in the margin condition. Condition on the metrics. For some constant c1 , α ≥ 0 and 0 ≤ β < κ, we have for all J ⊂ {1, . . . , m}, |θj − θ˜j | ≤ c1 |J |α dβ (fθ , fθ˜), ∀fθ , fθ˜ ∈ F. j∈J
In all Sections 5.1, 5.2 and 5.3, the condition on metrics is met with α = 1/2. Sections 5.1 and 5.2, have β = 1, and Section 5.3 has β = 1/2. 5.1. Robust regression Consider independent observations Y1 , . . . , Yn , satisfying, for i = 1, . . . , n, the regression Yi = f0 (xi ) + i , where γ (Yi − c), f0 (xi ) = min E¯ c∈R
with γ¯ : R → [0, ∞) a given loss function. The error is defined as i = Yi − f0 (xi ), i = 1, . . . , n. We examine the estimator n 1 fˆn = arg min γ¯ (Yi − f (xi )) + pen(f ) , f ∈F n i=1 where with fθ =
m
F ⊂ {fθ : θ ∈ Rm },
j=1 θj ψj ,
and where pen(fθ ) = λn I(θ), I(θ) =
m
|θj |.
j=1
For technical reasons, we assume sup |f |∞ ≤ K,
f ∈F
where K is finite (possibly depending on n). We need this condition to handle the empirical process part of the problem. Throughout this section, we assume that γ¯ is Lipschitz: |¯ γ (z1 ) − γ¯ (z2 )| ≤ |z1 − z2 |, z1 , z2 ∈ R. This is related to a certain robustness of the estimator. We call the estimator fˆn of this section the 1 -penalized robust regression estimator. The Lipschitz assumption also makes it possible to apply the contraction principle (see Section 6.3). This means there is enough machinery to obtain a good bound for the empirical process.
Oracle Inequalities and Regularization
233
Example 5.1. Least absolute deviations. If γ¯ (z) = |z|, z ∈ R, then f0 (xi ) is the median of Yi (whenever it exists). Hence, in that case i has median zero. We call fˆn the 1 -penalized least absolute deviations estimator. Example 5.2. Quantile regression. More generally, if for a given 0 < α < 1, γ¯(z) = α|z|l{z < 0} + (1 − α)|z|l{z ≥ 0}, z ∈ R, then f0 (xi ) is the α-quantile of the distribution of Yi . The estimator fˆn is then called the 1 -penalized quantile estimator. As usual, we employ the notation n 1 Rn (f ) = γ¯ (Yi − f (xi )), n i=1 R(f ) = ERn (f ), νn (f ) =
√ n[Rn (f ) − R(f )],
and
1 2 f (xi ). n i=1 n
f 2n =
Margin condition. For some constants c2 and κ > 1 and for all f ∈ F, R(f ) − R(f0 ) ≥ f − f0 κn /c2 . The value of κ and c2 may depend on the density of the errors i , i = 1, . . . , n. Typically κ = 2 as is shown in Exercise 5.1. Exercise 5.1. Consider least absolute deviations estimation, i.e., the case γ(z) = |z|. Suppose that 1 , . . . , n are i.i.d. and that their common distribution has density g with respect to Lebesgue measure. Suppose also that g(u) > t > 0 for all |u| ≤ K. Show that when also |f0 | ≤ K, the margin condition holds with κ = 2 and c2 = 2/t. Positive eigenvalue condition on the system {ψj }. Define 1 ψ(xi )ψ(xi ) , n i=1 n
Σn =
with ψ(xi ) = (ψ1 (xi ), . . . , ψm (xi )) , i = 1, . . . , n. Let λ2min be the smallest eigenvalue of Σn . Assume that for some constant 0 < c1 < ∞, λmin ≥ 1/c1 . (Note that this condition on the system {ψj }m j=1 implies m ≤ n.) m Let for f= j=1 θj ψj , N (f ) = #{θj = 0}.
234
S. van de Geer
Define for 0 < δ < 1, ! κ 1 Vn (f ) = 2(c2 /δ) κ−1 4c21 λ2n N (f ) 2(κ−1) . We will show in Lemma 5.1 that on the set √ Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2 ]}, n
the excess risk of the estimator fˆn satisfies an inequality involving the estimation error Vn (f∗ ) and approximation error R(f∗ ) − R(f0 ). Lemma 5.1. Assume the margin condition, and the positive eigenvalue condition on the system {ψj }. Then on the set Ξn we have R(fˆn ) − R(f0 ) ≤ (
1+δ ) Vn (f∗ ) + R(f∗ ) − R(f0 ) + λ2n . 1−δ
Proof. Use the positive eigenvalue condition on the system {ψj } to see that the condition on metrics holds with d(f, f˜) = f − f˜ n , α = 1/2, and β = 1. Now, invoke the margin condition and apply Lemma 5.5. In Lemma 5.1, we now choose f∗ = arg min {Vn (f ) + R(f ) − R(f0 )} f ∈F
(where we assume for simplicity that the minimum is attained). This choice balances the estimation error Vn (f∗ ) and approximation error R(f∗ ) − R(f0 ), so that one arrives at an oracle inequality on Ξn . It can be shown that for any f∗ , the set Ξn has large probability for the choice λn = c log n/n, with c a large enough constant depending only on K. This can be done using the tools from Section 6 (the concentration inequality of Theorem 6.1, symmetrization (Theorem 6.3), the contraction principle (Theorem 6.4), and the peeling device described in Section 6.4). The details are in Loubes and van de Geer (2002) and van de Geer (2003). Thus, a good choice for λn does not require knowledge of the constants κ or c2 appearing in the margin condition. Since the estimation error Vn (f∗ ) does depend on these constants, one may conclude that the 1 -penalty yields a trade off between estimation error and approximation error, without (directly) estimating the estimation error. Exercise 5.2. Suppose that for all integers N , min
f ∈F , N (f )≤N
[R(f ) − R(f0 )] ≤ c3 N −κs .
Here s > 0 is a “smoothness parameter”. Verify that the trade off between estimation error and approximation error gives 2κs
R(fˆn ) − R(f0 ) ≤ c4 λn2(κ−1)s+1 . The constant c4 is determined by the values of c1 , c2 and c3 and those of δ, κ and s.
Oracle Inequalities and Regularization
235
5.2. Density estimation Let X1 , . . . , Xn be i.i.d., with distribution P on X . Let p0 = dP/dµ be the density with respect to some given σ-finite measure µ. Define f0 = log p0 − log p0 dµ. Take
F¯ = {f :
f dµ = 0,
ef dµ < ∞}.
Define for f ∈ F¯ ,
pf = exp[f − b(f )], b(f ) = log
ef dµ.
Then for each f ∈ F¯ , the function pf is a density with respect to µ. We examine ¯ the penalized maximum likelihood estimator fˆn over the class F ⊂ F: n 1 log pf (Xi ) − pen(f ) . fˆn = arg max f ∈F n i=1 This means we take the loss function γf (x) = −f (x) + b(f ). The empirical risk is Rn (f ) = −Pn (f ) + b(f ), and the theoretical counterpart is R(f ) = −P (f ) + b(f ). Exercise 5.3. Verify that R(f ) − R(f0 ) = K(pf |pf0 ), where K(p|p0 ) is the Kullback-Leibler information between the densities p and p0 : K(p|p0 ) = log(p0 /p)p0 dµ. As stated in the beginning of this chapter, we assume that each f ∈ F can be written in the form m θj ψj , f= j=1
for some coefficients θj , j = 1, . . . , m. In this section this means we are considering an m-parameter exponential family. To have an identifiable representation, the functions ψj can be assumed to be centered with respect to µ ψj dµ = 0, j = 1, . . . , m.
236
S. van de Geer
The penalty is again taken to be the 1 -penalty pen(fθ ) = λn I(θ), I(θ) =
m
|θj |.
j=1
Define now
f 22,µ
=
f 2 dµ.
Margin condition. For some κ > 1 and c2 , and for all f ∈ F, K(pf |p0 ) ≥ f − f0 κ2,µ /c2 . In fact, the value κ = 2 is quite natural in this case, as can be seen in the next exercise. Exercise 5.4. Let {pf : f ∈ F } be a class of densities, with respect to Lebesgue measure on [0, 1], and suppose that p0 ≡ 1 is the density of the uniform distribution. Note that f0 = 0 and b(f0 ) = 0 in this case. Suppose that supf ∈F |f |∞ ≤ K. Show that R(f ) − R(f0 ) =
f 2 p1 dµ − [
f p1 dµ]2 ,
where p1 = exp[f1 − b(f1 )], and |f1 | ≤ |f |. Moreover f p1 dµ = f f1 p2 dµ − f p2 dµ f1 p2 dµ, where p2 = exp[f2 − b(f2 )], and |f2 | ≤ |f1 |. Conclude that R(f ) − R(f0 ) ≥ e−2K
f 2 dµ − 2e4K ( f 2 dµ)2 .
The margin condition is thus met with κ = 2 when F satisfies the requirement that f 2,µ is sufficiently small for all f ∈ F. (Convexity arguments can remove this requirement on F .) Positive eigenvalue condition on the system {ψj }. Let Σ = ψψ dµ. Let λ2min be the smallest eigenvalue of Σ. Assume that for some constant 0 < c1 < ∞, λmin ≥ 1/c1 . m Now, we proceed exactly as in the previous section. Let for f= j=1 θj ψj , N (f ) = #{θj = 0},
Oracle Inequalities and Regularization
237
and for 0 < δ < 1, ! κ 1 Vn (f ) = 2(c2 /δ) κ−1 4c21 λ2n N (f ) 2(κ−1) . Moreover, let f∗ be any function in F . Recall the set √ Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2 ]}, n
Lemma 5.2. Assume the margin condition, and the positive eigenvalue condition on the system {ψj }. Then on Ξn we have R(fˆn ) − R(f0 ) ≤ (
1+δ ){Vn (f∗ ) + R(f∗ ) − R(f0 ) + λ2n }. 1−δ
Proof. This again straightforward application of Lemma 5.5, as in the proof of Lemma 5.1. It is in this case easy to handle the set Ξn . Note that −[Rn (fθ ) − R(fθ )] − [Rn (fθ∗ ) − R(fθ∗ )] = Pn (fθ − fθ∗ ) − P (fθ − fθ∗ ) ≤ max |Pn (ψj ) − P (ψj )|I(θ − θ∗ ). j
Normalization condition on the system {ψj }. Assume that for all j = 1, . . . , m, + n |ψj |∞ ≤ log m (where we recall the notation |ψj |∞ = supx |ψj (x)|), and |ψj |2 dµ ≤ 1. Also assume that p0 ≤ c0 , where c0 ≥ 1 is given. Lemma 5.3. Suppose the normalization condition on {ψj }. Then for all t ≥ 1, + 2 log m ≤ 2 exp − tc0 log m . P max |Pn (ψj ) − P (ψj )| ≥ 3tc0 j n 7 Proof. This follows from Bernstein’s inequality (Bernstein (1924), Bennet (1962)), which says that for each a > 0, P (|Pn (ψj ) − P (ψj )| > t) ≤ 2 exp[− where σj2 = P |ψj |2 − |P ψj |2 .
na2 ], 2a|ψj |∞ + σj2
In view
of Lemma 5.3, we now know that for sequences of sets Ξn with λn = 3c0 log m/n, with m = mn → ∞ as n → ∞, we have lim P(Ξn ) = 1.
n→∞
238
S. van de Geer
5.3. Binary classification Let (X, Y ) be random variables, with X ∈ X a feature and Y ∈ {−1, +1} a label. A classifier is a function f : X → R. Using the classifier f , we predict the label +1 when f (X) ≥ 0, and the label −1 when f (X) < 0. Thus, a classification error occurs when Y f (X) < 0. We regard (X, Y ) as random variables with distribution P , and denote the distribution of X by Q. Moreover, we write η0 (x) = P (Y = 1|X = x), x ∈ X . Bayes rule is the classifier
f0 =
+1 if η0 ≥ 1/2 . −1 if η0 < 1/2
In Exercise 2.3, we have seen that f0 minimizes the probability of a classification error. Let (X1 , Y1 ), . . . , (Xn , Yn ) be observed i.i.d. copies of (X, Y ). These observations are called the training set. Let F be a collection of classifiers. In empirical risk minimization (see Vapnik (1995, 1998)), one chooses the classifier in F that has the smallest number of classification errors. However, if F is a rich set, this classifier will be hard to compute. Indeed, we will again consider a very highdimensional class F in this section We will use support vector machine (SVM) loss instead of the number of misclassifications, to overcome computational problems. The empirical SVM loss function is n 1 γ¯ (Yi f (Xi )), Rn (f ) = n i=1 where γ¯ (z) = (1 − z)+ is the hinge function, with z+ denoting the positive part of z ∈ R. Thus, our loss function γf is now γf (x, y) = (1 − yf (x))+ . The 1 -penalized SVM loss estimator fˆn is n 1 ˆ fn = arg min (1 − Yi f (Xi ))+ + pen(f ) , f ∈F n i=1 where F ⊂ {fθ : θ ∈ Rm }, and pen(fθ ) = λn I(θ), I(θ) =
m
|θj |.
j=1
As in Section 5.1, we need to assume, for technical reasons, that sup |f |∞ ≤ K,
f ∈F
where K is a given finite constant.
Oracle Inequalities and Regularization Define the theoretical SVM loss R(f ) = ERn (f ) =
239
(1 − yf (x))+ dP (x, y).
Exercise 5.5. Verify that SVM loss is consistent, in the sense that Bayes rule f0 satisfies f0 = arg min R(f ). all f
To handle the empirical process, note first that γ¯ (z) = (1 − z)+ is Lipschitz. This allows us to again apply the contraction principle of Section 6.3. It means that we can invoke the tools of Chapter 6 (in particular, the concentration inequality of Theorem 6.2, symmetrization, the contraction principle and the peeling device) to handle the empirical process part of the problem. Margin condition. For some constants κ ≥ 1 and c2 , we have for all f ∈ F, R(f ) − R(f0 ) ≥ f − f0 κ1,Q /c2 . Note that we assumed the margin condition with an L1 -norm, instead of the L2 -norm. It turns out that under some mild assumptions on η0 and Q, indeed the margin condition is met with L1 (Q)-norm. Positive eigenvalue condition on the system {ψj }. Let Σ = ψψ dQ. Let λ2min be the smallest eigenvalue of Σ. Assume that λmin > 0. The rest is as in the previous two sections, albeit that the condition on metrics is now met with the value β = 1/2 (instead of the value β = 1 of Sections 5.1 and m 5.2). Let for f= j=1 θj ψj , N (f ) = #{θj = 0}, and for 0 < δ < 1, 1
Vn (f ) = 2(c2 /δ) 2κ−1 8Kλ2n N (f )/λ2min
κ ! 2κ−1
.
Moreover, let f∗ be any function in F . Recall the set √ Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2 ]}. n
Assuming a proper normalization condition on the system
{ψj } (see Section 6.5), this set has large probability for the choice λn = c log n/n with c a suitable constant (see the conclusion at the end of Section 6.5). Lemma 5.4. Assume the margin condition, and the positive eigenvalue condition on the system {ψj }. Then on Ξn we have R(fˆn ) − R(f0 ) ≤ (
1+δ ){Vn (f∗ ) + R(f∗ ) − R(f0 ) + λ2n }. 1−δ
240
S. van de Geer
Proof. This follows from the fact that for fθ ∈ F and fθ˜ ∈ F
m |θj − θ˜j | ≤ |J |1/2 ( |θj − θ˜j |2 )1/2 j=1
j∈J
√ 1/2 ≤ |J |1/2 fθ − fθ˜ 2,Q /λmin ≤ |J |1/2 2K fθ − fθ˜ 1,Q /λmin . So the condition on metrics holds with d(f, f˜) = f − f˜ 1,Q √ and with α = β = 1/2 and c1 = 2K/λmin . Apply Lemma 5.5.
5.4. The behavior on Ξn We use straightforward calculus to show the inequality for the excess risk on Ξn . The arguments follow exactly those in the proof of Lemma 3.7. Lemma 5.5. Assume the margin condition and the condition on the metrics, formulated in the beginning of this chapter. Then on the set Ξn , we have for any 0 < δ < 1, R(fˆn ) − R(f0 ) ≤ ( where
1+δ ){Vn + R(f∗ ) − R(f0 ) + λ2n }, 1−δ β
κ
κ
Vn = 2(c2 /δ) κ−β (2c1 ) κ−β (|J∗ |α λn ) κ−β , with J∗ = {θj,∗ = 0}. Proof. The basic inequality says that
√ R(fˆn ) − R(f0 ) ≤ −[νn (fˆn ) − ν( f∗ )]/ n + pen(f∗ ) − pen(fˆn ) + R(f∗ ) − R(f0 ),
so that on Ξn , R(fˆn ) − R(f0 ) ≤ pen(fˆn − f∗ ) + λ2n + pen(f∗ ) − pen(fˆn ) + R(f∗ ) − R(f0 ). Now, use the same arguments as in the proof of Lemma 3.7. Let pen1 (fθ ) = λn |θj |, j∈J∗
and pen2 (fθ ) = λn
|θj |.
j ∈J / ∗
Then pen(fθ ) = pen1 (fθ ) + pen2 (fθ ), and one easily sees pen(fˆn − f∗ ) = pen1 (fˆn − f∗ ) + pen2 (fˆn ) and pen(f∗ ) − pen(fˆn ) ≤ pen1 (fˆn − f∗ ) − pen2 (fˆn ).
Oracle Inequalities and Regularization
241
So on Ξn , R(fˆn ) − R(f0 ) ≤ 2pen1 (fˆn − f∗ ) + λ2n + R(f∗ ) − R(f0 ). By the condition on the metrics, on Ξn , R(fˆn ) − R(f0 ) ≤ 2c1 λn |J |α dβ (fˆn , f∗ ) + λ2n + R(fˆn ) − R(f∗ ). After application of the triangle inequality d(fˆn , f∗ ) ≤ d(fˆn , f0 ) + d(f∗ , f0 ), the margin condition gives us that on Ξn , β
β
β β R(fˆn ) − R(f0 ) ≤ 2c1 λn |J |α c2κ (R(fˆn ) − R(f0 )) κ + 2c1 λn |J |α c2κ (R(f∗ )) − R(f0 )) κ
+λ2n + R(f∗ ) − R(f0 ). Next, invoke the Technical Lemma of Chapter 2. Then, on Ξn , β
κ
κ
R(fˆn ) − R(f0 ) ≤ 2(c2 /δ) κ−β (2c1 ) κ−β (|J∗ |α λn ) κ−β + δ(R(fˆn ) − R(f0 )) +λ2n + (1 + δ)(R(f∗ ) − R(f0 )).
5.5. Bibliographical remarks For quantile regression, we refer to Koenker and Bassett (1978) and, in the nonparametric case, Koenker, Ng and Portnoy (1992, 1994) and Portnoy (1997). A nice comparison of least squares and least absolute deviations is in Portnoy and Koenker (1997). The result of Section 5.1 is from Loubes and van de Geer (2002) and van de Geer (2003). There, also convexity is used to show that one may take λn not depending on the bound K for the functions f in F . See also van de Geer (2002) for convexity arguments in density estimation (and high-dimensional exponential families). A good reference for classification is the book by Devroye, Gy¨ orfi and Lugosi (1996). Also the book of Hastie, Tibshirani and Friedman (2001) has classification as one of its subjects, including SVM’s, but also other methods and algorithms. Adaptive results for empirical risk minimizers in classification, using for instance Rademacher complexities, are in, e.g., Koltchinskii (2001), Koltchinskii and Panchenko (2002), Koltchinskii (2003), and Lugosi and Wegkamp (2004). Support vector machines (SVM’s) have been introduced by Boser, Guyon and Vapnik (1992). An important book on SVM’s is Sch¨ olkopf and Smola (2002). Lin (2002) shows that the SVM is consistent. See Bartlett, Jordan and McAuliffe (2003) for results for more general loss functions. The 1 -penalty is often referred to as the LASSO (see Tibshirani (1996)). Adaptivity of the SVM with 1 -penalty is studied in Tarigan and van de Geer (2005).
242
S. van de Geer
6. Tools from empirical process theory Let X1 , . . . , Xn be independent random variables with values in X and let Γ be some family of real-valued functions on X . Define Pn (γ) =
n
γ(Xi )/n, P (γ) = EPn (γ).
i=1
We review, in Sections 6.1–6.4, some general results for the empirical process √ n(Pn − P )(γ) : γ ∈ Γ. In Section 6.5, we apply the results of Sections 6.1 - 6.4 to arrive at a uniform probability bound for the case where the empirical process is indexed by a class ∈ F } with γ¯ : R → R and F a subset of a collection of linear {γf = γ¯ ◦ f : f functions {fθ = j θj ψj }. There, we moreover assume the Lipschitz condition |¯ γ (z1 ) − γ¯ (z2 ))| ≤ |z1 − z2 |, z1 , z2 ∈ R. The probability bound of Section 6.5 can be used to handle the set Ξn introduced in Chapter 5. Section 6.6 studies empirical processes indexed by functions in a function class Γ satisfying an entropy bound. It considers the modulus of continuity of the empirical process. The result says that a version of the empirical process condition of Section 2.5 holds. 6.1. Concentration inequalities Concentration inequalities are exponential probability inequalities for the concentration of the supremum of the empirical process around its mean. Remarkable is that the amount of concentration does not depend on the dimensionality of the problem. Define Z = sup |Pn (γ) − P (γ)|. γ∈Γ
We start out with a Hoeffding-type concentration inequality. Theorem 6.1 (Massart (2000a)). Suppose Γ is a finite collection, satisfying ai,γ ≤ γ(Xi ) ≤ bi,γ for some real numbers ai,γ and bi,γ and for all 1 ≤ i ≤ n and γ ∈ Γ. Define n L2 = sup (bi,γ − ai,γ )2 /n. γ∈Γ i=1
Then for any positive t, P(Z ≥ EZ + t) ≤ exp[−
nt2 ]. 2L2
The next theorem is a Bernstein-type concentration inequality.
Oracle Inequalities and Regularization
243
Theorem 6.2 (Massart (2000a)). Let Γ be a countable family of functions, satisfying |γ|∞ (= supx |γ(x)|) ≤ b < ∞ for every γ ∈ Γ. Let τ 2 = sup
n
var(γ(Xi ))/n.
γ∈Γ i=1
Then for any positive t,
√ P(Z ≥ 2EZ + τ t 8 + 69bt2 /2) ≤ exp[−nt2 ].
6.2. Symmetrization Definition. A Rademacher sequence 1 , . . . , n is a sequence of i.i.d. random variables with values in {±1}, and with P( i = 1) = P( i = −1) = 1/2 (i = 1, . . . , n). Theorem 6.3 (van der Vaart and Wellner (1996)). Let 1 , . . . , n be a Rademacher sequence independent of X1 , . . . , Xn . Then n 1
i γ(Xi )| . E sup |Pn (γ) − P (γ)| ≤ 2E sup | γ∈Γ γ∈Γ n i=1 6.3. Contraction principle Theorem 6.4 (Ledoux and Talagrand (1991)). Let x1 , . . . , xn be non-random elements of X , and γ¯ : R → R be Lipschitz, i.e., |¯ γ (z1 ) − γ¯ (z2 )| ≤ |z1 − z2 |, z1 , z2 ∈ R. Furthermore, let F be a class of functions on X . Then for any f∗ ∈ F, n n E sup
i [¯ γ (f (xi )) − γ¯ (f∗ (xi ))] ≤ 2E sup
i (f (xi ) − f∗ (xi )) . f ∈F f ∈F i=1
i=1
6.4. Weighted empirical processes Let Γ = {γf : f ∈ F } be a class of functions indexed by a set F . Define the empirical process √ νn (f ) = n(Pn − P )(γf ), f ∈ F. Consider a function d∗ on F taking non-negative values. Suppose we are interested in the weighted empirical process νn (f ) , dβ∗ (f ) where β > 0 is given. We have for all a > 0 and t > 0, ∞ |νn (f )| β(j−1) β >a ≤ P sup |νn (f )| > a2 t . P sup β f ∈F , d∗ (f )>t d∗ (f ) f ∈F , d∗ (f )≤2j t j=1 Each term is the summand in the right-hand side can be studied using results for the non-weighted empirical process. Because the index set F is “peeled” into annuli f ∈ F, 2j−1 t < d∗ (f ) ≤ 2j t, this method for obtaining probability bounds is sometimes referred to as the peeling device (van de Geer (2000)). We apply
244
S. van de Geer
the peeling device in the proof of Lemma 6.4. It is also handy for obtaining the modulus of continuity of the empirical process indexed by functions satisfying an entropy bound (see Section 6.6). 6.5. The case of a Lipschitz transformation of a linear space Let γ¯ : R → R be a Lipschitz function, i.e., |¯ γ (z1 ) − γ¯ (z2 ))| ≤ |z1 − z1 |, z1 , z2 ∈ R. Let F be a collection of functions on X and define γf = γ¯ ◦ f, f ∈ F. Recall the notation for the empirical process √ νn (f ) = n(Pn − P )(γf ), f ∈ F. Regression and classification example. In regression and classification, we replace X with values x ∈ X by the pair (X, Y ) with values (x, y) ∈ X × R. In robust regression, the loss function is γ¯ (y − f (x)) with γ¯ a given Lipschitz function. In binary classification, one has y ∈ {±1} and the SVM loss is γ¯(yf (x)) where γ¯ (z) = (1 − z)+ is the hinge loss, which is clearly also Lipschitz. Suppose now that F a bounded subset of a collection of linear functions n F = {fθ = θj ψj : |fθ | ≤ K0 /2}, j=1
{ψj }nj=1
where is a given system of functions. The constant K0 is assumed to be finite and satisfy K0 ≥ 1. We assume there are exactly n functions in the system to facilitate the exposition. Denote the 1 -norm of a vector θ ∈ Rn by n I(θ) = |θj |. j=1
Normalization condition on the system {ψj }. Assume that for all j = 1, . . . , n, + n , max |ψj (x)| ≤ x log n and Fix an arbitrary f∗ = fθ∗
P |ψj |2 ≤ 1. in F . Define for all M > 0,
FM = {fθ ∈ F : I(θ − θ∗ ) ≤ M }, and
√ ZM = sup |νn (f ) − νn (f∗ )|/ n. fθ ∈FM
We first apply, in Lemma 6.1, the concentration inequality of Theorem 6.1 to ZM , with M fixed. The result is a probability inequality for the (one-sided) deviation
Oracle Inequalities and Regularization
245
of ZM from its mean EZM . Next task is to obtain an upper bound for EZM . This is done in Lemmas 6.2 and 6.3. The combination of Lemmas 6.1–6.3 yields the probability inequality (6.1) (below Lemma 6.3) for ZM with M fixed. Lemma 6.4 finally uses this result and the peeling device of Section 6.4, to derive a probability bound for the weighted empirical process. We use the notation K0 ∨ M = max(K0 , M ), K0 ∧ M = min(K0 , M ). Lemma 6.1. Assume the normalization condition on the system {ψj }. For all M satisfying + + log n n ≤ M ≤ K0 , K0 n log n we have + log n 2 ≤ exp[−(K0 ∨ M )2 log n]. P ZM ≥ 2EZM + 36K0 M n Proof. By the second part of the normalization condition on the system {ψj }, we know that P |fθ − f∗ |2 ≤ I 2 (θ − θ∗ ), In the concentration inequality of Theorem 6.2, we may take τ 2 ≤ (K0 ∧ M )2 . We moreover take there t2 = (K0 ∨ M )2 log n/n, and b = K0 . Then
√ τ t 8 + 69bt2 /2 ≤ 36K02M log n/n.
Lemma 6.2. We have for all M > 0, EZM
n 1 ≤ 4M E max |
i ψj (xi )| , 1≤j≤n n i=1
where 1 , . . . , n is a Rademacher sequence (see Section 6.2 for a definition), independent of X1 , . . . , Xn . Proof. By the definition of ZM , EZM = E
sup |(Pn − P )(γfθ − γf∗ )| .
fθ ∈FM
Use the symmetrization inequality (Theorem 6.3) and then the contraction principle (Theorem 6.4). Then we get n 1 sup |
i (γfθ (Xi ) − γf∗ (Xi ))| EZM ≤ 2E fθ ∈FM n i=1 n 1 ≤ 4E sup |
i (fθ (Xi ) − f∗ (Xi ))| . fθ ∈FM n i=1
246
S. van de Geer
But obviously,
n n 1 1
i (fθ (Xi ) − f∗ (Xi ))| ≤ M E max |
i ψj (xi )| . E sup | 1≤j≤n n fθ ∈FM n i=1 i=1
From now on we assume log n ≥ 1. Lemma 6.3. Assume the normalization condition on the system {ψj }. Let 1 , . . . , n be a Rademacher sequence, independent of X1 , . . . , Xn . We have + n log n 1 . E max |
i ψj (xi )| ≤ 24 1≤j≤n n n i=1 Proof. Define
Zψ =
+ n log n 1 . max |
i ψj (xi )| / 3 1≤j≤n n n i=1
Using the same argument as in Lemma 5.2, we find for all t ≥ 1, 2 P(Zψ ≥ t) ≤ 2 exp[− t log n]. 7 Hence,
E(Zψ ) =
P(Zψ ≥ t)dt ≤ 1 + 2 1
∞
2 exp[− t log n] 7
2 7 exp[− log n] ≤ 8. =1+ log n 7 (Here, we used that log n ≥ 1.)
The combination of Lemmas 6.1, 6.2 and 6.3 yields that under the normalization condition on the system {ψj }, + log n 2 (6.1) P ZM ≥ 228K0 M ≤ exp[−(K0 ∨ M )2 log n], n
for K0 log n/n ≤ M ≤ K0 n/ log n. We now invoke the technique of Section 6.4, to study the weighted empirical process νn (fθ ) − νn (f∗ )
, fθ ∈ F. I(θ − θ∗ ) + 456K02 log n/n
Lemma 6.4. Define λn = 456K02 log n/n. Under the normalization condition on the system {ψj } we have √ |ν( fθ ) − ν( f∗ )| P sup ≥ nλn ≤ 3 exp[−K02 log n/2]. fθ ∈F I(θ − θ∗ ) + λn
Oracle Inequalities and Regularization
247
Proof. To simplify the notation, we assume that f∗ ≡ 0. We split the class F into three subclasses, namely + log n }, F (I) = {fθ : I(θ) ≤ K0 n + log n < I(θ) ≤ K0 }, F (II) = {fθ : K0 n and F (III) = {fθ : I(θ) > K0 }. Then
√ √ |ν( fθ )| |ν( fθ )| P sup ≥ nλn ≤ P ≥ nλn sup fθ ∈F I(θ) + λn fθ ∈F (I) I(θ) + λn √ √ |ν( fθ )| |ν( fθ )| +P sup ≥ nλn + P ≥ nλn sup fθ ∈F (II) I(θ) + λn fθ ∈F (II) I(θ) + λn = PI + PII + PIII .
Apply (6.1) to PI to see that PI ≤ P
sup |νn (fθ )| ≥
fθ ∈F (I)
√
nλ2n
2 4 log n √ = P ZK log n ≥ (456) K0 0 n n log n ≤ P ZK √ log n ≥ 228K03 √ 0 n n ≤ exp[−K02 log n]. Now, 0 F (II) ⊂ ∪jj=0 {fθ ∈ F : Mj+1 < I(θ) ≤ Mj }, √ −j where Mj = 2 K0 , j = 0, . . . , j0 , and where 2j0 ≤ n. Thus
PII ≤
j0
j0 P ZMj ≥ λn Mj /2 ≤ exp[−K02 log n) ≤ exp[−K02 log n/2].
j=0
j=0
Finally ¯ ¯ F (III) = ∪∞ j=1 {fθ ∈ F : Mj−1 < I(θ) ≤ Mj }, ¯ j = 2j K0 , j = 0, 1, 2, . . .. So where M PIII ≤
∞ j=1
∞
¯ j /2 ≤ ¯ 2 log n] ≤ exp[−K 2 log n]. P ZM¯ j ≥ λn M exp[−M j 0 j=1
248
S. van de Geer
Let us conclude this section by putting the result of Lemma 6.4 in the context of Chapter 5. Following the notation of that chapter, let √ Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2n ]},
where fˆn is any random function in F , λn = 456K02 log n/n and pen(fθ ) = λn I(θ). Then by Lemma 6.4, under the normalization condition on the system {ψj }, one has P(Ξn ) ≥ 1 − 3 exp[−K02 log n/2]. This, in combination with the result of Lemma 5.4 completes, for the case m = n, the proof of adaptivity of the 1 -penalized SVM loss estimator. (When m is larger than n, say m = nD , the result goes through with suitably adjusted normalization condition on the system {ψj }, or alternatively, with suitably adjusted constants depending on D.) The result of Lemma 5.1 for the 1 -penalized robust regression estimator, can be completed analogously. It suffices to replace in the arguments of this section, the concentration inequality of Theorem 6.2 by the one of Theorem 6.1. 6.6. Modulus of continuity of the empirical process Let F be a class of functions on X . Definition. Let u > 0. The u-entropy H(u, F , d) (for the metric d) is the logarithm of the minimum number of balls with radius u necessary to cover F . We define
f = P |f | = 2
2
n
Ef 2 (Xi )/n.
i=1
Let moreover |f |∞ = sup |f (x)|, x∈X
and we suppose that the entropy of F for the metric induced by | · |∞ is finite. We denote this entropy by H∞ (·, F ). Consider again a transformation γf = γ¯ ◦ f : f ∈ F, where γ¯ is Lipschitz. Theorem 6.5 (van de Geer (2000)). Let F be a class of functions with |f −f∗|∞ ≤ 1 for all f, f∗ ∈ F. Suppose that for some constants A and s > 1/2, 1
H∞ (u, F , | · |∞ ) ≤ Au− s , u > 0. Then there exists a constant c, depending only on A and s, such that for all f∗ ∈ F and all t ≥ c, ν (f ) − ν (f )| t n n ∗ ≥ t ≤ c exp[− ]. P sup 1 1− s c −
f − f∗ 2s 2s+1 f ∈F , f −f∗ >n
Oracle Inequalities and Regularization Example 6.1. Let X = [0, 1] and
F = {|f |∞ ≤ 1,
249
1
|f (s) (x)|2 dx ≤ 1}. 0
Here f (s) is the s-th derivative of f . Then (Kolmogorov and Tikhomirov (1959)), the entropy condition of Theorem 6.5 holds. Exercise 6.1. We first study the regression setup with fixed design. Suppose, for i = 1, . . . , n, we have observations (xi , Yi ) ∈ (X × R), , where xi is a fixed covariable, and Yi a response variable. In this setup, we throughout used the notation
f 2n =
n
f 2 (Xi ),
i=1
for a function f : X → R. Note that for such a function, f = f n . Consider the M-estimator over the class F defined in Example 6.1: 1 γ¯ (Yi − f (xi )). fˆn = arg min f ∈F n i=1 n
Assume the margin condition on of Section 2.5 holds, with d the metric corresponding to the · n -norm, and with margin parameter κ ≥ 2. By applying Theorem 6.5 (with F there replaced by F˜ = {(x, y) → y − f (x) : f ∈ F }), show that the bound as given in (2.4) for the average risk becomes 2κs 1 + δ − 2(κ−1)s+1 ER(fˆn ) − R(f0 ) ≤ ( ) c0 n + R(f∗ ) − R(f0 ) , 1−δ as β = 1 − 1/(2s). Here, c0 is a constant depending on the constant c of Theorem 6.5, and on δ, κ and s. Compare with the result of Exercise 5.2. This illustrates that, indeed, up to log n-terms, the estimator with 1 -penalty of Section 5.1, adapts to the smoothness (= number of derivatives in this case) s. Exercise 6.2. Generalize the situation of Exercise 6.1 to the case of random design, assuming the margin condition of Section 2.5 is met with d the metric corresponding to the · -norm. 6.7. Bibliographical remarks Hoeffding’s inequality and Bernstein’s inequality can be found throughout the literature. Original references are Hoeffding (1963), Bernstein (1924) and Bennett (1962). Concentration inequalities have been derived by Talagrand (e.g., Talagrand (1995)) and further developed by Ledoux (e.g., Ledoux (1996)). Massart (2000a) studies the constants in these inequalities. Symmetrization is an important randomization technique in empirical process theory, that goes back to Vapnik and Chervonenkis (1971). The method described in Section 6.4, for weighted empirical processes is from Alexander (1985), and referred to as the peeling device in van de Geer (2000). A recent paper on weighted empirical processes is Gin´e and Koltchinskii (2004). The results in Section 6.5 are from Tarigan and van de Geer
250
S. van de Geer
(2005). In van de Geer (2000), moduli of continuity are derived, more general as the one cited in Section 6.6. There, also their implications on rates of convergence in regression and maximum likelihood estimation is explained. Entropy results for subsets of Besov spaces, which are extensions of the space F studied in Example 6.1, can be found in Birg´e and Massart (2000).
References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Proceedings 2nd International Symposium on Information Theory, P.N. Petrov and F. Csaki, Eds., Akademia Kiado, Budapest, 267–281. Alexander, K.S. (1985). Rates of growth for weighted empirical processes. In: Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer 2 (L. LeCam abd R.A. Olshen eds.) 475–493. University of California Press, Berkeley. Barron, A., Birg´e, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Prob. Theory Rel. Fields 113 301–413. Bartlett, P.L., Jordan, M.I. and McAuliffe, J.D. (2003). Convexity, classification and risk bounds. Techn. Report 638, University of California at Berkeley. Bennet, G. (1962). Probability inequalities for sums of independent random variables. Journ. Amer. Statist. Assoc. 57 33–45. Bernstein, S. (1924). Sur un modification de l’in´egalit´e de Tchebichef. Ann. Sci. Inst. Sav. Ukraine Sect. Math. I (Russian, French summary). Birg´e, L. and Massart, P. (2000). An adaptive compression algorithm in Besov spaces. Journ. Constr. Approx. 16 1–36. ˇ and Solomjak, M.Z. (1967). Piecewise polynomial approximation of funcBirman, M.S. tions in the classes Wpα . Math. USSR Sbornik 73 295–317. Boser, B. Guyon, I. and Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers.Fifth Annual Conf. on Comp. Learning Theory, Pittsburgh ACM 142– 152. Cand`es, E.J. and Donoho, D.L. (2004). New tight frames of curvelets and optimal representations of objects with piecewise C 2 singularities. Comm. Pure and Applied Math. LVII 219–266. Devroye, L., Gy¨ orfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York, Berlin, Heidelberg. Donoho, D.L. and Johnstone, I.M. (1994a). Minimax risk for 1 -losses over p -balls. Prob. Th. Related Fields 99, 277–303. Donoho, D.L. and Johnstone, I.M. (1994b). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81, 425–455. Donoho, D.L. (1995). De-noising via soft-thresholding. IEEE Transactions in Information Theory 41, 613–627. Donoho, D.L. and Johnstone, I.M. (1996). Neo-classical minimax problems, thresholding and adaptive function estimation. Bernoulli 2, 39–62. Donoho, D.L. (1999). Wedgelets: nearly minimax estimation of edges. Ann. Statist. 27 859–897.
Oracle Inequalities and Regularization
251
Donoho, D.L. (2004a). For most large underdetermined systems of equations, the minimal 1 -norm near-solution approximates the sparsest near-solution. Techn. Report, Stanford University. Donoho, D.L. (2004b). For most large underdetermined systems of linear equations, the minimal 1 -norm solution is also the sparsest solution. Techn. Report, Stanford University. Edmunds, E. and Triebel, H. (1992). Entropy numbers and approximation numbers in function spaces. II. Proceedings of the London Mathematical Society (3) 64, 153–169. Gin´e, E. and Koltchinskii, V. (2004). Concentration inequalities and asymptotic results for ratio type empirical processes. Working paper. Goldstein, H. (1980). Classical Mechanics. 2nd edition, Reading, MA, Addison-Wesley. Green, P.J. and Silverman, B.W. (1994). Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall, London. H¨ ardle, W., Kerkyacharian, G., Picard, D. and Tsybakov, A. (1998). Wavelets, Approximation and Statistical Applications. Lecture Notes in Statistics, vol. 129. Springer, New York, Berlin, Heidelberg. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer, New York. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journ. Amer. Statist. Assoc. 58 13–30. Koenker, R. and Bassett Jr. G. (1978). Regression quantiles. Econometrica 46, 33–50. Koenker, R., Ng, P.T. and Portnoy, S.L. (1992). Nonparametric estimation of conditional quantile functions. L1 Statistical Analysis and Related Methods, Ed. Y. Dodge, Elsevier, Amsterdam, 217–229. Koenker, R., Ng, P.T. and Portnoy, S.L. (1994). Quantile smoothing splines. Biometrika 81, 673–680. Kolmogorov, A.N. and Tikhomirov, V.M. (1959). -entropy and -capacity of sets in function spaces. Uspekhi Mat. Nauk 14 3-86. (English transl. in Americ. Math. Soc. Transl. (2) 17 (1961) 277–364). Koltchinskii, V. (2001) Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902–1914. Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1-50. Koltchinskii, V. (2003) Local Rademacher complexities and oracle inequalities in risk minimization. Manuscript. Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes . Springer Verlag, New York. Ledoux, M. (1996). Talagrand deviation inequalities for product measures. ESIAM: Probab. Statist. 1 63–87. Available at: www.emath.fr/ps/ Lin, Y. (2002). Support vector machines and the Bayes rule in classification. Data mining knowledge and discovery 6 259–275. Loubes, J.-M. and van de Geer, S. (2002). Adaptive estimation in regression, using soft thresholding type penalties. Statistica Neerlandica 56 453–478. Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. To appear in Ann. Statist.
252
S. van de Geer
Mammen, E. and Tsybakov, A.B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829. Massart, P. (2000a). About the constants in Talagrand’s concentration inequalities for empirical processes. Ann. Probab. 28 863–884. Massart, P. (2000b). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse 9, 245–303. ´ Pareto, V. (1897). Course d’Economie Politique, Rouge, Lausanne et Paris. Portnoy, S. (1997). Local asymptotics for quantile smoothing splines. Ann. Statist. 25, 414–434. Portnoy, S. and Koenker, R. (1997). The Gaussian hare and the Laplacian tortoise: computability of squared error versus absolute-error estimators, with discussion. Stat. Science 12, 279–300. Sch¨ olkopf, B. and Smola, A. (2002). Learning with Kernels, MIT Press, Cambridge. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461–464. Silverman, B.W. (1985). Some aspects of the smoothing spline approach to nonparametric regression curve fitting (with discussion). Journ. Royal Statist. Soc. B 47, 1–52. Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Publications Math´ematiques de l’I.H.E.S 81 73–205. Tarigan, B. and van de Geer, S.A. (2005). Support vector machines with 1 -complexity regularization. Submitted. Tibshirani, R. (1996). Regression analysis and selection via the LASSO. Journal Royal Statist. Soc. B 58, 267–288. Tsybakov, A.B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32, 135–166. Tsybakov, A.B. and van de Geer, S.A. (2005). Square root penalty: adaptation to the margin in classification and in edge estimation. To appear in Ann. Statist. 33. van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge University Press. van de Geer, S. (2001). Least squares estimation with complexity penalties. Mathematical Methods of Statistics 10, 355–374. van de Geer, S. (2002). M-estimation using penalties or sieves. J. Statist. Planning Inf. 108, 55–69. van de Geer, S. (2003). Adaptive quantile regression. In: Recent Advances and Trends in Nonparametric Statistics (Eds. M.G. Akritas and D.N. Politis), Elsevier, 235–250. van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes, with Applications to Statistics. Springer, New York. Vapnik, V.N. and Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Th. Probab. Appl. 16, 264–280. Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: Soc. for Ind. and Appl. Math. Wand, M.P. and Jones, M.L. (1995). Kernel Smoothing. Chapman and Hall, London. Sara van de Geer ((Please insert complete address))