The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. Si–Sv. doi: 10.1111/j.1368-423X.2009.00293.x
Tenth Anniversary Special Issue
EDITORIAL New Year 2008 marked the tenth anniversary of The Econometrics Journal, which was established in 1998 by the Royal Economic Society with the intention of creating a high-quality refereed general journal for the publication of econometric research with a standard of intellectual rigour and academic standing similar to those of the pre-existing top international field journals in econometrics. To celebrate this event a Special Issue of the journal was commissioned by inviting contributions from a number of leading scholars in econometrics, the research interests of those invited ranging across all aspects of the discipline. The eleven papers that appear in this special issue deal with a number of topics of current research interest. Given the breadth of the discipline and the coverage of the papers collected here, they cannot be gathered easily under any one single heading. However, some papers do fall rather loosely into a number of overlapping categories. The ordering of the papers in this special issue is made so as to reflect these groupings and their intersection as far as possible. Many economic data are generated by stochastic processes that can be modelled as occurring in continuous time with the data treated as realizations of random functions, i.e. functional data. The particular focus of the paper by Federico Bandi, Peter Hall, Joel Horowitz and George Newman is a scenario in which economic theory may be described by a finite dimensional parametric stochastic process and thereby explicitly or implicitly specifies the probability distribution of the process sample paths. A test that the theory model generated the data may be constructed by comparing the empirical and theoretical sample path distributions, i.e. a test of a finite dimensional parametric model against a non-parametric alternative. This paper generalizes the Cram´er-von Mises approach to distributions of random functions as a particular example of functional data approaches to tests of specification in econometrics. It also develops parametric bootstrap methods that facilitate the use of techniques based on integration over function spaces. The functional data approach not only presents a novel way of conceptualizing specification testing problems but potentially can provide a basis for new test methods for continuous time models in finance as well as the equilibrium search model considered in this paper. The next two papers consider particular aspects of quantile regression. The subject of the paper by Elise Coudin and Jean-Marie Dufour concerns finite sample and asymptotically valid distribution-free tests and confidence sets for the parameters of a linear median regression. The problem of interest consists in obtaining conditions under which signs are i.i.d. and follow a known distribution despite the underlying random variates neither being independent nor satisfying other regularity conditions. The setting employed allows for the presence of heteroskedasticity and the possibility of nonlinear dependence in the regression disturbances of unknown form as well as discretely distributed random variates. Tests based on residual C The Author(s). Journal compilation C Royal Economic Society 2009 Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Sii
signs constitute a system for finite-sample exact inference under very general assumptions. An advantage of the sign-based inference methods considered in this paper is that no parametric assumption is imposed on the distribution of the regression error. Moreover, they avoid estimation of the estimator asymptotic variance matrix, which can be particularly problematic for standard procedures. The procedures considered in the paper remain asymptotically valid with weakly exogenous regressors and stationary regression disturbances. Standard heteroskedasticity and autocorrelation consistent (HAC) methods permit sign-based statistics to be transformed appropriately, thereby eliminating nuisance parameters asymptotically. Consequently, the test method retains asymptotic validity although at the expense of exactness. Furthermore, it is unnecessary to evaluate the disturbance density at zero, a major difficulty associated with asymptotic kernel-based methods used in least absolute deviations (LAD)-based techniques. The performance of the proposed procedures is illustrated through a set of simulation experiments. The particular concern of the paper by Xiaohong Chen, Roger Koenker and Zhijie Xiao is the specification and robust estimation of quantile autoregressive models for nonlinear time series. They note that many current methods for quantile estimation and prediction rely heavily on unrealistic global distributional assumptions. This paper proposes local, quantilespecific copula-based time-series models motivated by parametric copula models, which retain some semiparametric flexibility and should thereby offer some robustness over classical global parametric approaches. Parametric copula models are used to generate nonlinear-inparameters quantile autoregression models that, by construction, possess monotone conditional quantile functions over the entire support of the conditioning variables. However, rather than impose this global structure explicitly, the implied conditional quantile function at a particular quantile is assumed to be correctly specified and is used as the basis for estimation and inference. This distinction between global parametric and local quantile-specific models facilitates an analysis of potential misspecification of the global structure. Consistency and asymptotic normality of the quantile estimator are obtained under mild sufficient conditions only requiring stationarity and ergodicity of the underlying copula-based Markov model without any mixing conditions. The results are particularly relevant for estimation and inference about extreme conditional quantiles (value-at-risk) for financial time-series data that typically display strong temporal dependence and tail dependence as well as heavy-tailed marginals. Peter Robinson details a simple model that can explain spatial dependence in circumstances when observations may have been purged of spatial correlation. The model displays similarities to the well-known stochastic volatility model of financial econometrics adapted for the spatial context. Parameter estimation is based on quasi maximum likelihood using logarithms of squared observations. An asymptotic theory is described in which consistency and asymptotic normality of model parameter estimators are established. Related asymptotically valid tests for spatial dependence are presented. Although the simple model may be straightforwardly extended to incorporate spatial correlation in the observables and the inclusion of explanatory variables. The next two papers deal with certain inferential issues that arise in time-series econometric models. The paper by Xu Cheng and Peter Phillips extends earlier work by the second author for the univariate context to the multivariate setting. In particular, the paper addresses the issue of cointegrating rank choice using information criteria. If cointegration and cointegrating rank selection is of primary concern, a complete model is unnecessary for statistical purposes allowing a reduced rank regression with a single lag to form the basis for model choice with no explicit
C The Author(s). Journal compilation C Royal Economic Society 2009
Siii
account necessary to be taken of the short memory component. Standard information criteria including BIC and Hannan and Quinn are shown to be weakly consistent in the choice of cointegrating rank provided the penalty coefficient Cn satisfies Cn → ∞ and Cn /n → ∞ as n → ∞. The paper also provides the limit distribution of the Akaike information criterion (AIC), which as in the standard setting is inconsistent. A general limit theory for semiparametric reduced rank regressions under weakly dependent errors is presented. The finite-sample performance of the criteria is studied in some simulation experiments. In the article Miguel Delgado, Javier Hidalgo and Carlos Velasco extends their earlier work for observable time-series processes to dynamic regression models. In this setting the null hypothesis of interest concerns the absence of serial correlation in the regression errors. The paper proposes goodness-of-fit tests where regressors are permitted to be only weakly exogenous and arbitrarily correlated with past shocks. The tests employ a linear transformation of Barlett’s Tp -process of the regression residuals. The linear transformation approximates the martingale component of the process, thereby ensuring that it converges weakly to standard Brownian motion under the null hypothesis. A feasible transformation might be based on a non-parametric smoothed estimator of this crossspectrum. Smoothing in the feasible martingale transformation can be avoided by using the (inconsistent) cross-periodogram directly. Nevertheless, the tests have non-trivial power against local alternatives converging to the null at the parametric root-n rate. A notable aspect of the tests is that there is no necessity to specify the dynamic structure of the regressors, hence avoiding restrictions on the class of local alternatives that the tests are able to detect which contrasts with tests which employ smoothing techniques. A Monte-Carlo study illustrates the finite-sample performance of the tests. Bertille Antoine and Eric Renault examine weak identification characterized by drifting population moment conditions. The focus of the paper is on nearly weak identification where the limit rank deficiency obtains at a rate δ T , slower than the standard root-T. Consequently, generalised methods of moments (GMM) estimators of all parameters remain consistent but at a rate potentially less than root-T. The standard GMM-based Lagrange multiplier (LM) test remains asymptotically chi-square in contrast to the weakly identified context. A comparative study of the power of the standard LM test and its modified weak identification version indicates that the latter statistic can be relatively deficient in power in a nearly weak identified environment. Moreover, both tests are asymptotically equivalent for rates δ T slower than T 1/4 which the authors classify as nearly strong identification. A reparameterization obtained via a rotation in the parameter space results in the first components being estimated at the standard root-T rate with the others at the slower rate T 1/2 /δ T . Standard GMM formulae for asymptotic variance matrices are only applicable in the nearly strong identified set-up. A Monte Carlo study using the consumption-based capital asset pricing model concludes the paper. The paper by Donald Andrews and Sukjin Han is concerned with the finite-sample and asymptotic properties of a number of sampling methods for constructing confidence interval (CI) endpoints for partially identified parameters in models defined by moment inequalities. The particular emphasis is on the bootstrap and m out of n bootstrap applied directly to construct CI endpoints. These bootstrap methods are valid neither in finite samples nor in a uniform asymptotic sense in general when applied directly to construct CI endpoints. Both backward and forward forms of the bootstrap together with the m out of n bootstrap CIs are considered. The failure of the bootstrap arises because of the non-differentiability of the statistics of interest as a function of underlying sample moments. Although the results described
C The Author(s). Journal compilation C Royal Economic Society 2009
Siv
in the paper are for parametric versions of the bootstrap, the asymptotic properties of their non-parametric counterparts follow directly through their asymptotic equivalence. Moreover, asymptotic results for subsampling are identical to those for the non-parametric i.i.d. m out of n bootstrap provided the subsample size obeys certain restrictions, indicating that the m out of n bootstrap results should also apply to subsampling methods. The finite-sample and asymptotic properties of sampling methods for CI endpoints are obtained in two simple models and therefore their invalidity applies generally. Other methods for constructing confidence sets, e.g. inverting acceptance regions based on an Anderson-Rubin-type test statistic, based on subsampling and the m out of n bootstrap are asymptotically valid in a uniform sense. Moreover, these confidence sets may be combined with a recentred bootstrap applied as part of a moment selection method for constructing critical values. The final three papers address issues that may be broadly considered to arise in the area of programme evaluation. First, Charles Manski and John Pepper revisit the topic of their 2000 paper published in Econometrica. That paper introduced a monotone instrumental variable (MIV) assumption, weakening the traditional IV assumption of mean independence; i.e. mean response is constant across subpopulations of persons with different values of an observed covariate, and thereby replacing a moment equality with a weak inequality restriction. The paper employs an explicit response model to contrast the content of MIV and traditional IV assumptions and to illustrate why MIV assumptions might reasonably be adopted in studies of the returns to schooling and production. The identifying power of MIV assumptions when combined with the homogeneous linear response assumption maintained in many studies of treatment response is examined to provide an indication of the implications of the latter assumption. The estimation of MIV bounds is reconsidered. An analysis of the finite-sample bias of analogue estimators for MIV bounds and of their tendency to be narrower than the true bounds is presented. The paper gives some simulation-based evidence for this bias and on the performance of a bias-correction method. Next, the classical sample selection model is analysed in the paper by Whitney Newey. In this paper the selection correction term is treated semiparametrically rather than in a parametric fashion as in the standard case. The functional form of the selection term is assumed to be unknown and dependent on an index known up to a finite dimensional vector of parameters for which a root-n consistent estimator is available. Although the semiparametric efficiency lower bound is known for this form of conditional moment restriction no efficient estimator has as yet been proposed. Least squares estimation after substitution for the unknown index parameter and approximating the unknown selection term via either power series or splines provides a very simple and straightforward estimation method for the regression parameters and a potentially attractive alternative to fully non-parametric procedures. The paper provides root-n consistency and asymptotic normality results for the regression parameter estimator and a consistent estimator for the estimator asymptotic variance matrix. Finally, it has been known for some time that for choice-based samples matching and selection methods are necessarily robust if the probability of selection into treatment is consistently estimated. James Heckman and Petra Todd demonstrate the empirically important result that when the propensity score is estimated based on unweighted choice-based samples and is thus inconsistently estimated these procedures remain valid. In conclusion I hope that the papers collected in the tenth anniversary special issue go some way towards achieving the original objective of the Royal Economic Society for The Econometrics Journal. I would also like to take this opportunity to extend the gratitude of The Econometrics Journal to the contributors for their submissions. Especial thanks are owed to
C The Author(s). Journal compilation C Royal Economic Society 2009
Sv
the referees listed below of the papers comprising the tenth anniversary special issue without whose assistance it would not have possible. V. Chernozhukov
J. Pinkse
M. Jansson
P. Guggenberger Y. Hong P. M. D. C. Parente
A. M. R. Taylor V. Corradi E. Guerre
A. Patton R. J. Smith
Richard J. Smith (Managing Editor) University of Cambridge
C The Author(s). Journal compilation C Royal Economic Society 2009
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S1–S18. doi: 10.1111/j.1368-423X.2008.00266.x
Goodness-of-fit tests for functional data F EDERICO A. B UGNI † , P ETER H ALL ‡ , J OEL L. H OROWITZ § AND G EORGE R. N EUMANN ¶ †
‡
Department of Economics, Northwestern University, Evanston, IL 60208–2600, USA E-mail:
[email protected]
Department of Mathematics and Statistics, University of Melbourne, Melbourne, VIC 3010, Australia E-mail:
[email protected] §
Department of Economics, Northwestern University, Evanston, IL 60208–2600, USA E-mail:
[email protected] ¶
Department of Economics, University of Iowa, Iowa City, IA 52242–1000, USA E-mail:
[email protected] First version received: July 2008; final version accepted: August 2008
Summary Economic data are frequently generated by stochastic processes that can be modelled as occurring in continuous time. That is, the data are treated as realizations of a random function (functional data). Sometimes an economic theory model specifies the process up to a finite-dimensional parameter. This paper develops a test of the null hypothesis that a given functional data set was generated by a specified parametric model of a continuous-time process. The alternative hypothesis is non-parametric. A random function is a form of infinitedimensional random variable, and the test presented here a generalization of the familiar Cram´er-von Mises test to an infinite dimensional random variable. The test is illustrated by using it to test the hypothesis that a sample of wage paths was generated by a certain equilibrium job search model. Simulation studies show that the test has good finite-sample performance. Keywords: Bootstrap, Cram´er-von Mises test, Equilibrium search model, Functional data analysis, Hypothesis testing.
1. INTRODUCTION Economic data are frequently generated by stochastic processes that can be modelled as occurring in continuous time. The data may then be treated as realizations of random functions (functional data). Examples include wage paths and asset prices or returns. Sometimes economic theory provides a parametric model for the data. That is, economic theory may provide a stochastic process that is known up to a finite-dimensional parameter and may be the process that generated the data. For example, certain equilibrium job search models specify the wage process up to a finite-dimensional parameter, and certain diffusion models specify an asset’s price or returns process up to a finite-dimensional parameter. In such cases, it is natural to test C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S2
F. A. Bugni et al.
the theory model against the data. More specifically, it is natural to test the hypothesis that, for some value of its parameter, the theory model is a correct specification of the data-generation process. This paper describes a method for carrying out such a test. A theory model of a stochastic process explicitly or implicitly specifies the probability distribution of the random functions (or sample paths) that are realizations of the process. If the theory model depends on an unknown finite-dimensional parameter, which we assume to be the case here, the specification is up to the value of this parameter. Functional data can be used to form an empirical analogue of the probability distribution of the random functions (the empirical distribution of the data). Therefore, a test of the hypothesis that the theory model generated the data can be made by comparing the empirical and theoretical distributions of the sample paths. This amounts to testing a finite-dimensional parametric model of a probability distribution against a non-parametric alternative. When the random variable of interest is finite-dimensional, the Cram´er-von Mises and Kolmogorov–Smirnov tests, among many others, can be used for this purpose but these tests do not apply to random functions, which are infinite-dimensional random variables. 1 The test described in this paper generalizes the Cram´er-von Mises test to distributions of random functions, or infinite-dimensional random variables that depend on an unknown finitedimensional parameter. Novel aspects of our contribution include the introduction of functional data approaches to specification testing in econometrics, and the development of parametric bootstrap methods that facilitate the use of techniques based on integration over function spaces. The functional data view offers new ways of conceptualizing specification testing problems and can lead to new approaches for testing continuous time models, such as models of financial data that are quite different from the equilibrium search model that motivates the present work. More specifically, suppose that the distribution of a random function Y depends on an unknown, finite-dimensional parameter θ and that we have a random sample X = {X1 , . . . , Xn } of n realizations of a random function X that may be distributed as Y for some value of θ . We develop a Cram´er-von Mises type test of the null hypothesis, H 0 , that “the distribution of X is identical to that of Y for some unspecified value of θ .” A mathematically concise interpretation of the phrase in quotation marks will be given in the first paragraph of Section 2.1. The paper presents the test statistic and explains how to compute it, derives the test statistic’s asymptotic distribution under H 0 and local alternative hypotheses, and presents a bootstrap procedure for computing the critical value of the test. We illustrate the use of the test by applying it to an equilibrium job search model (Mortensen, 1990, Burdett and Mortensen, 1998, Bowlus et al., 2001, and Christensen et al., 2005). This model aims to explain the frequencies and durations of spells of unemployment as well as the distribution of wages among employed individuals. In particular, the model provides an explanation for why seemingly identical individuals have different wages. One of the model’s outputs is a random function, Y say, that gives an individual’s wage as a function of time up to an unknown, vector-valued parameter. We also have data on wage paths of a random sample of individuals. Our test allows us to assess whether the equilibrium search model provides a correct description of the wage process. If the distribution specified by the theory model did not depend on an unknown parameter, so that the distribution of Y were completely known, then the permutation test of Hall and Tajvidi (2002) would be an alternative to the test presented here. However, we have found through Monte 1 Durbin (1973a, b) and Pollard (1984) discuss the Kolmogorov-Smirnov and Cram´ er-von Mises tests of distributions that depend on an unknown parameter.
C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for functional data
S3
Carlo experimentation that the finite-sample performance of the Hall-Tajvidi test is poor when the null hypothesis distribution depends on an unknown parameter. In particular, the test has low power, and the probability that it rejects a correct H 0 greatly exceeds the nominal rejection probability. We present Monte Carlo evidence indicating that the differences between the true and nominal rejection probabilities of our Cram´er-von Mises type test are small. A variety of other tests can be considered. One possibility is the development of adaptive methods in which the weight function of our Cram´er-von Mises test (that is, the measure μ in Section 2.1 of this paper) is chosen to optimize power against a specific class of alternatives. Cuesta-Albertos et al. (2006) and Cuesta-Albertos et al. (2007) have shown that a class of tests based on random projections is consistent against certain location-scale families. The equilibrium search model that we consider here is not of this type, however, and it is unknown whether the random-projection tests are consistent under conditions that are more general than those considered in the two foregoing papers. In Monte Carlo experiments using the designs of CuestaAlbertos et al. (2007), we compared the power of our Cram´er-von Mises test with the power of the random-projections test. The results of the experiments are reported in Section 4.3. In every case the power of the Cramer-von Mises test is similar to or greater than the power of the randomprojections test. There is a large econometrics literature on specification testing but, in contrast to the test in this paper, it applies to data consisting of finite-dimensional vectors, rather than functions. Some of the existing work addresses specification testing problems for dynamic processes and processes that are observed in continuous time. See, e.g. Cay and Hong (2003), Hong and Haito (2005), Guay and Guerre (2006) and Kim and Wang (2006). Many econometric specification testing problems are in a context that is semi-parametric or non-parametric. Examples include Fan and Li (1996, 2000, 2002), Guerre and Lavergne (2002, 2005), Horowitz and Spokoiny (2001) and Miles and Mora (2003). There is also an extensive statistics literature on functional data analysis. Much of this literature is synthesized in the books by Ramsey and Silverman (2002, 2005). Principal components analysis in statistics plays a role in methods for computing our test statistic. Recent work in that context includes Boente and Fraiman (2002), He et al. (2003), Yao et al. (2005), Hall and Hosseini-Nasab (2006) and Jank and Shmueli (2006). Virtually all problems in functional data analysis can be reformulated to permit treatment with finite-dimensional methods. In particular, the functional testing problem dealt with in this paper can be made finite-dimensional by testing only finitely many features of the parametric model instead of the entire stochastic process. However, this approach has several drawbacks. First, depending on the features that are chosen for testing, a finite-dimensional test may be inconsistent against important deviations of the data-generation process from the theory model. Choosing an appropriate low-dimensional approximation and its associated test can be quite difficult in practice. Even if the model is finite-dimensional, the test needs to be sensitive to relatively complex, high-dimensional departures from the null hypothesis, for example as represented by the shapes of random functions. The functional approach avoids this problem. Second, the accuracy of finite-dimensional asymptotic approximations tends to deteriorate as the dimension of the object being tested increases. Specifically, the difference between the true and nominal probabilities of rejecting a correct null hypothesis tends to increase as the dimension of the distribution being tested increases. The functional approach avoids this problem by developing an asymptotic approximation that is specifically designed for infinite-dimensional data. Finally, the functional approach avoids having to explicitly model the correlation between function values at nearby points in their domain. C The Author(s). Journal compilation C Royal Economic Society 2009.
S4
F. A. Bugni et al.
Section 2 of this paper describes the test statistic and methods for computing the statistic and its critical value. Section 3 presents the test’s theoretical properties. Section 4 presents the empirical application and Monte Carlo results. The proofs of theorems are given in the Appendix.
2. THE TEST PROCEDURE This section describes the test statistic and its implementation. Section 2.1 presents the statistic. Sections 2.2 and 2.3, respectively, explain how to compute the test statistic and the critical value. 2.1. The test statistic Assume that the random functions X and Y are defined on a bounded interval, I, which we take to be [0, 1]. Let L 2 [0, 1] denote the space of square integrable functions on [0, 1] and let · denote the L 2 norm. We assume that X and Y are both in L 2 [0, 1] (so that X, Y < ∞) with probability 1. Note that this condition accommodates unbounded random functions such as smooth Gaussian processes defined on compact intervals. It is not equivalent to requiring P (X ≤ C, Y ≤ C) = 1 for some finite constant C > 0. Define the distribution functionals of X and Y, respectively, by FX (x) = P [X(t) ≤ x(t) for all t ∈ I] and FY (x|θ ) = P [Y (t) ≤ x(t) for all t ∈ I], where the non-stochastic function x(·) is the argument of the distribution functional and θ is the finite-dimensional parameter on which the distribution of Y depends. Assume that θ is contained in a parameter set ⊂ Rp for some finite p > 0. The null hypothesis that we test is H0 : FX (x) = FY (x|θ ) for some θ ∈ and all x ∈ L 2 [0, 1]. The alternative hypothesis, H 1 , is that there is no θ ∈ for which H 0 holds. Basing the definition of H 0 on the distribution functionals F X and F Y is natural because FX = FY implies that the finite-dimensional distributions associated with FX and F Y coincide, implying that X and Y correspond to the same probability measure. We note that the sets {X(t) ≤ x(t) for all t ∈ I} and {Y (t) ≤ x(t) for all t ∈ I} are measurable. For example, X(t) ≤ x(t) for all t ∈ I is equivalent to supt∈I [X(t) − x(t)] ≤ 0, and the supremum is measurable whenever the functions X and x are measurable. Let the data be a random sample of X: {Xi : i = 1, . . . , n}. Because X is a function, each Xi is also a function, Xi (t), on the interval [0, 1]. For example, in the empirical application presented in Section 4, each Xi is the wage path of a randomly sampled individual. The empirical distribution functional of the data is defined as FˆX (x) = n−1
n
I [Xi (t) ≤ x(t) for all t ∈ I],
i=1 C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for functional data
S5
where I (·) is the indicator function. Let θˆ be an estimator of θ that is consistent under H 0 . Then ˆ H 0 is rejected if the “distance” between FˆX H 0 can be tested by comparing FˆX with FY (·|θ). ˆ is too large in some metric. and FY (·|θ) In practice, FY (·|θ ) may not be available in a convenient, analytic form. However, FY (x|θ ) can be estimated for any x and θ and with any desired level of accuracy if sample paths of Y can be generated by simulation. Specifically, let {Y 1 (t), . . . Ym (t)} be m sample paths that are generated by simulation from the Y process with a specified value of θ . Then FY (·|θ ) is estimated consistently by the empirical distribution functional of the simulated paths: FˆY (x|θ ) = m−1
m
I [Yi (t) ≤ x(t) for all t ∈ I].
i=1
The random sampling errors of FˆY can be made arbitrarily small by making m sufficiently large. ˆ is too large. Therefore, our test rejects H 0 if the distance between FˆX and FˆY (·|θ) If X and Y were finite-dimensional random variables, then the Cram´er-von Mises test would ˆ Thus, the test consist of using the L 2 metric to measure the distance between FˆX and FˆY (·|θ). statistic would be ˆ 2 dη(z), TCvM = [FˆX (z) − FY (z|θ)] where η is Lebesgue measure on the support of X and Y . A generalization to the case of random functions, which are infinite-dimensional random variables, can be obtained by replacing η with a probability measure on L 2 [0, 1]. Let μ be such a measure. The resulting test statistic is ˆ (2.1) T (X |θ) = [FˆX (x) − FˆY (x|θˆ )]2 dμ(x). ˆ is too large. Section 2.2 explains how T (X |θ) ˆ can be computed. The test rejects H 0 if T (X |θ) The measure μ is analogous to the weight function that can enter the finite-dimensional Cram´er-von Mises statistic and many other test statistics. As in finite-dimensional testing, μ or the weight function cannot be selected empirically as this would require knowing how the true data-generation process differs from the parametric model. Rather, one chooses a measure μ or a weight function that is tractable computationally and assigns relatively high probability to regions in the space of alternatives against which one wants good power. In Section 2.2, we propose using a measure based on a Gaussian process. This emphasizes deviations from the null hypothesis that are relatively stable. If, however, we were concerned with highly erratic deviations from the null hypothesis, we would choose a measure corresponding to an erratic stochastic process (e.g. a process with few finite moments). It is also possible to construct a Kolmogorov-Smirnov type test of H 0 . The test statistic is ˆ = sup |FˆX (x) − FˆY (x|θ)|. ˆ TKS (X |θ) x<∞
We prefer the Cram´er-von Mises test for two reasons. First, in finite-dimensional settings the Cram´er-von Mises test tends to be more powerful than the Kolmogorov–Smirnov test. For example, this occurs when FX and F Y differ by a relatively small amount over a moderately large region, and it reflects the greater sensitivity of the L 2 metric compared to the L ∞ metric in such cases. This property can be expected to carry over to functional data. Second, in the functional data case, the Cram´er-von Mises statistic is easier to compute. C The Author(s). Journal compilation C Royal Economic Society 2009.
S6
F. A. Bugni et al.
2.2. Computing the test statistic ˆ in applications. Observe This section presents a Monte Carlo procedure for computing T (X |θ) ˆ is the average of [FˆX (·) − FˆY (·|θ)] ˆ 2 relative to the probability measure μ. Thus, if that T (X |θ) the distribution represented by μ can be sampled, Monte Carlo integration can be used to ˆ Specifically, suppose that {Z 1 (t), . . . , ZJ (t)} are functions that are sampled compute T (X |θ). ˆ can be estimated by randomly from the distribution μ. Then T (X |θ) ˆ = J −1 TJ (X |θ)
J
ˆ 2. [FˆX (Zj ) − FˆY (Zj |θ)]
(2.2)
j =1
ˆ and T (X |θ) ˆ can be made arbitrarily small by making J The difference between TJ (X |θ) sufficiently large. To generate the random functions {Zj }, let {φ k : i = 1, 2, . . . } be a complete, orthonormal basis for L 2 [0, 1]. Then any function Z that is sampled from μ can be written in the form Z(t) =
∞
bk φk (t),
(2.3)
k=1
where the Fourier coefficients {bk } are random variables satisfying 1 Z(t)φk (t)dt bk = 0
and ∞
bk2 < ∞
(2.4)
k=1
with probability 1. Thus, sample paths Zj can be generated randomly by sampling the bk ’s randomly from some distribution such that (2.4) holds with probability 1. The distribution of the bk ’s implies the measure μ, so μ can be specified by specifying the distribution of the bk ’s and the basis functions {φ k }. This is convenient, because it ensures that μ is a probability distribution μ by setting bk = ρ k Nk , where the ρ k ’s on L 2 [0, 1]. In the remainder of this paper, we specify 2 ρ < ∞, the Nk ’s are random variables that are are non-stochastic constants satisfying ∞ k=1 k independently distributed as N(0, 1) and the φ k ’s are sine functions. We truncate the infinite sum in (2.3), so the formula for generating Z’s in practice is ZM (t) =
M
bk φk (t),
(2.5)
k=1
for some M < ∞. In Section 3.3, we show that the measure induced by ZM converges to μ in an appropriate sense as M → ∞. If FX = FY (·|θ 0 ), then the probability that our test detects the discrepancy converges to 1 as n → ∞, provided that FX (x) and F (x|θ 0 ) are non-zero on a set of functions x that has nonvanishing μ measure. See the discussion following Theorem 3.2. This condition is satisfied if μ corresponds to a non-degenerate Gaussian process on [0, 1]. This fact and considerations of numerical simplicity motivate our decision to let μ be the measure of a Gaussian process. Of course, there are other possibilities. For example, if the principal components of the processes C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for functional data
S7
X and Y have heavy-tailed distributions, then one might consider taking the variables Nk to be heavy-tailed as well, possibly Student-t distributed with a low number of degrees of freedom. 2.3. Bootstrap computation of the critical value ˆ has a non-standard asymptotic In Section 3.1, we show that the test statistic T (X |θ) distribution that depends on unknown population parameters and, therefore, cannot be tabulated. Consequently, it is not convenient to obtain the critical value of test from the analytic formula for the asymptotic distribution. Instead, we use the bootstrap. The bootstrap procedure is as follows. 1.
2.
3.
Generate a bootstrap sample of random functions, X ∗ ≡ {X1∗ , . . . , Xn∗ }, by simulation from the population having the distribution of Y with θ set equal to θˆ . That is, X ∗ is generated by simulating the Y process with θ = θˆ . Estimate θ from the bootstrap sample, thereby obtaining the estimate θˆ ∗ . Also, compute the bootstrap version of the test statistic, T (X ∗ |θˆ ∗ ). This is done by replacing X with X ∗ and θˆ with θˆ ∗ in (2.2). Repeat steps 1–2 many times, and use the results to find the empirical distribution function of nT (X ∗ |θˆ ∗ ). Estimate the α-level critical value of the test by the 1 − α quantile of the empirical distribution of nT (X ∗ |θˆ ∗ ).
In Section 3.2, we show that this bootstrap procedure consistently estimates the asymptotic ˆ and yields the correct asymptotic probability of rejecting a correct critical value of T (X |θ) H 0.2
3. THEORETICAL PROPERTIES ˆ under H 0 , This section presents theorems that (1) give the asymptotic distribution of T (X |θ) fixed alternative hypotheses, and local alternative hypotheses; (2) establish the validity of the ˆ and (3) show that bootstrap procedure of Section 2.3 for obtaining the critical value of T (X |θ); the truncated measure induced by ZM in (2.5) converges in an appropriate sense as M → ∞. ˆ because the difference between T (X |θ) ˆ and ˆ not TJ (X |θ), The theory is developed for T (X |θ), ˆ TJ (X |θ) can be made arbitrarily small by making J sufficiently large in (2.2). 3.1. The asymptotic distribution ˆ under H 0 , fixed alternative This section presents the asymptotic distribution of T (X |θ) hypotheses, and local alternative hypotheses. The asymptotic distribution is non-standard and 2 It is also possible to carry out bootstrap iteration. The procedure for this is as follows. Conditional on the sample X and the bootstrap sample X ∗ , draw a sample X ∗∗ = {X1∗∗ , . . . , Xn∗∗ } by sampling from the distribution of Y with θ = θˆ ∗ . Estimate θ from this sample, thereby obtaining the estimate θˆ ∗∗ and the test statistic T (X ∗∗ |θˆ ∗∗ ). Repeat this process many times, and for 0 < β < 1 let tβ∗ (X ∗ ) denote the 1 − β quantile of the resulting empirical distribution of T (X ∗∗ |θˆ ∗∗ ). ˆ Let q(β) = P ∗ [T (X ∗ |θˆ ∗ ) > tβ∗ (X ∗ )], where P ∗ denotes the probability measure induced by sampling the distribution of ˆ > t ˆ (X ), where t ˆ (X ) ˆ For an α-level test, define βˆ to be the solution of q(β) ˆ Y with θ = θ. = α. Reject H 0 if T (X |θ) β β ∗ ∗ ˆ ˆ is the 1 − β quantile of the bootstrap distribution of T (X |θ ). Bootstrap iteration provides asymptotic refinements in many settings. We do not investigate refinements here. The Monte Carlo results reported in Section 4 indicate that the bootstrap procedure for obtaining the critical value works well without iteration.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S8
F. A. Bugni et al.
depends on unknown population parameters, so it is not useful for obtaining critical values for ˆ The bootstrap procedure of Section 2.3 is used for that purpose. However, the asymptotic T (X |θ). distributional results show that the test is consistent against fixed alternative hypotheses and that it has power against local alternatives whose distance from the null-hypothesis distribution, F Y , is O(n−1/2 ). ˆ depends on the asymptotic properties of θˆ . We assume The asymptotic distribution of T (X |θ) ˆ that as n → ∞, θ converges in probability to a unique, non-stochastic limit, θ 0 . We also assume that n1/2 (θˆ − θ0 ) has the representation n1/2 (θˆ − θ0 ) = n−1/2
n
(Xi ) + op (1)
(3.1)
i=1
as n → ∞, where is a p-vector valued function that is square-integrable with respect to μ and is such that E (X) = 0 and cov[ (X)] is non-singular. The estimator θˆ = arg min T (X |θ ), θ∈
has these properties under mild regularity conditions. Many other estimators also have these properties. We use the following additional notation. Define the p-vector F˙ (·|θ ) = ∂F (·|θ )/∂θ . Let ζ be the Gaussian process on [0, 1] having the same covariance structure as the indicator process I (X ≤ x) ≡ I [X(t) ≤ x(t) for all t ∈ I]. That is, the covariance function is ψ(x1 , x2 ) = cov[ζ (x1 ), ζ (x2 )] = FX (x1 ∧ x2 ) − FX (x1 )FX (x2 ), where x 1 ∧ x 2 denotes the function that equals x 1 (t) ∧ x 2 (t) for each t ∈ [0, 1]. Let ξ be a p-variate normal random variable whose mean is 0, covariance matrix is V , and satisfies E[ξ ζ (x)] = E (X)[I (X ≤ x) − FX (x)]. We make the following assumptions. A SSUMPTION 3.1. The functional data X ≡ {X1 (·), . . . , Xn (·)} are an independent random sample from the population whose distribution functional is FX . A SSUMPTION 3.2. (i) θ 0 is uniquely defined. (ii) n1/2 (θˆ − θ0 ) has the asymptotic representation (3.1). Moreover, E (X) = 0, V is finite and non-singular, and (x) (x)μ(dx) < ∞.
A SSUMPTION 3.3. F˙ (·|θ ) exists for all θ in an open set O that contains θ 0 . Moreover, sup F˙ (x|θ ) F˙ (x|θ )dμ(x) < ∞ θ∈O
and lim
ε→0
p
sup
i,j =1 θ−θ0 ≤ ε
[F˙i (x|θ ) − F˙i (x|θ0 )][F˙j (x|θ ) − F˙j (x|θ0 )]dμ(x) = 0, C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for functional data
S9
where F˙i denotes the ith component of F˙ and θ − θ 0 is the Euclidean distance between θ and θ 0. A SSUMPTION 3.4. μ is the measure induced by the Gaussian process Z(t) =
∞
ρk Nk φk (t); 0 ≤ t ≤ 1,
k=1
where 0 < |ρ k | ≤ Ck−d for all k and some constants C < ∞ and d > 1, the Nk ’s are independent standard normal random variables, and φ k (t) = 21/2 sin(kπ t). The independence requirement of Assumption 3.1 precludes applying our test to the path of prices or returns of a single financial asset. However, the test can be applied to a portfolio of assets whose prices or returns move independently after removal of any common trends. 3 Assumption 3.4 ensures that with probability 1, functions sampled from the population with distribution μ are bounded and in L 2 [0, 1]. Other basis functions and distributions of the Nk ’s could be used. For example, the basis could be cosine functions or sines and cosines together. Our asymptotic distributional result treats the following three cases: 1. 2.
H0 is true. That is, θ0 ∈ O and FX (·) = FY (·|θ0 ). H0 is false, and FX constitutes a sequence of local alternatives. That is FX (·) = FY (·|θ0 ) + n−1/2 D(·)
3.
for some θ0 ∈ O, where D is a bounded functional on L2 [0, 1]. FX is fixed and H0 is false. That is, there is no θ ∈ such that FX (·) = FY (·|θ ). Observe that case 1 is identical to case 2 with D = 0 We now have the following theorem.
ˆ →d V , where T HEOREM 3.1. Let Assumptions 3.1–3.4 hold. In cases 1 and 2, nT (X |θ) V = [ζ (x) + D(x) + F˙Y (x|θ0 ) ξ ]2 dμ(x) and D = 0 in case 1. In case 3, ˆ →p T (X |θ)
(3.2)
[FX (x) − FY (x|θ0 )]2 dμ(x).
(3.3)
Result (3.3) implies that the test is consistent against fixed alternative hypotheses. Result (3.2) gives the distribution of the test statistic under the null hypothesis (D = 0) and contiguous alternatives (D = 0). In particular, (3.2) implies that the test has non-trivial asymptotic power (that is, asymptotic power exceeding the probability of rejecting a correct null hypothesis) against alternatives whose distance from the null hypothesis is O(n−1/2 ). From some points of view, cases 1 and 2 of Theorem 3.1 can be interpreted as extensions, to the setting of functional data, of work of Neuhaus (1971, 1976) and Behnen and Neuhaus (1975) on limit theory under contiguous alternatives. 3 If the prices or returns of a single asset are weakly dependent, then it may be possible to apply a version of our test to data consisting of blocks of prices or returns. However, the investigation of this extension is beyond the scope of this paper.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S10
F. A. Bugni et al.
An alternative representation of V at (3.2) is V = Wj =
X
∞
j =1
Wj2 , where
[ζ (x) + D(x) + F˙Y (x|θ0 ) ξ ]ψj (x)dμ(x)
and ψ 1 , ψ 2 , . . . is an orthonormal sequence of eigenfunctionals of the linear operator γ that takes a function ψ to F (x|θ0 )ψ(x)dμ(x). γ (ψ) = X
This representation of V can be regarded as an extension of Neuhaus’ (1976) result concerning the power of Cram´er-von Mises tests under contiguous alternatives. 3.2. Consistency of the bootstrap This section establishes the validity of the bootstrap procedure of Section 2.3 for estimating the ˆ Let V 0 be the random variable critical value of T (X |θ). V0 = [ζ (x) + F˙Y (x|θ0 ) ξ ]2 dμ(x). ˆ is asymptotically distributed as V 0 when H 0 is true. It follows from Theorem 3.1 that nT (X |θ) ˆ Then P (V 0 > Let s α denote the asymptotic α-level critical value of the test based on T (X |θ). ∗ s α ) = α. Let P denote the probability measure that is induced by bootstrap sampling. The bootstrap α-level critical value, s ∗α , is the solution to P ∗ [nT (X ∗ |θˆ ∗ ) > sα∗ ] = α. The Cram´er-von ˆ > sα∗ . Mises test based on the bootstrap critical value rejects H 0 at the nominal α level if T (X |θ) ∗ ˆ The true rejection probability is P [nT (X |θ) > sα ]. The following theorem shows that the true rejection probability approaches the nominal level α as n → ∞. T HEOREM 3.2. Let Assumptions 3.1–3.4 hold. Then s ∗α →p s α in each of the three cases of Theorem 3.1. Moreover, if H 0 is true, then ˆ > sα∗ ] = α. lim P [nT (X |θ)
n→∞
It follows immediately from Theorems 3.1 and 3.2 that if FX is fixed and does not satisfy H 0 , then the probability of rejecting H 0 approaches 1 as n → ∞ whenever FX (x) − FY (x|θ 0 ) is non-zero on a set of non-vanishing μ measure. Thus, the Cram´er-von Mises test based on the bootstrap critical value is consistent. Moreover, in case 3, where FX is the sequence of local alternatives FX (·) = FY (·|θ 0 ) + n−1/2 D(·), the probability that the bootstrap-based test rejects H 0 converges to P (V > s α ). If we set D = cD 0 , where c is a constant and D 0 is functional that is non-zero on a set of positive μ measure, then ˆ > sα∗ ] = 1. lim lim P [nT (X |θ)
|c|→∞ n→∞
Thus, as the local alternative distributions in case 3 move further from H 0 , the Cram´er-von Mises test with a bootstrap critical value detects them with probability approaching 1. C The Author(s). Journal compilation C Royal Economic Society 2009.
S11
Goodness-of-fit tests for functional data
3.3. Convergence of the finite-dimensional approximation to μ ˆ is an integral with respect to the infinite-dimensional measure, μ, that is The statistic T (X |θ) induced by the process Z defined in (2.3). Section 2.2 proposes approximating the integral by replacing μ with the finite-dimensional measure, μ M , that is induced by the process ZM defined in (2.5). This section shows that integrals with respect to μ M converge to the corresponding integrals with respect to μ as M → ∞. ∞ ) denote the set of all infinite sequences b = {b 1 , b 2 , . . . } of real numbers such Let ∞L 2 (R 2 that i=1 bi < ∞. Let {φ i : i = 1, 2, . . . } be the basis functions that are used in (2.3) and (2.5). Let A be a μ-measurable subset of L 2 [0, 1]. For each function a ∈ A there is a sequence b ∈ L 2 (R∞ ) that is defined by 1 bi = a(t)φi (t)dt. (3.4) 0
Define the sets B = {b : bi is given by (3.4) for some a ∈ A}, BM = {b1 , . . . , bM : (b1 , b2 , . . . ) ∈ B for some (bM+1 , bM+2 , . . .)} and
AM =
∞
bi φi : (b1 , . . . , bM ) ∈ BM and
i=1
∞
bi2
<∞ .
i=1
Let Pb denote the probability measure on L 2 (R∞ ) that corresponds to μ. Then it suffices to show that lim Pb (AM ) = Pb (A)
M→∞
(3.5)
for any A in the Borel sigma field of subsets of L 2 [0, 1]. T HEOREM 3.3. Equation (3.5) holds for any A in the Borel sigma field of subsets of L 2 [0, 1].
4. EMPIRICAL APPLICATION AND MONTE CARLO EXPERIMENTS Section 4.1 presents an empirical application of our test. Section 4.2 presents the results of a Monte Carlo investigation of the finite-sample performance of the test using designs based on the empirical application. 4.1. Empirical application This section reports an application of our Cram´er-von Mises test to an equilibrium job search model. The model was proposed by Mortensen (1990) and Burdett and Mortensen (1998). Bowlus et al. (2001), Christensen et al. (2005) and Bontemps et al. (2000) have estimated the model empirically. The application described here consists of testing the hypothesis that data on the wage paths of a certain sample of individuals were generated by the model. C The Author(s). Journal compilation C Royal Economic Society 2009.
S12
F. A. Bugni et al.
We begin with a brief description of the model. Job offers arise according to a Poisson process with arrival rate λ 0 for unemployed individuals and λ 1 for employed individuals. An individual’s job is eliminated, initiating a period of involuntary unemployment, according to a Poisson process with rate δ. Each job offer is accompanied by a wage, W , that is sampled from a distribution, the distribution function of which is denoted by F . An employed individual accepts the offer if W exceeds his current wage. An unemployed individual accepts the offer if W exceeds his reservation wage, r. Firms (employers) choose their offers to maximize profits. There are J different types of firms that are distinguished by their productivities, Pj (j = 1, . . . , J ). A firm of type j makes the offer W that solves maximize (Pj − w)(w), w
where (w) is the number of individuals the firm will employ if it offers wage w. Mortensen (1990) showed that this leads to the following distribution of offers:
Pj − w 1/2 1 + κ1 (1 − γj −1 ) 1+κ1 F (w) = 1− ; wL,j < w ≤ wH ,j , κ1 1 + κ1 Pj − wH ,j where κ 0 = λ 0 /δ, κ 1 = λ 1 /δ, w L,j is the lowest wage offered by a firm of type j , w H,j is the highest wage offered by a firm of type j , and γ j is the fraction of firms having productivities less than Pj . The equation γ j = F (w H,j ) defines wHj . The value of λ 0 enters into the distribution of labour supply but not into the distribution of offers. The lowest and highest wages offered by firms of adjacent types are related by w L,j = w H,j −1 . The productivities are determined by
where
Pj =
wH ,j + Bj wH ,j −1 , 1 − Bj
2
1 + κ1 (1 − γj ) Bj = 1 + κ1 (1 − γj −1 )
.
The model’s parameters are θ = (r, w H,1 , . . . , wH,J , λ 0 , λ 1 , δ) and are estimated by maximum likelihood using data that are described below. Bowlus et al. (2001) used J = 4 types of firms in their analysis, and we do the same here. An important feature of the model is that it specifies the entire lifetime wage process. A worker starts his work life in unemployment and moves to employment when an offer exceeding r arrives. Subsequently, the worker moves to a higher-paying job if a higher wage offer arrives, or he returns to unemployment if his job is destroyed. Thus, given values of θ and J , predicted distributions of wages, employment durations and unemployment durations can be obtained from the model by simulation. Our data are taken from the National Longitudinal Survey of Youth. We use observations of the wages and employment histories of 374 white males over the 10-year period 1982–1991. During this period, these 374 individuals had 4082 spells of employment or unemployment. The data are the wage paths of these individuals. The data are functional because they are sample paths of the continuous-time wage process. Table 1 provides some informal evidence of the ability of the model to fit the data. The table shows features of the observed and predicted distributions of wages, durations of spells C The Author(s). Journal compilation C Royal Economic Society 2009.
S13
Goodness-of-fit tests for functional data
Statistic
Table 1. Predicted and observed characteristics of distributions of wages and durations. Predicted Observed
Mean wage Variance of wage
488 40,882
479 53,453
74 17
168 17
2469 64 6601
300 59 3875
Mean wage change in job-to-job transitions Mean duration of spells of unemployment Variance of duration of spells of unemployment Mean duration of spells of employment Variance of duration of spells of employment Note: Wages are in 1990 U.S. dollars per week. Durations are in weeks.
of employment and durations of spells of unemployment. The predictions were obtained by simulating the model with J = 4 and the maximum likelihood estimate of θ . Some but not all features of the data are well reflected by the model. The model does a good job of matching observed mean wages and mean durations of spells of employment and unemployment. However, the model predicts a mean wage increase in job-to-job transitions that is more than twice the observed increase. The model’s predictions of the variances of the durations of spells of employment are below the observed values by factors of 8 and 1.7, respectively. The results in Table 1 suggest that the model does not provide a satisfactory description of the data generation process but do not reveal whether the differences between predictions and observations are larger than can be explained by random sampling error. Accordingly, we used the Cram´er-von Mises test of Section 2.1 to test the hypothesis that the model contains the observed data-generation process. The 0.01-level critical value is 0.0012. The value of the test statistic is 0.0117. Thus, the test rejects the model at the 0.01 level, confirming the informal impression conveyed by Table 1. Bowlus et al. (2001) reported that the equilibrium search model does not fit certain features of the data that they used. Our results are not comparable to theirs, however. They modelled only the first job search and spell of employment following an individual’s completion of secondary school, whereas we model 10 years of employment/unemployment history for each individual. 4.2. Monte Carlo experiments This section reports the results of Monte Carlo experiments designed to address the level accuracy of our test. The designs of the experiments were based on the equilibrium search model of Section 4.1. In particular, data were generated by simulation from the model using four different sets of parameter values: (1) the maximum likelihood estimates of the parameters using the data described in Section 4.1; (2) the same values except that λ 0 , λ 1 and δ were increased by 20%; (3) the same values as in (1) except that λ 0 , λ 1 and δ were decreased by 20%; and (4) the same values as in (1) except that w H 1 , . . . , w H 4 were replaced by quartiles of the empirical distribution of wages. For each design, 1000 samples of wage-employment-unemployment histories were generated by simulation for each of n = 374 individuals. The test was applied to each simulated data set. Since the data were generated under the model being tested, the long-run proportion of rejections equals the true level of the test. Simulations were carried out in Matlab using the pseudo-random C The Author(s). Journal compilation C Royal Economic Society 2009.
S14
Design
F. A. Bugni et al. Table 2. Results of Monte Carlo simulations. Empirical rejection Empirical rejection Empirical rejection probability at probability at probability at nominal 0.10 level nominal 0.05 level nominal 0.01 level
Maximum likelihood parameter estimates Increase λ 0 , λ 1 and δ by 20%
0.108 0.102
0.056 0.051
0.015 0.009
Decrease λ 0 , λ 1 and δ by 20% Replace w H,1 , . . . , w H,4 by quartiles of empirical distribution of wages
0.109 0.110
0.060 0.066
0.012 0.018
number generator in that software package. The results are given in Table 2. They indicate that the differences between the true and nominal rejection probabilities under H 0 are small. 4.3. Comparison with the random projection test This section reports the results of Monte Carlo experiments in which we compare the power of our Cram´er-von Mises test with that of the random projections test of Cuesta-Albertos et al. (2007). The designs of the experiments are taken from Cuesta-Albertos et al. (2007). Data are generated by the process Z(t) = [1 + s2 t 2 + s3 sin(2π t)]W (t) + [a1 t + a2 t 2 + a3 sin(2π t)], where W is a standard Brownian motion and the s’s and a’s are constants. Under the null hypothesis, s 2 = s 3 = a 2 = a 3 = 0. There are two null hypothesis models, one with a 1 = 0 and one with a 1 = 1. The values of the s’s and a’s corresponding to the null and alternative hypotheses are listed in columns 2–6 of Table 3. Models 1 and 3 are null hypotheses. Model 2 is the alternative hypothesis when the null hypothesis is model 1. Models 4–18 are the alternative hypotheses when the null hypothesis is model 3. Realizations of Z were generated in the manner described by Cuesta-Albertos et al. (2007). As in Cuesta-Albertos et al. (2007), the sample size in the experiments is 200. There were 1000 Monte Carlo replications per experiment. The nominal level of the tests is 0.05. The results are shown in columns 7–11 of Table 3, which give the empirical probabilities that our Cram´er-von Mises test and the random projections test reject the null hypothesis. When the data are generated by models 1 and 3, the null hypothesis is correct, and the rejection probability is the empirical level of the test. For the other models, the null hypothesis is false, and the rejection probability is the power of the test. The powers of the random projections test using k = 3, 5, 10 and 40 one-dimensional projections are taken from Cuesta-Albertos et al. (2007). The results show that the differences between the empirical and nominal levels of the tests are small. The results also show that the power of the Cram´er-von Mises test is equal to or greater than that of the random projections test in all but one of the experiments. The exception is model 6, where the random projections test with 40 one-dimensional projections has an empirical power of 0.72 compared to a power of 0.68 for the Cram´er-von Mises test. In summary, the power of the Cram´er-von Mises test compares favourably with that of the random projections test. C The Author(s). Journal compilation C Royal Economic Society 2009.
S15
Goodness-of-fit tests for functional data Table 3. Results of power experiments.
Rejection probabilities of Cram´er von Mises test and random projections test with k one-dimensional projections Model s 2 s 3 a 1 a 2 a 3
Cram´er- Rand. Comp. Rand. Comp. Rand. Comp. von Mises with k = 3 with k = 5 with k = 10
Rand. Comp. with k = 40
1 2 3
0 0 0
0 0 0
0 0 1
0 1 0
0 0 0
0.054 0.87 0.053
0.056 0.71 0.059
0.054 0.76 0.053
0.050 0.81 0.055
0.057 0.85 0.049
4 5
0 1
0 0
1 0
1 0
0 0
0.91 0.21
0.71 0.15
0.74 0.15
0.77 0.17
0.85 0.14
6 7 8
1 1 1
0 0 0
0 1 1
1 0 1
0 0 0
0.68 0.26 0.74
0.53 0.15 0.51
0.57 0.15 0.55
0.65 0.15 0.61
0.72 0.12 0.72
9 10
0 0
0 0
0 1
0 0
1 1
1 1
0.98 0.98
1 1
1 1
1 1
11 12 13
1 1 0
0 0 1
0 1 0
0 0 0
1 1 0
1 1 1
0.94 0.94 0.97
1 1 1
1 1 1
1 1 1
14 15
0 0
1 1
0 1
1 0
0 0
1 1
0.99 0.96
1 0.99
1 1
1 1
16 17 18
0 0 0
1 1 1
1 0 1
1 0 0
0 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
ACKNOWLEDGMENTS Bugni and Horowitz were supported in part by NSF grant SES-0352675. Hall was supported by an Australian Research Council fellowship. Neumann was supported by a grant from the Robert Woods Johnson Foundation. We thank Richard Blundell for helpful comments.
REFERENCES Behnen, K. and G. Neuhaus (1975). A central limit theorem under contiguous alternatives. Annals of Statistics 3, 1349–53. Boente, G. and R. Fraiman (2000). Kernel-based functional principal components. Statistics and Probability Letters 48, 335–45. Bowlus, A.J., N.M. Kiefer and G.R. Neumann (2001). Equilibrium search models and the transition from school to work. International Economic Review 42, 317–43. Bontemps, C., J.-M. Robin and G. Van Den Berg (2000). Equilibrium search with continuous productivity dispersion: theory and non-parametric estimation. International Economic Review 41, 305–58 Burdett, K. and D.T. Mortensen (1998). Wage differentials, employer size, and unemployment. International Economic Review 39, 257–73. C The Author(s). Journal compilation C Royal Economic Society 2009.
S16
F. A. Bugni et al.
Cai, Z. and Y. Hong (2003). Nonparametric methods in continuous-time finance: a selective review. In M.G. Akritas and D.N. Politis (Eds.), Recent Advances and Trends in Nonparametric Statistics, 283– 302. Amsterdam: Elsevier. Christensen, B.J., R. Lentz, D.T. Mortensen, G.R. Neumann and A. Werwatz (2005). On-the-job search and the wage distribution. Journal of Labor Economics 23, 31–58. Cuesta-Albertos, J.A., E. del Barrio, R. Fraiman and C. Matr´an (2007). The random projection method in goodness of fit for functional data. Computational Statistics and Data Analysis 51, 4814–4831. Cuesta-Albertos, J.A., R. Fraiman and T. Ransford (2006). Random projections and goodness of fit tests in infinite dimensional spaces. Bulletin of the Brazilian Mathematical Society 37, 1–25. Durbin, J. (1973a). Distribution Theory for Tests Based on the Sample Distribution Function. Philadelphia: SIAM. Durbin, J. (1973b). Weak convergence of the sample distribution function when parameters are estimated, Annals of Statistics 1, 279–90. Fan, Y. and Q. Li (1996). Consistent model specification tests: omitted variables and semiparametric functional forms. Econometrica 64, 865–90. Fan, Y. and Q. Li (2000). Consistent model specification tests: kernel-based tests versus Bierens’ ICM tests. Econometric Theory 16, 1016–141. Fan, Y. and Q. Li (2002). A consistent model specification test based on kernel sum of squares residuals. Econometric Reviews 21, 337–52. Guay, A. and E. Guerre (2006). A data-driven nonparametric specification test for dynamic regression models. Econometric Theory 22, 543–86. Guerre E. and P. Lavergne (2002). Optimal minimax rates for nonparametric specification testing in regression models. Econometric Theory 18, 1139–71. Guerre, E. and P. Lavergne (2005). Data-driven rate-optimal specification testing in regression models. Annals of Statistics 33, 840–70. Hall, P. and M. Hosseini-Nasab (2006). On properties of functional principal components analysis. Journal of the Royal Statistical Society, Series B 68, 109–26. Hall, P. and N. Tajvidi (2002). Permutation tests for equality of distributions in high-dimensional settings. Biometrika 89, 359–97. He, G.Z., H.-G. M¨uller and J.-L. Wang (2003). Functional canonical analysis for square integrable stochastic processes. Journal of Multivariate Analysis 85, 54–77. Hong, Y. and L. Haito (2005). Nonparametric specification testing for continuous-time models with applications to term structure of interest rates. Review of Financial Studies 18, 37–84. Horowitz, J.L. and V.G. Spokoiny (2001). An Adaptive, Rate-Optimal Test of a Parametric MeanRegression Model against a Nonparametric Alternative. Econometrica 69, 599–631. Jank, W. and G. Shmueli (2006). Functional data analysis in electronic commerce research. Statistical Science 21, 155–66. Kim, M.S. and S. Wang (2006). Sizes of two bootstrap-based nonparametric specification tests for the drift function in continuous time models. Computational Statistics and Data Analysis 50, 1793–806. Miles, D. and J. Mora (2003). On the performance of nonparametric specification tests in regression models. Computational Statistics and Data Analysis 42, 477–90. Mortensen, D.T. (1990). Equilibrium wage distributions: a synthesis. In J. Hartog, G. Ridder and J. Theeuwe (Eds.), Panel Data and Labor Market Studies, 279–96. Amsterdam: North-Holland. Neuhaus, G. (1971). On weak convergence of stochastic processes with multidimensional time parameter. Annals of Mathematical Statistics 42, 1285–95. Neuhaus, G. (1976). Asymptotic power properties of the Cram´er-von Mises test under contiguous alternatives. Journal of Multivariate Analysis 6, 95–110. C The Author(s). Journal compilation C Royal Economic Society 2009.
S17
Goodness-of-fit tests for functional data
Pollard, D. (1984). Convergence of Stochastic Processes. New York: Springer. Ramsay, J.O. and B.W. Silverman (2002). Applied Functional Data Analysis. Methods and Case Studies, New York: Springer-Verlag. Ramsay, J.O. and B.W. Silverman (2005). Functional Data Analysis (2nd ed.). New York: Springer-Verlag. Yao, F., H.-G. M¨uller and J.-L. Wang (2005). Functional linear regression analysis for longitudinal data. Annals of Statistics 33, 2873–903.
APPENDIX: PROOFS OF THEOREMS Proof of Theorem 3.1. We prove only case 2. Case 1 follows directly from case 2 and case 3 is relatively elementary. Write FˆX − FX = n−1/2 Wn , where the stochastic process Wn has the same covariance as ζ . A Taylor series expansion gives ˆ − FY (x|θ0 )]}2 dμ(x) ˆ = {Wn (x) + D(x) − n1/2 [FY (x|θ) nT (X |θ) = {Wn (x) + D(x) − F˙Y (x|θ0 )n1/2 (θˆ − θ0 ) ˜ − F˙Y (x|θ0 )]n1/2 (θˆ − θ0 )}2 dμ(x), − [F˙Y (x|θ) ˆ has the where θ˜ is between θˆ and θ 0 . It follows from this result and Assumptions 3.2 and 3.3 that nT (X |θ) same asymptotic distribution as (A.1) VW ≡ {Wn (x) + D(x) − F˙Y (x|θ0 )ξ }2 dμ(x) + op (1). Now let L 2 (μ) be the Hilbert space consisting of functionals on L 2 [0, 1] that are square integrable with respect to μ. The inner product of two functionals, ω 1 and ω 2 , in this space is ω1 , ω2 = ω1 (x)ω2 (x)dμ(x). Because L 2 (μ) is a Hilbert space, it has a complete, orthonormal basis, say {ϕ k : k = 1, 2, . . . }. Define 1 1 Wn (x)ϕk (x)dμ(x), dk = ζ (x)ϕk (x)μ(dx). cnk = 0
0
For any finite integer K > 0, define the stochastic processes WnK1 (x) =
K
cnk ϕk (x), WnK2 (x) =
k=1
VKζ =
VnK1 =
cnk ϕk (x), ζK (x) =
k=K+1
Define the random variables
and
∞
K
dk ϕk (x).
k=1
[ζK (x) + D(x) + F˙Y (x|θ0 ) ξ ]2 dμ(x),
[WnK1 (x) + D(x) + F˙Y (x|θ0 ) ξ ]2 dμ(x).
Let V be the random variable defined in (3.2). Now VW = (VW − VnK1 ) + VnK1 . C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.2)
S18
F. A. Bugni et al.
Expanding the integrand of VW and applying the Cauchy–Schwarz inequality yields
1/2 1/2 2 2 dμ(x) + WnK2 dμ(x). |VW − VnK1 | ≤ 2VnK1 WnK2
(A.3)
Standard methods for K-variate problems may be used to show that VnK1 →d VKζ
(A.4)
VKζ − V →p 0
(A.5)
for each K as n → ∞. Moreover,
as K → ∞, and
lim lim sup
K→∞
n→∞
2 dμ(x) = 0. E WnK2
(A.6)
Combining (A.2)–(A.6) yields the result that V W →d V
(A.7)
The theorem follows by combining (A.1) and (A.7).
Proof of Theorem 3.2. Let ζ θ denote a Gaussian process having mean zero and the covariance structure of the indicator process I (Y ≤ x) − FY (x|θ). Define Vθ = [ζθ (x) + F˙Y (x|θ) ξ ]2 dμ(x). Write P θ for probability measure under the assumption that X has the distribution of Y with parameter θ. Then arguments like those used to prove Theorem 3.1 show that if η > 0 is sufficiently small, then sup
lim
n→∞ θ: θ−θ ≤η 0
sup |Pθ [nT (X |θ) ≤ t] − P (Vθ ≤ t)| = 0.
Moreover, lim
(A.8)
t: t>0
sup
η→0 θ: θ−θ ≤η 0
sup P (Vθ ≤ t) − P (Vθ0 ≤ t) = 0.
(A.9)
t: t>0
Let t nβ (θ ) denote the β-level critical value of nT (X |θ ) when the data have the distribution of Y and t β (θ) denote the β-level critical value of V θ . Note that the distribution of V θ is continuous and has support equal to the real line. Then it follows from (A.8)–(A.9) that for all sufficiently small δ > 0 sup
sup
sup
n≥n1 θ: θ−θ0 ≤ η β: |β−α| ≤δ
|tnβ (θ) − tβ (θ0 )| → 0
(A.10)
ˆ and t α (θ 0 ), respectively. Therefore, (A.10) as n 1 → ∞ and δ → 0. Now s ∗α and s α are identical to tnα (θ) with β = α and the fact that θˆ →p θ0 imply that s ∗α →p s α . Proof of Theorem 3.3. The proof consists of showing that the class of sets A for which (3.5) holds is a sigma field and that it contains all balls.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S19–S49. doi: 10.1111/j.1368-423X.2009.00285.x
Finite-sample distribution-free inference in linear median regressions under heteroscedasticity and non-linear dependence of unknown form E LISE C OUDIN † AND J EAN -M ARIE D UFOUR ‡,§ ,¶ †
‡
´ Centre de Recherche en Economie et Statistique, Institut National de la Statistique et des ´ Etudes Economiques, 15 Boulevard Gabriel Perl, 92245, Malakoff Cedex, France E-mail:
[email protected]
Department of Economics, McBill University, 855 Sherbrooke Street West, Montr´eal, Quebec H3A 2T7, Canada §
Centre interuniversitaire de recherche en analyse des organisations, 2020 rue University, 25e e´ tage, Montr´eal, Quebec H3A 2A5, Canada
¶
Centre interuniversitaire de recherche en e´ conomie quantitative, Universit´e de Montr´eal, Quebec H3C 3J7, Canada E-mail:
[email protected] First version received: August 2008; final version accepted: January 2009
Summary We construct finite-sample distribution-free tests and confidence sets for the parameters of a linear median regression, where no parametric assumption is imposed on the noise distribution. The set-up studied allows for non-normality, heteroscedasticity, nonlinear serial dependence of unknown forms as well as for discrete distributions. We consider a mediangale structure—the median-based analogue of a martingale difference—and show that the signs of mediangale sequences follow a nuisance-parameter-free distribution despite the presence of non-linear dependence and heterogeneity of unknown form. We point out that a simultaneous inference approach in conjunction with sign transformations yield statistics with the required pivotality features—in addition to usual robustness properties. Monte Carlo tests and projection techniques are then exploited to produce finite-sample tests and confidence sets. Further, under weaker assumptions, which allow for weakly exogenous regressors and a wide class of linear dependence schemes in the errors, we show that the procedures proposed remain asymptotically valid. The regularity assumptions used are notably less restrictive than those required by procedures based on least absolute deviations (LAD). Simulation results illustrate the performance of the procedures. Finally, the proposed methods are applied to tests of the drift in the Standard and Poor’s composite price index series (allowing for conditional heteroscedasticity of unknown form). Keywords: Bootstrap, Discrete distribution, Distribution-free, GARCH, Heteroscedasticity, Median regression, Monte Carlo test, Non-normality, Projection methods, Quantile regression, Serial dependence, Signs, Sign test, Simultaneous inference, Stochastic volatility.
C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S20
E. Coudin and J.-M. Dufour
1. INTRODUCTION Median regression (and related quantile regressions) provides an attractive bridge between parametric and non-parametric models. Distributional assumptions on the disturbance process are relaxed, but the functional form remains parametric. Associated estimators, such as the least absolute deviations (LAD) estimator, are more robust to outliers than usual least-squares (LS) methods and may be more efficient whenever the median is a better measure of location than the mean (Dodge, 1997). They are especially appropriate when unobserved heterogeneity is suspected in the data. The current expansion of such ‘semiparametric’ techniques reflects an intention to depart from restrictive parametric frameworks (see Powell, 1994). However, related tests remain usually based on asymptotic normality approximations. In this paper, we show that tests based on residual signs yield an entire system of finitesample exact inference under very general assumptions. We study a linear median regression model where the (possibly dependent) disturbance process is assumed to have a null median, conditional on some exogenous explanatory variables and its own past. This set-up covers non-stochastic heteroscedasticity, standard conditional heteroscedasticity (like ARCH, GARCH, stochastic volatility models, . . .) as well as other forms of non-linear dependence. We provide both finite-sample and asymptotic distributional theories. In the first set of results, we show that the level of the tests is provably equal to the nominal level, for any sample size. Exact tests and confidence regions are valid under general assumptions and allow for heteroscedasticity and non-linear dependence of unknown forms, as well as for discrete distributions. This is done, in particular, by combining Monte Carlo tests adapted to discrete statistics—using a randomized tie-breaking procedure (Dufour, 2006)—with projection techniques, which allow inference on general parameter transformations (Dufour, 1990). We also show that the tests proposed include locally optimal tests. However, for more general processes that may involve stationary ARMA disturbances, sign-based statistics are no longer pivotal. The serial dependence parameters constitute nuisance parameters. In a second set of results, we show that the proposed procedures remain asymptotically valid when the regressors are weakly exogenous and disturbances are stationary ARMA. Transforming sign-based statistics with standard heteroscedasticity and autocorrelation-corrected (HAC) methods allows one to eliminate nuisance parameters asymptotically. We thus extend the validity of the Monte Carlo test method. In such cases, we lose exactness but retain asymptotic validity. The latter holds under much weaker assumptions on moments or the shape of the distribution (such as the existence of a density) than usual asymptotically justified inference (such as LAD-based techniques). Besides, one does not need to evaluate the disturbance density at zero, which constitutes one of the major difficulties of asymptotic kernel-based methods associated with LAD and other quantile estimators. A basic motivation for the sign-based techniques studied in this paper comes from an impossibility result due to Lehmann and Stein (1949), who proved that inference procedures that are valid under conditions of heteroscedasticity of unknown form when the number of observations is finite, must control the level of the tests conditional on the absolute values (see also Pratt and Gibbons, 1981). This result has two main consequences. First, sign-based methods constitute the only general way of producing provably valid inference for any given sample size. Second, all other methods, including the usual HAC methods developed by White (1980), Newey and West (1987), Andrews (1991) and others, which are not based on signs, are not provably valid for any sample size. Although this provides a compelling argument for using sign-based
C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S21
procedures, the latter have barely been exploited in econometrics; for a few exceptions which focus on simple time series models, see Dufour (1981), Campbell and Dufour (1991, 1995, 1997) and Wright (2000). In a regression context, the vast majority of the statistical literature is reviewed by Boldin et al. (1997). These authors also develop sign-based inference and estimation for linear models, both exact and asymptotic with i.i.d. errors. In the same vein, the recent paper by Chernozhukov et al. (2008) considers quantile regression models and derives finite sample inference using quantile indicators when the observations are independent. The problem of interest in the present paper consists in giving conditions under which signs will be i.i.d. according to a known distribution, even though the variables to which indicator functions are applied are not independent or do not satisfy other regularity conditions (such as following an absolutely continuous distribution). An important feature of our results consists in allowing for a dynamic structure in the error distribution, providing a considerable extension of earlier results on the distribution of signs in the presence of dependent observations. Moreover, errors with discrete distribution (or mixtures of discrete and continuous distributions) are allowed, as opposed to the usual continuity assumption. This is made possible by the combination of a ternary sign operator—rather than binary—and Monte Carlo test techniques involving randomized tie-breaking. Sign-based inference methods constitute an alternative to inference derived from the asymptotic distribution of LAD estimators and their extensions (see Koenker and Bassett, 1978, Powell, 1984, Weiss, 1991, Fitzenberger, 1997b, Horowitz, 1998, Zhao, 2001, etc.). An important problem in the LAD literature consists in providing good estimates of the asymptotic covariance matrix, on which inference relies. Powell (1984) suggested kernel estimation, but the most widespread method of estimation is the bootstrap (Buchinsky, 1995; Fitzenberger, 1997b; Hahn, 1997). 1 Kernel techniques are sensitive to the choice of kernel function and bandwidth parameter, and the estimation of the LAD asymptotic covariance matrix needs a reliable estimator of the error term density at zero. This may be tricky especially when disturbances are heteroscedastic or simply do not possess a density with respect to the Lebesgue measure (discrete distributions). Besides, whenever the normal distribution is not a good finite-sample approximation, inference based on covariance matrix estimation may be problematic. From a finite-sample point of view, asymptotically justified methods can be arbitrarily unreliable. Test sizes can be far from their nominal levels. One can find examples of such distortions for time series in Dufour (1981) and Campbell and Dufour (1995, 1997) and for L 1 -estimation in Dielman and Pfaffenberger (1988a,b), De Angelis et al. (1993) and Buchinsky (1995). Inference based on signs constitutes an alternative that does not suffer from these shortcomings. 2 The paper is organized as follows. In Section 2, we present the model and the notations. Section 3 contains results on exact inference. In Section 4, we derive confidence intervals at any given confidence level and illustrate the method on a numerical example. Section 5 is dedicated to the asymptotic validity of the finite-sample inference method. In Section 6, we give simulation results from comparisons with usual techniques. Section 7 presents an illustrative application: testing the presence of a drift in the Standard and Poor’s composite price index series. Section 8 concludes. The Appendix contains the proofs.
1
See Buchinsky (1995, 1998) for a review and Fitzenberger (1997b) for a comparison between these methods. Other notable areas of investigation in the L 1 -literature concern: (1) censored quantile regressions (Powell, 1984, 1986, Fitzenberger, 1997a, Buchinsky and Hahn, 1998), (2) endogeneity (Amemiya, 1982, Powell, 1983, Hong and Tamer, 2003), (3) misspecification (Jung, 1996, Kim and White, 2002, Komunjer, 2005). 2
C The Author(s). Journal compilation C Royal Economic Society 2009.
S22
E. Coudin and J.-M. Dufour
2. FRAMEWORK We consider a stochastic process {(yt , xt ) : → Rp+1 : t = 1, 2, . . . } defined on a probability space (, F, P), such that y t and x t satisfy a linear model of the form yt = xt β + ut ,
t = 1, . . . , n,
(2.1)
where y t is a dependent variable, x t = (x t1 , . . . , x tp ) is a p-vector of explanatory variables, and u t is an error process. The x t ’s may be random or fixed. In the sequel, y = (y1 , . . . , yn ) ∈ Rn will denote the dependent vector, X = [x 1 , . . . , x n ] the n × p matrix of explanatory variables, and u = (u1 , . . . , un ) ∈ Rn the disturbance vector. Moreover, F t (· |x 1 , . . . , x n ) represents the distribution function of u t conditional on X. Inference on this model will be made possible through assumptions on the conditional medians of the errors. To do this, it will be convenient to consider adapted sequences of the form S(v, F) = {vt , Ft : t = 1, 2, . . . },
(2.2)
where v t is any measurable function of Wt = (yt , xt ) , Ft is a σ -field in , Fs ⊆ Ft for s < t, σ (W1 , . . . , Wt ) ⊂ Ft and σ (W 1 , . . . , W t ) is the σ -algebra spanned by W 1 , . . . , W t . We shall depart from the usual assumption that E(ut |Ft−1 ) = 0, ∀t ≥ 1, i.e. u = {u t : t = 1, 2, . . . } in the adapted sequence S(u, F) = {ut , Ft : t = 1, 2, . . . } is a martingale difference with respect to Ft = σ (W1 , . . . , Wt ), t = 1, 2, . . . . In a framework that allows for heteroscedasticity of unknown form, it is known from Bahadur and Savage (1956) that inference on the mean of i.i.d. observations of a random variable, without any further assumption on the form of the distribution, is impossible. Such a test has no power. This problem of non-testability can be viewed as a form of non-identification in a wide sense. Unless relatively strong distributional assumptions are made, moments are not empirically meaningful. Thus, if one wants to relax the distributional assumptions, one must choose another measure of central tendency, such as the median. The median is especially appropriate if the distribution of the disturbance process does not possess moments. Thus, in the median regression framework, it appears that the martingale difference assumption should be replaced by an analogue in terms of median. Such a mediangale may be defined conditional on the design matrix X or unconditionally. Here, we focus on the conditional form. D EFINITION 2.1. (Weak conditional mediangale). Let Ft = σ (u1 , . . . , ut , X), for t ≥ 1. u in the adapted sequence S(u, F) is a weak mediangale conditional on X with respect to {Ft : t = 1, 2, . . . } iff P[u1 < 0|X] = P[u1 > 0|X] and P[ut < 0|u1 , . . . , ut−1 , X] = P[ut > 0|u1 , . . . , ut−1 , X], for t > 1. The above definition allows u t to have a discrete distribution with a non-zero probability mass at zero. A more restrictive version, called the strict conditional mediangale, imposes a zero probability mass at zero. Then, P[u1 < 0|X] = P[u1 > 0|X] = 0.5 and P[ut < 0|u1 , . . . , ut−1 , X] = P[ut > 0|u1 , . . . , ut−1 , X] = 0.5, for t > 1. With no mass at zero and no matrix X, this concept coincides with the mediangale one defined in Linton and Whang (2007), together with other quantilegales. 3 3 Linton and Whang (2007) define that u is a mediangale if E(ψ (u )|F 1 t t t−1 ) = 0, ∀t, where Ft−1 = σ (ut−1 , 2 ut−2 , . . . ) and ψ 1 (x) = 12 − 1(−∞,0) (x). This definition is adapted to continuous distributions but does not work 2
C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S23
Stating that u is a weak mediangale with respect to F is equivalent to assuming that its sign process s(u) = {s(u t ) : t = 1, 2, . . . }, where s(a) = 1[0,+∞) (a) − 1(−∞,0] (a), ∀a ∈ R, is a martingale difference with respect to the same sequence of sub-σ algebras F. The difference of martingale assumption on the raw process u is replaced by a quasi-similar hypothesis on a robust transform of this process s(u). However, the weak conditional mediangale concept differs from a martingale difference on the signs, because it requires conditioning upon the whole process X. We shall see later that asymptotic inference may be available under a classical martingale difference on signs or, more generally, mixing conditions on {s(u t ), σ (W 1 , . . . , W t ) : t = 1, 2, . . . }. It is relatively easy to deal with a weak mediangale by a simple transformation of the sign operator. Consider P[ut = 0 | X, u1 , . . . , ut−1 ] = pt (X, u1 , . . . , ut−1 ) > 0, where the p t (·) are unknown and may vary between observations. A way out consists in modifying the sign function s(x) as s˜ (x, V ) = s(x) + [1 − s(x)2 ]s(V − 0.5), where V ∼ U(0, 1). If V t is independent of u t then, irrespective of the distribution of u t , P[˜s (ut , Vt ) = +1] = P[˜s (ut , Vt ) = −1] =
1 2
To simplify the presentation, we shall focus on the strict mediangale concept. Therefore, our model will rely on the following assumption. A SSUMPTION 2.1. (Strict conditional mediangale). The components of u = (u 1 , . . . , u n ) satisfy a strict mediangale conditional on X. One remark concerns exogeneity. As long as the x t ’s are strongly exogenous, the conditional mediangale concept is equivalent to a martingale difference on signs with respect to Ft = σ (W1 , . . . , Wt ), t = 1, 2, . . . . P ROPOSITION 2.1. (Mediangale exogeneity). Suppose {x t : t = 1, 2, . . . } is a strongly exogenous process for β, P[u1 > 0] = P[u1 < 0] = 0.5 and P[ut > 0|u1 , . . . , ut−1 , x1 , . . . , xt ] = P[ut < 0|u1 , . . . , ut−1 , x1 , . . . , xt ] = 0.5. Then {u t : t = 1, 2, . . . } is a strict mediangale conditional on X. Model (2.1) with the Assumption 2.1 allows for very general forms of the disturbance distribution, including asymmetric, heteroscedastic or dependent ones, as long as conditional medians are 0. Neither density nor moment existence are required. Indeed, what the mediangale concept requires is a form of independence in the signs of the residuals. This extends results in Dufour (1981), Campbell and Dufour (1991, 1995, 1997) and Dufour et al. (1998). For example, Assumption 2.1 is satisfied if u t = σ t (x 1 , . . . , x n ) ε t , t = 1, . . . , n, where ε 1 , . . . , ε n are i.i.d. conditional on X, which is relevant for cross-sectional data. Many dependence schemes are also covered, especially any model of the form u 1 = σ 1 (x 1 , . . . , x t−1 )ε 1 , u t = σ t (x 1 , . . . , x t−1 , u 1 , . . . , u t−1 )ε t , t = 2, . . . , n, where ε 1 , . . . , ε n are independent with median 0, σ 1 (x 1 , . . . , x t−1 ) and σ t (x 1 , . . . , x n , u 1 , . . . , u t−1 ), t = 2, . . . , n, are non-zero with probability one. In time series context, this includes models presenting robustness properties to endogenous disturbance variance (or volatility) specification, such as ARCH, GARCH or stochastic volatility well with discrete distributions. If u t has a mass at zero, the condition given by Definition 2.1 can hold even if E(ψ 1 (ut )|Ft−1 ) = 0. 2
C The Author(s). Journal compilation C Royal Economic Society 2009.
S24
E. Coudin and J.-M. Dufour
models with non-Gaussian noises. Further, the mediangale property is more general because it does not specify explicitly the functional form of the variance in contrast with an ARCH specification. Note again that the disturbance process does not have to be second-order stationary. Asymptotic normality of the LAD estimator, which is presented in its most general way in Fitzenberger (1997b), holds under some mixing concepts on {s(u t ), σ (W 1 , . . . , W t ) : t = 1, 2, . . . } and an orthogonality condition between s(u t ) and x t . Besides, it requires additional assumptions on moments. 4 With such a choice, testing is necessarily based on approximations (asymptotic or bootstrap). Here, we focus on valid finite-sample inference without any further assumption on the form of the distributions. This non-parametric set-up extends those used in Dufour (1981) and Campbell and Dufour (1991, 1995, 1997). Assumption 2.1 can easily be extended to allow for another quantile q by setting P[ut < 0|Ft−1 ] = q, ∀t, which would lead to P[ut < 0|u1 , . . . , ut−1 , x1 , . . . , xt ] = q in Proposition 2.1. However, with error heterogeneity or dependence of unknown form, such an assumption can plausibly hold only for a single quantile. So little generality is lost by focusing on the median case. Further, contrary to other quantiles, the median may have an economic meaning when it coincides with the expectation, e.g. if the error density is symmetric. It can be used to state expectation-based economic conditions such as a no-arbitrage opportunity condition on a market etc. A classical result in non-parametric statistics consists in using this Bernoulli distribution to build exact tests and confidence intervals on quantiles (for i.i.d. observations); see Thompson (1936), Scheff´e and Tukey (1945) and the review of David (1981, ch. 2). For recent econometric exploitation of a quantile version of this result which holds if the observations are X-conditionally independent, see Chernozhukov et al. (2008). Proposition 2.1 above provides general conditions under which such a result holds for non-i.i.d. observations. Finally, the set-up presented here extends those approaches to the time series context where some kinds of Markovian serial dependence are permitted as well as discrete distributions.
3. EXACT FINITE-SAMPLE SIGN-BASED INFERENCE In finite samples, first-order asymptotic approximations can be misleading. Test sizes of asymptotically justified t- or χ 2 -statistics can be quite far from their nominal level. One can find examples of such distortions in the dynamic literature (see, for example, Dufour, 1981, Mankiw and Shapiro, 1986, Campbell and Dufour, 1995, 1997); on inference based on L 1 estimators (see Dielman and Pfaffenberger, 1988a,b; Buchinsky, 1995; De Angelis et al., 1993). This remark usually motivates the use of bootstrap procedures. In a sense, bootstrapping (once bias corrected) is a way to make approximation closer by introducing artificial observations. However, the bootstrap still relies on approximations and in general there is no guarantee that the level condition is satisfied in finite samples. The asymptotic method unreliability motivates us to turn a fully finite-sample-based approach. Sign-based procedures provide a way to build distribution-free statistics even in finite samples. Sign-based statistics have been used in the statistical literature to derive non-parametric sign tests. In this section, we present the general sign pivotality result and apply it in median regression context to derive sign-based test statistics that are pivots and provide power against alternatives 4 Fitzenberger (1997b) show that LAD and quantile estimators are consistent and asymptotically normal when E[xt sθ (ut )] = 0, ∀t, where (u t , x t ) has a density and finite second moments. C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S25
of interest. This will enable us to build Monte Carlo tests relying on their exact distribution. Therefore, the level of those tests is exactly controlled for any sample size. We study first the test problem, then build confidence sets. Finally, estimators can be derived. 5 Hence, results on the valid finite-sample test problem will be adapted to obtain valid confidence intervals and estimators. 3.1. Distribution-free pivotal functions and non-parametric tests When the disturbance process is a conditional mediangale, the joint distribution of the signs of the disturbances is completely determined. If there is no positive mass at zero, the signs are i.i.d. and take the values 1 and −1 with equal probability 1/2. The case with a mass at zero can be covered provided the transformation in the sign operator definition presented in the previous section. These results are stated more precisely in the following propositions. P ROPOSITION 3.1. (Sign distribution). Under model (2.1), suppose the errors (u 1 , . . . , u n ) satisfy a strict mediangale conditional on X = [x 1 , . . . , x n ] . Then the variables s(u 1 ), . . . , s(u n ) are i.i.d. conditional on X according to the distribution 1 , t = 1, . . . , n. (3.1) 2 More generally, this result holds for any combination of t = 1, . . . , n. If there is a permutation π : i → j such that mediangale property holds for j, then the signs are i.i.d. From Proposition 3.1, it follows that the residual sign vector P[s(ut ) = 1 |x1 , . . . , xn ] = P[s(ut ) = −1 |x1 , . . . , xn ] =
s(y − Xβ) = [s(y1 − x1 β), . . . , s(yn − xn β)] has a nuisance-parameter-free distribution (conditional on X), i.e. it is a ‘pivotal function’. Its distribution is easy to simulate from a combination of n independent uniform Bernoulli variables. Furthermore, any function of the form T = T (s(y − Xβ), X) is pivotal, conditional on X. Once the form of T is specified, the distribution of the statistic T is totally determined and can also be simulated. Using Proposition 3.1, it is possible to construct tests for which the size is fully controlled in finite samples. Consider testing H 0 (β 0 ) : β = β 0 against H 1 (β 0 ) : β = β 0 . Under H 0 (β 0 ), s(y t − x t β 0 ) = s(u t ), t = 1, . . . , n. Thus, conditional on X, T s(y − Xβ0 ), X ∼ T (Sn , X), (3.2) i.i.d.
where S n = (s 1 , . . . , s n ) and s1 , . . . , sn ∼ B(1/2). A test with level α rejects H 0 (β 0 ) when T s(y − Xβ0 ), X > cT (X, α), (3.3) where c T (X, α) is the (1 − α)-quantile of the distribution of T (S n , X). This result is generalized for distributions with a positive mass at zero in the following proposition. P ROPOSITION 3.2. (Randomized sign distribution). Suppose (2.1) holds with the assumption that u 1 , . . . , u n belong to a weak mediangale conditional on X. Let V 1 , . . . , V n be i.i.d. random variables U(0, 1) distributed and independent of u 1 , . . . , u n and X. Then the variables s˜t = 5
For the estimation theory, the reader is referred to Coudin and Dufour (2006).
C The Author(s). Journal compilation C Royal Economic Society 2009.
S26
E. Coudin and J.-M. Dufour
s˜ (ut , Vt ) are i.i.d. conditional on X with the distribution P[˜st = 1 | X] = P[˜st = −1 | X] = 12 , t = 1, . . . , n. All the procedures described in the paper can be applied by replacing s by s˜ . When the error distributions possess a mass at zero, the test statistic T (˜s (y − Xβ0 ), X) has to be used instead of T (s(y − Xβ 0 ), X). 3.2. Regression sign-based statistics We consider test statistics of the following form: DS (β0 , n ) = s(y − Xβ0 ) Xn s(y − Xβ0 ), X X s(y − Xβ0 ),
(3.4)
where n (s(y − Xβ 0 ), X) is a p × p weight matrix that depends on the constrained signs s(y − Xβ 0 ) under H 0 (β 0 ). The weight matrix n (s(y − Xβ 0 ), X) provides a standardization that can be useful for power considerations as well as to account for dependence schemes that cannot be eliminated by the sign transformation. Further, n (s(y − Xβ 0 ), X) would normally be selected to be positive definite (although this is not essential to show the pivotality of the test statistic under the null hypothesis). 6 Statistics of the form D S (β 0 , n ) include as special cases the ones studied by Koenker and Bassett (1982) and Boldin et al. (1997). Namely, on taking n = I p and n = (X X)−1 , we get: 2 SB(β0 ) = s(y − Xβ0 ) XX s(y − Xβ0 ) = X s(y − Xβ0 ) , (3.5) 2 SF (β0 ) = s(y − Xβ0 ) P (X)s(y − Xβ0 ) = X s(y − Xβ0 )M ,
(3.6)
where P (X) = X(X X)−1 X . In Boldin et al. (1997), it is shown that SB(β 0 ) and SF(β 0 ) can be associated with locally most powerful tests in the case of i.i.d. disturbances under some regularity conditions on the distribution function (especially f (0) = 0). 7 Their proof can easily be extended to disturbances that satisfy the mediangale property and for which the conditional density at zero is the same f t (0|X) = f (0|X), t = 1, . . . , n. SF(β 0 ) can be interpreted as a sign analogue of the Fisher statistic. SF(β 0 ) is a monotonic transformation of the Fisher statistic for testing γ = 0 in the regression of s(y − Xβ 0 ) on X : s(y − Xβ 0 ) = Xγ + v. This remark holds also for a general sign-based statistic of the −1/2 form (3.6), when s(y − Xβ 0 ) is regressed on n X. Wald, Lagrange multiplier (LM) and likelihood ratio (LR) asymptotic tests for M-estimators, such as the LAD estimator, in L 1 -regression are developed by Koenker and Bassett (1982). They 6 Under more restrictive assumptions, statistics that exploit other robust functions of y − Xβ (such as ranks, signed 0 ranks, and signs and ranks) can lead to more powerful tests. However, the fact we allow for both heteroscedasticity and non-linear serial dependence of unknown forms appears to break the required pivotality result and makes the use of such statistics quite difficult if not impossible in the context of our set-up. For discussion of such alternative statistics (applicable under stronger assumptions), see Hallin and Puri (1991, 1992), Hallin et al. (2006, 2008), Hallin and Werker (2003) and the references therein. 7 The power function of the locally most powerful sign-based test has the faster increase when departing from β . 0 In the multiparameter case, the scalar measure required to evaluate that speed is the curvature of the power function. Restricting to unbiased tests, Boldin et al. (1997) introduced different locally most powerful tests corresponding to different definitions of curvature. SB(β 0 ) maximizes the mean curvature, which is proportional to the trace of the shape; see Dubrovin et al. (1984, ch. 2, pp. 76–86) or Gray (1998, ch. 21, pp. 373–80) for a discussion of various curvature notions.
C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S27
assume i.i.d. errors and a fixed design matrix. In that set-up, the LM statistic for testing H 0 (β 0 ) : β = β 0 turns out to be the SF(β 0 ) statistic. The same authors also remarked that this type of statistic is asymptotically nuisance-parameter-free, contrary to LR and Wald-type statistics. The Boldin et al. (1997) local optimality interpretation can be extended to heteroscedastic disturbances. In such a case, the locally optimal test statistic associated with the mean curvature, i.e. the test with the highest power near the null hypothesis according to a trace argument, will be of the following form. P ROPOSITION 3.3. In model (2.1), suppose the mediangale Assumption 2.1 holds, and the disturbances are heteroscedastic with conditional densities f t (· |X), t = 1, 2, . . . , which are continuously differentiable around zero and such that f t (0|X) = 0. Then, the locally optimal sign-based statistic associated with the mean curvature is ˜ 0 ) = s(y − Xβ0 ) X˜ X˜ s(y − Xβ0 ), SB(β
(3.7)
where X˜ = diag(f1 (0|X), . . . , fn (0|X))X. When the f i (0|x)’s are unknown, the optimal statistic is not feasible. The optimal weights must be replaced by approximations, such as weights derived from the normal distribution. Sign-based statistics of the form (3.4) can also be interpreted as GMM statistics which exploit the property that {st ⊗ xt , Ft } is a martingale difference sequence. 8 However, these are quite unusual GMM statistics. Indeed, the parameter of interest is not defined by moment conditions in explicit form. It is implicitly defined as the solution of some robust estimating equations (involving constrained signs): n
s(yt − xt β) ⊗ xt = 0.
t=1
For i.i.d. disturbances, Godambe (2001) showed that these estimating functions are optimal − xt β). For among all the linear unbiased (for the median) estimating functions nt=1 at (β)s(yt independent heteroscedastic disturbances, the set of optimal estimating equations is nt=1 s(yt − ˜ can be viewed as optimal instruments for the linear xt β) ⊗ x˜t = 0. In those cases, X (resp. X) model. We now turn to linearly dependent processes. We propose to use a weighting matrix directly derived from the asymptotic covariance matrix of √1n s(y − Xβ0 ) ⊗ X. Let us denote it by J n (s(y − Xβ 0 ), X). We consider n (s(y − Xβ0 ), X) = n1 Jˆn (s(y − Xβ0 ), X)−1 , where Jˆn (s(y − Xβ0 ), X) stands for a consistent estimate of J n (s(y − Xβ 0 ), X) that can be obtained using kernel estimators; for example, see Parzen (1957), Newey and West (1987), Andrews (1991) and White (2001). This leads to 1 1 ˆ−1 = s(y − Xβ0 ) XJˆn−1 X s(y − Xβ0 ). DS β0 , Jn (3.8) n n J n (s(y − Xβ 0 ), X) accounts for dependence among signs and explanatory variables. Hence, by using an estimate of its inverse as weighting matrix, we perform a HAC correction. Note that the correction depends on β 0 . 8 Concerning power performance again, Chernozhukov et al. (2008) show also the class of GMM sign-based statistics contains a locally asymptotically uniformly most powerful invariant test.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S28
E. Coudin and J.-M. Dufour
In all cases, H 0 (β 0 ) is rejected when the statistic evaluated at β = β 0 is large: DS (β0 , n ) > cn (X, α), where cn (X, α) is a critical value, which depends on the level α. Since we are looking at pivotal functions, the critical values can be evaluated to any degree of precision by simulation. This is the strategy followed by Chernozhukov et al. (2008), which exploits the same finite sample property of (θ -) signs in a quantile regression context with conditionally independent observations. However, as the distribution is discrete, a test based on cn (X, α) may not exactly reach the nominal level. A more elegant solution consists in using the technique of Monte Carlo tests with a randomized tie-breaking procedure, which do not suffer from this shortcoming. Further, we will show later that the Monte Carlo procedure also enables one to build tests with asymptotically controlled level for general processes when Assumption 2.1 fails to hold. 3.3. Monte Carlo tests Monte Carlo tests can be viewed as a finite-sample version of the bootstrap. They have been introduced by Dwass (1957) (see also Barnard, 1963) and can be adapted to any pivotal statistic whose distribution can be simulated. For a general review and for extensions in the case of the presence of a nuisance parameter, the reader is referred to Dufour (2006). In the case of discrete distributions, the method must be adapted to deal with ties. Here, we use a randomized tie-breaking procedure for evaluating empirical survival functions (see Dufour, 2006). Let us consider a statistic T, whose conditional distribution given X is discrete and free of nuisance parameters, and a test which rejects the null hypothesis when T ≥ c(α). Let T (0) be the observed value of T, and T (1) , . . . , T (N) , N independent replicates of T. Each replication T (j ) is associated with a uniform random variable W (j ) ∼ U(0, 1) to produce the pairs (T (j ) , W (j ) ). The vector (W (0) , . . . , W (N) ) is independent of (T (0) , . . . , T (N) ). (T (i) , W (i) )’s are ordered according to (T (i) , W (i) ) ≥ (T (j ) , W (j ) ) ⇔ {T (i) > T (j ) or (T (i) = T (j ) and W (i) ≥ W (j ) )}. This leads to the following p-value function: ˜ N (x) + 1 NG , N +1 ˜ N (x) = 1 − 1 N s+ (x − T (i) ) + where the empirical survival function, G i=1 N x)s+ (W (i) − W (0) ), with s + (x) = 1 [0, ∞) (x), δ(x) = 1 {0} . Then p˜ N (x) =
P[p˜ N (T (0) ) ≤ α] =
I [α(N + 1)] , N +1
1 N
N i=1
δ(T (i) −
for 0 ≤ α ≤ 1.
The randomized tie-breaking allows one to exactly control the level of the procedure. This may also increase the power of the test.
4. REGRESSION SIGN-BASED CONFIDENCE SETS In this section, we discuss how to use Monte Carlo sign-based joint tests in order to build confidence sets for β with known level. This can be done as follows. For each value β0 ∈ Rp , perform the Monte Carlo sign test for H 0 (β 0 ) and get the associated simulated p-value. The confidence set C 1−α (β) that contains any β 0 with p-value higher than α has, by construction, level 1 − α (see Dufour, 2006). From this simultaneous confidence set for β, it is possible, C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S29
by projection techniques, to derive confidence intervals for the individual components. More generally, we can obtain conservative confidence sets for any transformation g(β), where g can be any kind of real functions, including non-linear ones. Obviously, obtaining a continuous grid of Rp is not realistic. We will instead require global optimization search algorithms. 4.1. Confidence sets and conservative confidence intervals Projection techniques yield finite-sample valid confidence intervals and confidence sets for general functions of the parameter β. For examples of use in different settings and for further discussion, the reader is referred to Dufour (1990, 1997), Abdelkhalek and Dufour (1998), Dufour and Kiviet (1998), Dufour and Jasiak (2001) and Dufour and Taamouti (2005). The basic idea is the following one. Suppose a simultaneous confidence set with level 1 − α for β, C 1−α (β), is available. Since β ∈ C 1−α (β) =⇒ g(β) ∈ g(C 1−α (β)), we have P[β ∈ C1−α (β)] ≥ 1 − α =⇒ P[g(β) ∈ g(C1−α (β))] ≥ 1 − α. Thus, g(C 1−α (β)) is a conservative confidence set for g(β). If g(β) is scalar, the interval (in the extended real numbers) Ig [C1−α (β)] = [infβ∈C1−α (β) g(β) , supβ∈C1−α (β) g(β)] has level 1 − α:
P
inf
β∈C1−α (β)
g(β) ≤ g(β) ≤
sup
g(β) 1 − α.
β∈C1−α (β)
Hence, to obtain valid conservative confidence intervals for the individual component β k in the model (2.1) under mediangale Assumption 2.1, it is sufficient to solve the following numerical optimization problems, where s.c. stands for ‘subject to the constraint’: minp βk s.c. p˜ N DS (β) ≥ α, maxp βk s.c. p˜ N DS (β) ≥ α, β∈R
β∈R
(j )
where p˜ N is computed using N replicates D S of the statistic D S under the null hypothesis. In practice, we use simulated annealing as optimization algorithm (see Goffe et al., 1994; Press et al., 1996). 9 In the case of multiple tests, projection techniques allow to perform tests on an arbitrary number of hypotheses, without ever losing control of the overall level: rejecting at least one true null hypothesis will not exceed the specified level α. 4.2. Numerical illustration This part reports a numerical illustration. We generate the following normal mixture process for n = 50, N [0, 1] with probability 0.95 i.i.d. yt = β0 + β1 xt + ut , t = 1, . . . , n, ut ∼ N [0, 1002 ] with probability 0.05. We conduct an exact inference procedure with N = 999 replicates. The true process is generated with β 0 = β 1 = 0. We perform tests of H 0 (β ∗ ) : β = β ∗ on a grid for β ∗ = (β ∗0 , β ∗1 ) and retain the associated simulated p-values. As β is a two-vector, we can provide a graphical illustration. To each value of the vector β is associated the corresponding simulated p-value. Confidence
9
See Chernozhukov et al. (2008) for the use of other MCMC algorithms.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S30
E. Coudin and J.-M. Dufour
98%
for B2
75%
95% 90%
B1 Figure 1. Confidence regions provided by SF-based inference. Table 1. Confidence intervals. OLS White
SF
β0
95%CI 98%CI
[−4.57, 0.82] [−5.10, 1.35]
[−4.47, 0.72] [−4.98, 1.23]
[−0.54, 0.23] [−0.64, 0.26]
β1
95%CI 98%CI
[−2.50, 3.22] [−3.07, 3.78]
[−1.34, 2.06] [−1.67, 2.39]
[−0.42, 0.59] [−0.57, 0.64]
region with level 1 − α contains all the values of β with p-values greater than α. Confidence intervals are obtained by projecting the simultaneous confidence region on the axis of β 0 or β 1 ; see Figure 1 and Table 1. The confidence regions so obtained increase with the level and cover other confidence regions with smaller level. Confidence regions are highly non-elliptic and thus may lead to different results than an asymptotic inference. Concerning confidence intervals, sign-based ones appear to be largely more robust than OLS and White CI and are less sensitive to outliers.
5. ASYMPTOTIC THEORY This section is dedicated to asymptotic results. We point out that the mediangale Assumption 2.1 excludes some common processes, whereas usual asymptotic inference still can be conducted on them. We relax Assumption 2.1 to allow random X that may not be independent of u. We show that the finite-sample sign-based inference remains asymptotically valid. For a fixed number of replicates, when the number of observations goes to infinity, the level of a test tends to the C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S31
nominal level. Besides, we stress the ability of our methods to cover heavy-tailed distributions, including infinite disturbance variance. 5.1. Asymptotic distributions of test statistics In this part, we derive asymptotic distributions of the sign-based statistics. We show that the HAC-corrected version of the sign-based statistic DS (β0 , n1 Jˆn−1 ) in (3.8) allows one to obtain an asymptotically pivotal function. The set of assumptions we make to stabilize the asymptotic behaviour will be needed for further asymptotic results. We consider the linear model (2.1), with the following assumptions: A SSUMPTION 5.1. (Mixing). {(x t , u t ) : t = 1, 2, . . .} is α-mixing of size −r/(r − 2), r > 2. 10 A SSUMPTION 5.2. (Moment condition). E[s(ut )xt ] = 0, t = 1, . . . , n, ∀n ∈ N. A SSUMPTION 5.3. (Boundedness). x t = (x 1t , . . . , x pt ) and E[|xht |r ] < < ∞, h = 1, . . . , p, t = 1, . . . , n, ∀n ∈ N. A SSUMPTION 5.4. (Non-singularity). Jn = var[ √1n nt=1 s(ut )xt ] is uniformly positive definite. A SSUMPTION 5.5. (Consistent estimator of J n ). n (β 0 ) is symmetric positive definite p uniformly over n and n − n1 Jn−1 → 0. We can now give the following result on the asymptotic distribution of D S (β 0 , n ) under H 0 (β 0 ). T HEOREM 5.1. (Asymptotic distribution of sign-based statistics). In model (2.1), with Assumptions 5.1–5.5, we have, under H 0 (β 0 ), D S (β 0 , n ) → χ 2 (p). In particular, when the mediangale condition holds, J n reduces to E(X X/n), and (X X/n)−1 is a consistent estimator of J −1 n . This yields the following corollary. C OROLLARY 5.1. In model (2.1), suppose the mediangale Assumption 2.1 and boundedness Assumption 5.3 are fulfilled. If X X/n is positive definite uniformly over n and converges in probability to a definite positive matrix, then, under H 0 (β 0 ), SF(β 0 ) → χ 2 (p). 5.2. Asymptotic validity of Monte Carlo tests We first state some general results on asymptotic validity of Monte Carlo-based inference methods. Then, we apply these results to sign-based inference methods. 5.2.1. Generalities. Let us consider a parametric or semi-parametric model {Mβ , β ∈ }. Let S n (β 0 ) be a test statistic for H 0 (β 0 ). Let c n be the rate of convergence. Under H 0 (β 0 ), the distribution function of cn Sn (β 0 ) is denoted by F n (x). We suppose that F n (x) converges almost everywhere to a distribution function F (x). G(x) and G n (x) are the corresponding survival functions. In Theorem 5.2, we show that if a sequence of conditional survival functions G˜n (x|Xn (ω)) given X(ω), satisfies G˜n (x|Xn (ω)) → G(x) with probability one, where G does not
10
See White (2001) for a definition of α-mixing.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S32
E. Coudin and J.-M. Dufour
depend on the realization X(ω), then G˜n (x|Xn (ω)) can be used as an approximation of G n (x). It can be seen as a pseudo survival function of cn Sn (β 0 ). T HEOREM 5.2. (Generic asymptotic validity). Let S n (β 0 ) be a test statistic for testing H 0 (β 0 ): β = β 0 against H 1 (β 0 ) : β = β 0 in model (2.1). Suppose that, under H 0 (β 0 ), P[cn Sn (β0 ) ≥ x|Xn ] = Gn (x|Xn ) = 1 − Fn (x|Xn ) → G(x) a.e., n→∞
where {c n } is a sequence of positive constants, and suppose that G˜n (x|Xn (ω)) is a sequence of survival functions such that G˜n (x|Xn (ω)) → G(x) with probability one. Then n→∞
˜ n (cn Sn (β0 ), Xn (ω)) ≤ α] ≤ α. lim P[G
n→∞
(5.1)
This theorem can also be stated in a Monte Carlo version. Following Dufour (2006), we use empirical survival functions and empirical p-values adapted to discrete statistics in a randomized way, but the replicates are not drawn from the same distribution as the observed statistic. However, both distribution functions, respectively F n and F˜n , converge to the same limit F. Let U (N + 1) = (U (0) , U (1) , . . . , U (N) ) be a vector of N + 1 i.i.d. real variables drawn from (1) (N) a U(0, 1) distribution, S (0) n is the observed statistic and S n (N ) = (S n , . . . , S n ) a vector of N independent replicates drawn from F˜n . Then, the randomized pseudo empirical survival function under H 0 (β 0 ) is N 1 (0) ˜ (N) G x, n, S , S (N ), U (N + 1) = 1 − s+ x − cn Sn(j ) n n n N j =1
+
N 1 (j ) δ cn Sn − x S+ U (j ) − U (0) . N j =1
(0) ˜ (N) ˜ G n (x, n, Sn , Sn (N ), U (N + 1)) is in a sense an approximation of Gn (x). Thus, it depends on the number of replicates, N, and the number of observations, n. The randomized pseudo empirical p-value function is defined as
p˜ n(N) (x) =
˜ (N) NG n (x) + 1 . N +1
(5.2)
We can now state the Monte Carlo-based version of Theorem 5.2. T HEOREM 5.3. (Monte Carlo test asymptotic validity). Let S n (β 0 ) be a test statistic for testing H 0 (β 0 ) : β = β 0 against H 1 (β 0 ) : β = β 0 in model (2.1) and S (0) n the observed value. Suppose that, under H 0 (β 0 ), P[cn Sn (β0 ) ≥ x|Xn ] = Gn (x|Xn ) = 1 − Fn (x|Xn ) → G(x) a.e., n→∞
where {c n } is a sequence of positive constants. Let S˜n be a random variable with conditional ˜ n (x|Xn ), such that survival function G ˜ n (x|Xn ) = 1 − F˜n (x|Xn ) → G(x) a.e., P[cn S˜n ≥ x|Xn ] = G n→∞
C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S33
(N) ˜ and (S (1) n , . . . , S n ) be a vector of N independent replicates of Sn , where (N + 1)α is an integer. Then, the randomized version of the Monte Carlo test with level α is asymptotically valid, i.e. limn→∞ P[p˜ n(N) (β0 ) ≤ α] ≤ α.
These results can be applied to the sign-based inference method. However, Theorems 5.2 and 5.3 are much more general. They do not exclusively rely on asymptotic normality—the limiting distribution may be different from a Gaussian one. Besides, the rate of convergence may differ √ from n. 5.2.2. Asymptotic validity of sign-based inference. In model (2.1), suppose that conditions 5.1–5.5 hold and consider the testing problem: H0 (β0 ) : β = β0 against H1 (β0 ) : β = β0 . Let DS (β, Jˆn−1 ) be the test statistic as defined in (3.8). Observe SF (0) = DS (β0 , Jˆn−1 ). Draw N independent replicates of sign vector, each one having n independent components, from a B(1, 0.5) distribution. Compute (SF (1) , SF (2) , . . . , SF (N) ), the N pseudo replicates of D S (β 0 , X X−1 ) under H 0 (β 0 ). We call them ‘pseudo’ replicates because they are drawn as if observations were independent. Draw N + 1 independent replicates (W (0) , . . . , W (N) ) from a U(0, 1) distribution and form the couple (SF (j ) , W (j ) ). Compute p˜ n(N) (β0 ) using (5.2). From Theorem 5.3, the confidence region {β ∈ Rp |p˜ n(N) (β) ≥ α} is asymptotically conservative with level at least 1 − α. H 0 (β 0 ) is rejected when p˜ n(N) (β0 ) ≤ α. Contrary to usual asymptotic tests, this method does not require the existence of moments nor a density on the {u t : t = 1, 2, . . . } process. Usual Wald-type inference is based on the asymptotic behaviour of estimators and, consequently, is more restrictive. More moments existence restrictions are needed; see Weiss (1991) and Fitzenberger (1997b). Besides, asymptotic variance of the LAD estimator involves the conditional density at zero of the disturbance process {u t : t = 1, 2, . . . } as unknown nuisance parameter. The approximation and estimation of asymptotic covariance matrices constitute a large issue in asymptotic inference. This usually requires kernel methods. We get around those problems by adopting the finite-sample sign-based procedure.
6. SIMULATION STUDY In this section, we study the performance of sign-based methods compared with usual asymptotic tests based on OLS or LAD estimators, with different approximations for their asymptotic covariance matrices. We consider the sign-based statistics D S (β, (X X)−1 ) and DS (β, Jˆn−1 ) when a correction is needed for linear serial dependence. We consider a set of general DGPs to illustrate different classical problems one may encounter in practice. They are presented in Table 2. First, we investigate the performance of tests, then, confidence sets. We use the following linear regression model: yt = xt β0 + ut ,
t = 1, . . . , n,
(6.1)
where x t = (1, x 2,t , x 3,t ) and β 0 are 3 × 1 vectors. We denote the sample size n. For the first six ones, {u t : t = 1, 2. . .} is i.i.d. or depends on the explanatory variables and its past values in a multiplicative heteroscedastic way: u t = h(x t , u t−1 , . . . , u 1 )ε t , t = 1, . . . , n. In those cases, the error term constitutes a strict conditional mediangale given X (see Assumption 2.1). Correspondingly, the levels of sign-based tests and confidence sets are perfectly controlled. Case C1 presents i.i.d. normal observations without conditional heteroscedasticity. Case C2 involves outliers in the error term. This can be seen as an example of measurement error in the observed C The Author(s). Journal compilation C Royal Economic Society 2009.
S34
E. Coudin and J.-M. Dufour Table 2. Simulated models. i.i.d.
C1:
Normal HOM:
(x2,t , x3,t , ut ) ∼ N (0, I3 ), t = 1, . . . , n
C2:
Outlier:
(x2,t , x 3,t ) ∼ N (0, I2 ), N [0, 1] with p = 0.95 i.i.d. ut ∼ N [0, 10002 ] with p = 0.05 xt , ut , independent, t = 1, . . . , n.
C3:
Stat.
(x2,t , x3,t ) ∼ N (0, I2 ), ut = σt εt with
GARCH(1,1):
σ t2 = 0.666u2t−1 + 0.333σ 2t−1 where εt ∼ N (0, 1), xt , εt , independent, t = 1, . . . , n.
Stoc.
(x2,t , x3,t ) ∼ N (0, I2 ), ut = exp(wt /2)εt with
Volatility:
w t = 0.5w t−1 + v t , where εt ∼ N (0, 1), vt ∼ χ2 (3), xt , ut , independent, t = 1, . . . , n.
Deb. design matrix
x2,t ∼ N (0, 1), x3,t ∼ χ2 (1),
+ HET. dist.:
ut = x3,t εt , εt ∼ N (0, 1), xt , εt independent, t = 1, . . ., n.
Cauchy
(x2,t , x3,t ) ∼ N (0, I2 ),
disturbances:
ut ∼ C, xt , ut , independent, t = 1, . . . , n.
AR(1)-HET,
x j,t = ρ x x j,t−1 + ν t , j = 1, 2,
ρ u = 0.5, : ρ x = 0.5
ut = min{3, max[0.21, |x2,t |]} × u˜ t , u˜ t = ρu u˜ t−1 + νtu ,
C4:
C5:
C6:
C7:
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
j
i.i.d.
(νt2 , νt3 , νtu ) ∼ N (0, I3 ), t = 2, . . . , n ν 21 , ν 31 and ν u1 chosen to ensure stationarity. C8:
Exp. Var.:
i.i.d.
(x2,t , x3,t , εt ) ∼ N (0, I3 ), ut = exp(0.2t)εt .
y. Cases C3 and C4 involve other non-linear dependent schemes with stationary GARCH and stochastic volatility disturbances. Case C5 combines a very unbalanced design matrix (where the LAD estimator performs poorly) with highly conditional heteroscedastic disturbances. Case C6 is an example of heavy-tailed errors (Cauchy). Next, we study the behaviour of the sign-based inference (involving a HAC correction) when inference is only asymptotically valid. Case C7 illustrates the behaviour of sign-based inference when the error term involves linear dependence at a mild level (see the discussion paper for results at other levels of linear dependence and Fitzenberger, 1997b, for a study of LAD block bootstrap performance on such DGPs). In that case, x t and u t are such that E(ut xt ) = 0 and E[s(u t )x t ] = 0 for all t. Finally, case C8 involves disturbances that are not second-order stationary (exponential variance) but for which the mediangale assumption holds. As we noted previously, sign-based inference does not require stationary assumptions in contrast with tests derived from CLT. In each case, the design matrix is simulated once. Hence, results are conditional. More simulation results on other types of DGPs can be found in the discussion paper (Coudin and Dufour, 2007). C The Author(s). Journal compilation C Royal Economic Society 2009.
S35
Finite-sample distribution-free inference in linear median regressions
Table 3. Linear regression under mediangale errors: empirical sizes of conditional tests for H 0 : β = (1, 2, 3) . y t = x t β + ut , t = 1, . . . , 50.
SIGN SF
LAD
SHAC
OS
DMB
MBB
OLS BT
LR
IID
WH
BT
Stationary models with mediangale errors C1: HOM ρ = ρ x = 0, C2:
0.052 0.047 ∗ 0.047
0.050 0.019∗ 0.048
0.086
0.050
0.089
0.047
0.068
0.060
0.096
0.113
0.088
0.043
0.083
0.039
0.066
0.056
0.008
0.009
Outlier C3:
0.044∗ 0.042
0.015∗ 0.046
0.040
0.005
0.005
0.004
0.012
0.080
0.046
0.046
St. GARCH(1,1) C4: Stoch. Volat.
0.040∗ 0.043 0.045∗
0.013∗ 0.041 0.021∗
0.063
0.006
0.014
0.006
0.031
0.054
0.014
0.014
C5: Deb. + Het. C6:
0.044 0.040∗ 0.058
0.042 0.018∗ 0.059
0.687
0.020
0.044
0.152
0.307
0.421
0.171
0.173
0.069
0.013
0.033
0.012
0.044
0.061
0.023
0.023
Cauchy
0.049∗
0.021∗
C8: Exp. Var.
0.049
0.051
0.014
0.014
0.328
0.276
Non-stationary models with mediangale errors 0.017
0.000
0.000
0.000
0.000
0.132
Stationary models with serial dependence C7: HET ρ = ρ x = 0.5∗∗
0.218 –
0.026 0.017 ∗
0.440
0.131
0.097
0.108
0.308
0.407
Notes: ∗ Sizes using asymptotic critical values based on χ 2 (3). ∗∗ Automatic bandwidth parameters are restricted to be <10 to avoid invertibility problems.
6.1. Size We first study level distortions. We consider the testing problem: H 0 (β 0 ) : β 0 = (1, 2, 3) against H 1 : β 0 = (1, 2, 3) . We compare exact and asymptotic tests based on SF = D S (β, (X X)−1 ) and SH AC = DS (β, Jˆn−1 ), where Jˆn−1 is estimated by a Bartlett kernel, with various asymptotic tests. Wald- and LR-type tests are considered. We consider Wald tests based on the OLS estimate with three different covariance estimators: the usual under homoscedasticity and independence (IID), White correction for heteroscedasticity (WH) and Bartlett kernel covariance estimator with automatic bandwidth parameter (BT, Andrews, 1991). Concerning the LAD estimator, we study Wald-type tests based on several covariance estimators: order statistic estimator (OS), 11 Bartlett kernel covariance estimator with automatic bandwidth parameter (BT, Powell, 1984, Buchinsky, 1995), design matrix bootstrap centring around the sample estimate (DMB, Buchinsky, 1998), moving block bootstrap centring around the sample estimate (MBB, Fitzenberger, 1997b). 12 11 This assumes i.i.d. residuals; an estimate of the residual density at zero is obtained from a confidence interval constructed for the (n/2)th residual (Buchinsky, 1998). 12 The block size is 5.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S36
E. Coudin and J.-M. Dufour
Finally, we consider the likelihood ratio statistic (LR) assuming i.i.d. disturbances with an OS estimate of the error density (Koenker and Bassett, 1982). When errors are i.i.d. and X is fixed, the LM statistic for testing the joint hypothesis H 0 (β 0 ) turns out to be the SF sign-based statistic. Consequently, the three usual forms (Wald, LR and LM) of asymptotic tests are compared in our set-up. In Table 3, we report the simulated sizes for a conditional test with nominal level α = 5%, given X. N replicates are used for the bootstrap and the Monte Carlo sign-based method, and N = 2999. All bootstrapped samples are of size n = 50. We simulate M = 5000 random samples to evaluate the sizes of these tests. For both sign-based statistics, we also report the asymptotic level whenever processes are stationary. When the mediangale Assumption 2.1 holds, sizes of tests derived from sign-based finitesample methods are exactly controlled, whereas asymptotic tests may greatly overreject or underreject the null hypothesis. This remark especially holds for cases involving strong heteroscedasticity (cases C3, C5). The asymptotic versions of sign-based tests suffer from the same underrejection than other asymptotic tests, suggesting that for small samples (n = 50), the distribution of the test statistic is really far from its asymptotic limit. Hence, the sign-based method that deals directly with this distribution has clearly an advantage on asymptotic methods. When the disturbance process is highly heteroscedastic (case C5), the kernel estimation of the LAD asymptotic covariance matrix is not reliable anymore. In the last row, we illustrate behaviours when the error term involves linear serial dependence. The Monte Carlo SHAC sign-based test does not control exactly the level but is still asymptotically valid and yields the best results. We underscore its advantages compared with other asymptotically justified methods. Whereas the Wald and LR tests overreject the null hypothesis, the latter test seems to better control the level than its asymptotic version, avoiding underrejection. There exist important differences between using critical values from the asymptotic distribution of SHAC statistic and critical values derived from the distribution of the SHAC statistic for a fixed number of independent signs. Besides, we underscore the dramatic overrejections of asymptotic Wald tests based on HAC estimation of the asymptotic covariance matrix when the data set involves a small number of observations. These results suggest, in a sense, that when the data suffer from both a small number of observations and linear dependence, the first problem to solve is the finite-sample distortion, which is not what is usually done.
6.2. Power Then, we illustrate the power of these tests. We are particularly interested in comparing the sign-based inference to kernel and bootstrap methods. We consider the simultaneous hypothesis H 0 as before. The true process is obtained by fixing β 1 and β 3 at the tested value, i.e. β 1 = 1 and β 3 = 3 and letting vary β 2 . Simulated power is given by a graph with β 2 in abscissa. The power functions presented here (Figures 2 and 3) are locally adjusted for the level, which allows comparisons between methods. However, we should keep in mind that only the sign-based methods lead to exact confidence levels without adjustment. Other methods may overreject the null hypothesis and do not control the level of the test, or underreject it, and then lose power. Sign-based inference has a comparable power performance with LAD methods in cases C1, C2 and a slightly lower in case C6 (Cauchy disturbances), with the advantage that the level is exactly controlled, which leads to great difference in small samples. In heteroscedastic or heterogeneous cases (C4, C5 and above all C3 and C8), sign-based inference dominates other C The Author(s). Journal compilation C Royal Economic Society 2009.
S37
1
1
0.9
0.9
0.8
0.8 probability of rejecting H0
probability of rejecting H0
Finite-sample distribution-free inference in linear median regressions
0.7 0.6 0.5 0.4 0.3 0.2
0.7 0.6 0.5 0.4 0.3 0.2 0.1
0.1 0 0.5
1
1.5 2 2.5 true value of beta2
3
0 0.5
3.5
1
(a) C1: normal
1.5 2 2.5 true value of beta2
3
3.5
(b) C2: outliers
1
0.5
0.9 probability of rejecting H0
probability of rejecting H0
0.4
0.3
0.2
0.8 0.7 0.6 0.5 0.4 0.3 0.2
0.1
0.1 0
0 0
0.5
1
1.5 2 2.5 true value of beta2
3
3.5
(d) C4: stochastic volatility
1
1
0.9
0.9
0.8
0.8 probability of rejecting H0
probability of rejecting H0
(c) C3: stationary GARCH
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 true value of beta2
4
0.7 0.6 0.5 0.4 0.3 0.2 0.1
1
1.5 2 2.5 true value of beta2
3
3.5
(e) C5: DEB+HET
0 0.5
1
1.5 2 2.5 true value of beta2
(f) C6: Cauchy
Figure 2. Power functions (level corrected). C The Author(s). Journal compilation C Royal Economic Society 2009.
3
3.5
S38
E. Coudin and J.-M. Dufour
0.4
1 0.9
probability of rejecting H0
probability of rejecting H0
0.8 0.7 0.6 0.5 0.4 0.3 0.2
0.3
0.2
0.1
0.1 0 0.5
1
1.5 2 2.5 true value of beta2
3
(a) C 7: AR(1) HET, ρx = ρu = 5
3.5
0
0 5 true value of beta2
10
15
(b) C 12: exponential variance
Figure 3. Power functions (level corrected).
methods: levels are exactly controlled and power functions exceed others, even methods that are size-corrected with locally adjusted levels. In the presence of linear serial dependence, the Monte Carlo test based on DS (β, Jˆn−1 ), which is still asymptotically valid, seems to lead to good power performance for a mild autocorrelation, along with a better size control (C7). Only for very high autocorrelation (close to unit root process), the sign-based inference is not adapted; see the discussion paper (Coudin and Dufour, 2007). 6.3. Confidence intervals As the sign-based confidence regions are, by construction, of a level higher than 1 − α, whenever inference is exact, a performance indicator for confidence intervals may be their width. Thus, we wish to compare the width of confidence intervals obtained by projecting the sign-based simultaneous confidence regions to those based on t-statistics on the LAD estimator. We use M = 1000 simulations, and report average width of confidence intervals for each β k and coverage probabilities in Table 4. We only consider stationary examples. Spreads of confidence intervals obtained by projection are larger than asymptotic confidence intervals. This is due to the fact that they are by construction conservative confidence intervals. However, it is not clear that valid confidence intervals without this feature can even be built.
7. ILLUSTRATIVE APPLICATION: STANDARD AND POOR’S DRIFT We test the presence of a drift on the Standard and Poor’s composite price index (SP), 1928–87. That process is known to involve a large amount of heteroscedasticity and have been used by Gallant et al. (1997) and Dufour and Val´ery (2008) to fit a stochastic volatility model. Here, we are interested in robust testing without modelling the volatility in the disturbance process. The data set consists in a series of 16,127 daily observations of SP t , converted to price movements, y t = 100[log (SP t ) − log (SP t−1 )] and adjusted for systematic calendar effects. We consider a C The Author(s). Journal compilation C Royal Economic Society 2009.
β3
β1
β2
C The Author(s). Journal compilation C Royal Economic Society 2009.
41.5
29.3
32.6
32.3
1.0 15.5
(15.9) (15.5) 1.0 0.99 1.47 1.47
1.0 15.9
(9.6) 0.98 1.21
1.0 15.1
1.0 19.1
0.94
(7.5) 0.99 1.39
1.0 15.7
1.11
(7.8) 1.0 1.52
1.0 15.4
1.01
(7.5) 0.99 1.53
1.0 15.6
(0.31) (0.40) (0.40) (0.41) (0.51) (0.35) (0.17) (0.33) (0.24) (0.24) (0.42) (0.26) (0.33) (0.55) (0.36) 1.0 0.99 1.0 0.98 0.97 0.94 0.93 0.88 0.98 0.95 0.89 0.98 0.97 0.83 0.97
0.96
(28.0) (19.3) 1.0 0.99 1.41 1.42
1.0 20.7
Model which does not satisfy the mediangale condition 1.64 0.99 0.68 1.12 0.96 0.79 1.23
(6.4) 0.99 1.25
1.0 13.3
ρ u = ρ x = 0.5 HET
1.23
38.3
1.25
1.56
35.0
C7:
1.46
0.94 27.0
25.9
(0.59) (0.82) (0.82) (0.56) (0.78) (0.74) (0.32) (0.46) (0.45) (0.38) (0.57) (0.53) (0.37) (0.49) (0.47) 1.0 1.0 1.0 1.0 1.0 0.99 0.98 0.98 0.98 0.97 0.98 0.97 0.99 0.98 0.99
0.99 29.4
33.4
0.88
(64.6) (74.6) (61.0) (76.7) (82.6) (84.0) (70.3) (76.9) (78.0)
30.6
0.88
Cauchy dist.:
C6:
0.99 22.8
(117)
56.1
0.88
(14.4) (16.7) (18.1) (12.2) (17.6) (15.8) 1.0 0.98 1.0 1.0 1.0 1.0 2.20 2.75 2.59 1.88 2.33 1.95
0.93 33.1
(115)
55.9
1.04
Stoc. vol.:
1.0 30.4
(100)
49.5
0.98
1.0 27.3
(122)
57.3
0.88
GARCH(1,1) C4:
(118)
58.5
0.98
50.4
0.94
0.87
β3
(101)
0.92
0.88
β2
Stat.
0.91
0.82
β1
C3:
1.24
0.85
β3
(0.26) (0.31) (0.30) (0.25) (0.29) (0.30) (0.80) (0.79) (1.29) (0.67) (1.36) (2.73) (0.17) (0.20) (0.24) 1.0 1.0 0.98 1.0 0.99 0.96 0.98 0.98 0.98 0.97 0.97 0.97 0.97 0.98 0.97
1.15
β2
Outlier
1.05
β1
1.26
1.37
β3
Models which satisfy the mediangale condition 1.36 1.02 0.81 0.90 0.89 0.79 0.88
β2
C2:
1.16
β1
(0.21) (0.27) (0.29) (0.14) (0.28) (0.29) (0.23) (0.21) (0.22) (0.21) (0.24) (0.24) (0.15) (0.19) (0.22) 1.0 1.0 1.0 1.0 1.0 1.0 0.97 0.97 0.97 0.95 0.96 0.95 0.97 0.96 0.96
(SD) Cov. lev.
1.40
β3
ρu = ρx = 0 HOM
1.52
β2
1.29
LAD t-stat. with BT
(β 1 , β 2 , β 3 ) = (1, 2, 3) C1: Average spread
LAD t-stat. with DMB LAD t-stat. with MBB
β1
Proj.-based SHAC
y t = x t β + u t , t = 1, . . . , T T = 50
Proj.-based SF
Table 4. Width of confidence intervals (for stationary cases).
Finite-sample distribution-free inference in linear median regressions
S39
S40
E. Coudin and J.-M. Dufour
model with a constant and a drift: yt = a + bt + ut ,
t = 1, . . . , 16,127,
(7.1)
where we let the possibility that {u t : t = 1, . . . , 16,127} presents a stochastic volatility or any kind of non-linear heteroscedasticity of unknown form. White and Breusch–Pagan tests for heteroscedasticity both reject homoscedasticity at 1%. 13 We derive confidence intervals for the two parameters with the Monte Carlo sign-based method, and we compare them with the ones obtained by Wald techniques applied to LAD and OLS estimates. Then, we perform a similar experiment on two subperiods, the whole year 1929 (291 observations) and on the last 90 opened days of 1929, which roughly corresponds to the four last months of 1929 (90 observations), to investigate behaviours of the different methods in small samples. Due to the financial crisis, one may expect data to involve heavy heteroscedasticity during this period. Let us remind ourselves that the Wall Street Crash occurred between October 24 (Black Thursday) and October 29 (Black Tuesday). Hence, the second subsample corresponds to September and October with the crash period, and November and December with the early beginning of the Great Depression Heteroscedasticity tests reject homoscedasticity for both subsamples. 14 In Table 5, we report 95% confidence intervals for a and b obtained by various methods: finite-sample sign-based method (for SF and SHAC, which involves a HAC correction); LAD and OLS with different estimates of their asymptotic covariance matrices (order statistic, bootstrap, kernel, . . .). If the mediangale Assumption 2.1 holds, the sign-based confidence interval coverage probabilities are controlled. First, results on the drift are very similar between methods. The absence of a drift cannot be rejected with 5% level, but results concerning the constant differ greatly between methods and time periods. In the whole sample, the conclusions of Wald tests based on the LAD estimator differ depending on the choice of the covariance matrix estimate. Concerning the test of a positive constant, Wald tests with bootstrap or with an estimate derived if observations are i.i.d. (OS covariance matrix), which is totally illusory in that sample, reject, whereas the Wald test with kernel (so as sign-based tests) cannot reject the nullity of a. This may lead the practitioner in a perplex mind. Which is the correct test? In all the considered samples, Wald tests based on OLS appear to be unreliable. Either confidence intervals are huge (see OLS results on both subperiods) or some bias is suspected (see OLS results on the whole period). Take the constant parameter, on one hand, sign-based confidence intervals and LAD confidence intervals are rather deported to the right; on the other hand, OLS confidence intervals seem to be biased towards zero. This may due to the presence of some influential observations. Moreover, the OLS estimate for the whole sample is negative. In settings with arbitrary heteroscedasticity, LS methods should be avoided. Let us examine the third column of Table 5. The tightest confidence intervals for the constant parameter is obtained for sign-based tests based on the SHAC statistic, whereas LAD (and OLS) ones are larger. Note besides the gain obtained by using SHAC instead of SF in that set-up. This suggests the presence of autocorrelation in the disturbance process. In such a circumstance, finite-sample sign-based tests remain asymptotically valid, such as Wald methods. However, they are also corrected for the sample size and yield very different results. Finally, sign-based tests
White: 499 (p-value = 0.000) ; BP: 2781 (p-value = 0.000). 1929: White: 24.2 (p-value = 0.000); BP: 126 (p-value = 0.000); Sept–Oct–Nov–Dec 1929: White: 11.08 (p-value = 0.004); BP: 1.76 (p-value = 0.18). 13 14
C The Author(s). Journal compilation C Royal Economic Society 2009.
S41
Finite-sample distribution-free inference in linear median regressions Table 5. S&P price index: 95% confidence intervals. Constant parameter (a) Methods
Whole sample
Subsamples
(16,120 obs.)
1929 (291 obs.)
1929 (90 obs.)
[−0.007, 0.105] [−0.007, 0.106] (0.062)
[−0.226, 0.522] [−0.135, 0.443] (0.163)
[−1.464, 0.491] [−0.943, 0.362] (−0.091)
[0.033, 0.092] [0.007, 0.117]
[−0.144, 0.470] [−0.139, 0.464]
[−1.015, 0.832] [−1.004, 0.822]
[0.008, 0.116] [−0.019, 0.143] (−0.005)
[−0.130, 0.456] [−0.454, −0.780] (0.224)
[−1.223, 1.040] [−1.265, 1.083] (−0.522)
with i.i.d. cov. matrix est. with DMB cov. matrix est.
[−0.041, 0.031] [−0.054, 0.045]
[−0.276, 0.724] [−0.142, 0.543]
[−2.006, 0.962] [−1.335, 0.290]
with MBB cov. matrix est. (b=3)
[−0.056, 0.046]
[−0.140, 0.588]
[−1.730, 0.685]
×10−5
×10−2
×10−1
[−0.676, 0.486] [−0.699, 0.510]
[−0.342, 0.344] [−0.260, 0.268]
[−0.240, 0.305] [−0.204, 0.224]
(0.184) [−0.504, 0.320] [−0.688, 0.320]
(0.000) [−0.182, 0.182] [−0.256, 0.255]
(−0.044) [−0.220, 0.133] [−0.281, 0.194]
[−0.681, 0.313] [−0.671, −0.104]
[−0.236, 0.236] [−0.392, 0.391]
[−0.316, 0.229] [−0.303, 0.215]
(0.266) [−0.119, 0.651] [−0.213, 0.745]
(−0.183) [−0.480, 0.113] [−0.544, 0.177]
(0.010) [−0.273, 0.293] [−0.148, 0.169]
[−0.228, 0.761]
[−0.523, 0.156]
[−0.250, 0.270]
Sign SF statistics SHAC statistics LAD (estimate) with OS cov. matrix est. with DMB cov. matrix est. with MBB cov. matrix est. (b=3) with kernel cov. matrix est. (Bn=10) OLS
Drift parameter (b) Methods Sign SF statistics SHAC statistics LAD with OS cov. matrix est. with DMB cov. matrix est. with MBB cov. matrix est. (b=3) with kernel cov. matrix est. OLS with i.i.d. cov. matrix est. with DMB cov. matrix est. with MBB cov. matrix est. (b=3)
seem really adapted for small sample settings. Consequently, they are also particularly adapted to regional data sets, which have, by nature, fixed sample size. 15
8. CONCLUSION In this paper, we have proposed an entire system of inference for the β parameter of a linear median regression that relies on distribution-free sign-based statistics. We show that 15 For an illustration on cross-regional β-convergence between the levels of per capita output in the U.S., see the discussion paper.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S42
E. Coudin and J.-M. Dufour
the procedure yields exact tests in finite samples for mediangale processes and remains asymptotically valid for more general processes, including stationary ARMA disturbances. Simulation studies indicate that the proposed tests and confidence sets are more reliable than usual methods (LS, LAD), even when using the bootstrap. Despite the programming complexity of sign-based methods, we advocate their use when arbitrary heteroscedasticity is suspected in the data and the number of available observations is small. Finally, we have presented a practical example: we test the presence of a drift on the S&P price index, for the whole period 1928–87 and for shorter subsamples.
ACKNOWLEDGEMENTS The authors thank Marine Carrasco, Marc Hallin, Fr´ed´eric Jouneau, Thierry Magnac, Bill McCausland, Benoit Perron, Alain Trognon, the two anonymous referees and the editor Richard Smith for useful comments and constructive discussions. Earlier versions of this paper were presented at the 2003 Meeting of the Statistical Society of Canada (Halifax), the 2005 Econometric Society World Congress (London), CREST (Paris), the 2005 Conference in honour of Jean-Jacques Laffont (Toulouse), the 2005 Workshop on ‘New Trouble for Standard Regression Analysis’ (Universit¨at Regensburg, Germany) and ECARES (Brussels). This work was supported by the William Dow Chair in Political Economy (McGill University), the Canada Research Chair Program (Chair in Econometrics, Universit´e de Montr´eal), the Bank of Canada (Research Fellowship), a Guggenheim Fellowship, a Konrad-Adenauer Fellowship (Alexandervon-Humboldt Foundation, Germany), the Institut de finance math´ematique de Montr´eal (IFM2), the Canadian Network of Centres of Excellence [program on Mathematics of Information Technology and Complex Systems (MITACS)], the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, and the Fonds de recherche sur la soci´et´e et la culture (Qu´ebec).
REFERENCES Abdelkhalek, T. and J.-M. Dufour (1998). Statistical inference for computable general equilibrium models, with application to a model of the Moroccan economy. Review of Economics and Statistics 80, 520–34. Amemiya, T. (1982). Two-stage least absolute deviations estimator. Econometrica 50, 689–711. Andrews, D. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817–58. Bahadur, R. and L. Savage (1956). The nonexistence of certain statistical procedures in nonparametric problems. Annals of Mathematical Statistics 27, 1115–22. Barnard, G. A. (1963). Comment on ‘The spectral analysis of point processes’ by M. S. Bartlett. Journal of the Royal Statistical Society, Series B 25, 294. Boldin, M. V., G. I. Simonova and Y. N. Tyurin (1997). Sign-Based Methods in Linear Statistical Models, of Translations of Mathematical Monographs, Volume 162. Maryland: American Mathematical Society. Buchinsky, M. (1995). Estimating the asymptotic covariance matrix for quantile regression models. Journal of Econometrics 68, 303–38. Buchinsky, M. (1998). Recent advances in quantile regression models: a practical guideline for empirical research. Journal of Human Resources 33, 88–126. Buchinsky, M. and J. Hahn (1998). An alternative estimator for the censored quantile regression model. Econometrica 66, 653–71. C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S43
Campbell, B. and J.-M. Dufour (1991). Over-rejections in rational expectations models: a nonparametric approach to the Mankiw–Shapiro problem. Economics Letters 35, 285–90. Campbell, B. and J.-M. Dufour (1995). Exact nonparametric orthogonality and random walk tests. Review of Economics and Statistics 77, 1–16. Campbell, B. and J.-M. Dufour (1997). Exact nonparametric tests of orthogonality and random walk in the presence of a drift parameter. International Economic Review 38, 151–73. Chernozhukov, V., C. Hansen and M. Jansson (2008). Finite sample inference for quantile regression models. Forthcoming in Journal of Econometrics. Chow, Y. S. and H. Teicher (1988). Probability Theory. Independence, Interchangeability, Martingales (2nd ed.). New York: Springer-Verlag. Coudin, E. and J.-M. Dufour (2006). Generalized confidence distributions and robust sign-based estimators in median regressions under heteroskedasticity and nonlinear dependence of unknown form. Working paper, CIRANO-CIREQ, McGill University. Coudin, E. and J.-M. Dufour (2007). Finite and large sample distribution-free inference in linear median regressions under heteroskedasticity and nonlinear dependence of unknown form. Working paper, CIREQ, McGill University and Document de Travail No. 2007-38, CREST-INSEE. David, H. A. (1981). Order Statistics (2nd ed.). New York: John Wiley. De Angelis, D., P. Hall and G. A. Young (1993). Analytical and bootstrap approximations to estimator distributions in L 1 regression. Journal of the American Statistical Association 88, 1310–16. Dielman, T. and R. Pfaffenberger (1988a). Bootstrapping in least absolute value regression: an application to hypothesis testing. Communications in Statistics—Simulation and Computation 17, 843–56. Dielman, T. and R. Pfaffenberger (1988b). Least absolute value regression: necessary sample sizes to use normal theory inference procedures. Decision Sciences 19, 734–43. Dodge, Y. (Ed.) (1997). L 1 -Statistical Procedures and Related Topics, Lecture Notes—Monograph Series, Volume 31. Hayward, CA: Institute of Mathematical Statistics. Dubrovin, B., A. Fomenko and S. Novikov (1984). Modern Geometry—Methods and Applications. New York: Springer-Verlag. Dufour, J.-M. (1981). Rank tests for serial dependence. Journal of Time Series Analysis 2, 117–28. Dufour, J.-M. (1990). Exact tests and confidence sets in linear regressions with autocorrelated errors. Econometrica 58, 475–94. Dufour, J.-M. (1997). Some impossibility theorems in econometrics, with applications to structural and dynamic models. Econometrica 65, 1365–89. Dufour, J.-M. (2006). Monte Carlo tests with nuisance parameters: a general approach to finitesample inference and nonstandard asymptotics in econometrics. Journal of Econometrics 133, 443– 77. Dufour, J.-M., M. Hallin and I. Mizera (1998). Generalized runs tests for heteroskedastic time series. Journal of Nonparametric Statistics 9, 39–86. Dufour, J.-M. and J. Jasiak (2001). Finite sample limited information inference methods for structural equations and models with generated regressors. International Economic Review 42, 815–43. Dufour, J.-M. and J. Kiviet (1998). Exact inference methods for first-order autoregressive distributed lag models. Econometrica 82, 79–104. Dufour, J.-M. and M. Taamouti (2005). Projection-based statistical inference in linear structural models with possibly weak instruments. Econometrica 73, 1351–65. Dufour, J.-M. and P. Val´ery (2008). Exact and asymptotic tests for possibly non-regular hypotheses on stochastic volatility models. Forthcoming in Journal of Econometrics. Dwass, M. (1957). Modified randomization tests for nonparametric hypotheses. Annals of Mathematical Statistics 28, 181–87. C The Author(s). Journal compilation C Royal Economic Society 2009.
S44
E. Coudin and J.-M. Dufour
Fitzenberger, B. (1997a). A guide to censored quantile regressions. In G. S. Maddala and C. R. Rao (Eds.), Handbook of Statistics, Volume 15, 405–37. Amsterdam: North-Holland. Fitzenberger, B. (1997b). The moving blocks bootstrap and robust inference for linear least squares and quantile regressions. Journal of Econometrics 82, 235–87. Gallant, A. R., D. Hsieh and G. Tauchen (1997). Estimation of stochastic volatility models with diagnostics. Journal of Econometrics 81, 159–92. Godambe, V. (2001). Estimation of median: quasi-likelihood and optimum estimating functions. Discussion Paper 2001-04, Department of Statistics and Actuarial Sciences, University of Waterloo. Goffe, W. L., G. D. Ferrier and J. Rogers (1994). Global optimization of statistical functions with simulated annealing. Journal of Econometrics 60, 65–100. Gray, A. (1998). Modern Differential Geometry of Curves and Surfaces with Mathematica (2nd ed.). Boca Raton, FL: CRC Press. Hahn, J. (1997). Bayesian boostrap of the quantile regression estimator: a large sample study. International Economic Review 38, 795–808. Hallin, M. and M. L. Puri (1991). Time series analysis via rank-order theory: signed-rank tests for ARMA models. Journal of Multivariate Analysis 39, 175–237. Hallin, M. and M. L. Puri (1992). Rank tests for time series analysis: a survey. In D. Brillinger, P. Caines, J. Geweke, E. Parzen, M. Rosenblatt and M. S. Taqqu (Eds.), New Directions in Time Series Analysis (Part I), The IMA Volumes in Mathematics and its Applications, Volume 45, 111–53. New York: SpringerVerlag. Hallin, M., C. Vermandele and B. Werker (2006). Linear and nonserial sign-and-rank statistics: asymptotic representation and asymptotic normality. The Annals of Statistics 34, 254–89. Hallin, M., C. Vermandele and B. Werker (2008). Semiparametrically efficient inference based on signs and ranks for median-restricted models. Journal of the Royal Statistical Society, Series B 70, 389– 412. Hallin, M. and B. Werker (2003). Semiparametric efficiency, distribution-freeness, and invariance. Bernoulli 9, 137–65. Hong, H. and E. Tamer (2003). Inference in censored models with endogenous regressors. Econometrica 71, 905–32. Horowitz, J. L. (1998). Bootstrap methods for median regression models. Econometrica 66, 1327–51. Jung, S. (1996). Quasi-likelihood for median regression models. Journal of the American Statistical Association 91, 251–57. Kim, T. and H. White (2002). Estimation, inference, and specification testing for possibly misspecified quantile regression. Discussion paper 2002-09, Department of Economics, UC San Diego. Koenker, R. and G. Bassett, Jr. (1978). Regression quantiles. Econometrica 46, 33–50. Koenker, R. and G. Bassett, Jr. (1982). Tests of linear hypotheses and L 1 estimation. Econometrica 50, 1577–84. Komunjer, I. (2005). Quasi-maximum likelihood estimation for conditional quantiles. Journal of Econometrics 128, 137–64. Lehmann, E. L. and C. Stein (1949). On the theory of some non-parametric hypotheses. Annals of Mathematical Statistics 20, 28–45. Linton, O. and Y.-J. Whang (2007). The quantilogram: with an application to evaluating directional predictability. Journal of Econometrics 141, 250–82. Mankiw, G. and M. Shapiro (1986). Do we reject too often? Small samples properties of tests of rational expectations models. Economics Letters 20, 139–45. Newey, W. and K. West (1987). A simple positive semi-definite, heteroskadasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–8. C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S45
Parzen, E. (1957). On consistent estimates of the spectrum of a stationary times series. Annals of Mathematical Statistics 28, 329–48. Powell, J. L. (1983). The asymptotic normality of two-stage least absolute deviations estimators. Econometrica 51, 1569–76. Powell, J. L. (1984). Least absolute deviations estimation for the censored regression model. Journal of Econometrics 25, 303–25. Powell, J. L. (1986). Censored regression quantiles. Journal of Econometrics 32, 143–55. Powell, J. L. (1994). Estimation of semiparametric models. In R. F. Engle and D. L. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2443–521. Amsterdam: North-Holland. Pratt, J. W. and J. D. Gibbons (1981). Concepts of Nonparametric Theory. New York: Springer-Verlag. Press, W. H., S. A. Teukolsky, W. T. Vetterling and B. P. Flannery (1996). Numerical Recipes in Fortran 90 (2nd ed.). Cambridge: Cambridge University Press. Scheff´e, H. and J. W. Tukey (1945). Non-parametric estimation, I: validation of order statistics. Annals of Mathematical Statistics 116, 187–92. Thompson, W. R. (1936). On confidence ranges for the median and other expectation distributions for populations of unknown distribution form. Annals of Mathematical Statistics 7, 122–8. Weiss, A. (1991). Estimating nonlinear dynamic models using least absolute error estimation. Econometric Theory 7, 46–68. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48, 817–38. White, H. (2001). Asymptotic Theory For Econometricians (revised edition). New York: Academic Press. Wright, J. H. (2000). Alternative variance-ratio tests using ranks and signs. Journal of Business and Economic Statistics 18, 1–9. Zhao, Q. (2001). Asymptotically efficient median regression in presence of heteroskedasticity of unknown form. Econometric Theory 17, 765–84.
APPENDIX A: PROOFS Proof of Proposition 2.1: We use the fact that, as {X t : t = 1, 2, . . . } is strongly exogenous, {u t : t = 1, 2, . . . } does not Granger cause {X t : t = 1, 2, . . . }. It follows directly that l(s t |u t−1 , . . . , u 1 , x t , . . . , x 1 ) = l(s t |u t−1 , . . . , u 1 , x n , . . . , x 1 ), where l stands for the density of s t = s(u t ). Proof of Proposition 3.1: Consider the vector [s(u 1 ), s(u 2 ), . . . , s(u n )] ≡ (s 1 , s 2 , . . . , s n ) . From Assumption 2.1, we derive the two following equalities: P[ut > 0|X] = E(P[ut > 0|ut−1 , . . . , u1 , X]) = 1/2, P[ut > 0|st−1 , . . . , s1 , X] = P[ut > 0|ut−1 , . . . , u1 , X] = 1/2, ∀t ≥ 2. Further, the joint density of (s 1 , s 2 , . . . , s n ) can be written l(s1 , s2 , . . . , sn |X) =
n
l(st |st−1 , . . . , s1 , X)
t=1
=
n
P[ut > 0|ut−1 , . . . , u1 , X](1−st )/2 {1 − P[ut > 0|ut−1 , . . . , u1 , X]}(1+st )/2
t=1 n = (1/2)(1−st )/2 [1 − (1/2)]](1+st )/2 = (1/2)n . t=1 i.i.d.
Hence, conditional on X, s1 , s2 , . . . , sn ∼ B(1/2). C The Author(s). Journal compilation C Royal Economic Society 2009.
S46
E. Coudin and J.-M. Dufour
Proof of Proposition 3.2: Consider model (2.1) with {u t : t = 1, 2, . . .}, satisfying a weak mediangale conditional on X. Let us show that s˜ (u1 ), s˜ (u2 ), . . . , s˜ (un ) can have the same role in Proposition 3.1 as s(u 1 ), s(u 2 ), . . . , s(u n ) under Assumption 2.1. The randomized signs are defined by s˜ (ut , Vt ) = s(ut ) + [1 − s(ut )2 ]s(Vt − 0.5), hence
P[˜s (ut , Vt ) = 1|ut−1 , . . . , u1 , X] = P s(ut ) + 1 − s(ut )2 s(Vt − 0.5) = 1|ut−1 , . . . , u1 , X . As (V 1 , . . . , V n ) is independent of (u 1 , . . . , u n ) and Vt ∼ U(0, 1), it follows 1 P[˜s (ut , Vt ) = 1] = P[ut > 0|ut−1 , . . . , u1 , X] + P[ut = 0|ut−1 , . . . , u1 , X]. 2
(A.1)
The weak conditional mediangale assumption given X entails P[ut > 0|ut−1 , . . . , u1 , X] = P[ut < 0|ut−1 , . . . , u1 , X] =
1 − pt , 2
(A.2)
where pt = P[ut = 0|ut−1 , . . . , u1 , X]. Substituting (A.2) into (A.1) yields P[˜s (ut , Vt ) = 1|ut−1 , . . . , u1 , X] =
1 − pt pt 1 + = . 2 2 2
(A.3)
In a similar way, P[˜s (ut , Vt ) = −1|ut−1 , . . . , u1 , X] =
1 . 2
(A.4)
The rest is similar to the proof of Proposition 3.1.
Proof of Proposition 3.3: Let us consider first, the case of a single explanatory variable case (p = 1), which contains the basic idea for the proof. The case with p > 1 is just an adaptation of the same ideas to multidimensional notions. Under model (2.1) with the mediangale Assumption 2.1, the locally optimal signbased test (conditional on X) of H 0 (β) : β = 0 against H 1 (β) : β = 0 is well defined. Among tests with level α, the power function of the locally optimal sign-based test has the highest slope around zero. The power function of a sign-based test conditional on X can be written Pβ [s(y) ∈ Wα |X], where W α is the critical d Pβ [S(y) = s|X]β=0, is region with level α. Hence, we should include in W α the sign vectors for which dβ as large as possible. An easy way to determine that derivative is to identify the terms of a Taylor expansion around zero. Under Assumption 2.1, we have Pβ [S(y) = s|X] =
n
[Pβ (yi > 0|X)](1+si )/2 [Pβ (yi < 0|X)](1−si )/2
(A.5)
i=1
=
n [1 − Fi (−xi β|X)](1+si )/2 [Fi (−xi β|X)](1−si )/2 .
(A.6)
i=1
Assuming that continuous densities at zero exist, a Taylor expansion at order one entails Pβ [S(y) = s|X] =
n 1 [1 + 2fi (0|X)xi si β + o(β)] 2n i=1
n 1 fi (0|X)xi si β + o(β) . = n 1+2 2 i=1
(A.7)
(A.8)
C The Author(s). Journal compilation C Royal Economic Society 2009.
S47
Finite-sample distribution-free inference in linear median regressions
All other terms of the product decomposition are negligible or equivalent to o(β). That allows us to identify the derivative at β = 0: d Pβ=0 [S(y) = s|X] = 2−n+1 fi (0|X)xi si . dβ i=1 n
(A.9)
Therefore, the required test has the form
n Wα = s = (s1 , . . . , sn )| fi (0|X)xi si > cα ,
(A.10)
i=1
or equivalently, Wα = {s|s(y) X˜ X˜ s(y) > cα }, where c α and cα are defined by the significance level. When the disturbances have a common conditional density at zero, f (0|X), we find the results of Boldin et al. (1997). The locally optimal sign-based test is given by W α = {s|s(y) XX s(y) > cα } . The statistic does not depend on the conditional density evaluated at zero. When p > 1, we need an extension of the notion of slope around zero for a multidimensional parameter. Boldin et al. (1997) propose to restrict to the class of locally unbiased tests with given level α and to consider dP (W ) = 0, and, as the maximal mean curvature. Thus, a locally unbiased sign-based test satisfies, βdβ α β=0
f i (0) = 0, ∀ i, the behaviour of the power function around zero is characterized by the quadratic term of its Taylor expansion 1 1 d 2 Pβ (Wα ) β = n−2 [fi (0|X)si β xi ][fj (0|X)sj xj β]. (A.11) β 2 2 dβ 2 1≤i= j ≤n The locally most powerful sign-based test in the sense of the mean curvature maximizes the mean curvature, d 2 Pβ (Wα ) which is, by definition, proportional to the trace of dβ 2 ; see Boldin et al. (1997, p. 41), Dubrovin β=0
et al. (1984, ch. 2, pp. 76–86) or Gray (1998, ch. 21, pp. 373–80). Taking the trace in expression (A.11), we find (after some computations) that p d 2 Pβ (Wα ) f (0|X)f (0|X)s s xik xj k . (A.12) tr = i j i j dβ 2 β=0
By adding the independent of s quantity
1≤i= j ≤n
n p
n p k=1
i=1
k=1
k=1
xik2 to (A.12), we find 2
xik fi (0|X)si
= s (y)X˜ X˜ s(y) .
(A.13)
i=1
Hence, the locally optimal sign-biased test, in the sense developed by Boldin et al. (1997) for heteroscedastic signs, is Wα = {s : s (y)X˜ X˜ s(y) > cα }. Another quadratic test statistic convenient for ˜ Wα = {s : s (y)X( ˜ −1 X˜ s(y) > cα }. ˜ X˜ X) large-sample evaluation is obtained by standardizing by X˜ X: Proof of Theorem 5.1: This proof follows the usual steps of an asymptotic normality result for mixing processes (see White, 2001). Consider model (2.1). In the following, s t stands for s(u t ). Under Assumption exists for any n. Set Znt = λ Vn−1/2 xt s(ut ), for some λ ∈ Rp such that λ λ = 1. The mixing 5.4, V −1/2 n s(u t ) ⊗ x t property 5.1 of (x t , u t ) gets transmitted to Znt ; see White (2001, Theorem 3.49). Hence, λV −1/2 n is α-mixing of size −r/(r − 2), r > 2. Assumptions 5.2 and 5.3 imply E[λ Vn−1/2 xt s(ut )] = 0, t = 1, . . . , n, ∀n ∈ N .
(A.14)
r Eλ Vn−1/2 xt s(ut ) < < ∞, t = 1, . . . , n, ∀n ∈ N .
(A.15)
C The Author(s). Journal compilation C Royal Economic Society 2009.
S48
E. Coudin and J.-M. Dufour
Note also that
n 1 Znt Var √ n t=1
n 1 −1/2 λ Vn s(ut ) ⊗ xt = Var √ n t=1
= λ Vn−1/2 Vn Vn−1/2 λ = 1 .
(A.16)
The mixing property of Znt and equations (A.14)–(A.16) allow one to apply a central limit theorem (see White, 2001, Theorem 5.20) that yields n 1 −1/2 λ Vn s(ut ) ⊗ xt → N (0, 1) . √ n t=1
(A.17)
Since λ is arbitrary with λ λ = 1, the Cram´er–Wold device entails Vn−1/2 n−1/2
n
s(ut ) ⊗ xt → N (0, Ip ) .
(A.18)
t=1
Finally, Assumption 5.5 states that n is a consistent estimate of V −1 n . Hence, n−1/2 1/2 n
n
s(ut ) ⊗ xt → N (0, Ip ),
(A.19)
t=1
and n−1 s (y − Xβ 0 )X n X s(y − Xβ 0 ) → χ 2 (p). yt , x0 ,
xt ).
Proof of Corollary 5.1: Let Ft = σ (y0 , . . . , ... , When the mediangale Assumption 2.1 = 1, . . . , n} belong to a martingale difference with respect to Ft . Hence, holds, {s(ut ) ⊗ xt , Ft : t Vn = Var[ √1n s ⊗ X] = n1 nt=1 E(xt st st xt ) = n1 nt=1 E(xt xt ) = n1 E(X X), and X X/n is a consistent estimate of E(X X/n). Theorem 5.1 yields SF(β 0 ) → χ 2 (p). In order to prove Theorem 5.2, we will use the following lemma on the uniform convergence of distribution functions (see Chow and Teicher, 1988, sec. 8.2, p. 265). L EMMA 8.1. Let (Fn )n∈N and F be right continuous distribution functions. Suppose that Fn (x) → F (x), ∀x ∈ R. Then, sup−∞<x<+∞ |Fn (x) − F (x)| → 0.
n→∞
n→∞
˜ n (+∞) = 1, and G˜n (x|Xn (ω)) → ˜ n (−∞) = 0, G(+∞) = G Proof of Theorem 5.2: G(−∞) = G ˜ n can ˜ n )nN converges uniformly to G. The same holds for G n . Moreover, G G(x) a.e. By Lemma 8.1, (G be rewritten as ˜ n (cn Sn (β0 )|Xn (ω)) − G(cn Sn (β0 ))] ˜ n (cn Sn (β0 )|Xn ) = [G G + [G(cn Sn (β0 )) − Gn (cn Sn (β0 )|Xn (ω))] + Gn (cn Sn (β0 )|Xn ), hence ˜ n (cn Sn (β0 )|Xn ) + op (1). Gn (cn Sn (β0 )|Xn ) = G
(A.20)
cn Sn0
is a discrete positive random variable and G n , its survival function is also discrete. It directly As follows from properties of survival functions that for each α ∈ I m(Gn (R+ )),i.e. for each point of the image set, we have P[Gn (cn Sn (β0 )) ≤ α] = α.
(A.21)
Consider now the case when α ∈ (0, 1)\Im(Gn (R+ )). α must be between the two values of a jump of the function G n . Since G n is bounded and decreasing, there exist α1 , α2 ∈ Im(Gn (R+ )), such that α 1 < α < α 2 and P[Gn (cn Sn (β0 )) ≤ α1 ] ≤ P[Gn (cn Sn (β0 )) ≤ α] ≤ P[Gn (cn Sn (β0 )) ≤ α2 ]. C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S49
More precisely, the first inequality is an equality. Indeed, P[Gn (cn Sn (β0 )) ≤ α] = P[{Gn (cn Sn (β0 )) ≤ α1 } ∪ {α1 < Gn (cn Sn (β0 )) ≤ α}] = P[Gn (cn Sn (β0 )) ≤ α1 ] + 0, as {α 1 < G n (cn Sn (β 0 )) ≤ α} is a zero-probability event. Applying (A.21) to α 1 , P[Gn (cn Sn (β0 )) ≤ α] = P[Gn (cn Sn (β0 )) ≤ α1 ] = α1 ≤ α.
(A.22)
Hence, for α ∈ (0, 1), we have P[Gn (cn Sn (β0 )) ≤ α] ≤ α. The latter combined with equation (A.20) allows us to conclude ˜ n (cn Sn (β0 )) ≤ α] = P[Gn (cn Sn (β0 )) ≤ α] + op (1) ≤ α + op (1). P[G
(1) (N) Proof of Theorem 5.3: Let S (0) n be the observed statistic and S n (N ) = (S n , . . . , S n ), a vector of N independent replicates drawn from F˜n (x). Usually, validity of Monte Carlo testing is based on the fact the (N) vector (cn S(0) n , . . . , cn Sn ) is exchangeable. Indeed, in that case, the distribution of ranks is fully specified (N) and yields the validity of empirical p-value (see Dufour, 2006). In our case, it is clear that (cn S(0) n , . . . , cn Sn ) is not exchangeable, so that Monte Carlo validity cannot be directly applied. Nevertheless, asymptotic (N) exchangeability still holds, which will enable us to conclude. To obtain that the vector (cn S(0) n , . . . , cn Sn ) is asymptotically exchangeable, we show that for any permutation π : [1, N ] → [1, N ],
lim P Sn(0) ≥ t0 , . . . , Sn(N) ≥ tN − P Snπ (0) ≥ t0 , . . . , Snπ (N) ≥ tN = 0. n→∞
First, let us rewrite
P Sn(0) ≥ t0 , . . . , Sn(N) ≥ tN = EXn P Sn(0) ≥ t0 , . . . , Sn(N) ≥ tN , Xn = xn . The conditional independence of the sign vectors (replicated and observed) entails: N
P Sn(i) ≥ ti |Xn = xn P Sn(0) ≥ t0 , . . . , Sn(N) ≥ tN , Xn = xn = P[Xn = xn ] i=0
= Gn (t0 |Xn = xn )
N
G˜n (ti |Xn = xn ).
i=1
As each survival function converges with probability one to G(x), we finally obtain N
G(ti ) with probability one. P Sn(0) ≥ t0 , Sn(1) ≥ t1 , . . . , Sn(N) ≥ tN , Xn = xn → i=0
Moreover, it is straightforward to see that for π : [1, N ] → [1, N ], we have, as n → ∞, N
G(ti ) with probability one. P Sn(0) ≥ tπ (0) , Snπ (1) ≥ t1 , . . . , Snπ (N) ≥ tN , Xn = xn → i=0
G(t) is not a function of the realization X(ω), so that
lim P Sn(0) ≥ t0 , . . . , Sn(N) ≥ tN − P Snπ (0) ≥ t0 , . . . , Snπ (N) ≥ tN = 0. n→∞
Hence, we can apply an asymptotic version of proposition 2.2.2 in Dufour (2006) that validates Monte Carlo testing for general possibly non-continuous statistics. The proof of this asymptotic version follows exactly the same steps as the proofs of Lemma 2.2.1 and Proposition 2.2.2 of Dufour (2006). We just have to replace the exact distributions of randomized ranks, the empirical survival functions and the empirical p-values by their asymptotic counterparts, and this is sufficient to conclude. Suppose that N, the number of replicates is such that α(N + 1) is an integer. Then, limn→∞ p˜ nN (cn Sn(0) ) ≤ α. C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S50–S67. doi: 10.1111/j.1368-423X.2008.00274.x
Copula-based nonlinear quantile autoregression X IAOHONG C HEN † , R OGER K OENKER ‡ AND Z HIJIE X IAO § †
Cowles Foundation for Research in Economics, Yale University, Box 208281, New Haven, CT 06520, USA E-mail:
[email protected] ‡
§
Department of Economics, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA E-mail:
[email protected]
Department of Economics, Boston College, Chestnut Hill, MA 02467, USA and Tsinghua University, Beijing, 100084, China E-mail:
[email protected] First version received: September 2008; final version accepted: November 2008
Summary Parametric copulas are shown to be attractive devices for specifying quantile autoregressive models for nonlinear time-series. Estimation of local, quantile-specific copulabased time series models offers some salient advantages over classical global parametric approaches. Consistency and asymptotic normality of the proposed quantile estimators are established under mild conditions, allowing for global misspecification of parametric copulas and marginals, and without assuming any mixing rate condition. These results lead to a general framework for inference and model specification testing of extreme conditional value-at-risk for financial time series data. Keywords: Copula, Ergodic nonlinear Markov models, Quantile autoregression.
1. INTRODUCTION Estimation of models for conditional quantiles constitutes an essential ingredient in modern risk assessment. And yet, often, such quantile estimation and prediction rely heavily on unrealistic global distributional assumptions. In this paper, we consider new estimation methods for conditional quantile functions that are motivated by parametric copula models, but retain some semi-parametric flexibility and thus, should deliver more robust and more accurate estimates, while also being well-suited to the evaluation of misspecification. We employ parametric copula models to generate nonlinear-in-parameters quantile autoregression (QAR) models. Such models have several advantages over the linear QAR models previously considered in Koenker and Xiao (2006) since, by construction, the copulabased nonlinear QAR models are globally plausible with monotone conditional quantile functions over the entire support of the conditioning variables. Rather than imposing this global structure, however, we choose instead to estimate the implied conditional quantile C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Copula-based nonlinear quantile autoregression
S51
function independently, thereby facilitating an analysis of potential misspecification of the global structure. Copula-based Markov models provide a rich source of potential nonlinear dynamics describing temporal dependence (and tail dependence). They also permit us to carefully distinguish the temporal dependence from the specification of the marginal (stationary) distribution of the response. Stationarity of the processes considered implies that only one marginal distribution is required for the specification in addition to the choice of a copula. See, e.g. Chen and Fan (2006), Ibragimov (2006), Patton (2008) and the references therein for more detailed discussions about copula-based Markov models. Choice of the parametric specification of the copula, C, and the marginal distribution F, is a challenging problem. In this paper, we restrict our attention to settings in which the choices of C and F could be globally misspecified, yet, they yield correct specification of a conditional quantile function at a particular quantile. This is obviously a weaker condition than the direct assertion that we have correctly specified C and F themselves, since each of the conditional quantile functions we consider are permitted to have their own vector of quantile-specific parameters. Indeed, this distinction between global parametric models and local, quantilespecific, ones is essential throughout the quantile regression literature, and facilitates inference for misspecification that arises from discrepancies in the quantile specific estimates of the model parameters (see Koenker, 2005). Moreover, we are able to derive the consistency and asymptotic normality of our quantile estimator under mild sufficient conditions. In particular, we only assume that the underlying copula-based Markov model is stationary ergodic, without requiring any mixing conditions, and our moment restrictions are only those necessary for the validity of a central limit theorem (even for independent and identically distributed data). Our results are relevant for estimation and inference about extreme conditional quantiles (or value-at-risk) for financial time series data, as such data typically display strong temporal dependence and tail dependence as well as heavy-tailed marginals. Chen and Fan (2006) and Bouy´e and Salmon (2008) have also suggested methods for estimating copula-based conditional quantile models. Both papers assume correct specification of the parametric copula dependence function C(·; α) (without specifying the marginal distribution F). Chen and Fan (2006) first estimate the marginal F by a rescaled empirical marginal CDF, and then estimate the copula parameter α via maximum likelihood. Conditional quantile functions are then obtained by plugging in the estimated copula parameter and the empirical marginal CDF. This approach obviously relies heavily on the correct specification of the parametric copula function. Bouy´e and Salmon (2008) propose to estimate several distinct, nonlinear quantile regression models implied by their copula specification. This is essentially the approach adopted here. Bouy´e and Salmon (2008) refer to Chen and Fan (2006) for conditions and justifications of the asymptotic properties of their estimator. While Chen and Fan (2006) derive the asymptotic properties of their two-step estimator under the assumptions that the parametric copula is correctly specified and the time series is beta-mixing with fast enough decay rate, we obtain the asymptotic properties of the copula-based quantile estimator allowing for misspecified parametric copula and without any mixing condition. The plan of the paper is as follows. We introduce the copula-based QAR model in Section 2. Assumptions and asymptotic properties of the proposed estimator are developed in Section 3. Section 4 briefly describes statistical inference and Section 5 concludes. For simplicity of illustration and without loss of generality, we focus our analysis on first order QAR processes in our analysis.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S52
X. Chen, R. Koenker and Z. Xiao
2. COPULA-BASED QUANTILE AUTOREGRESSION MODELS 2.1. First-order strictly stationary Markov models To motivate copula-based quantile autoregression models, we start with a first-order strictly stationary Markov process, {Y t }nt=1 , whose probabilistic properties are determined by the true joint distribution of Y t−1 and Y t , say, G∗ (y t−1 , y t ). Suppose that G∗ (y t−1 , y t ) has continuous marginal distribution function F ∗ (·), then by Sklar’s Theorem, there exists an unique copula function C ∗ (·, ·) such that G∗ (yt−1 , yt ) ≡ C ∗ (F ∗ (yt−1 ), F ∗ (yt )), where the copula function C ∗ (·, ·) is a bivariate probability distribution function with uniform marginals. Differentiating C ∗ (u, v) with respect to u, and evaluate at u = F ∗ (x), v = F ∗ (y), we obtain the conditional distribution of Y t given Y t−1 = x: ∂C ∗ (u, v) ≡ C1∗ (F ∗ (x), F ∗ (y)). Pr[Yt < y|Yt−1 = x] = ∗ ∂u u=F (x),v=F ∗ (y) For any τ ∈ (0, 1), solving τ = Pr[Yt < y|Yt−1 = x] ≡ C1∗ (F ∗ (x), F ∗ (y)) for y (in terms of τ ), we obtain the τ -th conditional quantile function of Y t given Y t−1 = x: QYt (τ |x) = F ∗−1 C1∗−1 (τ ; F ∗ (x)) , ∗ where F ∗−1 (·) signifies the inverse of F ∗ (·) and C ∗−1 1 (·; u) is the partial inverse of C 1 (u, v) with ∗−1 ∗ ∗ ∗ respect to v = F (y t ). Denote h (x) ≡ C 1 (τ ; F (x)), so we may rewrite the τ -th conditional quantile function of Y t given Y t−1 = x as 1
QYt (τ |x) = F ∗−1 (h∗ (x)) ≡ H ∗ (x). In this paper, we will work with the class of copula-based, first-order, strictly stationary Markov models. We allow for most commonly used parametric copula functions, excluding the Fr´echet–Hoeffding upper bound and lower bounds. A SSUMPTION DGP. {Y t : t = 1, . . . , n} is a sample from a stationary first-order Markov process generated from (F ∗ (·), C ∗ (·, ·)), where F ∗ (·) is the true invariant distribution and is absolutely continuous with respect to Lebesgue measure on the real line; the copula C ∗ (·, ·) for (Y t−1 , Y t ) is absolutely continuous with respect to Lebesgue measure on [0, 1]2 , and is neither the Fr´echet–Hoeffding upper or lower bounds: min{F ∗ (Y t−1 ), F ∗ (Y t )} or max{F ∗ (Y t−1 ) + F ∗ (Y t ) − 1, 0}. Denote f ∗ (·) and c∗ (·, ·) as the density functions corresponding to the marginal distribution f (·) and the copula function C ∗ (·, ·), respectively. Assumption DGP is equivalent to assuming that {Y t : t = 1, . . . , n} is a sample from a first-order stationary Markov process generated from ∗
1
As we can see from the definition, both h∗ and H ∗ depend on τ . We suppress τ from h∗ and h∗ for notational simplicity. C The Author(s). Journal compilation C Royal Economic Society 2009.
S53
Copula-based nonlinear quantile autoregression
(f ∗ (·), g ∗ (· | ·)), where g ∗ (·|yt−1 ) ≡ f ∗ (·)c∗ (F ∗ (yt−1 ), F ∗ (·)) is the true conditional density function of Y t given Y t−1 = y t−1 . 2.1.1. The autoregressive transformation model. As demonstrated in Chen and Fan (2006), all the copula-based first order Markov models can be expressed in terms of an autoregressive transformation model. Let U t = F ∗ (Y t ), then under Assumption DGP, {U t } is a strictly stationary first-order Markov process with the joint distribution of U t and U t−1 given by the copula C ∗ (·, ·). Let 1 () be any increasing transformation, then there exist measurable functions 2 and σ such that, 1 (F ∗ (Yt )) = 2 (F ∗ (Yt−1 )) + σ (F ∗ (Yt−1 ))εt or equivalently, Ut = F ∗ (Yt ) = −1 1 (2 (Ut−1 ) + σ (Ut−1 )εt ) , where the conditional density of ε t given U t−1 = F ∗ (Y t−1 ) = u t−1 is fε|F ∗ (Yt−1 )=ut−1 (ε) = c∗ (ut−1 , −1 1 (2 (ut−1 ) + σ (ut−1 )ε))/D(ut−1 ) = where D(u) =
∗ ∗ c∗ (F ∗ (Yt−1 ), −1 1 (2 (F (Yt−1 )) + σ (F (Yt−1 ))ε)) , D(F ∗ (Yt−1 ))
d1 (2 (u)+σ (u)ε) , dε
and satisfies the condition that 1 2 (ut−1 ) = E[1 (Ut )|Ut−1 = ut−1 ] = 1 (u) × c∗ (ut−1 , u)du. 0
In the special case that 1 (u) = u, we obtain U t = 2 (U t−1 ) + σ (U t−1 )ε t , i.e. F ∗ (Yt ) = 2 (F ∗ (Yt−1 )) + σ (F ∗ (Yt−1 ))εt , with
2 (ut−1 ) = E[1 (Ut )|Ut−1 = ut−1 ] =
1
uc∗ (ut−1 , u)du = 1 −
0
1
0
C1∗ (ut−1 , u)du.
2.2. Copula-based parametric quantile autoregression models In practice, neither the true copula function C ∗ (·, ·) nor the true marginal distribution function F ∗ (·) of {Y t } is known. If we model both parametrically, by C(·, ·; α) and F (y; β), depending on unknown parameters α, β, then the τ -th conditional quantile function of Yt , QYt (τ |x), becomes a function of the unknown parameters α and β, i.e. QYt (τ |x) = F −1 C1−1 (τ ; F (x, β), α), β . Denoting θ = (α , β ) and h(x, α, β) ≡ C −1 1 (τ ; F (x, β), α), we will write, QYt (τ |x) = F −1 (h(x, α, β), β) ≡ H (x; θ ).
(2.1)
This copula formulation of the conditional quantile functions provides a rich source of potential nonlinear dynamics. By varying the choice of the copula specification we can induce a wide C The Author(s). Journal compilation C Royal Economic Society 2009.
S54
X. Chen, R. Koenker and Z. Xiao
variety of nonlinear QAR(1) dependence, and the choice of the marginal, choice of F enables us to consider a wide range of possible tail behaviour as well. Copula-based models have been widely used in finance, especially in estimating conditional quantiles as required for Value-at-Risk (VaR) assessment, motivated by possible nonlinearity in financial time series dynamics. However, in many financial time series applications, the nature of the temporal dependence may vary over the quantiles of the conditional distribution. We would like to stress that although the conditional quantile function specification in the above representation assumes the parameters to be identical across quantiles, our estimation methods do not impose this restriction. Thus, we permit the estimated parameters to vary with τ and this provides an important diagnostic feature of the methodology. The proposed QAR model is based on (2.1) but we permit different parameter values over τ , and write the vector of unknown parameters as θ (τ ) = (α(τ ) , β(τ ) ) . With h(x, α, β) ≡ C −1 1 (τ ; F (x, β), α), we obtain the following nonlinear QAR model: QYt (τ |Yt−1 ) = F −1 (h(Yt−1 , α(τ ), β(τ )), β(τ )) ≡ H (Yt−1 , θ (τ )).
(2.2)
This nonlinear form of the QAR model can capture a wide range of systematic influences of conditioning variables on the conditional distribution of the response. Koenker and Xiao (2006) considered linear-in-parameter QAR processes in studying similar specifications. Maintaining a linear specification in the QAR model, however, requires rather strong regularity assumptions on the domain of the associated random variables imposed to ensure quantile monotonicity. Relaxing those assumptions implies that the conditional quantile functions are no longer linear. From this point of view, copula-based models provide an important path toward extending linear QAR models to nonlinear quantile autoregression specifications. The above analysis may be easily extended to k-th order nonlinear QAR models, but we will resist the temptation to tax the readers’ patience with the notation required to accomplish this. 2.3. Examples E XAMPLE 2.1. Gaussian Copula. Let α (·, ·) be the distribution function of bivariate normal distribution with mean zeros, variances 1, and correlation coefficient α, and be the CDF of a univariate standard normal. The bivariate Gaussian copula is given by C(u, v; α) = α ( −1 (u), −1 (v)) −1 (u) −1 (v) 1 (s 2 − 2αst + t 2 ) dsdt. = exp − √ 2(1 − α 2 ) 2π 1 − α 2 −∞ −∞ Let {Y t } be a stationary Markov process of order 1 generated from a Gaussian copula C ∗ (u, v) = α ( −1 (u), −1 (v)) and a marginal distribution F ∗ (·). Denote U t = F ∗ (Y t ), then the joint distribution of U t and U t−1 is C(ut−1 , ut ; α) = α ( −1 (ut−1 ), −1 (ut )). Differentiating C(u t−1 , u t ; α) with respect to u t−1 , we obtain the conditional distribution of U t given U t−1 : −1
(ut ) − α −1 (ut−1 ) . C1 (ut−1 , ut ; α) =
√ 1 − α2 C The Author(s). Journal compilation C Royal Economic Society 2009.
S55
Copula-based nonlinear quantile autoregression
For any τ ∈ [0, 1], solving
τ = C1 (ut−1 , ut ; α) =
−1 (ut ) − α −1 (ut−1 ) √ 1 − α2
for u t , we obtain the τ -th conditional quantile function of U t given U t−1 = u t−1 :
√ QUt (τ |ut−1 ) = α −1 (ut−1 ) + 1 − α 2 −1 (τ )
√ = α −1 (F ∗ (yt−1 )) + 1 − α 2 −1 (τ ) = h∗ (τ ; F ∗ (yt−1 ), α). Let Z t = −1 (U t ) = −1 (F ∗ (Y t )). Then {Z t } is a Gaussian AR(1) process that can be represented by Zt = αZt−1 + εt where ε t ∼ N (0, (1 − α 2 )) and is independent of Z t−1 . We obtain the τ -th conditional quantile function of Z t given Z t−1 : QZt (τ |Zt−1 ) = b(τ ) + αZt−1 , with b(τ ) = 1 − α 2 −1 (τ ), a formulation that is the familiar linear AR(1) specification, which induces the simplest linear QAR model. E XAMPLE 2.2. Student-t Copula. Let t ν,ρ (·, ·) be the distribution function of bivariate Studentt distribution with mean zeros, correlation coefficient ρ, and degrees of freedom ν. And let t ν (·) be the CDF of a univariate Student-t distribution with mean zero and degrees of freedom ν. The bivariate t-copula is given by, with α = (ν, ρ) C(u, v; α) = tν,ρ tν−1 (u), tν−1 (v) −(ν+2)/2 tν−1 (u) tν−1 (v) (s 2 − 2ρst + t 2 ) 1 1+ dsdt. = ν(1 − ρ 2 ) 2π 1 − ρ 2 −∞ −∞ Let {Y t } be a stationary Markov process of order 1 generated from a standard bivariate −1 ∗ t ν -copula function C ∗ (u, v) = t ν,ρ (t −1 ν (u), t ν (v)) and a marginal distribution function F (·). ∗ Let U t = F (Y t ), then the τ -th conditional quantile function of U t given U t−1 is given by −1 QUt (τ |Ft−1 ) = tν ρtν−1 (F ∗ (Yt−1 )) + σ (F ∗ (Yt−1 ))tν+1 (τ ) = h∗ (τ ; F ∗ (Yt−1 ), ρ, ν), where
∗
σ (F (Yt−1 )) =
2 ν + tν−1 (F ∗ (Yt−1 )) (1 − ρ 2 ). ν+1
−1 ∗ Moreover, the transformed process {Z t = t −1 ν (U t ) = t ν (F (Y t ))} is a Student-t process that can be represented by
Zt = ρZt−1 + σ (Zt−1 )et , where e t ∼ t ν+1 , and is independent of Y t−1 ,
σ (Zt−1 ) =
2 ν + Zt−1 (1 − ρ 2 ) ν+1
C The Author(s). Journal compilation C Royal Economic Society 2009.
S56
X. Chen, R. Koenker and Z. Xiao
∗ ∗ is a known function of Z t−1 = t −1 ν (F (Y t−1 )). (If and only if the true marginal distribution F is −1 ∗ also t ν then Z t = t ν (F (Y t )) = Y t ). The τ -th conditional quantile function of Z t given Z t−1 , is then given by −1 (τ ). QZt (τ |Ft−1 ) = ρZt−1 + σ (Zt−1 )tν+1
Let θ (τ ) = (ρ, α(τ ), β(τ )), where α(τ ) =
−1 (τ )2 ν(1 − ρ 2 )tν+1 , 1+ν
β(τ ) =
we can rewrite the conditional quantile function as QZt (τ |Ft−1 ) = ρZt−1 +
−1 (τ )2 (1 − ρ 2 )tν+1 1+ν
2 α(τ ) + β(τ )Zt−1 .
This example can be generalized to any first-order Markov models that are generated from an elliptical copula and an elliptical marginal distribution of the same form. 2 The conditional mean is linear and conditional variance is homoskedastic if and only if the copula is Gaussian with Gaussian marginal. The Gaussian copula does not exhibit tail dependence, while the Student-t copula and other elliptical copula have symmetric tail dependence. For many financial applications, copulas that possess asymmetric tail dependence properties are more appropriate. E XAMPLE 2.3. Joe–Clayton Copula. The Joe–Clayton copula is given by: 1/k C(u, v; α) = 1 − 1 − [(1 − u¯ k )−γ + (1 − v k )−γ − 1]−1/γ , where u¯ = 1 − u, α = (k, γ ) and k ≥ 1, γ > 0. It is known that the lower tail dependence parameter for this family is λ L = 2−1/γ and the upper tail dependence parameter is λ U = 2 − 21/k . When k = 1, the Joe-Clayton copula reduces to the Clayton copula: C(u, v; α) = [u−α + v −α − 1]−1/α , where
α = γ > 0.
When γ → 0, the Joe–Clayton copula approaches the Joe copula whose concordance ordering and upper tail dependence increase as k increases. For other properties of the Joe–Clayton copula, see Joe (1997) and Patton (2006). When coupled with heavy-tailed marginal distributions such as the Student’s t distribution, this family of copulas can generate time series with clusters of extreme values and hence provide alternative models for economic and financial time series that exhibit such clusters. For the Joe–Clayton copula, one can easily verify that −(γ +1) C1 (ut−1 , ut ; α) = (1 − ut−1 )k−1 1 − u¯ kt−1 −γ −γ −(γ −1 +1) × 1 − u¯ kt−1 + 1 − u¯ kt −1 −γ −γ −1/γ k−1 −1 × 1 − 1 − u¯ kt−1 + 1 − u¯ kt −1 . For any τ ∈ [0, 1], solving τ = C 1 (u t−1 , u t ; α) for u t , we obtain the τ th conditional quantile function of U t given u t−1 based on the Clayton copula: −1/α QUt (τ |ut−1 ) = (τ −α/(1+α) − 1)u−α t−1 + 1 2
An elliptical copula is a copula generated from an elliptically symmetric bivariate distribution. C The Author(s). Journal compilation C Royal Economic Society 2009.
Copula-based nonlinear quantile autoregression
S57
Note that this expression and the similar expressions in the foregoing examples provide a convenient mechanism with which to simulate observations from the respective models. See Bouy´e and Salmon (2008) for additional examples of copula-based conditional quantile functions.
3. ASYMPTOTIC PROPERTIES In this section, we study estimation of the copula-based QAR model (2.2). The vector of parameters θ (τ ) and thus the conditional quantile of Y t can be estimated by the following nonlinear quantile autoregression: min θ∈
ρτ (Yt − H (Yt−1 , θ )),
(3.1)
t
where ρ τ (u) ≡ u(τ − I (u < 0)) is the usual check function (Koenker and Bassett, 1978). We denote the solution as θ (τ ) ≡ arg minθ∈ t ρτ (Yt − H (Yt−1 , θ )). Then the τ -th conditional quantile of Y t given Y t−1 , can be estimated by Yt (τ |Yt−1 = x) = H (x, (τ )), (τ ) . Q α (τ ) , β θ (τ )) ≡ F −1 C1−1 τ, F (x, β 3.1. Consistency To facilitate our analysis, we define C1 (u, v; α) ≡
∂ 2 C(u, v; α) ∂C(u, v; α) ; c(u, v; α) ≡ . ∂u ∂u∂v
Denote C −1 1 (τ , u; α) as the inverse function of C 1 (u, v; α) with respect to the argument v, and H (x, θ ) ≡ F −1 (C −1 1 (τ , F (x; β), α); β). We first introduce some simple regularity conditions to ensure consistency of our QAR estimator θ (τ ). A SSUMPTION 3.1.
The parameter space is a compact subset in k .
A SSUMPTION 3.2. (i) F (·; β) and F −1 (·; β) (the inverse function of F (·; β)) are continuous with respect to all their arguments; (ii) the copula function C(u, v; α) is second order differentiable with respect to u and v, and has copula density C(u, v; α) and (iii) C −1 1 (τ , u; α) (the inverse function of C 1 (u, v; α) with respect to v) is continuous in α and u. A SSUMPTION 3.3. (i) The true τ th conditional quantile of Y t given Yt−1 , QYt (τ |Yt−1 ), takes the form H (Y t−1 , θ (τ )) ≡ F −1 (C −1 1 (τ , F (Y t−1 ; β(τ )), α(τ )); β(τ )) for a θ (τ ) = (α(τ ) , β(τ ) ) ∈ for almost all Y t−1 ; (ii) The true unknown conditional density of Y t given Y t−1 , g ∗ (·|Y t−1 ), is bounded and continuous, and there exist 1 > 0, p > 0 such that Pr[g ∗ (QYt (τ |Yt−1 )) ≥ 1 ] ≥ p. A SSUMPTION 3.4. For any > 0, there exists a δ > 0 such that, for any θ − θ (τ ) > , E Pr H (Yt−1 , θ ) − QYt (τ |Yt−1 ) > δ | g ∗ (QYt (τ |Yt−1 )) ≥ 1 > 0. C The Author(s). Journal compilation C Royal Economic Society 2009.
S58
X. Chen, R. Koenker and Z. Xiao
A SSUMPTION 3.5. (i) E(supθ∈ |H (Yt−1 , θ )|) < ∞; (ii) {Y t } is stationary, ergodic and satisfies Assumption DGP. Assumptions 3.1–3.4 and 3.5(i) are mild regularity conditions that are typically imposed even for parametric nonlinear quantile regression of Y t given x t with i.i.d. data {(Y t , x t )}nt=1 . Thus they are natural conditions for our nonlinear Markov model (with x t = Y t−1 ). Assumption 3.5(ii) is a very mild condition on temporal dependence of {Y t }. Although we do not assume the correct specification of the parametric functional forms of the copula C(·, α) and the marginal distribution F (·, β), we assume that the parametric functional form of the conditional quantile H (Y t−1 , θ (τ )) is correct at the τ -th quantile (Assumption 3.3(i)). Hence, we do not need any betamixing decay rate condition on {Y t } that is assumed in Chen and Fan (2006). See Beare (2008) for temporal dependence properties of copula-based strictly stationary Markov processes. T HEOREM 3.1. (Consistency) For any fixed τ ∈ (0, 1), under Assumptions 3.1–3.5, we have θ (τ ) = θ (τ ) + op (1). 3.2. Normality We introduce the following additional notation: ∂H (x; θ ) ¨ ∂ 2 H (x; θ ) H˙ θ (x, θ ) ≡ , Hθθ (x, θ ) ≡ . ∂θ ∂θ ∂θ Given the consistency of θ (τ ), we only need to impose the following additional conditions in a shrinking neighbourhood of θ (τ ). Denote 0 = A 0 × B 0 = {θ = (α , β ) ∈ : θ − θ (τ ) = o p (1)}. We assume: A SSUMPTION 3.6. (i) H˙ θ (Yt−1 , θ ) and H¨ θθ (Yt−1 , θ ) are well defined and measurable for all θ ∈ 0 and for almost all Y t−1 ; (ii) E[supθ∈0 |H˙ θ (Yt−1 , θ )|2 ] < ∞; (iii) E(supθ∈0 | H¨ θθ (Yt−1 , θ )|) < ∞ and (iv) V (τ ) and (τ ) are finite non-singular, where V (τ ) ≡ E g ∗ (QYt (τ |Yt−1 ))H˙ θ (Yt−1 , θ (τ ))H˙ θ (Yt−1 , θ (τ )) , (τ ) ≡ E H˙ θ (Yt−1 , θ (τ ))H˙ θ (Yt−1 , θ (τ )) . (3.2) We impose Assumption 3.6(i)(iii) for simplicity. We could replace Assumption 3.6(i)(iii) by assuming that only H˙ θ (Yt−1 , θ ) exists for θ ∈ 0 and satisfies some milder regularity conditions such as those imposed in Huber (1967) and Pollard (1985) for i.i.d. data, and Hansen et al. (1995) for stationary ergodic data, without the need of the existence of H¨ θθ (Yt−1 , θ ) satisfying Assumption 3.6(iii). Comparing our Assumptions 3.1–3.6 to the regularity conditions imposed in earlier papers (e.g. Weiss, 1991, White, 1994, Engle and Mangenelli, 2004, and the references therein) on parametric nonlinear quantile time series models, we do not need any mixing nor near epoch dependence of mixing process conditions (see our Assumption 3.5(ii)), and our moment requirement is also much weaker than the existing ones (see our Assumption 3.5(i) and Assumption 3.6(ii)(iii)). Both these relaxations are important for financial applications that typically exhibit persistent temporal dependence and heavy-tailed marginals. Denote f (·; β) as the parametric density of F (·; β), and h(x, α, β) = C1−1 (τ ; u, α)u=F (x,β) = C1−1 (τ ; F (x, β), α) C The Author(s). Journal compilation C Royal Economic Society 2009.
S59
Copula-based nonlinear quantile autoregression −1
∂C (τ ;u,α) −1 ˙ α (x, α, β) = ∂h(x,α,β) and F˙β (x, β) = with C1u (τ ; u, α) = 1 ∂u , h ∂α (τ ) defined in (3.2) can be expressed as follows:
V (τ ) = where
Vαα (τ ) = E
Vαα (τ ) Vαβ (τ ) αα (τ ) , (τ ) = Vβα (τ ) Vββ (τ ) βα (τ )
∂F (x,β) . ∂β
Then V (τ ) and
αβ (τ ) , ββ (τ )
(3.3)
g ∗ (QYt (τ | Yt−1 )) ˙ ˙ α (Yt−1 ; θ(τ )) ; h (Y ; θ(τ )) h α t−1 {f (QYt (τ | Yt−1 ))}2
∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) g ∗ (QYt (τ | Yt−1 )) ˙ Vαβ (τ ) = E hα (Yt−1 ; θ(τ )) f (QYt (τ | Yt−1 )) ∂β ∗ g (QYt (τ | Yt−1 )) ˙ −1 ˙β (Yt−1 , β(τ )) ; (Y ; θ(τ ))C (τ ; F (Y , β(τ )), α(τ )) F h +E α t−1 t−1 1u {f (QYt (τ | Yt−1 ))}2
Vβα (τ ) = Vαβ (τ ) ; Vββ (τ ) = E g ∗ (QYt (τ | Y t−1 ))
∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) ∂β ∂β
g ∗ (QYt (τ | Yt−1 )) ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) −1 C1u (τ ; F (Y t−1 , β(τ )), α(τ ))F˙ β (Y t−1 , β(τ )) + 2E ∂β f QYt (τ | Yt−1 ) 2 g ∗ (QYt (τ | Yt−1 )) −1 C1u (τ ; F (Yt−1 , β(τ )), α(τ )) ·F˙ β (Y t−1 , β(τ ))F˙ β (Y t−1 , β(τ )) . +E {f QYt (τ | Yt−1 ) }2 1 ˙ α (Yt−1 ; θ(τ )) h ˙ α (Yt−1 ; θ(τ )) ; h αα (τ ) = E {f (QYt (τ | Yt−1 ))}2 −1 ∂F (h(Y ; θ(τ )), β(τ )) 1 t−1 ˙ α (Yt−1 ; θ(τ )) h αβ (τ ) = E f (QYt (τ | Yt−1 )) ∂β 1 ˙ α (Yt−1 ; θ(τ ))C −1 (τ ; F (Yt−1 , β(τ )), α(τ ))F˙β (Yt−1 , β(τ )) ; h +E 1u {f (QYt (τ | Yt−1 ))}2
βα (τ ) = αβ (τ ) ; ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) ββ (τ ) = E ∂β ∂β −1 ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) C1u (τ ; F (Yt−1 , β(τ )), α(τ )) · F˙β (Yt−1 , β(τ )) + 2E ∂β f QYt (τ | Yt−1 ) 2 −1 C1u (τ ; F (Yt−1 , β(τ )), α(τ )) ˙ ˙ +E · Fβ (Yt−1 , β(τ ))Fβ (Yt−1 , β(τ )) . {f QYt (τ | Yt−1 ) }2
T HEOREM 3.2. For any fixed τ ∈ (0, 1), under Assumptions 3.1–3.6 and θ (τ ) ∈ int(), we have: √ n θ (τ ) − θ (τ ) ⇒ N (0, τ (1 − τ )V (τ )−1 (τ )V (τ )−1 ), with V (τ ) and (τ ) are given in (3.2) (or (3.3) equivalently). C The Author(s). Journal compilation C Royal Economic Society 2009.
S60
X. Chen, R. Koenker and Z. Xiao
R EMARK 3.1. When the marginal distribution function of Y is completely known F (y, β) = F (y), V (τ ) and (τ ) reduce to the following simplified forms: ∗ g (QYt (τ | Yt−1 )) ˙ ˙ hα (Yt−1 ; α(τ )) hα (Yt−1 ; α(τ )) , V (τ ) = E [f (QYt (τ | Yt−1 ))]2 1 ˙ ˙ (τ ) = E hα (Yt−1 ; α(τ )) hα (Yt−1 ; α(τ )) . [f (QYt (τ | Yt−1 ))]2 R EMARK 3.2. When both the copula function C ∗ (u, v) = C(u, v; α) and the marginal distribution F ∗ (y) = F (y; β) are correctly specified, the parameters θ (τ ) define an explicit onedimensional manifold in , as illustrated in the examples of Section 2.3. To the extent that the estimated θˆ (τ ) departs from this curve we can infer various forms of misspecification. See, for example, Koenker and Xiao (2002).
4. INFERENCE The asymptotic normality of the QAR estimate also facilitates inference. In order to standardize the QAR estimator and remove nuisance parameters from the limiting distribution, we need to estimate the asymptotic covariance Matrix. In particular, we need to estimate (τ ) and V (τ ). Let Yt (τ | Yt−1 ) ≡ H (Yt−1 , θ (τ )), Q ) be the plug-in estimate of the and let f = f (·, β (τ ) can be estimated by n,αα (τ ) n (τ ) = n,βα (τ )
parametric marginal density function. Then n,αβ (τ ) , n,ββ (τ )
with n,αα (τ ) =
1 1 ˙ α (Yt−1 ; ˙ α (Yt−1 ; h θ (τ )) h θ (τ )) ; n t=1 {f (QYt (τ | Yt−1 ))}2
n,αβ (τ ) =
n (τ )) 1 θ (τ )), β 1 ∂F −1 (h(Yt−1 ; ˙ α (Yt−1 ; θ (τ )) h Yt (τ | Yt−1 )) n t=1 f(Q ∂β
n
+
n ˙ α (Yt−1 ; θ (τ )) 1 h (τ )), (τ )) ; C −1 (τ ; F (Yt−1 , β α (τ ))F˙β (Yt−1 , β n t=1 {f (QYt (τ | Yt−1 ))}2 1u
n,βα (τ ) = n,αβ (τ ) ; n,ββ (τ ) =
n (τ )) ∂F −1 (h(Yt−1 ; (τ )) θ (τ )), β θ (τ )), β 1 ∂F −1 (h(Yt−1 ; n t=1 ∂β ∂β −1 (τ )), (τ )) (τ )) C1u (τ ; F (Yt−1 , β α (τ ))F˙β (Yt−1 , β θ (τ )), β 2 ∂F −1 (h(Yt−1 ; Yt (τ | Yt−1 )) n t=1 ∂β f(Q 2 n −1 (τ )), (τ ; F (Yt−1 , β α (τ )) 1 C1u (τ ))F˙β (Yt−1 , β (τ )) . + F˙β (Yt−1 , β Yt (τ |Yt−1 ))}2 n {f(Q n
+
t=1
C The Author(s). Journal compilation C Royal Economic Society 2009.
Copula-based nonlinear quantile autoregression
S61
Next, the true (unknown) conditional density of Y t given Yt−1 , g ∗ (QYt (τ |Yt−1 )), can be estimated by the difference quotients, ˆ Yt (τi |Yt−1 ) − Q ˆ Yt (τi−1 |Yt−1 )), Yt (τ |Yt−1 )) = (τi − τi−1 )/(Q g (Q for some appropriately chosen sequence of {τ i }’s. Then the matrix V (τ ) can be estimated by n,αα (τ ) V n,αβ (τ ) V n (τ ) = V n,βα (τ ) V n,ββ (τ ) V with n Yt (τ |Yt−1 )) g (Q ˙ α (Yt−1 ; ˙ (Y ; n,αα (τ ) = 1 θ (τ )) h θ (τ )) ; V h Yt (τ |Yt−1 ))}2 α t−1 n t=1 {f(Q n Yt (τ |Yt−1 )) (τ )) ∂F −1 (h(Yt−1 ; g (Q θ (τ )), β ˙ α (Yt−1 ; n,αβ (τ ) = 1 V θ (τ )) h Yt (τ |Yt−1 )) n t=1 f(Q ∂β n Yt (τ |Yt−1 )) 1 g (Q −1 ˙β (Yt−1 , β ˙ α (Yt−1 ; + θ (τ ))C (τ ; F (Y , β (τ )), α (τ )) F (τ )) ; h t−1 1u Yt (τ |Yt−1 ))}2 n t=1 {f(Q
n,βα (τ ) = V n,αβ (τ ) ; V n −1 −1 Yt (τ |Yt−1 )) ∂F (h(Yt−1 ; θ (τ )), β (τ )) ∂F (h(Yt−1 ; θ (τ )), β (τ )) n,ββ (τ ) = 1 g (Q V n t=1 ∂β ∂β
+
n Yt (τ |Yt−1 )) ∂F −1 (h(Yt−1 ; (τ )) −1 2 g (Q θ (τ )), β (τ )), (τ )) α (τ ))F˙β (Yt−1 , β C1u (τ ; F (Yt−1 , β n t=1 f (QYt (τ |Yt−1 )) ∂β
+
n Yt (τ |Yt−1 )) −1 2 1 g (Q (τ )), (τ ))F˙β (Yt−1 , β (τ )) . C (τ ; F (Yt−1 , β α (τ )) F˙β (Yt−1 , β Yt (τ |Yt−1 ))}2 1u n t=1 {f(Q
Wald type tests can then be constructed immediately based on the standardized QAR n (τ ). The copula-based QAR models and related quantile regression n (τ ) and V estimators using estimation also provide important information about specification. Specification of, say, the copula function may be investigated based on parameter constancy over quantiles, along the lines of Koenker and Xiao (2006). In addition, specification of conditional quantile models can be studied based on the quantile autoregression residuals. For example, if we want to test the hypothesis of a general form: H0 : R(θ (τ )) = 0 where R(θ ) is an q-dimensional vector of smooth functions of θ , with derivatives to the second order, the asymptotic normality derived from the previous section facilitates the construction of a Wald statistic. Let ∂Rq (θ ) ∂R1 (θ ) ˙ ,..., R(θ (τ )) = , ∂θ ∂θ θ=θ(τ ) denote a p × q matrix of derivatives of R(θ ), we can construct the following regression Wald statistic −1 ˙ ˙ n (τ )−1 R( n (τ )−1 n (τ )V Wn,τ ≡ nR( θ (τ )) θ (τ )) V θ (τ )) τ (1 − τ )R( R( θ (τ )). C The Author(s). Journal compilation C Royal Economic Society 2009.
S62
X. Chen, R. Koenker and Z. Xiao
Under the hypothesis and our regularity conditions, we have Wn,τ ⇒ χq2 where χ 2q has a central chi-square distribution with q degrees of freedom.
5. CONCLUSION There are many competing approaches to broadening the scope of nonlinear time series modelling. We have argued that parametric copulas offer an attractive framework for specifying nonlinear quantile autoregression models. In contrast to fully parametric methods like maximum likelihood that impose a global parametric structure, estimation of distinct copula-based QAR models retains considerable semi-parametric flexibility by permitting local, quantile-specific parameters. There are many possible directions for future development. Inference and specification diagnostics is clearly a priority. Extensions to methods based on nonparametric estimation of the invariant distribution are possible. Finally, semi-parametric modelling of the copula itself as a sieve appears to be a feasible strategy for expanding the menu of currently available parametric copulas.
ACKNOWLEDGMENTS We thank Richard Smith and a referee for helpful comments on an earlier version of this paper. Chen and Koenker gratefully acknowledge financial support from National Science Foundation grants SES-0631613 and SES-0544673, respectively.
REFERENCES Beare, B. (2008). Copulas and temporal dependence. Working paper, U. C. San Diego. Bouy´e, E. and M. Salmon (2008). Dynamic copula quantile regressions and tail area dynamic dependence in forex markets, Working paper, Financial Econometrics Research Centre, Warwick Business School, UK. Chen, X. and Y. Fan (2006). Estimation of copula-based semiparametric time series models. Journal of Econometrics 130, 307–35. Engle, R. and S. Mangenelli (2004). CAViaR: conditional autoregressive value at risk by regression quantiles. Journal of Business and Economic Statistics 22, 367–81. Hansen, L. P., J. Heaton and E. Luttmer (1995). Econometric evaluation of asset pricing models. The Review of Financial Studies 8, 237–74. Hayashi, F. (2000). Econometrics. Princeton: Princeton University Press. Huber, P. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In L. Le Cam and J. Neyman (Eds.), Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, Volume I, 221–233. Berkeley: University of California Press. Joe, H. (1997). Multivariate Models and Dependence Concepts. London: Chapman & Hall/CRC. Ibragimov, R. (2006). Copulas-based characterizations and higher-order Markov processes. Working paper, Harvard University. Koenker, R. (2005). Quantile Regression. Econometric Society Monographs 38. New York: Cambridge University Press. Koenker, R. and G. Bassett (1978). Regression quantiles. Econometrica 46, 33–49. C The Author(s). Journal compilation C Royal Economic Society 2009.
S63
Copula-based nonlinear quantile autoregression
Koenker, R. and Z. Xiao (2006). Quantile autoregression. Journal of the American Statistical Association 101, 980–90. Koenker, R. and Z. Xiao (2002). Inference on the quantile regression process. Econometrica 81, 1583–612. Newey, W. K. and D. F. McFadden (1994). Large sample estimation and hypothesis testing. In R. F. Engle and D. F. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2113–247. Amsterdam: NorthHolland. Patton, A. (2006). Modelling asymmetric exchange rate dependence. International Economic Review 47, 527–56. Patton, A. (2009). Copula-based models for financial time series. Forthcoming in T. G. Andersen, R. A. Davis, J.-P. Kreiss and T. Mikosch (Eds.), Handbook of Financial Time Series. New York: SpringerVerlag. Pollard, D. (1985). New ways to prove central limit theorems. Econometric Theory 1, 295–313. Weiss, A. (1991). Estimating nonlinear dynamic models using least absolute errors estimation. Econometric Theory 7, 46–68. White, H. (1994). Estimation, inference and specification analysis. Econometric Society Monographs no. 22. New York: Cambridge University Press.
APPENDIX: MATHEMATICAL PROOFS Proof of Theorem 3.1. We denote Y t−1 as x t . Then θ (τ ) = arg minθ∈ ρ τ (u) ≡ u(τ − I (u < 0)). Define
t
ρτ (Yt − H (xt , θ)) where
εt ≡ Yt − QYt (τ |xt ) ≡ Yt − H (xt , θ(τ )). Then Qεt (τ |xt ) = 0 and Yt = H (xt , θ (τ )) + εt ,
Pr[εt ≤ 0|xt ] = τ.
Denote H (xt , θ) ≡ H (xt , θ ) − H (xt , θ (τ )) and qτ (Yt , xt , θ) ≡ ρτ (εt − H (xt , θ )) − ρτ (εt ), and Qn (θ ) ≡
n 1 qτ (Yt , xt , θ). n t=1
Then it is easy to see that θ (τ ) = arg min Qn (θ ) and θ∈
θ (τ ) = arg min E [Qn (θ)] . θ∈
We apply Theorem 2.1 of Newey and McFadden (1994) to establish consistency. The compactness of (Assumption 3.1), continuity of E[Q n (θ )] with respect to θ ∈ (Assumptions 3.2 and 3.3) are directly assumed. It remans to verify uniform convergence (supθ∈ |Qn (θ) − E[Qn (θ)]| = op (1)), and that θ(τ ) is the unique minimizer of E[Q n (θ )]. Notice that under Assumptions 3.2 and 3.3, q τ (Y t , x t , θ) is continuous in θ ∈ and measurable in (Y t , x t ). Since sup |qτ (Yt , xt , θ )| = sup ρτ (εt − H (xt , θ )) − ρτ (εt ) ≤ sup |H (xt , θ) − H (xt , θ(τ ))| , θ∈
θ∈
θ∈
we have E(supθ∈ |qτ (Yt , xt , θ )|) < ∞ under Assumption 3.5(i). These and compactness of (Assumption 3.1) and stationary ergodicity of {Y t } (Assumption 3.5(ii)) together imply that all the conditions of Proposition 7.1 of Hayashi (2000) hold. Thus, by apply the uniform law of large numbers C The Author(s). Journal compilation C Royal Economic Society 2009.
S64
X. Chen, R. Koenker and Z. Xiao
for stationary ergodic processes (see, e.g. Proposition 7.1 of Hayashi, 2000), we obtain the uniform convergence: supθ∈ |Qn (θ ) − E[Qn (θ )]| = op (1). Next we verify that E[Q n (θ )] is uniquely minimized at θ(τ ). Recall that the true but unknown conditional density and distribution function of Y t given x t are g ∗ (· | x t ) and g ∗ (· | x t ) respectively, and use the following identity ρτ (u − v) − ρτ (u) = −vψτ (u) + (u − v){I (0 > u > v) − I (0 < u < v)} v = −vψτ (u) + {I (u ≤ s) − I (u < 0)}ds,
(A.1)
0
where ψτ (u) ≡ τ − I (u < 0), and by definition
E [ψτ (εt )|xt ] = 0.
we have, with simplified notation H t = H (xt , θ ),
Ht
qτ (Yt , xt , θ ) = ρτ (εt − H t ) − ρτ (εt ) = −H t ψτ (εt ) +
{I (εt ≤ s) − I (εt < 0)}ds.
0
thus E[Q n (θ )] = E{E[q τ (Y t , x t , θ )|x t ]} and E[qτ (Yt , xt , θ )|xt ] = E
Ht
{I (εt ≤ s) − I (εt < 0)}ds|xt
0
= 1 Ht > 0 E
Ht
I (0 ≤ εt ≤ s)ds|xt
0
+1 Ht < 0 E
0
I (s ≤ εt ≤ 0)ds|xt .
Ht
Notice that under Assumptions 3.3, Ht I (0 ≤ εt ≤ s)ds|xt 1 Ht > 0 E 0
= 1 Ht > 0
Ht 0
s+QYt (τ |xt )
QYt (τ |xt )
∗
g (y|xt )dy ds
≥ 1 H t > 0 1 g ∗ (QYt (τ |xt ) ≥ 1 )
Ht
0
2 1 ≥ 1 H t > 0 1 g ∗ (QYt (τ |xt ) ≥ 1 ) H t , 2
s+QYt (τ |xt )
QYt (τ |xt )
g ∗ (y|xt )dy ds
and similar result can be obtained for the case H t < 0. Thus, 2 1 E [Qn (θ )] ≥ E 1 g ∗ (QYt (τ |xt ) ≥ 1 ) H t , 2 which, under Assumption 3.4, is strictly positive. Thus for any ε > 0, Qn (θ) is bounded away from zero, uniformly in θ for θ − θ (τ ) ≥ ε. Proof of Theorem 3.2. We obtain the asymptotic normality using Pollard’s (1985) approach. In particular, we apply Pollard’s (1985) Theorem 2 except that we replace his i.i.d. assumption by our stationary ergodic data Assumption 3.5(ii), (note that we could also apply Theorem 7.1 of Newey and McFadden, θ (τ ) ∈ 0 with 1994). Recall that θ (τ ) = arg minθ∈ n1 t ρτ (Yt − H (xt , θ)), and under our Theorem 3.1, probability approaching one. Note that ψ τ (u) ≡ τ − I (u < 0) is the right-hand derivative of ρ τ (u) ≡ u(τ − I (u < 0)). (ρ τ (u) is everywhere differentiable with respect to u except at u = 0). Under Assumption 3.6(i), C The Author(s). Journal compilation C Royal Economic Society 2009.
Copula-based nonlinear quantile autoregression
S65
the derivative of ρ τ (Y t − H (x t , θ)) with respect to θ ∈ 0 exists (except at the point Y t = H (x t , θ)), and is given by ϕtτ (θ ) ≡ [τ − I (Yt < H (xt , θ))] H˙ θ (xt , θ). By the mean value theorem, ρτ (Yt − H (xt , θ )) = ρτ (Yt − H (xt , θ (τ ))) + (θ − θ(τ )) ϕtτ (θ(τ )) + θ − θ(τ ) rt (θ) with rt (θ ) ≡
(θ − θ (τ )) [ϕtτ (θ) − ϕtτ (θ(τ ))] ,
θ − θ(τ )
where θ ∈ 0 is in between θ and θ (τ ). Likewise, E[ρτ (Yt − H (xt , θ))] = E[ρτ (Yt − H (xt , θ (τ )))] + (θ − θ(τ )) E[ϕtτ (θ(τ ))] + θ − θ(τ ) E[rt (θ)]. Since E[τ − I (Y t < H (x t , θ (τ )))|x t ] = 0 under Assumption 3.3, we have, under Assumptions 3.3, 3.5 and 3.6(i)(iv), that E[ρ τ (Y t − H (x t , θ ))] has a second-order (i.e. E[ϕ tτ (θ)] has a first-order) derivative at θ(τ ) that is nonsingular, and is given by −V (τ ) ≡ −E g ∗ (H (xt , θ (τ )))H˙ θ (xt , θ(τ ))H˙ θ (xt , θ(τ )) . Thus condition (i) of Pollard’s (1985) Theorem 2 is satisfied. Condition (ii) of Pollard’s (1985) Theorem 2 is directly assumed (θ(τ ) ∈ int()), and his Condition (iii) holds due to our Theorem 3.1 ( θ (τ ) − θ(τ ) = oP (1)). We shall replace his Condition (iv) by a CLT for stationary ergodic martingale difference data. Since E[ϕtτ (θ (τ ))|xt ] = E E (τ − I (Yt < H (xt , θ(τ )))|xt ) H˙ θ (xt , θ(τ )) = 0, V ar[ϕtτ (θ (τ ))|xt ] = τ (1 − τ )H˙ θ (xt , θ (τ ))H˙ θ (xt , θ (τ )) . Under Assumptions 3.3, 3.5(ii) and 3.6(iv), we can apply the CLT for strictly stationary ergodic martingale difference sequence (see, e.g. Hayashi, 2000, p. 106), and obtain: n 1 ϕtτ (θ (τ )) ⇒ N (0, τ (1 − τ )(τ )) √ n t=1
with (τ ) ≡ E H˙ θ (xt , θ (τ ))H˙ θ (xt , θ(τ )) . Thus it remains to verify that condition (v) (stochastic differentiability) of Pollard’s (1985) Theorem 2 holds: √1 n t (rt (θ ) − E[rt (θ )]) → 0 in probability sup √ 1 + n θ − θ (τ ) θ∈Un for each sequence of balls {U n } that shrinks to θ(τ ) as n → ∞. Since rt (θ ) ≡
(θ − θ (τ )) [ϕtτ (θ) − ϕtτ (θ(τ ))] ,
θ − θ(τ )
Pollard’s (1985) Condition (v) holds provided that 1 [ϕ (θ ) − ϕ (θ (τ ))] − E[ϕ (θ ) − ϕ (θ(τ ))] tτ tτ tτ tτ sup → 0 in probability
θ − θ (τ ) θ∈Un n t for each sequence of balls {U n } that shrinks to θ(τ ) as n → ∞. C The Author(s). Journal compilation C Royal Economic Society 2009.
S66
X. Chen, R. Koenker and Z. Xiao Recall that ϕtτ (θ ) ≡ [τ − I (Yt < H (xt , θ ))]H˙ θ (xt , θ ), we have: ϕtτ (θ ) − ϕtτ (θ (τ )) = H˙ θ (xt , θ ) [I (Yt < H (xt , θ(τ ))) − I (Yt < H (xt , θ))] + H˙ θ (xt , θ ) − H˙ θ (xt , θ(τ )) [τ − I (Yt < H (xt , θ(τ )))] ≡ R1t (θ ) + R2t (θ ).
Under Assumption 3.6(i)(iii) we have: for all U n ⊆ 0 , R2t (θ ) ≤ E sup H¨ θθ (xt , θ) < ∞. E sup θ∈U θ − θ (τ )
θ∈0
n
By Assumption 3.3, E[R2t (θ )] = E
H˙ θ (xt , θ ) − H˙ θ (xt , θ (τ )) E{τ − I (Yt < H (xt , θ(τ )))|xt } = 0.
Thus, under Assumptions 3.5(ii) and 3.6(i)(iii), by the uniform law of large numbers for stationary ergodic processes, since U n ⊆ 0 ⊂ we obtain: 1 R (θ ) − E[R (θ)] 2t 2t sup = oP (1)
θ − θ (τ ) θ∈Un n t for each sequence of balls {U n } that shrinks to θ (τ ) as n → ∞. Consequently, Pollard’s (1985) Condition (v) holds provided that 1 R (θ ) − E[R (θ)] 1t 1t sup = oP (1)
θ − θ (τ ) θ∈Un n t
(A.2)
for each sequence of balls {U n } that shrinks to θ(τ ) as n → ∞. For any positive sequence of decreasing numbers {ε n }, denote U n ≡ {θ ∈ 0 : θ = θ(τ ), θ − θ(τ ) < ε n }. Then, under Assumption 3.6(i)(ii), we have: E
sup
θ∈Un
≤E
R1t (θ )
θ − θ (τ )
|I (Yt < H (xt , θ(τ ))) − I (Yt < H (xt , θ ))| sup H˙ θ (xt , θ ) × E sup |xt
θ − θ(τ ) θ∈0 θ∈Un
For all θ ∈ 0 , under Assumption 3.6(i)(iii), we have (θ − θ(τ )) H¨ θθ (xt , θ)(θ − θ(τ )) H (xt , θ) = H (xt , θ (τ )) + H˙ θ (xt , θ (τ )) (θ − θ (τ )) + 2 with E(supθ ∈0 |H¨ θθ (xt , θ )|) < ∞. Therefore, under Assumptions 3.3 and 3.6(i)(iii), conditioning on x t , there exists a small (x t ) > 0 such that for all θ ∈ 0 with θ − θ(τ ) < (x t ), we have that Y t − H (x t , θ (τ )) and Y t − H (x t , θ ) are of the same sign. Hence, under Assumptions 3.3 and 3.6(i)(ii), conditioning C The Author(s). Journal compilation C Royal Economic Society 2009.
Copula-based nonlinear quantile autoregression
S67
on x t and for any ε n ≤ (x t ) with ε n 0, we have: |I (Yt < H (xt , θ (τ ))) − I (Yt < H (xt , θ))| E sup |xt
θ − θ (τ ) θ∈Un I (Yt < H (xt , θ)) − I (Yt < H (xt , θ (τ ))) 1{H t > 0} |xt ≤E sup
θ − θ(τ ) θ∈Un : θ−θ(τ ) <(xt ) I (Yt < H (xt , θ(τ ))) − I (Yt < H (xt , θ)) +E sup 1{H t < 0} |xt
θ − θ(τ ) θ∈Un : θ−θ(τ ) <(xt ) ≤ const.g ∗ (H (xt , θ (τ ))) × sup H˙ θ (xt , θ ) ; θ∈0
hence for ε n 0, E sup θ∈Un
! "2 R1t (θ ) ∗ ˙ sup Hθ (xt , θ) ≤ const.E × g (H (xt , θ(τ ))) < ∞.
θ − θ (τ ) θ∈0
This and the uniform law of large numbers for stationary ergodic processes now imply that (A.2) √ holds. θ (τ ) − Therefore Pollard’s (1985) Theorem 2 is applicable and we obtain the desired normality result: n( θ (τ )) ⇒ N (0, V (τ )−1 τ (1 − τ )(τ )V (τ )−1 ).
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S68–S82. doi: 10.1111/j.1368-423X.2008.00264.x
Large-sample inference on spatial dependence P. M. R OBINSON † †
Department of Economics, London School of Economics, Houghton Street, London WC2A 2AE, UK E-mail:
[email protected] First version received: July 2008; final version accepted: September 2008
Summary We consider cross-sectional data that exhibit no spatial correlation, but are feared to be spatially dependent. We demonstrate that a spatial version of the stochastic volatility model of financial econometrics, entailing a form of spatial autoregression, can explain such behaviour. The parameters are estimated by pseudo-Gaussian maximum likelihood based on log-transformed squares, and consistency and asymptotic normality are established. Asymptotically valid tests for spatial independence are developed. Keywords: Asymptotic theory, Independence testing, Parameter estimation, Spatial dependence.
1. INTRODUCTION The possibility of cross-sectional dependence haunts much analysis of econometric data. Rules of statistical inference based on cross-sectional or panel data frequently assume independence of observables or, more likely, of unobservable disturbances. These rules are typically invalidated if there is actually dependence. On the other hand, the modelling of cross-sectional dependence is hugely complicated by the usual lack of any natural ordering over the cross-section. This is in contrast to time series data, where dependence between variables at different times is frequently modelled as a function of their time difference, as is appropriate under stationarity. In the standard setting of equal spacing across time, elegant statistical procedures result, due to the ability to exploit the Toeplitz structure of the covariance matrix. When there is unequal spacing, matters are considerably complicated, but nevertheless there is still a natural ordering, and the ability to regard the observations as arising from sampling from, say, a continuous time process, and so it is still clear how one might proceed, for example under Gaussianity where it suffices to consider the mean and covariance structure. The absence of any such natural ordering poses more of a dilemma. One might consider pairwise covariances or correlations, but without data replication these cannot be consistently estimated in the absence of a suitable structure. When there is a spatial context, however, progress may be possible. Lattice data provide the simplest extension of time series, entailing equal spacing across two or more dimensions. Though there is a lack of a single obvious ordering, and difficulties due to end effects and in data simulation, there are natural ways of extending stationary time series models, and the corresponding rules of statistical inference. C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Large-sample inference on spatial dependence
S69
However, lattice observations arise infrequently in econometrics. With spatio-temporal data, there may well still be regular spacing across time, but observations in geographical space are more likely to be at irregular intervals, in both dimensions, for example when these are identified with capital cities of countries. Matters are further complicated if observations are to be interpreted as aggregates across administrative regions of irregular shapes. And in many situations, geographical distances may not be the most relevant measures. More generally, pairwise ‘economic distances’ can be postulated, possibly varying with reversal of direction. Much of the methodology of spatial econometrics has pursued this setting, focussing on models of ‘spatial autoregressive’ type, which depend on the availability of such measures of distance between each pair of observations. In the spatial econometrics literature, dependence has been usually taken to be synonymous with correlation (an exception being Brett and Pinkse, 1997). On the other hand, other areas of econometric research stress the distinction between these concepts. In particular, financial time series that exhibit lack of serial correlation frequently contain evidence of dependence, for example in serial correlation of second moments, and considerable activity has been devoted to modelling such phenomena. In the present paper, we propose a model that combines features of a stochastic volatility model of financial econometrics with the spatial autoregressive model, deriving asymptotic statistical theory for estimates of its parameters, and present related tests for the lack of dependence, justifying their asymptotic validity. The reference to stochastic volatility modelling is not necessarily intended to imply particular relevance to financial data, and the econometric and statistical literature on non-linearity and testing for dependence covers other possible applications also. Generally, in non-Gaussian settings, dependence and correlation have different meanings and there may be interest in non-linear modelling and independence testing, our particular approach being specialised but parsimonious. One could also think of our model as applying not to raw data but to uncorrelated though not necessarily independent innovations, possible spatial correlation and explanatory variables having been previously taken care of in a conventional fashion. The following section describes the model and illustrates its ability to describe both dependence and the lack of correlation. Section 3 describes the parameter estimates. Section 4 establishes their consistency. Section 5 establishes their asymptotic normality. Section 6 examines tests that might be used to test the hypothesis of spatial independence. Some concluding remarks are offered in Section 7. Proofs are left to an appendix.
2. A MODEL FOR SPATIAL DEPENDENCE Introduce first sequences η i , ε i , i = 1, 2, . . . , of zero-mean independent and identically distributed (i.i.d.) random variables, having finite variances σ 2η and σ 2ε , respectively, and such that η i and ε j are independent for all i, j . Next, define ε = (ε1 , . . . , εn ) ,
(2.1)
and an n × n weight matrix W, having zero diagonal elements, and for a scalar ρ define S(ρ) = I − ρW , C The Author(s). Journal compilation C Royal Economic Society 2009.
(2.2)
S70
P. M. Robinson
I being the n × n identity matrix. For some ρ 0 ∈ (−1, 1), put S 0 = S(ρ 0 ), and define the n × 1 vector ζ = (ξ1 , . . . , ζn )
(2.3)
by the spatial autoregressive model (see Cliff and Ord, 1973) S0 ζ = ε.
(2.4)
Though the ζi are unobservable, as are the η i , we observe xi = ηi eα0 +ζi ,
i = 1, . . . , n,
(2.5)
where α 0 is a scalar constant. This is a model analogous to the stochastic volatility model of financial econometrics of Taylor (1986), and extensively developed and applied since. We will develop asymptotic theory of parameter estimates as n → ∞, in which case all non-diagonal elements of W (as well as its dimension) can vary as n increases, especially as some normalization restriction is typically placed on W (see Assumption 4.2). In this case, the ζi , and correspondingly the xi , form triangular arrays. However, as is common, for notational convenience we suppress reference to this. When ρ 0 = 0, the xi are clearly independent. For ρ 0 = 0 spatial independence is lost, though there is still no spatial correlation, as we now demonstrate. For the purpose of the immediately following argument assume also that the ε i have a moment generating function; thus so also do the ζi . We have E(xi ) = 0,
(2.6)
E xi2 = ση2 e2α0 E(e2ζi ) < ∞
(2.7)
E(xi xj ) = 0,
(2.8)
and, for i =j ,
so that the xi are uncorrelated. Thus, they exhibit no spatial correlation. However, consider now xi2 = ηi2 e2α0 +2ζi . Then, for i =j , using also (2.7), Cov xi2 , xj2 = ση4 e4α0 E e2(ζi +ζj ) − E e2ζi E e2ζj .
(2.9)
(2.10)
For ρ 0 = 0, the ζi are not independent, so the expression in braces is non-zero, and thus the x 2i do exhibit spatial correlation. Of course, other non-linear functions of xi , such as |xi |λ for any λ > 0, will also do so, but for simplicity we focus on squares. Parametric functional expressions are available on making distributional assumptions on the ε i . For example, suppose they are Gaussian. From (2.4), we can write ζi = ti ε,
(2.11)
T0 = S0−1 .
(2.12)
where t i is the ith row of
C The Author(s). Journal compilation C Royal Economic Society 2009.
Large-sample inference on spatial dependence
Thus,
and for i =j
S71
2 2 E e2ζi = e2ti σε
(2.13)
2 2 E e2(ζi +ζj ) = E e2(ti +tj ) ε = e2ti +tj σε ,
(2.14)
where for any real matrix A, A denotes the square root of the largest eigenvalue of A A. Thus, 2 2 2 Cov xi2 , xj2 = ση4 e4α0 +2ti +2tj e4ti tj σε − 1 , i = j . (2.15) Note that when ρ 0 = 0 the elements of ti are all zero except for the ith, so indeed (2.15) then reduces to zero, but otherwise it is generally non-zero.
3. PSEUDO-MAXIMUM LIKELIHOOD ESTIMATION Though W is chosen by the practitioner, the parameters α 0 , ρ 0 , σ 2η and σ 2ε are generally unknown. Given further distributional assumptions, they can be estimated by maximum likelihood, but this is a computationally onerous procedure, and asymptotic statistical properties are difficult to derive. Instead, we consider a Gaussian pseudo-likelihood procedure based on logs. Denote yi = log xi2 ,
β0 = E log ηi2 ,
νi = log ηi2 − β0 .
(3.1)
We deduce from (2.9) yi = 2α0 + β0 + νi + 2ζi .
(3.2)
μ0 = 2α0 + β0 ,
(3.3)
ξi = νi + 2ζi ,
(3.4)
yi = μ0 + ξi ,
(3.5)
y = μ0 + ξ,
(3.6)
Define also
and write
or in vector form
where ξ = (ξ1 , . . . , ξn ) ,
y = (y1 , . . . , yn )
(3.7)
and is the n × 1 vector of 1’s. We could rewrite (3.6) as S(ρ0 )y = μ0 S(ρ0 ) + S(ρ0 )ν + 2ε,
(3.8)
where ν = (ν1 , . . . , νn ) , so that the yi follow a kind of constrained spatial autoregressive moving average. Note that if the row sums of W are normalized, the intercept term in (3.8) becomes μ 0 (1 − ρ 0 ) . C The Author(s). Journal compilation C Royal Economic Society 2009.
S72
P. M. Robinson
The yi have mean μ 0 , and y has covariance matrix σ02 I + σε2 T0 T0 ,
(3.9)
where σ 20 = Var(ν i ). Though ξ in general is non-Gaussian, we will apply Gaussian estimation procedures to (3.6). This means that parameters must be identifiable from first and second moments of the yi . We can only identify μ 0 from E(yi ). Also, (3.9) reduces to (σ02 + σε2 )I when ρ 0 = 0, whence we cannot identify both σ 20 and σ 2ε . Though our work is motivated by the possible presence of spatial dependence, there is interest also in testing for spatial independence, i.e. ρ 0 = 0, so we restrict to a parsimonious version of the model, in which we constrain σ 2ε = 1. Since T 0 depends only on the parameter ρ 0 , we are left with three unknown parameters, summarized in the vector (3.10) θ0 = μ0 , σ02 , ρ0 . Let μ, σ 2 , ρ be any admissible values, of μ 0 , σ 20 , ρ 0 , and define θ = (μ, σ 2 , ρ) ,
(3.11)
T (ρ) = S(ρ)−1 ,
(3.12)
(σ 2 , ρ) = σ 2 I + T (ρ)T (ρ) .
(3.13)
The Gaussian pseudo-maximum likelihood estimate (PMLE) of θ 0 is defined as θˆ = arg min Q(θ ),
(3.14)
θ∈
where Q(θ ) =
1 1 log det (σ 2 , ρ) + (y − μ ) (σ 2 , ρ)−1 (y − μ ), n n
(3.15)
and is a compact subset of R × (0, ∞) × (−1, 1), in particular = μ × σ 2 × ρ ,
(3.16)
where μ = [c1 , c2 ] ,
σ 2 = [c3 , c4 ] ,
ρ = [c5 , c6 ] ,
(3.17)
where −∞ < c 1 < c 2 < ∞, 0 < c 3 < c 4 < ∞, − 1 < c 5 < c 6 < 1.
4. CONSISTENCY OF ESTIMATES We introduce the following assumptions. A SSUMPTION 4.1. The ε i , η i are i.i.d. with zero means, ε i is independent of η j , for all i, j , and E(ε4i ) < ∞, E(ν 4i ) < ∞. The identity of distribution aspect, and indeed the independence, can be somewhat relaxed, but we opt for simplicity. A SSUMPTION 4.2. For all n, W ≤ 1. C The Author(s). Journal compilation C Royal Economic Society 2009.
Large-sample inference on spatial dependence
S73
The spatial autoregression literature imposes various conditions on W. One is that the row sums of W are normalized to 1, which implies that 1 is an eigenvalue of WW (and of W if W is symmetric), so that Assumption 4.2 requires that there be no other eigenvalue that is larger in absolute value. When all the elements of W are non-negative, 1 is then also the maximum row sum norm of W (see Horn and Johnson, 1988, p. 295). In a sense, Assumption 4.2 is costless because some normalization is necessary in order to identify ρ 0 , and indeed W = 1 not only achieves this but is also natural from a stability perspective because then T (ρ) ≤
∞
|ρ|j W ≤
j =1
∞
|ρ|j = (1 − |ρ|)−1 ,
(4.1)
j =0
which is finite for all ρ ∈ (−1, 1). It is possible to impose more general conditions on W, such as ones on T (ρ) that are uniform in ρ (see e.g. Lee, 2004), but we prefer in this respect to separate requirements on W from other aspects. 1 1 Define H = − 2 ] (σ 2 , ρ)(σ02 , ρ0 )− 2 (σ 2 , ρ) (where we employ the positive definite square root), and then 1 1 tr (σ 2 , ρ)−1 σ02 , ρ − log det (σ 2 , ρ)−1 σ02 , ρ − 1 n n 1 1 = tr{H } − log det{H } − 1 n n n 1 = (λj − log λj − 1), n j =1
r(σ 2 , ρ) =
(4.2)
where the λ j are eigenvalues of H. A SSUMPTION 4.3. For any δ > 0, lim
inf
n→∞ {σ 2 −σ0 ,ρ−ρ0 >δ }∩{σ 2 ×ρ }
r(σ 2 , ρ) > 0.
(4.3)
Because H is positive definite the λ j , j = 1, . . . , n, are positive, and for all j, the jth summand in (4.2) is non-negative, and positive when λ j = 1. Of course, λ j = 1 for all n only when (σ 2 , ρ) = (σ 20 , ρ 0 ), so that Assumption 4.3 is an identifiability condition. It seems difficult in general to reduce it to something more comprehensible (see also the identifiability assumption used by Lee (2004) in his asymptotic theory for the Gaussian PMLE of spatial autoregression). A SSUMPTION 4.4. θ 0 ∈ . T HEOREM 4.1. Let Assumptions 4.1– 4.4 hold. Then θˆ →p θ0 ,
as n → ∞.
(4.4)
5. ASYMPTOTIC NORMALITY OF ESTIMATES Using the consistency just established, and additional conditions, we proceed to establish asymptotic normality of the Gaussian PMLE. C The Author(s). Journal compilation C Royal Economic Society 2009.
S74
P. M. Robinson
Define the 3 × 3 symmetric matrices A and B as follows. Write 0 = (σ 02 , ρ 0 ). The (i, j )th element of A is aij , where a 12 = a 13 = 0 and 1 −1 0 , n
(5.1)
1 −2 tr 0 , n
(5.2)
a11 = 2 lim
n→∞
a22 = lim
n→∞
1 −2 tr 0 T0 W T0 T0 , n→∞ n
a23 = 8 lim
−1 a33 = 32 lim tr −1 0 T0 W T0 T0 0 T0 W T0 + T0 W T0 , n→∞
(5.3)
(5.4)
and the (i, j )th element of B is bij , where b 11 = 2α 11 and b12 = lim
n→∞
1 −1 E 0 ξ tr (ξ ξ − 0 )−1 , 0 n
1 −1 E 0 ξ tr (ξ ξ − 0 )−1 , T0 W T0 T0 −1 0 0 n→∞ n
(5.6)
b22 = E lim
1 2 E tr (ξ ξ − 0 )−1 , 0 n→∞ n
(5.7)
8 E tr (ξ ξ − 0 )−1 tr (ξ ξ − 0 )−1 , 0 0 T0 W T0 T0 n
(5.8)
1 2 −1 E tr (ξ ξ − 0 )−1 , 0 T0 W T0 T0 0 n
(5.9)
b13 = 16 lim
b23 = − lim
n→∞
(5.5)
b33 = 64 lim
n→∞
where we assume the following. A SSUMPTION 5.1. The matrices A and B exist, and are finite and non-singular. We also impose standard additional conditions for a central limit theorem. A SSUMPTION 5.2. θ 0 is an interior point of 0 . A SSUMPTION 5.3. For some δ > 0 E |εi |4+δ + |νi |4+δ < ∞.
(5.10)
T HEOREM 5.1. Let Assumptions 4.1– 4.3 and 5.1–5.3 hold. Then as n → ∞, n 2 (θˆ − θ0 ) →d N (0, A−1 BA−1 ). 1
C The Author(s). Journal compilation C Royal Economic Society 2009.
S75
Large-sample inference on spatial dependence
6. TESTING FOR SPATIAL INDEPENDENCE Theorem 5.1 can be applied to set confidence regions for θ 0 or its individual elements, but it is also a basis for testing hypotheses. One of leading interest is H0 : ρ0 = 0,
(6.1)
which in the setting of our model is equivalent to independence of the xi . We present first a result which is largely, but not strictly, a corollary of Theorem 5.1. In this connection, we introduce the following assumption. A SSUMPTION 6.1. tr{W (W + W )} → ∞,
as n → ∞.
(6.2)
Notice that Assumption 4.3 would require, under H 0 , that tr{W (W + W )} increase at rate n. We could indeed have relaxed conditions for Theorem 5.1 to permit a slower rate, which would have been reflected in the convergence rate of ρˆ (see also Lee, 2004). Assumption 4.2 implies that tr{W (W + W )} = O(n), and thus that no faster rate would be possible. For notational convenience, define also σξ2 = σ02 + 4,
(6.3)
which is the variance of ξ i under H 0 . T HEOREM 6.1. Let Assumptions 4.1, 4.2, 5.2 and 6.1 hold. Then under H 0 , 1
n 2 μ/σ ˆ ξ,
1 1 n 2 σˆ 2 − σ02 / E ξ14 − σξ4 2 ,
1 4 tr 2 {W (W + W )} ρ/σ ˆ ξ2
(6.4)
converge in distribution as n → ∞ to independent standard normal variates. Theorem 6.1 motivates the test statistic 4tr 2 {W (W + W )}ρˆ . s1 = σˆ 2 + 4 1
(6.5)
A simpler one is s2 =
ξ˜ W ξ˜ 1
σ˜ ξ2 tr 2 {W (W + W )}
,
(6.6)
where ξ˜ = ξ˜1 , . . . , ξ˜n ,
σ˜ ξ2 =
1 ˜ ˜ ξ ξ, n
˜ ξ˜i = yi − μ,
μ˜ = n−1
n
yi .
(6.7)
i=1
Of course, (6.6) is merely a standard statistic test for lack of spatial correlation but applied to the ξ˜i (see Moran, 1950, Pinkse, 1999, and also Robinson, 2008, for a more general class). T HEOREM 6.2. Let Assumptions 4.1, 4.2, 5.2 and 6.1 hold. Then under H 0 , s 1 and s 2 both converge in distribution as n → ∞ to standard normal variates. Both s 1 and s 2 can be used in one- or two-sided tests based on standard normal critical regions. A Pitmen argument indicates that tests that reject for large positive (negative) values of C The Author(s). Journal compilation C Royal Economic Society 2009.
S76
P. M. Robinson
s 1 /s 2 have power against local, at rate n− 2 , positive (negative) alternatives to H 0 . We can think of s 21 and s 22 as pseudo-Wald and pseudo-score statistics, respectively. A pseudo-log-likelihoodratio test can also be developed, but for brevity, and as there is no one-sided version of it, we omit the details. 1
7. CONCLUDING REMARKS We have established consistency and asymptotic normality of parameter estimates of a simple model that can explain spatial dependence in observations xi in the absence of spatial correlation in the xi , and have also presented related asymptotically justified tests for spatial dependence. One straightforward extension of the model would allow spatial correlation of observables, and perhaps also include explanatory variables where we would test spatially uncorrelated inputs for spatial independence. It would also be worth examining both higher-order asymptotic properties and finite-sample properties of our various statistics. Higher-order asymptotics should be possible at least under Gaussian assumptions on ξ i and ν i but presents a substantial additional challenge. There seems to be no higher-order asymptotic theory yet for even most basic statistics based on spatial weight matrices, and in the general statistical literature there is relatively little work covering implicitly defined estimates. Some finite-sample theory would be possible for s 2 under H 0 and Gaussianity of ξ i and ν i because it is then merely a ratio of quadratic forms of independent Gaussian variates. On a more mundane level, Monte Carlo simulations can also provide information about finite-sample properties, but given the limited nature of proposals for modelling and inference of spatial dependence without spatial correlation, and for testing for spatial dependence as distinct from spatial correlation, it would be appropriate first to develop some further models, estimates and tests, with which ours can be compared. Other parametric models, including spatial moving averages, and spatial autoregressive moving averages, can be considered for our ξ , and along with the tests for independence suggested by such models there is considerable scope for developing non-parametric tests for independence in addition to those of Brett and Pinkse (1997). Bearing in mind the range of non-parametric independence tests available for time series data, there are clearly many possibilities in spatial settings.
ACKNOWLEDGMENT This research was supported by ESRC Grant RES-062-23-0036.
REFERENCES Brett, C. and J. Pinkse (1997). These taxes are all over the map! A test for spatial independence of municipal tax rates in British Columbia. International Regional Science Review 20, 131–51. Cliff, A. D. and J. K. Ord (1973). Spatial Autocorrelation. London: Pion. Horn, R. A. and C. R. Johnson (1988). Matrix Analysis. Cambridge: Cambridge University Press. Lee, L.-F. (2004). Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica 72, 1899–925. Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika 37, 17–23. C The Author(s). Journal compilation C Royal Economic Society 2009.
Large-sample inference on spatial dependence
S77
Pinkse, J. (1999). Asymptotic Properties of Moran and Related Tests and Testing for Spatial Correlation in Probit Models. Working paper, University of British Columbia. Robinson, P. M. (2009). Correlation testing in time series, spatial and cross-sectional data. Forthcoming in Journal of Econometrics. Taylor, S. (1986). Modelling Financial Time Series. New York: John Wiley.
APPENDIX: PROOFS OF THEOREMS Proof of Theorem 4.1: We have Q(θ) =
1 1 log det (σ 2 , ρ) + (μ − μ0 )2 (σ 2 , ρ)−1 n n 2 1 2 − (μ − μ0 ) (σ , ρ)−1 ξ + ξ (σ 2 , ρ)−1 ξ. n n
(A.1)
Then write Q(θ ) − Q(θ0 ) = u(θ) + v(θ),
(A.2)
where u(θ ) =
v(θ) =
2 (μ0 − μ) (σ 2 , ρ)−1 ξ n −1 1
+ tr (σ 2 , ρ)−1 − σ02 , ρ0 ξ ξ − σ02 , ρ0 , n 1 (μ − μ0 )2 (σ 2 , ρ)−1 + w(σ 2 , ρ). n
(A.3)
(A.4)
By Assumption 4.4 and a standard kind of argument for consistency of implicitly defined extremum estimates, it thus suffices to show that sup |u(θ )| →p 0,
as n → ∞,
(A.5)
v(θ) > 0.
(A.6)
θ∈
and, for all δ θ > 0, χ > 0 lim
inf
n→∞ {θ−θ0 >δθ }∩
To prove (A.6), we first consider the contribution to u(θ) from the first term in (A.4). This is uniformly op (1) if 1 (σ 2 , ρ)−1 ξ →p 0, uniformly in . n
(A.7)
We first show pointwise convergence, for any θ ∈ . The left-hand side of (A.7) has mean zero and variance
σ 2 , ρ0 n−2 (σ 2 , ρ)−1 σ02 , ρ0 (σ 2 , ρ)−1 ≤ n−2 0 σ4 (A.8) ≤ Cn−1 σ02 + T (ρ0 )2 . But from (4.1) T (ρ0 ) ≤ C, C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.9)
S78
P. M. Robinson
where C denotes throughout a generic finite constant. Thus pointwise convergence is established. The uniform convergence follows from an equicontinuity argument, as follows. Consider a neighbourhood N of any, σ ∗2 , ρ ∗ , such that N ⊂ σ 2 × ρ . We have
(σ 2 , ρ)−1 − σ 2 , ρ −1 ξ ∗ ∗ sup n (σ 2 ,ρ)∈N
≤
ξξ n
12 sup (σ 2 ,ρ)∈N
1 ⎧
2 −1 2 ⎫ ⎪ ⎨ (σ 2 , ρ)−1 − σ∗2 , ρ∗ ⎬ ⎪
⎪ ⎩
n
⎪ ⎭
.
(A.10)
Now Eξ ξ /n = tr{ (σ 02 , ρ 0 )}/n ≤ C, whereas the expression in braces is bounded by −1 2 2 (σ 2 , ρ)−1 2 σ∗2 , ρ∗ (σ 2 , ρ) − σ∗2 , ρ∗ .
(A.11)
The first two factors are bounded uniformly on N , while the last one is 2 σ − σ 2 I + 4 T T − T∗ T∗ 2 , ∗
(A.12)
where T ∗ = T (ρ ∗ ). Now, with S = S(ρ), S ∗ = S(ρ ∗ ),
T T − T∗ T∗ = T∗ T∗ S∗ S∗ − SS T T ,
(A.13)
S∗ S∗ − SS = (I − ρ∗ W ) (I − ρ∗ W ) − (I − ρW ) (I − ρW ) = (ρ − ρ∗ ) (W + W ) + ρ∗2 − ρ 2 W W .
(A.14)
where
Then from (4.1), (A.14) is bounded by 2 C σ 2 − σ∗2 + C (ρ − ρ∗ )2 .
(A.15)
This can be made arbitrarily small uniformly on N by choosing N small enough. Since any open cover of σ 2 × ρ has a finite subcover, the proof of (A.9) is completed. The second term in u(θ) can be dealt with in a similar way, using the fourth moment conditions in Assumption 4.1. We omit the details. Now looking at v(θ ), in view of Assumption 4.3 it suffices to show that, for any δ μ > 0, lim n→∞
inf
|μ − μ0 | > δμ σ 2 ∈ σ 2 , ρ ∈ ρ
(μ − μ0 )2
(σ 2 , ρ)−1 > 0. n
(A.16)
But −1 (σ 2 , ρ)−1 ≥ n (σ 2 , ρ) n
T T −1 = σ2 + n −1 ≥ σ 2 + T 2 −1 ≥ σ 2 + (1 − |ρ|)−2 −1 ≥ c4 + (1 − max (c5 , c6 ))−2 , so (A.16) is established, to complete the proof.
(A.17)
C The Author(s). Journal compilation C Royal Economic Society 2009.
Large-sample inference on spatial dependence
S79
Proof of Theorem 5.1: By the mean value theorem, 1
0 = n2
ˆ ∂Q(θ) 1 ∂Q(θ0 ) ˜ 12 θˆ − θ0 , = n2 + An ∂θ ∂θ
(A.18)
where A˜ is formed by evaluating each row of ∂ 2 Q(θ )/∂θ ∂θ at (possibly different) θ˜ (i) , i = 1, 2, 3, such that θ˜ (i) − θ0 ≤ θˆ − θ0 . To evaluate the derivatives, for notational convenience write (σ 2 , ρ), and Q(θ) as, respectively, , and Q. Note first that ∂ = I, ∂σ 2
∂ 2 = 0, ∂(σ 2 )2
∂ 2 = 0. ∂ρ∂σ 2
(A.19)
Noting also that ∂S = −W , ∂ρ
∂T = T WT , ∂ρ
(A.20)
we have ∂ = 4T (W T + T W )T , ∂ρ
(A.21)
∂ 2 = 8T (W T T W + W T W T + T W T W )T . ∂ρ 2
(A.22)
2 ∂Q = − −1 (y − μ ), ∂μ n
(A.23)
∂Q 1 1 = tr{−1 } − (y − μ ) −2 (y − μ ), ∂σ 2 n n
(A.24)
∂Q 1 = 8 tr{−1 T W T T } ∂ρ n
(A.25)
8 − (y − μ ) −1 T W T T −1 (y − μ ), n
(A.26)
2 ∂ 2Q = −1 , ∂μ2 n
(A.27)
∂ 2Q 2 = − −2 (y − μ ), ∂μ∂σ 2 n
(A.28)
∂ 2Q 8 = − −1 T (W T + T W )T −1 (y − μ ), ∂μ∂ρ n
(A.29)
∂ 2Q 1 2 = − tr{−2 } + (y − μ ) −3 (y − μ ), ∂(σ 2 )2 n n
(A.30)
Then
C The Author(s). Journal compilation C Royal Economic Society 2009.
S80
P. M. Robinson ∂ 2Q 8 = − tr{−2 T W T T } ∂σ 2 ∂ρ 2 n 16 + (y − μ ) {−1 T W T T −2 }(y − μ ), n
2 ∂ ∂ −1 ∂ ∂ 2Q 1 −1 −1 = − {(y − μ )(y − μ ) − } tr ∂ρ 2 n ∂ρ 2 ∂ρ ∂ρ 1 ∂ ∂ + tr −1 −1 −1 (y − μ )(y − μ ) , n ∂ρ ∂ρ
(A.31)
(A.32)
where we omit the complicated expression for the latter in terms of W and T . From (A.23)–(A.25), 2 ∂Q0 = − −1 0 ξ, ∂μ n
(A.33)
1 ∂Q0 , = − tr (ξ ξ − 0 )−2 0 ∂σ 2 n
(A.34)
8 ∂Q0 −1 = − tr (ξ ξ − 0 )−1 . 0 T0 W T0 T0 0 ∂ρ n
(A.35)
Then, we deduce that lim nE
n→∞
∂Q0 ∂Q0 ∂θ ∂θ
= B.
(A.36)
Also evaluating the second derivative at θ 0 we deduce via (A.33)–(A.35) ∂ 2 Q0 →p A, as n → ∞, ∂θ ∂θ
(A.37)
by establishing convergence in probability to zero of zero-mean quantities, via techniques as in the proof of Theorem 4.1. Then, using Theorem 4.1 it is straightforward to show also that ∂ 2 Q0 →p 0, A˜ − ∂θ ∂θ
(A.38)
and hence, from (A.37), A˜ →p A. It remains to show that 1
n2
∂Q0 →d N (0, B). ∂θ
(A.39)
This follows if all suitably normalized linear combinations of the left-hand side of (A.39) are asymptotically standard normal. To achieve this, a linear combination is written as a sum of martingale differences, and a martingale central limit theorem is applied. Many of the details are standard and straightforward, and the aspect that most warrants attention pertains to (A.35), in view of its dependence on W, so we simply 1 −1 consider the asymptotic normality of n 2 ∂Q0 /∂ρ. First denote P = −1 0 T 0 WT 0 T 0 0 , so 1
n2
∂Q0 8 = − 1 tr{(ξ ξ − 0 )P }. ∂ρ n2
(A.40)
C The Author(s). Journal compilation C Royal Economic Society 2009.
Large-sample inference on spatial dependence
S81
Then writing τ = (ν /σ 0 , ε ) and R = (σ 0 In , 2T 0 ), we have ξ = Rτ , and thence (A.40) becomes −8 1
n2
tr{(τ τ − I2n )M}
⎫ ⎧ 2n 2n ⎬ −8 ⎨ 2 (τi − 1)mii + τi τj (mij + mj i ) , = 1 ⎭ n2 ⎩ i=1
(A.41)
j
i=1
where τ i is the ith element of τ and mij is the (i, j )th element of M = R PR. In view of Assumption 4.1, a martingale central limit thus holds if, as n → ∞ j
0. Using Burkholder and von Bahr/Esseen inequalities, the expectation is bounded by 2 1+δ/2 j
max
1
i
n2
+ max
j
m2ij + m2j i
n
i
→ 0.
(A.44)
Now 2n
m2ij = ri P RR P ri ,
(A.45)
j =1
where ri is the ith column of R. Thus m2ij ≤ max ri 2 P R2 .
(A.46)
max ri ≤ R ≤ σ0 + 2T ≤ C,
(A.47)
P ≤ C W ≤ C,
(A.48)
max i
2n
i
j =1
But i
whence (A.42) follows. Proof of Theorem 6.1: It is straightforward to deduce under H 0 that a11 =
2 , σξ2
(A.49)
a22 =
1 , σξ2
(A.50)
a23 = lim 8/σξ2 tr(W ) = 0, n→∞
C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.51)
S82
P. M. Robinson
a33 =
1 32 lim tr{W (W + W )}, σξ4 n→∞ n
(A.52)
4 , σξ2
(A.53)
b11 =
n E ξi3 1 E ξi3 = , b12 = lim n→∞ n σξ2 σξ2 i=1
(A.54)
⎧ ⎫ n 2 2 n ⎨ 1 ξi ξj − σξ ⎬ = 0, b13 = lim E n→∞ n ⎩ ⎭ σ 2 j =1 σξ2 i=1 ξ
b22
1 = lim E n→∞ n
!
n 1 2 ξ − σξ2 σξ2 i=1 i
"2 =
E ξi4 − σξ4 σξ4
(A.55)
,
⎧ ⎫ n n n ⎬ 1 1 ⎨ 1 2 2 ξi − σξ ξj ξk wj k = 0, b23 = −8 lim E 2 2 n→∞ n ⎩ σξ ⎭ σξ j =1 k=1 i=1
b33
(A.56)
(A.57)
⎧ ⎫2 n n ⎬ 1 ⎨ 1 = 64 lim E ξj ξk wj k 4 n→∞ n ⎩ σξ ⎭ j =1 k=1 =
1 64 lim tr{W (W + W )}, σξ4 n→∞ n
(A.58)
where wj is the (j , k)th element of W. Thus A and B are diagonal matrices. Then, the theorem is proved when n−1 tr{W (W + W )} converges to a positive limit. But, if in fact tr{W (W + W )} = o(n), while Assumption 6.1 holds, then a modified proof leads to the statement of the present theorem. For brevity we omit the details. Proof of Theorem 6.2: The result for s 1 follows directly from Theorems 4.1 and 6.1. The proof for s 2 proceeds by noting that ξ˜i = ξi + (μ0 − μ), ˜ and then using standard arguments with a simplified version of the martingale central limit arguments in the proof of Theorem 5.1.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S83–S104. doi: 10.1111/j.1368-423X.2008.00270.x
Semiparametric cointegrating rank selection X U C HENG † AND P ETER C. B. P HILLIPS †,‡ †
Department of Economics, Yale University, 28 Hillhouse Avenue, New Haven, CT 06511, USA E-mail: [email protected] ‡
University of York, University of Auckland and Singapore Management University E-mail: [email protected]
First version received: January 2008; final version accepted: September 2008
Summary Some convenient limit properties of usual information criteria are given for cointegrating rank selection. Allowing for a non-parametric short memory component and using a reduced rank regression with only a single lag, standard information criteria are shown to be weakly consistent in the choice of cointegrating rank provided the penalty coefficient C n → ∞ and C n /n → 0 as n → ∞. The limit distribution of the AIC criterion, which is inconsistent, is also obtained. The analysis provides a general limit theory for semiparametric reduced rank regression under weakly dependent errors. The method does not require the specification of a full model, is convenient for practical implementation in empirical work, and is sympathetic with semiparametric estimation approaches to co-integration analysis. Some simulation results on the finite sample performance of the criteria are reported. Keywords: Cointegrating rank, Consistency, Information criteria, Model selection, Nonparametric, Short memory, Unit roots.
1. INTRODUCTION Information criteria are now widely used in parametric settings for econometric model choice. The methods have been especially well studied in stationary systems. Models that allow for nonstationarity are particularly relevant in econometric work and have been considered by several authors, including Tsay (1984), P¨otscher (1989), Wei (1992), Phillips and Ploberger (1996), Phillips (1996) and Nielsen (2006), among others. Model choice methods are heavily used in empirical work and in forecasting exercises, the most common applications involving choice of lag length in (vector) autoregression and variable choice in regression. The methods have also been suggested and used in the context of cointegrating rank choice where they are known to be consistent under certain conditions, at least in parametric models (Chao and Phillips, 1999). This application is natural because cointegrating rank is an order parameter for which model selection methods are particularly well suited since there are only a finite number of possible choices. Furthermore, rank order may be combined with lag length and intercept and trend degree parameters to provide a wide compass of choice in parametric models that is convenient in practical work, as discussed in Phillips (1996). C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S84
Xu Cheng and Peter C. B. Phillips
When the focus is on co-integration and cointegrating rank selection, it is not necessary to build a complete model for statistical purposes. Indeed, many of the approaches that have been developed for econometric estimation and inference in such contexts are semiparametric in character so that the model user can be agnostic regarding the short memory features of the data and concentrate on long-run behaviour. In such settings, it will often be desirable to perform the evaluation of cointegrating rank (or choice of the number of unit roots in a system) in a semiparametric context allowing for a general short memory component in the time series. The present paper has this goal and looks specifically at the issue of cointegrating rank choice by information criteria. In the case of a univariate series, this choice reduces to distinguishing unit root time series from stationary series. In such a context, it is known that information criteria provide consistent model choice in a semiparametric framework (Phillips, 2008). The contribution of this paper is to extend that work to the multivariate setting in the context of a semiparametric reduced rank regression of the form Xt = αβ Xt−1 + ut ,
t ∈ {1, . . . , n},
(1.1)
where Xt is an m- vector time series, α and β are m × r 0 full rank matrices and u t is a weakly dependent stationary time series with zero mean and continuous spectral density matrix f u (λ). The series Xt is initialized at t = 0 by some (possibly random) quantity X0 = Op (1), although other initialization assumptions may be considered, as in Phillips (2008). A secondary contribution of the paper that emerges from the analysis is to provide a limit theory for semiparametric reduced rank regressions of the form (1.1) under weakly dependent errors. This limit theory is useful in studying cases where reduced rank regressions are misspecified, possibly through the choice of inappropriate lag lengths in the vector autoregression or incorrect settings of the cointegrating rank. Under (1.1), the time series Xt is co-integrated with co-integration matrix β of rank r 0 , so there are r 0 cointegrating relations in the true model. Of course, r 0 is generally unknown and our goal is to treat (1.1) semiparametrically with regard to u t and to estimate r 0 directly in (1.1) by information criteria. The procedure we consider is quite simple. Model (1.1) is estimated by conventional reduced rank regression (RRR) for all values of r = 0, 1, . . . , m just as if u t were a martingale difference, and r is chosen to optimize the corresponding information criteria as if (1.1) were a correctly specified parametric framework up to the order parameter r. Thus, no explicit account is taken of the weak dependence structure of u t in the process. (r) be the residual covariance matrix from the RRR. The criterion used to evaluate Let cointegrating rank takes the simple form (r) + Cn n−1 (2mr − r 2 ), (1.2) IC(r) = log with coefficient C n = log n, 2 log log n, or 2 corresponding to the BIC (Akaike, 1977, Rissanen, 1978, and Schwarz, 1978), Hannan and Quinn (1979) and Akaike (1974) penalties, respectively. Sample information-based versions of the coefficient C n may also be employed, such as those in Wei’s (1992) FIC criterion and Phillips and Ploberger’s (1996) PIC criterion. The BIC version of (1.2) was given in Phillips and McFarland (1997) and used to determine cointegrating rank in an exchange rate application. In (1.2) the degrees of freedom term 2mr − r 2 is calculated to account for the 2mr elements of the matrices α and β that have to be estimated, adjusted for the r2 restrictions that are needed to ensure structural identification of β in reduced rank regression. The effects of other normalization schemes on the information criterion are discussed in the Appendix. C The Author(s). Journal compilation C Royal Economic Society 2009.
S85
Semiparametric cointegrating rank selection
For each r = 0, 1, . . . , m, we estimate the m × r matrices α and β by reduced rank regression, ˆ and, for use in (1.2), we form the corresponding residual variance matrices denoted by αˆ and β, (r) = n−1
n Xt − αˆ βˆ Xt−1 Xt − αˆ βˆ Xt−1 ,
r = 1, . . . , m
t=1
(0) = n−1 nt=1 Xt Xt . Model evaluation based on IC(r) then leads to the with cointegrating rank selection criterion r = arg min IC(r). 0≤r≤m
As shown below, the information criterion IC(r) is weakly consistent for selecting the cointegrating rank r 0 provided that the penalty term in (1.2) satisfies the weak requirements that C n → ∞ and C n /n → 0 as n → ∞. No minimum expansion rate for C n such as log log n is required and no more complex parametric model needs to be estimated. The approach is therefore quite straightforward for practical implementation. The organization of the paper is as follows. Some preliminaries on estimation and notation are covered in Section 2. The main asymptotic results are given in Section 3. Section 4 briefly reports some simulation findings. Section 5 concludes and discusses some extensions. Proofs and other technical material are in the Appendix.
2. PRELIMINARIES Reduced rank regression estimates of α and β in (1.1) are obtained ignoring any weak dependence error structure in u t . To analyze the asymptotic properties of the rank order estimates and the information criterion IC(r) under a general error structure, we start by investigating the asymptotic properties of the various regression components. Using conventional RRR notation, define S00 = n−1
n
Xt Xt , S11 = n−1
t=1
S01 = n−1
n
n
Xt−1 Xt−1 ,
t=1 Xt Xt−1 , and S10 = n−1
t=1
n
Xt−1 Xt .
(2.1)
t=1
For some given r and β, the estimate of α is obtained by regression as α (β) = S01 β(β S11 β)−1 .
(2.2)
Again, given r, the corresponding RRR estimate of β in (1.1) is an m × r matrix satisfying = arg min S00 − S01 β(β S11 β)−1 β S10 , (2.3) β β
subject to the normalization S11 β = Ir . β is found in the usual way by first solving the determinantal equation The estimate β λS11 − S10 S −1 S01 = 0 00 C The Author(s). Journal compilation C Royal Economic Society 2009.
(2.4)
(2.5)
S86
Xu Cheng and Peter C. B. Phillips
= for the ordered eigenvalues 1 > λ1 > · · · > λm > 0 and corresponding eigenvectors V S11 V = Im . Estimates of β and α are then obtained vm ], which are normalized by V [ v1 , . . . , as = [ β v1 , . . . , vr ],
, ) = S01 β and α = α (β
(2.6)
corresponding to the r largest roots of (2.5). The formed from the eigenvectors of V with β residuals from the RRR and the corresponding moment matrix of residuals that appear in the information criterion are Xt−1 , and ut = Xt − αβ (r) = n−1
n
β S10 . ut ut = S00 − S01 β
(2.7) (2.8)
t=1
Using (2.8) we have (e.g. Theorem 6.1 of Johansen, 1995) (r) = |S00 | ri=1 (1 − λi ),
(2.9)
where λi , 1 ≤ i ≤ r, are the r largest solutions to (2.5). The criterion (1.2) is then well determined for any given value of r.
3. ASYMPTOTIC RESULTS The following assumptions make specific the semiparametric and co-integration components of (1.1). Assumption LP is a standard linear process condition of the type that is convenient in developing partial sum limit theory. The condition can be relaxed to allow for martingale difference innovations and to allow for some mild heterogeneity in the innovations without disturbing the limit theory in a material way (see Phillips and Solo, 1992). Assumption RR gives conditions that are standard in the study of reduced rank regressions with some unit roots (Johansen, 1988, 1995, and Phillips, 1995). j A SSUMPTION LP. Let D(L) = ∞ j =0 Dj L , with D 0 = I and full rank D(1), and let u t have Wold representation ut = D(L)εt =
∞ j =0
Dj εt−j ,
with
∞
j 1/2 ||Dj || < ∞,
(3.1)
j =0
for some matrix norm ||·|| and where ε t is i.i.d. (0, ε ) with ε > 0. We use the notation ∞ ab (h) = E(at bt+h ) and ab = h=1 ab (h) for autocovariance matrices and one sided long run autocovariances and set = ∞ h=−∞ uu (h) = D(1)ε D(1) > 0 and ε = E{ε t ε t }. A SSUMPTION RR. (a) The determinantal equation |I m − (I m + αβ )L| = 0 has roots on or outside the unit circle, i.e. |L| ≥ 1. (b) Set = I m + αβ where α and β are m × r 0 matrices of full column rank r 0 , 0 ≤ r 0 ≤ m. (If r 0 = 0 then = I m ; if r 0 = m then β has full rank m and β Xt and hence Xt are (asymptotically) stationary) (c) The matrix R = I r + β α has eigenvalues within the unit circle. C The Author(s). Journal compilation C Royal Economic Society 2009.
S87
Semiparametric cointegrating rank selection
Assumption (c) ensures that the matrix β α has full rank. Let α ⊥ and β ⊥ be orthogonal complements to α and β, so that [α, α ⊥ ] and [β, β ⊥ ] are non-singular and β ⊥ β ⊥ = I m−r . Then, non-singularity of β α implies the non-singularity of α ⊥ β ⊥ . Under RR we have the Wold representation of β Xt vt := β Xt =
∞
R i β ut−i = R(L)β ut = R(L)β D(L)εt ,
(3.2)
i=0
and some further manipulations yield the following useful partial sum representation Xt = C
t
us + α(β α)−1 R(L)β ut + CX0 ,
(3.3)
s=1
where C = β ⊥ (α ⊥ β ⊥ )−1 α ⊥ . Expression (3.3) reduces to the Granger representation when u t is a martingale difference (e.g. Johansen, 1995). Under LP a functional law for partial sums of u t holds, so that n−1/2 [n·] s=1 us ⇒ Bu (·) as n → ∞, where B u is vector Brownian motion with variance matrix . In view of (3.2) and the i −1 = −(β α)−1 , we further have fact that R(1) = ∞ i=0 R = (I − R) n−1/2
[n·] s=1
vs = n−1/2
[n·]
β Xs ⇒ −(β α)−1 β Bu (·),
as n → ∞.
(3.4)
s=1
These limit laws involve the same Brownian motion B u and determine the asymptotic forms of the various sample moment matrices involved in the reduced rank regression estimation of (1.1). Define Xt 00 0β Var = . (3.5) β Xt−1 β0 ββ Explicit expressions for the submatrices in this expression may be worked out in terms of the autocovariance sequences of u t and v t and the parameters of (1.1). These expressions are given in (A.2)–(A.4) in the proof of Lemma A1 in the Appendix. The following result provides some (r) in the criterion asymptotic limits that are useful in deriving the asymptotic properties of function (1.2). L EMMA 3.1. Under Assumptions LP and RR, S00 → p 00 , β S11 β →p ββ , β S10 →p β0 ,
1 −1 n−1 β⊥ S11 β⊥ ⇒ (α⊥ β⊥ )−1 α⊥ Bu Bu α⊥ β⊥ α⊥ , 0 1 1 β⊥ (S10 − S11 βα ) ⇒ (α⊥ β⊥ )−1 α⊥ Bu dBu + wu , 0 1 β⊥ )−1 α⊥ Bu dBu β(α β)−1 + wv , β⊥ S11 β ⇒ −(α⊥ 0 1 −1 1 −1 Bu dBu α⊥ β⊥ α⊥ β⊥ + wu + wv α , β⊥ S10 ⇒ (α⊥ β⊥ ) α⊥ 0
C The Author(s). Journal compilation C Royal Economic Society 2009.
S88
Xu Cheng and Peter C. B. Phillips
where 1 wu =
∞
E{(β⊥ Xt )ut+h },
h=1
wv =
∞
E{(β⊥ Xt )(β Xt+h ) },
h=0
and w t = β ⊥ Xt = β ⊥ u t + β ⊥ αv t−1 . R EMARK 3.1. (a) When u t is weakly dependent, it is apparent that the asymptotic limits of β ⊥ (S 10 − S 11 βα ), β ⊥ S 10 , and β ⊥ S 11 β involve bias terms that depend on various one sided longrun covariance matrices associated with the stationary components u t , v t , and w t = β ⊥ Xt . Explicit values of these one-sided long-run covariance matrices are given in (A.7) and (A.8) in the Appendix. (b) When u t is a martingale difference sequence, 1wu = 0 and wv = β ⊥ E(ut vt ). Simpler results, such as β⊥ )−1 α⊥ β⊥ (S10 − S11 βα ) ⇒ (α⊥
1
0
Bu dBu ,
(3.6)
then hold for the limits in Lemma 3.1, and these correspond to earlier results given for example in theorem 10.3 of Johansen (1995). From (1.1), Xt = u t + αβ X t−1 = u t + αv t−1 , so that (c.f. (A.3)–(A.4)) 0β = α ββ + Eut vt−1 and 00 = αβ0 + Eut vt−1 α + E(ut ut ).
(3.7)
−1 −1
α = 0β ββ = α + Eut vt−1 ββ ,
(3.8)
Define
and let α⊥ be an m × (m − r) orthogonal complement to α such that [ α, α⊥ ] is non-singular. L EMMA 3.2. Under Assumptions LP and RR, when the true co-integration rank is r 0 , the λi with 1 ≤ i ≤ r 0 , converge to the roots of r 0 largest solutions to (2.5), denoted by λββ − β0 −1 0β = 0. (3.9) 00 The remaining m − r 0 roots, denoted by λi with r 0 + 1 ≤ i ≤ m, decrease to zero at the rate n−1 and {n λi : i = r0 + 1, . . . , m} converge weakly to the roots of 1
1
1 −1 ρ α⊥ 00 Gu Gu − Gu dGu β⊥ + α⊥ α⊥
α⊥ β⊥ dGu Gu + = 0, 0
0
0
(3.10)
where G u (r) = (α ⊥ β ⊥ )−1 α ⊥ B u (r) is m − r 0 dimensional Brownian motion with variance matrix (α ⊥ β ⊥ )−1 α ⊥ α ⊥ (β ⊥ α ⊥ )−1 and = 1wu + wv α . R EMARK 3.2. (a) Comparing these results with those of the standard RRR case with martingale difference errors (Johansen, 1995, p. 158), we see that, just as in the standard case, the r 0 largest roots of (2.5) are all positive in the limit and the m − r 0 smallest roots converge to 0 at the rate n−1 , both results now holding under weakly dependent errors. However, when u t C The Author(s). Journal compilation C Royal Economic Society 2009.
S89
Semiparametric cointegrating rank selection
is weakly dependent, the limit distribution determined by (3.10) is more complex than in the standard case. In particular, the determinantal equation (3.10) involves the composite one-sided long-run covariance matrix . 1 α = α, α⊥ = α⊥ , wu = (b) When u t is a martingale difference sequence, we find that 0, = wv α , 00 = αββ α + and −1 −1 α⊥ 00 α⊥
α⊥ = α⊥ α⊥ α⊥ α⊥ .
α⊥ Then α ⊥ β ⊥ G u (r) = α ⊥ B u (r) is Brownian motion with covariance matrix α ⊥ α ⊥ , α ⊥ = 0, and the determinantal equation (3.10) reduces to 1 1 1 −1 ρ Gu Gu − Gu dGu β⊥ α⊥ α⊥ α⊥ α⊥ β⊥ dGu Gu = 0, 0
0
which is equivalent to ρ
0
1
Vu Vu −
0
1 0
Vu dVu
1 0
dVu Vu = 0,
where V u (r) is m − r 0 dimensional standard Brownian motion, thereby corresponding to the standard limit theory of a parametric reduced rank regression (Johansen, 1995). T HEOREM 3.1. (a) Under Assumptions LP and RR, the criterion IC(r) is weakly consistent for selecting the rank of co-integration provided C n → ∞ at a slower rate than n. (b) The asymptotic distribution of the AIC criterion (IC(r) with coefficient C n = 2) is given by lim P (ˆrAIC = r0 ) ⎫⎤ ⎧ ⎡ r ⎬ ⎨ m =P⎣ ∩ ξi < 2(r − r0 )(2m − r − r0 ) ⎦ , ⎭ r=r0 +1 ⎩ n→∞
i=r0 +1
lim P (ˆrAIC = r|r > r0 ) r m =P ∩ ξi < 2(r − r)(2m − r − r) ∩ n→∞
r =r+1
r−1
∩
r =r0
i=r+1
r
, ξi > 2 r − r 2m − r − r
i=r +1
and lim P (ˆrAIC = r|r < r0 ) = 0,
n→∞
where ξr0 +1 , . . . , ξm are the ordered roots of the limiting determinantal equation (3.10). R EMARK 3.3. (a) BIC, HQ and other information criteria with C n → ∞ and C n /n → 0 are all consistent for the selection of cointegrating rank without having to specify a full parametric model. C The Author(s). Journal compilation C Royal Economic Society 2009.
S90
Xu Cheng and Peter C. B. Phillips
The same is true for the criterion IC∗ (r) where only the cointegrating space is estimated and structural identification conditions on the cointegrating vector are not imposed or used in rank selection. (b) AIC is inconsistent, asymptotically never underestimates cointegrating rank, and favours more liberally parametrized systems. This outcome is analogous to the well-known overestimation tendency of AIC in lag length selection in autoregression. Of course, in the present case, maximum rank is bounded above by the order of the system. Thus, the advantages to overestimation in lag length selection that arise when the autoregressive order is infinite might not be anticipated here. However, when cointegrating rank is high (and close to full dimensional), AIC typically performs exceedingly well (as simulations reported below attest) largely because the upper bound in rank restricts the tendency to overestimate. (c) When m = 1, r 0 = 0 corresponds to the unit root case and r 0 = 1 to the stationary case. Thus, one specialization of the above result is to unit root testing. In this case, the criteria consistently discriminate between unit root and stationary series provided C n → ∞ and C n /n → 0, as shown in Phillips (2008). In this case, the limit distribution of AIC is much 1 1 simpler and involves only the explicit limiting root ξ1 = ( 0 Bu dBu + λ)2 /{( 0 Bu2 )00 } where λ = ∞ h=1 E(ut ut+h ). (d) While Theorem 3.1 relates directly to model (1.1), it is easily shown to apply in cases where the model has intercepts and drift. Thus, the result provides a convenient basis for consistent co-integration rank selection in most empirical contexts.
4. SIMULATIONS Simulations were conducted to evaluate the finite sample performance of the criteria under various generating mechanisms for the short memory component u t , different settings for the true cointegrating rank, and for various choices of the penalty coefficient C n . Some illustrative findings for cases of dimension m = 2 and 4 are reported here. The data generating process follows (1.1). When m = 2, the design is as follows. For r 0 = 0 we have α β = 0. For r 0 = 1 the reduced rank coefficient structure is set so that α β = R1 = (1, 0.5)
−1 1
.
For r 0 = 2, two different designs (A and B) were simulated, one with smaller and the other with larger stationary roots as follows: A: α β = R2 = B: α β = R3 =
−0.5 0.1 0.2 −0.4
−0.5 0.1 0.2 −0.15
with stationary roots λi [I + β α] = {0.7, 0.4} ;
, ,
with stationary roots λi [I + β α] = {0.9, 0.45} .
When the dimension m = 4, the matrix α β was constructed to have a block diagonal form reflecting the true cointegrating rank. We call the four dimensional set up design C in what C The Author(s). Journal compilation C Royal Economic Society 2009.
Semiparametric cointegrating rank selection
S91
follows. For r 0 = 0 we have α β = 0. Let 2 −0.7 0.1 R4 = . (−1, 1) and R5 = 0.5 0.2 −0.6 For r 0 = 1, the reduced rank coefficient structure is set to α β = diag{R4, 0, 0},
with stationary root λi [I + β α] = −0.5.
For r 0 = 2, α β = diag{R5, 0, 0},
with stationary roots λi [I + β α] = {0.2, 0.5}.
For r 0 = 3, α β = diag{R2, R4 },
with stationary roots λi [I + β α] = {0.4, 0.7, −0.5}.
For r 0 = 4, α β = diag{R2, R3 },
with stationary roots λi [I + β α] = {0.4, 0.7, 0.45, 0.9}.
Simulations were conducted with AR(1), MA(1) and ARMA(1,1) errors, corresponding to the models ut = Aut−1 + εt , ut = εt + Bεt−1
and ut = Aut−1 + εt + Bεt−1 ,
(4.1)
respectively, with coefficient matrices A = ψI m , B = φI m , where |ψ| < 1, |φ| < 1, and with innovations ε t = i.i.d. N (0, ε ), where ε = diag{1 + θ, 1 − θ } > 0 when m = 2 and ε = diag{1 + θ1 , 1 − θ1 , 1 + θ2 , 1 − θ2 } when m = 4. The parameters for these models were set to ψ = φ = 0.4, θ = 0.25, θ 1 = 0.25 and θ 2 = 0.4. The performance of the criteria AIC, BIC, HQ and log(HQ) was investigated for sample sizes n = 100 in design A and n = 100, 400 in design B and n = 100, 250 and 400 in design C. 1 All cases including 50 additional observations to eliminate start-up effects from the initializations X 0 = 0 and ε 0 = 0. The results are based on 20,000 replications and are summarized in Fig. 1, which shows the results for design A, in Table 1, which shows the results for design B, and in Table 2, which shows the results for design C, with correct selections in bold type. The results displayed are for the model with AR(1) errors. Similar results were obtained for the other error generating schemes in (4.1). As is evident in Fig. 1, the BIC criterion generally performs very well when n ≥ 100. For design B, where the stationary roots of the system are closer to unity, BIC has a tendency to underestimate rank when n = 100 and r 0 = 2, thereby choosing more parsimoniously parameterized systems in this case, just as it does in lag length selection in autoregressions. But BIC performs well when n = 400, as seen in Table 1. The tendency of AIC to overestimate rank is also clear in Fig. 1, but this tendency is noticeably attenuated when the true rank is 1 and is naturally delimited when the true rank is 2 because of the upper bound in rank choice. For design B, AIC performs better than BIC when 1
log(HQ) has penalty coefficient C n = log (2 log log n).
C The Author(s). Journal compilation C Royal Economic Society 2009.
S92
Xu Cheng and Peter C. B. Phillips (a) AIC
(b) BIC 100
100 Prob
Prob
80
80
60
60
40
40 20
20
0
0 0
1
rˆ
2
0
0
2
1
1
2
rˆ
r0
(c) HQ
1
0
2
r0
(d) log(HQ) 100 Prob
0
100 Prob
80
80
60
60
40
40
20
20
0 1
rˆ
2
0
1
r0
2
0 1
rˆ
0 2
0
1
2
r0
Figure 1. Cointegrating rank selection in design A when u t is AR(1) and n = 100.
the cointegrating rank is 2 and the system is stationary, as does HQ, for which the penalty is C n < 2 when n = 100, 400. Criteria with weaker penalties, such as log(HQ) with C n = log(2 log log n), also do better in this case, although for other cases they perform much less satisfactorily than AIC and HQ, showing a strong tendency to overestimate cointegrating rank. Design C is a more extensive set-up for the higher dimensional system with m = 4. As shown in Table 2, BIC generally performs well when n = 250 and the pattern follows that of the two dimensional set-up. When r 0 = 4 and the system is stationary, BIC tends to underestimate cointegrating rank because one stationary root is close to unity. BIC also performs better when some stationary roots are negative than it does when all stationary roots are positive. We found, for example, that when the cointegrating rank is 3 with the three positive stationary roots {0.4, 0.5, 0.7} BIC can perform poorly even for n = 250. However, if the distribution of the stationary roots is more balanced—for example if one of the roots is negative as in {−0.5, 0.4, 0.7} in Design C—performance improves significantly. Based on overall performance, it seems that BIC can be recommended for practical work in choosing cointegrating rank and it gives generally very sharp results when n ≥ 250. The main weakness of BIC is its tendency to choose more parsimonious models (i.e. models with more unit roots) in the following conditions: (i) when the system is stationary and has a root near unity; (ii) when there are many positive stationary roots and (iii) when the sample size is small and the system dimension is large. 2 Wang and Bessler (2005) reported some related simulation work under the assumption that it is known that the time series are already transformed into a form where the observed variables are 2 Simulations (not reported here) showed that the tendency for BIC to select models with more unit roots is exacerbated when the criterion IC∗ (r), which has a stronger penalty, is used.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S93
Semiparametric cointegrating rank selection Table 1. Cointegrating rank selection in design B when u t follows an AR(1) process. n = 100 n = 400 r0 = 0 r=0
r=1
r=2
r=0
r=1
r=2
AIC
0.48
0.40
0.12
0.53
0.36
0.11
BIC HQ
0.88 0.35
0.11 0.47
0.01 0.18
0.95 0.47
0.05 0.40
0.00 0.13
Log(HQ)
0.03
0.44
0.53
0.07
0.49
0.44
r=0
r=1
r=2
r=0
r=1
r=2
AIC BIC
0.00 0.00
0.78 0.94
0.22 0.06
0.00 0.00
0.76 0.97
0.24 0.03
HQ Log(HQ)
0.00 0.00
0.71 0.40
0.29 0.60
0.00 0.00
0.74 0.46
0.26 0.54
r=0
r=1
r=2
r=0
r=1
r=2
AIC BIC HQ
0.00 0.05 0.00
0.25 0.74 0.14
0.75 0.21 0.86
0.00 0.00 0.00
0.00 0.02 0.00
1.00 0.98 1.00
Log(HQ)
0.00
0.02
0.98
0.00
0.00
1.00
r0 = 1
r0 = 2
either stationary or integrated. In the present context, this is equivalent to setting αβ to a diagonal matrix with elements of either zero or unity. The problem of cointegrating rank selection in this simpler framework is equivalent to direct unit root testing on each variable. We may therefore use the selection method of Phillips (2008) to estimate the co-integration rank by conducting a unit root test on each time series and simply counting the number of unit roots obtained. Simulations (not reported here) indicate that this procedure works well. However, since the transformation which takes the model into a canonical form where the observed variables are either stationary or integrated is seldom known, this procedure is generally not practical for estimating cointegrating rank.
5. CONCLUSION Model selection for cointegrating rank treats rank as an order parameter and provides the convenience of consistent estimation of this parameter under weak conditions on the expansion rate of the penalty coefficient. The approach is easy to implement in practice and is sympathetic with other semiparametric approaches to estimation and inference in cointegrating systems where the focus is on long-run behaviour. Information criteria such as (1.2) provide a useful C The Author(s). Journal compilation C Royal Economic Society 2009.
S94
Xu Cheng and Peter C. B. Phillips Table 2. Cointegrating rank selection in design C when u t follows an AR(1) process. n = 250 r0 = 0 r=0
r=1
r=2
r=3
r=4
AIC BIC HQ
0.13 0.94 0.58
0.40 0.06 0.33
0.31 0.00 0.08
0.12 0.00 0.01
0.04 0.00 0.00
Log(HQ)
0.00
0.09
0.31
0.40
0.20
r0 = 1 r=0
r=1
r=2
r=3
r=4
AIC BIC
0.00 0.00
0.34 0.96
0.43 0.04
0.18 0.00
0.05 0.00
HQ Log(HQ)
0.00 0.00
0.75 0.05
0.21 0.30
0.03 0.45
0.00 0.20
r0 = 2 r=0
r=1
r=2
r=3
r=4
AIC BIC
0.00 0.00
0.00 0.02
0.54 0.93
0.36 0.05
0.10 0.00
HQ Log(HQ)
0.00 0.00
0.00 0.00
0.80 0.23
0.17 0.50
0.03 0.26
r=1
r=2
r=3
r=4
r0 = 3 r=0 AIC
0.00
0.00
0.00
0.82
0.18
BIC HQ Log(HQ)
0.00 0.00 0.00
0.00 0.00 0.00
0.10 0.00 0.00
0.88 0.93 0.66
0.02 0.07 0.34
r=2
r=3
r=4
r0 = 4 r=0
r=1
AIC
0.00
0.00
0.00
0.00
1.00
BIC HQ
0.00 0.00
0.00 0.00
0.04 0.00
0.16 0.02
0.80 0.98
Log(HQ)
0.00
0.00
0.00
0.00
1.00
C The Author(s). Journal compilation C Royal Economic Society 2009.
Semiparametric cointegrating rank selection
S95
diagnostic check on system cointegrating rank and proceed as if there is no prior information on cointegrating rank. If prior information were available and could be formulated as prior probabilities on the models of different rank then this could, of course, be incorporated into a Bayes factor. In subsequent work Cheng and Phillips (2008) show that consistent cointegrating rank selection by information criteria continues to hold in models where there is unconditional heterogeneity in the error variance of unknown form, including breaks in the variance or smooth transition functions in the variance over time. Such permanent changes in variance are known to invalidate both unit root tests and likelihood ratio tests for cointegrating rank because of their effects on the limit distribution theory under the null (see Cavaliere, 2004, Beare, 2007, and Cavaliere and Taylor, 2007). Since consistency of the information criteria is unaffected by the presence of this form of variance induced non-stationarity, the approach offers an additional degree of robustness in cointegrating rank determination that is useful in empirical applications. Cheng and Phillips (2008) give an empirical application of this theory to exchange rate dynamics. Some applications of the methods outlined here are possible in other models. First, rather than work with reduced rank regression formulations within a vector autoregressive framework, it is possible to use reduced rank formulations in regressions of the time series on a fixed (or expanding) number of deterministic basis functions such as time polynomials or sinusoidal polynomials (Phillips, 2005). In a similar way to the present analysis, it can be shown that information criteria such as BIC and HQ will be consistent for cointegrating rank in such coordinate systems. The coefficient matrix in such systems turns out to have a random limit, corresponding to the matrix of random variables that appear in the Karhunen–Lo`eve representation (Phillips, 1998), but has a rank that is the same as the dimension of the cointegrating space, which enables consistent rank estimation by information criteria. A second application is to dynamic factor panel models with a fixed number of stochastically trending unobserved factors, as in Bai and Ng (2004). Again, these models have reduced rank structure (this time with non-random coefficients) and the number of factors may be consistently estimated using model selection criteria of the same type as those considered here but in the presence of an increasing number of incidental loading coefficients. In such cases, the BIC penalty, as derived from the asymptotic behaviour of the Bayes factor, has a different form from usual and typically involves both cross section and time series sample sizes. Some extensions of the present methods to these models will be reported in later work.
ACKNOWLEDGMENTS Our thanks go to the Editor and a referee for helpful comments on the original version. Cheng acknowledges support from an Anderson Fellowship. Phillips acknowledges support from the NSF under Grant No. SES 06-47086.
REFERENCES Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Csaki (Eds.), Second International Symposium on Information Theory, 267–281. Budapest: Akademiai Kiado. Akaike, H. (1977). On entropy maximization principle. In P. R. Krishnarah (Ed.), Applications of Statistics, 27–41. Amsterdam: North–Holland. C The Author(s). Journal compilation C Royal Economic Society 2009.
S96
Xu Cheng and Peter C. B. Phillips
Bai, J. and S. Ng (2004). A PANIC attack on unit roots and cointegration. Econometrica 72, 1127– 77. Beare, B. (2007). Robustifying unit root tests to permanent changes in innovation variance. Working paper, Yale University. Cavaliere, G. (2004). Unit root tests under time-varying variance shifts. Econometric Reviews 23, 259–92. Cavaliere, G. and A. M. R. Taylor (2007). Testing for unit roots in time series models with non-stationary volatility. Journal of Econometrics 140, 919–47. Chao, J. and P. C. B. Phillips (1999). Model selection in partially non-stationary vector autoregressive processes with reduced rank structure. Journal of Econometrics 91, 227–71. Cheng, X. and P. C. B. Phillips (2008). Cointegrating rank selection in models with time-varying variance. Working paper, Yale University. Hannan, E. J. and B. G. Quinn (1979). The determination of the order of an autoregression. Journal of the Royal Statistical Society, Series B 41, 190–5. Johansen, S. (1988). Statistical analysis of cointegration vectors. Journal of Economic Dynamics and Control 12, 231–54. Johansen, S. (1995). Likelihood-Based Inference in Cointegrated Vector Autoregressive Models. Oxford: Oxford University Press. Kapetanios, G. (2004). The asymptotic distribution of the cointegration rank estimation under the Akaike information criterion. Econometric Theory 20, 735–42. Nielsen, B. (2006). Order determination in general vector autoregressions. IMS Lecture Notes- Monograph Series 52, 93–112. Phillips, P. C. B. (1991). Optimal inference in cointegrated systems. Econometrica 59, 283–306. Phillips, P. C. B. (1995). Fully modified least squares and vector autoregression. Econometrica 63, 1023–78. Phillips, P. C. B. (1996). Econometric model determination. Econometrica 64, 763–812. Phillips, P. C. B. (1998). New tools for understanding spurious regressions. Econometrica 66, 1299– 326. Phillips, P. C. B. (2005). Challenges of trending time series econometrics. Mathematics and Computers in Simulation 68, 401–16. Phillips, P. C. B. (2008). Unit root model selection. Journal of the Japan Statistical Society 38, 65–74. Phillips, P. C. B. and J. McFarland (1997). Forward exchange market unbiasedness: the case of the Australian dollar since 1984. Journal of International Money and Finance 16, 885–907. Phillips P. C. B. and W. Ploberger (1996). An asymptotic theory of bayesian inference for time series. Econometrica 64, 381–413. Phillips, P. C. B. and V. Solo (1992). Asymptotics for linear processes. Annals of Statistics 20, 971– 1001. P¨otscher, B. M. (1989). Model selection under nonstationarity: autoregressive models and stochastic linear regression models. Annals of Statistics 17, 1257–74. Rissanen, J. (1978). Modeling by shortest data description. Automatica 14, 465–71. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461–4. Tsay, R. S. (1984). Order selection in nonstationary autoregressive models. Annals of Statistics 12, 1425– 33. Wang, Z. and D. A. Bessler (2005). A Monte Carlo Study on the selection of cointegration rank using information criteria. Econometric Theory 21, 593–620. Wei, C. Z. (1992). On predictive least squares principles. Annals of Statistics 20, 1–42.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S97
Semiparametric cointegrating rank selection
APPENDIX A.1. Normalization restrictions and degrees of freedom In triangular system specifications (Phillips, 1991) the cointegrating matrix β in (1.1) takes the form [I r , −B] for some unrestricted r × (m − r) matrix B, which involves r2 restrictions and leads to degrees of freedom 2mr − r 2 in A. Under normalization restrictions of the form (2.4) on β that are conventionally employed in empirical reduced rank regression modeling, the degrees of freedom term would be 2mr − r(r + 1)/2, leading to the alternate criterion (r)| + Cn n−1 (2mr − r(r + 1)/2). IC∗ (r) = log | In this case the outer product form of the coefficient matrix in (1.1) implies that A = αβ = αCC β for an arbitrary orthogonal matrix C, so that α and β are not uniquely identified even though the likelihood is well defined. In such cases, only the cointegrating rank and the cointegrating space are identified and consistently estimable. Correspondingly, under this normalization there are more degrees of freedom in the system. However, the usual justification for the BIC criterion (Schwarz, 1978 and Ploberger and Phillips, 1996) involves finding an asymptotic approximation to the Bayesian data density (and hence the posterior probability of the model), which is obtained by Laplace approximation methods using a Taylor series expansion of the log likelihood around a consistent parameter estimate. In the reduced rank regression case, r2 restrictions on β are required to identify the structural parameters as in the above formulation β = [I r , −B]. If only normalization restrictions such as β β = I r are imposed, then we can write A = αβ = αCC (Ir + BB )−1/2 [Ir , −B], with β = (I r + BB )−1/2 [I r , −B] and where C is an arbitrary orthogonal matrix. In this case, C is unidentified and if C has a uniform prior distribution on the orthogonal group O(r) independent of the prior on (α, B), then C may be integrated out of the Bayesian data density or marginal likelihood. The data density then has the same form as it does for the case where A = α(Ir + BB )−1/2 [Ir , −B] = α¯ −1/2 [Ir , −B], and where α¯ and B are identified. In this event, the model selection criterion is the same as that given in (1.2).
A.2. Proofs L EMMA A1. Under (1.1) and Assumption LP, −1 −1/2 −1/2 −1 −1 −1 −1 00 − 00 0β β0 00 β0 β0 00 = 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 , −1/2
−1 −1 where c = 00 0β and c ⊥ is an orthogonal complement to c. Defining α = 0β ββ = 00 cββ and −1/2
α⊥ = 00 c⊥ , we have the alternate form −1 −1/2 −1/2 α⊥ 00 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 = α⊥ α⊥ α⊥ . (A.1) 1/2
When 0β = α ββ , (A.1) reduces to −1/2
−1/2
00 c⊥ (c⊥ c⊥ )−1 c⊥ 00
−1 = α⊥ α⊥ 00 α⊥ α⊥ .
Proof: Since [c, c ⊥ ] is non-singular we have I = c(c c)−1 c + c⊥ (c⊥ c⊥ )−1 c⊥ , C The Author(s). Journal compilation C Royal Economic Society 2009.
S98
Xu Cheng and Peter C. B. Phillips
and then −1 −1 −1 −1 − 00 0β (β0 00 β0 )−1 β0 00 00 −1/2
−1/2
= 00 {I − c(c c)−1 c }00
−1/2
−1/2
= 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 ,
as required. −1/2 −1/2 Observe that when 0β = α ββ , we have c = 00 0β = 00 α ββ and we may choose c ⊥ = 1/2 00 α ⊥ where α ⊥ is an orthogonal complement to α. In that case we have −1 −1/2 −1/2 α⊥ , 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 = α⊥ α⊥ 00 α⊥ as stated. This corresponds with the result in Johansen (1995, lemma 10.1) where u t is a martingale difference. In the present semiparametric case, Xt = u t + αβ X t−1 = u t + αv t−1 and the covariance vu (1) = E(v t−1 u t ) is generally non-zero, so that ββ = Evt vt = vv (0),
(A.2)
β0 = Evt−1 (ut + αvt−1 ) = vu (1) + vv (0)α = vu (1) + ββ α ,
(A.3)
00 = αββ α + α vu (1) + uv (−1) α + uu (0).
(A.4)
−1/2
−1 −1 = 00 cββ and we may choose α⊥ = 00 c⊥ . In this notation, we may write in Note that α = 0β ββ the general case −1 −1/2 −1/2 α⊥ 00 α⊥ α⊥ α⊥ , (A.5) 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 = 1/2
as given in (A.1).
Proof of Lemma 3.1: Since both Xt = u t + αβ X t−1 and v t = β Xt are stationary and satisfy Assumption LP, the law of large numbers gives S00 = n−1
n
Xt Xt →p 00 = uu (0) + α vv (0)a + α vu (0) + uv (0)a ,
t=1
β S11 β = n−1
n
β Xt−1 β Xt−1 →p ββ = vv (0), and
t=1
β S10 = n−1
n
β Xt−1 Xt →p β0 = vu (1) + vv (0)a .
t=1
In view of (3.3) we have β⊥ Xt = β⊥ C
t
us + β⊥ α(β α)−1 R(L)β ut + β⊥ CX0
s=1
=
(α⊥ β⊥ )−1 α⊥
t
us + X0 + β⊥ α(β α)−1 R(L)β ut ,
s=1
so that the standardized process n−1/2 β ⊥ X [n·] ⇒ (α ⊥ β ⊥ )−1 α ⊥ B u (·), and from (3.4) we have n−1/2
[n·]
β Xs ⇒ −(β α)−1 β Bu (·) .
(A.6)
s=1 C The Author(s). Journal compilation C Royal Economic Society 2009.
S99
Semiparametric cointegrating rank selection It follows by conventional weak convergence methods that
1 −1 n−1 β⊥ S11 β⊥ ⇒ (α⊥ β⊥ )−1 α⊥ Bu Bu α⊥ β⊥ α⊥ , 0 n Xt−1 Xt − αβ Xt−1 β⊥ (S10 − S11 βα ) = β⊥ n−1 t=1
1 n β⊥ Xt−1 ut 1 Bu dBu + wu , = √ √ ⇒ (α⊥ β⊥ )−1 α⊥ n n 0 t=1 1 n β⊥ Xt−1 (β Xt−1 ) −1 Bu dBu β(α β)−1 + wv , β⊥ S11 β = ⇒ −(α⊥ β⊥ ) α⊥ √ √ n n 0 t=1 where 1 = wu
∞
E
β⊥ Xt ut+h
and wv =
h=1
∞
E
β⊥ Xt (β Xt+h )
h=0
are one-sided long-run covariance matrices involving w t = β ⊥ Xt , u t and v t . Note that wt := β⊥ Xt = β⊥ ut + β⊥ αvt−1 so we may deduce the explicit form ∞ ∞ ∞ 1 = E β⊥ Xt ut+h = β⊥ E ut ut+h + β⊥ α E vt−1 ut+h wu h=1
h=1
h=1
= β⊥ uu + β⊥ α [ vu − vu (1)] , and wv =
∞
E
β⊥ Xt
(A.7)
β Xt+h
h=0
= β⊥
∞
∞ + β⊥ α E ut vt+h E vt−1 vt+h
h=0
h=0
= β⊥ ( uv + uv (0)) + β⊥ α vv .
(A.8)
Finally, using (A.6) and standard limit theory again, we obtain
n n β⊥ Xt−1 Xt β⊥ Xt−1 ut + αvt−1 β⊥ S10 = √ √ = √ √ n n n n t=1 t=1 1 1 −1 ⇒ (α⊥ β⊥ )−1 α⊥ Bu dBu − (α⊥ β⊥ )−1 α⊥ Bu dBu β α β α 0
+
∞
E
= (α⊥ β⊥ )−1 α⊥ ∞
E
β⊥ Xt ut+h +
h=1
+
0
∞
E
β⊥ Xt vt+h α
h=0 1
Bu dBu {I − β(α β)−1 α }
0
∞ β⊥ Xt ut+h + E β⊥ Xt vt+h α
h=1
= (α⊥ β⊥ )−1 α⊥
0
h=0 1
−1 1 Bu dBu α⊥ β⊥ α⊥ β⊥ + wu + wv α ,
since β (α β)−1 α + α ⊥ (β ⊥ α ⊥ )−1 β ⊥ = I (e.g. Johansen, 1995, p. 39). C The Author(s). Journal compilation C Royal Economic Society 2009.
S100
Xu Cheng and Peter C. B. Phillips
Proof of Lemma 3.2: Let S(λ) = λS 11 − S 10 S −1 00 S 01 , so that the determinantal equation (2.5) is |S(λ)| = 0. Defining P n = [β, n−1/2 β ⊥ ] and using Lemma 3.1, we have P (S (λ)) Pn n λβ S β λn−1/2 β S11 β⊥ 11 = λn−1/2 β⊥ S11 β λn−1 β⊥ S11 β⊥ −1 −1 β S10 S00 S01 β n−1/2 β S10 S00 S01 β⊥ − −1/2 −1 −1 n β⊥ S10 S00 S01 β n−1 β⊥ S10 S00 S01 β⊥ ⎡ ⎤ λββ 0 −1 β0 00 0β 0
1 ⎢ ⎥ ⇒ ⎣ − ⎦ −1 0 0 0 λ(α⊥ β⊥ )−1 α⊥ Bu Bu α⊥ β⊥ α⊥ 0
1 −1 −1 −1 = λββ − β0 00 0β λ(α⊥ β⊥ ) α⊥ Bu Bu α⊥ β⊥ α⊥ . (A.9) 0
The determinantal equation
λββ − β0 −1 0β λ(α β⊥ )−1 α ⊥ 00 ⊥
0
1
−1 Bu Bu α⊥ β⊥ α⊥ = 0
has m − r 0 zero roots and r 0 positive roots given by the solutions of λββ − β0 −1 0β = 0. 00
(A.10)
Thus, the r 0 largest roots of (2.5) converge to the roots of (A.10) and the remainder converge to zero. Defining P = [β, β⊥ ], we have β S (λ) β β S (λ) β ⊥ P (S (λ)) P = β⊥ S (λ) β β⊥ S (λ) β⊥ = |β S(λ)β||β⊥ {S(λ) − S(λ)β[β S(λ)β]−1 β S(λ)}β⊥ |.
(A.11)
As in Johansen (1995, theorem 11.1), we let n → ∞ and λ → 0 such that ρ = nλ = Op (1). Using Lemma 3.1, we have β S (λ) β
−1 −1 = ρn−1 β S11 β − β S10 S00 S01 β = −β0 00 0β + op (1),
−1 β⊥ S (λ) β⊥ = ρn−1 β⊥ S11 β⊥ − β⊥ S10 S00 S01 β⊥ , and −1 β⊥ S (λ) β = ρn−1 β⊥ S11 β − β⊥ S10 S00 S01 β −1 = −β⊥ S10 S00 S01 β + op (1).
(A.12)
Define −1 −1 −1 −1 Nn = S00 − S00 S01 β β S10 S00 S01 β β S10 S00 . Using Lemmas 3.1 and A1, we have −1 −1 −1 −1 − 00 0β β0 00 0β β0 00 + op (1) Nn = 00 −1 = α⊥ α⊥ 00 α⊥ α⊥ + op (1).
(A.13)
C The Author(s). Journal compilation C Royal Economic Society 2009.
S101
Semiparametric cointegrating rank selection By (A.12) and (A.13), the second factor in (A.11) becomes β⊥ {S(λ) − S(λ)β[β S(λ)β]−1 β S(λ)}β⊥ = ρn−1 β⊥ S11 β⊥ − β⊥ S10 Nn S01 β⊥ + op (1) −1 = ρn−1 β⊥ S11 β⊥ − β⊥ S10 α⊥ 00 α⊥ α⊥ α⊥ S01 β⊥ + op (1).
(A.14)
By Lemma 3.1, we have β⊥ {S(λ) − S(λ)β[β S(λ)β]−1 β S(λ)}β⊥
1 ∼ ρ(α⊥ β⊥ )−1 α⊥ Bu Bu α⊥ (β⊥ α⊥ )−1 0
#
− (α⊥ β⊥ )−1 α⊥
1
0
$ Bu dBu α⊥ (β⊥ α⊥ )−1 β⊥ +
# α⊥ 00 α⊥ )−1 α⊥ β⊥ (α⊥ β⊥ )−1 α⊥ × α⊥ (
1
=ρ
Gu Gu
1
−
0
0
Gu dGu β⊥
+
1
0
$ dBu Bu α⊥ (β⊥ α⊥ )−1 +
α⊥ ( α⊥ 00 α⊥ )−1 α⊥
β⊥
1
dGu Gu
+
,
0
where = 1wu + wv α and G u (r) = (α ⊥ β ⊥ )−1 α ⊥ B u (r) is Brownian motion with variance matrix (α⊥ β⊥ )−1 α⊥ α⊥ (β⊥ α⊥ )−1 . Equations (A.11), (A.14) and Lemma 3.1 reveal that the m − r 0 smallest solutions of (2.5) normalized by n converge to those of the equation ρ
1
1
Gu Gu −
0
0
−1 α⊥ 00 Gu dGu β⊥ + α⊥ α⊥ α ⊥ β⊥
1
0
dGu Gu + = 0, (A.15)
as stated. Proof of Theorem 3.1:
Part (a) Let ICr0 (r) denote the information criterion defined in (1.2) when the true co-integration rank is r 0 . Cointegrating rank is estimated by minimizing ICr0 (r) for 0 ≤ r ≤ m. To check the consistency of this estimator, we need to compare ICr0 (r) with ICr0 (r0 ) for any r = r 0 . When r > r 0 , using (1.2) and (2.9), we have ICr0 (r) − ICr0 (r0 ) =
r
log(1 − λˆ i ) + Cn n−1 (2mr − r 2 ) − 2mr0 − r02
i=r0 +1
=
r
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 ).
(A.16)
i=r0 +1
In order to consistently select r 0 with probability 1 as n → ∞ we need r
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 ) > 0,
i=r0 +1 C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.17)
S102
Xu Cheng and Peter C. B. Phillips
with probability 1 as n → ∞ for any r 0 < r < m. From (3.10), we know that λˆ i is O p (n−1 ) for all i = r 0 + 1, . . . , r. Expanding log(1 − λˆ i ), we have r
log(1 − λˆ i ) = −
i=r0 +1
r
λˆ i + op (n−1 ) = Op (n−1 ).
(A.18)
i=r0 +1
Using (A.18) and Lemma 3.2, we then have ⎛ n⎝
⎞
r
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 )⎠
i=r0 +1
=−
r
nλˆ i + Cn (r − r0 )(2m − r − r0 ) + op (1),
(A.19)
i=r0 +1
where nλˆ i for i = r 0 + 1, . . . , r are O p (1). As such, as long as C n → ∞ as n → ∞, the second term on the right-hand side of (A.19) dominates, which leads to (A.17) as n → ∞. Hence, if the penalty coefficient C n → ∞, cointegrating rank r > r 0 will never be selected. So, too few unit roots will never be selected in the system in such cases. Thus, the criteria BIC and HQ will never select excessive cointegrating rank as n → ∞. On the other hand the AIC penalty is fixed at C n = 2 for all n, so we may expect AIC to select models with excessive cointegrating rank with positive probability as n → ∞. This corresponds to a more liberally parametrized system. When r < r 0 , ICr0 (r) − ICr0 (r0 ) =−
r0
log(1 − λˆ i ) + Cn n−1 (2mr − r 2 ) − 2mr0 − r02
i=r+1
=−
r0
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 ).
(A.20)
i=r+1
In order to consistently select r 0 with probability 1 as n → ∞, we need −
r0
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 ) > 0, as n → ∞.
(A.21)
i=r+1
From Lemma 3.2, we know that 0 < λˆ i < 1 for i = r + 1, . . . , r 0 . So the first term on the right-hand side of (A.20) is a positive number that is bounded away from 0 and the second term on the right-hand side of (A.20) is a negative number of order O(Cn n−1 ). In order for (A.21) to hold as n → ∞, we therefore require only that C n /n = o(1), i.e. that the penalty coefficient must pass to infinity slower than n. For each of the criteria AIC, BIC and HQ, the penalty coefficient C n → ∞ slower than n. Hence, these three information criterion all select models with insufficient cointegrating rank (or excess unit roots) with probability zero asymptotically. Combining the conditions on C n for r > r 0 and r < r 0 , it follows the information criterion will lead to consistent estimation of the co-integration rank provided the penalty coefficient satisfies C n → ∞ and C n /n → 0 as n → ∞. C The Author(s). Journal compilation C Royal Economic Society 2009.
Semiparametric cointegrating rank selection
S103
Part (b) Under AIC, C n = 2. The limiting probability that AIC(r 0 ) ≤ AIC(r) for some r ≤ r 0 is given by lim P {AIC(r0 ) ≤ AIC(r)} r0 −1 log(1 − λˆ i ) + 2n (r − r0 )(2m − r − r0 ) > 0 = lim P − n→∞
n→∞
i=r+1
r0
= lim P n→∞
−1 ˆ log(1 − λi ) < 2n (r − r0 )(2m − r − r0 ) = 1,
(A.22)
i=r+1
r0 because 0 < λ i < 1 for i = r + 1, . . . , r 0 are the r 0 − r smallest solutions to (3.9) and then i=r+1 log(1 − λi ) < 0, giving (A.22). Hence, when r 0 is the true rank, AIC will not select any r < r 0 as n → ∞, i.e. lim P (ˆrAIC = r|r < r0 ) = 0.
n→∞
(A.23)
Let ξr0 +1 > · · · > ξm be the ordered roots of the limiting determinantal equation (3.10). When r > r ≥ r 0 , AIC(r) < AIC(r ) iff
r
log(1 − λˆ i ) + Cn n−1 (r − r)(2m − r − r) > 0,
i=r+1
so that the limiting probability that r will be chosen over r is lim P AIC(r) < AIC(r ) n→∞ r ˆ nλi + 2(r − r)(2m − r − r) > 0 = lim P − n→∞
=P
i=r+1 r
ξi < 2(r − r)(2m − r − r) .
(A.24)
i=r+1
Accordingly, the probability that AIC will select rank r is equivalent to the probability that r is chosen over any other r ≥ r 0 . This probability is limn→∞ P (ˆrAIC = r|r > r0 ) r m =P ∩ ξi < 2(r − r)(2m − r − r) ∩ r =r+1
r−1
∩
r =r0
r
i=r+1
ξi > 2 r − r 2m − r − r ,
i=r +1
(A.25)
where the first part is the limiting probability that r is chosen over all r > r and the other part is the probability that r is chosen over all r 0 ≤ r < r. Any rank less than r 0 is not taken into account here because those ranks are always dominated in the limit by r 0 from (A.23). The probability that the co-integration rank r 0 is consistently estimated by AIC as n → ∞ is limn→∞ P (ˆrAIC = r0 ) ⎫⎤ ⎧ ⎡ r ⎬ ⎨ m ξi < 2(r − r0 )(2m − r − r0 ) ⎦ . =P⎣ ∩ r=r0 +1 ⎩ ⎭
(A.26)
i=r0 +1
This is a special case of (A.25) with r = r 0 . C The Author(s). Journal compilation C Royal Economic Society 2009.
S104
Xu Cheng and Peter C. B. Phillips
The unit root case. When the system order is m = 1, the procedure provides a mechanism for unit root testing. If r 0 = 0, i.e. the model has a unit root, we have by (A.26) lim P (ˆrAIC = 1|r0 = 0) = P {ξ1 > 2} = 1 − P {ξ1 < 2} and
n→∞
lim P (ˆrAIC = 0|r0 = 0) = P {ξ1 < 2} ,
n→∞
(A.27)
where ξ 1 is the solution to (3.10) when m = 1 and r 0 = 0. In this case, we see that ) ) *2 *2 1 1 Gu dGu β⊥ + Bu dBu + λ 0 0 ) * ) * ξ1 = = 1 1 2 Gu Gu 00 Bu 00 0 0 since Gu = Bu , α⊥ = 1, β⊥ = 1, and = λ in this case. If r 0 = 1, when the model is stationary, we have lim P (ˆrAIC = 0|r0 = 1) = 0 and lim P (ˆrAIC = 1|r0 = 1) = 1,
n→∞
n→∞
(A.28)
using (A.22). These results for the scalar case m = 1 are consistent with those in Phillips (2008).
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S105–S134. doi: 10.1111/j.1368-423X.2009.00280.x
Distribution-free specification tests for dynamic linear models M IGUEL A. D ELGADO † , JAVIER H IDALGO ‡ AND C ARLOS V ELASCO † †
Universidad Carlos III, 28903 Madrid, Spain E-mail: [email protected], [email protected] ‡
London School of Economics, Houghton Street, London WC2A 2AE, UK E-mail: [email protected]
First version received: July 2008; final version accepted: December 2008
Summary This article proposes goodness-of-fit tests for dynamic regression models, where regressors are allowed to be only weakly exogenous and arbitrarily correlated with past shocks. The null hypothesis is stated in terms of the lack of serial correlation of the errors of the model. The tests are based on a linear transformation of a Bartlett’s T p -process of the residuals. This transformation approximates the martingale component of the process so that it converges weakly to the standard Brownian motion under the null hypothesis. One feature of our setup is that we do not require to specify the dynamic structure of the regressors. Due to this, the transformation employs a semi-parametric correction that does not restrict the class of local alternatives that our tests can detect, in contrast with other works using smoothing techniques. A Monte Carlo study illustrates the finite sample performance of the tests. Keywords: Dynamic models, Empirical processes, Exogeneity, Goodness-of-fit, Local alternatives, Martingale decomposition.
1. INTRODUCTION Delgado et al. (2005) (DHV henceforth) proposed asymptotically distribution-free tests for the correct parametric specification of the autocorrelation structure of a time series process. The tests were based on a parametric transformation of Bartlett’s (1954) Tp -process, which entails to consider its martingale component, so that asymptotically the transformed process converges to a standard Brownian motion. The tests were applied to observable data, so there was no need to compute the residuals of the model, and the martingale transformation only depended on a set of unknown parameters under the null hypothesis. The aim of this paper consists in extending the DHV procedure to test the specification of dynamic regression models. Here, we use the empirical spectral process of the residuals of the model, because, in the presence of general explanatory variables, regression models do not specify completely the dynamics of the dependent variable, unlike the linear models studied by DHV. The transformation of the corresponding Tp -process depends, despite the unknown parameters, on the non-parametric cross-spectrum between the regressors and the regression error term, which is non-constant and different from zero when regressors are only assumed to be weakly exogenous. A feasible transformation might be computed via a non-parametric smoothed estimator of this crossspectrum. However, we show that we can avoid the smoothing in the feasible martingale C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S106
M. A. Delgado, J. Hidalgo and C. Velasco
transformation by using directly the cross-periodogram, even though it is an inconsistent estimate of the cross-spectrum. In spite of this non-parametric aspect of our model, our tests have nontrivial power against local alternatives converging to the null at the parametric rate n1/2 . The remainder of the paper is organized as follows. Section 2 introduces the model and describes the testing problem. Section 3 presents the transformation to obtain asymptotically distribution-free tests, whereas Section 4 discusses the power of our tests. Section 5 describes a Monte Carlo experiment to shed some light on the finite sample performance of our test and how it compares with Portmanteau tests based on non-parametric smoothing as well as directional and smooth tests. Finally, the proofs have been placed in the Appendix.
2. DYNAMIC MODELS This section discusses methods for the correct specification of dynamic regression models Xt = μ0 + α01 Xt−1 + · · · + α0p Xt−p + β0 Zt + εt ,
(2.1)
where Z t is a q-dimensional vector of deterministic and/or (weakly) exogenous variables and where the parameter vector θ 0 = (μ 0 , α 0 , β 0 ) is identified as the solution of the p + q + 1 moment conditions (2.2) E Wt Xt − θ Wt = 0, where W t = (1, X t−1 , . . . , X t−p , Z t ) and E(Wt Wt ) is a positive definite matrix. The models considered in (2.1), also known as ARX models, are an important extension of those examined in DHV. Notice that in Z t we can allow some of its components to be lagged values, for example Z kt = Z j t− for some ≥ 1. In the context of model (2.1), a natural assumption is that (2.3) E εt |F {εs , Zs+1 , s < t} = 0, where F{εs , Zs+1 , s < t} is the σ -algebra generated by {ε s , Z s+1 , s < t}. Equation (2.3) implies that E[Zt εs ] = 0 for all s ≥ t, although it allows for feedback from ε t to Z t+j , j > 0. The latter implies that it is possible that the cross-autocovariance of Z t and ε t satisfies that γZε (j ) = E[Zt+j εs ] = 0 for some j > 0. Denoting herewith the cross-spectral density function between the sequences {Ut }t∈Z and {Vt }t∈Z by f UV , we have that one consequence of the latter is that the cross-spectral density function between the sequences {Zt }t∈Z and {εt }t∈Z , fZε , defined by π γZε (j ) = fZε (λ) eij λ dλ, j = 0, ±1, ±2, . . . , −π
is not a null function. That is, the sequence {Zt }t∈Z is only predetermined in (2.1). The null hypothesis of interest is that the errors {εt }t∈Z in (2.1) are not autocorrelated. In other words, that the regression model (2.1) captures the linear dynamic structure of {Xt }t∈Z . More specifically, for given θ , define the residuals {εt (θ )}t∈Z by εt (θ ) := Xt − θ Wt ,
(2.4)
and its autocovariance structure by γε (j ; θ ) := E(εt (θ )εt+j (θ )). Then, our null hypothesis of interest is H0 : γε (j ; θ0 ) = 0,
for all |j | ≥ 1
and some θ0 ∈ ⊂ Rp+q+1 .
C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S107
We are interested in omnibus tests, where the alternative hypothesis is the negation of the null. The compact set := A × Rq+1 is chosen such that for all α ∈ A, all the roots of the polynomial α(z) := 1 − α1 z − · · · − αp zp
(2.5)
are outside the unit disk. Notice that the least-squares estimator of the parameters may be inconsistent if H 0 does not hold even if the true value of α is zero. R EMARK 2.1. It is worth mentioning that we could allow the so-called ARMAX models, i.e. (2.1), where εt = 1 εt−1 + · · · + εt− + ηt . In this latter scenario, our null hypothesis would be that {ηt }t∈Z follows a white-noise sequence. However, we shall consider (2.1) because of its generality and mathematical simplicity in terms of arguments and notation. Also, extensions of (2.1) to non-linear specifications is fairly straightforward and will not be pursued in this paper. As in DHV, we can write the null hypothesis H 0 in the frequency domain. Indeed, let f ε (λ; θ ) denote the spectral density function of {εt (θ )}t∈Z in (2.4), that is π γε (j ; θ ) = fε (λ; θ ) exp(ij λ)dλ, j = 0, ±1, . . . , −π
and denote its spectral distribution function as F ε (λ; θ 0 ), i.e. λ Fε (λ; θ ) := 2 fε (ω; θ ) dω. 0
Under H 0 we have respectively the spectral density and distribution functions of {εt (θ0 )}t∈Z = {εt }t∈Z . Then, we can equivalently write the null hypothesis H 0 as H0 :
λ Fε (λ; θ0 ) for all λ ∈ [0, π ] and some θ0 ∈ , = Fε (π ; θ0 ) π
(2.6)
being the alternative hypothesis H 1 the negation of H 0 . Thus, the null hypothesis H 0 in (2.6) states that there exists a parameter value θ 0 ∈ such that the sequence {εt (θ0 )}t∈Z has a constant spectral density function, i.e. they are uncorrelated. A natural estimator of F ε (λ; θ ) is ˜ [nλ/π] 2π Fˆn (λ; θ ) := Iεε λj ; θ , n˜
(2.7)
j =1
˜ n˜ := [n/2], [·] denoting the integer part, and where λ j := 2πj /n, for j = 1, . . . , n, 2 n 1 itλ Iεε (λ; θ ) := εt (θ ) e 2π n t=1 is the periodogram of the sequence {ε t (θ )}nt=1 defined in (2.4). In what follows, for a generic function g(·; θ ), we shall suppress any reference to θ when the function is evaluated at the true value θ 0 . That is, g(·; θ 0 ) =: g(·). Observe that the estimator Fˆn (λ; θ ) is location invariant, due to the omission of j = 0 in (2.7). Thus, there is no need to centre the residuals or to estimate the mean μ in (2.1). See Remark 2.2 below for a more explicit explanation and some implications. C The Author(s). Journal compilation C Royal Economic Society 2009.
S108
M. A. Delgado, J. Hidalgo and C. Velasco
If the true value of θ , θ 0 , were known, or equivalently if we could observe the sequence {ε t }nt=1 , following Bartlett (1954), we might perform a goodness-of-fit test using the T p -process
ˆ 1/2 Fn (π ω; θ ) ˆ Tn (ω; θ ) := n˜ − ω , ω ∈ [0, 1] , (2.8) Fˆn (π ; θ ) evaluated at θ = θ 0 . Recall that in this case, we denote Tˆn (ω; θ0 ) by Tˆn (ω). Before we present the properties of Tˆn (ω), let us introduce the following regularity assumption. A SSUMPTION 2.1. {εt }t∈Z is a zero mean sequence of random variables such that E(εt εs ) = σε2 I(t = s) and that E[ε(t)k |Ft−1 ] = κk , k = 1, . . . , 3, and E|ε(t)|k = μk , k = 3, . . . , 8 with μ 8 < ∞, where Ft−1 is the σ -algebra of events generated by {ε s , Z s+1 , s < t}. Herewith, we are denoting the indicator function by I(·). Assumption 2.1 is similar to that given in Dahlhaus (1985) who only assumed constant conditional moments up to the third order. This implies that the fourth-order spectral density function of the process {εt }t∈Z is not necessarily constant (cf. lemma 2 in DHV). Now, denoting by B(ω) the standard Brownian bridge on [0, 1], we have the following proposition. P ROPOSITION 2.1. Under Assumption 2.1, we have that d Tˆn (·) ⇒ B(·) in the Skorohod metric space D [0, 1] .
The statistic given in (2.8) is not feasible as it depends on the unknown vector of parameters θ 0 . To be able to compute (2.8), and so the test, we shall replace θ 0 by, for example, the leastsquares estimator, denoted θˆn . A SSUMPTION 2.2. Under H 0 , it holds that θˆn − θ0 = Op (n−1/2 ). Sufficient conditions for Assumption 2.2 are the stationarity of {Zt }t∈Z , (2.3) and that γ Zε (0) = 0. Notice that in contrast to DHV, Assumption 2.2 does not require a linear expansion of θˆn , only its rate of convergence. This is due to the explicit solution of the least-squares estimator. Also, we shall not give explicit conditions under which the sequence {Zt εt }t∈Z , and so θˆn , satisfies the central limit theorem. R EMARK 2.2. It is worth noticing that the least-squares estimator of (α , β ) is given by the minimization of Fˆn (π ; θ ). That is, (αˆ n , βˆn ) = arg min (α ,β )
n˜
|wX (λj ) − α wX− (λj ) − β wZ (λj )|2
j =1
= arg min Fˆn (π ; θ ), (α ,β )
(2.9)
where w X (λ j ), w X− (λ j ) and w Z (λ j ) are respectively the discrete Fourier transform of {X t }nt=1 , {X t−1 , . . . , X t−p }nt=1 and {Z t }nt=1 . So, observing that we do not employ the frequency λ j = 0 to compute Fˆn (π ; θˆn ), Fˆn (π ; θˆn ) is independent of the intercept estimator μˆ n . The latter implies that the computation of Tˆn (ω; θˆn ) is independent of the intercept μ. For this reason and to simplify notation, in what follows, we shall assume that there is no intercept in (2.1) and accordingly C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S109
that W t = (X t−1 , . . . , X t−p , Z t ) and θ = (α , β ) . Moreover when we have trend regressors, such as polynomial trends, apart from a different rate of convergence of θˆn , we have that the distribution of Tˆn (ω; θˆn ) is asymptotically independent of the estimation of the trend component of the regression model. Hence, in what follows we can consider the model Xt = α01 Xt−1 + · · · + α0p Xt−p + β0 Zt + εt
(2.10)
without loss of generality. Also, notice that if we employed tapers, Tˆn (ω; θˆn ) would be invariant to the trend as well as to the intercept. Now, once we have an estimator of the unknown parameters θ 0 , we can obtain the residuals as εˆ t := εt (θˆn ) = Xt − θˆn Wt , and with Iεˆ εˆ (λj ) := Iεε (λj ; θˆn ), we set ˜ [nω] 2π Fˆn ω; θˆn := Iεˆ εˆ (λj ). n˜ j =1
So, the feasible T p -process is defined as in (2.8) but with θˆn replacing θ . That is,
Fˆn π ω; θˆn 1/2 −ω . Tˆn ω; θˆn = n˜ Fˆn π ; θˆn
(2.11)
Before we describe the asymptotic properties of Tˆn (ω; θˆn ), we introduce the following regularity assumption. A SSUMPTION 2.3. (i) The cross-spectrum f Zε (λ) is differentiable at all λ ∈ [−π , π ]. (ii) The spectral density matrix f ZZ (λ) is continuous for all λ ∈ [−π , π ]. (iii) The higher order (cross) spectral densities up to eighth order of {Zt }t∈Z and {εt }t∈Z are bounded. Assumption 2.3(i) could be replaced by some Lipschitz condition, but that might complicate some of the technical arguments. Nevertheless, the assumption as it stands is very mild and it is satisfied for most models employed with real data. Next, because all the roots of the polynomial α(z) in (2.5) are outside the unit disk, we obtain that the stationary solution of X t is given by α 0 (L)−1 (ε t − β 0 Z t ), where α 0 (z) is defined in (2.5) with α = α 0 . Thus, it follows that Lp (eiλ ) σε2 − β0 fZε (λ) , (2.12) fX−,ε (λ) = α0 (eiλ ) 2π with L p (z) = (z, . . . , zp ) , so that Assumption 2.3(i) implies that f W ε (λ) is differentiable everywhere in λ ∈ [−π , π ]. One implication of (2.12) and Assumption 2.1 is that (1) = 0, where ω φ(v)dv, ω ∈ [0, 1] (ω) := 0
and φ(ω) = 4π RefW ε (π ω) = 4π Re(fX−,ε (π ω) , fZε (π ω) ) , by orthogonality between {Wt }t∈Z and {εt }t∈Z and evenness (oddness) of the real (imaginary) part of f W ε (λ). However, it is important to emphasize that we are not assuming that f W ε (λ) = 0 for all λ. In fact, this is not the case because E[Zt εs ] can be different than zero for some t > s. This is one of the main features of our specification in (2.1)/(2.10). C The Author(s). Journal compilation C Royal Economic Society 2009.
S110
M. A. Delgado, J. Hidalgo and C. Velasco
On the other hand, Assumptions 2.3(i)–(ii) imply that f WW (λ) is bounded for all λ ∈ [−π , π ] because
fX−,X− (λ) fX−,Z (λ) fW W (λ) = , fZX− (λ) fZZ (λ) with Lp (eiλ ) α0 (eiλ )
σε2 + 2β0 RefZε (λ) + β0 fZZ (λ)β0 2π Lp (eiλ ) fεZ (λ) + β0 fZZ (λ) . fX−,Z (λ) = α0 (eiλ )
fX−,X− (λ) =
Lp (e−iλ ) α0 (e−iλ )
Finally, Assumption 2.3(iii) implies eight finite moments for Z t and X t as assumed for ε t in Assumption 2.1. However, the requirement of higher order bounded spectra function of {Wt }t∈Z can be relaxed as in DHV at the expense of much lengthier arguments. In what follows, for two vector sequences {V t }nt=1 and {U t }nt=1 , we denote its crossperiodogram by
n
n 1 itλ −itλ IV U (λ) := Vt e Ut e . 2π n t=1 t=1 P ROPOSITION 2.2. Assuming Assumptions 2.1–2.3, under H 0 we have that ˜ [ωn] 4π Tˆn ω; θˆn = Tˆn (ω) − 2 ReIεW λj n˜ 1/2 θˆn − θ0 + op (1) σε n˜ j =1 = Tˆn (ω) − (ω) n˜ 1/2 θˆn − θ0 /σε2 + op (1),
(2.13)
where the o p (1) is uniformly in ω ∈ [0, 1]. R EMARK 2.3. The second equality in (2.13) follows because under weak regularity conditions, Brillinger (1981) implies that
˜ [ωn]
2π
I λ − f λ sup VU j VU j = op (1).
˜ ω n
j =1
R EMARK 2.4. Proceeding as with the proof of Theorem 2 of DHV, Propositions 2.1 and 2.2 imply that the asymptotic distribution of Tˆn (ω; θˆn ) depends, in general, on θˆn and so on the model as in other goodness-of-fit tests with estimated parameters. However, since the aim of the paper is to describe distribution-free (pivotal) tests, we will not explicitly examine the asymptotic distribution of Tˆn (ω; θˆn ). R EMARK 2.5. (Strong) exogeneity and predetermined regressors. When the regressors Z t are (strong) exogenous, we have that f Zε (λ) = 0 for all λ ∈ [0, π ], and hence φ(ω) = (φ 1 (ω) , 0q ) , where φ 1 (ω) := 4π Ref Xε (π ω). So, the latter together with (2.13) implies that ω Tˆn ω; θˆn = Tˆn (ω) − φ1 (v) dv n˜ 1/2 (αˆ n − α0 ) /σε2 + op (1). 0
C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S111
That is, similar to the case where regressors are deterministic, the estimation of β in (2.1) has no influence on the asymptotic distribution of Tˆn (ω; θˆn ), only the least-squares estimator of α 0 . Moreover, in this case the function (ω) is known up to a set of parameters which can be consistently estimated by Assumption 2.2. But this case was already covered by DHV, and hence it is not of interest in this paper. On the other hand, it is worth mentioning that the null hypothesis that one particular component of Z t is (strong) exogenous can be tested using the methods put forward in the paper. From Proposition 2.2 and Remark 2.4, it is obvious that tests based on continuous functionals of Tˆn (ω; θˆn ) are not pivotal, as their asymptotic distribution depends on the model specified under the null hypothesis H 0 and on the unknown function φ(·). The latter function not only depends on θ 0 but also on the joint dynamic properties of {Zt }t∈Z and {εt }t∈Z described by f Zε , which is unknown to the practitioner. The next section introduces a linear transformation of Tˆn (ω; θˆn ) which converges weakly, under H 0 , to the standard Brownian motion, denoted B 0 (·), whose critical values are readily available.
3. DISTRIBUTION-FREE TESTS ¯ such that L¯ Tˆn (·; θˆn ) converges weakly to We are looking for a linear transformation, say L, 0 the standard Brownian motion B under H 0 . This transformation must remove the effect of (ω) n˜ 1/2 (θˆn − θ0 ) into the asymptotic linear expansion of Tˆn (ω; θˆn ); see Proposition 2.2. As pointed out in Remarks 2.2 and 2.4, we shall only consider the interesting case where the regressors Z t are only predetermined, but not strictly exogenous, so that the cross-spectral density f Zε (λ) is not constant. Abbreviating for a generic function h(·), h(λ j ) by h j , and denoting m j = 2π I εε,j − σ 2ε , we observe that, applying Proposition 2.2, we can write Tˆn (ω; θˆn ), up to terms of order o p (1), as ⎛ ⎞−1 ˜ [ωn] n˜ n˜ n˜ n˜ −1/2 (ω) n˜ 1/2 ⎝ ωn˜ −1/2 mj − IW W ,j ⎠ ReIW ε,j − mj , Fˆn (π ) j =1 Fˆn (π ) Fˆn (π ) j =1 j =1 j =1
(3.1)
which is similar to the corresponding expression given in DHV but with our generic definition of φ(ω). However, unlike DHV, this expression (3.1) cannot be directly identified as a CUSUM of least-squares residuals. Nevertheless, a similar martingale transformation based on a forward on the function g(u) := (1, φ(u) ) will remove the terms in (3.1) depending on projection ω 0 g(u)du, i.e. (ω) and ω. The latter are the non-martingale components in the tied-down empirical process with estimated parameters Tˆn (ω; θˆn ). ¯ So, following similar arguments to those in DHV, we propose as our transformation L, L¯ Tˆn ω; θˆn := Tˆn ω; θˆn −
⎛ ⎞−1 ¯ [ωn] n˜ n˜ n˜ −1/2 ⎝ ˆ k, gj gk gk ⎠ gk m Fˆn π ; θˆn j =1 k=j +1 k=j +1
(3.2)
ˆ k := 2π Iεˆ εˆ ,k − Fˆn (π ; θˆn ), n¯ = n˜ − p − q − 1. The limiting continuous version of L¯ is where m defined, for a generic function ξ : [0, 1] → R, as 1 ω L0 ξ (ω) := ξ (ω) − g (v) −1 (v) g (u) ξ (du) dv, 0 C The Author(s). Journal compilation C Royal Economic Society 2009.
v
S112
M. A. Delgado, J. Hidalgo and C. Velasco
1 with (v) := v g(u)g(u) du. Before we examine the properties of L¯ Tˆn (ω; θˆn ) in (3.2), we need to introduce the following assumption. ˜ gk gk is non-singular. A SSUMPTION 3.1. The matrix n˜ −1 nk= ¯ n+1 T HEOREM 3.1. Assume Assumptions 2.1–2.3 and 3.1. Then, under H 0 , d L¯ Tˆn ω; θˆn ⇒ B 0 in the Skorohod metric space D [0, 1] . The transformation L¯ is infeasible, as it depends on the unknown function g(u). To construct ¯ we need to replace g(u) by some estimate. Recall that from (2.12), a feasible version of L, φ(ω) = 4π Re(f X−,ε (π ω) , f Zε (π ω) ) and because f Zε is an unknown function, we have that φ is a non-parametric function. The latter is one of the main differences with DHV’s paper. Because of that, we shall propose two feasible transformations. The first one employs the standard average periodogram estimator of the (scaled real part) of the cross-spectrum between {Wt }t∈Z and {εt }t∈Z , i.e. 4π ˜ = φˆ m,j := φˆ m (j /n) K¯ m
m
K ReIW εˆ ,j + ,
(3.3)
=−m;=0
where K = K(/m) and K¯ m = m =−m;=0 K . The second approach replaces f W ε by the crossperiodogram. The latter is a much more delicate matter, as the periodogram is not a consistent estimator of f W ε , only unbiased, unlike the former approach or that in DHV, where the function φ(·) was known up to a finite set of parameters. A SSUMPTION 3.2. (i) K(x) is a non-negative continuous symmetric function in [−1, 1]. (ii) m−2 n1+δ + mn−1 → 0, for some δ > 0. Non-parametric adjustment in related contexts has been also examined in Stute et al. (1998) and Stute and Zhu (2002). The estimator φˆ m,j is of the leave-one-out type as it does not use the frequency λ j in its computation. The latter is done to guarantee the orthogonality in finite samples of φˆ m,j with respect to Iεˆ εˆ ,j for all m, using the well-known result of the approximate orthogonality between the discrete Fourier transform of vector time series at different Fourier frequencies. We need to strengthen Assumption 2.3. A SSUMPTION 2.3 . Assumption 2.3 holds and f W ε (λ) has two bounded derivatives. Thus, in practice, we can take the discrete sample counterpart of L¯ Tˆn (ω; θˆn ), L¯ n Tˆn ω; θˆn := Tˆn ω; θˆn −
n˜ −3/2 gˆ m Fˆn π ; θˆn j =1 ¯ [nω]
ˆ m (ω) := n˜ −1 where gˆ m (ω) := (1, φˆ m (ω) ) and
−1 n˜ j k ˆm j ˆ k, m gˆ m n˜ n˜ n˜
n˜
˜ j =1+[nω]
(3.4)
k=1+j
gˆ m,j gˆ m,j .
T HEOREM 3.2. Assuming Assumptions 2.1–2.2, 2.3 and 3.1–3.2, under H 0 , d L¯ n Tˆn ω; θˆn ⇒ B 0 in the Skorohod metric space D [0, 1] . C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S113
Note that the proof of this result does not show that supω∈[0,1] |L¯ n Tˆn (ω; θˆn ) − L¯ Tˆn (ω)| = op (1) as it was necessary in DHV’s proofs. We now describe the unsmoothed version of the feasible transformation. Here the aim is to use the cross-periodogram instead of g k or a consistent estimate of it. We propose to employ the transformation Lˇ n Tˆn ω; θˆn = Tˆn ω; θˆn −
⎛ ⎞−1 ¯ [nω] n˜ n˜ n˜ −1/2 ⎝ ⎠ ˆ k+1 , (3.5) gˆ j +1 gˆ k+2 gˆ k+2 gˆ k+2 m Fˆn π ; θˆn j =1 k=j +1 k=j +1
˜ where gˆ j = IW εˆ ,j , j = 1, . . . , n. ˜ ˜ ˆ k+1 instead of nk=j ˆ k as in (3.4) is The reason to employ, for example nk=j gk m +1 gk+2 m n˜ +1 ˆ k which does because, contrary to the latter, there is leverage effect from g j +1 into k=j +1 gk m not vanish sufficiently fast, as in the case with the smoothed version or in the case examined in ˆ k+1 is approximately centred because gˆ and m ˆ DHV. At the same time, we guarantee that gˆ k+2 m have different indices. Then, we have our next result. T HEOREM 3.3. Assuming Assumptions 2.1–2.2, 2.3 and 3.1–3.2, under H 0 , the unsmoothed transformation given in (3.5) satisfies that d Lˇ n Tˆn ω; θˆn ⇒ B 0 in the Skorohod metric space D [0, 1] . Theorems 3.2 and 3.3 justify asymptotic admissible tests based on continuous functionals of Lˇ n Tˆn (ω; θˆn ), as stated in the following corollary. C OROLLARY 3.1. For any continuous functional ϕ : D[0, 1] −→ R+ , under H 0 and assuming the same conditions of Theorem 3.3, d ϕ Lˇ n Tˆn ω; θˆn → ϕ B 0 . Note that the non-parametric estimation does not affect first-order asymptotics of the tests, which have the same limiting behaviour as if g were known or parametrically modelled. However, ˆ m (ω) in a discrete grid ω = j /n, ˜ the need to invert the (p + q + 1) × (p + q + 1) matrix implies that this is only possible at j = 1, . . . , n¯ due to the loss of degrees of freedom as we need to estimate the parameters in the regression model (2.1). The distribution of ϕ(B 0 ) can be tabulated by Monte Carlo. For the main goodness-offit proposals, Kolmogorov–Smirnov and Cram´er–von Mises, ϕ(B 0 ) is already tabulated, for instance in Shorack and Wellner (1986, pp. 34 and 748).
4. LOCAL ALTERNATIVES AND CONSISTENCY We consider two types of local alternatives, first a parametric one and secondly a more general non-parametric type of alternative which it may suggest or establish the origin of the possible misspecification of the model given in (2.1). C The Author(s). Journal compilation C Royal Economic Society 2009.
S114
M. A. Delgado, J. Hidalgo and C. Velasco
4.1. Parametric alternatives To study the power of our test let us consider local alternatives of the type Han : α0,p+1 =
c n˜ 1/2
for some c = 0.
(4.1)
Similar results are available for other forms of misspecification, including errors in the modelling of the relationship between the sequences {Zt }t∈Z and {Xt }t∈Z . T HEOREM 4.1. Assuming the same conditions as in Theorem 3.3, under H an , d (4.2) L¯ n Tˆn ⇒ B 0 + cL0 in the Skorohod metric space D [0, 1] , ω where (ω) := σε−2 0 φp+1 (u)du, with exp(i(p + 1)π v) 2 2σε + 4πβ0 RefZε (π v) . φp+1 (v) := 4π RefεX−p−1 (π v) = Re α0 (eiπv )
R EMARK 4.1. Under the set of assumptions in the previous section, the proposed test does not have trivial power, as stated in the following theorem if Z t cannot explain all the information contained in X t−p−1 at all frequencies. i.e. there is a set of positive Lebesgue measure where the spectral density matrix of (Z t , X t−p−1 ) has full rank. This should imply that in a set of positive Lebesgue measure the cross-spectral density fXt−p−1 ε (λ) is not a linear combination of the rows of f Zε (λ), which guarantees that L0 is not zero for all λ. Therefore, for a suitable continuous functional ϕ : D[0, 1] → R+ , such as the Cram´er–von Mises or the Kolmogorov–Smirnov, Pr[ϕ(B 0 + L0 ) > ϕ(B 0 )] = 1, and the test will detect local departures from the null of the type H an given in (4.1). 4.2. Non-parametric alternatives We now consider the case when {εt }t∈Z has not flat spectrum up to an n−1/2 factor. Notice that H an implies that the spectral density function of {εt }t∈Z , where θ does not include α p+1 , is 2c σε2 c2 + 1/2 RefεX−p−1 (λ) + fX−p−1 (λ) n˜ n˜ 2π 2 φp+1 (λ/π ) −1/2 σ 2 −1 ˜ ˜ = 1+c n + O(c n ) . 2π σ2
f (λ; θ0 ) =
So, we could consider non-parametric alternatives of the type Han : f (λ; θ0 ) =
σ2 1 + l (λ) n˜ −1/2 2π
for some θ0 ∈ ,
where the function l(·) is not in the space spanned by φ(·/π ). The latter implies that the correlation structure of {εt }t∈Z cannot be explained either by lag values of X t or by any of the components of the variables Z t . It is worth noticing that the test has maximum power against alternatives for which l(·) belongs to the orthogonal space spanned by g. Then Theorem 4.1 ω holds for H an with (ω) := 0 l(π u)du and c = 1 there. C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S115
The test is consistent in the direction of general fixed non-parametric or parametric alternatives in (4.1), such as α p+1 = c, c = 0. Though a precise justification under suitable regularity conditions is possible, this is beyond the scope of this paper and we will only provide a sketch of the main arguments. Assuming certain regularity conditions (such as that α(L) has all roots outside the unit circle), Assumption 2.1 could be replaced by a linear process specification and Assumption 2.2 is satisfied under the alternative hypothesis H 1 , where now θ 0 denotes the pseudo true value, defined by θ0 := arg minθ∈ F (π ; θ ), which is such that the pseudo-innovations {ε t (θ 0 )} are autocorrelated under H 1 . Denote by f ε (λ) := f ε (λ; θ 0 ) the (nonconstant) spectral density of {εt }t∈Z . Indeed, proceeding as in DHV or Dahlhaus and Wefelmeyer (1996), we shall have that, for each ω ∈ [0, 1], ˜ 1/2 θˆn − θ0 n ˆ ˆ ˆ + op (1) . Tn ω; θn = Tn (ω) + (ω) Fˆn (π ) Now, Tˆn (ω) =
1
2π
n˜ 1/2 Fˆn (π )
˜ [nω] Iεε,j j =1
fε,j
⎞ ⎛ ˜ [nω] 2π 1 − 1 fε,j + n˜ 1/2 ⎝ fε,j − ω⎠ , Fˆn (π ) n˜ j =1
where, under suitable regularity conditions, the first term on the right-hand side of the last display expression is Op (1), whereas the expression inside the parenthesis of the second term on the right-hand side converges to a constant for each ω. Thus, |Tˆn (ω)| and |L¯ n Tˆn (ω)| diverge to infinity at the rate n1/2 . From here, the consistency of the test follows by standard arguments. Following the discussion in DHV, we can use Theorem 4.1 to derive optimal tests for H 0 against the direction l given in H an . These test statistics are based on L¯ n Tˆn (λ) and thus they are also asymptotically distribution-free under H 0 .
5. MONTE CARLO EXPERIMENT This section presents a small simulation exercise to shed some light on the small sample behaviour of our tests. To that end, we have considered the ARX(1, 1) model Xt = α1 Xt−1 + β1 Z1t + εt , t = 1, . . . , n,
(5.1)
where Z1t = aZ1(t−1) + ut , 1/2 vt + bεt−1 , ut = 1 − b2 and {vt }t∈Z and {εt }t∈Z are mutually independent i.i.d. N (0, 1) variates. We have employed three sample sizes n = 100, 200, 400, and the following values of the parameters: β1 = {0.2, 0.5, 1.0} ,
α1 ∈ {0.2, 0.5, 0.8} ,
b ∈ {0, 0.4, 0.8} ,
whereas a = 0.5 for all the combinations and sample sizes. The autoregressive parameters α 1 and a control partially the dependence structure of {Xt }t∈Z and {Zt }t∈Z . On the other hand, b measures the ‘endogeneity’ of {Zt }t∈Z in (5.1) (so that Z t is strongly exogenous if b = 0), together with the regression coefficient β 1 . C The Author(s). Journal compilation C Royal Economic Society 2009.
S116
M. A. Delgado, J. Hidalgo and C. Velasco
We first estimate the parameters α 1 , β 1 and σ 2ε in (5.1) by (2.9), and for a given feasible ˆ er–von Mises statistic transformation Lm n of Tn we compute the Cram´ Cnm
2 ˜ n−3 j 1 m ˆ Ln Tn := , ˜n − 3 n˜ j =1
where m indicates the type of approximation of φ employed. We have considered three alternatives for the martingale transformation. The first one uses a non-consistent estimator of φ, using the transformation Lˇ n , and it is denoted as C 0n in Tables 1–5. For the cases where we estimate consistently φ, we use the Tuckey–Hanning kernel in (3.3), πx 1 1 + cos , Km (x) = 2 m with bandwidth parameters m = [0.25n0.9 ] and [0.30n0.9 ]. To be able to make comparisons we provide the results for the popular Ljung and Box’s (1978) Portmanteau test p ρˆεˆ (j )2 , Qp := n (n + 2) n−j j =1 where
ρˆεˆ (j ) :=
n t=1
−1 εˆ t2
n
εˆ t εˆ t−j , j ≥ 1,
t=j +1
are the sample autocorrelations of the residuals {ˆεt }nt=1 for two choices for p. For n = 100, 200, we chose p = 10, 15, whereas for n = 400, p = 15, 20. Those choices are close to n1/2 , which seems a reasonable compromise in terms of size and power. As in Hong (1996), we employ a standardized version of Q p which we compare against the standard normal critical values. For power comparisons we consider two local alternatives. The first one is based on the ARX(2, 1) model, 5 Xt = α1 Xt−1 + 0.5 Xt−1 + β1 Z1t + εt , n whereas the second local alternative is the ARMAX(1, 1, 1) model 5 Xt = α1 Xt−1 + β1 Z1t + 0.5 εt−1 + εt . n We report the percentage of rejections in 100,000 Monte Carlo replications. The empirical size for tests based on C 0n show an improvement with the sample size, but it also appears to depend on the model under consideration. More specifically, the percentage of rejections under H 0 increases with α 1 , b and β 1 for all sample sizes. On the other hand, those for Cnm are more stable, although there is some dependence on the value of b, perhaps due to some additional dependence on m. Q p provides better sizes for the smaller values of n but similar for the larger ones. Here the choice of p seems to be quite important, with the number of rejections increasing with p and also with α 1 , β 1 , although it decreases with b. For the power analysis we only report the simulations with n = 200, being the picture for other sample sizes similar, although perhaps for n = 100, the results show some instability due perhaps to the oversize of the tests for some parameter combinations. For AR(2) alternatives, C The Author(s). Journal compilation C Royal Economic Society 2009.
3.39 5.65
9.38
4.40
6.75 10.06
6.24 8.28
10.71
α1 0.2 0.5
0.8
0.2
0.5 0.8
0.2 0.5
0.8
C 0n
C The Author(s). Journal compilation C Royal Economic Society 2009.
3.71
2.75 3.52
3.54 4.27
2.68
4.44
2.65 3.50
C 15 n
7.30
6.09 6.63
β 1 = 1.0 2.22 2.39
2.65
5.63 6.47
2.31 2.70
5.20
5.85
4.79 5.04
β 1 = 0.2 2.16 2.23
2.70 β 1 = 0.5 2.22
Q 10
C 18 n
b=0
7.90
6.72 7.22
6.30 7.10
5.87
6.46
5.47 5.77
Q 15
11.15
7.51 9.46
8.08 11.02
5.23
10.36
3.71 6.41
C 0n
2.69
3.94 3.99
4.28 3.40
3.22
3.99
2.88 3.96
C 15 n
2.17
β 1 = 1.0 2.55 2.65
2.53 2.45
2.52 β 1 = 0.5 2.25
β 1 = 0.2 2.23 2.34
C 18 n
6.52
6.12 6.30
5.63 6.11
5.25
5.66
4.80 5.13
Q 10
Table 1. Size of 5% tests, n = 100. b = 0.4
7.21
6.72 6.93
6.28 6.83
5.99
6.38
5.50 5.82
Q 15
11.65
7.93 9.55
8.87 11.10
6.34
10.51
4.11 7.22
C 0n
2.58
3.69 3.23
3.62 2.72
4.02
3.03
3.24 4.05
C 15 n
1.63
β 1 = 1.0 1.93 1.76
1.93 1.69
1.81 β 1 = 0.5 2.35
β 1 = 0.2 2.15 2.21
C 18 n
b = 0.8
4.54
4.80 4.52
4.75 4.54
5.07
4.58
4.73 4.95
Q 10
5.34
5.51 5.23
5.42 5.30
5.79
5.39
5.46 5.66
Q 15
Goodness-of-fit tests for dynamic models
S117
2.20 3.71
6.40
2.79
4.29 6.71
4.03 5.46
7.22
α1 0.2 0.5
0.8
0.2
0.5 0.8
0.2 0.5
0.8
C 0n
6.01
3.97 4.99
4.87 6.57
3.89
6.59
3.86 4.83
C 29 n
6.79
5.83 6.22
β 1 = 1.0 3.48 3.59
4.20
5.33 5.95
3.48 4.23
5.06
5.37
4.70 4.86
β 1 = 0.2 3.46 3.47
4.13 β 1 = 0.5 3.48
Q 10
C 35 n
b=0
6.96
6.29 6.60
5.83 6.33
5.57
5.77
5.20 5.37
Q 15
7.41
4.99 6.29
5.29 7.41
3.38
7.07
2.41 4.20
C 0n
3.99
5.49 5.91
6.19 5.10
4.51
6.10
4.12 5.36
C 29 n
3.39
β 1 = 1.0 3.83 4.08
3.84 3.84
4.00 β 1 = 0.5 3.53
β 1 = 0.2 3.44 3.53
C 35 n
5.88
5.75 5.85
5.30 5.47
5.10
5.17
4.68 4.88
Q 10
Table 2. Size of 5% tests, n = 200. b = 0.4
6.27
6.17 6.20
5.70 5.95
5.57
5.60
5.20 5.35
Q 15
6.88
4.91 5.50
5.54 6.75
4.19
6.67
2.71 4.65
C 0n
4.48
5.40 5.27
5.67 4.64
5.54
5.02
4.59 6.01
C 29 n
2.98
β 1 = 1.0 3.00 2.92
3.13 3.05
3.13 β 1 = 0.5 3.59
β 1 = 0.2 3.37 3.53
C 35 n
b = 0.8
4.15
4.52 6.89
4.43 4.13
4.86
4.19
4.57 4.68
Q 10
4.73
5.00 7.76
4.93 4.70
5.32
4.69
5.10 5.17
Q 15
S118 M. A. Delgado, J. Hidalgo and C. Velasco
C The Author(s). Journal compilation C Royal Economic Society 2009.
1.83 2.91
4.95
2.21
3.41 5.08
3.15 4.27
5.58
α1 0.2 0.5
0.8
0.2
0.5 0.8
0.2 0.5
0.8
C 0n
C The Author(s). Journal compilation C Royal Economic Society 2009.
7.87
4.61 5.61
5.53 8.25
4.56
8.32
4.54 5.54
C 54 n
6.51
5.83 6.09
β 1 = 1.0 4.22 4.30
5.39
5.39 5.83
4.20 5.31
5.21
5.30
4.90 5.00
β 1 = 0.2 4.17 4.23
5.20 β 1 = 0.5 4.17
Q 15
C 65 n
b=0
6.57
6.04 6.25
5.66 6.02
5.48
5.51
5.19 5.25
Q 20
5.57
3.99 4.86
4.13 5.53
2.69
5.29
2.01 3.34
C 0n
5.07
6.31 7.38
7.68 6.45
3.99
7.99
4.87 6.23
C 54 n
4.45
β 1 = 1.0 4.62 5.17
5.02 4.94
5.33 β 1 = 0.5 4.16
β 1 = 0.2 4.24 4.39
C 65 n
5.82
5.71 5.71
5.30 5.46
5.19
4.71
4.88 5.00
Q 15
Table 3. Size of 5% tests, n = 400. b = 0.4
5.98
5.95 5.96
5.59 6.58
5.46
5.14
5.17 5.23
Q 20
4.54
3.72 3.86
4.08 4.56
3.33
4.60
2.16 3.74
C 0n
6.13
6.85 6.91
7.36 6.34
6.78
6.84
5.69 7.68
C 54 n
4.38
β 1 = 1.0 4.11 4.41
4.59 4.51
4.61 β 1 = 0.5 4.44
β 1 = 0.2 4.32 4.81
C 65 n
b = 0.8
4.32
4.70 4.40
4.55 4.28
5.02
4.31
4.78 4.83
Q 15
4.66
5.02 4.75
4.89 4.66
5.22
4.69
5.05 5.11
Q 20
Goodness-of-fit tests for dynamic models
S119
57.38 77.79
99.30
61.35
84.89 99.24
73.05 93.70
98.98
α1 0.2 0.5
0.8
0.2
0.5 0.8
0.2 0.5
0.8
C 0n
22.10
28.29 34.58
57.04 31.15
48.40
36.56
64.79 69.48
C 29 n
93.75
89.69
96.11 95.76
β 1 = 1.0 21.39 97.90 27.59 97.75
20.25
94.06 89.47
96.61 93.55
46.93 26.84
93.08
89.47
91.03 92.39
β 1 = 0.2 55.01 94.48 58.47 95.38
31.19 93.53 β 1 = 0.5 38.67 95.98
Q 15
Q 10
C 35 n
b=0
91.80
81.58 94.76
86.90 94.26
68.04
95.28
58.34 79.36
C 0n
19.62
58.94 46.97
69.50 20.88
63.81
25.01
62.55 76.06
C 29 n Q 10
91.88 79.78
17.19
70.89
β 1 = 1.0 45.97 94.93 44.71 87.39
61.77 19.22
22.99 85.32 β 1 = 0.5 50.57 95.55
β 1 = 0.2 51.11 94.65 68.41 93.67
C 35 n
63.35
91.53 81.65
87.52 73.05
92.61
79.40
91.25 89.98
Q 15
Table 4. Power of 5% tests: AR(2) alternative, n = 200. b = 0.4
17.45
59.38 33.61
58.12 37.04
67.56
51.79
60.73 66.83
C 0n
5.15
28.50 5.94
22.78 15.60
87.35
28.27
90.54 62.05
C 29 n
3.89
β 1 = 1.0 25.46 4.79
18.31 11.23
21.45 β 1 = 0.5 84.47
β 1 = 0.2 80.66 57.51
C 35 n
b = 0.8
12.70
52.09 20.86
53.39 28.79
87.43
51.25
94.57 79.61
Q 10
12.19
45.70 18.46
47.29 25.53
82.35
45.22
91.34 73.14
Q 15
S120 M. A. Delgado, J. Hidalgo and C. Velasco
C The Author(s). Journal compilation C Royal Economic Society 2009.
21.82
66.50 95.45
32.96 76.50
97.04
68.60
91.95 98.65
α1 0.2
0.5 0.8
0.2 0.5
0.8
0.2
0.5 0.8
C 0n
C The Author(s). Journal compilation C Royal Economic Society 2009.
19.85 27.35
11.07
25.12
12.01 19.84
19.65 23.68
12.59
C 29 n
32.41 62.30
16.42 20.72
73.64 87.69
18.36 84.33 β 1 = 1.0 10.50 49.52
12.05 16.93
64.86 80.81
42.18
77.23
28.24 54.40
53.10 76.34
27.21
β 1 = 0.2 12.90 30.99
17.12 60.72 17.12 83.19 β 1 = 0.5
Q 15
95.49 95.91
89.21
95.57
58.36 88.15
74.56 93.93
31.08
C 0n
16.08 14.77
15.78
14.04
14.59 15.55
14.97 13.21
13.89
C 29 n Q 10
49.60 71.44
9.81 9.41
79.12 82.38
8.30 80.96 β 1 = 1.0 12.40 68.86
14.28 10.09
11.21 64.91 7.52 79.37 β 1 = 0.5
β 1 = 0.2 14.92 38.29
C 35 n
70.74 74.55
60.06
73.20
42.88 63.02
56.96 71.86
33.28
Q 15
Table 5. Power of 5% tests: MA(1) alternative, n = 200. b = 0.4
Q 10
C 35 n
b=0
48.98 40.27
54.80
47.55
53.72 56.49
55.70 53.61
34.95
C 0n
8.79 13.85
3.70
12.85
3.96 6.69
5.23 11.34
10.58
C 29 n
4.19 8.07
6.67 β 1 = 1.0 2.01
3.63 2.86
2.35 5.20 β 1 = 0.5
β 1 = 0.2 13.33
C 35 n
b = 0.8
29.98 30.86
38.14
33.53
47.68 38.68
47.95 37.45
43.22
Q 10
26.11 26.66
32.83
28.92
41.08 34.01
41.27 32.27
37.55
Q 15
Goodness-of-fit tests for dynamic models
S121
S122
M. A. Delgado, J. Hidalgo and C. Velasco
C 0n shows highest power for models with high α 1 and b, otherwise Cnm dominates, with power decreasing with m. Tests based on Cnm are dominated in general by C 0n , except for the least persistent models (with lowest α 1 and β 1 ) for which φ is rather flat and can be well estimated by kernel estimates with some oversmoothing as with the choices of m we employ. In general power increases with α 1 and β 1 for small b, but the reverse situation arises for large value of b. For the MA(1) alternative, C 0n dominates in almost every case, in some situations outperforming noticeably Q p , while Cnm displays much inferior results for all m.
ACKNOWLEDGMENTS The research of the first and third author is funded by the Spanish ‘Plan Nacional de I+D+I’, reference number SEJ2007-62908.
REFERENCES Bartlett, M. S. (1954). Problems de l’analyse spectral des s´eries temporelles stationnaires. Publications Institut de Statistique Universit´e de Paris III, 119–34. Billingsley, P. (1968). Convergence of Probability Measures. New York: John Wiley. Brillinger, D. R. (1981). Time Series, Data Analysis and Theory. San Francisco: Holden-Day. Dahlhaus, R. (1985). On the asymptotic distribution of Bartlett’s U p -statistic. Journal of Time Series Analysis 6, 213–27. Dahlhaus, R. and W. Wefelmeyer (1996). Asymptotically optimal estimation in misspecified time series models. Annals of Statistics 24, 952–74. Delgado, M. A., J. Hidalgo and C. Velasco (2005). Distribution free goodness-of-fit tests for linear processes. Annals of Statistics 33, 2568–609. Hong, Y. (1996). Consistent testing for serial correlation of unknown form. Econometrica 64, 837–64. Ljung, G. M. and G. E. P. Box (1978). On a measure of lack of fit in time series models. Biometrika 65, 297–303. Shorack, G. R. and J. A. Wellner (1986). Empirical Processes with Applications to Statistics. New York: John Wiley. Stute, W., S. Thies and L.-X. Zhu (1998). Model checks for regression: an innovation process approach. Annals of Statistics 26, 1916–34. Stute, W. and L.-X. Zhu (2002). Model checks for generalized linear models. Scandinavian Journal of Statistics 29, 535–45.
APPENDIX: PROOFS We first state two general lemmas. ˜ Then under H 0 , as n → ∞ L EMMA A.1. Let Assumptions 2.1–2.3 hold. Set gˆ j = gˆ m (j /n). sup gˆ j − gj = op (1), j
ˆ j − j = op (1). sup
(A.1)
j
ˆ j − j = op (1) follows by identical Proof: We only prove the first part of (A.1) since the proof of supj steps. Because g j = (1, φ j ) , we ignore the first element. By the triangle inequality, we have that the left-hand side of (A.1) is bounded by
j − φj + sup j + sup Eφ j , φj − Eφ (A.2) sup φˆ j − φ j
j
j C The Author(s). Journal compilation C Royal Economic Society 2009.
S123
Goodness-of-fit tests for dynamic models where, using the errors ε t , j := φ
4π K¯ m
m
K ReIWε,j + .
=−m;=0
To simplify arguments, we shall take herewith K(u) = 1, so 4π φˆ j = 2m Now
m
ReIWˆε,j + .
=−m;=0
m
1
ˆ ˆ ReIWW,k+ sup φj − φj ≤ θn − θ0 sup
j j m =−m
n˜
−1 1/2 −1 1/2 ˆ ≤ n m n θn − θ0 n ReIWW,
=1 −1 1/2 = Op m n = op (1)
˜ because n1/2 θˆn − θ0 = Op (1) by Assumption 2.2, and n−1 n=1 ReIWW, = Op (1) by Assumption 2.3(i). The second term in (A.2) is O(m2 n−2 + n−1 log n) = o(1) because of Assumption 2.3(i), whereas j 4 = j − Eφ E φ
m m m m 1 E[hj +a hj +b hj +c hj +d ], 4 16m a=−m b=−m c=−m d=−m
where we have considered hj := ReIW ε,j − EReIW ε,j as scalar to simplify notation. Now E[hj +a hj +b hj +c hj +d ] = E[hj +a hj +b ]E[hj +c hj +d ] + E[hj +a hj +c ]E[hj +b hj +d ] + E[hj +a hj ,d ]E[hj +b hj +c ] + cum[hj +a , hj +b , hj +c , hj +d ]. But, for all a, b, E[hj +a hj +b ] = O(n−1 log3 n + I(a = b)), whereas, distinguishing the contribution from higher-order cumulants and second-order cumulants (see Brillinger, 1981, p. 20 and Theorem 2.6.1), 3 2 n−1 log3 n + δa,b,c,d cum[hj +a , hj +b , hj +c , hj +d ] = O n−2 log6 n + δa,b,c,d n−1 log2 n + δa,b,c,d = O(m−1 n−1 log2 n + m−3 ), where δ a,b,c,d indicates a restriction among the indices a, b, c, d. Thus, j − Eφ j 4 = O(n−2 log6 n + m−2 ). E φ j − Eφ j = op (1) using that From here, we can conclude easily that supj φ
n˜ n˜ j − Eφ j > c ≤ j − Eφ j > c) ≤ c−4 j − Eφ j 4 Pr( φ E φ Pr sup φ j
j =1
and that m−2 n = o(1). L EMMA A.2. Under the assumptions of Theorem 3.2,
˜ [nω]
1
ˆ j − φj )mj = op (1). sup ( φ
1/2 ˜ ω∈(0,π ) n
j =1 C The Author(s). Journal compilation C Royal Economic Society 2009.
j =1
(A.3)
S124
M. A. Delgado, J. Hidalgo and C. Velasco
j − 4π RefWε,j is Proof: To simplify arguments we will assume that K(u) = I(|u| ≤ 1). Because Eφ O(n−2 m2 ) uniformly in j, it is easy to show that
˜ [nω]
1
sup 1/2 (Eφj − 4π RefWε,j )mj
= op (1), ˜ ω∈(0,π ) n
j =1 assuming finite second derivatives of f W ε,j in Assumption 2.3 , and that
˜ [nω]
1
ˆj − φ j )mj = op (1), ( φ sup
1/2 ˜ ω∈(0,π ) n
j =1 using Assumptions 2.2 and 2.3 as in Lemma A.1. The lemma now follows by Propositions A.1 and A.2. P ROPOSITION A.1. Under the assumptions of Theorem 3.2, for all ω ∈ [0, π ], ˜ [nω] 1 j − Eφ j )mj = op (1). (φ 1/2 n˜ j =1
j = j − Eφ Proof: Writing φ side of (A.4) is
1 2m
1 2n˜ 1/2 m =
m =−m;=0
hj + , by Abel summation by parts, we obtain that the left-hand
j ˜ [nω] (hj −m − hj +1+m − hj + hj +1 ) m j =1
1 2n˜ 1/2 m
+
(A.4)
=1 ˜ [nω]
(hj −m − hj +1+m − hj + hj +1 )(mj −m + mj )
(A.5)
j =1
1 2n˜ 1/2 m
˜ [nω] (hj −m − hj +1+m − hj + hj +1 ) j =1
j −1
m .
(A.6)
=1;=j −m
Equation (A.5) is o p (1) because the Cauchy–Schwarz inequality implies that (E|hj −m − hj +1+m − hj + hj +1 ||mj −m + mj |)2 ≤ E|hj −m − hj +1+m − hj + hj +1 |2 E|mj −m + mj |2 < D, where, in what follows, D denotes a finite and positive constant. It is worth mentioning that this is the best rate we can obtain under our general assumptions, because lack of (strong) exogeneity implies that E(hj mj ) = 0. Next, we examine (A.6). We employ that h • and m • do not have subindices in common. So, although the expectation is not zero, unless the fourth cumulant is, this is O(n−1 log3 n) at most. The expectation of (A.6) is O(m−1 n−1/2 log3 n) because ⎛ ⎞ j −1 E ⎝(hj −m − hj +1+m − hj + hj +1 ) m ⎠ = O(n−1 log3 n). =1;=j −m
Note that under Gaussianity the expectation would have been exactly zero. Next, we examine the second moment of (A.6). By the Cauchy–Schwarz inequality, it suffices to examine the second moment of each of C The Author(s). Journal compilation C Royal Economic Society 2009.
S125
Goodness-of-fit tests for dynamic models the following four terms: ˜ [nω]
1 2n1/2 m −
j −1
hj −m
j =1
=1;=j −m ˜ [nω]
1 2n1/2 m
H −
j −1
hj
j =1
˜ [nω]
1 2n1/2 m
j =1 ˜ [nω]
1
H +
2n1/2 m
=1;=j −m
j −1
hj +1+m
m
=1;=j −m j −1
hj +1
j =1
m .
(A.7)
=1;=j −m
We will study the contribution due to the first term, the other three terms are similarly handled. The second moment of the first term of (A.7) is proportional to j1 ˜ [nω] 1 ˜ 2 j =1 j =1 nm 1
=
2
j1 −1
1
+
1 ˜ 2 nm
2
j1 −1
E hj1 −m hj2 −m E m1 m2
1 =1;1 =j1 −m 2 =1;2 =j2 −m
j1 −1
j2 −1
E hj1 −m m1 E hj2 −m m2
j1 =1 j2 =1 1 =1;1 =j1 −m 2 =1;2 =j2 −m
j1 ˜ [nω] 1 ˜ 2 j =1 j =1 nm
1 ˜ 2 nm
E hj1 −m hj2 −m m1 m2 j2 −1
j1 ˜ [nω]
1
+
1 =1;1 =j1 −m 2 =1;2 =j2 −m
j1 ˜ [nω] 1 2 ˜ nm j =1 j =1
+
j2 −1
2
j1 −1
j2 −1
E hj2 −m m1 EE hj1 −m m2
1 =1;1 =j1 −m 2 =1;2 =j2 −m
j1 −1
j1 ˜ [nω]
j2 −1
cum hj1 −m , hj2 −m , m1 , m2 .
(A.8)
j1 =1 j2 =1 1 =1;1 =j1 −m 2 =1;2 =j2 −m
Because E(m1 m2 ) = O(n−1 ) + I(1 = 2 ), E(hj1 −m hj2 −m ) = O(n−1 ) + I(j1 = j2 ) and ˜ 2 ), the first term on the right-hand side of (A.8) is O([nω] O
n m2
+
j1 −1 ˜ [nω] 1 nm2 j =1 =1; =j 1
1
D=O
1 −m
[nω] ˜ j =1
j=
n . m2
Similarly, the second and third terms on the right-hand side of (A.8) are O(m−2 n). Finally, the fourth term on the right-hand side of (A.8). First, observe that cum hj1 −m , hj2 −m , m1 , m2 ∗ ∗ ∗ ∗ = cum wz,j1 −m wε,j , wz,j2 −m wε,j , wε,1 wε, , wε,2 wε, 1 −m 2 −m 1 2 =
q υ
cum wa,s1 wb,s2 ; (s1 , s2 ) ∈ υr
r=1
with s 1 , s 2 = j 1 − m, j 2 − m, 1 , 2 , a and b are Z and ε and where the summation in υ is over all indecomposable partitions υ = υ 1 ∪ . . . ∪ υ q , q = 1, . . . , 4, of the table wz,j1 −m wz,j2 −m wε,1 wε,2
C The Author(s). Journal compilation C Royal Economic Society 2009.
∗ wε,j 1 −m ∗ wε,j 2 −m ∗ wε, 1 ∗ wε, 2
S126
M. A. Delgado, J. Hidalgo and C. Velasco
see Brillinger (1981, p. 20 and Theorem 2.6.1). So, a typical component of the fourth term on the right-hand side of (A.8) is j1 ˜ [nω] 1 nm2 j =1 j =1 1
2
j1 −1
j2 −1
cum wa,s1 wb,s2 ; (s1 , s2 ) ∈ υr = O(n−1 m−1 log3 n).
1 =1;1 =j1 −m 2 =1;2 =j2 −m
So, we conclude that the second moment of (A.6) converges to zero, and (A.4) holds true by Markov inequality. P ROPOSITION A.2. Under the assumptions of Theorem 3.2, the process ˜ [nω] 1 j − Eφ j mj , φ n˜ 1/2 j =1
Xn (ω) =
ω ∈ [0, 1]
is tight. Proof: Proceeding as with Proposition A.1, Xn (ω) can be written as Xn1 (ω) + Xn2 (ω) :=
˜ [nω]
1 2n˜ 1/2 m +
(hj −m − hj +1+m − hj + hj +1 )(mj −m + mj )
j =1 ˜ [nω]
1 2n˜ 1/2 m
j −1
(hj −m − hj +1+m − hj + hj +1 )
j =1
m .
=1;=j −m
Following Billingsley (1968, Theorem 15.6), a sufficient condition for the tightness of Xn (ω) is E|Xnj (ω2 ) − Xnj (ω1 )|τ ≤ D(ω2 − ω1 )1+δ , j = 1, 2,
(A.9)
where τ , δ > 0, ω 2 > ω 1 , and, where without loss of generality, we can assume that n˜ −1 ≤ ω2 − ω1 . We begin with Xn1 (ω). By definition, Xn1 (ω2 ) − Xn1 (ω1 ) =
˜ 2] [nω
1 2n˜ 1/2 m
(hj −m − hj +1+m − hj + hj +1 )(mj −m + mj ).
˜ 1 ]+1 j =[nω
So, by the triangle inequality and proceeding as with the estimation of the second moment of (A.5), we have that E|Xn1 (ω2 ) − Xn1 (ω1 )| is bounded by D
˜ 1] ˜ 2 ] − [nω n1/2 [nω ≤ D (ω2 − ω1 )1+δ ≤ D (ω2 − ω1 ) 1/2 mn m
because by Assumption 3.2(ii), m−1 n1/2 = o(n−2δ ) for some δ > 0. To complete the proof, we need to show (A.9) for Xn2 (ω). We will only examine the contribution due to the first term of (A.7) into the left-hand side of (A.9), that is
1 E
2n˜ 1/2 m
τ
hj −m m
.
˜ 1 ]+1 j =[nω =1;=j −m
˜ 2] [nω
j −1
C The Author(s). Journal compilation C Royal Economic Society 2009.
S127
Goodness-of-fit tests for dynamic models Choosing τ = 2, we have that the last displayed expression is bounded by 1 ˜ 2 nm
+
+
+
j1 −1
j1
˜ 2] [nω
j2 −1
E hj
1 −m
hj2 −m E m1 m2
˜ 1 ]+1 j2 =[nω ˜ 1 ]+1 1 =1;1 =j1 −m 2 =1;2 =j2 −m j1 =[nω
1 ˜ 2 nm 1 ˜ 2 nm 1 ˜ 2 nm
˜ 2] [nω
j1
j1 −1
j2 −1
E hj
1 −m
m1 E hj2 −m m2
E hj
2 −m
m1 E hj1 −m m2
˜ 1 ]+1 j2 =[nω ˜ 1 ]+1 1 =1;1 =j1 −m 2 =1;2 =j2 −m j1 =[nω
˜ 2] [nω
j1
j1 −1
j2 −1
˜ 1 ]+1 j2 =[nω ˜ 1 ]+1 1 =1;1 =j1 −m 2 =1;2 =j2 −m j1 =[nω
˜ 2] [nω
j1
j1 −1
j2 −1
cum hj
1 −m
, hj2 −m , m1 , m2 .
˜ 1 ]+1 j2 =[nω ˜ 1 ]+1 1 =1;1 =j1 −m 2 =1;2 =j2 −m j1 =[nω
However, the last expression is bounded by D(ω 2 − ω 1 )1+δ because proceeding as with the proof of (A.8), they are bounded by D
˜ 1 ]2 ˜ 2 ]2 − [nω n [nω log3 n ≤ D (ω2 − ω1 ) 2 log3 n ≤ D (ω2 − ω1 )1+2δ , n3 m2 m
as n˜ −1 ≤ ω2 − ω1 . This completes the proof.
L EMMA A.3. Let ξ (u) : [0, 1] → Rp+q+1 be continuous. Assuming Assumption 2.1, we have that in p+q+1 D[0, 1], ×1 ω ω ξ (u)Tˆn (du) : ω ∈ [0, 1] converges in distribution to ξ (u)dB (u) : ω ∈ [0, 1] . 0
0
Proof: The proof is much simpler than that of Lemma 2 in DHV, so it is omitted.
Proof of Proposition 2.1: The proof proceeds as that of Lemma 7 in DHV, and so it is omitted.
Proof of Proposition 2.2: First it can be shown that Fˆn (π ) →p 1 by Assumption 2.1 under H 0 . So, we can write 4π Fˆn (ωπ ; θˆn ) = Fˆn (ωπ ) − (θˆn − θ0 ) n˜ where
˜ [ωn]
4π ReIWε,j + (θˆn − θ0 ) n˜ j =1
˜ [ωn]
ReIWW,j (θˆn − θ0 ),
j =1
˜ [ωn] n˜
4π
4π
≤
= Op (1) sup ReI ReI WW,j WW,j
˜ j =1 ω∈[0,1] n
n˜ j =1
(A.10)
because of Assumption 2.3(ii). Then Fˆn (π ; θˆn ) →p 1 by Assumption 2.2 and because as we now show An (ω) :=
˜ [ωn] 4π ReIWε,j = (ω) + op (1) n˜ j =1
C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.11)
S128
M. A. Delgado, J. Hidalgo and C. Velasco
uniformly in ω, and (1) = 0. By Assumption 2.3(i) we obtain that EFˆn (ω) = (ω) + o(1) uniformly in ˆ n (ω) − (ω) = op (1) for each ω. Then we have to check the tightness of ω and by Assumption 2.3(iii), ¯ An (ω) := An (ω) − E[An (ω)]. Following Billingsley (1968, Theorem 15.6), a sufficient condition is that, for some δ > 0, 0 ≤ ω 1 < ω 2 ≤ 1, 2 [ω2 n] ˜ 2 hj ≤ D (ω2 − ω1 )1+δ . E|A¯ n (ω2 ) − A¯ n (ω1 )|2 = E ˜ n j =1+[ω1 n] ˜
(A.12)
Without loss of generality, we consider only n˜ −1 ≤ (ω2 − ω1 ). Then using Assumption 2.1 and Assumption 2.3(i)–(iii), the left-hand side of (A.10) is bounded by E|A¯ n (ω2 ) − A¯ n (ω1 )|2 ≤ D n˜ −1
ω2
dλ + D n˜
−1
ω1
3
2
ω2
log n
dλ
≤ D (ω2 − ω1 )2 .
ω1
Now (A.11) follows from (A.10) and Assumption 2.2, while (2.13) follows from (A.11) and Assumption 2.2. Proof of Theorem 3.1: Using the arguments in the proofs of Theorems 3 and 1 in DHV, we only need to consider convergence in intervals [0, ω 0 ], for any ω 0 < 1. Since it is trivially satisfied that ¯ = 0, the theorem is a consequence of supω∈[0,ω0 ] LG(ω) ¯ Tˆn (ω; θˆn ) − Tˆn (ω))| = op (1), sup |L(
(A.13)
ω∈[0,ω0 ]
d L¯ Tˆn (ω) ⇒ B 0 in the space D [0, ω0 ] .
(A.14)
¯ Tˆn (ω; θˆn ) − Tˆn (ω)) is By definition, L( Tˆn (ω; θˆn ) − Tˆn (ω) −
ω
−1
0
1
g(v)(Tˆn (dv; θˆn ) − Tˆn (dv))du.
g(u) (u)
(A.15)
u
By Proposition 2.2, the first two terms in (A.15) are equal to −n˜ 1/2 (ω) (θˆn − θ0 ) + op (1) uniformly in ω, whereas the third term is n˜ 1/2 0
ω
g(u) −1 (u)
1
g(v)g(v) dvdu(θˆn − θ0 ) + op (1) = n˜ 1/2 (ω) (θˆn − θ0 ) + op (1),
u
which shows (A.13). To complete the proof we need to show (A.14). Fidi’s convergence follows as in Proposition 2.1 or Lemma A.3. Then, it suffices to prove tightness. Since Tˆn (ω) is tight, we only need to show the tightness condition of
r
Pn (r) :=
H (u) n (u) du, 0 C The Author(s). Journal compilation C Royal Economic Society 2009.
S129
Goodness-of-fit tests for dynamic models where H (u) := g(u) (u)−1 and n (u) := n˜ −1/2 supu∈[0,ω0 ] n (u) = Op (1) and E n (u) 2 < D, r
r
E |Pn (r) − Pn (s)|2 =
H (u1 )H (u2 ) s
s
n˜
˜ j =1+[nu]
n˜ 1 n˜ j =1+[nu ˜
1]
r
gj mj . Because by Lemma A.3,
n˜
gj gk E(mj mk )du1 du2
˜ 2] k=1+[nu
r
≤D
H (u1 ) H (u2 ) du1 du2 s
s
= D |L (r) − L (s)|2 , where L(·) =
· 0
H (u) du is a monotonic continuous and non-decreasing function.
˜ L¯ n Tˆn (ω; θˆn ) is, up to terms o p (1) uniformly in ω, Proof of Theorem 3.2: Setting φˆ j = φˆ m (j /n), ˜ ¯ [nω] [nω] 1 1 m − j n˜ 1/2 j =1 n˜ j =1
⎛
˜ ¯ [nω] [nω] 1 1 φj − +⎝ n˜ j =1 n˜ j =1
1 ˆ φj
1 ˆ φj
ˆ j−1 +1
ˆ j−1 +1
n˜ 1 n˜ 1/2 =j +1
n˜ 1 n˜ =j +1
1 ˆ φ
1 ˆ φ
(A.16)
m
⎞ φ ⎠ n˜ 1/2 (θˆn
− θ0 ),
(A.17)
using Proposition 2.2. ¯ using Lemma A.1 and Assumption 2.2 Since j is assumed non-singular for all j = 1, . . . , n, we obtain that (A.17) is o p (1), which is what it is required to conclude that the asymptotic behaviour of L¯ n Tˆn (ω) is given by that of (A.16). We now show the weak convergence of (A.16) with φˆ replaced by φ . In Lemma A.2 we show that the difference is negligible. ˜ Next, we study the First, the expectation is clearly zero because E(Iεε,j − 1) = 0, j = 1, . . . , n. covariance structure. Let ω 1 ≤ ω 2 . Our aim is to show that E (an (ω1 ) an (ω2 )) → ω1 ,
(A.18)
n→∞
where ˜ ¯ [nω] [nω] 1 1 mj − an (ω) = n˜ j =1 n˜ j =1
1 φj
j−1 +1
n˜ 1 n˜ =j +1
1 Eφ
m
:= an1 (ω) − an2 (ω) , since (A.18) implies that if a n (ω) converges to a Gaussian process, this would be the standard Brownian motion. Because E(an1 (ω1 )an1 (ω2 )) = ω1 , we have that (A.18) holds true if E (an2 (ω1 ) an2 (ω2 )) = E (an1 (ω1 ) an2 (ω2 )) + E (an1 (ω2 ) an2 (ω1 )) . First, it is easy to check that the right-hand side of (A.19) is ⎧ ⎫
¯ ] ¯ ] ¯ ] j ∧[nω ¯ ] j ∧[nω [nω [nω 1 1 ⎨ 1 1 2 2 1 1 ⎬ 1 −1 + j2 . ⎭ φj 2 n˜ ⎩ j =1 j =1 φj 1 j =1 j =1 1
2
1
2
C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.19)
S130
M. A. Delgado, J. Hidalgo and C. Velasco
Next, we examine the left-hand side of (A.19), which is
1 n˜ =1+j =1+j φj 1 1 2 1 1 2 2
⎫
⎬ 1 1 ×E m1 m2 , j−1 2 φ2 ⎭ φj 2
¯ ] ¯ ] [nω [nω 1 1 2 n˜ j =1 j =1
1
j−1 1
1
φ 1
showing (A.19). Since the fidis of (A.16) converge to those of a Brownian motion, we only need to examine the tightness of a n2 (ω), as it is already known that a n1 (ω) is tight. But we have that a n2 (ω 2 ) − a n2 (ω 1 ) is ⎧ ⎫
¯ 2] nω n¯ ⎨ 1 [ ⎬ 1 1 1 −1 j +1 m ⎩ n˜ ⎭ n˜ 1/2 φ φ j ¯ 1 ]+1 ¯ 2 ]+1 j =[nω =[nω ⎛ ⎞ ¯ 2] [nω 1 1 1 1 ⎝ ⎠ m , + 1/2 j−1 +1 ˜ n˜ =[nω n φ φ j ¯ ]+1 ¯ ]+1 j =[nω 1
1
from where it is easy to show that a n2 (ω) is tight. Observe that, for instance, the first term has again the structure (ζ (ω2 ) − ζ (ω1 ))Z, where Z is a random variable with at least finite second moments. ˇ assuming that Proof of Theorem 3.3: We first analyse an unfeasible version of the transformation Lˇ n , L, ˆ j by mj , j = 1, . . . , n, ˜ we observe θ 0 and hence replacing gˆ j by g j = Re I W ε,j and m ⎧ ⎫ ⎛ ⎞−1 ¯ ⎨ [ωn] n˜ n˜ ⎬ −1/2 ˜ n ˜ j +1 ⎝ gk+2 gk+2 ⎠ gk+2 mk+1 , mj − ng Lˇ Tˆn (ω; θˆn ) = ⎩ ⎭ Fˆn (π ) j =1
k=j +1
(A.20)
k=j +1
and show that under the same conditions of the theorem, d Lˇ Tˆn (ω; θˆn ) ⇒ B 0 in the Skorohod metric space D [0, ω0 ] ,
for any ω 0 < 1. Then the proof of Theorem 3.3 is standard after we notice that εˆ t = εt + (θˆn − θ0 ) Wt , Assumption 2.2 implies that θˆn − θ0 = Op (n−1/2 ) and the arguments in the proofs of Theorems 3 and 4 in DHV. We shall abbreviate g n,k by g k to simplify the notation. Now, because Fˆn (π, θˆn ) − σε2 = op (1), recall that we can assume that σ 2ε = 1 without loss of generality, we obtain that ⎛ ⎞−1 ¯ [nω] n˜ n˜ 1 ⎠ gj +1 ⎝ gk+2 gk+2 gk+2 mk+1 + op (1) . Lˇ Tˆn (ω) = Tˆn (ω) − 1/2 n˜ j =1 k=j +1 k=j +1
(A.21)
So, except the o p (1), the right-hand side of Lˇ Tˆn (ω) is ⎧ ⎫ ⎛ ⎞−1 ¯ [nω] n˜ n˜ ⎬ 1 ⎨ ⎝ ⎠ − g g g g m m . j k+2 k+2 k+1 j +1 k+2 ⎭ n˜ 1/2 j =1 ⎩ k=j +1 k=j +1
(A.22)
C The Author(s). Journal compilation C Royal Economic Society 2009.
S131
Goodness-of-fit tests for dynamic models ˜ n˜ 1 Now, we could replace Gj ,n = n1˜ nk=j +1 gk+2 gk+2 by Gj = n˜ k=j +1 E(gk+2 gk+2 ). Indeed, ⎧ ⎫ ¯ [nω] n˜ ⎬ 1 ⎨ −1 1 −1 G − G g m g k+2 k+1 j j ,n j +1 ⎭ n˜ 1/2 j =1 ⎩ n˜ k=j +1 ⎧ ⎫ ¯ [nω] n˜ ⎬ 1 ⎨ −1 (G − G )G g m = 3/2 gj +1 G−1 . j ,n j k+2 k+1 j j ,n ⎩ ⎭ n˜ j =1
(A.23)
k=j +1
However, Brillinger’s (1981) Theorem 7.6.3, see also the proof of Lemma A.1, implies that uniformly in j, Gj ,n − Gj = op n−1/4 ;
n˜
gk+2 mk+1 = Op n3/4 ,
k=j +1
so that the right-hand side of (A.23) is o p (1), and hence the asymptotic distribution of (A.22) is given by that of ⎧ ⎫ ¯ [nω] n˜ ⎬ 1 ⎨ −1 1 gp+2 mp+1 mj − gj +1 Gj 1/2 ⎩ ⎭ n˜ n˜ p=j +1 j =1 ⎧ ⎫ ¯ [nω] n˜ ⎬ 1 1 ⎨ (A.24) gk+2 mk+1 + op (1), = 1/2 mj − E(gj +1 )G−1 j ⎩ ⎭ n˜ n˜ j =1
k=j +1
as we now show. Writing gj = gj − E(gj ), the difference between left-hand side and the first term on the right-hand side of (A.24) is ¯ [nω] n˜ 1 −1 g G gk+2 mk+1 . n˜ 3/2 j =1 j +1 j k=j +1
Next, the second moment of the right-hand side of the last displayed equality is ⎧⎛ ⎞⎛ ⎞⎫ ¯ [nω] n˜ n˜ ⎬ 1 ⎨⎝ −1 −1 ⎠ ⎝ ⎠ . E g G g m G g m g k+2 k+1 q+2 q+1 j j +1 +1 ⎩ ⎭ n˜ 3 1=j ≤
k=j +1
(A.25)
q=+1
Now because G−1 j < D, the expectation term in (A.25) is governed by ⎞ ⎛ n˜ n˜ gq+2 mq+1 gk+2 mk+1 ⎠ E( g+1 ) gj +1 E⎝ k=j +1
q=+1
⎛
+ E ⎝ gj +1 ⎛ + E ⎝ g+1
n˜ k=j +1 n˜
⎞ ⎛
gk+2 mk+1 ⎠ E ⎝ g+1
⎞
gq+2 mq+1 ⎠ E ⎝ gj +1
q=+1
+
n˜ n˜
⎛
n˜
⎞ gq+2 mq+1 ⎠
q=+1 n˜
⎞ gk+2 mk+1 ⎠
k=j +1
cum( gj +1 , gk+2 mk+1 , g+1 , gq+2 mq+1 ).
k=j +1 q=+1
Now, because for example Cov( gj +1 , gk+1 ) = I(j = k) + O(n−1 ), Cov( gj +1 , mk+1 ) = I(j = k) + O(n−1 ) and by Brillinger (1981, p. 20 and Theorem 4.3.2), the last displayed expression is O(1), and C The Author(s). Journal compilation C Royal Economic Society 2009.
S132
M. A. Delgado, J. Hidalgo and C. Velasco
hence (A.25) is O(n−1 ). So, we conclude that ⎧ ⎫ ¯ ⎨ [nω] n˜ ⎬ 1 1 ˚j gk+2 mk+1 + op (1) , Lˇ Tˆn (ω) = 1/2 mj − G ⎩ ⎭ n˜ n˜ j =1
k=j +1
where ˚ j = E(gj +1 )G−1 G j . So, it suffices to examine the asymptotic behaviour of ⎧ ⎫ ¯ ⎨ [nω] n˜ ⎬ 1 1 ˚j gk+2 mk+1 , mj − G Lˇ Tˆn (ω) = 1/2 ⎩ ⎭ n˜ n˜ j =1
(A.26)
(A.27)
k=j +1
and more specifically that (a) |ELˇ Tˆn (ω)| = o(1), (b) Cov(Lˇ Tˆn (ω1 ), Lˇ Tˆn (ω2 )) = (ω1 ∧ ω2 )π −1 + o(1) and (c) the tightness of the process Lˇ Tˆn (ω). ˚ j < D, we have that We begin with part (a). Now, because Emj = 0 and G
¯ [nω] n˜
1 −1/2 D
ˇ ˆ , E(gk+2 mk+1 ) |ELTn (w)| ≤ 1/2
=O n
n˜ n˜
j =1 k=j +1 because E(gk+2 mk+1 ) = Cov(Iε,k+1 , gk+2 ) = O(n−1 ). Now, we examine part (b). To that end it suffices to show that ⎞2 ⎛ ˜ 1] ˜ 2] [nω [nω 1 1 (ω1 ∧ ω2 ) + o(1); mj 1/2 mj ⎠ + o (1) = (i) E ⎝ 1/2 ˜ n˜ n π j =1 j =1
(A.28)
(ii) that the contribution of the other three terms in Cov(Lˇ Tˆn (ω1 ), Lˇ Tˆn (ω2 )) = o(1). That (A.28) holds true is standard. See for instance DHV’s Lemma 7. Now, regarding part (ii), it suffices to see that ⎧ ⎫ ¯ ¯ [nω] [nω] n˜ n˜ ⎨ ⎬ 1 1 ˚ ˚j E mj G gk+2 mk+1 − 2 E m G gk+2 mk+1 − 2 ⎭ n˜ j ≤ n˜ j ≤ ⎩ k=+1 k=j +1 ⎧ ⎞⎫
⎛ ¯ [nω] n˜ n˜ ⎬ 1 ⎨ ˚ ˚j (A.29) G E gk+2 mk+1 ⎝G gk+2 mk+1 ⎠ + 3 ⎩ ⎭ n˜ j ≤
k=+1
k=j +1
is o(1). Observe that this is the term we obtain when ω 1 = ω 2 = ω. That (A.29) is o(1) follows because the first term on (A.29) is proportional to ¯ ¯ [nω] [nω] n˜ n˜ 1 1 E{m g m } = {E{mj }E{gk+2 mk+1 } j k+2 k+1 n˜ 2 j ≤ k=+1 n˜ 2 j ≤ k=+1
+ E{mj gk+2 }E{mk+1 } + E{mj mk+1 }E{gk+2 }}, which is zero because E{mj mp+1 } = E{mj } = 0. Next, the second term of (A.29) is −
¯ ¯ [nω] [nω] n˜ 1 ˚ 1 ˚ } {m = − G Gj E {g+1 } E g m j k+2 k+1 n˜ 2 j ≤ n˜ 2 j ≤ k=j +1 C The Author(s). Journal compilation C Royal Economic Society 2009.
S133
Goodness-of-fit tests for dynamic models
because E{mj } = 0 and E{m mk+1 } = I( = k + 1). And finally, the third term of (A.29) is ¯ −1 ˚ n˜ −2 j[nω] ≤ Gj E{g+1 } + O(n ). Indeed, proceeding as before using Brillinger (1981, p. 20 and theorem 4.3.2) as in the proof of Proposition A.1, that for instance E(mk+1 mq+1 ) = I(p = q) and then using the ˚ j in (A.26), the third term of (A.29) is definition of G ¯ ¯ [nω] [nω] n˜ 1 ˚ ˚ 2 1 ˚ −1 G G Gj E{g+1 } + o(1). + O(n E g ) = j k+2 n˜ 3 j ≤ n˜ 2 j ≤ k=+1
So, we conclude part (b) that Cov(Lˇ Tˆn (ω1 ), Lˇ Tˆn (ω2 )) = (ω1 ∧ ω2 )π −1 + o(1). To complete the proof we need to show part (c). From the definition of Lˇ Tˆn (ω) in (A.27), it suffices to examine the tightness of ¯ [nω] n˜ 1 ˚ 1 G gk+2 mk+1 j n˜ 1/2 j =1 n˜ k=j +1
˜ ˚ as n˜ −1/2 j[nω] =1 mj is known to be tight. See, for instance, DHV. Now, because by Assumption 2.3, Gj − ˚ j +1 = O(n−1 ), it suffices to examine the tightness of G ¯ [nω] n˜ 1 1 1 gk+2 mk+1 = (1 − ω) 1/2 n˜ 1/2 j =1 n˜ k=j +1 n˜
+
n¯
g+2 m+1
¯ =1+[nω]
¯ [nω] 1 g+2 m+1 . 1 − n˜ 1/2 =1 n˜
We shall examine the second term on the right-hand side being the first one similarly handled. Now by standard arguments, see Billingsley (1968), we only need to show that
1 E
n˜ 1/2
4
1+δ g+2 m+1
≤ D (ω2 − ω1 )
¯ 1] =1+[nω
¯ 2] [nω
for some δ > 0. Now, the left-hand side of the last displayed expression is 1 n˜ 2
¯ 2] [nω
E g1 +2 m1 +1 g2 +2 m2 +1 g3 +2 m3 +1 g4 +2 m4 +1
¯ 1 ]=1 ,2 ,3 ,4 1+[nω
⎛ ¯ 2] [nω 1 ⎝ =3 2 n˜ ¯ ]= 1+[nω 1
+
1 n˜ 2
⎞2
E g1 +2 m1 +1 g2 +2 m2 +1 ⎠ 1 ,2
¯ 2] [nω
cum g1 +2 m1 +1 , g2 +2 m2 +1 , g3 +2 m3 +1 , g4 +2 m4 +1 .
¯ 1 ]=1 ,2 ,3 ,4 1+[nω
Now proceed as we did in part (b) to conclude that the right-hand side of the last displayed expression is bounded by D(ω 2 − ω 1 )2 after we notice that we can always take ω 1 and ω 2 such that n˜ −1 ≤ (ω2 − ω1 ). This completes the proof of the theorem.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S134
M. A. Delgado, J. Hidalgo and C. Velasco
Proof of Theorem 4.1: From the definition of Fˆn (ω, θˆn ) in (2.7), under H an , and proceeding as in Proposition 2.2, we have that ˜ [ωn] 4π Fˆn (ω, θˆn ) = Fˆn (ω) − 3/2 ReIεW,j n˜ 1/2 (θˆn − θ0 ) n˜ j =1
−c
˜ [ωn] 4π ReIεX(−p−1),j + op (1) n˜ 3/2 j =1
= Fˆn (ω) − n˜ −1/2 {(ω) (θn − θ0 ) + cσ 2 (ω)} + op (1), uniformly in ω ∈ [0, 1]. From here, (4.2) follows ω repeating the same steps of Theorems 3.1 and 3.2, but noting the additional term given by (ω) := 0 l(π u)du in the general case. So, under H an , the B 0 + L0 is a non-centred Gaussian process, being the ‘non-centrality function’ given by L0 . Now, the test will have non-trivial power under H an if L0 (ω) =0 in a set, say (L), with Lebesgue measure greater than 1 zero. From the definitions of L0 and and that 0 φ(v)dv = 0, it is easily seen that ω 1 l(π u) − g(u) (u)−1 L0 (ω) = g(v)l(π v)dv du. 0
u
However, the expression in braces is just the residuals from the least-squares projection of l(π u) on g(u) = (1, φ(u) ) , which obviously is different than zero unless l(π u) is in the space spanned by g(u). But the latter is ruled out, which concludes the proof.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S135–S171. doi: 10.1111/j.1368-423X.2009.00279.x
Efficient GMM with nearly-weak instruments B ERTILLE A NTOINE † AND E RIC R ENAULT ‡ †
‡
Department of Economics, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada E-mail: [email protected]
Department of Economics, University of North Carolina at Chapel Hill, CB 3305, Chapel Hill, NC 27599, USA E-mail: [email protected] First version received: July 2008; final version accepted: December 2008
Summary This paper is in the line of the recent literature on weak instruments, which, following the seminal approach of Stock and Wright captures weak identification by drifting population moment conditions. In contrast with most of the existing literature, we do not specify a priori which parameters are strongly or weakly identified. We rather consider that weakness should be related specifically to instruments, or more generally to the moment conditions. In addition, we focus here on the case dubbed nearly-weak identification where the drifting DGP introduces a limit rank deficiency reached at a rate slower than root-T. This framework ensures the consistency of Generalized Method of Moments (GMM) estimators of all parameters, but at a rate possibly slower than usual. It also validates the GMM-LM test with standard formulas. We then propose a comparative study of the power of the LM test and its modified version, or K-test proposed by Kleibergen. Finally, after a well-suited rotation in the parameter space, we identify and estimate directions where root-T convergence is maintained. These results are all directly relevant for practical applications without requiring the knowledge or the estimation of the slower rate of convergence. Keywords: GMM, Instrumental variables, Weak identification.
1. INTRODUCTION In this paper, we revisit the Generalized Method of Moments (GMM) of Hansen (1982) when classical identification assumptions are only barely satisfied. Following Hansen (1982), we imagine an economic model with structural parameter of interest θ ∈ ⊆ Rp . The econometrician’s information about the true unknown value θ 0 of θ comes through moment conditions. For some stationary ergodic process (Y t ), φ(Y t , θ ) is a K-dimensional function, integrable for all θ ∈ , and the underlying economic model states that these moment conditions are satisfied at the true unknown value of the parameter: E[φ(Yt , θ 0 )] = 0.
(1.1)
Moment conditions (1.1) strongly globally identify θ 0 if they do not admit any other solution: E [φ(Yt , θ )] = 0,
θ ∈ ⇔ θ = θ 0.
(1.2)
C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S136
B. Antoine and E. Renault
Hansen (1982) maintains (1.2) to prove consistency of a GMM estimator, defined as: θˆT = arg min[φ¯ T (θ )T φ¯ T (θ )]. θ∈
(1.3)
φ¯ T (θ ) is the sample mean of φ(Y t , θ ); T is a sequence of random non-negative matrices with positive definite probability limit. To address the asymptotic distribution of a GMM estimator, Hansen (1982) extends the above definition and considers any sequence θˆT such that: Plim[T 1/2 AT φ¯ T (θˆT )] = 0.
(1.4)
(A T ) is a sequence of (p, K) random matrices converging in probability to a constant full-row rank matrix A 0 . The strong global identification condition (1.2) may be replaced by a local one: moment conditions (1.1) strongly locally identify θ 0 if θ 0 belongs to the interior of where φ(Y t , θ ) is continuously differentiable, and E[∂φ(Y t , θ 0 )/∂θ ] has full-column rank. Under this strong local identification condition, the GMM estimator θˆT defined by (1.4) is consistent and asymptotically normal. Moreover, the GMM estimator defined by (1.3) is a special case of (1.4) (through the first-order conditions), when strong local identification holds. Both global and local strong identification conditions have been questioned in the literature during the last ten years. Stock and Wright (2000) relax the strong global identification when considering a drifting Data Generating Process (DGP) with: E[φ(Yt , θ )] =
m1T (θ ) + m2 (θ1 ) for some given subvector θ1 of θ. T 1/2
(1.5)
Then, only possibly θ 1 is identified, since, for the other components of θ , the relevant moment information vanishes at rate square-root T, the speed at which information is accumulated with a larger sample size. This case has been referred to as (global) weak identification. Kleibergen (2005) focuses on the GMM score-type test of a null hypothesis: H 0 : θ = θ0 . For such a problem, only local identification is relevant, and Kleibergen (2005) refers to as (local) weak identification the case where: C ∂φ(Yt , θ 0 ) = 1/2 with full-column rank matrix C. E (1.6) ∂θ T We revisit the issue of GMM estimators and score-type tests in the nearly-weak identification case, first introduced by Hahn and Kuersteiner (2002): through a drifting DGP approach, information now vanishes when sample size T increases, but at a rate δ T slower than T 1/2 . In the context of Stock and Wright (2000), global nearly-weak identification would mean: E [φ(Yt , θ )] =
m1T (θ ) + m2 (θ1 ), δT
while in the context of Kleibergen (2005), local nearly-weak identification would mean: C ∂φ(Yt , θ 0 ) = . E ∂θ δT
(1.7)
(1.8)
The possible case of nearly-weak identification has been quite overlooked in the literature while, after all, it makes sense to study a variety of asymptotic behaviours when δ T may be associated to any rate between O(1) and O(T 1/2 ). Weak identification is only a limit case where identification is completely lost. So far, only Hahn and Kuersteiner (2002) in a linear context, and Caner (2007) in a non-linear one, have considered nearly-weak identification. Our contribution as C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S137
concerns nearly-weak identification is to imagine that, in realistic circumstances, nearly-weak identification may occur for some moments while strong identification is still guaranteed for others. This new point of view paves the way for new results as follows. In terms of GMM score-type testing, the partition between locally strongly- and locally nearly-weakly identifying moment conditions determines the different rates of convergence associated with specific directions in the parameter space against which the test has power. 1 As a result, the GMM score test has power even in quite weak directions, where the weakness degree δ T may be arbitrarily close to T 1/2 . We show that, by contrast, Kleibergen’s modified score test is more likely to waste some power in such directions. It is the price to pay to be robust to weak identification (δ T = T 1/2 ) when, as shown by Kleibergen (2005), the standard GMM score test does not work. We show that the GMM score test and Kleibergen’s modified score test are actually asymptotically equivalent under relevant sequences of local alternatives, but only in cases of moderate weakness of identification: we refer to nearly-strong identification, when δ T goes to infinity slower than T 1/4 . This equivalence is tightly related to a stronger equivalence result between the standard two-step (efficient) GMM and the continuously updated GMM of Hansen et al. (1996) for efficient estimation of all directions. Such a result can only be embraced after extending the pioneered setting introduced by Stock and Wright (2000): we now consider that some moment conditions are globally-identifying, while some others are weaklyidentifying.. In other words, the vector φ(Y t , θ ) is partitioned into two subvectors φ(Y t , θ ) = [φ 1 (Y t , θ ) .. φ 2 (Y t , θ ) ] such that: E [φ1 (Yt , θ )] = ρ1 (θ ) and
E [φ2 (Yt , θ )] = ρ2 (θ )/δT
(1.9)
with the global nearly-weak identification condition: ρ(θ ) = 0 ⇔ θ = θ 0
where
. ρ(θ ) = [ρ1 (θ ) .. ρ2 (θ )] .
Identification is nearly-weak because δ T goes to infinity, but we rather call it nearly-strong when associated to a rate slower than T 1/4 . By contrast with Stock and Wright (2000), we have no prior knowledge on the subset of parameters that are weakly identified. Intuitively, the first set of moment conditions (respectively the second one) identifies strong (respectively weak) directions in the parameter space. Through a convenient rotation in the parameter space in the spirit of Phillips (1989), we define a reparametrization such that the first components of this new parameter are estimated at standard rate square-root T, while the others are estimated only at slower rate λ T = T 1/2 /δ T . Asymptotic covariances come with standard GMM-like formulas, but only with nearly-strong identification, i.e. rate λ T faster than T 1/4 . Interpreting this latter condition is germane to Andrews’ (1994) study of MINPIN estimators. 2 In our case, the nuisance parameter is not infinite dimensional. However, due to nearly-weak identification, it is associated to a rate of convergence slower than the standard parametric square-root T. As in Andrews (1994), the slow rate of convergence needs to be faster than T 1/4 to avoid contamination of the well-identified estimated directions by the nearly-weak ones. We also show that the nearly-strong identification condition is exactly needed to ensure that all directions are equivalently estimated 1 As far as the size properties are concerned, we know from results in Andrews and Guggenberger (2007) that nearlyweak identification does not offer additional insights. Only genuine weak identification is important in investigating size properties of testing procedures. 2 These estimators are defined as MINimizing a criterion function that might depend on a Preliminary Infinite dimensional Nuisance parameter estimator.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S138
B. Antoine and E. Renault
by efficient two-step GMM and continuously updated GMM. This explains the aforementioned partial equivalence between GMM score and Kleibergen’s modified score tests. More generally, our unified setting for mixed strong/nearly-strong identification incorporates coherently both global and local points of view. Ultimately, evidence of weak identification should not always lead to renounce meaningful estimation and testing, as the alleged weak identification may only be nearly-weak, or even nearly-strong. Overlooking these cases could lead to wasting some relevant information. Moreover, possible weakness should be assigned to some specific instruments (or to some moment conditions) as in (1.9) rather than to specific parameters as in (1.7). It is the econometrician’s duty to determine the different directions in the parameter space where she has more or less accurate information. We illustrate this new point of view with a Monte Carlo study on the well-known example of consumption-based CAPM, already extensively studied in the literature: see Stock and Wright (2000) among others. The paper is organized as follows. In Section 2, we discuss GMM-based tests of a simple null hypothesis H 0 : θ = θ0 . We compare the asymptotic behaviour of the standard GMM score test (Newey and West, 1987), and the Kleibergen modified score test. When complete weak identification is precluded, both tests work. Our framework allows us to display relevant sequences of local alternatives with heterogeneous rates of convergence depending on the direction of departure in the parameter space. By contrast with Kleibergen (2005), different degrees of nearly-weak identification are simultaneously considered: this opens the door for non-equivalence, even asymptotically, between standard and modified score tests. In Section 3, consistency and rate of convergence of any GMM estimator are analysed in a nearly-weak identification setting. The special case of nearly-strong identification allows us in Section 4 to discuss efficient estimation with various rates of convergence in various directions, and to check equivalence between two-step efficient GMM and continuously updated GMM in all directions. These last results bridge the gap between estimation and score tests as discussed in Section 2. The practical relevance of our new asymptotic theory is checked in Section 5 in a consumption-based intertemporal asset pricing model. It validates our point of view of nearly-strong identification with different rates of convergence in different directions for realistic simulated parameter configurations. Section 6 concludes. Proofs are gathered in the Appendix; Table 1 summarizes the different concepts of identification.
2. GMM SCORE-TYPE TESTING We want to test the null hypothesis: H 0 : θ = θ0 . Our information about parameter θ comes from the following moment conditions, E[φ(Yt , θ 0 )] = 0,
(2.1)
always assumed to be fulfilled at least by the true unknown value of the parameter θ 0 . Observed time series (Y t ) 1≤t≤T of a stationary ergodic process are available, and such that sample counterparts of the moment conditions satisfy a Central Limit Theorem (CLT) at the true value: A SSUMPTION 2.1. (CLT at the true value θ 0 ). With φ¯ T (θ 0 ) = T1 Tt=1 φ(Yt , θ 0 ) : √ (i) T φ¯ T (θ 0 ) is asymptotically normally distributed with zero mean and covariance matrix S0. C The Author(s). Journal compilation C Royal Economic Society 2009.
C The Author(s). Journal compilation C Royal Economic Society 2009.
δ T = T 1/2
Weak
1 << δ T << T 1/4
1 << δ T << T 1/2
δ T = T 1/2
Nearly-strong
Nearly-weak
Weak
Strong
withθ 1 known subset of θ m1 (θ1 ) = 0 ⇔ θ1 = θ10 and m2 (θ10 , θ2 ) = 0 ⇔ θ2 = θ20
1 << δ T << T 1/2
Nearly-weak
∂m1 (θ10 ) ∂θ1
full column rank
1
on parameters (SW) 0) ∂m1 (θ10 ) .. = E ∂φ(Y∂θt ,θ .0 + ∂θ
1 δT
.. .
full column rank
∂m2 (θ 0 ) ∂θ2
∂φ(Yt ,θ 0 ) ∂θ
∂m2 (θ 0 ) ∂θ1
E
Local Identification
=C with C full column rank
Plim
on moment conditions (AR) 0
Rank deficiencies nest the cases δ T = ∞ (both for (SW) and (AR))
∂φ 1T (θ ) ∂θ 0 δT ∂φ 2T∂θ(θ )
withφ 1 known subset of φ ρ1 (θ ) = 0 ⇔ θ = θ0 and ρ2 (θ )
on moment conditions (AR) E[φ1 (Yt , θ )] = ρ1 (θ ) E[φ2 (Yt , θ )] = ρ2 (θ )/δT
General case with φ(Y t , θ) = A(Y t )θ and rank conditions on E[A(Y t )]
Partial Identification
E[φ(Y t , θ)] = m 1 (θ 1 ) + m 2T (θ)/δ T
1 << δ T << T 1/4
on parameters (SW)
E[φ(Y t , θ )] = 0 ⇔ θ = θ 0
Nearly-strong
Strong
Global Identification
Table 1. Global, partial and local identification.
Efficient GMM with nearly-weak instruments
S139
S140
(ii)
B. Antoine and E. Renault
HAC estimator S T (θ ) of the long-term covariance matrix S0 is available and such that: S 0 = Plim[ST (θ 0 )].
We focus here on a case where local nearly-weak identification of some directions in the parameter space may occur simultaneously as strong identification of other directions. More precisely, we assume: A SSUMPTION 2.2. (Nearly-weak local identification). (i) (ii)
θ 0 belongs to the interior of , and φ(Y t , θ ) is continuously differentiable on . There exists a (K, p) matrix C, with full-column rank such that:
√ ¯ T ∂ φ¯ 2T (θ 0 ) ∂ φ1T (θ 0 ) = C1 and Plim Plim = C2 , ∂θ λT ∂θ . . T (θ ) .. φ¯ 2T (θ )] , C = [C1 .. C2 ] , λT → ∞, and where φ¯ T (θ ) = [φ¯ 1T
λT √ T
T
→ 0.
The following toy example illustrates our focus of interest. E XAMPLE 2.1. (Toy example). Consider the moment conditions: E[Y1t ] = g(θ 0 ) and E[Zt ⊗ (Y2t − X2t θ 0 )] = 0, where the general functions φ 1 (.) and φ 2 (.) are defined as follows: φ1 (Yt , θ ) = Y1t − g(θ )
and
φ2 (Yt , θ ) = −Zt ⊗ (Y2t − X2t θ ).
The instruments Z t introduced in φ 2 are only nearly-weak instruments since √ 1 T T T δT C2 E [Zt ⊗ X2t ] = with δT = → ∞ and √ = → 0. δT λT λT T Then the associated Jacobian matrix is: ¯ ∂g(θ 0 ) ∂ φ1T (θ 0 ) = Plim = C1 ∂θ ∂θ
√
√
√ T T ∂ φ¯ 2T (θ 0 ) T 1 T Plim E (Zt ⊗ X2t ) = C2 = Plim (Zt ⊗ X2t ) = lim T λT ∂θ λT T t=1 λT and we assume that the following matrix has full-column rank: 0 ∂g (θ ) .. . . E Zt ⊗ X2t ∂θ GMM score-type testing asks whether the test value θ 0 is closed to fulfil the first-order conditions of the (efficient) two-step GMM minimization, minθ∈ [φ¯ T (θ )ST−1 (θ0 )φ¯ T (θ )], that is, whether the score vector is close to zero. It is defined at the test value θ 0 as: VT (θ0 ) =
∂ φ¯ T (θ0 ) −1 ST (θ0 )φ¯ T (θ0 ). ∂θ
The GMM score test statistic (Newey and West, 1987) is then a suitable norm of V T (θ 0 ): −1 ¯ ∂ φT (θ0 ) −1 ∂ φ¯ T (θ0 ) ST (θ0 ) ξTNW = T VT (θ0 ) VT (θ0 ). ∂θ ∂θ C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S141
Kleibergen (2005) considers instead the first-order conditions of the continuously updated GMM minimization: minθ∈ [φ¯ T (θ )ST−1 (θ )φ¯ T (θ )]. The corresponding score vector can be computed either by direct differentiation (see equations (15) and (16) and appendix in Kleibergen, 2005), or with even simpler computations within a Euclidean Empirical Likelihood approach (see Antoine et al., 2007). The score vector computed at the test value θ 0 is: ∂ φ˜ T (θ0 ) −1 V˜T (θ0 ) = ST (θ0 )φ¯ T (θ0 ), ∂θ ∂ φ˜T (θ0 ) ∂ φ˜T(j ) (θ0 ) of is the residual of the long-term affine regression of where each column ∂θ ∂θ ∂ φ¯T(j ) (θ0 ) 3 ¯ on φT (θ0 ): ∂θ
(j ) (j ) √ ∂ φ¯ T(j ) (θ0 ) √
√ −1 ∂ φ¯ T (θ0 ) ∂ φ˜ T (θ0 ) = − Covas , T φ¯ T (θ0 ) Varas T φ¯ T (θ0 ) φ¯ T (θ0 ), T ∂θ ∂θ ∂θ
√ where Varas ( T φ¯ T (θ0 )) = S 0 is the long-term covariance matrix of the moment conditions √ ∂ φ¯ (j ) (θ ) √ (j ) (Yt ,θ0 ) φ(Y t , θ 0 ), and Covas ( T T∂θ 0 , T φ¯ T (θ0 )) is the long-term covariance between ∂φ ∂θ and φ(Y t , θ 0 ). This long-term covariance is assumed to be well defined. 4 A SSUMPTION 2.3. (Long-term covariance). (j ) 0 ¯ √ ∂ φ¯ T(j ) (θ 0 ) √ ∂ φ (θ ) T , T φ¯ T (θ 0 ) ≡ lim T Cov , φ¯ T (θ 0 ) Covas T T ∂θ ∂θ is a well-defined ( p, K) matrix. Kleibergen (2005) maintains Assumption 2.3, and in addition assumes that it corresponds to the asymptotic covariance matrix of the (assumed) joint asymptotic normal distribution of √ ∂ φ¯T(j ) (θ 0 ) √ and T ∂θ T φ¯ T (θ 0 ) . In our nearly-weak identification setting, what we really need, albeit almost equivalent for all practical purposes, is only the following regularity condition: A SSUMPTION 2.4. (Well-behaved Jacobian matrix for the strong subset of moment conditions) ¯ √ ∂ φ1T (θ 0 ) T − C1 = OP (1). ∂θ (j ) (j ) From the above discussion, replacing [∂ φ¯ T (θ 0 )/∂θ ] by [∂ φ˜ T (θ 0 )/∂θ ] amounts to removing (j ) 0 the finite sample correlation between [∂ φ¯ T (θ )/∂θ ] and φ¯ T (θ 0 ). As extensively discussed in Antoine et al. (2007), while this correlation may be responsible for the finite sample bias of standard two-stage GMM, the well-documented improved bias performance of continuously updated GMM is precisely due to this correction. When considering genuine weak instruments (λ T = 1 in Assumption 2.2), this correlation remains asymptotic, and
3 For any vector ψ ∈ RK , we distinguish between the following notations. (i) ψ = [ψ ψ ] refers to the partition 1 2 introduced in Assumption 2.2: ψ 1 (respectively ψ 2 ) is a subvector of ψ with the same dimension as the strong (j ) (respectively nearly-weak) group of moment conditions. (ii) ψ = [ψ ] 1≤j ≤K refers to the (single) components of the vector ψ: each ψ (j ) is a real number. The latter (cumbersome) notation is not used often. 4 These notations are precisely defined in Assumption 2.3.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S142
B. Antoine and E. Renault
Kleibergen (2005) introduces a modified version of the Newey and West (1987) score test statistic: ξTK = T V˜T (θ0 )
−1 ˜ ∂ φT (θ0 ) −1 ∂ φ˜ T (θ0 ) V˜T (θ0 ). ST (θ0 ) ∂θ ∂θ
By contrast, we show that with nearly-weak instruments, the correlation is immaterial, and the works. It is actually asymptotically equivalent to the standard GMM score test statistic ξ NW T modified Kleibergen’s score test statistic under the null: P ROPOSITION 2.1. (Equivalence under the null). Under Assumptions 2.1–2.4, under the null and ξ K H 0 : θ = θ0 , we have: Plim[ξTNW − ξTK ] = 0 and both ξ NW T T converge in distribution towards a chi-square with p degrees of freedom. The main contribution of this paper is to characterize the heterogeneity of the informational content of moment conditions, along different directions in the parameter space. While a proper assessment of this heterogeneity is crucial for efficient estimation (see Section 4), it also matters when considering power of tests under sequences of local alternatives. E XAMPLE 2.2. (Toy example, continued). Consider a sequence of local alternatives defined by a given deterministic sequence (γ T ) T ≥0 in Rp , going to zero when T goes to infinity, and such that the true unknown value θ 0 is drifting as: θ T = θ 0 + γ T . For large T: g(θ T ) ∼ g(θ 0 ) + [∂g(θ 0 )/∂θ ]γ T . Therefore, the strongly-identified moment restrictions E[Y 1t − g(θ T )] = 0 are informative with respect to the violation of the null (θ T = θ 0 ) if and only if: [∂g(θ 0 )/∂θ ]γ T = 0. As a consequence, we expect GMM-based tests of H 0 : θ = θ0 to have power against √ sequences of local alternatives converging at standard rate square-root T, θT = θ0 + γ / T , if and only if [∂g(θ 0 )/∂θ ]γ = 0, or, when γ does not belong to the null space of C 1 = [∂g(θ 0 )/∂θ ]. By contrast, if C 1 γ = 0, violations of the null can only be built from the nearlyweakly identifying conditions: E [Zt ⊗ Yt ] =
C2 θT . δT
We will show that relevant sequences of local alternatives to characterize non-trivial power are necessarily such that: θT = θ0 + δT √γT = θ0 + λγT . In other √ words, the degree of weakness of the moment conditions δ T downplays the standard rate [γ / T ] of sequences of local alternatives against which the tests have non-trivial local power. The intuition is quite clear. Under such a sequence of local alternatives, C2 γ C2 θ0 + √ δT T √ differs from its value under the null by the standard scale 1/ T . E [Zt ⊗ Yt ] =
We need to reinforce Assumptions 2.1(ii) and 2.2(ii) for the study of sequences of local alternatives: A SSUMPTION 2.5. (Reinforced assumptions for study of local power). For any sequence θ T in the interior of such that Plim[θT − θ 0 ] = 0, C The Author(s). Journal compilation C Royal Economic Society 2009.
S143
Efficient GMM with nearly-weak instruments
(i)
HAC estimator S T (θ ) of the long-term covariance matrix S0 is available and such that: S 0 = Plim[ST (θT )].
(ii) ¯ ∂ φ1T (θT ) = C1 Plim ∂θ (iii)
√ and
Plim
T ∂ φ¯ 2T (θT ) λT ∂θ
= C2 . (j )
φ 1 (Y t , θ ) is twice continuously differentiable, and for all components φ 1T (see also (j ) footnote 3), Plim[∂ 2 φ1T (θT )/∂θ ∂θ ] is a well-defined matrix.
P ROPOSITION 2.2. (Local power of GMM score tests). Under Assumptions 2.1–2.5: √ (i) With a drifting true unknown value, θT = θ0 + γ / T , for some γ ∈ Rp , we have and ξ K Plim[ξTNW − ξTK ] = 0, and both ξ NW T T converge in distribution towards a non-central chi-square with p degrees of freedom and non-centrality parameter μ =
. (γ C1 .. 0)[S 0 ]−1 C01 γ . √ T (ii) When λ2T / T → ∞, with a drifting true unknown value θ T = θ 0 + γ /λ T , for some and ξ K γ ∈ Rp such that C 1 γ = 0, we have Plim[ξTNW − ξTK ] = 0, and both ξ NW T T converge in distribution towards a non-central chi-square with p degrees of freedom
. and non-centrality parameter μ = (0 .. γ C )[S 0 ]−1 0 . 2
C2 γ
Two additional conclusions implicitly follow from Proposition 2.2. First, if C 1 γ = 0, the two GMM score tests behave more or less as usual against sequences of local alternatives in the direction γ . They are asymptotically equivalent, and both consistent against sequences converging slower than square-root T. They both follow asymptotically a non-central chi-square against sequences with exactly the rate square-root T. However, the noncentrality parameter (and hence the power of the test) does not really depend on the size of the departure γ from the null, but only on the size of its orthogonal projection of the space spanned by the columns of C 1 (orthogonal space of the null space of C 1 ); that is, by the columns of the Jacobian matrix corresponding to the strong moment conditions. the two GMM score tests have no power against sequences of local Second, if C 1 γ = 0, √ alternatives θT = θ0 + γ / T . They may have power against sequences θ T = θ 0 + γ /λ T (or slower); their behaviour is pretty much the standard one, but only in the nearly-strong case where λ T goes to infinity faster than T 1/4 . The case C 1 γ = 0 corresponds to the study of Guggenberger and Smith (2005) (see theorem 3, p. 680). In the setting of Stock and Wright (2000) that they adopt, with θ = (α β ) , α weakly identified and β strongly identified, C 1 γ = 0 means that β is fixed at its value β 0 under the null. However, since this approach does not disentangle partition of moments and partition of parameters, it amounts for us to consider that φ = φ 2 . This simplifies the study of local power (see our Proposition 2.3 below) and makes immaterial the condition of nearly-strong identification. We now explain why non-standard asymptotic behaviour of both score tests may arise when γ /λ T with we consider sequences of local alternatives in the weak directions (θ T = θ 0 + √ C 1 γ = 0) while the nearly-weak identification problem is so severe that even λ2T / T goes to zero. Recall that the genuine weak identification usually considered in the literature (λ T = 1) is T
a limit case, since we always maintain the nearly-weak identification √condition λT → ∞. Under such a sequence of local alternatives, while by Assumption 2.1, T φ¯ T (θT ) is asymptotically C The Author(s). Journal compilation C Royal Economic Society 2009.
S144
B. Antoine and E. Renault
normal with zero mean, the key to get a common √ non-central chi-square for the asymptotic distribution of a score test statistic is to ensure that T φ¯ T (θ0 ) is asymptotically normal with nonzero mean if and only if γ is not zero. This result should follow from the Taylor approximation: √ √ √ ∂ φ¯ T (θT∗ ) T φ¯ T (θT ) ≈ T φ¯ T (θ0 ) + T (θT − θ0 ) ∂θ √ √ T ∂ φ¯ T (θT∗ ) ¯ = T φT (θ0 ) + γ λT ∂θ √ 0 ≈ T φ¯ T (θ0 ) + . C2 γ Assumption 2.5 justifies this approximation insofar as we can show that C 1 γ = 0 implies:
√ T ∂ φ¯ 1T (θT∗ ) Plim γ = 0. (2.2) λT ∂θ This question is irrelevant if, as in Kleibergen (2005), two different degrees of identification are never simultaneously considered. 5 In other words, we can easily state: P ROPOSITION 2.3. (Special case with only one degree of weakness). Consider the special case where the same degree of weakness is assumed for all moment conditions at hand (φ(.) = φ 2 (.)). Under Assumptions 2.1–2.5, with a drifting true unknown value θ T = θ 0 + γ /λ T , for and ξ K some γ ∈ Rp , we have Plim[ξTNW − ξTK ] = 0, and both ξ NW T T converge in distribution towards a non-central chi-square with p degrees of freedom and non-centrality parameter μ = γ C [S 0 ]−1 Cγ . Needless to say that a result similar to Proposition 2.3 holds when φ(.) = φ 1 (.) (standard strong identification). Hence, the interesting case is precisely the mixture of strong and nearlyweak identification, or non-empty subsets of components φ 1 and φ 2 . In this case (2.2) should follow from:
√ T ∂ φ¯ 1T (θT ) γ = 0. (2.3) C1 γ = 0 ⇒ Plim λT ∂θ Note that (2.3) is a direct consequence of Assumption 2.4 applied to the drifting true value θ 0 = θ T . Of course, conditions (2.2) and (2.3) are identical in the special case where the moment conditions φ 1 (.) are linear. In other words, we can also easily state: P ROPOSITION 2.4. (Special case with linear strong moment conditions). Consider the special case where φ 1 (Y t , θ ) is linear with respect to θ . Then conclusions of Proposition 2.3 hold. By contrast, in the general case of non-linear moment restrictions, we would like to be able (j ) to deduce (2.2) from (2.3) through a Taylor argument for each component φ 1 : √ √ √ (j ) (j ) (j ) T ∂ φ¯ 1T (θT∗ ) T ∂ φ¯ 1T (θT ) T ∂ 2 φ¯ 1T (θT∗∗ ) ∗ ≈ + (θT − θT ). (2.4) λT ∂θ λT ∂θ λT ∂θ ∂θ 5 The fact that with only one rate of convergence, nearly-weak identification does not modify the standard equivalence between tests (that all have trivial asymptotic power) has already been noticed by Smith (2007): see footnote 3, p. 244.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S145
Efficient GMM with nearly-weak instruments
The problem is then that, since we only know (θT∗ − θT ) = OP (1/λT ), we can neglect the √ that 2 second term in the RHS of (2.4) only when √ ( T /λT ) goes to zero, or nearly-strong identification. Otherwise, there is no guarantee that T φ¯ T (θ0 ) is asymptotically normal under a sequence of ‘weak’ local alternatives θ T = θ 0 + γ /λ T , even when C 1 γ = 0. This explains why we no longer get a characterization of local power through standard non-central chi-square. Fortunately, this problem may not invalidate the consistency of the standard GMM-score test against sufficiently slow sequences of weak alternatives. Ideally, one would like to prove that: (A hypothetical claim of consistency) The GMM-score test is consistent against any sequence of T
local alternatives θ T = θ 0 + γ /δ T , for all γ ∈ Rp , when δT /λT → 0 : Plim[ξTNW ] = +∞. This claim is trivially deduced from√ former results if either the direction γ is strong (C 1 γ = 0) or with nearly-strong identification (( T /λ2T ) goes to zero). The novelty would be to maintain consistency even in case of nearly-weak identification arbitrarily close to weak identification (λ T goes to infinity arbitrarily slowly) insofar as the sequence of alternatives converges even more slowly than the sequence λ T . We develop in the Appendix an argument to show that the consistency is likely, but not granted. In other words, the information may be quite weak but efficiently used for testing. This result is consistent with our estimation result in Section 3 below: we show that in any case, with nearly-weak global identification, a GMM estimator will converge at least at rate λ T . This consistency property is no longer likely for the Kleibergen’s modified score test, because its modification may waste this fragile part of information. This is the price to pay for a correct asymptotic size even in the limit case of no-identification. Kleibergen (2005) overlooked this problem since, as pointed out by Proposition 2.3, it may occur only when considering simultaneously two different rates of identification. Fragile identification may be wasted by Kleibergen’s modification precisely because it comes with another piece of information which is stronger. To see this, the key is the aforementioned lack of logical implication from (2.3) to (2.2). As a result, the modified score statistic and the original one may have quite different asymptotic behaviours since, as reminded above: √ ∂ φ˜ T(j ) (θ0 ) √ ∂ φ¯ T(j ) (θ0 ) T = T ∂θ ∂θ √ ∂ φ¯ T(j ) (θ0 ) √ √ √ , T φ¯ T (θ0 ) Varas ( T φ¯ T (θ0 ))−1 T φ¯ T (θ0 ). − Covas T ∂θ
(2.5)
√ It is quite evident from (2.5) that, when T φ¯T (θ0 ) is not OP (1), the modified score statistic may have an arbitrarily nasty asymptotic behaviour.
3. CONSISTENT ESTIMATION WITH NEARLY-WEAK INSTRUMENTS 3.1. General framework In this section, we provide consistent estimation of the true (unknown) parameter θ 0 . Standard GMM estimation defines its estimator θˆT as follows: D EFINITION 3.1. Let T be a sequence of symmetric positive definite random matrices of size K which converges in probability towards a positive definite matrix . A GMM estimator θˆT of C The Author(s). Journal compilation C Royal Economic Society 2009.
S146
B. Antoine and E. Renault
θ 0 is then defined as: θˆT = arg min QT (θ ) where θ∈
with φ¯ T (θ ) =
1 T
T t=1
QT (θ ) ≡ φ¯ T (θ )T φ¯ T (θ ),
(3.1)
φ(Yt , θ ), the empirical mean of the moment restrictions.
We consider here k 1 standard moment restrictions such that 6 √ T [φ¯ 1T (θ ) − ρ1 (θ )] = OP (1) and k 2 (=K − k 1 ) weaker moment restrictions such that √ √ λT T φ¯ 2T (θ ) − √ ρ2 (θ ) = OP (1) where λT = o( T ) and T
(3.2)
T
λT → ∞.
(3.3)
λ T measures the degree of weakness of the second group of moment restrictions: its corresponding component ρ 2 (.) is squeezed to zero and Plim[ φ¯ 2T (θ )] = 0 for all θ ∈ . Therefore, the probability limit of φ¯ T (θ ) does not allow to discriminate between θ 0 and any θ ∈ . In such a context, identification is a combined property of the functions φ(Y t , .), and ρ(.) and the asymptotic behaviour of λ T . Assumption 3.1 below reinforces the standard CLT stated in Assumption 2.1 for moment conditions evaluated under the null by maintaining a functional CLT on the whole parameter set . In this respect, we follow Stock and Wright (2000). 7 A SSUMPTION 3.1. (Identification) (i)
ρ(.) is a continuous function from a compact parameter space ⊂ Rp into RK :
ρ1 (θ ) ρ(θ ) = = 0 ⇐⇒ θ = θ 0 . ρ2 (θ )
(ii)
The empirical process ( T (θ )) θ∈ obeys a functional CLT: ⎡ ⎤ φ 1T (θ ) − ρ1 (θ ) √ ⎦ ⇒ (θ ), T (θ ) ≡ T ⎣ λT φ 2T (θ ) − √ ρ2 (θ ) T where (θ ) is a Gaussian stochastic process on with mean zero. λ T is a deterministic sequence of positive real numbers with
(iii)
lim λT = ∞ and
T →∞
λT lim √ = 0. T
T →∞
Our framework is different from the seminal paper by Stock and Wright (2000) in two ways. First, we rather consider nearly-weak identification, as introduced in a linear setting by Hahn and Kuersteiner (2002), than weak identification. In this sense, we are. closer to Caner (2007). Second, we do not assume the a priori knowledge of a partition θ = (α .. β ) , where α is strongly identified and β (near)-weakly identified. Our framework conveys the idea that identification 6
Functions ρ 1 and ρ 2 are introduced in equation (1.9). See also Assumption 3.1 below for a formal definition. As stressed by Stock and Wright (2000), the uniformity in θ provided by the functional CLT is crucial in case of nonlinear non-separable moment conditions, that is when the occurrences of θ and the observations in the moment conditions are not additively separable. By contrast, Hahn and Kuersteiner (2002) (linear case) and Lee (2004) (separable case) do not need to resort to a functional CLT. 7
C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S147
is a matter of the moment conditions: nearly-weak identification is produced by the rates of convergence of the moment conditions. More precisely, Assumption 3.1 implies that, for the first set of moment conditions, we have (as for standard GMM), ρ1 (θ ) = Plim[φ¯ 1T (θ )], whereas we only have for the second set of moment conditions
√ T ¯ φ2T (θ ) . ρ2 (θ ) = Plim λT The above identification assumption allows for consistent GMM estimator, even in the case of nearly-weak identification. T HEOREM 3.1. (Consistency of θˆT ). Under Assumption 3.1, any GMM estimator θˆT like (3.1) is weakly consistent. Under additional (local) identification assumption, we get consistency at the slowest available rate λ T . T HEOREM 3.2. (Rate of convergence). Under Assumptions 2.2, and 2.4–3.1, we have: θˆT − θ 0 = OP (1/λT ). In Section 4, we introduce a convenient rotation √ in the parameter space that allows us to identify some strongly identified directions at rate T . 3.2. Single-equation linear IV model We already pointed out a major difference between our framework and the existing literature: possible weakness is assigned to some specific instruments (or moment conditions), and not to the structural parameters of the model. The following single-equation linear IV model with two structural parameters, two orthogonal instruments (and no exogenous variables for convenience) sheds some light on the consequences of such distinction: ⎧ ⎪ ⎪ ⎪ ⎨
y (T , 1)
⎪ Y ⎪ ⎪ ⎩ (T , 2)
=
Y (T , 2)
= [X1 X2 ] (T , 2)
θ (2, 1)
+
(2, 2)
+ [V1 V2 ] (T , 2).
u (T , 1)
(3.4)
As commonly done in the literature the matrix of coefficients is artificially linked to the sample size T in order to introduce some (nearly)-weak identification issues. Depending on the focus of interest, several matrices = T may be considered. (i) (ii)
Staiger and Stock (1997) consider a framework √ with the same genuine weak identification pattern for all the parameters: SS T = C/ T . Stock and Wright (2000) reinterpret this framework to consider simultaneously strong and weak identification patterns. This distinction is done . at the parameter level and the structural parameter θ is (a priori) partitioned: θ = [θ 1 .. θ 2 ] , with θ 1 strongly-identified
C The Author(s). Journal compilation C Royal Economic Society 2009.
S148
B. Antoine and E. Renault
and θ 2 weakly-identified. Consequently, they introduce
√ c11 c12 /δT SW with δT = T . T = c21 c22 /δT (iii)
We consider simultaneously strong and near-weak identification patterns. This distinction is done at the moment condition level (or the instrument level), and we suppose the available moment conditions E[φ(.)] to be (a priori) partitioned: φ = . [φ 1 .. φ 2 ] , with φ 1 strongly-identified and φ 2 nearly-weakly-identified. Consequently, we introduce
√ c c 11 12 T AR with δT → ∞, δT = o( T ). T = c21 /δT c22 /δT
or AR Besides considering different degrees of weakness, modelling with SW T T has an AR 8 important implication. T modifies the explanatory power of the second instrument X 2 only. As a result, one strong moment condition (associated with X 1 ) and one less informative (associated with X 2 ) naturally emerge. Intuitively, the standard restriction should identify one standard direction in the parameter space: however, this direction is still unknown and does not necessarily correspond to one of the structural parameters. On the other hand, modelling with SW T amounts to treat θ 2 as weakly identified, and to alter the explanatory powers of both instruments. In the linear model (3.4) the two moment conditions write as, E[(y t − Y t θ 0 )X t ] = AR 9 0. When is replaced respectively by SW T and T , these moments can be rewritten as: SW SW ρ1s (θ1 ) + ρ1w (θ2 )/δT = 0
and
ρ1AR (θ1 , θ2 ) = 0
and
SW SW ρ2s (θ1 ) + ρ2w (θ2 )/δT = 0
ρ2AR (θ1 , θ2 )/δT = 0.
(3.5) (3.6)
AR Finally, modelling weak identification with SW T or T does not change the concentration parameter μ (or matrix), which is the well-accepted measure of the strength of the instruments in the literature. In the linear model (3.4) it is well-defined as: −1/2
μ = V
−1/2
X XV
with
Var(V ) ≡ V
AR The determinant of this concentration matrix, when is replaced by SW T or by T , writes:
det[μSW ] = det[μAR ] =
1 det[X X] det[V−1 ] det[C]2 . δT2
(3.7)
Therefore, the same features of partial identification can be captured with both approaches. While with standard weak asymptotics, δ 2T = T and the concentration matrix has a finite limit (see also Andrews and Stock, 2007), nearly-weak asymptotics allow an infinite limit for the determinant of the concentration matrix but at a rate smaller than det[X X] = O(T ). In this respect, there is AR no difference between SW T and T : only the rate of convergence to zero of respectively a row or a column of the matrix matters. 8 This difference is relatively obvious, and not extensively discussed here. This also justifies why the same parameter δ T may refer to different rates of convergence depending on the context. 9 See the Appendix for exact definitions of ρ SW (.), ρ SW (.), ρ SW (.), ρ SW (.), ρ AR (.) and ρ AR (.). 1s 1w 2s 2w 1 2 C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S149
Partial identification (Phillips, 1989) refers to matrices that may not be of full rank. 10 Generalization to asymptotic rank condition failures (at rate δ T ) comes at the price of having to specify which row or column asymptotically goes to zero. At least, our approach with ρ AR remains true to the partial identification approach by working with “estimable functions” of the structural parameters, or functions that can be identified and square-root T consistently estimated. In the former example, it is clearly the case for g(θ ) = π 11 θ 1 + π 12 θ 2 . 11 By contrast, the approach ρ SW implies directly a partition of the structural parameters between θ 1 which is strongly identified and θ 2 which is not. In the literature, T /δ 2T plays the role of the effective number of observations available for (nearly)-weak identification and implies a slow rate of convergence T 1/2 /δ T for functions of the parameters that are not strongly identified, unlike g(θ ), which is endowed with a square-root T consistent estimator. By contrast with Choi and Phillips (1992), we even get a normal asymptotic distribution for this estimator, while they only got a mixture of normals: by assuming that δ T goes to infinity slower than square-root T, we keep the possibility of consistent estimation for all structural parameters. However, when considering (in the next section) more general nonlinear settings of nearly-weak identification, we will realize that the price to pay for standard square-root T asymptotic normality of strongly-identified functions of the parameters may be even higher: for instance, we will need to refer to nearly-strong identification, or assuming that the analogue of δ T goes to infinity slower than T 1/4 (see also Section 2).
4. ASYMPTOTIC DISTRIBUTION THEORY 4.1. Asymptotic distribution theory of 2S-GMM First, we introduce a convenient rotation in the parameter space to disentangle the rates of convergence. Special linear combinations of θ can actually be estimated at the standard rate √ T , while others are still estimated at the slower rate λ T . This is formalized by a CLT which allows the practitioner to apply usual GMM formulas without knowing a priori the identification pattern. We consider the following situation: √ (i) Only k 1 equations (defined by ρ 1 ) have sample counterparts converging at rate T . Unfortunately, we have (in general) a reduced rank problem, since [∂ρ 1 (θ 0 )/∂θ ] is not full column rank: its rank s 1 is strictly smaller than p. Intuitively, the first set of equations can only identify s 1 directions in the p-dimensional parameter space. (ii) The other k 2 equations (defined by ρ 2 ) should be used to identify the remaining s 2 (s 1 + s 2 = p) directions. 12 However, this identification comes at the slower rate λT . The parameter space will be separated into two subspaces, each of them characterized as the range of a full column rank matrix: respectively (p, s 1 )-matrix R 1 and (p, s 2 )-matrix R 2 . Since R 2 characterizes the set of slow directions, it is naturally defined via the null space of [∂ρ 1 (θ 0 )/∂θ ], 10
The case considered and studied by Phillips (1989) and Choi and Phillips (1992) is even more general since they also address identification issues for coefficients of exogenous variables in the structural equation. 11 It is shown in the Appendix that ρ AR (θ ) = (EX 2 )[g(θ 0 ) − g(θ )]. 1 1t 12 By assumption, our set of moment conditions identifies the entire vector of parameters θ . C The Author(s). Journal compilation C Royal Economic Society 2009.
S150
B. Antoine and E. Renault
i.e. the directions not identified in a standard way: ∂ρ1 (θ 0 ) R2 = 0. (4.1) ∂θ Then, the remaining s 1 directions are defined as follows: R = [R 1 R 2 ], where R is a non-singular (p, p)-matrix with rank p that can be used as a matrix of a change of basis in Rp . The new parameter is defined as η = R −1 θ , that is η 1 s1 . θ = [R1 R2 ] η2 s2 In the next subsection, we show that this reparametrization defines two subsets of directions, each associated with a specific rate of convergence. In general, there is no hope to get standard (non-degenerate) asymptotic normality of some components of the estimator θˆT of θ 0 : after a standard expansion of the first-order conditions, θˆT now appears as asymptotically equivalent to some linear transformations of φ¯ T (θ ) which are likely to mix up the two rates. Hence, all components of θˆT might be contaminated by the slow rate of convergence. The advantage of the reparametrization is precisely to separate these two rates. In Section 4.3, we carefully compare our theory with Stock and Wright (2000) (in the linear case), and provide conditions under which some components of θˆT converge (by chance) at the standard rate. These correspond to what is assumed a priori by Stock and Wright (2000) when they separate the structural parameters into one standard-converging group and one slower-converging one. The reparametrization may not be feasible in practice since the matrix R depends on the true unknown value of the parameter θ 0 . However, we can still deduce a feasible inference strategy: technical details can be found in the companion paper by Antoine and Renault (2008). Albeit with a mixture of different rates, the Jacobian matrix of moment conditions has a consistent sample counterpart (see Lemma A.1 in the proof of Proposition 2.1): √ √ ∂ φ¯ T (θ 0 ) T I ds1 0 ∂ρ(θ 0 ) −1 P T RT → J with J ≡ R and T = . (4.2) ∂θ ∂θ 0 λT I ds2 T is the (p, p) block diagonal scaling matrix, where I d r denotes the identity matrix of size r; J is the (K, p) block diagonal matrix with its two blocks respectively defined as the (k i , s i ) 0 matrices √ [∂ρ i (θ )/∂θ R i ] for i = 1, 2. Note that the coexistence of two rates of convergence (λ T and T ) implies zero northeast and southwest blocks for J. Moreover, to derive the asymptotic distribution of the GMM estimator θˆT , convergence result (4.2) needs to be fulfilled even when θ 0 is replaced by some preliminary consistent estimator θ ∗T . Hence, Taylor expansions must be robust to a λ T -consistent estimator, the only rate guaranteed by Theorem 3.2. This situation is rather similar to the one met in Section 2 when introducing nearlystrong identification. Hence, if moments are non-linear, we need (similarly to Andrews, 1995, for non-parametric estimates) to assume that our near-weakly identified directions are estimated at a rate faster than T 1/4 . 13,14 In addition, we want as usual uniform convergence of sample Hessian matrices. This leads us to maintain the following assumption: A SSUMPTION 4.1. (Taylor expansions). 13
The link between Andrews (1994, 1995) and our setting is further discussed in Antoine and Renault (2008). As already noted, this nearly-strong condition is irrelevant when the same degree of weakness is assumed for all moment conditions (φ = φ 2 or φ 1 is linear with respect to θ ). 14
C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
(i) (ii)
S151
√ φ 1 (Y t , θ ) is linear with respect to θ , or limT →∞ [λ2T / T ] = ∞. φ¯ T (θ ) is twice continuously differentiable on the interior of and is such that: √ ∂ 2 φ¯ 1T ,k (θ ) P T ∂ 2 φ¯ 2T ,k (θ ) P → H (θ ) and ∀ 1 ≤ k ≤ k → H2,k (θ ) ∀ 1 ≤ k ≤ k1 1,k 2 ∂θ ∂θ λT ∂θ ∂θ uniformly on θ in some neighbourhood of θ 0 , for some (p, p) matricial function H i,k (θ ) for i = 1, 2 and 1 ≤ k ≤ k i .
Up to unusual rates of convergence, we get a standard asymptotic normality result for the new parameter η = R −1 θ : 15 T HEOREM 4.1. (Asymptotic normality). (i)
Under Assumptions 2.1, 2.2 and 2.4–4.1, the GMM estimator θˆT defined by (3.1) is such that:
d
T R −1 θˆT − θ 0 −→ N 0, [J J ]−1 J S(θ 0 )J [J 0 J ]−1 .
(ii)
Under Assumptions 2.1, 2.2 and 2.4–4.1, the asymptotic variance displayed in (i) is minimal when the GMM estimator θˆT is defined with a weighting matrix T being a consistent estimator of = [S(θ 0 )]−1 :
d T R −1 θˆT − θ 0 −→ N (0, [J [S(θ 0 )]−1 J ]−1 ).
Since θˆT = (R1 ηˆ 1,T +√R2 ηˆ 2,T ), a linear combination a θˆT of the estimated parameters of interest is endowed with T rate of convergence as ηˆ 1,T if and only if a R 2 = 0, or when a belongs to the orthogonal space of the range of R 2 . By equation (4.1), it also means that a is spanned by the columns of the matrix [∂ρ 1 (θ 0 )/∂θ ]. Hence, a θ is strongly identified if and only if it is identified by the first set of moment conditions ρ 1 (θ ) = 0. As far as inference about θ is concerned, several practical implications of Theorem 4.1 are worth mentioning. Up to the unknown matrix R and the unknown rate of convergence λ T , a consistent estimator of the asymptotic covariance matrix (J [S(θ 0 )]−1 J )−1 is 16 ¯ ˆ ¯ ˆ −1 −1 −1 −1 ∂ φT (θT ) −1 ∂ φT (θT ) ST T T R [R ] T , (4.3) ∂θ ∂θ where S T is a standard consistent estimator of the long-term covariance matrix. 17 From Theorem 4.1, for large T , [T R −1 (θˆT − θ 0 )] behaves like a Gaussian √ random variable with mean zero and variance (4.3). One may be tempted to deduce that T (θˆT − θ 0 ) behaves like a Gaussian random variable with mean 0 and variance −1 ¯ ˆ ∂ φT (θT ) −1 ∂ φ¯ T (θˆT ) ST . (4.4) ∂θ ∂θ And this would give the feeling that we are back to standard GMM formulas of Hansen (1982). As far as practical purposes are concerned, this intuition is correct: in particular, estimation of R Note that efficiency in Theorem 4.1(ii) is implicitly considered for the given set of moment restrictions φ¯ T (.). This directly follows from Lemma A.5 in the Appendix. 17 Recall that a consistent estimator of S of the long-term covariance matrix S(θ 0 ) can be built in the standard way T from a preliminary inefficient GMM estimator of θ . 15 16
C The Author(s). Journal compilation C Royal Economic Society 2009.
S152
B. Antoine and E. Renault
is not necessary to perform inference. 18 However, from a theoretical point of view, √ this is a bit misleading. First, since in general all components of θˆT converge at the slow rate, T (θˆT − θ 0 ) has no limit distribution. In other words, considering the asymptotic variance (4.4) refers to the inverse of an asymptotically singular matrix. Second, for the same reason, (4.4) is not an estimator of the standard population matrix 0 0 −1 ∂ρ (θ ) 0 −1 ∂ρ(θ ) [S(θ )] . (4.5) ∂θ ∂θ To conclude, if inference about θ is technically more involved than one may believe at first sight, it is actually very similar to standard GMM from a pure practical point of view. In other words, if a practitioner is not aware of the specific framework with moment conditions associated with several rates of convergence (coming, say, from the use of instruments of different qualities) then she can still provide reliable inference by using standard GMM formulas. In this respect, we generalize Kleibergen’s (2005) result that inference can be performed without a priori knowledge of the identification setting. However, we are more general than Kleibergen (2005) since we allow moment conditions to display simultaneously different identification patterns. 19 Finally, the score test defined in Section 2 is completed by the classical overidentification test: T HEOREM 4.2. (J-test). Under Assumptions 2.1, 2.2, 2.4–4.1, with T consistent estimator of d [S(θ 0 )]−1 , T QT (θˆT ) → χ 2 (K − p). 4.2. About the (non)-equivalence of 2S-GMM and CU-GMM We now show that the nearly-strong identification condition is exactly needed to ensure that both strong and weak directions are equivalently estimated by efficient two-step GMM and by continuously updated GMM. This explains the aforementioned case of equivalence between the GMM score test and Kleibergen’s modified score test. Hansen et al. (1996) define the continuously updated GMM estimator θˆTCU as: D EFINITION 4.1. Let S T (θ ) and φ¯ T (θ ) be defined as in Assumption 2.1. The continuously updated GMM estimator θˆTCU of θ 0 is then defined as: θˆTCU = arg min QCU T (θ ) where θ∈
−1 ¯ ¯ QCU T (θ ) ≡ φT (θ )ST (θ )φT (θ ).
(4.6)
P ROPOSITION 4.1. (Equivalence between CU-GMM and efficient GMM). Both strong and weak directions are equivalently estimated by efficient two-step GMM and continuously updated GMM, if the nearly-strong identification condition is ensured. That is,
T R −1 θˆT − θˆTCU = oP (1) when
λ2 T √T → ∞. T
18 Antoine and Renault (2008) provide feasible asymptotic distribution. It simply involves plugging in a consistent estimator of R. 19 For notational simplicity, we only consider here one speed of nearly-weak identification λ . The more general T framework with an arbitrary number of speeds is considered by Antoine and Renault (2008).
C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S153
In the special case where the same degree of weakness is assumed for all moment conditions (see Proposition 2.3), CU-GMM and efficient GMM are always equivalent insofar as λ T → ∞. Several comments are in order. First,√ since non-degenerate asymptotic normality is obtained for T R −1 (θˆT − θ 0 ) (and not for T (θˆT − θ 0 )), the relevant (non-trivial) equivalence result between two-step efficient GMM and continuously updated GMM relates to the suitably rescaled and rotated difference T R −1 (θˆT − θˆTCU ). As already mentioned, a naive reading of the asymptotic theory may spuriously lead to believe that standard formulas are maintained for asymptotic distribution of estimators of θ 0 . √ T Second, the case with nearly-weak (and not nearly-strong) identification (λ2T / T → 0) breaks down the standard theory of efficient GMM, and the proof shows that there is no reason to believe that continuously updated GMM may be an answer. Two-step GMM and continuously updated GMM, albeit no longer equivalent, are both perturbed by higher-order terms with ambiguous effects on asymptotic distributions. The intuition given by higher-order asymptotics in standard identification settings cannot be extended to the case of nearly-weak identification. While the latter approach shows that continuously updated GMM is, in general, higher-order efficient than the standard two-step one (see Newey and Smith, 2004, and Antoine et al., 2007), there is no clear ranking of asymptotic performances under weak identification. Third, it is important to keep in mind that all these difficulties are due to the fact that we consider realistic circumstances where two different degrees of identification are simultaneously involved. Standard results (equivalence, or rankings between different approaches) carry on when only one rate of convergence is considered. 4.3. Single-equation linear IV model (continued) First, we define the reparametrization in the linear model of Section 3.4. The derivative of the standard moment restriction is
2
2 .. .. ∂ρ1 (θ 0 ) = −E X π π = −E(Y X ) . − E Y X . − E X J1 = 1t 1t 2t 1t 11 12 . 1t 1t ∂θ .. Hence, .. the null space of J 1 is spanned by the vector [−π 12 . π 11 ] and its2 orthogonal by [π . π ] . The legitimate matrix of change of basis R in the parameter space R and associated 11
12
new parameter η can be defined as:
1 π11 −π12 R= π12 π11 η = R −1 θ,
2 2 + π12 , with = π11
η1 = π11 θ1 + π12 θ2
and
η2 = −π12 θ1 + π11 θ2 .
Only the strong direction is completely determined, since the direction orthogonal to the null space of J 1 is uniquely determined and defines the first column of R. Any other direction could define a valid second column of R and a linear combination of θ 1 and θ 2 estimated at the slower rate. Strictly speaking, the linear re-interpretation of Staiger and Stock (1997) by Stock and Wright (2000) is not nested in our setting, because each of their moment condition contains a strong part (that only depends on a subvector of parameter) and a weak one. This setting (through the definition of the matrix SW T ) is conveniently built so as to know a priori which subset of C The Author(s). Journal compilation C Royal Economic Society 2009.
S154
B. Antoine and E. Renault
the parameters is strongly identified. Now, if we pretend that we did not realize that the set of strongly-identified parameter was known and still . . perform the change of variables, we get: J 1 [−π 11 .. 0] with orthogonal space spanned by [0 .. 1] . The change of basis is defined as:
a 0 1 0 θ1 R= with a = 0 ⇒ η = . b 1 −b a θ2 As expected, we identify the strongly-identified direction as being parallel to θ 1 . In other words, even the Stock and Wright approach can be accommodated within our general framework.
5. MONTE CARLO STUDY We now report some Monte Carlo evidence about the intertemporally separable consumption capital asset pricing model (CCAPM) with constant relative risk-aversion (CRRA) preferences. Artificial data are generated to mimic the dynamic properties of the historical data. 5.1. Moment conditions The Euler equations lead to the following moment conditions: −γ E ht+1 (θ )|It = 0 with ht (θ ) = δrt ct − 1. Our parameter of interest is then θ = [δ γ ] , with δ the discount factor and γ the preference parameter; (r t , c t ) denote respectively a vector of asset returns and the consumption growth at time t. To estimate this model, our K instruments Zt ∈ It include the constant as well as some lagged variables. We then rewrite the above moment conditions as 20 E0 φt,T (θ ) = E0 ht+1 (θ ) ⊗ Zt,T .
5.2. Data generation Our Monte Carlo design follows Tauchen (1986), Kocherlakota (1990), Hansen et al. (1996) and, more recently, Stock and Wright (2000). More precisely, the artificial data are generated by the method discussed in Tauchen and Hussey (1991). This method fits a 16-state Markov chain to the law of motion of the consumption and the dividend growths, so as to approximate a beforehandcalibrated Gaussian VAR(1) model (see Kocherlakota, 1990). The CCAPM-CRRA model is then used to price the stocks and the risk-free bond in each time period, yielding a time series of asset returns. Since the data are generated from a general equilibrium model, even the econometrician does not know whether (δ, γ ) are (near)-weakly identified or not. In a similar study, Stock and Wright (2000) impose a different treatment for the parameters δ and γ : typically, δ is taken as strongly identified whereas γ is not. We do not make such an assumption. Through our convenient reparametrization, we identify some directions of the parameter space that are strongly identified and some others that are not. 20
To stress the potential weakness of the instruments, we add the subscript T to refer to the instruments. C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S155
5.3. Strong and weak moment conditions We consider here three instruments: the constant, the centred lagged asset return and the centred lagged consumption growth. To apply our nearly-weak GMM estimation, we first need to separate the instruments (and the associated moment conditions) according to their strength. Typically, a moment restriction E[φ t (θ )] is (nearly)-weak when E[φ t (θ )] is close to zero for all θ . This means that the restriction does not permit to (partially) identify θ . Hence, we evaluate each moment restriction for a grid of parameter values. If the moment is uniformly close to 0 then we conclude to its weakness. This study can always be performed and is not specifically related to the Monte Carlo setting; the Monte Carlo setting is simply convenient to get rid of the simulation noise by averaging over the many simulated samples. Figure 1 has been built with a sample size of 100 and 2500 Monte Carlo replications: top figures for set 1 with (a) constant instrument; (b) lagged asset return and (c) lagged consumption rate; bottom figures of set 2. 21 This study clearly reveals two groups of moment restrictions: (i) with the constant instrument, the associated restriction varies quite substantially with the parameter θ ; (ii) with the lagged instruments, both associated restrictions remain fairly small when θ vary over the grid. The set of instruments, and accordingly of moment conditions, is then separated as follows: ⎛ ⎞ −γ (δrt ct − 1) ⎟ ⎜ φt,T (θ ) = ⎝ rt−1 − r¯ ⎠ , −γ (δrt ct − 1) ⊗ ct−1 − c¯ √ T √ ρ1 (θ ) T 01,2 1 φ¯ T (θ ) = φt,T (θ ) with T E[φ¯ T (θ )] = . T t=1 ρ2 (θ ) 02,1 λT I d2 As emphasized earlier, our Monte Carlo study simulates a general equilibrium model. So, even the econometrician does not √ know in advance which moment conditions are weak and the level T and λ T must be chosen so as to fulfil the following conditions, of this weakness. Hence, √ √ λT = o( T ) and T = o(λ2T ). In their theoretical considerations (Section 4.3), Stock and Wright (2000) also treat differently the covariances of the moment conditions. The strength of the constant instrument is actually used to provide some intuition on their identification assumptions (δ strongly-identified and γ weakly-identified). However, we think that if γ is weakly-identified, then it affects the −γ covariance between r t and ct , and hence the identification of δ should be altered too. This actually matches some asymptotic results of Stock and Wright (2000) where the weak parameter affects the strong one, by preventing it to converge to a standard Gaussian random variable. 5.4. Reparametrization To identify the standard directions in the parameter space, we first define the matrix of the change of basis. Recall that it is defined through the null space of the following matrix: J1 =
21
∂ρ1 (θ 0 ) . ∂θ
Note that the conclusions are not affected when larger sample sizes are considered.
C The Author(s). Journal compilation C Royal Economic Society 2009.
0
δ
γ
10
0
δ
5
5 20 10
15 γ
γ
5
10
0
0
0
δ
δ
5
5
Lagged consumption growth
Figure 1. CCAPM: moment restrictions as a function of the parameter values θ.
5
5 20
10
20 15
δ
0
5
0
0
0
0
10
γ
5
10
δ
10
5
γ
0
5
10
0
5
10
10
0
5
Lagged return
10
15
γ
10
10
5
0
0
5
5
5
10
10
10
Constant Instrument
10
10
S156 B. Antoine and E. Renault
C The Author(s). Journal compilation C Royal Economic Society 2009.
S157
Efficient GMM with nearly-weak instruments
Straightforward calculations lead to:
∂φ1,t (θ ) ∂φ1,t (θ ) = ∂θ ∂δ
−γ ∂φ1,t (θ ) = rt ct ∂γ
−γ
.. .
− δrt ln(ct )ct
and J 1 is then defined as follows: J1 =
∂ρ1 (θ 0 ) −γ 0 .. . = E rt ct ∂θ
−γ 0 − E δ 0 rt ln(ct )ct .
. . The null space of J 1 is spanned by the vector [−J 12 .. J 11 ] and its orthogonal by [J 11 .. J 12 ] . A convenient change of basis is then defined by the matrix: 1 R=
J11 −J12 J12
where
J11
2 2 + J12 = J11
and the new set of parameter is then obtained as, −1
η=R θ ⇔
η1 η2
=
J11 δ + J12 γ −J12 δ + J11 γ
.
The standard direction η 1 is completely determined: that is, the relative weights on δ and γ are known. As a convention, we normalize all vectors to unity and we also ensure that the subspaces defined respectively by the columns of R 2 and of R 1 are orthogonal. 5.5. Asymptotic result The adapted asymptotic convergence result writes:
√ T ηˆ 1T − η10 d → N (0, (J S(θ 0 )−1 J )−1 )
λT ηˆ 2T − η20
⎤ ∂ρ1 (θ 0 ) R 0 1 ⎥ ⎢ ∂θ ⎥. with J = ⎢ ⎣ 0 ∂ρ2 (θ ) ⎦ R2 0 ∂θ ⎡
The approximation of J can easily be deduced from what has been done above. 5.6. Results Monte Carlo results are provided for three instruments, constant, lagged asset return and lagged consumption growth, and two sets of parameter: set 1 (or model M1a as in Stock and Wright, 2000) where θ 0 = [0.97 1.3]; set 2 (or model M1b) where θ 0 = [1.139 13.7]. Model M1b has previously been found to produce non-normal estimator distributions. First, as previously emphasized, the matrix of reparametrization is not known (even in our Monte Carlo setting) and is actually data dependent. We then investigate the variability of the true new parameter η0 . We found that even with small sample size (T = 100), the (estimated) true new parameter is really stable and does not depend much on the realization of the sample. C The Author(s). Journal compilation C Royal Economic Society 2009.
0.0305 150
0.031
0.0315
0.032
0.0325
0.033
0.0415 150
0.042
0.0425
0.043
0.0435
0.044
0.0445
0.045
200
200
250
250
300
300
400
400
450
450
500
500
550
550
600
600
x 10
5.2 150
5.4
5.6
5.8
6
6.2
6.4
6.6
3.53 150
x 10
200
200
250
250
300
300
350
350
Figure 2. CCAPM: ratio for the variances as a function of the sample size.
350
350
3.54
3.55
3.56
3.57
3.58
3.59
3.6
3.61
400
400
450
450
500
500
550
550
600
600
S158 B. Antoine and E. Renault
C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S159
For our two models, we find the following true new parameter:
0.999 −0.019 0 −1 Set 1: η = [0.9449 1.3183]; R = 0.019 0.999
0.999 −0.007 0 −1 Set 2: η = [1.0356 13.7082]; R = . 0.007 0.999 Note that η 1 = 0.999δ − 0.019γ (for set 1) and η 1 = 0.999δ − 0.007γ (for set 2): in other words, we confirm Stock √ and Wright’s (2000) intuition that the parameter η 1 which is estimated at the standard rate T is almost equal to δ. However, by contrast with Stock and Wright (2000), this point of view was not a prior belief but only an empirical conclusion. Our findings are: (i) all the estimators are consistent; (ii) the variances of the estimators (for both ηˆ T and θˆT ) decrease to 0 with the sample size and (iii) according to our asymptotic results, in case of nearly-weak identification, the asymptotic variance of the new parameter ηˆ 1T should decrease faster with the sample size than the one of ηˆ 2T . Figure 2 investigates this feature by plotting the evolution of the ratio of the Monte Carlo variance of ηˆ 2T and the Monte Carlo variance of ηˆ 1T with the sample size: top panels for set 1, left-hand panels for Var(ηˆ 2T )/Var(ηˆ 1T ) and right-hand panels for Var(θˆ1T )/Var(θˆ2T ); bottom panels for set 2. This ratio Var(ηˆ 2T )/Var(ηˆ 1T ) increases with T, especially for the second set of parameter values. This supports previous findings in the literature that the first set of parameter values leads to a less severe weak identification problem.
6. CONCLUSION In a GMM context, this paper proposes a general framework to account for potentially weak instruments. In contrast with existing literature, the weakness is directly related to the moment conditions (through the instruments) and not to the parameters. More precisely, we consider two groups of moment conditions: the standard one associated with the standard rate of convergence √ T and the nearly-weak one associated with the slower rate λ T . In this framework, the standard GMM-score-type test proposed by Newey and West (1987) is valid, and we do not need to resort to the correction proposed by Kleibergen (2005). Our comparative power study reveals that the above correction does have asymptotic consequences, especially with heterogeneous identification patterns. Hence, we recommend carefulness, especially when instruments of heterogeneous quality are used. Our proposed framework is not much more involved in terms of specifying the identification issues, and also helps identify the directions against which the tests have power. Moreover, this framework ensures that GMM estimators of all parameters are consistent, but at rates possibly slower than usual. In addition, we identify and estimate efficiently (with non-degenerate asymptotic normality) the relevant directions, respectively strongly- or nearlyweakly identified, in the parameter space. This asymptotic distributional theory is practically relevant, since it allows inference without the knowledge, or estimation of the slow rate of identification. It allows in particular the application of the general Wald testing theory developed in a companion paper, Antoine and Renault (2008) (see also Lee, 2004). Moreover, we show that both for estimation and testing, continuous updated GMM is not always an answer to identification issues. C The Author(s). Journal compilation C Royal Economic Society 2009.
S160
B. Antoine and E. Renault
For notational and expositional simplicity, we focus here on two groups of moment conditions only. The extension to considering several degrees of weakness (think of a practitioner using several instruments of different informational qualities) is quite natural. Antoine and Renault (2008) specifically consider multiple groups of moment conditions associated with specific rates of convergence. However they do not explicitly consider any applications to identification issues, but rather applications in kernel, unit-root, extreme values or continuous time environments.
ACKNOWLEDGMENTS We would like to thank O. Boldea, A. Inoue, P. Lavergne, L. Magee, W. Newey, R. Smith, and seminar participants at University of British Columbia, University of Indiana at Bloomington, Tilburg University and Yale University for helpful discussions.
REFERENCES Andrews, D. W. K. (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62, 43–72. Andrews, D. W. K. (1995). Nonparametric kernel estimation for semiparametric econometric models. Econometric Theory 11, 560–96. Andrews, D. W. K. and P. Guggenberger (2007). Asymptotic size and a problem with subsampling and with the m out of n bootstrap. Cowles Foundation Discussion Paper 1605, Yale University. Andrews, D. W. K. and J. H. Stock (2007). Inference with weak instruments. In R. Blundell, W. K. Newey and T. Persson (Eds.), Advances in Economics and Econometrics, Volume III, Econometric Society Monograph 43, 122–73. Cambridge: Cambridge University Press. Antoine, B., H. Bonnal and E. Renault (2007). On the efficient use of the informational content of estimating equations: implied probabilities and Euclidean empirical likelihood. Journal of Econometrics 138, 461– 87. Antoine, B. and E. Renault (2008). Efficient minimum distance estimation with multiple rates of convergence. Working paper, Simon Fraser University. Caner, M. (2007). Testing, estimation and higher order expansions in GMM with nearly-weak instruments. Working paper, North Carolina State University. Choi, I. and P. C. B. Phillips (1992). Asymptotic and finite sample distribution theory for IV estimators and tests in partially identified structural equations. Journal of Econometrics 51, 113–50. Guggenberger, P. and R. J. Smith (2005). Generalized empirical likelihood estimators and tests under partial, weak, and strong identification. Econometric Theory 21, 667–709. Hahn, J. and G. Kuersteiner (2002). Discontinuities of weak instruments limiting distributions. Economics Letters 75, 325–31. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–54. Hansen, L. P., J. Heaton and A. Yaron (1996). Finite sample properties of some alternative GMM estimators. Journal of Business and Economic Statistics 14, 262–80. Kleibergen, F. (2005). Testing parameters in GMM without assuming that they are identified. Econometrica 73, 1103–23. Kocherlakota, N. (1990). On tests of representative consumer asset pricing models. Journal of Monetary Economics 26, 285–304. C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S161
Lee, L. (2004). Pooling estimates with different rates of convergence—a minimum χ 2 approach with an emphasis on a social interaction model. Working paper, Ohio State University. Newey, W. K. and R. J. Smith (2004). Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72, 219–55. Newey, W. K. and K. D. West (1987). Hypothesis testing with efficient method of moments estimation. International Economic Review 28, 777–87. Phillips, P. C. B. (1989). Partially identified econometric models. Econometric Theory 5, 181–240. Smith, R. J. (2007). Weak instruments and empirical likelihood: a discussion of the papers by D. W. K. Andrews and J. H. Stock and Y. Kitamura. In R. Blundell, W. K. Newey and T. Persson (Eds.), Advances in Economics and Econometrics, Volume III, Econometric Society Monograph 43, 238–60. Cambridge: Cambridge University Press. Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65, 557–86. Stock, J. H. and J. H. Wright (2000). GMM with weak identification. Econometrica 68, 1055–96. Tauchen, G. (1986). Statistical properties of generalized method-of-moments estimators of structural parameters obtained from financial market data. Journal of Business and Economic Statistics 4, 397– 425. Tauchen, G. and R. Hussey (1991). Quadrature-based methods for obtaining approximate solutions to nonlinear asset pricing models. Econometrica 59, 371–96.
APPENDIX L EMMA A.1. Under the assumptions of Proposition 2.1, there exists a (K, p)-matrix J of rank p, √ ∂ φ¯ T (θ0 ) −1 J = Plim RT , T ∂θ √ where T is (p, p) diagonal matrix whose first s 1 (resp. last s 2 ) coefficients are T (resp. λ T ). Proof: ⎤ ⎡ √ ∂ φ¯ 1T (θ0 ) T ∂ φ¯ 1T (θ0 ) R R 1 2⎥ ⎢ √ ∂ φ¯ T (θ0 ) λT ∂θ ⎥ ⎢ ∂θ T R−1 ⎥. √ T = ⎢ ⎣ ∂ φ¯ 2T (θ0 ) ∂θ T ∂ φ¯ 2T (θ0 ) ⎦ R1 R2 ∂θ λT ∂θ From Assumption 2.2, we have: ¯ ∂ φ1T (θ0 ) Plim = C1 R1 , R 1 ∂θ and Plim √
√
T ∂ φ¯ 2T (θ0 ) R2 = C2 R2 λT ∂θ
Plim
¯ λT ∂ φ2T (θ0 ) C = Plim R R √ 1 2 1 = 0. ∂θ T
T ∂ φ¯ 1T (θ0 ) R2 λT ∂θ
=
√
where UT = OP (1). √ ¯ (θ0 ) = 0. Since C 1 R 2 = 0 (by definition of R 2 ), and λT → ∞, we get: Plim λTT ∂ φ1T R 2 ∂θ C1 R1 0 We can then define the matrix J as: J = 0 CR .
From Assumption 2.4, we have:
T λT T
C1 R2 +
2
C The Author(s). Journal compilation C Royal Economic Society 2009.
2
UT R2 λT
S162
B. Antoine and E. Renault
Note that J is of rank p since C 1 R 1 (respectively C 2 R 2 ) is of rank s 1 (respectively s 2 = p − s 1 ). The proof of Lemma A.1 is now completed. We need another preliminary result: L EMMA A.2. Under the assumptions of Proposition 2.1, √ −1 ˜ −1 ¯ T −1 T R VT (θ0 ) = J ST (θ0 ) T φT (θ0 ) + oP (1) = T T R VT (θ0 ) + oP (1). √ Proof: From Assumption 2.1, ST−1 (θ0 ) T φ¯ T (θ0 ) = OP (1); combined with Lemma A.1 we get: T −1 T R VT (θ0 ) =
√
T −1 T R
√ √ ∂ φ¯ T (θ0 ) −1 ST (θ0 ) T φ¯ T (θ0 ) = J ST−1 (θ0 ) T φ¯ T (θ0 ) + oP (1). ∂θ
˜ We now show that: T −1 T R [VT (θ0 ) − VT (θ0 )] = oP (1). We have: ¯ √ √ ∂ φT (θ0 ) ∂ φ˜ T (θ0 ) −1 ˜ T −1 T − × ST−1 (θ0 ) T φ¯ T (θ0 ). R T R VT (θ0 ) − VT (θ0 ) = T ∂θ ∂θ √ Since ST−1 (θ0 ) T φ¯ T (θ0 ) = OP (1), we only need to show that: ¯ √ ∂ φT (θ0 ) ∂ φ˜ T (θ0 ) R−1 T − T = oP (1). ∂θ ∂θ
The jth row of the above matrix writes:
(j ) (j ) √ ∂ φ¯ T (θ0 ) ∂ φ˜ T (θ0 ) T − R−1 T ∂θ ∂θ
√ ∂ φ¯ T(j ) (θ0 ) √ √ √ ¯ = − Covas T , T φT (θ0 ) (Varas [ T φ¯ T (θ0 )])−1 T φ¯ T (θ0 )R−1 T , ∂θ which is o P (1) since
√
T φ¯ T (θ0 ) = OP (1) and R−1 T = o P (1). This concludes the proof of Lemma A.2.
Proof of Proposition 2.1 (Equivalence under the null): Consider the notations: rk[C1 ] = s1 ≤ p; R1 is a (p, s 1 )-matrix such that: rk[R1 ] = s1 , col[R1 ] = col[C1 ]; R2 is a (p, s 2 )-matrix such. that rk[R2 ] = s2 and col[R2 ] is the null space of C 1 . 22 By Assumption 2.2, p = s 1 + s 2 , and R = [R 1 .. R 2 ] is a non-singular (p, p)-matrix. We do not exclude the special case: s 1 = p, s 2 = 0 and R = R 1 . From Lemma A.2 and Assumption 2.1, [T −1 T R V T (θ 0 )] is asymptotically normal with zero mean and covariance matrix 0 −1 J [S ] J . 0 −1 −1 −1 Hence, [T 2 V T (θ 0 ) R−1 T [J [S ] J ] T R V T (θ 0 )] converges in distribution towards a chi-square with p degrees of freedom. From Lemma A.1 and Assumption 2.1, this is also the case for: −1 ¯ ∂ φ¯ T (θ0 ) −1 ∂ φT (θ0 ) −1 −1 ST (θ0 ) R −1 T VT (θ0 )R−1 T T R T T R VT (θ0 ). ∂θ ∂θ After simplifications, we find that: ξTNW = T VT (θ0 )
−1 ¯ ∂ φT (θ0 ) −1 ∂ φ¯ T (θ0 ) d ST (θ0 ) VT (θ0 ) → χ 2 (p), ∂θ ∂θ
22 For any (m × n)-matrix M, rk[M] denotes its rank and col[M] represents the subspace of Rn generated by the column vectors of M.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S163
Efficient GMM with nearly-weak instruments where χ 2 (p) denotes the chi-square distribution with p degrees of freedom. When d P ∂ φ˜ T (θ0 ) , the same argument leads to: ξTK → χ 2 (p) and ξTK − ξTNW → 0. ∂θ
∂ φ¯ T (θ0 ) ∂θ
is replaced by
√ d L EMMA A.3. Under the assumptions of Proposition 2.2, T φ¯ T (θ0 ) → N (m, S 0 ) with m = [−γ C 1 0] in case (i), and m = [0 − γ C 2 ] in case (ii). Proof: √ √ ∂ φ¯ T (θT∗ ) √ T φ¯ T (θT ) = T φ¯ T (θ0 ) + T (θT − θ0 ) with θT∗ = θ0 + βT γ , where βT ∈ [0, αT ]. ∂θ √ d From Assumption 2.1, we have: T φ¯ T (θT ) → N (0, S 0 ). Therefore, we only need to show: √ ∂ φ¯ T (θT∗ ) Plim T (θ − θ ) = −m. T 0 ∂θ More precisely, in case (i), we need to show: ¯ ∗ C1 γ ∂ φT (θT ) . γ = Plim ∂θ 0 From Assumption 2.5, since Plim[θT∗ − θT ] = 0, we have:
√ ¯ ¯ T ∂ φ¯ 2T (θT∗ ) ∂ φ2T (θT∗ ) ∂ φ1T (θT∗ ) = C = 0. and Plim ⇒ Plim Plim = C 1 2 ∂θ λT ∂θ ∂θ And in case (ii), we need to show:
√ Plim
T ∂ φ¯ T (θT∗ ) γ λT ∂θ
=
0
C2 γ
.
From Assumption 2.5:
√ Plim
T ∂ φ¯ 2T (θT∗ ) γ λT ∂θ
= C2 γ ,
(j )
while for all components φ¯ 1T , we have: √ √ √ (j ) (j ) T ∂ φ¯ 1T (θT∗ ) T ∂ φ¯ 1T (θT ) T (j ) ∗ γ = γ+ H (θ − θT ) + oP (1) = oP (1), λT ∂θ λT ∂θ λT 1 T (j ) √ √ ∂ φ¯ 1T (θT ) T = T C1 + OP (1) since ∂θ 1 λ2 T and √T → ∞. with C1 γ = 0, θT∗ − θT = OP λT T
Proof of Proposition 2.2 (Local power of GMM score tests): First, we get the √ asymptotic distribution ¯ in cases (i) and (ii). We study result for ξ NW T √ first the asymptotic distribution of [ T φT (θ0 )] when the true value is: θ T = θ 0 + α T γ , with αT = 1/ T in case (i), and α T = 1/λ T in case (ii). We need the following preliminary result: From Assumption 2.5, and Lemma A.1 the rescaled score can be written: T −1 T R VT (θ0 ) =
√
T −1 T R
√ √ ∂ φ¯ T (θ0 ) −1 ST (θ0 ) T φ¯ T (θ0 ) = J ST−1 (θ0 ) T φ¯ T (θ0 ) + oP (1), ∂θ
C The Author(s). Journal compilation C Royal Economic Society 2009.
S164
B. Antoine and E. Renault
√ since ST−1 (θ0 ) T φ¯ T (θ0 ) = OP (1). Thus, from Lemma A.3, we get:
0 −1 d 0 −1 T −1 T R VT (θ0 ) → N J [S ] m, J [S ] J . 0 −1 −1 −1 Therefore: T 2 V T (θ 0 )R−1 T [J [S ] J ] T R V T (θ 0 ) converges in distribution towards a non-central chi-square with p degrees of freedom and non-centrality parameter:
−1 0 −1 J [S ] m. μ = m [S 0 ]−1 J J [S 0 ]−1 J From Assumption 2.5 and Lemma A.1, this also holds for: ξTNW
=
T VT (θ0 )R−1 T
−1 ¯ ∂ φ¯ T (θ0 ) −1 ∂ φT (θ0 ) −1 −1 T R RT ST (θ0 ) ∂θ
∂θ
−1 T R VT (θ0 ).
(A.1)
We now only need to check the formula for the non-centrality parameter μ. It is convenient to introduce the following notations:
S 0 = S 1/2 S 1/2 ,
−1
S −1/2 = S 1/2
and
mS = S −1/2 m,
X = S −1/2 J .
Then: μ = mS X[X X]−1 X m S . We get the announced result if we can check that: μ = mS m S , that is, PX mS = m S , where P X = X[X X]−1 X is the orthogonal projection matrix on the column space of X. Therefore, we want to show that m S belongs to the column space of X, or equivalently that m belongs to the column space of J. From the proof of Lemma A.1, we see that the column space of J is the set of vectors of RK that can be written: CC12 RR12 ab for some a ∈ Rs1 , and b ∈ Rs2 . In case (i), m = C01 γ when choosing b = 0 and a as follows. First, C 1 γ = C 1 γ ∗ where γ ∗ is the orthogonal projection of γ onto the orthogonal of the null space of C 1 , or col[C1 ] = col[R1 ]. Then, there ∗ exists some a such that γ = R 1 a. 0 In case (ii), m = C2 γ when choosing a = 0. By definition, m belongs to the null space of C 1 , which is spanned by the columns of R 2 . We now show that Plim[ξTNW − ξTK ] = 0 in cases (i) and (ii). We show that: T −1 T R [VT (θ0 ) − ˜ VT (θ0 )] = oP (1). We have: ¯ √ √ ∂ φT (θ0 ) ∂ φ˜ T (θ0 ) −1 ˜ R R [V (θ ) − V (θ )] = T − × ST−1 (θ0 ) T φ¯ T (θ0 ). T −1 T 0 T 0 T T ∂θ ∂θ √ Since ST−1 (θ0 ) T φ¯ T (θ0 ) = OP (1) from Lemma A.3, we only need to show that: ¯ √ ∂ φT (θ0 ) ∂ φ˜ T (θ0 ) R−1 − T T = oP (1). ∂θ ∂θ The jth row of the above matrix writes:
(j ) (j ) √ ∂ φ¯ T (θ0 ) ∂ φ˜ T (θ0 ) T − R−1 T ∂θ ∂θ
√ ∂ φ¯ T(j ) (θ0 ) √ √ √ ¯ , T φT (θ0 ) (Varas [ T φ¯ T (θ0 )])−1 T φ¯ T (θ0 )R−1 = −Covas T T , ∂θ which is o P (1) since (j ) B (j ) = B1
√
T φ¯ T (θ0 ) = OP (1), R−1 T = oP (1), and from Assumption 2.3: −1 √ ∂ φ¯ T(j ) (θ0 ) √ √ (j ) ¯ , T φT (θ0 ) Varas ( T φ¯ T (θ0 )) B2 ≡ Covas T = OP (1). ∂θ C The Author(s). Journal compilation C Royal Economic Society 2009.
S165
Efficient GMM with nearly-weak instruments And, we conclude: Plim[ξTK − ξTNW ] = 0.
Proof of Proposition 2.3 (Special case with only one degree of weakness): When φ = φ 2 , the proof of √ T Lemma A.3 does not require anymore the condition λ2T / T → ∞, and the result of Lemma A.3 remains valid in any case. The proof of Proposition 2.2 can then be rewritten based on this lemma to conclude that Plim[ξTNW − ξTK ] = 0, and both score test statistics converge towards a non-central chi-square as announced. About the consistency of GMM score test (see Section 2): As in the proof of Proposition 2.2 (see equation (A.1)), ξ NW T can be rewritten as: ξTNW
=
T VT (θ0 )R−1 T
√
T
√
¯ √ ∂ φ¯ T (θ0 ) ∂ φT (θ0 ) −1 ST (θ0 ) T −1 R−1 T T R T ∂θ
−1
√
∂θ
T −1 T R VT (θ0 )
and we also have: √ ∂ φ¯ T (θ0 ) −1 ST (θ0 ) T φ¯ T (θ0 ) ∂θ √ √ ∂ φ¯ T (θ0 ) −1 = J ST (θ0 ) T φ¯ T (θT ) − J ST−1 (θ0 ) T (θT − θ0 ) + oP (1). ∂θ
T −1 T R VT (θ0 ) =
√
T −1 T R
(A.2)
The first term on the RHS of (A.2) is asymptotically normal with zero mean and variance J [S 0 ]−1 J . The second term on the RHS of (A.2) can be written (−λ T /δ T )π T with: πT = J ST−1 (θ0 )
√ T ∂ φ¯ T (θ0 ) γ ≡ π1T + π2T , λT ∂θ
where the additive decomposition comes from the decomposition of ∂ φ¯ T (θ0 )/∂θ γ according to the two subsets of moment conditions φ 1 and φ 2 . By Assumption 2.5: π2T = {J [S 0 ]−1 }2 C2 γ + oP (1), where, for any matrix M with K columns, {M} 2 is obtained by keeping only the second subset of columns of M (which corresponds to the second subset of moment conditions). Hence: √ λT λT −1 ¯ π1T − ({J [S 0 ]−1 }2 C2 γ + oP (1)) T −1 T R VT (θ0 ) = J ST (θ0 ) T φT (θT ) − δT δT √ = J ST−1 (θ0 ) T φ¯ T (θT ) + ZT , where Z T is likely to go√to infinity when (λ T /δ T ) goes to infinity and C 2 γ = 0. When Z T goes to infinity, since J ST−1 (θ0 ) T φ¯ T (θT ) is OP (1), ξTNW which is a positive definite quadratic function of T −1 T R V T (θ 0 ) goes to infinity and the test is consistent. The only case where this consistency is lost is when: π1T + {J [S 0 ]−1 }2 C2 γ = oP (δT /λT ). The consistency of the minimum distance estimator θˆT is a direct implication of the identification Assumption 3.1 jointly with the following lemma: L EMMA A.4. ρ(θˆT ) = OP (1/λT ). Proof: From (3.1), the objective function is written as follows: T (θ) T T T (θ ) + √ ρ(θ) T √ + √ ρ(θ) , QT (θ ) = √ T T T T
C The Author(s). Journal compilation C Royal Economic Society 2009.
where T =
I dk1 0
0 λT √ T
I dk2
.
S166
B. Antoine and E. Renault
Since θˆT is the minimizer of Q(.) we have in particular: T T T (θˆT ) T (θˆT ) + √ ρ(θˆT ) T + √ ρ(θˆT ) QT (θˆT ) ≤ Q(θ 0 ) =⇒ √ √ T T T T T (θ 0 ) T (θ 0 ) T √ . ≤ √ T T Denoting dT = T (θˆT )T T (θˆT ) − T (θ 0 )T T (θ 0 ), we get: [T ρ(θˆT )] T [T ρ(θˆT )] + 2[T ρ(θˆT )] T T (θˆT ) + dT ≤ 0. Let μ T be the smallest eigenvalue of T . The former inequality implies: μT T ρ(θˆT )2 − 2T ρ(θˆT ) × T T (θˆT ) + dT ≤ 0. In other words, xT = T ρ(θˆT ) solves the inequality: 2T T (θˆT ) dT xT + ≤0 μT μT T T (θˆT ) $ T T (θˆT ) $ T T (θˆT )2 dT ⇒ − T ≤ xT ≤ + T with T = − . 2 μT μT μT μT xT2 −
Since xT ≥ λT ρT (θˆT ) we want to show that xT = OP (1), that is T T (θˆT ) = OP (1) μT
and
T = OP (1) ⇔
T T (θˆT ) = OP (1) μT
and
dT = OP (1). μT
P
Note that since det(T ) → det() > 0, no subsequence of T can converge in probability towards zero and thus we can assume (for T sufficiently large) that μ T remains lower bounded away from zero with asymptotic probability one. Therefore, we just have to show that: T T (θˆT ) = OP (1)
and
dT = OP (1).
P
Since tr(T ) → tr() (where tr[M] denotes the trace of any square matrix M), and the sequence tr(T ) is upper bounded in probability, so are all the eigenvalues of T . Therefore, the required boundedness in probability just results from our Assumption 3.1(ii) ensuring that: sup T (θ ) = OP (1). θ∈
Proof of Theorem 3.1 (Consistency of θˆT ): Let us then deduce the weak consistency of θˆT by a contradiction argument. If θˆT is not consistent, there exists some positive such that: P [θˆT − θ 0 > ] does not converge to zero. Then we can define a subsequence (θˆTn )n∈N such that, for some positive η: P [θˆTn − θ 0 > ] ≥ η for n ∈ N. Let us denote α = infθ−θ 0 > ρ(θ) > 0 by Assumption 3.1(i). Then for all n ∈ N: P [ρ(θˆTn ) ≥ α] > 0. When considering the identification Assumption 3.1(iii), this last inequality contradicts Lemma A.4. This completes the proof of consistency. Proof of Theorem 3.2 (Rate of convergence): From Lemma A.4 ρ(θˆT ) = ρ(θˆT ) − ρ(θ 0 ) = OP (1/λT ) and by application of the Mean-Value Theorem, for some θ˜T between θˆT and θ 0 component by component, C The Author(s). Journal compilation C Royal Economic Society 2009.
S167
Efficient GMM with nearly-weak instruments we get: % % % ∂ρ(θ˜T ) % % ˆT − θ 0 % = OP 1 . θ % % ∂θ λT
Note that, by a common abuse of notation, we omit to stress that θ˜T actually depends on the component of ρ(.). The key point is that since ρ(.) is continuously differentiable and θ˜T , as θˆT , converges in probability towards θ 0 , we have: ∂ρ(θ 0 ) ˆ ∂ρ(θ˜T ) P ∂ρ(θ 0 ) → ⇒ × θT − θ 0 = zT , ∂θ ∂θ ∂θ with zT = OP (1/λT ). Since ∂ρ(θ 0 )/∂θ is full column rank, we deduce that:
−1 0 0 ∂ρ (θ ) ∂ρ (θ ) ∂ρ(θ 0 ) zT θˆT − θ 0 = ∂θ ∂θ ∂θ
% % also fulfils: %θˆT − θ 0 % = OP (1/λT ) .
Proof of Equations (3.5) and (3.6) (Moments in the linear IV model): In the linear model (3.4), yt = X1t π11 θ10 + X2t π21 θ10 + X1t π21 θ20 + X2t π22 θ20 + V1t θ10 + V2t θ20 + ut and the two moment conditions write as follows (assuming orthogonal instruments): 2
E X1t π11 θ10 − θ1 + E X1t2 π12 θ20 − θ2 E[(yt − Yt θ )Xt ] =
, E X2t2 π21 θ10 − θ1 + E X2t2 π22 θ20 − θ2 When is replaced by SW T , the moment conditions are: 2 0 SW E X1t θ1 − θ1 c11 + E X1t2 θ20 − θ2 c12 /δT ρ1s ≡ 2 0 2 0 SW E X2t θ1 − θ1 c21 + E X2t θ2 − θ2 c22 /δT ρ2s with
SW (θ1 ) = E X1t2 θ10 − θ1 c11 ρ1s SW ρ2s (θ1 ) = E X2t2 θ10 − θ1 c21
0 SW θ1 + ρ1w
0 SW θ1 + ρ2w
0 θ2 /δT
0 θ2 /δT
SW ρ1w (θ2 ) = E X1t2 θ20 − θ2 c12 SW ρ2w (θ2 ) = E X2t2 θ20 − θ2 c22 .
When is replaced by AR T , the moment conditions are: AR 2 0 E X1t θ1 − θ1 c11 + E X1t2 θ20 − θ2 c12 ρ1 (θ1 , θ2 ) ≡ 2 0 ρ2AR (θ1 , θ2 )/δT E X2t θ1 − θ1 c21 /δT + E X2t2 θ20 − θ2 c22 /δT with
ρ1AR (θ1 , θ2 ) = E X1t2 θ10 − θ1 c11 + E X1t2 θ20 − θ2 c12 ρ2AR (θ1 , θ2 ) = E X2t2 θ10 − θ1 c21 + E X2t2 θ20 − θ2 c22. −1/2
Proof of Equation (3.7) (Determinant of the concentration matrix): By definition, μ = V −1/2 X X V . Standard calculation rules for determinant yield to: −1/2 −1/2 = det V−1 det X X (det[])2 det [μ] = det V X XV C The Author(s). Journal compilation C Royal Economic Society 2009.
S168
B. Antoine and E. Renault
AR with det[] = π11 π22 − π12 π21 . When is replaced respectively by SW T and T , we have:
det[C] = det SW T δT
and
det[C] = det AR . T δT
Hence: det[μSW ] = det[μAR ] = d/δT2 with d = det[V−1 ] det[X X](det[C])2 .
First, we need a preliminary result which naturally extends Lemma A.1 when the true value is replaced by some preliminary consistent estimator θ ∗T . L EMMA A.5. Under Assumptions 3.1–4.1, if θ ∗T is such that θT∗ − θ 0 = OP (1/λT ), then
√ ∂ φ¯ T θT∗ P T R−1 when T → ∞. T → J ∂θ Proof: First, note that ⎡ ∂ φ¯ 1T (θT∗ )
∗ R1 ⎢ √ ∂ φ¯ T θT ⎢ ∂θ ˜ −1 T R = ⎢ T ⎣ ∂ φ¯ 2T (θT∗ ) ∂θ R1 ∂θ To get the results, we have to show the following:
∂ φ¯ 1T θT∗ P ∂ρ1 (θ 0 ) → (i) ∂θ ∂θ ∂ φ¯ 2T (θT∗ ) P (iii) R1 → 0 ∂θ ¯
0
⎤ T ∂ φ¯ 1T (θT∗ ) R2 ⎥ λT ∂θ ⎥ ⎥. √ T ∂ φ¯ 2T (θT∗ ) ⎦ R 2 λT ∂θ √
T ∂ φ¯ 2T θT∗ P ∂ρ2 (θ 0 ) (ii) → λT ∂θ ∂θ √ T ∂ φ¯ 1T (θT∗ ) P (iv) R2 → 0. λT ∂θ √
0
1 (θ ) = oP (1). The Mean-Value Theorem applies to the kth (i) From Assumption 2.2(ii): ∂ φ1T∂θ(θ ) − ∂ρ∂θ ¯ component of [∂ φ1T /∂θ ] for 1 ≤ k ≤ k 1 . For some θ˜ between θ 0 and θ ∗T :
∂ 2 φ¯ 1T ,k (θ˜T ) ∂ φ¯ 1T ,k θT∗ ∂ φ¯ 1T ,k (θ 0 ) ∗ − = θT − θ 0 = oP (1), ∂θ ∂θ ∂θ∂θ
where the last equality follows from Assumption 4.1(ii) and the assumption on θ ∗T . (ii) From Assumption 2.2(ii): √ √ ∂ φ¯ 2T (θ 0 ) ∂ρ2 (θ0 ) T ∂ φ¯ 2T (θ 0 ) ∂ρ2 (θ 0 ) T − λ = O (1) ⇒ − = oP (1) T P ∂θ ∂θ λT ∂θ ∂θ T
because λT → ∞. The Mean-Value Theorem applies to the kth component of ∂ φ¯ 2T /∂θ for 1 ≤ k ≤ k 2 . For some θ˜T between θ 0 and θ ∗T , we have: √
T
λT
√ ∂ φ¯ 2T ,k θT∗ T ∂ 2 φ¯ 2T ,k (θ˜T ) ∂ φ¯ 2T ,k (θ 0 ) ∗ 0 − − θ ) = oP (1), = (θ T ∂θ ∂θ λT ∂θ∂θ
where the last equality follows from Assumption 4.1(ii) and the assumption on θ ∗T . √ √ ∂ φ¯ (θ ∗ ) ∂ φ¯ (θ ∗ ) (iii) 2T∂θ T = √λTT × λTT 2T∂θ T = oP (1) because of (ii) and λT = o( T ). (iv) Recall the Mean-Value Theorem from (i). For 1 ≤ k ≤ k 1 and θ˜T between θ 0 and θ ∗T :
√ √ √ T ∂ φ¯ 1T ,k θT∗ T ∂ φ¯ 1T ,k (θ 0 ) T ∂ 2 φ¯ 1T ,k (θ˜T ) ∗ 0 1 = + λT (θT − θ ) . λT ∂θ λT ∂θ λT λT ∂θ∂θ C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S169
The second member of the RHS is o P (1) because of Assumptions 3.1(iii), 4.1(i) and 4.1(ii) and the assumption on θ ∗T . Now we just need to show that the first member of the RHS is o P (1). Recall from Assumption 2.4 that √ ¯ √ 1 T ∂ φ¯ 1T (θ 0 ) ∂ρ1 (θ 0 ) ∂ φ1T (θ 0 ) ∂ρ1 (θ 0 ) = OP (1) ⇒ R2 = OP . T − − ∂θ ∂θ λT ∂θ ∂θ λT By definition R 2 is such that
∂ρ1 (θ 0 ) R2 ∂θ
= 0. Hence we get
√
T ∂ φ¯ 1T ,k (θ 0 ) 0 R2 = OP λT ∂θ
1 λT
= oP (1).
Proof of Theorem 4.1 (Asymptotic normality): From the optimization problem (3.1), the first-order conditions for θˆT are written as: ∂ φ¯ T (θˆT ) ¯ ˆ φT (θT ) = 0. ∂θ A mean-value expansion yields to: ∂ φ¯ T (θˆT ) ¯ 0 ∂ φ¯ (θˆT ) ∂ φ¯ T (θ˜T ) ˆ φT (θ ) + T × θT − θ 0 = 0, ∂θ ∂θ ∂θ where θ˜T is between θˆT and θ 0 . Pre-multiplying the above equation by the non-singular matrix T −1 T R yields to an equivalent set of equations: √
JˆT T φ¯ T (θ 0 ) + JˆT J˜T × T R −1 θˆT − θ 0 = 0
after defining: JˆT =
√ ∂ φ¯ T (θˆT ) T R−1 T ∂θ
and
J˜T =
√ ∂ φ¯ T (θ˜T ) R−1 T T . ∂θ
From Theorem 3.2 and Lemma A.5, we can deduce that: Plim J˜T = J and Plim JˆT = J . Hence: √ P JˆT J˜T → J J non-singular by assumption. Recall now that by Assumption 3.1(ii), T (θ 0 ) = T φ¯ T (θ 0 ) converges to a normal distribution with mean 0. We then get the announced result. Proof of Theorem 4.2 (J-test): A Taylor expansion of the moment conditions gives: √
with JˆT =
√
√ ∂ φ¯ T (θˆT ) θˆT − θ 0 + oP (1) T ∂θ √
= T φ¯ T (θ 0 ) + JˆT T R −1 θˆT − θ 0 + oP (1)
T φ¯ T (θˆT ) =
√
T φ¯ T (θ 0 ) +
T ∂ φ¯ T (θˆT )/∂θ R−1 T . A Taylor expansion of the FOC gives:
−1 √ ∂ φ¯ T (θˆT ) √ ∂ φ¯ T (θˆT )
−1 −1 −1 −1 ˆ 0 T R θT − θ = − T RT ST T RT ∂θ ∂θ ˆ √ √ ∂ φ¯ T (θ) −1 × RT ST−1 T φ¯ T (θ 0 ) + oP (1) T ∂θ
with S T a consistent estimator of the asymptotic covariance matrix of the process (θ). Combining the two above results leads to: √ √ −1 −1 √ JˆT ST T φ¯ T (θ 0 ) + oP (1). T φ¯ T (θˆT ) = T φ¯ T (θ 0 ) − JˆT JˆT ST−1 JˆT C The Author(s). Journal compilation C Royal Economic Society 2009.
S170
B. Antoine and E. Renault
Use the previous result to rewrite the criterion function: √ √ T QT (θˆT ) = [ T φ¯ T (θˆT )] ST−1 T φ¯ T (θˆT ) √ −1 −1 √ JˆT ST T φ¯ T (θ 0 ) ST−1 = T φ¯ T (θ 0 ) − JˆT JˆT ST−1 JˆT √ −1 −1 √ JˆT ST T φ¯ T (θ 0 ) + oP (1) × T φ¯ T (θ 0 ) − JˆT JˆT ST−1 JˆT √ √ √ −1 −1 √ JˆT ST T φ¯ T (θ 0 ) = [ T φ¯ T (θ 0 )] ST−1 T φ¯ T (θ 0 ) − T φ¯ T (θ 0 )ST−1 JˆT JˆT ST−1 JˆT √ √ −1/2 1/2 + oP (1) = T φ¯ T (θ 0 ) ST T φ¯ T (θ 0 ) + oP (1), [I − M]−1 ST 1/2 −1/2 −1/2 −1/2 −1/2 where S T is such that S T = S T S T and M = ST JˆT [JˆT ST−1 JˆT ]−1 JˆT ST which is a projection matrix, hence idempotent and of rank (K − p). The expected result follows.
Proof of Proposition 4.1 (Equivalence between CU-GMM and efficient GMM): The first-order conditions of the CU-GMM optimization problem can be written as follows (see Antoine et al., 2007):
√ ∂ φ¯ T θˆTCU −1 CU √ √ ∂ φ¯ T θˆTCU −1 CU √
CU
ˆ ˆ ¯ T T φT θT − P T T φ¯ T θˆTCU = 0, ST θT ST θˆT ∂θ ∂θ where P is the projection matrix onto the moment conditions. Recall that: (j )
√ ∂ φ¯ T(j ) θˆTCU
CU −1 CU √
∂ φ¯ T θˆTCU ¯ ˆ = Cov , φT θT P T T φ¯ T θˆTCU . ST θˆT ∂θ ∂θ With a slight abuse of notation, we define conveniently the matrix of size (p, K 2 ) built by stacking horizontally the K matrices of size (p, K), Cov(∂ φ¯ j ,T (θˆTCU )/∂θ, φ¯ T (θˆTCU )), as
(1)
CU
CU ∂ φ¯ T θˆTCU ∂ φ¯ T θˆTCU , φ¯ T θˆT , φ¯ T θˆT ≡ Cov Cov ∂θ ∂θ (j ) (K)
CU
CU ∂ φ¯ T θˆTCU ∂ φ¯ T θˆTCU ¯ ¯ ˆ ˆ · · · Cov , φT θT , φT θT · · · Cov . ∂θ ∂θ
Then, we can write:
√ ∂ φ¯ T θˆTCU
CU &
√
' ∂ φ¯ T θˆTCU ¯ ˆ I dK ⊗ ST−1 θˆTCU = Cov , φT θT P T . T φ¯ T θˆTCU ∂θ ∂θ Pre-multiply the above FOC equations by the invertible ( p, p)-matrix −1 T R to get:
√ ∂ φ¯ T θˆTCU −1 CU √
CU
CU ∂ φ¯ T θˆTCU −1 ¯ ˆ ˆ ˆ ¯ ST θT , φT θT T T φT θT − T R Cov ∂θ ∂θ &
√
' −1 CU √
× I dK ⊗ ST−1 θˆTCU ST θˆT T φ¯ T θˆTCU T φ¯ T θˆTCU = 0.
−1 T R
(A.3)
√ (i) Special case with the same weakness (λ T ) or all moment conditions: T φ¯ T (θˆTCU ) = λT ρ(θˆTCU ) + T (θˆTCU ), where T (θ ) ⇒ (θ ), a Gaussian stochastic process on . Consider a Gaussian random variable C The Author(s). Journal compilation C Royal Economic Society 2009.
S171
Efficient GMM with nearly-weak instruments
U to rewrite the second term of the LHS of equation (A.3) as follows:
CU &
√
' −1 CU ∂ φ¯ T θˆTCU I dK ⊗ ST−1 θˆTCU ST θˆT T φ¯ T θˆTCU R Cov , φ¯ T θˆT ∂θ
CU 1 ˆ U = oP (1), × ρ θT − λT √ because we have: (I dK ⊗ [ST−1 (θˆTCU ) T φ¯ T (θˆTCU )]) = OP (1) and [ρ(θˆTCU ) − λ1T U ] = oP (1). And, we conclude that CU-GMM is equivalent to efficient GMM. √ (ii) Consider now two groups of moment conditions with rates T and λ T : √ √
CU
T I dk1 0 ˆ ¯ = T φT θT ρ θˆTCU + T θˆTCU . 0 λT I dk2 The second term of the LHS of equation (A.3) can be separated into two pieces: the first one involves a Gaussian random variable V and the second one ρ(.). First:
−1 T
R Cov (
CU &
√
' −1 CU ∂ φ¯ T θˆTCU ¯ ˆ , φT θT I dK ⊗ ST−1 θˆTCU ST θˆT V = oP (1). T φ¯ T θˆTCU ∂θ )* + OP (1)
Second, recall that we have: √ T I dk1 0 and define the matrix
0 λT I dk2
ρ θˆTCU =
√
T I dk1 0
0 λT I dk2
ρ θˆTCU =
√
T ρ1 θˆTCU
λT ρ2 θˆTCU
CU &
√
' −1 CU ∂ φ¯ T θˆTCU ¯ ˆ I dK ⊗ ST−1 θˆTCU , φT θT ST θˆT T φ¯ T θˆTCU R Cov ∂θ A11 A12 ≡A= , A21 A22
where Aij = OP (1) for any 1 ≤ i, j ≤ 2 with respective sizes: (s 1 , k 1 ) for A 11 ; (s 1 , k 2 ) for A 12 ; (p − s 1 , k 1 ) for A 21 ; and (p − s 1 , k 2 ) for A 22 . Then the second term writes: ⎡
CU
CU ⎤ λT ˆ ˆ √
CU A + A √ 11 ρ1 θT 12 ρ2 θT ⎥ ⎢ A11 A12 T ρ1 θˆT T −1 ⎥. ⎢ T
CU = ⎣ √ ⎦ ˆ
T A21 A22 λT ρ2 θT CU CU A21 ρ1 θˆT + A22 ρ2 θˆT λT And we have:
A11 ρ1 θˆTCU = OP (1) × OP λT √ T √ T λT
1 λT
= oP (1)
1 λT A12 ρ2 θˆTCU = √ × OP (1) × OP = oP (1) λT T
1 = oP (1) A22 ρ2 θˆTCU = OP (1) × OP λT √ √
CU 1 T T ˆ = A21 ρ1 θT × OP (1) × OP . = OP λT λT λ2T
Hence, to get the equivalence between CU-GMM and efficient GMM, we need the nearly-strong √ T identification condition, that is: λ2T / T → ∞. C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S172–S199. doi: 10.1111/j.1368-423X.2008.00265.x
Invalidity of the bootstrap and the m out of n bootstrap for confidence interval endpoints defined by moment inequalities D ONALD W. K. A NDREWS † AND S UKJIN H AN ‡ †
Cowles Foundation for Research in Economics, Yale University, New Haven, CT 06520, USA E-mail: [email protected] ‡
Department of Economics, Yale University, New Haven, CT 06520, USA E-mail: [email protected]
First version received: July 2008; final version accepted: September 2008
Summary This paper analyses the finite-sample and asymptotic properties of several bootstrap and m out of n bootstrap methods for constructing confidence interval (CI) endpoints in models defined by moment inequalities. In particular, we consider using these methods directly to construct CI endpoints. By considering two very simple models, the paper shows that neither the bootstrap nor the m out of n bootstrap is valid in finite samples or in a uniform asymptotic sense in general when applied directly to construct CI endpoints. In contrast, other results in the literature show that other ways of applying the bootstrap, m out of n bootstrap, and subsampling do lead to uniformly asymptotically valid confidence sets in moment inequality models. Thus, the uniform asymptotic validity of resampling methods in moment inequality models depends on the way in which the resampling methods are employed. Keywords: Bootstrap, Coverage probability, m out of n bootstrap, Moment inequality model, Partial identification, Subsampling.
1. INTRODUCTION This paper considers confidence intervals (CIs) for partially identified parameters defined by moment inequalities. The paper investigates the properties of the bootstrap and the m out of n bootstrap applied directly to CI endpoints. (Here, m is the bootstrap sample size and n is the original sample size.) By ‘applied directly to CI endpoints’, we mean that one takes the CI upper endpoint to be the upper bound of the estimated set based on the original sample plus a (recentred and rescaled) sample quantile of the upper bounds of the estimated sets from a collection of bootstrap or m out of n bootstrap samples and analogously for the CI lower endpoint. We note that the m out of n bootstrap has been suggested in the literature as an alternative to the bootstrap in cases in which the bootstrap does not work properly. ‘Backward’ and ‘forward’ bootstrap and m out of n bootstrap CIs are considered. (These are defined below.) Both finite-sample and asymptotic coverage probabilities (ACPs) and sizes of the CIs are obtained. In fact, one of the novelties of the paper is the determination of exact C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Invalidity of the bootstrap
S173
finite-sample coverage probabilities and sizes for some bootstrap and m out of n bootstrap procedures. The results show that neither the bootstrap nor the m out of n bootstrap is asymptotically valid in a uniform sense in general when applied directly to CI endpoints. These results are obtained by considering the parametric bootstrap in two particular models. These two models each have normally distributed observations, a scalar parameter θ and two moment inequalities. The two models are selected to exhibit the two common features of moment inequality models that cause difficulties for inference. The first model exhibits a redundant, but not irrelevant, moment inequality. The model has two moment inequalities, only one of which is binding in the population, but either of which may be binding in the sample due to random fluctuations. The second model exhibits the possibility of random ‘reversals’ of moment inequalities. The model has two moment inequalities that bound a parameter from below and above such that the identified set is a proper interval. But the length of the identified set is sufficiently short relative to the variability in the moment inequalities that there is a non-negligible probability that the estimated set is a singleton. The estimated set is a singleton when the lower bound from one moment inequality is larger than the upper bound from another moment inequality, which is referred to as a ‘reversal’ of the moment inequalities. Redundant but not irrelevant moment inequalities and reversals of moment inequalities are common features of moment inequality models (e.g. see Andrews et al., 2004, and Pakes et al., 2004). This paper shows that the finite-sample and ACPs and sizes of the bootstrap and m out of n bootstrap CIs can be far from their nominal levels in these two models. 1 For example, in the first model, the nominal .95 ‘backward’ bootstrap CI has finite-sample confidence size equal (up to simulation error) to .00 for all n. Similarly, the nominal .95 m out of n ‘backward’ bootstrap has finite-sample confidence size equal to .00 when m/n = .01, .05 or .10 for all n. It has asymptotic size .00 when m/n → 0 as n → ∞. In the second model, the nominal .95 ‘forward’ bootstrap CI has finite-sample confidence size equal (up to simulation error) to .51 for all n. Similarly, the nominal .95 m out of n ‘forward’ bootstrap has finite-sample confidence size equal to .51 when m/n = .01, .05 or .10 for all n. It has asymptotic size .50 when m/n → 0 as n → ∞. The failure of the bootstrap in these models is due to the non-differentiability of the statistics of interest as a function of the underlying sample moments (see Shao, 1994 for further discussion). The failure of the m out of n bootstrap is due to the discontinuity of the asymptotic distribution of the statistics of interest as a function of the parameters (see Andrews and Guggenberger, 2009a, for further discussion). Obviously, if a method fails to deliver desirable finite-sample and asymptotic properties in the two models considered in the paper, it cannot do so in general. Hence, the results of this paper show that the bootstrap and m out of n bootstrap applied directly to CI endpoints cannot be relied upon to give valid inference in general. As stated above, the results given here are for the parametric bootstrap and the m out of n parametric bootstrap. The asymptotic properties of the nonparametric i.i.d. bootstrap and nonparametric i.i.d. m out of n bootstrap are the same as those of the parametric bootstrap, although we do not show this explicitly in the paper. Furthermore, asymptotic results for subsampling are the same as for the nonparametric i.i.d. m out of n bootstrap provided the subsample size b equals m and m2 /n → 0 as n → ∞ (see Politis et al., 1999, p. 48). Hence, the asymptotic results of this paper for the m out of n bootstrap should also apply to subsampling 1 By definition, the ‘size’ of a CI is the infimum of the coverage probabilities of the CI over all distributions in the model. A CI has ‘level’ 1 − α if its size is greater than or equal to 1 − α. C The Author(s). Journal compilation C Royal Economic Society 2009.
S174
Donald W. K. Andrews and Sukjin Han
methods. Such results for subsampling could be established directly using the methods in Andrews and Guggenberger (2009a). For brevity, we do not do so here. We emphasize that there are different ways of applying the bootstrap, m out of n bootstrap, and subsampling to moment inequality models. This paper addresses one way, viz., by applying such methods to CI endpoints directly. The paper shows that this does not yield asymptotically valid CIs in a uniform sense in general. On the other hand, if one follows the approach in Chernozhukov et al. (2007) and one constructs confidence sets by inverting tests based on an Anderson–Rubin type test statistic, then subsampling and the m out of n bootstrap yield confidence sets that are asymptotically valid in a uniform sense for test statistics in a suitable class (see Andrews and Guggenberger, 2009c, and also see Romano and Shaikh, 2008). Furthermore, confidence sets constructed by inverting tests based on an Anderson–Rubin test statistic can be coupled with a recentred bootstrap that is applied as part of a generalized moment selection (GMS) method for constructing critical values. Such bootstrap-based confidence sets are asymptotically valid in a uniform sense (see Andrews and Soares, 2007). In particular, the subsampling and GMS methods just described provide asymptotically valid inference in a uniform sense in the two models considered in this paper. Bugni (2007a,b) and Canay (2007) consider closely related bootstrap-based confidence sets that can be shown to be asymptotically valid in a uniform sense. The method of constructing confidence sets based on a finite number of moment inequalities and/or equalities that we recommend is given in Andrews and Jia (2008). It employs the bootstrap. We now discuss related literature. Bugni (2007a) and Canay (2007) also provide results regarding the inconsistency of certain bootstrap methods in moment inequality models. The bootstrap method that Canay (2007) considers differs from the ones considered here because he considers the bootstrap distribution of a test statistic. The bootstrap method and model that Bugni (2007a) considers are quite similar to the bootstrap method considered in this paper and model I. The present paper and Bugni (2007a) were written independently. The bootstrap results for models I and II given here were done in November 2005. There is a large literature on bootstrap inconsistency due to non-regularity of a model (see Efron, 1979, Bickel and Freedman, 1981, Beran, 1982, 1997, Babu, 1984, Beran and Srivastava, 1985, Athreya, 1987, Romano, 1988, Basawa et al., 1991, Putter and van Zwet, 1996, Bretagnolle, 1983, Deheuvels et al., 1993, D¨umbgen, 1993, Sriram, 1993, Athreya and Fukuchi, 1994, Datta, 1995, Bickel et al., 1997, and Andrews, 2000). When the bootstrap is not consistent, it is common in the literature to suggest using the m out of n bootstrap or subsampling as an alternative (see Bretagnolle, 1983, Swanepoel, 1986, Athreya, 1987, Beran and Srivastava, 1987, Shao and Wu, 1989, Wu, 1990, Eaton and Tyler, 1991, Politis and Romano, 1994, Shao, 1994, 1996, Beran, 1997, Bickel et al., 1997, Andrews, 1999, 2000, Politis et al., 1999, Romano and Wolf, 2001, Guggenberger and Wolf, 2004, Lehmann and Romano, 2005, Romano and Shaikh, 2005, 2008, and Chernozhukov et al., 2007). Potential problems with the m out of n bootstrap and subsampling are discussed in D¨umbgen (1993), Beran (1997), Andrews (2000), Samworth (2003), Andrews and Guggenberger (2005, 2009a, b, d), and Mikusheva (2007). The remainder of the paper is organized as follows. Section 2 introduces the general moment inequality model. Section 3 defines ‘forward’ and ‘backward’ bootstrap CIs based on bootstrapping CI endpoints. Section 4 does likewise for the m out of n bootstrap. Sections 5 and 6 treat two specific moment inequality models that are based on linear normally distributed moment inequalities. These two sections provide finite-sample and ACP and size results for bootstrap and m out of n bootstrap procedures in the two models considered. C The Author(s). Journal compilation C Royal Economic Society 2009.
S175
Invalidity of the bootstrap
2. MOMENT INEQUALITY MODEL The sample is {Xi : i ≤ n}. The random variables {Xi : i ≥ 1} are assumed to be i.i.d. We have some moment functions m(Xi , θ ) (∈ R k ) that depend on a parameter θ ∈ ⊂ R p . The true value of the parameter is θ 0 ∈ and the true distribution of X i is F0 . The population moments satisfy EF0 m(Xi , θ0 ) ≥ 0
(2.1)
element by element. We are interested in a real-valued smooth function g(θ ) (∈ R) of θ . Define the identified set of θ values that satisfy the population moment inequalities to be 0 = {θ ∈ : EF0 m(Xi , θ ) ≥ 0}.
(2.2)
The corresponding identified set of g(θ ) values is [g L0 , g U 0 ], where gL0 = inf{g(θ ) : θ ∈ 0 } and gU 0 = sup{g(θ ) : θ ∈ 0 }.
(2.3)
The object is to determine a random interval [ gLn , gUn ] that contains either the true value g(θ 0 ) with probability 1 − α asymptotically or the identified interval [g L0 , g U 0 ] with probability 1 − α asymptotically for some α ∈ (0, 1). We specify a plausible bootstrap procedure for doing this and show that it does not have the correct ACP in very simple normal models with linear moment functions and a scalar parameter θ . We show that using an m out of n version of the bootstrap does not solve the problem. Define mn (Xi , θ ) = n−1
n
m(Xi , θ ),
i=1
√ Qn (θ ) = d( n · [mn (Xi , θ )]− ), [x]− = min{x, 0}, n = {θ ∈ : Qn (θ ) = inf Qn (θ)}, θ ∈
n } and gLn = inf{g(θ ) : θ ∈ n }, gUn = sup{g(θ ) : θ ∈
(2.4)
where d(·) : R k → R is a non-negative distance function such as d(x) = x x or d(x) = k 2 gLn and gUn are estimators of 0 , g L0 and g U 0 , respectively. 3 j =1 |xj |. The quantities n ,
3. BOOTSTRAP FOR CI ENDPOINTS We now define a heuristic procedure for constructing the random interval [ gLn , gUn ] based on the bootstrap. (Based on the results given below, we do not recommend that this procedure be used 2 One also could consider data-dependent weight functions d(·) of [m (X , θ )] , but for simplicity we do not do so n i − here. 3 If E m(X , θ ) is not well behaved, then it is possible for these estimators to be inconsistent (e.g. see Manski and θ0 i Tamer, 2002). But this is not the issue that is of concern here. We consider cases in which Eθ0 m(Xi , θ ) is sufficiently n , well-behaved that gLn and gUn are consistent. C The Author(s). Journal compilation C Royal Economic Society 2009.
S176
Donald W. K. Andrews and Sukjin Han
in practice.) This procedure uses the 1 − α/2 quantile of a bootstrap distribution in the upper endpoint and the α/2 quantile in the lower endpoint of a nominal 1 − α two-sided bootstrap CI. For reasons described below, this can be viewed as a ‘backward’ bootstrap CI. We also consider a ‘forward’ bootstrap CI in which the α/2 quantile appears in the upper endpoint (with a minus sign) and the 1 − α/2 quantile appears in the lower endpoint (with a minus sign) of the bootstrap CI. In regular models, both the ‘backward’ and ‘forward’ bootstrap CIs give firstorder asymptotically correct CIs. But in the non-standard scenarios considered here, neither is asymptotically correct in general and they have quite different properties, as shown below. The ‘backward’ bootstrap procedure is as follows. (i) Generate B independent bootstrap samples {X∗ir : i ≤ n} for r = 1, . . . , B using some method of bootstrapping—as discussed ∗ and gU∗ nr using the definitions of gLn and gUn in (2.4) with further below. (ii) Compute gLnr ∗ {Xir : i ≤ n} in place of {Xi : i ≤ n} for r = 1, . . . , B. (iii) Compute the α/2 sample ∗ : r = 1, . . . , B}, call it c∗∗ quantile of { gLnr LnB (α/2). (iv) Compute the 1 − α/2 sample quantile of ∗ (1 − α/2). (v) Take the random interval [ gLn , gUn ] to be { gU nr : r = 1, . . . , B}, call it c∗∗ UnB ∗∗ ∗∗ gLn , gUn = cLnB (α/2), cUnB (1 − α/2) . (3.1) ∗ and gU∗ nr for r = The intuitive idea behind this interval is that the bootstrap quantities gLnr gUn . Hence, the interval from the α/2 sample 1, . . . , B behave like B i.i.d. realizations of gLn and ∗ to the 1 − α/2 quantile of gU∗ nr should include [g L0 , g U 0 ] with probability quantile of gLnr 1 − α. This intuition is not completely correct because it ignores the issues of (i) proper centering of the bootstrap quantities and (ii) the non-differentiability in the mapping between the sample gUn . In ‘regular’ cases the first issue does not cause problems moments and the estimators gLn and and the second issue does not arise. In the present case with moment inequalities, these issues cause problems. In practice, one often is interested in a two-sided CI such as the one in (3.1). However, for simplicity, we focus on a one-sided interval ∗∗ (1 − α)], (−∞, gUn ] = (−∞, cUnB
(3.2)
where gUn and c∗∗ UnB (1 − α) are defined as above. The bootstrap that is employed to generate the bootstrap samples can be the usual i.i.d. nonparametric bootstrap (in which {X∗ir : i ≤ n} are i.i.d. draws from the empirical distribution of {Xi : i ≤ n}), a parametric bootstrap (if the distribution of the data is specified up to an unknown parameter) or an ‘asymptotic normal’ bootstrap [in which {mn (Xir∗ , θ ) : θ ∈ } for r = 1, . . . , B are i.i.d. draws from a k-variate Gaussian process with mean mn (Xi , θ ) and covariance function Cn (θ1 , θ2 ) = n−1 ni=1 (m(Xi , θ1 ) − mn (Xi , θ1 ))(m(Xi , θ2 ) − mn (Xi , θ2 )) ]. It is standard in the bootstrap literature to analyse the properties of the bootstrap when B = ∞ because the simulation error due to the use of B bootstrap repetitions can be made arbitrarily small by taking B large. We do this here. When B = ∞, c∗∗ UnB (1 − α) is the population 1 − α quantile of the distribution of gU∗ nr given the original sample {Xi : i ≤ n}. For notational ∗∗ simplicity, when B = ∞, we let c∗∗ Un (1 − α) denote cUnB (1 − α). ∗ 1/2 ∗ gUnB − gUn ) conditional on the original sample Let cUnB (α) denote the α quantile of n ( {Xi : i ≤ n}. Notice that ∗∗ ∗ (α) = gUn + n−1/2 cUnB (α). cUnB
In consequence, the interval in (3.2) can be written equivalently as ∗ gUn + n−1/2 cUnB gUn ] = −∞, (1 − α) . (−∞,
(3.3)
(3.4)
C The Author(s). Journal compilation C Royal Economic Society 2009.
Invalidity of the bootstrap
S177
This way of writing the CI makes it clear that the CI is based on an estimator of the upper bound plus a bootstrap adjustment that takes sampling error into account. For reasons given below, the CI of (3.2) and (3.4) will be referred to as a ‘backward’ bootstrap CI. Suppose one is interested in a CI for the true value g(θ 0 ). Let AS tv denote the ‘asymptotic size of the CI for the true value.’ By definition and simple algebra, ∗ inf Pθ0 ,γ0 g(θ0 ) ≤ gUn + n−1/2 cUn (1 − α) AStv = lim infn→∞ (θ0 ,γ0 )∈×
= lim infn→∞
inf
(θ0 ,γ0 )∈×
∗ Pθ0 ,γ0 n1/2 ( gUn − g(θ0 )) ≥ −cUn (1 − α) ,
(3.5)
where c∗Un (1 − α) denotes c∗UnB (1 − α) when B = ∞ and γ 0 is a nuisance parameter with parameter space . The nuisance parameter γ 0 arises because θ 0 does not completely determine the distribution of the sample {Xi : i ≤ n} in the moment inequality context. In later sections when we consider simple examples, the nuisance parameter γ 0 is specified explicitly. Taking the infimum over (θ 0 , γ 0 ) ∈ × in (3.5) is standard in the definition of the size of a CI. In particular, by definition, without the lim infn→∞ the expression in (3.5) for AS tv is the confidence size of the random interval. Taking the infimum over (θ 0 , γ 0 ) ∈ × guarantees that no matter what is the (unknown) true value of the parameter of interest θ 0 or the (unknown) nuisance parameter γ 0 , the ACP is at least AS tv . Next, suppose one is interested in a CI for the identified interval (−∞, g U 0 ]. For this case, let AS int be defined analogously to AS tv , but with g U 0 in place of the true value g(θ 0 ). Then, ∗ ASint = lim infn→∞ inf Pθ0 ,γ0 n1/2 ( gUn − gU 0 ) ≥ −cUn (1 − α) . (3.6) (θ0 ,γ0 )∈×
One would like the bootstrap intervals to be such that AStv = 1 − α and ASint = 1 − α or, at least, AStv ≥ 1 − α and ASint ≥ 1 − α.
(3.7)
If the CI satisfies AS tv > 1 − α (or AS int > 1 − α), then it has asymptotic level 1 − α, but may be longer than desirable. We show later that the bootstrap interval (−∞, gUn ] in (3.2) (or equivalently in (3.4)) does not necessarily satisfy AS tv ≥ 1 − α or AS int ≥ 1 − α even if the infimum over (θ 0 , γ 0 ) ∈ × in (3.5) or (3.6) is deleted. Given the definition of AS tv in (3.5), the bootstrap interval in (3.4) has the desired (firstgUn − g(θ0 )) order) asymptotic property in (3.7) if the difference between the α quantile of n1/2 ( ∗ gUnB − gUn ) (given the sample {Xi : i ≤ n}) converges and −1 times the 1 − α quantile of n1/2 ( in probability to zero uniformly over (θ 0 , γ 0 ) ∈ × . This seems ‘backwards’ because in scenarios in which the bootstrap works properly, the distribution of the normalized bootstrap ∗ gUnB − gUn ) is close to that of the normalized estimator n1/2 ( gUn − g(θ0 )) when estimator n1/2 ( n is large. Hence, in such cases it makes sense to have c∗Un (α) appear in place of −c∗Un (1 − α) in the right-hand side expression for AS tv in (3.5). Indeed, Hall (1992, p. 12, 36) refers to a bootstrap interval of the type in (3.4) as the ‘other percentile’ or ‘backward percentile’ ∗ gUn − g(θ0 )) and n1/2 ( gUnB − gUn ) are both asymptotically normal, then the bootstrap CI. If n1/2 ( ‘backward percentile’ bootstrap CI typically has the desired (first-order) asymptotic properties because c∗Un (α) and −c∗Un (1 − α) both converge in probability to z α using the symmetry of gUn − g(θ0 )) nor the asymptotic normal distribution. In the present case, however, neither n1/2 ( ∗ gUnB − gUn ) is asymptotically normal. n1/2 ( C The Author(s). Journal compilation C Royal Economic Society 2009.
S178
Donald W. K. Andrews and Sukjin Han
An alternative ‘forward’ bootstrap CI is given by taking (−∞, gUn ] to be ∗ gUn − n−1/2 cUnB (α) . (−∞, gUn ] = − ∞,
(3.8)
This ‘forward’ bootstrap CI has coverage probabilities for covering for g(θ 0 ) and (−∞, g U 0 ] given by for ∗ AStv = lim infn→∞ inf Pθ0 ,γ0 n1/2 ( gUn − g(θ0 )) ≥ cUn (α) and (θ0 ,γ0 )∈×
for ASint
= lim infn→∞
inf
(θ0 ,γ0 )∈×
∗ Pθ0 ,γ0 n1/2 ( gUn − gU 0 ) ≥ cUn (α) ,
(3.9)
respectively. We show below that the ‘forward’ bootstrap interval (−∞, gUn ] in (3.8) does not necessarily for for satisfy AStv ≥ 1 − α or ASint ≥ 1 − α even if the infimum over (θ 0 , γ 0 ) ∈ × in the first or second line of (3.9) is deleted. Hence, neither the ‘backward’ nor the ‘forward’ bootstrap CI has the desired asymptotic properties in general. Notice that the coverage probabilities of the bootstrap CIs given in (3.5), (3.6) and (3.9) ∗ gUn − g(θ0 )), n1/2 ( gUn − gU 0 ) and n1/2 ( gUnB − gUn ). Hence, depend on the distributions of n1/2 ( in subsequent sections, we determine what these normalized distributions are.
4. m OUT OF n BOOTSTRAP FOR CI ENDPOINTS We now consider the m out of n bootstrap for CI endpoints. (Based on the results given below, we do not recommend that this procedure be used in practice either.) As in the previous section, we simplify the arguments by focusing on a one-sided CI. The m out of n ‘backward’ bootstrap procedure is defined as in (3.4) but with the bootstrap sample size n replaced by m (
(4.1)
The bootstrap that is employed can be any of those discussed in the previous section. For a CI for the true value g(θ 0 ), AS tv is defined as in (3.5) but with c∗Um (1 − α) in place of ∗ cUn (1 − α). Analogously, for a CI for the identified interval (−∞, g U 0 ], AS int is defined as in (3.6) but with c∗Um (1 − α) in place of c∗Un (1 − α), where c∗Um (1 − α) denotes c∗UmB (1 − α) when B = ∞. As above, we would like AS tv and AS int to satisfy (3.7). We show below that the m out of n bootstrap interval (−∞, gUn ] in (4.1) does not necessarily satisfy AS tv ≥ 1 − α or AS int ≥ 1 − α for any value of m/n including m/n = 0 (which gives the asymptotic size when m/n → 0 as n → ∞ as is usually assumed for the m out of n bootstrap). In fact, this is true even if the infima in (3.5) and (3.6) are deleted.
4 The critical value c∗ UmB (1 − α) depends on n as well as m because the bootstrap distribution depends on the original sample which is of size n. However, for notational simplicity, we suppress the dependence on n.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S179
Invalidity of the bootstrap
As discussed above, the m out of n bootstrap of (4.1) is ‘backward’ in a certain sense. The m out of n ‘forward’ bootstrap CI is defined by ∗ (−∞, gUn ] = − ∞, gUn − n−1/2 cUmB (α) . (4.2) This m out of n ‘forward’ bootstrap CI has coverage probabilities for covering for g(θ 0 ) and (−∞, g U 0 ] given by the first and second expressions in (3.9), respectively, with c∗Um (α) in place gUn ] of c∗Un (α). We show in the next sections that the m out of n ‘forward’ bootstrap interval (−∞, in (4.2) does not necessarily satisfy AS tv ≥ 1 − α or AS int ≥ 1 − α for any value of m/n including m/n = 0.
5. LINEAR INEQUALITIES I 5.1. Model and Estimators Next, we consider a special case of the moment inequality model discussed above. The model considered is one in which a moment inequality is potentially redundant but not irrelevant. We choose this particular model for the reasons given in the Introduction and for its analytic tractability—we can derive the finite-sample distribution of the bootstrap and m out of n bootstrap statistics that are considered. If neither the bootstrap nor m out of n bootstrap works in this simple model, then they are not procedures that work in general. Let
X1i ∼ N (μ, ) , where Xi = X2i
μ1 1 ρ (5.1) μ = and = for some ρ ∈ [−1, 1]. μ2 ρ1 For simplicity, we assume that ρ is known. The moment functions, sample moments and population moment inequalities are
X 1n − θ X1i − θ , mn (Xi , θ ) = and m(Xi , θ ) = X2i − θ X 2n − θ
n μ1 − θ0 ≥ 0, where X sn = n−1 Eθ0 m(Xi , θ0 ) = Xsi for s = 1, 2. μ2 − θ0
(5.2)
i=1
In consequence, θ 0 ≤ min{μ 1 , μ 2 } and the identified set 0 equals (−∞, min{μ 1 , μ 2 }]. The function g(θ ) of interest is just the identity function g(θ ) = θ . Hence, (g L0 , g U 0 ] = n , gLn and gUn can be determined analytically. It is (−∞, min{μ 1 , μ 2 }]. In the present case, easy to see that for the distance functions d(x) = x x and d(x) = |x 1 | + |x 2 | (and many other distance functions symmetric in x 1 and x 2 ), we have
n = −∞, min X 1n , X 2n . gLn = −∞, (5.3) gUn = min X 1n , X 2n and Without loss of generality, we assume that μ 1 ≤ μ 2 (although this is not known to the investigator using the moment inequalities). Hence, min{μ 1 , μ 2 } = μ 1 . Notice that gUn is a non-differentiable function of the sample mean vector (X1n , X 2n ). In consequence, the asymptotic distribution of gUn turns out to be a discontinuous function of the parameters (μ 1 , μ 2 ). In particular, the asymptotic distribution is different between the cases C The Author(s). Journal compilation C Royal Economic Society 2009.
S180
Donald W. K. Andrews and Sukjin Han
where μ 1 < μ 2 and μ 1 = μ 2 . Furthermore, the asymptotic distribution is different again if the true difference μ 2 − μ 1 = h D /n1/2 for some positive constant h D . Because of this, the bootstrap does not perform as desired. For s ∈ R, define
Z1 ∼ N (0, ). (5.4) U (s) = min{Z1 , Z2 + s}, where Z = Z2 Combining (5.3) and (5.4) gives n1/2 ( gUn − gU 0 ) = min n1/2 X1n − μ1 , n1/2 X2n − μ1 =
d
U (n1/2 (μ2 − μ1 )) and
gUn − θ0 ) = n1/2 (
d
U (n1/2 (μ2 − μ1 )) + n1/2 (μ1 − θ0 ),
(5.5)
where ‘= d ’ denotes equality in distribution. 5.2. Bootstrap and m Out of n Bootstrap We now introduce the m out of n bootstrap for the linear moment inequality model of (5.1)– (5.2). The (standard) bootstrap is obtained as a special case by taking m = n. We consider the parametric bootstrap for which a bootstrap sample {X∗i : i ≤ m} consists of i.i.d. N (Xn , ) random variables. For specificity, we take Xi∗ = Zi∗∗ + Xn , where Zi∗∗ ∼ N (0, ) for i ≤ m
(5.6)
and {Z ∗∗ i : i ≤ m} are i.i.d. and independent of {Xi : i ≤ n}. In the present model, the parametric bootstrap is the same as the ‘asymptotic normal’ bootstrap referred to above. The issues that arise below with the parametric bootstrap are the same as those that arise with the nonparametric bootstrap. The parametric bootstrap, however, has the advantage of makingthese issues as clear ∗ ∗∗ ∗∗ ∗∗ ∗∗ as possible. We write Xsm = Z sm + X sn for s = 1, 2, where Z sm = m−1 m i=1 Zsi and Z i = ∗∗ ∗∗
(Z 1i , Z 2i ) . ∗ is defined by Using (5.3) and (5.6), the bootstrap estimator gUm ∗∗
∗∗
∗ gUm = min{Z 1m + X1n , Z 2m + X 2n }.
(5.7)
For s ∈ R, define
U ∗ (m/n, s) = min Z1∗ + (m/n)1/2 Z1 , Z2∗ + (m/n)1/2 (Z2 + s) , where
∗ Z1 ∗ ∼ N(0, ), Z = Z2∗
Z is as defined in (5.4), and Z ∗ and Z are independent. Using (5.7) and (5.8), we have ∗ gUm − gU 0 m1/2 ∗∗ = min m1/2 Z 1m + (m/n)1/2 n1/2 X1n − μ1 , ∗∗ m1/2 Z 2m + (m/n)1/2 n1/2 X 2n − μ2 + n1/2 (μ2 − μ1 ) = d U ∗ m/n, n1/2 (μ2 − μ1 ) .
(5.8)
(5.9)
C The Author(s). Journal compilation C Royal Economic Society 2009.
Invalidity of the bootstrap
S181
Combining (5.5) and (5.9) gives ∗ ∗ gUm − gUm − gU 0 − m1/2 ( gUn = m1/2 m1/2 gUn − gU 0 ) ∗ 1/2 = d U m/n, n (μ2 − μ1 ) − (m/n)1/2 U n1/2 (μ2 − μ1 ) . (5.10)
5.3. Coverage Probabilities and Size We now use the results of the previous subsection to provide expressions for the coverage probabilities of the m out of n ‘backward’ and ‘forward’ bootstrap CIs. As above, results for the standard bootstrap are obtained by taking m = n. For notational convenience, let h1 = n1/2 (μ1 − θ0 ), hD = n1/2 (μ2 − μ1 ) and h = (h1 , hD ) .
(5.11) c∗U (m/n,
Note that h 1 , h D ≥ 0 and h D denotes the scaled difference between μ 1 and μ 2 . Let s, α) be the conditional α quantile of U ∗ (m/n, s) − (m/n)1/2 U (s) given Z. Using (4.1), (5.5) and (5.10), the probability that the m out of n ‘backward’ bootstrap CI covers the true value θ 0 , denoted by CP tv (m/n, h), is ∗ gUn + n−1/2 cUm (1 − α) CPtv (m/n, h) = Pθ0 ,μ θ0 ≤ ∗ (1 − α) = Pθ0 ,μ n1/2 ( gUn − θ0 ) ≥ −cUm (5.12) = P U (hD ) + h1 ≥ −cU∗ (m/n, hD , 1 − α) , ∗ where c∗Um (1 − α) is the 1 − α quantile of n1/2 ( gUm − gUn ) conditional on {Xi : i ≤ n}. Note that CP tv (m/n, h) only depends on h ∈ R 2+ , ρ (the correlation coefficient between X 1i and X 2i ) and m/n. The finite-sample coverage probability CP tv (m/n, h) is exactly the same as the ACP that arises when (i) m/n is fixed for all n, (ii) n1/2 (μ 1 − θ 0 ) → h 1 and (iii) n1/2 (μ 2 − μ 1 ) → h D as n → ∞ for some fixed h 1 , h D ∈ [0, ∞]. Hence, the results given here are both exact and asymptotic. If the true value θ 0 is on the edge of the identified interval (i.e., θ 0 = g U 0 and h 1 = 0) and the difference between μ 1 and μ 2 is ‘arbitrarily large’ (i.e., h D = ∞), then U (hD ) = Z1 , U ∗ (m/n, hD ) = Z1∗ + (m/n)1/2 Z1 , U ∗ (m/n, hD ) − (m/n)1/2 U (hD ) = Z1∗ , cU∗ (m/n, hD , 1 − α) = z1−α and CP tv (m/n, h) = P (Z 1 ≥ −z 1−α ) = 1 − α, as desired. However, if the difference between μ 1 and μ 2 is not ‘arbitrarily large’, then this desired result does not hold, as shown in the next subsection. The finite-sample size of an m out of n ‘backward’ bootstrap CI for the true value θ 0 is
Sizetv (m/n) = inf CPtv (m/n, h) 2 h∈R+
=
inf
h1 ,hD ∈R+
P (U (hD ) + h1 ≥ −cU∗ (m/n, hD , 1 − α))
= inf P (U (hD ) ≥ −cU∗ (m/n, hD , 1 − α)), hD ∈R+
(5.13)
which depends on m and n only through m/n. Provided r = lim n→∞ m/n (∈ [0, 1]) exists, the asymptotic size AS tv of the CI is given by (5.13) with r in place of m/n. Hence, the size results given here also are both exact and asymptotic. Note that for the bootstrap we have r = 1, and for the usual choices of m for the m out of n bootstrap we have r = 0. C The Author(s). Journal compilation C Royal Economic Society 2009.
S182
Donald W. K. Andrews and Sukjin Han
For the ‘forward’ bootstrap CI, the coverage probability and size are the same as in (5.12) and (5.13), respectively, but with c∗Um (1 − α) and −c∗U (m/n, h D , 1 − α) replaced by −c∗Um (α) and c∗U (m/n, h D , α), respectively. Next, suppose one wants a CI for the identified interval (−∞, gU 0 ], rather than the true value. Then, using (5.5) and (5.10), the probability that the CI covers the identified interval (−∞, g U 0 ], denoted by CP int (m/n, h), is ∗ CPint (m/n, h) = Pθ0 ,μ gU 0 ≤ gUn + n−1/2 cUm (1 − α) = P U (hD ) ≥ −cU∗ (m/n, hD , 1 − α) .
(5.14)
If the difference between μ 1 and μ 2 is ‘arbitrarily large’ (i.e., h D = ∞), then U (h D ) = Z 1 , U ∗ (m/n, h D ) = Z ∗1 + (m/n)1/2 Z 1 , U ∗ (m/n, h D ) − (m/n)1/2 U (h D ) = Z ∗1 , c∗U (m/n, h D , 1 − α) = z 1−α and CP int (m/n, h) = P (Z 1 ≥ −z 1−α ) = 1 − α, as desired. If the difference between μ 1 and μ 2 is not ‘arbitrarily large’, then the desired result for CP int (m/n, h) does not hold. The size of the m out of n ‘backward’ bootstrap for the identified interval is given by Sizeint (m/n) = inf P (U (hD ) ≥ −cU∗ (m/n, hD , 1 − α)), hD ∈R+
(5.15)
which is the same as that for the corresponding CI for the true value. For the ‘forward’ bootstrap CI for the identified interval (−∞, g U 0 ], the coverage probability and size are the same as in (5.14) and (5.15), respectively, but with c∗Um (1 − α) and −c∗U (m/n, h D , 1 − α) replaced by −c∗Um (α) and c∗U (m/n, h D , α), respectively. 5.4. Coverage Probability Simulations for Bootstrap CIs Table 1 provides values of CP tv (1, h) and CP int (1, h) based on the formulae in (5.12) and (5.14), respectively, computed by simulation for the ‘backward’ and ‘forward’ bootstrap CIs (for which m = n). Table 1 provides results for 1 − α = .95 and for a variety of values of h ∈ R 2+ and ρ ∈ [−1, 1]. In particular, we consider the cases where h 1 = 0, h D = 0, .125, .25, .5, 1.0, 2.0 and ρ = −1.0, −.99, −.95, −.5, 0, .5, .95, .99, 1.0. Given that h 1 = 0, we have CP tv (1, h) = CP int (1, h). Forty thousand bootstrap repetitions are used here (and in all of the tables) to compute the bootstrap critical value for each simulation repetition. Forty thousand simulation repetitions are used here (and in all of the tables) to compute each coverage probability. Table 1(a) shows that the coverage probabilities for the ‘backward’ bootstrap are much less than the nominal level .95 when ρ ≤ .5 and h D ≤ .5. For a given value of ρ, the exact (and asymptotic) confidence size of the bootstrap CI is less than or equal to the minimum value in each column. For example, for ρ = −1.0, the confidence size is .00 rather than .95. When ρ = −.99, the confidence size is .13 rather than .95. Clearly, the ‘backward’ bootstrap fails dramatically to deliver a CI with asymptotic size equal to its nominal size. Table 1(b) shows that the coverage probabilities for the ‘forward’ bootstrap are less than the nominal level when ρ ≤ .5 and h D ≤ .5. But, the differences are much smaller than with the ‘backward’ bootstrap. The results of the table indicate that the finite-sample (and asymptotic) confidence size (over all ρ ∈ [−1.0, 1.0]) of the ‘forward’ bootstrap is .90 or less, rather than .95. The pointwise ACPs of the ‘backward’ and ‘forward’ bootstrap CIs when θ 0 = μ 1 = μ 2 are given by the values in the first rows of Tables 1(a) and 1(b), respectively (which correspond to h D = 0). Table 1 shows that the pointwise ACPs of the bootstrap CIs are less than the nominal C The Author(s). Journal compilation C Royal Economic Society 2009.
S183
Invalidity of the bootstrap
Table 1. Linear moment inequalities I: bootstrap coverage probabilities of nominal 95% CIs when h 1 = 0. (a) ‘Backward’ bootstrap CI. m/n = 1
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
.00 .03 .76
.13 .36 .63
.36 .50 .63
.70 .74 .77
.80 .82 .84
.87 .88 .89
.93 .94 .94
.94 .95 .95
.95 .95 .95
.500 1.00
.90 .94
.88 .94
.82 .93
.83 .90
.87 .91
.91 .93
.95 .95
.95 .95
.95 .95
2.00
.95
.95
.95
.94
.94
.95
.95
.95
.95
(b) ‘Forward’ bootstrap CI. m/n = 1
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125
.90 .91
.90 .91
.90 .91
.90 .91
.91 .92
.92 .93
.94 .95
.95 .95
.95 .95
.250 .500
.92 .94
.92 .94
.92 .94
.92 .93
.93 .94
.94 .94
.95 .95
.95 .95
.95 .95
1.00 2.00
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.95 coverage probability for many values of ρ. In some cases, they are much below .95. Hence, these bootstrap CIs do not yield correct pointwise ACP. To conclude, Table 1 illustrates that neither the ‘backward’ nor the ‘forward’ bootstrap yields CIs with finite-sample or asymptotic confidence size equal to the nominal level. In particular, the bootstrap CIs are not asymptotically valid in a pointwise or uniform sense. 5.5. Coverage Probability Simulations for m out of n Bootstrap CIs Table 2 provides values of CP tv (m/n, h) and CP int (m/n, h) computed by simulation for the m out of n ‘backward’ and ‘forward’ bootstrap CIs for m/n = 0, .01, .05, .1, .5 and for the same confidence level and parameters as in Table 1. Given that h 1 = 0, we have CP tv (m/n, h) = CP int (m/n, h). For each value of m/n, Table 2(a) (i.e., Table 2(i)(a) through 2(v)(a)) shows that the coverage probabilities for the m out of n ‘backward’ bootstrap are much lower than the nominal level .95 when ρ ≤ .5 for all values of h D . For a given value of ρ, the exact (and asymptotic) confidence size of the m out of n bootstrap CI is less than or equal to the minimum value in each column and C The Author(s). Journal compilation C Royal Economic Society 2009.
S184
Donald W. K. Andrews and Sukjin Han
Table 2. Linear moment inequalities I: m out of n bootstrap coverage probabilities of nominal 95% CIs when h 1 = 0. (i) m/n = 0 (a) ‘Backward’ bootstrap CI. m/n = 0
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
.00 .00 .05
.01 .03 .06
.05 .08 .11
.37 .40 .43
.60 .63 .65
.77 .79 .81
.92 .93 .93
.94 .94 .94
.95 .95 .95
.500 1.00 2.00
.14 .30 .45
.15 .31 .45
.19 .34 .47
.49 .58 .65
.69 .74 .77
.83 .86 .86
.93 .94 .94
.94 .94 .94
.95 .95 .95
(b) ‘Forward’ bootstrap CI. m/n = 0
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125
.95 .96
.95 .96
.95 .96
.95 .96
.95 .96
.95 .96
.95 .96
.95 .96
.95 .95
.250 .500 1.00
.96 .97 .97
.96 .97 .97
.96 .97 .97
.96 .97 .97
.96 .97 .98
.96 .97 .97
.96 .96 .96
.96 .96 .96
.95 .95 .95
2.00
.98
.98
.98
.97
.98
.97
.96
.96
.95
each table. Hence, when ρ = −1.0 and for all values of m/n, the confidence size is .00 rather than .95. Clearly, the m out of n ‘backward’ bootstrap fails dramatically to deliver a CI with asymptotic size equal to its nominal size. For all m/n ∈ [.01, .5], for certain values of h D and for all ρ ≤ .5, Table 2(b) (i.e., Table 2(i)(b) through 2(v)(b)) shows that the m out of n ‘forward’ bootstrap has coverage probability that is less than the nominal level .95. But, the differences are much smaller than with the m out of n ‘backward’ bootstrap. More specifically, the finite-sample confidence size of the nominal .95 m out of n ‘forward’ bootstrap is less than or equal to .93 for m/n = .01, .91 for m/n = .05, .10 and .90 for m/n = .5. Table 2(i)(b) shows that the m out of n ‘forward’ bootstrap has correct asymptotic size when m/n → 0 as n → ∞ (i.e., AS tv = AS int = .95). The pointwise ACPs of the m out of n ‘backward’ and ‘forward’ bootstrap CIs when θ 0 = μ 1 = μ 2 are given by the values in the first rows of Table 2(i)(a) and 2(i)(b), respectively (which correspond to h D = 0 and m/n = 0). Table 2(i)(a) shows that the pointwise ACPs of the m out of n ‘backward’ bootstrap CIs are less than the nominal .95 coverage probability for many values of ∗ gUm − gUn )(= U ∗ (0, 0), here) in ρ. This is because of the asymmetry of the distribution of m1/2 ( (5.10). In some cases, the ACPs are much below .95. Hence, the m out of n ‘backward’ bootstrap C The Author(s). Journal compilation C Royal Economic Society 2009.
S185
Invalidity of the bootstrap Table 2. Continued. (ii) m/n = 0.01. (a) ‘Backward’ bootstrap CI. m/n = 0.01
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00
.00
.01
.06
.40
.63
.79
.92
.94
.95
.125 .250 .500
.00 .06 .16
.03 .07 .17
.09 .12 .21
.43 .47 .53
.66 .68 .72
.81 .83 .85
.93 .93 .94
.94 .94 .95
.95 .95 .95
1.00 2.00
.33 .50
.34 .50
.37 .52
.62 .70
.77 .81
.87 .89
.94 .94
.95 .95
.95 .95
(b) ‘Forward’ bootstrap CI. m/n = 0.01
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
.93 .94 .95
.93 .94 .95
.93 .94 .95
.93 .94 .95
.94 .95 .95
.94 .95 .95
.95 .95 .96
.95 .95 .95
.95 .95 .95
.500 1.00
.96 .96
.96 .96
.95 .96
.96 .96
.96 .97
.96 .97
.96 .96
.95 .95
.95 .95
2.00
.96
.96
.96
.96
.97
.96
.95
.95
.95
CIs do not yield correct pointwise ACP. Table 2(i)(b) shows that the pointwise ACP probabilities of the m out of n ‘forward’ bootstrap are greater than or equal to .95, as desired. To conclude, Table 2 illustrates that in Model I the m out of n ‘backward’ bootstrap yields CIs with finite-sample and asymptotic confidence size substantially less than its nominal level. In particular, the m out of n ‘backward’ bootstrap CI is not asymptotically valid in a pointwise or uniform sense. On the other hand, the m out of n ‘forward’ bootstrap is asymptotically valid in pointwise and uniform senses when lim n→∞ m/n = 0. In finite samples, the m out of n ‘forward’ bootstrap yields CIs with confidence sizes that are somewhat lower than their nominal level.
6. LINEAR INEQUALITIES II 6.1. Model and Estimators In this section, we consider a model with Xi defined as in (5.1), but with different moment inequalities. The main purpose of this section is to see the quantitative difference between the finite-sample/asymptotic sizes and the nominal sizes of the bootstrap CIs in a model scenario in which ‘reversals’ of moment inequalities may occur. In particular, we are interested in whether C The Author(s). Journal compilation C Royal Economic Society 2009.
S186
Donald W. K. Andrews and Sukjin Han Table 2. Continued.
(iii) m/n = 0.05. (a) ‘Backward’ bootstrap CI. m/n = 0.05
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00
.00
.01
.06
.43
.66
.80
.92
.94
.95
.125 .250 .500
.00 .07 .19
.03 .08 .19
.10 .15 .25
.47 .51 .57
.69 .71 .75
.82 .84 .86
.93 .94 .94
.94 .95 .95
.95 .95 .95
1.00 2.00
.38 .57
.39 .57
.43 .60
.67 .76
.80 .85
.89 .91
.95 .95
.95 .95
.95 .95
(b) ‘Forward’ bootstrap CI. m/n = 0.05
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
.91 .92 .93
.91 .92 .93
.91 .92 .93
.92 .93 .94
.93 .94 .94
.94 .94 .95
.95 .95 .96
.95 .95 .95
.95 .95 .95
.500 1.00
.94 .95
.94 .95
.94 .95
.95 .95
.95 .96
.96 .96
.96 .95
.95 .95
.95 .95
2.00
.95
.95
.95
.95
.96
.96
.95
.95
.95
the m out of n ‘forward’ bootstrap yields CIs whose confidence size is close to the nominal size (because these CIs work fairly well in Model I). The moment functions, sample moments and population moment inequalities are
θ0 − X1i θ0 − X 1n , mn (Xi , θ ) = , and m(Xi , θ ) = X2i − θ0 X2n − θ0
n θ0 − μ1 (6.1) ≥ 0, where X sn = n−1 Eθ0 m(Xi , θ ) = Xsi for s = 1, 2. μ2 − θ0 i=1
In consequence, μ 1 ≤ θ 0 ≤ μ 2 and the identified set 0 equals [μ 1 , μ 2 ]. Note that the model considered here differs from the ‘interval-censored variable’ model considered in Imbens and Manski (2004) because the latter assumes that X 1i ≤ X 2i almost surely. In contrast, the model defined by (5.1) and (6.1) allows for sample ‘reversals’ of the moment conditions which lead to no solution to the sample moment inequalities mn (Xi , θ ) ≥ 0 even though the population inequalities Eθ0 m(Xi , θ ) ≥ 0 hold. This is a common feature of more complicated moment inequality models. In the model considered here, a ‘reversal’ occurs whenever X 1n ≥ X 2n . C The Author(s). Journal compilation C Royal Economic Society 2009.
S187
Invalidity of the bootstrap Table 2. Continued. (iv) m/n = 0.1. (a) ‘Backward’ bootstrap CI. m/n = 0.1
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00
.00
.01
.08
.47
.68
.81
.92
.94
.95
.125 .250 .500
.00 .07 .21
.04 .09 .21
.11 .16 .28
.51 .54 .61
.71 .73 .77
.83 .85 .87
.93 .94 .94
.95 .95 .95
.95 .95 .95
1.00 2.00
.43 .63
.43 .64
.47 .66
.71 .80
.83 .88
.90 .92
.95 .95
.95 .95
.95 .95
(b) ‘Forward’ bootstrap CI. m/n = 0.1
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
.91 .92 .93
.91 .92 .93
.91 .92 .93
.91 .92 .93
.92 .93 .94
.93 .94 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.500 1.00
.94 .95
.94 .95
.94 .95
.94 .95
.95 .95
.95 .96
.95 .95
.95 .95
.95 .95
2.00
.95
.95
.95
.95
.95
.95
.95
.95
.95
The function g(θ ) of interest is just the identity function g(θ ) = θ . Hence, [g L0 , g U 0 ] = [μ 1 , n , gLn and gUn can be determined analytically. It is easy to see that if μ 2 ]. In the present case, X 1n ≤ X 2n , then n = X1n , X 2n , gUn = X 2n . (6.2) gLn = X 1n and On the other hand, provided d(x) is symmetric in its k = 2 components and nondecreasing in each component (as is true if d(x) = x x or d(x) = |x 1 | + |x 2 |), it is easy to see that if X 1n ≥ X2n , then
n = (X1n + X 2n )/2 , gUn = X 1n + X 2n /2. (6.3) gLn = X 1n + X 2n /2 and gUn are non-differentiable functions of the sample mean vector (X1n , X 2n ). In Notice that gLn and gUn turn out to be discontinuous functions consequence, the asymptotic distributions of gLn and of the parameters (μ 1 , μ 2 ). In particular, the asymptotic distributions are different between the cases where μ 1 < μ 2 and μ 1 = μ 2 . Furthermore, the asymptotic distribution is different again if the true difference μ 2 − μ 1 = h D /n1/2 for some positive constant h D . Because of this, it is shown below that the bootstrap and the m out of n bootstrap do not perform as desired. C The Author(s). Journal compilation C Royal Economic Society 2009.
S188
Donald W. K. Andrews and Sukjin Han Table 2. Continued.
(v) m/n = 0.5. (a) ‘Backward’ bootstrap CI. m/n = 0.5
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00
.00
.02
.15
.61
.76
.85
.93
.94
.95
.125 .250 .500
.00 .17 .45
.09 .20 .46
.23 .32 .52
.65 .69 .75
.78 .80 .84
.87 .88 .90
.94 .94 .95
.95 .95 .95
.95 .95 .95
1.00 2.00
.77 .93
.77 .93
.77 .92
.84 .92
.89 .93
.93 .95
.95 .95
.95 .95
.95 .95
(b) ‘Forward’ bootstrap CI. m/n = 0.5
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
.90 .91 .92
.90 .91 .92
.90 .91 .92
.90 .91 .92
.91 .92 .93
.92 .93 .94
.95 .95 .95
.95 .95 .95
.95 .95 .95
.500 1.00
.94 .95
.94 .95
.94 .95
.93 .95
.94 .95
.95 .95
.95 .95
.95 .95
.95 .95
2.00
.95
.95
.95
.95
.95
.95
.95
.95
.95
Because we only consider one-sided CIs here, we focus on gUn from now on. Combining (6.2) and (6.3) gives (6.4) gUn = max X2n , X 1n + X 2n /2 . For s ∈ R, define U (s) = max {Z2 , (Z1 + Z2 − s)/2} , where Z =
Z1 ∼ N (0, ). Z2
(6.5)
Combining (6.4) and (6.5) gives gUn − gU 0 ) = max n1/2 X 2n − μ2 , n1/2 X 1n + X 2n − 2μ2 /2 n1/2 ( =
d
U (n1/2 (μ2 − μ1 )).
(6.6)
In turn, (6.6) yields n1/2 ( gUn − θ0 ) =d U n1/2 (μ2 − μ1 ) + n1/2 (μ2 − θ0 ).
(6.7)
C The Author(s). Journal compilation C Royal Economic Society 2009.
Invalidity of the bootstrap
S189
6.2. Bootstrap and m Out of n Bootstrap We now consider the m out of n bootstrap for the linear moment inequality model of (5.1) and (6.1). As above, the (standard) bootstrap is obtained by taking m = n. The bootstrap sample {X∗i : i ≤ m} is defined exactly as in (5.6). ∗ satisfies Using (5.6) and (6.4), the bootstrap estimator gUm ∗∗ ∗∗ ∗∗ ∗ (6.8) gUm = max Z 2m + X 2n , Z 1m + Z 2m + X1n + X2n /2 . For s ∈ R, define
U ∗ (m/n, s) = max Z2∗ + (m/n)1/2 Z2 , (Z1∗ + Z2∗ + (m/n)1/2 (Z1 + Z2 − s))/2 , where
∗ Z1 (6.9) ∼ N (0, ), Z∗ = Z2∗
Z is as defined in (6.5), and Z ∗ and Z are independent. Using (6.8) and (6.9), we have ∗ gUm − gU 0 ) m1/2 ( ∗∗ = max m1/2 Z 2m + (m/n)1/2 n1/2 X 2n − μ2 , ∗∗ ∗∗ m1/2 Z 1m + m1/2 Z 2m + (m/n)1/2 n1/2 X1n + X2n − 2μ2 /2
=
d
U ∗ (m/n, n1/2 (μ2 − μ1 )).
Combining (6.6) and (6.10) gives ∗ ∗ m1/2 gUm − gUm − gU 0 − m1/2 gUn − gU 0 gUn = m1/2 = d U ∗ m/n, n1/2 (μ2 − μ1 ) − (m/n)1/2 U n1/2 (μ2 − μ1 ) .
(6.10)
(6.11)
6.3. Coverage Probabilities and Size Next, we use the results of the previous subsection to provide expressions for the coverage probabilities of the m out of n bootstrap CIs considered above. For notational convenience, let h1 = n1/2 (θ0 − μ1 ), h2 = n1/2 (μ2 − θ0 ), h = (h1 , h2 ) and hD = h1 + h2 = n1/2 (μ2 − μ1 ).
(6.12)
Note that h 1 , h 2 , h D ≥ 0 and h D denotes the scaled difference between μ 1 and μ 2 . Let c∗U (m/n, s, α) be the conditional α quantile of U ∗ (m/n, s) − (m/n)1/2 U (s) given Z. Using (6.7) and (6.11), the coverage probability of the m out of n ‘backward’ bootstrap CI for the true value, denoted by CP tv (m/n, h), is ∗ CPtv (m/n, h) = Pθ0 ,μ g (θ0 ) ≤ (1 − α) gUn + n−1/2 cUm ∗ (1 − α) = Pθ0 ,μ n1/2 ( gUn − g (θ0 )) ≥ −cUm (6.13) = P U (hD ) + h2 ≥ −cU∗ (m/n, hD , 1 − α) , C The Author(s). Journal compilation C Royal Economic Society 2009.
S190
Donald W. K. Andrews and Sukjin Han
∗ where c∗Um (1 − α) denotes the 1 − α quantile of m1/2 ( gUm − gUn ) conditional on {Xi : i ≤ n}. This finite-sample probability is exactly the same as the asymptotic probability that arises when m/n is fixed for all n, n1/2 (θ 0 − μ 1 ) → h 1 and n1/2 (μ 2 − θ 0 ) → h 2 as n → ∞ for some fixed h 1 , h 2 ∈ [0, ∞]. Hence, the results given here are both exact finite-sample and asymptotic. The probability in (6.13) depends only on h D , h 2 ∈ R + , ρ and m/n. If θ 0 is on the right edge of the identified interval (i.e., θ 0 = g U 0 and h 2 = 0) and the interval is ‘arbitrarily wide’ (i.e., h D = ∞), then U (h D ) = Z 2 , U ∗ (m/n, h D ) = Z ∗2 + (m/n)1/2 Z 2 , U ∗ (m/n, h D ) − (m/n)1/2 U (h D ) = Z ∗2 , c∗U (m/n, h D , 1 − α) = z 1−α and CP tv (m/n, h) = P (Z 2 ≥ −z 1−α ) = 1 − α. However, if the identified interval is not ‘arbitrarily wide’, then this desired result does not hold. The finite-sample size of an m out of n ‘backward’ bootstrap CI for the true value θ 0 is
Sizetv (m/n) = inf CPtv (m/n, h) 2 h∈R+
=
inf
h2 ,hD ∈R+
P (U (hD ) + h2 ≥ −cU∗ (m/n, hD , 1 − α))
= inf P (U (hD ) ≥ −cU∗ (m/n, hD , 1 − α)). hD ∈R+
(6.14)
As above, provided r = lim n→∞ m/n(∈ [0, 1]) exists, the asymptotic size AS tv of the CI is given by (6.14) with r in place of m/n. Hence, the size results given here also are both exact and asymptotic. For the ‘forward’ bootstrap CI, the coverage probability and size are the same as in (6.13) and (6.14), respectively, but with c∗Um (1 − α) and −c∗U (m/n, h D , 1 − α) replaced by −c∗Um (α) and c∗U (m/n, h D , α), respectively. Using (6.6) and (6.11), the probability that the m out of n ‘backward’ bootstrap CI covers the identified interval (−∞, g U 0 ], denoted by CP int (m/n, h), is ∗ gUn + n−1/2 cUm (1 − α) CPint (m/n, h) = Pθ0 ,μ gU 0 ≤ (6.15) = P U (hD ) ≥ −cU∗ (m/n, hD , 1 − α) . If the identified interval is ‘arbitrarily wide’ (i.e., h D = ∞), then U (h D ) = Z 2 , U ∗ (m/n, h D ) = Z ∗2 + (m/n)1/2 Z 2 , U ∗ (m/n, h D ) − (m/n)1/2 U (h D ) = Z ∗2 , c∗U (m/n, h D , 1 − α) = z 1−α and CP int (m/n, h) = P (Z 2 ≥ −z 1−α ) = 1 − α, as desired. If the identified interval is not ‘arbitrarily wide’, then the desired result for CP int (m/n, h) does not hold. The size of the m out of n ‘backward’ bootstrap for the identified interval is given by the same expression as in (5.15) (but with U (h D ) and c∗U (m/n, h D , 1 − α) defined as in this section). For the ‘forward’ bootstrap CI for the identified interval (−∞, g U 0 ], the coverage probability and size are the same as in (6.15) and (5.15), respectively, but with c∗Um (1 − α) and −c∗U (m/n, h D , 1 − α) replaced by −c∗Um (α) and c∗U (m/n, h D , α), respectively. 6.4. Coverage Probability Simulations of Bootstrap CIs Table 3 provides values of CP tv (1, h) and CP int (1, h) for ‘backward’ and ‘forward’ bootstrap CIs for the moment inequality model of (5.1) and (6.1). These results are analogous to those in Table 1, but apply to the second linear inequality model rather than the first. The parameters considered are h 2 = 0 and h 1 = h D = 0, .125, .25, .5, 1.0, 2.0, 4.0, 6.0, 8.0. Given that h 2 = 0, we have CP tv (1, h) = CP int (1, h). C The Author(s). Journal compilation C Royal Economic Society 2009.
S191
Invalidity of the bootstrap
Table 3. Linear moment inequalities II: bootstrap coverage probabilities of nominal 95% CIs when h 2 = 0. (a) ‘Backward’ bootstrap CI. m/n = 1
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125
1.00 .95
1.00 .99
1.00 .99
.99 .99
.98 .98
.98 .97
.96 .96
.95 .95
.95 .95
.250 .500
.95 .95
.98 .95
.99 .98
.99 .98
.98 .98
.97 .97
.95 .95
.95 .95
.95 .95
1.00 2.00 4.00
.95 .95 .95
.95 .95 .95
.95 .95 .95
.97 .96 .95
.97 .96 .95
.96 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
6.00 8.00
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
(b) ‘Forward’ bootstrap CI. m/n = 1
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
1.00 .51 .53
.96 .86 .71
.97 .93 .89
.96 .96 .95
.96 .96 .95
.96 .96 .95
.96 .95 .95
.95 .95 .95
.95 .95 .95
.500 1.00
.55 .60
.57 .61
.77 .65
.93 .89
.94 .92
.95 .94
.95 .95
.95 .95
.95 .95
2.00 4.00 6.00
.69 .84 .93
.71 .85 .93
.72 .86 .93
.85 .91 .94
.91 .94 .95
.94 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
8.00
.95
.95
.95
.95
.95
.95
.95
.95
.95
Table 3(a) indicates that the ‘backward’ bootstrap has exact and asymptotic size equal to the nominal level .95, as desired. (That is, the coverage probabilities are .95 or greater with equality for some parameter values.) The coverage probability exceeds .95 in a variety of cases, however, so the CI is not asymptotically similar. In consequence, the CI may be longer than necessary. This does not occur in model scenarios in which the bootstrap performs properly. Table 3(b) shows that the ‘forward’ bootstrap has asymptotic size substantially less than its nominal level when ρ ≤ −.5. 5 For example, when ρ is fixed at −1.0, the exact and asymptotic 5 Tables 3(b) and 4(i)(b)–4(v)(b) show a discontinuity in the coverage probability at (h , ρ) = (0.0, −1.0). To see why D this discontinuity occurs, consider the case where m/n = 0. In this case, (i) U (h D ) = max{Z 2 , (Z 1 + Z 2 − h D )/2} = ∗ ∗ max{Z 2 , −h D /2}, where the last equality holds because Z 1 = −Z 2 when ρ = −1, (ii) U ∗ (m/n, h D ) = U ∗ (0, h D ) = max{Z ∗2 , (Z ∗1 + Z ∗2 )/2} = max{Z ∗2 , 0}, (iii) the α quantile of U ∗ (0, h D ) is c∗U (0, h D , α) = 0.0 for α ≤ 1/2, (iv) C The Author(s). Journal compilation C Royal Economic Society 2009.
S192
Donald W. K. Andrews and Sukjin Han
Table 4. Linear moment inequalities II: m out of n bootstrap coverage probabilities of nominal 95% CIs when h 2 = 0. (i) m/n = 0 (a) ‘Backward’ bootstrap CI. m/n = 0
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
.99 .99 .99
.98 .98 .98
.96 .96 .96
.96 .96 .95
.95 .95 .95
.500 1.00 2.00
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 .98
.99 .98 .96
.98 .97 .96
.96 .96 .96
.95 .95 .95
.95 .95 .95
4.00 6.00
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.96 .96
.96 .96
.95 .95
.95 .95
8.00
.95
.95
.95
.95
.95
.96
.96
.95
.95
(b) ‘Forward’ bootstrap CI. m/n = 0
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125
1.00 .50
.95 .83
.95 .90
.95 .94
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.250 .500
.50 .50
.67 .54
.85 .72
.93 .90
.94 .92
.94 .93
.95 .95
.95 .95
.95 .95
1.00 2.00 4.00
.50 .50 .50
.54 .54 .54
.59 .58 .58
.85 .77 .75
.90 .86 .85
.92 .91 .90
.95 .95 .95
.95 .95 .95
.95 .95 .95
6.00 8.00
.50 .50
.54 .54
.58 .58
.75 .75
.85 .85
.90 .90
.95 .95
.95 .95
.95 .95
size is less than or equal to .51. When ρ = −.99, the exact and asymptotic size is less than or equal to .57. This demonstrates that the ‘forward’ bootstrap CI can behave quite poorly depending upon the moment inequalities and the parameter values considered. 6.5. Coverage Probability Simulations of m out of n Bootstrap CIs Table 4 provides values of CP tv (m/n, h) and CP int (m/n, h) for m out of n ‘backward’ and ‘forward’ bootstrap CIs for the moment inequality model of (5.1) and (6.1). These results are CP tv (m/n, h) = CP tv (0, (h D , 0)) = P (U (h D ) ≥ −c∗U (0, h D , α)) = P (max{Z 2 , −h D /2} ≥ 0) and (v) P (max{Z 2 , −h D /2} ≥ 0) = 1 when h D = 0 and P (max{Z 2 , −h D /2} ≥ 0) = 1/2 when h D > 0. C The Author(s). Journal compilation C Royal Economic Society 2009.
S193
Invalidity of the bootstrap Table 4. Continued. (ii) m/n = 0.01. (a) ‘Backward’ bootstrap CI. m/n = 0.01
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00
1.00
1.00
1.00
1.00
.99
.98
.96
.96
.95
.125 .250 .500
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
.99 .99 .99
.98 .98 .97
.96 .96 .95
.95 .95 .95
.95 .95 .95
1.00 2.00
1.00 1.00
1.00 1.00
1.00 1.00
1.00 .98
.98 .96
.97 .96
.95 .95
.95 .95
.95 .95
4.00 6.00 8.00
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
(b) ‘Forward’ bootstrap CI. m/n = 0.01
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125
1.00 .50
.95 .84
.95 .91
.95 .94
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
.250 .500 1.00
.51 .51 .52
.68 .55 .55
.86 .73 .60
.93 .91 .86
.94 .93 .90
.94 .94 .92
.95 .95 .95
.95 .95 .95
.95 .95 .95
2.00 4.00
.54 .57
.57 .60
.61 .64
.79 .79
.87 .87
.91 .92
.95 .95
.95 .95
.95 .95
6.00 8.00
.61 .64
.63 .67
.67 .70
.81 .83
.88 .89
.93 .93
.95 .95
.95 .95
.95 .95
analogous to those in Table 2 and use the same values of m/n as in Table 2, but apply to the second linear inequality model rather than the first. The nominal confidence level .95 and parameter values are the same as in Table 3. Given that h 2 = 0, we have CP tv (m/n, h) = CP int (m/n, h). Table 4(a) indicates that the m out of n ‘backward’ bootstrap has exact and asymptotic size equal to the nominal level .95, as desired. (That is, the coverage probabilities are .95 or greater with equality for some parameter values.) The coverage probabilities for the m out of n ‘backward’ bootstrap are not very sensitive to the value of m/n. The coverage probabilities exceed .95 in a variety of cases, however, so the CI is not asymptotically similar. In consequence, the CI may be longer than necessary. Table 4(b) shows that the m out of n ‘forward’ bootstrap has asymptotic size substantially less than its nominal level when ρ ≤ −.5. For example, when ρ = −1.0 and m/n = 0, the exact C The Author(s). Journal compilation C Royal Economic Society 2009.
S194
Donald W. K. Andrews and Sukjin Han Table 4. Continued.
(iii) m/n = 0.05. (a) ‘Backward’ bootstrap CI. m/n = 0.05
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00
1.00
1.00
1.00
1.00
.99
.98
.96
.96
.95
.125 .250 .500
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
.99 .99 .99
.98 .98 .97
.96 .96 .95
.95 .95 .95
.95 .95 .95
1.00 2.00
1.00 1.00
1.00 1.00
1.00 1.00
1.00 .98
.98 .96
.97 .96
.95 .95
.95 .95
.95 .95
4.00 6.00 8.00
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
(b) ‘Forward’ bootstrap CI. m/n = 0.05
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
1.00 .51 .51
.95 .84 .68
.96 .92 .87
.95 .94 .93
.95 .95 .94
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.500 1.00
.52 .54
.55 .56
.74 .61
.91 .86
.93 .90
.94 .93
.95 .95
.95 .95
.95 .95
2.00 4.00 6.00
.57 .65 .71
.60 .67 .73
.63 .70 .75
.80 .82 .85
.88 .89 .91
.92 .93 .94
.95 .95 .95
.95 .95 .95
.95 .95 .95
8.00
.77
.79
.80
.88
.93
.95
.95
.95
.95
and asymptotic size is less than or equal to .50. When ρ = −.99 and m/n = 0, it is less than or equal to .58. This demonstrates that the m out of n ‘forward’ bootstrap CI can behave quite poorly depending upon the moment inequalities and the parameter values considered. 6.6. Comparison of the m Out of n ‘Backward’ and ‘Forward’ Bootstraps in Models I and II Table 2(i) shows that in Model I the m out of n ‘backward’ bootstrap is not asymptotically correct when m/n → 0 as n → ∞, whereas the m out of n ‘forward’ bootstrap is, and Table 4(i) shows that the opposite is true in Model II. This can be explained as follows. The ‘backward’ bootstrap is ‘backward’ (see (3.5)). In consequence, it does not have asymptotically correct coverage probability even when the asymptotic distributions of the test C The Author(s). Journal compilation C Royal Economic Society 2009.
S195
Invalidity of the bootstrap Table 4. Continued. (iv) m/n = 0.1. (a) ‘Backward’ bootstrap CI. m/n = 0.1
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00
1.00
1.00
1.00
1.00
.99
.98
.96
.96
.95
.125 .250 .500
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
.99 .99 .98
.98 .98 .97
.96 .96 .95
.95 .95 .95
.95 .95 .95
1.00 2.00
1.00 1.00
1.00 1.00
1.00 1.00
.99 .97
.98 .96
.97 .96
.95 .95
.95 .95
.95 .95
4.00 6.00 8.00
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
(b) ‘Forward’ bootstrap CI. m/n = 0.1
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125 .250
1.00 .51 .51
.96 .85 .69
.96 .92 .87
.96 .95 .94
.96 .95 .94
.96 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.500 1.00
.52 .55
.56 .57
.75 .62
.91 .87
.93 .91
.94 .93
.95 .95
.95 .95
.95 .95
2.00 4.00 6.00
.60 .69 .77
.62 .70 .78
.65 .73 .80
.81 .84 .88
.88 .90 .93
.92 .94 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
8.00
.83
.84
.85
.91
.94
.95
.95
.95
.95
statistic and the bootstrap test statistic are the same whenever the asymptotic distribution is skewed such that minus its 1 − α quantile exceeds its α quantile. This occurs in Model I in the leading case in which m/n = 0, h 1 = h D = 0 and ρ = −1. In this case, in Model I, U (0) = min{Z 1 , −Z 1 }, U (0) is negative with probability one and U ∗ (0, 0) has the same distribution as U (0). The density of U (0) is twice the density of a standard normal on R − and is zero on R + . The m out of n ‘backward’ bootstrap asymptotic critical value is the 1 − α quantile, c(1 − α), of U (0) and the probability of coverage of this CI is P (U (0) ≥ −c(1 − α)) (see (5.12)). When α < 1/2, c(1 − α) < 0 and so U (0) ≥ −c(1 − α) with probability zero. Hence, the m out of n ‘backward’ bootstrap is not asymptotically correct (or even close to it). On the other hand, the m out of n ‘forward’ bootstrap is asymptotically correct in Model I because its critical value equals c(α) asymptotically and its ACP is P (U (0) ≥ c(α)) = 1 − α (see the paragraph following (5.13)). C The Author(s). Journal compilation C Royal Economic Society 2009.
S196
Donald W. K. Andrews and Sukjin Han Table 4. Continued.
(v) m/n = 0.5 (a) ‘Backward’ bootstrap CI. m/n = 0.5
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00
1.00
1.00
1.00
.99
.99
.98
.96
.96
.95
.125 .250 .500
.99 .99 .99
1.00 1.00 .99
1.00 1.00 .99
.99 .99 .99
.99 .98 .98
.98 .97 .97
.96 .96 .95
.95 .95 .95
.95 .95 .95
1.00 2.00
.98 .97
.98 .97
.98 .97
.98 .96
.97 .96
.96 .95
.95 .95
.95 .95
.95 .95
4.00 6.00 8.00
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
.95 .95 .95
(b) ‘Forward’ bootstrap CI. m/n = 0.5
hD
ρ −1.00
−0.99
−0.95
−0.50
0.00
0.50
0.95
0.99
1.00
0.00 .125
1.00 .51
.96 .86
.96 .93
.96 .95
.96 .96
.96 .96
.96 .95
.95 .95
.95 .95
.250 .500 1.00
.52 .54 .58
.70 .57 .60
.88 .75 .64
.94 .92 .88
.95 .94 .92
.95 .95 .94
.95 .95 .95
.95 .95 .95
.95 .95 .95
2.00 4.00
.66 .80
.68 .81
.70 .82
.83 .89
.90 .93
.94 .95
.95 .95
.95 .95
.95 .95
6.00 8.00
.89 .95
.89 .95
.90 .94
.93 .94
.95 .95
.95 .95
.95 .95
.95 .95
.95 .95
In Model II, however, the m out of n ‘backward’ bootstrap has correct ACP. In this model, when m/n = 0, h 2 = h D = 0 and ρ = −1, we have U (0) = max{Z 2 , 0}, U (0) is positive with probability one and U ∗ (0, 0) has the same distribution as U (0). In this case, minus the 1 − α quantile of U (0) does not exceed its α quantile. Specifically, the random variable U (0) equals 0 with probability 1/2 and has a standard normal density on R + . Hence, c(α) = 0 and c(1 − α) is the 1 − α quantile of the standard normal (provided α < 1/2). In consequence, by the formula given above for the coverage probability of the m out of n ‘backward’ bootstrap CI, this CI has correct ACP. In Model II, the m out of n ‘forward’ bootstrap CI also has correct ACP if h D = 0. But, for any h D > 0, this CI has very poor ACP (equal to 1/2) because the test statistic has asymptotic distribution given by U (h D ) = max{Z 2 , −h D /2}, the asymptotic bootstrap distribution is given C The Author(s). Journal compilation C Royal Economic Society 2009.
Invalidity of the bootstrap
S197
by U (0) = max{Z 2 , 0}, the asymptotic critical value is the α quantile of U (0), which equals 0 (provided α < 1/2) and the ACP is P (U (h D ) ≥ 0) = 1/2. The reason for the failure of the m out of n ‘forward’ bootstrap CI in this case is a combination of the difference in the asymptotic distributions of the test statistic and the bootstrap test statistic plus the discrete component of these distributions. Both of these features arise because of the discontinuity in the pointwise asymptotic distribution of the test statistic as a function of the underlying parameters.
ACKNOWLEDGMENTS Andrews gratefully acknowledges the research support of the National Science Foundation via grant numbers SES-0417911 and SES-0751517. The authors thank a referee and the editor for helpful comments.
REFERENCES Andrews, D. W. K. (1999). Estimation when a parameter is on a boundary. Econometrica 67, 1341–83. Andrews, D. W. K. (2000). Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space. Econometrica 68, 399–405. Andrews, D. W. K. and P. Jia (2008). Inference for parameters defined by moment inequalities: a recommended moment selection procedure. Cowles Foundation Discussion Paper No. 1676, Yale University. Andrews, D. W. K., S. Berry, and P. Jia (2004). Confidence regions for parameters in discrete games with multiple equilibria, with an application to discount chain store location. Working paper, Cowles Foundation, Yale University. Andrews, D. W. K. and P. Guggenberger (2005). Applications of subsampling, hybrid, and size-correction methods. Cowles Foundation Discussion Paper No. 1608, Yale University. Andrews, D. W. K. and P. Guggenberger (2009a). Asymptotic size and a problem with subsampling and with the m out of n bootstrap. Forthcoming in Econometric Theory. Andrews, D. W. K. and P. Guggenberger (2009b). Hybrid and size-corrected subsampling methods. Forthcoming in Econometrica. Andrews, D. W. K. and P. Guggenberger (2009c). Validity of subsampling and ‘plug-in asymptotic’ inference for parameters defined by moment inequalities. Forthcoming in Econometric Theory. Andrews, D. W. K. and P. Guggenberger (2009d). Incorrect asymptotic size of subsampling procedures based on post-consistent model selection estimators. Forthcoming in Journal of Econometrics. Andrews, D. W. K. and G. Soares (2007). Inference for parameters defined by moment inequalities using generalized moment selection. Cowles Foundation Discussion Paper No. 1631, Yale University. Athreya, K. B. (1987). Bootstrap of the mean in the infinite variance case. Annals of Statistics 15, 724–31. Athreya, K. B. and J. Fukuchi (1994). Bootstrapping extremes of i.i.d. random variables. In J. Galambos, J. Lechner and E. Simiu (Eds.), Proceedings of the Conference on Extreme Values and Applications, Volume III, Special Publication 866, 23–29. Gaithersburg, MD: NIST. Babu, J. (1984). Bootstrapping statistics with linear combinations of chi-squares as weak limit. Sankhya 46, 86–93. Basawa, I. V., A. K. Mallik, W. P. McCormick, J. H. Reeves, and R. L. Taylor (1991). Bootstrapping unstable first-order autoregressive processes. Annals of Statistics 19, 1098–101. Beran, R. (1982). Estimated sampling distributions: the bootstrap and competitors. Annals of Statistics 10, 212–25. Beran, R. (1997). Diagnosing bootstrap success. Annals of the Institute of Statistical Mathematics 49, 1–24. C The Author(s). Journal compilation C Royal Economic Society 2009.
S198
Donald W. K. Andrews and Sukjin Han
Beran, R. and M. S. Srivastava (1985). Bootstrap tests and confidence regions for functions of a covariance matrix. Annals of Statistics 13, 95–115. Bickel, P. J. and D. Freedman (1981). Some asymptotic theory for the bootstrap. Annals of Statistics 9, 1196–217. Bickel, P. J., F. G¨otze, and W. R. van Zwet (1997). Resampling fewer than n observations: gains, losses, and remedies for losses. Statistica Sinica 7, 1–31. Bretagnolle, J. (1983). Lois limites du bootstrap de certaines fonctionnelles. Annales de L’institut Henri Poincar´e Sec. B, 19, 281–96. Bugni, F. (2007a). Bootstrap inference in partially identified models. Working paper, Department of Economics, Northwestern University. Bugni, F. (2007b). Bootstrap inference in partially identified models: pointwise construction. Working paper, Department of Economics, Northwestern University. Canay, I. A. (2007). EL inference for partially identified models: large deviations optimality and bootstrap validity. Working paper, Department of Economics, University of Wisconsin. Chernozhukov, V., H. Hong, and E. Tamer (2007). Estimation and confidence regions for parameter sets in econometric models. Econometrica 75, 1243–84. Datta, S. (1995). On a modified bootstrap for certain asymptotically non-normal statistics. Statistics and Probability Letters 24, 91–8. Deheuvels, P., D. Mason, and G. Shorack (1993). Some results on the influence of extremes on the bootstrap. Annales de L’institut Henri Poincar´e 29, 83–103. D¨umbgen, L. (1993). On nondifferentiable functions and the bootstrap. Probability Theory and Related Fields 95, 125–40. Eaton, M. L. and D. E. Tyler (1991). On Wieland’s inequality and its applications to the asymptotic distribution of the eigenvalues of a random symmetric matrix. Annals of Statistics 19, 260–71. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics 7, 1–26. Guggenberger, P. and M. Wolf (2004). Subsampling tests of parameter hypotheses and overidentifying restrictions with possible failure of identification. Working paper, Department of Economics, UCLA. Hall, P. (1992). The Bootstrap and Edgeworth Expansion. New York: Springer. Imbens, G. and C. F. Manski (2004). Confidence intervals for partially identified parameters. Econometrica 72, 1845–57. Lehmann, E. L. and J. P. Romano (2005). Testing Statistical Hypotheses (3rd ed.). New York: Wiley. Manski, C. F. and E. Tamer (2002). Inference on regressions with interval data on a regressor or outcome. Econometrica 70, 519–46. Mikusheva, A. (2007). Uniform inferences in autoregressive models. Econometrica 75, 1411–52. Pakes, A., J. Porter, K. Ho, and J. Ishii (2004). Applications of moment inequalities. Working paper, Department of Economics, Harvard University. Politis, D. N. and J. P. Romano (1994). Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics 22, 2031–50. Politis, D. N., J. P. Romano, and M. Wolf (1999). Subsampling. New York: Springer. Putter, H. and W. R. van Zwet (1996). Resampling: consistency of substitution estimators. Annals of Statistics 24, 2297–318. Romano, J. (1988). Bootstrapping the mode. Annals of the Institute of Statistical Mathematics 40, 565–86. Romano, J. P. and A. M. Shaikh (2005). Inference for the identified set in partially identified econometric models. Working paper, Department of Economics, University of Chicago. Romano, J. P. and A. M. Shaikh (2008). Inference for identifiable parameters in partially identified econometric models. Journal of Statistical Inference and Planning (Special Issue in Honor of T. W. Anderson) 138, 2786–802. C The Author(s). Journal compilation C Royal Economic Society 2009.
Invalidity of the bootstrap
S199
Romano, J. P. and M. Wolf (2001). Subsampling intervals in autoregressive models with linear time trend. Econometrica 69, 1283–314. Samworth, R. (2003). A note on methods of restoring consistency to the bootstrap. Biometrika 90, 985–90. Shao, J. (1994). Bootstrap sample size in nonregular cases. Proceedings of the American Mathematical Society 112, 1251–62. Shao, J. (1996). Bootstrap model selection. Journal of the American Statistical Association 91, 655–65. Shao, J. and C. J. F. Wu (1989). A general theory for jackknife variance estimation. Annals of Statistics 15, 1563–79. Sriram, T. N. (1993). Invalidity of the bootstrap for critical branching processes with immigration. Annals of Statistics 22, 1013–23. Swanepoel, J. W. H. (1986). A note on proving that the (modified) bootstrap works. Communications in Statistics-Theory and Methods Section 15, 3193–203. Wu, C. F. J. (1990). On the asymptotic properties of the jackknife histogram. Annals of Statistics 18, 1438– 52.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S200–S216. doi: 10.1111/j.1368-423X.2008.00262.x
More on monotone instrumental variables C HARLES F. M ANSKI † AND J OHN V. P EPPER ‡ †
Department of Economics and Institute for Policy Research, Northwestern University, Evanston, IL, USA E-mail: [email protected] ‡
Department of Economics, University of Virginia, Charlottesville, VA, USA E-mail: [email protected] First version received: May 2008; final version accepted: August 2008
Summary Econometric analyses of treatment response often use instrumental variable (IV) assumptions to identify treatment effects. The traditional IV assumption holds that mean response is constant across the sub-populations of persons with different values of an observed covariate. Manski and Pepper (2000) introduced monotone instrumental variable assumptions, which replace equalities with weak inequalities. This paper presents further analysis of the monotone instrumental variable (MIV) idea. We use an explicit response model to enhance the understanding of the content of MIV and traditional IV assumptions. We study the identifying power of MIV assumptions when combined with the homogeneous linear response assumption maintained in many studies of treatment response. We also consider estimation of MIV bounds, with particular attention to finite-sample bias. Keywords: Bias correction, Instrumental variables, Nonparametric bounds, Partial identification
1. INTRODUCTION Econometric analyses of treatment response often use instrumental variable (IV) assumptions to identify treatment effects. The traditional IV assumption holds that mean response is constant across the sub-populations of persons with different values of an observed covariate. The credibility of this assumption is often a matter of considerable disagreement, as evidenced by frequent debates about whether some covariate is a ‘valid instrument’. There is therefore good reason to consider weaker but more credible assumptions. To this end, Manski and Pepper (2000) introduced monotone instrumental variable (MIV) assumptions. An MIV assumption weakens the traditional IV assumption by replacing its equality of mean response across sub-populations with a weak inequality. We studied the identifying power of MIV assumptions when imposed alone and when combined with the assumption of monotone treatment response (MTR). We reported an empirical application using MIV and MTR assumptions to bound the wage returns to schooling. This paper presents further analysis of the MIV idea. We draw in part on previously unpublished research that appeared in our original working paper on MIVs (Manski and Pepper, 1998), but that was not included in our 2000 Econometrica article. We also discuss statistical issues associated with estimation of MIV bounds. C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
More on monotone instrumental variables
S201
As prelude, Section 2 sets up basic concepts and notation, summarizes the main analytical findings of Manski and Pepper (2000) and describes subsequent findings. Section 3 uses an explicit response model to enhance the understanding of the content of MIV and traditional IV assumptions. The key is to integrate the concepts of treatments and covariates in the analysis of treatment response. We use the integrated framework to suggest MIV assumptions that might credibly be applied in the analyses of the returns to schooling and other studies of production. Section 3 is a revised version of Manski and Pepper (1998, Section 2). Section 4 studies the identifying power of MIV assumptions when combined with the homogeneous linear response (HLR) assumption maintained in many studies of treatment response. We think that the HLR assumption is rarely credible; hence, we do not endorse its use in practice. However, widespread application of the assumption makes it important that researchers understand its implications. It has been common to combine the HLR assumption with a traditional IV assumption to achieve point-identification of treatment effects. We show that combining the HLR assumption with an MIV assumption yields a bound on treatment effects. Section 4 is a revised and extended version of Manski and Pepper (1998, Section 4). Section 5 considers estimation of the bounds reported in Sections 2 and 4. An important statistical concern, noted but not analysed in Manski and Pepper (2000), is that analogue estimates of the bounds have finite-sample bias that make them tend to be narrower than the true bounds. We explain this bias, give Monte Carlo evidence on its magnitude and describe the bias-correction procedure of Kreider and Pepper (2007). We also show how the so-called weakinstruments problem that arises in analogue estimation under HLR and IV assumptions manifests itself when HLR and MIV assumptions are combined.
2. BACKGROUND 2.1. Concepts and notation We use the same formal setup as Manski and Pepper (2000). There is a probability space (J, , P) of individuals. Each member j of population J has observable covariates xj ∈ X and a response function yj (·):T →Y mapping the mutually exclusive and exhaustive treatments t ∈ T into real-valued outcomes yj (t) ∈ Y . The outcome space Y has greatest lower bound K 0 ≡ inf Y and least upper bound K 1 ≡ sup Y . Person j has a realized treatment zj ∈ T and a realized outcome yj ≡ yj (zj ), both of which are observable. The latent outcomes yj (t), t = zj are not observable. An empirical researcher learns the distribution P(x, z, y) of covariates, realized treatments and realized outcomes by observing a random sample of the population. The researcher’s problem is to combine this empirical evidence with assumptions in order to learn about the distribution P [y(·)] of response functions, or perhaps the conditional distributions P [y(·)|x]. With this background, we may define an MIV assumption. Let x = (w, v) and X = W × V. Each value of (w, v) defines an observable sub-population of persons. The familiar meanindependence form of IV assumption is that, for each t ∈ T and each w ∈ W , the mean value of y(t) is the same in all of the sub-populations (w, v = u), u ∈ V . IV A SSUMPTION. Covariate v is an instrumental variable in the sense of mean-independence if, for each t ∈ T , each w ∈ W and all (u1 , u2 ) ∈ V × V , E[y(t)|w, v = u2 ] = E[y(t)|w, v = u1 ]. C The Author(s). Journal compilation C Royal Economic Society 2009.
(2.1)
S202
C. F. Manski and J. V. Pepper
MIV assumptions replace the equality in (2.1) by an inequality, yielding a mean-monotonicity condition MIV A SSUMPTION. Let V be an ordered set. Covariate v is a monotone IV in the sense of mean-monotonicity if, for each t ∈ T , each w ∈ W and all (u1 , u2 ) ∈ V × V such that u2 ≥ u1 , E[y(t)|w, v = u2 ] ≥ E[y(t)|w, v = u1 ].
(2.2)
Certainly, the most commonly applied IV assumption is exogenous treatment selection (ETS). Here, the IV v is the realized treatment z. Hence, assumption (2.1) becomes the following. ETS A SSUMPTION. For each t ∈ T , each w ∈ W and all (u1 , u2 ) ∈ T × T , E[y(t)|w, z = u2 ] = E[y(t)|w, z = u1 ].
(2.3)
Weakening equation (2.3) to an inequality yields the special MIV assumption that we call monotone treatment selection (MTS). MTS A SSUMPTION. Let T be an ordered set. For each t ∈ T , each w ∈ W and all (u1 , u2 ) ∈ T × T such that u2 ≥ u1 , u2 ≥ u1 ⇒ E[y(t)|w, z = u2 ] ≥ E[y(t)|w, z = u1 ].
(2.4)
The MTS assumption should not be confused with the MTR assumption of Manski (1997). This is the following. MTR A SSUMPTION. Let T be an ordered set. For each j ∈J , t2 ≥ t1 ⇒ yj (t2 ) ≥ yj (t1 ).
(2.5)
2.2. Findings This section summarizes the main analytical findings of Manski and Pepper (2000). To simplify the exposition, we henceforth leave implicit the conditioning on w maintained in the definitions of MIVs. 2.2.1. MIV assumptions. Consider an MIV assumption alone, not combined with other assumptions. Proposition 2.1 gives sharp bounds on the conditional mean responses E[y(t)|v = u], u ∈V and the marginal mean E[y(t)]. These bounds are informative if the outcome space Y has finite range [K 0 , K 1 ]. The MIV bounds have particularly simple forms in the case of monotone treatment selection. P ROPOSITION 2.1. Let the MIV assumption (2.2) hold. Then for each u ∈V , sup E(y|v = u1 , z = t) · P (z = t|v = u1 ) + K0 · P (z = t|v = u1 )
u1 ≤u
≤ E[y(t)|v = u] ≤ inf E(y|v = u2 , z = t) · P (z = t|v = u2 ) + K1 · P (z = t|v = u2 ).
u2 ≥u
(2.6)
C The Author(s). Journal compilation C Royal Economic Society 2009.
More on monotone instrumental variables
S203
It follows that
P (v = u) sup E(y|v = u1 , z = t) · P (z = t|v = u1 ) + K0 · P (z = t|v = u1 ) u1 ≤u
u∈V
≤ E[y(t)] ≤ P (v = u) inf E(y|v = u2 , z = t) · P (z = t|v = u2 ) + K1 · P (z = t|v = u2 ) . (2.7) u2 ≥u
u∈V
Let the MTS assumption (2.4) hold. Then bound (2.6) reduces to u < t ⇒ K0 ≤ E[y(t)|z = u] ≤ E(y|z = t) u = t ⇒ E[y(t)|z = u] = E(y|z = t) u > t ⇒ E(y|z = t) ≤ E[y(t)|z = u] ≤ K1 .
(2.8)
It follows that K0 · P (z < t) + E(y|z = t) · P (z ≥ t) ≤ E[y(t)] ≤ K1 · P (z > t) + E(y|z = t) · P (z ≤ t). (2.9) In the absence of other information, these bounds are sharp. The basic finding in Proposition 2.1 is inequality (2.6), from which the other findings are derived. The lower bound in (2.6) is the supremum of the no-assumptions bounds on E[y(t)|v = u 1 ] over all u1 ≤ u. The upper bound is the infimum of the no-assumptions bounds on E[y(t)|v = u 2 ] over u2 ≥ u. Hence, the MIV bound on E[y(t)|v = u] is a subset of the no-assumption bound on E[y(t)|v = u]. The MIV assumption has no identifying power if the no-assumptions lower and upper bounds on E[y(t)|v = u] weakly increase with u. Otherwise, it has identifying power in the formal sense that at least some MIV bounds are proper subsets of the corresponding no-assumptions bounds. Bound (2.6) is a superset of the Manski (1990) IV bound on E[y(t)|v = u], which is the intersection of the no-assumptions bounds on E[y(t)|v = u 1 ] over all u1 ∈ V . The MIV and IV bounds coincide if the no-assumptions bounds on E[y(t)|v = u] weakly decrease with u. In this case, the MIV and IV assumptions have the same identifying power. 2.2.2. MIV and MTR assumptions. The MIV and MTR assumptions make distinct contributions to identification. When imposed together, the two assumptions can have substantial identifying power. Combining the MTR and MTS assumptions yields a particularly interesting finding. Whereas the MIV-MTR bounds are informative only when Y has finite range, the MTS-MTR bounds are informative even if Y is unbounded. Proposition 2.2 gives these results. P ROPOSITION 2.2. Let the MIV and MTR assumptions (2.2) and (2.5) hold. Then for each u ∈ V , sup E(y|v = u1 , t ≥ z) · P (t ≥ z|v = u1 ) + K0 · P (t < z|v = u1 )
u1 ≤u
≤ E[y(t)|v = u] ≤ inf E(y|v = u2 , t ≤ z) · P (t ≤ z|v = u2 ) + K1 · P (t > z|v = u2 ).
u2 ≥u
C The Author(s). Journal compilation C Royal Economic Society 2009.
(2.10)
S204
C. F. Manski and J. V. Pepper
It follows that
P (v = u) sup E(y|v = u1 , t ≥ z) · P (t ≥ z|v = u1 ) + K0 · P (t < z|v = u1 ) u1 ≤u
u∈V
≤ E[y(t)] ≤ P (v = u) inf E(y|v = u2 , t ≤ z) · P (t ≤ z|v = u2 ) + K1 · P (t > z|v = u2 ) . (2.11) u2≥ u
u∈V
Let the MTS and MTR assumptions (2.4) and (2.5) hold. Then, bound (2.10) reduces to u < t ⇒ E(y|z = u) ≤ E[y(t)|z = u] ≤ E(y|z = t) u = t ⇒ E[y(t)|z = u] = E(y|z = t) u > t ⇒ E(y|z = t) ≤ E[y(t)|z = u] ≤ E(y|z = u).
(2.12)
It follows that
E(y|z = u) · P (z = u) + E(y|z = t) · P (z ≥ t) ≤ E[y(t)]
u
≤
E(y|z = u) · P (z = u) + E(y|z = t) · P (z ≤ t).
(2.13)
u>t
In the absence of other information, these bounds are sharp.
2.3. Subsequent findings The analytical findings described above have been applied in several empirical studies of treatment response. As mentioned earlier, we reported MIV bounds on the wage returns to schooling. Subsequently, Gonz´ales (2005) has studied the wage returns to English proficiency. Gerfin and Schellhorn (2006) have studied the effect of deductibles in health care insurance on doctor visits. Kreider and Hill (2008) have studied the effect of health insurance coverage on health care utilization. Manski and Pepper (2000) proposed MIVs specifically to weaken the traditional meanindependence IV assumption in the analysis of treatment response. However, our broad idea was to enhance the credibility of empirical research by replacing traditional distributional equality assumptions with weak inequalities. Other manifestations of this idea have developed subsequently. The findings of our 2000 article apply immediately to inference on the population mean of an outcome from a random sample with missing outcome data. Let T = {0, 1}, with t = 1 indicating that the outcome is observable and t = 0 that it is unobservable. Let y j (1) be the outcome of interest. Let z j = 1 if y j (1) is observable and z j = 0 otherwise. With these definitions, Proposition 2.1 gives the sharp bound on E[y(1)] when a covariate v satisfies MIV assumption (2.2). This application of the MIV idea is developed in Manski (2003, section 2.5), with the notational simplification that y(1) here is called y there. Blundell et al. (2007) apply the missing-outcome version of the MIV assumption to inference on wages. The outcome of interest is wage, which is observable when a person works but not otherwise. Blundell et al. perform inference when a traditional statistical-independence form C The Author(s). Journal compilation C Royal Economic Society 2009.
More on monotone instrumental variables
S205
of IV assumption is replaced by a weak stochastic dominance assumption. To accomplish this, they apply Proposition 2.1 to the distribution function P [y(1) ≥ r] = E{1[y(t) ≥ r]}, all r ∈ R 1 . This work partially addresses the Manski and Pepper (2000) call for research on MIVs that replaces the statistical-independence assumption of classical randomized experiments with weak stochastic dominance. Blundell et al. also contribute new analytical findings on identification of quantiles and inter-quantile differences. MIV assumptions may also be applied when a researcher faces a data problem other than missing outcomes. Kreider and Pepper (2007, 2008) consider inference on a conditional mean when the data problem is partial misreporting of either the conditioning variable or the outcome variable. They maintain assumption (2.2) for a given covariate v and obtain a version of Proposition 2.1 in which the lower (upper) bound in inequality (2.6) is replaced by the supremum (infimum) of several v-specific lower (upper) bounds that hold in the misreporting context. Yet another related line of work begins with traditional parametric econometric models in which the parameters of interest solve a set of moment equations, and replaces these equations with a set of weak moment inequalities. Although replacement of parametric moment equations with inequalities is very much within the broad MIV theme, the approaches used to study partial identification of parametric models are quite different from and more complex than the transparent non-parametric analysis of Propositions 2.1 and 2.2 (see Manski and Tamer, 2002, Chernozhukov, Hong, and Tamer, 2007 and Rosen, 2008).
3. WHAT ARE IV AND MIV ASSUMPTIONS? The concepts introduced in Section 2 suffice to define IV and MIV assumptions and to analyse their identifying power. Imbedding these concepts within a broader framework, however, helps to understand the meaning of these assumptions. Section 3.1 sets out this broader framework. Section 3.2 uses it to develop a lemma suggesting MIV assumptions that might credibly be imposed when analysing the returns to schooling. Section 3.3 discusses other applications of the lemma to production analysis. 3.1. Treatments and covariates The discussion of Section 2 suggests a sharp distinction between treatments and covariates. Treatments have been presented as quantities that may be manipulated autonomously, inducing variation in response. We have been careful to use separate symbols for the conjectural treatments t ∈ T and for the actual treatment z j realized by person j. Covariates have been presented only as realized quantities associated with the members of the population, with no mention of their manipulability. We have used v j to denote a covariate value associated with person j. We have given no notation for conjectural values of covariates. A symmetric perspective on treatments and covariates emerges if, as in Manski (1997, Section 2.4), we enlarge the set of treatments from T to the Cartesian product set T × S and introduce a generalized response function y∗j (·, ·) : T × S → Y mapping elements of T × S into outcomes. Now, for each (t, s) ∈ T × S, yj∗ (t, s) is the outcome that person j would experience if she were to receive the conjectural treatment pair (t, s). The treatment pair realized by person j is (z j , ζ j ), and her realized outcome is yj = yj∗ (zj , ζj ), where ζj ∈ S is the realized value of s. The response function yj (·) : T → Y introduced as a primitive in Section 2 is now a derived C The Author(s). Journal compilation C Royal Economic Society 2009.
S206
C. F. Manski and J. V. Pepper
sub-response function, obtained by evaluating y ∗j (·, ·) with its second argument set at the realized treatment value ζ j . That is, yj (·) ≡ yj∗ (·, ζj ).
(3.1)
In this broadened framework, covariate ζ j is the realized value of treatment s. With this as background, observe that the familiar statement ‘variable v does not affect response’ has two distinct formal interpretations. One interpretation is that v is an IV, as defined in (2.1). The other is that outcomes are constant under conjectural variations in v. Under the latter interpretation, S = V and yj∗ (t, u) = yj∗ (t, vj ) = yj (t)
(3.2)
for all j ∈ J and (t, u) ∈ T × V . Assumption (3.2) has not yet been named. We call it a constant treatment response (CTR) assumption. Similarly, the familiar statement ‘response is monotone in v’ has two interpretations. One is that v is an MIV, as defined in (2.2). The other is that outcomes vary monotonically under conjectural variations in v. Under the latter interpretation, S = V and u2 ≥ u1 ⇒ yj∗ (t, u2 ) ≥ yj∗ (t, u1 )
(3.3)
for all j ∈ J and t ∈ T . This is an MTR assumption. Distinguishing appropriately between IV/MIV assumptions and CTR/MTR assumptions is critical to the informed analysis of treatment response. We cannot know how often empirical researchers, thinking loosely that ‘variable v does not affect response’, have imposed an IV assumption but really had a CTR assumption in mind. Introducing MIV assumptions here, we want to squelch from the start any confusion between MIV and MTR assumptions. 3.2. Researcher-measured ability and the returns to schooling Labour economists studying the returns to schooling commonly suppose that each individual j has a wage function y j (t), giving the wage that j would receive were she to obtain t years of schooling. Observing realized covariates, schooling and wages in the population, labour economists often seek to learn the mean wage E[y(t)] that would occur if a randomly drawn member of the population were to receive t years of schooling. Researchers often use personal, family and environmental attributes as IVs for years of schooling. Yet, the validity of whatever IV assumption may be imposed seems inevitably to be questioned. Some formal analysis using the concepts of Section 3.1 suggests that school grades, test scores and other measures of ability or achievement that are commonly observed by researchers are plausible MIVs for inference on the returns to schooling. For succinctness, we refer below to measures of ability rather than to measures of ability or achievement. L EMMA 3.1. Let person j’s earning function be yj∗ (t, s) = g(t, s, εj ).
(3.4)
Here, t ∈ T is years of schooling and s ∈ S is an ordered measure of ability that is observable by employers. The quantity ε j is person j’s realization of another variable that takes values in a space E and that is observable by employers. Assume that, for each (t, ε) ∈ T × E, the C The Author(s). Journal compilation C Royal Economic Society 2009.
More on monotone instrumental variables
S207
sub-response function g(t, ·, ε) is weakly increasing on S. Thus, wage increases with employermeasured ability. Let v ∈ V be an ordered measure of ability that is observable by a researcher. Let ζ be the realized value of employer-measured ability. Assume that P (ζ |v) is weakly increasing in v in the sense that P (ζ |v = u2 ) weakly stochastically dominates P (ζ |v = u1 ) when u2 ≥ u1 . Assume that ε is statistically independent of (ζ , v). Then v is an MIV. The lemma does not require that employers or researchers measure ability accurately. Nor does it require that employers and researchers observe the same measure of ability. It only requires that (i) wage increases with employer-measured ability and (ii) researchermeasured ability be a weakly positive predictor of employer-measured ability, in the sense of stochastic dominance. These are understandable and plausible assumptions. Indeed, we think the assumption that P (ζ |v) weakly increases in v in the sense of stochastic dominance appropriately formalizes what many empirical researchers have in mind when they state that some observed variable v is a ‘proxy’ for an unobserved variable ζ . Although researcher-measured ability provides a credible MIV for inference on the returns to schooling, we found in our own application that this MIV has little identifying power. Analysing data from the National Longitudinal Study of Youth, Manski and Pepper (1998) reported noassumptions bounds on the returns to schooling as well as bounds that use a respondent’s score on the Armed Forces Qualifying Test (AFQT) as an MIV. We found that the no-assumptions bounds computed for different AFQT scores are nearly monotone increasing with the score. As a consequence, the MIV bounds were only slightly narrower than the no-assumptions bounds. 3.3. Other applications to production analysis For concreteness, we used the returns to schooling to motivate Lemma 3.1. We observe here that the Lemma has other applications to production analysis. In abstraction, let equation (3.4) give the production function for agent j, which might be a firm producing a commodity, a person investing in human capital or another entity. Let t and s be two conjectural factors of production whose realized values (z j , ζ j ) are either chosen by the agent or predetermined. Let ε j be a production shock of some kind. For example, in agricultural production, g(t, s, ε) might be crop output per acre, which varies with seed input t, planting effort s and weather quality ε following planting. Suppose that, for each agent j, a researcher observes the realized output y j and input z j . The researcher does not observe input ζ j or the shock ε j , but he does observe a ‘proxy’ v j for ζ j . For example, in the agricultural setting, the researcher might observe crop output y j , seed input z j and labour hours v j allocated to planting. However, he might not observe cultivation effort and weather quality. Suppose the researcher finds it credible to assume that P (ζ |v) weakly increases in v in the sense of stochastic dominance and that ε is statistically independent of (ζ , v). For example, these assumptions are credible in the agricultural setting described above. Then, Lemma 3.1 shows that v is an MIV.
4. COMBINING MIV AND HLR ASSUMPTIONS Classical econometric analysis of treatment response, as codified in the literature on linear simultaneous equations (Hood and Koopmans, 1953), supposes that the outcome space Y is the C The Author(s). Journal compilation C Royal Economic Society 2009.
S208
C. F. Manski and J. V. Pepper
entire real line and combines an IV assumption of form (2.1) with the homogeneous-linearresponse assumption yj (t) = βt + δj .
(4.1)
Here, treatments are real-valued, δ j is an unobserved covariate and β is a slope parameter that takes the same value for all members of the population. The central finding is that assumptions (2.1) and (4.1) point-identify the response parameter β, provided that z is not mean independent of v. For example, numerous studies of the returns to schooling report estimates of β, interpreted as the common return that all members of the population experience per year of schooling. Empirical researchers have long used the HLR assumption even though it is not grounded in economic theory or other substantive reasoning. The literature has not provided compelling, or even suggestive, arguments in support of the hypothesis that response varies linearly with treatment and that all persons have the same response parameter. Manski (1997) and Manski and Pepper (2000) argue that much of the research that has used the HLR assumption could more plausibly use the MTR assumption. Consumer theory suggests that, ceteris paribus, the demand for a product weakly decreases as a function of the product’s price. The theory of production suggests that, ceteris paribus, the output of a product weakly increases as a function of each input into the production process. Human capital theory suggests that, ceteris paribus, the wage that a worker earns weakly increases as a function of years of schooling. In these and other settings, MTR assumptions have a reasonably firm foundation. The above notwithstanding, it is important that researchers who want to maintain the HLR assumption should understand its implications when combined with other assumptions. This section studies identification when an MIV assumption of form (2.2) is combined with assumption (4.1). 4.1. The classical result As background, we first give a proof of the classical result that originally appeared in Manski (1995, p. 152), with further exposition in Manski (2007, Chapter 8). This proof is much simpler than those offered in traditional treatments of linear simultaneous equations. Moreover, it extends easily when we replace the IV assumption with an MIV assumption in the next section. Let u1 ∈ V and u2 ∈ V be any two points on the support of the distribution of covariate v. Given (4.1), assumption (2.1) states that E(δ|v = u2 ) = E(δ|v = u1 ).
(4.2)
Given (4.1), δj = yj − βzj for each person j. Hence, E(y − βz|v = u2 ) = E(y − βz|v = u1 ).
(4.3)
Solving (4.3) for β yields β=
E(y|v = u2 ) − E(y|v = u1 ) , E(z|v = u2 ) − E(z|v = u1 )
(4.4)
provided that the dominator is non-zero. The requirement that the denominator be non-zero is called the rank condition in the classical literature. C The Author(s). Journal compilation C Royal Economic Society 2009.
More on monotone instrumental variables
S209
Each quantity on the right-hand side of (4.4) is point-identified. Hence, assumptions (2.1), (4.1), and the rank condition point-identify β. If V contains multiple (u 1 , u 2 ) pairs that satisfy the rank condition, then there exist correspondingly multiple versions of equation (4.4). The parameter β must equal the right-hand side of all such equations. If the right-hand side of all versions of (4.4) have the same value, β is said to be over-identified. If versions of (4.4) differ in value, either the IV or the HLR assumption is incorrect. Equation (4.4) takes a particularly simple form in the case of an ETS assumption. Let v = z. Then E(z|v = u2 ) − E(z|v = u1 ) = u2 − u1 . Hence, the rank condition holds and (4.4) reduces to β = [E(y|z = u2 ) − E(y|z = u1 )]/(u2 − u1 ).
(4.5)
4.2. Weakening an IV to an MIV Now replace IV assumption (2.1) with MIV assumption (2.2). Let V be an ordered set and let u 2 > u 1 be any two points on the support of v. Given (4.1), assumption (2.2) states that E(δ|v = u2 ) ≥ E(δ|v = u1 ).
(4.6)
Recall that δj = yj − βzj for each person j. Hence, E(y − βz|v = u2 ) ≥ E(y − ßz|v = u1 ).
(4.7)
Solving for β yields the inequality β≤
E(y|v = u2 ) − E(y|v = u1 ) E(z|v = u2 ) − E(z|v = u1 )
if E(z|v = u2 ) − E(z|v = u1 ) > 0,
(4.8a)
β≥
E(y|v = u2 ) − E(y|v = u1 ) E(z|v = u2 ) − E(z|v = u1 )
if E(z|v = u2 ) − E(z|v = u1 ) < 0.
(4.8b)
This proves P ROPOSITION 4.1. Let MIV assumption (2.2) and HLR assumption (4.1) hold. Then, β lies in the intersection of the inequalities (4.8) over (u1 , u2 ) ∈ V × V such that u 2 > u 1 . In the absence of other information, this bound is sharp. Proposition 4.1 yields an informative bound on β if and only if z is not mean independent of v. Thus, the rank condition here is the same as when an IV assumption is combined with the HLR assumption. It may turn out that no value of β satisfies all of the inequalities in (4.8). If so, either assumption HLR or MIV is incorrect. In general, the bound in Proposition 4.1 does not point-identify β. However, the sign of β may be identified. Inspection of (4.8) shows that sgn(β) is identified as negative if there exists a u 2 > u 1 such that E(y|v = u2 ) − E(y|v = u1 ) < 0 and E(z|v = u2 ) − E(z|v = u1 ) > 0. Sgn(β) is identified as positive if there exists a u 2 > u 1 such that E(y|v = u2 ) − E(y|v = u1 ) < 0 and E(z|v = u2 ) − E(z|v = u1 ) < 0. Sgn(β) is not identified if E(y|v = u2 ) − E(y|v = u1 ) ≥ 0 for all u 2 > u 1 . Inequalities (4.8) take a particularly simple form in the case of an MTS assumption. Let v = z. Then E(z|v = u2 ) − E(z|v = u1 ) = u2 − u1 > 0, so only (4.8a) is applicable. These C The Author(s). Journal compilation C Royal Economic Society 2009.
S210
C. F. Manski and J. V. Pepper
upper bounds on β reduce to β≤
inf
(u2 ,u1 ):u2 >u1
E(y|z = u2 ) − E(y|z = u1 ) . u2 − u1
(4.9)
Thus, the MTS and HLR assumptions together imply an upper bound on β but no lower bound. 4.3. Restricted outcome spaces The above analysis has supposed that the outcome space Y is the entire real line. Empirical researchers often apply the HLR assumption in settings where Y is a restricted part of the real line, perhaps a discrete set or a bounded interval. In such cases, the structure of Y implies constraints on β that must hold even in the absence of an IV or MIV assumption. These constraints, which typically are ignored in empirical studies, generally have no substantive interpretation and often have the non-sensical implication that β must equal zero. These difficulties are regularly overlooked in practice, so we call attention to them here, in the hope that empirical researchers will henceforth be more judicious in their applications of the HLR assumption. The source of the constraints is the fact that the outcomes generated by the HLR assumption must logically be elements of Y. That is, β must be such that βt + δj ∈ Y for all t ∈ T and j ∈ J . Recall that δj = yj − βzj for each person j. Hence, β must satisfy the constraints β(t − zj ) + yj ∈ Y, ∀ t ∈ T
and
j ∈ J.
(4.10)
These constraints are unnatural when the outcome space is discrete. If Y and T are both discrete, at most a discrete set of β values can satisfy (4.9). If Y is discrete and T is a continuum, the only parameter value that satisfies (4.9) is β = 0. Hence, the HLR assumption is not sensible when the outcome space is discrete. The constraints imply a set of inequalities on β when the outcome space is a bounded interval. Let Y = [K 0 , K 1 ]. Then (4.10) is the set of inequalities K0 ≤ β(t − zj ) + yj ≤ K1 , ∀ t ∈ T
and
j ∈ J.
(4.11)
Manipulation of (4.11) yields these inequalities on β: (K0 − yj )/(t1 − zj ) ≤ β ≤ (K1 − yj )/(t1 − zj ), ∀ j ∈ J ,
(4.12a)
(K1 − yj )/(t0 − zj ) ≤ β ≤ (K0 − yj )/(t0 − zj ), ∀ j ∈ J ,
(4.12b)
where t0 ≡ inf T and t1 ≡ sup T . These inequalities generally lack substantive interpretation. Indeed, the only value satisfying (4.12a) is β = 0 if the population contains a member j for which (yj = K0 , zj < t1 ) and a member k for which (yk = K1 , zk < t1 ). Similarly, the only value satisfying (4.12b) is β = 0 if the population contains a member j for which (yj = K0 , zj > t0 ) and a k for which (yk = K1 , zk > t0 ). Thus, the HLR assumption generally is not sensible when Y is a bounded interval. C The Author(s). Journal compilation C Royal Economic Society 2009.
More on monotone instrumental variables
S211
5. ESTIMATION OF THE BOUNDS Although identification usually is the dominant inferential problem in the analysis of treatment response, finite-sample statistical inference can be a serious concern as well. It is easy to show that analogue estimates of the bounds in Propositions 2.1 through 4.1 are consistent. Each bound is a continuous function of point-identified conditional means. When V is finite, these conditional means are consistently estimated by the corresponding sample averages. When V is a continuum, non-parametric regression methods may be used. The bounds under the MTS assumption in (2.9) and (2.13) have simple explicit forms, and the sampling distributions of analogue estimates are correspondingly simple. However, the sup and inf operations in inequalities (2.6) and (2.10) significantly complicate the bounds under other MIV assumptions, rendering it difficult to analyse the sampling behaviour of analogue estimates. Moreover, the methods for forming asymptotically valid confidence sets for partially identified parameters developed by Horowitz and Manski (2000), Imbens and Manski (2004), Chernozhukov et al. (2007), Beresteanu and Molinari (2008), Rosen (2008) and others appear not to apply. An important statistical concern, noted but not analysed in Manski and Pepper (2000), is that analogue estimates of bounds (2.6) and (2.10) have finite-sample bias that make the estimates tend to be narrower than the true bounds. Section 5.1 explains this bias, gives Monte Carlo evidence on its magnitude and describes a heuristically motivated bias-correction method. Section 5.2 considers analogue estimation of the bound of Proposition 4.1. Although the present discussion focuses on the estimation of MIV bounds, with obvious modifications it applies as well to estimation of the IV bounds of Manski (1990). 5.1. Estimation of the bounds of Propositions 2.1 and 2.2 The finite-sample bias of analogue estimates is most transparent when V is a finite set. We restrict attention to this case and focus on the lower bound in (2.6). Analysis of the upper bound in (2.6) and the bounds in (2.10) is analogous. To begin, observe that the lower bound in (2.6) may be rewritten as max E{y · 1[z = t] + K0 · 1[z = t]|v = u1 }. u1 ≤u
(5.1)
Let a random sample of size N be drawn. Let N(u) be the sub-sample size with (v = u). Consider inference conditional on the vector N (V ) ≡ [N (u), u ∈ V ] of sub-sample sizes. The analogue estimate of (5.1) is max EN(V ) {y · 1[z = t] + K0 · 1[z = t]|v = u1 }, u1 ≤u
(5.2)
where E N (V) denotes the empirical mean. Suppose that all components of N(V) are positive, so the estimate exists. Each term in (5.2) is an unbiased estimate of the corresponding term in (5.1). It follows by Jensen’s inequality that E ∗ max EN(V ) {y · 1[z = t] + K0 · 1[z = t]|v = u1 } u1 ≤u ≥ max E y · 1[z = t] + K0 · 1[z = t]|v = u1 , u1 ≤u
C The Author(s). Journal compilation C Royal Economic Society 2009.
(5.3)
S212
C. F. Manski and J. V. Pepper Table 1. Bias of the analogue estimate of the MIV lower bound. M=4 M=8 σ2 = 1
σ2 = 4
σ 2 = 25
σ2 = 1
σ2 = 4
σ 2 = 25
100 500
0.09 0.01
0.15 0.03
0.19 0.04
0.31 0.08
0.43 0.12
0.53 0.15
1000
0.01
0.01
0.02
0.04
0.07
0.09
N
where E∗ denotes the expected value of the estimate across repeated samples of sizes N(V). The inequality in (5.3) is strict in the ordinary case where the distributions P {y · 1[z = t] + K0 · 1[z = t]|v = u1 }, u1 ≤ u are non-degenerate and have overlapping supports. The above shows that the analogue estimate of the lower bound is biased upwards for each vector N(V) such that the estimate exists. Similar analysis shows that the estimate of the upper bound is biased downwards. Thus, the mean estimate of the bound always is a subset of the true bound and ordinarily is a proper subset. 5.1.1. Monte Carlo evidence. To obtain a sense of the magnitude of the bias, we have performed a Monte Carlo experiment. Ceteris paribus, the bias is most serious when the no-assumption bound is the same for all values of v, implying that the MIV assumption has no identifying power. We consider such a setting. In particular, a) v has a multinomial distribution with M equal-probability mass points {1/M, 2/M, . . . . , 1}. b) T = {0, 1}, zj = 1[vj + εj > 0], and ε is distributed N(0, 1). c) yj = min{max[−1.96, ηj ], 1.96}, and η is distributed N(0, σ 2 ). Thus, y is censored normal. d) The random variables η, v and ε are mutually statistically independent. In this setting, E{y · 1[z = t] + K0 · 1[z = t]|v = u1 } = −1.96 · P (z = t|v = u1 ).
(5.4)
Let t = 1 and u = 1. Then (5.1) reduces to −1.96 · min P (z = 0|v = u1 ) = −1.96 · P (z = 0|v = 1) ∼ = −0.31 u1 ≤1
(5.5)
for all values of M and σ 2 . Fix values of N, M and σ 2 . To measure the bias of the analogue estimate, we draw 1000 random samples of size N from the distribution of {η, v, ε}, and compute the analogue estimate of the MIV lower bound for each sample. The bias is then measured as the difference between the average of the 1000 estimates and the true lower bound –0.31. Table 1 displays the bias for N ∈ {100, 500, 1000), M ∈ {4, 8}, and σ 2 ∈ {1, 4, 25). Qualitatively, the upwards bias increases with M and σ 2 and decreases with N. These findings are sensible. The difference between the left- and right-hand sides of (5.3) increases with the dispersion of the estimates. Dispersion increases with σ 2 and M, while it decreases with N. Observe that the mean number of observations of y per value of v is N/M. Hence, for each value C The Author(s). Journal compilation C Royal Economic Society 2009.
S213
More on monotone instrumental variables Table 2. Bias of the Kreider–Pepper estimate of the MIV lower bound. M=4 M=8 N
σ2 = 1
σ2 = 4
σ 2 = 25
σ2 = 1
σ2 = 4
100 500
0.01 −0.01
0.03 −0.01
1000
−0.01
−0.00
σ 2 = 25
0.05 −0.01
0.12 0.02
0.16 0.03
0.21 0.04
−0.01
0.00
0.01
0.02
of v, the sample size for estimation of E{y · 1[z = t] + K0 · 1[z = t]|v} tends to increase with N and decrease with M. Quantitatively, the bias is enormous when (N = 100, M = 8) for all values of σ 2 . Indeed, the mean estimate of the bound is an empty interval when σ 2 ∈ {4, 25}. The bias is negligible when (N = 1000, M = 4) for all values of σ 2 . Small to moderate biases occur at other values of (N, M, σ 2 ). 5.1.2. Bias-corrected estimates. To counter the bias of the analogue estimate, it is natural to seek bias-corrected methods. Kreider and Pepper (2007) proposed a bootstrap bias-corrected estimator and applied it to their misreporting problem. The idea is to estimate the bias using the bootstrap distribution, and then adjust the analogue estimate accordingly. For a random sample of size N, let T N be the analogue estimate of the lower bound in (5.2) and let Eb (T N ) be the mean of this estimate using the bootstrap distribution. The bias is then estimated to equal Eb (T N )T N , and the proposed bias-corrected estimator is 2T N -Eb (T N ). Analysing a partial identification problem that is substantively different but has similar mathematical structure, Haile and Tamer (2003) used a smoothing function to reduce the variability of analogue estimators across different values of an index. While both correction methods seem reasonable and tractable, neither has a firm theoretical foundation as of now. Evidence on the efficacy of these corrections in finite samples can be obtained from Monte Carlo experiments. We have assessed the Kreider and Pepper (2007) estimator using the simulation design described earlier. To do this, we conducted further simulations to assess the bootstrap distribution. For each of the 1000 random samples of size N, we drew 1000 random pseudo-samples of size N, and used these pseudo-samples to compute the mean Eb (T N )of the bootstrap distribution. The bias of the Kreider–Pepper estimator was then measured as the difference between the average of the 1000 estimates and the true lower bound –0.31. Table 2 displays the bias of the proposed estimator. Relative to the analogue estimator in (5.2), substantial reductions in bias are realized in cases where N = 100 or M = 8. For example, when (N = 100, M = 4, σ 2 = 25), the bias falls from 0.19 to 0.05. Overall, the biases are negligible (less than or equal to 0.05) except in the extreme case when (N = 100, M = 8). The biases are moderate in this case, but considerably smaller than the analogous biases of the analogue estimator. 5.2. Estimation of the bound of Proposition 4.1 This final section considers estimation of the bound (4.8) obtained by combining an MIV assumption with the HLR assumption. We first examine the special case in which the MIV is C The Author(s). Journal compilation C Royal Economic Society 2009.
S214
C. F. Manski and J. V. Pepper
the realized treatment. Here, inequality (4.9) gave the resulting sharp upper bound on β. The argument of Section 5.1 shows that the analogue estimate of this bound is downward biased. As earlier, consider inference conditional on N(V). The analogue estimate of (4.9) is min
(u2 ,u1 ):u2 >u1
EN(V ) (y|z = u2 ) − EN(V ) (y|z = u1 ) u2 − u1 .
(5.6)
Let all components of N(V) be positive, so the estimate exists. Each term in (5.6) is an unbiased estimate of the corresponding term in (4.9). It follows by Jensen’s inequality that EN(V ) (y|z = u2 ) − EN(V ) (y|z = u1 ) E(y|z = u2 ) − E(y|z = u1 ) ∗ min ≤ min . E (u2 ,u1 ):u2 >u1 (u ,u ):u >u u2 − u1 u2 − u1 2 1 2 1 (5.7) Thus, the estimate of the upper bound is biased downward for each vector N(V) such that the estimate exists. Hence, the estimate is biased downward conditional on existence. Now consider the general form of Proposition 4.1, where β lies in the intersection of the inequalities (4.8). The analogue estimate of the upper bound is min
u2 , u1 ∈ WU
EN(V ) (y|v = u2 ) − EN(V ) (y|v = u1 ) EN(V ) (z|v = u2 ) − EN(V ) (z|v = u1 )
(5.8a)
where WU ≡ {(u2 , u1 ) : u2 > u1 and EN(V ) (z|v = u2 ) − EN(V ) (z|v = u1 ) > 0}. Similarly, the estimate of the lower bound is max
(u2 ,u1 )∈WL
EN(V ) (y|v = u2 ) − EN(V ) (y|v = u1 ) , EN(V ) (z|v = u2 ) − EN(V ) (z|v = u1 )
(5.8b)
where WL ≡ {(u2 , u1 ) : u2 > u1 and EN(V ) (z|v = u2 ) − EN(V ) (z|v = u1 ) < 0}. The structure of this estimate is complex. In particular, the estimate is highly sensitive to small variations in EN(V ) (z|v = u2 ) − EN(V ) (z|v = u1 ) when this quantity is near zero, with discontinuity at zero. Realizations of EN(V ) (z|v = u2 ) − EN(V ) (z|v = u1 ) that are near zero tend to occur frequently if the population mean difference E(z|v = u2 ) − E(z|v = u1 ) is near zero and/or the dispersion of its estimate is large. Hence, estimate (5.8) has subtle sampling behaviour in such cases. This is the MIV manifestation of the so-called weak instruments problem that has drawn much attention in the literature on analogue estimation under the HLR and IV assumptions (see e.g. Nelson and Startz, 1990, Bound, Jaeger, and Baker, 1995, and Staiger and Stock, 1997). Observe that this weak-instruments problem does not occur in analogue estimation of the bounds of Propositions 2.1 and 2.2. In those cases, the estimate always varies continuously as a function of multiple sample averages. Even when an MIV has no identifying power at all, the bounds of Propositions 2.1 and 2.2 exist and their analogue estimates are consistent.
ACKNOWLEDGEMENTS This paper was prepared for the tenth anniversary issue of the Econometric Journal. Our research on monotone instrumental variables (MIVs) was first circulated in 1998, the year that the journal began publication. We are grateful for this opportunity to report further findings on MIVs and, in doing so, to mark the tenth anniversary of both the journal and the subject. We have benefited from the comments of a referee. Manski’s research was supported in part by NSF Grant SES-0549544. C The Author(s). Journal compilation C Royal Economic Society 2009.
More on monotone instrumental variables
S215
REFERENCES Beresteanu, A. and F. Molinari (2008). Asymptotic properties for a class of partially identified models. Econometrica 76, 763–814. Blundell, R., A. Gosling, H. Ichimura and C. Meghir (2007). Changes in the distribution of male and female wages accounting for employment composition using bounds. Econometrica 75, 323–63. Bound, J., D. Jaeger and R. Baker (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association 90, 443–50. Chernozhukov, V., H. Hong and E. Tamer (2007). Estimation and confidence regions for parameter sets in econometric models. Econometrica 75, 1243–84. Gerfin, M. and M. Schellhorn (2006). Nonparametric bounds on the effect of deductibles in health care insurance on doctor visits—Swiss evidence. Health Economics 15, 1011–20. Gonz´alez, L. (2005). Nonparametric bounds on the returns to language skills. Journal of Applied Econometrics 20, 771–95. Haile, P. and E. Tamer (2003). Inference with an incomplete model of English auctions. Journal of Political Economy 111, 1–51. Hood, W. and T. Koopmans, eds. (1953). Studies in Econometric Method. New York: Wiley. Horowitz, J. and C. Manski (2000). Nonparametric analysis of randomized experiments with missing covariate and outcome data. Journal of the American Statistical Association 95, 77–84. Kreider, B. and S. Hill (2008). Partially identifying treatment effects with an application to covering the uninsured. Forthcoming in Journal of Human Resources. Kreider, B. and J. Pepper (2007). Disability and employment: Reevaluating the evidence in light of reporting errors. Journal of the American Statistical Association 102, 432–41. Kreider, B. and J. Pepper. (2008). Inferring disability in corrupt data. Journal of Applied Econometrics 23, 329–349. Manski, C. (1990). Nonparametric bounds on treatment effects. American Economic Review Papers and Proceedings 80, 319–23. Manski, C. (1995). Identification Problems in the Social Sciences. Cambridge, MA: Harvard University Press. Manski, C. (1997). Monotone treatment response. Econometrica 65, 1311–34. Manski, C. (2003). Partial Identification of Probability Distributions. New York: Springer-Verlag. Manski, C. (2007). Identification for Prediction and Decision. Cambridge, MA: Harvard University Press. Manski, C. and J. Pepper (1998). Monotone instrumental variables: with an application to the returns to schooling. Technical Working Paper t0224, National Bureau of Economic Research. Manski, C. and J. Pepper (2000). Monotone instrumental variables: With an application to the returns to schooling. Econometrica 68, 997–1010. Manski, C. and E. Tamer (2002). Inference on regressions with interval data on a regressor or outcome. Econometrica 70, 519–46. Nelson, C. and R. Startz (1990). The distribution of the instrumental variable estimator and its t-ratio when the instrument is a poor one. Journal of Business 63, 5125–40. Rosen, A. (2008). Confidence sets for partially identified parameters that satisfy a finite number of moment inequalities. Journal of Econometrics 146, 107–17. Staiger, D. and J. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65, 557–86.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S216
C. F. Manski and J. V. Pepper
APPENDIX Proof of Lemma 3.1: Let u2 ≥ u1 . We need to show that E[y(t)|v = u2 ] ≥ E[y(t)|v = u1 ]. The assumptions imply that
E[y(t)|v = u2 ]−E[y(t)|v = u1 ] = E[y ∗ (t, ζ )|v = u2 ] − E[y ∗ (t, ζ )|v = u1 ] = E[g(t, ζ, ε)|v = u2 ] − E[g(t, ζ, ε)|v = u1 ] = g(t, ζ, ε)dP (ζ, ε|v = u2 ) − g(t, ζ, ε)dP (ζ, ε|v = u1 ) = g(t, ζ, ε)dP (ζ |v = u2 ) − g(t, ζ, ε)dP (ζ |v = u1 ) dP (ε) ≥ 0. The first and second equalities apply (3.1) and (3.4). The third equality writes the expectations explicitly as integrals. The fourth equality applies the assumption that ε is statistically independent of (ζ , v). The final inequality applies the assumptions that g(t, ·, ε) is monotone and that P (ζ |v = u2 ) weakly dominates P (ζ |v = u1 ).
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S217–S229. doi: 10.1111/j.1368-423X.2008.00263.x
Two-step series estimation of sample selection models W HITNEY K. N EWEY † †
Department of Economics, E52-262D, Massachusetts Institute of Technology, 50 Memorial Drive, Cambridge, MA 02142, USA E-mail: [email protected] First version received: July 2008; final version accepted: September 2008
Summary Sample selection models are important for correcting the effects of non-random sampling. This paper is about semiparametric estimation using a series approximation to the correction term. Regression spline and power series approximations are considered. Asymptotic normality and consistency of an asymptotic variance estimator are shown. Keywords: Sample selection models, Semiparametric estimation, Series estimation, Two-step estimation.
1. INTRODUCTION Sample selection models provide an approach to correcting for non-random sampling that is important in econometrics. Pioneering work in this area includes Gronau (1973) and Heckman (1974). This paper is about two-step estimation of these models without restricting the functional form of the selection correction. The estimators are particularly simple, using polynomial or spline approximations to correct for selection. A consistent estimator of the asymptotic variance is given and asymptotic normality is shown. Some of the estimators considered here are similar to two-step least-squares estimators with flexible correction terms previously proposed by Lee (1982) and Heckman and Robb (1987). This paper adds to the menu of flexible correction terms by including regression splines and types of power series that have not been considered before. Also, the theory here allows for the functional form of the correction to be entirely unknown, with the number of approximating √ functions growing with the sample size to achieve n-consistency and asymptotic normality. These are semiparametric estimation results for sample selection models. There are several prior papers on this subject. Gallant and Nychka (1987) gave a consistent sieve maximum likelihood estimator for a model with disturbances independent of regressors. Cosslett (1991) proposed a consistent two-step series estimator of that model where the first step is the non-parametric maximum likelihood estimator of the selection equation. Powell (2001) developed a density-weighted kernel estimator for a conditional mean model and showed √ n-consistency and asymptotic normality. This paper is most closely related to Powell (2001), in proposing a two-step estimator with semiparametric first step and deriving distribution theory. The series estimators analysed here have the virtue of being extremely easy to implement. Practical experience with these estimators given in Newey, Powell, and Walker (1990) also suggests they can be efficient relative to density-weighted estimators. C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S218
W. K. Newey
The model we consider imposes a conditional moment restriction on the second stage. Newey and Powell (1993) gave the semiparametric efficiency bound for this model, though no efficient estimator has yet been presented. Under the stronger restriction of independence of disturbances and instruments, Lee (1994) gave an efficient estimator. The model and estimators of Ahn and Powell (1993) allowed for a non-parametric first stage and Das et al. (2003) allow for the equation of interest to be non-parametric as well. As usual, imposing a correct parametric form can lead to large efficiency gains, especially with high dimensional models. Thus, the semiparametric estimators given here can be useful alternatives to fully non-parametric estimators. Section 2 of this paper presents the model and discusses identification. The estimators are described in Section 3, and Section 4 gives the asymptotic theory.
2. THE MODEL AND IDENTIFICATION To describe the model, let y denote a scalar-dependent variable of interest, x a vector of regressors that can affect y and w a vector of first-stage regressors that include x. Also, let d ∈ {0, 1} be the indicator of selection and v(w, α) a known function that determines the selection probability. The model considered here is y = x β0 + ξ ;
y only observed if d = 1,
E[ξ |w, d = 1] = E[ξ |v(w, α0 ), d = 1], Prob(d = 1|w) = π (v(w, α0 )),
(2.1)
where π (v) is an unknown function. Thus, we assume that conditional on selection the mean of the disturbance depends only on the index v = v(w, α 0 ). This restriction is implied by other familiar conditions, such as independence of disturbances and regressors (see Powell, 1994). A basic implication of this model is that E[y|w, d = 1] = x β0 + h0 (v),
h0 (v) = E[ξ |w, d = 1].
(2.2)
The function h 0 (v) is a selection correction that is familiar. For example, if d = 1(v + ξ˜ ≥ 0], (ξ, ξ˜ ) is independent of w, ξ˜ has a standard normal distribution and E[ξ | ξ˜ ] is linear in ξ˜ , then h 0 (v) = φ(v)/(v), where (v) and φ(v) are the standard normal CDF and PDF, respectively. This term is the correction term considered by Heckman (1976). In this paper, we allow h 0 (v) to have an unknown functional form. Equation (2.2) is an additive semiparametric regression like that considered by Robinson (1988), except that the variable v = v(w, α 0 ) depends on unknown parameters. Making use of this information is important for identification. Ignoring the structure implied by equation (2.1), and regarding h 0 as an unknown function of w, would mean that β 0 is not identified in this model. The identification condition for this paper is, for ui = di (xi − E[xi |vi , di = 1]), as follows. A SSUMPTION 2.1. M = E[ui ui ] is non-singular, i.e. for any λ = 0 there is no measurable function f (v) such that x λ = f (v) when d = 1. This condition was imposed by Cosslett (1991) and is the selection model version of Robinson’s (1988) identification condition for additive semiparametric regression. As shown by Chamberlain (1986), this condition is not necessary for identification, but it is necessary for existence of a C The Author(s). Journal compilation C Royal Economic Society 2009.
Two-step series estimation of sample selection models
S219
√ (regular) n-consistent estimator. It is important to note that this condition does not allow for a constant term in x, because it is not separately identified from h 0 (v). More primitive conditions for Assumption 2.1 are available in some cases. A simple sufficient condition is that Var(x) is non-singular and the conditional distribution of v given x has an absolutely continuous component with conditional density that is positive on the entire real line for almost all x. An obvious necessary condition is that v not be a linear combination of x, requiring that something in v be excluded from x. Such an exclusion restriction is implied by many economic models, where d is a choice variable and v includes a price variable for d = 0. Identification of β 0 from equation (2.2) also requires identification of α 0 . Here, no specific assumptions will be imposed, in order to allow flexibility in the choice of an estimator of α 0 . Of course, consistency of αˆ will imply identification of α 0 , but different consistent estimators αˆ may correspond to different identifying assumptions. For brevity, a menu of different assumptions is not discussed here.
3. ESTIMATION The type of estimator we consider is a two-step estimator, where the first step is a semiparametric estimator αˆ of the selection parameters α 0 and the second step is least- squares regression on x and approximating functions of vˆ = v(w, α) ˆ in the selected data. These estimators are analogous to Heckman’s (1976) two-step procedure for the Gaussian disturbances case. The difference is that α is estimated by a distribution-free method rather than by probit, and a non-parametric approximation to h(v) is used in the second step regression rather than the inverse Mills ratio. There are many distribution-free estimates that are available for the first step for various models of selection, including those of Manski (1975) when there is conditional median restriction, Cosslett (1983) when the selection disturbance is√independent of regressors, and Ruud (1986). We will assume that the first step estimator α, ˆ is n-consistent, like the estimator of Powell, Stock, and Stoker (1989), Ichimura (1993) and Cavanagh and Sherman (1998). The asymptotic variance of βˆ will be an increasing function of the asymptotic variance of α, ˆ so whatever selection model is considered an efficient first step estimator would be good. In this model where Pr(d = 1|w) = Pr(d = 1|v), the efficient estimator of Klein and Spady (1993) would give the most efficient two-step estimator. The second step consists of a linear regression of y on x and functions of vˆ that can approximate h 0 (v). To describe the estimator let τ (v, η) denote some strictly monotonic transformation of v, depending on parameters η. This transformation is useful for adjusting the location and scale of v, as discussed below. Let pK (τ ) = (p 1K (τ ), . . . , pKK (τ )) be a vector of functions with the property that for large K a linear combination of pK (τ ) can approximate an unknown function of τ . Suppose that the data are zi = (di , wi , di yi ), (i = 1, . . . , n) ˆ τˆi = τ (vˆi , η) ˆ assumed throughout to be i.i.d. Let ηˆ denote an estimator of η, vˆi = v(wi , α), and pˆ i = pK (τˆi ), where a K superscript for pˆ i is suppressed for notational convenience. For ˆ = Pˆ (Pˆ Pˆ )−1 Pˆ the X = [d 1 x 1 , . . ., dn xn ] , Y = (d 1 y 1 , . . . , dn yn ) Pˆ = [d1 pˆ 1 , . . . , dn pˆ n ] and Q estimator is ˆ /n, βˆ = Mˆ −1 X (I − Q)Y
ˆ Mˆ = X (I − Q)X/n,
(3.1)
where the inverses will exist in large samples under conditions discussed below. The estimator βˆ is the coefficient of xi from the regression of yi on xi and pˆ i in the selected data. C The Author(s). Journal compilation C Royal Economic Society 2009.
S220
W. K. Newey
This estimator depends on the choice of approximating functions and transformation. Here, we consider two kinds of approximating functions, power series and splines. For power series, the approximating functions are given by pkK (τ ) = τ k−1 .
(3.2)
Depending on the transformation τ (v, η), this power series can lead to several different types of sample selection corrections. Three examples are a power series in the index v, ˆ in the inverse Mills ratio φ(·)/(·), or in the normal CDF (·). When a non-linear transformation of v is used (e.g. for a power series in ), it may be appropriate to undo a location and scale normalization imposed on most semiparametric estimators of v(w, α). To this end let ηˆ = (ηˆ 1 , ηˆ 2 ) be the coefficients from probit estimation with regressors (1, vˆi ), where we do not impose normality √ (but will require that ηˆ be a n-consistent of some population parameter). Then, the transformed observations for the three examples will be τˆi = vˆi ,
(3.3)
τˆi = φ(ηˆ 1 + ηˆ 2 vˆi )/(ηˆ 1 + ηˆ 2 vˆi ),
(3.4)
τˆi = (ηˆ 1 + ηˆ 2 vˆi ).
(3.5)
The power series in equation (3.3) will have as a leading term the index vˆi itself. The one from equation (3.4) will have leading term given by the inverse Mills ratio, so that the first term is the Heckman (1976) correction. This one also has approximating functions that preserve a shape property of h 0 (v) that holds when d = 1(v + ξ˜ ≥ 0) and ξ˜ is independent of w, that h 0 (v) goes to zero as v gets large. The last example will correspond to a power series in the selection probability for Gaussian ξ˜ . Replacing power series by corresponding polynomials that are orthogonal with respect to some weight function may help avoid multicollinearity. For example, for τˆu ≡ ˆ and τˆ = mini≤n,di =1 {τ (vˆi , η)} ˆ one could replace τ k−1 by a polynomial of maxi≤n,di =1 {τ (vˆi , η)} ˆ − τˆu − order k that is orthogonal for the uniform weight on [−1, 1], evaluated at τˆi = [2τ (vˆi , η) τˆ ]/(τˆu − τˆ ). Of course, βˆ is not affected by such a replacement, since it is just a non-singular linear transformation of the power series. An alternative approximation that is better in several respects than power series is splines, that are piecewise polynomials. Splines are less sensitive to outliers and to singularities in the function being approximated. Also, as discussed below, asymptotic normality holds under weaker conditions for splines than power series. For theoretical convenience attention is limited to splines with evenly spaced knots on [−1, 1]. For b + ≡ 1(b > 0) · b, a spline of degree m in τ with L evenly spaced knots on [−1, 1] can be based on pkK (τ ) = τ k−1 , 1 ≤ k ≤ m + 1, = {[τ + 1 − 2(k − m − 1)/(L + 1)]+ }m ,
m + 2 ≤ k ≤ m + 1 + L ≡ K.
(3.6)
An alternative, equivalent series that is less subject to multicollinearity problems is B-splines (e.g. see Powell, 1981). Fixed, evenly spaced knots are restrictive, and are motivated by theoretical convenience. Allowing the knots to be estimated may improve the approximation, but would make computation more difficult and require substantial modification to the theory of Section 4, which relies on linear in parameter approximations. C The Author(s). Journal compilation C Royal Economic Society 2009.
Two-step series estimation of sample selection models
S221
For inference, it is important to have a consistent estimator of the asymptotic variance of ˆ This can be formed by treating the approximation as if it were exact and using formulae for β. parametric two-step estimators such as those of Newey √ (1984). The estimator will depend on a consistent estimator Vˆαˆ of the asymptotic variance of n(αˆ − α0 ). Let βˆ and γˆ be the estimates from the regression of di yi on di xi and di pˆ i , εˆ i = di (yi − xi βˆ − pˆ i γˆ ) the corresponding residual, ˆ ˆ γˆ the estimate of h(v) obtained from this regression. Define uˆ = (I − and h(v) = pK (τ (v, η)) ˆ ˆ = uˆ uˆ Q)X to be the matrix of residuals from the regression of di xi on di pˆ i , so that X (I − Q)X and let Vˆ = Mˆ −1
n
uˆ i uˆ i εˆ i2 /n
ˆ ˆ ˆ + H Vαˆ H Mˆ −1 ,
i=1
Hˆ =
n
ˆ vˆi )/∂v ∂v(wi , α)/∂α uˆ i ∂ h( ˆ /n.
(3.7)
i=1
This estimator is the sum of two terms, the first of which is the White (1980) specification robust variance estimator for the second step regression and the second a term that accounts for the first-stage estimation of the parameters√ of the selection equation. This estimator will be consistent for the asymptotic variance of n(βˆ − β0 ) under the conditions of Section 4. Note here the normalization by the total sample size n rather than the number of observations in the selected sample. For example, a 95 per cent asymptotic confidence interval for β j is √ √ 1/2 1/2 [βˆj − Vˆjj 1.96/ n, βˆj + Vˆjj 1.96/ n].
4. ASYMPTOTIC NORMALITY Some regularity conditions will be used to show consistency and asymptotic normality. The first condition is about the first-stage estimator. √ A SSUMPTION 4.1. There exists ψ(w, d) such that for ψi = ψ(wi , di ), n(αˆ − α0 ) = n √ ˆ p i=1 ψi / n + op (1), E[ψi ] = 0 and E[ψ i ψ i ] exists and is non-singular. Also, Vα −→ Vαˆ = E[ψi ψi ]. This condition requires that αˆ be asymptotically equivalent to a sample average that depends only on w and d. It is satisfied by many semiparametric estimators√ of binary choice models, such as that of Klein and Spady (1993), but not by those that are not n-consistent, such as that of Manski (1975). The next condition imposes some moment conditions on the second stage. A SSUMPTION 4.2. For some δ > 0, E[d x2+δ ] < ∞. Var(x|v, d = 1) is bounded, and for ε ≡ d(y − x β 0 − h 0 (v)), E[ε2 |v, d = 1] is bounded. The bounded conditional variance assumptions are standard in the literature and will not be very restrictive here because v will also be bounded, due to the condition to follow that τ has compact support. To control the bias of the estimator is essential to impose some smoothness conditions on functions of v. C The Author(s). Journal compilation C Royal Economic Society 2009.
S222
W. K. Newey
A SSUMPTION 4.3. h 0 (v) and E[x|v, d = 1] are continuously differentiable in v, of orders s ≥ 1 and t ≥ 1 respectively. We also require that the transformation τ satisfy some properties. √ A SSUMPTION 4.4. There is η 0 with n(ηˆ − η0 ) = Op (1), the distribution of τ (v(w, α 0 ), η 0 ) has an absolutely continuous component with PDF bounded away from zero on its support, which is compact. Also, the first and second partial derivatives of v(wi , α) and τ (v, η) with respect to α, v and η are bounded for α and η in a neighbourhood of α 0 and η 0 , respectively. The first condition of this assumption means that the density of τ i is bounded away from zero, which is useful for series estimation, but is restrictive. For example, if v = x 1 + x 2 , where x 1 and x 2 are continuously distributed and independent, then a density of v, which is a convolution of the densities of x 1 and x 2 , will be everywhere continuous, and hence cannot be bounded away from zero. It would be useful to weaken this condition, but this would be difficult and is beyond the scope of this paper. The next assumption imposes growth rate conditions for the number of approximating terms. √ p A SSUMPTION 4.5. K = Kn such that nK −s−t+1 −→ 0 and (a) pK (τ ) is a power series, s ≥ 5 and K 7 /n −→ 0; or (b) pK (τ ) is a spline with m ≥ t − 1, s ≥ 3 and K 4 /n −→ 0. Here, splines require the minimum smoothness conditions and the least stringent growth rate for the number of terms, with h 0 (v) only required to be three times continuously differentiable. It is also of note that this assumption does not require undersmoothing. The presence of t in the rate conditions means that smoothness in E[x|v, d = 1] can compensate for the lack of smoothness ˆ does not have to go to zero faster than the variance. This absence in h 0 (v), so that the bias of h(v) of an undersmoothing requirement is a feature of series estimators that has been previously noted in Donald and Newey (1994) and Newey (1994). Asymptotic normality of the two-step least-squares estimator and consistency of the estimator of its asymptotic covariance matrix follows from the previous conditions. Let ui = di {xi − E[xi |vi , di = 1]}, = E[ε2i ui ui ] and H = E[ui {∂h 0 (vi )/∂vi } ∂v(wi , α 0 )/∂α ]. T HEOREM 4.1. If Assumptions 2.1 and 4.1–4.5 are satisfied and is non-singular then for V = M −1 ( + H Vαˆ H )M −1 , √ p d n(βˆ − β0 ) −→ N(0, V ), Vˆ −→ V . √ This result gives n-consistency and asymptotic normality of the series estimators considered in this paper, that are useful for large sample inference. In comparison with Powell (2001), the basic regression equation (2.1) is slightly more general in allowing for a non-linear first stage, while Powell (2001) allows identification from instrumental variables. The identification condition of Assumption 2.1 is the same as in previous work with exogenous x. The other conditions are stronger in some respects and weaker in others than Powell (2001). Here, we require fewer moments to exist. Also, for splines we require fewer derivatives of h 0 (v) to exist and do not require derivatives to exist for the density of v but do impose that v is bounded. It would also be useful to have a way of choosing the number K of approximating functions in practice. A K that minimizes goodness-of-fit criteria for the selection correction, such as C The Author(s). Journal compilation C Royal Economic Society 2009.
Two-step series estimation of sample selection models
S223
cross-validation on the equation of interest, should satisfy the rate conditions of Assumption 4.5, when E[x|v, d = 1] has sufficiently many derivatives. In Newey et al. (1990), such a criterion was used and gave reasonable results. However, the results of Donald and Newey (1994) and Linton (1995) for the partially linear model suggest that it may be optimal for estimation of β to undersmooth, meaning K should be larger than the minimum of a goodness-of-fit criterion. Such results are beyond the scope of this paper, but remain an important topic for future research.
REFERENCES Ahn, H. and J. L. Powell (1993). Semiparametric estimation of censored selection models with a nonparametric selection mechanism. Journal of Econometrics 58, 3–29. Cavanagh, C. and R. P. Sherman (1998). Rank estimators for monotonic index models. Journal of Econometrics 84, 351–81. Chamberlain, G. (1986). Asymptotic efficiency in semiparametric models with censoring. Journal of Econometrics 32, 189–218. Cosslett, S. R. (1983). Distribution-free maximum likelihood estimator of the binary choice model. Econometrica 51, 765–82. Cosslett, S. R. (1991). Semiparametric estimation of a regression model with sample selectivity. In W. A. Barnett, J. L. Powell and G. Tauchen (Eds.), Nonparametric and Semiparametric Methods in Econometrics and Statistics, 175–97. Cambridge: Cambridge University Press. Das, M., W. K. Newey and F. Vella (2003). Nonparametric estimation of sample selection models. Review of Economic Studies 70, 33–58. Donald, S. G. and W. K. Newey (1994). Series estimation of semilinear models. Journal of Multivariate Analysis 50, 30–40. Gallant, A. R. and D. W. Nychka (1987). Semi-nonparametric maximum likelihood estimation. Econometrica 55, 363–90. Gronau, R. (1973). The effects of children on the housewife’s value of time. Journal of Political Economy 81, S168–S199. Heckman, J. J. (1974). Shadow prices, market wages, and labor supply. Econometrica 42, 679–93. Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator of such models. Annals of Economic and Social Measurement 5, 475–92. Heckman, J. J. and R. Robb (1987). Alternative methods for evaluating the impact of interventions. In J. Heckman and B. Singer (Eds.), Longitudinal Analysis of Labor Market Data, 156–245. Cambridge, UK: Cambridge University Press. Ichimura, H. (1993). Estimation of single index models. Journal of Econometrics 58, 71–120. Klein, R. W. and R. S. Spady (1993). An efficient semiparametric estimator for discrete choice models. Econometrica 61, 387–421. Lee, L. F. (1982). Some approaches to the correction of selectivity bias. Review of Economic Studies 49, 355–72. Lee, L. F. (1994). Semiparametric two stage estimation of sample selection models subject to Tobit selection rules. Journal of Econometrics 61, 305–44. Linton, O. (1995). Second order approximation in a partially linear regression model. Econometrica 63, 1079–1112. C The Author(s). Journal compilation C Royal Economic Society 2009.
S224
W. K. Newey
Manski, C. (1975). Maximum score estimation of the stochastic utility model of choice. Journal of Econometrics 3, 205–28. Newey, W. K. (1984). A method of moments interpretation of sequential estimators. Economics Letters 14, 201–06. Newey, W. K. (1994). The asymptotic variance of semiparametric estimators. Econometrica 62, 1349–82. Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79, 147–68. Newey, W. K. and J. L. Powell (1993). Efficiency bounds for semiparametric selection models. Journal of Econometrics 58, 169–84. Newey, W. K., J. L. Powell and J. R. Walker (1990). Semiparametric estimation of selection models: Some empirical results. American Economic Review Papers and Proceedings 80, 324–28. Powell, M. J. D. (1981). Approximation Theory and Methods. Cambridge: Cambridge University Press. Powell, J. L. (1994). Estimation of semiparametric models. In R. F. Engle and D. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2443–521. New York: North-Holland. Powell, J. L. (2001). Semiparametric estimation of censored selection models. In C. Hsiao, K. Morimune and J. Powell (Eds.), Nonlinear Statistical Modeling, 165–96. Cambridge: Cambridge University Press. Powell, J. L., J. H. Stock and T. M. Stoker (1989). Semiparametric estimation of index coefficients. Econometrica 57, 1403–30. Robinson, P. (1988). Root-n-consistent semiparametric regression. Econometrica 56, 931–54. Ruud, P. A. (1986). Consistent estimation of limited dependent variable models despite misspecification of distribution. Journal of Econometrics 32, 157–87. White, H. (1980). Using least squares to approximate unknown regression functions. International Economic Review 21, 149–70.
APPENDIX: PROOFS Throughout the appendix, C will denote a positive constant that can be different in different uses, T denotes the triangle inequality and w.p.a.1 mean ‘with probability approaching one’. Also, we will use repeatedly p the result that if E[Yn |Xn ] −→ 0 for a sequence of positive random variables Yn and conditioning variables p Xn , then Yn −→ 0. √ ˆ Proof of Theorem 4.1: To begin the proof, note that√ by ∂v(w, α)/∂α bounded and n-consistency of α, τi | = Op (1/ n). Also, by the density of τ i bounded away from and by ∂τ (v, η)/∂v bounded, maxi |τˆi − √ zero, both min i τ i and max i τ i will be n-consistent for the boundary points of the support of τ i , and series, which hence so will mini τˆi and maxi τˆi . Therefore, by a location and scale transformation for power √ will not change the regression, it can be assumed that |τˆi | ≤ 1 and maxi |τˆi − τi | = Op (1/ n). Now, it follows from Assumption 4.5, as in Newey (1997) that, for A = tr(A A)1/2 for any matrix A, there is a non-singular linear transformation of p˜ K (τ ) of p K (τ ) such that for ζ s (K) = CK (1+2s)/2 for splines and ζ s (K) = CK 1+2s for power series, E di p˜ K (τi )p˜ K (τi ) = I , sup d s p˜ K (τ )/dτ s ≤ ζs (K), |τ |≤1
ζ1 (K)K
1/2
√ / n −→ 0, ζ1 (K)K −s+1 −→ 0.
(A.1)
ˆ it will be convenient to just let p˜ K = p K then, Since a non-singular transformation does not change β, √ p 1/2 as in Newey (1997), P P /n − I = Op (ζ0 (K)K / n) −→ 0. It follows as in Newey (1997) that λ min (P P /n) ≥ C w.p.a.1, where λ min () and λ max () denote the smallest and the largest eigenvalues of a symmetric matrix . Therefore, there is a constant C such that I ≤ C(P P /n)−1 w.p.a.1. Also, by the C The Author(s). Journal compilation C Royal Economic Society 2009.
S225
Two-step series estimation of sample selection models mean value theorem, √ √ p Pˆ − P / n ≤ ζ1 (K) max |τˆi − τi | = Op (ζ1 (K)/ n) −→ 0. i
Then, we have (Pˆ − P ) P /n2 ≤ n−1 Ctr{(Pˆ − P ) P (P P )−1 P (Pˆ − P )} p ≤ CPˆ − P 2 /n = Op (ζ1 (K)2 /n) −→ 0, p
so that Pˆ Pˆ /n − P P /n ≤ Pˆ − P 2 /n + 2P (Pˆ − P )/n −→ 0, and by T, p Pˆ Pˆ /n − I −→ 0.
Then λmin (Pˆ Pˆ /n) ≥ C w.p.a.1. Next, let Aˆ = Pˆ (Pˆ Pˆ )−1 and Qˆ = Pˆ (Pˆ Pˆ )−1 Pˆ and note that λmax ((Pˆ Pˆ )−1 ) = n−1 [λmin (Pˆ Pˆ /n)]−1 = Op (n−1 ). Therefore, ˆ ≤ Op (1/n)In , Aˆ Aˆ = Pˆ (Pˆ P )−2 Pˆ ≤ Op (n−1 )Q
(A.2)
ˆ idempotent. where the second inequality follows by Q Next, since τ (v, η 0 ) is one-to-one, conditioning on v is equivalent to conditioning on τ , so that, for example, h 0 (v) can be regarded as a function of τ . Let μ i = di E[xi |τ i , di = 1], μ = [μ 1 , . . . , μ n ] and ˆ − 2X Qμ ˆ + μ μ)/n. By equation (A.2), ˆ so that μˆ − μ2 /n = tr(X QX μˆ = QX, ˆ 2 = tr(X Aˆ Aˆ X) ≤ Op (1)tr(X X/n) ≤ Op (1)tr(X X/n) = Op (1). X A It follows similarly that X A = Op (1) for A = P (P P )−1 . Also, X (Pˆ − P )/n ≤ XPˆ − P /n = √ p Op (ζ1 (K)/ n) −→ 0, so that for Q = P (P P )−1 P , ˆ − X QX/n ≤ X (Pˆ − P )Aˆ X/n + X A(Pˆ Pˆ − P P )Aˆ X/n X QX/n +X A(Pˆ − P ) X/n ≤ X (Pˆ − P )/n(Aˆ X + A X) p +X A(Pˆ Pˆ − P P )/nAˆ X −→ 0.
(A.3)
p ˆ − X Qμ/n −→ 0. Therefore, since ui = di xi − μ i for ui from It follows similarly that X Qμ/n Assumption 2.1,
μˆ − μ2 /n = tr(X QX − 2X Qμ + μ μ)/n + op (1) = tr(u Qu + μ (I − Q)μ)/n + op (1). For T = (τ 1 , . . . , τ n ) and D = (d 1 , . . . , dn ) , by independence of the observations, E[ui |T , D] = E[ui |τ i , di ] = 0. Therefore, E[ui uj |T , D] = E[ui uj |τi , τj , di , dj ] = E[ui E[uj |ui , τi , τj , di , dj ]|τi , τj , di , dj ] = E[ui E[uj |τj , dj ]|τi , τj , di , dj ] = 0. Also, by Assumption 4.2, E[ui ui |T , D] = E[ui ui |τ i , di ] ≤ C. Therefore, with probability one, E[uu |T , D] ≤ CI
(A.4) p
follows that E[tr(u Qu)/n|T , D] ≤ Ctr(Q)/n = CK/n −→ 0, so that tr(u Qu)/n −→ 0. Also, by Assumption 4.3 and standard approximation theory results for power series and splines (e.g. see Newey, C The Author(s). Journal compilation C Royal Economic Society 2009.
S226
W. K. Newey
1997 for references), and by (I − Q)P = 0 and I − Q idempotent, there exists π K such that E[tr(μ (I − Q)μ)]/n = E[tr((μ − P πK ) (I − Q)(μ − P πK ))]/n ≤ E[tr((μ − P πK ) (μ − P πK ))]/n = E[di {μi − πK p K (τi )} {μi − πK p K (τi )}] −→ 0. Combining these results gives, for uˆ i = di Xi − μˆ i , p
uˆ − u2 /n = μˆ − μ2 /n −→ 0. p
(A.5)
p
This implies that by Mˆ − u u/n −→ 0, while u u/n −→ M follows by the law of large numbers. T then p gives Mˆ −→ M. Next, let ε i = di [yi − x i β 0 − h 0 (τ i )], ε = (ε1 , . . . , εn ) and W = [w1 , . . . , w n ] . It follows similarly to ˆ and Q are functions of W and D, equation (A.4) that E[εε |W , D] ≤ CI . Then, since Q
√ 2 ˆ − Q)ε/ n |W , D = tr X (Q ˆ − Q)E[εε |W , D](Q ˆ − Q)X /n E X (Q ˆ − Q)(Q ˆ − Q)X)/n. ≤ Ctr(X (Q p p ˆ − Q)QX/n ˆ ˆ − Q)QX/n −→ It follows similarly to equation (A.13) that X (Q −→ 0 and X (Q 0, so that √ √ √ p ˆ − Q)ε/ n −→ ˆ X (Q 0, and hence√X (I − Q)ε/ n = X (I − Q)ε/ n + o p (1). It follows as in Donald √ and Newey (1994) that X (I − Q)ε/ n = u ε/ n + op (1). Then by T, √ √ ˆ X (I − Q)ε/ n = u ε/ n + op (1). (A.6)
Next, for both power series and splines it follows as in Newey (1997) that there are γ K and π K such that for hK (τ ) = p K (τ ) γ K , μ K (τ ) = p K (τ ) π K and μ(τ ) = E[xi |τ i = τ , di = 1], ∂h0 (τ ) ∂hK (τ ) ≤ CK −s+1 , − sup |h0 (τ ) − hK (τ )| ≤ CK −s+1 , sup ∂τ ∂τ |τ |≤1 |τ |≤1 sup |μ(τ ) − μK (τ )| ≤ CK −t .
(A.7)
|τ |≤1
Let h˜ i = h0 (τˆi ), hi = h 0 (τ i ), h˜ Ki = hK (τˆi ), hKi = hK (τ i ), μ˜ i = μ(τˆi ), μ˜ Ki = μK (τˆi ), μ Ki = μ K (τ i ), and let expressions without the i subscript denote corresponding matrices over all observations multiplied by selection indicators, e.g. μ˜ K = [d1 μ˜ K1 , . . . , dn μ˜ Kn ] . Then √ √ √ ˆ ˆ ˜ ˆ h˜ − h˜ K )/ n. n = X (I − Q)(h − h)/ n + (X − μ˜ K ) (I − Q)( X (I − Q)h/ Let θ = (α , η ) , τ (w, θ ) = τ (v(w, α), η) and h θi = ∂h 0 (τ (wi , θ 0 ))/∂θ . Since ∂τ (w, θ 0 )/∂η depends only on v and E[ui a(vi )] = 0 for any function a(vi ) with finite mean square, E[ui h θi ] = E[ui {∂h 0 (vi )/∂v} p ∂v(wi , α 0 )/∂α , 0] = [H , 0]. It follows similarly to Mˆ −→ M that for h θ = [d 1 h θ1 , . . . , dn h θn ] we have √ p ˆ θ /n −→ E[ui hθi ]. Then, by a second-order expansion and n-consistency of θ, ˆ X (I − Q)h √ √ ˆ ˜ ˆ θ /n] n(θˆ − θ0 ) + op (1) X (I − Q)(h − h)/ n = −[X (I − Q)h √ √ (A.8) = −E[ui hθi ] n(θˆ − θ0 ) + op (1) = −H n(αˆ − α0 ) + op (1). ˆ idempotent, Also, by equation (A.7) and I − Q √
p √ √ ˆ h˜ − h˜ K )/ n ≤ μ˜ − μ˜ K h˜ − h˜ K / n = Op nK −s−t+1 −→ 0. μ˜ − μ˜ K ) (I − Q)( Also, by μ˜ − μ ≤
√
n sup ∂μ(τ )/∂τ max |τˆi − τi | = Op (1), we have |τ |≤1
i
√ √ p (μ − μ) ˆ h˜ − h˜ K )/ n ≤ μ˜ − μ h˜ − h˜ K / n = Op (K −s+1 ) −→ ˜ (I − Q)( 0. C The Author(s). Journal compilation C Royal Economic Society 2009.
S227
Two-step series estimation of sample selection models Furthermore, by an expansion of h0 (τˆi ) − hK (τˆi ) around τ i we have ∂h0 (τ ) ∂hK (τ ) √ p max |τˆi − τi | = Op (K −s+1 ) −→ − h˜ − h˜ K − h + hK ≤ n sup 0, ∂τ ∂τ i≤n |τ |≤1 √ so that by u/ n = Op (1), √ √ p u (I − Q)( ˆ h˜ − h˜ K − h + hK )/ n ≤ u/ nh˜ − h˜ K − h + hK −→ 0. It then follows by T that √ √ √ ˆ h/ ˜ n = (X − μ˜ K ) (I − Q)( ˆ h˜ − h˜ K )/ n = u (I − Q)( ˆ h˜ − h˜ K )/ n + op (1) X (I − Q) √ ˆ − hK )/ n + op (1). = u (I − Q)(h
(A.9)
Next, let K = h − hK . Note that Aˆ K 2 = K Aˆ Aˆ K = Op (n−1 )K K = Op (K −2s+2 ). Then √ √ p u (Pˆ − P )Aˆ K / n ≤ uPˆ − P Aˆ K / n = Op (ζ1 (K)K −s+1 ) −→ 0. Also note that E[u A2 |T , D] ≤ CK/n, so that u A(P P − Pˆ Pˆ )Aˆ K /√n ≤ u AP P − Pˆ Pˆ Aˆ K /√n p
= Op (K/n)op (n)Op (K −s+1 ) −→ 0. We also have
u A(Pˆ − P ) K /√n ≤ u A (Pˆ − P ) K /√n
√ p = Op (K/n)Op (ζ1 (K)K −s+1 / n) −→ 0.
Then by T, √ p ˆ − Q)K / n −→ 0. u ( Q It follows this and equation (A.9) that √ √ ˆ h/ ˜ n = u (I − Q)K / n + op (1). X (I − Q) Also, E[u (I − Q)K 2 /n | T , D] = K (I − Q)E[uu | T , D](I − Q)K /n ≤ CK (I − Q)K /n ≤ K K /n −→ 0, √ p so that u (I − Q)K / n −→ 0. T then gives √ p ˆ h/ ˜ n −→ X (I − Q) 0. Combining equations (A.6), (A.8) and (A.10), we obtain √ √ ˆ ˆ ˜ X (I − Q)(ε + h)/ n = X (I − Q)(ε + h − h˜ + h)/ n √ √ = u ε/ n + H n(αˆ − α0 ) + op (1) =
n
√ (ui εi + H ψi )/ n + op (1).
i=1 C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.10)
S228
W. K. Newey
Also note that E[ui ε i ψ i ] = E[ui E[ε i |wi ,di ]ψ i ] = 0. The Lindbergh–Levy central limit theorem then gives n √ d ˆ ). Next, note that by yi = x i β 0 + hi + ε i , i=1 (ui εi + H ψi )/ n −→ N (0, + H V (α)H √ √ ˆ n(βˆ − β0 ) = Mˆ −1 X (I − Q)(ε + h)/ n. The first conclusion then follows from the continuous mapping theorem and the Slutzky theorem in the usual way. ˆ ) = p K (τ ) γˆ , γˆ = Aˆ (Y − X β). ˆ By equation (A.2), To show the second conclusion, note that h(τ √ √ √ Aˆ X(βˆ − β0 ) ≤ Op (1/ n), Aˆ (h − h) ˜ ˜ ≤ Op (1) (h − h)/ n = Op (1/ n), Aˆ (h˜ − Pˆ γK ) ≤ Op (1) (h˜ − Pˆ γK )/√n = Op (K −s+1 ). ˆ | D, W ] ≤ CK, so that Aˆ ε2 = ε Aˆ Aˆ ε = Op (1)ε Qε/n ˆ = Similar to previous results, E[ε Qε Op (K/n). Then by T, ˜ + Aˆ (h˜ − Pˆ γK ) = Op ((K/n)1/2 ) + Op (K −s+1 ). γˆ − γK = Aˆ X(βˆ − β0 ) + Aˆ ε + Aˆ (h − h) Thus, for s = 0 or 1,
ˆ )/∂τ s − ∂ s h0 (τ )/∂τ s ≤ sup [∂ s p K (τ )/∂τ s ] (γˆ − γK ) sup ∂ s h(τ
|τ |≤1
|τ |≤1
+ sup ∂ s [p K (τ ) γK ]/∂τ s − ∂ s h0 (τ )/∂τ s ≤ ζs (K) γˆ − γK + O(K −s+1 ) |τ |≤1
p
= Op (ζs (K)[(K/n)1/2 + K −s+1 ]) −→ 0.
(A.11)
p
ˆ τˆi )/∂τ − ∂h0 (τˆi )/∂τ | −→ 0. Also, since h 0 (τ ) is at least twice differentiable with Therefore, maxi≤n |∂ h( p bounded derivative, maxi≤n |∂h0 (τˆi )/∂τ − ∂h0 (τi )/∂τ | −→ 0, so by T, p ˆ τˆi )/∂τ − ∂h0 (τi )/∂τ −→ max ∂ h( 0. i≤n
ˆ τˆi )/∂τ ]∂τ (vˆi , η)/∂v. ˆ vˆi )/∂v = [∂ h( ˆ By Assumptions 4.1 Now note that ∂ h( p supi≤n |∂τ (vˆi , η)/∂v ˆ − ∂τ (vi , η0 )/∂v| −→ 0, so by boundedness of ∂τ (vˆi , η)/∂v, ˆ we have
and
4.4,
p ˆ vˆi )/∂v − ∂h0 (vi )/∂v −→ 0. max ∂ h( i≤n
ˆ for H˜ = n−1 ni=1 uˆ i [∂v(wi , α)/∂α ˆ ]∂h0 (vi )/∂v, Then, by boundedness of ∂v(wi , α)/∂α, 1/2 n p 2 ∂v(wi , α)/∂α ˆ 1/2 ˆ /n max |∂h0 (τˆi )/∂τ − ∂h0 (τi )/∂τ | −→ 0. Hˆ − H˜ ≤ tr(uˆ u/n) i≤n
i=1
Further, by equation (A.5) and Assumption 4.1, for H¯ = n−1 ni=1 ui [∂v(wi , α0 )/∂α ]∂h0 (vi )/∂v we have p p p H˜ − H¯ −→ 0. Then since H¯ −→ H by the law of large numbers, Hˆ −→ H follows by T. ˆ i = xi (βˆ − β0 ) + hˆ i − hi . By equation (A.11), Now, let p ˆ τˆi ) − h0 (τˆi )| + max |h0 (τˆi ) − h0 (τi )| −→ 0. max |hˆ i − hi | ≤ max |h( i≤n
i≤n
i≤n
Also, max |xi (βˆ i≤n
− β0 )| ≤ max xi βˆ − β0 ≤ n
1/(2+δ)
n
i≤n
=n
1/(2+δ)
√ p Op (1)Op (1/ n) −→ 0.
1/(2+δ) xi
2+δ
/n
√ Op (1/ n)
i=1
C The Author(s). Journal compilation C Royal Economic Society 2009.
S229
Two-step series estimation of sample selection models p ˆ i | −→ Then by T , maxi≤n | 0. Furthermore, by Assumption 4.2, E[|εi | |W , D| ≤ C, so that n n n 2 uˆ i |εi |/n|W , D = uˆ i 2 E [|εi | | W , D] /n ≤ C uˆ i 2 /n = Op (1), E i=1
n
i=1
i=1
and hence i=1 uˆ i |εi | /n = Op (1). Therefore, n n n n 2 2 ˆ i )2 − εi2 /n uˆ i 2 εˆ i2 − εi2 /n = uˆ i 2 (εi − uˆ i uˆ i εˆ i /n − uˆ i uˆ i εi /n ≤ i=1 i=1 i=1 i=1 n n 2 p 2 2 ˆ ˆ i −→ 0. uˆ i |εi | /n max i + uˆ i /n max ×2 2
i≤n
i=1
i=1
i≤n
(A.12)
Also note that by equation (A.5), n n εi uˆ i − εi ui 2 /n | W , D = E εi2 μˆ i − μi 2 /n | W , D E i=1
i=1
≤
n
n p μˆ i − μi 2 /n −→ 0. E εi2 |W , D μˆ i − μi 2 /n ≤ C
i=1
i=1
p Therefore i=1 εi uˆ i − εi ui 2 /n −→ 0. Then by T , i=1 uˆ i uˆ i εˆ i2 /n − ni=1 ui ui εi2 /n −→ 0. Then by the n p p law of large numbers, i=1 ui ui εi2 /n −→ E[ui ui εi2 ] = , so by T , ni=1 uˆ i uˆ i εˆ i2 /n −→ . The second ˆ conclusion then follows by consistency of Vα and the Slutzky theorem. n
p
n
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S230–S234. doi: 10.1111/j.1368-423X.2008.00269.x
A note on adapting propensity score matching and selection models to choice based samples J AMES J. H ECKMAN †,‡ AND P ETRA E. T ODD § †
University of Chicago, Economics Department, 1126 E 59th Street, Chicago, IL 60637, USA ‡
§
University College Dublin, Cowles Foundation, Yale University and American Bar Foundation E-mail: [email protected]
Department of Economics, University of Pennsylvania, 3718 Locust Walk, McNeil 160, Philadelphia, PA 19104, USA and NBER E-mail: [email protected]
First version received: August 2008; final version accepted: October 2008
Summary The probability of selection into treatment plays an important role in matching and selection models. However, this probability can often not be consistently estimated, because of choice-based sampling designs with unknown sampling weights. This note establishes that the selection and matching procedures can be implemented using propensity scores fit on choice-based samples with misspecified weights, because the odds ratio of the propensity score fit on the choice-based sample is monotonically related to the odds ratio of the true propensity scores. Keywords: Choice-based sampling, Matching models, Propensity scores, Selection models.
1. INTRODUCTION The probability of selection into a treatment, also called the propensity score, plays a central role in classical selection models and in matching models (see, e.g. Heckman, 1980, Rosenbaum and Rubin, 1983, Hirano et al., 2003, Heckman and Navarro, 2004, Heckman and Vytlacil, 2007). 1 Heckman and Robb (1986, reprinted 2000), Heckman and Navarro (2004) and Heckman and Vytlacil (2007) show how the propensity score is used differently in matching and selection models. They also show that, given the propensity score, both matching and selection models are robust to choice-based sampling, which occurs when treatment group members are over- or under-represented relative to their frequency in the population. Choice-based sampling designs are frequently chosen in evaluation studies to reduce the costs of data collection and to obtain more observations on treated individuals. Given a consistent estimate of the propensity score, matching and classical selection methods are robust to choice-based sampling, because both are defined conditional on treatment and comparison group status. 1 It also plays a key role in instrumental variables models (see Heckman et al., 2006). Heckman and Vytlacil (2007) discuss the different role played by the propensity score in matching IV and selection models. C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Adapting propensity score matching and selection models to choice based samples
S231
This note extends the analysis of Heckman and Robb (1985), Heckman and Robb (1986, reprinted 2000) to consider the case where population weights are unknown so that the propensity score cannot be consistently estimated. In evaluation settings, the population weights are often unknown or cannot easily be estimated. 2 For example, for the National Supported Work training program studied in LaLonde (1986), Dehejia and Wahba (1999, 2002) and in Smith and Todd (2005), the population consists of all persons eligible for the program, which was targeted at drug addicts, ex-convicts, and welfare recipients. Few data sets have the information necessary to determine whether a person is eligible for the program, so it would be difficult to estimate the population weights needed to consistently estimate propensity scores. In this note, we establish that matching and selection procedures can still be applied when the propensity score is estimated on unweighted choice based samples. The idea is simple. To implement both matching and classical selection models, only a monotonic transformation of the propensity score is required. In choice based samples, the odds ratio of the propensity score estimated using misspecified weights is monotonically related to the odds ratio of the true propensity scores. Thus, selection and matching procedures can identify population treatment effects using misspecified estimates of propensity scores fit on choice-based samples.
2. DISCUSSION OF THE PROPOSITION Let D = 1 if a person is a treatment group member; D = 0 if the person is a member of the comparison group. X = x is a realization of X. In the population generated from random sampling, the joint density is g(d, x) = [Pr(D = 1|x)]d [P r(D = 0|x)]1−d g(x) for D = d, d ∈ {0, 1}, where g is the density of the data. By Bayes’s theorem, we have, letting Pr(D = 1) = P, g(x|D = 1)P = g(x)P r(D = 1|x)
(2.1a)
g(x|D = 0)(1 − P ) = g(x)P r(D = 0|x).
(2.1b)
and
Take the ratio of (2.1a) to (2.1b) g(x|D = 1) g(x|D = 0)
P 1−P
=
P r(D = 1|x) . P r(D = 0|x)
(2.2)
Assume 0 < P r(D = 1|x) < 1. From knowledge of the densities of the data in the two samples, g(x|D = 1) and g(x|D = 0), one can form a scalar multiple of the ratio of the propensity score without knowing P. The odds ratio is a monotonic function of the propensity score that does not require knowledge of the true sample weights. In a choice-based sample, both the numerator and denominator of the first term in (2.2) can be consistently estimated. This monotonic function can replace P(x) in implementing both matching and nonparametric selection models.
2 The methods of Manski and Lerman (1977) and Manski (1986) for adjusting for choice-based sampling in estimating the discrete choice probabilities cannot be applied when the weights are unknown and cannot be identified from the data.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S232
James J. Heckman and Petra E. Todd
However, estimating g(x|D = d) is demanding of the data when X is of high dimension. Instead of estimating these densities, we can substitute for the left-hand side of (2.2) the odds ratio of the estimated conditional probabilities obtained using the choice-based sample with the wrong weights. (i.e. ignoring the fact that the data are a choice based sample). The odds ratio of the estimated probabilities is a scalar multiple of the true odds ratio. It can therefore be used instead of Pr(D = 1|X) to match or construct nonparametric control functions in selection bias models. In the choice-based sample, let P˜ r(D = 1|x) be the conditional probability that D = 1 and ∗ P be the unconditional probability of sampling D = 1, where P∗ = P, the true population proportion. The joint density of the data from the sampled population is [g(x|D = 1)P ∗ ]d [g(x|D = 0)(1 − P ∗ )]1−d . Using (2.1a) and (2.1b) to solve for g(x|D = 1) and g(x|D = 0) one may write the data density as 1−d P r(D = 1|x)g(x) ∗ d P r(D = 0|x)g(x) ∗ P (1 − P ) P (1 − P ) so ∗
P˜ r(D = 1|x) =
P r(D = 1|x)g(x) PP g(x|D = 1)P ∗ + g(x|D = 0)(1 − P ∗ )
(2.3a)
and ∗
P˜ r(D = 0|x) =
P r(D = 0|x)g(x) 1−P 1−P
g(x|D = 1)P ∗ + g(x|D = 0)(1 − P ∗ )
.
(2.3b)
Under random sampling, the right-hand sides of (2.3a) and (2.3b) are the limits to which the choice-based probabilities converge. Taking the ratio of (2.3a)–(2.3b), assuming the latter is not zero, one obtains 1−P P r(D = 1|x) P∗ P˜ r(D = 1|x) = . (2.4) P r(D = 0|x) 1 − P ∗ P P˜ r(D = 0|x) Thus, one can estimate the ratio of the propensity score up to scale (the scale is the product of the final two terms on the right-hand side of (2.4)). Instead of estimating matching or semiparametric selection models using P r(D = 1|x) (as in, for example, Heckman, 1980, Heckman and Robb, 1986, Heckman and Hotz, 1989, Ahn and Powell, 1993, Heckman et al., 1998, Powell, 2001), one can, instead, use the odds ratio of the estimate P˜ r(D = 1|x), which is monotonically related to the true Pr(D = 1|x). In the case of a logit P (x), P (x) = exp (xβ)/(1 + exp (xβ)), the log of this ratio becomes ln
P˜ r(D = 1|x) = x β˜ P˜ r(D = 0|x)
where the slope coefficients are the true values and the intercept β˜0 = β0 + n(P ∗ /(1 − P ∗ )) + n((1 − P )/P )), where β 0 is the true value. 3 3
See Manski and McFadden (1981, p. 26). C The Author(s). Journal compilation C Royal Economic Society 2009.
Adapting propensity score matching and selection models to choice based samples
S233
In implementing nearest-neighbor matching estimators, matching on the log odds ratio gives identical estimates to matching on the (unknown) Pr(D = 1|x), because the odds ratio preserves the ranking of the neighbors. In application of either matching or classical selection bias correction methods, one must account for the usual problems of using estimated log odds ratios instead of true values. 4
ACKNOWLEDGMENTS This research was supported by NSF SBR 93-21-048 and NSF 97-09-873 and NICHD 40-4043000-85-261.
REFERENCES Ahn, H. and J. Powell (1993). Semiparametric estimation of censored selection models with a nonparametric selection mechanism. Journal of Econometrics 58, 3–29. Dehejia, R. and S. Wahba (1999). Causal effects in nonexperimental studies: reevaluating the evaluation of training programs. Journal of the American Statistical Association 94, 1053–62. Dehejia, R. and S. Wahba (2002). Propensity score matching methods for nonexperimental causal studies. Review of Economics and Statistics 84, 151–61. Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66, 315–31. Heckman, J. J. (1980). Addendum to sample selection bias as a specification error. In E. Stromsdorfer and G. Farkas (Eds.), Evaluation Studies Review Annual, Volume 5, 69–74. Beverly Hills: Sage Publications. Heckman, J. J. and V. J. Hotz (1989). Choosing among alternative nonexperimental methods for estimating the impact of social programs: the case of manpower training. Journal of the American Statistical Association 84, 862–74. (Rejoinder also published in Vol. 84, No. 408.) Heckman, J. J., H. Ichimura, J. Smith and P. E. Todd (1998a). Characterizing selection bias using experimental data. Econometrica 66, 1017–98. Heckman, J. J., H. Ichimura and P. E. Todd (1998b). Matching as an econometric evaluation estimator. Review of Economic Studies 65, 261–94. Heckman, J. J. and S. Navarro (2004). Using matching, instrumental variables, and control functions to estimate economic choice models. Review of Economics and Statistics 86, 30–57. Heckman, J. J. and R. Robb (1985). Alternative methods for evaluating the impact of interventions: an overview. Journal of Econometrics 30, 239–67. Heckman, J. J. and R. Robb (1986). Alternative methods for solving the problem of selection bias in evaluating the impact of treatments on outcomes. In H. Wainer (Ed.), Drawing Inferences from SelfSelected Samples, 63–107. New York: Springer-Verlag. (Reprinted in 2000, Mahwah, NJ: Lawrence Erlbaum Associates.) Heckman, J. J., S. Urzua and E. J. Vytlacil (2006). Understanding instrumental variables in models with essential heterogeneity. Review of Economics and Statistics 88, 389–432. Heckman, J. J. and E. J. Vytlacil (2007). Econometric evaluation of social programs, part II: Using the marginal treatment effect to organize alternative economic estimators to evaluate social programs and
4 For discussion related to using estimated propensity scores, see Hahn (1998), Heckman et al. (1998a, 1998b), Hirano et al. (2003).
C The Author(s). Journal compilation C Royal Economic Society 2009.
S234
James J. Heckman and Petra E. Todd
to forecast their effects in new environments. In J. Heckman and E. Leamer (Eds.), Handbook of Econometrics, Volume 6B, 4875–5144. Amsterdam: Elsevier. Hirano, K., G. W. Imbens and G. Ridder (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 1161–89. LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review 76, 604–20. Manski, C. F. (1986). Semiparametric analysis of binary response from response-based samples. Journal of Econometrics 31, 31–40. Manski, C. F. and S. R. Lerman (1977). The estimation of choice probabilities from choice based samples. Econometrica 45, 1977–88. Manski, C. F. and D. McFadden (1981). Statistical analysis of discrete probability models. In C. F. Manski and D. McFadden (Eds.), Structural Analysis of Discrete Data with Econometric Applications, 2–49. Cambridge, MA: MIT Press. Powell, J. L. (2001). Semiparametric estimation of censored selection models. In C. Hsiao, K. Morimune, and J. L. Powell (Eds.), Nonlinear Statistical Modeling: Proceedings of the Thirteenth International Symposium in Economic Theory and Econometrics: Essays in Honor of Takeshi Amemiya, 165–96. New York: Cambridge University Press. Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55. Smith, J. A. and P. E. Todd (2005). Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics 125, 305–53.
C The Author(s). Journal compilation C Royal Economic Society 2009.