Editorial policy: The Journal of Econometrics is designed to serve as an outlet for important new research in both theoretical and applied econometrics. Papers dealing with estimation and other methodological aspects of the application of statistical inference to economic data as well as papers dealing with the application of econometric techniques to substantive areas of economics fall within the scope of the Journal. Econometric research in the traditional divisions of the discipline or in the newly developing areas of social experimentation are decidedly within the range of the Journal’s interests. The Annals of Econometrics form an integral part of the Journal of Econometrics. Each issue of the Annals includes a collection of refereed papers on an important topic in econometrics. Editors: T. AMEMIYA, Department of Economics, Encina Hall, Stanford University, Stanford, CA 94035-6072, USA. A.R. GALLANT, Duke University, Fuqua School of Business, Durham, NC 27708-0120, USA. J.F. GEWEKE, Department of Economics, University of Iowa, Iowa City, IA 52240-1000, USA. C. HSIAO, Department of Economics, University of Southern California, Los Angeles, CA 90089, USA. P. ROBINSON, Department of Economics, London School of Economics, London WC2 2AE, UK. A. ZELLNER, Graduate School of Business, University of Chicago, Chicago, IL 60637, USA. Executive Council: D.J. AIGNER, Paul Merage School of Business, University of California, Irvine CA 92697; T. AMEMIYA, Stanford University; R. BLUNDELL, University College, London; P. DHRYMES, Columbia University; D. JORGENSON, Harvard University; A. ZELLNER, University of Chicago. Associate Editors: Y. AÏT-SAHALIA, Princeton University, Princeton, USA; B.H. BALTAGI, Syracuse University, Syracuse, USA; R. BANSAL, Duke University, Durham, NC, USA; M.J. CHAMBERS, University of Essex, Colchester, UK; SONGNIAN CHEN, Hong Kong University of Science and Technology, Kowloon, Hong Kong; XIAOHONG CHEN, Department of Economics, Yale University, 30 Hillhouse Avenue, P.O. Box 208281, New Haven, CT 06520-8281, USA; MIKHAIL CHERNOV (LSE), London Business School, Sussex Place, Regents Park, London, NW1 4SA, UK; V. CHERNOZHUKOV, MIT, Massachusetts, USA; M. DEISTLER, Technical University of Vienna, Vienna, Austria; M.A. DELGADO, Universidad Carlos III de Madrid, Madrid, Spain; YANQIN FAN, Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA; S. FRUHWIRTH-SCHNATTER, Johannes Kepler University, Liuz, Austria; E. GHYSELS, University of North Carolina at Chapel Hill, NC, USA; J.C. HAM, University of Southern California, Los Angeles, CA, USA; J. HIDALGO, London School of Economics, London, UK; H. HONG, Stanford University, Stanford, USA; MICHAEL KEANE, University of Technology Sydney, P.O. Box 123 Broadway, NSW 2007, Australia; Y. KITAMURA, Yale Univeristy, New Haven, USA; G.M. KOOP, University of Strathclyde, Glasgow, UK; N. KUNITOMO, University of Tokyo, Tokyo, Japan; K. LAHIRI, State University of New York, Albany, NY, USA; Q. LI, Texas A&M University, College Station, USA; T. LI, Vanderbilt University, Nashville, TN, USA; R.L. MATZKIN, Northwestern University, Evanston, IL, USA; FRANCESCA MOLINARI (CORNELL), Department of Economics, 492 Uris Hall, Ithaca, New York 14853-7601, USA; F.C. PALM, Rijksuniversiteit Limburg, Maastricht, The Netherlands; D.J. POIRIER, University of California, Irvine, USA; B.M. PÖTSCHER, University of Vienna, Vienna, Austria; I. PRUCHA, University of Maryland, College Park, USA; E. RENAULT, University of North Carolina, Chapel Hill, NC; R. SICKLES, Rice University, Houston, USA; F. SOWELL, Carnegie Mellon University, Pittsburgh, PA, USA; MARK STEEL (WARWICK), Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; DAG BJARNE TJOESTHEIM, Department of Mathematics, University of Bergen, Bergen, Norway; HERMAN VAN DIJK, Erasmus University, Rotterdam, The Netherlands; Q.H. VUONG, Pennsylvania State University, University Park, PA, USA; E. VYTLACIL, Columbia University, New York, USA; T. WANSBEEK, Rijksuniversiteit Groningen, Groningen, Netherlands; T. ZHA, Federal Reserve Bank of Atlanta, Atlanta, USA and Emory University, Atlanta, USA. Submission fee: Unsolicited manuscripts must be accompanied by a submission fee of US$50 for authors who currently do not subscribe to the Journal of Econometrics; subscribers are exempt. Personal cheques or money orders accompanying the manuscripts should be made payable to the Journal of Econometrics. Publication information: Journal of Econometrics (ISSN 0304-4076). For 2011, Volumes 160–165 (12 issues) are scheduled for publication. Subscription prices are available upon request from the Publisher, from the Elsevier Customer Service Department nearest you, or from this journal’s website (http://www.elsevier.com/locate/jeconom). Further information is available on this journal and other Elsevier products through Elsevier’s website (http://www.elsevier.com). Subscriptions are accepted on a prepaid basis only and are entered on a calendar year basis. Issues are sent by standard mail (surface within Europe, air delivery outside Europe). Priority rates are available upon request. Claims for missing issues should be made within six months of the date of dispatch. USA mailing notice: Journal of Econometrics (ISSN 0304-4076) is published monthly by Elsevier B.V. (Radarweg 29, 1043 NX Amsterdam, The Netherlands). Periodicals postage paid at Rahway, NJ 07065-9998, USA, and at additional mailing offices. USA POSTMASTER: Send change of address to Journal of Econometrics, Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA. AIRFREIGHT AND MAILING in the USA by Mercury International Limited, 365 Blair Road, Avenel, NJ 07001-2231, USA. Orders, claims, and journal inquiries: Please contact the Elsevier Customer Service Department nearest you. St. Louis: Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA; phone: (877) 8397126 [toll free within the USA]; (+1) (314) 4478878 [outside the USA]; fax: (+1) (314) 4478077; e-mail:
[email protected]. Oxford: Elsevier Customer Service Department, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK; phone: (+44) (1865) 843434; fax: (+44) (1865) 843970; e-mail:
[email protected]. Tokyo: Elsevier Customer Service Department, 4F Higashi-Azabu, 1-Chome Bldg., 1-9-15 Higashi-Azabu, Minato-ku, Tokyo 106-0044, Japan; phone: (+81) (3) 5561 5037; fax: (+81) (3) 5561 5047; e-mail:
[email protected]. Singapore: Elsevier Customer Service Department, 3 Killiney Road, #08-01 Winsland House I, Singapore 239519; phone: (+65) 63490222; fax: (+65) 67331510; e-mail:
[email protected]. Printed by Henry Ling Ltd., Dorchester, United Kingdom The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper)
Journal of Econometrics 163 (2011) 127–143
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Asymptotic distributions of impulse response functions in short panel vector autoregressions Bolong Cao a , Yixiao Sun b,∗ a
Department of Economic, Ohio University, Athens, OH 45701, United States
b
University of California, San Diego, Department of Economics, 9500 Gilman Drive, La Jolla, CA 92093-0508, United States
article
info
Article history: Received 20 February 2009 Received in revised form 27 December 2010 Accepted 14 March 2011 Available online 1 April 2011 JEL classification: C33 C53 Keywords: Asymptotic distribution Bootstrap Nonorthogonalized impulse response function Orthogonalized impulse response function Panel data Vector autoregressions
abstract This paper establishes the asymptotic distributions of the impulse response functions in panel vector autoregressions with a fixed time dimension. It also proves the asymptotic validity of a bootstrap approximation to their sampling distributions. The autoregressive parameters are estimated using the GMM estimators based on the first differenced equations and the error variance is estimated using an extended analysis-of-variance type estimator. Contrary to the time series setting, we find that the GMM estimator of the autoregressive coefficients is not asymptotically independent of the error variance estimator. The asymptotic dependence calls for variance correction for the orthogonalized impulse response functions. Simulation results show that the variance correction improves the coverage accuracy of both the asymptotic confidence band and the studentized bootstrap confidence band for the orthogonalized impulse response functions. © 2011 Elsevier B.V. All rights reserved.
1. Introduction In this paper, we consider the panel vector autoregressions (VARs) where the cross-sectional dimension (N ) is large and the time series dimension (T ) is short (typically less than 10). Panel VARs with a short T have been investigated, for examples, by Holtz-Eakin et al. (1988) and Binder et al. (2005). While these papers focus on the estimation of the slope coefficients, our focus here is on the estimation of the impulse response functions (IRFs) and their confidence bands. Following the traditional panel data literature, we assume that the slope coefficients are the same across different cross-sectional units and there is no crosssectional dependence after controlling for the fixed time effects. These two assumptions allow us to make good long-horizon forecasts, especially when the forecasting horizon is comparable to the time series length. This argument is consistent with the view of Binder et al. (2005) who use short panel VARs to infer the long run properties of the underlying time series. For time series data, VAR models are typically estimated using the equation-by-equation OLS as it is asymptotically equivalent
∗
Corresponding author. Tel.: +1 858 534 4692. E-mail addresses:
[email protected] (B. Cao),
[email protected] (Y. Sun).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.03.004
to the full system-of-equations estimator. For panel data VARs, the OLS estimator is inconsistent for a fixed T as N → ∞. In this case, the VAR models are typically estimated using the Anderson and Hsiao; Anderson and Hsiao (1981; 1982, hereafter AH) estimator or the Arellano-Bond (1991, hereafter AB) estimator. These estimators can be applied to each equation in the VAR system or the full system of equations. Holtz-Eakin et al. (1988) and Arellano (2003, p. 120) point out that it may be possible to improve the efficiency by estimating the system of equations jointly. We show that, under the model specification given below, the equation-by-equation AH or AB estimator is asymptotically equivalent to the corresponding system-of-equations estimator. Impulse response analysis in the time series setting has been examined by Baillie (1987), Lütkepohl (1989, 1990), among others. However, there are two important differences between the time series case and the short panel case considered in this paper. First, for time series VARs, the OLS estimator of the slope coefficients is asymptotically independent of the error variance estimator while for short panel VARs the AH or AB estimator of the slope coefficients depends on the error variance estimator even in the limit as N → ∞ for a fixed T . Since the regressors are only sequentially exogenous, the demeaned regressors in the short panel VARs are correlated with the demeaned regression error. This nonzero correlation leads to the asymptotic dependence
128
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
between the slope coefficient estimator and the error variance estimator. Second, for time series VARs, the error variance estimator based on the estimated OLS residual is asymptotically equivalent to that based on the true error term. For short panel VARs, the error variance estimator has different asymptotic distributions, depending on whether the error term is known or is based on estimated slope coefficients and fixed effects. In other words, the estimation uncertainty of the slope coefficients and the fixed effects affects the asymptotic distribution of the error variance estimator. These two differences imply that the usual asymptotic results for orthogonalized impulse responses are not applicable to short panel VARs. One of the main contributions of the paper is to derive the asymptotic distributions of the orthogonalized IRFs for short panel VARs. The asymptotic distributions are obtained under the asymptotic specification that N → ∞ with T fixed. Based on our asymptotic result, confidence bands for the IRFs can be easily constructed. Although impulse response analyses using short panels have been employed in the empirical applications, to the best of our knowledge, no study has reported confidence bands for orthogonalized IRFs that account for the estimation uncertainty of the error variance matrix. As a result, the reported confidence bands are often more narrow than they should be. This may lead to the finding of statistical significance that does not actually exist. A further contribution of the paper is to establish the asymptotic validity of bootstrap confidence bands. Our simulation results show that bootstrap confidence bands usually provide more accurate coverage than the asymptotic analytical bands. In addition, the percentile-t bootstrap band that takes the dependence between the autoregressive coefficient estimator and the error variance estimator into account performs better than those that do not. The rest of the paper is organized as follows. Section 2 describes the vector autoregression model for panel data and presents the standard GMM estimator of the slope coefficients and analysis-ofvariance-type estimator of the error variance matrix. This section also establishes the joint asymptotic distribution of the slope coefficients estimator and the error variance estimator. Using these asymptotic results, we derive in Section 3 the asymptotic distributions of the orthogonalized and non-orthogonalized impulse response functions. We also prove the asymptotic validity of various bootstrap confidence bands. Section 4 provides some simulation evidence. The final section concludes. Proofs and a technical lemma are collected in the Appendix. Throughout the paper, vec denotes the column stacking operator and vech is the corresponding operator that stacks only the elements on and below the main diagonal. As usual, the Kronecker product is denoted by ⊗, the commutation matrix Km,n is defined such that, for any (m × n) matrix G, Km,n vec(G) = vec(G′ ), and the m2 × (m(m + 1))/2 duplication matrix Dm is defined such that Dm vech(F ) = vec(F ) for a symmetric (m × m) ′ −1 ′ matrix F . Furthermore, D+ Dm and Lm is the (m(m + m = (Dm Dm ) 1))/2 × m2 elimination matrix defined such that, for any (m × m) matrix F , vech(F ) = Lm vec(F ). For matrix A, ‖A‖ is the Euclidian norm of A. ‘‘⇒’’ denotes weak convergence and ‘‘≡’’ denotes distributional equivalence. 2. The model and GMM estimation 2.1. The model We consider an m-dimensional panel VAR(p) process: yi,t = µ + A1 yi,t −1 + · · · + Ap yi,t −p + µi + ui,t
(1)
for t = 0, . . . , T and i = 1, 2, . . . , N where yi,t = (y1,it , . . . , ym,it )′ , Aj are (m × m) coefficient matrices, µi is an m × 1 vector of individual fixed effects, µ is an m × 1 vector of intercepts, and ui,t is the error term. To simplify the discussion, we focus on balanced panel data sets. For each individual i, the time series starts at period 0 and ends at period T . Without the loss of generality, we assume that the initial values yi,−1 , . . . , yi,−p are observed. We make the following assumption. Assumption 1. ui,t is independently and identically distributed across i and t with E (ui,t |yi,t −1 , . . . , yi,−p ) = 0
for 0 ≤ t ≤ T
and E ui,t u′j,s |yi,t −1 , . . . , yi,−p
=
Σ, 0,
i = j and t = s , otherwise
for 0 ≤ t ≤ s ≤ T
(2)
where Σ is a positive definite matrix. The model is the same as that considered by Binder et al. (2005). We do not parameterize the initial conditions for the VAR model as the asymptotic properties of the GMM estimators used in this paper do not reply on any parametric specification of yi,−1 , . . . , yi,−p . This is an advantage of the GMM estimators as compared to the quasi maximum likelihood (QML) estimator in Binder et al. (2005). In their fixed effects specification, they assume that the initial observations yi,−1 , . . . , yi,−p are generated according to
yi,t = I − A1 − · · · − Ap
−1
(µ + µi ) + ξi,t for − p ≤ t ≤ −1
where the initial deviations ξi,t are i.i.d. across i and t with zero mean and constant variance matrix (see their assumption G3). This homogeneity assumption may help improve the asymptotic efficiency of the QML estimator but will lead to inconsistency when it is violated. Here we maintain minimal assumptions without parameterizing the initial conditions and focus only on the GMM type estimators. The standard GMM estimator considered below is widely used in empirical applications, see, for example, Love and Zicchino (2002) and Gilchrist et al. (2005). We can include the time fixed effects in the model so that yi,t = µ + A1 yi,t −1 + · · · + Ap yi,t −p + µi + λt + ui,t .
(3)
We have done so in a previous version of this paper. In this case, we can remove λt by taking out the cross sectional average. All of our results remain valid for the above model. To simplify the notation, we focus on the VAR model in (1). 2.2. Panel GMM estimator and its asymptotic distribution It is well-known that, due to the correlation between the fixed effect µi and the regressors, the OLS estimator of Aj based on Eq. (1) is inconsistent when T is small. To remove the fixed individual effect, we take the first difference of Eq. (1), leading to
1yi,t = A1 1yi,t −1 + · · · + Ap 1yi,t −p + 1ui,t ,
t = 1, . . . , T .
The OLS estimator based on the first differenced equation is still inconsistent because 1ui,t is correlated with 1yi,t −1 . The standard GMM estimators of AH and AB employ instruments that are orthogonal to 1ui,t . Additional nonlinear moment conditions implied by the homoscedasticity assumption in (2) are considered in Ahn and Peter (1995) and Binder et al. (2005). We provide similar results to this paper for the Ahn and Schmidt estimator in a previous version of this paper. In what follows, we will mainly focus on the AB estimator since it is easy to implement
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
as the underlying moment conditions are linear in parameters. It has also been a standard practice to employ the AB estimator in empirical studies. In addition, for dynamic panel data models, the latest version of STATA contains only GMM estimators with linear moment conditions, the leading case of which is the AB estimator. Furthermore, in his seminal monograph, Hsiao (2003) discusses only the AB estimator for panel vector autoregressive models. Our results can be extended straightforwardly to the AH estimator. The moment conditions for the AB estimator are E 1ui,t y′i,t −1−ℓ = 0
for ℓ = 1, 2, . . . , t + p − 1;
t = 1, . . . , T .
(4)
To write the equations in the vector form, we let
equivalent to the QML estimator when a normal likelihood function is used. Since the error term is not observable, however, we have to replace it by some estimate. Given the estimate αˆ , it is natural to estimate ui,t by uˆ i,t = yi,t − y¯ i,· − Aˆ ′ Xi,t − X¯ i,·
for t = 0, . . . , T .
Here and hereafter, a dot in the subscript indicates the average over that subscript. The resulting estimator of Σ is then given by N T 1 −−
ˆ GMM = Σ
NT i=1 t =0
uˆ i,t uˆ ′i,t .
(8)
ˆ GMM indicates that the estimator is based on the The subscript on Σ GMM estimator of A. We now consider the large N asymptotics for a fixed T . Under some regularity conditions, we have
A = A1 , A2 , . . . , Ap , ′
129
m×mp
and define the first-differenced variables:
∞ P lim SZX = SZX
′ 1y′i,1 1ui,1 ′ 1yi,2 1u′i,2 , 1u = , 1y = ··· ··· i i ′ ′ T ×m T ×m 1yi,T 1ui,T ′ 1Xi,1 1yi,t −1 1Xi′,2 1yi,t −2 , 1X = . 1Xi,t = ... ··· i ′ T ×mp 1yi,t −p mp×1 1Xi,T
(9)
N →∞
∞ for some constant matrix SZX and N 1 −
√
N i=1
∞ vec Zi′ 1ui ⇒ N (0, Σ ⊗ SZZ ),
(10)
where N 1 − ′ Zi GZi , N →∞ N i=1
We define the level variables yi , ui and Xi similarly except that they have (T + 1) rows. Then
∞ SZZ = P lim
vec(1yi ) = (Im ⊗ 1Xi ) vec(A) + vec(1ui ).
and G is the T × T symmetric tridiagonal matrix with the main diagonal elements equal to 2 and the sub-diagonal elements equal −1. Combining (9) and (10), we get
(5)
To construct the instrument matrix, we let ′ yi,−p , . . . , y′i,−1 0 ··· 0 0 y′i,−p , . . . , y′i,0 0 ··· Zi = ··· ··· ··· ··· ′ ′ 0 0 ··· yi,−p , . . . , yi,T −2 ′ Zi,1
Z ′ := i,2 ···
(6)
Zi′,T
√
N (αˆ GMM − α) ⇒ N (0, Ωαα )
for some variance matrix Ωαα . To minimize the asymptotic variance of the GMM estimator, we choose the weighting matrix WN such that its limit is W =
∞ −1 Σ ⊗ SZZ (see Hansen, 1982). With the optimal weighting matrix, we have
which is a T × m [pT + (T − 1) T /2] matrix. Then the moment conditions in (4) can be written as E (Im ⊗ Zi )′ vec(1ui ) = 0. The GMM estimator of α = vec(A) is now given by
αˆ GMM
= vec(Aˆ ) := vec((Aˆ 1 , Aˆ 2 , . . . , Aˆ p )′ ) −1 ′ = Im ⊗ SZX WN (Im ⊗ SZX ) ′ × Im ⊗ SZX WN vec (SZY )
Ωαα =
∞ Im ⊗ SZX
′ −1 ∞ −1 ∞ Σ ⊗ SZZ Im ⊗ SZX
:= Σ ⊗ Q −1 ,
(12)
where ∞ Q = SZX
(7)
′
∞ SZZ
−1
∞ SZX .
The above asymptotic variance can be also achieved by letting
where SZX =
N 1 −
N i =1
Zi′ 1Xi ,
SZY =
N 1 −
N i =1
Zi′ 1yi
and WN is a weighting matrix that converges to W , a positive definite matrix as N → ∞. To estimate the orthogonalized impulse response function, we need to estimate the covariance matrix Σ . If the error term ui,t in (1) is observable, then an analysis-of-variance type estimator of Σ is given by
˜ = Σ
1
(11)
N − T −
N (T + 1) i=1 t =0
WN = Im ⊗
˜ Under the assumption that ui,t is normal, it can be shown that Σ is the best quadratic unbiased estimator. It is also asymptotically
−1 ,
(13)
−1
∞ in which case W = P limN →∞ WN = Im ⊗ SZZ . To see this, note that for this choice of the weighting matrix, we have ∞ ∞ Im ⊗ SZX W Im ⊗ SZX = Im ⊗ Q ,
and var
ui,t u′i,t .
N 1 − ′ Zi GZi N i=1
′ Im ⊗ SZX WN vec (SZY ) → Σ ⊗ Q .
Therefore
Ωαα = (Im ⊗ Q )−1 (Σ ⊗ Q ) (Im ⊗ Q )−1 = Σ ⊗ Q −1 , which is identical to the asymptotic variance given in (12).
130
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
With the weighting matrix given in (13), the GMM estimator of
αˆ reduces to αˆ GMM
−1 −1 − N N N − − = vec 1Xi′ Zi Zi′ GZi Zi′ 1Xi i =1
×
N −
1Xi′ Zi
i=1
i =1
N −
i =1
−1 Zi′ GZi
N − (Zi′ 1yi ) .
i=1
independence of the variance estimator and the slope estimator in time series VARs. A sufficient condition for Assumption 2(iv) to hold is that ui,t follows an elliptical distribution, which includes normal distributions as special cases. Let
′
Xi,t = y′i,t −1 , y′i,t −2 , . . . , y′i,t −p , and
(14)
i=1
This is the equation-by-equation GMM estimator. Therefore, we have shown that the equation-by-equation GMM estimator is asymptotically as efficient as the system GMM estimator. This result is analogous to the asymptotic efficiency of the equation-byequation OLS in an ordinary VAR system. Holtz-Eakin et al. (1988) and Arellano (2003) both point out the possibility of improving the efficiency by jointly estimating all equations in the VAR system. Our result shows that, under the assumption of conditional homoskedasticity given in (2), there is no efficiency gain from joint estimation. We now focus on the equation-by-equation GMM estimator given in (14) and the associated variance estimator defined in (8). To establish their joint limiting distribution, we maintain Assumption 2.
N T ′ 1 −− ui,t − u¯ i,· Xi,t − X¯ i,· . N →∞ NT i=1 t =0
B = −P lim
4+2δ (i) E ui,t < ∞, 4+2δ (ii) maxi E Xi,0 < ∞, maxi E ‖µi ‖4+2δ < ∞,
∞ ∞ (iii) SZZ and SZX have full rank, (iv) E (ui,t u′i,t ⊗ ui,t |yi,t −1 , . . . , yi,−p ) = 0,
where Xi,0 = (y′i,−1 , y′i,−2 , . . . , y′i,−p )′ . Some comments on Assumption 2 are in order. Assumption 2(i) is a standard moment condition on ui,t . Assumption 2(ii) assumes that the fixed effect µi and the initial values yi,−1 , . . . , yi,−p have uniformly bounded 4 + 2δ moments. Together with Assumption 2(i), Assumption 2(ii) ensures that individual contributions to cross-sectional averages do not play a dominating role so that LLN and CLT hold. The moment conditions are not necessary but sufficient for our results. We point out in passing that while we do not parameterize the initial conditions, we still need to control their cross-sectional heterogeneity by assuming that
4+2δ
maxi E Xi,0 < ∞. To avoid the weak instrument problem, we impose Assumption 2(iii). If the initial observations are generated from the stationary distribution of the process, the full rank assumption rules out unit roots in the system, see for example, Binder et al. (2005). It should be pointed out that the presence of a unit root does not necessarily lead to the weak instrument problem as the fixed individual effects combined with unrestricted initialization can ensure ∞ that SZX is of full rank. When the initial values do not follow the stationary distribution of the VAR process, both 1Xi and Zi are affected by the fixed effect µi . As a result, Zi can help predict 1Xi not only because of the presence of time series dynamics but also because of the presence of the fixed effects. We maintain the technical condition in Assumption 2(iv) in order to simplify the asymptotic variance. Under this assumption, the infeasible estimator
ˆ0 = Σ
N T 1 −−
NT i=1 t =0
ui,t − u¯ i,·
ui,t − u¯ i,·
√
′
is asymptotically independent of N (Aˆ − A). Otherwise, there will be extra terms in the asymptotic variance that reflect the skewness of ui,t . This assumption is also needed to ensure the asymptotic
(16)
The following theorem establishes the asymptotic distributions of ˆ GMM when N → ∞ for a fixed T . αˆ GMM and Σ Theorem 1. Let Assumptions 1 and 2 hold. Then
√ N αˆ GMM − α ˆ GMM − Σ ) Nvech(Σ Ωαα ⇒ N (0, Ω ) , Ω = Ωασ
√
′ Ωασ Ωσ σ
(17)
where
Ωαα = Σ ⊗ Q −1 ,
(18)
Ωασ = −Dm (Im ⊗ B) Σ ⊗ Q −1 − D+ , m Km,m (Im ⊗ B) Σ ⊗ Q Ωσ σ [ ] 1 1 = D+ Λ + Σ ⊗ Σ I + K ( ) 2 2 m , m m m m T +1 T (T + 1) + ′ ′ −1 ′ × Dm + D+ B D+ m Σ ⊗ BQ m ′ + −1 ′ + + Dm BQ B ⊗ Σ Dm ′ −1 ′ + D+ B Km′ ,m D+ m Σ ⊗ BQ m + ′ + −1 ′ + Dm Km,m Σ ⊗ BQ B Dm , +
Assumption 2. The following hold for some δ > 0:
(15)
−1
(19)
(20)
and Λm2 = var(vec(ui,t u′i,t )) is an m2 × m2 matrix.
√ N αˆ GMM − α is not asympˆ GMM − Σ ). This is in sharp Nvech(Σ
Remark 1. Theorem 1 shows that
√
totically independent of contrast with the time series case. For a time series VAR model with Gaussian innovations, the MLEs of α and Σ are asymptotically independent, see Hamilton (1994, Proposition 11.2). To construct valid confidence bands for the orthogonalized IRFs from short panel √ VARs, we have to√take the asymptotic dependence beˆ GMM − Σ ) into account. tween N αˆ GMM − α and Nvech(Σ Remark 2. Eq. (41) in the proof states that
√
ˆ 0 − Σ) Nvech(Σ [ −1 → D+ Λm2 + T −1 (T + 1)−1 (Σ ⊗ Σ ) m (T + 1)
× Im2 + Km,m
]
′
D+ m .
√
ˆ 0 − Σ ) contains two So the asymptotic variance of Nvech(Σ −1 −1 + terms. The second term T (T + 1) Dm (Σ ⊗ Σ )(Im2 + Km,m ) ′ (D+ m ) reflects the estimation uncertainty of the fixed effects. When T → ∞, the second term + ′ is of smaller order than the first term (T + 1)−1 D+ and disappears asymptotically. m Λm2 Dm However, when T is assumed to be fixed, both terms contribute to the asymptotic variance. This is different from the time series asymptotics. The difference highlights the risk of naively extending time series results to short panels.
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
Remark 3. It is precisely because B ̸= 0 that the fixed effects estimator or the least squared dummy variable (LSDV) estimator is asymptotically biased. If the asymptotic bias of the fixed effects estimator is nonnegligible, then it is likely that the asymptotic ˆ GMM is also nonnegligible. This dependence between αˆ GMM and Σ paper complements the papers by Anderson and Hsiao (1982) and Arellano and Bond (1991) in that they investigate the consequence of demeaning for the short panels on the slope estimation while we examine the consequence on the variance estimation. Remark 4. In general, the variance matrix Λm2 depends on the fourth-order multivariate cumulants of ui,t . The relation between product-moments and multivariate cumulants is rather technical. See Bilodeau and Brenner (1999, Appendix B). However, if we make some distributional assumptions on ui,t , Λm2 may be simplified. For example, if we assume that ui,t follows an elliptical distribution, then 1
Λm2 =
3
κ (Σ ⊗ Σ ) Im2 + Km,m +
κ 3
− 1 vec (Σ ) [vec (Σ )]
′
where κ is the kurtosis of any element of the standardized error (j) (j) u˜ i,t = Σ −1/2 ui,t . That is κ = E ([˜ui,t ]4 )/{E ([˜ui,t ]2 )}2 . For a proof of this result, see Bilodeau and Brenner (1999, example 13.6). As a special case, when ui,t ∼ i.i.d. N (0, Σ ), we have κ = 3 and
Λm2 = (Σ ⊗ Σ ) Im2 + Km,m .
In this section, we first define the IRFs for reduced-form VARs and structural VARs. We then consider the large sample approximation and bootstrap approximation to the sampling distribution of the IRFs.
Since the impulse response function does not depend on the index i and fixed effects in the system, we omit the subscript i and consider the reduced-form VAR model: for t = 0, . . . , T .
(21)
The impulse response matrix is defined to be
∂ yt +j . ∂ u′t
The (k, ℓ)-th element of Φj describes the response of k-th element of yt +j to one unit impulse in ℓ-th element of yt with all other variables dated t or earlier held constant. The plot of the (k, ℓ)th element of Φj as a function of j is called the non-orthogonalized impulse–response function. To compute Φj , we let
...
,
yt −p+1 mp×1
ut
0 0 Ut = . . .. 0 mp×1
A1 Im 0 F =
A2 0 Im
0
0
...
F j Ut −j + F t +1 Y−1 ,
j =0
so ∂ Yt +j /∂ Ut′ = F j . By definition, Φj is the first m × m block of F j . Differentiating both sides of (21) yields:
∂ yt +j−1 ∂ y t +j −p ∂ yt +j = A1 + · · · + Ap . ∂ u′t ∂ u′t ∂ u′t That is, Φj satisfies the recursive relationship: Φj =
p −
.. .
··· ··· ··· .. .
Ap−1 0 0
···
Im
mp×mp
Aℓ Φj−ℓ ,
j = 1, 2, . . .
(22)
ℓ=1
with Φj = 0 for j < 0 and Φ0 = Im . To estimate the non-orthogonalized impulse response, we plug ∑p ˆ ˆ ˆj = the estimate Aˆ into (22) and get Φ ℓ=1 Aℓ Φj−ℓ with the
ˆ j = 0 for j < 0 and Φ ˆ 0 = Im . initialization Φ In empirical applications, it is a standard practice to report the orthogonalized impulse response function. Let Pr Pr′ = Σ be the Cholesky decomposition of Σ , where Pr is a lower triangular matrix with positive diagonal elements. Then Pr−1 yt = Pr−1 A1 yt −1 + · · · + Pr−1 Ap yt −p + urt for t = 0, . . . , T
(23) −1
∂ y t +j ∂ y t + j ∂ ut = = Φj Pr . ∂ u′t ∂ (urt )′ ∂ (urt )′
In general, the model in (23) does not have a structural interpretation. One exception is the recursive structural VAR model defined by
Ayt = As1 yt −1 + · · · + Asp yt −p + ust
for t = 0, . . . , T
(24)
where A is a lower triangular matrix, and ust
3.1. IRFs for reduced-form and structural VARs
yt yt −1 y t −2 Yt =
t −
Θjr =
3. Asymptotic approximation to the distribution of IRFs
Yt =
where = Pr ut has mean zero and variance Im . The orthogonalized impulse response matrix is defined to be
We will use this formula in our simulation study.
Φj =
Then
urt
yt = A1 yt −1 + · · · + Ap yt −p + ut
131
.. .
Ap 0 0 ,
.. .
0
is a vector white noise process with variance matrix Im . To obtain the recursive structure, we may have to re-order the variables in yt according to some economic theory. The structural model coincides with the reducedform model in (23) if we let Pr−1 = A, Pr−1 Ai = Asi and urt = ust . In this case, the IRFs based on the Cholesky decomposition have structural interpretations. In the absence of a recursive structure, we decompose Σ into Σ = P s (P s )′ where (P s )−1 satisfies the same identification restrictions imposed on the matrix A. This decomposition is different from the Cholesky decomposition, as it is dictated by the identification restrictions. To identify A, it is necessary to impose m(m − 1)/2 restrictions. Following Lütkepohl (2005, p. 360), we consider the identification restrictions: SA vec(A) = sA where SA is a m(m − 1)/2 × m2 selection matrix and sA is a suitable m(m − 1)/2 × 1 fixed vector. Under these conditions, we solve for P s such that Σ = P s (P s )′ and SA vec[(P s )−1 ] = sA . The reduced-form Cholesky decomposition and the structural decomposition can be represented in a unified framework. Given the covariance matrix Σ , we want to find P such that
Σ = PP ′ and Svec(P −1 ) = s
(25)
for some selection matrix S and constant vector s. For the structural VAR, (S, s) is based on economic theory. For the reduced-form Cholesky decomposition, S is the matrix that selects the upper triangular elements of P −1 and s is a vector of zeros. Given the matrix P, the orthogonalized IRFs are Θj = Φj P. The orthogonalized IRF can be estimated by plugging the ˆ j and Σ ˆ into its definition. More specifically, let Pˆ be estimates Φ
ˆ = Pˆ Pˆ ′ and Svec(Pˆ −1 ) = s. the plug-in estimator of P such that Σ ˆ ˆj = Φ ˆ j P. Then the orthogonalized IRF can be estimated by Θ
132
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
To obtain a closed form expression for ∂ vec(P )/∂ [vech(Σ )]′ , we differentiate both sides of the equations in (25) and get
3.2. Large sample approximation
ˆ j can be derived using the The limiting distribution of Φ delta method. More specifically, taking transposes of (22) and differentiating the resulting equation with respect to αq , the q-th element of α , yields: ∂ Φj
′
=
∂αq
p − ∂ Φj′−ℓ
ℓ=1
=
∂αq
p − ∂ Φj′−ℓ
ℓ=1
∂αq
A′ℓ +
p −
Φj′−ℓ
ℓ=1
∂ A′ℓ
∂αq
A′ℓ + [Φj′−1 , Φj′−2 , . . . , Φj′−p ]
∂A . ∂αq
∂α ′
− (Aℓ ⊗ Im ) ℓ=1
∂ vec Φj′−ℓ ∂α ′
= 0 and
∂ vec Φj
=
Gj
(m2 ×pm2 )
∂α ′
[
(Aℓ ⊗ Im )Gj−ℓ + (Im ⊗
,
j = 1, 2, . . .
[Φj′−1 , Φj′−2 , . . . , Φj′−p ])
Φℓ ⊗
Ωαα ΩΘ j = Cαj , Cσ j Ωσ α
[Φj′−ℓ−1 , Φj′−ℓ−2 , . . . , Φj′−ℓ−p ]
=
Φℓ ⊗ J F
,
ℓ=0
where J = [Im , 0m , . . . , 0m ] is an m × mp matrix. A consistent estimator of Gj can be obtained by plugging Aˆ and
ˆ j into the above equation. The asymptotic distribution of the nonΦ orthogonalized impulse response function is
√
ˆ j′ − Φj′ Nvec Φ
d
− → N (0, ΩΦ j ) for ΩΦ j = Gj Ωαα G′j .
(26)
We can estimate ΩΦ j by
ˆ Φ j = Gˆ j Ω ˆ αα Gˆ ′j Ω
(27)
where j −1 −
ˆj = G
ˆ ℓ ⊗ [Φ ˆ j′−ℓ−1 , Φ ˆ j′−ℓ−2 , . . . , Φ ˆ j′−ℓ−p ], Ω ˆ αα Φ
ˆ GMM ⊗ Qˆ −1 , =Σ
Qˆ =
N i=1
1Xi Zi ′
N 1 − ′ Zi GZi N i=1
−1
N 1 − ′ Zi 1Xi N i =1
.
ˆ j , we use the delta To derive the limiting distribution of Θ method again. We compute Cα j = ∂ vec(Θj′ )/∂α ′ and Cσ j = ∂ vec(Θj′ )/∂ [vech(Σ )]′ as follows: ∂ vec Φj′
= Im ⊗ P = Im ⊗ P ′ Gj , ′ ∂α′ ∂α ∂ vec P ′ Φj′ ∂ vec P ′ = = Φ ⊗ I j m ∂ [vech(Σ )]′ ∂ [vech(Σ )]′ Km,m ∂ vec(P ) = Φj ⊗ Im . ∂ [vech(Σ )]′
Cα j = Cσ j
∂ vec P ′ Φj′
Ωασ Ωσ σ
Cα′ j Cσ′ j
,
(29)
∂ vec(P ) ∂ vech(P ) = L′m . ∂ [vech(Σ )]′ ∂ [vech(Σ )]′
But it follows from Σ = PP ′ that vec(dΣ ) = (Im2 + Kmm )(P ⊗ Im ) vec(dP ) and so
and N 1 −
(28)
In the time series setting, the matrix Ωασ = 0 (e.g. Proposition 1 in Lütkepohl (1990)). As a result, the cross product terms Cσ j Ωσ α Cα′ j , Cα j Ωασ Cσ′ j are not present in the asymptotic variance of the orthogonalized IRF. In contrast, for short panel VARs, Ωασ ̸= 0, and the cross product terms cannot be ignored. In addition, compared to the time series cases, the asymptotic variance Ωσ σ contains a few extra terms, reflecting the estimation uncertainty of the slope coefficients and the fixed effects. So for short panel VARs, it is important to include the cross product terms Cσ j Ωσ α Cα′ j , Cα j Ωασ Cσ′ j and extra terms in Ωσ σ in computing the asymptotic variance, especially when T is small. The matrix Om can be simplified if P is restricted to be a lower triangular matrix. In this case Om =
ℓ=0
Cα j = Im ⊗ P ′ Gj , Cσ j = Φj ⊗ Im Km,m Om .
′ j−ℓ−1
and
ℓ=0 j −1 −
d
ˆ j − Θj ) ′ ] − Nvec[(Θ → N 0, ΩΘ j ,
where
with Gj = 0 for j < 0. A closed-form solution for Gj is Gj =
]
is of full rank. Then
ℓ=1
j −1 −
2D+ m ( P ⊗ Im ) S[(P −1 )′ ⊗ P −1 ]
√ p −
= [Om , ∆]
Theorem 2. Let Assumptions 1 and 2 hold. Assume that the matrix
′
then Gj =
] −1
2D+ m ( P ⊗ Im ) S[(P −1 )′ ⊗ P −1 ]
The following theorem gives that asymptotic distribution of the ˆ j. orthogonalized IRF Θ
Let G0
]
Cσ j = Φj ⊗ Im Km,m Om .
[
for some m2 × m (m + 1) /2 matrix Om and m2 × m (m − 1) /2 matrix ∆. Then vec(dP ) = Om vech (dΣ ), and so
+ Im ⊗ [Φj′−1 , Φj′−2 , . . . , Φj′−p ] .
(m2 ×pm2 )
]
p
=
2D+ vech (dΣ ) m ( P ⊗ Im ) vec(dP ) = . 0 S[(P −1 )′ ⊗ P −1 ]
Let Om be the first m (m + 1) /2 column of the matrix on the lefthand side, that is
[
Consequently,
∂ vec Φj′
[
′
∂ vech(P ) ′ −1 ′ = Lm Im2 + Kmm (P ⊗ Im ) Lm ∂ [vech(Σ )] ′ −1 = 2D+ . m ( P ⊗ Im ) L m As a result
′ −1 Om = L′m 2D+ . m (P ⊗ Im ) Lm The structural VAR model in (24) is referred to as the A-model by Lütkepohl (2005, p. 358). Another class of structural VAR models, the so-called B-model, is defined to be yt = As1 yt −1 + · · · + Asp yt −p + B ust ust
for t = 0, . . . , T
where is normalized to have mean zero and variance Im . To identify the structural parameter B , we impose m(m − 1)/2
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
restrictions of the form SB vec(B ) = sB . In this case, the map between the reduced form variance Σ and the structural matrix B is defined by Σ = BB ′ and SB vec(B ) = sB . In other words, we solve for P such that Σ = PP ′ such that SB vec(P ) = sB . Theorem 2 continues to hold but Om is now the first m (m + 1) /2 column of the matrix
] −1
2D+ m (P ⊗ Im ) SB
[
133
where the variables Xi∗ , Zi∗ and y∗i are defined in the same way as Xi , Zi and yi except that they are based on the bootstrap sample. Similarly, the bootstrap estimator of Σ is ∗ ˆ GMM Σ =
N T 1 − − ∗ ∗′ uˆ i,t uˆ i,t , NT i=1 t =0
where
.
uˆ ∗i,t = y∗i,t − y¯ ∗i,· − (Aˆ ∗ )′ Xi∗,t − X¯ i∗,· ,
for t = 0, . . . , T .
To consistently estimate the asymptotic variance ΩΘ j , we plug consistent estimates of Cα j , Cσ j , Ωαα , Ωασ and Ωσ σ into (29), leading to
ˆ j∗ using the same procedure as that for Φ ˆ j and orthogonalized IRFs Θ
ˆ Θ j = Cˆ αj Ω ˆ αα Cˆ α′ j + Cˆ σ j Ω ˆ σ σ Cσ′ j + Cˆ σ j Ω ˆ σ α Cˆ α′ j + Cˆ αj Ω ˆ ασ Cˆ σ′ j , Ω
ˆ j . In addition, we can compute Ω ˆ Φ∗ j and Ω ˆ Θ∗ j in exactly the same Θ ˆ Φ j and Ω ˆ Θ j defined in (27) and (30) but use the bootstrap way as Ω
(30)
sample.
where
ˆj Cˆ α j = (Im ⊗ Pˆ ′ )G
and
ˆ j ⊗ Im )Km,m Oˆ m . Cˆ σ j = (Φ
ˆ ’s are defined in (18)–(20) with B, Q , Σ replaced by Here Ω ˆ GMM , where Bˆ , Qˆ , Σ Bˆ =
∗ ∗ ˆ GMM ˆ j∗ and the Given αˆ GMM and Σ , we can compute the IRFs Φ
N T 1 −−
NT i=1 t =0
Theorem 3. Let Assumption 1, Assumption 2(iii) and (iv) hold. In addition, assume that E (‖ui,t ‖16+8δ ), maxi E ‖Xi,0 ‖16+8δ , maxi E ‖µi ‖16+8δ are finite for some δ > 0. Then for any conformable vector c, the following hold uniformly over x ∈ R: (a)
′
uˆ i,t Xi,t − X¯ i,· .
√
ˆ j∗ − Φ ˆ j )′ ] < x P c ′ Nvec[(Φ
√ ˆ j − Φj )′ ] < x + o(1), = P c ′ Nvec[(Φ
ˆ m is the first m (m + 1) /2 column of O
ˆ 2D+ m P ⊗ Im
−1
or
SA [(Pˆ −1 )′ ⊗ Pˆ −1 ]
ˆ 2D+ m P ⊗ Im
(b)
−1 ,
respectively for the A-model and B-model. For recursive structural VARs, we can take
√
ˆ j∗ − Θ ˆ j )′ ] < x P c ′ Nvec[(Θ
SB
′ ˆ m = L′m 2D+ ˆ O m P ⊗ Im L m
−1
√ ˆ j − Θj )′ ] < x + o(1), = P c ′ Nvec[(Θ (c)
√ ˆ j∗ − Φ ˆ j )′ ] c ′ Nvec[(Φ P < x ˆ Φ∗ j c c′Ω √ ˆ j − Φj )′ ] c ′ Nvec[(Φ = P < x + o(1), ′ ˆ c ΩΦ j c
.
3.3. Bootstrap approximation The large sample approximation in the previous subsection is based on the delta method. In finite samples, the approximation may not capture the finite sample distribution very well. In this subsection, we consider the bootstrap approximation to the IRFs. We use the nonparametric i.i.d. bootstrap along the cross sectional dimension. Let yi = (yi,−p , . . . , yi,T )′ , i = 1, 2, . . . , N denote the original sample and P denote the true distribution that generates the sample {y1 , . . . , yN }. The bootstrap sample is denoted by {y∗i = (y∗i,−p , . . . , y∗i,T )′ , i = 1, 2, . . . , N } where y∗i follows the bootstrap distribution P∗ , a discrete distribution that places probability mass 1/N at each point in the sample {y1 , . . . , yN }. By definition, the bootstrap sample satisfies
1y∗i,t = Aˆ 1 1y∗i,t −1 + · · · + Aˆ p 1y∗i,t −p + 1u∗i,t ,
t = 1, . . . , T
where {u∗i = (u∗i,−p, , . . . , u∗i,T )′ , i = 1, 2, . . . , N } are simple random draws from the estimated errors {ˆui = (ˆui,−p , . . . , uˆ i,T )′ , i = 1, 2, . . . , N }. The GMM estimator of α based on the bootstrap sample is ∗ = vec((Aˆ ∗1 , . . . , Aˆ ∗p )′ ) αˆ GMM −1 −1 N N N − − − ∗ ′ ∗ ∗ ′ ∗ ∗ ′ ∗ Zi GZi Zi 1Xi = vec 1Xi Zi i=1 i=1 i=1 −1 N N N − − − ∗ ′ ∗ ∗ ′ ∗ ∗ ′ ∗ × 1Xi Zi Zi GZi Zi 1yi , i=1 i=1 i=1
(d)
√ ˆ j∗ − Θ ˆ j )′ ] c ′ Nvec[(Θ P < x ˆ Θ∗ j c c′Ω √ ˆ j − Θj ) ′ ] c ′ Nvec[(Θ = P < x + o(1). ′ ˆ c ΩΘ j c To evaluate the probabilities in Theorem 3, we first condition on the original sample under which the average statistics of interest converge almost surely. The moment conditions in Theorem 3 ensure that the sample we condition on occurs with probability one. Consequently, the conditional convergence results can then be converted into unconditional results. The moment conditions are likely to be stronger than necessary but they facilitate the proof. √ ∗ To prove Theorem 3, we first show that [ N (αˆ GMM − αˆ GMM ),
√
∗ ˆ GMM ˆ GMM )] has the same joint limiting distribution Nvech(Σ −Σ
√
√
ˆ GMM − Σ )] in Lemma 1 given in as [ N (αˆ GMM − α), Nvech(Σ the Appendix. We then invoke a delta-type method for bootstrap approximation. For the standard delta method, it is sufficient to assume that the function of interest is continuous. Here we
134
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
require the function to be continuously differentiable. The delta method for bootstrap approximation is likely to be of independent interest. A direct implication of Theorem 3 is that the bootstrap percentile confidence band and bootstrap percentile-t confidence band are asymptotically valid to the first order. Higher order refinement of the bootstrap approximation may require stronger moment conditions and some adjustment of the bootstrap GMM estimator when the model is overidentified (cf. Horowitz, 1997). This is beyond the scope of the present paper. 4. Simulation evidence In this section, we provide some simulation evidence on the accuracy of the asymptotic approximation and the bootstrap approximation to the sampling variability of the orthogonalized impulse response functions. We consider the panel VAR model with two variables. The data generating process is yi,t = µ + Ayi,t −1 + µi + Pei,t
A = λ1 b1 b′1 + λ2 b2 b′2
1
ρ
ˆ GMM − Σ ) and the additional randomness of and Nvech(Σ √ ˆ GMM − Σ ) are ignored. Both confidence bands are of the Nvech(Σ form [
√ ˆ j (k, ℓ) − 1.96 × Ω ˆ Θ j (k, ℓ)/ N , Θ ˆ j (k, ℓ) Θ ] √ ˆ + 1.96 × ΩΘ j (k, ℓ)/ N .
For convenience, we call the confidence band with the naive asymptotic variance given in (31) the naive CLT confidence band and the one based on (26) the variance-corrected CLT confidence band. In the simulation experiment, we also consider the finite sample performances of bootstrap confidence bands. Hall (1992) discusses three types of bootstrap confidence bands. See Lütkepohl (2005) for VARs. The first is Hall’s percentile confidence band. In our case, the 95% bootstrap confidence band is ˆ ∗ (k,ℓ) Θ
ˆ j (k, ℓ) − CVU j Θ ˆ ∗ (k,ℓ) Θ j
where CVU
ˆ ∗ (k,ℓ) Θ j
and CVL
]
,
(32)
are the 97.5% and 2.5% quantiles of
The other two types of bootstrap bands are based on the tstatistic. For the original sample, the t-statistic is
√ ˆ j (k, ℓ) − Θj (k, ℓ) N Θ t =
(33)
ˆ Θ j (k, ℓ) Ω
t∗ =
for different values of ρ . The qualitative information from our simulation results remains more or less the same. We consider different N and T combinations, i.e. N = 100, 200 and T = 5, 10, 20. We set Td to be 1, 5, 10, 50. For each (N , T ) combination, we estimate the model using the AB estimator. To avoid the weak instrument problem, we do not use the lagged dependent variables dated too early as instruments. Instead, we set the maximum number of lags of the dependent variable that can be used as instruments to be 3. Our simulation results change only slightly when we set the maximum number of lags to be 1, 2 and 4. The number of simulation replications is 5000. 4.1. The asymptotic and bootstrap confidence bands Given the estimated autoregressive and variance parameters, we construct the orthogonalized impulse response functions and the corresponding 95% confidence band based on the asymptotic distribution in (28). As a comparison, we also construct the 95% confidence band when Ωασ and Ωσ σ are set to be (31)
√ ∗ ˆ j (k, ℓ) − Θ ˆ j (k, ℓ) N Θ ˆ Θ∗ j (k, ℓ) Ω
.
(34)
Then the equal-tailed percentile-t confidence band is
[
√ ˆ j (k, ℓ) ˆ j (k, ℓ) − CVUt ∗ × Ω ˆ Θ j (k, ℓ)/ N , Θ Θ
0 1
+ ′ Ωσ σ = (T + 1)−1 D+ Dm . m (Σ ⊗ Σ ) Im2 + Km,m
ˆ ∗ (k,ℓ) Θ j
ˆ j (k, ℓ) − CVL ,Θ
ˆ j∗ (k, ℓ) − Θ ˆ j (k, ℓ), respectively. Θ
Ωασ = 0,
√ N αˆ GMM − α
while for the bootstrap sample the t-statistic is
for λ1 and λ2 ∈ (0.10, 0.35, 0.60, 0.85), λ1 ̸= λ2 , and set P to be the low triangle matrix such that P ′ P = SS ′ . Since the performance of the AB estimator may be sensitive to the specifications of A and P (e.g. Bun and Kiviet, 2006), we have experimented with different random matrices R and S. We have also experimented with P =
√
[
for i = 1, 2, . . . , N and t = 1, 2, . . . , T where µ = (0, 0)′ , ei,t ∼ i.i.d. N (0, I2 ) and µi ∼ i.i.d. N(0, I2 ). For each given T , we set the initial value of the process yi,t to be zero and generate a 2-dimensional time series of length T + Td . We drop the first Td observations to obtain the simulated sample. We specify matrices A and P as follows. Let R and S be 2 × 2 random matrices whose elements are i.i.d. from the uniform [0, 1] distribution. Let b1 and b2 be the eigenvectors of R′ R. We set A to be
In this case, the asymptotic dependence between
−
∗ CVLt
] √ ˆ × ΩΘ j (k, ℓ)/ N , ∗
(35)
∗
where CVUt and CVLt are the 97.5% and 2.5% quantiles of t ∗ respectively. If instead of t ∗ , the quantiles are calculated based on |t ∗ |, then we have the symmetric percentile-t confidence band. Let ∗ CV |t | be the 95% quantile of the bootstrap distribution of |t ∗ |, the symmetric bootstrap confidence band is:
[
√ ˆ j (k, ℓ) − CV |t ∗ | × Ω ˆ Θ j (k, ℓ)/ N , Θ ˆ j (k, ℓ) Θ ] √ ∗ ˆ Θ j (k, ℓ)/ N . + CV |t | × Ω
(36)
General bootstrap theory suggests that the symmetric percentile-t confidence band has the most accurate coverage probability among the three bootstrap confidence bands considered here. See Hall (1992). An extensive simulation study in a previous version of this paper supports this qualitative observation. So we focus on the symmetric percentile-t conˆ Θ j (k, ℓ) and its fidence band hereafter. Depending on how Ω bootstrap version are computed, we obtain two different symmetric percentile-t bootstrap confidence bands: the naive bootstrap confidence band and the variance-corrected bootstrap confidence band.
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
4.2. Simulation results Fig. 1 graphs the empirical coverage of different confidence bands against the forecasting horizons when N = 100, T = 5, Td = 1. The confidence bands considered are: the naive CLT band, the variance-corrected CLT band, the naive symmetric percentile-t bootstrap band, and the variance-corrected symmetric percentilet bootstrap band. Here we focus on the case (λ1 , λ2 ) = (0.1, 0.6), which is representative of other cases. We will refer to this parameterization as our base case hereafter. The empirical coverage of the CLT confidence bands is the coverage rate based on 5000 rounds of Monte Carlo simulations. For the bootstrap confidence bands, we compute their empirical coverage based on 999 bootstrap replications in each of 5000 simulation replications. Several patterns emerge from the figure. First, the empirical coverage of the variance-corrected CLT is closer to the nominal coverage probability than the naive CLT confidence band. For some scenarios, the variance-corrected CLT confidence band dominates the naive CLT confidence band by a large margin. Across all the subplots in the figure, we find that the empirical coverage of the naive CLT band is considerately lower than the nominal coverage. A direct implication is that the naive asymptotic variance under-estimates the sampling variability of the impulse response. As a result, inferences based on the naive asymptotic variance may lead to the finding a statistically significant relationship that does not actually exist. Second, similar to the findings for the CLT bands, the advantage in coverage for the variancecorrected bootstrap band over the naive bootstrap band is visible, although the margin of improvement is smaller than the CLT case. Third, the bootstrap confidence band has a more accurate coverage than the corresponding CLT confidence band. The larger coverage error of the CLT bands may reflect the limitation of the delta method in capturing the finite sample distribution for the IRFs. On the other hand, the coverage of the bootstrap band is very close to the nominal level for all forecasting horizons. This superior performance suggests that the bootstrap approximation may provide a high order refinement to the first order normal approximation. This is an interesting theoretical question for future research. Fig. 2 shows the median widths of the confidence bands reported in Fig. 1. It is clear that the variance-corrected confidence bands, for both the asymptotic and bootstrap ones, are wider than the corresponding naive band. From this figure, we can see that the widths of the confidence bands ranked from high to low are: variance-corrected bootstrap band, naive bootstrap band, variance-corrected CLT band, and naive CLT band. This is generally true for all the other cases that we consider. The next figure, Fig. 3, shows the average and median of the relative biases of the IRFs. The biases are measured as percentages of the true IRFs. We can see that the biases of the one-period-ahead IRF can be either positive or negative, ranging from 8% to almost 50% of the true IRFs. For longer horizon IRFs, the mean biases fall within the 10% range. We do not report the figures for other parameter configurations but summarize the main results here. We have considered a DGP with higher persistence. In our high persistence case, the eigenvalues λ1 and λ2 of A are set to be 0.1 and 0.85. Since λ2 is closer to 1, the process is more persistent. The rest of the parameters, N , T and Td , remain the same as in the base case. Compared to the base case, the coverage of the bootstrap band is in similar range but the coverage of the CLT bands improves for this case. A contributing factor to this improvement is that the median bias for this high persistence case is much lower than the base case. When the process becomes more persistent, the signal-tonoise ratios, as measured by var(ym,it )/var(um,it ), become higher but the instruments become weaker. These two offsetting forces have opposite impacts on the coverage accuracy. When Td is small
135
and the variance of the fixed effects var(µi ) is relatively large, the instruments remain relatively strong. The effect of higher signalto-noise ratios dominates that of weaker instruments, leading to improved coverage accuracy. Simulations show that for large Td and small var(µi ), the coverage accuracy may deteriorate as the process becomes more persistent. The next case we consider contains more observations with N = 200, the only deviation from the base case. The coverage probability for both the CLT bands and the bootstrap bands increase under larger N. In particular, the coverage of the bootstrap bands closely tracks the 95% nominal coverage probability for all four IRFs and for all forecasting horizons. As one would expect, the mean and median of the relative biases of the IRFs remain the same as in the base case, which confirms that the biases come from the time series dimension rather than the cross-sectional dimension. We also examine a case in which the time series are close to being stationary. The parameter configuration is the same as the base case except that Td is now equal to 50. The margin of improvement from using the variance-corrected confidence bands shrinks a little comparing with the base case but remains positive and visible in the omitted figure. The basic qualitative observations are the same for other (N , T ) combinations and initialization schemes and for 90% confidence bands. In an overall sense, the bootstrap bands have smaller coverage errors than the corresponding CLT bands, and variance correction is effective in reducing the coverage errors. 5. Conclusion The paper establishes the asymptotic distribution of the orthogonalized impulse response function for short panel VARs. Due to the correlation between the demeaned regressors and the demeaned error term, the estimator of the autoregressive coefficients and that of the error variance are not independent, even in large samples with a fixed time series dimension. The dependence calls for correction for the asymptotic variance of the orthogonalized impulse response function. In this paper, we have developed the corrected asymptotic formula for both reduced form VAR and structural VAR for short panels. We also have proved the asymptotic validity of the bootstrapped confidence bands in this context. Our simulation analysis shows that the proposed variance correction leads to confidence bands that have smaller coverage errors. In practical applications, we recommend using the corrected variance to studentize the t-statistic and employ the bootstrap approximation to construct the confidence bands. Acknowledgements We thank Cheng Hsiao, the coeditor, an associate editor and two anonymous referees for helpful comments that lead to considerable improvement of the paper. Sun gratefully acknowledges partial research support from NSF under Grant No. SES-0752443. Appendix Proof of Theorem 1. To establish the asymptotic distribution of αˆ GMM , we only need to verify the conditions for the LLN in (9), (11), and the CLT in (10). Under the cross-sectional independence, a sufficient condition for the LLN in (9) is
2 max E Zi′ 1Xi < ∞. i
136
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
(a) Response of Var1 to One SD Shock in Var1.
(b) Response of Var1 to One SD Shock in Var2.
(c) Response of Var2 to One SD Shock in Var1.
(d) Response of Var2 to One SD Shock in Var2.
Fig. 1. The Empirical Coverage of Different 95% Confidence Bands of the Orthogonalized IRFs for the Base Case with N = 100, T = 5, and Td = 1.
(a) Response of Var1 to One SD Shock in Var1.
(c) Response of Var2 to One SD Shock in Var1.
(b) Response of Var1 to One SD Shock in Var2.
(d) Response of Var2 to One SD Shock in Var2.
Fig. 2. The Median Length Different 95% Confidence Bands of the Orthogonalized IRFs for the Base Case with N = 100, T = 5, and Td = 1.
But for a generic constant C , which may be different for different occurrences,
2 max E Zi′ 1Xi ≤ C max max i
i
2
max E 1Xi,t y′i,t −s
t =1,...,T s=1,...,t
2 ≤ C max max max max E 1yi,t −τ y′i,t −s i t =1,...,T s=1,...,t τ =1,...,p 4 ≤ C max max E yi,t ≤ C . i
To verify the Lyapunov condition for the CLT, we use the Cramer–Wold theorem. Let Zui = K ′ vec Zi′ 1ui for any fixed vector K and σi2 = var (Zui ). Under Assumption 2(i) and (ii), we have 2+δ max E |Zui |2+δ = max E (K ′ vec Zi′ 1ui )
i
i
2+δ ≤ C max E Zi′ 1ui i 1/2 1/2 ≤ C max E ‖Zi ‖4+2δ E ‖1ui ‖4+2δ ≤ C.
t =0,...,T
i
The last inequality holds by Assumption 2(i) and (ii). Similarly, we can show that the LLN in (11) holds under Assumption 2(i) and (ii).
So
∑N
i=1
E (|Zui |2+δ ) = O(N ). Under Assumption 2(iii), we have
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
137
Median Biases Mean Biases
Median Biases Mean Biases
(a) Response of Var1 to One SD Shock in Var1.
(b) Response of Var1 to One SD Shock in Var2.
Median Biases Mean Biases
Median Biases Mean Biases
(c) Response of Var2 to One SD Shock in Var1.
(d) Response of Var2 to One SD Shock in Var2.
Fig. 3. The Mean and Median of the Relative Biases of the Orthogonalized IRFs for the Base Case with N = 100, T = 5, and Td = 1.
N −
√
σ
i=1
√
for some constant C and large enough N, as p lim N −1 is of full rank. This implies that
∑
σi2
N i =1
−1−δ/2
∑N
′ i=1 Zi GZi
lim
N →∞
N −
σi2
N −
NT i=1 t =0
i=1
T N 1 −−
NT i=1 t =0
yi,t − y¯ i,·
′
′
= B + op (1),
√
√
NI2 = −B N Aˆ − A + op (1),
√
′
(38) B′ + op (1).
Xi,t − X¯ i,·
]′
,
√ √ √ ˆ GMM − Σ = N (Σ ˆ 0 − Σ ) − B N Aˆ − A N Σ ′ √ − N Aˆ − A B′ + op (1). √ To derive the limiting distribution of sider each of the three terms. First,
where N T ′ 1 −− ui,t − u¯ i,· ui,t − u¯ i,· , NT i=1 t =0 N − T ′ 1 −
I1 = Aˆ − A
Xi,t − X¯ i,·
NT i=1 t =0
Xi,t − X¯ i,·
√
N
T
′ 1 − −
I3 = − Aˆ − A
NT i=1 t =0
Xi,t − X¯ i,·
′
= √
N T − −
NT i=1
Aˆ − A ,
N (T + 1) i=1 t =0
−√
ui,t − u¯ i,·
N − T −
1
ˆ GMM − Σ ), we conN (Σ
ui,t − u¯ i,·
′
−Σ
t =0
1
ui,t u′i,t − Σ
N T − −
NT (T + 1) i=1 s,t =0,s̸=t
′
ui,t − u¯ i,· .
1
ˆ 0 − Σ) = √ N (Σ
N T ′ 1 −− I2 = − ui,t − u¯ i,· Xi,t − X¯ i,· Aˆ − A , NT i=1 t =0
NI3 + op (1).
Combining (37) with (38) yields
yi,t − y¯ i,· − A′ Xi,t − X¯ i,· − Aˆ − A
Xi,t − X¯ i,·
NI3 = − N Aˆ − A
[
ui,t − u¯ i,·
ˆ 0 + I1 + I2 + I3 =Σ
ˆ0 = Σ
√
NI2 +
where the op (1) term follows from Assumption 2(i) and (ii). Therefore,
√
[
] ′ ¯ ˆ ¯ − A Xi,t − Xi,· − A − A Xi,t − Xi,· ×
√
(37)
N T 1 −−
E (|Zui |2+δ ) = 0.
That is, the Lyapunov condition holds. ˆ GMM and It remains to establish the asymptotic distribution of Σ ˆ its relationship with αˆ GMM . Writing ΣGMM in terms of the unobserved error term, we have
′
ˆ 0 − Σ) + N (Σ
To evaluate I2 and I3 , we note that
−1−δ/2
i =1
ˆ GMM = Σ
√
ˆ GMM − Σ ) = N (Σ
NI1 = op (1). As a
= o N −1−δ/2 .
As a result,
√
In view of Aˆ − A = Op (1/ N ), we have result,
N 1 − ′ = NK E Σ ⊗ Zi GZi K ≥ C N N i=1
′
2 i
Using the Lyapunov CLT, we have
ui,t u′i,s .
(39)
138
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
√
˜ − Σ ) = vech Nvech(Σ
√
N (T + 1) i=1 t =0
1
⇒ N 0,
N − T −
1
+ ′
D+ m Λm2 Dm
T +1
−Σ
ui,t u′i,t
for some matrix M. Note that E
,
T −
√
NT (T + 1) i=1 t =0 s=0,s̸=t
1
⇒ N 0,
T (T + 1)
=E
=E
D+ m (Σ ⊗ Σ ) Im2 + Km,m
D+ m
′
′
T −
vec ui,t u′i,t
T −
vec Zi′,t −2 ui,t
′
t =1
T − T −
ui,t ⊗ ui,t
u′i,s ⊗ Zis−2
T − T −
ui,t u′i,s ⊗ ui,t Zis−2
t =0 s=1
. =E
As a result,
T − T −
ui,t u′i,s ⊗ ui,t Zis−2
t =1 s=1
√ ˆ0 −Σ Nvech Σ [ 1 Λm2 ⇒ N 0, D+ m T +1 +
t =0 s=1
ui,t u′i,s
vec Zi′,t −2 ui,t
t =0
N − T T − −
1
T − t =1
=E
where Λm2 = It is easy to see that the Lyapunov condition holds under Assumption 2(i). Similarly,
vech
t =0
(40)
var[vec(ui,t u′i,t )].
vec ui,t u′i,t − Σ
1 T (T + 1)
=E
T − −
ui,t u′i,s ⊗ ui,t Zis−2 + E
T −
(Σ ⊗ Σ ) Im2 + Km,m
]
+ ′
t =1
s=1 t ̸=s
=E
ui,t u′i,t ⊗ ui,t Zi,t −2
T −
ui,t u′i,t ⊗ ui,t Zi,t −2
t =1
Dm
(41)
=E
T −
ui,t u′i,t ⊗ ui,t ⊗ Zi,t −2 = 0
t =1
where we have used the asymptotic independence between the two terms in (39).
by Assumption 2(iv). Similarly,
Next, using the properties of the commutation matrix: Km,m Σ ⊗ BQ −1 B′ = BQ −1 B′ ⊗ Σ Km,m and Km,m Km′ ,m = Im2 , we can
E
show that
√
[ √
′ ]
and vech
N Aˆ − A
′
B
⇒ N 0, Dm BQ +
B ⊗Σ
−1 ′
+ ′
Dm
.
[ ′ ] √ ˆ ˆ cov vech B N A − A , vech N A − A B′ √
′ −1 ′ = D+ B Km′ ,m D+ m Σ ⊗ BQ m .
√
√
′
B′ ⇒ N (0, VAB ),
′
−1 ′ ′ + D+ B ⊗ Σ D+ m BQ m + ′
N Aˆ − A
where −1 ′ VAB = D+ B ) D+ m (Σ ⊗ BQ m
−1 ′ + D+ B )Km′ ,m Dm m (Σ ⊗ BQ ′ −1 ′ + D+ B D+ m Km,m Σ ⊗ BQ m .
√
ˆ 0 − Σ ) and N (Aˆ − A) are We proceed to prove that N (Σ asymptotically independent under Assumption 2(iv). We write √
Nvec(Aˆ − A) N 1 −
= M√
N i=1
vec
T − t =1
Zi′,t −2
ui,t − ui,t −1
+ op (1)
′
= 0.
t =1
ˆ GMM − Σ Nvech Σ
→ N (0, Ωσ σ ),
pletes the proof of the theorem. (42)
√
vec Zi′,t −2 ui,t −1
(43)
√ √ Ωα,σ = −cov vech B N Aˆ − A , vec N Aˆ − A [ √ ′ ] √ ′ ˆ ˆ − cov vech N A − A B , vec N A−A −1 = −D+ m (Im ⊗ B) Σ ⊗ Q −1 − D+ . (44) m Km,m (Im ⊗ B) Σ ⊗ Q √ Combining (43), (44) and N αˆ GMM − α ⇒ N (0, Ωαα ) com-
Therefore, B N Aˆ − A +
T −
where Ωσ σ is defined in the theorem. Finally, we examine the √ asymptotic covariance between √ √ ˆ GMM − Σ )] and vec[ N (Aˆ − A)]. Since N (Σ ˆ 0 − Σ ) is vech[ N (Σ √ asymptotically independent of N (Aˆ − A), the asymptotic covariance is given by
In addition,
t =0
√
vec ui,t u′i,t − Σ
Hence the first term in (39) is asymptotically independent of √ N (Aˆ − A). It is easy to see that the asymptotic independence also holds for the second term in (39). As a result,
′ −1 ′ ⇒ N 0 , D+ B D+ , m Σ ⊗ BQ m
vech B N Aˆ − A
T −
Lemma 1. Let the assumptions in Theorem 3 hold, then for any conformable vector c:
√ ∗ N αˆ GMM − αˆ GMM P c′ √ < x ∗ ˆ GMM ˆ GMM ) Nvech(Σ −Σ √ N αˆ GMM − α = P c′ √ < x + o(1) ˆ GMM − Σ ) Nvech(Σ uniformly over x ∈ R as N → ∞ for a fixed T .
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
Proof Lemma 1. The result is on the joint convergence of √ √ of ∗ ∗ ˆ GMM ˆ GMM ). We prove only the N αˆ GMM − αˆ GMM and N (Σ −Σ convergence of the marginal distributions, as the joint convergence follows easily from the same argument we present here. The proof consists of two parts.
√
∗ N αˆ GMM − αˆ GMM ⇒ N (0, Ωαα ).
Part (a) Proof of
By construction, we have ∗ αˆ GMM − αˆ GMM −1 [− N N − ∗ ′ ∗ ∗ ′ ∗ Zi GZi = vec 1Xi Zi i=1
i =1
×
N − ∗ ′
1Xi
∗
Zi
] −1
N −
×
1Xi
∗
Zi
i =1
i=1
∗ ′
−1
N − ∗ ′
GZi∗
Zi
N − ∗ ′
Zi
i=1
where the last line follows because
1u∗i .
(45)
4 E Z¯ui = E |Zui |2+δ − E |Zui |2+δ
Define the set of samples Eα as
(i) N
Oa.s. (N ). Similarly, we can show that ∞
i =1
(ii) N −1
N −
∞ Zi′ GZi → SZZ ,
(iii) N
N −
′
N −
vec Zi′ GZi {vec Zi′ GZi }′ → 0,
i=1
(v)
[− N
|Zui |2
i=1
]−1−δ/2 − N
E∗
|Zui |2+δ → 0
i =1
(47)
N ′ 1 − ∞ 1Xi′ Zi → SXZ 1Xi∗ Zi∗ = .
N i =1
Combining this with (47) yields
(vi) Aˆ → A .
N 1 −
∞ ′
∞ In the above definition, SXZ = SZX and Zui = K ′ vec Zi′ 1ui for any fixed vector K . Under the moment conditions in the theorem, we have
i=1
where E ∗ denotes the expectation with respect to the bootstrap distribution P∗ , and op∗ (1) denotes a sequence of random variables that converges to zero in probability under P∗ . By definition,
i=1
(iv) N −2
′ ′ 1Xi∗ Zi∗ − E ∗ 1Xi∗ Zi∗ = op∗ (1)
N i=1
vec (1Xi ) Zi {vec (1Xi )′ Zi }′ → 0,
|Zui |2+δ = −1−δ/ 2 N 2 = i=1 |Zui |
N 1 −
i=1
−2
∑N
∑
Oa.s. (N −1−δ/2 ). Consequently, condition (v) indeed holds almost surely. We can conclude that P (Eα ) = 1. ′ Conditional on the sample in Eα , 1Xi∗ Zi∗ is a triangular array of rowwise i.i.d. random variables and a law of large numbers gives
1Xi Zi → SXZ , ′
4
using the moment conditions in the theorem. So
Eα = [y1 (ω), . . . , yN (ω)] : N −
= O E |Zui |8+4δ = O E ‖Zi ‖16+8δ E ‖ui ‖16+8δ = O(1)
i=1
−1
139
∞ N 1 − − P Z¯ > ε N i=1 ui N =1 4 N ∑ ¯ Zui ∞ E − i=1 ≤ N 4 ε4 N =1 N [ N ]2 ∑ 4 ∑ 2 ¯ ¯ E Z E Z ∞ ∞ ui ui − − i=1 i=1 = + 4 4 4 4 N ε N ε N =1 N =1 N [ N ]2 ∑ 4 ∑ 2 ¯ ¯ E Z E Z ∞ ∞ ui ui − − i=1 i =1 ≤ + <∞ N 4 ε4 N 4 ε4 N =1 N =1
∞ N − −1 − ′ ∞ P N 1Xi Zi − SXZ > ε N =1 i=1 N ∑ 4 ∞ ′ E 1Xi Zi − SXZ ∞ − i=1 ≤ N 4 ε4 N =1 ∞ − 1 =O < ∞. N 2 ε4 N =1
′ ∞ 1Xi∗ Zi∗ = SXZ + o∗p (1).
N i=1
(48)
By the same argument, we have, conditional on Eα , N 1 − ∗ ′ ∞ Zi GZi∗ = SZZ + op∗ (1). N i=1
(49)
Therefore,
[
N 1 −
1Xi
N i=1
× (46)
∗ ′
∗
Zi
N 1 − ∗ ′ Zi GZi∗ N i=1
N 1 − ∗ ′ Zi 1Xi∗ N i =1
]−1
−1
= Q + o∗p (1).
Next
∑N
It follows from the Borel–Cantelli lemma that N −1 i=1 1Xi′ Zi → ∞ SXZ almost surely. So condition (i) in the definition of Eα holds almost surely. Similarly, we can show conditions (ii)–(iv) and (vi) hold almost surely. To show (v) holds almost surely, that condition we let Z¯ui = |Zui |2+δ − E |Zui |2+δ and note that
N 1 −
N i =1
∗ ′
1Xi
∗
Zi
N 1 − ∗ ′ Zi GZi∗ N i =1
N 1 − ∗ ′ Zi 1u∗i := I4 + I5 N i=1
×√
−1
(50)
140
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
where
As a result,
I4 =
N
1 − N i =1
′ 1Xi∗ Zi∗
N
1 − ∗ ′ Zi GZi∗ N i =1
−1
[ vec
N 1 − ∗ ∗ ′ E Zi 1u∗i , N i=1
I5 =
N 1 −
N i =1
∗ ′
1Xi
∗
Zi
N 1 − ∗ ′ Zi GZi∗ N i =1
−1
I4 =
N i=1
×
−1
N 1 − ′ Zi GZi + o∗p (1) N i =1
N 1 − ′ Zi 1uˆ i √ N i =1
√ ∗ N αˆ GMM − αˆ GMM ⇒ N 0, Im ⊗ Q −1 (Σ ⊗ Q ) Im ⊗ Q −1 ≡ N 0, Σ ⊗ Q −1 conditional on Eα . Equivalently, for any conformable vector cα ,
where the second equality holds because
N i=1
]
Together with (50), this leads to
1Xi Zi + op (1) ∗
= o∗p (1)
N 1 −
−1
≡ N (0, Σ ⊗ Q ).
′
Zi
N 1 − ∗ ′ Zi GZi∗ N i=1
∞ −1 ∞ ′ ∞ × Σ ⊗ SZZ Im ⊗ SZZ SXZ
For I4 , we use (48) and (49) to obtain: N 1 −
1Xi
N i=1
∗
∞ −1 ∞ ⇒ N 0, Im ⊗ SXZ SZZ
N 1 − ∗ ′ ′ . Zi 1u∗i − E ∗ Zi∗ 1u∗i ×√ N i=1
∗ ′
N 1 − ∗ ′ Zi 1u∗i ×√ N i =1
×√
N 1 −
1Xi′ Zi
N 1 − ′ Zi GZi N i =1
−1
N 1 − ′ Zi 1uˆ i = 0, N i=1
P
∗
√
Ncα′
αˆ GMM − αˆ GMM < x|Eα → Φ ∗
√
by the definition of the GMM estimator αˆ GMM . For I5 , we have, using (48) and (49) again:
x cα′ Σ ⊗ Q −1 cα
.
Since P(Eα ) = 1, the above conditional result implies the following unconditional almost sure convergence result:
√
∗ Ncα′ αˆ GMM − αˆ GMM < x
∞ −1 I5 = SXZ + op (1) SZZ + o∗p (1) N 1 − ∗ ′ ′ ×√ Zi 1u∗i − E ∗ Zi∗ 1u∗i .
P∗
By the Lyapunov CLT for triangular arrays, we have
for all x ∈ R. This and the dominated convergence theorem imply
∞
∗
→Φ
N i =1
N 1 − ∗ ′ ′ Zi 1u∗i − E ∗ Zi∗ 1u∗i vec √
N i=1
′ 1 − = √ Im ⊗ Zi∗ vec(1u∗i ) N i =1
− E∗
∗ ′
Im ⊗ Z i
a.s. as N → ∞
x
cα′
Σ ⊗ Q −1 cα
for all x ∈ R,
or equivalently
∗ Define Zui = K ′ vec
∗ ′ Zi
1u∗i for any conformable constant
vector K . Then the Lyapunov condition is N
1
−
]1+δ/2 2 ∗ E Z ∗
2+δ
∗ E ∗ Zui
→ 0 for some δ > 0.
i=1
√ ∗ N αˆ GMM − αˆ GMM → N 0, Σ ⊗ Q −1 . Thus, the√unconditional asymptotic of the bootstrap ∗ distribution statistic N αˆ GMM − αˆ GMM is N 0, Σ ⊗ Q −1 , which is the same as that of the original-sample statistic
√
Part (b): Proof of
ui
i=1
√ N αˆ GMM − α .
∗ ˆ GMM ˆ GMM ) ⇒ N (0, Ωσ σ ). N (Σ −Σ
Note that
That is
uˆ ∗i,t = y∗i,t − y¯ ∗i,· − (Aˆ ∗ )′ Xi∗,t − X¯ i∗,·
N −
1
[N ∑
cα′ Σ ⊗ Q −1 cα
∗ Ncα′ αˆ GMM − αˆ GMM < x
→Φ
vec(1u∗i )
∞ ⇒ N (0, Σ ⊗ SZZ ).
[N ∑
√ P
x
√ ∗ = EP ∗ Ncα′ αˆ GMM − αˆ GMM < x
N
|Zui |2
]1+δ/2
|Zui |2+δ → 0 for some δ > 0.
i=1
i=1
This condition holds conditional on the sample in Eα . Therefore
∞ −1
I5 ⇒ N 0, SXZ SZZ
∞
∞ ′
SXZ
≡ N (0, Q ).
= y∗i,t − y¯ ∗i,· − Aˆ Xi∗,t − X¯ i∗,· − (Aˆ ∗ − Aˆ )′ Xi∗,t − X¯ i∗,· = u∗i,t − u¯ ∗i,· − (Aˆ ∗ − Aˆ )′ Xi∗,t − X¯ i∗,· , we have ∗ ˆ GMM ˆ 0∗ + I1∗ + I2∗ + I3∗ , Σ =Σ
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
where
141
Conditional on E , N T ′ 1 −− ∗ ui,t − u¯ ∗i,· u∗i,t − u¯ ∗i,· , NT i=1 t =0
ˆ 0∗ = Σ
N − T ′ 1 −
I1∗ = Aˆ ∗ − Aˆ I2∗ = −
NT i=1 t =0
Xi∗,t − X¯ i∗,·
N T ′ 1 −− ∗ ui,t − u¯ ∗i,· Xi∗,t − X¯ i∗,· NT i=1 t =0
Xi∗,t − X¯ i∗,·
′
Aˆ ∗ − Aˆ ,
N T ′ ∗ 1 −− ∗ Aˆ − Aˆ , ui,t − u¯ ∗i,· Xi∗,t − X¯ i∗,· NT i=1 t =0 N − T ′ 1 −
I3∗ = − Aˆ ∗ − Aˆ
NT i=1 t =0
Xi∗,t − X¯ i∗,·
′
(ii) (NT )
√
√
√
′
NI3∗ = − N Aˆ ∗ − Aˆ
(52) B′ + o∗p (1) .
∗ ˆ GMM ˆ GMM ) N (Σ −Σ
′
′
Xi,t − X¯ i,·
√
∞ → SXX ,
ui,t − u¯ i,·
Xi,t − X¯ i,·
√
ˆ 0∗ − Σ ˆ GMM ) − B N Aˆ ∗ − Aˆ N (Σ
= −
√
′
N Aˆ ∗ − Aˆ
→ B,
i =1 t =0
ui,t − u¯ i,·
ui,t − u¯ i,·
′
→ Σ,
B′ + o∗p (1).
√
N − T −
1
ˆ 0∗ − Σ ˆ GMM ) = √ N (Σ
i=1 t =0
−1 (iv) VN → D+ Λm2 + [T (T + 1)]−1 m T
− E∗
u∗i,t − u¯ ∗i,·
]
+ ′ × (Σ ⊗ Σ ) Im2 + Km,m Dm , (v) N −1
vech
ui,t − u¯ i,·
u∗i,t − u¯ ∗i,·
NT i=1 t =0
[
N − T −
Now
N − T −
(iii) (NT )
√
Xi,t − X¯ i,·
−1
+ o∗p (1)
NI2∗ = −B N Aˆ ∗ − Aˆ + o∗p (1),
i=1 t =0
−1
′
Combining (51) with (52) yields
N − T −
N − T −
N T ′ 1 −− ui,t − u¯ i,· Xi,t − X¯ i,· + o∗p (1) = B + o∗p (1). NT i=1 t =0
√
[y1 (ω), . . . , yN (ω)] :
(i) (NT )−1
uˆ i,t Xi,t − X¯ i,·
Therefore,
As in the proof of part (a), we first derive the asymptotic distribution conditional on an event that occurs with probability one. Define E = Eα ∩ Eσ where
Eσ =
NT i=1 t =0
=
u∗i,t − u¯ ∗i,· .
N T 1 −−
=
ui,t − u¯ i,·
u∗i,t − u¯ ∗i,·
u∗i,t − u¯ ∗i,·
′
′
+ o∗p (1).
Conditional on E , the first term is a normalized sum of i.i.d. random variables with mean zero. By the Lyapunov CLT for triangular arrays, we have
′
i=1 t =0
vech
× vec(Zi 1ui ) → 0
√ ˆ 0∗ − Σ ˆ GMM → N (0, V ) N Σ
where and
VN =
N 1 1 −
T 2 N i =1
×
vech
vech
T −
ui,t − u¯ i,·
′
ui,t − u¯ i,·
1
V = lim
N →∞
t =0
′
T −
ui,t − u¯ i,·
′
ui,t − u¯ i,·
v ar
N →∞
−
×
N −
T 2 N 2 j =1 N − i=1
vech
T 2 N i =1
− lim
uj,t − u¯ i,·
′
uj,t − u¯ i,·
N →∞
t =0
vech
T −
ui,t − u¯ i,·
′
ui,t − u¯ i,·
×
ˆ 0∗ − Σ ˆ GMM ) + N (Σ
−
′ u¯ ∗i,·
vech
uˆ i,t uˆ ′i,t
vech
T −
vech
T −
′ uˆ i,t uˆ ′i,t
t =0
T −
uˆ i,t uˆ ′i,t
t =0
′ uˆ j,t uˆ ′j,t
uˆ i,t = ui,t − u¯ i,· − Xi,t − X¯ i,· (Aˆ − A),
√
NI2∗ +
√
NI3∗ + o∗p (1).
so conditional on E , we have
∗
√
u∗i,t
and var∗ is the variance operator under P∗ . By definition,
√
=
−
u¯ ∗i,·
t =0
In view of part (a), we have I1∗ = o∗p (1/ N ). So
ˆ GMM − Σ ˆ GMM ) N (Σ
u∗i,t
t =0 N − N −
1
.
t =0
Under the moment conditions in the theorem, a strong law of large number implies that P (Eσ ) = 1.
√
vech
T −
T 2 N 2 i=1 j=1
′
T −
vech
t =0
t =0
1
T −
N 1 1 −
= lim
T2
∗
(51)
vech
T − t =0
uˆ i,t uˆ ′i,t
= vech
T − t =0
ui,t − u¯ i,·
′
ui,t − u¯ i,·
+ o(1).
142
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143
In this result, the randomness with respect to both the bootstrap sample and the original sample is taken into account. Part (c): Proof of Uniformity over x ∈ R. It follows from parts (a) and (b) that
Consequently, V = lim VN N →∞
= D+ m
[
1 T +1
Λm2 +
1
√ ∗ N αˆ GMM − αˆ GMM < x ∗ ˆ GMM ˆ GMM ) Nvech(Σ −Σ √ ˆ GMM − αGMM ′ √ N α =P c < x + o(1) ˆ GMM − ΣGMM ) Nvech(Σ
T (T + 1)
× (Σ ⊗ Σ ) Im2 + Km,m
]
P c′ √
′
D+ m .
We have therefore shown that, conditional on E ,
√
ˆ 0∗ − Σ ˆ GMM N Σ
vech
⇒ N 0, Dm
+
[
1 T +1
for any given x ∈ R. By Polya’s theorem (see DasGupta, 2008, p. 3), the above pointwise result holds uniformly over x ∈ R.
Λm2 +
Proof of Theorem 3. Part (a) For any conformable vector c, we can write
1 T (T + 1)
× (Σ ⊗ Σ ) Im2 + Km,m
]
+ ′
Dm
ˆ j∗ )′ ] = Fj (Aˆ ∗ ) c ′ vec[(Φ
.
where Fj is a continuously differentiable function. By the Delta method, we have
Next, using part (a), we can show that
√
√
B N Aˆ ∗ − Aˆ +
′
N Aˆ ∗ − Aˆ
√
ˆ j∗ − Φ ˆ j )′ ] < x} P{c ′ vec[ N (Φ
B′ ⇒ N (0, VAB ),
√ = P{Fj(1) (Aˆ )vec[ N (Aˆ ∗ − Aˆ )] < x} + o(1)
where VAB is defined in (42). √ √ ˆ 0∗ − Σ ˆ GMM ) and N (Aˆ ∗ − Aˆ ) are Finally, we prove that N (Σ asymptotically independent conditional on E . That is, we need to show N T 1 −−
√
N i=1 t =0
u∗i,t − u¯ ∗i,·
vech
u∗i,t − u¯ ∗i,·
N 1 −
N i=1
vec
∗ ′
∗ ′
1u∗i − E ∗
Zi
Zi
1u∗i
T
−
cov∗
vech
=
N i =1
.
(1)
which holds by the consistency of Aˆ and the continuity of Fj yields
−
u∗i,t − u¯ ∗i,·
u∗i,t − u¯ ∗i,·
′
, vec
∗ ′ Zi
1u∗i
ˆ j∗ − Φ ˆ j )′ ] < x} P{c ′ vec[ N (Φ
√ ˆ j − Φj )′ ] < x} + o(1). = P{c ′ vec[ N (Φ
N T 1 −−
=
N i =1 t =0
−
vec(ˆui uˆ i ) ′
×
N i=1
N i=1
ˆ Θ j = ΩΘ j + op (1), Ω ˆ Φ∗ j = ΩΦ j + op (1), and Ω ˆ Φ∗ j = ΩΦ j + op (1). Ω
vec(Zi 1uˆ i )
′
The proof is straightforward and details are omitted here.
(ui,t − u¯ i,· ) ui,t − u¯ i,·
′
vec(Zi 1ui ) + o(1)
→ 0. As a result, conditional on E , we have
√
∗ ˆ GMM ˆ GMM Nvech Σ −Σ
⇒ N (0, Ωσ σ ).
(53)
Using the same argument as in part (a), we can show that this conditional convergence result implies unconditional convergence. That is, for any conformable vector cσ ,
√ P
Ncσ′ vech
∗ ˆ ˆ ΣGMM − ΣGMM < x → Φ
x cσ′ Ωσ σ cσ
References
vec(Zi 1ui )
N T 1 −−
N 1 −
N 1 −
vech (ui,t − u¯ i,· ) ui,t − u¯ i,·
N i=1 t =0
Invoking the Polya’s theorem gives the desired result. Part (b). The same argument for part (a) applies. Details are omitted. ˆ Φ j = ΩΦ j + op (1), Parts (c), (d). It is sufficient to show that Ω
vech(ˆui uˆ i )vec(Zi 1uˆ i )
N i=1
(·),
√
′
N 1 −
√ = P{Fj(1) (A)vec[ N (Aˆ − A)] < x} + o(1)
∂ Fj (A) ∂ Fj (A) = + op (1), ∂ vec(A)′ A=Aˆ ∂ vec(A)′
t =0 N 1 −
√
ˆ j − Φj )′ ] < x} P{c ′ vec[ N (Φ
where Fj (A) = ∂ Fj (A)/∂ vec(A)′ . Combining the above two equations with
Their covariance is
and
(1)
′
is asymptotically independent of
√
ˆ j )′ ] = Fj (Aˆ ) and c ′ vec[(Φ
.
Ahn, S.C., Peter, S., 1995. Efficient estimation of models for dynamic panel data. Journal of Econometrics 68 (1), 5–27. Anderson, T.W., Hsiao, C., 1981. Estimation of dynamic models with error components. Journal of the American Statistical Association 76, 598–606. Anderson, T.W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel data. Journal of Econometrics 18, 47–82. Arellano, M., 2003. Panel Data Econometrics. Oxford University Press, USA. Arellano, M., Bond, S.R., 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Review of Economic Studies 58, 277–297. Baillie, R.T., 1987. Inference in dynamic models containing ‘surprise’ variables. Journal of Econometrics 35, 101–117. Bilodeau, M., Brenner, D., 1999. Theory of Multivariate Statistics. Springer. Binder, M., Hsiao, C., Pesaran, M.H., 2005. Estimation and inference in short panel vector autoregressions with unit roots and cointegration. Econometric Theory 21, 795–837. Bun, M.J.G., Kiviet, J.F., 2006. The effects of dynamic feedbacks on LS and MM estimator accuracy in panel data models. Journal of Econometrics 132 (2), 409–444. DasGupta, A., 2008. Asymptotic Theory of Statistics and Probability. Springer. Gilchrist, S., Himmelberg, C.P., Huberman, G., 2005. Do stock price bubbles influence corporate investment? Journal of Monetary Economics 52 (4), 805–827. Hall, P., 1992. The Bootstrap and Edgeworth Expansion. Springer-Verlag New York, Inc.
B. Cao, Y. Sun / Journal of Econometrics 163 (2011) 127–143 Hamilton, J., 1994. Time Series Analysis. Princeton University Press. Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Holtz-Eakin, D., Newey, W.K., Rosen, H.S., 1988. Estimating vector autoregressions with panel data. Econometrica 56, 1371–1395. Horowitz, J.L., 1997. Bootstrap methods in econometrics: theory and numerical performance. In: Kreps, D.M., Wallis, K.F. (Eds.), Advances in Economics and Econometrics: Theory and Applications, Vol. III. In: Econometric Society Monographs, vol. 28. Cambridge University Press. Hsiao, C., 2003. Analysis of Panel Data, 2nd ed. In: Econometric Society Monographs, Cambridge University Press.
143
Love, I., Zicchino, L., 2002. Financial development and dynamic investment behavior: evidence from panel vector autoregression. The Quarterly Review of Economics and Finance 46, 190–210. Lütkepohl, H., 1989. A note on the asymptotic distribution of impulse response functions of estimated VAR models with orthogonal residuals. Journal of Econometrics 42, 371–376. Lütkepohl, H., 1990. Asymptotic distributions of impulse response functions and forecast error variance decompositions of vector autoregressive models. Review of Economics and Statistics 72, 116–125. Lütkepohl, H., 2005. New Introduction to Multiple Time Series Analysis. SpringerVerlag, Berlin, Heidelberg, New York.
Journal of Econometrics 163 (2011) 144–162
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Bias corrections for two-step fixed effects panel data estimators Iván Fernández-Val a , Francis Vella b,∗ a
Boston University, United States
b
Georgetown University, United States
article
info
Article history: Received 31 October 2007 Received in revised form 2 March 2011 Accepted 15 March 2011 Available online 24 March 2011 JEL classification: C23 J31 J51 Keywords: Panel data Two-step estimation Endogenous regressors Fixed effects Sample selection bias Union premium
abstract This paper introduces large-T bias-corrected estimators for nonlinear panel data models with both time invariant and time varying heterogeneity. These models include systems of equations with limited dependent variables and unobserved individual effects, and sample selection models with unobserved individual effects. Our two-step approach first estimates the reduced form by fixed effects procedures to obtain estimates of the time varying heterogeneity underlying the endogeneity/selection bias. We then estimate the primary equation by fixed effects including an appropriately constructed control variable from the reduced form estimates as an additional explanatory variable. The fixed effects approach in this second step captures the time invariant heterogeneity while the control variable accounts for the time varying heterogeneity. Since either or both steps might employ nonlinear fixed effects procedures it is necessary to bias adjust the estimates due to the incidental parameters problem. This problem is exacerbated by the two-step nature of the procedure. As these two-step approaches are not covered in the existing literature we derive the appropriate correction thereby extending the use of large-T bias adjustments to an important class of models. Simulation evidence indicates our approach works well in finite samples and an empirical example illustrates the applicability of our estimator. © 2011 Elsevier B.V. All rights reserved.
1. Introduction The incidental parameters problem arises in the estimation of nonlinear panel models that include unrestricted individual specific effects to control for unobserved time invariant heterogeneity (Neyman and Scott, 1948; Heckman, 1981; Lancaster, 2000; Greene, 2004a). Recent papers, surveyed in Arellano and Hahn (2005) and including Hahn and Kuersteiner (2002, forthcoming), Lancaster (2002), Woutersen (2002), Hahn and Newey (2004), Carro (2006), and Fernández-Val (2009), provide a range of solutions, so-called large-T bias corrections, to reduce the incidental parameters problem in long panels. These papers derive the analytical expression of the bias (up to a certain order of T ), which can be employed to adjust the biased fixed effects estimators. Numerical evidence suggests these adjustments eliminate or significantly reduce the bias even in short panels. While the above papers collectively cover a large class of models, they do not handle endogeneity resulting from unobserved heterogeneity that contains a time varying component. This kind of heterogeneity, which includes time varying endogenous regressors and sample selection, is frequently encountered in empirical
∗
Corresponding author. Tel.: +1 202 687 5573. E-mail address:
[email protected] (F. Vella).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.03.002
investigations. Accordingly we derive large-T bias corrections for panel data models with multiple sources of endogeneity. In particular, we consider a class of models with both time varying and time invariant endogeneity that can be accounted for by including individual effects and a parametric control variable. Specific examples include models with censored endogenous regressors and individual effects, sample selection models with individual effects, and limited dependent variable models with endogenous explanatory variables and individual effects. More generally, our approach covers nonlinear panel data models with predetermined and endogenous regressors, when the endogeneity can be controlled for via individual effects and a parametric control variable. We provide a computationally simple two-step estimation procedure. We first estimate the reduced form of the time varying heterogeneity underlying the endogeneity/selection bias by fixed effects. We then estimate the primary equation by fixed effects adding an appropriately constructed control variable. Since either or both steps might employ nonlinear fixed effects procedures and the control variable might be a nonlinear function of the individual effects of the reduced form, the incidental parameters problem arises. As the existing bias corrections fail to account for the additional source of incidental parameters bias arising from the fixed effects estimation of the control variable, our main contribution is to extend the large-T bias corrections to systems of equations estimated via two-step fixed effects procedures.
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
Below we discuss some papers which have analyzed some of the models we consider here. We differ from these existing studies in our treatment of the time invariant heterogeneity, the assumptions about the properties of the explanatory variables, or the asymptotic framework. Most notable is our treatment of the unobserved individual effects as fixed effects (FE), potentially correlated with the explanatory variables, whereas previous approaches to nonlinear systems of equations in panel data generally assume they are random effects (RE) distributed independently of the explanatory variables. RE estimation by-passes the incidental parameters problem by integrating out the individual effects. This approach, however, has three important shortcomings. First, the independence assumption is not compelling in many applications. In microeconomic studies, for instance, individual effects might capture variations in preferences or technology, and the explanatory variables are often choice variables determined on the basis of this individual heterogeneity. Second, the RE estimators generally require an additional round of integration and this can complicate computation. Finally, the RE procedures require parametric assumptions for the time invariant individual heterogeneity. A second important feature of our approach is our ability to accommodate weakly exogenous (predetermined) explanators. This extension substantially expands the range of models we can consider by allowing for dynamic feedback effects between the outcomes and the explanatory variables. These effects are not possible under the strict exogeneity assumption which is commonly maintained in the nonlinear panel literature, see, e.g., Rasch (1960), Chamberlain (1980), Manski (1987), Honoré (1992), and Hahn and Newey (2004). Wooldridge (2001), Honoré and Lewbel (2002), Arellano and Carrasco (2003), and Hahn and Kuersteiner (forthcoming) developed one-step estimators for panel data models with predetermined regressors and individual effects. These estimators, however, are not suitable for the models with multiple sources of heterogeneity that we consider. Note that our procedures are based on large-T asymptotic approximations that are more suitable for moderate or large panels, whereas some of the previous studies remain valid for short panels since they are derived under fixed-T sequences. However, in numerical examples, we find that our bias correction performs well even with only 6 or 8 time periods. The following section briefly describes some econometric models covered by our approach. Section 3 reviews some existing treatments of bias corrections in non-linear panel data models and extends these corrections to two-step estimators. Section 4 gives the appropriate asymptotic theory. Section 5 provides simulation evidence and Section 6 presents an empirical example. Section 7 adds some concluding remarks. The Appendix contains the proofs of the main results. 2. Panel models with multiple endogeneity The leading class of econometric models we consider has the following triangular two-equation structure in the observed variables: dit = f1 (x1it , α1i ; θ1 ) + ε1it ,
(Reduced form equation)
yit = f2 (dit , x2it , λit , α2i ; θ2 ) + ε2it ,
(Primary equation)
145
that identification does not rely exclusively on nonlinearities of the parametric functions (Cameron and Trivedi, 2005, p. 565). Lags of the observed dependent variables dit and yit may appear in each equation and would be included in x1it and/or x2it .1 The control variable is assumed to be a known function of the parameters and variables of the reduced form equation, λit := λ(dit , x1it , α1i ; θ1 ). The form of this function depends on the type of endogeneity/selection and also the nature of the dependent variable in the reduced form. It is usually derived from parametric assumptions about the underlying structural disturbances, typically joint normality, although weaker assumptions can suffice.2 Below we provide specific examples of econometric models that generate this triangular representation. Many models in panel data are defined by sequential moment conditions, which correspond to the following restrictions on the disturbances of the system (1): Assumption 1 (Sequential Moment Conditions). The idiosyncratic disturbances ε1it and ε2it satisfy the sequential moment conditions E [ε1it |xi (t ), α1i ] = 0
and
E [ε2it |di (t ), xi (t ), λi (t ), α2i ] = 0,
for i = 1, . . . , n; t = 1, . . . , T , where xi (t ) = [x1i (t )′ , x2i (t )′ ]′ and ri (t ) = [ri1 , . . . , rit ]′ for r ∈ {d, λ, x1 , x2 }. Note that our model is of the FE type because we do not impose any restriction on the joint distribution of α1i and α2i given xi (t ). Assumption 1 indicates that the endogeneity in the primary equation can arise either through the omission of the time invariant unobserved individual effects or through the omission of the time varying control variable. The sequential moment conditions imply that the model is dynamically complete conditional on the individual effects (Wooldridge, 2002, p. 300), and that the explanatory variables are predetermined relative to the disturbances. This is an important departure from the usual strict exogeneity assumption in these models and permits richer dynamic feedbacks from the dependent variable to the explanators. A leading case of predetermined regressors are lags of the dependent variables. To obtain estimates of the parameters of the model we first estimate the reduced form equation from which we construct the appropriate control variable. We then account for the endogeneity in the primary equation by eliminating the first form, due to the α2 ’s, through the inclusion of individual fixed effects, and the second, due to the λ’s, through the inclusion of the estimated control variable. This approach is computationally more attractive than Full or Partial Maximum Likelihood estimation of the system (1). The computational demands of these models is especially severe due to the presence of fixed effects in both the reduced form and primary equations. Moreover, system estimators, although more efficient, are generally less robust to parametric assumptions than two-step procedures (Wooldridge, 2002, p. 566). The incidental parameters problem may arise in both steps and is further complicated by the inclusion in the second stage of the control variable, which depends on the individual effects of the reduced form equation.
(1)
for (i = 1, . . . , n; t = 1, . . . , T ), where f1 (·) and f2 (·) are known functions up to the finite dimensional parameters θ1 and θ2 . The endogenous variable of primary interest is yit , and dit is an endogenous explanatory variable or selection indicator. The predetermined explanatory variables are denoted by x1it and x2it ; α1i and α2i are unobserved individual effects; λit is a control variable underlying the endogeneity/selection of dit in the primary equation; and the disturbances are denoted by ε1it and ε2it . An exclusion restriction in x2it relative to x1it ensures
1 Our framework does not allow for the inclusion of lags of unobserved or latent dependent variables. The inclusion of these variables raises identification issues that are beyond the scope of this paper. See Kyriazidou (2001), Hu (2002) and Gayle and Viauroux (2007) for dynamic sample selection and censored panel models that include lags of latent dependent variables. 2 There is an extensive literature on the use of control variables to address endogeneity and selection issues in parametric econometric models. In this paper we only derive the control variable for some specific examples, and assume its existence and refer to the literature in which it has been developed in general. See, e.g., Dhrymes (1970), Heckman (1976, 1979), Smith and Blundell (1986), Rivers and Vuong (1988), Blundell and Smith (1989, 1994), and Vella (1993).
146
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
An important special case of this model is related to the sample selection procedure of Heckman (1979). Ridder (1990), Verbeek and Nijman (1992) and Vella and Verbeek (1999) extend this procedure to panels under the assumption that the error components are RE. Wooldridge (1995) introduces a correlated RE estimator under alternative assumptions on the individual effects. Our approach avoids the distributional assumptions for the individual effects employed in the fully parametric approaches, but requires large-T . Kyriazidou (1997, 2001) develops semiparametric fixed-T FE estimators for sample selection models that do not impose distributional assumptions on the idiosyncratic component of the error term. In addition to the asymptotic framework, we differ from Kyriazidou (1997, 2001) in that we impose some parametric assumptions on the idiosyncratic error term and do not allow for the inclusion of lags of the dependent variable in the primary equation as regressors, but we impose less data restrictions. In particular, we allow the explanatory variables to be predetermined, whereas the Kyriazidou approach imposes strict exogeneity on all the regressors other than the lag dependent variable. Gayle and Viauroux (2007) propose a semi-parametric estimator for sample selection models with predetermined explanatory variables, including the lag of the dependent variable. Their estimator does not require parametric assumptions about the error term in the selection equation, but imposes restrictions on the individual effects of this equation. Panel data selection models under alternative selection rules, considered by Vella and Verbeek (1999) with RE, can also be accommodated. Our approach encompasses models with censored endogenous regressors such as those considered by Heckman (1978) and Vella (1993) in the cross sectional context and by Vella and Verbeek (1999) in panels. Example 1 (Sample Selection). Consider the following panel sample selection model with predetermined regressors and individual effects: dit = 1{x′1it θ1 + α1i + u1it ≥ 0}, yit = dit × (x′2it π2 + α2i + u2it ). If the error terms (u1it , u2it ) are jointly normal conditional on [xi (t )′ , α1i , α2i ]′ with zero mean, E [u21it |xi (t ), α1i , α2i ] = 1 and E [u1it u2it |xi (t ), α1i , α2i ] = ζ2 , we can express the previous system of equations in the triangular form (1) with dit = Φ (x′1it θ1 + α1i ) + ε1it = f1 (x1it , α1i ; θ1 ) + ε1it , yit = dit × (x′2it π2 + ζ2 λit + α2i ) + ε2it = f2 (dit , x2it , λit , α2i ; θ2 ) + ε2it , where Φ (·) is the CDF of the standard normal. Here (ε1it , ε2it )′ satisfies Assumption 1 with θ2 := (π2′ , ζ2 )′ . The control variable is the inverse Mills ratio or generalized residual of the reduced form equation λit = (dit − Φ1it )φ1it /[Φ1it (1 − Φ1it )], where φ1it and Φ1it denote the PDF and CDF of the standard normal evaluated at x′1it θ1 + α1i . The primary equation in the above models is estimated by least squares. A second important class of models follows the conditional MLE procedure of Smith and Blundell (1986) and Rivers and Vuong (1988), which has been extended to panels by Vella and Verbeek (1999) under the RE assumption. We further extend this class by assuming FE and allowing for dynamic feedbacks. Example 2 (Endogenous Tobit). Consider the following censored regression model with predetermined explanatory variables and individual effects: dit = x′1it θ1 + α1i + ε1it = f1 (x1it , α1i ; θ1 ) + ε1it , yit = max{δ2 dit + x′2it π2 + α2i + u2it , 0}.
If the error terms (ε1it , u2it ) are jointly normal conditional on 2 [xi (t )′ , α1i , α2i ]′ with zero mean, E [ε1it |xi (t ), α1i , α2i ] = σ12 , 2 2 E [u2it |xi (t ), α1i , α2i ] = σ2 , and E [ε1it u2it |xi (t ), α1i , α2i ] = σ12 ζ2 , we can express the primary equation for yit as yit = Φ2it × (δ2 dit + x′2it π2 + ζ2 λit + α2i + σ˜ 2 φ2it /Φ2it ) + ε2it
= f2 (dit , x2it , λit , α2i ; θ2 ) + ε2it , where σ˜ 22 = σ22 − σ12 ζ22 , and Φ2it and φ2it denote the CDF and PDF of the standard normal evaluated at (δ2 dit + x′2it π2 +ζ2 λit +α2i )/σ˜ 2 . Here (ε1it , ε2it )′ satisfies Assumption 1 with θ2 := (δ2 , π2′ , ζ2 , σ˜ 2 )′ . The control variable is the reduced form error term λit = dit − x′1it θ1 − α1i . 3. Bias corrections for two-step FE estimators 3.1. One-step analytical bias correction The parameters of the reduced form equation in (1) are identified by the population problem
θ10 , {α }
n 1i0 i=1
= arg maxn E θ1 ,{α1i }i=1
n T 1 −−
nT i=1 t =1
g1 (wit ; θ1 , α1i ) , (2)
where g1 is some suitable criterion function (e.g., least squares, likelihood, or GMM criterion function), wit = (xit , dit , yit ) (t = 1, . . . , T ; i = 1, . . . , n) are the data observations including the covariates and endogenous variables, and we assume that the problem has a unique solution. We obtain FE estimators for the parameters by solving the corresponding sample problem
αˆ 1i (θ1 ) = arg max α1i
θˆ1 = arg max θ1
T 1−
T t =1
g1 (wit ; θ1 , α1i ),
T n 1 −−
nT i=1 t =1
g1 (wit ; θ1 , αˆ 1i (θ1 )),
where we first concentrate out the individual effects and then solve for θ1 . Neyman and Scott (1948) show that nonlinear FE estimators can be severely biased in short panels due to the incidental parameters problem. This bias arises because the unobserved individual effects are replaced by sample estimates, αˆ 1i (θ1 ). Since in nonlinear models estimation of parameters cannot be separated from individual effects, the estimation error of the individual effects contaminates the parameter estimates. To see this, note that from the usual M-estimation properties, for n → ∞ with T fixed,
θˆ1 →P θ1T = arg max lim θ1
n→∞
n T 1 −−
nT i=1 t =1
g1 (wit ; θ1 , αˆ 1i (θ1 )).
The probability limit θ1T ̸= θ10 generally since αˆ 1i (θ10 ) ̸= α1i0 ; but θ1T → θ10 as T → ∞, since αˆ 1i (θ10 ) → α1i0 . For smooth moment conditions, θ1T = θ10 + B1 /T + o (1/T√ ) for some B1 . Then, by asymptotic normality of M-estimators, nT (θˆ1 − θ1T ) →d N (0, Σ1 ) as n → ∞, and therefore
√
nT (θˆ1 − θ10 ) =
√
nT (θˆ1 − θ1T ) +
√
nT B1 /T + o
n/T .
If T grows at the same rate as n with the sample size, the FE estimator, while consistent, has a limiting distribution which is not centered at the true parameter value. This large-T version of the incidental parameters problem invalidates inference based on the standard asymptotic distribution of the FE estimator. Hahn and Kuersteiner (forthcoming) characterize the expression of the bias, B1 , using a stochastic expansion of the fixed effects estimator on the order of T for general nonlinear models. We
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
briefly reproduce this expansion and the resulting expression for the bias to introduce notation used below. Let u1it (θ1 , α1 ) := ∂ g1 (wit ; θ1 , α1 )/∂θ1 ,
(3)
v1it (θ1 , α1 ) := ∂ g1 (wit ; θ1 , α1 )/∂α1 ,
be the score functions that determine the estimating equations for θ1 and α1i , and additional subscripts denote partial derivatives, e.g., u1it θ (θ1 , α1 ) := ∂ u1it (θ1 , α1 )/∂θ1′ . The arguments are omitted when the expressions are evaluated at the true parameter value, i.e., v1it θ = v1it θ (θ10 , α1i0 ). The (leading term of the large-T expansion of the) asymptotic bias is
B1 = −J1−1 b1 ,
J1i = ET [u1it θ ] − ET [u1it α ] ET [v1it θ ] /ET [v1it α ] ,
(4)
i=1 mi /n and ET [ht ] ∑ := p limT →∞ Tt=1 ht /T for any random sequences {mi }ni=1 and {ht }Tt=1 ; and b1 := En [b1i ] is the bias of the estimating equation for θ1 ,
with En [mi ] := p limn→∞
∑n
b1i = E¯ T [u1it α ψ1is ] + ET [u1it α ] β1i + ET [u1it αα ] σ1i2 /2, with E¯ T [hit kis ] :=
where g2 is the appropriate objective function and we assume a unique solution. The control variable has a known parametric form which generally depends on the unknown parameters and individual effects of the reduced form equation. We estimate this equation in a first stage from a (possibly) nonlinear panel model with parameters identified by an optimization problem such as (2). We obtain FE estimates of the second stage parameters by solving the optimization problem in the sample
αˆ 2i (θ2 ) = arg max α2i
θˆ2 = arg max
where J1 = En [J1i ] is the probability limit of the Jacobian of the estimating equation for θ1 ,
∑∞
j=−∞
(5)
ET hit ki,t −j , the spectral expectation,
for any sequences {hit }Tt=1 and {kis }Ts=1 .3 In Eq. (5), σ1i2 and β1i are the asymptotic variance and bias components of a higher-order expansion for the estimator of the individual effects at the true parameter value. That is, as T → ∞,
√ αˆ 1i (θ10 ) = α1i0 + ψ1i / T + β1i /T + oP (1/T ), (6) √ ∑T 2 where the influence function ψ1i = t =1 ψ1it / T →d N (0, σ1i ), ψ1it = −ET [v1it α ]−1 v1it , the asymptotic variance is σ1i2 = E¯ T [ψ1it ψ1is ], and the higher-order bias is β1i = −ET [v1it α ]−1 {E¯ T [v1it α ψis ] + ET [v1it αα ]σ1i2 /2}. As in Hahn and Newey (2004), the bias of the estimating equation in (5) has three terms, all arising from the randomness of αˆ 1i (θ10 ). The first term comes from correlation of u1it with ψ1is , and is present because α1i0 is estimated from the same observations as θ10 . This term is zero if u1it is linear in α1i0 and the regressor xit is strictly exogenous. The second and third terms come from the higher-order bias and variance of αˆ 1i (θ10 ) and are due to nonlinearities. The expression for b1 is different from the corresponding expression in Hahn and Kuersteiner (forthcoming) because they use the score function Uit (θ1 , α1 ) = u1it (θ1 , α1 ) − v1it (θ1 , α1 )ET [u1it α ]/ET [v1it α ] to construct the estimating equation for θ1 instead of u1it (θ1 , α1 ). This simplifies some of the components of the bias because Uit and v1it are information–orthogonal. We do not adopt this orthogonal formulation because we find the resulting bias expressions in terms of Uit more difficult to interpret. 3.2. Two-step analytical bias correction We now extend the large-T bias correction methods to two-step FE estimators for the parameters of the primary equation in (1). The parameters of interest are identified by the population problem
θ20 , {α2i0 }ni=1 = arg maxn E θ2 ,{α2i }i=1
n T 1 −−
nT i=1 t =1
g2 (wit , λit ; θ2 , α2i ) ,
147
θ2
T 1−
T t =1
˜ it ; θ2 , α2i ), g2 (wit , λ
n T 1 −−
nT i=1 t =1
˜ it ; θ2 , αˆ 2i (θ2 )), g2 (wit , λ
˜ it = λ(wit ; θ˜1 , αˆ 1i (θ˜1 )) and θ˜1 is an estimator of θ10 . To where λ simplify the exposition and focus the analysis on the new sources of incidental parameters bias, we assume that θ˜1 = θ10 in ˜ the following √ discussion. In general it suffices that θ1 = θ10 + OP (1/ nT ), which holds if θ˜1 is a large-T bias-corrected estimator of θ10 . We show that the resulting FE estimators are biased if the control variable is nonlinear in the first stage individual effects or the second stage is nonlinear. An additional round of bias correction is therefore needed in the second step. The existing correction methods for one-step procedures are generally not suitable for this second stage correction. The ˜ it is a FE estimate that depends on the firstcontrol variable λ stage individual effects. This dependence introduces an additional source of incidental parameters bias. The issue here is similar to the two-step variance estimation (Newey, 1984). Thus, as n → ∞, the limit of the FE is θˆ2 →P θ2T = arg max lim θ2
n→∞
n T 1 −−
nT i=1 t =1
Here θ2T ̸= θ20 , not only because αˆ 2i (θ20 ) ̸= α2i0 , but also be˜ it ̸= λit . To see the second inequality, note that cause λ λit (wit ; θ10 , αˆ 1i (θ10 )) ̸= λit (wit ; θ10 , α1i0 ) since αˆ 1i (θ10 ) ̸= α1i0 . Moreover, the additional source of bias is not related to the estimation of θ10 , since we are evaluating this parameter at its true value. As a result of the previous analysis, a bias expression similar to B1 would not be valid because it would only account for the bias due to the estimation of the α2i0 ’s. We derive the general expression for the bias of two-step FE estimators using stochastic expansions that explicitly account for the randomness introduced by the estimation of the control variable. As for the one-step estimators let u2it (λ, θ2 , α2 ) := ∂ g2 (wit , λ; θ2 , α2 )/∂θ2 ,
v2it (λ, θ2 , α2 ) := ∂ g2 (wit , λ; θ2 , α2 )/∂α2 ,
(7)
be the score functions that determine the estimating equations for θ2 and α2i , and additional subscripts denote partial derivatives, e.g., u2it λ (λ, θ2 , α2 ) := ∂ u2it (wit , λ; θ2 , α2 )/∂λ. For the control variable, let λit (θ1 , α1 ) := λ(wit ; θ1 , α1 ) and additional subscripts denote partial derivatives, e.g., λit α (θ1 , α1 ) = ∂λit (θ1 , α1 )/∂α1 . We continue omitting the arguments when the expressions are evaluated at the true parameter values. The estimating equation for the two-step FE estimator is 0=
n − T −
˜ it , θˆ2 , αˆ 2i (θˆ2 )), u2it (λ
i=1 t =1
where αˆ 2i (θ ) is the solution to 3 The notation E and E for the probability limits as n → ∞ and T → ∞ is n T an abuse of notation, because the limits do not depend on n and T . We use it to emphasize with respect to which dimension the limits are taken.
˜ it ; θ2 , αˆ 2i (θ2 )). g2 (wit , λ
0=
T − t =1
v2it (λ˜ it , θ2 , αˆ 2i (θ2 )).
148
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
Proposition 1 (Bias Expression). Under the Conditions 1–4 given in Section 4, the bias of the two-step FE estimator has the asymptotic expansion, as n, T → ∞, TE [θˆ2 − θ20 ] = −J2−1 En [b2i ] =: B2 .
(8)
Here, J2 = En [ET [u2it θ ] − ET [u2it α ] ET [v2it θ ] /ET [v2it α ]] is the limit of the Jacobian of the estimating equation for θ2 ; and b2i , the bias of the estimating equation for θ2 , takes the form b2i = E¯ T [u2it α ψ2is ] + ET [u2it α ] β2i + ET [u2it αα ] σ2i2 /2
The terms σ2i2 and β2i are the asymptotic variance and bias components of a higher-order expansion for the estimator of the individual effects of the primary equation evaluated at the true parameter value. That is, as T → ∞,
√ αˆ 2i (θ20 ) = α2i0 + ψ2i / T + β2i /T + oP (1/T ), (9) √ ∑T 2 where the influence function ψ2i = t =1 ψ2it / T →d N (0, σ2i ), ψ2it = −ET [v2it α ]−1 v2it + ET [v2it λ λit α ] φ1it + ET [v2it λ λit θ ]
g1 (wit ; θ1 , α1i ) = dit log f1 (x1it , α1i ; θ1 ) + (1 − dit )
× log{1 − f1 (x1it , α1i ; θ1 )} = dit log Φ (x′1it θ1 + α1i ) + (1 − dit ) × log{1 − Φ (x′1it θ1 + α1i )}, and the least squares criterion g2 (wit , λit ; θ2 , α2i ) = −{yit − f2 (dit , x2it , λit , α2i ; θ2 )}2 /2
The estimating equations for the second stage are u2it ζ = [(uπ2it )′ , u2it ]′ , where uπ2it = x2it v2it is the score function for ζ
i=1
with φ1it = ψ1it − ET [v1it α ] ET [v1it θ ] ϕ1it , ϕ1it = J1 U1it , and U1it = u1it − ET [u1it α ]ψ1it ; the higher-order bias is −1
β2i = −ET [v2it α ]−1 E¯ T [v2it α ψ2is ] + ET [v2it αα ] σ2i2 /2 + ET [v2it λα λit α ] σ12i + E¯ T [v2it λ λit α ψ1is ] + ET v2it λ λit α β1i + λit αα σ1i2 /2 + ET v2it λλ λ2it α σ1i2 /2 ,
π2 , and u2it = λit v2it is the score function for ζ2 , with v2it dit (yit − x′2it π2 − ζ2 λit − α2i ). Let r¯it denote the variable rit in deviations with respect its individual mean in the selected population, i.e., r¯it = − ET [dit rit ]/ET [dit ]. The bias of the estimating equations is b2i ζ [(bπ2i )′ , b2i ]′ , where
= to rit
=
¯ it α )ψ2is ] bπ2i = −ET [dit ]−1 E˜ T [dit x¯ 2it v2is ] − ζ2 E˜ T [(dit x¯ 2it λ − ζ2 ET [dit x¯ 2it (λ¯ it α β1i + λ¯ it αα σ1i2 /2)],
and the asymptotic variance is σ2i2 = E¯ T [ψ2it ψ2is ]. The term σ12i is the asymptotic covariance between the estimators of the individual effects of the reduced form and primary equations
σ12i = E¯ T [ψ1it ψ2is ].
Example 1 (Sample Selection, Cont.). The objective functions are the log-likelihood of the probit model
= −dit (yit − x′2it π2 − ζ2 λit − α2i )2 /2.
√ ϕ1it / n , −1
{fkt }Tt=1 and k, l ∈ {1, 2}. Moreover, if in addition the explanatory
variables are strictly exogenous, the influence functions are also uncorrelated with arbitrary bounded functions of future values of these variables and E˜ T can be replaced by ET .
+ ET [u2it λα λit α ] σ12i + E¯ T [u2it λ λit α ψ1is ] + ET u2it λ λit α β1i + λit αα σ1i2 /2 + ET u2it λλ λ2it α σ1i2 /2.
n −
In dynamically complete models based on sequential conditional moment conditions that satisfy Assumption 1, the influence functions ψ1it and ψ2it follow martingale differences relative to the explanatory variables and individual effects, and are therefore uncorrelated with arbitrary bounded functions of past values of the explanatory variables. Accordingly the terms involving the spectral expectation E¯ T can be replaced by E˜ T , where E˜ T [fkit ψlis ] = ∑∞ ˜ j=0 ET [fkit ψli,t −j ] and ET [ψkit ψlis ] = ET [ψkit ψlit ] for any sequence
(10)
Proposition 1 shows that the bias of the two-step estimating equation, b2i , in addition to the three components of a onestep estimating equation in Eq. (5) (first three terms), has four additional components arising from the FE estimation of the control variable and the nonlinearity and/or dynamics of the second stage. Recall that the first three terms arise from the randomness of αˆ 2i (θ20 ) if the primary equation is nonlinear in these individual effects or there are dynamic feedbacks. The new terms arise through the correlation between the estimators of the individual effects in the first and second stages σ12i (fourth term), since both stages use the same observations to estimate these effects; the asymptotic bias of the FE estimator of the control ˜ it due to dynamic feedbacks (fifth term) or nonlinearities variable λ (sixth term), coming from the nonlinearity of this variable in the first stage individual effects; and the nonlinearity of the second stage in the control variable (last term). The expression for the higher-order bias of the estimator of the individual effects, β2i , also has more components than the corresponding expression in the one-step case. The additional components come from the same sources as for the bias of the estimating equation.
for the estimating equation of π2 , and ζ
¯ it v2is ] − ζ2 E˜ T [(dit λ¯ it λ¯ it α )ψ2is ] b2i = −ET [dit ]−1 E˜ T [dit λ − ζ2 ET [dit λ¯ it (λ¯ it α β1i + λ¯ it αα σ1i2 /2)] − ζ2 ET [dit λ¯ 2it ]σ1i2 , for the estimating equation of ζ2 . The first term of the biases for these two estimating equations comes from the dynamic feedback in the primary equation. The terms due to the nonlinearity with respect to α2i and the interaction with λit drop out since the equations are linear in α2i . The next two terms come from the nonlinearity of the control variable in the individual effects of the reduced form equation. For the estimation equation of ζ2 the final term comes from the nonlinearity of this equation with respect to λit . Example 2 (Endogenous Tobit, Cont.). The objective functions are the least squares criterion g1 (wit ; θ1 , α1i ) = −{dit − f1 (x1it , α1i ; θ1 )}2 /2
= −(dit − x′1it θ1 − α1i )2 /2, and the log-likelihood of the Tobit model ′ g2 (wit , λit ; θ2 , α2i ) = sit log φ(yit − z2it θ2 − α2i ) ′ + (1 − sit ) log(1 − Φ (z2it θ2 + α2i )),
where sit = 1{yit > 0}, z2it = (dit , x′2it , λit )′ , θ2 = (δ2 , π2′ , ζ2 )′ , and we assume that the variance of ε2it is known to be one for
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
simplicity.4 The estimating equation for the second stage is u2it = z2it v2it , with ′ v2it = sit (yit − z2it θ2 − α2i ) ′ ′ − (1 − sit )φ(z2it θ2 + α2i )/Φ (z2it θ2 + α2i ).
The bias of this estimating equation takes the form b2i = ET [u2it α ]β2i + E˜ T [u2it α (ψ2is − ζ2 ψ1is )]
+ ET [u2it αα ](σ + ζ σ − 2ζ2 σ12i ). 2 2i
2 2 2 1i
This expression corresponds to the bias of a one-step estimator with individual effect α2i − ζ2 α1i where the α1i ’s are estimated from the reduced form equation. This result follows from the linearity of the control variable in α1i . To see this, note that the primary equation can be written as yit = max{δ2 dit + x′2it π2 + ζ2 (dit − x′1it θ1 )
+ (α2i − ζ2 α1i ) + ε2it , 0}. We can use the expression for B2 in Proposition 1 to construct analytical (closed form) bias corrected estimators for the second stage parameters and other functions of parameters and individual effects such as marginal effects. A bias-corrected estimator for model parameters can be formed as
θ˜2 = θˆ2 − Bˆ 2 /T , where Bˆ 2 is an estimator of B2 constructed using sample analogs of the components of J2 and b2 . Moreover, since Bˆ 2 generally depends on θ20 , we have that Bˆ 2 = Bˆ 2 (θˆ2 ). Iterated bias corrections can be constructed similarly to those for one-step estimators by solving θ˜2∞ = θˆ2 − Bˆ 2 (θ˜2∞ )/T . An alternative bias-corrected estimator can also be obtained by correcting the estimating equation of θ2 by an estimate of b2 . In deriving the expression of the bias in Eq. (8) we assume that the control variable is constructed using bias-corrected estimates of the first stage parameter. We can also obtain similar bias expressions for two-step estimators that use control variables constructed from uncorrected first stage estimates. Our approach has the advantages that it yields bias-corrected estimates of both the reduced form and primary equations, and that the bias expressions in the second stage involve fewer terms. Alternative panel Jackknife corrections, which do not require explicit characterization of the analytical expression of the bias, can be constructed by appropriately adapting the approaches in Hahn and Newey (2004) and Dhaene and Jochmans (2010). 4. Asymptotic theory To guarantee the validity of the higher-order expansions used to derive the expression of the bias in Proposition 1 and to establish the validity of the bias corrections in large samples, we impose the following conditions: Condition 1 (Sampling). (i) n, T → ∞ such that n/T → κ , where 0 < κ < ∞. (ii) For each i, wi := {wit }t =1,2,... is a stationary mixing sequence. Let Ait = σ (wit , wi,t −1 , . . .), Dti = σ (wit , wi,t +1 , . . .), and ai (m) = supt supA∈Ai ,D∈D i |P (A ∩ D) − P (A)P (D)|. Then, t
t +m
supi |ai (m)| ≤ Cam for some a such that 0 < a < 1 and some C > 0. (iii) {wi }i=1,2,... are independent across i.
149
Condition 2 (First Stage). Let g1it (γ1 ) := g1 (wit ; γ1 ) be the first stage objective function indexed by the parameter γ1 = (θ1 , α1 ) ∈ Γ1 . (i) For each i, γ1i0 := (θ10 , α1i0 ) ∈ int Γ1 , and the parameter space Γ1 isa convex, compact subset of Rp1 . (ii) For each η > 0 and n, infi g1i (γ1i0 ) − sup{γ1 ∈Γ1 :|γ1 −γ1i0 |>η} g1i (γ1 ) > 0, where g1i (γ1 ) := ET [g1it (γ1 )]. (iii) Let ν = (ν1 , . . . , νp1 ) be a ∑p1 vector of non-negative integers, |ν| = j=1 νj , and ∇ν g1it (γ1 ) ν
νp
∂ |ν| g1it (γ1 )/(∂γ111 . . . ∂γ1p11 ); then, there exists a function ′ M 1 (wit ), such that for all′ γ1 , γ1 ∈ Γ1 and |ν|′ ≤ 5, ∇ν g1it (γ1 ) − ∇ν g1it (γ ) ≤ M1 (wit )|γ1 −γ |; supγ ∈Γ |∇ν g1it (γ1 )| 1 1 1 1 ≤ M1 (wit ) for |ν| ≤ 5; and supi ET [|M1 (wit )|10q1 +12+υ ] < ∞ for some integer q1 ≥ p1 /2 + 2 and some υ > 0. (iv) J1 is negative definite and finite. (v) The covariance matrix Ω1 , defined in Lemma 1, is positive definite and finite. (vi) For each n, 0 < infi σ1i2 and supi σ1i2 < ∞. (vii) The bandwidth m1 for the estimation of spectral expectations is such that m1 → ∞ and m1 T −1 → 0, as n, T → ∞. =
These conditions, taken from Hahn and Kuersteiner (forthcoming), guarantee the validity of the first stage bias correction for general one-step FE M-estimators. The stationarity assumption is restrictive as it rules out, for example, time dummies and other deterministic trend components as explanatory variables. How to extend the large-T bias corrections to allow for non-stationary variables is an open question beyond the scope of this paper. Let
vˆ 1it (θ1 ) := v1it (θ1 , αˆ 1i (θ1 )),
uˆ 1it (θ1 ) := u1it (θ1 , αˆ 1i (θ1 )),
be fixed effects estimators of the score functions in (3), recalling additional subscripts denote partial derivatives, e.g., vˆ 1it α (θ1 ) := v1it α (θ1 , αˆ 1i (θ1 )). Let
ψˆ 1it (θ1 ) := −ˆv1it (θ1 )/Eˆ T vˆ 1it α (θ1 ) , σˆ 1i2 (θ1 ) := Eˆ¯ T ,m1 ψˆ 1it (θ1 )ψˆ 1is (θ1 ) , ∑min(T ,T +j)
ˆ¯ 1 where Eˆ T [fit ] := j=−m1 t =1 fit /T and E T ,m1 [fit gis ] := t =max(1,j) fit gi,t −j /(T − j), for any functions fit = f (wit ) and git = g (wit ). The parameter m1 is a bandwidth parameter that needs to be chosen such that m1 → ∞ and m1 /T → 0 as T → ∞ (Hahn and Kuerˆ 1it (θ1 ) and σˆ 1i2 (θ1 ) are estimators of the steiner, forthcoming), and ψ influence function and asymptotic variance of αˆ 1i (θ1 ), see Eq. (6). Let ∑T
∑m
−1 ˆ ˆ 1is (θ1 )] βˆ 1i (θ1 ) := −Eˆ T vˆ 1it α (θ1 ) E¯ T ,m1 [ˆv1it α (θ1 )ψ + Eˆ T vˆ 1it αα (θ1 ) σˆ 1i2 (θ1 )/2 , Jˆ1i (θ1 ) := Eˆ T uˆ 1it θ (θ1 ) − Eˆ T uˆ 1it α (θ1 ) × Eˆ T vˆ 1it θ (θ1 ) /Eˆ T vˆ 1it α (θ1 ) , ˆ 1is (θ1 ) + Eˆ T uˆ 1it α (θ1 ) βˆ 1i (θ1 ) bˆ 1i (θ1 ) := Eˆ¯ T ,m1 uˆ 1it α (θ1 )ψ + Eˆ T uˆ 1it αα (θ1 ) σˆ 1i2 (θ1 )/2. Here, βˆ 1i (θ1 ) is an estimator of the higher-order asymptotic bias of αˆ 1i (θ1 ) from a stochastic expansion as T grows, see Eq. (6); whereas Jˆ1i (θ1 ) and bˆ 1i (θ1 ) are estimators of the Jacobian and the asymptotic bias of the estimating equation of θ1 for individual i, see Eqs. (4) and (5). A bias corrected estimator of the first-step FE estimator can be formed as:
θ˜1 = θˆ1 − Bˆ 1 (θˆ1 )/T , 4 In censored models there is no simple analytical form for the relationship between the conditional expectation f2 in the triangular representation (1) and the likelihood objective function g2 .
where Bˆ 1 (θ1 ) = −Eˆ n [Jˆ1i (θ1 )]−1 Eˆ n [bˆ 1i (θ1 )] is an estimator of the ∑n ˆ bias of θˆ1 , where Eˆ n [fi ] := i=1 fi /n for any function fi = ET [fit ]; and θˆ1 is the FE estimator of θ10 .
150
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
Lemma 1 (First Stage). Assume that Conditions 1 and 2 hold. Then, for any η > 0,
√
nT θ˜1 − θ10 →d N (0, J1−1 Ω1 J1−1 ) and
Pr
max |α˜ 1i − α1i0 | > η
1≤i≤n
= o(T −1 ),
′ U1is where Ω1 := En E¯ T U1it
, U1it := u1it + ET [u1it α ] ψ1it , and
α˜ 1i := αˆ 1i (θ˜1 ). Condition 3 (Control Variable). Let λit (γ1 ) = λ(wit ; γ1 ) be the control variable indexed by the parameter γ1 = (θ1 , α1 ) ∈ Γ ∑1 .pLet ν = (ν1 , . . . , νp1 ) be a vector of non-negative integers, |ν| = j=1 1 νj , νp
ν
∇ν λit (γ1 ) = ∂ |ν| λit (γ1 )/(∂γ111 . . . ∂γ1p11 ), and Nε (γ1i0 ) be an ε neighborhood of γ1i0 for some ε > 0. There exists a function Mλ (wit ), such that ∇ν λit (γ1 ) − ∇ν λit (γ1′ ) ≤ Mλ (wit )|γ1 − γ1′ | for all γ1 , γ1′ ∈ Nε (γ1i0 ) and |ν| ≤ 4; supγ1 ∈Nε (γ1i0 ) |∇ν λit (γ1 )| ≤ Mλ (wit ) for |ν| ≤ 4; and supi ET [|Mλ (wit )|20qλ +24+υ ] < ∞ for some integer qλ ≥ p1 /2 + 2 and some υ > 0. This condition guarantees the existence of higher-order expansions for fixed effects averages of smooth functions of the estimated control variable in a neighborhood of its true value, and the uniform convergence of the remainder terms in these expansions. In most applications the control variables are the influence function or generalized residuals from the reduced form equation, ψ1it , and this condition follows from Condition 2, see examples below. Condition 4 (Second Stage). Let g2it (λ, γ2 ) := g2 (wit , λ; γ2 ) be the second stage objective function indexed by the value of the control variable λ ∈ R and the parameter γ2 = (θ2 , α2 ) ∈ Γ2 . (i) For each i, γ2i0 := (θ20 , α2i0 ) ∈ int Γ2 , and the parameter space Γ2 is a convex, compact subset of Rp2 . (ii) For each η > 0 and n, infi |g2i (γ2i0 ) − sup{γ2 ∈Γ2 :|γ2 −γ2i0 |>η} g2i (γ2 )| > 0, where g2i (γ2 ) := ET [g2it (λit , γ2 )]. (iii) Let ν = (νλ , ν1 , . . . , νp2 ) be a vector of ∑p2 non-negative integers, |ν| = νλ + = j=1 νj , ∇ν g2it (λ, γ2 ) ν
νp
∂ |ν| g2it (λ, γ2 )/(∂λνλ ∂γ211 . . . ∂γ2p22 ), and Nε (λit ) be an ε -neighborhood of λit for some ε > 0; then, there exists a func′ tion M2 (wit ), such that for all λ, λ′ ∈ N ε (λit ), γ2 , γ2 ∈ Γ2′ , ′ ′ and |ν| ≤ 5, ∇ν g2it (λ, γ2 ) − ∇ν g2 (λ , γ2 ) ≤ M2 (wit )(|λ − λ | + |γ2 − γ2′ |); for |ν| ≤ 5, supλ∈Nε (λit ),γ2 ∈Γ2 |∇ν g2it (λ; γ2 )| ≤ M2 (wit ); and supi ET [|M2 (wit )|20q2 +24+υ ] < ∞ for some integer q2 ≥ p2 /2 + 2 and some υ > 0. (iv) J2 is negative definite and finite. (v) The covariance matrix Ω2 , defined in Theorem 1, is positive definite and finite. (vi) For each n, 0 < infi σ2i2 and supi σ2i2 < ∞. (vii) The bandwidth m2 for the estimation of spectral expectations is such that m2 → ∞ and m2 T −1 → 0, as n, T → ∞. These conditions are sufficient to establish the validity of the second stage bias correction, extending the conditions in Hahn and Kuersteiner (forthcoming) to two-step dynamic FE Mestimators. Condition 4 guarantees parameter identification based on time series variation, but it does not explicitly impose exclusion restrictions in the reduced form and primary equations. Parameter identification can be achieved, in principle, by non-linearities in the reduced form and primary equations, or by non-linearities in the control variable. To avoid such an identification scheme an exclusion restriction should be imposed in the primary equation for each source of time varying endogeneity. Example 1 (Sample Selection, Cont.). Sufficient conditions for Condition 2 for the static and dynamic probit models are given in Examples 1 and 2 of Hahn and Kuersteiner (forthcoming). Let z1it denote the elements of x1it excluding the lags of dit and including a constant term. These conditions require non-singularity of the
′ matrix E [z1it z1it ] for each i, a stationarity and mixing condition on the processes {z1it }∞ t =1 , and a sufficient number of bounded moments for the elements of z1it . Condition 3 for the control variable λit = (dit − Φ1it )φ1it /[Φ1it (1 − Φ1it )] follows from the same conditions. Condition 4 for the second stage follows by analogous conditions applied to z¯2it = z2it − ET [dit z2it ]/ET [dit ], where z2it denote the elements of x2it excluding the lags of yit , and standard conditions in the coefficients of the lags of yit to guarantee stationarity.
Example 2 (Endogenous Tobit, Cont.). Let z1it denote the elements of x1it excluding the lags of dit and including a constant term. Since the first stage is linear and the control variable is the error term of ′ the reduced form equation, non-singularity of the matrix E [z1it z1it ] for each i, a stationarity and mixing condition on the processes {z1it }∞ t =1 , and 10q1 + 12 + υ bounded moments for the elements of x1it and dit are sufficient for Conditions 2 and 3. Example 3 in Hahn and Kuersteiner (forthcoming) gives conditions for the validity of one-step bias corrections in Tobit models with lagged dependent variables and fixed effects that are sufficient for Condition 4. These conditions are similar to those given in the previous example for the probit model, see Hahn and Kuersteiner (forthcoming) for details. Theorem 1 (Second Stage). Under Conditions 1–4, we have:
√
nT θˆ2 − θ20 − B2 /T →d N 0, J2−1 Ω2 J2−1 ,
′ U2is and U2it = u2it + ET [u2it α ] ψ2it + where Ω2 := En E¯ T U2it ET [u2it λ λit α ] φ1it + ET [u2it λ λit θ ] ϕ1it . The rest of the terms are defined in Proposition 1.
We need some additional notation to describe the bias corrections in the second stage. Let
˜ it , θ2 , αˆ 2i (θ2 )), λ˜ it := λit (θ˜1 , αˆ 1i (θ˜1 )), uˆ 2it (θ2 ) := u2it (λ ˜ vˆ 2it (θ2 ) := v2it (λit , θ2 , αˆ 2i (θ2 )), and additional subscripts denote partial derivatives. Let
−1 ψˆ 2it (θ2 ) := −Eˆ T vˆ 2it α (θ2 ) vˆ 2it (θ2 ) + Eˆ T vˆ 2it λ (θ2 )λ˜ it α n − √ × φˆ 1it (θ˜1 ) + Eˆ T vˆ 2it λ (θ2 )λ˜ it θ ϕ˜ 1it / n , i=1
where
−1 φˆ 1it (θ1 ) = ψˆ 1it (θ1 ) − Eˆ T vˆ 1it α (θ˜1 ) Eˆ T vˆ 1it θ (θ˜1 ) ϕ˜ 1it , −1 ˆ 1it (θ˜1 ) , ϕ˜ 1it = −Eˆ n Jˆ1i (θ˜1 ) uˆ 1it (θ˜1 ) + Eˆ T uˆ 1it α (θ˜1 ) ψ σˆ 2i2 (θ2 ) = Eˆ¯ T ,m2 ψˆ 2it (θ2 )ψˆ 2is (θ2 ) , σˆ 12i (θ2 ) = Eˆ¯ T ,m2 ψˆ 2it (θ2 )ψˆ 1is (θ˜1 ) . ˆ 2it (θ2 ) and σˆ 2i2 (θ2 ) are estimators of the influence function Here, ψ and asymptotic variance of αˆ 2i (θ2 ) as T grows, see Eq. (9), and σˆ 12i (θ2 ) is an estimator of the asymptotic covariance between αˆ 1i (θ˜1 ) and αˆ 2i (θ2 ) as T grows, see Eq. (10). Let −1 ˆ ˆ 2is (θ2 )] βˆ 2i (θ2 ) := −Eˆ T vˆ 2it α (θ2 ) E¯ T ,m2 [ˆv2it α (θ2 )ψ 2 + Eˆ T vˆ 2it αα (θ2 ) σˆ 2i (θ2 )/2 + Eˆ¯ T ,m2 vˆ 2it λ (θ2 )λ˜ it α ψˆ 1is (θ˜1 ) + Eˆ T vˆ 2it λ (θ2 ) λ˜ it α βˆ 1 (θ˜1 ) + λ˜ it αα σˆ 1i2 (θ˜1 ) + Eˆ T vˆ 2it λα (θ2 )λ˜ it α σˆ 12i (θ2 ) + Eˆ T vˆ 2it λλ (θ2 )λ˜ 2it α σˆ 1i2 (θ˜1 )/2 ,
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
Jˆ2i (θ2 ) := Eˆ T uˆ 2it θ (θ2 ) − Eˆ T uˆ 2it α (θ2 ) × Eˆ T vˆ 2it θ (θ2 ) /Eˆ T vˆ 2it α (θ2 ) , ˆ 2is (θ2 ) + Eˆ T uˆ 2it α (θ2 ) βˆ 2i (θ2 ) bˆ 2i (θ2 ) := Eˆ¯ T ,m2 uˆ 2it α (θ2 )ψ + Eˆ T uˆ 2it αα (θ2 ) σˆ 2i2 (θ2 )/2 + Eˆ¯ T ,m2 uˆ 2it λ (θ2 )λ˜ it α ψˆ 1is (θ˜1 ) + Eˆ T uˆ 2it λ (θ2 ) λ˜ it α βˆ 1i (θ˜1 ) + λ˜ it αα σˆ 1i2 (θ˜1 )/2 + Eˆ T uˆ 2it λα (θ2 )λ˜ it α σˆ 12i (θ2 ) + Eˆ T uˆ 2it λλ (θ2 )λ˜ 2it α σˆ 1i2 (θ˜1 )/2.
Here, βˆ 2i (θ1 ) is an estimator of the higher-order asymptotic bias of αˆ 2i (θ2 ) from a stochastic expansion as T grows, which accounts for the estimation error of the control variable, see Eq. (9); whereas Jˆ2i (θ2 ) and bˆ 2i (θ2 ) are estimators of the Jacobian and asymptotic bias of the estimating equation of θ2 for individual i as in Proposition 1. A bias-corrected estimator of the two-step FE estimator can be formed as:
θ˜2 = θˆ2 − Bˆ 2 (θˆ2 )/T , where Bˆ 2 (θ2 ) = −Eˆ n [Jˆ2i (θ2 )]−1 Eˆ n [bˆ 2 (θ2 )] is an estimator of the ˜ it bias of θˆ2 , and θˆ2 is the two-step FE estimator of θ20 that uses λ as the control variable. As for one-step estimators, iterated bias corrections can also be formed by solving θ˜2∞ = θˆ2 − Bˆ 2 (θ˜2∞ )/T , and score-corrected estimators can be obtained by solving the modified estimating equation:
Eˆ n Eˆ T uˆ 2it (θ˜2sc ) − bˆ 2i (θ˜2sc )/T = 0. Theorem 2 (Bias Correction). Assume that Conditions 1–4 hold. Then,
√
nT θ˜2 − θ20 →d N (0, J2−1 Ω2 J2−1 ).
The results in Theorems 1 and 2 do not follow from the theory in Hahn and Kuersteiner (forthcoming) as it does not apply when some of the explanatory variables are fixed effects estimates. One of the contributions of this paper is to transform the model into a form that allows the application of Hahn and Kuersteiner (forthcoming) results. More explicitly, by performing the bias correction in two stages, we can apply directly their results to the first stage estimators, and indirectly to the second stage estimators after using sequential expansions of the estimators with respect to the individual effects and the control variables. These sequential expansions are convenient to avoid dealing with high order expansions on multidimensional parameters. Remark 1 (Variance Estimation). The sandwich form of the asymptotic covariance matrix indicates that, as in the cross sectional case (Rivers and Vuong, 1988), the two-step FE estimator is less efficient than a system estimator. A consistent estimator for this matrix can be obtained using cross sectional sample averages of Jˆ2i (θ˜2 ) and ˆ 2i (θ˜2 ), where: Ω
ˆ 2i (θ2 ) = Eˆ¯ T ,m2 Uˆ 2it (θ2 )Uˆ 2is (θ2 )′ , Ω ˆ 2it (θ2 ) with Uˆ 2it (θ2 ) = uˆ 2it (θ2 ) + ET [ˆu2it α (θ2 )]ψ + ET [ˆu2it λ (θ2 )λ˜ it α ]φ˜ 1it + ET [ˆu2it λ (θ2 )λ˜ it θ ]ϕ˜ 1it and φ˜ 1it = φˆ 1it (θ˜1 ). Remark 2 (Sequential Moment Conditions). For dynamically complete models based on sequential moment restrictions that satisfy
ˆ ˆ 2 ˆ ∑mℓ ∑T := Eˆ T ψℓit , E˜ T ,mℓ [fit ψℓis ] := j=0 t =j+1 fit ψ1i,t −j /(T − j) and Assumption 1, we can replace Eˆ¯ T ,mℓ by E˜ T ,mℓ , where E˜ T ,mℓ [ψℓit ψℓis ]
151
ℓ ∈ {1, 2}. If in addition all the explanatory variables are strictly exogenous, we can set m1 = m2 = 0. 5. Monte Carlo experiments This section reports evidence on the finite sample behavior of two-step FE estimators in Examples 1 and 2. We examine the finite sample properties of uncorrected and bias-corrected estimators in terms of bias and inference accuracy of their asymptotic distributions. The results are based on 1000 replications, and the designs correspond to a static panel sample selection model with probit selection rule and strictly exogenous regressors, and a dynamic Tobit model with an endogenous explanatory variable. 5.1. Sample selection The model design is a static version of Example 1 with dit = 1 x′1it θ1 + α1i + ε1it > 0 ,
yit = dit (x′2it π2 + α2i + u2it ),
(i = 1, . . . , n; t = 1, . . . , T )
where x1it = (z1it , z2it ) ; θ1 = (1, 1)′ ; x2it = z1it ; β2 = 1; zjit = −0.5 + 0.5zji,t −1 + υjit for j ∈ {1, 2}, where υ1it and υ2it are independent N (0, 3/8) variables, and z1i0 and z2i0 are independent √ √ ∑T N (−1, 0.5) variables; α1i = α2i = 2 + t =1 (z1it + 1)/ T +ξi / 2, with ξi an independent N (0, 1) variable; ε1it and u2it are jointly distributed as a standard bivariate normal with correlation 0.6, which corresponds to the coefficient of the control variable in the second stage, ζ2 = 0.6. All data are generated i.i.d. across individuals. This design implies that Pr {dit = 1} ≈ 0.6, so that approximately 60% of the sample is used to estimate θ2 = (π2′ , ζ2 )′ in the second step. We generate panel data sets with n = 100 individuals and three different numbers of time periods T : 6, 8 and 12. Throughout the tables, SD is the standard deviation of the estimator; p; # denotes a rejection frequency with # specifying the nominal value; SE/SD is the ratio of the average standard error to standard deviation; and MAE denotes median absolute error.5 BC1 and BC2 correspond to the two-step version of the analytical bias-corrected estimators of Hahn and Newey (2004) based on the maximum likelihood setting and general estimating equations, respectively. BC3 is the two-step version of the bias-corrected estimator proposed in Fernández-Val (2009), which replaces observed quantities by expected quantities in the expression of the bias. Due to the binary nature of the dependent variable in the selection equation, the observations which have the same value for the dependent variable for each period are automatically removed when estimating the first step. In the second step we retain the observations for which the dependent variable in the selection equation is always one, and we assign a value of zero for their correction terms, that is, their ML estimates. Table 1 gives the results for the estimators of the probit parameters of the first stage, θ1 . These results are qualitatively similar to previous numerical studies (Hahn and Newey, 2004; Fernández-Val, 2009).6 The uncorrected FE estimator, MLE in the table, is severely biased, and the large-T bias corrections remove most of the incidental parameters bias for panels with even T = 6. This is especially true with the BC3 refinement. Table 2 presents the finite sample properties for the FE estimators of the coefficient of the explanatory variable in the ′
5 We use median absolute error instead of root mean squared error as an overall measure of goodness of fit because it is less sensitive to outliers. 6 These studies, however, use a different design and include only one regressor. Our design has two regressors in order to add an exclusion restriction to the selection equation.
152
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
Table 1 Heckman selection model—probit first stage, n = 100. Estimator
Mean
Median
SD
p; 0.05
p; 0.10
SE/SD
MAE
Nobs
1.32 1.08 1.15 1.01
0.255 0.187 0.222 0.166
0.40 0.06 0.14 0.04
0.51 0.13 0.23 0.09
0.785 0.969 0.845 1.035
0.322 0.132 0.178 0.113
368 368 368 368
1.32 1.08 1.14 1.00
0.255 0.186 0.221 0.167
0.36 0.07 0.15 0.04
0.48 0.12 0.21 0.09
0.787 0.978 0.849 1.036
0.315 0.125 0.165 0.105
368 368 368 368
1.22 1.07 1.08 1.01
0.178 0.148 0.155 0.135
0.34 0.07 0.08 0.04
0.44 0.13 0.17 0.09
0.856 0.970 0.931 1.016
0.221 0.109 0.113 0.095
550 550 550 550
1.21 1.06 1.07 1.00
0.176 0.149 0.153 0.134
0.32 0.08 0.09 0.04
0.42 0.14 0.15 0.09
0.868 0.965 0.946 1.026
0.213 0.106 0.107 0.089
550 550 550 550
1.15 1.05 1.04 1.02
0.126 0.113 0.113 0.107
0.28 0.08 0.08 0.06
0.40 0.14 0.12 0.11
0.874 0.935 0.938 0.968
0.151 0.083 0.080 0.071
935 935 935 935
1.14 1.05 1.04 1.01
0.121 0.109 0.108 0.102
0.28 0.08 0.07 0.04
0.37 0.15 0.13 0.10
0.912 0.981 0.983 1.012
0.142 0.076 0.074 0.069
935 935 935 935
A. T = 6 A.1. Coefficient of z1 (true = 1) H-MLE H-BC1 H-BC2 H-BC3
1.34 1.09 1.17 1.01
A.2. Coefficient of z2 (true = 1) H-MLE H-BC1 H-BC2 H-BC3
1.34 1.08 1.16 1.01
B. T = 8 B.1. Coefficient of z1 (true = 1) H-MLE H-BC1 H-BC2 H-BC3
1.23 1.07 1.09 1.01
B.2. Coefficient of z2 (true = 1) H-MLE H-BC1 H-BC2 H-BC3
1.23 1.07 1.08 1.01
C. T = 12 C.1. Coefficient of z1 (true = 1) H-MLE H-BC1 H-BC2 H-BC3
1.15 1.05 1.05 1.02
C.2. Coefficient of z2 (true = 1) H-MLE H-BC1 H-BC2 H-BC3
1.15 1.06 1.05 1.02
Notes: 1000 replications. MLE denotes an uncorrected probit FE estimator; BC1 denotes a Hahn and Newey (2004) bias-corrected estimator based on Bartlett equalities; BC2 denotes a Hahn and Newey (2004) bias-corrected estimator based on general estimating equations; BC3 denotes a Fernández-Val (2009) bias corrected estimator. Table 2 Heckman selection model—OLS second stage, coefficient of exogenous regressor (true = 1), n = 100. Estimator
Mean
Median
SD
p; 0.05
p; 0.10
SE/SD
MAE
Nobs
1.15 1.01 1.07 1.04 1.05 1.04
1.15 1.01 1.07 1.04 1.05 1.05
0.104 0.122 0.115 0.125 0.121 0.122
0.39 0.08 0.11 0.09 0.09 0.09
0.49 0.14 0.18 0.15 0.16 0.16
0.876 0.893 0.975 0.949 0.949 0.954
0.147 0.080 0.090 0.091 0.088 0.088
374 374 374 374 374 374
1.14 1.00 1.04 1.02 1.02 1.02
1.14 1.00 1.05 1.02 1.03 1.03
0.083 0.097 0.093 0.100 0.098 0.099
0.46 0.07 0.08 0.07 0.07 0.07
0.59 0.13 0.14 0.12 0.13 0.13
0.932 0.950 1.012 0.977 0.981 0.982
0.144 0.061 0.072 0.067 0.067 0.067
485 485 485 485 485 485
1.15 1.00 1.03 1.01 1.01 1.01
1.15 1.00 1.03 1.01 1.01 1.01
0.066 0.078 0.078 0.082 0.081 0.081
0.66 0.05 0.08 0.06 0.06 0.06
0.76 0.11 0.15 0.12 0.12 0.12
0.946 0.953 0.976 0.945 0.947 0.947
0.150 0.056 0.057 0.057 0.057 0.057
700 700 700 700 700 700
A. T = 6 OLS H-1 H-MLE H-BC1 H-BC2 H-BC3 B. T = 8 OLS H-1 H-MLE H-BC1 H-BC2 H-BC3 C. T = 12 OLS H-1 H-MLE H-BC1 H-BC2 H-BC3
Notes:1000 replications. H-1 denotes unfeasible estimator that uses (unobserved) true control variable; MLE is the feasible version of H-1 that uses an estimated control variable; BC1 denotes a Hahn and Newey (2004) bias-corrected estimator based on Bartlett equalities; BC2 denotes a Hahn and Newey (2004) bias-corrected estimator based on general estimating equations; BC3 denotes a Fernández-Val (2009) bias-corrected estimator. Standard errors account for heteroskedasticity and generated regressors, when relevant.
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
153
Table 3 Heckman selection model—OLS second stage, coefficient of control variable (true = 1), n = 100. Estimator
Mean
Median
SD
p; 0.05
p; 0.10
SE/SD
MAE
Nobs
0.99 0.70 0.95 0.81 0.90
0.98 0.71 0.94 0.81 0.90
0.460 0.431 0.593 0.497 0.548
0.10 0.21 0.17 0.16 0.13
0.17 0.29 0.24 0.25 0.21
0.839 0.823 0.705 0.794 0.796
0.317 0.368 0.382 0.346 0.372
374 374 374 374 374
1.01 0.79 0.96 0.89 0.95
0.99 0.79 0.96 0.89 0.95
0.371 0.354 0.433 0.400 0.424
0.10 0.16 0.12 0.12 0.11
0.16 0.23 0.19 0.17 0.17
0.854 0.847 0.775 0.828 0.825
0.239 0.282 0.286 0.280 0.283
485 485 485 485 485
0.99 0.85 0.96 0.94 0.96
1.00 0.86 0.97 0.95 0.97
0.278 0.269 0.305 0.297 0.305
0.08 0.15 0.11 0.11 0.10
0.14 0.21 0.17 0.17 0.16
0.902 0.894 0.851 0.876 0.874
0.183 0.198 0.202 0.196 0.202
700 700 700 700 700
A. T = 6 H-1 H-MLE H-BC1 H-BC2 H-BC3 B. T = 8 H-1 H-MLE H-BC1 H-BC2 H-BC3 C. T = 12 H-1 H-MLE H-BC1 H-BC2 H-BC3
Notes: 1000 replications. H-1 denotes an unfeasible estimator that uses (unobserved) true control variable; MLE is the feasible version of H-1 that uses an estimated control variable; BC1 denotes a Hahn and Newey (2004) bias-corrected estimator based on Bartlett equalities; BC2 denotes a Hahn and Newey (2004) bias-corrected estimator based on general estimating equations; BC3 denotes a Fernández-Val (2009) bias-corrected estimator. Standard errors account for heteroskedasticity and generated regressors, when relevant.
second stage, π2 . OLS denotes a least squares estimator in the observed sample that ignores sample selection and is therefore inconsistent. H-1 denotes the unfeasible OLS estimator that controls selection by using the true (unobserved) inverse mills ratio, whereas H-MLE is the feasible version of H-1 that uses an estimate of the inverse mills ratio (evaluated at uncorrected estimates of the probit parameters). H-BC1, H-BC2, and H-BC3 in addition to using estimates of the control variable evaluated at bias-corrected estimates of the probit parameters, perform another round of bias-correction in the second stage. Uncorrected FE estimators have small biases, about 7%, 4% and 3% for 6, 8 and 12 time periods, respectively, which are reduced by the bias corrections. Rejection frequencies are higher than their nominal levels due to underestimation of dispersion.7 The corrected estimators have similar MAE to the uncorrected estimators because the corrections in this case increase dispersion. Table 3 reports the ratio of estimators to the truth for the coefficient of the control variable in the second stage, ζ2 . This is an important parameter as a significance test for this coefficient has been proposed to assess if there is endogenous sample selection. The results here show important biases towards zero in uncorrected FE estimators. The bias corrections remove most of the bias and bring down the rejection frequencies closer to their nominal values, although the tests are still oversized due to the underestimation of the dispersion. Some intuition for these numerical results can be obtained through a simple example. Specifically, suppose that αi = α for all i, that is the individual effects are the same for all the individuals. Fernández-Val (2009) finds that in this case the biases for all the parameters in the first stage probit are scalar multiples of the true value of the parameters, and the limit probit index is also proportional to the true index. Since the inverse Mills ratio is either close to zero or close to linear in the selected sample, the estimated control variable is approximately proportional to the true inverse mills ratios. This is consistent with the small bias found for π2 and the significant bias for ζ2 . 7 The expressions used to compute the standard errors are robust to heteroskedasticity and account for estimated regressors using the method in Lee et al. (1980). In results not reported, we find that the finite sample adjustments of MacKinnon and White (1985) to the heteroskedasticity corrections give rise to conservative standard errors.
5.2. Endogenous Tobit The model design is a dynamic version of Example 2 with dit = x′1it θ1 + α1i + ε1it , yi0 = max{di0 δ2 + α2i + ε2i0 , 0}, yit = max{dit δ2 + x′2it π2 + α2i + u2it , 0},
(i = 1, . . . , n; t = 1, . . . , T − 1) where x1it = zit ; θ1 = 1; δ2 = 1; x2it = yi,t −1 ; π2 = 0.5; zit is a √ ∑T −1 N (0, 1) variable; α1i = α2i = t =0 zit / T ; and ε1it and u2it are jointly distributed as a bivariate normal with correlation 0.6, which corresponds to the coefficient of the control variable in the primary equation ζ2 = 0.6, and common variances σ 2 = 1 + (T − 1)ζ22 /T .8 All data are generated i.i.d. across individuals. We generate panel data sets with n = 100 individuals and three different total time periods T : 6, 8 and 12. For the trimming parameter that determines the number of lags used in the estimation of biases and variances, we choose bandwidth parameters m1 = m2 = 1 following Hahn and Kuersteiner (forthcoming). Table 4 presents the finite sample results for the coefficient of the endogenous continuous explanatory variable dit . TOBIT denotes the FE Tobit estimator that does not account for the endogeneity of dit and is inconsistent. CMLE-1 denotes the infeasible estimator that uses the true (unobserved) control variable in the second stage; whereas CMLE is the feasible version of CMLE-1 that replaces the true λ1it ’s by the OLS FE residuals of the reduced form for dit . The bias corrections BC1, BC2, and BC3 are the same as in the previous example. Overall, all the FE estimators that control for endogeneity have small finite sample biases, although the inference procedures are oversized due to underestimation of the dispersion.9 The results here agree with the Honoré (1993) and Greene (2004b) numerical findings of small biases in Tobit FE estimators of the slope parameters. Table 5 reports the results for the coefficient of the lagged dependent variable. Here, as in the uncensored linear case, 8 We choose the value of σ 2 such that the variance of the error term in the ε estimating equation that includes the estimated control variable is approximately equal to 1. 9 The expression for the standard errors accounts for the estimation of the regressors.
154
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
Table 4 Tobit CMLE, coefficient of endogenous regressor (true = 1), n = 100. Estimator
Mean
Median
SD
p; 0.05
p; 0.10
SE/SD
MAE
Nobs
1.31 0.98 0.98 1.02 1.00 0.98
1.31 0.98 0.99 1.02 1.00 0.98
0.058 0.074 0.081 0.094 0.084 0.085
1.00 0.11 0.11 0.18 0.13 0.14
1.00 0.17 0.19 0.24 0.19 0.20
0.863 0.868 0.817 0.727 0.798 0.783
0.314 0.050 0.056 0.063 0.056 0.058
415 415 415 415 415 415
1.33 0.99 0.99 1.01 1.00 0.99
1.33 0.99 0.99 1.01 1.00 0.99
0.048 0.060 0.066 0.072 0.068 0.067
1.00 0.08 0.10 0.13 0.11 0.11
1.00 0.14 0.17 0.19 0.16 0.17
0.882 0.909 0.841 0.796 0.832 0.833
0.328 0.041 0.043 0.047 0.044 0.044
614 614 614 614 614 614
1.34 1.00 1.00 1.00 1.00 1.00
1.34 0.99 0.99 1.00 1.00 1.00
0.035 0.046 0.050 0.052 0.051 0.050
1.00 0.06 0.09 0.10 0.09 0.09
1.00 0.12 0.14 0.16 0.14 0.15
0.953 0.945 0.894 0.860 0.888 0.886
0.336 0.032 0.034 0.035 0.034 0.033
1016 1016 1016 1016 1016 1016
T =6 TOBIT CMLE-1 CMLE BC1 BC2 BC3 T =8 TOBIT CMLE-1 CMLE BC1 BC2 BC3 T = 12 TOBIT CMLE-1 CMLE BC1 BC2 BC3
Notes: 1000 replications. TOBIT denotes a Tobit FE maximum likelihood estimator that does not account for endogeneity; CMLE-1 denotes an unfeasible estimator that uses (unobserved) true control variable; CMLE is the feasible version of CME-1 that uses an estimated control variable; BC1 denotes a Hahn and Kuersteiner (forthcoming) bias-corrected estimator based on Bartlett equalities; BC2 denotes a Hahn and Kuersteiner (forthcoming) bias-corrected estimator based on general estimating equations; BC3 denotes a Fernández-Val (2009) bias-corrected estimator; Standard errors account for generated regressors, when relevant.
Table 5 Tobit CMLE, coefficient of lagged dependent variable (true = 1), n = 100. Estimator
Mean
Median
SD
p; 0.05
p; 0.10
SE/SD
MAE
Nobs
0.81 0.84 0.84 0.98 0.92 0.95
0.81 0.84 0.84 0.99 0.92 0.94
0.077 0.073 0.073 0.084 0.074 0.077
0.77 0.71 0.71 0.14 0.31 0.21
0.84 0.78 0.78 0.22 0.41 0.30
0.874 0.859 0.859 0.773 0.854 0.817
0.189 0.162 0.162 0.057 0.086 0.065
415 415 415 415 415 415
0.86 0.88 0.88 0.99 0.94 0.96
0.86 0.88 0.88 0.99 0.94 0.96
0.063 0.058 0.058 0.062 0.057 0.058
0.71 0.64 0.64 0.11 0.24 0.15
0.79 0.74 0.75 0.17 0.34 0.23
0.875 0.889 0.890 0.846 0.904 0.886
0.143 0.121 0.122 0.044 0.062 0.049
614 614 614 614 614 614
0.91 0.92 0.92 0.99 0.96 0.97
0.91 0.92 0.92 0.99 0.96 0.97
0.047 0.044 0.044 0.044 0.043 0.043
0.58 0.50 0.50 0.07 0.17 0.12
0.68 0.63 0.63 0.14 0.25 0.19
0.931 0.928 0.928 0.924 0.938 0.933
0.093 0.079 0.079 0.032 0.040 0.035
1016 1016 1016 1016 1016 1016
T =6 TOBIT CMLE-1 CMLE BC1 BC2 BC3 T =8 TOBIT CMLE-1 CMLE BC1 BC2 BC3 T = 12 TOBIT CMLE-1 CMLE BC1 BC2 BC3
Notes: 1000 replications. TOBIT denotes a Tobit FE maximum likelihood estimator that does not account for endogeneity; CMLE-1 denotes an unfeasible estimator that uses (unobserved) true control variable; CMLE is the feasible version of CME-1 that uses an estimated control variable; BC1 denotes a Hahn and Kuersteiner (forthcoming) bias-corrected estimator based on Bartlett equalities; BC2 denotes a Hahn and Kuersteiner (forthcoming) bias-corrected estimator based on general estimating equations; BC3 denotes a Fernández-Val (2009) bias-corrected estimator; Standard errors account for generated regressors, when relevant.
uncorrected FE estimators are biased downward even when we use the true control variable. The bias corrections remove an important part of this bias, and have rejection probabilities closer to their nominal value, although the test is still oversized. Table 6 shows the results for the coefficient of the control variable ζ2 . This coefficient captures the correlation between the error terms of the control equation and primary equation. Here we find small biases for both uncorrected and bias-corrected estimators, with the endogeneity tests having a bigger size than their
nominal level due to the underestimation of the dispersion of the parameters. 6. Empirical illustration: estimating the impact of union status on wages To illustrate our approach we estimate a two-equation model which describes the manner in which union status affects wages where the union status decision, which is endogenous to wages, is
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
155
Table 6 Tobit CMLE, coefficient of control variable (true = 1), n = 100. Estimator
Mean
Median
SD
p; 0.05
p; 0.10
SE/SD
MAE
Nobs
0.98 0.98 1.02 1.00 1.00
0.98 0.99 1.02 1.00 1.00
0.158 0.168 0.193 0.174 0.173
0.09 0.10 0.14 0.10 0.10
0.16 0.16 0.21 0.17 0.18
0.882 0.846 0.760 0.829 0.833
0.106 0.113 0.127 0.115 0.117
415 415 415 415 415
1.00 1.00 1.04 1.01 1.01
1.00 1.00 1.03 1.01 1.01
0.131 0.140 0.152 0.144 0.142
0.09 0.11 0.12 0.11 0.09
0.14 0.15 0.20 0.17 0.15
0.892 0.849 0.800 0.833 0.845
0.086 0.092 0.099 0.094 0.093
614 614 614 614 614
1.00 1.00 1.03 1.00 1.00
1.00 1.00 1.03 1.01 1.00
0.102 0.107 0.112 0.108 0.108
0.07 0.08 0.10 0.08 0.08
0.12 0.14 0.18 0.14 0.14
0.912 0.884 0.854 0.881 0.884
0.068 0.072 0.080 0.074 0.074
1016 1016 1016 1016 1016
T =6 CMLE-1 CMLE BC1 BC2 BC3 T =8 CMLE-1 CMLE BC1 BC2 BC3 T = 12 CMLE-1 CMLE BC1 BC2 BC3
Notes: 1000 replications. CMLE-1 denotes an unfeasible estimator that uses (unobserved) true control variable; CMLE is the feasible version of CME-1 that uses an estimated control variable; BC1 denotes a Hahn and Kuersteiner (forthcoming) bias-corrected estimator based on Bartlett equalities; BC2 denotes a Hahn and Kuersteiner (forthcoming) bias-corrected estimator based on general estimating equations; BC3 denotes a Fernández-Val (2009) bias-corrected estimator; Standard errors account for generated regressors, when relevant.
where union is a binary variable denoting that the individual is a member of a union and wage is the log of the individual’s hourly wage rate. The vector xit includes completed years of schooling, log of potential experience (age – schooling – 6), and married, rural area, health disability, region, industry, time, and occupation dummies. The model is interesting in the context of the methods presented here as the binary union decision equation has a FE and a lagged dependent variable. Also, the wage equation has a binary endogenous regressor where the endogeneity is the result of potentially time varying heterogeneity. This specification is similar to VV. The sample, selected from the National Longitudinal Survey (Youth Sample), consists of full-time young working males followed over the period 1980–1988. We exclude individuals who fail to provide sufficient information for each year, are in the active forces in any year, have negative potential experience in at least one year, their schooling decreases in any year or increases by more than two years between two interviews, or report too high (more than $500 per hour) or too low (less than $1 per hour) wages. The final sample includes 545 men. The first period is used as the initial condition for the lagged union variable.10 Table 7 reports descriptive statistics for the sample used. Union membership is based on a question reflecting whether or not the individual had his wage set in a collective bargaining agreement. Roughly 26% of the sample are union members. Union and non-union workers have similar observed characteristics, though union workers are slightly less educated, more likely to be married, more likely to live in the northern central region, and less likely to live in the South. Across industries, there are relatively more union workers in transportation, manufacturing
and public administration, and fewer in trade and business. Union membership reduces wage dispersion and has high persistence. Note that all variables, except for the Black and Hispanic dummies, display time variation over the period considered. The unconditional union premium is around 23%. Table 8 presents the estimates for the dynamic probit model of union membership. The left panel of this table excludes the occupational dummies while the right panel includes them. We make this distinction to remain comparable to VV. In each panel, the first column reports pooled probit estimates that do not account for individual time invariant heterogeneity, the second column shows the unadjusted FE probit estimates, while the third column presents the corresponding bias-corrected estimates. The fourth and fifth columns give the average marginal effects for each of the FE models.11 We include time dummies in the specification to remain comparable to VV, even though they are not covered by our regularity conditions. In results not reported, however, we find that excluding the time dummies does not have any significant effect on the estimates. First, note that for the parameter of primary interest in this table, the coefficient of the lag dependent variable, the pooled probit estimator that does not account for heterogeneity leads to a significant overstatement of the importance of the state dependence. This result can be seen by comparing ratios of coefficients, for example, with respect to the coefficient of log experience, since the pooled and FE estimators use different normalizations. Comparing with the VV estimates of 0.61 and 0.63 for the left and right panels, respectively, our unadjusted FE probit estimates are smaller with values of 0.35 and 0.32. More interesting, however, are the adjusted results. The bias adjusted estimates of the lagged union variable coefficients are approximately 0.73 for the specification excluding the occupational dummies, and 0.70 for that with the occupation dummies included. These estimates are more similar to those reported by VV. The effects of the bias corrections in the FE estimators are easier to interpret by looking at the estimates of the average marginal effects. While the unadjusted estimates already reveal a substantial degree of state dependence with average
10 Although we do not use the identical data to VV the time period and the summary statistics of the data sets are very similar.
11 The bias corrected estimates reported correspond to the BC3 method. The other methods give similar results.
treated as a dynamic binary choice outcome. The model is similar to that considered in Vella and Verbeek (1998), hereafter VV, noting that there the individual components are treated as random effects. In particular, we estimate the following equations unionit = 1 δ1 × unioni,t −1 + x′it π1 + α1i + ε1it > 0 ,
wageit = δ2 × unionit + x′it π2 + α2i + u2it ,
156
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
Table 7 Descriptive statistics, 1981–1988. Variable
SCHOOL LEXPER UNION UNION1 MARRIED BLACK HISP HEALTH RURAL NE NC S W WAGE
Definition
Full sample
Years of schooling Log(1 + EXPER) Wage set by collective bargaining Lag of UNION Married Black Hispanic Has health disability Lives in rural area Lives in North East Lives in Northern Central Lives in South Lives in West Log of real hourly wage
Union
Non-union
Mean
Std. dev.
Within (%)
Mean
Std. dev.
Mean
Std. dev.
12.33 1.78 0.26 0.27 0.45 0.11 0.15 0.02 0.20 0.21 0.29 0.31 0.19 1.74
1.69 0.60 0.44 0.44 0.50 0.32 0.36 0.15 0.40 0.40 0.46 0.46 0.39 0.46
5 47 41 43 41 0 0 76 22 3 3 4 3 41
12.21 1.90 1.00 0.73 0.52 0.15 0.14 0.02 0.19 0.20 0.36 0.26 0.19 1.91
1.15 0.49 0.00 0.44 0.50 0.36 0.35 0.15 0.39 0.40 0.48 0.44 0.39 0.41
12.37 1.74 0.00 0.10 0.42 0.10 0.15 0.02 0.20 0.21 0.27 0.32 0.19 1.68
1.84 0.63 0.00 0.30 0.49 0.30 0.36 0.15 0.40 0.41 0.45 0.47 0.40 0.46
0.03 0.02 0.10 0.25 0.07 0.03 0.08 0.01 0.01 0.32 0.05 0.04
0.17 0.13 0.30 0.43 0.26 0.16 0.27 0.12 0.12 0.47 0.21 0.18
49 44 41 47 45 38 59 77 61 38 56 49
0.01 0.03 0.09 0.17 0.13 0.01 0.04 0.01 0.00 0.39 0.04 0.08
0.11 0.16 0.29 0.38 0.34 0.08 0.20 0.09 0.05 0.49 0.20 0.26
0.03 0.01 0.11 0.28 0.05 0.03 0.09 0.02 0.02 0.29 0.05 0.02
0.18 0.12 0.31 0.45 0.22 0.18 0.28 0.13 0.14 0.45 0.21 0.14
0.07 0.09 0.01 0.10 0.23 0.23 0.11 0.01 0.11
0.25 0.28 0.07 0.30 0.42 0.42 0.31 0.11 0.31
53 61 63 63 50 54 69 55 50
0.03 0.03 0.00 0.10 0.23 0.32 0.16 0.00 0.13
0.17 0.17 0.03 0.30 0.42 0.47 0.36 0.04 0.33
0.08 0.11 0.01 0.10 0.24 0.21 0.09 0.02 0.10
0.27 0.31 0.08 0.30 0.42 0.40 0.29 0.13 0.30
Industry dummies AG MIN CON TRAD TRA FIN BUS PER ENT MAN PRO PUB
Agricultural Mining Construction Trade Transportation Finance Business and repair service Personal service Entertainment Manufacturing Professional and related service Public administration
Occupational dummies OCC1 OCC2 OCC3 OCC4 OCC5 OCC6 OCC7 OCC8 OCC9
Professional, technical and kindred Managers, officials and proprietors Sales workers Clerical and kindred Craftsmen, foremen and kindred Operatives and kindred Laborers and farmers Farm laborers and foreman Service workers
Number of observations
545 × 8
4360
1141
3219
Source: NLSY men. Table 8 Fixed effects probit estimates of union membership (1981–1988). Index coefficient POOLED [1]
FE [2]
BC [3]
Average marginal effect
Index coefficient
FE [4]
POOLED [6]
BC [5]
Without occupation UNION1 SCHOOL LEXPER RURAL MARRIED HEALTH BLACK HISP Observations
Average marginal effect FE [7]
BC [8]
FE [9]
BC [10]
With occupation
1.80 (0.07) −0.01 (0.02) 0.16 (0.07) −0.09 (0.07) 0.14 (0.06) −0.18 (0.16) 0.24 (0.08) 0.02 (0.09)
0.35 (0.07) 0.09 (0.11) 0.95 (0.25) 0.15 (0.20) 0.26 (0.11) −0.31 (0.28)
0.73 (0.07) 0.07 (0.11) 0.64 (0.23) 0.13 (0.18) 0.21 (0.11) −0.28 (0.30)
0.05 (0.01) 0.01 (0.01) 0.12 (0.03) 0.02 (0.03) 0.03 (0.01) −0.04 (0.04)
0.10 (0.01) 0.01 (0.01) 0.09 (0.03) 0.02 (0.02) 0.03 (0.01) −0.04 (0.04)
1.76 (0.07) 0.04 (0.02) 0.19 (0.07) −0.12 (0.07) 0.15 (0.06) −0.19 (0.17) 0.20 (0.08) 0.01 (0.09)
0.32 (0.07) 0.19 (0.11) 0.98 (0.26) 0.16 (0.20) 0.26 (0.11) −0.32 (0.29)
0.70 (0.07) 0.16 (0.11) 0.65 (0.24) 0.14 (0.18) 0.21 (0.11) −0.29 (0.31)
0.04 (0.01) 0.02 (0.01) 0.12 (0.03) 0.02 (0.02) 0.03 (0.01) −0.04 (0.04)
0.09 (0.01) 0.02 (0.01) 0.09 (0.03) 0.02 (0.02) 0.03 (0.01) −0.04 (0.04)
4360
2064
2064
4360
4360
4360
2064
2064
4360
4360
Notes: Standard errors in parentheses. All regressions include industry, region, and time dummies. Standard errors in columns [1] and [6] are clustered at the individual level.
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
157
Table 9 Wage regressions with union effects (1981–1988). P-OLS [1]
P-Heckit [2]
FE [3]
FE-Heck [4]
BC-Heck [5]
P-OLS [6]
0.24 (0.04) 0.08 (0.01) 0.20 (0.02) −0.16 (0.03) 0.11 (0.02) 0.03 (0.04) −0.20 (0.04) −0.08 (0.04) −0.07 (0.02)
0.10 (0.02) 0.09 (0.01) 0.29 (0.03) −0.01 (0.02) 0.03 (0.01) 0.03 (0.03)
0.30 (0.06) 0.09 (0.01) 0.27 (0.03) −0.01 (0.03) 0.02 (0.02) 0.04 (0.03)
0.40 (0.05) 0.09 (0.02) 0.26 (0.04) −0.02 (0.03) 0.02 (0.02) 0.04 (0.04)
0.18 (0.02) 0.07 (0.01) 0.20 (0.02) −0.15 (0.03) 0.10 (0.02) 0.03 (0.03) −0.17 (0.04) −0.07 (0.03)
−0.12
−0.18
(0.03)
(0.03)
0.36 4360
0.71 4360
0.71 4360
0.71 4360
Without occupation UNION SCHOOL LEXPER RURAL MARRIED HEALTH BLACK HISP
λ R-squared Obs.
0.16 (0.02) 0.08 (0.01) 0.21 (0.02) −0.16 (0.03) 0.11 (0.02) 0.02 (0.04) −0.19 (0.04) −0.08 (0.04)
0.36 4360
P-Heckit [7]
FE [8]
FE-Heck [9]
BC-Heck [10]
0.28 (0.04) 0.07 (0.01) 0.19 (0.02) −0.15 (0.03) 0.10 (0.02) 0.04 (0.03) −0.18 (0.04) −0.07 (0.03) −0.08 (0.02)
0.11 (0.02) 0.08 (0.01) 0.28 (0.03) −0.01 (0.02) 0.03 (0.01) 0.04 (0.03)
0.32 (0.05) 0.07 (0.02) 0.25 (0.03) −0.01 (0.03) 0.02 (0.02) 0.04 (0.04)
0.42 (0.05) 0.07 (0.02) 0.24 (0.04) −0.02 (0.03) 0.02 (0.02) 0.05 (0.05)
−0.13
−0.19
0.38 4360
0.71 4360
With occupation
0.38 4360
(0.03)
(0.03)
0.71 4360
0.71 4360
Notes: Standard errors in parentheses. All regressions include industry, region, and time dummies. Standard errors in columns [1], [2], [6] and [7] are clustered at the individual level. Standard errors in columns [3], [4], [5], [8], [9], and [10] are robust to heteroskedasticity. Standard errors in columns [4], [5], [9], and [10] account for generated regressors (the standard errors in columns [2] and [7] do not account for the estimation of the Mills ratio).
marginal effects of 4%–5% points, recalling that the mean of the union membership variable is only 26%, the adjusted estimates of state dependence are approximately 100% higher with estimates of 9%–10% percentage points. An inspection of the other marginal effects indicates there is little difference between the adjusted and unadjusted estimates. Table 9 presents the estimates of the wage equation. We consider a range of estimators depending on whether or not they account for possible time varying endogeneity and/or individual time invariant heterogeneity. We again provide estimates that include and exclude occupational dummies in the left and right panel, respectively. We start from a pooled OLS estimator that does not account for any source of heterogeneity (P-OLS). Then, we control for possible time varying endogeneity using a pooled Heckman-type estimator (P-Heckit). Next, we introduce a FE estimator that controls for individual heterogeneity but not for time varying endogeneity (FE). Finally, we consider a FE Heckmantype estimator that controls for both time varying and time invariant heterogeneity (FE-Heck), together with a bias-corrected version of this estimator that reduces the incidental parameters problem of FE-Heck (BC-Heck).12 The corrected estimator employs the bias-corrected estimates of Table 8 to construct the control variable, and performs an additional bias correction of the estimates of the wage equation to fix the bias problem due to the nonlinearity of the control variable in the estimates of the first stage individual effects. Pooled OLS produces estimates of the union effect of 16% and 18%. As in previous studies, these estimates increase when possible non random selection into unions is taken into account, and decrease when individual heterogeneity is controlled for using longitudinal estimators. More interestingly, the effect of endogenous time varying selection is more important for estimators that account also for individual heterogeneity. Thus, the union effect rises from 15%–18% to 24%–28% for pooled estimators, whereas it jumps from 10%–11% to 30%–32% for FE estimators. The difference is even more acute when we correct for the bias problem of the Heckman FE estimator. Thus, the corrected estimates give 12 Note that the control variable in this case is the generalized residual of the probit model.
a union effect of about 40%–42%. These results are also in line with VV estimates, which find a union effect of about 39%. For the other coefficients we only observe significant differences between corrected and uncorrected estimates for the coefficient of the control variable. Overall the evidence leads to a number of conclusions. First, of the parameters of interest in this empirical investigation it appears that the ones most subject to bias are that for the lagged dependent variable in the union membership equation, and those for the union variable and time varying endogeneity correction in the wage equation. Second, the results here confirm the finding in VV that the increase in the union effect which results from OLS estimation is due to time varying heterogeneity rather than time invariant heterogeneity. Finally, the empirical evidence indicates that there is significant interaction between the individual heterogeneity in the wage equation and the selection mechanism of workers into unions. 7. Summary and conclusion This paper introduces bias-corrected estimators for nonlinear panel models with both time invariant and time variant heterogeneity. These estimators have a closed analytical form and are easy to implement. A major attraction of our approach is that it does not require any assumption on the parametric form of the distribution of the unobserved individual heterogeneity. Our estimation strategy is very flexible and can accommodate other models of interest with minor adjustments. For example, the estimation method for the dynamic Tobit model with endogenous regressors can be extended to the case where the lag of the latent dependent variable, instead of the lag of the censored dependent variable, is included as explanatory variable. This model would extend Hu (2002) by allowing for endogenous and predetermined explanatory variables. Acknowledgements This paper resulted from conversations with Whitney Newey. We acknowledge his valuable input. We are also grateful to Josh Angrist, Yulia Bubnova, Kazim Kazimov, two anonymous
158
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
referees, an Associate Editor, the Editor Cheng Hsiao, and the seminar participants at Boston University, MIT, University Carlos III of Madrid, and 2007 Winter Econometric Society Meetings for suggestions and comments.
The second inequality follows by the Cauchy–Schwartz inequality and Condition 3, where Mλ is defined. The result follows by the moment properties of Mh and Mλ , and Lemma 1. Lemma 4. Assume that Conditions 1–4 hold. Then, for any η > 0,
Appendix
This Appendix contains the proofs of the results in the main text. It is divided into four sections. The first section contains auxiliary results that are used in the proofs, the second section contains the proof of Proposition 1, the third section contains the proof of Theorem 1, and the fourth section contains the proof of ¯ P and o¯ P denote uniform Theorem 2. Throughout the Appendix O orders in probability. For example, for a sequence of random ¯ P (1) means max1≤i≤n ξi = variables {ξi , i = 1, . . . , n}, ξi = O OP (1), and ξi = o¯ P (1) means max1≤i≤n ξi = oP (1). For a matrix A = (aij ), i = 1, . . . , m, j = 1, . . . , n, |A| denotes Euclidean norm, that is |A|2 = trace[AA′ ]. HK refers to Hahn and Kuersteiner (forthcoming).
Pr
max sup gˆ2i (γ2 ) − g2i (γ2 ) > η
1≤i≤n γ2 ∈Γ2
= o(T −1 ),
˜ it , γ2 )] and g2i (γ2 ) = ET [g2it (λit , γ2 )]. where gˆ2i (γ2 ) = Eˆ T [g2it (λ Proof. By the triangle inequality, note that max sup gˆ2i (γ2 ) − g2i (γ2 ) ≤ max sup gˆ2i (γ2 ) − g˜2i (γ2 )
1≤i≤n γ2 ∈Γ2
1≤i≤n γ2 ∈Γ2
+ max sup g˜2i (γ2 ) − g2i (γ2 ) , 1≤i≤n γ2 ∈Γ2
˜ it , γ2 )]. Then, the conclusion follows by where g˜2i (γ2 ) = ET [g2it (λ ˜ it is a function Lemma 1 of HK applied to the first term, noting that λ of the data, and Lemma 3 applied to the second term.
A.1. Auxiliary results The following results establish higher-order expansions for bias corrected estimators of first stage parameters, uniform consistency for the uncorrected estimators of second stage parameters, and uniform convergence of sample averages of estimated smooth functions dependent on the control variable and second stage parameters. These results are needed to obtain a higher-order expansion of the uncorrected second stage estimator and to establish the validity of the bias correction. Lemma 2. Assume that Conditions 1 and 2 hold. Then,
√ = ϕ1 + R1 / nT , √ ∑n ∑T −1 where ϕ1 = P (1), ϕ1it = −J1 U1it , i=1 t =1 ϕ1it / nT = O√ U1it = u1it + ET [u1it α ] ψ1it , and R1 = oP ( nT ). Let α˜ 1i := αˆ 1i (θ˜1 ), √
nT θ˜1 − θ10
√ α˜ 1i = α1i0 + φ1i / T + β1i /T + R1i /T 3/2 ,
√
−1 where φ1i = ψ1i − ET [v√ ET [v1it θ ] ϕ1 / n = o¯ P (T 1/10 ), β1i = 1it α ] 2/10 o¯ P (T ), and R1i = o¯ P ( T ).
Proof. See HK.
Lemma 3. Assume that Conditions 1–3 hold. Let hit (λ, γ2 ) = h(wit , λ; γ2 ) be a function indexed by the value of the control variable λ and the parameter γ2 . Assume that there exists a function Mh (wit ) such that hit (λ, γ2 ) − hit (λ′ , γ2 ) ≤ Mh (wit )|λ − λ′ | for all λ, λ′ ∈ Nε (λit ), an ε -neighborhood of λit for some ε > 0, and γ2 ∈ Γ2 ; and supi ET [|Mh (wit )|2 ] < ∞. Then, for any η > 0,
¯ Pr max sup hi (γ2 ) − hi (γ2 ) > η = o(T −1 ), ¯ it ; γ2 ) , hi (γ2 ) = ET [h(wit , λit ; γ2 )] , where h¯ i (γ2 ) = ET h(wit , λ
λ¯ it := λit (γ¯1i ), and γ¯1i lies between γ1i0 and γ˜1i = (θ˜1′ , α˜ 1i )′ . ¯ it ∈ Nε (λit ) with probability Proof. By Lemma 1 and Condition 3, λ 1 − o(T −1 ), and ¯ it − λit | max sup ¯hi (γ2 ) − hi (γ2 ) ≤ max ET |Mh (wit )| |λ
1≤i≤n
≤ sup ET
1/2 |Mh (wit )|2
1≤i≤n
1/2 × sup ET |Mλ (wit )|2 1≤i≤n
× max |γ¯1i − γ1i0 |. 1≤i≤n
Proof. This result follows by the same argument as in the proof of Theorem 3 in HK, replacing Lemma 1 in HK by Lemma 4. Proposition 3. Assume that Conditions 1–4 hold. Then, for any η > 0,
Pr
max αˆ 2i (θ¯2 ) − α2i0 ≥ η
1≤i≤n
= o(T −1 ),
where θ¯2 lies between θ20 and θˆ2 . Proof. This result follows by using the same argument as in the proof of Theorem 4 in HK, replacing Lemma 4 and Proposition 2 by Lemma 1 and Theorem 3 in HK. Lemma 5. Assume that Conditions 1–4 hold. Let ν1 = (ν11 , . . . , ν1p1 ) and ν2 = (ν21 ∑,p.i . . , ν2p2 ) be two vectors of non-negative ∈ {1, 2}, and hit (γ1 , γ2 ) integers, |νi | = j=1 νij for i ν
ν2p
= ∂ |ν1 |+|ν2 | g2it (λ(wit ; γ1 ), γ2 )/(∂λ|ν1 | ∂γ2121 . . . ∂γ2p2 2 ) ν
ν1p
× ∂ |ν1 | λit (γ1 )/(∂γ1111 . . . ∂γ1p1 1 ) for some ν1 and ν2 with |ν1 | ≤ 3 and |ν2 | ≤ 4. Let γ¯1i be between γ1i0 and γ˜1i , and let γ¯2i be between γ2i0 and γˆ2i . Then, for some η > 0 and 0 < υ < (100q + 120)−1 with q = max{q1 , qλ , q2 }, √ Pr max T |Eˆ T [hit (γ¯1i , γ¯2i )] − ET [hit (γ¯1i , γ¯2i )] | > ηT 1/10−υ 1≤i≤n
1≤i≤n γ2 ∈Γ2
Pr θˆ2 − θ20 ≥ η = o(T −1 ).
then
1≤i≤n γ2 ∈Γ2
Proposition 2. Assume that Conditions 1–4 hold. Then, for any η > 0,
= o(T −1 ). Also, √ let a = max{a1 ,aa2 } where T (γˆ2i − γ2i0 ) = o¯ P (T 2 ). Then,
√
Pr max 1≤i≤n
√
T (γ˜1i − γ1i0 ) = o¯ P (T a1 ) and
T |Eˆ T [hit (γ¯1i , γ¯2i )] − ET [hit (γ1i0 , γ2i0 )]| > ηT a
= o(T −1 ). Proof. The result follows by a similar argument to the proof of Lemma 11 in Fernández-Val (2005) using the chain rule and Cauchy–Schwartz in the same fashion as in Lemma 3 to link changes in the control variable to changes in the first stage parameter.
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
and the uniform order of probability follows by Lemmas 2 and 5 and Eq. (11).
B.1. Proof of Proposition 1 B.1.1. Lemmas
Lemma 7. Assume that Conditions 1–4 hold. Then,
The following lemmas characterize higher-order stochastic expansions for estimators of the second stage individual effects αˆ 2i0 := αˆ 2i (θ20 ) and the estimating equation of the model parameter Eˆ T [ˆu2it (θ20 , αˆ 2i0 )] evaluated at the true parameter value θ20 , and derive the properties of the components of these expansions. We use these expansions to characterize the bias and obtain a higherorder stochastic expansion for the second stage estimator θˆ2 . Lemma 6. Assume that Conditions 1–4 hold. Then,
√
√
T (αˆ 2i0 − α2i0 ) = ψ2i + R21i / T ,
√
where, for ψitλ := λit α φ1i + λit θ ϕ1 / n,
√
Moreover, ψ2i = o¯ P (T 1/10 ) and R21i = o¯ P (T 2/10 ).
˜ it , θ2 , α2 ), additional subscripts Proof. Let v˜ 2it (θ2 , α2 ) := v2it (λ denote partial derivatives, and arguments are omitted when the expressions are evaluated at the true parameter values. By a first order Taylor expansion of the FOC for αˆ 2i0 around α2i0 , we have 0 = Eˆ T vˆ 2it (θ20 )
= Eˆ T v˜ 2it + Eˆ T v˜ 2it α (θ20 , α¯ 2i0 ) (αˆ 2i0 − α2i0 ), where α¯ 2i0 lies between α2i0 and αˆ 2i0 . Next, by Lemma 5 and Proposition 3
√
T (αˆ 2i0 − α2i0 )
√ = − ET [v2it α ]−1 T Eˆ T v˜ 2it − ET [v2it α ]−1 =O¯ P (1)
√ × Eˆ T v˜ 2it α (θ20 , α¯ 2i0 ) − ET [v2it α ] T (αˆ 2i0 − α2i0 ) = o¯ P (T 1/10+υ ) + o¯ P
T (αˆ 2i0 − α2i0 ) .
(11)
√
ψiv =
0 = Eˆ T vˆ 2it (θ20 ) = Eˆ T [˜v2it ] + Eˆ T [˜v2it α ](αˆ 2i0 − α2i0 )
+ Eˆ T [˜v2it αα (θ20 , α¯ 2i0 )](αˆ 2i0 − α2i0 )2 /2, where α¯ 2i0 lies between α2i0 and αˆ 2i0 . Variables with a tilde are defined in √ the same fashion as in the proof of Lemma 6. Next, solving for T (αˆ 2i0 − α2i0 )
√ − ET [v2it α ] T (αˆ 2i0 − α2i0 ) √ √ = T Eˆ T [˜v2it ] + T (Eˆ T [˜v2it α ] − ET [v2it α ])(αˆ 2i0 − α2i0 ) √ + T ET [v2it αα ](αˆ 2i0 − α2i0 )2 /2 √ + T (Eˆ T [˜v2it αα (θ20 , α¯ 2i0 )] − ET [v2it αα ])(αˆ 2i0 − α2i0 )2 /2. (12)
Qiv =
= o¯ P (T 1/10 ), λit (γ¯1i ), and v¯ 2it
√
Rv1i = ET [v2it λ λit α ](β1i + R1i / T )
√ √ + T (Eˆ T [¯v2it λ λ¯ it α ] − ET [v2it λ λit α ]) T (α˜ 1i − α1i0 ) + ET [v2it λ λit θ ]R1 /n √ + T (Eˆ T [¯v2it λ λ¯ it θ ] − ET [v2it λ λit θ ])(θ˜1 − θ10 )
by Lemmas 2 and 5. Finally, for the remainder term, we have
R21i = −ET [v2it α ]−1 Rv1i
√
√
T Eˆ T v˜ 2it α (θ20 , α¯ 2i0 ) − ET [v2it α ]
√
√
T Eˆ T [v2it λ λit α ] − ET [v2it λ λit α ] (β1i + R1i / T )
¯ it = λit (γ¯1i ) or γ¯1i , where functions with a bar are evaluated at λ and γ¯1i is between γ˜1i and γ1i0 . To write some of the components of the expression of the remainder term, we use √ the fact that if z = x + y, then z 2 = x2 + y(z + x). Also, Rv2i = o¯ P ( T ) by Lemmas 2 and 5. For the second term of Eq. (12), a first order expansion around γ1i0 and Lemma 6 give
= o¯ P (T 2/10 ),
+
T Eˆ T [v2it λ λit α ] − ET [v2it λ λit α ] φ1i
√ + ET [v2it λ λit α ] R1i + ET [v2it λ λit θ ] R1 T /n √ + T Eˆ T v¯ 2it λ λ¯ it θ − ET [v2it λ λit α ] T (θ˜1 − θ10 ) √ + ET [v2it λ λit αα + v2it λλ λ2it α ](β1i + R1i / T ) √ × ( T (α˜ 1i − α1i0 ) + φ1i ) √ + T (ET [¯v2it λ λ¯ it αα + v¯ 2it λλ λ¯ 2it α ] − ET [v2it λ λit αα √ + v2it λλ λ2it α ]) T (α˜ 1i − α1i0 )2
√
=
√
by Lemmas 2 and 5, and
T Eˆ T [v2it ] + ET [v2it λ λit α ] φ1i + ET [v2it λ λit θ ] ϕ1 / n
¯ it and, for γ¯1i between γ˜1i and γ1i0 , λ ˜ = v2it (λit , γ2i0 ),
√
+ ET v2it λ βitλ + ET [v2it λλ ]φ1i2 /2 = o¯ P (T 2/10 ), Rv2i =
√
where ψiv is defined in the proof of Lemma 6,
T Eˆ T [˜v2it ] = ψiv + Rv1i / T ,
where
−ET [v2it α ] Q2i √ = T Eˆ T [v2it λ λit α ] − ET [v2it λ λit α ] φ1i + ET v2it λ βitλ √ + ET [v2it λλ ]φ1i2 /2 + T Eˆ T [v2it α ] − ET [v2it α ] ψ2i + ET v2it αλ ψitλ ψ2i + ET [v2it αα ] ψ2i2 /2. √ Moreover, Q2i = o¯ P (T 2/10 ) and R22i = o¯ P ( T ).
T Eˆ T [˜v2it ] = ψiv + Qiv / T + Rv2i /T ,
The √ expression for ψ2i follows from a first order expansion of T Eˆ T [v2it (λ(wit ; γ˜1i ); γ2i0 )] around γ1i0 . Thus,
√
2 where, for βitλ = λit α β1i + λit αα ψ1i /2,
√
=¯oP (1)
√
√
T (αˆ 2i0 − α2i0 ) = ψ2i + Q2i / T + R22i /T ,
A second order Taylor expansion of the first term of Eq. (12) around γ1i0 gives
=O¯ P (1)
=¯oP (T 1/10 )
√
Proof. By a second order Taylor expansion of the FOC for αˆ 2i0 around α2i0
T Eˆ T [v2it ] + ET [v2it λ ψitλ ].
−ET [v2it α ] ψ2i =
159
T (αˆ 2i0 − α2i0 ) ,
√
√
T (Eˆ T [˜v2it α ] − ET [v2it α ])(αˆ 2i0 − α2i0 ) = Qivα / T + Rvα i /T ,
160
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
√
u Also, ψ2i = o¯ P (T 1/10 ), Q2iu = o¯ P (T 2/10 ), and Ru22i = o¯ P ( T ).
where
√
Qivα =
T Eˆ T [v2it α ] − ET [v2it α ] ψ2i
˜ it , θ2 , α2 ), additional subscripts Proof. Let u˜ 2it (θ2 , α2 ) := u2it (λ denote partial derivatives, and arguments are omitted when the expressions are evaluated at the true parameter value. A second order Taylor expansion of Eˆ T [ˆu2it (θ20 )] around α2i0 gives
λ
+ ET v2it αλ ψit ψ2i = o¯ P (T 2/10 ),
and Rvα = i
√
√
T Eˆ T [v2it α ] − ET [v2it α ] R21i
√ + ET [v2it αλ λit α ] (β1i + R1i / T ) √ + ET [v2it αλ λit θ ] R1 /n T (αˆ 2i0 − α2i0 ) √ + T Eˆ T v¯ 2it αλ λ¯ it α − ET [v2it αλ λit α ] √ √ × T (α˜ 1i − α1i0 ) T (αˆ 2i0 − α2i0 ) √ + T (Eˆ T v¯ 2it αλ λ¯ it θ − ET [v2it αλ λit θ ]) √ √ √ × T (θ˜1 − θ10 ) T (αˆ 2i0 − α2i0 ) = o¯ P ( T ), ¯ it = by Lemmas 2, 5 and 6. Functions with a bar are evaluated at λ λit (γ¯1i ) or γ¯1i , and γ¯1i is between γ˜1i and γ1i0 . For the third term of Eq. (12), Lemma 6 gives √
√
T ET [v2it αα ](αˆ 2i0 − α2i0 )2 /2 = Qivαα / T + Rvαα /T , i
T Eˆ T [ˆu2it (θ20 )]
√
√
√ √ + ET [u2it αα ] T (αˆ 2i0 − α2i0 )2 /2 + T (Eˆ T [˜u2it αα × (θ20 , α¯ 2i0 )] − ET [u2it αα ])(αˆ 2i0 − α2i0 )2 /2,
Ri
where α¯ 2i0 lies between α2i0 and αˆ 2i0 . A second order Taylor expansion of the first term of Eq. (13) around γ1i0 gives
√
(1)
T Eˆ T [˜u2it ] = ψi
where
(1)
Qi
(1)
Ri
=
Proof. The result for the influence functions ψ2i ’s follows by Lemma 6, and Lemma 3 in HK. The result for the Q2i ’s can be shown using a similar argument as in the derivation of the limiting behavior of θ εε (0) in the proof of Theorem 1 in HK. In particular, uniform convergence of Q2i , i = 1, . . . , n, can be established using Corollary A.2 of Hall and Heyde (1980), and Lemma 3 in HK. Lemma 9. Assume that Conditions 1–4 hold. Then,
where
ψ2iu = Q2iu
√ u uˆ 2it (θ20 ) = ψ2i + Q2iu / T + Ru22i /T ,
√
√
T Eˆ T [u2it λ λit α ] − ET [u2it λ λit α ] φ1i
T Eˆ T [u2it λ λit α ] − ET [u2it λ λit α ]
¯ it = λit (γ¯1i ) or where functions with a bar are evaluated at λ γ¯1i , and γ¯1i is between γ˜1i and γ1i0 . Here to write some of the components of the expression of the remainder term, we use√the (1) fact that if z = x + y, then z 2 = x2 + y(z + x). Also, Ri = o¯ P ( T ) by Lemmas 2 and 5. For the second term of Eq. (13), by Lemma 7 we have
√
(2)
ET [u2it α ] T (αˆ 2i0 − α2i0 ) = ψi
√ + Qi(2) / T + R(i 2) /T ,
(2)
= ET [u2it α ]ψ2i = o¯ P (T 1/10 ), Qi(2) = ET [u2it α ]Q2i = √ (2) o¯ P (T 2/10 ), and Ri = ET [u2it α ]R22i = o¯ P ( T ). For the third term of Eq. (13), a first order expansion around γ1i0 where ψi
and Lemma 6 give
√
(3)
T (Eˆ T [˜u2it α ] − ET [u2it α ])(αˆ 2i0 − α2i0 ) = Qi
√ / T + R(i 3) /T ,
where (3)
λ
T Eˆ T [u2it ] + ET [u2it α ] ψ2i + ET u2it λ ψit , √ = T Eˆ T [u2it λ λit α ] − ET [u2it λ λit α ] φ1i + ET u2it λ βitλ + ET [u2it λλ ] φ1i2 /2 √ + ET [u2it α ] Q2i + T Eˆ T [u2it α ] − ET [u2it α ] ψ2i + ET u2it αλ ψitλ ψ2i + ET [u2it αα ] ψ2i2 /2.
√
+ ET [u2it λ λit αα + u2it λλ λ2it α ] √ √ × (β1i + R1i / T )( T (α˜ 1i − α1i0 ) + φ1i ) √ + T (ET [¯u2it λ λ¯ it αα + u¯ 2it λλ λ¯ 2it α ] √ − ET [u2it λ λit αα + u2it λλ λ2it α ]) T (α˜ 1i − α1i0 )2 ,
Eˆ n [Q2i ] →P En [β2i ] ,
σ2i2 = E¯ T [ψ2it ψ2is ] , −ET [v2it α ] β2i = E¯ T [v2it α ψ2is ] + ET [v2it αα ] σ2i2 /2 + E¯ T [v2it λ λit α ψ1is ] + ET [v2it λα λit α ] σ12i + ET [v2it λ λit α ] β1i + [v2it λ λit αα ] σ1i2 /2 + ET v2it λλ λ2it α σ1i2 /2, σ12i = E¯ T [ψ1it ψ2is ] .
T Eˆ T
√ × (β1i + R1i / T ) + ET [u2it λ λit α ] R1i + ET [u2it λ λit θ ] R1 T /n √ + T Eˆ T u¯ 2it λ λ¯ it θ − ET [v2it λ λit α ] T (θ˜1 − θ10 )
where
T Eˆ T [u2it ] + ET u2it λ ψitλ = o¯ P (T 1/10 ),
=
Lemma 8. Assume that Conditions 1–4 hold. Then, as n, T → ∞,
√
√
ψi(1) =
by Lemmas 5 and 6. √ Finally, the fourth term of Eq. (12) is of the order o¯ P (1/ T ) by Lemmas 5 and 6. The expressions for ψ2i and Q2i are obtained by setting ψ2i = ψiv and Q2i = Qiv + Qivα + Qivαα .
nEˆ n [ψ2i ] →d N (0, En [σ2i2 ]) and
√ + Qi(1) / T + R(i 1) /T ,
and
√ √ = ET [v2it αα ]R21i ( T (αˆ 2i0 − α2i0 ) + ψ2i )/2 = o¯ P ( T ),
√
(13)
+ ET u2it λ βitλ + ET [u2it λλ ]φ1i2 /2 = o¯ P (T 2/10 ),
2 Qivαα = ET [v2it αα ]ψ2i /2 = o¯ P (T 2/10 ),
vαα
T (Eˆ T [˜u2it α ] − ET [u2it α ])(αˆ 2i0 − α2i0 )
+
where
and
√
T Eˆ T [˜u2it ] + ET [u2it α ] T (αˆ 2i0 − α2i0 )
=
Qi
=
√
T Eˆ T [u2it α ] − ET [u2it α ] ψ2i
+ ET u2it αλ ψitλ ψ2i = o¯ P (T 2/10 ), and (3)
Ri
=
√
T Eˆ T [u2it α ] − ET [u2it α ] R21i
√ + (ET [u2it αλ λit α ](β1i + R1i / T ) + ET [u2it αλ λit θ ]R1 /n)
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
161
√ × + × +
√
T (αˆ 2i0 − α2i0 )
By repeated application of Lemma 5 and Propositions 2 and 3, we can write
√
¯ it α ] − ET [u2it αλ λit α ]) T (α˜ 1i − α1i0 ) T (Eˆ T [¯u2it αλ λ
√
T (αˆ 2i0 − α2i0 )
∂ αˆ 2i (θ 2 )/∂θ2 = −ET [v2it α ]−1 ET [v2it θ ] + o¯ P (1),
¯ it θ ] − ET [u2it αλ λit θ ]) T (Eˆ T [¯u2it αλ λ
and
√
√
×
√ √ T (θ˜1 − θ10 ) T (αˆ 2i0 − α2i0 ) = o¯ P ( T ),
Eˆ T u˜ 2it θ (θ 2 , αˆ 2i (θ 2 )) = ET [u2it θ ] + o¯ P (1),
¯ it = by Lemmas 2, 5 and 6. Functions with a bar are evaluated at λ λit (γ¯1i ) or γ¯1i , and γ¯1i is between γ˜1i and γ1i0 . For the fourth term of Eq. (13), Lemma 6 gives √
√ (4) (4) T ET [u2it αα ](αˆ 2i0 − α2i0 )2 /2 = Qi / T + Ri /T ,
(4)
and (4)
Ri
√ √ = ET [u2it αα ]R21i ( T (αˆ 2i0 − α2i0 ) + ψ2i )/2 = o¯ P ( T ),
setting ψ = ψi
(2)
+ ψi
(1)
and Q2i = Qi
(2)
+ Qi
u nEˆ n [ψ2i ] →d N (0, En [Ω2i ]) and
(3)
+ Qi
(4)
+ Qi .
√
√
u T Eˆ n [ψ2i ] + Eˆ n [Q2iu ] + Eˆ n [Ru22i ]/ T .
Then, replacing in Eq. (14), as n, T → ∞,
√
u T En [ψ2i ] + En [Q2iu ] + J2 T (θˆ2 − θ20 ) + oP (1),
0=
by part I and Proposition 2. Taking expectations, by Lemmas 9 and 10 and Condition 1
By Condition 4 the matrix J2 is non-singular. Therefore, TE [θˆ2 − θ20 ] = −J2−1 En [b2i ], where the expression for b2i is given in Lemmas 8 and 10.
U2it = u2it + ET [u2it α ] ψ2it + ET [u2it λ λit α ] ψ1it + ET u2it λ (λit θ − ET [v1it α ]−1 ET [v1it θ ]λit α ) ϕ1it , and
C.1. Proof of Theorem 1
b2i = ET Q2iu = E¯ T [u2it α ψ2is ] + ET [u2it α ] β2i
C.1.1. Lemmas
+ σ2i2 ET [u2it αα ] /2 + ET [u2it λα λit α ] σ12i + E¯ T [u2it λ λit α ψ1is ] + ET u2it λ λit α β1i + λit αα σ1i2 /2 + ET u2it λλ λ2it α σ1i2 /2.
The following result establishes the asymptotic normality of the estimating equation for θ2 .
u Proof. The result for the influence functions ψ2i ’s follows by u Lemma 3 in HK. The result for the Q2i ’s can be shown using a similar argument as in the derivation of the limiting behavior of θ εε (0) in the proof of Theorem 1 in HK. In particular, uniform convergence u of Q21i , i = 1, . . . , n, can be established using Corollary A.2 of Hall and Heyde (1980), and Lemma 3 in HK.
0 = T Eˆ n Eˆ T [ˆu2it (θˆ2 )] = T Eˆ n Eˆ T [ˆu2it (θ20 )] + Eˆ n Eˆ T duˆ 2it (θ 2 )/dθ2′ T (θˆ2 − θ20 ),
(14)
√ κ En [b2i ], En [Ω2i ] ,
where b2i and Ω2i are defined in Lemma 10. Proof. From Lemma 9, we have
(15)
where variables with tilde are defined in the same fashion as in the proof of Lemmas 6 and 9. Then, differentiation of the FOC for αˆ 2i (θ), Eˆ T v˜ 2it (θ2 , αˆ 2i (θ2 )) = 0, with respect to θ2 gives
+ Eˆ T v˜ 2it α (θ 2 , αˆ 2i (θ 2 )) ∂ αˆ 2i (θ 2 )/∂θ2 = 0.
√
u nEˆ n ψ2i +
=
=OP (1)
√
u n/T Eˆ n Q21i +
=OP (1)
nEˆ n Ru2i /T .
=oP (1)
Then, the result follows by Lemma 10 and the Slutsky Theorem.
Proof. From a Taylor Expansion of the FOC for θˆ2 around θ20 , we have
+ Eˆ T u˜ 2it α (θ 2 , αˆ 2i (θ 2 )) ∂ αˆ 2i (θ 2 )/∂θ2 ,
Eˆ T v˜ 2it θ (θ 2 , αˆ 2i (θ 2 ))
nT Eˆ n Eˆ T [ˆu2it (θ20 )] →d N
C.1.2. Proof of Theorem 1
where θ 2 lies between θˆ2 and θ20 . Part I: Limit of Jˆ2i (θ¯2 ) := Eˆ T [duˆ 2it (θ 2 )/dθ2′ ]. Note that
√
nT Eˆ n Eˆ T uˆ 2it (θ20 )
Proof. From a Taylor Expansion of the FOC for θˆ2 around θ20 , we have
Jˆ2i (θ¯2 ) = Eˆ T u˜ 2it θ (θ 2 , αˆ 2i (θ 2 ))
Lemma 11. Assume that Conditions 1–4 hold. Then, for κ = limn,T →∞ n/T ,
√
B.1.2. Proof of Proposition 1
0 = En [b2i ] + J2 TE [θˆ2 − θ20 ] + o(1).
Eˆ n [Q2iu ] →P En [b2i ],
′ where Ω2i = E¯ T U2it U2is with
Finally, replacing the terms in Eq. (15) by their limits and using the CMT, we have
Lemma 10. Assume that Conditions 1–4 hold. Then, as n, T → ∞
√
T Eˆ n Eˆ T uˆ 2it (θ20 ) =
by Lemmas 5 and 6. √ Finally, the fifth term of Eq. (13) is of the order o¯ P (1/ T ) by Lemmas 5 and 6. The expressions for ψiu and Qiu are obtained by (1)
where J2i = ET [u2it θ ] − ET [v2it α ]−1 ET [u2it α ] ET [v2it θ ] and Eˆ n [J2i ] →P En [J2i ] = J2 by the LLN. Part II: Asymptotic Expansion for TE [θˆ2 − θ20 ]. By Lemma 9
= ET [u2it αα ]ψ2i2 /2 = o¯ P (T 2/10 ),
u i
Eˆ T u˜ 2it α (θ 2 , αˆ 2i (θ 2 )) = ET [u2it α ] + o¯ P (1).
Jˆ2i (θ¯2 ) = J2i + o¯ P (1),
where Qi
0 = T Eˆ n Eˆ T [ˆu2it (θˆ2 )] = T Eˆ n Eˆ T [ˆu2it (θ20 )]
+ Eˆ n Eˆ T duˆ 2it (θ 2 )/dθ2′ T (θˆ2 − θ20 ), where θ 2 lies between θˆ2 and θ20 and uˆ 2it (θ2 ) ˜ = u2 (wit , λit ; θ2 , αˆ 2i (θ2 )). By part I of the proof of Proposition 1, as n, T → ∞ Eˆ n Eˆ T duˆ 2it (θ 2 )/dθ2′ →P J2 .
162
I. Fernández-Val, F. Vella / Journal of Econometrics 163 (2011) 144–162
By Condition 4 the matrix J2 is non-singular. Then, by Lemma 11
√
nT (θˆ2 − θ20 − B2 /T )
√ = −J2−1 nT Eˆ n Eˆ T uˆ 2it (θ20 ) − En [b2i ]/T + oP (1) →d N 0, J2−1 En [Ω2i ]J2−1 .
D.1. Proof of Theorem 2 Proof. First, note that
√
nT θ˜2 − θ20
√ =
nT θˆ2 − θ20 − B2 /T
− n/T Bˆ 2 (θˆ2 ) − B2 . By Theorem 1 and Condition 1 we only need to show that Bˆ 2 (θˆ2 ) − B2 = oP (1). Recall that B2 is a smooth function of expectations of derivatives of the objective function evaluated at the true parameter values, i.e., expressions of the form ET [hit (γ1i0 , γ2i0 )] = ET [h(wit , λit (γ1i0 ); γ2i0 )]. Bˆ 2 (θˆ2 ), the fixed effects estimator of B2 , replaces expected values by sample analogs, and the true values of the parameters and control variables by fixed effects estimates, i.e., Bˆ 2 (θˆ2 ) has components of the form Eˆ T [hit (γ˜1i , γˆ2i )]. Propositions 2 and 3, and Lemma 5 establish the uniform consistency of the components of the estimator of B2 . The result for the entire expression then follows by the continuous mapping theorem and a LLN, where consistency of the truncated estimators of the spectral variances and covariances follows by Lemma 6 in HK. References Arellano, M., Carrasco, R., 2003. Discrete choice panel data models with predetermined variables. Journal of Econometrics 115 (1), 125–157. Arellano, M., Hahn, J., 2005. Understanding bias in nonlinear panel models: some recent developments. Mimeo, CEMFI. Blundell, R.W., Smith, R.J., 1989. Estimation in a class of simultaneous equation limited dependent variable models. Review of Economic Studies 56, 37–58. Blundell, R.W., Smith, R.J., 1994. Coherency and estimation in simultaneous models with censored or qualitative dependent variables. Journal of Econometrics 64, 355–373. Cameron, A.C., Trivedi, P.K., 2005. Microeconometrics: Methods and Applications. Cambridge University Press, New York. Carro, J.M., 2006. Estimating dynamic panel data discrete choice models with fixed effects. Journal of Econometrics. doi:10.1016/j-jeconom.2006.07.023. Chamberlain, G., 1980. Analysis of covariance with qualitative data. Review of Economic Studies XLVII, 225–238. Dhaene, G., Jochmans, K., 2010. Split-panel estimation of fixed-effect models. Mimeo. K.U. Leuven. Dhrymes, P.J., 1970. Econometrics: Statistical Foundations and Applications. Springer, Berlin. Fernández-Val, I., 2005. Bias correction in panel data models with individual specific parameters. Working Paper, MIT Department of Economics. Fernández-Val, I., 2009. Estimation of structural parameters and marginal effects in binary choice panel data models with fixed effects. Journal of Econometrics 150, 71–85. Gayle, G.L., Viauroux, C., 2007. Root-N consistent semi-parametric estimators of a dynamic panel sample selection model. Journal of Econometrics 141, 179–212. Greene, W.H., 2004a. The behavior of the fixed effects estimator in nonlinear models. Econometric Journal 7, 98–119. Greene, W.H., 2004b. Fixed effects and the incidental parameters problem in the tobit model. Econometric Reviews 23, 125–148. Hahn, J., Kuersteiner, G., 2002. Asymptotically unbiased inference for a dynamic panel model with fixed effects when both n and T are large. Econometrica 70, 1639–1657.
Hahn, J., Kuersteiner, G., Bias reduction for dynamic nonlinear panel models with fixed effects. Econometric Theory (forthcoming). Hahn, J., Newey, W., 2004. Jackknife and analytical bias reduction for nonlinear panel models. Econometrica 72, 1295–1319. Hall, P., Heyde, C., 1980. Martingale Limit Theory and its Applications. Academic Press. Heckman, J.J., 1976. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5, 475–492. Heckman, J.J., 1978. Dummy endogenous variables in a simultaneous equation system. Econometrica 46, 931–959. Heckman, J.J., 1979. Sample selection bias as a specification error. Econometrica 47, 153–161. Heckman, J.J., 1981. The incidental parameters problem and the problem of initial conditions in estimating a discrete time-discrete data stochastic process. In: Manski, C.F., McFadden, D. (Eds.), Structural Analysis of Discrete Panel Data with Econometric Applications. pp. 179–195. Honoré, B.E., 1992. Trimmed lad and least squares estimation of truncated and censored regression models with fixed effects. Econometrica 60 (3), 533–565. Honoré, B.E., 1993. Orthogonality conditions for Tobit models with fixed effects and lagged dependent variables. Journal of Econometrics 59, 35–61. Honoré, B.E., Lewbel, A., 2002. Semiparametric binary choice panel data models without strictly exogenous regressors. Econometrica 70, 2053–2063. Hu, L., 2002. Estimation of a censored dynamic panel data model. Econometrica 70, 2499–2517. Kyriazidou, E., 1997. Estimation of a panel data sample selection model. Econometrica 65, 1335–1364. Kyriazidou, E., 2001. Estimation of dynamic panel data sample selection models. Review of Economic Studies 68, 543–572. Lancaster, T., 2000. The incidental parameters problem since 1948. Journal of Econometrics 95, 391–413. Lancaster, T., 2002. Orthogonal parameters and panel data. Review of Economic Studies 69, 647–666. Lee, L.-F., Maddala, G.S., Trost, R.P., 1980. Asymptotic covariance matrices of twostage probit and two-stage Tobit methods for simultaneous equations models with selectivity. Econometrica 48 (2), 491–503. MacKinnon, J.G., White, H., 1985. Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of Econometrics 29 (3), 305–325. Manski, C., 1987. Semiparametric analysis of random effects linear models from binary panel data. Econometrica 55, 357–362. Newey, W.K., 1984. A method of moments interpretation of sequential estimators. Economics Letters 14 (2–3), 201–206. Neyman, J., Scott, E.L., 1948. Consistent estimates based on partially consistent observations. Econometrica 16, 1–32. Rasch, G., 1960. Probabilistic models for some intelligence and attainment tests. Denmark Paedogiska, Copenhagen. Ridder, G., 1990. Attrition in multi-wave panel data. In: Hartog, J., Ridder, G., Theeuwes, J. (Eds.), Panel Data and Labor Market Studies. Elsevier, North Holland. Rivers, D., Vuong, Q., 1988. Limited information estimators and exogeneity tests for simultaneous probit models. Journal of Econometrics 39, 347–366. Smith, R., Blundell, R., 1986. Exogeneity test for a simultaneous equation Tobit model with an application to labor supply. Econometrica 54, 679–685. Vella, F., 1993. A simple estimator for simultaneous models with censored endogenous regressors. International Economic Review 34, 441–457. Vella, F., Verbeek, M., 1998. Whose wages do unions raise? A dynamic model of unionism and wage rate determination for young men. Journal of Applied Econometrics 13, 163–183. Vella, F., Verbeek, M., 1999. Two-step estimation of panel data models with censored endogenous regressors and selection bias. Journal of Econometrics 90, 239–263. Verbeek, M., Nijman, T., 1992. Testing for selectivity bias in panel data models. International Economic Review 33 (3), 681–703. Wooldridge, J.M., 1995. Selection corrections for panel data models under conditional mean independence assumptions. Journal of Econometrics 68, 115–132. Wooldridge, J., 2001. A framework for estimating dynamic, unobserved effects panel data models with possible feedback to future explanatory variables. Economics Letters 68, 245–250. Wooldridge, J.M., 2002. Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge. Woutersen, T.M., 2002. Robustness against incidental parameters. Unpublished Manuscript, University of Western Ontario.
Journal of Econometrics 163 (2011) 163–171
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Nonparametric identification of a binary random factor in cross section data Yingying Dong, Arthur Lewbel ∗ California State University Fullerton and Boston College, United States
article
info
Article history: Received 22 January 2010 Received in revised form 13 March 2011 Accepted 16 March 2011 Available online 23 March 2011 JEL classification: C25 C21
abstract Suppose V and U are two independent mean zero random variables, where V has an asymmetric distribution with two mass points and U has some zero odd moments (having a symmetric distribution suffices). We show that the distributions of V and U are nonparametrically identified just from observing the sum V + U, and provide a pointwise rate root n estimator. This can permit point identification of average treatment effects when the econometrician does not observe who was treated. We extend our results to include covariates X , showing that we can nonparametrically identify and estimate cross section regression models of the form Y = g (X , D∗ ) + U, where D∗ is an unobserved binary regressor. © 2011 Elsevier B.V. All rights reserved.
Keywords: Mixture model Random effects Binary Unobserved factor Unobserved regressor Nonparametric identification Deconvolution Treatment
1. Introduction We propose a method of nonparametrically identifying and estimating cross section regression models that contain an unobserved binary regressor or treatment, or equivalently an unobserved random effect that can take on two values. For example, suppose an experiment (natural or otherwise) with random or exogenous assignment to treatment was performed on some population, but we only have survey data collected in the region where the experiment occurred, and this survey does not report which (or even how many) individuals were treated. Then, given our assumptions, we can point identify the average treatment effect in this population and the probability of treatment, despite not observing who was treated. No instruments or proxies for the unobserved binary regressor or treatment need to be observed. Identification is obtained by assuming that the unobserved exogenously assigned treatment or binary regressor effect is a location shift of the observed outcome, and that the regression or conditional outcome errors have zero
∗ Corresponding address: Department of Economics, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA, 02467, United States. Tel.: +1 617 552 3678; fax: +1 617 552 2308. E-mail address:
[email protected] (A. Lewbel). URL: http://www2.bc.edu/∼lewbel/ (A. Lewbel). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.03.003
low order odd moments (a sufficient condition for which is symmetrically distributed errors). These identifying assumptions provide moment conditions that can be used to construct either an ordinary generalized method of moments (GMM) estimator or, in the presence of covariates, a nonparametric local GMM estimator for the model. The zero low order odd moments used for identification here can arise in a number of contexts. Normal errors are of course symmetric and so have all odd moments equal to zero, and normality arises in many models such as those involving central limit theorems, e.g., Gibrat’s law for wage or income distributions. Differences of independently, identically distributed errors, or more generally of exchangable errors such as those following ARMA processes, are also symmetrically distributed (see, e.g., Proposition 1 of Honore, 1992). So, e.g., two period panel models with fixed effects and ARMA errors will generally have errors that are symmetric after time differencing. Our results could therefore be applied in a two period panel where individuals can have an unobserved mean shift at any time (corresponding to the unobserved binary regressor), fixed effects (which are differenced away) and exchangable remaining errors (which yield symmetric errors after differencing). Below we give other more specific examples of models with the required odd moments being zero. Ignoring covariates for the moment, suppose Y = h + V + U, where V and U are independent mean zero random variables and
164
Y. Dong, A. Lewbel / Journal of Econometrics 163 (2011) 163–171
h is a constant. The random V equals either b0 or b1 with unknown probabilities p and 1 − p respectively, where p does not equal a half, i.e., V is asymmetrically distributed. U is assumed to have its first few odd moments equal to zero. We observe a sample of observations of the random variable Y , and so can identify the marginal distribution of Y , but we do not observe h, V , or U. We first show that the constant h and the distributions of V and U are nonparametrically identified just from observing Y . The only regularity assumption required is that some higher moments of Y exist. More precisely, the first three odd moments of U must be zero (and so also exist for Y ) for local identification, while having the first five odd moments of U equal to zero suffices for global identification. We also provide estimators for the distributions of V and U. We show that the constant h, the probability mass function of V , moments of the distribution of U, and points of the distribution function of U can all be estimated using GMM. Unlike common deconvolution estimators that can converge at slow rates, we estimate the distributions of V and U, and the density of U (if it is continuous) at the same rates of convergence as if V and U were separately observed, instead of just observing their sum. We do not assume that the supports of V or U are known, so estimation of the distribution of V means identifying and estimating both of its support points b0 and b1 , as well as the probabilities p and 1 − p, respectively, of V equaling b0 or b1 . One can write V as V = b1 D∗ + b0 (1 − D∗ ), where D∗ is an unobserved binary indicator. For example, if D∗ is the unobserved indicator of exogenously assigned treatment, then b1 − b0 is the average treatment effect, p is the probability of treatment, and U describes the remaining heterogeneity of outcomes Y (here treatment is assumed to only cause a shift in outcome means). We also show how these results can be extended to allow for covariates. If h depends on a vector of covariates X while V and U are independent of X , then we obtain the random effects regression model Y = h(X )+ V + U, which is popular for panel data, but which we identify and estimate just from cross section data. More generally, we allow both h and the distributions of V and U to depend in unknown ways on X . This is equivalent to nonparametric identification and estimation of a regression model containing an unobserved binary regressor. The regression model is Y = g (X , D∗ ) + U, where g is an unknown function, D∗ is an unobserved binary regressor (or unobserved indicator of treatment) that equals zero with unknown probability p(X ) and one with probability 1 − p(X ), while U is a random error with an unknown conditional distribution FU (U | X ) having its first few odd moments equal to zero (conditional symmetry conditioning on X suffices). The unobserved random variables U and D∗ are assumed to be conditionally independent, conditioning upon X . By defining h(x) = E (Y | X = x) = E [g (X , D∗ ) | X = x], V = g (X , D∗ ) − h(X ) and U = Y − h(X )− V , this regression model can then be rewritten as Y = h(X ) + V + U, where h(x) is a nonparametric regression function of Y on X , and the two support points of V conditional on X = x are then bd (x) = g (x, d) − h(x) for d = 0, 1. Kitamura (unpublished manuscript) provides some nonparametric identification results for this model by placing constraints on how the distributions can depend upon X , while we place no such restrictions on the distribution of X and instead restrict the shape of the distribution of U. The assumptions we impose on U in Y = g (X , D∗ ) + U are common assumptions in regression models, e.g., they allow the error U to be heteroskedastic with respect to X , and they hold, e.g., if U given X is normal (though normality is not required). Also, regression model errors U are sometimes interpreted as measurement error in Y , and measurement errors are often assumed to be symmetric. If D∗ is an unobserved treatment indicator, then g (X , 1) − g (X , 0) is the conditional average treatment effect, which may
be averaged over X to obtain an unconditional average treatment effect. Symmetry of errors is not usually assumed for treatment models, but suppose we have panel data (two periods of observations) and all treatments occur in one of the two periods. Then, as noted above, the required symmetry of U errors would result automatically from time differencing the data, given the standard panel model assumption of individual specific fixed effects plus independently, identically distributed (or more generally ARMA or other exchangable) errors. Another possible application of these extensions is a stochastic frontier model, where Y is the log of a firm’s output, X are factors of production, and D∗ indicates whether the firm operates efficiently at the frontier, or inefficiently. Existing stochastic frontier models obtain identification either by assuming parametric functional forms for both the distributions of V and U, or by using panel data and assuming that each firm’s individual efficiency level is a fixed effect that is constant over time. See, e.g., Kumbhakar et al. (2007) and Simar and Wilson (2007). In contrast, our assumptions and associated estimators could be used to estimate a nonparametric stochastic frontier model using cross section data, given the restriction that unobserved efficiency is indexed by a binary D∗ . Note that virtually all stochastic frontier models based on cross section data assume U given X is symmetrically distributed. Dong (forthcoming) empirically estimates a model where Y = h(X ) + V + U, based on symmetry of U and using moments similar to the exponentials we suggest in an extension section. Our results formally prove identification of Dong’s model, and our estimator is more general in that it allows V and the distribution of U to depend in arbitrary ways on X . Hu and Lewbel (2008) also nonparametrically identify some features of a model containing an unobserved binary regressor, using either a type of instrumental variable or an assumption of conditional independence of low order moments. Models that allocate individuals into various types, as D∗ does, are common in the statistics and marketing literatures. Examples include cluster analysis, latent class analysis, and mixture models (see, e.g., Clogg, 1995 and Hagenaars and McCutcheon, 2002). Our model resembles a finite (two distribution) mixture model, but differs crucially in that, for identification, finite mixture models usually require the distributions being mixed to be parametrically specified, while in our model U is nonparametric. While general mixture models are more flexible than ours in allowing for more than two groups and permitting the U distribution to vary across groups, ours is more flexible in letting U be nonparametric, essentially allowing for an infinite number of parameters versus finitely parameterized mixtures. Some mixture models can be nonparametrically identified by observing draws of vectors of data, where the number of elements of the observed vectors is larger than the number of distributions being mixed. Examples include Hall and Zhou (2003) and Kasahara and Shimotsu (2009). In contrast, we obtain identification with a scalar Y . As noted above, Kitamura (unpublished manuscript) also obtains nonparametric identification with a scalar Y , but does so by requiring observation of a covariate that affects the component distributions with some restrictions. Another closely related mixture model result is Bordes et al. (2006), who impose strictly stronger conditions than we do, including that U is symmetric and continuously distributed. Also related is the literature on mismeasured binary regressors, where identification generally requires instruments. An exception is Chen et al. (2008). Like our Theorem 1, they exploit error symmetry for identification, but unlike this paper they assume that the binary regressor is observed, though with some measurement (classification) error, instead of being completely unobserved. A more closely related result is Heckman and Robb (1985), who like us use zero low order odd moments to identify a binary effect,
Y. Dong, A. Lewbel / Journal of Econometrics 163 (2011) 163–171
though theirs is a restricted effect that is strictly nested in our results. Error symmetry has also been used to obtain identification in a variety of other econometric contexts, e.g., Powell (1986). There are a few common ways of identifying the distributions of random variables given just their sum. One method of identification assumes that the exact distribution of one of the two errors is known a priori (e.g., from a validation sample as is common in the statistics literature on measurement error; see, e.g., Carroll et al., 2006) and uses deconvolution to obtain the distribution of the other one. For example, if U were normal, one would need to know a priori its mean and variance to estimate the distribution of V . A second standard way to obtain identification is to parameterize both the distributions of V and U, as in most of the latent class literature or in the stochastic frontier literature (see, e.g., Kumbhakar and Lovell, 2000), where a typical parameterization is to have V be log normal and U be normal. Panel data models often have errors of the form V + U that are identified either by imposing specific error structures or by assuming one of the errors is fixed over time (see, e.g., Baltagi, 2008 for a survey of random effects and fixed effects panel data models). Past nonparametric stochastic frontier models have similarly required panel data for identification, as described above. In contrast to all these identification methods, in our model both U and V have unknown distributions, and no panel data are required. The next section contains our main identification result. We then provide moment conditions for estimating the model, including the distribution of V (its support points and the associated probability mass function), using ordinary GMM. Next we give estimators for the distribution and density function of U. We provide a Monte Carlo analysis showing that our estimator performs reasonably well even compared to infeasible maximum likelihood estimation. This is followed by some extensions showing how our identification and estimation methods can be augmented to provide additional moments for estimation, and to allow for covariates. Proofs are given in the Appendix.
just E (U ) = E (U 3 ) = E (U 5 ) = 0 and E (U d | V ) = E (U d ) for positive integers d ≤ 5, thereby only needing existence of E (Y 5 ). Higher moments going up to the ninth moment are required only to distinguish amongst the elements in the finite identified set and thereby provide global identification. Note that Theorem 1 assumes asymmetry of V (since otherwise it would be indistinguishable from U) and more than one point of support (since otherwise it would be indistinguishable from h), which is equivalent to requiring that p not be exactly equal to zero, one, or one half. This suggests that the identification and associated estimation will be weak if the actual p is very close to any of these values. In practice, it would be easy to tell if this problem exists, because if it did then the observed Y would itself be close to symmetrically distributed. Applying a formal test of data symmetry such as Ahmed and Li (1997) to the Y data is equivalent in our model to testing if p equals zero, one, or one half. More simply, one might look for substantial asymmetry in a histogram or kernel density estimate of Y . We next consider estimation of h, b0 , b1 , and p, and then later show how the rest of the model, i.e., the distribution function of U, can be estimated. 3. Estimation Our estimator will take the form of the standard generalized method of moments (GMM, as in Hansen, 1982), since given data Y1 , . . . , Yn , we will below construct a set of moments of the form E [G(Y , θ )] = 0, where G is a set of known functions and the vector θ consists of the parameters of interest h, b0 , p, as well as u2 , u4 , and u6 , where ud = E (U d ). The parameters u2 , u4 , and u6 are nuisance parameters for estimating the V distribution, but in some applications they may be of interest as summary measures of the distribution of U. Let vd = E (V d ). Then v1 = E (V ) = b0 p + b1 (1 − p) = 0, so b1 = b0 p/(p − 1)
2. Identification
and, therefore,
In this section, we provide our general identification results. Later we extend these results to include covariates X .
vd = E V d = bd0 p +
Assumption A1. Let Y = h + V + U. Assume the distribution of V is mean zero, and has exactly two points of support. asymmetric, Assume E U d | V = E (U d ) exists for all positive integers d ≤ 9, and E (U 2d−1 ) = 0 for all positive integers d ≤ 5. Theorem 1. Let Assumption A1 hold, and assume the distribution of Y is identified. Then the constant h and the distributions of V and U are identified. Let b0 and b1 denote the two support points of the distribution of V , where without loss of generality b0 < b1 , and let p be the probability that V = b0 , so 1 − p is the probability that V = b1 . Identification of the distribution of V by Theorem 1 means identification of b0 , b1 , and p. If Y is the outcome of a treatment and D∗ denotes an unobserved treatment indicator, then we can define V = b1 D∗ + b0 (1 − D∗ ), and we will have identification of the probability of treatment p and identification of the average treatment effect b1 − b0 (which by mean independence of V and U will also equal the average treatment effect on the treated). Assumption A1 says that the first nine moments of U conditional on V are the same as the moments that would arise if U were distributed symmetrically and independent of V . The only regularity condition required for identification of the distribution of V in Theorem 1 is existence of E (Y 9 ). The proof of Theorem 1 also shows identification up to a finite set, and hence local identification of the distribution of V , assuming
165
(1)
b0 p p−1
d
(1 − p).
(2)
Now expand the expression E [(Y − h)d −(V + U )d ] = 0 for integers d, noting by Assumption A1 that the first five odd moments of U are zero. The results are E (Y − h) = 0
(3)
E ((Y − h) − (v2 + u2 )) = 0
(4)
E ((Y − h) − v3 ) = 0
(5)
E ((Y − h) − (v4 + 6v2 u2 + u4 )) = 0
(6)
E ((Y − h) − (v5 + 10v3 u2 )) = 0
(7)
E ((Y − h)6 − (v6 + 15v4 u2 + 15v2 u4 + u6 )) = 0
(8)
E ((Y − h)7 − (v7 + 21v5 u2 + 35v3 u4 )) = 0
(9)
2 3 4 5
E ((Y − h) − (v9 + 36v7 u2 + 126v5 u4 + 84v3 u6 )) = 0. 9
(10)
Substituting Eq. (2) into Eqs. (3)–(10) gives eight moments we can write as E [G(Y , θ )] = 0 in the six unknown parameters θ = (h, b0 , p, u2 , u4 , u6 ), which we use for estimation via GMM. We use these moments because the proof of Theorem 1 shows that these particular eight equations are exactly those required to point identify the parameters defining the distribution of V . As shown in the proof, more equations than unknowns are required for global identification because of the nonlinearity of these equations, and in particular the presence of multiple roots. Given an estimate of θ , the estimate of b1 is then obtained by Eq. (1).
166
Y. Dong, A. Lewbel / Journal of Econometrics 163 (2011) 163–171
Based on Theorem 1, for this estimator we assume that Y1 , . . . , Yn are identically distributed (or more precisely, have identical first nine moments); however, the Y observations do not need to be independent, since GMM estimation theory permits some serial dependence in the data. Standard GMM limiting distribution theory applied to our moments provides root n consistent, asymptotically normal estimates of θ and hence of h and of the distribution of V (i.e., the support points b0 and b1 and the probability p, where b1 is obtained by b1 = b0 p/( p − 1) from Eq. (1)). Without loss of generality we have imposed b0 < b1 (if this is violated then the definitions of these two parameters can be switched to make the inequality hold), and this along with E (V ) = 0 implies that b0 is negative and b1 is positive, which may be imposed on estimation. One could also impose that p lie between zero and one, and that u2 , u4 , and u6 be positive. One might anticipate poor empirical results, and great sensitivity to outliers or extreme observations of Y , given the use of such high order moments for estimation. For example, Altonji and Segal (1996) document bias in GMM estimates associated with just second order moments. However, we found that these problems rarely arose in our Monte Carlo simulations (in applications one would want to carefully scale Y to avoid the effects of computer rounding errors based on inverting matrix entries of varying orders of magnitude). We believe the reason the estimator performs reasonably well is that only lower order moments are required for local identification, up to a small finite set of values. The higher order moments (specifically those above the fifth) are only needed for global identification to distinguish between these few possible multiple solutions of the low order polynomials. For example, if by the low order moments b1 was identified up to a value in the neighborhood of either 1 or of 3, then even poorly estimated higher moments could succeed in distinguishing between these two neighborhoods, by having a sample GMM moment mean be substantially closer to zero in the neighborhood of one value rather than the other. In an extension section we describe how additional moments could be constructed for estimation based on symmetry of U. These alternative moments might be employed in applications where the polynomial based moments are found to be problematic. Another possibility would be to Winsorize extreme observations of the Y data prior to estimation to robustify the higher moment estimates.
However, under the assumption that U is symmetrically distributed, the following theorem provides a more convenient way to estimate the distribution function of U. Define
Ψ ( u) =
[Fε (−u + b0 ) − 1] p + Fε (u + b1 )(1 − p) 1 − 2p
.
(11)
Theorem 2. Let Assumption A1 hold. Assume U is symmetrically distributed. Then FU (u) =
Ψ (u) − Ψ (−u) + 1 2
.
(12)
Theorem 2 provides a direct expression for the distribution of U in terms of b0 , b1 , p and the distribution of ε , all of which are previously identified. This can be used to construct an estimator for FU (u) as follows. Let I (·) denote the indicator function that equals one if · is true and zero otherwise, and let θ be a vector containing h, b0 , b1 , and p. Define the function ω(Y , u, θ ) by
ω(Y , u, θ ) [I (Y ≤ h − u + b0 ) − 1] p + I (Y ≤ h + u + b1 ) (1 − p) . = 1 − 2p (13) Then using Y = h + ε it follows immediately from Eq. (11) that
Ψ (u) = E (ω(Y , u, θ )) .
(14)
An estimator for FU (u) can now be constructed by replacing the parameters in Eq. (14) with estimates, replacing the expectation with a sample average, and plugging the result into Eq. (12). The resulting estimator is
n θ − ω Yi , −u, θ +1 1 − ω Yi , u, FU (u) = . n i =1
2
(15)
Alternatively, FU (u) for a finite number of values of u, say u1 , . . . , uJ , can be estimated as follows. Recall that E [G (Y , θ)] = 0 was used to estimate the parameters h, b0 , b1 , p by GMM. For notational convenience, let ηj = FU (uj ) for each uj . Then by Eqs. (12) and (14)
ω Y , uj , θ − ω Y , uj , θ + 1
4. The distribution of U
E ηj −
For any random variable Z , let FZ denote the marginal cumulative distribution function of Z . Define ε = V + U. Define FKU (u) by
Adding Eq. (16) for j = 1, . . . , J to the set of functions defining G, including η1 , . . . , ηJ in the vector θ , and then applying GMM to this augmented set of moment conditions E [G(Y , θ )] = 0 simultaneously yields root n consistent, asymptotically normal estimates of h, b0 , b1 , p and ηj = FU (uj ) for j = 1, . . . , J. An advantage of this approach versus Eq. (15) is that GMM limiting distribution theory then provides standard error estimates for each FU (uj ). While p is the unconditional probability that V = b0 , given FU it is straightforward to estimate conditional probabilities as well. In particular,
FKU (u) =
k K −1 − 1 1−p p
k =0
p
Fε (u + b0 + (b0 − b1 ) k)
if p > 1/2 otherwise FKU (u) =
k K −1 − 1 p Fε (u + b1 + (b1 − b0 ) k) 1 − p 1 − p k =0 if p < 1/2.
The proof of Theorem 1 shows that FU (u) = FKU (u) + RK , where 0 ≤ RK ≤ min
K K , 1−p p , so the remainder term RK → 1 −p p
0 as K → ∞ (since p = 1/2 is ruled out). This suggests that FU could be estimated by FKU (u) after replacing Fε with the empirical distribution of Y − h, replacing b0 , b1 , and p with their estimates, and letting K → ∞ as N → ∞.
2
= 0.
(16)
Pr (V = b0 | Y ≤ y) = Pr (V = b0 , Y ≤ y) / Pr (Y ≤ y)
= FU (y − h − b0 ) /Fy (y), which could be estimated as FU y − h − b0 / Fy (y), where Fy is the empirical distribution of Y . Let fZ denote the probability density function of any continuously distributed random variable Z . So far no assumption has been made about whether U is continuous or discrete. However, if U is
Y. Dong, A. Lewbel / Journal of Econometrics 163 (2011) 163–171
continuous, then ε and Y are also continuous, and then taking the derivative of Eqs. (11) and (12) with respect to u gives
−fε (−u + b0 )p + fε (u + b1 )(1 − p) , 1 − 2p ψ(u) + ψ(−u) fU (u) = ,
ψ(u) =
(17)
2
which suggests the estimators
u) = ψ(
p) − f ε −u + b0 p + fε u + b1 (1 − , 1 − 2 p
u) + ψ(− u) ψ( fU (u) = ,
(18)
167
Table 1 p = 0.6. Parameter
b1
b0
p
u2
u4
u6
True Median Mean Stddev Root MSE Median ABS ERR Mean ABS ERR 25% quantile 75% quantile
2.500 2.491 2.374 0.506 0.522 0.068 0.201 2.422 2.561
−1.667 −1.657 −1.578 0.347 0.358 0.057 0.146 −1.712 −1.599
0.600 0.601 0.586 0.099 0.100 0.013 0.034 0.588 0.614
1.000 0.983 1.234 0.925 0.954 0.049 0.302 0.942 1.034
3.000 2.786 4.846 7.946 8.154 0.380 2.399 2.518 3.106
15.000 12.309 34.142 90.287 92.250 3.927 25.013 9.973 15.210
Parameter
b1
b0
p
u2
u4
u6
True Median Mean Stddev Root MSE Median ABS ERR Mean ABS ERR 25% quantile 75% quantile
5.000 4.955 4.601 1.235 1.298 0.094 0.459 4.871 5.036
−1.250 −1.242 −1.267
0.800 0.799 0.756 0.153 0.159 0.009 0.054 0.789 0.808
1.000 1.007 1.018 0.368 0.368 0.065 0.164 0.946 1.078
3.000 2.858 3.228 2.657 2.665 0.337 0.884 2.548 3.169
15.000 13.222 17.151 26.392 26.466 3.106 7.905 10.549 15.951
(19)
2
where fε (ε) is a kernel density or other estimator of fε (ε), constructed using data εi = Yi − h for i = 1, . . . , n. Since densities
converge at slower than rate root n, the limiting distribution of this estimator will generally be the same as if h, b0 , b1 , and p were evaluated at their true values (e.g., this holds if f is differentiable around the true values of h, b0 , b1 , by a mean value expansion of ψ and p). The above fU (u) is just the weighted sum of two density estimators, each one dimensional, and so will converge at the same rate as a one dimensional density estimator. For example, this will be the pointwise rate n−2/5 using a kernel density estimator of fε under standard assumptions as in Silverman (1986) (independent observations, fε twice differentiable, evaluated at points not on the boundary of the support of ε , bandwith proportional to n−1/5 , and a second order kernel function) with p bounded away from 1/2. It is possible for fU (u) to be negative in finite samples, so if desired one could replace negative values of fU (u) with zero. A potential numerical problem is that Eq. (18) may require evaluting fε at a value that is outside the range of observed values u) and ψ(− u) are consistent estimators of of εi . Since both ψ( fU (u) (though generally less precise than Eq. (19) because they individually ignore the symmetry constraint), one could use either u) or ψ(− u) instead of their average to estimate ψ( fU (u) whenever u) or ψ( u), respectively, requires evaluating ψ(− fε at a point outside the range of observed values of εi . This construction also suggests a specification test for the u) = ψ(− u) one model. Since symmetry of U implies that ψ(
L
u) − ψ(− u)]2 w(u)du = 0, could base a test on whether 0 [ψ( where w(u) is a weighting function that integrates to one, and L is u) nor ψ( u) requires in the range of values for which neither ψ(− evaluating fε at a point outside the range of observed values of εi . The limiting distribution theory for this type of test statistic (a degenerate U statistic under the null) based on functions of kernel densities is standard, and in this case would closely resemble Ahmed and Li (1997). 5. Monte Carlo analysis Our Monte Carlo design takes h = 0, U standard normal, and −b0 p = b1 (1 − p) = 1. We consider three different values
of p between one half and one, specifically 0.6, 0.8, and 0.95. By symmetry of this design, we should obtain the same results in terms of accuracy if we took p equal to 0.4, 0.2, and 0.05, respectively. The nuisance parameters, derived from the distribution of U, are then u2 = 1, u4 = 3, and u6 = 15. Our sample size is n = 1000, and for each design we perform 1000 Monte Carlo replications. Each draw of Y in each replication is constructed by drawing an observation of V and one of U from their above described distributions and then summing the two. In each simulated data set we first estimated h as the sample mean of Y , then performed standard two step GMM (using the identity matrix as the weighting matrix in the first step) with the
Table 2 p = 0.8.
0.402 0.403 0.065 0.175 −1.305 −1.178
Table 3 p = 0.95. Parameter
b1
b0
p
u2
u4
u6
True Median Mean Stddev Root MSE Median ABS ERR Mean ABS ERR 25% quantile 75% quantile
20.000 19.948 19.852 1.404 1.411 0.142 0.265 19.816 20.083
−1.053 −1.051 −1.051
0.950 0.950 0.947 0.043 0.043 0.005 0.008 0.945 0.955
1.000 0.988 0.987 0.122 0.123 0.083 0.096 0.903 1.064
3.000 2.890 2.852 0.529 0.549 0.315 0.397 2.527 3.121
15.000 14.997 14.501 16.208 16.208 0.010 4.348 14.946 15.002
0.161 0.161 0.103 0.123 −1.151 −0.943
moments Eqs. (4)–(10). We imposed the inequality constraints on estimation that p lie between zero and one and that b1 , u2 , u4 , and u6 are positive. Rarely, this GMM either failed to converge after many iterations, or iterated towards one of these boundary points. When this happened, we applied two step GMM to just the low order (sufficient for local identification) moments (4), (5) and (7), and then used the results as starting values for two step GMM using all the moments (4)–(10). We could have alternatively performed a more time consuming grid search, but this procedure led to estimates that conveged to interior points in all but a handful of simulations. Specifically, in fewer than one half of one percent of replications this procedure either failed to converge or produced estimates of p that approached the boundaries of zero or one. We drop these few failed replications from our reported results. The results are reported in Tables 1–3. The parameters of the distribution of V (b0 , b1 , and p) are estimated with reasonable accuracy, having relatively small root mean squared errors and interquartile ranges. These parameters are very close to median unbiased, but have mean bias of a few percent, with p always mean biased downwards. This is likely because estimates of p are more or less equally likely to be above or below the true (yielding very small median bias), but when they are below the true they can be much further from the truth than when they are too high, e.g., when p = 0.8 an estimate that is too high can be biased by at most 0.2, while the downward bias can be as large as −0.8. Some fraction of
168
Y. Dong, A. Lewbel / Journal of Econometrics 163 (2011) 163–171
these replications may be centering around incorrect roots of the polynomial moments, which can be quite distant from the correct roots. The high order nuisance parameters u4 and u6 are sometimes estimated very poorly. In particular, for p = 0.6 the median bias of u6 is almost −20% while the mean bias is over five times larger and has the opposite sign of the median bias. The fact that the high order moment nuisance parameters are generally much more poorly estimated than the parameters of interest supports our claim that low order moments are providing most of the parameter estimation precision, while higher order moments mainly serve to distinguish among discretely separated local alternatives. We performed limited experiments with alternative sample sizes, which are not reported to save space. Precision increases with sample size pretty much as one would expect. More substantial is the frequency with which numerical problems were encountered, e.g., at n = 500 we encountered convergence or boundary problems in 1.4% of replications while these problems were almost nonexistent at n = 5000.
6. Extension 1: additional moments Here we provide additional moments that might be used for estimating the parameters h, b0 , b1 and p. Proposition 1. Let Y = h + V + U. Assume the distribution of V is mean zero, asymmetric, and has exactly two points of support. Assume U is symmetrically distributed around zero and is independent of V . Assume E [exp(TU )] exists for some positive constant T . Then for any positive τ ≤ T there exists a constant ατ such that the following two equations hold with r = p/(1 − p): E [exp(τ (Y − h)) − (r exp(τ b0 ) + exp(−τ r ))ατ ] = 0
(20)
E [exp(−τ (Y − h)) − (r exp(−τ b0 ) + exp(τ b0 ))ατ ] = 0.
(21)
Given a set of L positive values for τ , i.e., constants τ1 , . . . , τL , each of which is less than T , Eqs. (20) and (21) provide 2L moment conditions satisfied by the set of L + 3 parameters ατ1 , . . . , ατL , h, p, and b0 . Although the order condition for identification is therefore satisfied with L ≥ 3, we do not have a proof analogous to Theorem 1 showing that the parameters are actually globally identified based on any number of these moments. Also, Proposition 1 is based on means of exponents, and so requires Y to have a thinner tailed distribution than estimation based on the polynomial Eqs. (3)–(10). Still, if global identification holds with these parameters, then they could be used by themselves for estimation, otherwise they could be combined with the polynomial moments to possibly increase estimation efficiency. One could also construct moments of complex exponentials based on the characteristic function of Y − h instead of those based on the moment generating function as in Proposition 1, which avoids the requirement for thin tailed distributions. However, such moments could sometimes vanish and thereby be uninformative, as when U is uniform. Proposition 1 actually provides a continuum of moments, so rather than just choose a finite number of values for τ , it would also be possible to efficiently combine all the moments given by an interval of values of τ using, e.g., Carrasco and Florens (2000).
7. Extension 2: h depends on covariates We now extend our results by permitting h to depend on covariates X . Estimators associated with this extension will take the form of standard two step estimators with a uniformly consistent first step. Corollary 1. Assume the conditional distribution of Y given X is identified and its mean exists. Let Y = h(X ) + V + U. Let Assumption A1 hold. Assume V and U are independent of X . Then the function h(X ) and distributions of U and V are identified. Corollary 1 extends Theorem 1 by allowing the conditional mean of Y to nonparametrically depend on X . Given the assumptions of Corollary 1, it follows immediately that Eqs. (3)–(10) hold replacing h with h(X ), and if U is symmetrically distributed and independent of V and X then Eqs. (20) and (21) also hold replacing h with h(X ). This suggests a couple of ways of extending the GMM estimators of the previous section. One method is to first estimate h(X ) by a uniformly consistent nonparametric mean regression of Y on X (e.g., a kernel regression over a compact set of X values on the interior of its support), then replace Y −h in Eqs. (3)–(10) and/or Eqs. (20) and (21) with ε = Y − h(X ), and apply ordinary GMM to the resulting moment conditions (using as data εi = Yi − h(Xi ) for i = 1, . . . , n) to estimate the parameters b0 , b1 , p, u2 , u4 , and u6 . Consistency of this estimator follows immediately from the uniform consistency of h and ordinary consistency of GMM. This estimator is easy to implement because it only depends on ordinary nonparametric regression and ordinary GMM. Root n limiting distribution theory may be immediately obtained by applying generic two step estimation theorems as in Newey and McFadden (1994). After replacing h with h(Xi ), Eq. (15) can be used to estimate the distribution of U, or alternatively Eq. (16) for j = 1, . . . , J, replacing h with h(X ), can be included in the set of functions defining G in the estimator described above. Since ε has the same properties here as before, given uniform consistency of h(X ), the estimator (19) will still consistently estimate the density of U if it is continuous, using as data εi = Yi − h(Xi ) for i = 1, . . . , n to estimate the density function fε . 8. Extension 3: nonparametric regression with an unobserved binary regressor This section extends previous results to a more general nonparametric regression model of the form Y = g (X , D∗ ) + U. Specifically, we have the following corollary. Corollary 2. Assume the joint distribution of Y , X is identified and that g (X , D∗ ) = E (Y | X , D∗ ) exists, where D∗ is an unobserved variable with support {0, 1}. Assume that the distribution of g (X , D∗ ) conditional upon X is asymmetric for all X on its support. Define p(X ) = E (1 − D∗ | X ) and define U = Y − g (X , D∗ ). Assume E (U d | X , D∗ ) = E (U d | X ) exists for all integers d ≤ 9 and E (U 2d−1 | X ) = 0 for all positive integers d ≤ 5. Then the functions g (X , D∗ ), p(X ), and the distribution of U are identified. Corollary 2 permits all of the parameters of the model to vary nonparametrically with X . It provides identification of the regression model Y = g (X , D∗ ) + U, allowing the unobserved model error U to be heteroskedastic (and have nonconstant higher moments as well), though the variance and other low order even moments of U can only depend on X and not on the unobserved regressor D∗ . As noted in the introduction and in the proof of this corollary, Y = g (X , D∗ ) + U is equivalent to Y = h(X ) + V + U. However, unlike Corollary 1, now V and U have distributions that can depend on X . As with Theorem 1, symmetry of U (now
Y. Dong, A. Lewbel / Journal of Econometrics 163 (2011) 163–171
169
conditional on X ) suffices to make the required low order odd moments of U be zero. Given the assumptions of Corollary 2, Eqs. (3)–(10), and given symmetry of U, Eqs. (20) and (21), will all hold after replacing the parameters h, b0 , b1 , p, uj , and τℓ , and with functions h(X ), b0 (X ), b1 (X ), p(X ), uj (X ), and τℓ (X ) and replacing the unconditional expectations in these equations with conditional expectations, conditioning on X = x. If desired, we can further replace b0 (X ) and b1 (X ) with g (x, 0) − h(x) and g (x, 1) − h(x), respectively, to directly obtain estimates of the function g (X , D∗ ) instead of b0 (X ) and b1 (X ). Let q(x) be the vector of all of the above listed unknown functions. Then these conditional expectations can be written as
global identification, just as the proof of Theorem 1 requires S = 9 even though in that theorem H = 1. Still, as long as U has sufficiently thin tails, E (Y s ) can exist for arbitrarily high integers s, thereby providing far more identifying equations than unknowns. The above analysis is only suggestive. We do not have a proof of global identification with more than two points of support, though local identification up to a finite set should hold, given that the moments are polynomials, which must have a finite number of roots. Assuming that a given model where V takes on more than two values is identified, moment conditions for estimation analogous to those we provided earlier are available. For example, as in the proof of Proposition 1, it follows from symmetry of U that
E [G(q(x), Y ) | X = x] = 0
E [exp(τ (Y − h))] = E [exp(τ V )]ατ
(22)
for a vector of known functions G. Eq. (22) is in the form of conditional GMM which could be estimated using Ai and Chen (2003), replacing all of the unknown functions q(x) with sieves (related estimators are Carrasco and Florens, 2000 and Newey and Powell, 2003). However, given independent, identically distributed draws of X , Y , the local GMM estimator of Lewbel (2007) may be easier to use because it exploits the special structure we have here, where all the functions q(x) to be estimated depend on the same variables that the moments are conditioned upon, that is, X = x. We summarize here how this local GMM estimator would be implemented. See the online supplement Appendix B to this paper or Lewbel (2007) for details regarding the associated limiting distribution theory. 1. For any value of x, construct data Zi = K ((x − Xi )/b) for i = 1, . . . , n, where K is an ordinary kernel function (e.g., the standard normal density function) and b is a bandwidth parameter. As is common practice when using kernel functions, it is a good idea to first standardize the data by scaling each continuous element of X by its sample standard deviation. 2. Obtain θ by applying standard two step GMM based on the moment conditions E (G(θ , Y )Z ) = 0 for G from Eq. (22). 3. For the given value of x, let q(x) = θ. 4. Repeat these steps using every value of x for which one wishes to estimate the vector of functions q(x). For example, one may repeat these steps for a fine grid of x points on the support of X , or repeat these steps for x equal to each data point Xi to just estimate the functions q(x) at the observed data points.
with ατ = α−τ for any τ for which these expectations exist, and therefore by choosing constants τ1 , . . . , τL , GMM estimation could be based on the 2L moments
H − E [[exp(τℓ (Y − h))] − exp(τℓ bk )ατℓ ]pk = 0 k=0
H − E [[exp(−τℓ (Y − h))] − exp(−τℓ bk )ατℓ ]pk = 0 k=0
for ℓ = 1, . . . , L. The number of parameters bk , pk and ατℓ to be estimated would be 2H + L, so taking L > 2H provides more moments than unknowns. 10. Conclusions
Note that this local GMM estimator can be used when X contains both continuous and discretely distributed elements. If all elements of X are discrete, then the estimator simplifies back to Hansen’s (1982) original GMM.
We have proved global point identification and provided estimators for the models Y = h + V + U or Y = h(X ) + V + U, and more generally for Y = g (X , D∗ ) + U. In these models, D∗ or V are unobserved regressors with two points of support, and the unobserved U is drawn from an unknown distribution having some odd central moments equal to zero, as would be the case if U is symmetrically distributed. No instruments, measures, or proxies for D∗ or V are observed. A small Monte Carlo analysis shows that our estimator works reasonably well with a moderate sample size, despite involving high order data moments. To further illustrate the estimator, in an online supplemental Appendix B to this paper we provide a small empirical application involving distribution of income across countries. Interesting work for the future could include derivation of semiparametric efficiency bounds for the model, and obtaining conditions for global identification when V can take on more than two values.
9. Discrete V with more than two support points
Appendix A. Proofs
A simple counting argument suggests that it may be possible to extend this paper’s identification and associated estimators to applications where V is discrete with more than two points of support, as follows. Suppose V takes on the values b0 , b1 , . . . , bH with probabilities p0 , p1 , . . . , pH . Let uj = E (U j ) for integers j as before. Then for any positive odd integer S, the moments E (Y s ) for s = 1, . . . , S equal known functions of the 2H + (S + 1)/2 parameters b1 , b2 , . . . , bH , p1 , p2 , . . . , pH , u2 , u4 , . . . , uS −1 , h. Note that p0 and b0 can be expressed as functions of the other parameters by probabilities summing to one and V having mean zero, and we assume us for odd values of s ≤ S are zero. Therefore, with any odd S ≥ 4H + 1, E (Y s ) for s = 1, . . . , S provides at least as many moment equations as unknowns, which could be used to estimate these parameters by GMM and will generally suffice for local identification. These moments include polynomials with up to S − 1 roots, so having S much larger than 4H + 1 may be necessary for
Proof of Theorem 1. To save space, a great deal of tedious but straightforward algebra is omitted. These details are available in an online supplemental Appendix B. First identify h by h = E (Y ), since V and U are mean zero. Then the distribution of ε defined by ε = Y − h is identified, and ε = U + V . Define ed = E (εd ), ud = E (U d ), and vd = E (V d ). Now evaluate ed for integers d ≤ 9. These ed exist by assumption, and are identified because the distribution of ε is identified. Using independence of V and U , v1 = 0, and ud = 0 for odd values of d up to nine, evaluate ed = E (U + V )d to obtain e2 = v2 + u2 , e3 = v3 , e4 = v4 + 6v2 u2 + u4 , so u4 = e4 − v4 − 6v2 e2 + 6v22 , and e5 = v5 +10v3 u2 = v5 +10v3 (e2 − v2 ). Define s = e5 −10e3 e2 , and note that s depends only on identified objects and so is identified. Then s = v5 − 10e3 v2 . Similarly, e6 = v6 + 15v4 u2 + 15v2 u4 + u6 , which can be solved for u6 , e7 = v7 + 21v5 u2 + 35v3 u4 , and e9 = v9 + 36v7 u2 +
170
Y. Dong, A. Lewbel / Journal of Econometrics 163 (2011) 163–171
126v5 u4 + 84v3 u6 . Substituting out the earlier expressions for u2 , u4 , and u6 in the e7 and e9 equations gives results that can be written as q = v7 − 35e3 v4 − 21sv2 and w = v9 − 36qv2 − 126sv4 − 84e3 v6 , where q and w are identified by q = e7 − 21se2 − 35e3 e4 = e7 − 21e5 e2 + e3 210e22 − 35e4 , and w =
Also w =S 3 e3
b90 r (1 − r ) r 6 − 246r 5 + 3487r 4 − 10 452r 3 + 3487r 2 − 246r + 1
=
b30 r (1 − r )
e9 − 36qe2 − 126se4 − 84e3 e6 = e9 − 36e7 e2 + e5 756e22 − 126e4 +
3
e3 2520e2 e4 − 84e6 − 7560e2 . Summarizing, we have w, s, q, e3 are all identified and e3 = v3 , s = v5 − 10e3 v2 , q = v7 − 35e3 v4 − 21sv2 , and w = v9 − 84e3 v6 − 126sv4 − 36qv2 . Now V only takes on two values, so let V equal b0 with probability p0 and b1 with probability p1 . Let r = p0 /p1 . Using p1 = 1 − p0 and E (V ) = b0 p0 + b1 p1 = 0 we have p0 = r /(1 + r ),
p1 = 1/(1 + r ),
b1 = −b0 r ,
vd = bd0 p0 + bd1 p1 = bd0 p0 + (−r )d p1 Substituting this vd into the expression for e3 , s, q, and w reduces to
w = b90 r (1 − r )(r 6 − 246r 5 + 3487r 4 − 10 452r 3 + 3487r 2 − 246r + 1). These are four equations in the two unknowns b0 and r. We require all four equations for point identification, because these are polynomials in r and so have multiple roots. However, from just the e3 and s equations and b0 ̸= 0 we have the identified polynomial in r
3
− s3 r 2 ( 1 − r ) 2 = 0
which has at most six roots. Associated with each possible root r is a corresponding identified distribution for V and U as described at the end of this proof. This shows set identification of the model up to a finite set, and hence local identification, using just the first five moments of Y . To show global identification, we will first show that the four equations for e3 , s, q, and w imply that r 2 − γ r + 1 = 0, where γ is finite and identified. First we have e3 = v3 ̸= 0 and r ̸= 1 by asymmetry of V . Also r ̸= 0 and b0 ̸= 0 because then V would only have one point of support instead of two. Applying these results to the s equation shows that if s (which is identified) is zero then r 2 − 10r + 1 = 0, and so in that case γ is identified. So now consider the case where s ̸= 0. Define R = qe3 /s2 , which is identified because its components are identified. Then R = r 4 − 56r 3 + 246r 2 − 56r + 1
r 2 − 10r + 1
−2
0 = (1 − R)r + (−56 + 20R)r + (246 − 102R)r + (−56 + 20R)r + (1 − R). 4
3
so
2
If R = 1, then (using r ̸= 0) this polynomial reduces to the quadratic 0 = r 2 − 4r + 1, so in this case γ = −4 is identified. Now consider the case where R ̸= 1. Define Q = s3 /e53 and S = w/e33 . Both Q and S exist because e3 ̸= 0, and they are identified because their components are identified. Then
3
2
Q = r − 10r + 1
(r (1 − r ))
−2
5
4
+ (303 − Q ) r − 30r + 1. 2
0 = 216r 4 + (S − Q − 3184) r 3 + (9392 + 2Q − 2S ) r 2
+ (S − Q − 3184) r + 216.
where N = (1 − R) (1136 + S − Q ) + 7776, which after substituting in for R, S, and Q becomes
The denominator of this expression for N is not equal to zero, because that would imply s = 0, and we are currently specifically considering the case where s ̸= 0 (having already analyzed the case where s = 0). Also N ̸= 0 because r ̸= 0, and r ̸= −1. We therefore have 0 = r 2 −γ r +1, where γ = (2(1−R)(6320+S − Q ) + 31 104)/N, which is identified because all of its components are identified. We have now shown that 0 = r 2 −γ r + 1, where γ is identified. This equation says that γ = r + r −1 = [p0 /(1 − p0 )]+[(1 − p0 )/p0 ]. Whatever value p0 takes on between zero and one makes this expression for γ greater than or equal to two. The equation 0 = r 2 − γ r + 1 has solutions r =
1 2
γ+
1 2
γ 2 − 4 and r =
1 1 2
γ+ γ2 − 4 1 2
with γ 2 ≥ 4, so one of these solutions must be the true value 1/3 of r. Given r, we can then solve for b0 by b0 = e3 (r (1 − r ))1/3 . Recall that r = p0 /p1 . If we exchanged b0 with b1 and exchanged p0 with p1 everywhere, all of the above equations would still hold. It follows that one of the above two values of r must equal p0 /p1 , and the other equals p1 /p0 . The former when substituted into e3 (r (1 − r )) will yield b30 and the latter must yield b31 . Without loss of generality imposing the constraint b0 < 0 < b1 shows that the correct solution for r will be the one that satisfies e3 (r (1 − r )) < 0, and so r and b0 is identified. The remainder of the distribution of V is then given by p0 = r /(1 + r ), p1 = 1/(1 + r ), and b1 = −b0 r. Finally, we show identification of the distribution of U. For any random variable Z , let FZ denote the marginal cumulative distribution function of Z . By the probability mass function of the V distribution, Fε (ε) = (1 − p)FU (ε − b1 ) + pFU (ε − b0 ). Letting ε = u + (b0 − b1 ) k − b0 and rearranging gives FU (u + (b0 − b1 )k) =
1 p
Fε (u + b0 + (b0 − b1 )k)
−
1−p p
FU (u + (b0 − b1 )(k + 1)),
so
0 = r − 30r + (303 − Q ) r + (2Q − 1060) r 6
Subtracting the polynomial with Q from the polynomial with S gives
15 552r (r + 1)4 . N = 2 r 2 − 10r + 1 (1 − r )2
s = b50 r (1 − r )(r 2 − 10r + 1)
q = b70 r (1 − r )(r 4 − 56r 3 + 246r 2 − 56r + 1)
+ (3487 − S ) r 2 − 246r + 1.
0 = Nr 2 − (2(1 − R) (6320 + S − Q ) + 31 104) r + N ,
= bd0 [r + (−r )d ]/(1 + r ).
e53 r 2 − 10r + 1
0 = r 6 − 246r 5 + (3487 − S ) r 4 + (2S − 10 452) r 3
Multiply this by (1 − R), multiply the polynomial based on R by 216, subtract one from the other and divide by r (which is nonzero) to obtain an expression that simplifies to
and for any integer d
e3 = b30 r (1 − r ),
3
3
so for positive integers K , FU (u) = RK +
∑K −1 1−p k k=0
p
b0 + (b0 − b1 ) k), where the remainder term RK =
1 F p ε −K r FU
(u + (u +
Y. Dong, A. Lewbel / Journal of Econometrics 163 (2011) 163–171
(b0 − b1 ) K ) ≤ r −K . If r > 1 then Rk → 0 as K → ∞, so FU (u) is
171
Appendix B. Supplementary data
identified by FU (u) =
k ∞ − 1−p 1 k=0
p
p
Fε (u + b0 + (b0 − b1 ) k)
(23)
since all the terms on the right of this expression are identified, given that the distributions of ε and of V are identified. If r < 1, then exchange the roles of b0 and b1 (e.g., start by letting ε = u + (b1 − b0 ) k − b1 ) which will correspondingly exchange p and 1−p to obtain FU (u) =
∑∞ k=0
p 1 −p
k
1 F 1 −p ε K
(u + b1 + (b1 − b0 ) k),
where now the remainder term is RK = r FU (u + (b1 − b0 ) K ) ≤ r K → 0 as K → ∞ since now r < 1. The case of r = 0 is ruled out, since that is equivalent to p = 1/2. Proof of Proposition 1. Y = h + V + U and independence of U and V implies that E [exp(τ (Y − h))] = E [exp(τ V )] E [exp (τ U )] . Now E [exp (τV )] = p exp (τ b0 ) + (1 − p) exp (τ b1 ). Define ατ = (1 − p)E eτ U . By symmetry of U , ατ = α−τ . These equations with r = p/ (1 − p) and b1 = b0 p/(p − 1) give Eqs. (20) and (21). Proof of Theorem 2. By the probability mass function of the V distribution, Fε (ε) = (1 − p)FU (ε − b1 ) + pFU (ε − b0 ). Evaluating this expression at ε = u + b1 gives Fε (u + b1 ) = (1 − p)FU (u) + pFU (u + b1 − b0 )
(24)
and evaluating at ε = −u + b0 gives Fε (−u + b0 ) = (1 − p) FU (−u − b1 + b0 )+ pFU (−u). Apply symmetry of U which implies FU (u) = 1 − FU (−u) to this last equation to obtain Fε (−u + b0 ) = (1 − p) [1 − FU (U + b1 − b0 )]
+ p [1 − FU (u)] .
(25)
Eqs. (24) and (25) are two equations in the two unknowns FU (U + b1 − b0 ) and FU (U ). Solving for FU (U ) gives FU (U ) = Ψ (U ) with Ψ (U ) given by Eq. (11). It follows from symmetry of U that FU (U ) must also equal 1 − Ψ (−U ), which gives Eq. (12). Proof of Corollary 1. First identify h(x) by h(x) = E (Y | X = x), since E (Y − h(X ) | X = x) = E (V + U | X = x) = E (V + U ) = 0. Next define ε = Y − h(X ), and then the rest of the proof is identical to the proof of Theorem 1. Proof of Corollary 2. Define h(x) = E (Y | X ) and ε = Y − h(X ). Then h(x) and the distribution of ε conditional upon X is identified and E (ε | X ) = 0. Define V = g (X , D∗ ) − h(X ) and let bd (X ) = g (X , d) − h(X ) for d = 0, 1. Then ε = V + U, where V (given X ) has the distribution with support equal to the two values b0 (X ) and b1 (X ) with probabilities p(X ) and 1 − p(X ), respectively. Also U and ε have mean zero given X so E (V | X ) = 0. Applying Theorem 1 separately for each value x on the support of X shows that b0 (x), b1 (x), p(x), and the conditional distribution of U given X = x is identified for each such x, and it follows that the function g (x, d) is identified by g (x, d) = bd (x) + h(x).
Supplementary material related to this article can be found online at doi:10.1016/j.jeconom.2011.03.003. References Ahmed, I.A., Li, Q., 1997. Testing symmetry of an unknown density by kernel method. Nonparametric Statistics 7, 279–293. Ai, C., Chen, X., 2003. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–1844. Altonji, J., Segal, L., 1996. Small-sample bias in GMM estimation of covariance structures. Journal of Business & Economic Statistics 14, 353–366. Baltagi, B.H., 2008. Econometric Analysis of Panel Data, 4th ed. Wiley. Bordes, L., Mottelet, S., Vandekerkhove, P., 2006. Semiparametric estimation of a two-component mixture model. Annals of Statistics 34, 1204–1232. Carrasco, M., Florens, J.P., 2000. Generalization of GMM to a continuum of moment conditions. Econometric Theory 16, 797–834. Carroll, R.J., Ruppert, D., Stefanski, L.A., Crainiceanu, C.M., 2006. Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed. Chapman & Hall/CRC. Chen, X., Hu, Y., Lewbel, A., 2008. Nonparametric identification of regression models containing a misclassified dichotomous regressor without instruments. Economics Letters 100, 381–384. Clogg, C.C., 1995. Latent class models. In: Arminger, G., Clogg, C.C., Sobel, M.E. (Eds.), Handbook of Statistical Modeling for the Social and Behavioral Sciences. Plenum, New York, pp. 311–359 (Chapter 6). Dong, Y., 2011. Semiparametric binary random effects models: estimating two types of drinking behavior. Economics Letters (forthcoming). Hagenaars, J.A., McCutcheon, A.L., 2002. Applied Latent Class Analysis Models. Cambridge University Press, Cambridge. Hall, P., Zhou, X.-H., 2003. Nonparametric estimation of component distributions in a multivariate mixture. Annals of Statistics 31, 201–224. Hansen, L., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Heckman, J.J., Robb, R., 1985. Alternative methods for evaluating the impact of interventions. In: Heckman, James J., Singer, B. (Eds.), Longitudinal Analysis of Labor Market Data. Cambridge University Press, New York, pp. 156–245. Honore, B., 1992. Trimmed lad and least squares estimation of truncated and censored regression models with fixed effects. Econometrica 60, 533–565. Hu, Y., Lewbel, A., 2008. Identifying the returns to lying when the truth is unobserved. Boston College Working Paper. Kasahara, H., Shimotsu, K., 2009. Nonparametric identification of finite mixture models of dynamic discrete choices. Econometrica 77, 135–175. Kitamura, Y., 2004. Nonparametric identifiability of finite mixtures. Unpublished Manuscript, Yale University. Kumbhakar, S.C., Lovell, C.A.K., 2000. Stochastic Frontier Analysis. Cambridge University Press. Kumbhakar, S.C., Park, B.U., Simar, L., Tsionas, E.G., 2007. Nonparametric stochastic frontiers: a local maximum likelihood approach. Journal of Econometrics 137, 1–27. Lewbel, A., 2007. A local generalized method of moments estimator. Economics Letters 94, 124–128. Newey, W.K., McFadden, D., 1994. Large sample estimation and hypothesis testing. In: Engle, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. IV. Elsevier, Amsterdam, pp. 2111–2245. Newey, W.K., Powell, J.L., 2003. Instrumental variable estimation of nonparametric models. Econometrica 71, 1565–1578. Powell, J.L., 1986. Symmetrically trimmed least squares estimation of Tobit models. Econometrica 54, 1435–1460. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Simar, L., Wilson, P.W., 2007. Statistical inference in nonparametric frontier models: recent developments and perspectives. In: Fried, H., Lovell, C.A.K., Schmidt, S.S. (Eds.), The Measurement of Productive Efficiency, 2nd ed. Oxford University Press, Oxford (Chapter 4).
Journal of Econometrics 163 (2011) 172–185
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Inference and prediction in a multiple-structural-break model John Geweke a,∗ , Yu Jiang b,1 a
CenSoC—Centre for the Study of Choice, University of Technology Sydney, 645 Harris St Ultimo NSW 2007, PO Box 123, Broadway NSW 2007, Australia
b
Department of Finance and Insurance, School of Business, Nanjing University, 22 Hankou Road, Nanjing, Jiangsu 210093, PR China
article
info
Article history: Received 19 November 2009 Received in revised form 24 November 2010 Accepted 22 March 2011 Available online 15 April 2011
abstract This paper develops a new Bayesian approach to structural break modeling. The focuses of the approach are the modeling of in-sample structural breaks and forecasting time series allowing out-of-sample breaks. The model has several desirable features. First, the number of regimes is not fixed but is treated as a random variable. Second, the model adopts a hierarchical prior for regime coefficients, which allows for the coefficients of one regime to contain information about coefficients of other regimes. Third, the regime coefficients can be integrated analytically in the posterior density; as a consequence the posterior simulator is fast and reliable. An application to US real GDP quarterly growth rates links groups of regimes to specific historical periods and provides forecasts of future growth rates. © 2011 Elsevier B.V. All rights reserved.
1. Introduction In recent years nonlinear models of macroeconomic and financial time series have proliferated. Many of these models have been created by introducing stochastic parameter variation in simple linear models. In the extreme case, parameters change each time period; such models for macroeconomic time series include those of Cooley and Prescott (1976) and Min and Zellner (1993). In intermediate cases, parameters change occasionally and the change process itself is fully described. Such models are said to be characterized by multiple structural breaks and each period between breaks is known as a regime. The model studied here is of this type and we retain the nomenclature ‘‘structural break’’ and ‘‘regime’’. Given this generally accepted terminology it perhaps bears emphasis that a single probability law still describes the evolution of the time series, and structural break models may be either stationary or nonstationary as is also the case with simple linear models. Koop and Potter (2001) explicitly interprets the structural break model as an intermediate case between classical linear models and continuously time varying parameter models. Stock and Watson (1996) finds evidence of structural instability in US post-war macroeconomic time series. Quite a few studies have introduced structural break models and provided successful applications to economic time series, including Garcia and Perron (1996), Stock and Watson (1996), Clements and Hendry (1998, 1999), Kim
∗
Corresponding author. Tel.: +61 2 9514 9797. E-mail addresses:
[email protected] (J. Geweke),
[email protected] (Y. Jiang). 1 Tel.: +86 25 8362 1368; fax: +86 25 8362 1266. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.03.005
and Nelson (1999), McConnell and Perez (2000), Wang and Zivot (2000), Cogley and Sargent (2001), Pastor and Stambaugh (2001), Ang and Bekaert (2002), Pesaran and Timmermann (2002), and Pesaran et al. (2006). The non-Bayesian literature has seen several lines of attack on structural break models. A large literature concentrates on testing hypotheses about the existence of structural breaks (Ghysels and Hall, 1990; Hansen, 1992; Andrews, 1993; Lumsdaine and Papell, 1997; Ghysels et al., 1998; Andrews, 2003; Elliott and Muller, 2006), using asymptotic distribution theory that fixes the number of regimes as sample size increases. A somewhat more limited literature uses the same setup but also addresses the estimation of change points and parameters; most notably Bai and Perron (1998, 2003) use the least squares principle in a linear model to this end. Neither of these literature groups makes sufficiently specific assumptions about the process of transitions between regimes that it can address the problem of prediction. A third approach in the non-Bayesian tradition explicitly posits such transitions. Leading examples include threshold autoregressions (Tong and Lim, 1980), Markov switching models (Hamilton, 1989), and smooth transition autoregressive models (Terasvirta and Anderson, 1992). These three models are also amenable to Bayesian treatments (Geweke and Terui, 1993; Kim and Nelson, 1999; Giordani et al., 2007). Bayesian approaches to structural break models begin with a complete specification of the break process and the evolution of the time series within each regime, including a prior distribution for all parameters. Most of these studies use latent variables to indicate the location of structural breaks and some form of hierarchical distribution for the distribution of the time series across regimes. In this context it is generally advantageous to access the posterior distribution by means of Markov chain Monte Carlo (MCMC) simulation: conditional on the latent variables it is easy to generate
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
parameters for each regime, while conditional on the parameters it is straightforward to generate the latent variables. Work typical of this approach includes McCulloch and Tsay (1993); Chib (1998); Wang and Zivot (2000), Pesaran et al. (2006); Koop and Potter (2007); Giordani and Kohn (2008). This study introduces a new model of structural breaks, together with Bayesian methods of inference, that has several attractive features.
• If structural breaks exist in the past then they will occur in the future. In this study the number of regimes is fixed neither within the sample nor over any prediction horizon. • Parameters are exchangeable across regimes, as in Pesaran et al. (2006) and Koop and Potter (2007). Each regime provides more information about the common distribution, thereby updating information about future regimes. • Regime parameters can be marginalized analytically in the posterior distribution, leaving only the small handful of the parameters of the common distribution of regimes to be handled by a posterior simulator. • A sophisticated yet reliable Metropolis proposal distribution for regime breaks alleviates the need to fix the number of breaks. The resulting MCMC algorithm is reliable, mixes well and has very low computational cost. This is the first approach to combine all of these features. Section 2 provides details of our structural break model, including the modeling of in-sample structural breaks and forecasting in the presence of out-of-sample breaks. Section 3 applies it to the US real gross domestic product (GDP) quarterly growth rates and shows that our structural break model compares favorably with a Markov switching model. Section 4 concludes and discusses future research directions. 2. The model The multiple-structural-break model is characterized by a succession of regimes. Within each regime, growth rates are drawn independently from a common Gaussian distribution. Breaks occur randomly according to a Bernoulli process, the mean and precision of each regime being drawn independently from a common distribution of parameter vectors. The prior distributions of the parameter of the Bernoulli process for structural breaks and of the population of regime parameter vectors are important in the model, for they provide the definition of ‘‘structural break’’ that is brought to the data. In any structural break model this definition will be important for the same reason. In a formal Bayesian approach the definition is explicit. This section establishes notation and provides the detailed structure of the model; describes the posterior simulator, with technical detail relegated to an Appendix; and outlines the algorithm for prediction with the model. 2.1. The model structure The multiple-structural-break model describes a time series yt , typically the growth rate of a macroeconomic aggregate or the return on a set of financial assets. T denotes the sample size. The objectives of the modeling exercise are taken to be interpretation of the sample, or prediction of future values of the series, or both. The process of regime change is indicated by the Bernoulli latent variable st , P (st = 1) = π . If st = 0 then observations t and t + 1 are in the same regime while if st = 1 then a new regime begins in period t + 1. Time begins with t = 1 and therefore regime ∑t −1 jt = 1 + τ =1 sτ pertains at time t. Define J = jT , the number of regimes in the sample; J − 1 is the number of breakpoints in the sample.
173
In regime j, the observable time series is i.i.d.
1 yt ∼ N µj , h− . j
(1)
The sequence of parameter vectors µj , hj is independent with a common normal–gamma distribution
s ∗2 h j ∼ χ 2 ν ∗ ,
1 −1 . µj | hj ∼ N µ∗ , h− · h ∗ j µ
(2)
The conjugate normal–gamma distribution is important to the efficiency of the computational algorithm described in Section 2.3. The model is driven by the five parameters π , s∗2 , ν ∗ , µ∗ and hµ∗ , as may be seen by noting that given these five parameters one could simulate yt . The intermediate constructs st , hj and µj may be regarded as latent variables although, from a Bayesian perspective, there is no functional distinction between parameters and latent variables. While the time series is i.i.d. within each regime (1), the persistence of regimes for several periods implies that {yt } can easily exhibit autocorrelation, persistence in volatility and cyclical behavior of the kind seen in economic time series. Section 3.1 provides specific evidence on this point. The model assigns independent prior distributions to the five parameters. Each prior distribution is characterized by hyperparameters, indicated by underbars, to which the investigator assigns values. The values assigned are discussed in Section 3 in the context of the application of the model to US GDP growth rates, and they are interpreted there by means of their implications for features of the observable growth rates yt . The functional forms of the prior distributions are as follows:
π ∼ Beta γ 1 , γ 2 1 µ∗ ∼ N µ∗ , h − µ∗ shµ∗ ∼ χ 2 ν as∗2 ∼ χ 2 b ν ∗ ∼ exp λ
(3) (4) (5) (6) (7)
where in the last expression E (ν ∗ ) = λ−1 . In this model the regime parameters µj , hj may be interpreted as latent variables. Equivalently they may be regarded as parameters with (4)–(7) and (2) then constituting a hierarchical prior distribution for µj , hj . This organization has several advantages. First, it expresses the belief that there is a degree of similarity in µj , hj across regimes (2) while permitting a separate statement of belief about what the population of regimes might be ((4)–(7)). Second, given (2), regime coefficients can be integrated analytically in the posterior distribution (see the details in Appendix A.3). This property significantly simplifies the simulation of the posterior distribution and thus reduces the computational cost. Third, the structure of the prior implies that as observations in one regime arise they provide information about future regimes by means of updating the posterior distribution of µ∗ , hµ∗ , s∗ 2 , and ν ∗ . 2.2. Prediction Suppose that a simulation sample from the posterior distribution of parameters and latent variables conditional on y1 , . . . , yT is available (Section 2.3 and the Appendix detail the posterior simulator). Then sampling from the posterior predictive distribution p (yT +1 , . . . , yT +h | y1 , . . . , yT ) is straightforward. Let m = 1, . . . , M index the posterior simulation sample. For each simulation m denote the latent breakpoint indicators
174
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
(m)
st (t = 1, . . . , T − 1). The total number of regimes in draw m is ∑T −1 (m) J (m) = 1 + t =1 st . Simulate the model forward conditional on the parameters and latent variables in the mth simulation from the posterior distribution as follows: (m) i.i.d.
1. sT +r ∼ Bernoulli π (m) (r = 0, . . . , h − 1).
∑ (m) T +r −1 (m) 2. Define jT +r = 1 + t =1 st (r = 1, . . . , h). (m) (m) (m) (m) (m) (m) 3. If jT +h > jT simulate hj , µj j = j T +1 , . . . , j T +h , (m) s∗2(m) hj ∼ χ 2 ν ∗(m) , µ(j m) | h(j m) ∼ N µ∗(m) , h(j m)−1 · h(µm∗)−1 (m) (m) j = j T +1 , . . . , j T +h , where all draws are independent. 4. Simulate (m) y T +r
[
(m)
(m)
∼ N µjT +r , hjT +r
−1 ]
(m)
(r = 1, . . . , h) .
The collection of draws yT +r , r = 1, . . . , h
M
removes the
a fully conjugate prior distribution. Appendix A.3 derives the marginal posterior distribution. None of the resulting distributions for each of s∗2 , ν ∗ , µ∗ and hµ∗ , conditional on the data {yt } and the regimes defined by {st }, are conventional. The conditional probability density functions of ν ∗ and s∗2 in the marginal posterior density are always unimodal. Those of µ∗ and h∗µ are not known to be unimodal for all possible configurations of the conditioning parameters and data but work to date with this model has uncovered no cases of multimodality. The proposal distribution for each of these four parameters is Gaussian, centered at the mode of the conditional distribution and with precision set to minus the second derivative of the log conditional probability density function at the mode. Appendix A.4 provides full details. The conditional posterior distribution of s1 , . . . , sT −1 in the marginal posterior distribution, derived at the end of Appendix A.3, is also nonstandard. The parameters of this distribution are the model parameters π , s∗2 , ν ∗ , µ∗ and hµ∗ , and the sample mean and precision of yt within each regime defined by s1 , . . . , sT −1 . Therefore it is straightforward to evaluate the conditional p.d.f. of s1 , . . . , sT −1 in the marginal posterior distribution using a Metropolis proposal distribution. The proposal is a mixture of three distributions.
m=1
conditioning on latent variables and parameters. Some characteristics of the predictive distribution and the sample simulated from it are worth noting. 1. Because the collection comes from the predictive distribution it does not condition on any stipulation about the number of regimes, either in the sample period t = 1, . . . , T or in the forecasting period t = T + 1, . . . , T + h. This distinguishes the procedure from other models and methods that do not draw directly from the predictive distribution Pesaran et al. (2006). 2. Like any sample from the predictive distribution this one enables the investigator to conduct inference about any function of the time series over a finite horizon. Examples include characteristics of business cycles, for macroeconomic time series, and pricing of options, for asset return time series. 3. While the number of latent variables in the model exceeds T , the simulation algorithm for the predictive distribution uses only the five parameters and the mean µjT and precision hjT pertaining to the last period. Thus if the posterior simulation sample is to be used exclusively for prediction then only the simulations of these seven unobservables need be recorded during posterior simulation. 2.3. The posterior simulator The structure of the model suggests several approaches to posterior Markov chain Monte Carlo (MCMC) simulation but some of these chains mix better than others. For example MCMC simulation can be fully blocked in the five parameters, the means and precisions of the regimes, and the regime change indicator. This approach leads to conditionally conjugate posterior distributions for all blocks except the regime change indicator; but this indicator, conditioning on regime-specific means and precisions, only rarely moves a boundary between disparate regimes, leading to poor mixing for the entire chain. The MCMC algorithm used in this study achieves satisfactory mixing by implementing a marginal Gibbs sampler incorporating appropriate Metropolis proposal distributions. The Gibbs sampler is based on the marginalization of the posterior with distribution respect to the regime-specific parameters µj , hj , henceforth the ‘‘marginal posterior distribution’’. The marginalization is straightforward because conditional on the parameters s∗2 , ν ∗ , µ∗ and hµ∗ , (1)–(2) amount to the textbook i.i.d. normal model with
1. Add a break with probability qadd . (a) With equal probability select t ∗ from among the periods t for which st = 0. (b) Set st ∗ = 1. 2. Delete a break with probability qdel . (a) With equal probability select t ∗ from among the periods t for which st = 1. (b) Set st ∗ = 0. 3. Move a break with probability qmove . (a) With equal probability select t ∗ from among the periods t for which st = 1. (b) Set st ∗ = 0. (c) If 2 ≤ t ∗ ≤ T − 2 set τ ∗ = t ∗ − 1 or t ∗ + 1 with equal probability; if t ∗ = 1 then τ ∗ = 2 and if t ∗ = T − 1 then τ ∗ = T − 2. (d) Set sτ ∗ = 1. (If sτ ∗ = 1 in the previous iteration and this proposal is accepted then a break is removed.) The three probabilities qadd , qdel and qmove (which sum to 1) are design parameters of the algorithm. The mixing properties of the posterior simulator are not very sensitive to these choices. Computing time is roughly proportional to the product of MCMC iterations M, sample size T , and the posterior mean of the number of regime changes J. For the application in the next section with M = 106 , T = 245 and E (J | yo ) = 20, the computing time was about 15 min. Appendices A.5 and A.6 provide further details on these points. 3. Application: US real GDP quarterly growth rates The multiple-structural-break model in this study can be used to interpret and predict many time series. The focus in this section is on US real gross domestic product (GDP) quarterly growth rates, a common subject of study in this literature (Kim and Nelson, 1999; McConnell and Perez, 2000; Kim et al., 2004; Koop and Potter, 2007; Maheu and Gordon, 2008). Denote by GDPt seasonally adjusted US quarterly real GDP in billions of chained 2000 dollars, obtained from the website of the Bureau of Economic Analysis for 1947Q1 through 2008Q3. The model then pertains to the growth series yt = log(GDPt /GDPt −1 ). The sample contains 246 observations of yt from 1947Q2 through 2008Q3. Fig. 1 displays the sample path of US quarterly real GDP growth rates.
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
175
Fig. 1. Sample path of US real GDP quarterly growth rates. Table 2 Prior predictive analysis for GDP quarterly growth rates.
Table 1 Prior hyperparameters and corresponding prior moments. Parameter
Prior hyperparameters
µ∗
µ∗ = 0.00817 s = 0.82 λ = 0.25 a = 20987.9 γ 1 = 3.57
hµ∗
ν∗
s∗ 2
π
hµ∗ = 59960.35 ν=8 b=8 γ 2 = 38.10
Prior Mean
SD
0.0082 9.82 4 0.00038 21/245
0.0041 4.91 4 0.00019 10.5/245
Function of interest
hj (yo )
Pr hj (y) ≤ hj (yo )
h1 h2 h3 h4 h5 h6 h7 h8
−0.031
0.477 0.347 0.943 0.595 0.593 0.341 0.304 0.273
Coefficient of skewness Coefficient of excess kurtosis corr(yt , yt −1 ) corr(|yt |, |yt −1 |) 2 corr(|zt |, |zt −1 |); zt = (yt − yT ) Number of business cycle peaks Number of business cycle troughs Number of quarters in contraction
1.333 0.329 0.261 0.196 9 8 24
3.1. Prior specification Specification of the prior distribution for the multiple-structural-break model described in Section 2.1 amounts to choice of the nine hyperparameters, µ∗ , hµ∗ , s, ν , a, b, λ, γ , γ . The 1 2 baseline hyperparameter specification is based loosely on sample information. 1. Equate the unconditional mean of yt with the sample mean. Thus E (µ∗ ) = 0.00817. 2. Motivated by E (hj ) = ν ∗ /s∗ 2 , equate E (ν ∗ )/E (s∗ 2 ) with the sample precision of yt . Thus E (ν ∗ )/E (s∗ ) = 10493.96. 2
(8)
3. Set E (ν ) = 4. (If ν = 4 then from (2) hj /hi ∼ F (4, 4) (j ̸= i) and the probability that the ratio hj /hi is less than 1/4 or greater than 4 is about 0.1.) Then from (8) E s∗2 = 0.00038. 4. Identify hµ∗ with the ratio of four times the sample variance of yt to the variance in annual growth rates, where annual growth rate isthe average quarterly growth rates in each year. Thus E hµ∗ = 9.82. 5. Equate E (π ) with the ratio of the number of NBER-dated expansions and contractions over the sample period (21) to the total number of breaks between quarters in the sample (T − 1 = 245). 6. Take the prior standard deviation of each of the parameters µ∗ , hµ∗ , s∗2 and π to be half of its mean. ∗
∗
This line of reasoning leads to the choices of hyperparameters shown in Table 1. To understand more fully the implications of the model and the choices of the hyperparameters in particular, we undertook a prior predictive analysis. In this analysis M = 1000 values of the five parameters are drawn from the prior distribution, each by the latent variables s, the regime parameters draw followed µj , hj , and then a sample of T = 246 observables y(m) =
(m)
(m)
y1 , . . . , yT
′
(m = 1, . . . , M ). We computed eight functions (m) of interest, hj y (j = 1, . . . , 8), and compared them with the observed values hj (yo ). For each function hj , Table 2 provides
hj (yo ) together with the prior c.d.f. of hj evaluated at hj (yo ),
approximated by (1/M ) m=1 I(−∞,hj (yo )] hj y(m) . The first two functions of interest measure the ability of the model to reproduce the shape of the empirical distribution of the GDP growth rates yt . The observed sample skewness coefficient h1 is small and well within the support of the prior predictive distribution. The observed excess kurtosis h2 is moderately positive, and is easily accounted for through the mixing of normal distributions across different regimes. The remaining functions of interest measure the ability of the model to reproduce the observed dynamics in the GDP growth rate. The sample first autocorrelation h3 is near the upper end of the prior predictive distribution, yet within a centered 90% prior predictive credible interval. The prior predictive analysis studies persistence in volatility in two ways: through the first-order autocorrelation in growth rates, h4 , and through the first-order autocorrelation in the squared departures of growth rates from their sample mean yT , h5 . In each case the observed value is about the 60th percentile of the prior predictive distribution. Notice that while growth rates are i.i.d. within each regime, regimes can persist for varying lengths of time, thereby generating dynamics with persistence. The last three functions of interest focus specifically on dynamics related to business cycles. They are all based on an algorithm for dating business cycles that mimics NBER dating.2 If y1 < 0 (y1 > 0) then period 1 is part of a contraction (expansion) that begins in period 1. Then for periods t = 2, . . . , T − 1, a business cycle trough occurs and an expansion begins at t if period t − 1 is part of a contraction and yt −1 < 0, yt > 0 and yt +1 > 0; and a business cycle peak occurs and a contraction begins at t if period t − 1 is part of an expansion and yt −1 > 0, yt < 0 and yt +1 < 0. For the T = 246 observed GDP growth rates this produces 9 business cycle peaks, 8 troughs, and 24 quarters in contractions. These values are all between the 25th and 35th
∑M
2 NBER dating is more complex, subjective and based on additional time series. To conduct prior predictive analysis it is necessary to have a business cycle dating convention based solely on GDP growth rates that can be coded. This does not affect the validity or usefulness of the prior predictive analysis that compares the prior distribution of hj (y) with the observed values hj (yo ) (j = 6, 7, 8).
176
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
Table 3 Prior and posterior moments; numerical accuracy for GDP quarterly growth rates. Parameter
Prior Mean
µ
∗
hµ∗
ν∗
s∗ 2
π
# of regimes J
0.0082 9.82 4 0.00038 0.086 22
Posterior SD 0.0041 4.91 4 0.00019 0.043 11.36
Mean 0.00822 8.634 4.502 0.000233 0.070 17.50
SD
NSE
RNE
0.00104 4.577 1.628 0.00010 0.035 8.94
1.16 × 10 0.088 0.017 1.45 × 10−6 0.0010 0.29 −5
0.81 0.27 0.89 0.52 0.12 0.10
cases the prior distribution supports both positive and negative sample autocorrelation coefficients, but the prior probability of positive autocorrelation exceeds that of negative autocorrelation. Combinations of higher sample autocorrelation of growth rates and lower sample autocorrelation of absolute growth rates compared with those actually observed are unlikely in the prior predictive distribution. 3.2. Posterior distribution of parameters
Fig. 2. Prior predictive distribution of characteristics of GDP quarterly growth rates.
percentiles of the prior predictive distribution. Thus the number and observed asymmetry of business cycles are well within the capabilities of the model for reproducing. Fig. 2 provides another perspective on the prior predictive analysis. Each panel presents a scatterplot of 1000 independent draws of a different pair of functions from the prior. The intersection of the horizontal and vertical lines designates the observed value of the pair in each case. The prior distribution of the sample coefficients of skewness and excess kurtosis in panel (a) indicates that the model permits both platykurtic and leptokurtic distributions, including some that are highly leptokurtic, as well as very skewed distributions. The observed combination is well within the support of the prior predictive distribution. Panel (b) provides the joint prior distribution of the first-order autocorrelation coefficient for growth rates and that for the absolute value of growth rates. In both
All posterior analysis is based on 1,005,000 iterations of the simulator, the first 5000 being discarded and the remaining sample being thinned to every 100th value, leaving a posterior sample of 10,000 values for analysis. Appendices A.5 and A.6 provide some detail on mixing properties of the Markov chain and computation time. Table 3 provides the prior and posterior means and standard deviations of each of the five parameters and of the number of regimes J in the sample, and the numerical standard deviation and relative numerical efficiency of the simulation approximations to the posterior means. Except for strong positive correlation between π and J and mild positive correlation between s∗2 and ν ∗ , there is little dependence among the five parameters and J in the posterior distribution. Fig. 3 shows the corresponding posterior densities. Analysis of the sensitivity of the posterior distribution to changes in the prior distribution was carried out by varying the hyperparameters of the prior distribution, taking as baseline the values indicated in Table 1. The posterior distribution of the number of regimes J proved to be more sensitive to this variation than the posterior distributions of any of the were. parameters The prior distribution of µ∗ = E µj was varied from baseline by multiplying the baseline value E (µ∗ ) = 0.0082 by 0.1, 0.5, 2 and 10, while maintaining the same prior standard deviation. As indicated in Table 4, upper left panel, the posterior mean is insensitive to all changes but the last one, for which the posterior mean of J drops from 17.50 to 2.12 regimes. Our interpretation of this finding is that the prior distribution forces the posterior distribution of mean growth rates in regimes µj to be unrealistically high. Compared with the discrepancy between this high value and actual behavior, variations from one regime to another are negligible, and in this context variation across regimes is undetectable. The parameters hµ∗ , s∗2 , and ν ∗ are intrinsically positive. Their prior means were also varied from baseline by multiplying their values by 0.1, 0.5, 2 and 10, while continuing to set the prior standard deviation equal to half of the prior mean. The remaining three panels of Table 4 report the results of this analysis. The elasticity of the posterior mean of J with respect to the prior means of σ ∗2 and ν ∗ is less than 0.2 in absolute value. The posterior mean of the number of regimes J is inversely related to the prior mean of hµ∗ . Our interpretation of this result is that as the prior specifies more variance in regime means µj , the number of regimes detected increases.
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
hmu*
0
0
100
200
300
.02 .04 .06 .08
.1
400
mu*
177
0
.002 .004 .006 .008 .01 .012 .014
0
20
30
40
50
1000 2000 3000 4000
s*^2
0
0
.06
.1
.15
.2
.25
nu*
10
0
5
10
15
0
20
.0002 .0004 .0006 .0008
pi
# of regimes
0 .01 .02 .03 .04 .05 .06
15 10 5 0 0
.05
.1
.15
2
.25
.001
0
10
20
30
40
50
60
Fig. 3. Posterior densities of the five parameters and the number of regimes in the sample.
Table 4 Prior means of parameters and corresponding posterior moments of J. Scalar
µ∗ Prior mean
Number of regimes Mean
1/10 1/2 Baseline 2 10
0.00082 0.0041 0.0082 0.0163 0.082
Scalar
s∗ 2 Prior mean
1/10 1/2 Baseline 2 10
0.000038 0.00019 0.00038 0.00076 0.0038
17.69 17.23 17.50 16.50 2.12
hµ∗ Prior mean SD 9.69 9.07 8.94 8.82 0.38
Number of regimes Mean
SD
16.18 18.92 17.50 17.27 13.62
7.98 11.23 8.94 9.34 7.66
The parameter π ∈ (0, 1). Its prior mean was varied from baseline by multiplying its value by 0.1, 0.5, 2 and 10, while maintaining the same value γ + γ = 41.67 as was used in 1 2 the baseline prior distribution. Table 5 shows the corresponding prior means and standard deviations of π and the posterior means and standard deviations of J. The posterior distribution of J is very sensitive to the prior distribution of π . Halving (doubling) the prior mean of π leads to a halving (doubling) of the posterior mean of J.
Number of regimes Mean
SD
0.982 4.908 9.815 19.630 98.152
33.26 18.20 17.50 14.86 10.17
13.61 9.52 8.94 7.82 5.76
ν ∗ Prior mean
Number of regimes Mean
0.4 2 4 8 40
14.44 18.43 17.50 16.36 15.81
SD 8.80 8.34 8.94 7.97 7.66
The sensitivity of the posterior mean of J to more extreme changes in E (π ) is not as great but is still quite large. This analysis of the sensitivity of the posterior distribution to changes in the prior distribution strongly suggests that the prior distribution of π is the primary channel through which the definition of the regime is embedded in the model. We do not view this as a weakness. Any model of multiple breakpoints must communicate such a definition. The definition can be more or less
178
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
Table 5 Prior means and standard deviations of π and corresponding posterior moments of J. Scalar
π Prior mean
Prior SD
Mean
SD
1/10 1/2 Baseline 2 10
2.1/245 11.5/245 21/245 42/245 210/245
0.014 0.031 0.043 0.058 0.054
5.16 9.57 17.50 35.68 164.33
4.06 5.84 8.94 12.55 16.93
Table 6 Breaks with high posterior probability.
Number of regimes
Break
Posterior probability
1960Q4 1961Q1 1966Q1 1982Q4 1984Q2 1991Q1 2000Q2
0.281 0.242 0.231 0.257 0.344 0.207 0.205
Table 7 US business cycle expansions and contractions. Source: http://www.nber.org/cycles.html. Starting date
Fig. 4. Posterior distribution of the number of regimes for real GDP quarterly growth rates.
Fig. 5. Posterior probability of break occurrence for real GDP quarterly growth rates.
implicit or explicit. In the model formulation here the definition is explicit in the prior distribution of π . In the application the definition is chosen so that the typical length of a regime coincides with the typical length of expansions and contractions. 3.3. Structural breaks Fig. 4 provides the posterior distribution of the number of regimes in the sample period 1947Q2–2008Q7. The posterior mean is 17.5, the posterior standard deviation is 8.94, the posterior mode is 12, and [6, 34] is a 95% posterior credible interval. The multiplebreak model in Koop and Potter (2007) applied to US real GDP growth rates for 1947Q2 through 2005Q4 yields a posterior mean of 45 and posterior standard deviation of 12.7 for the number of regimes. By design, the posterior distribution in our model is more closely aligned with the number of regimes defined by US business cycles. Fig. 5 provides the posterior probability of break occurrence (lower solid line) at each quarter, alongside the sample of real GDP quarterly growth rates (upper shaded line). Table 6 lists the breaks identified with posterior probability greater than 0.2. Referring to
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1947Q2 1948Q4 1949Q3 1953Q2 1954Q2 1957Q3 1958Q2 1960Q2 1961Q1 1969Q4 1970Q4 1973Q4 1975Q1 1980Q1 1980Q3 1981Q3 1982Q4 1990Q3 1991Q1 2001Q1 2001Q4 2007Q4
Ending date 1948Q4 1949Q3 1953Q2 1954Q2 1957Q3 1958Q2 1960Q2 1961Q1 1969Q4 1970Q4 1973Q4 1975Q1 1980Q1 1980Q3 1981Q3 1982Q4 1990Q3 1991Q1 2001Q1 2001Q4 2007Q4 2008Q3
Expansion
Contraction
Midpoint
√
1948Q1 1949Q2 1951Q3 1953Q4 1955Q4 1957Q4 1959Q2 1960Q3 1965Q3 1970Q2 1972Q2 1974Q3 1977Q3 1980Q2 1980Q4 1982Q1 1986Q3 1990Q4 1996Q1 2001Q3 2004Q4 2008Q2
√ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
the conventional US business cycle dates listed in Table 7, 1960Q4, 1982Q4, and 1991Q1 are breakpoints between contractions and expansions. 1966Q1, 1984Q2, and 2000Q2 are well within periods of expansion. However, from Fig. 5 all three breaks correspond to distinct decreases in the growth rate of GDP. In particular the break of 2000Q2 corresponds to the burst of the I.T. bubble. Event probabilities are often reported in this way, but the presentation has inherent limitations because it provides only marginal probabilities for each time period. For example, consider a hypothetical breakpoint indicator vector s (4 × 1) with marginal posterior distribution P (s1 = 1) = 0, P (s2 = 1) = 0.5, P (s3 = 1) = 0.5, P (s4 = 1) = 0. Consider also the events s1 = (0, 0, 0, 0)′ , s2 = (0, 1, 0, 0)′ , s3 = (0, 0, 1, 0)′ , s4 = (0, 1, 1, 0)′ . The marginal distribution is consistent with P (s1 ) = P (s4 ) = 1/2, and it is also consistent with P (s2 ) = P (s3 ) = 1/2. More loosely, a succession of low probabilities may be consistent with a high probability of exactly one breakpoint combined with uncertainty about the date, but it may also be consistent with substantial uncertainty about the number of breakpoints. An alternative summary of the posterior distribution of s provides more information about the multiple-breakpoint model in the context of cyclical expansions and contractions. This summary locates the midpoint of each standard (NBER) expansion and contraction, indicated in the last column Table 7. From the posterior simulations of s it is then straightforward to compute the MCMC approximation of the probability that any pair of midpoints belongs to the same regime. The results of these calculations are portrayed in Fig. 6. In each panel, each axis of the graph indicates the expansion and contraction numbers in the first column of Table 7. Each square indicates the posterior probability that the corresponding pair of midpoints were in the same regime. The center panel (c) indicates
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
(a) Low prior mean, high prior s.d.
(b) High prior mean, high prior s.d.
(d) Low prior mean, low prior s.d.
179
(c) Baseline prior mean and s.d.
(e) High prior mean, low prior s.d.
Fig. 6. The posterior probability of midpoints of business cycles being in the same regime. Table 8 Prior and posterior means of π for the five panels of Fig. 6. Panel
Prior for π Mean
(a) (b) (c) (d) (e)
0.042 0.171 0.086 0.042 0.171
= 10.5/245 = 42/245 = 21/245 = 10.5/245 = 42/245
Posterior for π Stand. dev.
Mean
Stand. dev.
0.086 0.086 0.043 0.022 0.022
0.026 0.130 0.070 0.038 0.170
0.035 0.062 0.035 0.018 0.021
the posterior probabilities that midpoints are in the same regime, implied by the baseline prior distribution for the model (Table 3). The other panels provide the posterior probabilities that midpoints are in the same regime that result when the prior distribution for π is changed while the prior distributions of the other parameters remain the same as indicated in Table 3. Table 8 shows the corresponding prior and posterior means and standard deviations of π . For the baseline case (panel (c)) this analysis indicates four ‘‘super-regimes’’, the defining characteristic being that for any pair of expansion or contraction midpoints in different superregimes, the posterior probability that these midpoints are in the same regime is very close to zero. The four superregimes so defined are 1947Q2–1961Q1 (midpoints 1–8), the single expansion 1961Q1–1969Q4 (midpoint 9), 1969Q4–1982Q4 (midpoints 10–16) and 1982Q4–2008Q3 (midpoints 17–22). The first and third super-regime are distinguished by their contrasts in fiscal discipline and are separated by the strong and sustained expansion of the 1960’s. The end of the third super-regime corresponds to the conclusion of the Volcker monetary reforms. Notably, the contraction at the end of the Carter administration and the very sharp contraction early in the Reagan administration are (along with the intervening weak expansion) placed in the same regime with high posterior probability. While this
interpretation varies in the expected direction with changes in the prior distribution, it is fairly robust with respect to these changes. A lower prior mean for π (panels (a) and (d)) implies fewer regime changes and a higher posterior probability that any pair of expansions and contractions is in the same regime. Nevertheless, there is still evidence of the pattern of four super-regimes. Contrasting the results in panels (a) and (d), a higher prior standard deviation for π leads to a lower posterior mean for π and a correspondingly higher probability of combining superregimes, reflecting a posterior distribution of π that is shifted leftward in (a) relative to (d). The most prominent feature in (a) is the combination of expansions and contractions through the end of the Volcker reforms with a well-defined break at that point. A higher prior mean for π (panels (b) and (e)) implies a higher posterior mean for π , more regimes, and a higher posterior probability that any pair of expansions and contractions is in different regimes. The most extreme case, panel (e), produces the highest posterior mean for π and no evidence for super-regimes. With a higher prior standard deviation (panel (b)) the superregimes begin to re-emerge. Just as was the case for the lower prior mean, relatively little information in the prior distribution in the form of a higher prior standard deviation shifts the posterior distribution of π to the left as indicated in Table 8. 3.4. Forecasting It is straightforward to apply the procedures developed here to update predictive distributions in real time. When the sample is updated with a new observation yT , the posterior simulator described in Section 2.3 is executed. For each iteration m, draws from the predictive density for (yT +1 , . . . , yT +h ) are made as described in Section 2.2, using the simulated parameter
180
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
Table 9 Actual growth rates (bold) and probabilities of negative growth in future quarters (italic). End of sample T :
2007Q4
2008Q1
2008Q2
0.115 0.117 0.122 0.129 0.124 0.131 0.134 0.138
0.0022 0.122 0.130 0.126 0.130 0.130 0.140 0.133
0.0070 0.113 0.117 0.120 0.120 0.122 0.125
2008Q3
2008Q4
2009Q1
2009Q2
Prediction for: 2008Q1 2008Q2 2008Q3 2008Q4 2009Q1 2009Q2 2009Q3 2009Q4
(m)
(m)
(m)
values s∗2(m) , ν ∗2(m) , µ∗(m) , hµ∗ , π (m) , µjT and hjT . As noted in Section 2.2, the number of regimes in the predictive density over the specified prediction horizon is a random variable, and accordingly it varies from one simulation to the next. We illustrate this process by applying the model using successive samples ending in quarters 2007Q4 through 2009Q2, and constructing a random sample from future growth rates through 2009Q4 at the end of each quarter. We focus on the predictive probabilities of negative growth rates in future quarters. Table 9 provides the results of this exercise. Actual growth rates yt are shown in bold in the on-diagonal entries. GDP grew at a near-normal rate in 2008Q2, fell slightly in 2008Q3, dropped significantly in 2008Q4 and 2009Q1, and fell slightly in 2009Q2. The mild decline in GDP in 2008Q3 produced almost no change in the probability of future negative growth rates. Following the first steep decline in 2008Q4 these probabilities more than doubled (from below 0.15 to above 0.30), and following the second steep decline in 2009Q1 these probabilities increased further to around 0.4. Probabilities of future negative growth remained about the same following the decline in GDP in 2009Q2. Table 10 provides the corresponding posterior probabilities of breakpoints in the model. Recall that a break occurs between quarters t and t + 1 if st = 1. The bold-face entries indicate probabilities of breakpoints within the sample: for example, Pr(s2007Q 3 = 1|y1 , . . . , yT ) = 0.056 for T = 2007Q 4 indicates that at the end of 2007Q4, the model ascribes posterior probability 0.056 to a break between 2007Q3 and 2007Q4. The italic entries indicate probabilities of breakpoints in the future. Recall that conditional on the model parameters the probability of any future breakpoint is π . Thus recent and current growth rates yt affect the probability of future breaks only to the extent that they affect the posterior distribution of π . Therefore there are no major changes in the predictive distributions of future breakpoints. This is evident in the probabilities (italic entries) in Table 10, as well as in other aspects of the distribution. For example at the end of 2008Q3 the predictive distribution indicated no breaks over the next 12 quarters with probability 0.455; 1 break, 0.329; 2 breaks, 0.155, 3 breaks, 0.053; and 4 or more breaks, 0.008. By contrast, current and recent observed growth rates provide evidence on whether a break has occurred. Table 10 shows that probabilities of recent breakpoints remained low through the end of 2008Q3. The first steep decline in growth, 2008Q4, sharply increased the probability of a recent breakpoint, with the mode being a break between 2008Q2 and 2008Q3, the transition from positive to negative growth. Following the second steep decline in growth, 2009Q1, the model attributes similar substantial probability to a break between 2008Q3 and 2008Q4 as well. At the end of 2009Q1 and 2009Q2 the regime mean growth rate µJ is negative, and because breakpoint probabilities going forward remain small, probabilities of future negative growth remain high (Table 9).
−0.0013 −0.0164
0.137 0.148 0.145 0.137 0.140
−0.0159
0.322 0.325 0.315 0.300
0.422 0.397 0.385
−0.0019 0.421 0.409
3.5. A competing model Our structural break model imposes a degree of similarity in behavior in yt across regimes by means of the hierarchical prior distribution described in Section 2.1. In this setup no two regimes are exactly alike. One could construct a similar model in which regimes can recur, but this would entail substantial changes in the posterior simulator best left to future research. Instead we consider a simple and familiar alternative, a tworegime Markov switching model in which st ∈ {1, 2}, P (st = i | st −1 = j) = pij (i, j = 1, 2), and within each regime i.i.d.
1 y t ∼ N µj , h − as in our model (1). (This is a very simple version j of the influential model of Hamilton, 1989.) To facilitate a formal comparison with our model we complete the specification of this Markov switching model with independent prior distributions of the form
µj ∼ N(µ, h−1 ), s2 hj ∼ χ 2 (ν) p12 ∼ Beta(γ , γ ), 1
2
(j = 1, 2)
p21 ∼ Beta(γ , γ ). 1
2
To make the prior close to the one used in our structural break model, we maintain the same values γ = 3.57 and γ = 38.1 1 2 as in the Beta prior distribution for π . For the other hyperparameters we use values producing prior distributions with the same median and interquartile range as in the structural break model: µ = 0.008167, h = 19035, s2 = 0.0000813, ν = 1.0775. The corresponding prior means and standard deviations are shown in columns 2 and 3 of Table 11. The prior distributions are symmetric with respect to the two states, which are therefore unidentified. This identification issue has no bearing on prediction or on the execution of the MCMC posterior simulator (Geweke, 2007). Interpretation of posterior moments, on the other hand, requires that states 1 and 2 be identified. Identification of states by precision, h1 > h2 , can be imposed directly on the MCMC simulator output, and this leads to the posterior means and standard deviations in columns 4 and 5 of Table 11. The difference in the posterior means of the precisions is large, both substantively and relative to their posterior standard deviations. Identification of states by mean, µ1 < µ2 , leads to the posterior moments in the last two columns of Table 11. The posterior means of µ1 and µ2 are similar, both substantively and relative to posterior standard deviations. Thus the distinction between states is cast more naturally in terms of different precisions than in terms of different means. Our formal comparison of the two models is based on the ratio of predictive likelihoods p (ys+1 , . . . , yT | y1 , . . . , ys , A1 ) p (ys+1 , . . . , yT | y1 , . . . , ys , A2 )
,
(9)
where A1 denotes our structural break model and A2 denotes the Markov switching model. Notice that if s = 0 in (9) then this amounts to the Bayes factor in favor of model A1 versus model A2 . (We chose not to compute a full Bayes factor because of the
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
181
Table 10 Probabilities of regime breaks in the sample (bold) and in the future (italic). End of sample T :
2007Q4
2008Q1
2008Q2
2008Q3
2008Q4
2009Q1
2009Q2
0.065 0.056 0.070 0.071 0.069 0.071 0.074 0.069 0.068 0.068
0.049 0.062 0.037 0.070 0.068 0.069 0.068 0.069 0.068 0.066
0.038 0.044 0.043 0.038 0.069 0.063 0.069 0.064 0.069 0.069
0.040 0.055 0.042 0.055 0.054 0.074 0.071 0.069 0.068 0.071
0.086 0.143 0.114 0.153 0.338 0.185 0.084 0.082 0.077 0.081
0.078 0.136 0.108 0.145 0.338 0.321 0.021 0.089 0.084 0.081
0.093 0.145 0.111 0.144 0.332 0.290 0.030 0.060 0.086 0.091
Probability Pr(s2007Q 2 Pr(s2007Q 3 Pr(s2007Q 4 Pr(s2008Q 1 Pr(s2008Q 2 Pr(s2008Q 3 Pr(s2008Q 4 Pr(s2009Q 1 Pr(s2009Q 2 Pr(s2009Q 3
= 1|y1 , . . . , yT ) = 1|y1 , . . . , yT ) = 1|y1 , . . . , yT ) = 1|y1 , . . . , yT ) = 1|y1 , . . . , yT ) = 1|y1 , . . . , yT ) = 1|y1 , . . . , yT ) = 1|y1 , . . . , yT ) = 1|y1 , . . . , yT ) = 1|y1 , . . . , yT )
Table 11 Prior and posterior moments of parameters in the Markov switching model. Parameter
Prior moments mean
µ1 µ2 h1 h2 p12 p21
0.00817 0.00817 13253.38 13253.38 0.0857 0.0857
Posterior moments sd
0.00725 0.00725 18056.48 18056.48 0.0428 0.0428
Precision labeling
Mean labeling
mean
sd
mean
sd
0.00759 0.00861 41673.97 6742.09 0.0499 0.0446
0.000587 0.00108 8100.22 891.78 0.0237 0.0210
0.00739 0.00876 34324.14 14788.40 0.0490 0.0454
0.00059 0.00091 16528.09 15691.56 0.0229 0.0213
well-known sensitivity of Bayes factors to prior distributions.) For s = 50, the log predictive likelihood of the structural break model is 669.76, whereas that of the Markov switching model is 663.89, implying the value 354 for (9), strongly favoring the structural break model. These results contrast markedly with those for US GDP in Hamilton (1989), where the states have clear interpretations as expansions and contractions. The evidence all suggests that the difference is due to the great moderation that began in the mid-1980’s (and ended, perhaps, with the global financial crisis) and barely appeared in Hamilton’s sample. This interpretation is supported by the posterior moments in Table 11, and by inspection of smoothed state probabilities (not presented here). It is also supported by the fact that a considerably more informative prior distribution is required before the posterior moments of the twostate Markov switching model begin to have a business cycle interpretation. For example, if we revise the prior distribution in Table 11 by changing the prior mean of µ1 to −0.008167 and reducing its prior standard deviation by a factor of 3 to 0.002416, then the posterior means of µ1 and µ2 are 0.00312 and 0.00837, respectively. But, in this case, the predictive likelihood of the Markov switching model drops to 647.249. Thus the evidence favoring a ‘‘moderation’’ rather than a ‘‘business cycle’’ interpretation of this model is strong. Consistent with the specification of our structural break model, states appear to be idiosyncratic rather than recurring. This greater flexibility leads to improved predictions. 4. Conclusions and future research This study developed a new Bayesian structural break model with several desirable features. The number of regimes is not fixed and is treated as a random variable in our model. Regime parameters are drawn from a common distribution, implying that each regime updates the econometrician’s information about future regimes. The number of regimes is fixed neither in the sample nor over any specified prediction horizon; a single distribution provides all information about the occurrence of structural breaks. Most important, the structure of the model leads to a tractable distribution of regime parameters with a handful of prior hyperparameters. Posterior simulation is reliable
and efficient because (1) the structure of the model leads to an analytically tractable posterior distribution of regime parameters with a handful of hyperparameters and (2) a single Metropolis proposal permits introduction, deletion and movement of breakpoints. The application to US real GDP quarterly growth buttresses the interpretation that breakpoints are related to business cycles and provides some new insights into the changing features of these business cycles over the past 60 years. A natural extension of this work would introduce a linear regression model in place of the simple normal distribution within each regime. Leading candidates for covariates are lagged values of the time series itself. Another natural extension is to several time series jointly, with the multivariate normal distribution replacing the univariate normal. Since the number of parameters in each regime grows as the square of the number of time series, this would entail close consideration of the hierarchical structure of the distribution of variance matrices across regimes. The combination of desirable features introduced in this study would carry over directly to these extensions. Appendix A A.1. Latent variables and observables Define the vector of break indicators s = (s1 , . . . , sT −1 )′ . Since s is a Bernoulli process with parameter π , p(s|π ) = π J −1 (1 − π )T −J , where J = 1 +
∑T −1
τ =1 sτ .
(10)
Conditional on jt = 1 +
∑t −1
τ =1 s τ ,
yt ∼ N (µjt , hjt ). −1
(11)
The conditional probability density function of y is p(y|µ1:J , h1:J , s)
∝
J ∏ j=1
nj / 2 hj
exp −
J −
hj (yj − ιnj µj ) (yj − ιnj µj )/2 , ′
(12)
j =1
where µ1:J = (µ1 , . . . , µJ )′ , h1:J = (h1 , . . . , hJ )′ , yj is the vector containing the nj observations in regime j, and ιn denotes an n × 1 vector of units.
182
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
p s∗2 , ν ∗ , µ∗ , hµ∗ , s | y
A.2. The prior distribution From (2) the conditional prior density kernels of µj and hj are p(hj |s∗ , ν ∗ ) 2
∝ [2ν
∗ /2
Γ (ν ∗ /2)]−1 (s∗ 2 )ν
∗ /2 (ν ∗ −2)/2 hj
exp(−s∗ hj /2) 2
nj +ν ∗ 1/2 J J Γ ∏ ∏ ∗2 J ν ∗ /2 2 hµ∗ ν∗ · s ∝ ∗ h µ + nj Γ 2 j=1 j =1
(13)
J [ ∏
×
p(µj |hj , µ∗ , hµ∗ )
∗2
s
+
s2j
+
j=1
h µ ∗ + nj
p(µ∗ |µ∗ , hµ∗ ) ∝ exp −hµ∗ (µ∗ − µ∗ )2 /2 ,
(15)
p(hµ∗ |s, ν) ∝ hµ∗ (ν−2)/2 exp −shµ∗ /2 ,
(16)
2 2 2 p(s∗ |a, b) ∝ (s∗ )(b−2)/2 exp(−as∗ /2),
(17)
p(ν ∗ |λ) ∝ exp(−λν ∗ ).
(18)
1
γ 1 −1
2
From (21),
= s
∗2 (b−2)/2
exp −as /2 ∗2
J ∏
s∗2 + s2j
j =1
(1 − π )γ 2 −1 .
+
(19)
nj hµ∗
−(nj +ν ∗ )/2
hµ∗ + nj
µ − yj ∗
2
s
∗ ∗2 ν /2
,
(22)
p(ν ∗ |s∗2 , µ∗ , hµ∗ , π , s, y) ∝ k(ν ∗ |s∗2 , µ∗ , hµ∗ , π , s, y)
A.3. The posterior distribution The posterior density kernel of all parameters and latent variables is the product of the conditional p.d.f. of latent variables given in (10), the conditional p.d.f. of observables in (12), and the prior density kernels of parameters given in (13)–(19). From this product, for j = 1, . . . , J,
n +ν ∗ j J Γ ∏ 2 ν∗ = exp(−λν ∗ ) Γ j =1 2 ×
s∗2 + s2j +
nj hµ∗ hµ∗ +nj
s ∗2
(nj +ν ∗ +1−2)/2
p(µj , hj |s∗2 , ν ∗ , µ∗ , hµ∗ , π , s, y) ∝ hj
]−(nj +ν ∗ )/2
p(s∗2 |ν ∗ , µ∗ , hµ∗ , π , s, y) ∝ k(s∗2 |ν ∗ , µ∗ , hµ∗ , π , s, y)
Corresponding to (3), p(π|γ , γ ) ∝ π
∗ 2 µ − yj
2 ] ν−2 /2 ( ) × exp −hµ∗ µ∗ − µ∗ /2 hµ∗ × exp −shµ∗ /2 exp −λν ∗ (b−2)/2 γ +J −2 × s ∗2 exp −as∗2 /2 π 1 (1 − π )γ 2 +T −J −1 . (21)
(14)
The probability density functions corresponding to (4)–(7) are
nj hµ∗
[
1/2 1/2 ∝ hj hµ∗ exp −hj hµ∗ (µj − µ∗ )2 /2 .
∗ 2 −ν ∗ /2 µ − yj ,
(23)
p(µ∗ |s∗2 , ν ∗ , hµ∗ , π , s, y) ∝ k(µ∗ |s∗2 , ν ∗ , hµ∗ , π , s, y)
J ∏ ∗ ∗ 2 = exp −hµ∗ (µ − µ ) /2 s∗2 + s2j
2 × exp −hj hµj µj − µj + s∗2 + s2j
j =1
+
nj hµ∗ h µ ∗ + nj
∗ 2 µ − yj /2 ,
(20)
+
where
hµ∗ + nj
µ∗ − y j
2
,
(24)
p(hµ∗ |s∗2 , ν ∗ , µ∗ , π , s, y) ∝ k(hµ∗ |s∗2 , ν ∗ , µ∗ , π , s, y)
hµj = hµ∗ + nj ,
µj = hµ∗ µ∗ + nj yj / hµ∗ + nj , −1 ′ yj = nj ιnj yj , ′ s2j = yj − ιnj yj yj − ιnj yj . Hence the random vectors hj , µj are conditionally indepen
dent in the posterior distribution and each has a normal–gamma distribution sj hj |(s∗2 , ν ∗ , µ∗ , hµ∗ , π , s, y) ∼ χ 2 (ν j ) 2
−1
where nj hµ∗ hµ∗ + nj
µ∗ − yj
2
J ∏ = hµ∗ (ν−2)/2 exp −shµ∗ /2
hµ∗ + nj
j=1
+
nj hµ∗
hµ∗
1/2
s∗2 + s2j
−(nj +ν ∗ )/2
hµ∗ + nj
µ − yj ∗
2
,
(25)
p(π|s∗2 , ν ∗ , µ∗ , hµ∗ , s, y) ∝ k(π|s∗2 , ν ∗ , µ∗ , hµ∗ , s, y)
=π
γ 1 +J −1 −1
(1 − π )
γ 2 +T −J −1
,
(26)
and
1 µj |(hj , s∗2 , ν ∗ , µ∗ , hµ∗ , π , s, y) ∼ N (µj , h− · hµ j ) j
2 sj = s∗2 + s2j +
nj hµ∗
−(nj +ν ∗ )/2
,
p(s|s∗2 , ν ∗ , µ∗ , hµ∗ , π , y) ∝ (s|s∗2 , ν ∗ , µ∗ , hµ∗ , π , y)
= π J −1 (1 − π )T −J (s∗ 2 )J ν /2 1/2 nj +ν ∗ J Γ J ∏ ∏ 2 hµ∗ ν∗ × Γ h µ j =1 j=1 j 2 ∗
ν j = nj + ν ∗ (j = 1, . . . , J ). Taking advantage of the known constant of integration for the normal–gamma distribution, integrate the posterior density to obtain the marginal posterior density
×
J ∏
s∗2 + s2j + nj yj − µj
2
2 −(nj +ν ∗ )/2 + hµ∗ µ∗ − µj .
j=1
(27)
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
using the successive bisection method find ν ∗ that maximizes
A.4. The posterior simulator The posterior simulator for our model is, globally, a Gibbs sampler with six blocks. Appendix A.3 provides the conditional posterior density kernels. 1. Scale parameter s∗2 for the within-regime precisions. The conditional posterior density kernel given in (22) is not of any known form but well suited to a Metropolis within Gibbs step. To construct a source distribution from which candidate s∗2 is drawn, we first find s∗2 that maximizes the density kernel of s∗2 . This can be done by solving the equation
∂ ∂ s ∗2 =
2
j =1
1 s ∗2
+ Qj
+
+
s ∗2
b−2 1 s∗ 2
2
a
−
2
=0
(28)
where Qj = s2j +
nj h µ ∗ hµ∗ + nj
µ∗ − y j
2
∂2
∂ s ∗2 J [ −
∗2 ∗ ∗ 2 ln k(s |ν , µ , hµ∗ , π , s, y)|s∗2 =s∗2
nj + ν ∗
=
j =1
2
1
( s ∗2
+ Qj )
2
−
ν∗
1
2
( s∗2 )2
] −
b−2
1
2
( s ∗2 ) 2
.
2
2
2
J
= 0,
(29)
where
ηj =
s∗2 + s2j +
nj hµ∗ hµ∗ +nj
∗ 2 µ − yj
s ∗2
,
∂2 ln k(ν ∗ |s∗2 , µ∗ , hµ∗ , π , s, y)|ν ∗ = ν∗ ∂ (ν ∗ )2 ∗ ] J [ nj + ν∗ ν 1− Ψ′ − Ψ′ . =
Iν ∗ = −
2
2
Draw the candidate
∗ −1 ν∗ ∼ N ν , Iν ∗ , and accept it with probability
min
k( ν ∗ |s∗2 , µ∗ , hµ∗ , π , s, y)/ exp −Iν ∗ ( ν∗ − ν ∗ )2 /2
k(ν ∗ |s∗2 , µ∗ , hµ∗ , π , s, y)/ exp −Iν ∗ (ν ∗ − ν ∗ )2 /2
, 1 .
3. Location parameter µ∗ for the within-regime locations. The Metropolis within Gibbs step can be described as follows. First, find µ∗ that maximizes the conditional posterior density of µ∗ by solving4
∂ ln k(µ∗ |s∗2 , ν ∗ , hµ∗ , π , s, y) ∂µ∗ J − −(nj + ν ∗ ) µ∗ − yj = 2 nh ∗ ∗ + s∗2 + s2j / h j∗ µ+nj j =1 µ − y j − hµ∗ (µ∗ − µ∗ ) = 0.
s∗2 , Is−∗21 . s ∗2 ∼ N
(30)
Second, evaluate
We accept the draw with probability min
2
µ
Then, draw the candidate
2
4 j =1
.
Eq. (28) has a unique solution if b ≥ 2. This fact can be seen as follows. (1) As s∗2 goes from 0 to ∞ the left hand side of Eq. (28) varies from ∞ to −a/2, and this shows the existence of solutions. (2) The derivative of the left side either is negative or has only one zero, which means that either the left side is decreasing or it first decreases to its absolute minimum and then increases while approaching to the limit −a/2. This shows the uniqueness of the solution. Roots are found by successive bisection.3 Now let Is∗2 = −
j =1
d ln Γ (z )
ν∗ 1 2
∂ ln k(ν ∗ |s∗2 , µ∗ , hµ∗ , π , s, y) ∂ν ∗ ∗ ] J [ − 1 nj + ν ∗ 1 ν ln ηj λ = Ψ − Ψ − −
and Ψ (z ) = is the digamma function. The uniqueness of dz the solution of Eq. (29) can be seen by the fact that the logarithm of the posterior density kernel of ν ∗ is globally concave since its second derivative is always negative. Let
ln k(s∗2 |ν ∗ , µ∗ , hµ∗ , π , s, y) J − nj + ν ∗ −
183
k( s∗2 |ν ∗ , µ∗ , hµ∗ , π , s, y)/ exp −Is∗2 ( s∗2 − s∗2 )2 /2
, 1 .
k(s∗2 |ν ∗ , µ∗ , hµ∗ , π, s, y)/ exp −Is∗2 (s∗2 − s∗2 )2 /2
2. Shape parameter ν ∗ for the within-regime precisions. A similar Metropolis within Gibbs step is applied to this block sincethe conditional posterior density kernel of ν ∗ is nonstandard. First,
∂2 p(µ∗ |s∗2 , ν ∗ , hµ∗ , π , s, y)|µ∗ = µ∗ ∂ (µ∗ )2 J − = −(nj + ν ∗ )
Iµ ∗ = −
j =1
nh ∗ ∗ 2 ∗ 2 µ − yj + s∗2 + s2j / h j∗ µ+nj − 2 µ − yj µ × 2 n h ∗ 2 µ∗ − yj + s∗2 + s2j / h j∗ µ+nj µ
3 The procedure can be described as follows: (a) Find zeros of all components in Eqs. (28)–(31). Notice that there are J + 1 components (J in the summation and one outside) in Eqs. (28), (30) and (31), and J components in Eq. (29). In (28) the zeros are
J
zeros are yj j=1 and µ∗ . In (31) the zeros are and
ν−2 s
ν ∗ Qj nj
J
and
j=1
b−2 . In (30) the a
nj s∗2 +s2j
J
nj (nj +ν ∗ −1)(µ∗ −yj )2 − s∗2 +s2j
− hµ∗ . Draw the candidate
µ∗ ∼ N µ∗ , Iµ−∗1 , and accept it with probability
j=1
. There are no analytical forms for zeros in (29), but they can be found
numerically using numerical methods such as the Newton–Raphson method. (b) Find the maximum and minimum of the zeros found in step 1 and set them to be the right and left end points of the interval used in bisection method respectively. (c) Apply the bisection method using the interval found in step 2 to get solutions.
min
k( µ∗ |s∗2 , ν ∗ , hµ∗ , π , s, y)/ exp −Iµ∗ ( µ∗ − µ∗ )2 /2
k(µ∗ |s∗2 , ν ∗ , hµ∗ , π , s, y)/ exp −Iµ∗ (µ∗ − µ∗ )2 /2
, 1 .
4 The uniqueness of the solution of Eqs. (30) and (31) is not easy to see. This is an open question left to future research.
184
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185
4. Scale parameter hµ∗ for the within-regime locations. The Metropolis within Gibbs step can be described as follows. First, find hµ∗ that maximizes the conditional posterior density of hµ∗ by solving
∂ ∂ hµ ∗
ln k(hµ∗ |s∗2 , ν ∗ , µ∗ , π , s, y) =
J
+
2
j =1
nj
×
hµ ∗
−
hµ ∗
+ nj
2
hµ ∗
nj + ν nj µ − yj ∗
− 1 hµ∗ + nj
ν−2 1
∗
−
s 2
2
τj
2
2 = 0,
qadd
qdel
qmove
RNE
0.1 0.2 0.3 0.4 0.5 0.2 0.5 0.5
0.5 0.3 0.3 0.4 0.1 0.5 0.2 0.4
0.4 0.5 0.4 0.2 0.4 0.3 0.3 0.1
0.15 0.14 0.16 0.18 0.15 0.15 0.14 0.18
A.6. Software
τj = s∗2 + s2j +
nj hµ∗ hµ∗ + nj
∗ 2 µ − yj .
Second, evaluate
∂2
∗2 ∗ ∗ 2 ln k(hµ∗ |s , ν , µ , π , s, y)|hµ∗ =hµ∗ ∂ hµ∗ 2 4 ∗ J 2 − 1 hµ∗ + nj nj + ν ∗ nj µ − yj − = + 2 2 τj2 hµ ∗ j =1
Design parameters
(31)
where
Ihµ∗ = −
Table 12 Analysis of sensitivity to design parameters.
J − 1 h µ ∗ + nj × 2 + 2 hµ∗ h µ ∗ + nj j =1 ∗ 2 nj + ν ∗ nj µ − yj −2nj ν−2 1 − . 3 − 2 τj 2 hµ∗ h µ ∗ + nj nj
Third, draw the candidate
hµ∗ , Ih−1∗ , hµ ∗ ∼ N µ and accept it with probability
k( hµ∗ |s∗2 , ν ∗ , µ∗ , π , s, y)/ exp −Ihµ∗ ( hµ∗ − hµ∗ )2 /2 ,1 . min k(h ∗ |s∗2 , ν ∗ , µ∗ , π , s, y)/ exp −I (h ∗ − hµ∗ )2 /2 µ hµ∗ µ 5. Probability of breaks π . From (26) the conditional posterior distribution of π is Beta(γ + J − 1, γ + T − J ). 1
2
6. Structural break indicators s. The proposal distribution q( s|s) is described in Section 2.3. The standard arithmetic for the rejection probability applies.
A.5. Mixing properties of the algorithm The posterior simulator produces an ergodic sequence of parameter draws. Theorems 4.5.5 and 4.5.6 of Geweke (2005) directly establish the convergence of the Metropolis–Hastings algorithms used in this section. The ergodicity of the Gibbs sampler can be proved by using Corollary 4.5.1 and Theorem 4.5.4 of Geweke (2005). The mixing properties are insensitive to the design parameters qadd , qdel , and qmove of the proposal distribution for the simulation of s in the posterior distribution. Repeating the posterior simulation exercise from Section 3.2 with different settings of these design parameters has almost no effect on the relative numerical efficiency (RNE) of the draws for the number of regimes J, as shown in Table 12.
The posterior conditional distributions and the Metropolis–Hastings algorithm used in the Gibbs sampling algorithm, and the computer code that implements the Gibbs sampling algorithm, were verified to be correct using the joint distribution test described in Geweke (2004) and Geweke (2005, Section 8.1.2). All code is written in compiled C running under Windows Vista, and executed using a 2.93 GHz Intel(R) Core(TM)2 Duo CPU with 2G memory. For the results reported in Section 3, we executed M = 1, 005, 000 iterations of the Markov chain with a sample size approximately T = 250. This requires about fifteen minutes. More generally, the computing time is roughly proportional to the product of the MCMC iterations M, sample size T , and number of regime changes J. References Andrews, D.W.K., 1993. Tests for parameter instability and structural change with unknown change point. Econometrica 61 (4), 821–856. Andrews, D.W.K., 2003. End-of-sample instability tests. Econometrica 71 (6), 1661–1694. Ang, A., Bekaert, G., 2002. Regime switches in interest rates. Journal of Business and Economic Statistics 20, 163–182. Bai, J., Perron, P., 1998. Estimating and testing linear models with multiple structural changes. Econometrica 66, 47–78. Bai, J., Perron, P., 2003. Computation and analysis of multiple structural change models. Journal of Applied Econometrics 18, 1–22. Chib, S., 1998. Estimation and comparison of multiple change-point models. Journal of Econometrics 86, 221–241. Clements, M., Hendry, D., 1998. Forecasting Economic Time Series. Cambridge University Press, Cambridge. Clements, M., Hendry, D., 1999. Forecasting Non-Stationary Economic Time Series. The MIT Press, Cambridge. Cogley, T., Sargent, T., 2001. Evolving post-world war II US Inflation dynamics. NBER Macroeconomics Annual 16, 331–373. Cooley, T.F., Prescott, E.C., 1976. Estimation in the presence of stochastic parameter variation. Econometrica 44, 167–184. Elliott, G., Muller, U., 2006. Optimally testing general breaking processes in linear time series models. Review of Economic Studies 73, 907–940. Garcia, R., Perron, P., 1996. An analysis if the real interest rate under regime shifts. Review of Economics and Statistics 78, 111–125. Geweke, J., 2004. Getting it right: joint distribution tests of posterior simulators. Journal of the American Statistical Association 99, 799–804. Geweke, J., 2005. Contemporary Bayesian Econometrics and Statistics. John Wiley, Hoboken, NJ. Geweke, J., 2007. Interpretation and inference in mixture models: simple MCMC works. Computational Statistics & Data Analysis 51 (7), 3529–3550. Geweke, J., Terui, N., 1993. Bayesian threshold autoregressive models for nonlinear time series. Journal of Time Series Analysis 14, 441–455. Ghysels, E., Guay, A., Hall, A., 1998. Predictive tests for structural change with unknown breakpoint. Journal of Econometrics 82 (2), 209–233. Ghysels, E., Hall, A., 1990. A test for structural stability of Euler conditions parameters estimated via the generalized method of moments estimator. International Economic Review 31 (2), 355–364. Giordani, P., Kohn, R., 2008. Efficient Bayesian inference for multiple change-point and mixture innovation models. Journal of Business and Economic Statistics 26, 66–77. Giordani, P., Kohn, R., van Dijk, D., 2007. A unified approach to nonlinearity, structural change, and outliers. Journal of Econometrics 137 (1), 112–133. Hamilton, J.D., 1989. A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57 (2), 357–384.
J. Geweke, Y. Jiang / Journal of Econometrics 163 (2011) 172–185 Hansen, B.E., 1992. Tests for parameter instability in regressions with I(1) processes. Journal of Business and Economic Statistics 10 (3), 321–335. Kim, C.-J., Nelson, C.R., 1999. Has the US economy become more stable? a Bayesian approach based on a Markov-switching model of the business cycle. The Review of Economics and Statistics 81 (4), 608–616. Kim, C.-J., Nelson, C.R., Piger, J., 2004. The less-volatile US economy: a Bayesian investigation of timing, breadth, and potential explanations. Journal of Business and Economic Statistics 22 (1), 80–93. Koop, G., Potter, S., 2001. Are apparent findings of nonlinearity due to structural instability in economic times series? Econometrics Journal 4, 37–55. Koop, G., Potter, S., 2007. Estimation and forecasting in models with multiple breaks. Review of Economic Studies 74, 763–789. Lumsdaine, R.L., Papell, D.H., 1997. Multiple trend breaks and the unit-root hypothesis. The Review of Economics and Statistics 79 (2), 212–218. Maheu, J.M., Gordon, S., 2008. Learning, forecasting and structural breaks. Journal of Applied Econometrics 23 (5), 553–583. McConnell, M., Perez, G., 2000. Output fluctuations in the United States: what has changed since the early 1980s? American Economic Review 90, 1464–1476. McCulloch, R.E., Tsay, R.S., 1993. Bayesian inference and prediction for mean and variance shifts in autoregressive time series. Journal of the American Statistical Association 88 (423), 968–978.
185
Min, C.-k., Zellner, A., 1993. Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates. Journal of Econometrics 56 (1–2), 89–118. Pastor, L., Stambaugh, R., 2001. The equity premium and structural breaks. Journal of Finance 56, 1207–1239. Pesaran, M.H., Pettenuzzo, D., Timmermann, A., 2006. Forecasting time series subject to multiple structural breaks. Review of Economic Studies 73, 1057–1084. Pesaran, M.H., Timmermann, A., 2002. Market timing and return prediction under model instability. Journal of Empirical Finance 9, 495–510. Stock, J., Watson, M., 1996. Evidence on structural instability in macroeconomic time series relations. Journal of Business and Economic Statistics 14, 11–30. Terasvirta, T., Anderson, H.M., 1992. Characterizing nonlinearities in business cycles using smooth transition autoregressive models. Journal of Applied Econometrics 7 (S), 119–136. Tong, H., Lim, K.S., 1980. Threshold autoregression, limit cycles and cyclical data. Journal of the Royal Statistical Society. Series B (Methodological) 42 (3), 245–292. Wang, J., Zivot, E., 2000. A Bayesian time series model of multiple structural changes in level, trend, and variance. Journal of Business and Economic Statistics 18 (3), 374–386.
Journal of Econometrics 163 (2011) 186–199
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
An I(d) model with trend and cycles Karim M. Abadir a,∗ , Walter Distaso a , Liudas Giraitis b a
Imperial College Business School, Imperial College London, London SW7 2AZ, UK
b
Department of Economics, Queen Mary, University of London, London E14 NS, UK
article
info
Article history: Received 12 March 2007 Received in revised form 14 March 2011 Accepted 29 March 2011 Available online 6 April 2011
abstract This paper deals with models allowing for trending processes and cyclical component with error processes that are possibly nonstationary, nonlinear, and non-Gaussian. Asymptotic confidence intervals for the trend, cyclical component, and memory parameters are obtained. The confidence intervals are applicable for a wide class of processes, exhibit good coverage accuracy, and are easy to implement. © 2011 Elsevier B.V. All rights reserved.
JEL classification: C22 Keywords: Fractional integration Trend Cycle Nonlinear process Whittle objective function
1. Introduction To start, consider the basic model Xt = β1 + β2 t + ut ,
t = 1, 2, . . . , n,
(1.1)
where E (ut ) = 0. When {ut } is a long memory stationary process, with memory parameter d ∈ (−1/2, 1/2), the problem of estimating such models has been studied extensively in the literature. Yajima (1988, 1991) derived conditions for consistency and asymptotic normality of Least Squares (LS) estimators of the parameters of a regression model with nonstochastic regressors, when the errors {ut } have long memory. Dahlhaus (1995) suggested an efficient weighted least squares estimator for β1 and β2 and investigated its asymptotic properties in the case of a polynomial regression with stationary errors. Nonlinear regression models with long memory errors have been investigated by Ivanov and Leonenko (2004, 2008). The estimation of a trend when {ut } has d ∈ [0, 3/2) was discussed by Deo and Hurvich (1998), but they did not estimate d and they required {ut } to have a linear structure with restrictive asymptotic weights.
∗
Corresponding author. Tel.: +44 20 75941819. E-mail addresses:
[email protected] (K.M. Abadir),
[email protected] (W. Distaso),
[email protected] (L. Giraitis). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.03.006
There is a large literature on the estimation of d in the case of long memory. Fewer papers have so far considered an extended range for d to include regions of nonstationarity. Assuming that {ut } is observed, to estimate d, Velasco (1999, 2003) used data differencing and data tapering, and he noted that this inflates the estimator’s variance. Robinson (2005) suggested an adaptive semiparametric estimation method for the case of a polynomial regression with fractionally-integrated errors, employing in his Monte Carlo study a tapered estimate of d. An alternative approach was developed by Shimotsu and Phillips (2005) who introduced an exact local Whittle estimation method based of fractional differencing of {ut }, which is valid when a nonstationary process {ut } is generated by a linear process. Abadir et al. (2007) extended the classical Whittle estimator to the Fully-Extended Local Whittle (FELW) estimator that is valid for a wider range of d values, allowing for nonstationary {ut } but not for deterministic components. The present papers focuses on the estimation of the linear regression model (1.1) and its extended version Xt = β1 + β2 t + β3 sin(ωt ) + β4 cos(ωt ) + ut , t = 1, 2, . . . , n,
(1.2)
which allows for stationary and nonstationary errors {ut } and a cyclical component with ω ∈ (0, π ). We assume that ω is known in (1.2), and so we treat separately the boundary case ω = 0 as model (1.1), effectively covering ω ∈ [0, π ) in the paper but not the
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
unrealistic case of ω = π that leads to Xt = β1 + β2 t + β4 (−1)t + ut . We do not propose a method of estimation for ω; see Nandi and Kundu (2003) and references therein for the estimation of ω in the context of a short memory linear process and no linear trend. Estimating ω is beyond the scope of this paper, though (as we will show) our procedure allows for time-varying ω and/or multiple cyclical components with different frequencies ω• . For expository purposes, we refrain from writing these features into the model given in this introduction. In this paper, we estimate β := (β1 , β2 , β3 , β4 )′ by LS and generalize (for the presence of trend and cycle) the Fully-Extended Local Whittle (FELW) estimator of d given in Abadir et al. (2007). We also provide a simpler alternative form for the FELW estimator. We show that our estimators are consistent and we obtain their rates of convergence and limiting distributions, as well as confidence intervals based on them. The asymptotic properties of our LS estimators of β1 , β2 turn out to be unaffected by (and robust to) the unknown cyclical component. The papers listed earlier require the assumptions of linearity or Gaussianity of the error process. However, our estimation procedure allows for a wide range of permissible values of the memory parameter d and for possibly nonstationary, nonlinear, and nonnormal processes {ut }. By virtue of {ut } being modelled semiparametrically, the procedure also allows for seasonality and other effects to be present in {ut } at nonzero spectral frequencies. In Section 2, we investigate the LS estimators of β1 , β2 , β3 , β4 , while Section 3 is concerned with the parameters of the process {ut }. Section 4 contains the results of simulation experiments on the performance of the estimators suggested earlier. It is technically straightforward to extend our results to higher-order polynomials and to values of d outside the interval −1/2 < d < 3/2 to which we restrict our attention in this paper. We do not report such extensions in order to simplify the exposition and because most economic series will not require more than a linear trend or d outside −1/2 < d < 3/2. The proofs of the main results are given in the Appendix. p
d
We use − → and − → to denote convergence in probability and in distribution, respectively. We write i for the imaginary unit, 1A for the indicator of a set A, ⌊ν⌋ for the integer part of ν, C for a generic constant but c• for specific constants. The lag operator is denoted by L, such that Lut = ut −1 , and the backward difference operator by ∇ := 1 − L. We define a ∧ b := min {a, b} and a ∨ b := max {a, b}. Definition 1.1. Let d = k + dξ , where k = 0, 1, 2, . . . and dξ ∈ (−1/2, 1/2). We say that {ut } is an I(d) process (denoted by ut ∼ I(d)) if
∇ k ut = ξ t ,
t = 1, 2, . . . ,
where the generating process {ξt } is a second order stationary sequence with spectral density fξ (λ) = b0 |λ|−2dξ + o(|λ|−2dξ ),
as λ → 0
(1.3)
where b0 > 0. Notice that there are two parameters of interest in this definition, b0 and dξ . 2. Estimation of β We will use Ordinary LS (OLS) estimation of β, because of its ease of application, its consistency, and its asymptotic normality. Feasible Generalized LS (GLS) applied to (1.1) would require us to specify the autocovariance structure explicitly, which is not usually known, so OLS is more in line with the semiparametric approach of our paper. Even so, assuming the autocovariance structure is
187
known and is correctly specified, it has been shown that the loss of efficiency will not be substantial in this context. For example, Table 1 of Yajima (1988) implies that the maximal loss of asymptotic efficiency by OLS compared to the BLUE is 11% when estimating β1 and β2 , and 2% when estimating the mean of the differenced data (hence β2 of the original data). These will correspond to our cases d ∈ (−1/2, 1/2) and d ∈ (1/2, 3/2), respectively, as will be seen later. These efficiency bounds apply to GLS as well, since it is a linear estimator, thus limiting the efficiency loss of OLS relative to GLS. Below it will be shown that the rates of convergence of the OLS estimators depend on the order of integration d of ut , and their limits depend on the long run variance s2ξ of {ξt } which needs to be estimated. Property (1.3) of the spectral density fξ implies that
s2ξ = lim E n−1/2−dξ n→∞
n −
2 ξt
t =1
= lim n−1−2dξ n→∞
π
∫
sin(nλ/2) sin(λ/2)
−π
2
fξ (λ)dλ
= p(dξ )b0 ,
(2.1)
where b0 is defined in (1.3) and p(d) :=
∫
∞
sin(λ/2)
2
|λ|−2d dλ λ/2 Γ (1 − 2d) sin(π d) 2 , if d ̸= 0, = 2π , d(1 + 2d) if d = 0. −∞
To derive the asymptotic distribution of estimators of (β1 , β2 ), we introduce the following condition on the generating process {ξt } of Definition 1.1. Assumption FDD. The finite-dimensional distributions of the process Yn (r ) := n−1/2−dξ
⌊nr ⌋+1 −
ξt ,
0≤r ≤1
(2.2)
t =1
converge to those of the Gaussian process Y∞ (r ), that is, d
Yn (r ) − → Y∞ (r ),
as n → ∞.
Assumption FDD together with asymptotic (1.3) of spectral density fξ imply that Y∞ (r ) = sξ J1/2+dξ (r ),
0 ≤ r ≤ 1,
where J1/2+dξ (r ) is a fractional Brownian motion. By definition, J1/2+dξ (r ) is a Gaussian process with zero mean and covariance function Rd (r , s) := E J1/2+dξ (r )J1/2+dξ (s)
=
1 2
(r 1+2dξ + s1+2dξ − |r − s|1+2dξ ),
0 ≤ r , s ≤ 1.
(2.3)
2.1. Model (1.1) In order to estimate the slope parameter β2 and the location parameter β1 of model (1.1), we use the OLS estimators n ∑
β2 =
(Xt − X¯ )(t − t¯)
t =1 n ∑
(2.4)
(t − t¯)2
t =1
and
β1 = X¯ − β2 t¯,
(2.5)
188
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
¯ where X¯ = n t =1 Xt and t = n t =1 t = (n + 1)/2 are the sample means of the variables. We shall show that the estimator ( β1 , β2 ) is valid for both models, (1.1) and (1.2), and the cyclical component of model (1.2) does not affect the asymptotic properties of this estimator. The regressors of model (1.1) are asymptotically orthogonal to the additional cyclical ones in model (1.2), when we normalize them all to make the matrix of sumof-squares asymptotically nonsingular. In standard regression analysis, where {ut } has short memory, the asymptotic distribution of the OLS estimators of (β1 , β2 ) is unaffected by the choice of models. Here, we show inter alia that this still holds for these distributions for the values of d that we consider. Define the following functions of d: ∑n −1
σ22,d :=
2 (12) (12)2
∑n −1
1
1
−
2d + 3 4 2d − 1
,
(2.6)
8d(2d + 1)(2d + 3)
zt =
σ12,d := 1 + 36
1
−
2d + 3
1
if d ∈ (1/2, 3/2],
4
,
σ12,d :=
18(2d − 1) 3 + 2d
(β3,j sin(ωj t ) + β4,j cos(ωj t ))
j =1
with unknown parameters β3,j , β4,j , ωj , assuming that 0 < ωj < π , j = 1, . . . , p. They also remain valid when the ωj ’s vary deterministically over time, so long as they satisfy the condition in (iv). Such changes include a finite number of breaks as a special case. 2.2. Model (1.2)
β := ( β1 , β2 , β3 , β4 )′ = S −1 t ,
(2.12)
where S = (sij )i,j=1,...,4 and t = (tn,j )j=1,...,4 , with elements
and
p −
To estimate the parameter vector β = (β1 , β2 , β3 , β4 )′ of model (1.2), we shall use the OLS estimator
if d ∈ [−1/2, 1/2],
,
Remark 2.1. Results (i)–(ii) of Theorem 2.1 remain valid for a process Xt = β1 + β2 t + zt + ut where
.
(2.7)
sij :=
n −
zti ztj ,
tn,j :=
t =1
n −
ut ztj
(2.13)
t =1
such that
Set
τβ2 := n3/2−d ( β2 − β2 ),
zt1 := 1,
τβ1 := n1/2−d ( β1 − β1 ).
Theorem 2.1. Assume that Xt follows model (1.1), and {ut } is an I(d) process with d ∈ (−1/2, 3/2) and d ̸= 1/2. (i) Then, as n → ∞, 2 E(τ ) → s2ξ σ22,d , β
if d ∈ (−1/2, 3/2)
(2.8)
2 E(τ ) → s2ξ σ12,d , β1
if d ∈ (−1/2, 1/2).
(2.9)
2
(ii) If the generating process {ξt } satisfies Assumption FDD, then, as n→∞ d
τβ2 − → N(0, s2ξ σ22,d ),
if d ∈ (−1/2, 3/2),
d
(τβ1 , τβ2 ) − → N(0, 62 ),
if d ∈ (−1/2, 1/2),
(2.10) (2.11)
where N(0, 62 ) is a zero mean Gaussian vector with the covariance matrix
62 := sξ
2
σ12,d σ12,d
σ12,d . σ22,d
(iii) Results (i) and (ii) remain valid if Xt follows the model (1.2) with unknown ω. (iv) Results (i) and (ii) hold if Xt = β∑ 1 + β2 t + zt + ut with n 1/2+d any deterministic trend zt such that ), as t =1 zt = o(n n → ∞. The results of Theorem 2.1 are derived under unrestrictive Assumption FDD that requires only a finite second moment of {ξt }. They do not assume that the generating process {ξt } has higher moments, or a linear structure with regularly decaying weights at ∼ ct −γ as in Deo and Hurvich (1998). Notice that the mean of Xt cannot be estimated consistently if d ≥ 1/2, a general feature of integrated models, which is why the results on β1 are limited to the case d < 1/2. Constructing confidence intervals for β2 below will use the estimate σ 2 of σ22,d , where d is a consistent estimate 2 ,d
of d. Although we exclude the value d = 1/2 by assumption, d may take value d = 1/2 with a negligible probability. For this reason, definition (2.6) of σ22,d includes the boundary value d = 1/2.
zt2 := t ,
zt3 := sin(ωt ),
zt4 := cos(ωt ).
Theorems 2.1 and 2.2 show that, in spite of the cyclical component, the earlier LS estimator ( β1 , β2 ) given in (2.4)–(2.5) can be applied to model (1.2) without loss of efficiency and that, under the unrestrictive Assumption FDD, it has the same asymptotic distribution as the LS estimator ( β1 , β2 ) of the larger model. Deriving the asymptotic normality of the LS estimators of the parameters β3 , β4 requires an additional assumption, that the generating process {ξt } is a linear process
ξt =
∞ −
aj ε t − j ,
(2.14)
j=0
where {aj } are real nonrandom weights, j=0 a2j < ∞, and {εj } are i.i.d. variates with zero mean and unit variance. This assumption is because β3 and β4 involve trigonometric sums that cannot be transformed into integrals of the process {Yn (r )}, whence the need to impose assumptions directly on {ξt } rather than on the convergence of its partial sum process {Yn (r )}. In the case here, asymptotic normality is obtained by assuming that {ξt } is a linear process. Define
∑∞
τβ2 := n3/2−d ( β2 − β2 ), τβ1 := n1/2−d ( β1 − β1 ), 1/2 n βk − βk ), if d < 1, √ ( Ak τβk := n3/2−d √ ( βk − βk ), if d ≥ 1, Ak
for k = 3, 4, where Ak is such that τ βk has a limiting variance of 1. It will be derived in Theorem 2.2(i) as
Ak =
1 1 4π fξ (ω), if − < d < , 2 2 π f (ω) 1 ξ , if < d < 1, sin2 (ω/2) 2 n2(1−d) π fξ (ω) 2 2 + s2ξ (θ1k ψ11 + θ2k ψ22 + 2θ1k θ2k ψ12 2 sin (ω/ 2 ) 3 θ1k +4γk + vθ2k + 4γk2 ), if 1 ≤ d < , 2
2
(2.15)
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
189
and, for k = 3, 4,
with
ψ11 := ψ22 :=
1 1 + 2d
,
1
ψ12 :=
8d(1 + 2d) 1
1−
2(1 + 2d)
6d − 1
(3 + 2d)d
θ1k := −8s1k + 12n−1 s2k , 1
v :=
4
γ4 :=
+
1 1 + 2d
−
1 4d
sin(ω(n + 1/2)) 2 sin(ω/2)
,
,
(2.16)
d ( βk − βk ) − → N(0, 1),
if d ∈ (−1/2, 1)
k
n3/2−d
d ( βk − βk ) − → N(0, 1), A+
θ2k := 12s1k − 24n−1 s2k , cos(ω(n + 1/2)) γ3 := − , 2 sin(ω/2)
,
n
A+
if d ∈ [1, 3/2)
k
with Ak the corresponding estimate of Ak and A+ k := max{0, Ak }. Using the above results, we can write an asymptotic confidence interval (CI) of size 1 − γ for β1 and β2 as
.
Observe that, for 1 ≤ d < 3/2, Ak is bounded but oscillates with n. Although for d > 1 in (2.15) the term n2(1−d) π fξ (ω)/ sin2 (ω/2) is asymptotically negligible, we found that preserving it improves the coverage probabilities in finite samples.
[ ] σ1,d sξ σ1,d sξ β1 − c , β1 + c , γ γ n1/2−d n1/2−d ] [ σ2,d sξ σ2,d sξ c , β + c β2 − 2 γ γ
Theorem 2.2. Assume that Xt follows model (1.2). Suppose that {ut } is an I(d) process with d ∈ (−1/2, 3/2) and d ̸= 1/2, such that the spectral density fξ is continuous at ω. In the case of d ∈ (1/2, 3/2), assume that E u20 < ∞.
valid for |d| < 1/2 and −1/2 < d < 3/2, respectively, where cγ denotes here the quantile of the standard normal distribution satisfying Pr(|N(0, 1)| > cγ ) = γ . Similarly for β3 and β4 , the corresponding intervals in the cases −1/2 < d < 1 and 1 ≤ d < 3/2 are
2 (i) Then, E(τ ) and E(τβ2 ) satisfy (2.9) and (2.8) with β1 , β2 β1 2 replaced by β1 , β2 . For k = 3, 4, as n → ∞, 2 E(τ ) ∼ 1 for d ∈ (−1/2, 3/2). β
n3/2−d
βk −
k
(ii) If the generating process {ξt } satisfies Assumption FDD, then τ β2 and (τ β1 , τ β2 ) satisfy (2.10) and (2.11), respectively. (iii) If the generating process {ξt } is a linear process (2.14), then for d ∈ (−1/2, 1/2), d
(τβ1 , τβ2 , τβ3 , τβ4 ) − → N(0, 64 ),
(2.17)
where N(0, 64 ) is a zero mean Gaussian vector with covariance matrix s2ξ σ12,d s2 σ ξ 12,d 64 := 0 0
s2ξ σ12,d s2ξ σ22,d 0 0
0 0 1 0
0 0 . 0 1
For d ∈ (1/2, 3/2),
d τβ2 − → N 0, s2ξ σ22,d , d
τβk − → N(0, 1),
(2.18)
k = 3, 4.
(2.19)
In particular, for 1/2 < d < 1,
d (τβ2 , τβ3 , τβ4 ) − → N 0, diag(s2ξ σ22,d , 1, 1) .
(2.20)
To make this theorem operational, we need to estimate the unknown parameters d and sξ . Some actual estimators will be proposed in the next section, but in the meantime we have the following result which is valid for any estimators satisfying a weak condition for consistency. Corollary 2.1. Suppose that the assumptions of Theorem 2.2(iii) are satisfied, and that we have estimators d, s2ξ , and fξ (ω) with the property
d = d + op (1/ log n),
s2ξ = s2ξ + op (1),
fξ (ω) = fξ (ω) + op (1). Then, as n → ∞, n3/2−d
sξ σ2,d
d ( β2 − β2 ) − → N(0, 1),
if d ∈ (−1/2, 3/2),
(2.21)
βk −
A+ k
n
n3/2−d
cγ , βk +
A+ k
n3/2−d
cγ , βk +
A+ k
n
cγ ,
A+ k
n3/2−d
cγ ,
k = 3, 4;
see Section 4 concerning the positivity of estimates of Ak . If we have
Ak = 0, then the Corollary’s statistics based on Aˆ + k are infinite but this happens with probability 0 asymptotically. Also, we interpret the CIs to become the whole real line in this case. The formulae imply that the length of the CIs for β1 and β2 increase when the parameter d approaches the bounds 1/2 and 3/2. The length of CIs for β3 , β4 does not depend on d when d < 1 and increases when d approaches 3/2. Theorem 2.2 shows that as long as {ut } is an I(d) process, the LS estimators β1 , β2 , β3 , β4 are consistent and the rate of convergence is known. As in the previous subsection which dealt with the simpler model (1.1), consistent estimation of the intercept β1 is not possible when d > 1/2. However, and perhaps surprisingly, we can still estimate consistently β3 and β4 of the bounded cyclical component when d > 1/2. The asymptotic normality of β1 and β2 holds under the weak Assumption FDD, whereas asymptotic normality of β3 , β4 follows under the assumption that the generating process {ξt } is linear. Only a finite second moment of ut , ξt is required. Remark 2.2. Assumption FDD with a Gaussian limit is satisfied for a wide class of generating processes {ξt }. In the case of a linear process, convergence of the finite-dimensional distributions of Yn (r ) to those of a Gaussian process sξ J1/2+dξ has been known for some time; e.g. see Ibragimov and Linnik (1977, Theorem 18.6.5), and Davydov (1970). In the case of EGARCH processes, it was shown by Surgailis and Viano (2002). Taqqu (1979) and Giraitis and Surgailis (1985) have shown that Assumption FDD is satisfied for a wide class on nonlinear transformations ξt = h(ζt ) of stationary Gaussian I(d) sequences {ζt }. Although FDD is satisfied for a large class of nonlinear process, it does not cover nonlinear processes whose partial sum process has a non-Gaussian limit Y∞ . The latter case is specific and of limited interest in applications since, although it allows consistent OLS estimation of (β1 , β2 ), one does not have a simple data-based procedure for the computation of its critical values.
190
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
3. Estimation of d and s2ξ
and the long run variance s2ξ by
In this section we discuss estimators for the unknown parameter d of the I(d) process {ut } and the long run variance s2ξ of {ξt }. We start with a model with no deterministic components, then generalize it to models (1.1) and (1.2). 3.1. Model with no deterministic components Assume in this subsection that we observe the process {ut }, which follows ut ∼ I(d0 ) with d0 ∈ (a, b) ⊂ (−1/2, ∞),
d0 ̸= 1/2, 3/2, 5/2, . . . .
(3.1)
In,u (λj ) := |wu (λj )|2 ,
wu (λj ) := (2π n)−1/2
−
for d ∈ (k − 1/2, k + 1/2], k = 0, 1.
(3.4)
In Dalla et al. (2006) for |d| < 1/2 and Abadir et al. (2007) for general d, it was shown that under weak (ergodicity type) assumptions on the generating process {ξt }, the FELW estimator is consistent at a rate faster than 1/ log n,
d − d0 = op (1/ log n).
(3.5)
This is sufficient to write nd = nd (1 + op (1)). Consistency of d did not require the assumption that {ξt } be a Gaussian or a linear process. Under the same conditions, in Theorem 2.1 of Abadir et al. (2007), consistency of the estimator of b0 was shown,
Then, d0 can be estimated using the FELW estimator developed in Abadir et al. (2007) which extends to nonstationarity the classical local Whittle estimator (see Robinson, 1995). In this subsection, we write the estimator in a form that is equivalent to the original FELW estimator, but in a way that is simpler to use. The estimators are identical due to the algebraic relation of the periodograms presented in Lemma 4.4 of Abadir et al. (2007). Denote by n
s2ξ ≡ s2m,u ( d) = p( d − k) βm,u ( d),
eit λj ut
p βm,u ( d) − → b0 ,
which together with (3.5) implies consistency of the estimator of p
the long run variance, s2m,u ( d) − → s2ξ . Asymptotic normality of the FELW estimator in Abadir et al. (2007) was derived for the case of a linear generating process {ξt }.
t =1
the periodogram and discrete Fourier transform of {ut }, where λj = 2π j/n, j = 1, . . . , n denote the Fourier frequencies. For k = 0, 1, 2, . . ., define In,u (λj , d) := |1 − eiλj |−2k In,∇ k u (λj ) for all d ∈ (k − 1/2, k + 1/2], k = 0, 1, 2, . . .. The points d0 = k + 1/2 are excluded from (3.1) because they lead to the spectral density of ∇ k ut not being integrable. The FELW estimator d ≡ du of d0 is defined as
d := argmind∈[a,b] Un (d),
(3.2)
where
Un (d) := log
m 1 −
m j=1
2d
iλj −2k
j |1 − e |
In,∇ k u (λj )
−
m 2d −
m j =1
log j, (3.3)
for d ∈ (k − 1/2, k + 1/2], k = 0, 1, 2, . . ., and the bandwidth parameter m is such that m → ∞,
m = o( n)
as n → ∞. We can think of k as determined by any given d: for any d chosen to evaluate the objective function Un (d), we have a corresponding k such that d ∈ (k − 1/2, k + 1/2]. For example,
m m 1 − 2d 2d − log j I (λ ) − log j, n , u j m j =1 m j=1 if d ∈ (−1/2, 1/2], Un (d) := m m 1 − 2d 2d − iλj −2 log j | 1 − e log j, | I (λ ) − n ,∇ u j m j =1 m j=1 if d ∈ (1/2, 3/2]. Remark 3.1. In applying the estimation method, the sample data must be enumerated to allow to compute the difference ∇ k u1 . For example, if [a, b] = [−1/2, 3/2] then the data should be enumerated as u0 , . . . , un , whereas if [a, b] = [−1/2, 5/2] then it should be u−1 , u0 , . . . , un . In the case of d0 ∈ (−1/2, 3/2), we shall estimate the scale parameter b0 of (1.3) by m 1 − 2d βm,u ( d) := λj In,u (λj , d)
m j=1
3.2. The two models with deterministic components Assume now that {ut } is not observed. In model (1.1), to estimate d we use the detrended observations
ut = X t − β2 t = ut + β1 + (β2 − β2 )t ,
t = 0, 1, . . . , n,
(3.6)
where β2 is the OLS estimator (2.4). In regression (1.2), we use instead
β2 t − β3 sin(ωt ) − β4 cos(ωt ) ut = X t − = ut + β1 + (β2 − β2 )t + (β3 − β3 ) sin(ωt ) + (β4 − β4 ) cos(ωt ), t = 0, 1, . . . , n,
(3.7)
where β2 , β3 , β4 is the OLS estimator (2.12). The FELW estimator is computed using the series { ut }. Henceforth, we write d for d d u u only. Recall that for d0 > −1/2, the FELW estimator is exactly (not just asymptotically) with respect to the mean because ∑n invariant iλj t of the property e g = 0 when gt is constant; and for t t =1 d0 > 1/2, it is so with respect to a linear trend because of the inherent differencing in the estimation procedure; see Abadir et al. (2007). In regression models (1.1) and (1.2), we assume that d ∈ (−1/2, 3/2) and hence the minimization in (3.2) that yields d is carried out over [−1/2, 3/2]. We show below that the estimator of d based on { ut } has the same asymptotic properties as in case when {ut } is observed. We shall need the following assumption. Assumption L. {ξt } is a linear process (2.14) such that: the i.i.d. noise {εt } in (2.14) has finite fourth moment, the spectral density fξ (λ) = |λ|−2dξ (b0 + λ2 b1 + o(λ2 )),
as λ → 0,
(3.8)
for some dξ ∈ (−1/2, 1/2), b0 > 0, and the transfer function ∑ ijλ α(λ) := ∞ j=0 e aj , is such that dα(λ) dλ
= O(|α(λ)|/λ),
as λ → 0+ .
The i.i.d. assumption imposed on {εt } in (2.14) is only needed for expository purposes and can be relaxed to martingale difference.
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199 Table 1 Coverage probabilities of 90% confidence intervals for the LS estimator.
d d d d d d d d d d
= −0.4 = −0.2 =0 = 0.2 = 0.4 = 0.6 = 0.8 =1 = 1.2 = 1.4
ρ = −0.5 β1 β2
ρ=0 β1
0.97 0.91 0.85 0.85 0.81
0.95 0.86 0.83 0.84 0.80
0.96 0.91 0.85 0.86 0.85 0.85 0.86 0.86 0.85 0.81
Table 3 Root MSEs of LS estimators.
ρ = 0.5 β1
β2 0.96 0.87 0.86 0.85 0.82 0.83 0.82 0.83 0.85 0.80
β2
0.88 0.88 0.88 0.87 0.82
d d d d d d d d d d
0.89 0.87 0.86 0.87 0.87 0.89 0.90 0.90 0.90 0.79
(i) Then, if m = o(n4/5 ),
(3.9)
and, if n4/5 /m = o(1), then
p
(n/m) ( d − d0 ) − → 2
b1 , 9b0 (2π )2
if d0 ∈ (−1/2, 1/2),
b1
β1
β2
0.018138 0.036291 0.089922 0.244466 0.901005
6.70865e−5 0.000130 0.000309 0.000761 0.002061 0.005568 0.016482 0.049259 0.169340 0.754884
0.018173 0.036311 0.089924 0.244473 0.901016
6.72490e−5 0.000130 0.000309 0.000761 0.002061 0.005568 0.016482 0.049259 0.169340 0.754873
In this section, the finite sample performance of the asymptotic results given earlier are assessed through simulations. We let the errors {ξt } be generated from a Gaussian fractional ARIMA(1, dξ , 0) noise with unit standard deviation, where the autoregressive parameter ρ is equal to −0.5, 0, 0.5. Throughout the simulation exercise, the number of replications is 10,000, the sample size n = 500. The memory parameter d is estimated using the Local Whittle bandwidth parameter m = ⌊n0.65 ⌋ = 56. We have also tried two other bandwidths which are not reported for space considerations: m = ⌊n0.7 ⌋ = 77 whose results are almost indistinguishable from ⌊n0.65 ⌋, and m = ⌊n0.8 ⌋ = 144 whose results are dominated by ⌊n0.65 ⌋. For more details on our choice of bandwidth, see Abadir et al. (2007), Dalla et al. (2006), and Shimotsu and Phillips (2005). Under some additional restrictions, the optimal m can be chosen using the data-driven methods of Henry and Robinson (1996) or Andrews and Sun (2004). We start with the case when the data generating process follows model (1.1). Using asymptotic CIs, we report Coverage Probabilities (CPs) in Table 1. The CPs are satisfactory, with a few exceptions: low values of ρ and d, and high values of ρ . We then move to the estimation of the regression parameters β1 , β2 , β3 , β4 . Corollary 2.1 established consistency and a central limit theorem for the LS estimators β2 , β3 , β4 when d ∈ (−1/2, 3/2) and β1 when d ∈ (−1/2, 1/2). Using asymptotic CIs, we report CPs in Table 2. CPs for β1 and β2 are very similar to those reported in Table 1. For β3 and β4 , we observe a deterioration of CPs at the opposite extremes of the table: low ρ and d, and high ρ and d. Notice that in the Monte Carlo exercise, apart from ω, all the remaining parameters needed to construct CIs are estimated.
Theorem 3.1. Assume that ut ∼ I(d0 ), where d0 ∈ (−1/2, 3/2), d0 ̸= 1/2. Suppose that the generating process {ξt } is a linear process (2.14) satisfying Assumption L.
d
β2
4. Simulation results
respectively, applied to the residuals ut = Xt − Xt .
m( d − d0 ) − → N(0, 1/4),
= −0.4 = −0.2 =0 = 0.2 = 0.4 = 0.6 = 0.8 =1 = 1.2 = 1.4
β1
Notes: This table reports root mean squared errors of the LS estimators of deterministic components. In particular, the data generating process is given by Xt = β1 + β2 t + β3 cos(ωt ) + β4 sin(ωt ) + ut , ω = π/2, ut ∼ I(d). β1 , β2 are the LS estimators when the model is correctly specified, while β1 , β2 denote the LS estimator for the misspecified model.
Notes: This table reports coverage probabilities of the 90% confidence intervals constructed using the asymptotic theory developed in Section 2. In particular, the data generating process is given by Xt = β1 + β2 t + ut , ut ∼ I(d). β1 , β2 are estimated using Ordinary Least Squares. The long memory parameter d and the long run variance s2ξ are estimated using the FELW method and formula (3.4),
√
191
1
+ , 2 108(2π )2 9b0 (2π ) if d0 ∈ (1/2, 3/2).
(3.10)
If m is proportional to n4/5 , we get a CLT but the limiting mean is not zero. (ii) The results of (i) remain valid if d = d u is computed using: (a) ut of (3.6) obtained from model (1.1), (b) ut of (3.7) obtained from model (1.2). The estimation of β1 and β2 in Corollary 2.1 requires only op (1/ log n) consistency of an estimator d of d0 , and does not assume {ξt } to be a linear process. However, if one is interested in obtaining the asymptotic normality of d, then this is derived in Theorem 3.1 under the assumption that {ξt } is a linear process. Table 2 Coverage probabilities of 90% confidence intervals for the LS estimator.
ρ = −0.5 β1 d d d d d d d d d d
= −0.4 = −0.2 =0 = 0.2 = 0.4 = 0.6 = 0.8 =1 = 1.2 = 1.4
0.95 0.89 0.84 0.84 0.79
β2
β3
β4
0.95 0.91 0.87 0.84 0.86 0.86 0.87 0.87 0.86 0.82
0.84 0.84 0.86 0.88 0.91 0.85 0.84 0.85 0.90 0.93
0.84 0.85 0.87 0.87 0.90 0.86 0.83 0.88 0.90 0.93
ρ=0 β1 0.95 0.86 0.86 0.82 0.79
β2
β3
β4
0.95 0.86 0.87 0.84 0.83 0.86 0.85 0.85 0.85 0.82
0.86 0.87 0.91 0.91 0.93 0.88 0.87 0.89 0.91 0.92
0.88 0.88 0.90 0.92 0.93 0.87 0.87 0.88 0.92 0.92
ρ = 0.5 β1 0.88 0.86 0.86 0.89 0.81
β2
β3
β4
0.88 0.86 0.84 0.86 0.87 0.86 0.89 0.90 0.90 0.80
0.92 0.93 0.94 0.93 0.94 0.91 0.91 0.92 0.92 0.82
0.91 0.92 0.94 0.93 0.94 0.91 0.89 0.92 0.92 0.82
Notes: This table reports coverage probabilities of the 90% confidence intervals constructed using the asymptotic theory developed in Section 2. In particular, the data generating process is given by Xt = β1 + β2 t + β3 cos(ωt ) + β4 sin(ωt ) + ut , ω = π/2, ut ∼ I(d). β1 , β2 , β3 , β4 are estimated using Ordinary Least Squares. The long memory parameter d and the long run variance s2ξ are estimated using the FELW method and formula (3.4), respectively, applied to the residuals ut = Xt − Xt . The estimator of the spectral density fξ (ω) is given by formula (4.1).
192
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
Table 4 Bias and root MSEs of the estimator d u for different values of ρ . Bias
RMSE
Bias
ρ = −0.5 d d d d d d d d d d
= −0.4 = −0.2 =0 = 0.2 = 0.4 = 0.6 = 0.8 = 1.0 = 1.2 = 1.4
RMSE
Bias
ρ=0
−0.004406 −0.026378 −0.036340 −0.041281 −0.043690 −0.030618 −0.027204 −0.027851 −0.023571 −0.022104
0.082613 0.085891 0.087974 0.090194 0.090695 0.091283 0.089273 0.080885 0.082448 0.082663
RMSE
ρ = 0.5
0.005228
0.080757 0.080904 0.080196 0.089280 0.090983 0.089691 0.085367 0.077266 0.081665 0.077420
−0.010943 −0.019628 −0.028645 −0.028162 −0.016453 −0.012923 −0.011396 −0.009434 −0.005929
0.101875 0.087739 0.082967 0.080431 0.084435 0.091507 0.096704 0.088296 0.090578 0.096853
0.129679 0.119000 0.116947 0.113747 0.120408 0.125162 0.125939 0.120020 0.119118 0.123975
Notes: This table reports bias and root MSEs of the FELW estimator of d. In particular, the data generating process is given by Xt = β1 +β2 t +β3 cos(ωt )+β4 sin(ωt )+ ut , ω =
π/2, ut ∼ I(d). The long memory parameter d is estimated using the FELW method applied to the residuals ut = Xt − Xt .
A consistent estimator of the spectral density at ω is given as follows. Define
wu (λj ; d) :=
n − −1/2 eit λj ut , (2π n)
−1/2 < d < 1/2;
t =1 n
− −1/2 eit λj (ut − ut −1 ), (2π n)
1/2 < d < 3/2.
t =1
Then, ⌊n/2⌋−1 2 − 2 fξ (ω) = bn (λj ) w u (λj − ω; d) ,
n
(4.1)
j =1
where sin(qλj /2) bn (λj ) = q−1 sin(λj /2)
2
,
q = n1/3 .
For −1/2 < d < 1, the sign Ak > 0 is obvious by (2.15). Since Ak may be close to zero for 1 ≤ d < 3/2, we conducted a numerical check with n = 1, . . . , 1000 and a dense grid of d ∈ [1, 1.5) and ω ∈ (0, π ): the minimum of the function 2 2 θ1k ψ11 + θ2k ψ22 + 2θ1k θ2k ψ12 + 4γk
θ1k 2
+ vθ2k + 4γk2
(appearing in Ak ) is positive. It is also possible to use A+ := k max{0, Ak }, although simulations show that this is not necessary. It is worth mentioning that the CPs improve and become very close to the nominal ones when infeasible CIs (ones based on the ‘‘true’’ memory parameter and spectral density) are calculated. A full set of tables is not reported for space reasons and is available upon request. Table 3 provides a check for Remark 2.1. Comparing the mean squared errors (MSEs) of both misspecified and correctly specified LS estimators of β1 and β2 , it is immediate to see that there is no loss of precision caused by misspecification here. Therefore, if the purpose is only the consistent estimation of location and trend parameters, then one can ignore modelling cyclical components and just estimate a simple mean plus trend model. Of course, this is not true in the case of testing hypotheses on β1 and β2 , since in the latter case one has to estimate s2ξ , which requires a correct specification of the model. The specification Xt = β1 + β2 t + zt + ut , zt =
3 − (β3,j sin(ωj t ) + β4,j cos(ωj t )),
Table 4 reports the results of estimation of the fractional integration parameter d when the series are detrended and the extended local Whittle estimator d u of the previous section is used. We check the performance of the estimator for a range of values of d ∈ (−1/2, 3/2). The table confirms some previous findings of the literature on Whittle estimation of the memory parameter. In particular, we observe a slight negative bias for ρ = −0.5 and ρ = 0 and a higher, positive bias when ρ = 0.5. The MSEs of d u are generally slightly higher than those of dX in Abadir et al. (2007), reflecting the contribution of the estimation of deterministic components. These are general features that hold for the whole range considered for d, except at the lower boundary of our interval. For d = −0.4 and ρ = −0.5, ρ = 0 the estimator of d is less biased and, in the case where ρ = 0, it is (slightly) positively biased.
Acknowledgements We are grateful for the comments of seminar participants at the Bank of Italy, Bilbao, Econometric Society World Congress (London), GREQAM, Liverpool, LSE, Oxford, Queen Mary, Tilburg, Tinbergen Institute, Vilnius Academy of Science, York. We also thank the Editor, Associate Editor, and referees for their constructive comments. This research is supported by the ESRC grants R000239538, RES000230176, RES062230311, and RES062230790.
Appendix. Proofs Preliminary facts. Assume that ut ∼ I(d), where −1/2 < d < 3/2 and d ̸= 1/2. Property (1.3) together with definition of Yn (r ) and (2.1) implies that, for any 0 < r ≤ 1, as n → ∞, 2 E Yn2 (r ) → s2ξ r 1+2dξ = E Y∞ (r ) ,
max E Yn2 (r ) ≤ C . (A.1)
0≤r ≤1
Also, for any 0 < r ≤ s ≤ 1, one has the equality E Yn2 (s) = E(([Yn (s) − Yn (r )] + Yn (r ))2 )
= E((Yn (s) − Yn (r ))2 ) + 2E(Yn (s)Yn (r )) − E(Yn2 (r )), where E((Yn (s) − Yn (r ))2 ) ∼ E((Yn (s − r ))2 ), which together with (A.1) yields the equality
j =1
with β3,1 = β4,1 = 1, β3,2 = β4,2 = 1/2, β3,3 = β4,3 = 3/2, ω1 = π /2, ω2 = π /4, ω3 = 3π /2, was also tried and gave very similar results.
E (Yn (r )Yn (s)) → s2ξ Rd (r , s) = E (Y∞ (r )Y∞ (s)) , where Rd (r , s) = E J1/2+dξ (r )J1/2+dξ (s) is given by (2.3).
(A.2)
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
12n−2
(−8s13 + 12s23 n−1 )n−1 (12s13 − 24s23 n−1 )n−2
··· ···
···
−6n−1
4
· · · S − 1 = n− 1 ··· ···
2
193
−1 O(n ) (−8s14 + 12s24 n−1 )n−1 (12s14 − 24s24 n−1 )n−2 + n−1 ··· ··· 0 2 ···
O(n−2 ) O(n−4 )
O(n−2 ) O(n−3 ) O(n−2 )
··· ···
O(n−2 ) O(n−3 ) O(n−1 ) O(n−2 )
···
Box I.
For 0 < ω < π , we compute the elements sij of the symmetric matrix S of (2.13) as s11 =
n −
1 = n;
s12 =
n −
t =1
s13 =
n −
sin (ωt ) =
cos(ω/2) − cos(ω(n + 1/2)) 2 sin(ω/2)
t =1 n
s14 =
−
cos (ωt ) =
sin(ω(n + 1/2))
t =1
s22 =
n −
t = n2 /2 + n/2;
t =1
2 sin(ω/2)
−
1 2
= O(1);
tβ2 = −6Zn,1;u + 12Zn,2;u , tβ3 = s13 (−8Zn,1;u + 12Zn,2;u ) + (s23 /n)(12Zn,1;u − 24Zn,2;u ), tβ4 = s14 (−8Zn,1;u + 12Zn,2;u ) + (s24 /n)(12Zn,1;u − 24Zn,2;u ).
4 sin2 (ω/2)
+
sin(ωn) 4 sin2 (ω/2)
Hence,
= O(n);
t cos (ωt )
s33 =
cos(ωn) − cos(ω(n + 1))
n −
4 sin2 (ω/2) sin2 (ωt ) =
t =1
−
− n −
= O(n);
2 t =1
sin (ωt ) cos (ωt ) = (1/2)
n −
t =1
n −
(1 − sin2 (ωt )) = n/2 + O(1);
(A.5)
t =1
k = 1, . . . , 4,
where
tn,1;u∗ tn,2;u∗
= =
;
(A.6)
(A.7)
s12 s22
−1
tn,1;u∗ tn,2;u∗
,
(A.8)
tn,1;u tn,2;u
tn,1;u tn,2;u
s13 + s14 s23 + s24
+
O(1) , O(n)
+
(A.9)
where s13 , s14 , s23 , s24 are elements of S satisfying |s13 | + |s14 | = O(1), |s23 | + |s24 | = O(n), as it was shown above in the derivation of the elements of S. Since
n1/2+d m1 m2 n3/2+d m = (1/2)∨(d−1/2) . n 3 ( 1 / 2 )∨( d − 1 / 2 ) m4 n
=
tβ1 + O(n−1∧(1+d) )Rn,1 tβ2 + O(n−1∧(1+d) )Rn,2
where u∗t = uj + β3 sin(t ω) + β4 cos(t ω). Then,
by means of sin(∑ x) = (eix − e−ix )/ (2i) and cos(x) = (eix + e−ix )/2. n Denote tn,p;u = t =1 ztp ut , p = 1, . . . , 4 and set 1 Zn,k ≡ Zn,k;u := m− k tn,k;u ,
β1 β1 s = + 11 β2 s21 β2
sin (2ωt ) = O(1);
t =1
cos2 (ωt ) =
for j = 3, 4, where E|Rn,j |2 < ∞ for j = 1, . . . , 4 because E|Zn,j;u |2 < ∞ in view of Lemma A.1. In the case of the LS estimator ( β1 , β2 ), we have
(1 − cos (2ωt )) = n/2 + O(1);
t =1
s44 =
2 sin2 (ω/2)
n 1−
n
s34 =
sin2 (ωn/2)
τβ1 τβ2
τβj = 2Zn,3;u + n−1+d tβ j + n−1 Rn,j , if − 1/2 < d < 1 = 2Zn,3;u + tβj + n−1 Rn,j , if 1 ≤ d < 3/2,
t =1
= n
where
t sin (ωt )
sin(ωn) − sin(ω(n + 1))
n −
s24 =
n−1/2+d [tβ1 + O(n−1∧(1+d) )Zn∗ ] −3/2+d n [tβ2 + O(n−1∧(1+d) )Zn∗ ] , = − − 3 / 2 + d n [tβ3 ] + n 1/2∧(3/2−d) [2Zn,3;u + O(n−1 )Zn∗ ] n−3/2+d [tβ4 ] + n−1/2∧(3/2−d) [2Zn,4;u + O(n−1 )Zn∗ ] tβ1 = 4Zn,1;u − 6Zn,2;u ,
t =1
= n
(A.4)
t =1
s23 =
n1/2+d Zn,1;u tn,1;u n3/2+d Zn,2;u t S −1 n,2;u = S −1 n1/2∨(d−1/2) Zn,3;u tn,3;u tn,4;u n1/2∨(d−1/2) Zn,4;u
= O(1); (A.3)
t 2 = n3 /3 + n2 /2 + n/6;
n −
obtain a symmetric matrix S −1 given in Box I. Then,
s11 s21
s12 s22
−1
s12 s22
−1
= n− 1
4 + O(n−2 ) −6n−1 + O(n−2 )
−6n−1 + O(n−3 ) , 12n−2 + O(n−4 )
then
Set Zn∗ = |Zn,1 | + · · · + |Zn,4 |. By the well-known property of LS estimators,
tn,1;u ′ −1 tn,2;u β ≡ (β1 , β2 , β3 , β4 ) = β + S . tn,3;u tn,4;u Inverting the symmetric matrix S, defined by (2.13), and bearing in mind the formulae for sij above which are valid for 0 < ω < π , we
s11 s21
tn,1 tn,2
=
s11 s21
s12 s22
−1
n1/2+d Zn,1;u n3/2+d Zn,2;u
n−1/2+d [4Zn,1;u − 6Zn,2;u + O(Zn′ /n)] , n [−6Zn,1;u + 12Zn,2;u + O(Zn′ /n)]
=
−3/2+d
where Zn′ = |Zn,1 | + |Zn,2 |. Hence, −1/2+d β1 β1 n [tβ1 + O(Zn′ /n)] = + −3/2+d , β2 n [tβ2 + O(Zn′ /n)] β2 τβ1 t + n−1 rn,1 = β1 , τβ2 tβ2 + n−1 rn,2
(A.10)
(A.11)
194
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
where E|rn,j | < ∞ for j = 1, 2 since E|Zn,j;u |2 < ∞ in view of Lemma A.1. Hence, the LS estimators ( β1 , β2 ) and ( β1 , β2 ) are equivalent:
τβ1 τβ2
τβ1 + Op (n−1∨(1+d) ) . = τβ2 + Op (n−1∨(1+d) )
Case of 1/2 < d < 3/2. Then, d = 1 + dξ , and the definition ∑t of the I(d) model implies ut = j=1 ξj + u0 , where by assumption E u20 < ∞. Using notation (2.2), write
(A.12)
s2ξ /2 2 sξ (3 + 2d)−1 0 0
s2ξ
sξ /2 9 := 0 2
0
0 0
0 0 0
= n3/2+dξ
0
(A.13)
tn,2;u =
n −
= n3/2+d
= n3/2+d
cov(Zn,k;u , Zn,j;u )
k,j=1,...,4
→ 9.
(A.15)
Proof of Theorem 2.1. Since the estimator β2 , (2.4) is invariant with respect to a shift of Xt , we can set u0 = 0 for 1/2 < d < 3/2. (i) Since the estimators β1 and β2 can be written as linear combinations of variables Zn,1 and Zn,2 , (A.10) together with Lemma A.1 implies the limits (2.8)–(2.9). (ii)–(iii). In view of expansion (A.10) and (A.15),
τβ1 τβ2
=
4Zn,1 − 6Zn,2 −6Zn,1 + 12Zn,2
+
op (1) . op (1)
Denote by (V1 , V2 ) a zero mean Gaussian vector with the covariances cov(Vk , Vj ) = ψkj ,
k, j = 1, 2.
d
(Zn,1;u , Zn,2;u ) − → (V1 , V2 ).
(A.16)
By the Cramer–Wold device, (A.16) holds if and only if for any real numbers a, b, d
aZn,1;u + bZn,2;u − → aV1 + bV2 .
(A.17)
d
(τβ1 , τβ2 ) − → (4V1 − 6V2 , −6V1 + 12V2 ). Since var(4V1 − 6V2 ) = σ12,d , var(−6V1 + 12V2 ) = σ22,d , and
cov(4V1 − 6V2 , −6V1 + 12V2 ) = σ12,d , where σ12,d , σ12,d , and σ22,d are given by (2.7)–(2.6), this completes proof of (2.10)–(2.11). It remains to prove (A.17). Case of |d| < 1/2. Then, uj = ξj . Using the notation Yn (r ), (2.2), and summation by parts, we can write,
ξk = n1/2+d Yn (1) − ξn+1 ,
n −
t ξt = −
t =1
n−1 − t − t =1 k=1
[ ∫ = n3/2+d −
1
ξk + n
n − k=1
]
Yn (r )dr + Yn (1) +
n −
Whence Zn,1;u = Yn (1) + op (1), 1
∫
Yn (r )dr + Yn (1) + op (1). 0
tSt ,1;ξ + u0
t =1 1
∫
n
Yn (r )dr + u0 O(n2 )
1
[∫
t
t =1
(⌊rn⌋ + 1)
0
n −
]
rYn (r )dr + Op (n−1 ) + u0 O(n2 ),
(A.20)
0
= op (n3/2+d ), since
2 in view of (A.1), observing that u0 O(n ) 2 E u0 < ∞ and d > 1/2. In this case, 1
∫
1
∫
Yn (r )dr + op (1),
Z n ,1 =
rYn (r )dr + op (1).
Zn,2 =
0
0
Note that {Yn (r ), r ∈ [0, 1]}, n ≥ 0 is a sequence of real valued measurable processes with the paths from the space L2 [0, 1], i.e. 1 2 Y (r )dr < ∞, and for any reals a, b, and c, 0 n 1
∫
Yn (r )dr + b
Fn := a
1
∫
rYn (r )dr + cYn (1) 0
is continuous real valued bounded functional from L2 [0, 1] to R. Relations (A.1) and (A.2) and Assumption FDD imply that the process Yn (r ) satisfies the conditions of the weak convergence criterion in the space L2 [0, 1] by Cremers and Kadelka (1986), which implies the convergence to the Gaussian limit d
1
∫
Y∞ (r )dr + b
Fn − →a 0
∫
1
rY∞ (r )dr + cY∞ (1)
0
Z n ,2 = −
n −1 −
Zk,1 + nZn,1 +
k=1
n −1 −
o(k1/2+d ) + no(n1/2+d )
k=1
= o(n3/2+d ). Denote β10 , β20 the OLS estimators corresponding to case zt = 0. Then (A.8)–(A.9) imply that βi = βi0 + Ri , i = 1, 2 where
|R2 | ≤ C (n−2 |Zn,1 | + n−3 |Zn,2 |) = o(n−3/2+d ).
ξk
0
Zn,2;u = −
n −
|R1 | ≤ C (n−1 |Zn,1 | + n−2 |Zn,2 |) = o(n−1/2+d ),
k=1
tn,2;u =
(A.19)
which proves (A.17). ∑n ∑n (iv) Set Zn,1 := t =1 zt , Zn,2 := t =1 tzt . In (iv) it is assumed that Zn,1 = o(n1/2+d ). Summation by parts yields
Convergence (A.16) implies that
n −
Yn (r )dr + nu0 ,
0
We shall show that
tn,1;u =
tut =
t =1
if −1/2 < d < 1/2, and as (A.14) in Box II if 1/2 < d < 1. Lemma A.1 will show that, under general assumptions,
∫
St ,1;ξ + nu0
t =1 1
0
π fξ (ω)
π fξ (ω)
n −
ut =
t =1
Define the 4 × 4 matrix 9d ≡ 9 = (ψkj )k,j=1,...,4 as
n −
tn,1;u =
ξk − nξn+1 . (A.18)
Since results (i) and (ii) of Theorem 2.1 are valid for βi0 , i = 1, 2, they remain true for βi , i = 1, 2, because the terms R1 and R2 are negligible.
k=1
Lemma A.1. Assume that ut ∼ I(d), with −1/2 < d < 3/2, d ̸= 1/2, and let n → ∞. If −1/2 < d < 1, then cov(Zn,k;u , Zn,j;u ) → s2ξ ψkj ,
k, j = 1, . . . , 4,
(A.21)
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
s2ξ (6d − 1)
s2ξ
1 + 2d 2 sξ (6d − 1) 9 := 8d(1 + 2d) 0
8d(1 + 2d)
2
sξ
1−
2(1 + 2d)
0
0
1
(3 + 2d)d
0 0
0
195
0
0
π fξ (ω)/4 sin2 (ω/2)
0
(A.14)
π fξ (ω)/4 sin2 (ω/2)
0 Box II.
where ψkj are entries of the matrix 9, defined by (A.13)–(A.14). If 1 ≤ d < 3/2, then cov(Zn,k;u , Zn,j;u ) → s2ξ ψkj , var(Zn,k;u ) =
k, j = 1, 2,
n2(1−d) π fξ (ω) 4 sin2 (ω/2)
(A.22)
(1 + o(1)) + sξ γ , 2
cov(Zn,1;u , Zn,k;u ) = sξ γk /2 + o(1),
Rn,k,j = (mk mj )−1
= (mk mj )−1
(A.24)
k = 3, 4,
(A.25)
1 −1 Rn,k,j := E Zn,k;u Zn,j;u = m− k mj E tn,k;u tn,j;u
1 −1 = m− k mj E
n −
n −
ztk ut
t =1
1
r Yn (r )dr p
.
s=1
E
1
∫
0
Yn (v)v dv l
0
0
E
Yn2
Rn,1,1 = n−1−2d E(tn,1;ξ )2 → s2ξ ; Yn (r )dr
2
1
→ s2ξ Rd (1, 1) − 2 Rd (1, r )dr 0 ∫ 1∫ 1 + Rd (r , v)drdv 0
Rn,1,2
0
= s2ξ (1 − 1 + (3 + 2d)−1 ) = s2ξ (3 + 2d)−1 ; = E n−1/2−d tn,1 n−3/2−d tn,2 = E Yn (1) − n−1/2−d ξn+1 ∫ 1 × − Yn (r )dr + Yn (1) + o(1) ∫ 01 → s2ξ − Rd (1, r )dr + Rd (1, 1) 0
= s2ξ (−1/2 + 1) = s2ξ /2.
(A.28)
n
−
Dn,3 (x) =
eitx sin(t ω)
t =1
= (Dn (x + ω)eiω − Dn (x − ω)e−iω )/(2i), n − Dn,4 (x) = eitx cos(t ω)
(A.29)
= (Dn (x + ω)eiω + Dn (x − ω)e−iω )/2,
For any ϵ > 0, 1
∫ 0
∫
sin(x/2)
|Re[Dn,3 (x)Dn,4 (−x)]| ≤ C |Dn (x − ω)Dn (x + ω)|.
Writing tn,1;ξ , tn,2;ξ as in (A.18), (A.26) implies
sin(nx/2)
|Dn,k (x)|2 = (|Dn (x + ω)|2 + |Dn (x − ω)|2 )/4 + O(|Dn (x + ω)||Dn (x − ω)|),
(1) → sξ Rd (1, 1) = sξ .
Rn,2,2 = E(n−3/2−d tn2 )2 = E Yn (1) −
eitx = eix(n+1)/2
which implies
2
n −
Dn,1 (x) = Dn (x)
(A.26)
0
2
(A.27)
t =1
∫ 1∫ 1 → s2ξ r p Rd (r , v)v l drdv; 0 0 ∫ 1 ∫ 1 E Yn (1) r p Yn (r )dr → s2ξ r p Rd (r , 1)dr ;
Dn,k (x)Dn,j (−x)fξ (x)dx.
the Dirichlet kernel. Then, since sin(x) = (eix − e−ix )/ (2i) and cos(x) = (eix + e−ix )/2,
Case of |d| < 1/2. Then, uj = ξj , d = dξ , and tn,k;u ≡ tn,k;ξ = ∑n t =1 ztk ξt . Properties (A.1)–(A.2) of Yn (r ) and the dominated convergence theorem imply that, for p, l = 0, 1, as n → ∞,
∫
π
∫
t =1
zsj us
ztk zsj E (ξt ξs )
Denote by Dn (x) :=
t =1
−π
where ψkj are entries of matrix (A.14), and γk , v are defined in (2.16). Proof of Lemma A.1. Denote for k, j = 1, . . . , 4,
n −
∑n
t ,s=1
k = 3, 4, (A.23)
k = 3, 4,
2
cov(Zn,2;u , Zn,k;u ) = s2ξ vγk + o(1),
2 k
To obtain the remaining Rn,k,j with k, j = 3, 4, set Dn,k (x) := eitx ztk , k = 1, . . . , 4 and write
+ o(1)
|Dn (x)| ≤ C ,
ϵ ≤ |x| ≤ π ;
|Dn (x)| ≤ C
n 1 + n|x|
,
|x| ≤ 2π − ϵ.
(A.30)
If fξ is continuous at ω, then a standard argument, using (1.3) and (A.30), implies that, as n → ∞, n−1−2dξ
∫
π −π
n
−1
∫
π
|Dn (x)|2 fξ (x)dx → s2ξ ,
|Dn (x ± ω)|2 fξ (x)dx → 2π fξ (ω),
−π
∫
π
−π ∫ π −π
|Dn (x + ω)| |Dn (x − ω)|fξ (x)dx ≤ C log n, |Dn (x)| |Dn (x ± ω)|fξ (x)dx ≤ C (log n + n2dξ ∨0 ).
196
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
Therefore,
Similarly, we obtain the convergence
∫
Rn,k,k = n−1
−1 −1/2−dξ
π
mk n
|Dn,k (x)|2 fξ (x)dx
∫−π π
∼E
(|Dn (x + ω)|2 + |Dn (x − ω)|2 )/4 −π + O(|Dn (x + ω)| |Dn (x − ω)|) fξ (x)dx → π fξ (ω), k = 3, 4, (A.31) ∫ π |Dn (x + ω)| |Dn (x − ω)|fξ (x)dx → 0, |Rn,3,4 | ≤ Cn−1 = n− 1
n −
π
∫
|Dn (x)| (|Dn (x + ω)|
−π
+ |Dn (x − ω)|)fξ (x)dx → 0,
k = 3, 4.
(A.32)
It remains to obtain Rn,2,k , k = 3, 4. By (A.18), write tn,2;u as tn,2;ξ = −
n −1 −
t
= s2ξ
0
1 4
1
+
n −
tn,k;u =
=
1 + 2d
max |Dj (x)| ≤ C max j/(1 + j|x|) ≤ n/(1 + n|x|).
n −
1≤j≤n
Hence, by (A.27), (A.28) and (1.3),
n −
1≤j≤n
1≤j≤n
|Dj (x)| (|Dn (x + ω)|
+ |Dn (x − ω)|)fξ (x)dx ∫ π n ≤C (|Dn (x + ω)| −π 1 + n|x|
ztk ut =
t =1
n n − −
≤ Cn
n −1 − E Sj,1;ξ tn,k;ξ + n E tn,1:ξ tn,k;ξ
(n log n +
n2dξ ∨0 n
n − t =1
0
) → 0,
t k−1 ut
n −
s j − 1 us
s=1
r k−1 Rdξ (r , v)v j−1 drdv,
which yields
ψ22
s2ξ
6d − 1 ψ12 = s2ξ ; 8d(1 + 2d) 1 1 = s2ξ 1− . 2(1 + 2d) d(3 + 2d)
ψ11 =
1 + 2d
;
ztk
t =1
n −
ztk .
(A.35)
t =1
cos(ω(j − 1/2)) − cos(ω(n + 1/2)) 2 sin(ω/2) sin(ω(n + 1/2)) − sin(ω(j − 1/2)) 2 sin(ω/2)
; .
2 sin(ω/2)
+ |u0 |O(1), =
(A.36)
sin(ω(n + 1/2))tn,1;ξ − tn,3;ξ cos(ω/2) + tn,4;ξ sin(ω/2) 2 sin(ω/2) (A.37)
Using (A.31)–(A.32), we obtain E argument shows that E tn2,4;u = n
→ ψkj
π fξ (ω) 4 sin2 (ω/2)
tn2,3;u
given in Box III. A similar
+ n2d−1 s2ξ γ42 + o(n),
which proves that for k = 3, 4, Rn,k,k →
1 0
n −
j=1
ξj + u0
ztk
ξ j + u0
+ |u0 |O(1).
Case of 1/2 < d < 3/2. Recall that d = 1 + dξ and E u0 = O(1). Consider first Rn,k,j , k, j = 1, 2. Then by (A.19), (A.20) and (A.26), for k, j = 1, 2, with m1 = n1/2+d and m2 = n3/2+d ,
∫
t −
tn,4;ξ cos(ω/2) + tn,3;ξ sin(ω/2) − cos(ω(n + 1/2))tn,1;ξ
2
Rn,k,j = (mk mj )−1 E
(A.34)
tn,4;u
which completes proof of (A.21) with 9 as in (A.13), corresponding to |d| < 1/2.
= s2ξ
ztk
t =j
cos(ωt ) =
(A.33)
j=1
1
n −
t =1
sin(ωt ) =
=
Therefore, for k = 3, 4,
−2−dξ
if k = 4,
tn,3;u
+ |Dn (x − ω)|)fξ (x)dx ≤ C (log n + n2dξ ∨0 ).
|Rn,2,k | ≤ n−2−dξ
= s2ξ v,
Since cos(ω(j − 1/2)) = cos(ωj) cos(ω/2) + sin(ωj) sin(ω/2) and sin(ω(j − 1/2)) = sin(ωj) cos(ω/2) − cos(ωj) sin(ω/2), then
−π
∫
4d
t =j
1
t =j
max E Sj,1;ξ tn,k;ξ π
−
if k = 3;
For k = 3, 4, z∑ t3 = sin(ω t ) and zt4 = cos(ω t ). Using (A.3)–(A.4), it n follows that | t =1 ztk | ≤ C and
By (A.30),
≤ max
(un − u0 )
ut
r k−1 Rdξ (r , 1)dr = s2ξ /2,
j =1
∫
which will be used below. Next we obtain Rn,k,j , k, j = 3, 4. The definition of the I(d) ∑t model implies ut = j=1 ξj + u0 . Then,
j =1
k−1
1
∫
→ s2ξ
Sj,1;ξ + ntn,1:ξ .
1≤j≤n
t =1
−π
|Rn,1,k | ≤ Cn−1−d
E tn,k;u tn,1;ξ
=
π fξ (ω) 4 sin2 (ω/2)
,
n2(1−d) π fξ (ω) 4 sin2 (ω/2)
if (1/2 < d < 1)
(1 + o(1)) + s2ξ γk2 ,
if (1 ≤ d < 3/2),
which proves (A.22) and (A.23). Next, for 1/2 < d < 1,
|Rn,3,4 | = n−1 E tn,3;u tn,4;u ≤ Cn−1 E tn2,4;ξ − E tn2,3;ξ + E tn,3;ξ tn,4;ξ + E tn,1;ξ tn,3;ξ + E tn,1;ξ tn,4;ξ ≤ Cn−1 (o(n) + O(log n) + o(n)) = o(1).
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
E tn2,3;u =
+ o(n)
4 sin (ω/2) 2
nπ fξ (ω) + cos (ω(n + 1/2))sξ n
2 1+2dξ
2
=
E tn2,4;ξ cos2 (ω/2) + E tn2,3;ξ sin2 (ω/2) + cos2 (ω(n + 1/2))E tn2,1;ξ
197
4 sin2 (ω/2)
=n
π fξ (ω) 4 sin2 (ω/2)
+ o(n)
+ n2d−1 s2ξ γ32 + o(n) Box III.
Finally, we show that Rn,k,j = o(1) for k = 1, 2, j = 3, 4, when 1/2 < d < 1. By (A.36)–(A.37),
E tn,k;u tn,j;u ≤ C E tn,k;u tn,1,ξ + E tn,k;u tn,3;ξ + E tn,k;u tn,4;ξ + E u0 tn,k;u .
which proves (A.38) and completes the proof of Lemma A.1. Proof of Theorem 2.2. 1. Proof of (i)–(ii). By (A.6),
τβ1 = 4Zn,1;u − 6Zn,2;u + O(n−1∧(1+d) )Rn,1 = 4Zn,1;u − 6Zn,2;u + op (1), τβ2 = −6Zn,1;u + 12Zn,2;u + O(n−1∧(1+d) )Rn,2 = −6Zn,1;u + 12Zn,2;u + op (1),
By (A.34),
E tn,k;u tn,1,ξ ≤ Cmk nd−1/2 . Observe that for j = 1, . . . , 4, |E(u20 tn,l;u )| ≤ (E(u20 )E(tn2,l;u ))1/2 ≤ Cml , since E(u20 ) < ∞ and E(tn2,l;u ) ≤ Cm2l , as it was shown above. Using (A.19)–(A.20) and (A.33), we obtain for j = 3, 4, n
− E tn,k;u tn,j;ξ ≤ C t k−1 E St ,k;ξ tn,j;ξ
where E|R2n,j | ≤ C , j = 1, 2. Therefore the same argument used in the proof of Theorem 2.1 implies statements (i) and (ii) of Theorem 2.2 about β1 and β2 . By (A.7), for k = 3, 4 and −1/2 < d < 3/2,
0∧(−1+d) Ak τ (θ1k Zn,1;u βk = n
+ θ2k Zn,2;u ) + 2Zn,k;u + n−1 Rn,j ,
t =1
+ O(nk )E |u0 St ,j;ξ | n − k−1 2dξ ∨0 k+1/2 ≤C n (log n + n )+n t =1
= O(n ( k
n2dξ ∨0
+ n1/2 )),
since m3 = m4 = n1/2 . Since for j = 3, 4, m1 mj = n1+d , m2 mj = n2+d , then
|Rn,k,j | ≤ C (mk mj )−1 (mk nd−1/2 + nk+2dξ ∨0 + nk+1/2 + mk ) = o(1), which completes proof of (A.21). To prove (A.24) and (A.25) in the case of d ≥ 1, recall that m3 = m4 = nd−1/2 , m1 = n1/2+d , and m2 = n3/2+d . Then, by (A.36)–(A.37), Zn,3;u = Zn,4;u =
1
Zn,4;ξ
2nd−1
cos(ω/2) sin(ω/2)
1 2nd−1
Zn,4;ξ − Zn,3;ξ
+ Zn,3;ξ
cos(ω/2)
sin(ω/2)
+ γ3 Zn,1;ξ + |u0 |n−d+1/2 ; + γ4 Zn,1;ξ + |u0 |n−d+1/2 .
E Zn,1;u Zn,1;ξ → sξ /2, 2
E Zn,2;u Zn,1;ξ → sξ v.
(i) If −1/2 < d < 1, then for any real numbers aj , j = 1, . . . , 4, tn :=
4 −
d
ak Zn,k;u − →
k=1
4 −
ak Vk =: Z ,
j = 1, 2, k = 3, 4,
(A.38)
E Zn,j;u Zn,k;ξ
We shall show that tn can be written as
n − t j−1 E St ,j;u tn,k;ξ + nj E u0 tn,k;ξ t =1
n − 2 2 1/2 1 j (log n + n2dξ ∨0 ) + m− j n E u0 E Zn,k;ξ t =1
→ 0,
(A.41)
Proof. (i) For −1/2 ≤ d < 1, by Lemma A.1, the matrix 9 is the limit covariance matrix of the vector t. Therefore σn2 := var(tn ) → σ 2 := var(Z ) as n → ∞. It remains to show
σn−1 tn − → N(0, 1).
≤ Cn−2d
(A.40)
where (V1 , . . . , V4 ) ∼ N(0, 9) with 9 given by (A.13). (ii) If 1 ≤ d < 3/2, then
which together with estimates above implies (A.24) and (A.25). By (A.19), (A.20) and (A.33),
1 −1 ≤ m− j mk
n → ∞,
k=1
2
We show that E Zn,j;u Zn,k;ξ → 0,
Lemma A.2. Assume that the conditions of Theorem 2.2(iii) are satisfied.
d
(A.39)
where θ1k , θ2k are defined in (2.16), and E|R2n,k | ≤ C . Hence, for k = 3, 4, (i) follows from (A.39), applying Lemma A.1. 2. Proof of (iii). Assume that −1/2 < d < 1. Since τ β1 , τ β2 , τ β3 , and τ β4 are linear combinations of the variables Zn,j;u , j = 1, . . . , 4, convergence (2.17) and (2.20) follow from (A.40) of Lemma A.2 by the same argument as that used in the proof of Theorem 2.1. If 1 ≤ d < 3/2, then (2.18) and (2.19) follow from (A.41) and Lemma A.1.
(var(tn ))−1/2 tn − → N(0, 1).
In (A.34) it was shown that
d
tn =
n −
cn,j εj + op (1)
(A.42)
(A.43)
j=−∞
with some weights cn,j such that cn∗ := max |cn,j | = o(1). −∞<j≤n
(A.44)
198
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
First we show that (A.43) and (A.44) imply (A.42). Assume first that E|εt |k < ∞ for all k ≥ 1. To prove (A.42), it suffices to show for any k ≥ 3, the kth order cumulant of σn−1 tn satisfies
κk → 0,
n → ∞.
Since {εt } is a sequence of i.i.d. variables, then
−
where κk (ε0 ) is the kth order cumulant of the variable ε0 . Since σn2 → σ 2 > 0, then by (A.44), for k ≥ 3,
|κk | ≤ |κk (ε0 )|(cn∗ )k−2 σn−k
n −
cn2,j ≤ |κk (ε0 )|(cn∗ /σn )−k+2 → 0,
j=−∞
to prove (A.42). Assume now that not all moments of εj are finite. Fix K > 0, set εj− = 1|εj |≤K εj , εj+ = 1|εj |>K εj , and write tn = Sn + Sn , +
−
Sn :=
n −
cn,j εj , −
n −
−
Sn :=
j=−∞
which implies
|cn(2,j) |
|cn(k,j) |
cn,j εj .
d
Zn,k;u =
cn2,j ∼ var(ε1+ )σ 2 → 0,
K →∞
with weights
+ op (1),
k = 1, . . . , 4
1 Zn,k;u = m− k
satisfying (A.45)
Zn,k;u = Zn,k;ξ = mk
n − t =1
n −
1/2 → 0,
2
| as |
n −
ztk ξt =
n → ∞,
(k)
cn,j εj + op (1),
j=−∞
n −
∗ ztk at − j ,
∗ ztk =
n −
zsk .
s=t
(k)
It remains to show that cn,j satisfies (A.45). For k = 1, 2, we have zj1 = 1 and zjk = j, and summation by parts gives
|cn,k (k)| ≤ mk
−1
n −1 − s =1
(k) cn,j εj
n s − − at − j at −j + |znk | |zsk | t =j∨1 t =j∨1
1 k dξ ∨0 ≤ Cm− ) = o(1). k n (log n + n
j=−∞
∗ For k = 3, 4, (A.3)–(A.4) imply that |ztk | ≤ C , and (A.45) follows using the same argument as in the case −1/2 < d < 1/2. (ii) The proof of (A.41) follows using the same argument as (A.40) in the case of 1/2 < d < 1. Differently from case (i), for case (ii) covering 1 ≤ d < 3/2, var(tn ) remains bounded but it may oscillate √ with n in view of Lemma A.1, which requires normalization by var(tn ).
where ztk at −j .
t =j∨1
First, we show that
k − max at ≤ C (log n + ndξ ∨0 ). 1≤k≤n t =1 Since at is square summable, it can be written in the form π at = −π eitx a(x)dx, where a(x) is a square integrable function. In addition, the spectral density of ξt can be written as fξ (x) = 2π | a(x)|2 . Hence, for 1 ≤ k ≤ n, by (A.30) and (1.3),
∫ ∫ k k − π− π a(x)dx ≤ |Dk (x)|| a(x)|dx at = eitx t =1 −π t =1 −π ∫ π k ≤C fξ (x)dx −π 1 + k|x| ≤ C (log k + kdξ ∨0 ) ≤ C (log n + ndξ ∨0 ).
n −
∗ ztk ξt + op (1) =
t =j∨1
1. Assume that −1/2 < d < 1/2. Then ut = ξt , where {ξt } is a linear process (2.14), and −1
log n +
∞ −
t =1
(k)
(k)
n −
≤ Cn
1 cn,j = m− k
max |cn,j | = o(1).
(k)
| at − j |
t =j∨1:|t −j|>log n
where
−∞<j≤n
1 cn,j = m− k
t =j∨1:|t −j|≤log n
−1/2
n −
|at −j | + n−1/2
to complete proof of (A.45). 2. Assume that 1/2 < d < 1. By (A.35), we can write
j=−∞
(k) cn,j
n −
s>log n
which in view of a standard argument implies (A.42). To prove (A.43) and (A.44) it suffices to show that Zn,k;u , k = 1, . . . , 4 can be written as (k) cn,j εj
n n − − ztk at −j ≤ n−1/2 |at −j | t =j∨1 t =j∨1
≤ n−1/2
j=−∞
n −
=n
−1/2
+
As we showed above, for any fixed K > 0, Sn − → ZK , where ZK ∼ N(0, σ 2 var(εj− )), and var(εj− ) → var(εj ) = 1 as K → ∞. On the other hand, as n → ∞, n −
=n
n − tat −j ≤ Cn−1/2−dξ (log n + ndξ ∨0 ) = o(1). t =j∨1
−3/2−dξ
Finally, for k = 3, 4, bearing in mind that zt3 = sin(ωt ), zt4 = cos(ωt ),
j=−∞
−
var(Sn+ ) = var(ε1+ )
=n
n − at −j ≤ Cn−1/2−dξ (log n + ndξ ∨0 ) = o(1). t =j∨1
−1/2−dξ
n n−1 − k n − − − tat −j ≤ at − j + n at −j ≤ Cn(log n + ndξ ∨0 ), t =j∨1 k=1 t =j∨1 t =j∨1
cnk,j ,
j=−∞
−
|cn(1,j) |
Using summation by parts and (A.46), we obtain that
n
κk = κk (ε0 )
Therefore,
Proof of Theorem 3.1. (i) Asymptotic results (3.9)–(3.10) follow from Theorem 2.2 and Corollary 2.1 of Abadir et al. (2007). (ii) (a) Write
ut = gn,t + β1 + ut ,
where ht = gn,t if −1/2 < d0 < 1/2, ht = gn,t − gn,t −1 if 1/2 < d0 < 3/2. In Theorem 2.5 and Corollary 2.1 of Abadir et al. (2007), it was shown that (3.9)–(3.10) are valid if ut ∼ I(d0 ), d0 ∈ (−1/2, 3/2) and d0 ̸= 1/2, ut satisfies Assumption L, and gn,t is such that n −
(A.46)
(A.47)
t =1
|hn,t − hn,t +1 | = Op (n−1/2+dξ ),
(1/2 < d0 < 3/2).
(A.48)
K.M. Abadir et al. / Journal of Econometrics 163 (2011) 186–199
To prove (3.9)–(3.10) in case (ii) (a), it suffices to note that ut can be written as (A.47) with gn,t = (β2 − β2 )t and, by Theorem 2.1, β2 − β2 = Op (n−3/2+d0 ), which implies the validity of (A.48). (ii) (b) Then, (A.47) holds with gn,t =
3 −
gn,t ;j ,
gn,t ;1 = (β2 − β2 )t ,
(A.49)
j =1
gn,t ;2 = (β3 − β3 ) sin(ωt ), −1/2
Write wh (λj ) := (2π n)
τn,2 = Op (1)m−1 n−2(2−d0 )
t =1 p
e
it λj
m − 2d (j/m)−γ λj ξ j =1
τn∗,2
= Op (1)n−2+4dξ = op (1), m − 2d d = Op (1)m−1 (| log(j/m)| + 1)(λj ξ n−2+d0 + λj ξ n−4+2d0 ) j =1
= Op (1)
mdξ n−1 m−1
ht .
m − (| log(j/m)| + 1)(j/m)dξ j =1
To prove the consistency d − → d0 , it suffices to check (see the proof of Theorem 2.5(i) in Abadir et al. (2007), p. 1371 and Lemma 4.1) that for any 0 < γ < 1 m−1
If 1 < d < 3/2, then d = 1 + dξ where 0 ≤ dξ < 1/2, and
gn,t ;3 = (β4 − β4 ) cos(ωt ).
∑n
199
m − 2d (j/m)−γ λj ξ |wh (λ)|2 = op (1).
= Op (1)mdξ n−1 = op (m−1/2 ), which implies (A.53) for j Theorem 3.1.
=
2 and completes proof of
(A.50)
j =1
References
To verify (A.50) for (A.49), it suffices to check that
τn,j := m−1
m − 2d (j/m)−γ λj ξ |whj (λ)|2 = op (1),
j = 1, 2, 3.
j =1
(A.51) Asymptotics (3.9)–(3.10) follow (see (4.24)–(4.25) of the proof of Theorem 2.5 (iii) in Abadir et al., 2007, p. 1373), if the trend gn,t satisfies (A.50) and m−1
m − d 2d (j/m)−γ | log(j/m) + 1|(λj ξ |wh (λ)| + λj ξ |wh (λ)|2 ) j =1
= op (m−1/2 ).
(A.52)
To prove (A.52), it suffices to show that
τn∗,j := m−1
m −
d
2d
| log(j/m) + 1|(λj ξ |whj (λ)| + λj ξ |whj (λ)|2 )
j =1
−1/2
= op ( m
),
j = 1, 2, 3.
(A.53)
For j = 1, (A.51) and (A.53) were shown in Abadir et al. (2007, Lemmas 4.1 and (4.25)). It remains to prove (A.51) and (A.53) for j = 2; for j = 3 the proofs following using the same argument. By Theorem 2.1, β3 − β3 = Op (n−1/2∧(3/2−d0 ) ), j = 1, 2. This implies that
|wh1 (λj )| ≤ Cn−1/2 | β3 − β3 | |Dn,3 (λj )| = Op (n−(1∧(2−d0 )) ) and observing that (A.29) and (A.30) yields
|Dn,3 (λj )| ≤ |Dn (λj + ω)| + |Dn (λj − ω)| ≤ Cn[(1 + n|λj + ω|)−1 + (1 + n|λj − ω|)−1 ] ≤ C because λj ≤ λm ≤ ω/2 as n → ∞. Hence, if −1/2 < d ≤ 1, then
τn,2 = Op (1)m−1 n−2
m − 2d (j/m)−γ λj ξ j =1
= Op (1)n
−2+2|dξ |
m −1
m −
(j/m)−γ
j =1
τn∗,2
= Op (1)n−2+2|dξ | = op (1), m − d 2d = Op (1)m−1 (| log(j/m)| + 1)(λj ξ n−1 + λj ξ n−2 ) j=1
= Op (n|dξ |−1 ) = op (m−1/2 ).
Abadir, K.M., Distaso, W., Giraitis, L., 2007. Nonstationarity-extended local Whittle estimation. Journal of Econometrics 141, 1353–1384. Andrews, D.W.K., Sun, Y., 2004. Adaptive local polynomial Whittle estimation of long-range dependence. Econometrica 72, 569–614. Cremers, H., Kadelka, D., 1986. On weak convergence of integral functionals of stochastic processes with applications to processes taking paths in LEp . Stochastic Processes and their Applications 21, 305–317. Dahlhaus, R., 1995. Efficient location and regression estimation for long range dependent regression models. Annals of Statistics 23, 1029–1047. Dalla, V., Giraitis, L., Hidalgo, J., 2006. Consistent estimation of the memory parameter for nonlinear time series. Journal of Time Series Analysis 27, 211–251. Davydov, Y.A., 1970. The invariance principle for stationary processes. Theory of Probability and its Applications 15, 487–498. Deo, R.S., Hurvich, C.M., 1998. Linear trend with fractionally integrated errors. Journal of Time Series Analysis 19, 379–397. Giraitis, L., Surgailis, D., 1985. CLT and other limit theorems for functionals of Gaussian process. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 70, 191–212. Henry, M., Robinson, P.M., 1996. Bandwidth choice in Gaussian semiparametric estimation of long range dependence. In: Robinson, P.M., Rosenblatt, M. (Eds.), Athens Conference on Applied Probability and Time Series, vol. 2. SpringerVerlag, New York. Ibragimov, I.A., Linnik, Yu.V., 1977. Independent and Stationary Sequences of Random Variables. Wolters-Noordhoff, Groningen. Ivanov, A.V., Leonenko, N.N., 2004. Asymptotic theory of nonlinear regression with long-range dependence. Mathematical Methods of Statistics 13, 153–178. Ivanov, A.V., Leonenko, N.N., 2008. Semiparametric analysis of long-range dependence in nonlinear regression. Journal of Statistical Planning and Inference 138, 1733–1753. Nandi, S., Kundu, D., 2003. Estimating the fundamental frequency of a periodic function. Statistical Methods and Applications 12, 341–360. Robinson, P.M., 1995. Gaussian semiparametric estimation of long range dependence. Annals of Statistics 23, 1630–1661. Robinson, P.M., 2005. Efficiency improvements in inference on stationary and nonstationary fractional time series. Annals of Statistics 33, 1800–1842. Shimotsu, K., Phillips, P.C.B., 2005. Exact local Whittle estimation of fractional integration. Annals of Statistics 33, 1890–1933. Surgailis, D., Viano, M.C., 2002. Long memory properties and covariance structure of the EGARCH model. ESAIM: Probability and Statistics 6, 311–329. Taqqu, M.S., 1979. Convergence of integrated processes of arbitrary hermite rank. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 50, 53–83. Velasco, C., 1999. Gaussian semiparametric estimation of nonstationary time series. Journal of Time Series Analysis 20, 87–127. Velasco, C., 2003. Nonparametric frequency domain analysis of nonstationary multivariate time series. Journal of Statistical Planning and Inference 116, 209–247. Yajima, Y., 1988. On estimation of a regression model with long-memory stationary errors. Annals of Statistics 16, 791–807. Yajima, Y., 1991. Asymptotic properties of the LSE in a regression model with longmemory stationary errors. Annals of Statistics 19, 158–177.
Journal of Econometrics 163 (2011) 200–214
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
A class of simple distribution-free rank-based unit root tests Marc Hallin a,b,c,∗,1 , Ramon van den Akker c , Bas J.M. Werker c,d a
ECARES, Université Libre de Bruxelles, Belgium
b
ORFE, Princeton University, United States
c
Econometrics group, CentER, Tilburg University, Netherlands
d
Finance group, CentER, Tilburg University, Netherlands
article
info
Article history: Received 25 September 2009 Received in revised form 20 January 2011 Accepted 31 March 2011 Available online 15 April 2011 JEL classification: C12 C22 Keywords: Unit root Dickey–Fuller test Local asymptotic normality Rank test
abstract We propose a class of distribution-free rank-based tests for the null hypothesis of a unit root. This class is indexed by the choice of a reference density g, which need not coincide with the unknown actual innovation density f . The validity of these tests, in terms of exact finite-sample size, is guaranteed, irrespective of the actual underlying density, by distribution-freeness. Those tests are locally and asymptotically optimal under a particular asymptotic scheme, for which we provide a complete analysis of asymptotic relative efficiencies. Rather than stressing asymptotic optimality, however, we emphasize finite-sample performances, which also depend, quite heavily, on initial values. It appears that our rank-based tests significantly outperform the traditional Dickey–Fuller tests, as well as the more recent procedures proposed by Elliott et al. (1996), Ng and Perron (2001), and Elliott and Müller (2006), for a broad range of initial values and for heavy-tailed innovation densities. Thus, they provide a useful complement to existing techniques. © 2011 Elsevier B.V. All rights reserved.
1. Introduction 1.1. Autoregressive unit root models The econometric and statistical literature dealing with near unit root asymptotics in time series models is overabundant. The presence or absence of unit roots in econometric models does indeed have crucial economic policy implications. Even a short review of the literature is impossible here, and we refer the reader to, for instance, Haldrup and Jansson (2006) for a recent survey. Unit root problems generally lead to non-standard asymptotics. The study of least-squares estimators in zero-mean unit root autoregressive processes started with White (1958), but gained attention more widely after the publication of Dickey and Fuller (1979); unit root testing problems were first studied in detail in Dickey and Fuller (1981). In this paper, we restrict ourselves to the simplest possible case of a univariate AR(1) unit root model with i.i.d. innovations.
Extensions to multivariate settings, cointegration, panel data, more elaborate trends involving covariates, and heteroskedastic innovations fall within the general ideas of the present paper but their technical implications are not pursued here. Examples of such extensions are those of Phillips (1987), Chan and Wei (1988), Phillips and Perron (1988), Perron (1988), West (1988), Johansen (1991), Phillips (1991), Levin et al. (2002), Im et al. (2003), and Elliott and Jansson (2003), to name but a few. Within that very simple context, we are interested in the construction of ‘‘efficient’’ tests of the null hypothesis of a unit root. Whether or not theoretical asymptotic optimality results or simulations are considered, assessing the ‘‘efficiency’’ of such tests requires embedding the null hypothesis of a unit root into a broader model of AR(1) dependence. The literature (see, for instance, the monographs by Hamilton (1994) or Enders (2004)) traditionally considers two of them, under which the observation (Y1 , . . . , Yn ) is generated either from – Model (a) (a very simple model of the ARMAX type2 )
∗
Corresponding address: ECARES CP 114/04, Avenue F.D. Roosevelt 50, 1050 Brussels, Belgium. Tel.: +32 2 650 46 03; fax: +32 2 650 40 12. E-mail addresses:
[email protected] (M. Hallin),
[email protected] (R. van den Akker),
[email protected] (B.J.M. Werker). 1 Royal Academy of Belgium. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.03.007
Yt = ρ Yt −1 + µ + εt ,
(1)
2 Hamilton (1994) and Enders (2004) actually consider a slightly more general equation, of the form Yt = ρ Yt −1 + µ + γ t + εt ; see Remark 2.3.
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
or from – Model (b) (the so-called components model)
(Yt − m) = ρ(Yt −1 − m) + εt .
(2)
In both cases, it is generally assumed that {εt , t ∈ N} is an i.i.d. innovation process, with mean zero and variance σε2 and a distribution function F admitting a density f . As for the initial value Y0 , it is often assumed to be equal to zero in Model (a), or to the stationary mean m in Model (b). It is safer, however, to leave the distribution PY0 of Y0 unspecified, provided that Y0 and the εt ’s are mutually independent, and that PY0 does not depend on the parameters ρ, µ or m; Y0 then is ancillary, and inference on ρ is naturally conducted conditionally on Y0 . Intuitively, Model (a) describes an autoregressive scheme in which the random shocks are i.i.d. with constant mean µ, whereas in Model (b) the i.i.d. shocks have mean zero, while the observations have (constant) mean m. For ρ < 1, those two models, under two parameterizations, actually strictly coincide: indeed, (1) and (2), for µ = (1 − ρ)m, describe the same autoregressive data-generating process. As for ρ = 1, Model (a) takes the form H0 : Yt − Yt −1 = µ + εt ,
µ ∈ R unspecified
(3)
yielding the (first-order as well as second-order nonstationary) random walk Yt = Y0 + µt + ut ,
ut :=
t −
εs
(4)
s=1
with conditional drift E[Yt |Y0 ] = Y0 + µt and conditional variance Var(Yt |Y0 ) = t σε2 . That null hypothesis H0 strictly contains the null hypothesis (b)
H0
: Y t − Y t − 1 = εt ,
m ∈ R unspecified
(5)
(b)
(m under H0 is not identified) induced, under ρ = 1, by Model (b), which characterizes the second-order nonstationary but firstorder stationary random walk Yt = Y0 + ut ,
ut :=
t −
εs
(6)
s =1
with constant conditional mean E[Yt |Y0 ] = Y0 and variance Var(Yt |Y0 ) = t σε2 . From the point of view of local asymptotic experiments, however, Models (a) and (b) differ dramatically. While Model (a), as we shall see, defines local experiments that are nicely (albeit with non-standard n3/2 consistency rates) LAN (locally asymptotically normal) at the null hypothesis H0 of a unit root,3 Model (b) at (b) H0 yields a considerably more tricky asymptotic structure, of the LABF (locally asymptotically Brownian functional) type, for which no uniform optimality results exist—see Elliott et al. (1996), Rothenberg and Stock (1997), Thompson (2004), and Jansson (2008). We refer the reader to Gushchin (1996), Ploberger (2004, 2008), and Jansson and Moreira (2008) for recent developments on experiments of the LABF and the (more general) LAQ (locally asymptotically quadratic) type. For any fixed n, thus, the differences between Models (a) and (b) are extremely tenuous: for ρ < 1, they strictly coincide, whereas, (b) for ρ = 1, Model (a) is more general, since H0 includes H0 as a special case. It follows that the choice between (1) and (2) is not really a choice between two models, but a choice between two kinds of asymptotic scenarios: the debate is about (a)-asymptotics versus (b)-asymptotics rather than Model (a) versus Model (b). This is a debate that we do not enter into here. Asymptotics in this
3 With degenerate Fisher information at µ = 0, though.
201
paper are just a mathematical device, which is used to suggest ‘‘sensible’’ testing procedures for the finite-sample problem at hand. Rather than parametric or semiparametric efficiency, or ARE values, which presuppose a specific asymptotic scheme, the ultimate benchmarks for the procedures that we are describing here are their finite-sample performances under the alternative, where Models (a) and (b) strictly coincide, so no particular choice needs to be made. 1.2. Outline of the paper The remainder of the paper accordingly is organized in two main parts: Section 2, which is devoted to asymptotics, and Section 3, dealing with finite-sample performances. Much attention has been given, in the recent literature, to (b)asymptotics. The analysis that we are developing in Section 2 is based on (a)-asymptotics,4 which, apparently, have not been considered so far in this context, and suggest a class of very simple tests, for which moreover rank-based, and hence finitesample distribution-free, versions exist. Being distribution-free, those tests are valid, for finite sample size n, irrespective of the innovation density f (no moment restrictions5 ), and irrespective of the model ((a) or (b)). Section 2.4 provides a full analysis of the limiting properties of those tests: asymptotic null distributions and, under (a)-asymptotics, local powers and asymptotic relative efficiencies (AREs).6 Section 3 is devoted to a numerical investigation of the finite-sample performance of the tests described in Section 2—an investigation that does not require any choice between Model (a) and Model (b), as the two models describe the same datagenerating processes under the alternative. That finite-sample analysis brings into the picture an important additional feature of the problem: the influence of the initial observation Y0 . Müller and Elliott (2003) show that the deviation of Y0 from the stationary mean has a dramatic influence on the finite-sample performance of all unit root tests. In empirical applications it is generally impossible to tell whether that deviation is small or large. Elliott and Müller (2006) provide a discussion of this; in Section 3.2, we are following their suggestion of evaluating empirical performances as a function of Y0 − m by adopting their simulation design. The results show that our rank tests significantly outperform all their competitors (the traditional Dickey–Fuller procedures, as well as the tests by Elliott et al., 1996, Ng and Perron, 2001, and Elliott and Müller, 2006) whenever the deviation Y0 − m of the initial value Y0 from the stationary mean is ‘‘large’’, and whenever the innovation distribution is heavy tailed. Section 4 concludes, while proofs are gathered in an Appendix. 1.3. Rank tests Before turning to asymptotics, let us provide some details about the rank-based tests that we are proposing. Our test statistics are based on the ranks Rt of the increments 1Yt := Yt − Yt −1 . Let g be a given density (the so-called reference density), not necessarily the actual underlying one f . We assume throughout that g belongs to the class F of probability densities h that are absolutely continuous
4 We once more emphasize that asymptotics here are just an agnostic mathematical device, the consequences of which are to be evaluated (Section 3) on the basis of finite-sample performances. 5 In the absence of first-order moments, m and µ can be reinterpreted as medians rather than means. 6 Whereas validity holds irrespective of the underlying density, our derivation of local powers relies on some regularity, such as finite second-order moments.
202
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
with a.e. continuous h′ and finite Fisher information for ′ derivative 2 location Ih := h /h dH ∈ (0, ∞), and for which
lim
n→∞
n 1−
n i =1
h′
h
H
2
i
−1
= Ih
n+1
(7)
(as usual, F , G, H denote the distribution functions associated with f , g , h). This assumption concerns the reference density g: once again, we stress that, as far as validity is concerned, we do not make any assumptions on the actual density f , since our tests are strictly distribution-free. If, however, asymptotic optimality, under density f and (a)-asymptotics, is to be considered, then we also need to impose f ∈ F . Motivated by the asymptotic analysis of Section 2, our test statistics take the form n
−
1 Tg(n) := √ n t =1
t n+1
−
1
2
ϕg
Rt n+1
,
(8)
with ϕg (u) := −g ′ G−1 (u) /g G−1 (u) , u ∈ (0, 1). Under the
(b)
null hypothesis H0 , and hence also under the null hypothesis H0 , the vector of ranks (R1 , . . . , Rn ), and therefore the test statistics (n) Tg , are distribution-free with respect to µ and f . In particular, (n)
this implies that exact critical values for Tg -based tests can be easily computed or simulated for finite n, despite the unspecified f and µ. The form of the test statistic (8) actually follows from optimality considerations under (a)-asymptotics and µ ̸= 0. In Section 2, we derive its local power and compare it to the efficiency bound obtained from the LAN property (derived in Section 2.3). That local power does depend on both the reference density g and the actual underlying density f . We show that a correctly specified reference density g = f leads to a test that achieves the efficiency bound and thus is parametrically efficient at f . As a result, while our tests are valid irrespective of the reference and underlying densities, they are locally and asymptotically efficient in Model (a) (with µ ̸= 0) in the case of a correctly specified g. This situation thus is tantamount to quasi-maximum or pseudomaximum likelihood estimation, where choosing a (Gaussian) reference density leads to an estimator that (often) remains consistent even when the reference density is misspecified, while attaining the parametric efficiency bound if the actual underlying density is Gaussian. In general, the limiting variance of such estimators, however, depends on both the true and the (Gaussian) reference density. Our tests have a comparable property, with the important difference that we may use any density g as a reference density, while quasi-likelihood or pseudo-likelihood procedures are generally restricted to a Gaussian g (when using another reference density, the estimators, in general, do not remain consistent under misspecified innovation density f ̸= g). Moreover, for our tests, the reference density can even be preestimated in order to achieve (parametric) efficiency uniformly over a broad class of densities f —without any sacrifice at the level of validity (see Section 2.6). Now, if (b)-asymptotics are to be preferred, the tests based (n) on Tg , as already mentioned, remain valid; but their asymptotic optimality properties are lost. However, their fixed-alternative performances are unchanged; see Section 3. Distribution-freeness is another attractive property of our tests. The need for exact and distribution-free inference in econometrics has often been emphasized; see, for instance, Dufour (1981, 1997) or Coudin and Dufour (2009). Despite that recognized need, distribution-free procedures remain extremely rare in the context of time series econometrics. Campbell and Dufour (1995),
Campbell and Dufour (1997), and Luger (2003) consider testing orthogonality restrictions using sign-based and rank-based tests instead of regression-based approaches. These methods are valid under zero-median or symmetry assumptions and, using extensive simulation, are shown to beat regression-based tests. Hasan and Koenker (1997) extend these results using regression rank scores in order to deal with the nuisance parameter problem. Their focus of interest again is the zero-mean unit root model. Hasan (2001) further allows for infinite variances; no formal optimality analysis is given. Thompson (2004b) reconsiders these tests in order to improve their power, especially under fat-tailed error distributions. Finally, we mention Breitung and Gouriéroux (1997) who consider the hypothesis that some transformation of the process exhibits a unit root. They propose a test based on the ranks of the observed time series (not those of residuals). 2. Asymptotic theory 2.1. Rank tests: exact versus approximate scores It turns out that deriving results on the asymptotic size and (under (a)-asymptotics) local power of our test is easier when the test statistic (8) is slightly adjusted, replacing ϕg by
ϕ˜ g (u) := EG ϕg (G (εt )) |Rt = ⌊u(n + 1)⌋ , u ∈ (0, 1). (9) Note that ϕ˜ g , in contrast to ϕg , depends on the number n of observations. Clearly, the statistic based on ϕg is simpler to compute, although the function ϕ˜ g is easily simulated using distribution-freeness of the ranks. Whereas (8), in the literature on rank-based inference, is known as the approximate score version (n) (n) of Tg , using ϕ˜ g in Tg yields the so-called exact score version. This exact score version is more convenient for proofs as its expectation is identically zero irrespective of the true underlying density f : E ϕ˜ g (Rt /(n + 1)) = EG ϕg (G(εt )) = 0.
Incidentally, note that the average of the weighting constants (t /(n + 1) − 1/2) in (8) equals zero as well. When n is large and conditionally on the rank of εt being Rt = i, G (εt ) is approximately equal to i/(n + 1). This intuitively explains why the ϕg -based and
ϕ˜ g -based versions of Tg(n) behave similarly, which we formalize in the following result. Lemma 2.1. If the reference density g belongs to F , we have, as n → ∞, under the null hypothesis H0 of a unit root, n 1 − Tg(n) = √ n t =1
t n+1
−
1 2
ϕ˜ g
Rt n+1
+ oP (1).
(10)
Proof. This is a well-known result on the asymptotic equivalence of the approximate and exact score versions of (linear) rank statistics, which is proved in various places; see, for instance, Theorem 13.5 in Van der Vaart (2000). Remark 2.1. A consequence of the local asymptotic normality result proved in Proposition 2.1 is mutual contiguity of the probability measures at the unit root (ρ = 1) and those near the unit root (ρn = 1 − O(n−3/2 )). The asymptotic equivalence (10), therefore, is preserved under contiguous sequences. Consequently, in expressions like (10), we do not have to worry whether oP ’s are taken at the unit root or near the unit root. This consequence of contiguity will be used throughout the paper without further mention. Condition (7) on ϕg is satisfied for all standard reference densities g: Gaussian, logistic, double exponential, Student (including Cauchy), etc. Under this condition, the asymptotic equivalence in (10) implies that all results concerning asymptotic size, power (under contiguous alternatives), and efficiency carry over from one statistic to the other: whether exact or approximate scores are considered has no impact on asymptotic results.
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
(n)
More precisely, 1Yt = µ + εt under P(µ,1);f , and, as n → ∞,
2.2. Rank tests: asymptotic size In view of distribution-freeness, one easily constructs, via (n) simulations, tests based on Tg with exact finite-sample sizes, irrespective of µ and f . Asymptotic critical values can be obtained from a normal distribution with variance Ig /12, as shown by the following result (see the Appendix for a proof).
(n)
log
(n)
12/Ig Tg
⇒ N (0, 1).
(11)
(n)
Note that 12/Ig Tg is scale-free. If σ is a scale parameter associated with g (not necessarily a standard error, though), writing gσ for g and g1 for the corresponding standardized density (n) such that gσ (x) = σ1 g1 σx , we have indeed 12/Igσ Tgσ =
(n)
12/Ig1 Tg1 . We insist, once again, that no assumptions are made on f which, in particular, does not need to have finite moments or belong to F . Moreover, Theorem 2.1 is as valid for Model (a) as for Model (b) as a consequence of its distribution-freeness also with respect to µ. For instance, Theorem 2.1 still applies under heavy-tailed innovations such as Cauchy or Lévy ones, while the Dickey–Fuller statistic may break down. This fact will be confirmed in Section 3 by finitesample simulations. Unlike their size, however, the power of our tests depends both on the chosen reference density g and the actual underlying density f (actually, on their standardized versions, g1 and f1 ); for f ∈ F , explicit values are provided in Theorem 2.2.
2.3. The limit experiments and efficient inference
dP(µn ,ρn );f (n)
dP(µ,1);f
= h1 n−1/2
(Y0 , Y1 , . . . , Yn )
n − −f ′ t =1
Theorem 2.1. Let (ε1 , . . . , εn ) be i.i.d. from a continuous distribution with density f and denote by Rt the rank of 1Yt among 1Y1 , . . . , 1Yn . Let the reference density g belong to F . Then, as n → ∞ and under H0 ,
203
+ h2 µn−1/2
n − t =1
−
If
2
(εt )
f
−f ′
t
n+1 f
h21 + µh1 h2 +
µ2 3
h22
(εt )
+ oP (1)
and
[ [ (n) ] ∆µ 1 ⇒ N 0 , I f µ/2 ∆(ρn)
] µ/2 . µ2 /3
For µ = 0, however, this LAN result, with information matrix 1 0 If 0 0 , is a degenerate one. (n)
(ii) If f has finite variance, the subfamily P(µ,ρ);f |µ = 0, ρ ∈
[−1, 1] is locally asymptotically Brownian functional (LABF) for local alternatives of the form ρn = 1 + h2 n−1 .
Proof. See the Appendix.
Remark 2.2. The LAN result of Proposition 2.1 does not require h2 ≤ 0: all claims in this paper can easily be rephrased in the context of testing H0 : ρ = 1 against H1 : ρ > 1 and H0 : ρ = 1 against H1 : ρ ̸= 1. Remark 2.3. If one considers the model Yt = ρ Yt −1 + µ + γ t + εt , i.e. a model including a linear time trend, the LAN result still holds true when γ ̸= 0, but with consistency rate (for ρ ) n5/2 instead of n3/2 .
As mentioned in the introduction, the limiting experiments, under (a)-asymptotics, crucially depend on the value of µ, leading to (a)-asymptotics for µ ̸= 0 and to (b)-asymptotics for µ = 0. In the latter case, the limit experiment (for the model with single parameter ρ ) is locally asymptotically Brownian functional (LABF), with rate of convergence n, as shown by Jeganathan (1995), and departures of the order of n−3/2 from the unit root hypothesis cannot be detected. This LABF result is exploited in Jansson (2008) to derive power envelopes for unit root tests. As shown in the next result, the situation is quite different, and much simpler, under (a)-asymptotics, at the even more favorable rate n−3/2 .
Remark 2.4. The fact that the Fisher information for ρ in (13) vanishes for µ → 0 confirms that ρ indeed cannot be estimated at rate n3/2 whenever µ = 0.
Proposition 2.1. Consider Model (a) with innovation density f ∈ F , (n) and denote by P(µ,ρ);f the joint distribution of (Y1 , . . . , Yn ) under (1).
Local asymptotic normality, via the Hájek and Le Cam asymptotic theory of statistical experiments (see, e.g., Chapters 7 and 9 of Van der Vaart, 2000), completely characterizes the local and asymptotic features of the statistical experiment under study. Not only does it induce the asymptotic optimality bounds for statistical inference, but also it indicates how central-sequencebased procedures achieve those bounds. Accordingly, it follows from Proposition 2.1 that a locally and asymptotically optimal test for H0 : ρ = 1, under (a)-asymptotics, if the innovation density f is known, and considering µ ̸= 0 a nuisance parameter, should be based on (any monotone transformation of)
(n)
(i) The family P(µ,ρ);f | µ ∈ R, ρ ∈ [−1, 1] is locally asymptotically normal (LAN) at any (µ, ρ = 1), for local alternatives of the form (µn = µ + h1 n−1/2 , ρn = 1 + h2 n−3/2 ), with central sequence
[
(n) ]
∆µ (n) := ∆ρ
n−1/2
µn−1/2
n
n − −f ′
f
t =1
− t =1
t
(1Yt )
−f (1Yt ) ′
(12)
n+1 f
If
1
µ/2
] µ/2 . µ2 /3
Y0 ∼ N (µ/(1 − ρ), σf2 /(1 − ρ 2 )), can degrade the LAN result. In such situations, our LAN result still holds conditionally on Y0 . In this way one ignores the statistical information possibly contained in Y0 , and restricts attention to the differenced observations 1Y1 , . . . , 1Yn .
If−1 ∆(ρn) −
and Fisher information
[
Remark 2.5. An initial value Y0 with a distribution depending on ρ , such as
(13)
µ 2
n µ − ∆(µn) = n−1/2
If
t =1
t n+1
−
1 2
−f ′ f
(1Yt ) (14)
(see, for instance, Section 11.9 of Le, 1986). Clearly, the magnitude of the constant factor µ/If can be ignored in the construction of
204
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
that test. Since the sign of µ is unspecified, both one-sided and twosided versions are meaningful. In the remainder of this section, we focus on the empirically more relevant case of µ > 0; asymptotic theory then leads to rejection (as the alternative is ρ < 1) for small values of the test statistic. In Section 3, however, we evaluate finitesample performance for µ = 0, and consider two-sided tests. Statistics of the form Sg(n) := n−1/2
n − t =1
t n+1
−
1
2
−g ′ g
(1Yt )
(15)
are thus interesting candidates as test statistics for our problem, (n) and reach parametric efficiency if f = g. Unfortunately, Sg is not distribution-free. (n) The situation is totally different if we turn to Tg . Under f = g, indeed, it follows from (15), (23) and Lemma A.1 that (n) Tg = Sg(n) + oP (1) under H0 and f = g. If the actual density (n)
coincides with g , Tg
thus shares all the nice optimality features
(n)
of Sg . The essential difference is that, being distribution-free, its finite-sample null distribution is the same under f ̸= g as under (n) f = g: Tg thus does not require f to be specified, and naturally qualifies as a solution for our testing problem, while achieving efficiency at the chosen reference density g. 2.4. Local powers (n)
The asymptotic power of our rank-based test statistics Tg against local (under (a)-asymptotics) unit root alternatives follows directly from Le Cam’s third lemma, provided that f and g both satisfy the assumptions of Proposition 2.1. Theorem 2.2. Consider the model (1) with innovation density f ∈ F and Y0 ∼ L. Let the reference density g also be in F . Then, under (n) P(µ,ρn );f , where ρn = 1 + hn−3/2 , Tg(n) ⇒ N (hµIfg /12, Ig /12)
as n → ∞,
(16)
with
∫
1
ϕg (u)ϕf (u)du.
Ifg :=
(17)
u =0
Proof. See the Appendix.
Whenever µ ̸= 0, our test has power against alternatives that are at distance n−3/2 from the unit root. This is, of course, much more precise than the usual n−1/2 rate. It is more precise, too, than the n−1 rate that can be attained when µ = 0; see Proposition 2.1. In that case, however, no test can have local power against alternatives at rate n−3/2 . It is interesting to compare (still under (a)-asymptotics) the power of our test statistic to that of the classical Dickey–Fuller test. For this comparison we choose the asymptotically optimal Dickey–Fuller test for Model (a), that is, based on the least-squares estimate ρˆ nDF of ρ in (1). The asymptotic properties of this classical Dickey–Fuller statistic are well-known and we have the following corollary to Theorem 2.2. Corollary 2.1. Let f and g belong to F ; assume that µ > 0 and that f moreover has finite variance σf2 . The asymptotic relative efficiency (ARE), for the unit root hypothesis H0 : ρ = 1, of the one-sided (n) rank test based on Tg with respect to the Dickey–Fuller test is, under density f , (n)
Ig3/2
AREf (Tg |DF) = |Ifg | σ / 3
3 f
.
(18)
Table 1 ( n)
( n)
Asymptotic relative efficiencies AREf (Tg |DF) of our rank-based test based on Tg in (8) with respect to the Dickey–Fuller test, for various choices (Gaussian, logistic, double exponential) of the reference density g, and several values (Gaussian, logistic, double exponential, Cauchy, and t3 ) of the actual density f . Reference density g
Actual density f
Gaussian (van der Waerden) Logistic (Wilcoxon) Double exponential (Laplace)
Proof. See the Appendix.
Gaussian
Logistic
DExp
t3
Cauchy
1.00 0.93 0.51
1.07 1.15 0.75
1.44 1.84 2.83
2.10 2.62 2.06
∞ ∞ ∞
Remark 2.6. The AREf in (18) is defined as the limit, as n → ∞, of the ratio nDF /n, where nDF is the number of observations needed in the Dickey–Fuller test to achieve the same performance (in terms of power) as of our rank-based test using n observations. Our test and the Dickey–Fuller test both have local power at rate n3/2 . This explains the exponent 3 in (18). Remark 2.7. Despite the notation, AREf in (18) is a scale-free quantity. It is easy to see, indeed, that, writing f1 for the standardized version of f (that is, f (z ) = σf−1 f1 (z /σf )), Ifg3 σf3 = If31 g . Similarly, if g1 and g2 are such that, for some c > 0, g2 (z ) = 1/2 1/2 c −1 g1 (z /c ), then Ifg2 /Ig2 = Ifg1 /Ig1 . Table 1 provides, for various reference densities and various values of f , some numerical values of (18). Under infinite innovation variance, those values are infinite, since the Dickey–Fuller test is no longer valid.7 Inspection of Table 1 reveals that, under finite innovation variance for f , very sizeable efficiency gains also are possible, even when using a Gaussian reference density g (van der Waerden tests). 2.5. Choosing a reference density g Our test depends on a reference density g to be chosen by the investigator. This raises the obvious question of how to choose this reference density. (n) Recall that our rank-based statistic Tg is homogeneous in the scale of the reference distribution: rescaling a given reference density g (·) to gc (·) = c −1 g (·/c ), c > 0, has no impact on the test, and one does not have to worry about choosing an appropriate scale for g. Similarly, we have shown in Remark 2.7 that the asymptotic relative efficiency of our test with respect to the Dickey–Fuller test does not depend on the scale of the reference density g, nor on that of the actual density f . The form of the reference density g, if not its scale, however, 1/2 does influence the local power of our test via the ratio |Ifg |/Ig in (17). An obvious first choice is a Gaussian reference density g (x) ∝ exp(−x2 /2), leading to the so-called normal or van der Waerden scores. In this case, n 1 − (n) TvdW = √ n t =1
t n+1
−
1
2
Φ −1
Rt n+1
,
where Φ denotes the standard normal distribution function (note that Ig = 1), and (18) reduces to
∫ AREf (vdW|DF) =
1 u =0
−f ′ f
F
−1
(u) Φ
−1
3 (u) σf3 .
(19)
7 Recall that for symmetric α -stable innovation distributions the Dickey–Fuller test statistic has a limiting distribution of the Lévy type with critical values depending on the tail index α ; see Rachev et al. (1998), Ahn et al. (2003), and Callegari et al. (2003).
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
A celebrated result by Chernoff and Savage (1958) shows that the latter quantity is always larger than 1, except under a Gaussian f , where it takes the value 1. Consequently, a Gaussian reference density constitutes a safe choice, as it always leads to an improvement over the Dickey–Fuller test. The magnitude of the improvement is all the more sizeable in our situation, due to the faster rate of convergence n3/2 ; see the first row in Table 1. For instance, true underlying Student t3 distributed innovations lead to more than 100% efficiency gain, while fatter-than-t3 -tailed distributions lead to even larger (infinite in the case of infinite innovation variance) gains. Two other popular choices for the reference density are the double-exponential distribution (Laplace or sign test scores), with √ √ density gL (x) = exp(− 2|x|)/ 2 (for which σg2L = 1 and IgL = 2), and the logistic distribution (Wilcoxon scores) gW (x) =
√ √ √ π exp(−π x/ 3)/( 3(1 + exp(−π x/ 3))2 ) (for which σg2W = 1 and IgW = π 2 /9). They lead to the Laplace and Wilcoxon test statistics (n)
TL
(n)
TW
=
n 2−
n t =1
t n+1
n π − = √
3n t =1
−
t n+1
1
sgn
2
−
1 2
Rt n+1
−
1 2
,
1 − n+1R−Rt
t , 1 + n+1R−Rt t
respectively. It is worth emphasizing, again, that we nowhere impose that the innovations need to have finite variances, nor even finite first-order moments: our tests remain valid under completely unspecified innovation density f and completely unspecified shift µ (which may be zero). As explained before, the Dickey–Fuller test is no longer valid in the semiparametric model with unspecified f . Remark 2.8. In view of Theorem 2.2, for given f , maximum power is achieved when the reference density g matches the actual one f (up to a possible scale transformation). In that case, our rankbased statistic asymptotically coincides with the parametrically (n) optimal (under (a)-asymptotics) test statistic (14), and the Tg based test achieves parametric efficiency in Model (a) with innovation density f . This implies that Model (a) (with innovation density f ) actually is adaptive: the ‘‘cost’’ of not knowing the innovation density in addition to not knowing µ is asymptotically nil when performing inference as regards ρ . Model (b) does not exhibit such attractive limiting local structure. 2.6. Pre-estimating the reference density g As the power of the test depends on the chosen reference density, and is maximal if the reference density coincides with the actual density f up to a scale transformation, one may want to pre-estimate the reference density to use. An important additional advantage of our test is that this can be done without any changes in the asymptotic analysis. To be more precise, consider an estimated reference density gˆn with values in F that depends on the order statistics of the increments 1Yt , as is, for example, the case for traditional kernel density estimators. Recall that the order statistics are stochastically independent of the ranks Rt of the innovations. Therefore, we (n) can easily study the behavior of Tgˆ conditionally on the order
205
respect to the order statistics, and hence are similar and unbiased. An analogous reasoning can be applied to show that the power properties of our test with estimated reference density are as if the reference density were correctly specified. In order to make sure that Igˆn converges to Ig , a construction as in Proposition 7.8.1 in Bickel et al. (1993) can be considered. (n)
Summing up, the tests based on Tgˆ remain conditionally n distribution-free; they are parametrically efficient (under (a)asymptotics), uniformly over the family of all µ ̸= 0 and all f such (n) (n) that, under f , Tgˆ − Tf = oP (1)—without losing finite-sample n validity over that broader class of all µ and f . 3. Finite-sample performance As mentioned in the introduction, the ultimate benchmark for any statistical procedure is its finite-sample performance. This is all the more true in the present context, where two distinct and plausible asymptotic schemes are coexisting, roughly in the same statistical model. This section is totally agnostic in that respect, and does not make any choice between (a)-asymptotics and (b)-asymptotics. Nevertheless, the description of the simulated data-generating process requires a parameterization, and, without any loss of generality, the (ρ, m) parameterization (2) is adopted throughout. Section 3.1 deals with the finite-sample behavior of our tests (b) under H0 and, hence, a fortiori, also under H0 . Section 3.2 discusses their behavior under alternatives (where Models (a) and (b) coincide). 3.1. Finite-sample sizes It follows from Theorem 2.1 that the rank-based test statistic is asymptotically N (0, Ig /12) under the null hypothesis. This
(n)
Tg
(n)
section studies the finite-sample null distribution of Tg . Recall once more that our rank-based test statistics are distributionfree under the null hypothesis. This means that the finite-sample (n) distribution of Tg only depends on the number n of observations and the choice of the reference density g. Such distributions can easily be tabulated. To illustrate the convergence to a N (0, Ig /12) distribution under the null hypothesis, Fig. 1 presents a scaled histogram (n) of simulated values of TvdW along with its limiting Gaussian density for n = 25, 50, 100. From the figure, we conclude that the convergence to the limiting distribution is quite fast. This is common for rank-based statistics. Moreover, in view of distribution-freeness, this convergence is uniform over the family of possible underlying innovation densities f , irrespective of µ. Note that the limiting distribution seems to be overestimating tail probabilities, and hence produces conservative critical values. This is confirmed by Table 2, where simulated quantiles are presented for various sample sizes n and various reference densities g, along with (in the rows labeled ‘‘n = ∞’’) the asymptotic ones. As the distributions are symmetric with respect to the origin, only righttail quantiles are presented. Although the convergence is fast, we thus recommend using simulated critical values rather than the asymptotic ones.
n
statistics, that is, as if gˆn ∈ F were a given reference density. In particular, if (conditionally on the order statistics) exact α -critical points are computed for the estimated-score version of (8), the conditional size, and hence also the unconditional one, is exactly α too. The resulting tests moreover have Neyman α -structure with
3.2. Finite-sample powers As mentioned in the introduction, the initial value Y0 or, more precisely, its deviation Y0 − m from the stationary mean (a quantity
206
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
Table 2 Simulated (1 − α)-quantiles (based on 50,000 replications) for the van der Waerden, Wilcoxon, and Laplace rank-based test statistics, various values of n and α , under H0 (b) and, hence, a fortiori, also under H0 . The rows labeled ‘‘n = ∞’’ contain the critical values calculated from the limiting Gaussian distribution. Reference density g
α = 0.5%
α = 2.5%
α = 5%
n n n n n n n n n n n n n n n
= 25 = 50 = 100 = 250 =∞ = 25 = 50 = 100 = 250 =∞ = 25 = 50 = 100 = 250 =∞
Gaussian (van der Waerden)
Logistic (Wilcoxon)
Double exponential (Laplace)
0.62 0.68 0.70 0.73 0.74 0.49 0.52 0.54 0.55 0.57 0.41 0.44 0.45 0.46 0.47
0.71 0.75 0.76 0.77 0.78 0.56 0.57 0.58 0.59 0.59 0.47 0.48 0.49 0.49 0.50
0.99 1.02 1.04 1.04 1.05 0.76 0.78 0.79 0.80 0.80 0.65 0.66 0.67 0.67 0.67
Table 3 Rejection frequencies (25,000 replications), nominal level 5%, n = 50, underlying N (0, 1) density, starting values Y0 = m + aσε / 1 − ρ 2 , a = 0, 1, . . . , 6.
Tests
a 0
1
2
3
4
5
6
0.081 0.048 0.058 0.053 0.052 0.025 0.026 0.050 0.050 0.050
0.081 0.048 0.058 0.053 0.052 0.025 0.026 0.050 0.050 0.050
0.081 0.048 0.058 0.053 0.052 0.025 0.026 0.050 0.050 0.050
0.081 0.048 0.058 0.053 0.052 0.025 0.026 0.050 0.050 0.050
0.081 0.048 0.058 0.053 0.052 0.025 0.026 0.050 0.050 0.050
0.081 0.048 0.058 0.053 0.052 0.025 0.026 0.050 0.050 0.050
0.081 0.048 0.058 0.053 0.052 0.025 0.026 0.050 0.050 0.050
0.081 0.060 0.072 0.065 0.067 0.031 0.029 0.047 0.049 0.047
0.080 0.054 0.066 0.060 0.060 0.029 0.032 0.047 0.050 0.048
0.081 0.038 0.044 0.042 0.040 0.021 0.026 0.048 0.050 0.049
0.080 0.020 0.025 0.024 0.022 0.011 0.018 0.049 0.050 0.051
0.080 0.009 0.010 0.011 0.009 0.006 0.013 0.052 0.053 0.052
0.081 0.003 0.004 0.004 0.003 0.002 0.007 0.055 0.055 0.055
0.080 0.001 0.001 0.002 0.001 0.001 0.003 0.058 0.058 0.060
0.084 0.084 0.100∗ 0.088 0.092 0.044 0.036 0.037 0.042 0.038
0.085 0.064 0.076 0.069 0.069 0.035 0.036 0.039 0.044 0.041
0.084 0.026 0.032 0.031 0.028 0.015 0.029 0.047 0.049 0.048
0.082 0.006 0.008 0.008 0.007 0.004 0.019 0.059 0.059 0.061
0.079 0.001 0.001 0.002 0.001 0.001 0.010 0.079 0.073 0.080
0.075 0.000 0.000 0.000 0.000 0.000 0.004 0.106∗ 0.091 0.105∗
0.073 0.000 0.000 0.000 0.000 0.000 0.001 0.139∗ 0.111 0.139∗
0.094 0.141 0.162∗ 0.139 0.151 0.074 0.047 0.019 0.028 0.020
0.094 0.080 0.097∗ 0.087 0.089 0.048 0.047 0.025 0.035 0.027
0.091 0.015 0.022 0.021 0.019 0.011 0.044 0.045 0.049 0.047
0.091 0.001 0.002 0.003 0.001 0.001 0.034 0.080 0.076 0.083
0.088 0.000 0.000 0.000 0.000 0.000 0.021 0.139∗ 0.113 0.139∗
0.086 0.000 0.000 0.000 0.000 0.000 0.010 0.221∗ 0.166 0.220∗
0.083 0.000 0.000 0.000 0.000 0.000 0.004 0.326∗ 0.230 0.321∗
ρ=1 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.99 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ ˆ EM-Q (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.975 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.95 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214 1.6
1.6
1.6
1.4
1.4
1.4
1.2
1.2
1.2
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
207
0
0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
( n)
Fig. 1. Simulated (50,000 replications) finite-sample (n = 25, 50, 100) distributions of the van der Waerden test statistic TvdW (standard normal reference density), compared to its limiting distribution under the null hypothesis.
Table 4 Rejection frequencies (25,000 replications), nominal level 5%, n = 50, underlying double exponential density, starting values Y0 = m + aσε / 1 − ρ 2 , a = 0, 1, . . . , 6.
Tests
a 0
1
2
3
4
5
6
0.078 0.045 0.056 0.052 0.051 0.025 0.027 0.050 0.050 0.050
0.078 0.045 0.056 0.052 0.051 0.025 0.027 0.050 0.050 0.050
0.078 0.045 0.056 0.052 0.051 0.025 0.027 0.050 0.050 0.050
0.078 0.045 0.056 0.052 0.051 0.025 0.027 0.050 0.050 0.050
0.078 0.045 0.056 0.052 0.051 0.025 0.027 0.050 0.050 0.050
0.078 0.045 0.056 0.052 0.051 0.025 0.027 0.050 0.050 0.050
0.078 0.045 0.056 0.052 0.051 0.025 0.027 0.050 0.050 0.050
0.079 0.058 0.070 0.063 0.063 0.030 0.031 0.048 0.049 0.048
0.080 0.051 0.062 0.057 0.056 0.027 0.028 0.049 0.050 0.049
0.079 0.035 0.045 0.042 0.040 0.018 0.023 0.050 0.052 0.050
0.078 0.019 0.025 0.025 0.023 0.011 0.018 0.052 0.054 0.053
0.077 0.009 0.012 0.013 0.011 0.005 0.012 0.055 0.059 0.058
0.077 0.003 0.004 0.004 0.004 0.002 0.007 0.060 0.064 0.064
0.077 0.001 0.001 0.001 0.001 0.001 0.003 0.065 0.070 0.069
0.083 0.081 0.097∗ 0.084 0.088 0.043 0.037 0.037 0.044 0.039
0.082 0.058 0.074 0.068 0.068 0.031 0.033 0.042 0.049 0.044
0.081 0.025 0.033 0.032 0.030 0.015 0.027 0.053 0.062 0.055
0.078 0.007 0.009 0.009 0.008 0.004 0.018 0.071 0.082 0.076
0.076 0.001 0.001 0.001 0.001 0.001 0.009 0.098 0.111∗ 0.105
0.072 0.000 0.000 0.000 0.000 0.000 0.004 0.132 0.151∗ 0.145
0.070 0.000 0.000 0.000 0.000 0.000 0.002 0.176 0.197∗ 0.195∗
0.090 0.134 0.156∗ 0.135 0.144 0.071 0.046 0.019 0.032 0.021
0.091 0.076 0.097∗ 0.086 0.088 0.042 0.045 0.027 0.044 0.029
0.088 0.016 0.023 0.022 0.020 0.011 0.043 0.054 0.076 0.061
0.086 0.001 0.002 0.002 0.002 0.001 0.032 0.105 0.131∗ 0.120
0.084 0.000 0.000 0.000 0.000 0.000 0.019 0.186 0.213∗ 0.211∗
0.081 0.000 0.000 0.000 0.000 0.000 0.009 0.294 0.311 0.328∗
0.080 0.000 0.000 0.000 0.000 0.000 0.005 0.420 0.421 0.460∗
ρ=1 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ ˆ EM-Q (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.99 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.975 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.95 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ ˆ EM-Q (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
which, in practice, is not known), heavily influences the power of all unit root tests. In this section, we are following Elliott and Müller (2006), and explore powers for various values of Y0 , of the form Y0 = m + aσε / 1 − ρ 2 with a = 0, 1, . . . , 6 (ρ < 1) measuring
the amplitude of the deviation of Y0 from the stationary mean in terms of the stationary standard deviation.8
8 For ρ = 1, that deviation is not well-defined; all test statistics, however, only depend on the observations via 1Y1 , . . . , 1Yn which, under the null hypothesis,
208
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
Table 5 Rejection frequencies (25,000 replications), nominal level 5%, n = 50, underlying Cauchy density, starting values Y0 = m + aσε / 1 − ρ 2 , a = 0, 1, . . . , 6.
Tests
a 0
1
2
3
4
5
6
0.077 0.025 0.035 0.048 0.029 0.014 0.049 0.050 0.050 0.050
0.077 0.025 0.035 0.048 0.029 0.014 0.049 0.050 0.050 0.050
0.077 0.025 0.035 0.048 0.029 0.014 0.049 0.050 0.050 0.050
0.077 0.025 0.035 0.048 0.029 0.014 0.049 0.050 0.050 0.050
0.077 0.025 0.035 0.048 0.029 0.014 0.049 0.050 0.050 0.050
0.077 0.025 0.035 0.048 0.029 0.014 0.049 0.050 0.050 0.050
0.077 0.025 0.035 0.048 0.029 0.014 0.049 0.050 0.050 0.050
0.078 0.033 0.042 0.056 0.036 0.017 0.047 0.149 0.178∗ 0.167
0.078 0.031 0.042 0.055 0.036 0.018 0.046 0.150 0.178∗ 0.167
0.078 0.029 0.038 0.051 0.033 0.015 0.042 0.151 0.181∗ 0.170
0.078 0.025 0.033 0.045 0.028 0.013 0.039 0.155 0.185∗ 0.174
0.078 0.022 0.027 0.039 0.024 0.011 0.036 0.159 0.189∗ 0.180
0.078 0.019 0.023 0.033 0.020 0.009 0.033 0.164 0.198∗ 0.187
0.078 0.015 0.019 0.029 0.016 0.007 0.030 0.170 0.209∗ 0.196
0.080 0.046 0.056 0.070 0.049 0.023 0.028 0.201 0.254∗ 0.233
0.080 0.042 0.054 0.066 0.048 0.021 0.028 0.204 0.259∗ 0.238
0.079 0.034 0.043 0.056 0.038 0.018 0.028 0.215 0.275∗ 0.253
0.079 0.027 0.034 0.045 0.030 0.013 0.028 0.236 0.299∗ 0.277
0.078 0.020 0.025 0.035 0.022 0.010 0.027 0.263 0.333∗ 0.308
0.077 0.016 0.020 0.030 0.018 0.007 0.024 0.299 0.376∗ 0.348
0.076 0.013 0.015 0.023 0.013 0.005 0.023 0.338 0.422∗ 0.394
0.084 0.074 0.090 0.100 0.082 0.038 0.026 0.225 0.301∗ 0.270
0.084 0.064 0.080 0.091 0.072 0.032 0.026 0.232 0.311∗ 0.281
0.082 0.049 0.058 0.070 0.052 0.023 0.028 0.261 0.343∗ 0.311
0.080 0.035 0.040 0.051 0.036 0.016 0.030 0.305 0.394∗ 0.363
0.080 0.024 0.029 0.039 0.025 0.011 0.030 0.366 0.455∗ 0.430
0.079 0.019 0.021 0.031 0.019 0.007 0.028 0.434 0.521∗ 0.502
0.078 0.014 0.016 0.025 0.014 0.005 0.026 0.507 0.584∗ 0.577
ρ=1 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.99 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ ˆ EM-Q (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.975 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.95 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
Tables 3–10 provide rejection frequencies, over 25,000 replications of the data-generating process, and sample sizes n = 50 and n = 100, of three of the rank-based tests (van der Waerden, Wilcoxon, and Laplace, associated with Gaussian, logistic and double-exponential reference densities g, respectively) considered in this paper, along with those of the traditional Dickey–Fuller procedure, the PT test (c = −7) ERS-PT from Elliott et al. (1996), the M GLS tests NP-MZαGLS , NP-MZtGLS , and NP-MSBGLS (with p = 0
and c¯ = −7.0) from Ng and Perron (2001), and the Qˆ µ tests EM-Qˆ µ (10, 1) and EM-Qˆ µ (10, 3.8) from Elliott and Müller (2006). Throughout, the nominal level is α = 5%, with simulated critical values (taken from Table 2) for the rank-based tests and asymptotic critical values for the other ones. As all tests are invariant with respect to m (under the null hypothesis as well as under the alternative), we only consider m = 0. For each combination of an
coincide with ε1 , . . . , εn , so, without any loss of generality, we put Y0 = 0, in simulations under the null hypothesis.
innovation density f (four densities: Gaussian, double exponential, Cauchy, and skew-normal) and a ρ value (four values: 1, 0.99, 0.975, and 0.95), seven starting values following Elliott and Müller (2006), (Y0 = aσε / 1 − ρ 2 for a = 0, 1, . . . , 6) have been considered.9 All simulations were carried out in Matlab 7.10; codes are available upon request. In each table, rejection frequencies significantly larger than 10% (at probability level 5%, that is, larger than or equal to 0.097) are printed in bold face; among them, the winners in each column (still at level 5%) are starred. Before commenting the results, some further details about the implementation of the Dickey–Fuller method are in order. The Dickey–Fuller tests actually are the (standard) t-tests for testing the hypothesis ρ = 1. Accordingly, different versions exist, depending on the regression equation to be considered. These versions are presented in, for example, Hamilton (1994, Table
9 For the Cauchy density, we use σ = 3 in the definition of Y . ε 0
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
209
Table 6 Rejection frequencies (25,000 replications), nominal level 5%, n = 50 underlying skew-normal (shape parameter −10, mean 0, variance 1) density, starting values Y0 = m + aσε / 1 − ρ 2 , a = 0, 1, . . . , 6. Tests
a 0
1
2
3
4
5
6
0.076 0.047 0.056 0.052 0.051 0.024 0.025 0.050 0.050 0.050
0.076 0.047 0.056 0.052 0.051 0.024 0.025 0.050 0.050 0.050
0.076 0.047 0.056 0.052 0.051 0.024 0.025 0.050 0.050 0.050
0.076 0.047 0.056 0.052 0.051 0.024 0.025 0.050 0.050 0.050
0.076 0.047 0.056 0.052 0.051 0.024 0.025 0.050 0.050 0.050
0.076 0.047 0.056 0.052 0.051 0.024 0.025 0.050 0.050 0.050
0.076 0.047 0.056 0.052 0.051 0.024 0.025 0.050 0.050 0.050
0.075 0.060 0.070 0.064 0.063 0.030 0.029 0.049 0.050 0.049
0.077 0.052 0.062 0.058 0.056 0.025 0.029 0.049 0.050 0.047
0.077 0.035 0.042 0.042 0.039 0.019 0.027 0.049 0.050 0.049
0.078 0.016 0.026 0.026 0.022 0.009 0.020 0.051 0.052 0.050
0.078 0.007 0.011 0.011 0.009 0.004 0.013 0.054 0.053 0.055
0.078 0.002 0.003 0.003 0.002 0.001 0.007 0.058 0.055 0.059
0.077 0.001 0.001 0.001 0.000 0.000 0.004 0.064 0.058 0.063
0.079 0.083 0.097 0.084 0.089 0.042 0.034 0.038 0.043 0.038
0.081 0.060 0.072 0.067 0.066 0.030 0.034 0.038 0.045 0.039
0.081 0.021 0.032 0.032 0.028 0.012 0.032 0.047 0.049 0.048
0.081 0.004 0.006 0.007 0.005 0.002 0.021 0.064 0.057 0.064
0.082 0.001 0.001 0.001 0.000 0.000 0.012 0.089 0.070 0.087
0.082 0.000 0.000 0.000 0.000 0.000 0.005 0.125∗ 0.085 0.116
0.078 0.000 0.000 0.000 0.000 0.000 0.001 0.167∗ 0.106 0.154
0.087 0.136 0.157∗ 0.135 0.146 0.070 0.043 0.019 0.031 0.021
0.090 0.074 0.095 0.088 0.087 0.041 0.046 0.021 0.033 0.024
0.093 0.011 0.019 0.020 0.016 0.008 0.048 0.043 0.046 0.044
0.095 0.001 0.001 0.001 0.000 0.000 0.040 0.087 0.074 0.087
0.095 0.000 0.000 0.000 0.000 0.000 0.025 0.164∗ 0.111 0.156
0.092 0.000 0.000 0.000 0.000 0.000 0.013 0.274∗ 0.163 0.256
0.093 0.000 0.000 0.000 0.000 0.000 0.006 0.406∗ 0.231 0.379
ρ=1 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.99 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.975 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.95 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
17.1). Two Dickey–Fuller tests are suited for both equations (1) and (2). One possibility is to regress Yt on a constant term and Yt −1 (as in (24)). In Hamilton (1994, Table 17.1), the behavior of this test is summarized in Cases 2 and 3; denote by DF1 the resulting Dickey–Fuller statistic. Another possibility is to regress Yt also on a linear time trend; in Hamilton (1994, Table 17.1) this is called Case 4; denote by DF the resulting Dickey–Fuller statistic. It is well-documented, however, that DF1 yields non-similar tests— see, for example, Bhargava (1986), Hylleberg and Mizon (1989), or Dios-Palomares and Roldan (2006). Therefore, we use DF instead. Turning to Tables 3–10, the figures speak for themselves: (a) (validity) Irrespective of series lengths, starting values and underlying densities, the Dickey–Fuller method significantly over-rejects. The ERS-PT test and NP-M GLS tests are close to the nominal level, except for the Cauchy case, under which they are severely biased. The EM test Qˆ µ (10, 1) is uniformly and severely biased, as is the EM test Qˆ µ (10, 3.8) which,
however, has a much better behavior under Cauchy densities. The rank tests, as expected, perfectly match the nominal level. (b) (short series lengths) Although of econometric practical relevance, n = 50 in this context is a very short series length, for which only the ERS-PT and NP-M GLS tests have some power at ρ = 0.95 and small Y0 − m values. Rank-based tests, however, have power under large values of Y0 − m, and spectacularly outperform all their competitors under Cauchy densities. (c) (heavy-tailed densities) All ‘‘classical’’ techniques, and, particularly, the Dickey–Fuller method, fail miserably under Cauchy densities, while all rank-based ones do extremely well. This is all the more remarkable as the scores (van der Waerden, Wilcoxon, Laplace) considered here are not adapted to a heavytailed context, and Cauchy scores (see Hallin et al., in press) are likely to perform even better. (d) (impact of the starting value) Roughly, the deviation of Y0 from the stationary mean m has a negative impact on the
210
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
Table 7 Rejection frequencies (25,000 replications), nominal level 5%, n = 100 underlying N (0, 1) density, starting values Y0 = m + aσε / 1 − ρ 2 , a = 0, 1, . . . , 6.
Tests
a 0
1
2
3
4
5
6
0.063 0.047 0.058 0.052 0.054 0.037 0.036 0.050 0.050 0.050
0.063 0.047 0.058 0.052 0.054 0.037 0.036 0.050 0.050 0.050
0.063 0.047 0.058 0.052 0.054 0.037 0.036 0.050 0.050 0.050
0.063 0.047 0.058 0.052 0.054 0.037 0.036 0.050 0.050 0.050
0.063 0.047 0.058 0.052 0.054 0.037 0.036 0.050 0.050 0.050
0.063 0.047 0.058 0.052 0.054 0.037 0.036 0.050 0.050 0.050
0.063 0.047 0.058 0.052 0.054 0.037 0.036 0.050 0.050 0.050
0.064 0.078 0.091 0.081 0.086 0.056 0.052 0.039 0.044 0.040
0.065 0.064 0.073 0.068 0.068 0.046 0.045 0.042 0.046 0.042
0.064 0.032 0.039 0.036 0.035 0.026 0.033 0.047 0.050 0.047
0.063 0.010 0.013 0.013 0.011 0.009 0.018 0.057 0.056 0.055
0.063 0.002 0.003 0.003 0.002 0.002 0.009 0.071 0.065 0.069
0.061 0.000 0.000 0.000 0.000 0.000 0.004 0.088 0.076 0.085
0.060 0.000 0.000 0.000 0.000 0.000 0.001 0.110∗ 0.090 0.104∗
0.075 0.143 0.165∗ 0.146 0.156 0.103 0.073 0.018 0.030 0.019
0.075 0.085 0.100∗ 0.090 0.092 0.067 0.066 0.023 0.032 0.024
0.075 0.016 0.023 0.022 0.021 0.019 0.052 0.044 0.047 0.043
0.073 0.001 0.001 0.001 0.001 0.002 0.032 0.083 0.075 0.079
0.070 0.000 0.000 0.000 0.000 0.000 0.016 0.142∗ 0.113 0.138∗
0.069 0.000 0.000 0.000 0.000 0.000 0.007 0.228∗ 0.163 0.214
0.068 0.000 0.000 0.000 0.000 0.000 0.002 0.330∗ 0.228 0.314
0.104 0.319 0.355∗ 0.311 0.339 0.223 0.123 0.002 0.012 0.003
0.104 0.132 0.162∗ 0.145 0.151 0.131 0.125 0.006 0.019 0.007
0.108 0.007 0.014 0.013 0.012 0.023 0.122∗ 0.022 0.036 0.022
0.113∗ 0.000 0.000 0.000 0.000 0.001 0.110∗ 0.067 0.071 0.065
0.126 0.000 0.000 0.000 0.000 0.000 0.090 0.162∗ 0.126 0.154
0.142 0.000 0.000 0.000 0.000 0.000 0.068 0.311∗ 0.208 0.291
0.164 0.000 0.000 0.000 0.000 0.000 0.048 0.501∗ 0.311 0.466
ρ=1 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.99 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ ˆ EM-Q (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.975 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.95 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
power of ERS-PT , NP-M GLS , EM-Qˆ µ (10, 1) and EM-Qˆ µ (10, 3.8) tests, and a positive impact on the rank-based ones; Tables 7, 8 and 10, for ρ = 0.95, are quite typical in that respect. The two families of procedures thus nicely complement each other (the deviation Y0 − m, of course, is unknown in practice). 4. Conclusions The rank-based tests that we are proposing for the unit root hypothesis offer all the usual advantages of rank-based tests: distribution-freeness, exact finite-sample sizes, and robustness. Moreover, they are flexible and efficient, in the sense that a reference density g can be chosen, which is such that semiparametric efficiency is achieved under density g. That reference density g can even be estimated, without affecting the validity of the test. Moreover, choosing a Gaussian reference density guarantees that our tests (of the van der Waerden type)
are (under (a)-asymptotics) uniformly locally more powerful than Dickey–Fuller tests. For finite samples, our simulation study shows that rank-based tests outperform the traditionally used Dickey–Fuller test, as well as several more recent competitors, for a broad range of initial values. Efficiency gains are particularly large when the underlying innovation density has fat tails. Our rank-based procedures thus nicely complement existing techniques. The present paper focuses on the simplest setting possible. In particular, we assume the underlying innovations of the process to be i.i.d. This is needed in order to define optimality of testing procedures. However, extensions to models that allow for, e.g., parametric forms of heteroskedasticity are easily imagined. Acknowledgments The authors gratefully acknowledge the constructive remarks by two anonymous referees, an Associate Editor and an Editor,
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
211
Table 8 Rejection frequencies (25,000 replications), nominal level 5%, n = 100 underlying double exponential density, starting values Y0 = m + aσε / 1 − ρ 2 , a = 0, 1, . . . , 6.
Tests
a 0
1
2
3
4
5
6
0.063 0.050 0.058 0.053 0.053 0.036 0.037 0.050 0.050 0.050
0.063 0.050 0.058 0.053 0.053 0.036 0.037 0.050 0.050 0.050
0.063 0.050 0.058 0.053 0.053 0.036 0.037 0.050 0.050 0.050
0.063 0.050 0.058 0.053 0.053 0.036 0.037 0.050 0.050 0.050
0.063 0.050 0.058 0.053 0.053 0.036 0.037 0.050 0.050 0.050
0.063 0.050 0.058 0.053 0.053 0.036 0.037 0.050 0.050 0.050
0.063 0.050 0.058 0.053 0.053 0.036 0.037 0.050 0.050 0.050
0.066 0.080 0.092 0.083 0.085 0.058 0.051 0.040 0.045 0.041
0.066 0.060 0.070 0.065 0.064 0.046 0.047 0.043 0.049 0.044
0.065 0.029 0.035 0.033 0.032 0.023 0.034 0.051 0.058 0.051
0.063 0.009 0.012 0.013 0.011 0.007 0.020 0.062 0.074 0.066
0.063 0.002 0.003 0.003 0.003 0.002 0.009 0.080 0.095 0.084
0.062 0.000 0.000 0.000 0.000 0.000 0.003 0.101 0.123∗ 0.108
0.062 0.000 0.000 0.000 0.000 0.000 0.001 0.128 0.158∗ 0.140
0.075 0.145 0.163∗ 0.144 0.154 0.105 0.070 0.018 0.030 0.019
0.075 0.081 0.094 0.086 0.087 0.065 0.066 0.024 0.043 0.028
0.075 0.015 0.021 0.020 0.018 0.016 0.052 0.050 0.080 0.058
0.072 0.001 0.002 0.002 0.002 0.002 0.033 0.101 0.147∗ 0.114
0.071 0.000 0.000 0.000 0.000 0.000 0.016 0.182 0.237∗ 0.208
0.069 0.000 0.000 0.000 0.000 0.000 0.007 0.292 0.354∗ 0.331
0.068 0.000 0.000 0.000 0.000 0.000 0.003 0.426 0.483∗ 0.472
0.102 0.317 0.348∗ 0.305 0.332 0.217 0.120 0.003 0.018 0.004
0.102 0.127 0.153∗ 0.138 0.144 0.124 0.123 0.007 0.033 0.010
0.106 0.008 0.013 0.013 0.012 0.020 0.121∗ 0.029 0.082 0.040
0.113 0.000 0.000 0.000 0.000 0.001 0.107 0.093 0.170∗ 0.119
0.126 0.000 0.000 0.000 0.000 0.000 0.088 0.222 0.299∗ 0.270
0.144 0.000 0.000 0.000 0.000 0.000 0.068 0.419 0.448 0.474∗
0.168 0.000 0.000 0.000 0.000 0.000 0.048 0.627 0.599 0.677∗
ρ=1 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.99 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ ˆ EM-Q (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.975 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.95 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
which resulted in a major reorganization of the paper. They also thank Feike Drost, Alastair Hall, Ralph Koijen, Pierre Perron, and Per Mykland for useful comments. The research by Marc Hallin was supported by the Sonderforschungsbereich ‘‘Statistical modeling of nonlinear dynamic processes’’ (SFB 823) from the Deutsche Forschungsgemeinschaft, and by a Discovery Grant from the Australian Research Council. Appendix. Proofs For ease of reference, we first provide a lemma on the joint convergence of a partial sum process and its rank-based version. Although based on existing results in the literature, this lemma as such does not seem to have been provided. The bottom line is that, where the partial sum process converges to a Brownian motion, its rank-based version converges to the Brownian bridge generated by that Brownian motion.
Lemma A.1. Let (U1 , . . . , Un ) be i.i.d. standard uniformly distributed random variables and denote by Rt the rank of Ut . Let ϕ : [0, 1] → R 1 be a measurable function satisfying 0 ϕ(v)dv = 0 and
ϕ(n) , ϕ(v)2 dv < ∞. Define the partial sum processes Wϕ(n) and W both on [0, 1], by 1 0
⌊un⌋ 1 − Wϕ(n) (u) = √ ϕ(Ut ) n t =1
and
⌊un⌋ − ϕ(n) (u) = √1 W E {ϕ(Ut )|Rt } . n t =1
(20)
Then, we have Wϕ(n) ϕ(n) W
[
]
[ ] W
⇒ , W
(21)
212
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
Table 9 Rejection frequencies (25,000 replications), nominal level 5%, n = 100 underlying Cauchy density, starting values Y0 = m + aσε / 1 − ρ 2 , a = 0, 1, . . . , 6.
Tests
a 0
1
2
3
4
5
6
0.067 0.026 0.033 0.048 0.028 0.019 0.053 0.050 0.050 0.050
0.067 0.026 0.033 0.048 0.028 0.019 0.053 0.050 0.050 0.050
0.067 0.026 0.033 0.048 0.028 0.019 0.053 0.050 0.050 0.050
0.067 0.026 0.033 0.048 0.028 0.019 0.053 0.050 0.050 0.050
0.067 0.026 0.033 0.048 0.028 0.019 0.053 0.050 0.050 0.050
0.067 0.026 0.033 0.048 0.028 0.019 0.053 0.050 0.050 0.050
0.067 0.026 0.033 0.048 0.028 0.019 0.053 0.050 0.050 0.050
0.069 0.041 0.051 0.066 0.044 0.030 0.045 0.277 0.339∗ 0.314
0.070 0.041 0.051 0.066 0.045 0.029 0.045 0.280 0.341∗ 0.319
0.070 0.038 0.047 0.061 0.041 0.026 0.042 0.286 0.351∗ 0.326
0.070 0.035 0.041 0.053 0.036 0.023 0.039 0.296 0.366∗ 0.339
0.070 0.030 0.034 0.046 0.029 0.020 0.037 0.313 0.387∗ 0.357
0.071 0.027 0.028 0.040 0.025 0.017 0.035 0.334 0.412∗ 0.381
0.071 0.023 0.025 0.035 0.022 0.014 0.033 0.359 0.442∗ 0.410
0.074 0.078 0.092 0.104 0.083 0.055 0.042 0.353 0.437∗ 0.404
0.074 0.074 0.086 0.098 0.078 0.050 0.041 0.360 0.443∗ 0.413
0.074 0.062 0.072 0.083 0.064 0.040 0.040 0.381 0.466∗ 0.438
0.074 0.051 0.057 0.068 0.050 0.033 0.039 0.414 0.502∗ 0.475
0.074 0.041 0.046 0.055 0.040 0.025 0.038 0.462 0.549∗ 0.525
0.073 0.032 0.036 0.046 0.032 0.020 0.036 0.517 0.597∗ 0.580
0.073 0.026 0.029 0.038 0.026 0.015 0.034 0.576 0.647 0.638
0.082 0.197 0.218 0.208 0.207 0.123 0.066 0.382 0.460∗ 0.440
0.082 0.177 0.196 0.189 0.184 0.112 0.065 0.389 0.470∗ 0.448
0.080 0.141 0.155 0.151 0.146 0.087 0.067 0.413 0.494∗ 0.478
0.081 0.110 0.117 0.117 0.109 0.065 0.068 0.453 0.528∗ 0.517
0.083 0.086 0.092 0.091 0.087 0.050 0.069 0.505 0.566∗ 0.565∗
0.084 0.069 0.074 0.075 0.070 0.039 0.069 0.564 0.608 0.618∗
0.085 0.057 0.060 0.061 0.056 0.030 0.068 0.620 0.646 0.664∗
ρ=1 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.99 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ ˆ EM-Q (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.975 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.95 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
where W denotes a zero-drift Brownian motion with variance 1 its associated Brownian bridge: ϕ(v)2 dv per unit of time and W 0
(u) = W (u) − uW (1), u ∈ [0, 1]. The convergence in (21) is on W D2 [0, 1] equipped with the uniform topology.
Proof. It is well-known that weak convergence in D2 [0, 1] under the uniform topology follows from establishing convergence of marginals and asymptotic tightness; see, for example, Van der Vaart and Wellner (1993, Theorem 1.5.4). Convergence of marginals for the partial sum process Wϕ(n) is easily obtained from the central limit theorem. This implies also (joint) convergence of the marginals of its rank-based version ϕ(n) using what is sometimes known as Hájek’s asymptotic W representation theorem:
ϕ(n) (u) = Wϕ(n) (u) − uWϕ(n) (1) + oP (1); W
1
aNi = E {ϕ(Ut )|Rt = i}. From 0 ϕ(v)dv = 0 we find a¯ N = 0. Moreover, we have c¯N = ⌊un⌋/n → u. Since marginal tightness implies joint tightness, the proof is ϕ is tight in D[0, 1] under the concluded once we show that W uniform topology. This follows from Shorack and Wellner (1986). ∑n Take cni = E {ϕ(Ut )|Rt = i} and note that c¯n = n−1 i=1 cni = 0,
∑n
Proof of Theorem 2.1. First recall that g ∈ F implies
see Van der Vaart (2000, Theorem 13.5). In the notation of Van der Vaart (2000), we have i = t , N = n, CNi = I {t ≤ un}, and
1
1 u =0
ϕg (u)du
ϕ (u)du = Ig , with ϕg := −g /g. Moreover, u=0 under H0 , we have 1Yt = µ + ϵt , so the rank of 1Yt amongst 1Y1 , . . . , 1Yn is the same as that of ϵt amongst ϵ1 , . . . , ϵn . Now, ϕ(n) as defined in Lemma A.1 and (10) with Ut = F (εt ), we using W g = 0 and
(22)
1
2 ≤ 0 φ 2 (u)du. From this it easily follows that n−1 i=1 cni the conditions to Shorack and Wellner (1986, Theorem 3.1) are 2 2 → 0. satisfied: maxi=1,...,n cni /cnT cn → 0 and c¯n / cnn
2 g
′
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
213
Table 10 Rejection frequencies (25,000 replications), nominal level 5%, n = 100 underlying skew-normal (shape parameter −10, mean 0, variance 1) density, starting values Y0 = m + aσε / 1 − ρ 2 , a = 0, 1, . . . , 6. Tests
a 0
1
2
3
4
5
6
0.063 0.051 0.060 0.053 0.056 0.037 0.037 0.050 0.050 0.050
0.063 0.051 0.060 0.053 0.056 0.037 0.037 0.050 0.050 0.050
0.063 0.051 0.060 0.053 0.056 0.037 0.037 0.050 0.050 0.050
0.063 0.051 0.060 0.053 0.056 0.037 0.037 0.050 0.050 0.050
0.063 0.051 0.060 0.053 0.056 0.037 0.037 0.050 0.050 0.050
0.063 0.051 0.060 0.053 0.056 0.037 0.037 0.050 0.050 0.050
0.063 0.051 0.060 0.053 0.056 0.037 0.037 0.050 0.050 0.050
0.065 0.079 0.092 0.082 0.087 0.058 0.050 0.042 0.044 0.043
0.065 0.062 0.071 0.066 0.066 0.046 0.049 0.042 0.044 0.043
0.064 0.028 0.034 0.033 0.031 0.023 0.036 0.048 0.047 0.049
0.062 0.009 0.010 0.010 0.008 0.007 0.022 0.060 0.053 0.058
0.062 0.001 0.002 0.002 0.002 0.001 0.010 0.077 0.061 0.073
0.062 0.000 0.000 0.000 0.000 0.000 0.004 0.100∗ 0.071 0.092∗
0.061 0.000 0.000 0.000 0.000 0.000 0.001 0.131∗ 0.082 0.117
0.075 0.145 0.166∗ 0.145 0.154 0.104 0.069 0.018 0.030 0.020
0.074 0.080 0.094 0.086 0.087 0.065 0.070 0.021 0.032 0.024
0.074 0.014 0.017 0.017 0.015 0.015 0.056 0.044 0.045 0.046
0.074 0.000 0.001 0.001 0.001 0.001 0.038 0.096∗ 0.069 0.089
0.074 0.000 0.000 0.000 0.000 0.000 0.020 0.177∗ 0.104 0.158
0.073 0.000 0.000 0.000 0.000 0.000 0.008 0.294∗ 0.155 0.260
0.071 0.000 0.000 0.000 0.000 0.000 0.003 0.443∗ 0.219 0.389
0.105 0.324 0.350∗ 0.307 0.333 0.222 0.122 0.002 0.013 0.003
0.105 0.125 0.154∗ 0.141 0.143 0.130 0.124 0.004 0.019 0.006
0.109 0.004 0.008 0.009 0.007 0.016 0.126∗ 0.020 0.036 0.023
0.118∗ 0.000 0.000 0.000 0.000 0.000 0.115∗ 0.077 0.071 0.078
0.129 0.000 0.000 0.000 0.000 0.000 0.096 0.202∗ 0.130 0.191
0.146 0.000 0.000 0.000 0.000 0.000 0.075 0.408∗ 0.218 0.373
0.167 0.000 0.000 0.000 0.000 0.000 0.055 0.635∗ 0.336 0.584
ρ=1 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.99 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.975 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS EM-Qˆ µ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
ρ = 0.95 Dickey–Fuller ERS-PT NP-MZαGLS NP-MSBGLS α NP-MZtGLS µ EM-Qˆ (10, 1) EM-Qˆ µ (10, 3.8) van der Waerden Laplace Wilcoxon
obtain the asymptotic representation Tg(n) =
∫
1
u−
u=0
1 2
ϕ(n) (u) + oP (1). dW g
(23)
Lemma A.1 and the continuous mapping theorem thus imply that (n) Tg is asymptotically distributed as
∫
1
u−
u=0
1 2
( u) ∼ N dW
0, Ig
∫
1
u−
u =0
= N 0,
Ig 12
1 2
2 du
location parameter and its Fisher information, therefore, is If . The ∑n Fisher information for ρ is given by the limit of n−3 t =1 Yt2−1 εt2 , analogously to the standard regression framework. Note that, under the null hypothesis (µ ̸= 0, ρ = 1), the drift µt in Yt dominates the stochastic part, as n 1 −
n3 t = 1
(Yt − µt ) =
≤
.
Proof of Proposition 2.1. Case (ii) has been established in Jeganathan (1995, Section 7). For Case (i), the proof is analogous to that in Drost et al. (1997) for a pure location model. The rates of convergence obviously have to be adapted, as well as the form of the Fisher information matrix. Also, µ in our model (1) is a pure
2
2
n t 1 − −
n3 t =1 n 1−
n t =1
εs
s=1 t 1−
t s=1
2 εs
→ 0 a.s.,
where the last convergence follows from a Cesàro mean argument and the strong law of large numbers. Consequently, we have lim n−3
n→∞
n − t =1
Yt2−1 = lim n−3 n→∞
n − (µt )2 = µ2 /3 (a.s.), t =1
214
M. Hallin et al. / Journal of Econometrics 163 (2011) 200–214
which in turn leads to the Fisher information µ2 If /3 for ρ .
Proof of Theorem 2.2. The Hájek asymptotic representation result (23), combined with Lemma A.1, implies (n)
Tg
1
∫
u−
= u =0
1
2
dWϕ(ng ) (u) + oP (1),
as
∫
1
u−
u =0 1
∫
1 2
u−
⇒ u=0
ϕ(n) (u) d Wϕ(ng ) − W g 1 2
(u) = 0. d W −W
Also, it follows from Proposition 2.1 that (n)
(n)
log dP(µ,ρn );f /dP(µ,1);f = hµ
1
∫
u=0
udWϕ(nf ) (u) − h2 µ2 If /6 + oP (1).
(n)
As a result, the statistic Tg and the log likelihood ratio are asymptotically jointly normal, with limiting covariance hµIfg
∫
1
u(u − 1/2)du = hµIfg /12. u =0
Le Cam’s third lemma (see, e.g., Van der Vaart (2000, Section 6.7)) now readily implies (16). Proof of Corollary 2.1. The asymptotic distribution of the Dickey– Fuller test statistic is well-studied. For instance, it follows from ∑n Chapter 17 in Hamilton (1994) that, letting Y¯n := n−1 t =1 Yt −1 , n−3/2
n3/2 ρˆ nDF − 1 =
n ∑
Yt −1 − Y¯n 1Yt −1
t =1
n− 3
n ∑
Yt −1 − Y¯n
2
t =1
= n−1/2
n 12 −
µ
t =1
t n+1
−
1 2
εt + oP (1).
(24)
The null limiting distribution of n3/2 (ρˆ nDF − 1) thus is N (0, 12σf2 /µ2 ). As in Theorem 2.2, it follows from Le Cam’s third lemma that its limiting distribution under the near (under (a)(n) asymptotics) unit root alternatives H1 : ρn = 1 + hn−3/2 is N (h, 12σf2 /µ2 ), using the fact that Ef (−f ′ /f )(εt )εt = 1. Incidentally, this shows that the least-squares estimator is (also) regular in this situation. References Ahn, S.A., Fotopoulos, S.B., He, L., 2003. Unit root tests with infinite variance errors. Econometric Reviews 20, 461–483. Bhargava, A., 1986. On the theory of testing for unit roots in observed time series. The Review of Economic Studies 53, 369–384. Bickel, P.J., Klaassen, C.A.J., Ritov, Y., Wellner, J.A., 1993. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press. Breitung, J., Gouriéroux, C., 1997. Rank tests for unit roots. Journal of Econometrics 81, 7–27. Callegari, F., Cappuccio, N., Lubian, D., 2003. Asymptotic inference in time series regressions with a unit root and infinite variance errors. Journal of Statistical Planning and Inference 116, 277–303. Campbell, B., Dufour, J.M., 1995. Exact nonparametric orthogonality and random walk tests. The Review of Economics and Statistics 77, 1–16. Campbell, B., Dufour, J.M., 1997. Exact nonparametric tests of orthogonality and random walk in the presence of a drift Parameter. International Economic Review 38, 151–173. Chan, N.H., Wei, C.Z., 1988. Limiting distributions of least squares estimates of unstable autoregressive processes. Annals of Statistics 16, 367–401. Chernoff, H., Savage, I.R., 1958. Asymptotic normality and efficiency of certain nonparametric test statistics. The Annals of Mathematical Statistics 29, 972–994.
Coudin, E., Dufour, J.M., 2009. Finite-sample distribution-free inference in linear median regression under heteroskedasticity and nonlinear dependence of unknown form. Econometrics Journal 12, 19–49. Dickey, D.A., Fuller, W.A., 1979. Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association 74, 427–431. Dickey, D.A., Fuller, W.A., 1981. Likelihood ratio statistics for autoregressive time series with a unit root. Econometrica 49, 1057–1072. Dios-Palomares, R., Roldan, J.A., 2006. A strategy for testing the unit root in AR(1) model with intercept: a Monte Carlo experiment. Journal of Statistical Planning and Inference 136, 2685–2705. Drost, F.C., Klaassen, C.A.J., Werker, B.J.M., 1997. Adaptive estimation in time-series models. The Annals of Statistics 25, 786–817. Dufour, J.M., 1981. Rank tests for serial dependence. Journal of Time Series Analysis 2, 117–128. Dufour, J.M., 1997. Some impossibility theorems in econometrics with applications to structural and dynamic models. Econometrica 65, 1365–1389. Elliott, G., Jansson, M., 2003. Testing for unit roots with stationary covariates. Journal of Econometrics 115, 75–89. Elliott, G., Müller, U.K., 2006. Minimizing the impact of the initial condition on testing for unit roots. Journal of Econometrics 135, 285–310. Elliott, G., Rothenberg, T.J., Stock, J.H., 1996. Efficient tests for an autoregressive unit root. Econometrica 64, 813–836. Enders, W., 2004. Applied Econometric Time Series, 2nd ed.. John Wiley & Sons. Gushchin, A.A., 1996. Asymptotic optimality of parameter estimators under the LAQ Condition. Theory of Probability and its Applications 40, 261–272. Haldrup, N., Jansson, M., 2006. Improving size and power in unit root testing. In: Mills, T.C., Patterson, K. (Eds.), Palgrave Handbook of Econometrics. In: Econometric Theory, vol. 1. Palgrave Macmillan, New York, pp. 252–277. Hallin, M., Swan, Y., Verdebout, T., Veredas, D., 2010. Rank-based testing in linear models with stable errors. Journal of Nonparametric Statistics (in press). Hamilton, J.D., 1994. Time Series Analysis, 1st ed.. Princeton University Press. Hasan, M.N., 2001. Rank tests of unit root hypothesis with infinite variance errors. Journal of Econometrics 104, 49–65. Hasan, M.N., Koenker, R.W., 1997. Robust rank tests of the unit root hypothesis. Econometrica 65, 133–161. Hylleberg, S., Mizon, G., 1989. A note on the distribution of the least squares estimator of a random walk with drift. Economics Letters 29, 225–230. Im, K.S., Pesaran, M.H., Shin, Y., 2003. Testing for unit roots in heterogeneous panels. Journal of Econometrics 115, 53–74. Jansson, M., 2008. Semiparametric power envelopes for tests of the unit root hypothesis. Econometrica 76, 1103–1142. Jansson, M., Moreira, M.J., 2008. Optimal inference in regression models with nearly integrated regressors. Econometrica 74, 681–714. Jeganathan, P., 1995. Some aspects of asymptotic theory with applications to time series models. Econometric Theory 11, 818–887. Johansen, S., 1991. Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econometrica 59, 1551–1580. Le, Cam, 1986. Asymptotic Methods in Statistical Decision Theory. Springer-Verlag, New York, L.M.. Levin, A., Lin, C.-F., Chu, C.-S.J., 2002. Unit root tests in panel data: asymptotic and finite-sample properties. Journal of Econometrics 108, 1–24. Luger, R., 2003. Exact non-parametric tests for a random walk with unknown drift under conditional heteroscedasticity. Journal of Econometrics 115, 259–276. Müller, U.K., Elliott, G., 2003. Tests for unit roots and the initial condition. Econometrica 71, 1269–1286. Ng, S., Perron, P., 2001. Lag length selection and the construction of unit root tests with good size and power. Econometrica 69, 1519–1554. Perron, P., 1988. Trends and random walks in macroeconomic time series. Journal of Economics Dynamics and Control 12, 297–332. Phillips, P.C.B., 1987. Time series regression with a unit root. Econometrica 55, 277–301. Phillips, P.C.B., 1991. Optimal inference in cointegrated systems. Econometrica 59, 283–306. Phillips, P.C.B., Perron, P., 1988. Testing for a unit root in time series regression. Biometrika 75, 335–346. Ploberger, W., 2004. Complete class of tests when the likelihood is locally asymptotically quadratic. Journal of Econometrics 118, 67–94. Ploberger, W., 2008. Admissible and nonadmissible tests in unit-root-like situations. Econometric Theory 24, 15–42. Rachev, S.T., Mittnik, S., Kim, J.-R., 1998. Time series with unit roots and infinitevariance disturbances. Applied Mathematics Letters 11, 69–74. Rothenberg, T.J., Stock, J.H., 1997. Inference in a nearly integrated autoregressive model with nonnormal innovations. Journal of Econometrics 80, 268–286. Shorack, G., Wellner, J.A., 1986. Empirical Processes with Applications to Statistics. Wiley, New York. Thompson, S.B., 2004. Optimal versus robust inference in nearly integrated nonGaussian models. Econometric Theory 20, 23–55. Thompson, S.B., 2004b. Robust tests of the unit root hypothesis should not be modified. Econometric Theory 20, 360–381. Van der Vaart, A., 2000. Asymptotic Statistics, first ed.. Cambridge University Press. Van der Vaart, A., Wellner, J., 1993. Weak Convergence and Empirical Processes, 2nd ed.. Springer-Verlag. West, K.D., 1988. Asymptotic normality, when regressors have a unit root. Econometrica 56, 1397–1417. White, J.S., 1958. The limiting distribution of the serial correlation coefficient in the explosive case. The Annals of Mathematical Statistics 29, 1188–1197.
Journal of Econometrics 163 (2011) 215–230
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Likelihood-based scoring rules for comparing density forecasts in tails Cees Diks a , Valentyn Panchenko b , Dick van Dijk c,∗ a
Center for Nonlinear Dynamics in Economics and Finance, Department of Quantitative Economics, University of Amsterdam, Roetersstraat 11, NL-1018 WB Amsterdam, The Netherlands b School of Economics, Faculty of Business, University of New South Wales, Sydney, NSW 2052, Australia c
Econometric Institute, Erasmus University Rotterdam, P.O. Box 1738, NL-3000 DR Rotterdam, The Netherlands
article
info
Article history: Received 9 April 2009 Received in revised form 10 April 2011 Accepted 18 April 2011 Available online 24 April 2011 JEL classification: C12 C22 C52 C53
abstract We propose new scoring rules based on conditional and censored likelihood for assessing the predictive accuracy of competing density forecasts over a specific region of interest, such as the left tail in financial risk management. These scoring rules can be interpreted in terms of Kullback–Leibler divergence between weighted versions of the density forecast and the true density. Existing scoring rules based on weighted likelihood favor density forecasts with more probability mass in the given region, rendering predictive accuracy tests biased toward such densities. Using our novel likelihood-based scoring rules avoids this problem. © 2011 Elsevier B.V. All rights reserved.
Keywords: Density forecast evaluation Scoring rules Weighted likelihood ratio scores Conditional likelihood Censored likelihood Risk management
1. Introduction The interest in density forecasts is rapidly expanding in both macroeconomics and finance. Undoubtedly, this is due to the increased awareness that point forecasts are not very informative unless some indication of their uncertainty is provided; see Granger and Pesaran (2000) and Garratt et al. (2003) for discussions of this issue. Density forecasts, representing a full predictive distribution of the random variable in question, provide the most complete measure of this uncertainty. Prominent macroeconomic applications are density forecasts of output growth and inflation obtained from a variety of sources, including statistical time series models (Clements and Smith, 2000), professional forecasters (Diebold et al., 1999), and central banks and other institutions producing so-called ‘fan charts’ for these variables (Clements, 2004; Mitchell and Hall, 2005). In finance, density forecasts play a pivotal role in risk management as they form the basis for risk
∗
Corresponding author. Tel.: +31 10 4081263; fax: +31 10 4089162. E-mail addresses:
[email protected] (C. Diks),
[email protected] (V. Panchenko),
[email protected] (D. van Dijk). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.04.001
measures such as Value-at-Risk (VaR) and Expected Shortfall (ES); see Dowd (2005) and McNeil et al. (2005) for general overviews and Guidolin and Timmermann (2006) for a recent empirical application.1 The increasing popularity of density forecasts has naturally led to the development of statistical tools for evaluating their accuracy. The techniques that have been proposed for this purpose can be classified into two groups. First, several approaches have been put forward for testing the quality of an individual density forecast, relative to the data generating process. Following the seminal contribution of Diebold et al. (1998), the most prominent tests in this group are based on the probability integral transform (PIT) of
1 In addition, density forecasts are starting to be used in other financial decision problems, such as derivative pricing (Campbell and Diebold, 2005; Taylor and Buizza, 2006; Härdle and Hlávka, 2009) and asset allocation (Guidolin and Timmermann, 2007). It is also becoming more common to use density forecasts to assess the adequacy of predictive regression models for asset returns, including stocks (Perez-Quiros and Timmermann, 2001), interest rates (Hong et al., 2004; Egorov et al., 2006) and exchange rates (Sarno and Valente, 2005; Rapach and Wohar, 2006), as well as measures of financial market volatility (Bollerslev et al., 2009; Corradi et al., 2009).
216
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
Rosenblatt (1952).2 We refer to Clements (2005) and Corradi and Swanson (2006c) for in-depth surveys on specification tests for univariate density forecasts. The second group of evaluation tests aims to compare two or more competing density forecasts. This problem of relative predictive accuracy has been considered by Sarno and Valente (2004); Mitchell and Hall (2005); Corradi and Swanson (2005, 2006b); Amisano and Giacomini (2007) and Bao et al. (2004, 2007). All statistics in this group compare the relative distance between the competing density forecasts and the true (but unobserved) density, albeit in different ways. Sarno and Valente (2004) consider the integrated squared difference between the density forecast and the true density as distance measure, while Corradi and Swanson (2005, 2006b) employ the mean squared error between the cumulative distribution function (CDF) of the density forecast and the true CDF. The other studies in this group develop tests of equal predictive ability based on a comparison of the Kullback–Leibler Information Criterion (KLIC). Amisano and Giacomini (2007) provide an interesting interpretation of the KLIC-based comparison in terms of scoring rules, which are loss functions depending on the density forecast and the actually observed data. In particular, it is shown that the difference between the logarithmic scoring rule for two competing density forecasts corresponds exactly to their relative KLIC values. In many applications of density forecasts, we are mostly interested in a particular region of the density. Financial risk management is an example in case, where the main concern is obtaining an accurate description of the left tail of the distribution of asset returns. Bao et al. (2004) and Amisano and Giacomini (2007) suggest likelihood ratio (LR) tests based on weighting the KLIC-type logarithmic scoring rule for the purpose of evaluating and comparing density forecasts over a particular region. However, as mentioned by Corradi and Swanson (2006c) the accuracy of density forecasts in a specific region cannot be measured in a straightforward manner using the KLIC. The problem that occurs is that by construction the weighted logarithmic scoring rule favors density forecasts with more probability mass in the region of interest, rendering the resulting tests of equal predictive ability biased toward such density forecasts. In this paper we demonstrate that two density forecasts can still be compared on a specific region of interest by means of likelihoodbased scoring rules in a natural way. Our proposed solution is to replace the full likelihood by the conditional likelihood, given that the actual observation lies in the region of interest, or by the censored likelihood, with censoring of the observations outside the region of interest. We show analytically that these new scoring rules can be interpreted in terms of Kullback–Leibler divergences between weighted versions of the density forecast and the actual density. This implies that the conditional likelihood and censored likelihood scoring rules favor density forecasts that approximate the true conditional density as closely as possible in the region of interest. Consequently, tests of equal predictive accuracy of density forecasts based on these scoring rules do not suffer from spurious rejections against densities with more probability mass in that region. This is confirmed by extensive Monte Carlo simulations, in which we assess the finite sample properties of the predictive ability tests for the different scoring rules. Here we also find that the censored likelihood scoring rule, which uses more of the relevant information present, performs better in most, but not all, cases considered.
2 Alternative test statistics based on the PIT are developed in Berkowitz (2001); Bai (2003); Bai and Ng (2005); Hong and Li (2005); Li and Tkacz (2006), and Corradi and Swanson (2006a), mainly to counter the problems caused by parameter estimation uncertainty and the assumption of correct dynamic specification under the null hypothesis.
We wish to emphasize that the framework developed here differs from the evaluation of conditional quantile forecasts as considered in Giacomini and Komunjer (2005). That approach focuses on the predictive accuracy of point forecasts for a specific quantile of interest, such as the VaR at a certain level, whereas the conditional and censored likelihood scoring rules intend to cover a broader region of the density. We do not claim that our methodology is a substitute for the quantile forecast evaluation test (or any other predictive accuracy test), but suggest that they may be used in a complementary way. Gneiting and Ranjan (2008) independently also address the tendency of the weighted LR test of Amisano and Giacomini (2007) to favor density forecasts with more probability mass in the region of interest, but from a quantile forecast evaluation perspective. They point out that this tendency is a consequence of the scoring rule not being proper (Winkler and Murphy, 1968; Gneiting and Raftery, 2007), meaning that an incorrect density forecast may receive a higher average score than the true conditional density. Exactly this gives rise to the problem of spuriously favoring densities with more probability mass in the region of interest. As an alternative Gneiting and Ranjan (2008) propose weighted quantile scoring rules. Our aim in this paper is different in that we specifically want to find alternative scoring rules that generalize the unweighted likelihood scoring rule. The two main reasons for pursuing this are, first, that likelihood-based score differences are invariant under transformations of the outcome space and, second, that they lead to LR statistics, which are known to have optimal power properties, as emphasized by Berkowitz (2001) in the context of density forecast evaluation. The remainder of this paper is organized as follows. In Section 2, we discuss conventional scoring rules based on the KLIC divergence for evaluating density forecasts and illustrate the problem with the resulting LR tests in case the logarithmic scores are weighted to focus on a particular region of interest. In Section 3 we put forward our alternative scoring rules based on conditional and censored likelihood and show analytically that the new scoring rules are proper. We assess the finite sample properties of tests of equal predictive accuracy of density forecasts based on the different scoring rules by means of extensive Monte Carlo simulation experiments in Section 4. We provide an empirical application concerning density forecasts for daily S&P 500 returns in Section 5, demonstrating the practical usefulness of the new scores. We summarize and conclude in Section 6. 2. Scoring rules for evaluating density forecasts We consider a stochastic process {Zt : Ω → Rk+1 }Tt=1 , defined on a complete probability space (Ω , F , P ), and identify Zt with (Yt , Xt′ )′ , where Yt : Ω → R is the real valued random variable of interest and Xt : Ω → Rk is a vector of observable predictor variables. The information set at time t is defined as Ft = σ (Z1′ , . . . , Zt′ )′ . We consider the case where two competing forecast methods are available, each producing one-stepahead density forecasts, i.e. predictive densities of Yt +1 , based on Ft . The competing density forecasts of Yt +1 are denoted by the probability density functions (pdfs) fˆt (y) and gˆt (y), respectively. As in Amisano and Giacomini (2007), by ‘forecast method’ we mean the set of choices that the forecaster makes at the time of the prediction, including the variables Xt , the econometric model (if any), and the estimation method. The only requirement that we impose on the forecast methods is that the density forecasts depend only on a finite number m of most recent observations Zt −m+1 , . . . , Zt . Forecast methods of this type arise naturally, for instance, when density forecasts are obtained from time series regression models, for which parameters are estimated with a moving window of m observations. The reason for focusing on forecast methods
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
rather than on forecast models is that this allows for treating parameter estimation uncertainty as an integral part of the density forecasts. Requiring the use of a finite (rolling) window of m past observations for parameter estimation then considerably simplifies the asymptotic theory of tests of equal predictive accuracy, as demonstrated by Giacomini and White (2006). It also turns out to be convenient as it enables comparison of density forecasts based on both nested and non-nested models, in contrast to other approaches such as West (1996). Our interest lies in comparing the relative performance of the one-step-ahead density forecasts fˆt (y) and gˆt (y). One of the approaches that has been put forward for this purpose is based on scoring rules, which are commonly used in probability forecast evaluation; see Diebold and Lopez (1996). In the current context, a scoring rule is a loss function S ∗ (fˆt ; yt +1 ) depending on the density forecast and the actually observed value yt +1 , such that a density forecast that is ‘better’ receives a higher score. Of course, what is considered to be a better forecast among two competing incorrect forecasts depends on the measure used to quantify divergences between distributions. However, as argued by Diebold et al. (1998) and Granger and Pesaran (2000), any rational user would prefer the true conditional density pt of Yt +1 over an incorrect density forecast. This suggests that it is natural to focus, if possible, on scoring rules that are such that incorrect density forecasts fˆt do not receive a higher average score than the true conditional density, that is, Et
S (fˆt ; Yt +1 ) ≤ Et (S (pt ; Yt +1 )) ,
for all t.
Following Gneiting and Raftery (2007), a scoring rule satisfying this condition will be called proper. It is useful to note that the correct density pt includes true parameters (if any). In practice, density forecasts typically involve estimated parameters. This implies that even if the density forecast fˆt is based on a correctly specified model, but the model includes estimated parameters, the average score Et S (fˆt ; Yt +1 ) may not
achieve the upper bound Et (S (pt ; Yt +1 )) due to nonvanishing estimation uncertainty. This reflects the fact that a density forecast based on a misspecified model with limited estimation uncertainty may be preferred over a density forecast based on the correct model specification having larger estimation uncertainty. Section 4.3 illustrates this issue with a Monte Carlo simulation. The above may seem to suggest that the notion of properness of scoring rules is of limited relevance in practice. Nevertheless, it does appear to be a desirable characteristic. Proper scoring rules are such that density forecasts receive a higher score when they approximate the true conditional density more closely, for example in the Kullback–Leibler sense as with the logarithmic score (2) discussed below. We have to keep in mind though that in the presence of nonvanishing estimation uncertainty, as accounted for in the adopted framework of Giacomini and White (2006), this may be a density forecast based on a misspecified model. Given a scoring rule of one’s choice, there are various ways to construct tests of equal predictive ability. Giacomini and White (2006) distinguish tests for unconditional predictive ability and conditional predictive ability. In the present paper, for clarity of exposition, we focus on tests for unconditional predictive ability.3 Assume that two competing density forecasts fˆt and gˆt and corresponding realizations of the variable Yt +1 are available for
3 The above inequality in terms of conditional expectations implies the same
inequality in terms of unconditional expectations, that is, E Et S (fˆt ; Yt +1 )
E (Et (S (pt ; Yt +1 ))) ⇒ E S (fˆt ; Yt +1 ) ≤ E (S (pt ; Yt +1 )).
≤
217
t = m, m + 1, . . . , T − 1. We may then compare fˆt and gˆt based on their average scores, by testing formally whether their difference is statistically significant. Defining the score difference d∗t +1 = S ∗ (fˆt ; yt +1 ) − S ∗ (ˆgt ; yt +1 ), for a given scoring rule S ∗ , the null hypothesis of equal scores is given by H0 : E(d∗t +1 ) = 0,
for all t = m, m + 1, . . . , T − 1.
∗
Let dm,n denote the sample average of the score differences, that is,
∑T −1 ∗ dm,n = n−1 t =m d∗t +1 with n = T − m. In order to test H0 against
∗
the alternative Ha : E dm,n
̸= 0, (or < 0 or > 0) we may use a
Diebold and Mariano (1995) type statistic ∗
tm,n =
dm,n
σˆ m2 ,n /n
,
(1)
where σˆ m2 ,n is a heteroskedasticity and autocorrelation-consistent (HAC) variance estimator of σm2 ,n = Var
√
∗
n dm,n , which satisfies
P
σˆ m2 ,n −σm2 ,n −→ 0. The following theorem characterizes the asymptotic distribution of the test statistic under the null hypothesis.
Theorem 1. The statistic tm,n in (1) is asymptotically (as n → ∞ with m fixed) standard normally distributed under the null hypothesis if: (i) {Zt } is φ -mixing of size −q/(2q − 2) with q ≥ 2, or α -mixing of size −q/(q − 2) with q > 2; (ii) E|d∗t +1 |2q < ∞ for all t; and (iii) σm2 ,n = Var
√
∗
n dm,n
> 0 for all n sufficiently large.
Proof. This is Theorem 4 of Giacomini and White (2006), where a proof can also be found. The proof of this theorem as given by Giacomini and White (2006) is based on the central limit theorems for dependent heterogeneous processes given in Wooldridge and White (1988). The conditions in Theorem 1 are rather weak in that they allow for nonstationarity and heterogeneity. However, note that conditions (i) and (ii) jointly imply the existence of at least the fourth moment of d∗t +1 for all t. Theorem 1.3 of Merlevède and Peligrad (2000) shows that asymptotic normality can also be achieved under weaker distributional assumptions (existence of the second moment plus a condition relating the behavior of the tail of the distribution of |d∗t +1 | to the mixing rate). However, strict stationarity is assumed by Merlevède and Peligrad (2000). The conditions required for asymptotic normality of normalized partial sums of dependent heterogeneous random variables have been further explored by De Jong (1997). 2.1. The logarithmic scoring rule and the Kullback–Leibler information criterion Mitchell and Hall (2005); Amisano and Giacomini (2007), and Bao et al. (2004, 2007) focus on the logarithmic scoring rule S l (fˆt ; yt +1 ) = log fˆt (yt +1 ),
(2)
assigning a high score to a density forecast if the observation yt +1 falls within a region with high predictive density fˆt , and a low score if it falls within a region with low predictive density. Based on the n observations available for evaluation, ym+1 , . . . , yT , the density forecasts fˆt and gˆt can be ranked according to their ∑T −1 ∑T −1 average scores n−1 t =m log fˆt (yt +1 ) and n−1 t =m log gˆt (yt +1 ). The density forecast yielding the highest average score would obviously be the preferred one. The sample average of the log score
218
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
differences dlt +1 = log fˆt (yt +1 ) − log gˆt (yt +1 ) may be used to test whether the predictive accuracy is significantly different, using the test statistic defined in (1). Note that this coincides with the loglikelihood ratio of the two competing density forecasts. Intuitively, the logarithmic scoring rule is closely related to information theoretic goodness-of-fit measures such as the Kullback–Leibler Information Criterion (KLIC), which for the density forecast fˆt is defined as
KLIC(fˆt ) = Et log pt (Yt +1 ) − log fˆt (Yt +1 )
∫
∞
pt (yt +1 ) log
=
−∞
pt (yt +1 )
dyt +1 ,
fˆt (yt +1 )
(3)
where pt denotes the true conditional density. Obviously, a higher expected value (with respect to the true density pt ) of the logarithmic score in (2) is equivalent to a lower value of the KLIC in (3). ∞ Under the constraint −∞ fˆt (y) dy = 1, the expectation of log fˆt (Yt +1 ) with respect to the true density pt is maximized by taking fˆt = pt . This follows from the fact that for any density fˆt different from pt ,
Et
log
fˆt (Yt +1 ) pt (Yt +1 )
≤ Et ∫
fˆt (Yt +1 )
pt (Yt +1 ) ∞
pt (y)
= −∞
−1
fˆt (y) pt (y)
dy − 1 = 0,
where the inequality follows from applying log x ≤ x − 1 to fˆt /pt . It thus follows that the quality of a normalized density forecast fˆt can be measured properly by the log-likelihood score S l (fˆt ; yt +1 ) and, equivalently, by the KLIC in (3). An advantage of the KLIC is that it has an absolute lower bound equal to zero, which is achieved if and only if the density forecast fˆt is identical to the true distribution pt . As such, its value provides a measure of the divergence between the candidate density fˆt and pt . However, since pt is unknown, the KLIC cannot be evaluated directly (but we return to this point below). We can nevertheless use the KLIC to measure the relative accuracy of two competing densities, as discussed in Mitchell and Hall (2005) and Bao et al. (2004, 2007). Taking the difference KLIC(ˆgt ) − KLIC(fˆt ) the term Et (log pt (Yt +1 )) drops out, solving the problem that the true density pt is unknown. This in fact renders the logarithmic score difference dlt +1 = log fˆt (yt +1 ) − log gˆt (yt +1 ). Summarizing the above, the null hypothesis of equal average logarithmic scores for the density forecasts fˆt and gˆt actually corresponds with the null hypothesis of equal KLICs. Given that the KLIC measures the divergence of the density forecasts from the true density, the use of the logarithmic scoring rule boils down to assessing which of the competing densities comes closest to the true distribution. Bao et al. (2004, 2007) discuss an extension to compare multiple density forecasts based on their KLIC values, where the null hypothesis is that none of the available density forecasts is more accurate than a given benchmark, in the spirit of the reality check of White (2000). Mitchell and Hall (2005) and Hall and Mitchell (2007) also use the relative KLIC values as a basis for combining density forecasts. It is useful to note that both Mitchell and Hall (2005) and Bao et al. (2004, 2007) employ the KLIC for testing the null hypothesis of an individual density forecast being correct, that is, H0 : KLIC(fˆt ) = 0. The problem that the true density pt in (3) is unknown then is circumvented by using the result established by Berkowitz (2001) that the KLIC of the density forecast fˆt relative to pt is equal to the KLIC of the density of the inverse normal transform of the PIT
of fˆt relative to the standard normal density. Defining zfˆ ,t +1 =
Φ −1 (Fˆt (yt +1 )) with Fˆt (yt +1 ) =
yt +1
ˆ −∞ ft (y) dy and Φ the standard normal distribution function, it holds true that log pt (yt +1 ) − log fˆt (yt +1 ) = log qt (zfˆ ,t +1 ) − log φ(zfˆ ,t +1 ), where qt is the true conditional density of zfˆ ,t +1 and φ is the standard normal density. This result states that the logarithmic scores are invariant to the inverse normal transform of yt +1 , which is essentially a consequence of the general invariance of likelihood ratios under smooth coordinate transformations. Of course, in practice the density qt is not known either, but it may be estimated using a flexible density function. The resulting KLIC estimate then allows testing for departures of qt from the standard normal. 2.2. Weighted logarithmic scoring rules In empirical applications of density forecasting it frequently occurs that a particular region of the density is of most interest. For example, in risk management applications such as VaR and ES estimation, an accurate description of the left tail of the distribution of asset returns obviously is of crucial importance. In that context, it seems natural to focus on the performance of density forecasts in the region of interest and pay less attention to (or even ignore) the remaining part of the distribution. Within the framework of scoring rules, an obvious way to pursue this is to construct a weighted scoring rule, using a weight function wt (y) to emphasize the region of interest (see Franses and van Dijk (2003) for a similar idea in the context of testing equal predictive accuracy of point forecasts). Along this line Amisano and Giacomini (2007) propose the weighted logarithmic (w l) scoring rule S wl (fˆt ; yt +1 ) = wt (yt +1 ) log fˆt (yt +1 )
(4)
to assess the quality of density forecast fˆt on a certain region defined by the properties of wt (yt +1 ). The weighted average scores ∑T −1 ∑ T −1 n−1 t =m wt (yt +1 ) log fˆt (yt +1 ) and n−1 t =m wt (yt +1 ) log gˆt (yt +1 ) can be used for ranking two competing forecasts, while the weighted score difference l wl ˆ wl dw g t ; y t +1 ) t +1 = S (ft ; yt +1 ) − S (ˆ
= wt (yt +1 )(log fˆt (yt +1 ) − log gˆt (yt +1 )),
(5)
forms the basis for testing the null hypothesis of equal weighted l scores, H0 : E dw t +1 = 0, for all t = m, m + 1, . . . , T , by means of a Diebold–Mariano type statistic of the form (1). For the sake of argument it is instructive to consider the case of a ‘threshold’ weight function wt (y) = I(y ≤ r ), with a fixed threshold r, where I(A) = 1 if the event A occurs and zero otherwise. This is a simple example of a weight function we might consider for evaluation of the left tail in risk management applications. In this case, however, the weighted logarithmic score results in predictive ability tests that are biased toward densities with more probability mass in the left tail. This can be seen by considering the situation where gˆt (y) > fˆt (y) for all y smaller than some given value y∗ , say. Using wt (y) = I(y ≤ r ) for some l r < y∗ in (4) implies that the weighted score difference dw t +1 in (5) is never positive, and strictly negative for observations below the l threshold value r, such that E(dw t +1 ) is negative. Obviously, this can have far-reaching consequences when comparing density forecasts with different tail behavior. In particular, it may happen that a (relatively) fat-tailed distribution gˆt is favored over a thin-tailed distribution fˆt , even if the latter is the true distribution from which the data are drawn, as the following example illustrates.
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
219
Before introducing our proper likelihood-based scoring rules, we briefly summarize the scoring rules proposed by Gneiting and Ranjan (2008), which may also be used for comparing density forecasts in specific regions of interest. Their starting point is the continuous ranked probability score (CRPS), which for the density forecast fˆt is defined as CRPS(fˆt , yt +1 ) =
∞
∫
PS(Fˆt (r ), I(yt +1 ≤ r )) dr ,
(6)
−∞
where PS(Fˆt (r ), I(yt +1 ≤ r )) = (I(yt +1 ≤ r ) − Fˆt (r ))2
Fig. 1. Probability density functions of the standard normal distribution fˆt (yt +1 ) and standardized Student-t (5) distribution gˆt (yt +1 ) (upper panel) and corresponding relative log-likelihood scores log fˆt (yt +1 ) − log gˆt (yt +1 ) (lower panel).
Example 1. Suppose we wish to compare the accuracy of two density forecasts for Yt +1 , one being the standard normal distribution with pdf 1
fˆt (y) = (2π )− 2 exp(−y2 /2), and the other being the Student-t distribution with ν degrees of freedom, standardized to unit variance, with pdf gˆt (y) = √
Γ
ν+1 2
(ν − 2)π Γ
y2
ν 1 + 2
(ν − 2)
−(ν+1)/2
,
with ν > 2.
Fig. 1 shows these density functions for the case ν = 5, as well as the relative log-likelihood score log fˆt (yt +1 ) − log gˆt (yt +1 ). The relative score function is negative in the left tail (−∞, y∗ ), with y∗ ≈ −2.5. Now consider the situation that we have a sample ym+1 , . . . , yT of n observations from an unknown density on (−∞, ∞) for which fˆt (y) and gˆt (y) are competing candidates, and we use a threshold weight function wt (y) = I(y ≤ r ), with fixed threshold r, to concentrate on the left tail. It follows from the lower panel of Fig. 1 that if the threshold r < y∗ , the average weighted wl
log-likelihood score difference dm,n can never be positive and will be strictly negative whenever there are observations in the tail. Evidently, the test of equal predictive accuracy will then favor the fattailed Student-t density gˆt (y), even if the true density is the standard normal fˆt (y). 2.3. Weighted probability scores The issue we are signaling has been reported independently by Gneiting and Ranjan (2008). As they point out, the w l score does not satisfy the properness property, in the sense that there can be incorrect density forecasts fˆt that receive a higher average score than the actual conditional density pt . As a consequence, the associated test of equal predictive accuracy could even suggest that the incorrect density forecast is significantly better than the true density. As discussed before, it seems reasonable to focus on proper scoring rules to avoid such inconsistencies. However, there are many different proper scoring rules one might use, raising the question which rules are suitable candidates in practice. Our main reason to focus on KLIC-based scoring rules is the close connection with likelihood ratio tests, which are known to perform well in many statistical settings. As mentioned before, the test for equal predictive ability based on the logarithmic scoring rule is nothing but a likelihood ratio test.
is the Brier probability score for the probability forecast Fˆt (r ) = r ˆ −∞ ft (y)dy of the event Yt +1 ≤ r. Equivalently, the CRPS may be written in terms of α -quantile forecasts qˆ t ,α = Fˆt−1 (α), as CRPS(fˆt , yt +1 ) =
1
∫
QSα (ˆqt ,α , yt +1 ) dα,
(7)
0
where QSα (ˆqt ,α , yt +1 ) = 2(α − I(yt +1 < qˆ t ,α ))(yt +1 − qˆ t ,α ) is the quantile score (also known as the ‘tick’ or ‘check’ score) function; see also Giacomini and Komunjer (2005). As suggested by Gneiting and Ranjan (2008), the CRPS in (7) may be generalized to emphasize certain regions of interest in the evaluation of density forecasts. Specifically, a weighted quantile scoring rule (wqs) may be defined as S
wqs
(fˆt ; yt +1 ) = −
1
∫
v(α)QSα (ˆqt ,α , yt +1 ) dα, 0
where v(α) is a nonnegative weight function on the unit interval and the minus sign on the right-hand side is inserted such that density forecasts with higher scores are preferred. Similarly, a weighted probability score (wps) is obtained from (6) as S wps (fˆt ; yt +1 ) = −
∫
∞
wt (r )PS(Fˆt (r ), I(yt +1 ≤ r )) dr ,
(8)
−∞
for some weight function wt . Note that the same wps scoring rule was proposed by Corradi and Swanson (2006b) for evaluating density forecasts in case a specific region of the density is of interest rather than its whole support. In the Monte Carlo simulations in Section 4, we include Diebold–Mariano type tests based on S wps (fˆt ; yt +1 ) for comparison purposes. 3. Scoring rules based on conditional and censored likelihood KLIC-based scoring rules for evaluating and comparing density forecasts in a specific region of interest At ⊂ R can be obtained in a relatively straightforward manner. Specifically, it is natural to replace the full likelihood in (2) either by the conditional likelihood, given that the observation lies in the region of interest, or by the censored likelihood. The conditional likelihood (cl) score function, given a region of interest At , is given by
fˆt (yt +1 ) ˆ S (ft ; yt +1 ) = I(yt +1 ∈ At ) log . fˆ (s)ds At t cl
(9)
The main argument for using this scoring rule would be to evaluate density forecasts based only on their behavior in the region of interest At . The division by A fˆt (s)ds serves the purpose t of normalizing the density on the region of interest, such that competing density forecasts can be compared in terms of their relative KLIC values, as discussed before.
220
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
However, due to this normalization, the cl scoring rule does not take into account the accuracy of the density forecast for the total probability of the region of interest. For example, in case At is the left tail yt +1 ≤ r, the conditional likelihood ignores whether the tail probability implied by fˆt matches with the frequency at which tail observations actually occur. As a result, the scoring rule in (9) attaches comparable scores to density forecasts that have similar tail shapes but may have completely different tail probabilities. This tail probability is obviously relevant for risk management purposes, in particular for VaR evaluation, and therefore it would be useful to include it in the density forecast evaluation. This can be achieved by using the censored likelihood (csl) score function, given by S csl (fˆt ; yt +1 ) = I(yt +1 ∈ At ) log fˆt (yt +1 )
+ I(yt +1 ∈
Act
) log
∫ Act
ˆft (s)ds ,
(10)
where Act is the complement of At . This scoring rule uses the likelihood associated with having an observation outside the region of interest, but apart from that ignores the shape of fˆt outside At . In that sense this scoring rule is similar to the loglikelihood used in the Tobit model for random variables that cannot be observed above a certain threshold value (see Tobin, 1958). The conditional and censored likelihood scoring rules as discussed above focus on a sharply defined region of interest At . It is possible to adapt these score functions in order to emphasize certain parts of the outcome space more generally, by going back to the original idea of using a weight function wt (y) as in (4). For this purpose, note that by setting wt (y) = I(y ∈ At ) the scoring rules in (9) and (10) can be rewritten as
fˆt (yt +1 ) ˆ S (ft ; yt +1 ) = wt (yt +1 ) log , wt (s)fˆt (s)ds cl
(11)
and S csl (fˆt ; yt +1 ) = wt (yt +1 ) log fˆt (yt +1 )
∫ + (1 − wt (yt +1 )) log 1 − wt (s)fˆt (s)ds . (12) At this point, we make the following assumptions. Assumption 1. The density forecasts fˆt and gˆt satisfy KLIC(fˆt ) < ∞ and KLIC(ˆgt ) < ∞, where KLIC(ht ) = pt (y) log (pt (y)/ht (y)) dy is the Kullback–Leibler divergence between the density forecast ht and the true conditional density pt . Assumption 2. The weight function wt (y) is such that (a) it is determined by the information availableat time t, and hence a function of Ft , (b) 0 ≤ wt (y) ≤ 1, and (c) wt (y)pt (y) dy > 0. Assumption 1 ensures that the expected score differences for the competing density forecasts are finite. Assumption 2(c) is needed to avoid cases where wt (y) takes strictly positive values only outside the support of the data. The following lemma shows that the generalized cl and csl scoring rules in (11) and (12) are proper, and hence cannot lead to spurious rejections against wrong alternatives just because these have more probability mass in the region(s) of interest. Lemma 1. Under Assumptions 1 and 2, the generalized conditional likelihood scoring rule given in (11) and the generalized censored likelihood scoring rule given in (12) are proper.
∗
Fig. 2. Empirical CDFs of mean relative scores dm,n for the weighted logarithmic (w l) scoring rule in (4), the conditional likelihood (cl) in (11), and the censored likelihood (csl) in (12) for series of n = 2000 independent observations from a standard normal distribution. The scoring rules are based on the threshold weight function wt (y) = I(y ≤ r ) with r = −2.5. The relative score is defined as the score for the (correct) standard normal density minus the score for the standardized Student-t (5) density. The graph is based on 10,000 replications.
The proof of this lemma is given in the Appendix. The proof clarifies that the scoring rules in (11) and (12) can be interpreted in terms of Kullback–Leibler divergences between weighted versions of the density forecast and the actual density. We may test the null hypothesis of equal performance of two density forecasts fˆt (yt +1 ) and gˆt (yt +1 ) based on the conditional likelihood score (11) or the censored likelihood score (12) in the same manner as before. That is, given a sample of density forecasts and corresponding realizations for n time periods t = m, m + cl ˆ 1, . . . , T − 1, we may form the relative scores dcl t +1 = S (ft ; yt +1 )− csl ˆ csl S cl (ˆgt ; yt +1 ) and dcsl gt ; yt +1 ) and use these t +1 = S (ft ; yt +1 ) − S (ˆ for computing Diebold–Mariano type test statistics as given in (1).
Example 1 (Continued). We revisit the example from the previous section in order to illustrate the properties of the various scoring rules and the associated tests for comparing the accuracy of competing density forecasts. We generate 10,000 series of n = 2000 independent observations yt +1 from a standard normal distribution. For each sequence we compute the weighted logarithmic scores in (4), the conditional likelihood scores in (11), and the censored likelihood scores in (12). We use the threshold weight function wt (y) = I(y ≤ r ), with the threshold fixed at r = −2.5. The scores are computed for the (correct) standard normal density fˆt and for the standardized Student-t density gˆt with five degrees of freedom. Fig. 2 shows the empirical CDF of ∗ the mean relative scores dm,n , where ∗ is w l, cl or csl. The average w l scores take almost exclusively negative values, which means that, on average, they attach a lower score to the correct normal distribution than to the Student-t distribution, cf. Fig. 1, indicating a bias in the corresponding test statistic toward the incorrect, fattailed distribution. The cl and csl scoring rules both correctly favor the true normal density. The censored likelihood rule appears to be better at detecting the inadequacy of the Student-t distribution, in that its relative scores stochastically dominate those based on the conditional likelihood. To illustrate the behavior of the scoring rules obtained under smooth weight functions we consider the logistic weight function
wt (y) = 1/(1 + exp(a(y − r ))) with a > 0.
(13)
This sigmoidal function changes monotonically from 1 to 0 as Yt +1 increases, while wt (r ) = 12 and the slope parameter a determines the speed of the transition. In the limit as a → ∞, the threshold weight function I(y ≤ r ) is recovered. We fix the center at r = −2.5 and vary the slope parameter a among the values 3, 4, 6, and 10. For a = 10, the logistic weight function is already very close to
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
(a=4)
1
1
0. 8
0. 8 empirical CDF
empirical CDF
(a=3)
0. 6 0. 4 0. 2 0
cl csl 0
0. 6 0. 4 0. 2
cl csl
0
0.005
0
score
(a=10) 1
0. 8
0. 8 empirical CDF
empirical CDF
(a=6)
0. 6 0. 4
0
cl csl 0
0.005 score
1
0. 2
221
0. 6 0. 4 0. 2
cl csl
0 0.005
0
score
0.005 score
∗
Fig. 3. Empirical CDFs of mean relative scores dm,n for the generalized conditional likelihood (cl) and censored likelihood (csl) scoring rules for series of n = 2000 independent observations from a standard normal distribution. The scoring rules are based on the logistic weight function wt (y) defined in (13) for various values of the slope parameter a. The relative score is defined as the score for (correct) standard normal density minus the score for the standardized Student-t (5) density. The graph is based on 10,000 replications.
the threshold weight function I(y ≤ r ), such that for larger values of a the score distributions essentially do not change anymore. The integrals wt (y)fˆt (y) dy and wt (y)ˆgt (y) dy are determined numerically by averaging over a large number (106 ) of simulated random variables Yt +1 with density fˆt and gˆt , respectively. ∗ Fig. 3 shows the empirical CDFs of the mean relative scores dm,n obtained with the conditional likelihood and censored likelihood scoring rules for the different values of a. It can be observed that the difference between the scores increases as a becomes larger. In fact, for the smoothest weight function considered (a = 1) the two score distributions are very similar. The cl and csl score distributions become more alike for smaller values of a because, as a → 0, wt (y) in (13) converges to a constant equal to 12 for all values of y, so that wt (y) − (1 − wt (y)) → 0, and moreover wt (y)fˆt (y) dy = wt (y)ˆgt (y) dy → 21 . Consequently, both scoring rules converge to the unconditional likelihood (up to a csl constant factor 2) and the relative scores dcl t +1 and dt +1 have the limit 1 2
(log fˆt (yt +1 ) − log gˆt (yt +1 )). We close this section with some remarks on the weight function
wt (y) defining the region that is emphasized in the density forecast evaluation. The conditional and censored likelihood scoring rules may be applied with arbitrary weight functions, subject to the conditions stated in Lemma 1. The appropriate choice of wt (y) obviously depends on the interests of the forecast user. The threshold (or logistic) weight function considered in the example above seems a natural choice in risk management applications, as the left tail behavior of the density forecast is of most concern there. In other applications however the focus may be on different regions. For example, for monetary policymakers aiming to keep inflation within a certain range, the central part of the density may be of most interest, suggesting a weight function such as wt (y) = I(rl ≤ y ≤ ru ), for certain lower and upper bounds rl and ru .
The preceding implies that essentially it is also up to the forecast user to set the parameter(s) in the weight function, such as the threshold r in wt (y) = I(y ≤ r ). For example, when Yt represents the return on a given portfolio, r may be set equal to a certain quantile of the return distribution such that it corresponds with a target VaR level. In practice, r will then have to be estimated from historical data and might be set equal to the particular quantile of the m observations in the moving window that is used for constructing the density forecast at time t. This makes the weight function dynamic, i.e. wt (y) = I(y ≤ rt ), while it also involves estimation uncertainty, namely in the threshold rt . As shown by Lemma 1, as long as the weight function wt is conditionally (given Ft ) independent of Yt +1 , the properness property of the conditional and censored likelihood scoring rules is not affected. However, nonvanishing estimation uncertainty in the threshold may affect the power of the test of equal predictive accuracy. In Section 4.3, we verify this numerically with Monte Carlo simulations. 4. Monte Carlo simulations In this section we examine the implications of using the weighted logarithmic scoring rule in (4), the conditional likelihood score in (11), the censored likelihood score in (12), and the weighted probability score in (8) for constructing a test of equal predictive ability of two competing density forecasts in finite samples. Specifically, we consider the size and power properties of the Diebold–Mariano type statistic as given in (1) for testing the null hypothesis that the two competing density forecasts have equal expected scores, or H0 : E d∗t +1 = 0,
for t = m, m + 1, . . . , T − 1
under scoring rule ∗, where ∗ is either w l, wps, cl or csl. As before m denotes the length of the rolling window used for constructing the density forecast and n = T − m denotes the number of forecasts. Throughout we use a HAC estimator for the asymptotic variance of ∑ K −1 ∗ the average relative score dm,n , that is σˆ m2 ,n = γˆ0 + 2 k=1 ak γˆk ,
222
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
where γˆk denotes the lag-k sample covariance of the sequence {d∗t +1 }Tt =−m1 and ak are the Bartlett weights ak = 1 − k/K with K = ⌊n1/4 ⌋. We focus on one-sided rejection rates to highlight the fact that some of the scoring rules may favor a wrong density forecast over a correct one. Concerning the implementation of the wps rule in (8), it is useful to note that it is in fact not essential for the properness of this score function to use an integral. As mentioned by Gneiting and Ranjan (2008), a weighted sum over a finite number of yvalues also renders a suitable scoring rule. With this in mind, we do not attempt to obtain an accurate numerical approximation to the integral in (8), which is computationally very demanding, but simply use a discretized version with a discretization step of the y-variable of 0.1. Initially, we examine the size and power properties of the test of equal predictive ability in an environment that does not involve parameter estimation uncertainty in Sections 4.1 and 4.2, to demonstrate the pitfalls when using w l scoring rule and the benefits of the cl and csl alternatives most clearly. The role of estimation uncertainty, both in the density forecasts and in the weighting function, are addressed explicitly in Section 4.3.
Fig. 4. One-sided rejection rates of the Diebold–Mariano type test statistic of equal predictive accuracy defined in (1) when using the weighted logarithmic (w l), the conditional likelihood (cl), and the censored likelihood (csl) scoring rules, under the weight function wt (y) = I(−r ≤ y ≤ r ) for sample size n = 500, based on 10,000 replications. The DGP is i.i.d. standard normal. The test compares the predictive accuracy of N (−0.2, 1) and N (0.2, 1) distributions. The graph shows rejection rates against the alternative that the N (0.2, 1) distribution has better predictive ability.
4.1. Size 4.2. Power In order to assess the size properties of the tests a case is required with two competing predictive densities that are both ‘equally (in)correct’. However, whether or not the null hypothesis of equal predictive ability holds depends on the weight function wt (y) that is used in the scoring rules. This complicates the simulation design, given the fact that we would like to examine how the behavior of the tests depends on the specific settings of the weight function. For the threshold weight function wt (y) = I(y ≤ r ) it appears to be impossible to construct an example with two different density forecasts having identical predictive ability regardless of the value of r. We therefore evaluate the size of the tests when focusing on the central part of the distribution by means of the weight function wt (y) = I(−r ≤ y ≤ r ). As mentioned before, in some cases this region of the distribution may be of primary interest, for instance to monetary policymakers targeting to keep inflation between certain lower and upper bounds. The data generating process (DGP) is taken to be i.i.d. standard normal, while the two competing density forecasts are normal distributions with different means equal to −0.2 and 0.2 and identical variance equal to 1. In this case, independent of the value of r the competing density forecasts have equal predictive accuracy, as the scoring rules considered here are invariant under a simultaneous reflection about zero of all densities of interest (the true conditional density as well as the density forecasts). In addition, it turns out that for this combination of DGP and predictive densities, the relative scores d∗t +1 for the w l, cl and csl rules based on wt (y) = I(−r ≤ y ≤ r ) are identical; observations outside the interval [−r , r ] do not support evidence in favor of either density forecast, which is reflected in equal scores for the two forecasts, under any of the scoring rules considered. Fig. 4 displays one-sided rejection rates at nominal significance levels of 1%, 5% and 10% of the null hypothesis against the alternative that the N (0.2, 1) distribution has better predictive ability as a function of the threshold value r, based on 10,000 replications for sample size n = 500. The rejection rates of the tests are quite close to the nominal significance levels for all values of r. Unreported results for different values of n show that this holds even for sample sizes as small as n = 100 observations. Hence, the size properties of the predictive ability test appear to be satisfactory.
We evaluate the power of the test based on the various scoring rules by performing simulation experiments where one of the competing density forecasts is correct, i.e. corresponds exactly with the underlying DGP. In that case the true density always is the best possible one, regardless of the region for which the densities are evaluated, that is, regardless of the weight function used in the scoring rules. Given that our main focus in this paper has been on comparing density forecasts in the left tail, in these experiments we first return to the threshold weight function wt (y) = I(y ≤ r ). In order to make the rejection frequencies of the null obtained for different values of r more comparable, we make the sample size n dependent on the threshold value in such a way that the expected number of observations in the region of interest, denoted by c, is constant across the various values of r. This is achieved by setting n = c /P(Y < r ). Given that in typical risk management applications there may be only a few tail observations, we consider relatively small values of c. Figs. 5 and 6 show the observed rejection rates for c = 5 and c = 40, respectively, based on 10,000 replications, for data drawn from the standard normal distribution (left column) or the standardized Student-t (5) distribution (right column). In both cases, the null hypothesis being tested is equal predictive accuracy of the standard normal and standardized Student-t (5) density forecasts. The top (bottom) panels in these figures show rejection rates at nominal significance level 5% against superior predictive ability of the standard normal (standardized Studentt (5)) distribution, as a function of the threshold parameter r. Hence, the top left and bottom right panels report true power (rejections in favor of the correct density), while the top right and bottom left panels report spurious power (rejections in favor of the incorrect density). Several interesting conclusions emerge from these graphs. First, the power of the w l scoring rule depends strongly on the threshold parameter r. For the normal DGP, for example, the test has excellent power for values of r between −2 and 0, but for more negative threshold values the rejection rates against the correct alternative drop to zero. In fact, for threshold values less than −2, we observe substantial spurious power in the form of rejection against the incorrect alternative of superior predictive ability of the Student-t density. Comparing Figs. 5 and 6 shows that this is
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
223
Fig. 5. One-sided rejection rates (at nominal significance level 5%) of the Diebold–Mariano type test statistic of equal predictive accuracy defined in (1) when using the weighted logarithmic (w l), the conditional likelihood (cl), and the censored likelihood (csl) scoring rules, under the threshold weight function wt (y) = I(y ≤ r ) for c = 5 expected observations in the region of interest, based on 10,000 replications. For the graphs in the left and right columns, the DGP is i.i.d. standard normal and i.i.d. standardized Student-t (5), respectively. The test compares the predictive accuracy of the standard normal and the standardized Student-t (5) distributions. The graphs in the top (bottom) panels show rejection rates against superior predictive ability of the standard normal (standardized Student-t (5)) distribution, as a function of the threshold parameter r.
not a small sample problem. In fact, the spurious power for the w l rule increases as the sample size becomes larger. This behavior of the test based on the w l scoring rule for large negative values of r can be understood from the bottom graph of Fig. 1, showing that the logarithmic score is higher for the Student-t density than for the normal density for all values of y below −2.5, approximately. To understand the non-monotonic nature of these power curves more fully, we use integration to obtain the expected numerical l relative score E dw t +1 for various values of the threshold r for i.i.d. standard normal data. The results are shown in Fig. 7. It can be observed that the mean changes sign several times, in exact accordance with the patterns in the top panels of Figs. 5 and 6. Whenever the mean score difference (computed as the score of the standard normal minus the score of the standardized Student-t (5) density) is positive the associated test has high power, while it has high spurious power for negative mean scores. The w l scoring rule thus cannot be relied upon for discriminating between competing density forecasts. For example, a rejection of the null hypothesis in favor of superior predictive accuracy of the Student-t density for r ≈ −2.5 could be due to the considerable ‘true’ power of the test, as shown in the bottom right graph in Fig. 6. However, it may equally likely be the result of the spurious power problem shown in the bottom left graph. Second, the top right and bottom left panels of Fig. 5 suggest that the wps, cl and csl scores also display some spurious power for certain regions of threshold values. However, in stark contrast to the weighted logarithmic scoring rule, this appears to be due to the extremely small sample size, as it quickly disappears as c increases. Already for c = 40 the rejection rates for these scoring rules against the incorrect alternative remain below the nominal significance level of 5%; see Fig. 6. This clearly demonstrates
the advantage of using a proper scoring rule for comparing the predictive accuracy of density forecasts. Third, for small values of the threshold r the power for the csl scoring rule is higher than that of the cl rule, for the standard normal (top left panel) as well as for the standardized Student-t (5) distributions (bottom right panel), especially for c = 5 (see Fig. 5). Obviously, the additional information concerning the coverage probability of the left tail region helps to distinguish between the competing density forecasts, in particular when the number of observations in the region of interest is extremely small. Fourth, for c = 5, the power of the different tests behaves similarly for large values of r. This should be expected on theoretical grounds for the w l, cl and csl scoring rules, since they become identical in the limit as r → ∞. This is not the case for the wps scoring rule though, so its similar power for large r might be coincidental. In fact, for c = 40 it is visible that the wps rule has slightly deviating power from the other rules for large r; it is somewhat smaller for the normal DGP (top left panel of Fig. 6) while it appears to be somewhat larger for the Student-t (5) DGP (lower right panel of Fig. 6). Next, we perform the same simulation experiments but with the weight function wt (y) = I(−r ≤ y ≤ r ) to study the power properties of the tests when they are used to compare density forecasts on the central part of the distribution. Fig. 8 shows rejection rates obtained for an i.i.d. standard normal DGP, when we test the null of equal predictive ability of the N (0, 1) and standardized Student-t (5) distributions against the alternative that either of these density forecasts has better predictive ability, for c = 200 (the number of observations in the region of interest needed to obtain a reasonable power strongly depends on the
224
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
Fig. 6. One-sided rejection rates (at nominal significance level 5%) of the Diebold–Mariano type test statistic of equal predictive accuracy defined in (1) when using the weighted logarithmic (w l), the conditional likelihood (cl), and the censored likelihood (csl) scoring rules, under the threshold weight function wt (y) = I(y ≤ r ) for c = 40 expected observations in the region of interest, based on 10,000 replications. For the graphs in the left and right columns, the DGP is i.i.d. standard normal and i.i.d. standardized Student-t (5), respectively. Further details are identical to those given in Fig. 5.
4.3. Estimation uncertainty and time-varying weight functions
l Fig. 7. Mean relative w l score E [dw t +1 ] with threshold weight function wt (y) = I(y ≤ r ) for the standard normal versus the standardized Student-t (5) density as a function of the threshold value r, for the standard normal DGP.
relative differences between densities). The format is the same as in Figs. 5 and 6, with the left (right) column showing results when the DGP is the standard normal (standardized Studentt (5)) distribution. The top (bottom) panels in these figures show rejection rates at nominal significance level 5% against superior predictive ability of the standard normal (standardized Studentt (5)) distribution, as a function of the threshold parameter r. It can be clearly observed that the w l rule displays spurious power, and that in the full information case (i.e. large values of r) the likelihood-based rules provide more powerful tests than the wps rule.
In the remaining simulation experiments, we examine the effects of parameter estimation uncertainty. We start with a simulation addressing the effect of nonvanishing estimation uncertainty on the tests of equal predictive accuracy. In particular, we demonstrate that a forecast method using an incorrect model specification but with limited estimation uncertainty may produce a better density forecast than a forecast method based on the correct model specification but having larger estimation uncertainty. For brevity, we focus only on the (unweighted) logarithmic scoring rule (2). The results generalize to other scoring considered in the paper. The data generating process is the following AR(2) specification: yt = 0.8yt −1 + 0.05yt −2 + εt , εt ∼ i.i.d. N (0, 1). We compare the predictive accuracy of the AR(2) specification, which is correct up to two estimated parameters, against a more parsimonious, but incorrect AR(1) specification with one parameter to be estimated. The parameters are estimated by MLE. Recall that we work under a rolling forecast scheme, where the size of the estimation window m is fixed, so that the estimation uncertainty does not vanish asymptotically. Table 1 shows one-sided rejection rates of the test of the equal predictive density for different rolling estimation window sizes m against the alternatives that the average log score is higher for the AR(2) model relative to the AR(1) model and vice versa. For small estimation windows, m = 100; 250, the estimation uncertainty is relatively important and the test often indicates that the incorrectly specified, but more parsimonious AR(1) model produces better density forecasts. For intermediate values m = 500; 1000 the test generally does not reject the null of equal predictive accuracy. For very large estimation windows with m = 2500; 5000, the estimation error is small enough for the test to
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
225
Fig. 8. One-sided rejection rates (at nominal significance level 5%) of the Diebold–Mariano type test statistic of equal predictive accuracy defined in (1) when using the weighted logarithmic (w l), the conditional likelihood (cl), and the censored likelihood (csl) scoring rules, under the weight function wt (y) = I(−r ≤ y ≤ r ) for c = 200 expected observations in the region of interest, based on 10,000 replications. The DGP is i.i.d. standard normal. The graphs on the left and right show rejection rates against better predictive ability of the standard normal distribution compared to the standardized Student-t (5) distribution and vice versa, respectively. Table 1 Tests of equal predictive accuracy under parameter estimation uncertainty. m Ha : E ( Ha : E (
dlt +1 dlt +1
)>0 )<0
100
250
500
1000
2500
5000
0.000 0.982
0.000 0.239
0.024 0.026
0.134 0.004
0.339 0.001
0.463 0.000
Note: The table presents one-sided rejection rates (at nominal significance level 5%) of the null hypothesis of equal predictive accuracy against the indicated alternative by the Diebold–Mariano type test statistic defined in (1) when using the logarithmic scoring rule (2), based on the sample average of the score difference dlt +1 = log fˆt (yt +1 ) − log gˆt (yt +1 ). The DGP is an AR(2) process: yt = 0.8yt −1 + 0.05yt −2 + εt ,with εt ∼ i.i.d. N (0, 1). The competing density forecasts fˆt and gˆt are based on AR(2) and AR(1) specifications, respectively. The residuals in both specifications are assumed to be normally distributed and the coefficients are estimated using a moving window of m observations. The estimation window size is varied from m = 100 to 5000. The number of out-of-sample evaluations is n = 5000 and the number of replications is 10,000.
favor the correctly specified AR(2) model. We can summarize that with small estimation samples, the density forecasts from the AR(1) model approximate the true density forecast more closely, on average, and this is rightfully detected by the log score and the associated test. Next, in addition to parameter estimation uncertainty in the density forecasts, we investigate the effect of using a weight function that is time-varying and depends on estimated parameters. In particular, we use a threshold weight function wt (y) = I(y ≤ rˆtα ), where the threshold rˆtα is given by the empirical α -quantile obtained from a finite window of past observations. As shown by Lemma 1, the cl and csl scoring rules in (11) and in (12) remain proper in this case and the properties of the associated tests of equal predictive accuracy should not be affected. We focus on a DGP which is more relevant for finance applications. The DGP is taken to be a GARCH(1, 1) process, specified
√
as yt = ht ηt , with ht = 0.01 + 0.1y2t −1 + 0.8ht −1 and {ηt } an i.i.d. standard normal sequence. We evaluate the performance of the available scoring rules in identifying the correctly specified GARCH density forecast when compared with an alternative density forecast, which differs only in the specification of the distribution of the standardized innovations ηt . Specifically, the alternative specification assumes a standardized Student-t (5) distribution for ηt . The model for the conditional volatility is correctly specified in both forecast methods, up to the unknown parameters. The GARCH parameters are estimated by MLE using a rolling window of m = 2000 observations and the threshold, rˆtα , is set equal to the empirical α -quantile of yt −m+1 , yt −m+2 , . . . , yt . Similarly to the previous experiments the number of observations for which density forecasts are constructed varies depending on the number of expected observations falling within the region of interest, i.e. n = c /α . We report results for c = 40. Given the same number of parameters in the model specifications underlying the competing forecasts and the relatively large rolling window size, m = 2000, we may expect that the density forecasts based on the correct specification with standard normal innovations are closer to the true conditional density than the forecasts using the standardized Student-t (5) innovations. Fig. 9 shows one-sided rejection rates of the null hypothesis of equal predictive abilities against better predictive ability of the forecast based on the standard normal innovations compared to the forecast based on standardized Student-t (5) innovations and vice versa for values of α ranging between 0.01 and 0.5. The right panel shows that the cl and csl scoring rules do not display spurious power, while the w l rule has rejection rates substantially above the nominal level of 5%, in particular for small threshold values. This confirms that the cl and csl scoring rules can be used in combination with timevarying weight functions without introducing spurious rejections
226
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
Fig. 9. One-sided rejection rates (at nominal significance level 5%) of the Diebold–Mariano type test statistic of equal predictive accuracy defined in (1) when using the weighted logarithmic (w l), the weighted probability wps, the conditional likelihood (cl), and the censored likelihood (csl) scoring rules for c = 40 expected observations in the region of interest, based on 1000 replications. The data follow an AR(5)-GARCH(1,1) process with the standard normal innovations. The competing model is based on the standardized Student-t (5) innovations. The graphs on the left and right show rejection rates against better predictive ability of the forecast based on the standard normal innovations compared to the forecast based on standardized Student-t (5) innovations and vice versa, respectively.
of the null hypothesis. The cl and csl scoring rules display comparable power for quantiles α > 0.10, approximately. For lower quantiles, thus focusing more on the left tail, the additional information used by the csl rule again leads to improved power compared to the cl rule. While the wps scoring rule has power comparable to the csl rule for the lowest quantiles, its (relative) performance rapidly deteriorates as α becomes larger. 5. Empirical illustration We examine the empirical relevance of the proposed scoring rules in the context of the evaluation of density forecasts for daily stock index returns. We consider S&P 500 log-returns yt = ln(Pt /Pt −1 ), where Pt is the closing price on day t, adjusted for dividends and stock splits. The sample period runs from January 1, 1980 until March 14, 2008, giving a total of 7115 observations (source: Datastream). For illustrative purposes we define two forecast methods based on GARCH models in such a way that a priori one of the methods is expected to be superior to the other. Examining a large variety of GARCH models for forecasting daily US stock index returns, Bao et al. (2007) conclude that the accuracy of density forecasts depends more on the choice of the distribution of the standardized innovations than on the volatility specification. Therefore, we differentiate our forecast methods in terms of the innovation distribution, while keeping identical specifications for the conditional mean and the conditional variance. We consider an AR(5) model for the conditional mean return together with a GARCH(1, 1) model for the conditional variance, that is yt = µt + εt = µt +
ht ηt ,
where the conditional mean µt and the conditional variance ht are given by
µt = ρ0 +
5 −
ρj yt −j ,
j =1
ht = ω + αεt2−1 + β ht −1 , and the standardized innovations ηt are i.i.d. with mean zero and variance one. Following Bollerslev (1987), a common finding in empirical applications of GARCH models has been that a normal distribution for ηt is not sufficient to fully account for the kurtosis observed in stock returns. We therefore concentrate on leptokurtic distributions for the standardized innovations. Specifically, for one forecast method the distribution of ηt is specified as a (standardized) Student-t distribution with ν degrees
of freedom, while for the other forecast method we use the (standardized) Laplace distribution. Note that for the Studentt distribution the number of degrees of freedom ν is a parameter that is to be estimated. The degrees of freedom directly determines the value of the excess kurtosis of the standardized innovations, which is equal to 6/(ν − 4) (assuming ν > 4). Due to its flexibility, the Student-t distribution has been widely used in GARCH modeling (see e.g. Bollerslev (1987), Baillie and Bollerslev (1989)). The standardized Laplace distribution provides a more parsimonious alternative with no additional parameters to be estimated and has been applied in the context of conditional volatility modeling by Granger and Ding (1995) and Mittnik et al. (1998). The Laplace distribution has excess kurtosis of 3, which exceeds the excess kurtosis of the Student-t (ν) distribution for ν > 6. Because of the greater flexibility in modeling kurtosis, we may expect that the forecast method with Student-t innovations gives superior density forecasts relative to the Laplace innovations. This is indeed indicated by results in Bao et al. (2007), who evaluate these density forecasts ‘unconditionally’, that is, not focusing on a particular region of the distribution. Our evaluation of the two forecast methods is based on their one-step-ahead density forecasts for daily returns, using a rolling window scheme for parameter estimation. The length of the estimation window is set to m = 2000 observations, so that the number of out-of-sample observations is equal to n = 5115. For comparing the density forecasts’ accuracy we use the Diebold–Mariano type test based on the weighted logarithmic scoring rule in (4), the weighted probability scores in (8), the conditional likelihood in (11), and the censored likelihood in (12). We concentrate on the left tail of the distribution by using the threshold weight function wt (y) = I(y ≤ rˆtα ) for the w l, wps, cl and csl scoring rules. The time-varying threshold rˆtα is set equal to the empirical α -quantile of the return observations in the relevant estimation window, where we consider α = 0.10, 0.05 and 0.01. The score difference d∗t +1 is computed by subtracting the score of the GARCH-Laplace density forecast from the score of the GARCH-t density forecast, such that positive values of d∗t +1 indicate better predictive ability of the forecast method based on Student-t innovations. ∗ Table 2 shows the average score differences dm,n with the accompanying tests of equal predictive accuracy as in (1), where we use a HAC estimator for the asymptotic variance σˆ m2 ,n to account for serial dependence in the d∗t +1 series. The results clearly demonstrate that different conclusions follow from the different scoring rules. For thresholds based on α = 0.05 and 0.01 the w l scoring rule suggests superior predictive ability of the forecast
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
227
Table 2 Average score differences and tests of equal predictive accuracy. Scoring rule
α = 0.10 d
Threshold weight function wl wps cl csl
α = 0.05
∗
−1.69 × 10−4 4.29 × 10−7 1.47 × 10−3 2.21 × 10−3
α = 0.01
∗
Test stat.
d
−0.14
−5.12 × 10−3 7.75 × 10−7 1.58 × 10−3 1.63 × 10−3
0.69 1.48 1.89
∗
Test stat.
d
Test stat.
−4.74
−3.21 × 10−3 8.68 × 10−7 7.78 × 10−4 1.16 × 10−3
−3.75
1.56 2.32 1.53
4.28 1.81 1.35
∗
Note: The table presents the average score difference d and the corresponding test statistics for the weighted logarithmic (w l) scoring rule in (4), the weighted probability score (wps) in (8), the conditional likelihood (cl) in (11), and the censored likelihood (csl) in (12). All scoring rules are based on the indicator weight function wt (y) = I(y ≤ rˆtα ), where rˆtα is the α th quantile of the empirical (in-sample) CDF, where α = 0.1, 0.05 or 0.01. The score difference dt +1 is computed for density forecasts obtained from an AR(5)-GARCH(1, 1) model with (standardized) Student-t (ν) innovations relative to the same model but with Laplace innovations, for daily S&P500 returns over the evaluation period December 2, 1987–March 14, 2008. Table 3 VaR and ES characteristics.
Average VaR Coverage (yt ≤ VaRt ) CUC (p-value) IND (p-value) CCC (p-value) Average ES McNeil–Frey (test stat.) McNeil–Frey (p-value)
α = 0.10 t (ν)
Laplace
t (ν)
α = 0.05 Laplace
t (ν)
α = 0.01 Laplace
−0.0110
−0.0112
−0.0149
−0.0162
−0.0243
−0.0279
0.1056 0.1876 0.1082 0.1156 −0.0168 −0.7538 0.4510
0.1001 0.9814 0.2315 0.4887 −0.0185 3.1164 0.0018
0.0530 0.3324 0.0465 0.0861 −0.0209 −0.8504 0.3951
0.0405 0.0012 0.3658 0.0036 −0.0235 0.3639 0.7159
0.0104 0.7961 0.5809 0.8304 −0.0312 −1.1899 0.2341
0.0055 0.0004 0.5788 0.0015 −0.0351 −2.3174 0.0205
Note: The average VaRs reported are the observed average 1%, 5% and 10% quantiles of the density forecasts based on the GARCH model with t (ν) and Laplace innovations, respectively. The coverages correspond with the observed fraction of returns below the respective VaRs, which ideally would coincide with the nominal rate α . The rows labeled CUC, IND and CCC provide p-values for Christoffersen’s (1998) tests for correct unconditional coverage, independence of VaR violations, and correct conditional coverage, respectively. The average ES values are the ESs (equal to the conditional mean return, given a realization below the predicted VaR) based on the different density forecasts. The bottom two rows report McNeil–Frey test statistics and corresponding p-values for evaluating the expected shortfall estimates ESfˆ ,t (α).
the density forecast fˆt , that is, through Pfˆ ,t Yt +1 ≤ VaRfˆ ,t (α) = α . The ES is defined as the conditionalmean return given that
Yt +1 ≤ VaRfˆ ,t (α), that is ESfˆ ,t (α) = Efˆ ,t Yt +1 |Yt +1 ≤ VaRfˆ ,t (α) . Fig. 10 shows the VaR estimates against the realized returns. We observe that typically the VaR estimates based on the Laplace innovations are more extreme, confirming that it has fatter tails than the Student-t innovations. The same conclusion follows from the sample averages of the VaR and ES estimates, as shown in Table 3.
0.06 0.04 0.02 0 returns
method based on Laplace innovations, while for α = 0.1, it fails to reject the null of equal predictive ability. By contrast, the cl scoring rule suggests that the performance of the GARCH-t density forecasts is superior for all three values of α . The csl scoring rule points toward the same conclusion as the cl rule, although the evidence for better predictive ability of the forecast based on the GARCH-t specification is somewhat weaker. The wps rule also indicates the superior performance of the GARCH-t especially for α = 0.01, but evidence is weak when we consider less extreme quantiles α = 0.05 and 0.1. In the remainder of this section we seek to understand the reasons for these conflicting results, and explore the consequences of selecting either forecast method for risk management purposes. In addition, this allows us to obtain circumstantial evidence that shows which of the two competing forecast methods is most appropriate. For most estimation windows, the degrees of freedom parameter in the Student-t distribution is estimated to be (slightly) larger than 6, such that the Laplace distribution implies fatter tails than the Student-t distribution. Hence, it may very well be that the w l scoring rule indicates superior predictive ability of the Laplace distribution simply because this density has more probability mass in the region of interest, that is, the problem that motivated our analysis in the first place may be relevant here. To see this from a slightly different perspective, we compute one-day 90%, 95% and 99% VaR and ES estimates as implied by the two forecast methods. The 100 × (1 − α)% VaR is determinedas the α th quantile of
-0.02 -0.04 -0.06 -0.08 -0.1 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 time
Fig. 10. Daily S&P 500 log-returns (black) for the period December 2, 1987–March 14, 2008 and out-of-sample 95% and 99% VaR forecasts derived from the AR(5)GARCH(1, 1) specification using Student-t innovations (light gray) and Laplace innovations (dark gray).
The VaR and ES estimates also enable us to assess which of the two innovation distributions is the most appropriate in a different way. For that purpose, we first of all compute the frequency of 90%, 95% and 99% VaR violations, which should be close to 0.1, 0.05 and 0.01, respectively, if the innovation distribution is correctly specified. We compute the likelihood ratio (LR) test of correct unconditional coverage (CUC) suggested by Christoffersen (1998) to determine whether the empirical violation frequencies differ significantly from these nominal levels. Additionally, we use Christoffersen’s (1998) LR tests of independence of VaR violations (IND) and for correct conditional coverage (CCC). Define the indicator variables Ifˆ ,t +1 (yt +1 ≤ VaRfˆ ,t (α)) for α = 0.1, 0.05 and 0.01, which take the value 1 if the condition in brackets is
228
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
satisfied and 0 otherwise. Independence of the VaR exceedances is tested against a first-order Markov alternative, that is, the null hypothesis is given by H0 : E(Ifˆ ,t +1 |Ifˆ ,t ) = E(Ifˆ ,t +1 ). In words, we test whether the probability of observing a VaR violation on day t + 1 is affected by observing a VaR violation on day t or not. The CCC test simultaneously examines the null hypotheses of correct unconditional coverage and of independence, with the CCC test statistic simply being the sum of the CUC and IND LR statistics. For evaluating the adequacy of the ES estimates we employ the test suggested by McNeil and Frey (2000). For every return yt +1 that falls below the VaRfˆ ,t (α) estimate, define the standardized ‘residual’ et +1 = (yt +1 − ESfˆ ,t (α))/ht +1 , where ht +1 is the conditional volatility forecast obtained from the corresponding GARCH model. If the ES predictions are correct, the expected value of et +1 is equal to zero, which can be assessed by means of a twosided t-test with a HAC variance estimator. The results reported in Table 3 show that the empirical VaR exceedance probabilities are very close to the nominal levels for the Student-t innovation distribution. For the Laplace distribution, they are considerably lower for α = 0.05 and α = 0.01. This is confirmed by the CUC test, which for these quantiles convincingly rejects the null of correct unconditional coverage for the Laplace distribution but not for the Student-t distribution. The null hypothesis of independence is not rejected in any of the cases at the 5% significance level. Finally, the McNeil and Frey (2000) test does not reject the adequacy of the 95% ES estimates for either of the two distributions, but it does for the 90% and 99% ES estimates based on the Laplace innovation distribution. In sum, the VaR and ES estimates suggest that the Student-t distribution is more appropriate than the Laplace distribution, confirming the density forecast evaluation results obtained with the conditional and censored likelihood scoring rules. In terms of risk management, using the GARCH-Laplace forecast method would lead to larger estimates of risk than the GARCH-t forecast method. This, in turn, could result in suboptimal asset allocation and ‘over-hedging’.
size. When comparing the scoring rules based on conditional likelihood and censored likelihood it was found that the latter often leads to more powerful tests. This is due to the fact that more information is used by the censored likelihood scores. Additionally, the censored likelihood scoring rule outperforms the weighted probability score function of Gneiting and Ranjan (2008). In an empirical application to S&P 500 daily returns we investigated the use of the various scoring rules for density forecast comparison in the context of financial risk management. It was shown that the weighted logarithmic scoring rule and the newly proposed scoring rules can lead to the selection of different density forecasts. The density forecasts preferred by the conditional and censored likelihood scoring rules appear to be more appropriate as they result in more accurate estimates of VaR and ES.
6. Conclusions
Generalized conditional likelihood score. cl ˆ ˆ It is to be shown that Et (dcl t +1 (pt , ft )) ≥ 0, where dt +1 (pt , ft ) =
In this paper we have developed new scoring rules based on conditional and censored likelihood for evaluating the predictive ability of competing density forecasts. It was shown that these scoring rules are useful when the main interest lies in comparing the density forecasts’ accuracy for a specific region, such as the left tail in financial risk management applications. Directly weighting the (KLIC-based) logarithmic scoring rule is not suitable for this purpose. By construction this tends to favor density forecasts with more probability mass in the region of interest, rendering the tests of equal predictive accuracy biased toward such densities. Our novel scoring rules do not suffer from this problem. We argued that likelihood-based scoring rules can be extended for comparing density forecasts on a specific region of interest by using the conditional likelihood, given that the actual observation lies in the region of interest, or the censored likelihood, with censoring of the observations outside the region of interest. Furthermore, we showed that the conditional and censored likelihood scoring rules can be extended in order to emphasize certain parts of the outcome space more generally by using smooth weight functions. Both scoring rules can be interpreted in terms of Kullback–Leibler divergences between weighted versions of the density forecast and the true conditional density. Monte Carlo simulations demonstrated that the conventional scoring rules may indeed give rise to spurious rejections due to the possible bias in favor of an incorrect density forecast. This phenomenon is virtually non-existent for the new scoring rules, and where present, diminishes quickly upon increasing the sample
S cl (pt ; Yt +1 ) − S cl (fˆt , Yt +1 ). Define Pt ≡
Acknowledgements We would like to thank the associate editor and two anonymous referees, Michael Clements, Frank Diebold, and Wolfgang Härdle, seminar participants at Humboldt University Berlin, Monash University, Queensland University of Technology, University of Amsterdam, University of New South Wales, University of Pennsylvania, and the Reserve Bank of Australia, as well as participants at the 16th Society for Nonlinear Dynamics and Econometrics Conference (San Francisco, April 3–4, 2008) and the New Zealand Econometric Study Group Meeting in honor of Peter C.B. Phillips (Auckland, March 7–9, 2008) for providing useful comments and suggestions. Valentyn Panchenko acknowledges the support under Australian Research Council’s Discovery Projects funding scheme (project number DP0986718). Usual caveats apply. Appendix This Appendix provides a proof of Lemma 1.
wt (s)pt (s) ds and Fˆt ≡
wt (s)fˆt (s) ds.
The time-t conditional expected score difference for the density forecasts pt and fˆt is Et
dcl t +1
∫ pt (y) ˆ (pt , ft ) = pt (y) wt (y) log dy Pt ∫ fˆt (y) − pt (y) wt (y) log dy Fˆt ∫ wt (y)pt (y) wt (y)pt (y)/Pt = Pt log dy Pt wt (y)fˆt (y)/Fˆt wt (y)pt (y) wt (y)fˆt (y) = Pt · K , ≥ 0, Pt Fˆt
where K (·, ·) represents the Kullback–Leibler divergence between the pdfs in its arguments, which is finite as a consequence of Assumption 1. Assumption 1 implies support(fˆt ) = support(ˆgt ) = support(pt ). This, together with Assumption 2(c) guarantees that wt (y)pt (y)/Pt and wt (y)fˆt (y)/Fˆt can be interpreted as pdfs, while Assumption 2(a) ensures that wt (y) can be treated as a given function of y in the calculation of the expectation, which is conditional on Ft .
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
Generalized censored likelihood score csl csl ˆ ˆ If dcsl t +1 (pt , ft ) = S (pt ; Yt +1 ) − S (ft , Yt +1 ), then
ˆ Et dcsl t +1 (pt , ft ) ∫ =
pt (y) log (pt (y))wt (y) (1 − Pt )1−wt (y) dy
∫ − ∫ = ∫ =
fˆt (y)
pt (y)
(1 − Fˆt )1−wt (y)
wt (y)
fˆt (y)
pt (y) log 1 − Pt
wt (y)
=
pt (y) log
pt (y) log
× ∫
1 − Pt
wt (y)
Pt fˆt (y)
pt (y) wt (y) log
Pt
wt (y)
Fˆt
pt (y)Fˆt
+ wt (y) log
Pt fˆt (y) 1 − Pt
Pt Fˆt
dy
1 − Fˆt
wt (y)pt (y)/Pt = Pt log Pt wt (y)fˆt (y)/Fˆt Pt 1 − Pt + Pt log + (1 − Pt ) log Fˆ 1 − Fˆt t wt (y)pt (y) wt (y)fˆt (y) = Pt · K , Pt Fˆt + K Bin(1, Pt ), Bin(1, Fˆt ) ∫
dy
dy
pt (y)wt (y)
1−wt (y)
1 − Fˆt
+ (1 − wt (y)) log
dy
1 − Fˆt
pt (y)Fˆt
1−wt (y)
dy
≥ 0,
where K Bin(1, Pt ), Bin(1, Fˆt )
is the Kullback–Leibler diver-
gence between two Bernoulli distributions with success probabilities Pt and Fˆt , respectively. Assumption 2(b), which requires wt (y) to be scaled between 0 and 1 for the csl rule, is essential for this interpretation because it implies that Pt and Fˆt can be interpreted as probabilities. Again, Assumptions 1 and 2(c) guarantee that wt (y)pt (y)/Pt and wt (y)fˆt (y)/Fˆt can be interpreted as pdfs, while Assumption 2(a) ensures that wt (y) can be treated as a given function of y in the calculation of the expectation, which is conditional on Ft . References Amisano, G., Giacomini, R., 2007. Comparing density forecasts via weighted likelihood ratio tests. Journal of Business and Economic Statistics 25, 177–190. Bai, J., 2003. Testing parametric conditional distributions of dynamic models. Review of Economics and Statistics 85, 531–549. Bai, J., Ng, S., 2005. Tests for skewness, kurtosis, and normality of time series data. Journal of Business and Economic Statistics 23, 49–60. Baillie, R.T., Bollerslev, T., 1989. The message in daily exchange rates: a conditionalvariance tale. Journal of Business and Economic Statistics 7, 297–305. Bao, Y., Lee, T.-H., Saltoğlu, B., 2004. A test for density forecast comparison with applications to risk management. Working Paper 04-08, UC Riverside. Bao, Y., Lee, T.-H., Saltoğlu, B., 2007. Comparing density forecast models. Journal of Forecasting 26, 203–225. Berkowitz, J., 2001. Testing density forecasts, with applications to risk management. Journal of Business and Economic Statistics 19, 465–474. Bollerslev, T., 1987. A conditionally heteroskedastic time series model for speculative prices and rates of return. Review of Economics and Statistics 69, 542–547.
229
Bollerslev, T., Kretschmer, U., Pigorsch, C., Tauchen, G., 2009. A discrete-time model for daily S&P500 returns and realized variations: jumps and leverage effects. Journal of Econometrics 150, 151–166. Campbell, S.D., Diebold, F.X., 2005. Weather forecasting for weather derivatives. Journal of the American Statistical Association 100, 6–16. Christoffersen, P.F., 1998. Evaluating interval forecasts. International Economic Review 39, 841–862. Clements, M.P., 2004. Evaluating the Bank of England density forecasts of inflation. Economic Journal 114, 844–866. Clements, M.P., 2005. Evaluating Econometric Forecasts of Economic and Financial Variables. Palgrave-Macmillan, New York. Clements, M.P., Smith, J., 2000. Evaluating the forecast densities of linear and nonlinear models: applications to output growth and inflation. Journal of Forecasting 19, 255–276. Corradi, V., Distaso, W., Swanson, N.R., 2009. Predictive density estimators for daily volatility based on the use of realized measures. Journal of Econometrics 150, 119–138. Corradi, V., Swanson, N.R., 2006a. Bootstrap conditional distribution tests in the presence of dynamic misspecifation. Journal of Econometrics 133, 779–806. Corradi, V., Swanson, N.R., 2006c. Predictive density evaluation. In: Elliott, G., Granger, C.W.J., Timmermann, A. (Eds.), Handbook of Economic Forecasting, vol. 1. Elsevier, Amsterdam, pp. 197–284. Corradi, V., Swanson, N.R., 2005. A test for comparing multiple misspecified conditional interval models. Econometric Theory 21, 991–1016. Corradi, V., Swanson, N.R., 2006b. Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics 135, 187–228. De Jong, R.M., 1997. Central limit theorems for dependent heterogeneous random variables. Econometric Theory 13, 353–367. Diebold, F.X., Gunther, T.A., Tay, A.S., 1998. Evaluating density forecasts with applications to financial risk management. International Economic Review 39, 863–883. Diebold, F.X., Lopez, J.A., 1996. Forecast evaluation and combination. In: Maddala, G.S., Rao, C.R. (Eds.), Handbook of Statistics, vol. 14. North-Holland, Amsterdam, pp. 241–268. Diebold, F.X., Mariano, R.S., 1995. Comparing predictive accuracy. Journal of Business and Economic Statistics 13, 253–263. Diebold, F.X., Tay, A.S., Wallis, K.F., 1999. Evaluating density forecasts of inflation: the survey of professional forecasters. In: Engle, R.F., White, H. (Eds.), Cointegration, Causality, and Forecasting: A Festschrift in Honor of C.W.J. Granger. Oxford University Press, Oxford, pp. 76–90. Dowd, K., 2005. Measuring Market Risk, second ed. John Wiley & Sons, Chichester. Egorov, A.V., Hong, Y., Li, H., 2006. Validating forecasts of the joint probability density of bond yields: can affine models beat random walk? Journal of Econometrics 135, 255–284. Franses, P.H., van Dijk, D., 2003. Selecting a nonlinear time series model using weighted tests of equal forecast accuracy. Oxford Bulletin of Economics and Statistics 65, 727–744. Garratt, A., Lee, K., Pesaran, M.H., Shin, Y., 2003. Forecast uncertainties in macroeconomic modelling: an application to the UK economy. Journal of the American Statistical Association 98, 829–838. Giacomini, R., Komunjer, I., 2005. Evaluation and combination of conditional quantile forecasts. Journal of Business and Economic Statistics 23, 416–431. Giacomini, R., White, H., 2006. Tests of conditional predictive ability. Econometrica 74, 1545–1578. Gneiting, T., Raftery, A.E., 2007. Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association 102, 359–378. Gneiting, T., Ranjan, R., 2008. Comparing density forecasts using threshold and quantile weighted scoring rules. Technical Report 533. University of Washington. Granger, C.W.J., Ding, Z., 1995. Some properties of absolute return, an alternative measure of risk. Annales d’Economie et de Statistique 40, 67–91. Granger, C.W.J., Pesaran, M.H., 2000. Economic and statistical measures of forecast accuracy. Journal of Forecasting 19, 537–560. Guidolin, M., Timmermann, A., 2006. Term structure of risk under alternative econometric specifications. Journal of Econometrics 131, 285–308. Guidolin, M., Timmermann, A., 2007. Asset allocation under multivariate regime switching. Journal of Economic Dynamics and Control 31, 3503–3544. Hall, S.G., Mitchell, J., 2007. Combining density forecasts. International Journal of Forecasting 23, 1–13. Härdle, W., Hlávka, Z., 2009. Dynamics of state price densities. Journal of Econometrics 150, 1–15. Hong, Y., Li, H., 2005. Nonparametric specification testing for continuous-time models with applications to term structure of interest rates. Review of Financial Studies 18, 37–84. Hong, Y., Li, H., Zhao, F., 2004. Out-of-sample performance of discrete-time spot interest rate models. Journal of Business and Economic Statistics 22, 457–473. Li, F., Tkacz, G., 2006. A consistent bootstrap test for conditional density functions with time-series data. Journal of Econometrics 133, 863–886. McNeil, A.J., Frey, R., 2000. Estimation of tail-related risk measures for heteroscedastic financial time series: an extreme value approach. Journal of Empirical Finance 7, 271–300. McNeil, A.J., Frey, R., Embrechts, P., 2005. Quantitative Risk Management: Concepts, Techniques, and Tools. Princeton University Press, Princeton.
230
C. Diks et al. / Journal of Econometrics 163 (2011) 215–230
Merlevède, F., Peligrad, M., 2000. The functional central limit theorem under the strong mixing condition. Annals of Probability 28, 1336–1352. Mitchell, J., Hall, S.G., 2005. Evaluating, comparing and combining density forecasts using the KLIC with an application to the Bank of England and NIESR ‘fan’ charts of inflation. Oxford Bulletin of Economics and Statistics 67, 995–1033. Mittnik, S., Paolella, M.S., Rachev, S.T., 1998. Unconditional and conditional distributional models for the Nikkei index. Asia-Pacific Financial Markets 5, 99–128. Perez-Quiros, G., Timmermann, A., 2001. Business cycle asymmetries in stock returns: Evidence from higher order moments and conditional densities. Journal of Econometrics 103, 259–306. Rapach, D.E., Wohar, M.E., 2006. The out-of-sample forecasting performance of nonlinear models of real exchange rate behavior. International Journal of Forecasting 22, 341–361. Rosenblatt, M., 1952. Remarks on a multivariate transformation. Annals of Mathematical Statistics 23, 470–472.
Sarno, L., Valente, G., 2005. Empirical exchange rate models and currency risk: some evidence from density forecasts. Journal of International Money and Finance 24, 363–385. Sarno, L., Valente, G., 2004. Comparing the accuracy of density forecasts from competing models. Journal of Forecasting 23, 541–557. Taylor, J.W., Buizza, R., 2006. Density forecasting for weather derivative pricing. International Journal of Forecasting 22, 29–42. Tobin, J., 1958. Estimation of relationships for limited dependent variables. Econometrica 26, 24–36. West, K.D., 1996. Asymptotic inference about predictive ability. Econometrica 64, 1067–1084. White, H., 2000. A reality check for data snooping. Econometrica 68, 1097–1126. Winkler, R.L., Murphy, A.H., 1968. Good probability assessors. Journal of Applied Meteorology 7, 751–758. Wooldridge, J.M., White, H., 1988. Some invariance principles and central limit theorems for dependent heterogeneous processes. Econometric Theory 4, 210–230.