Editorial policy: The Journal of Econometrics is designed to serve as an outlet for important new research in both theoretical and applied econometrics. Papers dealing with estimation and other methodological aspects of the application of statistical inference to economic data as well as papers dealing with the application of econometric techniques to substantive areas of economics fall within the scope of the Journal. Econometric research in the traditional divisions of the discipline or in the newly developing areas of social experimentation are decidedly within the range of the Journal’s interests. The Annals of Econometrics form an integral part of the Journal of Econometrics. Each issue of the Annals includes a collection of refereed papers on an important topic in econometrics. Editors: T. AMEMIYA, Department of Economics, Encina Hall, Stanford University, Stanford, CA 94035-6072, USA. A.R. GALLANT, Duke University, Fuqua School of Business, Durham, NC 27708-0120, USA. J.F. GEWEKE, Department of Economics, University of Iowa, Iowa City, IA 52240-1000, USA. C. HSIAO, Department of Economics, University of Southern California, Los Angeles, CA 90089, USA. P. ROBINSON, Department of Economics, London School of Economics, London WC2 2AE, UK. A. ZELLNER, Graduate School of Business, University of Chicago, Chicago, IL 60637, USA. Executive Council: D.J. AIGNER, Paul Merage School of Business, University of California, Irvine CA 92697; T. AMEMIYA, Stanford University; R. BLUNDELL, University College, London; P. DHRYMES, Columbia University; D. JORGENSON, Harvard University; A. ZELLNER, University of Chicago. Associate Editors: Y. AÏT-SAHALIA, Princeton University, Princeton, USA; B.H. BALTAGI, Syracuse University, Syracuse, USA; R. BANSAL, Duke University, Durham, NC, USA; M.J. CHAMBERS, University of Essex, Colchester, UK; SONGNIAN CHEN, Hong Kong University of Science and Technology, Kowloon, Hong Kong; XIAOHONG CHEN, Department of Economics, Yale University, 30 Hillhouse Avenue, P.O. Box 208281, New Haven, CT 06520-8281, USA; MIKHAIL CHERNOV (LSE), London Business School, Sussex Place, Regents Park, London, NW1 4SA, UK; V. CHERNOZHUKOV, MIT, Massachusetts, USA; M. DEISTLER, Technical University of Vienna, Vienna, Austria; M.A. DELGADO, Universidad Carlos III de Madrid, Madrid, Spain; YANQIN FAN, Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA; S. FRUHWIRTH-SCHNATTER, Johannes Kepler University, Liuz, Austria; E. GHYSELS, University of North Carolina at Chapel Hill, NC, USA; J.C. HAM, University of Southern California, Los Angeles, CA, USA; J. HIDALGO, London School of Economics, London, UK; H. HONG, Stanford University, Stanford, USA; MICHAEL KEANE, University of Technology Sydney, P.O. Box 123 Broadway, NSW 2007, Australia; Y. KITAMURA, Yale Univeristy, New Haven, USA; G.M. KOOP, University of Strathclyde, Glasgow, UK; N. KUNITOMO, University of Tokyo, Tokyo, Japan; K. LAHIRI, State University of New York, Albany, NY, USA; Q. LI, Texas A&M University, College Station, USA; T. LI, Vanderbilt University, Nashville, TN, USA; R.L. MATZKIN, Northwestern University, Evanston, IL, USA; FRANCESCA MOLINARI (CORNELL), Department of Economics, 492 Uris Hall, Ithaca, New York 14853-7601, USA; F.C. PALM, Rijksuniversiteit Limburg, Maastricht, The Netherlands; D.J. POIRIER, University of California, Irvine, USA; B.M. PÖTSCHER, University of Vienna, Vienna, Austria; I. PRUCHA, University of Maryland, College Park, USA; P.C. REISS, Stanford Business School, Stanford, USA; E. RENAULT, University of North Carolina, Chapel Hill, NC; F. SCHORFHEIDE, University of Pennsylvania, USA; R. SICKLES, Rice University, Houston, USA; F. SOWELL, Carnegie Mellon University, Pittsburgh, PA, USA; MARK STEEL (WARWICK), Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; DAG BJARNE TJOESTHEIM, Department of Mathematics, University of Bergen, Bergen, Norway; HERMAN VAN DIJK, Erasmus University, Rotterdam, The Netherlands; Q.H. VUONG, Pennsylvania State University, University Park, PA, USA; E. VYTLACIL, Columbia University, New York, USA; T. WANSBEEK, Rijksuniversiteit Groningen, Groningen, Netherlands; T. ZHA, Federal Reserve Bank of Atlanta, Atlanta, USA and Emory University, Atlanta, USA. Submission fee: Unsolicited manuscripts must be accompanied by a submission fee of US$50 for authors who currently do not subscribe to the Journal of Econometrics; subscribers are exempt. Personal cheques or money orders accompanying the manuscripts should be made payable to the Journal of Econometrics. Publication information: Journal of Econometrics (ISSN 0304-4076). For 2011, Volumes 160–165 (12 issues) are scheduled for publication. Subscription prices are available upon request from the Publisher, from the Elsevier Customer Service Department nearest you, or from this journal’s website (http://www.elsevier.com/locate/jeconom). Further information is available on this journal and other Elsevier products through Elsevier’s website (http://www.elsevier.com). Subscriptions are accepted on a prepaid basis only and are entered on a calendar year basis. Issues are sent by standard mail (surface within Europe, air delivery outside Europe). Priority rates are available upon request. Claims for missing issues should be made within six months of the date of dispatch. USA mailing notice: Journal of Econometrics (ISSN 0304-4076) is published monthly by Elsevier B.V. (Radarweg 29, 1043 NX Amsterdam, The Netherlands). Periodicals postage paid at Rahway, NJ 07065-9998, USA, and at additional mailing offices. USA POSTMASTER: Send change of address to Journal of Econometrics, Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA. AIRFREIGHT AND MAILING in the USA by Mercury International Limited, 365 Blair Road, Avenel, NJ 07001-2231, USA. Orders, claims, and journal inquiries: Please contact the Elsevier Customer Service Department nearest you. St. Louis: Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA; phone: (877) 8397126 [toll free within the USA]; (+1) (314) 4478878 [outside the USA]; fax: (+1) (314) 4478077; e-mail:
[email protected]. Oxford: Elsevier Customer Service Department, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK; phone: (+44) (1865) 843434; fax: (+44) (1865) 843970; e-mail:
[email protected]. Tokyo: Elsevier Customer Service Department, 4F Higashi-Azabu, 1-Chome Bldg., 1-9-15 Higashi-Azabu, Minato-ku, Tokyo 106-0044, Japan; phone: (+81) (3) 5561 5037; fax: (+81) (3) 5561 5047; e-mail:
[email protected]. Singapore: Elsevier Customer Service Department, 3 Killiney Road, #08-01 Winsland House I, Singapore 239519; phone: (+65) 63490222; fax: (+65) 67331510; e-mail:
[email protected]. Printed by Henry Ling Ltd., Dorchester, United Kingdom The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper)
Journal of Econometrics 161 (2011) 101–109
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Modeling data revisions: Measurement error and dynamics of ‘‘true’’ values Jan P.A.M. Jacobs a,b , Simon van Norden c,∗ a
University of Groningen, The Netherlands
b
CAMA and CIRANO, Canada HEC Montréal, CIRANO and CIREQ, Canada
c
article
info
Article history: Received 8 September 2007 Received in revised form 1 July 2009 Accepted 22 April 2010 Available online 28 December 2010 JEL classification: C22 C53 C82 Keywords: Real-time analysis Data revisions
abstract Policy makers must base their decisions on preliminary and partially revised data of varying reliability. Realistic modeling of data revisions is required to guide decision makers in their assessment of current and future conditions. This paper provides a new framework with which to model data revisions. Recent empirical work suggests that measurement errors typically have much more complex dynamics than existing models of data revisions allow. This paper describes a state-space model that allows for richer dynamics in these measurement errors, including the noise, news and spillover effects documented in this literature. We also show how to relax the common assumption that ‘‘true’’ values are observed after a few revisions. The result is a unified and flexible framework that allows for more realistic data revision properties, and allows the use of standard methods for optimal real-time estimation of trends and cycles. We illustrate the application of this framework with real-time data on US real output growth. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Data revisions have haunted economists for decades.1 Such revisions complicate forecasts and estimates of current economic conditions, since the most recent data are usually the least reliable (e.g. Koenig et al., 2003). Optimal forecasts and indicators require a model of the data revision process (see Croushore (2006) for a survey). Some authors cast the data revision process in state-space
∗ Corresponding address: HEC Montréal, 3000 Chemin de la Côte Sainte Catherine, Montreal QC, H3T 2A7, Canada. Tel.: +1 514 340 6781. E-mail address:
[email protected] (S. van Norden). 1 McKenzie (2006) notes eight reasons for revisions of official statistics. 1. Incorporation of source data with more complete or otherwise better reporting (e.g. including late respondents) in subsequent estimates. 2. Correction of errors in source data (e.g. from editing) and computations (e.g. revised imputation). 3. Replacement of first estimates derived from incomplete samples (e.g. subsamples) judgmental or statistical techniques when firmer data become available. 4. Incorporation of source data that more closely match the concepts and/or benchmarking to conceptually more accurate but less frequent statistics. 5. Incorporation of updated seasonal factors. 6. Updating of the base period of constant price estimates. 7. Changes in statistical methodology (such as the introduction of chain-linked volume estimates), concepts, definitions, and classifications. 8. Revisions to national accounts statistics arising from the confrontation of data in supply and use tables. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.04.010
form, which then allows the use of standard filtering techniques for forecasting, estimation, inference, smoothing, estimation of missing data, etc. Part of our motivation for this paper is the work of Aruoba (2008) and Siklos (2008), who note that data revisions typically show more complex behavior than standard models allow. This paper describes a more general state-space model that allows for more flexibility in the dynamics of revisions. Measurement errors in our model may have any combination of several key characteristics:
• they may behave as ‘‘noise’’, so that the measurement errors of consecutive vintages are mutually uncorrelated.
• they may have a ‘‘news’’ component, in which the measurement errors of consecutive vintages behave like a set of rational forecast errors. • they may have a ‘‘spillover’’ component, in which measurement errors within a given data vintage are serially correlated. • they may have important periodic components. For example, seasonal adjustment factors are often revised once a year. This may be four quarters after the initial release for some observations, but five or six quarters for others. • measurement errors may be important even years after initial estimates are published. In addition, our formulation of the state-space model is novel in that it defines the measured vector as a set of estimates for a given point in time rather than a set of estimates from the same
102
J.P.A.M. Jacobs, S. van Norden / Journal of Econometrics 161 (2011) 101–109
Fig. 1. The revision triangle.
vintage. We find this leads to a more parsimonious state-space representation and a cleaner distinction between various aspects of measurement error. It also allows us to augment the model of published data with forecasts in a straightforward way. The resulting model provides us with a unified framework to estimate trends or cycles with data subject to revision. We think such a framework is indispensable for the proper formulation and conduct of monetary and fiscal policy. The paper begins with an extensive survey of the literature on state-space models of data revision in Section 2. We relate this literature to a more empirical literature which typically characterizes measurement errors as ‘‘news’’ or ‘‘noise’’. We then describe our state-space model in Section 3, detailing its ability to capture different important types of measurement error dynamics as well as different assumed dynamics for the ‘‘true’’ underlying series. In Section 4 we illustrate our methodology with the US real GNP/GDP series made available by the Federal Reserve Bank of Philadelphia. This paper concludes in Section 5 with an evaluation of the model’s ability to improve the accuracy with which we can model measurement errors. 1.1. Notation Throughout the paper we use the notation that is standard in this literature: superscripts refer to vintages while subscripts to time periods. For example, yt1 is the estimate available at time t of the value of variable y at time 1. Real-time data are typically displayed in the form of the Revision Triangle (see Fig. 1). We move to later vintages as we move across columns from left to right and we move to later points in time when we move down the rows. Note that the frequency of vintages need not necessarily correspond to the unit of observation; for example, statistical agencies often publish monthly vintages of quarterly observations. Although the Figure shows that first available estimates are published without a lag, this assumption is typically relaxed so that the typical entry in the j +1 j +l j last diagonal may be yj or yj rather than yj . We also note there is no important distinction in this framework between ‘‘published data’’ and ‘‘forecasts’’. For example, if we choose to incorporate forecasts up to an h period horizon, our lower diagonal will have typical element ytt +h . 2. Literature review Current interest in the analysis of real-time data dates from the publication of Croushore and Stark’s ‘‘A Real-Time Data Set for Macroeconomists’’ (Croushore and Stark, 1999, 2001). Real-time data sets are increasingly available for the United States (from the Federal Reserve Banks of St. Louis and Philadelphia), the euro zone (EABCN Real Time Database, RTDB), and several other countries (the OECD real-time database). The literature on real-time data analysis is also expanding rapidly.2
2 Dean Croushore’s real-time data bibliography see (http://facultystaff.richmond. edu/∼dcrousho/docs/realtime_lit.pdf), has been very helpful in preparing this review.
However, acknowledgement of data revisions dates at least back to the first issue of the first volume of the Review of Economic Statistics (Persons, 1919) while Kuznets (1948) also explicitly mentioned data revisions in his seminal paper on national income statistics. Zellner (1958, page 58) begins by noting that ‘Many . . . economic policy decisions are formulated or constructed on the basis of preliminary or provisional estimates of . . . our national accounts.’ Zellner (1958) and Morgenstern (1963) analyzed the relationship between preliminary and revised data. Cole (1969) and Denton and Kuiper (1965) investigated the sensitivity of parameter estimates to data revisions. Stekler (1967), Denton and Kuiper (1965), and Cole (1969) examined the loss in forecast accuracy when using preliminary data instead of ‘‘true’’ data. The intervening decades have seen continuing innovations in the study of data revisions. This review focuses on the modeling of measurement error, in particular on state-space models. We also review the empirical literature on the properties of measurement errors, and briefly document recent research on modeling data revisions. 2.1. State-space models State-space models provide a convenient framework for modeling measurement errors in economic data.3 This was recognized early by Howrey (1978, 1984), although he did not mention the method by name. Conrad and Corrado (1979) were the first to apply the Kalman filter to data revisions, soon followed by Harvey et al. (1983). For a typical state-space model, we can gather the l most recent estimates of vintage t for a variable y as y t = [ytt −1 , . . . , ytt −l+1 , ytt −l ]′ , where we have assumed a one-period publication lag. These estimates may be related to the corresponding ‘‘true’’ values y˜ t = [˜yt −1 , . . . , y˜ t −l+1 , y˜ t −l ]′ by defining measurement errors ut ≡ y t − y˜ t = [utt −1 , . . . , utt −l+1 , 0]′ , where we assume that the ‘‘true’’ values are obtained after l releases. For white noise measurement errors, the state vector α is defined as α = y˜ t and the linear statespace model may be written as y t = Z y˜ t + εt y˜ t +1 = T y˜ t + ηt
measurement equation
(1)
transition equation,
(2)
with the usual assumptions on the error processes: εt ∼ i.i.d N (0, H ); ηt ∼ i.i.d N (0, Q ); and E(εt , ητ ) = 0 for all t and τ . The measurement equation models the relationship between the vector of observed variables and the state vector. The transition equation captures the dynamics of the state vector, usually a simple autoregressive process. At a given time τ , y τ contains yτt = y˜ t for t ≤ τ − l and optimal estimates of true values for τ − l + 1 to τ can be obtained with the Kalman filter. This framework also allows us to form optimal forecasts of future true values given the available data, to estimate the standard errors associated with recent observations, and to calculate the relative weights that should be attached to recent data when forecasting or estimating true values. All of these activities have important practical applications. Variants of this state-space framework have been used to deal with special cases. For example, serial correlation in measurement errors (cf. Howrey, 1978) can easily be captured by including the measurement error ut in the state vector and adapting the transition equation, as shown in Harvey (1989, Section 6.4.4). Similar models are applied by Trivellato and Rettore (1986),
3 For general introductions to state-space modeling see the textbooks of Harvey (1989) and Hamilton (1994, Chapter 13).
J.P.A.M. Jacobs, S. van Norden / Journal of Econometrics 161 (2011) 101–109
Bordignon and Trivellato (1989), Patterson (1994, 1995a,b,c), Mariano and Tanizaki (1995) and Busetti (2006). An alternative approach postulates that vintages are cointegrated and driven by a single common stochastic trend or ‘‘factor’’, while measurement errors and revisions are strictly stationary. Gallo and Marcellino (1999) and a series of papers by Patterson (2000, 2002a,b, 2003) and Patterson and Heravi (1991a,b) adopt this approach, while Siklos (2008) casts doubt on the underlying cointegration assumptions near benchmark revisions. 2.2. Measurement errors: ‘‘news’’ or ‘‘noise’’? While it is acknowledged that measurement errors can be correlated across time, their behavior across vintages has been the subject of much research. Boschen and Grossman (1982), Mankiw et al. (1984), Mankiw and Shapiro (1986), Maravall and Pierce (1986), Mork (1987, 1990), Patterson and Heravi (1992), Croushore and Stark (2001, 2003), Faust et al. (2005), Swanson and van Dijk (2006), and Aruoba (2008) debate whether data revisions are best modeled as ‘news’ or ‘noise’. The two polar views are: (i) Published data contain noise (ζtt +i ): Measurement errors are said to be noise when the errors ζtt +i are orthogonal to the true values y˜ t , but correlated with the available vintage, so that ytt +i = y˜ t + ζtt +i ,
cov(˜yt , ζtt +i ) = 0. (ytt +i+1
(3) ytt +i )
are generally − Noise implies that revisions forecastable. (ii) Published data contain news (νtt +i ): Measurement errors are described as ‘‘news’’ if and only if they match the properties associated with rational forecast errors. t +j+1 This requires that revisions (yt − ytt +j ) are unpredictable given the information set at time t + j. Since this information set contains all previously published vintages, this implies that y˜ t = ytt +i + νtt +i ,
t +j
cov(yt , νtt +i ) = 0 ∀j ≤ i.
(4)
Unlike the previous case, the covariance restriction now relates the measurement error to previously published vintages rather than true values. This condition also implies that the variability of the measurement errors for a given point in time t cannot increase as we move to more recent vintages. de Jong (1987) presents necessary and sufficient conditions for revisions to be rational. Sargent (1989) models a statistical agency that collects economic data as the sum of a vector of ‘‘true’’ variables and a vector of measurement errors, and then uses optimal filtering methods to construct and report least-squares estimates of the ‘‘true’’ variables. Kapetanios and Yates (2010) start from the Sargent (1989) world to model the publication process of the statistical agency. The Mincer and Zarnowitz (1969) test of the ‘‘noise’’ specification regresses the measurement error ytt +i − y˜ t on a constant and the final release utt +i = α1 + β1 y˜ t + ζtt +i .
(5)
The null hypothesis that measurement errors are independent of true values (α1 = 0, β1 = 0) may be tested with a Wald test. The analogous test of the ‘‘news’’ model regresses the measurement error ytt +i − y˜ t on a constant and the ith release ytt +i − y˜ t = α2 + β2 ytt +i + νtt +i .
103
2.3. Modeling of data revisions Most of the relevant literature on data revisions can be grouped around three different themes. Data description. These studies focus on detailing the often complex nature of data revisions. Much of the above literature on the news versus noise debate falls into this category, as does Siklos (2008) and Garratt and Vahey (2006). Optimal forecasting and inference. These papers focus on deriving the optimal forecasts or smoothed estimates of ‘‘true’’ values for a postulated data revision process. Examples of such work include the Harvey et al. (1983), Sargent (1989) and Kapetanios and Yates (2010) papers cited above. Cycle-trend decompositions. These papers focus on the accuracy of estimated trends or cycles near the end of sample. While some (such as Laubach (2001), Orphanides and van Norden (2002), Rünstler (2002), or van Norden (2005)) take account of data revision, they ignore its structure and analyze each data vintage independently. The links between these three strands of the literature are weak, but becoming stronger. For example, de Antonio Liedo and Carstensen (2006) improve the links between the first two strands by allowing for both simple news and noise effects in data revision, while Cunningham et al. (forthcoming) also allow for serial correlation in measurement errors. Kishor and Koenig (2005) allow for a general VAR model of data vintages. Similarly, Garratt et al. (2008) better integrate the second two strands by showing that simple revision forecasts can improve HP-filtered trends and cycles. In the next section we present a state-space representation that further integrates all three strands of this literature. In particular, we relax the assumption imposed above that ‘‘true’’ or final values become available after l releases; we accommodate richer and more realistic dynamics for the measurement errors; we use conventional state-space methods to construct optimal forecasts and estimates of the underlying true values; and we show how to optimally estimate trends or cycles from multiple data vintages. 3. Our state-space model 3.1. Structure Following Durbin and Koopman (2001) we write a generic (time-invariant) state-space model as yt
α t +1
= Z αt + εt = T αt + R ηt
(7) (8)
where yt is l × 1, αt is m × 1, εt is l × 1 and ηt is r × 1; εt ∼ N (0, H ) and ηt ∼ N (0, Ir ). Both error terms are i.i.d. and orthogonal to one another.4 For convenience we omit constants from the model in this exposition. In our framework, the data yt is a l × 1 vector of l different vintage estimates ytt +i , i = 1, . . . , l, for a particular observation t,
′
so yt ≡ ytt +1 , ytt +2 , . . . , ytt +l simply stacks the first through the lth estimate of yt . The superscript denotes the period at which the vintage becomes available.5 We will denote the unobserved ‘‘true’’
(6)
The similar null hypothesis (α2 = 0, β2 = 0) now tests whether data revisions are predictable. The two null hypotheses are mutually exclusive but they are not collectively exhaustive, i.e. we may be able to reject both hypotheses, particularly when the constant in both test equations differs from zero (see Aruoba, 2008).
4 For more detailed assumptions, see Durbin and Koopman (2001, Section 3.1 and 4.1). 5 The assumption that the first estimate becomes available with a one-period lag is arbitrary and innocuous; the assumption that first estimate is available at t − j (i.e. a j-period forecast) would make no difference to our analysis.
104
J.P.A.M. Jacobs, S. van Norden / Journal of Econometrics 161 (2011) 101–109
value of this variable y˜ t , and its measurement error ut ≡ yt − ιl y˜ t , where ιl is an l × 1 vector of ones. Note that this differs from the conventional state-space modeling framework discussed above, which is specified in terms of y t instead of yt . In addition, unlike other models in this literature, we do not require the assumption that the last available estimate has no measurement error (i.e. that ytt +l = y˜ t ). Instead, we simply assume that y˜ t belongs to a desired class of dynamic models (such as an ARMA(p, q) process, or a particular structural time series model that can be expressed in state-space form). For tractability, we partition the state vector αt into four components
′ ′
αt = y˜ t , φt , νt , ζ t ,
′
′
(9)
of length 1, b, l and l respectively, where φt corresponds to the dynamics of the true values, νt to news, and ζt to noise, as will become clear below. It will be convenient to similarly partition Z = [Z1 , Z2 , Z3 , Z4 ]
(10)
where Z1 = ιl (a l × 1 vector of 1s), Z2 = 0l×b (an l × b matrix of zeros), Z3 = Il , and Z4 = Il (both l × l identity matrices). We also impose H = 0, so that there is no error term associated with the measurement equation. The measurement equation (7) then simplifies to
T11 T21 T = 0 0
T12 T22 0 0
0 0 T3 0
R1 R2 R= 0 0
R3 0 −U1 · diag(R3 ) 0
T =
0 0 , 0
R1 R2 0
R=
0 0 , R4
σζ 2 .. .
··· ··· .. .
0
···
0
0
0 0
.. . . σζ l
The assumption that estimates become more precise over time could be imposed by restricting σζ l < σζ ,l−1 < · · · < σζ 2 < σζ 1 . 3.2.2. Pure news In the pure news case we may drop ζ t from the state vector and simplify Eqs. (7) and (8) to obtain
y˜ t αt = φ t , νt
T11 T21 0
T =
T12 T22 0
0 0 0
and
Z = [Z1 , Z2 , Z3 ] .
Imposing the ‘‘news’’ properties then requires R1 R2 0
R3 0 . −U1 · diag(R3 )
This means that (8) becomes
0 0 , 0 T4
(11)
where T11 is a scalar, and {T12 , T21 , T22 , T3 , T4 } are 1 × b, b × 1, b × b, l × l and l × l; 0 is a conformably defined matrix of zeros. We similarly partition R into an (1 + b + 2l) × r matrix
σζ 1 0 R4 ≡ .. .
R=
Consistent with the above, we partition matrix T as
T12 T22 0
T11 T21 0
and Z = [Z1 , Z2 , Z4 ]. In the simplest case of ‘‘pure’’ noise, measurement errors are independent of the measurement errors in neighboring vintages, so E(ut u′t ) is a diagonal matrix. We can impose this property by setting
yt = Z αt = y˜ t + νt + ζ t = ‘Truth’ + ‘News’ + ‘Noise’.
y˜ t αt = φ t , ϵt
0 0 , 0 R4
(12)
where U1 is a l × l matrix with zeros below the main diagonal and ones everywhere else, R3 = [σν 1 , σν 2 , . . . , σν l ], where σν i is the standard error of the incremental news error associated with i-th estimate ytt +i , diag(R3 ) is a l × l matrix with elements of R3 on its main diagonal, and R4 is an l × l matrix. Finally, we partition the error term associated with the transition equation as ηt = [η′et , η′ν t , η′ζ t ]′ , where ηet refers to errors associated with the true values, and ην t and ηζ t are the errors for news and noise, respectively. In the next subsection, we explore the specification of the measurement error and thereafter we consider the specification of the dynamics of y˜ t . 3.2. Measurement error
y˜ t +1 = T11 · y˜ t + T12 · φt + R1 ηet + R3 η2t
σ
ν1
0 νt +1 = −U1 · diag(R3 ) · η2t ≡ − . . . 0
and
σν 2 σν 2 .. . ...
··· .. . .. . 0
σν l .. . · η2t . σν l σν l
In this way, adding news shocks νt to the true values y˜ t removes some of information (shocks) driving y˜ t . As subsequent vintages peel away some of these news shocks, our estimates of y˜ t improve. 3.2.3. Spillovers Information which arrives at t + j and causes the statistical t +j agency to revise an estimate yt may also cause them to revise t +j t +j nearby estimates, such as yt +1 , yt −1 , etc.6 Such a relationship between measurement errors in different economic time periods (spillovers) is independent of the characterization of measurement error as news or noise. Accordingly, the presence or absence of spillovers has no implications for the form of R3 or R4 , but instead is captured via the specification of T3 or T4 . In the simplest case of noise and spillovers, we can capture the effect of measurement errors for time t affecting estimates at t − 1 by setting T4 = ρ Il , where ρ is the correlation coefficient between ut and ut −1 . Propagation over longer time periods can be accommodated by stacking successive values of ζ t into the state vector and (if desired) specifying richer intertemporal dynamics.
In this subsection, we show how each of three special cases of measurement errors (pure noise, pure news and simple spillovers) may be captured in our state-space framework. We then describe how all three types can be combined in a general model.
3.2.4. General models of measurement errors Since their effects enter via different system matrices, we may freely combine any desired pattern of spillover effects with either
3.2.1. Pure noise In the pure noise case we may drop νt from the state vector and simplify Eqs. (7) and (8) to obtain
6 For example, data from annual tax returns become available only long after the end of the tax year, but may lead to the conclusion that income in each month or quarter of the previous year should be revised upwards or downwards.
J.P.A.M. Jacobs, S. van Norden / Journal of Econometrics 161 (2011) 101–109
pure noise or pure news models of measurement errors. We may also combine both news and noise effects in the same model. In addition spillovers may differ for news and noise. Still more general dynamics can be obtained by allowing R4 to be an unrestricted upper-triangular matrix so that noise need not be i.i.d. across vintages. 3.3. Dynamics of true values The dynamics of the ‘‘true’’ value y˜ t are jointly determined by the first two blocks ofthe statevector [˜yt , φ′t ]′ , the upper block of T12 T22
T11 T21
the transition matrix
, and the associated blocks of the
error weighting matrix R1 and R2 . This framework is sufficiently general to accommodate a variety of popular dynamic models for y˜ t ; we provide two popular examples below. We first examine the case where y˜ t follows an ARMA(p, q) process. Thereafter, we describe the case where y˜ t is assumed to follow the structural time series model of Harvey and Jaeger (1993). 3.3.1. ARMA dynamics Letting φt = [˜yt −1 , . . . , y˜ t −p+1 , et , . . . , et −q+1]′ we can capture the ARMA(p, q) dynamics of y˜ t by specifying
σe , 01×(q−1) ]′ and ρ′ [ ] T11 T12 Ip−1 = T T 0 21
22
1 ×p
0(q−1)×p
θ′
R1 R2
= [σe , 01×(p−1) ,
0(p−1)×(q+1) 01×q Iq−1 |0(q−1)×1
where ρ is the p × 1 vector containing the autoregressive coefficients and θ is the q × 1 vector containing the moving average coefficients. 3.3.2. A structural time series model To obtain a trend level and drift rate subject to stochastic shocks, as well as a stochastic cycle with period 2π /λ, we can use the following specification:
• φt = [τt , µt , ct , ct∗ ]′ where τt is the level of the trend component of y˜ t at time t, µt is its growth rate, and ct and ct∗ are the stochastic cycle components of y˜ t at time t, with standard deviations στ , σµ and σc , respectively; • T11 = 0; ′ • T21 = 0 0 0 0 ; 1 1 0 0 • T22 =
0 0 0
1 0 0
0 cos λ − sin λ
0 sin λ cos λ
, where λ is the frequency of the
cyclical component.7
• T12 = [1, 1, cos λ, sinλ]; στ 0 σc σ0τ σ0 00 R1 • R2 = µ . 0 0
0 0
σc σc
The definitions of T12 and R1 simply follow from the trend-cycle decomposition y˜ t ≡ τt + ct . 3.4. Identification and estimation It is well-known that in the absence of restrictions on the parameter matrices of the general state-space model, the
7 We could also prefix the trigonometric constants by a constant ρ subject to |ρ| < 1 to produce dampened cycles.
105
parameters are unidentified—more than one set of values for the parameters can give rise to the identical value of the likelihood function, and the data give us no guide for choosing among these (Hamilton, 1994, Section 13.4). Local identification may be verified by translating the state-space representation into a vector ARMA process and checking the conditions for identification given in Hannan (1971, 1976). Another approach is to work directly with the state-space representation, see e.g. Burmeister et al. (1986) or Otter (1986). We adopt the latter approach in a companion paper: Jacobs and van Norden (2007) provide sufficient conditions under which the parameters of our model are identified. We summarize some of them here. We begin by assuming that the parameters of the dynamic model for the true values y˜ are identified. A second condition is that the dynamics of the measurement errors and true values are not too similar. When the dynamics implied by Tφ , Tζ and Tν are sufficiently distinct, this can be critical in distinguishing noise and news from movements in true values. As a result, identification is generally easier in our general model than in a restricted model which imposes the same dynamics on all three components. Even in the case where these components are identical (for example, when the dynamics of news and noise are identical), we can distinguish news and noise if we have sufficient restrictions on the persistence of noise shocks across subsequent vintages. Therefore, although our model allows for rich dynamics, the richness of the data generally allows us to identify quite complex behaviors. While the parameters of our general model may be identified, this need not imply that conventional state space methods are the only (or even the best) way to estimate them. As de Antonio Liedo and Carstensen (2006) note, the presence of multiple vintages implies that many revisions are directly observed and their behavior may be consistently estimated by GLS. This suggests the use of an EM algorithm or a concentrated likelihood function to simplify estimation of the full model. Even in situations where the model’s parameters may be estimated without recourse to its state-space form (such as when we assume that true values are observed after a small number of revisions), the state-space form of the model may still be useful for the construction of optimal forecasts, smoothed estimates, and determining the optimal weights to be placed on data of varying vintages. 4. Illustration This section demonstrates the feasibility of our state-space framework by presenting estimates for five simple models of US real output estimates. We begin by describing our data source and reviewing the characteristics of the data revision process. We then describe, present and compare five alternative state-space models for these data: pure noise, pure news, noise plus spillovers, news and spillovers, and a mixture of news, noise and spillovers. 4.1. Data and data properties To illustrate the uses of our model of data revisions, we use the real output series of the Federal Reserve Bank of Philadelphia, which consists of quarterly vintages from 1965Q4 up to and including 2006Q2 and provides quarterly information from 1947Q1 onwards. The vintages have a publication lag of one quarter, i.e. our last vintage (2006Q2) spans the period 1947Q1 to 2006Q1.8
8 The vintages 1992Q1–1992Q4 do not have observations for 1947Q1–1958Q4, vintages 1996Q2–1997Q1 provide no information for 1947Q1–1959Q2, and vintages 1999Q4 and 2000Q1 begin only in 1959Q1.
106
J.P.A.M. Jacobs, S. van Norden / Journal of Econometrics 161 (2011) 101–109
the first four vintage estimates in our y vector. This captures the most important revisions shown in the above figure (including the annual revision each summer) but also implies that important measurement error may be present in the ‘‘final’’ vintage. Our most general model with news, noise and spillovers may then be written as t +1 yt yt +2 t t +3 = ι4 yt
01×4
y˜ t y˜ t −1 I4 , νt ζt
I4
t +4 yt
ρ1 y˜ t y˜ t −1 1 = νt 04×1 04×1 ζt
σe 0 0 0 0 + 0 0 0
Fig. 2. News and noise in US GDP growth?
Real output estimates are typically revised once per quarter for the first few quarters after they are released to incorporate the arrival of more and better information. Annual revisions (in the summer) commonly affect up to the last three years of observations. In addition comprehensive or benchmark revisions, carried every few years, typically incorporate conceptual, methodological or benchmark changes and therefore can change every published estimate from the first to the last observation. Benchmark revisions in the real output series occurred in 1976Q1, 1981Q1, 1986Q1, 1992Q1, 1996Q1 and 1997Q2, 1999Q4 and 2000Q2, 2004Q1. To mitigate any level effects of benchmark revisions we follow the common practice of analyzing the first differences of the natural logarithms of the raw data. To better describe the properties of the data, we confine ourselves to periods for which at least 12 different vintages are available. This prevents estimation of the model for the most recent time periods, but still leaves 153 time-series observations for analysis.9 To illustrate the complex character of measurement errors in this commonly studied series, Fig. 2 shows correlations between t +j t +j −1 different vintage revisions (yt − yt ) on the one hand, and t +j 2006Q 2 current and final estimates (yt , yt ) on the other. We observe many non-zero correlation coefficients, suggesting that the news and noise null hypotheses are rejected at times. Interestingly, different revisions appear to display different properties. For the first revision, neither correlation coefficient is close to zero, suggesting that both the pure news and pure noise models would be rejected. For the second revision, however, the correlation with the current vintage is approximately zero, suggesting that news model would be a good fit. The opposite is true for the third revision, suggesting that it may be better characterized as noise. The figure also shows that the dynamics continue to vary for many quarters after the initial release of the data. These results are only meant to be suggestive; we leave a more formal consideration of the characteristics of data revisions to the model results, presented below.
ρ2
01×4
0
01×4
y˜ t −1 01×4 y˜ t −2 04×4 νt −1 ζ t −1 T4 01×4
04×1
T3
04×1
04×4
σν 1
σν 2
σν 3
σν 4
0
0
0
0
0
0
0
0
0
0
−σν 1
−σν 2 −σν 2
0
0
0
0
0
0
0
0
0
0
0
−σν 4 −σν 4 −σν 4 −σν 4
0
0
−σν 3 −σν 3 −σν 3
0
0
0
0
0
0
0
0
0
σζ 1
0
0
0
0
0
0
0
σζ 2
0
0
0
0
0
0
0
σζ 3
0
0
0
0
0
0
0
0
0
0 ηet 0 ην1 t 0 ην t 2 0 η ν t 0 3 · ην t , 0 4 ηζ t 1 0 η ζ t 0 2 ηζ3 t 0 ηζ4 t σζ 4
where T3 and T4 are diagonal 4 × 4 matrices. The model has 19 free parameters (ρ1 , ρ2 , σe , σν 1 , σν 2 , σν 3 , σν 4 , σζ 1 , σζ 2 , σζ 3 , σζ 4 and the eight diagonal elements of T3 and T4 ) to be estimated on 153 observations of four data vintages.10 We also considered four simpler models, each of which maintained the assumption of an AR(2) process for the true values of real output growth.11 Two of these assumed the absence of news shocks and take the form. ytt +1 yt +2 tt +3 = ι4 y t ytt +4
y˜ t y˜ t −1
01×4
I4
ρ1
ρ2
1 04×1
0 04×1 0 0
=
ζt
σ e 0 0 + 0
ζt
01×4 y˜ t −1 y˜ t −2 01×4 ζ t −1 T4 0 0 0 ηet 0 0 0 ηζ1 t 0 0 0 · η . σζ 2 0 0 ζ2 t ηζ3 t 0 σζ 3 0 ηζ4 t 0 0 σζ 4
σζ 1
0 0
y˜ t y˜ t −1 ,
0 0 0
The other two assumed the absence of noise shocks and take the form ytt +1 yt +2 tt +3 = ι4 y t ytt +4
01×4
I4
y˜ t ˜yt −1 , νt
4.2. Estimation results We assume a simple AR(2) process for the dynamics of ‘‘true’’ output growth. The estimates presented below use only
9 This restriction could be relaxed using filtering methods for missing observations (see Harvey, 1989, Section 3.4.7).
10 We investigated the properties of the maximum likelihood (ML) estimators reported below by simulating data from the above general model and comparing the resulting ML parameter estimates to the true values used to create the data. Results were encouraging; ML estimates were close to true values and the robust standard errors used below seemed to be a reasonable guide to their precision, particularly for the elements of the T matrix. 11 Estimates assuming an AR(1) process produced very similar results.
J.P.A.M. Jacobs, S. van Norden / Journal of Econometrics 161 (2011) 101–109
107
Table 1 Estimation results: AR(2) model, four vintages. Parameter
Pure noise
Pure news
Estimate
ρ1 ρ2
AR(1) AR(2) Spill 1 - news Spill 2 - news Spill 3 - news Spill 4 - news Spill 1 - noise Spill 2 - noise Spill 3 - noise Spill 4 - noise AR shock News shock 1 News shock 2 News shock 3 News shock 4 Noise shock 1 Noise shock 2 Noise shock 3 Noise shock 4
T331 T332 T333 T334 T441 T442 T443 T444
σe × 10 σν 1 × 10 σν 2 × 10 σν 3 × 10 σν 4 × 10 σζ 1 × 10 σζ 2 × 10 σζ 3 × 10 σζ 4 × 10
Log likelihood function (llf) AIC BIC
Std. err.
0.410 0.043
0.082 0.085
9.074
0.523
2.387 1.028 0.726 1.569
0.152 0.097 0.112 0.105
Estimate 0.217 0.133
5.666 2.302 1.258 1.612 24.777
Std. err. 0.166 0.088
1.112 0.132 0.072 0.092 14.445
Noise +spillovers
News +spillovers
News +noise +spillovers
Estimate
Std. err.
Estimate
Std. err.
Estimate
Std. err.
0.410 0.043
0.082 0.082
0.198 0.116 −0.008 −0.030 −0.037 −0.025
0.118 0.053 0.044 0.046 0.048 0.047
−0.130 0.035 −0.084 −0.008 9.075
0.087 0.107 0.164 0.042 0.522
4.893 2.199 1.241 1.575 30.087
1.185 0.134 0.073 0.094 15.240
2.364 1.030 0.720 1.570
0.151 0.096 0.112 0.105
0.177 0.147 −0.055 −0.079 −0.088 −0.077 – 0.850 – 0.305 4.761 2.141 1.195 1.430 27.012 0.000 0.183 0.000 0.633
0.146 0.071 0.064 0.065 0.068 0.071 – 0.169 – 0.311 1.086 0.130 0.141 0.330 14.118 – 0.150 – 0.455
−48.533
−16.887
−47.087
−9.365
−7.597
111.067 132.280
47.774 68.987
116.175 149.510
40.730 74.065
53.194 110.772
Notes: Standard errors are based on QML estimates. The Akaike Information Criterion (AIC) is calculated as −2 · llf + 2 · k, where k is the number of parameters. The Bayes Information Criterion (BIC) is calculated as −2 · llf + 2 · ln(T ) · k, where T is the number of observations.
y˜ t y˜ t −1
νt
ρ1
ρ2
1 = 0
0
4×1
σ e 0 0 + 0 0 0
04×1
σν 1
01×4 y˜ 01×4 t −1 y˜ T3 t −2 ν t −1
σν 2
σν 3
0
0
0
−σν 1
−σν 2 −σν 2
−σν 3 −σν 3 −σν 3
0 0 0
0 0
0
σν 4 ηet 0 ην1 t −σν 4 · ην t . −σν 4 η 2 ν3 t −σν 4 ην4 t −σν 4
For both pairs of models, we estimated both a version with the first-order spillover effects described above (where T3 or T4 are diagonal) and a version without spillover effects (where T3 = T4 = 04×4 ). Table 1 lists the parameter estimates for all five of the above models; pure noise, pure news, noise plus spillovers, news plus spillovers, and the general or ‘‘news plus noise plus spillovers’’ model. Prior to estimation, all vintages were approximately standardized using the mean and standard deviation of the final (i.e. 4th) vintage.12 All parameters were estimated by maximum likelihood (ML) and their standard errors are robust estimates based on the cross-product of the score matrix. The third and the fourth columns list the parameter estimates and standard deviations for the pure noise model. The AR parameters ρ1 and ρ2 sum to approximately 0.45. The AR shock has a standard deviation (σe ) slightly less than one. The estimated standard deviations of the noise shocks are large relative to their robust standard errors and they decrease until that of the final measurement error σζ4 . The degree of measurement error remaining after the first year of revisions is slightly more than
12 This provided vintages that each had means very close to zero and standard deviations very close to one, while ensuring that the change from one vintage to the next was entirely due to a revision in the official estimates and not a change in factors used to standardize the data.
half of that of the first revision. The largest of the noise shocks has a standard deviation roughly one-quarter the size of the AR shock. All these characteristics seem reasonable given the dynamic properties of the data. The results of the pure news model listed in the fifth and the sixth columns of Table 1 show an AR(2) process with slightly less persistence than before; the sum of the AR parameters is now only 0.35. However, the AR shock is now about one-third smaller than in the previous model. The standard deviations of the pure news shocks are of the same magnitude as those of pure noise shocks with the exception of the final measurement error, which is now much larger but very imprecisely estimated. Columns seven through ten show the results of adding spillover effects to the above pure news and pure noise models. At first glance, spillover effects appear to be unimportant. Individual estimates of the spillover parameters all appear small relative to their standard errors and estimates of the other model parameters are little changed from their values in the pure noise and news models. However, this neglects the strong correlation between estimated spillover parameters, which jointly appear to be important for each model. For example, a Likelihood Ratio (LR) test statistic for the null hypothesis of no spillovers should have a χ 2 (4) distribution under the null hypothesis, but has a value of 2.89 for the noise model and 15.04 for the news model.13 The final two columns of the table show the parameter estimates of the most general model we estimated: the mixture of news, noise and spillovers. The estimates of the AR parameters and all the parameters of the noise shocks are very similar to those of the news plus spillover model presented in the two previous columns. The noise shocks, however, are very different. Two of
13 Of course, likelihood ratio statistics also have interpretations in terms of posterior odds. If λ is the likelihood ratio statistic, then eλ/2 is the factor by which we increase the prior odds of the spillover model to arrive at the posterior odds. This is roughly 1.2 × 1096 for the noise models and something in excess of 1 × 10500 for the news models.
108
J.P.A.M. Jacobs, S. van Norden / Journal of Econometrics 161 (2011) 101–109
the four noise shocks are estimated to have standard deviations of zero, while the remaining two seem much smaller than before.14 However, the estimated spillover effects of the two remaining noise shocks are by far the largest of any model. While the most general model provides an increase in the value of the log likelihood function relative to any of the other models, the use of likelihood ratio tests for model comparisons is not justified in this case as several parameters (the standard deviations of the news or noise shocks) are constrained to lie on the boundary of the parameter space (i.e. =0) under the null hypothesis. The Akaike Information Criterion (AIC) prefers the simpler News & Spillovers model as the best characterization of the data while the Bayes Information Criterion (BIC) prefers the Pure News model. The BIC values parsimony more highly than the AIC and has more desirable asymptotic properties.15 , 16 Most of the models estimated above are not particularly parsimonious; for example, our above discussion of the estimated spillover parameters suggests that a more parsimonious representation may still provide a reasonable description of the data.17 These results illustrate the feasibility of our state-space framework for modeling measurement errors. While they provide relatively little evidence of noise in GDP revisions, they show the importance of the spillover effects of news across time-series observations. These results are consistent with the view that the US Bureau of Economic Analysis has efficiently incorporated available information into its preliminary estimates and revisions of real GDP. 5. Conclusion This paper presented a state-space framework which allows for general dynamics of ‘‘true’’ values and three types of measurement errors (news, noise and spillovers) as well as published estimates which never converge to ‘‘true’’ values. Analysis of the revisions of US real GDP presented here confirms the previous findings that their dynamics are complex. Our novel formulation of the statespace model enables us to flexibly manipulate and estimate these complex dynamics. Our results above confirm that these methods are feasible even using modest, realistic numbers of observations and vintages. They also demonstrate the relative ease with which we can separate the news and noise components in measurement errors. We think this state-space framework for modeling data revisions has great potential. Applications include estimating and constructing confidence intervals for productivity trends and cycles for the formulation and conduct of monetary and fiscal policy. Future research will also address multivariate processes. Acknowledgements This paper was written during visits of the first author to HEC Montréal and CIRANO, of the second author to the research school
14 When the data revision shocks are equal to zero, the corresponding spillover parameters are not identified. As is the case with the standard deviations of other shocks, zero is on the boundary of the parameter space so conventional standard errors are not valid. 15 The BIC is consistent and has the same asymptotic properties independent of the choice of prior. Moreover, it is guaranteed to be maximal at the model with the highest posterior odds. 16 We also examined the Hannan–Quinn Information Criterion (HQIC), which has a variable penalty for lack of parsimony indexed by c. The HQIC is consistent for c > 2; it selects the News + Spillovers models for c < 2.36 and otherwise selects the Pure News model. 17 Additional results (available by request from the authors) confirmed that more parsimonious parameterizations of the spillover effects in the News + Spillovers models were preferred by the BIC to the pure News model.
SOM of the University of Groningen, and of both authors to KOF Zurich. The hospitality and support of these institutions, as well as that of CREF and CIREQ, is gratefully acknowledged. The second author would also like to thank the INE program of the Canadian SSHRC for financial support. We would like to thank participants at the 2006 (EC)2 Conference, Rotterdam, December 2006, the 2007 International Symposium on Forecasting, New York, June 2007, International Conference Measurement Error, Econometrics and Practice, Birmingham, July 2007, the CGBCR Conference, Manchester, July 2007, the 2007 Joint Statistical Meetings, Salt Lake City, Utah, July 2007, the 2008 AEA Meetings, New Orleans, LA, January 2008, ESEM2008, Milan, August 2008, and the 5th Eurostat Colloquium on Modern Tools for Business Cycle Analysis, Luxembourg, September–October 2008, and several seminars and workshops for their helpful comments. The present version has benefited from the helpful comments of the late Arnold Zellner as well as this journal’s, associate editor and referees. References Aruoba, S. Borağan, 2008. Data revisions are not well behaved. Journal of Money, Credit and Banking 40, 319–340. Bordignon, S., Trivellato, U., 1989. The optimal use of provisional data in forecasting with dynamic models. Journal of Business & Economic Statistics 7, 275–286. Boschen, J.F., Grossman, H.I., 1982. Tests of equilibrium macroeconomics using contemporaneous monetary data. Journal of Monetary Economics 10, 309–333. Burmeister, E., Wall, K.D., Hamilton, J.D., 1986. Estimation of unobserved expected monthly inflation using Kalman filtering. Journal of Business & Economic Statistics 4, 147–160. Busetti, F., 2006. Preliminary data and econometric forecasting: an application with the Bank of Italy Quarterly Model. Journal of Forecasting 25, 1–23. Cole, R., 1969. Data errors and forecasting accuracy. In: Mincer, J. (Ed.), Economic Forecasts and Expectations: Analyses of Forecasting Behavior and Performance. National Bureau of Economic Research, New York, pp. 47–82 (Chapter 2). Conrad, W., Corrado, C., 1979. Application of the Kalman filter to revisions in monthly retail sales estimates. Journal of Economic Dynamics and Control 1, 177–198. Croushore, D., 2006. Forecasting with real-time macroeconomic data. In: Elliott, Graham, Granger, Clive W.J., Timmermann, Allan (Eds.), Handbook of Economic Forecasting. North-Holland, Amsterdam. Croushore, D., Stark, T., 1999, A real-time data set for macroeconomists. Working Paper No. 99-4. Federal Reserve Bank of Philadelphia. Croushore, D., Stark, T., 2001. A real-time data set for macroeconomists. Journal of Econometrics 105, 111–130. Croushore, D., Stark, T., 2003. A real-time data set for macroeconomists: does the data vintage matter? Review of Economics and Statistics 85, 605–617. Cunningham, A., Eklund, J., Jeffery, C., Kapetanios, G., Labhard, V., 2009, A state space approach to extracting the signal from uncertain data. Journal of Business & Economic Statistics (forthcoming). de Antonio Liedo, D., Carstensen, K., 2006, A model for real-time data assessment and forecasting. Manuscript. de Jong, P., 1987. Rational economic data revisions. Journal of Business & Economic Statistics 5, 539–548. Denton, F.T., Kuiper, J., 1965. The effect of measurement errors on parameter estimates and forecasts: A case study based on the Canadian preliminary National Accounts. Review of Economics and Statistics 47, 198–206. Durbin, J., Koopman, S.J., 2001. Time Series Analysis by State Space Methods. Oxford University Press, Oxford. Faust, J., Rogers, J.H., Wright, J.H., 2005. News and noise in G-7 GDP announcements. Journal of Money, Credit, and Banking 37, 403–419. Gallo, G.M., Marcellino, M., 1999. Ex post and ex ante analysis of provisional data. Journal of Forecasting 18, 421–433. Garratt, A., Lee, K., Mise, E., Shields, K., 2008. Real time representations of the output gap. Review of Economics and Statistics 90, 792–804. Garratt, A., Vahey, S.P., 2006. UK real-time macro data characteristics. The Economic Journal 116, F119–F135. Hamilton, J.D., 1994. Time Series Analysis. Princeton University Press, Princeton, NJ. Hannan, E.J., 1971. The identification problem for multiple equation systems with moving average errors. Econometrica 44, 751–765. Hannan, E.J., 1976. The identification and parameterization of ARMAX and state space forms. Econometrica 44, 713–723. Harvey, A.C., 1989. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, Cambridge. Harvey, A.C., Jaeger, A., 1993. Detrending, stylized facts and the business cycle. Journal of Applied Econometrics 8, 231–247. Harvey, A.C., McKenzie, C.R., Blake, D.P.C., Desai, M.J., 1983, Irregular data revisions. In: Zellner, A. (Ed.), Applied Time Series Analysis of Economic Data. Economic Research Report Number ER-5. US Department of Commerce, pp. 329–339. Howrey, E.P., 1978. The use of preliminary data in econometric forecasting. Review of Economics and Statistics 60, 193–200.
J.P.A.M. Jacobs, S. van Norden / Journal of Econometrics 161 (2011) 101–109 Howrey, E.P., 1984. Data revision, reconstruction and prediction: an application to inventory investment. Review of Economics and Statistics 66, 386–393. Jacobs, J.P.A.M, van Norden, S., 2007. Appendix to modeling data revisions: measurement error and dynamics of true values. Mimeo. University of Groningen. Kapetanios, George, Yates, Tony, 2010. Estimating time-variation in measurement error from data revisions: an application to backcasting and forecasting in dynamic models. Journal of Applied Econometrics 25, 869–893. Kishor, N.K., Koenig, E.F., 2005, VAR estimation and forecasting when data are subject to revision. Research Department Working Paper 0501. Federal Reserve Bank of Dallas, Dallas, Texas. Koenig, E.F., Dolmas, S., Piger, J., 2003. The use and abuse of ‘real-time’ data in economic forecasting. Review of Economics and Statistics 85, 618–628. Kuznets, S., 1948. National income: a new version. The Review of Economics and Statistics 30, 151–179. Laubach, T., 2001. Measuring the NAIRU: evidence from seven economies. Review of Economics and Statistics 83, 218–231. Mankiw, N.G., Runkle, D.E., Shapiro, M.D., 1984. Are preliminary announcements of the money stock rational forecasts? Journal of Monetary Economics 14, 15–27. Mankiw, N.G., Shapiro, M.D., 1986. News or noise: An analysis of GNP revisions. Survey of Current Business 66, 20–25. Maravall, A., Pierce, D.A., 1986. The transmission of data noise into policy noise in US monetary control. Econometrica 54, 961–980. Mariano, R.S., Tanizaki, H., 1995. Prediction of final data with use of preliminary and/or revised data. Journal of Forecasting 14, 351–380. McKenzie, R., 2006, Undertaking revisions and real-time data analysis using the OECD main economic indicators original release data and revisions database. Technical Report STD/DOC20062, Organisation for Economic Co-operation and Development. Mincer, J., Zarnowitz, V., 1969. The evaluation of economic forecasts. In: Mincer, J. (Ed.), Economic Forecasts and Expectations: Analyses of Forecasting Behavior and Performance. National Bureau of Economic Research, New York, pp. 3–46 (Chapter 1). Morgenstern, O., 1963. On the Accuracy of Economic Observations, 2nd ed.. Princeton University Press, Princeton NJ, (first edition: 1950). Mork, K.A., 1987. Ain’t behavin’: Forecast errors and measurement errors in early GNP estimates. Journal of Business & Economic Statistics 5, 165–175. Mork, K.A., 1990. Forecastable money-growth revisions: A closer look at the data. Canadian Journal of Economics 23, 593–616. Orphanides, A., van Norden, S., 2002. The unreliability of output gap estimates in real time. Review of Economics and Statistics 84, 569–583. Otter, P.W., 1986. Dynamic structural systems under indirect observation: identifiability and estimation aspects from a system theoretic perspective. Psychometrika 51, 415–428. Patterson, K.D., 1994. A state space model for reducing the uncertainty associated with preliminary vintages of data with an application to aggregate consumption. Economics Letters 46, 215–222.
109
Patterson, K.D., 1995a. An integrated model of data measurement and data generation processes with an application to consumers’ expenditure. The Economic Journal 105, 54–76. Patterson, K.D., 1995b. A state-space approach to forecasting the final vintage of revised data with an application to the index of Industrial Production. Journal of Forecasting 14, 337–350. Patterson, K.D., 1995c. Forecasting the final vintage of real personal disposable income: A state space approach. International Journal of Forecasting 11, 395–405. Patterson, K.D., 2000. Which vintage of data to use when there are multiple vintages of data? Cointegration, weak exogeneity and common factors. Economics Letters 69, 115–121. Patterson, K.D., 2002a. Modelling the data measurement process for the index of production. Journal of the Royal Statistical Society 165, 279–296. Series A. Patterson, K.D., 2002b. The data measurement process for UK GNP: stochastic trends, long memory, and unit roots. Journal of Forecasting 21, 245–264. Patterson, K.D., 2003. Exploiting information in vintages of time-series data. International Journal of Forecasting 19, 177–197. Patterson, K.D., Heravi, S.M., 1991a. Are different vintages of data on the components of GDP co-integrated? Economics Letters 35, 409–413. Patterson, K.D., Heravi, S.M., 1991b. Data revisions and the expenditure components of GDP. Economic Journal 101, 887–901. Patterson, K.D., Heravi, S.M., 1992. Efficient forecasts or measurement errors? Some evidence for revisions to the United Kingdom growth rates. Manchester School of Economic and Social Studies 60, 249–263. Persons, W.M., 1919. Indices of business conditions. The Review of Economic Statistics 1, 6–107. Rünstler, G., 2002, The information content of real-time output gap estimates: an application to the euro area. Technical Report No. 182. European Central Bank. Sargent, T.J., 1989. Two models of measurements and the investment accelator. The Journal of Political Economy 97, 251–287. Siklos, P.L., 2008. What can we learn from comprehensive data revisions for forecasting inflation? Some US evidence. In: Rapach, D., Wohar, M.E. (Eds.), Forecasting in the Presence of Structural Breaks and Model Uncertainty. Elsevier, Amsterdam. Stekler, H.O., 1967. Data revisions and economic forecasting. Journal of the American Statistical Association 62, 470–483. Swanson, N.R., van Dijk, D., 2006. Are statistical reporting agencies getting it right? Data rationality and business cycle asymmetry. Journal of Business & Economic Statistics 24, 24–42. Trivellato, U., Rettore, E., 1986. Preliminary data errors and their impact on the forecast error of simultaneous-equations models. Journal of Business & Economic Statistics 4, 445–453. van Norden, S., 2005, Are we there yet? Looking for evidence of a New Economy. Manuscript. HEC Montreal. Zellner, A., 1958. A statistical analysis of provisional estimates of Gross National Product and its components, of selected National Income components, and of personal saving. Journal of the American Statistical Association 53, 54–65.
Journal of Econometrics 161 (2011) 110–121
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Empirical likelihood block bootstrapping Jason Allen a,∗ , Allan W. Gregory b , Katsumi Shimotsu b,c a
Financial Stability Department, Bank of Canada, Canada
b
Department of Economics, Queen’s University, Canada
c
Department of Economics, Hitotsubashi University, Japan
article
info
Article history: Received 24 March 2008 Received in revised form 22 March 2010 Accepted 21 October 2010 Available online 5 November 2010 JEL classification: C14 C22 Keywords: Generalized methods of moments Empirical likelihood Block bootstrap
abstract Monte Carlo evidence has made it clear that asymptotic tests based on generalized method of moments (GMM) estimation have disappointing size. The problem is exacerbated when the moment conditions are serially correlated. Several block bootstrap techniques have been proposed to correct the problem, including Hall and Horowitz (1996) and Inoue and Shintani (2006). We propose an empirical likelihood block bootstrap procedure to improve inference where models are characterized by nonlinear moment conditions that are serially correlated of possibly infinite order. Combining the ideas of Kitamura (1997) and Brown and Newey (2002), the parameters of a model are initially estimated by GMM which are then used to compute the empirical likelihood probability weights of the blocks of moment conditions. The probability weights serve as the multinomial distribution used in resampling. The first-order asymptotic validity of the proposed procedure is proven, and a series of Monte Carlo experiments show it may improve test sizes over conventional block bootstrapping. © 2010 Elsevier B.V. All rights reserved.
1. Introduction Generalized method of moments (GMM, Hansen (1982)) has been an essential tool for econometricians, partly because of its straightforward application and fairly weak restrictions on the data generating process. GMM estimation is widely used in applied economics to estimate and test asset pricing models (Hansen and Singleton, 1982; Kocherlakota, 1990; Altonji and Segal, 1996), business cycle models (Christiano and Haan, 1996), models that use longitudinal data (Arellano and Bond, 1991; Ahn and Schmidt, 1995), as well as stochastic dynamic general equilibrium models (Ruge-Murcia, 2007). Despite the widespread use of GMM, there is ample evidence that the finite sample properties for inference have been disappointing (e.g. the 1996 special issue of JBES); t-tests on parameters and Hansen’s test of over-identifying restrictions ( J-test, or Sargan test) for model specification perform poorly and tend to be biased away from the null hypothesis. The situation is especially severe for dependent data (see Clark, 1996). Consequently, inferences based on asymptotic critical values can often be very misleading. From an applied perspective, this means that theoretical models may be more frequently rejected than necessary due to poor inference rather than poor modeling.
∗
Corresponding author. Tel.: +1 613 782 8712. E-mail address:
[email protected] (J. Allen).
0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.10.003
Various attempts have been made to address finite sample size problems while allowing for dependence in the data. Berkowitz and Kilian (2000), Ruiz and Pascual (2002) and Härdle et al. (2003) review some of the techniques developed for bootstrapping timeseries models, including financial time-series. Lahiri (2003) is an excellent monograph on resampling methods for dependent data. Hall and Horowitz (1996) apply the block bootstrap approach to GMM and establish the asymptotic refinements of their procedure when the moment conditions are uncorrelated after finitely many lags. Andrews (2002) provides similar results for the k-step bootstrap procedure first proposed by Davidson and Mackinnon (1999). Limited Monte Carlo results indicate the block-bootstrap has some success at improving inference in GMM. More recent papers by Zvingelis (2003) and Inoue and Shintani (2006) attempt refinements to Hall and Horowitz (1996) and Andrews (2002). The main requirement of these earlier papers is that the data is serially uncorrelated after a finite number of lags. In contrast, Inoue and Shintani (2006) prove that the block bootstrap provides asymptotic refinements for the GMM estimator of linear models when the moment conditions are serially correlated of possibly infinite order. Zvingelis (2003) derives the optimal block length for coverage probabilities of normalized and Studentized statistics. A complementary line of research has examined empirical likelihood (EL) estimators, or their generalization (GEL). Rather than try to improve the finite properties of the GMM estimator directly, researchers such as Kitamura (1997), Kitamura and Stutzer (1997),
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121
Smith (1997), and Imbens et al. (1998) have proposed and/or tested new statistics, ones based on GEL-estimators.1 A GEL estimator minimizes the distance between the empirical density and a synthetic density subject to the restriction that all the moment conditions are satisfied. GEL estimators have the same first-order asymptotic properties as GMM but have smaller bias than GMM in finite samples. Furthermore, these biases do not increase in the number of over-identifying restrictions in the case of GEL. Newey and Smith (2004) provide theoretical evidence of the higher-order efficiency of GEL estimators. Gregory et al. (2002) have shown, however, that these alternatives to GMM do not solve the overrejection problem in finite samples. Brown and Newey (2002) introduce the empirical likelihood bootstrap technique for i.i.d. data. Rather than resampling from the empirical distribution function, the empirical likelihood bootstrap resamples from a multinomial distribution function, where the probability weights are computed by empirical likelihood. Brown and Newey (2002) show that an empirical likelihood bootstrap provides an asymptotically efficient estimator of the distribution of t ratios and over-identification test-statistics. The author’s Monte Carlo design features a dynamic panel model with persistence and i.i.d. error structure. The results suggest that the empirical likelihood bootstrap is more accurate than the asymptotic approximation, and not dissimilar to the Hall and Horowitz (1996) bootstrap. In this paper, the approach of Brown and Newey (2002) is extended to the case of dependent data, using the empirical likelihood (Owen, 1990). A number of researchers have implemented this approach with some success in linear time-series models (Ramalho, 2006) as well as dynamic panel data models (Gonzalez, 2007). With serially correlated data the idea is that parameters of a model are initially estimated by GMM and then used to compute the empirical likelihood probability weights of the blocks of moment conditions, which serve as the multinomial distribution for resampling. In this paper the first-order asymptotic validity of the proposed empirical likelihood block bootstrap is proven using the results in Gonçalves and White (2004) and the approach of Mason and Newton (1992), who analyze the consistency of generalized bootstrap (weighted bootstrap) procedures. Our consistency results may be viewed as an extension of Mason and Newton (1992) to block bootstrapping. We report on the finite-sample properties of t-ratios and over-identification test-statistics. A series of Monte Carlo experiments show that the empirical likelihood block bootstrap can reduce size distortions considerably and improve test sizes over first-order asymptotic theory and frequently outperforms conventional block bootstrapping approaches.2 Furthermore, the empirical likelihood block bootstrap does not require solving the difficult saddle point problem associated with GEL estimators. This is because estimation of the probability weights can be conducted by plugging in first-stage GMM estimates. Difficulties with solving the saddle point problem is a common argument amongst applied researchers for not switching from GMM to EL, even though the latter is higher-order efficient. In related work, Hall and Horowitz (1996) analyze an application of the block bootstrap to GMM. Hall and Horowitz (1996) assume that the moment conditions are uncorrelated after finitely many lags, and derive the higher-order improvements of the block
1 See Kitamura (2007) for a review of recent research on empirical likelihood methods. 2 In addition to bootstrapping using empirical likelihood estimated weights it would seem natural to consider sub-sampling using the same weights. Subsampling (Politis and Romano, 1994; Politis et al., 1999; Hong and Scaillet, 2006) is an alternative to bootstrapping where each block is treated as its own series and test-statistics are calculated for each sub-series. This is left as future work.
111
bootstrap. The key insight of Hall and Horowitz (1996) is that, when the number of moment conditions exceeds the number of parameters, one needs to re-center the moment conditions because there is in general no parameter value such that the resampled moment conditions will be exactly equal to zero in expectation. One difference between our paper and Hall and Horowitz (1996) is that we do not assume that the moment conditions are uncorrelated after finitely many lags. Further, in the empirical likelihood block bootstrap one does not need to re-center the moment conditions by virtue of the EL weights. However, we only derive the consistency of our proposed procedure, and do not derive its higher-order properties. The paper is organized as follows. Section 2 provides an overview of GMM and EL. Section 3 presents a discussion of how resampling methods might improve inference in GMM. Section 4 presents the asymptotic results. Section 5 presents the Monte Carlo design for both linear and nonlinear models. Section 6 concludes. The technical assumption and proofs are collected at the end of the paper in the mathematical appendix. 2. Overview of GMM and GEL In this section we present an overview of GMM and EL to establish notation and framework. 2.1. GMM Let Xt ∈ Rk , t = 1, . . . , n, be a set of observations from a stochastic sequence. Suppose for some true parameter value θ0 (p × 1) the following moment conditions (m equations) hold and p ≤ m < n: E [g (Xt , θ0 )] = 0,
(1)
where g : R × Θ → R . The GMM estimator is defined as: k
m
θˆ = arg min Qn (θ ), ′ n n − − −1 −1 Qn (θ ) = n g ( Xt , θ ) W n n g (Xt , θ ) , t =1
(2)
t =1
where the weighting matrix Wn →p W . Hansen (1982) shows that the GMM estimator θˆ is consistent and asymptotically normally distributed subject to some regularity conditions. The elements of {g (Xt , θ )} and {∇ g (x, θ )} are assumed to be near epoch dependent (NED) on the α -mixing sequence {Vt } of size −1 uniformly on p (Θ , ρ) where ρ is any convenient norm ∑n on R . Define Σ = limn→∞ var(n−1/2 t =1 g (Xt , θ0 )). The standard kernel estimate of Σ is: Sn (θ ) =
n − h k
h=−n
m
Γˆ (h, θ ),
(3)
∑n where k(·) is a kernel and Γˆ (h, θ ) = n−1 t =h+1 g (Xt , θ )g (Xt +h , ∑ θ )′ for h ≥ 0 and n−1 nt =−1h g (Xt , θ )g (Xt −h , θ )′ for h < 0. It is known that Sn (θ˜ ) →p Σ if θ˜ →p θ0 under weak conditions on the kernel and bandwidth; see de Jong and Davidson (2000). The optimal weighting matrix is given by Sn (θ˜ )−1 with θ˜ →p θ0 . When the optimal weighting matrix is used, the asymptotic covariance matrix of θˆ is (G′ Σ −1 G)−1 , where G = limn→∞ E (n−1 ∑n ′ t =1 ∇ g (Xt , θ0 )) with ∇ g (x, θ ) = ∂ g (x, θ )/∂θ . In terms of testing for model misspecification, the most popular test is Hansen’s J-test for over-identifying restrictions:
Jn = Kn (θˆn )′ Kn (θˆn ) →d χm−r , where
(4)
112
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121
Kn (θ) = Sn−1/2 n−1/2
n
−
g (Xt , θ ),
t =1
and Sn is a consistent estimate of Σ . Let θr denote the rth element of θ , and let θ0r denote the rth element of θ0 . The t-statistic for testing the null hypothesis H0 : θr = θ0r is:
√
n(θˆnr − θ0r )
Tnr =
σˆ nr
→d N (0, 1),
(5)
2 where θˆnr is the rth element of θˆn , and σˆ nr is a consistent estimate
of the asymptotic variance of θˆnr .
Empirical Likelihood (EL) estimation has some history in the statistical literature but has only recently been explored by econometricians. One attractive feature is that while its first-order asymptotic properties are the same as GMM, there is an improvement for EL at the second-order (see Qin and Lawless, 1994 and Newey and Smith, 2004). For time-series models see Anatolyev (2005). This suggests that there might be some gain for EL over GMM in finite sample performance. At present, limited Monte Carlo evidence (see Gregory et al., 2002) has provided mixed results. The idea of EL is to use likelihood methods for model estimation and inference without having to choose a specific parametric family or probability densities. The parameters are estimated by minimizing the distance between the empirical density and a density that identically satisfies all of the moment conditions. The main advantages over GMM are that it is invariant to linear transformations of the moment functions and does not require the calculation of the optimal weighting matrix for asymptotic efficiency (although smoothing or blocking of the moment condition is necessary for dependent data). The main disadvantage is that it is computationally more demanding than GMM in that a saddle point problem needs to be solved. The Generalized Empirical Likelihood Estimator solves the following Lagrangian: max L =
n 1−
n t =1
h(·) − µ
n − t =1
πt − 1 − γ ′
n −
h1 (δ ′ g (xt , θ )) h1 (δ ′ g (xt , θ ))
,
i.i.d.(X˜ 1 , . . . , X˜ n−ℓ+1 ). The GMM estimator is
∼
t =1
∗∗ ˆ where g (Xt , θ ) = g (Xt , θ ) − n t =1 g (Xt , θn ) and Wn is a ∗∗ weighting matrix. That is, given a weighting matrix Wn , the GMM estimator that minimizes the quadratic form of the demeaned ∗∗ block-resampled moment conditions is θMBB . Hall and Horowitz (1996) implement the non-overlapping block bootstrap (NBB, Carlstein (1986)). This approach is also considered (in addition to the MBB). Let b be the number of blocks and ℓ the block length, and assume bℓ = n. We resample b blocks with replacement from {X˜ i : i = 1, . . . , b} where X˜ i = (X(i−1)ℓ+1 , . . . , X(i−1)ℓ+ℓ ). The NBB resample is {Xt∗ }nt=1 . The NBB version of the GMM problem is identical to the MBB version, except for the way one resamples the data. We consider both the MBB and NBB approaches because there is little known about the superiority of either method in finite samples.3 As shown in Gonçalves and White (2004) (hereafter GW04), because the resampled b blocks are (conditionally) i.i.d., the bootstrap version of the long-run auto-covariance matrix estimate takes the form (cf. Eq. (3.1) of GW04):
Sn (θ ∗∗
∗∗
∗
) = ℓb
∗
−1
b −
−1
ℓ
−1
i=1
(6)
ℓ
−1
ℓ −
∑n
g
∗
(X(∗i−1)ℓ+t , θ ∗∗ )
t =1
ℓ −
′ g
∗
(X(∗i−1)ℓ+t , θ ∗∗ )
,
(8)
t =1
∗∗ ∗∗ where θ ∗∗ denotes either θMBB or θNBB . The optimal weighting
t =1
h1 (v) = ∂ h(v)/∂v.
t =1
∗
× πt g (xt , θ ).
Solving for πt gives
πt = ∑
where X˜ i∗ therefore:
∗∗ ∗∗ θMBB = arg min QMBB ,n (θ ), ′ n n − − ∗∗ −1 QMBB g ∗ (Xt∗ , θ ) Wn∗∗ n−1 g ∗ (Xt∗ , θ ) , ,n (θ ) = n
2.2. Empirical likelihood
by resampling the estimation data. If the estimation data is serially correlated, then blocks of data are resampled and the blocks are treated as the i.i.d. sample. We implement two forms of the block bootstrap. The first approach implements the overlapping bootstrap (MBB, Künsch (1989)). Let b be the number of blocks and ℓ the block length, such that n = bℓ. The ith overlapping block is X˜ i = {Xi , . . . , Xi+ℓ−1 }, i = 1, . . . , n − ℓ + 1. The MBB resample is {Xt∗ }nt=1 = {X˜ 1∗ , . . . , X˜ b∗ },
(7)
In the case of EL, h(·) = log(πt ). The presence of serially correlated observations necessitates a modification of Eq. (6). Kitamura and Stutzer (1997) address the data dependency problem by smoothing the moment conditions. Anatolyev (2005) provides conditions on the amount of smoothing necessary for the bias of the GEL estimator to be less than the GMM estimator. Kitamura (1997) and Bravo (2005) address serial correlation in the moment conditions by using averages across blocks of data. 3. Improving inference: resampling methods This section presents an overview of block bootstrap methods typically used to improve inference in models estimated by GMM and follows up with a detailed proposal of a new method based on empirical likelihood.
matrix is given by (Sn∗∗ (θ˜ ∗∗ ))−1 , where θ˜ ∗∗ is the first-stage ∗∗ MBB/NBB estimator. The bootstrap version of the J-statistic, JMBB ,n ∗∗ ∗∗ ˜ ∗∗ −1/2 and JNBB ,n , is defined analogously to Jn but using (Sn (θ ))
and n−1/2 t =1 g ∗ (Xt∗ , θ ). Note that in Hall and Horowitz (1996), the recentering of the sample moment condition is necessary in order to establish the asymptotic refinements of the bootstrap. This is because in general there is no θ such that E ∗ g (x, θ ) = 0 when there are more moments than parameters and the resampling schemes must impose the null hypothesis. Recentering is not necessary for establishing the first-order validity of the bootstrap version of θˆn (see Hahn, 1996), but is necessary for the first-order validity of the bootstrap J-test. Operationally one needs to choose a block size when implementing the block-bootstrap. Härdle et al. (2003) point out that the optimal block length depends on the objective of bootstrapping. That is, the block length depends on whether or not one is interested in bootstrapping one-sided or two-sided tests or whether one is concerned with estimating a distribution function. Among
∑n
3.1. The block bootstrap The bootstrap amounts to treating the estimation data as if they were the population and generating bootstrap observations
3 It is only known that the MBB is more efficient than the NBB in estimating the variance (Lahiri, 1999).
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121
others, Zvingelis (2003) solves for optimal block lengths given different scenarios. Practically, the optimal block lengths for each different hypothesis test are unlikely to be implemented since practitioners are interested in a variety of problems across various hypotheses. Experimentation is done with fixed block lengths as well as data-dependent methods. Following the literature we recommend using a data-dependent approach for selecting a block length. We set the block length equal to the data-driven lag length for the Bartlett kernel using the method proposed by Newey and West (1994). This is motivated by the asymptotic equivalence of the bootstrap variance to a Bartlett kernel variance estimator (see Bühlmann and Künsch, 1999, Eq. (2.5)). Gonçalves and White (2004) use the automatic bandwidth selection procedure proposed by Andrews (1991) in their simulation study for similar reasons. There may be some gain in using a more advanced algorithm than the one we currently employ but given its simplicity and availability in pre-packaged GMM software, we believe that most practitioners are likely to continue using a Newey–West type lag-selection procedure.4 A number of approaches that are particular to block bootstrapping, but under different conditions than our model, have been suggested. Berkowitz and Kilian (2000) propose a two-step parametric approach for linear models and Politis and White (2004) propose an automatic block-length selection procedure based on spectral estimation (Politis, 1995), and which is appropriate for the circular and stationary bootstrap (Politis and Romano, 1994). 3.2. Empirical likelihood bootstrap
113
On the other hand, it is not clear whether an Edgeworthexpansion based analysis can demonstrate a higher-order improvement of the EL block bootstrap over the first-order asymptotics. Inoue and Shintani (2006) demonstrate that a higher-order analysis of the block bootstrap is handicapped by the bias of the HAC covariance matrix estimator, unless one uses a kernel whose characteristic exponent is greater than two. This excludes standard kernels such as the Bartlett, Parzen, and quadratic spectrature kernel. Another attractive feature of using the empirical likelihood bootstrap rather than the standard bootstrap is that re-centering is not required, as is the case in Hall and Horowitz (1996). The EL weights provide a probability measure under which the moment conditions hold exactly. 3.2.1. EMB First consider the overlapping bootstrap. Let N = n − ℓ + 1 be the total number of overlapping blocks. Define the ith overlapping block of the sample moment as (o stands for ‘‘overlapping’’): Tio (θ ) = ℓ−1
ℓ −
g (Xi+t −1 , θ ),
i = 1, . . . , N ,
t =1
and the Lagrangian as: L=
N −
log(π ) + µ 1 − o i
i =1
N −
π
o i
− Nγ ′
i=1
N −
πio Tio (θ ).
i=1
It is known that the solution for the probability weights are given by: 1
1
In this section we develop the empirical likelihood (EL) approach to bootstrapping time-series models. Two cases are considered: (i) the overlapping empirical likelihood block bootstrap (EMB), and (ii) the non-overlapping empirical likelihood block bootstrap (ENB). The procedure for implementing the empirical block bootstrap is straightforward and outlined in Section 7. An advantage of the EL block bootstrap over the standard block bootstrap is that EL weighted observations estimate the distribution function of the data more efficiently than non-weighted observations. We think this provides the EL block bootstrap with an improvement in test level accuracy over the standard block bootstrap, although a rigorous proof by an Edgeworth expansion is beyond the scope of the paper. When Xt is i.i.d., Theorem 1 of Brown and Newey (2002) shows that the empirical distribution function of the EL-weighted Xt ’s is a more efficient estimator of the population distribution function of Xt than the ordinary empirical distribution function of the Xt ’s. Brown and Newey (2002) combine it with an Edgeworth expansion to show that the EL bootstrap improves test level accuracy over an i.i.d. bootstrap for some cases, for example in a one-sided test of the null hypothesis of E [g (Xt , θ0 )] = 0. In our case, we attach the EL weights to the blocks, instead of individual observations. By analogy to Brown and Newey (2002), using the EL weights would provide a more efficient estimate of the distribution function of the blocks. Therefore, the EL block bootstrap would estimate the distribution of the sample moments more efficiently than the standard block bootstrap. Our simulation results suggest that efficient estimation of the distribution of the blocks by the EL block bootstrap contributes to improvements in test level accuracy, at least in some cases.
πio =
4 Note that in the case of covariance matrix estimation there is also the issue of smoothing, and therefore the choice of the appropriate kernel. The block samples in our approach, however, are (conditionally) i.i.d., therefore the choice of kernel does not arise.
∗ −1 SMBB ,n (θ ) = ℓb
1 + γ o (θ )′ Tio (θ )
N
,
where N −
γ o (θ ) = arg max
λ∈Λn (θ )
log(1 + γ ′ Tio (θ )).
(9)
i =1
Solving out the Lagrange multipliers and the coefficients simultaneously requires solving a difficult saddle point problem outlined in Kitamura (1997). Instead, one can use the GMM estimate of θ to compute πio and attach these weights to the bootstrapped (blocks
of) samples. Given the GMM estimate θˆ , compute γ o (θˆ ), which is a much smaller dimensional problem. Then solve for the empirical probability weights:
πˆ = o i
1
1 1 + γ o (θˆ )′ Tio (θˆ )
N
,
which satisfy the moment condition version of θˆ is defined as:
(10)
∑N
i=1
πˆ io Tio (θˆ ) = 0. The EMB
∗ ∗ θMBB = arg min QMBB ,n (θ ), ′ b b − − ∗ −1 ∗ −1 QMBB Tio∗ (θ ) WMBB Tio∗ (θ ) , ,n (θ ) = b ,n b i =1
i=1
where WMBB,n is a weighting matrix and { (θ )}bi=1 are b i.i.d. samples (with replacement) from the distribution with Pr(Tio∗ (θ ) = Tio∗
∗
Tko (θ )) = πˆ ko for k = 1, . . . , N. Note that E ∗ Tio∗ (θˆ ) =
πˆ io Tio (θˆ ) = 0.
∑N
i=1
The long-run auto-covariance matrix estimator for EMB takes the form: b −
Tio∗ (θ )Tio∗ (θ )′ ,
(11)
i=1
and the second-stage (optimal) weighting matrix is given by
114
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121
∗ ˜ ∗ −1 ˜∗ SMBB ,n (θMBB ) , where θMBB is the first-stage EMB estimator. The overlapping block Wald tests are based on the long-run auto∗ covariance matrix SMBB ,n (θ ). The EMB version of the J-statistic, ∗ ∗ −1/2 ˜∗ JMBB ,n , is defined analogously to Jn but using (SMBB,n (θMBB ))
and n1/2 b−1
∑b
i =1
Tio∗ (θ ).
3.2.2. ENB The ENB uses b non-overlapping blocks rather than overlapping blocks. The ith non-overlapping block is defined as: Ti (θ) = ℓ−1
ℓ −
g (X(i−1)ℓ+t , θ ),
i = 1, . . . , b,
and the Lagrange multiplier and empirical probability weights are given by: b −
ˆ λ∈Λn (θ) i=1
πˆ i =
1 b
log(1 + γ ′ Ti (θˆ )),
1 1 + γ (θˆ )′ Ti (θˆ )
(12)
.
The ENB estimator is defined as: ∗ ∗ θNBB = arg min QNBB ,n (θ ), ′ b b − − ∗ −1 ∗ ∗ −1 ∗ QNBB,n (θ ) = b Ti (θ ) WNBB,n b Ti (θ ) , i =1
b −
i=1
Ti∗ (θ )Ti∗ (θ )′ ,
√
√
∗ for any ε > 0, Pr{supx∈Rp |P ∗ [ n(θMBB − θˆ ) ≤ x]− P [ n(θˆ −θ0 ) ≤
√
√
∗∗ x]| > ε} → 0 and Pr{supx∈Rp |P ∗ [ n(θMBB − θˆ ) ≤ x] − P [ n(θˆ − θ0 ) ≤ x]| > ε} → 0.
√ ∗ →P ∗ ,P W , then for any ε > 0, Pr{supx∈Rp |P ∗ [ n(θNBB − θˆ ) ≤ x] − √ √ ∗∗ ∗ ˆ P [ n(θ − θ0 ) ≤ x]| > ε} → 0 and Pr{supx∈Rp |P [ n(θNBB − θˆ ) ≤ √ x] − P [ n(θˆ − θ0 ) ≤ x]| > ε} → 0.
Theorem 3. Let Assumptions A and B in the mathematical appendix hold. Assume Sn →P Σ . If ℓ → ∞ and ℓ = o(n1/2−1/r ), then the bootstrap-based inference using the Wald statistic is consistent. ∗ ∗ ∗∗ ∗∗ 2 Further, Jn →d χm2 −p , and JMBB ,n , JNBB,n , JMBB,n , JNBB,n →d∗ χm−p prob-P. 5. Monte Carlo experiments
b ∗ ∗ where WNBB ,n is a weighting matrix and {Ti (θ )}i=1 are b i.i.d. samples (with replacement) from the distribution with Pr(Ti∗ (θ ) = Tk (θ)) = πˆ k for k = 1, . . . , b. The long-run auto-covariance matrix estimator for ENB is:
∗ −1 SNBB ,n (θ) = ℓb
Theorem 1. Let Assumptions A and B in the mathematical appendix ∗ hold. If ℓ → ∞, ℓ = o(n1/2−1/r ), and Wn∗∗ , WMBB ,n →P ∗ ,P W , then
Theorem 2. Let Assumptions A and B in the mathematical appendix ∗ hold. If ℓ → ∞, ℓ = o(n(r −2)/2(r −1) ), and Wn∗∗ , WNBB ,n
t =1
ˆ = arg max γ (θ)
prob-P ∗ , prob-P (or Tn∗ →P ∗ ,P 0) if for any ε > 0 and any δ > 0, limn→∞ P [P ∗ [|Tn∗ | > ε] > δ] = 0. Also following GW04 we use the notation xn →d∗ x prob-P when weak convergence under P ∗ occurs in a set with probability converging to one.
(13)
i=1
∗ ˜ ∗ −1 and the optimal weighting matrix is given by SNBB ,n (θNBB ) , where ∗ θ˜NBB is the first-stage ENB estimator. The non-overlapping block
Wald tests are based on the long-run auto-covariance matrix, ∗ ∗ SNBB ,n (θ). The ENB version of the J-statistic, JNBB,n , is defined ∗ analogously to JMBB,n . It may also be possible to attach EL weights to the blocks and draw i.i.d. bootstrap observations. For example, in EMB, draw b i.i.d. samples from {πˆ jo Tjo (θ ) : j = 1, . . . , N }. This variant of the EL block bootstrap will have the same first-order asymptotic property, but it is not clear whether this variant will have the same higher-order property. While Theorems 2.1 and 2.2 of Hall and Mammen (1994) provide sufficient conditions for higher-order equivalence of weighted bootstraps in the i.i.d. case,5 applying these theorems to the EL bootstrap requires more detailed bounds on the EL weights than those in this paper. 4. Consistency of the bootstrap-based inference The following lemmas establish the consistency of the bootstrap-based inference. The proofs are based on the results in Gonçalves and White (2004), hereafter referred to as GW04, and Mason and Newton (1992). As in GW04, let P denote the probability measure that governs the behavior of the original time-series and let P ∗ be the probability measure induced by bootstrapping. For a bootstrap statistic Tn∗ we write Tn∗ → 0
5 Barbe and Bertail (1995) analyze the asymptotics of the generalized bootstrap of a large class of statistics including Fréchet differentiable functionals.
In this section, a comparison of the finite sample performance differences of the standard block bootstrapping approaches to the empirical likelihood block bootstrap approaches is undertaken in a number of Monte Carlo experiments. The Monte Carlo design includes both linear and nonlinear models. For each experiment we report actual and nominal size at the 1%, 5% and 10% level for the t-test and J-test. Parameter settings are deliberately chosen to illustrate the most challenging size problems. There are sample sizes: 100, 250 and 1000. Each experiment has 2000 replications and 499 bootstrap samples. This number of bootstrap samples does not lead to appreciable distortions in size for any of the experiments. 5.1. Case I: linear models 5.1.1. Symmetric errors Consider the same linear process as Inoue and Shintani (2006): yt = θ1 + θ2 xt + ut
for t = 1, . . . , n,
(14)
where (θ1 , θ2 ) = (0, 0), ut = ρ ut −1 + ε1t and xt = ρ xt −1 + ε2t . The error structure, ε = (ε1 , ε2 ) are uncorrelated i.i.d. normal processes with mean 0 and variance 1. The approach is instrumental variable estimation of θ1 and θ2 with instruments zt = (ι xt xt −1 xt −2 ). There are two over-identifying restrictions. The null hypothesis being tested is: Ho : θ2 = 0. The statistics based on the GMM estimator are studentized using a Bartlett kernel applied to pre-whitened series (see Andrews and Monahan, 1992). The bootstrap sample is not smoothed since the b blocks are i.i.d. Both the non-overlapping block bootstrap and the overlapping block bootstrap are considered in the experiment. Results are reported in Table 1. The amount of dependence in the moment conditions is relatively high, ρ = 0.9. The block length is set equal to the lag window in the HAC estimator, which is chosen using a data-dependent method (Newey and West, 1994). One immediate observation is that the asymptotic test-statistics severely over-reject the true null hypothesis. For example, with 100 observations the actual level for a 10% t-test is 42.25%. The actual level of the J-test is closer to the nominal level, although there is still over-rejection. The block bootstrap, with block size averaging from 1.96 for 100 observations to 4.48
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121 Table 1 Linear model—symmetric errors.
Table 2 Linear model—GARCH(1, 1) errors. Replications = 2000; bootstraps = 499; auto-selection block length yt = θ1 + θ2 xt + σt ut ; ut ∼ N (0, σt ), σt2 = 0.0001 + 0.6σt2−1 + 0.3ε1t −1 ; xt = 0.75xt −1 + ε2t , where ε1t ∼ N (0, 1); zt = (ι, xt , xt −1 , xt −2 ) (θ1 , θ2 ) = (0, 0); ε1t ∼ N (0, 1)
Replications = 2000; bootstraps = 499; auto-selection block length yt = θ1 + θ2 xt + ut ; ut = 0.9ut −1 + ε1t ; xt = 0.9xt −1 + ε2t ; zt = (ι, xt , xt −1 , xt −2 ) (θ1 , θ2 ) = (0, 0); [ε1t , ε2t ] ∼ N (0, I2 ) t-test
Sargan test
t-test
10
05
01
10
05
01
100 Asymptotic SNB SMB ENB EMB
0.4225 0.2725 0.3760 0.2265 0.2290
0.3420 0.2070 0.2885 0.1830 0.2260
0.2335 0.1085 0.1640 0.1150 0.1120
0.1360 0.1505 0.1330 0.0675 0.0775
0.0735 0.0945 0.0755 0.0460 0.0560
0.0245 0.0320 0.0255 0.0220 0.0250
250 Asymptotic SNB SMB ENB EMB
0.3485 0.2090 0.3255 0.1385 0.1500
0.2755 0.1460 0.2390 0.0990 0.1250
0.1625 0.0720 0.1320 0.0455 0.0500
0.1225 0.1320 0.1315 0.0815 0.1140
0.0745 0.0840 0.0790 0.054? 0.0830
1000 Asymptotic SNB SMB ENB EMB
0.2735 0.1675 0.2550 0.0995 0.0960
0.1945 0.1140 0.1815 0.0605 0.0590
0.0955 0.0425 0.0830 0.0230 0.0020
0.0925 0.0930 0.0970 0.0875 0.1045
0.0460 0.0505 0.0450 0.0480 0.0560
05
01
10
05
01
100 Asymptotic SNB SMB ENB EMB
0.1420 0.0820 0.0920 0.0875 0.1405
0.0840 0.0340 0.0480 0.0405 0.0870
0.0280 0.0060 0.0060 0.0006 0.0200
0.070? 0.0530 0.0590 0.0730 0.1100
0.0240 0.0180 0.0160 0.0300 0.0600
0.0040 0.0050 0.0050 0.0040 0.0100
0.0235 0.0310 0.0260 0.0260 0.0480
250 Asymptotic SNB SMB ENB EMB
0.1150 0.0630 0.0830 0.0995 0.1130
0.0580 0.0300 0.0370 0.0410 0.0570
0.0150 0.0060 0.0080 0.0065 0.0210
0.0840 0.0820 0.0760 0.0845 0.1510
0.0270 0.0230 0.0260 0.0330 0.0950
0.0040 0.0030 0.0040 0.0035 0.0140
0.0075 0.0090 0.0070 0.0145 0.0180
1000 Asymptotic SNB SMB ENB EMB
0.1050 0.0700 0.0910 0.0995 0.1000
0.0560 0.0340 0.0470 0.0490 0.0560
0.0150 0.0070 0.0110 0.0090 0.0090
0.0880 0.0840 0.0860 0.0885 0.0900
0.0390 0.0420 0.0410 0.0370 0.0490
0.0060 0.0050 0.0060 0.0090 0.0100
5.1.2. Heteroscedastic errors The subsequent DGP is the same as in the previous section with the addition of conditional heteroscedasticity, modeled as a GARCH(1, 1). The DGP is: for t = 1, . . . , n,
Sargan test
10
for 1000 observations, reduces the amount of over-rejection of the t-test substantially. The greatest improvements for the t-test are with the standard bootstrap. For the J-test the empirical likelihood bootstrap produces actual sizes much closer to the nominal size than the alternatives. Interestingly, the overlapping bootstrap has a size worse than the non-overlapping block bootstrap for the t-test.
yt = θ1 + θ2 xt + σt ut
115
(15)
where (θ1 , θ2 ) = (0, 0), xt = 0.75xt −1 + ε1t , and ut ∼ N (0, σt ). 2 σt2 = 0.0001 + 0.6σt2−1 + 0.3ε2t −1 and ε ∼ N (0, I ). The unconditional variance is 1. The instrument set is zt = [ι, xt , xt −1 , xt −2 ]. Results with 2000 replications and 499 bootstrap samples are presented in Table 2. There are three sample sizes: 100, 250 and 1000. The actual size of the asymptotic tests are close to the nominal size for sample size 250 and greater. The moving block bootstrap tests have a good size and the empirical likelihood bootstrap performs the best out of the bootstrap procedures. Using the standard block bootstrap actually leads to more severe underrejection of the true null hypothesis than the asymptotic tests. 5.2. Case II: nonlinear models Two experiments are considered. First the chi-squared experiment from Imbens et al. (1998). Second, the asset pricing DGP outlined in Hall and Horowitz (1996) and used by Gregory et al. (2002). Imbens et al. (1998) also consider this DGP. In addition this section looks at the empirical likelihood bootstrap in a framework with dependent data. It is the case of nonlinear models where the asymptotic t-test and J-test tend to severely over-reject. 5.2.1. Asymmetric errors First consider a model with chi-squared moments. Imbens et al. (1998) provide evidence that average moment tests like the J-test
Note: The mean block length is 1.96 when T = 100, 2.84 when T = 250, and 4.48 when T = 1000.
can substantially over-reject a true null hypothesis under a DGP with chi-squared moments. The authors find that tests based on the exponential tilting parameter perform substantially better. The moment vector is: g (Xt , θ1 ) = (Xt − θ1 , Xt2 − θ12 − 2θ1 )′ . The parameter θ1 is estimated using the two moments. Results for 2000 replications and 499 bootstrap samples are presented in Table 3. There is severe over-rejection of the true null hypothesis when using the asymptotic distribution. The bootstrap procedures correct for this over-rejection; the empirical likelihood bootstrap performs very well for the t-tests. For small sample sizes the standard and empirical likelihood bootstrap both outperform the asymptotic approximation but there is still is an over-rejection. 5.2.2. Asset pricing example Finally consider an asset pricing model with the following moment conditions.6 : E [exp(µ − θ (x + z ) + 3z ) − 1] = 0, Ez [exp(µ − θ (x + z ) + 3z ) − 1] = 0. It is assumed that
log xt = ρ log xt −1 +
(1 − ρ 2 )εxt , zt = ρ zt −1 + (1 − ρ 2 )εzt , where εxt and εzt are independent normal with mean 0 and variance 0.16. In the experiment ρ = 0.6. Results for 2000 replications and 499 bootstrap samples are presented in Table 4. Again, the asymptotic tests severely overreject the true null hypothesis. The bootstrap procedures produce tests with reasonable size, especially for the t-tests. As was the case in the model with asymmetric errors, the empirical likelihood bootstrap performs best. 6. Conclusion This paper extends the ideas put forth by Brown and Newey (2002) to bootstrap test-statistics based on empirical likelihood.
6 Derivation of the example can be found in Gregory et al. (2002).
116
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121
Table 3 Nonlinear model—chi-square moment conditions. Replications = 2000; bootstraps = 499; auto-selection block length g (Xt , θ1 ) = (Xt − θ1 , Xt2 − θ12 − 2θ1 )′ t-test
Sargan test
10
05
01
10
05
01
100 Asymptotic SNB SMB ENB EMB
0.1845 0.1535 0.1800 0.1175 0.1100
0.1250 0.1000 0.0875 0.0575 0.0620
0.0625 0.0380 0.0070 0.0100 0.0090
0.2655 0.1895 0.1825 0.2135 0.2000
0.2065 0.1505 0.1465 0.1470 0.1550
0.1195 0.0870 0.0780 0.0750 0.0700
250 Asymptotic SNB SMB ENB EMB
0.1245 0.1095 0.1240 0.1050 0.1040
0.0700 0.0585 0.0710 0.0560 0.0610
0.0250 0.0200 0.0175 0.0120 0.0110
0.1990 0.1615 0.1520 0.1720 0.1780
0.1560 0.1290 0.1225 0.1200 0.1280
0.0840 0.0790 0.0695 0.0420 0.0380
1000 Asymptotic SNB SMB ENB EMB
0.0975 0.0985 0.0795 0.0965 0.0940
0.0515 0.0620 0.0395 0.0480 0.0400
0.0100 0.0205 0.0075 0.0095 0.0060
0.1325 0.1335 0.1180 0.1120 0.1340
0.0835 0.0985 0.0870 0.0695 0.0700
0.0400 0.0580 0.0430 0.0240 0.0400
tions hold. As highlighted by Brown and Newey (2002), the empirical likelihood bootstrap is the same as the conventional bootstrap, except that it is based on a more efficient distribution estimator. Two possible avenues for future research include combining subsampling methods with empirical likelihood probability weights and establishing higher order improvements for the ENB and EMB. 7. Implementing the block bootstrap The procedure for implementing the GMM overlapping (MBB) and empirical likelihood (EMB) bootstrap procedures are outlined below. The procedure is similar for the non-overlapping bootstrap. 1. Given the random sample (X1 , . . . , Xn ), calculate θˆ using 2stage GMM. 2. For EMB calculate πˆ io using Eq. (10).
3a. For EMB, sample with replacement from {Tjo (θˆ ) : j = 1, . . . , N } with probability {πˆ jo : j = 1, . . . , N }. 3b. For MBB, sample uniformly with replacement to get {X ∗ }nt=1 =
Replications = 2000; bootstraps = 499; auto-selection block length
(X˜ 1 , . . . , X˜ b ). ∗ 4a. For EMB, calculate the J-statistic (JMBB ,n ) and the t-statistic ∗ (Tnr ). ∗∗ 4b. For MBB, calculate the J-statistic (JMBB ,n ) and the t-statistic ∗∗ (Tnr ). 5. Repeat steps 3–4 B times, where B is the number of bootstraps. ∗ ∗∗ 6. Let qˆ πα be a (1 − α) percentile of the distribution of Tnr or Tnr . π ∗ 7. Let qα be a (1 − α) percentile of the distribution of JMBB,n or ∗∗ JMBB ,n .
g = (exp(µ − θ(x + z ) + 3z ) − 1, z [exp(µ − θ (x + z ) + 3z ) − 1]), log xt = ρ log xt −1 + (1 − ρ 2 )εxt , zt = ρ zt −1 + (1 − ρ 2 )εzt , where εxt and εzt are independent normal with mean 0 and variance 0.16. In the experiment ρ = 0.6
8. Mathematical appendix
Note: The mean block length is 1.29 when T = 100, 1.99 when T = 250, and 3.33 when T = 1000.
Table 4 Nonlinear model—asset pricing model.
t-test
8. The bootstrap confidence interval for θ0r is θˆnr ± qˆ πα n−1/2 σˆ nr . π 9. For the bootstrap J-test, the test rejects if Jn ≥ qα .
Sargan test
10
05
01
10
05
01
100 Asymptotic SNB SMB ENB EMB
0.4010 0.1550 0.1540 0.1300 0.1360
0.3235 0.0985 0.1015 0.0780 0.0825
0.2195 0.0400 0.0435 0.0245 0.0260
0.3080 0.1880 0.1930 0.1250 0.1880
0.2350 0.1260 0.1300 0.0700 0.0810
0.1460 0.0385 0.0420 0.0150 0.0200
250 Asymptotic SNB SMB ENB EMB
0.3005 0.1270 0.1285 0.1200 0.1290
0.2275 0.0755 0.0780 0.0620 0.0600
0.1240 0.0290 0.0290 0.0140 0.0210
0.2470 0.1435 0.1430 0.1210 0.1245
0.1850 0.1005 0.0985 0.0670 0.0650
0.0995 0.0510 0.0535 0.0180 0.0270
1000 Asymptotic SNB SMB ENB EMB
0.2205 0.1440 0.1420 0.1180 0.1160
0.1440 0.0825 0.0820 0.0600 0.0560
0.0545 0.0280 0.0250 0.0220 0.0160
0.1975 0.1005 0.1040 0.1300 0.1090
0.1335 0.0715 0.0660 0.0695 0.0700
0.0685 0.0220 0.0220 0.0210 0.0150
Note: The mean block length is 1.51 when T = 100, 2.62 when T = 250, and 4.96 when T = 1000.
Where Brown and Newey (2002) consider bootstrapping in an i.i.d. context, this paper provides a proof of the first-order asymptotic validity of empirical likelihood block bootstrapping in the context of dependent data. Given the test-statistics considered, the size distortions of those tests based on the asymptotic distribution are severe, especially in the case of nonlinear moment conditions and substantial serial correlation. The empirical likelihood bootstrap largely corrects for these size distortions and produces promising results. This is especially true when the regression errors are non-spherical. The significance of using the empirical likelihood estimator is that it satisfies the moment conditions identically while supplying a probability measure under which these condi-
Assumptions A and B are a simplified version of Assumptions A and B in Gonçalves and White (2004), tailored to our GMM estimation framework. ‖x‖p denotes the Lp norm (E |Xnt |p )1/p . For a (m × k) matrix x, let |x| denote the 1-norm of x, so |x| = ∑m ∑k i=1 j=1 |xij |. Assumption A. A.1 Let (Ω , F , P ) be a complete probability space. The observed data are a realization of a stochastic process {Xt : Ω → Rk , k ∈ N}, with Xt (ω) = Wt (. . . , ∏ Vt −1 (ω), Vt (ω), Vt +1 (ω), . . .), Vt : v l Ω → Rv , v ∈ N, and Wt : ∞ τ =−∞ R → R is such that Xt is measurable for all t. A.2 The functions g : Rk × Θ → Rm are such that g (·, θ ) is measurable for each θ ∈ Θ , a compact subset of Rp , p ∈ N, and g (Xt , ·) : Θ → Rm is continuous on Θ a.s.-P , t = 1, 2, . . . . A.3 (i) θ0 is identifiably unique with respect to Eg (Xt , θ )′ WEg (Xt , θ ) and (ii) θ0 is interior to Θ . A.4 (i) {g (Xt , θ )} is Lipschitz continuous on Θ , i.e. |g (Xt , θ ) − g (Xt , θ o )| ≤ Lt |θ − θ o | a.s.-P , ∀θ , θ o ∈ Θ , where supt E (Lt ) = O(1). (ii) {∇ g (Xt , θ )} is Lipschitz continuous on Θ . A.5 For some r > 2: (i) {g (Xt , θ )} is r-dominated on Θ uniformly in t, i.e. there exists Dt : Rlt → R such that |g (Xt , θ)| ≤ Dt for all θ in Θ and Dt is measurable such that ‖Dt ‖r ≤ ∆ < ∞ for all t. (ii) {∇ g (Xt , θ )} is r-dominated on Θ uniformly in t. A.6 {Vt } is an α -mixing sequence of size −2r /(r − 2), with r > 2. A.7 The elements of (i) {g (Xt , θ )} are NED on {Vt } of size −1 uniformly on (Θ , ρ), where ρ is any convenient norm on Rp , and (ii) {∇ g (Xt , θ )} are NED on {Vt } of size −1 uniformly on (Θ , ρ). ∑n A.8 Σ ≡ limn→∞ var(n−1/2 t =1 g (Xt , θ0 )) is positive definite, ∑ n and G ≡ limn→∞ E (n−1 t =1 ∇ g (Xt , θ0 )) is of full rank.
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121
Assumption B. B.1 {g (Xt , θ )} is 3r-dominated on Θ uniformly in t , r > 2. B.2 For some small δ > 0 and some r > 2, the elements of {g (Xt , θ )} are L2+δ -NED on {Vt } of size −(2(r − 1))/(r − 2) uniformly on (Θ , ρ); {Vt } is an α -mixing sequence of size −((2 + δ)r )/(r − 2). The following two lemmas are required to prove Theorems 1–3. Lemma 1. Suppose Assumption A in the mathematical appendix holds. Then θˆ − θ0 →P 0. If also ℓ → ∞ and ℓ = o(n), then ∗∗ θMBB − θˆ →P ∗ ,P 0. If also Assumption B in the Appendix holds and ∗ ℓ = o(n1/2−1/r ), then θMBB − θˆ →P ∗ ,P 0.
Lemma 2. Suppose Assumption A in the mathematical appendix ∗∗ holds, ℓ → ∞, and ℓ = o(n). Then θNBB − θˆ →P ∗ ,P 0. If also
∗ ℓ = o(n(r −2)/2(r −1) ), then θNBB − θˆ →P ∗ ,P 0. Note that ℓ must satisfy 1/2 ℓ = o(n ) because (r − 2)/2(r − 1) < 1/2.
If we compare conditions on ℓ, the condition with the NBB is slightly weaker because (r − 2)/2(r − 1) = 1/2 − 1/2(r − 1) and 2(r − 1) > r. 8.1. Proof of Lemma 1 The proof follows the proof of Theorem 2.1 of GW04, with two differences: (i) the objective function is a GMM objective function, and (ii) in the case of EMB, the bootstrapped objective function depends on the probability weight πˆ io . θˆ − θ0 →P 0 follows from applying Lemma A.2 of GW04 to the GMM objective function, because conditions (a1)–(a3) in Lemma A.2 ∗∗ of GW04 are satisfied by Assumption A. The consistency of θMBB is proved by applying Lemma A.2 of GW04. Their conditions (b1)–(b2) A.2. Define Q˜ n (θ ) = ∑ are satisfied by Assumption ∑ (n−1 nt=1 g (Xt∗ , θ ))′ Wn∗ (n−1 nt=1 g (Xt∗ , θ )), then their condition ∗∗ ˜ (b3) holds because supθ |QMBB ,n (θ ) − Qn (θ )| →P ∗ ,P 0 from a ˜ standard argument and supθ |Qn (θ ) − Qn (θ )| →P ∗ ,P 0 by lemmas A.4 and A.5 of GW04. ∗ We prove the consistency of θMBB by approximating the EMB sample moment condition with an uncentered MBB mo∑b ment condition, namely, by showing supθ |b−1 i=1 Tio∗ (θ ) − ∑ n n−1 t =1 g (Xt∗ , θ )| →P ∗ ,P 0 for suitably chosen Tio∗ (θ )’s and Xt∗ ’s. ∗ Then the consistency of θMBB follows from the proof of the consis∗∗ tency of θMBB . We will use the following result, which we prove later:
N πˆ io = 1 + δni ,
117
≤ max1≤i≤N µ(Di ) i=1 supθ |Tio (θ )| = oP (1). Therefore, supθ ∑ ∑ |b−1 bi=1 Tio∗ (θ ) − n−1 nt=1 g (Xt∗ , θ )| = oP ∗ ,P (1), and the ∗ consistency of θMBB follows. It remains to show (16). First we show γ o (θˆ ) = OP (ℓn−1/2 ). ∑N
max |δni | = oP (1).
(16)
1≤i≤N
Partition the interval [0, 1] into A1 , . . . , AN , where Ai = [πˆ 0o + · · · + πˆ io−1 , πˆ 0o + · · · + πˆ io ] with πˆ 0o = 0. Partition the interval [0, 1] into N sets, B1 , . . . , BN , where the Bi ’s are chosen such that µ(Bi ) = 1/N and max1≤i≤N µ(Di ) = o(N −1 ), where µ denotes the Lebesgue measure on [0, 1], and Di = (Ai − Bi ) ∪ (Bi − Ai ), i.e., the symmetric difference between Ai and Bi . Such a construction of B1 , . . . , BN is possible by virtue of (16). One way to construct {Tko∗ (θ )}bk=1 and {X˜ k∗ }bk=1 is to draw i.i.d. uniform [0, 1] random variables U1 , . . . , Ub and set Tko∗ (θ ) = Tio (θ) if Uk ∈ Ai and set X˜ k∗ = X˜ i if Uk ∈ Bi . Then we may
∑b ∑b ∑N write b−1 i=1 Tio∗ (θ ) = b−1 k=1 i=1 1{Uk ∈ Ai }Tio (θ ) and
∑ ∑ ∑ ∑ |b−1 bi=1 Tio∗ (θ ) − n−1 nt=1 g (Xt∗ , θ )| = b−1 bk=1 Ni=1 1{Uk ∈ Di }|Tio (θ )|. Taking the bootstrap expectation of its supremum over ∑ ∑ ∑ θ gives E ∗ supθ b−1 bk=1 Ni=1 1{Uk ∈ Di }|Tio (θ )| ≤ E ∗ b−1 bk=1 ∑N ∑ N o o ∗ i=1 1{Uk ∈ Di } supθ |Ti (θ )| = E i=1 1{U1 ∈ Di } supθ |Ti (θ )|
In view of the argument in pp. 100–101 of Owen (1990) (see also Kitamura, 1997), γ o (θˆ ) = OP (ℓn−1/2 ) holds if (a) ℓN −1 ∑N o ∑N o −1 −1/2 ˆ o ˆ ′ ˆ ), and i=1 Ti (θ )Ti (θ ) →P Σ , (b) ℓN i=1 Ti (θ ) = OP (ℓn
(c) max1≤i≤N |Tio (θˆ )| = oP (n1/2 ℓ−1 ). For (a), using a mean value expansion and Assumption A.5 gives
N N − −1 − o ˆ o ˆ ′ −1 o o ′ Ti (θ )Ti (θ ) − ℓN Ti (θ0 )Ti (θ0 ) ℓN i =1 i=1 ≤ |θˆ − θ0 |2ℓN −1
N − i=1
sup |∇ Tio (θ )||Tio (θ )| θ
= OP (n−1/2 ℓ) = oP (1). ∑ ¯ ∗n = n−1 nt=1 g (Xt∗ , θ0 ), then we have (cf. Lahiri, 2003, Define G ∑ √ ∗ N ¯ n ) + ℓT¯n T¯n′ , where p. 48) ℓN −1 i=1 Tio (θ0 )Tio (θ0 )′ = var∗ ( nG ∑ √ ¯Tn = N −1 Ni=1 Tio (θ0 ). var∗ ( nG¯ ∗n ) − Σ →P 0 from Corollary 2.1 of Gonçalves and White (2002) (hereafter GW02). T¯n is equal to X¯ γ ,n defined in p. 1371 of GW02 if we replace their Xt with g (Xt , θ0 ). GW02 p. 1381 shows X¯ γ ,n = oP (ℓ−1 ), and hence ∑ ℓT¯n2 = oP (1). Therefore, ℓN −1 Ni=1 Tio (θ0 )Tio (θ0 )′ →P Σ , and (a) follows. (b) follows from expanding Tio (θˆ ) around θ0 and using ∑N ∑n N −1 i=1 Tio (θ0 ) = n−1 t =1 g (Xt , θ0 ) + Op (n−1 ℓ) (see Lemma A.1 of Fitzenberger, 1997), and applying the central limit theorem. (c) holds because max1≤i≤N |Tio (θˆ )| = Oa.s. (N 1/r ) from Lemma 3.2 of Künsch (1989) and ℓ = o(n1/2−1/r ). Therefore, we have
γ o (θˆ ) = OP (ℓn−1/2 ),
max |γ o (θˆ )′ Tio (θˆ )| = oP (1).
(17)
1≤i≤N
(16) follows from expanding N πio = (1 + γ o (θˆ )′ Tio (θˆ ))−1 around
γ o (θˆ )′ Tio (θˆ ) = 0.
8.2. Proof of Lemma 2 ∗∗ In view of the proof of Lemma 1, the consistency of θNBB holds because condition (b3) of Lemma A.2 of GW04 holds because supθ |Q˜ n (θ ) − Qn (θ )| →P ∗ ,P 0 by Lemmas 3 and 4. ∗ ∗ In view of the proof of the consistency of θMBB in Lemma 1, θNBB is consistent if
γ (θˆ ) = OP (ℓn−1/2 ),
max |γ (θˆ )′ Ti (θˆ )| = oP (1).
(18)
1≤i≤b
Eq. (18) holds if (a) ℓb−1
∑b
i=1
Ti (θˆ )Ti (θˆ )′ →P Σ , (b) ℓb−1
∑b
i=1
Ti (θˆ ) = OP (ℓn−1/2 ), and (c) max1≤i≤b |Ti (θˆ )| = oP (n1/2 ℓ−1 ). (a) follows from expanding Ti (θˆ ) around θ0 and using Corollary 2. (b) follows from expanding Ti (θˆ ) around θ0 and applying the central limit theorem. (c) follows because max1≤i≤b |Ti (θˆ )| = Oa.s. (b1/r ) and ℓ = o(n(r −2)/2(r −1) ). 8.3. Proof of Theorem 1 Define H = (G′ WG)−1 G′ W Σ WG(G′ WG)−1 , then the stated √ result follows from Polya’s theorem if we show n(θˆ − √ ∗ √ ∗∗ ˆ θ0 ) →d N (0, H ), n(θMBB − θ ) →d∗ N (0, H ) prob-P, and n(θMBB
− θˆ ) →d∗ N (0, H ) prob-P. The limiting distribution of
√
n(θˆ − θ0 ) follows from a standard argument. ∗ ∗∗ The proof of the asymptotic normality of θMBB and θMBB uses Theorem 2.1 of Mason and Newton (1992), who prove
118
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121
the consistency of generalized bootstrap (weighted bootstrap) procedures. We first derive the asymptotics of the EMB estimator. The for∑the EMB estimator is ∑bfirst order∗ condition b o∗ ∗ ∗ −1 0 = b−1 i=1 ∇ Tio∗ (θMBB )′ WMBB ,n b i=1 Ti (θMBB ). Expanding b−1
∑b
∗ Tio∗ (θMBB ) around θˆ and approximating b−1
∑b
i=1 ∑ ∇ Tio∗ (θ) by n−1 bi=1 ∇ g (Xt∗ , θ ) as in the proof of Lemma 1 gives ∑b o∗ ˆ ∗ ∗ 1/2 −1 ˜ −1 ˜ ′ ∗ n1/2 (θMBB − θˆ ) = −(G˜ ′n WMBB b ,n Gn ) Gn WMBB,n n i=1 Ti (θ ), ˜ where Gn is a generic notation for G + oP ∗ ,P (1). We proceed to ∑b rewrite n1/2 b−1 i=1 Tio∗ (θˆ ) so that we can apply the results in Mason and Newton (1992). For i = 1, . . . , N, let wNi be the number of times Tio (θ ) appears in a bootstrap sample {Tko∗ (θ )}bk=1 . Conditional on X1 , . . . , Xn , an N-vector wN = (wN1 , . . . , wNN )′ follows a multinomial distribution such that wN ∼ Mult(b; πˆ 1o , . . . , πˆ No ). ∑N Using wNi in conjunction with i=1 πˆ io Tio (θˆ ) = 0 and bℓ = n, we ∑N ∑ b may rewrite n1/2 b−1 i=1 Tio∗ (θˆ ) = N −1/2 i=1 (N /b)1/2 (wNi − bπˆ io )ℓ1/2 Tio (θˆ ). ∗ Therefore, the asymptotic normality of θMBB follows if we show
N −1/2
i=1
N − (N /b)1/2 (wNi − bπˆ io )ℓ1/2 Tio (θˆ ) →d∗ N (0, Σ ) i =1
prob-P .
(19)
We apply Theorem 2.1 of Mason and Newton (1992) to the left hand side of (19) with two minor changes. First, the weights in Theorem 2.1 of Mason and Newton (1992) do not depend on the data, whereas our wN depends on the data through πˆ io . As Mason and Newton (1992) discuss on p. 1618, their Theorem 2.1 holds if the weights are exchangeable given the data. Second, in Mason and Newton (1992), condition (2.4) and result (2.7) hold P-almost surely. We can weaken both to hold in P-probability because xn → x in probability if and only if every subsequence of {xn } has a further subsequence that converges almost surely to x (see, for example, Theorem 6.2 in p. 46 of Durrett (2005)). For simplicity, we assume Tio (θ ) to be a scalar without loss of generality. Note that our {N , ℓ1/2 Tio (θˆ ), (N /b)1/2 (wNi − bπˆ io )} corresponds to {kn , Xn,k , Yn,k } in Mason and Newton (1992). From Theorem of Mason and Newton (1992), (19) follows if we show ∑2.1 n (recall i=1 (wNi − bπˆ io ) = 0 by construction) N − N −1 (ℓ1/2 Tio (θˆ ) − ℓ1/2 T¯ o (θˆ ))2 →P Σ , i =1 N − N −1 ((N /b)1/2 (wNi − bπˆ io ))2 →P ∗ ,P 1,
(20)
i=1
where T¯ o (θˆ ) = N −1
∑N
2 (a) max UNi →P 0 , 1≤i≤N
DN (τ ) =
N − N −
i =1
Tio (θˆ ), and, for all τ > 0,
(b) max VNi2 →P ∗ ,P 0,
2 2 2 2 UNi VNj 1{NUNi VNj > τ } →P ∗ ,P 0,
i=1 j=1
∑
(ℓ1/2 Tio (θˆ ) − ℓ1/2 T¯ o (θˆ ))2
1/2 ,
i =1 1/2
VNi =
(N /b) N ∑ i=1
of (21) follows from Theorem 1 of Hoeffding (1951) in conjunction with the second part of (20) and Lemma 5 with r = 4. Finally, (22) can be be shown by a similar argument to Corollary 2.2 of Mason and Newton (1992). For any ε ∈ (0, 1), from (b) of (21) we have, for sufficiently large N, with prob-P ∗ , prob-P ∑N ∑N 2 2 2 greater than 1 − ε, DN (τ ) ≤ j=1 UNi VNj 1{NUNi > τ /ε} = i=1
∑N
i=1
2 2 UNi 1{NUNi > τ /ε}. From the first part of (20) and the or-
2 der of T¯ o (θ ), this is bounded by Σ −1 N −1 i=1 ℓTio (θ0 )2 1{NUNi > τ /ε} + oP (1). Consequently, choosing ε sufficiently small gives Dn (τ ) →P ∗ ,P 0 from E ℓ|Tio (θ0 )|2 = O(1) (see Lemmas A.1 and A.2 of GW02) and the dominated convergence theorem. For the standard bootstrap estimator, expanding the first or∗∗ der condition and applying a routine argument gives n1/2 (θMBB −
∑N
∑ θˆ ) = −(G˜ ′n Wn∗∗ G˜ n )−1 G˜ ′n Wn∗∗ n−1/2 nt=1 g ∗ (Xt∗ , θˆ ). For i = ∗ 1, . . . , N, let wNi be the number of times X˜ i appears in a boot∗ b ˜ strap sample {Xk }k=1 . Conditional on X1 , . . . , Xn , an N-vector wN∗ = (wN1 , . . . , wNN )′ follows Mult(b; 1/N , . . . , 1/N ). Using ∑N ∑n N −1 i=1 Tio (θˆ ) = n−1 t =1 g (Xt , θˆ ) + OP (n−1 ℓ) (see Lemma ∑n A.1 of Fitzenberger (1997)), we may write n−1/2 t =1 g ∗ (Xt∗ , θˆ ) = ∑ N ∗ N −1/2 i=1 (N /b)1/2 (wNi − b/N )ℓ1/2 Tio (θˆ ) + oP (1). Since wN∗
satisfies the assumptions in Lemma 5, repeating the proof for ∑N the EMB estimator with replacing wN by wN∗ gives N −1/2 i=1 ∗ (N /b)1/2 (wNi − b/N )ℓ1/2 Tio (θˆ ) →d∗ N (0, Σ ) prob-P, and the stated
result follows.
8.4. Proof of Theorem 2 The proof closely follows the proof of Theorem 1. Because we sample from b blocks, instead of N, we use Corollary 1 in place of Lemma 5. We first derive the asymptotics of the ENB estimator. Expanding ∗ the first order condition for the ENB estimator gives n1/2 (θNBB −
∑b ∗ ∗ ˆ 1/2 −1 ˜ −1 ˜ ′ ∗ ˜ θˆ ) = −(G˜ ′n WNBB b ,n Gn ) Gn WNBB,n n i=1 Ti (θ ), where Gn is a generic notation for G + oP ∗ ,P (1). The required result ∑b follows if we show n1/2 b−1 i=1 Ti∗ (θˆ ) →d∗ N (0, Σ ) prob-P. For i = 1, . . . , b, let wbi be the number of times Ti (θ ) appears in a bootstrap sample {Tk∗ (θ )}bk=1 . Conditional on X1 , . . . , Xn , a b-vector wb = (wb1 , . . . , wbb )′ follows Mult(b; πˆ 1 , . . . , πˆ b ). ∑b Using πˆ T (θˆ ) = 0 and bℓ = n, we may rewrite ∑i=b1 i ∗i ∑ 1/2 −1 ˆ = b−1/2 bi=1 (wbi − bπˆ i )ℓ1/2 Ti (θˆ ). From n b i=1 Ti (θ ) ∑b Theorem 2.1 of Mason and Newton (1992), b−1/2 i=1 (wbi − bπˆ i )ℓ1/2 Ti (θˆ ) →d∗ N (0, Σ ) prob-P follows if we show b −1
b − (ℓ1/2 Ti (θˆ ) − ℓ1/2 T¯ (θˆ ))2 →P Σ , i=1
ℓ1/2 Tio (θˆ ) − ℓ1/2 T¯ o (θˆ ) N
(22)
→P Σ and T¯ o (θˆ ) = OP (N −1/2 ). The second part of (20) follows from applying Lemma 5 to the left hand side with r = 2 because wN satisfies the assumptions in Lemma 5. (a) of (21) follows from the first part of (20), T¯ o (θˆ ) = OP (N −1/2 ), and max1≤i≤N |Tio (θˆ )| = oP (N 1/2 ℓ−1 ), which is shown in (c) in the proof of Lemma 1. (b)
b − b −1 (wbi − bπˆ i )2 →P ∗ ,P 1,
where UNi =
(21)
1≤i≤N
We proceed to check (20)–(22). The first part of (20) holds be∑N cause (a) and (b) in the proof of Lemma 1 show ℓN −1 i=1 Tio (θˆ )2
(wNi − bπˆ ) o i
((N /b)1/2 (wNi − bπˆ io ))2
i=1
∑b where T¯ (θˆ ) = b−1 i=1 Ti (θˆ ), and, for all τ > 0, (a) max Ubi2 →P 0, 1≤i≤b
1/2 .
(23)
Db (τ ) =
b − b − i=1 j=1
(b) max Vbi2 →P ∗ ,P 0,
(24)
1≤i≤b
Ubi2 Vbj2 1{bUbi2 Vbj2 > τ } →P ∗ ,P 0,
(25)
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121
∑ b
1/2 = (ℓ1/2 Ti (θˆ ) − ℓ1/2 T¯ (θˆ )) Ti (θˆ ) − ℓ1/2 T¯ i=1 (ℓ −1/2 ∑ −1/2 N ˆ 2 (θ)) and Vbi = (wbi − bπˆ i ) ˆ i )2 . i=1 (wbi − bπ
where Ubi
We proceed to check (23)–(25). The first part of (23) holds ∑b because (a) and (b) in the proof of Lemma 2 show ℓb−1 i=1 Ti (θˆ )2
→P Σ and T¯ (θˆ ) = OP (n−1/2 ). The second part of (23) follows from applying Corollary 1 with r = 2. (a) of (24) follows from the first part of (23), T¯ (θˆ ) = OP (n−1/2 ), and max1≤i≤b |Ti (θˆ )| = oP (n1/2 ℓ−1 ), which is shown in (c) in the proof of Lemma 2. (b) of (24) follows from Theorem 1 of Hoeffding (1951) in conjunction with the second part of (23) and Corollary 1 with r = 4. Finally, (25) is shown by repeating the argument of the proof of (22) since Ubi2 = b−1 ℓTi (θ0 )2 (Σ −1 + oP ∗ ,P (1)), and we derive the asymptotics ∗ ∗∗ of θNBB . The proof for the standard bootstrap estimator θNBB is very similar and omitted. 8.5. Proof of Theorem 3 The validity of the bootstrap Wald test is proven if we show ∗ ∗ ∗ ∗ Sn∗∗ (θ ∗ ), SMBB ,n (θ ), SNBB,n (θ ) →P ∗ ,P Σ for any root-n consistent ∗ θ ∗ . Using a similar argument to the consistency proof of θMBB , we ∗ ∗ can show SMBB,n (θ ) is asymptotically equivalent in distribution to Sn∗∗ (θ ∗ ) that is constructed from a standard MBB sample. Sn∗∗ (θ ∗ ) →P ∗ ,P Σ then follows from result (iii) in the proof of ∗ ∗ Theorem 3.1 of GW04. Similarly, SNBB ,n (θ ) is asymptotically equivalent in distribution to Sn∗∗ (θ ∗ ) that is constructed from a standard NBB sample, which converges to∑ Σ from Corollary 2. n Jn →d χm2 −p if Wn →P Σ −1 and n−1/2 t =1 g (Xt , θ0 ) →d N (0, Σ ), which follows from Assumptions A and B and a standard ∗ ∗ 2 ˜∗ argument. JMBB ,n →d∗ χm−p prob-P because SMBB,n (θMBB ) →P ∗ ,P Σ ∗∗ 2 and n1/2 b−1 i=1 Tio∗ (θˆ ) →d∗ N (0, Σ ) prob-P. JMBB ,n →d∗ χm−p ∗∗ prob-P follows because Sn∗∗ (θMBB ) →P ∗ ,P Σ and we have shown
∑b
−1/2
ˆ ∗ in the proof of Theorem 1 that n t =1 g (Xt , θ ) →d N (0, Σ ) ∗ ∗∗ prob-P. The convergence of JNBB and J are proven by a ,n NBB,n similar argument. ∑n
∗
9. Auxiliary results Lemma 3 (NBB Uniform WLLN). Let {q∗nt (·, ω, θ )} be an NBB p resample of {q∑ nt (ω, θ )} and assume: (a) For each θ ∈ Θ ⊂ R , Θ a n ∗ ∗ compact set, n t =1 (qnt (·, ω, θ ) − qnt (ω, θ )) → 0, prob-Pn,ω , probP; and (b) ∀θ ∈ Θ , |qnt (·,θ ) − qnt (·, θ0 )| ≤ Lnt |θ − θ0 | a.s.-P, , θ0 ∑ n where supn n−1 t =1 E (Lnt ) = O(1). Then, if ℓ = o(n), for any δ > 0 and ξ > 0,
n − ∗ −1 lim P Pn,ω sup n q∗ (·, ω, θ ) n→∞ t =1 nt θ∈Θ − qnt (ω, θ ) > δ > ξ = 0.
Proof. The proof closely follows that of Lemma 8 of Hall and Horowitz (1996). Lemma 4 (NBB Pointwise WLLN). For some r > 2, let {qnt : Ω × Θ → Rm : m ∈ N} be such that for all n, t, there exists Dnt : Ω → R with |qnt (·, θ )| ≤ Dnt for all θ ∈ Θ and ‖Dnt ‖r ≤ ∆ < ∞. For each θ ∈ Θ let {q∗nt (·, ω, θ)} be an NBB resample of {qnt (ω, θ )}. If ℓ = o(n), then for any δ > 0, ξ > 0 and for each θ ∈ Θ ,
lim P
n→∞
= 0.
Pn∗,ω
n − ∗ −1 n q (·, ω, θ) − qnt (ω, θ ) > δ > ξ t =1 nt
119
Proof. Fix θ ∈ Θ , and we suppress ∑ θ and ω henceforth. Since n q∗nt is an NBB resample, E ∗ q∗nt = n−1 t =1 qnt = q¯ n and hence ∑n ∑n ∗ ∗ ∗ = t =1 qnt − E qnt . From the arguments in t =1 qnt − qnt the proof of Lemma A.5 of GW04, the stated result follows if ∑ ‖var∗ n−1/2 nt=1 q∗nt ‖r /2 = O(ℓ) for some r > 2. Define
∑ℓ
Uni = ℓ−1 t =1 qn,(i−1)ℓ+t , the average of the ith block. Since the blocks are have (cf. Lahiri, 2003, p. independently sampled,∑we ∑n b 48) var∗ n−1/2 t =1 q∗nt = b−1 ℓ i=1 (Uni − q¯ n )(Uni − q¯ n )′ =
∑ (qn,(i−1)ℓ+t − q¯ n ) ℓs=1 (qn,(i−1)ℓ+s − q¯ n )′ = ∑b ∑ℓ−1 Rn (0) + b−1 i=1 τ =1 (Rni (τ ) + R′ni (τ )), where Rn (0) = ∑ ∑ℓ−τ n n−1 t =1 (qnt − q¯ n )(qnt − q¯ n )′ , and Rni (τ ) = ℓ−1 t =1 (qn,(i−1)ℓ+t − ′ q¯ n )(qn,(i−1)ℓ+t +τ − q¯ n ) , τ = 1, . . . , ℓ − 1. Applying the Minkowski and Cauchy–Schwartz inequalities gives ‖Rn (τ )‖r /2 = O(1), τ = ∑n 0, . . . , ℓ − 1, and ‖var∗ n−1/2 t =1 q∗nt ‖r /2 = O(ℓ) follows. b − 1 ℓ− 1
∑b ∑ℓ i =1
t =1
Lemma 5. Suppose wN = (wN1 , . . . , wNN )′ follows a multinomial distribution such that wN ∼ Mult(b; p1 , . . . , pN ). Assume further max1≤i≤N |Npi − 1| → 0 and N /b2 → 0 as N → ∞. Then, for r = 2, 4, as N → ∞, N −1
N −
|(bpi )−1/2 (wNi − bpi )|r
i=1
→P lim N −1 N →∞
N −
E |(bpi )−1/2 (Z (bpi ) − bpi )|r ,
i=1
where Z (c ) is a Poisson random variable with mean c. The limit on the right hand side exists because EZ (c ) = c , E (Z (c ) − c )2 = c, and E (Z (c ) − c )4 = 3c 2 + c. Corollary 1. Suppose wb = (wb1 , . . . , wbb )′ follows wb ∼ Mult (b; p1 , . . . , pb ). Assume further max1≤i≤b |bpi − 1| → 0 as ∑b b → ∞. Then, for r = 2, 4, as b → ∞, b−1 i=1 |wbi −
bpi |r →P limb→∞ b−1 i=1 E |(Z (bpi ) − bpi )|r , where Z (c ) is a Poisson random variable with mean c.
∑b
Proof. The proof closely follows that of Lemma 4.1 of Mason and Newton (1992). Their {n, j, Mn,j } correspond to our {N , i, wNi }. We need to adjust the proof of Mason and Newton (1992) because we assume wN follows a multinomial distribution (b; p1 , . . . , pN ) whereas Mason and Newton (1992) assume nMn follows a multinomial distribution (n; 1/n, . . . , 1/n). Let U1 , U2 , . . . be a sequence of i.i.d. U [0, 1] random variables. ∑b Similar to Mason and Newton (1992), define Gb (t ) = k=1 1{Uk ≤
∑N (b)
t } and G∗b (t ) = k=1 1{Uk ≤ t }, where N (t ) is a Poisson process independent of Uk ’s. We can then write wNi = {Gb (p1 + · · · + pi ) − Gb (p1 + · · · + pi−1 )} with p0 = 0 for 1 ≤ i ≤ N. Further, analogously to Mn∗,j in (4.3) of Mason and Newton (1992), define ∗ wNi = {G∗b (p1 + · · · + pi ) − G∗b (p1 + · · · + pi−1 )}, then the elements ∗ ′ ∗ ) are independent Poisson (bpi ) random of wN∗ = (wN1 , . . . , wNN variables. Consequently, it follows from the weak law of large ∑N numbers, max1≤i≤N |Npi − 1| → 0, and N −1 i=1 |(bpi )−1/2 (bpi − b/N )|r → 0 that, as in (4.4) of Mason and Newton (1992), N −1
N −
∗ |(bpi )−1/2 (wNi − w ∗N )|r
i=1
→p lim N −1 N →∞
N −
E |(bpi )−1/2 (Z (bpi ) − bpi )|r ,
i=1
∗ where w N = N −1 i=1 wNi . The stated result follows from replacing Sn and Tn in Mason and Newton (1992) with SN = N −1 ∑N ∑ ∗ −1/2 (wNi − w ∗N − wNi + bpi )|r and TN = E (N −1 Ni=1 | i=1 |(bpi ) ∗
∑N
120
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121
∗ (bpi )−1/2 (wNi − w∗N − wNi + bpi )|r |N (b)) and repeating their ∑N argument in conjunction with N −1 i=1 |(bpi )−1/2 (bpi − b/N )|r → 0 and n/b2 → 0.
The proof of the Corollary 1 follows from repeating the above argument with replacing N with b. Lemma 6 (Consistency of NBB Conditional Variance). Assume {Xt } satisfies EXt = 0 for all t , ‖Xt ‖3r ≤ ∆ < ∞ for some r > 2 and all t = 1, 2, . . . . Assume {Xt } is L2 -NED on {Vt } of size −(2(r − 1))/(r − 2), and {Vt } is an α -mixing sequence of size −( ∑n2r /(r − 2)). Let {Xt∗ } be an NBB resample of {Xt }. Define X¯ n = n−1 t =1 Xt , X¯ n∗ = n− 1
∑n
t =1
√
√
ˆ n = var∗ ( nX¯ n∗ ). Then, if Xt∗ , Σn = var( nX¯ n ), and Σ
ˆ n →P 0. ℓ → ∞ and ℓ = o(n1/2 ), Σn − Σ
Corollary 2. Assume Xt satisfies the assumptions of Lemma 6. Define ∑ℓ Ui = ℓ−1 t =1 X(i−1)ℓ+t , the average of the ith non-overlapping block. Then, if ℓ → ∞ and ℓ = o(n1/2 ), b−1 ℓ
∑b
i=1
Ui Ui′ − Σn →P 0.
Proof. For simplicity, we assume Xt to be a scalar. The extension to the vector-valued Xt is straightforward, see GW02. Define Ui = ∑ ℓ−1 ℓt =1 X(i−1)ℓ+t , the average of the ith block. Since the blocks are independently sampled, we have
ˆ n = b−1 ℓ Σ
b −
Ui2 − ℓX¯ n2
= b−1 ℓ−1
i=1
= b−1
b −
X(i−1)ℓ+t
t =1
ℓ −
− ℓX¯ n2
X(i−1)ℓ+s
(26)
s=1
Rˆ i (0) + 2b−1
b − ℓ−1 −
Rˆ i (τ ) − ℓX¯ n2
(27)
i=1 τ =1
i =1
where Rˆ i (τ ) = ℓ−1
∑ℓ−τ t =1
X(i−1)ℓ+t X(i−1)ℓ+t +τ , τ = 0, . . . , ℓ −
ˆ n ) − Σn = o(1). For the third term on 1. First we show E (Σ the right of (27), E |ℓX¯ n2 | = o(1) holds because it follows from Lemmas A.1 and A.2 of GW02 that E (X¯ n2 ) = n−2 E
∑n
n
−2
E max1≤j≤n |
∑j
t =1
2
Xt |
≤ Cn
−2
∑n
2 t =1 ct
t =1
2
Xt
≤
= O(n ), −1
where ct are (uniformly bounded) mixingale constants of Xt . ∑ℓ−τ Define Ri (τ ) = ℓ−1 t =1 E (X(i−1)ℓ+t X(i−1)ℓ+t +τ ) and Rij =
∑ ∑ ℓ−1 ℓt =1 ℓs=1 E (X(i−1)ℓ+t X(j−1)ℓ+s ) so that E (Rˆ i (τ )) = Ri (τ ), then ∑ ∑ ∑ 1 ∑b ∑b −1 Σn = b−1 bi=1 Ri (0)+2b−1 bi=1 ℓ− τ =1 Ri (τ )+ b i =1 j̸=i Rij , ∑ ∑ b b −1 ˆ and E (Σn ) − Σn = b i=1 j̸=i Rij . From Gallant and White (1988) (pp. 109–110), E (Xt Xt +τ ) is bounded by |EXt Xt +τ | ≤ 1/2−1/r ∆(5α[τ /4] + 2v[τ /4] ) ≤ C τ −1−ξ for some ξ ∈ (0, 1), where vm is the NED coefficient. for |i − j| = k ≥ 2, ∑ℓ Therefore, ∑ℓ we have |Rij | ≤ C ℓ−1 t =1 s=1 ((k − 1)ℓ)−1−ξ = O((k − ∑ℓ ∑ℓ 1)−1−ξ ℓ−ξ ), and |Ri,i+1 | ≤ C ℓ−1 t =1 s=1 |ℓ + s − t |−1−ξ ≤ ∑ ℓ−1 C ℓ−1 h=−ℓ+1 (ℓ − |h|)|ℓ + h|−1−ξ = O(ℓ−ξ ), where the last equality follows from evaluating the sums with h > 0 ∑b ∑b and h < 0 separately. It follows that b−1 i=1 j̸=i Rij = ∑b−1 O ℓ−ξ + b−1 k=2 (b − k)(k − 1)−1−ξ ℓ−ξ = O(ℓ−ξ ), and we ˆ n ) − Σn = o(1). establish E (Σ ˆ n ) = o(1). It suffices to show that the It remains to show var(Σ variance of b −1
b b − ℓ−1 − − (Rˆ i (0) − Ri (0)) + 2b−1 (Rˆ i (τ ) − Ri (τ )) i=1
1−2/r
1/2−1/r
2 C ℓ−1 (τ α[τ /4] + τ v[τ v[τ /4] ) = O(ℓ−1 ). Observe /4] + 2τ α[τ /4] that, when |i − j| ≥ 7, from Lemma 6.7(a) of Gallant and White (1988) we have, for some ξ ∈ (0, 1), cov(Rˆ i (τ ), Rˆ j (τ )) ≤
ℓ− 2
∑ℓ−τ ∑ℓ−τ
ℓ− 2
∑ℓ−τ ∑ℓ−τ
|cov(X(i−1)ℓ+t X(i−1)ℓ+t +τ , X(j−1)ℓ+s X(j−1)ℓ+s+τ )| ≤ ∑ℓ−τ ∑ℓ−τ 1/2−1/r (r −2)/2(r −1) −2 (α t =1 s =1 t =1 s=1 [(|i−j|−6)ℓ/4] +v[(|i−j|−6)ℓ/4] ) = O ℓ −1−ξ −1−ξ [(|i − j| − 6)ℓ/4] ≤ C (ℓ|i − j|) . Define Br = {1 ≤ i ≤ b : i = 7k + r , k ∈ N} for r = 1, . . . , 7, so that all i ∈ Br are at least 7 apart from each other. Rewrite (28) as ∑ ∑7 ∑ℓ−1 −1 ∑ ∑7 −1 ˆ ˆ τ =1b i∈Br (Ri (τ ) − r =1 r =1 b i∈Br (Ri (0) − Ri (0)) + 2 ∑ −1 ˆ Ri (τ )). Then, for τ = 0, . . . , ℓ − 1, we have var b i∈Br (Ri (τ ) − ∑ ∑ Ri (τ )) = b−2 i∈Br j∈Br cov(Rˆ i (τ ), Rˆ j (τ )) = O(b−1 ℓ−1 + ∑ ∑ ∑ 1 ℓ−1−ξ b−2 bi=1 bj̸=i |i − j|−1−ξ ) = O b−1 ℓ−1 + ℓ−1−ξ b−2 bh− =1 −1−ξ −1 −1 (b − h)h = O(b ℓ ). Therefore, the variance of (28) is t =1
s =1
O(ℓb−1 ) = O(ℓ2 n−1 ) = o(1), giving the stated result. Corollary 2 ∑b ˆ n + oP (1) from (26). follows because b−1 ℓ i=1 Ui Ui′ = Σ Acknowledgements
i =1
b ℓ − −
∑ℓ−τ ∑ℓ−τ )| s=t +1 |cov(X(i−1)ℓ+t X(i−1)ℓ+t +τ , X(i−1)ℓ+s X(i−1)ℓ+s+τ t =1 ∑ ∑∞ ∑∞ (r −2)/2(r −1) 1/2−1/r ∞ −1 ≤ C ℓ ∆ + k=1 α[k/4] + k=1 v[k/4] + k=1 v[k/4] + 2 ℓ− 2
(28)
i=1 τ =1
is o(1). Following the derivation in GW02 leading to their equation ∑ℓ−τ (A.4), we obtain var(Rˆ i (τ )) ≤ ℓ−2 t =1 var(X(i−1)ℓ+t X(i−1)ℓ+t +τ )+
We thank Don Andrews, Geoffrey Dunbar, Atushi Inoue, Gregor Smith, Silvia Gonçalves, Sharon Kozicki, Thanasis Stengos, and Tim Vogelsang for helpful comments and insightful discussion. We also thank seminar participants at the Bank of Canada, Indiana University, Queen’s University, Econometric Society World Congress in London (2005), Canadian Econometric Study Group in Vancouver (2005), and Far East Meetings of the Econometric Society in Taiwan (2007). We also thank the referee for insightful comments. We acknowledge the Social Sciences and Humanities Research Council of Canada for support of this research. The views in this paper do not necessarily reflect those of the Bank of Canada. All errors are our own. References Ahn, S., Schmidt, P., 1995. Efficient estimation of models for dynamic panel data. Journal of Econometrics 68, 5–27. Altonji, J., Segal, L., 1996. Small-sample bias in GMM estimation of covariance structures. Journal of Business & Economic Statistics 14, 353–366. Anatolyev, S., 2005. GMM, GEL, serial correlation, and asymptotic bias. Econometrica 73, 983–1002. Andrews, D., 1991. Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59. Andrews, D., 2002. Higher-order improvements of a computationally attractive kstep bootstrap for extremum estimators. Econometrica 70, 119–262. Andrews, D., Monahan, J., 1992. An improved heteroskedasticity and autocorrelation consistent covariance matrix estimator. Econometrica 60, 953–966. Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Review of Economic Studies 58, 277–297. Barbe, P., Bertail, P., 1995. The Weighted Bootstrap. Springer-Verlag, Berlin. Berkowitz, J., Kilian, L., 2000. Recent developments in bootstrapping time series. Econometric Reviews 19, 1–48. Bravo, F., 2005. Blockwise empirical entropy tests for time series regressions. Journal of Time Series Analysis 26, 185–210. Brown, B., Newey, W., 2002. Generalized method of moments, efficient bootstrapping, and improved inference. Journal of Business & Economic Statistics 20, 507–517. Bühlmann, P., Künsch, H., 1999. Block length selection in the bootstrap for time series. Computational Statistics & Data Analysis 31, 295–310. Carlstein, E., 1986. The use of subseries methods for estimating the variance of a general statistic from a stationary time series. The Annals of Statistics 14, 1171–1179. Christiano, L., Haan, W., 1996. Small-sample properties of GMM for business-cycle data. Journal of Business & Economic Statistics 14, 309–327. Clark, T., 1996. Small-sample properties of estimators of nonlinear models of covariance structure. Journal of Business and Economic Statistics 14, 367–373.
J. Allen et al. / Journal of Econometrics 161 (2011) 110–121 Davidson, R., Mackinnon, J.G., 1999. Bootstrap testing in nonlinear models. International Economic Review 40, 487–508. de Jong, R., Davidson, J., 2000. Consistency of kernel estimators of heteroscedastic and autocorrelated covariance matrices. Econometrica 68, 407–423. Durrett, R., 2005. Probability: Theory and Examples, third ed. Duxbury Press. Fitzenberger, B., 1997. The moving blocks bootstrap and robust inference for linear least squares and quantile regressions. Journal of Econometrics 82, 235–287. Gallant, A., White, H., 1988. A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. Blackwell. Gonçalves, S., White, H., 2002. The bootstrap of the mean for dependent heterogeneous arrays. Econometric Theory 18, 1367–1384. Gonçalves, S., White, H., 2004. Maximum likelihood and the bootstrap for nonlinear dynamic models. Journal of Econometrics 119, 199–219. Gonzalez, A., 2007. Empirical Likelihood Estimation in Dynamic Panel Models. Mimeo. Gregory, A., Lamarche, J., Smith, G.W., 2002. Information-theoretic estimation of preference parameters: macroeconomic applications and simulation evidence. Journal of Econometrics 107, 213–233. Hahn, J., 1996. A note on bootstrapping generalized method of moments estimators. Econometric Theory 12, 187–197. Hall, P., Horowitz, J., 1996. Bootstrap critical values for tests based on generalizedmethod-of-moments estimators. Econometrica 64, 891–916. Hall, P., Mammen, E., 1994. On general resampling algorithms and their performance in distribution estimation. Annals of Statistics 22, 2011–2030. Hansen, L., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Hansen, L., Singleton, K.J., 1982. Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica 50, 1269–1286. Härdle, W., Horowitz, J., Kreiss, J., 2003. Bootstrapping methods for time series. International Statistical Review 71, 435–459. Hoeffding, W., 1951. A combinatorial central limit theorem. Annals of Mathematical Statistics 22, 558–566. Hong, H., Scaillet, O., 2006. A fast subsampling method for nonlinear dynamic models. Journal of Econometrics 133. Imbens, G., Spady, R., Johnson, P., 1998. Information theoretic approaches to inference in moment condition models. Econometrica 66, 333–357. Inoue, A., Shintani, M., 2006. Bootstrapping GMM estimators for time series. Journal of Econometrics 133, 531–555. Kitamura, Y., 1997. Empirical likelihood methods with weakly dependent processes. The Annals of Statistics 25, 2084–2102.
121
Kitamura, Y., 2007. Empirical likelihood methods in econometrics: theory and practice. In: Advances in Economics and Econometrics: Theory and Applications, Ninth World Congress, 3. In: Econometric Society Monograph ESM 4, vol. III. Cambridge University Press, Cambridge, pp. 174–237. Kitamura, Y., Stutzer, M., 1997. An information-theoretic alternative to generalized method of moments estimation. Econometrica 65, 861–874. Kocherlakota, N., 1990. On tests of representative consumer asset pricing models. Journal of Monetary Economics 25, 43–48. Künsch, H., 1989. The jackknife and the bootstrap for general stationary observations. The Annals of Statistics 17, 1217–1261. Lahiri, S., 1999. Theoretical comparisons of block bootstrap methods. Annals of Statistics 27, 384–404. Lahiri, S., 2003. Resampling Methods for Dependent Data. Springer. Mason, D., Newton, M., 1992. A rank statistics approach to the consistency of a general bootstrap. Annals of Statistics 20, 1611–1624. Newey, W., Smith, R., 2004. Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72, 219–256. Newey, W., West, K., 1994. Automatic lag selection in covariance matrix estimation. Review of Economic Studies 61, 631–654. Owen, A., 1990. Empirical likelihood ratio confidence regions. Annals of Statistics 18, 90–120. Politis, D., Romano, J., 1994. Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics 22. Politis, D., Romano, J., 1995. Bias-corrected nonparametric spectral estimation. Journal of Time Series Analysis 16. Politis, D., Romano, J., Wolf, M., 1999. Subsampling. Springer, New York. Politis, D., White, H., 2004. Automatic block-length selection for the dependent bootstrap. Econometric Reviews 23, 53–70. Qin, J., Lawless, J., 1994. Empirical likelihood and general estimating equations. The Annals of Statistics 22, 300–325. Ramalho, J., 2006. Bootstrap bias-adjusted GMM etimators. Economics Letters 92, 149–155. Ruge-Murcia, F., 2007. Methods to estimate dynamic stochastic general equilibrium models. Journal of Economic Dynamics and Control 31, 2599–2636. Ruiz, E., Pascual, L., 2002. Bootstrapping financial time series. Journal of Economic Surveys 16, 271–300. Smith, R., 1997. Alternative semi-parametric likelihood approaches to generalized method of moments estimation. The Economic Journal 107, 503–519. Zvingelis, J., 2003. On bootstrap coverage probability with dependent data. In: Computer-Aided Econometrics. Marcel Dekker, New York, pp. 69–90.
Journal of Econometrics 161 (2011) 122–128
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Tighter bounds in triangular systems Sung Jae Jun ∗ , Joris Pinkse, Haiqing Xu Center for Auctions, Procurements and Competition Policy, Department of Economics, The Pennsylvania State University, United States
article
info
Article history: Received 1 February 2010 Received in revised form 7 October 2010 Accepted 15 November 2010 Available online 6 December 2010 JEL classification: C14 C30 C31
abstract We study a nonparametric triangular system with (potentially discrete) endogenous regressors and nonseparable errors. Like in other work in this area, the parameter of interest is the structural function evaluated at particular values. We impose a global exclusion and exogeneity condition, in contrast to Chesher (2005), but develop a rank condition which is weaker than Chesher’s. The alternative rank condition can be satisfied for binary endogenous regressors, and it often leads to an identified interval tighter than Chesher (2005)’s minimum length interval. We illustrate the potential of the new rank condition using the Angrist and Krueger (1991) data. © 2010 Elsevier B.V. All rights reserved.
Keywords: Nonparametric triangular systems Control variables Weak monotonicity Partial identification Instrumental variables Rank conditions
1. Introduction The primary objective of our paper is to obtain identification results – that are stronger than those currently available in the literature under alternative conditions – for the nonparametric triangular model
y = g (x, u), x = h(z , v ),
(1)
where y ∈ S y ⊂ R, x ∈ S x ⊂ Rd , z ∈ S z ⊂ Rdz are observables, g , h are unknown functions, and u ∈ U = (0, 1], v ∈ V ⊆ U d are errors. We refer to x as endogenous regressors and z as instruments and use bold face symbols to denote random variables and regular face symbols for the (nonrandom) values that the corresponding random variable can take. Like in Chesher (2005), the regressors need not be continuous and the objective is identification of the object
ψ ∗ = ψ(x∗ , τ ∗ , v ∗ ) = g (x∗ , Qu|v (τ ∗ |v ∗ ))
(2)
∗ Corresponding address: 608 Kern Graduate Building, University Park 16802, United States. Tel.: +1 814 865 6149; fax: +1 814 863 4775. E-mail addresses:
[email protected],
[email protected] (S.J. Jun),
[email protected] (J. Pinkse),
[email protected] (H. Xu). 0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.11.015
for given values of (τ ∗ , x∗ , v ∗ ) ∈ U ×S x ×V , where Qu|v (τ |v) = inf{u : P[u ≤ u|v = v] ≥ τ }. If, for the sake of intuition, one attaches the labels ‘earnings’ to y, ‘education’ to x, ‘demographics’ to z, ‘talent’ to v, and ‘(market) success’ to u, then ψ(x∗ , 0.5, 0.5) can be interpreted as the (counterfactual) earnings of someone with median success and median talent if she were given education x∗ .1 Identification of marginal effects such as ψ(x∗ , 0.5, 0.5) − ψ(x∗∗ , 0.5, 0.5) naturally follows from the identification of ψ(x∗ , 0.5, 0.5) and ψ(x∗∗ , 0.5, 0.5). A model similar to (1) was studied in Chesher (2003) and Chesher (2005). Chesher (2003) used a strict monotonicity assumption, excluding discrete-valued x, to identify the partial derivatives of g with respect to x. Ma and Koenker (2006) and Jun (2009) proposed a parametric and a semiparametric estimator of Chesher’s 2003 model, respectively. Chesher (2005) is more closely related to our paper in the sense that x is allowed to be discrete and that the object of interest is also ψ ∗ . The object of estimation in Newey et al. (1999) and Pinkse (2000) is g (x∗ , E(u|v = v ∗ )), which is similar to ψ ∗ , but in those papers the errors are assumed additively separable in both equations in (1). In Chesher (2003) regressors are assumed to be continuous, in which case point identification of the partial derivatives of
1 We use the term ‘success’ to emphasize the potential dependence between u, v.
S.J. Jun et al. / Journal of Econometrics 161 (2011) 122–128
g can be achieved by using strict monotonicity conditions on the second argument of g and h. However, when x is discrete, as in the example of the years of schooling, strict monotonicity cannot hold. In Chesher (2005) (identified) bounds are obtained for ψ ∗ under weak monotonicity, a dependence condition on u and v, and ‘local exclusion’ and ‘local exogeneity’ conditions on the instrument z.2 We present our results under ‘global’ rather than local conditions, i.e. we impose a global exclusion restriction (z does not enter g) and assume that z is independent of u, v. Global conditions are stronger than local ones, but we note that those conditions are not testable and that global conditions are more common in multi-equations models.3 Further, our global conditions allow us to replace the rank condition in Chesher (2005) (R ) with an alternative, weaker, rank condition (R ∗ ) which allows for the construction of bounds on ψ ∗ tighter than those obtained in Chesher (2005).4 Moreover, in the case of binary regressors R is never satisfied, but R ∗ developed in this paper usually holds and in some cases leads to point identification of ψ ∗ ; the example that we provide exploits continuous variation in z. A more precise and detailed discussion follows in the next section. Section 3 contains an empirical example illustrating the difference between R and R∗. Results similar to those developed in this paper can in principle be established under local conditions, also. However, obtaining much tighter bounds under conditions that are meaningfully different from the global ones results in conditions that are exceedingly difficult to interpret; see Jun et al. (2009), which is available on our website. Chesher (2005) establishes that his bounds are tight in a point identification example; our paper does not provide insights as to whether or not Chesher’s bounds under his conditions are tight more generally. Alternatively, one can conduct the analysis conditional on a subset Sz∗ of instrument values. Our results go through without modification provided that all conditions and results are interpreted conditionally on z ∈ S ∗z . Since conditioning on Sz∗ amounts to throwing away information, doing so generally yields wider bounds than if global exclusion/exogeneity can be assumed to hold without such conditioning. But it is weaker than global exclusion/exogeneity and it does allow for instruments to enter into the g-function directly, albeit subject to the strong condition that the g-function value is the same for all z ∈ S ∗z . The methodology developed in this paper can be applied in other settings. For instance, Jun et al. (2010) provide an extension of the identification method proposed in this paper, which is applied to the model of Vytlacil and Yildiz (2007); the Vytlacil–Yıldız results for binary endogenous regressors are extended to cover discrete endogenous regressors that can take more than two values, and their support restrictions are relaxed in the case of binary regressors. The proof of our theorem relies on an inversion of the conditional distribution function Π (y|x∗ , v ∗ ) = P[y ≤ y|x = x∗ , v = v ∗ ]. Therefore, our methodology can be used to derive bounds on other functionals of Π such as the mean δ(x∗ , v ∗ ) = E[y |x = x∗ , v = v ∗ ], which is in fact the quantity of interest in Manski and Tamer (2002). There are certain similarities between Manski and Tamer (2002) and Chesher (2005), and indeed our paper. There are however several differences besides the difference in object of interest
2 Starting from y = g˜ (x, z , u), Chesher (2005) assumed that there exist z1 , z2 ∈ S z such that for some r ∗ , Qu|v ,z (τ ∗ |v ∗ , z1 ) = Qu|v ,z (τ ∗ |v ∗ , z2 ) = r ∗ (local exogeneity) and g˜ (x∗ , z1 , r ∗ ) = g˜ (x∗ , z2 , r ∗ ) (local exclusion). 3 This is true for traditional linear simultaneous equations models, as well as for the bulk of the modern literature, e.g. Imbens and Newey (2009). 4 Cases exist in which the bounds are the same.
123
(mean versus quantile) noted earlier. First, the primary objective in Manski and Tamer (2002) is estimation, whereas in Chesher (2005) and here it is identification. Second, in Manski and Tamer (2002) upper and lower bounds (v0MT , v1MT ) on v are assumed to be available while Chesher (2005) provides conditions (involving instrumental variables) under which such bounds are available and can be used. Finally, if the Manski and Tamer (2002) bounds were used in the quantile context, the bounds that obtained after inversion of Π would be the ones obtained by Chesher (2005), not the ones provided in this paper; see Appendix C.2 for details. Although we only provide identification results in this paper, the identification approach here can be implemented in practice. We are currently developing an estimator for ψ ∗ in a separate paper. This estimator assumes the existence of continuous instruments, which we do not assume for our identification result in the present paper. Developing an estimator which takes full advantage of the weakest set of identification results contained in this paper could be challenging. Our paper is organized as follows. Section 2 contains the main results established in this paper. In Section 3 we illustrate our proposal using the Angrist and Krueger (1991) data set. 2. Main results 2.1. Assumptions Consider again the model in (1). The objective remains to find identifiable bounds on ψ ∗ defined in (2) for given values of τ ∗ , x∗ , v ∗ . We make the following assumptions. Assumption A. u, v1 , . . . , vd have (marginal) U (0, 1]-distributions. Assumption B. g is nondecreasing in u for all values of x and h(z , v) = [h1 (z , v1 ), . . . , hd (z , vd )]T , where hj is nondecreasing and left-continuous in vj for all values of z for j = 1, . . . , d. Assumption C. u, v are independent of z. Assumption D. u is positive regression dependent on v, i.e. Qu|v (τ |v) is nondecreasing in v for all values of τ . Assumption E. Z (x∗ , v ∗ ) = {z ∈ S z : h(z , v ∗ ) = x∗ } is nonempty. Given that g , h are unknown, the distributional conditions in Assumption A plus the weak monotonicity and left-continuity conditions in Assumption B by themselves amount to normalizations; see Appendix C.1. The assumption that only one error enters each hj -equation is general since no dependence conditions are imposed between the vj ’s. Assumptions A and B do become restrictive, however, when paired with the positive regression-dependence condition in Assumption D. Assumption C is restrictive, as was discussed in the introduction. In the context of (1) and imposing Assumption C, the only addition in Assumptions A, B and D over what is assumed in Chesher (2005) is that the direction of monotonicity of Qu|v (τ |v) is specified. This is innocuous, because the same analysis can be repeated under the assumption of the other direction of monotonicity after which one can compare the resulting bounds with the bounds based on Assumption D. Assumption E requires that the type of individual for which bounds are desired exists. If there are no demographic characteristics z that yield an education level x∗ for someone of talent v ∗ , then our procedure does not yield meaningful bounds for the earnings of someone with education level x∗ and talent v ∗ for any level of success τ . Assumption E could be restrictive if one conditions on a subset Sz∗ of demographic profiles, as discussed in the introduction.
124
S.J. Jun et al. / Journal of Econometrics 161 (2011) 122–128
2.2. Basics
to ensure that
We start by stating a lemma which shows that if v were observable then ψ ∗ would be directly estimable from the data; v then plays the role of a control variable. The assumptions made above are presumed to hold for all lemmas. Lemma 1. For all τ ∈ U , ψ(x∗ , τ , v ∗ ) = Qy |x,v (τ |x∗ , v ∗ ). Proof. See Appendix A.
Lemma 1 implies that ψ ∗ can alternatively be interpreted as the τ ∗ -quantile of the earnings distribution of individuals with education x∗ and talent v ∗ . Note that h1 (z , v1 )
h1 (z , Qv1 (v1 ))
.. .. = . . hd (z , vd ) hd (z , Qvd (vd )) h1 (z , Qv1 |z (v1 |z )) Qx1 |z (v1 |z ) .. .. = = , . . hd (z , Qvd |z (vd |z )) Qxd |z (vd |z )
(3)
(5)
where the last equality follows from the definition of Vj (xj , z ) in (4). Therefore, V (x, z ) is a set of talent levels for which individuals with demographics z achieve education level x, V (x, z ) = {v ∈ U d : h(z , v) = x}.
(6)
Please note that since V (x, z ) depends only on (a conditional distribution function of) observables, it is identified for all (x, z ) ∈ S x ×S z . 2.3. The basic rank condition Let
G + (x, v) = {V ∈ U d : ∃z ∈ S z : V = V (x, z ) ≥ v}, G − (x, v) = {V ∈ U d : ∃z ∈ S z : V = V (x, z ) ≤ v},
(7)
where V ≥ v (V ≤ v ) means that no vectors in V have elements that are strictly less (greater) than the corresponding element of v . Intuitively, for V (x, z ) to belong to G + (x, v), the demographics z must be so unfavorable as to ensure that anyone with demographics z but talent less than v would not be able to achieve education x. We now turn to the first of the two rank conditions mentioned in the introduction, namely R . Condition 1 (R ). Neither G + (x∗ , v ∗ ) nor G − (x∗ , v ∗ ) is empty.
inf
R is due to Chesher (2005) as, under local conditions, is Lemma 3 below. R requires the instrument to be strong enough
Qy |x,z (τ ∗ |x∗ , z ) ≤ ψ ∗
{z ∈S z :V (x∗ ,z )≥v ∗ }
Qy |x,z (τ ∗ |x∗ , z ),
(9)
or equivalently sup
(4)
Lemma 3. Under R (Condition 1),
V ∈G − (x∗ ,v ∗ )
for j = 1, 2, . . . , d. Then, for any v ∈ V (x, z ) ≡ V1 (x1 , z ) × · · · × Vd (xd , z ) we have h(z , v) = [Qx1 |z (v1 |z ), . . . , Qxd |z (vd |z )]T = x,
Proof. See Appendix A.
≤
where the second to fourth equalities follow from Assumptions A– C, respectively. Hence, if the conditional distribution of xj given z = z is continuous for all z then Lemma 1 implies that hj is invertible in its second argument and that vj = Fxj |z (xj |z ), where Fxj |z is the conditional distribution function of xj given z. Therefore, the vj ’s that correspond to continuous xj ’s can be recovered from the data. For this reason we only discuss the case in which the elements of x are all discrete from here on. Let
(8)
Lemma 2. For all τ ∈ U and all (x, z ) ∈ S x ×S z , if V (x, z ) ̸= ∅, then Qy |x,z (τ |x, z ) = g {x, Qu|v (τ |V (x, z ))}.
{z ∈S z :V (x∗ ,z )≤v ∗ }
(3)
j = 1, . . . , d,
for some z , z˜ ∈ S z . If R is satisfied, then the result of Lemma 3 below follows almost immediately, using Lemma 2 along the way. Let Qu|v (τ |V ) denote the τ quantile of the conditional distribution of u given that v ∈ V .
sup
h(z , v) =
Vj (xj , z ) = (P[xj < xj |z = z ], P[xj ≤ xj |z = z ]]
P[xj ≤ x∗j |z = z ] ≤ v ∗ ≤ P[xj < x∗j |z = z˜ ],
≤
inf
g (x∗ , Qu|v (τ ∗ |V )) ≤ ψ ∗
V ∈G + (x∗ ,v ∗ )
g (x∗ , Qu|v (τ ∗ |V )).
Proof. See Appendix A.
(10)
Lemma 3 is a sensible result, which has the following intuition. Find a demographic profile z such that individuals must have talent no less (greater) than v ∗ to achieve education x∗ . Individuals with demographics z and education x∗ then have a success distribution no less (more) favorable than those of individuals with talent equal to v ∗ by Assumptions C and D and the same level of education. Hence, by Assumption B, ψ ∗ must be no less (greater) than the τ ∗ quantile of the earnings distribution of individuals with demographics z and education x∗ . Out of all such profiles z, select the one resulting in the tightest upper (lower) bound. Two problems with R (Condition 1) are that (i) it may not hold and (ii) the classes G + (x∗ , v ∗ ), G − (x∗ , v ∗ ), even when nonempty, may not be large. In fact, R cannot be satisfied if x∗ is a scalar and equals the highest or lowest value possible. For instance, if x is binary (college-educated or not) then V (0, z ) = (0, P[x = 0|z = z ]] and V (1, z ) = (P[x = 0|z = z ], 1] for all z; V (1, z ) has upper limit equal to 1 since there is no nontrivial upper bound to the talent of individuals with a college education. For vector-valued x∗ , the problem is still more severe. Further, note that each value of z generates at most one element in either G + (x∗ , v ∗ ) or G − (x∗ , v ∗ ). This fact, together with the global exogeneity of z, suggests that G + (x∗ , v ∗ ) and G − (x∗ , v ∗ ) may be too small; a new rank condition is needed. A more detailed discussion follows in the next subsection. 2.4. The new rank condition We now develop our new, weaker, rank condition R ∗ . It is based on the idea that the collection {V (x∗ , z ) : z ∈ S z } can in fact generate larger classes of sets that are useful for bounding ψ ∗ than G − (x∗ , v ∗ ) and G + (x∗ , v ∗ ). To be more specific, consider the example of binary x (college-educated or not) again. In this example V (1, z1 ) − V (1, z2 ) is the set of talent levels for which individuals would attend college with demographics z1 but not with demographics z2 . If V (1, z1 ) − V (1, z2 ) ≤ v ∗ then the success distribution of college-educated individuals whose talent is in the range V (1, z1 )−V (1, z2 ) is no more favorable than those of collegeeducated individuals with talent v ∗ . This can be the case even when neither V (1, z1 ) ≤ v ∗ nor V (1, z2 ) ≤ v ∗ .
S.J. Jun et al. / Journal of Econometrics 161 (2011) 122–128
125
Fig. 1. Example of how sets are combined.
The situation is illustrated in Fig. 1 in which x is binary, Sz = {0, 1, 2, 3} and K (z ) = P[x = 0|z = z ]. In the graphed example, demographic profiles 0 and 1 ensure that all those with a college education must have talent no less than v ∗ , but there is no demographic value that makes individuals with talent no better than v ∗ attend college. So R is not satisfied; G + (1, v ∗ ) = {V (1, 1), V (1, 0)} and G − (1, v ∗ ) = ∅. Therefore, the method used in Lemma 3 provides an upper bound for ψ ∗ , but it does not provide a meaningful lower bound. As will be shown below, there is information available for the construction of an upper bound that is not contained in G + (1, v ∗ ). Indeed, we can also construct an upper bound by looking at the group of individuals who would attend college with z = 1 but not with z = 0. The bound provided by V (1, 1) − V (1, 0) may well be tighter than the bound provided by either V (1, 1) or V (1, 0). Likewise, V (1, 3) − V (1, 2), the group of talent levels that would result in a college education with z = 3 but not with z = 2, can be used to construct a lower bound. A more complicated example, involving vector-valued x, can be found at the end of this subsection. Lemma 4 is our starting point. Let φ ∗ (V ) = φ(τ ∗ , V ) = g (x∗ , Qu|v (τ ∗ |V )). Our main result, Theorem 1 below, is based on the fact that
φ ∗ (V ) ≥ ψ ∗
(11)
whenever V ≥ v ∗ ; this is a direct implication of Assumption D. Lemma 4 shows how sets can be combined. Let µ(V ) = P[v ∈ V ] and let K = {V ⊂ U d : V ̸= ∅, µ(V ) and φ(τ , V ) are identified for all τ ∈ U }. Lemma 4. For any V1 , V2 ∈ K : (i) If V1 ⊂ V2 , µ(V2 − V1 ) > 0, then V2 − V1 ∈ K . (ii) If V1 ∩ V2 = ∅, µ(V1 ∪ V2 ) > 0, then V1 ∪ V2 ∈ K . Proof. The proof is in Appendix A. Lemma 4 can be applied to V (x, z )-sets because φ(τ , V (x, z )) is identified for all τ by Lemma 2 and because µ(V (x, z )) = P[x = x|z = z ] is identified.5 If one applies either operation described in Lemma 4 to V1 = V (x, z1 ) and V2 = V (x, z2 ) then the resulting set V3 belongs to K , and the procedure can be iterated. Doing so ultimately leads to a Dynkin system or λ system (Billingsley, 1995, p. 41) of measurable sets. Let V (x) = {V : V ̸= ∅, ∃z ∈ S z : V (x, z ) = V }. In Definition 1 below one can take D0 = A = V (x∗ ), D1 to be the
5 Since V (x, z ) depends only on the conditional distribution of x given z, V (x, z ) itself is also identified.
collection of sets that contains all sets in A plus all sets that arise when one applies Lemma 4 to all combinations of elements in A , D2 to be the collection of all sets in D1 plus all sets that arise when one applies Lemma 4 to all combinations of elements in D1 , and so forth. Ultimately, one ends up with D = D ∞ . Definition 1. Let A be a collection of measurable subsets of U d . Then D = D (A ) is the collection D∞ in the following iterative scheme. Let D0 = A . Then for all t ≥ 0, Dt +1 consists of all sets A∗ such that at least one of the following three conditions is satisfied. (i) A∗ ∈ D t , (ii) ∃A1 , A2 ∈ D t : A1 ⊂ A2 , µ(A2 − A1 ) > 0, A∗ = A2 − A1 , (iii) ∃A1 , A2 ∈ D t : A1 ∩A2 = ∅, µ(A1 ∪A2 ) > 0, A∗ = A1 ∪A2 .
We will use D (x) in lieu of D (V (x)) to emphasize its dependence on x. We are now in a position to state our rank condition. Let
J − (x, v) = {V ∈ D (x) : V ≤ v}, J + (x, v) = {V ∈ D (x) : V ≥ v}.
(12)
Condition 2 (R ∗ ). Neither J − (x∗ , v ∗ ) nor J + (x∗ , v ∗ ) is empty. If one compares R ∗ to R (Condition 2 to Condition 1), R is weaker, since J − , J + contain all elements of G − , G + , respectively. Only in rare circumstances are R and R ∗ the same. ∗
Theorem 1. Under Assumptions A–E, if R ∗ is satisfied then sup V ∈J − (x∗ ,v ∗ )
≤ ψ∗ ≤
g (x∗ , Qu|v (τ ∗ |V )) inf
V ∈J + (x∗ ,v ∗ )
g (x∗ , Qu|v (τ ∗ |V )),
(13)
where the bounds are identified. Proof. The proof is in Appendix B.
It is instructive to compare the bounds in (13) to those resulting from R in (10). Because J − , J + are larger classes than G − , G + , the bounds in (13) are generally tighter than those in (10), and hence also than those in (9). An example of the difference between the bounds in (10) and (13) arises when we consider Fig. 1; note for instance that G − (1, v ∗ ) = ∅, whereas J − (1, v ∗ ) = {V (1, 3) − V (1, 2)}. The difference between the bounds arising from R and R ∗ becomes extreme when the instrument has continuous variation. Consider again Fig. 1, but with Sz = R and for the special case where h(z , v) = I (v > H (z )) for all v, z, where H is a continuous distribution function. R is as before not satisfied, so it produces
126
S.J. Jun et al. / Journal of Econometrics 161 (2011) 122–128
Fig. 2. How to obtain upper and lower bounds when x∗ = (1, 1).
Fig. 3. How to obtain upper and lower bounds when x∗ = (1, 0).
no bounds. Now R ∗ . Note that V (0, z ) = (0, H (z )] and V (1, z ) = (H (z ), 1]. Take z ∗ = H −1 (v ∗ ). If Qu|v (τ ∗ |v) is continuous at v = v ∗ , then lim Qu|v (τ ∗ |V (1, z ∗ − 1/t ) − V (1, z ∗ ))
talent sufficient to have a marketable field of specialization with demographics z = 3 or z = 5 but not with z = 2 or z = 4. As Fig. 3 illustrates, finding upper and lower bounds for x∗ = (1, 0) is easier than finding a lower bound for x∗ = (1, 1).
t →∞
= lim Qu|v (τ ∗ |V (1, z ∗ ) − V (1, z ∗ + 1/t )) = Qu|v (τ ∗ |v ∗ ),
3. Revisiting Angrist and Krueger (1991)
t →∞
and using R ∗ , ψ ∗ is then point-identified for x∗ = 1 (and similarly for x∗ = 0) since V (1, z ∗ − 1/t ) − V (1, z ∗ ) ∈ J − (1, v ∗ ) and V (1, z ∗ ) − V (1, z ∗ + 1/t ) ∈ J + (1, v ∗ ) for all t ≥ 1. The bounds in Theorem 1 are sharp under the stated conditions.6 To see this, note that the inequalities φ ∗ (V0 ) ≤ ψ ∗ ≤ φ ∗ (V1 ) for all V0 , V1 ⊂ U such that V0 ≤ v ∗ ≤ V1 cannot be improved on and that φ ∗ (V ) is not identified unless it is expressed as a mapping from one of P , P 2 , . . . , P ∞ to R, where P = {p : R → [0, 1] : p(y) = P[y ≤ y|x = x∗ , z = z ] for some z ∈ S z }. Because of the limitations of local identification conditions, Chesher’s (2005) analysis is restricted to using P only, but Assumption C enables us to use all of P , P 2 , . . . , P ∞ . As mentioned earlier, we conclude with an example for vectorvalued x. In our examples x contains two binary variables: a college education dummy and a field of specialization dummy (marketable or not). See Figs. 2 and 3. For x∗ = (1, 1), the V (x∗ , z )-sets are rectangles in U 2 including (1, 1) and V (x∗ , 1) − V (x∗ , 0) is simply the difference between two such rectangles. V (x∗ , 0), V (x∗ , 1), V (x∗ , 1) − V (x∗ , 0) can all be used to construct upper bounds. But to obtain a lower bound at all, one must use at least four V (x∗ , z )-rectangles, as indicated in the second graph in Fig. 2. The rectangle below v ∗ is the collection of v = (v1 , v2 )points for which v1 is a level of (college) talent sufficient to obtain a college degree with demographics z = 4 or z = 5 but not with z = 2 or z = 3 and for which v2 is a level of (marketable field)
6 That is, the bounds cannot be improved without further restrictions.
We now illustrate the difference between R and R ∗ by using the Angrist and Krueger (1991) data. Angrist and Krueger (1991) estimated a wage equation with years of schooling as an endogenous regressor. They used quarter of birth dummies as instruments. Chesher (2005) concluded that the Angrist and Krueger (1991) instruments do not satisfy R for any value of years of schooling and for any level of talent. In this section we determine whether R ∗ is satisfied. Before we proceed, we comment on the plausibility of global exclusion, exogeneity and rank conditions. Exclusion restrictions and invariance restrictions of the distribution of latent variables given instruments are at best partially testable7 so they are usually justified by economic reasoning. For instance, the wage equation in Angrist and Krueger (1991) does not include the birth quarter variables, the exogeneity of which the authors justified by arguing the independence of ability and birth quarters. Rank conditions, on the other hand, are restrictions on the joint distribution of observables. So they are generally testable once exogeneity of instruments is assumed. Potential failure of rank conditions has received more attention than failure of exclusion restrictions, for instance in the weak instrument literature. To simplify our discussion we define x to be limited to the values {0, 1, 2}; no more than 6 years, 7–12 years, and more than 12 years of education, respectively. The instrument z equals the quarter of birth (1–4). Table 1 summarizes the effect of the instruments, where we pretend that the estimated probabilities equal the true probabilities.
7 For example, by means of a test of overidentifying restrictions.
S.J. Jun et al. / Journal of Econometrics 161 (2011) 122–128
∫
Table 1 The effect of the birth quarter instrument on the education variable x. x↓
≤ V (x∗ ,z )
z
0 1 2 nz
1
2
3
4
0.0317 0.6119 1.0000 81,671
0.0315 0.6038 1.0000 80,138
0.0280 0.5977 1.0000 86,856
0.0270 0.5946 1.0000 80,844
Because P[x < x|z = z ] < P[x ≤ x|z = z˜ ] for all x, z , z˜ , R (or equivalently (8)) is not satisfied, irrespective of the value of x∗ and v ∗ . The birth quarter instruments are hence too weak for R to be satisfied. R ∗ is different, however. For instance, consider x∗ = 2 and ∗ v = 0.6. Since V (2, 3) ⊂ V (2, 4) and V (2, 4) − V (2, 3) = (0.5946, 0.5977] ≤ v ∗ ≤ V (2, 1) = (0.6119, 1.0000], R ∗ is satisfied for x∗ = 2 and v ∗ = 0.6. The above example is flawed in two respects. First, we used estimated rather than true probabilities. Note however that the sample size is large and that this is just an example to illustrate the possibility of using R ∗ when R is not satisfied. Second, in the above example, z can only take a small number of different values. With more variation in the instrument, the difference in identifying potential of R and R ∗ increases exponentially.
This paper is based on research that was supported by NSF grant SES-0922127. We thank seminar participants at the University of Pittsburgh, the University of British Columbia, the University of Western Ontario, the 2009 North American Econometric Society Summer meeting, the 2009 Canadian Econometrics Study Group meeting, the conference on quantile regression at University College London/CEMMAP, Andrew Chesher, Rosa Matzkin, Jeff Racine, Peter Robinson, Yuanyuan Wan, the associate editor, and the anonymous referees for helpful comments and suggestions. We thank the Human Capital Foundation for their support of CAPCP. Joris Pinkse is an extramural fellow at Center, Tilburg University. Appendix A. Proofs of lemmas Proof of Lemma 1. By Assumption B, Qy |x,v (τ |x∗ , v ∗ ) = g (x, Qu|x,v (τ |x∗ , v ∗ )). Recall that by Assumption E, Z (x∗ , v ∗ ) is nonempty. Thus, x = x∗ , v = v ∗ ⇔ z ∈ Z (x∗ , v ∗ ), v = v ∗ , such that Qu|x,v (τ |x∗ , v ∗ ) = Qu|v ,z (τ |v ∗ , Z (x∗ , v ∗ )), which equals Qu|v (τ |v ∗ ) by Assumption C. Proof of Lemma 2. By Assumption B, Qy |x,z (τ |x, z ) = g (x, Qu|x,z (τ |x, z )). Since x = x, z = z ⇔ v ∈ V (x, z ), z = z and by Assumption C, Qu|x,z (τ |x, z ) = Qu|v ,z (τ |V (x, z ), z ) = Qu|v (τ |V (x, z )).
Proof of Lemma 3. We establish the upper bound; the argument for the lower bound is virtually identical. Let z be such that V (x∗ , z ) ≥ v ∗ . Then by Lemma 2, Qy |x,z (τ |x , z ) = g {x , Qu|v (τ |V (x , z ))} ∗
P[u ≤ u|v = v ∗ ]dv
= P[u ≤ u|v = v ∗ ]P[v ∈ V (x∗ , z )], which implies that Qu|v (τ ∗ |V (x∗ , z )) ≥ Qu|v (τ ∗ |v ∗ ). Proof of Lemma 4. We show (i) where (ii) follows similarly. Note that for any y by the conditions on V1 , V2 , P[g (x∗ , u) ≤ y|v ∈ V2 − V1 ]
=
P[g (x∗ , u) ≤ y|v ∈ V2 ]µ(V2 ) − P[g (x∗ , u) ≤ y|v ∈ V1 ]µ(V1 )
µ(V2 − V1 )
∗
∗
∗
≥ g {x∗ , Qu|v (τ ∗ |v ∗ )} = ψ ∗ , where the weak inequality follows from Assumption B and the fact that for any u ∈ U
Now P[g (x∗ , u) ≤ y|v ∈ Vj ] is identified for j = 1, 2 and all y because φ(τ , Vj ) is identified for j = 1, 2 and all τ ∈ U . Further, since V1 , V2 are disjoint and µ(V1 ), µ(V2 ) are identified by assumption, so is µ(V2 − V1 ) = µ(V2 ) − µ(V1 ). So the left hand side in (14) is identified for all y. Invert the conditional distribution function to obtain the conditional quantile. Appendix B. Proof of theorem Proof of Theorem 1. Recall that V (x∗ ) ⊂ K by the discussion following Lemma 4. Therefore, D (x∗ ) ⊂ K by Lemma 4 and by construction of D (x∗ ). Combining this with (11) (and its converse when V ≤ v ∗ ) concludes the proof.
C.1. Normalization Consider an arbitrary function g ∗ (x, u∗ ) with arbitrarily distributed u∗ , where g ∗ is weakly increasing in its second argument. Letting ω be the quantile function of u∗ , there exists a uniform random variable u such that u∗ = ω(u). Therefore, g (x, u) = g ∗ (x, ω(u)), is still weakly increasing in u. Since h is weakly increasing in its second argument and v is uniform, we have h(z , v) = Qx|z (v|z ). Left-continuity holds, because P[x ≤ x|z = z ] is a CADLAG function of x and Qx|z (v|z ) = inf{x : P[x ≤ x|z = z ] ≥ v}. C.2. Manski and Tamer Below is a somewhat more detailed discussion of the relationship between Manski and Tamer (2002) and the present paper. Manski and Tamer (2002) assume that v is scalar-valued and that upper and lower bounds (v0MT , v1MT ) on v are observed. They further assume monotonicity of δ(x∗ , v) in v and that E[y |x, v , v0MT , v1MT ] = E[y |x, v ] a.s. Replacing y , v0MT , v1MT with 1(y ≤ y), P[x < x∗ |z ], P[x ≤ ∗ x |z ], respectively,8 facilitates a comparison of Manski and Tamer (2002) with Chesher (2005) and the current paper. The availability of instruments in the triangular model provides us with more structure to be exploited. To simplify the discussion, suppose that both P[x < x∗ |z = z ] and P[x ≤ x∗ |z = z ] are invertible in z, such that (v0MT , v1MT ) = (P[x < x∗ |z = z ], P[x ≤ x∗ |z = z ]) if and only if z = z. Then, equation (A3) in the proof of proposition 1 in Manski and Tamer (2002) would become
P[y ≤ y|x = x∗ , v = P[x < x∗ |z = z ]] ≤ P[y ≤ y|x = x∗ , z = z ]
≤ P[y ≤ y|x = x∗ , v = P[x ≤ x∗ |z = z ]].
= V (x∗ ,z )
P[u ≤ u|v = v]dv
(15)
Note here that the monotonicity assumption of Manski and Tamer (2002)9 in combination with the inversion of the distribution functions in (15) leads to the quantile version of proposition 1 of Manski and Tamer (2002), which coincides with the bounds in (9) that our main theorem, Theorem 1, improves upon.
P[u ≤ u|v ∈ V (x∗ , z )]P[v ∈ V (x∗ , z )]
∫
. (14)
Appendix C. Miscellaneous
Acknowledgements
∗
127
8 1 is the indicator function. 9 That is, E[1{y ≤ y}|x = x∗ , v = v] is weakly decreasing in v .
128
S.J. Jun et al. / Journal of Econometrics 161 (2011) 122–128
References Angrist, Joshua D., Krueger, Alan B., 1991. Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics 106, 979–1014. Billingsley, Patrick, 1995. Probability and Measure, third ed.. Wiley, New York. Chesher, Andrew, 2003. Identification in nonseparable models. Econometrica 71 (5), 1405–1441. Chesher, Andrew, 2005. Nonparametric identification under discrete variation. Econometrica 73, 1525–1550. Imbens, G.W., Newey, W.K., 2009. Identification and estimation of triangular simultaneous equations models without additivity. Econometrica 77 (5), 1481–1512. Jun, Sung Jae, 2009. Local structural quantile effects in a model with a nonseparable control variable. Journal of Econometrics 151, 82–97.
Jun, Sung Jae, Pinkse, Joris, Xu, Haiqing, Tighter bounds in triangular models: supplementary material, The Pennsylvania State University. 2009. Jun, Sung Jae, Pinkse, Joris, Xu, Haiqing, 2010. Discrete endogenous variables in weakly separable models, Penn State. Ma, Lingjie, Koenker, Roger, 2006. Quantile regression methods for recursive structural equation models. Journal of Econometrics 134, 471–506. Manski, C.F., Tamer, E., 2002. Inference on regressions with interval data on a regressor or outcome. Econometrica 70 (2), 519–546. Newey, Whitney K., Powell, James L., Vella, Francis, 1999. Nonparametric estimation of triangular simultaneous equations models. Econometrica 67 (3), 565–603. Pinkse, Joris, 2000. Nonparametric two-step regression estimation when regressors and error are dependent. Canadian Journal of Statistics 28 (2), 289–300. Vytlacil, Edward, Yildiz, Nes.e, 2007. Dummy endogenous variables in weakly separable models. Econometrica 75 (1), 757–779.
Journal of Econometrics 161 (2011) 129–146
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Instrumental variable methods for recovering continuous linear functionals Andres Santos ∗ Department of Economics, University of California, 9500 Gilman Drive MS0508, San Diego, La Jolla, CA 92093-0508, United States
article
info
Article history: Received 9 December 2008 Received in revised form 27 July 2010 Accepted 23 November 2010 Available online 5 December 2010 JEL classification: C13 C14 C21
abstract This paper develops methods for estimating continuous linear functionals in nonparametric instrumental variable problems. Examples of such functionals include consumer surplus and weighted average derivatives. The estimation procedure is robust to a setting where the underlying model is not identified but the linear functional is. In order to attain such robustness, it is necessary to employ a partially identified nuisance parameter. We address this problem by consistently estimating a unique element √ of the identified set for nuisance parameters which we then use to construct a n asymptotically normal estimator for the desired linear functional. Published by Elsevier B.V.
Keywords: Instrumental variables Partial identification
1. Introduction Numerous estimation problems in microeconometrics have encountered the challenge of endogenous regressors. The underlying structural relations often imply the estimated models do not fit the classical regression framework but are instead of the form: Y = m0 (X ) + ϵ
(1)
where E [ϵ|X ] ̸= 0 but E [ϵ|Z ] = 0 for some instrument Z . The parametric analysis of m0 through instrumental variables (IV) is well understood. Unfortunately, the extension of such procedures to a more robust nonparametric framework has encountered a number of difficulties. As discussed in Newey and Powell (2003), the nonparametric identification of m0 requires the availability of an instrument satisfying far more stringent conditions than the usual covariance restrictions of the linear model. A lack of identification of m0 , however, does not preclude interesting characteristics of the model from being identified. Severini and Tripathi (2006, 2010), for example, argue that certain linear functionals of m0 will be identified even √ when m0 is not. In this paper, we develop methods for the n estimation of such functionals without requiring m0 to be identified. For X the support of X and ν : X → R a known function, we study functionals of the form:
∫
ν(x)m(x)dx,
⟨ν, m⟩ ≡ X
∗
Tel.: +1 858 534 2407. E-mail address:
[email protected].
0304-4076/$ – see front matter. Published by Elsevier B.V. doi:10.1016/j.jeconom.2010.11.014
(2)
which include as special cases consumer surplus and weighted average derivatives. Severini and Tripathi (2010) show, under mild assumptions, that a√ necessary condition for ⟨ν, m0 ⟩ to be identified and estimable at a n rate is the existence of a function θ0 of the instrument Z such that E [θ0 (Z )|x] =
ν(x) fX (x)
(3)
where fX is the density of X . Otherwise, Severini and Tripathi (2010) establish, the semiparametric efficiency bound for ⟨ν, m0 ⟩ is infinite. It is important √ to emphasize that condition (3) must hold for ⟨ν, m0 ⟩ to√ be n estimable, regardless of whether m0 is identified or not. A n-asymptotically normal estimator of ⟨ν, m0 ⟩ must therefore (i) assume either explicitly or implicitly that (3) holds or (ii) further restrict the model so the semiparametric efficiency bound is finite. For this √ reason, we base our estimator off the necessary condition for n estimability in (3) and avoid unnecessary assumptions on the identification of m0 . As a result, our estimator will be robust to a possible lack of identification of the underlying model m0 . The identification of ⟨ν, m0 ⟩ is guaranteed when (3) holds, since for any function m : X → R that agrees with the exogeneity assumption on the instrument we must have:
⟨ν, m⟩ = E [m(X )θ0 (Z )] + E [(Y − m(X ))θ0 (Z )] = E [Y θ0 (Z )].
(4)
Moreover, Eq. (4) suggests that a natural estimator for ⟨ν, m0 ⟩ is given by the sample analogue
⟨ν, m0 ⟩ ≡
n 1−
n i=1
Yi θˆ0 (Zi ),
(5)
130
A. Santos / Journal of Econometrics 161 (2011) 129–146
where θˆ0 is a consistent estimator of θ0 . Unfortunately, in most applications the nuisance parameter θ0 will not be identified. Nonparametric identification requires the existence of a unique solution to the integral equation in (3). Equivalently, identification necessitates that there be no nonzero function ψ such that E [ψ(Z )|x] = 0. Intuitively, for θ0 to be identified the regressor X must be able to detect all forms of variation in the instrument Z . This requirement is problematic, as in most instances instruments posses a variation that is unrelated to the endogenous regressor. Hence, a general estimation procedure should be robust to the lack of identification of both m0 and θ0 . The lack of identification of θ0 presents important technical √ challenges, but it does not hinder the identification or n estimability of ⟨ν, m0 ⟩. Any solution to (3) provides a valid nuisance parameter for recovering ⟨ν, m0 ⟩ through (4). We therefore do not assume that θ0 is identified and instead obtain an estimator for a unique element of the set of solutions to (3). This set of solutions constitutes an identified set which we denote by:
ν(x) , Θ0 ≡ θ ∈ Θ : E [θ (Z )|x] = fX (x)
(6)
where Θ is a nonparametric set of functions. We obtain a consistent estimator for a unique element of Θ0 in two steps. First, results in Chernozhukov et al. (2007) are generalized to ˆ 0 for Θ0 . arbitrary metric spaces to obtain a consistent estimator Θ In addition, extending results in Ai and Chen (2003), it is possible to
ˆ 0 converges to Θ0 at a op n− 14 rate with respect establish that Θ
to an appropriate Hausdorff pseudo-norm. As a second step, we recover a unique element θ0 ∈ Θ0 by carefully choosing a unique ˆ 0 . This procedure is analogous to a classical Melement θˆ0 ∈ Θ estimation problem where the domain Θ0 is unknown but instead ˆ 0 . These results provide a general useful technique estimated by Θ for recovering nonparametric nuisance parameters when they are not identified and may be of independent interest. This paper is highly complementary to previous work in Severini and Tripathi (2006, 2010). The authors are the first to explore conditions for the identification of ⟨ν, m0 ⟩ when m0 is not identified and to derive efficiency bounds for its estimation. They, however, provide no estimation procedures. Ai and Chen (2003, 2007) and Darolles et al. (2003) derive asymptotically normal estimators for ⟨ν, m0 ⟩ that assume m0 is identified. Within the larger nonparametric IV literature, Newey and Powell (2003) and Hall and Horowitz (2005) propose consistent estimators for m0 , while Horowitz (2007) and Gagliardini and Scaillet (2008) derive the asymptotic distribution for estimators of m0 . Santos (2007) proposes test statistics for inference when the model is partially identified. Ai and Chen (2003) and Blundell et al. (2007) examine the properties of a semiparametric specification. In related work, Newey et al. (1999), Chesher (2003, 2005, 2007), Imbens and Newey (2009) and Schennach et al. (2007) explore estimation and identification in triangular systems, while Chalak and White (2006) and White and Chalak (2006) study the identification of causal effects. This paper is also related to the vast partial identification literature that explores the limits of inference without identification. See Manski (1990, 2003) and references within. The remainder of the paper is organized as follows. Section 2 provides examples of interesting choices for ν while Section 3 develops a consistent estimator for a unique element of the identified set Θ0 . In Section 4 we employ such an estimator of the nuisance parameter to obtain an asymptotically normal estimator of ⟨ν, m0 ⟩. The small sample performance of the estimator is analyzed in a Monte Carlo study in Section 5. Section 6 briefly concludes. All proofs are contained in the Appendix.
2. Examples The canonical example of a function ν for which the functional ⟨ν, m0 ⟩ is of interest, is that of consumer surplus, as the following example illustrates: Example 2.1. Let q denote quantity and m0 : R+ → R be an inverse demand function. Suppose we are interested in estimating approximate consumer surplus at a market clearing price and quantity (pc , qc ). Clearly, if we let ν(q) = 1{0 ≤ q ≤ qc }, we can immediately obtain qc
∫
m0 (q)dq − pc qc = ⟨ν, m0 ⟩ − pc qc
(7)
0
and hence if (pc , qc ) are observable, then ⟨ν, m0 ⟩ is the object of interest. This simple example can readily be extended to allow for an inverse demand function that depends on both qq and covariates X . If X is discrete valued, then we may estimate 0 c m0 (q, x0 )dq at some fixed point x0 . Otherwise, if X is continuously distributed and → R is some specified weight function, then we can recover w :X qc m0 (q, x)w(x)dqdx. X 0 Proper selection of the function ν also allows us to estimate weighted average derivatives of m0 by examining functionals of the form ⟨ν, m0 ⟩. (s)
Example 2.2. Let m0 : [0, 1] → R and m0 (x) denote the sth derivative of m0 evaluated at x. Suppose the object of interest is the weighted average derivative given by 1
∫
(s)
m0 (x)w(x)dx
(8)
0
for some specified weight function w . If w is s times differentiable and in addition satisfies w (k) (0) = w (k) (1) = 0 for all 0 ≤ k ≤ (s − 1), then integration by parts yields: 1
∫
(s)
m0 (x)w(x)dx = (−1)(s)
1
∫
m0 (x)w (s) (x)dx.
(9)
0
0
Hence, the weighted average derivative in (8) may be expressed in the form ⟨ν, m0 ⟩ by letting ν(x) = (−1)(s) w (s) (x). As a final example, we show functionals of the form ⟨ν, m0 ⟩ may also be employed to recover the parametric component of a semiparametric specification. Example 2.3. Let X = (X1 , X2 ) with X1 ∈ R, X2 ∈ R and X ≡ [0, 1]2 . Suppose we are interested in the semiparametric specification m0 (x1 , x2 ) = x1 β0 + r0 (x2 ) for some constant β0 and integrable function r0 (x2 ). By selecting ν(x1 , x2 ) = 12 x1 − 12 , we then obtain by direct calculation: 1
∫
1
∫
⟨ν, m0 ⟩ =
12 x1 −
0
0
1 2
(x1 β0 + r0 (x2 ))dx1 dx2 = β0 . (10)
It is important to note that, unlike in Examples 2.1 and 2.3, the results in Severini and Tripathi (2010) do not apply as the parameter space has been restricted to be semiparametric.√In particular, condition (3) is not a necessary condition for the n estimability of β0 . 3. Nuisance parameter
The principal challenge in constructing ⟨ν, m0 ⟩ consists in producing a consistent estimator θˆ0 for a unique element θ0 of the identified set Θ0 . Intuitively, if we could observe Θ0 , then we
A. Santos / Journal of Econometrics 161 (2011) 129–146
would simply select a unique element θ0 from it and employ it to estimate ⟨ν, m0 ⟩. The set Θ0 , however, is itself identified and can therefore be estimated. Hence, a natural estimation strategy is:
ˆ 0 for the set Θ0 . (S.1) Construct a consistent estimator Θ ˆ 0 in such a way as to ensure it converges to a (S.2) Select θˆ0 ∈ Θ unique element θ0 ∈ Θ0 . This section develops the theory necessary to show that such an approach is valid. In order to achieve (S.1), we generalize results in Chernozhukov et al. (2007) to arbitrary metric spaces and construct a consistent estimator for Θ0 as well as obtain its rate of convergence. To carry out (S.2), we adapt the theory of extremum estimators to problems where the parameter space is unknown but consistently estimated. In this way we are able to show that if M : Θ → R is a population criterion function attaining a unique minimum on Θ0 , then the minimizer of a sample analogue ˆ 0 satisfies (S.2). Mn : Θ → R over the estimated set Θ 3.1. Criterion functions The identified set can be characterized as the set of minimizers to a criterion function, as in Chernozhukov et al. (2007) and Romano and Shaikh (2010). In particular, for Q (θ) ≡ E [(E [ν(X ) − θ (Z )fX (X )|X ])2 ],
(11)
the set Θ0 , as defined in (6), agrees with the set of zeros of the function Q : Θ → R:
Θ0 = {θ ∈ Θ : Q (θ ) = 0}.
(12)
We require Θ to be a smooth set of functions, which both ensures consistency and the uniform behavior of the empirical process on the parameter space. In particular, we assume Θ is bounded in the Sobolev norm ‖ · ‖∞,α . To define ‖ · ‖∞,α , let Z ∈ Rdz and λ be a dz ∑dz dimensional vector of nonnegative integers. Denote |λ| = i =1 λ i λ
λd
and let Dλ θ (z ) = ∂ |λ| θ (z )/∂ z1 1 . . . ∂ zdz z . For α ∈ R, α the greatest integer smaller than α and Z the support of Z , the norm ‖ · ‖∞,α is then given by:
‖θ‖∞,α
|Dλ θ (z ) − Dλ θ (z ′ )| ≡ max sup |D θ (z )| + max sup . (13) |λ|≤α z ∈Z |λ|=α z ̸=z ′ ‖z − z ′ ‖α−α λ
A function θ with ‖θ ‖∞,α < ∞ has partial derivatives up to order α uniformly bounded, and partial derivatives of order α Lipschitz of order α − α . If Θ is bounded in ‖ · ‖∞,α , then these properties hold uniformly in θ ∈ Θ . Estimation of Θ0 is equivalent to estimation of the zeros of the criterion function Q . In parametric models, Chernozhukov et al. (2007) show the latter can be accomplished by employing the approximate minimizers of a sample analogue Qn . Adapting their results to a nonparametric framework, we let Θn be a sieve for Θ and define the estimator:
ˆ 0 ≡ {θ ∈ Θn : Qn (θ ) ≤ bn /an } Θ
(14)
for bn /an ↘ 0 at an appropriate rate. The requirement bn /an ↘ 0 ˆ 0 only includes those θn ∈ Θn that implies that in the limit Θ approximate elements of θ0 ∈ Θ0 well. On the other hand, if ˆ 0 includes all such bn /an ↘ 0 slowly enough, then we can ensure Θ θn . The requirements on the rate at which bn /an ↘ 0 are of course different than in the parametric case. The construction of an appropriate sample analogue Qn requires the use of nonparametric estimators for both conditional expectations and the density of X . We estimate conditional expectations using a standard series approach. Assume X ∈ Rdx and let {pj (·)}∞ j=1 be a sequence of known approximating functions. We denote the vector of the first kn terms in the basis by pkn (x) =
131
(p1 (x), . . . , pkn (x)) and let P = (pkn (X1 ), . . . , pkn (Xn ))′ . For an i.i.d. sample {Wi }ni=1 from a random variable W ∈ R, the nonparametric estimator Eˆ [W |x] for E [W |x] is then given by the linear regression of the vector (W1 , . . . , Wn ) on the matrix P; that is: ′
Eˆ [W |x] ≡ pkn (x)(P ′ P )−1
n −
pkn (Xi )Wi .
(15)
i=1
This series estimator is studied in Newey (1997), Huang (1998, 2003) and Ai and Chen (2003). For the nonparametric estimator of fX , we follow Hall and Horowitz (2005) and employ product generalized kernel estimators to allow for compact support of X . In particular, if X ∈ Rdx and X (k) denotes the kth coordinate of X , then the estimator of fX (Xi ) is given by: fˆX (Xi ) ≡
1
dx −∏
(n − 1)hdx
j̸=i k=1
Kh
(k)
Xi
− Xj(k) h
(k)
, Xi
.
(16)
The dependence of Kh (u, t ) on t accommodates the use of boundary kernels for points t near the boundary of the support. For certain parameter values, it will also be necessary to resort to higher order kernels in order to attain the appropriate rates of convergence. The kernel Kh is of order α if for all t on its domain and all h > 0 the following holds:
∫
Kh (u, t )du = 1
∫
R
uj K h ( u, t ) = 0
if 1 ≤ j ≤ α.
(17)
R
The specific assumptions on the kernel Kh and the bandwidth h are stated in Section 3.3. The criterion Qn is the plug-in version of Q using the above estimators. Specifically, we let Qn (θ ) ≡
n 1−
n i=1
ˆ 2 (Xi , θ ) m
(18)
ˆ (Xi , θ ) ≡ Eˆ [ν(X ) − θ (Z )fˆX (X )|Xi ]. m We show in the next section that for an appropriate choice of bn /an , ˆ 0 for the the criterion function Qn produces a consistent estimator Θ identified set Θ0 . 3.2. Set consistency Under regularity conditions to be introduced, it is possible to ˆ 0 for Θ0 under a variety of norms. establish the consistency of Θ We focus on the family of Hausdorff norms, which is defined by: dH (Θ1 , Θ2 , ‖ · ‖) ≡ max{h(Θ1 , Θ2 ), h(Θ2 , Θ1 )} h(Θ1 , Θ2 ) ≡ sup
(19)
inf ‖θ1 − θ2 ‖.
θ1 ∈Θ1 θ2 ∈Θ2
ˆ 0 provides a consistent estimator for Θ0 under the Hence, Θ Hausdorff norm if both the maximal approximation error of ˆ 0 by Θ0 and of Θ0 by Θ ˆ 0 converges to zero in probability. Θ Unlike the parametric case, however, using different norms for the projections in (19) implies significantly different Hausdorff norms and rates of convergence. As in most two stage estimation problems, the nuisance 1
parameter must be estimated at a rate op n− 4
under a suitable
√
pseudo-norm in order for the second stage to be n consistent. Ai and Chen (2003) show we can focus on the pseudo-norm ‖ · ‖w , which in the present context is given by:
‖θ ‖w ≡
E [(E [θ (Z )|X ])2 fX2 (X )].
(20)
132
A. Santos / Journal of Econometrics 161 (2011) 129–146
Interestingly, the identified set Θ0 is an equivalence class under the pseudo-norm ‖·‖w , since for any θ1 , θ2 ∈ Θ0 we have ‖θ1 −θ2 ‖2w = E [(ν(X ) − ν(X ))2 ] = 0. Consequently, the result:
1
ˆ 0 , Θ 0 , ‖ · ‖ w ) = op n− 4 dH ( Θ
(21)
1
suffices to show that ‖θˆ0 − θ0 ‖w = op n− 4
for any θ0 ∈ Θ0 and
ˆ 0 . This follows, as θˆ0 ∈ Θ ˆ 0 , Θ0 , ‖ · ‖w ). ‖θˆ0 − θ0 ‖w = inf ‖θˆ0 − θ ‖w ≤ dH (Θ
(22)
θ∈Θ0
Exploiting (22), we obtain a rate of convergence for ‖θˆ0 − θ0 ‖w by ˆ 0 , Θ0 , ‖ · ‖ w ) . deriving one for dH (Θ Convergence under the metric dH (·, ·, ‖ · ‖w ), however, is not sufficient for consistently estimating a unique element θ0 ∈ Θ0 . ˆ 0 under a norm that For this purpose, we require consistency of Θ is able to differentiate between elements in Θ0 (unlike the pseudometric ‖ · ‖w ). For ‖ · ‖∞ the usual supremum norm ‖θ‖∞ ≡ supz ∈Z |θ (z )|, we therefore additionally show that:
ˆ 0 , Θ0 , ‖ · ‖∞ ) = op (1). dH ( Θ
(23)
ˆ0 We now state assumptions that are sufficient to show that Θ is consistent under the norm dH (·, ·, ‖ · ‖∞ ) and obtaining a rate of convergence under the weaker pseudo-metric dH (·, ·, ‖ · ‖w ). Assumption 3.1. (i) {Yi , Xi , Zi } is an i.i.d. sample from (1), with E [Y 2 ] < ∞ and E [ϵ|Z ] = 0; (ii) X ∈ Rdx has support [0, 1]dx and density fX bounded away from zero with ‖f ‖∞,r < ∞ for r > 3d2x 1 ; (iii) Z ∈ Rdz has compact support Z.
requirement on the bandwidth h in Assumption 3.2(iv) is feasible as a result of r > 3d2x . Assumption 3.3(i)–(iii) are standard in the use of series estimators for conditional mean functions; see Huang (1998, 2003) and Remarks 3.2 and 3.3 for primitive conditions that ensure these rate requirements hold. In Assumption 3.4(i), we impose that the parameter space Θ be bounded in the norm ‖·‖∞,m for some m > dz /2. This assumption has strong implications on the estimation of Θ0 ; see Remark 3.4. Finally, Assumption 3.4(ii) characterizes the bias introduced by employing a sieve Θn for Θ . Remark 3.1. The kernel Kh may be set to be a regular order r kernel for points t away from {0, 1} and a boundary kernel otherwise. For example, let the bounded kernels K : R → R and L : R → R be supported on [−1, 1] and [0, 1] respectively and satisfy 1 1 −1 K (u)du = 0 L(u)du = 1 and 1
∫
1
∫
uj K (u)du = 0 −1
uj L(u)du = 0
(24)
0
for all 1 ≤ j ≤ r. The generalized kernel Kh (u, t ) given by K (u) for all h ≤ t ≤ 1 − h, by L(u) for all h > 1 − t and by L(−u) for all t < h then satisfies Assumption 3.2(ii)–(iii) and (v) with K0 (u, t ) = K (u) for all t ∈ (0, 1). Remark 3.2. Assumptions 3.3(ii) and 3.4(ii) impose restrictions on approximation errors. These are easily controlled if the functions being approximated are sufficiently smooth. For example, for: γ
ΛB ([0, 1]dx ) = {ψ : X → R, ‖ψ‖∞,γ ≤ B}
(25)
and {pj (·)}nj=1 polynomials or tensor product univariate splines, the ′
Assumption 3.2. (i) The generalized kernel Kh is of order r > 3d2x ; (ii) Kh (u, t ) is uniformly bounded on h > 0, (u, t ) ∈ [0, 1]2 ; (iii) For some compact interval K , Kh (·, t ) is compactly supported on [(t − 1)/ h, t /h] ∩ K for all t ∈ [0, 1]; (iv) nh3dx ↗ ∞ 1
and hr = o n− 2 ; (v) For some K0 (u, t ) with
K0 (u, t )du =
1, limh→0 Kh (u, t + hu) = K0 (u, t ) pointwise in all (u, t ) with t ∈ (0, 1). ′
Assumption 3.3. (i) The eigenvalues of E [pkn (X )pkn (X )] are bounded above and away from zero; (ii) For every θ ∈ Θ there − dγ
is a πn (θ ) with supθ,x |fX (x)E [θ (Z )|x] − πn (θ )pkn (x)| = O kn
− dγ
and π˜ n (θ ) with supθ,x |E [θ (Z )|x] − π˜ n (θ )pkn (x)| = O kn
x
x
; (iii)
2 kn = o(n), for ξjn = sup|λ|=j,x ‖Dλ pkn (x)‖. ξ0n
Assumption 3.4. (i) For some m > dz /2 we have supΘ ‖θ‖∞,m < ∞, Θ0 ̸= ∅, Θn ⊆ Θ and both {Θn } and Θ closed; (ii) For every θ ∈ Θ there is Πn θ ∈ Θn with supΘ ‖θ − Πn θ‖∞ = O(δn ). Assumption 3.1 states the requirements on the distribution of (Y , X , Z ). It is worth pointing out that the only requirement imposed on m0 is that E [m20 (X )] < ∞. The regressor and instrument (X , Z ) are assumed to have compact support. For notational convenience, we set the support of X to be [0, 1]dx , which may require transforming original variables through a bounded strictly monotonic function. Assumptions 3.1(ii) and 3.2(i)–(v) imply fˆX is pointwise consistent at an appropriate rate; see Remark 3.1 for examples of valid kernels Kh . Notice that the
1 Here, ‖f ‖ ∞,r is applied to a function of x, and is meant to be defined as in (13) with X in place of Z.
kn approximation error by functions of the form p π under ‖·‖∞ of ψ
− dγ
is of the order O kn
x
γ
uniformly on ψ ∈ ΛB ([0, 1]dx ). Hence,
Assumption 3.3(ii) is satisfied if both supΘ ‖E [θ (Z )|·]‖∞,γ and supΘ ‖fX (·)E [θ (Z )|·]‖∞,γ are finite; see Chen (2007) for further discussion. Remark 3.3. Verifying Assumption 3.3(iii) requires us to understand the relationship between ξ0n and kn , which may be sieve specific. When {pj (·)}∞ j=1 are polynomials, for example, we have
ξ0n . kdnx , while if {pj (·)}∞ j=1 are tensor product univariate splines, dx
then ξ0n . kn2 ; see Newey (1997) and Huang (1998) for additional details. Remark 3.4. Let the operator T (θ ) ≡ E [θ (Z )|·] have domain Θ , and notice that the equation T (θ )(x) =
ν(x) f X ( x)
(26)
implicitly defines Θ0 . When T is bijective, T −1 exists but is not continuous unless Θ is restricted, for example, to a compact set. This is the role of Assumptions 3.1(iii) and 3.4(i) which together imply Θ is compact under ‖ · ‖∞ . Alternatively, it is possible to circumvent the discontinuity of T −1 employing a regularization approach as in Darolles et al. (2003) or Chen and Pouzo (2008). The compactness restriction is a special case of Chen and Pouzo (2008) with a possibly suboptimal choice of Lagrange multiplier. While we have assumed compactness of Θ , it would also be interesting to extend the aforementioned regularization approaches to the present setting where T is not bijective. Estimation of Θ0 through such a method would face important challenges as Θ0 is likely to ˆ 0 , Θ0 , ‖ · ‖ ∞ ) be unbounded if Θ is not restricted and hence dH (Θ is potentially infinite.
A. Santos / Journal of Econometrics 161 (2011) 129–146
ˆ0 Assumptions 3.1–3.4 allow us to show the consistency of Θ and establish its rate of convergence. −1 − 2γ + kn dx + (nhdx )−1 + h2r + δn2 and bn → ∞ with bn = o(an ). If in addition Assumptions 3.1(i)–(ii), 3.2(i)–(iv), 3.3(i)–(iii) and 3.4(i)–(ii) hold, then ˆ 0 , Θ0 , ‖ · ‖∞ ) = op (1) and furthermore that it follows that dH (Θ √ ˆ 0 , Θ0 , ‖ · ‖w ) = Op ( bn /an ). dH (Θ
Theorem 3.1. Let an = O
kn n
ˆ 0 to Theorem 3.1 establishes that the rate of convergence of Θ Θ0 under the weak norm is balanced by three natural terms: (i) The − 2dγ estimation error from the conditional expectation kn /n + kn x , (ii) The estimation error from the density ((nhdx )−1 + h2r ) and (iii) The approximation error from using a sieve for the parameter space (δn ). Therefore, it is possible to establish (21) by imposing conditions on these bandwidths. Theorem 3.1 further establishes ˆ 0 to Θ0 as in (23), which enables us to the consistency of Θ consistently estimate a unique element from it. We illustrate how to verify Assumptions 3.1–3.4 by applying Theorem 3.1 to Example 2.2. Proposition 3.1. In Example 2.2, suppose (X , Z ) ∈ [0, 1]2 and further assume that: (i) fX is bounded away from zero, fX′ is Lipschitz and Assumption 3.1(i) holds. (ii) Kh is as in Remark 3.1 with order r ≥ 2 and h ≍ n−rh for 1/4 < rh < 1/3. d 1 d fZ |X (z |x1 ) − dx fZ |X (z |x2 ) ≤ G(z )|x1 − x2 | with 0 G(z )dz < (iii) dx ∞ ∞ ∀x1 , x2 ∈ [0, 1]; {pj (·)}j=1 are splines of order two with kn ≍ nrx for 1/6 < rx < 1/3 and Assumption 3.3(i) holds. (iv) Θ = {θ : [0, 1] → R, ‖θ ‖∞ ≤ B and |θ (z1 ) − θ (z2 )| ≤ B|z1 − z2 | ∀z1 , z2 ∈ [0, 1]} for some B > 0 and Θ0 ̸= ∅. Further let {rq (·)}∞ q=1 be splines of order two and define the sieve
∑q Θn = θ ∈ Θ : θ (z ) = qn=1 βq rq (z ) for qn ≍ nrz with rz > 1/3.
If an ≍ nα with 2/3 < α < max{1 − rx , 4rx , 1 − rh , 2rz } and bn ≍ log n, it then follows that:
ˆ 0 , Θ0 , ‖ · ‖∞ ) = op (1) dH (Θ 1 ˆ 0 , Θ 0 , ‖ · ‖ w ) = op n− 4 . dH (Θ
(27)
3.3. Element consistency
ˆ 0 for Θ0 , the second challenge consists Given the estimator Θ ˆ 0 in such a way as to ensure it in selecting an element θˆ0 ∈ Θ converges to a unique element θ0 ∈ Θ0 . For this purpose, we adapt the theory of extremum estimators to problems where the parameter space is unknown but consistently estimated. Suppose M : Θ → R is a population criterion function attaining a unique minimum on Θ0 and Mn : Θ → R is its finite sample analogue. Intuitively, if θ0 is the unique minimizer of M on Θ0 , then the ˆ 0 should minimizer of Mn over the estimated parameter space Θ provide a consistent estimator for θ0 . In this manner, we can ensure ˆ 0 converges to a unique element of the identified set θˆ0 ∈ Θ by selecting it to be the solution to an appropriate minimization problem. Theorem 3.2 formalizes this argument. Theorem 3.2. Let (i) Θ0 ⊆ Θ be closed with Θ compact and M : ˆ 0 ⊆ Θ satisfy Θ → R have a unique minimum on Θ0 at θ0 , (ii) Θ ˆ 0 , Θ0 , ‖ · ‖) = op (1), (iii) Mn : Θ → R and M : Θ → R dH (Θ
133
be continuous, (iv) supθ∈Θ |Mn (θ ) − M (θ )| = op (1). If θˆ
∈
arg minθ∈Θˆ 0 Mn (θ ), then ‖θˆ − θ0 ‖ = op (1).2
Remark 3.5. Even though θ0 is a minimum of M : Θ → R on Θ0 , it is often not the minimum on the parameter space Θ . In particular, since Θ0 may have no interior relative to Θ , θ0 may lie in the boundary of Θ0 and not even be a local minimum of M on Θ . The ˆ 0 converges to Θ0 under the Hausdorff norm, requirement that Θ however, is sufficiently strong to overcome this difficulty. The principal purpose of the criterion function M is to help us construct a consistent estimator for a unique element θ0 ∈ Θ0 . Any function M that attains a unique minimum on Θ0 is a suitable choice for this goal.3 Fortunately, when the parameter space Θ is both compact and convex, the identified set Θ0 is itself compact and convex. As a result, it is straightforward to guarantee that M has a unique minimizer by choosing it to be strictly convex on Θ or Θ0 . In this manner, Theorems 3.1 and 3.2 can be used to construct ˆ 0 such that ‖θˆ0 − θ0 ‖∞ = op (1) for some an estimator θˆ0 ∈ Θ unique θ0 ∈ Θ0 . Furthermore, since Θ0 is an equivalence class under the pseudo-metric ‖ · ‖w , Theorem 3.1 also yields a rate of convergence for ‖θˆ0 − θ0 ‖w by arguing as in (22). Corollary 3.1. Let Assumptions 3.1(i)–(ii), 3.2(i)–(iv), 3.3(i)–(iii) and 3.4(i)–(ii) hold, Θ be convex, M : Θ → R strictly convex, Mn : Θ → R continuous and supθ∈Θ |Mn (θ ) − M (θ )| = op (1). Then, for θ0 the minimizer of M : Θ → R on Θ0 and θˆ0 a minimizer ˆ 0: of Mn : Θ → R on Θ
‖θˆ0 − θ0 ‖∞ = op (1)
‖θˆ0 − θ0 ‖w = Op ( bn /an ).
(28)
Corollary 3.1 implies it is possible to construct an estimator
θˆ0 such that for some unique element θ0 ∈ Θ0 , ‖θˆ0 − θ0 ‖∞ = 1 op (1) and in addition ‖θˆ0 − θ0 ‖w = op n− 4 . Both results are instrumental for showing the second stage estimator is both √ n consistent and asymptotically normal. We illustrate the implications of Corollary 3.1 by applying it to Example 2.2. Proposition 3.2. Let all assumptions of Proposition 3.1 hold, and further define: Mn (θ ) ≡
n 1−
n i=1
θ 2 (Zi )
M (θ ) ≡ E [θ 2 (Z )].
(29)
Then for θ0 the minimizer of M on Θ0 and θˆ0 a minimizer of Mn on ˆ 0 , it follows that: Θ
‖θˆ0 − θ0 ‖∞ = op (1)
1 ‖θˆ0 − θ0 ‖w = op n− 4 .
(30)
Remark 3.6. Without further restrictions √ on the model, a lack of a solution to (3) implies ⟨ν, m0 ⟩ is not n-estimable, and possibly not identified either. In such instances we may still define:
νP (x) ≡ E [θP (Z )|x]fX (x) θP ∈ arg min E [(ν(X ) − E [θ (Z )|X ]fX (X ))2 ],
(31)
θ∈Θ
and hence the function νP is the best approximation in mean squared error that can be obtained from functions of the
2 Continuity, closeness and compactness in (i)–(iii) are with respect to the metric p
ˆ 0 , Θ0 , ‖ · ‖) → 0. under which dH (Θ 3 Notice this rules out setting M = Q because Q (θ) = 0 for all θ ∈ Θ . 0
134
A. Santos / Journal of Econometrics 161 (2011) 129–146
form E [θ (Z )|·]fX (·). Unlike the parameter ⟨ν, m0 ⟩, however, the approximation ⟨νP , m0 ⟩ is guaranteed to be both identified and √ n-estimable. Letting θˆ denote the exact minimizer of Qn and νˆ P (x) = Eˆ [θˆ (Z )|x]fˆX (x), we may define alternative criterion functions to be given by: Q˜ n (θ) ≡
n 1−
n i=1
(Eˆ [νˆ P (X ) − θ (Z )fˆX (X )|Xi ])2
(32)
Q˜ (θ) ≡ E [(E [νP (X ) − θ (Z )fX (X )|X ])2 ]. The results in this paper may be extended to show that a procedure ˆ employing Q˜ n instead of Qn can deliver an estimator θP satisfying 1 ‖θˆP −θP ‖w = op n− 4 and ‖θˆP −θP ‖∞ = op (1) which can in turn √
be employed to construct a for ⟨νP , m0 ⟩.4
n asymptotically normal estimator
3.4. Computational aspects
θˆ0 ∈ arg min Mn (θ ) s.t. an Qn (θ ) ≤ bn . θ∈Θn
(33)
The constraint function Qn is quadratic in θ , while the objective function Mn can be chosen to be convex leading to a tractable optimization problem. The only potential difficulty therefore lies in imposing the constraint that θ ∈ Θn . A computationally simple choice for Θn are linear sieves. Let {rq (·)}∞ q=1 be a set of basis functions and qn be the number of terms included when the sample size is n. Further denoting the vector r qn (z ) = (r1 (z ), . . . , rqn (z )), and letting β ∈ Rqn , a linear sieve is of the form: ′
Θn = {θ ∈ Θ : θ (z ) = r qn (z )β}. (34) For this choice of Θn , defining Θ = {θ : ‖θ ‖∞,m ≤ B}, as in Proposition 3.1, may not be advisable as the constraint ′ ‖r qn β‖∞,m ≤ B can be highly nonlinear in β . In a similar problem, Newey and Powell (2003) instead suggest employing the norm:
‖θ ‖
− ∫
≡
|λ|≤m0
[Dλ θ (z )]2 dz
(35)
Z
and letting Θ denote the closure of the set {θ : ‖θ ‖2,m0 ≤ B} under ‖ · ‖∞,m . If Z is sufficiently regular and m0 > m + dz /2 then such a choice of Θ will be bounded in ‖ · ‖∞,m as required ′ by Assumption 3.4(i). Moreover, the constraint r qn β ∈ Θn is then quadratic in β , since for
Λqn ≡
− ∫ |λ|≤m0
The results of Section 3 can be used to obtain an estimator
ˆ θˆ0 for a unique element θ0 of the identified 1 set such that ‖θ0 − −4 ˆ θ0 ‖∞ = op (1) and ‖θ0 − θ0 ‖w = op n . In this section we √ show how such an estimator can be employed to construct a n asymptotically normal estimator for ⟨ν, m0 ⟩. We first need to introduce additional notation. Let V be the closure of the linear span of Θ under ‖ · ‖w . The vector space V is a Hilbert Space with inner product given by:
⟨θ1 , θ2 ⟩w ≡ E E [θ1 (Z )|X ]E [θ2 (Z )|X ]fX2 (X ) .
[Dλ r qn (z )Dλ r qn (z )]dz ′
(36)
Z
The linear functional θ → E [Y θ (Z )] is continuous under ‖ · ‖w and hence, by the Riesz Representation theorem, there exists a v˜ ∈ V such that for all θ ∈ V we have: (39)
It is important to note that the Riesz representer is unique only up to equivalence classes in ‖ · ‖w . To be precise, we therefore denote this equivalence class of functions by:
V ≡ {v ∈ V¯ : ⟨v, θ⟩w = E [Y θ (Z )] for all θ ∈ V¯ }.
Assumption 4.1. (i) V ∩ Θ ̸= ∅ and ‖E [˜v (Z )|·]ν(·)fX (·)‖∞,r < ∞; 1
o( n
−1
),
k3n
− 2dγ
= o(n), ξ × kn 2 0n
x
= o(1) and δn = o(n
n 1−
n i =1
(37)
minimizing a convex function of choice subject to two quadratic constraints, which can be computationally easily solved.
=
).
Theorem 4.1. Let θ0 ∈ Θ0 and assume θˆ0 ∈ Θn is such that 1 ‖θˆ0 −θ0 ‖∞ = op (1) and in addition ‖θˆ0 −θ0 ‖w = op n− 4 . Further
rn (θˆ0 ) ≡
′ and θˆ = r qn βˆ . The optimization problem in (37) consists of
− 13
x
Assumption 4.1(i) imposes that there be at least one element of V also in the parameter space Θ . Since V ⊂ V¯ , the assumption requires that there be a v˜ not only in the closure of the linear span of Θ , but in Θ itself. The norm ‖ · ‖∞,r in Assumption 4.1(i) corresponds to the one in Assumptions 3.1(ii), 3.2(i) and 3.2(iv). Finally, Assumption 4.1(ii) strengthens the rate requirements on the bandwidths and approximation errors. For illustrative purposes, we note that all the rate requirements of Assumption 4.1(ii) are satisfied in Example 2.2 by the conditions imposed in Proposition 3.1. As a first step towards establishing normality, we obtain the following asymptotic expansion:
′ βˆ ∈ arg min Mn (r qn β)
s.t. (a) an Qn (r qn β) ≤ bn , (b) β ′ Λqn β ≤ B2
− 3dγ
(ii) These rate requirements hold: ξ0n × kn = o(n 2 ), kn
let v˜ ∈ V ∩ Θ and define the remainder term:
β
(40)
Since E [˜v1 (Z )|x] = E [˜v2 (Z )|x] for all v˜ 1 , v˜ 2 ∈ V , we will generically write E [˜v (Z )|x] without referring to a specific v˜ ∈ V . We introduce the following assumption which will suffice for establishing asymptotic normality.
we have r β ∈ Θn if and only if β ′ Λqn β ≤ B2 . For these parameter specifications, (33) becomes:
q′n
′
(38)
⟨˜v , θ⟩w = E [Y θ (Z )].
While in the development of the theory we have presented θˆ0 as a two stage estimator, in practice its computation can be done in one simple step. In particular, θˆ0 minimizes the objective function ˆ 0 , if and only if Mn over the estimate set Θ
2 2,m0
4. Asymptotic normality
ˆ (Xi , θˆ0 ). Eˆ [Πn v˜ (Z )fˆX (X )|Xi ]m
(41)
If Assumptions 3.1(i)–(iii), 3.2(i)–(v), 3.3(i)–(iii), 3.4(i)–(ii) and 4.1(i)–(ii) hold, then:
√ n{⟨ν, m0 ⟩ − ⟨ν, m0 ⟩} n 1 −
= √ 4 The exact minimizer θˆ of Q may not be employed to construct ⟨ν P , m0 ⟩ as it n1 will satisfy ‖θˆ − θP ‖w = op n− 4 but not necessarily ‖θˆ − θP ‖∞ = oP (1).
n i =1
{Yi θ0 (Zi ) − E [˜v (Z )|Xi ]fX2 (Xi )θ0 (Zi )}
√ −
nrn (θˆ0 ) + op (1).
(42)
A. Santos / Journal of Econometrics 161 (2011) 129–146
Direct calculation reveals that the remainder term rn (θˆ0 ) in (41) is related to Qn (θˆ0 ) through:
ˆ = 2rn (θ)
d dτ
135
Proposition 4.1. Let all the assumptions of Proposition 3.1 and Assumption 4.1(i) hold. Then,
√
Qn (θˆ0 + τ Πn v˜ )|τ =0 .
(43)
n|ˆrn (θˆ0 ) − rn (θˆ0 )| = op (1).
sup
(48)
ˆ0 θˆ0 ∈Θ
Hence, 2rn (θˆ0 ) is the pathwise derivative of Qn from θˆ0 in the direction of Πn v˜ . If θˆ0 is the exact minimizer of Qn , then rn (θˆ0 ) = 0. However, without identification the exact minimizer fails to converge to a unique element of Θ0 and the conditions of Theorem 4.1 are not satisfied. We instead employ the results of ˆ 0 implies θˆ0 is a Section 3 to construct θˆ0 , in which case θˆ0 ∈ Θ ‘‘near minimizer’’ of Qn but not necessarily the exact minimizer. As √ a result, the term nrn (θˆ0 ) may not be asymptotically negligible, a problem we address in the next subsection.
4.2. Main Result Theorem 4.1√ and Lemma 4.1 establish that it is possible to construct a n asymptotically normal estimator for ⟨ν, m0 ⟩ employing a nuisance parameter θˆ0 constructed as outlined in Section 3. We collect the implications of our analysis and present the main result of the paper:
ˆ 0 be as in (14) with an = O Theorem 4.2. Let Θ
4.1. Remainder correction The only unobservable element of the remainder rn (θˆ0 ) is Πn v˜ , the projection of v˜ onto the sieve Θn . We construct an extremum estimator for Πn v˜ by defining the criterion function: C (θ) ≡ E [(E [θ (Z )|X ])2 fX2 (X )] − 2E [Y θ(Z )].
(44)
Since E [Y θ (Z )] = ⟨˜v , θ⟩w , it follows that C (θ) = ‖θ − v˜ ‖2w −‖˜v ‖2w , and hence v˜ ∈ Θ is the unique minimizer of θ → C (θ ) (up to the equivalence class in ‖ · ‖w ). We denote the implied estimator by:
vˆ ∈ arg min Cn (θ ) θ∈Θn
Cn (θ) ≡
n 1−
n i =1
(Eˆ [θ (Z )fˆX (X )|Xi ]) − 2
n 2−
n i=1
(45)
Yi θ(Zi ).
It is important to note that in order to accurately approximate the remainder term rn (θˆ0 ), we do not require consistency for Πn v˜ under ‖ · ‖∞ , but only under the weak pseudo-metric ‖ · ‖w . This is because Πn v˜ only enters rn (θˆ0 ) through the term Eˆ [Πn v˜ (Z )fˆX (X )|Xi ], which is asymptotically equivalent to E [˜v (Z )fX (X )|Xi ]. In turn, E [˜v (Z )fX (X )|Xi ] takes the same values for all v˜ ∈ V , which is an equivalence class under ‖ · ‖w . Given the estimator vˆ ∈ Θn , we define our approximation to the remainder rn (θˆ0 ) to be:
−1
kn n
− 2dγ
+ kn
x
+
2 3
(nhdx )−1 + h2r +δn2 and bn → ∞ with n × bn = o(an ). Suppose Θ is convex, M : Θ → R is strictly convex on Θ0 and Mn : Θ → R is continuous and satisfies supθ∈Θ |Mn (θ ) − M (θ )| = op (1). Further define the parameters:
θˆ0 ∈ arg min Mn (θ ) ˆ0 θ ∈Θ
θ0 = arg min M (θ ). θ∈Θ0
(49)
If Assumptions 3.1(i)–(iii), 3.2(i)–(v), 3.3(i)–(iii), 3.4(i)–(ii) and 4.1(i)–(ii) hold, then it follows that: n 1 −
√
n i =1
L
{Yi θˆ0 (Zi ) + rˆn (θˆ0 ) − ⟨ν, m0 ⟩} −→ N (0, σ 2 )
(50)
where rˆn (θˆ0 ) is as in (46) and σ 2 = E [θ02 (Z )(Y − E [˜v (Z )|X ]fX2 (X ))2 ]. In order to conduct inference on ⟨ν, m0 ⟩, we still require a consistent estimator for the asymptotic variance σ 2 . Such an estimator is easily computed given our estimators θˆ0 , fˆX and vˆ by:
σˆ 2 ≡
n 1−
n i=1
θˆ02 (Zi )(Yi − Eˆ [ˆv (Z )fˆX2 (X )|Xi ])2 .
(51)
(46)
Lemma 4.2 establishes σˆ 2 is indeed consistent for the asymptotic variance of Theorem 4.2.
The following lemma characterizes the rate of convergence of ˆ 0. rˆn (θˆ0 ) to rn (θˆ0 ) uniformly on Θ
Lemma 4.2. If Assumptions 3.1(i)–(iii), 3.2(i)–(v), 3.3(i)–(iii), 3.4(i)–(ii) and 4.1(i)–(ii) hold, it then follows that the estimator σˆ 2 is consistent for σ 2 .
rˆn (θˆ0 ) =
n 1−
n i =1
ˆ (Xi , θˆ0 ). Eˆ [ˆv (Z )fˆX (X )|Xi ]m
Lemma 4.1. If Assumptions 3.1(i)–(iii), 3.2(i)–(v), 3.3(i)–(iii), 3.4(i)–(ii) and 4.1(i)–(ii) hold, it then follows that the approximate remainder rˆn (θˆ0 ) satisfies: sup |ˆrn (θˆ0 ) − rn (θˆ0 )| ˆ0 θˆ0 ∈Θ
= Op
kn n
14
γ
− 2d
+ kn
x
1
+ (nhdx )− 4
√ bn
×√
an
.
(47)
√
√ Lemma 4.1 implies that if bn / an ↘ 0 is sufficiently fast, √ then n|ˆrn (θˆ0 ) − rn (θˆ0 )| = op (1). The rate conditions imposed in √ √ Assumption 4.1(ii) ensure that such a choice of bn / an is indeed ˆ 0, feasible. Furthermore, since Lemma 4.1 holds uniformly in Θ the parameter choices are valid for all possible criterion functions Mn . To conclude, we note that the conditions of Proposition 3.1 are sufficient for establishing the remainder term rˆn (θˆ0 ) may be estimated sufficiently fast in the context of Example 2.2.
Theorem 4.2 and Lemma 4.2 enable us to construct pivotal statistics and in this way conduct hypotheses tests on the parameter of interest ⟨ν, m0 ⟩. We conclude by showing this can readily be accomplished in Example 2.2. Proposition 4.2. Let all the assumptions of Proposition 4.1 hold and in addition define Mn as in Proposition 3.2 and let θˆ0 be the minimizer ˆ 0 . It then follows that: of Mn on Θ 1
√
n − L {Yi θˆ0 (Zi ) + rˆn (θˆ0 ) − ⟨ν, m0 ⟩} −→ N (0, 1).
nσˆ i=1
(52)
Remark 4.1. Under exogeneity, θ0 (z ) = ν(x)/fX (x) and similarly E [˜v (Z )|x]fX2 (x) = m0 (x). It is then interesting to note that the asymptotic variance from Theorem 4.2 reduces to
] ν 2 (X ) 2 E (Y − m0 (X )) , fX2 (X ) [
(53)
136
A. Santos / Journal of Econometrics 161 (2011) 129–146
which coincides with the asymptotic variance of Newey and McFadden (1994). Since the latter is semiparametrically efficient, so is our estimator in the case of exogeneity.5
0.75(1 − u2 ) for all |u| ≤ 1 and L(u) = 6(u − u2 )(6 − 10u) for all u ∈ [0, 1] which Müller (1991) shows corresponds to the generalized Epanechnikov kernel. Finally, we also define the auxiliary criterion functions Mn and M to be given by:
5. Monte Carlo In order to illustrate the implementation of the outlined procedure and examine its finite sample performance, we conduct a small-scale Monte Carlo study. To facilitate exposition, the Monte Carlo was designed so that the integral equation in (3) has a closed form solution. We assume X , Z ∈ [0, 1]2 , and that they are distributed according to the density: fXZ (x, z ) = 3|x − z | for (x, z ) ∈ [0, 1]2 .
(54)
We complete the model by assuming that Y and ϵ are generated according to the relationship: Y = 2 − X2 + ϵ
ϵ=−
U
fXZ (X |Z )
12
−1
(55)
where U is uniformly distributed on [0, 1] independent of (X , Z ) and fX |Z (x|z ) is the conditional density of X given Z .6 By construction, E [ϵ|Z ] = 0 and E [ϵ|x] = (1 − fX (x))/24fX (x). For the linear functional, we study an approximation to the integral of m0 between two points xL < xU satisfying 0 < xL < xU < 1. Let Φ denote the c.d.f. of the standard normal and define Ft (x) ≡ Φ
x − xL
−Φ
t
x − xU t
,
(56)
as well as the constants At ≡ − 12 (Ft′ (1)+ Ft′ (0)) and Bt ≡ 21 (Ft′ (1)− Ft (1) − Ft (0)) for some t > 0. The functional of interest is then ⟨νt , m0 ⟩, where νt is pointwise given by:
νt (x) ≡ Ft (x) + At x + Bt .
(57)
It can be verified through direct calculation and the dominated convergence theorem that: lim ⟨νt , m0 ⟩ = t ↓0
xU
∫
m0 (x)dx
(58)
xL
1
whenever 0 m20 (x)dx < ∞. Unlike the choice ν(x) = 1{xL ≤ x ≤ xU } for which no solution to (3) exists, however, Polyanin and Manzhirov (1998) establish that for any t > 0 the function
θ0 (z ) ≡
1 t2
φ
′
z − xL
t
−φ
′
z − xU t
(59)
d satisfies E [θ0 (Z )|x] = νt (x)/fX (x) (where φ(u) = du Φ (u)). For the simulations we set xL = 0.25, xU = 0.75 and t = 0.01. Following the discussion in Section 3.4, we define the parameter space Θ to be:
Θ ≡ cl({θ : ‖θ‖2,2 ≤ B})
(60)
where the closure is under ‖ · ‖∞,1 and ‖ · ‖2,2 is as defined in (35). We employ a linear sieve Θn as in (34) with {rq (·)}nq=1 splines of order 4 with equally spaced knots in the interior and full multiplicity at the endpoints. Similarly, for estimating the conditional expectation we let {pj (·)}∞ j=1 be splines of order 2 with the same knot arrangement. The kernel Kh for estimating the density of X is constructed as in Remark 3.1 setting K (u) =
5 We thank an anonymous referee for pointing this out. 1 6 U has significantly fat tails. The scaling by 1/12 ensures that ϵ f (x|z )−1 X |Z
retains variability while preventing most its large realizations from driving the results.
n 1−
n i =1
θ 2 (Zi )
M (θ ) ≡ E [θ 2 (Z )].
(61)
Arguing as in Proposition 3.1, it is possible to show all rate requirements are satisfied by letting an /bn , h, kn and qn satisfy the assumptions of said proposition. The theory, however, offers little guidance as to how to select the level of the different bandwidths. Since the minimizer of M on Θ0 is not the global minimizer of M, it seems prudent to select an /bn so that the constraint (a) in (37) binds. For θˆu the minimizer of Qn on Θn , we implement the following ad-hoc selection of an /bn : an
1
Mn (θ ) ≡
bn
≡ γ × Qn (θˆu ) + (1 − γ ) × Qn (0)
(62)
for γ ∈ (0, 1). Since θ = 0 is the minimizer of Mn on Θn , setting γ > 0 ensures that constraint (a) in (37) is binding. In turn, ˆ 0 contains elements in addition to θˆu . selecting γ > 0 implies Θ Tables 1 and 2 report the bias and standard deviation respectively ˆ of the estimator ⟨ν t , m0 ⟩ + rˆn (θ0 ) for different choices of γ (as in (62)), B (as in (60)) and h. The bandwidths kn and qn were both set equal to six. Other choices for these parameters yielded qualitative similar results. The numbers reported in Tables 1 and 2 are based on one thousand replications. Higher choices of bandwidth h decrease the standard deviation of the estimator and, somewhat surprisingly, do not translate into a higher bias for the ranges of h selected. Conversely, selecting large values of B significantly decreases the bias of the estimator without an important increment in the standard deviation. For the range specified, the estimator does not appear to be very sensitive to the choice of γ , with both its bias and standard deviation remaining largely invariant to the level of γ across specifications. Overall, we find the performance of the estimator to be encouraging. 6. Conclusion The results in this paper allow for the estimation of continuous linear functionals in an additive separable model with endogenous regressors. Our proposed estimator relies on an assumption that √ under weak conditions is necessary for ⟨ν, m0 ⟩ to be n estimable. This procedure does not require m0 nor the nuisance parameter to be identified. The techniques developed in this paper may be of use in other estimation problems with partially identified nonparametric nuisance parameters. Acknowledgements I would like to thank Peter Robinson and three anonymous referees for comments and suggestions that greatly improved the paper. I also benefited from helpful comments by Graham Elliott, Aprajit Mahajan, Azeem Shaikh, Hal White and Frank Wolak as well as seminar participants at UC Berkeley, UC Davis, UC Riverside, Ohio State University, the Cowles Conference on Operator Methods and Inverse Problems and the 2008 Econometric Society Winter Meetings. Appendix A. Notation and definitions The following is a table of the notation and definitions that will be used throughout the appendix, including many that go beyond
A. Santos / Journal of Econometrics 161 (2011) 129–146
137
Table 1
ˆ Bias of ⟨ν t , m0 ⟩ + rˆn (θ0 ). n = 500, kn = 6, B = 103
γ = 0.05 γ = 0.15 γ = 0.25
n = 500, kn = 6, B = 104
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
−0.054 −0.057 −0.059
−0.049 −0.052 −0.053
−0.051 −0.053 −0.054
−0.027 −0.032 −0.034
−0.022 −0.025 −0.027
−0.023 −0.027 −0.028
−0.015 −0.016 −0.016
−0.003 −0.005 −0.006
−0.005 −0.007 −0.008
n = 1000, kn = 6, B = 103
γ = 0.05 γ = 0.15 γ = 0.25
n = 1000, kn = 6, B = 104
n = 1000, kn = 6, B = 105
h = 0.01
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
−0.027 −0.029 −0.029
−0.024 −0.025 −0.026
−0.028 −0.029 −0.029
−0.015 −0.018 −0.019
−0.011 −0.012 −0.013
−0.014 −0.016 −0.016
−0.004 −0.006 −0.006
−0.002 −0.003 −0.003
−0.006 −0.007 −0.007
n = 2000, kn = 6, B = 103
γ = 0.05 γ = 0.15 γ = 0.25
n = 500, kn = 6, B = 105
h = 0.01
n = 2000, kn = 6, B = 104
n = 2000, kn = 6, B = 105
h = 0.01
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
−0.012 −0.012 −0.013
−0.011 −0.012 −0.012
−0.016 −0.017 −0.017
−0.005 −0.006 −0.006
−0.005 −0.005 −0.006
−0.010 −0.010 −0.011
−0.000 −0.000 −0.000
−0.000 −0.001 −0.001
−0.006 −0.006 −0.006
Table 2
ˆ Standard Deviation of ⟨ν t , m0 ⟩ + rˆn (θ0 ). n = 500, kn = 6, B = 103
γ = 0.05 γ = 0.15 γ = 0.25
n = 500, kn = 6, B = 104
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
0.093 0.092 0.093
0.066 0.066 0.067
0.060 0.061 0.061
0.101 0.095 0.095
0.057 0.056 0.057
0.048 0.048 0.049
0.171 0.137 0.131
0.078 0.065 0.065
0.063 0.054 0.053
n = 1000, kn = 6, B = 103
γ = 0.05 γ = 0.15 γ = 0.25
n = 1000, kn = 6, B = 104
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
0.051 0.051 0.051
0.037 0.038 0.038
0.033 0.034 0.034
0.050 0.050 0.050
0.030 0.031 0.032
0.025 0.026 0.027
0.072 0.065 0.065
0.038 0.036 0.035
0.027 0.026 0.026
n = 2000, kn = 6, B = 104
n = 2000, kn = 6, B = 105
h = 0.01
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
h = 0.01
h = 0.05
h = 0.1
0.027 0.027 0.028
0.021 0.022 0.022
0.020 0.020 0.020
0.026 0.026 0.026
0.017 0.017 0.017
0.015 0.016 0.016
0.035 0.034 0.034
0.021 0.020 0.020
0.018 0.017 0.017
the ones already introduced in the main text: a.b
‖ · ‖∞,m
a ≤ Mb for some constant M which is universal in the context of the proof The norm ‖g ‖∞,m = max|λ|≤m supw |Dλ g (w)|
+ N (ϵ, F , ‖ · ‖) N[ ] (ϵ, F , ‖ · ‖)
ξjn
n = 1000, kn = 6, B = 105
h = 0.01
n = 2000, kn = 6, B = 103
γ = 0.05 γ = 0.15 γ = 0.25
n = 500, kn = 6, B = 105
h = 0.01
|Dλ g (w)−Dλ g (w ′ )| maxλ=m supw,w′ ‖w−w′ ‖m−m
The covering numbers of size ϵ for F under the norm ‖ · ‖ The bracketing numbers of size ϵ for F under the norm ‖ · ‖ The sequence given by ξjn = supx,|λ|=j ‖Dλ pkn (x)‖
Proof. Define Θpn to be the pointwise projection of Θ0 onto Θn under ‖ · ‖. We then obtain: sup Qn (θ ) ≤ sup C1 Q (θ ) + Op (c2n ) Θpn
Θpn
1 ≤ sup inf C1 C2 ‖θ − θ0 ‖κ1 + Op (c2n ) = Op (a− n )
Θpn Θ0
with probability approaching one by (ii), (iii), (iv) and an κ O(max[c1n1 , c2n ]−1 ). Next observe that we also have:
(63)
=
ˆ 0 ) = sup inf ‖θ0 − θˆ0 ‖ h(Θ0 , Θ ˆ0 Θ0 Θ
≤ sup inf {‖θ0 − θpn ‖ + ‖θpn − θˆ0 ‖} ˆ 0 ,Θpn Θ0 Θ
Appendix B. Proofs of results Theorem B.1. Assume (i) Q (θ ) ≥ 0 and Θ0 = {θ ∈ Θ : Q (θ) = 0} with Θ compact in ‖ · ‖, (ii) Θn ⊆ Θ are closed and supΘ infΘn ‖θ − θn ‖ = O(c1n ), (iii) Uniformly on Θn , Qn (θ ) ≤ C1 Q (θ ) + Op (c2n ) and Q (θ ) ≤ C1 Qn (θ ) + Op (c2n ) with probability approaching one, (iv) Q (θ ) ≤ C2 infΘ0 ‖θ − θ0 ‖κ1 for some κ1 > 0. κ Then for an = O(max[c1n1 , c2n ]−1 ) and bn → ∞ with bn = o(an ), the
ˆ 0 = {θn ∈ Θn : Qn (θ ) ≤ bn /an } satisfies dH (Θ ˆ 0 , Θ0 , ‖ · ‖) = set Θ op (1). If in addition (v) Q (θ ) ≥ infΘ0 C2 ‖θ − θ0 ‖κ2 for some κ2 > 0, 1 ˆ 0 , Θ0 , ‖ · ‖) = Op (max{(bn /an ) max[κ2 ,1] , c1n }). then dH (Θ
≤ sup inf ‖θ0 − θpn ‖ + sup inf ‖θpn − θˆ0 ‖ Θ0 Θpn
ˆ0 Θpn Θ
= sup inf ‖θpn − θˆ0 ‖ + O(c1n ). ˆ0 Θpn Θ
(64)
Therefore, for M > 0 sufficiently large so that the O(c1n ) term in (64) is less Mc1n /2 we obtain that:
ˆ 0 ) < Mc1n ) P (h(Θ0 , Θ ≥ P (sup inf ‖θpn − θˆ0 ‖ < Mc1n /2) ˆ0 Θpn Θ
ˆ 0 ). ≥ P (Θpn ⊆ Θ
(65)
138
A. Santos / Journal of Econometrics 161 (2011) 129–146
ˆ 0 , Θpn ⊆ Θ ˆ 0 if and only if Qn (θpn ) ≤ bn /an for all By definition of Θ θpn ∈ Θpn . Hence, for n large enough,
Var(fˆX (Xi )|Xi )
≤ ˆ 0 ) < Mc1n ) ≥ P (an sup Qn (θ ) ≤ bn ) → 1 P (h(Θ0 , Θ
(66)
Θpn
=
since by (63) an supΘpn Qn (θ ) = Op (1) and bn → ∞. We therefore
ˆ 0 ) = Op (c1n ). obtain that h(Θ0 , Θ
ˆ 0 , Θ0 ). For ϵn > 0, the continuity of Q (θ ) Next, we examine h(Θ and compactness of Θ imply that: δn ≡
inf
{θ∈Θ : inf ‖θ−θ0 ‖≥ϵn }
Q (θ ) > 0.
(67)
Θ0
dx ∏
∫
1 nh2dx
[0,1]dx k=1 dx ∏
∫
1 nhdx
Γ (Xi ) k=1
Kh2
(k)
Xi
− x(k) h
(k)
, Xi
fX (x)dx
(k)
Kh2 (u(k) , Xi )fX (Xi − hu)du
= O((nh ) ), dx −1
(70)
and hence the first claim of the lemma follows from (69) and (70). The second claim of the lemma is in turn a direct consequence of the first claim of the lemma holding almost surely in Xi . ′
ˆ 0 = ∅, then h(Θ ˆ 0 , Θ0 ) = 0. Therefore, it follows from (67), the If Θ ˆ definition of Θ0 and (iii) that: ˆ 0 , Θ0 ) > ϵn ) P (h(Θ ≤ P (∃θ ∈ Θn : Qn (θ ) ≤ bn /an and Q (θ ) > δn ) bn ≤ P δn < C1 + Op (c2n ) + o(1).
(68)
an
ˆ 0 , Θ0 ) If ϵn is constant, then conditions (i)–(iv) and (68) imply h(Θ ˆ = op (1) which together with h(Θ0 , Θ0 ) = Op (c1n ) establishes con1
sistency. If in addition (v) holds, then set ϵn = (2C1 bn /an C2 ) max{κ2 ,1} κ which implies δn ≥ C2 ϵn 2 ≥ 2C1 bn /an for n large, and (68) im-
ˆ 0 , Θ0 ) = Op ((bn /an ) plies h(Θ
1 max[κ2 ,1]
). The claimed rate of conver-
ˆ 0 ) = Op (c1n ) and the definition of gence then follows from h(Θ0 , Θ ˆ 0 , Θ0 , ‖ · ‖). dH ( Θ
Lemma B.1. If Assumptions 3.1(i)–(ii), 3.2(i)–(iii) hold, then almost surely (i) E [fˆX (Xi )|Xi ] = fX (Xi ) + O(hr ) and Var(fˆX (Xi )|Xi ) = O((nhdx )−1 ); (ii) E [(fˆX (Xi ) − fX (Xi ))2 ] = O(h2r + (nhdx )−1 ). Proof. The proof proceeds by standard calculations. Let u = (Xi − x)/h and u(k) its kth component. Defining the set Γ (Xi ) = {u ∈ (k) (k) Rdx : (Xi − 1)/h ≤ u(k) ≤ Xi /h}, we then obtain through a change of variables that: 1 E [fˆX (Xi )|Xi ] = d h
∫ =
x
dx ∏
∫ [0,1]dx
Γ (Xi ) k=1
Kh ( u
= f X ( X i ) + hr
Kh
(k)
Xi
(k)
− x(k) h
k =1
dx
∏
(k)
, Xi
, Xi )fX (Xi − hu)du
Γ (Xi ) |λ|=r
− 2γ (X )θ (Z )|Xi ] − E [fX (X )θ (Z )|Xi ])2 = Op knn + kn dx + (nhdx )−1 + h2r ; and (ii) supΘn E [(E [ν(X ) − fX (X )θ (Z )|X ] − E¯ [ν(X ) − fX (X ) − 2dγ
θ (Z )|X ])2 ] = Op (kn
‖g ‖∞ ‖g ‖L2
).
∑ kn aj pj (x) j=1 = sup 1 x 2 kn ∑ 2 aj
j =1
≤ sup ‖pkn (x)‖.
(71)
x
Hence, for An = supGn ‖g ‖∞ /‖g ‖L2 Assumption 3.3(iii) implies A2n kn /n → 0 and by Lemma 2.3(i) in Huang (2003): 1 2
E [g 2 (X )] ≤
Θn
n 1−
n i =1
g 2 (Xi ) ≤ 2E [g 2 (X )]
∀g ∈ Gn
(72)
n 1−
n i=1
≤ sup
(Dλ fX (X¯ i (u))
Θn
dx ∏ (k) λ (k) λk (k) − D fX (Xi )) (u ) Kh (u , Xi ) du
(Eˆ [fˆX (X )θ (Z )|Xi ] − E [fX (X )θ (Z )|Xi ])2
n 2−
n i=1
+ sup Θn
k=1
= fX (Xi ) + O(hr ),
x
Proof. As noted in Newey (1997), under Assumption 3.3(i) we may ′ assume without loss of generality that E [pkn (X )pkn (X )] = I. Let Gn denote the linear span of pkn , and observe that for any g (x) = ∑kn j=1 aj pj (x):
sup
(k)
−
∑ 3.2(i)–(iii), 3.3(i)–(iii) and 3.4(i) hold, then (i) supΘn n−1 i (Eˆ [fˆX
with probability tending to one. In order to establish the first claim of the lemma, we use the inequality:
fX (x)dx
∫
dx k ¯ Lemma B.2. ∑For w : [0, 1] × Z → R, let E [w(X , Z )|x] = p n (x)(P ′ P )−1 i pkn (Xi )E [w(X , Z )|Xi ]. If Assumptions 3.1(i)–(iii),
(Eˆ [fX (X )θ (Z )|Xi ] − E [fX (X )θ (Z )|Xi ])2
n 2−
n i=1
(Eˆ [θ (Z )(fX (X ) − fˆX (X ))|Xi ])2 .
Let ϵi (θ ) (69)
where the third equality holds for some X¯ i (u) a convex combination of Xi and Xi − hu by a pointwise Taylor expansion, Kh being ∏ (k) of order r and the support of k Kh (·, Xi ) ⊂ Γ (Xi ) by Assumption 3.2(iii). That the final result in (69) holds almost surely in Xi is the result of ‖X¯ i (u) − Xi ‖ ≤ h‖u‖ for all u, ‖f ‖r ,∞ < ∞, the support of Kh (·, Xi ) being contained in a compact interval K by Assumption 3.2(iii) and Kh (u, t ) being uniformly bounded. By similar arguments and exploiting that Kh and fX are uniformly bounded and Kh (·, Xi ) is compactly supported
= fX (Xi )θ (Zi ) − E [fX (X )θ (Z )|Xi ] and ϵ(θ ) (ϵ1 (θ ), . . . , ϵn (θ ))′ . It then follows that: sup Θn
(73)
=
n 1−
n i=1
(Eˆ [fX (X )θ (Z )|Xi ] − E¯ [fX (X )θ (Z )|Xi ])2 ′
≤ 2 sup E [pkn (X )(P ′ P )−1 P ′ ϵ(θ )ϵ ′ (θ )P (P ′ P )−1 pkn (X )] Θn
≤ 2 sup ‖fX θ‖2∞ × trace {(P ′ P )−1 P ′ P (P ′ P )−1 } Θn
(74)
where the first inequality holds by (72) with probability tending to ′ one, and the second by E [pkn (X )pkn (X )] = I, the i.i.d. assumption,
A. Santos / Journal of Econometrics 161 (2011) 129–146
E [ϵi2 (θ )] ≤ ‖fX θ ‖2∞ and the law of conditional expectations. Let γn be the largest eigenvalue of n(P ′ P )−1 and note trace {(P ′ P )−1 n} ≤ kn γn . It is shown in Theorem 1 in Newey (1997) that γn = Op (1) and hence, supΘ ‖fX θ ‖2∞ < ∞ by Assumptions 3.1(ii) and 3.4(i), and result (74) then imply: n 1−
sup
n i =1
Θn
(Eˆ [fX (X )θ (Z )|Xi ] − E¯ [fX (X )θ (Z )|Xi ])2 = Op
kn
n
. (75)
− 2dγ
n 1−
n i =1
Θ
(E¯ [fX (X )θ (Z )|Xi ] − E [fX (X )θ (Z )|Xi ])2 = Op (kn
x
). (76)
Finally, using the property of least squares projections and Assumption 3.4(i) implying supΘ ‖θ‖2∞ < ∞ yields,
n i =1
Θn
(Eˆ [θ (Z )(fˆX (X ) − fX (X ))|Xi ])2 2
≤ sup ‖θ‖∞ × Θ
= Op ((nh )
dx −1
+ h2r )
(77)
by Markov’s inequality and Lemma B.1(ii). The first claim then follows from (73) and (75)–(77). For the second claim of the lemma, let ϵ¯i (θ ) = E [ν(X ) − fX (X )θ (Z )|Xi ] − (πn (θ0 ) − πn (θ ))pkn (Xi ) and define ϵ¯ (θ ) = (¯ϵ1 (θ), . . . , ϵ¯n (θ ))′ . Exploiting Assumption 3.3(ii) and E [pkn (X ) ′ pkn (X )] = I we then obtain that: sup E [(E [ν(X ) − fX (X )θ (Z )|X ] − E¯ [ν(X ) − fX (X )θ (Z )|X ])2 ] Θn
− 2dγ
′
≤ sup 2E [(pkn (X )(P ′ P )−1 P ′ ϵ¯ (θ ))2 ] + O(kn Θn
− 2dγ
Θn
x
x
)
− 2dγ
Θn
with probability approaching one. The first claim the follows by (79) and Lemma B.2(ii). Similarly, for the second claim of the corollary, exploit Lemma B.2(ii) and (72) to obtain that sup E [(E [ν(X ) − fX (X )θ (Z )|X ])2 ] − 2dγ
≤ sup 2E [(E¯ [ν(X ) − fX (X )θ (Z )|X ])2 ] + Op (kn ≤ sup Θn
x
− 2dγ
) = Op (kn
x
)
(78)
− 2dγ
and 3.4(i) hold, cn = knn + kn x + (nhdx )−1 + h2r . Then, uniformly in Θn with probability tending to one (i) Qn (θ ) ≤ 16Q (θ ) + Op (cn ) and (ii) Q (θ ) ≤ 16Qn (θ ) + Op (cn ). Proof. First apply (75) and (77) together with E¯ [ν(X )|Xi ] Eˆ [ν(X )|Xi ] and (72) to obtain
x
)
(80)
with probability tending to one. The result then follows by combining (80) with (75) and (77).
Q (θ ) = E [(E [ν(X ) − fX (X )θ (Z )|X ])2 ]
= E [fX2 (X )(E [θ0 (Z ) − θ (Z )|X ])2 ] ≤ E [fX2 (X )] × inf ‖θ0 − θ ‖2∞ Θ0
(81)
implying condition (iv) holds with C2 = E [fX2 (X )] and κ1 = 2. The first claim then follows by Theorem B.1. For the second claim of the theorem, note that fX bounded implies ‖ · ‖w . ‖ · ‖∞ . Hence, conditions (i)–(iii) are verified by the above discussion for the case ‖ · ‖ = ‖ · ‖∞ . Since Q (θ ) = ‖θ −θ0 ‖2w by (81), conditions (iv)–(v) are immediately verified with κ1 = κ2 = 2 and C1 = C2 = 1. Hence, Theorem B.1 yields (82)
=
Since bn /an /δn → ∞ the second claim of the theorem then follows. Proof of Proposition 3.1. We verify Assumptions 3.1–3.4 and apply Theorem 3.1 to establish the result. Since fX is continuous, it is also bounded on [0, 1] while fX′ being Lipschitz implies ‖f ‖∞,2 < ∞. Hence condition (i) and (X , Z ) ∈ [0, 1]2 implies Assumption 3.1 holds. Kh is of order r ≥ 2 and since ‖f ‖∞,α < ∞ for all α ≤ 2, Assumption 3.2(i)–(iv) are implied by the parameter choices in condition (ii). Furthermore, supΘ ‖θ ‖∞ < ∞ and the dominated convergence theorem allowing exchanging of differentiation and integration imply:
d E [θ (Z )|x1 ] − d E [θ (Z )|x2 ] dx dx ∫ d d . fZ |X (z |x1 ) − fZ |X (z |x2 ) dz dx
n
n i =1
(E¯ [ν(X ) − fX (X )θ (Z )|Xi ])2 + Op (kn
√
Corollary B.1. Let Assumptions 3.1(i)–(iii), 3.2(i)–(iii), 3.3(i)–(iii)
(Eˆ [ν(X ) − fˆX (X )θ (Z )|Xi ])2
n i =1
)
− 2dγ
n 4−
where the final inequality is a result of γn , the largest eigenvalue of n(P ′ P )−1 , converging in probability to one. The equality in (78) in turn follows from Assumption 3.3(ii) and P (P ′ P )−1 P ′ being idempotent.
1−
x
Θn
ˆ 0 , Θ0 , ‖ · ‖w ) = Op (max{ bn /an , δn }). dH ( Θ
) ≤ n− 1 γ n
× sup ϵ¯ ′ (θ )P (P ′ P )−1 P ′ ϵ¯ (θ ) + O(kn
(79)
turn, condition (iii) holds with c2n = knn + kn x +(nhdx )−1 + h2r and C1 = 16 by Corollary B.1, while the law of conditional expectations yields
(fˆX (Xi ) − fX (Xi ))2
≤ sup 2‖(P ′ P )−1 P ′ ϵ¯ (θ )‖2 + O(kn
+ (nhdx )−1 + h2r
n
− 2dγ
n 1−
n i=1
kn
Proof of Theorem 3.1. We first show consistency under dH (·, ·, ‖· ‖∞ ) by verifying the conditions of Theorem B.1. Conditions (i) and (ii) hold for ‖ · ‖ = ‖ · ‖∞ and c1n = δn by Assumption 3.4(i)–(ii). In
n 1−
sup
+ Op
139
Θn
In addition, it follows from Assumptions 3.1(i)–(ii), 3.3(i)–(ii) and Lemma A.1(B) in Ai and Chen (2003) that: sup
∫ .
dx
1
G(z )dz × |x1 − x2 |.
(83)
0
≤ sup Θn
+ Op
n 4−
n i =1 kn n
(E¯ [ν(X ) − fX (X )θ (Z )|Xi ])2 + (nhdx )−1 + h2r
≤ sup 8E [(E¯ [ν(X ) − fX (X )θ (Z )|X ])2 ] Θn
It follows that supΘ ‖E [θ (Z )|·]‖∞,2 < ∞ and supΘ ‖fX (·)E [θ (Z )|·]‖∞,2 < ∞. The approximation error by splines is then 2 2 O(k− . kn , the parameter n ); see Chen (2007). Since ξ0n values in condition (iii) verify Assumption 3.3. By construction, 1 supΘ ‖θ ‖∞,1 ≤ B and Assumption 3.4 is verified with δn ≍ q− n ; √ α see Chen (2007). Since an ≍ n with α > n, and bn ≍ log n, the claim of the proposition immediately follows by Theorem 3.1.
140
A. Santos / Journal of Econometrics 161 (2011) 129–146
Proof of Theorem 3.2. By Example 3.1 in Stinchcombe and White (1992), the estimator θˆ is measurable and hence a well defined random variable. If the model is identified, so that Θ0 = {θ0 }, then by (ii) we have
ˆ 0 , Θ0 , ‖ · ‖) = op (1). ‖θˆ − θ0 ‖ ≤ dH (Θ
min
Θ0 ∩Nϵc (θ0 )
M (θ ) − M (θ0 ) > 0.
(85)
Since (i) and (ii) imply M : Θ → R is uniformly continuous in Θ , there exists a ζ such that if ‖θ1 − θ2 ‖ < ζ then |M (θ1 ) − M (θ2 )| < ζ δ/4. Letting Θ0 denote the closed ζ blowup of Θ0 we then obtain:
ζ
min
Θ0 ∩Nϵc (θ0 )
M (θ ) − M (θ0 ) >
3 4
δ.
(86) ζ
ˆ 0 , Θ0 , ‖ · ‖) < ζ implies Θ ˆ 0 ⊂ Θ0 , it follows Therefore, since dH (Θ from (86) that: P (‖θˆ − θ0 ‖ < ϵ)
ˆ 0 , Θ0 , ‖ · ‖) < ζ ). ≥ P (M (θˆ ) − M (θ0 ) ≤ 3δ/4; dH (Θ
(87)
ˆ 0 , Θ0 , ‖ · ‖) < ζ , then Let θp ∈ arg infΘˆ 0 ‖θ0 − θ‖. If dH (Θ ‖θ0 − θp ‖ < ζ and hence M (θp ) ≤ M (θ0 ) + δ/4. By definition ˆ Mn (θˆ ) ≤ Mn (θp ) which implies M (θ) ˆ − M (θp ) ≤ |M (θˆ ) − of θ, ˆ Mn (θ)|+| M (θp )− Mn (θp )|. Thus, M (θp ) ≤ M (θ0 )+δ/4 and |M (θˆ )− ˆ + |M (θp ) − Mn (θp )| ≤ δ/4 imply that M (θ) ˆ − M (θ0 ) ≤ δ/2. Mn (θ)| Therefore, using (87), we get that: ˆ + |M (θp ) − Mn (θp )| P (‖θˆ − θ0 ‖ < ϵ) ≥ P (|M (θˆ ) − Mn (θ)| ˆ 0 , Θ0 , ‖ · ‖) < ζ ). ≤ δ/4; dH (Θ
∑
n 1 −
√
n i=1
(84)
Therefore, without loss of generality we assume Θ0 is not a singleton. Let Nϵ (θ0 ) be an ϵ open neighborhood of θ0 so that Θ0 ∩ Nϵc (θ0 ) ̸= ∅. Since Θ0 ∩ Nϵc (θ0 ) ⊆ Θ is closed, it is also compact by virtue of Θ being compact. Therefore, the continuity of M : Θ → R and θ0 being the unique minimum on Θ0 imply that
δ≡
to Xn . If n−1 i Win2 = Op (1), ‖θˆ − θ0 ‖∞ = op (1) and in addition Assumptions 3.1(i), (iii), and 3.4(i) hold, then it follows that (91)
ς
Proof. Define the class Fn n = {θ − θ0 − E [θ (Z ) − θ0 (Z )|·] : θ ∈ Θ and ‖θ − θ0 ‖∞ ≤ ςn }, where ςn ↘ 0 is such that ‖θˆ − θ0 ‖∞ = op (ςn ). Notice that since {Win } is measurable with ς respect to Xn , E [f (Xi , Zi )Win ] = 0 for any f ∈ Fn n and that Zi , Zj are conditionally independent given Xn for all i ̸= j. Applying Markov’s inequality for conditional expectations, ‖θˆ − θ0 ‖∞ = op (ςn ) and Lemma 2.3.6 in van der Vaart and Wellner (1996) then yields:
n 1 − P √ W {θˆ (Zi ) − θ0 (Zi ) − E [θˆ (Z ) − θ0 (Z )|Xi ]} > η|Xn n i=1 in n 1 2 − ≤ E sup | √ Win f (Xi , Zi )ϵi |Xn + o(1) (92) ς η n i=1 Fn n where {ϵi } are i.i.d. Rademacher random variables independent of ς {Xi , Zi }ni=1 . Next, define the semimetric on Fn n :
‖f ‖n =
D n × ‖ f ‖∞
(93)
2 where Dn = n i Win . Let Eϵ [·] denote expectation over {ϵi }, use Corollary 2.2.8 in van der Vaart and Wellner (1996), notice√that the ς diameter of Fn n under ‖ · ‖n is less than or equal to 2ςn Dn and exploit (93) to conclude:
∑ −1
n 1 − Win f (Xi , Zi )ϵi Eϵ sup √ ςn n Fn i =1 ∫ ∞ ς . log N (ϵ, Fn n , ‖ · ‖n )dϵ
0
(88)
Proof of Corollary 3.1. Since Θ is compact under ‖ · ‖∞ and Θ0 is closed it is also compact. Hence, the convexity of Θ0 and strict convexity of M : Θ → R implies a unique minimum is attained. The first claim of the corollary then follows by Theorems 3.1 and 3.2. The second claim is implied by Theorem 3.1. Proof of Proposition 3.2. Note Θ = {θ : ‖θ‖∞,1 ≤ B} is convex, while M : Θ → R is strictly convex on Θ . Next, notice that since θ ∈ Θ are uniformly bounded we obtain for some K > 0:
2ςn
∫
√ Dn
≤
ˆ θp ∈ Θ ˆ 0 ⊆ Θ establish the Therefore, (88), (iii), (iv) and θ, theorem.
Win {θˆ (Zi ) − θ0 (Zi ) − E [θˆ (Z ) − θ0 (Z )|Xi ]} = op (1).
ς
log N (ϵ/ Dn , Fn n , ‖ · ‖∞ )dϵ.
(94)
0
Recall that for any class F and norm ‖ · ‖, N (ϵ, F , ‖ · ‖) ≤ ς N[ ] (2ϵ, F , ‖ · ‖). Since in addition N[ ] (ϵ, Fn n , ‖ · ‖∞ ) ≤ N[ ] (ϵ/2, Θ , ‖ · ‖∞ ), (94) and Theorem 2.7.1 in van der Vaart and Wellner (1996) in turn implies that:
dz ∫ 2ςn √Dn √ 2m n 1 − Dn sup √ f (Xi , Zi )ϵi . dϵ ς ϵ n i =1 0 Fn n dz = Dn × (2ςn )1− 2m .
Eϵ
(95)
(89)
Thus, since ςn ↘ 0, m > dz /2 and Dn = Op (1) by hypothesis, the desired claim follows from (92) and (95).
where the final inequality holds by Theorem 2.7.1 in van der Vaart and Wellner (1996). Therefore, Θ 2 is a Glivenko–Cantelli class, and by Theorem 2.4.1 in van der Vaart and Wellner (1996) we can conclude:
˜ in } Lemma B.4. Let Xn be the σ -field generated by {Xi } and {Win }, {W be triangular arrays of random variables that are measurable with
N[ ] (ϵ, Θ 2 , ‖ · ‖∞ ) ≤ N[ ] (ϵ/K , Θ , ‖ · ‖∞ ) < ∞
n 1 − 2 2 sup θ (Zi ) − E [θ (Z )] = op (1). n Θ i=1
(90)
Hence, we have verified the conditions of Corollary 3.1 and the result follows by Proposition 3.1. Lemma B.3. Let Xn be the σ -field generated by {Xi } and {Win } be a triangular array of random variables that is measurable with respect
1
˜ in )2 = Op (1). If θˆ ∈ (Win − W 1 Θn satisfies ‖θˆ − θ0 ‖∞ = op (1), ‖θˆ − θ0 ‖w = op n− 4 and Assumptions 3.1(i)–(iii), 3.2(i)–(iv), and 3.4(i) hold, then: respect to Xn such that n− 2
n 1 −
√
n i=1
i
Win {ν(Xi ) − θˆ (Zi )fˆX (Xi )}
n 1 −
= √
∑
n i =1
˜ in {ν(Xi ) − θˆ (Zi )fˆX (Xi )} + op (1). W
(96)
A. Santos / Journal of Econometrics 161 (2011) 129–146
Proof. First observe that since θ ∈ Θn are uniformly bounded by Assumption 3.4(i), Cauchy–Schwarz yields:
n 1 − ˜ in )(fˆX (Xi ) − fX (Xi ))θˆ (Zi ) (Win − W n i =1 21 12 n n 1− 1− 2 2 ˜ in ) . × (Win − W (fˆX (Xi ) − fX (Xi )) n i=1
= op n− 2
(97)
where the final result is implied by (77), Assumption 3.2(iv) and 1
n− 2 n
˜ in )2 = Op (1) by hypothesis. Since fX is bounded, (Win − W 1 −2 ˜ 2 2 and hence by Lemma B.3 i (Win − Win ) fX (Xi ) = Op n
∑
∑ −1
i
we can conclude that: n 1 −
√
n i=1
3.3(i)–(iii), 3.3(i), 4.1(ii) hold:
n 1 − ˆ ¯ sup √ {E [θ (Z )|Xi ] − E [θ (Z )|Xi ]}f (Zi , Xi ) = op (1). n i =1 F ,Θ n
˜ in )fX (Xi )θˆ (Zi ) (Win − W
n 1 − sup √ {Eˆ [θ (Z )|Xi ] − E¯ [θ (Z )|Xi ]}f (Zi , Xi ) F ,Θ n n i = 1 n 1 − ′ = sup √ pkn (Xi )β(θ )f (Zi , Xi ) . F ,Θn n i=1 ′
sup ‖β(θ )‖2 = sup ϵ ′ (θ )P (P ′ P )−2 P ′ ϵ(θ ) Θn
Θn
. sup E [ϵ ′ (θ )P (P ′ P )−1 Θn
n i=1
′
+ θ0 (Zi )} + op (1).
× pkn (X )pkn (X )(P ′ P )−1 P ′ ϵ(θ )] n 1− . sup (Eˆ [θ (Z )|Xi ] − E¯ [θ (Z )|Xi ])2
(98)
Further observe, however, that since fX is bounded, we obtain by the Cauchy–Schwarz inequality that:
n i=1
= op n
n i=1
kn
(104)
n
n 1 − ˆ ¯ P sup √ {E [θ (Z )|Xi ] − E [θ (Z )|Xi ]}f (Zi , Xi ) > η F ,Θn n i=1 n 1 − 1 ≤ E sup √ g (Xi )f (Xi , Zi ) + o(1) η F ,Gςnn n i=1 ∫ 1 ς . M ςn ξ0n 1 + log N (ϵςn ξ0n M , Gnn × F , ‖ · ‖∞ )dϵ (105)
(99)
n 1 −
n i=1
n i=1
with probability tending to one. Let ςn ↘ 0 such that √ √ ′ ς ςn n/ kn → ∞ and Gnn = {pkn β : ‖β‖ ≤ ςn }. By results (103) and (104), together with Markov’s inequality we are then able to conclude that:
where the final result follows by Markov’s inequality and ‖θˆ − 1 θ0 ‖w = op n− 4 . Hence, from (97)–(99) we conclude:
√
Θn
= Op
n 1 − ˜ ˆ (Win − Win )fX (Xi )E [θ (Z ) − θ0 (Z )|Xi ] n i =1 21 21 n n 1− 1− 2 2 ˜ ˆ . (Win − Win ) (E [θ (Z ) − θ0 (Z )|Xi ]) − 12
(103)
Since the smallest eigenvalue of E [pkn (X )pkn (X )] is bounded away from zero, (72) and (75) imply:
n 1 − ˜ in )fX (Xi ){E [θˆ (Z ) − θ0 (Z )|Xi ] = √ (Win − W
(102)
Proof. Define ϵi (θ ) = θ (Zi ) − E [θ (Z )|Xi ], ϵ(θ ) = (ϵ1 (θ ), . . . , ϵn (θ )), β(θ ) = (P ′ P )−1 P ′ ϵ(θ ) and notice that:
n i=1
1
141
˜ in )(ν(Xi ) − θˆ (Zi )fˆX (Xi )) (Win − W
0 n
1 −
= √
n i=1
˜ in )(ν(Xi ) − θ0 (Zi )fX (Xi )) + op (1). (Win − W
(100)
where the second inequality follows by Theorem 2.14.1 in van der Vaart and Wellner (1996) and noting that ‖gf ‖∞ ≤ M ςn ξ0n ς ς uniformly in g ∈ Gnn and f ∈ F . Note that every gj ∈ Gnn is of ′
˜ in } are both Further, since E [θ0 (Z )fX (X )|Xi ] = ν(Xi ), and {Win }, {W measurable with respect to Xn :
˜ in )(ν(Xi ) − θ0 (Xi )fX (Xi )) E √ (Win − W n i=1 =
n 1−
n i=1
‖g1 f1 − g2 f2 ‖∞ ≤ ‖g1 − g2 ‖∞ M + ‖g1 ‖∞ ‖f1 − f2 ‖∞ ≤ ξ0n ‖β1 − β2 ‖M + ξ0n ςn ‖f1 − f2 ‖.
2
n 1 −
|Xn
ςn
ϵ , Bςnn , ‖ · ‖ M ξ0n ϵ ×N , F , ‖ · ‖∞ . ξ0n ςn
(101)
since ν, fX and θ0 are all bounded by hypothesis. The conclusion of the lemma then follows from (100), (101) and Markov’s inequality for conditional expectations. Lemma B.5. Let ∞F√ satisfy ‖f ‖∞ ≤ M , E [f (Z , X )|X ] = 0 for all f ∈ F , and 0 log N (ϵ, F , ‖ · ‖∞ )dϵ < ∞. If E¯ [θ (Z )|x] = ′
kn
N (ϵ, Gςnn × F , ‖ · ‖∞ ) ≤ N
= op (1)
∑
i
(106)
Let Bn be a sphere of radius ςn in R . We are then able to conclude from (106) that:
˜ in )2 E [(ν(X ) − θ0 (Z )fX (X ))2 |Xi ] (Win − W
pkn (x)(P ′ P )−1
the form gj = pkn βj and apply the triangle and Cauchy–Schwarz inequality to obtain the bound:
pkn (Xi )E [θ (Z )|Xi ] and Assumptions 3.1(i)–(iii),
ς
(107)
Since N (ϵ, Bnn , ‖·‖) ≤ (2ςn /ϵ)kn we can combine (107) with (105) and use finiteness of the resulting integral to obtain:
n 1 − E sup √ g (Xi )f (Xi , Zi ) ς n i =1 F ,G n
n
142
A. Santos / Journal of Econometrics 161 (2011) 129–146
. ςn ξ0n
1
∫
1 + kn log
0
.
2
ϵ
+ log N (ϵ M , F , ‖ · ‖∞ )dϵ
kn ςn ξ0n .
.
Lemma B.6. Let v˜ ∈ V ∩ Θ , u = ±˜v and un = ±Πn u. If Assumptions 3.1(i)–(iii), 3.2(i)–(iv), 3.3(i)–(iii), 3.4 (i)–(ii), 4.1(i)–(ii) hold, and θˆ ∈ Θn satisfies ‖θˆ −θ0 ‖w = op n
− 14
and ‖θˆ −θ0 ‖∞ = op (1),
then: (i) n−1
∑
i
ˆ (Xi , θˆ ) E [u(Z )fX (X )|Xi ]m
−1 = n 1
∑
i
E [u(Z )fX (X )|Xi ]
(ν(Xi ) − θˆ (Zi )fˆX (Xi )) + op n− 2 ∑ ∑ (ii) n−1 i E [u(Z )fX (X )|Xi ](ν(Xi ) − θˆ (Zi )fˆX (Xi )) = n−1 i Eˆ 1 [un (Z )|Xi ]fˆX (Xi )(ν(Xi ) − θˆ (Zi )fˆX (Xi )) + op n− 2 ∑ −1 ˆ ˆ (Xi , θˆ )− m ˆ (Xi , θ0 ))+ (iii) ⟨u, θ0 − θ⟩ i E [u(Z )fX (X )|Xi ](m 1 w = n op n− 2 .
Proof. For any function w : [0, 1]dx × Z → R, recall ∑ ′ E¯ [w(X , Z )|x] = pkn (x)(P ′ P )−1 i pkn (Xi )E [w(X , Z )|Xi ]. In order to establish the first claim of the lemma, then notice that by rearranging terms it follows that: n 1 −
√
n i=1
ˆ (Xi , θˆ ) − (ν(Xi ) − θˆ (Zi )fˆX (Xi ))) E [u(Z )fX (X )|Xi ](m
ˆ Zi )fˆX (Xi )). − θ(
n
(Eˆ [un (Z )|Xi ] − E¯ [un (Z )|Xi ])fˆX (Xi )(ν(Xi ) − θˆ (Zi )fˆX (Xi )) (110)
For this purpose, first notice that Cauchy–Schwarz, ν bounded and results (75) and (77) establish that:
n 1 − (Eˆ [un (Z )|Xi ] − E¯ [un (Z )|Xi ])(fˆX (Xi ) − fX (Xi ))ν(Xi ) n i=1 12 n 1− . (Eˆ [un (Z )|Xi ] − E¯ [un (Z )|Xi ])2
×
n 1− (fˆX (Xi ) − fX (Xi ))2
. (112)
E [(fˆX (Xi ) − fX (Xi ))4 |Xi ] = O((nhdx )−2 + h4r ).
(113)
Hence, by (111)–(113) together with (75) and Assumptions 3.2(iv) and 4.1(ii), we obtain: n 1 −
√
n i=1
(Eˆ [un (Z )|Xi ] − E¯ [un (Z )|Xi ])fˆX (Xi )
× (ν(Xi ) − θˆ (Zi )fˆX (Xi )) n 1 − = √ (Eˆ [un (Z )|Xi ] − E¯ [un (Z )|Xi ])fX (Xi )(ν(Xi ) n i =1
− θˆ (Zi )fX (Xi )) + op (1). (114) ∞√ In turn, since 0 N (ϵ, Θ , ‖ · ‖∞ )dϵ < ∞ by Theorem 2.7.1 in van der Vaart and Wellner (1996) and the θ ∈ Θ are uniformly bounded, we are able to conclude by Lemma B.5 that: n 1 −
(Eˆ [un (Z )|Xi ] − E¯ [un (Z )|Xi ])fX2 (Xi ) (115)
n 1 −
√
n i=1
(Eˆ [un (Z )|Xi ] − E¯ [un (Z )|Xi ])fX (Xi )(ν(Xi ) − θ0 (Zi )fX (Xi ))
= op (1).
(116)
Next observe that the Cauchy–Schwarz inequality, (75) and 1
Assumption 4.1(ii) together with ‖θˆ − θ0 ‖w = op n− 4
and
Markov’s inequality allows us to obtain:
n 1 − 2 ˆ ¯ ˆ (E [un (Z )|Xi ] − E [un (Z )|Xi ])fX (Xi )E [θ (Z ) − θ0 (Z )|Xi ] n i =1 12 n 1− 2 ˆ ¯ ≤ (E [un (Z )|Xi ] − E [un (Z )|Xi ])
×
n 1−
n i =1
21 fX2
(Xi )(E [θˆ (Z ) − θ0 (Z )|Xi ])
2
1 = op n− 2 .
n i=1
(111)
where for the final result we have exploited Assumptions 3.2(iv) and 4.1(ii). Similarly, we obtain that:
n 1 − 2 2 (Eˆ [un (Z )|Xi ] − E¯ [un (Z )|Xi ])(fˆX (Xi ) − fX (Xi ))θˆ (Zi ) n i=1
4
Through calculations as in Lemma B.1, it is further possible to establish that almost surely in Xi :
1 2
1 = op n− 2
{(fˆX (Xi ) − fX (Xi )) + (fˆX (Xi ) − fX (Xi )) } 2
n i=1
n i =1
12
n 1−
Similarly, also notice that a second application of Lemma B.5 additionally implies that: (109)
= op (1).
(Eˆ [un (Z )|Xi ] − E¯ [un (Z )|Xi ])
2
× {θˆ (Zi ) − θ0 (Zi ) − E [θˆ (Z ) − θ0 (Z )|Xi ]} = op (1).
Hence, (76), Assumption 4.1(ii) and Lemma B.4 establish the desired result. In order to establish the second claim of the lemma, we first aim to establish that:
n i=1
12
n i =1
n i=1
n i=1
√
×
√
n 1 − = √ (E [u(Z )fX (X )|Xi ] − E¯ [u(Z )fX (X )|Xi ])(ν(Xi )
1 −
n i=1
(108)
√ To conclude, √ notice that ξ0n kn / n → 0 implies we may choose ςn so that kn ξ0n ςn → 0, and the claim of the lemma follows from (108) and Markov’s inequality.
n 1−
(117)
Claim (110) then follows from (114), (115), (116) and (117). Let εi (θ ) = E [θ (Z )|Xi ] − π˜ n (θ )pkn (Xi ) and ε(θ ) = (ε1 (θ ), . . . ε2 (θ )). Exploiting Assumption 3.3(ii), that the largest eigenvalue of n(P ′ P )−1 is bounded in probability and that the matrix P (P ′ P )−1 P ′ is indempotent, we are then able to establish: sup (E¯ [θ (Z )|x] − E [θ (Z )|x])2
Θ ,[0,1]dx
′
− 2dγ
≤ sup 2(pkn (x)(P ′ P )−1 P ′ ε(θ ))2 + O(kn Θ ,[0,1]dx
x
)
A. Santos / Journal of Econometrics 161 (2011) 129–146
− 2dγ
2 = Op (ξ0n × kn
x
).
(118)
×
√
n i=1
Since supΘ ,[0,1]dx ‖E [θ (Z )|x]‖∞ ≤ supΘ ‖θ ‖∞ < ∞, (118) and
Assumption 4.1(ii) implies supΘ ‖E¯ [θ (Z )|x]‖∞ = Op (1). Therefore, result (77) and Assumption 3.2(iv) implies the following result: n 1−
n i =1
Θ ,[0,1]dx
= op n
− 12
n 1−
n i=1
(fˆX (Xi ) − fX (Xi ))2
.
(119)
Furthermore, also notice that Assumptions 4.1(i)–(ii), 3.4(ii) and fX bounded imply that:
− E [u(Z )fX (X )|Xi ])fX (Xi )E [θ0 (Z ) − θˆ (Z )|Xi ] 12 n 1 − . √ (E¯ [u(Z )fX (X )|Xi ] − E [u(Z )fX (X )|Xi ])2 n i =1
× (120)
Exploiting (119), (76), (120) together with Assumption 4.1(ii) and three successive applications of Lemma B.4 yields: n 1 −
n i=1
E¯ [un (Z )|Xi ]fˆX (Xi )(ν(Xi ) − θˆ (Zi )fˆX (Xi ))
n i=1
n 1 −
√
n i=1
n 1 −
= √
n i=1
n i=1
n 1 −
n i=1
n 1 −
n i=1
(121)
1 −
−E [u(Z )fX2 (X )|x]E [θ0 (Z ) − θ2 (Z )|x]| . ‖θ1 − θ2 ‖∞ .
Hence, from (126), it follows that N[ ] (ϵ, F , ‖ · ‖∞ ) ≤ N[ ] (ϵ/M , Θ , ‖ · ‖∞ ) for some M > 0. Theorems 2.7.1 and 2.5.6 in van der Vaart and Wellner (1996) together with m > dz /2 then imply F is a Donsker class. Thus, since ‖θˆ − θ0 ‖∞ = op (1) implies
n 1 −
√
E¯ [u(Z )fX (X )|Xi ]fˆX (Xi )E [θ0 (Z ) − θˆ (Z )|Xi ] (122)
ˆ θ0 ‖2w ≍ E [(E [θˆ (Z ) Since fX is bounded from above and below, ‖θ −
1 − θ0 (Z )|X ])2 ]. Hence, ‖θˆ − θ0 ‖w = op n− 4 , sup[0,1]dx |E¯ [u(Z ) fX (X )|x]| = Op (1) the Cauchy–Schwarz inequality, (77) and
2r
+h =o n
− 21
imply:
n 1 − ¯ ˆ ˆ E [u(Z )fX (X )|Xi ](fX (Xi ) − fX (Xi ))E [θ0 (Z ) − θ (Z )|Xi ] √ n i =1 12 n 1 − 2 ≤ √ (fˆX (Xi ) − fX (Xi )) n i =1
(126)
sup[0,1]dx |E [u(Z )fX2 (X )|x]E [θ0 (Z ) − θˆ (Z )|x]| = op (1), we obtain that:
E¯ [u(Z )fX (X )|Xi ]fˆX (Xi )(θ0 (Zi ) − θˆ (Zi ))
+ op (1).
(nh )
(125)
sup |E [u(Z )fX2 (X )|x]E [θ0 (Z ) − θ1 (Z )|x]
n
dx −1
E [u(Z )fX (X )|Xi ]E [(θ0 (Z )
x∈[0,1]dx
= √
n i=1
(124)
Next define the class F = {E [u(Z )fX2 (X )|·]E [θ0 (Z ) − θ (Z )|·] : θ ∈ Θ }. Note that by Assumptions 3.1(ii), 3.4(i) and 4.1(i) E [u(Z )fX2 (X )|x] is bounded. Therefore, it follows that:
ˆ (Xi , θˆ ) − m ˆ (Xi , θ0 )) E [u(Z )fX (X )|Xi ](m
= √
= op (1).
− θˆ (Z ))fX (X )|Xi ] + op (1).
which together with (110) implies the second claim of the lemma. In order to establish the third claim of the lemma, we rearrange as in (109), and exploit Lemma B.3 together with E [fˆX2 (Xi )] < ∞ and sup[0,1]dx |E¯ [u(Z )fX (X )|x]| = Op (1) so as to derive that:
√
(E [θˆ (Z ) − θ0 (Z )|Xi ])
ˆ (Xi , θˆ ) − m ˆ (Xi , θ0 )) E [u(Z )fX (X )|Xi ](m
n 1 −
E [un (Z )|Xi ]fX (Xi )(ν(Xi ) − θˆ (Zi )fˆX (Xi )) + op (1) E [u(Z )|Xi ]fX (Xi )(ν(Xi ) − θˆ (Zi )fˆX (Xi )) + op (1)
n i =1
= √
n i=1
n 1 −
21 2
Therefore, by combining the results from (122)–(124) we are able to conclude that:
n 1 − = √ E¯ [un (Z )|Xi ]fX (Xi )(ν(Xi ) − θˆ (Zi )fˆX (Xi )) + op (1)
= √
n 1 −
√
n i=1
√
× Op (1) = op (1). (123)
Similarly, exploiting (76), Assumption 4.1(ii), fX being bounded and the Cauchy–Schwarz inequality, we obtain:
n √ 1 − (E [u(Z ) − un (Z )|Xi ]fX (Xi ))2 . n‖u − un ‖2∞ √
= o(1).
21 (E [θˆ (Z ) − θ0 (Z )|Xi ])2
n 1 − (E¯ [u(Z )fX (X )|Xi ] √ n i=1
(E¯ [un (Z )|Xi ])2 (fˆX (Xi ) − fX (Xi ))2
≤ sup ‖E¯ [θ (Z )|x]‖2∞ ×
n 1 −
143
n i =1
{E [u(Z )fX (X )|Xi ]E [(θ0 (Z ) − θˆ (Z ))fX (X )|Xi ]
ˆ w} − ⟨u, θ0 − θ⟩ = op (1).
(127)
The third claim of the lemma therefore follows from (125) and (127). Lemma B.7. Suppose Assumptions 3.1(i)–(iii), 3.2(i)–(v) and 4.1(i) hold. It then follows that: n 1 −
√
n i =1
E [˜v (Z )fX (X )|Xi ]θ0 (Zi )(fˆX (Xi ) − fX (Xi ))
n 1 −
= √
n i=1
(E [˜v (Z )fX (X )|Xi ]ν(Xi )
− E [˜v (Z )fX (X )ν(X )]) + op (1).
(128)
144
A. Santos / Journal of Econometrics 161 (2011) 129–146
Proof. For notational simplicity, let g (x, z ) = E [˜v (Z )fX (X )|x]θ0 (z ), wi = (xi , zi ) and define the kernel:
Hn (Wi , Wj ) = g (Xi , Zi ) K˜ h
Xi − Xj h
, Xi − E [fˆX (Xi )|Xi ]h
Xj − Xi + g (Xj , Zj ) K˜ h , Xj − E [fˆX (Xj )|Xj ]hdx h
where K˜ h
Xi −Xj h
, Xi
=
∏dx
k=1 Kh
In addition, letting Γ (Xi ) = {u : −t /Xi ≤ u ≤ (1 − t )/Xi } and doing the change of variables u = (Xj − Xi )/h:
(k) (k) Xi −Xj
h
dx
∫ × lim E h→0
g (Xi , Zi )(fˆX (Xi ) − fX (Xi ))
1
=
E
g (Xi , Zi )(fˆX (Xi ) − E [fˆX (Xi )|Xi ]) + o(1)
n − −
n − −
3
n 2 hdx i=1 j>i
Hn (Wi , Wj ) − E [Hn (Wi , Wj )|Wi ]
=
n − −
n3 h2dx i=1 j>i 2(n − 1) n2 h2dx
E[
Hn2
n − −
3
n 2 hdx i=1 j>i
= √
1
E [Hn (Wi , Wj )|Wi ]
2 {E [g (Xi , Zi )|Xi ]fX (Xi ) − E [g (X , Z )fX (X )]}
E[
(136)
∞
(Wi , Wj )]
log N[ ] (ϵ, F , ‖ · ‖L2 )dϵ
0
∞
∫ (Wi , Wj )] = O(nh ) dx
≤
(132)
Hn (Wi , Wj )
log N[ ] (ϵ/ E [Y 2 ], Θ , ‖ · ‖∞ )dϵ < ∞.
(137)
0
Therefore, Theorem 2.5.6 in van der Vaart and Wellner (1996) establishes that F is a Donsker class. Since by assumption ‖θˆ0 − θ0 ‖∞ = op (1) and in addition E [Y θ0 (Z )] = ⟨ν, m0 ⟩ we can further conclude:
√
n 1 − n{⟨ν, m0 ⟩ − ⟨ν, m0 ⟩} = √ {Yi θ0 (Zi ) − E [Y θ0 (Z )]} n i =1
√
n −
nhdx i=1
+ E [Hn (Wi , Wj )|Wi ] + op (1)
h
n 1 −
h
x
√
n i=1
− E [g (Xj , Zj )fˆX (Xj )] + op (1),
(133)
where the final equality follows by (129) and direct calculation. Moreover, observe that by (130) we also have E [g (Xj , Zj )fˆX (Xj )] = E [g (Xj , Zj )fX (Xj )] + O(hr ).
nE [Y (θˆ0 (Z ) − θ0 (Z ))] + op (1).
(138)
In addition, Lemmas B.6(i) and B.7 together with E [˜v (Z )fX (X )ν(X )] = E [Y θ0 (Z )] imply the equality:
n 1 − 1 Xj − Xi = √ E g (Xj , Zj )K˜ h , Xj |Xi d n i=1
(135)
by results (133)–(135). The claim of the lemma then follows from (131), (133), (136), Markov’s inequality and g (x, z ) = E [˜v (Z )fX (X )|x]θ0 (z ) which implies E [g (X , Z )|x]fX (x) = E [˜v (Z )|x] fX (x)ν(x).
∫ Hn2
where for the final result we have used E [Hn2 (Wi , Wj )] = O(hdx ), which follows from (70). Hence, since nhdx → ∞, Chebychev’s inequality and (132) together with E [Hn (Wi , Wj )] = 0 from (129), yields 1
=0
Proof of Theorem 4.1. Define the class F = {g (y, z ) = yθ (z ) : θ ∈ Θ }. By Theorems 2.7.11 and 2.7.1 in van der Vaart and Wellner (1996), it then follows that:
− E [Hn (Wi , Wj )|Wj ] + E [Hn (Wi , Wj )] ≤
E [g (Xj , Zj )|Xi + hu]K˜ h (u, Xi + hu)
= o(1)
2
4
n i=1
(131)
Next, expand the square, notice that all cross terms have expectation zero and employ Jensen’s inequality to obtain: 1
nhdx i=1
−√
Hn (Wi , Wj ) + o(1).
n hdx i=1 j>i
E
n −
1
√
n 1 −
3 2
Γ (Xi )
where the final equality is established by noting the integrand is uniformly bounded by Assumptions 3.1(ii), 3.2(ii) and 4.1(i), appealing to the dominated convergence theorem and using Assumption 3.2(v). Therefore:
n i=1
, Xj |Xi h−dx
× fX (Xi + hu)du − E [g (Xi , Zi )|Xi ]fX (Xi )
(130)
1
= √
2
Therefore, since hr n 2 = o(1) by Assumption 3.2(iv), we can use (129) to conclude:
n 1 −
h
− E [g (Xi , Zi )|Xi ]fX (Xi )
, Xi(k) . Since g (x, z ) is
. |E [fˆX (Xi )|Xi ] − fX (Xi )| = O(hr ).
n i=1
Xj − Xi
2
(129)
|g (Xi , Zi )(E [fˆX (Xi )|Xi ] − fX (Xi ))|
n 1 −
E E [g (Xj , Zj )|Xj ]K˜ h
lim E
h→0
bounded, Lemma B.1(i) then implies that:
√
(134)
ˆ (Xi , θ0 ) E [˜v (Z )fX (Xi )|Xi ]m
n 1 −
= √
n i =1
E [˜v (Z )fX (X )|Xi ](ν(Xi ) − θ0 (Zi )fˆX (Xi )) + op (1)
n 1 −
= √
n i =1
{E [Y θ0 (Z )] − E [˜v (Z )fX (X )|Xi ]
× θ0 (Zi )fX (Xi )} + op (1).
(139)
A. Santos / Journal of Econometrics 161 (2011) 129–146
145
Therefore, since E [Y (θˆ0 (Z ) − θ0 (Z ))] = ⟨˜v , θˆ0 − θ0 ⟩w , combining (138), (139) and Lemma B.6(iii) implies:
Next, define the class G = {g (x) = (E [θ (Z )|x])2 fX2 (x) : θ ∈ Θ } and notice fX , θ ∈ Θ uniformly bounded implies:
√
|(E [θ1 (Z )|x])2 fX2 (x) − (E [θ2 (Z )|x])2 fX2 (x)|
n{⟨ν, m0 ⟩ − ⟨ν, m0 ⟩} n 1 −
= √
n i=1
n 1 −
−√
≤ fX2 (x)(E [θ1 (Z ) − θ2 (Z )|x])(E [θ1 (Z ) + θ2 (Z )|x])
{Yi θ0 (Zi ) − E [˜v (Z )fX (X )|Xi ]fX (Xi )θ0 (Zi )}
n i=1
. ‖θ1 − θ2 ‖∞
ˆ (Xi , θˆ0 ) + op (1). E [˜v (Z )fX (X )|Xi ]m
(140)
Furthermore, by the Cauchy–Schwarz inequality, the definition of Qn (θ) and fX bounded, it follows that:
n 1 − ˆ (Xi , θˆ0 ) (E [˜v (Z )fX (X )|Xi ] − E [Πn v˜ (Z )fX (X )|Xi ])m √ n i =1 1
1
1
. n 4 ‖˜ v − Πn v˜ ‖∞ n 4 [Qn (θˆ0 )] 2 = op (1)
(141)
where the final result follows by Assumptions 4.1(i)–(ii), 3.2(ii), 1 Corollary B.1 and ‖θˆ0 − θ0 ‖w = op n− 4 . Similarly,
n 1 − ˆ (Xi , θˆ0 ) (E [Πn v˜ (Z )fX (X )|Xi ] − Eˆ [Πn v˜ (Z )fˆX (X )|Xi ])m √ n i =1 n 1 − 1 1 . n 4 [Qn (θˆ0 )] 2 × √ (E [Πn v˜ (Z )fX (X )|Xi ] n i =1
for any θ1 , θ2 ∈ Θ . Hence, G is Lipschitz in Θ , and by Theorems 2.7.11, 2.7.1 and 2.5.6 in van der Vaart and Wellner (1996), the class G is Donsker as well. We therefore conclude that:
n 1 − 2 2 2 2 sup {(E [θ (Z )|Xi ]) fX (Xi ) − E [(E [θ (Z )|X ]) fX (X )]} θ∈Θ n i=1 1 = Op n− 2 .
(147)
Moreover, by Cauchy–Schwarz, E ([θ (Z )fX (X )|Xi ])2 uniformly bounded in (x, θ ) ∈ [0, 1]dx × Θ and Lemma B.2(i):
n 1 − 2 2 ˆ ˆ sup {(E [θ (Z )fX (X )|Xi ]) − (E [θ (Z )fX (X )|Xi ]) } Θn n i=1 √ −γ kn 1 = Op √ + kn dx + (nhdx )− 2 + hr .
(148)
n
Therefore, combining (144), (145), (147), (148) and absorbing higher order terms we obtain:
√ − dγ kn dx − 12 x . ‖ˆv − Πn v˜ ‖w = Op √ + kn + (nh ) 2
12 − Eˆ [Πn v˜ (Z )fˆX (X )|Xi ])2
(146)
= op (1)
(142)
where the final result is implied by Lemma B.2(i). The theorem then follows by (140)–(142).
(149)
n
Finally, the Cauchy–Schwarz inequality, Lemma B.2(i), result (149), ˆ 0 imply Markov’s inequality and the definition of Θ sup |rn (θˆ0 ) − rˆn (θˆ0 )| ˆ0 Θ
Proof of Lemma 4.1. As a first step, we obtain a rate of convergence for ‖ˆv − Πn v˜ ‖w . Towards this end, notice that since C (θ ) − C (˜v ) = ‖θ − v˜ ‖2w for all θ ∈ Θ , Assumption 3.4(ii) implies that:
≤
n 1−
n i=1
‖ˆv − Πn v˜ ‖2w ≤ 2‖ˆv − v˜ ‖2w + O(δn2 )
= 2{C (ˆv ) − C (Πn v˜ )} + O(δ ). 2 n
(143)
Therefore, since by definition Cn (ˆv ) ≤ Cn (Πn v˜ ), we can in turn employ (143) to conclude that:
= Op
12 (Eˆ [(Πn v˜ (Z ) − vˆ (Z ))fˆX (X )|Xi ])2
1
sup[Qn (θˆ0 )] 2 ˆ0 Θ
kn
41
n
− 2dγ
+ kn
x
√ 1
+ (nhdx )− 4
bn
(150)
√
an
which establishes the claim of the lemma.
Proof of Proposition 4.1. It follows by direct calculation and
‖ˆv − Πn v˜ ‖2w
2
noting that n 3 an /bn → 0.
≤ 2{(C (ˆv ) − Cn (ˆv )) − (C (Πn v˜ ) − Cn (Πn v˜ ))} + O(δ ) 2 n
≤ 2 sup |C (θ ) − Cn (θ )| + O(δn2 ). θ∈Θn
(144)
Define F = {f (y, z ) = yθ (z ) : θ ∈ Θ } and recall that by (137), the class F is Donsker. Therefore, we obtain
1
θ0 ‖∞ = op (1) and ‖θˆ0 − θ0 ‖w = op n− 4 . In addition, Lemma 4.1 √ and Assumption 4.1(ii) imply n|ˆrn (θˆ0 ) − rn (θˆ0 )| = op (1). Therefore, by Theorem 4.1,
√
θ∈Θn
n 1− 2 2 2 ˆ ˆ . sup E [(E [θ (Z )|X ]) fX (X )] − (E [θ (Z )fX |Xi ]) n i=1 θ∈Θn n 1 − + sup Yi θ (Zi ) − E [Y θ (Z )] θ ∈Θn n i=1 n 1− 2 2 2 = sup E [(E [θ (Z )|X ]) fX (X )] − (Eˆ [θ (Z )fˆX |Xi ]) n i=1 θ∈Θn 1 + Op n− 2 .
ˆ Proof of Theorem 4.2. By Corollary 3.1, it follows that ‖θ0 −
n 1 −
sup |C (θ ) − Cn (θ )|
n i =1
Yi θˆ0 (Zi ) + rn (θˆ0 ) − ⟨ν, m0 ⟩
n 1 −
= √
n i=1
{Yi θ0 (Zi ) − E [˜v (Z )|Xi ]fX2 (Xi )θ0 (Zi )} + op (1).
(151)
The claim of the theorem then follows by the central limit theorem. Proof of Lemma 4.2. First notice, that the properties of least square projections, vˆ and fX bounded imply n 1−
(145)
n i =1
(Eˆ [ˆv (Z )fˆX2 (X )|Xi ] − E [ˆv (Z )fX2 (X )|Xi ])2
146
A. Santos / Journal of Econometrics 161 (2011) 129–146 n
.
2− n i =1
+
(fˆX2 (Xi ) − fX2 (Xi ))2
n 2−
n i =1
(Eˆ [ˆv (Z )fX (X )|Xi ] − E [ˆv (Z )fX (X )|Xi ])2
= op (1)
(152)
where the final result follows by vˆ ∈ Θn bounded, fX bounded as well, (75), (76), (112) and (113). Therefore, since θˆ is bounded, ‖θˆ0 − θ0 ‖∞ = op (1), ‖ˆv − v˜ ‖w = op (1) and Markov’s inequality yields: n 1−
n i=1
(Eˆ [ˆv (Z )fˆX2 (X )|Xi ]θˆ0 (Zi ) − E [˜v (Z )fX2 (X )|Xi ]θ0 (Zi ))2
≤ ‖θˆ0 − θ0 ‖2∞ +
n 4−
n i =1
n 4−
n i=1
(E [ˆv (Z )fX (X )|Xi ])2
θ02 (Zi )(E [ˆv (Z ) − v˜ (Z )|Xi ])2 fX2 (Xi ) + op (1)
= op (1).
(153)
In turn, Cauchy–Schwarz, ‖θˆ0 − θ0 ‖∞ = op (1), E [Y 2 ] < ∞ and Markov’s inequality together with (153) implies: n 1−
n i=1
=
θˆ02 (Zi )(Yi − Eˆ [ˆv (Z )fˆX2 (X )|Xi ])2
n 1−
n i =1
θ02 (Zi )(Yi − E [˜v (Z )fX2 (X )|Xi ])2 + op (1).
(154)
The claim of the lemma then follows by the law of large numbers. Proof of Proposition 4.2. Follows directly from Theorem 4.2 and Lemma 4.2. References Ai, C., Chen, X., 2003. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–1844. Ai, C., Chen, X., 2007. Estimation of possibly misspecified semiparametric conditional moment restriction models with different conditioning variables. Journal of Econometrics 141, 5–43. Blundell, R., Chen, X., Kristensen, D., 2007. Semi-nonparametric iv estimation of shape-invariant engel curves. Econometrica 75, 1613–1669. Chalak, K., White, H., 2006. An extended class of instrumental variables for estimation of causal effects. University of California at San Diego. Chen, X., 2007. Large sample sieve estimation of semi-nonparametric models.
In: Heckman, James J., Leamer, Edward E. (Eds.), Handbook of Econometrics, vol. 6B. Elsevier, North-Holland, pp. 5549–5632. Chen, X., Pouzo, D., 2008. Estimation of nonparametric conditional moment models with possibly nonsmooth moments. Yale University. Chernozhukov, V., Hong, H., Tamer, E., 2007. Estimation and confidence regions for parameter sets in econometric models. Econometrica 75 (5), 1243–1284. Chesher, A., 2003. Identification in nonseparable models. Econometrica 71, 1405–1441. Chesher, A., 2005. Nonparametric identification under discrete variation. Econometrica 73, 1525–1550. Chesher, A., 2007. Instrumental variables. Journal of Econometrics 139, 15–34. Darolles, S., Florens, J., Renault, E., 2003. Nonparametric instrumental regression. GREMAQ. Gagliardini, P., Scaillet, O., 2008. Tikhonov regularization for nonparametric instrumental variable estimators. University of Lugano. Hall, P., Horowitz, J., 2005. Nonparametric methods for inference in the presence of instrumental variables. Annals of Statistics 33, 2904–2929. Horowitz, J., 2007. Asymptotic normality of a nonparametric instrumental variables estimator. International Economic Review 48, 1329–1349. Huang, J.Z., 1998. Projection estimation in multiple regression with applications to functional ANOVA models. Annals of Statistics 26, 242–272. Huang, J.Z., 2003. Local asymptotics for polynomial spline regression. Annals of Statistics 31, 1600–1635. Imbens, G.W., Newey, W.K., 2009. Identification and estimation of triangular simultaneous equation models without additivity. Econometrica 77, 1481–1512. Manski, C.F., 1990. Nonparametric bounds on treatment effects. American Economic Review Papers and Proceedings 80, 319–323. Manski, C.F., 2003. Partial Identification of Probability Distributions. SpringerVerlag, New York. Müller, H., 1991. Smooth optimum kernel estimators near endpoints. Biometrika 78, 52–64. Newey, W.K., 1997. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79, 147–168. Newey, W.K., McFadden, D.L., 1994. Large sample estimation and hypothesis testing. In: Engle, Robert F., McFadden, Daniel L. (Eds.), Handbook of Econometrics, vol. IV. Elsevier, North-Holland, pp. 2113–2245. Newey, W.K., Powell, J., 2003. Instrumental variables estimation of nonparametric models. Econometrica 71, 1565–1578. Newey, W.K., Powell, J., Vella, F., 1999. Nonparametric estimation of triangular simultaneous equation models. Econometrica 67, 565–603. Polyanin, A.D., Manzhirov, A.V., 1998. Handbook of Integral Equations. CRC Press, Boca Raton. Romano, J.P., Shaikh, A.M., 2010. Inference for the identified set in partially identified econometric models. Econometrica 78, 169–211. Santos, A., 2007. Inference in nonparametric instrumental variables with partial identification. University of California at San Diego. Schennach, S., Chalak, K., White, H., 2007. Estimating average marginal effects in nonseparable structural systems. University of California at San Diego. Severini, T.A., Tripathi, G., 2006. Some identification issues in nonparametric linear models with endegenous regressors. Econometric Theory 22, 258–278. Severini, T.A., Tripathi, G., 2010. Efficiency bounds for estimating linear functionals of nonparametric regression models with endogenous regressors. University of Connecticut. Stinchcombe, M.B., White, H., 1992. Some measurability results for extrema of random functions over random sets. The Review of Economic Studies 59, 495–512. van der Vaart, A.W., Wellner, J.A., 1996. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York. White, H., Chalak, K., 2006. Identifying effects of endogenous causes in nonseparable systems using covariates. University of California at San Diego.
Journal of Econometrics 161 (2011) 147–165
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Robustness and inference in nonparametric partial frontier modeling Abdelaati Daouia a,∗ , Irène Gijbels b a
Toulouse School of Economics (GREMAQ), University of Toulouse, France
b
Department of Mathematics and Leuven Statistics Research Center (LStat), Katholieke Universiteit Leuven, Belgium
article
info
Article history: Received 25 December 2008 Received in revised form 23 September 2010 Accepted 6 December 2010 Available online 21 December 2010 JEL classification: C13 C14 D20 Keywords: Asymptotics Breakdown values Econometric frontiers Outlier detection
abstract A major aim in recent nonparametric frontier modeling is to estimate a partial frontier well inside the sample of production units but near the optimal boundary. Two concepts of partial boundaries of the production set have been proposed: an expected maximum output frontier of order m = 1, 2, . . . and a conditional quantile-type frontier of order α ∈]0, 1]. In this paper, we answer the important question of how the two families are linked. For each m, we specify the order α for which both partial production frontiers can be compared. We show that even one perturbation in data is sufficient for breakdown of the nonparametric order-m frontiers, whereas the global robustness of the order-α frontiers attains a higher breakdown value. Nevertheless, once the α frontiers break down, they become less resistant to outliers than the order-m frontiers. Moreover, the m frontiers have the advantage to be statistically more efficient. Based on these findings, we suggest a methodology for identifying outlying data points. We establish some asymptotic results, contributing to important gaps in the literature. The theoretical findings are illustrated via simulations and real data. © 2010 Elsevier B.V. All rights reserved.
1. Introduction In the economics, statistics, management science and the related literature a major aim is to estimate the upper boundary of a sample {(Xi , Yi ), i = 1, . . . , n} of independent copies of a random production unit (X , Y ) with support defined by {(x, y) ∈ p+1 R+ |0 ≤ y ≤ ϕ(x)}. Econometric considerations lead to the natural assumption that the frontier function ϕ is monotone nondecreasing. Let (Ω , A, P) be the probability space on which p the vector of inputs X ∈ R+ and the single output Y are defined. Then following Cazals et al. (2002), the optimal value ϕ(x) can be characterized as the right endpoint of the conditional distribution function F (y|x) = P(Y ≤ y|X ≤ x) = F (x, y)/FX (x), with F (·, ·) and FX (·) being respectively the joint and marginal distribution functions of (X , Y ) and X . The conventional estimate for ϕ is the Free Disposal Hull (FDH) estimator, i.e. the lowest nondecreasing step surface covering all sample points, that is ϕˆ n (x) := sup{y ≥ 0|Fˆn (y|x) < 1} = maxi|Xi ≤x Yi , where Fˆn (y|x) = Fˆ (x, y)/FˆX ,n (x), with Fˆ (x, y) = (1/n)
∑n
i =1
1(Xi ≤ x, Yi ≤ y) and FˆX ,n (x) = Fˆ (x, ∞). When the frontier
∗
Corresponding author. E-mail addresses:
[email protected],
[email protected] (A. Daouia),
[email protected] (I. Gijbels). 0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.002
function ϕ is also assumed to be concave, a popular estimator is the Data Envelopment Analysis (DEA) estimator, which is the lowest concave surface covering the FDH frontier. Both FDH and DEA estimators derive from the pioneering work of Farrell (1957). The DEA frontier has been popularized by Charnes et al. (1978), while the FDH has been proposed by Deprins et al. (1984). See Simar and Wilson (2008) for a survey on inference techniques using FDH and DEA estimators. By construction, these envelopment estimators are very sensitive to extremes and/or outliers in the output direction. This dramatic lack of robustness results in poor estimation of the corresponding economic efficiencies; the efficiency score of a firm is estimated via the distance between the attained produced output and the optimal production level given by the frontier function. Of course, in production activity, outlying outputs Yi are highly desirable. But in the absence of information on whether the observations are measured accurately, it is prudent to seek frontier estimators which are not determined by very few extreme observations. The underlying idea of the two existing methods in the econometric literature is to estimate a partial frontier well inside the cloud of data points but near the upper boundary. The first concept of a partial boundary of the joint support of (X , Y ) has been introduced by Cazals et al. (2002). Given an integer m ≥ 1, they define a notion of expected maximum output
ϕ(x)
function of order m as ξm (x) := 0 (1 − [F (y|x)]m )dy. This partial frontier function converges to the full frontier function ϕ(x) as
148
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
ϕˆ n (x)
m → ∞. It is estimated by ξˆm,n (x) := 0 (1 − [Fˆn (y|x)]m )dy. To summarize the properties of this estimator, for fixed sample size n we have limm→∞ ↑ ξˆm,n (x) = ϕˆ n (x), and for fixed order
√
m we have n(ξˆm,n (x) − ξm (x)) → N (0, σ 2 (x, m)) as n → ∞, where the asymptotic variance σ 2 (x, m) is given in (4.2). The second concept of a partial frontier function, suggested by Aragon et al. (2005), is defined as the α th quantile function qα (x) := inf{y ≥ 0|F (y|x) ≥ α}, with α ∈]0, 1]. This order-α frontier function converges to ϕ(x), as α ↑ 1, and is estimated by qˆ α,n (x) := inf{y ≥ 0|Fˆn (y|x) ≥ α}. For n fixed, this estimator satisfies √ limα→1 ↑ qˆ α,n (x) = ϕˆ n (2x), and for fixed order α we2 have n(ˆqα,n (x) − qα (x)) → N (0, σ (α, x)) as n → ∞, where σ (α, x) is given in (4.3), provided that F (·|x) is differentiable at qα (x) with derivative f (qα (x)|x) > 0. Unlike usual (FDH, DEA) methods, both alternatives {ˆqα,n (x)}
and {ξˆm,n (x)} are qualitatively robust and bias-robust as shown in Daouia and Ruiz-Gazen (2006). But the order-α quantile frontiers can be more robust to extremes than the order-m frontiers when estimating the true full frontier ϕ (i.e. when α ↑ 1 and m ↑ ∞) since the influence function is no longer bounded for order-m frontiers as m tends to infinity, while it remains bounded for the conditional quantile frontiers as the order α tends to one. This advantage is proved only under the condition that the conditional density function f (·|x) is not null and continuous on its support. No attention was devoted however to the difference between the reliability of the two sequences of estimators {ˆqα,n (x)} and
{ξˆm,n (x)} in the general setting. Moreover, the influence function
only offers a local quantification of robustness by measuring the sensitivity of estimators to infinitesimal perturbations, but it is well known that estimators can be infinitesimally robust and yet still highly sensitive to small, finite perturbations. To measure the global robustness of an estimator, the richest quantitative information is provided by the finite sample breakdown point as shown by Donoho and Huber (1983). It measures the smallest fraction of contamination of an initial sample that can cause an estimator to take values arbitrarily far from its value at the initial sample. In this paper, we deal with global robustness and some asymptotic aspects of the two sequences {ˆqα,n } and {ξˆm,n } as estimators of the partial frontiers qα and ξm respectively, for fixed orders α and m, and as estimators of the full frontier ϕ itself when α → 1 and m → ∞. In Section 2 we first focus on the replacement breakdown values of the estimators. We show that, as expected, both FDH and DEA frontiers may break down for any contamination (Lemma A.1). The surprising result is that even one outlying observation is sufficient for breakdown of the partial frontier ξˆm,n (x) for any order m, whereas the partial order-α frontier qˆ α,n (x) has the desirable robustness in withstanding the contamination of outlying observations. While the asymptotic breakdown value is 0 for any orderm partial frontier, it is (1 − α)FX (x) for the sequence {ˆqα,n (x)}n≥1 . But, once the α frontiers break down, they become less resistant to outliers than the order-m frontiers. A natural question arising is: how to compare the reliability between the two sequences of partial frontiers once the order-α frontier also breaks down? A more general question is: given a fixed order m, which frontier function qˆ α,n (x) can be analyzed and compared with ξˆm,n (x)? The families {ξm (·), m ≥ 1} and {qα (·), α ∈]0, 1]} have emerged in the econometric literature as two different theoretical concepts of partial production frontiers. See e.g. Daraio and Simar (2007) for statistical properties of both concepts of partial boundaries together with several appealing economic features. The estimators ξˆm,n and qˆ α,n (of ξm and qα respectively) cannot be compared since they do not estimate the same quantity except for the limiting case, when m ↑ ∞ and α ↑ 1 (when both estimate
the true full frontier ϕ ). We however establish in Proposition 2.2 that the two concepts of partial boundaries are closely linked in the sense that for each order m ≥ 1, there exists a well-specified order α = α(m) = (1/2)1/m such that the theoretical order-m and order-α frontiers are respectively the mean and median of the same distribution, namely that of the maximum of m independent random variables drawn from the law of Y given X not exceeding some level of inputs. This result also confirms the advantage of qˆ α,n over ξˆm,n in terms of finite sample breakdown point and grosserror sensitivity, but such a robust proposal may sacrifice statistical efficiency. We show in Section 3 how these results can be exploited to detect outlying data points in the output direction. In the frontier modeling context, descriptive methods for identifying outliers have been proposed by Wilson (1993, 1995). Although very useful, these methods are very computer intensive as the sample size increases and are based on some tuning parameters whose choice is not justified. See further discussions in Sections 3 and 6. Section 4 contributes to important gaps in the asymptotic theory for the estimators ξˆm,n (x) and qˆ α,n (x). We establish pointwise
√
and functional asymptotic representations for √n(ξˆm,n (x)−ξm (x)) and improve its order of convergence to O( log log n). Similar √ asymptotic properties for n(ˆqα,n (x) − qα (x)) can be found in Daouia (2005) and Daouia et al. (2008). However, unlike ξm (x), the computation of the asymptotic confidence interval of qα (x) requires estimation of the quantile density function f (qα (x)|x), often resulting in estimates of unsatisfactory accuracy for finite samples. To avoid this problem, we derive an alternative asymptotic confidence interval for qα (x) not requiring knowledge of f (qα (x)|x). Finally, we show under general conditions that the asymptotic normality of both ξˆm,n (x) and qˆ α,n (x) is still valid when m = mn → ∞ and α = αn → 1 as n → ∞. Section 5 illustrates the theoretical findings through simulations and real data and Section 6 concludes. 2. Robustness The most successful notion of global robustness of an estimator T at a sample (Z )n = (Z1 , . . . , Zn ) is provided by the finite sample breakdown point of Donoho and Huber (1983):
RB(T , (Z ) ) = min n
k
: k = 1, . . . , n, and satisfies
n
sup |T {(Z ) } − T {(Z ) }| = ∞ , n k
(Z )nk
n
where (Z ) denotes the contaminated sample by replacing k points p of (Z )n with arbitrary values. Given m ≥ 1, α ∈]0, 1] and x ∈ R+ , n k
the partial boundaries ξˆm,n (x) and qˆ α,n (x) are representable as functionals of the joint empirical distribution function Fˆ (·, ·), or equivalently, of the data set (X , Y )n = {(Xi , Yi ), i = 1, . . . , n}:
ξm (x) = S m,x (F ) ξˆm,n (x) = S m,x (Fˆ ) = S m,x ((X , Y )n ),
qα (x) = T α,x (F ) qˆ α,n (x) = T α,x (Fˆ ) = T α,x ((X , Y )n )
where the operators S m,x and T α,x associate to a distribution p function G(·, ·) on R+ × R+ such that G(x, ∞) > 0, the real values S
m,x
(G) =
∫
∞
[ 1−
0
T α,x (G) = inf y ≥ 0|
G(x, y)
]m
G(x, ∞) G(x, y) G(x, ∞)
dy and
≥α ,
with the integrand being identically zero for y ≥ inf{y | G(x, y) /G(x, ∞) = 1}.
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
As expected we can easily show that even one outlying observation is sufficient for breakdown of the FDH frontier (see Lemma A.1, Appendix), and consequently for breakdown of the DEA frontier. But the surprising result is that the partial order-m boundary breaks down for the same fraction, 1/n, of contamination as the envelopment FDH and DEA estimators, for any order m. p
149
be more robust than the mean (see e.g. Hampel, 1968), we can ‘‘robustify’’ the expected value of the maximum, ξm (x), by simply replacing the expectation with the median to obtain a median-type expected maximum output frontier. Proposition 2.2. Consider the robust-variant of ξm (x) defined as
Theorem 2.1. Let x ∈ R+ such that FˆX ,n (x) > 0. Then, for any order m ≥ 1,
ξ˜m (x) = Median [max(Yx1 , . . . , Yxm )].
RB(ξˆm,n (x), (X , Y )n ) = 1/n.
Then for any order m ≥ 1, there exists an order α(m) = (1/2)1/m such that ξ˜m (x) = qα(m) (x).
Hence the asymptotic breakdown value is 0 for any order-m partial frontier. In contrast, by an appropriate choice of the order α as a function of n and FˆX ,n (x), we can derive a partial quantile-based frontier qˆ α,n (x) capable of withstanding arbitrary perturbations of a significant proportion of the data points without disastrous results. p
Theorem 2.2. Let x ∈ R+ such that FˆX ,n (x) > 0. Then, for any order α ∈]0, 1], RB(ˆqα,n (x), (X , Y )n )
=
(n(1 − α)FˆX ,n (x) + 1)/n if α nFˆX ,n (x) = 1, 2, 3, . . . (nFˆX ,n (x) − [α nFˆX ,n (x)])/n otherwise,
where [α nFˆX ,n (x)] denotes the integer part of α nFˆX ,n (x). Remark 2.1. The asymptotic breakdown value for the sequence {ˆqα,n (x)}n≥1 is then (1 − α)FX (x). When the order α is fixed, this theorem reflects how the corresponding partial frontier qˆ α,n (x) p suffers from the left-border effect when the vector x ∈ R+ of inputs-usage is too small. Likewise, increasing the dimension p of input factors x decreases FˆX ,n (x), and hence RB(ˆqα,n (x), (X , Y )n ) goes down. On the other hand, once we know that qˆ α,n (x) = T α,x ((X , Y )n ) does not break down for the fraction (k∗ − 1)/n of contamination, with k∗ /n = RB(ˆqα,n (x), (X , Y )n ), it is of interest to know how large the bias |T α,x ((X , Y )nk∗ −1 ) − T α,x ((X , Y )n )| can be. For this purpose we compute the upper bound of this bias, here n,y we only focus on contaminated samples (X , Y )nk∗ −1 := (X , Y )k∗ −1 ∗ in the direction of Y obtained by replacing k − 1 points (Xi , Yi ) with outlying extreme-values (Xi , Yi∗ ). p
Proposition 2.1. Let x ∈ R+ such that FˆX ,n (x) > 0. Then, for any order α ∈]0, 1], n,y 0 ≤ T α,x ((X , Y )k∗ −1 ) − T α,x ((X , Y )n ) ≤ ϕˆ n (x) − qˆ α,n (x)
Hence for each expected-maximum output m frontier, there exists a quantile-type frontier of a well-specified order α = α(m) such that their pointwise values ξm (x) and qα (x) are respectively the theoretical mean and median of the same distribution, namely that of the random variable max(Yx1 , . . . , Yxm ). When this distribution is symmetric, ξˆm,n (x) and qˆ α(m),n (x) estimate exactly the same quantity. Remark 2.2. It is difficult to imagine the family {ˆqα,n (x), α ∈ ]0, 1]} being preferred in all contexts: of course qˆ α,n (x) is preferred
over ξˆm,n (x) in terms of finite sample breakdown point and grosserror sensitivity, but such a robust proposal may sacrifice in terms of statistical efficiency (measured e.g. by means of estimation variance). Moreover, once {ˆqα,n (x)} breaks down, it becomes less resistant to extreme values than {ξˆm,n (x)}. Indeed, putting Nx = nFˆX ,n (x) and taking Y1x , . . . , YNxx to be the Yi ’s such that Xi ≤ x, we first have qˆ α,n (x) =
Note that when the order α goes to 1, i.e. when estimating the full frontier function ϕ(x) itself, the maximal bias tends to zero since limα↑1 qˆ α,n (x) = ϕˆ n (x). However, when α is fixed, the maximal bias ϕˆ n (x) − qˆ α,n (x) may become too large as x increases. So, to estimate ϕ(x) by qˆ α,n (x), the order α should be chosen appropriately as a function of both x and n. We next answer the important question of how the two families of order-α and order-m boundaries are linked. We show that these concepts of partial frontiers are closely linked in the sense that {qα (x), α ∈]0, 1]} defines a ‘‘robustified’’ variant of the family {ξm (x), m ≥ 1} while the latter defines an ‘‘efficient’’ variant of the former. We provide an explicit and exact expression of α as a function of m that allows to select which frontier qˆ α,n can be analyzed and compared with ξˆm,n . Indeed, it is easy to see that ξm (x) = E[max(Yx1 , . . . , Yxm )] for any sequence (Yx1 , . . . , Yxm ) of m independent random variables drawn from the conditional distribution of Y given X ≤ x. Since the median is known to
(2.1)
where Y(xi) denotes the ith order statistic of the points Y1x , . . . , YNxx . Likewise, we have N x −1
ϕˆ n (x) − ξˆm,n (x) =
−
(i/Nx )m {Y(xi+1) − Y(xi) }.
(2.2)
i =1
This difference being a sum of weighted spacings, ξˆm,n (x) is more resistant to FDH points in the sense that it converges slowly to ϕˆ n (x) as m increases, whereas qˆ α(m),n (x), as an order statistic, converges rapidly to ϕˆ n (x) once it breaks down. It is easy to see that
n,y
for any contaminated sample (X , Y )k∗ −1 .
if α Nx = 1, 2, 3, . . . otherwise,
x Y(α Nx ) x Y([α Nx ]+1)
qˆ α(m),n (x) =
Nx −
Y(xi)
1
i−1
m
Nx
i=1
<
1 2
≤
i Nx
m
,
(2.3)
and so qˆ α(m),n (x) coincides with ϕˆ n (x) for all m > log(2)/ log(Nx / (Nx − 1)), which is not the case for
ξˆm,n (x) =
Nx − i =1
Y(xi)
i Nx
m
−
i−1 Nx
m
.
(2.4)
These sensitivity and resistance characteristics of ξˆm,n (x) and qˆ α(m),n (x), as well as their statistical efficiency, are illustrated in Section 5.1 with simulated and real data sets. To conclude, the α(m) frontier ξ˜m,n is sometimes preferred over the m frontier ξˆm,n and sometimes not according to the values of m. So a sensible practice is not to restrict the frontier analysis to one procedure, but to check whether both concepts of partial boundaries point toward similar conclusions. See the practical guidelines in Section 5.4.
150
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
3. Detection of anomalous data The word ‘‘anomalous’’ is used here for detecting isolated data points in the direction of Y . From now on, we write ξ˜m,n := qˆ α(m),n . Local distance. Let (xa , ya ) be an isolated outlier, that is, (xa , ya ) = (xa , ϕˆ n (xa )) is an FDH observation clearly outlying the cloud of data points. We know that both partial boundaries ξˆm,n (xa ) and ξ˜m,n (xa ) ↗ ϕˆ n (xa ) as m → ∞. We distinguish between two different behaviors of ξˆm,n (xa ) and ξ˜m,n (xa ) as the order m increases:
i. While ξˆm,n (xa ) breaks down (i.e. ξˆm,n (xa ) becomes attracted by the outlying value ya = ϕˆ n (xa )) for any order m ≥ 1 in view of Theorem 2.1, the quantile-type value ξ˜m,n (xa ), being determined solely by the frequency α(m), remains unaffected even when m increases (quantiles are known to be robust in this sense). In this situation, the distance between the robust value ξ˜m,n (xa ) and the influencable value ξˆm,n (xa ) shall increase rapidly as m increases; ii. However, when m achieves a sufficiently large threshold ma , the partial boundary ξ˜m,n (xa ) also breaks down in view of Theorem 2.2 and converges rapidly, as an order statistic (see (2.1) and (2.3)), to the outlying maximum value ϕˆ n (xa ). Even more strongly, it is easy to see that ξ˜m,n coincides overall with the FDH frontier ϕˆ n for any m ≥ log(1/2)/ log((n − 1)/n). In contrast, ξˆm,n (xa ) being a linear combination of order statistics (see (2.2) and (2.4)), converges more slowly to the largest order statistic ϕˆ n (xa ). Hence, although its sensitivity to the magnitude of the outlying value ϕˆ n (xa ) for any m ≥ 1, ξˆm,n (xa ) becomes more resistant than ξ˜m,n (xa ) as m exceeds ma . Thus, the distance between ξ˜m,n (xa ) and ξˆm,n (xa ) shall decrease slowly as m > ma increases.
To summarize, if ϕˆ n (xa ) is really outlying, the curve m → |ξ˜m,n (xa ) − ξˆm,n (xa )| shall have roughly a ‘‘Λ’’ structure, that is, a sharp positive slope (indicating that ξˆm,n (xa ) breaks down while ξ˜m,n (xa ) remains still unaffected as m increases) followed by a smooth decreasing slope (indicating that ξ˜m,n (xa ) becomes nonrobust for m large enough whereas ξˆm,n (xa ) is more resistant). Here the ‘‘Λ’’ effect appears at ma − 1 such that the value of |ξ˜m,n (xa ) − ξˆm,n (xa )| at m = (ma − 1) is sufficiently large compared with its initial value at m = 1. However, if ϕˆ n (xa ) is only extreme (not really isolated), the graph of m → |ξ˜m,n (xa )− ξˆm,n (xa )| will have a slight ‘‘∧’’ curvature, that is, a non-decreasing slope followed by a non-increasing slope such that the maximal value of the distance |ξ˜m,n (xa ) − ξˆm,n (xa )| is very close to its initial value at m = 1. So, in general, if the graph of the distance function m → |ξ˜m,n (xi ) − ξˆm,n (xi )| shows clearly a sharp ‘‘Λ’’ curvature for a given observed value xi , this indicates a potential outlier in the data set. The suspicious outlying point can be then easily recovered: it corresponds to the FDH point (xk , yk ) for which yk = ϕˆ n (xi ). This is the basic idea of our procedure. Global distance. Consider now the maximal ‘‘distance’’ between the partial boundaries ξ˜m,n and ξˆm,n , defined as d(m) =
max1≤i≤n |ξ˜m,n (xi ) − ξˆm,n (xi )|. Assume that (xa , ya ) is the unique outlier in the sample. If this point is far enough from the cloud of data points, then the local distance |ξ˜m,n (xa ) − ξˆm,n (xa )| coincides for all m ≥ 1 with the global distance d(m). In this case, as described above, the shape of the entire curve m → d(m) (and not only a part of this graph) should be a sharp ‘‘Λ’’. If, instead, the sample contains two isolated outliers (xa , ya ) and (xb , yb ) with xa < xb , it is easy to see from Theorem 2.2 that
ξ˜m,n (xa ) breaks down before ξ˜m,n (xb ). Let ma and mb be respectively the values of m at which ξ˜m,n (xa ) and ξ˜m,n (xb ) break down. Then ma < mb . On the other hand, due to the conditioning on X ≤ x, both ξˆm,n and ξ˜m,n are more resistant to outliers at xb than at xa (left-border effect). It follows that: i. for m < ma , both ξ˜m,n (xa ) and ξ˜m,n (xb ) are unaffected by the two outliers, while ξˆm,n (xa ) is more attracted by these outliers
than ξˆm,n (xb ) due to the left-border effect. This implies that |ξ˜m,n (xa ) − ξˆm,n (xa )| ≥ |ξ˜m,n (xb ) − ξˆm,n (xb )| as m increases,
whence d(m) = |ξ˜m,n (xa ) − ξˆm,n (xa )| as m ↑ ma . Therefore the graph of d(m) should have a sharp positive slope as m ↑ ma ; ii. once m exceeds ma , the local distance |ξ˜m,n (xa ) − ξˆm,n (xa )| decreases smoothly to zero (breakdown of ξ˜m,n (xa )), while
|ξ˜m,n (xb ) − ξˆm,n (xb )| still increases rapidly as m ↑ mb . Let ma,b be the value of m at which |ξ˜m,n (xb ) − ξˆm,n (xb )| exceeds |ξ˜m,n (xa ) − ξˆm,n (xa )|. Then ma ≤ ma,b < mb . If ma = ma,b , then d(m) = |ξ˜m,n (xb ) − ξˆm,n (xb )| for m ≥ ma . Whence d(m) increases rapidly as m ↑ mb and decreases smoothly as m ≥ mb . In contrast, if ma < ma,b < mb , then d(m) = |ξ˜m,n (xa ) − ξˆm,n (xa )| decreases smoothly for m ∈ [ma , ma,b ) whereas d(m) = |ξ˜m,n (xb ) − ξˆm,n (xb )| increases rapidly for m ∈ [ma,b , mb ) and decreases smoothly for m ≥ mb .
In summary, in presence of two outliers far from the cloud of data points, the shape of the entire graph m → d(m) should be either one sharp ‘‘Λ’’ or two successive ‘‘Λ’’ effects showing an ‘‘M’’ structure. It should be also clear that if (xa , ya ) is only a suspicious extreme (not really isolated), then the strong ‘‘Λ’’ effect corresponding to the outlier (xb , yb ) could be preceded by a slight ‘‘∧’’ oscillation due to the presence of the extreme observation (xa , ya ). In general, in presence of k outliers, the graph m → d(m) shows at least one sharp ‘‘Λ’’ effect and at most k ‘‘Λ’’ effects. However, in absence of outliers, the graph shows only slight ‘‘∧’’ oscillations as m increases and shall have a decreasing trend. To avoid any ambiguity of appreciation between sharp ‘‘Λ’’ effects and slight ‘‘∧’’ oscillations, we also make use of the concave envelopment of m → d(m) (i.e. the lowest concave curve enveloping the graph). The methodology. For a given order m, let x(m) denote the observed input xj for which d(m) = |ξ˜m,n (xj ) − ξˆm,n (xj )|. Then the basic tool will be a picture plotting the graph of d(m) and its concave envelopment for increasing equidistant values of m. Remember that d(m) ↘ 0 as m → ∞. So, if the graph of d(m) ends with an increasing slope, it should be redone by adding larger values of m until it ends with a decreasing slope. Note also that, if the graph of d(m) is plotted by using (2J + 1) or (2J + 2) values of m (with J = 1, 2, . . .), then it has at most J sharp ‘‘Λ’’ effects or slight ‘‘∧’’ oscillations. The different possible behaviors of the graph of d(m) and its concave envelopment can be summarized as follows: (a) If the shape of the entire graph of m → d(m) is a sharp ‘‘Λ’’, then the order m∗ at which the graph is maximal should indicate that the FDH point (xk , yk ), with yk = ϕˆ n (x(m∗ )), is an isolated outlier. The concave envelopment curve should have also a sharp ‘‘Λ’’ effect. Likewise, if the entire graph of d(m) shows a sharp ‘‘Λ’’ effect followed by a second one, that is, a structure ‘‘M’’, then each local maximum m∗ allows to detect an outlier. In this case, the concave envelopment curve should have roughly a structure ‘‘∩’’. (b) If the graph of d(m) begins with a sharp positive slope as m increases, it could have a global structure of at most J successive ‘‘Λ’’ effects. The local maxima corresponding to these sharp effects will allow to detect isolated outliers. Here also, the concave envelopment curve should have a structure ‘‘Λ’’ or ‘‘∩’’.
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
(c) If, in contrast, the graph of d(m) begins with a smooth positive slope followed by a decreasing trend showing a global maximum value very close from the initial value d(1), this indicates the presence of only suspicious extreme observations (not really isolated). In this case, the concave envelopment curve should not have a clear structure ‘‘Λ’’ or ‘‘∩’’. (d) If, in contrary, the graph of d(m) decreases overall, this indicates clearly the absence of both outliers and suspicious extremes. Here also, a structure ‘‘Λ’’ or ‘‘∩’’ of the concave envelopment curve should not appear. (e) If, instead, the graph of d(m) begins with a decreasing slope followed by an increasing one, then we distinguish between two situations: either (e1) the (short) decreasing slope is too smooth compared with √ the (longer) increasing one showing roughly a curvature ‘‘ ’’ for the first values of m, or (e2) the decreasing deviation is, at least, as important as the increasing one. In situation (e1), the concave envelopment curve should have a structure ‘‘Λ’’ or ‘‘∩’’ whose maxima allow to detect isolated outliers as described in (a). In situation (e2), the concave envelopment curve should behave as the graph of d(m) in (c) or (d) leading thus to the same conclusions. In conclusion, the above description tells us that a ‘‘Λ’’ or ‘‘∩’’ structure of the concave envelopment curve is necessary and sufficient for detecting outliers. It is also important to note that looking only at the graph of d(m) may result in some confusion between the desirable ‘‘Λ’’ effects (isolated outliers) and possibly contestable ‘‘∧’’ oscillations (suspicious extremes). To overcome such a subjectivity of appreciation, it is best to overlay in the same picture the graph of d(m) and its concave envelopment. Then only sharp ‘‘Λ’’ deviations of the graph of d(m) whose maximal points belong to the concave envelopment curve should be retained to identify potential outliers. This smoothing strategy however allows to detect only few outliers per picture. An outlier can ‘‘mask’’ other outliers situated near the first one and who are less isolated. To avoid such a masking effect pointed out earlier by Wilson (1993, 1995), the analysis should be redone without the identified outliers until the concave envelopment curve shows no more ‘‘Λ’’ or ‘‘∩’’ effects. Then, a careful analysis is to plot again the last graph of d(m) and its concave envelopment by using a refined sequence of ‘‘small’’ equidistant values of m in order to detect potential masked outliers at the left-border of the sample. Indeed, when the increasing values of m are large, our procedure cannot detect outliers having too small values of xi since, in this case, ξ˜m,n (xi ) =
ξˆm,n (xi ) = ϕˆ n (xi ). All this results in the following simple practical algorithm (illustrated in Section 5.3):
[1] Plot the graph of d(m) and its concave envelopment for m = n 2n 9n 1, [ 10 ], [ 10 ], . . . , [ 10 ], n. [2] If the concave envelopment curve shows a ‘‘Λ’’ or ‘‘∩’’ effect, then the order m∗ at which this curve attains its maximum indicates that the FDH point (xk , yk ), with yk = ϕˆ n (x(m∗ )), is a potential outlier. This suspicious point can be really identified as an isolated outlier only if the maximal value d(m∗ ) is clearly distant above from the initial value d(1). To avoid the masking effect, proceed again to Step [1] without the identified outliers. [3] If the concave envelopment curve shows neither a ‘‘Λ’’ nor a ‘‘∩’’ effect, let m1 > 1 be the first value of m in the chosen sequence in Step [1] at which the graph of d(m) shows a decreasing deviation. Then, [3a] if m1 /10 ≤ 1, there are no isolated outliers in the sample of interest. m 2m [3b] if m1 /10 > 1, proceed to [1] by using m = 1, [ 101 ], [ 101 ], 1 . . . , [ 9m ], m1 . 10
151
Multivariate extensions. Let us now extend the ideas to the full p multivariate setup where a set of inputs X ∈ R+ is used to produce q a set of outputs Y ∈ R+ . Let Ψ denote the joint support of the random vector (X , Y ) that we assume to be free disposal, i.e., (x, y) ∈ Ψ implies (x′ , y′ ) ∈ Ψ as soon as x′ ≥ x and y′ ≤ y (the inequalities here have to be understood componentwise). Let Y (j) , (y(j) ) denote the jth component of Y , (of y). Since a natural ordering of Euclidean spaces of dimension greater than one does not exist, we overcome the difficulty by utilizing the conditional distribution of the dimensionless transformation Yy := minj=1,...,q Y (j) /y(j) q given X ≤ x instead of the multivariate distribution of Y ∈ R+ conditioned by X ≤ x. The distribution function of this univariate transformation is given by
P(Yy ≤ λ|X ≤ x) = 1 − P(Y > λy|X ≤ x)
= 1 − SY |X (λy|x) for all λ ≥ 0, where SY |X (·|x) denotes the conditional survival function of Y given X ≤ x. Its endpoint
λ(x, y) := sup{λ ≥ 0|SY |X (λy|x) > 0} coincides with the conventional Farrell efficiency score, sup{λ ≥ 0|(x, λy) ∈ Ψ }, for the unit (x, y) ∈ Ψ , and the set Y ∂ (x) := {λ(x, y)y | y : (x, y) ∈ Ψ } represents the set of maximal outputs a unit operating at the level x can produce. The point y∂ (x) := λ(x, y)y is the radial projection of (x, y) on the support frontier Y ∂ := {(x, λ(x, y)y) | (x, y) ∈ Ψ } in the output-orientation (orthogonal to the vector x). In the particular case of q = 1, we have the equalities λ(x, y) ≡ ϕ(x)/y and Y ∂ (x) ≡ {ϕ(x)}. Parallely to the concepts of partial frontier functions qα (x) and ξm (x) related to the conditional distribution of Y given X ≤ x in the case of one output, we define the quantile function of order α and the expected maximum output function of order m for the dimensionless distribution of Yy given X ≤ x, respectively, as Qα (x, y) := inf{λ ≥ 0|1 − SY |X (λy|x) ≥ α},
Xm (x, y) :=
λ(x,y)
∫
{1 − [1 − SY |X (λy|x)]m }dλ.
0
As a matter of fact, Xm (x, y) coincides with the order-m output efficiency score for the unit (x, y), introduced by Cazals et al. (2002), while Qα (x, y) coincides with the α th quantile output efficiency score favored by Daouia and Simar (2007). The sets Yα∂ := {(x, Qα (x, y)y) | (x, y) ∈ Ψ } and Ym∂ := {(x, Xm (x, y)y) | (x, y) ∈ Ψ } represent, respectively, the efficient order-α and order-m partial surfaces in the output direction. In the particular case of one output, Qα (x, y) = qα (x)/y and Xm (x, y) = ξm (x)/y. In this case, the sets Yα∂ and Ym∂ coincide with the graphs of the frontier functions qα (·) and ξm (·), respectively. See, e.g., Daraio and Simar (2007) for a detailed description of both partial efficiency measures and for their economic meaning. ˆ m,n (x, y) of Qα (x, y) and The sample estimators Qˆ α,n (x, y) and X Xm (x, y), respectively, are obtained by replacing SY |X (λy|x) with its empirical version SˆY |X (λy|x) = i=1 1(Xi ≤ x, Yi > λy)/ ∑n 1( X ≤ x ) . They can be easily computed in the same i i =1
∑n
way as the quantities qˆ α,n (x) and ξˆm,n (x), respectively, by simply replacing in (2.1) and (2.4) the Yi ’s such that Xi ≤ x with the y dimensionless observations Yi such that Xi ≤ x. Moreover, it is not hard to show that all robustness and sensitivity properties established in the univariate case for the classes {qα (x), qˆ α,n (x)} and {ξm (x), ξˆm,n (x)} hold true for the transformations {Qα (x, y),
ˆ m,n (x, y)}. In particular, the practical Qˆ α,n (x, y)} and {Xm (x, y), X algorithm described above in the three Steps [1–3] for detecting potential outliers remains still valid in the full multivariate case up to two natural adaptations: i. the maximal distance d(m) between the curves of qˆ α(m),n (·) and ξˆm,n (·) in the case of q = 1 extends naturally to the distance ˆ m,n (Xi , Yi )Yi ‖ d(m) = max ‖Qˆ α(m),n (Xi , Yi )Yi − X 1≤i≤n
152
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
∂ ˆ between the empirical partial surfaces Yˆα( m),n = {(Xi , Qα(m),n ∂ ˆ m,n (Xi , Yi )Yi ) | i = (Xi , Yi )Yi ) | i = 1, . . . , n} and Yˆm,n = {(Xi , X 1, . . . , n} in the general case of q ≥ 1, where ‖ · ‖ denotes the Euclidean norm on Rq ; ii. the outlying FDH point (Xk , Yk ) to be identified in Step [2], with Yk = ϕˆ n (x(m∗ )) in the case of one output, is determined in the case of multi-outputs by
√
Note that the functional convergence of the process { n(ξˆm,n (x) − ξm (x)), x ∈ X} in Proposition 4.1(ii) provides the consistency and asymptotic distribution of parametric approximations of the order-m frontiers, as shown in Florens and Simar (2005). Their elegant approach tries to capture the shape of the cloud points near its boundary by combining parametric and nonparametric approaches.
√
ˆ n (x, y) = sup{λ ≥ 0|SˆY |X (λy|x) > 0} = maxi|Xi ≤x where λ
Next we show that n(ξˆm,n (x) − ξm (x)) also obeys a law of the iterated logarithm, which improves the order of convergence to √ O( log log n) and even gives the proportionality constant.
ˆ m,n (Xj , Yj )Yj ‖. ‖Qˆ α(m),n (Xj , Yj )Yj − X
Theorem 4.1. For all m ≥ 1 and any x ∈ R+ such that FX (x) > 0, we have almost surely for either choice of sign
ˆ n (x(m∗ ), y(m∗ ))y(m∗ ), Yk = λ (j)
minj=1,...,q Yi /y(j) is the FDH estimator of λ(x, y), and where (x(m∗ ), y(m∗ )) is the observation (Xj , Yj ) for which d(m∗ ) =
p
√
Section 5.3 illustrates the procedure with simulated and real data. lim sup ± n→∞
4. Asymptotic properties We first derive the following pointwise and uniform asymptotic representations for ξˆm,n (x). p
Proposition 4.1. (i) For all m ≥ 1 and any x ∈ R+ such that FX (x) > 0, we have
√
√
n(ξˆm,n (x) − ξm (x)) =
nΦm,n (x) + op (1)
as n → ∞
(4.1)
where
Φm,n (x) =
m FX (x)
FˆX ,n (x)
ϕ(x)
∫
F m−1 (y|x)[F (y|x) − Fˆn (y|x)]dy.
0
(ii) Suppose the upper boundary of the support of Y is finite. Then, p for all m ≥ 1 and any X ⊂ R+ such that infx∈X FX (x) > 0,
√ ˆ (4.1) holds √ uniformly in x ∈ X, i.e. { n(ξm,n (x) − ξm (x)); x ∈ X} = { nΦm,n (x); x ∈ X} + op (1). √ As an immediate consequence of Proposition 4.1(i), n(ξˆm,n (x) − ξm (x)) is asymptotically normal with mean 0 and variance ∫ ϕ(x) m σ 2 (x, m) = E 1(X ≤ x) F m−1 (y|x)[F (y|x) FX (x) 0 2 − 1(Y ≤ y)]dy =
2m2
∫
F X ( x)
ϕ(x)
0
ϕ(x)
∫
F m (y|x)F m−1 (u|x)
0
× [1 − F (u|x)]1(y ≤ u)dydu.
(4.2)
Even more strongly, it follows from Proposition 4.1(ii) (see also the √ proof) that the process { n(ξˆm,n (x) − ξm (x)), x ∈ X} converges in distribution in the space L∞ (X) of bounded functions on X to the centered Gaussian process {Gm (x); x ∈ X} as n → ∞, where
Gm (x) =
m FX (x)
ϕ(x)
∫
F m−1 (y|x)[F(x, ∞)F (y|x) − F(x, y)]dy
0
with F being a (p + 1) dimensional F -Brownian bridge. Similar results can be found in Cazals et al. (2002, Theorem 3.1 and Appendix B). Their techniques of proof rely on the differentiability of the operator S m,x in the Fréchet sense with respect to the supnorm. In statistical applications however, Fréchet differentiability may not hold, whereas Hadamard differentiability does, the latter being a less restrictive concept of differentiability than the former. The results in Proposition 4.1 are derived by applying the functional delta method in conjunction with the (less restrictive) Hadamard differentiability.
n(ξˆm,n (x) − ξm (x))
(2 log log n)1/2
= σ (x, m).
By the asymptotic normality we have limn→∞ P{ξm (x) ∈ √ [ξˆm,n (x) ± 2σ (x, m)/ n]} = 2Φ (2) − 1 ≈ 95%, where Φ denotes the standard normal distribution function. An intriguing implication of the law of the iterated logarithm (see, e.g., Serfling, 1980) is that we can be sure that ξm (x) is outside the asymptotic √ confidence interval [ξˆm,n (x) ± 2σ (x, m)/ n] infinitely often, but this is of little practical consequence. Monte–Carlo experiments are provided in Section 5.2 to illustrate the performance of the √ asymptotic confidence interval Qn := [ξˆm,n (x) ± z σˆ (x, m)/ n] which satisfies limn→∞ P[ξm (x) ∈ Qn ] = 2Φ (z ) − 1 for any z > 0, where σˆ 2 (x, m) is a strongly consistent estimator of σ 2 (x, m):
σˆ (x, m) = 2
m2
ϕˆ n (x)
∫
FˆX ,n (x)
0
ϕˆ n (x)
∫
[Fˆ (y|x)Fˆ (u|x)]m−1
0
× {Fˆ (y ∧ u|x) − Fˆ (y|x)Fˆ (u|x)}dydu. Note that similar results to Proposition 4.1 and Theorem 4.1 √ have been proved for n(ˆqα,n (x) − qα (x)) in Daouia (2005) and Daouia et al. (2008). Note also that, as pointed in Section 1, we have
√
d
n(ˆqα,n (x) − qα (x)) −→ N (0, σ 2 (α, x)) as n → ∞, where
σ 2 (α, x) = α(1 − α)/f 2 (qα (x)|x)FX (x).
(4.3)
√ Then the interval In = [ˆqα,n (x) ± z σ (α, x)/ n] satisfies limn→∞ P[qα (x) ∈ In ] = 2Φ (z ) − 1, for any z > 0. Putting z = Φ −1 (1 − a/2) to be the (1 − a/2)th quantile of Φ , we obtain (1 − a) as the confidence coefficient. However, the computation of the asymptotic confidence interval In requires estimation of the quantile density function f (qα (x)|x), which often results in estimates of unsatisfactory accuracy for finite samples. In the following theorem, we derive an alternative confidence interval for qα (x) which is asymptotically equivalent to In , but does not need f (qα (x)|x) to be known or estimated. Theorem 4.2. Let 0 < α1 < α2 < 1 and assume that F (·|x) is continuously differentiable on the interval [a, b] := [qα1 (x) − ε, qα2 (x) + ε] for some ε > 0, with strictly positive derivative f (·|x). For any α ∈]α1 , α2 [ and any z > 0, let Cn =]ˆqαn1 ,n (x), qˆ αn2 ,n (x)[ where αn1 = α − z [α(1 − α)/nFˆX ,n (x)]1/2 and αn2 = α + z [α(1 −
α)/nFˆX ,n (x)]1/2 . Then
lim P[qα (x) ∈ Cn ] = 2Φ (z ) − 1 and
n→∞
√
p
n|length(Cn ) − length(In )| −→ 0 as n → ∞.
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
In case the true partial frontiers qα (·) and ξm (·) coincide, one can compare the performances of the asymptotic confidence intervals Cn and Qn . See Section 5.2. It should be clear that the estimation of the partial frontiers qα (·) and ξm (·) instead of the full frontier ϕ(·) itself is mainly motivated by the construction of robust frontier estimators which are well inside the sample {(Xi , Yi ), i = 1, . . . , n} but near its upper boundary. It is then natural to investigate whether the asymptotic normality of the estimators qˆ α,n (x) and ξˆm,n (x) is still valid when α = αn → 1 and m = mn → ∞ as n → ∞. First, note that lim qˆ α,n (x) = lim ξˆm,n (x) = ϕˆ n (x). α↑1
153
1 0.9 0.8 0.7 outliers true frontier m=20 alpha(m)=.9659 m=15 alpha(m)=.9548 m=10 alpha(m)=.9330 m=5 alpha(m)=.8706 m=1 alpha(m)=.5 simulated points
0.6 0.5 0.4 0.3 0.2
m↑∞
0.1
Note also that the necessary and sufficient condition under which the FDH estimator ϕˆ n (x) converges to a non-degenerate distribution is given by
Fig. 1. The true frontiers ξm and ξ˜m for several values of m (Cobb–Douglas model).
1 − F (y|x) = ℓx ({ϕ(x) − y}−1 ){ϕ(x) − y}ρx
Table 1
as y ↑ ϕ(x).
(4.4)
(Daouia et al. (2010), Theorem 2.1), where ρx > 0 is a constant and ℓx is a slowly varying function, i.e., limt ↑∞ ℓx (tz )/ℓx (t ) = 1 for all z > 0. In the particular case where ℓx ({ϕ(x) − y}−1 ) = ℓ(x) is a strictly positive function in x, it is shown in Daouia et al. (2010, Corollary 2.1) that
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The values n × RB(ξ˜m,n (x)) with n = 105. x
m=1
m=5
m = 10
m = 15
m = 20
m = 25
0.1 0.3 0.5 0.7 0.9
5 15 26 37 49
2 4 7 10 13
1 2 4 5 7
1 2 3 4 5
1 1 2 3 4
1 1 2 2 3
d
{nℓ(x)}1/ρx (ϕ(x) − ϕˆ n (x)) −→ Weibull(1, ρx ) as n → ∞. For the estimator qˆ αn ,n (x) to keep the same limit Weibull distribution as ϕˆ n (x), it suffices to choose αn → 1 rapidly so that n1+1/ρx (1 − αn ) → 0 (see Daouia et al. (2010), Theorem 2.2). This result has been also proved by Aragon et al. (2005) in the restrictive case where the joint density of (X , Y ) has a sudden jump at the frontier, which corresponds to ρx = p + 1 in (4.4). Likewise, in this restrictive setting, Cazals et al. (2002) recover the same asymptotic Weibull distribution of ϕˆ n (x) for the estimator ξˆmn ,n (x) provided that mn = O(n log n). Instead of the Weibull extreme-value distribution, we provide in the next proposition sufficient conditions under which qˆ αn ,n (x) and ξˆmn ,n (x) are rather asymptotically normal.
Proposition 4.2. (i) Suppose (4.4) holds with ℓx ({ϕ(x) − y}−1 ) = ℓ(x) > 0 and F (·|x) is differentiable in a left neighborhood of ϕ(x) with a strictly positive derivative f (·|x). If n(1 − αn ) →
∞ as n → ∞, then N (0, 1).
√
m (m −1)
n (ii) If mn → ∞ and σn (x,m n)
√
d
n{σ (αn , x)}−1 (ˆqαn ,n (x) − qαn (x)) −→ √
n = O( log log ) as n → ∞, then n d
n{σ (x, mn )}−1 (ξˆmn ,n (x) − ξmn (x)) −→ N (0, 1). √
Thus, the convergence in distribution of both σ (α,x) (ˆqα,n (x) − √ n qα (x)) and σ (x,m) (ξˆm,n (x)−ξm (x)) to N (0, 1), for fixed orders α and m, is still valid when the partial frontiers qα (x) and ξm (x) approach the true full frontier ϕ(x). n
5. Numerical illustration We present simulation studies to illustrate the robustness and statistical efficiency of the empirical partial boundaries ξˆm,n and ξ˜m,n := qˆ α(m),n and to compare the asymptotic confidence intervals Cn and Qn . We also provide illustrations with a real data set.
5.1. Comparing ξˆm,n and ξ˜m,n Simulated example. Consider the Cobb–Douglas model Y = X 1/2 exp(−U ), where X is uniform on [0, 1] and U is exponential with mean 1/3. This model was studied by Gijbels et al. (1999) among others. Here ϕ(x) = x1/2 and F (y|x) = 3x−1 y2 − 2x−3/2 y3 for 0 < x ≤ 1 and 0 ≤ y ≤ ϕ(x). As can be seen from Fig. 1, in this example the theoretical partial frontiers ξm (solid lines) and ξ˜m = qα(m) (dotted lines) are very close. We also represent in Fig. 1 a simulated sample of size 100 (green points) and we add five outliers (blue points) to this sample. For the resulting sample (X , Y )n of size n = 105, we compute the finite sample breakdown points RB(ξ˜m,n (x)) := RB(ˆqα(m),n (x), (X , Y )n ) for several values of m and x and provide in Table 1 the values n × RB(ξ˜m,n (x)). Since the data set contains five outlying points in the output direction, the estimator ξ˜m,n (x) can break down whenever RB(ξ˜m,n (x)) ≤ 5/n. This is clearly seen from Fig. 2 where the
frontiers ξ˜m,n and ξˆm,n are plotted in absence of outliers (bottom: n = 100, 200, 300) and in presence of the 5 outliers (top: n = 105, 205, 305). Moreover, as pointed in Remark 2.2, once ξ˜m,n (x) breaks down, it becomes less resistant to the influential outliers than ξˆm,n (x) as m increases. This is exactly what happens for ξ˜25,105 and ξ˜25,205 at x = 0.3, where these order-α(25) frontiers are clearly more influenced than ξˆ25,105 and ξˆ25,205 (here m = 25). In contrast, before breaking down at the point x = 0.3, we see that ξ˜10,105 and ξ˜10,205 (here m = 10) are rather more robust than ξˆ10,105 and
ξˆ10,205 , respectively.
On the other hand, for too small values of x (e.g. x = 0.1), we see that both ξˆm,n (x) and ξ˜m,n (x) coincide with the non-robust FDH estimator, or at least, are drastically attracted by ϕˆ n (x). As pointed out in Remark 2.1, this left-border defect is due to the conditioning on X ≤ x in the construction of these two estimators. However, when the number nFˆX ,n (x) of observations (Xi , Yi ) with Xi ≤ x increases, we see clearly that both ξˆm,n and ξ˜m,n become more robust to the outlying points.
154
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165 1
0.8
1
n=205 m=10 alpha(10)=.9330 m=25 alpha(25)=.9727
0.9 0.8
0.8
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
Yi
0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0
0.1
0.2
0.3
0.4
Xi 1
1
0.6
0.7
0.8
0.9
0
1
1
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
Yi
0.8
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.2
0.3
0.4
0.6
0.7
0.8
0.9
1
0 0
0.1
0.2
0.3
0.4
Xi
0.5
0.6
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
1
n=300 m=10 alpha(10)=.9330 m=25 alpha(25)=.9727
0.9
0.8
0
0.1
Xi
n=200 m=10 alpha(10)=.9330 m=25 alpha(25)=.9727
0.9
Yi
Yi
0.5
Xi
n=100 m=10 alpha(10)=.9330 m=25 alpha(25)=.9727
0.9
n=305 m=10 alpha(10)=.9330 m=25 alpha(25)=.9727
0.9
Yi
Yi
1
n=105 m=10 alpha(10)=.9330 m=25 alpha(25)=.9727
0.9
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5 Xi
Xi
Fig. 2. In each picture, ξ˜10,n and ξ˜25,n (respectively: ξˆ10,n and ξˆ25,n ) in solid and dotted blue (respectively: red) lines. From left to right and from top to bottom: n = 105, 205, 305, 100, 200, 300. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Table 2 1000 Monte–Carlo simulations, n = 1000 (l-h.s.) with 5 outliers added (r-h.s.). n = 1000 x
0.15 0.35 0.55 0.75 0.95
0.15 0.35 0.55 0.75 0.95
0.15 0.35 0.55 0.75 0.95
n = 1005 MSE
Bias
MSE
Bias
ξˆ10,n (x)
ξ˜10,n (x)
ξˆ10,n (x)
ξ˜10,n (x)
ξˆ10,n (x)
ξ˜10,n (x)
ξˆ10,n (x)
ξ˜10,n (x)
0.0001 0.0001 0.0001 0.0001 0.0001
0.0001 0.0001 0.0001 0.0001 0.0001
−0.0009 −0.0007
0.0002 0.0002 0.0001 0.0001 0.0001
0.0001 0.0001 0.0001 0.0001 0.0001
0.0105 0.0103 0.0066 0.0058 0.0022
0.0023 0.0045 0.0040 0.0021 0.0017
ξˆ15,n (x)
ξ˜15,n (x)
−0.0011 −0.0010 −0.0002 −0.0009 −0.0008 ξˆ15,n (x)
ξˆ15,n (x)
ξ˜15,n (x)
ξˆ15,n (x)
ξ˜15,n (x)
0.0001 0.0001 0.0001 0.0001 0
0.0001 0.0001 0.0001 0.0001 0.0001
−0.0008 −0.0008 −0.0009 −0.0004
0.0003 0.0002 0.0001 0.0001 0.0001
0.0001 0.0001 0.0001 0.0001 0.0001
0.0147 0.0128 0.0087 0.0076 0.0028
0.0032 0.0037 0.0045 0.0029 0.0022
ξˆ20,n (x)
ξ˜20,n (x)
−0.0012 −0.0009 −0.0011 −0.0006 −0.0004 ξˆ20,n (x)
ξ˜20,n (x)
ξˆ20,n (x)
ξ˜20,n (x)
ξˆ20,n (x)
ξ˜20,n (x)
0.0001 0.0001 0.0001 0 0.0001
0.0001 0.0001 0.0001 0.0001 0.0001
−0.0011 −0.0011 −0.0004 −0.0007 −0.0007
−0.0008 −0.0008 −0.0001 −0.0006 −0.0006
0.0004 0.0003 0.0001 0.0001 0.0001
0.0001 0.0001 0.0001 0.0001 0.0001
0.0186 0.0159 0.0099 0.0085 0.0033
0.0030 0.0047 0.0036 0.0025 0.0031
0
−0.0007 −0.0007 ξ˜15,n (x)
0.0002
We also simulated 1000 samples of size n = 1000 to analyze the bias and the mean squared error (MSE) of ξˆm,n and ξ˜m,n as estimators of ξm ≃ ξ˜m . According to the numerical results reported in Table 2 (l-h.s.), we can see that ξˆm,n is slightly more efficient than ξ˜m,n in terms of MSE, whereas the latter estimator is better than
the m frontier ξˆm,n remains attracted overall between the 5 outliers as illustrated in Fig. 2. Remember in comparing the α(m)- and m frontiers that, in absence of outliers, ξ˜m,n is almost overall larger
the former in terms of bias. When the data are contaminated by
A real data set. To further illustrate the sensitivity and resistance properties of the empirical partial frontiers ξˆm,n and ξ˜m,n , we use the real data example of Cazals et al. (2002) and Aragon et al. (2005) on the frontier analysis of 9521 French post offices observed in 1994, with X as the quantity of labor and Y as the volume of delivered mail. In this illustration, we only consider the n = 4000 observed post offices with the smallest levels xi . We compared ξˆm,n and ξ˜m,n for different orders m ∈ {100, 200, 1000, 4000}. The cloud of points and the resulting estimates are provided in
adding five outliers (indicated by ‘‘∗’’ in Fig. 1), we see in Table 2 (r-h.s.) the improvement of ξ˜m,n over ξˆm,n in terms of MSE. Moreover, ξ˜m,n still outperforms ξˆm,n in terms of bias. Therefore, we can say that ξ˜m,n is globally more robust to the outlying points than ξˆm,n in this particular example. This can be explained by the fact that, even when the α(m) frontier ξ˜m,n breaks down at a value
x, it is influenced only locally on a right neighborhood of x, whereas
than or equal to ξˆm,n , which is no more the case when adding the five outliers.
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
0
500 1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
0
15000
n=3978 m=100 alpha(m)=.9931
10000
5000
0 0
500 1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
5000
0
5000
0
15000
n=3978 m=200 alpha(m)=.9965
10000
10000
0
500 1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
volume of delivered mail
5000
15000
m=1000 alpha(m)=.9993 n=4000
n=3978 m=1000 alpha(m)=.9993
10000
5000
0
5000
15000
0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
500 1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
m=4000 alpha(m)=.9998 n=4000
10000
0 0
500 1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
volume of delivered mail
15000
10000
15000 volume of delivered mail
5000
m=200 alpha(m)=.9965 n=4000
volume of delivered mail
volume of delivered mail
15000
10000
0 0
volume of delivered mail
m=100 alpha(m)=.9931 n=4000
volume of delivered mail
volume of delivered mail
15000
155
500 1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
n=3978 m=4000 alpha(m)=.9998
10000
5000
0 0
500 1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
Fig. 3. Top (full sample): ξˆm,n and ξ˜m,n for m = 100, 200, 1000, 4000. Bottom: as above without the anomalous data indicated by circles.
Fig. 3 (top): for m large enough (m ∈ {100, 200}), the quantilebased frontier ξ˜m,n is clearly more resistant to the extreme points than the expected maximal output frontier ξˆm,n . But for m too large (i.e. m ∈ {1000, 4000}), both partial boundaries ξ˜m,n and
ξˆm,n are drastically influenced by the few ostensible FDH points. Nevertheless, while ξ˜4000,n coincides overall with the FDH frontier, ξˆ4000,n has the advantage to be still resistant to this envelopment frontier. These results are expected in our theory. 5.2. Comparing Cn and Qn In the Cobb–Douglas model described above, as pointed in Daouia and Ruiz-Gazen (2006), ξm (·) coincides with qα (·) if and only if α = 12 (1 − cos[3 arccos( 12 − Bm ) − 4π ]), with Bm =
∑m
j =0
m
(
)3 (−2) j
j
m−j
/(3m − j + 1). For example, we obtain α =
0.8557 for m = 5 and α = 0.9242 for m = 10. In this case, the partial frontier ξm ≡ qα can be estimated by ξˆm,n as well as qˆ α,n , and one can compare the confidence intervals Qn and Cn . The true partial frontier and its 95% confidence intervals are displayed in Fig. 4 with (m = 5, α = 0.8557) on the l-h.s. and (m = 10, α = 0.9242) on the r-h.s. Here we consider two simulated samples of size n = 100 (top) and n = 1000 (bottom). By construction, the upper bound qˆ αn2 ,n (x) of Cn does not exist (see the upper blue solid lines) for small inputs-usage x and high values of α which result in orders αn2 > 1. This is the major drawback of the confidence interval Cn . On the other hand, even if the confidence bands of Qn (in red lines) are overall well defined, they do not contain ξm (x) for small levels x and high orders m. Apart from these left-border defects, we observe that Cn and Qn have very similar lower bounds, but Q100 performs globally better than C100 in terms of upper bounds. This is the price to be paid in order to avoid the estimation of the conditional quantile density f (qα (x)|x) involved in the asymptotic variance σ 2 (α, x) of qˆ α,n (x). For n = 1000, the two confidence intervals provide more similar results. Note also that Qn is computationally prohibitive when the sample size is of the order of several thousands. On the contrary, Cn is very easy and very fast to implement. Table 3 provides the average lengths and the achieved coverages of the 95% asymptotic confidence intervals Cn and Qn computed over 1000 random replications, for sample sizes n = 100 and n = 1000. The table only displays results for values of x where the upper bound of Cn exists, e.g., for x ranging over
{0.4, 0.5, . . . , 1} when m = 5 and for x ∈ {0.65, 0.7, . . . , 0.95} when m = 10. For n = 100 both Cn and Qn provide reasonably good confidence intervals, but Qn performs clearly better than Cn in terms of average lengths. In contrast, Cn outperforms Qn in terms of achieved coverages. For n = 1000 the confidence interval Qn performs as Cn in terms of achieved coverages, however it still outperforms Cn in terms of average lengths. We repeated this exercise for other values of m and MC trials and obtained the same conclusion: no winner in all contexts. On the other hand, when comparing the MSE and bias of ξˆm,n and qˆ α,n as estimators of ξm (x) = qα (x), we find here also that ξˆm,n is more efficient than qˆ α,n in terms of MSE and that qˆ α,n performs better in terms of bias. We do not reproduce the tables in order to save place. 5.3. Detection of anomalous data A univariate simulated example. We consider the cloud of n = 105 points represented in Fig. 1. The procedure based on the analysis of the curve of m → d(m) and its concave envelopment will detect only the five points ‘*’ as isolated outliers. The first picture in Fig. 5 (l-h.s.) gives these two curves for m = 1, 10, 20, . . . , 90, 100: here the graph of d(m), in solid blue line, shows clearly two sharp ‘‘Λ’’ effects and the concave envelopment curve, in dotted red line, has roughly a structure ‘‘∩’’. The ‘‘Λ’’ effects attain their maximal values at m∗ = 10 and m∗ = 40. We also see that each maximal point (m∗ , d(m∗ )) belongs to the concave envelopment curve, which indicates that the FDH points (xk , yk ), for which yk = ϕˆ n (x(m∗ )), can be really identified as isolated points in the direction of Y . A simple computation code (using matlab) allows to detect two outlying FDH points: the result is (xk , yk ) = (0.1, 0.5) for m∗ = 10 and (xk , yk ) = (0.5, 0.9) for m∗ = 40. To avoid the masking effect, we redid the same work on the same data set without the detected two outliers. The second picture of Fig. 5 (from left to right) provides the resulting curve of d(m) and its concave envelopment: here also looking to the sharp ‘‘Λ’’ effects of the graph of d(m) which appear at m∗ = 10 and m∗ = 50, we identify respectively the additional outlying points (0.3, 0.7) and (0.7, 1). When these two outliers are also deleted from the sample, we obtain the curves in the third picture of Fig. 5: clearly the graph of d(m) begins with a too smooth decreasing slope followed by a sharp ‘‘Λ’’ effect which attains its maximum at m∗ = 20, and ends with a slight ‘‘∧’’ oscillation. Here, the shape of the entire concave envelopment curve shows an indisputable sharp ‘‘Λ’’ effect which
156
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165 1
1
true partial frontier: m=5 95% mbased confidence bands true partial frontier: alpha=.8557 95% alpha-based confidence bands
0.9
true partial frontier: m=10 95% mbased confidence bands true partial frontier: alpha=.9242 95% alpha-based confidence bands
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1 0
0 0 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1 1
true partial frontier: m=5 95% mbased confidence bands true partial frontier: alpha=.8557 95% alpha-based confidence bands
0.9
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
1
true partial frontier: m=10 95% mbased confidence bands true partial frontier: alpha=.9242 95% alpha-based confidence bands
0.9
0.8
0.1
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
Fig. 4. 95% confidence intervals Qn (red) and Cn (blue) of ξm = qα (simulated example). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Table 3 Average lengths (av l) and coverages (cov ) of the 95% confidence intervals Cn and Qn , sample sizes n = 100 and n = 1000. covCn
0.4 0.5 0.6 0.7 0.8 0.9 1
0.9120 0.9310 0.9360 0.9440 0.9450 0.9400 0.9380
n = 1000
m = 5 and α = 0.8557
0.4 0.5 0.6 0.7 0.8 0.9 1
0.9540 0.9470 0.9570 0.9490 0.9450 0.9360 0.9420
av lQn
0.9540 0.9510 0.9500 0.9560 0.9470 0.9540 0.9540
0.9560 0.9360 0.9540 0.9530 0.9470 0.9510 0.9640 0.12
0.11
0.11
0.1
0.1
0.09 0.08
d(m)
0.12
av lCn
0.0903 0.0903 0.0907 0.0910 0.0913 0.0914 0.0913
0.1347 0.1302 0.1291 0.1299 0.1283 0.1273 0.1295
0.0293 0.0293 0.0293 0.0293 0.0294 0.0294 0.0293
0.0401 0.0397 0.0402 0.0400 0.0400 0.0401 0.0403
0.06
0.06
0.05
0.05
0.04 0 10 20 30 40 50 60 70 80 90 100 m
0.04
covCn
0.65 0.70 0.75 0.80 0.85 0.90 0.95
0.9200 0.9080 0.9310 0.9140 0.9110 0.9110 0.9250
0.9640 0.9460 0.9510 0.9550 0.9510 0.9500 0.9490
n = 1000
m = 10 and α = 0.9242
0.65 0.70 0.75 0.80 0.85 0.90 0.95
0.9560 0.9280 0.9350 0.9320 0.9380 0.9280 0.9480
0.9480 0.9410 0.9390 0.9500 0.9420 0.9470 0.9470
av lCn
0.0889 0.0883 0.0888 0.0878 0.0881 0.0886 0.0891
0.1392 0.1324 0.1274 0.1257 0.1300 0.1298 0.1239
0.0290 0.0291 0.0291 0.0290 0.0290 0.0290 0.0290
0.0394 0.0393 0.0395 0.0390 0.0393 0.0392 0.0393
0.05 0.045
0.06 0.05
av lQn
0.055
0.07
0.08 0.07
covQn
m = 10 and α = 0.9242
0.08
0.09
0.07
x n = 100
d(m)
m = 5 and α = 0.8557
d(m)
covQn
n = 100
d(m)
x
0.04 0.035
0.04
0.03
0.03 0.02 0 10 20 30 40 50 60 70 80 90 100 m
0.025 0 10 20 30 40 50 60 70 80 90 100 m
0.02
0 10 20 30 40 50 60 70 80 90 100 m
Fig. 5. Simulated example. The graph of d(m) in solid blue line and its concave envelopment in dotted red line. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
attains its maximum at m∗ = 20, indicating that the FDH point (x(20), ϕˆ n (x(20))) ≡ (0.3, 0.65) is an outlier. The slight ‘‘∧’’ oscillation of the graph of d(m), attaining its maximum at m = 50, indicates only the presence of a suspicious extreme observation (not really isolated) since its maximal value d(50) is very close to d(1).
Now, when the five outliers (0.1, 0.5), (0.5, 0.9), (0.3, 0.7), (0.7, 1) and (0.3, 0.65) are deleted from the sample, we get the curves in the last picture of Fig. 5 (r-h.s.): the concave envelopment curve shows neither a sharp ‘‘Λ’’ effect nor a ‘‘∩’’ structure, which indicates the absence of really isolated outliers. Here, the graph of d(m) begins with a decreasing deviation followed by a slight ‘‘∧’’
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165 potential outliers in Output direction
obtain successively the pictures in Fig. 7 from left to right and from top to bottom. Except for the last picture, the concave envelopment curves in dotted lines show roughly ‘‘Λ’’ or ‘‘∩’’ effects allowing to identify at most three outliers per picture indicated in Fig. 6 by the number of the corresponding picture (from #1 to #14). Looking at the last picture, #15, we see a too smooth increasing slope (an approximately horizontal deviation) of the concave curve followed by a sharp decreasing slope, which makes the ‘‘Λ’’ effect clearly more contestable than the one appearing in the preceding picture. So we cannot proceed to Step [2]. Instead, we proceed to Step [3b], i.e., the last picture is redone by using a sequence of smaller values of m in order to detect potential masked outliers at the left-border of the sample. Looking again to picture #15, the first value of m at which the graph of d(m) (solid blue line) shows a decreasing deviation is m1 = 400. The resulting new graph of d(m) and its concave envelopment, m 2m 9m for m = 1, [ 101 ], [ 101 ], . . . , [ 101 ], m1 , is in the first picture of Fig. 8: here we only see a slight ‘‘∧’’ oscillation of the concave curve whose maximal value is close to the initial value d(1). The first value of m at which the graph of d(m) shows a decreasing deviation being m1 = 80, the last picture is redone by using the refined sequence m = 1, 8, 16, . . . , 80. This gives the second picture of Fig. 8, which allows to identify only one potential outlier indicated by #17 in Fig. 6. When this point is deleted from the sample, we obtain the third picture of Fig. 8 which shows no more ‘‘Λ’’ or ‘‘∩’’ effects of the concave envelopment curve and so, there are no more outlying post offices. In summary, our semi-automatic procedure detects 22 potential outliers. Some of these points (e.g. #1, #2, #3)
2 6 3
1
7
8 10
4
9
5 13 10000 value of y
4
2
11
12
14
5000 17
0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
value of x
Fig. 6. Potential outlying post offices detected by the semi-automatic procedure.
1600
2500
1400
2000
1200
1500
d(m)
3000
d(m)
1000
1000
800
500
600 400
0 0
500 1000 1500 2000 2500 3000 3500 4000 m
750
800
700
750
650
700 650
550
600
500
550
450 400
450 0
600
350
500 1000 1500 2000 2500 3000 3500 4000 m
0
500 1000 1500 2000 2500 3000 3500 4000 m
850
1100
900
700
800
1000
800
750
550
650
800
450
500
500
400
450
400
500 1000 1500 2000 2500 3000 35004000 m
750 700 650 d(m)
600 550 500
0
500 1000 1500 2000 2500 3000 3500 4000 m
0
500 1000 1500 2000 2500 3000 3500 4000 m
600 500
600
550
0
500 1000 1500 2000 2500 3000 3500 4000 m
700
600
500
0
700 d(m)
700
0
500 1000 1500 2000 2500 3000 3500 4000 m
400
0
500 1000 1500 2000 2500 3000 3500 4000 m
300
900
550
800
800
500
700
700
600
600
450
d(m)
600
900 d(m)
d(m)
500
400
450
500
400 350 300 500 1000 1500 2000 2500 3000 3500 4000 m
400 300
300 0
500 1000 1500 2000 2500 3000 3500 4000 m
600 550
300 200 0
500 1000 1500 2000 2500 3000 3500 4000 m
550
500
500
450
500 450
450
d(m)
d(m)
0
400 350
400
400
d(m)
d(m)
800
850
750
650
d(m)
900
500
d(m)
d(m)
oscillation whose maximal value d(50) is very close to the initial value d(1) and so cannot be used for detecting outliers. Therefore only the five points indicated by ‘*’ in Fig. 1 are detected by our semi-automatic procedure, which is quite remarkable although ‘‘no optimal procedure nor miracle procedure can be defined to detect outliers’’ as stated by Simar (2003). Application to postal data. We test our procedure on the French post offices data set which contains several outlying points in the output-orientation. Proceeding to Step [1] and then Step [2] of our n 2n 9n algorithm for m = 1, [ 10 ], [ 10 ], . . . , [ 10 ], n (here n = 4000), we
d(m)
15000
157
400 350
350
350
300
300
300
250 200
250 0
500 1000 1500 2000 2500 3000 3500 4000 m
250 0 500 1000 1500 2000 2500 3000 3500 4000 m
0
500 1000 1500 2000 2500 3000 3500 4000 m
Fig. 7. The resulting pictures for m = 1, 400, 800, . . . , 4000 (French post offices).
158
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165 460
550
450
500
400
450
350
440 420 380 360
d(m)
d(m)
d(m)
400
400
300
350
250
340 320 300 280 0
300 0
50 100 150 200 250 300 350 400 m
10
20
30
40 m
50
60
70
80
200 0
10
20
30
40 m
50
60
70
80
0.7 0.65
0.6
0.6
0.55
0.55
0.5
0.5
0.45 0.4
0.4 0.35
0.3
0.3
0.25
0.25
0.7
10
20
30
40
50 m
60
70
80
90 100
0.4 0.3 0.2 0
0.7
0.55
0.65
0.5
10
20
30
40
50 m
60
70
80
90 100
d(m)
d(m)
0.3 0.25
0.4
20
30
40
50 m
60
70
80
90 100
0
10
20
30
40
50 m
60
70
80
90 100
0.3
0.35
0.5 0.45
10
0.35
0.4 0.55
0 0.4
0.45
0.6
0.6 0.5
0.2 0
d(m)
0.8
0.45
0.35
0.2
0.9
d(m)
0.7 0.65
d(m)
d(m)
Fig. 8. As above with m = 1, 40, . . . , 400 (l-h.s.) and m = 1, 8, . . . , 80 (middle and r-h.s.).
0.25
0.2
0.2
0.35
0.15 0
10
20
30
40
50 m
60
70
80
90 100
0.15 0
10
20
30
40
50 m
60
70
80
90 100
Fig. 9. The resulting pictures for m = 1, 10, 20, . . . , 100 (multivariate simulated data).
are clearly outlying due to measurement errors, but other isolated observations (e.g. #4, #9, #10, #17) might contain useful information on the process under analysis and so, they deserve to be carefully examined. A multivariate simulated example. Here a multi-input and multioutput (p = q = 2) data set is simulated as in Park et al. (2000). In this setup, the function describing the efficient frontier is given by y(2) = 1.0845(x(1) )0.3 (x(2) )0.4 − y(1) , where y(j) , (x(j) ), stands for the jth component of y, (of x), for (j) (j) j = 1, 2. We draw Xi independent uniforms on (1, 2) and Y˜i independent uniforms on (0.2, 5). Then the generated random rays (2) (1) in the output space are characterized by the slopes Ki = Y˜i /Y˜i . The generated random points on the frontier are defined by (1)
Yi,eff = (2)
(1)
(2)
1.0845(Xi )0.3 (Xi )0.4 Ki + 1 (1)
(2)
, (1)
Yi,eff = 1.0845(Xi )0.3 (Xi )0.4 − Yi,eff . The efficiencies are generated by exp(−Ui ) where Ui are drawn from an exponential with mean 1/3. Finally, we define Yi = Yi,eff ∗ exp(−Ui ). We simulate 100 observations according to this scenario and we add five outliers #1, . . . , #5, as in Daouia and Simar (2007), respectively at the following values of X : (1.25,1.5), (1.25, 1.75), (1.5, 1.5), (1.75, 1.25) and (1.5, 1.25); the corresponding values for the slopes in the Y space are (0.25, 0.75, 1, 3, 5). Our working algorithm results in the successive graphs of d(m) and their concave envelopment curves displayed on Fig. 9, with m = 1, 10, 20, . . . , 100. In the first picture (from left to right and from top to bottom), the graph of d(m) (solid blue line) and its concave envelopment curve (dashed red line) have a sharp structure ‘‘∩’’ which attains its maximal values at m∗ = 10 and
m∗ = 30 and allows to identify only the outlying observation #5. A similar ‘‘∩’’ structure is obtained in the second picture after removing the first detected outlier: here also the attained maximal values at m∗ = 10 and m∗ = 30 allow to identify the same outlier #1. When this additional outlier is deleted from the sample, the new graphs in the third picture show a sharp ‘‘Λ’’ effect which appears at m∗ = 10 and results in the identification of the outlier #2. The fourth picture provides the resulting graphs after removing this outlier: looking here to the structure ‘‘∩’’ which attains its maximal values at m∗ ∈ {20, 30, 60}, we identify the same outlying point #3. The new graphs obtained without this outlier are shown in the fifth picture, where a sharp ‘‘Λ’’ effect of the concave envelopment curve, appearing at m∗ = 20, allows to identify the outlier #4. For the same data set without this outlier, we get in the last picture a decreasing concave envelopment curve, indicating the absence of any suspicious observation among the remaining simulated 100 points. Thus, only the introduced five outliers are detected by our procedure. We repeated the same exercise with other simulated data sets with the same kind of results. Application to PFT (Program Follow Through) data. We examine here the popular data set reported by Charnes et al. (1981) on an experimental education program administrated in 70 US schools, with p = 5 inputs and q = 3 outputs. The observations #59 and #44 are detected by the procedure of Wilson (1993) as potential outliers. The results obtained by Simar (2003) confirm this and point out two additional suspicious observations #54 and #1 that deserve at least careful attention. Our methodology confirms that only the units #59 and #44 are really isolated from the sample in the output-orientation. Moreover, it turns out that the unit #52 is more suspicious than the extreme observations #54 and #1. Proceeding to Step [1] and then to Step [2] of our algorithm for m = 1, 7, 14, 21, . . . , 70, we find successively the eight pictures displayed on Fig. 10.
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165 40
20
11
9
35
18
10
8.5
9
14
8
12 10
15
20
30
40
50
60
70
10
20
30
m
40
50
60
70
6.5
30
5
5
5
4
4 3.5
3.5
3
3 40
50
60
70
50
60
70
0
20
30
40
50
60
70
40
50
60
70
40
50
60
70
4.5 4 3.5
3 10
30
5
3.5 0
20
4.5
4
m
10
m
5.5
6 5.5
4.5
4.5
40
6
6
d(m)
5.5
30
20
6.5
5.5 d(m)
6
20
10
m
6.5
10
4.5 0
m
7
0
5
4 0
d(m)
10
6 5.5
4 0
7 6.5
5
6
5
7 6
8
10
7.5 d(m)
20
d(m)
25
d(m)
d(m)
8
16
30
d(m)
159
3 0
10
20
30
m
40
50
60
70
0
10
20
m
30 m
Fig. 10. The resulting pictures for m = 1, 7, 14, 21, . . . , 70 (PFT data). n= 3978 (without anomalous data)
35
0.14 m alpha(m)
# 59 30
% of observations above frontiers
0.12
d(m ) - d(1) *
25
20 # 59 15 # 44
10
0.1
0.08
0.06 m=100
0.04 m=250
5
0.02
# 54 # 52 #1
# 21
# 10
# 27
# 12
# 50
# 20
# 16
0 0
2
4
6
8
10
12
suspicious point
0 0
200
400
600
800
1000
1200
values of m
Fig. 11. (l-h.s.) The difference d(m∗ ) − d(1) for the suspicious observations. (r-h.s.) Evolution of the % of sample points outside the partial frontiers ξˆm,3978 and ξ˜m,3978 .
In the first picture, the concave envelopment curve (dashed line) indicates a clear structure ‘‘Λ’’, and the graph of d(m) (solid line) shows two sharp ‘‘Λ’’effects which attain their maximal values at m∗ = 49 [with d(m∗ ) − d(1) = 32.9483] and at m∗ = 21 [with d(m∗ ) − d(1) = 17.2833]. Both local maximum points (m∗ , d(m∗ )) belong to the concave envelopment curve, but they allow to identify the same unit #59 as a potential outlier. When this suspicious unit is deleted from the sample, we obtain in the second picture a ‘‘Λ’’ (or roughly a ‘‘∩’’) structure of the graph of d(m) and its concave envelopment: here also the two maximal values attained at m∗ = 35 [with d(m∗ ) − d(1) = 13.7869] and at m∗ = 42 [with d(m∗ ) − d(1) = 13.7506] result in one detected unit #44. When this extreme unit is deleted from the sample, we get the third picture which allows to identify the unit #52 at m∗ = 21 [with d(m∗ ) − d(1) = 5.0187] and the unit #54 at m∗ = 42 [with d(m∗ ) − d(1) = 3.3166]. When deleting these two additional suspicious points from the data set, we obtain in the fourth picture a structure ‘‘Λ’’ of the graph of d(m) and its concave envelopment, which attains its maximum at m∗ = 42 [with d(m∗ ) − d(1) = 3.4307] and allows to identify the unit #1 as a potential outlier. Likewise, our semi-automatic procedure detects, - in picture 5, the unit #21 at m∗ = 21 [with d(m∗ ) − d(1) = 1.3047];
- in picture 6, the unit #10 at 0.9163] and the unit #27 at 0.5941]; - in picture 7, the unit #12 at 0.9554] and the unit #50 at 0.6958]; - in picture 8, the unit #20 at 0.5382] and the unit #16 at 0.5486].
m∗ = 35 [with d(m∗ ) − d(1) = m∗ = 14 [with d(m∗ ) − d(1) = m∗ = 21 [with d(m∗ ) − d(1) = m∗ = 28 [with d(m∗ ) − d(1) = m∗ = 14 [with d(m∗ ) − d(1) = m∗ = 28 [with d(m∗ ) − d(1) =
As a matter of fact, due to the high 8-dimensional space with a small sample (n = 70), we can identify more extreme points as potential outliers. However, we recall that before deleting any suspicious observation from the sample, our methodology requires to first check whether the suspicious point is really isolated in the output-orientation by comparing the maximal value d(m∗ ) of the corresponding ‘‘Λ’’ effect with the initial value d(1). As it can be seen from Fig. 11 (l-h.s.), which represents the difference d(m∗ ) − d(1) for each suspicious point, only the two units #59 and #44 can be really identified as potential outliers since, for each one of them, d(m∗ ) is clearly distant above from d(1). The three suspicious units #52, #54 and #1 cannot be viewed as isolated outliers, but they are certainly extreme/influential observations. In contrast, the remaining units (#21, #10, #27, #12, #50, #20, #16, . . .) are not even suspicious since the difference d(m∗ ) − d(1) is clearly negligible for all of them.
160
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
10000
5000
0
0
15000
500
5000
0
500
5000
0
15000
n=3978 m=250 95% Qn
10000
0
n=3978 alpha(100)=.9931 95% Cn
10000
0
1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
volume of delivered mail
volume of delivered mail
15000
n=3978 m=100 95% Qn
volume of delivered mail
volume of delivered mail
15000
1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
500
1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
n=3978 alpha(250)=.9972 95% Cn
10000
5000
0
0
500
1000 1500 2000 2500 3000 3500 4000 4500 quantity of labor
Fig. 12. 95% confidence intervals Qn (left) and Cn (right). Here n = 3978 (without anomalous data). From top to bottom: m = 100 and m = 250 (French post offices).
5.4. Practical guidelines In view of Proposition 2.2 and Theorems 2.1–2.2, we know that a significant difference between the expected-maximum estimate ξˆm,n and the median-maximum estimate ξ˜m,n indicates the presence of influential extreme observations above the order-m frontier that could be outlying. This suggests the following two steps in order to perform the frontier estimation: Step 1. Apply the semi-automatic prescription (as illustrated above) in order to detect any potential outliers. Then consider the sample without the identified anomalous data points. For the median- and mean-maximum estimators ξ˜m,n and ξˆm,n to provide similar conclusions, an intuitive idea is to seek the order m for which the percentage of sample points above each partial frontier is approximately the same. This leads to Step 2. Step 2. Overlay in a same picture the evolution of the percentage of observations outside each partial frontier with respect to m. Remember that the sample still contains extreme points (not really outlying) that influence ξ˜m,n more or less than ξˆm,n following the values of m. Therefore the two decreasing percentage curves shall ‘‘cross’’ since ξ˜m,n is less sensitive (and so envelopes less points) than ξˆm,n to the magnitude of extreme outputs even when m increases, but once m attains a sufficiently large threshold, ξˆm,n becomes more resistant (and so envelopes less points) than ξ˜m,n . The value of m at which the two percentage curves cross corresponds to the most similar large order-m and order-α(m) frontiers that capture the shape of the cloud points near its optimal boundary. The extreme observations left outside the resulting similar frontiers ξ˜m,n and ξˆm,n might be useful to emulate: the managers of any decision-making unit (DMU) operating at (x, y) and situated below these partial frontiers could study the relevant peers (Xi , Yi ) above ξˆm,n or ξ˜m,n among those dominating (x, y) (i.e. with Xi ≤ x and Yi ≥ y) in order to learn how to reduce inputs and/or increase outputs. Refined ‘‘relevant practices’’ that might be useful to emulate could be identified as follows: the partial frontiers
ξ˜m,n and ξˆm,n being less sensitive to the choice of the order m as m → ∞, the decrease of the percentage of points outside each frontier becomes approximately stable as m → ∞. In particular, the first value of m from which the two percentage curves are approximately horizontal/stable, corresponds to the frontiers ξˆm,n and ξ˜m,n that are sensible to the magnitude of the most extreme observations whose outputs are highly desirable, but in the same time, they are resistant to these extremes in the sense that they do not envelope them. Such extreme practices could be emulated by the managers of dominated DMUs to improve their own operations. Application to postal data. In order to capture in a robust way the shape and curvature of the sample boundary, we compared in Fig. 3 (top) both partial boundaries ξˆm,n and ξ˜m,n for a sequence of large values of m ∈ {100, 200, 1000, 4000}. We observe a distance between the two frontiers, for each order m, due to the presence of outliers above the m frontiers. Following our practical guidelines, a sensible practice is to remove, in a first step, the identified 22 potential outliers from the sample. Fig. 3 (bottom) shows how ξˆm,n and ξ˜m,n become very close for the resulting sample of size n = 3978. Then, in a second step, we overlay in a same picture the evolution of the percentage of sample points outside each partial frontier with respect to m. As can be seen from Fig. 11 (r-h.s.), the two decreasing percentage curves cross at m ≈ 100 and become approximately linearly stable from m ≈ 250. The partial frontiers ξˆm,n and ξ˜m,n for m ∈ {100, 250} are graphed in Fig. 12 together
with their 95% confidence intervals Qn and Cn , respectively. ξˆ100,n and ξ˜100,n are the largest order-m and order-α(m) frontiers which provide the most similar estimates. The extreme post offices left outside these frontiers, whose outputs are highly desirable, might be useful to emulate. The partial frontiers ξˆ250,n and ξ˜250,n also provide a refined identification of relevant post offices to be emulated and satisfactory estimates of the shape of the sample boundary.
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
6. Conclusions We show that the two classes of partial frontiers, {qα } and {ξm }, are closely related when α = α(m) = (1/2)1/m , in the same sense as the mean and median of a same distribution do. This answers in particular the important question of how to choose the order α as a function of m for a possible comparison between order-α and order-m frontiers. None of the two classes can be claimed to be preferable in all contexts. A sensible practice is to check whether both partial frontier analyses point toward similar conclusions. Obtaining different results from ξˆm,n and qˆ α(m),n , for sufficiently large values of m, indicates the presence of suspicious extreme data points that could be outlying or perturbed by noise. Before performing any frontier estimation, a useful empirical strategy is to first detect and remove the anomalous data and then to determine, in a second step, the order m at which the percentage of sample points outside the order-m and order-α(m) frontiers is approximately the same. This value of m corresponds to the largest frontiers ξˆm,n and qˆ α(m),n having the most similar behaviors. These extreme partial frontiers provide satisfactory estimates of the shape of the sample boundary and identify relevant peers that might be useful to emulate. The theoretical comparison between the reliability of {ˆqα,n } and {ξˆm,n } is exploited to derive an appealing identification
methodology, very easy and fast to implement and providing very good results. The use of partial frontiers for detecting influential observations is not new. A basic tool can be found in Simar (2003) and Daraio and Simar (2007) consisting of a picture showing the evolution of the ‘‘proportion’’ of sample points outside either the order-m or order-α frontier as a function of the order and of another tuning parameter. Our prescription is rather based on the evolution of the ‘‘maximal distance’’ between the related order-m and order-α(m) frontiers as a function of m. Adapting this tool to the input-orientation is straightforward. Our robustness study provides also a theoretical justification for the descriptive technique of Simar (2003) and Daraio and Simar (2007). We derive, among others, an asymptotic confidence interval Cn for qα (x) not requiring estimating the conditional quantile density function. We provide sufficient conditions for ensuring the asymptotic normality of both ξˆm,n (x) and qˆ α,n (x) for the limiting cases m = mn → ∞ and α = αn → 1 as n → ∞. Instead of the assumption involving the asymptotic variance σ 2 (x, mn ) (see Proposition 4.2(ii)), a main challenge is to get the asymptotic normality of ξˆmn ,n (x) under the more conventional condition (4.4). This problem is worth investigating in future. When estimating the same partial frontier qα (x) = ξm (x), the empirical study reveals interesting findings regarding the performances of the estimators ξˆm,n and qˆ α,n and the performances of the confidence intervals Qn and Cn . Acknowledgements The authors thank the editor, associate editor and reviewers for their valuable comments which led to a considerable improvement of the manuscript. This research was supported by the Research Fund KULeuven (GOA/07/04-project) and by the IAP research network P6/03, Federal Science Policy, Belgium. Support from the French ‘‘Agence Nationale pour la Recherche’’ (under grant ANR08-BLAN-0106-01/EPI project) is also acknowledged. Appendix A. Lemmas and proofs A.1. Robustness p Lemma A.1. Let x ∈ R+ such that FˆX ,n (x) > 0. Then RB(ϕˆ n (x), (X , Y )n ) = 1/n.
161
Proof. Since ϕˆ n (x) = T 1,x ((X , Y )n ) := maxi|Xi ≤x Yi , there exists j ∈ {1, . . . , n} such that Xj ≤ x and Yj = ϕˆ n (x). Let Y ∗ be any arbitrary point such that Y ∗ > Yj . Then, if we replace the FDH point (Xj , Yj ) in the sample ((X1 , Y1 ), . . . , (Xj , Yj ), . . . , (Xn , Yn )) by (Xj , Y ∗ ), we get the contaminated FDH estimator T 1,x ((X1 , Y1 ), . . . , (Xj , Y ∗ ), . . . , (Xn , Yn )) = Y ∗ . Hence sup |T 1,x {(Z )n1 } − T 1,x {(Z )n }| (Z )n1
≥ |T 1,x {(X1 , Y1 ), . . . , (Xj , Y ∗ ), . . . , (Xn , Yn )} − T 1,x {(Z )n }| for all Y ∗ > Yj . Therefore a breakdown occurs as Y ∗ → ∞.
Proof of Theorem 2.1. Let Nx = nFˆX ,n (x) be the number of observations (Xi , Yi ) with Xi ≤ x and let Y1x , . . . , YNxx be the Yi′ s such that Xi ≤ x. For i = 1, . . . , Nx , denote by Y(xi) the ith order statistic of the points Y1x , . . . , YNxx . We have ϕˆ n (x) = Y(xNx ) and so
ξˆm,n (x) = S m,x ((X , Y )n ) := Y(xNx ) −
Y(xN ) x
∫
[Fˆn (y|x)]m dy.
0
If Nx = 1, ξˆm,n (x) = ϕˆ n (x) and so RB(ξˆm,n (x), (X , Y )n ) = 1/n by Lemma A.1. Otherwise, S m,x ((X , Y )n ) = Y(xNx ) −
Nx −1
−
[i/Nx ]m (Y(xi+1) − Y(xi) ).
i=1
Consider the same contaminated sample (X , Y )n1 = ((X1 , Y1 ), . . . , (Xj , Y ∗ ), . . . , (Xn , Yn )) used in the proof of Lemma A.1, obtained by replacing the FDH observation (Xj , Yj ) by (Xj , Y ∗ ), where Y ∗ is an arbitrary point such that Y ∗ > Y(xNx ) . Then, if Nx = 2, we have S m,x ((X , Y )n1 ) = (1 − (1/2)m )Y ∗ + (1/2)m Y(x1) , and thus a breakdown occurs as Y ∗ → ∞. Likewise, if Nx > 2, we have S m,x ((X , Y )n1 ) = (1 − [ NxN−1 ]m )Y ∗ + [ NxN−1 ]m Y(xNx −1) −
∑Nx −2 i =1
∞.
x
x
[i/Nx ]m (Y(xi+1) − Y(xi) ) and thus a breakdown occurs as Y ∗ →
Proof of Theorem 2.2. The quantile qˆ α,n (x) = T α,x ((X , Y )n ) of the sample (X , Y )n is given by (2.1). Denote by N∗ the set of all positive integers. In what follows the index j is such that T α,x ((X , Y )n ) = Y(xj) , i.e., j = α Nx if α Nx ∈ N∗ and j = [α Nx ] + 1 otherwise. (i) First let us show that k = Nx − j + 1 points are sufficient for breakdown of qˆ α,n (x): If we replace, among the observations (Xi , Yi ) with Xi ≤ x, the k largest outputs Y(xj) , . . . , Y(xNx ) by an arbitrary point Y ∗ > Y(xNx ) without replacing their corresponding inputs Xi , then the Xi′ s of the obtained contaminated sample (X , Y )nk such that Xi ≤ x are the same as those of the initial sample (X , Y )n and their corresponding ordered Yi′ s are Y(x1) ≤ · · · ≤ Y(xj−1) ≤ Y ∗ ≤ · · · ≤ Y ∗ , where Y ∗ occurs k times. Hence the α th quantile of (X , Y )nk , defined as the jth order statistic, is T α,x ((X , Y )nk ) = Y ∗ . Therefore a breakdown occurs as Y ∗ → ∞. (ii) Let us now show that k − 1 = Nx − j points are not sufficient for breakdown of qˆ α,n (x): Let (X , Y )nk−1 = ((X1∗ , Y1∗ ), . . . , (Xn∗ , Yn∗ )) be a contaminated sample by replacing k − 1 p points of (X , Y )n with arbitrary values in R+ × R+ . Let ℓx be the number of replaced points among the observations (Xi , Yi ) with Xi ≤ x. It is clear that max{0, (k − 1) − (n − Nx )} ≤ ℓx ≤ k − 1. Let Nx∗ be the number of points (Xi∗ , Yi∗ ) such that Xi∗ ≤ x. Then it is easy to see that Nx∗ ≤ Nx + (k − 1) − ℓx .
(A.1)
162
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
Let Y1∗x , . . . , YN∗∗x be the points Yi∗ such that Xi∗ ≤ x, and for i = x
1, . . . , Nx∗ , denote by Y(∗i)x the ith order statistic such that Y(∗1x) ≤ · · · ≤ Y(∗Nx∗ ) . Then x
T α,x ((X , Y )nk−1 ) =
if α Nx ∈ N otherwise.
∗x Y(α Nx∗ ) ∗x Y([α N ∗ ]+1) x
∗
∗
Because k − 1 = Nx − j and k − 1 ≥ ℓx , we have Nx − ℓx ≥ j. Since α Nx ≤ j, we obtain Nx (1 − α) ≥ ℓx and so (2Nx − α Nx − ℓx )α ≤ Nx − ℓx . Using α Nx ≤ j, we get (Nx + (k − 1) − ℓx )α = (2Nx − j − ℓx )α ≤ Nx − ℓx . It follows from (A.1) that
α Nx∗ ≤ Nx − ℓx .
(A.2)
if α Nx ∈ N , otherwise it ≤ ∗x follows from (A.2) that [α Nx ] + 1 ≤ Nx − ℓx , whence Y([α ≤ N ∗ ]+1) It is then clear that
∗x Y(α Nx∗ )
∗
Y(∗Nxx −ℓx )
∗
∗
x
Y(∗Nxx −ℓx ) . Thus
0 ≤ T α,x ((X , Y )nk−1 ) ≤ Y(∗Nxx −ℓx ) .
(A.3)
Since we only replace ℓx points among the Nx observations (Xi , Yi ) with inputs Xi ≤ x, the remaining Nx − ℓx nonreplaced observations (Xi , Yi ) have outputs Yix ≤ Y(xNx ) . Since these non-contaminated Nx − ℓx outputs Yix are contained in the set {Y1∗x , . . . , YN∗∗x }, we have Y(∗Nxx −ℓx ) ≤ Y(xNx ) . Therefore α,x
x
T ((X , Y ) ) ≤ in view of (A.3). Thus |T T α,x ((X , Y ) )| ≤ ϕˆ n (x) for any (X , Y )nk−1 . n k−1 n
Y(xNx )
α,x
((X , Y )
n k−1
)−
n ,y k∗ −1
Proof of Proposition 2.1. Let (X , Y ) = ((X1 , Y1∗ ), . . . , (Xn , Yn∗ )) be an arbitrary contaminated sample. Using the notations of the proof of Theorem 2.2, we have here Nx∗ = Nx , Y(∗1x) ≤ · · · ≤ Y(∗Nxx −ℓx ) are the Nx − ℓx non-contaminated Yix ’s and Y(∗Nxx −ℓx +1) ≤ · · · ≤ Y(∗Nxx ) are the resulting ℓx outliers in the direction of Y . Since the points Y(∗1x) ≤ · · · ≤ Y(∗j)x belong to the set of non-contaminated Yix ’s, the point Y(∗j)x is then larger than or equal to j points among these non-contaminated Yix ’s. Therefore Y(xj) ≤ Y(∗j)x . On the other n ,y
hand, we have T α,x ((X , Y )n ) = Y(xj) and T α,x ((X , Y )k∗ −1 ) = Y(∗j)x n ,y since α Nx∗ = α Nx . Thus T α,x ((X , Y )n ) ≤ T α,x ((X , Y )k∗ −1 ). The secn ,y x α,x ond inequality T ((X , Y )k∗ −1 ) ≤ Y(Nx ) = ϕˆ n (x) is established in the proof of Theorem 2.2. Proof of Proposition 2.2. The result is immediate since by definition of the median we have ξ˜m (x) = inf{y ≥ 0|F m (y|x) ≥ 1/2} = q(1/2)1/m (x). Appendix B. Asymptotics p
Fix m ≥ 1 and x ∈ R+ such that FX (x) > 0. Define the domain p Dx to be the set of distribution functions G(·, ·) on R+ × R+ such that G(x, ∞) > 0
and G−1 (1|x) ≤ ϕ(x)
(B.1)
where G−1 (1|x) := inf{y ≥ 0| G(y|x) = 1} stands for the upper boundary of the support of the conditional distribution function G(·|x) = G(x, ·)/G(x, ∞). For any G ∈ Dx define m,x
φ (G) =
∞
∫
[1 − Gm (y|x)]dy 0
where the integrand is identically zero for y ≥ G−1 (1|x). It follows m,x
from (B.1) that φ (G) =
ϕ(x)
m,x
¯ p+1 ) −→ [0, ϕ(x)] is Lemma B.1. The map φ : Dx ⊂ L∞ (R Hadamard-differentiable at F with derivative m,x
m,x
( φ )′F : h ∈ L∞ (R¯ p+1 ) −→ ( φ )′F (h) ∫ ϕ(x) m = F m−1 (y|x)[h(x, ∞)F (y|x) − h(x, y)]dy. F X ( x) 0 ¯ p+1 ) and ht → h uniformly in L∞ (R¯ p+1 ), Proof. Let h ∈ L∞ (R m,x
where F + tht ∈ Dx for all small t > 0. Write ξmt (x) := φ (F + tht ). Following the definition of the Hadamard differentiability (see van der Vaart (1998), p. 296), we shall show that (ξmt (x) − ξm (x))/t m,x
converges to ( φ )′F (h) as t ↓ 0. We have
ξmt (x) − ξm (x) [ ]m ∫ ϕ(x) F (x, y) + tht (x, y) [F (y|x)]m − = dy. FX (x) + tht (x, ∞) 0 By Taylor’s formula, for any y ∈ [0, ϕ(x)], there exists a point ζt ,x (y) interior to the interval joining F (y|x) and (F (x, y) + tht (x, y))/(FX (x) + tht (x, ∞)) such that
[F (y|x)]m −
[
F (x, y) + tht (x, y)
]m
FX (x) + tht (x, ∞)
= mt ζtm,x−1 (y)
ht (x, ∞)F (y|x) − ht (x, y)
FX (x) + tht (x, ∞)
.
Whence
ξmt (x) − ξm (x) t
=
ϕ(x)
∫
m FX (x) + tht (x, ∞)
0
ζtm,x−1 (y)
× [ht (x, ∞)F (y|x) − ht (x, y)]dy.
(B.2)
It follows from the definition of ζt ,x (y) and the uniform conver¯ p+1 ) that ζtm,x−1 (y)[ht (x, ∞)F (y|x) − ht (x, y)] gence ht → h in L∞ (R converges to F m−1 (y|x)[h(x, ∞)F (y|x) − h(x, y)] uniformly in y as t ↓ 0. Therefore, we obtain limt ↓0 (ξmt (x) − ξm (x))/t = m,x
( φ )′F (h).
Proof of Proposition 4.1(i). It is well known that the empirical
√
p+1
process n(Fˆ − F ) converges in distribution in L∞ (R ) to F, a p + 1 dimensional F -Brownian bridge (see van der Vaart and Wellner, 1996, p. 82). F is a Gaussian process with zero mean and covariance function E (F(t1 )F(t2 )) = F (t1 ∧ t2 ) − p+1
F (t1 )F (t2 ), for all t1 , t2 ∈ R . Then, by applying the functional delta method (see e.g. van der Vaart, 1998, Theorem 20.8, p. 297) in conjunction with Lemma B.1, we obtain
√
m,x
m,x
n( φ (Fˆ )− φ
m,x √ (F )) = ( φ )′F ( n(Fˆ − F )) + op (1). √ Let us now consider n(ξˆm,n (x) − ξm (x)) as a process indexed by x ∈ X, an arbitrarily fixed set such that infx∈X FX (x) > 0. Here m ≥ 1 is still fixed. Define the domain DX to be the set of p+1 distribution functions G on R+ such that G ∈ Dx for all x ∈ X. Let ν be the finite upper boundary of the support of Y and define, m,x
m
for any G ∈ DX , the map φ (G) : x → φ m
(G) as a map m
[1 − Gm (y|x)]dy for all G ∈ Dx . 0 m,x m,x ϕˆ (x) In particular, we have φ (F ) = ξm (x) and φ (Fˆ ) = 0 n (1 − a.s. ϕ(x) [Fˆn (y|x)]m )dy = ξˆm,n (x) = 0 (1 − [Fˆn (y|x)]m )dy since ϕˆ n (x) ≤ ϕ(x) with probability 1. The following lemma will be useful for the
X −→ [0, ν]. Finally, define the functional φ: G →φ (G) as a
proof of Proposition 4.1(i).
for the proof of Proposition 4.1(ii).
m
m,x
¯ p+1 ) → L∞ (X). We have φ (Fˆ ) := { φ (Fˆ ); x ∈ map DX ⊂ L∞ (R a.s.
ϕ(x)
X} = {ξˆm,n (x); x ∈ X} = { 0 (1 − [Fˆn (y|x)]m )dy; x ∈ X} since P [ϕˆ n (x) ≤ ϕ(x), ∀x ∈ X] = 1. The following lemma will be useful
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165 m
Lemma B.2. φ is Hadamard-differentiable at F ∈ DX with derivative m
m ,x
(φ)F (h) : x ∈ X → ( φ )F (h), for any h ∈ L (R¯ ′
′
∞
p+1
m,x
F + tht is contained in DX for all small t. Abbreviate φ (F + tht ) to ξmt (x). By the uniform convergence of ht and the definition of ζt ,x (y), we have infx∈X |FX (x) + tht (x, ∞)| → infx∈X FX (x) and supx∈X,y∈R¯ |ζtm,x−1 (y) − F m−1 (y|x)| → 0 as t ↓ 0. By using supx∈X,y∈R¯ |ζt ,x (y)| ≤ 1 and supx∈X ϕ(x) ≤ ν , it can be easily seen m,x
that supx∈X |(ξmt (x) − ξm (x))/t − ( φ )′F (h)| → 0 as t ↓ 0, which ends the proof. Proof of Proposition 4.1(ii). By applying the functional delta method in conjunction with Lemma B.2, it is immediate that m
m
m
Furthermore, the linear operator (φ)′F (·) is defined and continuous ¯ p+1 ) since on the whole space L∞ (R 2mν
m,x
m
x∈X
inf FX (x)
a.s.
The following lemma will be needed to prove Theorem 4.2. Lemma B.3. Assume that the condition of Theorem 4.2 hold. For any
√ α ∈]α1 , α2 [ and any c ∈ R, let αn = α + c / nFˆX ,n (x) + o(1/ n). Then a.s.
qˆ αn ,n (x) −→ qα (x) and
¯ p+1 ). Therefore for any h ∈ L∞ (R
n(ˆqαn ,n (x) − qˆ α,n (x)) −→ c / FX (x)f (qα (x)|x) as n → ∞. p
Proof. Following Serfling (1980, p. 6), an equivalent condition for a.s.
the convergence Zn −→ Z to hold is limn→∞ P(supm≥n |Zm − Z | > ε) = 0 for every ε > 0, where Z1 , Z2 , . . . and Z are random variables on (Ω , A, P). Let ε > 0. By the smoothness of F (·|x) at qα (x) we have F (qα (x) − ε|x) < α < F (qα (x) + ε|x). Since
αn −→ α , we then have by applying the equivalent condition for the almost sure convergence
m
m
m
n(φ (Fˆ )− φ (F )) = (φ)′F
√ ( n(Fˆ − F )) + op (1) by Theorem 20.8 in van der Vaart (1998, p. 297).
√ Proof of Theorem 4.1. Write Rm,n (x) := n(ξˆm,n (x) − ξm (x)) − √ nΦm,n (x). By Taylor’s formula, for any y ∈ [0, ϕ(x)], there exists a point ηx,n (y) interior to the interval joining F (y|x) and Fˆn (y|x) such that [Fˆn (y|x)]m − F m (y|x) = mF m−1 (y|x)[Fˆn (y|x) − F (y|x)] + (m/2)(m − 1)[ηx,n (y)]m−2 [Fˆn (y|x) − F (y|x)]2 . By using the fact that a.s. ϕ(x) ξˆm,n (x) − ξm (x) = 0 (F m (y|x) − [Fˆn (y|x)]m )dy, we get ∫ ϕ(x) (ξˆm,n (x) − ξm (x)) − m F m−1 (y|x)[F (y|x) − Fˆn (y|x)]dy
[
a.s.
= −(m/2)(m − 1)
ϕ(x)
[ηx,n (y)]m−2 [Fˆn (y|x)
0
− F (y|x)] dy. 2
(B.3)
On the other hand, we have by the law of the iterated logarithm (LIL) for empirical processes sup |FˆX ,n (x) − FX (x)| = O
log log n n
x
sup |Fˆ (x, y) − F (x, y)| = O (x,y)
1/2
log log n
F (qα (x) + ε|x) − α 2
[
2
a.s.
On the other hand, since Fˆn (qα (x) ± ε|x) −→ F (qα (x) ± ε|x), we have F (qα (x) + ε|x) − α
] < Fˆm (qα (x) + ε|x), ∀m ≥ n → 1, 2 [ ] α − F (qα (x) − ε|x) ˆ P Fm (qα (x) − ε|x) < α − , ∀m ≥ n → 1 [
P α+
2
as n → ∞. It follows P[αm < Fˆm (qα (x) + ε|x), ∀m ≥ n] → 1 and P[Fˆm (qα (x) − ε|x) < αm , ∀m ≥ n] → 1 as n → ∞. Whence P[Fˆm (qα (x) − ε|x) < αm < Fˆm (qα (x) + ε|x), ∀m ≥ n] → 1 as n → ∞. Thus, by applying the fundamental property that the event {Fˆm (y|x) ≥ αm } is equivalent to {y ≥ qˆ αm ,m (x)}, we get P[qα (x) − ε < qˆ αm ,m (x) ≤ qα (x) + ε, ∀m ≥ n] → 1 as n → ∞. Therefore P[|ˆqαm ,m (x) − qα (x)| ≤ ε, ∀m ≥ n] → 1, which is a.s.
, (B.4)
1/2
n
with probability 1. It follows that supy |Fˆn (y|x) − F (y|x)|
√
=
O((log log n/n)1/2 ) with probability 1, whence supy { n[Fˆn (y|x) −
equivalent to qˆ αn ,n (x) −→ qα (x). Let us now turn to the second result. Since qˆ α,n (x) and a.s.
a.s.
qˆ αn ,n (x) −→ qα (x) and αn −→ α , the interval [a, b] contains both qˆ α,n (x) and qˆ αn ,n (x) and the interval [α1 , α2 ] contains αn , for n sufficiently large, with probability 1. Hence we have almost surely, for n large enough,
√
n(ˆqαn ,n (x) − qˆ α,n (x)) =
a.s.
F (y|x)] } −→ 0 as n → ∞. Finally, since 0 ≤ ηx,n (y) ≤ 1 for all 2
y, we arrive at
√
a.s.
n{(ξˆm,n (x) − ξm (x)) − m
ϕ(x)
0 a.s.
F m−1 (y|x)[F (y|x) −
Fˆn (y|x)]dy} −→ 0. This gives Rm,n (x) −→ 0 since FˆX ,n (x)/FX (x) a.s. −→ 1. By applying again the classical LIL (see e.g. Serfling, 1980, Theorem A, p. 35), we obtain for either choice of sign
√ lim sup ± n→∞
nΦm,n (x)
(2 log log n)1/2
]
, ∀m ≥ n → 1, ] α − F (qα (x) − ε|x) < αm , ∀m ≥ n → 1 as n → ∞. P α− P αm < α +
0
∫
√
a.s.
‖h‖L∞ (R¯ p+1 )
x∈X
√
F m−1 (y|x)[1(Xi ≤ x)F (y|x)
0
with probability 1. Moreover Rm,n (x)/(2 log log n)1/2 −→ 0 as n → ∞. Thus, by combining these results, we get the desired LIL.
m
linear transformation Gm = (φ)′F (F) of the Gaussian process F.
163
− 1(Xi ≤ x, Yi ≤ y)]dy = σ (x, m)
n(φ (Fˆ )− φ (F )) converges in distribution in L∞ (X) to the
‖(φ)′F (h)‖L∞ (X) = sup |( φ )′F (h)| ≤
−∫ i =1
Proof. It suffices to make the proof of Lemma B.1 uniform in x ∈
√
×
).
¯ p+1 ), where X. We use the same notation: let ht → h in L∞ (R
ϕ(x)
n
(m/FX (x)) = lim sup ± (2n log log n)1/2 n→∞
√
n{F (ˆqαn ,n (x)|x) − F (ˆqα,n (x)|x)}/f (qδn (x)|x)
where min{F (ˆqαn ,n (x)|x), F (ˆqα,n (x)|x)} < δn < max{F (ˆqαn ,n (x)|x), ∞ F (ˆqα,n (x)|x)}. Define the random function gn : L√ ([α1 , α2 ]) → R by gn (z ) = z (αn ) − z (α). Putting zn (·) = n{F (ˆq·,n (x)|x) − F (q· (x)|x)}, we obtain with probability 1, for all n large enough,
√
n(ˆqαn ,n (x) − qˆ α,n (x))
= [gn (zn ) −
√
n(α − αn )]/f (qδn (x)|x).
(B.5)
164
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165
Let us show that zn converges in distribution in L∞ ([α1 , α2 ]) to a process z with continuous paths at α : let D1 be the set of all restrictions of distribution functions on R to [a, b], and for any G ∈ D1 , let G−1 : ]0, 1[−→ R denotes the generalized inverse map α → G−1 (α) := inf{y|G(y) ≥ α}. Then by Lemma 3.3 in Daouia (2005), the inverse map φ1 : G → G−1 as a map D1 ⊂ D([a, b]) −→ L∞ ([α1 , α2 ]) is Hadamard differentiable at F (·|x) tangentially to C ([a, b]) with derivative φ1′ ,F (·|x) : h −→
−h(F
(·|x))/f (F (·|x)|x). We also have √ zn = n{F (φ1 (Fˆn (·|x))|x) − F (φ1 (F (·|x))|x)} √ = n{φ2 ◦ φ1 (Fˆn (·|x)) − φ2 ◦ φ1 (F (·|x))}, −1
−1
√
(B.6)
−1
[F (F −1 (β|x) + tHt (β)|x) − F (F −1 (β|x)|x)]/t −→ H (β)f (F −1 (β|x)|x) as t → 0 uniformly in β ∈ [α1 , α2 ]. Then φ2 is Hadamard differentiable at φ1 (F (·|x)) with derivative φ2′ ,φ (F (·|x)) : H −→ H × f (q· (x)|x) 1 = −h(q· (x)). Hence by the chain rule (see van der Vaart, 1998, Theorem 20.9, p. 298), we have φ2 ◦φ1 : D1 −→ L∞ ([α1 , α2 ]) is Hadamard differentiable at F (·|x) tangentially to C ([a, b]) with derivative (φ2 ◦ φ1 )′F (·|x) = φ2′ ,φ (F (·|x)) ◦ φ1′ ,F (·|x) . With this result 1 and the representation (B.6) of zn , we can apply immediately the functional delta method (van der Vaart, 1998, Theorem 20.8, p. 297) in conjunction with Theorem 3.1 in Daouia (2005) to obtain the convergence in distribution of zn in L∞ ([α1 , α2 ]) to
z = (φ2 ◦ φ1 )′F (·|x) (W ◦ F (·|x)/ FX (x)) = −W / FX (x) where W (·) denotes the standard Brownian bridge. Moreover the d
process z has continuous paths. Since gn (zn ) −→ 0 whenever d
zn −→ z in L∞ ([α1 , α2 ]) for a process z with continuous paths at α (see van der Vaart, 1998, Proof of Proof of Lemma 21.7, p. 308), we conclude that gn (zn ) in (B.5) converges in distribution to 0. On the other hand, by the smoothness of F (·|x), we have
√ a.s. a.s. δn −→ α and f (qδn (x)|x) −→ f (qα (x)|x). Finally, since n(α − √ √ p a.s. αn√ ) −→ −c / FX (x), we get n(ˆqαn ,n (x) − qˆ α,n (x)) −→ c / FX (x)f (qα (x)|x). √ Proof of Theorem 4.2. Write n{ˆqαn1 ,n (x) − (ˆqα,n (x) − z σ (α, x) √ √ / n)} = n(ˆqαn1 ,n (x) − qˆ α,n (x)) + z σ (α, x). It follows from Lemma B.3 that
√
√
y] = P[ˆqαn ,n (x) ≤ qαn (x) + σn y] = P[Fˆ (qαn (x) + σn y|x) ≥ α] = P[An ≥ an ], where nFX (x) {αn − F (qαn (x) + σn y|x)}, αn (1 − αn ) √ nFX (x) {Fˆ (qαn (x) + σn y|x) An = √ αn (1 − αn ) − F (qαn (x) + σn y|x)} FX (x) = FˆX (x) 1/2 F (qαn (x) + σn y|x)[1 − F (qαn (x) + σn y|x)] × αn (1 − αn ) n − Wn,i × √ nσ (Wn,i ) i =1 an = √
where φ2 : G −→ F (·|x) ◦ G . Let us show that φ2 as a map φ1 (D1 ) ⊂ L∞ ([α1 , α2 ]) −→ L∞ ([α1 , α2 ]) is Hadamard differentiable at φ1 (F (·|x)) = F −1 (·|x) = q· (x) tangentially to φ1′ ,F (·|x) (C ([a, b])). Let H = φ1′ ,F (·|x) (h) with h ∈ C ([a, b]) and take an arbitrary converging path Ht → H in L∞ ([α1 , α2 ]) such that F −1 (·|x) + tHt ∈ φ1 (D1 ) for all small t > 0. By the smoothness of F (·|x), it can be easily seen that −1
√
Proof of Proposition 4.2. (i) Let σn = σ (αn , x)/ n = √ √ αn (1 − αn )/f (qαn (x)|x) nFX (x). We shall prove for any real y ∈ R that P[σn−1 (ˆqαn ,n (x) − qαn (x)) ≤ y] → Φ (y) as n → ∞. Let n be large enough so that qˆ αn ,n (x) belongs to the left neighborhood of ϕ(x) on which F (·|x) is differentiable with a strictly positive derivative f (·|x). We have P[σn−1 (ˆqαn ,n (x) − qαn (x)) ≤
p
n{ˆqαn1 ,n (x) − (ˆqα,n (x) − z σ (α, x)/ n)} −→ 0 as n → ∞.
(B.7)
√ √ p Likewise n{ˆqαn2 ,n (x) − (ˆqα,n (x) + z σ (α, x)/ n)} −→ 0 as n → √ √ p ∞. Hence n{(ˆqαn2 ,n (x) − qˆ αn1 ,n (x)) − 2z σ (α, x)/ n} −→ 0 as n → ∞. On the other hand, we have
with Wn,i = 1(Xi ≤ x, Yi ≤ qαn (x) + σn y) − F (qαn (x) + σn y|x)1(Xi ≤ x) and σ 2 (Wn,i ) = FX (x)F (qαn (x) + σn y|x)[1 − d
F (qαn (x) + σn y|x)]. We first need to prove that An → N (0, 1) and second we shall show that an → −y as n → ∞. It is −αn 1/ρx easy to see from (4.4) that qαn (x) = ϕ(x) − ( 1ℓ( ) for n x)
large enough. Likewise since f (y|x) = ρx ℓ(x){ϕ(x) − y}ρx −1 as y ↑ ϕ(x), we get f (qαn (x)|x) = ρx ℓ(x){ϕ(x) − qαn (x)}ρx −1 = x −1)/ρx for n large enough. Then σ /(ϕ(x) − ρx ℓ(x)1/ρx (1 − αn )(ρ√ n √ qαn (x)) = αn /ρx n(1 − αn )FX (x) → 0 since n(1 − αn ) → ∞. It follows that [1 − F (qαn (x) + σn y|x)]/[1 − F (qαn (x)|x)] = 1 − σn y/(ϕ(x) − qαn (x)) → 1. Therefore F (qαn (x) + σn y|x)[1 − F (qαn (x) + σn y|x)] ∼ αn (1 − αn ) as n → ∞. We also have a.s.
ε
∫
z 2 dFn,1 (z ) ≤ |z |≥ε
+ P(qα (x) ≥ qˆ αn2 ,n (x))}. √
By using (B.7), we obtain P(qα (x) ≤ qˆ αn1 ,n (x)) = P{ n(ˆqα,n (x) − qα (x)) + op (1) ≥ z σ (α, x)}. By the asymptotic normality, we have limn→∞ P(qα (x) ≤ qˆ αn1 ,n (x)) = 1 − Φ (z ). Likewise limn→∞ P(qα (x) ≥ qˆ αn2 ,n (x)) = 1 − Φ (z ). Therefore limn→∞ P[qα (x) ∈ Cn ] = 2Φ (z ) − 1.
∫
|z |3 1(|z | ≥ ε)dFn,1 (z ) R
3 Wn,i Wn,i 1 √ = E √ nσ ( W ) ≥ ε nσ (Wn,1 ) n ,1 √ √ P[|Wn,1 | ≥ ε nσ (Wn,1 )] ≤ ≤ 1/nε 2 { nσ (Wn,1 )}3 √ 3 { nσ (Wn,1 )} by Chebyshev’s inequality. Since σ 2 (Wn,1 ) ∼ αn (1 − αn )FX (x) and n(1 − αn ) → ∞, we get n |z |≥ε z 2 dFn,1 (z ) → 0 and so Wn,i nσ (Wn,i )
d
d
→ N (0, 1). Whence An → N (0, 1). Therefore the monotone function Sn (·) = P[An ≥ ·] converges pointwise to 1 − Φ (·) which is continuous. By Dini’s Theorem, Sn also converges uniformly to 1 − Φ . Finally it suffices to show that an → −y to conclude that P[An ≥ an ] → Φ (y). √ √ First we have an = −yσn nFX (x)f (δn |x)/ αn (1 − αn ) = −yf (δn |x)/f (qαn (x)|x) for a real δn lying between qαn (x) and q (x)−δ f (δ |x) qαn (x) + σn y. Second, since f (q n(x)|x) = {1 + ϕ(αxn)−q (nx) }ρx −1 for α α ∑n
i=1
P[qα (x) ∈ Cn ] = 1 − {P(qα (x) ≤ qˆ αn1 ,n (x))
d
FX (x)/FˆX (x) → 1. Hence to check that An → N (0, 1), it is enough to show according Loève’s criterion (1963, p. 295) that limn→∞ n |z |≥ε z 2 dFn,1 (z ) = 0 for all ε > 0, where Fn,1 is the common distribution function of the random variables √ Wn,i / nσ (Wn,1 ). We have
√
n
n
A. Daouia, I. Gijbels / Journal of Econometrics 161 (2011) 147–165 qαn (x)−δn ϕ(x)−qαn (x)
|y|σn | ≤ ϕ(x)− → 0, we get qαn (x) f (δn |x)/f (qαn (x)|x) → 1 and an → −y.
all n large enough, and |
(ii) We know by the proof of Theorem 4.1 (see Eq. (B.3)) that
√
n(ξˆm,n (x) − ξm (x))
√ √ = (FX (x)/FˆX ,n (x)) nΦm,n (x) − n(m/2)(m − 1) ∫ ϕ(x) [ηx,n (y)]m−2 [Fˆn (y|x) − F (y|x)]2 dy ×
a.s.
0 a.s.
and that supy |Fˆn (y|x) − F (y|x)| = O((log log n/n)1/2 ) in view of Eq. (B.4). For y ∈]0, ϕ(x)[ we have 0 < ηx,n (y) < 1 and a.s.
[ηx,n (y)]m(n)−2 → 0 when n → ∞, so using the dominated ϕ(x) a.s. convergence theorem we get 0 [ηx,n (y)]m−2 dy → 0. Since √ nm(m − 1)/σ (x, m) = O(n/ log log n), we obtain √ ∫ ϕ(x) n [ηx,n (y)]m−2 [Fˆn (y|x) − F (y|x)]2 dy (m/2)(m − 1) σ (x, m) 0 √ a.s. n ≤ m(m − 1)O(log log n/n) σ (x , m ) ∫ ϕ(x) a.s. × [ηx,n (y)]m−2 dy −→ 0. 0
On the other hand,
√
nΦm,n (x)
σ (x, m)
=
n − i=1
√
Zn,i nσ (Zn,i )
ϕ(x)
where Zn,i = (m/FX (x))1(Xi ≤ x) 0 F m−1 (y|x)[F (y|x) − 2 1(Yi ≤ y)]dy and its variance σ (Zn,i ) = √σ 2 (x, m). We 3 2 3/2 have nE[|Zn√ ≤ mϕ(x)/FX (x) nσ (Zn,1 ) → ,1 | ]/{nσ (Zn,1 )} 0 since m/ nσ (x, m) → 0. Hence Lyapunov’s Theorem gives
√
d
nσ −1 (x, m)Φm,n (x) → N (0, 1). Therefore d
− ξm (x)) → N (0, 1).
√
nσ −1 (x, m)(ξˆm,n (x)
References Aragon, Y., Daouia, A., Thomas-Agnan, C., 2005. Nonparametric frontier estimation: a conditional quantile-based approach. Econometric Theory 21, 358–389. Cazals, C., Florens, J.P., Simar, L., 2002. Nonparametric frontier estimation: a robust approach. Journal of Econometrics 106, 1–25.
165
Charnes, A., Cooper, W.W., Rhodes, E., 1981. Evaluating program and managerial efficiency: an application of data envelopment analysis to program follow through. Management Science 27, 668–697. Daouia, A., 2005. Asymptotic representation theory for nonstandard conditional quantiles. Journal of Nonparametric Statistics 17 (2), 253–268. Daouia, A., Florens, J.-P., Simar, L., 2008. Functional convergence of quantile-type frontiers with application to parametric approximations. Journal of Statistical Planning and Inference 138, 708–725. Daouia, A., Florens, J.-P., Simar, L., 2010. Frontier estimation and extreme value theory. Bernoulli 16 (4), 1039–1063. Daouia, A., Ruiz-Gazen, A., 2006. Robust nonparametric frontier estimators: influence function and qualitative robustness. Statistica Sinica 16 (4), 1233–1253. Daouia, A., Simar, L., 2007. Nonparametric efficiency analysis: a multivariate conditional quantile approach. Journal of Econometrics 140, 375–400. Daraio, C., Simar, L., 2007. Advanced robust and nonparametric methods in efficiency analysis. In: Methodology and Applications. Springer, New-York. Deprins, D., Simar, L., Tulkens, H., 1984. Measuring labor inefficiency in post offices. In: Marchand, M., Pestieau, P., Tulkens, H. (Eds.), The Performance of Public Enterprises: Concepts and Measurements. North-Holland, Amsterdam, pp. 243–267. Donoho, D.L., Huber, P.J., 1983. The notion of breakdown point. In: Bickel, P.J., Doksum, K.A., Hodges Jr., J.L. (Eds.), A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, CA, pp. 157–184. Farrell, M.J., 1957. The measurement of productive efficiency. Journal of the Royal Statistical Society, Series A 120, 253–281. Florens, J.P., Simar, L., 2005. Parametric approximations of nonparametric frontier. Journal of Econometrics 124 (1), 91–116. Gijbels, I., Mammen, E., Park, B.U., Simar, L., 1999. On estimation of monotone and concave frontier functions. Journal of the American Statistical Association 94 (445), 220–228. Hampel, F.R., Contributions to the Theory of Robust Estimation. Ph.D. Thesis. University of California, Berkeley, 1968. Loève, M., 1963. Probability Theory, third ed. Princeton, Van Nostrand. Park, B.L., Simar,, Weiner, Ch., 2000. The FDH estimator for productivity efficiency scores: asymptotic properties. Econometric Theory 16, 855–877. Serfling, R.J., 1980. Approximation Theorems of Mathematical Statistics. In: Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons Inc., New York. Simar, L., 2003. Detecting outliers in frontiers models: a simple approach. Journal of Productivity Analysis 20, 391–424. Simar, L., Wilson, P.W., 2008. Statistical inference in nonparametric frontier models: recent developments and perspectives. In: Fried, H., Lovell, C.A.K., Schmidt, S. (Eds.), The Measurement of Productive Efficiency, second ed. Oxford University Press. van der Vaart, A.W., 1998. Asymptotic Statistics. In: Cambridge Series in Statistical and Probabilistic Mathematics, vol. 3. Cambridge University Press, Cambridge. van der Vaart, A.W., Wellner, J.A., 1996. Weak Convergence and Empirical Processes. With Applications to Statistics. In: Springer Series in Statistics, Springer-Verlag, New York. Wilson, P.W., 1993. Detecting outliers in deterministic nonparametric frontier models with multiple outputs. Journal of Business and Economic Statistics 11, 319–323. Wilson, P.W., 1995. Detecting influential observations in data envelopment analysis. Journal of Productivity Analysis 6, 27–45.
Journal of Econometrics 161 (2011) 166–181
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Nonparametric function estimation subject to monotonicity, convexity and other shape constraints Thomas S. Shively a,∗ , Stephen G. Walker b , Paul Damien a a
University of Texas at Austin, United States
b
University of Kent, United Kingdom
article
info
Article history: Received 2 March 2009 Received in revised form 3 June 2010 Accepted 6 December 2010 Available online 21 December 2010 JEL classification: C11—Bayesian analysis C14—Semiparametric and nonparametric methods
abstract This paper uses free-knot and fixed-knot regression splines in a Bayesian context to develop methods for the nonparametric estimation of functions subject to shape constraints in models with logconcave likelihood functions. The shape constraints we consider include monotonicity, convexity and functions with a single minimum. A computationally efficient MCMC sampling algorithm is developed that converges faster than previous methods for non-Gaussian models. Simulation results indicate the monotonically constrained function estimates have good small sample properties relative to (i) unconstrained function estimates, and (ii) function estimates obtained from other constrained estimation methods when such methods exist. Also, asymptotic results show the methodology provides consistent estimates for a large class of smooth functions. Two detailed illustrations exemplify the ideas. © 2010 Elsevier B.V. All rights reserved.
Keywords: Fixed-knot splines Free-knot splines Log-concave likelihood functions MCMC sampling algorithm Small sample properties
1. Introduction This paper uses regression splines in a Bayesian context to develop methods for the nonparametric estimation of functions subject to shape constraints in models with log-concave likelihood functions. The shape constraints we consider include monotonicity, convexity and functions with a single minimum. The class of models with log-concave likelihood functions contains many of those used regularly in the economics, finance and marketing literature including proportional hazard function models, generalized mixed models, and non-homogeneous Poisson processes, among others. We consider shape-constrained function estimation using both fixed-knot and free-knot spline models. These models were originally proposed by Smith and Kohn (1996) and Denison et al. (1998), respectively, for unconstrained nonparametric function estimation. The advantage of using free-knot models is that the data are allowed to specify the number and location of the knots.
∗ Corresponding address: Department of Information, Risk, and Operations Management, Mail Code B6500, University of Texas, Austin, TX 78712, United States. Tel.: +1 512 471 1753; fax: +1 512 471 0587. E-mail address:
[email protected] (T.S. Shively). 0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.001
However, the disadvantage is the increased complexity of the MCMC algorithm required to implement the method because the knots and spline coefficients in a free-knot model must be generated jointly. The natural way to impose shape constraints in both fixed-knot and free-knot regression spline models is to impose restrictions on the spline coefficients. In a Bayesian context the constraints are imposed through the prior distributions on the coefficients. Shively et al. (2009) use this idea to develop a Bayesian method for monotone function estimation using fixed-knot splines in Gaussian regression models. The current paper departs from Shively et al. (2009) in the following key respects: (1) shape constraints via prior distributions on the spline coefficients are developed for free-knot spline models; (2) the methodology development applies to the family of models that have logconcave likelihood functions, not just Gaussian regressions; and (3) additional shape constraints that include convexity and functions restricted to a single minimum are developed for both free-knot and fixed-knot models. We also develop a new MCMC slice sampler, requiring only one auxiliary variable, to do a full Bayesian analysis in the context of models with log-concave likelihood functions. The sampler is computationally efficient and numerically stable, and works well for large data sets. Given the general nature of the sampler, it is
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
noted that this new MCMC method holds promise for families of models other than the ones in this paper. A simulation analysis in Section 3 shows that the algorithm converges faster than other MCMC algorithms. Nonparametric function estimation subject to shape constraints has been studied extensively. Early work in the estimation of monotone functions includes Wright and Wegman (1980) and Friedman and Tibshirani (1984). More recently, Neelon and Dunson (2004) and Shively et al. (2009) developed monotone estimation methods in the context of Gaussian models, Manski and Tamer (2002) and Banerjee et al. (2009) discussed the problem in binary models, and Dunson (2005) and Schipper et al. (2007) considered the problem in Poisson and generalized mixed models, respectively. Research in the nonparametric estimation of convex and concave functions includes Mammen (1991), Groeneboom et al. (2001) and Yatchew and Härdle (2006), among others. In terms of applications, convex function estimation is used extensively in derivative asset pricing models (see, for example, Yatchew and Härdle (2006) and the references therein, Broadie et al. (2000a,b) and Aït-Sahalia and Duarte (2003)). Functions constrained to have a single minimum are used in the energy economics literature. For example, Pardo et al. (2002) and Ihara et al. (2008) use these types of constrained functions to model the relationship between temperature and electricity demand, although the functions in these papers are parametric in nature. To describe the general model used in this paper, let yi , i = 1, . . . , n, be a set of observable data, xi an r × 1 vector of regressor variables, f1 (x1i ), . . . , fr (xri ) a set of unknown functions, some of which are shape-constrained, and φi a set of unknown parameters. The density function for yi is π (yi |f1 (x1i ), . . . , fr (xri ), φi ). Using this notation, we develop computationally efficient nonparametric shape-constrained function estimation techniques for models in which π is log-concave in f1 (x1i ), . . . , fr (xri ) and φi . The method can be easily generalized to models for which π is a unimodal likelihood function with only a slight increase in the computational requirements. Incorporating an appropriate shape constraint assumption into a model often results in considerably better function estimates than can be obtained using unconstrained function estimation techniques. This is illustrated in the simulation results in Section 5 for the estimation of a monotone function in the context of Cox’s (1972) nonparametric hazard model. We also show through simulation that our estimator performs well in finite sample sizes for Poisson models relative to Dunson’s (2005) and Schipper et al. ’s (2007) nonparametric monotone function estimation methods. In addition, the asymptotic results in Section 6 and Appendix D show that our estimator provides consistent estimates. The paper is organized as follows. Section 2 develops the nonparametric function estimation methods subject to different shape constraints. Section 3 outlines the MCMC sampling algorithms for the fixed-knot and free-knot models in the context of monotonicity constraints and log-concave likelihood functions; corresponding appendices provide details for the algorithms. Section 4 discusses the implementation of the function estimation methodology for specific models while Section 5 gives simulation results to show the small sample properties of our estimator relative to previously proposed estimators. Section 6 discusses the estimator’s asymptotic properties and shows that it is a consistent estimator of a monotone function in a Poisson model. Extensions of these properties to other classes of models are treated in an Appendix. Section 7 contains two examples. The first illustrates monotone function estimation in the context of a discrete-time nonparametric proportional hazard model. The second applies the methodology to the estimation of a function constrained to have a single minimum in a Gaussian model with autocorrelated errors.
167
2. General model The complete Bayesian model specifications for fixed-knot and free-knot models under monotonicity, convexity and singleminimum restrictions are detailed in this section. Sections 2.1 and 2.2 outline the general model, including the likelihood function and the fixed-knot and free-knot spline models. The prior distributions on the spline coefficients that constrain functions to be monotonic, convex or have a single minimum are discussed in Sections 2.3–2.5. The methodology outlined in Sections 2.1–2.5 and implemented using the MCMC sampling algorithms in Section 3 is discussed in the context of univariate function estimation only. However, the methodology and sampling schemes also apply to additive models as illustrated in the two examples in Section 7. 2.1. Likelihood function Let f0 (x) represent the unknown function of interest, φi represent a vector of parameters, and π (yi |f0 (xi ), φi ) represent the density function for yi conditional on f0 (xi ) and φi . The assumption we make regarding the density function π is that it is log-concave in f0 (xi ) and φi . For example, for a Poisson-gamma model with counts yi and frailty parameter φi :
π (yi |f0 (xi ), φi ) ∝ exp{−φi f0 (xi )}(φi f0 (xi ))yi (1) where the φi are independent and identically distributed gamma random variables. Without loss of generality, we will always assume 0 ≤ x1 ≤ · · · ≤ xn ≤ 1. Similarly, for a Weibull proportional hazard function model with hazard function λi (t ) = f0 (xi )λ0 (t ) where λ0 (t ) = ηt η−1 , observational data yi = (ti , δi ) where δi = 1 indicates that ti is an uncensored observation and δi = 0 indicates ti is censored at c (so ti = c if δi = 0), and φi = η for all i, the density function is:
π (ti , δi |f0 (xi ), η) =
∏
η−1
(ηti
)
i:δi =1
× exp
n −
η
δi f0 (xi ) − ti exp{f0 (xi )} .
(2)
i=1
In this case, π is log-concave in f0 (xi ) and η. The interpretation of f0 (xi ) and φi is model dependent. In the Poisson-gamma model, E (Yi ) = φi f0 (xi ) while in the proportional hazard model f0 (xi ) is the proportionality constant for the ith hazard function. In many models, f0 (xi ) is required to be positive. If this is the case, the restriction will be imposed through the prior on f0 (xi ). The proof in Section 6 and Appendix D that shows our function estimator is consistent holds for the class of models where π is logconcave in f0 (xi ) and φi . However, in terms of the MCMC sampling algorithm we can weaken the assumption to one requiring only that the likelihood function be unimodal. The weaker assumption results in a slight increase in the computational requirements but the MCMC sampling algorithm still converges quickly. Letting y = (y1 , . . . , yn )′ and assuming the yi ’s are independent, the density function for y, and therefore the likelihood function for f0 (x1 ), . . . , f0 (xn ), φ1 , . . . , φn , can be expressed as
π (y|f0 (x1 ), . . . , f0 (xn ), φ1 , . . . , φn ) =
n ∏
π (yi |f0 (xi ), φi ).
(3)
i =1
In practice, we use finitely parametrized fixed-knot and free-knot regression spline models to approximate f0 (x). If f˜ (x) represents the fixed-knot or free-knot spline function then the resulting approximating model is
π (y|f˜ (x1 ), . . . , f˜ (xn ), φ) =
n ∏
π (yi |f˜ (xi ), φi )
i=1
where φ = (φ1 , . . . , φn )′ . We use E (f˜ (x)|y) as the estimate of f0 (x).
168
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
2.2. Fixed-knot and free-knot regression spline models
2.3. Prior distributions on the regression spline coefficients to impose monotonicity
This section provides a brief description of the fixed-knot and free-knot spline models along with the associated notation that will be used in the remainder of the paper. Prior distributions for all parameter values other than the regression spline coefficients are also given. The functions are constrained to take on specific shapes by putting constraints on the spline coefficients through their prior distributions. These priors are discussed in Sections 2.3–2.5. 2.2.1. Fixed-knot regression spline model The fixed-knot quadratic regression spline approximation to the function f0 (x) is f (m) (x) = α + β1 x + β2 x2 + β3 (x − η1 )2+ + · · ·
+ βm+2 (x − ηm )2+
(4)
where η1 , . . . , ηm are m fixed ‘‘knots’’ placed along the domain of the independent variable x such that 0 < η1 < · · · < ηm < 1 and (z )+ = max(0, z ). Variable selection is used to determine which knots remain in the model. Smith and Kohn (1996) used this model for unconstrained nonparametric function estimation. Quadratic regression splines are used instead of the more typical cubic regression splines because they impose a degree of smoothness on the function but the constraints required to ensure various shape constraints are more tractable than for cubic splines. For notational purposes, define Jj such that Jj = 0 if βj = 0 and Jj = 1 if βj ̸= 0. The Jj values are assumed to be a priori independent with π(Jj = 0) = p for j = 1, . . . , m + 2. The values used for p are discussed in Section 5. Alternative priors for Jj can also be used. For example, a prior that assigns equal probability to each number of knots could be used (see Cripps et al. (2005)). However, simulation results (available on request) indicate the prior π (Jj = 0) = p works best in practice. The prior for α is N (0, 1010 ). 2.2.2. Free-knot regression spline model The free-knot linear regression spline approximation to the function f0 (x) is f (K ) (x) = α + β0 x + β1 (x − ξ1 )+ + · · · + βK (x − ξK )+
(5)
where ξ1 , . . . , ξK are K knots on the domain of the independent variable x such that 0 < ξ1 < · · · < ξK < 1. We use ξj to represent the free knots to distinguish them from the fixed knots represented by ηj in the previous model. The general framework for the free-knot spline model is similar to the one originally proposed by Denison et al. (1998) for unconstrained function estimation. We use linear regression splines in the free-knot model because they impose a degree of smoothing similar to that imposed by quadratic regression splines with fixed knots. The free-knot model discussed below can be modified in a straightforward way to allow for piecewise constant or quadratic regression splines. For notational purposes, let ξ˜K = (ξ1 , . . . , ξK )′ and β˜ K = (β0 , . . . , βK )′ . We use uninformative priors on K and α . In particular, pr (K = k) = 1/kmax , k = 1, . . . , kmax where kmax is specified by the user, while the prior distribution for α is N (0, 1010 ). For estimating monotone and convex functions the knots are assumed to be uniformly distributed on (0, 1] so the prior for ξ˜K |K is the density function of the order statistics for K independent random variables drawn from a U (0, 1) distribution. It is also possible to use more informative prior distributions for K and ξ˜K |K if information is available about the number and location of the knots. For estimating a function with a single minimum, we put a prior distribution on the knot associated with the minimum of the function. The remaining knots are assumed to be uniformly distributed conditional on the location of this knot. This is discussed further in Section 2.5.
The prior distributions on the βj -coefficients used in the fixedknot spline model to impose monotonicity are discussed in Shively et al. (2009). They showed how to implement the resulting constrained estimation methodology in the context of a Gaussian regression model. We show in Section 3.1 how to implement the methodology in any model with a log-concave likelihood function. In the free-knot linear regression spline model with a given K , the constraints on the βj -coefficients to impose monotonicity are
β0 ≥ 0, β0 + β1 ≥ 0, . . . ,
j=0 βj ≥ 0. Note that the constraints change as knots are added and deleted (i.e. as K changes). This is handled in the MCMC sampling algorithm discussed in Section 3.2 and Appendix C that we use to implement the methodology. In the free-knot model, unlike the fixed-knot model, the knots and regression coefficients must be generated jointly for each j with the monotonicity constraints updated appropriately. In general, the linear restrictions on the elements of β˜ K required to ensure the function is non-decreasing can be written as γ˜K = LK β˜ K , where γ˜K = (γ0 , . . . , γK )′ and LK is a (K + 1) × (K + 1) matrix with the ijth element = 1 if i ≥ j and = 0 otherwise, and each element of γ˜K must be greater than or equal to zero. The portion of the γ˜K parameter space that guarantees a nondecreasing function is the multi-dimensional generalization of the first quadrant including the hyperplanes that border this space. The prior distribution we use for γ˜K |K is similar to the one used in Shively et al. (2009) for a fixed-knot model. More specifically, the prior is a mixture distribution of a normal distribution N (0, τ 2 I ) constrained to the multi-dimensional generalization of the first quadrant, where I is the identity matrix with appropriate dimension and τ 2 is typically set to the number of observations, and probability distributions on the boundaries of this space. Putting distributions on the boundaries allows us to obtain good estimates when the function being estimated has significant flat portions. Neelon and Dunson (2004) use a mixture prior in their monotone function estimation methodology for the same reason. Setting τ 2 to the number of observations makes the prior similar to one used by Smith et al. (1998) in an unconstrained function estimation problem.
∑K
2.4. Prior distributions on the regression spline coefficients to impose convexity This section provides a description of the fixed-knot regression spline prior on f0 (x) used to impose convexity. A similar prior can be used in the context of a free-knot model. Consider the fixed-knot model for f (m) (x) in (4). f (m) (x) is a convex function if the second derivative is non-negative for all x where d2 f (m) (x) dx2
= 2β2 + 2β3 I (x > η1 ) + · · · + 2βm+2 I (x > ηm )
(6)
and I (x > η) = 1 if x > η and = 0 otherwise. Constraints are imposed on the βj -coefficients to ensure that the resulting function is convex. The constraints depend on J2 , . . . , Jm+1 and change as variables enter and leave the model. For example, if Jj = 1 for all j then the constraints are β2 ≥ 0, β2 + β3 ≥ 0, . . . , j=2 βj ≥ 0 (β1 is unconstrained). Let J = (J2 , . . . , Jm+2 ), βJ consist of the elements β2 , . . . , βm+2 corresponding to those elements of J that ∑m+2 ∑m+2 are equal to one, and LJ be an j=2 Jj × j=2 Jj matrix where the ijth element is 1 if i ≥ j and 0 otherwise. The linear restrictions on the elements of βJ required to ensure the function is convex can be written as γJ = LJ βJ , where each element of γJ must be greater than or equal to zero. The prior on γJ is similar to the one discussed in Shively et al. (2009) used to impose monotonicity.
∑m+2
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
2.5. Prior distribution to ensure the function has a single minimum This section discusses the free-knot regression spline prior used to impose the constraint that realizations from the prior and posterior function spaces have a single minimum. A prior distribution is also placed on the location of the minimum. Other than the constraint of a single minimum the function is estimated nonparametrically. The method in this section generalizes previous methods used in the energy economics literature where the location of the minimum is specified a priori or is estimated using a least squares-type approach (see Pardo et al. (2002) and Ihara et al. (2008)). Previous methods also typically assumed the parametric form of the function on either side of the minimum is known. The shape constraint in this section is different from the monotonicity and convexity constraints discussed in Sections 2.3 and 2.4 because monotonicity and convexity are imposed by restricting the first and second derivatives, respectively, to be positive. The function constraint discussed below allows these derivatives to change signs but ensures that the first derivative changes sign exactly once. A similar prior can be used with fixed-knot splines to impose the constraint of a single minimum. The disadvantage of using the fixed-knot model is that it is difficult to put a prior on the location of the minimum. In many examples, there will be important information available about this value that should be incorporated into the model. The free-knot spline function we use is (1)
(1)
(1)
(1)
f (K ) (xi ) = α + β0 xi + β1 (xi − ξ1 )+ + · · · + βK1 (xi − ξK1 )+
+ βmin (xi − ξmin )+ + β1(2) (xi − ξ1(2) )+ + · · · + βK(22) (xi − ξK(22 ) )+ (1)
(2)
(1)
(2)
where ξ1 < · · · < ξK1 < ξmin < ξ1 < · · · < ξK2 , and ξmin is the location of the function minimum. The goal is to develop priors on the coefficients so the spline function is monotone decreasing for x < ξmin and monotone increasing for x > ξmin . The number of knots less than ξmin is K1 with the associated (1) (1) (1) knots and spline coefficients denoted ξ˜K1 = (ξ1 , . . . , ξK1 )′ and
β˜ K(11)
(β1(1) , . . . , βK(11) )′ , respectively. Similarly, the number of knots greater than ξmin is K2 with the associated knot and (2) (2) coefficient vectors given by ξ˜K2 and β˜ K2 . A prior distribution is placed on ξmin . Given ξmin and K1 , the (1) prior for ξ˜K1 |ξmin , K1 is the density function of the order statistics for K1 independent random variables drawn from a U (0, 1) (2) distribution. A similar prior is used for ξ˜K2 |ξmin , K2 . Uninformative prior distributions are placed on K1 and K2 . In particular, pr (Kj = k) = 1/kj,max , j = 1, 2 and k = 1, . . . , kj,max where k1,max and =
k2,max are specified by the user.
(1)
The spline coefficients β0 and β˜ K1 are constrained through their (K )
prior distribution so f (xi ) is monotone decreasing on the interval (0, ξmin ). Similarly, the coefficients βmin and β˜ K(22) are constrained so the function is monotone increasing on the interval (ξmin , 1). As with the strictly monotone free-knot spline prior discussed in (1) (2) Section 2.3, we reparametrize to γ˜K1 and γ˜K2 and use mixture (1)
(2)
priors for γ˜K1 |K1 and γ˜K2 |K2 with the γ -coefficients constrained to the appropriate regions to ensure decreasing and increasing functions on (0, ξmin ) and (ξmin , 1). Finally, the knot ξmin must remain in the model since it represents the required ‘‘turning point’’ of the function. The MCMC algorithm given in Section 3.2 for strictly monotone function estimation in the context of free-knot splines can be (1) (1) (2) (2) modified to generate the values K1 , ξ˜K1 , γ˜K1 and K2 , ξ˜K2 , γ˜K2 . The
169
primary differences are in the conditional distribution of the knots (1) (2) (1) (2) and the constraints on γ˜K1 and γ˜K2 , or equivalently, β˜ K1 and β˜ K2 , to ensure the function has the appropriate shape. The prior we use guarantees that each function realization from the prior and posterior function spaces has a single minimum. In principle, it is possible the posterior mean of the function E (f (K ) (x)|y) will have more than one minimum because the space of functions with a single minimum is not a convex set. However, simulation results (available from the authors on request) indicate this seldom occurs in practice when the data generating function has a single minimum. 3. MCMC sampling algorithm This section outlines a new MCMC sampling algorithm with good convergence properties for estimating shape-constrained functions in models with log-concave likelihood functions. In particular, Sections 3.1 and 3.2 show how to construct MCMC algorithms for fixed-knot and free-knot models, respectively, with monotonicity imposed. Given the prior distributions in Sections 2.4 and 2.5, the algorithms can be modified to handle functions constrained to be convex or have a single minimum. The key idea in both the fixed-knot and free-knot algorithms is to incorporate a single latent variable to make them computationally tractable for models with log-concave likelihood functions. We show in Section 3.3 that the algorithm for the fixed-knot model has good convergence properties relative to existing methods when existing methods can be used. The algorithms will also work with a slight modification for likelihood functions with a single mode but that are not log-concave. The main difference between the algorithms in Sections 3.1 and 3.2 is that the knots, spline coefficients and variable selection indicators must be generated jointly in the free-knot algorithm while in the fixed-knot algorithm only the spline coefficients and variable selection indicators need to be generated. Both algorithms handle the changing size of the spline coefficient vector as well as the changing constraints as knots enter and leave the model. The prior on the spline coefficients for the fixed-knot model is the same as the one given in Shively et al. (2009) in the context of Gaussian models. However, the MCMC algorithm in Section 3.1 to implement the methodology in non-Gaussian models is considerably different than the algorithm given in their paper. 3.1. MCMC sampling algorithm for the fixed-knot monotone spline prior For the fixed-knot model in (4), define J = (J1 , . . . , Jm+2 ), βJ to consist of the elements of β = (β1 , . . . , βm+2 ) corresponding to those elements of J that are nonzero, LJ to be a lower triangular matrix as defined in Shively et al. (2009), and γJ = LJ βJ . Then the prior distribution on γJ |J given in Shively et al. to impose monotonicity (see their paper for details) is a mixture distribution of a normal distribution N (0, τ 2 I ) constrained to the multidimensional generalization of the first quadrant and probability distributions on the boundaries of this space. The model for f (m) in (4) can be written in matrix notation as f (m) = ια + XJ βJ where XJ consists of the columns of X = (x, x2 , . . . , (x − ηm )2 ) corresponding to the nonzero elements of J. To make the model analytically tractable for use in an MCMC algorithm, we reparametrize to give f (m) = ια + WJ γJ where 1 WJ = XJ L− J . The corresponding likelihood function is given by
π (y|α, J , γJ , φ) =
n ∏ i=1
π (yi |α, J , γJ , φ).
(7)
170
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
We show how to generate (Jj , γj )|y, j = 1, . . . , m + 2, and therefore the f (m) (xi )|y values used to estimate E (f (m) (xi )|y). The likelihood function in (7) can be rewritten as
π(y|α, J , γJ , φ) = exp{−s(y, α, J , γJ , φ)} (8) ∑n where s(y, α, J , γJ , φ) = − i=1 log[π (yi |α, J , γJ , φi )]. The key to the MCMC algorithm is to introduce the scalar latent variable v such that
π (Jj = 1, γj | · · ·) ∝ I (amin < γj < amax )π (γj |Jj = 1)π (Jj = 1).
π(α, J , γJ , φ, θ , v|y)
(11)
∝ e−v I (v > s(y, α, J , γJ , φ))π (α, J , γJ , φ|θ )π (θ ). For notational purposes, let J(−j) = J without the jth element and γ (−j) = γ without the jth element. Using this notation, the MCMC sampling algorithm described below is used to carry out function estimation. For a discussion of Bayesian inference using MCMC methods see Gelfand and Smith (1990) and Tierney (1994). (0) Start with some initial values v [0] , α [0] , J [0] , γ [0] , φ [0] and θ [0] . (1) Generate v conditional on α, J , γJ , φ, θ , y; (2) Generate (Jj , γj ) conditional on v, J(−j) , γ(−j) , α, φ, θ , y; j = 1, . . . , m + 2; (Jj , γj ) will be generated as a block; (3) Generate α conditional on v, J , γJ , φ, θ , y; (4) Generate φ conditional on v, α, J , γJ , θ , y; (5) Generate θ conditional on v, α, J , γJ , φ, y. Let α [l] , γ [l] and J [l] be the iterates of α, γ and J in the sampling period. If wJi represents the ith row of WJ , then an estimate of the posterior mean of the ith element of f (m) , and therefore an estimate ∑L [ l] + wJ [l] ,i γJ[[ll]] ]. of f0 (xi ) is 1L l=1 [α We will focus on generating v and (Jj , γj ) in steps 1 and 2. Generating α and φ in steps 3 and 4 is done similarly to generating γj when Jj = 1 in step 2, and generating θ in step 5 is model specific.
1. Generate v : Generate v ∗ from an Exp(1) distribution and compute v = v ∗ + s(y, α, J , γJ , φ). 2. Generate (Jj , γj ); j = 1, . . . , m + 2. The actual method for generating (Jj , γj ) uses rejection sampling. To motivate the importance of using rejection sampling as well as to outline the structure of the general method we first consider an exact method that does not require rejection sampling. The disadvantage of the exact method is that it is too computationally intensive to implement in practice and is numerically unstable. We show in Appendix A how to modify the exact method to use rejection sampling so that the resulting algorithm is efficient and stable. We note that the rejection sampling step typically has a high acceptance rate in the models we have worked with. The reason for this is discussed in the Appendix. In the exact method (as well as the rejection sampling method), (Jj , γj ) is generated as a block by generating Jj first, and then γj . To generate Jj and γj , we have
π(Jj , γj | · · ·) ∝ I [v > s(y, α, J , γJ , φ)]π (γj |Jj )π (Jj )
(9)
where ‘‘· · ·’’ represents (y, α, J(−j) , γ(−j) , φ, θ , v). For Jj = 0, this yields
π(Jj = 0| · · ·) ∝ I (v > s(y, α, Jj = 0, J(−j) , γ(−j) , φ))π (Jj = 0). (10) Note that π (Jj = 0| · · ·) = 0 if v < s(y, α, Jj = 0, J(−j) , γ(−j) , φ). To find π (Jj = 1| · · ·), we integrate γj out of the density function in (9) with Jj set to one. To accomplish this, let s˜(γj ) = s(γj ; y, α, Jj = 1, J(−j) , γ(−j) , φ) − v so s˜(γj ) is a convex function (because the likelihood function is assumed to be log-concave). Then
π(Jj = 1, γj | · · ·) ∝ I [˜s(γj ) < 0]π (γj |Jj = 1)π (Jj = 1)
where π (γj |Jj = 1) is a mixture distribution consisting of a point mass at zero and a N (0, τ 2 ) distribution constrained to (0, ∞). The mixture distribution for γj |Jj = 1 is a result of the mixture prior on γJ |J. If s˜(γj ) is greater than zero for all γj ≥ 0, then π (Jj = 1| · · ·) = 0. Otherwise, let a∗min and amax represent the roots of this function. Noting that the monotonicity restriction is γj ≥ 0, let amin = max{0, a∗min }. Then
Given the values amin and amax , π (Jj = 1| · · ·) can be computed. Unfortunately, a∗min and amax can only be computed numerically and obtaining them to a sufficient degree of accuracy is often computationally intensive with numerical problems arising if the roots are not sufficiently accurate. For this reason, we find bounds on the values amin and amax , denoted bmin and bmax , such that bmin ≤ amin and amax < bmax , and then do the appropriate sampling using a rejection sampling algorithm. The rejection sampling algorithm is discussed in detail in Appendix A. However, the basic idea is that the approximating density function in rejection sampling is obtained using bmin and bmax in place of amin and amax in the true density function given in (11). The true and approximating densities are then the same up to a constant except on the intervals (bmin , amin ) and (amax , bmax ). For reasons discussed in Appendix A, if the candidate draw for Jj is 0, it is always accepted. Also, if the candidate draw for Jj is 1 and γj is in the interval (amin , amax ) it is always accepted (because the true and approximating densities are the same up to a constant). If the candidate draw for γj is in (bmin , amin ) or (amax , bmax ), it is rejected. However, using the method of finding bounds outlined in Appendix B typically gives bounds very close to amin and amax . This means the rejection region will be small and the overall acceptance rate will be high. In fact, given the monotonicity constraint, the lower bound is often exact because it is often the case that bmin = amin = 0. In this situation, only a bound on amax is required. 3.2. MCMC sampling algorithm for the free-knot spline prior To develop the MCMC algorithm for the free-knot model we use a reparametrized version of the model in (5) that is probabilistically equivalent. The reparametrized model is f (m) (x) = α + β0 x + β1 J1 (x − ξ1 )+ + · · · + βm Jm (x − ξm )+
(12)
where m = kmax and J = (J1 , . . . , Jm ) with Jj = 0 or 1. Also, let ξJ and βJ consist of the elements of ξj and βj , respectively, corresponding to those elements of J that are equal to one, let XJ consist of the vector x and the regressor variables in (12) corresponding to those elements of J that are equal to one, and let LJ be a (K + 1)×(K + 1) matrix ∑m with ijth element = 1 if i ≥ j and = 0 otherwise, where K = j=1 Jj . Note that the vector ξJ has length K , where K is the same value as defined in Section 2.2.2. The model in (12) can be written in matrix notation as f (m) = ια + XJ βJ , or 1 equivalently as f (m) = ια + WJ γJ where γJ = LJ βJ and WJ = XJ L− J . To make the reparametrized model probabilistically equivalent to the model in (5), the priors on J , ξJ |J and γJ |J must be specified appropriately. The prior for J is pr [J = (j1 , . . . , jm )] =
1
(m + 1)
m k
where j1 , . . . , jm are 0 or 1 and k = i=1 ji . Using this prior for J assigns equal probability to each number of knots. The priors for ξJ |J and γJ |J are the same as the priors for ξK |K and γK |K given in Section 2.2.2.
∑m
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
171
Table 1 Summary of the autocorrelation function values for different models, sampling methods and function values. Func. value
Method
ACF lag 10
20
50
100
200
0.084 0.879 0.109 0.936 0.067 0.937
0.060 0.777 0.084 0.880 0.053 0.881
0.040 0.621 0.059 0.787 0.037 0.782
0.011 0.005 0.025 0.013 0.005 0.039 0.006 0.002 0.016
0.004 0.002 0.014 0.005 0.001 0.023 0.001 0.001 0.011
0.001 0.000 0.005 0.001 0.000 0.015 0.000 0.001 0.005
0.042 0.029 0.082 0.065 0.044 0.034
0.028 0.019 0.054 0.047 0.029 0.026
0.016 0.013 0.033 0.029 0.021 0.013
Panel A: Poisson model f (0.25) f (0.50) f (0.75)
Single latent variable DWW Single latent variable DWW Single latent variable DWW
0.209 0.974 0.201 0.987 0.113 0.987
Single latent variable-Probit Single latent variable-Logit Albert and Chib Single latent variable-Probit Single latent variable-Logit Albert and Chib Single latent variable-Probit Single latent variable-Logit Albert and Chib-Probit
0.070 0.029 0.200 0.070 0.029 0.263 0.020 0.008 0.171
Single latent variable SSW Single latent variable SSW Single latent variable SSW
0.174 0.152 0.215 0.181 0.096 0.065
0.141 0.949 0.153 0.974 0.089 0.974
Panel B: Probit/logit model f (0.25) f (0.50) f (0.75)
0.033 0.013 0.077 0.034 0.019 0.108 0.014 0.006 0.020
Panel C: Gaussian model f (0.25) f (0.50) f (0.75)
0.096 0.075 0.131 0.101 0.066 0.051
The reported autocorrelation coefficients are averages across 50 runs of a simulation.
The likelihood function can now be written similarly to the likelihood function in (7) and (8) in Section 3.1 with the knots ξJ included in the list of parameter values. As in Section 3.1, we introduce the latent variable v such that
π(α, J , ξJ , γJ , φ, θ , v|y) ∝ e−v I (v > s(y, α, J , ξJ , γJ , φ)) × π (α, J , ξJ , γJ , φ|θ )π (θ ). An MCMC algorithm can now be constructed similar to the one in Section 3.1 except that step (2) is different. Step (2) becomes: (2) Generate (Jj , ξj , γj ) conditional on v, J(−j) , ξ(−j) , γ(−j) , α, φ, θ, y; j = 1, . . . , m; (Jj , ξj , γj ) will be generated as a block. To generate (Jj , ξj , γj ) as a block requires a Metropolis–Hastings step. The acceptance rates are typically over 70% and often over 90% which implies the approximating distribution is a good one. (Jj , ξj , γj )| · · ·, where ‘‘· · ·’’ represents v, J(−j) , ξ(−j) , γ(−j) , α, φ, θ , y, is generated from an approximating distribution by generating Jj , then ξj |Jj and finally γj |ξj , Jj from conditional distributions that are good approximations to the true conditional distributions. The approximating conditional distributions are discussed in Appendix C. 3.3. Convergence rates and CPU times This section reports simulation results to compare the convergence rates of the MCMC algorithm discussed in Section 3.1 for the fixed-knot spline model with the convergence rates of existing MCMC methods designed for three specific models: Poisson, probit/logit (i.e. binary data) and Gaussian. For each data set and each method, we compute the autocorrelation function, efficiency factors and CPU time per 1000 effective observations to measure convergence rates and compare these measures across methods. For the Poisson model, the method of Section 3.1 is compared to Damien et al.’s (1999) method (DWW) that uses 2n latent variables with the simulation results showing the single latent variable method converges much faster. For the probit/logit model the method is compared to Albert and Chib’s (1993) probit model method that uses n latent variables. The results show that the
efficiency factor for the single latent variable method used in conjunction with a logit model is approximately four times the efficiency factor for Albert and Chib’s method and requires less than half the CPU time per 1000 effective observations. For the Gaussian model the method is compared to Shively et al.’s (2009) method (SSW) that does not require any latent variables. The comparison with the SSW method provides a useful measure of the impact of incorporating a single latent variable into the MCMC algorithm. As the simulation results show (see below), the algorithm converges only slightly slower than the SSW method. This suggests there is only a small impact on convergence rates and CPU times of incorporating the latent variable into an MCMC algorithm in the way we propose. However, the method provides an algorithm for a wide variety of models that can be difficult to handle using existing methods. Also, once the shell program is written, it can be easily modified to handle different models by changing only the s(γ ) and s′ (γ ) functions. For each model, 50 runs of a simulation are done. A sample size of n = 400 observations and x-values equally spaced on the interval (0, 1] are used for each run. The warm-up and sampling periods are both 50,000 for all the MCMC methods. The model used to generate the Poisson data is given in (1) with φi = 1 for all i; the probit model pr [Yi = 1|f0 (xi )] = Φ [f0 (xi )] where Φ is the standard normal cumulative distribution function is used to generate the binary data; and the Gaussian model is yi = f0 (xi ) + εi , εi i.i.d. N (0, 1). For each model, f0 (x) = 0.1 + 2x2 (note that for the Poisson model f0 (x) is the mean function and must be positive for all x). The autocorrelation coefficients are computed for the function values f (0.25), f (0.50) and f (0.75) using iterates from the sampling period. The averages of the autocorrelation coefficients for lags 10, 20, 50, 100 and 200 across the 50 runs of the simulation are reported in Table 1 for the different methods and models. For each function value, the efficiency factor is defined as Efficiency factor =
1 1+2
∞ ∑ h=1
(13)
ρh
172
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
Table 2 Summary of the efficiency factors for different models, sampling methods and function values. Method
Efficiency factors f (0.25)
f (0.50)
f (0.75)
0.040 –
0.039 –
0.090 –
0.150 0.242 0.065
0.134 0.230 0.049
0.361 0.578 0.088
0.059 0.073
0.047 0.060
0.081 0.105
Panel A: Poisson model Single latent variable DWW Panel B: Probit/Logit model Single latent variable-probit Single latent variable-logit Albert and Chib Panel C: Gaussian model Single latent variable SSW
The efficiency factors are averages across 50 runs of a simulation. Table 3 Summary of the CPU times for different models, sampling methods and function values. Method
CPU time Per 1000 actual observations
Per 1000 effective observations
Panel A: Poisson model Single latent variable DWW
1.283 1.676
32.917 —
Panel B of Tables 1–3 report the averages of the autocorrelations, efficiency factors and CPU times for the Albert and Chib (1993) probit method and the latent variable method applied to probit and logit models (the 50 simulated data sets used in Panel B are generated from a probit model and the functions are then estimated using the three methods). The results show that the single latent variable method for a logit model converges considerably faster and requires less CPU time per 1000 effective observations.1 Panel C of Tables 1–3 report the averages of the autocorrelations, efficiency factors and CPU times for the SSW and single latent variable methods applied to a Gaussian model. As discussed above, the single latent variable method converges only slightly slower than the SSW method and the average CPU time per 1000 effective observations is only slightly greater. 4. Application to specific models This section discusses the application of the shape-constrained function estimation methodology developed in Sections 2 and 3 to generalized additive and hazard function models, and shows specifically what the s(γ ) function is in each case. The method applies to a wide class of models such as generalized mixed and extreme value models, and non-homogeneous Poisson processes. The s(γ ) function is determined similarly in each of these cases. 4.1. Generalized additive models
Panel B: Probit/Logit model Single latent variable-Probit Single latent variable-Logit Albert and Chib
10.152
75.658
2.851
12.393
1.337
27.373
0.963 1.015
19.924 16.932
Panel C: Gaussian model Single latent variable SSW
The CPU times are averages across 50 runs of a simulation.
(see Gamerman and Lopes, 2006, page 126) where ρh is the autocorrelation coefficient for the iterates from the sampling period at lag h for the function value. The sum of the autocorrelation coefficients in (13) is truncated at lag 200. The averages of the efficiency factors across the 50 runs of the simulation for the function values f (0.25), f (0.50) and f (0.75) are reported in Table 2. Following Gamerman and Lopes (2006), the effective sample size is defined as (Efficiency factor) ×nsampling where nsampling is the actual number of iterates in the MCMC sampling scheme. The effective sample size can be interpreted as the size of a sample of independent iterates that will give the same MCMC sampling variance as the nsampling correlated iterates give. Since the efficiency factors vary slightly across function values, we compute the effective sample sizes for the function value f (0.50). Table 3 reports the CPU times per 1000 effective observations for the function value f (0.50) for the different methods and models. The CPU times per 1000 actual observations are also reported in Table 3. The CPU times are averages across the 50 runs of the simulation. All runs were done on a Dell Precision 490 workstation. Panel A of Table 1 reports the averages of the autocorrelation coefficients when the DWW and single latent variable methods are applied to a Poisson model and shows the considerably faster convergence rates for the latent variable method. Panel A of Table 2 reports the efficiency factors for the latent variable method applied to the Poisson model. The efficiency factors are not reported for the DWW method because the ACF converges so slowly. Panel A of Table 3 reports the CPU times per 1000 actual and effective observations.
Dunson (2005) develops a methodology for monotone function estimation in the context of a Poisson-gamma model while Schipper et al. (2007) generalize his method to the class of generalized additive models. Generalized additive models have been applied in a wide variety of fields, including economics (Kim and Marschke, 2005) and transportation (Kweon and Kockleman, 2005), among others. The density function for the dependent variable in a generalized additive model can be written π (yi |θi , φ) = exp({[yi θi − b(θi )]/ai (φ)} + c (yi , φ)). Let µi = E (Yi ) = g −1 (f0 (xi )) where g is a link function and f0 is an unknown function. We model f0 with the fixed-knot regression spline given in (4) (a similar representation applies for the free-knot spline in (5)). Using the γ parametrization, the likelihood function is
π (y1 , . . . , yn |α, J , γ , φ) ] n [ − yi (α + wJi γJ ) − b(α + wJi γJ ) = exp + c (yi , φ) ai (φ) i =1 = exp{−s(y, α, J , γJ , φ)}. For the common types of generalized linear models (e.g. Gaussian, binomial, Poisson, negative binomial, etc.), s(y, α, J , γJ , φ) is a convex function in α , the elements of γJ , and φ . Note that if there is only a single x-variable in the mean function then the form of the link function does not matter. If the mean must be positive (as in the Poisson model) this restriction can be imposed through the
1 We note that the form of the link function in a nonparametric regression with a single x-variable does not impact the estimated probabilities because the flexibility of the function f0 compensates for the different link functions. If there are multiple x-variables then the form of the link function will have an impact. If it is known that binary data are generated from a probit model and there are multiple x-variables, then Albert and Chib’s (1993) method is the appropriate one to use. Even though it converges more slowly, each iteration is considerably faster than the single latent variable method as applied to a probit model. The MCMC algorithm using a single latent variable is computationally intensive for the probit model because it requires computing the standard normal cdf in the s(γ ) function. The MCMC algorithm for the logit model does not require such a calculation and is consequently much faster.
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
prior on the regression spline coefficients. This is discussed in more detail in Section 5. The methodology can be easily generalized to allow for noncanonical link functions (which we consider in Section 5 in the context of a Poisson model). It can also be generalized to handle multivariate generalized additive models (e.g. a multinomial model) and generalized additive mixed models such as the Poisson-gamma model discussed in Section 2. 4.2. Hazard function models
methods can be applied to the class of generalized additive mixed models, including Poisson-gamma and Poisson dose models (see Schipper et al.). However, to focus on the quality of the function estimates we compare the methods in the context of a Poisson model. Dunson’s (2005) and Schipper et al. ’s methods are similar, with Schipper et al. ’s method applying to a wider class of models. Also, Schipper et al. ’s method has better small properties so we only report the simulation results for their method. Using the Poisson model,
π (y|x) = exp{−µ0 (x)}
Hazard function models are used extensively in economics (see, for example, Abbring and van den Berg (2003)), marketing (Mitra and Golder, 2002), and especially in finance (Duffie et al. (2007) and Bharath and Shumway (2008)). As Bharath and Shumway state, ‘‘Hazard models have recently been applied by a number of authors and probably represent the state of the art in default forecasting with reduced-form models’’. Many of the relationships in these types of economic models can be assumed to be monotonic or convex based on subject matter theory. For example, in Duffie et al. (2007), the estimated default intensities are expected to be monotonically decreasing in the distance to default. In this section we consider Cox’s nonparametric proportional hazard function model. Let ti , i = 1, . . . , n, represent the time of ‘‘death’’ for the ith subject. Since the ordering of the subjects is arbitrary, we will assume they are ordered in the order of their deaths, i.e. subject 1 dies first at time t1 , subject 2 dies second at time t2 , etc. Note that ‘‘subjects’’ and ‘‘deaths’’ have a variety of interpretations. In Duffie et al. (2007), the subjects are corporations and deaths are defaults. The hazard function for subject i is hi (t ) = exp{f0 [xi (t )]}h0 (t )
(14)
where h0 (t ) is an unknown hazard function, xi (t ) is the value of the covariate for subject i at time t, and f0 is a monotone or convex function with unknown functional form. Note that the covariate xi (t ) is allowed to vary across both i and t. We model f0 with the fixed-knot spline in (4). Then, using the γ -parametrization for the model, the partial likelihood given t1 , . . . , tn corresponding to the likelihood function in (8) can be written
π(t1 , . . . , tn |α, J , γJ ) n − − = exp exp{α + wJk γJ } (α + wJi γJ ) − log i =1 k∈ Risk set
173
[µ0 (x)]y , y!
the simulation experiment sets n = 400 and considers the following four mean functions: (a) (b) (c) (d)
µ0 (x) = 2 (flat function); µ0 (x) = 0.1 + 3x (linear function); µ0 (x) = exp{1.386x3 } (exponential function); µ0 (x) = 1 + 3F (x), where F (·) is the distribution function for a N (0.5, (0.1)2 ) random variable.
These functions are chosen to represent a range of possible functions that might occur in practice. We note that the functions include ones with significant flat portions as well as ones that ‘‘change direction sharply’’. All the functions except the flat function have a range of three. The n = 400 x-values are equally spaced on (0, 1]. For the fixed-knot spline estimator, m = 9 equally spaced knots are used and pj = π (Jj = 0) = 0.8 while for the freeknot spline model kmax = 9 is used. Also, for both models we set f0 (x) = µ0 (x) in (1) (with φi = 1 for all i) and constrain f0 (x) to be positive using a prior on α that is constrained to (0, ∞), i.e. we estimate µ0 (x) directly rather than log[µ0 (x)] as is often done. The MCMC sampling scheme was run for a warm-up period of 50,000 iterations and a sampling period of 200,000 iterations. Convergence occurred well before this many iterations. If µ ˆ 0 (xi ) is the estimate of µ0 (xi ), we use the root-mean-squareerror
n 1 − RMSE = (µ0 (xi ) − µ ˆ 0 (xi ))2 n i =1
= exp{−s(t , α, J , γJ )} where t = (t1 , . . . , tn ) and the Risk set at time ti includes the subjects still alive the instant before time ti . s(t , α, J , γJ ) is a convex function in α and the elements of γJ .
to quantify the accuracy of µ ˆ 0 (xi ). The simulation results in Table 4 indicate that the free-knot and fixed-knot spline estimators both do better than the Schipper et al. (2007) estimator for each function considered. One of the advantages of fixed-knot and free-knot spline models that is reflected in the results is that they are adept at estimating functions that have a high degree of local variability yet still do well for globally smooth functions. The fixed-knot spline does better than the free-knot spline for two of the four functions. For the fixed-knot spline, using m = 19 knots gave similar results. All results are based on 50 simulation runs.
5. Simulation results
5.2. Cox nonparametric hazard function models
The estimation methodology developed in Sections 2 and 3 is very general and applies to a wide class of models. However, for conciseness, the simulations in this section focus on monotone function estimation in two specific models: (1) Poisson models because nonparametric monotone function estimators have been studied extensively by Dunson (2005) and Schipper et al. (2007) for variations on this model; and (2) Cox’s nonparametric hazard model because this model is used frequently in the economics, finance and marketing literature.
This section compares the small sample properties of the fixed-knot regression spline estimator with and without the monotonicity assumption imposed in a Cox hazard function model. The results show the substantial increase in the quality of the function estimates in situations where it is appropriate to impose monotonicity. We are unaware of other nonparametric monotone function estimation methods for the Cox hazard function model so there are no other estimation methodologies available for comparison purposes as there are for Poisson models. The simulation experiment sets n = 200 and generates the order of the 200 ‘‘deaths’’ using the proportional hazard model in (14). The x-values vary across subjects and are constant through time. They are generated from a uniform distribution on the interval (0, 1). The following four functions for f0 (x) are considered (with corresponding proportionality function ρ0 (x) = exp{f0 (x)}):
at time ti
5.1. Poisson model Here we compare the small sample properties of the free-knot and fixed-knot monotone regression spline estimators and the estimator proposed by Schipper et al. (2007). We note that all three
174
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
Table 4 Summary of root-mean square errors for the Poisson model. Mean function
Flat Linear Exponential Normal dist. function
Fixed-knot spline
Free-knot spline
Average RMSE
Average RMSE
0.064 0.106 0.136 0.167
0.066 0.089 0.153 0.161
Schipper et al. method Percentage increase (%) 3.1
−16.0 12.5
−3.6
Average RMSE
Percentage increase (%)
0.086 0.124 0.174 0.209
34.4 17.0 27.9 25.1
All results are based on 50 simulation runs. The functions are defined above. Percentage increases/decreases are from the fixed-knot spline method. Table 5 Summary of root-mean square errors for the Cox proportional hazard model. Mean function
Flat Linear Exponential Normal dist. function
Monotone regression spline
Unconstrained regression spline
Average RMSE
Average RMSE
Percentage increase (%)
0.039 0.110 0.102 0.137
0.080 0.142 0.115 0.161
105.1 29.1 12.7 17.5
All results are based on 50 simulation runs. The functions are defined above. Percentage increases are from the monotone regression spline method.
(a) f0 (x) = 0 so ρ0 (x) = 1 (flat function); (b) f0 (x) = log(1 + x) so ρ0 (x) = 1 + x (linear function); (c) f0 (x) = 0.693x3 so ρ0 (x) = exp{0.693x3 } (exponential function); (d) f0 (x) = log(1 + F (x)) so ρ0 (x) = 1 + F (x), where F (·) is the distribution function for a N (0.5, (0.1)2 ) random variable. These functions are chosen to represent a variety of functional forms that are likely to occur in practice. All the proportionality functions except the flat function have a range of one to make comparisons across functions easier. For both the monotone and unconstrained regression spline estimator, m = 19 equally spaced knots are used and pj = π (Jj = 0) is set to 0.8 for each j. The MCMC sampling schemes are run using warm-up periods of 5000 iterations and sampling periods of 40,000 iterations. If fˆ0 (xi ) is the estimate of f0 (xi ), we use the root-mean-squareerror
n 1 − (f0 (xi ) − fˆ0 (xi ))2 RMSE = n i =1
to quantify the accuracy of fˆ0 (x). Table 5 gives the RMSE for the monotone and unconstrained estimators. The results indicate the value of imposing monotonicity to obtain better function estimates. 6. Consistency property This section considers Bayesian consistency for monotone function estimation in the context of the Poisson model used in Section 5.1. We then show how to generalize the argument in Appendix D to prove consistency for the entire class of generalized additive models discussed in Section 4.1 and the proportional hazard function model discussed in Section 4.2. For each model, we develop conditions on the class of prior distributions for f , nondecreasing on [0, 1], so that f converges to some true underlying f0 as sample sizes tend to infinity, and show that our prior is a member of this class. Similar proofs can also be used to show the convex function estimators obtained using the prior in Section 2.4 provides consistent estimates of the true underlying function f0 in generalized additive and hazard function models. Consider the Poisson model whereby yi given xi is Poisson with mean f (xi ) and f is non-decreasing on [0, 1], the space of which is denoted by Ω . The aim here is to find conditions on the prior Π (df ) so that
Π (Aε |(x1 , y1 ), . . . , (xn , yn )) → 0 a.s.
for all ε > 0, where Aε = {f : d(f , f0 ) > ε}, d(f , f0 ) =
∫
dH (p(·|f (x)), p(·|f0 (x)))G0 (dx),
dH is a version of the Hellinger distance, dH (p1 , p2 ) = 1 − so that 1/2
dH (p(·|f (x)), p(·|f0 (x))) = 1 − exp⌊−0.5{f 1/2 (x) − f0
√
p1 p2 ,
(x)}2 ⌋,
where f0 is the true function, p(·|f (x)) denotes the Poisson distribution with mean f (x) and G0 is the distribution of the (xi ). The posterior mass assigned to a set A will be written as Πn (A) = JnA /In where JnA
∫ ∏ n p(yi |f (xi )) = Π (df ) p A i=1 (yi |f0 (xi ))
and In =
∫ ∏ n p(yi |f (xi )) Π (df ). p Ω i=1 (yi |f0 (xi ))
The aim is to show that JnAε < e−nd a.s. for all large n, for some d > 0, and that In > e−nc a.s. for all large n for any c > 0. Consistency then follows because we can take c < d. From the expression for JnA we see that Jn+1 A
=
JnA
mnA (yn+1 |xn+1 ) p(yn+1 |f0 (xn+1 ))
,
where mnA (yn+1 |xn+1 ) =
∫
p(yn+1 |f (xn+1 ))ΠA (df |(x1 , y1 ), . . . , (xn , yn ))
A
and ΠA is Π restricted and normalized to the set A. Then
E
∫ F n = 1 − dH (mnA (·|x), p(·|f0 (x)))G0 (dx)
Jn+1 A JnA
where F n = σ ((x1 , y1 ), . . . , (xn , yn )). Now let A be a subset of Ω such that for all f1 and f2 in A it is that d(f1 , f2 ) < δ . We can fill up the space of Ω − Acε with such sets, say {Aj }∞ j=1 . This follows since d(f1 , f2 ) ≤ 0.5 sup |f1 (x) − f2 (x)| x
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
and the space of continuous real valued functions on [0, 1] is separable with respect to the uniform metric. Then, for all j, and for some fj ∈ Aj , using the triangular inequality,
∫
dH (mnAj (·|x), p(·|f0 (x)))G0 (dx)
∫ ≥
E (φk2+ ) < +∞.
that is, for some a > 0 and b > 0, dH (mnAj (·|x), pj (·|f0 (x)))G0 (dx)
a −4−b . E (γ[2k+ /3]l ) ∝ k
7. Examples of shape-constrained function estimation
dH (mnAj (·|x), p(·|f0 (x)))G0 (dx) ≥ ε − δ = ε/2
once we have taken δ = ε/2. Hence, JnAj ≤ (1 − ε/2)n
Π (Aj ).
Therefore, in order to achieve
−
k1+
−4− ; E (γ[2k+ /3]l ) ∝ k
and so
∞ − k=1
dH (p(·|fj (x)), p(·|f0 (x)))G0 (dx)
∫
E
where ξ < ∞ does not depend on k. This holds, taking δ˜ k ∝ k−1− , when
Hence, it is sufficient to take
−
∫
175
JnAj ≤ e−nd
a.s. for all large n
The first example applies the monotone function estimation methodology to a discrete-time nonparametric proportional hazard model in the context of unemployment data. The second applies the methodology to the estimation of a function constrained to have a single minimum in a Gaussian model for electricity consumption with autocorrelated errors. 7.1. Monotone function estimation in a discrete-time proportional hazard model
j
we need
−
Π (Aj ) < +∞.
j
Now
Πn (Aε ) =
−
Πn (Aj ) ≤
− − Πn (Aj ) = In−0.5 JnAj j
j
j
and therefore the required consistency result now follows, since for all suitable f0 with a Kullback–Leibler support condition of Π , it is that In > e−nc a.s. for all large n, for any c > 0. See, for example, Walker (2004). ∑ Hence, we need to find conditions on Π so that j Π (Aj ) < +∞. Equivalently, to establish the restriction on the prior distribution of the parameters {γk }∞ where each γk is a 3-vector k∑ =1 , of positive r.v., in order to ensure Π (Aj ) < +∞. The prior j for γkl , l = 1, 2, 3, will be denoted by πkl and all are independent. We can obtain a set A, i.e. f1 and f2 in A, by having the associated parameters γ1 and γ2 so that
|γ1kl − γ2kl | < δk for all k and l where k δk < δ ∗ for some δ ∗ related to δ . We will use φ1 = γ11 , φ2 = γ12 , . . . , φ4 = γ21 and so on, and πk denotes the prior for φk . Now define, for δ˜ k = δ[k/3] ,
∑
Bmk = (mδ˜ k , (m + 1)δ˜ k ), for m = 0, 1, . . . , so we are looking for ∞ −
lim
N →∞
r 1 =0
∞ ∏ N − πk (Brk k ) < +∞,
···
rN =0 k=1
that is ∞ ∏
1+
k=1
∞ −
πk (Brk ) < +∞.
r =1
Using
πk (Brk ) < pr (φk2+ > r 2+ δ˜k2+ ) < E (φk2+ )/(r δ˜k )2+ , where 2+ etc. means 2 + a for any a > 0, (1) holds if ∞ ∏ k=1
{1 + ξ δ˜k−1− E (φk2+ )} < +∞
This section uses a discrete-time hazard function model to analyze Spanish male unemployment data. The data consist of 1279 unemployed workers who started receiving unemployment insurance (UI) benefits in February 1987. These data were first analyzed by Jenkins and Garcia-Serrano (2004). An excellent indepth discussion of the data and the importance of the analysis are given in their paper. One of the goals of their analysis and the one we consider here is to model the monthly re-employment hazard rate, hi (t ), i.e. to model the probability that UI recipient i will get a job in month t given that he is unemployed at the end of month t − 1. The explanatory variables included in the hazard model are: (1) x1it : Time-to-exhaustion of UI benefits for recipient i in month t. The hazard rate is expected to be a non-increasing function of the time-to-exhaustion of benefits because the incentive to get a job increases as the UI benefits run out. This is a time-varying covariate because time-to-exhaustion decreases as t increases. (2) x2it : Income replacement rate for recipient i in month t. The amount of UI benefits paid to a specific recipient is a percentage of his most recent salary. The replacement rate varies across recipients according to a well-defined rule. The replacement rates may also vary across the unemployment period for a specific recipient with x2it taking on one value for the first six months, a possibly lesser value for the second six months, and a lesser value still for the final 12 months. The hazard rate is expected to be a non-increasing function of the replacement rate because the incentive to get a job increases as the income replacement rate decreases. The income replacement rate is a time-varying covariate. (3) x3i : Age of recipient i in the month he begins receiving UI benefits. It is unclear from a subject matter perspective what the relationship between the recipient’s age and hazard rate will be. For this reason, the function associated with age is unconstrained (i.e. it is estimated nonparametrically but without any monotonicity constraints imposed). (4) D1i : Dummy variable representing recipient i’s family status (=1 if he has a family). (5) D2i , D3i , D4i , D5i : Dummy variables representing the region of Spain recipient i resides in (Center, North-East, South, and Islands, respectively). North is the baseline region left out of the model. (6) D6i : Dummy variable representing whether recipient i’s last job before starting UI benefits was temporary or permanent (=1 if he had a temporary job).
176
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
7.2. Function estimation in a model for electricity consumption with the constraint that the function has a single minimum The deregulation of the electricity market in many countries has generated substantial interest in determining the relationship between electricity demand and weather variables, especially temperature (see for example, Pardo et al. (2002) and Psiloglou et al. (2009)). It is well-known that the relationship between demand and temperature has a single minimum although the location of the minimum is different in different regions. Pardo et al. specified that the minimum occurs at 18 °C in their analysis of Spanish electricity data while Psiloglou et al. determined that the minimum occurs at 20 °C in Athens and 16 °C in London. This section contains an analysis of electricity consumption and weather data from the New South Wales, Australia electricity market for the period January 1, 2004 to December 31, 2004. These data were originally analyzed in Panagiotelis and Smith (2008) in a different context. The data are available every half hour. We analyze the 7 am, 3 pm and 7 pm observations as representative of the demand over the course of the day. 7 am represents morning demand before most people leave for work and temperatures tend to be low, 3 pm represents mid-afternoon when temperatures tend to be highest, and 7 pm represents early evening when most people are home from work. The following model was fit to electricity demand (separate models are fit to the three sets of observations):
Fig. 1a. Function estimate for time-to-exhaustion.
Demandt = α0 + α1 Timet + f (Tempt ) +
11 −
ωj Mjt
j=1
Fig. 1b. Function estimate for income replacement rate.
The maximum length of time a recipient is eligible for UI benefits is 24 months, although many workers have a shorter eligibility period. The observation for a specific recipient is censored if his eligibility is exhausted or if he stops receiving UI benefits for any reason other than getting a job (for example, death, permanent disability, emigration, etc.). The hazard model used in this section is similar to the one used by Jenkins and Garcia-Serrano (2004): hi (t ) 1 − hi (t )
=
h0 (t ) 1 − h0 (t )
× exp f1 (x1it ) + f2 (x2it ) + f3 (x3i ) +
6 −
λk Dki
k=1
where h0 (t ) is the baseline hazard function. f1 , f2 and f3 are unknown functions estimated nonparametrically with f1 and f2 restricted to be non-increasing functions. Estimates of the functions f1 and f2 are plotted in Fig. 1. The estimated values of λ1 , . . . , λ6 with their standard errors in parentheses are 0.09(0.09), −0.08(0.12), −0.23(0.15), −0.36(0.13), −0.13 (0.23), 0.51(0.13), respectively. The estimate of the function f1 increases as time-to-exhaustion decreases, and it increases at an increasing rate. This confirms the intuition that the probability a UI recipient gets a job will increase significantly as the benefits run out. The estimate of the function f2 is decreasing as the income replacement rate increases. This also supports the intuition discussed above that the probability a UI recipient gets a job will be lower the higher the income replacement rate is. The function f3 is not shown to conserve space but it shows very little change over the range of age values in the data. The dummy variable with the largest coefficient is D6i . The ˆ 6 = 0.51 indicates a UI recipient whose last coefficient estimate λ position was a temporary position is considerably more likely to get a job in any given month than a recipient whose last position was a permanent position.
+
6 − j =1
λj Djt +
5 −
δj PHjt + εt
(15)
j =1
where Demandt and Tempt are the electricity demand and temperature on day t. The function f (Temp) is estimated nonparametrically and constrained to have a single minimum. Timet is a trend variable that takes on the value t on day t. There are strong seasonal and day of the week effects so eleven monthly dummy variables, Mjt , j = 1, . . . , 11, and six dummy variables for days of the week, Djt , j = 1, . . . , 6, were included. Dummy variables were also included for the public holidays of New Year’s Day, Good Friday, Easter Sunday, Christmas and Boxing Day. The model in (15) is similar to the one estimated by Pardo et al. except they modeled f (Tempt ) as f (Tempt ) = β1 HDDt + β2 CDDt where HDDt = max(Tempref − Tempt , 0) and CDDt = max(Tempt − Tempref , 0) and Tempref is a reference temperature that is selected to adequately separate the cold and heat branches of the demand/temperature relationship. When the model in (15) is estimated there is a significant problem with autocorrelation in the residuals. For example, for the 7 pm data the first four autocorrelation coefficients are 0.50, 0.33, 0.20 and 0.11 and are all more than three standard errors from zero. Pardo et al. find similar autocorrelation in their model for electricity consumption. To account for this we assume a first-order autocorrelation process for the εt . The autocorrelation is incorporated into the constrained function estimation methodology using a technique similar to Smith et al. (1998). A uniform U (0, 0.95) prior is placed on the autocorrelation coefficient. When the model is re-estimated the first six autocorrelation coefficients are all within two standard errors of zero as are the weekly autocorrelation coefficients at lags 7, 14 and 21. The estimates of the functions f (Tempt ) obtained for the model modified to account for autocorrelation are given in Fig. 2a (7 am
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
Fig. 2a. Electricity demand at 7 am.
Fig. 2b. Electricity demand at 3 pm and 7 pm.
function) and Fig. 2b (the solid line is the 3 pm function and dashed line is the 7 pm function). The 7 am function is plotted separately because the y-axis scale is different due to morning electricity demand being considerably lower. The minimum of the 7 am function occurs at 16 °C while the minimum of the 3 and 7 pm functions both occur between 19 and 20 °C. Also, for lower temperatures at both 7 am and 7 pm the demand function begins to flatten out. Such a flattening out could not be modeled by a convex function. 8. Conclusion This paper developed methodology for estimating shape constrained functions nonparametrically in models with logconcave likelihood functions. Both free-knot and fixed-knot regression spline models were used and the shape constraints included monotonicity, convexity and functions with a single minimum. The free-knot and fixed-knot estimation methods for monotonically constrained functions in a Poisson model were shown to outperform existing methods by up to 34%. In addition, the function estimates obtained using the fixed-knot model for monotonically constrained functions were shown to be consistent in the context of a Poisson model as well as in generalized additive and hazard function models. We also developed a new MCMC slice sampler requiring only a single auxiliary variable that allows a full Bayesian analysis to be done in models with log-concave likelihood functions. The sampler was shown to converge faster than existing methods. It was developed in the context of nonparametric function estimation but given its general nature it holds promise for the implementation of other estimation techniques in different types of models. A limitation of the current paper is that it is not possible to apply the methodology outlined in Section 2 directly to the estimation of shape constrained multivariate functions. This is
177
an important problem in economics because imposing shape constraints such as monotonicity and concavity on the indirect utility function are fundamental to the estimation of demand and cost functions. This is discussed extensively in Varian’s (1984) Microeconomics book as well as many papers in the economics literature; see Barnett and Serletis (2008) for an excellent survey of constrained function estimation methods in consumer demand modeling. A system of demand equations can be derived from the indirect utility function if regularity conditions are imposed that include monotonicity and quasi-convexity. Flexible methods for modeling such functions include second order local approximation methods (see, for example, the translog flexible functional form in Christensen et al. (1975) and the generalized Leontief functional form in Diewert (1974)), the globally regular flexible functional forms introduced by Barnett (1983) and Cooper and McLaren (1996), and the semi-parametric approaches of Gallant (1981), Gallant and Golub (1984), and Barnett and Jonas (1983). Future research includes the possibility of generalizing the spline estimation methodology in Sections 2 and 3 to handle multivariate functions with monotonicity and quasi-convexity imposed to allow for its use in estimating cost and demand functions. A significant issue for generalizing the methodology to constrained multivariate functions is that there will be more linear constraints on the multivariate spline basis function coefficients than there are coefficients. This means the constraints cannot be imposed using the γ = Lβ formulation in Section 2 without extensive modification. Also, a new MCMC algorithm is necessary for the Gaussian case to traverse the multivariate function space efficiently and avoid the ‘‘curse of dimensionality’’. Given an MCMC algorithm to implement constrained multivariate function estimation in a Gaussian model, we can generalize it to any model with a log-concave likelihood function by using the technique introduced in Section 3 of incorporating a single auxiliary variable into the algorithm. Acknowledgements We would like to thank the Editor, Associate Editor and two reviewers for their helpful comments and suggestions. They improved the paper considerably. Tom Shively and Paul Damien’s work was partially supported by a Faculty Research grant from the McCombs School of Business at the University of Texas at Austin. Appendix A. Rejection sampling algorithm to generate (Jj , γj ) This Appendix describes the rejection sampling algorithm used to generate (Jj , γj ). First, following Chib and Greenberg’s (1995) description but using our notation, the rejection sampling algorithm can be summarized as follows: Let π (Jj , γj ) ∝ g (Jj , γj ). Also, let h(Jj , γj ) be a density that can be easily sampled and suppose there is a constant c such that g (Jj , γj ) ≤ ch(Jj , γj ) for all (Jj , γj ). Then to obtain a random variate from π , (1) Generate a candidate (Jj , γj ) from h and a value u ∼ Unif (0, 1); g (J ,γ )
(2) If u ≤ ch(Jj ,γj ) , then return (Jj , γj ); otherwise, go to (1) and j j repeat. This rejection sampling algorithm holds for the combined discrete/continuous distribution π (Jj , γj | · · ·) in our model as well as for purely continuous densities. The function g in the rejection sampling algorithm is set to g (Jj = 0) = I (v > s(y, α, Jj = 0, J(−j) , γ(−j) , φ))π (Jj = 0) and g (Jj = 1, γj ) = I (amin < γj < amax )π (γj |Jj = 1)π (Jj = 1).
178
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
The approximating density function h at Jj = 0 is defined as h(Jj = 0) = capp I (v > s(y, α, Jj = 0, J(−j) , γ(−j) , φ))π (Jj = 0) where capp is a constant that does not depend on Jj or γj . Note that h(Jj = 0) is the same as g (Jj = 0) except for the constant capp . To construct the approximating density function h at (Jj = 1, γj ), suppose we have: (1) a computationally efficient method to determine if the minimum value of s˜(γj ) is greater than zero for all γj > 0 (where s˜(γj ) is defined in Section 3.1); and (2) if s˜(γj ) < 0 for some γj > 0, a computationally efficient method for computing bounds bmin and bmax on the values amin and amax . Methods to accomplish (1) and (2) are discussed in Appendix B. []figA.1 []figA.2 Given these bounds, let Fig. A.1. Plot of s(gamma) vs. gamma with tangent line.
h(Jj = 1, γj ) = capp I (bmin < γj < bmax )π (γj |Jj = 1)π (Jj = 1). Note that h(Jj = 1, γj ) is the same as g (Jj = 1, γj ) up to a constant for amin < γj < amax . However, g (Jj = 1, γj ) = 0 for γj < amin and γj > amax whereas h(Jj = 1, γj ) has positive values for bmin < γj < amin and amax < γj < bmax . Therefore, if bmin and bmax are close to amin and amax , the exact and approximating functions g and h are very similar and the acceptance rate in a rejection sampling algorithm will be high. The only time the generated value will be rejected is if Jj = 1 and bmin ≤ γj < amin or amax < γj ≤ bmax . The values bmin and amin are often both zero so there is often no ‘‘lower end’’ rejection region. Also, bmax is typically close to amax so the upper rejection region tends to be small. To generate (Jj , γj ) from the approximating density h we will generate Jj first, and then γj . Jj is generated by analytically integrating γj out of the expression for h(Jj = 1, γj ) to give h(Jj = 1) and then using h(Jj = 0) and h(Jj = 1). If Jj = 0, then γj does not need to be generated. If Jj = 1 and bmin = 0, then γj |Jj = 1 is generated from the mixture distribution of a point mass at zero and the normal distribution N (0, τ 2 ) constrained to (0, bmax ). If bmin > 0, then γj |Jj = 1 is drawn from the normal distribution N (0, τ 2 ) constrained to (bmin , bmax ). To complete the rejection sampling algorithm, let c = 1/capp in the expression u ≤
g (Jj ,γj )
ch(Jj ,γj )
in step (2) of the algorithm. Then, if
Jj = 0 is drawn g (Jj = 0) ch(Jj = 0) I v > s(y, α, Jj = 0, J(−j) , γ(−j) , φ) π (Jj = 0)
=
1 capp
capp I v > s(y, α, Jj = 0, J(−j) , γ(−j) , φ) π (Jj = 0)
=1 and we always accept. If Jj = 1, we have ch(Jj = 1, γj )
=
I amin < γj < amax π (γj |Jj = 1)π (Jj = 1) 1 capp
=
1 0
capp I bmin < γj < bmax π (γj |Jj = 1)π (Jj = 1)
There are two cases to consider: (current )
Case 1: The current value of Jj (denoted Jj (current )
) equals 1. In this
(current )
case s˜(γj ) < 0, i.e. γj is a ‘‘feasible’’ value of γj so the function s˜(γj ) must have roots. (current )
Case 2: The current value Jj = 0. In this case, the minimum value of s˜(γj ) may be greater than zero. If this is true, then π (Jj = 1|J(−j) , γ(−j) , α, v, y) = 0 and Jj = 0 with probability one. If the minimum value of s˜(γj ) is less than zero, then roots exist and π (Jj = 1|J(−j) , γ(−j) , α, v, y) is positive. (current )
= 1. Case 1: Jj There are two possibilities: Case 1a: The lower root of s˜(γj ) is a∗min < 0 in which case amin = 0; Case 1b: The lower root of s˜(γj ) is a∗min > 0 in which case amin = a∗min > 0. It is straightforward to check whether case 1a or 1b holds by computing s˜(0). If s˜(0) < 0, then amin = 0 (this happens a large percentage of the time). If s˜(0) > 0, then zero provides an initial bound on amin . A bound on amax needs to be computed in either situation. Given a starting value bmax for the upper bound such that s˜(bmax ) > 0 and s˜′ (bmax ) > 0 (such a bmax is easy to find—see Appendix B.1), we compute the intersection point of the tangent line at bmax with the γ -axis to give an updated bmax (which is s˜(b ) bmax − ˜′ max )—see Fig. A.1. The updated bmax will tend to be close s (bmax )
g (Jj = 1, γj )
Appendix B. Bounds on the roots amin and amax
if amin ≤ γj ≤ amax otherwise.
Therefore, if amin ≤ γj ≤ amax we accept with probability one. Otherwise, we reject. The condition amin ≤ γj ≤ amax is easily checked by computing s˜(γj ). If s˜(γj ) > 0 and we reject, the generated γj value will be used to improve either the lower or upper bound so the computation is not ‘‘wasted’’—see Appendix B for details. Given that bmin and bmax are typically very tight bounds on amin and amax , we will almost always accept. We note that once the code for the new algorithm is available, it is a simple matter to modify it for a wide variety of models by changing the s(γ ) function appropriately.
to amax given the nature of the function s˜(γj ), but it is always the case that amax will be less than the updated bmax because s˜(γj ) is convex. Now generate (Jj , γj ) using h(Jj , γj ) as discussed in Appendix A. If Jj = 0, then it is accepted. If Jj = 1 so γj is also generated, then compute s˜(γj ). If s˜(γj ) < 0, then accept (Jj = 1, γj ). If s˜(γj ) > 0, then set bmax = γj and repeat the process. Note that the first (Jj , γj ) value generated is typically accepted and we almost always accept after two draws, especially if amin = 0. However, if the second draw is rejected, then the tangent method is used to continually update bmin and bmax until a draw is accepted. For example, if s˜′ (γj ) < 0, then the new lower bound bmin =
γj −
s˜(γj ) s˜′ (γj )
is computed.
(current )
Case 2: Jj = 0. There are three possibilities: Case 2a: The lower root of s˜(γj ) is a∗min < 0 in which case amin = 0;
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
179
Appendix C. Approximating conditional distributions in the MCMC algorithm for a free-knot spline model This Appendix provides the approximating conditional distributions πApprox (Jj | · · ·), πApprox (ξj |Jj = 1, . . .) and πApprox (γj |Jj = 1, ξj , . . .) used in step (2) of the MCMC algorithm in Section 3.2. We first note that π (Jj = 0| · · ·) is given in (10) because ξj and γj drop out of both the fixed-knot and free-knot models when Jj = 0. To provide an approximation to π (Jj = 1| · · ·), let s˜(ξj , γj ) = s(ξj , γj ; y, α, Jj = 1, J(−j) , ξ(−j) , γ(−j) , φ) − v. Then Fig. A.2. Plot of s(gamma) vs. gamma with tangent lines.
Case 2b: The lower root of s˜(γj ) is a∗min > 0 in which case amin = a∗min > 0; Case 2c: The minimum value of s˜(γj ) is greater than zero and no roots exist. Note that cases 2a and 2b are the same as cases 1a and 1b and can be handled the same. If case 2c occurs, then Jj = 0 with probability one. The key for case 2 is using an efficient method to determine if case 2c occurs. We begin by computing s˜(0). If s˜(0) < 0, then case 2a holds. If s˜(0) > 0, then compute s˜′ (0) and the intersection point s˜(0) (see of the tangent line at zero with the γ -axis, i.e. γˆ = − ˜′ s (0)
Fig. A.2). If s′ (γˆ ) > 0, then the minimum value of s˜(γ ) is positive and case 2c holds. If s˜′ (γˆ ) < 0, then compute s˜(bmax ) and s˜′ (bmax ) where bmax is the starting value of the upper bound as defined in Appendix B.1. If the intersection point of the tangent line at bmax with the γ -axis, γ ∗ = bmax − s˜˜′(bmax ) (see Fig. A.2), is less than zero or s˜′ (γ ∗ ) < 0, s (bmax )
then the minimum value of s˜(γ ) is positive and case 2c holds. If s˜′ (γˆ ) < 0 and s˜′ (γ ∗ ) > 0, then compute the intersection point of the two tangent lines, denoted γ⌣ (see Fig. A.2). If s˜( γ⌣) < 0, then roots exist. In this case, γˆ and γ ∗ are used as the bounds bmin and bmax (they will typically be excellent bounds) and then proceed as in case 1b. If s˜( γ⌣) > 0 (as in Fig. A.2), then compute s˜′ ( γ⌣). If s˜′ ( γ⌣) < 0, set γˆ = γ⌣, otherwise set γ ∗ = γ⌣, and repeat the process until either s˜( γ⌣) < 0 (in which case roots exist), the newly computed tangent line indicates the minimum value of s˜(γ ) is greater than zero, or s˜′ ( γ⌣) < 10−6 (with the latter two possibilities indicating no positive roots exist). It typically takes only a couple of iterations to determine if the minimum value of s˜(γ ) is greater than zero. B.1. Starting value for bmax To obtain a starting value for bmax we use an idea similar in spirit to Gilks and Wild’s (1995) method for finding starting values in their adaptive rejection sampling algorithm. After the roots have been found (or we find no roots exist) on a given iteration for a specified regression coefficient, we use a Gilks and Wildtype approximation to the likelihood function using the already computed values of s˜(γ ) and s˜′ (γ ). The approximated likelihood is normalized so it can be treated as a density function, and then the 95th percentile value is computed. This value is used as the starting value bmax in the next iteration. It is important to check that s˜′ (bmax ) is positive at the beginning of the next iteration. It typically is but in the few cases it is not, we continue to double bmax until the derivative is positive. The value bmin = 0 is always used as the starting value for the lower bound.
π (Jj = 1, ξj , γj | · · ·) ∝ I [˜s(ξj , γj ) < 0]π (γj |Jj = 1) × π (ξj |Jj = 1)π (Jj = 1) and γj and ξj must be integrated out to give π (Jj = 1| · · ·). To integrate out γj we follow Section 3.1 and let a∗min (ξj ) and amax (ξj ) represent the roots of s˜(ξj , γj ) for a given ξj and let amin (ξj ) = max[0, a∗min (ξj )]. Also, let bmin (ξj ) and bmax (ξj ) represent the bounds on amin (ξj ) and amax (ξj ) corresponding to bmin and bmax in Section 3.1. For a fixed ξj , these bounds can be computed similarly to bmin and bmax . γj can now be integrated out by integrating the mixture prior π (γj |Jj = 1) over the interval [bmin (ξj ), bmax (ξj )] to give the approximating density
∫
πApprox (Jj = 1, ξj | · · ·) ∝
bmax (ξj ) bmin (ξj )
π (γj |Jj = 1)dγj
× π (ξj |Jj = 1)π (Jj = 1). Calculating this density requires at most two standard normal cdf calculations (depending on whether bmin (ξj ) = 0 or >0). For a given ξj , πApprox (Jj = 1, ξj | · · ·) = 0 if s˜(ξj , γj ) > 0 for all γj . In the integral over ξj required to obtain πApprox (Jj = 1| · · ·), let ξL and ξU represent the bounding knot values for ξj . More specifically, let ξL = ξi where i is the largest index less than j with Ji = 1. If all Ji = 0 for i < j, then ξL = 0. Similarly, let ξU = ξi where i is the smallest index greater than j with Ji = 1. If all Ji = 0 for i > j, then ξU = 1. For example, if m = 5, j = 3 and J = (0, 1, −, 0, 1), then ξL = ξ2 and ξU = ξ5 . Then
πApprox (Jj = 1| · · ·) =
∫
ξU
ξL
πApprox (Jj = 1, ξj | · · ·)dξj .
This integral cannot be done analytically. However, the function
πApprox (Jj = 1, ξj | · · ·) is well-approximated by a step function computed at ξL and the xi values with ξL < xi < ξU . Using this step function approximation gives πApprox (Jj = 1| · · ·) up to a constant. A temporary value of Jj can now be generated using π (Jj = 0| · · ·) and πApprox (Jj = 1| · · ·). If Jj = 0, then ξj and γj do not need to be generated. If Jj = 1, then a temporary value of ξj is generated from the step function approximation
πApprox (ξj |Jj = 1, . . .) ∝ πApprox (Jj = 1, ξj | · · ·). Finally, γj |Jj = 1, ξj , . . . is generated from a constrained normal distribution with bounds bmin (ξj ) and bmax (ξj ). The true density function value at the value (Jj , ξj , γj ) is
π (Jj , ξj , γj | · · ·) ∝ e−v I (v > s(y, α, J , ξJ , γJ , φ)) × π (γj |Jj )π (ξj |Jj )π (Jj ). This is straightforward to compute to obtain the Metropolis–Hastings acceptance probability and therefore the generated value of (Jj , ξj , γj )| · · ·.
180
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181
Appendix D. Consistency proof for generalized additive and hazard function models This Appendix generalizes the argument in Section 6 for the Poisson model to show consistency for the class of generalized additive and hazard function models. The more general argument uses the existence of the maximum likelihood estimator (MLE) when working with log-concave densities and relies on the properties of the MLE. We consider the case when p(y|x) = exp{s(y, f (x), φ)} and s is concave in ξ = (φ, f ). Under this scenario it is that ξˆ exists and is unique (see Walther, 2002); where ξˆ maximizes n −
s(yi , f (xi ), φ).
i =1
Now, as before, define d(ξ , ξ0 ) =
∫
dH (p(·|ξ , x), p(·|ξ0 , x))G0 (dx),
and so we can obtain
Πn (Aε ) = Π (Aε |(x1 , y1 ), . . . , (xn , yn )) 1/2 ∫ 1/2 n n ∏ ∏ p(yi |ξˆ , xi ) p(yi |ξ , xi ) ≤ Π ( dξ ) p(yi |ξ0 , xi ) p(yi |ξ0 , xi ) Aε i=1 i=1 ∫ ∏ n p(yi |ξ , xi ) Π (dξ ). p (yi |ξ0 , xi ) i=1 We can now use a result in Walker and Hjort (2001), Section 3.3, to demonstrate that Πn (Aε ) → 0 a.s. for all ε > 0, with the classical consistency condition, that n− 1
n − {s(yi , ξˆ , xi ) − s(yi , ξ0 , xi )} → 0 a.s. i=1
This condition guarantees that
1/2 n ∏ p(yi |ξˆ , xi ) ≤ end p ( y |ξ , x ) i 0 i i =1 a.s. for all large n for any d > 0. A usual Kullback–Leibler support condition ensures that
∫ ∏ n p(yi |ξ , xi ) Π (dξ ) > e−nc p (yi |ξ0 , xi ) i =1 a.s. for all large n for any c > 0. Finally, taking expectations and using a Markov inequality combined with Borel–Cantelli it is easy to show that
1/2 ∫ ∏ n p(yi |ξ , xi ) Π (dξ ) < e−nδε p ( y |ξ , x ) i 0 i Aε i =1 a.s. for all large n for some δε > 0. Putting all these together we can see that Πn (Aε ) → 0 a.s. Now let Pn be the empirical distribution of the (xi , yi )ni=1 and P0 the true distribution of (x, y). Then the above classical consistency condition holds when
∫
|s(y, ξˆ , x) − s(y, ξ0 , x)|dP0 (x, y) → 0 a.s.
and
∫ sup g (y, x, ξ )d(Pn − P0 ) → 0 a.s. ξ where g (x, y, ξ ) = s(y, f (x), φ) − s(y, f0 (x), φ0 ). The former condition is straightforward under suitable continuity conditions for s. The latter result is a uniform law of large number criterion on which there is an abundance of literature. For early work see Pollard (1984) and Giné and Zinn (1984).
References Abbring, J.H., van den Berg, G.J., 2003. The nonparametric identification of treatment effects in duration models. Econometrica 71, 1491–1517. Albert, J., Chib, S., 1993. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669–679. Aït-Sahalia, Y., Duarte, J., 2003. Nonparametric option pricing under shape restrictions. Journal of Econometrics 116, 9–47. Banerjee, M., Mukherjee, D., Mishra, S., 2009. Semiparametric binary regression models under shape constraints with an application to Indian schooling data. Journal of Econometrics 149, 101–117. Barnett, W.A., 1983. New indices of money supply and the flexible Laurent demand systems. Journal of Business and Economic Statistics 1, 7–23. Reprinted in Barnett, W.A. J. Binner (Eds.), Functional Structure and Approximation in Econometrics, Elsevier, Amsterdam, 2004. Barnett, W.A., Jonas, A., 1983. The Müntz–Szatz demand system: an application of a globally well behaved series expansion. Economics Letters 11, 337–342. Reprinted in Barnett, W.A. J. Binner (Eds.). Functional Structure and Approximation in Econometrics, Elsevier, Amsterdam, 2004. Barnett, W.A., Serletis, A., 2008. Consumer preferences and demand systems. Journal of Econometrics 147, 210–224. Bharath, S.T., Shumway, T., 2008. Forecasting default with the Merton distance to default model. Review of Financial Studies 21, 1339–1369. Broadie, M., Detemple, J., Ghysels, E., Torrés, O., 2000a. Nonparametric estimation of American options’ exercise boundaries and call prices. Journal of Economic Dynamics and Control 24, 1829–1857. Broadie, M., Detemple, J., Ghysels, E., Torrés, O., 2000b. American options with stochastic dividends and volatility: a nonparametric investigation. Journal of Econometrics 94, 53–92. Chib, S., Greenberg, E., 1995. Understanding the Metropolis–Hastings algorithm. American Statistician 49, 327–335. Christensen, L.R., Jorgenson, D.W., Lau, L.J., 1975. Transcendental logarithmic utility functions. American Economic Review 70, 422–432. Cooper, R.J., McLaren, K.R., 1996. A system of demand equations satisfying effectively global regularity conditions. Review of Economics and Statistics 78, 359–364. Cox, D.R., 1972. Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Series B 34, 187–220. Cripps, E., Carter, C., Kohn, R., 2005. Variable selection and covariance selection in multivariate regression models. In: Dey, D.K., Rao, C.R. (Eds.), Handbook of Statistics. In: Bayesian Thinking: Modeling and Computation, vol. 25. Elsevier, North-Holland, Amsterdam, pp. 519–552. Damien, P., Wakefield, J., Walker, S.G., 1999. Gibbs sampling for Bayesian nonconjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society, Series B 61, 331–344. Denison, D.G.T., Malick, B.K., Smith, A.F.M., 1998. Automatic Bayesian curve fitting. Journal of the Royal Statistical Society, Series B 60, 333–350. Diewert, W.E., 1974. Applications of duality theory. In: Intriligator, M., Kendrick, D. (Eds.), Frontiers in Quantitative Economics, vol. 2. North-Holland, Amsterdam. Duffie, D., Saita, L., Wang, K., 2007. Multi-period corporate default prediction with stochastic covariates. Journal of Financial Economics 83, 635–665. Dunson, D.B., 2005. Bayesian semiparametric isotonic regression for count data. Journal of the American Statistical Association 100, 618–627. Friedman, J., Tibshirani, R., 1984. The monotone smoothing of scatterplots. Technometrics 26, 243–250. Gallant, A.R., 1981. On the bias of flexible functional forms and an essentially unbiased form: the Fourier functional form. Journal of Econometrics 15, 211–245. Gallant, A.R., Golub, G.H., 1984. Imposing curvature restrictions on flexible functional forms. Journal of Econometrics 26, 295–321. Gamerman, D., Lopes, H.F., 2006. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, 2nd ed. Springer, New York. Gelfand, A.E., Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398–409. Gilks, W.R., Wild, P., 1995. Adaptive rejection sampling for Gibbs sampling. Applied Statistics 41, 337–348. Giné, E., Zinn, J., 1984. On the central limit theorem for empirical processes. Annals of Probability 12, 929–989. Groeneboom, P., Jongblood, G., Wellner, J.A., 2001. Estimation of a convex function: characterizations and asymptotic theory. Annals of Statistics 29, 1653–1698. Ihara, T., Genchi, Y., Sato, T., Yamaguchi, K., Endo, Y., 2008. City-block-scale sensitivity of electricity consumption to air temperature and air humidity in business districts of Tokyo, Japan. Energy 33, 1634–1645. Jenkins, S.P., Garcia-Serrano, C., 2004. The relationship between unemployment benefits and re-employment probabilities: evidence from Spain. Oxford Bulletin of Economics and Statistics 66, 239–260. Kim, J., Marschke, G., 2005. Labor mobility of scientists, technological diffusion, and the firm’s patenting decision. RAND Journal of Economics 36, 298–317. Kweon, Y., Kockleman, K.M., 2005. Safety effects of speed limit changes. Transportation Research Record 1908, 148–158. Mammen, E., 1991. Estimating a smooth monotone regression function. Annals of Statistics 19, 724–740. Manski, C.F., Tamer, E., 2002. Inference in regression with interval data on a regressor or outcome. Econometrica 70, 519–546. Mitra, D., Golder, P.N., 2002. Whose culture matters? Near-market knowledge and its impact on foreign market entry. Journal of Marketing Research 39, 350–365. Neelon, B., Dunson, D.B., 2004. Bayesian isotonic regression and trend analysis. Biometrics 60, 398–406.
T.S. Shively et al. / Journal of Econometrics 161 (2011) 166–181 Panagiotelis, A., Smith, M., 2008. Bayesian identification, selection and estimation of semiparametric functions in high-dimensional additive models. Journal of Econometrics 143, 291–316. Pardo, A., Meneu, V., Valor, E., 2002. Temperature and seasonality influences on Spanish electricity load. Energy Economics 24, 55–70. Pollard, D., 1984. Convergence of Stochastic Processes. Springer, New York. Psiloglou, B.E., Giannakopoulos, C., Majithia, S., Petrakis, M., 2009. Factors affecting electricity demand in Athens, Greece and London, UK: a comparative assessment. Energy 34, 1855–1863. Schipper, M., Taylor, J.M.G., Lin, X., 2007. Bayesian generalized monotonic functional mixed models for the effects of radiation dose histograms on normal tissue complications. Statistics in Medicine 26, 4643–4656. Shively, T.S., Sager, T.W., Walker, S.G., 2009. A Bayesian approach to nonparametric monotone function estimation. Journal of the Royal Statistical Society, Series B 71, 159–175. Smith, M., Kohn, R., 1996. Nonparametric regression using Bayesian variable selection. Journal of Econometrics 75, 317–343.
181
Smith, M., Wong, C., Kohn, R., 1998. Additive nonparametric regression with autocorrelated errors. Journal of the Royal Statistical Society, Series B 60, 311–331. Tierney, L., 1994. Markov chains for exploring posterior distributions. Annals of Statistics 22, 1701–1762. Varian, H., 1984. Microeconomic Analysis, 2nd ed. Norton, Company, New York. Walker, S.G., 2004. New approaches to Bayesian consistency. Annals of Statistics 32, 2028–2043. Walker, S.G., Hjort, N.L., 2001. On Bayesian consistency. Journal of the Royal Statistical Society, Series B 63, 811–821. Walther, G., 2002. Detecting the presence of mixing with multiscale maximum likelihood. Journal of the American Statistical Association 97, 508–514. Wright, I., Wegman, E., 1980. Isotonic, convex, and related splines. Annals of Statistics 8, 1023–1035. Yatchew, A., Härdle, W., 2006. Nonparametric state price density estimation using constrained least squares and the bootstrap. Journal of Econometrics 133, 579–599.
Journal of Econometrics 161 (2011) 182–202
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Large panels with common factors and spatial correlation✩ M. Hashem Pesaran a,b,∗ , Elisa Tosetti c a
Cambridge University, United Kingdom
b
University of Sourthern California, USA
c
Brunel University, United Kingdom
article
info
Article history: Received 15 August 2007 Received in revised form 9 June 2010 Accepted 11 December 2010 Available online 21 December 2010 JEL classification: C10 C31 C33 Keywords: Panels Common factors Spatial dependence Common correlated effects estimator
abstract This paper considers methods for estimating the slope coefficients in large panel data models that are robust to the presence of various forms of error cross-section dependence. It introduces a general framework where error cross-section dependence may arise because of unobserved common effects and/or error spill-over effects due to spatial or other forms of local dependencies. Initially, this paper focuses on a panel regression model where the idiosyncratic errors are spatially dependent and possibly serially correlated, and derives the asymptotic distributions of the mean group and pooled estimators under heterogeneous and homogeneous slope coefficients, and for these estimators proposes nonparametric variance matrix estimators. The paper then considers the more general case of a panel data model with a multifactor error structure and spatial error correlations. Under this framework, the Common Correlated Effects (CCE) estimator, recently advanced by Pesaran (2006), continues to yield estimates of the slope coefficients that are consistent and asymptotically normal. Small sample properties of the estimators under various patterns of cross-section dependence, including spatial forms, are investigated by Monte Carlo experiments. Results show that the CCE approach works well in the presence of weak and/or strong cross-sectionally correlated errors. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Over the past few years there has been a growing literature, both empirical and theoretical, on econometric analysis of panel data models with cross-sectionally dependent error processes. Such cross-correlations can arise for a variety of reasons, such as omitted common factors, spatial spill-overs, and interactions within socioeconomic networks. Conditioning on variables specific to the cross-section units alone does not deliver cross-section error independence; an assumption required by the standard literature on panel data models. In the presence of such dependence, conventional panel estimators such as fixed or random effects can result in misleading inference and even inconsistent estimators (Phillips and Sul, 2003). Further, conventional panel estimators may be inconsistent if regressors are correlated with unobserved common factors that might be causing the error cross-section dependence (Andrews, 2005).
✩ We are grateful to the Editor (Cheng Hsiao), an Associate Editor and three anonymous referees, Badi Baltagi, Alexander Chudik and George Kapetanios for helpful comments and suggestions. Elisa Tosetti acknowledges financial support from ESRC (Ref. no. RES-061-25-0317). ∗ Corresponding address: Trinity College, Economics, Trinity Street, CB2 1TQ, Cambridge, England, United Kingdom. Tel.: +44 1223 335216. E-mail address:
[email protected] (M.H. Pesaran).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.003
Currently, there are two main strands in the literature for dealing with error cross-section dependence in panels where N is large relative to T , namely the residual multifactor and the spatial econometric approaches. The multifactor approach assumes that the cross-dependence can be characterized by a finite number of unobserved common factors, possibly due to economy-wide shocks that affect all units, albeit with different intensities. Under this framework, the error term is a linear combination of few common time-specific effects with heterogeneous factor loadings plus an idiosyncratic (individual-specific) error term. Estimation of a panel with such a multifactor residual structure can be addressed by using statistical techniques commonly adopted in factor analysis, such as the maximum likelihood (Robertson and Symons, 2000, 2007), and the principal components procedures (Coakley et al., 2002; Bai, 2009). Recently, Pesaran (2006) has suggested an estimation method, referred to as Common Correlated Effects (CCE), that consists of approximating the linear combinations of the unobserved factors by cross-section averages of the dependent and explanatory variables and then running standard panel regressions augmented with these cross-section averages. An advantage of this approach is that it yields consistent estimates under a variety of situations, such as serial correlation in errors, unit roots in the factors and possible contemporaneous dependence of the observed regressors with the unobserved factors (Coakley et al., 2006; Kapetanios and Pesaran, 2007; Kapetanios et al., 2011).
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
The spatial approach assumes that the structure of the crosssection correlation is related to location and distance among units, defined according to a pre-specified metric. Proximity need not be measured in terms of physical space, but can be defined using other types of metrics, such as economic (Conley, 1999; Pesaran et al., 2004), policy, or social distance (Conley and Topa, 2002). Hence, cross-section correlation is represented by means of a spatial process, which explicitly relates each unit to its neighbours (Whittle, 1954). Estimation of panels with spatially correlated errors can be based on maximum likelihood (ML) techniques (Lee, 2004), or on the generalized method of moments (GMM) (Kelejian and Prucha, 1999; Lee, 2007; Kelejian and Prucha, 2009). Recently, non-parametric methods based on heteroskedasticity and autocorrelation consistent estimators applied to spatial models have also been proposed (Conley, 1999; Kelejian and Prucha, 2007; Bester et al., 2009). In this paper we build on the existing literature and consider a general panel data model where error cross-section dependence is due to unobserved common factors and/or spatial dependence, whilst at the same time allow for the errors to be serially correlated. We focus on estimation and inference procedures that are robust to the presence of various forms of cross-sectional and temporal dependencies in the error processes. Robust methods are needed because the source and extent of error cross-section dependence is often unknown. The error cross-section dependence can take many different forms and its nature could differ at micro and macro levels. For instance, at a micro level, individual consumption behaviour can be influenced by economy-wide factors, such as changes in taxation and interest rates, and by local neighbourhood effects such as keeping up with the Jones’s (Cowan et al., 2004). In macroeconomics, several studies have argued business cycle fluctuations could be the result of both strategic interactions as well as aggregate technological shocks (Cooper and Haltiwanger, 1996). Our econometric specification, by allowing for the presence of both sources of contemporaneous error correlations, is sufficiently general and includes the models proposed in the literature as special cases. We focus on estimation of slope coefficients in the case of a number of different specifications. Initially, we concentrate on a panel data model without unobserved factors where the errors are spatially dependent and possibly serially correlated, and derive the asymptotic distribution of the mean group and pooled estimators, under alternative assumptions regarding the slope coefficients. In the presence of heterogeneous slopes, we show that the non-parametric approach advanced by Pesaran (2006) continues to be applicable and can be used to obtain standard errors that are robust to both spatial and serial error correlations. However, in the case of homogeneous slopes the CCE procedure will not be applicable. In this case we propose a non-parametric variance matrix estimator that adapts the Newey and West (1987)’s heteroskedasticity autocorrelation consistent (HAC) procedure to allow for the spatial effects along the lines recently advanced by Kelejian and Prucha (2007). We refer to this variance estimator as spatial, heteroskedasticity, autocorrelation (SHAC) estimator. We then consider the more general case where the error term in the panel data model is composed of a multifactor structure and a spatial process, and show that Pesaran’s CCE approach continues to be valid and yields consistent estimates of the slope coefficients and their standard errors. We also show how to obtain consistent estimates of the errors in the panel to be used in tests of cross-section independence, and for further analysis of the underlying spatial processes. Using Monte Carlo techniques, we investigate the small sample performance of the estimators under various patterns of error cross-section dependence, with and without error serial correlation, under both cases of heterogeneous and homogeneous
183
slopes. We examine the performance of the alternative estimators when the errors only display spatial dependence, when they are subject to unobserved common factors as well as spatial dependence, and in the case where the source of cross-section dependence changes over time. Our results indicate that the mean group and pooled estimators with robust standard errors do work well under certain regularity conditions outlined in our theorems. However, under slope homogeneity or in the presence of unobserved common factors these estimators fail to provide the correct inference. The results also document the tendency of the tests based on HAC type standard errors to over reject the null hypothesis in small samples even in the case of error cross-section dependence which is purely spatial. In contrast, our Monte Carlo experiments clearly show that the augmentations of panel regressions with cross-section averages, as formulated by the CCE procedure, eliminates the effects of all forms of spatial and temporal correlations, irrespective of whether these are due to spatial and/or unobserved common factors. The small sample properties of CCE estimators do not seem to be affected by the heterogeneity assumptions on slope coefficients, or by the presence of error serial correlations. It is this level of robustness of the CCE estimator which particularly commends it for use in empirical analysis. The plan of the remainder of the paper is as follows: Section 2 sets out a panel regression model with unobserved common factors and general spatial and temporal error processes. Section 3 develops the asymptotic distribution of the mean group and pooled estimators in the presence of spatial error dependence and error serial correlation. Section 4 considers the more general case where the errors also contain unobserved common factors, and establishes the validity of the CCE estimators for this class of models. Consistent estimation of the residuals from such models is considered in Section 5, where the necessary identification conditions are stated. Section 6 describes the Monte Carlo experiments and report the results. Section 7 ends with some concluding remarks. Notation: λ1 (A) ≥ λ2 (A) ≥ · · · ≥ λn (A) are the eigenvalues of a matrix A ∈ Mn×n , where Mn×n is the space of real n × n matrices. A− denotes a generalized∑ inverse of A. The column norm n of A ∈ Mn×n is ‖A‖1 = max1≤j≤n i=1 aij . The row norm of A is
∑ ‖A‖∞ = max1≤i≤n nj=1 aij . The Euclidean norm of A is ‖A‖2 = 1/2 j Tr(AA′ ) . K is used for a fixed positive constant. (N , T ) − →∞
denotes N and T tending to infinity jointly but in no particular order. 2. Heterogenous panels with unobserved common factors and spatial error correlation We begin with a general specification where the dependent variable is a function of a set of individual-specific regressors, a linear combination of common observed and unobserved factors, and includes errors that are serially and spatially correlated. Let yit be the observation on the ith cross-section unit at time t for i = 1, 2, . . . , N; t = 1, 2, . . . , T , and suppose that it is generated as yit = α′i dt + β′i xit + γ′i ft + eit ,
(1)
where dt = (d1t , d2t , . . . , dnt )′ is a n × 1 vector of observed common effects, and xit is a k × 1 vector of observed individualspecific regressors on the ith cross-section unit at time t , ft = (f1t , f2t , . . . , fmt )′ is an m-dimensional vector of unobservable common factors, γi = (γi1 , γi2 , . . . , γim )′ is the associated m × 1 vector of factor loadings. The number of factors, m, is assumed to be fixed relative to N, and in particular m < N. The common factors, ft simultaneously affect all cross-section units, albeit with different
184
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
degrees as measured by γi . For instance, a rise in the interest rate may affect household consumption and firm investment decisions; oil price shocks may influence firm production costs; real shocks, such as a decline in the aggregate demand and employment could simultaneously slow the growth in a number of countries (see Andrews, 2005). Finally, the unit-specific or idiosyncratic errors, eit , are assumed to be spatially and temporally correlated. The most widely used spatial models are the Spatial Moving Average (SMA), the Spatial Autoregressive (SAR) model, and the Spatial Error Component (SEC) specifications. These models differ in the range of dependence implied by their covariance matrices, but under certain invertibility conditions they can all be written as special cases of e.t = Rt ε.t ,
for t = 1, 2, . . . , T ,
(2)
where e.t = (e1t , . . . , eNt ) , ε.t = (ε1t , . . . , εNt ) and Rt is a given N × N matrix. We shall make use of the following assumptions. ′
′
Assumption 1. For each i, εit follows the linear stationary process with absolute summable autocovariances:
εit =
∞ −
ais ϵis ,
3. Estimating panels with spatial error correlation
s=0
where ϵis ∼ IID(0, 1) with finite fourth-order cumulants. Assumption 2. Rt has bounded row and column norms for all t. Assumption 3. The slope coefficients βi follow the random coefficient model
βi = β + υi ,
υi ∼ IID(0, υ ) for i = 1, 2, . . . , N ,
where ‖β‖2 < K , ‖υ ‖2 < K , υ is a symmetric non-negative definite matrix, and the random deviations, υi , are distributed independently of εjt , xjt , and dt , for all i, j and t. Assumption 4. (d′t , x′it )′ and εjs are independently distributed for all t , s, i and j. Note that under Assumption 1 we have Var (εit ) =
case of panel data models with fixed effects we would set n = 1, and d1t = 1, for t = 1, 2, . . . , T . The focus of this paper is on estimating the slope coefficients βi , their cross-section means, β = E (βi ), and unit-specific errors, uit . We shall consider four cases of interest. Initially, we abstract from unobserved common factors, and concentrate exclusively on the effects of weak spatial error dependence. Accordingly, we impose γi = 0 in Eq. (1), and consider estimation of the slope coefficients and their cross-section means in a panel regression model where eit follows an invertible spatial process of type (2). We then investigate the properties of the proposed estimators under the special case of homogeneous slopes, namely when βi = β for i = 1, 2, . . . , N (i..e., υ = 0 in Assumption 3). Next, we turn to the more general specification where γi ̸= 0, and allow the unobserved common factors to be correlated with the individual-specific regressors, xit . Initially, we deal with the case of heterogeneous slopes and then consider the special case of υ = 0. There are two further specifications that may be derived from (1)–(2). These are the cases of common factors and no spatial error correlation with either heterogeneous or homogeneous slopes. However, these specifications have already been discussed in Pesaran (2006) and will not be considered here. Consistent estimators of the residuals in the general case are addressed in Section 5.
∞ −
a2is = σi2 ≤ K < ∞,
s=0
and the covariance matrix of εi. = (εi1 , εi2 , . . . , εiT )′ has bounded row and column norms, for all i. Assumption 2 implies that the spatial error process, (2), carries weak cross-section dependence at all points in time, namely that its weighted averages converge to zero for all sets of weights satisfying certain regularity conditions (see also Lemma A.1 in the Appendix). Notions of weak and strong cross-section dependence are developed and discussed in Chudik et al. (forthcoming). We note that Assumption 2 holds for most widely used spatial models that are subject to a set of regularity conditions that are standard in the spatial econometrics literature. These regularity conditions ensure consistency and asymptotic normality of quasi-ML and GMM estimators of spatial parameters (see Kelejian and Prucha, 1999, Lee, 2004, and Mardia and Marshall, 1984 for details). The model outlined in Eqs. (1)–(2) is quite general and renders a variety of linear panel data models as special cases. The coefficients αi may be treated as fixed or random, possibly correlated with the other variables in the panel. The vector dt could contain deterministic terms such as an intercept or linear trends, or common observed variables such as oil prices. For example, in the
The literature on spatial econometrics typically considers the problem of spatial dependence under strong assumptions of homogeneity and temporal independence. Only recently, a strand of literature in spatial econometrics has considered the incorporation of unobserved heterogeneity in spatial panel data models, where N is usually assumed to be large relative to T . Baltagi et al. (2003) and Kapoor et al. (2007) have focused on ML and GMM estimation of panels where the error term is the sum of an individual-specific component and a spatially correlated idiosyncratic error. Baltagi et al. (2009) generalized their earlier work by allowing for spatial correlations in both the individual means and the remainder error components, with possibly different spatial autoregressive parameters. Fingleton (2008) extended the Kapoor et al. (2007) contribution on GMM estimation of spatial random effects panels to the case where the idiosyncratic error term follows a spatial moving average process, while Egger et al. (2005) have focused on extensions to the case of unbalanced panels. Lee and Yu (2010) considered estimation of a spatial panel data model with individual-specific fixed effects, and proposed a ‘‘transformation approach’’ to eliminate the fixed effects and then apply quasi-ML to the transformed model. Yu et al. (2007, 2008) and Yu and Lee (2010) focused on the properties of the quasi-ML estimator in the case of dynamic, possibly nonstationary, panels with fixed effects and spatial error correlation, assuming both N and T large. It is worth noting that application of ML techniques requires the serial correlation processes of the error terms, if any, to be fully specified. In panels where N is relatively large this could be quite demanding, since different dynamic specifications might be appropriate across different cross-sectional units. The GMM method is less demanding but still requires moment conditions that correctly take account of specific spatial and serial correlation patterns of the errors. The use of quasi-ML and GMM becomes even more involved if the errors also depend on unobserved common factors. It is, therefore, of interest to develop estimation and inference procedures for panels that are reasonably robust to the presence of cross-section and temporal dependencies in the error processes. In this section we focus on two estimators that can be used for estimating the mean, β, of the slope coefficients in Eq. (1), when
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
errors are spatially correlated. The first, known as mean group (MG) estimator of β, is given by (see Pesaran and Smith, 1995) N −
βˆ MG = N −1
where
βˆ i ,
(3)
i=1
−1 ′ Xi. MD yi. , βˆ i = X′i. MD Xi.
(4)
with yi. = (yi1 , yi2 , . . . , yiT ) = (xi1 , xi2 , . . . , xiT ), MD = IT − D(D′ D)−1 D′ , and D′ = (d1 , d2 , . . . , dT ). Alternatively, we can use the fixed effects, or pooled, estimator of β ′
N −
−1 X′i. MD Xi.
i=1
N −
X′i. MD yi. .
(5)
To derive the asymptotic distribution of the above estimators, we make the following additional assumption on the individualspecific regressors and observed common factors. Assumption 5. We assume: (a) For each i = 1, 2, . . . , N, the k × k observation matrix T −1 X′i. MD Xi. is non-singular for the sample size T under consideration and tends to a finite non-singular matrix, Qi as T → ∞. Also, the elements of the k × T matrix W′i. = T −1 X′i. MD Xi.
=
X′i. MD Xi.
−1
−1
T
X′i. MD are uniformly bounded, and T −1 W′i. Wi.
∑N
j
→ ∞. (N , T ) −
lim N
i.
N ,T →∞
T
i =1
3 = lim
N
N ,T →∞
−1
,
N ′ − X MD Xi. i.
T
i=1
(9)
υ
X′i. MD Xi.
T
.
The proofs are provided in the Appendix. We observe that, to obtain the asymptotic√ distribution of both estimators, we have √ premultiplied them by N rather than the usual NT . This follows from the random coefficients hypothesis stated in Assumption 3, since the time-invariant variability of βi dominates the other sources of randomness in the model. Robust estimators for 6P and 6MG can be obtained following the non-parametric approach employed in Pesaran (2006), which makes use of estimates of β computed for different cross-sectional units. A consistent estimator of the asymptotic variance of the mean group estimator is given by
βˆ Asy.Var MG
=
N ′ − βˆ i − βˆ MG βˆ i − βˆ MG .
1
N (N − 1) i=1
(10)
Similarly, a consistent non-parametric estimator of the asymptotic variance of the pooled estimator is
βˆ Asy.Var P
1 −1 1 Q 3NT Q− NT , N NT
=
(11)
where
We have N N 1 − 1 − υi + √ N βˆ MG − β = √ N i =1 N i=1 −1 −1 ′ −1 ′ × T Xi. MD Xi. T Xi. MD ei. ,
√
√
N βˆ P − β =
N
−1
N −
QNT = (6)
3NT =
−1
1 − −1 ′ T Xi. MD (Xi. υi + ei. ) . ×√ N i =1
(7)
Theorem 1 (MG Estimator—Heterogeneous Slopes, Spatial Corr. and No Common Factors). Consider the panel data model (1) with errors eit following the spatial process given by (2). Suppose that Assumptions 1–4 and 5 (a) hold and that γi = 0, for i = 1, 2, . . . , N. Then for the mean group estimator, βˆ MG , given by (3), j
as (N , T ) − → ∞ we have d
N βˆ MG − β → N (0, 6MG ),
N [ ′ − X MD Xi.
1
i.
βˆ i − βˆ MG N − 1 i=1 T ′ X′ M X ] i. D i. ˆ ˆ . × βi − βMG
One advantage of the above non-parametric variance estimators is that their computation does not require a priori knowledge of the spatial arrangement of cross-sectional units. As we shall see later in the paper, mis-specification of the spatial weights matrix may lead to substantial size distortions in tests based on the ML or quasiML estimators of βi (or β). Another advantage of using the above approach over standard spatial techniques is that, while allowing for serially correlated errors, it does not entail information on the time series processes underlying εit , so long as these processes are covariance stationary. Under the special case of homogeneous slopes, with βi = β for all i, to obtain non-degenerate asymptotic distributions, √ the MG and pooled estimators should now be multiplied by NT , rather √ than by N. In this case, we have N −1 ′ 1 − −1 ′ NT βˆ MG − β = √ T Xi. MD Xi. Xi. MD ei. , NT i=1
where 6MG = υ .
√
Theorem 2 (Pooled Estimator—Heterogeneous Slopes, Spatial Corr. and No Common Factors). Consider the panel data model (1)–(2). Suppose that Assumptions 1–4 and 5 (b) hold and that γi = 0, for
√
i = 1, 2, . . . , N. Then for the pooled estimator, βˆ P , given by (5), as
(N , T ) − → ∞, we have
(12)
T
where ei. = (ei1 , ei2 , . . . , eiT )′ . The asymptotic distributions of (6) and (7) are summarized in the following theorems.
N 1 − −1 ′ T Xi. MD Xi. , N i=1
T −1 X′i. MD Xi.
i =1 N
j
(8)
N ′ − X MD Xi.
−1
has bounded elements as T → ∞.
(b) The k × k pooled observation matrix (NT )−1 i=1 X′i. MD Xi. is finite and non-singular for sample sizes N and T under consideration and tends to a finite non-singular matrix, Q, as
√
Q=
, X′i.
i=1
6P = Q−1 3Q−1 , with
where
βˆ P =
185
d √ N βˆ P − β → N (0, 6P ),
(13)
NT βˆ P − β
= N
−1
N −
T −1 X′i. MD Xi.
i=1
−1
1
√
N −
NT i=1
X′i. MD ei. .
(14)
186
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
Using results (A.15) and (A.16) in the Appendix, the asymptotic distributions of (13) and (14) can be easily derived. These are set out in the following theorem.
[ ′ ] ˆ ˆ T ·E βi − β βj − β =
Theorem 3 (MG Estimator—Homogeneous Slopes, Spatial Corr. and No Common Factors). Consider the panel data model (1)–(2). Suppose that Assumptions 1–4 and 5 (a) hold, that γi = 0 for i = 1, 2, . . . , N,
=
and υ = 0. Then for the mean group estimator, βˆ MG , given by (3), as N and/or T → ∞ we have
√
d NT βˆ MG − β → N (0, 6MG ),
=
where
6MG = lim
1
M
M→∞
H′ εε H ,
(15)
with M = NT , H′ = (W′.1 R1 , W′.2 R2 , . . . , W′.T RT ), W′.t = (w1t , w2t , . . . , wNt ), and wit is the tth column of W′i. = T −1 X′i. MD Xi. X′i. MD .
−1
Theorem 4 (Pooled Estimator—Homogeneous Slopes, Spatial Corr. and No Common Factors). Consider the panel data model (1)–(2). Suppose that Assumptions 1–4 and 5 (b) hold, and that γi = 0 for i = 1, 2, . . . , N, and υ = 0. Then for the pooled estimator, βˆ P , given by (5), as N and/or T → ∞ we have
√
d
NT βˆ P − β → N (0, 6P ),
where
6 P = Q −1 9 P Q −1 ,
(16)
with
Q = lim
M→∞
9P = lim
N 1 − ′ X MD Xi. M i=1 i.
M→∞
,
1
M
P′ εε P ,
M = NT , P′ = ( X′.1 R1 , X′.2 R2 , . . . , X′.T RT ), X.t = ( x1t , x2t , . . . , xNt )′ , and xit is the tth column of X′i. = X′i. MD . The asymptotic variances 6MG and 6P depend on the particular specifications of Rt , t = 1, 2, . . . , T , and on εε . One important question is to determine whether the robust variance estimators introduced above can still be used under the case of homogeneous slopes. To investigate this issue, one possibility would be to
√
check whether the individual estimators
T βˆ i − β , for i =
1, 2, . . . , N (see formula (4)), are asymptotically independent and normal across i. Under this condition, using results in Ibragimov and Müller (2010), it is possible to show that it is still valid to base the inference on βˆ MG and its robust variance estimator given by (10). These authors prove that the type I error using a t-test based on βˆ MG is not greater than the level of statistical significance chosen for this test, α , under the condition that α < 0.083 (see also Bakirov and Székely, 2006). Note that
√
T βˆ i − β = T −1 X′i. MD Xi.
−1 1 ′ √ Xi. MD ei. ,
and the covariance between i ̸= j, is given by
T
√ √ T βˆ i − β and T βˆ j − β for
1 T
T T 1 −− wit w′js E eit ejs T t =1 s=1
W′i. E ei. e′j. Wj. =
T T 1 −− wit w′js E r′i.,t εt ε′s rj.,s T t =1 s =1 T 1 −
′
wit wjs
T t ,s=1
N −
E rih,t rjh,s εht εhs
.
h=1
The above covariance is zero for all i ̸= j only under certain conditions. For example, it is zero if the idiosyncratic errors are crosssectionally independent (though possibly serially correlated), or if the elements, rih,t , are random and satisfy E rih,t rjh,s = 0, for all i ̸= j, h = 1, 2, . . ., N and t , s = 1, 2, . . . , T . We observe that condition E rih,t rjh,s = 0 holds if the entries in the ith row in Rt are independently distributed of the entries in the jth row of Rt , at all time periods. However, these restrictions are unlikely to hold in general. To obtain robust estimates of the asymptotic variances in the general case one possibility would be to consider a generalized version of the Newey–West procedure that allows for the spatial effects. For purely spatial error processes, heteroskedasticity, spatially-correlated consistent (HSC) estimators have been proposed by Conley (1999) (see also early contributions by Driscoll and Kraay (1998), and the method suggested in Pinkse et al. (2002)). More recently, Kelejian and Prucha (2007) have proposed a new HSC estimator that approximate the true covariance matrix with a weighted average of cross-products of regression errors, where each element is weighted by a function of (possible multiple) distances between cross-section units. Bester et al. (2009), using results taken from Ibragimov and Müller (2010), propose to divide the sample in groups so that group-level averages are approximately independent, and accordingly suggest a HSC estimator based on a discrete group-membership metric. However, the validity of this approach relies on the ability of the researcher to construct groups whose averages are approximately independent. In contrast, the Kelejian and Prucha (2007) approach stands out as a flexible and robust method, as it does not entail high level assumptions, allows for multiple distance measures, and is robust to some measurement errors in the specification of the distance matrix. Note that (15) and (16) can be written as N T 1 −−
6MG =
NT i,j=1 t ,s=1
6P = Q
wit w′js E eit ejs ,
N T 1 −−
−1
NT i,j=1 t ,s=1
′
xit xjs E eit ejs
Q −1 .
Following Kelejian and Prucha (2007), one assumes that there exists a ‘‘meaningful’’, time-invariant, measure of distance between cross-sectional units, summarized in the N × N matrix, 8 with el′
ˆ i dt − βˆ MG xit (see Section 5 for ements φij ≥ 0. Let eˆ it = yit − α estimation of αi ). A Newey–West SHAC estimator of the variance of βˆ MG can be computed as ′
βˆ Asy.Var MG
=
1
N − T −
(NT )2
i,j=1 t ,s=1
K
φij |t − s| , wit w′js eˆ it eˆ js , φN p + 1
(17)
where φN > 0 is an arbitrary scalar function of N , p is the window size for the time series dimension, and K (.) is a kernel function that we set equal to
K
φij |t − s| , φN p + 1
= K1
φij φN
K2
|t − s| , p+1
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
where K1 (.) and K2 (.) satisfy a set of regularity conditions (see, in particular, andPrucha (2007)). Note that Assumption 7 in Kelejian φij φN
−s| = 0 for φij > φN , and K2 |pt + = 0 for |t − s| > p + 1, 1 and that K1 (0) = K2 (0) = 1. Similarly, a Newey–West SHAC estimator of the variance of βˆ P is βˆ Asy.Var P
K1
−1
[
= QNT
1
N − T −
(NT )2
i,j=1 t ,s=1
K
] φij |t − s| 1 , x′js eˆ it eˆ js Q− xit NT , φN p + 1
(18)
where QNT is given by (12), and eˆ it is now given by eˆ it = yit − ′
αˆ i dt − βˆ P xit . The rest of the notations are as above. We observe ′
that the above estimators require knowledge of the exact relative position of units across space, although as argued in Kelejian and Prucha (2007), the estimator remains valid under certain misspecifications of the distance metric. 4. Estimating panels with unobserved common factors and spatial error correlation We now turn to the estimation of the slope coefficients in the context of panels with both common factors and spatial error dependence. We restrict our attention to the CCE approach since, as compared to other existing methods, it is simple to apply and has been shown to be robust to the choice of m (the number of common factors), the temporal dynamics of unobserved common factors, and the idiosyncratic error. The idea underlying this approach is that, as far as estimation of the slope coefficients are concerned, the unobservable common factors can be well approximated by the cross-section averages of the dependent ∑N variable y¯ .t = N −1 i=1 yit and individual-specific regressors,
∑N
x¯ .t = N −1 i=1 xit . Hence, estimation can be carried out by least squares applied to auxiliary regressions where the observed regressors are augmented with these cross-section averages plus the observed common factors, dt . To model the correlation between the individual-specific regressors, xit , and the common factors, it is supposed that xit = A′i dt + 0′i ft + vit ,
(19)
where Ai and 0i are n × k and m × k factor loading matrices with fixed components, and vit is the individual-specific component of xit . ¯ be defined by Let M
¯ = IT − H¯ (H¯ ′ H¯ )− H¯ ′ , M
(20)
¯ = (D, Z¯ ), where D and Z¯ are, respectively, the matrices of H observations on dt and z¯ .t = (¯y.t , x¯ ′.t )′ . We make the following assumptions on the common factors and their loadings and on the individual, or unit-specific, errors: Assumption 6. The (n + m)× 1 vector gt = (d′t , f′t )′ is a covariance stationary process, with absolute summable autocovariances, distributed independently of eis and vis for all i, t , s.1 Assumption 7. The unobserved factor loadings, γi and 0i are bounded, i.e. γi 2 < K and ‖0i ‖2 < K , for all i. Further, it is assumed that the random deviations, υi , for the slope coefficients are independently distributed of γi and 0i .
187
Assumption 8. The individual-specific errors eit and vjs are distributed independently for all i, j, t and s, and for each i, vit follows a linear stationary process with absolute summable autocovariances given by vit =
∞ − ℓ=0
5iℓ ζi,t −ℓ ,
where for each i, ζit is a k × 1 vector of serially uncorrelated random variables with mean zero, the variance matrix Ik , and finite fourthorder cumulants. For each i, the coefficient matrices 5iℓ satisfy the condition Var(vit ) =
∞ − ℓ=0
5iℓ 5′iℓ = 6vi ,
where 6vi is a positive definite matrix, such that supi 6vi 2 < K .
˜ = E γ i , 0i Assumption 9. Let 0
= (γ, 0). We assume that
˜ = m. Rank 0 Assumption 10. Consider the cross-section averages of the indivi∑N dual-specific variables, zit = (yit , x′it )′ defined by z¯ .t = N1 i=1 zit ,
¯ be defined by (20). Then the following conditions hold: and let M ∑N
(a) The matrix limN →∞ N1 i=1 6vi exists and is non-singular. (b) There exists T0 and N0 , such thatfor all T ≥ T0 and N ≥ N0 , ¯ i. , and T −1 X′i. Mg Xi. , where the k × k matrices T −1 X′i. MX ′ − ′ Mg = IT − G(G G) G , with G = (D, F), exist and are non-
′ Xi. Mg Xi. 2 < K < ∞. T
singular for all i, and supi E
Remark 1. Note that Assumption 10 provides extensions of Assumption 5 to the case where individual-specific regressors, xit , are random and correlated with the common factors. These conditions ensure the existence of the probability limits involved in the derivation of the asymptotic distribution of the CCE j
estimators as (N , T ) − → ∞. Following Pesaran (2006), the mean group and pooled estimators for β in a panel with spatial correlation and common factors are given by (3) and (5), applied to a regression equation where the observed regressors are augmented with the cross-section averages of the dependent variable, y¯ .t , and of the regressors, x¯ .t . Specifically, the CCE mean group estimator is
βˆ CCEMG = N −1
N −
βˆ CCE,i ,
(21)
i=1
where
¯ i. )−1 X′i. My ¯ i. , βˆ CCE,i = (X′i. MX
(22)
and the CCE pooled estimator is
βˆ CCEP =
N −
−1 ¯ i. X′i. MX
i=1
N −
¯ i. , X′i. My
(23)
i =1
The following theorems apply to the above estimators (proofs are provided in the Appendix). Theorem 5 (CCE MG Estimator—Heterog. Slopes, Spatial Corr. and Common Factors). Consider the panel data model given by Eqs. (1), (2) and (19). Suppose that Assumptions 1–3 and 6–10 hold. Then for the common correlated effects mean group estimator βˆ CCEMG given j
by (21), as (N , T ) − → ∞ we have
√ N βˆ CCEMG − β → N (0, 6CCEMG ), 1 This assumption can be relaxed to allow for unit roots in the common factors, along the lines set out in Kapetanios et al. (2011).
where 6CCEMG = υ .
(24)
188
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
Theorem 6 (CCE Pooled Estimator—Heterog. Slopes, Spatial Corr. and Common Factors). Consider the panel data model given by Eqs. (1), (2) and (19). Suppose that Assumptions 1–3 and 6–10 hold. Then for the common correlated effects pooled estimator βˆ CCEP given by (23), as
′ ¯ √ Fγi + ei. −1 Xi. M ˆ iT T βˆ CCE,i − β = Ψ √ T √ 1
= √
T
j
(N , T ) − → ∞, we have √ N βˆ CCEP − β → N (0, 6CCEP ),
where W′i. = T −1 X′i. Mg Xi. √ with T /N → 0,
where
T·
6CCEP = 9∗−1 R∗ 9∗−1 ,
9∗ = lim
N →∞
∗
R = lim
N 1 −
N i=1 N 1 −
N i =1
N →∞
,
6vi
6vi υ 6vi .
βˆ CCEMG are (see also Section 3 above)
1
=
N
N −
1
βˆ CCE,i − βˆ CCEMG ′ × βˆ CCE,i − βˆ CCEMG , N (N − 1) i=1
ˆ 9
∗−1 ∗ ˆ
ˆ R 9
∗−1
.
(25) (26)
with
ˆ∗ = 9
N 1 −
¯ i. X′i. MX
N i=1
ˆ∗ = R
1
T
,
N [ ′ ¯ − X MXi. i.
βˆ CCE,i − βˆ CCEMG N − 1 i=1 T ′ X′ MX ¯ i. ] i. ˆ ˆ . × βCCE,i − βCCEMG
T
As in the pure spatial case, if the slope coefficients βi are homoge√ neous √ the CCE estimators must be multiplied by NT , rather than by N, to obtain non-degenerate asymptotic distributions, namely
√
NT βˆ CCEMG − β 1
= √
N −
NT i=1
√
NT βˆ CCEP − β
¯ i. T −1 X′i. MX
=
N
+ Op
−1
T
N
+ Op
1
√
N
,
X′i. Mg . Further, for large N and T
′ ] βˆ CCE,j − β
¯ Fγi + ei. T −1 X′i. M
−1 ′ ′ ¯ j. 9 ˆ jT γj F + e′j. MX
≈ T −1 W′i. ei. e′j. Wj .
Consistent estimators for the asymptotic variances of βˆ CCEP and
βˆ Asy.Var CCEP
βˆ CCE,i − β −1
βˆ Asy.Var = CCEMG
[
ˆ iT =9
with
W′i. ei.
−1
−1
¯ Fγi + ei. X′i. M
,
(27)
−1
N −
¯ i. T −1 X′i. MX
i=1
1
×√
N −
NT i=1
¯ Fγi + ei. . X′i. M
(28)
As in the pure spatial case, the above expression is asymptotically zero only under certain conditions, for example when the idiosyncratic errors are cross-sectionally independent, or if the entries of the matrix Rt , for t = 1, 2, . . . , T , are random and independently distributed across rows. Later in the paper, we will investigate the small sample properties of tests based on robust variances (25) and (26) both under heterogenous and homogeneous slopes. Remark 2. The CCE continues to be applicable even if the rank condition outlined in Assumption 9 is not satisfied. Failure of the rank condition can occur if there is an unobserved factor for which the average of the loadings in the yit and xit equations tends to a zero vector (Pesaran and Tosetti, 2009). This could happen if, for example, such a factor carries weak cross-section dependence. Another possible reason for failure of Assumption 9 is if the number of unobservable factors, m, is larger than k + 1, where k is the number of regressors. In these cases, common factors cannot be estimated from cross-section averages. However, it is possible to show that the cross-sectional means of the slope coefficients, βi , can still be consistently estimated, under the additional assumption that the unobserved factor loadings, γi , in Eq. (1) are independently and identically distributed across i, and of ejt , vjt , and gt = (d′t , f′t )′ for all i, j and t. No assumptions (other than Assumption 7) are required on the loadings attached to the regressors, xit . The proofs of consistency and asymptotic normality of the CCE estimator in the rank deficiency case are straightforward extensions of the results provided in Pesaran (2006). Remark 3. We observe that the CCE estimator does not entail any assumptions on the cumulative effect of factors on cross-section units. This is in contrast to the use of principal components that require errors to display a strong factor structure, namely that ∑N √ 2 2 i=1 γiℓ > c σ , for ℓ = 1, 2, . . . , m, where c is such that N − c = o N −1/2 , and σ 2 is the variance of the idiosyncratic T error. In the absence of this condition the principal components estimates of the factors would be inconsistent. See, for example, Onatski (2009), and Paul (2007). Remark 4. Kapetanios et al. (2011) considered the case where the unobservable common factors follow unit root processes and could be cointegrated. They showed that the asymptotic distribution of panel estimators in the case of I(1) factors is similar to that in the stationary case, reported in Theorems 5 and 6.
Using results (A.17)–(A.20) in the Appendix, it follows that βˆ CCEMG
5. Residuals from CCE regression
and βˆ CCEP continue to be consistent for β as N → ∞, for T fixed or T → ∞, although their asymptotic distributions will generally depend on nuisance parameters. Following similar lines of reasoning as in the pure spatial case, we now investigate whether βˆ CCEMG together with the non-parametric variance estimator (10) can still be used in the homogeneous slopes case. First note that, using results (A.12)–(A.14) in the Appendix, we have
We now consider the consistent estimation of regression errors uit = yit − α′i dt − β′i xit in model (1). Estimation of uit is needed for computing tests of error cross-section independence, or when the objects of interest are the coefficients of the spatial process, eit . Before continuing, without loss of generality, we specify some further assumptions on the observed and unobserved common factors. In particular:
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
Assumption 11. E (ft ) = 0, for t = 1, . . . , T , and the n × 1 vector of observed common factors, dt , is distributed independently of ft ′ , for all t and t ′ , such that ′
DF T
= Op
1
.
√
T
(29)
′
189
ˆ i dt . Given (A.20) and (32), we obtain Let uˆ it = yit − βˆ CCEP xit − α uit = uˆ it + Op
′
1
+ Op
√
N
1
√
T
+ Op
1
√
NT
.
Principal components analysis can be applied to the above residuals, uˆ it , to estimate the common factors, ft , and their loadings, γi . j
This is an identification condition that allows one to separate the effects of observed and unobserved common effects in uit . Note that the cross-section averages, z¯ .t , contain information not only on the unobserved factors, ft , but also on the observed factors, dt . Given that the number, nature and the source of the unobserved common factors are unknown, without Assumption 11 it would not be possible to separate the effects of these two sets of common variables. However, since ft is unobserved this assumption can be easily accommodated by a suitable re-definition of ft and the associated factor loadings. Consider the OLS estimates
αˆ i = D D
′
−1
D yi. − Xi. βˆ CCE,i , ′
(30)
where βˆ CCE,i is given by (22). Under Assumptions 1–3 and 6–10, and given (A.12)–(A.14), we have2
¯ Fγi + ei. X′ M ¯ i. −1 i. βˆ CCE,i − βi = T −1 X′i. MX T 1 1 1 = Op + Op √ + Op √ , N NT T ′ −1 D Xi. βi − βˆ CCE,i αˆ i − αi = T −1 D′ D T −1 D′ ei. −1 ′ −1 D ′ F γ i + T −1 D ′ D , + T DD T
T
and hence
αˆ i − αi = Op
1
+ Op
N
1
+ Op
√
NT
1
√
T
.
(31)
Note that, unlike in the case of a simple panel data model with fixed effects and no unobserved common factors, consistency of αˆ i requires both N and T going to infinity, due to the additional Op N −1 term in (31). This term arises since the unobserved common factors are approximated by cross-section averages. Now, ′
ˆ i dt . Given the consider the residuals uˆ it = yit − βˆ CCE,i xit − α ′
Note that these residuals continue to be consistent, as (N , T ) − → ∞, even when the loadings attached to the unobserved factors are set to zero, namely, when the data generating process is (1)–(2). In this case, the parameters of the spatial process can be recovered by applying standard spatial econometric techniques to uˆ it . 6. Monte Carlo experiments 6.1. Monte Carlo design This section provides Monte Carlo evidence on the small sample properties of our estimators, under a range of assumptions on the stochastic process generating the error terms. The study is comprised of three sets of experiments. In the first set, we consider a panel where the error term is generated by a SAR process and with no common factors. In the second set, we assume that the error process is the orthogonal sum of a factor structure and a spatial process, and allow the dependent variable and the individual-specific regressors to be correlated with the unobserved common factors. In the third set of experiments, we make a number of robustness checks, to see the extent to which our estimators are effective in dealing with various special conditions, such as when errors are serially correlated, there are sizeable spatial error correlations, or when the pattern of cross-section dependence varies over time. For all experiments we considered the following data generating process yit = αi d1t + βi1 x1it + βi2 x2it + γi1 f1t + γi2 f2t + eit , xijt = aij1 d1t + aij2 d2t + γij1 f1t + γij3 f3t + vijt ,
j = 1, 2,
for i = 1, 2, . . . , N and t = 1, 2, . . . , T . In the above equations, d1t and d2t are observed common factors, f1t , f2t , and f3t are unobserved common effects, and eit are idiosyncratic errors. We adopt the following data generating processes: d1t = 1,
d2t = ρd d2,t −1 + vdt ,
t = −49, . . . , 0, 1, . . . , T ,
ˆ i , it follows that consistency of βˆ CCE,i and α 1 1 1 uit = uˆ it + Op + Op √ + Op √ .
vdt ∼ IIDN (0, 1 − ρ ), ρd = 0.5, d2,−50 = 0, fℓt = ρfℓ fℓ,t −1 + vfℓt , ℓ = 1, 2, 3; t = −49, . . . , 0, 1, . . . , T ,
Similarly, in the homogenous case, adopting the CCEP estimator, under Assumptions 1–3 and 6–10, from (A.20) we have
and
N
αˆ i − αi = T
NT
−1
−1 ′
DD
D′ Xi.
−1 + T −1 D ′ D = Op
1
√
N
D′ F
T
+ Op
1
vfℓt ∼ IIDN (0, 1 − ρf2ℓ ),
T
T
2 d
β − βˆ CCEP
ϑijt ∼ N (0, 1 − ρϑij ), 2
−1 γ i + T −1 D ′ D
+ Op
√
T
1
√
NT
D′ ei. T
.
Var T −1 D′ ei. = E T −2 D′ ei. e′i. D ≤ K · E T −2 D′ D = O(T −1 ).
fℓ,−50 = 0,
t = −49, . . . , 0, 1, . . . , T ,
vij,−50 = 0,
ρϑij ∼ IIDU (0.05, 0.95) for j = 1, 2.
(32)
The first 50 observations are discarded. The factor loadings of the observed common effects do not change across replications and are generated as
αi ∼ IIDN (1, 1), i = 1, 2, . . . , N , (ai11 , ai21 , ai12 , ai22 ) ∼ IIDN (0.5τ4 , 0.5I4 ),
2 Note that under our assumptions T −1 D′ e = O T −1/2 . Indeed, E T −1 D′ e = i. p i. 0, since under Assumption 6, D and ei. are independently distributed. Further, under Assumption 6 and from Lemma A.1 (see result (A.2)), the largest eigenvalue of E ei. e′i. is bounded. It follows that
vijt = ρυij vij,t −1 + ϑijt ,
ρfℓ = 0.5,
where τ4 = (1, 1, 1, 1)′ and I4 is a 4 × 4 identity matrix. We consider two alternative sets of experiments, that involve different hypotheses on the data generating process for the loadings of the unobserved common factors, and the way the idiosyncratic errors eit are generated:
190
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
A. The factor loadings of the unobserved common effects are set to zero, γi11 = γi13 = γi21 = γi23 = γi1 = γi2 = 0, and the individual-specific errors, eit , are generated according to the SAR process eit = δt
N −
sij ejt + εit ,
j =1
for i = 1, 2, . . . , N , t = 1, 2, . . . , T , εit ∼ N (0, σi2 ),
(33)
σi2 ∼ IIDU (0.5, 1.5) , for i = 1, 2, . . . , N ,
(34)
where δt is the time-varying spatial autoregressive coefficient, that we set δt = δ = 0.4. sij , for i, j = 1, 2, . . . , N, are elements of a spatial weights matrix S, assumed to be timeinvariant. We follow Kelejian and Prucha (2007) and assume that units are located on a rectangular grid at locations (r , s), for r = 1, . . . , m1 ; s = 1, 2, . . . , m2 , such that N = m1 m2 .3 The distance φij between units is given by the Euclidean distance, and S is taken to be a rook-type matrix where two units are neighbors if their Euclidean distance is less than or equal to one. The weights matrix is normalized such that the weights in each row sum to one. B. The parameters of the unobserved common effects in the xit and yit equations are generated as
γi11 γi21
0 0
γi13 γi23
N (0.5, 0.5) ∼ IID N (0, 0.5)
0 0
N (0, 0.5) , N (0.5, 0.5)
and
γi1 ∼ IIDN (1, 0.2),
γi2 ∼ IIDN (1, 0.2),
γi3 = 0,
and the individual-specific errors, eit , are generated as in (33) and (34), with δt = δ = 0.4. This set of experiments aims at investigating the extent to which the CCE estimators capture the effects of local as well global cross-section dependence. For each case, we consider two alternative assumptions on the slope coefficients: (i) The case of heterogeneous slopes where βij = βj + ηij , with βj = 1, and ηij ∼ IIDN (0, 0.04),for i = 1, 2, . . . , N and j = 1, 2, varying across replications. (ii) The case of homogeneous slopes where βij = 1, for i = 1, 2, . . . , N and j = 1, 2. Experiment C: robustness checks The aim of this set of experiments is to investigate the extent to which the use of robust standard errors are effective in dealing with serially correlated errors, high spatial error correlation, and time variations in the degree and source of error cross-section dependence: 1. Serially correlated errors. We allow errors εit in (33) to be serially correlated. In particular, εit are generated as stationary AR(1) processes for half of the cross-section units, and as MA(1) processes for the remaining cross-section units:
1/2 εit = ρiε εi,t −1 + σi 1 − ρi2ε ζit , i = 1, . . . , ⌊N /2⌋ , 2 −1/2 εit = σi 1 + θiε ζit + θiε ζi,t −1 , i = ⌊N /2⌋ + 1, . . . , N
3 For a given value of N = m m , we set m and m such that these are 1 2 1 2 integer numbers and |m1 − m2 | is minimized. In particular, for N = 20 we set m1 = 5, m2 = 4, for N = 30 we set m1 = 6, m2 = 5, for N = 50, we set m1 = 10, m2 = 5, and for N = 100 we set m1 = m2 = 10.
where ζit ∼ IIDN (0, 1), σi2 ∼ IIDU (0.5, 1.5) , ρiε ∼ IIDU (0.05, 0.95), and θiε ∼ IIDU (0, 0.8). For this sub-experiment we set δt = δ = 0.4, no unobserved common factors, and consider both cases of heterogeneous and homogeneous slopes. 2. High spatial error correlation. For this sub-experiment we set δt = δ = 0.8. Further, we assume that there are no unobserved common factors and the slope coefficients are heterogeneous (i.e. as in case (i)). 3. Time-varying spatial correlation. The spatial autoregressive coefficients are generated as δt ∼ IIDU (0, 0.8), for t = 1, 2, . . . , T , and fixed across replications. For this sub-experiment we assume no unobserved common factors and heterogeneous slopes (i.e. as in case (i)). 4. Time-varying cross-section dependence. We allow the crossdependence to change from weak to strong and back to weak. Specifically, for t = 1, 2, . . . , ⌊T /3⌋ we assume parameters of the unobserved common effects in the xit and in the yit equations and the errors eit are generated as in Experiment A, with δt = δ = 0.8. For t = ⌊T /3⌋ + 1, . . . , ⌊2T /3⌋ parameters of the unobserved common effects in the xit and in the yit equations and the errors eit are generated as in Experiment B, with δt = δ = 0 (i.e., error processes include common factors only). For t = ⌊2T /3⌋ + 1, . . . , T parameters of the unobserved common effects in the xit and in the yit equations and the errors eit are generated as in Experiment A, with δt = δ = 0.8. For this sub-experiment we assume heterogeneous slopes (i.e. as in case (i)). The aim of this set of experiments is to investigate the robustness of our estimators to the possible time variations in the degree of cross section dependence. Each experiment was replicated 2000 times for the (N , T ) pairs with N , T = 20, 30, 50, 100. We report the small sample properties for a number of estimators of the slope coefficients. In particular, we computed the mean group estimator (3), both with robust variance (10) and with SHAC variance (17), the pooled estimator (5) both with robust variance (11) and with SHAC variance (18). The SHAC variance estimators have been computed using both the true distance matrix 8 and a mis-specified version of the distance matrix obtained by incorrectly assuming that units are ordered on a line, rather than on a rectangular grid. Following Kelejian and Prucha (2007), we have set the parameter φN in (17) and (18) equal to N 1/4 , and have fixed the window size for the time series part equal to 2T 1/2 . We also computed the ML estimator for a panel containing fixed effects, with spatially correlated and heteroskedastic errors. The likelihood function of this model for a given spatial matrix, S, is derived in a supplement which is available on request. Related derivations are also provided in Anselin (1988) and Lee (2004). We refer to this estimator as the ML-SAR estimator. The ML-SAR estimator is computed for two different spatial weights matrices; a correctly specified one, and a mis-specified version, where units with Euclidean distance less than or equal to two are incorrectly taken as neighbours. This is done with the intent to check the effect of mis-specification of S on the ML-SAR estimator. Finally, we report results for the CCE mean group estimator (21) with variance (25), and the CCE pooled estimator (23) with variance (26). 6.2. Monte Carlo results Results for Experiment A are summarized in Tables A1 and A2, for Experiment B in Tables B1 and B2, and for Experiment C in Tables C1–C4. Each table provides estimates of bias, root mean squared errors (RMSE), size, and power. The nominal size is set to 5%, while the power of the various tests is computed under the alternative H1 : β1 = 0.95. In what follows we focus on estimation of β1 ; results for β2 are very similar and are not reported.
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
191
Table A1 Small sample properties of estimators for panels with spatially correlated errors (δ = 0.4) and no unobserved common factors, under slope heterogeneity. Bias (×100) N \T
20
RMSE (×100)
30
50
100
0.10 −0.29 0.09 0.09
−0.15
−0.38 −0.07
0.02 −0.15 0.32 0.02
0.06 0.05 0.13 −0.01
−0.27 −0.11
−0.12
−0.32 −0.07
20
30
Bias (×100) 50
100
Mean group 20 30 50 100
−0.07 −0.07 −0.12 0.04
0.06 −0.18 −0.23 0.00
0.09 0.10 0.01
0.16 0.03
6.59 5.44 4.36 2.89
5.72 4.52 3.54 2.48
5.14 4.03 3.09 2.31
7.70 5.95 4.57 3.32
6.57 5.42 4.25 2.90
5.91 4.75 3.81 2.70
5.67 4.40 3.38 2.65
0.26 −0.31 −0.33 0.01
0.22
−0.03
0.15 −0.23 0.27 0.03
10.24 8.35 6.24 4.44
7.61 6.05 5.02 3.44
6.12 4.98 3.81 2.66
5.33 4.22 3.25 2.41
0.09 0.12 0.02
0.20 0.02
Power (×100)
11.10 10.70 7.20 8.80
6.70 6.60 4.90 5.60
6.50 6.00 5.40 5.50
7.60 6.80 5.20 5.20
13.60 15.40 17.30 33.20
14.40 20.10 24.30 38.80
16.10 20.50 27.40 43.10
16.40 21.00 31.00 51.80
11.90 11.80 11.90 11.10
17.80 16.50 15.30 14.60
27.30 23.70 22.50 25.90
18.40 20.10 23.70 40.00
23.20 27.20 36.00 56.40
31.80 39.10 52.50 72.60
46.10 54.80 68.30 85.10
11.60 11.00 7.90 8.80
27.50 25.00 22.30 25.50
11.90 11.90 11.60 10.40
18.40 15.70 14.30 14.60
27.30 23.70 22.00 26.00
17.20 19.40 23.20 40.50
22.60 27.60 35.40 56.20
31.60 38.90 52.10 73.10
45.60 53.20 67.80 84.90
30.80 31.40 31.30 30.90
39.30 36.90 36.40 37.40
52.50 48.40 48.00 55.00
39.10 36.90 45.00 65.60
42.80 48.50 59.50 74.20
52.90 59.70 69.20 85.00
65.80 70.40 82.00 91.10
7.60 6.90 5.10 5.10
100
0.15 0.01
0.23
−0.01
7.78 6.02 4.74 3.35
6.61 5.51 4.29 3.02
5.92 4.82 3.91 2.74
5.65 4.38 3.46 2.65
−0.31 −0.08
0.30 0.01
0.02 0.01 0.13 −0.01
0.40
0.20
−0.23
−0.27
−0.24 −0.38 −0.04
−0.19
0.10 0.07 −0.01
0.00 0.22 0.01
0.05
0.09
−0.10 −0.23 −0.02
−0.17
7.89 6.01 4.71 3.38
6.74 5.44 4.28 2.90
5.94 4.71 3.90 2.75
5.63 4.40 3.43 2.66
0.24
−0.04
8.92 7.19 5.42 3.89
7.29 5.86 4.68 3.25
6.07 5.00 3.80 2.67
5.40 4.30 3.30 2.46
0.25
−0.01
Power (×100)
8.90 6.40 4.90 5.10
6.30 6.50 6.70 6.20
7.10 5.00 5.10 5.10
7.10 6.40 4.70 5.30
13.80 15.10 16.60 31.10
14.30 18.80 25.70 42.20
17.20 21.90 29.00 51.10
17.40 22.50 36.60 60.40
15.40 13.00 12.00 12.10
17.70 17.20 16.00 14.40
20.90 20.30 20.90 17.80
32.40 27.70 26.80 31.60
22.10 24.50 28.20 46.50
26.40 33.00 41.40 57.90
35.10 43.40 52.50 71.20
46.80 53.30 68.00 81.00
7.80 5.90 5.40 4.70
8.80 6.70 5.10 5.60
9.60 11.80 11.80 20.50
12.80 13.90 23.20 32.60
16.00 19.50 27.70 44.70
18.90 23.50 37.60 58.10
15.10 13.70 11.70 11.90
18.00 16.80 15.80 14.10
31.20 28.00 26.70 31.00
21.40 24.30 27.60 47.70
26.00 32.90 42.10 58.30
35.60 42.90 51.60 71.90
47.20 52.80 68.00 81.10
35.40 33.90 37.40 36.00
50.20 46.30 47.90 52.80
34.80 36.70 43.20 63.80
41.00 43.60 58.50 72.90
50.10 57.00 68.20 84.00
64.00 67.80 81.00 89.60
7.70 5.50 5.30 5.40
8.50 7.10 5.20 5.50
11.60 12.50 13.70 24.60
13.30 14.80 24.90 34.70
15.50 19.50 26.40 43.90
20.50 23.10 36.20 58.00
21.40 19.30 19.80 17.70
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
−0.36 −0.13
50
SHAC pooled, D mis-specifiedb
ML-SAR 20 30 50 100
0.01 0.06 0.05 −0.04
30
SHAC pooled
SHAC mean group, D mis-specifiedb 20 30 50 100
−0.11 −0.13
20
Robust pooled
SHAC mean group 20 30 50 100
0.07
−0.06 −0.26 −0.05
Size (×100)
Robust mean group 7.70 7.40 5.50 5.70
100
CCE pooled
Size (×100) 20 30 50 100
50
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
RMSE (×100)
30
Pooled 8.40 6.40 4.94 3.38
ML-SAR 20 30 50 100
20
7.00 5.30 7.50 5.70
25.60 21.80 23.40 25.10
29.90 29.40 30.30 27.80
CCE pooled 7.50 5.90 5.20 5.70
8.40 5.10 5.60 5.30
Notes: Mean Group, Pooled, CCE Mean group and CCE Pooled are (3), (5), (21), (23), respectively. Robust variances of Mean group and Pooled estimators are (10) and (11). SHAC variances of Mean group and Pooled estimators are given by (17) and (18). Variances of CCE Mean group and CCE Pooled estimators are given by (25) and (26). a This estimator is computed under the incorrect assumption that cross-section units are neighbours if their Euclidean distance is less than or equal to 2 while in the true S matrix cross-section units are neighbours if their Euclidean distance is equal to 1. b This estimator is computed under the incorrect assumption that cross-section units are ordered on a line, rather than on a rectangular grid.
Tables A1 and A2 summarize the results for the case where the errors are generated by a spatial autoregressive process without any common factors, under heterogeneous slopes (Table A1) and homogeneous slopes (Table A2). We first note that, for these experiments, the mean group and pooled procedures provide unbiased estimators for the mean of the slope coefficients, β. Accordingly, these estimators display very small biases and their RMSEs decline steadily with increases in N and/or T . Considering the empirical sizes of the tests, the ones based on the robust variance estimators (10) and (11) display rejection frequencies that are close to the nominal size under heterogeneous slopes, while they slightly over-reject the null under slope homogeneity, namely when βi = β, for all i. Indeed, as noted in Section 3, the robust standard errors given by (10) and (11) are not applicable under slope homogeneity, and consistent estimation of the variance of the pooled and mean group estimators in general requires knowledge of the spatial arrangement of the cross-section units. In contrast, the tests based
on SHAC variances severely over-reject the null hypothesis in all experiments with heterogeneous slopes, while they do have the correct size when data are generated under βi = β, and when N and T are sufficiently large. Indeed, when N and T are smaller than 50, the SHAC based tests are slightly over-sized. That the use of Newey–West robust standard errors lead to an over-rejection of the null hypothesis in small samples is well-known within the time series literature (see, for example, the Monte Carlo study reported in Smith and McAleer (1994)). Our results seem to indicate that adopting the Newey–West procedure jointly with the Kelejian and Prucha (2007) variance estimator in a panel data framework may also lead to over-rejection of the null hypothesis in small samples. We note that Kelejian and Prucha, in their Monte Carlo experiments, only report results when N is relatively large (they focus on a single cross-section, T = 1, with N = 400 and 1024). Also, they do not report the sizes of tests based on their proposed variance estimator (see also the Monte Carlo study reported in Fingleton and Le Gallo (2008)). To further investigate this issue, we
192
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
Table A2 Small sample properties of estimators for panels with spatially correlated errors (δ = 0.4) and no unobserved common factors, under slope homogeneity. Bias (×100) N \T
20
RMSE (×100)
30
20
30
Bias (×100)
50
100
50
100
0.04 0.05 0.04 0.11
−0.03 −0.08 −0.01 −0.02
−0.05 −0.03 −0.03 −0.04
6.25 5.25 4.31 2.71
4.81 3.86 3.15 2.19
3.43 2.87 2.27 1.47
2.30 1.89 1.48 1.01
0.11 0.04 0.09 −0.02
0.06 −0.09 0.01 0.00
0.03 −0.06 −0.03 −0.04
4.39 3.73 2.90 1.87
3.37 2.85 2.34 1.56
2.47 2.19 1.69 1.11
1.76 1.45 1.11 0.76
0.05 −0.08 −0.03 −0.10
−0.03 −0.09
8.70 7.19 5.85 3.85
6.19 5.07 4.00 2.72
4.23 3.50 2.70 1.80
2.55 2.24 1.72 1.19
Mean group 20 30 50 100
−0.10 0.19 −0.26 −0.02
0.06 −0.01 −0.03 −0.03
−0.25 0.46 −0.18 0.00
0.15 0.19 0.20 0.13
0.01
−0.04
Power (×100)
7.50 8.30 7.10 4.80
5.40 7.20 6.30 8.80
5.20 9.00 6.40 6.60
6.80 7.70 6.70 7.30
20.50 28.60 35.00 64.70
27.40 37.90 52.70 83.60
43.50 57.50 75.10 97.00
69.80 84.30 96.40 100.00
6.80 6.40 6.50 7.60
7.00 7.30 7.40 5.00
5.20 6.90 6.50 4.70
14.80 20.80 23.60 47.30
22.00 30.70 40.70 69.40
33.80 46.60 64.40 91.80
61.30 76.70 92.40 99.90
6.40 7.90 7.00 4.70
8.00 8.30 8.30 6.20
6.80 6.80 6.50 7.80
7.00 7.60 6.70 4.40
4.70 7.00 6.30 5.50
15.70 20.60 23.90 47.20
22.20 32.30 40.70 70.30
34.00 47.30 63.80 92.20
61.20 77.50 92.30 99.90
6.20 5.40 8.70 7.10
6.10 5.90 6.30 5.70
5.90 5.00 4.40 4.80
26.50 34.00 46.60 80.00
37.20 46.90 67.10 91.50
56.20 65.60 85.70 99.60
84.70 93.20 99.40 100.00
6.40 5.90 6.90 4.40
100
0.06 0.03 −0.01 0.03
0.02
−0.04
−0.08
0.00 −0.07 −0.03
0.02
−0.03
4.98 4.36 3.21 2.16
3.92 3.32 2.65 1.85
2.87 2.59 1.92 1.30
2.03 1.74 1.32 0.92
−0.04 0.04 0.02 −0.03
0.09 0.04 0.07 −0.01
0.08
0.01
−0.09
−0.06 −0.03 −0.04
4.68 3.94 3.04 2.00
3.57 3.01 2.45 1.68
2.63 2.28 1.75 1.18
1.84 1.54 1.16 0.80
0.02
−0.02
0.05 0.29 −0.25 −0.03
0.11 0.04 0.17 0.11
0.11
−0.08 −0.03 −0.10
−0.02 −0.12
7.21 5.75 4.49 3.03
5.38 4.26 3.44 2.39
3.88 3.19 2.45 1.67
2.44 2.18 1.63 1.13
0.00
−0.04
Power (×100)
6.00 7.60 5.80 5.00
6.00 6.50 6.50 7.40
7.70 6.70 6.70 4.40
5.90 7.70 5.90 5.30
15.80 20.30 22.40 45.10
21.20 30.70 39.10 68.60
33.50 46.10 62.30 90.90
59.50 76.40 91.10 99.90
7.00 9.40 7.30 6.60
6.10 6.60 7.10 8.60
5.20 7.70 6.20 5.50
5.70 7.10 6.20 5.70
21.90 27.60 35.70 64.80
29.40 39.10 56.00 82.20
44.40 56.60 75.60 96.50
70.90 82.60 96.70 100.00
7.30 6.10 5.80 5.30
6.10 7.80 5.80 6.20
10.80 14.40 15.30 23.10
18.30 19.80 27.60 46.80
26.60 33.70 47.70 75.60
48.80 63.60 83.80 98.20
6.80 9.50 7.10 6.70
6.10 7.00 7.40 8.40
5.20 6.80 5.80 6.40
22.40 29.90 36.60 66.80
30.10 40.30 56.00 83.30
44.60 57.80 76.00 97.00
71.40 83.80 96.90 100.00
5.50 7.90 6.50 5.90
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
50
SHAC pooled,D mis-specifiedb
ML-SAR 20 30 50 100
30
SHAC pooled
SHAC mean group, D mis-specifiedb 20 30 50 100
20
Robust pooled
SHAC mean group 20 30 50 100
0.01 0.02 −0.11 −0.05
Size (×100)
Robust mean group 6.10 9.30 5.80 5.50
100
CCE pooled
Size (×100) 20 30 50 100
50
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
RMSE (×100)
30
Pooled
ML-SAR 20 30 50 100
20
6.70 6.30 6.50 4.80
7.30 7.50 7.70 7.30
5.10 6.30 7.70 8.30
5.40 5.90 6.10 5.90
4.70 4.90 4.70 4.90
23.30 32.60 44.50 75.50
32.80 42.60 63.40 88.80
48.30 61.10 82.90 98.40
78.00 89.20 98.80 100.00
7.00 5.70 6.30 6.20
6.60 6.30 5.90 5.80
5.90 8.10 5.80 6.10
13.20 15.90 18.80 35.40
20.20 21.40 34.60 58.50
30.00 37.70 53.80 82.60
52.40 66.10 87.10 99.30
CCE pooled 6.50 6.40 6.20 4.80
Notes: see notes to Table A1. a This estimator is computed under the incorrect assumption that cross-section units are neighbours if their Euclidean distance is less than or equal to 2 while in the true S matrix cross-section units are neighbours if their Euclidean distance is equal to 1. b This estimator is computed under the incorrect assumption that cross section units are ordered on a line, rather than on a rectangular grid.
have run some additional experiments using Kelejian and Prucha (2007) Monte Carlo design, with T = 1, N = 100, 200, 400, 1024. The results show that tests based on the non-parametric standard errors proposed by Keleijan and Prucha have empirical sizes close to the 5% nominal size if N ≥ 400, but tend to over-reject for smaller values of N.4 However, the results in Table A2 show that errors in the measurement of the distance between cross-section units does not seem to have much affect on the properties of SHAC estimators, which is in line with the theoretical results obtained in Keleijan and Prucha. A number of other interesting findings also emerge from the results reported in Table A2. We can see that under ideal conditions the spatial process generating the error term and the spatial arrangement of units are both known, the ML estimator
4 To save space, we have not reported these results. However, they are available upon request.
has the correct size for large T , and a high power. Also tests based on CCE Mean Group and CCE Pooled estimators have empirical sizes that are very close to the nominal size, under both cases of heterogeneous and homogeneous slopes. These findings suggest that augmenting the panel regressions with cross-section averages even in the absence of common factors can help deal with spatial error spillover effects. The attraction of the CCE type estimators in these contexts lies in the fact that they do not require a quantification of the exact relative position of the units in space, which is required by the SHAC type estimators. But, not surprisingly a comparison of the power of the CCE type tests with the tests based on the ML-SAR in Table A2 shows that not using information on the spatial ordering of cross-section units can result in some loss of power. In Experiments B (Tables B1 and B2), the combination of common factors and spatial correlation in the error term leads to large distortions in the pooled and mean group estimators. The bias and RMSE of the ML-SAR estimator are smaller than those
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
193
Table B1 Small sample properties of estimators for panels with spatially correlated errors (δ = 0.4) and unobserved common factors, under slope heterogeneity. Bias (×100) N \T
20
RMSE (×100)
30
50
100
Bias (×100)
20
30
50
100
Mean group 20 30 50 100
15.38 15.26 15.86 15.23
14.86 15.00 15.99 15.23
14.12 14.32 15.75 14.77
20.02 19.35 18.62 17.37
18.48 17.82 17.86 16.33
17.80 17.52 17.48 16.25
16.59 16.06 16.83 15.53
2.17 1.88 2.00 1.44
1.75 1.93 1.53 1.52
1.17 1.63 1.80 1.43
8.94 7.40 5.63 4.25
7.90 6.67 5.36 3.85
7.08 6.28 4.86 3.59
6.73 5.88 4.58 3.50
0.54 −0.43 −0.27 0.00
−0.10
−0.34 −0.10
10.09 8.00 6.10 4.43
7.54 6.05 4.98 3.39
6.07 4.94 3.85 2.67
5.44 4.32 3.21 2.41
0.16 −0.08 0.15 0.07
0.13 0.09 −0.03
0.21 0.00
Power (×100)
34.30 40.80 57.70 76.60
43.50 58.10 78.10 87.00
50.50 63.20 82.40 93.80
52.20 70.00 89.70 96.60
57.90 69.90 85.00 93.10
60.00 74.20 91.50 95.60
69.70 82.40 95.40 99.00
72.10 87.50 98.20 99.30
40.40 49.00 73.70 84.80
53.70 63.60 84.50 95.90
66.40 79.20 94.70 97.50
52.20 60.90 81.00 92.30
61.80 70.20 90.70 96.40
75.60 86.30 96.70 99.80
85.50 94.00 99.60 100.00
35.40 46.30 61.60 78.90
40.90 38.80 37.60 41.80
42.70 53.40 76.20 87.00
55.20 67.60 85.90 96.50
67.70 81.70 95.30 98.10
53.40 65.00 83.10 92.70
63.20 73.50 91.80 96.50
76.40 87.50 97.70 99.90
85.90 94.60 99.70 100.00
43.20 43.60 46.30 47.30
51.10 53.00 52.30 53.90
62.40 62.30 65.30 68.10
54.30 54.90 62.70 79.40
57.90 63.70 72.60 84.80
65.40 73.60 80.00 92.70
71.40 82.00 87.40 94.40
7.80 5.70 4.80 5.20
17.12 17.46 18.00 16.87
16.83 17.11 18.13 16.60
17.05 17.70 18.49 17.29
16.63 17.17 18.45 17.13
21.37 21.60 21.00 19.48
20.43 20.41 20.31 18.56
19.83 20.24 20.21 18.55
19.03 18.98 19.60 18.01
1.41 1.27 0.76 0.71
1.36 0.84 1.12 0.60
0.94 1.09 0.65 0.68
0.44 0.83 0.94 0.64
8.62 6.99 5.29 4.00
7.70 6.25 4.94 3.48
6.76 5.74 4.44 3.24
6.42 5.46 4.17 3.19
0.51
0.00
−0.23
−0.39 −0.28 −0.02
−0.17
0.08 0.06 −0.05
−0.34 −0.06
9.06 7.22 5.35 3.92
7.46 5.97 4.73 3.21
6.17 5.03 3.84 2.71
5.62 4.40 3.28 2.46
0.23 0.05
0.22
−0.01
Power (×100)
32.80 44.40 56.40 76.20
34.50 46.90 69.60 82.80
42.90 54.40 75.50 92.40
43.90 59.00 84.20 94.40
47.40 59.60 78.00 91.60
50.90 66.40 87.40 95.60
61.00 75.90 92.60 99.30
64.20 83.50 97.10 99.70
45.80 52.50 71.60 81.40
52.70 59.50 81.80 88.50
65.30 74.00 90.70 96.50
78.30 87.00 97.60 98.20
65.70 73.20 87.90 95.10
73.40 80.30 95.40 97.20
84.90 90.70 98.60 99.70
91.30 96.40 99.70 100.00
7.50 6.10 5.80 4.60
8.40 7.80 5.40 5.20
12.60 12.10 11.80 21.10
12.90 13.50 21.10 34.00
14.80 19.00 27.70 45.00
17.70 22.30 36.60 57.00
48.30 59.20 74.70 84.10
54.40 65.00 84.60 90.30
79.40 88.50 98.00 98.30
66.00 76.50 89.30 95.50
72.80 83.00 95.50 97.40
85.80 91.60 98.70 99.80
91.70 96.60 99.70 100.00
45.90 47.50 49.40 49.70
60.90 59.00 60.60 63.80
48.40 49.50 59.30 70.90
53.80 56.70 67.80 79.70
61.50 67.60 75.30 86.80
69.50 76.00 84.10 91.30
7.60 6.10 5.60 4.60
8.60 7.80 5.80 5.20
13.50 12.50 14.30 25.40
13.80 14.30 24.60 34.90
14.20 20.00 27.90 43.20
17.30 23.10 35.30 57.50
66.50 77.70 91.80 97.00
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
100
SHAC pooled, D mis-specifiedb
ML-SAR 20 30 50 100
50
SHAC pooled
SHAC mean group, D mis-specifiedb 20 30 50 100
30
Robust pooled
SHAC mean group 20 30 50 100
20
Size (×100)
Robust mean group 42.90 54.00 68.90 80.60
100
CCE pooled
Size (×100) 20 30 50 100
50
ML-SAR, S mis-specifieda
2.23 1.95 1.56 1.56
CCE mean group 20 30 50 100
RMSE (×100)
30
Pooled
14.69 14.50 15.97 14.76
ML-SAR 20 30 50 100
20
6.90 5.80 6.90 5.50
35.90 35.20 34.20 36.40
38.60 39.40 41.90 41.50
CCE pooled 8.50 5.70 4.70 5.10
7.80 6.10 6.00 5.60
Notes: see notes to Table A1. a This estimator is computed under the incorrect assumption that cross-section units are neighbours if their Euclidean distance is less than or equal to 2 while in the true S matrix cross-section units are neighbours if their Euclidean distance is equal to 1. b This estimator is computed under the incorrect assumption that cross-section units are ordered on a line, rather than on a rectangular grid.
of the pooled estimator, although they remain substantial even for large values of N and T . Further, tests based on the ML-SAR estimator substantially over-reject the null hypothesis. However, the combination of common factors and spatial correlation in the errors does not affect the empirical size of CCE estimators, which is close to the nominal size of 5%. Turning to results in Tables C1–C5, we observe that serial correlation in errors does not seem to affect the properties of mean group and pooled estimators with robust variances (10) and (11), and of CCE estimators under both heterogeneous and homogeneous slopes (see Tables C1 and C2). Another point to note is that the over-rejection tendency of the SHAC estimator is much more pronounced in the presence of residual serial correlation (Table C2). Turning to the experiments with a high value of the spatial coefficient (δ = 0.8), we see from Table C3 that tests based on mean group and pooled estimators that use robust standard errors tend to over-reject, in some cases significantly. But it is interesting that the CCE estimators continue to have the correct
size even with such high degrees of spatial dependence, although there is some evidence of a loss in power. One explanation for this result is that, when the degree of spatial correlation is high, an unobserved factor structure may better approximate the process generating cross-section dependence, and the CCE type estimators that allow for a factor error structure might be more appropriate. Finally, results reported in Table C5 suggest that CCE estimators are also robust to possible time variations in the degree of cross-section dependence. This important property of CCE type estimators is not necessarily shared by estimation methods that use principal components (see Bai, 2009), since time variation in the degree of cross-section dependence can yield inconsistent estimates of the principal components. We refer to Chudik et al. (forthcoming) for a comparison of the CCE method with the principal components approach in the estimation of panel regression models subject to common factors.
194
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
Table B2 Small sample properties of estimators for panels with spatially correlated errors (δ = 0.4) and unobserved common factors, under slope homogeneity. Bias (×100) N \T
20
30
RMSE (×100) 50
100
Bias (×100)
20
30
50
100
Mean group 20 30 50 100
15.36 14.46 16.45 15.43
2.14 1.78 1.62 1.77
15.05 14.43 15.80 14.93
14.85 14.58 15.44 14.77
19.38 18.61 18.85 17.52
17.68 17.53 17.82 16.71
17.37 16.42 17.10 15.93
16.46 15.88 16.29 15.37
1.89 1.92 1.82 1.42
1.66 1.57 1.62 1.43
1.65 1.65 1.53 1.32
6.17 5.03 4.14 3.15
5.10 4.53 3.60 2.60
4.32 3.67 3.06 2.36
3.78 3.44 2.68 1.99
−0.19 0.20 −0.15 0.05
0.06 −0.08 −0.03 −0.10
0.03 −0.07 −0.01 −0.04
8.29 7.11 5.57 3.77
5.90 5.06 3.92 2.79
4.39 3.48 2.68 1.77
2.87 2.35 1.72 1.18
0.05 0.07 0.03 0.18
Power (×100)
32.10 41.20 65.00 80.00
57.60 68.50 87.50 91.50
68.00 78.40 93.10 97.90
73.70 86.30 97.50 99.70
69.60 76.80 94.00 95.40
75.30 84.10 98.00 98.80
84.20 91.70 98.90 99.90
91.30 96.80 99.90 100.00
44.00 54.90 79.40 87.70
62.10 69.10 89.70 96.90
78.40 86.10 97.70 99.70
54.90 62.80 87.10 94.40
68.20 78.20 95.50 97.90
84.80 90.60 98.50 99.90
95.50 98.10 100.00 100.00
35.20 46.50 68.20 82.00
24.90 23.90 25.00 30.70
44.80 58.90 82.00 89.70
62.40 72.80 90.80 97.40
79.30 87.70 97.60 99.80
56.60 67.30 87.80 94.80
69.20 80.20 95.50 98.20
85.00 91.50 98.70 99.90
95.40 98.20 100.00 100.00
25.40 27.30 30.60 34.50
30.50 31.20 36.30 41.00
41.50 45.20 44.00 51.10
52.00 57.80 70.90 92.40
60.20 71.10 85.00 96.90
74.40 81.50 92.90 99.20
86.40 91.40 98.10 100.00
5.10 6.20 6.70 4.60
17.33 16.62 18.38 17.39
17.37 17.28 18.19 17.22
20.95 20.75 20.82 20.11
19.69 19.85 20.18 19.00
19.59 19.07 19.58 18.38
18.82 18.56 19.05 17.89
1.07 0.85 0.75 0.58
5.82 4.68 3.76 2.68
4.82 4.10 3.08 2.18
4.09 3.32 2.48 1.92
3.51 3.04 2.16 1.56
0.09
0.07
−0.08 −0.04 −0.09
−0.10
6.93 5.76 4.37 3.08
5.31 4.45 3.46 2.44
4.13 3.20 2.44 1.65
2.94 2.32 1.65 1.13
16.81 16.97 18.43 17.17
17.39 17.08 18.24 17.16
1.33 1.00 0.94 0.99
1.16 1.17 0.95 0.71
0.97 0.84 0.80 0.68
−0.03 0.00 −0.10 0.03
−0.01 −0.04 0.08 0.14
0.00
−0.04
Power (×100)
36.90 44.90 65.80 80.10
40.00 55.60 76.80 87.40
50.90 64.60 85.10 95.60
59.70 75.20 92.30 99.20
51.70 63.30 84.50 93.70
62.30 75.70 92.00 97.90
73.40 84.90 97.00 99.80
81.90 92.40 98.90 100.00
50.20 55.70 79.10 86.60
59.00 68.80 90.30 91.60
77.80 81.40 94.90 98.30
91.40 94.20 99.80 99.90
71.10 76.30 95.00 95.60
81.90 88.00 98.80 98.70
93.40 94.90 99.40 100.00
97.90 99.10 100.00 100.00
7.00 5.90 6.00 4.40
6.10 7.40 4.90 4.50
11.40 12.90 15.10 24.80
17.00 19.70 25.90 48.50
25.50 32.70 46.40 77.00
45.40 58.60 82.40 98.70
54.60 61.50 82.50 88.30
62.10 72.60 91.40 93.10
91.20 95.10 99.70 99.90
71.60 80.70 95.50 96.00
82.10 89.10 99.00 99.00
93.50 95.20 99.30 100.00
98.10 99.30 100.00 100.00
26.60 24.70 25.70 27.80
32.50 37.60 34.30 35.80
45.10 48.70 63.60 86.80
51.60 61.60 78.40 94.10
64.70 74.20 88.70 97.70
80.80 86.00 96.40 99.90
7.10 6.60 5.70 4.80
6.80 7.30 5.30 4.50
12.60 15.20 19.50 35.00
18.40 22.10 35.40 57.50
29.20 34.80 52.90 83.00
46.50 60.30 85.30 99.40
78.80 85.30 95.70 98.70
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
100
SHAC pooled, D mis-specifiedb
ML-SAR 20 30 50 100
50
SHAC pooled
SHAC mean group, D mis-specifiedb 20 30 50 100
30
Robust pooled
SHAC mean group 20 30 50 100
20
Size (×100)
Robust mean group 51.20 60.10 80.00 87.40
100
CCE pooled
Size (×100) 20 30 50 100
50
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
RMSE (×100)
30
Pooled
14.54 14.73 16.13 15.22
ML-SAR 20 30 50 100
20
6.20 5.80 6.10 6.30
19.90 19.10 20.20 22.50
21.20 21.30 22.20 22.20
CCE pooled 6.10 6.20 4.80 4.60
7.00 6.30 6.00 6.10
Notes: see notes to Table A1. a This estimator is computed under the incorrect assumption that cross-section units are neighbours if their Euclidean distance is less than or equal to 2 while in the true S matrix cross-section units are neighbours if their Euclidean distance is equal to 1. b This estimator is computed under the incorrect assumption that cross-section units are ordered on a line, rather than on a rectangular grid.
7. Concluding remarks The main aim of this paper has been to consider estimation of a panel regression model under a number of different specifications of cross-section error correlations, such as spatial and/or common factor models. We have derived the asymptotic distributions of the mean group and pooled estimators for a panel regression model where the source of error cross-section dependence is purely spatial or results from omitted unobserved factors, or both. In each case we have distinguished between panels when the slopes are homogeneous across the cross-section units and when they are not. Our main conclusion (based on theoretical and Monte Carlo results) is that the augmentation of panel regressions with cross-section averages together with non-parametric variance estimators associated with the CCE estimators, seem to be most effective in dealing with error cross-section dependencies, irrespective of whether they arise from spatial spillovers or are due to the presence of unobserved common factors. The CCE type
estimators also seem to be robust to possible serial correlations in the errors and time variations in the degree and the nature of crosssection error dependence. Our Monte Carlo results also document the tendency of the tests based on the SHAC type standard errors to over-reject the null hypothesis in small samples even in the case of error cross-section dependence which is purely spatial in nature. Appendix. Proof of Theorems 1, 2, 5 and 6 The following two lemmas establish a few key results used in the proofs of Theorems 1–6. Lemma A.1. Consider the process (2), where ε.t = (ε1t , ε2t , . . . , εNt )′ satisfies Assumption 1, and Rt satisfies Assumption 2. Then for all t E e¯ 2.t = O(N −1 ),
where e¯ .t =
1 N
∑N
i=1
and
Var e¯ 2.t = O(N −2 ),
eit and
(A.1)
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
195
Table C1 Small sample properties of estimators for panels with spatially correlated errors (δ = 0.4) no unobserved common factors, under slope heterogeneity and serial correlation. Bias (×100) N \T
20
RMSE (×100)
30
50
100
0.15 0.07 −0.03 0.16
−0.05
20
30
Bias (×100) 50
100
Mean group 20 30 50 100
0.00 0.57 −0.22 −0.01
0.20 0.35 −0.10 −0.01
0.14 −0.07 −0.20
9.00 7.41 5.81 3.91
7.40 6.21 4.70 3.30
6.32 5.16 3.94 2.85
5.35 4.33 3.40 2.42
−0.24 0.51 −0.26 0.12
−0.08 0.11 −0.22 0.02
0.02 0.09 −0.19 0.10
−0.06
0.11 0.14 0.08 0.15
−0.11
0.17 0.04 −0.18
8.04 6.65 5.23 3.66
7.74 6.29 4.85 3.31
6.90 5.45 4.30 3.00
6.01 4.94 3.80 2.70
11.02 9.15 7.01 4.94
8.47 6.88 5.24 3.85
6.77 5.77 4.23 3.08
5.64 4.60 3.57 2.53
0.10 −0.13 −0.22
Power (×100)
7.00 8.20 5.20 5.40
8.60 7.70 5.50 6.30
8.10 6.70 5.70 5.70
7.50 6.70 5.80 5.40
10.70 14.90 16.00 29.30
15.20 17.70 20.40 34.70
14.70 19.30 24.20 42.40
15.70 22.10 27.40 44.90
11.90 14.00 9.90 10.30
15.40 14.10 13.20 16.00
22.00 21.20 18.90 20.70
16.50 20.60 21.40 35.30
20.90 25.70 29.50 44.90
28.30 32.20 40.10 63.30
38.50 47.30 56.90 77.50
11.10 11.00 9.30 8.80
38.40 37.50 35.70 38.80
10.90 13.40 9.90 10.10
15.30 14.00 13.20 15.30
22.50 20.20 18.90 20.50
16.00 18.80 20.60 33.60
20.60 25.10 29.20 44.30
28.00 32.70 39.50 62.80
37.90 47.80 56.50 77.10
45.20 43.70 43.10 41.60
48.80 46.00 46.10 48.20
57.30 57.40 55.30 58.80
47.00 51.10 54.70 69.20
53.70 55.00 60.10 77.20
59.70 64.10 69.90 84.10
70.20 73.80 79.80 90.20
0.25 0.39 −0.31 −0.05
0.16
−0.01 −0.01 −0.08
−0.01 −0.10 −0.09
0.01 0.18 −0.05 −0.15
8.24 6.96 5.40 3.66
7.61 6.14 4.67 3.31
6.69 5.37 4.20 3.07
5.72 4.56 3.67 2.66
0.04 0.06 −0.18 0.09
−0.04
8.24 6.79 5.35 3.71
7.90 6.35 4.85 3.39
6.90 5.50 4.38 3.02
6.00 4.96 3.84 2.72
0.04 0.11 0.04 0.13
−0.06
9.30 8.01 6.09 4.29
8.13 6.42 4.99 3.62
6.78 5.55 4.24 3.10
5.68 4.69 3.60 2.53
0.09
−0.10
0.24 0.27 −0.07 −0.06
0.16 −0.20 0.00
0.16 0.03 −0.18
0.18 0.40 −0.22 0.13
0.18 0.08 −0.09 −0.04
0.12 −0.09 −0.20
Power (×100)
6.70 7.50 5.60 6.00
6.80 7.30 5.10 4.80
7.20 6.50 5.60 5.70
6.40 6.60 5.50 5.60
10.80 14.40 15.30 25.40
12.80 16.60 19.50 33.10
15.80 18.90 23.90 45.90
17.70 23.70 31.30 52.70
12.70 12.40 11.60 11.50
16.80 15.30 14.70 14.70
19.20 17.80 19.90 20.90
27.00 24.80 23.50 25.60
19.20 24.30 27.90 41.10
26.30 29.10 32.90 51.90
31.90 37.40 44.00 64.30
42.30 49.80 57.30 77.10
7.30 6.30 6.00 4.50
7.70 6.70 5.60 5.40
11.30 12.50 13.10 20.50
11.70 13.40 14.90 27.00
15.60 17.40 21.80 39.00
18.30 23.00 31.60 49.00
13.30 13.70 11.80 11.50
16.60 15.00 13.50 15.10
26.10 24.40 23.20 25.60
18.80 25.20 27.70 40.80
26.10 29.50 33.40 51.10
30.90 37.70 43.50 64.60
42.00 49.30 56.20 76.90
45.10 44.60 46.00 47.80
54.70 52.70 55.30 56.00
43.10 48.20 53.30 65.90
51.10 53.10 59.60 75.60
58.80 62.40 68.40 82.40
66.00 71.80 77.70 89.40
8.10 6.30 6.20 6.20
8.30 6.20 6.10 4.90
10.80 12.30 13.20 23.80
12.30 14.40 17.20 28.00
15.20 18.10 21.80 40.60
17.80 23.10 30.70 48.50
19.00 17.50 18.80 21.50
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
100
SHAC pooled, D mis-specifiedb
ML-SAR 20 30 50 100
50
SHAC pooled
SHAC mean group, D mis-specifiedb 20 30 50 100
30
Robust pooled
SHAC mean group 11.00 11.60 10.30 9.20
20
Size (×100)
Robust mean group
20 30 50 100
100
CCE pooled
0.01 0.05 −0.18 −0.02
Size (×100) 20 30 50 100
50
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
RMSE (×100)
30
Pooled
0.12 0.06 −0.08 −0.08
ML-SAR 20 30 50 100
20
7.70 6.30 6.00 6.50
36.00 34.10 33.90 36.20
44.00 43.00 40.00 39.10
CCE pooled
6.30 6.20 5.60 4.90
6.70 7.30 6.90 6.00
6.70 6.10 4.50 5.30
Notes: see notes to Table A1. a This estimator is computed under the incorrect assumption that cross-section units are neighbours if their Euclidean distance is less than or equal to 2 while in the true S matrix cross-section units are neighbours if their Euclidean distance is equal to 1. b This estimator is computed under the incorrect assumption that cross-section units are ordered on a line, rather than on a rectangular grid.
λ1 E ei. e′j. = O(1),
for all i and j,
(A.2)
where ei. = (ei1 , ei2 , . . . , eiT )′ . Proof. First note that, under Assumption 1, εit has mean zero, finite 2 < ∞, and finite fourth-order moments, variances 0 < σi2 < σmax 4 ′ E (εit ) = µi4 < K < ∞. To prove (A.1), note that E e¯ 2.t =
1
1
N
N
τ′ Rt 3ε R′t τ ≤ 2
σ 2 τ′ τ λ1 Rt R′t 2 max
where τ is an N-dimensional vector of ones. But since Rt has bounded row and column norms, λ1 Rt R′t is bounded, and we have E e¯ 2.t = O(N −1 ).
Let Mt = N12 31ε /2 R′t ττ′ Rt 31ε /2 , with elements mij,t , for i, j = 1, . . . , N. The diagonal elements of Mt satisfy (denoting the ith
column of Rt by r.i,t ) mii,t =
≤
1 N2 1 N2
σii2 r′.i,t ττ′ r.i,t = σ
2 ii
1 N2
σii2
N −
2 rji,t
j=1
2 N − rji,t . j =1
But by assumption j=1 rji,t = O(1), then mii,t = O(N −2 ) for all i and t. Using results on moments for quadratic forms established in the literature (see Ullah, 2004), we have
∑N
Var e¯ 2.t = Var
1 ′ τ Rt ε.t N
2
=E
[ ]2 1 ′ ′ ′ − E ε R ττ R ε t .t 2 .t t N
1 N2
ε′.t R′t ττ′ Rt ε.t
2
196
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
Table C2 Small sample properties of estimators for panels with spatially correlated errors (δ = 0.4) no unobserved common factors, under slope homogeneity and serial correlation. Bias (×100) N \T
20
RMSE (×100)
30
50
100
20
Bias (×100)
30
50
100
20
Mean group 20 30 50 100
0.07 −0.01 −0.08
0.11 0.17 −0.09 −0.01
0.01 0.12 0.06 0.03
7.83 6.12 5.20 3.46
6.16 4.54 3.94 2.60
4.40 3.66 2.82 2.01
3.10 2.50 1.90 1.53
0.00 0.10 −0.07 0.02
−0.04 0.02 0.04 0.03
4.65 4.02 3.19 2.07
3.93 3.23 2.62 1.70
3.03 2.50 1.99 1.32
2.22 1.85 1.51 0.94
−0.02 −0.02 −0.01
0.01 0.03 0.05 0.00
9.55 8.07 6.39 4.41
7.37 5.82 4.64 3.30
5.07 4.15 3.32 2.38
3.46 2.81 2.18 1.84
0.16 0.08 −0.09 −0.02
−0.03
0.09 0.14 −0.04 −0.01
0.06 0.00 0.05
7.30 6.70 6.10 8.30
0.05
Power (×100) 7.20 7.30 5.60 6.60
6.70 9.20 5.90 5.60
16.80 23.40 26.50 51.30
19.50 28.10 36.60 65.00
29.50 43.00 53.40 85.80
51.40 67.10 83.00 96.60
6.20 7.60 6.00 8.00
7.40 7.40 5.30 7.70
15.00 19.40 23.10 38.20
18.60 23.40 32.20 51.70
28.10 36.90 45.30 75.60
43.10 59.90 77.20 95.30
−0.20 −0.08 −0.13 −0.08
6.23 5.09 4.15 2.78
5.15 4.11 3.32 2.13
3.78 3.27 2.46 1.70
2.72 2.36 1.75 1.10
−0.02
−0.02
0.12 −0.05 0.01
0.02 0.03 0.02
5.05 4.20 3.29 2.18
4.22 3.34 2.78 1.76
3.23 2.67 2.08 1.40
2.31 1.92 1.55 0.97
−0.08 −0.03 −0.05
0.03 0.07 0.06 −0.01
8.22 6.72 5.27 3.66
6.43 5.14 4.05 2.85
4.72 3.80 3.01 2.15
3.30 2.72 2.07 1.78
0.02
Power (×100) 8.10 6.30 7.40 5.80
8.30 6.70 8.50 9.30
6.00 7.60 5.30 6.40
7.00 7.90 4.70 5.60
14.70 17.90 18.90 34.00
17.10 20.80 29.10 46.70
26.50 35.40 42.30 73.30
43.10 58.20 75.00 84.20
8.30 7.00 7.80 7.50
7.10 7.10 6.90 6.50
6.50 7.40 6.00 7.60
18.80 24.70 30.60 52.90
22.70 28.80 39.90 67.60
32.40 41.20 56.50 86.80
51.50 65.20 83.60 99.60
6.40 8.10 6.00 7.20
19.30 25.60 31.10 54.80
21.50 29.30 38.80 68.10
31.70 43.10 57.20 86.90
51.30 67.30 84.10 100.0
16.10 14.50 14.20 14.60
13.90 13.70 17.60 5.20
37.40 45.30 57.20 83.00
43.40 52.00 69.00 92.30
56.20 68.70 83.30 98.50
75.60 88.00 96.50 100.0
6.70 5.90 6.60 6.20
6.30 6.80 5.80 5.0
11.80 16.10 17.20 30.60
14.70 18.10 25.30 41.40
21.20 29.10 39.50 66.10
37.30 50.70 67.90 88.30
SHAC pooled, D mis-specifiedb
8.20 5.60 8.40 6.80
6.70 8.40 6.60 7.60
6.60 8.70 5.40 7.60
15.60 19.70 22.00 37.80
18.60 22.40 32.10 51.10
26.90 37.40 45.40 75.60
42.70 59.90 76.80 95.00
18.00 17.70 17.80 15.80
17.20 14.70 16.70 15.40
16.40 16.80 17.50 5.60
40.60 49.90 60.50 85.40
46.80 56.50 71.70 93.80
60.20 73.20 85.30 99.10
79.80 90.90 98.20 100.0
5.40 5.20 6.50 5.70
7.10 5.90 5.60 5.00
11.20 13.30 13.70 22.20
13.40 15.20 21.30 33.40
20.80 26.80 34.10 58.30
35.30 47.80 65.00 83.50
8.30 8.30 8.20 9.30
8.80 7.30 8.00 7.40
6.90 7.30 6.90 7.30
ML-SAR, S mis-specifieda
ML-SAR 17.10 19.00 16.40 17.80
16.60 17.70 16.50 18.00
CCE mean group 20 30 50 100
100
SHAC pooled
8.50 7.60 8.50 6.00
7.80 8.20 9.70 8.20
0.03 0.06
6.90 6.10 7.80 6.90
SHAC mean group, D mis-specifiedb
20 30 50 100
50
0.03 0.14 0.05 0.04
Size (×100)
SHAC mean group
20 30 50 100
30
Robust pooled
8.90 7.30 6.80 5.90
8.30 7.00 9.50 8.90
−0.08 −0.02
0.40 0.39 −0.24 −0.02
Robust mean group
20 30 50 100
0.01 0.09 −0.09 −0.02
20
CCE pooled
−0.38 −0.12 −0.02 −0.04
0.66 0.13 −0.15 0.04
Size (×100) 20 30 50 100
−0.11 0.03 −0.06 −0.05
0.17 0.14 −0.03 0.01
CCE mean group 20 30 50 100
100
ML-SAR, S mis-specifieda
ML-SAR 20 30 50 100
50
Pooled
−0.15
0.25 0.05 −0.11 −0.05
RMSE (×100)
30
6.70 6.00 6.20 4.60
17.20 15.40 17.10 15.30
CCE pooled
7.50 6.40 5.40 6.10
6.70 6.30 6.60 5.00
7.00 6.00 7.00 6.40
Notes: see notes to Table A1. a This estimator is computed under the incorrect assumption that cross-section units are neighbours if their Euclidean distance is less than or equal to 2 while in the true S matrix cross-section units are neighbours if their Euclidean distance is equal to 1. b This estimator is computed under the incorrect assumption that cross-section units are ordered on a line, rather than on a rectangular grid.
[
1
= Tr
+
N
R ττ Rt 3ε 2 t ′
′
]2 + 2Tr
1 N
R ττ Rt 3ε 2 t ′
= 2Tr
N
≤2
Rt ττ Rt 3ε ′
N2
≤2 ≤2
1 N
′
2
N − ′ + µi4 − 3σi4 m2ii,t . i=1
Tr R′t ττ′ Rt 32ε R′t ττ′ Rt + K 4
1 N4 1 N4
that, since for h ̸= q, E εht εqs = 0,
E eit ejs = E r′i.,t ε.t ε′.s rj.,t =
N − N −
rih,t rjq,s E εht εqs
N −
m2ii,t
′ 2 4 σmax τ Rt R′t τ + O N −3 4 max
2 ′ 2 λ1 Rt Rt τ τ + O N −3 ,
′
=
N −
riq,t rjq,s E εqt εqs ,
q=1
i=1
σ
which establishes the second result in (A.1). Toestablish (A.2) consider the (t , s)th element of the T × T matrix E ei. e′j. and note
h=1 q=1
1
But µ′i4 − 3σi4 < µ′i4 + 3σi4 < K , and therefore
2
]2 [ N − ′ 1 ′ ′ R ττ R 3 µi4 − 3σi4 m2ii,t − Tr t ε 2 t i =1
Var e¯ 2.t
′
and the largest eigenvalue of E ei. e′j. satisfies (using the result that λ1 (A) ≤ ‖A‖∞ ) T − N − riq,t rjq,s E εqt εqs λ1 E ei. e′j. ≤ max 1≤s≤T
t =1 q=1
≤ q max 1≤s≤T
N T − − riq,s E εqt εqs = O(1), q=1
t =1
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
197
Table C3 Small sample properties of estimators for panels with spatially correlated errors (δ = 0.8) no unobserved common factors, under slope heterogeneity. N \T
Bias (×100) 20
RMSE (×100)
30
Bias (×100)
50
100
20
30
50
100
0.00 −0.38 −0.01 0.11
−0.22
−0.46 −0.06
12.72 10.57 7.15 4.93
9.64 8.47 5.75 4.07
7.61 6.51 4.50 3.19
6.18 5.02 3.60 2.68
0.06 −0.08 0.33 0.01
0.10 0.04 0.14 −0.03
−0.24 −0.12
7.57 6.02 4.52 3.35
6.51 5.48 4.27 2.93
5.95 4.86 3.77 2.71
5.75 4.49 3.40 2.69
−0.11
−0.30 −0.08
12.97 10.83 8.66 6.02
9.36 7.74 6.54 4.53
7.10 6.00 4.72 3.27
5.79 4.72 3.73 2.70
20
Mean group 20 30 50 100
−0.29 −0.12 0.10 0.01
0.21 0.09 0.05
0.16 0.05
−0.15 −0.11 −0.13 −0.08
0.19
−0.04
0.10 −0.32 −0.20 −0.01
0.23 −0.19 0.32 0.01
0.16 0.14 0.04
0.22 0.05
Power (×100)
10.90 16.00 7.40 10.80
10.90 16.40 5.80 10.60
8.50 12.70 7.70 9.30
10.00 9.30 6.90 8.70
13.80 20.90 15.70 28.50
14.30 22.50 19.50 33.10
16.90 22.70 21.80 38.80
15.70 21.40 28.80 46.30
10.90 13.00 7.30 8.00
11.10 13.50 9.80 11.50
11.30 14.00 7.40 9.30
12.90 15.10 11.20 12.10
17.80 18.30 15.80 18.90
12.10 16.30 15.40 23.90
16.30 20.80 21.20 35.70
20.30 28.20 33.90 51.90
29.80 37.00 53.70 71.20
50
100
−0.42 −0.13
11.33 10.18 6.49 4.94
9.15 8.67 5.50 4.27
7.48 6.94 4.76 3.46
6.59 5.44 3.91 3.04
8.13 6.29 4.89 3.51
6.97 5.59 4.40 2.96
6.10 4.83 3.99 2.83
5.72 4.48 3.51 2.70
10.90 9.12 7.09 5.02
8.75 7.31 5.85 4.13
6.94 5.84 4.59 3.20
5.81 4.75 3.72 2.71
0.04 0.03
0.19 0.06 −0.02
0.25
−0.01
0.07
0.20
0.02
−0.12
−0.02
−0.33 −0.06
0.29
0.13
0.23
−0.01
−0.03
−0.06
0.23
0.28
−0.26
−0.24
−0.28 −0.31 −0.05
−0.18
0.16 0.06 0.01
0.00 0.23 0.04
0.30
−0.05
Power (×100)
8.80 12.70 6.20 8.60
8.50 10.10 6.00 6.70
9.10 8.20 6.10 7.00
11.10 16.60 12.00 21.10
13.10 18.90 16.10 30.10
14.70 21.70 22.20 40.20
15.50 22.30 30.80 52.70
16.20 18.50 15.90 15.60
24.70 21.80 22.00 25.60
17.20 19.80 20.50 30.10
19.80 25.10 26.80 39.90
25.20 32.10 36.80 52.30
35.20 40.30 55.70 68.90
24.10 24.40 21.40 27.80
16.80 22.90 20.50 33.70
19.80 28.20 28.20 42.60
25.90 34.80 38.50 56.00
35.20 43.40 55.30 71.40
30.50 28.80 33.40 32.90
46.50 42.60 45.40 48.90
31.10 31.60 37.90 57.60
36.10 39.00 54.50 69.00
45.70 52.00 65.00 79.80
60.00 63.50 78.80 88.10
7.30 5.80 6.10 5.40
7.90 6.90 5.90 5.30
9.30 10.20 9.90 16.00
13.90 13.30 18.30 23.70
13.50 15.70 20.70 35.50
18.10 19.80 31.90 50.30
SHAC pooled 13.20 16.00 9.90 12.60
15.70 18.10 12.00 14.50
SHAC pooled, D mis-specifiedb
11.50 15.20 9.00 11.40
13.40 15.00 11.20 12.80
18.40 19.10 15.10 19.40
12.30 17.50 15.80 25.50
15.60 22.00 21.00 37.00
20.50 28.60 34.80 53.20
29.40 37.60 52.80 72.70
31.50 32.00 33.50 33.20
42.60 38.30 37.50 40.30
53.80 51.20 49.40 57.00
41.40 38.60 46.20 66.10
45.30 50.00 60.10 74.40
54.40 60.80 70.00 85.50
68.20 69.80 81.60 91.20
7.00 5.80 6.40 4.60
7.60 6.50 5.80 6.00
9.80 8.70 9.80 14.40
11.60 12.10 16.90 21.70
13.40 16.20 20.60 34.60
17.30 19.00 31.80 48.60
13.20 18.60 11.60 15.40
15.30 21.40 11.50 16.80
17.10 21.10 16.90 17.10
ML-SAR, S mis-specifieda
ML-SAR 29.00 27.10 22.70 27.60
21.90 18.90 21.30 22.60
CCE mean group 20 30 50 100
−0.03
−0.04 −0.25 −0.01
10.80 12.40 5.20 7.10
SHAC mean group, D mis-specifiedb
20 30 50 100
30
Robust pooled
SHAC mean group
20 30 50 100
−0.30 −0.25
Size (×100)
Robust mean group
20 30 50 100
20
CCE pooled
Size (×100) 20 30 50 100
100
ML-SAR, S mis-specifieda
0.09 −0.21 −0.24 0.01
CCE mean group 20 30 50 100
50
Pooled
ML-SAR 20 30 50 100
RMSE (×100)
30
7.50 6.40 4.90 5.30
6.60 6.50 6.60 6.90
24.70 24.00 27.80 24.30
CCE pooled 8.00 6.40 5.40 5.10
7.50 5.70 6.10 6.60
Notes: see notes to Table A1. a This estimator is computed under the incorrect assumption that cross-section units are neighbours if their Euclidean distance is less than or equal to 2 while in the true S matrix cross-section units are neighbours if their Euclidean distance is equal to 1. b This estimator is computed under the incorrect assumption that cross-section units are ordered on a line, rather than on a rectangular grid.
∑T E εqt εqs = O(1), and by t =1 ∑ N Assumption 2, riq,t = O(1), q=1 riq,s = O(1), for all s. given that, by Assumption 1,
Therefore, for any process of form (2) with Rt having bounded row and column norms, e¯ 2.t converges to zero in quadratic mean as N → ∞, and the degree of cross-section dependence of ei. will be bounded in N. Lemma A.2. Consider the general process e.t = Rt ε.t . Then under Assumptions 1–8 we have e¯ ′ e¯ T F′ e¯ T V′i. e¯ T
1
= Op
N
= Op
1
(A.3)
√
NT
= Op
,
1
√
NT
,
,
D′ e¯ T e′i. e¯ T
= Op
1
√
NT
,
= Op
N
∑N
Proof. Note that T −1 e¯ ′ e¯ = T −1 inequality we have, for every ϵ > 0, P
+ Op
1
√
NT
,
(A.5)
∑T
t =1
e¯ 2.t . From the Markov
′ ′ T − e¯ e¯ 2 1 ≥ ϵ ≤ 1 E e¯ e¯ = 1 T −1 ¯ E e = O . (A.6) .t T ϵ T ϵ N t =1
which proves (A.3). As for (A.4), consider the ℓth row of T −1 F′ e¯ ∑T and note that it can be written as T −1 t =1 fℓt e¯ .t , where fℓt and e¯ .t are distributed independently of each other. Then, given (A.1), ∑T T −1 t =1 fℓt e¯ .t has zero mean and variance
1
(A.4)
where e¯ = (¯e.1 , . . . , e¯ .T )′ , e¯ 2.t = N −1 i=1 eit , D and F are T × n and T × m matrices on observed and unobserved common factors, and Vi. = (vi1 , . . . , viT )′ .
Var T
−1
T − t =1
fℓt e¯ .t
= T −2
T − T − t =1 t ′ =1
[E (fℓt fℓt ′ ) E (¯e.t e¯ .t ′ )]
198
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
Table C4 Small sample properties of estimators for panels with spatially correlated errors (δt ∼ IIDU (0, 0.8)) no unobserved common factors, under slope heterogeneity. Bias (×100) N \T
20
RMSE (×100)
30
50
100
20
0.19 −0.25 0.07 0.07
−0.13
−0.43 −0.08
0.03 −0.10 0.22 −0.03
0.13 0.06 0.12 −0.06
−0.25 −0.17
−0.12
−0.42 −0.04
30
Bias (×100) 50
100
20
Mean group 20 30 50 100
0.20 −0.21 −0.05 0.08
0.31 −0.23 −0.21 −0.02
0.05 0.17 0.03
6.92 5.99 4.50 3.11
6.02 4.75 3.62 2.58
5.27 4.13 3.14 2.39
7.66 6.01 4.68 3.30
6.62 5.55 4.33 2.97
6.10 4.78 3.74 2.72
5.71 4.46 3.41 2.67
11.05 8.99 6.90 4.84
7.92 6.52 5.19 3.59
6.26 5.05 3.92 2.77
5.40 4.30 3.27 2.46
0.14 0.03
0.23
−0.08 −0.19 −0.07
0.56 −0.38 −0.20 0.09
0.14 0.01
0.19 −0.21 0.21 0.06
0.03 0.20 0.03
0.18 0.02
Power (×100)
10.60 9.90 7.70 9.00
11.50 10.50 8.20 9.30
6.70 8.10 6.00 5.60
8.10 6.80 5.90 6.20
7.80 7.30 5.50 5.40
14.50 16.60 17.30 30.10
14.40 20.00 21.90 38.30
16.60 20.40 26.50 43.00
15.10 22.00 30.60 51.50
28.20 23.90 24.10 26.60
8.10 6.90 5.90 4.90
50
100
0.24 0.01
8.51 6.86 5.16 3.68
6.92 5.96 4.48 3.20
6.21 5.01 3.95 2.86
5.76 4.50 3.51 2.70
0.05
−0.13 0.18
−0.04
0.10 0.01 0.13 −0.08
−0.28 −0.10
8.02 6.07 4.90 3.37
6.77 5.57 4.37 2.97
6.17 4.78 3.85 2.76
5.70 4.45 3.47 2.68
0.16 0.00
0.47
0.12
−0.17
−0.30
−0.41 −0.23
−0.19
0.08 0.16 −0.02
0.04 0.22 0.03
9.70 7.56 5.81 4.08
7.62 6.18 4.85 3.35
6.23 5.05 3.89 2.77
5.43 4.37 3.33 2.49
0.25 0.00
Power (×100)
17.90 14.60 13.00 14.70
26.50 22.30 21.40 25.20
15.90 17.20 19.70 33.90
21.40 24.40 32.00 51.10
30.10 38.10 49.30 70.70
42.20 51.90 64.80 83.80
6.50 8.30 5.70 6.60
8.10 5.80 4.70 6.00
7.10 6.60 4.30 5.80
12.80 14.50 15.00 27.80
14.60 17.40 22.50 40.50
16.50 21.30 30.60 48.80
15.50 22.80 36.20 59.40
20.80 19.80 18.20 18.80
31.40 26.60 27.40 30.80
20.40 21.60 26.40 40.60
23.90 30.20 37.90 54.80
33.00 40.40 51.00 68.20
44.60 50.50 66.80 79.40
30.30 26.90 25.90 31.10
21.00 22.30 25.20 42.70
25.00 29.70 37.70 55.80
34.30 40.60 49.90 68.90
44.00 51.10 65.90 79.30
34.50 31.60 35.50 36.20
50.20 44.80 46.90 49.70
34.10 33.30 43.70 59.90
39.10 42.20 56.30 71.60
50.60 55.10 67.60 81.80
63.60 67.40 80.20 89.50
7.80 6.20 5.80 4.80
8.40 6.20 6.00 6.40
12.50 10.00 12.20 20.70
14.60 14.70 21.80 32.90
14.90 18.00 28.40 43.40
19.00 22.40 35.30 56.40
SHAC pooled
11.00 13.40 12.00 11.50
13.70 13.60 11.70 10.60
15.90 15.70 15.00 14.70
SHAC pooled, D mis-specifiedb
10.90 14.40 11.00 11.50
17.80 14.20 12.10 14.60
25.60 22.00 21.30 24.70
16.60 17.80 20.00 34.40
22.10 24.30 32.10 50.70
30.10 37.20 48.80 70.60
41.50 51.10 64.60 83.40
29.60 30.10 29.80 31.30
39.10 34.50 34.70 36.20
52.50 49.30 46.60 51.30
38.90 36.30 44.40 64.00
41.60 46.00 56.80 73.00
55.10 59.10 69.80 83.90
66.20 69.60 80.20 90.30
7.20 6.80 5.10 4.70
8.80 6.80 5.10 6.70
11.30 10.20 11.90 18.60
13.70 13.00 20.40 30.50
16.20 18.10 29.30 44.70
17.90 22.90 35.80 56.50
14.60 14.00 11.70 12.60
14.50 18.70 15.00 15.60
20.40 20.30 17.80 19.30
ML-SAR, S mis-specifieda 25.00 19.60 24.40 23.60
CCE mean group 20 30 50 100
0.35
−0.14 −0.17 −0.05
8.60 8.10 5.40 6.20
ML-SAR 20 30 50 100
0.07
−0.03
Size (×100)
SHAC mean group, D mis-specifiedb 20 30 50 100
−0.37 −0.12
30
Robust pooled
SHAC mean group 20 30 50 100
0.09 0.07 0.11 −0.04
0.01
Robust mean group 9.00 8.50 6.00 6.90
−0.14 −0.19
20
CCE pooled
Size (×100) 20 30 50 100
100
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
50
Pooled 9.31 7.37 5.45 3.88
ML-SAR 20 30 50 100
RMSE (×100)
30
27.10 27.80 28.80 28.20
CCE pooled
6.90 7.00 6.20 6.00
8.40 5.10 5.00 4.80
7.90 6.90 5.90 6.40
Notes: see notes to Table A1. a This estimator is computed under the incorrect assumption that cross-section units are neighbours if their Euclidean distance is less than or equal to 2 while in the true S matrix cross-section units are neighbours if their Euclidean distance is equal to 1. b This estimator is computed under the incorrect assumption that cross-section units are ordered on a line, rather than on a rectangular grid.
≤σ
2 max O
=O
1
1
N
NT
T
−2
T − T −
E (fℓt fℓt ′ )
t =1 t ′ =1
where r.q,t = E T −1 e′i. e¯ =
.
T −1 e′i. e¯
=T
−1
eit e¯ .t =
t =1
=
=
T N 1 −−
NT t =1 j=1
T N N N 1 −−−−
NT t =1 h=1 j=1 q=1 T N N 1 −−−
NT t =1 h=1 q=1
eit ejt
=
rih,t r.q,t εht εqt ,
NT t =1 h=1 q=1
E
∑N
q=1 riq,t
T −1 e′i. e¯
=
1 N 2T 2
rih,t r.q,t E εht εqt
T N 1 −− riq,t r.q,t σq2 NT t =1 q=1
Its mean is
T N N 1 −−−
≤ K since
rih,t rjq,t εht εqt
j=1 rjq,t .
This establishes (A.4). The second result in (A.4) and the first result in (A.5) follow similarly. As for the second result in (A.5), note that T −
∑N
T N 1 −−
NT t =1 q=1
riq,t = O
1
N
,
= O(1). Also, under Assumptions 1 and 2, we have
2
T − T − N − N − N − N − t =1 s=1 h=1 ℓ=1 q=1 p=1
rih,t riℓ,s r.q,t r.p,s E εht εℓs εqt εps
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
199
Table C5 Small sample properties of estimators for panels with cross-section dependence switching from spatial processes (δ = 0.4) to unobserved common factors, under slope heterogeneity. Bias (×100) N \T
20
RMSE (×100)
30
50
100
20
30
Bias (×100) 50
100
Mean group 20 30 50 100
8.42 8.02 8.75 8.49
7.94 7.55 8.38 7.85
8.07 7.95 8.83 8.20
7.17 7.33 8.49 7.70
13.88 12.87 11.99 11.01
11.91 11.15 10.84 9.80
11.21 10.81 10.59 9.55
9.87 9.28 9.74 8.61
1.92 1.88 2.05 1.58
1.88 2.01 1.85 1.67
1.31 1.79 2.02 1.63
8.56 6.82 5.42 4.10
7.29 6.21 5.03 3.67
6.50 5.67 4.46 3.41
6.10 5.15 4.21 3.27
−0.15
−0.37 −0.10
9.84 7.70 5.95 4.22
7.46 5.98 4.81 3.28
6.01 4.88 3.80 2.64
5.34 4.21 3.19 2.37
CCE mean group 0.20 −0.22 −0.22 −0.01
0.06 0.11 −0.04
0.22 0.02
Power (×100)
23.10 34.00 43.00 55.60
19.50 23.00 30.40 51.30
26.40 36.90 49.90 65.00
26.10 36.00 54.70 69.90
40.80 50.00 63.40 78.10
41.40 54.20 71.60 84.00
48.60 63.10 79.70 90.30
50.80 68.00 88.00 94.30
21.10 27.50 40.10 55.80
29.50 36.20 53.90 72.30
39.60 46.60 69.50 81.10
37.50 43.80 62.40 80.80
46.10 54.40 75.30 90.00
57.70 70.50 87.90 96.30
71.10 83.60 96.30 99.00
20.20 27.50 34.20 55.50
34.20 30.30 30.70 34.40
23.10 31.50 44.50 60.60
31.90 41.70 59.60 76.90
40.90 51.10 72.00 83.60
39.00 47.70 64.60 81.90
48.00 58.60 77.90 91.20
59.60 72.10 89.40 96.60
70.90 84.80 96.60 99.40
35.30 36.00 40.00 40.90
41.00 44.60 43.70 49.20
55.90 55.60 56.70 64.10
49.40 51.00 60.60 79.00
53.20 58.30 71.20 86.30
63.20 70.20 81.20 93.10
73.20 80.30 90.30 96.50
7.50 6.60 4.90 5.30
8.60 8.36 9.04 8.42
7.85 7.96 8.63 7.76
8.29 8.51 9.16 8.25
7.45 7.81 8.92 7.80
14.08 13.67 12.70 11.73
12.24 11.96 11.42 10.32
11.78 11.85 11.28 10.04
10.40 10.12 10.47 9.00
1.28 1.12 0.80 0.78
1.29 0.94 1.18 0.70
1.09 1.10 0.91 0.77
0.61 0.97 1.11 0.75
8.19 6.33 5.02 3.68
7.11 5.77 4.61 3.18
6.24 5.23 4.10 2.98
5.87 4.81 3.78 2.87
0.31
0.14
−0.30
−0.20 −0.21 −0.05
−0.20
0.03 0.07 −0.06
−0.35 −0.05
8.94 6.86 5.31 3.93
7.34 5.76 4.69 3.15
6.00 4.97 3.79 2.69
5.43 4.28 3.25 2.42
0.22
−0.03
0.22 0.01
Power (×100)
20.40 27.00 33.30 53.80
20.30 28.90 40.80 56.60
25.00 33.80 51.60 69.70
24.90 32.50 55.70 75.00
36.40 45.40 60.90 79.10
41.00 51.50 70.60 87.20
48.30 60.80 83.20 94.00
52.20 68.90 91.10 98.00
25.20 27.80 36.50 53.60
25.70 33.60 45.00 57.00
33.20 41.90 55.80 70.20
41.60 50.50 68.90 80.40
43.50 51.40 68.70 81.00
48.30 58.90 77.50 88.20
61.70 71.40 87.80 94.90
70.30 82.50 95.30 97.90
7.50 6.20 6.20 5.10
8.30 6.70 5.50 5.10
12.30 10.60 12.70 20.80
14.10 14.70 22.20 33.00
15.00 18.40 29.30 45.80
18.80 23.70 36.70 58.20
26.40 32.20 43.10 56.80
27.30 38.80 50.90 61.50
43.10 54.20 72.40 82.50
43.80 55.10 69.10 81.90
49.40 60.60 78.50 89.80
62.60 74.10 89.10 95.20
71.00 84.10 95.70 98.30
38.90 40.60 40.50 43.50
54.40 51.10 52.60 57.50
42.10 44.10 53.40 71.70
48.20 52.70 66.60 80.00
57.00 65.10 75.00 88.50
67.70 74.70 85.10 93.60
7.20 6.70 5.80 5.10
8.00 7.30 5.20 5.60
13.70 11.60 15.40 25.90
14.20 15.10 25.70 35.50
15.00 19.40 28.60 45.10
19.50 23.00 36.10 57.90
35.20 47.30 61.30 74.00
ML-SAR, S mis-specifieda
CCE mean group 20 30 50 100
100
SHAC pooled, D mis-specifiedb
ML-SAR 20 30 50 100
50
SHAC pooled
SHAC mean group, D mis-specifiedb 20 30 50 100
30
Robust pooled
SHAC mean group 20 30 50 100
20
Size (×100)
Robust mean group 23.90 32.10 39.70 53.40
100
CCE pooled
0.20 −0.13 0.14 −0.01
Size (×100) 20 30 50 100
50
ML-SAR, S mis-specifieda
2.12 1.89 1.70 1.64
20 30 50 100
RMSE (×100)
30
Pooled
ML-SAR 20 30 50 100
20
29.70 24.70 28.20 28.90
33.10 32.30 35.40 34.40
CCE pooled
7.10 4.80 6.70 4.80
7.80 4.60 5.70 5.50
7.80 5.30 6.70 6.40
Notes: see notes to Table A1. a This estimator is computed under the incorrect assumption that cross-section units are neighbours if their Euclidean distance is less than or equal to 2 while in the true S matrix cross-section units are neighbours if their Euclidean distance is equal to 1. b This estimator is computed under the incorrect assumption that cross-section units are ordered on a line, rather than on a rectangular grid.
=
T − T − N − N −
1 N 2T 2
+
+
≤K
1
t =1 s=1 h=1 ℓ=1
2 2 rih,t riℓ,s r.h,t r.ℓ,s E εht εℓs
T − T − N N − −
N 2 T 2 t =1 s=1 h=1 ℓ=1,ℓ̸=h 1
denoting
rih,t riℓ,s r.ℓ,t r.h,s E (εht εhs εℓs εℓt )
T − T − N − N − rih,t riℓ,s r.h,t r.ℓ,s
1
1 N2
+O
1 NT
,
N
∗ r.ℓ
≤
T − T −−− rih,t rih,s r.ℓ,t r.ℓ,s
≤
≤
1
N
N N − T − T − ∗− rih,t rih,s r.ℓ,s r .ℓ
N 2 T 2 ℓ=1 K
h=1 t =1 s=1
N N − T − ∗− r rih,s r.ℓ,s .ℓ
N 2 T 2 ℓ=1 K
1 N2
, while,
= max1≤t ≤T r.ℓ,t = O(1), the second term satisfies
N 2 T 2 h=1 ℓ=1 t =1 s=1
N 2 T 2 t =1 s=1 h=1 ℓ=1
+ rih,t rih,s r.ℓ,t r.ℓ,s + rih,t riℓ,s r.ℓ,t r.h,s =O
rih,t rih,s r.ℓ,t r.ℓ,s E (εht εhs εℓs εℓt )
T − T − N N − −
N 2 T 2 t =1 s=1 h=1 ℓ=1,ℓ̸=h 1
since the first and the third terms of the above are O
h=1 s=1
N N − T − ∗− r rih,s
N 2 T 2 ℓ=1
.ℓ
h=1 s=1
N K − ∗ r = O 1 ≤ 2 .ℓ N T ℓ=1
NT
200
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
∑N ∗ = O(N −1 ). It follows that T −1 e′ e¯ = Op 1 + since N −2 ℓ=1 r.ℓ i. N √1
Op
.
NT
The above results can be used to prove further results that are helpful in deriving the asymptotic distribution of CCE estimators. Rewrite Eqs. (1) and (19) more compactly as
yit xit
zit =
= B′i dt + C′i ft + ξit ,
(A.7)
where Bi = αi
Di =
Ci = γi
A i Di ,
0 Ik
1
βi
,
0i D i ,
eit + β′i vit vit
ξit =
1
= Op
T Fξ
N
′¯
for i = 1, 2, . . . , N , s = 0, 1, 2, . . . ,
.
= Op
T V′i. ξ¯ T e′i. ξ¯ T X′i. ξ¯ T
(A.8)
√
NT
D′ ξ¯
,
1
1
= Op
NT
1
= Op
√
+ Op
N
1
√
NT
+ Op
N
1
+ Op
N
= Op
T
= Op
1
≤
,
√
NT
(A.9)
, (A.10)
.
√
NT
T
= Op
¯ i. X′i. MX T
¯ i. X′i. Me T
1
N
+ Op
=
X′i. Mg Xi.
=
X′i. Mg ei.
T
T
1
√
NT
,
1
N
1
+ Op
N
+ Op
1
√
NT 1
√
NT
,
,
1 λ1 (εε ) H′ H ≤ K1 H′ H
NT
NT
1
K1
NT
N √ 1 − 1 N βˆ MG − β = √ υi + √ hNT ,
N i =1
T
N −1 ′ 1 − −1 ′ hNT = √ T Xi. MD Xi. Xi. MD ei. NT i=1 N −
1
T −
= √
t =1
1
K1
NT
T −
W′.t W.t λ1 Rt R′t
t =1 N −1 1 − −1 ′ T Xi. MD Xi. , N i =1
N i =1
(A.15)
T
N 1 − 1 = √ υi + Op √ , N i =1
(A.14)
T
which proves the theorem.
Proof of Theorem 2. Consider (7), and let N 1 − ′ qNT = √ Xi. MD ei. NT i=1
1
= √
N −
NT i=1 1
= √
T −
NT t =1
where
1
W′.t Rt R′t W.t
− ′
Proof of Theorem 1. Consider (6), and rewrite it as
= √
T −
N √ 1 − 1 N βˆ MG − β = √ υi + √ hNT
(A.13)
where Mg = IT − G(G G) G . Note that (A.12)–(A.14) are identical to relations (40), (43) and (44) in Pesaran (2006), and will be used to derive the asymptotic distribution of CCE Pooled and CCE Mean Group estimators. In what follows we sketch the proofs of Theorems 1, 2, 5 and 6. ′
H′ εε H
which, under Assumption 5(a), tends to a non-singular matrix with finite elements. Hence, we have (A.12)
+ Op
1 NT 1
≤ K1 K2
(A.11)
+ Op
≤
,
Under Assumption 9, the above results in turn yield:
¯ X′i. MF
Var (hNT ) =
=
1
that are absolute summable, namely s=0 |γi (s)| < K for all i. It follows that εε has bounded row and column norms and the variance of hNT satisfies5
∑∞
,
1
aij ai,j+|s| ,
j =0
From Lemma A.2 it follows that (see also Lemmas 2 and 3 in Pesaran (2006)) ′ ξ¯ ξ¯
∞ −
γi (s) = γi (−s) =
with W′.t = (w1t , w2t , . . . , wNt ). Under Assumption 1, the NT × 1 vector ε is a zero mean covariance stationary process, with covariance matrix E εε′ = εε . Since the elements of H are uniformly bounded, using standard results on double array central limit theorem for stationary processes (see, for example, Chung, 2001, Chapter 7), it follows that hNT is asymptotically normally distributed if its variance exists and is positive semi-definite. Note that εε is made of T 2 blocks of N × N diagonal matrices with elements
N T 1 −− W′i. ei. = √ wit eit NT i=1 NT i=1 t =1
1 W′.t e.t = √ H′ ε, NT t =1 NT
1 X′i. ei. = √
N − T −
NT i=1 t =1
xit eit
1 X′.t Rt εt = √ P′ ε, NT
where X.t = ( x1t , x2t , . . . , xNt )′ , and P′ = ( X′.1 R1 , X′.2 R2 , . . . , X′.T RT ). Following similar lines of reasoning as for (A.15), the variance of qNT satisfies
5 We make use of the following result. Let A be an n × n symmetric matrix, and B be an n × m matrix. Then B′ B λ1 (A) − B′ AB is a positive semi-definite matrix (see Bernstein, 2005, pp. 264 and 271).
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
Var (qNT ) =
1
P′ εε P
NT
= K1 ≤ K1 K2
= K1 K2
≤ K1
′
PP
X′ X.t
N −
1 hNT = √ NT i=1
NT
T 1 − ′ X.t Rt R′t X.t NT t =1 T 1−
.t
T t =1
N βˆ P − β =
N
−1
N −
(A.16)
×
=
N
−1
N −
Var (hNT ) =
1
E H′ εε H ≤ K1 K2
NT
= K1 K2
T −1 X′i. MD Xi.
N 1 − −1 ′ 1 T Xi. MD Xi. υi + √ qNT √ N i=1 T
T 1 − ′ 1 wit eit = √ W.t e.t = √ H′ ε, NT i=1 t =1 NT t =1 NT
N 1 −
= √
i=1
N i =1
N 1 − −1 ′ ×√ T Xi. MD Xi. υi + OP N i=1
1
√
T
N 1 −
N i=1
6vi ,
1
υi + √ hNT + Op
N
,
1 −
= √
N i =1
υi + Op
1
N
Proof of Theorem 6. Consider
√ N βˆ CCEP − β =
= √
N i=1
υi +
N 1 −
N 1 −
N i =1
−1
Ψˆ iT
+√
N i =1
√
¯ NX′i. MF
−1
Ψˆ iT
¯ i. X′i. Me
T
T
γi
,
N
√
Ψˆ iT
¯ NX′i. MF
¯ i. −1 T −1 X′i. MX
= Op
1
+ Op
√
N
N − ¯ i. 1 X′i. MX i=1
1
√
T
= Op
,
as N → ∞.
(A.17)
√
N i =1
Ψˆ iT
¯ i X′i. Me T
N 1 −
= √
V′i. Mg Vi.
1
V′i. Mg ei.
T
1 + Op √ N T 1 1 + Op √ + Op √ , 1
√
= √ hNT T
−1
T
N i=1
+ Op
where
−1
T
(A.18)
−1
T
T
N
1
N − ¯ 1 X′i. MF γi √
+ Op
√
N
i=1
N
1
√
T
T
.
(A.19)
T
N
i =1
N − 1 V′i. Mg ei. 1 1 + Op √ + Op √ . √
N
T
T
N
i=1
Further, given (A.13)–(A.14) we have −1
,
N − ¯ i 1 X′i. Me √
=
N 1 −
T
Further,
γi
N
T
i=1
√
Using (A.12)–(A.14) we have
ˆ iT = where, by Assumption 10(b), Ψ exists for all i. First note that, using (A.12), and since, by Assumption 7, factor loadings are bounded, it follows that (see also Pesaran, 2006, p. 983) −1
T
N ¯ Xi. υi + Fγi + ei. − 1 X′i. M . × √
N −
N
i =1
−1
−1
N − ¯ i. 1 X′i. MX i=1
1
√
1
√
+ Op
N
+ Op
√
1
√
which proves the theorem.
N βˆ CCEMG − β
T
Proof of Theorem 5. Consider
N 1 −
N 1 − ′ E Wi. Wi. NT i=1
√ N βˆ CCEMG − β
−1
which, under Assumption 10(a) tends to a finite, positive definite matrix. Therefore, we have
T −1 X′i. MD Xi.
which proves the theorem.
−1
V′i. Mg . wNt ), and wit is the tth column of W′i. = T −1 V′i. Mg Vi. Using similar lines of reasoning as in the proof of Theorem 1, hNT has zero mean and its variance satisfies
−1
i=1
T
N 1 − ′ Xi. MD Xi. , NT i=1
V′i. Mg ei.
where H′ = (W′.1 R1 , W′.2 R2 , . . . , W′.T RT ), W′.t = (w1t , w2t , . . . ,
N
which tends, under Assumption 5(b), to a finite, positive definite matrix. It follows that
√
V′i. Mg Vi.
N − T −
1
= √
201
−1
N
T
Let N N T 1 − ′ 1 −− qNT = √ Vi. ei. = √ vit eit NT i=1 NT i=1 t =1
1
= √
T −
NT t =1
1 V′.t Rt εt = √ P′ ε, NT
where Vi. = Mg Vi. , V.t = ( v1t , v2t , . . . , vNt )′ , P′ = ( V′.1 R1 , ′ ′ V.2 R2 , . . . , V.T RT ). Following similar lines of reasoning developed
202
M.H. Pesaran, E. Tosetti / Journal of Econometrics 161 (2011) 182–202
in the proofs to Theorem 2, qNT has mean zero and its variance satisfies Var (qNT ) =
1
NT
1
E P′ εε P ≤
= K1 K2
N 1 −
N i =1
NT
K1 K2
N −
′ i. Vi. E V
i=1
6 vi ,
which, by Assumption 10(a), tends to a finite, positive definite matrix. It follows that
√ N βˆ CCEP − β −1 N N − 1 X′i. Mg Xi. 1 − X′i. Mg Xi. υ i = √ i=1
+
N
=
N i=1
N − 1 X′i. Mg Xi. i=1
T
N
− 1 X′ Mg Xi. i. N
+ Op
1 N
1
1
√
T
−1
+ Op
N
+ Op
1
√
T
1 − X′i. Mg Xi. υi T N i=1 N
√
T
√
T
√ qNT + Op
T
N
i=1
−1
1
√
which proves the theorem.
T
.
(A.20)
References Andrews, D., 2005. Cross section regression with common shocks. Econometrica 73, 1551–1585. Anselin, L., 1988. Spatial Econometrics: Methods and Models. Kluwer Academic Publishers, Dordrecht, The Netherlands. Bai, J., 2009. Panel data models with interactive fixed effects. Econometrica 77, 1229–1279. Bakirov, N.K., Székely, G.J., 2006. Student’s t-test for Gaussian scale mixtures. Journal of Mathematical Sciences 139, 6497–6505. Baltagi, B.H., Egger, P., Pfaffermayr, M., 2009. A generalized spatial panel data model with random effects. Center for Policy Research. Paper 53. Baltagi, B., Song, S., Koh, W., 2003. Testing panel data regression models with spatial error correlation. Journal of Econometrics 117, 123–150. Bernstein, D.S., 2005. Matrix Mathematics: Theory, Facts, and Formulas with Application to Linear Systems Theory. Princeton University Press. Bester, C.A., Conley, T.G., Hansen, C.B., 2009. Inference with dependent data using cluster covariance estimators. Mimeo. University of Chicago. Chudik, A., Pesaran, M.H., Tosetti, E., 2010. Weak and strong cross section dependence and estimation of large panels. The Econometrics Journal (forthcoming). Chung, K.L., 2001. A Course in Probability Theory. Academic Press. Coakley, J., Fuertes, A.M., Smith, R., 2002. A principal components approach to crosssection dependence in panels. Birkbeck College Discussion Paper 01/2002. Coakley, J., Fuertes, A.M., Smith, R., 2006. Unobserved heterogeneity in panel time series. Computational Statistics and Data Analysis 50, 2361–2380. Conley, T.G., 1999. GMM estimation with cross sectional dependence. Journal of Econometrics 92, 1–45. Conley, T.G., Topa, G., 2002. Socio-economic distance and spatial patterns in unemployment. Journal of Applied Econometrics 17, 303–327. Cooper, R., Haltiwanger, J., 1996. Evidence on macroeconomic complementarities. The Review of Economics and Statistics 78, 78–93. Cowan, R., Cowan, W., Swann, G.M.P., 2004. Waves in consumption with interdependence among consumers. Canadian Journal of Economics 37, 149–177. Driscoll, J.C., Kraay, A.C., 1998. Consistent covariance matrix estimation with
spatially dependent panel data. The Review of Economics and Statistics 80, 549–560. Egger, P., Pfaffermayr, M., Winner, H., 2005. An unbalanced spatial panel data approach to US state tax competition. Economics Letters 88, 329–335. Fingleton, B., 2008. A generalized method of moments estimator for a spatial panel model with an endogenous spatial lag and spatial moving average errors. Spatial Economic Analysis 3, 27–44. Fingleton, B., Le Gallo, J., 2008. Estimating spatial models with endogenous variables, a spatial lag and spatially dependent disturbances: finite sample properties. Papers in Regional Science 87, 319–339. Ibragimov, R., Müller, U.K., 2010. t-statistic based correlation and heterogeneity robust inference. Journal of Business and Economic Statistics 28, 453–468. Kapetanios, G., Pesaran, M.H., 2007. Alternative approaches to estimation and inference in large multifactor panels: small sample results with an application to modelling of asset returns. In: Phillips, G., Tzavalis, E. (Eds.), The Refinement of Econometric Estimation and Test Procedures: Finite Sample and Asymptotic Analysis. Cambridge University Press, Cambridge. Kapetanios, G., Pesaran, M.H., Yagamata, T., 2011. Panels with nonstationary multifactor error structures. Journal of Econometrics 160, 326–348. Kapoor, M., Kelejian, H.H., Prucha, I.R., 2007. Panel data models with spatially correlated error components. Journal of Econometrics 140, 97–130. Kelejian, H.H., Prucha, I.R., 1999. A generalized moments estimator for the autoregressive parameter in a spatial model. International Economic Review 40, 509–533. Kelejian, H.H., Prucha, I.R., 2007. HAC estimation in a spatial framework. Journal of Econometrics 140, 131–154. Kelejian, H.H., Prucha, I.R., 2009. Specification and estimation of spatial autoregressive models with autoregressive and heteroskedastic disturbances. Journal of Econometrics 157, 53–67. Lee, L.F., 2004. Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica 72, 1899–1925. Lee, L.F., 2007. GMM and 2SLS estimation of mixed regressive, spatial autoregressive models. Journal of Econometrics 137, 489–514. Lee, L.F., Yu, J., 2010. Estimation of spatial autoregressive panel data models with fixed effects. Journal of Econometrics 154, 165–185. Mardia, K.V., Marshall, R.J., 1984. Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika 71, 135–146. Newey, W.K., West, K.D., 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–708. Onatski, A., 2009. Asymptotics of the principal components estimator of large factor models with weak factors. Mimeo. Columbia University. Paul, D., 2007. Asymptotics of the leading sample eigenvalues for a spiked covariance model. Statistica Sinica 17, 1617–1642. Pesaran, M.H., 2006. Estimation and inference in large heterogenous panels with multifactor error structure. Econometrica 74, 967–1012. Pesaran, M.H., Schuermann, T., Weiner, S., 2004. Modelling regional interdependencies using a global error-correcting macroeconometric model. Journal of Business and Economic Statistics 22, 129–162. Pesaran, M.H., Smith, R.P., 1995. Estimating long-run relationships from dynamic heterogeneous panels. Journal of Econometrics 68, 79–113. Pesaran, M.H., Tosetti, E., 2009. Large panels with common factors and spatial correlation. CESifo Working Paper Series No. 2103. Phillips, P.C.B., Sul, D., 2003. Dynamic panel estimation and homogeneity testing under cross section dependence. The Econometrics Journal 6, 217–259. Pinkse, J., Slade, M., Brett, C., 2002. Spatial price competition: a semiparametric approach. Econometrica 70, 1111–1153. Robertson, D., Symons, J., 2000. Factor residuals in SUR regressions: estimating panels allowing for cross sectional correlation. Mimeo. University of Cambridge. Robertson, D., Symons, J., 2007. Maximum likelihood factor analysis with rankdeficient sample covariance matrices. Journal of Multivariate Analysis 98, 813–828. Smith, J., McAleer, M., 1994. Newey-West covariance matrix estimates for models with generated regressors. Applied Economics 26, 635–640. Ullah, A., 2004. Finite Sample Econometrics. Oxford University Press, Oxford. Whittle, P., 1954. On stationary processes on the plane. Biometrika 41, 434–449. Yu, J., de Jong, R., Lee, L.F., 2007. Quasi-maximum likelihood estimators for spatial dynamic panel data with fixed effects when both n and T are large: a nonstationary case. Mimeo. Ohio State University. Yu, J., de Jong, R., Lee, L.F., 2008. Quasi-maximum likelihood estimators for spatial dynamic panel data with fixed effects when both n and T are large. Journal of Econometrics 146, 118–137. Yu, J., Lee, L.F., 2010. Estimation of unit root spatial dynamic panel data models. Econometric Theory 26, 1332–1362.
Journal of Econometrics 161 (2011) 203–207
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Extending the regression-discontinuity approach to multiple assignment variables John P. Papay a,∗ , John B. Willett a , Richard J. Murnane a,b a
Harvard Graduate School of Education, Appian Way, Cambridge, MA 02138, United States
b
National Bureau of Economic Research, United States
article
info
Article history: Received 13 July 2009 Received in revised form 28 August 2010 Accepted 13 December 2010 Available online 29 December 2010 JEL classification: C1 C14 C21 I21
abstract The recent scholarly attention to the regression-discontinuity design has focused exclusively on the application of a single assignment variable. In many settings, however, exogenously imposed cutoffs on several assignment variables define a set of different treatments. In this paper, we show how to generalize the standard regression-discontinuity approach to include multiple assignment variables simultaneously. We demonstrate that fitting this general, flexible regression-discontinuity model enables us to estimate several treatment effects of interest. © 2011 Elsevier B.V. All rights reserved.
Keywords: Regression discontinuity Semi-parametric estimation Treatment effects
1. Extending the regression-discontinuity approach to multiple assignment variables Introduced in the early 1960s, the regression-discontinuity design (RDD) has enjoyed a resurgence in popularity during the past decade and is recognized widely as one of the most robust approaches for making causal inferences from natural experiments. In a standard RDD, participants are assigned to the treatment or control group according to an exogenously imposed cutoff on a single predictor, called a ‘‘forcing’’, ‘‘running’’, or ‘‘assignment’’ variable. Researchers can then draw causal inferences by comparing fitted outcomes for individuals on the margin, those who are ‘‘local to’’ the cutoff determined by the assignment rule. As described by Cook (2008) and van der Klaauw (2008), researchers have applied this approach to estimate causal effects in a variety of disciplines. There is a large – and growing – literature on the RDD, including discussions of how to specify the underlying statistical models, under what conditions causal inferences can be drawn, how to handle situations in which participants are
∗
Corresponding author. Tel.: +1 617 894 3674; fax: +1 617 496 0183. E-mail address:
[email protected] (J.P. Papay).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.008
assigned without perfect compliance (i.e., a ‘‘fuzzy’’ discontinuity), and approaches for assessing the sensitivity of results to alternative model specifications. These topics are discussed in detail in a set of papers published in a 2008 special issue of this journal. Almost all the previous work has focused on the application of a single forcing variable that defines an individual’s assignment to treatment. In this paper, we generalize the standard RDD to include multiple forcing variables, modeling simultaneously discontinuities that arise when multiple criteria determine placement into several different treatment conditions. Such situations arise in many settings. We argue that researchers can better understand the complex relationships at play by incorporating the multiple discontinuities simultaneously into a single multi-dimensional regression-discontinuity model. 2. Examples of regression discontinuity with multiple assignment variables There are many situations in which values on multiple forcing variables assign participants to a set of different treatment conditions. For example, such practices occur regularly in public education because students often take tests with clear cutoffs in several different subject areas. Many school districts reward teachers for improving student test scores in both mathematics
204
J.P. Papay et al. / Journal of Econometrics 161 (2011) 203–207
and English Language Arts (ELA), so teachers who are judged to be effective in one subject earn a bonus while those who raise test scores in two subjects earn a greater bonus. In other districts, students must pass externally defined standards on tests in several subjects to avoid summer school, to be promoted in grade, or to graduate from high school; failing any of these tests defines different treatments that students must undergo (e.g., summer school in mathematics or summer school in ELA). In public finance, state and local funding formulas can also have multiple criteria that determine the type or level of funding. For example, Leuven et al. (2007) examine the effects of a policy in the Netherlands in which schools receive extra personnel funding for having at least 70% minority students and extra computer funds for having at least 70% of students from any single minority group. Eligibility for different types of insurance or entitlement program eligibility may also be driven by several criteria, such as family size or family income. Examples of multiple assignment variables are also common in politics. Pettersson-Lidbom (2008) estimates the effect of party control on fiscal policies and economic outcomes by examining localities in which one party received just over 50% of the vote. We could extend this analysis easily to investigate the effects of party control in more than one branch of government. Modeling discontinuities in both executive and legislative elections would enable researchers to draw causal conclusions about the effect of having the same or different parties in the two branches. In the two-party system present in the United States, there would be four different treatment conditions: Democratic control of the both branches, Republican control of both branches, and two cases with each party controlling one branch. Thus, multiple assignment variables that assign individuals to a range of different treatment conditions abound in public policy settings. To date, however, analysts have reduced these questions to analyses of single discontinuities by ignoring important differences among treatment conditions1 or by assuming homogeneous treatment effects across levels of the other forcing variable (see, for example, analyses of high school exit examination policies by Martorell (2004), Reardon et al. (2010), Ou (2010) and Papay et al. (2010). It is important to differentiate the cases described above from the case in which multiple variables assign individuals to a single treatment. For example, some school districts assign students to a particular homogeneous summer school program if they fail to achieve benchmark scores on either the end-of-school-year mathematics or ELA test (see Matsudaira, 2008; Jacob and Lefgren, 2004). Thus, performance on two forcing variables – the two tests – determines assignment to one treatment-control contrast. The method described in this paper does not pertain to such situations. However, it would pertain if the content of a mandatory summer school program (i.e., the treatment) depended on which of the two tests a child failed. 3. Incorporating exogenous discontinuities on multiple forcing variables In a basic regression-discontinuity set-up with standard notation, we observe a binary treatment indicator (Wi ) and a forcing variable (Xi ) with a related cutoff (c) that describes two treatment conditions (Wi = 0 and Wi = 1) as follows2 : Wi = 1{Xi ≥ c }.
(1)
1 For example, by assuming that funding for computers is interchangeable with funding for personnel. 2 Throughout this paper, we focus on sharp regression discontinuities.
We observe individual outcomes (Yi ) and seek to estimate the conditional mean of Y at the cut score for individuals in the treatment (Wi = 1) and control (Wi = 0) groups. In other words, our primary parameters of interest are
µl (c ) = lim E [Yi |Xi = x] and µr (c ) = lim E [Yi |Xi = x]. x→c +
x→c −
(2)
The causal effect of treatment (τ ), for students at the cutoff, is the difference in these means:
τ = µr (c ) − µl (c ).
(3)
The same logic applies when exogenously assigned cutoffs exist on multiple forcing variables. With J forcing variables (such that Wji = 1{Xji ≥ cj } ∀ j = 1, . . . , J), we define 2J treatment conditions. For any contrast between these treatment conditions, our primary parameters of interest are again the righthand and left-hand limits on either side of the relevant cutoff. As the dimensionality increases, the challenges of interpretation and implementation become more complicated. As a result, we focus on the case with two forcing variables. Having two forcing variables defines four different ‘‘treatment’’ conditions. Here, let X1 and X2 define the two assignment variables and c1 and c2 define the respective cutoffs. For each individual, we define W1i and W2i as follows: W1i = 1{X1i ≥ c1 }
and W2i = 1{X2i ≥ c2 }.
(4)
Thus, individuals can fall into one of four possible conditions: Condition A:W1i = 0 and W2i = 0; Condition B:W1i = 1 and W2i = 0; Condition C:W1i = 0 and W2i = 1;
and
Condition D:W1i = 1 and W2i = 1. These conditions define four separate regions in the space spanned by the forcing variables, (X1 , X2 ). Again, our parameters of interest are the conditional mean outcomes at the cutoff for individuals in each treatment condition. For example, the causal effect of W1i = 0 instead of W1i = 1 for individuals with X2i = c2 would be the difference between:
µl (c ) = lim E [Yi |X1i = x1 , X2i = c2 ] and − x 1 → c1
µr (c ) = lim E [Yi |X1i = x1 , X2i = c2 ].
(5)
+
x 1 → c1
4. Estimating a regression-discontinuity model with two forcing variables In any regression-discontinuity design, estimating these conditional means relies on assumptions about the relationship between Y and X near c. In the single-variable case, researchers often specify regression models with a variety of functional forms in a certain ‘‘window’’ around the cutoff. Recently, researchers have begun to relax these functional form assumptions by using ‘‘nonparametric’’ or ‘‘semi-parametric’’ approaches (e.g. Ludwig and Miller, 2007; Lee and Lemieux, 2010). As our parameters of interest are boundary objects and standard nonparametric smoothing strategies have poor boundary properties, we can estimate these limits with local linear regression (Fan, 1992; Hahn et al., 2001; Porter, 2003). Imbens and Lemieux (2008) formalize this nonparametric approach and delineate its implementation. They describe a two-step process in which analysts: (1) choose an ‘‘optimal’’ bandwidth around the cutoff (labeled h∗ ) that minimizes a clearly
J.P. Papay et al. / Journal of Econometrics 161 (2011) 203–207
defined cross-validation criterion3 by conducting a series of local linear regression analyses and (2) estimate the causal effect by conducting a local linear regression analysis using this optimal bandwidth. Given that causal inferences about the treatment effect focus on the cut score, this approach is identical in practice to fitting a single OLS model centered at the cut score, using observations within ±h∗ of the cut score. With multiple forcing variables, we need to model the outcome as a function of both forcing variables at the boundaries of these regions in order to estimate the effect of different treatment conditions at the cut scores. We can again estimate the relevant conditional expectations (such as those from (5)) using any standard technique. For example, we could take a parametric approach and use multiple regression analysis with higher-order polynomials to estimate the relationships between X1 , X2 , and Y near the joint cutoff. Given the advantages of the nonparametric methods, however, we generalize Imbens and Lemieux’s (2008) approach. As in the single-variable case, we can fit the requisite regression models in each region simultaneously, by specifying a single statistical model with 16 parameters—an intercept and slope parameters to accompany all 15 possible interactions among W1 , W2 , X1 , and X2 . We write the model with the two forcing variables (X1c and X2c ) centered on their respective cut-points: E [Yi ] = β0 + β1 W1i + β2 W2i + β3 (W1i × W2i ) + β4 X1ic + β5 X2ic
+β +β +β +β
( × ) + β ( × W1i ) + β ( × W2i ) ( × W2i ) + β10 (X2ic × W1i ) + β11 (X1ic × X2ic × W1i ) ( × X2ic × W2i ) + β13 (X1ic × W1i × W2i ) ( × W1i × W2i )
c 6 X1i c 9 X1i c 12 X1i c 14 X2i
X2ic
c 7 X1i
c 8 X2i
+ β15 (X1ic × X2ic × W1i × W2i ).
(6)
This model defines four surfaces in three dimensions, with intercepts at (c1 , c2 ). Each surface lies in one of the four regions defined by the intersection of the values of W1 and W2 , as follows: (a) E [Yi |W1 = 0; W2 = 0] = β0 + β4 X1ic + β5 X2ic + β6 (X1ic × X2ic ) (b) E [Yi |W1 = 1; W2 = 0]
= (β0 + β1 ) + (β4 + β7 )X1ic + (β5 + β10 )X2ic + (β6 + β11 )(X1ic × X2ic ) (c) E [Yi |W1 = 0; W2 = 1] = (β0 + β2 ) + (β4 + β9 )X1i + (β5 + β8 )X2ic + (β6 + β12 )(X1ic × X2ic ) (d) E [Yi |W1 = 1; W2 = 1] = (β0 + β1 + β2 + β3 ) + (β4 + β7 + β9 + β13 )X1ic + (β5 + β8 + β10 + β14 )X2ic + (β6 + β11 + β12 + β15 )(X1ic × X2ic )
(7)
In Fig. 1, we present a graphical representation of the four hypothetical surfaces. Note that although we depict these surfaces for many values of X1 and X2 , our parameters of interest are only defined at the cut scores. In Fig. 2 (top panel), we show the relevant edges of these hypothetical surfaces in two dimensions by taking cross-sections of Fig. 1 when X1 = c1 . Examining these surfaces at the cut scores, we can describe several treatment effects of interest for individuals who fall near
3 Imbens and Kalyanaraman (2009) have proposed a plug-in estimator for the optimal bandwidth. The properties of this estimator have not been proven for forcing variables with discrete support, such as test scores often used in regressiondiscontinuity designs. As a result, we choose to generalize the cross-validation criterion recommended by Imbens and Lemieux (2008).
205
Fig. 1. Hypothetical population representation of the four surfaces (A, B, C, and D) from Eq. (1) defined by the two cut scores, X1 = 0 and X2 = 0.
the joint cut scores. We illustrate these effects for individuals with X1 = c1 (effects (a) and (b)) in the bottom panel of Fig. 2. Here, the vertical distances between the edges represent the estimated causal effects identified by this approach, as follows: (a) Effect of W1 = 1|W2 = 0, X1 = c1 :
β1 + β10 X2ic (b) Effect of W1 = 1|W2 = 1, X1 = c1 :
(β1 + β3 ) + (β10 + β14 )X2ic (c) Effect of W2 = 1|W1 = 0, X2 = c2 : β2 + β9 X1ic (d) Effect of W2 = 1|W1 = 1, X2 = c2 : (β2 + β3 ) + (β9 + β13 )X1ic
(8)
Fig. 2 illustrates the flexibility of this approach in representing several treatment effects of interest simultaneously. Note that by setting both X1 = c1 and X2 = c2 , we can recover the effects of each of the treatment conditions for individuals at the joint cutoff. For example, in the bottom panel of Fig. 2, we can describe the effect of W1 = 1 for individuals at the cutoff on both X1 and X2 . Here, the effect for individuals at the cutoff with W2 = 0 is β1 , the vertical height of the line on the left, while the analogous effect for individuals with W2 = 1 is β1 +β3 , the vertical height of the line on the right. Our approach enables us not only to describe these effects at the joint cutoff, but also to examine explicitly how the causal impact of assignment by one forcing variable differs by levels of the other. The set of models in (8) describes the more nuanced relationships at different levels of X1 and X2 , at least local to the cut scores. 4.1. Implementation To fit our model in a data sample, we must first choose appropriate joint bandwidths (labeled h∗1 and h∗2 ) to govern our smoothing for X1 and X2 simultaneously. As we have two assignment variables, our bandwidth is actually a two-dimensional area in the (X1 , X2 ) plane bounded by (h∗1 , h∗2 ). To estimate h∗1 and h∗2 , we generalize the iterative cross-validation procedure described by Imbens and Lemieux (2008). For each observation, at each point on the (X1 , X2 ) grid, we use local linear regression analysis4 – within an arbitrary bandwidth (h1 , h2 ) – to estimate a
4 There is an interesting debate in the literature concerning how to weight different data points in the bandwidth in fitting these local linear regressions. Some analysts recommend simply using OLS regressions (i.e., applying a rectangular kernel) (e.g. Imbens and Lemieux, 2008), while others argue that triangular or other, more complicated, kernel weightings may lead to more appropriate estimation (e.g. Ludwig and Miller, 2007). For simplicity and interpretability, we recommend using OLS regressions. In practice, sensitivity of the results to kernel choice will be reflected in sensitivity to bandwidth choice because bandwidth selection is simply a more extreme version of a choice of weights (see Lee and Lemieux, 2010).
206
J.P. Papay et al. / Journal of Econometrics 161 (2011) 203–207 100
our attention on the single local linear regression analysis that uses only observations with forcing variable values within one optimal bandwidth on either side of the relevant cut scores. As in the singlevariable case, we can interpret this single model centered at (c1 , c2 ) parametrically for observations local to the cut score and use the standard errors to conduct appropriate statistical tests.5 One key advantage of the approach that we have laid out is that it provides an externally defined criterion for bandwidth choice, rather than leaving the investigator to make subjective decisions about which bandwidths provide the most credible findings. However, in all cases, we recommend assessing whether findings are robust to bandwidth choice by systematically refitting the primary model using a range of plausible bandwidths. In doing so, we recommend focusing on the magnitudes of parameter estimates in addition to the results of hypothesis tests. As bandwidths grow smaller, estimates will necessarily become less precise, but any convincing substantive story should remain the same.6
80
60
40
20
-4
-2
0
2
4
25
5. Discussion 20
15
10
5
-4
-2
0
2
4
Fig. 2. Cross-section of Fig. 1 at the cut scores, showing the predicted value of Y , when X1 = c1 , at different values of X2 (top panel) and hypothetical plot showing the causal effect of W1 = 1 at different levels of X2 . Note that the orientation is opposite to the orientation of Fig. 1 (negative values X2 are to the left here).
fitted value of the outcome at that point:
µ( ˆ X1i , X2i , h1 , h2 ) = γˆ0 + γˆ1 X1i + γˆ2 X2i + γˆ3 (X1i × X2i ).
(9)
In each case, we limit the observations used to estimate
µ( ˆ X1i , X2i , h1 , h2 ) to the region in which the grid-point falls and estimate µ( ˆ X1i , X2i , h1 , h2 ) as if it were a boundary point in order to mirror the regression-discontinuity approach that estimates limits defined at the boundary points of the region. Thus, the sample used to estimate µ( ˆ X1i , X2i , h1 , h2 ) differs in each of the four regions A, B, C, & D defined in Fig. 1. For instance, in region A (where X1i < c1 and X2i < c2 ), we estimate µ( ˆ X1i , X2i , h1 , h2 ) at grid-point (X1 , X2 ) using only observations in the area (X1i − h1 ≤ x1 < X1i ) ∩ (X2i − h2 ≤ x2 < X2i ), for every value of X1 and X2 in the region. By contrast, in region D (X1i ≥ c1 and X2i ≥ c2 ), we use observations in the area (X1i ≤ x1 ≤ X1i + h1 ) ∩ (X2i ≤ x2 ≤ X2i + h2 ). We then compare our fitted values to the observed values, across the entire sample, using the generalized Imbens & Lemieux cross-validation criterion: CVY (h1 , h2 ) =
N 1 −
N i =1
(Yi − µ( ˆ X1i , X2i , h1 , h2 ))2 .
Natural experiments in which units of analysis are assigned to several different treatment conditions based on values of multiple forcing variables are quite common. Typically, researchers have analyzed these cases by creating a single composite forcing variable or by focusing on one of the assignment variables and examining separately the effects for individuals on either side of the second cutoff. While sensible, these approaches have important limitations. The first method implicitly blurs distinctions between different treatments by evaluating them as a single condition. The approach described here permits the articulation of several different treatment conditions and the examination of the effects of different combinations of treatments. The second approach implicitly provides estimates of average effects rather than distinguishing among individuals at different levels of a second assignment variable. Such simplified analyses can be misleading. Our hypothesized model represents a more nuanced way to answer many different questions of interest. Rather than estimate the average effect of a single treatment, we can explore how this effect differs by levels of a second assignment variable. The approach we describe does have limitations. First, like any regression-discontinuity design, it has limited external validity: causal effects are only identified for observations in the immediate vicinity of the cut scores. Second, statistical power is a key issue. Estimating these effects precisely requires a substantial density of data points near the joint cut scores. Nonetheless, the approach that we describe in this paper has a number of advantages over more conventional methods for addressing causal questions in situations in which multiple forcing variables assign individuals exogenously to different treatments. In particular, with sufficient data, the method provides a more complete picture of the relationship between different combinations of treatments and the outcome of interest.
(10)
We vary the joint bandwidth dimensions (h1 and h2 ) systematically and, for each bandwidth pair, we obtain a value of this crossvalidation criterion. Our optimal joint bandwidth, h∗1 and h∗2 , is the pair of bandwidths that minimizes the CV criterion. This procedure uses the entire sample to develop optimal bandwidths for an analysis in which the objects of interest are the parameters at the joint cut scores. We then fit our full model from Eq. (6), applying our joint bandwidth. Given that our parameters of interest are defined at the cut scores, we can focus
5 It is worth noting that we can get ∑ E [Y |W1 =0,W2 =0] E [Y |W1 = 0, W2 = 0] = within the bandwidths, W1 =0,W2 =0 N00 where N00 is the total number of observations within the bandwidths for which W1 = 0 and W2 = 0. Analogous results exist for the other expressions in Eq. (7) in each of the relevant regions. By comparison, we can get E [Y |W1 = 0] and E [Y |W1 = 1] through a standard, univariate regression-discontinuity approach. 6 Furthermore, credible regression-discontinuity designs will demonstrate that key assumptions (including the exogeneity of the assignment cutoff) are satisfied. See the 2008 special issue of the Journal of Econometrics and Lee and Lemieux (2010) for recommended tests.
J.P. Papay et al. / Journal of Econometrics 161 (2011) 203–207
Acknowledgements The authors thank Carrie Conaway, the Director of Planning, Research, and Evaluation of the Massachusetts Department of Elementary and Secondary Education, for providing the data for the project on which this paper was based. We thank John Geweke and two anonymous referees for helpful comments. The research reported here was supported by the Institute of Education Sciences, US Department of Education, through Grant R305E100013 to Harvard University. The opinions expressed are those of the authors and do not represent views of the Institute or the US Department of Education. References Cook, T.D., 2008. ‘‘Waiting for life to arrive’’: a history of the regressiondiscontinuity design in psychology, statistics and economics. Journal of Econometrics 142 (2), 636–654. Fan, J., 1992. Design-adaptive nonparametric regression. Journal of the American Statistical Association 87 (420), 998–1004. Hahn, J., Todd, P., Van der Klaauw, W., 2001. Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica 69 (1), 201–209. Imbens, G., Kalyanaraman, K., 2009. Optimal bandwidth choice for the regression discontinuity estimator. Working Paper 14726. National Bureau of Economic Research. Imbens, G., Lemieux, T., 2008. Regression discontinuity designs: a guide to practice. Journal of Econometrics 142 (2), 615–635.
207
Jacob, B.A., Lefgren, L., 2004. Remedial education and student achievement: a regression-discontinuity analysis. Review of Economics and Statistics 86 (1), 226–244. Lee, D.S., Lemieux, T., 2010. Regression discontinuity designs in economics. Journal of Economic Literature 48 (2), 281–355. Leuven, E., Lindahl, M., Oosterbeek, H., Webbink, D., 2007. The effect of extra funding for disadvantaged pupils on achievement. Review of Economics and Statistics 89 (4), 721–736. Ludwig, J., Miller, D., 2007. Does head start improve children’s life chances? Evidence from a regression discontinuity design. Quarterly Journal of Economics 122 (1), 159–208. Martorell, F., 2004. Does failing a high school graduation exam matter? Unpublished Working Paper: Author. Matsudaira, J.D., 2008. Mandatory summer school and student achievement. Journal of Econometrics 142 (2), 829–850. Ou, D., 2010. To leave or not to leave? A regression discontinuity analysis of the impact of failing the high school exit exam. Economics of Education Review 29 (2), 171–186. Papay, J.P., Murnane, R.J., Willett, J.B., 2010. The consequences of high school exit examinations for low-performing urban students: evidence from Massachusetts. Educational Evaluation and Policy Analysis 32 (1), 5–23. Pettersson-Lidbom, P., 2008. Do parties matter for economic outcomes? A regression-discontinuity approach. Journal of the European Economic Association 6 (5), 1037–1056. Porter, J., 2003. Estimation in the regression discontinuity model. Unpublished Working Paper: Author. Reardon, S.F., Arshan, N., Atteberry, A., Kurlaender, M., 2010. High stakes, no effects: effects of failing the California high school exit exam. Educational Evaluation and Policy Analysis 32 (4), 498–520. van der Klaauw, W., 2008. Regression-discontinuity analysis: a survey of recent developments in economics. Labour 22 (2), 219–245.
Journal of Econometrics 161 (2011) 208–227
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Matching and semi-parametric IV estimation, a distance-based measure of migration, and the wages of young men✩ John C. Ham a,∗ , Xianghong Li b , Patricia B. Reagan c a
University of Maryland, IZA and IRP (UW-Madison), United States
b
York University, Canada
c
Ohio State University, United States
article
info
Article history: Received 31 July 2009 Received in revised form 25 October 2010 Accepted 13 December 2010 Available online 21 December 2010 JEL classification: J61 C14 C26 Keywords: US internal migration Propensity score matching LATE
abstract Our paper estimates the effect of US internal migration on wage growth for young men between their first and second job. Our analysis of migration extends previous research by: (i) exploiting the distance-based measures of migration in the National Longitudinal Surveys of Youth 1979 (NLSY79); (ii) allowing the effect of migration to differ by schooling level and (iii) using propensity score matching to estimate the average treatment effect on the treated (ATET) for movers and (iv) using local average treatment effect (LATE) estimators with covariates to estimate the average treatment effect (ATE) and ATET for compliers. We believe the Conditional Independence Assumption (CIA) is reasonable for our matching estimators since the NLSY79 provides a relatively rich array of variables on which to match. Our matching methods are based on local linear, local cubic, and local linear ridge regressions. Local linear and local ridge regression matching produce relatively similar point estimates and standard errors, while local cubic regression matching badly over-fits the data and provides very noisy estimates. We use the bootstrap to calculate standard errors. Since the validity of the bootstrap has not been investigated for the matching estimators we use, and has been shown to be invalid for nearest neighbor matching estimators, we conduct a Monte Carlo study on the appropriateness of using the bootstrap to calculate standard errors for local linear regression matching. The data generating processes in our Monte Carlo study are relatively rich and calibrated to match our empirical models or to test the sensitivity of our results to the choice of parameter values. The estimated standard errors from the bootstrap are very close to those from the Monte Carlo experiments, which lends support to our using the bootstrap to calculate standard errors in our setting. From the matching estimators we find a significant positive effect of migration on the wage growth of college graduates, and a marginally significant negative effect for high school dropouts. We do not find any significant effects for other educational groups or for the overall sample. Our results are generally robust to changes in the model specification and changes in our distance-based measure of migration. We find that better data matters; if we use a measure of migration based on moving across county lines, we overstate the number of moves, while if we use a measure based on moving across state lines, we understate the number of moves. Further, using either the county or state measures leads to much less precise estimates. We also consider semi-parametric LATE estimators with covariates (Frölich 2007), using two sets of instrumental variables. We precisely estimate the proportion of compliers in our data, but because we have a small number of compliers, we cannot obtain precise LATE estimates. © 2011 Elsevier B.V. All rights reserved.
✩ An earlier version of this paper was presented under the title ‘‘Matching and Selection Estimates of the Effect of Migration on Wages for Young Men’’. Geert Ridder, Barbara Sianesi and Barry Smith made numerous extremely helpful comments on the paper. We would also like to thank Dwayne Benjamin, Stephen Cosslett, Songnian Chen, Xiaohong Chen, William Greene, Cheng Hsiao, Guido Imbens, John Kennan, Lung-fei Lee, Audrey Light, Aloysius Siow, Jeffrey Smith, Petra Todd, Insan Tunali, Bruce Weinberg, Tiemen Woutersen, James Walker and Jeff Yankow for very helpful comments and discussions. We are also grateful to seminar participants at Arizona, CEMFI, the Federal Reserve Bank of New York, McGill, University of Montreal, McMaster, Minnesota (Industrial Relations), Ohio
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.004
State, Pompeu Fabra, Rutgers, Toronto, UC Berkeley, UC Davis, UC San Diego, USC, Vanderbilt, Western Ontario, Wisconsin and the Upjohn Institute, as well as at the Econometric Society and Society of Labor Economics meetings. Finally we thank three anonymous referees and a Co-Editor for very helpful comments which led to a much improved paper. Eileen Kopchik and Yong Yu provided outstanding programming help. This research was partially supported by the NSF and NICHD. We emphasize that we are responsible for all errors. This paper reflects the views of the authors and in no way represents the views of the NSF or NICHD. ∗ Corresponding author. E-mail address:
[email protected] (J.C. Ham).
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
1. Introduction Internal migration is an important economic phenomenon in the United States. Between 2002 and 2003, about 40.1 million Americans moved, and about 60% of these movers were 20–29 years old.1 Labor economists typically model migration as an investment in human capital, but evidence on whether moving increases wages is mixed. By using data on young men from the 1979–1996 waves of the National Longitudinal Surveys of Youth 1979 (NLSY79), we attempt in this paper to identify the average initial, or contemporaneous, wage gain from US internal migration for both those who move, and those who are at the margin of deciding to move and whose moving decision would react to an exogenous change in moving costs (referred to as compliers in LATE literature). We contribute to the migration literature in several ways. First, we allow migration effects to differ across educational groups and find that this distinction is important.2 Previous studies pool different educational groups to estimate average returns for all migrants. If returns on migration are positive for some education group(s), such as college graduates, and negative or zero for other groups, then the overall sample average may be statistically indistinguishable from zero even though individual components are nonzero. Second, we use a distance-based measure of migration in the NLSY79, instead of a measure based on moving across a state or county line. We define migration as having occurred if the respondent moved at least 50 miles, or changed Metropolitan Statistical Area (MSA) and moved at least 20 miles. Compared to a measure based on moving across a state or county line, the measures commonly used in the literature, the distancebased measure of migration corresponds more closely to the theoretical notion of changing local labor markets described by Hanushek (1973). We find that measuring migration by changing state underestimates migration by about 36%, and measuring migration by changing county overestimates migration by about 43%. Further, we obtain substantially less precise estimates using either of these alternative measures of migration. We also address the fact that those who move are a nonrandomly selected sample in two ways. First, we use propensity score matching to estimate an ATET of migration, i.e. we investigate the effect of migration on those who move. Second, we adopt a semiparametric instrumental variable approach to estimate local average treatment effects, which represent the effect of migration for the subpopulation who is at the margin of deciding to move (and who are compliers in the sense that they would react to an exogenous change in moving costs). One may prefer Frölich’s (2007) non-parametric IV estimator since it only requires an exclusion restriction (and we have one that seems plausible in our application), while matching requires invoking a conditional independence Assumption (CIA). However, we argue in some detail below that the CIA is appropriate in our application since we use difference-indifference matching and condition on a relatively rich set of variables from the NLSY79. We consider propensity score matching estimators based on local linear, local cubic, and local linear ridge regressions. We use the bootstrap to calculate standard errors. Since the validity of the bootstrap has not been investigated for the matching estimators we use, and has been shown to be invalid for nearest neighbor matching estimators, we conduct a Monte Carlo study on the appropriateness of using the bootstrap to calculate standard errors for local linear regression matching. The data generating processes in our Monte Carlo study are relatively rich and calibrated to match 1 See http://www.census.gov/prod/2004pubs/p20-549.pdf. 2 To the best of our knowledge, Yankow (1999) is the only other author who allows migration effects to differ by education.
209
our empirical model or to test the sensitivity of our results to the choice of parameter values. The estimated standard errors from the bootstrap are very close to those from the Monte Carlo experiments, which lends support to our using the bootstrap to calculate standard errors in our setting. The results suggest that the bootstrap does a very good job of calculating standard errors for local linear regression matching. We use the Andrews–Buchinsky algorithm (Andrews and Buchinsky, 2000, 2001) to choose the number of replications in the standard bootstrap, and in two cases below, the necessary number of replications is much larger than that usually used by applied researchers. We obtain relatively precise estimates of the ATET for each educational group from matching estimators, and our results are not sensitive to whether we use local linear versus local linear ridge matching; local cubic matching badly over-fits the data resulting in large standard errors. Further, our results are robust to minor changes in the propensity score specification, bandwidth choice, trimming levels, and reasonable changes in our distancebased measure of migration. We find a statistically significant and positive ATET effect among college graduates of around 10%, and a marginally significant negative effect for high school dropouts of about −12%. We do not find a statistically significant migration effect for the overall sample or the other educational groups (high school graduates and those with some college). Since we estimate a contemporaneous effect of migration on wage growth, insignificant or negative contemporaneous effects do not necessarily imply that migration is an irrational decision from a human capital perspective. Instead, it may simply indicate that, at least for certain individuals, there is an assimilation process where a short-term wage loss might be dominated by a greater long-term wage gain later.3 Our estimates and standard errors are somewhat sensitive to switching to a measure of migration based on crossing county or state lines, and under either of these alternative measures our ATET estimates become smaller in absolute value and statistically insignificant for college graduates and high school dropouts. Finally, we use Frölich’s (2007) semi-parametric instrumental variable estimator to estimate the treatment effects for the compliers, and this estimator also provides an estimate of the proportion of compliers in our data. We estimate relatively precisely the proportion of compliers in our data, but can estimate neither the ATE nor the ATET for compliers accurately since we have too few compliers in our data. Our paper is organized as follows: We review the migration literature in Section 2. In Section 3 we present our econometric approaches. In Section 4 we describe our data, and our empirical results are presented in Section 5. Section 6 provides the evidence from our Monte Carlo study exploring the validity of using the ordinary bootstrap to estimate standard errors of the ATETs given our local linear matching estimators. Section 7 concludes the paper. 2. A brief review of the migration literature The most common theoretical model of migration treats the decision to migrate as an investment in human capital: individuals migrate if the present value of real income in a destination minus 3 We also considered estimating the long-term effects of migration, but this turned out to be quite problematic to investigate. To estimate long-term effects of migration 5 or 10 years after the first move, we have to deal with the issues that some of the movers return, some of the stayers move on, and some of the movers go on to other locations. We could limit our comparisons to movers who stay in the new location and stayers who never move, but this will give us a much smaller sample and also raise significant concern about sample selection. Finally, we believe that selection on observables assumption or the CIA, may be more plausible at the beginning of careers.
210
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
the cost of moving exceeds what could be earned at the place of origin (Sjaastad, 1962).4 For our purposes, the empirical studies based on this model can be classified into two broad areas: those examining the determinants of migration and those focused on the consequences of migration for wages and earnings.5 While the determinants of migration are not the focus of our paper, they play an important role in our specification of the propensity score in matching and in our non-parametric LATE estimators. Polachek and Horvath (1977) and Plane (1993) find that geographic mobility peaks during the early to mid-twenties and declines with age thereafter. These studies also find that the propensity to migrate increases with education. In addition, the migration decision is affected by migration cost and the non-wage benefits of different locations. Goss and Schoening (1984) provide some indirect evidence that households with fewer assets are less mobile, since they find that the probability of migration declines with the duration of unemployment. Further, Lansing and Mueller (1967) report that many moves are attributable to family-related issues, such as proximity to family members. Presumably being close to one’s family is a non-wage advantage of a given location. Of course, in the human capital model of migration, expected wage gains, local demand shocks, and inter-regional differences in returns to skill play an important role in the migration decision. Shaw (1991), Borjas et al. (1992b), Dahl (2002), and Kennan and Walker (2003) use a Roy (1951) model of comparative advantage to study migration. Although the human capital model of migration clearly predicts a higher present value of lifetime earnings for those who migrate, the literature on the consequences of migration reaches no consensus on the contemporaneous returns to migration. Estimates of the average contemporaneous returns can be negative, zero, or positive. Positive contemporaneous returns are found by Bartel (1979) for younger workers, Hunt and Kau (1985) for repeat migrants, and Gabriel and Schmitz (1995) and Yankow (2003) for less-educated workers. Negative contemporaneous returns are found by Polachek and Horvath (1977), Borjas et al. (1992a), and Tunali (2000).6 Studies that find statistically insignificant contemporaneous returns include Bartel (1979) for older workers, Hunt and Kau (1985) for one-time migrants, and Yankow (2003) for workers with more than a high school degree. The sign and significance of the migration effect depend on the sample chosen and on how researchers address three critical questions. First, what definition of migration is used? Although all authors view migration as a change of labor market, most define migration as occurring if a geographic boundary is traversed. The majority of authors, including most of those cited above, focus on interstate migration. A few, such as Hunt and Kau (1985) and Gabriel and Schmitz (1995), define migration as a change of Metropolitan Statistical Area (MSA). Falaris (1987) defines it as a change of Census region, while some authors, such as Linneman and Graves (1983), study inter-county migration. We find below that migration counts are sensitive to the definition used, and that the estimated returns to migration and their standard errors can be quite sensitive to the definition of migration used. 4 See also McCall and McCall (1987). They develop a ‘‘multi-armed bandit’’ approach to the migration decision. Workers rank locations by their pecuniary and nonpecuniary attributes, and then sample locations sequentially until a suitable match is found. Search costs limit the number of markets sampled. 5 Greenwood (1997) provides an excellent review of the literature. 6 Alternatively, Tunali (2000) views migration as a lottery and finds that while a substantial portion of migrants experience wage reductions after moving, a minority realize very high returns. Individuals are willing to invest in an activity that has a high probability of yielding negative returns because of the potential for a very large payoff.
The second question affecting the estimated effect of migration concerns the choice of comparison group. Most authors use all workers who do not migrate as the comparison group. But it is well known that there is wage growth associated with voluntary job turnover (Topel and Ward, 1992). Since most migrants change jobs, the ‘‘return to migration’’ may confound returns to job changing with a return to geographic mobility. Bartel (1979) was the first to focus on the relationship between the types of job separation and migration. Others, such as Yankow (1999), condition on job changing but do not differentiate between types of job turnover (e.g. quits versus layoffs). Finally, Raphael and Riker (1999) consider only workers who were laid off. The third important question is how do researchers address the problem that migrants are likely to be a select sample in terms of observables and unobservables because migration is a choice variable. Nakosteen and Zimmer (1980, 1982) were among the first to provide evidence of positive self-selection into migration, and Robinson and Tomes (1982) and Gabriel and Schmitz (1995) also find this. On the other hand, Hunt and Kau (1985) and Borjas et al. (1992a) find no evidence of self-selection. Note that all of these studies attempt to estimate an unconditional effect of moving or an average treatment effect, i.e. what is the effect of migration on wages for a randomly chosen individual. We see two problems with this. First, it may not be interesting to ask what the effect on wages of moving to a new location for a randomly chosen individual is. Many individuals will currently be in a location that gives them relatively high wages, and thus we could easily expect this treatment effect to be negative. Second, the choice of a new location is ambiguous here—is it the individual’s best alternative or a randomly chosen location? In this paper we first use matching to look at a less ambitious, but arguably better-specified question: the effect on wage growth for those who move. Note that there is no ambiguity in our setting as to the choice of new location.7 However we also use Frölich’s non-parametric IV estimator to estimate the ATE and ATET for compliers. 3. Econometric approach 3.1. Propensity score matching estimators We consider a group of young men who have quit their first job. They are assumed to face a choice between accepting another job locally or moving to another labor market and accepting a job there. Our goal when using matching is to estimate the effect of internal migration on between-job wage growth for those who quit their first job and move.8 We use difference-in-difference matching (Heckman et al., 1997, 1998a; Abadie, 2005). Following the notation in the evaluation literature, let Di = 1 if an individual moves and Di = 0 otherwise. The outcome variable is defined as the logarithm of the starting wage on the second job minus the logarithm of the ending wage on the first job. For each individual i, we define two potential outcomes by treatment status, Yi0 = log (Wi0t ) − log (Wi0t ′ ) for Di = 0 and Yi1 = log (Wi1t ) − log (Wi0t ′ ) for Di = 1, where t and t ′ represent beginning of the second job (post-treatment) and end of the first job (pre-treatment) respectively. After dropping the i subscript for ease of exposition, our goal is to identify the migration effect on 7 See Heckman et al. (1999) for a discussion of when estimating the effect of ‘‘treatment on the treated’’ may be more useful than estimating an average treatment effect. For reasons discussed in Section 3.1.1, we cannot estimate an average treatment effect of migration given our data. 8 See Section 4 for our reasons for focusing on migration occurring between the first two jobs.
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
the treated (movers): ATET = E (Y 1 − Y 0 |D = 1) = E (Y 1 |D = 1) − E (Y 0 |D = 1).
(3.1)
We can estimate the first term on the right-hand side of (3.1) since we observe the between-job wage growth for the movers.9 However, we do not observe the between-job wage growth the movers would have received had they not moved—the (counterfactual) second term on the right-hand side of (3.1). Instead we use propensity score matching to estimate this counterfactual.10 3.1.1. Propensity score matching For matching to be valid, certain assumptions must hold. The fundamental assumption underlying matching estimators is known as the Ignorable Treatment Assignment Assumption (see Rosenbaum and Rubin, 1983) or the Conditional Independence Assumption (CIA—see Lechner, 2000). Since we are considering the migration effect on the treated (movers) it does not need CIA to hold for Y 1 ; instead we only need it to hold for Y 0 . In our case the difference-in-difference (DID) matching estimator (Heckman et al., 1997, 1998a; Abadie, 2005) requires E (log (W0t ) − log (W0t ′ ) |X , D = 1)
= E (log (W0t ) − log (W0t ′ ) |X , D = 0)
(3.2)
where X is an appropriate set of observable variables unaffected by the treatment. This assumption is stated in terms of the before–after wage evolution instead of levels. It means that, conditional on X , the movers’ potential wage growth Y 0 had they not moved would be the same as stayers’ potential wage growth. The choice between using a DID matching estimator, and a cross-section (CS) matching estimator (defining the level of wages on the second job as the outcome variable), involves a trade-off. On the one hand, the DID matching estimator relies on additive separability of the error terms,11 and this is not required by the CS matching estimator. On the other hand, the DID matching estimator only needs the CIA assumption to hold after unobserved time invariant (separable) components that affect both wages and migration (individual specific or local market specific) have been differenced out, while the CS matching estimator requires the CIA to hold without removing such time invariant (separable) components. Smith and Todd (2005) find that the DID matching estimator performs better than the CS matching estimator when participants and nonparticipants were drawn from different regional labor markets. Since participants and nonparticipants in our analysis are indeed drawn from different labor markets, we felt it was most appropriate to use DID matching estimator. For our identification strategy to be credible, there should be at least one variable that affects whether people move but is unrelated to potential wage growth if they stay; otherwise there is no way to explain with identical conditioning variables why only some people move while others do not. However for matching (unlike IV estimators) it is not necessary to observe variables that affect the migration decision but not wage growth. One variable that affects D but not Y 0 could be Y 1 . If one considers two people with the same potential wage gains Y 0 , the one with the higher 9 As discussed in Section 3.1.5, a common support constraint is imposed when estimating this term. 10 The seminal paper in the matching literature is Rosenbaum and Rubin (1983). A nonexhaustive list of recent contributions in the economics literature includes Abadie and Imbens (2006), Hahn (1998), Heckman et al. (1997, 1998a,b), and Hirano et al. (2003). See Imbens (2004) and Heckman et al. (1999) for a discussion of previous work in the statistics literature and empirical applications of matching. See Frölich (2004) and Zhao (2004) for Monte Carlo evaluations of various matching approaches. 11 See Heckman et al. (1997, equations 7a and 7b on p. 613).
211
potential wage gain Y 1 is more likely to move. We will discuss this issue in greater detail in the next section. Note that since we are estimating the ATET for those who move, we do not need Y 1 to be independent of D after conditioning on X . The variable vector X contains pretreatment variables and time-invariant individual characteristics. To identify the ATET, we also need Pr (D = 1|X ) < 1.
(3.3)
Eq. (3.3) is a common support condition that requires a positive probability of observing nonparticipants at each level of X . Note that if Pr(D = 1|X = x0 ) = 1, then at X = x0 we will observe only movers and do not observe any stayers. As we will see below, the common support constraint is not a problem in our data, unlike some other applications, such as those drawn from the job training literature. Matching on all variables in X becomes impractical as the number of variables increases. Rosenbaum and Rubin (1983) show that, if Y 0 is independent of treatment status given X , then it is also independent of treatment status given p(X ) = Pr(D = 1|X ). Consequently, matching can be performed on a single index p(X ) instead of on all of the variables in X . We should note that it is also possible to estimate an unconditional effect of moving if we are willing to make the stronger assumptions restricting Y 1 and adding 0 < Pr (D = 1|X ). However, we do not do this for two reasons. First, as we mentioned above, we believe the conditional effect for those who move is more interesting. Second, calculating an unconditional effect also involves using matching to estimate the wage growth that stayers would have experienced had they moved. In our case this would require using a relatively small number of movers to construct a counterfactual for a relatively large number of stayers. Frölich’s (2004) Monte Carlo experiments indicate that matching estimators for such a counterfactual have high mean squared errors. 3.1.2. Plausibility of the conditional independence assumption and the choice of conditioning variables The CIA in Eq. (3.2) is a rather strong condition, but note that we allow Y 1 , and thus the migration wage gain ∆ = Y 1 − Y 0 , to depend on D. Consider again a situation where two individuals have the same potential wage gain if they stay (Y 0 ). One of them has Y 1 > Y 0 , and therefore moves.12 The other person has Y 1 < Y 0 and thus does not move. However, if Y 0 depends on Y 1 , we would have to rule out the dependence of Y 1 on D. An example of a possible dependence of Y 0 on Y 1 is the case where an individual is able to use the wage offer from another location to bargain for a better local offer. Since the individuals in our sample had an average age of 26 years old when they quit their first job, we hope and believe that this influence is small and has only little effect on our estimates. The advantage of allowing Y 1 to depend on D is that we do not need to include in X the factors that affect both Y 1 and D. Examples of such variables are the pull factors from the new destination (such as labor market conditions in the new location), which are unobserved for the stayers, and thus cannot be included in our conditioning variables. Of course, as noted above, we only require that potential wage gains for staying are independent of D after the temporarily invariant components have been eliminated. We select our conditioning variables to control for factors (or proxy for unobservables) expected to affect both the migration decision and the potential wage gain at the home location (Y 0 ). Our propensity score specification includes age, education, race, occupation, starting wage, ending wage, job tenure on the first 12 Here we assume that the moving costs are the same for both individuals (for expositional simplicity).
212
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
job, living in an MSA,13 and home ownership.14 For example, we chose home ownership as a conditioning variable since we expect that it would affect moving costs and also would be correlated with the unobservables in wages. We would expect the wealth of the individual’s parents to affect the discount rate and thus the migration decision.15 Whether this variable also affects wage growth is an open question.16 Thus we estimate the propensity score without parents’ education in our baseline model, but include both parents’ education in our expanded model. In the expanded model we also consider a push factor for migration, the county unemployment rate at the end of the first job, since it may affect the migration decision and local wages. Assume that for a mover and a stayer, we have matched on all observed characteristics that affect both migration decision and Y 0 . It is reasonable to ask why some of them choose to move and the others do not, if we have controlled for all relevant factors. The answer is that there are variables that affect D but not Y 0 . We discussed above why Y 1 is likely to be such a variable. Further, we expect moving cost variables to affect D but not Y 0 ; as a result they should not be included in the propensity score model. Thus our empirical model will match movers and stayers with similar estimated propensity scores who have similar Y 0 , but bear different moving costs. As a proxy for moving costs, we consider the dummy variable in the NLSY79 data indicating if an individual was living at age 14 in the same county where he was born. We would expect that this variable is a proxy for psychological cost of moving (because of current proximity to friends and families), but in general we would not expect this variable to affect wages.17 In Section 4 we will show that movers are much less likely to live in their county of birth at age 14. 3.1.3. Local linear, local cubic and local linear ridge regression matching estimators We use a probit model to estimate the propensity score, pˆ (x), for each mover and stayer, and then consider different matching methods to construct the counterfactual E (Y 0 |D = 1, p(X ) = pˆ (x)). Let N1 be the number of movers and N0 be the number of stayers in our data. The conditioning and outcome N
variables, {xi , Yi1 }i=11 and
xj , Yj0
N0 j =1
, are observed for the two
N1
groups respectively. The propensity scores pˆ (xi ) i=1 are estimated N0 for the mover group and pˆ (xj ) j=1 are estimated for the stayer group. Applied economists have used many different matching estimators. For example, nearest neighbor matching uses only
13 Workers in cities earn 33% more than their nonurban counterparts. Glaeser and Mare (2001) show that a portion of the urban wage premium is a wage growth, not a wage level, effect. They conclude that cities speed the accumulation of human capital. 14 All variables take their first job values. 15 Migration is a human capital investment, where an individual trades off the initial cost of moving against the discounted future sum of increased earnings. It is widely believed in the human capital literature that individuals from wealthier families face lower discount rates because their parents will lend them money for such a human capital investment at a low or zero interest rate. 16 For example, the assumption that father’s education does not affect wages is the identifying assumption in the seminal paper of Willis and Rosen (1979) on the decision to go to college. Given their assumption, father’s education makes a good IV but should not be included in the propensity score. However, others may find their assumption too strong, since father’s education may affect wages through the father’s connections. In this case it should be included in the propensity score but would not be a valid IV. 17 Of course one could argue that living at age 14 in the same county where an individual was born could still affect wages, e.g. because of local network effects, although note that these effects will be captured in part by the job 1 wages. In a previous version of the paper, we experimented with including the same county variable in propensity score, but this had very little effect on the matching estimates, suggesting that these local network effects on wages are small (after conditioning on job 1 wages).
the closest (in terms of the estimated propensity score) stayer to estimate E Y 0 |D = 1, p(X ) = pˆ (x) for a mover. Using only one observation in the non-treated group is likely to be inefficient, and alternative matching estimators have been suggested. For example Heckman et al. (1997, 1998b) propose local polynomial matching estimators, including kernel matching and local linear matching. Hirano et al. (2003) suggest a weighted estimator (with the inverse of a nonparametric estimate of the propensity score as weights). Frölich (2004) suggests using a local linear ridge regression matching estimator, and finds that it performs well in his Monte Carlo experiments. (He also finds that local linear regression matching performs well when the control-treated ratio is relatively high, as in our application.) We also use local cubic matching since it offers the possibility of reducing the bias of the estimates – albeit at the cost of increasing the variance – and it has not been used previously in matching literature.18 For each observation i (i = 1, . . . , N1 ) in the mover group, local regression matching opens a window around pˆ (xi ) = pi and uses all observations in the stayer group with estimated propensity ˆ (pi ), to scores in that window to construct a weighted mean, m approximate the counterfactual E Y 0 |D = 1, p(X ) = pi . Within the window, the closer pˆ (xj ) is to pi , the greater the weight the ˆ (pi ). Local polynomial regression observation j gets in estimating m matching methods construct the counterfactuals by solving the following minimization problem for each mover i (i = 1 to N1 ) min
N0 −
β0 ,β1 ,...,βL
Yj0
−
j=1
L −
2 l βl pˆ xj − pi
K
pˆ xj − pi h
l =0
. (3.4)
In (3.4) K (·) is a kernel weighting function, h is the bandwidth, and L is the order of the polynomial chosen by the researcher. This ˆ (pi ) = βˆ 0 as the counterfactual minimization problem yields m estimate for mover i. Given the results for large samples of Fan and Gijbels (1996, Section 3.3.2), we consider both local linear regression (LLR) matching for the case that l = 1 and local cubic regression (LCR) matching for the case that l = 3, but not kernel or local quadratic regression matching.19 However in small samples, local linear regression could lead to a very rugged curve in regions of sparse or clustered data (Seifert and Gasser, 1996). A local linear ridge regression matching estimator is likely to perform well in such situations, since it is a local linear estimator that imposes a penalty for large slopes of the local regression line. Frölich (2004) finds that matching based on local linear ridge regression very often outperforms other matching estimators, including nearest neighbor, kernel and local linear; moreover it is robust to simulation designs. Thus we also use the following local linear ridge regression (LLRR) matching estimator: For each mover i with pˆ (xi ) = pi , the estimated counterfactual E Y 0 |D = 1, p(X ) = pi from LLRR matching is
ˆ R (pi ) = m
T0 S0
+
T1 · (pi − p¯ i ) S2 + r · h |pi − p¯ i |
,
(3.5)
where N0 − a Sa (pi ) = pˆ xj − p¯ i · K
pˆ xj − pi
j=1
Tb (pi ) =
N0 −
b Yj0 pˆ xj − p¯ i · K
h
pˆ xj − pi h
j =1
for b = 0, 1 N0
and
− p¯ i = pˆ xj · K j =1
for a = 0, 2,
N 0 − pˆ xj − pi pˆ xj − pi h
K
j=1
h
.
18 We thank Stephen Cosslett for suggesting that we consider this possibility. 19 They show that using an even order polynomial of degree k, as opposed to one of k + 1, increases bias in, but does not reduce the variance of, local regression estimators.
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
Again, K (·) is a kernel weighting function, h is a bandwidth and r is the ridge parameter.20 We follow the rule of thumb from Seifert and Gasser (2000) and use a Gaussian kernel with r set equal to
√
4 2π
φ 2 (u)du
−1
≈ 0.35.
To implement all of these matching estimators, we must choose a bandwidth or smoothing parameter; this choice is often the most important decision a researcher makes in nonparametric regression. Basically, there are two types of bandwidths: global (fixed) bandwidths and local (variable) bandwidths. The global bandwidth approach uses the same window width at each point where we run a local regression, while the variable bandwidth approach changes the bandwidth according to the data density around the particular point. In other words, the variable bandwidth approach allows us to use a small bandwidth where the probability mass is dense and a larger bandwidth where the probability mass is sparse. As Fan and Gijbels (1992, p. 2013) put it, ‘‘A different amount of smoothing is used at different data locations’’. Given that local linear estimators have a well-known problem over regions of sparse or clustered data, for both the LLR and LCR matching,21 we use the local adaptive bandwidth proposed by Fan and Gijbels (1996).22 Here the size of the window h(pi ) varies for each mover i with a different propensity score pˆ (xi ) = pi . Specifically, h(pi ) is chosen to include the same number kn of stayers closest to pi to fit the local regression. The term kn is determined by the sample size n and will grow (slowly) as the sample size grows. LLRR matching should be less sensitive to the distribution of the data, and when implementing this estimator we follow Frölich (2004) and use a global bandwidth chosen by leave-one-out cross-validation. Therefore, use of the local linear ridge regression estimator allows us to see the sensitivity of our estimates to changing estimation methods and changing from a local to a global bandwidth. 3.1.4. Standard error estimation Although previous studies have almost uniformly used the standard bootstrap to obtain standard errors for matching estimators, no formal justification for its use has been established for any matching estimator. Moreover, Abadie and Imbens (2008) show that the bootstrap is in general not valid for nearest neighbor matching, even when the estimator is root-N consistent and asymptotically normally distributed with zero asymptotic bias. This problem occurs because of the extreme non-smoothness of nearest neighbor matching. While one may argue plausibly that the standard bootstrap is more likely to be appropriate for our matching estimators since there is certainly additional smoothness in our estimators as compared to nearest neighbor matching, the fact remains that there is no formal result that justifies using the standard bootstrap for our matching estimators. To investigate whether the standard bootstrap is appropriate for our application, we conducted a Monte Carlo study which is calibrated to match, as closely as possible, to our application. We also carry out a sensitivity analysis in terms of changing important parameters. We find that the bootstrap does indeed work very well for LLR matching estimators.23 Since the Monte Carlo design is calibrated to our application, for clarity we discuss the Monte Carlo study in Section 6 after we present the empirical results in Section 5. 20 See Seifert and Gasser (1996, 2000). 21 Local cubic regression shall suffer from the same problem, probably at a more severe level. 22 Ruppert et al. (1995) derive three optimal fixed (global) bandwidth selectors for local linear regression. We considered their preferred selector, the direct plug-in bandwidth selector (p. 1262). However it produced matching estimates with large standard errors. 23 We want to thank an anonymous referee for suggesting using a Monte Carlo approach to address this concern of consistency of standard bootstrap in our setting.
213
In implementing the ordinary bootstrap, an important decision is choosing the number of bootstrap repetitions, and here we follow the procedure developed in Andrews and Buchinsky (2000, 2001). They propose a three-step method for choosing the number of bootstrap repetitions. We describe the Andrews and Buchinsky procedure for calculating standard errors in the Appendix B. 3.1.5. Common support constraint and balancing tests As noted above, matching shall only be used to estimate the sample ATET over the portion of the support of the covariates where each mover can find a reasonable number of stayers. To ensure that we are comparing comparables in terms of the chosen covariates, we add a common support constraint, following the procedure proposed by Heckman et al. (1997). Using their notation, we set the trimming level q = 5.24 To test the sensitivity of our matching estimators to the trimming level, we also consider q = 3 and q = 7 for our baseline model, and find that this does not affect our results. Further, we do not conduct any trimming when we use the LLRR matching estimator, and find that this does not have a qualitative effect on our results. One explanation for this lack of sensitivity of our matching estimator is that our distributions of propensity scores for treatment and controls are much more similar than those for matching applications aimed at estimating the ATET for training programs. Again following much of the literature, we choose the functional form of the variables in the index function to ensure, via formal tests, that the conditioning variables X are distributed identically across the treatment group and the matching sample. Specifically, for a given functional form of the index function, we test whether our empirical model for propensity score balances the sample via two types of tests; paired t-tests and joint F tests.25 Paired t-tests examine whether the mean of each element of X for the treatment group is equal to that of the matched sample. However, paired t-tests are not able to detect differences between two distributions beyond the sample means. Since all matching methods require that the two distributions mimic each other at each quantile, instead of just exhibiting similar means, we also conduct a joint F test, as proposed originally by Rosenbaum and Rubin (1985). The treatment group and matched sample are broken down into quartiles according to the estimated propensity scores.26 At each quartile, we test whether the mean of all elements of X are jointly different across the two groups. If a model fails to pass either the t-tests or the F tests, we add higher order terms or interaction terms until the variables are balanced across the two groups. 3.1.6. Finer balancing and allowing the treatment effect to differ by educational group We find that education plays a major role in the migration decision and wage rate determination, thus suggesting that it may be inappropriate to match across education groups, e.g. use a high school dropout who stays to estimate the counterfactual for a college graduate who moves. To avoid this problem, in calculating the ATET we only match stayers to mover i if they are in mover i’s educational group. Rosenbaum and Rubin (1983) 24 Their procedure is designed to trim q%–2q% of movers. Because this is a datadriven approach, the exact trimming level depends on the data structure. Fewer movers will be eliminated as the modes and shapes of the mover and stayer distributions become more similar. 25 Our tests are based on results from nearest neighbor matching. Although Abadie and Imbens (2008) show that bootstrap cannot be used to estimate standard errors for nearest neighbor matching, in the balancing tests, we do not need to calculate standard errors for the nearest neighbor estimates. 26 The number of intervals used in joint F tests depends on sample size. We have 378 movers, so we can only afford to break them down into quartiles. If larger samples are available, finer intervals, such as deciles, should be used.
214
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
define such a procedure as finer balancing. Following Rosenbaum and Rubin (1985), we first estimate the propensity score using the entire sample,27 as we do not have enough data to estimate a different index function for the four educational groups (high school dropouts, high school graduates, those with some college, and college graduates). We then match movers with stayers in the same educational group based on the estimated propensity score to get an estimate of the ATET for the whole sample. We also estimate ATET for each educational group, since the migration effect is likely to differ by level of schooling. For example, we would expect that it is much easier for college graduates to search and find a higher wage job in a new location without first moving there than it is for high school dropouts. Letting S denote schooling class and s denote a particular schooling level, the ATET for a schooling level is ATETs = E (Y 1 − Y 0 |D = 1, S = s)
= E (Y 1 |D = 1, S = s) − E (Y 0 |D = 1, S = s).
(3.6)
To obtain the first term in Eq. (3.6), we take the mean increase in wages for movers in schooling class s who satisfy the common support constraint. To obtain the second term, we use propensity score matching estimators presented in Section 3.1.3 and only match stayers to mover i if they are in mover i’s educational group. In our empirical work below we find that it is indeed important to allow the treatment effect to vary by educational group. 3.2. Semi-parametric IV estimation of local average treatment effects with covariates28 As discussed in Section 3.1.2, while moving cost variables are very important for achieving our common support assumption (Pr(D = 1|X ) < 1), we do not include them in propensity score specification and thus do not even need to observe them. However, moving cost variables play a more direct role in our instrumental variable analysis, which estimates local average treatment effects (LATE) of migration.29 Given that we can observe moving cost variables, we estimate the LATE for migration using Frölich (2007)’s semi-parametric IV estimator. We consider two versions of LATE estimators where the instrumental variable(s) defined as (i) living in the same county at age 14 where he was born and (ii) living in the same county at age 14 plus a dummy variable coded 1 if his father has a college degree.30 As discussed in Section 3.1.2, living in the same county is an obvious candidate for an instrumental variable. Father’s education is a proxy for the ability to finance a move (especially because our respondents are relatively young). It will be a valid IV if, as Willis and Rosen (1979) assume, it does not affect wages directly and it is not correlated with the unobservables affecting wages.31 In our case, the instrument is not randomly assigned. To make them proper instruments, it is important to condition on some covariates. Consider the case where we only use a binary instrument, where Z = 0 if an individual was still living in the 27 Our educational dummy variables (equivalent to Rosenbaum and Rubin’s sex dummy variable) are also included in the propensity score model. 28 We thank two anonymous referees for suggesting that we explore LATE estimators given that we have suitable instrumental variables in our data. 29 LATE was introduced by Imbens and Angrist (1994) and further developed by Angrist and Imbens (1995), Angrist et al. (1996), Imbens and Rubin (1997), Heckman and Vytlacil (2001) and Imbens (2001), among others. 30 We will discuss in Section 5.5 how we construct a single instrument based on both variables. 31 Our results below indicate that including this variable in the propensity score model does not change the estimates, while we would expect the estimates to change if the father’s education also affects the wage. This appears to lend support to the Willis and Rosen assumption in our case.
same county at age 14 (bearing high moving cost) and Z = 1 otherwise. If we do not use conditioning variables in our analysis, there could be common factors that affect both the instrument and wages. For example, suppose African Americans were less likely to move when they were young (i.e. more likely to have Z = 0). Given that race generally affects wages, our instrument would be invalid if we do not condition on race. We denote our conditioning variables by X , and use a rich set of variables such as past wages, work history, education, race, and marital status. Here we provide a heuristic description of the Frölich (2007) approach. For individual i let Di denote the endogenous migration decision as defined in Section 3.1 and Zi denote the binary instrument as defined above. The outcome variable of interest, Yi , is defined in Section 3.1 as the between job wage growth. Let Di,z denote the potential participation (migration) status of an individual i if the level of the instrument were externally set to z. Note that Di,Zi = Di is the observed value of D for individual i. Di,z defines four different types of individuals denoted by T ∈ {a, n, c , d}. Following the standard LATE literature, we call these four types: always-takers (a), never-takers (n), compliers (c ), and defiers (d). In our case, the always-takers are individuals who will move regardless of Zi (Di,0 = 1 and Di,1 = 1). The never-takers are those who will not move regardless of Zi (Di,0 = 0 and Di,1 = 0). The compliers are those who will move if Zi = 1 and will not move if Zi = 0 (Di,1 = 1 and Di,0 = 0). The defiers are those who move when Zi = 0 and vice-versa (Di,0 = 1 and Di,1 = 0). Let Yid,z represent the potential outcome for individual i when Di and Zi were fixed externally to d(d = 0, 1) and z (z = 0, 1) respectively. The potential outcomes of interest are Yid,Zi where d is fixed externally without a change in Z . Again note that the D
observed outcome for individual i is Yi,Zi i ≡ Yi . Given the following assumptions, Frölich (2007) shows that his conditional LATE estimator is identified. Assumption 1 (Exogenous Covariates). The conditioning variables X are exogenous in the sense that D
Xi,Zi i = Xid,z
∀d, z ,
where Xid,z is the potential value of X that would be observed for individual i if Di and Zi were set by an external intervention. This assumption will be violated if Xi is affected by Zi or Di . Among the conditioning variables we use for the matching estimators, we suspect the home ownership variable might violate this condition because individuals with Zi = 0 (bearing higher moving costs) might be more likely to buy a house at younger ages. Thus we exclude this variable in our LATE analysis. Assumption 2 (No Defiers). P (T = d) = 0.32 This assumption says that changing Z from 0 to 1 will not induce any mover to stay and changing Z from 1 to 0 will not induce any stayer to move. Assumption 3 (Existence of Compliers). P (T = c ) > 0. To identify the LATE, there have to be individuals in the population whose probabilities of migration, given other conditions, are affected by Z . Assumption 4 (Unconfounded Type). For all x ∈ Supp(X ) P (Ti = t |Xi = x, Zi = 0) = P (Ti = t |Xi = x, Zi = 1) for t ∈ {a, n, c } . 32 An alternative to Assumption 2 is that the average treatment effect is the same for compliers and for defiers.
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
215
This assumption requires that at each level of X , the fractions of compliers, always-takers and never-takers are the same for both the Z = 1 and the Z = 0 groups. This condition will be violated if, for example, given the same X , the individuals with Z = 0 are more likely to be compliers than those with Z = 1. We would expect this assumption to be plausible in our case because we compare two groups of individuals who share many characteristics (included in X ) such as past wages, education, race and occupation.
functions mπ z (ρ) = E (Y |π (X ) = ρ, Z = z ) and µπ z (ρ) = E (D|π (X ) = ρ, Z = z ) for z = 0, 1. Now the conditional mean functions depend only on the one-dimensional propensity score, and the IV estimator of the LATE is
Assumption 5 (Mean Exclusion Restriction). For all x ∈ Supp(X )
where πˆ i is a consistent estimator of πi = π (Xi ). Similarly, Frölich shows that the for average treatment effect the treated compliers γ ATET = E Y 1 − Y 0 |T = c , D = 1 can be estimated using
E Yi0,Zi |Xi = x, Zi = 0, Ti = t
= E Yi0,Zi |Xi = x, Zi = 1, Ti = t E Yi1,Zi |Xi = x, Zi = 0, Ti = t = E Yi1,Zi |Xi = x, Zi = 1, Ti = t
for t ∈ {n, c }
ˆ π 0 πˆ i Yi − m
−
Di − µ ˆ π 0 πˆ i
−
∑
i:Z =1
i:Zi =1
∑
ˆ π 0 πˆ i Yi − m
This assumption requires that the support of X is identical in both Z = 1 and Z = 0 population. We verify this assumption in Section 5.5. Let γ ATE = E Y 1 − Y 0 |T = c denote the average treatment effect for the compliers. Given the above assumptions, Frölich’s semi-parametric LATE estimator is given by
ˆ 0 (Xi ) − Yi − m
∑
i:Z =1
i γˆ ATE = ∑
i:Zi =1
ˆ 1 (Xi ) Yi − m
∑
i:Zi =0
Di − µ ˆ 0 (Xi ) −
Di − µ ˆ 1 ( Xi )
∑
,
(3.7)
i:Zi =0
ˆ z (x) and µ where m ˆ z (x) are nonparametric regression estimators of mz (x) = E (Y |X = x, Z = z ) and µz (x) = E (D|X = x, Z = z ). Of course, when X consists of a rich set of conditioning variables, we can encounter a curse of dimensionality problem because we have to carry out high-dimensional nonparametric regression. Frölich (2007) shows that propensity score based estimators can be constructed for the estimation of γ to avoid high-dimensional nonparametric regression. Define π (x) = P (Z = 1|X = x) as the propensity score with respect to Z and two conditional mean 33 Note that for our matching estimator, the key identifying restriction only involves the potential outcome in the absence of treatment. Here we need to restrict potential outcomes with and without treatment.
(3.8)
i:Z =0
i . (3.9) ∑ Di − µ ˆ π 1 πˆ i · πˆ i Di − µ ˆ π 0 πˆ i · πˆ i −
i:Zi =0
i:Zi =1
Assumption 6 (Common Support). Supp(X |Z = 0) = Supp(X |Z = 1).
,
Di − µ ˆ π 1 πˆ i
∑ ˆ π 1 πˆ i · πˆ i · πˆ i − Yi − m
γˆπATET m = ∑
This assumption rules out a direct effect of Z on Y .33 It carries two conceptually distinct assumptions: an exclusion restriction on the individual level and an unconfoundedness assumption on the population level. For example, for the potential outcome Yi1,Zi it requires on the individual level the potential outcome will not be affected by an exogenous change in Zi . One scenario that violates this assumption occurs when conditional on X , individuals with more experience moving (Z = 1) are more adaptable to new situations than those with less experience (Z = 0), since now the outcome would depend directly on Z . On the population level, it requires that the potential outcome Yi1,Zi is identically distributed in the subpopulations of Z = 1 and Z = 0, and thus it rules out selection effects that are related to the potential outcome. To guarantee the unconfoundedness on the population level, it is important to include in X all variables that affect Z and potential outcomes. The final assumption requires
i:Zi =0
∑ for t ∈ {a, c } .
ˆ π 1 πˆ i Yi − m
i:Zi =0
i γˆπATE m = ∑
i:Zi =1
∑
We the average treatment effect for the compliers, γ ATE = estimate 1 0 E Y − Y |T = c , and the average treatment effect for the treated compliers, γ ATET = E Y 1 − Y 0 |T = c , D = 1 , using (3.8) and (3.9) respectively. Besides estimating γ ATE = E Y 1 − Y 0 |T = c and γ ATET =
E Y 1 − Y 0 |T = c , D = 1 , we can also estimate the fractions of compliers, always-takers and never-takers, even though we cannot identify them in our sample. The fractions of compliers, alwaystakers and never-takers are estimated respectively as
P ( T = c) =
1
−
Di − µ ˆ π 0 πˆ i
N
i:Zi =1 − + µ ˆ π 1 πˆ i − Di ,
(3.10)
i:Zi =0
1 P ( T = a) = N
1 P ( T = n) = N
−
− µ ˆ π 0 πˆ i + Di ,
i:Zi =1
−
and
(3.11)
1−µ ˆ π 1 πˆ i ,
(3.12)
i:Zi =0
(1 − Di ) +
i:Zi =1
− i:Zi =0
where N = N0 + N1 (the number of movers plus the number of stayers) is the total number of observations in our sample. Finally it is worth emphasizing that the average treatment effect for the treated compliers, γ ATET = E Y 1 − Y 0 |T = c , D = 1 , is a very different parameter than the average treatment effect on the treated ATET = E (Y 1 − Y 0 |D = 1) in (3.1) (as identified by propensity score matching models). The latter represents the average treatment effect over a well-identified subpopulation, i.e. individuals who participated in the treatment, while the former involves a subpopulation that cannot be identified in the sense that we cannot distinguish compliers from non-compliers in our data.34 In our IV application, compliers are the subpopulation that would react (in terms of moving) to an exogenous change in Z . 4. Data and summary statistics Our primary data source is the 1979–1996 waves of the NLSY79. The survey began in 1979 with a sample of 12,686 men and 34 While we cannot identify individual compliers in population, the distribution of X of the compliers is identified (Frölich, 2007, page 42). Thus one could examine how different compliers are from the main population with respect to the observables.
216
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
women born between 1957 and 1964. Annual interviews were conducted from 1979 to 1994, with biennial interviews thereafter. The NLSY79 provides a comprehensive data set ideally suited for studying migration and job mobility. First, the longitudinal aspects of the data make it possible to track the same individuals over time as they move across jobs and labor markets. Furthermore, the NLSY79 data files include detailed longitudinal records of the employment history of each respondent. Second, the NLSY79 provides categorical information on the distance of a move when a change of address occurs.35 This, in turn, allows us to calculate a distance-based measure of migration and compare our results with more orthodox measures based on change of county or change of state. Our distance-based measure of migration corresponds more closely to the theoretical notion of changing local labor markets than do the alternative measures. We define migration as occurring if either (i) the respondent moved at least 50 miles or (ii) changed MSA and moved at least 20 miles.36 We also consider the sensitivity of our results to changes in the definition of migration that are feasible given the information available in the NLSY79. In addition, we compare the estimates based on our migration definition to those based on a county-to-county or a state-to-state definition. We note that a change-of-county definition of migration will misclassify as migrants individuals who move short distances across county lines but do not change labor markets. A change-of-state definition of migration will misclassify individuals who move hundreds of miles and change labor markets but remain in the same state as stayers. We focus on young men at the outset of their work careers, and restrict the analysis to the wage gain from migration for voluntary transitions between their first and second jobs. We call the first and second jobs after leaving school ‘‘job 1’’ and ‘‘job 2’’ respectively. We focus on moves from job 1 to job 2 for two reasons. First, individuals exhibit the majority of moving and job changing at the outset of work careers. Second, for individuals with similar backgrounds, the wage profile and work experience are more comparable at the beginning of their careers than later. Therefore matching has a better chance to succeed at the beginning of their careers. In order to construct a sample suitable for empirical analysis, we introduce several selection criteria. The sample is limited to young men since we felt that the moving decisions of women can be more complicated and not comparable to men. Because our interest lies in post-schooling labor market activity, we follow individuals from the time they leave school. The longitudinal structure of the NLSY79 allows us to determine precisely when most workers make a permanent transition into the labor force. To avoid counting summer breaks or other inter-term vacations as leaving school, we define a schooling exit as the beginning of the first non-enrollment spell lasting at least 12 consecutive months. Accordingly, respondents are excluded from the sample if the date of schooling exit cannot be clearly ascertained from the data. For example, respondents who are continuously enrolled throughout the observation period or who have incomplete or inconsistent schooling information are excluded from the sample. Of the 6403 male respondents in the initial sample, 262 were deleted because an exact date of leaving school could not be determined. Further, 576 individuals were deleted because they did not hold at least two civilian jobs. Another 116 were lost 35 The researchers who produce the NLSY79 data observe the exact latitude and longitude of the respondent’s residence at the time of each interview. To insure confidentiality, the geocode data providing the latitude and longitude of a respondent’s residence are not available for research purposes. Instead the NLSY79 provides to researchers interval information on distance moved. 36 Adjacent county centroids are typically about 25 miles apart, so a move of 50 miles roughly corresponds to a move to a location two counties away.
because they did not report hours and wages on at least two civilian jobs. We eliminated 50 respondents because they reported being fired from their jobs. We imposed this restriction because we wanted to concentrate on voluntary job transitions. Nine respondents were deleted because they were not interviewed during the duration of at least two civilian jobs. We required that the respondent had had at least two jobs that had lasted at least 26 weeks. This resulted in the loss of 32 respondents.37 We excluded from the sample 120 observations with reported hourly wages of less than $1 or greater than $50, as most of these reports appeared to reflect extreme measurement error rather than true wages. Moreover, since we use a distancebased measure of migration, we required respondents to have valid residential location data at the times that they reported holding each job. This requirement resulted in the loss of 1479 respondents; thus the unavailability of location data is by far the largest single reason for sample deletion. We also deleted jobs that overlapped for more than 8 weeks. In this case we considered the respondent to be holding two jobs simultaneously and did not treat that as a job transition. For this reason we lost 132 respondents. We lost another 465 respondents who held two jobs satisfying all of the above criteria, but who reported an intervening job without location data. In this case we did not observe location data for the two consecutive jobs. Further, we lost 855 respondents who satisfied all of the above criteria but who experienced an interval of more than 13 weeks between two jobs. The between-job interval consisted of either a spell of unemployment, non-employment or employment in part time jobs (defined as those with average hours less than 25 or lasting less than 26 weeks). We excluded these individuals because we wanted to measure the return to migration conditional on a job change. Of course this raises the issue of whether we may be overstating the return to migration by excluding those who move and cannot find a job. We attempted to investigate this but could not determine from the data whether a migrant had a long spell of non-employment before moving or after moving.38 Finally we lost 229 respondents because data on the variables used in the analysis were lacking. After imposing these selection criteria, we had a sample of 2078 men. To recap, we explore individuals’ migration conditional on their voluntarily quitting their first job. Our movers consist of young men who quit their first job and moved to a new location, while our stayers consist of those who quit their first job but did not move. In this study, migration was defined to have occurred if the respondent moved at least 50 miles or changed MSA and moved at least 20 miles. Table 1 presents descriptive statistics. The first two columns contain variable names and definitions. The third, fourth and fifth columns provide means for the whole sample, the movers’ sample, and the stayers’ sample respectively. The last column presents the difference in means between the movers and stayers. Standard errors for the sample means are in parentheses. The first row of Table 1 shows that 18% of all voluntary job changes involved migration. Panel A of Table 1 summarizes individual characteristics as of the end of job 1. The men in the sample are, on average, 26 years old at the time of the job change, with the movers being slightly younger than the stayers. The percentage of African Americans is 37 We also required that the respondent hold at least two jobs with average hours of at least 25 per week, but this restriction did not eliminate any respondents. 38 The problem is that we can determine the dates of employment changes but not the dates of location changes, as we only observe location at the interview dates. Thus someone who is in a new location at the time of the interview and has been unemployed for 6 months could have been unemployed for 6 months in the previous location and have just moved, or he could have been unemployed for 6 months in the new location.
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
217
Table 1 Variable definitions and descriptive statistics. Variable name
Variable definition
Means
Migrate
=1 if respondent moved at least 50 miles or changed MSA and moved at least 20 miles
Whole sample
Movers
Stayers
Difference
26.08 (0.096) 0.23 (0.009) 0.13 (0.007) 0.63 (0.011) 0.17 (0.008) 0.47 (0.011) 0.18 (0.008) 0.18 (0.008) 0.40 (0.011) 0.20 (0.009) 0.86 (0.008)
25.99 (0.185) 0.15 (0.018) 0.13 (0.017) 0.73 (0.023) 0.12 (0.017) 0.33 (0.024) 0.18 (0.020) 0.36 (0.025) 0.44 (0.026) 0.14 (0.018) 0.84 (0.019)
26.10 (0.110) 0.25 (0.011) 0.14 (0.008) 0.61 (0.012) 0.18 (0.009) 0.50 (0.012) 0.18 (0.009) 0.14 (0.008) 0.39 (0.012) 0.21 (0.010) 0.87 (0.008)
−0.106 (0.215) −0.108 (0.021) −0.009 (0.019) 0.12 (0.026) −0.06 (0.019) −0.166 (0.027) 0.002 (0.021) 0.227 (0.026) 0.057 (0.028) −0.068 (0.020) −0.029 (0.020)
2.02 (0.009) 2.09 (0.010) 2.60 (0.052) 0.22 (0.009) 6.995 (0.069)
2.13 (0.023) 2.21 (0.024) 2.66 (0.121) 0.39 (0.025) 6.96 (0.165)
1.99 (0.010) 2.07 (0.010) 2.59 (0.058) 0.18 (0.009) 7.00 (0.076)
0.141 (0.025) 0.142 (0.026) 0.070 (0.134) 0.210 (0.027) −0.005 (0.181)
2.19 (0.010)
2.30 (0.026)
2.16 (0.011)
0.142 (0.029)
0.07 (0.006) 0.14 (0.008) 0.57 (0.011)
0.12 (0.017) 0.25 (0.022) 0.49 (0.026)
0.06 (0.006) 0.12 (0.008) 0.59 (0.012)
0.06 (0.018) 0.132 (0.024) −0.096 (0.028)
0.18 (0.009)
Panel A: individual characteristics Age
Age in years
Black
=1 if African American
Hispanic
=1 if Hispanic
White
=1 if non-Hispanic White
Dropout
=1 if highest grade completed is less than 12
High_school
=1 if highest grade completed is equal to 12
Some_college
=1 if highest grade completed is greater than 12 and less than 16
College
=1 if highest grade completed is greater than or equal to 16
Married
=1 if married, spouse present
Home_Owner
=1 if own home on job 1
MSA
=1 if reside in MSA at time of job 1
Panel B: job 1 variables log(startwage1)
Logarithm of starting wage on job 1 ($1990)
log(endwage1)
Logarithm of ending wage on job 1 ($1990)
Tenure
Tenure of job 1 (year)
Professional1
=1 if professional/managerial occupation on job 1
Unemployment rate
County unemployment rate at the end of job 1
Panel C: job 2 variable log(startwage2)
Logarithm of starting wage on job 2 ($1990)
Panel D: family background Mother_college
=1 if mother was college graduate
Father_college
=1 if father was college graduate
Same_county
=1 if respondent resides at age 14 in the same county as county of birth
Notes: 1. Sample size equals 2078, and the sample consists of 378 movers and 1700 stayers. 2. Standard errors of the sample mean are in parentheses.
about 10 points higher in the stayers’ sample than in the movers’ sample, the percentage of Hispanics is about the same in both samples and the percentage of Whites is 12 points higher in the movers’ sample than in the stayers’ sample. On average, the movers have a higher education level, are more likely to be married, and prior to migration, are less likely to own a house or live in an MSA. The NLSY79 also provides a relatively rich set of conditioning variables including detailed information on each respondent’s job history and on characteristics of each job—a feature that makes matching an appealing strategy for estimating the migration effect. Panel B of Table 1 presents job 1 related variables. On job 1, movers on average have higher starting and ending wages, have slightly longer job tenure, and are more likely to have held a professional job on job 1. At the end of job 1, county unemployment rates are
about the same for movers and stayers. We use all the variables in Panels A and B as conditioning variables for matching, but exclude the home ownership variable in the LATE estimators because we suspect this variable might violate the exogeneity condition (Assumption 1 in Section 3.2).39 Panel C of Table 1 shows the means of the logarithm of the starting wage on job 2. Again movers have higher average starting wages on job 2. Note that our outcome variable is defined as the logarithm of the starting wage on job 39 Of course, migration decisions will also depend on differences in costs and amenities between the current location and other potential locations. Unfortunately we cannot include cost of living in the analysis because available housing price series in the US either: (i) cover a small subset of metropolitan areas; or (ii) reflect within metropolitan area inflation but not relative price variation between metropolitan areas; or (iii) are reported at an aggregate level, such as the state level.
218
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
2 minus the logarithm of the ending wage on job 1. Between the end of job 1 and beginning of job 2, on average the mover and stayer sample experience roughly the same wage growth (around 9%–10%). Panel D of Table 1 presents family background variables. Two dummy variables indicate whether the father or mother has a college degree. Movers are about twice as likely to have a father or mother with a college degree as stayers. Finally, as expected, stayers are much more likely than movers to have lived in their birth county at age 14. There are important differences between movers and stayers in the variables listed in Table 1, and on average movers have more favorable labor market characteristics than stayers. Thus, there is reason to suspect, a priori, that selection will be a serious problem that must be addressed when estimating the effect of migration on the real wage growth for those who move. One concern is that by eliminating a fairly large number of observations with missing data (and especially missing location measures), we may have introduced sample selection into our analysis. To investigate this, for our conditioning variables we compare mean values for our sample with mean values for the whole NLSY79 male sample (total 6403 individuals), in columns 1 and 2 of Appendix Table A.1 respectively. (The number of observations available for calculating summary statistics of each variable is presented in the last column.) Among these variables, age, race, parents’ education, and same county are from the baseline survey conducted in 1979, and the other variables are from the 1987 survey.40 (1987 is in the middle of our sample period and also is the year for which we have the most observations.) In general, our base sample is very similar to the overall sample in terms of the means of most conditioning variables. Compared to our base sample, the average person in the overall sample has slightly higher education, is more likely to be a minority, is less likely to live in a metropolitan area, and much less likely to reside at age 14 in the same county as the county of birth. The larger proportion of respondents in our sample living at age 14 in the birth county may reflect the fact that it is more likely to have geocoded location measures for respondents who stay in the same place. For our base sample we present the average hourly wages at the end of job 1 and at the beginning of job 2; while for the overall male sample we present the average hourly wage from the main job (CPS job) in 1987. All wages are in 1990 US dollars, and the average wages are very close between the two samples.41 5. Empirical results 5.1. Propensity score specification and balancing tests Table 2 reports two propensity score models for the migration decision. Model I, our baseline model, contains the individual characteristics and job 1 variables, except for the local unemployment rates at the end of job 1, as presented in Table 1. Model II is an expanded version of Model I, where we have added to the propensity score two dummy variables indicating whether father or mother has a college degree as well as the local unemployment rate at the end of job 1 (before moving for the movers). By comparing the 40 Age variable has been adjusted to be the individual’s age in 1987. 41 The two samples are fairly close in terms of major socioeconomic variables such as age, race, education, and wages. Still it is worth to point out while NLSY79 is a random sample of the US population, our target population consists of individuals who have changed their job. Thus any difference between these two samples does not necessarily indicate sample selection introduced by missing data or location, but may also be due to the difference between the general population and the job changers.
Table 2 Propensity score probit estimates for migration.
Intercept Age/10.0 Age** 2/100.0 Hispanic Black Dropout High_school Some_college Married MSA1 Professional1 Home_Owner1 log(startwage1) Tenure log(endwage1)
Model I
Model II
−5.82**
−5.97** (1.34) 4.15** (1.00) −0.83** (0.19) 0.06 (0.10) −0.25** (0.09) −0.53** (0.14) −0.56** (0.11) −0.39** (0.12) 0.17** (0.08) −0.30** (0.10) 0.35** (0.09) −0.55** (0.11) 0.19 (0.12) 0.02 (0.02) 0.05 (0.12) 0.21** (0.11) −0.04 (0.13) 0.01 (0.01)
(1.30) 4.13** (0.98) −0.83** (0.19) 0.03 (0.10) −0.27** (0.09) −0.58** (0.13) −0.60** (0.11) −0.42** (0.11) 0.16** (0.08) −0.30** (0.10) 0.34** (0.09) −0.53** (0.11) 0.18 (0.11) 0.02 (0.02) 0.07 (0.11)
Father_college Mother_college County unemployment rate
Chi-square statistic
8.72
2.30
Notes: 1. Values in the parentheses are standard errors. 2.The Chi-square statistic of Model I is from the likelihood ratio test against the model without the three job 1 variables: starting wage, ending wage and tenure. The Chi-square statistic of Model II is from the likelihood ratio test against Model I. The critical value for both tests at the 5% significance level is 7.81. * Significance at the 10% level. ** Significance at the 5% level.
matching estimates of returns to migration in Models I and II, we test the robustness of our propensity score matching estimators to the inclusion of additional variables that might affect estimated returns to migration. The probit coefficients for the demographic variables have the expected signs in both models; all variables are individually significant unless otherwise noted. Consistent with most migration studies, Model I shows that, holding other variables constant, the probability of migration starts to decline at about age 25; Hispanics are more likely, and African Americans are less likely, to move than are non-Hispanic Whites, although the coefficient on the Hispanic dummy is statistically insignificant; individuals with less than a college degree are less likely to move than those with a college degree; married men are more likely to move than are unmarried men, while individuals residing in an MSA when they quit their first job are less likely to migrate than are those living in nonmetropolitan areas; and men in professional occupations on job 1 are more likely to migrate, while homeownership has a negative effect on migration. The three work history variables (starting wage, ending wage and job tenure on job 1) are not individually significant, but a likelihood ratio test shows that these variables are
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
219
Table 3 Balancing tests. Panel A: paired t-tests Model I
Age Hispanic Black Married MSA1 Professional1 Home_Owner1 log(startwage1) Tenure log(endwage1) Father_college Mother_college Unemployment rate
Model II
Difference
Paired t statistics
Difference
Paired t statistics
0.018 0.015 0.003 −0.023 −0.009 −0.023 −0.021 −0.022 −0.006 −0.020 – – –
0.480 0.585 0.112 −0.617 −0.341 −0.853 −0.777 −0.717 −0.038 −0.605 – – –
0.298 0.009 0.018 −0.018 0.009 −0.006 −0.033 0.008 0.134 0.020 0.030 −0.030 −0.369
1.174 0.351 0.662 −0.465 0.321 −0.184 −1.239 0.242 0.797 0.604 1.054 −1.271 −1.383
Panel B: F test statistics
1st quartile 2nd quartile 3rd quartile 4th quartile Critical value at the 5% level
Model I
Model II
0.84 0.65 0.81 0.90 F (10, 75) = 1.96
0.78 1.07 1.13 1.38 F (13, 69) = 1.86
Notes: 1. All tests are based on nearest neighbor matching with q = 5 trimming, and the differences in the variable means are taken between movers and the matched sample of stayers. 2. Each F test is based on the variables included in the respective model. 3. No test statistic is statistically significant at standard confidence levels.
jointly significant. On average, movers have higher hourly wages prior to migration than do stayers. The additional variables in Model II indicate that – holding other variables constant – individuals whose father has a college degree are more likely to move, while we find that having a mother with a college degree, as well as the county unemployment rate, do not have a significant effect (either individually or jointly) on the probability of moving.42 We show the distributions of the estimated propensity score for movers and stayers from Model I in Fig. 1: we use a solid line for the estimated density function for the movers and a dashed line for the estimated density function for the stayers. The empirical support of the two distributions is very similar and the modes are quite close, although, as expected, the movers have a higher average probability of moving than the stayers. Table 3 shows the balancing tests for both models; all the tests are conducted using the mover sample and the matched sample from nearest neighbor matching. Panel A shows the paired tstatistics for the difference in the variable means between movers and the matched sample of stayers. Panel B presents the joint F statistics for the difference in the means between the two groups for all variables at each quartile of the estimated propensity score. In no case is a test statistic near statistical significance at standard test levels.43 5.2. Propensity score matching estimates of the ATET of migration Table 4 presents the matching estimates of the effect of migration for movers on wage growth from the baseline propensity 42 A likelihood ratio test of the null hypothesis that the three additional variables in Model II all have zero coefficients does not reject this null. 43 As has been emphasized in the literature, balancing tests are used to choose the functional form for the propensity score model and do not shed any light on whether the CIA is valid in our application.
Fig. 1. Distributions of the estimated propensity scores. (Table 2 Model I).
score model (Model I) for LLR, LCR and LLRR matching. The results from LLR matching with the trimming level q = 5 and two alternative trimming levels, q = 3 and q = 7, are presented in Panels A, B, and C respectively.44 We calculate the minimum required number of bootstrap repetitions based on the three-step method of Andrews and Buchinsky (2000, 2001). The minimum numbers of repetitions are 241, 251 and 265 for the LLR match44 Recall that in the Heckman et al. (1997) algorithm, the actual amount of trimming depends on the similarity in the distributions of the estimated propensity score for movers and stayers. A level of q trimming will result in deletions between q% and 2q% of the movers.
220
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
Table 4 Matching estimates of the ATET of migration on log wage growth from Model I (baseline model). Matching estimator
Overall sample
Dropouts
High_school
Some_college
College grads
−0.32%
−12.20%* (6.90%) (7.10%)
−4.79%
(2.49%) (2.46%)
(3.87%) (3.96%)
0.03% (5.55%) (5.61%)
10.42%** (5.83%) (5.35%)
0.06% (2.40%)
−12.20%* (7.40%)
−4.79%
−0.10%
(3.70%)
(5.58%)
10.56%** (5.34%)
−0.88%
−12.46%* (7.36%)
−4.73%
−0.75%
(3.71%)
(5.61%)
−12.51% (12.52%) {29.26%}
−4.90%
−4.48%
(3.62%) {6.28%}
(5.89%) {15.05%}
(9.93%) {8.74%}
1.90% (2.63%)
−12.60% (7.77%)
−4.40%
−2.20%
(3.89%)
(5.41%)
Panel A: LLR matching with trimming level q = 5 25% bandwidth (200 repetitions) [300 repetitions] Panel B: LLR matching with trimming level q = 3 25% bandwidth (300 repetitions) Panel C: LLR matching with trimming level q = 7 25% bandwidth (300 repetitions)
(2.51%)
10.86%* (6.06%)
Panel D: LCR matching with trimming level q = 5 25% bandwidth (300 repetitions) {1100 repetitions}
−1.64%
9.28% (5.62%) {6.22%}
Panel E: LLRR matching without trimming Bandwidth chosen by cross-validation (300 repetitions)
14.60%** (4.77%)
Notes: 1. The individual migration effect is estimated by each mover’s wage growth minus its counterfactual estimated from the corresponding matching estimator. We then average the individual effects over corresponding samples to obtain the estimates in this table. 2. The trimming procedure is designed to trim q%–2q% of movers. Because this is a data-driven approach, the exact trimming level depends on the data structure. 3. Standard errors calculated from corresponding bootstrap repetitions are in parentheses. 4. To implement finer balancing matching with 25% bandwidth, we first choose a variable bandwidth to give each mover a comparison group equal to 25% of the stayers. We then use only those in the group who are in the same educational category as the mover in question. Each mover gets far less than 25% of stayers in the local regression. * Significance at the 10% level. ** Significance at the 5% level.
ing with q = 3, q = 5, q = 7 trimming respectively.45 For q = 5 trimming, we present the standard errors from 200 and 300 repetitions in Panel A. The two sets of standard errors are relatively close because 200 repetitions are not significantly less than the required minimum of 251 replications. For q = 3 and q = 7 trimming, we only present standard errors calculated from 300 bootstrap repetitions—more than the number of replications required by the Andrews–Buchinsky procedure. For all three LLR matching estimators we choose a 25% bandwidth, which gives us a wide enough window when we disaggregate the data by educational class.46 Point estimates are quite close across three trimming levels. The results in Column 1 of the top three panels show a quite small, and statistically insignificant, effect of migration for the overall sample. When we disaggregate by education, the effect of migration for high school dropouts is estimated to be about −12%, and this estimate is significant at the 10% level. College graduates who migrate experience approximately 10.5% greater wage growth, and the estimates are statistically significant at the 5% level for q = 3 and q = 5 trimming and significant at the 10% level for q = 7 trimming. There are no statistically significant migration effects on wage growth for job changers who have only a high school education or some college. 45 For each estimator, we calculate the minimum repetitions required for the overall sample. We then calculate the minimum repetitions required for each educational group separately. Finally, we take the maximum of the five numbers as our required number of repetitions. For example if the required bootstrap repetitions for overall sample, high school dropouts, high school graduates, individuals with some college, and college graduates are 200, 289, 256, 235, 350 respectively, then we use 350 repetitions for that model. 46 To implement finer balancing matching, we first choose a variable bandwidth to give us a comparison group equal to 25% of the stayers. We then use only those in the group who are in the same educational category as the mover in question. Each mover gets far less than 25% of stayers in the local regression. We find that our results are not sensitive to a 1%–2% bandwidth change.
Panel D contains the results when we use LCR matching and q = 5 trimming. Interestingly, the Andrews–Buchinsky algorithm requires 1074 replications, which of course is higher than the number generally used in empirical work. We find it intriguing that the standard errors based on only 300 repetitions are much too small, indicating the importance of letting the data determine the number of repetitions. Further, the standard errors based on the appropriate number of replications are very large, indicating that we are over-fitting the data by using the LCR matching estimator. Panel E presents estimates from the LLRR matching estimator. As discussed in Section 3.1.3, we follow Frölich (2004) and use a global bandwidth chosen by leave-one-out cross-validation. Because local linear ridge regression is more stable in regions of sparse or clustered data, we apply this estimator to the original data without trimming. Again standard errors are calculated from the bootstrap with 300 repetitions; here the Andrews–Buchinsky method requires only 198 repetitions. Both point estimates and standard errors are close to those from LLR matching, except that the estimated effect for college graduates is about 4% points higher and the estimated effect for high school dropouts is no longer significant. Some may be concerned that the estimated negative effect for high school dropouts is inconsistent with an economic model of migration, but we note the following. First, we estimate a contemporaneous effect of migration on wage growth. Insignificant or negative contemporaneous effects do not necessarily imply that migration is an irrational decision from the perspective of the human capital approach. As noted in Section 2, Borjas et al. (1992a) have found that positive returns to migration often are not realized until five or six years after the original migration, and that the initial returns are negative. It is interesting to note that some of the previous studies (e.g. Polachek and Horvath, 1977; Tunali, 2000) found negative returns for the entire sample, while we find them
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
221
Table 5 Matching estimates of the ATET of migration on log wage growth based on an expanded model and two alternative migration definitions. LLR matching
Overall
Dropouts
High_school
Some_college
College grads
−0.32% (2.46%)
−12.20%* (7.10%)
−4.79% (3.96%)
0.03% (5.61%)
10.42%** (5.35%)
−0.47% (2.49%)
−10.27% (7.22%)
−4.83% (3.90%)
−0.50% (5.73%)
9.75% (6.05%)
−3.74% (4.25%)
−0.05% (5.90%)
8.71% (5.93%)
−6.35%* (3.63%)
1.20% (5.15%)
11.55%* (5.94%)
Panel A: Model I (baseline model) with 25% bandwidth (300 repetitions) Panel B: Model II (expanded model) with 25% bandwidth (300 repetitions)
Panel C: longer distance (50 miles) migration definition for everyone—Model I with 25% bandwidth (300 repetitions)
−0.33% (2.77%)
−11.31% (8.20%)
Panel D: shorter distance (20 miles) migration definition for everyone—Model I with 25% bandwidth (300 repetitions)
−0.90% (2.41%)
−10.44%* (6.30%)
Note: See notes to Table 4. Table 6 Misclassification of movers and stayers when move is defined as change-of-state or change-of-county. Number of individuals Panel A. change-of-state measure Long distance moves not counted when defining mover 20 miles < Moving distance <50 miles (changing MSA) 50 miles < Moving distance < 100 miles Moving distance > 100 miles Moving distance > 500 miles Short distance moves counted when defining mover Moving distance < 5 miles 5 miles < Moving distance < 20 miles 20 miles < Moving distance < 50 miles (not changing MSA)
136 21 56 58 1 16 4 8 4
Panel B. change-of-county measure Long distance moves not counted when defining mover Short distance moves counted when defining mover Moving distance < 5 miles 5 miles < Moving distance < 20 miles 20 miles < Moving distance < 50 miles (not changing MSA)
0 164 17 88 59
only for high school dropouts. Migration may involve an assimilation process. A short-term loss in wage need not, and probably does not, imply a drop in life-time utility from moving if individuals catch up later. Second, unlike most migration studies, our study estimates a migration effect that has netted out the effect of job changing, and thus our results do not imply that any group experiences a negative return from job changing. Third, it is possible that return and repeat migration are driving the negative returns for high school dropouts, and we do observe more repeat and return migration for high school dropouts than for other education categories.47 To explore this possibility, we excluded those with repeat or return migration from the mover sample. This modification, however, did not change the negative migration effect for high school dropouts or the positive effect for college graduates. 5.3. Robustness of the matching estimates Comparing Table 5 Panels A (repeated from Table 4 Panel A for comparison purposes) and B illustrates how the estimates of the ATET change when we add the county unemployment rate and two dummy variables indicating whether father or mother has a college 47 Return migration within two years is 36% for dropouts and 24% for the overall sample. Repeat migration within two years is 36% for dropouts and 22% for the whole sample.
degree to the propensity score; here we again use LLR matching (with q = 5 trimming). The estimated effects are quite close except that returns for high school dropouts change from −12.2% to −10.27% when we use the additional variables. The standard errors are generally larger under the expanded model when we have the additional conditioning variables. Due to these larger standard errors, the migration effects for high school dropouts and college graduates are no longer statistically significant. Next we investigate how sensitive our matching estimates are to the (distance-based) threshold used to define migration. As noted above, to this point we have defined migration as having occurred if the respondent moved at least 50 miles or changed MSA and moved at least 20 miles. In Panels C and D of Table 5 we use two alternative definitions of migration that require moving at least 50 miles or 20 miles respectively regardless of whether someone was changing an MSA; again the results are based on LLR matching (with q = 5 trimming). When we use the 50 miles criterion, we find that the absolute values of the point estimates for high school dropouts and college graduates fall somewhat while their standard errors rise, and all estimates become insignificant, even at the 10% level. However, our inference is not affected as much by using a migration definition based on moving 20 miles for everyone, except that the negative migration effect for high school graduates becomes significant at the 10% level (see Panel D). We would argue that within the US a move between MSAs of 20 miles or greater involves changing labor markets, and thus our original definition of migration makes the most sense. 5.4. Using state-to-state or county-to-county definitions of migration Our distance-based measure of migration data allows us to use a measure of migration which we believe is superior to two alternative definitions of migration commonly used in the literature: changing state of residence or changing county of residence. For example, MSAs are often spread over counties and even states. Table 6 presents numbers of moves using the three definitions. In Panel A we investigate how a state-to-state criterion misclassifies migration given our definition. We see first that a state-to-state measure misses 136 moves that our definition catches; moreover included in these 136 moves are 56 that involve moves between 50 and 100 miles and 59 that involve moves of more than 100 miles. Further, the state-to-state measure counts 16 short-distance moves that we would not consider to represent migration using our definition. Next, in Panel B we examine the differences between our definition of migration and one based on moves across county borders. The county-to-county measure does
222
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
Table 7 Matching estimates of the ATET of migration on log wage growth based on another two alternative definitions of migration. LLR matching
Overall
Dropouts
High_school
Some_college
College grads
−12.20%* (7.25%)
−4.79%
0.03% (5.66%)
10.42%** (5.06%)
−7.13% (8.49%)
−4.71%
−1.67% (6.86%)
8.59% (6.70%)
−7.35% (5.70%)
−6.23%**
−0.48% (5.20%)
7.27% (4.81%)
−10.05%** (4.75%)
−3.53%
−0.19% (4.57%)
4.14% (4.31%)
Panel A: base distance measure, Model I, base sample with 25% bandwidth (300 repetitions)
−0.32% (2.32%)
(3.65%)
Panel B: change-of-state measure, Model I, base sample with 25% bandwidth (300 repetitions)
0.03% (2.91%)
(5.41%)
Panel C: change-of-county measure, Model I, base sample with 25% bandwidth (300 repetitions)
−1.98% (2.05%)
(2.98%)
Panel D: change-of-county measure, Model I, enlarged sample with 25% bandwidth (300 repetitions)
−2.04% (1.99%)
(2.91%)
Note: See notes to Table 4.
not miss any moves defined by our measure. However, the countyto-county measure would count an extra 164 moves as migration, with 105 of them less than 20 miles. (Such a short distance does not seem to imply changing local labor markets.) Note the differences in the number of moves between our definition and those using state-to-state or county-to-county definitions are quite large given that we only have 378 movers using our definition. In Table 7 we compare the results from our baseline specification for LLR matching with q = 5 trimming (repeated from Table 4 panel A for comparison purposes), to results based on a state-to-state measure (presented in Panel B) and a county-tocounty measure (presented in Panel C). In Panel D, we also present results based on the same county-to-county measure (as used in Panel C) for a sample where we have added to our base sample individuals who do not have a moving distance measure but have county measures for both job 1 and job 2. When we use the stateto-state measure, all of our treatment effects become quite insignificant. However, in Panel C when we use a county-to-county measure on our base sample, the migration effect becomes insignificant for high school dropouts and college graduates, but the estimated negative effect for high school graduates becomes statistically significant. For our enlarged sample in Panel D, the migration effect is significant only for high school dropouts, for whom it is negative. Clearly a researcher’s inference about these treatment effects would be affected by which definition of migration is used. 5.5. Empirical results from LATE estimators48 As discussed in Section 3.2, we use the same county variable (a dummy variable coded 1 for a respondent living at age 14 in the same county where he was born), and a dummy variable coded 1 if the individual’s father has a college degree, as IV in our LATE estimation. Appendix Table A.2 presents the estimated coefficients for two probit models that estimate the probability of migration using these instruments and other conditioning variables. In column 1 we only include the same county variable with the other conditioning variables in LATE estimators, and the same county variable is significant at the 1% level. We denote results based on this specification as Model A. In column 2 we also include a variable indicating whether the father went to college, and a likelihood ratio test indicates that the two instruments are 48 We use Markus Frölich’s gauss program for LATE estimators with covariates, which he kindly supplied to us.
jointly significant at the 1% level. We denote results based on this specification as Model B. Column 1 of Table 8 presents the estimated coefficients for the conditioning variables in a probit equation where dependent variable is equal to one if the respondent was not living at age 14 in the same county where he was born (indicating lower moving costs), and equal to zero otherwise (indicating higher moving costs). In column 2 (Model B) we show the probit coefficients for the conditioning variables where we construct our dependent variable as equal to one if the respondent was not living at age 14 in the same county where he was born and his father has a college degree (indicating a better financial condition) and equal to zero if neither of these conditions holds. (We drop 847 observations where only one of the conditions holds.)49 In Model A, the African American dummy variable is highly significant. In Model B, the coefficients on dummy variables for African Americans, high school dropouts, high school graduates and for those married are highly significant while the coefficients for the Hispanics dummy variable and those with some college are significant at the 10% level. The upshot is that it is important to include conditioning variables in our LATE estimators. Based on estimates from both Models A and B in Table 8, Fig. 2 presents the density plots of the estimated ‘‘instrument propensity scores’’, π (x) = P (Z = 1|X = x), for Z = 1 and Z = 0 groups. In the same spirit of Figs. 1 and 2 summarizes the similarity of the conditioning variables between two groups, and here the two groups are defined by instruments. The two densities exhibit substantial overlap under each model (although the degree of similarity is greater for Model A than for Model B), and we expect that Assumption 6 of Section 3.2 is satisfied for both models. As noted above we consider two LATE parameters: (i) the average treatment effect (ATE) for all compliers and (ii) the average treatment effect on the treated (ATET) for the compliers who moved. Using Model A, where only the same county variable is an IV, we estimate the fractions of the sample who are compliers, always-takers and never-takers to be 4.3% (1.64%), 16.3% (1.08%) and 79.4% (1.30%) respectively where standard errors are in parentheses.50 Further, we estimate the ATE and ATET (on wage growth) for compliers to be −7.6% (50.8%) and 11.8% (253.1%) respectively with standard errors in parentheses. With Model B, which uses both the same county and father’s education variables 49 We thank Markus Frölich for suggesting that we construct the instrument in this way. 50 Standard errors are calculated using the number of repetitions indicated by the Andrews–Buchinsky algorithm.
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
223
Table 8 Propensity score probit estimates for instruments.
Intercept Age/10.0 Age∗∗ 2/100.0 Hispanic Black Dropout High_school Some_college Married MSA1 Professional1 log(startwage1) Tenure log(endwage1)
Model A
Model B
−0.262
−1.230 (1.961) 0.009 (1.446) −0.014 (0.266) −0.308∗ (0.173) −0.910∗∗ (0.156) −1.591∗∗ (0.254) −1.077∗∗ (0.153) −0.278∗ (0.150) −0.257∗∗ (0.112) 0.268 (0.169) 0.058 (0.129) 0.221 (0.173) −0.023 (0.028) 0.228 (0.173)
(0.962) −0.126 (0.720) 0.020 (0.134) 0.118 (0.085) −0.485∗∗ (0.073) 0.046 (0.114) 0.002 (0.095) 0.149 (0.101) −0.063 (0.060) 0.067 (0.082) 0.017 (0.081) 0.158 (0.098) −0.004 (0.014) −0.007 (0.098)
Note: For Model A, the dependent variable is one if the respondent was not living at age 14 in the same county where he was born and is zero otherwise. For Model B, the dependent variable is equal to one if both of the following hold: (i) the respondent was not living at age 14 in the same county where he was born and (ii) the father has a college degree, and is equal to zero if neither condition holds. For this model, we drop observations where only one of the conditions holds.
as IV, we estimate the fractions of compliers, always-takers and never-takers to be 10.1% (5.69%), 16.1% (1.27%) and 73.8% (5.76%) respectively. The estimated ATE and ATET for compliers are −22.0% (256.4%) and −1.7% (338.6%) respectively. We also investigate our enlarged sample (as discussed in Section 5.4, adding to our base sample individuals who do not have a moving distance measure but have county measures for both job 1 and job 2). For our enlarged sample, moving is defined as change-of-county between job 1 and job 2. Based on our enlarged sample and Model A, we estimate the fractions of the sample who are compliers, always-takers and never-takers to be 8.1% (1.74%), 19.9% (1.12%) and 71.9% (1.32%) respectively where standard errors are in parentheses. We estimate the ATE and ATET for compliers to be −17.1% (19.79%) and −6.4% (20.74%) respectively, with the standard errors in parentheses. Thus in each model and sample we estimate the fractions of compliers, always-takers and never-takers relatively precisely, but there are far too few compliers to estimate the ATE and ATET with any precision.51 6. Monte Carlo evidence on the appropriateness of using the bootstrap to calculate standard errors in our application of local linear regression matching In this section we describe the Monte Carlo experiments introduced to examine the appropriateness of bootstrapped standard errors in our application of LLR matching. Our base case is calibrated to resemble, as closely as possible, our application. We 51 Both LATE estimators require a large amount of data. Frölich and Lechner (2006) also find that LATE estimates are highly variable for individual labor markets. Fortunately in their case they are able to aggregate individual markets weighted by estimated number of compliers to get precise point estimates.
Fig. 2. Distributions of the estimated instrument propensity scores. (Table 8 Model A and Model B).
also carry out a sensitivity analysis in terms of changing important parameters in our experiment. As discussed in Section 4, the NLSY79 provides a relatively rich set of variables. To keep the Monte Carlo experiment manageable, we generate two continuous variables, x1 and x2 , to capture individual labor market characteristics, such as education and the previous wage, that affect both the outcome and migration decision. We generate another continuous variable x3 that represents migration costs and thus affects the migration decision, but does not affect wages. One can think of x3 as the latent variable determining the dummy variable in the NLSY79 data indicating if an individual was living at age 14 in the same county where he was born. We generate (x1 , x2 , x3 ) as i.i.d. draws from a trivariate normal distribution N (0, Λ). Motivated by our economic setting, we specify that x1 and x2 are positively correlated. For simplicity, in our base case we assume that x1 and x2 are uncorrelated with x3 , but we relax this in Design A below and find that doing so has no effect on our results. We define the latent variable underlying the individual migration decisions as M ∗ = β0 + β1 x1 + β2 x2 + β3 x3 + u,
(6.1)
where each u is an i.i.d. draw from N 0, σ which is independent of (x1 , x2 , x3 ).52 If M ∗ > 0, an individual chooses to move and we observe D = 1; otherwise an individual chooses to stay and we observe D = 0. Since in our data we find that people with better labor market characteristics are more likely to migrate, we set the coefficients on x1 and x2 to be positive. On the other hand, since individuals with higher moving costs are less likely to move, we set the coefficient on x3 to be negative. We have two potential outcome equations, one determining Y 0 , the outcome without treatment (i.e. log hourly wage if changing job locally), and the other determining Y 1 , the outcome with treatment (i.e. log hourly wage if taking a job in another city). Individual labor market characteristics x1 and x2 enter the two potential outcome equations through the variable
w=
2
4x1 + 2x2 sd (4x1 + 2x2 )
52 We do not calibrate the mean and variance of (x , x , x ) since we can adjust 1 2 3 the coefficients in Eq. (6.1).
224
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
Table 9 Comparing bootstrap and Monte Carlo standard errors of ATET estimated by LLR matching. Covariates: (x1 , x2 , x3 ) ∼ N (0, Λ) Migration decision: M ∗ = β0 + β1 x1 + β2 x2 + β3 x3 + u, u ∼ N (0, σ 2 ) Potential outcomes: Y 0 = 0.8 + 1+exp(−30.7.71.w+δ) + v0 , Y 1 = 0.8 + 1+exp(−13..07.w+0.72) + v1 , w = MC design
Covariates
Base case
1.00 Λ = 0.30 0.0
Migration decision
0.30 1.00 0.0
A
B
C
D
E
F
1.00 Λ = 0.30 −0.10
0.0 0.0 1.00
4x1 +2x2 sd(4x1 +2x2 )
Potential outcome
, (v0 , v1 ) ∼ N (0, Ω ) True ATET
Average ATET estimate
Standard error of estimate MC
Bootstrap
0.30 1.00 −0.12
−0.10 −0.12 1.00
1.00 Λ = 0.30 0.0
0.30 1.00 0.0
0.0 0.0 1.00
1.00 Λ = 0.30 0.0
0.30 1.00 0.0
0.0 0.0 1.00
1.00 Λ = 0.30 0.0
0.30 1.00 0.0
0.0 0.0 1.00
1.00 Λ = 0.30 0.0
0.30 1.00 0.0
0.0 0.0 1.00
1.00 Λ = 0.30 0.0
0.30 1.00 0.0
0.0 0.0 1.00
β0 = −1.4, β1 = 0.1, β2 = 0.2, β3 = −0.06, σ 2 = 2.25
δ = 0 .75, 0.6 Ω= 0.12
0.12 0.6
β0 = −1.4, β1 = 0.1, β2 = 0.2, β3 = −0.06, σ 2 = 2.25
δ = 0 .75, 0.6 Ω= 0.12
0.12 0.6
β0 = −1.38, β1 = 0.08, β2 = 0.15, β3 = −0.12, σ 2 = 2.25
δ = 0 .75, 0.6 Ω= 0.12
0.12 0.6
β0 = −1.4, β1 = 0.1, β2 = 0.2, β3 = −0.06, σ 2 = 2.25
δ = 0 .65, 0.6 Ω= 0.12
0.12 0.6
β0 = −1.4, β1 = 0.1, β2 = 0.2, β3 = −0.06, σ 2 = 2.25
δ = 0 .82, 0.6 Ω= 0.12
0.12 0.6
β0 = −0.95, β1 = 0.1, β2 = 0.2, β3 = −0.06, σ 2 = 1.0
δ = 0 .75, 0.4 Ω= 0.08
0.08 0.4
β0 = −1.4, β1 = 0.1, β2 = 0.2, β3 = −0.06, σ 2 = 2.25
δ = 0 .75, 0.6 Ω= 0.24
0.24 0.6
0.1053
0.1034
0.0462
0.0450
0.1070
0.1047
0.0462
0.0451
0.0962
0.0941
0.0456
0.0445
0.0267
0.0248
0.0460
0.0449
0.1593
0.1574
0.0463
0.0450
0.1237
0.1221
0.0382
0.0384
0.1056
0.1034
0.0460
0.0452
Notes: 1. The true ATET for each design is calculated from one MC sample with 100,000 observations assuming that both Y 0 and Y 1 are known for each observation. 2. The average ATET estimate and the MC standard errors are based on 1000 MC samples, each consisting of 2100 observations. 3. For each design, the first MC sample is taken to be the ‘‘data available to the researcher’’. One thousand bootstrap samples are then drawn and used to calculate the standard errors of the ATET.
where sd(·) denotes standard deviation. We then specify the two potential outcome equations as: Y 0 = 0.8 +
3.7 1 + exp (−0.71 · w + δ)
+ v0
(6.2)
and Y 1 = 0.8 +
3.7 1 + exp (−1.0 · w + 0.72)
+ v1 ,
(6.3)
where (v0 , v1 ) are i.i.d. draws from N (0, Ω ), which are independent of (x1 , x2 ). We also specify (v0 , v1 ) to be independent of the error term u in Eq. (6.1) to satisfy the Conditional Independence assumption of matching. We specify Ω for each experiment design and always assume that v0 and v1 are positively correlated to reflect unobserved common factors that affect Y 0 and Y 1 . We set the values of constants in Eqs. (6.2) and (6.3) to fit the range of the log hourly wage in our data. The parameter δ in Eq. (6.2) allows us to calibrate different treatment effects in our experiments. Further, we specify that both x1 and x2 positively affect the two potential outcomes and they enter both equations in a highly nonlinear form to account for the potential nonlinear functional form in our application (thus motivating our use of LLR matching). As noted above and consistent with our empirical model, x3 is assumed to not enter the outcome equations.
Up to this point we have specified the data generating process for individual characteristics (x1 , x2 , x3 ), a dummy variable D indicating 0 1 whether one migrates, and two potential outcomes Y , Y . In estimation, we assume that we observe (x1 , x2 ) but that we do not observe x3 directly; instead we observe a dummy variable z indicating high/low moving costs, comparable to the NLSY79 variable indicating that an individual is living in his birth county at age 14. To estimate the ATET, we use the LLR matching estimator discussed in Section 3 to construct the counterfactual E (Y 0 |D = 1) for each individual with D = 1. Given our design, the Conditional Independence Assumption of matching is satisfied as independence between u and (v0 , v1 ) implies selection only on observables. The second assumption of matching in Eq. (3.3) is also satisfied, since we have a migration cost variable, x3 , in the migration decision equation but not in the two outcome equations. When we specify the propensity score model, following the discussion in Section 3.1.2, we only include (x1 , x2 ) but not the observed moving cost variable z. For each experiment design, we generate 1000 Monte Carlo samples of size 2100, which matches the sample size in our actual data. In addition, in each design, the intercept in the migration decision process is adjusted to maintain the proportion of observations with D = 1 to be close to 0.18, which is the fraction
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
of individuals who move in our NLSY79 sample. In our base case, we choose the parameter values to set ATET about equal to 10% (or 0.1) and its standard error at about 5% (or 0.05). These values respectively match closely to the positive treatment effect and associated standard error that we found for college graduates. To test the sensitivity of the estimated standard errors to parameter values we choose, we provide six alternative designs. We present our Monte Carlo evidence in Table 9.53 Each row represents a different Monte Carlo design. Columns 1 and 2 present the assumed covariance matrix Λ for (x1 , x2 , x3 ) and the parameter values that we chose for the migration decision process respectively. Columns 3 and 4 contain the parameter values we chose for the potential outcome processes and the true ATET given each design, respectively. Concerning the latter, since we know the true data generating process, we know both Y 0 and Y 1 for each individual. Thus the true ATET in the fourth column is estimated by taking the sample average of Y 1 − Y 0 for individuals with D = 1. Since there are random components in both migration and potential outcome processes, we estimate the true ATET by generating a sample with 100,000 individuals for each design. Column 5 presents the average of the estimated ATET effect over 1000 Monte Carlo samples for each design. The results in column 5 provide information on the efficacy of the LLR matching estimators in recovering the true ATET. Column 6 shows the estimated standard errors of ATET based on 1000 Monte Carlo replications. Finally column 7 presents the standard errors from the bootstrap; in each design, the first Monte Carlo sample is taken to be the ‘‘data available to the researcher’’. One thousand bootstrap samples are then drawn and used to calculate the standard errors of ATET.54 Thus our primary focus in this table lies in comparing the entries in Columns 6 and 7 for a given design. Row 1 of Table 9 shows the results for our base case. Note that the bootstrap standard error of 0.0450 for the base case is very close to the Monte Carlo standard error of 0.0462. Each alternative design from A to F deviates from the base case in one direction. In Design A we allow for non-zero correlation of x3 with respect to x1 and x2 ; again the bootstrap standard error of 0.0451 is very similar to the Monte Carlo standard error of 0.0462. In Design B, we changed the parameter values of the migration decision equation so that compared to our base case, now the migration cost variable x3 plays a more important role and the two job market characteristic variables, x1 and x2 , play a less important role.55 Once more the bootstrap standard error of 0.0445 is very close to the Monte Carlo standard error of 0.0456. In Designs C and D we change the parameter δ to reduce and increase the ATET to 2.67% and 15.93% respectively. In both cases we see that the bootstrap standard errors of 0.0449 and 0.0450 are very close to the respective Monte Carlo standard errors of 0.0460 and 0.0463. In Design E, we reduce the variances of the error terms for both the migration and potential outcome equations; to keep the correlation between v0 and v1 constant, we also reduce the covariance between v0 and v1 . In Design F, we increase the correlation between v0 and v1 while holding everything else constant. The results for Designs E and F are in rows 6 and 7 respectively, and again the bootstrap standard errors are very close to the respective Monte Carlo standard errors. Thus we conclude that our use of the bootstrap for obtaining standard errors of the LLR matching estimates in our application is appropriate. 53 Our program for the Monte Carlo study, written in R, is available at http://dept.econ.yorku.ca/~xli. 54 As a robustness check, for each design we also randomly chose another Monte Carlo sample with which to calculate standard errors using the bootstrap, and our results did not change. 55 As noted above, to keep the proportion of individuals with D = 1 to be 0.18, we also change the intercept.
225
Finally, although it is not our primary purpose in this section to examine the performance of the LLR matching estimator in terms of bias, we note that the average of the estimated ATET in each row is quite close to the true ATET in each row. These additional results support the argument that LLR matching is appropriate for our empirical application, in which the outcomes could be potentially highly nonlinear. 7. Conclusion Our paper estimates the effect of US internal migration on wage growth for young men between their first and second job. Our analysis of migration differs from previous research in four important ways. First, we exploit the distance-based measures of migration in the NLSY79. Second, we let the effect of migration on wage growth differ by schooling level. Third, we use propensity score matching to address selection issues and to estimate the effect of migration on the wage growth of young men who move and argue that this treatment effect is at least as interesting for economists as the effect of migration on wages for a randomly chosen individual. Since matching is a ‘‘data hungry’’ approach, we are fortunate that the NLSY79 provides a rich array of variables on which to match, making our use of the Conditional Independence assumption reasonable. Specifically, we use variables on home ownership, previous starting wage, ending wage and job tenure, as well as variables on family background and demographics. Finally, we use semi-parametric LATE estimators with conditioning variables to estimate the ATE and ATET for compliers. We consider a number of matching estimators: local linear, local cubic, and local linear ridge regression. We find that local linear and local linear ridge regression matching produce relatively similar point estimates and standard errors, while our use of local cubic regression matching leads to over fitting the data and large standard errors. Since it is unknown whether the standard bootstrap is appropriate for the matching estimators we consider, we conduct a Monte Carlo study to investigate its validity in our application. The Monte Carlo evidence strongly supports our use of the standard bootstrap to calculate standard errors in our application. We use the Andrews–Buchinsky method to estimate the number of bootstrap replications when we use the standard nbootstrap, and find that for the local cubic regression estimator and LATE estimators (not for other matching estimators), the required number of repetitions is much larger than those commonly used in empirical research. From our matching estimators, we find a significant positive effect of migration on the wage growth of college graduates, and a marginally significant negative effect for high school dropouts. We do not find any significant effects for other educational groups or for the overall sample. Our results are generally robust to changes in the model specification, changes in our distance-based measure of migration, level of trimming, and changing from a local bandwidth to a global bandwidth. We find that better data matters; if we use a measure of migration based on moving across county lines, we overstate the number of moves, while if we use a measure based on moving across state lines, we understate the number of moves. Further, moving to either migration definition has an important effect on our estimates of the ATET for movers. We also consider LATE estimators of the ATE and ATET of moving for compliers. Here we consider using (i) whether the individual was living in his birth county at age 14 and (ii) the same county variable and a variable indicating whether the respondents’ father went to college as instrumental variables. However, we have far too few compliers in our data to obtain precise LATE estimates of the ATE and ATET for compliers using either set of instruments.
226
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227 Table A.2 (continued)
Appendix A. Tables
Dropout
See Tables A.1 and A.2.
High_school Table A.1 Descriptive statistics: base sample versus NLSY79 male sample. Variable name
Variable definition
Base sample
Age in years
Black
=1 if African
Hispanic
American =1 if Hispanic
White
=1 if non-Hispanic
Dropout
High_school
Some_college
College
Married Home_Owner MSA Job 1 ending wage Job 2 starting wage Hourly wage (1987) Professional1
Mother_college Father_college Same_county
White =1 if highest grade completed is less than 12 =1 if highest grade completed is equal to 12 =1 if highest grade completed is greater than 12 and less than 16 =1 if highest grade completed is greater than or equal to 16 =1 if married, spouse present =1 if own home on job 1 =1 if reside in MSA at time of job 1 in 1990 dollars in 1990 dollars Hourly wage of the main job (in 1990 dollars) =1 if professional/managerial occupation on job 1 =1 if mother was college graduate =1 if father was college graduate =1 if respondent resides at age 14 in the same county as county of birth
Married
NLSY79 male (1987)
(N = 2078) Age
Some_college
N
Professional1
26.08 (0.096) 0.23 (0.009) 0.13 (0.007) 0.63 (0.011) 0.17 (0.008)
25.88 (0.029) 0.25 (0.005) 0.16 (0.005) 0.59 (0.006) 0.20 (0.006)
6403
0.47 (0.011)
0.44 (0.007)
5093
log(startwage1) 6403 Tenure 6403 log(endwage1) 6403 Same_county 5093
Age∗∗ 2/100.0 Hispanic Black
−0.508∗∗
(0.131) −0.606∗∗ (0.105) −0.438∗∗ (0.112) 0.038 (0.072) −0.281∗∗ (0.096) 0.328∗∗ (0.089) 0.175 (0.114) 0.005 (0.018) 0.013 (0.114) −0.177∗∗ (0.068)
Chi-square statistic
0.18 (0.008)
0.20 (0.006)
5093
0.18 (0.008)
0.16 (0.005)
5093
0.40 (0.011) 0.20 (0.009) 0.86 (0.008) 8.98 (0.100) 10.02 (0.118) – –
0.39 (0.007) 0.22 (0.006) 0.78 (0.006) – – – – 9.74 (0.084)
5116
0.22 (0.009)
0.21 (0.006)
4582
0.07 (0.006) 0.14
0.08 (0.003) 0.14
5934
0.57 (0.011)
0.40 (0.006)
6402
5111 4696 – – – – 4373
5494
Table A.2 Reduced form probit estimates for the migration decision and tests for significance of the excluded instruments.
Age/10.0
Model B
−0.571∗∗
Father_college
Note: Standard errors of the sample mean are in parentheses.
Intercept
MSA1
Model A
Model A
Model B
−6.134∗∗
−6.182∗∗
(1.296) 4.589∗∗ (0.976) −0.925∗∗ (0.183) 0.059 (0.103) −0.201∗∗ (0.091)
(1.298) 4.582∗∗ (0.977) −0.924∗∗ (0.183) 0.077 (0.103) −0.180∗∗ (0.092)
(0.136)
−0.550∗∗ (0.110)
−0.405∗∗ (0.114) 0.046 (0.072) −0.292∗∗ (0.096) 0.329∗∗ (0.089) 0.172 (0.114) 0.005 (0.018) 0.006 (0.114) −0.162∗∗ (0.069) 0.177∗ (0.098) 9.912
Note: For Model B the null hypothesis of the likelihood ratio test is that the coefficients of Same county and Father_college variables are both equal to zero. The null hypothesis is rejected at the 1% level.
Appendix B. Three-step method for choosing the number of bootstrap repetitions Andrews and Buchinsky (2000, 2001) propose a three-step method for choosing the number of bootstrap repetitions. We follow their procedure to set the proper number of bootstrap repetitions to calculate the standard errors for each parameter we estimate. The following is a special case in Andrews and Buchinsky (2001). We first define the notations related to our problem following Andrews and Buchinsky (2001). θ is a scalar parameter, and λ is an unknown parameter of interest. In our case θ is the ATET identified by the matching estimator or one of the two parameters identified by the LATE estimators (ATE and ATET for compliers), and λ is the standard error of θ . B is the number of repetitions, and pdb denotes the measure of accuracy, which is the percentage deviation of the bootstrap quantity of interest based on bootstrap repetitions from the ideal bootstrap quantity for which B = ∞. The magnitude of B depends on both the accuracy required and the data. If we required the actual percentage deviation to be less than pdb with a specified probability 1 − τ , then the three-step method takes pdb and τ as given and provides a minimum number of repetitions B∗ to obtain the desired level of accuracy. We use a conventional accuracy level, (pdb, τ ) = (10, 0.05). Step 1. Calculate initial number of repetitions B1 . The three-step method dependson a preliminary estimate ω1
ˆ B − λˆ ∞ /λˆ ∞ , where λˆ B and of the asymptotic variance ω of B1/2 λ λˆ ∞ are the bootstrap estimates from B and an infinite number of repetitions respectively. Following Andrews and Buchinsky (2000, 2001), we set a starting value of ω1 = 0.5 in Eq. (B.1). (The threestep method is not too sensitive to this starting value because it uses a finite sample estimate of ω in the last step.) Given this starting value, we then calculate B1 = int 10,000 ∗ z12−τ /2 ∗ ω1 /pdb2 ,
(B.1)
where z1−τ /2 is a 1 − τ /2 quantile of standard normal distribution. In our case B1 = 193.
J.C. Ham et al. / Journal of Econometrics 161 (2011) 208–227
Step 2. Use the bootstrap results θˆ : θˆ1 , θˆ2 , . . . , θˆB1 to update ω1 to ωB :
µB = γB =
B1 1 −
B1 r =1 1
θˆr , B1 −
B1 − 1 r =1
(B.2)
θˆr − µB
4
/se4B − 3,
(B.3)
where seB is the standard deviation of θˆ : θˆ1 , θˆ2 , . . . , θˆB1 . Then
ωB = (2 + γB ) /4.
(B.4)
Step 3. Calculate B2 from B2 = int 10,000 ∗ z12−τ /2 ∗ ωB /pdb2 .
(B.5)
Then the minimum number of repetitions is B = max (B1 , B2 ). ∗
References Abadie, A., 2005. Semi-parametric difference-in-differences estimators. Review of Economic Studies 72, 1–19. Abadie, A., Imbens, G., 2006. Large sample properties of matching estimators for average treatment effects. Econometrica 74, 235–267. Abadie, A., Imbens, G., 2008. On the failure of the bootstrap for matching estimators. Econometrica 76, 1537–1557. Andrews, D.W.K., Buchinsky, M., 2000. A three-step method for choosing the number of bootstrap repetitions. Econometrica 67, 23–51. Andrews, D.W.K., Buchinsky, M., 2001. Evaluation of a three-step method for choosing the number of bootstrap repetitions. Journal of Econometrics 103, 345–386. Angrist, J., Imbens, G., 1995. Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of American Statistical Association 90, 431–442. Angrist, J., Imbens, G., Rubin, D., 1996. Identification of causal effects using instrumental variables. Journal of American Statistical Association 91, 444–472. (with discussion). Bartel, A.P., 1979. The migration decision: what role does job mobility play? American Economic Review 69, 775–786. Borjas, G.J., Bronars, S.G., Trejo, S.J., 1992a. Assimilation and the earnings of young internal migrants. The Review of Economics and Statistics 74, 170–175. Borjas, G.J., Bronars, S.G., Trejo, S.J., 1992b. Self-selection and internal migration in the United States. Journal of Urban Economics 32, 159–185. Dahl, G.B., 2002. Mobility and the returns to education: testing a Roy model with multiple markets. Econometrica 70, 2367–2420. Falaris, E.M., 1987. A nested logit migration model with selectivity. International Economic Review 28, 429–444. Fan, J., Gijbels, I., 1992. Variable bandwidth and local regression smoothers. Annals of Statistics 20, 2008–2036. Fan, J., Gijbels, I., 1996. Local Polynomial Modeling and its Applications. In: Monographs on Statistics and Applied Probability, vol. 66. Chapman & Hall, London. Frölich, M., 2004. Finite sample properties of propensity-score matching and weighting estimators. Review of Economics and Statistics 86, 77–90. Frölich, M., 2007. Nonparametric IV estimation of local average treatment effects with covariates. Journal of Econometrics 139, 35–75. Frölich, M., Lechner, M., 2006. Exploiting regional treatment intensity for the evaluation of labour market policies. IZA Working Paper No. 2144. Gabriel, P.E., Schmitz, S., 1995. Favorable self-selection and the internal migration of young white males in the United States. Journal of Human Resources 30, 460–471. Summer. Glaeser, L.E., Mare, D.C., 2001. Cities and skills. Journal of Labor Economics 19, 316–342. Goss, E.P., Schoening, N.C., 1984. Search time, unemployment and the migration decision. Journal of Human Resources 19, 570–579. Greenwood, M.J., 1997. Internal migration in developed countries. In: Handbook of Population and Family Economics, vol. 1B. Elsevier, Amsterdam, New York, pp. 648–720. Hahn, J., 1998. On the role of the propensity score in efficient semi-parametric estimation of average treatment effects. Econometrica 66, 315–331. Hanushek, E.A., 1973. Regional differences in the structure of earnings. Review of Economics and Statistics 55, 204–213. Heckman, J., Ichimura, H., Todd, P., 1997. Matching as an econometric evaluation estimator: evidence from evaluating a job training program. Review of Economic Studies 64, 605–654.
227
Heckman, J., Ichimura, H., Smith, J., Todd, P., 1998a. Characterizing selection bias using experimental data. Econometrica 66, 1017–1098. Heckman, J., Ichimura, H., Todd, P., 1998b. Matching as an econometric evaluation estimator. Review of Economic Studies 65, 261–294. Heckman, J., LaLonde, R., Smith, J., 1999. The economics and econometrics of active labor market programs. In: Ashenfelter, O., Card, D. (Eds.), Handbook of the Labor Economics. Elsevier, Amsterdam, New York. Heckman, J., Vytlacil, E., 2001. Local instrumental variables. In: Hsiao, C., Morimune, K., Powell, J. (Eds.), Nonlinear Statistical Inference: Essays in Honor of Takeshi Amemiya. Cambridge University Press, Cambridge. Hirano, K., Imbens, G., Ridder, G., 2003. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 1161–1189. Hunt, J.C., Kau, J.B., 1985. Migration and wage growth: a human capital approach. Southern Economic Journal 51, 697–710. Imbens, G., 2001. Some remarks on instrumental variables. In: Lechner, M., Pfeiffer, F. (Eds.), Econometric Evaluation of Labour Market Policies. Physica/Springer, Heidelberg, pp. 17–42. Imbens, G., 2004. Nonparametric estimation of average treatment effects under exogeneity: a review. Review of Economics and Statistics 86, 4–29. Imbens, G., Angrist, J., 1994. Identification and estimation of local average treatment effects. Econometrica 62, 467–475. Imbens, G., Rubin, D., 1997. Estimating outcome distributions for compliers in instrumental variables models. Review of Economic Studies 64, 555–574. Kennan, J., Walker, J., 2003. The effect of expected income on individual migration decisions. Mimeo, University of Wisconsin. Lansing, J.B., Mueller, E., 1967. The geographic mobility of labor. Survey Research Center. Ann Arbor, MI. Lechner, M., 2000. An evaluation of public-sector-sponsored continuous vocational training programs in East Germany. Journal of Human Resources 35, 347–375. Linneman, P., Graves, P.E., 1983. Migration and job change: a multinomial logit approach. Journal of Urban Economics 14, 263–279. McCall, B.P., McCall, J.J., 1987. A sequential study of migration and job search. Journal of Labor Economics 5, 452–476. Nakosteen, R.A., Zimmer, M.A., 1980. Migration and income: the question of selfselection. Southern Economic Journal 46, 840–851. Nakosteen, R.A., Zimmer, M.A., 1982. The effects on earnings of interregional and interindustry migration. Journal of Regional Science 22, 325–341. Plane, D.A., 1993. Demographic influences on migration. Regional Studies 27, 375–383. Polachek, S.W., Horvath, F.W., 1977. A life cycle approach to migration: analysis of the perspicacious peregrinator. In: Ehrenberg, R.G. (Ed.), Research in Labor Economics. pp. 103–149. Raphael, S., Riker, D.A., 1999. Geographic mobility, race, and wage differentials. Journal of Urban Economics 45, 17–46. Robinson, C., Tomes, N., 1982. Self-selection and interprovincial migration in Canada. Canadian Journal of Economics 15, 474–502. Rosenbaum, P.R., Rubin, D.B., 1983. The central role of the propensity score in observational studies for casual effects. Biometrika 70, 41–55. Rosenbaum, P.R., Rubin, D.B., 1985. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician 39, 33–38. Roy, A.D., 1951. Some thoughts on the distribution of earnings. Oxford Economic Papers 3, 135–146. Ruppert, D., Sheather, S.J., Wand, M.P., 1995. An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association 90, 1257–1270. Seifert, B., Gasser, T., 1996. Finite-sample variance of local polynomials: analysis and solutions. Journal of American Statistical Association 91, 267–275. Seifert, B., Gasser, T., 2000. Data adaptive ridging in local polynomial regression. Journal of Computational and Graphical Statistics 9, 338–360. Shaw, K., 1991. The influence of human capital investment on migration and industry change. Journal of Regional Science 31, 397–416. Sjaastad, L.A., 1962. The costs and returns of human migration. Journal of Political Economy 70, 80–93. Smith, J., Todd, P., 2005. Does matching address LaLonde’s critique of nonexperimental estimators? Journal of Econometrics 125, 355–364. Topel, R.H., Ward, M.P., 1992. Job mobility and the careers of young men. Quarterly Journal of Economics 107, 439–479. Tunali, I., 2000. Rationality of migration. International Economic Review 41, 893–920. Willis, R., Rosen, S., 1979. Education and self-selection. Journal of Political Economy 87, s7–s36. Yankow, J.J., 1999. The wage dynamics of internal migration. Eastern Economic Journal 25, 265–278. Yankow, J.J., 2003. Migration, job change, and wage growth: a new perspective on the pecuniary return to geographic mobility. Journal of Regional Science 43, 483–516. Zhao, Z., 2004. Using matching to estimate treatment effects: data requirements, matching metrics, and Monte Carlo evidence. Review of Economics and Statistics 86, 91–107.
Journal of Econometrics 161 (2011) 228–245
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Bias in estimating multivariate and univariate diffusions Xiaohu Wang a , Peter C.B. Phillips b,c,d,a , Jun Yu a,∗ a
School of Economics and Sim Kee Boon Institute for Financial Economics, Singapore Management University, 90 Stamford Road, Singapore 178903, Singapore
b
Cowles Foundation for Research in Economics, Yale University, Box 208281, Yale Station, New Haven, CT 06520-8281, United States
c
University of Auckland, New Zealand
d
University of Southampton, United Kingdom
article
info
Article history: Received 22 March 2010 Received in revised form 22 March 2010 Accepted 15 December 2010 Available online 21 December 2010 JEL classification: C15 G12 Keywords: Bias Diffusion Euler approximation Trapezoidal approximation Milstein approximation
abstract Multivariate continuous time models are now widely used in economics and finance. Empirical applications typically rely on some process of discretization so that the system may be estimated with discrete data. This paper introduces a framework for discretizing linear multivariate continuous time systems that includes the commonly used Euler and trapezoidal approximations as special cases and leads to a general class of estimators for the mean reversion matrix. Asymptotic distributions and bias formulae are obtained for estimates of the mean reversion parameter. Explicit expressions are given for the discretization bias and its relationship to estimation bias in both multivariate and in univariate settings. In the univariate context, we compare the performance of the two approximation methods relative to exact maximum likelihood (ML) in terms of bias and variance for the Vasicek process. The bias and the variance of the Euler method are found to be smaller than the trapezoidal method, which are in turn smaller than those of exact ML. Simulations suggest that when the mean reversion is slow, the approximation methods work better than ML, the bias formulae are accurate, and for scalar models the estimates obtained from the two approximate methods have smaller bias and variance than exact ML. For the square root process, the Euler method outperforms the Nowman method in terms of both bias and variance. Simulation evidence indicates that the Euler method has smaller bias and variance than exact ML, Nowman’s method and the Milstein method. © 2010 Elsevier B.V. All rights reserved.
1. Introduction Continuous time models, which are specified in terms of stochastic differential equations, have found wide applications in economics and finance. Empirical interest in systems of this type has grown particularly rapidly in recent years with the availability of high frequency financial data. Correspondingly, growing attention has been given to the development of econometric methods of inference. In order to capture causal linkages among variables and allow for multiple determining factors, many continuous systems are specified in multivariate form. The literature is now wide-ranging. Bergstrom (1990) motivated the use of multivariate continuous time models in macroeconomics; Sundaresan (2000) provided a list of multivariate continuous time models, particularly multivariate diffusions, in finance; Piazzesi (2009) discusses how to use multivariate continuous time models to address various macrofinance issues.
∗
Corresponding author. Tel.: +65 68280858; fax: +65 68280833. E-mail addresses:
[email protected] (X. Wang),
[email protected] (P.C.B. Phillips),
[email protected] (J. Yu). 0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.006
Data in economics and finance are typically available at discrete points in time or over discrete time intervals and many continuous systems are formulated as Markov processes. These two features suggest that the log likelihood function can be expressed as the product of the log transition probability densities (TPD). Consequently, the implementation of maximum likelihood (ML) requires evaluation of TPD. But since the TPD is unavailable in closed form for many continuous systems, several methods have been proposed as approximations. The simplest approach is to approximate the model using some discrete time system. Both the Euler approximation and the trapezoidal rule have been suggested in the literature. Sargan (1974) and Bergstrom (1984) showed that the ML estimators (MLEs) based on these two approximations converge to the true MLE as the sampling interval h → 0, at least under a linear specification. This property also holds for more general diffusions (Florens-Zmirou, 1989). Of course, the quality of the approximation depends on the size of h. However, the advantage of the approximation approach is that it is computationally simple and often works well when h is small, for example at the daily frequency. More accurate approximations have been proposed in recent years. In-fill simulations and closed-form approximations are the two that have received the most attention. Studies of in-fill
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
simulations include Pedersen (1995) and Durham and Gallant (2002). For closed-form approximations, seminal contributions include Aït-Sahalia (1999, 2002, 2008)), Aït-Sahalia and Kimmel (2007), and Aït-Sahalia and Yu (2006). These approximations have the advantage that they can control the size of the approximation errors even when h is not small. Aït-Sahalia (2008) provides evidence that the closed-form approximation is highly accurate and allows for fast repeated evaluations. Since the approximate TPD takes a complicated form in both these approaches, no closed form expression is available for the MLE. Consequently, numerical optimizations are needed to obtain the MLE. No matter which of the above methods is used, when the system variable is persistent, the resulting estimator of the speed of mean reversion can suffer from severe bias in finite samples. This problem is well known in scalar diffusions (Phillips and Yu, 2005a,b, 2009a,b) but has also been reported in multivariate models (Phillips and Yu, 2005a; Tang and Chen, 2009). In the scalar case, Tang and Chen (2009) and Yu (2009) give explicit expressions to approximate the bias. To obtain these explicit expressions, the corresponding estimators must have a closed-form expression. That is why explicit bias results are presently available only for the scalar Vasicek model (Vasicek, 1977) and the Cox-Ingersoll-Ross (CIR, 1985) model. The present paper focuses on extending existing bias formulae to the multivariate continuous system case. We partly confine our attention to linear systems so that explicit formulae are possible for approximating the estimation bias of the mean reversion matrix. It is known from previous work that bias in the mean reversion parameter has some robustness to specification changes in the diffusion function (Tang and Chen, 2009), which gives this approach a wider relevance. Understanding the source of the mean reversion bias in linear systems can also be helpful in more general situations where there are nonlinearities. The paper develops a framework for studying estimation in multivariate continuous time models with discrete data. In particular, we show how the estimator that is based on the Euler approximation and the estimator based on the trapezoidal approximation can be obtained by taking Taylor expansions to the first and second orders. Moreover, the uniform framework simplifies the derivation of the asymptotic bias order of the ordinary least squares estimator and the two stage least squares estimator of Bergstrom (1984). Asymptotic theory is provided under long time span asymptotics and explicit formulae for the matrix bias approximations are obtained. The bias formulae are decomposed into the discretization bias and the estimation bias. Simulations reveal that the bias formulae work well in practice. The results are specialized to the scalar case, giving two approximate estimators of the mean reversion parameter which are shown to work well relative to the exact MLE when the mean reversion is slow. The results confirm that bias can be severe in multivariate continuous time models for parameter values that are empirically realistic, just as it is in scalar models. Specializing our formulae to the univariate case yields some useful alternative bias expressions. Simulations are reported that detail the performance of the bias formulae in the multivariate setting and in the univariate setting. The rest of the paper is organized as follows. Section 2 introduces the model and the setup and reviews four existing estimation methods. Section 3 outlines our unified framework for estimation, establishes the asymptotic theory, and provides explicit expressions for approximating the bias in finite samples. Section 4 discusses the relationship between the new estimators and two existing estimators in the literature, and derives a new bias formula in the univariate setting. Section 5 compares the performance of the estimator based on the Euler scheme relative to that the method proposed by Nowman (1997) in the context of the square root process and a diffusion process with a linear drift but a more general diffusion. Simulations are reported in Section 6. Section 7 concludes and the Appendix collects together proofs of the main results.
229
2. The model and existing methods We consider an M-dimensional multivariate diffusion process of the form (Phillips, 1972, cf.): dX (t ) = (A(θ )X (t ) + B(θ ))dt + ζ (dt ),
X (0) = X0 ,
(2.1)
where X (t ) = (X1 (t ), . . . , XM (t )) is an M-dimensional continuous time process, A(θ ) and B(θ ) are M × M and M × 1 matrices, whose elements depend on unknown parameters θ = (θ1 , . . . , θK ) that need to be estimated, ζ (dt ) (:= (ζ1 (dt ), . . . , ζM (dt ))) is a vector random process with uncorrelated increments and covariance matrix Σ dt. The particular model receiving most attention in finance is when ζ (dt ) is a vector of Brownian increments (denoted by dW (t )) with covariance Σ dt, viz., ′
dX (t ) = (A(θ )X (t ) + B(θ ))dt + dW (t ),
X (0) = X0 ,
(2.2)
corresponding to a multivariate version of the Vasicek model (Vasicek, 1977). Although the process follows a continuous time stochastic differential equation system, observations are available only at discrete time points, say at n equally spaced points {th}nt=0 , where h is the sampling interval and is taken to be fixed. In practice, h might be very small, corresponding to high-frequency data. In this paper, we use X (t ) to represent a continuous time process and Xt to represent a discrete time process. When there is no confusion, we simply write Xth as Xt . Bergstrom (1990) provided arguments why it is useful for macroeconomists and policy makers like central bankers to formulate models in continuous time even when discrete observations only are available. In finance, early fundamental work by Black and Scholes (1973) and much of the ensuing literature such as Duffie and Kan (1996) successfully demonstrated the usefulness of both scalar and multivariate diffusion models in the development of financial asset pricing theory. Phillips (1972) showed that the exact discrete time model corresponding to (2.1) is given by Xt = exp{A(θ )h}Xt −1 − A−1 (θ )[exp{A(θ )h} − I ]B(θ ) + εt
(2.3)
where εt = (ε1 , . . . , εM )′ is a martingale difference sequence (MDS) with respect to the natural filtration and E (εt εt′ ) =
h
∫
exp{A(θ )s}Σ exp{A(θ )′ s}ds := G. 0
Letting F (θ ) := exp{A(θ )h} and g (θ ) := −A−1 (θ )[exp{A(θ )h} − I ]B(θ ), we have the system Xt = F (θ )Xt −1 + g (θ ) + εt ,
(2.4)
which is a vector autoregression (VAR) model of order 1 with MDS(0, G) innovations. In general, identification of θ from the implied discrete model (2.3) generating discrete observations {Xth } is not automatically satisfied. The necessary and sufficient condition for identifiability of θ in model (2.3) is that the correspondence between θ and [F (θ ), g (θ )] be one-to-one, since (2.3) is effectively a reduced form for the discrete observations. Phillips (1973) studied the identifiability of (A(θ ), Σ ) in (2.3) in terms of the identifiability of the matrix A(θ ) in the matrix exponential F = exp(A(θ )h) under possible restrictions implied by the structural functional dependence A = A(θ ) in (2.1). In general, a one-to–one correspondence between A(θ ) and F , requires the structural matrix A(θ ) to be restricted. This is because if A(θ ) satisfies exp{A(θ )h} = F and some of its eigenvalues are complex, A(θ ) is not uniquely identified. In fact, adding to each pair of conjugate complex eigenvalues the imaginary numbers 2ikπ and −2ikπ for any integer k, leads to another matrix satisfying exp{Ah} = F . This phenomenon is well
230
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
known as aliasing in the signal processing literature. When restrictions are placed on the structural matrix A(θ ) identification is possible. Phillips (1973) gave a rank condition for the case of linear homogeneous relations between the elements of a row of A. A special case is when A(θ ) is triangular. Hansen and Sargent (1983) extended this result by showing that the reduced form covariance structure G > 0 provides extra identifying information about A, reducing the number of potential aliases. To deal with the estimation of (2.1) using discrete data and indirectly (because it was not mentioned) the problem of identification, two approximate discrete time models were proposed in earlier studies. The first is based on the Euler approximation given by th
∫
(t −1)h
A(θ )X (r )dr ≈ A(θ)hXt −1 ,
th
(t −1)h
A(θ )X (r )dr ≈
1 2
and
∂ Σ (X (t ); ψ) dXp (t ) ∂ Xp
dΣ (X (t ); ψ) =
+
1 ∂ 2 Σ (X (t ); ψ)
∂ Xp ∂ Xq′
2
Σ (X (t ); ψ) ≃ Σ (X(i−1)h ; θ ) +
(2.10)
∂ Σ (X(i−1)h ; ψ) Σpq (X(i−1)h ; ψ) ∂ Xp
t
∫
(i−1)h
A(θ )h(Xt + Xt −1 ),
dWq (τ ).
Using these approximations in (2.9) we find
A(θ )h(Xt + Xt −1 ) + B(θ )h + νt . (2.6) 2 The discrete time models are then estimated by standard statistical methods, namely OLS for the Euler approximation and systems estimation methods such as two-stage or three-stage least squares for the trapezoidal approximation. As explained by Lo (1988) in the univariate context, such estimation strategies inevitably suffer from discretization bias. The size of the discretization bias depends on the sampling interval, h, and does not disappear even if n → ∞. The bigger is h, the larger is the discretization bias. Sargan (1974) showed that the asymptotic discretization bias of the two-stage and three-stage least squares estimators for the trapezoidal approximation is O(h2 ) as h → 0. Bergstrom (1984) showed that the asymptotic discretization bias of the OLS estimator for the Euler approximation is O(h). For the more general multivariate diffusion X (0) = X0 , (2.7)
ih
∫ Xih − X(i−1)h =
(i−1)h
1
dX (t ) = κ(µ − X (t ))dt + Σ (X (t ); ψ)dW (t ),
dXp (t )dXq (t ),
where µj (X (t ); θ ) is the jth element of the (linear) drift function κ(µ − X (t )), Σpq is the (p, q)th element of Σ and Xp is the pth element of X . These expressions lead to the approximations
×
which gives rise to the approximate nonrecursive discrete time model Xt − Xt −1 =
∂µ(X (t ); θ ) dXp (t ), ∂ Xp
and (2.5)
The second, proposed by Bergstrom (1966), is based on the trapezoidal approximation
∫
dµ(X (t ); θ ) =
µ(X (t ); θ ) ≃ µ(X(i−1)h ; θ ),
which leads to the approximate discrete time model Xt − Xt −1 = A(θ )hXt −1 + B(θ )h + ut .
By Itô’s lemma, the linearity of the drift function in (2.7), and using tensor summation notation for repeated indices (p, q), we obtain
κ(µ − X (t ))dt +
∫
ih
(i−1)h
≃ µ(X(i−1)h ; θ )h + Σ (X(i−1)h ; ψ)
Σ (X (t ); ψ)dW (t ) ih
∫
(i−1)h
dW (t )
∂ Σ (X(i−1)h ; ψ) Σpq (X(i−1)h ; ψ) ∂ Xp ∫ ih ∫ t × dWq (τ )dW (t ) . +
(i−1)h
(2.11)
(i−1)h
The multiple (vector) stochastic integral in (2.11) reduces as follows:
∫
ih
t
∫
(i−1)h
∫
(i−1)h
dWq (τ )dWp (t )
ih
= (i−1)h
Wq (t ) − Wq(i−1)h dWp (t )
1 2 Wqih − Wq(i−1)h − h = ∫2 ih Wq (t ) − Wq(i−1)h dWp (t )
p=q
where W is standard Brownian motion, two other approaches have been used to approximate the continuous time model (2.7). The first, proposed by Nowman (1997), approximates the diffusion function within each unit interval, [(i − 1)h, ih) by its left end point value leading to the approximate model
The approximate model under a Milstein second order discretization is then
dX (t ) = κ(µ − X (t ))dt + Σ (X(i−1)h ; ψ)dW (t )
Xih − X(i−1)h ≃ µ(X(i−1)h ; θ )h + Σ (X(i−1)h ; ψ) Wih − W(i−1)h
for t ∈ [(i − 1)h, ih).
(2.8)
Since (2.8) is a multivariate Vasicek model within each unit interval, there is a corresponding exact discrete model as in (2.3). This discrete time model, being an approximation to the exact discrete time model of (2.7), facilitates direct Gaussian estimation. To reduce the approximation error introduced by the Euler scheme, Milstein (1978) suggested taking the second order term in a stochastic Taylor series expansion when approximating the drift function and the diffusion function. Integrating (2.7) gives
∫
ih
(i−1)h
dX (t ) =
(i−1)h
∫
∂ Σ (X(i−1)h ; ψ) Σpq (X(i−1)h ; ψ) ∂ Xp ∫ ih ∫ t × dWq (τ )dWp (t ) .
κ(µ − X (t ))dt
(i−1)h
(i−1)h
Σ (X (t ); ψ)dW (t ).
(2.13)
(i−1)h
In view of the calculation (2.12), when the model is scalar the discrete approximation has the simple form (c.f., Phillips and Yu, 2009a,b)
[
]
1 ′ σ (X(i−1)h ; ψ)σ (X(i−1)h ; ψ) h 2
+ σ (X(i−1)h ; ψ) Wih − W(i−1)h
ih
+
+
Xih − X(i−1)h ≃ µ(X(i−1)h ; θ ) −
ih
∫
(i−1)h
(2.12)
p ̸= q.
(2.9)
+ σ ′ (X(i−1)h ; ψ)σ (X(i−1)h ; ψ)
1 2
Wih − W(i−1)h
2
.
(2.14)
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
1 2
Since
Wqih − Wq(i−1)h
2
− h has mean zero, the net contribu-
tion to the drift from the second order term is zero. In the multivariate Vasicek model, Σ (X (t ); ψ) = Σ , and the Milstein approximation (2.13) reduces to Xih − X(i−1)h ≃ µ(X(i−1)h ; θ )h + Σ (X(i−1)h ; ψ) Wih − W(i−1)h .
and the bias of the OLS estimator of F for the VAR(1) model with a known intercept is ∞
1 − ′k BIAS (Fˆ ) = − G {F tr(F k+1 ) + F ′2k+1 }D−1 n k=0
3 + O n− 2 .
Thus, for the multivariate Vasicek model, the Milstein and Euler schemes are equivalent. 3. Estimation methods, asymptotic theory and bias In this paper, following the approach of Phillips (1972), we estimate θ directly from the exact discrete time model (2.3). In particular, we first estimate F (θ ) and θ from (2.3), assuming throughout that A(θ ) and θ are identifiable and that all the eigenvalues in A(θ ) have negative real parts. The latter condition ensures that Xt is stationary and is therefore mean reverting. The exact discrete time model (2.3) in this case is a simple VAR(1) model which has been widely studied in the discrete time series literature. We first review some relevant results from this literature. Let Zt = [Xt′ , 1]′ . The OLS estimator of H = [F , g ] is
ˆ = [Fˆ , gˆ ] = H
n −
n− 1
Xt Zt′−1
· n− 1
t =1
n −
Fˆ =
n
−1
n −
· n
−1
t =1
n −
We now derive a simplified bias formulae in the two models which facilitates the calculation of the bias formulae in continuous time models. Lemma 3.3. Assume (A1)–(A3) hold, h is fixed and n → ∞. The bias of the least squares estimator for F in the VAR(1) is given by Bn = E (Fˆ ) − F = −
n
Zt −1 Zt′−1
.
b = G (I − C )
−1
+ C (I − C )
,
(3.2)
p
(a) Fˆ → F ; d
n{Vec(Fˆ ) − Vec(F )} → N (0, (Γ (0))−1 ⊗ G),
where Γ (0) = Var(Xt ) =
i =0
F i · G · F ′i and G = E (εt εt′ ).
Under different but related conditions, Yamamoto and Kunitomo (1984) and Nicholls and Pope (1988) derived explicit bias expressions for the OLS estimator Fˆ . The proof of the following lemma is given in Yamamoto and Kunitomo (1984, Theorem 1). (A1) Xt is a stationary VAR(1) process whose error term is iid (0, G) with G nonsingular; s0 (A2) For some s0 ≥ 16, E |εti | < ∞, for all i = 1, . . . , M;
∑n
t =1
Zt −1 Zt′−1
−1 2 is bounded, where the operator
1/2
‖Q ‖ = sup(β Q Q β) ′
β
(β β ≤ 1), ′
for any vector β ; Under (A1)–(A3) if n → ∞, the bias of OLS estimator of F in the VAR(1) model with an unknown intercept is ∞ − BIAS (Fˆ ) = −n−1 G {F ′k + F ′k tr(F k+1 ) + F ′2k+1 }D−1 k=0
+O n ∞ − i=0
,
−1
(3.6)
∑∞
(3.3)
i
′i
b = G C (I − C )
2 −1
+
−
λ(I − λC )
−1
Γ (0)−1 .
(3.7)
λ∈Spec(C )
Remark 3.1. The alternative bias formula (3.5) is exactly the same as that given by Nicholls and Pope (1988) for the Gaussian case, although here the expression is obtained without Gaussianity and in a simpler way. If the bias is calculated to a higher order, Bao and Ullah (2009) showed that skewness and excess kurtosis of the error distribution figure in the formulae. In a related contribution, Ullah et al. (2010) obtain the second order bias in the mean reversion parameter for a (scalar) continuous time Lévy process. We now develop estimators for A. To do so we use the matrix exponential expression ∞ − (Ah)i i =0
i!
= I + Ah + H
= I + Ah + O(h2 ) as h → 0.
(3.8)
Rearranging terms we get 1 1 (F − I ) − H = (F − I ) + O(h) as h → 0, h h h which suggest the following simple estimator of A Aˆ =
1
1 h
(Fˆ − I ),
(3.9)
(3.10)
where Fˆ is the OLS estimator of F . We now develop the asymptotic ˆ distribution for Aˆ and the bias in A. Theorem 3.1. Assume Xt follows Model (2.1) and that all characteristic roots of the coefficient matrix A have negative real parts. Let {Xth }nt=1 be the available data and suppose A is estimated by (3.10) with Fˆ defined by (3.1). When h is fixed, as n → +∞, we have
where D=
− 32
λ(I − λC )
A=
‖ · ‖ is defined by ′
−
′ where C = F , Γ (0) = Var(Xt ) = t =0 F · G · F , G = E (εt εt ), and Spec(C ) denotes the set of eigenvalues of C . When the model has a known intercept,
F = eAh =
Lemma 3.2 (Yamamoto and Kunitomo, 1984). Assume:
(A3) E n−1
+
′
Lemma 3.1. For the stationary VAR(1) model (2.4), if h is fixed and n → ∞, we have
∑∞
2 −1
× Γ (0)−1 ,
for which the standard theory first order limit theory (e.g., Fuller (1976, p. 340) and Hannan (1970, p. 329)) is well known.
(b)
(3.5)
(3.1)
t =1
√
3 + O n− 2 .
When the model has a unknown intercept,
−1 Xt −1 Xt′−1
b
λ∈Spec(C )
t =1
Xt Xt′−1
(3.4)
−1
If we have prior knowledge that B(θ ) = 0 and hence g = 0, the OLS estimator of F is:
231
F i GF ′i ,
p
Aˆ − A →
1 h
(F − I − Ah) =
1 h
H = O(h) as h → 0,
(3.11)
232
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
where H = F − I − Ah, and
√
[
h n Vec Aˆ −
1 h
]
=
d
(F − I ) → N (0, Γ (0)−1 ⊗ G),
where Γ (0) = Var(Xt ) =
∑∞
i=0
(3.12)
F i GF ′i , G = E (εt εt′ ).
Theorem 3.2. Assume that Xt follows Model (2.2) where W (t ) is a vector Brownian Motion with covariance matrix Σ and that all characteristic roots of the coefficient matrix A have negative real parts. Let {Xth }nt=1 be the available data and suppose A is estimated by (3.10) with Fˆ defined by (3.1). When h is fixed and n → ∞, the bias formula is: BIAS (Aˆ ) = E (Aˆ − A) =
1 h
−b
H+
T
+ o(T −1 ),
(3.13)
b = G[(I − C )−1 + C (I − C 2 )−1
+
λ(I − λC )−1 ]Γ (0)−1 ,
(3.14)
∑∞
−
b = G[C (I − C 2 )−1 +
λ(I − λC )−1 ]Γ (0)−1 .
(3.15)
λ∈Spec(C )
Remark 3.2. Expression (3.11) extends the result in Eq. (32) of Lo (1988) to the multivariate case. According to Theorem 3.2, the bias of the estimator (3.10) can be decomposed into two parts, the discretization bias and the estimation bias, which take the following forms: discretization bias =
H
=
h = O(h)
estimation bias =
−b T
h as h → 0,
h
(3.20)
Theorem 3.3. Assume that Xt follows Model (2.1) and that all characteristic roots of the coefficient matrix A have negative real parts. Let {Xth }nt=1 be the available data and A is estimated by (3.20) with Fˆ defined by (3.1). When h is fixed, n → +∞, we have Aˆ − A →
2 h
(F − I )(F + I )−1 − A = O(h2 ) as h → 0,
and
[
] 2 d −1 ˆ h n Vec A − (F − I )(F + I ) → N (0, Ψ ), h
Ψ = 16Υ [Γ (0)−1 ⊗ G]Υ ′ ,
+ o(T −1 ).
Υ = (F ′ + I )−1 ⊗ (F + I )−1 .
Theorem 3.4. Assume that Xt follows (2.2) where W (t ) is a vector Brownian motion with covariance matrix Σ and that all characteristic roots of the coefficient matrix A have negative real parts. Let {Xth }nt=1 be the available data and suppose A is estimated by (3.20) with Fˆ defined by (3.1). When h is fixed, n → ∞, and T = hn, the bias formula is: BIAS (Aˆ ) = −ν −
4 T
(I + F )−1 b(I + F )−1 (3.21)
h
(3.16) (3.17)
where ν = A − 2h (F − I )(F + I )−1 , ∆ = [IM ⊗ (I + F )−1 ] · Γ (0)−1 ⊗ G · [IM ⊗ (I + F )−1 ]′ , and L is a M × M matrix whose ijth element is given by Lij =
M 1− ′ eM (s−1)+i · ∆ · eM (j−1)+s , n s=1
(3.22)
with ei being a column vector of dimension M 2 whose ith element is 1 and other elements are 0. If B(θ ) is an unknown vector, then b = G[(I − C )−1 + C (I − C 2 )−1 +
−
λ(I − λC )−1 ]Γ (0)−1 .
λ∈Spec(C )
If B(θ ) is a known vector, then
i!
−
λ(I − λC )−1 ]Γ (0)−1 .
Ah
b = G[C (I − C 2 )−1 +
Ah
Remark 3.3. Theorem 3.4 shows that the bias of the estimator (3.20) can be decomposed into a discretization bias and an estimation bias as follows:
[ −A2 h2 −2A3 h3 = I + Ah + (eAh − I ) + + + ··· 2 3! 4! ] −(n − 2)An−1 hn−1 + + ··· n! 2 Ah 2
λ∈Spec(C )
[F − I ] + η [F − I ] + O(h ) as h → 0. 3
2
(F − I )(F + I )−1 − η(F + I )−1 h
(3.18)
2
(F − I )(F + I )−1 − A h = O(h2 ) as h → 0,
discretization bias = −ν =
Consequently, h
2
4
∞ − (Ah)i i =0
2
Aˆ =
− L(I + F )−1 + o(T −1 ),
Higher order approximations are possible. For example, we may take the matrix exponential series expansion to the second order to produce a more accurate estimate using
A =
(Fˆ − I )(Fˆ + I )−1 .
After neglecting terms smaller than O h , we get the alternative estimator
F − I − Ah
It is difficult to determine the signs of the discretization bias and the estimation bias in a general multivariate case. However, in the univariate case, the signs are opposite to each other as shown in Section 4.2.
= I + Ah +
(3.19)
where
i ′i ′ where Γ (0) = Var(Xt ) = i=0 F · G · F , G = E (εt εt ), and Spec(C ) is the set of eigenvalues of C . If B(θ ) is known, then
= I + Ah +
(F − I )(F + I )−1 + ν (F − I )(F + I )−1 + O(h2 ) as h → 0. 2
h
√
λ∈Spec(C )
F = eAh =
h 2
p
where H = F − I − Ah, and T = nh is the time span of the data. If B(θ) is unknown, then
−
=
2
(3.23)
4 4 estimation bias = − (I + F )−1 b(I + F )−1 − L(I + F )−1 T h + o(T −1 ). (3.24)
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
As before, it is difficult to determine the signs of the discretization bias and estimation bias in a general multivariate case. However, in the univariate case, the signs are opposite each other as reported in Section 4.2. Remark 3.4. The estimator (3.10) is based on a first order Taylor expansion whereas the estimator (3.20) is based on a second order expansion, so it is not surprising that (3.20) has a smaller discretization bias than (3.10). It is not as easy to compare the magnitudes of the two estimation biases. In the univariate case, however, we show in Section 4.2 that the estimator (3.20) has a larger estimation bias than the estimator (3.10). 4. Relations to existing results
The estimators given above include as special cases the two estimators obtained from the Euler approximation and the trapezoidal approximation. Consequently, both the asymptotic and the bias properties are applicable to these two approximation models and the simple framework above unifies some earlier theory on the estimation of approximate discrete time models. The Euler approximate discrete time model is of the form: Xt − Xt −1 = AhXt −1 + Bh + ut .
n
−1
n −
Xt Zt′−1
n
−1
n − t =1
(4.2)
If B is known a priori and assumed zero without loss of generality, then the OLS estimator of A is n− 1
[I + Ah] =
Xt Xt′−1
n− 1
t =1
n −
−1 Xt −1 Xt′−1
t =1
=: [Fˆ ],
(4.3)
where Zt −1 , Fˆ , gˆ are defined in the same way as before. Hence, Aˆ =
1
[Fˆ − I ].
(4.4)
h This is precisely the estimator given by (3.10) based on a first order expansion of the matrix exponential exp(Ah) in h. The trapezoidal approximate discrete time model is of the form Xt − Xt −1 =
1
Ah(Xt + Xt −1 ) + Bh + νt .
(4.5)
1
Ah(Xt + Xt −1 ) + νt . (4.6) 2 Note that (4.6) is a simultaneous equations model, as emphasized by Bergstrom (1966, 1984). We show that the two stage least squares estimator of A from (4.5) is equivalent to the estimator given by (3.20) based on a second order expansion of exp(Ah) in h. To save space, we focus on the approximate discrete time model with known B = 0. The result is easily extended to the case of unknown B. The two stage least squares estimator of Bergstrom (1984) takes the form
Aˆ =
n − 1 t =1
h
n −
Xt Xt′−1
(Xt − Xt −1 )Vt
′
(4.8)
n −
t =1
−1
n − 1 t =1
2
X t −1 .
Xt −1 Xt′−1
(4.9)
t =1
Theorem 4.1. The two stage least squares estimator suggested in Bergstrom (1984) has the following form Aˆ =
2 h
[Fˆ − I ][Fˆ + I ]−1 ,
(4.10)
4.2. Bias in univariate models The univariate diffusion model considered in this section is the OU process: dX (t ) = κ(µ − X (t ))dt + σ dW (t ),
−1 (Xt + Xt −1 )Vt
′
,
X (0) = 0,
(4.11)
where W (t ) is a standard scalar Brownian motion. The exact discrete time model corresponding to (4.11) is
−κ h
)+σ
1 − e−2κ h 2κ
ϵt ,
ˆ h, κˆ = − ln(φ)/
(4.12)
(4.13)
where
φˆ =
n−1 Σ Xt Xt −1 − n−2 Σ Xt Σ Xt −1 n−1 Σ Xt2 − n−2 (Σ Xt −1 )2
(4.7)
,
(4.14)
and κˆ exists provided φˆ > 0. Tang and Chen (2009) analyzed the asymptotic properties and derived the finite sample variance formula and the bias formula, respectively, Var(κ) ˆ =
1 − φ2
+ o(T −1 ),
Thφ 2
E (κ) ˆ −κ =
1
T
5 2
+e
κh
+
(4.15)
e2κ h 2
+ o(T −1 ).
(4.16)
When µ is known (assumed to be 0), the exact discrete model becomes
2 If B = 0, the approximate discrete model becomes Xt − Xt −1 =
Xt =
(Xt∗ + Xt −1 ),
where φ = e−κ h , ϵt ∼ iid N (0, 1) and h is the sampling interval. The ML estimator of κ (conditional on X0 ) is given by
−1 Zt −1 Zt′−1
=: [Fˆ , gˆ ].
n −
∗
2
Xt = φ Xt −1 + µ(1 − e
t =1
1
Vt =
(4.1)
The OLS estimator of A is given by
where
and is precisely the same estimator as that given by (3.20) based on a second order expansion of exp(Ah) in h.
4.1. The Euler and trapezoidal approximations
] := [I + Ah, Bh
233
X t = φ X t −1 + δ
1 − e−2κ h 2κ
ϵt ,
(4.17)
ˆ h, where φˆ = Σ Xt and the ML estimator of κ is κˆ = − ln(φ)/ Xt −1 /Σ Xt2−1 . In this case, Yu (2009) derived the following bias formula under stationary initial conditions E (κ) ˆ −κ =
1 2T
(3 + e2κ h ) −
2(1 − e−2nκ h ) Tn(1 − e−2κ h )
+ o(T −1 ).
(4.18)
When the initial condition is X (0) = 0, the bias formula becomes 1 (3 + e2κ h ) + o(T −1 ). (4.19) 2T Since the MLE is based on the exact discrete time model, there is no discretization bias in (4.12) and (4.17). The bias in κˆ is induced entirely by estimation and is always positive. We may link our results for multivariate systems to the univariate model. For example, κ = −A in (4.11) and the first order Taylor series expansion (i.e., the Euler method) gives the estimator E (κ) ˆ −κ =
234
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
κ1 =
1
ˆ [1 − φ].
(4.20)
h In this case the results obtained in Theorems 3.1 and 3.2 may be simplified as in the following two results. Theorem 4.2. Assuming κ > 0, when h is fixed, and n → ∞, we have p
κˆ 1 − κ → −
exp(−κ h) − 1 + κ h h
= O(h) as h → 0,
(4.21)
and
√
[
h n κˆ 1 −
1 − exp(−κ h) h
d
(4.22)
h
1 + 3 exp(−κ h)
+
T
+ o(T −1 ),
(4.23)
For the OU process with a known mean, BIAS (κˆ 1 ) = −
H h
2 exp(−κ h)
+
T
+ o(T
),
(4.24)
+ o(T −1 ) and + o(T −1 ) are the where T T estimation biases in the two models, respectively. In both models, the discretization bias has the following form: −H h
=−
exp(−κ h) − 1 + κ h h
.
(4.25)
1 − exp(−2κ h) Th
.
(4.26)
Remark 4.2. The estimation bias is always positive in both models. If κ h ∈ (0, 3] which is empirically realistic, the discretization bias may be written as
−H h
= −κ 2 h
∞ − (−κ h)i−2
i! − (−κ h)j−2 = −κ 2 h (j + 1 − κ h) (j + 1)! j=2,4,...
< 0.
(4.27)
Remark 4.3. For the unknown mean model, if T < h(1 + 3φ)/ (κ h + φ − 1), the estimation bias is larger than the discretization bias in magnitude because this condition is equivalent to
>
κ h + exp(−κ h) − 1 h
.
Further h(1 + 3φ)/(κ h + φ − 1) =
h(1 + 3(1 − κ h + O(h2 )))
κ 2 h2 − 61 κ 3 h3 + O(h4 ) −1 2 1 = 2 (4 − 3κ h + O(h2 )) 1 − κ h + O(h2 ) κ h 3 2 1 2 2 = 2 (4 − 3κ h + O(h )) 1 + κ h + O(h ) κ h 3 =
8
κ
2h
(1 + O(h)) .
√
[
h n κˆ 2 −
2(1 − exp(−κ h))
− κ = O(h2 ) as h → 0,
(4.29)
16(1 − exp(−κ h)) d → N 0, . (4.30) h(1 + exp(−κ h)) (1 + exp(−κ h))3 ]
For the OU process with an unknown mean, BIAS (κˆ 2 ) = ν +
8 T (1 + exp(−κ h))
+ o(T −1 ).
(4.31)
For the OU process with a known mean, 4 T (1 + exp(−κ h))
+ o(T −1 ),
(4.32)
where T (1+exp8 (−κ h)) + o(T −1 ) and T (1+exp4 (−κ h)) + o(T −1 ) are the two estimation biases. In both models, the discretization bias has the form
ν = −κ +
2(1 − exp(−κ h)) h(1 + exp(−κ h))
AsyVar(κˆ 2 ) =
This means that the discretization bias has sign opposite to that of the estimation bias.
T
h(1 + exp(−κ h))
= O(h2 ).
(4.33)
Remark 4.4. From (4.30) the asymptotic variance for κˆ 2 is
i =2
1 + 3 exp(−κ h)
2(1 − exp(−κ h))
p
κˆ 2 − κ →
BIAS (κˆ 2 ) = ν +
Remark 4.1. From (4.22) the asymptotic variance for κˆ 1 is AsyVar(κˆ 1 ) =
for which we have the following result.
and −1
2 exp(−κ h)
1+3 exp(−κ h)
(4.28)
Theorem 4.3. Assuming κ > 0, when h is fixed, and n → ∞, we have
For the OU process with an unknown mean, H
Similarly, the second order expansion (i.e. the trapezoidal method) gives the estimator
ˆ 2(1 − φ) 2 κˆ 2 = −Aˆ = − [Fˆ − I ][Fˆ + I ]−1 = , ˆ h h(1 + φ)
]
→ N (0, 1 − exp(−2κ h)).
BIAS (κˆ 1 ) = −
In empirically relevant cases, 8/(κ 2 h) is likely to take very large values, thereby requiring very large values of T before the estimation bias is smaller than the discretization bias. For example, if κ = 0.1 and h = 1/12, T > 9600 years are needed for the bias to be smaller. The corresponding result for the known mean case is 2hφ/(κ h +φ− 1) = 4/(κ 2 h) (1 + O(h)) and again large values of T are required to reduce the relative magnitude of the estimation bias.
1 2
16(1 − exp(−κ h)) Th(1 + exp(−κ h))3
.
(4.34)
Remark 4.5. The estimation bias is always positive in both models. If κ h ∈ (0, 2], the discretization bias may be written as
ν = −κ + = =
2(1 − exp(−κ h)) h(1 + exp(−κ h))
∞ − −κ (i − 2)(−κ h)i−1 1 + exp(−κ h) i=3 i! − (−κ h)j−1 −κ
1 + exp(−κ h) j=3,5,... (j + 1)!
× ((j − 2)(j + 1) − κ h(j − 1)) < 0.
(4.35)
Hence, the discretization bias has the opposite sign of the estimation bias. Remark 4.6. For the unknown mean model, if T < 8h/(κ h(1 +φ)−2(1−φ)), the estimation bias is larger than the discretization bias in magnitude because this condition is equivalent to 8 T (1 + exp(−κ h))
>κ−
2(1 − exp(−κ h)) h(1 + exp(−κ h))
.
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
Further 8h
κ h(1 + φ) − 2(1 − φ) 8h
=
κ h(2 − κ h + 21 κ 2 h2 + O(h3 )) − 2(κ h − 12 κ 2 h2 + 16 κ 3 h3 + O(h4 )) − 1 1 3 3 48 = 8h κ h + O(h4 ) = 3 2 (1 + O(h))−1 6 κ h =
48
κ 3 h2
(1 + O(h)) .
Again, in empirically relevant cases, 48/(κ 3 h2 ) is likely to take very large values thereby requiring very large values of T before the estimation bias is smaller than the discretization bias. For example, if κ = 0.1 and h = 1/12, T > 6912,000 years are needed for the bias to be smaller. Hence the estimation bias is inevitably much larger than the discretization bias in magnitude for all realistic sample spans T . Remark 4.7. It has been argued in the literature that ML should be used whenever it is available and the likelihood function should be accurately approximated when it is not available analytically; see Durham and Gallant (2002) and Aït-Sahalia (2002) for various techniques to accurately approximate the likelihood function. From the results in Theorems 4.2 and 4.3 we can show that the total bias of the MLE based on the exact discrete time model is bigger than that based on the Euler and the trapezoidal approximation. For example, for the estimator based on the trapezoidal approximation, considering ν = O(h2 ) as h → 0, when the model is the OU process with an unknown mean,
BIAS (κˆ ML ) − BIAS (κˆ 2 ) 5 + 2eκ h + e2κ h − = 2T
=
5 + 2eκ h + e2κ h 2T
−
+ v + o(T −1 ) T (1 + e−κ h ) 8
8 T (1 + e−κ h )
whereas the asymptotic variance of κˆ 1 and κˆ 2 is based on large n asymptotics and the two approximate estimators are inconsistent with fixed h. Nevertheless, Eqs. (4.22) and (4.30) seem to indicate that in finite (perhaps very large finite) samples, the inconsistent estimators may lead to smaller variances than the MLE, which will be verified by simulations. Remark 4.9. Comparing Theorems 4.2 and 4.3, it is easy to see the estimator (4.28) based on the trapezoidal approximation leads to a smaller discretization bias than the estimator (4.20) based on the Euler approximation. However, when κ h > 0 and hence φ = e−κ h ∈ (0, 1), the gain in the discretization error is earned at the expense of an increase in the estimation error. For the OU process with an unknown mean, estimation bias (κˆ 2 ) − estimation bias (κˆ 1 )
= =
8 T (1 + e−κ h )
−
1 + 3e−κ h T
+ o(T −1 )
(1 − φ)(7 + 3φ) + o(T −1 ) > 0. T (1 + φ)
(4.40)
Similarly, for the OU process with a known mean, estimation bias (κˆ 2 ) − estimation bias (κˆ 1 )
= =
4 T (1 + e−κ h )
−
2e−κ h T
+ o(T −1 )
(1 − φ)(4 + 2φ) + o(T −1 ) > 0. T (1 + φ)
(4.41)
Since the sign of the discretization bias is opposite to that of the estimation bias, and the trapezoidal rule makes the discretization bias closer to zero than the Euler approximation, we have the following result in both models.
BIAS (κˆ 2 ) − BIAS (κˆ 1 ) > 0.
− v + o(T −1 )
(1 − φ)2 (1 + 5φ) − v + o( T − 1 ) 2T φ 2 (1 + φ) > 0.
235
Remark 4.10. The estimator based on the Euler method leads not only to a smaller bias but also to a smaller variance than that based on the trapezoidal method when κ > 0. This is because
=
(4.36)
Using the same method, it is easy to prove the result still holds for the OU process with an known mean. Similarly, one may show that
BIAS (κˆ ML ) − BIAS (κˆ 1 ) > 0,
AsyVar(κˆ 2 ) − AsyVar(κˆ 1 ) =
=
16(1 − φ) Th(1 + φ)3
−
(1 − φ)2 (3 + φ)[4 + (1 + φ)2 ] > 0. Th(1 + φ)3
1 − φ2 Th (4.42)
in both models.
In consequence, the Euler method is preferred to the trapezoidal method and exact ML for estimating the mean reversion parameter in the univariate setting.
Remark 4.8. The two approximate estimators reduce the total bias over the exact ML and also the asymptotic variance when κ > 0. This is because
5. Bias in general univariate models
AsyVar(κˆ ML ) − AsyVar(κˆ 1 ) =
1 − φ2 Thφ 2
−
1 − φ2 Th
5.1. Univariate square root model
>0
(4.37)
and AsyVar(κˆ ML ) − AsyVar(κˆ 2 ) =
1 − φ2
The square root model, also known as the Cox et al. (1985) (CIR hereafter) model, is of the form dX (t ) = κ(µ − X (t ))dt + σ
16(1 − φ)
− (4.38) Thφ 2 Th(1 + φ)3 2 (1 − φ)3 φ + 6φ + 1 = > 0. Thφ 2 (1 + φ)3 (4.39)
In consequence, the two approximate methods are preferred to the exact ML for estimating the mean reversion parameter in the univariate setting. Of course, the two approximate methods do NOT improve the asymptotic efficiency of the MLE. This is because the asymptotic variance of the MLE is based on large T asymptotics
X (t )dW (t ).
(5.1)
If 2κµ/σ 2 > 1, Feller (1951) showed that the process is stationary, the transitional distribution of cXt given Xt −1 is non-central χν2 (λ) with the degree of freedom ν = 2κµσ −2 and the non-central component λ = cXt −1 e−κ h , where c = 4κσ −2 (1 − e−κ h )−1 . Since the non-central χ 2 -density function is an infinite series involving the central χ 2 densities, the explicit expression of the MLE for θ = (κ, µ, σ ) is not attainable. To obtain a closed-form expression for the estimator of θ , we follow Tang and Chen (2009) by using the estimator of Nowman. The Nowman discrete time representation of the square root model is
236
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
Xt = φ1 Xt −1 + (1 − φ1 )µ + σ
X t −1
1 − φ12 2κ
ϵt ,
Furthermore, the finite sample variance for κˆ Euler is (5.2)
where φ1 = e , ϵt ∼ iid N (0, 1) and h is the sampling interval. Hence, Nowman’s estimator of κ is −κ h
1
κˆ Nowman = − ln(φˆ 1 ),
(5.3)
h
n ∑
n ∑
Xt
t =1
Xt−−11 − n−1
t =1 n
∑
n− 2
n ∑
(5.12)
(5.13)
X t −1
n ∑
Var(κˆ Euler )
Xt Xt−−11
t =1
t =1
.
(5.4)
Xt−−11 − 1
t =1
For the stationary square root process, Tang and Chen (2009) derived explicit expressions to approximate E (φˆ 1 − φ1 ) and Var(φˆ 1 ). Using the following relations, E (κˆ Nowman − κ) = −
1
[
1
φ1
h
−
1
E (φˆ 1 − φ1 )
] 2 −3/2 ˆ E (φ1 − φ1 ) + O(n ) , 2
2φ1
(5.5)
Var(κˆ Nowman ) =
E (φˆ 1 − φ1 ) < 0, bias − 1h H in the Euler approximation. Consequently, the negative
1 h2
= φ12 + O(n−1 ) < 1. (5.14) Var(κˆ Nowman ) According to (5.14), the Euler scheme always gains over Nowman’s method in terms of variance. The smaller is φ1 , the larger the gain. Tang and Chen (2009) obtained a bias formula of E (φˆ 1 − φ1 ) for the Nowman estimator under the square root model. Unfortunately, the expression is too complex to be used to determine the sign of the bias analytically. However, the simulation results reported in the literature (Phillips and Yu, 2009a,b, for example) and in our own simulations reported in Section 6 suggest that E (κˆ Euler − κ) > 0. Since H > 0, (5.10) implies that and the estimation bias − 1h E (φˆ 1 −φ1 ) dominates the discretization
and
φ
2 1
[Var(φˆ 1 ) + O(n−2 )],
(5.6)
they further obtained the approximations to E (κˆ Nowman − κ) and Var(κˆ Nowman ). With √ a fixed h and n → ∞ they derived the asymptotic distribution of n(κˆ Nowman − κ). The fact that the mean of the asymptotic distribution is zero implies that the Nowman method causes no discretization bias for estimating κ . The estimator of κ based on the Euler approximation also has a closed form expression under the square root model. The Euler discrete time model is Xt = φ2 Xt −1 + (1 − φ2 )µ + σ
X t −1 hϵ t ,
(5.7)
where φ2 = (1 − κ h). Hence, the Euler scheme estimator of κ is 1
κˆ Euler = − (φˆ 2 − 1),
(5.8)
h
where
φˆ 2 =
Var(φˆ 1 ). h2 If κ > 0, φ1 = e−κ h < 1. When h is fixed, we have 1 1 Var(φˆ 1 ) + O(n−2 ) > 2 Var(φˆ 1 ) Var(κˆ Nowman ) = 2 2 h h φ1 leading to
n− 2
n
1
= Var(κˆ Euler ),
where
φˆ 1 =
Var(κˆ Euler ) =
discretization bias − 1h H reduces the total bias in the Euler method. Consequently, the bias in κˆ Nowman is larger than that in κˆ Euler because E (κˆ Nowman − κ)
=− ≥−
1
[
1
φ1
h
1 1 h φ1
E (φˆ 1 − φ1 ) −
1
]
E (φˆ 1 − φ1 )2 + O(n−3/2 ) 2
2φ1
E (φˆ 1 − φ1 ) 1
1
≥ − E (φˆ 1 − φ1 ) − H = E (κˆ Euler − κ).
(5.15) h h The Milstein scheme is another popular approximation approach. For the square root model, the discrete time model obtained by the Milstein scheme is given by Xt = Xt −1 + κ(µ − Xt −1 )h + σ
X t −1 hϵ t +
√
n ∑ −2
Xt
t =1
n
n ∑
Xt−−11 − n
t =1 n
∑ −2 t =1
X t −1
n ∑ −1
n ∑
.
(5.9)
Xt−−11 − 1
t =1
Yt = aϵt + bϵt2 = b
[
ϵt +
a 2 2b
−
a2
1 1 E (κˆ Euler − κ) = − E (φˆ 1 − φ1 ) − H , h h
(5.10)
]
4b2
.
(5.17)
2
a Since ϵt ∼ iid N (0, 1), Z = ϵt + 2b follows a noncentral χ 2 distribution with 1 degree of freedom and noncentrality parameter
Obviously φˆ 2 = φˆ 1 . Hence, κˆ Euler = − 1h (φˆ 1 − 1). Considering ∑ i φ1 = e−κ h = 1 − κ h + ∞ i=2 (−κ h) /i!, the finite sample bias for κˆ Euler can be expressed as
λ =
a2 . 4b2
Elerian (1998) showed that the density of Z may be expressed as f (z ) =
1 2
exp −
√ λ + z z −1/4 λz , I−1/2 2 λ
(5.18)
where
where ∞ −
1
1
h
h i=2
− H=−
σ 2 h ϵt2 − 1 . (5.16)
Let a = σ Xt −1 h, b = 14 σ 2 h, Yt = Xt − Xt −1 −κ(µ− Xt −1 )h + 14 σ 2 h, then Eq. (5.16) can be represented by
Xt Xt−−11
t =1
1 4
(−κ h)i /i! = O(h),
as h → 0,
(5.11)
which is the discretization bias caused by discretizing the drift √ function. Since the asymptotic mean of n(φˆ 1 − φ1 ) and hence √ the asymptotic mean of n(κˆ Euler − κ + 1h H ) is zero for a fixed h and n → ∞, the Euler discretization of the diffusion function introduces no discretization bias to κ under the square root model.
∞ 2−
(x/2)2i = i!Γ (j + 0.5)
1 {exp(x) + exp(−x)}. x i=0 2π x This expression may be used to compute the log-likelihood function of the approximate model (5.16). Unfortunately, the ML estimator does not have a closed form expression and it is therefore difficult to examine the relative performance of the bias and the variance using analytic methods. The performance of the Milstein scheme is therefore compared to other methods in simulations. I−1/2 (x) =
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
237
If κ > 0, κˆ Euler has a smaller finite sample variance than κˆ Nowman because
5.2. Diffusions with linear drift We consider the following general diffusion process with a linear drift dX (t ) = κ(µ − X (t ))dt + σ q(X (t ); ψ)dW (t ),
(5.19)
as a generalization to the Vasicek and the square root models, where σ q(X (t ); ψ) is a general diffusion function with parameters ψ , and θ = (κ, µ, σ , ψ) ∈ Rd is the unknown parameter vector. This model include the well known Constant Elasticity of Variance (CEV) model, such as the Chan et al. (1992, CKLS) model, as a special case. In this general case, the transitional density is not analytically available. The Nowman approximate discrete model is
Xt = φ1 Xt −1 + (1 − φ1 )µ + σ q(Xt −1 ; ψ)
1 − φ12 2κ
Var(κˆ Nowman ) =
(5.20)
φ
2 1
Var(φˆ 1 ) + O(n−2 ) ≥
1 h2
Var(φˆ 1 )
= Var(κˆ Euler ).
(5.29)
Under Assumptions 1–3, κˆ Euler has a smaller bias than κˆ Nowman because E (κˆ Nowman − κ)
=− ≥−
ϵt ,
1 h2
1
[
h
1
φ1
1 1 h φ1
E (φˆ 1 − φ1 ) −
1
E (φˆ 1 − φ1 )2 + O(n−3/2 ) 2
2φ1
E (φˆ 1 − φ1 )
1
1
h
h
≥ − E (φˆ 1 − φ1 ) − H = E (κˆ Euler − κ).
The Euler approximate discrete model is
]
(5.30)
√
Xt = φ2 Xt −1 + (1 − φ2 )µ + σ q(Xt −1 ; ψ) hϵt .
(5.21)
Theorem 5.1. For Model (5.19), the MLE of κ based on the Nowman approximation is 1
κˆ Nowman = − ln(φˆ 1 ),
(5.22)
h
where φˆ 1 is the ML estimator for φ1 in (5.20). The MLE of κ based on the Euler approximation is 1
κˆ Euler = − (φˆ 2 − 1),
(5.23)
h
where φˆ 2 is the ML estimator for φ2 in (5.21). Then we have
φˆ 2 = φˆ 1 .
Remark 5.1. The ML estimator of φ1 does not have a closed-form expression. Neither does the ML estimator of φ2 . So numerical calculations are needed for comparisons. However, according to Theorem 5.1, even without a closed-form solution, we can still establish the equivalence of φˆ 1 and φˆ 2 . After φˆ 1 and φˆ 2 are found numerically, one may find the estimators of κ by using the relations κˆ Nowman = − 1h ln(φˆ 1 ) and κˆ Euler = − 1h (φˆ 2 − 1). To compare the magnitude of the bias in κˆ Nowman to that of κˆ Euler , no general analytic result is available. However, under some mild conditions, comparison is possible. In particular, we make the following three assumptions. Assumption 1: φˆ 1 − φ1 ∼ Op (n−1/2 ); Assumption 2: E (φˆ 1 − φ1 ) < 0; Assumption 3: − 1h E (φˆ 1 − φ1 ) >
− 1h H, i.e., the estimation bias dominates the discretization bias in Euler approximation. Under Assumption 1, we get E (κˆ Nowman − κ) = −
− Var(κˆ Nowman ) =
1 h2 φ12
[
1
φ1
h 1
E (φˆ 1 − φ1 )
2φ1
[Var(φˆ 1 ) + O(n−2 )],
1 1 E (κˆ Euler − κ) = − E (φˆ 1 − φ1 ) − H , h h and Var(κˆ Euler ) = where H =
1 h2
∑∞
Var(φˆ 1 ),
i =2
]
E (φˆ 1 − φ1 )2 + O(n−3/2 ) , 2
(−κ h)i /i! = O(h2 ).
6.1. Linear models To examine the performance of the proposed bias formulae and to compare the two alternative approximation scheme in multivariate diffusions, we estimate κ = −A in the bivariate model with a known mean: dXt = AXt dt + Σ dWt ,
(5.25)
(5.26)
(5.27)
(5.28)
X0 = 0,
(6.1)
where Wt is the standard bivariate Brownian motion whose components are independent, and
κ Xt = , κ = −A = 11 κ21 σ11 0 and Σ = . 0 σ22
(5.24)
1
6. Simulation studies
X1t X2t
0
κ22
,
Since A is triangular, the parameters are all identified. While keeping other parameters fixed, we let κ22 take various values over the interval (0, 3], which covers empirically reasonable values of κ22 that apply for data on interest rates and volatilities. The mean reversion matrix is estimated with 10 years of monthly data. The experiment is replicated 10,000 times. Both the actual total bias and the actual standard deviation are computed across 10,000 replications. The actual total bias is split into two parts – discretization bias and estimation bias – as follows. The estimation bias is calculated as H /h and −v as in (3.13) and (3.21) for the two approximate methods. The estimation bias is calculated as: estimation bias = actual total bias − discretization bias Fig. 1 plots the biases of the estimate of each element in the mean reversion matrix κ , based on the Euler method, as a function of the true value of κ22 . Four biases are plotted, the actual total bias, the approximate total bias given by the formula in (3.13), the discretization bias H /h as in (3.13), and the estimation bias. Several features are apparent in the figure. First, the actual total bias in all cases is large, especially when the true value of κ22 is small. Second, except for κ12 whose discretization bias is zero, the sign of the discretization bias for the other parameters is opposite to that of the estimation bias. Not surprisingly, in these cases, the actual total bias of estimator (3.10) is smaller than the estimation bias. The discretization bias for κ12 is zero because it is assumed that the true value is zero. In the bivariate set-up, however, it is possible that the sign of the discretization bias for the other parameters is the same as that of the estimation bias (for example when κ12 = 5 and κ21 = −0.5). Third, the bias in all parameters
238
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
Fig. 1. The bias of the elements in Aˆ in Model (6.1) as a function of κ22 at the monthly frequency and T = 10. The estimates are obtained from the Euler method. The solid line is the actual total bias; the broken line is the approximate total bias according to the formula (3.13); the dashed line is the discretization bias H /h; the point line is the estimation bias. The true value for κ11 , κ12 , and κ21 is 0.7, 0, and 0.5, respectively.
Fig. 2. The bias of the elements in Aˆ in Model (6.1) as a function of κ22 at the monthly frequency and T = 10. The estimates are obtained from the trapezoidal method. The solid line is the actual total bias; the broken line is the approximate bias according to the formula (3.13); the dashed line is the discretization bias −v ; the point line is the estimation bias. The true value for κ11 , κ12 , and κ21 is 0.7, 0, and 0.5, respectively.
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
239
Fig. 3. The bias of the elements in Aˆ in Model (6.1) as a function of κ22 at the monthly frequency and T = 10. The estimates are obtained from the Euler and the trapezoidal methods, respectively. The solid line is the actual total bias for the Euler method; the broken line is the actual total bias for the trapezoidal method. The true value for κ11 , κ12 , and κ21 is 0.7, 0, and 0.5, respectively.
is sensitive to the true value of κ22 . Finally, the bias formula (3.13) generally works well in all cases. Fig. 2 plots the biases of the estimate of each element in the mean reversion matrix κ , based on the trapezoidal method, as a function of the true value of κ22 . Four biases are plotted, the actual total bias, the approximate total bias given by the formula in (3.21), the discretization bias −ν as in (3.21), and the estimation bias. In all cases, the discretization bias is closer to zero than that based on the Euler approximation. This suggests that the trapezoidal method indeed reduces the discretization bias. Moreover, the bias formula (3.21) generally works well in all cases. The performance of the two approximation methods is compared in Fig. 3, where the actual total bias of the estimators given by (3.10) and (3.20) is plotted. It seems that the bias of the estimator obtained from the trapezoidal approximation is larger than that from the Euler approximation for all parameters except κ12 . For κ12 , the performance of the two methods are very close with the Euler method being slightly worse when κ22 is large. Fig. 4 plots the actual standard deviations for the two approximate estimators, (3.10) and (3.20) as a function of κ22 . We notice that, for all the parameters, the standard deviation of the Euler method is smaller than that of the trapezoidal method. The percentage difference can be as high as 20%. We also design an experiment to check the performance of the alternative estimators in the univariate case. Data are simulated from the univariate OU process with a known mean dX (t ) = −κ X (t )dt + σ dW (t ),
X (0) = 0.
(6.2)
Fig. 5 reports the bias in κ obtained from the Euler method and the trapezoidal method in the OU process with a known mean. Three biases are plotted: the actual total bias, the estimation bias
and the discretization bias. Fig. 6 compares the bias in κ obtained from the exact ML methods with that of the two approximate methods. Several conclusions may be drawn from these two Figures. First, our bias formula provides a good approximation to the actual total bias. Second, for the two approximate estimators, (4.20) and (4.28), the sign of the discretization bias is opposite to that of the estimation bias. Third, while the trapezoidal method leads to a smaller discretization bias than the Euler method, it has a larger estimation bias. Finally, the actual total bias for the Euler method is smaller than that of the trapezoidal method and both methods lead to a smaller total bias than the exact ML estimator (4.13). Fig. 7 reports the standard deviations for estimators (4.13), (4.20) and (4.28). It is easy to find that the standard deviations of estimator (4.20) is the smallest among those of all estimators. The standard deviations of estimator (4.28) are almost the same with those from the exact ML estimator (4.13), but smaller when κ is bigger than 1. Considering the sample size is 120, we can roughly say that, focusing on bias and standard deviation, the estimator (4.20) from the Euler approximation is better than the other estimators in comparatively small sample sizes. 6.2. Square root model For the square root model, we designed an experiment to compare the performance of the various estimation methods, including the exact ML, the Euler scheme, the Nowman scheme and the Milstein scheme. In all cases we fix h = 1/12, T = 120, µ = 0.05, σ = 0.05, but vary the value of κ from 0.05 to 0.5. These settings correspond to 10 years of monthly data in the estimation of κ . The experiment is replicated 10,000 times.
240
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
Fig. 4. The standard deviation of the elements in Aˆ in Model (6.1) as a function of κ22 at the monthly frequency and T = 10. The estimates are obtained from the Euler and the trapezoidal methods, respectively. The solid line is the standard deviation for the Euler method; the broken line is the standard deviation for the trapezoidal method. The true value for κ11 , κ12 , and κ21 is 0.7, 0, and 0.5, respectively.
Fig. 5. The bias of the κ estimates in the univariate model as a function of κ at the monthly frequency and T = 10 for the two approximate methods. The left panel is for the Euler method and the right panel is for the trapezoidal method. The solid line is the actual total bias; the dashed line is the approximate total bias; the dotted line is the estimation bias; the broken line is the discretization bias.
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
241
Table 1 Exact and approximate ML estimation of κ from the square root model using 120 monthly observations. The experiment is replicated 10,000 times. Method
Exact
Euler
Nowman
Milstein
0.1156 0.2251 0.2531
0.1126 0.2205 0.2476
0.1152 0.2249 0.2526
0.1132 0.2206 0.2480
0.1392 0.2670 0.3011
0.1342 0.2590 0.2917
0.1387 0.2668 0.3007
0.1350 0.2592 0.2922
0.1615 0.3178 0.3565
0.1529 0.3070 0.3430
0.1610 0.3178 0.3562
0.1538 0.3068 0.3432
0.1869 0.4210 0.4607
0.1625 0.3999 0.4317
0.1862 0.4209 0.4603
0.1639 0.3993 0.4316
κ = 0.05 Bias Std err RMSE
κ = 0.1 Bias Std err RMSE
κ = 0.2 Bias Std err RMSE Fig. 6. The actual total bias of the κ estimates in the univariate model as a function of κ at the monthly frequency and T = 10 for the two approximate methods and the exact ML. The solid line is for the exact ML; the dashed line is for the Euler method; the broken line is for the trapezoidal method.
Fig. 7. The standard deviation of the κ estimates in the univariate model as a function of κ at the monthly frequency and T = 10. The solid line is for the exact ML; the broken line is for the Euler method; the dotted line is for the trapezoidal method.
Table 1 reports the bias, the standard error (Std err), and the root mean square error (RMSE) of κ for all estimation methods, obtained across 10,000 replications. Several conclusions emerge from the table. First, all estimation methods suffer from a serious bias problem. Second, the Euler scheme performs best both in terms of bias and variance. Third, the ratios of the standard error of κEuler and that of κNorman are 0.9958, 0.9917, 0.9835, 0.9592 when κ is 0.05, 0.1, 0.2, 0.5, respectively. The ratio decreases as κ increases, as predicted in (5.14). Finally, although the bias for the Milstein method is larger than that for the Euler method, the variances for these two methods are very close. 7. Conclusions This paper provides a framework for studying the implications of different discretization schemes in estimating the mean reversion parameter in both multivariate and univariate diffusion models with a linear drift function. The approach includes the Euler method and the trapezoidal method as special cases, an asymptotic theory is developed, and finite sample bias comparisons are conducted using analytic approximations. Bias is decomposed into
κ = 0.5 Bias Std err RMSE
a discretization bias and an estimation bias. It is shown that the discretization bias is of order O(h) for the Euler method and O(h2 ) for the trapezoidal method, respectively, whereas the estimation bias is of the order of O(T −1 ). Since in practical applications in finance it is very likely that h is much smaller than 1/T , estimation bias is likely to dominate discretization bias. Applying the multivariate theory to univariate models gives several new results. First, it is shown that in the Euler and trapezoidal methods, the sign of the discretization bias is opposite that of the estimation bias for practically realistic cases. Consequently, the bias in the two approximate method is smaller than the ML estimator based on the exact discrete time model. Second, although the trapezoidal method leads to a smaller discretization bias than the Euler method, the estimation bias is bigger. As a result, it is not clear if there is a gain in reducing the total bias by using a higher order approximation. When comparing the estimator based on the Euler method and the exact ML, we find that the asymptotic variance of the former estimator is smaller. As a result, there is clear evidence for preferring the estimator based on the Euler method to the exact ML in the univariate linear diffusion when the mean reversion is slow. Simulations suggest the bias continues to be large in finite samples. It is also confirmed that for empirically relevant cases, the magnitude of the discretization bias in the two approximate methods is much smaller than that of the estimation bias. The two approximate methods lead to a smaller variance than exact ML. Most importantly for practical work, there is strong evidence that the bias formulae work well and so they can be recommended for analytical bias correction with these models. For the univariate square root model, the Euler method is found to have smaller bias and smaller variance than the Nowman method. Discretizing the diffusion function both in the Euler method and the Nowman method causes no discretization bias on the mean reversion parameter. For the Euler method, we have derived an explicit expression for the discretization bias caused by discretizing the drift function. The simulation results suggest that the Euler method performs best in terms of both bias and variance. The analytic and expansion results given in the paper are obtained for stationary systems. Bias analysis for nonstationary and explosive cases require different methods. For diffusion models with constant diffusion functions, it may be possible to extend recent finite sample and asymptotic expansion results for the discrete time AR(1) model (Phillips, 2010) to a continuous time setting. Such an analysis would involve a substantial extension of the present work and deserves treatment in a separate study.
242
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
Acknowledgements
Aˆ − A =
Thanks go to the referee, an associate editor, the editor and the seminar participants at the SETA 2010 meeting, the QMBA 2010 meeting, the 2010 RMI Annual Conference, the 2010 Tsinghua Econometrics Summer Workshop for helpful comments on the original version. Phillips gratefully acknowledges support from the NSF under Grant No. SES 09-56687. Yu gratefully acknowledges support from the Singapore Ministry of Education AcRF Tier 2 fund under Grant No. T206B4301-RS. Appendix
=
h 2 h 2 h
2
(Fˆ − I )(Fˆ + I )−1 − (F − I )(F + I )−1 − ν h
(Fˆ + I − 2I )(Fˆ + I )
−1
[I − 2(Fˆ + I ) ] − [I − 2(F + I )−1 ] − ν h
= − [(Fˆ + I )
−1
h
4 h
h
2
−1
4
=
2
− (F − I )(F + I )−1 − ν
− (F + I )−1 ] − ν
(I + F )−1 (Fˆ − F )(I + Fˆ )−1 − ν.
(A.6) p
As h is fixed, according Lemma 3.1, as n → ∞, Fˆ → F , the first part of above equation goes to zero. And from formula (3.19),
′
Proof of Lemma 3.3. Let C = F and then ∞ −
=
2
p
F ′k = (I − F ′ )−1 = (1 − C ),
(A.1)
Aˆ − A → −ν =
2 h
(F − I )(F + I )−1 − A.
t =0
∞ −
F ′k tr(F k+1 ) =
k =0
∞ −
−
F ′k
k=0
−
Proof of Theorem 3.3b.
λk+1
λ∈spec (F ) ∞ −
[ ] 2 −1 ˆ ˆ Vec(A − A + ν) = Vec A − (F − I )(F + I ) h
λ λF k =0 λ∈spec (F ) ∞ − − k k = λ λC k=0 λ∈− spec (C ) = λ(I − λC )−1 ,
=
k ′k
= = (A.2)
λ∈spec (C )
where Spec(C ) denotes the set of eigenvalues of C . Thus, ∞ −
F ′2k+1 =
k =0
∞ −
Γ (0) = Var(xt ) =
∞ −
(A.3)
F · G · F = D, b n
+ O(n
− 23
).
(A.5)
Proof of Theorem 3.1. By Lemma 3.1, for fixed h, as n → ∞, p
Fˆ → F . Hence, h
d
n(Fˆ − F ) → N (0, Γ (0)−1 ⊗ G), and we get
1
p
[Fˆ − F ] + H → h
1 h
4 E [Aˆ ] − A = − E [(Fˆ + I )−1 − (F + I )−1 ] − ν h 4 4 = − E [(Fˆ + I )−1 ] + (F + I )−1 − ν. h h
= [(I + F )(I + (I + F )−1 (Fˆ − F ))]−1 = [I + (I + F )−1 (Fˆ − F )]−1 (I + F )−1 , and
[I + (I + F )−1 (Fˆ − F )]−1 =
i =0
= I − (I + F ) (Fˆ − F ) + [(I + F )−1 (Fˆ − F )]2 ∞ − + (−1)i [(I + F )−1 (Fˆ − F )]i .
d
Proof of Theorem 3.2. According to formulae (3.8), (3.9) and Lemma 3.3, 1 1 1 E (Aˆ − A) = E (Fˆ − F ) + H = E h
=−
h
b T
h
−b n
+ O(n
1
+ H + o(T −1 ). h
Proof of Theorem 3.3a. From formulae (3.19),
∞ − (−1)i [(I + F )−1 (Fˆ − F )]i
−1
n Vec[Fˆ − F ] → N (0, (Γ (0))−1 ⊗ G),
giving the second part.
Υ = (F ′ + I )−1 ⊗ (F + I )−1 .
(Fˆ + I )−1 = (I + F + Fˆ − F )−1
d
=
d
(F − I )(F + I )−1 ] → N (0, Ψ ),
For the first term, we note that H.
n{Vec(Fˆ ) − Vec(F )} → N (0, (Γ (0))−1 ⊗ G), √ √ 1 ˆ − (F − I )] nh Vec[Aˆ − (F − I )] = n Vec[Ah h
√
h
where
From Eq. (3.8), 1h H = 1h [F − I − Ah] = O(h) as h → 0, proving the first part. (b) According to Lemma 3.1, fixed h, as n → ∞,
√
2
Proof of Theorem 3.4. From the proof of Theorem 3.3, we have
Bn = BIAS (Fˆ ) = E (Fˆ ) − F = −
1
{(Fˆ ′ + I )−1 ⊗ (F + I )−1 }Vec(Fˆ − F ).
(A.4)
i=0
Aˆ − A =
h
Vec[(I + F )−1 (Fˆ − F )(I + Fˆ )−1 ]
Ψ = 16Υ [Γ (0)−1 ⊗ G]Υ ′ ,
′i
i
√
√
C 2k+1 = C (I − C 2 )−1 ,
h 4
Again when h is fixed, according to Lemma 3.1, as n → ∞,
h n Vec[Aˆ −
k=0
4
−3/2
1
) + H h
i =3
By Lemma 3.1, we have
√
d
n[Vec(Fˆ ) − Vec(F )] → N (0, Γ (0)−1 ⊗ G),
and so,
1
Fˆij − Fij = OP n− 2 Then,
.
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
[(I + F )−1 (Fˆ − F )]3 = Op n
− 32
3
and [(I + F )−1 (Fˆ − F )]i = op n− 2
of Vec(ˆg )Vec(ˆg )′ . Defining ei to be the column vector of dimension M 2 whose ith element is 1 and other elements are 0, we have
,
i ≥ 3,
E [ˆgis gˆsj ] = e′M (s−1)+i E [Vec(ˆg )Vec(ˆg )′ ]eM (j−1)+s 1 = e′M (s−1)+i · ∆ · eM (j−1)+s + o(n−1 ), n
[I + (I + F )−1 (Fˆ − F )]−1 = I − (I + F )−1 (Fˆ − F ) 3 + [(I + F )−1 (Fˆ − F )]2 + Op n− 2 , and
n s=1
M 1− ′ eM (s−1)+i · ∆ · eM (j−1)+s . n s=1
Lij = Then
ˆ ) = L + o(n−1 ). E {[(I + F )−1 (Fˆ − F )]2 } = E (W Again, using Lemma 3.3, the formula for the estimation bias is 4
E [Aˆ − A] =
n · Vec[(I + F )−1 (Fˆ − F )]
√ d = [IM ⊗ (I + F )−1 ] n Vec(Fˆ − F ) → N (0, ∆),
n · Vec(ˆg ) = ∆ + o(1) → Var[Vec(ˆg )] =
∆ n
h
E {(I + F )−1 (Fˆ − F )(I + F )−1 } 4
− E {[(I + F )−1 (Fˆ − F )2 ](I + F )−1 } h 1 −3 + O n 2 −ν h [ 3 ] b 4 = (I + F )−1 − + O n− 2 (I + F )−1
where ∆ = [IM ⊗ (I + F )−1 ] · Γ (0)−1 ⊗ G · [IM ⊗ (I + F )−1 ]′ . As a result,
+ o(n−1 ),
h
and
n
−
E [Vec(ˆg ) · Vec(ˆg )′ ] = Var[Vec(ˆg )] + E [Vec(ˆg )] · E [Vec(ˆg )]′
=
∆ n
e′M (s−1)+i · ∆ · eM (j−1)+s + o(n−1 ).
Next, define the matrix L with (i, j) element
√
√
−1
=
Now let gˆ = [(I + F )−1 (Fˆ − F )], so that n · Vec[ˆg ] =
E [ˆgis gˆsj ]
s=1 M
4 E [Aˆ − A] = − E {[I + (I + F )−1 (Fˆ − F )−1 ]}(I + F )−1 h 4 + (F + I )−1 + O(h2 ) h 4 = E {(I + F )−1 (Fˆ − F )(I + F )−1 } h 4 − E {[(I + F )−1 (Fˆ − F )]2 (I + F )−1 } h 1 −3 + O n 2 − ν. h
Var
M −
ˆ ij ] = E [W
√
243
4 h 4
1 1 3 · L · (I + F )−1 + o(n−1 ) + O n− 2 − ν h
= − (I + F )
+ E [Vec(ˆg )] · E [Vec(ˆg )]′ + o(n−1 ).
h
· b · (I + F )
−1
−1
T
−
4 h
· L · (I + F )−1
− ν + o(T −1 ).
From Lemma 3.3, Bn = E (Fˆ ) − F = −
b n
Proof of Theorem 4.1. Using (4.8) and (4.9) in (4.7), we have
3 + O n− 2 .
When the exact discrete model involves an unknown B(θ ) we have
n − 1 t =1
h
(Xt − Xt −1 )Vt′ =
b = G (I − C )
−1
+ C (I − C )
2 −1
−
+
λ(I − λC )
−1
Γ (0) , −1
n 1 −
+
2h t =1
λ∈Spec(C )
and when we have a prior knowledge that B(θ ) = 0 in (2.2), we have b = G[C (I − C 2 )−1 +
−
−
λ(I − λC )−1 ]Γ (0)−1 .
λ∈Spec(C )
n 1 −
=
2h t =1
1 2h
→ E [Vec(ˆg )Vec(ˆg )′ ] =
n
ˆ = [(I + F )−1 (Fˆ − F )]2 = gˆ gˆ and W ˆ ij = Here we assume W ∑M ˆis gˆsj . It is easy to find that gˆis is the (M (s − 1) + i)th element s=1 g of Vec(ˆg ), and gˆis gˆsj is the (M (s − 1) + i, M (j − 1) + s)th element
Xt Xt′−1
n −
+ o(n−1 ).
×
n −
2h
−1 Xt −1 Xt′−1
t =1
Xt −1 Xt′
n −
′
Xt −1 Xt
t =1
−1 Xt −1 Xt′−1
−
n −
′
X t −1 X t
t =1
−1 n − ′ ′ Xt −1 Xt −1 X t −1 X t −1
t =1
=
n −
Xt −1 Xt′−1
t =1
1
2h t =1
t =1
n −
t =1
n
∆
n 1 −
Xt −1 Xt′
t =1
×
−1 Xt −1 Xt′−1
t =1
n
−
+
n −
t =1
= [IM ⊗ (I + F )−1 ]E [Vec(Fˆ − F )] = [IM ⊗ (I + F )−1 ]Vec[E (Fˆ − F )] [ 3 ] b = [IM ⊗ (I + F )−1 ]Vec − + O n− 2 = O(n−1 )
2h t =1
Xt Xt′−1 −
−1 n n − − ′ ′ X t X t −1 Xt −1 Xt −1 −I
Then E [Vec(ˆg )] = E [(IM ⊗ (I + F )−1 )Vec(Fˆ − F )]
Xt Xt′−1
n 1 −
ˆF − I + Fˆ
t =1
n − t =1
′
Xt −1 Xt
n − t =1
−1 Xt −1 Xt′−1
244
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245
−
n −
Xt −1 Xt′
n −
t =1
t =1
=
1 2h
n −
(Fˆ − I ) I +
and n 1 − e−2κ h − ∂ g (Xi−1 ; ψ)/∂ψj ∂ℓ(θ ) = 0 ⇒ 0 = σ2 ∂ψj 2κ g (Xi−1 ; ψ) i =1
t =1
Xt −1 Xt′
t =1
n −
×
−1 n − ′ ′ Xt −1 Xt −1 Xt −1 Xt −1 n −
−1 Xt −1 Xt′−1
−
t =1
Xt −1 Xt′−1
Taking Eq. (A.13) into (A.14), the first term and the third term cancel and we obtain
t =1
=
=
1 2h 1 2h
(Fˆ − I )
n −
Xt −1 Xt′−1
+
t =1
(Fˆ − I )
n −
n −
n − [Xi − φ1 Xi−1 − (1 − φ1 )µ](Xi−1 − µ)
′
Xt −1 Xt
t =1
(Fˆ ′ + I ).
Xt −1 Xt′−1
t =1
2
=
(A.7)
t =1
1 4
(Fˆ + I )
n −
−1 ′ ˆ (F + I ) .
(A.8)
t =1
2 h
(Fˆ − I )(Fˆ + I )−1 .
[(1 − e−2κ h )/2κ]−1/2 √ 2π σ g (Xi−1 ; ψ) [Xi − φ1 Xi−1 − (1 − φ1 )µ]2 × exp − 2 2 , 2σ g (Xi−1 ; ψ)(1 − e−2κ h )/2κ
(A.10)
n
2
−
ln[g (Xi−1 ; ψ)] −
i=1
n
ℓ(θ ) = − ln(σ 2 ) − 2
−
n −
ln[g (Xi−1 ; ψ)]
i =1
n − [Xi − φ2 Xi−1 − (1 − φ2 )µ]2 i=1
2σ 2 hg 2 (Xi−1 ; ψ)
.
(A.18)
φˆ 2 = φˆ 1 .
(A.19)
References
and the following log-likelihood function
ℓ(θ) = − ln(σ 2 ) −
Eqs. (A.12), (A.16) and (A.17) yield the ML estimators, φˆ 1 , µ ˆ and ψˆ and Eq. (A.13) gives the ML estimator, σˆ 2 . The Euler approximate discrete model yields the following loglikelihood function,
It is easy to obtain the first order conditions, three of which are identical to those in (A.12), (A.16) and (A.17). Hence,
f (Xi Xi−1 ) =
n
g (Xi−1 ; ψ)
i =1
n − [Xi − φ1 Xi−1 − (1 − φ1 )µ]2 ∂ g (Xi−1 ; ψ)/∂ψj . (A.17) g 2 (Xi−1 ; ψ) g (Xi−1 ; ψ) i=1
(A.9)
Proof of Theorem 5.1. The Nowman approximate discrete time model yields the following transition function
g 2 (Xi−1 ; ψ)
n i=1
−
Xt −1 Xt′−1
(A.16)
n n 1 − [Xi − φ1 Xi−1 − (1 − φ1 )µ]2 − ∂ g (Xi−1 ; ψ)/∂ψj
−1 (Xt + Xt −1 )Vt′
= 0.
Taking Eq. (A.13) into (A.15), we have 0 =
Using the above two formulae in (4.7), the two stage least squares estimator is Aˆ =
g 2 (Xi−1 ; ψ)
i=1
By the same method, it is easy to obtain
n − 1
n − [Xi − φ1 Xi−1 − (1 − φ1 )µ]2 ∂ g (Xi−1 ; ψ)/∂ψj . (A.15) g 2 (Xi−1 ; ψ) g (Xi−1 ; ψ) i =1
n 2
ln
n − [Xi − φ1 Xi−1 − (1 − φ1 )µ]2 − . 2σ 2 g 2 (Xi−1 ; ψ)(1 − e−2κ h )/2κ i=1
−2 κ h
1−e
2κ (A.11)
The first order conditions are n − ∂ℓ(θ) [Xi − φ1 Xi−1 − (1 − φ1 )µ] =0⇒ = 0, (A.12) ∂µ g 2 (Xi−1 ; ψ) i =1 ∂ℓ(θ) 1 − e−2κ h 2 = 0 ⇒ σ ∂σ 2 2κ n 1 − [Xi − φ1 Xi−1 − (1 − φ1 )µ]2 − = 0, (A.13) n i =1 g 2 (Xi−1 ; ψ) [ ] ∂ℓ(θ) n 2he−2κ h 1 = 0⇒0=− − ∂κ 2 1 − e−2κ h κ n − [Xi − φ1 Xi−1 − (1 − φ1 )µ](Xi−1 − µ) − he−κ h σ 2 g 2 (Xi−1 ; ψ)(1 − e−2κ h )/2κ i =1 n − [Xi − φ1 Xi−1 − (1 − φ1 )µ]2 − 2σ 2 g 2 (Xi−1 ; ψ) [i=1 ] −2κ h 2(1 − e ) − 4κ he−2κ h × (A.14) (1 − e−2κ h )2
Aït-Sahalia, Y., 1999. Transition densities for interest rate and other non-linear diffusions. Journal of Finance 54, 1361–1395. Aït-Sahalia, Y., 2002. Maximum likelihood estimation of discretely sampled diffusion: a closed-form approximation approach. Econometrica 70, 223–262. Aït-Sahalia, Y., 2008. Closed-form likelihood expansions for multivariate diffusions. Annals of Statistics 36, 906–937. Aït-Sahalia, Y., Kimmel, R., 2007. Maximum likelihood estimation of stochastic volatility models. Journal of Financial Economics 83, 413–452. Aït-Sahalia, Y., Yu, J., 2006. Saddlepoint approximations for continuous-time markov processes. Journal of Econometrics 134, 507–551. Bao, Y., Ullah, A., 2009. On skewness and kurtosis of econometric estimators. Econometrics Journal 12, 232–247. Bergstrom, A.R., 1966. Nonrecursive models as discrete approximations to systems of stochastic differential equations. Econometrica 34, 173–182. Bergstrom, A.R., 1984. Continuous time stochastic models and issues of aggregation over time. In: Handbook of Econometrics. pp. 1146–1211. Bergstrom, A.R., 1990. Continuous Time Econometric Modelling. Oxford University Press, Oxford. Black, F., Scholes, M., 1973. The pricing of options and corporate liabilities. Journal of Political Economy 81, 654–673. Chan, K., Karolyi, F, Longstaff, F., Sanders, A., 1992. An empirical comparison of alternative models of short term interest rates. Journal of Finance 47, 1209–1227. Cox, J., Ingersoll, J., Ross, S., 1985. A theory of the term structure of interest rates. Econometrica 53, 385–407. Durham, G., Gallant, A.R., 2002. Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes. Journal of Business and Economic Statistics 20, 297–316. Duffie, D., Kan, R., 1996. A yield-factor model of interest rate. Mathematical Finance 6, 379–406. Elerian, O., 1998. A note on the existence of a closed-form conditional transition density for the milstein scheme. Economics discussion paper 1998-W18, Nuffield College, Oxford.
X. Wang et al. / Journal of Econometrics 161 (2011) 228–245 Florens-Zmirou, D., 1989. Approximate discrete-time schemes for statistics of diffusion processes. Statistics 20, 547–557. Feller, W., 1951. Two singular diffusion problems. Annals of Mathematics 54, 173–182. Fuller, W.A., 1976. Introduction to Statistical Time Series, vol. 20. Wiley, New York. Hannan, E.J., 1970. Multiple Time Series, vol. 20. Wiley, New York. Hansen, L.P., Sargent, T.J., 1983. The dimensionality of the aliasing problem in models with rational spectral densities. Econometrica 51, 377–388. Lo, A.W., 1988. Maximum likelihood estimation of generalized Itô processes with discretely sampled data. Econometric Theory 4, 231–247. Milstein, G.N., 1978. A method of second-order accuracy integration of stochastic differential equations. Theory of Probability and its Applications 23, 396–401. Nicholls, D.F., Pope, A.L., 1988. Bias in the estimation of multivariate autoregressions. Australian Journal of Statistics 30, 296–309. Nowman, K.B., 1997. Gaussian estimation of single-factor continuous time models of the term structure of interest rates. Journal of Finance 52, 1695–1703. Pedersen, A., 1995. A new approach to maximum likelihood estimation for stochastic differential equations based on discrete observation. Scandinavian Journal of Statistics 22, 55–71. Phillips, P.C.B., 1972. The structural estimation of stochastic differential equation systems. Econometrica 40, 1021–1041. Phillips, P.C.B., 1973. The problem of identification in finite parameter continuous time models. Journal of Econometrics 1, 351–362. Phillips, P.C.B., 2010. Folklore theorems, implicit maps, and new unit root limit theory. Working paper, Yale University. Phillips, P.C.B., Yu, J., 2005a. Jackknifing bond option prices. Review of Financial Studies 18, 707–742.
245
Phillips, P.C.B., Yu, J., 2005b. Comment: a selective overview of nonparametric methods in financial econometrics. Statistical Science 20, 338–343. Phillips, P.C.B., Yu, J., 2009a. Maximum likelihood and gaussian estimation of continuous time models in finance. In: Handbook of Financial Time Series, pp. 707–742. Phillips, P.C.B., Yu, J., 2009b. Simulation-based estimation of contingent-claims prices. Review of Financial Studies 22, 3669–3705. Piazzesi, M., 2009. Affine Term Structure Models. In: Aït-Sahalia, Y., Hansen, L. (Eds.), Handbook for Financial Econometrics, North-Holland. Sargan, J.D., 1974. Some discrete approximations to continuous time stochastic models. Journal of the Royal Statistical Society, Series B 36, 74–90. Sundaresan, S.M., 2000. Continuous-time models in finance: a review and an assessment. Journal of Finance 55, 1569–1622. Tang, C.Y., Chen, S.X., 2009. Parameter estimation and bias correction for diffusion processes. Journal of Econometrics 149, 65–81. Vasicek, O., 1977. An equilibrium characterization of the term structure. Journal of Financial Economics 5, 177–186. Ullah, A., Wang, Y., Yu, J., 2010. Bias in the mean reversion estimator in the continuous time gaussian and levy processes. Working Paper. Sim Kee Boon Institute for Financial Economics, Singapore Management University. Yamamoto, T., Kunitomo, N., 1984. Asymptotic bias of the least squares estimator for multivariate autoregressive models. Annals of the Institute of Statistical Mathematics 36, 419–430. Yu, J., 2009. Bias in the estimation of mean reversion parameter in continuous time models. Working Paper. Sim Kee Boon Institute for Financial Economics, Singapore Management University.
Journal of Econometrics 161 (2011) 246–261
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Testing for weak identification in possibly nonlinear models Atsushi Inoue a,∗ , Barbara Rossi b a
Department of Agricultural and Resource Economics, North Carolina State University, Raleigh, NC 27695-8109, United States
b
Department of Economics, Duke University, Durham, NC 27708-0097, United States
article
info
Article history: Received 30 September 2008 Received in revised form 15 November 2010 Accepted 22 December 2010 Available online 4 January 2011 JEL classification: C12
abstract In this paper we propose a chi-square test for identification. Our proposed test statistic is based on the distance between two shrinkage extremum estimators. The two estimators converge in probability to the same limit when identification is strong, and their asymptotic distributions are different when identification is weak. The proposed test is consistent not only for the alternative hypothesis of no identification but also for the alternative of weak identification, which is confirmed by our Monte Carlo results. We apply the proposed technique to test whether the structural parameters of a representative Taylor-rule monetary policy reaction function are identified. © 2011 Elsevier B.V. All rights reserved.
Keywords: GMM Shrinkage Weak identification
1. Introduction The validity of statistical inference in a growing number of macroeconomic models has been questioned in the recent literature. Many of these models are estimated using first order moment conditions and exploiting exogenous instruments, such as in the widely used Generalized Method of Moments (GMM) estimation procedure. As Nelson and Startz (1990a,b) discovered, however, inference is unreliable when the correlation between instruments and endogenous variables is ‘‘weak’’, a situation referred to as the ‘‘weak identification’’ (or ‘‘weak instruments’’) problem. See Canova and Sala (2009), Iskrev (unpublished manuscript) and Ruge-Murcia (2007) for empirical evidence in dynamic stochastic general equilibrium (DSGE) models, Mavroeidis (2010) for the monetary policy rule, Nason and Smith (2005) and Dufour et al. (2006) for the new Keynesian Phillips curve, and Yogo (2004) for consumption Euler equations, to name a few. While methods to construct confidence sets that are robust to weak identification have been recently developed, they can be too large to be informative; in addition, applied researchers are often interested in point estimates, in which case their main interest is in whether a model is identified or not. This paper proposes a new test for identification by testing the null hypothesis of strong identification against the alternative
∗
Corresponding author. Tel.: +1 919 515 5969; fax: +1 919 515 1824. E-mail addresses:
[email protected] (A. Inoue),
[email protected] (B. Rossi). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.012
hypothesis of weak (or no) identification. Our proposed test statistic is based on the distance between two bias-corrected shrinkage extremum estimators. Under the null hypothesis of strong identification, the two estimators converge in probability to the same limit and the proposed test statistic has an asymptotic chi-square distribution. Under the alternative hypothesis of weak identification, they converge weakly to different random variables. Our test overcomes two limitations existing in the literature. First, the proposed test is consistent not only for the alternative hypothesis of no identification but also for the alternative of weak identification, whereas existing tests mainly focus on the alternative hypothesis of strict non-identification. Second, our test has the advantage of being applicable to both linear and nonlinear models that may have a large number of parameters, whereas existing tests can only be applied to models with a limited number of parameters and mainly to linear models or non-linear models where the second derivative is independent of the parameter vector. In the existing literature on identification, identification is often defined in terms of the underlying probability distribution function (see Hsiao, 1983). In many econometric problems, however, true probability measures or likelihood functions are not available to the econometrician, and parameters are estimated by extremum estimators. In this case, we say that parameters are identified if there is a unique minimizer of the estimation objective function. This definition of identification has been extensively used in the econometric literature (see Amemiya, 1985, Gallant and White, 1988, and Newey and McFadden, 1994, for example). We follow this definition of identification in our paper, and refer to this
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
definition of identification as the ‘‘identification condition for extremum estimators’’.1 Identification restrictions traditionally take the form of exclusion restrictions (see Hsiao, 1983). In the linear simultaneous equation model, instruments are exogenous if they are excluded from the equation of interest. However, the validity of instruments also requires instruments to be relevant. When instruments are only weakly correlated with the endogenous variables, the TSLS estimator is biased towards the probability limit of the OLS estimator and standard inference performs poorly (Bound et al., 1995; Nelson and Startz, 1990a,b). To explain the Monte Carlo findings, Staiger and Stock (1997) and Stock and Wright (2000) propose an alternative asymptotic theory in which the correlation is modeled local to zero, and refer to it as ‘‘weak identification’’. Our paper is interested in this concept of identification, and focuses on the relevance condition while maintaining the assumption that the exogeneity conditions hold. In our paper, we test the null hypothesis that this correlation is nonzero and is not local to zero against the alternative that it is local to zero. A few other papers have considered tests in the presence of weak instruments. In particular, Stock and Yogo (2005) propose to test the null hypothesis that the correlation between endogenous variables and instruments is local to zero against the alternative that it is not local to zero. Hahn and Hausman (2002) test the null that this correlation is local to zero against the alternative that it is fixed and different from zero, as we do. Our paper is related to these tests, but differs in a crucial way. The advantage of our test relative to that in Stock and Yogo (2005) is that our test does not rely on the Hessian of the objective function whereas the latter test does. Since the Hessian depends on nuisance parameters in nonlinear models, it is unclear how to extend the methods by Stock and Yogo (2005) and Hahn and Hausman (2002) to nonlinear models. Our test can instead be applied to both linear and non-linear models. Wright (2002) proposes a test for the null hypothesis of strong identification by comparing the volume of Wald confidence sets and that of Stock and Wright’s (2000) S confidence set. The difference between the two volumes is bounded in probability when the parameters are strongly identified, and diverges to infinity when parameters are weakly identified (because Wald confidence sets are not robust to weak identification whereas the S set is). A potential drawback of this test is that it is not applicable when the number of parameters is more than two. The rank test of Wright (2003) tests the null hypothesis that the relevance condition does not hold against the alternative that it holds. Because his test does not allow for weak identification, the asymptotic null distribution depends on nuisance parameters that cannot be consistently estimated. In fact, our Monte Carlo experiment shows that the rank test of Wright (2003) can suffer from the size distortion when instruments are weak. There is also a relationship between the tests proposed in this paper and literatures on (i) tests of rank; (ii) reduced rank regression; (iii) tests of over-identification; (iv) tests of no identification; (v) tests of weak identification; and (vi) empirical applications of tests of weak identification. In Section 3.1 we review these literatures in detail and consider a simple linear IV model to illustrate the differences between the existing tests and our test. The advantages of our approach relative to the above mentioned literatures can be summarized as follows. We test the null of strong identification rather than no identification, so that
1 A referee suggested to use ‘‘Q -identification’’ to refer to this definition of identification. Although we like the suggestion of the referee, we believe it would be confusing since a large body of the literature uses this definition of identification. This is why we call it instead the ‘‘identification condition for extremum estimators’’.
247
there is no nuisance parameter under the null hypothesis in our setup. Our test allows us to: (i) avoid highly time-consuming searches over the set of all possible parameter configurations that satisfy the null hypothesis of weak identification (as our null hypothesis is strong identification); (ii) have a test with exact size; and (iii) obtain a test that is suitable for highly parameterized nonlinear models, and therefore is especially useful for researchers interested in addressing issues of identification in macroeconomic models. The idea of shrinkage has been used in the recent literature on many and weak instruments. Carrasco (unpublished manuscript) considers regularization of two-stage least squares estimators in the presence of many instruments. Okui (forthcoming) uses shrinkage in linear simultaneous equations with many instruments and with many weak instruments. While they focus on the estimation problem in linear simultaneous equations, our focus is on testing for identification in possibly nonlinear models. Monte Carlo simulations confirm that our test has good size and power for reasonable sample sizes. To show the usefulness of the proposed technique, we present an empirical application to the analysis of identification of the parameters of a Taylor rule monetary policy reaction function. We find that the monetary policy parameters were identified in the pre-Volker period, but not in the Volker–Greenspan era. The rest of the paper is organized as follows: Section 2 presents the assumptions and the theoretical results. Section 3 shows Monte Carlo results using both the Consumption Capital Asset Pricing Models (CCAPM) and the Taylor rule model. Section 4 provides an empirical application addressing the issue of whether the parameters in the US monetary policy reaction function are identified. Lastly, we mention notational conventions that are used throughout the paper. Let ∇x f (x), ∇xx f (x) and ∇xxx f (x) denote the gradient vector (∂/∂ x)f (x), the Hessian matrix (∂ 2 /∂ x∂ x′ )f (x) and the matrix of third derivatives. (∂/∂ x′ )vec(∇xx f (x)), respectively. When x = [x′1 , x′2 ]′ , we will sometimes write f (x) as f (x1 , x2 ), not f ([x′1 , x′2 ]′ ) to simplify the
1/2
2 notation. ‖x‖ is the Euclidean norm of x, when x is an i=1 xi (n × 1) vector, and ‖A‖ is the matrix norm, max‖x‖=1 ‖Ax‖ when A is an (m × n) matrix. Finally, Ik denotes the (k × k) identity matrix.
∑n
2. Assumptions and theorems Consider an extremum estimator θˆT that maximizes some objective function QT (θ ),
θˆT = arg max QT (θ ),
(1)
θ ∈Θ
where Θ ⊂ ℜk . (1) includes maximum likelihood, classical minimum distance estimators and generalized method of moments estimators, as discussed in Gallant and White (1988) and Newey and McFadden (1994). A shrinkage estimator coaxes the parameter estimate in some direction by imposing possibly incorrect restrictions,
[ ] λ ˜θT = arg max QT (θ ) − T ‖θ − θ¯ ‖2 , θ ∈Θ
2
(2)
where {λT } is a sequence of positive constants that converges to zero as T → ∞. A well known shrinkage estimator is a ridge regression estimator with θ¯ = 0k×1 (Hoerl and Kennard, 1970a,b). We are interested in testing the null hypothesis of strong identification, whose definition is as follows. Definition (The Null Hypothesis). Under the null hypothesis, the parameters are strongly identified, that is: plimT →∞ QT (θ ) is uniquely maximized at some θ0 ∈ Θ , where Θ is compact in ℜp .
248
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
Suppose θ = [α ′ , β ′ ]′ where α is possibly weakly identified and β is always strongly identified. Note that it is possible that there are no strongly identified parameters, and our analysis allows for that possibility. Note that empirical researchers do not need to know which parameters are possibly weakly identified and which are strongly identified in order to implement our method in practice. The distinction between α and β is made only for the theoretical derivations. Our objective is to test the null hypothesis that the parameter θ0 = [α0′ , β0′ ]′ is strongly identified against the alternative hypothesis that α0 is only weakly identified in a sense that we will make precise shortly. We will impose the following set of assumptions: Assumptions. (a) Θ = ΘA × ΘB is non-empty and compact in ℜk where ΘA ⊂ ℜk1 and ΘB ⊂ ℜk2 , k1 + k2 = k. (b) QT (θ ) is twice continuously differentiable in θ . (c) Under the null hypothesis H0 , there is a function Q (θ ) such that (i) Q (θ ) is twice continuously differentiable, is uniquely maximized at θ0 = [α0′ , β0′ ]′ ∈ int(Θ ), and satisfies supθ∈Θ |QT (θ ) − Q (θ )| = op (1); (ii) T 1/2 [∇θ QT (·) − ∇θ Q (·)] ⇒ Z (·) holds on Θ , where ⇒ denotes weak convergence of random functions on Θ with respect to the sup norm and Z (·) is a zero-mean Gaussian process with covariance kernel Σ (θ1 , θ2 ) = E (Z (θ1 )Z (θ2 )′ ) that is positive definite at θ1 = θ2 = θ0 ; and (iii) ∇θ θ Q (θ0 ) is non-singular and supθ∈Θ ‖∇θθ QT (θ ) − ∇θ θ Q (θ )‖ = Op (T −1/2 ). (d) Under the alternative H1 : (i) There are stochastic processes on Θ , Qα (θ ), Qαβ (θ ) and Qβ (β), such that supθ∈Θ ‖QT (θ )−T −1 Qα (θ )−T −1/2 Qαβ (θ ) − Qβ (β)‖ = Op (T −1/2 ), supα∈ΘA ‖QT (α, β0 ) − T −1 Qα (α, β0 ) − Qβ (β0 )‖ = op (T −1 ), and supθ∈Θ |Qα (θ )| is bounded with probability one; (ii) There is a stochastic process Gα (θ ) such that supα∈ΘA ‖T ∇α QT (α, β0 ) − Gα (α, β0 )‖ = op (1); (iii) There are stochastic processes Hαα (θ ), Hαβ (θ ) and Hβα (θ ) such that supθ∈Θ ‖T ∇αα QT (θ )− Hαα (θ )‖ = op (1), supθ∈Θ ‖T 1/2 ∇αβ QT (θ ) − Hαβ (θ )‖ = op (1), and supθ∈Θ ‖T 1/2 ∇βα QT (θ ) − Hαβ (θ )‖ = op (1); and (iv) Qβ (β) satisfies Assumption (c) with Q (θ ), θ0 ∈ int(Θ ), ∇θ QT (θ ), ∇θ Q (θ ), Z (θ ), Σ (θ1 , θ2 ), ∇θθ QT (θ ) and ∇θ θ Q (θ ) replaced by Qβ (β), β0 ∈ int(ΘB ), ∇β QT (θ ), ∇β Q (β), Zβ (θ ), Σββ (β1 , β2 ), ∇ββ QT (θ ) and ∇ββ Qβ (β), respectively, where Zβ is a k2 -dimensional zero-mean Gaussian process with covariance kernel Σββ (β1 , β2 ) ≡ E [Zβ (β1 )Zβ (β2 )′ ]. (e) λT = κ T −1/2 for some κ ∈ (0, ∞). (f) There is a unique α ∗ ∈ ΘA that maximizes 1 Qα (α, β0 ) + Zβ (α, β0 ) b (α) + b∗′ (α)∇ββ Qβ (β0 )b∗ (α) 2 where ′ ∗
b∗ (α) = −[∇ββ Qβ (β0 )]−1 Zβ (α, β0 ).
(3)
(4)
Remarks. 1. Assumptions (b)–(d) are high-level assumptions. Our definition of weak identification in Assumption (d) follows those of Staiger and Stock (1997) and Stock and Wright (2000). α is weakly identified if the part of the objective function that depends on α vanishes (Assumption (d)(i)) and the Hessian of the objective function with respect to α converges to zero at certain rates (Assumption (d)(iii)). Assumption (d) is satisfied in Staiger
and Stock’s (1997) linear Instrumental Variable (IV) models in which: 1 QT (θ ) = − (y − Y θ )′ X (X ′ X )−1 X ′ (y − Y θ ), (5) T where y and Y are T × 1 and T × k matrices of endogenous variables and X is a T × ℓ matrix of exogenous variables linked to the regressors via the relationship Y = X Π0 +V , with V being a T × k matrix of error terms. In their model, our null and alternative hypotheses simplify to H0 : rank(Π0 ) = k and H1 : Π0 = ΠT = T −1/2 C ,
(6)
where C is an ℓ × k matrix of constants. 2. Assumption (d) is also satisfied in the generalized IV model considered in Stock and Wright (2000) in which
QT (θ ) = −
T 1−
T t =1
′ φt (θ )
ˆT W
T 1−
T s=1
φs (θ ) ,
(7)
Qα (θ ) = −m1 (θ )′ Wm1 (θ ),
(8)
Qαβ (θ ) = −2m1 (θ ) Wm2 (β),
(9)
′
Qβ (β) = −m2 (β) Wm2 (β), ′
(10)
where function evaluated at observation φt (θ ) is the moment t , E T −1
∑T
t =1
√ φt (θ ) = m1 (θ )/ T + m2 (β)+ o(1), m1 (θ ) and
ˆ T is a weighting matrix that m2 (β) are some functions, and W converges to W . See also Guggenberger and Smith (2005) who consider generalized empirical likelihood estimators under assumptions similar to those of Stock and Wright (2000). 3. We can cast our high-level assumptions into the classical minimum distance (CMD) estimation framework. Suppose that Π0 = g (θ0 ) where Π0 denotes a vector of reduced-form parameters, θ0 denotes structural parameters and g (·) maps the structural parameters into the reduced-form parameters. For example, Π0 is a vector of impulse responses, θ is a vector of structural parameters of a dynamic stochastic general equilibrium (DSGE) model and g (·) is the mapping implied by the DSGE model. The CMD estimator maximizes
ˆ T − g (θ ) QT (θ ) = − Π
′
ˆT Π ˆ T − g (θ ) W
ˆ T is the ˆ T is a consistent estimator of Π0 and W where Π weighting matrix. Assumptions (b) and (c) are satisfied under the standard assumptions, such as asymptotic normality of the estimator of the reduced-form parameters and smoothness of the function g (·). Assumption (d) is satisfied if g (θ ) = gβ (β) + T −1/2 gθ (θ ) under the alternative hypothesis. 4. Under the alternative hypothesis, parameters can be all unidentified, i.e., α = θ , β = ∅, k1 = k and k2 = 0. 5. While our nonlinear framework is general, our assumptions rule out the use of heteroskedasticity autocorrelation consistent (HAC) covariance matrix estimators. Because the HAC covariance matrix estimator is a nonparametric estimator, it converges at a rate slower than T 1/2 and estimators with HAC covariance matrix estimators will violate Assumption (d). Dynamic models based on rational expectations typically imply that Euler residuals and one-period-ahead forecast errors are serially uncorrelated and do not require the use of HAC covariance matrix estimators. 6. The shrinkage parameter, λT , determines the harshness of the penalty term. Assumption (e) requires that λT converges to zero so that the two objective functions converge in probability
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
249
to the same limit. As a result, the two estimators converge in probability to the true parameter value under the null hypothesis. Assumption (e) requires that λT does not converge to zero too fast, so that the two estimators behave differently under the alternative hypothesis. 7. Existence of a unique maximizer in Assumption (f) only simplifies the asymptotic distribution of the weakly identified parameter, α . The consistency of our proposed test does not necessarily require this assumption, which is made for convenience only. Stock and Wright (2000, p. 1062), impose an analogous assumption in their Theorem 1(ii).
and their shrinkage versions,
In what follows, we will first derive the asymptotic properties of both the extremum estimator and the shrinkage estimator. Under the null hypothesis (strong identification), both estimators are consistent. However, under the alternative hypothesis (weak identification), the extremum estimator does not converge to any constant whereas the shrinkage estimator converges in probability to the value it is shrunk towards. This implies that one cannot construct a consistent test against weak identification using the extremum estimator and is the reason why we focus on shrinkage estimators in this paper.
where
Theorem 1 (Asymptotic Distributions of Extremum Estimators). Suppose that assumptions (a)–(f) hold. (a) Under the null hypothesis, 1
d
T 2 (θˆT − θ0 ) → N (0k×1 , [∇θθ Q (θ0 )]−1
× Σ (θ0 , θ0 )[∇θθ Q (θ0 )]−1 ), 1
(11)
d
T 2 (θ˜T − θ0 − λT BT (θ0 )) → N (0k×1 , [∇θθ Q (θ0 )]−1
× Σ (θ0 , θ0 )[∇θθ Q (θ0 )] ), −1
(12)
where BT (θ0 ) = [MT (θ0 )] (θ0 − θ¯ ) and MT (θ ) = ∇θθ Q (θ ) − λT Ik . −1
[ ] λ ˜θ1T = arg max Q1T (θ ) − T ‖θ − θ¯ ‖2 ,
(17)
[ ] λT θ˜2T = arg max Q2T (θ ) − ‖θ − θ¯ ‖2 .
(18)
For example, θˆ1T and θˆ2T can be GMM estimators with identity and optimal weighting matrices. Define a test statistic by
ˆ ′T Σ ˆ T (θˆT )Dˆ T )−1 dˆ T , Rˆ T = dˆ ′T (D
Remarks. Eq. (11) in part (a) of Theorem 1 is a standard result for extremum estimators and is presented for reference. Eq. (12) shows that the shrinkage estimator has a higher-order bias term but has the same asymptotic distribution as the extremum estimator. This is because λT converges to zero at rate T −1/2 . Part (b) shows that the two estimators behave differently in the presence of weakly identified parameters. As Stock and Wright (2000) show for the GMM estimator, the extremum estimator is inconsistent and converges to a random variable. The shrinkage estimator converges in probability to θ¯ because the restriction imposed on the shrinkage estimator constrains the shrinkage estimator in the limit when the parameter is weakly identified. Consider two extremum estimators,
θˆ1T = arg max Q1T (θ ),
(15)
θˆ2T = arg max Q2T (θ ),
(16)
θ∈Θ θ∈Θ
(19)
1
dˆ T = T 2 (θ˜2T − θ˜1T − λT Bˆ 2T + λT Bˆ 1T ),
] −[∇θ θ Q1T (θ˜1T ) − λT Ik ]−1 , [∇θ θ Q2T (θ˜2T ) − λT Ik ]−1 ] [ ˆ 11,T Σ ˆ 12,T Σ ˆ ΣT = ˆ 21,T Σ ˆ 22,T , Σ ˆT = D
[
(20)
(21)
ˆT and Bˆ jT = [∇θ θ QjT (θ˜j,T ) − λT Ik ]−1 (θˆjT − θ¯ ) for j = {1, 2} , Σ is a consistent estimator of the asymptotic covariance matrix of T 1/2 [∇θ Q1T (θ0 )′ ∇θ Q2T (θ0 )′ ]. In order to ensure that the test statistic has a well-defined limiting distribution under the null hypothesis and that the test is consistent under the alternative, we make additional assumptions. Assumptions. (g) α1∗ ̸= α2∗ with probability one where α1∗ and α2∗ are defined in
(3) for θˆ1T and θˆ2T , respectively. ˆ T is a consistent estimator (h) (i) Under the nullhypothesis, Σ 1
of Σ ≡ A Var T 2 [∇θ Q1T (θ0 )′ ∇θ Q2T (θ0 )′ ] , and D′ Σ D is non-singular, where
[ ] −[∇θ θ Q1 (θ0 )]−1 [∇θ θ Q2 (θ0 )]−1 [ ] Σ11 Σ12 Σ= . Σ21 Σ22 D=
(22)
(23)
(ii) Under the alternative hypothesis, there are random ∗ ∗ ∗ ∗ matrices Σ11 , Σ12 , Σ21 and Σ22 such that
[
(14) where α ∗ and b∗ (α) are defined in (3) and (4), respectively, in Assumption (f), and θ¯ = [α¯ ′ , β¯ ′ ]′ ∈ Θ .
2
θ∈Θ
(b) Under the alternative hypothesis,
[αˆ ′ , T 1/2 (βˆ T − β0 )′ ]′ ⇒ [α ∗′ , b∗′ (α ∗ )]′ , (13) T λ T ( α ˜ − α) ¯ T T 1 ¯ T 2 β˜ T − β0 − λT [∇ββ Qβ (β0 )]−1 (β0 − β) [ ] ¯ G (α, ¯ β0 ) − Hαβ (α, ¯ β0 )(Zβ (α, ¯ β0 ) − κ(β0 − β)) ⇒ α , −1 −[∇ββ Qβ (β0 )] Zβ (α, ¯ β0 )
2
θ∈Θ
1
T 2 Ik1 0k2 ×k1
]
[
1
0k1 ×k2 Σ ˆ ij,T T 2 Ik1 Ik2 0k2 ×k1
0k1 ×k2 Ik2
]
⇒ Σij∗
(24)
for i, j = 1, 2. Remarks. 1. Assumption (g) requires that the two extremum estimators converge to different random variables when the parameters are weakly identified. Consider a linear simultaneous equation model with two endogenous variables, for example. Let N (µ, Σ ) denote a normally distributed random vector with mean µ and covariance matrix Σ . Then the GMM estimator with the identity weighting matrix converges weakly to the random variable that maximizes a non-central χ 2 random function of α ∈ ΘA :
′
N E (zi zi′ )C (α − α0 ), E (εi − (α − α0 )ηi )2 E (zi zi′ )
× N E (zi zi′ )C (α − α0 ), E (εi − (α − α0 )ηi )2 E (zi zi′ ) , (25) where zi is a l × 1 vector of instruments, C is a l × 1 vector of Pitman drift parameters such that Π = T −1/2 C , εi is the disturbance term in the structural equation, and ηi is the disturbance term of the reduced form equation for the endogenous variable included on the right hand side of the
250
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
structural equation. The two-stage least squares estimator converges weakly to the random variable that maximizes another non-central χ 2 random function of α ∈ ΘA :
N E (zi zi′ )
− 12
C (α − α0 ), E (εi − (α − α0 )ηi )2 Il
′ − 12
× N E (zi zi )
′
C (α − α0 ), E (εi − (α − α0 )ηi ) Il .
(26)
Unless the instruments are orthonormal, i.e., E (zi zi′ ) = cIl for some c > 0, α ∗ and α ∗∗ are different in general and Assumption (g) is satisfied when parameters are weakly identified. Corollary 4 of Stock and Wright (2000, p. 1067), also shows that different weighting matrices lead to different limits of GMM estimators. 2. Assumption (h)(i) requires that the two extremum estimators θˆ1T and θˆ2T have different asymptotic covariance matrices. For just identified linear regression models, OLS and GLS estimators have different asymptotic covariance matrices in general. In general, however, the assumption is not satisfied for just-identified moment restriction models. For over-identified models, this assumption is likely to be satisfied if the two estimators use different weighting matrices. For example: Weighting matrix 2
Identity matrix
The inverse of cross product of instruments Optimal weighting matrix
The inverse of cross product of instruments
Assumption (h)(i) is satisfied if: Instruments are non-orthogonal Conditional hetero skedasticity is present
Similar arguments apply to classical minimum distance estimators. When reduced form parameters, such as the parameters of state space models and impulse responses, are functions of structural parameters, the structural parameters can be estimated from reduced form estimates via minimum distance. Two suitable estimators can be obtained by choosing different minimum distance estimators. 3. In IV and GMM estimation, one can achieve Assumptions (g) and (h)(i) by adding a relevant instrument. For example, if θˆ2T is an IV/GMM estimator based on Z1 , then θˆ1T is an IV/GMM estimator based on Z1 and Z2 , where Z2 is a set of relevant instruments.2 In empirical macroeconomics, we generally have plenty of candidates for Z2 , such as lagged values of Z1 . ˆ T , consider a linear IV model. Let 4. As an example of Σ
ˆ T X ′ Y ) −1 Y ′ X W ˆ T X ′ y, θˆ1,T = (Y ′ X W
(27)
θˆ2,T = (Y XX Y ) Y XX y, (28) ∑T ˆ T = (1/T ) i=1 (yi − θˆ2,T )2 Xi Xi′ , Xi′ , yi and Yi′ are the where W ith row of X , y and Y , respectively, and the rest of the notation ′
′
−1 ′
′
follows the notation in Remark 1 above on Assumptions (a)–(f). ˆ T is an estimate of the covariance matrix of Then Σ
[
ˆ T Xi (yi − θˆ1′ ,T Xi ) Y ′X W Y ′ XXi (yi − θˆ2′ ,T Xi )
]
.
(29)
ˆ T is for the GMM estimator in the second 5. Another example of Σ remark on Assumptions (a)–(f). Let θˆ1,T and θˆ2,T be the GMM estimators with weighting matrices [(1/T )
ˆ ˆ ˆ Dθ φs (θ1,T )WT φt (θ1,T ) s=1 − , T ′ ˆ ˆ ˆ Dθ φs (θ1,T ) WT φt (θ2,T )
(30)
s=1 2
Weighting matrix 1
− T
ˆ t =1 φt (θ2,T )φt ′ −1 ˆ ˆ (θ2,T ) ] and Ik . Then ΣT is an estimate of the covariance ∑T
where Dθ φs (θ ) = [∇θ φs,1 (θ ) ∇θ φs,2 (θ ) · · · ∇θ φs,l (θ )]′ is the Jacobian matrix of φs (θ ) and l = dim(φs (θ )). Our main result is the asymptotic distribution of Rˆ T . We state it formally in the following theorem. Theorem 2 (Asymptotic Properties of the Proposed Test Statistic). Suppose that Assumptions (a)–(f) hold for Q1T (θ ) and Q2T (θ ) with common θ0 as well as Assumptions (g) and (h). (a) If the null hypothesis H0 is true, d
Rˆ T → χk2 .
(31)
∗ (b) If the alternative hypothesis H1 is true and if M1′ Σ11 M1 − ∗ ∗ ∗ M1′ Σ12 M2 − M2′ Σ21 M1 + M2′ Σ22 M2 is non-singular,
1 T
[ ∗ ] ∗ ′ ∗ ∗ ˆRT ⇒ κ 2 α2 − α1 (M1′ Σ11 M1 − M1′ Σ12 M2 0k2 ×1
∗ ∗ − M2′ Σ21 M1 + M2′ Σ22 M2 )−1
[ ∗ ] α2 − α1∗
(32)
0k2 ×1
where
M1 = M2 =
−Ik1
0k1 ×k2
(∇ββ Q1,β (β0 ))−1 H1,βα (α, ¯ β0 ) −Ik1
(∇ββ Q1,β (β0 ))−1
(∇ββ Q2,β (β0 ))−1 H2,βα (α, ¯ β0 )
(∇ββ Q2,β (β0 ))−1
0k1 ×k2
, .
∗ ∗ ∗ ∗ If M1′ Σ11 M1 − M1′ Σ12 M2 − M2′ Σ21 M1 + M2′ Σ22 M2 is singular,
1 T
Rˆ T ⇒ ∞.
(33)
Remarks. 1. Theorem 2 shows that one can use central χ 2 critical values to test the null hypothesis of strong identification. This is because there are no nuisance parameters under the null hypothesis. 2. Theorem 2(b) shows that if we construct the test statistic using two non-shrinkage extremum estimators with κ = 0 the test will be inconsistent. Because the standard extremum √ estimator is inconsistent under weak identification, T (θˆ1T − θˆ2T ) diverges at rate T 1/2 . Under the alternative hypothesis of weak identification, however, the asymptotic covariance √ estimator of T (θˆ1 − θˆ2 ) diverges at rate T by Assumption (d). Therefore the test statistic based on the two extremum estimators with κ = 0 will be bounded in probability under the alternative hypothesis and thus the test will be inconsistent.3 3. Theorem 2(b) shows that the test rejects the null hypothesis with probability approaching one whether parameters are not identified at all or only weakly identified. 4. Theorem 2(b) implies that the power is increasing in κ . That is, the test is more powerful the larger λT is. There is a size–power trade-off, however. In general, the type I error of the test is bigger for larger values of λT , because there is some approximation error of order Op (λT ).4 We will discuss the choice of λT in the next section.
matrix of
2 We thank Don Andrews for pointing this out.
3 In linear IV models, Hahn et al. (2011) similarly show that the conventional Hausman (1978) test is invalid when instruments are weak. 4 See Eq. (65). When multiplied by T 1/2 there is error of order O (λ ). p
T
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
3. Literature review and local power analysis
structural equation can be analyzed in an analogous fashion. The 2SLS maximizes the population objective function
In this section we provide a discussion of how our paper is related to the literatures on tests of weak identification, tests of over-identification, tests of rank, the reduced rank regression, and the local alternative hypotheses of rank condition. The section also provides some intuition regarding our test in a simplified setup and shows that the proposed test has nontrivial asymptotic local power in a simple linear IV model.
Identification is quite commonly defined as follows: The probability measures Pθ and Pθ ′ are observationally equivalent if Pθ = Pθ ′ (see Definition 2.2 of Hsiao, 1983, p. 226, for example). When there are no two probability measures that are observationally equivalent, we say that the true probability measure is identifiable and the population likelihood function achieves its maximum at a unique value (see Lemma 5.35 of Van der Vaart, 1998, p. 62). In many econometric problems, however, the true probability measures or likelihood functions are not available to the econometrician, and parameters are estimated by extremum estimators that maximize or minimize some estimation objective function, e.g. GMM. In this case, we say that parameters are identified if there is a unique minimizer of the objective function (see Amemiya, 1985, p. 106, Gallant and White, 1988, p. 19 and Newey and McFadden, 1994, p. 2121, for example). We focus on this identification condition for extremum estimators. The following simple example illustrates how various rank conditions for identification are related. Example. Suppose that data are generated by
−β0
1 b21
][ ]
b22
yi Yi
[ γ = 11 γ21
γ12 γ22
z1i ε + 1i , z2i ε2i
][ ]
[
]
(34)
iid
where [ε1i ε2i ]′ ∼ (02×1 , Σ ) is uncorrelatedwith Zi ≡ [z1i z2i ]′ , Σ is a 2 × 2 positive definite matrix and E Zi Zi′ is full rank and diagonal, i = 1, . . . , N, where N is the total sample size. The econometrician estimates: yi = β0 Yi + ui ,
(35)
where ui = γ11 z1i + γ12 z2i + ε1i = Zi′ Γ1 + ε1i and Γ1 ≡ [γ11 , γ12 ]′ , by either OLS or 2SLS where instrumental variables z1i and z2i are excluded from the structural equation. Note that (34) implies the reduced form equations:
[ ] yi Yi
=
[ ][ ] γ11 b22 + β0 γ21 γ12 b22 + β0 γ22 z1i γ22 − b21 γ12 z2i ∆ γ21 − γ11 b21 [ ] 1 b22 ε1i + β0 ε2i + ∆ ε2i − b21 ε1i 1
= Azi + wi
Ω11 Ω12
(i.e. γ11 = γ12 = 0);
(38)
and
(B) the instruments are relevant, rank[E (Zi Yi )] = 1 (see Assumptions 3.3 and 3.4 of Hayashi, 2000, p. 198 and p. 200, 1 respectively, for example). Note that E (Zi Yi ) = ∆ E (Zi Zi′ )Π , γ21 − γ11 b21 γ22 − b21 γ12 , thus rank[E (Zi Yi )]
= rank(Π ), since E (Zi Zi ) is full rank. Clearly, rank(Π ) = 1 if either γ21 − γ11 b21 ̸= 0 or γ22 − b21 γ12 ̸= 0. Thus, the relevance condition where Π ≡ ′
is: Relevance condition: rank(Π ) = 1
or
rank[E (Zi Yi )] = 1 i.e. γ21 − γ11 b21 ̸= 0
(39) or γ22 − b21 γ12 ̸= 0.
Note that, since the words identification and rank conditions have been used to mean different conditions in the literature, we refer to condition (i) as the validity condition and to condition (ii) as the relevance condition. When the validity condition and the relevance condition are both satisfied, then the 2SLS population objective function (37) has a unique minimum at β0 and the identification condition for extremum estimators is satisfied. If, in addition, the disturbance terms are Gaussian, these two conditions also imply identification of the true probability measure (which is the definition of identification discussed in the chapter by Hsiao, 1983). We have the following cases: (a) If the validity conditions jointly hold for both instruments and the relevance condition holds then rank(A) = rank
[ β0 γ21 γ21
β0 γ22 γ22
]
= 1.
(40)
(b) If the validity condition fails (γ11 ̸= 0 or γ12 ̸= 0) but the relevance condition is satisfied (γ21 − γ11 b21 ̸= 0 or γ22 − b21 γ12 ̸= 0), then (41)
21
(36)
[
Ω12 Ω22
(A) the instruments are exogenous, that is, they are uncorrelated to the disturbance term in the structural equation; this requires that E (Zi ui ) = 0:
because |A| = (γ11 γ22 − γ12γ21 )(b22 + β0 b21 ) = (γ11 γ22 − γ12 γ21 )∆ ̸= 0, provided that γγ11 γγ12 is full rank.
] γ11 b22 + β0 γ21 γ12 b22 + β0 γ22 . γ22 − b21 γ12 ∆ γ21 − γ11 b21 w1i 1 b22 ε1i + β0 ε2i Let wi ≡ ∆ = ε2i − b21 ε1i w2i and Var (wi ) = Ω = 1
(37)
The following two conditions ensure that the 2SLS estimator is consistent and asymptotically normal:
rank(A) = 2,
provided ∆ ≡ b22 + β0 b21 ̸= 0, where A≡
Q (β) = E [(yi − β ′ Yi )Zi′ ][E (Zi Zi′ )]−1 E [Zi (yi − β ′ Yi )].
Validity (exogeneity) condition: E (Zi ui ) = 0
3.1. Literature review
[
251
.
To simplify our discussion, we focus on the case in which the two instruments are excluded from the structural equation. Cases in which one of the exogenous variables is included in the
22
(c) If the validity conditions are satisfied (γ11 = γ12 = 0) but the relevance condition fails for both instruments (γ21 − γ11 b21 = γ22 − γ12 b21 = 0), then rank(A) = 0.
(42)
Thus the rank of the (2 × 2) matrix A is 1 if and only if the validity and relevance conditions are both satisfied. Here below we discuss in detail the relationship between our paper and: (i) tests of rank; (ii) reduced rank regression; (iii) tests of over-identification; (iv) tests of no identification; and (v) tests of weak identification, paying special attention to the alternative hypotheses of the rank condition. (vi) Finally, we discuss why it is important to focus on tests for weak
252
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
identification as the alternative hypothesis instead of underidentification by reviewing many papers that recently have empirically encountered such a problem. (i) Our paper is related to the literature of tests of rank of a matrix— see the survey by Anderson (1984), Cragg and Donald (1996, 1997), Robin and Smith (2000), Gill and Lewbel (1992) and Kleibergen and Paap (2006) for recent contributions. (ii) Note that when rank(A) < 2, the reduced form of simple example (36) is a reduced rank regression. The technique of reduced rank regression was introduced by Anderson and Rubin (1949) and Anderson (1951). There exist several applications of tests of rank and reduced rank regressions. In a recent paper, Anderson and Kunitomo (unpublished manuscript) develop tests on coefficients in reduced rank regressions. In our example, their test simplifies to testing:
[ A
1
]
[ ] =
−β0
0 0
for the null parameter value β0
(43)
so that A has rank 1. Anderson and Kunitomo (unpublished manuscript, Section 4.2), show that their test is robust to weak instruments; however, they require E (Zi Yi ) = CN −δ , where 0 < δ < 1/2, which is a slower rate than that in our paper. (iii) Anderson and Rubin (1949, 1950) develop tests of overidentifying restrictions, and Anderson and Kunitomo (1992, 1994) propose tests of block identifiability. These papers focus on testing whether the validity conditions (38) hold and the maintained hypothesis for these tests is that the relevance conditions (39) are satisfied; a rejection of tests implies that some of the validity conditions are not satisfied.5 In our example, they test the null hypothesis that
[ A
1
−β0
]
[ ] =
0 , 0
for some β0
(44)
which boils down to the null (40), against the alternative (41).6 See also Sargan (1958), Durbin (1959), Hausman (1978), Wu (1973), and Hansen (1982) for tests of over-identifying restrictions and the tests of Newey (1985) and Eichenbaum et al. (1985) for tests of a subset of such validity conditions. The maintained hypothesis for these tests is that the relevance conditions are satisfied. Even if the validity condition fails, if the relevance condition is not satisfied, these tests may not be consistent. (iv) Koopmans and Hood (1953) and Wright (2003) propose tests for no identification (42) against the alternative (40). These papers focus on testing the null hypothesis that the relevance conditions do not hold against the alternative that they hold and the maintained hypothesis for these tests is that the validity conditions are satisfied. The test of Wright (2003) tests the rank of the lower 1 × 2 sub-matrix of the above A matrix, Π , for example. In their survey, Stock et al. (2002) describe methodologies to detect whether instruments are relevant or not. Among such methods, there is the methodology by Cragg and Donald (1993). Cragg and Donald (1993) propose a rank test on Π to test the null that the instruments are not relevant against the alternative that they are relevant; their test, however, is not capable of determining whether the instruments are ‘‘sufficiently strong’’ so that standard inference is reliable. This is the main problem we are interested in. (v) Our paper focuses on the relevance condition (39) and maintains the assumption that the exogeneity condition (38)
5 In addition to tests for identifiability, Anderson and Kunitomo (1992, 1994) also discuss tests for pre-determinedness, that is testing the null hypothesis that cov(ut , w2i ) = 0 against the alternative that cov(ut , w2i ) ̸= 0. 6 Note that the null parameter value is specified in (43) whereas the value of β 0
is unspecified in (44).
holds. In particular, we focus on the empirically relevant problem, corroborated by Monte Carlo evidence, that even if the relevance condition is technically satisfied in population, if the correlation between instruments and endogenous variables are ‘‘weak’’, the standard asymptotic approximation performs poorly (Bound et al., 1995; Nelson and Startz, 1990a,b). To explain the Monte Carlo findings, Staiger and Stock (1997) and Stock and Wright (2000) propose an alternative asymptotic theory in which the correlation is modeled local to zero: E (Zi Yi ) = CN −1
(45)
for some 2 × 1 vector C . Anderson and Kunitomo (unpublished manuscript) also develop tests on coefficients of one structural equation in a set of simultaneous equations that are robust to a particular case of weak instruments, where the rate is slower than the one we consider. In our paper, we test the null hypothesis that E (Zi Yi ) has rank 1 (so that (40) is satisfied under the maintained validity assumption) and is not local to zero against the alternative that it is local to zero, (45). In this example, γ21 − γ11 b21 = C1 N −1/2 and γ22 − b21 γ12 = C2 N −1/2 for some C1 and C2 . Note that our alternative hypothesis includes the case of no identification considered in the tests in (iv) as a special case with C1 = C2 = 0. A few papers have considered tests for relevance in the presence of weak instruments: Stock and Yogo (2005) propose to test the null hypothesis (45) against the alternative (40). Hahn and Hausman (2002) test the null (41) against the alternative (45), as we do. Our paper is related to these tests, but differs in a fundamental way. The advantage of our test is that it does not rely on the Hessian of the objective function whereas the test proposed by Stock and Yogo (2005) does. Since the Hessian depends on nuisance parameters in nonlinear models, it is unclear how to extend these methods to nonlinear models. Our test can instead be applied to both linear and non-linear models. In addition, the asymptotic null distributions of the existing tests for testing the null (42) against the alternative (40) depend on nuisance parameters since C1 and C2 cannot be consistently estimated, and thus are not robust to weak identification. In fact, our Monte Carlo experiment shows that the rank test of Wright (2003) can suffer from the size distortion when instruments are weak. (vi) The empirical literature (in particular papers estimating first order conditions in macroeconomics and finance) offers several examples of studies concerned about the weak instrument problem. These studies usually test the strength of instruments by using first stage tests. For example, the Stock and Yogo’s (2005) first stage test for weak instruments has been used by Consolo and Favero (2009) to estimate monetary policy functions; Yogo (2004) to estimate the elasticity of inter-temporal substitution estimated in the context of an Euler equation involving consumption growth and returns on wealth; Krause et al. (2008) for inflation dynamics; Shapiro (2008) to estimate the New Keynesian Phillips Curve using a new proxy for the real marginal cost term; Fuhrer and Rudebusch (2004) for the Euler equation for output. The same methodology is also popular as a robustness check in cross-section studies that use IV estimation. For example, Crowe (2010) uses it to estimate the relationship between IT adoption and the quality of private sector forecasts, Faria and Montesinos (2009) to estimate the relation between growth, income level and freedom indexes; and Alcala and Ciccone (2004) for the link between trade and productivity. Other papers use a first-stage test of the F -statistic check with a critical value of 10. See, for example, Park and Kang (2008) on the relationship between education and health; Acemoglu and Johnson (2007) to estimate the effect of life expectancy on economic performance; Doyle (2007) to measure the effects of foster care on children outcomes; DeJuan and Seater (2007) to test the Permanent Income Hypothesis; Temple and Wossmann (2006) in cross country growth regressions; Wossmann and West (2006) for the effects of class-size in school systems; Ait-Sahalia et al. (2004) for estimates of the equity premium.
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
3.2. Local power analysis In this section we provide some intuition regarding our test in a simplified setup and show that the proposed test has nontrivial asymptotic local power in a simple linear IV model. First we will define more general null and alternative hypotheses, and then we find the probability limits and asymptotic distributions of the shrinkage estimator under these alternatives and local alternatives. The general null and alternative hypotheses allow us to derive analytic local power results. These results include Theorem 2 as a special case. Consider the IV model in Remark 1 in Section 2 with k = ℓ = 1:
new alternative hypothesis (47). When d = −1/4, the shrinkage estimators converge to the weighted average of θ0 and θ¯ . Next consider the asymptotic distribution of the proposed test statistic. For d ∈ [−1/4, 0] it can be shown that the shrinkage estimator is T 1/2+d -consistent and asymptotically normal:
Y ′ Xj
Y ′ Xj
] 1 − ,0
(46)
4
(47)
4
We assume that X1 and X2 are independent and satisfy c12 /E (x21,i ) ̸=
) and that E (ui |x1,i , x2,i ) = σ where xj,i and ui denote the
i-th row of Xj and U, respectively, where i = 1, 2, . . . , T . Let θˆj,T
and θ˜j,T denote the 2SLS and shrinkage estimator based on Xj for j = 1, 2. That is, we consider two IV estimators that use different instruments: θˆj,T and θ˜j,T denote the IV and shrinkage estimators which use instruments Xj and which maximize objective functions Qj,T (θ ) and Qj,T (θ ) − λT θ 2 , respectively. First we will consider the probability limit of the shrinkage estimator under the null and alternative hypotheses. Let θ0 denote the true parameter value. Because the objective function for the shrinkage estimator θ˜j,T is Qj,T (θ ) −
2
λT θ − θ
2 ]
1 −cj2 E (x2j,i )(θ − θ0 )2 T 2d + op (T 2d ), if d ∈ − , 0 , 4 1 1 2 2 2 − 1 / 2 2 −cj E (xj,i )(θ − θ0 ) T − λT (θ − θ¯ ) + op (T − 2 ) 2
1 if d = − , 4 [ − 1 λT (θ − θ¯ )2 + op (λT ) if d ∈ − 1 , − 1 , 2
2
4
and using Assumption (e) it follows that
p
θ˜j,T →
θ0 2c 2 E (x2 )θ0 + κ θ¯ j j ,i 2cj2 E (x2j,i ) + κ θ¯
+ λT
Xj′ Xj
−1
Xj′ Y T
(θˆj,T − θ0 ) + λT (48)
We also have
[∇θθ Qj,T (θ ) − λT ]−1 = −
Xj′ Y
Xj′ Xj
T
−1
T
T
−2d 2 2 −1 −T [cj E (xj,i )] + op (T −2d ) 1 1 = −T 2 [cj2 E (x2j,i ) + κ]−1 + op T 2
ˆT = Σ
ˆ 12,T Π ˆ 1,T Π ˆ 2,T Π
= σ 2T
T 1−
T i=1
T 1−
T [ 2 2i=1 c E ( x1,i ) 2d 1 0
x1,i x2,i uˆ 1,i uˆ 2,i
]
0
(
x22,i
)
− 1 − λT
if d ∈ (−1/4, 0]
(49)
if d = 1/4,
ˆ 1,T Π ˆ 2,T Π
x21,i uˆ 21,i
c22 E
Xj′ Y
T 1−
x1,i x2,i uˆ 1,i uˆ 2,i
T i=1 T 1−
ˆ 22,T Π
T i=1
x22,i uˆ 22,i
+ op (T 2d )
, (50)
ˆ j,T = (Xj′ Xj )−1 Xj′ Y and uˆ j,t = yi − θˆj,T xi . Combining where Π (48)–(50) we can show that d
Rˆ T → χ 2 (1)
(51)
for d ∈ (−1/4, 0]. Thus, our test statistic has the same asymptotic null distribution under the more general null hypothesis (46). Similarly, it can be shown that, if d = −1/4, d
Rˆ T → K χ 2 (1)
(52)
where
=
j
T
σ2 → N 0, 2 2 . cj E (xj,i )
2
1
X′Y
T
T
[ 1 1 . H1 : d ∈ − , − /E (
j j
T
−1
and an alternative hypothesis by
x22,i
X′X
d
The null and alternative hypotheses considered in Section 2 can be written as H0 : d = 0 and H1 : d = −1/2, respectively. Now suppose that Πj,T = cj T d with cj ̸= 0 and define a new null hypothesis by
2
Xj′ u
T 1/2+d λT
T
H0 : d ∈
−1
T
T
+
for j = 1, 2.
Xj′ Xj
T
= T 1/2+d
Let T →∞
Y ′ Xj
Y = X1 Π1,T + X2 Π2,T + V .
c22
T 1/2+d θ˜j,T − θ0 − λT [∇θ θ QT (θ ) − λT ]−1 (θˆj,T − θ¯ )
y = Y θ0 + U ,
d ≡ lim ln(Πj,T )/ln(T )
253
if d ∈
] 1 − ,0 , 4
1
if d = − , [ 4 1 1 if d ∈ − , − . 2 4
Thus the shrinkage estimators remain consistent for θ0 under the new null hypothesis (46) and converge in probability to θ¯ under the
K =
1 c12 E (x21,i ) c12 E (x21,i )
(c12 E (x21,i )+κ)2
+ +
1 c22 E (x22,i ) c22 E (x22,i )
>1
(c22 E (x22,i )+κ)2
and that, if d ∈ (−1/2, −1/4), Rˆ T diverges at rate T −1−4d . Remarks. 1. Comparing (51) and (52), we interpret the case with d = −1/4 as a local alternative. Because the null distribution (51) is bounded above by the distribution under the local alternative (52), our test has nontrivial local power against the local alternative. The asymptotic local power of our test depends on two factors. First, the local power is increasing in the degree of shrinkage, κ . (52) also shows that the test would not have any local power if it is constructed from non-shrinkage extremum estimators. Second, the asymptotic local power approaches one as the strength of instruments, c1 and c2 , approaches zero. It is interesting to note that the asymptotic local power does not depend on the choice of θ¯ . This is because the shrinkage estimators are centered at their means.
254
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
2. The number d = −1/4 turns out to be special when it comes to identification. In a recent paper Antoine and Renault (forthcoming) show that the standard Wald test is valid when the quality of instruments is mixed. They show that the fastest rate at which ΠT converges to zero must be slower than d = −1/4 in our notation. Thus, their conditions for the validity of Wald tests and our null hypothesis (46) coincide. Because d = −1/4 is exactly on the boundary in their paper and in the above analysis, we interpret the case d = −1/4 as a local alternative. 3. The above consistency result includes Theorem 2(b) as a special case with d = −1/2 and shows that our test is consistent for more general fixed alternatives than the one considered in Section 2. 4. Empirical implementation of our proposed test The test that we propose is easy to implement even in highlydimensional models and has the advantage of having power against weak identification. However, in order to implement the test, one needs to choose the shrinkage parameter, λT , while 1
ensuring that it satisfies Assumption (e): λT = κ T − 2 . Intuitively, the choice of κ involves a trade-off between bias and variance of the shrinkage estimator: the larger κ is, the more the shrinkage estimator is coaxed towards the pseudo parameter value θ¯ . Thus the shrinkage parameter will be more biased and have a smaller variance for a larger value of κ . Given this bias-variance trade-off, we propose to choose κ from a finite sequence of positive numbers by a cross-validation procedure that minimizes the mean-squared error of the shrinkage estimator.7 Note that any λT = κˆ T T −1/2 such that
Step 0. Estimate θ by GMM: ˆθ1T = arg max Q1T (θ ), θ ∈Θ
θˆ2T = arg max Q2T (θ ), θ ∈Θ
′ ¯ T (θ ), ¯ T (θ ), Q2T (θ ) = m ¯ T (θ )′ W2T m where Q1T (θ ) = m ∑T¯ T (θ ) W1T m ¯ T (θ ) = (1/T ) t =1 m(zt , θ ), and m(zt , θ ) is a moment function m
satisfying E [m(zt , θ0 )] = 0 for some θ0 ∈ Θ (for example, θˆ1T and θˆ2T can be GMM estimators with identity and optimal weighting matrices). λT such that λT ∈ Step 1. Pick an arbitrary value of λ1,T , λ2,T , . . . , λL,T , where λj,T = cj T −1/2 for j = 1, 2, . . . , L, cj is a positive constant, and L is finite. Step 2. Pick an arbitrary t ∈ {1, 2, . . . , T }. Step 3. Use all the sample observations except t to estimate their shrinkage versions,
] [ λT ¯ 2 , θ˜1T ,t = arg max Q1T ,t (θ ) − ‖θ − θ‖ θ ∈Θ 2 [ ] λT ¯ 2 , θ˜2T ,t = arg max Q2T ,t (θ ) − ‖θ − θ‖ θ ∈Θ
2
¯ T ,t (θ )′ W1T m ¯ T ,∑ ¯ T ,t (θ )′ W2T where Q1T ,t (θ ) = m t (θ ), Q2T ,t (θ ) = m ¯ T ,t (θ ) and m ¯ T ,t (θ ) = (1/(T − 1)) s̸=t m(zs , θ ). m Step 4. Repeat Step 3 for t = 1, 2, . . . , T and construct a criterion function based on a Mean Squared Error (MSE) estimate of these parameter estimates such as9
T ′ − trace(MSE(λT )) = trace θ˜1T ,s − θˆ1T θ˜1T ,s − θˆ1T s=1
+ θ˜2T ,s − θˆ2T
p
• Under the null hypothesis κˆ T → κ ∗ where κ ∗ is a positive constant; and
• Under the alternative hypothesis κˆ T = Op (1) satisfies Assumption (e) and is asymptotically valid. Therefore, we expect that the proposed estimator κˆ T (and effectively κ ∗ ) obtained via cross-validation will converge to the value of κ that minimizes the mean-squared error of the shrinkage estimator under the null hypothesis. Under the alternative hypothesis, κˆ T is nonzero and is finite by construction. Our choice of κ that minimizes the meansquared error of the shrinkage estimator may not be optimal for testing our hypothesis, but is asymptotically valid. We outline our cross-validation procedure below and investigate its small sample properties in the next section. It is left for future research to theoretically investigate the effect of cross validation on the performance of the proposed test.8 Suppose that we estimate parameters by GMM in which moment functions are serially uncorrelated when they are evaluated at the true parameter values, as in the models considered in Sections 4 and 5.
7 See Stone (1974) for a theoretical analysis of cross-validation and Carrasco (unpublished manuscript, Section 4), for a recent application of cross-validation methods. 8 It is possible that choosing the bandwidth by cross validation may improve the performance of the test. As an anonymous referee suggests, it could be possible that when the identification is strong, the variance of the estimator may be small even without shrinkage so that the cross validation would choose a small κ to reduce the bias. This property might make the size distortion small. On the other hand, when the identification is weak, the variance of the estimator may be large and the cross validation might choose a large κ to reduce the variance at the cost of bias inflation. This might make the test more powerful. We are grateful to the anonymous referee for this conjecture.
′ ˜θ2T ,s − θˆ2T .
(53)
Step 5. Repeat Steps 2–4 for all values of λT , thus obtaining a vector of L × 1 Mean Square Error estimates:
trace MSE(λ1,T ) , trace MSE(λ2,T ) , . . . , trace MSE(λL,T )
.
Step 6. Choose λ∗T such that λ∗T = arg minl=1,...,L trace MSE
(λl,T ) .
Step 7. Re-estimate the shrinkage estimators evaluated at λ∗T :
[ ] λ∗ ˜θ1T = arg max Q1T (θ ) − T ‖θ − θ‖ ¯ 2 , 2 θ∈Θ [ ] λ∗ ¯ 2 , θ˜2T = arg max Q2T (θ ) − T ‖θ − θ‖ θ∈Θ
2
and evaluate the test statistic by
ˆ ′T Σ ˆ T (θˆT )Dˆ T )−1 dˆ T , Rˆ T = dˆ ′T (D where 1
dˆ T = T 2 (θ˜2T − θ˜1T − λ∗T Bˆ 2T + λ∗T Bˆ 1T ),
[ ] ∗ −1 ˜ ˆD′T = −[∇θ θ Q1T (θ1T ) − λ∗T Ik−] 1 , [∇θ θ Q2T (θ˜2T ) − λT Ik ] [ ] ˆ 11,T Σ ˆ 12,T Σ ˆ ΣT = ˆ 21,T Σ ˆ 22,T , Σ ˆT and Bˆ jT = [∇θ θ QjT (θ˜j,T ) − λ∗T Ik ]−1 (θˆjT − θ¯ ) for j ∈ {1, 2} , Σ is a consistent estimator of the asymptotic covariance matrix of T 1/2 [∇θ Q1T (θ0 )′ ∇θ Q2T (θ0 )′ ]. 9 Alternatively, one could consider the determinant (as opposed to the trace) as the criterion function. In the Monte Carlo section, we will investigate both.
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
Step 8. Reject the null hypothesis of strong identification in favor of weak or no identification at significance level α if Rˆ T is bigger than the (1 − α)-th percentile of a χk2 distribution. 5. Monte Carlo experiments We analyze the finite sample performance of our proposed test in two setups: the Consumption Capital Asset Pricing Model (CCAPM) and the Taylor rule monetary policy model. We will compare the performance of our test with that of Wright (2003) and discuss a cross-validation method to estimate λT .10 5.1. Consumption Capital Asset Pricing Models In this sub-section, we investigate the finite-sample performance of the proposed test using the Consumption Capital Asset Pricing Model used in Wright (2003). Consumption and dividend growth are assumed to follow a first-order Gaussian vector autoregression
log
Ct −1
log
Ct Dt
log
=µ+Φ
log
Dt −1
Ct −1
C t −2
[ ] uct + , udt
Dt −1
(54)
Dt −2
where Ct is consumption, Dt is dividend, µ is a 2 × 1 vector, Φ iid
is a 2 × 2 matrix of constants, [uct , udt ]′ ∼ N (0, Λ), and (54) is approximated by a 16-state Markov chain. Then asset prices are generated so that they satisfy the Euler equation
Et
δ R t +1
C t +1
−γ
Ct
− 1 = 0,
(55)
where δ is the discount factor, Rt is the gross stock return and γ is the coefficient of relative risk aversion. See Tauchen and Hussey (1991) for the quadrature method used to simulate data. Following Wright (2003), we let θ ∈ Θ , where θ = [δ, γ ]′ , and Θ = [0.7, 1.3] × [0, 30]. In our notation the objective function can be written as QT (θ) =
T 1−
T t =1 ′
× Zt WT
δ Rt +1 T 1−
T t =1
C t +1 Ct
Zt
−γ
δ Rt +1
−1
C t +1 Ct
−γ
−1 ,
255
the Jacobian matrix is 1. Models WI1 and WI2 are modifications of models NRF1 and NRF2 of Wright (2003) which is based on Kocherlakota (1990). In these models, Φ is the same as the one in Wright (2003) when the sample size is 90 for which the value of Φ is obtained in Kocherlakota (1990). As the sample size grows, Φ converges to the matrix of zeros, which means that the instruments become weak. We consider three sample sizes, T = 50, 100, 200, and set the number of Monte Carlo replications to 1000. We select λT via the cross validation method discussed in Section 3. We set the set of κ in Assumption (e) to κ ∈ {1, 5, 10} in this Monte Carlo experiment. Unlike simple parametric hypotheses, the distinction between our null and alternative hypotheses is murky in small samples. We report the median of the absolute value of the bias as well as the coverage probability of 95% confidence intervals based on t tests to assess the quality of the conventional asymptotic approximation. When identification is weak, the standard asymptotic approximation will perform poorly and we expect to see large biases and poor coverage probabilities. We compute rejection probabilities of both Wright’s (2003) test as well as our Rˆ T test at the 5% significance level. We expect Wright’s (2003) test to reject the null in model SI whereas our test is expected to reject the null in models PI1, PI2, WI1 and WI2. Table 2 shows the bias of the GMM estimators, coverage probabilities of 95% Wald confidence intervals and the rejection frequencies of Wright’s (2003) test and our test implemented with a nominal size equal to 5%. As expected, the GMM estimates are highly biased and the coverage probabilities are not accurate when the parameters are not identified or weakly identified. When the parameters are strongly identified (model SI), the rejection frequencies of Wright’s (2003) test increase as the sample size grows. Our proposed test is conservative in that the actual size is smaller than the nominal size. When the parameters are not identified (models PI1 and PI2), Wright’s (2003) test is also conservative. Our test is powerful in that it rejects the null with probability higher than 90% even for the sample size 50. Our test has power even when the parameters are weakly identified. When the parameters are weakly identified, Wright’s (2003) test rejects the null hypothesis of lack of identification 26.6%–46.6% of the time, which could mislead practitioners to believe that the model is strongly identified. While the size and power of our test does depend on the choice of κ , our test performs well when κ is chosen to minimize either the determinant or trace of MSE.
(56)
Zt = [1, Rt , Ct /Ct −1 ]′ , and WT is a weighting matrix. We use the identity matrix for θˆ1,T and θ˜1,T and the optimal weighting matrix
for θˆ2,T and θ˜2,T . We consider one model where the parameters are strongly identified, two models where the parameters are not (or only partially) identified, and two models where the parameters are weakly identified. See Table 1 for the parameter values in each of the five models. Model SI is a slight modification of experiment 1B of Tauchen (1986) and model FR of Wright (2003), in which correlation is introduced among the instruments to satisfy our assumptions (g) and (h). In model SI, the parameters are strongly identified. Models PI1 and PI2 are the same as models RF1 and RF2 of Wright (2003). In these models, the instruments Ct +1 /Ct and Dt +1 /Dt are independent of Ct +1 /Ct and Rt +1 , and the rank of
10 For computational reasons we will not consider the method proposed by Wright (2002). In this section we let θ = 0. Unreported Monte Carlo experiments show that the procedure is robust to different choices for θ .
5.2. The Taylor rule model We now consider the performance of a simple Taylor-rule model for monetary policy in a second series of Monte Carlo experiments. We focus on the same model that will be considered in the empirical application in the next section. The model is a simplified version of the monetary policy reaction function considered by Clarida et al. (2000, hereafter CGG), and it is based on the following moment conditions11 : Et
rt − rr ∗ − (β − 1)π ∗ + βπt +1 + γ yt +1
Xt = 0
(57)
where rt is the Fed Fund Rate, πt +1 is the inflation rate, yt +1 is the average output gap between time t and t + 1 and Xt is a vector of four instruments. We generate the instruments as: Xt ∼
11 The simplification consists of not considering the serial correlation in the Fed Fund Rate.
256
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
Table 1 Parameter values in the models.
µ [ ]
Model
0 0
SI
[
0.0012 0.0017
0.0017 0.0146
[
0.0012 0.0017
0.0017 0.0146
0.0012 0.0017
0.0017 0.0146
]
0.0012 0.0017
0.0017 0.0146
]
[
0.018 0.013
[
0.018 0.013
[
[
0.021 0.004
[
0.021 0.004
90 12 −0.161 T 0.414 [ 90 12 −0.161 T 0.414
PI1 PI2 WI1 WI2
]
0 0
]
]
0 0
Λ [
0.01 0.005
0 0
0 0
]
[
]
δ
Φ [ ] −0.5 0.1 0.1 −0.5 [ ]
0.017 0.117
[
0.017 0.117
[
] ]
0.005 0.01
γ
] 0.97
1.3
0.97
1.3
1.139
13.7
0.97
1.3
1.139
13.7
]
]
Notes: µ, Φ and Λ are the intercept, matrix of slope coefficients and covariance matrix of the disturbance term, respectively, of the VAR(1) model of consumption and dividend growth. Table 2 Rejection frequencies of Wright’s (2003) test and the proposed test. Model
δ
T
FR
50 100 200 50 100 200 50 100 200 50 100 200 50 100 200
RF1
RF2
NRF1
NRF2
γ
Bias
Coverage
Bias
Coverage
0.007 0.005 0.003 0.034 0.034 0.032 0.138 0.138 0.137 0.012 0.008 0.005 1.029 1.139 0.724
0.926 0.936 0.946 0.986 0.992 1.00 0.415 0.415 0.410 0.992 0.994 0.984 0.126 0.165 0.913
0.138 0.100 0.071 1.895 1.906 1.850 12.200 12.156 12.189 0.615 0.600 0.532 14.585 13.318 3.196
0.893 0.917 0.936 0.996 0.999 0.999 0.244 0.249 0.249 0.681 0.670 0.686 0.762 0.746 0.992
Wright (2003)
Our test
κ=1
κ=5
κ = 10
det
trace
0.741 0.913 0.976 0.138 0.113 0.097 0.133 0.113 0.100 0.321 0.310 0.357 0.466 0.357 0.266
0.070 0.047 0.038 0.899 0.945 0.952 0.941 0.988 0.999 0.558 0.658 0.772 0.955 0.918 0.989
0.101 0.068 0.053 0.933 0.973 0.977 0.967 0.991 1.00 0.658 0.745 0.856 0.977 0.951 1.00
0.154 0.102 0.081 0.965 0.986 0.989 0.980 0.998 1.00 0.759 0.846 0.923 0.992 0.975 1.00
0.070 0.047 0.038 0.900 0.945 0.952 0.941 0.988 0.999 0.557 0.658 0.772 0.971 0.937 1.00
0.070 0.047 0.038 0.900 0.945 0.952 0.941 0.988 0.999 0.558 0.658 0.772 0.962 0.926 0.995
Notes: The table reports median absolute biases (labeled ‘‘bias’’), coverage probabilities of 95% confidence intervals (labeled ‘‘coverage’’), and empirical rejection probabilities of the tests (last six columns). ‘‘Our Test’’ denotes our proposed RT test, Eq. (19); it is either implemented with a cross-validation method for the choice of λ based on the trace, labeled ‘‘trace’’, or on the determinant, labeled ‘‘det’’. ‘‘κ = 1, 5, 10’’ refers to the proposed test implemented with a pre-determined choice of λ = κ T −1/2 . ‘‘Wright (2003)’’ is the test proposed by Wright (2003).
N4×1 (04×1 , ΩX ) and ΩX = SX SX′ where SX was set to 1.6167 0 0 0
SX =
−1.4234 1.9835 0 0
0.1957 −0.1077 0.1635 0
−0.2524 0.4627 −0.4427 0.6272
which had been randomly drawn. We generate the data as follows: rt = rr ∗ − (β − 1)π ∗ + βπt +1 + γ yt +1 + εt , where εt is N (0, 1), β = 2, γ = 3, rr ∗ is the sample average of the simulated values of rt − πt +1 (on average, it is equal to unity), and π ∗ is chosen such that rr ∗ ≡ rr ∗ − (β − 1)π ∗ = 1 (which means that on average the Central Bank aims at zero inflation). The vector of regressors consists of a constant as well as Yt = {πt +1 , yt +1 }′ , where the latter are generated by: Yt = Bxz Xt + uX ,t where Bxz ≡ ϑ [I2×2 02×2 ], and uX ,t ∼ N2×1 (02×1 , I2×2 ). We consider three cases: ϑ = 0 (no identification, labeled ‘‘NI’’), ϑ = T −1/2 (weak identification, labeled ‘‘WI’’), ϑ = 1 (strong identification, labeled ‘‘SI’’). We will compare the performance of our method with that proposed by Wright (2003). In applying Wright’s (2003) method, we excluded the derivative of the moment condition with respect to the constant.12 In the no identification case, we implemented
12 This is necessary because the test statistic is based on the demeaned gradient of the moment conditions, and if one of the derivatives of the moment conditions
Wright’s (2003) method by testing the null hypothesis that the rank is 3 against the alternative that the rank is full (equal to four). Our method was implemented with a cross-validation choice of λT = κ T 1/2 for values of κ within a grid from 0.1 to 100, as well as with a fixed choice for λT = κ T 1/2 , where κ = 1, 5, 10. For the cross validation, we consider both the trace, as in (53), as well as a determinant. Table 3 reports the results. The main findings of the previous sub-section do carry over to this case. In particular, we note that Wright’s (2003) test has a tendency to reject the null hypothesis of no identification when the parameters are weakly identified. Our test, implemented with the cross-validation choice for λT , performs really well in terms of both size and power in small samples. Wright’s (2003) method also performs well in terms of size. However, in the weak identification case, Wright’s (2003) test rejects the null hypothesis for lack of strong identification 20%–30% of the time, thus incorrectly concluding that the model is identified in 20%–30% of the cases. In the same situation, our test, instead, does reject the null hypothesis of strong identification 50%–60% of the time, thus showing quite good power properties. Finally, the cross-validation procedure significantly improves the size properties of our test in finite samples relative to the case in which λT is pre-determined. The only notable difference with the results in the previous sub-section is that the
is constant – which will happen if one of the instruments is a constant and one of the derivatives is constant – then the gradient will have a column of zeros.
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261 Table 3 Rejection frequencies of Wright’s (2003) test and the proposed test—monetary policy example. T
50
100
200
Model
Wright (2003)
Our test
κ=1
κ=5
κ = 10
det
trace
SI WI NI SI WI NI SI WI NI
1 0.19 0.04 1 0.26 0.04 1 0.41 0.07
0.06 0.53 0.69 0.06 0.58 0.74 0.05 0.65 0.80
0.19 0.81 0.91 0.14 0.86 0.94 0.07 0.90 0.95
0.35 0.90 0.96 0.24 0.93 0.97 0.13 0.95 0.98
0.06 0.99 1 0.06 0.99 1 0.05 0.99 0.99
0.06 0.53 0.69 0.06 0.58 0.74 0.05 0.65 0.80
Notes. The table reports empirical rejection rates of nominal 5% tests for different sample sizes and for the cases of strong identification (‘‘SI’’), weak identification (‘‘WI’’) and no identification (‘‘NI’’). ‘‘Our Test’’ denotes our proposed RT test, Eq. (19); it is either implemented with a cross-validation method for the choice of λ based on the trace, labeled ‘‘trace’’, or on the determinant, labeled ‘‘det’’. ‘‘κ = 1, 5, 10’’ refers to the proposed test implemented with a pre-determined choice of λ = κ T −1/2 . ‘‘Wright (2003)’’ is the test proposed by Wright (2003).
cross-validation implemented with the determinant (rather than the trace) sometimes improves the power of the test. 6. Is the US monetary policy rule identified? An analysis of identification of the US forward-looking Taylor rule The issue of whether the parameters of structural macroeconomic models are well identified has recently received a lot of attention. In their review, An and Schorfheide (2007) acknowledge that identification problems in DSGE models are an important issue. They note that it is difficult to directly detect identification problems in large DSGE models since the mapping from the vector of structural parameters to the reduced form parameters is highly non-linear and, typically, has to be evaluated numerically. Lack of identification, therefore, constitutes a challenge for researchers because it is unclear which features of the posterior distribution are generated by prior information on rather than by information from the sample via the likelihood. So far, the main diagnostic tool to judge the extent to which data provide information regarding the parameters of interest has been to compare the prior and the posterior estimates. The method we propose in this paper has the advantage of testing whether the model’s parameters suffer from weak identification prior to estimation. The lack of identification of the parameters of various DSGE models has been documented in several papers. Canova and Sala (2009) compare the informativeness of different estimators with respect to key structural parameters in selected DSGE models, whereas Iskrev (unpublished manuscript) considers the issue of parameter identification in the Smets and Wouters (2007) model. Ruge-Murcia (2007) instead examines the implications of weak identification on competing estimators of DSGE models. A distinctive feature of interest in many DSGE models is the monetary policy reaction function. We therefore focus on it for our analysis. Usually, the monetary policy reaction function is a Taylor rule—see Taylor (1993). CGG estimate the monetary policy reaction function by GMM based on the following moment conditions: Et [{rt − (1 − ρ1 − ρ2 )[rr ∗ − (β − 1)π ∗ + βπt +1 + γ yt +1 ]
− ρ1 rt −1 − ρ2 rt −2 }Xt ] = 0.
257
β is typically interpreted as the ‘‘inflation-aversion’’ parameter, whereas γ is interpreted as the ‘‘output-gap reaction’’ parameter. We follow CGG and use the same quarterly data spanning the period 1960:1–1996:4. In particular, we collect interest rate and inflation data from CITIBASE. The Fed Fund Rate is the average value in the first month of each quarter, expressed in annual rates (FYFF). The inflation rate is the annualized rate of change of the GDP deflater (GDPP) between two subsequent quarters. The output gap is from the Congressional Budget Office. In CGG, the structural parameters have a one-to-one relationship with the parameters in a standard linear GMM moment condition: Et [{rt − α1 − α2 πt +1 − α3 rt −1 − α4 rt −2 − α5 yt +1 } Xt ] = 0, (59) that is, E [gt (α)] = 0, where gt (α) ≡ rt − α ′ Zt Xt for Zt being the vector containing a constant, the one-step ahead inflation rate, the interest rate lagged one and two periods, and the one-step ahead output gap. The structural parameters estimates are recovered from the estimated GMM parameters via a nonlinear mapping procedure. To estimate the GMM parameters, let ∑T QT (α) = − 12 g T (α)′ W g T (α), where g T (α) = T −1 t =1 gt (α) =
∑T ∑T ∑T −1 ′ −1 ′ t =1 X t r t − T t =1 Xt Zt α, G = T t =1 ∂ gt (α)/∂α = ∑ T −1 ′ ′ −T t =1 Xt Zt , and ▽θ θ QT (θ , W ) = −G WG. T −1
The shrinkage GMM estimator satisfies:
α (W ) = arg max QT (α) − 0.5λ‖α‖2 α 1
= arg max − g T (α) W g T (α) − 0.5λT α
′
2
5 −
αs
2
.
(60)
s=1
From (60), the first order conditions give:
α (W ) = G WG + λT Ip
′
−1
′
GW
T 1−
T t =1
Xt r t
.
We will consider two shrinkage estimators: α1 = α (W ∗ ), where ∗ W is the inverse of the asymptotic variance of gt (α), and α2 = α (I ). In the implementation, we chose λT by using the cross validation method described in Section 3. Panel A in Table 4 shows the empirical results for the GMM parameters, α . Our results show that we do not reject the null hypothesis of identification in both the Volker–Greenspan period as well as in the Pre-Volker period. Panel B shows instead the results for the structural parameters, θ . The results for the latter are very different, and show that we cannot reject the null of strong identification in the Pre-Volker period but we do reject identification in the Volker–Greenspan era. Our results suggest that, while identification issues are not a concern for the GMM parameters, they are indeed a concern for the structural parameters in the monetary policy reaction function. In passing, note that Mavroeidis (2010) estimates the joint confidence sets for the inflation-aversion and output gap reaction parameters by using Stock and Wright’s (2000) identification–robust test.13 His objective is rather different from ours. While we want to test whether the parameters are weakly identified, he instead wants to estimate a confidence set that is robust to weak identification.
(58)
The set of instruments Xt includes 4 lags of inflation, output gap, the Fed Fund Rate, interest rate spread, money growth, and inflation in commodity prices. Let θ = {ρ1 , ρ2 , β, γ , π ∗ }. Note that π ∗ is not directly identifiable from (58); it is instead estimated ∗ ∗ − r as: (rr r ∗ )/ (1 − β), where rr ≡ rr ∗ − (β − 1)π ∗ and ∗ rr is the sample average of the real interest rate. The parameter
13 He finds that the confidence sets are much wider in the Volker–Greenspan’s subsample than in the Pre-Volker era, and that the confidence sets contain parameters included in both the determinate and the indeterminacy regions, which is consistent with our results. However, his analysis is computationally very demanding, and very difficult to implement in highly dimensional parameter spaces.
258
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
where θ¯T is a point between θ0 and θ˜T . Because QT (θ ) is twice continuously differentiable by Assumption (b), its third derivatives
Table 4 Empirical results. Pre-Volker 1960:1–1979:2
Volker–Greenspan 1979:3–1996:4
Panel A. GMM parameters
ST (α) p-value
4.86 0.43
7.13 0.21
Panel B. Structural parameters
ST (θ) p-value
4.77 0.44
17.74 0.002
Notes: The table reports the value of our test statistic ST (θ) and its p-values for testing identification in both the GMM parameters (Panel A) and the Structural parameters (Panel B) in the two sub-samples of interest.
7. Conclusions This paper provides a new test for identification. The test has a limiting chi-square distribution under the null hypothesis of identification. Among the advantages of our test, we have: (i) the test is simple to implement; (ii) the test has power against weak identification; (iii) unlike most of the tests available in the literature, our test directly focuses on the null hypothesis of interest (identification) rather than the opposite (no identification). We document the good small sample size and power properties of our test via Monte Carlo simulations calibrated on both a Consumption Capital Asset Pricing Model and a Taylor rule monetary policy reaction function. Finally, we implement our test to analyze whether the structural parameters of the Taylor rule monetary policy reaction function are identified in the data. We show that identification is a concern mainly in the Volker–Greenspan era. In this paper we used the quadratic penalty term. Recently Caner (2009) developed GMM estimators with least absolute shrinkage and selection operator (LASSO) under strong identification. Extending our results to non-quadratic penalty terms, such as LASSO, is an interesting avenue of research but is beyond the scope of this paper. Acknowledgements We are grateful to Jonathan Wright for helpful conversations and providing us with his codes used in Wright (2003) and to the editor, associate editor, two anonymous referees, Craig Burnside, Mehmet Caner, Graham Elliott and participants of the first ERID conference for comments. We thank seminar participants at the University of Washington, Seattle, the Federal Reserve Bank of St. Louis, the UNC-NCSU econometrics workshop and the University of Tokyo for helpful comments. This research was supported by National Science Foundation grants SES-1022125 and SES1022159 and North Carolina Agricultural Research Service Project NC02265. Appendix. Proofs of the theorems Proof of Theorem 1. Part (a): Eq. (11) trivially follows from Theorem 3.1 of Newey and McFadden (1994, p. 2143), and Assumptions (a)–(c). Because λT = o(1), it follows from Theorem 2.1 of Newey and McFadden (1994, p. 2121), and Assumptions (a)–(c) that θ˜T = θ0 + op (1). The first-order condition for θ˜T is
∇θ QT (θ˜T ) − λT (θ˜T − θ¯ ) = 0k×1 .
(61)
∂ vec [∇θ θ QT (θ ) − λT Ik ]−1 ∂θ = −{[∇θ θ QT (θ ) − λT Ik ]−1 ⊗ [∇θ θ QT (θ ) − λT Ik ]−1 } ∂ × vec[∇θ θ QT (θ )] ∂θ
(63)
is finite in a shrinking neighborhood of θ0 with probability approaching one, and
[∇θ θ QT (θ¯T ) − λT Ik ]−1 = [∇θ θ QT (θ0 ) − λT Ik ]−1 + Op (‖θ˜T − θ0 ‖).
(64)
It follows from (62) and (64) that
θ˜T − θ0 − λT BT (θ0 ) = −[∇θ θ QT (θ0 ) − λT Ik ]−1 ∇θ QT (θ0 ) + Op (λT ‖θ˜T − θ0 ‖).
(65)
Therefore Eq. (12) follows from (65) and Assumptions (c.ii), (c.iii) and (f.i). Eq. (13) in Part (b): We will follow the proof of Theorem 1 of Stock and Wright (2000). First, we will show βˆ T = β0 + Op (T −1/2 ). Second, we will find a limiting representation for ∇ QT (α, β0 + bT −1/2 ). Third, we will prove Eq. (13). It follows from Assumption (d.i) that p
QT (θ ) → Qβ (β)
(66)
uniformly in θ . Because Qβ (β) is uniquely maximized at β0 by p
Assumption (d.iv), we can show that βˆ T → β0 by using the standard argument. Next we will show that βˆ T = β0 + Op (T −1/2 ). The first order condition for maximizing QT (θ ) with respect to β is
∇β QT (θˆT ) = 0k2 ×1 .
(67)
By applying the mean value theorem to (67) we obtain
∇β QT (θ0 ) + ∇βα QT (θ¯T )(αˆ T − α0 ) + ∇ββ QT (θ¯T )(βˆ T − β0 ) = 0k2 ×1 ,
(68) p
where θ¯T = [α¯ T′ , β¯ T′ ]′ is a point between θ0 and θˆT . Because βˆ T → β0 , Θ is compact by Assumption (a), which implies that αˆ T − α0 = Op (1), ∇βα QT (θ ) = Op (T −1/2 ) uniformly in θ by Assumption (d.iii), ∇ββ QT (θ )−∇ββ Qβ (β) = Op (T −1/2 ) uniformly in θ by Assumption (d.iv), ∇ββ Qβ (β) is bounded and non-singular by Assumptions (a), (b) and (c.iii) we have
βˆ T − β0 = Op (T −1/2 ).
(69)
Next we will find a limiting representation for ∇ QT (α, β0 + bT −1/2 ) as an empirical process in [a′ , b′ ]′ ∈ ΘA × Θ B where Θ B is a compact set in ℜk2 . We have 1
TQT (a, β0 + bT −1/2 ) = TQT (a, β0 ) + T 2 ∇β QT (a, β0 )b 1 ′ b ∇ββ QT (a, β0 )b + op (T −1 ) 2T ⇒ Qα (a, β0 ) + Zβ (α, β0 )′ b
+
By applying the mean value theorem to (61) we obtain
θ˜T − θ0 − λT [∇θθ QT (θ¯T ) − λT Ik ]−1 θ0 − θ¯ = −[∇θθ QT (θ¯T ) − λT Ik ]−1 ∇θ QT (θ0 ),
p
are bounded on the compact set Θ . Because θ˜T → θ0 and Q (θ0 ) is non-singular by Assumption (c.iii), [∇θ θ QT (θ0 ) − λT Ik ]−1 is nonsingular with probability approaching one. Thus,
1
(62)
+ b′ ∇ββ Qβ (a, β0 )b. 2
(70)
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
Thus by Lemma 3.2.1 of Van der Vaart and Wellner (1996, p. 286), we conclude that [αˆ T′ , T 1/2 (βˆ T − β0 )′ ]′ ⇒ [α ∗′ , b∗′ (α ∗ )]′ where α ∗ maximizes (3) and b∗ (α) is given in (4). Eq. (14) in Part (b): First, we will show the consistency and convergence rates of α˜ T and β˜ T . Because supθ∈Θ |QT (θ )− Qβ (β)| = op (1), Qβ (β) is uniquely maximized at β0 , and λT = o(1) by
where [α¯ T′ β¯ T′ ]′ is a point between [α¯ ′ + a′ /(λT T ), β0′ + λT (β0 −
¯ ′ [∇ββ Qβ (β0 )]−1 + T − 12 b′ ]′ and [α¯ ′ , β0′ ]′ . Thus it follows from (78) β) and Assumptions (d) and (e) that
[
p
(71) 1/2
and Assumption (d.ii) that α˜ T − α¯ = Op (1/(λT T )) = Op (1). An application of the mean value theorem to the first order condition for β˜ T ,
¯ = 0k1 ×1 , ∇β QT (α˜ T , β˜ T ) − λT (β˜ T − β)
a ¯ + T − 21 b α¯ + , β0 + λT [∇ββ Qβ (β0 )]−1 (β0 − β) λT T 2 λT a + ‖β0 + λT [∇ββ Qβ (β0 )]−1 (β0 − β) ¯ + 2 λT T ] λT 2 − 12 ¯ ¯ 2 −QT (α, ¯ β0 ) + ‖β0 − β‖ + T b − β‖
T QT
Assumptions (d.iv) and (e), one can show that β˜ T → β0 by using a standard argument. It follows from the first order condition for α˜ T ,
∇α QT (α˜ T , β˜ T ) − λT (α˜ T − α) ¯ = 0k1 ×1 ,
259
2
¯ (κ[∇ββ Qβ (β0 )]−1 (β0 − β) ¯ + b) ⇒ [Zβ (α, ¯ β0 ) − κ(β0 − β)] ′
1
¯ + b)′ ∇ββ Qβ (β0 ) + (κ[∇ββ Qβ (β0 )]−1 (β0 − β)
(72)
2
around β0 yields
¯ + b). × (κ[∇ββ Qβ (β0 )]−1 (β0 − β)
¯ β˜ T − β0 − λT [∇ββ QT (α˜ T , β¯ T ) − λT Ik2 ]−1 (β0 − β)
(79)
(73)
Therefore Theorem 1(b) follows from Lemma 3.2.1 of Van der Vaart and Wellner (1996, p. 286), (76), (79) and Assumption (d.ii).
where β¯ T is a point between β˜ T and β0 . By Assumptions (d.i), (d.iv) and (e), (73) can be written as
Proof of Theorem 2. Part (a): It follows from Assumptions (c.ii), (c.iii) and (e), Theorem 1(a) and Eq. (65) that
= −[∇ββ QT (α˜ T , β¯ T ) − λT Ik2 ]−1 ∇β QT (α˜ T , β0 ),
¯ β˜ T − β0 − λT [∇ββ Qβ (β0 )]−1 (β0 − β) = −[∇ββ Qβ (β0 )] ∇β QT (α, ¯ β0 ) + Op (T −1
+ Op (T
− 12
1
− 12
T 2 [θ˜2T − θ˜1T − λT Bˆ 2T + λT Bˆ 1T ]
‖α˜ T − α‖)
).
1
= −T 2 [∇θ θ Q2T (θ0 ) − λT Ik ]−1 ∇θ Q2T (θ0 ) (74)
1
+ T 2 [∇θ θ Q1T (θ0 ) − λT Ik ]−1 ∇θ Q1T (θ0 ) 1 1 1 − λT T 2 (Bˆ 2T − BT ) + λT T 2 (Bˆ 1T − BT ) + Op (T − 2 )
It follows from (74) and Assumptions (a), (d.i), (d.iv) and (e) that
β˜ T − β0 = Op (T −1/2 ).
(75)
d
→ N (0k×1 , D′ Σ D).
It follows from (71), (75) and Assumptions (d.ii) and (d.iii) that
α˜ T − α¯ =
1
λT
∇α QT (α˜ T , β0 ) +
= Op
1
λT T
1
λT
(76)
where β¯ T is a point between β˜ T and β0 . Second we will consider a limiting representation for a λ ¯ + T − 12 b + T , β0 + λT [∇ββ Qβ (β0 )]−1 (β0 − β) QT α¯ + λT T 2 2 a −1 − 12 2 ¯ ¯ × + ‖β0 + λT [∇ββ Qβ (β0 )] (β0 − β) + T b − β‖ (77) λT T as an empirical process in [a′ , b′ ]′ ∈ Θ A × Θ B where Θ A × Θ B is a compact set in ℜk1 × ℜk2 . By using Taylor’s theorem, (77) can be written as
[ ]′ λT 0k1 ×1 2 ¯ QT (α, ¯ β0 ) + ‖β0 − β‖ + ∇θ QT (α, ¯ β0 ) − λT β0 − β¯ 2 a
λT T × −1 − 21 ¯ λT [∇ββ Qβ (β0 )] (β0 − β) + T b ′
and
1
−αj + α¯ + Op λT T λT Bˆ jT = λT . (82) −1 ¯ + Op λT [∇ββ Qj,β (β0 ) − λT Ik2 ] (β0 − β) 1/2 ∗
1
T − 2 dˆ T = θ˜2,T − θ˜1,T − λT Bˆ 2,T + λT Bˆ 1,T ⇒
2
and that
a
, λT T × ¯ + T − 21 b λT [∇ββ Qβ (β0 )]−1 (β0 − β)
[∇θθ QjT (θ ) − λT Ik ]−1 −1 T −1 Hj,αα (θ ) − λT Ik1 + op (T −1 ) T −1/2 Hj,αβ (θ ) + op (T −1/2 ) = −1/2 −1/2 −1/2 T Hj,βα (θ ) + op (T ) ∇ββ Qj,β (β) − λT Ik2 + Op (T ) 1 1 1 Op − λT Ik1 + Op λ2 T 1/2 λ T T , T = (81) 1 1 Op [∇ββ Qj,β (β) − λT Ik2 ]−1 + Op (T − 2 ) 1 / 2 λT T
It follows from Theorem 1(b) and Eqs. (81) and (82) that
λT T ¯ + T − 21 b λT [∇ββ Qβ (β0 )]−1 (β0 − β) × [∇θθ QT (α¯ T , β¯ T ) − λT Ik ] +
Part (b): First we will show a result which will be used in the subsequent proofs. Using Eqs. (6) and, (7) of Magnus and Neudecker (1999, p. 11), result 0.7.4 of Horn and Johnson (1985, p. 19), and Assumption (d), we obtain
T
a
1
p
ˆ T → D and Σ ˆ T → Σ by Assumptions (c.iii), (e) and (g) and Since D Theorem 1(a), the desired result follows from (80).
∇αβ QT (α˜ T , β¯ T )(β˜ T − β0 )
,
p
(80)
(78)
I2 ⊗
1
T − 2 Ik1 0k2 ×k1
0k1 ×k2 Ik2
ˆT D
[ ∗ ] α2 − α1∗ 0k2 ×1
(83)
260
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261
1 − I 0 k k1 ×k2 κ 1 − 1 − 1 − 1 (∇ββ Q1,β (β0 )) H1,βα (α, ¯ β0 ) (∇ββ Q1,β (β0 )) d → κ 1 0k1 ×k2 − Ik1 κ 1 (∇ββ Q2,β (β0 ))−1 H2,βα (α, ¯ β0 ) (∇ββ Q2,β (β0 ))−1 κ 1 [ ] −M1 Ik1 0k1 ×k2 ≡ I2 ⊗ κ . (84) 0k2 ×k1
Ik2
M2
By Assumption (h.ii),
[
I2 ⊗
1
T 2 Ik1 0k2 ×k1
[ ∗ Σ11 ⇒ ∗ Σ21
0k1 ×k2 Ik2
Σ12 ∗ . Σ22 ∗
]′
[ 1 ˆ T I2 ⊗ T 2 Ik1 Σ
0k2 ×k1
0k1 ×k2 Ik2
]
]
(85)
The result (32) follows from Eqs. (83)–(85) and Eq. (7) of Magnus and Neudecker (1999, p. 11). The result (33) follows because the ˆ ′T Σ ˆ T Dˆ T diverges to infinity. reciprocal of the eigenvalues of D References Acemoglu, D., Johnson, S., 2007. Disease and development: the effect of life expectancy on economic growth. Journal of Political Economy 115, 925–985. Ait-Sahalia, Y., Parker, J.A., Yogo, M., 2004. Luxury goods and the equity premium. Journal of Finance 59, 2959–3004. Alcala, F., Ciccone, A., 2004. Trade and productivity. Quarterly Journal of Economics 119, 613–646. Amemiya, T., 1985. Advanced Econometrics. Harvard University Press, Cambridge, MA. Anderson, T.W., 1984. An Introduction to Multivariate Statistical Analysis. Wiley, New York, NY. Anderson, T.W., 1951. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics 22, 327–351. Anderson, T.W., Kunitomo, N., 2009. Likelihood ratio tests in reduced rank regression with applications to econometric structural equations (unpublished manuscript). www.stat.stanford.edu/~ckirby/ted/papers/2009_LRTsRRR.pdf. Anderson, T.W., Kunitomo, N., 1992. Tests of overidentification and predeterminedness in simultaneous equation models. Journal of Econometrics 54, 49–79. Anderson, T.W., Kunitomo, N., 1994. Asymptotic robustness of tests of overidentification and predeterminedness. Journal of Econometrics 62, 383–414. Anderson, T.W., Rubin, H., 1949. Estimation of the parameters of a single equation in a complete system of stochastic equations. Annals of Mathematical Statistics 20, 46–63. Anderson, T.W., Rubin, H., 1950. The asymptotic properties of estimates of the parameters of a single equation in a complete system of stochastic equations. Annals of Mathematical Statistics 21, 570–582. An, S., Schorfheide, F., 2007. Bayesian analysis of DSGE models. Econometric Reviews 26, 113–172. Antoine, B., Renault, E., 2010. Efficient minimum distance estimation with multiple rates of convergence. Journal of Econometrics (forthcoming). Bound, J., Jaeger, D.A., Baker, R.M., 1995. Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variables is weak. Journal of the American Statistical Association 90, 443–450. Caner, M., 2009. Lasso type GMM estimator. Econometric Theory 25, 1–23. Canova, F., Sala, L., 2009. Back to square one: identification issues in DSGE models. Journal of Monetary Economics 56, 431–449. Carrasco, M., 2008. A regularization approach to the many instruments problem. Université de Montréal (unpublished manuscript). Clarida, R., Galí, J., Gertler, M., 2000. Monetary policy rules and macroeconomic stability: evidence and some theory. Quarterly Journal of Economics 115, 147–180. Consolo, A., Favero, C.A., 2009. Monetary policy inertia: more a fiction than a fact? Journal of Monetary Economics 56, 900–906. Cragg, J.G., Donald, S.G., 1996. On the asymptotic properties of LDU-based tests of the rank of a matrix. Journal of the American Statistical Association 91, 1301–1309. Cragg, J.G., Donald, S.G., 1997. Inferring the rank of a matrix. Journal of Econometrics 76, 223–250. Cragg, J.G., Donald, S.G., 1993. Testing identifiability and specification in instrumental variables models. Econometric Theory 9, 222–240. Crowe, C., 2010. Testing the transparency benefits of inflation targeting: evidence from private sector forecasts. Journal of Monetary Economics 57, 226–232. DeJuan, J.P., Seater, J.J., 2007. Testing the cross-section implications of Friedman’s permanent income hypothesis. Journal of Monetary Economics 54, 820–849.
Doyle, J.J., 2007. Child protection and child outcomes: measuring the effects of foster care. American Economic Review 97, 1583–1610. Dufour, J.-M., Khalaf, L., Kichian, M., 2006. Inflation dynamics and the New Keynesian Phillips curve: an identification robust econometric analysis. Journal of Economic Dynamics and Control 30, 1707–1727. Durbin, J., 1959. Errors in variables. Review of the International Statistical Institute 22, 23–32. Eichenbaum, M.L., Hansen, L.P., Singleton, K.J., 1985. A time series analysis of representative agent models of consumption and leisure choice under uncertainty. Quarterly Journal of Economics 103, 51–78. Faria, H.J., Montesinos, H.M., 2009. Does economic freedom cause prosperity? An IV approach. Public Choice 141, 103–127. Fuhrer, J.C., Rudebusch, G.D., 2004. Estimating the Euler equation for output. Journal of Monetary Economics 51, 1133–1153. Gallant, R.A., White, H., 1988. A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. Basil Blackwell, Oxford, UK. Gill, L., Lewbel, A., 1992. Testing the rank and definiteness of estimated matrices with applications to factor, state-space and ARMA models. Journal of the American Statistical Association 87, 766–776. Guggenberger, P., Smith, R.J., 2005. Generalized empirical likelihood estimators and tests under partial, weak and strong identification. Econometric Theory 21, 667–709. Hahn, J., Ham, J., Moon, H.R., 2011. The Hausman test and weak instruments. Journal of Econometrics 160, 289–299. Hahn, J., Hausman, J.A., 2002. A new specification test for the validity of instrumental variables. Econometrica 70, 163–189. Hansen, L.P., 1982. Large sample properties of generalized methods of moments estimators. Econometrica 50, 1029–1054. Hausman, J.A., 1978. Specification tests in econometrics. Econometrica 46, 1251–1272. Hayashi, F., 2000. Econometrics. Princeton University Press, Princeton, NJ. Hoerl, A.E., Kennard, R.W., 1970a. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67. Hoerl, A.E., Kennard, R.W., 1970b. Ridge regression: applications to non-orthogonal problems. Technometrics 12, 69–82. Horn, R.A., Johnson, C.R., 1985. Matrix Analysis. Cambridge University Press, Cambridge, UK. Hsiao, C., 1983. Identification. In: Griliches, Z., Intriligator, M.D. (Eds.), Handbook of Econometrics, vol. 1. North Holland, Amsterdam, The Netherlands. Iskrev, N., 2007. How much do we learn from the estimation of DSGE models? A case study of identification issues in a New Keynesian business cycle model, University of Michigan (unpublished manuscript). Kleibergen, F., Paap, R., 2006. Generalized reduced rank tests using the singular value decomposition. Journal of Econometrics 133, 97–126. Kocherlakota, N., 1990. On tests of representative consumer asset pricing models. Journal of Monetary Economics 26, 285–304. Koopmans, T.C., Hood, W.C., 1953. The estimation of simultaneous linear economic relationships. In: Hood, W.C., Koopmans, T.C. (Eds.), Studies in Econometric Methods. Yale University Press, New Haven, CT (Chapter 6). Krause, M.U., Lopez-Salido, D., Lubik, T.A., 2008. Inflation dynamics with search frictions: a structural econometric analysis. Journal of Monetary Economics 55, 892–916. Magnus, J.R., Neudecker, H., 1999. Matrix Differential Calculus with Applications in Statistics and Econometrics, revised ed. John Wiley & Sons, Chichester, UK. Mavroeidis, S., 2010. Monetary policy rules and macroeconomic stability: some new evidence. American Economic Review 100, 491–503. Nason, J.M., Smith, G.W., 2005. Identifying the New Keynesian Phillips curve. Federal Reserve Board of Atlanta. Working Paper 2005-1. Nelson, C.R., Startz, R., 1990a. Some further results on the exact small sample properties of the instrumental variable estimator. Econometrica 58, 967–976. Nelson, C.R., Startz, R., 1990b. The distribution of the instrumental variable estimator and its t ratio when the instrument is a poor one. Journal of Business 63, S125–S140. Newey, W.K., 1985. Generalized method of moments specification testing. Journal of Econometrics 29, 229–256. Newey, W.K., McFadden, D.M., 1994. Large Sample Estimation and Hypothesis Testing. In: Engle, Robert F., McFadden, Daniel L. (Eds.), Handbook of Econometrics, vol. IV. Elsevier, Amsterdam, The Netherlands. Okui, R., 2007. Instrumental variables estimation in the presence of many moment conditions. Journal of Econometrics (forthcoming). Park, C., Kang, C., 2008. Does education induce healthy lifestyle? Journal of Health Economics 27, 1516–1531. Robin, J.-M., Smith, R., 2000. Tests of rank. Econometric Theory 16, 151–175. Ruge-Murcia, F., 2007. Methods to estimate dynamic stochastic general equilibrium models. Journal of Economic Dynamics and Control 31, 2599–2636. Sargan, J.D., 1958. The estimation of economic relationships using instrumental variables. Econometrica 26, 393–415. Shapiro, A.H., 2008. Estimating the New Keynesian Phillips curve: a vertical production chain approach. Journal of Money, Credit and Banking 40, 627–666. Smets, F., Wouters, R., 2007. Shocks and frictions in US business cycles: a Bayesian DSGE approach. American Economic Review 97, 586–606. Staiger, D., Stock, J.H., 1997. Instrumental variables regressions with weak instruments. Econometrica 65, 557–586. Stock, J.H., Wright, J.H., 2000. GMM with weak identification. Econometrica 68, 1055–1096. Stock, J.H., Wright, J.H., Yogo, M., 2002. A survey of weak instruments and weak identification in generalized method of moments. Journal of Business and Economic Statistics 20, 518–529.
A. Inoue, B. Rossi / Journal of Econometrics 161 (2011) 246–261 Stock, J.H., Yogo, M., 2005. Testing for weak instruments in linear IV regression. In: Andrews, Donald W.K., Stock, James H. (Eds.), Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg. Cambridge University Press, New York, NY, pp. 80–108. Stone, C.J., 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society 36, 111–147. Tauchen, G.E., 1986. Statistical properties of generalized method of moments estimators of structural parameters obtained from financial market data. Journal of Business and Economic Statistics 4, 397–425. Tauchen, G.E., Hussey, R., 1991. Quadrature based methods for obtaining approximate solutions to nonlinear asset pricing models. Econometrica 59, 371–396. Taylor, J.B., 1993. Discretion versus policy rules in practice. Carnegie-Rochester Conference Series on Public Policy 39, 195–214. Temple, J., Wossmann, L., 2006. Dualism and cross-country growth regressions. Journal of Economic Growth 11, 187–228.
261
Van der Vaart, A.W., 1998. Asymptotic Statistics. Cambridge University Press, Cambridge, UK. Van der Vaart, A.W., Wellner, J.A., 1996. Weak Convergence and Empirical Processes with Applications to Statistics. Springer-Verlag, New York, NY. Wossmann, L., West, M., 2006. Class-size effects in school systems around the world: evidence from between-grade variation in TIMSS. European Economic Review 50, 695–736. Wright, J.H., 2002. Testing the null identification in GMM. Board of Governors of the Federal Reserve System. International Finance Discussion Paper Number 732. Wright, J.H., 2003. Detecting lack of identification in GMM. Econometric Theory 19, 322–330. Wu, D.-M., 1973. Alternative tests of independence between stochastic regressors and disturbances. Econometrica 41, 733–750. Yogo, M., 2004. Estimating the elasticity of intertemporal substitution when instruments are weak. Review of Economics and Statistics 86, 797–810.
Journal of Econometrics 161 (2011) 262–283
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Subsampling high frequency data✩ Ilze Kalnina ∗ Département de sciences économiques, Université de Montréal, C.P. 6128, succ. Centre-Ville, Montréal, H3C 3J7, QC, Canada
article
info
Article history: Received 14 February 2010 Received in revised form 21 December 2010 Accepted 23 December 2010 Available online 28 December 2010 JEL classification: C12 C13 C14
abstract The main contribution of this paper is to propose a novel way of conducting inference for an important general class of estimators that includes many estimators of integrated volatility. A subsampling scheme is introduced that consistently estimates the asymptotic variance for an estimator, thereby facilitating inference and the construction of valid confidence intervals. The new method does not rely on the exact form of the asymptotic variance, which is useful when the latter is of complicated form. The method is applied to the volatility estimator of Aït-Sahalia et al. (2011) in the presence of autocorrelated and heteroscedastic market microstructure noise. © 2010 Elsevier B.V. All rights reserved.
Keywords: Subsampling Market microstructure noise High frequency data Realised volatility
1. Introduction Volatility estimation is a key component in the evaluation of financial risk. Financial econometrics continues to make progress in developing more robust and efficient estimators of volatility. But for some estimators, the asymptotic variance is hard to derive or may take a complicated form and be difficult to estimate. To tackle these problems, the current paper develops a method of inference that is automatic in the sense that it does not rely on the exact form of the asymptotic variance. In the traditional stationary time series framework, this task can be accomplished by traditional bootstrap and subsampling variance estimators. However, these are inconsistent with high frequency data, which is potentially contaminated with market microstructure noise, see Section 2.1. A new subsampling method is developed, which enables us to conduct inference for a general class of estimators that includes many estimators of integrated volatility. The question of inference on volatility estimates is important due to volatility being unobservable. For example, one might want to test whether
✩ I would like to thank Donald Andrews, Peter Bühlmann, Valentina Corradi, Kirill Evdokimov, Silvia Gonçalves, Yuichi Kitamura, Nour Meddahi, Per Mykland, Mark Podolskij, Mathieu Rosenbaum, Myung Hwan Seo, Kevin Sheppard, and Mathias Vetter for helpful comments. I am particularly grateful to Oliver Linton and Peter Phillips for their help and encouragement. This research was supported by the ESRC and the Leverhulme foundation. I am also grateful to Cowles Foundation for financial support. ∗ Tel.: +1 5143432400. E-mail address:
[email protected].
0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.011
volatility is the same on two different days, or in two different time periods within the same day. The latter corresponds to testing for diurnal variation in the volatility. Also, a common way of testing for jumps in prices is to compare two different volatility estimates, which converge to the same quantity under the null hypothesis of no jumps, but are different asymptotically under the alternative hypothesis of jumps in prices. Then, a consistent inferential method is needed to determine whether the two volatility estimates are significantly different. To illustrate the robustness of the new method, this paper considers the example of the inference problem for the integrated variance estimator of Aït-Sahalia et al. (2011), in the presence of market microstructure noise. As several assumptions about the market microstructure noise are relaxed, the expression for the asymptotic variance becomes more complicated, and it becomes more challenging to estimate each component of the variance separately. On the other hand, the new subsampling method delivers consistent confidence intervals that are simple to calculate. According to the fundamental theorem of asset pricing (see Delbaen and Schachermayer, 1994), the price process should follow a semimartingale. In this model, integrated variance (sometimes called integrated volatility) is a natural measure of variability of the price path (see, e.g. Andersen et al., 2001). With moderate frequency data, say 5 or 15 min data, this can be estimated by the so-called realized variance (RV), a sum of squared returns (also referred to as realized volatility).1 The nonparametric 1 See Eq. (6) for the definition of realized variance.
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
nature of realized variance and the simplicity of its calculation have made it popular among practitioners. It has been used for asset allocation (Fleming et al., 2003), forecasting of Value at Risk (Giot and Laurent, 2004), evaluation of volatility forecasting models (Andersen and Bollerslev, 1998), and other purposes. The Chicago Board Options Exchange (CBOE) started trading S&P 500 ThreeMonth realized volatility options on October 21, 2008. Over the counter, these and other derivatives written on RV have been traded for several years. These financial products allow one to bet on the direction of the volatility, or to hedge against exposure to volatility. One way of pricing these derivatives is by using the theory of quadratic variation. Suppose the log-price Xt follows a Brownian semimartingale process, dXt = µt dt + σt dWt ,
(1)
where µ, σ , and W are the drift, volatility, and Brownian Motion processes, respectively. Our interest is in estimating volatility over some interval, say one day, which we normalize to be [0, 1]. The quantity of interest is captured by integrated variance, or quadratic variation over the interval, which is defined as 1
∫
σs2 ds.
IV = 0
Realized variance is a consistent estimator of integrated variance in infill asymptotics, i.e., when the approximation is made as the time distance between adjacent observations shrinks to zero. According to this approximation, therefore, the estimation error in RV should be smaller for even higher frequency data than 5 min. Ironically, this is not the case in practice. For the highest frequencies, the data is more and more clearly affected by the bid–ask spread and other market microstructure frictions, rendering the semimartingale model inapplicable and RV inconsistent. Zhou (1996) proposed to model high frequency data as a Brownian semimartingale with an additive measurement error. This model can reconcile the main stylized facts of prices both in moderate and high frequencies. Zhang et al. (2005) were the first to propose a consistent estimator of integrated variance in this model, in the presence of i.i.d. microstructure noise, which they named the Two Scales Realized Volatility (TSRV) estimator; it is also known as the Two Time Scale estimator in the literature.2 Consistent estimators in this framework were also proposed by BarndorffNielsen et al. (2008), Christensen et al. (2010, 2009) and Jacod et al. (2009). Aït-Sahalia et al. (2011) extend the TSRV estimator to the case of stationary autocorrelated microstructure noise, but do not propose an inference method. The problem with inference arises from the complicated structure of the asymptotic variance of the TSRV estimator. The method proposed in this paper can be used to conduct inference for the Two Time Scales estimator in the presence of not only autocorrelated, but also heteroscedastic measurement error. This allows the model to accommodate the stylized fact in the empirical market microstructure literature about the U-shape in observed returns and spreads.3 This new subsampling scheme is useful in practice when available estimators of the asymptotic variance are complicated and hence present difficulties in constructing confidence intervals. In such cases, a common procedure is to estimate the asymptotic variance as a sample variance of the bootstrap estimator. It turns 2 A note on terminology: many authors have called TSRV the subsampling estimator of IV . It is very different from, and should not be confused with, the subsampling method of Politis et al. (1999). 3 See Andersen and Bollerslev (1997), Gerety and Mulherin (1994), Harris (1986), Kleidon and Werner (1996), Lockwood and Linn (1990) and McInish and Wood (1992).
263
out that this procedure is inconsistent for noisy high frequency data (see Section 2.1). The subsampling method of Politis and Romano (1994) has been shown to be useful in many situations as a way of conducting inference under weak assumptions and without utilizing knowledge of limiting distributions. The basic intuition for constructing an estimator of the asymptotic variance is as follows. Imagine the standard setting of discrete time with long-span (also called increasing domain) asymptotics. Take some general estimator θn (think of i.i.d. Yi′ s, a parameter of interest θ = E (Y ), ∑ 1 Yi ). Suppose we know its asymptotic distribution and θn = n
τn ( θn − θ ) H⇒ N (0, V ) as n → ∞, where H⇒ denotes convergence in distribution, and τn is the rate of convergence when n observations are used. Suppose we would like to estimate V , in order to be able to construct confidence intervals for θn . This can be done with the help of many subsamples, for which the estimator θn has the same asymptotic distribution. In particular, suppose we construct K different subsamples of m = m(n) consecutive observations, starting at different values (whether they are overlapping or not is irrelevant here), where m = m(n) → ∞ as n → ∞ but m/n → 0. Denote by θn,m,l the estimator θn calculated using the lth block of m observations, with n being the total number of observations. Then, the asymptotic distribution of τm ( θn,m,l − θ ) is the same, i.e.,
τm θn,m,l − θ H⇒ N (0, V )
(2)
for each subsample l, l = 1, . . . , K . Hence, V can be estimated by the sample variance of τm θn,m,l (with centering around θn , a proxy for the true value θ ). This yields the following estimator of V K 2 1 − θn,m,l − θn , V = τm2 ×
K l =1
(3)
and we have p V −→ V , p
where −→ denotes convergence in probability.Notice that the estimator in (3) is like an average of squared τm θn,m,l − θ over all subsamples, except that θn plays the role of θ . The difference between θn and θ is negligible because θn converges faster to θ than θn,m,l does. It is shown that a direct application of the above method to the high frequency framework fails. This fact is illustrated for the RV example in model (1). That is, θn is taken to be realized variance and θ its probability limit, integrated variance. The intuition behind the failure is straightforward. The problem is that θn,m,l and θn do not converge to the same quantity and so (2) cannot be satisfied. The underlying reason is that the spot (or infinitesimal) volatility σt is changing over time. The estimator calculated on a small block cannot estimate the integrated variance θ , because θ contains information about spot volatility on the whole interval. Politis et al. (1997) show, in the long span asymptotic framework, that the traditional subsampling scheme is valid under weaker assumptions than stationarity. Instead of stationarity, they assume that the normalized θn,m,l is on average close to the limiting distribution of θn . This allows for, e.g., considerable local heteroscedasticity. However, in an infill asymptotic framework, changes of volatility and its moments over time are not local in nature. Lahiri (1996) illustrates the problems infill asymptotics creates by proving inconsistency of some commonly-used estimators under this asymptotic scheme. A novel subsampling scheme is proposed that can estimate the asymptotic variance of RV. Importantly, it can also be applied to the Two Time Scales estimator of Aït-Sahalia et al. (2011), in
264
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
the presence of autocorrelated measurement error with diurnal heteroscedasticity. There are no alternative inferential methods available in the literature for this case. Moreover, this subsampling scheme can, under some conditions, estimate the asymptotic variance of a general class of estimators, which includes many estimators of the integrated variance. The remainder of this paper is organized as follows. Section 2 describes the usual subsampling method of Politis and Romano (1994) and proposes a new subsampling method. It also introduces an alternative scheme that is robust to jumps in volatility. Section 3 shows how inference can be conducted for the Two Time Scales estimator in the presence of autocorrelated and heteroscedastic microstructure noise. Section 4 applies the subsampling method to a general class of estimators. Section 5 investigates the numerical properties of the proposed method in a set of simulation experiments. Section 6 applies the method to high frequency stock returns. Section 7 concludes. 2. Description of resampling schemes The aim of this section is to motivate and introduce a new subsampling scheme in a relatively simple framework. Since the proposed method does not change across models or estimators, the motivation and intuition is given for the example of realized volatility in the absence of any market microstructure noise. We first describe the setting for the realized volatility example. Suppose that log-price Xt is the following Brownian semimartingale process dXt = µt dt + σt dWt ,
(4)
where Wt is standard Brownian motion, the stochastic process µt is locally bounded, and σt is a càdlàg spot volatility process.4 Suppose that we have observations on X on the interval [0, T ], where T is fixed. Without loss of generality set T = 1. Assume observation times are equidistant, so that the distance between observations is 1/n. The asymptotic scheme is infill as n → ∞. Suppose the quantity of interest is integrated variance (also called integrated volatility), 1
∫
σs2 ds.
IV =
(5)
0
IV is a random variable depending on the realization of the volatility path {σt , t ∈ [0, 1]}. The usual estimator of IV is the realized variance (often called realized volatility) RVn =
n −
Xi/n − X(i−1)/n
2
.
(6)
i=1
This satisfies
√
n (RVn − IV ) H⇒ MN (0, V )
∫ V = 2IQ = 2
(7)
1
σs4 ds
0
where MN (0, V ) denotes a mixed normal distribution with random conditional variance V independent of the underlying normal distribution.5 The convergence (7) follows from BarndorffNielsen and Shephard (2002) and Jacod (2008), and is stable in law, see Aldous and Eagleson (1978). Stable convergence is slightly stronger than the usual convergence in distribution. Stable asymptotics are particularly convenient because it permits division 4 In other words, the sample paths of the volatility process are left continuous with right limits. 5 In other words, the limiting p.d.f. is of the form f (x) = φ (x)f (v)dv , where 0,v V
√
fV denotes the p.d.f. of V and φ0,v (x) = exp(−x2 /2v 2 )/ 2πv .
of both sides of (7) by the square root of any consistent estimator of V to obtain a standardized asymptotic distribution for conducting inference on RVn . In fact, for the realized variance example, inference can be conducted relatively easily. Barndorff-Nielsen and Shephard (2002) propose to estimate V as twice the realized quarticity, V = 2IQn , where realized quarticity is the sum of fourth powers of returns, properly scaled, IQn =
n 4 n − Xi/n − X(i−1)/n . 3 i=1
(8) p
The estimator V is consistent for V in the sense that V /V −→ 1. This result allows the construction of consistent confidence intervals for IV . For example, a two-sided level 1 − α interval is √ α = RVn ± zα/2 V 1/2 / n, where zα is the α quantile given by C from a standard normal distribution, and this has the property that α ] → 1−α . Mykland and Zhang (2009) have proposed an Pr[IV ∈ C alternative estimator of V that is more efficient than V under the sampling scheme (4) and can also be used to construct intervals based on the studentized limit theory. 2.1. Failure of the traditional resampling schemes Recently, Gonçalves and Meddahi (2009) have proposed a bootstrap algorithm for RV, in the setting of no noise. They use the i.i.d. and wild bootstrap applied to studentized RV. They show that resampling the studentized RV gives confidence intervals for RV with better properties than the 2IQn estimator of asymptotic variance. Their procedure relies on an estimator of the asymptotic variance, which is not always available. A more widely used bootstrap procedure would be to estimate asymptotic variance as the sample variance of the bootstrap statistic. This procedure is simple, but only consistent for the wild bootstrap with certain external random variables. Podolskij and Ziggel (2007) show that, to first order, all methods proposed by Gonçalves and Meddahi (2009) apply in exactly the same way to the Bipower Variation estimator. All the above bootstrap methods become inconsistent in the presence of any market microstructure noise. While the current section also keeps this simplifying assumption for expositional purpose, Section 3 shows robustness of the proposed subsampling estimator to the market microstructure noise, which enables its application to data at the highest frequencies. We now consider the popular method of Politis and Romano (1994). This subsampling scheme fails in our setting due to variability of the volatility over time. It is however instructive to consider, as subsequently proposed methods use a similar underlying idea. Let θn be the RV calculated on the full sample, and let θn,m,l be the RV calculated on the lth block of m observations,6
θn,m,l =
ml −
Xi/n − X(i−1)/n
2
,
i=m(l−1)
see Fig. 1. In the above, 0 < l ≤ K , where K is the number of subsamples, K = ⌊n/m⌋. Assumption 5.3.1 of Politis et al. (1999) is satisfied, i.e., the sampling distribution of τn ( θn − θ ) converges weakly. Therefore, 6 For simplicity, all subsampling schemes in this paper are presented with nonoverlapping subsamples. However, it is inconvenient to display non-overlapping subsamples in figures, so Figs. 1–3 show maximum overlap versions of the subsampling schemes.
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
265
Fig. 1. The subsampling scheme of Politis and Romano (1994).
in the setting of stationary and mixing processes, V should be approximated well by K 2 1 − VPR = m × θn,m,l − θn .
K l =1
However, in our setting, it is easy to see that VPR does not converge to V . Proposition 1. Let X satisfy (4) and θn be the realized variance defined in (6). Let m → ∞ and m/n → 0 as n → ∞. Then,
VPR − mθ 2 = op (m). The estimator on the full sample converges to the true value,
θn →p θ . On the other hand, the estimator on a subsample converges to zero. This is because each high frequency return is of order n−1/2 , so a sum of m squared returns is of order m/n → 0. Therefore, Proposition 1 obtains that VPR is asymptotically equal to mθ 2 . Notice that the value θ 2 is not related to V , which is the parameter of interest. Different orders of magnitude of θn,m,l and θn could be m accounted for by using n θn,m,l instead of θn,m,l as in K 2 1 − n ′ VPR =m× θn,m,l − θn .
K l =1
θ
p
− θn 9 0 so long as the spot θ estimates volatility changes over time. This is because m n n,m,l the spot variance σ 2 (·) at some point, instead of the integrated variance θ .7 Assuming no drift, no leverage, and sufficiently smooth n m n,m,l
volatility sample paths, one can show that the resulting estimator has a diverging bias, conditional on the volatility sample path, ′ E VPR − V = m IQ − IV 2 + o(1).
2.2. The new subsampling scheme We now introduce and explain the new subsampling scheme. The current subsection describes this scheme for the RV example, and Section 3 applies it to the Two Time Scales estimator. Section 4 applies this subsampling scheme to a more general class of estimators. In the subsampling scheme of Politis and Romano (1994), the problem was that the estimator on a subsample θn,m,l was centered at ‘‘the wrong quantity’’. In the formula K 2 1 − θn,m,l − θn , VPR = m ×
K l=1
the quantity θn plays the role of θ , but the problem is that the leading term in θn,m,l is integrated variance over a shrinking interval,
θl =
m
However, it still holds that
p
It then holds that θn,m,l − θn → 0, ∀l. This, however, is not sufficient for consistency of V . The problem is that for every n, any two subsample estimators would be highly correlated. The resulting V would be asymptotically unbiased, but inconsistent.
(9)
Therefore, the underlying reason for the failure of the subsampling method of Politis and Romano is the fact that the spot volatility changes over time. The latter effect is captured by the term IQ − IV 2 in Eq. (9) above, which is zero if and only if volatility is constant over the whole interval [0, 1]. An intuitive alternative would be to sample prices at some lower frequency instead of taking a sub-block of consecutive high frequency observations. In a way, sub-blocks are mimicking the long span asymptotic scheme, and the infill asymptotic scheme equivalent would be subsamples formed by lower frequency prices. Thus, for example, θn,m,1 would be RV calculated with 5 min returns starting with the first second, θn,m,2 would be RV calculated with 5 min returns starting with the second, and so on. 7 For estimation of the spot variance using realized variance on a shrinking interval, see Foster and Nelson (1996), Andreou and Ghysels (2002), Mikosch and Starica (2003) and Kristensen (2010).
lm/n
∫
(l−1)m/n
σu2 du.
(10)
Thus, θn,m,l either converges to zero or the spot volatility depending on whether it is scaled by n/m, but in any case it cannot estimate θ , the integrated volatility over the whole interval [0, 1]. Therefore, θn,m,l − θn does not converge to zero, causing VPR to explode. Consider an alternative approach. We aim to center estimators at θl (as defined by Eq. (10)), in order to extract the information about the variance of θn,m,l . The leading term of the variance of θn,m,l is lm/n
∫ Vl = 2
(l−1)m/n
σu4 du.
It is of course not equal to V , which we want to estimate, but we can use the fact that these add up to V over subsamples, 1
∫
σu4 du =
V =2 0
K −
Vl .
l =1
Given the additive structure of V , this approach can still give a consistent estimator of V , despite volatility changing over time. The only question left is, how to obtain an estimator of the centering factor θl . So consider using two subsamples, one with length J and one with length m, such that J is of smaller order than n m. Then, both m θn,m,l and nJ θn,J ,l estimate the spot variance, but they have different convergence rates. This in turn means one can be used to center the other. To simplify the presentation, we use the notation θllong and θlshort instead of θn,m,l and θn,J ,l .
266
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
Fig. 2. The new subsampling scheme.
Since the rate of convergence of nJ θ short is V becomes
√
J, the estimator of
2 K 1 − n short n long θl θl Vsub = J × − K l =1
J
(11)
m
long
θl are realized variances calcuwhere K = ⌊n/m⌋. θlshort and lated on the short subsample with J observations, and the long subsample with m observations. Fig. 2 provides a graphical illustra tion. The corresponding time intervals used are and
(l−1)m n
(l−1)m n
, (l−1n)m+J
, lm , so the expressions for estimators on subsamples n
become J
θlshort =
−
X (l−1)m+i − X (l−1)m+i−1 n
i =1
n
m
θllong =
−
X (l−1)m+i − X (l−1)m+i−1
i=1
n
2
n
2
. long
For an arbitrary volatility process, nJ −1 θlshort and nm−1 θl cannot be guaranteed to be close. For example, if the volatility process has a large jump on the interval covered by θllong , but not covered by
θlshort , then nJ −1 θlshort and nm−1 θllong can differ substantially. Therefore, some kind of smoothness condition on the volatility paths is needed. Importantly, we do not require differentiable sample paths. It can be shown that a sufficient condition is to assume that volatility itself evolves like a Brownian semimartingale. This is a common way of modeling volatility in practice. Assumption A1. The volatility process {σt , t ∈ [0, 1]} is a Brownian semimartingale of the form
t dσt = µ ˜ t dt + σ˜ t dW t is standard Brownian motion, the stochastic process where W µt is locally bounded and the stochastic process σt is càdlàg. Proposition 2. Suppose (A1) holds and X satisfies (4). Let θn be the realized variance defined in (6), m → ∞, J → ∞, m/n → 0, J /m → 0, and mJ 2 /n → 0 as n → ∞. Then, p Vsub −→ V .
The cost of not relying on the exact expression of V is that the proposed method is data intensive. J should be large enough for θJ to have reasonable finite sample properties. As can be seen from the conditions above, m should be even larger, and n, the total number of observations, should be much larger than J. Sections 3 and 4 show that Proposition 2 can be extended to more general settings than RV in a Brownian semimartingale model. This is because the subsampling method does not rely on the exact form of V , which it estimates. 2.3. An alternative subsampling scheme The new estimator introduced in the previous section, Vsub , has the disadvantage that it does not allow for jumps in the volatility. The current section presents an alternative subsampling scheme that allows for such jumps.8 This subsampling scheme is illustrated in Fig. 3. On every block of m observations, calculate the estimator θ n twice as follows. First, calculate it using all m observations, and denote it as θlfast . Then, n calculate the estimator θ using every Q th price observation in the block of m observations, and denote it as θlslow .9 Now, θlfast can be used to center the θlslow , because they both converge to (10), and because θlfast converges to (10) faster than θlslow does. The new estimator of V becomes K m 1 − n slow n fast 2 ′ Vsub = × θl − θl
Q
K l=1
m
n/m
=
n − Q l =1
θlslow − θlfast
m
2
8 We conjecture that this alternative subsampling scheme is also robust to jumps in the price process in those special cases when these jumps do not appear in the expression of V . One example of such a case is the multipower variation when the sum of all powers is smaller than one, see Barndorff-Nielsen et al. (2005). 9 The subsampling scheme is similar in structure to the one in Lahiri et al. (1999). They similarly use two grids for subsampling to predict stochastic cumulative distribution functions in a spatial framework. However, they assume that the underlying process is stationary and their asymptotic framework is mixed infill and increasing domain.
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
267
Fig. 3. An alternative subsampling scheme.
Cov 1Yi/n , 1Y(i−1)/n
where
θkfast =
m −
X i+m(k−1) − X i−1+m(k−1)
i=1
θkslow =
⌊− m/Q ⌋ i=1
n
= Cov 1Xi/n + ϵi/n − ϵ(i−1)/n , 1X(i−1)/n + ϵ(i−1)/n − ϵ(i−2)/n = −Var ϵ(i−1)/n . (14)
2
n
X iQ +m(k−1) − X (i−1)Q +m(k−1) n
n
2
.
The consistency result does not need Assumption A1 anymore. Proposition 3. Suppose X satisfies (4). Let m → ∞, Q → ∞, m/n → 0, and Q /m → 0 as n → ∞. Then, p ′ Vsub −→ V .
(12)
′ The new estimator Vsub uses sparse data and hence cannot capture autocorrelated market microstructure noise. Since the Two Scales estimator with autocorrelated noise is the focus of this ′ paper, Vsub is not used beyond the current section.
3. Inference for the two scales realized volatility estimator This section shows how the new subsampling scheme can be applied to the Two Time Scales estimator of integrated variance proposed by Aït-Sahalia et al. (2011). Although only this example is discussed in detail, this subsampling scheme could also be applied to other integrated variance estimators in the presence of market microstructure noise, such as Multiscale estimator of Zhang (2006), Realized Kernels of Barndorff-Nielsen et al. (2008), and the preaveraging estimator of Jacod et al. (2009). Stock price data at highest frequencies is well known to be affected by market microstructure noise. For example, trades are not executed in practice at the efficient price. Typically, they are executed either at the prevailing bid or ask price. Therefore, observed transaction prices alternate between bid and ask prices (the so-called bid–ask bounce), creating negative autocorrelation in observed returns, which is a stylized fact in high frequency data. This was the motivation for Zhou (1996) to introduce an additive market microstructure noise model where the observed log-price Y is a sum of a Brownian semimartingale component X and an i.i.d. noise ϵ , Yt = Xt + ϵt .
(13)
In this model, observed log-returns display negative first order autocovariance,
Another stylized fact is that realized variances calculated at the highest frequencies become very large. This is in contradiction to the Brownian semimartingale model, where RV has roughly the same expectation irrespective of the frequency at which it is calculated. Also, RV should converge to IV when higher and higher frequencies are used. This difficulty lies behind the underlying reason for the common practice not to calculate realized variance at higher frequencies than 5 or 15 min. The problem with this approach is that it implies discarding most of the available data. There are only 72 five minute returns in a day, and only 24 fifteen minute returns in a day, while the available high frequency data is usually measured in thousands. In order to be able to use all the available data, one has to work with a model that can accommodate the above stylized facts. Zhang et al. (2005) were the first to introduce a consistent estimator of integrated variance of the efficient price IV within the additive measurement error model of Zhou (1996), see Eq. (13) above. The noise ϵt is i.i.d., zero mean with variance Var (ϵ) = ω2 and Eϵ 4 < ∞, and independent from the latent log-price Xt . In this model, Zhang et al. (2005) propose the following consistent estimator for the integrated variance of Xt , nG θn = [Y , Y ](G1 ) − 1 [Y , Y ](1) , n
(15)
where, for any parameter b, [Y , Y ](b) = nb =
n −b 2 1 − Y(i+b)/n − Yi/n b i=1
n−b+1 b
.
Notice that [Y , Y ](1) coincides with the RV estimator, while [Y , Y ](G1 ) consists of lower frequency returns. In particular, [Y , Y ](G1 ) consists of returns calculated from prices that are G1 high frequency observations apart. Thus, time distance is n−1 between high frequency observations and G1 n−1 between lower frequency observations. In empirical applications, a common choice for G1 is such that the lower frequency returns are sampled at 5 min. Zhang et al. (2005) call the above estimator the Two Scales Realized
268
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
a
b
5.2
5
0.6
0.4
4.8 0.2 4.6 0
4.4
4.2
-0.2
4 -0.4 3.8 -0.6 3.6 -0.8
3.4 0
0.2
0.4
0.6
0.8
1
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Fig. 4. Properties of returns of Microsoft (MSFT) stock. Returns are constructed from transaction prices over the whole year 2006. See Section 6 for data cleaning procedures. Panel (a) shows the estimated heteroscedasticity function ω (·), averaged over all days in 2006. Panel (b) shows the autocorrelogram of returns calculated in tick time.
Volatility (TSRV) estimator. They derive the following asymptotic distribution of the estimator, n1/6 θn − θ ⇒
√ VZ
V =c
∫
3
1
−2 4 σu4 du + 8c ω, 0 noise
∫
1
1
∫
signal
(16)
0
ω4 (u)du .
noise
In this model, the previous estimator of the noise part of V ceases to be consistent as
signal
i.e., it consists of a signal part, which is due to the efficient price, and a noise part. In the above, Z is a standard normal random vari able, independent from V , and c is the constant in G1 = cn2/3 . With i.i.d. noise, V can be estimated component by component. Var (ϵ) = ω2 can be estimated using the following estimator proposed by Bandi and Russell (2008), p 2 = RV → ω ω2 .
p 2 = RV → ω
2n
We saw in Section 2 that in a model without noise, integrated quarticity σu4 du can be estimated by realized quarticity defined in (8). This becomes more difficult in the presence of noise. However, Barndorff-Nielsen et al. (2008) have proposed an estimator for σu4 du, which is consistent in the presence of i.i.d. noise, see Section 5. This model is for i.i.d. noise, so the noise is assumed to be homoscedastic. A well known stylized fact in the empirical market microstructure literature is that intradaily spreads (difference between bid and ask price) and intradaily stock price volatility are described typically by a U-shape (see footnote 3 for some references). In other words, prices are more volatile in mornings and afternoons than at noon; spreads are also larger in mornings and afternoons. Fig. 4(a) presents an estimate of heteroscedasticity function ω2 (·) for transaction prices of Microsoft stock, averaged over all days in the year 2006. The diurnal variation is evident. Kalnina and Linton (2008) introduce diurnal heteroscedasticity in the microstructure noise in model (13). Suppose the efficient log-price X is the same as above in (13), but the noise displays unconditional heteroscedasticity. In particular, suppose the noise ϵt satisfies (17)
1
∫
ω2 (u)du, 0
so, by Jensen’s inequality, its square would be always strictly smaller than the target ω4 (u)du as long as there is any diurnal variation at all. Kalnina and Linton (2008) show that ω (·) can be estimated at any fixed point τ using kernel smoothing,
ω2 (τ ) =
2n
ϵt = ω(t )ut
4
σu4 du + 8c −2 3 0
V =c
where the asymptotic (conditional) variance takes the form 4
where ω(t ) is a nonstochastic differentiable function of time t, and ut is i.i.d. with E (ut ) = 0, and Var (ut ) = 1. As a result of this generalization, the asymptotic variance of θn changes to
n 1−
2 i =1
Kh (ti−1 − τ ) 1Yti−1
2
.
In the above, h is a bandwidth that tends to zero asymptotically and Kh (.) = K (./h)/h, where K (.) is a kernel function satisfying some regularity conditions. This suggests estimating the noise part of V by 8c −2
∫ 0
1
ω4 (u)du.
As we saw earlier in (14), the i.i.d. measurement error model is consistent with negative first order autocorrelations in the observed returns. However, returns can sometimes exhibit autocorrelation beyond the first lag in practice. For example, Fig. 4(b) graphs the autocorrelogram of the returns of Microsoft stock for the whole year 2006. We see that Microsoft stock returns display strong negative autocorrelation well beyond the first lag. While the model (13) does generate a negative first autocorrelation, it implies that any further autocorrelations have to be zero. Since increments of a Brownian semimartingale are uncorrelated in time, any such autocorrelation has to be due to noise ϵt .10 10 In a Brownian semimartingale model, the only source of autocorrelations of increments is drift, which is negligible for high frequencies.
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
Aït-Sahalia et al. (2011) generalize the i.i.d. measurement error model (13) in a different direction. They allow for autocorrelated stationary microstructure noise. In particular, they make the following assumption about the noise. Assumption A2. The noise ϵti is independent from the efficient log-price process Xt , and it is (when viewed as a process in index i) stationary and strong mixing with the mixing coefficients decaying exponentially. Also, for some κ > 0, Eϵ 4+κ < ∞. In model (13) with ϵt satisfying Assumption A2, Aït-Sahalia et al. (2011) propose the following consistent estimator for the integrated variance of Xt , nG θn = [Y , Y ](G1 ) − 1 [Y , Y ](G2 )
(18)
nG 2
where G1 and G2 satisfy the following assumption, Assumption A3. The G1 parameter of the Two Time Scales esti mator θn defined by (18) satisfies G1 = cn2/3 for some constant c. The G2 parameter is such that Cov(ϵ0 , ϵG2 /n ) = o(n−1/2 ), G2 → ∞, G2 /G1 → 0.11 The Two Time Scales estimator defined by (18) is more general than the one in (15), which is a special case when G2 = 1 and G1 → ∞ as n → ∞. Aït-Sahalia et al. (2011) show that the new Two Time Scales estimator θn has the same asymptotic properties except it has a more complicated asymptotic variance, V = c
∫
4 3
1
σu4 du
0
where (Gi )
[Y , Y ]l
J −G 1 −i
=
Gi i=1 i = 1, 2
J − Gi + 1
J Gi =
Gi
,
Y(l−1)m/n+(i+Gi )/n − Y(l−1)m/n+i/n
2
,
i = 1, 2.
One obtains θlshort by substituting J for m above. In Fig. 2, the version with maximum overlap is presented. In practice, it is much quicker to compute the no overlap version, for which Theorem 4 is formulated. While this does not alter the conclusion of Theorem 4, the maximum overlap version is slightly more efficient. In this case, Vsub is defined by (20) with K = n − m + 1. To the author’s knowledge, this is the only available method in the literature to construct confidence intervals for the Two Time Scales estimator when the noise is autocorrelated. Similarly, one can apply this method to the Multiscale estimator of AïtSahalia et al. (2011) when microstructure noise is autocorrelated. The advantage of using the Multiscale estimator is that it has the optimal rate of convergence n1/4 . However, the above model of Aït-Sahalia et al. (2011) rules out any diurnal heteroscedasticity of the noise. When both autocorrelation and heteroscedasticity are taken into account, we have Lemma 5. Suppose the observed price satisfies Yi/n = Xi/n + ϵi/n where the efficient log-price Xt follows a Brownian semimartingale process (4) and microstructure noise ϵi/n satisfies
ϵt = ω(t )ut
signal
+ 8c −2 Var (ϵ)2 + 16c −2 lim
n −
n→∞
Cov ϵ0 , ϵi/n
,
(19)
2/3
where c is the constant in G1 = cn . The literature does not provide any estimator of V or an alternative method for constructing confidence intervals for θn . Here we can estimate the asymptotic variance of the Two Time Scales estimator θn using the subsampling scheme. Theorem 4. Suppose model (13) holds, and ϵti satisfy Assumption A2. Let θn be the Two Time Scales estimator defined by (18), with parameters G1 and G2 that satisfy Assumption A3. Let V be defined by (19). Let J → ∞, m → ∞, J /m → 0, m/n → 0, G1 /J → 0, and Jmn−5/3 → 0. Then, p Vsub → V
where
K l =1
J
m
(20)
with K = ⌊n/m⌋. In the above, θlshort is simply θ n calculated on a smaller block of J observations inside the lth larger block of m observations, with exactly the same parameters G1 and G2 as θ n uses. See Fig. 2 for an illustration. In particular, J G1 J G2
2 K 1 − n short n long Vsub = Jn−2/3 × θl − θl
where ω (·) is a differentiable, nonstochastic function of time, ut satisfies Assumption A2 and Var (ut ) = 1. Then, θn defined in (18) is such that
θn − θ ⇒ n1/6
noise
2
i=1
(G ) θlshort = [Y , Y ]l 1 −
269
(G2 )
[Y , Y ]l
√
VZ
where V = c
4 3
1
∫
σu4 du + 8c −2 0
+ 16c −2
∫
1
ω4 (u)du
0 1
∫
ω4 (u)du lim 0
n→∞
n −
Cov ϵ0 , ϵi/n
2
.
i =1
In this case of autocorrelated and heteroscedastic noise, Theorem 4 easily generalizes and subsampling again delivers a consistent estimate of V . This is because both are special cases of the consistency result of the subsampling estimator in the general case, which is described in the next section. To estimate this more complicated V , exactly the same formula Vsub should be used as for the homoscedastic case. In this model, this is the only available method in the literature to construct confidence intervals for the Two Time Scales estimator. Importantly, this section illustrates the robustness of the subsampling estimator of V across different sets of assumptions. Moreover, it is also easy to implement. All that is necessary is to compute θn on several sub-blocks of observations. We conjecture that the subsampling estimator Vsub would be consistent for V under even more general assumptions than considered above, for example, in the case when autocorrelations of the noise are changing through time, or when the efficient returns have fat tails as in Meddahi and Mykland (2010). 4. Inference for a general estimator
11 The restriction on Cov(ϵ , ϵ 0 G2 /n ) should be considered in the light of the fact that Assumption A2 implies that there exists a constant φ such that, for all i,
Cov ϵi/n , ϵ(i+l)/n ≤ φ l Var (ϵ) .
This section shows how to use the new subsampling scheme (as described in Sections 2.2 and 3) to conduct inference for a general class of estimators of volatility measures. A set of assumptions
270
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
is introduced and explained, under which subsampling delivers a consistent estimate of the asymptotic variance of an estimator θn . As we shall see, there are two essential ingredients for the subsampling method to work. One is additivity over subsamples of the asymptotic variance of θn . The second is that the asymptotic distribution of θn calculated on a block of observations is similar, in a sense explained below, to the asymptotic distribution of θn calculated using all available data. We do not assume a specific process for X . It could be a pure diffusion or a diffusion contaminated with noise, as long as the regularity assumptions below are satisfied. All arguments in this section are made conditional on the volatility path {σu , u ∈ [0, 1]}. Suppose there is an estimator θn , for which the asymptotic distribution is known to be as follows
√ τn θn − θ ⇒ V Z .
(21)
In the above, τn is a known rate of convergence of θn . For example, τn = n1/2 for realized variance, τn = n1/6 for the Two Time Scales estimator. Z is a random variable that is known to satisfy E(Z ) = 0 and Var(Z ) = 1. A consistent estimator of V thus enables a researcher to construct consistent confidence intervals for θn . We recall the subsampling scheme introduced in Section 2.2. Divide the total number of returns into blocks of m consecutive returns. Thus, we obtain ⌊n/m⌋ subsamples. Denote by θllong the estimator θn calculated using all m returns of the lth block, l = 1, . . . , ⌊n/m⌋. Denote by θlshort the estimator θn calculated using only J returns of the lth block, where J < m. See Fig. 2 in Section 2.2 for a graphical illustration. θlshort and mn θllong converge to the In order to guarantee that nJ same quantity, despite being defined on different time intervals, we need to impose some smoothness on the volatility paths. In particular, we use the following assumption. Assumption A4. (21) holds, where θ and V are the following functions of the volatility path {σu , u ∈ [0, 1]},
θ=
1
∫
g1 (σ (u)) du
(l−1)m/n ∫ [(l−1)m+J]/n
Vlshort =
θ
∫
long l
=
long
=
Vl
[(l−1)m+J]/n
∫
θlshort =
(l−1)m/n lm/n
(l−1)m/n ∫ lm/n (l−1)m/n
g2 (σ (u)) du
ζl(n) =
where g1 , g2 ∈ C [0, 1] and σ is a Brownian semimartingale as in (4). For example, we obtain integrated variance IV with g1 (u) = σ 2 (u) and the asymptotic variance of realized variance with g2 (σ (u)) = 2σ 2 (u). The type of estimators that are likely to satisfy the assumptions of this section are those that are approximately additive over subsamples, i.e.,
l =1
J
θlshort + op (1)
(22)
or
θn =
⌊− n/m⌋
θllong + op (1).
(23)
l =1
All currently available estimators of integrated variance and related quantities satisfy this additivity property. We also impose the following assumption, which ensures that estimators on subsamples are mixing.
(n)
Assumption A5. For any fixed n, the returns process Ri/n i=1,...,n (n) with Ri/n = Xi/n − X(i−1)/n is strong mixing. Also, θn = φ R(1n/)n , R(2n/)n ,
. . . , R(1n) where φ : Rn −→ R.
g2 (σ (u)) du.
n J
2 short τn2 θl − θlshort − Vlshort . (n)
satisfies the following conditions
(n)
sup E ζl
as n → ∞,
→ 0.
l
(ii)
ζj(n) is Lp bounded for some p > 1.
We now discuss Assumption A6. Assumption A6(i) can be written equivalently as follows, as n → ∞,
sup E l
θn =
g1 (σ (u)) du,
Assumption A6. For every n, define θlshort and Vlshort by (24), and define a triangular array
1
m
(24) g2 (σ (u)) du
(i)
0
⌊− n/m⌋
g1 (σ (u)) du,
Finally, we make the following assumption,
The array ζj
1
V =
long
the respective quantities they estimate, and by Vlshort and Vl what can be thought of as their asymptotic variances. They can be defined as follows,
0
∫
This is a rather strong assumption. For example, when X follows a semimartingale (4), this assumption rules out leverage effects. Why could leverage effects be allowed for in Proposition 2? Proposition 2 assumed that X follows a semimartingale. Therefore, after a discretization approximation, a proof could be based on the powerful martingale methods as in, e.g., Jacod and Shiryaev (2003). Here, however, market microstructure noise is allowed for, which is not a semimartingale. Therefore, without imposing more structure on the estimator (such as approximate additivity in X and the noise as in the Two Scales estimator example), this technique cannot be used. θlshort do not As discussed in previous sections, θllong and estimate θ , since they use only information about the volatility path on a small time interval, whereas the volatility is changing long and θlshort throughout the interval [0, 1]. Let us denote by θl
Vlshort
−1
short 2 τn2 θl − θlshort → 1,
as long as is of order J /n. In other words, Assumption A6(i) requires that the square of the standardized statistic θlshort has asymptotic expectation one. On the full sample, we know from (21) n that 2the standardized θ is asymptotically a random variable Z with E Z = 1. Therefore, a sufficient condition for Assumption A6(i) Vlshort
to hold is that the asymptotic distribution of θlshort satisfies the same condition on a subsample. Roughly speaking, we need the estimator on a subsample, θlshort , to behave similarly to the estimator on a full sample, θn . Assumption A6(ii) is a stronger assumption, and it illustrates the main idea of the subsampling method. Recall the basic idea of subsampling as described in the introduction of the paper. Roughly speaking, in a stationary world, the way subsampling estimates V is by constructing many random variables with V as their asymptotic variance. In our nonstationary case, continuity
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
271
in time plays the role of stationarity as it ensures that the same feature in V is estimated by many subsamples. Assumption A6(ii) effectively imposes Vjshort to be of order J /n, i.e., that there is enough continuity in V with respect to time. Apart from this consideration, Assumption A6(ii) requires existence of moments. This is not an issue for a Brownian semimartingale model due to the local boundedness assumption on the drift and volatility functions, but becomes a constraint if X also contains other components. For example, consider a model where observations are sampled from a Brownian semimartingale with an additive noise ϵ . In this model, corresponding moments have to be assumed on ϵ for Assumption A6(ii) to hold. In the case of the Two Time Scales estimator discussed below, L4+ε boundedness of ϵ is needed, which is exactly what has been assumed by the authors of Two Time Scales estimator to derive its asymptotic distribution. We have the following result. Theorem 6. Assume (A4), (A5), and (A6). Let J → ∞, m → ∞, J /m → 0, m/n → 0, and Jmτn2 n−2 → 0. Then, p Vsub −→ V
where
2 ⌊n/m⌋ Jm − 2 n short n long Vsub = 2 τn θl θl . − n
l =1
J
m
The assumption Jmτn2 n−2 → 0 is determined by the smoothness of the volatility paths. Special cases show √ up in Proposition 2 and Theorem 4 with τn2 being replaced by n and n1/6 , respectively. All these results assume that volatility follows a Brownian semimartingale (Assumption A1). If one assumed more smoothness, this assumption could be weakened. Importantly, exactly the same formula is applied to all models and estimators, which satisfy the above assumptions. All that is necessary to calculate the estimator for V is to calculate the estimator θ n on several subsamples, as well as to know the convergence rate τn . In particular, Vsub simplifies to the formula for √ the realized variance in (11) with τn = n, and to the formula for the Two Time Scales estimator in (20) with τn = n1/6 . 5. Simulation study In this section numerical properties of the proposed estimator are studied for the example of the Two Time Scales estimator of Aït-Sahalia et al. (2011) in the case of i.i.d., autocorrelated, or heteroscedastic market microstructure noise. The observed log-price Yt is a sum of the efficient log-price Xt and noise ut . The paths of the efficient log-price are simulated from the Heston (1993) model: dXt = (α1 − vt /2) dt + σt dWt 1/2
dvt = α2 (α3 − vt ) dt + α4 vt
Simulation of the noise ut is described in Sections 5.1 and 5.2. The parameters of the Two Time Scales estimator and of the subsampling procedure are chosen as follows. We set G1 = 100, which in our data corresponds to 5 min lower frequency. This is a very popular choice in practice. We set G2 = 10 in all autocorrelated noise simulations, and G2 = 1 for heteroscedastic noise simulations. Two values of J are considered. The first is J = 2G1 = 200, and the second is J = 5G1 = 500. For m, three different values are considered, m = 4J , 10J, and 15J. The literature does not propose ways of estimating the asymptotic variance of the Two Time Scales when noise is autocorrelated or diurnal. However, in the case of i.i.d. noise, there is an alternative, and this can serve as a benchmark for the simulation results. In the case of i.i.d. noise, the expression for asymptotic variance V of the Two Time Scales estimator is 4 V =c t 3
t
∫
σu4 du + 8c −2 [Var(u)]2
0
and the alternative is to estimate each component of V separately. The easiest component to estimate is [Var(u)]2 . A popular estimator of Var(u) = ω2 is
2 = RV . ω
2n This has been proposed by, for example, Bandi andRussell (2006, t 2008). To estimate integrated quarticity IQ = t 0 σu4 du in the presence of noise is more difficult. A consistent estimator in the BNHLS , has been proposed by Barndorffpresence of i.i.d. noise, IQ Nielsen et al. (2008).12 Therefore, we can define the benchmark 12
dBt
where vt = σ , Wt and Bt are independent Brownian motions. The parameters of the efficient log-price process X are chosen to be the same as in Zhang et al. (2005). They are α1 = 0.05, α2 = 5, α3 = 0.04, and α4 = 0.5, and they correspond to one year being a unit of time. For the sake of consistency, we keep these time units for the rest of the section. We simulate n = 35, 000 observations over one week, i.e., five business days of 6.5 h each. This is motivated by the fact that GE stock has on average 35,000 observations per week in year 2006, see Section t 6. We aim to estimate weekly integrated variance or IV = 0 σs2 ds where t is one week or 1/50. The volatility path is fixed over simulations to facilitate comparisons. The volatility path used is plotted in Fig. 5. Varying the volatility path across simulations does not affect the theory nor the simulation results. 2 t
Fig. 5. Simulated volatility sample path.
BNHLS IQ δ, S = max
n ∗ 2 1 − 2 y2 − 2ω 2 , θn , δ −2 y2j,· − 2ω j−2,· n j=1
where y2j,· =
S −1 1 −
S s=0
Yδ (j+ s ) − Yδ (j−1+ s ) S S
2
,
j = 1, . . . , n
2 = exp log ω 2 − ω θn∗ /RV n = 1/ δ and where θn∗ is a consistent estimator of integrated variance IV . We take θn∗ to be the TSRV estimator θn . This estimator requires us to choose δ and S. We use the same choice as Barndorff-Nielsen et al. (2008) do, for real and simulated data. This choice 2 corrects the small sample bias in ω 2 . With is S = n1/2 and δ = n−1/2 . Estimator ω large number of observations, there is no difference between the two estimators in practice, but we keep the version of Barndorff-Nielsen et al. (2008) anyway.
272
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
estimator 2 4 2 , BNHLS + 8c −2 ω Vb = c IQ 3 which is consistent for V when noise is i.i.d. To simulate the market microstructure noise, we consider two cases, autocorrelated and heteroscedastic noise. Combination of the two is straightforward both in theory and practice and is therefore omitted. 5.1. Autocorrelated noise The market microstructure noise is simulated as an MA(1) process ui∆ = ϵi∆ + ρϵ(i−1)∆ ,
ϵ ∼ N 0,
ω 1 − ρ2 2
,
ω2 λ = 1/250 . σu2 du 0 Results are simulated for three different noise-to-signal ratios, λ = 0.0001, 0.001, and 0.01, motivated by Hansen and Lunde (2006). The range of estimated λ for the data of our empirical study is between 0.001 and 0.0019, see Section 6. Results are represented in terms of coverage probabilities of 95% two-sided, left-sided, and right-sided confidence intervals for IV . Results for noise-to-signal ratios λ = 0.0001, 0.001, and 0.01 are collected in Tables 2–4, respectively. We see that the subsampling estimator performs well in all scenarios. Vb performs well in the scenario it is designed for, which is the uncorrelated noise case. As the correlation increases, estimated values of Vb decrease, resulting in undercoverage. This effect is less pronounced for smaller noise cases. This is to be expected given that Vb is consistent for V when noise is zero. This simulation study effectively documents the well known fact that one should not calculate estimators that are not robust to autocorrelation with autocorrelated data. In practice, it can be partly remedied by using sparse data. However, this strategy does not help in general when noise is time-varying. Time-varying noise is an empirical fact that has not received much attention in the nonparametric volatility literature. 5.2. Heteroscedastic noise We now adopt the noise model of Kalnina and Linton (2008) as in Eq. (17) where noise displays time-varying heteroscedasticity. How to find the closest equivalent of the noise-to-signal ratio for this case? We know that 2n
p
→
t
∫
1 t
ω2 (u)du.
λ=
ω (u) = a
t 0 t 0
ω2 (u)du σu2 du
,
where t = 1/250 keeps the horizon of integration to be one day.
u t
−
1 2
2
,
u ∈ [0, t ]
(25)
where t = 1/50 and a is a constant chosen to deliver values of λ = 0.0001, 0.001, t or 0.01. Simple calculation shows it implies setting a = 12λ 0 σ 2 (u)du. Results are collected in Table 5. Perhaps surprisingly, both methods seem to work well. Is this result to be expected? In this case, the asymptotic variance of the Two Scales estimator is 4 V =c t 3
∫ 1 t 4 σu4 du + 8c −2 ω (u)du . t 0 0
∫
t
Vsignal
(26)
Vnoise
As discussed in Section 3, by Jensen’s inequality, Vnoise would be underestimated by Va . In particular, Vnoise is underestimated by a factor of 1 t
t 0
t 1 t
0
ω4 (u)du 2 , ω2 (u)du
which equals 1.8 when Eq. (25) is true. Estimation of the BNHLS behaves in the presence of first part depends on how IQ heteroscedastic noise, and it turns out it has a positive bias in all our simulations. Although Vb consists of two components that are both strongly biased, they tend to cancel out, and Vb is at most 20% away from the true value. We conclude this section with two remarks. First, a class of volatility estimators not used in this paper are the pre-averaging estimators recently proposed by Jacod et al. (2009). This method can estimate IV , IQ , and other volatility functionals in the presence of noise. Although not robust to autocorrelation in the noise, it is robust with respect to heteroscedasticity considered here. Second, the Two Time Scales estimator is in fact inconsistent in the presence of heteroscedasticity of the noise of this form. This has been shown by Kalnina and Linton (2008) who propose a modification, jittered TSRV,14 which restores consistency. Inconsistency arises due to a bias from the end effects. In our simulations, jittered TSRV reduces the bias of the TSRV estimator on average 4 times in the large noise case. However, the magnitude of this bias is too small to show very different results in terms of coverage probabilities. Therefore, we do not report the results for the jittered TSRV. 13 We consider the interval of a week. Thus, a very stylized model of diurnal heteroscedasticity would be 5 parabolas instead,
0
Hence, the most natural definition of noise-to-signal ratio is the ratio of the integrated variance of the noise to the integrated variance of the latent price, 1 t
2
so that Var(u) = ω2 . In the above, ∆ = t /n, t = 1/50 is one week, and n = 35, 000. Four different values of ρ are considered, ρ = 0, −0.3, −0.5, and ρ = −0.7. The size of the noise, ω2 , is an important parameter. Here we build on the careful empirical study of Hansen and Lunde (2006), who investigate 30 stocks of Dow Jones Industrial Average. As in Hansen and Lunde (2006), we define noise-to-signal ratio as the ratio of the variance of the noise to the daily integrated volatility. Thus, we introduce
RV
This is a lucky situation where conventional estimates of λ motivated by the misspecified homoscedastic framework estimate consistently the λ in the true more general heteroscedastic framework. Thus, exactly the same values of λ are appropriate for the simulation setup, λ = 0.0001, 0.001, and 0.01. This equivalence does not hold for higher moments of noise, as discussed in Section 3, having implications on the conventional estimates of the asymptotic variance of the Two Scales estimator. For the shape of heteroscedasticity, we take the simplest possible design motivated by ‘‘U-shape’’, a parabola. It is simple and easy to replicate, but is not meant to be realistic and can be improved in many directions.13 We set
ω2 (u) = a
5 −
1 u∈
i=1
[
(i − 1)t it , 5
5
]
5u t
−i+
1 2
2
,
u ∈ [0, t] .
Moreover, a reverse J-shape is typically a better approximation. Also, a data driven method would be more realistic, but that would decrease the transparency and replicability of the simulation setup. 14 This is not in any way related to jittering of Barndorff-Nielsen et al. (2008). They introduce a modification for Realized Kernels needed to enable estimation of confidence intervals for Realized Kernels in the presence of i.i.d. noise.
I. Kalnina / Journal of Econometrics 161 (2011) 262–283 Table 1 Summary statistics.
MMM MSFT IBM AIG GE INTC
Num. of obs.
ω
daily IV
λ
810,835 2,368,013 1,226,468 1,054,541 1,835,057 2,651,006
0.0004 0.0003 0.0003 0.0003 0.0002 0.0004
0.00096 0.00089 0.00076 0.00070 0.00057 0.00155
0.0019 0.0011 0.0013 0.0017 0.0010 0.0010
However, it is important to use the jittered version if noise appears to be heteroscedastic and if avoiding bias is important. Moreover, this correction is strictly positive and in practice almost completely solves the problem that TSRV can be negative.
6. Empirical analysis This section applies the proposed subsampling method to high frequency data from the NYSE TAQ database, and compares it to the benchmark estimator Vb , which is introduced in the previous section. The data consists of full record transaction price data of 6 stocks for year 2006. The six stocks are American International Group (AIG), General Electric (GE), International Business Machines (IBM), Intel (INTC), 3M (MMM), and Microsoft (MSFT). We first describe the data pre-processing steps. First we obtain raw data of these six stocks for the whole year 2006, time stamped between 9:30 a.m. till 4 p.m. The first column of Table 5 in the Appendix lists the number of observations in this raw data set for each stock. Following Aït-Sahalia et al. (2011), data from all exchanges is retained and zero returns are removed. This means deleting a large part of data (see the second column of Table 5), since these flat trading periods can be quite long. Griffin and Oomen (2008) show that, in the Realized Volatility case, this adjustment of data improves precision of estimation. Jumps are also removed,15 since the additive market microstructure noise model (13) does not allow for jumps (see the third column of Table 5). There is also an additional issue to consider, which Barndorff-Nielsen et al. (2009) denote as local trends or ‘‘gradual’’ jumps. These authors notice that the realized kernel, which is the estimator of integrated variance they propose, does not behave well in the presence of these ‘‘gradual’’ jumps. Barndorff-Nielsen et al. (2009) notice that these local trends are associated with high volumes traded, and conjecture that they are due to nontrivial liquidity effects. The authors replace them with one genuine jump, but conclude that they do not have an automatic way of detecting episodes of local trends. The subsampling method proposed in the current paper is also vulnerable to such price behavior. Our strategy to identify these gradual jumps is based on the fact that they should look like genuine jumps on a lower frequency. Therefore, we construct a time series of lower (five minute) frequency data, and set to zero those lower frequency returns that are larger than seven weekly standard deviations. Table 1 contains some summary statistics of the resulting data set. The first column contains the number of observations used for estimation for each stock. The second column reports a measure of 15 Jumps are identified as deviations of the log-returns that are larger than five standard deviations on a moving window of 500 observations. This is motivated by the thresholding technique of filtering out jumps, first proposed by Cecilia Mancini in a series of papers (e.g., Mancini, 2004), see also Aït-Sahalia and Jacod (2009), Eq. (21). Returns containing an identified jump are deleted.
273
the noise (a square root of RV /2n, calculated on skip-10-ticks data for the whole year 2006). For IV estimation, we calculate the Two Time Scales estimator for each day in 2006, then average across daily (G1 is the average number of transactions in days to obtain IV 5 min; G2 = 10). The fourth column reports RV /2n . λ= daily IV The returns of all these stocks display large negative autocorrelation similar to GE in Fig. 4(b). The asymptotic variance of the Two Time Scales estimator is estimated for each of the 52 weeks in year 2006. We conjecture that as long as the distance between observations is of order 1/n, the underlying theory can be extended to the non-equidistant observations case, at least when the observation times are nonstochastic. Therefore, the estimation is done in tick time, as suggested in Barndorff-Nielsen et al. (2008) and other authors. This also applies to summary statistics. The results are displayed in Fig. 6 in the Appendix, in terms of 95% confidence intervals for weekly integrated variance. The Two Time Scales estimate θn is in the center of both confidence intervals by construction. The subsampling confidence intervals for Two Time Scales are usually wider than confidence intervals of the benchmark method Vb . From our simulations, we conclude this might be due to negative bias of the Vb estimator in the presence of negatively autocorrelated returns. This is because all six stocks have strongly negatively correlated returns, and we know from Section 5 that Vb is downward biased in this case. On the other hand, the subsampling estimator is immune to autocorrelation. The figures also show a lot of variability in the estimates of V . This is mainly due to variability of the Two Time Scales estimates, with large estimates of V corresponding to large θn and vice versa. Thus, episodes of high volatility generally correspond to episodes of high volatility of volatility. Though not reported here, these also correspond to weeks with very large numbers of transactions and large volumes traded. 7. Conclusion This paper develops an automated method for estimating the asymptotic variance of an estimator in noisy high frequency data. The method applies to an important general class of estimators, which includes many estimators of integrated variance. The new method can substantially simplify the inference question for an estimator, which has an asymptotic variance that is hard to derive or takes a complicated form. An example of such a case is the integrated variance estimator of Aït-Sahalia et al. (2011), in the presence of autocorrelated heteroscedastic market microstructure noise. There is no alternative inferential method available in the literature in this case. A question that is yet to be addressed rigorously is a datadriven bandwidth choice. Several choices for the Two Time Scales estimator are suggested in the Monte Carlo section. A very promising extension that will be considered in a future paper is inference for a multivariate parameter. Subsampling naturally produces positive semi-definite estimated variance–covariance matrices, which can be very important for applications. For estimators like Realized Volatility, all the results extend readily to the multivariate case. The real challenge, however, arises due to the additional complications, which are not present in the univariate case. These concern the fact that different stocks do not trade at the same time or so-called asynchronous trading. Also, uncertainty about the observation times becomes much more important in the multivariate context.
274
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
Appendix A. Proofs
This holds because
Since {σt } , { σt } , {µt } and { µt } are locally bounded, it can be assumed, without loss of generality, that they are uniformly bounded by Cσ (see Barndorff-Nielsen et al. (2006), Section 3). We use C to denote a generic constant that is different from line to line.
E |σt +s − σt |q |Ft
A.1. Proof of Proposition 1 By Cauchy–Schwarz and Burkholder–Davis–Gundy inequality (Revuz and Yor, 2005, p. 160),
t
ml −
E θn,m,l =
E Xi/n − X(i−1)/n
2
≤ Cs
i=m(l−1) i/n
∫ ml −
≤ C
(i−1)/n
i=m(l−1)
σu4 du ≤ CCσ
m n
=
ml −
[ Cov
ml −
≤
ml −
[ E
E
2 ] 2 ′ Xi/n − X(i−1)/n , Xi /n − X(i′ −1)/n
i′ =m(l−1) i=m(l−1)
Xi/n − X(i−1)/n
2
i′ =m(l−1) i=m(l−1) ml −
≤
ml −
Xi/n − X(i−1)/n
Xi′ /n − X(i′ −1)/n
E |Xk.i |q F (k−1)m+i−1
×E ≤C
Xi′ /n − X(i′ −1)/n
ml −
E (i−1)/n
i′ =m(l−1) i=m(l−1)
∫ ×E
i′ /n
(i′ −1)/n
i/n
4 1/2
Introduce the following notation,
σ
4 u du
K 1 − DISCR Vsub 2σ k4−1 =
2 1/2
K k=1
DISCR
E V
αklong =
θn,m,l + m ×
γkDISCR
K 1 −
K l =1
θn,m,l
2
J i =1
2 σ m2 (k−1) W (k−1)m+i − W (k−1)m+i−1 n
n
n
m i=1
2 σ m2 (k−1) W (k−1)m+i − W (k−1)m+i−1 n
n
n
2 = αkshort − αklong .
2 K 1 − n short n long p Vsub = J × θl − θl →V K l =1
K l =1
J
m
1
∫
= m θn2 + op (m). The result now follows by consistency of θn for θ .
m n −
We want to show
K 2 m 2 m − = m θn2 − 2 θn + θn,m,l
σu4 du.
=2 0
First, by Riemann integrability of σ ,
A.2. Proof of Proposition 2 V Before proceeding to the main proof, we state two useful inequalities that hold when X and its volatility are Brownian semimartingales. First, for any q > 0
J − DISCR E γk |F k−1 K K k=1
2 θkshort − γk = θklong
K l =1
=
J n−
αkshort =
K 2 1 − VPR = m × θn,m,l − θn
E |σt +s − σt |q |Ft ≤ Csq/2 .
K
K
K J − E γk |F k−1 K K k=1
E V =
and
K
n
[(k−1)m+i−1]/n
for some constant C . Hence, m θn,m,l = Op n
K l =1
(28)
n
√ n σ m(k−1) 1W (k−1)m+i − 1X (k−1)m+i n n n ∫ [(k−1)m+i]/n √ = n µu du + σu − σ m(k−1) dWu .
≤ CCσ m2 n−2
= m θn2 − 2 θn m ×
1
Xk.i =
2 1/2 σu4 du
K 1 −
1∧q/2 ≤C
where
2 ]
4 ]1/2
∫
ml −
n
i′ =m(l−1) i=m(l−1)
[
q/2
where the Davis–Burkholder–Gundy inequality (Revuz and Yor, 2005, p. 160) is used to obtain the second transition. The second inequality is as follows, see Jacod (2007). For for all q > 1,
,
Var θn,m,l ml −
q ∫ s+t ∫ s+t u |Ft σ u dW = E µu du + t t q ∫ s+t ≤ E µu du |Ft t q ∫ s+t u |Ft + E σu dW t ∫ s+t 2 q/2 q ≤ Cs + C E σu du |Ft
(27)
DISCR p
1
∫
σu4 du.
→V = 2 0
To prove Proposition 2, proceed in three steps. Prove Vsub − p DISCR p DISCR E V → 0, then E V −E V → 0, and finally E V − p
V DISCR → 0.
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
The first step is to show K p J − Vsub − E V = γk − E γk |F k−1 → 0.
K k=1
k=1
K2
=
n
J 2 m2
q
K
n
q/2 ≤ Cq
1 n
for all q > 0, i = 1, . . . , m, and Cq some constant depending on q only. Hence,
J 2 mJ 2 J2 =C . E γk F k−1 ≤ C K K K n
The first step is thus proved, provided mJ 2 n−1 → 0. The second step is to show
DISCR
K k=1
k
k
m [ n − 2 2 ] 2 × σ m(k−1) 1W (k−1)m+i − 1X (k−1)m+i n n m i=1 n J [ 2 2 ] n− 2 − σ m(k−1) 1W (k−1)m+i − 1X (k−1)m+i n n J i =1 n ≤ Ek A2 Ek B2 .
−
+ Cn−3 n2
= Cn−1/2 J −1 + Cn−1
m i=1
n
n
σ m(k−1) 2
n
2 2 ]2 1W (k−1)m+i − 1X (k−1)m+i n
n
n
2 = Ek σ m(k−1) 1W (k−1)m+i − 1X (k−1)m+i n n n 2 × σ m(k−1) 1W (k−1)m+i + 1X (k−1)m+i n n n 4 ≤ Ek σ m(k−1) 1W (k−1)m+i − 1X (k−1)m+i n n n 4 × Ek σ m(k−1) 1W (k−1)m+i + 1X (k−1)m+i n n n 1
1
n3
n2
= Cn−5/2
n
n
≤ Cn . The first part is the square root of
long
Ek A2 = Ek αk
− αkshort + θklong − θkshort
2
[ m 2 2 ] 2 − 2 = Ek ci σ m + 1X (k−1)m+i (k−1) 1W (k−1)m+i n
i =1
n
≤ C.
n
2 2 1W (k−1)m+i − 1X (k−1)m+i n
[
2 Ek σ m (k−1)
n
m 2 2 n − σ m2 (k−1) 1W (k−1)m+i − 1X (k−1)m+i
J i=1
J
−3
Let ci = n/m − n/J for i = 1, . . . , J. ci = n/m for i = J + 1, . . . , m. The second part is the square root of
n2
|ci | |ci′ |
i=1 i′ =1
[ 2 2 ] Ek σ m2 (k−1) 1W (k−1)m+i − 1X (k−1)m+i n n n [ 2 2 ] × σ m2 (k−1) 1W (k−1)m+i′ − 1X (k−1)m+i′ n n n 2 2 ≤ Ek σ m2 (k−1) 1W (k−1)m+i − 1X (k−1)m+i n n n [ ] 2 2 2 × E σ m(k−1) 1W (k−1)m+i′ − 1X (k−1)m+i′ F (k−1)m+i n n n n ] [ 2 2 ≤ Cn−3/2 Ek σ m2 (k−1) 1W (k−1)m+i − 1X (k−1)m+i
We have
J n−
≤ Cn−5/2
n
m − m −
and, for i < i′ ,
K
E γkDISCR − γk F k−1 K long short αk + θklong − θkshort αk − = E long long αkshort − θkshort F k−1 × αk − θk − K long = Ek α − αkshort + θ long − θkshort
Ek B2 = Ek
ci2 + Cn−3
i=1
≤C
K p J − DISCR −E V = E γk − γk F k−1 → 0.
n
m −
because
n2
E V
≤ Cn−5/2
K
n
E X (k−1)m+i − X (k−1)m+i−1 F k−1
k=1
n
[ 2 2 ] × σ m2 (k−1) 1W (k−1)m+i′ − 1X (k−1)m+i′
for some constant C not depending on k, by repeated use of the Cauchy–Schwarz inequality and
K −
n
n
i=1 i′ =1
n
n
n
[ m − m 2 2 ] − 2 ′ = ci ci Ek σ m(k−1) 1W (k−1)m+i − 1X (k−1)m+i
We have
J2
n
n
i=1
J 2 J 2 n − J2 X (k−1)m+i − X (k−1)m+i−1 E γk F k−1 = 2 E K n n K K J i=1 m 2 4 n − − X (k−1)m+i − X (k−1)m+i−1 F k−1 m i =1
n
[ m 2 2 ]2 − 2 2 = ci Ek σ m(k−1) 1W (k−1)m+i − 1X (k−1)m+i
J 2 p E γk F k−1 → 0. K K
≤C
n
n
i =1
K
By Lenglart’s inequality (see e.g. Podolskij, 2006), it is sufficient to show that K −
275
[ m 2 2 ] 2 − 2 = Ek ci σ m − 1X (k−1)m+i (k−1) 1W (k−1)m+i
n
2
Combining both A and B terms, we obtain
E γkDISCR − γk F k−1 ≤ Cn−1/4 J −1/2 + Cn−1/2 , K
n
276
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
from which the second step
First, by Riemann integrability,
K J − DISCR DISCR −E V ≤ E γk − γk F k−1 E V
K k=1
p
V DISCR → V = 2
K
p DISCR p ′ Vsub − E V → 0, then E V − E V → 0, and finally DISCR DISCR p E V −V → 0.
follows, provided J 2 /n → 0, which is implied by mJ 2 n−1 → 0. Now we prove the third step.
K
J n −
W (k−1)m+i − W (k−1)m+i−1
J i=1
n
n
−
m i =1 2 4
= σ m(k−1)
J
n
The first step is to show
2
n
m
n −
2
W (k−1)m+i − W (k−1)m+i−1 n
2
n
2
− σ m4 (k−1)
m
n
σ m4 (k−1)
K k =1
n
DISCR = Vsub − Op
2 J
−
J
m
K J −
K k=1
By Lenglart’s inequality (see e.g. Podolskij, 2006), it is sufficient to show that
K
Notice that, by the Burkholder–Davis–Gundy inequality, Cauchy–Schwarz inequality, and uniform boundedness of σ , m − m − m − m − 4 Ek−1 θkfast ≤
K
K J −
K
k=1
k=1
K DISCR J − DISCR E V = E γk F k−1
=
K p − ′ Vsub −E V =K γk − E γk F k−1 → 0.
K p − E |K γk |2 F k−1 → 0.
.
Thus,
K k =1
(29)
To prove Proposition 3, use the following three steps. Prove
p
4 E γkDISCR F k−1 = σ m (k−1) E
σu4 du. 0
≤ CJn−1/4 J −1/2 + CJn−1/2 → 0
1
∫
σ m4 (k−1) n
2
m
×
.
4
Proposition 3 is proved for the special case Q = m. The general Q case follows by the same steps, but the notation is more involved. Denote K = ⌊n/m⌋ and ∆δ Xt = Xt − Xt −δ . Introduce the same notation as in Proposition 2. 2σ k4−1
n
n
4
K
K
K m− E γk F k−1 K n k=1
2 αkslow = σ m2 (k−1) ∆ mn W mk n
n
m
slow 2 γk = θk − θkfast
α
fast k
= σ m(k−1) 2
n
n
n
=C
n
1 K4
Ek−1
3 C θkfast ≤ 4,
Ek−1
K
3 C θkslow ≤ 3, K
4 1 θkfast − θkslow ≤C 4 K
and K − 1 E |K γk |2 F k−1 ≤ C = o(1).
K
K
k=1
σu4 du
The second step is to show
where
DISCR
V E
K
2 n − slow ′ Vsub = θk − θkfast
K p − V =K E γkDISCR − γk F k−1 → 0. −E k=1
m k=1
m k=1
n4
n
K
Ek−1 γk2 = Ek−1
n
0
=
m
4 C θkfast ≤ 4, K C fast 2 Ek−1 θk ≤ 2, K C slow 4 Ek−1 θk ≤ 4, K C slow 2 Ek−1 θk ≤ 2.
i=1
1
K n −
n
for some constant C , which does not depend on any of the above parameters. Hence, and by similarity,
From here,
K
∫
≤C
2 − ∆ 1 W i+m(k−1)
slow 2 γkDISCR = αk − αkfast . Also, denote E γk F k−1 by Enk−1 [γk ]. We want to show p ′ →V = 2 Vsub
n
[ 8 ] ∆ 1 X i′′′ +m(k−1)
Ek−1
Ek−1
K DISCR m − DISCR E V = Eγ k F k−1
E V =
n
[ 8 ] 4 × Ek−1 ∆ 1 X i′′ +m(k−1) ×
n k=1
[ 8 ] ∆ 1 X i+m(k−1)
[ 8 ] ∆ 1 X i′ +m(k−1)
Ek−1
4
n k=1
Ek−1
A.3. Proof of Proposition 3
K m−
4
i′′′ =1 i′′ =1 i′ =1 i=1
This proves consistency of the subsampling method for RV, provided mJ 2 n−1 → 0 and σ satisfies A1.
V DISCR =
It is sufficient to show
∆ mn X mk n
2
−
m − i =1
∆ 1 X i+m(k−1) n
n
2
2 .
K − K E γ DISCR − γk → 0. k
k=1
K
(30)
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
Write
277
inf
i−1+m(k−1) ≤u≤ i+m(nk−1) n
K
K
− − αkfast − αkslow + θkfast − θkslow E K E γkDISCR − γk = K
≤ ci,k ≤
k=1
k=1
fast slow × αk − θkfast − αk − θkslow ≡ A + B.
sup i−1+m(k−1) ≤u≤ i+m(nk−1) n
i+m(k−1) n
σ ⌊Ku⌋ − σu
2
K
2 1 σ ⌊Ku⌋ − σu du = ci,k .
i−1+m(k−1) n
K − fast fast A = K E αk − αkslow + θkfast − θkslow αk − θkfast
K
and
∫
As to the first term, we have
σ ⌊Ku⌋ − σu
2
n
K
Notice that
k=1
sup ci,k → 0
K − fast 2 1/2 E αk − αkslow + θkfast − θkslow
≤K
i ,k
by right-continuity and boundedness of σ . Then,
k=1
2 1/2 × E αkfast − θkfast ≤C
∫
A ≤ √ E n k=1 i=1
K − fast 2 1/2 E αk − θkfast k =1
m − −
K m C −−
K
=C
E
k =1
σ m(k−1) ∆ 1 W i+m(k−1) − ∆ 1 X i+m(k−1) n
n
i =1
n
n
× σ m(k−1) ∆ 1 W i+m(k−1) + ∆ 1 X i+m(k−1) n
n
n
n
[ K m − m − − 4
n k=1 i=1
n
E
DISCR
E V
[ 4
×
E
[ E
n
n
n
n
E
n
n
4 ]
n
4
×
n
n
n
=
2 K2
1 K2
n
DISCR
E V
=K
[
4 ] σ m(k−1) ∆ 1 W i+m(k−1) + ∆ 1 X i+m(k−1) n n n n n 2 ∫ i+m(k−1) 2 n C ≤ CE σ m(k−1) + σu du ≤ 2 , n
which follows by Burkholder–Davis–Gundy inequality. To proceed with term A, we use the arguments along the lines of the proof of Lemma 1 of Barndorff-Nielsen and Shephard (2002). For every i and k, there exists a constant ci,k s.t.
K − k=1
n
In the above, to obtain the second inequality, we used (30). To obtain the fourth inequality, we used
n
4
k=1 i=1
Eci2,k
1 n
→0
p
σ m4 (k−1) + op
=K
m
i−1+m(k−1) n
n
K − m −
n
i=1
n
.
Therefore,
n
=
E γkDISCR F k−1
K
K − 2
K
k=1
2 1/4 ∫ i+m(k−1) K m 2 n C −− = √ E σ ⌊Ku⌋ − σu du . i−1+m(k−1) K n k=1 i=1 n
E
=C
n
2 1/4 ∫ i+m(nk−1) 2 C −− ≤ √ E σ m(k−1) − σu du i−1+m(k−1) n n k=1 i=1 n K
]2 1/4
We have
]1/2
n
K
− V DISCR → 0.
n
[ K m 4 ] C −− 4 ≤ √ E σ m(k−1) ∆ 1 W i+m(k−1) − ∆ 1 X i+m(k−1) n k=1 i=1
E ci,k
du
2 1/4
E γkDISCR F k−1 K m 2 − 2 2 4 = σ m(k−1) E ∆ mn W mk − ∆ 1 W i+m(k−1)
n
n
n
n
4 ]
σ m(k−1) ∆ 1 W i′ +m(k−1) + ∆ 1 X i′ +m(k−1) n
4 ]
n
σ m(k−1) ∆ 1 W i′ +m(k−1) − ∆ 1 X i′ +m(k−1)
[ 4
n
n
σ m(k−1) ∆ 1 W i+m(k−1) + ∆ 1 X i+m(k−1)
n
×
n
n
n
4
σ m(k−1) ∆ 1 W i+m(k−1) − ∆ 1 X i+m(k−1)
i′ =1 i=1
k =1
1
σ ⌊Ku⌋ − σu
2
by Monotone Convergence Theorem. B → 0 is proved using exactly the same steps. This proves the second step. The final step is to show
1/2 2
n
i−1+m(k−1) n
[
= √
≤C
i+m(k−1) n
K m C −−
K − 2
K1 k=1
σ4 + op 2 m(k−1)
1 K2
n
σ m4 (k−1) + op (1) = V DISCR + op (1). n
The result follows immediately.
A.4. Proof of Theorem 4 It is convenient to decompose Vlshort into the signal and noise signal
parts, Vlshort = Vl signal
Vl
=
4 3
∫
+ Vlnoise where
[(l−1)m+J]/n
c (l−1)m/n
σu4 du n
− 2 J J Vlnoise = 8c −2 Var (ϵ)2 + 16 c −2 lim Cov ϵ0 , ϵi/n . n→∞ n n i =1 We first state the following lemma (see Appendix A.4.1 for proof).
278
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
In the first step, we show negligibility of the signal part, i.e.,
Lemma 7. Suppose the assumptions of Theorem 4 hold. Then,
K m−
K p 2 m − 1/3 short n θl − θlshort − Vlshort → 0. J l=1
J
K − m
J
l=1
K − m
J
l =1
Vlshort
=
=
J
2 J long short − Vlshort θl − θl
1/3
n
K m−
n1/3 θlshort − θlshort
2
− Vlshort
+ n1/3 θlshort − +n
θ
short l
−
J m J m
θllong
i=1
θ
long l
Sl2
2
n1/3
m J
−
J m
l =1
J θllong − θllong
2 =
m
=
J m
n1/3
+ R,
m
2 − long θl − θllong
→ 0 and
A.4.1. Proof Lemma 7 We now prove Lemma 7 stated in Appendix A.4. We have the following decomposition
J
=
J
n
1/3
K m−
J
−
J G1
+
n
1/3
(G1 )
[X , X ]l
(G2 )
J G2
2
(G2 )
[Y , Y ]l
−θ
short l
=
[ n
1/3
−
(G1 )
[X , X ]l
−θ
−
signal Vl
K m−
J
n1/3 [ϵ, ϵ ](l G1 ) −
l =1
+ R1 + R2 .
J G1 J G2
(G2 )
[ϵ, ϵ ]l
=
] =
l =1
n4/3
J n4/3
G1 ∧i
σu2 du
−
j
1−
2
G1
j =1
J −1 −
G1 ∧i 1 −
σ[4(l−1)m+i−1]/n
J −1 4 G1 −
3 n2 i=1
n2 j=1
σ[4(l−1)m+i−1]/n + op
2
j
1−
G1 J
+ op
J
n4/3
n4/3
J −1 ∫ −
[(l−1)m+i]/n [(l−1)m+i−1]/n
σu2 du
i−1 −
1X[(l−1)m+k]/n
k>r ≥0
The last equality follows from Zhang et al. (2005), p.1410 and the fact that conditions G1 = cn2/3 and J > G1 imply G1 /n < J /n4/3 . Therefore, Sl2
short l
J
n
+ R1 2
G1
+ + i−k i−r × 1X[(l−1)m+r]/n 1 − 1− G1 G1 J = op . 4/3
−
j
1−
and
Vlshort
(G1 )
Vlshort
+ op
2 1X[(l−1)m+i−j]/n
[(l−1)m+i−1]/n
i=1
− θlshort + [ϵ, ϵ ]l
2
[ϵ, ϵ ]l
K m−
J
−
J G1
G1
j =1
2
[(l−1)m+i]/n
J −1 ∫ −
(II ) = 8
l =1
J G2
=
[Y , Y ]l
l =1
=
(G1 )
σu2 du
i=1
l=1 K m−
1−
G ∧ i 1 −
j
2 × 1X[(l−1)m+i−j]/n =4
short 2 n1/3 θl − θlshort − Vlshort
[(l−1)m+i−1]/n
i=1
n
m −
[(l−1)m+i]/n
J −1 ∫ −
(I ) = 4
which follows from Eq. (27). This concludes the proof of Theorem 4.
K
j =1
= (I ) + (II ) + op
2 short J 2m J long E θl − θl ≤ C 3 , m
1 ∧i G−
where
V + op (1) = op (1).
The third term is op (1) by assumption Jmn
1X[(l−1)m+i]/n
× 1X[(l−1)m+i−j]/n
K
−5/3
2
J −1 −
i =1
l =1
J
=
=4
where R contains cross terms that are op (1) if the three main three terms are op (1). The first of these three terms is negligible by Lemma 7. The second term is also negligible by Lemma 7 by taking m instead of J, K
G1
j =1
i =1
2
(32)
J −1 1 ∧i − G− j 1X(l−1)m/n+i/n 1− 1X(l−1)m/n+(i−j)/n
Sl = 2
(31)
where
l =1
1/3
] − Vlsignal = op (1).
where ∆Xi/n = Xi/n − X(i−1)/n . R3 arises due to the end effects, see Zhang et al. (2005), p.1410., and it satisfies R3 = Op G1 n−1 . The second term in (32) satisfies
m
l =1
J
2
= [X , X ](l 1) + Sl + R3
[X , X ]l
K m−
− θlshort
l =1
(G1 )
Vlshort = op (1).
Therefore, to prove Theorem 4 it is sufficient to prove the negligibility of
Vsub −
(G1 )
n1/3 [X , X ]l
For this, we adapt the arguments of Zhang et al. (2005) to the subsample. We have
We conclude from Eq. (27) that V−
[
2
4 G1
∫
3 n 1 n1/3
[(l−1)m+J]/n
(l−1)m/n signal
Vl
+ op
σ
4 u du
J n4/3
+ op
J
n4/3
.
− Vlnoise
The final piece in (32) to deal with is to show n1/3
K 2 m − (1) [X , X ]l − θlshort = op (1), J l=1
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
which follows by following (a simpler version of) the steps of the proof of Proposition 2. Eq. (31) follows. Next, we turn to the noise part and prove K m−
J
(G1 )
n1/3 [ϵ, ϵ ]l
J G1
−
(G2 )
[ϵ, ϵ ]l
J G2
l =1
−
Vlnoise
p → 0.
(33)
Given that noise is a discrete time process, Proposition 1 of AïtSahalia et al. (2011) can be applied directly, with J instead of n (this is the number of observations used above) to obtain, for each l,
G1
(G1 )
[ϵ, ϵ ]l
√
J
J G1
−
(G2 )
⇒ N 0, 8Var (ϵ)2 + 16 lim
n→∞
K 1 − G21
(G1 )
[ϵ, ϵ ]l
K l =1 J
J G1
−
J G2
p
→ 8Var (ϵ)2 + 16 lim
n −
2 Cov ϵ0 , ϵi/n .
n→∞
i=1
(G2 )
2
Cov ϵ0 , ϵi/n
2
=
n
i=1
J
+
J
n1/3
[
(G1 )
2 [X , ϵ ]l
2 ]
[X , ϵ ]l
J G2
(Gi )
[X , ϵ ]l
2
|X
≤C
1 G2i
(Gi )
[X , X ]l
n−G 1 − 1 −2 −2 √ ϵi/n ϵ(i+G1 )/n + 2 √ ϵi/n ϵ(i+G2 )/n ,
n i =0 n i=0 see page 26 of Aït-Sahalia et al. (2011). Given that G1 /G2 → 0 and n
−ω
n l =1 J n/m m−n
n l =1 J
i
n
n
n/m short 2 m − n short p τn2 Vl θl →0 − θlshort −
n l =1 J
short 2 p τn2 θl − θlshort → V
−
=
≤C
due to differentiability of ω, the desired result follows.
n/m Jmτn2 −
n2
n2
Jmτn2 n2
+2
τ
2 n
×
m
n2 n m
J
n m
θ
m
long l
n
− θ J
short l
2
n θllong − θllong m
n
θllong − θlshort
J
Jmτ − n2 n
2
m
l =1
2 n/m n
m
n θlshort − θlshort
2 n θllong − θllong
K − l =1
n J
K Jmτn2 − n
−2
l =1
l =1
m
K Jmτn2 − n
G1 n
J
l =1
×
n −G 1
ω
n/m m−n
+
Most of the proof of the asymptotic distribution of the TSRV estimator of Aït-Sahalia et al. (2011) remains valid under the assumptions of Lemma 5. The noise component of the asymptotic distribution arises from the asymptotic distribution of
i + G1
n/m p m − (n) ζl − E ζl(n) → 0. n l=1
2 K Jmτn2 − n short n long θ − θ Vsub − G(n) = l l 2
.
A.5. Proof of Lemma 5
is strong mixing because R(n) is.
p
The final terms R′1 and R2 contain cross terms that are negligible by Cauchy–Schwarz inequality.
(n)
In a second step, we prove that G(n) − Vsub → 0.
(G )
(34)
and so (34) follows.
n −G 1 −1 = X(i+G1 )/n − Xi/n ϵ(i+G1 )/n − ϵi/n . G1 i=1
The first term in R1 is op (1) because [X , X ]l 2 = Op Jn−1 by substituting G2 for G1 in (31). The second and third terms are of op (1) by proof of Lemma 1 of Aït-Sahalia et al. (2011), which implies, for i = 1, 2, E
p
→ V.
By A4, we have
=
where, for i = 1, 2, (Gi )
ψi(n) =
=
2 K J G1 m − 1/3 (G ) + R′1 + 2 [X , ϵ ]l 2 n l =1
J
(n)
l =1
J
2
n/m n/m p 2 m − (n) m − n 2 short ζl = τn θl − θlshort − Vlshort → 0 n l=1 n l =1 J
J G2
K m−
J
n θlshort − θlshort
it is also strong mixing. Therefore, under A6, ψi is a uniformly integrable L1 -mixingale as defined in Andrews (1998), to which we can apply Theorem 2 of Andrews (1998) to obtain
n l=1
2 K J G1 m − 1/3 (G ) [X , X ]l 2 R1 = n l =1
n
(n) For any two subsamples l and l′ s.t. l ̸= l′ , ζl has no common
n/m m−
c 2 Vlnoise ,
which is equivalent to Eq. (33) given that K = n/m and G1 = cn2/3 . The final step to prove Lemma 7 is to show R1 + R2 = op (1). We have
J
n2 l = 1
τn2
ψi(n) = ζl(n) − E ζl(n) ,
[ϵ, ϵ ]l
n −
n/m mJ −
returns with ζl′ . Therefore, ζl Moreover, if we define
Since noise is mixing over subsamples, we can apply the law of large numbers to obtain
G(n) =
(n)
[ϵ, ϵ ]l
J G2
A.6. Proof of Theorem 6
Assume n is divisible by m by simplicity. As a first step, we prove
2
279
τ
2 n
l =1
n J
n θlshort − θlshort
J
n θllong − θlshort . J
(35)
280
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
Fig. 6. 95% Confidence Intervals (CI’s) for weekly IV , for each of 52 weeks in 2006, calculated using the subsampling method (CI’s with bars) or Vb (CI’s with lines). TSRV is the middle of all CI’s by construction.
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
281
Table 2 Coverage probabilities of 95% confidence interval of IV , λ = 0.0001. J = 200
J = 500
Va
m
800
2000
3000
2000
5000
7500
ρ=0
Two-sided Left-sided Right-sided
0.97 0.95 0.98
0.98 0.96 0.98
0.98 0.96 0.99
0.93 0.92 0.96
0.95 0.94 0.97
0.95 0.94 0.97
0.92 0.91 0.95
ρ = −0.3
Two-sided Left-sided Right-sided
0.97 0.95 0.98
0.98 0.97 0.98
0.98 0.97 0.98
0.93 0.92 0.96
0.96 0.94 0.97
0.96 0.94 0.97
0.92 0.91 0.96
ρ = −0.5
Two-sided Left-sided Right-sided
0.97 0.95 0.97
0.98 0.97 0.98
0.98 0.97 0.98
0.93 0.92 0.95
0.95 0.94 0.96
0.95 0.94 0.96
0.92 0.92 0.94
ρ = −0.7
Two-sided Left-sided Right-sided
0.97 0.95 0.98
0.98 0.97 0.98
0.98 0.97 0.98
0.94 0.92 0.96
0.96 0.94 0.97
0.96 0.94 0.97
0.91 0.9 0.96
Table 3 Coverage probabilities of 95% confidence interval of IV , λ = 0.001. J = 200
J = 500
Va
m
800
2000
3000
2000
5000
7500
ρ=0
Two-sided Left-sided Right-sided
0.97 0.95 0.98
0.98 0.96 0.99
0.98 0.96 0.99
0.93 0.92 0.96
0.95 0.93 0.97
0.95 0.94 0.97
0.92 0.91 0.95
ρ = −0.3
Two-sided Left-sided Right-sided
0.97 0.95 0.98
0.97 0.96 0.98
0.98 0.97 0.98
0.93 0.91 0.96
0.96 0.94 0.97
0.96 0.94 0.98
0.91 0.9 0.96
ρ = −0.5
Two-sided Left-sided Right-sided
0.97 0.95 0.98
0.97 0.96 0.98
0.98 0.97 0.98
0.94 0.91 0.96
0.96 0.94 0.97
0.96 0.94 0.98
0.9 0.89 0.95
ρ = −0.7
Two-sided Left-sided Right-sided
0.98 0.96 0.97
0.98 0.97 0.98
0.98 0.97 0.98
0.94 0.93 0.95
0.96 0.94 0.96
0.96 0.95 0.96
0.88 0.9 0.91
We have the following decomposition,
≤C
n J
n
θlshort − n J
(l−1)m/n [(l−1)m+J]/n
∫ n
≤
θllong
J
(l−1)m/n
+
∫
n
lm/n
m (l−1)m/n
g (u)du −
n
∫
lm/n
m (l−1)m/n
g (u)du
(g (u) − g ((l − 1) m/n)) du
(g (u) − g ((l − 1) m/n)) du
2
2
2
∫ [(l−1)m+J]/n n +2 (g (u) − g ((l − 1) m/n)) du J (l−1)m/n ∫ lm/n n × (g (u) − g ((l − 1) m/n)) du . m (l−1)m/n
These terms are small enough due to A4 and (27) as follows,
2 ∫ lm/n K Jmτ 2 − n n E 2 (f (u) − f ((l − 1) m/n)) du n l=1 m (l−1)m/n ∫ lm/n 2 K Jmτ 2 − n ≤ 2n E (f (u) − f ((l − 1) m/n)) du n
m (l−1)m/n
l =1
Jmτn2 − K
=
≤C
n2
E (f (sl ) − f ((l − 1) m/n))2
l =1
K Jmτ − 2 n
n2
l =1
n2
2
[(l−1)m+J]/n
∫ =
m
K Jmτn2 − m l =1
n
=C
Jmτn2 n2
→0
by assumption. In the above, the first equality follows by the mean value theorem, which applies by differentiability of t
∫
(l−1)m/n
(f (u) − f ((l − 1) m/n)) du
(37)
in time. Next, we show K Jmτn2 − n
n2
l =1
m
2 p n θllong − θllong → 0. m
By substituting m for J in G(n) =
n/m mJ −
n2 l = 1
τn2
n J
n θlshort − θlshort
2
J
p
→ V,
we obtain n/m m2 −
n2
l =1
τn2
n m
2 p n θllong − θllong → V , m
and so by multiplying the left hand side by J /m, (37) follows since J /m → 0. The remaining cross-terms in (35) are negligible by the above results and Cauchy–Schwarz inequality. This concludes the proof of Theorem 6. Appendix B. Tables and figures
E (σ (sl ) − σ ((l − 1) m/n))
(36)
2
See Tables 2–6.
282
I. Kalnina / Journal of Econometrics 161 (2011) 262–283
Table 4 Coverage probabilities of 95% confidence interval of IV X , λ = 0.01. J = 200
J = 500
Va
m
800
2000
3000
2000
5000
7500
ρ=0
Two-sided Left-sided Right-sided
0.97 0.95 0.98
0.98 0.96 0.98
0.98 0.97 0.98
0.94 0.92 0.97
0.96 0.94 0.98
0.96 0.94 0.98
0.92 0.9 0.96
ρ = −0.3
Two-sided Left-sided Right-sided
0.97 0.96 0.97
0.98 0.97 0.98
0.98 0.97 0.98
0.93 0.94 0.94
0.96 0.95 0.96
0.96 0.95 0.96
0.82 0.85 0.88
ρ = −0.5
Two-sided Left-sided Right-sided
0.98 0.95 0.97
0.98 0.96 0.98
0.98 0.96 0.98
0.94 0.93 0.96
0.96 0.94 0.97
0.96 0.94 0.97
0.7 0.8 0.84
ρ = −0.7
Two-sided Left-sided Right-sided
0.96 0.94 0.97
0.97 0.95 0.98
0.98 0.95 0.98
0.94 0.92 0.96
0.96 0.94 0.97
0.95 0.94 0.97
0.77 0.83 0.84
Table 5 Heteroscedastic noise. Coverage probabilities of 95% confidence interval of IV . J = 200
J = 500
Va
m
800
2000
3000
2000
5000
7500
λ = 0.0001
Two-sided Left-sided Right-sided
0.96 0.95 0.98
0.98 0.96 0.99
0.98 0.96 0.99
0.93 0.91 0.97
0.95 0.93 0.97
0.96 0.93 0.98
0.94 0.92 0.97
λ = 0.001
Two-sided Left-sided Right-sided
0.97 0.95 0.97
0.98 0.96 0.99
0.98 0.96 0.99
0.93 0.92 0.95
0.95 0.93 0.96
0.96 0.94 0.97
0.94 0.93 0.96
λ = 0.01
Two-sided Left-sided Right-sided
0.96 0.93 0.99
0.97 0.94 0.99
0.98 0.95 0.99
0.93 0.90 0.98
0.94 0.91 0.98
0.95 0.91 0.98
0.97 0.94 0.98
Table 6 Summary of data manipulations.
MMM MSFT IBM AIG GE INTC
Raw data
Step 1: flat trading
Step 2: jumps
Step 3: gradual jumps
1,797,107 18,738,034 2,786,649 2,807,065 7,288,596 21,155,095
983,705 (54.74%) 16,364,458 (87.33%) 1,556,475 (55.85%) 1,749,345 (62.32%) 5,449,832 (74.77%) 18,498,295 (87.44%)
2567 (0.14%) 5563 (0.03%) 3706 (0.13%) 3179 (0.11%) 3707 (0.05%) 5794 (0.03%)
5,963 (0.33%) 18,795 (0.10%) 7,525 (0.27%) 10,433 (0.37%) 12,991 (0.18%) 21,119 (0.10%)
References Aït-Sahalia, Y., Jacod, J., 2009. Testing for jumps in a discretely observed process. Annals of Statistics 37, 184–222. Aït-Sahalia, Y., Mykland, P., Zhang, L., 2011. Ultra high frequency volatility estimation with dependent microstructure noise. Journal of Econometrics 160, 190–203. Aldous, D.G., Eagleson, G.K., 1978. On mixing and stability of limit theorems. The Annals of Probability 6, 325–331. Andersen, T.G., Bollerslev, T., 1997. Intraday periodicity and volatility persistence in financial markets. Journal of Empirical Finance 5, 115–158. Andersen, T.G., Bollerslev, T., 1998. Answering the skeptics: yes, standard volatility models do provide accurate forecasts. International Economic Review 39 (4), 885–905. Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2001. The distribution of exchange rate volatility. Journal of the American Statistical Association 96, 42–55. Correction published in 2003, volume 98, page 501. Andreou, E., Ghysels, E., 2002. Rolling-sample volatility estimators: some new theoretical, simulations and empirical results. Journal of Business and Economic Statistics 20, 363–375. Andrews, D.W.K., 1998. Laws of large numbers for dependent non-identically distributed random variables. Econometric Theory 4, 458–467. Bandi, F.M., Russell, J.R., 2006. Separating microstructure noise from volatility. Journal of Financial Economics 79, 655–692. Bandi, F.M., Russell, J.R., 2008. Microstructure noise, realized variance, and optimal sampling. Review of Economic Studies 75, 339–369. Barndorff-Nielsen, O.E., Shephard, N., Winkel, M., 2005. Limit theorems for multipower variation in the presence of jumps. Stochastic Processes and Applications 116, 796–806. Barndorff-Nielsen, O.E., Graversen, S.E., Jacod, J., Shephard, N., 2006. Limit theorems for bipower variation in financial econometrics. Econometric Theory 22, 677–719.
Barndorff-Nielsen, O.E., Hansen, P.R., Lunde, A., Shephard, N., 2008. Designing realised kernels to measure the ex-post variation of equity prices in the presence of noise. Econometrica 76 (6), 1481–1536. Barndorff-Nielsen, O.E., Hansen, P.R., Lunde, A., Shephard, N., 2009. Realised kernels in practice: trades and quotes. Econometrics Journal 12, C1–C32. Barndorff-Nielsen, O.E., Shephard, N., 2002. Econometric analysis of realised volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society. Series B 64, 253–280. Christensen, K., Oomen, R.C.A., Podolskij, M., 2010. Realised quantile-based estimation of the integrated variance. Journal of Econometrics 159, 74–98. Christensen, K., Podolskij, M., Vetter, M., 2009. Bias-correcting the realized rangebased variance in the presence of market microstructure noise. Finance and Stochastics 13, 239–268. Delbaen, F., Schachermayer, W., 1994. A general version of the fundamental theorem of asset pricing. Mathematische Annalen 300 (3), 463–520. Foster, D.P., Nelson, D.B., 1996. Continuous record asymptotics for rolling sample variance estimators. Econometrica 64, 139–174. Fleming, J., Kirby, C., Ostdiek, B., 2003. The economic value of volatility timing using ‘‘realized’’ volatility. Journal of Financial Economics 67, 473–509. Gerety, M.S., Mulherin, H., 1994. Price formation on stock exchanges: the evolution of trading within the day. Review of Financial Studies 6, 23–56. Giot, P., Laurent, S., 2004. Modelling daily value-at-risk using realized volatility and ARCH type models. Journal of Empirical Finance 11, 379–398. Gonçalves, S., Meddahi, N., 2009. Bootstrapping realized volatility. Econometrica 77, 283–306. Griffin, J., Oomen, R., 2008. Sampling returns for realized variance calculations: tick time or transaction time? Econometric Reviews 27, 230–253. Hansen, P.R., Lunde, A., 2006. Realized variance and market microstructure noise (with comments and rejoinder). Journal of Business and Economic Statistics 24, 127–218. Harris, L., 1986. A transaction data study of weekly and intradaily patterns in stock returns. Journal of Financial Economics 16 (1), 99–117.
I. Kalnina / Journal of Econometrics 161 (2011) 262–283 Heston, S., 1993. A closed-form solution for options with stochastic volatility with applications to bonds and currency options. Review of Financial Studies 6, 327–343. Jacod, J., 2007. Statistics and high frequency data. Lecture notes. Séminaire Européen de Statistique 2007. Jacod, J., 2008. Asymptotic properties of realized power variations and related functionals of semimartingales. Stochastic Processes and their Applications 118, 517–559. Jacod, J., Li, Y., Mykland, P.A., Podolskij, M., Vetter, M., 2009. Microstructure noise in the continuous case: the pre-averaging approach. Stochastic Processes and their Applications 119, 2249–2276. Jacod, J., Shiryaev, A.N., 2003. Limit Theorems for Stochastic Processes. Springer. Kalnina, I., Linton, O.B., 2008. Estimating quadratic variation consistently in the presence of correlated measurement error. Journal of Econometrics 147, 47–59. Kleidon, A., Werner, I., 1996. UK and US trading of British cross-listed stocks: an intraday analysis of market integration. Review of Financial Studies 9, 619–644. Kristensen, D., 2010. Nonparametric filtering of the realised spot volatility: a kernelbased approach. Econometric Theory 26, 60–93. Lahiri, S.N., 1996. On inconsistency of estimators based on spatial data under infill asymptotics. Sankya: The Indial Journal of Statistics, Series A 58, 403–417. Lahiri, S.N., Kaiser, M.S., Cressie, N., Hsu, N., 1999. Prediction of spatial cumulative distribution functions using subsampling. Journal of the American Statistical Association 94, 86–97. Lockwood, L.J., Linn, S.C., 1990. An examination of stock market return volatility during overnight and intraday periods, 1964–1989. Journal of Finance 45, 591–601. Mancini, C., 2004. Estimation of the characteristics of the jumps of a general Poissondiffusion model. Scandinavian Actuarial Journal 2004 (1), 42–52.
283
McInish, T.H., Wood, R.A., 1992. An analysis of intraday patterns in bid/ask spreads for NYSE stocks. Journal of Finance 47, 753–764. Meddahi, N., Mykland, P., 2010. Fat tails or many small jumps? The near-diffusion paradigm. Work in Progress. Mikosch, T., Starica, C., 2003. Stock market risk-return inference. An unconditional, non-parametric approach. Available at SSRN: http://ssrn.com/abstract=882820. Mykland, P., Zhang, L., 2009. Inference for continuous semimartingales observed at high frequency: a general approach. Econometrica 77, 1403–1445. Podolskij, M., 2006. New theory on estimation of integrated volatility with applications. Ph.D. Thesis. Bochum University. Podolskij, M., Ziggel, D., 2007. Boostrapping bipower variation. Technical Report. Ruhr-University of Bochum. Politis, D.N., Romano, J.P., 1994. Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics 22, 2031–2050. Politis, D.N., Romano, J.P., Wolf, M., 1997. Subsampling for heteroscedastic time series. Journal of Econometrics 81, 281–317. Politis, D.N., Romano, J.P., Wolf, M., 1999. Subsampling. Springer-Verlag, New York. Revuz, D., Yor, M., 2005. Continuous Martingales and Brownian Motion. SpringerVerlag, New York. Zhang, L., 2006. Efficient estimation of stochastic volatility using noisy observations: a multi-scale approach. Bernoulli 12 (6), 1019–1043. Zhang, L., Mykland, P., Aït-Sahalia, Y., 2005. A tale of two time scales: determining integrated volatility with noisy high-frequency data. Journal of the American Statistical Association 100, 1394–1411. Zhou, B., 1996. High-frequency data and volatility in foreign-exchange rates. Journal of Business and Economic Statistics 14, 45–52.
Journal of Econometrics 161 (2011) 284–303
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Data-based ranking of realised volatility estimators Andrew J. Patton ∗ Department of Economics, Duke University, USA Oxford-Man Institute of Quantitative Finance, University of Oxford, UK
article
info
Article history: Received 31 March 2009 Received in revised form 16 December 2010 Accepted 23 December 2010 Available online 29 December 2010 JEL classification: C52 C22 C53
abstract This paper presents new methods for comparing the accuracy of estimators of the quadratic variation of a price process. I provide conditions under which the relative accuracy of competing estimators can be consistently estimated (as T → ∞), and show that forecast evaluation tests may be adapted to the problem of ranking these estimators. The proposed methods avoid making specific assumptions about microstructure noise, and facilitate comparisons of estimators that would be difficult using methods from the extant literature, such as those based on different sampling schemes. An application to high frequency IBM data between 1996 and 2007 illustrates the new methods. © 2010 Elsevier B.V. All rights reserved.
Keywords: Realized variance Volatility forecasting Forecast comparison
1. Introduction The past decade has seen an explosion in research on volatility measurement, as distinct from volatility forecasting.1 This research has focused on constructing non-parametric estimators of price variability over some horizon (for example, one day) using data sampled at a shorter horizons (for example, every 5 minutes or every 30 seconds). These ‘‘realised volatility’’ (RV) estimators or ‘‘realised measures’’ generally aim at measuring the quadratic variation or integrated variance of the log-price process of some asset or collection of assets. This profusion of research has lead to a need for some practical guidance on which RV estimator to select for a given empirical analysis. In addition to the particular estimator to use, the performance of RV estimators is generally affected by the frequency used to sample the price process (for example, every 5 minutes or every 30 seconds), see Zhou (1996) and Bandi and Russell (2008) for example, and may also be affected by the decision to sample in calendar time or in ‘‘tick time’’ (for example,
∗ Corresponding address: Department of Economics, Duke University, 213 Social Sciences Building, Box 90097, Durham, NC 27708-0097, USA. Tel.: +1 919 660 1849. E-mail address:
[email protected]. 1 See Andersen and Bollerslev (1998), Andersen et al. (2001a, 2003), BarndorffNielsen and Shephard (2002, 2004a), Aït-Sahalia et al. (2005), Zhang et al. (2005), Hansen and Lunde (2006a), Christensen and Podolskij (2007), and BarndorffNielsen et al. (2008) amongst many others. Andersen et al. (2006) and BarndorffNielsen and Shephard (2007) present recent surveys of this burgeoning field. 0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.010
every r minutes or every s trades), and the decision to use prices from transactions or from quotes, see Bandi and Russell (2006b), Hansen and Lunde (2006a) and Oomen (2006). This paper provides new methods for comparing RV estimators, which complement the approaches currently in the literature (discussed further below). Denoting the latent quadratic variation of the process over some interval of time (for example, one day) as θt , estimators of this quantity as Xit , and a distance measure as L, the primary theoretical contribution of this paper is to provide methods to consistently estimate: E [1L (θt , Xt )] ≡ E [L (θt , Xit )] − E L θt , Xjt
.
(1)
The latent nature of θt makes estimating E [1L (θt , Xt )] more difficult than in standard forecasting applications, as we cannot employ the sample mean of the loss differences as an estimator. Further, the fact that the estimators Xit usually use data from the same period over which θt is measured makes this problem distinct from (and more difficult than) volatility forecasting applications. With an estimator of E [1L (θt , Xt )] in hand, it is possible to employ one of the many tests from the literature on forecast evaluation and comparison, such as Diebold and Mariano (1995) and West (1996) for pair-wise comparisons, White (2000), Hansen (2005), Hansen et al. (forthcoming) and Romano and Wolf (2005) for comparisons involving a large number of RV estimators, and Giacomini and White (2006) for conditional comparisons of RV estimators. These tests rely on standard large sample asymptotics (T → ∞) rather than continuous-record asymptotics (m → ∞), and thus can be used to compare the ‘‘finite m’’ performance
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
of different estimators. I provide conditions under which these tests can be applied to the problem of ranking RV estimators. The proposed methods rely on the existence of a volatility proxy that is unbiased for the latent target variable, θt , and satisfies an uncorrelatedness condition, described in detail below. This proxy must be unbiased but it does not need to be very precise; a simple and widely-available proxy is the daily squared return, for example. Previous research on the selection of estimators of quadratic variation has predominantly focused on finding the sampling frequency that maximises the accuracy of a given estimator. Consider the simplest RV estimator: (m)
RVt
≡
m −
pτi − pτi−1
2
(2)
i=1
where pτi is the log-price at time τi , {τ0 , τ1 , . . . , τm } are the times at which the price of the asset is available during period t, and m is the number of intra-period observations used in computing the estimator. In the absence of market microstructure effects, the distribution theory for the simplest RV estimator would suggest sampling prices as often as possible, see Andersen et al. (2001a) for example, as the asymptotic variance of the estimator in this case declines uniformly as m → ∞. In practice, however, the presence of autocorrelation in very high frequency prices leads the standard RV estimator to become severely biased,2 and several papers have attempted to address this problem.3 While the methods of these papers differ, they have in common their use of continuous-record asymptotics in their derivations, the use of mean squared error (MSE) as the measure of accuracy, and, importantly, generally quite specific assumptions about the noise process.4 , 5 In contrast to the theoretical studies of the optimal sampling frequency cited above, the data-based methods proposed in this paper allow one to avoid taking a stand on some important properties of the price process. In particular, the proposed approach allows for microstructure noise that may be correlated with the efficient price process and/or heteroskedastic, cf. Hansen and Lunde (2006a), Kalnina and Linton (2008), and Bandi et al.
2 Early research in this area, see Zhou (1996) and Andersen et al. (2000), employed ‘‘volatility signature plots’’ to show graphically that at very high frequencies, features such as bid–ask bounce and stale prices can lead to large biases in simple RV estimators. More sophisticated estimators, such as the twoscale estimator of Zhang et al. (2006) and the realised kernel estimator of BarndorffNielsen et al. (2008) provide consistent estimates of quadratic variation, under some conditions, by taking these autocorrelations into account in the construction of the estimator. 3 Assuming i.i.d. noise and intra-daily homoskedasticity, Zhou (1996) derives the MSE-optimal sampling frequency (or, equivalently, optimal choice of m) for the RVAC1 estimator, which adjusts the standard RV estimator to account for autocovariances up to order 1; Aït-Sahalia et al. (2005) derive the MSE-optimal choice of m for the standard RV estimator under a variety of cases (i.i.d. noise, serially correlated noise, and noise correlated with the efficient price); Andersen et al. (2011b) derive the MSE-optimal choice of m for the RV ACq estimator, the realised kernel estimator of Barndorff-Nielsen et al. (2008) and the two-scale estimator of Zhang et al. (2005), under the assumption of i.i.d. noise; Hansen and Lunde (2006a) derive the MSE-optimal choice of m for RV ACq estimators assuming i.i.d. noise; Bandi and Russell (2006a, 2011) derive the optimal choice of the m for standard RV, and the optimal ratio of q/m for RV ACq estimators using m intra-daily observations, under the assumption of i.i.d. noise; Bandi et al. (2007) consider the optimal choice of m when the noise process is conditionally mean zero but potentially heteroskedastic; and Barndorff-Nielsen et al. (2008) examine the optimal sampling frequency and number of lags to use with a variety of realised kernel estimators, under the assumption of i.i.d. noise. 4 Gatheral and Oomen (2010) provide an alternative analysis of the problem of choosing an RV estimator via a detailed simulation study. 5 It should be noted that several of these papers derive the asymptotic distribution of their estimators, as m → ∞, under weaker assumptions on the noise than are required to derive optimal sampling frequencies.
285
(2007). Further, this approach avoids the need to estimate quantities such as the integrated quarticity and the variance of the noise process, which often enter formulas for the optimal sampling frequency, see Andersen et al. (2011b) and Bandi and Russell (2008) for example, and which can be difficult to estimate in practice. This approach does, however, require some assumptions about the time series properties of the variables under analysis (e.g., stationarity of certain functions of variables), which are not required in most of the existing literature, and so the proposed tests complement, rather than substitute, existing methods; they provide an alternate approach to addressing the same important problem. The data-based methods proposed in this paper also allow for comparisons of estimators of quadratic variation that would be difficult using existing theoretical methods in the literature. For example, theoretical comparisons of estimators using quote prices versus trade prices require assumptions about the behaviour of market participants: the arrival rate of trades, the placing and removing of limit and market orders, etc., and theoretical comparisons may be sensitive to these assumptions. Likewise, theoretical comparisons of tick-time and calendar-time sampling requires assumptions on the arrival rate of trades. Finally, the methods of this paper make it possible to compare estimators based on quite different assumptions about the price process, such as the ‘‘alternation’’ estimator of Large (2011) which is based on the assumption that the price process moves in steps of at most one tick, versus, for example, the multi-scale estimator of Zhang (2006), which is based on a quite different set of assumptions. The methods for comparing the accuracy of RV estimators proposed in this paper complement recent work comparing the accuracy of forecasts based on these estimators, see Andersen et al. (2003), Aït-Sahalia and Mancini (2008), and Ghysels and Sinko (2011), among others. If the forecasting model in which the estimator will be used is known by the econometrician, then rankings of RV estimators by their forecast performance are likely of primary interest. However, if the forecasting model is not known by the econometrician, or if the end-use of the RV estimator is unknown more generally (e.g., it may be used in pricing derivatives, risk management, portfolio decisions etc.) then a measure of its estimation accuracy may be of interest as a general gauge of its quality as a proxy for the true, latent, volatility. Of course, the methods proposed in this paper may be combined with measures of forecast accuracy to obtain overall rankings of RV estimators. The main empirical contribution of this paper comes from a study of the problem of estimating the daily quadratic variation of IBM equity prices, using high frequency data over the period January 1996 to June 2007. I consider simple realised variance estimators based on either quote or trade prices, sampled in either calendar time or in tick time, for many different sampling frequencies. Also studied are four more sophisticated estimators of QV. I find that Romano and Wolf (2005) tests clearly reject the squared daily return in favour of an RV estimator using higher frequency data, and corresponding tests also indicate that there are significant gains to moving beyond the rule-of-thumb of using 5-min calendar-time RV: estimators based on data sampled at between 15 s and 2 min are significantly more accurate than 5-min RV. I also find that some of the more sophisticated estimators of QV proposed in the literature significantly outperform 5-min RV, particularly in the latest sub-sample. In general, I find that using tick-time sampling leads to better estimators than using calendar-time sampling, particularly when trade arrivals are very irregularly-spaced. I also find that quote prices are significantly less accurate that trade prices in the early part of the sample, but this difference disappears in the most recent subsample.
286
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
The remainder of the paper is structured as follows. Section 2 presents the main theoretical results of this paper, Section 3 presents a simulation study of the proposed new methods, and Section 4 presents an application using high frequency quote and trade data on IBM over the period January 1996–June 2007. Section 5 concludes, and all proofs are collected in the Appendix. 2. Data-based ranking of RV estimators 2.1. Notation and background The target variable, generally quadratic variation (QV) or integrated variance6 (IV), is denoted θt . I assume that θt is Ft measurable, where Ft is the information set generated by the complete path of the log-price process. For the remainder of the paper I assume that θt is a scalar; I discuss the extension to vector (or matrix) target variables in the conclusion. The estimators of θt are denoted Xit , i = 1, 2, . . . , k. Often these will be the same estimator applied to data sampled at different frequencies, for example 1-min returns versus 30-min returns, though they could also be RV estimators based on different functional forms, different sampling schemes, etc. In order to rank the competing estimators we need some measure of distance from the estimator, Xit , to the target variable, θt . Two popular (pseudo-)distance measures in the volatility literature are MSE and QLIKE: MSE QLIKE
L (θ , X ) = (θ − X )2 L (θ , X ) =
θ X
− log
estimators. Hansen and Lunde (2006b) and Patton (2011) show that rankings of volatility forecasts using a ‘‘robust’’ loss function and a conditionally unbiased volatility proxy are asymptotically equivalent to rankings using the true latent target variable—this is stated formally in part (a) of the proposition below. Part (b) shows that this result does not hold for rankings of volatility estimators, due to a critical change in the time at which they are observable. In a slight abuse of notation, the proposition below uses θt to denote conditional variance in part (a) and quadratic variation in part (b). Proposition 1. Let θt be the latent scalar quantity of interest, let Ft be the information set generated by the complete path of the log-price process until time t, and let F˜t ⊂ Ft be the information set available to the econometrician at time t. Let (X1t , X2t ) be two estimators of θt , and let θ˜t be the proxy for θt . (a) [Volatility ] If θt ∈ Ft −1 , (X1t , X2t ) ∈ F˜t −1 , θ˜t ∈ F˜t forecasting
and E θ˜t |Ft −1 = θt , and if L is a member of the class of distance measures in Eq. (5), then E [L (θt , X1t )] Q E [L (θt , X2t )]
⇔ E L θ˜t , X1t Q E L θ˜t , X2t . (b) [Volatility estimation ] If θt ∈ Ft , (X1t , X2t ) ∈ F˜t , θ˜t ∈ F˜t and E θ˜t |Ft −1 , θt = θt , and if L is a member of the class of distance measures in Eq. (5), then
(3)
θ X
− 1.
The definition of QLIKE above has been normalised to yield a distance of zero when θ = X . The methods below apply to rankings of RV estimators using the general class of ‘‘robust’’ pseudodistance measures proposed in Patton (2011), which nests MSE and QLIKE as special cases: L (θ, X ) = C˜ (X ) − C˜ (θ ) + C (X ) (θ − X )
(5)
with C being some function that is decreasing and twicedifferentiable function on the supports of both arguments of this function, and where C˜ is the anti-derivative of C . In this class each pseudo-distance measure L is completely determined by the choice of C . MSE and QLIKE are obtained (up to location and scale constants) when C (z ) = −z and C (z ) = 1/z respectively. For the remainder of the paper I will use the following notation to describe the (k − 1 vector of) differences in the distances from the target variable to a collection of RV estimators:
1L (·, Xt ) ≡ [L (·, X1t ) − L (·, X2t ) , . . . , L (·, X1t ) − L (·, Xkt )]′
E [L (θt , X1t )] Q E [L (θt , X2t )]
(4)
(6)
where Xt ≡ [X1t , . . . , Xkt ]′ . Throughout, variables denoted with a ‘∗ ’ below are the bootstrap samples of the original variables obtained from the stationary bootstrap, P is the original probability measure, and P ∗ is the probability measure induced by the bootstrap conditional on the original data. 2.2. Ranking volatility forecasts versus ranking RV estimators Ranking volatility forecasts, as opposed to estimators, has received a lot of attention in the econometrics literature, see Poon and Granger (2003) and Hansen and Lunde (2005) for two recent and comprehensive studies, and this is the natural starting point for considering the ranking of realised volatility
6 Broadly stated, the quadratic variation of a process coincides with its integrated variance if the process does not exhibit jumps, see Barndorff-Nielsen and Shephard (2007) for example.
< E L θ˜t , X1t
Q E L θ˜t , X2t
.
All proofs are presented in the Appendix. The reason the equivalence holds in part (a) but fails in part (b) is that estimation error in (X1t , X2t ) will generally be correlated with the error in θ˜t in the latter case. This means that the ranking of RV estimators needs to be treated differently to the ranking of volatility forecasts, and it is to this that we now turn. 2.3. Ranking RV estimators In this section we obtain methods to consistently estimate the difference in average accuracy of competing estimators of quadratic variation, E [1L (θt , Xt )], by exploiting some wellknown empirical properties of the behaviour of θt and by making use of a (function of a) proxy for θt , denoted θ˜t . This proxy may itself be a RV estimator, of course, and it may be a noisy estimate of the latent target variable, but it must be conditionally unbiased. Assumption P1. θ˜t = θt + νt , with E [νt |Ft −1 , θt ] = 0, a.s. For many assets the squared daily return can reasonably be assumed to be conditionally unbiased: the mean return is generally negligible at the daily frequency, and the impact of market microstructure effects is often also negligible in daily returns. It should be noted, however, that the presence of jumps in the data generating process will affect the inference obtained using the daily squared return as a proxy: in this case we can compare the estimators in terms of their ability to estimate quadratic variation, which is the integrated variance plus the sum of squared jumps in many cases, see Barndorff-Nielsen and Shephard (2007) for example, but not in terms of their ability to estimate the integrated variance alone. If an estimator of the integrated variance that is conditionally unbiased, for finite m, in the presence of jumps is available, however, then the methods presented below apply directly. Assumption P2. Yt = and
∑J
i=1
ωi = 1.
∑J
i=1
ωi θ˜t +i , where 1 ≤ J < ∞, ωi ≥ 0 ∀i
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
In the propositions below I consider using a convex combination of leads of θ˜t , as in Assumption P2, the simplest special case of which is just a one-period lead (and so Yt = θ˜t +1 ). Using leads of the proxy is important for breaking the correlated measurement errors problem, which makes it possible to overcome the problems identified in Proposition 1. Yt is thus interpretable as an instrument for θ˜t . Our focus on differences in average accuracy makes this a non-linear instrumental variables problem, and like other such problems it is not sufficient to simply assume that Corr Yt , θ˜t ̸= 0; some more structure is required. I obtain results
in this application by considering two alternative approximations of the conditional mean of θt . Numerous papers on the conditional variance (see Bollerslev et al., 1994; Engle and Patton, 2001; Andersen et al., 2006, for example), or integrated variance (see Andersen et al., 2004, 2007) have reported that these quantities are very persistent, close to being (heteroskedastic) random walks. The popular RiskMetrics model, for example, is based on a unit root assumption for the conditional variance, and in recent work Hansen and Lunde (2010) find that the null of a unit root is rejected for almost none of the Dow Jones 30 stocks. Wright (1999), in contrast, provides thorough evidence against the presence of a unit root in daily conditional variance for several assets. Other authors have studied the persistence of volatility via long memory models, see Ding et al. (1993), Aït-Sahalia and Mancini (2008), Corsi (2009), and Maasoumi and McAleer (2008), for example. As an initial approximation to the observed persistence in volatility, consider the following assumption7 : Assumption T1. θt = θt −1 + ηt , with E [ηt |Ft −1 ] = 0, a.s. and θt > 0 a.s. The approximation in Assumption T1 is likely to be poor in applications where the price process is subject to jumps that contribute substantially to the total QV. Previous authors have found that the jump component of daily QV is much less persistent than the IV component, see Andersen et al. (2007, 2011a,b) for example, and in such cases the sum of these components (ie, the QV) may not be well approximated by Assumption T1. 2.3.1. Unconditional rankings of RV estimators This section presents results that allow the ranking of RV estimators based on unconditional average accuracy, according to some distance measure L. Importantly, the methods presented below allow for the comparison of multiple estimators simultaneously, via the tests of White (2000) and Romano and Wolf (2005) for example. Proposition 2. (a) Let Assumptions P1, P2 and T1 hold, and let the pseudo-distance measure L belong to the class in Eq. (5). Then E [1L (θt , Xt )] = E [1L (Yt , Xt )] for any vector of RV estimators, Xt , and any L such that these expectations exist. (b) If we further assume A1 and A2 in the Appendix, then:
√ T
T 1−
T t =1
1L (Yt , Xt ) − E [1L (θt , Xt )]
→d N (0, Ω1 ) , as T → ∞ where Ω1 is given in the proof.
287
T T ∗ ∗ 1 − ∗ 1 − sup P 1L Yt , Xt − 1L (Yt , Xt ) ≤ z T t =1 T t =1 z T 1 − −P 1L (Yt , Xt ) − E [1L (θt , Xt )] ≤ z T t =1 →p 0,
as T → ∞.
Part (a) of the above proposition shows that it is possible to obtain an unbiased estimate of the difference in the average distance from the latent target variable, θt , using a suitablychosen volatility proxy, under certain conditions. This opens the possibility to use existing methods from the forecast evaluation literature to help us choose between RV estimators.8 Parts (b) and (c) of the proposition uses the existing forecast evaluation literature to obtain moment and mixing conditions under which we obtain an asymptotic normal distribution for estimates of the differences in average distance. The conditions in part (b) are sufficient to justify the use of Diebold and Mariano (1995) and West (1996)-style tests for pair-wise comparisons of RV estimator accuracy. Part (c) justifies the use of the bootstrap ‘reality check’ test of White (2000), the ‘model confidence set’ of Hansen and Lunde (2010), the SPA test of Hansen (2005), and the stepwise multiple testing method of Romano and Wolf (2005), which are based on the stationary bootstrap of Politis and Romano (1994). The methods proposed above are complements rather than substitutes for existing methods: the assumptions required for the above result are mostly non-overlapping with the conditions usually required for existing comparison methods. For example, the above proposition does not require any assumptions about the underlying price process (subject to the moment and mixing conditions being satisfied), the microstructure noise process, the trade or quote arrival processes, or the arrivals of limit versus market orders. This means that tests based on the above proposition allow for comparisons of RV estimators that would be difficult using existing methods in the literature. However, unlike most existing tests, the above proposition relies on a long time series of data rather than a continuous sample of prices (i.e., T → ∞ rather than m → ∞), on mixing and moment conditions, and on the applicability of the random walk approximation for the target variable. In Section 3 below I show that these assumptions are reasonable in three realistic simulation designs. In the next proposition I substitute Assumption T1 with one which allows the latent target variable, θt , to follow a stationary AR(p) process. The work of Meddahi (2003) and Barndorff-Nielsen and Shephard (2002) shows that integrated variance follows an ARMA(p, q) model for a wide variety of stochastic volatility models for the instantaneous volatility, motivating this generalisation of the result based on a random walk approximation in Proposition 2. Whilst allowing for a general ARMA model is possible, I focus on the AR case both for the ease with which this case can be handled, and the fact that it has been found to perform approximately as well as the theoretically optimal ARMA model in realistic scenarios, see Andersen et al. (2004). Assumption T2. θt = a.s., θt > 0 a.s., φ1 invertible, and φ ≡ stationary.
∑ φ0 + pi=1 φi θt −i + ηt , with E [ηt |Ft −1 ] = 0 Ψ defined in Eq. (36) is ̸= 0, the matrix ′ φ1 , . . . , φp is such that θt is covariance
(c) If B1 in the Appendix also holds then the stationary bootstrap may also be employed, as:
When the order of the autoregression is greater than one, I also require Assumption R1, below. This assumption is plausible for most RV estimators in the literature, as they are generally based
7 Strict positivity of θ for this random walk process can be ensured if innovation t is a strictly positive random variable with variance proportional to θt −1 ,for example ηt = θt −1 (Zt − 1) where Zt ∼ i.i.d. log N −σZ2 /2, σZ2 . Many other specifications are possible.
8 Note that if the accuracy of RV estimators varies over time, then this approach allows us to make comparisons only about the average accuracy over the sample period. If variables thought to be correlated with the accuracy of a given estimator are known, then the conditional rankings in the next section may instead be used.
288
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
on data from a single day, although Barndorff-Nielsen et al. (2004) and Owens and Steigerwald (2007) are two exceptions. Assumption R1. Xt is independent of νt −j for all j > 0. Proposition 3. Let Assumptions P1, P2 and T2 hold, let the pseudodistance measure L belong to the class in Eq.(5), and let R1 hold if p > 0′p Ip
′ 1. Further, define Q0 ≡ φ0 , 0′p , Q1 ≡ where 0p is a p × 1 vector of zeros. Then:
0 0
,P ≡
1 0p
−φ′
ωj g0(j) /g1(j)
j =1 J −
+
H0∗ : E [1L (θt , Xt ) |Gt −1 ] = 0 (j)
ωj 1 − 1/g1
E 1C (Xt ) θ˜t +j
J −
ωj
j =1
p −
(j)
(j)
gi /g1
E 1C (Xt ) θ˜t +1−i
1L (θt , Xt ) = α′ Zt −1 + et
i=2
for any vector of RV estimators, Xt , and any L such that these (j) expectations exist. The variable g0 is defined as the first element of
the vector I − P −1 Q1
j
I − P −1 Q1
−1
(j)
P −1 Q0 , and gi
is defined
j
as (1, i) element of the matrix P Q1 . (b) If we further assume A1 and A2 in the Appendix hold for the series Bt , defined in Eq. (35), then:
√ T
T 1−
T t =1
−1
ˆ 1L (Yt , Xt ) − βT − E [1L (θt , Xt )]
→ N (0, Ω2 ) , d
βˆ T ≡
1− T t =1
+
J −
1C (Xt )
J −
ωj 1 − 1/ˆg1(j)
+
j =1
ωj
p −
(j)
(j)
gˆi /ˆg1
i=2
1
T −j −
T − j t =1 1
versus Ha : α ̸= 0.
(9)
The following proposition provides conditions under which a feasible form of the above regression: (10)
E [1L (θt , Xt ) |Gt −1 ] = E [1L (Yt , Xt ) |Gt −1 ] a.s., t = 1, 2, . . .
1C (Xt ) θ˜t +j
T −
T + 1 − i t =i
1C (Xt ) θ˜t +1−i (j)
(j)
where gˆi , i = 0, 1, . . . , p; j = 1, 2, . . . , J are estimators of gi described in the proof. (c) If B1 in the Appendix also holds then the stationary bootstrap may also be employed, as:
T 1 − ∗ supP ∗ 1L Yt∗ , X∗t − βˆ T T t =1 z T 1− − 1L (Yt , Xt ) + βˆ T ≤ z T t =1 T 1 − ˆ −P 1L (Yt , Xt ) − βT − E [1L (θt , Xt )] ≤ z T t =1 →p 0,
H0 : α = 0
Proposition 4. (a) Let Assumptions P1, P2 and T1 hold, and let the pseudo-distance measure L belong to the class in Eq. (5). If Gt −1 ⊂ Ft , then
j =1
j =1 J −
ωj gˆ0(j) /ˆg1(j)
where Zt −1 ∈ Gt −1 is some q × 1 vector of variables thought to be useful for predicting future differences in estimator accuracy, and testing:
provides consistent estimates of the parameter α in the infeasible regression.
as T → ∞
T
(7)
(8)
1L (Yt , Xt ) = α˜ ′ Zt −1 + e˜ t
where
a.s. t = 1, 2, . . . .
For pair-wise comparisons of forecasts (or RV estimators, in our case), 1L (θt , Xt ) is a scalar and the above null is usually tested by looking at simple regressions of the form:
j =1
+
to distinguish between competing RV estimators than would otherwise be the case. 2.3.2. Conditional rankings of RV estimators In this section we extend the above results to consider expected differences in distance conditional on some information set, thus allowing the use of Giacomini and White (2006)-type tests of equal conditional RV estimator accuracy. The null hypothesis in a GWtype test is:
where J −
E 1C (Xt ) θ˜t +j . This estimation error will lead to reduced power
Ip
(a) E [1L (θt , Xt )] = E [1L (Yt , Xt )] − β
β = E [1C (Xt )]
a stationary AR(p) process. The cost of the added flexibility in allowing for a general AR(p) process for the target variable is the added estimation error induced by having to estimate the AR(p) parameters, and having to estimate additional terms of the form
as T → ∞.
Proposition 3 relaxes the assumption of a random walk, at the cost of introducing a bias term to the expected loss computed using the proxy. This bias term, however, can be consistently estimated under the assumption that the target variable follows
for any vector of RV estimators, Xt , and any L such that these expectations exist. (b) Assume 1L (θt , Xt ) is a scalar and denote the OLS estimator of α˜ in Eq. (10) as αˆ T . Then if we further assume A3 and A4 in the Appendix: −1/2
ˆT D
√ ˆ T − α →d N (0, I ) T α
where
ˆT ≡ M ˆ T−1 Ω ˆ T−1 , ˆ TM D ΩT ≡ V
T 1 −
√
T t =1
ˆT ≡ M
T 1−
T t =1
Zt −1 Z′t −1 ,
Zt −1 et
ˆ T some symmetric and positive semi-definite estimator and with Ω ˆ T − ΩT →p 0. such that Ω Part (a) of the above proposition shows that the corresponding part of Proposition 2 can be generalised to allow for a conditioning set Gt −1 ⊂ Ft without any additional assumptions. Part (b) shows that the OLS estimator of the feasible GW regression in Eq. (10) is centred on the true parameter in the infeasible regression in Eq. (8), thus enabling GW-type tests. The variance of the OLS estimator will generally be inflated relative to the variance of
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
289
the infeasible regression, but nevertheless the variance can be estimated using standard methods. The above proposition can also be extended to allow the latent target variable, θt , to follow a stationary AR(p) process. The proposition below shows that the AR approximation can be accommodated by using an adjusted dependent variable in the GW-type regression. That is, the infeasible regression is again:
where
1L (θt , Xt ) = α′ Zt −p + et
(11)
and
(12)
1L (θt , Xt ) ≡ 1L (Yt , Xt ) + λˆ 0,T 1C (Xt ) p − + λˆ 1,T 1C (Xt ) θ˜t +1 + λˆ i,T 1C (Xt ) θ˜t +1−i
while the adjusted regression becomes:
1L (θt , Xt ) = α˜ ′ Zt −p + e˜ t .
Note that the variable Zt must be lagged by (at least) the order of the autoregression, so for an AR(p) the right-hand side of the GW-type regression would contain Zt −p . Under the random walk approximation the adjusted dependent variable is simply 1L (θt , Xt ) = 1L (Yt , Xt ), while under the AR(p) approximation it will contain terms related to the parameters of the AR(p) model. For example, specialising the proposition below to an AR(1) with J = 1 (so that Yt = θ˜t +1 ) we have:
+
φ1
1C (Xt ) θ˜t +1 .
(13)
(14)
where
φˆ 0,T 1L 1C (Xt ) (θt , Xt ) = 1L θ˜t +1 , Xt − φˆ 1,T +
1 − φˆ 1,T
φˆ 1,T
1C (Xt ) θ˜t +1
(15)
Assumption R1′ . Xt is conditionally independent of νt −j given Ft −j−1 , for all j > 0. Proposition 5. Let Assumptions P1, P2 and T2 hold, let the pseudodistance measure L belong to the class in Eq. (5), and let R1′ hold if (j) p > 1. Let Q0 , Q1 , P and gi be defined as in Proposition 3. Finally, assume that 1L (θt , Xt ) is a scalar, and define:
i =2
(j)
ωj
g1
j=1
i = 0, 2, 3, . . . , p
i =2
ˆ i,T , i = 0, 1, . . . , p are the values of λi based on estimated where λ (j)
values for φj and gi . Then:
(a) E 1L (θt , Xt )Zt −p = E 1L (θt , Xt ) Zt −p for any Zt −p ∈ Ft −p .
(b) Denote the OLS parameter estimate of α˜ in Eq. (14) as αˆ T . If we
as T → ∞.
(c) If B1 in the Appendix also holds then the stationary bootstrap may also be employed, as:
ˆ ∗T − αˆ T ≤ z − P αˆ T − α ≤ z →p 0, sup P ∗ α
z
as T → ∞. As in Proposition 3, the AR assumption introduces additional terms to be estimated in order to consistently estimate E 1L (θt , Xt ) · Zt −p . The above proposition shows that these terms are estimable, though the additional estimation error will of course reduce the power of this test. It is worth noting that Proposition 3 can be obtained as a special case of the above proposition by simply setting Zt −p equal to one. 3. Simulation study
in the AR(1) and J = 1 case. The dependent variable in the feasible adjusted regression depends on estimated AR(p) parameters, and so standard OLS inference cannot be used. The proposition below considers the more general AR(p) case, with a proxy that may depend on a convex combination of leads of θ˜t , and shows how to account for the fact that the adjustment term involves estimated parameters. A strengthening of Assumption R1 is needed for the test of conditional accuracy if the order of the autoregressive approximation is greater than one.
1L (θt , Xt ) ≡ 1L (Yt , Xt ) + λ0 1C (Xt ) + λ1 1C (Xt ) θ˜t +1 p − + λi 1C (Xt ) θ˜t +1−i
J −
, and φ1 J φi φi − ωj gi(j) − g1(j) , λi = − − φ1 φ1 j=1 φ1
−
√ ˆ T − α →d N (0, Ω3 ) T α
˜ = α, This adjusted dependent variable is constructed such that α and thus estimating Eq. (12) by OLS yields a consistent estimator of the unknown true parameter α. (Note that if φ0 = 0 and φ1 = 1, which corresponds to the random walk case, the adjustment term drops out and we obtain the same result as in Proposition 4.) Of course, the parameters of the AR(p) process must be estimated, leading to a feasible adjusted regression: ′ 1L α˜ Zt −p + e˜ t (θt , Xt ) =
1
further assume A1 and A2 in the Appendix hold for the series Dt , defined in Eq. (37), then:
φ 0 1L (θt , Xt ) = 1L θ˜t +1 , Xt − 1C (Xt ) φ1 1 − φ1
λ1 =
To examine the finite-sample performance of the results in the previous section, I present the results of a small simulation study. I use three different stochastic volatility models, each with the same parameters as in Gonçalves and Meddahi (2009). The first model is a GARCH diffusion: d log P Ď (t ) = 0.0314d(t ) + ν (t )
× −0.576dW1 (t ) + 1 − 0.5762 dW2 (t ) (16) dν 2 (t ) = 0.035 0.636 − ν 2 (t ) d(t ) + 0.144ν 2 (t )dW1 (t ). The second model is a log-normal diffusion, using the same process for the log-price as above, but a different process for the volatility: d log ν 2 (t ) = −0.0136 0.8382 + log ν 2 (t ) d(t )
+ 0.1148dW1 (t ).
(17)
The third volatility model is a two-factor diffusion, which takes the following form:
d log P Ď (t ) = 0.030d(t ) + ν(t ) −0.30dW1 (t ) − 0.30dW2 (t )
+ 1 − 2 × 0.302 dW3 (t ) ν 2 (t ) = exp −1.2 + 0.04ν12 (t ) + 1.5ν22 (t ) dν (t ) = −0.00137ν (t )dt + dW1 (t ) 2 1 2 2
2 1
dν (t ) = −1.386ν22 (t )dt + 1 + 0.25ν22 (t ) dW2 (t ).
(18)
290
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
The two-factor diffusion is characterized by one highly persistent component and one less persistent component, which yields volatility dynamics that are quite distinct from the other two processes. We include all three processes in this study to gain a better understanding of the finite sample properties of the proposed tests in a variety of empirical situations. In simulating from these processes I use a simple Euler discretization scheme, with the step size calibrated to one-tenth of one second (i.e., with 234,000 steps per simulated trade day, which assumed to be 6.5 h in length). I consider sample sizes of T = 500 and T = 2500 trade days. To gain some insight into the impact of microstructure effects, I also consider a simple i.i.d. error term for the observed log-price: log P tj = log P Ď tj + ξ tj
(19)
ξ tj ∼ i.i.d. N 0, σξ .
2
Following Aït-Sahalia et al. (2005) and Huang and Tauchen (2005), I set σξ2 to be such that the proportion of the variance of the 5-min return (5/390 of a trade day) that is attributable to microstructure noise is 20%: 2σξ2 V [rt ]
+ 2σξ2
5 390
= 0.20
(20)
where rt is the open-to-close return on day t. The expression above is from Aït-Sahalia et al. (2005), while the proportion of 20% is around the middle value considered in the simulation study of Huang and Tauchen (2005). The processes to be simulated above exhibit a leverage effect and are contaminated with noise, and so existing results on the ARMA processes for QV implied by various continuous-time stochastic volatility models, see Barndorff-Nielsen and Shephard (2002) and Meddahi (2003), cannot be directly applied. This allows us to study how the proposed tests perform in realistic cases where both the random walk and AR(p) models are merely approximations to the true process for daily QV; neither is correctly specified. The finite-sample size and power properties of the proposed methods are investigated via the following experiment. For simplicity I focus on pair-wise comparisons of RV estimators, each implemented using the 1000 draws from the stationary bootstrap of Politis and Romano (1994), thus making this a ‘reality check’type test from White (2000). I set the each RV estimator equal to the true QV plus some noise: Xit = QVt + ζit ,
ζ1t = ων
30 min t
i = 1, 2
(21)
+ (1 − ω) σu U1t
ζ2t = ωνt30 min + (1 − ω) σu U2t +
(22)
σζ22 − σζ21 U3t
(23)
[U1t , U2t , U3t ]′ ∼ i.i.d. N (0, I ) where
to V νt30 min /V [QVt ] in this simulation. To study the power, I fix σζ21 /V [QVt ] = 0.1, and let σζ22 /V [QVt ] = 0.15, 0.2, 0.5, 1.
ω=
ρσζ 1 σν
(24)
σν2 σζ21 1 − ρ σu2 = 2 σν − ρσζ 1
2 (25)
where V νt30 min ≡ σν2 .
I consider seven unconditional comparison tests in total. The first test is the infeasible test that would be conducted if the true QV were observable. The power of this test represents an upper bound on what one can expect from the feasible tests. I consider feasible tests under both the random walk approximation (using Proposition 2) and an AR(1) approximation (using Proposition 3). I also consider three different volatility proxies: daily squared returns, 30-min RV and the true QV. The latter case is considered to examine the limiting case of a proxy with no error being put through these tests. The rejection frequencies under each scenario are presented in Table 1, using the QLIKE pseudo-distance measure from Eq. (4). The corresponding results for the MSE distance measure are similar and available upon request. The first row of each panel of Table 1 corresponds to the case when the null hypothesis is satisfied, and thus we expect these figures to be close to 0.05, the nominal size of the tests. For both sample sizes and across all three diffusion models we see that the finite-sample size is reasonable, with rejection frequencies close to 0.05. Most tests appear to be under-sized, meaning that they are conservative tests of the null. Only in the cases of a short time series and the use of the AR approximation does over-rejection of the null occur. The results for the power of this test are as expected: the power of the new tests are worse than would be obtained if the true QV were observable; power is greater when using a longer time series of data; power is worse when a noisier instrument is used (true QV versus 30-min RV versus daily squared returns); and the power of the test based on the AR(1) approximation is worse than that based on the random walk approximation. The AR(1) approximation has little power when the volatility proxy is very noisy and T is small: in that case it appears that the estimation of the AR parameters overwhelms any information about the relative accuracy of the two RV estimators. The power curves for the GARCH and Log diffusions are similar, while the power of the test under the two-factor diffusion is generally lower, a finding consistent with other papers using this model, see Huang and Tauchen (2005). Next I consider a simulation study of the Giacomini and White (2006)-style conditional comparisons of RV estimators. I use the following design: X1t = QVt + ζ1t
(26)
X2t = QVt − λQVt −1 + ζ2t
(27)
ζit = ων + (1 − ω) σu Uit , [U1t , U2t ]′ ∼ i.i.d. N (0, I ) . 30 min t
min νt30 min ≡ RV30 − IVt . t
The above structure allows the measurement error on each of the RV estimators to be correlated with the proxy measurement error, consistent with what is faced in practice. As a benchmark, I use the measurement errors on RV30 min to generate this correlation, and I set the correlation to be ρ = Corr νt30 min , ζ1t = 0.5,
by setting the parameters ω, σu2 using Eqs. (24) and (25). The equations below also allow me to vary the variance of the errors associated with the RV estimators, σζ21 and σζ22 . In the study of the
size of the tests I set σζ21 /V [QVt ] = σζ22 /V [QVt ] = 0.1 (and so the variable U3t drops out of Eq. (23)) which is approximately equal
i = 1, 2
As in the simulation for tests of unconditional accuracy, I choose ω and σu2 such that σζ21 /V [QVt ] = σζ22 /V [QVt ] = 0.1 and
Corr νt30 min , ζ1t = Corr νt30 min , ζ2t = 0.5. In the study of finitesample size, I set λ = 0. To study power, I consider introducing some time-varying bias to the second RV estimator, by letting the parameter λ = 0.1, 0.2, 0.4, 0.8, and then estimate regressions of the form:
L θ˜t +1 , X1t
10 1 − − L θ˜t +1 , X2t = α0 + α1 log θ˜t −j + et (28)
10 j=1
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
291
Table 1 Finite-sample size and power of unconditional accuracy tests. QV QV∗ T
500
RV-30 min
RW 2500
γ
GARCH diffusion
0.10 0.15 0.20 0.50 1.00
0.02 0.30 0.50 0.75 0.84
0.10 0.15 0.20 0.50 1.00
0.02 0.14 0.28 0.59 0.68
0.10 0.15 0.20 0.50 1.00
0.04 0.05 0.10 0.20 0.26
0.02 0.53 0.79 0.99 0.99
AR
RV-daily
RW
AR
RW
AR
500
2500
500
2500
500
2500
500
2500
500
2500
500
2500
0.01 0.24 0.49 0.77 0.85
0.01 0.49 0.78 0.99 1.00
0.01 0.20 0.43 0.75 0.84
0.01 0.44 0.76 0.99 0.99
0.02 0.19 0.41 0.74 0.83
0.02 0.43 0.75 0.99 1.00
0.03 0.20 0.35 0.60 0.65
0.06 0.51 0.77 0.95 0.97
0.03 0.11 0.20 0.53 0.69
0.03 0.23 0.50 0.89 0.95
0.10 0.09 0.09 0.20 0.24
0.02 0.09 0.18 0.36 0.41
0.02 0.14 0.29 0.60 0.73
0.01 0.22 0.53 0.96 0.98
0.01 0.14 0.25 0.60 0.74
0.00 0.22 0.52 0.97 0.98
0.02 0.13 0.25 0.58 0.71
0.01 0.20 0.49 0.94 0.98
0.04 0.13 0.23 0.38 0.47
0.04 0.23 0.44 0.81 0.86
0.03 0.08 0.16 0.39 0.54
0.02 0.12 0.30 0.75 0.87
0.09 0.09 0.10 0.16 0.20
0.01 0.03 0.06 0.12 0.20
0.04 0.07 0.07 0.10 0.10
0.03 0.09 0.11 0.16 0.17
0.16 0.22 0.24 0.33 0.38
0.05 0.12 0.17 0.31 0.31
0.03 0.07 0.08 0.11 0.09
0.03 0.08 0.09 0.16 0.16
0.15 0.21 0.23 0.31 0.38
0.06 0.12 0.18 0.28 0.27
0.04 0.07 0.05 0.07 0.07
0.04 0.06 0.06 0.10 0.09
0.17 0.20 0.20 0.28 0.30
0.12 0.18 0.19 0.29 0.31
Log diffusion 0.03 0.32 0.62 0.96 0.98
Two factor diffusion 0.05 0.13 0.21 0.53 0.70
Notes: This table presents the rejection frequencies for tests of equal accuracy of two competing RV estimators, using the QLIKE pseudo-distance measure. The first two columns correspond to the ideal infeasible case when the true QV is observable. The remaining columns present results when the available volatility proxy has varying degrees of measurement error, under two approximations for the QV (a random walk (RW) and a first-order autoregression (AR)). The three panels correspond to three different specifications of the continuous time diffusion generating the observed returns. All tests are conducted at the 0.05 level, based on 1000 draws from the stationary bootstrap, and each scenario is simulated 1000 times. The null hypothesis of equal average accuracy is satisfied in the first row of each panel, while in the other rows the second RV estimator has greater noise variance (γ ≡ σζ22 /V [QVt ]) than the first (σζ21 /V [QVt ] = 0.10). Table 2 Finite-sample size and power of conditional accuracy tests, slope coefficient t-test. QV QV∗ T
500
RV-30 min
RW 2500
λ
GARCH diffusion
0.00 0.10 0.20 0.40 0.80
0.02 0.06 0.13 0.23 0.17
0.00 0.10 0.20 0.40 0.80
0.03 0.04 0.09 0.16 0.39
0.00 0.10 0.20 0.40 0.80
0.04 0.04 0.03 0.05 0.06
0.03 0.15 0.38 0.74 0.53
AR
RV-daily
RW
AR
RW
AR
500
2500
500
2500
500
2500
500
2500
500
2500
500
2500
0.03 0.08 0.20 0.32 0.14
0.03 0.17 0.43 0.80 0.45
0.00 0.02 0.06 0.12 0.14
0.01 0.07 0.31 0.69 0.54
0.03 0.06 0.15 0.25 0.13
0.02 0.14 0.40 0.75 0.38
0.01 0.01 0.03 0.06 0.13
0.01 0.04 0.20 0.53 0.44
0.03 0.03 0.03 0.04 0.06
0.03 0.05 0.08 0.12 0.10
0.01 0.01 0.02 0.02 0.02
0.00 0.00 0.01 0.02 0.03
0.03 0.05 0.10 0.19 0.37
0.04 0.08 0.22 0.45 0.92
0.01 0.02 0.04 0.12 0.38
0.00 0.02 0.08 0.27 0.93
0.03 0.04 0.08 0.16 0.33
0.03 0.07 0.21 0.42 0.90
0.01 0.01 0.02 0.06 0.33
0.00 0.01 0.06 0.20 0.90
0.04 0.03 0.03 0.04 0.11
0.04 0.04 0.08 0.12 0.41
0.00 0.01 0.01 0.01 0.04
0.00 0.00 0.00 0.01 0.20
0.03 0.04 0.04 0.06 0.04
0.03 0.05 0.12 0.25 0.30
0.01 0.02 0.02 0.02 0.04
0.00 0.02 0.05 0.16 0.34
0.04 0.04 0.04 0.05 0.03
0.03 0.05 0.10 0.23 0.28
0.01 0.02 0.01 0.02 0.03
0.01 0.01 0.04 0.11 0.23
0.02 0.02 0.02 0.03 0.03
0.03 0.03 0.04 0.08 0.11
0.01 0.01 0.01 0.01 0.01
0.01 0.01 0.01 0.02 0.03
Log diffusion 0.04 0.07 0.20 0.43 0.93
Two factor diffusion 0.04 0.06 0.11 0.28 0.38
Notes: This table presents the rejection frequencies for tests on the slope coefficient in a regression for testing the equal conditional accuracy of two competing RV estimators, using the QLIKE pseudo-distance measure. The first two columns correspond to the ideal infeasible case when the true QV is observable. The remaining columns present results when the available volatility proxy has varying degrees of measurement error, under two approximations for the QV (a random walk (RW) and a first-order autoregression (AR)). The three panels correspond to three different specifications of the continuous time diffusion generating the observed returns. All tests are conducted at the 0.05 level, based on 1000 draws from the stationary bootstrap, and each scenario is simulated 1000 times. The null hypothesis of a zero slope coefficient is satisfied in the first row of each panel, while in the other rows the second RV estimator at time t has time-varying bias equal to −λ × QVt −1 , and thus the true slope coefficient is non-zero.
where θ˜t is the volatility proxy: daily squared returns, 30-min RV or the true QV. I use Propositions 4 and 5 to consider two tests based on the above regression: a test that the slope coefficient is zero (α1 = 0), or a joint test that both coefficients are zero (α0 = α1 = 0). Under the random walk approximation, I can estimate these regressions by simple OLS, and I use Newey and West (1987) to obtain the covariance matrix of the estimated parameters. Under the AR(1) approximation I use 1000 draws from the stationary bootstrap. In the interests of space I present
these simulation results only for the QLIKE distance measure, see Tables 2 and 3; results under the MSE distance measure are similar and available on request. The first row of each panel in Tables 2 and 3 corresponds to the case where the null hypothesis is true. The tests using the random walk approximation are generally close to the nominal size of 0.05, while the tests using the AR(1) approximation appear to be somewhat under-sized, again implying a conservative test of the null. As expected, the power of the tests to detect violations
292
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
Table 3 Finite-sample size and power of conditional accuracy tests, joint test. QV QV∗ T
500
RV-30 min
RW 2500
λ
GARCH diffusion
0.00 0.10 0.20 0.40 0.80
0.02 0.46 0.75 0.96 1.00
0.00 0.10 0.20 0.40 0.80
0.01 0.19 0.51 0.91 1.00
0.00 0.10 0.20 0.40 0.80
0.02 0.05 0.18 0.39 0.56
0.01 0.62 0.92 1.00 1.00
AR
RV-daily
RW
AR
RW
AR
500
2500
500
2500
500
2500
500
2500
500
2500
500
2500
0.02 0.44 0.75 0.95 1.00
0.01 0.61 0.93 1.00 1.00
0.01 0.25 0.47 0.56 0.78
0.00 0.40 0.63 0.89 1.00
0.02 0.24 0.71 0.94 1.00
0.02 0.53 0.91 1.00 1.00
0.03 0.12 0.43 0.56 0.76
0.01 0.33 0.61 0.86 1.00
0.03 0.04 0.19 0.80 0.98
0.03 0.15 0.67 0.98 1.00
0.00 0.01 0.03 0.16 0.30
0.01 0.04 0.16 0.43 0.83
0.02 0.18 0.52 0.90 1.00
0.02 0.34 0.83 1.00 1.00
0.00 0.10 0.24 0.47 0.77
0.00 0.15 0.60 0.93 1.00
0.02 0.11 0.48 0.87 0.99
0.02 0.28 0.80 1.00 1.00
0.01 0.05 0.22 0.47 0.75
0.00 0.13 0.52 0.91 1.00
0.02 0.04 0.14 0.64 0.96
0.02 0.10 0.44 0.94 1.00
0.01 0.00 0.02 0.10 0.34
0.01 0.03 0.11 0.44 0.86
0.02 0.04 0.12 0.25 0.41
0.03 0.12 0.38 0.75 0.95
0.00 0.00 0.01 0.01 0.03
0.01 0.02 0.08 0.25 0.43
0.02 0.04 0.10 0.23 0.38
0.03 0.10 0.35 0.71 0.93
0.00 0.01 0.00 0.01 0.02
0.01 0.02 0.05 0.18 0.32
0.02 0.03 0.05 0.14 0.26
0.03 0.06 0.17 0.45 0.74
0.00 0.00 0.00 0.00 0.02
0.00 0.00 0.01 0.04 0.08
Log diffusion 0.02 0.34 0.83 1.00 1.00
Two factor diffusion 0.03 0.15 0.53 0.93 1.00
Notes: This table presents the rejection frequencies for tests of equal conditional accuracy of two competing RV estimators, using the QLIKE pseudo-distance measure. The first two columns correspond to the ideal infeasible case when the true QV is observable. The remaining columns present results when the available volatility proxy has varying degrees of measurement error, under two approximations for the QV (a random walk (RW) and a first-order autoregression (AR)). The three panels correspond to three different specifications of the continuous time diffusion generating the observed returns. All tests are conducted at the 0.05 level, based on 1000 draws from the stationary bootstrap, and each scenario is simulated 1000 times. The null hypothesis of equal conditional accuracy is satisfied in the first row of each panel, while in the other rows the second RV estimator at time t has time-varying bias equal to −λ × QVt −1 .
Fig. 1.√IBM volatility over the period January 1996–June 2007 (computed using realised volatility based on 5-min calendar-time trade prices), annualised using the formula σt = 252 × RVt .
of the null is lower when a less accurate volatility proxy is employed, higher when a long time series of data is available, and higher using the random walk approximation than using the AR(1) approximation. The results across the three diffusion processes are similar, though again power under the two-factor diffusion is generally lower than under the GARCH or Log diffusion. 4. Estimating the volatility of IBM stock returns In this section I apply the methods of Section 2 to the problem of estimating the quadratic variation of the open-toclose continuously-compounded return on IBM. I use data on NYSE trade and quote prices from the TAQ database over the period from January 1996 to June 2007, yielding a total of 2893 daily observations.9 This sample period covers several distinct periods:
9 I use trade and quote prices from the NYSE only, between 9:45 am and 4:00 pm, with a g127 code of 0 or 40, a corr code of 0 or 1, positive size, and cond not equal to
the minimum tick size moved from one-eighth of a dollar to onesixteenth of a dollar on June 24, 1997, and to pennies on January 29, 2001.10 Further, volatility for this stock (and for the market generally) was high over the early and middle parts of the sample, and very low, by historical standards, in the later years of the sample, see Fig. 1. These changes motivate the use of sub-samples in the empirical analyses below: I break the sample into three periods (1996–1999, 2000–2003 and 2004–2007) to determine whether these changes impact the ranking of the competing realised volatility estimators.
‘‘O’’, ‘‘Z’’, ‘‘B’’, ‘‘T’’, ‘‘L’’, ‘‘G’’, ‘‘W’’, ‘‘J’’, or ‘‘K’’. Further, the data were cleaned for data problems, following guidelines in Barndorff-Nielsen et al. (2009): trade and quote prices of zero were dropped, as were quotes generating negative spreads or spreads of more than 50 times the median spread for that day. If more than one price was observed with the same time stamp then the median of these prices was used. 10 Source: New York Stock Exchange web site, http://www.nyse.com/about/ history/timeline_chronology_index.html.
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
293
Fig. 2. ‘Volatility signature plots’ for IBM, over the period January 1996–June 2007, using 13 different sampling frequencies (from 1 second to 1 trade day), 2 different price series (trades and quotes) and 2 different sampling schemes (calendar-time and tick-time).
I consider standard realised variance, as presented in Eq. (2), using trade prices and mid-quote prices, and using calendar-time sampling and tick-time sampling, for thirteen different sampling frequencies: 1, 2, 5, 15, 30 seconds, 1, 2, 5, 15, 30 minutes, 1, 2 hours11 and the open–close return. For tick-time sampling, the sampling frequencies here are average times between observations on each day, and the actual sampling frequency of course varies according to the arrival rate of observations. The combination of two price series (trades and mid-quotes), two sampling schemes (calendar-time and tick-time), and 13 sampling frequencies yields 52 possible RV estimators. However, calendar-time and tick-time sampling are equivalent for the two extreme sampling frequencies (1-s sampling and 1-day sampling) which brings the number of RV estimators to 48 in total. In Fig. 2 I present the volatility signature plot for these estimators for the full sample, and for three sub-samples. These plots generally take a common shape: RV computed on trade prices tends to be upward biased for very high sampling frequencies, while RV computed on quote prices tends to be downward biased for very high sampling frequencies, see Hansen and Lunde (2006a) for example. This pattern does not appear in the last sub-sample for this stock. In Figs. 3 and 4 I present the first empirical contribution of this paper. These figures present estimates of the average distance
11 I use 62.5 and 125 min sampling rather than 60 and 120 min sampling so that there are an integer number of such periods per trade day. I call these 1-h and 2-h sampling frequencies for simplicity.
between each of the 48 RV estimators and the latent quadratic variation of the IBM price process, relative to the corresponding distance using 5-min calendar-time RV on trade prices,12 using the QLIKE distance measure presented in Eq. (4).13 The first figure uses the random walk (RW) approximation for the dynamics in QV, the second uses a first-order AR approximation.14 I use a one-period lead of 5-min calendar-time RV on trade prices as the volatility proxy to compute the differences in average distances.15 I
12 The choice of RV estimator to use as the ‘‘benchmark’’ in these plots is purely a normalisation: it has no effect on the ranks of the different estimators. 13 Patton and Sheppard (2009a) present evidence that the QLIKE pseudo-distance has greater power than the MSE distance measure in a variety of volatility applications. The results of this section under MSE distance are available on request. 14 The point estimate of the AR coefficient for QV, obtained using the estimator in the proof of Proposition 3, is 0.891, suggesting a strongly persistent volatility process. I estimate the contribution of jumps to QV using the ratio (RV-BV)/RV, where RV is 5-min realised variance and BV is 5-min bipower variation, see Barndorff-Nielsen and Shephard (2004b), which is a jump-robust estimator of IV. This ratio averages 0.07 over this sample period, consistent with Huang and Tauchen (2005) and Tauchen and Zhou (2011), indicating that jumps contribute a small but non-zero amount to the QV of this stock. 15 Using the assumption that the squared open-to-close return is unbiased for the true quadratic variation, I tested whether 5-min calendar-time RV is also unbiased, and found no evidence against this assumption at the 0.05 level. Using the squared open-to-close return as the volatility proxy did not qualitatively change these results, though as expected the power of the tests was reduced.
294
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
Fig. 3. Differences in average distance, estimated using a random walk approximation, for the 48 competing RV estimators, relative to 5-min calendar-time RV on trade prices. A negative (positive) value indicates that the RV estimator is better (worse) than 5-min calendar-time RV on trade prices. The estimator with the lowest average distance is marked with a vertical line down to the x-axis.
present these results for the full sample and for three sub-samples (1996–1999, 2000–2003, 2004–2007). The conclusion from these pictures is that there are clear gains to using intra-daily data to compute RV, consistent with the voluminous literature to date: the estimated average distances to the true QV for estimators based on returns sampled at 30-min or lower frequencies are clearly greater than those using higherfrequency data (formal tests of this result are presented below). Using the RW approximation, the optimal sampling frequency is either 30 s or 1 min, and the best-performing estimator over the full sample is RV based on trade prices sampled in tick time at 1min average intervals. The AR approximation gives the same result for the full sample and similar results in the sub-samples. 4.1. Comparing many RV estimators To formally compare the 48 competing RV estimators, I use the stepwise multiple testing method of Romano and Wolf (2005). This method identifies the estimators that are significantly better, or significantly worse, than a given benchmark estimator, while controlling the family-wise error rate of the complete set of hypothesis tests. That is, for a given benchmark estimator, Xt ,0 , it tests: (s)
H0 : E L θt , Xt ,0
= E L θt , Xt ,s ,
H1 : E L θt , Xt ,0
> E L θt , Xt ,s
versus (s)
for s = 1, 2, . . . , 47
or (s)
H2 : E L θt , Xt ,0
< E L θt , Xt ,s (s)
and identifies which individual null hypotheses, H0 , can be rejected. I use 1000 draws from the stationary bootstrap of Politis and Romano (1994), with an average block size of 20, for each test. I consider two choices of ‘‘benchmark’’ RV estimators: the squared open-to-close return, which is the most commonly-used volatility estimator in the absence of higher frequency data, and an RV estimator based on 5-min calendar-time trade prices, which is based on a rule-of-thumb from early papers in the RV literature (see Andersen et al., 2001b and Barndorff-Nielsen and Shephard, 2002 for example), which suggests sampling ‘‘often but not too often’’, so as to avoid the adverse impact of microstructure effects. Table 4 reveals that every estimator, except for the squared open-to-close quote-price return, is significantly better than squared open-to-close trade-price return, at the 0.05 level. This is true in the full sample and in all three sub-samples, using both the RW approximation and the AR approximation. This is very strong support for using high frequency data to estimate volatility. Table 5 provides some evidence that the 5-min RV estimator is significantly beaten by higher-frequency RV estimators. Under the RW approximation, the Romano–Wolf method indicates that RV estimators based on 15-s to 2-min sampling frequencies are significantly better than 5-min RV. Estimators with even higher sampling frequencies are not significantly different, while estimators based on 15-min or lower sampling are found to be
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
295
Fig. 4. Differences in average distance, estimated using an AR(1) approximation, for the 48 competing RV estimators, relative to 5-min calendar-time RV on trade prices. A negative (positive) value indicates that the RV estimator is better (worse) than 5-min calendar-time RV on trade prices. The estimator with the lowest average distance is marked with a vertical line down to the x-axis. Table 4 Tests of equal RV accuracy, with squared open-to-close returns as the benchmark. Sampling frequency
RW approximation
AR approximation
Trades
1s 2s 5s 15 s 30 s 1 min 2 min 5 min 15 min 30 min 1h 2h 1 day
Quotes
Trades
Quotes
Calendar
Tick
Calendar
Tick
Calendar
Tick
Calendar
Tick
✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ⋆
– ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ –
✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓
– ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ –
✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ⋆
– ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ –
✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓
– ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ –
∼∼∼∼
∼∼∼∼
Notes: This table presents the results of Romano and Wolf (2005) stepwise testing of the 48 realised volatility estimators considered in the paper (13 frequencies, 2 sampling schemes, 2 price series, less overlaps which are marked with ‘‘–’’). Two approximations for the dynamics of QV are considered: a random walk (RW) and a first-order autoregression (AR). In this table the benchmark RV estimator is the squared open-to-close trade price return, marked with an ⋆. Estimators that are significantly better than the benchmark, at the 0.05 level, are marked with ✓, estimators that are significantly worse than the benchmark are marked with ×, and estimators that are not significantly different are marked with ∼. The four characters in each element of the above table correspond to the results of the test for the full sample (1996–2007), first sub-sample (1996–1999), second sub-sample (2000–2003) and third sub-sample (2004–2007) respectively.
significantly worse. The results also indicate that trade prices are preferred to quote prices for most of this sample period. Only in the last sub-sample are quote prices at 15-s to 2-min sampling frequencies found to out-perform 5-min RV using trade prices. In the earlier sub-samples quote prices were almost always
worse than trade prices. This result will be explored further in the analysis below. Under the AR approximation very few RV estimators could be distinguished from the 5-min RV estimator using the Romano–Wolf method, suggesting that the gains from moving beyond 5-min sampling are hard to identify in the presence
296
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
Table 5 Tests of equal RV accuracy, with the 5-min RV as benchmark. Sampling frequency
RW approximation
AR approximation
Trades
1s 2s 5s 15 s 30 s 1 min 2 min 5 min 15 min 30 min 1h 2h 1 day
Quotes
Trades
Quotes
Calendar
Tick
Calendar
Tick
Calendar
Tick
Calendar
Tick
∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ✓∼∼✓ ✓✓∼✓
– ∼∼ × ∼ ∼∼ × ∼ ✓ ∼ ✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓∼✓✓
×××∼ ×××∼ ×××✓ × × ×✓ × × ×✓ ×× ∼✓ ∼×∼✓ ×××∼ × × ×× × × ×× × × ×× × × ×× × × ××
–
×××∼ ×××✓ ×× ∼✓ ×× ∼✓ ∼ × ∼✓ ∼ × ∼✓
∼∼ × ∼ ∼∼ × ∼ ∼∼ × ∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼
∼∼∼∼
⋆
× × ×× × × ×× × × ×× × × ××
∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼
– ∼∼ × ∼ ∼∼ × ∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ –
∼∼∼∼ × ∼∼∼ × ∼∼∼ × ∼∼∼ × ∼∼∼ × ∼∼∼ × ∼∼∼ × ∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼
– ∼∼∼∼ × ∼∼∼ × ∼∼∼ × ∼∼∼ × ∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ ∼∼∼∼ –
✓✓✓✓ ✓✓✓✓ ⋆
× × ×× × × ×× × × ×× × × ×× × × ××
× × ×× × × ×× × × ×× × × ×× –
–
Notes: This table presents the results of Romano and Wolf (2005) stepwise testing of the 48 realised volatility estimators considered in the paper (13 frequencies, 2 sampling schemes, 2 price series, less overlaps which are marked with ‘‘– ’’). Two approximations for the dynamics of QV are considered: a random walk (RW) and a first-order autoregression (AR). In this table the benchmark RV estimator is based on 5-min trade prices sampled in calendar time, marked with an ⋆. Estimators that are significantly better than the benchmark, at the 0.05 level, are marked with ✓, estimators that are significantly worse than the benchmark are marked with ×, and estimators that are not significantly different are marked with ∼. The four characters in each element of the above table correspond to the results of the test for the full sample (1996–2007), first sub-sample (1996–1999), second sub-sample (2000–2003) and third sub-sample (2004–2007) respectively.
of additional estimation error from the AR model, consistent with the simulation results in Section 3. 4.2. Comparing more sophisticated estimators of QV As noted in the Introduction, the past decade has yielded great progress on the estimation of asset price volatility using high frequency data. The realised volatility estimator in Eq. (2) was the first, and remains the simplest, such estimator. In this Section I compare the performance of a selection of more sophisticated estimators of quadratic variation with simple RV estimates.16 The first two estimators are the two-scale estimator (TSRV) of Zhang et al. (2005) and the multi-scale estimator (MSRV) of Zhang (2006). These estimators use realised variances computed using more than one sampling frequency, which is shown, under certain conditions, to lead to consistency of the estimator in the presence of noise and to efficiency gains. For TSRV I use one tick as the highest frequency and use the optimal ‘‘sparse’’ sampling frequency presented in that paper. For MSRV I again set one tick as the highest frequency and use the formula from that paper for the frequencies of the other estimates and the weights used to combine these estimates. Next, I consider the ‘‘realised kernel’’ (RK) of Barndorff-Nielsen et al. (2008). Following their empirical application to General Electric stock returns, I use their ‘‘modified Tukey–Hanning2 ’’ kernel and 1-min tick-time sampling, and choose the bandwidth using the approach in Barndorff-Nielsen et al. (2009). Finally, I consider the ‘‘realised range-based variance’’ (RRV) of Christensen and Podolskij (2007) and Martens and van Dijk (2007). I use 5-min blocks, as in Christensen and Podolskij (2007), with 1-min prices within each block. I compare these estimators with RV based on calendar-time trade prices sampled at 1 s, 5 min and 1 day, which gives a total of seven estimators.17 I compare each of these estimators against a RV estimator based on 5-min calendar-time trade prices, using a bootstrap version of the Diebold and Mariano (1995) test. I consider both the RW approximation and the AR approximation, drawing on
16 These estimators were computed using Kevin Sheppard’s ‘‘Oxford Realized’’ toolbox for Matlab, http://realized.oxford-man.ox.ac.uk/data/code. 17 Patton and Sheppard (2009b) consider these, and some other, RV estimators in a study of optimal combinations of estimators of QV. A comprehensive study of the performance across several asset classes of the various estimators of QV is being pursued in Patton and Sheppard (2010).
Propositions 2 and 3 respectively. The results are shown in Table 6. Under the RW approximation all four of the more sophisticated estimators of QV out-perform simple RV5 min over the full sample, with the differences being significant for RK and RRV. Under the AR approximation the significance is reduced, and none of the more sophisticated estimators outperform RV5 min in the full sample. It is noteworthy that in the latter sub-sample (2004–2007) all four of the more sophisticated estimators significantly beat RV5 min , under both the RW and AR approximations, perhaps indicating that in latter periods, when turnover and liquidity are higher, there are greater gains to using more sophisticated estimates of QV. 4.3. Conditional comparisons of RV estimators To investigate the possible sources of the under- or outperformance of certain RV estimators, I next undertake Giacomini and White (2006)-style tests of conditional estimator accuracy. As discussed in Section 2.3.2, the null hypothesis of interest in a Giacomini–White (GW) test is that two competing RV estimators have equal average accuracy conditional on some information set Gt −1 , that is: H0∗ : E L θt , Xt ,0 |Gt −1 − E L θt , Xt ,s |Gt −1 = 0
a.s. t = 1, 2, . . . . One way to implement a test of this null is via a simple regression: L θt , Xt ,0 − L θt , Xt ,s = β0 + β1 Zt −1 + et
(29)
where Zt −1 ∈ Gt −1 , and then test the necessary conditions: H0 : β0 = β1 = 0
(30)
versus Ha : βi ̸= 0
for some i = 0, 1.
4.3.1. High-frequency versus low-frequency RV estimators I first use the GW test to examine the states where the gains from using high-frequency data are greatest. One obvious conditioning variable is recent volatility: distribution theory for standard RV estimators, see Andersen et al. (2003) and BarndorffNielsen and Shephard (2004a) for example, suggests that RV estimators are less accurate during periods of high volatility, and one might expect that the accuracy gains from using highfrequency data are greatest during volatile periods. Using the RW approximation, I estimate the following regression, and obtain the results below, with robust t-statistics presented in parentheses
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
297
Table 6 Comparing RV5 min with more sophisticated estimators. Estimator
RW approximation Avg 1L
RV1 s RV5 min RV1 day TSRV MSRV RKTH2 RRV
AR approximation
t-statistics on 1L 99–07
96–99
00–03
04–07
−0.01
−1.02
−2.74
−1.73
0.00 29.66 −0.00 −0.00 −0.01 −0.02
⋆ 9.77 −0.09 −0.29 −2.14 −3.41
⋆ 5.81 3.26 3.12 0.91 0.49
1.18 ⋆ 5.32 −0.95 −0.90 −1.86 −2.75
⋆ 8.93 −3.22 −3.52 −5.66 −6.56
Avg 1L
t-statistics on 1L 99–07
96–99
00–03
04–07
0.10 0.00 23.03 0.03 0.02 0.02 0.01
2.44 ⋆ 6.08 1.67 1.61 1.32 0.73
1.12 ⋆ 4.50 0.32 −0.03 −1.27 −1.66
2.10 ⋆ 3.97 0.97 0.87 0.61 0.41
−0.65 ⋆ 8.62 −2.39 −2.58 −4.41 −5.09
Notes: This table presents the results of comparisons of the accuracy of seven estimators of QV: realised variance (RV) sampled at 1-s, 5-min and 1-day, two-scales realised variance (TSRV), multi-scale realised variance (MSRV), realised kernel with Tukey–Hanning kernel (RKTH2 ), and realised range-based variance (RRV). Two approximations for the dynamics of QV are considered: a random walk (RW) and a first-order autoregression (AR). In this table the benchmark RV estimator is based on 5-min trade prices sampled in calendar time, marked with an ⋆. The full sample differences in average QLIKE accuracy, relative to RV5 min , are reported in the first column of each panel, with negative (positive) values indicating that the estimator is better (worse) than RV5 min . Diebold–Mariano t-statistics tests of the differences in accuracy are presented in the remaining columns, for the full sample (1996–2007), first sub-sample (1996–1999), second sub-sample (2000–2003) and third sub-sample (2004–2007).
below the parameter estimates18 :
(daily)
− L Yt , RV(t 5 min) = 33.67 +et
(daily)
− L Yt , RV(t 5 min) = 24.94 + 17.85 Zt −1 + et (32)
L Yt , RVt
(8.42)
L Yt , RVt
(31)
(11.10)
(2.55)
where Zt −1 = log
10 1 −
10 j=1
Yt −j .
The first of the above regression results show that daily squared (daily) returns, RVt , are less accurate on average than RV based on 5-min sampling. The positive and significant coefficient on lagged volatility in the second regression is consistent with RV distribution theory, and indicates that the relative accuracy of daily squared returns deteriorates during high volatility periods. The pvalue from a test that both parameters in the second regression are zero is less than 0.001, indicating a strong rejection of the null of equal conditional accuracy. Using an AR approximation and the bootstrap methods presented in Proposition 5, very similar results are obtained19 :
(daily)
L Yt , RVt
(daily)
L Yt , RVt
− L Yt , RV(t 5 min) = 33.54 +et (6.69)
(33)
− L Yt , RV(t 5 min) = 19.93 + 27.76 Zt −1 + et (34) (5.75)
(3.45)
with bootstrap p-values from tests that the parameters in both models are zero less than 0.001 in both cases. 4.3.2. Tick-time versus calendar-time sampling I next use the GW test of conditional accuracy to compare calendar-time sampling with tick-time sampling. Theoretical comparisons of tick-time and calendar-time sampling requires assumptions on the arrival rate of trades, while the methods presented in this paper allow us to avoid making any specific assumptions about the trade arrival process. For example, in a parametric ‘‘pure jump’’ model of high frequency asset prices,
18 Tests for zero autocorrelation in the regression residuals and squared regression residuals, up to the tenth lag, yield p-values of less than 0.01 in all cases, motivating the use of a block bootstrap to capture this serial dependence. The R2 of the second of these regressions is 0.005. 19 Tests for zero autocorrelation in the regression residuals and squared regression residuals, up to the tenth lag, yield p-values of less than 0.01 in all cases, again motivating the use of a block bootstrap to capture this serial dependence. The R2 of the second of these regressions is 0.011.
Oomen (2006) finds that tick-time sampling leads to more accurate RV estimators than calendar-time sampling when trades arrive at irregular intervals. In general, if the trade arrival rate is correlated with the level of volatility, consistent with the work of Easley and O’Hara (1992), Engle (2000) and Manganelli (2005), then using tick-time sampling serves to make the sampled highfrequency returns closer to homoskedastic, which theoretically should improve the accuracy of RV estimation, see Hansen and Lunde (2006a) and Oomen (2006). I use the log volatility of trade durations to measure how irregularly-spaced trade observations are: this volatility will be zero if trades arrive at evenly-spaced intervals, and increases as trades arrive more irregularly. I estimate a regression of the difference in the accuracy of a calendar-time RV estimator and a tick-time estimator with the same average sampling frequency, on a constant and the lagged log volatility of trade durations, for each of the frequencies considered in the earlier sections,20 and present the results in Table 7. The first column of Table 7 reports a Diebold and Mariano (1995)type test of the difference in unconditional average accuracy, across the sampling frequencies, using the RW approximation. This difference is positive and significant for the highest three frequencies (2, 5 and 15 s) and negative and significant for all but one of other frequencies, indicating that tick-time sampling is better than calendar-time sampling (has smaller average distance from the true QV) for all but the very highest frequencies. Further, Table 7 reveals that for all but one frequency the slope coefficient is negative, and 6 out of 11 are significantly negative, indicating that the accuracy of tick-time RV is even better relative to calendartime RV when trades arrive more irregularly. The results under the AR approximation are very similar to those under the RW approximation, though with slightly reduced significance. 4.3.3. Quote prices versus trade prices Finally, I examine the difference in accuracy of RV estimators based on trade prices versus quote prices. Theoretical comparisons of RV estimators using quote prices versus trade prices require assumptions about the behaviour of market participants: the arrival rate of trades, the placing and removing of limit and market orders, etc., and theoretical comparisons may be sensitive to these assumptions. The data-based methods of this paper allow us to avoid such assumptions. As a simple measure of the potential informativeness of quotes versus trades, I consider using the ratio of the number of quotes
20 Calendar-time sampling and tick-time sampling are equivalent for the 1-s and 1-day frequencies, and so these are not reported.
298
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
Table 7 Tests of equal unconditional and conditional RV accuracy: tick-time versus calendar-time sampling. Sampling frequency
RW approximation
AR approximation
Uncond
Uncond
Avg
Const
Slope
Joint p-val
0.01 ∗
0.08∗
−0.01∗
0.02 ∗
0.07∗
−0.01∗
(t-stat)
(t-stat)
2s 5s
Conditional
(13.92)
(3.52)
(12.95) 0.01∗ (4.03)
(2.79)
(t-stat)
(−2.73)
(−2.01)
Conditional
Avg
Const
Slope
Joint p-val
0.00
0.03∗
0.27∗
−0.04∗
0.00
0.00
0.04∗
0.34∗
−0.05∗
0.00
0.06
(t-stat)
(t-stat) (4.99)
(t-stat)
(2.98)
(4.73)
(−2.81)
(2.67)
(−2.49) 0.05∗ (2.86)
−0.15
0.03
0.00
0.01
−0.27
30 s
−0.00
−0.02
0.00
0.08
−0.00
−0.14
0.02
1 min
−0.00∗
0.07∗
−0.01∗
0.01
0.01
0.16
−0.03
2 min
−0.01
0.02
−0.00
0.00
0.01
−0.04
5 min
−0.02∗
−0.01
−0.00
0.00
0.02
15 min
−0.06∗
0.17
−0.04
0.00
−0.02
0.22∗ (2.02) 0.49∗ (2.09) ∗
30 min
−0.06∗
0.70∗
−0.14∗
0.00
1h
−0.25∗
1.97
−0.40
2h
−1.00
10.66
−2.10
15 s
(−3.88)
(−1.35)
(−2.80)
∗
(−3.31)
(−6.28)
(−6.93) (−4.02) (−3.59)
∗
(−2.75)
(−0.48) (2.29)
(0.57)
(−0.10)
(1.03) (2.22) (1.63)
(1.94)
(3.97) (0.41)
(−2.39)
(−0.71)
(−0.23)
(−1.35)
(−2.39) (−1.78)
∗
(−2.10)
(1.20)
∗
(−2.83)
(−0.54)
(1.13)
(−1.16)
(1.11)
(1.83)
(1.17)
0.00
(−1.78)
0.01 ∗
(−2.03)
0.02
−0.09∗
0.02
0.69
−0.13
0.03
−0.02
1.29∗
−0.24∗
0.01
0.00
−0.14
2.99
−0.56
0.17
0.00
−0.86
−1.82
0.10
(0.67)
(−2.15)
(2.09)
(−0.78)
(−2.20)
(2.29)
(−0.49)
(−2.40)
(1.92)
(−1.52)
(−2.00)
9.24 (1.15)
(−1.79)
tick(h)
Notes: This table presents the estimated difference in average distance of tick-time and calendar-time RV estimators, L Yt , RVt
(−1.25)
( h) − L Yt , RVcal , either unconditionally, t
or via a regression on a constant and one-period lag of the log variance of intra-day trade durations, which is a measure of the irregularity of the arrivals of trade observations. A negative slope coefficient indicates that higher volatility of durations leads to an improvement in the accuracy of the tick-time RV estimator relative to a calendar-time RV estimator using the same (average) frequency. Trade prices are used for all RV estimators. The fourth and eighth columns present the p-values from a chi-squared test that both coefficients are equal to zero. Two approximations for the dynamics of QV are considered: a random walk (RW) and a first-order autoregression (AR). Inference under the RW approximation is based on Newey and West (1987) standard errors, while inference under the AR approximation is based on 1000 samples from the stationary bootstrap. All parameter estimates that are significantly different from zero at the 0.05 level are marked with an asterisk.
Table 8 Tests of equal unconditional conditional RV accuracy: quote prices versus trade prices. Sampling frequency
RW approximation
AR approximation
Uncond
Uncond
Avg
1s
0.11∗
2s
0.11
5s
0.11
15 s 30 s 1 min 2 min
∗
(9.29)
∗
(10.03)
0.11 ∗
(11.99)
0.08 ∗
(12.56)
0.06
Const
Slope
Const
Slope
Joint p-val
0.33 ∗
−0.14∗
0.34
∗
0.00
0.02
0.44∗
−0.27∗
0.00
0.00
0.02
0.43
∗
−0.14
−0.26∗
0.34
∗
0.00
∗
0.00
0.03
0.41
∗
−0.24
∗
0.00
−0.13 ∗
0.00
0.05∗
0.34∗
−0.18∗
0.00
−0.10 ∗
0.00
0.05∗
0.25∗
−0.13∗
∗
0.00
∗
(11.90)
0.04 ∗
0.00
0.03
0.16
−0.08
0.00
0.11 ∗
−0.05∗
0.00
0.02∗
0.10∗
−0.05∗
(11.08)
0.00
0.10
−0.05∗
0.00
0.02∗
0.09∗
−0.04∗
0.00
(10.74)
(11.22) (12.30)
0.30 ∗
(13.75)
0.24 ∗
(13.70)
0.16
∗
(12.34)
(10.30) ∗
Joint p-val
Conditional
Avg
(t-stat)
(t-stat) (9.20)
Conditional (t-stat)
(−8.66)
∗
(−9.38)
−0.14
(−10.59) (−11.52) (−11.43)
−0.07
(−10.21) (−8.63)
(t-stat) (0.34)
(5.48)
(0.51)
(5.97)
(0.97) (2.21) (2.66)
(t-stat)
∗
(2.28)
(2.11)
(t-stat)
(−3.83) (−4.37)
(6.16)
(−4.50)
(6.67) (6.52)
(−5.06) (−5.53)
∗
(5.97)
∗
(−5.36)
(4.40)
(−4.33)
5 min
0.03∗
15 min
0.03∗
0.07∗
−0.03∗
0.00
0.02∗
0.07∗
−0.03∗
0.00
30 min
0.03∗
0.07∗
−0.03∗
0.00
0.03∗
0.07∗
−0.03∗
0.00
1h
−0.04
−0.17
0.08
0.63
−0.05
−0.13
0.05
0.67
2h
−0.19
−1.06
0.55
0.25
−0.21
−1.09
0.56
0.37
1 day
1.98
10.70
−5.52
0.62
1.51
10.76
−5.85
(8.72)
(6.33)
(4.38)
(−0.77)
(−0.74)
(0.41)
(8.80)
(5.21)
(3.17)
(−0.90)
(−1.47) (0.84)
(−7.85)
(−4.07)
(−2.36)
(0.93)
(1.63)
(−0.95)
(3.25)
(2.28)
(4.38)
(−0.76)
(−0.74)
(0.25)
(4.39)
(−4.08)
(3.63)
(−3.25)
(3.17)
(−2.36)
(0.52)
(−0.68)
(1.44)
(−1.34) (0.64)
0.66
(−0.77)
quote(h)
Notes: This table presents the estimated difference in average distance of quote-price and trade-price RV estimators, L Yt , RVt
( h) − L Yt , RVtrade , either t
unconditionally, or via a regression on a constant and one-period lag of the ratio of the number of quote observations per day to the number of trade observations per day. A negative slope coefficient indicates that an increase in the number of quote observations relative to trade observations leads to an improvement in the accuracy of the quote-price RV estimator relative to a trade-price RV estimator with the same frequency. Calendar time sampling is used for all estimators. The fourth and eighth columns present the p-values from a chi-squared test that both coefficients are equal to zero. Two approximations for the dynamics of QV are considered: a random walk (RW) and a first-order autoregression (AR). Inference under the RW approximation is based on Newey and West (1987) standard errors, while inference under the AR approximation is based on 1000 samples from the stationary bootstrap. All parameter estimates that are significantly different from zero at the 0.05 level are marked with an asterisk.
per day to the number of trades per day. I regress the difference in the accuracy of a quote-price RV and trade-price RV, with the same calendar-time sampling frequency, on a constant and the lagged ratio of the number of quotes to the number of trades. I do this for each of the frequencies considered in the earlier sections, and present the results in Table 8. The first column of Table 8 reveals that quote-price RV had larger average distance to the true
QV than trade-price RV for all but two sampling frequencies, and for 10 out of 13 this difference is significant at the 0.05 level. However, the results of the test of conditional estimator accuracy reveal that quote-price RV improves relative to trade-price RV as the number of quote observations increases relative to the number of trades: 11 out of 13 slope coefficients are negative, and 10 of these are statistically significant. Results are very similar under the
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
AR approximation, though with slightly reduced t-statistics. The ratio of quotes per day to trades per day for IBM has increased from around 0.5 in 1996 to around 2.5 in 2007, and may explain the subsample results in Table 5: as the relative number of quotes per day has increased, its relative accuracy has also increased. In the early part of the sample, quote-price RV was significantly less accurate than trade-price RV, however that difference vanishes in the last sub-sample, where quote and trade prices, of the same frequency, yield approximately equally accurate RV estimators.21
299
Appendix. Proofs Additional assumptions used in parts of the proofs below: ′ ¯ T denote the Let At ≡ 1L (θt , Xt )′ , 1C (Xt )′ (Yt − θt ) , let A sample mean of At , and let Ai,t denote the ith element of At .
6+ε Assumption A1. E Ai,1 < ∞ for some ε > 0 and for all i.
Assumption A2. {At } is α -mixing of size −3 (6 + ε) /ε . 5. Conclusion Assumption A3. E [Zt −1 et ] = 0 for all t. This paper considers the problem of ranking competing realised volatility (RV) estimators, motivated by the growing literature on nonparametric estimation of price variability using highfrequency data, see Andersen et al. (2006) and Barndorff-Nielsen and Shephard (2007) for recent surveys. I provide conditions under which the relative average accuracy of competing estimators for the latent target variable can be consistently estimated from available data, using ‘‘large T ’’, asymptotics, and show that existing tests from the forecast evaluation literature, such as Diebold and Mariano (1995), West (1996), White (2000), Hansen et al. (forthcoming), Romano and Wolf (2005) and Giacomini and White (2006), may then be applied to the problem of ranking these estimators. The methods proposed in this paper eliminate the need for specific assumptions about the properties of the microstructure noise, and facilitate comparisons of RV estimators that would be difficult using methods from the extant literature. I apply the proposed methods to high frequency IBM stock price data between 1996 and 2007 in a detailed empirical study. I consider simple RV estimators based on either quote or trade prices, sampled in either calendar-time or in tick-time, for several different sampling frequencies. Romano and Wolf (2005) tests reject the squared daily return and the 5-min calendar-time RV in favour of an RV estimator using data sampled at between 15 s and 5 min. In general, I found that using tick-time sampling leads to more accurate RV estimation than using calendar-time sampling, particularly when trades arrivals are very irregularly-spaced, and RV estimators based on quote prices are significantly less accurate than those based on trade prices in the early part of the sample, but this difference disappears in the most recent sub-sample of the data. Acknowledgements I thank the Editor, Ron Gallant, three anonymous referees, and Alan Bester, Tim Bollerslev, Peter Hansen, Nour Meddahi, Roel Oomen, Neil Shephard, Kevin Sheppard and seminar participants at Athens University of Economics & Business, Cambridge, Chicago, CREATES in Aarhus, Duke, Essex, Federal Reserve Bank of St. Louis, Imperial College London, Keele, Lugano,Queensland University of Technology, Tinbergen Institute Amsterdam,University College Dublin, the London-Oxford Financial Econometrics workshop, the SITE workshop in Stanford, and the workshop on Model Evaluation and Predictive Ability in Paris for helpful comments and suggestions. Runquan Chen and Ben Carlston provided excellent research assistance. Financial support from the Leverhulme Trust under grant F/0004/AF is gratefully acknowledged.
21 Using data from the first half of 2007, corresponding to the end of the last subsample in this paper, Barndorff-Nielsen et al. (2009) also find that estimators based on quote prices are very similar to those based on trade prices, when kernel-based estimators of the type in Barndorff-Nielsen et al. (2008) are used, or when standard RV estimators are used on slightly-lower frequency data (1–5-min sampling rather than 1-s sampling).
Assumption A4(a). for some ε > 0.
Z′t −1 , e˜ t
is α -mixing of size − (2 + ε) /ε
2+ε Assumption A4(b). E Zt −1,i e˜ t < ∞ for i = 1, 2, . . . , q and all t.
Assumption A4(c). VT ≡ V T −1/2
∑T
˜ t is uniformly post =1 Z t −1 e
itive definite. 2+ε+2δ Assumption A4(d). E Zt −1,i all i = 1, 2, . . . , q and all t.
Assumption A4(e). MT
< ∞ for some δ > 0 and
∑ ≡ E T −1 Tt=1 Zt −1 Z′t −1 is uniformly
positive definite. Assumption B1. If pT is the inverse of the average block length in Politis and Romano’s (1994) stationary bootstrap, then pT → 0 and T × pT → ∞. Proof of Proposition 1. The proof of part (a) is given in Hansen and Lunde (2006b). I repeat part of it here to show where that proof breaks down in part (b). Consider a second-order meanvalue expansion of the pseudo-distance measure L θ˜t , Xit around
(θt , Xit ): ∂ L (θt , Xit ) L θ˜t , Xit = L (θt , Xit ) + θ˜t − θt ∂θ 2 1 ∂ 2 L θ¨t , Xit ˜t − θt + θ 2 ∂θ 2 = L (θt , Xit ) + (C (Xit ) − C (θt )) θ˜t − θt 2 1 − C ′ θ¨t θ˜t − θt 2
where θ¨t = λt θt + (1 − λt ) θ˜t for some λt ∈ [0, 1], and using the functional form of L in Eq. (5). The third term in the above equation does not depend on Xit , and so will not affect the ranking of (X1t , X2t ). In volatility forecasting applications, θt is the conditional variance and so θt ∈ Ft −1 , and Xit is a volatility forecast, and so Xit ∈ F˜t −1 . In that case, this allows
E (C (Xit ) − C (θt )) · θ˜t − θt |Ft −1
= (C (Xit ) − C (θt )) · E θ˜t |Ft −1 − θt = 0 by the unbiasedness of θ˜t for θt conditional on Ft −1 . Using the law of iterated expectations we obtain E (C (Xit ) − C (θt )) · θ˜t − θt
= 0, and thus E 1L θ˜t , Xt = E [1L (θt , Xt )]. (b) When Xit is a realised volatility estimator and θt is the integrated variance or quadratic variation we have θt ∈ Ft and
300
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
Xit , θ˜t
∈ F˜t , which means we cannot employ the above reason ing directly. If we could assume that Corr C (Xit ) − C (θt ) , θ˜t − θt |Ft −1 = 0 ∀i, in addition to E θ˜t |Ft = θt , then we would have E (C (Xit ) − C (θt )) θ˜t − θt |Ft −1 = E [C (Xit ) − C (θt ) |Ft −1 ] E θ˜t − θt |Ft −1 = E [C (Xit ) − C (θt ) |Ft −1 ] E E θ˜t |Ft −1 , θt − θt |Ft −1 = 0. However it is not true that Corr C (Xit ) − C (θt ) , θ˜t − θt |Ft −1 = 0 for all empirically relevant combinations of RV estimators and volatility proxies. In fact, if Xit = θ˜t and L = MSE, a very natural case to consider, then C (z ) = −z and Corr C (Xit ) − C (θt ) , θ˜t −
= Corr θt − θ˜t , θ˜t − θt |Ft −1 = −1. In general, we should expect Corr C (Xit ) − C (θt ) , θ˜t − θt |Ft −1 ̸= 0. This is the correlation between the error in θ˜t and something similar to the θt |Ft −1
‘‘generalised forecast error’’, see Patton and Timmermann (2010) for example, of Xit . If the proxy, θ˜t , and the RV estimators, Xit , use the same or similar data then their errors will generally be correlated restriction will not hold, and thus and this zero correlation E (C (Xit ) − C (θt )) θ˜t − θt ̸= 0, which breaks the equivalence of the ranking obtained using θ˜t with that using θt .
1L (Yt , Xt ) − E [1L (θt , Xt )]
T t =1
T 1−
T t =1
T t =1
T 1−
T t =1
θt −p−1
1
0
redefine as PZt = Q0 + Q1 Zt −1 + Vt
T t =1
Et Zt +j = I − P −1 Q1
P −1 Q 0
−1
P −1 Q0 + P −1 Q1
j
−1 −1 P Q0 × Zt − I − P −1 Q1 j −1 −1 −1 j = I − P −1 Q1 I − P −1 Q1 P Q0 + P Q1 Zt
1C (Xt ) (Yt − θt )
and so (j)
E t θ t +j = g 0 +
p −
(j)
gi θt +1−i
i=1
′
At ≡ 1L (θt , Xt )′ , 1C (Xt )′ (Yt − θt )
(j)
since E [1C (Xt ) (Yt − θt )] = 0 from part (a). Under Assumptions A1 and A2, Theorem 3 of Politis and Romano (1994) provides:
√
¯ T − E [At ] →d N (0, VA ) T A
T 1−
T t =1
√
(j)
and gi is the (1, i) element of P obtain: E 1C (Xt ) Et θt +j
¯ T − E [At ] →d N (0, Ω1 ) T A
where Ω1 ≡ ι′ VA ι. It should be noted that Assumptions A1 and A2 can hold despite the random walk Assumption T1, if θt and Xit obey a some form of cointegration, linked to the distance measure
−1
j
I − P −1 Q1
−1
j
i=2
so E [1C (Xt ) θt ] =
1
(j)
g1
−
E 1C (Xt ) θ˜t +j −
(j) p − g i
(j)
i=2 g1
P −1 Q 0 ,
Q1 . Next I use this result to
= g0(j) E [1C (Xt )] + g1(j) E [1C (Xt ) θt ] p − (j) + gi E [1C (Xt ) θt +1−i ]
1L (Yt , Xt ) − E [1L (θt , Xt )]
where g0 is the first element of I − P −1 Q1
where VA is the long-run covariance matrix of At . Let ι denote a vector of ones, and note that
′
−1
and so
{1L (Yt , Xt ) − 1L (θt , Xt )}
where
=ι
···
0
E [Zt ] = I − P −1 Q1
≡ A¯ T − E [At ]
T
0
with
1L (θt , Xt ) − E [1L (θt , Xt )]
T 1−
+
0
Zt = P −1 Q0 + P −1 Q1 Zt −1 + P −1 Vt
1L (θt , Xt ) − E [1L (θt , Xt )]
T 1−
+
√
∑J ∑J (Yt − θt ) = j=1 ωj E 1C (Xt ) θ˜t +j − θt = j=1 ωj E 1C (Xt ) ∑ θt +j − θt = Jj=1 ωj E 1C (Xt ) Et θt +j −θt under P1 and P2. Allowing for J > 1 requires computing j (> 1)-step ahead forecasts from an AR(p) process, Et θt +j . This is simplified by using the companion form for the AR(p) process governing θt : 1 −φ1 · · · −φp θt 1 ··· 0 θt −1 0 . .. .. .. .. . . . . . . θt −p 0 0 ··· 1 θt −1 φ0 0 0 ··· 0 νt 0 1 0 · · · 0 θt −2 0 = .. + .. .. . . . .. .. + .. . . . . . .
so
T 1−
=
Proof of Proposition 3. (a) Using the second-order mean-value expansion of the loss function from the proof of Proposition 4, we obtain E [1L (Yt , Xt )] = E [1L (θt , Xt )] + β, where β ≡ E 1C (Xt )
Proof of Proposition 2. (a) See the proof of part (a) Proposition 4 and set Gt −1 to be the trivial information set. (b) Note that
=
employed. If MSE is employed, T1, A1 and A2 require that these variables obey standard linear cointegration, with cointegrating vector [1, −1]. For other distance measures a form of non-linear cointegration must hold. (c) Follows directly from Theorem 3 of Politis and Romano (1994), under the additional Assumption B1.
(j)
g0
(j)
g1
E [1C (Xt )]
E [1C (Xt ) θt +1−i ]
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
ˆ we can obtain rather than population autocovariances. From φ (j) estimates of P , Q0 , and Q1 and thus estimates of the parameters gi , for i = 0, 1, . . . , p and j = 1, 2, . . . , J, from these we can compute ¯ T and the estimated bias term βˆ T . Given asymptotic normality of B
which yields
E 1C (Xt ) θ˜t +j − θt g (j) 1 = 1 − (j) E 1C (Xt ) θ˜t +j + 0(j) E [1C (Xt )]
g1
+
ˆ t =1 1L (Yt , Xt ) − βT is a smooth function of the ¯ elements of BT , we can then apply the delta method, see Lemma
i
(j)
i=2 g1
E 1C (Xt ) θ˜t +1−i
since E [1C (Xt ) θt +1−i ] = E 1C (Xt ) θ˜t +1−i
for i ≥ 2 under
Assumption R1. With this result we can now compute β:
β=
J −
1 T
the fact that
g1
(j) p − g
301
∑T
2.5 of Hayashi (2000) for example, to obtain asymptotic normality ∑T ˆ of T1 t =1 1L (Yt , Xt ) − βT and obtain its covariance matrix. (c) Follows directly from Theorem 4 of Politis and Romano (1994), under the additional Assumption B1. Proof of Proposition 4. (a) Consider again a second-order meanvalue expansion of the pseudo-distance measure L (Yt , Xit ) given in Eq. (5) around (θt , Xit ):
ωj E 1C (Xt ) Et θt +j − θt
j =1
=
J −
ωj 1 −
j =1 J −
+
J −
(j)
g1
ωj ωj
g0
(j)
g1
(Yt − θt )2 ∂θ 2 = L (θt , Xit ) + (C (Xit ) − C (θt )) (Yt − θt ) 1 − C ′ θ¨t (Yt − θt )2 ,
E [1C (Xt ) θt ]
j =1
i=2
where θ¨t = λt θt + (1 − λt ) Yt for some λt ∈ [0, 1], and using the functional form of L in Eq. (5). Thus
g1
Substituting in this expression for β, we thus have
1L (Yt , Xt ) = 1L (θt , Xt ) + 1C (Xt ) (Yt − θt )
E [1L (θt , Xt )] = E [1L (Yt , Xt )]
−
J −
ωj 1 −
j =1
−
J − j =1
1 (j)
g1
E 1C (Xt ) θ˜t +j
(j)
ωj
g0
J −
g1
j =1
E [1C (Xt )] − (j)
2
2
E 1C (Xt ) θ˜t +1−i . (j)
i
1 ∂ 2 L θ¨t , Xit
+
p (j) − g
∂ L (θt , Xit ) (Yt − θt ) ∂θ
L (Yt , Xit ) = L (θt , Xit ) +
E 1C (Xt ) θ˜t +j
(j)
j =1
+
1
ωj
where
C (X1t ) − C (X2t )
(j) p − gi E 1C (Xt ) θ˜t +1−i . (j) i =2
1C (Xt ) ≡
g1
(b) This is proved by invoking a multivariate CLT for the sample mean of the loss differentials using the true volatility and all of the elements that enter into the estimated bias term, βˆ T . This collection of elements is defined as:
Bt ≡ 1L (θt , Xt )′ , 1C (Xt )′ , 1C (Xt )′ θ˜t +1 , . . . , 1C (Xt )′ θ˜t +J ,
1C (Xt ) θ˜t −1 , . . . , 1C (Xt )′ θ˜t −p+1 , θ˜t , ′ θ˜t θ˜t +1 , . . . , θ˜t θ˜t +2p
.. . . C (X1t ) − C (Xkt )
Next, note: E [1C (Xt ) (Yt − θt ) |Gt −1 ]
J −
J −
= E 1C (Xt )
i=1
′
= E 1C (Xt ) (35)
√ A1 and A2 applied to Bt we have T and with Assumptions ¯ T − E [Bt ] →d N (0, VB ) using Theorem 3 of Politis and Romano B (1994). ¯ T are sufficient to Note that the last 2p + 1 elements of B obtain estimates of the mean and the first 2p autocovariances of
= E 1C (Xt )
+
i =1
ωi
i −
i=1
j=1
J
i
− i=1
J −
ωi θ˜t +i − θt Gt −1
ωi
η t +j +
J − i=1
ωi νt +i Gt −1
− E ηt +j |Ft j =1
ωi E [νt +i |Ft ] Gt −1 = 0
θt , since E θ˜t = E [θt ] by Assumption P1, and E θ˜t θ˜t +j = E (θt + νt ) θt +j + νt +j = E θt θt +j by Assumptions P1 and T2. Let γj ≡ Cov θt , θt −j , then by the properties of an AR(p) process we have Ψ φ = ψ, where γp γp−1 · · · γ1 γp · · · γ2 γp+1 Ψ ≡ .. . , .. .. . .. (36) . . γ2p−1 γ2p−2 · · · γp ′ ′ ψ = γp+1 , γp+2 , . . . , γ2p , φ = φ1 , . . . , φp
by the law of iterated expectations, since Gt −1 ⊂ Ft . This thus yields E [1L (θt , Xt ) |Gt −1 ] = E [1L (Yt , Xt ) |Gt −1 ] as claimed. (b) Using of White (2001) for example, we have √ Exercise 5.21 1/2 d ˆ− ˆ ˜ T α − α → N D (0, I ), where Dˆ T is given in the statement T T of the proposition. ˜ = α note that 1L (Yt , Xt ) = 1L (θt , Xt ) + To show that α 1C (Xt) (Yt −θt ) = α′ Zt −1 + et + 1C (Xt ) (Yt − θt ) ≡ α′ Zt −1 + e˜ t , with E e˜ t Zt −1 = E [et Zt −1 ]+E [1C (Xt ) (Yt − θt ) Zt −1 ] = 0, since E [1C (Xt ) (Yt − θt ) Zt −1 ] = E [E [1C (Xt ) (Yt − θt ) |Gt −1 ] Zt −1 ] = ˜ = α as 0 by part (a), and E [et Zt −1 ] = 0 under A3. Thus α claimed.
ˆ = Ψˆ −1 ψ ˆ , where Ψˆ and and by Assumption T2 we can obtain φ ˆ ψ are the equivalents of Ψ and ψ using sample autocovariances
Proof of Proposition 5. (a) We first obtain E 1L (Yt , Xt ) Zt −p using calculations previously presented in the proof of Proposition 3:
302
A.J. Patton / Journal of Econometrics 161 (2011) 284–303
E 1L (Yt , Xt ) Zt −p − E 1L (θt , Xt ) Zt −p J − = ωj E 1C (Xt ) θ˜t +j − θt Zt −p
This collection of elements is:
Dt ≡ 1L (θt , Xt ) Z′t −p , 1C (Xt ) Z′t −p , 1C (Xt ) θ˜t +1 Z′t −p , . . . ,
1C (Xt ) θ˜t +J Z′t −p , . . . , 1C (Xt ) θ˜t −1 Z′t −p , . . . ,
j =1
E 1C (Xt ) θ˜t +j − θt Zt −p p − (j) = g0(j) E 1C (Xt ) Zt −p + gi E 1C (Xt ) θ˜t +1−i Zt −p i =2
+ g1(j) − 1 E 1C (Xt ) θt Zt −p and E 1C (Xt ) θt Zt −p =
1
φ1
E 1C (Xt ) θ˜t +1 Zt −p
φ0 E 1C (Xt ) Zt −p φ1 p − φi − E 1C (Xt ) θ˜t +1−i Zt −p . φ1 i=2 −
Pulling these results together we obtain: E 1L (Yt , Xt ) Zt −p − E 1L (θt , Xt ) Zt −p
=
J −
=
ωj g0(j) E 1C (Xt ) Zt −p
j =1
+
p −
(j)
(j)
gi E 1C (Xt ) θ˜t +1−i Zt −p + g1 − 1
i=2
×
1
φ1
E 1C (Xt ) θ˜t +1 Zt −p −
φ0 E 1C (Xt ) Zt −p φ1
p − φi E 1C (Xt ) θ˜t +1−i Zt −p φ1 i=2 J − φ0 (j) (j) φ0 ωj g0 − g1 + = E 1C (Xt ) Zt −p φ1 φ1 j =1
−
+
′
(37)
and A1 and A2 applied to Dt we have √ with Assumptions ¯ T − E [Dt ] →d N (0, VD ) using Theorem 3 of Politis and T D Romano (1994). As in the proof of Proposition 3, the last 2p + 1 ¯ T are sufficient to obtain estimates of P , Q0 , and Q1 elements of D (j) and thus estimates of the parameters gi , for i = 0, 1, . . . , p and j = 1, 2, . . . , J. With these we obtain the estimated adjustment ˆ i,T , i = 0, 1, . . . , p. Given asymptotic normality of D¯ T and terms λ ˆ T is a smooth function of the elements of D¯ T , we can the fact that α then apply the delta method, see Lemma 2.5 (2000) for of Hayashi ˆ T − α˜ , and obtain its example, to show asymptotic normality of α ˜ = α we use the result from part covariance matrix. To show that α −1 ˜ ≡ E Zt −p Z′ E Zt −p 1L (a) which provides α (θt , Xt ) = t −p
′
−1
E Zt −p 1L (θt , Xt ) ≡ α.
E Zt −p Zt −p (c) Again follows directly from Theorem 4 of Politis and Romano (1994). References
ωj E 1C (Xt ) θ˜t +j − θt Zt −p
j =1 J −
1C (Xt ) θ˜t −p+1 Z′t −p , Z′t , θ˜t , θ˜t θ˜t +1 , . . . , θ˜t θ˜t′+2p ,
p − E 1C (Xt ) θ˜t +1−i Zt −p i=2
φi φi × + ωj gi − g1 φ1 φ1 j =1 (j) J − g1 1 + E 1C (Xt ) θ˜t +1 Zt −p ωj − φ1 φ1 j =1 J −
(j)
(j)
p − ≡ −λ0 E 1C (Xt ) Zt −p − λi E 1C (Xt ) θ˜t +1−i Zt −p i=2
− λ1 E 1C (Xt ) θ˜t +1 Zt −p . Thus as in the proposition, we obtain with 1L (θt ,Xt ) defined E 1L (θt , Xt )Zt −p = E 1L (θt , Xt ) Zt −p . (b) Similar to the proof of Proposition 3(b), this part is proved by invoking a multivariate CLT for the sample mean of the loss differentials using the true volatility and all of the elements that ˆ i,T , i = 0, 1, . . . , p. enter into the estimated adjustment terms, λ
Aït-Sahalia, Y., Mancini, L., 2008. Out of sample forecasts of quadratic variation. Journal of Econometrics 147, 17–33. Aït-Sahalia, Y., Mykland, P., Zhang, L., 2005. How often to sample a continuous-time process in the presence of market microstructure noise. Review of Financial Studies 18, 351–416. Andersen, T.G., Bollerslev, T., 1998. Answering the skeptics: yes, standard volatility models do provide accurate forecasts. International Economic Review 39, 885–905. Andersen, T.G., Bollerslev, T., Christoffersen, P.F., Diebold, F.X., 2006. Volatility and correlation forecasting. In: Elliott, G., Granger, C.W.J., Timmermann, A. (Eds.), Handbook of Economic Forecasting. North Holland Press, Amsterdam. Andersen, T.G., Bollerslev, T., Diebold, F.X., 2007. Roughing it up: including jump components in the measurement, modeling and forecasting of return volatility. Review of Economics and Statistics 89, 701–720. Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2001a. The distribution of realized exchange rate volatility. Journal of the American Statistical Association 96, 42–55. Andersen, T.G., Bollerslev, T., Diebold, F.X., Ebens, H., 2001b. The distribution of realized stock return volatility. Journal of Financial Economics 61, 43–76. Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2003. Modeling and forecasting realized volatility. Econometrica 71, 579–626. Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2000. Great realizations. Risk 13, 105–108. Andersen, T.G., Bollerslev, T., Huang, X., 2011a. A reduced form framework for modeling volatility of speculative prices based on realized variation measures. Journal of Econometrics 160, 176–189. Andersen, T.G., Bollerslev, T., Meddahi, N., 2011b. Realized volatility forecasting and market microstructure noise. Journal of Econometrics 160, 220–234. Andersen, T.G., Bollerslev, T., Meddahi, N., 2004. Analytic evaluation of volatility forecasts. International Economic Review 45, 1079–1110. Bandi, F.M., Russell, J.R., 2008. Microstructure noise, realized variance, and optimal sampling. Review of Economic Studies 75, 339–369. Bandi, F.M., Russell, J.R., 2006a. Separating microstructure noise from volatility. Journal of Financial Economics 79, 655–692. Bandi, F.M., Russell, J.R., 2006b. Comment on ‘‘realized variance and microstructure noise’’. Journal of Business and Economic Statistics 24, 167–173. Bandi, F.M., Russell, J.R., 2011. Market microstructure noise, integrated variance estimators, and the accuracy of asymptotic approximations. Journal of Econometrics 160, 145–159. Bandi, F.M., Russell, J.R., Yang, C., 2007. Realized volatility forecasting in the presence of time-varying noise. Working Paper. Graduate School of Business, University of Chicago. Barndorff-Nielsen, O.E., Hansen, P.R., Lunde, A., Shephard, N., 2008. Designing realised kernels to measure the ex-post variation of equity prices in the presence of noise. Econometrica 76, 1481–1536. Barndorff-Nielsen, O.E., Hansen, P.R., Lunde, A., Shephard, N., 2009. Realized kernels in practice. Econometrics Journal 12, 1–32. Barndorff-Nielsen, O.E., Nielsen, B., Shephard, N., Ysusi, C., 2004. Measuring and forecasting financial variability using realised variance. In: Harvey, A., Koopman, S.J., Shephard, N. (Eds.), State Space and Unobserved Components Models: Theory and Applications. Cambridge University Press. Barndorff-Nielsen, O.E., Shephard, N., 2002. Econometric analysis of realized volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society, Series B 64, 253–280.
A.J. Patton / Journal of Econometrics 161 (2011) 284–303 Barndorff-Nielsen, O.E., Shephard, N., 2004a. Econometric analysis of realized covariation: high frequency based covariance, regression and correlation in financial economics. Econometrica 72, 885–925. Barndorff-Nielsen, O.E., Shephard, N., 2004b. Power and bipower variation with stochastic volatility and jumps. Journal of Financial Econometrics 2, 1–48. Barndorff-Nielsen, O.E., Shephard, N., 2007. Variation, jumps, market frictions and high frequency data in financial econometrics. In: Blundell, R., Torsten, P., Newey, W.K. (Eds.), Advances in Economics and Econometrics. Theory and Applications, Ninth World Congress. In: Econometric Society Monographs, Cambridge University Press, pp. 328–372. Bollerslev, T., Engle, R.F., Nelson, D.B., 1994. ARCH models. In: Engle, R.F., McFadden, D. (Eds.), Handbook of Econometrics. North Holland Press, Amsterdam. Christensen, K., Podolskij, M., 2007. Realized range-based estimation of integrated variance. Journal of Econometrics 141, 323–349. Corsi, F., 2009. A simple approximate long-memory model of realized volatility. Journal of Financial Econometrics 1–23. Diebold, F.X., Mariano, R.S., 1995. Comparing predictive accuracy. Journal of Business and Economic Statistics 13, 253–263. Ding, Z., Granger, C.W.J., Engle, R.F., 1993. A long memory property of stock market returns and a new model. Journal of Empirical Finance 1, 83–106. Easley, D., O’Hara, M., 1992. Time and the process of security price adjustment. Journal of Finance 47, 577–605. Engle, R.F., 2000. The econometrics of ultra high frequency data. Econometrica 68, 1–22. Engle, R.F., Patton, A.J., 2001. What good is a volatility model? Quantitative Finance 1, 237–245. Gatheral, J., Oomen, R.C.A., 2010. Zero-intelligence realized variance estimation. Finance and Stochastics 14, 249–283. Ghysels, E., Sinko, A., 2011. Volatility forecasting and microstructure noise. Journal of Econometrics 160, 257–271. Giacomini, R., White, H., 2006. Tests of conditional predictive ability. Econometrica 74, 1545–1578. Gonçalves, S., Meddahi, N., 2009. Bootstrapping realized volatility. Econometrica 77, 283–306. Hansen, P.R., 2005. A test for superior predictive ability. Journal of Business and Economic Statistics 23, 365–380. Hansen, P.R., Lunde, A., 2005. A forecast comparison of volatility models: does anything beat a GARCH(1,1)? Journal of Applied Econometrics 20, 873–889. Hansen, P.R., Lunde, A., 2006a. Realized variance and market microstructure noise. Journal of Business and Economic Statistics 24, 127–161. Hansen, P.R., Lunde, A., 2006b. Consistent ranking of volatility models. Journal of Econometrics 131, 97–121. Hansen, P.R., Lunde, A., 2010. Estimating the persistence and the autocorrelation function of a time series that is measured with error. Working Paper. Hansen, P.R., Lunde, A., Nason, J.M., 2010. The model confidence set. Econometrica (forthcoming). Hayashi, F., 2000. Econometrics. Princeton University Press, New Jersey. Huang, X., Tauchen, G., 2005. The relative contribution of jumps to total price variance. Journal of Financial Econometrics 3, 456–499. Kalnina, I., Linton, O., 2008. Estimating quadratic variation consistently in the presence of correlated measurement error. Journal of Econometrics 147, 47–59.
303
Large, J., 2011. Estimating quadratic variation when quoted prices change by a constant increment. Journal of Econometrics 160, 2–11. Maasoumi, E., McAleer, M., 2008. Realized volatility and long memory: an overview. Econometric Reviews 27, 1–9. Manganelli, S., 2005. Duration, volume and volatility impact of trades. Journal of Financial Markets 8, 377–399. Martens, M., van Dijk, D., 2007. Measuring volatility with the realized range. Journal of Econometrics 138, 181–207. Meddahi, N., 2003. ARMA representation of integrated and realized variances. Econometrics Journal 6, 334–355. Newey, W.K., West, K.D., 1987. A simple, positive semidefinite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–708. Oomen, R.C.A., 2006. Properties of realized variance under alternative sampling schemes. Journal of Business and Economic Statistics 24, 219–237. Owens, J., Steigerwald, D., 2007. Noise reduced realized volatility: a Kalman filter approach. In: Terrell, D., Fomby, T., Hill, R.C. (Eds.), Econometric Analysis of Financial and Economic Time Series, Part A. Elsevier. Patton, A.J., 2011. Volatility forecast comparison using imperfect volatility proxies. Journal of Econometrics 160, 246–256. Patton, A.J., Sheppard, K., 2009a. Evaluating volatility and correlation forecasts. In: Andersen, T.G., Davis, R.A., Kreiss, J.-P., Mikosch, T. (Eds.), Handbook of Financial Time Series. Springer Verlag. Patton, A.J., Sheppard, K., 2009b. Optimal combinations of realised volatility estimators. International Journal of Forecasting 25, 218–238. Patton, A.J., Sheppard, K., 2010. Estimating volatility using high frequency data: an analysis across multiple markets. Work in Progress. Patton, A.J., Timmermann, A., 2010. Generalized forecast errors, a change of measure, and forecast optimality conditions. In: Bollerslev, T., Russell, J.R., Watson, M.W. (Eds.), Volatility and Time Series Econometrics: Essays in Honor of Robert F. Engle. Oxford University Press. Politis, D.N., Romano, J.P., 1994. The stationary bootstrap. Journal of the American Statistical Association 89, 1303–1313. Poon, S.-H., Granger, C.W.J., 2003. Forecasting volatility in financial markets. Journal of Economic Literature 41, 478–539. Romano, J.P., Wolf, M., 2005. Stepwise multiple testing as formalized data snooping. Econometrica 73, 1237–1282. Tauchen, G., Zhou, H., 2011. Realized jumps on financial markets and predicting credit spreads. Journal of Econometrics 160, 102–116. West, K.D., 1996. Asymptotic inference about predictive ability. Econometrica 64, 1067–1084. White, H., 2000. A reality check for data snooping. Econometrica 68, 1097–1126. White, H., 2001. Asymptotic Theory for Econometricians. Academic Press, USA. Wright, J.H., 1999. Testing for a unit root in the volatility of asset returns. Journal of Applied Econometrics 14, 309–318. Zhang, L., 2006. Efficient estimation of stochastic volatility using noisy observations: a multi-scale approach. Bernoulli 12, 1019–1043. Zhang, L., Mykland, P.A., Aït-Sahalia, Y., 2005. A tale of two time scales: determining integrated volatility with noisy high-frequency data. Journal of the American Statistical Association 100, 1394–1411. Zhou, B., 1996. High-frequency data and volatility in foreign-exchange rates. Journal of Business and Economic Statistics 14, 45–52.
Journal of Econometrics 161 (2011) 304–324
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Predictive density construction and accuracy testing with multiple possibly misspecified diffusion models✩ Valentina Corradi a,∗ , Norman R. Swanson b a
Department of Economics, University of Warwick, Coventry CV4 7AL, UK
b
Department of Economics, Rutgers University, 75 Hamilton Street, New Brunswick, NJ 08901, USA
article
info
Article history: Received 19 April 2009 Received in revised form 8 July 2010 Accepted 23 December 2010 Available online 29 December 2010 JEL classification: C22 C51
abstract This paper develops tests for comparing the accuracy of predictive densities derived from (possibly misspecified) diffusion models. In particular, we first outline a simple simulation-based framework for constructing predictive densities for one-factor and stochastic volatility models. We then construct tests that are in the spirit of Diebold and Mariano (1995) and White (2000). In order to establish the asymptotic properties of our tests, we also develop a recursive variant of the nonparametric simulated maximum likelihood estimator of Fermanian and Salanié (2004). In an empirical illustration, the predictive densities from several models of the one-month federal funds rates are compared. © 2011 Elsevier B.V. All rights reserved.
Keywords: Block bootstrap Diffusion processes Jumps Nonparametric simulated quasi maximum likelihood Parameter estimation error Recursive estimation Stochastic volatility
1. Introduction Correct specification of models describing dynamics of financial assets is crucial for everything from pricing bonds and derivative assets to designing appropriate hedging strategies. Hence, it is of little surprise that there has been considerable attention given to the issue of testing for the correct specification of diffusion models. In this paper, we do not construct specification tests in the usual
✩ Corradi gratefully acknowledges ESRC grant RES-000-23-0006 and RES-06223-0311, and Swanson acknowledges financial support from a Rutgers University Research Council grant. We would like to thank the co-editor, Ron Gallant, a referee, Federico Bandi, Marine Carrasco, Javier Hidalgo, Antonio Mele, Andrew Patton, Eric Renault, John Rust and the seminar participants at the annual joint CORE, ECARES, and KU Leuven econometrics workshop, the 2007 Summer Meeting of the Econometric Society, the Marseille conference in honor of Russell Davidson, as well as faculty at the following universities: University of Montreal, LSE, Michigan State University, New York University, University of Chicago GBS, and the University of Maryland, for their useful comments on earlier versions of this paper. Additionally, we would like to thank Lili Cai for excellent research assistance. ∗ Corresponding author. E-mail addresses:
[email protected] (V. Corradi),
[email protected] (N.R. Swanson).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.009
sense, but instead assume that all models are (possibly) misspecified and outline a simulation-based methodology for comparing the accuracy of predictive densities based on alternative models. To place this paper in the correct historical context, note that a first generation of specification testing papers, initiated by the work of Aït-Sahalia (1996), compares the marginal densities implied by hypothesized null models with nonparametric estimates thereof, for the case of one-factor models (see also Pritsker (1998) and Jiang (1998)). While one-factor models may in some cases provide a reasonable representation for shortterm interest rates, there is a somewhat widespread consensus that stock returns and term structures are better modeled using multifactor diffusions. To take this into account, Corradi and Swanson (2005a) outline a test for comparing the cumulative distribution (marginal or joint) implied by a hypothesized null model with the corresponding empirical distribution. Their test can be used in the context of multidimensional and/or multifactor models. Needless to say, tests based on the comparison of marginal distributions have no power against iid alternatives with the same marginal, while tests based on the comparison of joint distributions do not suffer from this problem. Nevertheless, correct specification of the joint distribution is not equivalent to
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
that of the conditional; and hence focus in the literature now centers on comparing conditional distributions. When considering conditional distributions, a key difficulty that arises stems from the fact that knowledge of the drift and variance terms of a diffusion process does not in turn imply knowledge of the transition density, in general. Indeed, if the functional form of the transition density were known, one could test the hypothesis of correct specification of a diffusion via the probability integral transform approach of Diebold et al. (1998); the cross-spectrum approach of Hong (2001), Hong et al. (2002) and Hong and Li (2005); the martingalization-type Kolmogorov test of Bai (2003); or via the normality transformation approaches of Bontemps and Meddahi (2005) and Duan (2003). Furthermore, for the case in which the transition density is unknown, tests could be constructed by comparing the kernel (conditional) density estimator of the actual and simulated data, as in Altissimo and Mele (2009) and Thompson (2008); by comparing the conditional distribution of the simulated and of the historical data, as in Bhardwaj et al. (2008); or by using the approaches of Aït-Sahalia (2002) and Aït-Sahalia et al. (2009), where closed form approximations of conditional densities under the null are compared with data-driven kernel density estimates. All of the papers cited above deal with testing for the correct specification of a given diffusion model. Nevertheless, and as alluded to above, we believe that all models are probably best viewed as approximations of reality and, thus, are likely to be misspecified. Therefore, we focus on choosing the ‘‘best’’ model from amongst (multiple) misspecified alternatives. Moreover, the ‘‘best’’ model is selected by constructing tests that compare both predictive densities and/or predictive conditional confidence intervals associated with alternative models. Our approach is to measure accuracy using a distributional generalization of mean square error, as defined in Corradi and Ď Swanson (2005b). Namely, let Fkτ (u|Xt , θk ) be the distribution of Xt +τ given Xt , evaluated at u, implied by diffusion model k, and let F0τ (u|Xt , θ0 ) be the distribution associated with the underlying and unknown ‘‘true’’ model. Now, choose model k over model Ď Ď 1, say, if E ((Fkτ (u|Xt , θk ) − F0τ (u|Xt , θ0 ))2 ) < E ((F1τ (u|Xt , θ1 ) τ 2 − F0 (u|Xt , θ0 )) ). Our tests can be viewed as distributional generalizations of both Diebold and Mariano (1995) and White Ď (2000). Note that if we knew Fkτ (u|Xt , θk ) in closed form, then we could proceed as in Corradi and Swanson (2006a,b). However, the functional form of the model implied conditional distribution is unknown in closed form, in general, and hence we rely on a simulation-based approach to facilitate testing. As is customary in the out-of-sample evaluation literature, the sample of T observations is split into two subsamples, such that T = R + P , where only the last P observations are used for predictive evaluation. We first simulate P − τ τ -step ahead paths, using XR , . . . , XR+P −τ as starting values. Then, a scaled difference between the conditional distribution, estimated with historical as well as simulated data, is used to construct our test statistic. One complication that arises in this setup is that for the case of stochastic volatility (SV) models, the initial value of the volatility process is unobserved. To overcome this problem, it suffices to simulate the process using different random initial values for the volatility process. Thereafter, one simply constructs the empirical distribution of the asset price process for any given initial value of the volatility process and takes an average over the latter. This integrates out the effect of the volatility initial value. The limiting distributions of the suggested statistics are shown to be (functionals of) Gaussian processes with covariance kernels that reflect the contribution of recursive parameter estimation error. In order to provide asymptotically (first-order) valid critical values, we introduce a new bootstrap procedure that mimics the contribution of parameter estimation error in a recursive setting. This is achieved by establishing consistency and asymptotic
305
normality of nonparametric simulated quasi maximum likelihood (NPSQML) estimators of (possibly misspecified) diffusion models, in a recursive setting, and by establishing the first-order validity of their bootstrap analogs. Of final note is that we test the same null hypothesis as Corradi and Swanson (2006a), and we estimate empirical conditional distributions using both historical and simulated data, as in Bhardwaj et al. (2008). However, there are many differences between those papers and this one. Five such differences are the following. First, we show the asymptotic equivalence of recursive NPSQMLE (Nonparametric Simulated Quasi Maximum Likelihood Estimators) and recursive QMLE. Second, we show the asymptotic equivalence of recursively estimated NPSQMLE and recursive QMLE for partially unobservable multidimensional diffusions (e.g. for stochastic volatility models). This extends in a non-trivial manner the NPSQMLE of Fermanian and Salanié (2004). Third, we establish the first order validity of bootstrap critical values for recursive NPSQMLE, in the case of both observable and partially unobservable diffusions. To the best of our knowledge, there are no available results on bootstrapping NPSQMLE. Fourth, we allow for jumps in the return process, and we recursively estimate the intensity and the parameters of the jump size density. Finally, we develop Diebold–Mariano type Reality Check tests for cases where (a) the CDF is not known in closed form, and (b) data are generated by partially unobservable jump diffusion processes. The rest of the paper is organized as follows. In Section 2, we define the setup. Section 3 outlines the testing procedure for choosing between m ⩾ 2 models and establishes the asymptotic properties thereof. In Section 4, we develop a recursive version of the NPSQML estimator of Fermanian and Salanié (2004) and outline conditions under which asymptotic equivalence between NPSQML and the corresponding recursive QMLE obtains. An empirical illustration is provided in Section 5, in which various models of the effective federal funds rate are compared. All proofs are collected in an Appendix. Hereafter, let P ∗ denote the probability law governing the resampled series, conditional on the (entire) sample, let E ∗ and Var∗ denote the mean and variance operators associated with P ∗ . Further, let o∗P (1) Pr −P denote a term converging to zero in P ∗ -probability, conditional on the sample except a subset of probability measure approaching zero. Finally, let O∗P (1) Pr −P denote a term which is bounded in P ∗ -probability, conditional on the sample, and for all samples except a subset with probability measure approaching zero. 2. Set-up First, consider m one factor jump diffusion models. Namely, for k = 1, . . . , m consider1 : X (t− ) =
t
∫
Ď bk (X (s− ), θk )ds
− λk t
∫
yφk (y)dy Y
0 t
∫ + 0
σk (X (s− ), θkĎ )dW (s) +
Jk,t −
yk,j ,
j =1
where Jk,t is a Poisson process with intensity parameter λk , λk finite, and the jump size, yk,j , is iid with marginal distribution given by φk . Both Jk,t and yk,j are assumed to be independent of the driving Brownian motion, W (t ). Also, note that Y yφk (y)dy denotes the mean jump size under model k, hereafter denoted by µy,k . The case of no jumps corresponds to Jk,t = 0 for all t, and λk = 0. Note that over a unit time interval, there are on average λk
1 Hereafter, X (t ) denotes the cadlag (right continuous with left limit) for t ∈ −
R+ , while Xt denotes the discrete skeleton for t = 1, 2, . . ..
306
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
jumps; so that over the time span [0, t ], there are on average λk t jumps. The dynamics of X (t− ) is then given by: Ď
ϑk,t ,N ,h
Ď
dX (t ) = (bk (X (t− ), θk ) − λk µy,k )dt + σk (X (t− ), θk )dW (t )
∫
yp(dy, dt ),
+
(1)
Y
where p(dy, dt ) is a random Poisson measure giving point mass at y if a jump occurs in the interval dt. Hereafter, let ϑk = (θk , λk , µy,k ). If model k is correctly specified, then bk (X (t− ), θkĎ ) = Ď
b0 (X (t− ), θ0 ), σk (X (t− ), θk ) = σ0 (X (t− ), θ0 ), λk = λ0 , and φk =
φ0 . Now, let
Ď Fk (u|Xt , ϑk )
τ
= P τ Ď (Xt +τ ≤ u|Xt ) (i.e., Fkτ (u|Xt , ϑkĎ ) ϑk
defines the conditional distribution of Xt +τ , given Xt , and evaluated at u, under the probability law generated by model k). Analogously, define F0τ (u|Xt , ϑ0 ) = Pϑτ 0 (Xt +τ ≤ u|Xt ) to be the ‘‘true’’ conditional distribution. We measure model accuracy in terms of a distributional analog of mean square error. In particular, model 1 is defined to be more accurate than model k if: Ď
Ď
E (((F1τ (u2 |Xt , ϑ1 ) − F1τ (u1 |Xt , ϑ1 )) − (F0τ (u2 |Xt , ϑ0 )
− F0τ (u1 |Xt , ϑ0 )))2 ) < E (((Fkτ (uτ2 |Xt , ϑkĎ ) − Fkτ (uτ1 |Xt , ϑkĎ )) τ
τ
ϑk,R,N ,h
Ď
P − τ , where ϑk,t ,N ,h is an estimator of ϑk computed using all observations up to time t, P + R = T , N is the number of simulation paths used in estimation, and h is the discretization interval. Hence, prediction errors should be constructed as follows. Simulate P − τ paths of length τ , using XR+1 , . . . , XR+P −τ as starting values and using the recursively estimated parameters, ϑk,t ,N ,h , t = R, . . . , R+ P − τ . Then, construct the empirical distribution of the series simulated under model k. Then, test statistics are constructed relying on the fact that, under some regularity conditions, as discussed in Bhardwaj et al. (2008): ϑk,t ,N ,h
pr
1{u1 ≤ Xk,t +τ ,i (Xt ) ≤ u2 } → F
Ď
ϑ Xk,kt +τ (Xt )
(u2 ) − F
Ď
ϑ Xk,kt +τ (Xt )
t = R, . . . , T − τ , where F
(u1 ), (2)
ϑ
Ď
ϑ Xk,kt +τ (Xt )
the marginal distribution of Xt +τ (Xt ) is the distribution of Xt +τ under model k, conditional on the values observed at time t. Thus, ϑ
Ď
ϑ Xk,kt +τ (Xt )
t ,N ,h (u) = Fkτ (u|Xt , ϑkĎ ). In the above expression, Xk,kt,+τ ,i (Xt )
is generated according to a Milstein scheme, where ϑk,t ,N ,h
ϑk,t ,N ,h
X(q+1)h − Xqh
ϑ
dX (t ) dV (t )
1
ϑk,t ,N ,h
, θk,t ,N ,h )h + σk (Xqh 2
=
Ď
b1,k (X (t ), θk ) Ď
b2,k (V (t ), θk )
+
j =1
yk,j 1{qh ≤ Uj ≤ (q + 1)h},
σ11,k (V (t ), θkĎ ) dW (t ) 1 0
σ12,k (V (t ), θkĎ ) dW2 (t ), σ22,k (V (t ), θkĎ )
(4)
where W1,t and W2,t are independent standard Brownian motions. Following a generalized Milstein scheme (see, for example, Eq. (3.3), pp. 346 in Kloeden and Platen 1999), for models k = 1, 2, . . . , m, and for θk,t ,N ,S ,h an estimator of θkĎ : θk,t ,N ,S ,h
X(q+1)h
θ
θ
k,t ,N ,S ,h = Xqhk,t ,N ,S ,h + b1,k (Xqh , θk,t ,N ,S ,h )h
θ + σ11,k (Vqhk,t ,N ,S ,h , θk,t ,N ,S ,h )ϵ1,(q+1)h
1
θ
+ σ22,k (Vqhk,t ,N ,S ,h , θi ) 2
, θk,t ,N ,h )
×
θ
∂σ12,k (Vqhk,t ,N ,S ,h , θk ) ∂V
ϑ
+
dt +
+ σ12,k (Vqhk,t ,N ,S ,h , θk )ϵ2,(q+1)h
× σk′ (Xqhk,t ,N ,h , θk,t ,N ,h )ϵ(2q+1)h − λk µy,k h Jk −
θ
= bk (Xqhk,t ,N ,h , θk,t ,N ,h )h 2
× σk (Xqh
1 ϑ ϑ + σk (Xqhk,t ,N ,h , θk,t ,N ,h )ϵ(q+1)h − σk′ (Xqhk,t ,N ,h , θk,t ,N ,h ) ϑk,t ,N ,h
ϑk,R+j,N ,h
that it enables the comparison of simulated values Xk,R+j+τ ,i (XR+j ) with actual values that are τ periods ahead (i.e., XR+j+τ ), for j = 1, . . . , P − τ + 1. In this manner, we are able to propose tests for simulation based on ex-ante predictive density comparison. Turning now to the case of SV models, whenever both intensity and jump size are non-state dependent, a jump component can be simulated and added to either the return and/or the volatility process in the same manner as above. Therefore, for the sake of simplicity, we consider SV models without jumps in the sequel. Extension to general multidimensional and multifactor models both with and without jumps follows directly. Finally, note that as we are considering the case of no jumps, parameters and estimators will be denoted by θ instead of ϑ . Consider model k, k = 1, . . . , m, defined as follows:
Ď
Ď ϑk
ϑk,T −τ ,N ,h
Now, proceed by constructing Xk,R+τ ,i (XR ), . . . , Xk,T ,i (XT −τ ), where i = 1, . . . , N. This yields an N × (P − τ + 1) matrix of simulated values. The key feature of this setup is
k (u) is the marginal distribution of Xt +τ (Xt ) implied
by k model (i.e., by the model used to simulate the series), conditional on the (simulation) starting value Xt . Furthermore,
F
ϑk,t ,N ,h
is that to generate Xk,t +τ ,i (Xt ), i = 1, . . . , N, for t = R, . . . , T − τ , we must use (for each t) the same set of randomly drawn errors as well as the same draws for numbers of jumps, jump times and jump sizes. Thus, only the starting value used to initialize the simulations changes. More precisely, the errors used in simulation are defined iid
2
This measure defines a norm and implies a standard goodness of fit measure (see, for example, Corradi and Swanson 2005b). Recalling that E (1{u1 ≤ Xt +τ ≤ u2 }|Xt ) = F0τ (u2 |Xt , ϑ0 ) − F0τ (u1 |Xt , ϑ0 ), it is straightforward to construct a sequence of P − τ τ -step ahead prediction errors under model k as 1{u1 ≤ Xt +τ ≤ u2 } − (Fkτ (u2 |Xt , ϑk,t ,N ,h ) − Fkτ (u1 |Xt , ϑk,t ,N ,h )), for t = R, . . . , R +
N i=1
Xt in Xk,t +τ ,i (Xt ) denotes that the starting value for the simulation is Xt . Note that the last term on the right-hand side (RHS) of (3) is nonzero whenever we have one (or more) jump realization(s) in the interval [(q − 1)h, qh]. Moreover, as neither the intensity nor the jump size is state dependent, the jump component can be simulated without any discretization error, as follows. Begin by making a draw from a Poisson distribution with intensity parameter λk τ , say Jk . This gives a realization for the number of jumps over the simulation time span. Then, draw Jk uniform random variables over [0, τ ], and sort them in ascending order so that U1 ≤ U2 ≤ · · · ≤ UJk . These provide realizations for the Jk jump times. Then, make Jk independent draws from φk , say yk,1 , . . . , yk,Jk . An important feature of this simulation procedure
to be ϵqh,i ∼ N (0, h), with Qh = τ , i = 1, . . . , N.
− (F0 (u2 |Xt , ϑ0 ) − F0 (u1 |Xt , ϑ0 ))) ).
N 1 −
iid
with ϵqh ∼ N (0, h), q = 1, . . . , Q ; and where σ ′ is the derivative of σ (·) with respect to its first argument. Additionally, the argument
θk,t ,N ,S ,h
(3)
ϵ22,(q+1)h θ
∂σ11,k (Vqhk,t ,N ,S ,h , θk )
+ σ22,k (Vqh , θk ) ∂V ∫ (q+1)h ∫ s × dW1,τ dW2,s qh
qh
(5)
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
θk,t ,N ,S ,h
V(q+1)h
θ
θ
k,t ,N ,S ,h , θk )h b2,k (Vqh = Vqhk,t ,N ,S ,h +
1
θ
θ
+ σ22,k (Vqhk,t ,N ,S ,h , θk )ϵ2,(q+1)h + σ22,k (Vqhk,t ,N ,S ,h , θk ) 2
θ
×
∂σ22 (Vqhk,t ,N ,S ,h , θk )
ϵ22,(q+1)h (6) ∂V ∼ N (0, 1), i = 1, 2, E (ϵ1,qh ϵ2,q′ h ) = 0 for all
where h−1/2 ϵi,qh q ̸= q′ , and
b1,k (V , θk,t ,N ,S ,h ) bk (V , θk,t ,N ,S ,h ) = b2,k (V , θk,t ,N ,S ,h ) 1 ∂σ12,k (V , θk,t ,N ,S ,h ) b ( V , θ ) − σ ( V , θ ) k, t , N , S , h 22,k k,t ,N ,S ,h 1,k 2 ∂V . = 1 ∂σ22,k (V , θk,t ,N ,S ,h ) b2,k (V , θk,t ,N ,S ,h ) − σ22,k (V , θk,t ,N ,S ,h ) 2 ∂V The last terms on the RHS of (5) involve stochastic integrals and cannot be explicitly computed. However, they can be approximated, up to an error of order o(h) by (see, for example, Eq. (3.7), pp. 347 in Kloeden and Platen 1999):
∫
(q+1)h
∫
s
dW1,τ
qh
dW2,s
≈h +
1 2
ξ1 ξ2 +
p h −1
2π r =1 r
bk (·, ·)′ and σk (·, ·)′ denote derivatives with respect to the first argument of the function. Assumption A2′ . For j, i = 1, 2, let bj,k (·, ·) and σi,j,k (·, ·) (as ∂σ (V ,θ ) defined in (4)) and σll′ ,k (V , θk ) kι∂ V k be twice continuously differentiable, Lipschitz in the first argument, with a Lipschitz constant independent of θk , and assume that these terms grow at most at a linear rate, uniformly in Θk , for l, l′ , j, ι = 1, 2 and k = 1, . . . , m. Assumption A3. For k = 1, . . . , m: (i) for any fixed h and ϑk ∈ Θk , ϑ Θk compact set in Rdk , Xqhk is geometrically ergodic and β -mixing; ϑ
(ii) Xk,kt +τ ,i is continuously differentiable in the interior of Θk , for ϑ
i = 1, . . . , N; and (iii) ∇ϑk Xk,kt +τ ,i is r-dominated in Θk , uniformly in i for r > 4. Assumption A4. For each model k = 1, . . . , m the parameters ϑk,t ,N ,h admit the following expansion: T −1 1 −
T −1 −
P t =R
P t =R
√
1 ( ϑk,t ,N ,h − ϑkĎ ) = AĎk √
ψk,t ,N ,h (ϑkĎ ) + op (1)
and as P , R, N → ∞ and h → 0,
qh
307
√ ρp (µ1,p ξ2 − µ2,p ξ1 )
T −1 1 −
√
P t =R
√ √ (ς1,r ( 2ξ2 + η2,r ) − ς2,r ( 2ξ1 + η1,r )),
ψk,t ,N ,h (ϑkĎ ) → N (0, VkĎ ), d
Ď
where Vk = limT ,R,N ,h−1 →∞ Var ( √1 P
∑ T −1 t =R
ψk,t ,N ,h (ϑkĎ )).
where for j = 1, 2, ξj , µj,p , ςj,r , ηj,r are iid N (0, 1) random vari∑ 1 ables, ρp = 12 − 2π1 2 pr=1 r12 , and p is such that as h → 0, p → ∞. In order to simulate paths for SV models, proceed as follows:
Assumption A4′ . For each model k = 1, . . . , m the parameters ϑk,t ,N ,S ,h admit the following expansion:
Step 1. Using the schemes in (5) and (6), simulate (P −τ + 1)× S × N paths of length τ , setting the initial values for the observable state variable equal to the initial value Xt , t = R, . . . , R + P − τ , and for each Xt , using the S different starting values for volatility
√
θk,t ,N ,S ,h
(i.e., Vj
, j = 1, . . . , S). Thus, there are S paths rather than one, θk,t ,N ,S ,h
, for each starting value of Xt . For any initial value Xt and Vj t = R + 1, . . . , R + P −τ and j = 1, . . . , S , generate N independent paths of length τ . Also, keep the simulated randomness (i.e.,
ϵ1,qh , ϵ2,qh ,
(q+1)h s ( qh dW1,τ )dW2,s ) constant across the different qh
starting values for the unobservable and observable state variables. θk,t ,N ,S ,h
θk,t ,N ,S ,h
Now, define Xk,t +τ ,i,j (Xt , Vj ) to be the τ -step ahead value for the return series simulated (under model k), at replication i, θk,t ,N ,S ,h
i = 1, . . . , N, using initial values Xt and Vj Step 2. Construct an estimator of F
Ď θ X k (X ) k,t +τ t
using
1 NS
θk,t ,N ,S ,h Vk,t ,j
∑S ∑N j =1
i=1
θk,t ,N ,S ,h
. (u2 ) − F
θk,t ,N ,S ,h
1{u1 ≤ Xk,t +τ ,i,j (Xt , Vk,t ,j
Ď θ X k (X ) k,t +τ t
(u1 )
) ≤ u2 }, where
denotes the value of volatility at time t and at simulation
j, simulated under model k, using parameters θk,t ,N ,S ,h . The asymptotic results in the sequel require the following assumptions. Assumption A1. (i) X (t ), t ∈ ℜ , is a strictly stationary, geometric ergodic β -mixing diffusion; and (ii) Y yp φk (y)dy < ∞ for some p > 2. +
Assumption A2. For k = 1, . . . , m, bk (·, θ Ď ) and σk (·, θ Ď ), as defined in (1), are twice continuously differentiable. Also, bk (·, ·), bk (·, ·)′ , σk (·, ·), and σk (·, ·)′ are Lipschitz in the first argument, with Lipschitz constant independent of θk , where
T −1 1 −
T −1 −
P t =R
P t =R
1 ( ϑk,t ,N ,S ,h − ϑkĎ ) = AĎk √
ψk,t ,N ,S ,h (ϑkĎ ) + op (1)
and as P , R, N , S → ∞ and h → 0, T −1 1 −
√
P t =R
ψk,t ,N ,S ,h (ϑkĎ ) → N (0, VkĎ ), d
Ď
where Vk = limT ,R,N ,S ,h−1 →∞ Var ( √1
P
∑T −1 t =R
ψk,t ,N ,S ,h (ϑkĎ )).
Assumption A1(i) requires the diffusion, X (t ), to be geometrically ergodic and β -mixing. In the case of no jumps, conditions for (geometric) β -mixing for (multivariate) diffusions that can be relatively easily verified are provided by Meyn and Tweedie (1993). Such conditions also suffice for the case of jump diffusions, when both the intensity parameters and the jump sizes are independent of the state of the system. Recently, Masuda (2007) has extended the conditions for β -mixing to the case of jump diffusions in which the intensity parameter is constant, but the size of the jumps is state dependent. Assumptions√A4 and A4 ′ require that the recursively estimated parameters are P-consistent and asymptotically normal, regardless of whether or not the underlying model is misspecified. As outĎ lined in detail in Section 4, a key point here is that E (ψk,t ,N ,h (θk )) Ď
and E (ψk,t ,N ,S ,h (θk )) are o(P −1/2 ), regardless of misspecification. We shall show that NPSQMLE satisfies this requirement. Needless to say, in some cases the transition density is known in closed form and can be used to obtain QML estimators. For example, if the drift and variance terms as well as the intensity of the jump process have affine structures, then there is no need to rely on simulation methods and parameters can be estimated via the use of the conditional empirical characteristic function (see, for example, Singleton 2001).
308
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
H0′ : max (EX ((F
3. Test statistics
k=2,...,m
3.1. One factor models
Ď
ϑ X1,1t +τ (Xt )
τ
( u2 ) − F
Ď
ϑ X1,1t +τ (Xt )
− F0 (u1 |Xt ))) − EX ((F 2
Ď ϑ X k (X ) k,t +τ t
τ
τ
(u1 )) − (F0 (u2 |Xt )
(u2 ) − F
Ď
ϑ Xk,kt +τ (Xt )
(u1 ))
T −τ N − − 1 ϑ1,t ,N ,h 1 Dk,P ,N (u1 , u2 ) = √ 1{u1 ≤ X1,t +τ ,i (Xt ) ≤ u2 } N i =1
2
N i=1
2 ≤ u2 } .
d
H0 , Dk,P ,N (u1 , u2 ) → N (0, Wk (u1 ,u2 )), where Wk (u1 , u2 ) is defined in the Appendix. (ii) Under HA , Pr √1 Dk,P ,N (u1 , u2 ) > ε → 1.
P
Note that Wk (u1 , u2 ) reflects the contribution of the recursive parameter estimation error. The intuitive argument underlying the proof to Theorem 1 is the following. Note that:
N i=1
ϑk,t ,N ,h
1{Xk,t +τ ,i (Xt ) ≤ u} =
+ E (f
ϑ
Ď
θ Xk,kt +τ ,i (Xt )
N 1 −
N i=1
Ď
ϑ
Ď
1{Xk,kt +τ ,i (Xt ) ≤ u} T 1 −
(u)∇ϑk Xk,kt +τ ,i (Xt )) √
P t =R
( ϑk,t ,N ,h − ϑ Ď )
+ oP (1) =F
Ď
ϑ X k (X ) k,t +τ t
(u) + E (f
Ď
ϑ Xk,kt +τ ,i (Xt )
ϑ
Ď
(u)∇ϑk Xk,kt +τ ,i (Xt ))
T 1 −
×√
P t =R
The statistic for testing these hypotheses is:
Corollary 1. Let Assumptions A1–A4 hold. Also, assume that models 1 and k are nonnested for at least one k = 2, . . . , m. If as P , R, N → ∞, h → 0, P /N → 0, h2 P → 0, and P /R → π , where 0 < π < ∞, then: d
max (Dk,P ,N (u1 , u2 ) − µk (u1 , u2 )) → max Zk (u1 , u2 ),
k=2,...,m
k=2,...,m
where, with an abuse of notation, µk (u1 , u2 ) = µ1 (u1 , u2 ) − µk (u1 , u2 ), and Ď
( ϑk,t ,N ,h − ϑ Ď ) + oP (1) + oN (1),
where oN (1) denotes terms approaching zero, as N → ∞. The statement follows by the same argument used in the case in which the closed form of the conditional distribution is known. Note that as N /P → ∞, we can neglect the contribution of simulation error in the asymptotic covariance matrix. Finally, it is easy to see that if P /R → π = 0, then the contribution of parameter estimation error vanishes. In some circumstances, one may be interested in comparing one (benchmark) model against multiple competing models. In this case, the null hypothesis is that no model can outperform the benchmark model. More specifically, setting model 1 as the benchmark, the hypotheses of interest are:
(u2 ) − F
Ď
ϑj Xj,t +τ (Xt )
(u1 ))
− (F0 (u2 |Xt ) − F0 (u1 |Xt )))2 ),
ϑk,t ,N ,h
Theorem 1. Let Assumptions A1–A4 hold. Also, assume that models 1 and k are nonnested. If as P , R, N → ∞, h → 0, P /N → 0, h2 P → 0, and P /R → π , where 0 < π < ∞, then: (i) Under
N 1 −
(u2 )
HA′ : negation of H0′ .
ϑj Xj,t +τ (Xt )
1{u1 ≤ Xk,t +τ ,i (Xt ) ≤ u2 }
− 1{u1 ≤ Xt +τ
Ď ϑ X k (X ) k,t +τ t
(u1 )) − (F0 (u2 |Xt ) − F0 (u1 |Xt )))2 ) ≤ 0
µj (u1 , u2 ) = E (((F
− 1{u1 ≤ Xt +τ ≤ u2 } N 1 −
Ď
ϑ Xk,kt +τ (Xt )
(u1 ))
k=2,...,m
Notice that the hypotheses are stated using a particular interval (i.e., (u1 , u2 ) ∈ U × U) so that the objective is evaluation of predictive densities for a given range of values. The test statistic is:
−
Ď
ϑ X1,1t +τ (Xt )
DMax k,P ,N (u1 , u2 ) = max Dk,P ,N (u1 , u2 ).
− (F0 (u2 |Xt ) − F0 (u1 |Xt ))) = 0 HA : negation of H0 .
P t =R
−F
τ
2
(u2 ) − F
− (F0 (u2 |Xt ) − F0 (u1 |Xt )))2 − EX ((F
First, consider comparing the predictive accuracy of two possibly misspecified diffusion models. The hypotheses of interest are: H0 : EX ((F
Ď
ϑ X1,1t +τ (Xt )
for j = 1, . . . , m, and where (Z1 (u1 , u2 ), . . . , Zm (u1 , u2 )) is an m-dimensional Gaussian random variable for which the associated covariance matrix has kk element given by Wk (u1 , u2 ), as in Theorem 1(i). Critical values for these tests can be obtained using a recursive version of the block bootstrap. When forming block bootstrap samples in the recursive case, observations at the beginning of the sample are used more frequently than observations at the end of the sample. This introduces a location bias to the usual block bootstrap, as under standard resampling with replacement, all blocks from the original sample have the same probability of being selected. Also, the bias term varies across samples and can be either positive or negative, depending on the specific sample. A first-order valid bootstrap procedure for non-simulation-based m- estimators constructed using a recursive estimation scheme is outlined in Corradi and Swanson (2007). Here we extend the results of Corradi and Swanson (2007) by establishing asymptotic results for cases in which simulation-based estimators are bootstrapped in a recursive setting. In order to carry out the bootstrap, begin by resampling b blocks of length l from the full sample, with lb = T . For any given τ , it is necessary to jointly resample Xt , Xt +1 , . . . , Xt +τ . More precisely, let Z t ,τ = (Xt , Xt +1 , . . . , Xt +τ ), t = 1, . . . , T − τ . Now, resample b overlapping blocks of length l from Z t ,τ . This yields Z t ,∗ = (Xt∗ , Xt∗+1 , . . . , Xt∗+τ ), t = 1, . . . , T − τ . Use these data to construct ϑk∗,t ,N ,h . Recall that N is the number of simulated series used to estimate the parameters. Note that as we assume N /P → ∞, simulation error vanishes and there is no need to resample the simulated series. We proceed by assuming the firstorder asymptotic validity of the bootstrap estimator, as outlined in the following assumption (in Section 4 we shall provide primitive conditions under which NPSQMLE and satisfies this assumption). Assumption A5. As P , R, N → ∞ and h → 0, for k = 1, . . . , m:
T ∗ 1 − ∗ P ω : sup PT √ (ϑk,t ,N ,h − ϑk,t ,N ,h ) ≤ v v∈ℜϱ P t =R T 1 − Ď −P √ ( ϑk,t ,N ,h − ϑk ) ≤ v > ε → 0. P t =R
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
It can be seen immediately that A5 ensures that √1
∑T
t =R
P ∑ ( ϑk∗,t ,N ,h − ϑk,t ,N ,h ) has the same limiting distribution as √1P Tt=R ( ϑk,t ,N ,h − ϑ Ď ), conditional on sample, and for all samples except k
a set with probability measure approaching zero. Given this assumption, the appropriate bootstrap statistic is:
T −τ 1 −
N
− ϑ1∗,t ,N ,h 1 D∗k,P ,N (u1 , u2 ) = √ 1{u1 ≤ X1,t +τ ,i (Xt∗ ) ≤ u2 } N i=1 P t =R 2 − 1{u1 ≤ 1 −
Xt∗+τ
T −
T j=1
≤ u2 }
N 1 −
N i=1
− 1{u1 ≤ Xj+τ
ϑ1,t ,N ,h
1{u1 ≤ X1,t +τ ,i (Xj ) ≤ u2 }
309
The above results suggest proceeding in the following manner. For any bootstrap replication, compute the bootstrap statistic (i.e. D∗k,P ,N (u1 , u2 ) or maxk=2,...,m D∗k,P ,N (u1 , u2 )). Perform B bootstrap replications (B large) and compute the percentiles of the empirical distribution of the B bootstrap statistics. Reject H0 , if Dk,P ,N (u1 , u2 ) is less than the (α/2)th-percentile or greater than the (1 − α/2)th-percentile of the bootstrap empirical distribution. This provides a test with asymptotic size α and unit asymptotic power. Furthermore, reject H0′ if maxk=2,...,m Dk,P ,N (u1 , u2 ) is greater than the (1 − α)th-percentile of the bootstrap empirical distribution. Whenever µ1 (u1 , u2 ) = µk (u1 , u2 ), for k = 2, . . . , m (i.e., when all competitors are as good as the benchmark), then the asymptotic size is α . However, whenever µk (u1 , u2 ) > µ1 (u1 , u2 ) for some k, the bootstrap critical values define upper bounds, and the inference drawn on them is conservative. 3.2. Stochastic volatility models
2 ≤ u2 }
The test statistic for comparing two models is: DVk,P ,S ,N (u1 , u2 )
−
N
1 − N i =1
1{u1 ≤
ϑk∗,t ,N ,h Xk,t +τ ,i (Xt∗ )
T −τ 1 −
≤ u2 }
= √
P t =R
S N 1 −−
NS j=1 i=1
2 − 1{u1 ≤ −
Xt∗+τ
T 1−
T j=1
N i=1
− 1{u1 ≤ Xj+τ
1{u1 ≤
ϑk,t ,N ,h Xk,t +τ ,i (Xj )
≤ u2 }
2 ≤ u2 } .
−
S N 1 −−
NS j=1 i=1
− 1{u1 ≤ Xt +τ
v∈ℜϱ
− P (Dk,P ,N (u1 , u2 ) − µk (u1 , u2 ) ≤ v)| > ε → 0.
Corollary 2. Let Assumptions A1–A5 hold. Also, assume that at least one model is nonnested with model 1. If as P , R, N → ∞, h → 0, P /N → 0, h2 P → 0, l → ∞, l/T 1/4 → 0, and P /R → π , where 0 < π < ∞, then:
P ω : sup PT∗ max D∗k,P ,N (u1 , u2 ) ≤ v k=2,...,m v∈ℜϱ − P max (Dk,P ,N (u1 , u2 ) − µk (u1 , u2 )) ≤ v > ε → 0.
k=2,...,m
) ≤ u2 }
2 ≤ u2 } ,
DVk∗,P ,S ,N (u1 , u2 )
T −τ 1 −
1 NS P t =R
= √
S − N −
θ1∗,t ,N ,S ,h
θ1∗,t ,N ,S ,h
1{u1 ≤ X1,t +τ ,i,j (Xt∗ , V1,j
j=1 i=1
2 ≤ u2 } − 1{u1 ≤
−
T 1−
T l =1
Xt∗+τ
≤ u2 }
S N 1 −−
NS j=1 i=1
−
S N 1 −−
NS j=1 i=1
θ1,t ,N ,S ,h
θ1,t ,N ,S ,h
1{u1 ≤ X1,t +τ ,i,j (Xl , V1,j
)
2 ≤ u2 }
≤ u2 } − 1{u1 ≤ Xl+τ
ω : sup |PT∗ (D∗k,P ,N (u1 , u2 ) ≤ v)
θk,t ,N ,S ,h
θk,t ,N ,S ,h
1{u1 ≤ Xk,t +τ ,i,j (Xt , Vk,j
and the bootstrap test statistic is:
Theorem 2. Let Assumptions A1–A5 hold. Also, assume that models 1 and k are nonnested. If as P , R, N → ∞, h → 0, P /N → 0, h2 P → 0, l → ∞, l/T 1/4 → 0, and P /R → π , where 0 < π < ∞, then:
) ≤ u2 }
− 1{u1 ≤ Xt +τ ≤ u2 }
Note that each bootstrap term is recentered around the (full) sample mean. This is necessary because the bootstrap statistic is constructed using the last P resampled observations, which in turn have been resampled from the full sample. In particular, this is necessary regardless of the ratio, P /R. Thus, even if P /R → 0, so that there is no need to mimic the parameter estimation error (and hence the above statistic can be constructed using ϑk,t ,N ,h instead of ϑk∗,t ,N ,h ), it remains the case that recentering of all bootstrap terms around the (full) sample mean is necessary.
P
θ1,t ,N ,S ,h
2
≤ u2 }
N 1 −
θ1,t ,N ,S ,h
1{u1 ≤ X1,t +τ ,i,j (Xt , V1,j
θk∗,t ,N ,S ,h
θk∗,t ,N ,S ,h
1{u1 ≤ Xk,t +τ ,i,j (Xt∗ , Vk,j
) ≤ u2 }
2 − 1{u1 ≤ 1 −
Xt∗+τ
T −
T l =1
≤ u2 }
S N 1 −−
NS j=1 i=1
≤ u2 } − 1{u1 ≤ Xl+τ
θk,t ,N ,S ,h
θk,t ,N ,S ,h
1{u1 ≤ Xk,t +τ ,i,j (Xl , Vk,j
2 ≤ u2 } .
)
)
310
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
Note that we do not need to resample the volatility process, although volatility is simulated under both θk,t ,N ,S ,h and θk∗,t ,N ,S ,h , for k = 1, . . . , m. Also, maxk=2,...,m DVk,P ,N (u1 , u2 ) and maxk=2,...,m DVk∗,P ,N (u1 , u2 ) are defined analogous to their one-factor counterparts. Assumption A5′ . As P , R, N , S → ∞ and h → 0, for k = 1, . . . , m:
T ∗ 1 − ∗ P ω : sup PT √ θk,t ,N ,S ,h − θk,t ,N ,S ,h ≤ v v∈ℜϱ P t =R T 1 − Ď −P √ θk,t ,N ,S ,h − θk ≤ v > ε → 0. P t =R
1/2
hT
d k (u1 , u2 )), π < ∞, then: (i) under H0 , DVk,P ,N ,S (u1 , u2 ) → N (0, W where Wk (u1 , u2 ) has the same format as Wk (u1 , u2 ) in the statement of Theorem 1(i). Also, d
max (DVk,P ,N ,S (u1 , u2 ) − µ(u1 , u2 )) → max Zk (u1 , u2 ),
where µ(u1 , u2 ) and Zk (u1 , u2 ) are defined as in the statement of Corollary 1; and (ii) under HA , for k = 2, . . . , m, Pr( √1 |DVk,P ,N ,S P
Theorem 4. Let Assumptions A1, A2′ , A3, and A4′ –A5′ hold. Also, assume that models 1 and k are nonnested. If as P , R, S , N → ∞, h → 0, P /N → 0, P /S → 0, h2 P → 0, l → ∞, l/T 1/4 → 0, and P /R → π , where 0 < π < ∞, then:
P
1/2
hT
ω : sup |PT∗ (DVk∗,P ,N ,S (u1 , u2 ) ≤ v)
T −1 − ( Ft ,hT (Xt +1 |Xt ) − F (Xt +1 |Xt ; ϑt ))2 w(Xt +1 , Xt ),
where Ft ,hT (Xt +1 |Xt ) is a recursive local polynomial estimator constructed using observations up to time t , and ϑt are recursively estimated parameters. As the statistic in (8) converges at a nonparametric rate (see their Theorem 3), the contribution of parameter estimation error is always negligible, regardless of the estimation scheme. On the other hand, the asymptotic bias terms may be affected by nonparametric recursive estimation. Turning to Bhardwaj et al. (2008), their statistic given in their Eqs. (10)–(11) can be written as VT = supu×v∈U ×V |VT (u, v)|, with VT (u, v) = √
v∈ℜϱ
− P (DVk,P ,N ,S (u1 , u2 ) − µk (u1 , u2 ) ≤ v)| > ε → 0,
×
T −1 −
ω : sup PT∗ max DVk∗,P ,N ,S (u1 , u2 ) ≤ v k=2,...,m v∈ℜϱ − P max (DVk,P ,N ,S (u1 , u2 ) − µ(u1 , u2 )) ≤ v > ε k=2,...,m
→ 0, where µk (u1 , u2 ) is defined as in the statement of Corollary 1.
3.3.1. Out of sample specification tests In this paper the focus is on the comparison of out of sample predictive accuracy of possible misspecified diffusion models, when the conditional distribution is not known in closed form. One may wonder whether the tools developed here can also allow for the construction of out of sample specification tests, based on recursively estimated parameters. As we briefly outline below, this is indeed possible. As mentioned in the introduction, specification tests for the conditional distribution of a diffusion, when its closed form is unknown, have also been recently suggested by Aït-Sahalia et al. (2009) and by Bhardwaj et al. (2008). The former test is based on the integrated mean square error of the difference between a local polynomial estimator of the conditional CDF and an exact
N 1 −
ϑT , N , h 1{Xs,t +1 (Xt )
≤ u} − 1{Xt +1 ≤ u} 1{Xt ≤ v},
ϑT ,N ,h
where Xs,t +1 is the process simulated using ϑT ,N ,h and with initial value Xt . It is immediate to see that an out of sample version of this test can be written as VP = supu×v∈U ×V |VP (u, v)|, with 1 VP (u, v) = √ P −1
×
T −1 − t =R
3.3. Further extensions
1 T −1
N s=1
t =1
and
(8)
t =R
P
(7)
where FT ,hT (Xt +1 |Xt ) is a local polynomial estimator of the conditional distribution constructed using the full sample, and evaluated at (Xt +1 , Xt ), hT is a bandwidth parameter, F (Xt +1 |Xt ; ϑT ) is an exact approximation of the CDF under the null, evaluated at ϑT , and w(Xt +1 , Xt ) is a weighting function. It is straightforward to construct an out of sample version of their test based on recursively estimated parameters. Namely, consider:
k=2,...,m
(u1 , u2 )| > ε) → 1.
T −1 − ϑT ))2 w(Xt +1 , Xt ), ( FT ,hT (Xt +1 |Xt ) − F (Xt +1 |Xt ; t =1
Theorem 3. Let Assumptions A1, A2′ , A3, and A4′ hold. Also, assume that models 1 and k are nonnested. If as P , R, S , N → ∞, h → 0, P /N → 0, P /S → 0, h2 P → 0, and P /R → π , where 0 <
k=2,...,m
approximation, based on a Hermite expansion, of the parametric CDF under the null, constructed with historical data. The latter test is based on the comparison of empirical CDFs constructed with historical and simulated data respectively. Both tests require estimated parameters, the former to construct the approximated CDF under the null, and the latter to simulate data. In both cases, parameters are estimated using the full sample, and tests are thus in-sample. Hereafter, to facilitate comparison, set τ = 1. Aït-Sahalia, Fan and Peng’s statistic, Eq. (4.4) in their paper, is based on
N 1 −
N s=1
ϑt ,N ,h 1{Xs,t +1 (Xt )
≤ u} − 1{Xt +1 ≤ u} 1{Xt ≤ v},
where ϑt ,N ,h are recursively estimated parameters. Thus, the limiting distribution of VP follows by combining the proof of Theorem 3 in Bhardwaj et al. (2008) with that of Theorem 1 in this paper. Validity of bootstrap critical values and extension to stochastic volatility models also follow using the tools developed in this paper. 3.3.2. Choice of intervals Thus far, our focus has centered on comparing models over a specific conditional interval. Needless to say, one may choose different models for different intervals. However, if one is interested in comparing predictive accuracy over multiple intervals, one can construct weighted versions of maxk=2,...,m Dk,P ,N (u1 , u2 ) and maxk=2,...,m DVk,P ,S ,N (u1 , u2 ). For notational simplicity, hereafter we limit our attention to the one-factor case, but extension
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
to stochastic volatility models follows by a similar argument. More precisely, let (−∞, u1 ], (u1 , u2 ], . . . , (uj−1 , ∞) be a partition of the support of the variable to be predicted, and define
−
J
Dk,P ,N = max
k=2,...,m
= ∞, and
∑J −1
w(uj , uj+1 ) = 1. Of
j =0
Dk,P ,N (u)φ(u)du, U
where φ(u) ≥ 0 and φ(u)du = 1. If we restrict U to be a compact set in R, say U = [u, u], then it is not difficult to show that Dk,P ,N (u) is stochastic equicontinuous on U, and under the conditions of
d
Theorem 1, it follows that U Dk,P ,N (u)φ(u)du → U Zk (u)φ(u)du, where Zk (u) is a Gaussian process with covariance kernel Wk (u), as defined in the Appendix. The asymptotic validity of the bootstrap critical values can be established as in Theorem 2, provided the bootstrap counterpart of Dk,P ,N (u), D∗k,P ,N (u), is also stochastic equicontinuous on U . Finally, one may prefer to minimize the integrated mean square error between the model and the true conditional density. In this case, the null hypothesis is:
∫ ∫
k=2,...,m
U
[(f1 (u|v; ϑ1Ď ) − f0 (u|v))2 − (fk (u|v; ϑkĎ ) V
− f0 (u|v))2 ]f0 (u, v)dudv ≤ 0, Ď
where fk (u|v; ϑk ) is the conditional density implied by model k, f0 (u|v) is the true conditional density. A test statistic for H0′ , Ď
Ď
requires estimators of f1 (u|v; ϑ1 ), fk (u|v; ϑk ), and f0 (u|v).
Given recursively estimated parameters ϑk,t ,N ,h , one should ϑk,t ,N ,h
thus generate a sample of length N, Xqh q = 1, . . . , Q where Qh = N, and sample the simulated data at the same frequency ϑt ,N ,h
as the historical data, in order to get a discrete sample, Xi , i = 1, . . . , N. As N /T → ∞, the initial value effect is negligible. Then, construct a kernel estimator of the conditional density, using both simulated and historical data,
ϑk,t ,N ,h fk,N (Xt +τ |Xt )
1 Nh2N
N −τ
=
∑
K
i=1
ϑk,t ,N ,h Xi+τ −Xt +τ hN
1 NhN
N −τ
∑
K
K
ft (Xt +τ |Xt ) =
K
1 ThT
t = R, . . . , T − τ
K
hT
t ∑ Xj −Xt K
j =1
hT
hN
−Xt
with w(Xt +τ , Xt ) denoting a proper trimming function. The study of the asymptotic properties of the statistic above is left to future research. 4. Recursive nonparametric simulated quasi maximum likelihood estimators In this section we develop a recursive version of the nonparametric simulated (quasi) maximum likelihood (NPSQML) estimator of Fermanian and Salanié (2004) and outline conditions under which asymptotic equivalence between the NPSQML estimator and the corresponding recursive QML estimator obtains, hence ensuring that A4 and A4′ hold. Analogous results are also established for the bootstrap counterpart of the recursive NPSQML estimators. A previous version of this paper contains results analogous to those reported in this section for the case of exactly identified simulated generalized methods of estimators of Duffie and Singleton (1993).2 4.1. One factor models The idea underlying the nonparametric simulated maximum likelihood estimator of Fermanian and Salanié (2004) is to replace the unknown conditional density with a kernel estimator, constructed using simulated data. Fermanian and Salanié (2004) focus on the case of exogenous conditioning variables, while Kristensen and Shin (2008) consider extensions to (fully observed) Markov models. In the sequel, we extend the estimator of Fermanian and Salanié (2004) and Kristensen and Shin (2008) to the recursive estimation case. In a subsequent section, we outline a bootstrap version of the estimator and establish the first-order validity thereof. Ď Hereafter, let fk (Xt |Xt −1 , ϑk ) be the conditional density implied by model k. If we knew fk in closed form, we could just estimate ϑtĎ,k recursively, using standard QML as3 : t 1− ϑt ,k = arg max ln fk (Xt |Xt −1 , ϑk ),
ϑk ∈Θk
t j =2
t = R + 1, . . . , R + P . Now, define:
ϑkĎ = arg max E (ln fk (Xt |Xt −1 , ϑk )).
(9)
ϑk ∈Θk
Following Kristensen and Shin (2008), generate T − 1 paths of length one for each simulation replication, using X1 , . . . , XT −1 as starting values and hence construct Xkϑ,t ,j (Xt −1 ), for t = 2, . . . , T −
hN
i=1
t ∑ Xj+τ −Xt +τ Xj −Xt j =1
ϑk,t ,N ,h
Xi
ϑk,t ,N ,h Xi −Xt
and 1 Th2T
ϑ1,t ,N ,h [( f1,N (Xt +τ |Xt ) − ft (Xt +τ |Xt ))2
course, is not independent of the bounds of the interval, and in fact it depends on the number of intervals considered and on their relative length. Moreover, J should be finite and not too large, otherwise one will be left with fewer observations than needed to construct reliable estimates. Intuitively, this is the price we pay for using statistics that converge at a parametric rate. Alternatively, if interest lies in approximation of the entire conditional distribution, one can consider the one-sided interval (−∞, u], and construct the following statistic,
H0′ : max
T −τ −
ϑk,t ,N ,h − ( fk,N (Xt +τ |Xt ) − ft (Xt +τ |Xt ))2 ]w(Xt +τ , Xt ),
j =0
J Dk,P ,N
Dk,P ,N =
1/2
t =R
Dk,P ,N (uj , uj+1 )w(uj , uj+1 ),
where u0 = −∞, uJ
∫
where hN and hT are bandwidth parameters. Then, we can test H0′ via a statistics based on: hT
J −1
311
hT
,
2 See http://econweb.rutgers.edu/nswanson/papers.htm. We conjecture that one could establish the asymptotic properties of recursive versions and bootstrap analogs for all other simulation-based estimators, such as indirect inference (Gourieroux et al., 1993; Dridi et al., 2007), an efficient method of moment (Gallant and Tauchen, 1996) and simulated GMM with a continuum of moment conditions (Carrasco et al., 2007). We leave this to future research. 3 Note that as model k is, in general, misspecified, ∑T −1 ln f (X |X , ϑ ) is a Ď
t =1
k
t
t −1
k
quasi-likelihood and ∇θk ln fk (Xt |Xt −1 , ϑk ) is not necessarily a martingale difference sequence.
312
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
1, j = 1, . . . , N. Note that we keep the N random draws fixed across different initial values. Then, define the following estimator of the conditional density:
fk,N ,h (Xt |Xt −1 , ϑk ) =
N 1 −
N ξN i=1
K
ϑ
Xt ,ik,h (Xt −1 ) − Xt
d
Further, define the recursive NPSQML estimator as follows: t 1− ln fk,N ,h (Xs |Xs−1 , ϑk ) ϑk,t ,N ,h = arg max
ϑk ∈Θk
t s =2
× τN ( fk,N ,h (Xs |Xs−1 , ϑk )),
t ≥ R,
where the trimming function τN ( fk,N ,h (Xt |Xt −1 , ϑk )) is a positive and increasing function, such that τN ( fk,N ,h (Xt , Xt −1 , ϑk )) = 0, if fk,N ,h (Xt , Xt −1 , ϑk ) < ξNδ , and τN ( fk,N ,h (Xt , Xt −1 , ϑk )) = 1, if fk,N ,h (Xt , Xt −1 , ϑk ) > 2ξNδ , for some δ > 0.4 The reason for the trimming parameter is that when the log density is close to zero, the derivative tends to infinity and so even very tiny simulation errors have a large impact on the likelihood. Our result in this subsection requires the following additional assumptions. ϑ
ϑ
Assumption A3′ . For k = 1, . . . , m: (i) Xi k (x) and Xi,hk (x) are geometrically ergodic and β -mixing, (ii) ϑ
ϑ
ϑ
∂ 2 Xi k (x) ∂ϑk ∂ x
∂ Xi,hk (x) , ∂ϑk
and
r-dominated on Θk and on X a > 1.
ϑ
ϑ
ϑ
∂ Xi k (x) ∂ Xi k (x) ∂ 2 Xi k (x) , , ∂ϑ ∂ϑ ′ , ∂ϑk ∂x k k ϑ
∂ Xi,hk (x) , ∂x T ,a
∂ 2 Xi,hk (x) ∂ϑk ∂ϑk′
ϑ
∂ 2 Xi,hk (x) ∂ϑk ∂ x
,
are
= {x : x ≤ T a } for r > 4 and Ď
Assumption A6. Let Nϑ Ď be a neighborhood of ϑk , E (supϑk ∈N
‖
∂ ln fk (Xt |Xt −1 ,ϑk ) r ‖) ∂ϑk
< ∞, E (supϑk ∈N
Ď
ϑk
ϑ
∂ X k (Xt −1 ) r E (supϑk ∈N Ď ‖ i,h∂ϑ ‖) k ϑk
‖
Ď
ϑk
k
ϑk
→ 0, (d) T 1/2 ξN−(δ+3) h| ln ξNδ | → 0. Then, for k = 1, . . . , ∑ p ϑk,t ,N ,h − m: (i) supt ≥R ( ϑk,t ,N ,h − ϑkĎ ) → 0 and (ii) √1P Tt=R ( ϑkĎ ) → N (0, 2Π AĎk VkĎ AĎk ), where AĎk = E (−∇θ2k ln fk (Xt |Xt −1 , ϑkĎ )), ∑∞ Ď ′ Ď Ď Vk = i=−∞ E (∇θk ln fk (X2 |X1 , ϑk )∇θk ln fk (X2+i |X1+i , ϑk ) ) and Π = 1 − π −1 ln(1 + π ).
.
ξN
2ξNδ ) → 0, (b) T 1/2 ξNs−δ | ln ξN | → 0, (c) T (1−a) ξN−4−2δ (ln ξN2 ) ln T a
∂ Xi (Xt −1 ) r ‖) ∂ϑk
< ∞,
< ∞, for k = 1, . . . , m and for r > 4.
As 0 < π < ∞, P grows at the same rate as T , for sake of simplicity, we have stated the rate conditions (a)–(d) in terms of T , instead of a combination of T and P. Note that if we simulate the process using the Euler scheme, instead of the Milstein scheme, the rate condition in (d) should be strengthened −(d+3) 1/2 to T 1/2 ξN h | ln ξNδ | → 0. From Theorem 5 it can be seen immediately that the NPSQML estimator satisfies Assumption A4 and is asymptotically equivalent to the unfeasible QML estimator, which is constructed by maximizing the likelihood of model k. An interesting alternative to nonparametric simulated maximum likelihood estimator has recently been suggested by Altissimo and Mele (2009). Their estimator is based on the minimization of a properly weighted distance between kernel conditional density estimators based on historical and simulated data. For fully observable systems, it is asymptotically equivalent to the maximum likelihood estimator. Under the rate conditions in Theorem 5, the contribution of simulation error is asymptotically negligible, and thus there is no need to resample the simulated observations. In particular, let Z t ,∗ = (Xt∗ , Xt∗+1 , . . . , Xt∗+τ ), t = 1, . . . , T − τ be as outlined in Section 3. For each simulation replication, generate T − 1 paths of length one, using as starting values X1∗ , . . . , XT∗−1 , and so obtaining ϑ
Xk,kt ,j (Xt∗−1 ), for t = 2, . . . , T − 1, j = 1, . . . , N. Further, let:
fk∗,N ,h (Xt∗ |Xt∗−1 , ϑk )
=
N 1 −
N ξN j=1
(i.e. E (ln fk (Xt |Xt −1 , ϑk )) <
for any ϑk ̸= (ii) ϑk,t ,N ,h and ϑkĎ are in the interior of Θk , (iii) fk (x|x−1 , ϑk ) is s + 1-continuously differentiable on the interior of Θk , fk (x|x−1 , ϑk ), ∇xs fk (x|x−1 , ϑk ), ∇xs ∇ϑk fk (x|x−1 , ϑk ) are bounded on R × R × Θ k , for s ≥ 2; (iii) the elements of ∇ϑk fk (Xt |Xt −1 , ϑk ), ∇ϑ2k fk (Xt |Xt −1 , ϑk ), ∇ϑk ln fk (Xt |Xt −1 , ϑk ) and ∇ϑ2k ln fk (Xt |Xt −1 , ϑk ) are r-dominated on Θk , with r > 4; and (iv) E (−∇ϑ2k ln fk (ϑk )) is positive definite, uniformly on Θk .
ϑkĎ );
Assumption A8. The kernel, K , is a symmetric, nonnegative, continuous function with bounded support [−∆, ∆], s-time differen tiable on the interior of its support and satisfies: K (u)du = 1, s−1 u K (u)(u)du = 0, s ≥ 2. Let K (j) be the j-th derivative of the kernel. Then, K (j) (−∆) = K (j) (∆) = 0, for j = 1, . . . , s, s ≥ 2. Theorem 5. Let Assumptions A1–A2, A3′ , and A6–A8 hold. Let T = R + P, P /R → π , where 0 < π < ∞ and let N = T a a > 1. r
1
As T , P , N → ∞, (a) T 2 | ln ξN | r −1 Pr(infϑ∈N
Ď
ϑk
f (Xj |Xj−1 , ϑ) ≤
4 As an example of a trimming function, Fermanian and Salanié (2004) suggest using:
τN (x) =
4(x − aN )3 a3N
for aN ≤ x ≤ 2aN .
−
3(x − aN )4 a4N
,
K
ϑ
Xt ,jk,h (Xt∗−1 ) − Xt∗
.
ξN
Now, for t = R, . . . , R + P − 1, define:
Ď
Assumption A7. For k = 1, . . . , m: (i) ϑk is uniquely identified Ď E (ln fk (Xt |Xt −1 , ϑk ))
ϑk∗,t ,N ,h = arg max
ϑk ∈Θk
t 1−
t l =2
ln fk,N ,h (Xl∗ |Xl∗−1 , ϑk )
× τN ( fk,N ,h (Xl∗ |Xl∗−1 , ϑk )) T − ∇ϑk fk,N ,h (Xl′ | Xl′ −1 , ϑk ) ′ 1 − ϑk T ′ fk,N ,h (Xl′ |Xl′ −1 , ϑk )
ϑk = ϑk,t ,N ,h
l =2
× τN ( fk,N ,h (Xl′ |Xl′ −1 , ϑk,t ,N ,h )) + τN ( fk,N ,h (Xl′ |Xl′ −1 , ϑk )) × ∇ϑk fk,N ,h (Xl′ |Xl′ −1 , ϑk )|ϑk =ϑk,t ,N ,h ′
× ln fk,N ,h (Xl′ |Xl′ −1 , ϑk,t ,N ,h ) , where τN′ (·) denotes the derivative of τN (·) with respect to its argument. Note that each term in the simulated likelihood is recentered around the (full) sample mean of the score, evaluated at ϑk,t ,N ,h . This ensures that the bootstrap score has mean zero, conditional on the sample. The recentering term requires computation of ∇θk fk,N ,h (Xl′ |Xl′ −1 , ϑk,t ,N ,h ), which is not known in closed form. Nevertheless, it can be computed numerically, by simply taking the numerical derivative of the simulated likelihood. Theorem 6. Let Assumptions A1–A2, A3′ , and A6–A8 hold. Let T = R + P, P /R → π , where 0 < π < ∞ and let N = T a a > 1. 1
r
As T , N , l → ∞, l/T 1/4 → 0, and (a) T 2 | ln ξN | r −1 Pr(infϑ∈N
Ď
ϑk
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
f (Xj |Xj−1 , ϑ) ≤ 2ξNδ ) → 0, (b) T 1/2 ξNs−δ | ln ξN | → 0, (c) T (1−a) −4−2δ
−(δ+3) T 1/2 N h
δ
ξN (ln ξ ) ln T → 0, (d) ξ | ln ξN | → 0. Then, for k = 1, . . . , m: T ∗ 1 − ∗ P ω : sup PT √ (ϑk,t ,N ,h − ϑk,t ,N ,h ) ≤ v v∈ℜϱ P t =R T 1 − Ď − P √ (ϑk,t ,N ,h − ϑk ) ≤ v > ε → 0, P t =R 2 N
a
where PT∗ denotes the probability law of the resampled series, conditional on the (entire) sample. Thus, √1
∑T
( ϑk∗,t ,N ,h − ϑk,t ,N ,h ) has the same limiting ∑T Ď t =R (ϑk,t ,N ,h − ϑk ), conditional on the sample,
t =R
P
distribution as √1
P
and for all samples except a set with probability measure approaching zero, and A5 is satisfied by the bootstrap NPSQML estimator. 4.2. Stochastic volatility models Since volatility is not observable, we cannot proceed as in the θ single factor case. Instead, let Vs k be generated according to (4), setting qh = s, q = 1, . . . , 1/h, and s = 1, . . . , S. For each model k = 1, . . . , m, and at each simulation replication, i = 1, . . . , N, generate S paths of length one, using Xt −1 as the starting value for the observable, and using S different starting values for the θ unobservable volatility (i.e., Vs k , s = 1, . . . , S). Thus, for any t = 1, . . . , T − 1, and for any set i, i = 1, . . . , N of random errors ϵ1,t +(q+1)h,i and ϵ2,t +(q+1)h,i , q = 1, . . . , 1/h, generate S different values for the observable at time t + 1, each of them corresponding to a different starting value for the unobservable. Note that it is important to use the same set of random errors ϵ1,t +(q+1)h,i and ϵ2,t +(q+1)h,i across different initial values for volatility. Using (5) θ
(i) supt ≥R ( θk,t ,N ,S ,h − θkĎ ) → 0 and (ii)
Now, for t = R, . . . , R + P − 1, define:
θk∗,t ,N ,S ,h = arg max
θk ∈Θk
t l=2
ln fk,N ,S ,h (Xl∗ |Xl∗−1 , θk )
θk,t ,N ,h
l =2
× τN ( fk,N ,S ,h (Xl′ |Xl′ −1 , θk,t ,N ,S ,h )) ′ + τN (fk,N ,Sh (Xl′ |Xl′ −1 , θk,t ,N ,S ,h )) θ
θ
Xt ,ki,h (Xt −1 , Vs k ) − Xt
ξN
i=1
× ∇ϑk fk,N ,S ,h (Xl′ |Xl′ −1 , θk,t ,N ,h )
,
× ln fk,N ,S ,h (Xl′ |Xl′ −1 , θk,t ,N ,h ) ,
and note that by averaging over the initial values for the unobservable volatility, its effect is integrated out. Finally, define: t 1− θk,t ,N ,S ,h = arg min ln fk,N ,S ,h (Xl |Xl−1 , θk )
θk ∈Θk
t 1−
× τN ( fk,N ,S ,h (Xl∗ |Xl∗−1 , θk )) T − ′ |Xl′ −1 , θk ) ∇ f ( X 1 θ k , N , S , h l k − θk′ T ′ fk,N ,h (Xl∗′ |Xl∗′ −1 , θk )
S s=1 N ξN K
( θk,t ,N ,S ,h −
S 1− 1 fk,N ,S ,h (Xt∗ |Xt∗−1 , θk ) = S s=1 N ξN θ θ N − Xt ,ki (Xt∗−1 , Vs′k−1 ) − Xt∗ × K . ξN i=1
1− 1
×
t =R
Note that in this case, Xt is no longer Markov (i.e., Xt and Vt are jointly Markovian, but Xt is not). Therefore, even in the case in which model k is the true data generating process, the joint likelihood cannot be expressed as the product of the conditional and marginal distributions. Thus, θk,t ,N ,S ,h is necessarily a QML Ď estimator. Furthermore, note that ∇θk ln f (Xt |Xt −1 , θk ) is no longer a martingale difference sequence; therefore, we need to use HAC robust covariance matrix estimators, regardless of whether k is the ‘‘correct’’ model or not. Note that for the bootstrap counterpart of θk,t ,N ,S ,h , since S /T → ∞ and N /T → ∞, the contribution of simulation error is asymptotically negligible. Hence, there is no need to resample the simulated observations or the simulated initial values for volatility. Define:
S
P
d
θ
N −
√1
∑T
θkĎ ) → N (0, 2Π AĎk VkĎ AĎk ), where AĎk = E (−∇θ2k ln fk (Xt |Xt −1 , θkĎ )), VkĎ ∑ Ď Ď ′ = ∞ i=−∞ E (∇θk ln fk (X2 |X1 , θk )∇θk ln fk (X2+i |X1+i , θk ) ), and Π = −1 1 − π ln(1 + π ).
and (6), generate Xt ,ki (Xt , Vs k ) for t = 2, . . . , T , i = 1, . . . , N and s = 1, . . . , S. Now, define:
fk,N ,S ,h (Xt |Xt −1 , θk ) =
313 p
where τN′ (·) denotes the derivative with respect to its argument. We have:
t l =2
× τN ( fk,N ,S ,h (Xl |Xl−1 , θk )),
t ≥ R.
Before establishing the asymptotic properties of θk,t ,N ,S ,h , we need another assumption. Assumption A9. Let Nθ Ď be a neighborhood of k
θkĎ ,
E (supθk ∈N
Theorem 8. Let Assumptions A1, A2′ –A3′ , and A6–A9 hold. Let T = R + P, P /R → π , where 0 < π < ∞, and let N = T a , a > 1. As T , N , S , l → ∞, l/T 1/4 → 0, 1
r
and (a) T 2 | ln ξN | r −1 Pr(infϑ∈N
Ď
ϑk
f (Xj |Xj−1 , ϑ) ≤ 2ξNδ ) → 0,
f (Xj |Xj−1 , ϑ) ≤ 2ξNδ ) → 0, (b) T 1/2 ξNs−δ
(b) T 1/2 ξNs−δ | ln ξN | → 0, (c) T (1−a) ξN−4−2δ (ln ξN2 ) ln T a → 0, (d) T 1/2 ξN−(δ+3) h| ln ξNδ | → 0, (e) T 1/2 S −1/2 ξN−2(1+δ) → 0. Then, for k = 1, . . . , m: T ∗ 1 − ∗ P ω : sup PT √ (θk,t ,N ,S ,h − θk,t ,N ,S ,h ) ≤ v v∈ℜϱ P t =R T 1 − Ď − P √ ( θk,t ,N ,S ,h − ϑ ) ≤ v > ε → 0, P t =R
| ln ξN | → 0, (c) T (1−a) ξN−4−2δ (ln ξN2 ) ln T a → 0, (d) T 1/2 ξN−(δ+3) h | ln ξNδ | → 0, (e) T 1/2 S −1/2 ξN−2(1+δ) → 0. Then for k = 1, . . . , m:
where PT∗ denotes the probability law of the resampled series, conditional on the (entire) sample.
θk
‖
θk
∂ Xi (Xt −1 ,Vj ) r ‖) ∂θk
< ∞, E (supθk ∈N Ď ‖
k = 1, . . . , m and for r > 4.
θk
θ θ ∂ Xi,kh (Xt −1 ,Vj k )
∂θk
Ď
θk
‖r ) < ∞, for
Theorem 7. Let Assumptions A1, A2′ –A3′ , and A6–A9 hold. Let T = R + P, P /R → π , where 0 < π < ∞, and let N = T a , a > 1. As T , N , S → ∞, 1
r
(a) T 2 | ln ξN | r −1 Pr(infϑ∈N
Ď
ϑk
314
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
Table 1 Predictive density model selection test results sample period January 6, 1989–December 31, 1998 (CIR model is the benchmark, bootstrap block length = 5).
τ
u1 , u2 1 2 3 4 5 6
12
X X X X X X X X X X X X X X
± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX
DVkMax ,P ,S ,N (u1 , u2 )
PDMSFECIR
PDMSFESV
PDMSFESVJ
5% CV
10% CV
15% CV
20% CV
2.82927∗ 1.31996 1.57134∗ 0.53925 0.80223∗ 1.19189∗ 1.23058∗ 0.48079∗ −0.00077 0.18502 1.52213∗ 0.58406∗ 0.56293∗ 0.41295∗
5.66205 1.58636 4.13194 0.85434 4.26257 1.82012 4.32896 1.02194 3.71976 1.09725 4.949 1.63659 4.58393 1.30048
3.62009 0.3691 2.62781 0.34105 3.87959 0.93572 3.82788 0.76792 3.72053 1.01962 3.83724 1.05253 4.37846 1.5585
2.83278 0.2664 2.56061 0.31509 3.46034 0.62823 3.09838 0.54115 3.97788 0.91223 3.42687 1.18955 4.021 0.88753
1.76793 1.78705 0.95374 0.88404 0.23338 0.48909 0.34424 0.32672 0.25028 0.2864 0.11366 0.16156 0.03752 0.02381
1.65848 1.64695 0.85015 0.8354 0.20535 0.40461 0.28591 0.28204 0.2032 0.2164 0.08187 0.12362 0.03085 0.01912
1.59048 1.57157 0.81374 0.7433 0.19317 0.36703 0.22947 0.22131 0.17763 0.19567 0.07064 0.11468 0.02742 0.01574
1.53149 1.5188 0.77364 0.67953 0.16539 0.30468 0.21701 0.20073 0.16541 0.14872 0.05948 0.10462 0.01931 0.01425
Notes: Numerical entries in the table are test statistics, predictive density type PDMSFEs (see Section 7 for further discussion), and associated bootstrap critical values, constructed using intervals given in the second column of the table, and for predictive horizons, τ = 1, 2, 3, 4, 5, 6, 12. Starred entries denote rejection of the null hypothesis that the CIR model yields predictive densities at least as accurate as the competitor SV and SVJ models. Weekly data are used in all estimations, and the sample period across which predictive densities are constructed is T /2, where T is the sample size. Predictive densities are constructed using simulations of length S = 10T . Empirical bootstrap distributions are constructed using 100 bootstrap replications, and critical values are reported for the 95th, 90th, 85th, and 80th percentiles of the bootstrap distribution. X and σX are the mean and variance of an initial sample of data used in the first in-sample estimation, prior to the construction of the first predictive density (i.e., using T /2 observations). Finally, the predictive density type ‘‘mean square forecast errors’’ (MSFEs) reported in the fourth through sixth columns of the table are defined above and reported entries are multiplied by P 1/2 , where P = T /2 is the ex-ante prediction period. Table 2 Predictive density model selection test results sample period January 6, 1989–December 31, 1998 (CIR model is the benchmark, bootstrap block length = 10).
τ
u1 , u2 1 2 3 4 5 6
12
X X X X X X X X X X X X X X
± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX
DVkMax ,P ,S ,N (u1 , u2 )
PDMSFECIR
PDMSFESV
PDMSFESVJ
5% CV
10% CV
15% CV
20% CV
2.82927∗ 1.31996 1.57134∗ 0.53925 0.80223∗ 1.19189∗ 1.23058∗ 0.48079∗ −0.00077 0.18502 1.52213∗ 0.58406∗ 0.56293∗ 0.41295∗
5.66205 1.58636 4.13194 0.85434 4.26257 1.82012 4.32896 1.02194 3.71976 1.09725 4.949 1.63659 4.58393 1.30048
3.62009 0.3691 2.62781 0.34105 3.87959 0.93572 3.82788 0.76792 3.72053 1.01962 3.83724 1.05253 4.37846 1.5585
2.83278 0.2664 2.56061 0.31509 3.46034 0.62823 3.09838 0.54115 3.97788 0.91223 3.42687 1.18955 4.021 0.88753
2.00777 2.04287 1.20729 1.18983 0.30797 0.72656 0.39022 0.52736 0.20617 0.36255 0.11792 0.1695 0.05866 0.03615
1.87189 1.94914 1.12574 1.12383 0.26336 0.61716 0.31387 0.45501 0.18285 0.29925 0.10103 0.14107 0.04347 0.03183
1.79275 1.92829 1.09287 1.02568 0.23572 0.5816 0.28829 0.41484 0.16524 0.2721 0.08588 0.12773 0.03611 0.02711
1.74894 1.82353 1.01652 0.93639 0.21822 0.5347 0.27063 0.37745 0.13619 0.22753 0.08082 0.09614 0.03507 0.02122
Notes: see Table 1.
5. Empirical illustration: choosing between CIR, SV, and SVJ models In this section, we choose between Cox–Ingersoll–Ross (CIR), stochastic volatility (SV) and stochastic volatility with jumps (SVJ) models by comparing the models’ predictive performance across two different sample periods. Our primary objective is to illustrate the implementation of our test statistics and our secondary objective is to assess whether the choice of model is impacted by the choice of sample period. There are many precedents in the empirical literature suggesting that evaluation of subsample robustness is an important issue when evaluating models. For example, see Bandi and Reno (2008), who compare their semiparametric estimates of a jump diffusion for S&P500 returns to a less general affine model estimated by Eraker et al. (2003). In their analysis, the alternative models are rather similar, but they use different sample periods and different variance filtering methods. In our example, we use the same estimation method for different models across different estimation periods. In particular, we consider two samples of weekly data, one from January 6, 1989–December 31, 1998 (526 observations) and one from January 8, 1999–April 30, 2008 (491 observations), chosen arbitrarily. The variable that we model is the effective (or market) federal funds rate (i.e., the interbank interest rate), measured at the close.
In our analysis, we use the three models implemented in Bhardwaj et al. (2008). Other than considering similar models, our empirical illustration is quite different from theirs. Namely, they report on in-sample Kolmogorov type consistent specification tests for individual models, while we report the model selection type test statistics and related forecast error measures discussed in this paper. More specifically, we jointly compare the out-of-sample predictive accuracy of various models using recursively estimated models and recursively constructed predictive densities. The three models that we examine are: CIR: dX (t ) = κ1 (α1 − X (t ))dt + γ1 γ1 > 0 and 2κ1 α1 ≥ γ12 ,
√
X (t )dW1 (t ), where κ1 > 0,
√
SV: dX (t ) = κ2 (α2√− X (t ))dt + V (t )dWr (t ), and dV (t ) = κ3 (α3 − V (t ))dt + γ2 V (t )dWv (t ), where Wr (t ) and Wv (t ) are independent Brownian motions, and where κ2 > 0, κ3 > 0, γ2 > 0, and 2κ3 α3 ≥ γ22 .
√
SVJ: dX (t ) = κ4 (α4 − X (t ))dt + V (t )dWrJ (t ) + Ju dqu − Jd dqd , √ and dV (t ) = κ5 (α5 − V (t ))dt + γ3 V (t )dWv J (t ), where WrJ (t ) and Wv J (t ) are independent Brownian motions, and where κ4 > 0, κ5 > 0, γ3 > 0, and 2κ5 α5 ≥ γ32 . Further qu and qd are Poisson processes with jump intensity λu and λd , and are independent of the Brownian motions W1 (t ) and W2 (t ). Jump sizes are iid and are controlled by jump magnitudes ζu , ζd > 0, which are drawn from
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
315
Table 3 Predictive density model selection test results sample period January 8, 1999–April 30, 2008 (CIR model is the benchmark, bootstrap block length = 5).
τ
u1 , u2 1
X X X X X X X X X X X X X X
2 3 4 5 6 12
± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX
DVkMax ,P ,S ,N (u1 , u2 )
PDMSFECIR
PDMSFESV
PDMSFESVJ
5% CV
10% CV
15% CV
20% CV
3.36528∗ 0.39113 1.8218∗ 0.59514 1.2709 0.97425 1.33461∗ 0.59446 1.55731∗ 0.62454∗ 1.07981 1.0877∗ 1.06647∗ 0.74472∗
3.93191 0.39172 2.32377 0.60979 1.86856 1.04645 1.86611 0.78217 1.92318 0.92698 1.5355 1.3928 1.72738 0.9282
0.56663 0.00059 0.50197 0.01464 0.59766 0.0722 0.5315 0.18771 0.36586 0.30244 0.45569 0.39654 0.66091 0.43853
2.35979 0.13535 2.04596 0.26331 2.29788 0.46272 2.50816 0.23341 2.3208 0.42899 2.23224 0.3051 2.59892 0.18348
2.4573 2.09495 1.82588 2.182 1.47533 1.98624 1.18714 1.44947 0.94807 1.12818 0.90627 1.11448 0.96992 0.93258
2.31001 1.99902 1.71781 2.09447 1.33248 1.77604 1.03895 1.31151 0.72157 0.91251 0.81358 0.88946 0.7709 0.73613
2.17511 1.93683 1.64691 1.99572 1.19701 1.71385 0.92443 1.23566 0.63611 0.81989 0.58599 0.69749 0.65347 0.59269
2.05169 1.84544 1.55461 1.93641 1.11857 1.63308 0.74572 1.18198 0.56305 0.69776 0.49386 0.57532 0.54271 0.4251
Notes: see Table 1. Table 4 Predictive density model selection test results sample period January 8, 1999–April 30, 2008 (CIR model is the benchmark, bootstrap block length = 10).
τ
u1 , u2 1
X X X X X X X X X X X X X X
2 3 4 5 6 12
DVkMax ,P ,S ,N (u1 , u2 )
± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX ± 0.5σX ± σX
∗
3.36528 0.39113 1.8218 0.59514 1.2709 0.97425 1.33461 0.59446 1.55731 0.62454 1.07981 1.0877 1.06647∗ 0.74472
PDMSFECIR
PDMSFESV
PDMSFESVJ
5% CV
10% CV
15% CV
20% CV
3.93191 0.39172 2.32377 0.60979 1.86856 1.04645 1.86611 0.78217 1.92318 0.92698 1.5355 1.3928 1.72738 0.9282
0.56663 0.00059 0.50197 0.01464 0.59766 0.0722 0.5315 0.18771 0.36586 0.30244 0.45569 0.39654 0.66091 0.43853
2.35979 0.13535 2.04596 0.26331 2.29788 0.46272 2.50816 0.23341 2.3208 0.42899 2.23224 0.3051 2.59892 0.18348
3.22922 2.49945 2.97083 2.82514 2.51858 2.98617 2.14655 2.72152 1.9112 2.57883 2.11693 2.37199 1.36719 1.77444
2.79456 2.30575 2.41921 2.67829 2.25422 2.8359 1.91697 2.56512 1.80572 2.30651 1.64939 2.08945 1.00359 0.98574
2.66332 2.18381 2.29894 2.64444 2.06351 2.75257 1.73401 2.49455 1.4376 2.14454 1.47409 1.83042 0.8389 0.75872
2.49582 2.15431 2.2163 2.55817 1.93476 2.59837 1.59074 2.37684 1.33975 1.96686 1.34432 1.71404 0.57706 0.54984
Notes: see Table 1.
exponential distributions, with densities: f (Ju ) = ζ1 exp(− ζu ) u u J and f (Jd ) = ζ1 exp(− ζd ). Here, λu is the probability of a jump d d up, Pr(dqu (t ) = 1) = λu , and jump up size is controlled by Ju ; while λd and Jd control jump down intensity and size. Note that the case of Poisson jumps with constant intensity and jump size with exponential density is covered by the assumptions stated in the previous sections. Note that the CIR model is neither nesting the SVJ and the SV models nor is nested in either of them. On the other hand, SV is clearly nested in SVJ. Max The tests that we construct are DMax k,P ,N (u1 , u2 ) and DVk,P ,S ,N (u1 , u2 ). In our tables, we also report the so-called ‘‘predictive density’’ mean square forecast error (PDMSFE) terms in these statistics, which are constructed using the following formulae: J
T −τ 1−
P t =R
S N 1 −−
NS j=1 i=1
θ1,t ,N ,S ,h
2 − 1{u1 ≤ Xt +τ ≤ u2 } and T −τ 1−
P t =R
N 1 −
N i=1
θ1,t ,N ,S ,h
1{u1 ≤ X1,t +τ ,i,j (Xt , V1,j
ϑ1,t ,N ,h
1{u1 ≤ X1,t +τ ,i (Xt ) ≤ u2 }
2 − 1{u1 ≤ Xt +τ ≤ u2 }
,
) ≤ u2 }
depending upon whether we are predicting using one factor or SV models. We define the CIR model to be our ‘‘benchmark’’, against which the other models are compared. Thus, both competitors are neither nested in nor nesting the benchmark. For the estimation of parameters as well as the construction of predictive densities, data were generated using the Milstein scheme discussed above, with h = 1/T , where T is the sample size. The jump component in our SVJ model was simulated without any error because of the constancy of the intensity parameter. The three models fall in the class of affine diffusions. Therefore, it is possible to compute parameter estimates using the conditional characteristic function (see Singleton (2001) for the CIR model, Jiang and Knight (2002) for the SV model, and Chacko and Viceira (2003) for the SVJ model). We leave analysis of the predictive accuracy of the models discussed herein under different estimation methods to future research. All parameters are estimated recursively, all empirical bootstrap distributions are constructed using 500 bootstrap replications, and critical values are reported for the 95th,90th,85th, and 80th percentiles of the relevant bootstrap empirical distributions. For the bootstrap, block lengths of 5 and 10 are reported on. Additionally, we set S = 1000, and for model SV and SVJ we set N = S. Tests were carried out based on the construction of τ -step ahead predictive densities and associated confidence intervals, for τ = {1, 2, 3, 4, 5, 6, 12}. We set (u1 , u2 ) equal to X ± 0.5σX , and X ± σX , where X and σX are the mean and variance of an initial sample of data. Test statistic values, PDMSFEs, and bootstrap critical values are reported for various u1 , u2 combinations, forecast horizons, and bootstrap block lengths in Tables 1–4. The first two tables report
316
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
(a) Evaluation point = 0.035
(b) Evaluation point = 0.055
(c) Evaluation point = 0.075 Fig. 1. Predictive densities for CIR, SV and SVJ models — 01:1989–12:1998.
results for the sample period January 6, 1989–December 31, 1998, while Tables 3 and 4 report results for the sample period January 8, 1999–April 30, 2008. Interestingly, a number of very clear-cut conclusions emerge. In particular, PDMSFEs are lower for the SVJ model in 12 of 14 cases in Table 1. Moreover, in the two cases where SVJ is not ‘‘PDMSFE-best’’, there is little to choose between the PDMSFEs of the different models. Perhaps not surprisingly, then, the null hypothesis that the CIR model yields predictive densities at least as accurate as the two competitor models is rejected in almost all cases, at a 95% level of confidence. (Starred entries in the tables denote rejection using CVs equal to the 95th percentile of the empirical bootstrap distributions.) Notice also that although bootstrap CVs increase in magnitude when a longer block length is used (see Table 2), the number of rejections of the null hypothesis remains the same, suggesting that our findings, thus far, are somewhat robust to bootstrap block length. Turning now to Table 3, note that it is now the SV model that yields the ‘‘PDMSFE-best’’ predictive densities in all but two cases. Moreover, in the two cases that SV does not ‘‘win’’, the SVJ model ‘‘wins’’, albeit with only marginally lower PDMSFEs. However, significant rejection of the null only occurs in 8 of 14 cases based on the more recent sample of data used in construction of the statistics reported in Tables 3 and 4, rather than 10 cases, as in Tables 1 and 2. Moreover, when the block length is increased from 5 to 10, the number of rejections of the null deceases almost to zero (see Table 4). Thus, while the point PDMSFE is lower in 12 of 14 cases, it is more difficult to discern a statistically significant difference between the SV and the CIR model when
using data from 1999–2008. Two points are worth mentioning in this regard. First, in Tables 3 and 4, the absolute magnitude of the SV PDMSFEs are actually substantively lower than those for the CIR model, when comparing CIR and SV models, just as they were when comparing CIR and SVJ models in Tables 1 and 2, suggesting that the reduction in rejections when increasing the block length in Table 4 may be due in part to size bias in the case of the longer block length. Second, and more important, regardless of the above findings, it is very clear that the selection of PDMSFE-best model is indeed dependent upon the sample period used to construct predictive densities. While the one factor model generally performs worse than the other two models, whether or not jumps improve model performance depends on the sample period being investigated. Thus, different sample periods do not result in the same model being chosen, which is not surprising, given that the extant empirical evidence concerning which model to use when examining interest rates is rather mixed.5 In Figs. 1 and 2, predictive densities are plotted for various evaluation points given a particular set of recursively estimated
5 Note that SV is nested by SVJ. However, it neither nests or is nested by CIR, and hence the ‘‘nestedness’’ assumptions made in the statements of the theorems and corollaries above are not violated. Additionally, note that one might be tempted to think that if there is a model that outperforms CIR, this should be SVJ, as SVJ nests SV. However, as we are performing true ex-ante prediction experiments using predictive densities, this is clearly not the case; more parsimonious models may perform better, particularly if they are ‘‘better approximations’’ of the true underlying DGP.
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
(a) Evaluation point = 0.020
317
(b) Evaluation point = 0.035
(c) Evaluation point = 0.050 Fig. 2. Predictive densities for CIR, SV and SVJ models — 01:1999–04:2008.
parameters (chosen to illustrate the variety of predictive densities that arise, in practical applications). Evaluation points are chosen to be equal to the mean of the data and various points around the mean. Fig. 1 reports densities for our first sample period and Fig. 2 for our second sample period. Notice that a model yielding a density centered around the evaluation point is preferred, assuming that it yields predictions with equal or less dispersion than its competitor model. Interestingly, in Fig. 1 it is quite apparent that the SVJ model is preferred, although none of the models is particularly well centered for evaluation points not equal to the mean of 0.055. In Fig. 2, where results are reported for the second sample period, the models are well-centered around the evaluation point, even for points that are relatively distant from the mean (see Fig. 1(a) and (c)). Moreover, in this particular set of plots, the SV model is clearly dominant, as it yields densities that are better centered and exhibit much less dispersion.
Proof of Theorem 1. (i) We begin by analyzing the term in the test statistic that is associated with model 1. Without loss of generality and for the sake of brevity, set u1 = −∞ and u2 = u. Consider: T −τ 1 −
√
P t =R
+√
N 1 −
N i =1
P t =R
2
Ď
−
ϑ 1{X1,1t +τ ,i (Xt ) T −τ 2 −
+√
×
≤ u})
N 1 −
N i=1
P t =R
ϑ
t ,N ,h (1{X1,1t,+τ ,i (Xt ) ≤ u}
N 1 −
N i=1
ϑ t ,N ,h (1{X1,1t,+τ ,i (Xt )
Ď
ϑ 1{X1,1t +τ ,i (Xt )
≤ u} − 1{Xt +τ ≤ u}
Ď
≤ u} −
ϑ 1{X1,1t +τ ,i (Xt )
≤ u})
= IP ,N ,h + IIP ,N ,h + IIIP ,N ,h .
(10)
Now, T −τ 1 − I P ,N ,h = √ ((1{Xt +τ ≤ u} − F0 (u|Xt )) + (F0 (u|Xt ) P t =R
Appendix
T −τ 1 −
N 1 −
N i =1
T −τ 1 −
= √
P t =R
1{X1,t +τ ,i (Xt ) ≤ u} − 1{Xt +τ ≤ u}
N 1 −
N i=1
≤ u} − 1{Xt +τ ≤ u}
(u|Xt )))2 + op (1),
∑N
i=1
N /P → ∞,
2
Ď
ϑ 1{X1,1t +τ ,i (Xt )
Ď
ϑ X1,1t +τ
as E ( N1
2
ϑ1,t ,N ,h
−F
ϑ
Ď
1{X1,1t +τ ,i (Xt ) ≤ u} − F
Ď
ϑ X1,1t +τ
(u|Xt )) = 0; and for
T −τ N Ď 1 − 1 − ϑ 1{X1,1t +τ ,i (Xt ) ≤ u} − F ϑ Ď (u|Xt ) = op (1). √ P t =R N i=1 X1,1t +τ
318
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
Letting µF1 = E (F
Ď
ϑ X1,1t +τ
(F
IIIP ,N ,h = √ P t =R
×
Ď
ϑ X1,1t +τ
(u|Xt ) − 1{Xt +τ ≤ u})
T −τ µF − = √1
ϑ
N i =1
P t =R
ϑ t ,N ,h (X1,1t,+τ ,i (Xt )
− F
Ď
ϑ X1,1t +τ
−
N 1 −
((u −
N 1 −
N i=1
ϑ
≤ u} Now, let µf ,θ Ď (u)′ = EX (f k k
Ď
T −τ 1 − ((F ϑ Ď (u|Xt ) − F0 (u|Xt ))2 Dk,P ,N (u) = √ P t =R X1,1t +τ
ϑ X1,1t +τ ,i (Xt ))} Ď
−
ϑ X1,1t +τ ,i (Xt )))|Xt )
T −τ N µF − 1 − + √1 (F
Ď
ϑ X1,1t +τ
P t =R N i =1
Ď ϑ1
− X1,t +τ ,i (Xt )))|Xt ) − F
− (F
≤ u} − F
((u −
Ď
ϑ X1,1t +τ
√
P t =R
N 1 −
N i=1
(u|Xt )
× (F
(u|Xt )) + op (1).
Ď
ϑ X1,1t +τ
ϑ
ϑ 1,t ,N ,h
T −τ 1 − Dk,P ,N (u) = √ ((F
Ď
ϑ X1,1t +τ
P t =R Ď
P t =R Ď
P t =R
1
T −1 1 −
− µFk µfk ,ϑ Ď (u)AĎk √
P t =R
k
ψ1,t ,N ,h (ϑ1Ď ) ψk,t ,N ,h (ϑkĎ ) + oP (1).
d
It then follows that Dk,P ,N (u) → N (0, Wk (u)), where Wk (u) = C (u) + V (u) + 2CV (u) + P11 (u)
Ď
ϑ1Ď ),
+ Pkk (u) − P1k (u) + P1 C (u) − Pk C (u) (12)
and where, recalling A4, C (u) =
∞ −
E (((F
j=0
− (F
Ď
ϑ Xk,k1+τ
V (u) =
∞ −
T −τ 2 −
(u|Xt ))) + µF1 √
P t =R
(u|X1 ) − F0τ (u|X1 ))2
(u|X1 ) − F0τ (u|X1 ))2 )((F
Ď
ϑ X1,1t +τ
N 1 −
N i =1
(u|Xt )
f1 ((u
−F
Ď
ϑ Xk,k1+τ
Ď
ϑ Xk,k1+j+τ
Ď
ϑ X1,11+j+τ
Ď
ϑ Xk,k1+j+τ
(u|X1+j ))))
(u|X1+j )
(u|X1+j ) − F0τ (u|X1+j ))2 ))
(u|X1 )))((F0τ (u|X1+j )
− 1{X1+j+τ ≤ u})(F −F
Ď
ϑ X1,11+j+τ
E (((F0τ (u|X1 ) − 1{X1+τ ≤ u})(F
j =0
((F0 (u|Xt ) − 1{Xt +τ ≤ u})(F
Ď
ϑ X1,11+τ
− F0τ (u|X1+j ))2 − (F
(u|Xt ) − F0 (u|Xt ))2
T −τ 1 −
ϑ Xk,kt +τ
(u|Xt )))
+ µF1 µf1 ,ϑ Ď (u)AĎ1 √
(11)
(u|Xt ) − F0 (u|Xt ))2 )
+√
−F
Ď
ϑ Xk,kt +τ
+ P1 V (u) − Pk V (u), ( ϑ1,t ,N ,h −
ϑ Xk,kt +τ
(u|Xt ) − F
T −1 1 −
where f1 (·|Xt ) denotes the conditional density under model 1. Finally, IIP ,N ,h is oP (1), given that it is of smaller order than the other two terms on the RHS of (10). By treating model k in the same manner as model 1, we have that,
− (F
Ď
ϑ X1,1t +τ
ϑ t ,N ,h (X1,1t,+τ ,i (Xt )
×
((F0 (u|Xt ) − 1{Xt +τ ≤ u})
P t =R
f1 ((u − (X1,t +τ ,i (Xt ) − X1,1t +τ ,i (Xt )))|Xt )
ϑ ,t ,N ,h ∇θ1 X1,1t +τ ,i (Xt )
(u|Xt ) − F0 (u|Xt ))2 )
+√
By arguments similar to those used in the proof of Proposition 1 in Corradi and Swanson (2005b), the first term of the last equality on the RHS of (11) is oP (1). Now, by taking a mean value expansion Ď around ϑ1 , it is easy to see that the second term of the last equality on the RHS of (11) can be written as: T −τ 1 −
Ď
ϑ Xk,kt +τ
T −τ 1 −
Ď
ϑ 1{X1,1t +τ ,i (Xt )
,t ,N ,h ′ (u|Xt )EN (∇θk Xk,tk+τ ,i (Xt )) ), where
EX denotes expectation with respect to the probability measure governing the data and EN denotes expectation with respect to the probability measure governing the simulated data. Thus, given Assumption A4:
1{X1,1t +τ ,i (Xt ) ≤ u
ϑ t ,N ,h (X1,1t,+τ ,i (Xt )
ϑ
Ď
ϑ X k 1,t +τ
Ď
−
× ( ϑk,t ,N ,h − ϑkĎ ) + oP (1).
− 1{X1,1t +τ ,i (Xt ) ≤ u}) + oP (1) T −τ µF − = √1
ϑ k,t ,N ,h
fk ((u − (Xk,t +τ ,i (Xt )
ϑ ϑ k,t ,N ,h Xk,kt +τ ,i (Xt )))|Xt )∇θk Xk,t +τ ,i (Xt )′
−
Ď
N i=1
P t =R
+ oP (1)
ϑ t ,N ,h (1{X1,1t,+τ ,i (Xt )
N i =1
N 1 −
Ď
N 1 −
P t =R
− µF k √
Ď
1
T −τ 2 −
N i=1 ϑ
Ď
1,t +τ ,i
1
N 1 − ϑ t ,N ,h (1{X1,1t,+τ ,i (Xt ) ≤ u}
− 1{X1,1t +τ ,i (Xt ) ≤ u})
−
ϑ
ϑ
,t ,N ,h 1 − (X1,1t +τ ,i (Xt ) − X1,t +τ ,i (Xt )))|Xt ) ϑ × ∇θ X 1,t ,N ,h (Xt )′ ( ϑ1,t ,N ,h − ϑ Ď )
T −τ 1 −
(u|Xt ) − 1{Xt +τ ≤ u}),
(u|X1+j )
Ď
ϑ X1,11+τ
(u|X1 )
(13)
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
CV (u) =
∞ −
E (((F
Ď
θ X1,11+τ
j =0
− (F
Ď
θ Xk,k1+τ
(u|X1 ) − F0 (u|X1 ))2 )((F0τ (u|X1+j ) Ď
θ X1,11+j+τ
−F
Ď
a.s., but they are different from F0 (u|Xt ) over a set of strictly positive probability, and so µF1 = µFj but they are both different from zero, so that the last two terms on RHS of (13) do not vanish. Concluding, if models j = 2, . . . , m/2 are nested with the benchmark, and correctly specified, then DMax k,P ,N (u1 , u2 ) = maxk=2,...,m Dk,P ,N (u1 , u2 ) = maxk=m/2+1,...,m Dk,P ,N (u1 , u2 ) + op (1). As the asymptotic covariance of (14) has to be positive semidefinite, it suffices that at least one competing model is not nested with the benchmark.
(u|X1 ) − F0 (u|X1 ))2
− 1{X1+j+τ ≤ u})(F θ Xk,k1+j+τ
(u|X1+j )
(u|X1+j ))))
P11 (u) = 4Π µ2F1 (u)µ′
Ď
f1 ,ϑ1
(u)(AĎ1 V1Ď AĎ1 )µf1 ,ϑ Ď
Proof of Theorem 2. As before, set u1 = −∞ and u2 = u. We begin by analyzing the term in the test statistic that is associated with model 1, which can be written as: 2 T −τ N − 1 − ϑ1,t ,N ,h 1 = √ 1{X1,t +τ ,i (Xt∗ ) ≤ u} − 1{Xt∗+τ ≤ u}
1
′ Ď
P1k (u) = 8Π µF1 (u)µf ,ϑ Ď (u) A1 1 1
×
∞ −
Ď
Ď′
E (ψ1,1 (θ1 )ψk,1+j (θk )′ )Ak µf ,ϑ Ď (u)µFk , k k
j =0
P1 C (u) =
Ď
∞ −
Ď 4Π µF1 (u)µf ,ϑ Ď (u)A1 1 1
− (F
(u|X1+j ) − F0 (u|X1+j )) )),
Ď ϑ Xk,k1+j+τ
−
j =0
(u|X1+j ) − F0 (u|X1+j ))2
Ď ϑ X1,11+j+τ
T 1−
T j=1
T −τ 1 −
2
N 1 −
N i=1
+√
ϑ1,t ,N ,h 1{X1,t +τ ,i (Xj )
N 1 −
Ď
2
T −τ 1 −
Ď
−F
Ď
θ Xk,k1+j+τ
(ii) Under HA , √1
(u|X1+j )
×
ϑ1∗,t ,N ,h 1{X1,t +τ ,i (Xt∗ )
−
N i=1
ϑ1,t ,N ,h
1{X1,t +τ ,i (Xt∗ ) ≤ u} − 1{Xt∗+τ ≤ u}
≤ u} −
ϑ1,t ,N ,h 1{X1,t +τ ,i (Xt∗ )
≤ u}
(u|X1+j )))),
t =R ((F
∑T −τ
P
Ď
1
N
N 1 −
N i=1
P t =R
j =0 θ X1,11+j+τ
+2√
E (ψ1,1 (θ1 )((F0τ (u|X1+j )
− 1{X1+j+τ ≤ u})(F
,t ,N ,h ∗ − 1{X1,1t +τ ,i (Xt ) ≤ u})
P1 V1 (u) = 4Π µF1 (u)µf ,θ Ď (u)′ A1 1 1
×
≤ u} − 1{Xj+τ ≤ u}
ϑ∗
ϑ
∞ −
2
,t ,N ,h ∗ (1{X1,1t +τ ,i (Xt ) ≤ u}
N i=1
P t =R
and
− F0 (u|Xt )) ) diverges at rate plus or minus infinity.
. (15)
(u|Xt ) − F0 (u|Xt ))2 − (F ϑ Ď (u|Xt ) Xk,kt +τ √
Ď ϑ X1,1t +τ
2
P. This drives the statistic to either
First, note that: 2 T −τ N 1 − ∗ 1 − ϑ1,t ,N ,h ∗ ∗ = √ E 1{X1,t +τ ,i (Xt ) ≤ u} − 1{Xt +τ ≤ u} N i=1
P t =R
T −τ
1 −
Proof of Corollary 1. For any given k, the limiting distribution of Dk,P ,N (u1 , u2 )−µk (u1 , u2 ) follows from inspection of Theorem 1(i). Also, by the Cramer–Wold device,
= √
T
1− T j=1
P t =R
N 1 −
N i=1
+ O(l/P 1/2 ) Pr −P .
− 1{Xj+τ ≤ u2 }
(Dm,P ,N (u1 , u2 ) − µm (u1 , u2 )))
(14)
converges to a (m − 1)-dimensional mean zero Gaussian random variable with covariance matrix that has kk element given by Wk (u1 , u2 ), as defined in the statement of Theorem 1(i). The statement in the corollary then follows as a straightforward consequence of the Cramer–Wold device and the continuous mapping theorem. Theorem 1 considers the case of two nonnested models. We then need to study what happens if a subset of models is instead nested with the benchmark. For notational simplicity, hereafter let u1 = −∞ and u2 = u. Suppose that models 2, . . . , m/2 are nested with model 1, while models m/2 + 1, . . . , m are nonested with 1, so that under the null, for j = 2, . . . , m/2, F ϑ Ď (u|Xt ) = F ϑ Ď (u|Xt ) a.s., and then the first two terms on j
Xj,t +τ
the RHS of (13) are almost sure zero. Further, if the nested models are also correctly specified, then F ϑ Ď (u|Xt ) = F ϑ Ď (u|Xt ) = X1,1t +τ
j
Xj,t +τ
F0 (u|Xt ) a.s. implying µF1 = µFj = 0, and so the last two terms on RHS of (13) are also zero. Hence, if model j and 1 are nested, and correctly specified, Dj,P ,N (u) = op (1) and µj (u) = 0. On the other hand, if the nested models are not correctly specified for the conditional distribution, then F ϑ Ď (u|Xt ) = F ϑ Ď (u|Xt ) X1,1t +τ
j
Xj,t +τ
ϑ1,t ,N ,h
1{X1,t +τ ,i (Xj ) ≤ u}
2
((D2,P ,N (u1 , u2 ) − µ2 (u1 , u2 )), . . . ,
X1,1t +τ
N i=1
P t =R
Ď E (ψ1,1 (θ1 )
× ((F
319
Also, by the same arguments as those used in the proof of Theorem 4 in Bhardwaj et al. (2008), 2 T −τ N − 1 − ϑ1,t ,N ,h ∗ 1 ∗ ∗ Var 1{X1,t +τ ,i (Xt ) ≤ u} − 1{Xt +τ ≤ u} √ N i=1
P t =R
= Var
T −τ 1 −
√
P t =R
− 1{Xt +τ
N 1 −
N i=1
ϑ
Ď
1{X1,1t +τ ,i (X ) ≤ u}
2 ≤ u} + O(l/P 1/2 ) Pr −P .
Thus, from Theorem 3.5 in Künsch (1989), it follows that the first term on the RHS of the last equality in (15) has the same limiting distribution as: T −τ 1 −
√
P t =R
N Ď − ϑ 1 1{X1,1t +τ ,i (Xt ) ≤ u} − 1{Xt +τ ≤ u}
− E
2
N i =1
N 1 −
N i=1
Ď
ϑ 1{X1,1t +τ ,i (Xt )
≤ u} − 1{Xt +τ
2 ≤ u} .
320
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
Now,
∑N
1 N
Ď ϑ1
i =1
1{X1,t +τ ,i (Xt ) ≤ u} − F
Ď
ϑ X1,1t +τ
(u|Xt ) = ON (N −1/2 ), and
as N /P → ∞, the third term on the RHS of (15) can be written as: T −τ 1 −
2µF1 √ P t =R
N 1 −
N i=1
ϑ ∗ t ,N ,h ∗ (1{X1,1t,+τ ,i (Xt )
T −τ 1 −
≤ u}
D∗j,P ,N (u) √
ϑ
where µF1 = E (F
√
N 1 −
N i=1
P t =R
(16)
ϑ
−
ϑ1,t ,N ,h
1{X1,t +τ ,i (Xt ) ≤ u − (
ϑ
ϑ∗
ϑ1,t ,N ,h X1,t +τ
ϑ1,t ,N ,h X1,t +τ ,i (Xt )))|Xt
ϑ1,t ,N ,h X1,t +τ
−
− F
N 1 −
N i=1
P t =R N i=1
(F
ϑ1,t ,N ,h
− X1,t +τ ,i (Xt )))|Xt ) − F
≤ u}
((u
(u|Xt )).
+
(17)
By the same argument as that used in the proof of Theorem 1(i): T −τ 1 − = √
N 1 −
N i =1
P t =R
ϑ t ,N ,h − X1,1t,+τ ,i (Xt ))}
1{X1,t +τ ,i (Xt ) ≤ u − (
−
( Xt ) −
−F
ϑ1,t ,N ,h X1,t +τ
−
ϑ1∗,t ,N ,h X1,t +τ ,i
ϑ1,t ,N ,h
ϑ1,t ,N ,h X1,t +τ ,i (Xt )))|Xt )
−
((u − N 1 −
N i=1
ϑ1,t ,N ,h X1,t +τ
(u|Xt )
ϑ ∗ t ,N ,h (X1,1t,+τ ,i (Xt )
− ϑ1,t ,N ,h 1{X1,t +τ ,i (Xt )
≤ u}
Ď
ϑ P t =R N i=1 X1,1t +τ − F ϑ Ď (u|Xt ))
ϑ
T ι=1
ϑj,t ,N ,h 1{Xj,t +τ ,i (Xt∗ )
T ι=1
T 1−
N 1 −
N 1 −
N i =1
ϑ1,t ,N ,h
1{X1,t +τ ,i (Xι ) ≤ u}
2 ≤ u} −
ϑj,t ,N ,h 1{Xj,t +τ ,i (Xι )
N i=1
1{Xt∗+τ
≤ u}
≤ u} − 1{Xι+τ
2 ≤ u} .
−
Ď ϑ1
t ,N ,h ((u − (X1,1t,+τ ,i (Xt ) − X1,t +τ ,i (Xt )))|Xt )
1
X1,t +τ
and the statement then follows by the same argument as that used in Theorem 1(i). Proof of Corollary 2. Given Theorem 2, for the case of nonnested models, the result follows directly upon application of the Cramer–Wold device and the continuous mapping theorem. It remains to show that whenever Dj,P ,N (u) = op (1), because of model j and 1 are nested and correctly specified, then also D∗j,P ,N (u) = o∗p (1), conditional on the sample. Now, given (16), it is immediate to see that whenever µF1 = µFj = 0, then the
ϑj,t ,N ,h Xj,t +τ
(F
T ι=1
ϑj,t ,N ,h Xj,t +τ
T −τ 2−
(((F
T 1−
T ι=1
ϑj,t ,N ,h Xj,t +τ
ϑ1,t ,N ,h X1,t +τ
T 1−
T ι=1
ϑ1,t ,N ,h X1,t +τ
(u|Xt ) − 1{Xt +τ ≤ u})2
(u|Xt ) − 1{Xt +τ ≤ u})2
(u|Xι ) − 1{Xι+τ ≤ u})2 )2
ϑj,t ,N ,h Xj,t +τ
(F
((F
(u|Xι ) − 1{Xι+τ ≤ u})2 )2
((F
T 1−
P t =R
P t =R
ϑ1,t ,N ,h X1,t +τ
T −τ 1−
P t =R
T −τ 1−
(F
T ι=1
× ((F
= oP ∗ (1) Pr −P .
Finally, the last term on the RHS of (17), conditional on the sample, and for all samples except a set with probability measure approaching zero, has the same limiting distribution as: N T −τ 1 − 1 − (F √
N 1 −
T 1−
− F
−
T 1−
2 ≤ u}
var∗ ( D∗j,P ,N (u)) =
−
ϑ1,t ,N ,h X1,t +τ
ϑ1,t ,N ,h X1,t +τ
Now, a straightforward calculation gives that
ϑ1,t ,N ,h 1{X1,t +τ ,i (Xt )
T −τ N 1 − 1 −
(u|Xt )) + √
ϑ ∗ t ,N ,h − (X1,1t,+τ ,i (Xt )
−
( Xt )
t ,N ,h ((u − (X1,1t,+τ ,i (Xt )
t ,N ,h − X1,1t,+τ ,i (Xt ))} − F
−
ϑ1∗,t ,N ,h X1,t +τ ,i
N i =1
≤ u}
N i=1
N 1 −
N i =1
P t =R
1{Xt∗+τ
− 1{Xι+τ
ϑ∗
t ,N ,h ∗ − 1{X1,1t,+τ ,i (Xt ) ≤ u})
T −τ 1 − = √
−
t ,N ,h ∗ (1{X1,1t,+τ ,i (Xt ) ≤ u}
N − ϑ1,t ,N ,h 1 1{X1,t +τ ,i (Xt∗ ) ≤ u}
2
(u|Xt ) − 1{Xt +τ ≤ u}). Now,
Ď
ϑ X1,1t +τ
P t =R
t ,N ,h ∗ ∗ − 1{X1,1t,+τ ,i (Xt ) ≤ u}) + op (1) Pr −P ,
T −τ 1 −
bootstrap counterpart of the parameter estimation error is op∗ (1), Pr −P . Now, recalling (15) the bootstrap counterpart of Dj,P ,N (u) for the case of vanishing parameter estimation error writes as:
(u|Xt ) − 1{Xt +τ ≤ u})2
(u|Xι ) − 1{Xι+τ ≤ u})2 )
(u|Xt ) − 1{Xt +τ ≤ u})2
(F
ϑ1,t ,N ,h X1,t +τ
(u|Xι ) − 1{Xι+τ ≤ u})2 ))
√ + Op (l/ P ) + oN (1), which is op (1), as F
Ď
ϑ X1,1t +τ
(u|Xt ) = F
Ď
ϑj Xj,t +τ
(u|Xt ) = F0 (u|Xt ), and
ϑ1,t ,N ,h and ϑj,t ,N ,h are consistent for ϑ1Ď and ϑjĎ , respectively. ∗ Hence, Dj,P ,N (u) = o∗p (1). Proof of Theorem 3. We begin by analyzing the term in the test statistic that is associated with model 1. Without loss of generality and for the sake of brevity, we yet again set u1 = −∞ and u2 = u. Consider: T −τ 1 −
√
P t =R
S N 1 −−
NS j=1 i=1
T −τ 1 −
= √
P t =R
θ1,t ,N ,Sh 1{X1,t +τ ,i,j (Xt )
S N 1 −−
NS j=1 i=1
Ď
2 ≤ u} − 1{Xt +τ ≤ u}
θ 1{X1,1t +τ ,i,j (Xt )
2 ≤ u} − 1{Xt +τ ≤ u}
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
T −τ 1 −
+√
Step 3:
θ
t ,N ,S ,h (1{X1,1t,+τ ,i,j (Xt ) ≤ u}
NS j=1 i=1
P t =R θ
S N 1 −−
T 1 − sup √ |∇ϑ LNt (ϑ) − ∇ϑ Lt (ϑ)| = op (1). ϑ∈Nϑ Ď P t =R
2
Ď
− 1{X1,1t +τ ,i,j (Xt ) ≤ u})
T −τ 2 −
+√
S N 1 −−
− 1{Xt +τ ≤ u}
θ
T 1 − sup √ |∇ϑ LNt,h (ϑ) − ∇ϑ LNt (ϑ)| = op (1). ϑ∈N Ď P t =R ϑ
Ď
1{X1,1t +τ ,i,j (Xt ) ≤ u}
k
S N 1 −−
NS j=1 i=1
ϑ t ,N ,S ,h (1{X1,1t,+τ ,i,j (Xt )
≤ u}
Ď
θ 1{X1,1t +τ ,i,j (Xt )
−
Step 4:
NS j=1 i=1
P t =R
≤ u})
= IP ,N ,S ,h + IIP ,N ,S ,h + IIIP ,N ,S ,h . The statement follows by the same argument as that used in Theorem 1, as by Proposition 5 in Bhardwaj et al. (2008), for S /P → ∞ and N /P → ∞,
T −τ S N Ď 1 − 1 −− θ 1{X1,1t +τ ,i,j (Xt ) ≤ u} √ P t =R NS j=1 i=1 Ď
− F
Ď
ϑ X1,1t +τ
θ (u|Xt , Vj,h1 )
= op (1).
Proof of Theorem 4. Since S /T → ∞, we do not need to resample the initial value of volatility, and the statement thus follows by the same argument as that used in Theorem 2. For notational simplicity, in the proof of Theorems 5–8 below, we drop the subscript k, as the arguments used in this proof are the same for all k. Proof of Theorem 5. Define,
fN (Xt |Xt −1 , ϑ) =
N 1 −
N ξN i=1
K
Xtϑ,i (Xt −1 ) − Xt
ξN
t 1−
t j =1
ϑ∈Θ
,
and
ln fk,N ,h (Xj |Xj−1 , ϑ)
× τN ( fk,N ,h (Xj |Xj−1 , ϑ)), LNt (ϑ) =
t 1−
t j =1
ln fk,N (Xj |Xj−1 , ϑ)τN ( fk,N (Xj |Xj−1 , ϑ)),
(18) (19)
and Lt (ϑ) =
t 1−
t j =1
ln f (Xj |Xj−1 , ϑ),
(20)
where Lt (ϑ) is the pseudo true density under Pθ . We organize the proof into four steps. Steps 1 and 2 suffice for the statement in (i) to hold. Step 1: sup sup |LNt (ϑ) − Lt (ϑ)| = op (1).
ϑ∈Θ t ≥R
Step 2: sup sup |LNt,h (ϑ) − LNt (ϑ)| = op (1).
ϑ∈Θ t ≥R
Proof of Steps 1 and 3. We first need to show that our assumptions imply the assumptions in Theorems 1.1 and 1.2 in Fermanian and Salanié (2004), and then we outline which steps in their proofs have to be modified in order to take into account the fact that Xt is β mixing (instead of iid) and the fact that our estimator is recursive. Then, the statements in Steps 1 and 3 will follow directly from their Theorems 1.1 and 1.2. Now, Assumption A8 implies K1, in Fermanian and Salanié (2004). A1(ii)–(iii) and Assumptions A6 and A7 imply L1 and L2, with β = r, and L3, with γ = γ ′ = r > 4 in Fermanian and Salanié (2004). A3′ implies M1 with s0 = 0, and M2 with r0 = s1 = 0 and p0 = ζ = r > 4, in Fermanian and Salanié (2004). It remains to check that the rate conditions T1, R1, T2, R2 and R3 in Fermanian and Salanié (2004) are implied by the rate conditions in the statement of the theorem. First, recall that T , R, P grow at the same rate, given 0 < π < ∞ and N = T a , ∑T a > 1. Given A1(iii), Pr(supt |Xt | > ε T a ) ≤ t =1 Pr(|Xt | > ε T a ) ≤ ε1r T 1−ar E (|Xt |r ), and as a > 1 and r > 4, and so (c) in the statement of the theorem implies T2 (and hence T1) in Fermanian and Salanié (2004) for v = 1 and γ = γ ′ = ζ = r > 4. Condition (a) corresponds to R3 in Fermanian and Salanié (2004), for γ = r. Finally, (c) and (b) are equivalent to R2 in Fermanian and Salanié (2004), for m = 1 and r0 = 0. As the proof in Fermanian and Salanié (2004) is based on the rate at which 1{‖Xt , Xt −1 ‖ < N } sup | ln fN (Xt |Xt −1 , ϑ) − ln f (Xj |Xj−1 , ϑ)|
where Xtϑ,i (Xt −1 ) is the i-th simulated value, when starting the path at Xt −1 , for the case in which there is no discretization error (i.e. for the case in which we could generate continuous paths), and define: LNt,h (ϑ) =
321
1{‖Xt , Xt −1 ‖ > N } supϑ∈Θ | ln fN (Xt |Xt −1 , ϑ)| approach zero, the fact that we are estimating parameters in a recursive manner plays no role. On the other hand, the iid assumption is used in the exponential inequalities in the proof of Lemma 1 and Theorem 1.1 in Fermanian and Salanié (2004). However, given the geometric β mixing assumption in A1(i), the rate in the exponential (Bernstein and Hoeffding) inequalities is slower than in the iid case, only up to a logarithmic term (see e.g. Doukhan, 1995, p. 33–36). Thus, consistency follows from their Theorem 1.1 and asymptotic normality from their Theorem 1.2. Hence, it remains to prove Step 4, as Step 2 follows by the same argument. Proof of Step 4. T 1 − sup √ |∇ϑ LNt,h (ϑ) − ∇ϑ LNt (ϑ)| ϑ∈N Ď P t = R ϑ k
T t 1 − 1 − ≤ sup √ τN ( fN (Xj |Xj−1 , ϑ)) ϑ∈N Ď P t =R t j =1 ϑk
×
1
fN (Xj |Xj−1 , ϑ) ∂ fN ,h (Xj |Xj−1 , ϑ) ∂ fN (Xj |Xj−1 , ϑ) × − ∂ϑ ∂ϑ t 1 − τN ( fN ,h (Xj |Xj−1 , ϑ)) τN ( fN (Xj |Xj−1 , ϑ)) + − t j =1 fN ,h (Xj |Xj−1 , ϑ) fN (Xj |Xj−1 , ϑ)
322
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
∂ fN ,h (Xj |Xj−1 , ϑ) × ∂ϑ t 1 − ∂ fN ,h (Xj |Xj−1 , ϑ) + τN′ ( fN ,h (Xj |Xj−1 , ϑ)) t j =1 ∂ϑ × ln fN ,h (Xj |Xj−1 , ϑ) t 1 − ∂ fN (Xj |Xj−1 , ϑ) + τN′ ( fN (Xj |Xj−1 , ϑ)) t j =1 ∂ϑ × ln fN (Xj |Xj−1 , ϑ)
same argument as used in the study of the term A4 in the proof of Theorem 2.2 in Fermanian and Salanié (2004). Proof of Theorem 6. Define, L∗t ,Nh (θ )
Lt (θ ) = ∗
ϑ
by Theorem 2.3 in Pardoux and Talay (1985) that E ((Xjϑ,i,h (Xj−1 ) − Xjϑ,i (Xj−1 ))2 ) = O(h2 ). Thus,
√
t j =1
T −
T i=1
N ϑ Li,h
∇
( ϑ t ,N ,h )
ln f (Xj |Xj∗−1 , ϑ) ∗
−ϑ
′1
T −
T i =1
∇ϑ Li ( ϑt )
Step 1: sup sup |L∗t ,Nh (ϑ) − L∗t (ϑ)| = op∗ (1).
ϑ∈Θ t ≥R
Given Steps 1 and 2, the desired outcome follows from Theorem 1 in Corradi and Swanson (2007). Proof of Step 1. Given the definition of L∗t ,Nh (ϑ) and L∗t (ϑ), and recalling that Θ is a compact set, it suffices to show that:
t 1 − (ln fN ,h (Xl∗ |Xl∗−1 , ϑ)τN ( arg max sup fN ,h (Xl∗ |Xl∗−1 , ϑ)) ϑ∈Θk t ≥R t l =2 ∗ ∗ − ln f (Xl |Xl−1 , ϑ)) = op∗ (1)
(23)
and sup (21)
t ≥R
T 1−
T i=1
(∇ϑ L∗i,Nh ( ϑt ,N ,h ) − ∇ϑ L∗i ( ϑt )) = op∗ (1).
(24)
Now, (23) follows from Steps 1 and 2 in the proof of Theorem 5, given that the only difference is that we evaluate the likelihood at the resampled observations. Note also that (24) is majorized by:
1 P sup sup t ≥R ϑ∈N Ď t
t − τN ( fN (Xj |Xj−1 , ϑ))( fN ,h (Xj |Xj−1 , ϑ) − fN (Xj |Xj−1 , ϑ))
fN (Xj |Xj−1 , ϑ) 1 ∂ ln fN ,h (Xj |Xj−1 , ϑ) √ × + P sup sup ∂ϑ t ≥R ϑ∈N Ď t ϑk t − τN ( fN (Xj |Xj−1 , ϑ)) − τN ( fN ,h (Xj |Xj−1 , ϑ)) × fN (Xj |Xj−1 , ϑ) j=1 ∂ ln fN ,h (Xj |Xj−1 , ϑ) × fN ,h (Xj |Xj−1 , ϑ) ∂ϑ √ = Op ( P ξN−(δ+3) h| ln ξNδ |).
t 1−
′1
ln fN ,h (Xj∗ |Xj∗−1 , ϑ)τN ( fN ,h (Xj∗ |Xj∗−1 , ϑ))
T 1 − |∇ϑ L∗t ,Nh (ϑ) − ∇ϑ L∗t (ϑ)| = o∗p (1). sup √ ϑ∈Nϑ Ď P t =R
ϑk
×
t j =1
Step 2:
and given A6, A2,T ,N ,h ≤
and let ϑt∗ = arg minϑ∈Θ L∗t (θ ). We organize the proof into two steps.
Now, note that X j,i,h (Xj−1 ) ∈ (Xjϑ,i,h (Xj−1 ), Xjϑ,i (Xj−1 )), and recall
A1,T ,N ,h
=
− ϑ
= A1,T ,N ,h + A2,T ,N ,h + A3,T ,N ,h + A4,T ,N ,h .
t 1 − √ 1 ≤ ξN−δ P sup sup t N ξN t ≥R ϑ∈N Ď j=1 ϑk N − Xjϑ,i,h (Xj−1 ) − Xj × ∇ϑ K ξN i=1 Xtϑ,i (Xj−1 ) − Xj − ∇ϑ K ξN t N 1 − √ 1 − 2 ≤ ξN−(δ+3) P sup sup ∇ϑ K t ≥R ϑ∈N Ď t j=1 N i=1 ϑk Xjϑ,i,h (Xj−1 ) − Xj × ϑ ξN X j,i,h (Xj−1 ) ϑ ϑ × (Xj,i,h (Xj−1 ) − Xj,i (Xj−1 )) √ −(δ+3) = Op ( P ξN h),
t 1−
j=1
T 1 − ∗N ∗ sup (∇ϑ Li,h (ϑt ,N ,h ) − ∇ϑ Li (ϑt ,N ,h )) t ≥R T i=1 T 1 − ∗ ∗ + sup (∇ϑ Li (ϑt ,N ,h ) − ∇ϑ Li (ϑt )) . t ≥R T i=1 The first term above is op∗ (1) as a direct consequence of Steps 3 and 4 in the proof of Theorem 5. The second term is majorized by
T 1 − 2 ∗ sup (∇ϑ Li (ϑ t ,N ,h )( ϑ t ,N ,h − ϑt )) t ≥R T i=1 ≤ sup t ≥R
(22)
Hence, A1,T ,N ,h and A2,T ,N ,h are op (1) because of (d). Finally, given the rate conditions in (a)–(c), A3,T ,N ,h and A4,T ,N ,h are oP (1), by the
T 1−
T i=1
|∇ϑ2 L∗i (ϑ t ,N ,h )| sup( ϑ t ,N ,h − ϑt ) = Op∗ (1)op (1). t ≥R
Proof of Step 2. Follows directly from Steps 2 and 4 in the proof Theorem 5.
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
Proof of Theorem 7. Let: N ,S
Lt ,h (θ ) =
t 1−
t l =2
− K
ln fN ,S ,h (Xl |Xl−1 , θ )τN ,S
= Op (ξN−(δ+1) S −1/2 ),
× ( fN ,S ,h (Xl |Xl−1 , θ )). N ,S
sup sup |Lt ,h (θ ) − LNt,h (θ )| = op (1)
(25)
θ∈Θ t ≥R
IIt ,N ,S ,h
and T 1 − sup √ |∇θ LNt ,,hS (θ ) − ∇θ LNt,h (θ )| = op (1). θ∈N Ď P t =R θ
The desired outcome then follows from Theorem 5. Note first that (25) can be written as:
(θ ) − (θ )| t 1 − = sup sup (ln fN ,S ,h (Xl |Xl−1 , θ )τN ,S ( fN ,S ,h (Xl |Xl−1 , θ )) θ∈Θ t ≥R t l=2 − ln fN ,h (Xl |Xl−1 , θ )τN ( fN ,h (Xl |Xl−1 , θ ))) t 1 − ≤ sup sup τN ,S ( fN ,S ,h (Xl |Xl−1 , θ ))(ln fN ,S ,h (Xl |Xl−1 , θ ) θ ∈Θ t ≥R t l=2 − ln fN ,h (Xl |Xl−1 , θ )) t 1 − + sup sup (τN ,S ( fN ,S ,h (Xl |Xl−1 , θ )) θ∈Θ t ≥R t l=2 − τN (fN ,h (Xl |Xl−1 , θ ))) ln fN ,h (Xl |Xl−1 , θ )
θ∈Θ t ≥R
LNt,h
= sup sup(It ,N ,S ,h + IIt ,N ,S ,h ). θ∈Θ t ≥R
Let f N ,S ,h (Xl |Xl−1 , θ ) note that for all i, j
K
Xlθ,i,h (Xl−1 ) − Xl
ξN
∈ ( fN ,S ,h (Xl |Xl−1 , θ ), fN ,h (Xl |Xl−1 , θ )), and
∫ =
K
Xlθ,i,h (Xl−1 , v θ ) − Xl
V
= ES K
ξN Xlθ,i,h (Xl−1 , v θ ) − Xl
ξN
fθ (v)dv
,
where ES denotes the expectation with respect to the simulated initial values of volatility. By a mean value expansion, I t ,N ,S ,h
t 1 − ≤ ξN−δ ( f (X |X , θ ) t l = 2 N ,S ,h l l − 1 − fN ,h (Xl |Xl−1 , θ )) t N 1 − 1 − = ξN−(δ+1) t l=2 N i=1 S Xlθ,i,h (Xj−1 , Vsθ ) − Xj 1− × K S s=1 ξN
t 1 − ≤ (X |X , θ )) ln fN ,h (Xl |Xl−1 , θ ) τ ′ (f t l = 2 N ,S N ,S ,h l l − 1 × ( fN ,S ,h (Xl |Xl−1 , θ ) − fN ,h (Xl |Xl−1 , θ )) = Op (ξN−(δ+1) S −1/2 | ln ξN−δ |),
(26)
k
sup sup |
uniformly in t and θ .
Also,
We show that:
N ,S L t ,h
323
Xlθ,i,h (Xj−1 ) − Xj ξN
uniformly in t and θ .
Given the rate condition in (e), this proves (25). Turning now to (26), note that after few simple manipulations: T 1 − sup √ |∇θ LNt ,,hS (θ ) − ∇θ LNt,h (θ )| θ∈N Ď P t = R θ k
T t fN ,S ,h (Xl |Xl−1 , θ )) 1 − 1 − τN ,S ( ≤ sup √ fN ,S ,h (Xl |Xl−1 , θ ) θ∈Nθ Ď P t =R t l =2
×
∂ fN ,h (Xl |Xl−1 , θ ) ∂ fN ,S ,h (Xl |Xl−1 , θ ) − ∂θ ∂θ
t 1 − ∂ fN ,h (Xl |Xl−1 , θ ) + τN ,S ( fN ,S ,h (Xl |Xl−1 , θ )) t l =2 ∂θ ( fN ,h (Xl |Xl−1 , θ ) − fN ,S ,h (Xl |Xl−1 , θ )) × fN ,S ,h (Xl |Xl−1 , θ ) fN ,h (Xl |Xl−1 , θ ) t 1 − (τ ( f (X |X , θ )) − τN ( fN ,h (Xl |Xl−1 , θ ))) + t l = 2 N ,S N ,S ,h l l − 1 ∂ ln fN ,h (Xl |Xl−1 , θ ) × ∂θ t 1 − + τ ( f (X |X , θ ))(ln fN ,S ,h (Xl |Xl−1 , θ ) t l = 2 N ,S N ,S ,h l l − 1 − ln fN ,h (Xl |Xl−1 , θ )) t 1 − + τ ′ ( f (X |X , θ )) t l = 2 N ,S N ,S ,h l l − 1 ∂ fN ,S ,h (Xl |Xl−1 , θ ) fN ,h (Xl |Xl−1 , θ ) × − ∂θ ∂θ × ln fN ,h (Xl |Xl−1 , θ ) t 1 − (τ ′ ( f (X |X , θ )) − τN′ ( fN ,h (Xl |Xl−1 , θ ))) + t l = 2 N ,S N ,S ,h l l − 1 fN ,h (Xl |Xl−1 , θ ) × ln fN ,h (Xl |Xl−1 , θ ) ∂θ = sup V1,T ,N ,S ,h (θ ) + V2,T ,N ,S ,h (θ ) + V3,T ,N ,S ,h (θ ) θ∈Nθ Ď
+ V4,T ,N ,S ,h (θ ) + V5,T ,N ,S ,h (θ ) + V6,T ,N ,S ,h (θ ) .
324
V. Corradi, N.R. Swanson / Journal of Econometrics 161 (2011) 304–324
References
Now, recalling Assumption A9, sup V1,T ,N ,S ,h (θ )
θ∈Nθ Ď
T 1 −
≤ sup ξN−(δ+2) √ θ∈Nθ Ď
P t =R
θ t N S θ 1 − 1 − 1 − ∂ Xl,i,h (Xj−1 , Vs ) × t l=2 N i=1 S s=1 ∂θ θ θ Xl,i,h (Xj−1 , Vs ) − Xj × K′ ξN ∂ Xlθ,i,h (Xj−1 ) ′ Xlθ,i,h (Xj−1 ) − Xj − K ∂θ ξN = Op (P 1/2 ξN−(δ+2) S −1/2 ) = op (1) because of (e). Then, sup V2,T ,N ,S ,h (θ )
θ∈Nθ Ď
r 1/r t 1 − ∂ fN ,h (Xl |Xl−1 , θ ) ≤ ξN sup sup ∂θ θ ∈Nθ Ď t ≥R t l=2 T t 1 − 1 − × sup √ fN ,h (Xl |Xl−1 , θ ) θ∈Nθ Ď P t =R t l =2 (r −1)/r r /(r −1) − fN ,S ,h (Xl |Xl−1 , θ ) √ −1/2 −(1+2δ) ), = OP ( PS ξN −2 δ
and so sup V3,T ,N ,S ,h (θ )
θ∈Nθ Ď
T t 1 − 1 − ≤ sup √ (τN ,S ( fN ,S ,h (Xl |Xl−1 , θ )) θ∈Nθ Ď P t =R t l =2
∂ ln f (Xl |Xl−1 , θ ) − τN ( fN ,h (Xl |Xl−1 , θ ))) (1 + op (1)) ∂θ √ −1/2 −(1+δ) = Op ( PS ξN ). By a similar argument as that √ used in the proof of (25), −(δ+1) supθ∈N Ď V4,T ,N ,S ,h (θ ) = Op (ξN PS −1/2 ); V5,T ,N ,S ,h (θ ), (other θ than a log term), can be treated as V1,T ,N ,S ,h (θ ), and so supθ∈N Ď −(δ+2)
θ
V5,T ,N ,S ,h (θ ) = Op (P 1/2 ξN S −1/2 | ln ξN−δ |). Finally, by a similar argument as that used to examine V3,T ,N ,S ,h (θ ): sup V5,T ,N ,S ,h (θ )
θ∈Nθ Ď
T t 1 − 1 − ′ ≤ sup √ (τN ,S ( fN ,S ,h (Xl |Xl−1 , θ )) θ∈Nθ Ď P t =R t l =2
− τN′ ( fN ,h (Xl |Xl−1 , θ ))) ×
f (Xl |Xl−1 , θ )
ln fN ,h (Xl |Xl−1 , θ ) (1 + op (1))
∂θ √ −1/2 −(1+2δ) = Op ( PS ξN ).
Proof of Theorem 8. Follows immediately, given Theorem 7, and by the same arguments as those used in the proof of Theorem 6.
Altissimo, F., Mele, A., 2009. Simulated nonparametric estimation of dynamic models with application in finance. Review of Economic Studies 76, 413–450. Aït-Sahalia, Y., 1996. Testing continuous time models of the spot interest rate. Review of Financial Studies 9, 385–426. Aït-Sahalia, Y., 2002. Maximum likelihood estimation of discretely sampled diffusions: a closed form approximation approach. Econometrica 70, 223–262. Aït-Sahalia, Y., Fan, J., Peng, H., 2009. Nonparametric transition-based tests for diffusions. Journal of the American Statistical Association 104, 1102–1116. Bai, J., 2003. Testing parametric conditional distributions of dynamic models. Review of Economics and Statistics 85, 531–549. Bandi, F.M., Reno, R., 2008. Nonparametric Stochastic Volatility. Working Paper, University of Chicago. Bhardwaj, G., Corradi, V., Swanson, N.R., 2008. A simulation based specification test for diffusion processes. Journal of Business and Economic Statistics 26, 176–193. Bontemps, C., Meddahi, N., 2005. Testing normality: a GMM approach. Journal of Econometrics 124, 149–186. Carrasco, M., Chernov, M., Florens, J.P., Ghysels, E., 2007. Efficient estimation of general dynamic models with a continuum of moment conditions. Journal of Econometrics 140, 529–543. Chacko, G., Viceira, L.M., 2003. Spectral GMM estimation of continuous-time Processes. Journal of Econometrics 116, 259–292. Corradi, V., Swanson, N.R., 2005a. Bootstrap tests for diffusion processes. Journal of Econometrics 124, 117–148. Corradi, V., Swanson, N.R., 2005b. A test for comparing multiple misspecified conditional intervals. Econometric Theory 21, 991–1016. Corradi, V., Swanson, N.R., 2006a. Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics 135, 187–228. Corradi, V., Swanson, N.R., 2006b. Predictive density evaluation. In: Granger, C.W.J., Elliot, G., Timmermann, A. (Eds.), Handbook of Economic Forecasting. Elsevier, Amsterdam, pp. 197–284. Corradi, V., Swanson, N.R., 2007. Nonparametric bootstrap procedures for predictive inference based on recursive estimation schemes. International Economic Review 48, 67–109. Diebold, F.X., Gunther, T., Tay, A.S., 1998. Evaluating density forecasts with applications to finance and management. International Economic Review 39, 863–883. Diebold, F.X., Mariano, R.S., 1995. Comparing predictive accuracy. Journal of Business and Economic Statistics 13, 253–263. Doukhan, P., 1995. Mixing Properties and Examples. Springer and Verlag, New York. Dridi, R., Guay, A., Renault, E., 2007. Indirect inference and calibration of dynamic stochastic general equilibrium models. Journal of Econometrics 136, 397–439. Duan, J.C., 2003. A Specification Test for Time Series Models by a Normality Transformation. Working Paper, University of Toronto. Duffie, D., Singleton, K., 1993. Simulated moment estimation of Markov models of asset prices. Econometrica 61, 929–952. Eraker, B., Johannes, M., Polson, N., 2003. The impact of jumps in volatility and returns. Journal of Finance 58, 1269–1300. Fermanian, J.-D., Salanié, B., 2004. A nonparametric simulated maximum likelihood estimation method. Econometric Theory 20, 701–734. Gallant, A.R., Tauchen, G., 1996. Which moments to match. Econometric Theory 12, 657–681. Gourieroux, C., Monfort, A., Renault, E., 1993. Indirect inference. Journal of Applied Econometrics 8, 203–227. Hong, Y., 2001. Evaluation of Out of Sample Probability Density Forecasts with Applications to S&P 500 Stock Prices. Working Paper, Cornell University. Hong, Y.M., Li, H., 2005. Nonparametric specification testing for continuous time models with applications to term structure interest rates. Review of Financial Studies 18, 37–84. Hong, Y.M., Li, H., Zhao, F., 2002. Out of sample performance of spot interest rate models. Journal of Business Economics and Statistics 22, 457–473. Jiang, G.J., 1998. Nonparametric modelling of US term structure of interest rates and implications on the prices of derivative securities. Journal of Financial and Quantitative Analysis 33, 465–497. Jiang, G.J., Knight, J.L., 2002. Estimation of continuous-time processes via the empirical characteristic function. Journal of Business and Economic Statistics 20, 198–212. Kloeden, P.E., Platen, E., 1999. Numerical Solution of Stochastic Differential Equations. Springer and Verlag, New York. Kristensen, D., Shin, Y., 2008. Estimation of Dynamic Models with Nonparametric Simulated Maximum Likelihood. CREATES Research Paper 2008-58, University of Aarhus and Columbia University. Künsch, H.R., 1989. The jackknife and the bootstrap for general stationary observations. Annals of Statistics 17, 1217–1241. Masuda, H., 2007. Ergodicity and exponential β -mixing bounds for multivariate diffusions with jumps. Stochastic Processes and Their Applications 117, 35–56. Meyn, S.P., Tweedie, R.L., 1993. Markov Chains and Stochastic Stability. Spinger and Verlag, New York. Pardoux, E., Talay, D., 1985. Discretization and simulation of stochastic differential equations. Acta Applicandae Mathematicae 3, 23–47. Pritsker, M., 1998. Nonparametric density estimators and tests of continuous time interest rate models. Review of Financial Studies 11, 449–487. Singleton, K.J., 2001. Estimation of affine asset pricing models using empirical characteristic function. Journal of Econometrics 102, 111–141. Thompson, S.B., 2008. Identifying term structure volatility from the LIBOR-swap curve. Review of Financial Studies 21, 819–854. White, H., 2000. A reality check for data snooping. Econometrica 68, 1097–1126.
Journal of Econometrics 161 (2011) 325–337
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Estimation of stable distributions by indirect inference René Garcia a,∗ , Eric Renault b,c , David Veredas d a b
EDHEC Business School, France University of North Carolina at Chapel Hill, USA
c
CIRANO and CIREQ, Canada
d
ECARES, Solvay Brussels School of Economics and Management, Université libre de Bruxelles, Belgium
article
info
Article history: Available online 23 December 2010 JEL classification: C13 C15 G11 Keywords: Stable distribution Indirect inference Constrained indirect inference Skewed-t distribution
abstract This article deals with the estimation of the parameters of an α -stable distribution with indirect inference, using the skewed-t distribution as an auxiliary model. The latter distribution appears as a good candidate since it has the same number of parameters as the α -stable distribution, with each parameter playing a similar role. To improve the properties of the estimator in finite sample, we use constrained indirect inference. In a Monte Carlo study we show that this method delivers estimators with good properties in finite sample. We provide an empirical application to the distribution of jumps in the S&P 500 index returns. © 2010 Elsevier B.V. All rights reserved.
1. Introduction The α -stable distribution has been widely used for fitting data in which extreme values are frequent. As shown in early work by Mandelbrot (1963) and Fama (1965a), it accommodates heavytailed financial series, and therefore produces more reliable measures of tail risk. The α -stable distribution is also able to capture skewness in a distribution, which is another characteristic feature of financial series. The distribution is also preserved under convolution. This property is appealing when considering portfolios of assets, especially when the skewness and fat tails of returns are taken into account to determine the optimal portfolio.1 Stable processes have recently been used in the high frequency microstructure literature by Ait-Sahalia and Jacod (2007, 2008) who proposed volatility estimators for some processes built from the sum of a stable process and another Levy process. To estimate the parameters of an α -stable distribution we propose to use indirect inference (see Smith, 1993; Gouriéroux et al.,
1993, GMR hereafter), a method particularly suited to situations where the model of interest is difficult to estimate but relatively easy to simulate. Indeed, the α -stable density function does not have a closed-form expression and is only characterized as an integral difficult to compute numerically, making ML estimation not very appealing in practice.2 However, several methods are available to simulate α -stable random variables, such as the one described in Chambers et al. (1976). Indirect inference involves the use of an auxiliary model. Auxiliary parameters are recovered through maximization of the pseudo-likelihood of a model based on the fictitious i.i.d. sampling in a skewed-t distribution of Fernández and Steel (1998).3 It is a Student-t with an inverse scale factor in the positive and negative orthants, allowing for asymmetries. The distribution has four parameters which have a one-to-one correspondence with the parameters of the α -stable distribution. There is a clear and interpretable matching between the two sets, parameter by parameter.4
∗ Corresponding address: EDHEC-393, Promenade des Anglais BP3116 06202 Nice Cedex 3, France. Tel.: +33 0 493189966; fax: +33 0 493830810. E-mail addresses:
[email protected] (R. Garcia),
[email protected] (E. Renault),
[email protected] (D. Veredas). 1 Basic references on the α -stable distribution are Feller (1971), Zolotarev (1986)
2 Nevertheless, DuMouchel (1973) has shown that the maximum likelihood (ML hereafter) estimator is consistent, asymptotically normal and reaches the Cramer–Rao efficiency bound. 3 Hansen (1994) also proposes a skewed version of the Student-t. The way
and Samorodnitsky and Taqqu (1994). Its properties motivate its use in the modelling of financial series in particular by Carr et al. (2002) and Mittnik et al. (2000). For value-at-risk applications, see in particular Bassi et al. (1998) and Mittnik et al. (1998). For portfolio allocation with stable distributions, see Fama (1965b), Bawa et al. (1979), and Ortobelli et al. (2002). 0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.007
skewness is introduced differs from that of Fernández and Steel (1998). 4 During the course of this project, we were made aware by Lombardi that Lombardi and Calzolari (2008) use the same auxiliary model to estimate a stable distribution. The two projects were conducted independently and differ in several respects.
326
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
Our application of indirect inference is innovative in two respects. First, following McCulloch (1986) in the context of matching quantiles, we actually perform a constrained version of indirect inference, introducing an a priori constraint on one auxiliary parameter to match, namely the number of degrees of freedom of the Student-t. The theory for such constrained indirect inference (CII hereafter) has been developed in a general context by Calzolari et al. (2004) (CFS hereafter). Second, we stress in our application that the α -stable simulator need not to take into account the actual dynamic features of the data.5 We show that, for a reasonable level of asymmetry, the pseudo-ML estimators of the four parameters of the skewed-t distribution are asymptotically normal even when the observations are generated by an α -stable distribution.6 Consequently, the associated indirect inference estimators of the parameters of the α -stable distribution are asymptotically normal too. We compare our method to two moment-based estimation methods. McCulloch (1986) proposed a quantile-based estimator, by building sample counterparts of the cumulative distribution function. Another approach is to match moments produced by the characteristic function (CF hereafter). Carrasco and Florens (2000, 2002) devise an optimal generalized method of moments based on a continuum of moment conditions corresponding to the CF computed at all points. Called continuous GMM (CGMM), the method produces an efficient estimator and overcomes the necessity of choosing an arbitrary set of frequencies, which was a fundamental drawback of CF-based methods.7 In a Monte Carlo study, we compare our estimator with CGMM and report that it is often more efficient in finite sample. Since DuMouchel (1973) provides a way to compute the efficiency bound in the i.i.d. case, we are able to measure the performance of our indirect inference estimator with the ML benchmark. At least in this i.i.d. setting, the efficiency loss appears mainly negligible given the finite sample improvement brought about by indirect inference. We also compare our method to the simple but inefficient quantile-based estimator of McCulloch (1986). Our estimates are close to those obtained with the quantile-based method. However, our estimators appear to have a much smaller variance, both asymptotically and in finite sample. Many of the properties of stable models are shared by GARCH models. In particular, both models share the facts that the unconditional distribution has fat tails and that the tail shape is invariant under aggregation (see Ghose and Kroner, 1995; de Vries, 1991).8 We illustrate this observational equivalence by generating different GARCH(1, 1) and IGARCH(1, 1) with Gaussian and Student-t innovations and aggregating the generated processes to lower frequencies. We show that the unconditional density captures very well the variance and kurtosis through aggregation and memory. The tail index α remains relatively constant under aggregation while the estimated dispersion increases. We complete our analysis by applying our estimation procedure to a series of realized jumps filtered from the S&P 500 return series using the methodology of Tauchen and Zhou (2011). We find
5 The use of a wrongly specified simulator in indirect inference has not received much attention, except in Dridi et al. (2007). 6 According to our Monte Carlo experiments, the allowed level of asymmetry is
actually consistent with the one produced by an α -stable distribution with support on the whole real line. 7 Some authors, like Fielitz and Rozelle (1981), recommend to match only a few
frequencies on the basis of Monte Carlo results, while others, like Feuerverger and McDunnough (1981), recommend on the contrary to use as many frequencies as possible. 8 It is well known that, except for the limiting case of the normal distribution, all
the α -stable distributions have infinite variance. However, it should be remembered that a highly persistent GARCH with, by definition, finite conditional variances, may produce infinite moments at orders not much higher than two.
that the stable distribution that best characterizes these jumps is symmetric with an estimated tail index of 1.7. The rest of the paper is organized as follows. Section 2 briefly describes the properties of α -stable distributions and their estimation by CGMM and empirical quantiles. In Section 3 we detail the application of the indirect inference methodology to the α -stable distribution, using the skewed-t distribution as an auxiliary model. We discuss the primitive conditions that warrant identification of structural parameters and asymptotic normality of their indirect inference estimators. Section 4 reports the results of a Monte Carlo study where indirect inference is compared to CGMM and empirical quantiles. The superior performance of CII is documented through both asymptotic and Monte Carlo MSE. We also compare and illustrate through simulations the relationship between the fat-tailed unconditional distributions produced by highly persistent GARCH models and an α -stable model. Section 5 is devoted to an empirical application to jumps in equity returns. Section 6 concludes. Proofs to several propositions are provided in the Appendix. 2. The α-stable distributions, CGMM and empirical quantiles The α -stable family of distributions is characterized by four parameters α , β , σ and µ, where α is the stability parameter, β the skewness parameter, σ the scale parameter, and µ the location parameter. These parameters define the natural logarithm of the characteristic function as ln ψθ (t ) = ln E (exp(it Y ))
= iµt − σ α |t |α [1 − iβ sign(t ) tan(π α/2)]
where θ = (α, β, σ , µ) ∈ Θ =]1, 2] × [−1, 1] × R∗+ × R, Y is the random variable following the α -stable distribution S (θ ) with characteristic function ψθ (·) and sign(t ) = t /|t | for t ̸= 0 (and 0 for t = 0). Note that the α -stable distribution can also be defined for α smaller than 1 but we preclude this case to guarantee the existence of a finite expectation. This requirement is rather realistic for the application to financial returns we have in mind. More generally, E (|Y |p ) < ∞ for all p < α and in particular E ( Y ) = µ. Even though the likelihood function is not known in closed form in general, the score function for an i.i.d. sample of size n remains asymptotically root-n normal. Therefore, DuMouchel (1973) was able to show that the standard tools of maximum likelihood theory (mainly root-n asymptotic normality and Cramer–Rao bounds) may be applied to estimation of θ insofar as its domain is limited to |β| < min(α, 2 − α). This result implies that efficient estimation of the parameters of α -stable distributions remains a sensible goal and that asymptotic normality of M-estimators like MLE or QMLE can be derived by the application of standard central limit theory to well-chosen (pseudo)-score functions rather than to moments of Y , which do not exist. This idea is the main motivation of the indirect inference strategy proposed in this paper. Other estimation methods are available. Since the theoretical characteristic function has a closed form, estimation can per∑be n formed by fitting the sample characteristic function n−1 j=1 exp (itk Yj ) to the theoretical one ψθ (tk ), defined on a grid of frequencies tk , k = 1, . . . , K . The problem is that it takes an infinite number of moment conditions, indexed by tk ∈ R, to summarize the informational content of the characteristic function. Consider the moment conditions: E (h(tk , Y , θ )) = 0,
∀k = 1, . . . , K ,
(1)
where h(tk , Y , θ ) = exp(itY ) − ψθ (tk ). They amount to a set of 2K moment restrictions E (gk (θ , Y )) = 0 that include the real and imaginary parts of h(tk , Y , θ ). Standard GMM estimates are solu−1/2 tions of min ||Ωn hn (., Y , θ )|| where hn (., Y , θ ) is the sample
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
counterpart of (1) and Ωn is an estimate of the covariance operator. As first noticed by Carrasco and Florens (2002), when the grid becomes sufficiently thin (K large), it is not possible to estimate efficiently θ since the 2K estimating functions gk (θ , Y ) tend to become collinear when K goes to infinity. As a result, the inverse of Ωn is not continuous and it needs to be stabilize by the introduction of a regularization parameter δn . This motivated Carrasco and Florens (2002) to introduce CGMM, which is based on the whole continuum of moment conditions, to estimate θ as
θˆn = arg min ||(Ωnδn )−1/2 hn (., Y , θ)||, θ∈Θ
1/2
where (Ωnδn )−1/2 = (δn K + Ωn2 )−1/2 Ωn . Carrasco et al. (2007) 5/2
show that for a sequence δn such that δn −→ 0 but nδn −→ ∞ when n −→ ∞, θˆn is not only optimal among GMM estimators but reaches the Cramer–Rao efficiency bound. An alternative moment-based estimation method is proposed by McCulloch (1986), extending an idea of Fama and Roll (1971). In fact, this estimator can be seen as a particular case of indirect inference. Let us denote by xp the p-th population quantile of S (θ ), i.e. P (Y < xp ) = p. For any different values p, q, p′ , q′ ∈]0, 1[, xp − xq x′p − x′q is independent of both µ and σ . The idea of McCulloch (1986) is to define two functions of such ratios, Φ1 = Φ1 (α, β) and Φ2 = Φ2 (α, β), that can be interpreted as two auxiliary parameters, which will allow to back out α and β . To get accurate estimators, it is intuitive to focus the first auxiliary parameter on α (for each β ) and the second on β (for each α ). McCulloch (1986) proposes to define Φ1 as a measure of the relative size of the tails with respect to the middle of the distribution:
Φ1 (α, β) = max
x0.95 − x0.05 x0.75 − x0.25
, 2.439 ,
(2)
where 2.439 is the smallest possible value of Ψ1 (α, β), when α increases to 2, and as it is irrespective of the value of β , βˆ n is not identified. The second function
Φ2 (α, β) =
(x0.95 − x0.5 ) − (x0.5 − x0.05 ) x0.95 − x0.05
(3)
is defined as a measure of the spread between the right part and the left part of the distribution. Replacing population quantiles xp by their sample counterparts xˆ p,n , the estimators αˆ n and βˆ n are the
ˆ 1,n and Φ2 (αˆ n , βˆ n ) = Φ ˆ 2,n . solutions of Φ1 (αˆ n , βˆ n ) = Φ On the other hand, for any p, q ∈]0, 1[, xp − xq is independent of µ and proportional to σ , so it is natural to define an estimator of the scale parameter σ through the auxiliary parameter defined by x0.75 − x0.25 Φ3 (α, β) =
σ
,
a standardized quantity that does not depend on µ and σ . The ˆ 3,n of the auxiliary parameter Φ3 is deduced from estimator Φ
ˆ 3,n = Φ3 (αˆ n , βˆ n ). And the the previous estimation of (α, β): Φ estimator σˆ n can be recovered from: σˆ n =
x0.75 − x0.25
ˆ 3 ,n Φ
.
(4)
Finally, to back out the location parameter µ, it is natural to locate it with respect to the median x0.5 of the distribution through α µ − x0.5 Φ4 (α, β) = + β tan Π . σ 2
ˆ 4,n = Φ4 (αˆ n , βˆ n ) we can then deduce the From the estimator Φ estimator of µ from the previously defined estimators of (α, β, σ ): αˆ n ˆ ˆ . (5) µ ˆ n = xˆ 0.5,n + σˆ n Φ4,n − βn tan Π 2
327
Therefore, to back out the indirect estimator θˆn from the estiˆ n of auxiliary parameters, one has just to invert the binding mator Φ function Φ defined as (2)–(5):
Φ (θ ) = {Φ1 (α, β), Φ2 (α, β), σ , µ}. The resulting indirect estimator θˆn is nothing but an indirect inference estimator. It turns out that, by contrast with the most standard way to perform indirect inference, no new simulations are needed to recover the indirect inference estimator in this case. The reason for that is that the only two components of the binding function which are not known in closed form (Φ1 (α, β) and Φ2 (α, β)) have been tabulated by McCulloch (1986). In other words, the required simulation work is already done. However, the computation of indirect inference standard errors would require to resort to simulations as usual for numerical computation of the partial derivatives of the binding function. This is required indeed only for a grid of values of α and β since the effect of the location–scale parameters µ and σ inside the binding function is known in closed form. Last, it is worth insisting on the constraint on Φ1 . Although immaterial asymptotically when the unknown true value of α lies in the open interval ]0, 2[, this constraint may play a role in finite ˆ 1,n is stuck on the value 2.439, the sample is charsample. When Φ ˆ 2,n , Φ ˆ 3 ,n , Φ ˆ 4,n ), acterized by a three-dimensional parameter set (Φ which does not allow to identify the four unknown structural parameters. A sensible solution for this problem is to use the CII theory of CFS. The idea is to replace the lacking fourth auxiliary parameter by the value of the Kuhn–Tucker multiplier associated with the constraint. For the sake of efficiency, we rather choose in Section 3 below to develop the CII approach with instrumental parameters provided by a well suited pseudo-likelihood function rather than by arbitrary quantiles. 3. Constrained indirect inference estimation 3.1. Indirect estimation based on a skewed-t score generator Indirect inference involves the use of an auxiliary model through its pseudo-likelihood function. For indirect estimation almost as efficient as maximum likelihood, it matters to resort to a pseudo-likelihood whose parameters match well all the information content of the structural parameters of interest. This remark motivates our choice of a pseudo-likelihood associated to the skewed-t distribution as introduced by Fernández and Steel (1998). This pseudo-likelihood function entails four auxiliary parameters denoted by Ψ = (ν, γ , λ, ω) ∈ Ξ = R∗+ ×R∗+ ×R∗+ ×R and its log-likelihood for n observations Yi , i = 1, . . . , n, is given by
2h(ν) Ln = n log √ ·
1
λ γ + γ1 2 n ν+1 − 1 Yi − ω log 1 + − gω (Yi , γ ) , 2 ν λ i=1 πν
(6)
where h(ν) =
Γ
ν+1
Γ
ν2 2
1 and gω (y, γ ) = γ 2 2 γ
if y ≥ ω if y < ω.
The degree-of-freedom parameter ν (possibly non-integer) of a Student-t distribution captures the thickness of the tails as α does for stable distributions, while the scale parameter λ and the location parameter ω can easily be introduced to match the two parameters σ and µ. Finally, the skewed-t allows one to accommodate skewness through an additional parameter γ that should
328
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
the knowledge of λ(α, 0, 1, 0) and ν(α, 0, 1, 0) that we denote respectively λ(α) and ν(α). We can then prove:
Table 1 Relation structural and auxiliary parameters. Characteristic
Structural
Auxiliary
Tail thickness Skewness Scale Location
α β σ µ
ν γ λ ω
be informative about β . More precisely, when observations Yi , i = 1, . . . , n, are identically distributed as Y following the stable distribution S (θ ), the pseudo-log-likelihood function is integrable (i.e. log(1 + x2 ) < |x| for large x) and from its expectation L(Ψ |θ) = E (L1 (Ψ )|θ ) we can define a pseudo-true value Ψ (θ ) by
Ψ (θ) = arg max L(Ψ |θ ). Ψ
Note that Ψ (θ ) = (ν(θ ), γ (θ ), λ(θ ), ω(θ)) defines the pseudotrue value of the skewed-t parameters (ν, γ , λ, ω) when the true marginal distribution is S (θ ), irrespective of the possible dynamic structure of the data generating process. We expect the auxiliary parameters (ν(θ ), γ (θ ), λ(θ ), ω(θ)) to be very informative about θ . To see this, note that the binding function θ → Ψ (θ ) is not only one-to-one but each of its coefficients is well focused on the relevant corresponding structural parameter according to the natural association described in Table 1. The matching displayed in this table is justified by the following result. Proposition 3.1. For all (α, β, σ , µ) ∈ Θ
ω(α, β, σ , µ) = µ + ω(α, β, σ , 0) λ(α, β, σ , µ) = σ λ(α, β, 1, µ/σ ) γ (α, β, σ , µ) = γ (α, β, 1, 0) =
1
γ (α, −β, 1, 0) ν(α, β, σ , µ) = ν(α, β, 1, 0) = ν(α, −β, 1, 0). Such a nice correspondence between auxiliary and structural parameters leads us to hope that an indirect inference estimator θˆn of structural parameters, obtained as the only value of θ for which Ψ (θ ) equals the pseudo-maximum likelihood estimator Ψˆ n , should be very accurate and almost as efficient as maximum likelihood. To a large extent, this will be confirmed by our Monte Carlo experiments with an exception however concerning the estimation of the tail parameter α when its true unknown value is close to 2. The reason for that is the following: even though α and ν are both the index of regular variation characteristic of the tail of the stable and the skewed-t distribution respectively, the binding function ν(α, β, σ , µ) = ν(α, β, 1, 0) not only does not coincide with α but also becomes less and less informative about α when its true value gets closer and closer to 2. In fact, ν(α, β, 1, 0) is an even function of β that blows up to infinity when α becomes arbitrarily close to 2. For this reason, the inverse problem to recover the structural parameter α from the auxiliary one ν becomes asymptotically ill-posed when α goes to 2. To see this in a simple case, let us make more explicit the symmetric (β = 0) case. The relationship between structural and auxiliary parameters is actually even more transparent in the symmetric case. First, we can prove: Proposition 3.2. For all (α, 0, σ , µ) ∈ Θ
ω(α, 0, σ , µ) = µ γ (α, 0, σ , µ) = 1. In other words, in the symmetric case, the structural parameters are exactly equal to their pseudo-true value analogs for both the location and the skewness parameters. As far as the scale and the tail parameter are concerned, we know by Proposition 3.1 that in the symmetric case, they are fully characterized from
Proposition 3.3. For all α ∈]1, 2[, λ(α) and ν(α) are determined as solution of the two equations: 1
=E
Z (α)
ν(α) + 1 1 + Z (α) 2h′ (ν(α)) = E (log(1 + Z (α))), h(ν(α)) where Z (α) = Y 2 [ν(α)λ2 (α)]−1 , Y follows S (θ ) with θ = (α, 0, 1, 0) and h(ν) = Γ [(ν + 1)/2]/Γ [ν/2]. The function g (ν) = E (log(1 + (Y 2 /νλ2 ))) shows why the problem becomes almost ill-posed when α (and the corresponding ν ) becomes large. The absolutely value of the derivative g ′ (ν) decreases fast towards its limit zero when ν increases towards infinity. As a results, small increments on α result in huge increments on ν(α). According to our Monte Carlo work, an increase of α in the interval [1.4, 1.7] results in an increase of ν from 2 to 10. And the variations are even steeper when α becomes close to 2. Even though sensitivity of auxiliary parameters to structural ones should be a good thing for indirect inference, the explosive behavior of ν comes with an explosive behavior of the variance of νˆ n which makes indirect inference quite imprecise about α when it is too close to two. Intuitively, when the variance of Y almost exists, the indirect inference matching becomes closer and closer to behave as if the stable distribution were the normal one (corresponding to α = 2) and pushes accordingly ν to infinity, with huge variance of νˆ n since normality is far from being granted. In order to address this discontinuity issue, we need to constraint both the set of structural parameters θ and the set of auxiliary parameters Ψ as well. Like DuMouchel (1973) for asymptotic normality of MLE, we maintain for asymptotic theory of indirect inference that the true unknown value θ 0 = (α 0 , β 0 , σ 0 , µ0 ) is such that 1 < α 0 < 2. Moreover, we have to constrain the auxiliary parameter Ψ1 , first component of Ψ , by imposing on it an upper bound, exactly as McCulloch (1986) did for his auxiliary parameter that provided information about α . We choose to impose ν ≤ 2, that is to redefine Ψ1 as Ψ1 = min(ν, 2). It is worth noting that, while α < 2 by definition, the upper bound on ν does not have to be 2. Intuitively, choosing for ν an upper bound larger than 2 is detrimental for the estimation of α and beneficial for the estimation of β . This intuition, confirmed by our Monte Carlo experiments, can be explained as follows. First, as explained above through the explosive behavior of ν , if we do not constrain ν to be smaller than 2, we loose the tight connection between ν and α as both measure the index of asymptotic regular variation of the tails. Second, in case of an asymmetric stable distribution, we know that the tail behavior has some informative content about β (see Samorodnitsky and Taqqu (1994), property 1.2.15). Even though the binding function ν(α, β, σ , µ) is an even function of β , it is consequently able to convey some information about |β| that we basically waste by replacing the auxiliary parameter ν by min(ν, 2), which is the uninformative constant 2 when the true value of α is large, larger than 1.4 approximately. Of course, as already mentioned for the quantile-based example, such a constraint on the auxiliary parameter may destroy identification when it remains at its limit value 2. CII provides the right methodology to deal with this problem. 3.2. Constrained indirect inference CII is based on the Lagrangian objective function Qn (Υ ) =
1 n
Ln (Ψ ) + ρ(2 − ν),
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
where Υ = (Ψ ′ , ρ)′ . The parameter ρ ≥ 0 is the Kuhn–Tucker multiplier associated with the constraint ν ≤ 2. The estimator Υˆ n is then defined by the first order conditions jointly with the complementary slackness restriction ρˆ n (2 − νˆ n ) = 0 and the inequality restriction. Note that Ln (Ψ ) is not differentiable with respect to ω at values that coincide with one of the n observations Yi , i = 1, . . . , n. These circumstances can be excluded almost surely. Likewise, let us denote Yih (θ ), h = 1, . . . , H, the components of H simulated paths of an α -stable process for a given value θ . The simulated path (Yih (θ ))1≤i≤n defines a simulated criterion function: Qnh (Υ |θ ) =
1 n
Lhn (Ψ |θ ) + ρ(2 − ν),
and the corresponding simulated estimators Υˆ nh (θ ) are defined by the first order conditions jointly with the slackness condition ρˆ nh (θ)·(2−νˆ nh (θ )) = 0 and the inequality restriction (with the same remark as above regarding the innocuous non-differentiability issue). Let us then consider the average estimator over the H ∑H ˆh simulated paths Υˆ n,H (θ ) = H1 h=1 Υn (θ ). The main idea of CII is to choose the estimator θˆnr as a value of θ
that matches Υˆ n,H (θ ) against Υˆ n . The superscript r reminds that the estimator is constrained, or restricted, by the Kuhn–Tucker multiplier. Unfortunately, we cannot rely directly on the results in GMR because the constrained estimator Υˆ n may not be asymptotically normal in large samples in the presence of inequality constraints on its components ν and ρ . This is the reason why we closely follow below the relevant asymptotic theory of CII, as developed by CFS, while slightly extending it to a case of finite H. Moreover, we can simplify the exposition since we only focus on a just-identified case (dim Ψ = dim θ ). We first maintain Assumption 1 of CFS. It states almost sure uniform (in (θ , Ψ )) convergence of 1n Ln (Ψ ) to the expected loglikelihood L(Ψ |θ ) = E (L1 (Ψ )|θ ), which is differentiable with respect to its both arguments.9 Note that the differentiability of L(Ψ |θ ) is granted with standard Lebesgue dominated convergence arguments, even with respect to the variable ω, since the nonsmoothness of L1 (Ψ ), event of probability zero, has no impact on the expectation. For each value of θ we can define the binding function for the constrained auxiliary parameters Υ = (Ψ ′ , ρ ′ )′ as Υ (θ ) = (Ψ r (θ )′ , ρ(θ )′ )′ from the value Ψ r (θ ) = (ν r (θ ), γ r (θ ), λr (θ ), ωr (θ))′ , which fulfills the first order conditions:
∂Q (Υ |θ )|Υ =Υ (θ) = 0, ∂θ with Q (Υ |θ ) = L(Ψ |θ ) + ρ(2 − ν), and the slackness condition ρ(θ)(2 − ν r (θ )) = 0. In addition, we assume that Υ r (θ ) is unique, in the sense that L(Ψ r (θ )|θ ) > L(Ψ |θ ) for any Ψ = (ν, γ , λ, ω)′ in a neighborhood of Ψ r (θ ) and fulfilling ν ≤ 2. As a consequence, Assumption 1 ensures the strong consistency when n → ∞ of Υˆ n (resp. Υˆ n,H (θ )) for Υ (θ 0 ) (resp Υ (θ )). To ensure local identification ∂ Υ ′ (θ ) of θ 0 , we maintain Assumption 2 in CFS, i.e. the rank of ∂θ is 4 for any θ in a neighborhood of θ 0 . To interpret Assumption 2 in our setting, it is first worth noting that trivially ωr (θ ) = ω(θ ) while hopefully γ r (θ ) is little different from γ (θ ). A constraint on the tail parameter has no impact on the location parameter and little impact on the skewness parameter, for instance always equal to one in the symmetric case. This remark
9 Even though CFS do not make it explicit, standard asymptotic arguments rest upon uniform convergence when θ and Ψ evolve in their allowed domain. While θ is only constrained between 1 < α < 2, the set of allowed values for Ψ is the range of the binding function, that is a strict subset of R∗+ × R∗+ × R∗+ × R.
329
is actually confirmed by our simulations. Therefore, the fact that ∂ Υ ′ (θ )
∂ Ψ r ′ (θ )
∂ρ
the four rows of the matrix ∂θ = [ ∂θ , ∂θ ] are linearly independent is expected as a likely consequence of the following arguments: First, by the natural association between structural and auxiliary parameters put forward in the former section and in ∂ Ψ r ′ (θ )
Table 1, we expect the matrix ∂θ to be almost diagonal and at least, to have its four rows linearly independent. Therefore, the four ∂ Υ ′ (θ )
rows of the matrix ∂θ must be linearly independent, at least when the constraint is not binding. Second, when the constraint is binding, we still have a natural correspondence between the two sets of parameters (β, σ , µ) and (γ , λ, ω) while the structural tail parameter α will not be captured by ν anymore (ν r (θ ) = 2) but by the Kuhn–Tucker multiplier ρ(θ ). In other words, we now ∂(γ r ,λr ,ωr ,ρ) expect an almost diagonal matrix when considering . ∂θ ′ For standard (unconstrained) indirect inference, we are here in a just-identified setting, so that θˆnu is simply defined as the solution
of the four-equation system Ψˆ nu = Ψˆ Hu (θˆnu ). The superscripts, u for unconstrained, are just a reminder that the corresponding estimators have been computed by choosing a zero Kuhn–Tucker multiplier: ρˆ n and ρˆ nh (θ ) are fixed to zero, for h = 1, . . . , H. Note that, from Gouriéroux et al. (1993), we know that in this justidentified setting, the indirect inference estimator θˆnu numerically coincides with the score-matching estimator as put forward by Gallant and Tauchen (1996). By contrast, in order to perform constrained indirect inference, we are faced with a seemingly overidentified problem since both Υˆ n,H (θ ) and Υˆ n entail five free parameters while the unknown θ is of dimension four. However, we know that this overidentification feature is just a finite sample problem since (see CSF, Proposition 1) the asymptotic distributions of Υˆ n and Υˆ n,H (θ ) are singular. Therefore, the overidentified finite sample-matching problem can be solved by minimizing an arbitrary distance:
θˆnr,H = arg min(Υˆ n,H (θ ) − Υˆ n )′ Wn (Υˆ n,H (θ ) − Υˆ n ). θ
In terms of asymptotic probability distribution of θˆnr , the choice of the positive definite weighting matrix Wn is immaterial. In fact, Proposition 6 of CFS shows that if dim(Ψ ) = dim(θ ), indirect methods provide estimators that are independent of the choice of Wn for large enough n. Note however that when νˆ n is stuck at its limit value, the information content of Υˆ n about the structural parameters θ will go through the Kuhn–Tucker multiplier ρˆ n = 1 ∂ Ln ˆ r (Ψn ). Therefore, constrained indirect inference will not suffer n ∂ν from the weak identification problem about β that is currently encountered with competing estimation methods when the true unknown value of α is close to 2. The needed regularity condition for asymptotic theory of constrained constraint indirect inference is maintained as in CFS. More precisely, we maintain Assumption 3 of CFS that 1 ∂ 2 Ln n ∂Ψ ∂Ψ ′
(Ψn ) to a nonstochastic matrix J0r , √ and convergence in distribution of n ∂∂QΨn (Υ (θ 0 )) to a Gaussian states convergence of
distribution with zero mean and variance I0r . Note that in order to get asymptotically normal indirect estimators, the √ key assumption is asymptotic normality of the pseudo-score n ∂∂QΨn computed at the pseudo-true value. It is worth stressing that while observations have infinite variance, the pseudo-score is expected to be well behaved in the same way we know from DuMouchel (1973) that the true score is well behaved. Then, standard theory of indirect inference (see GMR) can be applied insofar as the information content in Ψ is sufficient for local identification of θ (Assumption 2) and as the pseudo-score is root-n asymptotically normal (Assumption 3). Our Monte Carlo evidence (see also Garcia et al. (2006) for extended evidence) shows that we do get probability distributions close to normal for both
330
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
scores and parameters for large n.10 This makes a key difference between our approach and a conventional method of moments where asymptotic normality could not be warranted due to the infinite variance of observations.11 In our setting, the asymptotic variance formula for θˆnu follows from GMR (p S93) where the cross-derivatives of Ln with respect to Υ and θ are computed by simulations. As a robustness check, which we follow in the application, we also compute the variance–covariance matrix of θˆnu by bootstrap methods. As far as constrained estimation is concerned, CFS show that a similar result of asymptotic normality still holds for the parameters of interest in spite of the constraint parameters. While they derive the asymptotic distribution of the constrained indirect inference estimator for an infinite number H of simulated paths, we do extend it here with finite H. As a general principle for simulated methods of moments, we know that the only asymptotic consequence of this is to multiply the asymptotic variance matrix of θˆnr by a factor (1 + H1 ). Under the maintained assumption of a well-specified structural model, this extension is straightforward since, as a general principle for simulated method of moments, it only amounts to multiply the asymptotic variance matrix of θˆnr by a factor (1 + 1 ). However, as stressed in the introduction, we never maintain H a serial independence assumption but only assume that the sequence of observations is stationary and ergodic. Yet, the source of paths for indirect inference are obtained as independent draws in the stationary distribution, supposed to be well specified by a stable model with parameter θ . Following Dridi et al. (2007), the sandwich formula for the asymptotic variance of QMLE of auxiliary parameters Ψ must then be computed differently depending whether this estimator is obtained from data or from simulated paths. For an infinite number of simulations, the latter is unmaintained and then CFS Proposition 4 (p. 951) allows us to state that, under Assumptions 1, 2 and 3:
misspecification has no impact on the limit J0r of the Hessian since the stationary distribution is always assumed to be well specified. By contrast, with independent draws, the long-term covariance matrix I0r must be replaced by
˜I0r = S0 (θ 0 , Υ (θ 0 )). Then, the correct asymptotic distribution for constrained indirect inference is now:
√
n(θˆnr,H − θ 0 ) →d N (0, (C0r,H )−1 ),
with C0r,H =
∂Υ ′ 0 (θ , Υ (θ 0 ))J0r ∂θ (˜I0r )−1 r ∂ Υ 0 r −1 × ( I0 ) + J0 ′ (θ , Υ (θ 0 )). H ∂θ
Note that both ˜I0r and J0r can be consistently estimated by their sample counterparts, computed either in actual data or in data simulated with a value of θ produced by a first step consistent estimator. By contrast, a consistent estimation of the matrix I0r takes a HAC (heteroskedasticity and autocorrelation consistent) estimator computed from actual data. In the application, we use Newey–West estimators (asymptotic variances in Table 7) and also check their validity by a bootstrap procedure. 4. A Monte Carlo study
√
In Section 4.1, we compare our constrained indirect estimator to the CGMM and empirical quantile estimation methods for various sets of values for θ , the vector of parameters of the stable distribution, assuming that the observations are identically and independently distributed. In Section 4.2 we show by simulation that some GARCH processes may be observationally equivalent to a stable process from the viewpoint of the unconditional distribution.
where
4.1. Independent processes
n(θˆnr,∞ − θ 0 ) →d N (0, (C0r )−1 ),
∂Υ ′ 0 ∂Υ C0r = (θ , Υ (θ 0 ))J0r (I0r )−1 J0r ′ (θ 0 , Υ (θ 0 )), ∂θ ∂θ 1 ∂ 2 Ln r 0 J0 = plimn→∞ (Ψ ), n ∂Ψ ∂Ψ ′ +∞ − I0r = Sτ (θ 0 , Υ (θ 0 )), τ =−∞
Sτ (θ, Υ ) = E (mi (Υ )mi−τ (Υ )|θ ) ∂ L1 ∂ν mi ( Υ ) = (Yi ) − ρ .
∂Ψ
and
∂Ψ
When the actual number of simulated paths is no longer infinite, the sandwich formula J0r (I0r )−1 J0r must be corrected to take into account that the simulator may be misspecified. This
10 For more primitive underpinnings of high-level Assumption 3, asymptotic normality of the unconstrained QML estimators is discussed in the Appendix. The key argument is that QML is based on self-weighted sample averages with finite variance by contrast with naive sample means. This argument is germane to the asymptotic normality of MLE for GARCH processes even when squared returns have infinite variance (see Francq and Zakoian, 2004). 11 Note that fat tails may also invalidate the efficiency argument of Efficient Method of Moments (Gallant and Tauchen, 1996) since there is no more reason to hope that a semi-nonparametric (SNP) score generator based on Hermite expansions will be able to span the true score function. The class of densities to fit with SNP considered by Coppejans and Gallant (2002) are indeed weighted with the exponential function exp (−x2 /2) which ensures finite moments at any order.
We carry out a Monte Carlo experiment to determine if the good asymptotic properties of the indirect inference estimators with a skewed-t auxiliary model are maintained in a finite sample context. As we have seen in the previous section, the asymptotic distribution of θˆn is determined by the asymptotic distribution of Ψˆ n . Therefore, it is worthwhile to examine the sample distribution of the parameter estimates for the auxiliary model in an experimental setting where we simulate data from a α -stable distribution with different values of the parameters. We generate 500 samples of 1000 observations for 2 different values of α , namely 1.5 and 1.9.12 We keep the other parameters µ, σ and β fixed and set them equal to 0, 0.5 and 0 respectively.13 The simulation experiment is divided in two parts. First we estimate the unconstrained skewed-t distribution. Since the true values of Ψ are unknown, we can only check by Monte Carlo the behavior of the first four moments of the estimators. The results are reported in columns 3–6 of Table 2. Concerning ω, λ and γ , skewness and kurtosis of the estimators are fairly close to zero and three respectively. In contrast, νˆ n is ill behaved in finite sample, especially when α = 1.9. It appears that νˆ n has a large upward bias. It is as if, in finite sample, the estimate wants to capture the spurious appearance of normality corresponding to the limit case
12 We simulate from the α -stable distribution following Chambers et al. (1976). 13 A more extensive Monte Carlo study is available in Garcia et al. (2006).
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
0.58 0.02 0.15 2.93 0.64 0.02 −0.11 3.01
−0.01 0.04
−0.01 2.89
−0.01 0.06
−0.02 2.87
βˆ n
1.47 −0.11 0.04 0.09 0.03 0.28 2.94 3.14 1.82 0.16 0.79 0.21 −3.12 0.14 22.9 2.66
σˆ n
µ ˆn
0.50 0.01 0.13 3.45 0.50 0.01 −0.13 3.48
−0.04 0.04 0.12 3.29 0.11 0.09 1.32 7.39
First four moments of the estimated parameters of the skewed-t (columns 3–6) and the α -stable distribution (columns 7–10) using unconstrained indirect inference and when the data generating process is S (α, 0, 0.5, 0) for values of α given in the first column (500 samples of 1000 observations). Sd, Skw and Kur stand for standard deviation, skewness and kurtosis.
N
νˆ n ρˆ n
Sd Skw Kur Sd Skw Kur
500 38.82 7.18 56.53 0.00 −0.29 2.96
1000 15.46 3.08 16.53 0.01 −0.14 2.86
0.24 1.8
2.2
2.6
3.0
3.4
0
4
8
12
16
20
24
28
32
36
Density estimator for RHO
0
Table 3 Sensitivity of νˆ and ρˆ to the sample size when α = 1.9.
1.4
0.16
αˆ n
0.08
ωˆ n
0.00
2.18 1.00 0.20 0.04 0.58 −0.08 3.81 2.79 7.55 1.00 3.32 0.05 3.42 0.18 22.0 2.87
λˆ n
140
1.9
Mean Sd Skw Kur Mean Sd Skw Kur
γˆn
Density estimator for NU
20 40 60 80 100
1.5
νˆ n
0.0 0.4 0.8 1.2 1.6 2.0 2.4
Density estimator for NU
Table 2 Simulation results for the unconstrained model.
α
331
0.034 0.038 0.042 0.046 0.050 0.054 0.058
5000
10,000
0.83 0.92 4.47 0.01 −0.23 3.36
0.58 0.78 4.15 0.01 −0.17 3.03
Standard deviations, skewness and kurtosis (denoted by Sd, Skw and Kur respectively) of the estimated and using unconstrained and constrained indirect inference respectively. The date generating process is S (1.9, 0, 0.5, 0). We generate 500 samples of N = 500, 1000, 5000, 10,000 observations.
α = 2 and ν = +∞. the top plots in Fig. 1 represent the kernel density of the Monte Carlo distribution of νˆ n . When α increases the estimator νˆ n exhibits serious departures from normality. While these kernel densities are for a sample size of 1000, the role of sample size in the dependence of νˆ n in α is investigated in the top part of Table 3. For α = 1.9, no less than 10,000 observations are needed for fairly approaching normality. The implications of these shortcomings of the auxiliary model on the estimation of θ appear in columns 7–10 of Table 2. For α = 1.9, skewness and kurtosis depart substantially from the respective values of 0 and 3 for a normal distribution. For α = 1.5, the distribution of the estimates is closer to a normal. We conclude from this simulation exercise that the auxiliary model works for most cases but fails when α is close to 2. Since the estimate νˆ n is ill behaved in finite sample when α → 2, we impose an upper bound on ν , as mentioned in the previous section. To check that the estimated multiplier is well behaved in finite sample, the bottom plot of Fig. 1 shows the density of ρˆ n when α = 1.9. It can be seen that, contrary to νˆ n , the distribution is much closer to a normal. This confirms why ρˆ n is a more relevant auxiliary parameter than νˆ n in such a case. The bottom part of Table 3 shows that in fact the finite sample behavior of the multiplier is very good for any sample size. The Monte Carlo study is conducted for α = 1.5 and 1.9 and β = 0 and 0.75.14 We compare the CII method to CGMM method of Carrasco and Florens (2002) based on the characteristic function with a regularization parameter δn equal to 10−6 .15 The second one is the empirical quantile method of McCulloch (1986).16 The results are reported in Table 4.
14 The number of draws H in set to five. Results do not change qualitatively for H = 1 and 2. We choose the identity matrix for the weighting matrix Wn as explained in previous section. 15 10−6 is the same value that Carrasco and Florens (2002) chose for their Monte Carlo study with the α -stable distribution. 16 We use a GAUSS procedure written by Huston McCulloch and available in his web page http://www.econ.ohio-state.edu/jhm/jhm.html.
Fig. 1. Kernel densities for νˆ n and ρˆ n . The top plots show kernel densities of νˆ n when the true α are 1.5 and 1.9. The bottom plot shows the kernel density of ρˆ n when the true α is 1.9.
As a general assessment, one can say that the constrained indirect inference method delivers consistent estimators which are close to being normally distributed. The skewness for all parameters is close to 0 and the kurtosis close to 3. Thanks to the constraint imposed on ν in the auxiliary model, the estimator behaves well even when α approaches 2. CII compares well with the two other methods. First, with respect to CGMM, it appears that it estimates much better the parameter σ , as CGMM underestimates it. The bias for α = 1.9 is quite severe, since the mean of the 500 replications is 0.26 for a true value of 0.5. Estimates for other parameters, for example β , suffer when α gets close to 2, which is never the case for indirect inference. The CII method is also more efficient than CGMM.17 The empirical quantile method does not seem to suffer from any systematic bias except for β at α = 1.9. Its main weakness appears to be its lack of efficiency. Standard deviations are larger than in the case of CII and CGMM. To judge the efficiency of the indirect inference procedure, we also compare in Table 5 the empirical standard deviations of α and β to the asymptotic Cramer–Rao bounds reported in DuMouchel (1975) for a set of parameter values. It can be seen that CII produces standard deviations that are close to the asymptotic lower bounds. Interestingly, when α is getting closer to 2, the empirical standard deviations produced by the indirect inference procedure are smaller than the asymptotic bounds. This is a finite sample phenomenon. As we increase the number of observations from 1000 to 5000 the indirect inference standard deviation becomes higher than the lower bound while remaining close to it. 4.2. Dependent processes Several papers have investigated the relationship between stable processes and processes with conditional heteroskedasticity such as GARCH and IGARCH. de Vries (1991) has shown that under certain conditions on the parameters of a GARCH-like process, the stable and GARCH processes are observationally equivalent from the viewpoint of the unconditional distribution. Ghose and Kroner (1995) establish that many of the properties of stable models are shared by GARCH models. However, they identify distinctive properties, namely the clustering in volatility and the distributions
17 As noted above, CGMM has been performed with a fixed ad hoc regularization coefficient. An endogenous choice of δn could improve results, namely suppress the bias for σ . However, in the Monte Carlo study performed by Carrasco and Florens (2002) for the α -stable distribution, the estimated σ does not change significantly when δn is selected in an ad hoc way or endogenously.
332
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
Table 4 Simulation results for the constrained model. CII
α
β
1.5
0
0.75
1.9
0
0.75
Mean Sd Skw Kur Mean Sd Skw Kur Mean Sd Skw Kur Mean Sd Skw Kur
CGMM
Empirical quantiles
αˆ n
βˆ n
σˆ n
µ ˆn
αˆ n
βˆ n
σˆ n
µ ˆn
αˆ n
βˆ n
σˆ n
µ ˆn
1.61 0.15 −0.78 3.55 1.53 0.04 0.23 3.03 1.91 0.02 −0.03 2.73 1.90 0.04 0.20 3.04
−0.00
0.49 0.02 −0.07 3.19 0.50 0.01 0.04 2.90 0.49 0.01 0.01 2.88 0.49 0.01 0.18 3.38
0.08 0.03 0.02 3.06 0.01 0.04 0.11 4.01 0.01 0.03 0.08 3.31 0.01 0.03 0.30 3.02
1.50 0.07 0.08 2.90 1.50 0.07 0.02 2.71 1.89 0.05 −0.46 3.17 1.88 0.05 −0.34 2.89
0.03 0.15 −0.10 2.94 0.75 0.12 −0.19 2.83 0.00 0.61 0.01 2.08 0.63 0.44 −1.52 5.37
0.35 0.02 0.14 2.95 0.35 0.02 0.39 3.55 0.26 0.02 −0.01 2.87 0.26 0.01 0.28 3.08
0.00 0.06 −0.01 3.18 0.01 0.08 0.65 3.84 −0.00 0.03 0.05 3.04 −0.00 0.03 0.02 2.99
1.49 0.08 0.26 2.95 1.51 0.11 0.38 3.01 1.88 0.10 −0.58 2.32 1.88 0.11 −0.59 2.30
0.00 0.15 0.03 3.47 0.78 0.16 −0.23 3.98 0.01 0.39 0.03 4.15 0.35 0.40 0.38 2.01
0.49 0.02 0.09 2.91 0.49 0.02 0.10 3.01 0.49 0.03 0.01 2.92 0.49 0.03 0.16 3.09
0.00 0.07 −0.29 3.59 −0.00 0.10 0.27 3.98 0.00 0.04 −0.02 2.99 −0.01 0.04 −0.09 3.04
0.09 0.05 2.92 0.75 0.07 −0.13 2.94 −0.04 0.38 −0.03 3.14 0.75 0.12 −0.54 3.17
First four moments of the estimated parameters of the α -stable distribution when the data generating process is S (α, β, 0.5, 0) for values of α and β given in the first and second columns respectively (500 samples of 1000 observations). Columns 4–7 are estimates using constrained indirect inference. Columns 8–11 are estimates using CGMM and columns 12–15 show the estimates using empirical quantiles. Sd, Skw and Kur stand for standard deviation, skewness and kurtosis. Table 5 Finite sample vs asymptotic standard deviations.
α 1.5 1.9
β 0 0.5 0 0.5
Sd. ind. inf.
Sd. asympt.
αˆ n
βˆ n
αˆ n
βˆ n
0.048 0.044 0.026 0.015
0.096 0.097 0.369 0.364
0.049 0.047 0.036 0.035
0.096 0.085 2.287 0.268
Comparison of finite sample (500 samples of 1000 observations) indirect inference standard deviations (Sd. ind. inf.) with asymptotic deviations corresponding to the Cramer–Rao bounds (Sd. asympt.) for α and β .
of extreme values, captured by their tail indices. Groenendijk et al. (1995) show that it is not always the case that the tail shapes can be used to discriminate between the competing models. More recently, Deo (2000, 2002) has devised an estimation procedure for the tail index α as well as a goodness-of-fit test that is valid in the presence of m-dependence in the series. To illustrate the results put forward in this literature, we generate GARCH(1, 1) series for a subset of parameter values chosen by Ghose and Kroner (1995). As before, we simulate 500 samples of 1000 observations from a GARCH(1, 1) model with either Gaussian or Student-t 5 that we aggregate every 5 and 20 periods. All processes exhibit empirically excess kurtosis, which increases with the memory of the model and does not vanish under aggregation. Table 6 shows the means and standard deviations for αˆ n and σˆ n . The stable density captures very well the increases in variance and
kurtosis through aggregation and memory. The tail index remains relatively constant under aggregation while the estimated dispersion increases. In all cases standard errors increase with aggregation as the sample size decreases. As expected, the tail index and the dispersion are higher when the process is generated from a Student-t density that when it comes from a Gaussian probability distribution. Last, the higher the memory of the model, in particular for the last three cases, the lower the tail index and the higher the dispersion. We have illustrated that stable distributions could serve as a good statistical tool to capture the unconditional distribution of asset returns at low frequency, even if the true DGP was a conditionally heteroscedastic process like GARCH. Of course, one might argue that what matters for financial applications is the conditional distribution and that the estimation method developed here will not be useful to capture the conditional dependence in mean and variance. Although we do not intend to show in detail how our method can be extended to characterize conditional distributions, we argue that it should be possible to introduce dependence in our indirect inference procedure: First, it should be stressed that GARCH-like processes, that is processes exhibiting conditional heteroskedasticity, can be built from stable random variables as shown by de Vries (1991). Second, Deo (2002) has used conditionally heteroscedastic processes with marginal stable distribution, which has infinite dependence, to estimate tail indices with a procedure that accommodates dependence of order m. He concludes that his procedure provides quite good estimates of the tail index.
Table 6 Simulation results when the DGP is a GARCH model. GARCH model
δ1 = 0.1, δ2 = 0.8
δ1 = 0.05, δ2 = 0.9
δ1 = 0.05, δ2 = 0.95
Distribution
αˆ n αˆ n σˆ n σˆ n αˆ n αˆ n σˆ n σˆ n αˆ n αˆ n σˆ n σˆ n
Normal t5 Normal t5 Normal t5 Normal t5 Normal t5 Normal t5
1.9132 [0.0348] 1.7277 [0.0418] 0.6800 [0.0311] 0.5618 [0.0269] 1.9361 [0.0248] 1.7459 [0.0382] 0.9621 [0.0427] 0.8079 [0.0381] 1.7526 [0.0992] 1.5882 [0.1260] 3.5478 [1.3464] 2.6441 [0.6771]
5
20
1.9256[0.0424] 1.8423 [0.0718] 1.4731 [0.1003] 1.3437 [0.1136] 1.9136 [0.0538] 1.7785 [0.0652] 2.1098 [0.1439] 1.9341 [0.1744] 1.6941 [0.1102] 1.6085 [0.1721] 7.7671 [2.9350] 6.0090 [1.7830]
1.9072[0.1095] 1.7929 [0.0807] 2.8884 [0.4403] 2.9488 [0.4071] 1.7992 [0.1112] 1.8071 [0.1333] 4.0430 [0.5512] 3.9414 [0.6127] 1.5987 [0.1983] 1.5942 [0.1630] 14.790 [3.9942] 13.028 [3.6017]
Mean and standard deviation (in square brackets) of 500 estimated αˆ n ’s and σˆ n ’s (each draw of 1000 observations) when the DGP is a GARCH(1, 1): yt = εt , εt ∼ D(0, ht ) ht = δ0 + δ1 ht −1 + δ2 εt2−1 , and where D is either Gaussian or Student-t5 . Last two columns show the estimated parameters when the DGP is aggregated over 5 and 20 periods.
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
333
Table 7 Estimation results. Confidence level Days Day with jumps Jump contribution Jump mean Jump standard deviation Equal variance test Equal mean test
0.999 2686 538 9.44% 0.0040 0.0036 1.12 (F538,828 ≃ 1) 0.599 (t1364 = 1.96)
0.99 2686 828 12% 0.0038 0.0034
CII
αˆ n βˆ n σˆ n µ ˆn
1.6841 [0.0490, 0.0516] 0.0298 [0.1822, 0.2772] 0.2992 [0.0176, 0.0228] 0.1067 [0.0355, 0.0319]
1.7224 [0.0407, 0.0452]
−0.170 [0.1681, 0.2164] 0.2966 [0.0137, 0.0160] 0.0752 [0.0314, 0.0251]
Empirical quantiles
αˆ n βˆ n σˆ n µ ˆn
1.5631 [0.0885] 0.0064 [0.1600] 0.2909 [0.0202] 0.0853 [0.0300]
1.6188 [0.0796] 0.0029 [0.1307] 0.2906 [0.0129] 0.0829 [0.0232]
4.5414 [0.8541] 1.0354 [0.0569] 0.4075 [0.0277] 0.0625 [0.0366]
5.1589 [0.8677] 0.9910 [0.0455] 0.4032 [0.0200] 0.0705 [0.0319]
Skewed-t
νˆ n γˆn λˆ n ωˆ n
Descriptive statistics (top panel), estimation results of the α -stable distribution with CII and empirical quantiles (middle panels), and estimation results of the skewedt distributions (bottom panel). Estimation is based on days when a jump has been detected. To detect them a test has been used with two confidence levels 0.999 and 0.99. Standard deviations for CII have been computed with nonparametric bootstrap (left numbers within the squared brackets) and the asymptotically (right numbers). Standard deviations for empirical quantiles have been computed with nonparametric bootstrap.
5. An empirical illustration We propose to apply our estimation methodology for the stable distribution to the characterization of the jump distribution in the S&P 500 index returns time series. A series of papers by BarndorffNielsen and Shephard (2004a,b, 2006) and Andersen et al. (2003, 2007, 2009), as well as others, have shown that financial time series are best characterized by jump–diffusion processes. Of direct interest for our purpose is the fact that the jumps are nearly i.i.d. Jumps display less persistence than volatility and appear to occur at unpredictable times. Moreover, the α -stable density appears as a natural candidate for the distribution of a jump conditional on a jump occurring since Levy processes provide the theoretical foundations to jump–diffusion models. The literature on realized variance and bi-power variation has provided financial econometricians with measures of both the diffusion and the jump parts in high frequency financial time series. Barndorff-Nielsen and Shephard (2004b) develop a method for detecting the presence of jumps. The basic idea is to compare two measures of variance, realized variance, that includes the contribution of jumps to the total variance, and bi-power variation, that is robust to the contribution of jumps. Realized variance and bi-power variation are defined as follows: RVi ≡
m −
ri2,j
j =1
BVi ≡
π
m
m −
2 m − 1 j =2
|ri,j ||ri,j−1 |,
where ri,j refers to the j = 1, . . . , m within-day return on day i = 1, . . . , n. Asymptotically, the difference between the realized variance and bi-power variation is zero when there is no jump and strictly positive when there are jumps. This property has given
rise to several jump detection techniques in Barndorff-Nielsen and Shephard (2004b), Andersen et al. (2007), and Huang and Tauchen (2005). Recently, Tauchen and Zhou (2011) extended this literature to identify large jumps in a financial time series. First, they propose RV −BV to use a ratio statistics identified in this literature RJi ≡ iRV i to i construct a test statistics to detect jumps: RJi ZJi = , [(π /2)2 + π − 5]m−1 max(1, TPi /BVi2 ) where TPi is the tri-power quarticity robust to jumps defined in Barndorff-Nielsen and Shephard (2004b). This test statistics converges in distribution to a standard normal distribution. Then, they assume that there is at most one jump per day and that the jump size dominates the return when a jump occurs. Given a confidence level ι, it is possible to filter out the daily realized jumps as
Jˆn,i = sign(ri ) (RVi − BVi )IZJ ≥Φ −1 , i
ι
where ri is the daily log return. We apply their methodology to filter the jumps out of the S&P 500 return series. We collected the 5-min returns data from 2 January 1996 to 31 August 2006 from and compute the realized variance and bi-power variation statistics leading to the test statistic separating the days in the sample between jump days and non-jump days. We then estimate a stable distribution for the Jˆn,i with CII,18 as well as with empirical quantiles for comparison purposes.19
18 Note that we can fit a stable distribution to the real variable Jˆ irrespective of n, i any assumption about the number of jumps per day. 19 We experienced some difficulties with the CGMM method using the fixed beta value reported in the Monte Carlo study. We decided not to pursue further since
it would have involved a more elaborate data-dependent method for deriving the optimal regularization parameter. 20 An alternative, to account for potential serial dependence, is to do block nonparametric bootstrap with replacement. We also computed standard errors with this method for a block of size 10, but the results were very similar and are not reported for space considerations. 21 Sensu stricto asymptotic distributional theory of realized volatility measures when the number of intraday data goes to infinity introduces a second source of uncertainty. However, both the indirect inference distributional theory and the bootstrap procedure can be interpreted for given intraday data, while asymptotics are about the number of days. 22 Their sample covers the period 1986–2005 for a total of 4752 days.
–0.0
0.4 0.8 1.2 1.6 2.0
–1.6
–0.8
–0.4
Table 7 shows the estimation results of the α -stable and the skewed-t densities on Jˆn,i . For the constrained indirect inference procedure we used H = 5 as in the reported Monte Carlo results. Standard errors are computed using both the asymptotic method described in GMR and a nonparametric bootstrap method. For the asymptotic method, we compute the cross-derivatives of the loglikelihood function with respect to the structural and auxiliary parameters by simulation. For the bootstrap, we generate a number of replications by drawing at random with replacement in the sample of data seen as the population.20 We estimate the model for each bootstrap sample and obtain a measure of the standard error in the generated distribution for each parameter. Given that for each bootstrap sample our estimation procedure involves simulations, we limited our number of replications to 100.21 To filter the jumps we used two different significance levels, 0.99 and 0.999, as in Tauchen and Zhou (2011). With the value of 0.99 we are detecting more jumps (828 out of 2686 days) but of smaller size, while for 0.999 jumps are less (538) but larger in size. The corresponding percentages of days with jumps are 30% and 20% respectively. It means that the distribution of daily returns for 0.999 has thicker tails, which is reflected in a smaller α (1.684 for 0.999 vs. 1.722 for 0.99). The jump contribution to total variance is in the order of 10%, a bit higher than in Tauchen and Zhou (2011) who have a longer sample.22 In Tauchen and Zhou (2011), it is shown that for a small jump contribution it is better to select the more generous test level of 0.99. The estimates for the parameter α are smaller with empirical quantiles, implying thicker tails. The larger standard error for the empirical quantile method confirms its relative lack of efficiency. The estimate for the symmetry parameter β is statistically indistinguishable from zero for both the CII and the empirical quantile estimators. Therefore one can conclude safely that the distribution of jumps does not show any asymmetry. Finally, the estimated mean is close to 0.1 and the standard deviation is around 0.3. This compares to empirical estimates of 0.05 and 0.53 respectively in Tauchen and Zhou (2011), who impose a simple model of Poisson-mixing-Normal jump specification. The standard errors of the estimates are roughly of the same magnitude than the ones in Tauchen and Zhou (2011). As a diagnostic check we compare by QQ plots in Fig. 2 the quantiles of the estimated stable distribution with the Gaussian quantiles for the two confidence levels of the jump detection test. The left plot is for the confidence level 0.999 and the right plot for 0.99. The stable quantiles are evaluated at the estimated parameters while the Gaussian quantiles are evaluated on the sample mean and variance. Previous estimation results indicate that days with jumps are far from being Gaussian and can be better represented by a stable distribution. Indeed, the plots show clear differences between the two distributions, especially in the tails. Stable quantiles away from the median are larger than the corresponding Gaussian quantiles, indicating the presence of thick tails. Moreover, the differences in the tail behavior are more substantial for the confidence level 0.999, which is natural since the jumps for this level are larger than in the 0.99 level case.
0.8 1.2 1.6
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
–1.2
334
–0.8
–0.4
–0.0
0.4
0.8
1.2
1.6
–0.8
–0.4
–0.0
0.4
0.8
1.2
1.6
Fig. 2. QQ plots stable vs. Gaussian. Stable quantiles (y-axis) against the Gaussian quantiles (x-axis). The left plot is for confidence level 0.999 in the jump detection test and the right plot for confidence level 0.99. The stable quantiles are evaluated at the estimated parameters. The Gaussian quantiles are evaluated at the estimated sample mean and variance.
6. Conclusion The stable distribution is particularly useful to model processes with heavy-tailed and skewed distributions which are often encountered in financial series. However, its estimation raises several challenges that we addressed in this paper. Since the density function of a stable distribution does not have a closed form but a stable series is relatively easy to simulate, we proposed an indirect inference estimation method, which is ideally suited to such characteristics. In a Monte Carlo study, we showed that the method performed well and better than competing methods in terms of efficiency. To improve the properties of the estimator in finite samples when the value of the stability parameter approaches two, we used a variant of the indirect inference method called constrained indirect inference. We also showed that this new method for estimating stable distributions proved very useful for capturing the skewness and kurtosis present in daily returns with jumps. Although we do not consider testing in this paper, indirect inference provides as a by-product specification tests about the matched characteristics, in our case the unconditional distribution. One can envision a battery of diagnostic tools. For example, the fact that the binding function can be interpreted parameter by parameter allows independent assessments of the ability of the stable model to capture the four relevant features of the data. One can also perform an omnibus test, by matching jointly McCulloch quantile-based functions and our skewed-t auxiliary parameters to obtain an automatic overidentification test. These tests could complement the work by Deo (2000) who proposed, in the context of m-dependent sequences, a goodness-of-fit test for stable distributions. This is left for future research. Acknowledgements We are grateful to Marine Carrasco, Geert Dhaene, François Lamontagne, Marco Lombardi, Marc Paolella, Mark Steel and Casper de Vries for useful discussions and remarks as well as the audiences at the CIRANO–CIREQ Conference in Financial Econometrics 2004, ESEM 2004, Canadian Econometric Study Group 2004, York University, Conference on Heavy tails and stable Paretian distributions 2005 organized by the Deutsche Bundesbank and the Econometrics Seminar at the University of Copenhagen and Tilburg University. Marine Carrasco kindly provided us with the computer code for estimating α -stable distributions with continuous GMM. Special thanks are due to a referee for invaluable advice regarding both the theoretical developments and the empirical illustration. The first two authors gratefully acknowledge financial support from the Institut de Finance Mathématique de Montréal (IFM2), the Fonds québécois de la recherche sur la société et la culture (FQRSC), the Social Sciences and Humanities Research Council of Canada (SSHRC) and the MITACS Network of Centres of Excellence. The first author, a
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
research fellow at CIRANO and CIREQ, is grateful to Hydro-Québec and the Bank of Canada for financial support. An important part of this work was carried out while the third author was a postdoctoral fellow at CIRANO and later a Marie Curie postdoctoral fellow at Tilburg University. The third author thanks CIRANO, CIREQ, IFM2, the European Community’s Human Potential Programme under contract HPRN-CT-2002-00232 [MICFINMA], and the IAP P6/07 contract, from the IAP program (Belgian Federal Scientific Policy) ‘‘Economic policy and finance in the global economy’’ for financial support.
335
Proof of Proposition 3.2. By Proposition 3.1
γ (α, β, σ , µ) =
1
γ (α, 0, σ , µ)
.
And thus γ (α, 0, σ , µ) = 1. Moreover, when γ = 1, the asymptotic first order condition for ω is:
Y −ω
0=E
1+
(Y −ω)2 νλ2
.
ω = µ is clearly a solution of the equation since, when β = 0, (Y −µ)2 Y − µ and (Y − µ)(1 + νλ2 )−1 are symmetrically distributed
Appendix Proof of Proposition 3.1. Pseudo-true values are defined as maximizing the expectation of the log quasi-likelihood function, that is:
around 0.
L[(ν, γ , λ, ω)|(α, β, σ , µ)] = E [L1 (Y ; ν, γ , λ, ω)]
∂L 1 ν+1 2 (Y − ω)2 gω (Y , γ ) =− − − 3 E 2 ∂λ λ 2ν λ 1 + 1 Y −ω gw (Y , γ )
where Y S (α, β, σ , µ). First note that when Y we have for all a ∈ R: Y +a
S (α, β, σ , µ)
S (α, β, σ , µ + a)
while for all a ∈ R∗+ : S (α, β, aσ , aµ)
ν
L1 (ay; ν, γ , aλ, aω) = L1 (y; ν, γ , λ, ω). Thus, for all a ∈ R: L[(ν, γ , λ, ω + a)|(α, β, σ , µ + a)]
= L[(ν, γ , λ, ω)|(α, β, σ , µ)] and for all a ∈ R+ ∗: L[(ν, γ , aλ, aω)|(α, β, aσ , aµ)]
= L[(ν, γ , λ, ω)|(α, β, σ , µ)]. Then, considering first a = −µ and, second, a = 1/σ , we get: L[(ν, γ , λ, ω)|(α, β, σ , µ)] = L[(ν, γ , λ, ω − µ)|(α, β, σ , 0)]
[ ] µ λ ω = L ν, γ , , α, β, 1 , σ σ σ [ ] λ ω − µ =L ν, γ , , (α, β, 1 , 0 ) . σ σ We deduce straightforwardly from the above expressions that:
ω(α, β, σ , µ) = µ + ω(α, β, σ , 0) µ λ(α, β, σ , µ) = σ λ α, β, 1, σ γ (α, β, σ , µ) = γ (α, β, 1, 0) ν(α, β, σ , µ) = ν(α, β, 1, 0). Finally, note that when Y and for all y:
−y; ν,
1
γ
S (α, β, 1, 0), (−Y )
, λ, −ω = L1 (y; ν, γ , λ, ω).
1
γ (α, β, 1, 0) ν(α, −β, 1, 0) = ν(α, β, 1, 0).
S (α, −β, 1, 0)
Y2 1 ν+1 ∂L λ2 ν =− + E 2 ∂λ λ λ 1 + λY2 ν [ ] ∂L h′ (ν) 1 1 Y2 = − − E log 1 + 2 ∂ν h(ν) 2ν 2 λ ν Y2 ν+1 λ2 ν E + . 2 2ν 1 + Y2 λ ν
∂L ∂λ
The first equation = 0 gives zero since the first result of Proposition 3.3 while when plugging in into the second equation ∂L = 0 we get the second result of Proposition 3.3. ∂ν Asymptotic normality of the unconstrained QMLE For sake of brevity we simplify the proof by considering that the observations Yi are i.i.d. and symmetric. Serial dependence could clearly be accommodated by referring instead to central limit theorems for mixing sequences. The symmetry assumption allows us, by virtue of Proposition 3.2, to simplify the formulas by considering only three free auxiliary parameters (ω, λ, ν) to identify three structural parameters (µ, σ , α) while γ is fixed at its pseudo-true value γ 0 = 1. In the general case, the functions of γ , gω (Y , γ ) and (γ + γ1 ) could easily be incorporated in the QML first order conditions below at the cost of cumbersome algebra. Thanks to the simplifying constraint γ = 1, the first order conditions for the QML (ω ˆ n , λˆ n , νˆ n ) can be written:
θi,n Yi ∂ Ln i=1 (ωˆ n , λˆ n , νˆ n , ) = 0 ⇔ ωˆ = n ∑ ∂ω θi,n
and
This completes the proof of Proposition 3.1.
λ
n ∑
Therefore
γ (α, −β, 1, 0) =
λ
When Y S (α, 0, 1, 0) the pseudo-true value of γ is 1 so that gω (Y , γ ) = 1 and the above first order derivatives can be computed at the pseudo-true value Ψ (α, 0, 1, 0) as:
and
ν
2 ∂L h′ (ν) 1 1 1 Y −ω = − − E log 1 + gw (Y , γ ) ∂ν h(ν) 2ν 2 ν λ (Y − ω)2 gω (Y , γ ) 1 ν+1 − 2 E − . 2 2λ2 ν 1 + 1 Y −ω gw (Y , γ )
L1 (y + a; ν, γ , λ, ω + a) = L1 (y; ν, γ , λ, ω),
L1
Proof of Proposition 3.3. The partial derivatives of the expected quasi-log-likelihood function with respect to λ and ν are
and
and
aY
i=1
with θi,n =
1+
1
νˆ n
Yi − ω ˆn
λˆ n
2 −1
,
(A.7)
336
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337
∂ Ln (ωˆ n , λˆ n , νˆ n , ) = 0 ∂λ n νˆ n + 1 1 − ⇔λˆ 2 = θi,n (Yi − ωˆ n )2 , 2νˆ n n τ =1
(A.8)
and the first order condition
∂ Ln (ωˆ n , λˆ n , νˆ n , ) = 0 ∂ν is less tractable since it involves the highly nonlinear function h(ν). It could however be handled exactly in a similar way to the other first order conditions thanks to local linearization. By (A.7): n ∑ i =1
ωˆ n − ω0 =
n √ ∑ λ0 ν 0 θi,n Zi
θi,n (Yi − ω0 ) n ∑
i=1
=
n ∑
θi,n
i=1
where Zi =
Yi −ω0 √ λ0 ν 0
,
θi,n
i=1
. By the uniform low of large number for the
bounded double array θi,n we have n 1−
n i=1
θi,n →p E ((1 + Z 2 )−1 ).
Thus
√ n 1 − λ0 ν 0 n( ω ˆn − ω ) = θi,n Zi + op (1). √ E ((1 + Z 2 )−1 ) n i=1
√
0
Since
Zi 1+Zi2
is bounded and symmetric around zero (see ω0 = µ0
by Proposition 3.2), we deduce that normal with asymptotic variance 2
λ0 ν 0 E ((E (1 + Z 2 ))−1 )2
Z2
(1 + Z 2 )2
√
n(ω ˆ n − ω0 ) is asymptotically
.
It is worth noting that the central limit argument above works Zi because we manipulate sample means of variables 2 which 1+Zi
have finite variances (they are bounded), by contrast with the initial standardized α -stable variable Zi . Similarly, we deduce from (A.8) that
√
2
ˆ 2n − λ0 ) = n( λ
n ν0 + 1 1 − (θi,n (Yi − ω) ˆ 2 − c ) + op (1) √ 0 2ν n i=1
with 2
c=
2λ0 ν 0
ν0 + 1
.
By standard uniformity argument and thanks to boundedness we can write, similarly to above,
√
2
ˆ 2n − λ0 ) = n( λ
(ν 0 + 1)λ0 2
2
n 1 −
√
n i=1
Zi2 1 + Zi2
−
2
ν0 + 1
+ op (1). Since the variables applies.
Zi2 1+Zi2
are bounded, the central limit theorem still
References Ait-Sahalia, Y., Jacod, J., 2008. Fisher’s information for discretely sampled Lévy processes. Econometrica 76, 727–761. Ait-Sahalia, Y., Jacod, J., 2007. Volatility estimators for discretely sampled Lévy processes. Annals of Statistics 35, 355–392.
Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2003. Modelling and forecasting realized volatility. Econometrica 71, 579–625. Andersen, T.G., Bollerslev, T., Diebold, F.X., 2007. Roughing it up: including jump components in the measurement, modeling and forecasting or return distribution. Review of Economics and Statistics 89, 701–720. Andersen, T.G., Bollerslev, T., Diebold, F.X., 2009. Parametric and nonparametric volatility measures. In: Hansen, L.P., Ait-Sahalia, Y. (Eds.), Handbook of Financial Econometrics. North-Holland, Amsterdam, pp. 67–138. Barndorff-Nielsen, O.N., Shephard,, 2004a. Econometric analysis of realised covariation: high frequency based covariance, regression and correlation. Econometrica 72, 885–925. Barndorff-Nielsen, O., Shephard, N., 2004b. Power and bipower variation with stochastic volatility jumps. Journal of Financial Econometrics 2, 1–48. Barndorff-Nielsen, O., Shephard, N., 2006. Econometrics of testing for jumps in financial economics using bipower variation. Journal of Financial Econometrics 4, 1–30. Bassi, F., Embrechts, P., Kafetzaki, M., 1998. Risk management and quantile estimation. In: Adler, R., Feldman, R., Taqqu, M. (Eds.), A Practical Guide to Heavy Tails: Statistical Tecniques and Applications. Birkháuser, Boston, pp. 111–130. Bawa, V.S., Elton, E.J., Gruber, M.J., 1979. Simple rules for optimal portfolio selection in stable paretian markets. Journal of Finance 34, 1041–1047. Calzolari, G., Fiorentini, G., Sentana, E., 2004. Constrained indirect estimation. Review of Economic Studies 71, 945–973. Carr, P., Geman, H., Madan, D.B., Yor, M., 2002. The fine structure of asset returns: an empirical investigation. Journal of Business 75, 305–332. Carrasco, M., Florens, J.P., 2000. Generalization of GMM to a continuum of moment conditions. Econometric Theory 16, 797–834. Carrasco, M., Chernov, M., Florens, J.P., Ghysels, E., 2007. Efficient estimation of general dynamic models with a continuous of moment conditions. Journal of Econometrics 140, 529–573. Carrasco, M., Florens, J.P., 2002. Efficient GMM estimation using the empirical characteristic function, GREMAQ-University of Toulouse Working Paper. Chambers, J.M., Mallows, C.L., Stuck, B.W., 1976. A method for simulating stable random variables. Journal of the American Statistical Association 71, 340–344. Coppejans, M., Gallant, A.R., 2002. Cross-validated snp density estimates. Journal of Econometrics 110, 27–65. Deo, R.S., 2000. On estimation and testing goodness of fit for m-dependent stable sequences. Journal of Econometrics 99, 349–372. Deo, R.S., 2002. On testing the adequacy of stable processes under conditional heteroscedasticity. Journal of Empirical Finance 9, 257–270. de Vries, C.G., 1991. On the relation between GARCH and stable processes. Journal of Econometrics 48, 313–324. Dridi, R., Guay, A., Renault, E., 2007. Indirect inference and calibration of dynamic stochastic general equilibrium models. Journal of Econometrics 136, 397–430. DuMouchel, W.H., 1973. On the asymptotic normality of the maximum-likelihood estimate when sampling from a stable distribution. Annals of Statistics 1, 948–957. DuMouchel, W.H., 1975. Stable distributions in statistical inference: 2. Information from stably distributed samples. Journal of the American Statistical Association 70, 386–393. Fama, E., 1965a. Behavior of stock market prices. Journal of Business 38, 34–105. Fama, E., 1965b. Portfolio analysis in a stable paretian market. Management Science 11, 404–419. Fama, E., Roll, R., 1971. Parameter estimates for symmetric stable distributions. Journal of American. Statistical Association 66, 331–338. Feller, W., 1971. An Introduction to Probability Theory and its Applications. John Wiley and Sons, New York. Fernández, C., Steel, M., 1998. On Bayesian modelling of fat tails and skewness. Journal of the American Statistical Association 93, 359–371. Feuerverger, A., McDunnough,, 1981. On the efficiency of empirical characteristic function procedures. Journal of the Royal Statistical Society, Series B Statistical Methodology 43, 20–27. Fielitz, B.D., Rozelle, J., 1981. Method-of-moments estimators of stable distribution parameters. Applied Mathematics and Computation 8, 303–320. Francq, C., Zakoian, J.M., 2004. Maximum likelihood estimation of pure GARCH and ARMA–GARCH processes. Bernoulli 10, 605–637. Gallant, A.R., Tauchen, G., 1996. Which moments to match? Econometric Theory 12, 657–681. Garcia, R., Renault, E., Veredas, D., 2006, Estimation of stable distributions by indirect inference, CORE DP 2006/112. Ghose, D., Kroner, K.F., 1995. The relationship between GARCH and symmetric stable processes: finding the source of fat tails in financial data. Journal of Empirical Finance 2, 225–251. Gouriéroux, C., Monfort, A., Renault, E., 1993. Indirect inference. Journal of Applied Econometrics 8, 85–118. Groenendijk, P.A., Lucas, A., de Vries, C.G., 1995. A note on the relationship between GARCH and symmetric stable processes. Journal of Empirical Finance 2, 253–264. Hansen, B.E., 1994. Autoregressive conditional density estimation. International Economic Review 35, 705–730. Huang, X., Tauchen, G., 2005. The relative contribution of jumps to total price variance. Journal of Financial Econometrics 3, 456–499. Lombardi, M.J., Calzolari, G., 2008. Estimation of α -stable distributions and processes. Econometrics Journal 11, 193–208.
R. Garcia et al. / Journal of Econometrics 161 (2011) 325–337 Mandelbrot, B., 1963. The variation of speculative prices. Journal of Business 36, 394–419. McCulloch, J.H., 1986. Simple consistent estimators of stable distribution parameters. Communications in Statistics — Simulation and Computation 15, 1109–1136. Mittnik, S., Paolella, M.S., Rachev, S.T., 2000. Diagnosis and treating the fat tails in financial returns data. Journal of Empirical Finance 7, 389–416. Mittnik, S., Rachev, S.T., Paolella, M.S., 1998. Stable paretian modeling in finance: some empirical and theoretical aspects. In: Adler, R., Feldman, R., Taqqu, M. (Eds.), A Practical Guide to Heavy Tails: Statistical Tecniques and Applications. Birkháuser, Boston, pp. 79–110.
337
Ortobelli, S., Huber, I., Schwartz, E., 2002. Portfolio selection with stable distributed returns. Mathematical Methods of Operations Research 55, 265–300. Samorodnitsky, G., Taqqu, M.S., 1994. Stable Non-Gaussian Random Processes. Chapman and Hall, New York. Smith, A.A., 1993. Estimating nonlinear time series models using simulated vector autoregressions. Journal of Applied Econometrics 8, S63–S84. Tauchen, G., Zhou, H., 2011. Realized jumps on financial markets and predicting credit spreads. Journal of Econometrics 160 (1), 102–118. Zolotarev, V.M., 1986. One dimensional stable distributions. In: Translation of Mathematical Monographs, vol. 65. American Mathmetical Society, Providence.
Journal of Econometrics 161 (2011) 338
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Corrigendum to ‘‘A simple way of computing the inverse moments of a non-central chi-square random variable’’ [J. Econom. 37 (1988) 389–393] Wen Zhi Xie Wuhan Iron and Steel University, Wuhan, China
article
info
Article history: Available online 1 February 2011
The denominator for all the Yk equations and Eq. (2) on p. 391 should be λk−2k−2 to the power of 2 rather than λ2 .
DOI of original article: 10.1016/S0304-4076(88)90013-9. E-mail address:
[email protected]. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.11.013
The editorial board wishes to thank Dr. Muhammad Javed for pointing out the typos.